cs.CL [Back]

[1] Are you sure? Measuring models bias in content moderation through uncertainty

Alessandra Urbinati,Mirko Lai,Simona Frenda,Marco Antonio Stranisci

Main category: cs.CL

TL;DR: 提出一种基于不确定性衡量内容审核模型公平性的无监督方法，利用共形预测技术分析模型对弱势群体标注数据的偏差。

Details

Motivation: 现有语言模型在内容审核中存在种族和社会偏见，且缺乏有效的公平性评估手段。 Method: 采用共形预测技术计算模型分类时的不确定性，以此作为代理指标，评估11个模型对女性和非白人标注者的偏见程度，并与F1分数等性能指标进行比较。 Result: 发现一些预训练模型虽能准确预测少数群体标签，但置信度较低；通过模型置信度可识别哪些标注群体在模型中表示更充分。 Conclusion: 模型置信度是揭示内容审核模型社会偏见的有效指标，可用于指导模型去偏过程。 Abstract: Automatic content moderation is crucial to ensuring safety in social media. Language Model-based classifiers are being increasingly adopted for this task, but it has been shown that they perpetuate racial and social biases. Even if several resources and benchmark corpora have been developed to challenge this issue, measuring the fairness of models in content moderation remains an open issue. In this work, we present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups. We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models against women and non-white annotators and observe to what extent it diverges from metrics based on performance, such as the $F_1$ score. The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low. Therefore, by measuring the confidence of models, we are able to see which groups of annotators are better represented in pre-trained models and lead the debiasing process of these models before their effective use.

[2] AccessEval: Benchmarking Disability Bias in Large Language Models

Srikant Panda,Amit Agarwal,Hitesh Laxmichand Patel

Main category: cs.CL

TL;DR: 本研究提出了AccessEval基准，用于评估大语言模型在不同残疾背景下的表现差异，发现残疾相关查询的回应存在更负面的情绪、更多刻板印象和更高的事实错误率，揭示了模型行为中根深蒂固的残障偏见。

Details

Motivation: 为了系统探究大语言模型在现实查询中对不同残障群体的处理差异，填补技术评估与用户实际影响之间的空白。 Method: 构建包含6个现实领域和9种残障类型的配对中性与残障感知查询的数据集AccessEval，评估21个闭源和开源大模型在情感、社会认知和事实准确性方面的表现。 Result: 残障感知查询的回应相比中性查询表现出更负面情绪、更多刻板印象和更高事实错误率，其中听觉、言语和行动残障受影响最显著，且不同领域和残障类型间存在差异。 Conclusion: 大语言模型中存在系统性残障偏见，可能在现实决策场景中对残障用户造成实际伤害，强调需在日常应用中加强偏见缓解措施。 Abstract: Large Language Models (LLMs) are increasingly deployed across diverse domains but often exhibit disparities in how they handle real-life queries. To systematically investigate these effects within various disability contexts, we introduce \textbf{AccessEval (Accessibility Evaluation)}, a benchmark evaluating 21 closed- and open-source LLMs across 6 real-world domains and 9 disability types using paired Neutral and Disability-Aware Queries. We evaluated model outputs with metrics for sentiment, social perception, and factual accuracy. Our analysis reveals that responses to disability-aware queries tend to have a more negative tone, increased stereotyping, and higher factual error compared to neutral queries. These effects show notable variation by domain and disability type, with disabilities affecting hearing, speech, and mobility disproportionately impacted. These disparities reflect persistent forms of ableism embedded in model behavior. By examining model performance in real-world decision-making contexts, we better illuminate how such biases can translate into tangible harms for disabled users. This framing helps bridges the gap between technical evaluation and user impact, reinforcing importance of bias mitigation in day-to-day applications. Our dataset is publicly available at: https://huggingface.co/datasets/Srikant86/AccessEval

[3] RAR$^2$: Retrieval-Augmented Medical Reasoning via Thought-Driven Retrieval

Kaishuai Xu,Wenjun Hou,Yi Cheng,Wenjie Li

Main category: cs.CL

TL;DR: 提出RAR²框架，通过联合学习增强推理的检索与检索增强的推理，利用思维过程揭示隐式知识需求，并采用直接偏好优化训练模型，在多个生物医学问答数据集上优于现有RAG方法。

Details

Motivation: 现有检索增强生成（RAG）方法在处理需要复杂推理的医学问题时表现不佳，因表面查询难以反映真实知识需求，且缺乏对推理过程的显式建模。 Method: 提出RAR²框架，构建思维过程以揭示隐式知识需求，并指导检索与答案生成；构建包含混合偏好对的训练数据集，采用直接偏好优化（DPO）进行训练，并设计两种测试时扩展策略。 Result: 在多个生物医学问答数据集上，RAR²均优于有无微调的RAG基线方法，验证了其有效性。 Conclusion: RAR²通过联合优化推理与检索过程，显著提升复杂医学问题回答的准确性和相关性，为临床决策支持提供了更可靠的技术路径。 Abstract: Large Language Models (LLMs) have shown promising performance on diverse medical benchmarks, highlighting their potential in supporting real-world clinical tasks. Retrieval-Augmented Generation (RAG) has emerged as a key approach for mitigating knowledge gaps and hallucinations by incorporating external medical information. However, RAG still struggles with complex medical questions that require intensive reasoning, as surface-level input often fails to reflect the true knowledge needs of the task. Existing methods typically focus on refining queries without explicitly modeling the reasoning process, limiting their ability to retrieve and integrate clinically relevant knowledge. In this work, we propose RAR$^2$, a joint learning framework that improves both Reasoning-Augmented Retrieval and Retrieval-Augmented Reasoning. RAR$^2$ constructs a thought process to uncover implicit knowledge requirements and uses it to guide retrieval and answer generation. We build a training dataset of mixed preference pairs and apply Direct Preference Optimization (DPO) to train the model. Moreover, we design two test-time scaling strategies to explore the boundaries of our framework. Experiments demonstrate the effectiveness of RAR$^2$ across several biomedical question answering datasets, outperforming RAG baselines with or without fine-tuning.

[4] TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?

Jiho Park,Jongyoon Song,Minjin Choi,Kyuho Heo,Taehun Huh,Ji Won Kim

Main category: cs.CL

TL;DR: TRUEBench是一个面向大语言模型（LLM）生产力助手的真实世界使用评估新基准，涵盖12种语言、多轮对话和显式与隐式约束，通过严格评估标准揭示现有模型在实际应用中的局限性。

Details

Motivation: 现有基准在多语言支持、用户请求中的隐含约束捕捉以及多轮对话复杂性方面存在不足，难以真实评估LLM在现实场景中的指令遵循能力。 Method: 构建包含12种语言和跨实例多语言指令的输入提示，设计涵盖显式与隐式约束的严格评估标准，并引入具有累积约束和上下文切换的多轮对话场景；使用LLM验证器优化约束以确保评估可靠性。 Result: 实验表明TRUEBench比现有基准更具挑战性，例如OpenAI o1模型的整体通过率仅为69.07%。 Conclusion: TRUEBench提供了一种更真实、更严苛的LLM评估方式，有效揭示其在实际生产力场景中的能力与局限，推动更可靠的AI助手发展。 Abstract: Large language models (LLMs) are increasingly integral as productivity assistants, but existing benchmarks fall short in rigorously evaluating their real-world instruction-following capabilities. Current benchmarks often (i) lack sufficient multilinguality, (ii) fail to capture the implicit constraints inherent in user requests, and (iii) overlook the complexities of multi-turn dialogue. To address these critical gaps and provide a more realistic assessment, we introduce TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark)1, a novel benchmark specifically designed for LLM-based productivity assistants. TRUEBench distinguishes itself by featuring input prompts across 12 languages, incorporating intra-instance multilingual instructions, employing rigorous evaluation criteria to capture both explicit and implicit constraints, and including complex multi-turn dialogue scenarios with both accumulating constraints and context switches. Furthermore, to ensure reliability in evaluation, we refined constraints using an LLM validator. Extensive experiments demonstrate that TRUEBench presents significantly greater challenges than existing benchmarks; for instance, a strong model like OpenAI o1 achieved only a 69.07% overall pass rate. TRUEBench offers a demanding and realistic assessment of LLMs in practical productivity settings, highlighting their capabilities and limitations.

Sadia Abdulhalim,Muaz Albaghdadi,Moshiur Farazi

Main category: cs.CL

TL;DR: 本文提出了一个轻量级的动态注意力融合（DAF）框架，用于多模态情感分析，结合冻结的文本嵌入和声学特征，无需微调即可在大型基准上优于静态融合和单模态基线。

Details

Motivation: 传统的情感分析仅依赖文本，忽略了语调和韵律等非语言线索，难以准确捕捉真实情感意图。 Method: 提出动态注意力融合（DAF）框架，利用自适应注意力机制，在每个话语级别动态加权预训练语言模型的文本嵌入和语音编码器的声学特征，且不微调底层编码器。 Result: DAF在大型多模态基准上显著优于基线模型，F1分数提升明显，预测误差降低，并通过消融实验验证了动态加权策略的有效性。 Conclusion: DAF能有效融合语言与非语言信息，为情感预测提供了更鲁棒的方法，在情感识别、心理健康评估和人机交互等领域具有广泛应用前景。 Abstract: Traditional sentiment analysis has long been a unimodal task, relying solely on text. This approach overlooks non-verbal cues such as vocal tone and prosody that are essential for capturing true emotional intent. We introduce Dynamic Attention Fusion (DAF), a lightweight framework that combines frozen text embeddings from a pretrained language model with acoustic features from a speech encoder, using an adaptive attention mechanism to weight each modality per utterance. Without any finetuning of the underlying encoders, our proposed DAF model consistently outperforms both static fusion and unimodal baselines on a large multimodal benchmark. We report notable gains in F1-score and reductions in prediction error and perform a variety of ablation studies that support our hypothesis that the dynamic weighting strategy is crucial for modeling emotionally complex inputs. By effectively integrating verbal and non-verbal information, our approach offers a more robust foundation for sentiment prediction and carries broader impact for affective computing applications -- from emotion recognition and mental health assessment to more natural human computer interaction.

[6] Enabling Approximate Joint Sampling in Diffusion LMs

Parikshit Bansal,Sujay Sanghavi

Main category: cs.CL

TL;DR: 本文提出了一种在掩码扩散语言模型中近似联合采样的新方法，通过在大型扩散语言模型之上添加一个轻量级的单层“采样器”，实现每次全模型前向传播后多次采样多个token，从而在保持较高生成质量的同时提升生成速度。

Details

Motivation: 在自回归语言模型中，每个token都是基于前面所有token生成的，保证了整体字符串符合正确的联合分布。而掩码扩散语言模型可以并行地、非顺序地去遮蔽token，但并行去遮蔽数越多，生成结果偏离真实联合分布越远，导致准确性下降。因此，需要一种方法能够在加快生成速度的同时，尽可能逼近真实的联合分布。 Method: 提出一种轻量级的单层“采样器”，附加在已有的大扩散语言模型之上。该采样器经过训练以模仿冻结主模型的精确联合采样行为。在每一次完整的模型去噪步骤后，可通过多次调用该采样器层来生成多个去遮蔽的token，从而实现在一次完整前向传播后生成多个token。 Result: 在Dream-7B-Base和Dream-7B-Instruct模型上验证了方法的有效性，在语言建模、数学与编程任务中表现良好。当每次去噪步骤同时去遮蔽四个token时，所提方法相对于真实联合分布的MAUVE得分为0.87，显著高于基线的0.31。 Conclusion: 该方法能够在保持较高生成质量的前提下，有效提升扩散语言模型的生成效率，为平衡生成速度与分布保真度提供了一种可行方案。 Abstract: In autoregressive language models, each token is sampled by conditioning on all the past tokens; the overall string has thus been sampled from the correct underlying joint distribution represented by the model. In contrast, masked diffusion language models generate text by unmasking tokens out of order and potentially in parallel. Generating an overall string sampled from the correct underlying joint distribution would (again) require exactly one token unmasking in every full-model forward pass. The more tokens unmasked in parallel, the further away the string is from the true joint; this can be seen in the resulting drop in accuracy (but, increase in speed). In this paper we devise a way to {\em approximately} sample multiple tokens from the joint distribution in a single full-model forward pass; we do so by developing a new lightweight single-layer ``sampler" on top of an existing large diffusion LM. One forward pass of the full model can now be followed by multiple forward passes of only this sampler layer, to yield multiple unmasked tokens. Our sampler is trained to mimic exact joint sampling from the (frozen) full model. We show the effectiveness of our approximate joint sampling for both pretrained-only (Dream-7B-Base) and instruction-tuned (Dream-7B-Instruct) models on language modeling and math \& coding tasks. When four tokens are unmasked for each full-model denoising step, our sampling algorithm achieves a MAUVE score of 0.87 (vs marginal baseline of 0.31) with respect to the true joint distribution.

[7] Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

Sasha Cui,Zhongren Chen

Main category: cs.CL

TL;DR: 本文提出了Painless Activation Steering (PAS)，一种全自动的激活 steering 方法，无需人工构造提示或标注特征，可在多种任务和模型上快速提升语言模型的行为控制性能，尤其在偏见、道德和对齐等行为任务上效果显著，并可与上下文学习和监督微调结合使用。

Details

Motivation: 现有的语言模型后训练方法（如基于权重或提示的 steering）存在耗时、昂贵或不可控的问题，而当前的激活 steering 技术依赖人工设计提示对或繁琐的特征标注，限制了其便捷性。因此需要一种更自动化、低成本且易用的替代方案。 Method: 提出Painless Activation Steering (PAS)，利用带标签的数据集自动生成激活向量，实现无需人工干预的激活 steering；并设计了 introspective 变体 iPAS 以增强因果 steering 效果。 Result: 在三个开源模型和18项任务上的实验表明，PAS 能稳定提升行为类任务的表现（如偏见减少10.1%，道德提升5.2%，对齐提升34.8%），但对智力导向任务无效；且能进一步增强上下文学习和监督微调的效果。 Conclusion: PAS 是一种实用、自动化的语言模型后训练方法，明确了激活 steering 的适用场景与局限，为模型行为控制提供了一种高效、轻量的新工具。 Abstract: Language models (LMs) are typically post-trained for desired capabilities and behaviors via weight-based or prompt-based steering, but the former is time-consuming and expensive, and the latter is not precisely controllable and often requires manual trial-and-error. While activation steering (AS) promises a cheap, fast, and controllable alternative to the two existing post-training methods, current AS techniques require hand-crafted prompt pairs or labor-intensive feature annotation, making them more inconvenient than the plug-and-play methods such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). We introduce Painless Activation Steering (PAS), a family of fully automated methods that make AS readily usable with any given labeled dataset, with no need for prompt construction, feature labeling, or human intervention. We evaluate PAS on three open-weight models (Llama3.1-8B-Instruct, DeepSeek-R1-Distill-8B, and Nous-Hermes-2) and 18 tasks; we find that PAS reliably improves performance for behavior tasks, but not for intelligence-oriented tasks. The introspective variant (iPAS) delivers the strongest causal steering effects (10.1% on Bias, 5.2% on Morality, and 34.8% on Alignment). We also show PAS delivers additional gains on top of In-Context Learning (ICL) and SFT. PAS constructs a fast, lightweight activation vector that can be cheaply trained, easily stored, and activated at will. Our results provide a characterization of where AS helps, where it fails, and how to deploy it as a practical, automated LM post-training option.

[8] MIRAGE: Multi-hop Reasoning with Ambiguity Evaluation for Illusory Questions

Jeonghyun Park,Ingeol Baek,Seunghyun Yoon,Haeun Jang,Aparna Garimella,Akriti Jain,Nedim Lipka,Hwanhee Lee

Main category: cs.CL

TL;DR: 本文提出了MIRAGE基准，用于评估多跳推理中的歧义问题，并提出CLARION框架以提升大模型在该任务上的表现。

Details

Motivation: 现有大语言模型在处理现实世界中包含多重歧义的多跳问答时表现不佳，难以正确探索推理路径并给出完整答案。 Method: 构建了一个包含1,142个高质量样本的多跳歧义问答基准MIRAGE，并提出基于多智能体的CLARION框架，通过推理与指令协同来澄清每一步的歧义。 Result: 实验表明当前最先进的模型在MIRAGE上表现较差，而CLARION框架显著优于现有方法。 Conclusion: 多跳推理中的歧义消除是一个重要且具有挑战性的问题，CLARION为构建更自适应和鲁棒的推理系统提供了新方向。 Abstract: Real-world Multi-hop Question Answering (QA) often involves ambiguity that is inseparable from the reasoning process itself. This ambiguity creates a distinct challenge, where multiple reasoning paths emerge from a single question, each requiring independent resolution. Since each sub-question is ambiguous, the model must resolve ambiguity at every step. Thus, answering a single question requires handling multiple layers of ambiguity throughout the reasoning chain. We find that current Large Language Models (LLMs) struggle in this setting, typically exploring wrong reasoning paths and producing incomplete answers. To facilitate research on multi-hop ambiguity, we introduce MultI-hop Reasoning with AmbiGuity Evaluation for Illusory Questions (MIRAGE), a benchmark designed to analyze and evaluate this challenging intersection of ambiguity interpretation and multi-hop reasoning. MIRAGE contains 1,142 high-quality examples of ambiguous multi-hop questions, categorized under a taxonomy of syntactic, general, and semantic ambiguity, and curated through a rigorous multi-LLM verification pipeline. Our experiments reveal that even state-of-the-art models struggle on MIRAGE, confirming that resolving ambiguity combined with multi-step inference is a distinct and significant challenge. To establish a robust baseline, we propose CLarifying Ambiguity with a Reasoning and InstructiON (CLARION), a multi-agent framework that significantly outperforms existing approaches on MIRAGE, paving the way for more adaptive and robust reasoning systems.

[9] ML2B: Multi-Lingual ML Benchmark For AutoML

Ekaterina Trofimova,Zosia Shamina,Maria Selifanova,Artem Zaitsev,Remi Savchuk,Maxim Minets,Daria Ozerova,Emil Sataev,Denis Zuenko,Andrey E. Ustyuzhanin

Main category: cs.CL

TL;DR: 提出首个用于评估多语言机器学习代码生成的基准ML2B，包含30个Kaggle竞赛翻译成13种语言，揭示非英语任务中15-45%的性能下降。

Details

Motivation: 现有机器学习代码生成基准主要局限于英语，忽视了机器学习研究与实践的全球性和多语言需求。 Method: 构建ML2B基准，包含30个Kaggle竞赛翻译为13种自然语言，涵盖表格、文本和图像数据类型，并采用AIDE自动化框架进行端到端评估。 Result: 实验结果显示模型在非英语任务上性能下降15-45%，暴露出多语言表示学习在代码生成中的关键挑战。 Conclusion: ML2B为多语言机器学习代码生成提供了重要评估工具，突显提升非英语语言支持的必要性，促进全球化AI开发。 Abstract: Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.

[10] ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection

Mohamed Maged,Alhassan Ehab,Ali Mekky,Besher Hassan,Shady Shehata

Main category: cs.CL

TL;DR: 本文介绍了首个包含多种阿拉伯语方言的伪造语音数据集，并通过多模型评估和人类评分等方式验证了FishSpeech在阿拉伯语语音克隆中的优越性能。

Details

Motivation: 由于现有研究主要集中在英语的语音反欺骗检测，阿拉伯语及其多方言的研究相对不足，因此需要构建专门的多方言阿拉伯语伪造语音数据集以填补这一空白。 Method: 采用多种方法评估不同TTS模型生成的合成语音难度，包括基于嵌入的现代分类器、基于MFCC特征的传统机器学习算法、RawNet2架构，结合MOS人工评分和ASR的WER分析。 Result: 实验结果表明FishSpeech在Casablanca语料库上的阿拉伯语语音克隆效果最好，生成的语音更逼真且更具挑战性；但仅依赖单一TTS模型可能影响数据集的泛化能力。 Conclusion: FishSpeech是当前最优的阿拉伯语合成模型，未来应融合多个TTS模型以提升伪造语音数据集的多样性和鲁棒性。 Abstract: With the rise of generative text-to-speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi-dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best-performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding-based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.

[11] EditGRPO: Reinforcement Learning with Post -Rollout Edits for Clinically Accurate Chest X-Ray Report Generation

Kai Zhang,Christopher Malon,Lichao Sun,Martin Renqiang Min

Main category: cs.CL

TL;DR: 本文提出了一种名为EditGRPO的混合策略强化学习算法，用于优化放射学报告生成，通过临床驱动的奖励机制显著提升了多模态大语言模型在多个指标和数据集上的表现。

Details

Motivation: 现有的放射学报告生成方法（如监督微调）未能明确与临床有效性对齐，缺乏针对医学文本生成特点设计的训练目标。 Method: 提出EditGRPO算法，结合在线策略探索与离线策略指导，在训练过程中注入句子级详细修正，利用临床动机奖励进行强化学习优化。 Result: 在四个主流胸部X光报告数据集上，相比SFT和GRPO基线，EditGRPO平均提升3.4%；在未见数据集上表现出更强的泛化能力，平均性能提升5.9%。 Conclusion: EditGRPO通过混合策略强化学习有效解决了探索困境和采样效率问题，显著提升了医学报告生成的准确性和临床适用性。 Abstract: Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models (MLLMs), have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning (RL) algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B MLLM initialized with supervised fine-tuning (SFT), EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4% in CheXbert, GREEN, Radgraph, and RATEScore metrics across four major chest X-ray report generation datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of 5.9% on unseen datasets.

[12] Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

Chi Ruan,Dongfu Jiang,Yubo Wang,Wenhu Chen

Main category: cs.CL

TL;DR: 提出了一种新的强化学习方法——Critique Reinforcement Learning (CRL)，通过引入对（问题，解答）对的批判生成任务，增强大语言模型的批判与反思能力。基于此训练的 Critique-Coder 模型在代码生成和逻辑推理任务上均优于纯RL模型，表明CRL能有效提升模型的通用推理与批判能力。

Details

Motivation: 标准强化学习主要关注生成响应，缺乏对批判和反思能力的显式建模；而近期研究显示显式训练批判能力有益，因此作者希望结合强化学习与批判训练，提升模型的判断与泛化能力。 Method: 提出Critique Reinforcement Learning (CRL)，模型需对（问题，解答）对生成批判，奖励信号仅取决于最终判断标签是否与真实标签一致；将CRL与标准RL结合（20% CRL数据），用于训练Critique-Coder系列模型。 Result: Critique-Coder在多个基准上优于纯RL模型：Critique-Coder-8B在LiveCodeBench (v5) 上超过60%，优于DeepCoder-14B和GPT-o1；同时在BBEH逻辑推理任务上表现更好，显示出更强的通用推理能力。 Conclusion: CRL是一种有效的强化学习补充方法，通过引入批判任务，不仅能提升代码生成性能，还能增强模型的通用推理与批判能力，具有跨任务迁移潜力。 Abstract: Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label $c \in \{\texttt{True}, \texttt{False}\}$ of the generated critique aligns with the ground-truth judgment $c^*$. Building on this point, we introduce \textsc{Critique-Coder}, which is trained on a hybrid of RL and CRL by substituting 20\% of the standard RL data with CRL data. We fine-tune multiple models (\textsc{Critique-Coder}) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that \textsc{Critique-Coder} consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our \textsc{Critique-Coder-8B} can reach over 60\% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, \textsc{Critique-Coder} also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.

[13] ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Hwan Chang,Yonghyun Jun,Hwanhee Lee

Main category: cs.CL

TL;DR: 本文提出了一种新型的针对大语言模型代理的间接提示注入攻击方法ChatInject，该方法通过模仿原生聊天模板结构并利用多轮对话中的说服策略，显著提高了攻击成功率，并对现有防御手段表现出强鲁棒性。

Details

Motivation: 随着基于大语言模型的代理在外部环境中广泛应用，其面临的安全威胁日益增加，尤其是间接提示注入攻击。然而，现有研究主要关注纯文本注入，忽略了模型对结构化聊天模板的依赖及其在多轮对话中易受上下文操纵的弱点。因此，亟需探索更隐蔽且高效的攻击方式以揭示系统潜在漏洞。 Method: 本文提出了ChatInject攻击方法，通过构造符合原生聊天模板格式的恶意载荷进行注入；进一步设计了基于说服策略的多轮变体，逐步引导代理接受可疑指令。实验在多个前沿大语言模型上进行，评估攻击成功率、跨模型迁移性及现有防御机制的有效性。 Result: 实验结果表明：（1）ChatInject显著提升了攻击成功率，在AgentDojo和InjecAgent上分别从5.18%提升至32.05%、15.13%提升至45.90%，多轮变体在InjecAgent上平均达52.33%；（2）基于聊天模板的载荷具有良好的跨模型迁移性，即使对闭源模型也有效；（3）现有基于提示的防御手段对此类攻击基本无效，尤其难以应对多轮变体。 Conclusion: 当前LLM代理系统在处理结构化输入和多轮交互时存在严重安全漏洞，ChatInject揭示了模板依赖性和上下文说服带来的风险，强调需要设计更鲁棒的防御机制来应对此类高级提示注入攻击。 Abstract: The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain-text injection attacks, we find a significant yet underexplored vulnerability: LLMs' dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi-turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model's inherent instruction-following tendencies. Building on this foundation, we develop a persuasion-driven Multi-turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi-turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat-template-based payloads demonstrate strong transferability across models and remain effective even against closed-source LLMs, despite their unknown template structures, and (3) existing prompt-based defenses are largely ineffective against this attack approach, especially against Multi-turn variants. These findings highlight vulnerabilities in current agent systems.

[14] Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems

Kai Hua,Zhiyuan Feng,Chongyang Tao,Rui Yan,Lu Zhang

Main category: cs.CL

TL;DR: 提出了一种多轮响应选择模型RSM-DCK，通过检测上下文和知识库中的相关部分来提升开放域知识对话中的响应选择性能。

Details

Motivation: 现有方法在利用上下文和知识进行响应匹配时，未充分考虑信息的相关性，过多无关内容会影响匹配效果。 Method: 模型以最近上下文作为查询，在词级和语句级预筛选相关上下文和知识；响应候选与筛选后的内容交互，并通过融合表示进一步后筛选知识以增强匹配。 Result: 在两个基准数据集上实验表明，该模型优于现有方法，并能有效识别响应选择所需的上下文和知识。 Conclusion: RSM-DCK通过动态选择关键信息，提升了知识接地对话中响应选择的准确性和鲁棒性。 Abstract: Recently, knowledge-grounded conversations in the open domain gain great attention from researchers. Existing works on retrieval-based dialogue systems have paid tremendous efforts to utilize neural networks to build a matching model, where all of the context and knowledge contents are used to match the response candidate with various representation methods. Actually, different parts of the context and knowledge are differentially important for recognizing the proper response candidate, as many utterances are useless due to the topic shift. Those excessive useless information in the context and knowledge can influence the matching process and leads to inferior performance. To address this problem, we propose a multi-turn \textbf{R}esponse \textbf{S}election \textbf{M}odel that can \textbf{D}etect the relevant parts of the \textbf{C}ontext and \textbf{K}nowledge collection (\textbf{RSM-DCK}). Our model first uses the recent context as a query to pre-select relevant parts of the context and knowledge collection at the word-level and utterance-level semantics. Further, the response candidate interacts with the selected context and knowledge collection respectively. In the end, The fused representation of the context and response candidate is utilized to post-select the relevant parts of the knowledge collection more confidently for matching. We test our proposed model on two benchmark datasets. Evaluation results indicate that our model achieves better performance than the existing methods, and can effectively detect the relevant context and knowledge for response selection.

[15] Towards Generalizable Implicit In-Context Learning with Attention Routing

Jiaqian Li,Yanshu Li,Ligong Han,Ruixiang Tang,Wenya Wang

Main category: cs.CL

TL;DR: 提出了一种新的隐式上下文学习方法In-Context Routing (ICR)，通过在注意力logits层面建模可泛化的ICL模式，实现了无需任务特定训练的高效零样本性能，在多领域和多种大语言模型上表现出优越的泛化能力。

Details

Motivation: 现有隐式上下文学习方法依赖于残差流中的偏移向量注入，通常需要标注示例或任务特定对齐，未能充分利用ICL的结构机制且泛化能力有限。 Method: 提出In-Context Routing (ICR)，从ICL过程中提取可复用的结构方向，并利用可学习的输入条件路由模块调制注意力logits，实现一次训练、多次复用的框架。 Result: 在12个真实世界数据集和多个大语言模型上验证，ICR持续优于需任务特定检索或训练的现有隐式ICL方法，并在跨域任务中展现出更强的鲁棒性。 Conclusion: ICR有效提升了隐式上下文学习的泛化能力和实用性，推动了ICL在实际应用中的边界。 Abstract: Implicit in-context learning (ICL) has newly emerged as a promising paradigm that simulates ICL behaviors in the representation space of Large Language Models (LLMs), aiming to attain few-shot performance at zero-shot cost. However, existing approaches largely rely on injecting shift vectors into residual flows, which are typically constructed from labeled demonstrations or task-specific alignment. Such designs fall short of utilizing the structural mechanisms underlying ICL and suffer from limited generalizability. To address this, we propose In-Context Routing (ICR), a novel implicit ICL method that internalizes generalizable ICL patterns at the attention logits level. It extracts reusable structural directions that emerge during ICL and employs a learnable input-conditioned router to modulate attention logits accordingly, enabling a train-once-and-reuse framework. We evaluate ICR on 12 real-world datasets spanning diverse domains and multiple LLMs. The results show that ICR consistently outperforms prior implicit ICL methods that require task-specific retrieval or training, while demonstrating robust generalization to out-of-domain tasks where existing methods struggle. These findings position ICR to push the boundary of ICL's practical value.

[16] The Bias is in the Details: An Assessment of Cognitive Bias in LLMs

R. Alexander Knipper,Charles S. Knipper,Kaiqi Zhang,Valerie Sims,Clint Bowers,Santu Karmaker

Main category: cs.CL

TL;DR: 该论文大规模评估了45个大语言模型（LLM）在8种经典认知偏差上的表现，发现LLM在17.8%至57.3%的情况下表现出与人类相似的偏差行为，并揭示模型规模和提示细节对偏差的影响。

Details

Motivation: 随着LLM越来越多地参与现实决策，有必要探究其是否表现出类似人类的认知偏差，以提升其可靠性与安全性。 Method: 提出基于多项选择任务的评估框架，联合心理学家构建包含220个决策场景的数据集，并通过可控提示变体生成超过280万条响应进行分析。 Result: LLM在锚定、可得性、确认、框架、解释、过度归因、前景理论和代表性等偏差上均表现出一致性行为；模型规模增大（>32B参数）在39.5%情况下减少偏差，更详细的提示最多可降低14.9%的偏差，但会加剧过度归因偏差（+8.8%）。 Conclusion: 大语言模型确实表现出多种认知偏差，提示设计和模型规模可在一定程度上缓解（但有时加剧）这些偏差，需在实际应用中加以考虑。 Abstract: As Large Language Models (LLMs) are increasingly embedded in real-world decision-making processes, it becomes crucial to examine the extent to which they exhibit cognitive biases. Extensively studied in the field of psychology, cognitive biases appear as systematic distortions commonly observed in human judgments. This paper presents a large-scale evaluation of eight well-established cognitive biases across 45 LLMs, analyzing over 2.8 million LLM responses generated through controlled prompt variations. To achieve this, we introduce a novel evaluation framework based on multiple-choice tasks, hand-curate a dataset of 220 decision scenarios targeting fundamental cognitive biases in collaboration with psychologists, and propose a scalable approach for generating diverse prompts from human-authored scenario templates. Our analysis shows that LLMs exhibit bias-consistent behavior in 17.8-57.3% of instances across a range of judgment and decision-making contexts targeting anchoring, availability, confirmation, framing, interpretation, overattribution, prospect theory, and representativeness biases. We find that both model size and prompt specificity play a significant role on bias susceptibility as follows: larger size (>32B parameters) can reduce bias in 39.5% of cases, while higher prompt detail reduces most biases by up to 14.9%, except in one case (Overattribution), which is exacerbated by up to 8.8%.

[17] Lexicon-Enriched Graph Modeling for Arabic Document Readability Prediction

Passant Elchafei,Mayar Osama,Mohamed Rageh,Mervat Abuelkheir

Main category: cs.CL

TL;DR: 提出了一种基于图神经网络和词典的混合方法，用于阿拉伯语文档级可读性预测，在BAREC 2025共享任务中表现优异。

Details

Motivation: 为了提升阿拉伯语文档级可读性预测的准确性，特别是在约束条件下处理复杂语言结构。 Method: 将文档建模为句子级图，节点表示句子和词元，边表示词汇共现和类别关系；利用SAMER词典特征和阿拉伯语Transformer上下文嵌入，通过独立训练的GNN和Transformer分支进行 late fusion，并采用最大池化聚合句子级输出以得到文档级预测。 Result: 该混合方法在多个可读性指标上优于单独的GNN或Transformer模型，尤其在文档级预测中表现更优，但GNN在句子级预测上仍更具优势。 Conclusion: 融合GNN与Transformer的双分支架构能有效提升文档级可读性预测性能，而GNN更适合精细的句子级预测任务。 Abstract: We present a graph-based approach enriched with lexicons to predict document-level readability in Arabic, developed as part of the Constrained Track of the BAREC Shared Task 2025. Our system models each document as a sentence-level graph, where nodes represent sentences and lemmas, and edges capture linguistic relationships such as lexical co-occurrence and class membership. Sentence nodes are enriched with features from the SAMER lexicon as well as contextual embeddings from the Arabic transformer model. The graph neural network (GNN) and transformer sentence encoder are trained as two independent branches, and their predictions are combined via late fusion at inference. For document-level prediction, sentence-level outputs are aggregated using max pooling to reflect the most difficult sentence. Experimental results show that this hybrid method outperforms standalone GNN or transformer branches across multiple readability metrics. Overall, the findings highlight that fusion offers advantages at the document level, but the GNN-only approach remains stronger for precise prediction of sentence-level readability.

[18] HEART: Emotionally-driven test-time scaling of Language Models

Gabriela Pinto,Palash Goyal,Yiwen Song,Souradip Chakraborty,Zifeng Wang,Tomas Pfister,Hamid Palangi

Main category: cs.CL

TL;DR: 本文提出了HEART框架，利用基于六种基本情绪的情感化反馈提示，通过情感驱动的迭代自我修正来提升语言模型在复杂推理任务中的表现。实验表明，在有oracle验证器的情况下，该方法显著提升了准确性，但在无验证器时效果不稳定，揭示了实际部署中的关键瓶颈。

Details

Motivation: 现有测试时扩展策略（如自反思）主要关注逻辑或结构优化，未能利用情感反馈对认知性能的调节作用。受心理学中情绪影响认知的启发，作者希望探索情感反馈在语言模型推理中的潜力。 Method: 提出HEART框架，使用保罗·艾克曼提出的六种普遍情绪构建简洁且富有情感色彩的反馈短语，通过迭代方式引导模型纠正错误推理路径。在每次迭代中系统性地调整情感语气，结合oracle验证器进行指导。 Result: 在OlympiadBench、Humanity's Last Exam和SimpleQA等难题推理基准上，HEART在有验证器时显著优于最先进的基线方法，展现出更深的推理能力；但在无验证器设置下性能提升不稳定，暴露出实用性瓶颈。 Conclusion: 情感反馈可有效增强语言模型的推理能力，未来机器推理的突破不仅在于逻辑优化，还在于理解和利用模型的“内心”——即情感引导机制。 Abstract: Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies such as self-reflection primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART--a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a model's incorrect response using a curated set of concise, emotionally charged phrases based on the six universal emotions categorized by Dr. Paul Ekman. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity's Last Exam, and SimpleQA. Our results reveal a significant new phenomenon: when guided by an oracle verifier, this affective iteration protocol unlocks significantly deeper reasoning, leading to consistent and substantial increases in accuracy over state-of-the-art baselines with the same verifier. However, we also identify a critical bottleneck for practical deployment. In a verifier-free setting, it struggles to harness these gains consistently, highlighting as a key challenge for future work. Our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the `HEART' of the models.

[19] Infusing Theory of Mind into Socially Intelligent LLM Agents

EunJeong Hwang,Yuwei Yin,Giuseppe Carenini,Peter West,Vered Shwartz

Main category: cs.CL

TL;DR: 本文提出了一种专注于心智理论（ToM）的对话代理ToMAgent（ToMA），通过将ToM与对话前瞻结合，提升大语言模型在社交对话中的目标达成能力和策略性推理。

Details

Motivation: 当前聊天机器人和基于大语言模型的社交代理缺乏对他人心理状态的理解（即心智理论，ToM），限制了其社会智能的发展。 Method: 首先通过提示使模型在对话轮次间生成心理状态，随后提出ToMAgent，训练时将ToM与对话前瞻结合，以产生最有助于实现对话目标的心理状态。 Result: 在Sotopia社交评估基准上的实验表明，ToMA优于多种基线方法，展现出更强的战略性、目标导向的推理能力，能进行长视野适应，并更好地维持与对话伙伴的关系。 Conclusion: 将心智理论有效整合到大语言模型中，是构建更具社会智能的对话代理的重要一步。 Abstract: Theory of Mind (ToM)-an understanding of the mental states of others-is a key aspect of human social intelligence, yet, chatbots and LLM-based social agents do not typically integrate it. In this work, we demonstrate that LLMs that explicitly use ToM get better at dialogue, achieving goals more effectively. After showing that simply prompting models to generate mental states between dialogue turns already provides significant benefit, we further introduce ToMAgent (ToMA), a ToM-focused dialogue agent. ToMA is trained by pairing ToM with dialogue lookahead to produce mental states that are maximally useful for achieving dialogue goals. Experiments on the Sotopia interactive social evaluation benchmark demonstrate the effectiveness of our method over a range of baselines. Comprehensive analysis shows that ToMA exhibits more strategic, goal-oriented reasoning behaviors, which enable long-horizon adaptation, while maintaining better relationships with their partners. Our results suggest a step forward in integrating ToM for building socially intelligent LLM agents.

[20] Extract-0: A Specialized Language Model for Document Information Extraction

Henrique Godoy

Main category: cs.CL

TL;DR: 本文提出了一种专用于文档信息抽取的70亿参数语言模型Extract-0，通过合成数据生成、LoRA微调和基于语义相似度奖励的GRPO强化学习，在1000个多样化抽取任务上超越了GPT-4等更大规模模型。

Details

Motivation: 为降低大模型在特定任务上的资源消耗，探索通过任务专用优化使小参数模型超越通用大模型的可行性。 Method: 采用内存友好的合成数据生成流程创建28万训练样本，结合仅调整0.53%权重的LoRA高效微调，并引入基于语义相似度的奖励函数进行GRPO强化学习。 Result: 在1000个文档抽取任务基准上取得0.573的平均奖励，显著优于GPT-4.1（0.457）、o3（0.464）和GPT-4.1-2025（0.459）。 Conclusion: 任务特定优化可使小规模模型在性能上超越通用大模型，同时大幅减少计算资源需求，为高效专用模型设计提供了新路径。 Abstract: This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameterefficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resource.

[21] Large language models management of medications: three performance analyses

Kelli Henry,Steven Xu,Kaitlin Blotske,Moriah Cargile,Erin F. Barreto,Brian Murray,Susan Smith,Seth R. Bauer,Yanjun Gao,Tianming Liu,Andrea Sikora

Main category: cs.CL

TL;DR: 该研究评估了GPT-4o在三项药物相关任务中的表现，包括药物剂型匹配、药物相互作用识别和药物医嘱生成，发现其整体性能较差，存在频繁的遗漏和幻觉问题，表明需要基于临床医生标注数据进行领域特定训练。

Details

Motivation: 评估大型语言模型（如GPT-4o）在医疗用药决策中的准确性和一致性，特别是在药物命名、相互作用判断和医嘱生成方面的可靠性。 Method: 通过三个实验测试GPT-4o：药物剂型匹配、基于内部知识与网络搜索的药物相互作用识别、以及药物医嘱句子生成；使用余弦相似度、Levenshtein相似度、ROUGE-F1及临床医生人工评估来量化准确性。 Result: GPT-4o在剂型匹配中仅49%药物完全正确，在药物相互作用识别中搜索增强反而降低无交互情况下的准确性（100% vs 40%），医嘱生成中34.2%存在错误。 Conclusion: GPT-4o在当前状态下不适用于独立药物管理任务，需结合领域特定训练和临床验证框架以提升安全性和可靠性。 Abstract: Background: Large language models (LLMs) can be useful in diagnosing medical conditions, but few studies have evaluated their consistency in recommending appropriate medication regimens. The purpose of this evaluation was to test GPT-4o on three medication benchmarking tests including mapping a drug name to its correct formulation, identifying drug-drug interactions using both its internal knowledge and using a web search, and preparing a medication order sentence after being given the medication name. Methods: Using GTP-4o three experiments were completed. Accuracy was quantified by computing cosine similarity on TF-IDF vectors, normalized Levenshtein similarity, and ROUGE-1/ROUGE-L F1 between each response and its reference string or by manual evaluation by clinicians. Results: GPT-4o performed poorly on drug-formulation matching, with frequent omissions of available drug formulations (mean 1.23 per medication) and hallucinations of formulations that do not exist (mean 1.14 per medication). Only 49% of tested medications were correctly matched to all available formulations. Accuracy was decreased for medications with more formulations (p<0.0001). GPT-4o was also inconsistent at identifying drug-drug-interactions, although it had better performance with the search-augmented assessment compared to its internal knowledge (54.7% vs. 69.2%, p=0.013). However, allowing a web-search worsened performance when there was no drug-drug interaction (median % correct 100% vs. 40%, p<0.001). Finally, GPT-4o performed moderately with preparing a medication order sentence, with only 65.8% of medication order sentences containing no medication or abbreviation errors. Conclusions: Model performance was overall poor for all tests. This highlights the need for domain-specific training through clinician-annotated datasets and a comprehensive evaluation framework for benchmarking performance.

[22] LLMs Behind the Scenes: Enabling Narrative Scene Illustration

Melissa Roemmele,John Joon Young Chung,Taewook Kim,Yuqian Sun,Alex Calderwood,Max Kreminski

Main category: cs.CL

TL;DR: 本文探讨了利用大语言模型（LLM）作为接口，结合文本到图像模型自动生成故事场景插图的方法，并构建了一个名为SceneIllustrations的数据集，用于推动跨模态叙事转换的研究。

Details

Motivation: 生成式AI使得内容在不同媒介间转换成为可能，尤其在 storytelling 中，将文本转化为视觉插图具有重要意义。然而，如何准确捕捉故事中隐含的场景信息并生成高质量插图仍具挑战。 Method: 采用基于LLM的pipeline，通过解析原始故事文本生成适合文本到图像模型的提示词，进而生成场景插图，并在知名故事语料库上进行实验，通过人工标注评估插图质量。 Result: 成功构建了SceneIllustrations数据集，实验证明LLM能有效提取并表达故事文本中隐含的场景知识，显著提升插图生成与评估效果。 Conclusion: LLM在跨模态叙事转换中具有关键作用，不仅能增强文本到图像的生成质量，还为未来相关研究提供了宝贵资源和方法基础。 Abstract: Generative AI has established the opportunity to readily transform content from one medium to another. This capability is especially powerful for storytelling, where visual illustrations can illuminate a story originally expressed in text. In this paper, we focus on the task of narrative scene illustration, which involves automatically generating an image depicting a scene in a story. Motivated by recent progress on text-to-image models, we consider a pipeline that uses LLMs as an interface for prompting text-to-image models to generate scene illustrations given raw story text. We apply variations of this pipeline to a prominent story corpus in order to synthesize illustrations for scenes in these stories. We conduct a human annotation task to obtain pairwise quality judgments for these illustrations. The outcome of this process is the SceneIllustrations dataset, which we release as a new resource for future work on cross-modal narrative transformation. Through our analysis of this dataset and experiments modeling illustration quality, we demonstrate that LLMs can effectively verbalize scene knowledge implicitly evoked by story text. Moreover, this capability is impactful for generating and evaluating illustrations.

[23] What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?

Mohammed Sabry,Anya Belz

Main category: cs.CL

TL;DR: 本文研究了在预训练中引入合成数据（如前向/后向复制）是否能在计算量不变的情况下提升上下文学习（ICL）能力。作者提出Bi-Induct方法，在不同规模模型上进行实验，发现尽管合成数据能加速归纳头的出现，但并未持续提升ICL性能；相反，纯自然文本训练的大模型表现出更强的泛化能力和更早形成的归纳电路。结果表明，仅激活归纳电路不足以提升ICL，关键在于这些电路是否成为功能上的必要组成部分。

Details

Motivation: 探究在固定计算资源下，显式引入促进归纳电路发展的合成数据是否比自然文本更能提升模型的上下文学习能力，并理解ICL能力涌现的机制。 Method: 提出Bi-Induct轻量级课程学习方法，将前向复制（Induction）、后向复制（Anti）或混合模式注入预训练流，在0.13B到1B参数模型上进行等计算量（iso-FLOPs）训练，并评估其在少样本ICL任务、归纳头激活情况和语言建模困惑度上的表现。 Result: Bi-Induct可加速小模型中归纳头的形成，但未带来一致的ICL性能提升；在标准和函数式ICL任务上，纯自然文本训练的1B模型表现最佳；更大自然模型能更早发展出广泛归纳头；移除最强2%归纳头对自然模型ICL影响更大，说明其电路更集中且关键；Bi-Induct模型则表现出更多冗余活动。 Conclusion: 仅仅诱导归纳电路激活不足以提升上下文学习能力；真正的ICL增益来自于这些电路在功能上变得不可或缺。研究强调应设计能促进‘承重’结构（load-bearing structure）的数据混合策略和机制感知的预训练诊断方法。 Abstract: Does explicitly exercising the induction circuit during pretraining improve in-context learning (ICL), or is natural text sufficient when compute is held constant (iso-FLOPs)? To test whether targeted synthetic data can accelerate induction-head emergence and enhance ICL, we introduce Bi-Induct, a lightweight curriculum that injects forward-copy (Induction), backward-copy (Anti), or a balanced mix into the pretraining stream. We train models from 0.13B to 1B parameters under iso-FLOPs, evaluating (i) few-shot ICL benchmarks, (ii) head-level telemetry, and (iii) held-out language modeling perplexity. Our findings challenge the assumption that early induction circuit activation directly improves ICL. While Bi-Induct accelerates induction-head emergence at small scales, this does not consistently yield stronger generalization. On standard LM benchmarks, Bi-Induct matches natural-only training; on function-style ICL probes, the 1B natural-only performs best. Stress tests (e.g., label permutation, HITS@1 vs. HITS@3, 1 vs. 10 shots) preserve these trends. Telemetry shows larger natural-only models develop broader, earlier induction heads without explicit induction patterns. Anti-induction data fails to elicit meaningful activation. Perplexity penalties from synthetic data shrink with scale, suggesting larger models can absorb non-natural patterns with minimal cost. Crucially, ablating the top 2% of induction heads degrades ICL more than random ablations, especially for natural-only models, indicating more centralized, load-bearing circuits. Bi-Induct variants exhibit more redundant induction activity, implying different circuit utilization. Overall, inducing activation is not sufficient: ICL gains depend on these circuits becoming functionally necessary. These results underscore mechanism-aware pretraining diagnostics and data mixtures that foster load-bearing, not merely present, structure.

[24] Emergent morpho-phonological representations in self-supervised speech models

Jon Gauthier,Canaan Breiss,Matthew Leonard,Edward F. Chang

Main category: cs.CL

TL;DR: 该研究探讨了自监督语音模型在识别英语名词和动词屈折形式时的表征方式，发现其表征具有全局线性几何结构，反映的是词汇间的规律性分布关系，而非直接对应音位或形态单位。

Details

Motivation: 理解自监督语音模型在嘈杂环境中识别口语词汇所依赖的语言表征类型。 Method: 分析专用于词识别的S3M变体模型在常见英语名词和动词屈折现象中的音系和形态表征。 Result: 模型表征呈现出不直接对应音系或形态单元的全局线性几何结构，而是捕捉英语词汇中大量词对之间的规律性分布关系，这些关系通常（但并非总是）源于形态屈折。 Conclusion: 这种几何结构揭示了可能支持人类口语词识别的表征策略，挑战了音系和形态必须有独立语言表征的传统观点。 Abstract: Self-supervised speech models can be trained to efficiently recognize spoken words in naturalistic, noisy environments. However, we do not understand the types of linguistic representations these models use to accomplish this task. To address this question, we study how S3M variants optimized for word recognition represent phonological and morphological phenomena in frequent English noun and verb inflections. We find that their representations exhibit a global linear geometry which can be used to link English nouns and verbs to their regular inflected forms. This geometric structure does not directly track phonological or morphological units. Instead, it tracks the regular distributional relationships linking many word pairs in the English lexicon -- often, but not always, due to morphological inflection. These findings point to candidate representational strategies that may support human spoken word recognition, challenging the presumed necessity of distinct linguistic representations of phonology and morphology.

[25] Same Content, Different Representations: A Controlled Study for Table QA

Yue Zhang,Seiji Maekawa,Nikita Bhutani

Main category: cs.CL

TL;DR: 本文提出了首个控制性研究，探讨表格表示形式对表格问答（Table QA）性能的影响，通过保持内容不变而改变结构，系统地比较了不同建模范式的表现。

Details

Motivation: 现实世界中的表格问答需处理结构化和半结构化数据，但现有基准测试未系统考察表示形式对模型性能的影响。 Method: 使用文本化（verbalization）流程生成成对的结构化与半结构化表格，并构建一个包含多种维度（如表格大小、连接需求、查询复杂度和模式质量）的诊断性基准进行分析。 Result: 实验发现：基于SQL的方法在结构化数据上表现好但在半结构化数据上下降明显；大语言模型（LLM）灵活但精度较低；混合方法在噪声模式下表现更优。这些差异在大表和复杂查询中更为显著。 Conclusion: 没有单一方法在所有条件下都表现最佳，表格表示形式对性能有关键影响；研究结果为模型选择与设计提供了实用指导，推动适用于多样化真实场景的鲁棒混合方法发展。 Abstract: Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields. However, existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance. We present the first controlled study that isolates the role of table representation by holding content constant while varying structure. Using a verbalization pipeline, we generate paired structured and semi-structured tables, enabling direct comparisons across modeling paradigms. To support detailed analysis, we introduce a diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality. Our experiments reveal consistent trade-offs: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data, LLMs exhibit flexibility but reduced precision, and hybrid approaches strike a balance, particularly under noisy schemas. These effects intensify with larger tables and more complex queries. Ultimately, no single method excels across all conditions, and we highlight the central role of representation in shaping Table QA performance. Our findings provide actionable insights for model selection and design, paving the way for more robust hybrid approaches suited for diverse real-world data formats.

[26] ADAM: A Diverse Archive of Mankind for Evaluating and Enhancing LLMs in Biographical Reasoning

Jasin Cekinmez,Omid Ghahroodi,Saad Fowad Chandle,Dhiman Gupta,Ehsaneddin Asgari

Main category: cs.CL

TL;DR: 本文提出了ADAM框架，用于评估和改进多模态大语言模型在传记推理中的表现，包含多语言多模态数据集AdamDB、基于布鲁姆认知分类的评测基准AdamBench，以及针对传记场景的检索增强生成系统AdamRAG，有效减少幻觉并提升模型性能。

Details

Motivation: 传记推理是事实性知识的重要但未被充分探索的方面，现有模型在处理人物生平信息时易产生幻觉，尤其对知名度较低的人物表现不佳，因此需要系统性评估与改进框架。 Method: 构建了覆盖400多万人的多语言多模态数据集AdamDB，设计基于布鲁姆认知分类的六层推理评测基准AdamBench，并提出专为传记场景优化的检索增强生成系统AdamRAG。 Result: 实验表明AdamRAG显著提升了开源模型的表现，对闭源模型也有一定改善，尤其在低阶推理任务上效果最明显；人物知名度显著影响准确性，而面部图像等多模态输入带来的增益较小且不稳定。 Conclusion: ADAM建立了首个融合认知、文化和多模态要素的传记评估基准，推动了多语言、准确且抗幻觉的多模态大语言模型的发展。 Abstract: We introduce ADAM (A Diverse Archive of Mankind), a framework for evaluating and improving multimodal large language models (MLLMs) in biographical reasoning. To the best of our knowledge, this is the first work to systematically examine LLM capabilities in biography, a critical yet underexplored dimension of factual knowledge. At its core, AdamDB is a multilingual and multimodal dataset covering over 4 million individuals across geography, time, and profession, while AdamBench provides cognitively structured evaluations based on Bloom's taxonomy, spanning six reasoning levels in both English and native languages. To address hallucinations, particularly for lesser-known individuals, we propose AdamRAG, a retrieval-augmented generation system tailored to biographical contexts. Experiments show that AdamRAG substantially improves open-source models and modestly benefits closed-source ones, with the largest gains on lower-order reasoning. Popularity strongly mediates accuracy, and multimodal input via face images offers smaller, less consistent improvements than retrieval. ADAM establishes the first benchmark and framework for cognitively, culturally, and multimodally grounded biographical evaluation, advancing the development of multilingual, accurate, and hallucination-resistant MLLMs.

[27] AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts

Jiří Milička,Anna Marklová,Václav Cvrček

Main category: cs.CL

TL;DR: 本文介绍了两个由大语言模型生成的英语和捷克语语料库，旨在与人类撰写的文本进行语言学对比分析。

Details

Motivation: 创建一个多体裁、主题丰富、作者多样且文本类型多样的资源，用于比较人类写作与大语言模型生成文本的语言特征，并保持与现有人类语料库的可比性。 Method: 使用来自OpenAI、Anthropic、Alphabet、Meta和DeepSeek的大语言模型（从GPT-3到GPT-4.5）生成语料库，复制了BE21和Koditex两个参考人类语料库，并按照通用依存关系标准进行分词、词形还原及形态句法标注。 Result: 英语部分平均每个模型生成86.4万token，总计2700万token；捷克语部分平均每个模型76.8万token，总计2150万token。语料库已公开发布，采用CC BY 4.0许可（标注数据为CC BY-NC-SA 4.0），并可通过捷克国家语料库的搜索界面访问。 Conclusion: 该资源为分析大语言模型生成文本与人类书写文本之间的语言差异提供了高质量、结构化且易于访问的多语言数据支持。 Abstract: This article presents two corpora of English and Czech texts generated with large language models (LLMs). The motivation is to create a resource for comparing human-written texts with LLM-generated text linguistically. Emphasis was placed on ensuring these resources are multi-genre and rich in terms of topics, authors, and text types, while maintaining comparability with existing human-created corpora. These generated corpora replicate reference human corpora: BE21 by Paul Baker, which is a modern version of the original Brown Corpus, and Koditex corpus that also follows the Brown Corpus tradition but in Czech. The new corpora were generated using models from OpenAI, Anthropic, Alphabet, Meta, and DeepSeek, ranging from GPT-3 (davinci-002) to GPT-4.5, and are tagged according to the Universal Dependencies standard (i.e., they are tokenized, lemmatized, and morphologically and syntactically annotated). The subcorpus size varies according to the model used (the English part contains on average 864k tokens per model, 27M tokens altogether, the Czech partcontains on average 768k tokens per model, 21.5M tokens altogether). The corpora are freely available for download under the CC BY 4.0 license (the annotated data are under CC BY-NC-SA 4.0 licence) and are also accessible through the search interface of the Czech National Corpus.

[28] Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

Yaorui Shi,Yuxin Chen,Siyuan Wang,Sihang Li,Hengxing Cai,Qi Gu,Xiang Wang,An Zhang

Main category: cs.CL

TL;DR: 本文提出ReMemR1，一种增强记忆的代理模型，结合回调机制和多级奖励强化学习，有效解决长上下文问答中的信息丢失和监督稀疏问题。

Details

Motivation: 现有长上下文问答方法存在前向单遍处理不可逆、信息覆盖丢失和强化学习信号稀疏等问题，难以有效利用分散证据。 Method: 提出ReMemR1，引入可回调的历史记忆机制以支持非线性推理和早期证据回溯，并设计多级奖励强化学习（RLMLR），融合最终答案奖励与密集的步骤级信号以指导记忆使用。 Result: 在长文档问答任务上显著优于现有基于记忆的方法，验证了信息保持、监督增强和多跳记忆利用的有效性。 Conclusion: ReMemR1通过可回调记忆和多级奖励机制，为长上下文推理代理提供了高效且可扩展的解决方案。 Abstract: Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory corpus that is dynamically updated during a single-pass document scan, also known as the "memorize while reading" methods. While this approach scales efficiently, it suffers from irreversible forward-only processing, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, a memory-augmented agent with callback-enhanced memory that allows selective retrieval from the entire memory history and allows non-linear reasoning and revisiting of early evidence. To further strengthen training, we propose Reinforcement Learning with Multi-Level Rewards (RLMLR), which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support multi-hop memory utilizing. Experiments on long-document QA show significant gains over existing memory-based approaches, which validates ReMemR1 as an effective solution for long-context reasoning agents.

[29] Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate

Binwei Yao,Chao Shang,Wanyu Du,Jianfeng He,Ruixue Lian,Yi Zhang,Hang Su,Sandesh Swamy,Yanjun Qi

Main category: cs.CL

TL;DR: 本文提出了一个多智能体辩论系统（MADS）中谄媚行为的首个可操作框架，定义了其形式化概念，开发了评估指标，并揭示了其对信息交换和辩论结果的负面影响，进而提出设计原则以平衡合作与分歧。

Details

Motivation: 大语言模型在多智能体辩论系统中表现出过度顺从（谄媚）的行为，导致辩论过早达成一致，影响创新和准确性，但目前对智能体间谄媚的影响尚缺乏理解。 Method: 提出针对MADS场景的谄媚形式化定义，设计新的评估指标，并在去中心化与中心化辩论框架中系统研究不同角色（辩者与裁判）的谄媚程度对辩论结果的影响。 Result: 发现谄媚是导致多智能体辩论提前崩溃、准确率低于单智能体基线的核心失败模式，且存在辩者驱动和裁判驱动两种失败机制。 Conclusion: 应通过设计原则调控智能体间的互动，以在保持有效合作的同时促进有益分歧，提升多智能体辩论系统的性能。 Abstract: Large language models (LLMs) often display sycophancy, a tendency toward excessive agreeability. This behavior poses significant challenges for multi-agent debating systems (MADS) that rely on productive disagreement to refine arguments and foster innovative thinking. LLMs' inherent sycophancy can collapse debates into premature consensus, potentially undermining the benefits of multi-agent debate. While prior studies focus on user--LLM sycophancy, the impact of inter-agent sycophancy in debate remains poorly understood. To address this gap, we introduce the first operational framework that (1) proposes a formal definition of sycophancy specific to MADS settings, (2) develops new metrics to evaluate the agent sycophancy level and its impact on information exchange in MADS, and (3) systematically investigates how varying levels of sycophancy across agent roles (debaters and judges) affects outcomes in both decentralized and centralized debate frameworks. Our findings reveal that sycophancy is a core failure mode that amplifies disagreement collapse before reaching a correct conclusion in multi-agent debates, yields lower accuracy than single-agent baselines, and arises from distinct debater-driven and judge-driven failure modes. Building on these findings, we propose actionable design principles for MADS, effectively balancing productive disagreement with cooperation in agent interactions.

[30] Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks

Chunyang Jiang,Yonggang Zhang,Yiyang Cai,Chi-Min Chan,Yulong Liu,Mingming Chen,Wei Xue,Yike Guo

Main category: cs.CL

TL;DR: 提出一种无需自我评估的轻量级自改进方法，通过语义投票（基于语义相似性）替代传统多数投票，提升大语言模型在不可验证任务上的效率与性能。

Details

Motivation: 降低获取监督数据的成本，解决现有自评估方法在不可验证任务中计算开销高和存在内在偏见的问题。 Method: 引入语义投票机制，使用轻量级句子嵌入模型衡量输出之间的语义相似性，实现软匹配，避免依赖大模型进行自我评估。 Result: 实验表明该方法在多种模型架构和任务上均优于自评估方法，显著提升计算效率和整体性能。 Conclusion: 语义投票是一种高效、低负担的自改进方法，适用于不可验证任务，有效克服了自评估带来的高成本和过自信问题。 Abstract: The rising cost of acquiring supervised data has driven significant interest in self-improvement for large language models (LLMs). Straightforward unsupervised signals like majority voting have proven effective in generating pseudo-labels for verifiable tasks, while their applicability to unverifiable tasks (e.g., translation) is limited by the open-ended character of responses. As a result, self-evaluation mechanisms (e.g., self-judging and entropy minimization) are predominantly used to derive pseudo-labels. However, self-evaluation relying on LLMs typically incurs high computational overhead and introduces overconfidence issues due to intrinsic biases. To address these challenges, we propose a novel self-evaluation-free approach for unverifiable tasks, designed for lightweight yet effective self-improvement. Inspired by majority voting commonly employed in verifiable tasks, we propose semantic voting as a novel mechanism that relaxes the principle of hard matching (i.e., exact matching) toward soft matching (i.e., semantic similarity). Soft matching is achieved by leveraging a lightweight sentence embedding model to quantify semantic similarity, thereby mitigating excessive computational burden and intrinsic bias-associated limitations of self-evaluation. Comprehensive experiments demonstrate that our method achieves substantial gains in computational efficiency and overall better performance than self-evaluation methods across diverse model architectures and tasks.

[31] From Evidence to Trajectory: Abductive Reasoning Path Synthesis for Training Retrieval-Augmented Generation Agents

Muzhi Li,Jinhu Qi,Yihong Wu,Minghao Zhao,Liheng Ma,Yifan Li,Xinyu Wang,Yingxue Zhang,Ho-fung Leung,Irwin King

Main category: cs.CL

TL;DR: 本文提出了EviPath，一种基于证据锚定的推理路径合成范式，用于增强检索生成（RAG）代理的开发，通过合成包含环境交互的全过程数据，显著提升小型语言模型在问答任务中的表现。

Details

Motivation: 现有RAG代理缺乏过程级监督，难以有效训练任务分解、检索调用和逐步决策等能力；强化学习面临奖励稀疏问题，而现有数据合成方法无法建模环境交互。 Method: EviPath包括三部分：(i) 归纳式子任务规划，将问题分解为子问题并规划最优解决路径；(ii) 忠实的子问题回答，利用支持性证据构建代理环境生成推理过程和答案；(iii) 对话式微调，将完整的代理-环境交互轨迹格式化为对话形式用于监督微调。 Result: 在多个标准问答基准上的实验表明，使用EviPath合成数据训练的8B参数模型在开放域问答中比现有最先进方法绝对EM指标高出14.7%，且性能稳定超越基线。 Conclusion: EviPath通过合成包含环境交互的高质量推理路径数据，有效提升了小型语言模型在复杂推理和工具使用方面的能力，为RAG代理的发展提供了可扩展的数据驱动方案。 Abstract: Retrieval-augmented generation agents development is hindered by the lack of process-level supervision to effectively guide agentic capabilities like task decomposition, retriever invocation, and stepwise decision-making. While reinforcement learning offers a potential solution, it suffers from sparse rewards and the limited reasoning capabilities of large language models (LLMs). Meanwhile, existing data synthesis methods only produce chain-of-thought rationales and fail to model environmental interactions. In this paper, we propose EviPath, an evidence-anchored reasoning path synthesis paradigm for RAG agent development. EviPath comprises: (i) Abductive Subtask Planning, which decomposes the problem into sub-questions and iteratively plans an optimal solution path based on the dependencies between them; (ii) Faithful Sub-question Answering, which uses supporting evidence to construct a proxy environment to generate reasoning thoughts and answers for each sub-question; and (iii) Conversational Fine-Tuning, which formats the complete agent-environment interaction trajectory into a dialogue format suitable for Supervised Fine-Tuning. EviPath allows LLMs to learn complex reasoning and tool-use capabilities directly from synthesized data. Extensive experiments on widely-used question-answering benchmarks show that an 8B parameter model trained with EviPath-synthesized data significantly and consistently outperforms state-of-the-art baselines with a double-digit absolute EM gain of 14.7% in open-domain question answering.

[32] The Geometry of Creative Variability: How Credal Sets Expose Calibration Gaps in Language Models

Esteban Garces Arias,Julian Rodemann,Christian Heumann

Main category: cs.CL

TL;DR: 提出一种基于可信集的几何框架来量化和分解大语言模型在创意文本生成中的不确定性，并通过与人类创意变体对比，揭示现有模型在捕捉人类创造力方面的不足。

Details

Motivation: 理解大语言模型在创造性任务中的不确定性具有挑战性，尤其是在存在多种合理输出的情况下，需要更有效的不确定性度量方法。 Method: 采用可信集（credal sets）作为凸包的概率分布几何框架，分析500个写作提示及10种人类延续，评估四种语言模型和五种解码策略下的10万篇生成故事。 Result: 最佳模型与人类创造性的校准得分为0.434（Gemma-2B，温度0.7）；解码策略选择贡献了39.4%到72.0%的系统性不确定性；模型规模与校准质量相关性弱，基础模型与指令微调模型无显著差异。 Conclusion: 该几何框架为改进人机协同创作中的生成系统提供了可操作的见解，强调解码策略对不确定性的关键影响，而非仅依赖模型规模或训练方式。 Abstract: Understanding uncertainty in large language models remains a fundamental challenge, particularly in creative tasks where multiple valid outputs exist. We present a geometric framework using credal sets - convex hulls of probability distributions - to quantify and decompose uncertainty in neural text generation, calibrated against human creative variation. Analyzing 500 creative writing prompts from the WritingPrompts dataset with 10 unique human continuations each, we evaluate four language models across five decoding strategies, generating 100,000 stories. Our credal set analysis reveals substantial gaps in capturing human creative variation, with the best model-human calibration reaching only 0.434 (Gemma-2B with temperature 0.7). We decompose total uncertainty into epistemic and aleatoric components, finding that the choice of decoding strategy contributes 39.4% to 72.0% of total epistemic uncertainty. Model scale shows weak correlation with calibration quality and no significant difference exists between base and instruction-tuned models in calibration quality. Our geometric framework provides actionable insights for improving generation systems for human-AI creative alignment. We release our complete experimental framework.

[33] d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

Yuchu Jiang,Yue Cai,Xiangzhong Luo,Jiale Fu,Jiarui Wang,Chonghan Liu,Xu Yang

Main category: cs.CL

TL;DR: 提出了一种无需训练的近似KV缓存框架d²Cache，用于加速基于扩散的大语言模型（dLLM）的推理，显著提升推理速度和生成质量。

Details

Motivation: dLLM由于依赖双向注意力，无法直接使用标准KV缓存，导致推理效率低下。 Method: 设计了双阶段细粒度选择策略，自适应地更新部分token的KV状态并缓存其余token的状态以供复用，同时支持准从左到右生成。 Result: 在LLaDA和Dream两个代表性dLLM上实验表明，d²Cache显著加快了推理速度，并持续提升了生成质量。 Conclusion: d²Cache是一种高效、通用的推理加速方法，有效解决了dLLM中KV缓存不可用的问题，兼顾速度与生成性能。 Abstract: Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

[34] How to Make Large Language Models Generate 100% Valid Molecules?

Wen Tao,Jing Tang,Alvin Chan,Bryan Hooi,Baolong Bi,Nanyun Peng,Yuansheng Liu,Yiwei Wang

Main category: cs.CL

TL;DR: 本文提出SmiSelf框架，通过将无效SMILES转换为SELFIES来修正化学分子表示，确保生成100%有效的分子，同时保持甚至提升性能。

Details

Motivation: 现有大语言模型在少样本情况下生成有效分子（如使用SMILES表示）存在挑战，且直接使用SELFIES效果不佳，需提升分子生成的有效性与准确性。 Method: 提出SmiSelf框架，利用语法规则将无效SMILES转换为SELFIES，借助SELFIES的语法健壮性实现错误修正，并兼容现有SMILES生成模型。 Result: 实验表明SmiSelf能保证100%分子有效性，保留分子特性，并在其他指标上维持或提升生成性能。 Conclusion: SmiSelf有效解决了LLM在少样本下生成有效分子的难题，拓展了其在生物医药领域的应用潜力。 Abstract: Molecule generation is key to drug discovery and materials science, enabling the design of novel compounds with specific properties. Large language models (LLMs) can learn to perform a wide range of tasks from just a few examples. However, generating valid molecules using representations like SMILES is challenging for LLMs in few-shot settings. In this work, we explore how LLMs can generate 100% valid molecules. We evaluate whether LLMs can use SELFIES, a representation where every string corresponds to a valid molecule, for valid molecule generation but find that LLMs perform worse with SELFIES than with SMILES. We then examine LLMs' ability to correct invalid SMILES and find their capacity limited. Finally, we introduce SmiSelf, a cross-chemical language framework for invalid SMILES correction. SmiSelf converts invalid SMILES to SELFIES using grammatical rules, leveraging SELFIES' mechanisms to correct the invalid SMILES. Experiments show that SmiSelf ensures 100% validity while preserving molecular characteristics and maintaining or even enhancing performance on other metrics. SmiSelf helps expand LLMs' practical applications in biomedicine and is compatible with all SMILES-based generative models. Code is available at https://github.com/wentao228/SmiSelf.

[35] Non-Collaborative User Simulators for Tool Agents

Jeonghoon Shim,Woojung Song,Cheyon Jin,Seungwon KooK,Yohan Jo

Main category: cs.CL

TL;DR: 提出一种用于工具代理的非协作式用户模拟方法。

Details

Motivation: 现有用户模拟器多为合作型，无法有效训练和测试真实场景中面对非协作用户的工具代理性能。 Method: 设计了一种新的用户模拟器架构，可模拟请求不可用服务、偏离主题、表达不耐烦和提供不完整语句四类非协作行为，在保持任务信息完整的同时生成自然且具挑战性的对话。 Result: 在MultiWOZ和τ-bench上的实验表明，当前最先进的工具代理在面对非协作用户时性能显著下降，出现幻觉加剧和对话中断等问题。 Conclusion: 提出一个可扩展的非协作用户模拟框架，帮助研究社区在复杂现实条件下开发和预诊断工具代理。 Abstract: Non-Collaborative User Simulators for Tool Agents Download PDF Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo 19 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference SubmissionConference, AuthorsRevisionsCC BY 4.0 Keywords: Tool Agent, User Simulator, Non-collaborative User, Dialogue Simulation TL;DR: A non-collaborative user simulation method for tool agent. Abstract: Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $\tau$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents' weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.

[36] Tagging the Thought: Unlocking Personalization Reasoning via Reinforcement Learning

Song Jin,Juntian Zhang,Yong Liu,Xun Zhang,Yufei Zhang,Fei Jiang,Guojun Yin,Wei Lin,Rui Yan

Main category: cs.CL

TL;DR: 本文提出了一种名为TagPR的新训练框架，通过“标记思维”方法显著提升大语言模型在个性化推理方面的能力。

Details

Motivation: 现有大语言模型在通用推理上表现优异，但在个性化推理（如根据用户历史推断偏好）方面仍存在不足。 Method: 构建一个数据驱动的流水线，自动生成并语义标注推理链，形成结构化数据集；采用结合监督微调和多阶段强化学习的协同训练策略，并引入基于标签约束和带有用户嵌入的个性化奖励模型（PRMU）的复合奖励信号。 Result: 在公开LaMP基准和自建数据集上实验表明，该方法在所有任务上平均比基础模型提升32.65%，达到当前最优水平。 Conclusion: 结构化、可解释的推理是实现大语言模型真正个性化能力的有效路径。 Abstract: Recent advancements have endowed Large Language Models (LLMs) with impressive general reasoning capabilities, yet they often struggle with personalization reasoning - the crucial ability to analyze user history, infer unique preferences, and generate tailored responses. To address this limitation, we introduce TagPR, a novel training framework that significantly enhances an LLM's intrinsic capacity for personalization reasoning through a tagging the thought approach. Our method first develops a data-driven pipeline to automatically generate and semantically label reasoning chains, creating a structured dataset that fosters interpretable reasoning. We then propose a synergistic training strategy that begins with Supervised Fine-Tuning (SFT) on this tagged data to establish foundational reasoning patterns, followed by a multi-stage reinforcement learning (RL) process. This RL phase is guided by a unique composite reward signal, which integrates tag-based constraints and a novel Personalization Reward Model with User Embeddings (PRMU) to achieve fine-grained alignment with user-specific logic. Extensive experiments on the public LaMP benchmark and a self-constructed dataset demonstrate that our approach achieves state-of-the-art results, delivering an average improvement of 32.65% over the base model across all tasks. Our work validates that structured, interpretable reasoning is a highly effective pathway to unlocking genuine personalization capabilities in LLMs.

[37] Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models

Zichao Yu,Ming Li,Wenyi Zhang,Weiguo Gao

Main category: cs.CL

TL;DR: TReASURe是一种用于掩码扩散语言模型的树搜索测试时对齐方法，通过UnmaskBranch和ResubstituteScore解决并行去遮蔽相关性强和奖励估计方差高的问题，在低NFE情况下表现尤为出色。

Details

Motivation: 将树搜索应用于掩码扩散语言模型时面临两个挑战：并行去遮蔽导致分支高度相关，限制探索；基于采样补全的奖励评估具有高方差，导致剪枝不稳定。 Method: 提出TReASURe，包含两种关键技术：(1) UnmaskBranch——一种基于首次命中去遮蔽的分支策略，通过单次模型调用实现内容与揭示顺序的多样化；(2) ResubstituteScore——利用确定性重代入生成低方差代理补全来评分部分遮蔽序列的剪枝规则。 Result: 理论分析表明该方法在NFE效率、评分误差界和树宽扩展性方面均有提升；实验结果显示其在困惑度、语言可接受性、情感与毒性控制等任务上优于现有方法，尤其在低NFE条件下性能显著。 Conclusion: TReASURe有效解决了掩码扩散模型中树搜索对齐的关键挑战，在多种指标和计算预算下均达到最先进水平，特别适合资源受限场景。 Abstract: Tree search has recently emerged as a powerful framework for aligning generative models with task-specific rewards at test time. Applying tree search to Masked Diffusion Language Models, however, introduces two key challenges: (i) parallel unmasking yields highly correlated branches, limiting exploration, and (ii) reward evaluation via sampled completions produces high-variance estimates, making pruning unstable. We propose TReASURe, a tree-search test-time alignment method that addresses these issues. It introduces (i) UnmaskBranch, a branching strategy based on first-hitting unmasking that diversifies both token content and reveal order with a single model call per parent node, and (ii) ResubstituteScore, a pruning rule that uses deterministic resubstitution to score partially masked sequences with low-variance proxy completions. Theoretically, we quantify branching efficiency gains in NFEs (number of function evaluations), show that the scoring rule approximates the true reward with error bounded by predictive uncertainty, and prove improvements with larger tree widths. Empirically, TReASURe achieves state-of-the-art results on perplexity, linguistic acceptability, and control of sentiment and toxicity, outperforming prior methods under matched compute budgets, with especially strong gains in low-NFE regimes.

[38] Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

Chenxing Wei,Hong Wang,Ying He,Fei Yu,Yao Shu

Main category: cs.CL

TL;DR: 提出了一种新的多轮交互测试时策略自适应范式T2PAM，并引入轻量级算法ROSA，利用用户反馈实时调整模型策略，实现高效对话内自我修正，在任务效果和效率上均有显著提升。

Details

Motivation: 大语言模型通常在静态单轮数据上训练，难以适应多轮交互中的实时用户反馈，导致复杂任务中性能下降。 Method: 提出T2PAM范式，利用用户反馈作为奖励信号估计潜在最优策略，并通过ROSA算法在单步更新中调整模型少量参数，逼近该策略。 Result: 理论分析证明ROSA策略随交互次数增加收敛于用户偏好，实验表明其在多个挑战性基准上显著提升任务完成的效果与效率。 Conclusion: T2PAM与ROSA为大语言模型在多轮交互中的在线自适应提供了高效可行的解决方案，具备较低计算开销和良好的实际应用潜力。 Abstract: Large Language Models (LLMs) employ multi-turn interaction as a fundamental paradigm for completing complex tasks. However, their performance often degrades in extended interactions, as they are typically trained on static, single-turn data, which hinders their ability to adapt to real-time user feedback. To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction. We then introduce Optimum-Referenced One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM. ROSA guides the model parameters toward a theoretical optimal policy in a single, efficient update step, avoiding costly iterative gradient-based optimization and minimizing computational overhead. We provide a rigorous theoretical analysis guaranteeing that the policy of ROSA converges to the preference of user as the number of interactions increases. Extensive experiments on challenging benchmark demonstrate that ROSA achieves significant improvements in both task effectiveness and efficiency.

[39] Pretraining LLM with Latent Thoughts in Continuous Space

Boyi Zeng,He Li,Shixiang Song,Yixuan Wang,Ziwei He,Xinbing Wang,Zhouhan Lin

Main category: cs.CL

TL;DR: 提出一种新的预训练方法“带潜在思维的预训练语言模型”，通过在生成每个token前引入中间潜在思维（隐藏状态）来提升模型性能，在相同推理成本下优于两倍参数的标准模型。

Details

Motivation: 受思维链（CoT）在测试时通过增加推理步骤提升性能的启发，探索在预训练阶段是否也能通过增加计算步骤来提升单个token的生成质量。 Method: 在预训练中让语言模型先生成当前位置的中间潜在思维（即隐藏状态），再用该思维作为输入预测下一个实际token，从而在连续空间中细化预测。 Result: 实验表明，生成一个额外潜在思维的1.4B模型在相同数据上显著优于标准2.8B模型；增加每token的潜在思维数量能持续提升性能。 Conclusion: 通过引入潜在思维的额外计算步骤，可在不增加推理成本的前提下显著提升语言模型的性能，验证了在预训练中扩展计算路径的有效性。 Abstract: The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation of each individual token? To address this, we propose a novel pre-training methodology: Pretraining Language Models with Latent Thoughts. Our approach pretrains a language model (LM) to first generate an intermediate latent thought-the last hidden state of the current position-which is then used as input to predict the actual subsequent token. This additional computational step enables the LM to refine its prediction within unconstrained continuous space. Our experiments demonstrate that, at an identical inference cost, a LM that generates one additional latent thought per token outperforms a standard model with double the parameters. For instance, ours-1.4B (Pythia Arch), pretrained on 300B tokens from the Pile, significantly surpasses the vanilla Pythia-2.8B trained on the same data on both language modeling and a range of general downstream tasks. Furthermore, increasing the number of latent thoughts generated before each actual token-forming a chain analogous to CoT-consistently improves the model's performance.

[40] Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts

Guancheng Wan,Leixin Sun,Longxu Dou,Zitong Shi,Fang Wu,Eric Hanchen Jiang,Wenke Huang,Guibin Zhang,Hejia Geng,Xiangru Tang,Zhenfei Yin,Yizhou Sun,Wei Wang

Main category: cs.CL

TL;DR: 本文提出了一种针对大语言模型多智能体系统中指令冲突导致的层级合规性问题的全栈式解决方案，包括诊断、定位与对齐三个阶段。

Details

Motivation: 多智能体系统在复杂任务中广泛应用，但系统内指令冲突导致的层级合规性问题影响其可靠性，现有宏观指标难以捕捉微观违规行为。 Method: 提出CRAS评分用于细粒度诊断，通过注意力漂移分析定位关键层，并设计SAIL方法仅在关键层进行LoRA微调，结合基于注意力贡献加权的DPO优化目标。 Result: 在多个基准和MAS框架上验证了方法有效性，显著提升指令层级合规性（如在MedQA上AutoGen提升5.60%），且无需全模型微调。 Conclusion: 该工作实现了对多智能体系统中指令冲突问题的精准诊断与高效对齐，为可靠部署提供了可解释、低代价的改进路径。 Abstract: Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks. However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts (system-user, peer-peer), where agents misprioritize system-level rules in the presence of competing demands. Moreover, widely used macro-level metrics (e.g., pass@k) obscure these micro-level violations and offer little actionable guidance for remedy. In this work, we present a full-stack, three-stage framework: (1) Diagnose - Contextualized Role Adherence Score (CRAS), a query-wise, context-aware scoring metric that decomposes role adherence into four measurable dimensions; (2) Localize - attention drift analysis revealing that instruction conflicts are resolved by attention heads that are largely concentrated in middle layers; (3) Align - Surgical Alignment of Instruction Layers (SAIL), which installs LoRA only on the localized focal layers and optimizes a token-weighted DPO-style preference objective that credits tokens by their focal attentional contribution. Across standard benchmarks and MAS frameworks, our surgical approach improves instruction hierarchy compliance (e.g., +5.60% with AutoGen on MedQA) without full-model finetuning.

[41] Estimating the strength and timing of syntactic structure building in naturalistic reading

Nan Wang,Jiaxuan Li

Main category: cs.CL

TL;DR: 该研究通过EEG和眼动数据揭示了句子加工中短语结构构建先于词性检测，并主导词汇影响，支持预测性的‘树状支架’理解模型。

Details

Motivation: 解决心理语言学中句法处理时序问题，区分句法范畴检测与短语结构构建的过程及其时间顺序。 Method: 使用ZuCo语料库中的共注册EEG和眼动数据，分析注视转移、贝叶斯网络建模和固定相关电位。 Result: 发现读者在句法中心词间优先移动视线，结构深度是偏离线性阅读的最强驱动因素，且句法意外性在词出现前即影响神经活动。 Conclusion: 短语结构构建可先于词性检测并主导词汇影响，支持预测性‘树状支架’的理解机制。 Abstract: A central question in psycholinguistics is the timing of syntax in sentence processing. Much of the existing evidence comes from violation paradigms, which conflate two separable processes - syntactic category detection and phrase structure construction - and implicitly assume that phrase structure follows category detection. In this study, we use co-registered EEG and eye-tracking data from the ZuCo corpus to disentangle these processes and test their temporal order under naturalistic reading conditions. Analyses of gaze transitions showed that readers preferentially moved between syntactic heads, suggesting that phrase structures, rather than serial word order, organize scanpaths. Bayesian network modeling further revealed that structural depth was the strongest driver of deviations from linear reading, outweighing lexical familiarity and surprisal. Finally, fixation-related potentials demonstrated that syntactic surprisal influences neural activity before word onset (-184 to -10 ms) and during early integration (48 to 300 ms). These findings extend current models of syntactic timing by showing that phrase structure construction can precede category detection and dominate lexical influences, supporting a predictive "tree-scaffolding" account of comprehension.

[42] From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

Haonan Wang,Weida Liang,Zihang Fu,Nie Zheng,Yifan Zhang,Yao Tong,Tongyao Zhu,Hao Jiang,Chuang Li,Jiaying Wu,Kenji Kawaguchi

Main category: cs.CL

TL;DR: 本文提出了一种新的测试时推理框架Insight-to-Solve (I2S)，通过将上下文中的示范转化为可重用的洞察，有效解决了当前推理语言模型在少样本思维链提示下性能下降的问题。

Details

Motivation: 近年来基于验证器强化学习训练的推理语言模型（RLMs）在使用少样本思维链（CoT）时表现不如直接回答，存在性能下降的悖论。本文旨在探究该现象的根本原因并提出解决方案。 Method: 通过分析高质量推理轨迹（如DeepSeek-R1），识别出语义误导和策略迁移失败两种机制；在此基础上设计了I2S框架，将示例转化为显式的、可复用的洞察，并生成针对目标问题的推理路径，还可进一步自优化推理过程（I2S+）。 Result: 在多个开源与闭源模型及基准测试（如AIME'25、GPQA）上实验表明，I2S和I2S+均显著优于直接回答和其他测试时扩展方法。例如GPT-4.1在AIME'25上提升14.0%，o1-mini在AIME和GPQA上分别提升2.7%和1.7%。 Conclusion: 通过将上下文示范转化为结构化洞察，I2S框架能有效克服语义误导与策略迁移问题，为提升大模型在少样本场景下的推理能力提供了新思路。 Abstract: Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.

[43] Global Beats, Local Tongue: Studying Code Switching in K-pop Hits on Billboard Charts

Aditya Narayan Sankaran,Reza Farahbakhsh,Noel Crespi

Main category: cs.CL

TL;DR: 本研究分析了2017至2025年登上Billboard榜单的K-pop歌曲，发现英语使用和韩英混用（code-switching）在取得全球成功的K-pop音乐中占据主导地位，且女性独唱艺人更倾向于持续使用英语；研究还显示，较高的英语使用率对进入美国为主的Hot 100榜单更为关键。

Details

Motivation: 探讨K-pop在全球化过程中，语言选择（特别是韩英混用）如何反映审美取向与市场策略，并分析这些语言特征与全球音乐成功之间的关系。 Method: 收集2017–2025年登上Billboard Hot 100和Global 200榜单的K-pop歌曲数据集，涵盖14个团体和8位独唱艺人；分析歌词中韩语与英语的比例、语码转换频率及其他风格特征，并通过多语言嵌入和手工特征进行性别分类任务，比较不同榜单间的语言使用差异。 Result: 英语在上榜K-pop歌曲中占主导地位，男女艺人均频繁使用语码转换，无显著性别差异，但女性独唱者更一致地偏好英语；基于歌词预测表演者性别的分类任务达到最高0.76的macro F1分数；相较于Global 200，进入Hot 100的歌曲英语使用率更高，表明其对美国市场成功更具影响。 Conclusion: K-pop歌词中的语言选择深受全球市场压力影响，英语使用和语码转换不仅是艺术表达手段，也反映了表演者身份特征和目标市场的适应策略，尤其在美国市场成功中起关键作用。 Abstract: Code switching, particularly between Korean and English, has become a defining feature of modern K-pop, reflecting both aesthetic choices and global market strategies. This paper is a primary investigation into the linguistic strategies employed in K-pop songs that achieve global chart success, with a focus on the role of code-switching and English lyric usage. A dataset of K-pop songs that appeared on the Billboard Hot 100 and Global 200 charts from 2017 to 2025, spanning 14 groups and 8 solo artists, was compiled. Using this dataset, the proportion of English and Korean lyrics, the frequency of code-switching, and other stylistic features were analysed. It was found that English dominates the linguistic landscape of globally charting K-pop songs, with both male and female performers exhibiting high degrees of code-switching and English usage. Statistical tests indicated no significant gender-based differences, although female solo artists tend to favour English more consistently. A classification task was also performed to predict performer gender from lyrics, achieving macro F1 scores up to 0.76 using multilingual embeddings and handcrafted features. Finally, differences between songs charting on the Hot 100 versus the Global 200 were examined, suggesting that, while there is no significant gender difference in English, higher English usage may be more critical for success in the US-focused Hot 100. The findings highlight how linguistic choices in K-pop lyrics are shaped by global market pressures and reveal stylistic patterns that reflect performer identity and chart context.

[44] Steering Prepositional Phrases in Language Models: A Case of with-headed Adjectival and Adverbial Complements in Gemma-2

Stefan Arnold,René Gröbner

Main category: cs.CL

TL;DR: 研究探讨了Gemma-2模型在生成介词短语时对补足语功能（工具性状语或属性修饰语）的选择偏好，并通过调整注意力头的值向量来控制这种选择。

Details

Motivation: 理解语言模型在生成介词短语时内部机制如何决定补足语的功能角色，特别是区分其作为动词的副词性修饰还是名词的形容词性修饰。 Method: 构建包含上下文可容纳工具性或属性延续的with引导介词短语的提示集，通过将激活投影到词汇空间识别偏好工具性补足语的注意力头，并通过缩放特定注意力头的值向量来影响补足语功能分布。 Result: 发现模型对工具性解读有较强偏好（比例为3:4），通过调节单个注意力头的值向量，可将工具性使用降低至33%，同时将属性性使用提升至36%。 Conclusion: 能够识别并操控特定注意力头以改变语言模型中介词补足语的功能角色分布，揭示了模型内部处理语法歧义的部分机制。 Abstract: Language Models, when generating prepositional phrases, must often decide for whether their complements functions as an instrumental adjunct (describing the verb adverbially) or an attributive modifier (enriching the noun adjectivally), yet the internal mechanisms that resolve this split decision remain poorly understood. In this study, we conduct a targeted investigation into Gemma-2 to uncover and control the generation of prepositional complements. We assemble a prompt suite containing with-headed prepositional phrases whose contexts equally accommodate either an instrumental or attributive continuation, revealing a strong preference for an instrumental reading at a ratio of 3:4. To pinpoint individual attention heads that favor instrumental over attributive complements, we project activations into the vocabulary space. By scaling the value vector of a single attention head, we can shift the distribution of functional roles of complements, attenuating instruments to 33% while elevating attributes to 36%.

[45] PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness

Huacan Chai,Zijie Cao,Maolin Ran,Yingxuan Yang,Jianghao Lin,pengxin,Hairui Wang,Renjie Ding,Ziyu Wan,Muning Wen,Weiwen Liu,Weinan Zhang,Fei Huang,Ying Wen

Main category: cs.CL

TL;DR: 本文提出了PARL-MT框架，通过引入进度感知（progress awareness）来提升大语言模型在多轮函数调用中的表现，结合自动生成数据的PAG流程和PAG-RL强化学习算法，在两个公开基准上显著优于现有方法。

Details

Motivation: 现有的多轮对话方法要么忽略任务级规划，要么在强化学习中缺乏明确的进度感知机制，导致长周期任务执行不连贯或冗余，因此需要一种能显式建模进度感知的方法。 Method: 提出PARL-MT框架，包括Progress Awareness Generation（PAG）管道用于构建包含对话摘要与未来任务规划的数据集，以及Progress Awareness-Guided Reinforcement Learning（PAG-RL）算法，将进度信息融入强化学习过程，减少上下文冗余并提升局部动作与全局目标的一致性。 Result: 在两个公开的多轮函数调用基准上，PARL-MT显著优于现有方法，验证了其在提升任务连贯性和执行效率方面的有效性。 Conclusion: 显式的进度感知机制对于实现鲁棒且高效的多轮函数调用至关重要，PARL-MT为大语言模型在复杂、长周期任务中的应用提供了有效解决方案。 Abstract: Large language models (LLMs) have achieved impressive success in single-turn function calling, yet real-world applications such as travel planning or multi-stage data analysis typically unfold across multi-turn conversations. In these settings, LLMs must not only issue accurate function calls at each step but also maintain progress awareness, the ability to summarize past interactions and plan future actions to ensure coherent, long-horizon task execution. Existing approaches, however, either reduce multi-turn training to isolated single-turn samples, which neglects task-level planning, or employ end-to-end reinforcement learning (RL) that struggles with redundancy and lacks explicit integration of progress awareness. To overcome these limitations, we introduce PARL-MT, a framework that explicitly incorporates progress awareness into LLM training for multi-turn function calling. PARL-MT combines (i) a Progress Awareness Generation (PAG) pipeline, which automatically constructs datasets coupling conversation summaries with future task planning, and (ii) a Progress Awareness-Guided Reinforcement Learning (PAG-RL) algorithm, which integrates progress awareness into RL training to reduce contextual redundancy and improve alignment between local actions and global task completion. Empirical results on two public benchmarks demonstrate that PARL-MT significantly outperforms existing methods, highlighting the effectiveness of progress awareness in enabling robust and efficient multi-turn function calling.

[46] A Structured Framework for Evaluating and Enhancing Interpretive Capabilities of Multimodal LLMs in Culturally Situated Tasks

Haorui Yu,Ramon Ruiz-Dolz,Qiufeng Yi

Main category: cs.CL

TL;DR: 本研究开发了一个用于评估主流视觉语言模型（VLMs）在生成中国传统画评方面能力的量化框架，通过提取人类专家评论中的多维评价特征，并定义多种评论者角色来测试Llama、Qwen和Gemini等模型的表现。

Details

Motivation: 旨在评估当前VLMs在传统中国画评生成任务中的表现，探索其在复杂语义理解和艺术内容生成方面的潜力与局限。 Method: 构建了一个基于零样本分类模型的多维度量化评价框架，提取评价立场、关注特征和评论质量等特征，并定义代表性评论人设，采用人设引导提示方法对多个VLM进行评测。 Result: 实验揭示了现有VLMs在艺术评论生成中的性能水平、优势及改进空间，表明不同模型在视角多样性、语义深度和文化理解方面存在差异。 Conclusion: 该研究为VLMs在艺术批评领域的应用提供了可量化的评估基准，指出了模型在文化敏感性和深层语义理解方面的不足，具有指导未来改进的意义。 Abstract: This study aims to test and evaluate the capabilities and characteristics of current mainstream Visual Language Models (VLMs) in generating critiques for traditional Chinese painting. To achieve this, we first developed a quantitative framework for Chinese painting critique. This framework was constructed by extracting multi-dimensional evaluative features covering evaluative stance, feature focus, and commentary quality from human expert critiques using a zero-shot classification model. Based on these features, several representative critic personas were defined and quantified. This framework was then employed to evaluate selected VLMs such as Llama, Qwen, or Gemini. The experimental design involved persona-guided prompting to assess the VLM's ability to generate critiques from diverse perspectives. Our findings reveal the current performance levels, strengths, and areas for improvement of VLMs in the domain of art critique, offering insights into their potential and limitations in complex semantic understanding and content generation tasks. The code used for our experiments can be publicly accessed at: https://github.com/yha9806/VULCA-EMNLP2025.

[47] Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

Sina J. Semnani,Jirayu Burapacheep,Arpandeep Khatua,Thanawan Atchariyachanvanit,Zheng Wang,Monica S. Lam

Main category: cs.CL

TL;DR: 本文提出了CLAIRE，一个结合大语言模型推理与检索的代理系统，用于发现维基百科中的事实不一致问题。通过用户研究和构建首个真实维基不一致基准WIKICOLLIDE，证明了CLAIRE能显著提升编辑者检测不一致的效率与信心，并揭示维基中存在可观的矛盾比例，现有自动化方法仍有改进空间。

Details

Motivation: 维基百科是广泛使用的开放知识库，其准确性对大语言模型和检索增强生成系统至关重要。然而其中存在的事实不一致性问题尚未被充分衡量和解决，亟需有效方法进行大规模检测与修正。 Method: 提出CLAIRE系统，采用基于大语言模型的代理架构，结合推理与检索技术，自动挖掘潜在不一致的事实并提供上下文证据；结合人工标注构建WIKICOLLIDE基准数据集，并通过随机抽样评估维基及下游数据集中的不一致率。 Result: 在用户研究中，87.5%的编辑者使用CLAIRE后信心提升，检测到的不一致数量增加64.7%；估计英文维基至少3.3%的事实存在矛盾，这些矛盾传播至7.3%的FEVEROUS和4.0%的AmbigQA样本中；最佳全自动系统的AUROC仅为75.1%，显示仍有提升空间。 Conclusion: 事实矛盾是维基百科中可量化的现象，基于大语言模型的系统如CLAIRE可有效辅助编辑者大规模提升知识一致性，是改善开放知识库质量的实用工具。 Abstract: Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia, and how can we improve it? We focus on inconsistencies, a specific type of factual inaccuracy, and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time. Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact, with inconsistencies propagating into 7.3% of FEVEROUS and 4.0% of AmbigQA examples. Benchmarking strong baselines on this dataset reveals substantial headroom: the best fully automated system achieves an AUROC of only 75.1%. Our results show that contradictions are a measurable component of Wikipedia and that LLM-based systems like CLAIRE can provide a practical tool to help editors improve knowledge consistency at scale.

[48] Fin-ExBERT: User Intent based Text Extraction in Financial Context using Graph-Augmented BERT and trainable Plugin

Soumick Sarker,Abhijit Kumar Rai

Main category: cs.CL

TL;DR: 提出Fin-ExBERT，一个轻量级、模块化的框架，用于从金融客服对话中提取用户意图相关句子，结合LoRA增强的领域适配BERT和动态阈值策略，在少量标注数据下实现高效准确的信息抽取。

Details

Motivation: 金融对话文本具有非正式结构、领域术语多和意图密度不均等特点，传统句子级信息抽取方法难以有效处理。 Method: 基于领域适配的BERT模型，引入LoRA进行低秩微调；采用两阶段渐进解冻训练策略，结合不同学习率；使用基于概率曲率（肘部检测）的动态阈值方法提升不确定性下的抽取鲁棒性。 Result: 在真实金融对话数据上表现出高精确率和F1分数，输出可解释，支持批处理、可视化与校准导出，适用于下游审计和问答任务。 Conclusion: Fin-ExBERT在有限标注数据下实现了高效的金融对话意图相关句提取，具备良好的实用性与部署能力。 Abstract: Financial dialogue transcripts pose a unique challenge for sentence-level information extraction due to their informal structure, domain-specific vocabulary, and variable intent density. We introduce Fin-ExBERT, a lightweight and modular framework for extracting user intent-relevant sentences from annotated financial service calls. Our approach builds on a domain-adapted BERT (Bidirectional Encoder Representations from Transformers) backbone enhanced with LoRA (Low-Rank Adaptation) adapters, enabling efficient fine-tuning using limited labeled data. We propose a two-stage training strategy with progressive unfreezing: initially training a classifier head while freezing the backbone, followed by gradual fine-tuning of the entire model with differential learning rates. To ensure robust extraction under uncertainty, we adopt a dynamic thresholding strategy based on probability curvature (elbow detection), avoiding fixed cutoff heuristics. Empirical results show strong precision and F1 performance on real-world transcripts, with interpretable output suitable for downstream auditing and question-answering workflows. The full framework supports batched evaluation, visualization, and calibrated export, offering a deployable solution for financial dialogue mining.

[49] A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

Wonje Jeung,Sangyeon Yoon,Yoonjun Cho,Dongjae Jeon,Sangwoo Shin,Hyesoo Hong,Albert No

Main category: cs.CL

TL;DR: A2D是一种针对扩散大语言模型（dLLMs）的令牌级对齐防御方法，能够有效抵御任意解码顺序和任意步骤预填充攻击，在安全基准测试中将DIJA攻击成功率从80%以上降至接近零。

Details

Motivation: 由于dLLMs支持任意顺序生成，导致有害内容可能出现在任意位置，传统基于模板的防御难以应对如DIJA等预填充攻击，因此需要更细粒度、更鲁棒的安全对齐机制。 Method: 提出A2D方法，通过在令牌级别进行对齐训练，使模型在检测到有害内容时立即输出[EOS]拒绝信号；采用随机掩码机制确保对任意生成顺序和任意步骤攻击的鲁棒性，并支持实时监控与自动终止。 Result: 在多个安全基准上，A2D将DIJA攻击成功率显著降低至1.3%甚至0.0%，并可通过设定[EOS]概率阈值实现早期拒绝，安全终止速度提升高达19.3倍。 Conclusion: A2D实现了对dLLMs的高效、细粒度安全控制，能够在不依赖生成顺序的前提下实时阻断有害内容生成，为任意顺序生成模型提供了可靠的安全解决方案。 Abstract: Diffusion large language models (dLLMs) enable any-order generation, but this flexibility enlarges the attack surface: harmful spans may appear at arbitrary positions, and template-based prefilling attacks such as DIJA bypass response-level refusals. We introduce A2D (Any-Order, Any-Step Defense), a token-level alignment method that aligns dLLMs to emit an [EOS] refusal signal whenever harmful content arises. By aligning safety directly at the token-level under randomized masking, A2D achieves robustness to both any-decoding-order and any-step prefilling attacks under various conditions. It also enables real-time monitoring: dLLMs may begin a response but automatically terminate if unsafe continuation emerges. On safety benchmarks, A2D consistently prevents the generation of harmful outputs, slashing DIJA success rates from over 80% to near-zero (1.3% on LLaDA-8B-Instruct, 0.0% on Dream-v0-Instruct-7B), and thresholded [EOS] probabilities allow early rejection, yielding up to 19.3x faster safe termination.

[50] Scaling Policy Compliance Assessment in Language Models with Policy Reasoning Traces

Joseph Marvin Imperial,Harish Tayyar Madabushi

Main category: cs.CL

TL;DR: 本文提出了Policy Reasoning Traces（PRT），一种专门生成的推理链，用于提升大语言模型在政策合规性评估中的表现。PRT在推理和训练阶段均显著提升了开源和商业模型的性能，并在HIPAA和GDPR政策上达到新的SOTA水平。

Details

Motivation: 现有政策合规性评估依赖专家手动构建详细的推理过程，成本高昂且难以获取高质量标注数据，因此需要一种低成本、高效的替代方法来提升模型的合规判断能力。 Method: 提出Policy Reasoning Traces（PRT），通过生成结构化的推理链作为桥梁，在推理时和训练时增强大语言模型对政策合规性的判断能力，并验证其在不同模型上的有效性。 Result: 实验表明，使用PRT显著提升了模型在HIPAA和GDPR政策合规评估中的准确性，同时增强了模型引用具体政策条款的能力，并影响了合规决策过程。 Conclusion: PRT是一种有效提升大语言模型在政策合规性任务中表现的方法，兼具准确性和可解释性，为实际应用中的自动化合规审查提供了新思路。 Abstract: Policy compliance assessment is a fundamental task of evaluating whether an input case strictly complies with a set of human-defined rules, more generally known as policies. In practice, human experts follow a systematic, step-by-step process to identify violations with respect to specific stipulations outlined in the policy. However, such documentation of gold-standard, expert-level reasoning processes is costly to acquire. In this paper, we introduce Policy Reasoning Traces (PRT), a form of specialized generated reasoning chains that serve as a reasoning bridge to improve an LLM's policy compliance assessment capabilities. Our empirical evaluations demonstrate that the use of PRTs for both inference-time and training-time scenarios significantly enhances the performance of open-weight and commercial models, setting a new state-of-the-art for HIPAA and GDPR policies. Beyond accuracy gains, we also highlight how PRTs can improve an LLM's ability to accurately cite policy clauses, as well as influence compliance decisions through their high utilization from the raw chains of thought.

[51] Learning to Reason in Structured In-context Environments with Reinforcement Learning

Peng Yu,Zeyuan Zhao,Shao Zhang,Luoyi Fu,Xinbing Wang,Ying Wen

Main category: cs.CL

TL;DR: 本文提出了一个名为SIE的结构化上下文环境框架，通过从大规模结构化数据自动构建推理环境，解决了现有语言模型强化学习中环境扩展性差、泛化能力弱和验证困难的问题，实验证明该框架在领域内和跨领域的推理任务中均能有效提升大语言模型的推理能力和泛化性能。

Details

Motivation: 现有的数学和编程环境依赖专家标注难以扩展，而基于游戏的环境学习到的技能又难以泛化，缺乏同时具备可扩展性、可泛化推理和可验证性的理想推理环境。 Method: 提出SIE（Structured In-context Environment）框架，利用大规模结构化数据自动生成推理环境，通过明确的模式和推理链支持可验证的推理，并在信息有限的部分SIE中探索LLM通过环境探索推断缺失信息的能力。 Result: 实验表明，SIE框架不仅在领域内结构化推理任务上表现优异，还能将学到的组合推理能力有效迁移到跨领域的数学和逻辑推理任务中，并在部分信息环境下展现出强大的推理补全和泛化能力。 Conclusion: SIE框架为大语言模型提供了一个可扩展、支持泛化且可验证的强化学习推理环境，显著提升了模型的推理性能和跨任务迁移能力。 Abstract: Large language models (LLMs) have achieved significant advancements in reasoning capabilities through reinforcement learning (RL) via environmental exploration. As the intrinsic properties of the environment determine the abilities that LLMs can learn, the environment plays a important role in the RL finetuning process. An ideal LLM reasoning environment should possess three core characteristics: scalability, generalizable reasoning, and verifiability. However, existing mathematical and coding environments are difficult to scale due to heavy reliance on expert annotation, while the skills learned in game-based environments are too specialized to generalize. To bridge this gap, we introduce the \textbf{S}tructured \textbf{I}n-context \textbf{E}nvironment (SIE) framework. SIE achieves scalability by automatically constructing reasoning environments from large-scale structured data, where the rich compositional patterns naturally support generalizable reasoning. Moreover, the explicit schemas and reasoning chains in structured data provide a foundation for rule-based verifiability. Experimental results show that SIE framework not only achieves substantial improvements in in-domain structured reasoning, but also enables the learned compositional reasoning skills to generalize effectively to out-of-domain mathematical and logical reasoning tasks. We further explored learning in information-limited partial SIEs and found that LLMs can infer the missing information through exploring the environment, leading to robust reasoning improvements and generalization performance.

[52] C-Evolve: Consensus-based Evolution for Prompt Groups

Tiancheng Li,Yuhang Wang,Zhiyang Chen,Zijun Wang,Liyuan Ma,Guo-jun Qi

Main category: cs.CL

TL;DR: 本文提出了Consensus-Evolve（C-Evolve），一种通过群体提示聚合输出并利用多数投票提升性能的进化算法，在多种任务上实现了最先进的表现。

Details

Motivation: 探索通过聚合多个提示的结果达成共识，是否能进一步提升基于闭源模型的AI系统能力边界。 Method: 采用基于岛屿的进化算法维持种群多样性，通过投票得分评估每个提示在群体中的贡献，并以此作为进化的适应度评分。 Result: 在HotpotQA和MATH等多个任务上达到最先进性能，例如在Qwen3-8B上HotpotQA准确率达70.67%，IFBench达43.88%；在GPT-4.1-mini上MATH准确率高达95.33%。 Conclusion: C-Evolve能有效生成高性能提示组，显著提升闭源模型在多样化任务上的表现。 Abstract: Prompt evolution algorithms offer a powerful paradigm for enhancing AI systems based on closed-source models, while few work explores whether aggregating results from multiple prompts to reach a consensus can further advance the system capability boundary. In this paper, we introduce Consensus-Evolve (C-Evolve), an evolutionary algorithm that discovers a group of prompts whose aggregated outputs after majority voting achieve optimal performance. More specifically, C-Evolve employs an island-based evolutionary algorithm to maintain population diversity, and prompts from distinct islands are selected to form groups to aggregate their outputs. The key difference from single individual evolution is a voting score, which evaluates each individual prompt's contribution within groups. We take this as the fitness score for evolution instead of individual performance. Consequently, C-Evolve is more likely to produce and maintain prompts with higher potential to form a high-performing group and eliminate low-performing ones, gradually improving the group performance after reaching consensus. Our method achieves state-of-the-art performance across a wide range of tasks, including both open-ended tasks like HotpotQA and closed-ended tasks like MATH. On Qwen3-8B, C-Evolve achieves 70.67% on HotpotQA and 43.88% on IFBench, which are 4.95% and 2.73% higher than GEPA, respectively. For GPT-4.1-mini, the accuracy on IFBench is further improved to 47.96% and reaches 95.33% in the MATH benchmark. These results demonstrate the C-Evolve's competitive performance.

[53] Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Han Yan,Zheyuan Liu,Meng Jiang

Main category: cs.CL

TL;DR: 本文提出了一种名为PRISM的统一框架，通过在表示空间和参数空间中实施双空间平滑性来提升机器遗忘的鲁棒性和指标平衡性。

Details

Motivation: 现有的最先进遗忘方法常面临灾难性遗忘和指标不平衡问题，且易受重学习和越狱攻击的影响。 Method: PRISM包含两个平滑优化阶段：一是利用鲁棒训练探针防御越狱攻击的表示空间阶段；二是解耦保留-遗忘梯度冲突、减少不平衡并平滑参数空间以缓解重学习攻击的参数空间阶段。 Result: 在WMDP和MUSE数据集上的实验表明，PRISM在多种攻击下优于现有方法，并在关键指标间实现了更好的平衡。 Conclusion: PRISM有效提升了机器遗忘过程中的鲁棒性与多目标平衡，为解决隐私、版权和安全问题提供了新思路。 Abstract: With the rapid advancement of large language models, Machine Unlearning has emerged to address growing concerns around user privacy, copyright infringement, and overall safety. Yet state-of-the-art (SOTA) unlearning methods often suffer from catastrophic forgetting and metric imbalance, for example by over-optimizing one objective (e.g., unlearning effectiveness, utility preservation, or privacy protection) at the expense of others. In addition, small perturbations in the representation or parameter space can be exploited by relearn and jailbreak attacks. To address these challenges, we propose PRISM, a unified framework that enforces dual-space smoothness in representation and parameter spaces to improve robustness and balance unlearning metrics. PRISM consists of two smoothness optimization stages: (i) a representation space stage that employs a robustly trained probe to defend against jailbreak attacks, and (ii) a parameter-space stage that decouples retain-forget gradient conflicts, reduces imbalance, and smooths the parameter space to mitigate relearning attacks. Extensive experiments on WMDP and MUSE, across conversational-dialogue and continuous-text settings, show that PRISM outperforms SOTA baselines under multiple attacks while achieving a better balance among key metrics.

[54] MedCritical: Enhancing Medical Reasoning in Small Language Models via Self-Collaborative Correction

Xinchun Su,Chunxu Luo,Yixuan Li,Weidong Yang,Lipeng Ma

Main category: cs.CL

TL;DR: 本文提出了一种名为MedCritical的两阶段框架，利用大模型指导小模型自我迭代优化，通过提取长链思维模板和自学习DPO方法，在降低训练成本的同时提升了小模型在医学复杂推理任务中的表现，7B模型在CMExam基准上达到SOTA。

Details

Motivation: 小语言模型在医学复杂推理任务中表现不佳，而依赖大模型进行知识蒸馏的方法存在高成本、耗时和效率低的问题，因此需要一种更高效且低成本的训练方法来提升小模型的推理能力。 Method: 提出MedCritical两阶段框架：第一阶段从大教师模型提取高层次和详细的长链思维模板，指导学生模型生成更复杂的推理过程；第二阶段引入基于自我迭代协作的直接偏好优化（DPO），让学生模型通过自身修正轨迹增强推理能力。 Result: MedCritical 7B模型在CMExam基准上分别比Taiyi和Huatuo-o1-7B模型高出3.04%和10.12%，在7B级小模型中实现了新的SOTA性能，且效果可媲美传统知识蒸馏方法但成本更低。 Conclusion: MedCritical通过小模型自我迭代与DPO优化，有效提升了其在医学复杂推理任务中的表现，为低成本训练高性能小模型提供了可行方案。 Abstract: In the field of medicine, complex reasoning tasks such as clinical diagnosis, treatment planning, and medical knowledge integration pose significant challenges, where small language models often underperform compared to large language models like GPT-4 and Deepseek. Recent knowledge distillation-based methods aim to address these issues through teacher-guided error correction, but this LLM as judge approach remains challenging in terms of cost, time, and efficiency. To circumvent this issue, we propose a novel two-stage framework, MedCritical, which uses a small language model fine-tuned by a large teacher model to play against itself. In the first stage, we extract high-level and detailed long-chain thought templates from the teacher model to guide the student model to generate more complex reasoning thoughts. In the second stage, we introduce direct preference optimization (DPO) through model self-iteration collaboration to enhance the reasoning ability of the student model by playing against the correction trajectory of the fine-tuned model during training. This model self-learning DPO approach teaches the student model to use its own error-driven insights to consolidate its skills and knowledge to solve complex problems, and achieves comparable results to traditional knowledge distillation methods using teacher models at a lower cost. Notably, our MedCritical 7B model outperforms the Taiyi and Huatuo-o1-7B models by 3.04\% and 10.12\% respectively on the CMExam benchmark, achieving new SOTA performance among 7B-class small models.

[55] Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

Junming Yang,Ning Xu,Biao Liu,Shiqi Qiao,Xin Geng

Main category: cs.CL

TL;DR: 提出MetaAPO框架，通过动态耦合数据生成与模型训练，利用元学习器评估对齐差距，实现在线与离线数据的自适应平衡，在多个评测基准上优于现有方法，并减少42%在线标注成本。

Details

Motivation: 解决离线偏好数据与模型策略之间的分布不匹配问题，现有方法难以适应模型动态学习状态。 Method: 设计轻量级元学习器作为对齐差距估计器，指导有针对性的在线采样，并为优化目标分配样本级元权重，动态平衡在线和离线数据的质量与分布。 Result: 在AlpacaEval 2、Arena-Hard和MT-Bench上均优于现有偏好优化方法，显著降低42%的在线标注成本。 Conclusion: MetaAPO能有效缓解分布不匹配问题，提升模型对齐效果并降低成本，具有良好的实际应用潜力。 Abstract: Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.

[56] CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

Xi Zhang,Zaiqiao Meng,Jake Lever,Edmond S. L. Ho

Main category: cs.CL

TL;DR: 提出了一种无需训练和检索的推理框架CCD，通过结合任务特定的放射学专家模型的结构化临床信号，减少多模态大语言模型在放射学中的医学幻觉问题。

Details

Motivation: 现有的多模态大语言模型在放射学中容易产生临床不支持的描述（即医学幻觉），主要由于对临床文本部分过度敏感，影响了生成结果的准确性和可靠性。 Method: 提出临床对比解码（CCD），采用双阶段对比机制，在生成过程中精细化调整token-level logits，整合来自专家模型的结构化临床信号，从而提升生成内容的临床保真度，且不修改基础MLLM。 Result: 在三个数据集和多个模型上实验表明，CCD在放射学报告生成任务中 consistently 提升性能；在MIMIC-CXR数据集上，应用于最先进的模型时，RadGraph-F1指标最高提升17%。 Conclusion: CCD是一种轻量级、可泛化的解决方案，有效缓解了放射学中MLLM的医学幻觉问题，实现了专家模型与多模态大语言模型之间的有效协同。 Abstract: Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Cecoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task-specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.

[57] Guard Vector: Beyond English LLM Guardrails with Task-Vector Composition and Streaming-Aware Prefix SFT

Wonhyuk Lee,Youngchol Kim,Yunjin Park,Junhyung Moon,Dongyoung Jeong,Wanjin Park

Main category: cs.CL

TL;DR: 本文提出了Guard Vector方法，通过将防护模型与预训练语言模型的参数差值应用于目标模型，构建出Target Guard Model（TGM），在无需额外训练或标签的情况下提升安全性、支持多语言并具备模型可移植性。

Details

Motivation: 为了提升语言模型的安全性，同时实现跨模型和跨语言的可扩展性，并减少对标注数据和计算资源的依赖。 Method: 提出Guard Vector作为安全任务向量，结合前缀微调和单令牌输出分类器，在流式输入下进行训练与评估，以构建Target Guard Model（TGM）。 Result: TGM在多个标准安全测试集上优于现有防护模型，支持中日韩语言扩展，具备在Llama和Gemma架构间的可移植性，并通过单令牌输出设计提高了吞吐量、降低了延迟。 Conclusion: Guard Vector提供了一种高效、可扩展且流式友好的安全增强方法，减少了数据与计算需求，推动了更负责任的AI系统发展。 Abstract: We introduce Guard Vector, a safety task vector computed as the parameter difference between a guardrail model (Guard Model) and a same-architecture pretrained language model. Composing this vector with a target language model yields a Target Guard Model (TGM). We then adapt TGM with a streaming-aware approach that combines prefix-based training and evaluation with a classifier that produces a single-token output. With this composition alone, TGM improves classification quality over established Guard Models across standard safety suites and enables language extensibility to Chinese, Japanese, and Korean, requiring neither additional training nor target language labels. It also demonstrates model portability across two widely used public guardrail backbones, Llama and Gemma. With prefix SFT (supervised fine-tuning), TGM preserves classification quality under streaming by aligning the behavior between prefix inputs and full-text inputs. The single-token output design increases throughput and reduces latency. Together, these components reduce data and compute requirements while promoting streaming-aware evaluation practices, thereby contributing to a more responsible AI ecosystem.

[58] Train Once, Answer All: Many Pretraining Experiments for the Cost of One

Sebastian Bordt,Martin Pawelczyk

Main category: cs.CL

TL;DR: 提出在单次训练中同时进行多个预训练实验的方法，以低成本研究大模型的训练、推理和记忆等特性。

Details

Motivation: 由于预训练计算成本高，限制了对大语言模型的受控实验研究，因此需要更高效的方法。 Method: 在一次训练过程中并行开展多个预训练实验，使用1.5B参数模型在210B token上验证可行性。 Result: 成功复现了关于数据污染、中毒和记忆化的多项研究结果，并开展了知识获取、数学推理和水印等新实验；各实验对整体训练影响小，交互效应可忽略。 Conclusion: 单次训练中并行执行多实验是可行且高效的，可在有限算力下实现对大模型的系统性科学研究。 Abstract: Recent work has demonstrated that controlled pretraining experiments are a powerful tool for understanding learning, reasoning, and memorization in large language models (LLMs). However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose to conduct multiple pretraining experiments simultaneously during a single training run. We demonstrate the feasibility of this approach by conducting ten experiments during the training of a 1.5B parameter model on 210B tokens. Although we only train a single model, we can replicate the results from multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until the model acquires a particular piece of knowledge. Remarkably, the influence of the ten experiments on the model's training dynamics and overall performance is minimal. However, interactions between different experiments may act as a potential confounder in our approach. We propose to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our findings suggest that performing multiple pretraining experiments in a single training run can enable rigorous scientific experimentation with large models on a compute budget.

Wenhang Shi,Yiren Chen,Shuqing Bian,Xinyi Zhang,Kai Tang,Pengfei Hu,Zhe Zhao,Wei Lu,Xiaoyong Du

Main category: cs.CL

TL;DR: 本文提出了GRACE框架，通过门控优化和自适应压缩策略实现高效且稳定的提示优化，在多个任务上显著优于现有方法，并大幅降低计算开销。

Details

Motivation: 现有的自动提示优化方法难以稳定生成有效提示，易陷入局部最优，效率低下。 Method: 提出GRACE框架，结合门控优化（引入反馈调节门和更新拒绝门）和自适应压缩（在停滞时提取核心概念重构优化路径）两种策略，通过可控的信息损失提升优化效率与稳定性。 Result: 在11项跨领域任务上实验表明，GRACE相比最先进方法平均相对性能提升4.7%（BBH）、4.4%（领域特定）和2.7%（通用NLP），且仅需25%的提示生成预算。 Conclusion: GRACE实现了高效、稳定的提示优化，显著提升性能并降低计算成本，具有良好的实用性和扩展潜力。 Abstract: Prompt engineering is crucial for leveraging the full potential of large language models (LLMs). While automatic prompt optimization offers a scalable alternative to costly manual design, generating effective prompts remains challenging. Existing methods often struggle to stably generate improved prompts, leading to low efficiency, and overlook that prompt optimization easily gets trapped in local optima. Addressing this, we propose GRACE, a framework that integrates two synergistic strategies: Gated Refinement and Adaptive Compression, achieving Efficient prompt optimization. The gated refinement strategy introduces a feedback regulation gate and an update rejection gate, which refine update signals to produce stable and effective prompt improvements. When optimization stagnates, the adaptive compression strategy distills the prompt's core concepts, restructuring the optimization trace and opening new paths. By strategically introducing information loss through refinement and compression, GRACE delivers substantial gains in performance and efficiency. In extensive experiments on 11 tasks across three practical domains, including BIG-Bench Hard (BBH), domain-specific, and general NLP tasks, GRACE achieves significant average relative performance improvements of 4.7%, 4.4% and 2.7% over state-of-the-art methods, respectively. Further analysis shows that GRACE achieves these gains using only 25% of the prompt generation budget required by prior methods, highlighting its high optimization efficiency and low computational overhead. Our code is available at https://github.com/Eric8932/GRACE.

[60] Liaozhai through the Looking-Glass: On Paratextual Explicitation of Culture-Bound Terms in Machine Translation

Sherrie Shen,Weixuan Wang,Alexandra Birch

Main category: cs.CL

TL;DR: 本文提出了机器翻译中文化绑定术语的副文本显化任务，基于专业译者注释构建数据集，评估了大模型在该任务上的表现，并发现尽管生成的副文本能提升读者理解，但仍不及人工注释效果，且译者间存在显著差异，表明文化调适具有开放性。

Details

Motivation: 现有机器翻译方法难以处理文化绑定术语，且忽视了专业翻译中常用的副文本（如脚注、尾注）作用，因此需要引入副文本显化来提升翻译中的文化传递。 Method: 基于Genette（1987）的副文本理论，构建包含560个专家对齐副文本的数据集，选取《聊斋志异》四种英译本；通过内在提示和代理检索方法评估大语言模型（含推理轨迹）在副文本选择与内容生成上的表现。 Result: 实验表明该任务具有挑战性，LLM生成的副文本虽能提升受众理解，但效果明显低于专业译者的副文本；统计分析显示专业译者在副文本使用上差异显著。 Conclusion: 副文本显化有助于推动机器翻译超越语言对等，向文化传达迈进，具有在单语解释和个性化适配中的应用潜力，但文化中介本质是开放而非规范性的。 Abstract: The faithful transfer of contextually-embedded meaning continues to challenge contemporary machine translation (MT), particularly in the rendering of culture-bound terms--expressions or concepts rooted in specific languages or cultures, resisting direct linguistic transfer. Existing computational approaches to explicitating these terms have focused exclusively on in-text solutions, overlooking paratextual apparatus in the footnotes and endnotes employed by professional translators. In this paper, we formalize Genette's (1987) theory of paratexts from literary and translation studies to introduce the task of paratextual explicitation for MT. We construct a dataset of 560 expert-aligned paratexts from four English translations of the classical Chinese short story collection Liaozhai and evaluate LLMs with and without reasoning traces on choice and content of explicitation. Experiments across intrinsic prompting and agentic retrieval methods establish the difficulty of this task, with human evaluation showing that LLM-generated paratexts improve audience comprehension, though remain considerably less effective than translator-authored ones. Beyond model performance, statistical analysis reveals that even professional translators vary widely in their use of paratexts, suggesting that cultural mediation is inherently open-ended rather than prescriptive. Our findings demonstrate the potential of paratextual explicitation in advancing MT beyond linguistic equivalence, with promising extensions to monolingual explanation and personalized adaptation.

[61] Comparison of Scoring Rationales Between Large Language Models and Human Raters

Haowei Hua,Hong Jiao,Dan Song

Main category: cs.CL

TL;DR: 本研究探讨了人类评分者与大语言模型（LLM）在自动评分中生成理由的异同，利用大规模测试中的作文数据，评估GPT-4o、Gemini等模型的评分准确性，并通过余弦相似度和主成分分析比较其理由的相似性与聚类模式。

Details

Motivation: 随着大语言模型的发展，LLM被广泛用于自动评分并生成评分理由；然而，人类与LLM在评分推理上可能存在不一致，需深入分析其原因以提升自动评分系统的可解释性与可靠性。 Method: 采用二次加权Kappa和归一化互信息评估GPT-4o、Gemini等LLM的评分准确性；使用余弦相似度衡量人类与LLM所生成理由之间的相似性；基于理由的嵌入表示，利用主成分分析探索理由的聚类模式。 Result: 研究发现不同LLM在评分准确性上表现接近人类水平；余弦相似度显示LLM与人类理由存在一定差异；主成分分析揭示了人类与LLM在推理模式上的不同聚类结构。 Conclusion: LLM在自动评分中具备较强的推理能力，但其生成理由的方式与人类仍存在系统性差异；分析这些差异有助于改进自动评分系统的透明度和可信度。 Abstract: Advances in automated scoring are closely aligned with advances in machine-learning and natural-language-processing techniques. With recent progress in large language models (LLMs), the use of ChatGPT, Gemini, Claude, and other generative-AI chatbots for automated scoring has been explored. Given their strong reasoning capabilities, LLMs can also produce rationales to support the scores they assign. Thus, evaluating the rationales provided by both human and LLM raters can help improve the understanding of the reasoning that each type of rater applies when assigning a score. This study investigates the rationales of human and LLM raters to identify potential causes of scoring inconsistency. Using essays from a large-scale test, the scoring accuracy of GPT-4o, Gemini, and other LLMs is examined based on quadratic weighted kappa and normalized mutual information. Cosine similarity is used to evaluate the similarity of the rationales provided. In addition, clustering patterns in rationales are explored using principal component analysis based on the embeddings of the rationales. The findings of this study provide insights into the accuracy and ``thinking'' of LLMs in automated scoring, helping to improve the understanding of the rationales behind both human scoring and LLM-based automated scoring.

[62] Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models

Rajaa El Hamdani,Samy Haffoudhi,Nils Holzenberger,Fabian Suchanek,Thomas Bonald,Fragkiskos D. Malliaros

Main category: cs.CL

TL;DR: 本文提出了一种新的解码策略Retrieval-Constrained Decoding (RCD)，通过限制语言模型输出的表面形式多样性，更准确地评估其蕴含的事实知识，结果表明现有方法低估了模型的知识水平。

Details

Motivation: 语言模型虽包含大量事实知识，但因输出形式多样，在传统严格评估下常被判为错误，导致其真实知识能力被低估。 Method: 提出RCD解码策略，限制模型输出为唯一的标准表面形式，并构建包含19,137个常识问题的YAGO-QA数据集进行评估。 Result: 在开源LMs上的实验显示，RCD显著提升F1分数，例如Llama-3.1-70B从32.3%提升至46.0%，Llama-3.1-8B达到33.0%，超过大模型在标准解码下的表现。 Conclusion: RCD能更准确地揭示语言模型的参数化知识，表明当前评估方式过于严格，限制了对模型真实知识容量的估计。 Abstract: Language models (LMs) encode substantial factual knowledge, but often produce answers judged as incorrect. We hypothesize that many of these answers are actually correct, but are expressed in alternative surface forms that are dismissed due to an overly strict evaluation, leading to an underestimation of models' parametric knowledge. We propose Retrieval-Constrained Decoding (RCD), a decoding strategy that restricts model outputs to unique surface forms. We introduce YAGO-QA, a dataset of 19,137 general knowledge questions. Evaluating open-source LMs from 135M to 70B parameters, we show that standard decoding undervalues their knowledge. For instance, Llama-3.1-70B scores only 32.3% F1 with vanilla decoding but 46.0% with RCD. Similarly, Llama-3.1-8B reaches 33.0% with RCD, outperforming the larger model under vanilla decoding. We publicly share the code and dataset at https://github.com/Rajjaa/disambiguated-LLM.

Xuanming Zhang,Yuxuan Chen,Min-Hsuan Yeh,Yixuan Li

Main category: cs.CL

TL;DR: 本文提出了Cognition-of-Thought (CooT)，一种在解码时通过显式认知自我监控循环来增强大语言模型对齐性的新框架，能够在推理过程中动态检测并纠正潜在的不安全行为，而无需重新训练模型。

Details

Motivation: 现有的对齐方法通常将安全性隐式地嵌入模型权重中，导致控制静态且难以修改，因此需要一种更灵活、可审计的动态对齐机制。 Method: CooT将生成器与一个认知感知模块（Perceiver）结合，后者基于分层原则持续监控生成序列，在检测到违规时回滚生成并注入包含通用社会先验和上下文特定警告的指导信息进行重生成。 Result: 在多个基准和模型族上的实验表明，CooT一致提升了安全性和社会推理性能。 Conclusion: CooT将对齐从固定的模型属性转变为推理过程中的显式、动态和可审计过程，支持灵活的策略更新，为大模型的安全生成提供了新的范式。 Abstract: Large language models (LLMs) excel at complex reasoning but can still exhibit harmful behaviors. Current alignment strategies typically embed safety into model weights, making these controls implicit, static, and difficult to modify. This paper introduces Cognition-of-Thought (CooT), a novel decoding-time framework that equips LLMs with an explicit cognitive self-monitoring loop. CooT couples a standard text Generator with a cognitive Perceiver that continuously monitors the unfolding sequence. The Perceiver uses a structured, precedence-based hierarchy of principles (e.g., safety over obedience) to detect potential misalignments as they arise. When violations are flagged, CooT intervenes by rolling back the generation to the point of error and regenerating under injected guidance that combines universal social priors with context-specific warnings. CooT thus transforms alignment from a fixed property into an explicit, dynamic, and auditable process active during inference, allowing for flexible policy updates without retraining the model. Extensive experiments across multiple benchmarks and model families confirm that CooT consistently improves safety and social reasoning performance.

[64] Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review

Sydney Peters,Nan Zhang,Hong Jiao,Ming Li,Tianyi Zhou,Robert Lissitz

Main category: cs.CL

TL;DR: 本文综述了37项基于文本的自动化题目难度预测研究，比较了传统机器学习与现代语言模型在大规模测评中的表现，发现无需手动特征工程的Transformer模型能有效捕捉语言模式，预测精度高，具备广泛应用前景。

Details

Motivation: 传统题目难度建模依赖试测和经典测量理论，耗时且昂贵，亟需高效、低成本的自动化预测方法。 Method: 系统综述2025年5月前发表的37篇相关文献，提取并分析每项研究的数据集、难度参数、领域、题型、模型、特征及性能指标。 Result: 研究表明，尽管传统机器学习模型仍具可解释性优势，但基于Transformer的大型语言模型无需人工特征即可捕捉句法语义特征，预测性能优越，最低RMSE达0.165，最高皮尔逊相关达0.87，准确率高达0.806。 Conclusion: 基于文本的自动化难度预测方法具有高潜力，未来应进一步探索模型可解释性、跨学科应用及实际测评中的集成路径。 Abstract: Item difficulty plays a crucial role in test performance, interpretability of scores, and equity for all test-takers, especially in large-scale assessments. Traditional approaches to item difficulty modeling rely on field testing and classical test theory (CTT)-based item analysis or item response theory (IRT) calibration, which can be time-consuming and costly. To overcome these challenges, text-based approaches leveraging machine learning and language models, have emerged as promising alternatives. This paper reviews and synthesizes 37 articles on automated item difficulty prediction in large-scale assessment settings published through May 2025. For each study, we delineate the dataset, difficulty parameter, subject domain, item type, number of items, training and test data split, input, features, model, evaluation criteria, and model performance outcomes. Results showed that although classic machine learning models remain relevant due to their interpretability, state-of-the-art language models, using both small and large transformer-based architectures, can capture syntactic and semantic patterns without the need for manual feature engineering. Uniquely, model performance outcomes were summarized to serve as a benchmark for future research and overall, text-based methods have the potential to predict item difficulty with root mean square error (RMSE) as low as 0.165, Pearson correlation as high as 0.87, and accuracy as high as 0.806. The review concludes by discussing implications for practice and outlining future research directions for automated item difficulty modeling.

[65] The Impact of Role Design in In-Context Learning for Large Language Models

Hamidreza Rouzegar,Masoud Makrehchi

Main category: cs.CL

TL;DR: 本研究探讨了在零样本和少样本学习场景中，角色设计对大语言模型（如GPT-3.5、GPT-4o、Llama2-7b和Llama2-13b）性能的影响，发现在提示中引入基于角色的结构有助于提升模型在情感分析、文本分类、问答和数学推理等任务上的表现。

Details

Motivation: 尽管提示工程已被广泛研究，但提示中角色设计的影响仍缺乏探索。本文旨在填补这一空白，系统分析不同角色配置如何影响大语言模型的上下文学习能力。 Method: 在多个数据集上评估GPT-3.5、GPT-4o、Llama2-7b和Llama2-13b模型在零样本和少样本学习中的表现，比较不同角色配置的提示对情感分析、文本分类、问答和数学推理任务的影响。 Result: 实验结果表明，合理设计的角色提示结构能够显著提升大语言模型的任务表现，尤其是在零样本和少样本设置下。 Conclusion: 基于角色的提示设计是一种有效的提示工程策略，能够增强大语言模型的上下文学习能力，具有广泛的应用潜力。 Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to generate predictions based on prompts without additional fine-tuning. While prompt engineering has been widely studied, the impact of role design within prompts remains underexplored. This study examines the influence of role configurations in zero-shot and few-shot learning scenarios using GPT-3.5 and GPT-4o from OpenAI and Llama2-7b and Llama2-13b from Meta. We evaluate the models' performance across datasets, focusing on tasks like sentiment analysis, text classification, question answering, and math reasoning. Our findings suggest the potential of role-based prompt structuring to enhance LLM performance.

[66] AraS2P: Arabic Speech-to-Phonemes System

Bassam Matar,Mohamed Fayed,Ayman Khalafallah

Main category: cs.CL

TL;DR: 本文提出了AraS2P，一种用于阿拉伯语语音到音素转换的系统，在Iqra'Eval 2025共享任务中排名第一。该系统采用Wav2Vec2-BERT模型并结合两阶段训练策略：第一阶段在大规模阿拉伯语音-音素数据集上进行任务自适应预训练；第二阶段在官方任务数据上微调，并使用XTTS-v2合成的多样化诵读数据进行增强。实验结果表明，结合音素感知预训练与针对性数据增强可显著提升发音错误检测性能。

Details

Motivation: 为了提升阿拉伯语语音中音素级发音错误检测的准确性，尤其是在宗教诵读等高精度要求场景下的表现，需要更有效的预训练和数据增强方法。 Method: 采用Wav2Vec2-BERT模型，实施两阶段训练：第一阶段利用MSA Phonetiser生成的大规模阿拉伯语音-音素数据进行任务自适应继续预训练；第二阶段在官方共享任务数据上微调，并引入XTTS-v2合成的带有不同经文片段、说话人嵌入和文本扰动的数据进行增强，以模拟人类可能的发音错误。 Result: AraS2P系统在Iqra'Eval 2025官方排行榜上排名第一，验证了音素感知预训练与针对性数据增强策略的有效性。 Conclusion: 结合音素感知的预训练与模拟真实错误的数据增强方法，能有效提升语音到音素系统的性能，尤其适用于高精度发音错误检测任务。 Abstract: This paper describes AraS2P, our speech-to-phonemes system submitted to the Iqra'Eval 2025 Shared Task. We adapted Wav2Vec2-BERT via Two-Stage training strategy. In the first stage, task-adaptive continue pretraining was performed on large-scale Arabic speech-phonemes datasets, which were generated by converting the Arabic text using the MSA Phonetiser. In the second stage, the model was fine-tuned on the official shared task data, with additional augmentation from XTTS-v2-synthesized recitations featuring varied Ayat segments, speaker embeddings, and textual perturbations to simulate possible human errors. The system ranked first on the official leaderboard, demonstrating that phoneme-aware pretraining combined with targeted augmentation yields strong performance in phoneme-level mispronunciation detection.

[67] From Human Annotation to Automation: LLM-in-the-Loop Active Learning for Arabic Sentiment Analysis

Dania Refai,Alaa Dalaq,Doaa Dalaq,Irfan Ahmad

Main category: cs.CL

TL;DR: 本文提出了一种用于阿拉伯语情感分析的主动学习框架，通过使用大型语言模型（LLM）辅助标注，显著减少了人工标注成本，同时保持了高性能。

Details

Motivation: 阿拉伯语情感分析因缺乏大规模高质量标注数据而受限，且在主动学习和大模型辅助标注方面的研究较少。 Method: 采用LSTM、GRU和RNN等多种深度学习架构，在Hunger Station、AJGT和MASAC三个基准数据集上评估主动学习框架，并比较人类标注与五种大模型（GPT-4o、Claude 3 Sonnet、Gemini 2.5 Pro、DeepSeek Chat、LLaMA 3 70B Instruct）辅助标注的效果。 Result: LLM辅助的主动学习在多个数据集上达到或超过了人类标注的性能。例如，在Hunger Station数据集上，使用GPT-4o生成标签的LSTM模型仅用450个样本即达到93%准确率；在MASAC数据集上，DeepSeek Chat用650个样本达到82%准确率，与人工标注相当。 Conclusion: LLM辅助的主动学习能有效降低阿拉伯语情感分析的标注成本，同时保持高精度，具有实际应用潜力。 Abstract: Natural language processing (NLP), particularly sentiment analysis, plays a vital role in areas like marketing, customer service, and social media monitoring by providing insights into user opinions and emotions. However, progress in Arabic sentiment analysis remains limited due to the lack of large, high-quality labeled datasets. While active learning has proven effective in reducing annotation efforts in other languages, few studies have explored it in Arabic sentiment tasks. Likewise, the use of large language models (LLMs) for assisting annotation and comparing their performance to human labeling is still largely unexplored in the Arabic context. In this paper, we propose an active learning framework for Arabic sentiment analysis designed to reduce annotation costs while maintaining high performance. We evaluate multiple deep learning architectures: Specifically, long short-term memory (LSTM), gated recurrent units (GRU), and recurrent neural networks (RNN), across three benchmark datasets: Hunger Station, AJGT, and MASAC, encompassing both modern standard Arabic and dialectal variations. Additionally, two annotation strategies are compared: Human labeling and LLM-assisted labeling. Five LLMs are evaluated as annotators: GPT-4o, Claude 3 Sonnet, Gemini 2.5 Pro, DeepSeek Chat, and LLaMA 3 70B Instruct. For each dataset, the best-performing LLM was used: GPT-4o for Hunger Station, Claude 3 Sonnet for AJGT, and DeepSeek Chat for MASAC. Our results show that LLM-assisted active learning achieves competitive or superior performance compared to human labeling. For example, on the Hunger Station dataset, the LSTM model achieved 93% accuracy with only 450 labeled samples using GPT-4o-generated labels, while on the MASAC dataset, DeepSeek Chat reached 82% accuracy with 650 labeled samples, matching the accuracy obtained through human labeling.

[68] On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

Janvijay Singh,Austin Xu,Yilun Zhou,Yefan Zhou,Dilek Hakkani-Tur,Shafiq Joty

Main category: cs.CL

TL;DR: 本文研究了在数学领域中，针对生成模型响应的微调评判模型在未来适应性、向后兼容性和问题泛化能力方面的表现，发现未来适应性具有挑战性，而向后兼容性较易实现，持续学习能更好平衡新旧响应分布的变化，但现有模型在未见问题上仍存在性能下降。

Details

Motivation: 现有的微调评判模型在实际部署中面临未来模型演进和新问题泛化的挑战，但其长期有效性尚未被系统评估，因此需要研究其在未来适应性、向后兼容性和问题泛化方面的能力。 Method: 在统一框架下，通过改变训练与测试分布，结合三种基于SFT和DPO的微调算法及三种不同基础模型，在数学领域内系统评估微调评判模型的未来适应性、向后兼容性和问题泛化能力。 Result: 实验表明大多数模型在未来适应性上表现困难，向后兼容性相对容易，DPO训练的模型表现更优；持续学习比仅使用强或弱响应训练能更均衡地适应分布变化；所有模型在未见问题上均出现性能下降。 Conclusion: 当前微调评判模型在面对未来生成模型演进和新问题时存在局限，需结合DPO和持续学习策略提升鲁棒性，且亟需改进其在未见问题上的泛化能力以增强实际部署中的有效性。 Abstract: The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and finetuning. Recently, finetuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases. However, the standard evaluation ignores several practical concerns of finetuned judges regarding their real world deployment. In this paper, we identify and formalize three aspects that affect the shelf life of these judges: future proofing and backward compatibility -- how well judges finetuned on responses by today's generator models perform on responses by future models or past models, as well as question generalization -- how well judges generalize to unseen questions at test time. We study these three aspects in the math domain under a unified framework with varying train and test distributions, three SFT- and DPO-based finetuning algorithms and three different base models. Experiments suggest that future-proofing is challenging for most models, while backward compatibility is relatively easy, with DPO-trained models consistently improving performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models observe certain degrees of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators.

[69] Automatic Speech Recognition for Greek Medical Dictation

Vardis Georgilas,Themos Stafylakis

Main category: cs.CL

TL;DR: 本文提出了一种针对希腊语医学语音转录的领域特定系统，结合自动语音识别与文本纠错模型，提升医疗文档自动化水平。

Details

Motivation: 为减轻医疗专业人员的手动文档负担并提高工作效率，需开发适用于希腊语医学领域的语音转录系统。 Method: 结合自动语音识别技术和文本纠错模型，利用声学和文本建模，并通过领域特定的微调适应希腊语医学语境。 Result: 系统在处理希腊语医学术语和语言变异方面表现更优，实现了更准确、连贯的转录结果。 Conclusion: 该系统有助于推动希腊医疗领域实用语言技术的发展，提升医疗文档的自动化与可靠性。 Abstract: Medical dictation systems are essential tools in modern healthcare, enabling accurate and efficient conversion of speech into written medical documentation. The main objective of this paper is to create a domain-specific system for Greek medical speech transcriptions. The ultimate goal is to assist healthcare professionals by reducing the overload of manual documentation and improving workflow efficiency. Towards this goal, we develop a system that combines automatic speech recognition techniques with text correction model, allowing better handling of domain-specific terminology and linguistic variations in Greek. Our approach leverages both acoustic and textual modeling to create more realistic and reliable transcriptions. We focused on adapting existing language and speech technologies to the Greek medical context, addressing challenges such as complex medical terminology and linguistic inconsistencies. Through domain-specific fine-tuning, our system achieves more accurate and coherent transcriptions, contributing to the development of practical language technologies for the Greek healthcare sector.

[70] Towards Efficient CoT Distillation: Self-Guided Rationale Selector for Better Performance with Fewer Rationales

Jianzhi Yan,Le Liu,Youcheng Pan,Shiwei Chen,Yang Xiang,Buzhou Tang

Main category: cs.CL

TL;DR: 提出了一种面向模型的高质推理选择蒸馏方法（MoRSD），通过筛选高质量的推理链并引入推理难度（RD）指标，显著提升小语言模型的推理能力，在七项数据集上平均提高4.6%，且使用更少的高质量推理链即可超越传统方法。

Details

Motivation: 现有思维链蒸馏方法过于关注数据量而忽视推理链质量，可能导致错误或噪声信息传递给学生模型，从而限制其推理能力提升。 Method: 提出MoRSD框架，结合模型导向的高质量推理链筛选机制，并设计推理难度（RD）指标来评估特定推理链对学生生成正确答案的影响，从而优化蒸馏过程。 Result: 在三个任务的七个数据集上平均提升了4.6%，且使用的推理链数量更少，验证了高质量推理链优于大量低质推理链的有效性。 Conclusion: 高质量推理链的选择对小语言模型的推理能力提升至关重要，MoRSD为高效思维链蒸馏提供了一种可行方案。 Abstract: Chain-of-thought (CoT) distillation aims to enhance small language models' (SLMs) reasoning by transferring multi-step reasoning capability from the larger teacher models. However, existing work underestimates rationale quality, focusing primarily on data quantity, which may transfer noisy or incorrect information to the student model. To address the above issues, we proposed \textbf{M}odel-\textbf{O}riented \textbf{R}ationale \textbf{S}election \textbf{D}istillation (MoRSD), which can discern and select high quality rationales for distillation to improve performance further. We further propose a Rationale Difficulty (RD) metric to measure the ability of the student model to generate the correct answer under a given rationale. Compared to the baseline, we achieved 4.6$\%$ average improvement on seven datasets over three tasks, using fewer rationales by controlling their accuracy, diversity, and difficulty. Our results reveal that a small portion of the high quality rationales can enhance the reasoning ability of student models than the entire dataset. Our method promises to be a possible solution for efficient CoT distillation. Our code will be released in https://github.com/Leon221220/MoRSD.

[71] Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks

Kevin Frank,Anmol Gulati,Elias Lumer,Sindy Campagna,Vamse Kumar Subbiah

Main category: cs.CL

TL;DR: 本文提出了Jackal，一个大规模的自然语言到JQL查询的基准数据集，包含10万对自然语言请求与可执行JQL查询，并在真实Jira实例上进行验证。作者评估了23个大模型在该数据集子集Jackal-5K上的表现，发现最佳模型（Gemini 2.5 Pro）平均执行准确率仅为60.3%，显示出当前模型在企业级文本到查询任务中的局限性。

Details

Motivation: 缺乏公开的、基于真实场景和执行结果的自然语言到JQL查询的基准测试，限制了相关研究的发展。因此需要一个真实、可复现、带执行验证的文本到JQL基准。 Method: 构建了Jackal数据集，包含10万条自然语言请求与对应的有效JQL查询，并在拥有超过20万问题的真实Jira实例上执行验证。定义四种用户请求类型：长自然语言、短自然语言、语义相似和语义完全相同。发布数据集、执行评分工具包和Jira快照。在Jackal-5K子集上评估23个大语言模型，采用执行准确率、精确匹配和规范精确匹配指标。 Result: 在Jackal-5K上，最佳模型Gemini 2.5 Pro的平均执行准确率为60.3%。不同请求类型表现差异显著：长自然语言（86.0%）、短自然语言（35.7%）、语义相似（22.7%）、语义完全相同（99.3%）。模型在短句和语义近似表达上表现较差。 Conclusion: 当前大语言模型在将自然语言转换为可执行JQL查询方面仍有明显不足，尤其是在处理简短或语义相近但非字面匹配的请求时。Jackal为未来研究提供了公开、真实且可执行验证的基准，推动企业级文本到查询系统的发展。 Abstract: Enterprise teams rely on the Jira Query Language (JQL) to retrieve and filter issues from Jira. Yet, to our knowledge, there is no open, real-world, execution-based benchmark for mapping natural language queries to JQL. We introduce Jackal, a novel, large-scale text-to-JQL benchmark comprising 100,000 natural language (NL) requests paired with validated JQL queries and execution-based results on a live Jira instance with over 200,000 issues. To reflect real-world usage, each JQL query is associated with four types of user requests: (i) Long NL, (ii) Short NL, (iii) Semantically Similar, and (iv) Semantically Exact. We release Jackal, a corpus of 100,000 text-to-JQL pairs, together with an execution-based scoring toolkit, and a static snapshot of the evaluated Jira instance for reproducibility. We report text-to-JQL results on 23 Large Language Models (LLMs) spanning parameter sizes, open and closed source models, across execution accuracy, exact match, and canonical exact match. In this paper, we report results on Jackal-5K, a 5,000-pair subset of Jackal. On Jackal-5K, the best overall model (Gemini 2.5 Pro) achieves only 60.3% execution accuracy averaged equally across four user request types. Performance varies significantly across user request types: (i) Long NL (86.0%), (ii) Short NL (35.7%), (iii) Semantically Similar (22.7%), and (iv) Semantically Exact (99.3%). By benchmarking LLMs on their ability to produce correct and executable JQL queries, Jackal exposes the limitations of current state-of-the-art LLMs and sets a new, execution-based challenge for future research in Jira enterprise data.

[72] LLM Hallucination Detection: HSAD

JinXin Li,Gang Tu,JunJie Hu

Main category: cs.CL

TL;DR: 本文提出了一种基于隐藏层时序信号频域分析的幻觉检测方法HSAD，通过将大语言模型的推理过程建模为认知过程，并利用快速傅里叶变换提取频谱特征，有效识别生成内容中的幻觉，克服了现有方法在知识覆盖和推理偏差检测上的局限性。

Details

Motivation: 现有的幻觉检测方法受限于知识覆盖范围或难以捕捉推理过程中的偏差，因此需要一种能更有效检测大语言模型生成过程中幻觉的新方法。 Method: 将大语言模型的推理过程视为随时间展开的认知旅程，利用隐藏层时序信号模拟人类在欺骗检测中的感知过程；应用快速傅里叶变换（FFT）将时序信号转换到频域，构建频谱特征以捕捉推理异常，并设计基于这些特征的幻觉检测算法。 Result: 实验表明，频谱特征能有效反映推理过程中的异常，HSAD方法在检测准确性和鲁棒性方面优于现有方法。 Conclusion: HSAD通过结合推理过程建模与频域特征提取，显著提升了幻觉检测的效果，为大语言模型在关键场景中的可靠部署提供了新思路。 Abstract: Although Large Language Models have demonstrated powerful capabilities in a wide range of tasks such as language understanding and code generation, the frequent occurrence of hallucinations during the generation process has become a significant impediment to their deployment in critical application scenarios. Current mainstream hallucination detection methods rely on factual consistency verification or static hidden layer features. The former is constrained by the scope of knowledge coverage, while the latter struggles to capture reasoning biases during the inference process. To address these issues, and inspired by signal analysis methods in cognitive neuroscience, this paper proposes a hallucination detection method based on the frequency-domain analysis of hidden layer temporal signals, named HSAD (\textbf{H}idden \textbf{S}ignal \textbf{A}nalysis-based \textbf{D}etection). First, by treating the LLM's reasoning process as a cognitive journey that unfolds over time, we propose modeling and simulating the human process of signal perception and discrimination in a deception-detection scenario through hidden layer temporal signals. Next, The Fast Fourier Transform is applied to map these temporal signals into the frequency domain to construct spectral features, which are used to capture anomalies that arise during the reasoning process; analysis experiments on these spectral features have proven the effectiveness of this approach. Finally, a hallucination detection algorithm is designed based on these spectral features to identify hallucinations in the generated content. By effectively combining the modeling of the reasoning process with frequency-domain feature extraction, the HSAD method overcomes the limitations of existing approaches in terms of knowledge coverage and the detection of reasoning biases, demonstrating higher detection accuracy and robustness.

[73] Timber: Training-free Instruct Model Refining with Base via Effective Rank

Taiqiang Wu,Runming Yang,Tao Liu,Jiahao Wang,Zenan Xu,Ngai Wong

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的简单有效方法Timber，通过细微但有针对性地调整权重增量，部分恢复Instruct模型向其对应Base模型的状态，从而在保持利用能力的同时增强探索能力。

Details

Motivation: 后训练通常被认为只是表面性的改进，虽然提升了模型的利用能力，却限制了其探索能力。本文旨在解决这一权衡问题。 Method: 从权重层面分析后训练的影响，提出Timber方法，通过对权重增量进行精细调整，使Instruct模型部分回归Base模型特性，以提升探索能力。 Result: 在Llama和Qwen系列模型上的实验表明，Timber能持续提升原始Instruct模型的表现，尤其在Pass@k指标上表现更优。 Conclusion: 本文揭示了后训练阶段在权重层面的本质，并提供了无需再训练即可优化Instruct模型的实用策略。 Abstract: Post-training, which elicits a pretrained Base model into the corresponding Instruct model, is widely considered to be superficial. In this work, we first reinforce this hypothesis by providing novel quantitative evidence from the weight level that the effective rank (eRank) remains negligibly changed. However, this superficiality also suffers a critical trade-off, improving the exploitation capabilities at the cost of limiting its exploration. To tackle this issue, we propose Timber, a simple yet effective training-free method that enhances the exploration capability of the Instruct model while preserving its exploitation. The key insight is to partially revert Instruct towards the paired Base model by subtle yet targeted refinement of the weight deltas. Extensive experiments on Llama and Qwen series demonstrate that Timber consistently improves vanilla Instruct models, particularly on Pass@k performance. Our findings offer new insights into the post-training stage at the weight level and practical strategies to refine the Instruct model without training.

[74] Fast Thinking for Large Language Models

Haoyu Zheng,Zhuonan Wang,Yuqian Yuan,Tianwei Lin,Wenqiao Zhang,Zheqi Lv,Juncheng Li,Siliang Tang,Yueting Zhuang,Hongyang He

Main category: cs.CL

TL;DR: 提出了一种基于潜在码本的快速推理框架，通过在训练时使用简洁的思维链草图学习离散策略先验，在推理时利用少量连续向量实现策略级引导，结合GainRouter机制自适应切换快慢推理模式，显著降低推理成本并保持高性能。

Details

Motivation: 现有的推理型大语言模型依赖显式逐步生成思维链，虽然有效但效率低下，存在高延迟和高令牌消耗问题，需要更高效的推理方法。 Method: 引入潜在码本（Latent Codebooks）框架，在训练中使用简洁的思维链草图学习离散策略先验；推理时通过单次前向传播使用从码本蒸馏出的连续思考向量进行策略引导，并设计GainRouter轻量路由机制，自适应选择快速或慢速推理路径。 Result: 在多个推理基准测试上实验表明，该方法在显著降低推理成本的同时，达到了具有竞争力甚至更优的准确率。 Conclusion: 该方法为大语言模型提供了一条高效且可控的推理路径，有效平衡了推理性能与资源消耗。 Abstract: Reasoning-oriented Large Language Models (LLMs) often rely on generating explicit tokens step by step, and their effectiveness typically hinges on large-scale supervised fine-tuning or reinforcement learning. While Chain-of-Thought (CoT) techniques substantially enhance performance on complex reasoning tasks, they remain inefficient, requiring long reasoning traces that increase latency and token usage. In this work, we introduce Latent Codebooks for Fast Thinking, a framework that uses concise CoT sketches only during training to learn a codebook of discrete strategy priors. At inference, the model conditions on a handful of continuous thinking vectors distilled from the codebook in a single pass, enabling strategy-level guidance without producing explicit reasoning tokens. To complement this design, we propose GainRouter, a lightweight routing mechanism that adaptively switches between fast codebook guided inference and slow explicit reasoning, thereby suppressing overthinking and reducing unnecessary token generation. Experiments across multiple reasoning benchmarks show that our approach achieves competitive or superior accuracy while substantially lowering inference cost, offering a practical path toward efficient and controllable reasoning in large language models.

[75] Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models

Zemin Huang,Yuhang Wang,Zhiyang Chen,Guo-Jun Qi

Main category: cs.CL

TL;DR: 本文提出了一种新的掩码扩散语言模型RemeDi，通过引入“再掩码”机制，动态识别并修正生成过程中的低质量token，显著提升了文本生成质量。

Details

Motivation: 现有的掩码扩散语言模型一旦生成token后难以修改，缺乏对输入中潜在错误的识别与修正能力。 Method: RemeDi在每一步联合预测token分布和每个token的置信度得分，利用置信度决定哪些token应被重新掩码，并在后续步骤中基于更丰富的上下文重新采样；采用监督微调和强化学习进行训练。 Result: 实验表明，RemeDi在多个数据集上达到了开源DLM中的最先进水平。 Conclusion: RemeDi通过引入再掩码机制，增强了扩散语言模型的灵活修正能力，有效提升了生成文本的质量和可控性。 Abstract: Mask-based Diffusion Language Models (DLMs) struggle to revise incorrect tokens: once a token is generated, it typically remains fixed. The key challenge is to identify potential errors in the inputs. In this paper, we propose \emph{\underline{Rem}asking-\underline{e}nabled \underline{Di}ffusion Language Model (RemeDi}, a mask-based DLM that introduces \emph{remasking} as another fundamental mechanism, enabling more flexible text refinement in diffusion-based text generation. To achieve this, RemeDi jointly predicts token distributions and per-token confidence scores at each step. The confidence scores determine which tokens to be unmasked after the current step, allowing the model to identify tokens with low quality and remask them. These remasked tokens can be resampled with richer context in subsequent steps. We design a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens in addition to predict mask tokens, and reinforcement learning which optimizes full generation trajectories toward higher rewards. Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.

[76] Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in LLMs

Shulin Huang,Yiran Ding,Junshu Pan,Yue Zhang

Main category: cs.CL

TL;DR: 本研究首次系统比较了强化学习（RL）与监督微调（SFT）在多语言复杂推理任务中的跨语言泛化能力，发现RL不仅准确率更高，且在非英语数据上训练时表现出更强的泛化性能。

Details

Motivation: 尽管强化学习在提升大模型复杂推理能力方面表现优异，但其在跨语言泛化方面的效果尚不清楚，亟需与监督微调进行系统对比。 Method: 以Qwen2.5-3B-Base为基座模型，在数学、常识和科学推理等多个多语言基准上对比RL与SFT的跨语言推理泛化能力，并进行机制分析。 Result: 1) RL相比SFT具有更高的准确性和更强的跨语言泛化能力；2) 在非英语数据上进行RL训练比英语训练效果更好，该现象在SFT中未观察到；机制分析表明RL能赋予模型更鲁棒的推理策略。 Conclusion: 强化学习在多语言复杂推理中优于监督微调，尤其在非英语数据上训练时展现出更优的整体性能和泛化能力，为构建公平高效的多语言推理系统提供了重要指导。 Abstract: Enhancing the complex reasoning capabilities of Large Language Models (LLMs) attracts widespread attention. While reinforcement learning (RL) has shown superior performance for improving complex reasoning, its impact on cross-lingual generalization compared to Supervised Fine-Tuning (SFT) remains unexplored. We present the first systematic investigation into cross-lingual reasoning generalization of RL and SFT. Using Qwen2.5-3B-Base as our foundation model, we conduct experiments on diverse multilingual reasoning benchmarks, including math reasoning, commonsense reasoning, and scientific reasoning. Our investigation yields two significant findings: (1) Tuning with RL not only achieves higher accuracy but also demonstrates substantially stronger cross-lingual generalization capabilities compared to SFT. (2) RL training on non-English data yields better overall performance and generalization than training on English data, which is not observed with SFT. Furthermore, through comprehensive mechanistic analyses, we explore the underlying factors of RL's superiority and generalization across languages. Our results provide compelling evidence that RL enables the model with more robust reasoning strategies, offering crucial guidance for more equitable and effective multilingual reasoning.

[77] Aligning LLMs for Multilingual Consistency in Enterprise Applications

Amit Agarwal,Hansa Meghwani,Hitesh Laxmichand Patel,Tao Sheng,Sujith Ravi,Dan Roth

Main category: cs.CL

TL;DR: 提出一种批量对齐的微调策略，利用语义等价的多语言数据提升大模型在低/中资源语言中的性能，显著缩小与英语之间的准确率差距。

Details

Motivation: 大语言模型在多语言场景下表现不一致，尤其非英语语言准确率显著下降，影响全球企业应用的可靠性和用户体验。 Method: 在每个训练批次中使用语义等价的多语言数据进行批量对齐微调，直接对齐不同语言的模型输出。 Result: 非英语语言准确率提升高达23.9%，且不损害英语性能、模型推理或检索质量。 Conclusion: 该方法简单、可扩展，能无缝集成到现有LLM训练和部署流程中，有助于实现更稳健和公平的工业级多语言AI解决方案。 Abstract: Large language models (LLMs) remain unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases. This inconsistency undermines customer experience and operational reliability in multilingual settings such as customer support, content moderation, and information retrieval. Even with advanced Retrieval-Augmented Generation (RAG) systems, we observe up to an 29% accuracy drop in non-English languages compared to English. We propose a practical, batch-wise alignment strategy for fine-tuning LLMs, leveraging semantically equivalent multilingual data in each training batch to directly align model outputs across languages. This approach improves non-English accuracy by up to 23.9\% without compromising English performance, model reasoning, or retrieval quality. Our method is simple to implement, scalable, and integrates seamlessly with existing LLM training \& deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry.

[78] TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F

Yifeng He,Luning Yang,Christopher Castro Gaw Gonzalo,Hao Chen

Main category: cs.CL

TL;DR: 本文提出了TF-Bench，一个基于System F类型推断的代码语义推理基准，用于评估大语言模型在程序语义理解上的真实推理能力。通过构建去除了自然语言干扰的TF-Bench_pure，揭示了当前最先进模型（如Claude-3.7-sonnet）在纯语义任务上准确率仅为55.85%，并提出了两个新指标来衡量模型的鲁棒性和测试时推理效果，指出了未来研究的关键方向。

Details

Motivation: 现有代码推理基准缺乏形式化、以程序为中心的演绎框架，无法判断模型是真正理解程序语义还是仅依赖自然语言与代码标记之间的表面关联。 Method: 提出TF-Bench基准，基于System F中的类型推断任务评估LLM的程序语义推理能力；通过验证过的变换移除语义无关的自然语言，构建纯语义驱动的TF-Bench_pure版本；引入两个新指标评估模型鲁棒性和测试时推理有效性。 Result: 最先进的LLM在TF-Bench_pure上的最高准确率为55.85%（Claude-3.7-sonnet），显著低于预期，表明其在深层语义推理方面存在严重局限；新指标显示当前模型的推理过程缺乏稳健性和一致性。 Conclusion: 当前大语言模型在程序语义推理方面能力有限，过度依赖表面特征而非深层逻辑；TF-Bench为评估代码推理提供了更严格的标准，未来工作需提升模型的形式化推理能力和抗干扰性。 Abstract: Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute (TTC) reasoning capabilities show significant potential for understanding program logic and semantics beyond mere token recognition. However, current benchmarks for code reasoning lack a formal, program-centric deductive framework to ensure sound evaluation, and are incapable of assessing whether models genuinely reason about program semantics or merely exploit superficial associations between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as program semantics reasoning. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only 55.85% accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess robustness and the effectiveness of test-time reasoning, underscoring critical limitations in current LLM capabilities and highlighting essential directions for future research.

[79] VIVA+: Human-Centered Situational Decision-Making

Zhe Hu,Yixiao Ren,Guanzhong Liu,Jing Li,Yu Yin

Main category: cs.CL

TL;DR: 本文提出了VIVA+，一个基于认知的基准，用于评估多模态大语言模型在人类中心情境中的推理与决策能力，涵盖情境理解、行动理由和反思性推理三个方面，并通过实验揭示了现有模型的局限性及改进方向。

Details

Motivation: 现有的多模态大语言模型在复杂的人类环境中表现出潜力，但其在细微且类人的推理与决策能力方面缺乏有效的评估手段。因此，需要一个系统性的、认知基础的基准来全面衡量这些能力。 Method: 构建了一个包含1,317个真实场景和6,373个选择题的数据集VIVA+，聚焦于基础情境理解、上下文驱动的行为解释和反思性推理三个核心维度；在最新商业和开源模型上进行评估，并探索针对性训练和多步推理策略。 Result: 实验揭示了不同模型在各维度上的表现差异，表明当前模型在社会性推理和上下文理解方面仍存在显著挑战；引入的训练和推理策略带来了性能提升。 Conclusion: VIVA+为评估多模态大语言模型的社会化决策能力提供了有效工具，研究结果指出了当前模型的不足，并为未来提升模型的上下文感知和社会适应性提供了可行路径。 Abstract: Multimodal Large Language Models (MLLMs) show promising results for embodied agents in operating meaningfully in complex, human-centered environments. Yet, evaluating their capacity for nuanced, human-like reasoning and decision-making remains challenging. In this work, we introduce VIVA+, a cognitively grounded benchmark for evaluating the reasoning and decision-making of MLLMs in human-centered situations. VIVA+ consists of 1,317 real-world situations paired with 6,373 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning. Together, these dimensions provide a systematic framework for assessing a model's ability to perceive, reason, and act in socially meaningful ways. We evaluate the latest commercial and open-source models on VIVA+, where we reveal distinct performance patterns and highlight significant challenges. We further explore targeted training and multi-step reasoning strategies, which yield consistent performance improvements. Finally, our in-depth analysis highlights current model limitations and provides actionable insights for advancing MLLMs toward more robust, context-aware, and socially adept decision-making in real-world settings.

Zhiqiang Liu,Yichi Zhang,Mengshu Sun,Lei Liang,Wen Zhang

Main category: cs.CL

TL;DR: 提出了一种新的多模态知识图谱补全方法M-Hyper，结合融合和独立模态表示，利用双四元数建模模态间交互，在性能、鲁棒性和计算效率方面表现优越。

Details

Motivation: 现有方法在处理多模态知识图谱补全时，要么因固定融合策略丢失模态特有信息，要么难以捕捉模态间的上下文依赖语义交互。 Method: 提出M-Hyper模型，引入细粒度实体表示分解（FERF）和鲁棒关系感知模态融合（R2MF）模块，将三种独立模态和一种融合模态映射到双四元数的四个正交基上，利用哈密顿积建模模态间交互。 Result: 实验表明M-Hyper在多个指标上优于现有方法，具备良好的鲁棒性和计算效率。 Conclusion: M-Hyper有效实现了融合与独立模态表示的共存与协作，提升了多模态知识图谱补全的效果。 Abstract: Multi-modal knowledge graph completion (MMKGC) aims to discover missing facts in multi-modal knowledge graphs (MMKGs) by leveraging both structural relationships and diverse modality information of entities. Existing MMKGC methods follow two multi-modal paradigms: fusion-based and ensemble-based. Fusion-based methods employ fixed fusion strategies, which inevitably leads to the loss of modality-specific information and a lack of flexibility to adapt to varying modality relevance across contexts. In contrast, ensemble-based methods retain modality independence through dedicated sub-models but struggle to capture the nuanced, context-dependent semantic interplay between modalities. To overcome these dual limitations, we propose a novel MMKGC method M-Hyper, which achieves the coexistence and collaboration of fused and independent modality representations. Our method integrates the strengths of both paradigms, enabling effective cross-modal interactions while maintaining modality-specific information. Inspired by ``quaternion'' algebra, we utilize its four orthogonal bases to represent multiple independent modalities and employ the Hamilton product to efficiently model pair-wise interactions among them. Specifically, we introduce a Fine-grained Entity Representation Factorization (FERF) module and a Robust Relation-aware Modality Fusion (R2MF) module to obtain robust representations for three independent modalities and one fused modality. The resulting four modality representations are then mapped to the four orthogonal bases of a biquaternion (a hypercomplex extension of quaternion) for comprehensive modality interaction. Extensive experiments indicate its state-of-the-art performance, robustness, and computational efficiency.

[81] Do LLMs Understand Romanian Driving Laws? A Study on Multimodal and Fine-Tuned Question Answering

Eduard Barbu,Adrian Marius Dumitran

Main category: cs.CL

TL;DR: 本文评估了大语言模型在罗马尼亚驾驶法规问答中的表现，发布了一个包含1208个问题的数据集，并比较了纯文本与多模态系统的性能，发现经过领域微调的小规模模型表现具有竞争力，且文本描述优于直接视觉输入，同时揭示了LLM作为评判者时存在自我偏好偏差。

Details

Motivation: 确保驾驶员掌握交通规则对道路安全至关重要，但针对资源较少语言（如罗马尼亚语）的可解释问答研究不足，需评估大语言模型在此类场景下的能力。 Method: 构建了一个包含1208个问题（含387个多模态问题）的罗马尼亚驾驶法规数据集，比较了纯文本和多模态SOTA系统的表现，对Llama 3.1-8B-Instruct和RoLlama 3.1-8B-Instruct进行领域微调，并使用LLM-as-a-Judge评估生成解释的质量。 Result: 最先进的模型表现良好，但经过领域微调的8B规模模型也具备竞争力；使用图像的文本描述比直接输入视觉信息效果更好；LLM作为评判者时表现出自我偏好偏差。 Conclusion: 该研究表明，在资源较少的语言中，通过领域微调可以有效提升小规模模型在法律问答中的表现，且文本描述更利于模型理解，同时提醒在评估解释质量时需注意LLM的自我偏好偏差。 Abstract: Ensuring that both new and experienced drivers master current traffic rules is critical to road safety. This paper evaluates Large Language Models (LLMs) on Romanian driving-law QA with explanation generation. We release a 1{,}208-question dataset (387 multimodal) and compare text-only and multimodal SOTA systems, then measure the impact of domain-specific fine-tuning for Llama 3.1-8B-Instruct and RoLlama 3.1-8B-Instruct. SOTA models perform well, but fine-tuned 8B models are competitive. Textual descriptions of images outperform direct visual input. Finally, an LLM-as-a-Judge assesses explanation quality, revealing self-preference bias. The study informs explainable QA for less-resourced languages.

[82] Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

Yucheng Wang,Yifan Hou,Aydin Javadov,Mubashara Akhtar,Mrinmaya Sachan

Main category: cs.CL

TL;DR: 本文提出了一种基于逻辑的评估框架，用于研究多模态大语言模型中的跨模态推理问题，发现额外模态仅在提供独立且充分的推理路径时才能提升性能，而冗余或链式推理常损害表现，并揭示了任务组合瓶颈和融合瓶颈两大核心问题。

Details

Motivation: 现有研究对多模态是否真正促进推理存在矛盾结论，缺乏可控的评估框架和对模型内部机制的分析，难以明确模态交互何时及为何影响推理性能。 Method: 构建一个基于逻辑的六类交互模式分类框架，系统评估不同模态信息分布与逻辑组合方式下的推理表现，并通过注意力分析、模态可识别性探测和两步提示法（先识别后推理）探究模型内部行为。 Result: 实验表明，仅当新增模态提供独立且充分的推理路径时性能提升；冗余或链式支持常导致性能下降。发现三种系统性退化：弱模态拖累整体、冲突时模态偏好偏差、跨模态信号整合失败。注意力未能编码事实有用性，但两步提示可恢复性能；早期融合中模态身份仍可恢复，软化注意力有助于改善推理。 Conclusion: 多模态推理的主要障碍在于信息集成而非感知能力，任务组合瓶颈和融合偏见是核心失败机制，未来应关注组合感知训练与早期融合控制。 Abstract: Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.

[83] Understanding Textual Capability Degradation in Speech LLMs via Parameter Importance Analysis

Chao Wang,Rui-Chen Zheng,Yang Ai,Zhen-Hua Ling

Main category: cs.CL

TL;DR: 本文研究了语音集成到大语言模型中导致文本能力下降的问题，提出通过参数重要性估计分析机制，并采用分层学习率调度和低秩适应（LoRA）来缓解该问题，有效保持了模型的文本能力并提升了语音问答性能。

Details

Motivation: 语音增强的大语言模型在提升语音能力的同时常削弱其原有的文本理解能力，限制了模型对预训练文本知识的充分利用，因此需要探究其根本原因并提出改进方法。 Method: 基于编码器-适配器范式，提出一种基于参数重要性估计的分析框架，揭示语音微调导致文本相关参数重要性分布变化；进而采用分层学习率调度和低秩适应（LoRA）策略以保留原有参数分布。 Result: 实验表明，所提出的两种策略相比全量微调能更好地保持模型的文本能力，同时提升了下游口语问答任务的表现。 Conclusion: 通过控制微调过程中参数重要性的分布变化，可以有效缓解语音集成对文本能力的负面影响，为多模态大模型的设计提供了理论依据和实用方案。 Abstract: The integration of speech into Large Language Models (LLMs) has substantially expanded their capabilities, but often at the cost of weakening their core textual competence. This degradation limits the ability of speech-enabled LLMs to fully exploit their pre-trained text-based knowledge. In this work, we analyze the underlying mechanisms of this issue through a focused study of the widely used encoder-adaptor paradigm. We propose an analytical framework based on parameter importance estimation, which reveals that fine-tuning for speech introduces a textual importance distribution shift: the layer-wise allocation of parameters critical to textual reasoning is disrupted. Building on this insight, we investigate two mitigation strategies: layer-wise learning rate scheduling and Low-Rank Adaptation (LoRA), both aim to preserve the original parameter distribution. Experimental results show that both approaches better maintain textual competence than full fine-tuning, while also improving downstream spoken question answering performance. Furthermore, our analysis offers a principled explanation for the effectiveness of the proposed mitigation strategies, linking their benefits to the structural properties of textual knowledge in LLMs.

[84] Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality

Junliang Li,Yucheng Wang,Yan Chen,Yu Ran,Ruiqing Zhang,Jing Liu,Hua Wu,Haifeng Wang

Main category: cs.CL

TL;DR: 提出了一种新的知识级一致性强化学习框架（KLCF），通过双事实对齐机制提升大语言模型在长文本生成中的事实性，减少幻觉问题。

Details

Motivation: 现有的基于人类反馈的强化学习方法主要依赖偏好奖励，忽视了模型内部的知识边界，导致生成内容存在事实性错误和幻觉问题。 Method: 提出KLCF框架，利用预训练知识边界构建事实清单，指导在线强化学习以提高事实覆盖率和召回率；同时训练一个基于基础模型内部知识的自我评估模块，提升生成过程中的事实精确度。该方法无需外部知识检索或繁重验证，轻量且可扩展。 Result: 实验结果表明，KLCF在多个长文本生成基准上显著提升了事实性指标，并有效缓解了模型幻觉问题。 Conclusion: KLCF通过关注策略模型表达知识与基础模型参数知识之间的一致性，提供了一种高效、可扩展的方法来增强大语言模型的事实性。 Abstract: Hallucination and factuality deficits remain key obstacles to the reliability of large language models (LLMs) in long-form generation. Existing reinforcement learning from human feedback (RLHF) frameworks primarily rely on preference rewards, yet they often overlook the model's internal knowledge boundaries, exacerbating the so-called "hallucination tax". To address this challenge, we propose Knowledge-Level Consistency Reinforcement Learning Framework (KLCF), a novel framework that focuses on the knowledge consistency between the policy model's expressed knowledge and the base model's parametric knowledge, and introduces a Dual-Fact Alignment mechanism to jointly optimize factual recall and precision. Specifically, KLCF leverages pretrained knowledge boundaries to construct fact checklist, guiding online reinforcement learning to improve factual coverage and recall; simultaneously, it trains a self-assessment module based on the base model's internal knowledge to enhance factual precision during generation. Unlike prior methods that rely on external retrieval or heavy verification, our reward design is fully external-knowledge-free and lightweight, making KLCF efficient and easily scalable to large-scale training. Experimental results demonstrate that KLCF substantially improves factuality metrics across multiple long-form benchmarks and effectively alleviates model hallucinations.

[85] From Personal to Collective: On the Role of Local and Global Memory in LLM Personalization

Zehong Wang,Junlin Wu,ZHaoxuan Tan,Bolian Li,Xianrui Zhong,Zheli Liu,Qingkai Zeng

Main category: cs.CL

TL;DR: 提出了一种结合局部个性化记忆和全局集体记忆的本地-全局记忆框架（LoGo），并通过中介模块协调两者冲突，有效缓解大语言模型个性化中的冷启动和偏置问题。

Details

Motivation: 解决大语言模型个性化中的冷启动问题（用户历史数据不足）和偏置问题（历史数据过多但偏差严重），二者均源于缺乏跨用户的集体知识建模能力。 Method: 设计了一个本地-全局记忆框架（LoGo），其中本地记忆捕捉个体偏好，全局记忆建模群体共享兴趣，并引入一个中介模块来调和局部与全局记忆之间的冲突。 Result: 在多个基准上的实验表明，LoGo能够显著提升个性化效果，既能改善冷启动用户的推荐质量，又能减轻数据偏置带来的过拟合问题。 Conclusion: 引入集体知识有助于提升大语言模型的个性化性能，LoGo框架通过融合局部与全局记忆并解决其冲突，为个性化提供了更鲁棒的解决方案。 Abstract: Large language model (LLM) personalization aims to tailor model behavior to individual users based on their historical interactions. However, its effectiveness is often hindered by two key challenges: the \textit{cold-start problem}, where users with limited history provide insufficient context for accurate personalization, and the \textit{biasing problem}, where users with abundant but skewed history cause the model to overfit to narrow preferences. We identify both issues as symptoms of a common underlying limitation, i.e., the inability to model collective knowledge across users. To address this, we propose a local-global memory framework (LoGo) that combines the personalized local memory with a collective global memory that captures shared interests across the population. To reconcile discrepancies between these two memory sources, we introduce a mediator module designed to resolve conflicts between local and global signals. Extensive experiments on multiple benchmarks demonstrate that LoGo consistently improves personalization quality by both warming up cold-start users and mitigating biased predictions. These results highlight the importance of incorporating collective knowledge to enhance LLM personalization.

[86] Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions

Yoonah Park,Haesung Pyun,Yohan Jo

Main category: cs.CL

TL;DR: 本文提出了一种名为KAPPA的无参数干预方法，通过在特定子空间中对隐藏状态进行投影调整，使大语言模型在多项选择题中的预测与其内在知识对齐，从而显著提升准确率。

Details

Motivation: 大语言模型（LLM）在多项选择题（MCQ）上表现不佳，尽管其在其他情境下能表现出正确的知识，存在知识与预测之间的差距。本文旨在探究这一现象的机制并加以缓解。 Method: 通过探针分析发现某些层的残差流中存在一个由“知识基”和“预测基”张成的子空间；KAPPA方法在此子空间内将隐藏状态进行投影调整，使预测坐标与知识坐标对齐，实现知识引导的预测。 Result: 在Big-Bench-Hard和ARC-Challenge的二选一重构任务上，KAPPA显著提高了准确性，并优于基线方法；该方法在不同任务间具有一定的泛化能力，并可扩展到自由形式的问题。 Conclusion: KAPPA提供了一种新的几何视角来理解LLM的知识-预测差距，并通过简单有效的无参数干预提升了模型表现，有助于更好地释放模型的潜在知识。 Abstract: Large Language Models (LLMs) often fail on multiple-choice questions (MCQs) despite demonstrating correct knowledge in other contexts, such as free-form generation. To investigate the mechanism underlying this knowledge-prediction gap on MCQs and alleviate it, we conduct a probing analysis and find that residual streams in certain layers contain a subspace spanned by two important bases: a \emph{knowledge basis} that encodes the probability of the ground-truth answer for a given MCQ and a \emph{prediction basis} that encodes the probability of the answer choice predicted by the model. We observe that incorrect predictions arise from a misalignment of the model's hidden states along these two bases. Hence, we introduce \textbf{KAPPA} (Knowledge-Aligned Prediction through Projection-based Adjustment), a parameter-free intervention that transforms the hidden states to align the prediction coordinate with the knowledge coordinate within this subspace. Experiments on binary-choice reformulations of Big-Bench-Hard and ARC-Challenge show that KAPPA substantially improves accuracy and consistently outperforms baselines. While optimal subspaces differ across tasks, subspaces generalize to some extent, as supported by cross-dataset experiments. Moreover, KAPPA extends its effectiveness to free-form questions beyond MCQs. Our work provides a new geometric understanding of the knowledge-prediction gap and offers a practical method for better aligning model behavior with its latent knowledge.

[87] Transformer Tafsir at QIAS 2025 Shared Task: Hybrid Retrieval-Augmented Generation for Islamic Knowledge Question Answering

Muhammad Abu Ahmad,Mohamad Ballout,Raia Abu Ahmad,Elia Bruni

Main category: cs.CL

TL;DR: 本文提出了一种用于伊斯兰知识理解和推理的混合检索增强生成（RAG）系统，结合稀疏与密集检索及交叉编码器重排序，显著提升了大语言模型的表现。

Details

Motivation: 为了提升大语言模型在伊斯兰知识理解与推理任务中的准确性和语义匹配能力，解决现有方法在复杂知识问答中的局限性。 Method: 采用三阶段混合RAG框架：首先使用BM25进行稀疏检索，然后利用稠密向量模型进行语义匹配，最后通过交叉编码器进行重排序以提高检索精度，并结合Fanar和Mistral两种大语言模型进行评估。 Result: 所提方法在两个子任务中均提升了模型性能，最高准确率提升达25%；其中Fanar模型表现最佳，在子任务1中达到45%准确率，子任务2中达到80%准确率。 Conclusion: 混合RAG框架能有效增强大语言模型在伊斯兰知识理解任务中的表现，尤其在结合交叉编码器重排序时具有显著优势，验证了多阶段检索策略的有效性。 Abstract: This paper presents our submission to the QIAS 2025 shared task on Islamic knowledge understanding and reasoning. We developed a hybrid retrieval-augmented generation (RAG) system that combines sparse and dense retrieval methods with cross-encoder reranking to improve large language model (LLM) performance. Our three-stage pipeline incorporates BM25 for initial retrieval, a dense embedding retrieval model for semantic matching, and cross-encoder reranking for precise content retrieval. We evaluate our approach on both subtasks using two LLMs, Fanar and Mistral, demonstrating that the proposed RAG pipeline enhances performance across both, with accuracy improvements up to 25%, depending on the task and model configuration. Our best configuration is achieved with Fanar, yielding accuracy scores of 45% in Subtask 1 and 80% in Subtask 2.

[88] Open-DeBias: Toward Mitigating Open-Set Bias in Language Models

Arti Rani,Shweta Singh,Nihar Ranjan Sahoo,Gaurav Kumar Nayak

Main category: cs.CL

TL;DR: 本文提出了OpenBiasBench基准和Open-DeBias去偏方法，用于检测和缓解文本问答中的开放集偏差。该方法在少量数据上微调适配器模块，显著提升准确率并展现跨语言零样本迁移能力。

Details

Motivation: 现有去偏方法局限于预定义类别，难以应对新型或上下文相关的新兴偏差。本文旨在解决文本问答中开放集偏差的检测与缓解问题。 Method: 提出OpenBiasBench基准以评估已知和未见偏差，并设计Open-DeBias方法，利用适配器模块实现数据高效、参数高效的去偏训练。 Result: 相比SOTA的BMBI方法，在BBQ数据集上模糊子集准确率提升近48%，明确子集提升6%；在仅使用少量训练数据微调的适配器迁移到韩语BBQ时达到84%准确率，并在StereoSet、CrowS-Pairs等多个NLP任务中验证了其鲁棒性和多语言优势。 Conclusion: Open-DeBias是一种高效、通用且具备良好泛化能力的去偏方法，适用于开放域、多语言场景下的社会与刻板印象偏差缓解。 Abstract: Large Language Models (LLMs) have achieved remarkable success on question answering (QA) tasks, yet they often encode harmful biases that compromise fairness and trustworthiness. Most existing bias mitigation approaches are restricted to predefined categories, limiting their ability to address novel or context-specific emergent biases. To bridge this gap, we tackle the novel problem of open-set bias detection and mitigation in text-based QA. We introduce OpenBiasBench, a comprehensive benchmark designed to evaluate biases across a wide range of categories and subgroups, encompassing both known and previously unseen biases. Additionally, we propose Open-DeBias, a novel, data-efficient, and parameter-efficient debiasing method that leverages adapter modules to mitigate existing social and stereotypical biases while generalizing to unseen ones. Compared to the state-of-the-art BMBI method, Open-DeBias improves QA accuracy on BBQ dataset by nearly $48\%$ on ambiguous subsets and $6\%$ on disambiguated ones, using adapters fine-tuned on just a small fraction of the training data. Remarkably, the same adapters, in a zero-shot transfer to Korean BBQ, achieve $84\%$ accuracy, demonstrating robust language-agnostic generalization. Through extensive evaluation, we also validate the effectiveness of Open-DeBias across a broad range of NLP tasks, including StereoSet and CrowS-Pairs, highlighting its robustness, multilingual strength, and suitability for general-purpose, open-domain bias mitigation. The project page is available at: https://sites.google.com/view/open-debias25

[89] SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models

Ziyi Yang,Weizhou Shen,Ruijun Chen,Chenliang Li,Fanqi Wan,Ming Yan,Xiaojun Quan,Fei Huang

Main category: cs.CL

TL;DR: 本文提出SPELL，一种多角色自对弈强化学习框架，用于大规模、无标签的长上下文推理优化。

Details

Motivation: 由于缺乏可靠的人工标注和可编程验证的奖励信号，大模型在长上下文推理方面的进展相对滞后。 Method: SPELL框架包含三个循环角色：提问者、回答者和验证者，集成于单一模型中，通过自我对弈实现持续改进；引入自动化课程学习和自适应奖励函数以稳定训练。 Result: 在六个长上下文基准测试中，SPELL在多种大语言模型上均显著提升性能，优于同等规模的标注数据微调模型；在Qwen3-30B-A3B-Thinking模型上平均pass@8提升7.6点。 Conclusion: SPELL实现了无需人工标注的长上下文推理能力持续优化，展现出向更强模型扩展的潜力。 Abstract: Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles-questioner, responder, and verifier-within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder's output and the questioner's reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model's evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models.

[90] Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

Shaobo Wang,Jiaming Wang,Jiajun Zhang,Cong Wang,Yue Min,Zichen Wen,Fei Huang,Huiqiang Jiang,Junyang Lin,Dayiheng Liu,Linfeng Zhang

Main category: cs.CL

TL;DR: 提出了一种基于误差-不确定性平面的统一数据剪枝框架Q-Tuning，联合优化样本和token级别的剪枝，在仅用12.5%数据的情况下在SmolLM2-1.7B上平均提升38%，显著提升了大模型微调的数据效率。

Details

Motivation: 现有数据剪枝方法仅单独在样本级或token级操作，未能联合优化两个维度，导致高价值样本中冗余token未被有效处理，或关键信号被误删，限制了数据效率。 Method: 提出误差-不确定性（EU）平面用于分析数据效用，并在此基础上设计Q-Tuning框架：第一阶段进行样本级分诊，保留包含重要误解或校准信号的样本；第二阶段对误解样本采用上下文感知的非对称token剪枝策略，而完整保留校准样本。 Result: 在五个不同基准上达到最先进水平，在SmolLM2-1.7B模型上仅使用12.5%训练数据即比全数据基线平均提升38%。 Conclusion: Q-Tuning是首个能持续超越全数据训练的动态剪枝方法，为资源受限下的大语言模型监督微调提供了高效、可扩展的数据利用方案。 Abstract: As supervised fine-tuning (SFT) evolves from a lightweight post-training step into a compute-intensive phase rivaling mid-training in scale, data efficiency has become critical for aligning large language models (LLMs) under tight budgets. Existing data pruning methods suffer from a fragmented design: they operate either at the sample level or the token level in isolation, failing to jointly optimize both dimensions. This disconnect leads to significant inefficiencies--high-value samples may still contain redundant tokens, while token-level pruning often discards crucial instructional or corrective signals embedded in individual examples. To address this bottleneck, we introduce the Error-Uncertainty (EU) Plane, a diagnostic framework that jointly characterizes the heterogeneous utility of training data across samples and tokens. Guided by this insight, we propose Quadrant-based Tuning (Q-Tuning), a unified framework that strategically coordinates sample pruning and token pruning. Q-Tuning employs a two-stage strategy: first, it performs sample-level triage to retain examples rich in informative misconceptions or calibration signals; second, it applies an asymmetric token-pruning policy, using a context-aware scoring mechanism to trim less salient tokens exclusively from misconception samples while preserving calibration samples in their entirety. Our method sets a new state of the art across five diverse benchmarks. Remarkably, on SmolLM2-1.7B, Q-Tuning achieves a +38\% average improvement over the full-data SFT baseline using only 12.5\% of the original training data. As the first dynamic pruning approach to consistently outperform full-data training, Q-Tuning provides a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.

[91] DocPruner: A Storage-Efficient Framework for Multi-Vector Visual Document Retrieval via Adaptive Patch-Level Embedding Pruning

Yibo Yan,Guangwei Xu,Xin Zou,Shuliang Liu,James Kwok,Xuming Hu

Main category: cs.CL

TL;DR: 本文提出了DocPruner，首个用于视觉文档检索（VDR）的自适应patch级嵌入剪枝框架，通过利用文档内patch注意力分布动态去除冗余嵌入，在保持检索性能的同时实现50-60%的存储压缩。

Details

Motivation: 现有的基于多向量范式的VDR方法虽效果好，但每个文档需存储数百个向量，导致存储开销过大，难以大规模部署。 Method: 提出DocPruner框架，利用文档内部的patch注意力分布来自适应地识别并剪除冗余的patch级嵌入，从而减少存储开销。 Result: 在十多个代表性数据集上实验表明，DocPruner可将主流多向量VDR模型的存储减少50-60%，且对检索性能影响极小。 Conclusion: DocPruner为构建高效、可扩展的大规模视觉文档检索系统提供了一种鲁棒、灵活且有效的解决方案。 Abstract: Visual Document Retrieval (VDR), the task of retrieving visually-rich document pages using queries that combine visual and textual cues, is crucial for numerous real-world applications. Recent state-of-the-art methods leverage Large Vision-Language Models (LVLMs) in a multi-vector paradigm, representing each document as patch-level embeddings to capture fine-grained details. While highly effective, this approach introduces a critical challenge: prohibitive storage overhead, as storing hundreds of vectors per page makes large-scale deployment costly and impractical. To address this, we introduce DocPruner, the first framework to employ adaptive patch-level embedding pruning for VDR to effectively reduce the storage overhead. DocPruner leverages the intra-document patch attention distribution to dynamically identify and discard redundant embeddings for each document. This adaptive mechanism enables a significant 50-60% reduction in storage for leading multi-vector VDR models with negligible degradation in document retrieval performance. Extensive experiments across more than ten representative datasets validate that DocPruner offers a robust, flexible, and effective solution for building storage-efficient, large-scale VDR systems.

[92] Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

Jingyi Yang,Guanxu Chen,Xuhao Hu,Jing Shao

Main category: cs.CL

TL;DR: 本文提出了针对掩码扩散语言模型（MDLMs）的新型解码策略EOSER和ASS以及强化学习算法CJ-GRPO，解决了传统方法在训练与推理之间不一致的问题，并在数学和规划等推理任务上验证了其有效性。

Details

Motivation: 现有的自回归模型技术直接迁移到MDLMs存在训练与推理不一致的问题，且缺乏专为MDLM设计的解码策略和强化学习算法。 Method: 提出EOS Early Rejection (EOSER) 和 Ascending Step-Size (ASS) 解码调度机制，并引入Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) 算法，确保rollout轨迹与优化轨迹的一致性。 Result: 在LLaDA-8B-Instruct模型上实验表明，所提方法在减少解码步数的同时实现具有竞争力的性能，尤其在数学和规划类推理任务中表现突出。 Conclusion: EOSER、ASS与CJ-GRPO有效提升了MDLM的推理效率与训练稳定性，为MDLM的优化提供了新的方向。 Abstract: Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.

[93] Assessing Large Language Models in Updating Their Forecasts with New Information

Zhangdie Yuan,Zifeng Ding,Andreas Vlachos

Main category: cs.CL

TL;DR: 本文提出了EVOLVECAST框架，用于评估大语言模型在新信息出现时是否能适当调整其对未来事件的预测，并发现模型的更新往往不一致或过于保守，置信度估计也远不及人类水平。

Details

Motivation: 现有研究将未来事件预测视为静态任务，忽视了随着新证据出现预测和置信度应动态变化的问题。 Method: 提出EVOLVECAST框架，通过引入训练截止日期后发布的新信息，评估大语言模型对预测的修订情况，并以人类预测者为参照，分析预测调整和置信度校准的表现。 Result: 实验表明大语言模型虽能响应新信息，但更新常不一致或过于保守；基于语言化表达和logits的置信度估计均未显著优于对方，且都远落后于人类基准。 Conclusion: 当前大语言模型在动态信念更新方面存在不足，表现出保守偏差，亟需更鲁棒的信念更新方法。 Abstract: Prior work has largely treated future event prediction as a static task, failing to consider how forecasts and the confidence in them should evolve as new evidence emerges. To address this gap, we introduce EVOLVECAST, a framework for evaluating whether large language models appropriately revise their predictions in response to new information. In particular, EVOLVECAST assesses whether LLMs adjust their forecasts when presented with information released after their training cutoff. We use human forecasters as a comparative reference to analyze prediction shifts and confidence calibration under updated contexts. While LLMs demonstrate some responsiveness to new information, their updates are often inconsistent or overly conservative. We further find that neither verbalized nor logits-based confidence estimates consistently outperform the other, and both remain far from the human reference standard. Across settings, models tend to express conservative bias, underscoring the need for more robust approaches to belief updating.

[94] Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

Guojian Li,Chengyou Wang,Hongfei Xue,Shuiyuan Wang,Dehui Gao,Zihan Zhang,Yuke Lin,Wenjie Li,Longshuai Xiao,Zhonghua Fu,Lei Xie

Main category: cs.CL

TL;DR: 提出了一种名为Easy Turn的开源、模块化双模态（声学和语言）对话轮次检测模型，能够预测四种对话状态，并发布了1,145小时的训练数据集，在开源测试集上达到SOTA性能。

Details

Motivation: 现有轮次检测模型大多未开源，或参数量大、仅支持单模态；微调LLM方法依赖大量稀缺的全双工数据。 Method: 设计了一个模块化的双模态模型，融合声学与语言信息，预测完整、不完整、反馈词和等待四种轮次状态，并利用自建的大规模数据集进行训练。 Result: 在自建的Easy Turn测试集上，相比TEN Turn Detection和Smart Turn V2等开源模型，取得了最先进的轮次检测准确率。 Conclusion: Easy Turn是一个高效、开源的双模态轮次检测方案，配套大规模数据集推动全双工对话系统发展。 Abstract: Full-duplex interaction is crucial for natural human-machine communication, yet remains challenging as it requires robust turn-taking detection to decide when the system should speak, listen, or remain silent. Existing solutions either rely on dedicated turn-taking models, most of which are not open-sourced. The few available ones are limited by their large parameter size or by supporting only a single modality, such as acoustic or linguistic. Alternatively, some approaches finetune LLM backbones to enable full-duplex capability, but this requires large amounts of full-duplex data, which remain scarce in open-source form. To address these issues, we propose Easy Turn, an open-source, modular turn-taking detection model that integrates acoustic and linguistic bimodal information to predict four dialogue turn states: complete, incomplete, backchannel, and wait, accompanied by the release of Easy Turn trainset, a 1,145-hour speech dataset designed for training turn-taking detection models. Compared to existing open-source models like TEN Turn Detection and Smart Turn V2, our model achieves state-of-the-art turn-taking detection accuracy on our open-source Easy Turn testset. The data and model will be made publicly available on GitHub.

[95] Vision-Grounded Machine Interpreting: Improving the Translation Process through Visual Cues

Claudio Fantinuoli

Main category: cs.CL

TL;DR: 本文提出了视觉接地的同声传译（VGI），通过结合视觉和语言信息来改善机器同声传译中的歧义问题。

Details

Motivation: 单模态系统在处理依赖于视觉、情境或语用信息才能消解歧义的翻译任务时存在性能限制。 Method: 设计了一个整合视觉-语言模型的原型系统，利用网络摄像头获取语音和视觉输入，并通过视觉上下文信息辅助翻译过程；构建了一个针对三类歧义的手工诊断语料库进行评估。 Result: 视觉接地显著提升了词汇歧义的消解效果，在性别指代消解上表现出有限且不稳定的优势，但对句法歧义无明显帮助。 Conclusion: 多模态融合是提升机器同声传译质量的必要方向。 Abstract: Machine Interpreting systems are currently implemented as unimodal, real-time speech-to-speech architectures, processing translation exclusively on the basis of the linguistic signal. Such reliance on a single modality, however, constrains performance in contexts where disambiguation and adequacy depend on additional cues, such as visual, situational, or pragmatic information. This paper introduces Vision-Grounded Interpreting (VGI), a novel approach designed to address the limitations of unimodal machine interpreting. We present a prototype system that integrates a vision-language model to process both speech and visual input from a webcam, with the aim of priming the translation process through contextual visual information. To evaluate the effectiveness of this approach, we constructed a hand-crafted diagnostic corpus targeting three types of ambiguity. In our evaluation, visual grounding substantially improves lexical disambiguation, yields modest and less stable gains for gender resolution, and shows no benefit for syntactic ambiguities. We argue that embracing multimodality represents a necessary step forward for advancing translation quality in machine interpreting.

[96] HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

Ken Deng,Zizheng Zhan,Wen Xiang,Wenqiang Zhu,Tianhao Peng,Xinping Lei,Weihao Li,Jingxuan Xu,Kun Wu,Yifan Yao,Haoyang Huang,Huaixi Tang,Kepeng Lei,Zhiyi Lai,Songwei Yu,Zongxian Feng,Zuchen Gao,Weihao Xie,Chenchen Zhang,Yanan Wu,Yuanxing Zhang,Lecheng Huang,Yuqun Zhang,Jie Liu,Zhaoxiang Zhang,Haotian Zhang,Bin Chen,Jiaheng Liu

Main category: cs.CL

TL;DR: 本文提出了HiPO框架，通过混合策略优化实现大语言模型的自适应推理控制，能够在保持或提升准确率的同时显著减少推理所需的token数量。

Details

Motivation: 为了提高大语言模型在复杂任务上的效率，避免始终进行冗长的思维链推理导致的资源浪费和高成本。 Method: 提出HiPO框架，结合混合数据流水线（提供Think-on和Think-off配对响应）与混合强化学习奖励机制，使模型能动态选择是否进行详细推理。 Result: 在数学和编程基准测试中，HiPO显著减少了推理token长度，同时保持甚至提高了准确性。 Conclusion: HiPO为高效自适应推理提供了一种原则性方法，有助于推理型大语言模型在资源受限的实际场景中的部署。 Abstract: Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.

[97] ByteSized32Refactored: Towards an Extensible Interactive Text Games Corpus for LLM World Modeling and Evaluation

Haonan Wang,Junfeng Sun,Xingdi Yuan,Ruoyao Wang,Ziang Xiao

Main category: cs.CL

TL;DR: 本文提出了ByteSized32Refactored，一个重构的、模块化的文本游戏生成框架，并通过建立GameBasic.py基础库将代码量减少一半，提升了可扩展性，但实验表明其分层结构对大语言模型生成质量带来新的挑战。

Details

Motivation: 为了提升大语言模型在交互式世界建模中的能力，尤其是文本游戏生成任务中的可扩展性和代码复用性，需要对现有代码结构进行优化。 Method: 重构原始ByteSized32语料库，抽象出7个基础类，构建统一的GameBasic.py基础库，实现模块化和可扩展的设计，减少代码冗余。 Result: 代码量从20k减少到10k；GPT-4o实验显示，在四个评估维度中两个有所提升，两个下降，说明重构后的分层结构对LLM生成带来了新挑战。 Conclusion: 模块化和中心化设计提升了系统的可扩展性，有利于LLM适应环境规范，为未来扩展建立了可扩展的环境基础。 Abstract: Simulating interactive world models remains a core challenge in Large Language Models(LLMs). In this work, we introduce the ByteSized32Refactored, a refactored, modular, and extensible implementation of the original ByteSized32 corpus to explore the task of text game generation. We further optimize the code structure of each text game and create the GameBasic.py foundation library, which centralizes common logic across all 32 games by abstracting 7 base classes (GameObject, etc.) into reusable modules, thereby reducing from 20k to 10k total lines of Python code compared to the original Bytesized32. Our refactored implementation enables extendability - with our centralized design, ByteSized32Refactored can be more efficiently extended to include text games of new scenarios and specifications by reusing the shared logic and functionalities. Extensive experiments with GPT-4o demonstrate a mix of performance - with Bytesized32Refactored, the generated text games for unseen scenarios showcase quality improvements on two of the four evaluation dimensions while decreases on the other two, indicating that the hierarchical structure of the refactored code presents new challenges for LLMs. Overall, we highlight that our extensible code structure, centered on the foundation library and the modular optimization, not only facilitates LLM adaptation to environment specifications but also establishes a scalable environment that supports future extensions.

[98] Toward Preference-aligned Large Language Models via Residual-based Model Steering

Lucio La Cava,Andrea Tagarelli

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的偏好对齐方法PaLRS，通过利用大语言模型残差流中的偏好信号，从少量偏好数据中提取轻量级 steering 向量，在推理时实现模型行为调整。

Details

Motivation: 现有偏好对齐方法（如RLHF、DPO）依赖大量标注数据和昂贵的参数优化，且生成任务特定模型，缺乏灵活性和效率。 Method: PaLRS通过分析大语言模型残差流中的激活模式，从少量偏好样本对中提取可插拔的steering向量，用于在推理过程中引导模型输出更符合偏好的结果，无需任何训练过程。 Result: 在多个中小规模开源LLM上验证，PaLRS在数学推理和代码生成任务上显著优于基线模型，性能接近或超过DPO对齐模型，同时保持通用能力，并大幅节省计算时间。 Conclusion: PaLRS提供了一种高效、灵活、无需训练的偏好对齐新范式，仅需极少数据即可实现即插即用的模型行为调控，为大模型对齐提供了更具扩展性的解决方案。 Abstract: Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to DPO-aligned models, they perform better with huge time savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.

[99] The Hidden Costs of Translation Accuracy: Distillation, Quantization, and Environmental Impact

Dhaathri Vijay,Anandaswarup Vadapalli

Main category: cs.CL

TL;DR: 本研究通过机器翻译案例，比较了全规模、蒸馏和量化模型在翻译质量与效率之间的权衡，发现模型压缩技术能显著降低计算需求和环境影响，同时保持较高的翻译质量。

Details

Motivation: 随着大语言模型的快速发展，其计算和环境成本引发关注，因此需要探究在保证性能的同时减少资源消耗的方法。 Method: 使用Flores+基准测试和人类对法语、印地语、卡纳达语对话翻译的评估，比较全规模、蒸馏和量化模型的性能，并分析每次评估运行的碳排放。 Result: 全3.3B fp32模型BLEU分数最高，但碳足迹最大（每次运行约0.007-0.008 kg CO2）；蒸馏模型推理速度快达4.5倍，BLEU分数略有下降；INT4量化模型在人类评估中仍保持高准确性和流畅性。 Conclusion: 模型压缩策略可在保持竞争力的翻译质量的同时显著降低计算开销和环境影响，尤其应在低资源场景中权衡取舍，建议将效率与可持续性纳入NLP进步的核心评估框架。 Abstract: The rapid expansion of large language models (LLMs) has heightened concerns about their computational and environmental costs. This study investigates the trade-offs between translation quality and efficiency by comparing full-scale, distilled, and quantized models using machine translation as a case study. We evaluated performance on the Flores+ benchmark and through human judgments of conversational translations in French, Hindi, and Kannada. Our analysis of carbon emissions per evaluation run revealed that the full 3.3B fp32 model, while achieving the highest BLEU scores, incurred the largest environmental footprint (about 0.007-0.008 kg CO2 per run). The distilled models achieved an inference of up to 4.5x faster than the full 3.3B model, with only minimal reductions in BLEU scores. Human evaluations also showed that even aggressive quantization (INT4) preserved high levels of accuracy and fluency, with differences between models generally minor. These findings demonstrate that model compression strategies can substantially reduce computational demands and environmental impact while maintaining competitive translation quality, though trade-offs are more pronounced in low-resource settings. We argue for evaluation frameworks that integrate efficiency and sustainability alongside objective metrics as central dimensions of progress in NLP.

[100] The AI Agent Code of Conduct: Automated Guardrail Policy-as-Prompt Synthesis

Gauri Kholkar,Ratinder Ahuja

Main category: cs.CL

TL;DR: 提出了一种名为“Policy as Prompt”的新框架，利用大语言模型将非结构化设计文档自动转化为可验证的实时保护机制，通过上下文理解和最小权限原则来解释和执行自然语言策略。

Details

Motivation: 随着自主AI代理在工业中的广泛应用，确保其安全性变得至关重要。现有的策略执行方法难以应对非结构化设计文档与实时监管需求之间的差距。 Method: 系统首先解析技术文档以构建可验证的策略树，然后将其编译为轻量级、基于提示的分类器，在运行时审计AI代理行为，利用大语言模型实现自然语言策略的解释与执行。 Result: 在多种应用场景中验证了该方法的有效性，展示了从策略到实践的可扩展且可审计的管道，显著提升了AI系统的可监管性和安全性。 Conclusion: 该框架成功弥合了策略制定与实际执行之间的鸿沟，为实现可验证的安全AI和合规监管提供了可行路径。 Abstract: As autonomous AI agents are increasingly deployed in industry, it is essential to safeguard them. We introduce a novel framework that automates the translation of unstructured design documents into verifiable, real-time guardrails. We introduce "Policy as Prompt," a new approach that uses Large Language Models (LLMs) to interpret and enforce natural language policies by applying contextual understanding and the principle of least privilege. Our system first ingests technical artifacts to construct a verifiable policy tree, which is then compiled into lightweight, prompt-based classifiers that audit agent behavior at runtime. We validate our approach across diverse applications, demonstrating a scalable and auditable pipeline that bridges the critical policy-to-practice gap, paving the way for verifiably safer and more regulatable AI.

[101] MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Zijian Wu,Xiangyan Liu,Xinyuan Zhang,Lingjun Chen,Fanqing Meng,Lingxiao Du,Yiran Zhao,Fanshi Zhang,Yaoqi Ye,Jiawei Wang,Zirui Wang,Jinjie Ni,Yufan Yang,Arvin Xu,Michael Qizhe Shieh

Main category: cs.CL

TL;DR: 本文提出了MCPMark，一个包含127个高质量任务的基准，用于更全面和真实地评估MCP（大型语言模型与外部系统交互标准）的使用。这些任务由领域专家和AI代理共同创建，涉及丰富的CRUD操作，并通过程序化脚本自动验证。实验结果显示，即使是表现最好的gpt-5-medium模型，其pass@1得分也仅为52.56%，表明现有LLM在复杂、深度交互任务中仍有显著提升空间。

Details

Motivation: 现有的MCP基准测试范围狭窄，主要集中在读取密集型或交互深度有限的任务上，无法反映现实世界工作流的复杂性和真实性。因此，需要一个更具挑战性和代表性的基准来评估LLM在实际应用场景中的能力。 Method: 提出MCPMark基准，包含127个由领域专家与AI代理协作设计的任务，每个任务具有预设初始状态和自动化验证脚本。采用最小化代理框架，在工具调用循环中对前沿LLM进行评估，衡量其在多样化CRUD操作下的表现。 Result: 实验表明，最佳模型gpt-5-medium的pass@1为52.56%，pass^4为33.86%；其他强模型如claude-sonnet-4和o3均低于30% pass@1。平均每个任务需16.2轮执行和17.4次工具调用，显著高于以往基准，体现出更高的复杂性与挑战性。 Conclusion: MCPMark提供了一个更现实、更具挑战性的评估环境，揭示了当前LLM在复杂、多步骤交互任务中的局限性，为未来改进MCP兼容代理系统提供了重要方向。 Abstract: MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only $52.56$\% pass@1 and $33.86$\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below $30$\% pass@1 and $15$\% pass^4. On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.

[102] Sequential Diffusion Language Models

Yangzhou Liu,Yue Cao,Hao Li,Gen Luo,Zhe Chen,Weiyun Wang,Xiaobo Liang,Biqing Qi,Lijun Wu,Changyao Tian,Yanting Zhang,Yuqiang Li,Tong Lu,Yu Qiao,Jifeng Dai,Wenhai Wang

Main category: cs.CL

TL;DR: 本文提出了Next Sequence Prediction (NSP) 框架和Sequential Diffusion Language Model (SDLM)，统一了下一个token与块预测，实现自适应生成长度，兼容KV缓存，并可高效微调预训练模型，在少量训练样本下性能媲美或超越强自回归模型，显著提升推理吞吐。

Details

Motivation: 现有扩散语言模型受限于固定长度解码且不兼容KV缓存，块扩散虽有所改进但仍需昂贵训练并强制固定块大小，因此需要一种更灵活、高效且兼容现有架构的生成范式。 Method: 提出Next Sequence Prediction (NSP) 框架，将下一个token和下一个块预测统一，允许每步自适应决定生成长度；基于NSP构建Sequential Diffusion Language Model (SDLM)，在固定大小掩码块内进行扩散推断，同时根据模型置信度动态解码连续子序列，保持对KV缓存的兼容性。 Result: 实验表明，SDLM仅用3.5M训练样本即可匹配或超越强自回归基线，推理吞吐量比Qwen-2.5高2.1倍，且SDLM-32B展现出更强的效率增益和可扩展性。 Conclusion: NSP和SDLM提供了一种灵活、高效且可扩展的语言生成框架，兼具扩散模型的理论优势与自回归模型的工程友好性，具有良好的实际应用前景。 Abstract: Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1 higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm. Project page and codes: https://github.com/OpenGVLab/SDLM

[103] SparseD: Sparse Attention for Diffusion Language Models

Zeqing Wang,Gongfan Fang,Xinyin Ma,Xingyi Yang,Xinchao Wang

Main category: cs.CL

TL;DR: 提出SparseD，一种针对扩散语言模型（DLMs）的新型稀疏注意力方法，通过预计算头特定稀疏模式并在早期去噪步骤中使用全注意力、后期切换到稀疏注意力，实现无损加速。

Details

Motivation: 现有开源DLMs存在高推理延迟问题，主要源于注意力机制在长上下文中的二次复杂度；而传统稀疏注意力方法因忽略DLM特有的注意力行为（如头间差异和早期步骤重要性）而不适用。 Method: 基于对DLM注意力行为的观察，SparseD预先计算各注意力头的稀疏模式并跨所有去噪步重用，同时在早期去噪步使用全注意力以保证生成质量，后期切换至稀疏注意力以提升效率。 Result: 在64k上下文长度和1024去噪步下，SparseD相比FlashAttention最高实现1.50倍加速，且保持生成质量无损。 Conclusion: SparseD是一种高效实用的DLM部署方案，特别适用于长上下文应用场景，解决了现有稀疏注意力方法在DLM中不兼容的问题。 Abstract: While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps.

[104] ResFormer: All-Time Reservoir Memory for Long Sequence Classification

Hongbo Liu,Jia Xu

Main category: cs.CL

TL;DR: 提出了一种名为ResFormer的新神经网络架构，通过结合储层计算和Transformer，以线性时间高效建模长短上下文依赖，在多个对话情感和意图任务上显著优于基线模型，并降低内存消耗。

Details

Motivation: Transformer模型因二次复杂度限制输入长度，难以高效处理长上下文，现有方法在计算效率与性能之间难以平衡。 Method: ResFormer采用级联结构，用具备非线性读出的储层计算网络捕捉长期上下文依赖（线性时间），同时用固定长度输入的Transformer建模句子内短期依赖。 Result: 在EmoryNLP、MultiWOZ、MELD和IEMOCAP数据集上显著优于DeepSeek-Qwen和ModernBERT，其中EmoryNLP准确率最高提升22.3%，且内存消耗更低。 Conclusion: ResFormer有效解决了Transformer在长序列分类中的效率与性能瓶颈，兼顾准确性与计算资源，适用于需处理长上下文的序列分类任务。 Abstract: Sequence classification is essential in NLP for understanding and categorizing language patterns in tasks like sentiment analysis, intent detection, and topic classification. Transformer-based models, despite achieving state-of-the-art performance, have inherent limitations due to quadratic time and memory complexity, restricting their input length. Although extensive efforts have aimed at reducing computational demands, processing extensive contexts remains challenging. To overcome these limitations, we propose ResFormer, a novel neural network architecture designed to model varying context lengths efficiently through a cascaded methodology. ResFormer integrates an reservoir computing network featuring a nonlinear readout to effectively capture long-term contextual dependencies in linear time. Concurrently, short-term dependencies within sentences are modeled using a conventional Transformer architecture with fixed-length inputs. Experiments demonstrate that ResFormer significantly outperforms baseline models of DeepSeek-Qwen and ModernBERT, delivering an accuracy improvement of up to +22.3% on the EmoryNLP dataset and consistent gains on MultiWOZ, MELD, and IEMOCAP. In addition, ResFormer exhibits reduced memory consumption, underscoring its effectiveness and efficiency in modeling extensive contextual information.

[105] Ensembling Multilingual Transformers for Robust Sentiment Analysis of Tweets

Meysam Shirdel Bilehsavar,Negin Mahmoudi,Mohammad Jalili Torkamani,Kiana Kiashemshaki

Main category: cs.CL

TL;DR: 提出了一种基于Transformer集成模型和大语言模型的多语言情感分析方法，在多种语言数据集上实现了超过86%的性能表现。

Details

Motivation: 由于缺乏标注数据，现有情感分析方法在处理外语时效果不佳，因此需要一种适用于多语言且无需大量标注数据的情感分析方法。 Method: 采用多语言预训练情感分析模型（如bert-base-multilingual-uncased-sentiment和XLM-R）构建集成模型，并结合大语言模型进行多语言情感分析。 Result: 实验结果表明，所提方法在多语言数据集上的情感分析准确率超过86%。 Conclusion: 该集成方法能有效提升多语言情感分析的性能，尤其适用于标注数据稀缺的语言场景。 Abstract: Sentiment analysis is a very important natural language processing activity in which one identifies the polarity of a text, whether it conveys positive, negative, or neutral sentiment. Along with the growth of social media and the Internet, the significance of sentiment analysis has grown across numerous industries such as marketing, politics, and customer service. Sentiment analysis is flawed, however, when applied to foreign languages, particularly when there is no labelled data to train models upon. In this study, we present a transformer ensemble model and a large language model (LLM) that employs sentiment analysis of other languages. We used multi languages dataset. Sentiment was then assessed for sentences using an ensemble of pre-trained sentiment analysis models: bert-base-multilingual-uncased-sentiment, and XLM-R. Our experimental results indicated that sentiment analysis performance was more than 86% using the proposed method.

[106] Large-Scale Constraint Generation -- Can LLMs Parse Hundreds of Constraints?

Matteo Boffa,Jiaxuan You

Main category: cs.CL

TL;DR: 本文提出了大规模约束生成（LSCG）问题，并通过“Words Checker”任务评估大模型在处理大量细粒度约束时的表现，同时提出FoCusNet模型以提升性能。

Details

Motivation: 现有研究多关注少量特定任务约束下的大模型生成能力，但缺乏对大规模、细粒度通用约束处理能力的评估，因此需要新问题和方法来衡量和提升该能力。 Method: 提出LSCG问题及其实例Words Checker，系统评估模型特性与引导技术的影响，并设计FoCusNet模型将原始约束集压缩为相关子集，帮助大模型聚焦关键约束。 Result: 实验表明，随着约束数量增加，现有方法性能显著下降，而引入FoCusNet后准确率提升了8-13%。 Conclusion: 大模型在处理大规模约束时面临挑战，FoCusNet能有效提升其约束遵循能力，为复杂约束下的可控生成提供了可行方案。 Abstract: Recent research has explored the constrained generation capabilities of Large Language Models (LLMs) when explicitly prompted by few task-specific requirements. In contrast, we introduce Large-Scale Constraint Generation (LSCG), a new problem that evaluates whether LLMs can parse a large, fine-grained, generic list of constraints. To examine the LLMs' ability to handle an increasing number constraints, we create a practical instance of LSCG, called Words Checker. In Words Checker, we evaluate the impact of model characteristics (e.g., size, family) and steering techniques (e.g., Simple Prompt, Chain of Thought, Best of N) on performance. We also propose FoCusNet, a small and dedicated model that parses the original list of constraints into a smaller subset, helping the LLM focus on relevant constraints. Experiments reveal that existing solutions suffer a significant performance drop as the number of constraints increases, with FoCusNet showing an 8-13% accuracy boost.

[107] GEAR: A General Evaluation Framework for Abductive Reasoning

Kaiyu He,Peilin Wu,Mian Zhang,Kun Wan,Wentian Zhao,Xinya Du,Zhiyu Chen

Main category: cs.CL

TL;DR: 本文提出了GEAR，一种用于评估大语言模型在溯因推理中生成合理假设能力的自动化、透明且无标签的评估框架，并通过实证研究展示了其在提升模型假设多样性与可靠性方面的有效性。

Details

Motivation: 研究核心问题是大语言模型是否能发现新知识及其评估方法，现有方法受限于依赖人工标注和静态基准，难以衡量模型生成新颖且合理假设的能力。 Method: 提出GEAR评估范式，基于一致性、可泛化性和多样性三个指标自动评分；在四个溯因推理基准上对九个大模型进行细粒度分析，并设计基于动量的课程学习策略，利用GEAR得分动态调整训练数据。 Result: 实验生成超过5万条候选假设，揭示了不同模型间的差异；所提课程学习策略在无监督情况下提升了所有GEAR目标，且增益可迁移到传统基准。 Conclusion: GEAR提供了一种可扩展、可靠且开放式的溯因推理评估与训练框架，有助于推动大模型生成更具多样性和可靠性的新知识。 Abstract: Since the advent of large language models (LLMs), research has focused on instruction following and deductive reasoning. A central question remains: can these models discover new knowledge, and how can we evaluate this ability? We address this by studying abductive reasoning-the generation of plausible hypotheses to explain observations-and introduce GEAR (General Evaluation for Abductive Reasoning), a general-purpose, fully automated, transparent, and label-free evaluation paradigm. GEAR scores hypothesis sets by three metrics: consistency (each hypothesis explains the observations), generalizability (consistent hypotheses make meaningful predictions on unseen inputs), and diversity (the set covers distinct predictions and patterns). Built this way, GEAR is scalable (no human gold answers), reliable (deterministic scoring aligned with classical abduction), and open-ended (scores improve only when models produce new plausible hypotheses, unlike static benchmarks that saturate once accuracy is high). Using GEAR, we conduct a fine-grained study of nine LLMs on four abduction benchmarks with 1,500 problems, generating over 50,000 candidate hypotheses and revealing model differences obscured by gold-answer or purely human evaluations. We further propose a momentum-based curriculum that adjusts GEAR-derived training data by learning velocity: it starts with what the model learns quickly and shifts toward harder objectives such as generating diverse hypotheses once the model is confident on foundational objectives. Without gold-label supervision, this strategy improves all GEAR objectives and these gains transfer to established abductive reasoning benchmarks. Taken together, GEAR provides a principled framework that evaluates abduction and supplies label-free, scalable training signals that help LLMs produce more diverse and reliable hypotheses.

[108] BTC-SAM: Leveraging LLMs for Generation of Bias Test Cases for Sentiment Analysis Models

Zsolt T. Kardkovács,Lynda Djennane,Anna Field,Boualem Benatallah,Yacine Gaci,Fabio Casati,Walid Gaaloul

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型的新型偏见测试框架BTC-SAM，用于生成高质量、多样化的测试用例，以检测情感分析模型中的社会偏见。

Details

Motivation: 情感分析模型中存在有害的社会偏见，传统构建测试句子的方法成本高且难以覆盖多种偏见。 Method: 利用大语言模型（LLMs）进行可控的测试句子生成，通过最小化输入规范实现高效、多样化的偏见测试用例生成。 Result: 实验表明，该方法相比基础提示方法能提供更高的语言多样性和测试覆盖率，即使对未见过的偏见也有效。 Conclusion: BTC-SAM框架能有效降低偏见测试的成本和门槛，提升情感分析模型的公平性与可靠性。 Abstract: Sentiment Analysis (SA) models harbor inherent social biases that can be harmful in real-world applications. These biases are identified by examining the output of SA models for sentences that only vary in the identity groups of the subjects. Constructing natural, linguistically rich, relevant, and diverse sets of sentences that provide sufficient coverage over the domain is expensive, especially when addressing a wide range of biases: it requires domain experts and/or crowd-sourcing. In this paper, we present a novel bias testing framework, BTC-SAM, which generates high-quality test cases for bias testing in SA models with minimal specification using Large Language Models (LLMs) for the controllable generation of test sentences. Our experiments show that relying on LLMs can provide high linguistic variation and diversity in the test sentences, thereby offering better test coverage compared to base prompting methods even for previously unseen biases.

[109] Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Distributional Semantics

Guangliang Liu,Xi Chen,Bocheng Chen,Xitong Zhang,Kristen Johnson

Main category: cs.CL

TL;DR: 本文探讨了大语言模型（LLMs）在道德推理中实现泛化的挑战，并提出基于道德基础理论的语用推断方法，通过上下文信息弥合分布语义与语用层面道德之间的差距，显著提升了LLMs在道德推理任务中的泛化能力。

Details

Motivation: LLMs擅长捕捉分布语义，但道德推理涉及语用层面，导致其在道德推理中难以泛化，因此需要新方法来弥补这一差距。 Method: 提出基于道德基础理论的语用推断方法，利用每一步的上下文信息，将道德基础与道德推理目标联系起来，以引导LLMs进行更有效的推理。 Result: 实验结果表明，所提出的方法显著增强了LLMs在道德推理任务中的泛化能力。 Conclusion: 该研究为基于道德基础理论的LLMs道德推理提供了有效路径，推动了相关领域未来的研究发展。 Abstract: Moral reasoning has emerged as a promising research direction for Large Language Models (LLMs), yet achieving generalization remains a central challenge. From a linguistic standpoint, this difficulty arises because LLMs are adept at capturing distributional semantics, which fundamentally differs from the morals which operate at the pragmatic level. This paper investigates how LLMs can achieve generalized moral reasoning despite their reliance on distributional semantics. We propose pragmatic inference methods grounded in moral foundations theory, which leverage contextual information at each step to bridge the pragmatic gap and guide LLMs in connecting moral foundations with moral reasoning objectives. Experimental results demonstrate that our approach significantly enhances LLMs' generalization in moral reasoning, providing a foundation for future research grounded in moral foundations theory.

[110] Dual-Scale World Models for LLM Agents Towards Hard-Exploration Problems

Minsoo Kim,Seung-won Hwang

Main category: cs.CL

TL;DR: 本文提出了一种名为GLoW的新方法，利用双尺度世界模型在“难探索”任务中提升LLM代理的性能，在Jericho文本游戏基准上取得了当前最优的LLM方法表现，并且相比强化学习方法仅需1/100到1/800的环境交互次数。

Details

Motivation: 现有的LLM代理在需要通过探索学习新知识的“硬探索”任务中表现受限，缺乏有效的探索引导机制。 Method: 提出GLoW框架，结合双尺度世界模型：全局尺度维护高价值发现的轨迹前沿，局部尺度通过多路径优势反思机制（Multi-path Advantage Reflection）从试错中学习，利用基于优势的进步信号指导探索。 Result: 在Jericho文本游戏基准上，GLoW达到了LLM方法中的最先进水平，且与最先进的基于强化学习的方法性能相当，但仅需其100至800分之一的环境交互次数。 Conclusion: GLoW通过双尺度建模和优势引导的探索机制，显著提升了LLM代理在复杂探索任务中的效率和性能，为减少环境交互开销提供了有效方案。 Abstract: LLM-based agents have seen promising advances, yet they are still limited in "hard-exploration" tasks requiring learning new knowledge through exploration. We present GLoW, a novel approach leveraging dual-scale world models, maintaining a trajectory frontier of high-value discoveries at the global scale, while learning from local trial-and-error in exploration through a Multi-path Advantage Reflection mechanism which infers advantage-based progress signals to guide exploration. To evaluate our framework for hard-exploration, we tackle the Jericho benchmark suite of text-based games, where GLoW achieves a new state-of-theart performance for LLM-based approaches. Compared to state-of-the-art RLbased methods, our approach achieves comparable performance while requiring 100-800x fewer environment interactions.

[111] EduVidQA: Generating and Evaluating Long-form Answers to Student Questions based on Lecture Videos

Sourjyadip Ray,Shubham Sharma,Somak Aditya,Pawan Goyal

Main category: cs.CL

TL;DR: 本文提出了一种利用多模态大语言模型（MLLMs）自动回答在线讲座中学生问题的新方法，并构建了一个包含5252个问答对的EduVidQA数据集。通过实验评估了多种先进MLLM在该任务上的表现，展示了该任务的挑战性及合成数据在微调中的有效性。

Details

Motivation: 随着数字平台改变教育模式，在线学习中的互动性至关重要。现有方法难以有效应对来自视频讲座的复杂多模态问题，因此需要能够理解视听内容并准确回应学生疑问的自动化解决方案。 Method: 构建了EduVidQA数据集，包含来自296个计算机科学视频的真实与合成问答对；设计了针对多模态大语言模型的基准测试，采用文本和定性指标进行综合评估；研究了合成数据在模型微调中的作用。 Result: 实验证明当前最先进的MLLM在该任务上仍有局限，任务具有较高挑战性；使用合成数据微调可提升部分模型性能；学生偏好分析为未来研究提供了重要参考。 Conclusion: 本研究为基于视频讲座的自动问答任务建立了新基准，推动了多模态大模型在教育领域的应用，并指出了未来自然语言处理与教育融合的研究方向。 Abstract: As digital platforms redefine educational paradigms, ensuring interactivity remains vital for effective learning. This paper explores using Multimodal Large Language Models (MLLMs) to automatically respond to student questions from online lectures - a novel question answering task of real world significance. We introduce the EduVidQA Dataset with 5252 question-answer pairs (both synthetic and real-world) from 296 computer science videos covering diverse topics and difficulty levels. To understand the needs of the dataset and task evaluation, we empirically study the qualitative preferences of students, which we provide as an important contribution to this line of work. Our benchmarking experiments consist of 6 state-of-the-art MLLMs, through which we study the effectiveness of our synthetic data for finetuning, as well as showing the challenging nature of the task. We evaluate the models using both text-based and qualitative metrics, thus showing a nuanced perspective of the models' performance, which is paramount to future work. This work not only sets a benchmark for this important problem, but also opens exciting avenues for future research in the field of Natural Language Processing for Education.

[112] Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE

Guancheng Wan,Lucheng Fu,Haoxin Liu,Yiqiao Jin,Hui Yi Leong,Eric Hanchen Jiang,Hejia Geng,Jinhe Bi,Yunpu Ma,Xiangru Tang,B. Aditya Prakash,Yizhou Sun,Wei Wang

Main category: cs.CL

TL;DR: 本文提出了文本尖锐性（textual sharpness）的概念，用以描述提示词在语义邻域内因微小改写导致性能剧烈波动的问题，并提出TARE和ATARE两种无需梯度的提示优化框架，通过对抗性搜索与鲁棒选择提升提示的稳定性与泛化能力。

Details

Motivation: 现有的提示优化方法主要关注单点准确性，忽视了提示在语义等价改写下的稳定性（即 paraphrase invariance），导致自动化提示搜索在实践中脆弱且不可靠。作者旨在解决提示词对细微语义不变改写的敏感性问题。 Method: 提出TARE框架：内层采用基于采样的对抗搜索生成难性 paraphrase 来测试提示鲁棒性，外层进行鲁棒选择，优先保留其语义邻域整体表现强的提示；进一步提出ATARE，学习各维度的各向异性权重并动态调整邻域半径，以平衡探索与保真度。整个方法为黑箱设计，不依赖模型梯度。 Result: 在多种任务上验证了TARE和ATARE的有效性，所生成的提示在保持高准确率的同时显著提升了对 paraphrase 的鲁棒性，优于仅优化准确率的方法，且计算开销可控。 Conclusion: 通过形式化提示空间中的文本尖锐性并引入语义邻域上的鲁棒性标准，TARE和ATARE为构建稳定、可泛化的提示提供了有效且实用的无梯度解决方案，推动了自动化提示搜索的可靠性发展。 Abstract: The performance of Large Language Models (LLMs) hinges on carefully engineered prompts. However, prevailing prompt optimization methods, ranging from heuristic edits and reinforcement learning to evolutionary search, primarily target point-wise accuracy. They seldom enforce paraphrase invariance or searching stability, and therefore cannot remedy this brittleness in practice. Automated prompt search remains brittle: small, semantically preserving paraphrases often cause large performance swings. We identify this brittleness as the textual sharpness of the prompt landscape. In this work, we provide the first formal treatment of textual sharpness in the discrete, semantic space of prompts, together with an operational robustness criterion over a semantic neighborhood; the design is black-box or API-only, requiring no gradients to update the model's parameters. Then we introduce TARE (Textual Sharpness-Aware Evolving), a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases and an outer, robust selection that prefers candidates whose neighborhoods remain strong. We further propose ATARE, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity. Diverse tasks evaluate our methods, whose design for minimizing textual sharpness gap leads to prompts that preserve accuracy under paraphrasing, outperforming accuracy-only prompt search while remaining computationally practical.

[113] Your thoughts tell who you are: Characterize the reasoning patterns of LRMs

Yida Chen,Yuning Mao,Xianjun Yang,Suyu Ge,Shengjie Bi,Lijuan Liu,Saghar Hosseini,Liang Tan,Yixin Nie,Shaoliang Nie

Main category: cs.CL

TL;DR: 本文提出了LLM提出的开放分类法（LOT），用于比较大型推理模型（LRMs）的推理过程，揭示它们在数学、科学和编程任务中的系统性思维差异，并通过自然语言分类解释这些差异，同时展示推理风格对齐可提升小模型性能。

Details

Motivation: 现有的大型推理模型比较主要集中在宏观统计指标上，缺乏对不同模型推理方式差异的深入理解，因此需要一种能够细致分析和分类模型推理过程的方法。 Method: 提出LOT方法，利用生成式语言模型对两个LRMs的推理轨迹进行比较，提取其特征并基于经验分布建立预测模型，迭代生成人类可读的分类体系，以刻画模型的思维方式。 Result: LOT在12个开源LRMs上实现了80-100%的准确率来区分不同规模、基础模型族或目标领域的推理轨迹，并发现系统性的思维差异；案例研究表明，将较小Qwen3模型的推理风格与最大Qwen3对齐可使其在GPQA上的准确率提高3.3-5.7%。 Conclusion: LOT不仅能有效区分不同LRMs的推理方式，还提供了可解释的自然语言分类，有助于理解模型思维机制，并可通过推理风格优化提升模型性能。 Abstract: Current comparisons of large reasoning models (LRMs) focus on macro-level statistics such as task accuracy or reasoning length. Whether different LRMs reason differently remains an open question. To address this gap, we introduce the LLM-proposed Open Taxonomy (LOT), a classification method that uses a generative language model to compare reasoning traces from two LRMs and articulate their distinctive features in words. LOT then models how these features predict the source LRM of a reasoning trace based on their empirical distributions across LRM outputs. Iterating this process over a dataset of reasoning traces yields a human-readable taxonomy that characterizes how models think. We apply LOT to compare the reasoning of 12 open-source LRMs on tasks in math, science, and coding. LOT identifies systematic differences in their thoughts, achieving 80-100% accuracy in distinguishing reasoning traces from LRMs that differ in scale, base model family, or objective domain. Beyond classification, LOT's natural-language taxonomy provides qualitative explanations of how LRMs think differently. Finally, in a case study, we link the reasoning differences to performance: aligning the reasoning style of smaller Qwen3 models with that of the largest Qwen3 during test time improves their accuracy on GPQA by 3.3-5.7%.

[114] Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis

Haolin Yang,Hakaze Cho,Naoya Inoue

Main category: cs.CL

TL;DR: 提出基于任务子空间对数归因（TSLA）的新框架，识别在上下文学习中负责任务识别和任务学习的注意力头，并揭示其机制。

Details

Motivation: 调和大语言模型中上下文学习（ICL）的两种主流视角：注意力头的组件级分析与ICL的整体分解（任务识别与任务学习）。 Method: 引入任务子空间对数归因（TSLA）框架，结合相关性分析、消融实验、输入扰动和引导实验，分析注意力头在任务识别和任务学习中的作用。 Result: 识别出专门负责任务识别（TR）和任务学习（TL）的注意力头；TR头通过将隐藏状态对齐任务子空间促进识别，TL头在子空间内旋转状态以实现预测。 Conclusion: 所提框架为大语言模型如何执行上下文学习提供了统一且可解释的机制描述，并整合了先前关于诱导头和任务向量等发现。 Abstract: We investigate the mechanistic underpinnings of in-context learning (ICL) in large language models by reconciling two dominant perspectives: the component-level analysis of attention heads and the holistic decomposition of ICL into Task Recognition (TR) and Task Learning (TL). We propose a novel framework based on Task Subspace Logit Attribution (TSLA) to identify attention heads specialized in TR and TL, and demonstrate their distinct yet complementary roles. Through correlation analysis, ablation studies, and input perturbations, we show that the identified TR and TL heads independently and effectively capture the TR and TL components of ICL. Using steering experiments with geometric analysis of hidden states, we reveal that TR heads promote task recognition by aligning hidden states with the task subspace, while TL heads rotate hidden states within the subspace toward the correct label to facilitate prediction. We further show how previous findings on ICL mechanisms, including induction heads and task vectors, can be reconciled with our attention-head-level analysis of the TR-TL decomposition. Our framework thus provides a unified and interpretable account of how large language models execute ICL across diverse tasks and settings.

[115] Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insight

Haolin Yang,Hakaze Cho,Kaize Ding,Naoya Inoue

Main category: cs.CL

TL;DR: 本文提出了一种直接训练学习型任务向量（LTVs）的方法，优于传统提取的任务向量，并揭示了其在上下文学习中的机制作用，发现LTV主要通过注意力头的OV电路影响预测，且传播过程近似线性。

Details

Motivation: 现有任务向量（TVs）提取方法繁琐且不透明，对TV如何影响模型计算的机制理解不足，因此需要更有效、可解释的方法来研究ICL中TV的作用。 Method: 提出直接训练Learned Task Vectors（LTVs），并在多层、多位置及不同提示下评估其性能；通过系统分析研究TV在注意力头OV通路中的作用及其在Transformer中的传播特性。 Result: LTVs在准确性和灵活性上优于提取的TVs，能有效作用于任意层和位置；机制分析显示TV主要通过少量‘关键头’的OV电路影响预测，且TV在模型中的传播呈现早期旋转、后期缩放的近似线性行为。 Conclusion: LTVs不仅为获取高效任务向量提供了实用方案，也为理解大模型上下文学习的机制基础提供了新的理论视角。 Abstract: Large Language Models (LLMs) can perform new tasks from in-context demonstrations, a phenomenon known as in-context learning (ICL). Recent work suggests that these demonstrations are compressed into task vectors (TVs), compact task representations that LLMs exploit for predictions. However, prior studies typically extract TVs from model outputs or hidden states using cumbersome and opaque methods, and they rarely elucidate the mechanisms by which TVs influence computation. In this work, we address both limitations. First, we propose directly training Learned Task Vectors (LTVs), which surpass extracted TVs in accuracy and exhibit superior flexibility-acting effectively at arbitrary layers, positions, and even with ICL prompts. Second, through systematic analysis, we investigate the mechanistic role of TVs, showing that at the low level they steer predictions primarily through attention-head OV circuits, with a small subset of "key heads" most decisive. At a higher level, we find that despite Transformer nonlinearities, TV propagation is largely linear: early TVs are rotated toward task-relevant subspaces to improve logits of relevant labels, while later TVs are predominantly scaled in magnitude. Taken together, LTVs not only provide a practical approach for obtaining effective TVs but also offer a principled lens into the mechanistic foundations of ICL.

[116] Retrieval-augmented GUI Agents with Generative Guidelines

Ran Xu,Kaixin Ma,Wenhao Yu,Hongming Zhang,Joyce C. Ho,Carl Yang,Dong Yu

Main category: cs.CL

TL;DR: 提出RAG-GUI，一种轻量级VLM，利用网络教程在推理时增强GUI代理性能，具有良好的泛化和即插即用能力。

Details

Motivation: 现有GUI代理因训练数据稀缺和任务复杂性（涉及长尾知识）而在实际应用中受限。 Method: 通过监督微调（SFT）进行 warm-start，并通过自指导拒绝采样微调（RSF）进一步优化；在推理时引入网络教程作为外部知识源。 Result: 在三个不同任务上评估，RAG-GUI 持续优于基线代理，在两个模型规模上比其他推理方法提升 2.6% 到 13.3%。 Conclusion: RAG-GUI 具有模型无关性，可作为通用插件提升基于VLM的代理在现实场景中的表现，具备强泛化性和即插即用特性。 Abstract: GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inference time. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluated across three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.

[117] Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models

Zhimeng Luo,Lixin Wu,Adam Frisch,Daqing He

Main category: cs.CL

TL;DR: 本文提出了MedIRT，一种基于项目反应理论（IRT）的医学大语言模型评估框架，通过收集80个LLM在1100道USMLE题目上的响应，实现对模型能力、题目难度和区分度的联合估计，揭示了传统准确率指标无法捕捉的“尖峰”能力特征，并支持更精细的模型评估与基准审计。

Details

Motivation: 传统准确率指标无法充分反映大语言模型在高风险医学应用中的真实表现，缺乏对问题特性和主题特定能力的刻画，亟需更可靠、精细的评估方法。 Method: 基于项目反应理论（IRT），针对每个医学主题构建单维双参数逻辑斯蒂模型，前瞻性地收集80个多样化大语言模型在1100道平衡的USMLE对齐题目上的回答，联合估计模型能力、题目难度与区分度。 Result: MedIRT生成比准确率更稳定且细致的模型性能排序；发现了模型存在‘尖峰’能力特征，即整体排名靠前的模型可能在特定领域表现不佳；GPT-5在11个领域中8个领先，但在社会科学与沟通技能上被Claude-3-opus超越；同时可识别有缺陷的测试题，具备基准审计能力。 Conclusion: MedIRT提供了一种心理测量学上严谨的多维评估方法，能够更安全、有效、可信地支持大语言模型在医疗场景中的部署，推动从单一准确率向多维度能力画像的范式转变。 Abstract: As Large Language Models (LLMs) are increasingly proposed for high-stakes medical applications, there has emerged a critical need for reliable and accurate evaluation methodologies. Traditional accuracy metrics fail inadequately as they neither capture question characteristics nor offer topic-specific insights. To address this gap, we introduce \textsc{MedIRT}, a rigorous evaluation framework grounded in Item Response Theory (IRT), the gold standard in high-stakes educational testing. Unlike previous research relying on archival data, we prospectively gathered fresh responses from 80 diverse LLMs on a balanced, 1,100-question USMLE-aligned benchmark. Using one unidimensional two-parameter logistic IRT model per topic, we estimate LLM's latent model ability jointly with question difficulty and discrimination, yielding more stable and nuanced performance rankings than accuracy alone. Notably, we identify distinctive ``spiky'' ability profiles, where overall rankings can be misleading due to highly specialized model abilities. While \texttt{GPT-5} was the top performer in a majority of domains (8 of 11), it was outperformed in Social Science and Communication by \texttt{Claude-3-opus}, demonstrating that even an overall 23rd-ranked model can hold the top spot for specific competencies. Furthermore, we demonstrate IRT's utility in auditing benchmarks by identifying flawed questions. We synthesize these findings into a practical decision-support framework that integrates our multi-factor competency profiles with operational metrics. This work establishes a robust, psychometrically grounded methodology essential for the safe, effective, and trustworthy deployment of LLMs in healthcare.

[118] PET: Preference Evolution Tracking with LLM-Generated Explainable Distribution

Luyang Zhang,Siyuan Peng,Jialu Wang,Shichao Zhu,Beibei Li,Zhongcun Wang,Guangmou Pan,Yan Li,Song Yang

Main category: cs.CL

TL;DR: 本文提出了一种名为偏好演化追踪（PET）的框架，通过推断用户在稳定且可解释的偏好簇格上的动态概率分布，改进了基于大语言模型的用户偏好建模，显著提升了排序质量与长尾内容推荐效果。

Details

Motivation: 现有的大语言模型直接生成用户偏好列表的方法存在个性化不足、决策不透明和流行度偏差等问题，难以实现可解释和公平的推荐系统。 Method: PET框架采用logit探针和生成式分类技术，将用户偏好建模为对偏好簇的概率分布，而非直接生成项目列表，从而实现透明的偏好学习和动态演化追踪。 Result: 在Yelp和MovieLens公开数据集上，PET相比基线模型NDCG提升高达40%；在真实短视频平台大规模数据上，其对长尾内容的推荐性能NDCG超过当前SOTA生产模型7倍。 Conclusion: PET将用户画像从黑箱生成转向透明的概率化偏好映射，为更可解释、公平和多样化的个性化系统提供了新路径。 Abstract: Understanding how user preference evolves over time is a fundamental challenge central to modern digital ecosystems, for which Large Language Models (LLMs) are an increasingly prominent and popular approach due to their ability to comprehend the rich semantic context within behavioral data. A common practice is to use LLMs to predict a user's next action by directly generating a ranked list of preferred items. Although effective for short-term prediction, the end-to-end generation paradigm inherently limits personalization. Its opaque decision-making process obscures holistic user profiling and exacerbates popularity bias. To address these limitations, we propose Preference Evolution Tracking (PET), a framework that reframes the task as inferring a dynamic probability distribution over a stable and interpretable lattice of preference clusters. By applying logit-probing and generative classification techniques, PET infers a user's preference as a probability distribution, enabling transparent preference learning. On public benchmarks (Yelp, MovieLens), PET improves ranking quality by up to 40% in NDCG over direct generation baselines. On a large-scale, real-world dataset from a short-video platform, it excels at ranking long-tail contents, significantly outperforming a SOTA production model by 7 times in the NDCG score. Ultimately, PET transforms the user profile model from direct preference list generation to a transparent distributional preference mapping, paving the way for more explainable, fair, and diverse personalization systems.

[119] AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play

Ran Xu,Yuchen Zhuang,Zihan Dong,Jonathan Wang,Yue Yu,Joyce C. Ho,Linjun Zhang,Haoyu Wang,Wenqi Shi,Carl Yang

Main category: cs.CL

TL;DR: AceSearcher是一种新型的搜索增强型大语言模型框架，通过让单一LLM在查询分解和答案求解两个角色间交替训练，提升复杂推理任务的性能。

Details

Motivation: 现有搜索增强型LLM在多跳检索和推理能力上表现不足，难以有效处理复杂推理任务。 Method: 提出AceSearcher框架，结合监督微调（涵盖搜索、推理和分解任务）与强化微调（以最终答案准确率为目标），采用自对弈方式训练单个LLM轮流担任分解者和求解者角色，无需中间标注。 Result: 在10个数据集上的3个推理密集型任务中，AceSearcher平均精确匹配得分提升7.6%；AceSearcher-32B在文档级金融推理任务上达到DeepSeek-V3的性能，但参数不到其5%；即使1.5B和8B规模也常优于参数多达9倍的现有模型。 Conclusion: AceSearcher在复杂推理任务中表现出卓越的效率和有效性，显著优于现有搜索增强型LLM，具备良好的可扩展性和应用前景。 Abstract: Search-augmented LLMs often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceSearcher, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceSearcher couples supervised fine-tuning on a diverse mixture of search, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceSearcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%. Remarkably, on document-level finance reasoning tasks, AceSearcher-32B matches the performance of the DeepSeek-V3 model using less than 5% of its parameters. Even at smaller scales (1.5B and 8B), AceSearcher often surpasses existing search-augmented LLMs with up to 9x more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks. Our code will be published at https://github.com/ritaranx/AceSearcher and https://huggingface.co/AceSearcher.

[120] Can Large Language Models Express Uncertainty Like Human?

Linwei Tao,Yi-Fan Yeh,Bo Kai,Minjing Dong,Tao Huang,Tom A. Lamb,Jialin Yu,Philip H. S. Torr,Chang Xu

Main category: cs.CL

TL;DR: 提出语言学置信度（LC）作为大模型不确定性估计的轻量级、人性化方法，通过数据集、映射器、系统研究和微调框架推动其发展。

Details

Motivation: 现有置信度估计方法存在实际障碍：logits常不可见、多采样计算昂贵、数值化不确定性表达不自然。需要更实用、符合人类交流的方式。 Method: 1) 发布首个大规模带人工标注置信度的犹豫表达数据集；2) 提出轻量级映射器将犹豫词转化为置信度分数；3) 系统研究现代大模型在问答基准上的LC表现；4) 引入微调框架提升LC可靠性。 Result: 发现多数大模型在表达可靠LC方面表现不佳，但精心设计的提示可实现良好的校准性和区分性，微调后进一步改善。 Conclusion: 语言学置信度是一种可扩展、高效且与人类对齐的大模型不确定性估计方法，值得深入探索。 Abstract: Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Yet existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we (1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and (2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we (3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we (4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction.

[121] BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models

Gaurav Srivastava,Aafiya Hussain,Zhenyu Bi,Swastik Roy,Priya Pitre,Meng Lu,Morteza Ziyadi,Xuan Wang

Main category: cs.CL

TL;DR: 本文提出了BeyondBench，一个通过算法生成数学问题的无污染语言模型评估框架，涵盖44个任务共117种变体，分为易、中、难三个级别，在101个模型上的实验显示模型在复杂问题上推理能力显著下降。

Details

Motivation: 现有基准测试因互联网数据污染可能导致模型评估偏差，难以区分模型是真正推理还是记忆答案，因此需要一种干净、可控的评估方式。 Method: 提出BeyondBench框架，通过算法实时生成具有数学基础的问题，确保每个问题唯一且未在训练数据中出现，覆盖多项式到指数级复杂度的任务，并通过数学证明验证答案。 Result: 在101个语言模型上的评估显示，随着问题复杂度上升，模型性能显著下降；在难集中，Gemini、Llama和Qwen等大模型准确率不足60%；使用工具能明显提升表现，否则准确率大幅降低。 Conclusion: 当前语言模型在面对高复杂度、无污染的算法问题时仍存在严重推理缺陷，BeyondBench为公平、可靠的模型评估提供了有效方案。 Abstract: Evaluating language models fairly is becoming harder as static benchmarks available on the internet risk contamination by training data. This makes it unclear whether models are truly reasoning or just recalling answers. In this paper, we introduce BeyondBench, an evaluation framework that avoids this problem by using algorithmic problem generation. Unlike traditional benchmarks that risk contamination from internet-scale training data, BeyondBench creates mathematically grounded problems on the fly, ensuring each test remains fresh and uncontaminated. Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels: the Easy Suite (29 tasks) for basic arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) tackling NP-complete and constraint satisfaction problems. Each task generates problems from a combinatorial space larger than 10^15 unique instances, with solutions verified deterministically by mathematical proofs. We evaluated 101 language models, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters and multiple quantization schemes. Our results show consistent reasoning deficiencies across model families, with performance degrading sharply as problem complexity increases from polynomial to exponential. In our Hard Suite evaluations, models such as Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved average accuracies of 56.38%, 26.91%, and 33.60%, respectively. Moreover, we observe that performance drops drastically without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing a decline of 16.81%, 28.05%, and 47.59% accuracy on the hard suite. Our leaderboard is publicly available at https://ctrl-gaurav.github.io/BeyondBench/

[122] ScenarioBench: Trace-Grounded Compliance Evaluation for Text-to-SQL and RAG

Zahra Atf,Peter R Lewis

Main category: cs.CL

TL;DR: ScenarioBench 是一个基于政策、追踪感知的基准，用于评估合规场景下的 Text-to-SQL 和检索增强生成，强调决策依据和解释可验证性。

Details

Motivation: 现有 Text-to-SQL 和 RAG 基准缺乏对决策依据的严格证据支持和可审计性，难以评估系统在合规场景中的解释能力和真实性。 Method: 设计包含政策条款、黄金标准决策、最小见证轨迹和规范 SQL 的 YAML 场景，要求系统使用指定条款 ID 进行输出解释，并通过多维度指标（如决策准确率、轨迹质量、SQL 正确性等）进行端到端评估。 Result: 提供了包括决策准确性、轨迹质量、检索效果、SQL 正确性、政策覆盖率、延迟和解释幻觉率在内的综合评估结果，并提出 Scenario Difficulty Index (SDI) 和 SDI-R 指标来聚合考虑检索难度和时间限制的结果。 Conclusion: ScenarioBench 通过严格的条款级证据绑定和无窥探规则，提升了合规场景下模型决策与解释的可审计性和可信度，推动评估重点从单纯输出正确性转向解释质量与时效性。 Abstract: ScenarioBench is a policy-grounded, trace-aware benchmark for evaluating Text-to-SQL and retrieval-augmented generation in compliance contexts. Each YAML scenario includes a no-peek gold-standard package with the expected decision, a minimal witness trace, the governing clause set, and the canonical SQL, enabling end-to-end scoring of both what a system decides and why. Systems must justify outputs using clause IDs from the same policy canon, making explanations falsifiable and audit-ready. The evaluator reports decision accuracy, trace quality (completeness, correctness, order), retrieval effectiveness, SQL correctness via result-set equivalence, policy coverage, latency, and an explanation-hallucination rate. A normalized Scenario Difficulty Index (SDI) and a budgeted variant (SDI-R) aggregate results while accounting for retrieval difficulty and time. Compared with prior Text-to-SQL or KILT/RAG benchmarks, ScenarioBench ties each decision to clause-level evidence under strict grounding and no-peek rules, shifting gains toward justification quality under explicit time budgets.

[123] MoVa: Towards Generalizable Classification of Human Morals and Values

Ziyu Chen,Junfei Sun,Chenxi Li,Tuan Dung Nguyen,Jing Yao,Xiaoyuan Yi,Xing Xie,Chenhao Tan,Lexing Xie

Main category: cs.CL

TL;DR: MoVa是一个用于人类道德和价值观分类的资源套件，包含16个标注数据集、基于四种理论框架的基准结果、一种轻量级LLM提示策略，以及一个心理调查评估新应用。

Details

Motivation: 研究者在分析语言中的人类道德和价值观时，面临理论框架和数据多样性的挑战，需要统一且可推广的工具支持。 Method: 整合多个理论框架的标注数据集，提出all@once分类策略，并采用轻量级LLM提示方法，在多个领域和框架下进行基准测试。 Result: 所提出的LLM提示策略优于微调模型，all@once方法能有效同时评分多个相关概念，提升了跨领域的分类性能。 Conclusion: MoVa为语言中的道德与价值观分析提供了高效、通用的工具，有助于人机交流的细粒度解释及机器行为的对齐。 Abstract: Identifying human morals and values embedded in language is essential to empirical studies of communication. However, researchers often face substantial difficulty navigating the diversity of theoretical frameworks and data available for their analysis. Here, we contribute MoVa, a well-documented suite of resources for generalizable classification of human morals and values, consisting of (1) 16 labeled datasets and benchmarking results from four theoretically-grounded frameworks; (2) a lightweight LLM prompting strategy that outperforms fine-tuned models across multiple domains and frameworks; and (3) a new application that helps evaluate psychological surveys. In practice, we specifically recommend a classification strategy, all@once, that scores all related concepts simultaneously, resembling the well-known multi-label classifier chain. The data and methods in MoVa can facilitate many fine-grained interpretations of human and machine communication, with potential implications for the alignment of machine behavior.

[124] Model Fusion with Multi-LoRA Inference for Tool-Enhanced Game Dialogue Agents

Kangxu Wang,Ze Chen,Chengcheng Wei,Jiewen Zheng,Jiarong He,Max Gao

Main category: cs.CL

TL;DR: 本文提出了opdainlp团队在CPDC 2025挑战赛GPU赛道中的解决方案，通过LoRA微调和模型融合，在三项任务中取得优异成绩。

Details

Motivation: 为了在资源和时间受限的情况下，构建符合角色设定、游戏世界观并支持函数调用的游戏中对话AI。 Method: 采用Qwen3-14B模型，结合LoRA微调与模型融合，使用多个LoRA适配器分别处理工具调用、带结果响应生成和无结果响应生成，并基于vLLM实现MultiLoRA推理。 Result: 在Task 1和Task 3中获得第一名，Task 2中获得第二名。 Conclusion: 该方法在保证推理效率的同时有效提升了多任务性能，验证了LoRA适配器分离策略和模型融合在复杂对话系统中的可行性。 Abstract: This paper presents the opdainlp team's solution for the GPU track of the CPDC 2025 challenge. The challenge consists of three tasks, aiming to build an in-game conversational AI that adheres to character personas, aligns with the game's worldview, and supports function calling. Considering both effectiveness and resource/time constraints during inference, we synthesized data for some of the tasks based on the datasets provided by the competition organizers. We employed Qwen3-14B with LoRA fine-tuning and model fusion, and utilized a base model integrated with multiple LoRA adapters during inference. Specifically, in the competition, we used three distinct LoRA adapters to handle tool calling, response generation with tool call results, and response generation without tool call results, respectively. MultiLoRA inference was implemented using vLLM. Our solution achieved the first place in Task 1 and Task 3, and the second place in Task 2 of the GPU track.

[125] Prompt and Parameter Co-Optimization for Large Language Models

Xiaohe Bo,Rui Li,Zexu Sun,Quanyu Dai,Zeyu Zhang,Zihang Tian,Xu Chen,Zhenhua Dong

Main category: cs.CL

TL;DR: 本文提出了MetaTuner框架，联合优化提示学习和模型微调，通过共享编码层和监督正则化损失，在多个基准上显著优于基线方法。

Details

Motivation: 提示优化与微调通常被独立研究，其协同潜力未被充分挖掘，本文旨在探索二者结合的互补优势。 Method: 设计了一个包含提示生成网络和参数生成网络的框架，二者共享底层编码层，并引入监督正则化损失以协同优化离散提示与连续参数空间。 Result: 在多个基准任务上实验表明，MetaTuner consistently 优于单独提示优化或微调的基线方法。 Conclusion: 联合优化提示与参数是提升大模型性能的有效途径，MetaTuner为两者协同提供了可行框架。 Abstract: Prompt optimization and fine-tuning are two major approaches to improve the performance of Large Language Models (LLMs). They enhance the capabilities of LLMs from complementary perspectives: the former through explicit natural language, and the latter through implicit parameter updates. However, prior work has typically studied them in isolation, leaving their synergistic potential largely underexplored. To bridge this gap, in this paper, we introduce MetaTuner, a novel framework that jointly integrates prompt optimization and fine-tuning for LLM training. Specifically, we introduce two neural networks to generate prompts and parameters, respectively, while allowing them to share a common bottom encoding layer to enable knowledge sharing. By the guidance of the final supervised signals, our framework is optimized to discover the optimal combinations between the prompts and parameters. Given that prompt learning involves discrete optimization while fine-tuning operates in a continuous parameter space, we design a supervised regularization loss to train our framework effectively. Extensive experiments across diverse benchmarks show that our method consistently outperforms the baselines.

[126] MRAG-Suite: A Diagnostic Evaluation Platform for Visual Retrieval-Augmented Generation

Yuelyu Ji

Main category: cs.CL

TL;DR: 提出MRAG-Suite，一个用于多模态检索增强生成（Visual RAG）的诊断评估平台，结合难度和模糊性感知的过滤策略及MM-RAGChecker工具，揭示了复杂查询下的性能下降与幻觉问题。

Details

Motivation: 现有Visual RAG系统的评估方法未能系统考虑查询难度和模糊性对性能的影响，缺乏细粒度诊断工具。 Method: 构建MRAG-Suite平台，集成多个多模态基准，并引入基于难度和模糊性的过滤策略，以及claim-level诊断工具MM-RAGChecker。 Result: 实验显示在困难和模糊查询下模型准确率显著下降，且存在普遍幻觉现象；MM-RAGChecker能有效识别这些问题。 Conclusion: MRAG-Suite可作为诊断Visual RAG系统弱点的有效工具，推动未来更鲁棒的多模态问答系统发展。 Abstract: Multimodal Retrieval-Augmented Generation (Visual RAG) significantly advances question answering by integrating visual and textual evidence. Yet, current evaluations fail to systematically account for query difficulty and ambiguity. We propose MRAG-Suite, a diagnostic evaluation platform integrating diverse multimodal benchmarks (WebQA, Chart-RAG, Visual-RAG, MRAG-Bench). We introduce difficulty-based and ambiguity-aware filtering strategies, alongside MM-RAGChecker, a claim-level diagnostic tool. Our results demonstrate substantial accuracy reductions under difficult and ambiguous queries, highlighting prevalent hallucinations. MM-RAGChecker effectively diagnoses these issues, guiding future improvements in Visual RAG systems.

[127] SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

Gyuhyeon Seo,Jungwoo Yang,Junseong Pyo,Nalim Kim,Jonggeun Lee,Yohan Jo

Main category: cs.CL

TL;DR: 本文提出了SimuHome，一个基于Matter协议的时间加速智能家居模拟环境，用于训练和评估能够处理用户潜在意图、时序依赖和设备约束等复杂任务的大型语言模型代理，并提供了包含600个场景的基准测试，实验表明现有模型在状态验证和时间调度方面表现不佳，凸显了改进的必要性。

Details

Motivation: 智能家庭环境中，LLM代理面临用户潜在意图理解、时序依赖、设备限制等挑战，缺乏高保真模拟环境和有效评估基准成为研发瓶颈。 Method: 构建基于Matter协议的仿真环境SimuHome，支持API调用与环境变量变化，并设计包含12类用户查询、共600个任务的基准测试，在统一的ReAct框架下评估11种代理模型。 Result: 实验显示当前模型在简单任务上表现良好，但在隐含意图推断、状态验证和时间调度方面存在显著困难；即使表现最佳的GPT-4.1模型成功率也仅为54%。 Conclusion: 需要开发能通过工具可靠验证当前状态并协调时间相关操作的新方法，SimuHome为智能家庭代理的研发和部署提供了高保真、可迁移的平台。 Abstract: Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and more. The main bottlenecks for developing smart home agents with such capabilities include the lack of a realistic simulation environment where agents can interact with devices and observe the results, as well as a challenging benchmark to evaluate them. To address this, we introduce $\textbf{SimuHome}$, a time-accelerated home environment that simulates smart devices, supports API calls, and reflects changes in environmental variables. By building the simulator on the Matter protocol (the global industry standard for smart home communication), SimuHome provides a high-fidelity environment, and agents validated in SimuHome can be deployed on real Matter-compliant devices with minimal adaptation. We provide a challenging benchmark of 600 episodes across twelve user query types that require the aforementioned capabilities. Our evaluation of 11 agents under a unified ReAct framework reveals that while models perform well on simple tasks, they struggle with latent intent inference, state verification, and especially temporal scheduling. Even the top-performing model, GPT-4.1, reaches only 54% success rate. These findings highlight a critical need for methods that can reliably verify the current state via tools before acting and coordinate time-dependent actions.

[128] Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

Yu-Che Tsai,Kuan-Yu Chen,Yuan-Chi Li,Yuan-Hao Chen,Ching-Yu Tsai,Shou-De Lin

Main category: cs.CL

TL;DR: 提出GIRCSE框架，利用生成式迭代优化对比句子嵌入，超越现有LLM编码器方法。

Details

Motivation: 现有基于大语言模型的嵌入方法多采用仅编码器范式，忽视了LLM的生成能力，难以捕捉隐含语义。 Method: 提出GIRCSE框架，通过自回归生成软令牌序列，在迭代对比优化目标（ICR）指导下逐步 refine 语义表示。 Result: 在MTEB基准和指令跟随任务上优于强基线模型，并展现出推理时生成更多token可持续提升性能的测试时扩展特性。 Conclusion: 生成式迭代优化是一种新的表示学习范式，GIRCSE有效挖掘了LLM的生成潜力以提升嵌入质量。 Abstract: Existing large language model (LLM)-based embeddings typically adopt an encoder-only paradigm, treating LLMs as static feature extractors and overlooking their core generative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages autoregressive generation to iteratively refine semantic representations. By producing sequences of soft tokens optimized under contrastive objective, GIRCSE captures latent concepts and implicit semantics that encoder-only methods often miss. To guide this process, we propose an Iterative Contrastive Refinement (ICR) objective that encourages each refinement step to yield better representations. Extensive experiments show that GIRCSE outperforms strong LLM-based embedding baselines on the MTEB benchmark and instruction-following tasks. Moreover, GIRCSE exhibits an emergent test-time scaling property: generating more tokens at inference steadily improves embedding quality. Our results establish generative iterative refinement as a new paradigm for representation learning.

[129] LOGOS: LLM-driven End-to-End Grounded Theory Development and Schema Induction for Qualitative Research

Xinyu Pi,Qisen Yang,Chuong Nguyen

Main category: cs.CL

TL;DR: LOGOS 是一个端到端的自动化框架，利用大语言模型和语义聚类等技术，将定性研究中的原始文本自动转化为结构化理论，在多个数据集上表现优异，显著提升了扎根理论的可扩展性和可复用性。

Details

Motivation: 传统扎根理论依赖专家手动编码，耗时且难以扩展，现有计算工具无法实现完全自动化，限制了定性研究的效率和普及。 Method: 提出 LOGOS 框架，结合大语言模型驱动的编码、语义聚类、图推理和迭代优化机制，实现从文本到层次化理论的全自动构建；同时设计五维评估指标和训练-测试分割协议进行公平评测。 Result: 在五个不同语料库上，LOGOS 均优于强基线方法，在一个复杂数据集上与专家构建的编码体系达到 88.2% 的一致率。 Conclusion: LOGOS 实现了扎根理论流程的完全自动化，在保持理论深度的同时显著提升了可扩展性，为定性研究的民主化和规模化提供了新路径。 Abstract: Grounded theory offers deep insights from qualitative data, but its reliance on expert-intensive manual coding presents a major scalability bottleneck. Current computational tools stop short of true automation, keeping researchers firmly in the loop. We introduce LOGOS, a novel, end-to-end framework that fully automates the grounded theory workflow, transforming raw text into a structured, hierarchical theory. LOGOS integrates LLM-driven coding, semantic clustering, graph reasoning, and a novel iterative refinement process to build highly reusable codebooks. To ensure fair comparison, we also introduce a principled 5-dimensional metric and a train-test split protocol for standardized, unbiased evaluation. Across five diverse corpora, LOGOS consistently outperforms strong baselines and achieves a remarkable $88.2\%$ alignment with an expert-developed schema on a complex dataset. LOGOS demonstrates a powerful new path to democratize and scale qualitative research without sacrificing theoretical nuance.

[130] DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Zherui Li,Zheng Nie,Zhenhong Zhou,Yufei Guo,Yue Liu,Yitong Zhang,Yu Cheng,Qingsong Wen,Kun Wang,Jiaheng Zhang

Main category: cs.CL

TL;DR: 本文研究了扩散大语言模型（dLLMs）在迭代生成过程中面临的新安全漏洞，提出了一种无需训练的防御框架DiffuGuard，通过随机退火重遮蔽和块级审计修复机制有效降低 jailbreak 攻击成功率。

Details

Motivation: dLLMs的迭代与并行生成机制带来了与自回归模型不同的新型安全漏洞，现有防御方法难以应对，亟需针对其生成特性设计新的防护机制。 Method: 从单步内（intra-step）和跨步骤（inter-step）两个维度分析dLLM的脆弱性，提出Denoising-path Dependence现象，并设计DiffuGuard框架，包含Stochastic Annealing Remasking和Block-level Audit and Repair两个阶段进行防御。 Result: 在四个dLLM上实验表明，DiffuGuard将六种典型jailbreak攻击的平均攻击成功率从47.9%降至14.7%，同时保持了模型生成性能和效率。 Conclusion: dLLMs虽存在固有安全风险，但具备较强的内在安全性潜力，DiffuGuard通过解耦生成过程中的安全控制，为dLLMs提供了高效、通用且无需训练的防御方案。 Abstract: The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.

Junying Wang,Zicheng Zhang,Ye Shen,Yalun Wu,Yingji Liang,Yijin Guo,Farong Wen,Wenzhe Li,Xuezhi Zhao,Qi Jia,Guangtao Zhai

Main category: cs.CL

TL;DR: 提出TQA-to-MMQA框架，通过Q-Mirror代理系统实现文本到多模态问答的自动转换与迭代优化，显著提升生成质量。

Details

Motivation: 手动构建高质量多模态科学推理基准成本高、难以扩展，亟需自动化方法解决瓶颈问题。 Method: 设计TQA-to-MMQA转换框架和多维质量评估标准，构建两个大规模基准测试，并开发集生成与评估闭环的Q-Mirror代理系统进行迭代优化。 Result: 实验表明现有模型生成的MMQA仍有明显不足；顶级理解模型可有效评估MMQA质量；Q-Mirror将平均分从78.90提升至85.22，通过率从72%提升至95%。 Conclusion: Q-Mirror为自动生成高质量科学多模态基准提供了可行路径，结合生成与评估闭环可显著提升效率与质量。 Abstract: High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition \& Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation \& understanding models on the distinct tasks of MMQA generation \& MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72\% to 95\%, offering a practical path to large-scale scientific benchmarks.

[132] Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs

Jongwook Han,Jongwon Lim,Injin Kong,Yohan Jo

Main category: cs.CL

TL;DR: 该论文研究了大语言模型中内在表达和提示表达两种价值观机制的异同，发现二者既有共享成分也有独特成分，导致不同的可引导性和响应多样性。

Details

Motivation: 理解大语言模型在价值对齐和角色引导中的内在与提示价值观机制是否重叠或依赖不同机制，目前这方面研究尚不足。 Method: 通过值向量（残差流中提取的特征方向）和值神经元（MLP中贡献于价值表达的神经元）两种方法进行机制层面分析。 Result: 内在与提示机制部分共享关键成分，但各自也有独特元素；提示机制更易引导，而内在机制产生更多样化的词汇响应；提示机制的独特成分增强了指令遵循能力，甚至在越狱等远距离任务中起作用。 Conclusion: 内在和提示价值观机制既有关联又存在差异，其独特成分分别影响模型的可引导性、响应多样性和指令遵循能力。 Abstract: Large language models (LLMs) can express different values in two distinct ways: (1) intrinsic expression, reflecting the model's inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment and persona steering, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on substantially different mechanisms, but this remains largely understudied. We analyze this at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value expressions. We demonstrate that intrinsic and prompted value mechanisms partly share common components that are crucial for inducing value expression, but also possess unique elements that manifest in different ways. As a result, these mechanisms lead to different degrees of value steerability (prompted > intrinsic) and response diversity (intrinsic > prompted). In particular, components unique to the intrinsic mechanism seem to promote lexical diversity in responses, whereas those specific to the prompted mechanism primarily strengthen instruction following, taking effect even in distant tasks like jailbreaking.

[133] Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey

Yuntao Shou,Tao Meng,Wei Ai,Keqin Li

Main category: cs.CL

TL;DR: 本文综述了大语言模型（LLMs）和多模态大语言模型（MLLMs）在情感识别与推理中的应用，涵盖了模型架构、数据集和性能基准，并指出了关键挑战和未来研究方向。据作者所知，这是首次全面调研MLLMs与多模态情感识别与推理交叉领域的研究。

Details

Motivation: 尽管LLMs和MLLMs在情感识别与推理方面取得了进展，但该领域仍缺乏系统性综述，亟需整合最新发展以指导后续研究。 Method: 对现有LLMs和MLLMs在情感识别与推理中的模型架构、数据集及性能基准进行综合分析，并总结当前方法的优缺点。 Result: 提供了该领域全面的综述，总结了关键技术进展、面临的主要挑战以及未来的研究方向，并公开了相关方法汇总的GitHub资源库。 Conclusion: 该综述为研究人员提供了权威参考和实用见解，推动AI for Science中多模态情感识别与推理的发展。 Abstract: In recent years, large language models (LLMs) have driven major advances in language understanding, marking a significant step toward artificial general intelligence (AGI). With increasing demands for higher-level semantics and cross-modal fusion, multimodal large language models (MLLMs) have emerged, integrating diverse information sources (e.g., text, vision, and audio) to enhance modeling and reasoning in complex scenarios. In AI for Science, multimodal emotion recognition and reasoning has become a rapidly growing frontier. While LLMs and MLLMs have achieved notable progress in this area, the field still lacks a systematic review that consolidates recent developments. To address this gap, this paper provides a comprehensive survey of LLMs and MLLMs for emotion recognition and reasoning, covering model architectures, datasets, and performance benchmarks. We further highlight key challenges and outline future research directions, aiming to offer researchers both an authoritative reference and practical insights for advancing this domain. To the best of our knowledge, this paper is the first attempt to comprehensively survey the intersection of MLLMs with multimodal emotion recognition and reasoning. The summary of existing methods mentioned is in our Github: \href{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}.

[134] Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding

Sungkyun Kim,Jaemin Kim,Dogyung Yoon,Jiho Shin,Junyeol Lee,Jiwon Seo

Main category: cs.CL

TL;DR: 提出了一种名为Speculative Verification (SV)的新方法，通过动态预测推测准确率并调整验证长度，提升大语言模型解码效率，在大规模批次下平均加速1.4倍。

Details

Motivation: 现有推测性解码（SD）在推测准确率低时会产生大量无效计算，尤其在大批次情况下效果受限，需提高解码效率和吞吐量。 Method: 引入一个小型辅助的伴生模型来估计草稿模型与目标模型分布的一致性，并基于信息增益动态调整验证长度，从而优化推测决策，减少无效计算。 Result: 在多种LLM和任务上验证，SV在所有实验中均优于标准解码和SD，大批次下平均提速1.4倍，最高可达2倍。 Conclusion: SV有效提升了推测性解码的效率和鲁棒性，无需修改原有模型，兼容现有SD变体，具有良好的可扩展性和实用价值。 Abstract: LLMs have low GPU efficiency and high latency due to autoregressive decoding. Speculative decoding (SD) mitigates this using a small draft model to speculatively generate multiple tokens, which are then verified in parallel by a target model. However, when speculation accuracy is low, the overhead from rejected tokens can offset the benefits, limiting SD's effectiveness, especially at large batch sizes. To address this, we propose Speculative Verification (SV), an efficient augmentation to SD that dynamically predicts speculation accuracy and adapts the verification length to maximize throughput. SV introduces a companion model - a small auxiliary model similar in size to the draft model - to estimate the alignment between draft and target model distributions. By maximizing the information gain from quantifying this alignment, SV refines verification decisions, reducing wasted computation on rejected tokens and improving decoding efficiency. Moreover, SV requires no modifications to the draft or target models and is compatible with existing SD variants. We extensively evaluated SV on publicly available LLMs across three NLP tasks using nine combinations of draft, companion, and target models, including 13B-72B target models and three types of variations: base (no finetuning), instruction-tuned, and task fine-tuned. Across all experiments and batch sizes (4-80), SV consistently outperforms both SD and standard decoding with the target model. It improves SD performance by up to 2$\times$, with an average speedup of 1.4 $\times$ in large-batch settings (batch sizes 32-80). These results demonstrate SV's robustness, scalability, and practical utility for efficient LLM inference.

[135] AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment

Mengyu Bu,Shaolei Zhang,Zhongjun He,Hua Wu,Yang Feng

Main category: cs.CL

TL;DR: 本文提出了一种名为AlignX的两阶段表示级框架，用于提升预训练多语言大模型在非主导语言上的性能和跨语言对齐能力。

Details

Motivation: 现有方法在处理非主导语言时表现较差，且通过大规模均衡语料微调常导致对齐不精确和知识迁移效果不佳。 Method: 第一阶段通过多语言语义对齐和语言特征融合来对齐多语言表示；第二阶段通过多语言指令微调激发大模型的多语言能力。 Result: 在多个预训练大模型上的实验表明，该方法显著提升了多语言通用性和跨语言生成能力，使多语言表示更接近，改善了跨语言对齐。 Conclusion: AlignX有效缩小了多语言大模型中的性能差距，增强了跨语言表示对齐和整体多语言性能。 Abstract: Multilingual large language models (LLMs) possess impressive multilingual understanding and generation capabilities. However, their performance and cross-lingual alignment often lag for non-dominant languages. A common solution is to fine-tune LLMs on large-scale and more balanced multilingual corpus, but such approaches often lead to imprecise alignment and suboptimal knowledge transfer, struggling with limited improvements across languages. In this paper, we propose AlignX to bridge the multilingual performance gap, which is a two-stage representation-level framework for enhancing multilingual performance of pre-trained LLMs. In the first stage, we align multilingual representations with multilingual semantic alignment and language feature integration. In the second stage, we stimulate the multilingual capability of LLMs via multilingual instruction fine-tuning. Experimental results on several pre-trained LLMs demonstrate that our approach enhances LLMs' multilingual general and cross-lingual generation capability. Further analysis indicates that AlignX brings the multilingual representations closer and improves the cross-lingual alignment.

[136] Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining

Matthew Theodore Roque,Dan John Velasco

Main category: cs.CL

TL;DR: 研究了在数据受限情况下，通过文本简化和复杂度排序的课程学习对语言模型预训练的影响，发现加入简化文本能提升小模型和大模型的表示质量，且不同规模模型受益于不同的数据调度策略。

Details

Motivation: 在数据受限的预训练场景中，训练数据顺序和同一文本不同版本的使用影响尚不明确，现有研究多关注大规模数据集，忽视了优化策略的潜力。 Method: 基于平行语料库（原文与LLM简化的文本对齐），设计四种数据调度策略：重复暴露、低到高复杂度、高到低、交错顺序，并从微调效果和零样本性能评估模型表现。 Result: 添加简化文本相比重复使用原始数据能提升微调和零样本性能；小模型在低到高复杂度排序下表现更好，大模型则在交错排序下更优。 Conclusion: 数据调度策略在数据受限的预训练中显著影响模型性能，文本简化和复杂度排序是有效的优化手段，且应根据模型规模选择合适的数据呈现顺序。 Abstract: Most studies on language model pretraining focus on large datasets, leaving open questions about optimization in data-constrained settings. In such settings, the effects of training data order and of including alternative versions of the same text remain underexplored. We address this by studying curriculum learning in pretraining, focusing on text-complexity ordering and data augmentation via simplification. We ask: (1) Does simplifying texts enhance representation quality more than reusing the original data? and (2) Does ordering data by text complexity yield better representations? To answer, we build on a pair of parallel corpora where human-written paragraphs are aligned with LLM-simplified variants, and test four data schedules: repeated exposure, low-to-high complexity, high-to-low, and interleaved. We analyze models' representation quality from a sample efficiency perspective via fine-tuning, as well as its zero-shot performance on linguistic knowledge, entity tracking, world knowledge, and commonsense reasoning. Our findings show that adding simplified data improves fine-tuning and zero-shot performance over a repeated-exposure baseline: smaller models benefit from low-to-high complexity, while larger models perform better with interleaved ordering.

[137] Reinforcement Mid-Training

Yijun Tian,Shaoyu Chen,Zhichao Xu,Yawei Wang,Jinhe Bi,Peng Han,Wei Wang

Main category: cs.CL

TL;DR: 提出一种名为RMT的强化中间训练框架，通过动态token预算、自适应采样和双训练策略，显著提升大模型性能，减少推理步长并增强后续训练效果。

Details

Motivation: 现有的大语言模型训练流程缺乏一个有效的中间阶段来优化推理效率和token利用，导致过思考、训练不平衡和信息利用不足。 Method: 提出RMT框架，包含动态token预算机制、基于课程的自适应采样方法和结合强化学习与下一token预测的双训练策略。 Result: 实验显示RMT在语言建模中实现最高+64.91%的性能提升，仅使用21%的推理长度，并使后续后训练在数学领域提升最高+18.76%。 Conclusion: 强化中间训练是有效且必要的，RMT框架能显著提高训练效率和模型性能，为大模型训练提供新的范式。 Abstract: The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training with potential for strong performance gains. In this paper, we formally define the problem and identify three key challenges: (1) inefficient training due to excessive reasoning steps, (2) disregard of the imbalanced token entropy distribution, and (3) underutilization of token information. To address these challenges, we propose RMT, a framework for efficient, adaptive, and unified reinforcement mid-training with various innovative components. In particular, we first introduce a dynamic token budget mechanism that constrains unnecessary reasoning steps and mitigates model overthinking. Next, we design a curriculum-based adaptive sampling method that fosters a progressive learning trajectory from easy to hard tokens. Finally, we present a dual training strategy that combines reinforcement learning with next-token prediction, ensuring targeted learning on key tokens and full exploitation of all token information. Extensive experiments demonstrate the superiority of RMT over state-of-the-art methods, achieving up to +64.91% performance improvement with only 21% of the reasoning length in language modeling. We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.

[138] HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

Langqi Yang,Tianhang Zheng,Kedong Xiu,Yixuan Chen,Di Wang,Puning Zhao,Zhan Qin,Kui Ren

Main category: cs.CL

TL;DR: 本文提出了一个名为HarmMetric Eval的综合基准，用于评估大语言模型（LLM）有害性度量和判断器的有效性。结果显示，传统指标METEOR和ROUGE-1在评估模型输出有害性方面优于基于LLM的判断器。

Details

Motivation: 现有的越狱攻击评估缺乏系统性的基准来衡量不同有害性度量和判断方法的质量与有效性，导致报告结果可信度不足。因此需要一个标准化的评估框架。 Method: 构建了一个高质量的数据集，包含代表性有害提示及多样化的有害与非有害模型响应，并设计了灵活的评分机制以兼容多种度量和判断方法，从而实现对有害性评估工具的全面评测。 Result: 实验发现传统指标METEOR和ROUGE-1在评估有害性方面表现优于基于LLM的判断器，挑战了当前认为LLM判断更优的普遍认知。 Conclusion: HarmMetric Eval为评估有害性提供了可靠基准，揭示了传统指标在特定任务中的优势，强调了对评估方法选择应基于实证而非假设。 Abstract: The alignment of large language models (LLMs) with human values is critical for their safe deployment, yet jailbreak attacks can subvert this alignment to elicit harmful outputs from LLMs. In recent years, a proliferation of jailbreak attacks has emerged, accompanied by diverse metrics and judges to assess the harmfulness of the LLM outputs. However, the absence of a systematic benchmark to assess the quality and effectiveness of these metrics and judges undermines the credibility of the reported jailbreak effectiveness and other risks. To address this gap, we introduce HarmMetric Eval, a comprehensive benchmark designed to support both overall and fine-grained evaluation of harmfulness metrics and judges. Our benchmark includes a high-quality dataset of representative harmful prompts paired with diverse harmful and non-harmful model responses, alongside a flexible scoring mechanism compatible with various metrics and judges. With HarmMetric Eval, our extensive experiments uncover a surprising result: two conventional metrics--METEOR and ROUGE-1--outperform LLM-based judges in evaluating the harmfulness of model responses, challenging prevailing beliefs about LLMs' superiority in this domain. Our dataset is publicly available at https://huggingface.co/datasets/qusgo/HarmMetric_Eval, and the code is available at https://anonymous.4open.science/r/HarmMetric-Eval-4CBE.

[139] LLaDA-MoE: A Sparse MoE Diffusion Language Model

Fengqi Zhu,Zebin You,Yipeng Xing,Zenan Huang,Lin Liu,Yihong Zhuang,Guoshan Lu,Kangyu Wang,Xudong Wang,Lanning Wei,Hongrui Guo,Jiaqi Hu,Wentao Ye,Tieyuan Chen,Chenchen Li,Chengfu Tang,Haibo Feng,Jun Hu,Jun Zhou,Xiaolu Zhang,Zhenzhong Lan,Junbo Zhao,Da Zheng,Chongxuan Li,Jianguo Li,Ji-Rong Wen

Main category: cs.CL

TL;DR: LLaDA-MoE是一种基于MoE架构的大型语言扩散模型，使用约20T token从头训练，在仅激活1.4B参数的情况下实现竞争性性能，显著降低计算开销。

Details

Motivation: 旨在通过稀疏化MoE架构提升扩散语言模型的训练和推理效率，同时保持高性能，探索更高效的模型设计路径。 Method: 提出LLaDA-MoE，采用Mixture-of-Experts架构，基于掩码扩散语言建模目标从头训练，模型总容量为7B参数，推理时仅激活1.4B参数。 Result: LLaDA-MoE在多个基准上超越了LLaDA、LLaDA 1.5和Dream等现有扩散语言模型，达到SOTA水平；其指令调优版本LLaDA-MoE-7B-A1B-Instruct在知识理解、代码生成、数学推理、代理和对齐任务上表现接近Qwen2.5-3B-Instruct，但激活参数更少。 Conclusion: 将稀疏MoE架构引入扩散语言模型训练可有效提升推理效率并保持强大性能，为未来扩散语言模型的发展提供了新方向。 Abstract: We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.

[140] Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

Pengfei Wang,Baolin Sun,Xuemei Dong,Yaxun Dai,Hongwei Yuan,Mengdie Chu,Yingqi Gao,Xiang Qi,Peng Zhang,Ying Yan

Main category: cs.CL

TL;DR: 本文提出了Agentar-Scale-SQL，一种通过协调测试时扩展策略提升Text-to-SQL性能的新框架，在BIRD基准上达到81.67%执行准确率，位居榜首。

Details

Motivation: 现有Text-to-SQL方法在复杂基准上表现仍远落后于人类专家，且缺乏对模型内部推理过程的有效利用和系统化的测试时扩展策略。 Method: 提出Agentar-Scale-SQL框架，结合三种扩展视角：基于强化学习增强的内在推理实现内部扩展，通过迭代优化实现顺序扩展，以及利用多样化合成与锦标赛选择实现并行扩展。 Result: 在BIRD基准测试中取得81.67%的执行准确率，刷新SOTA结果，并在官方排行榜排名第一。 Conclusion: Agentar-Scale-SQL通过协同的测试时扩展策略显著提升了Text-to-SQL性能，为实现类人水平的表现提供了有效路径。 Abstract: State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model's internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67\% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.

[141] Multilingual Text-to-SQL: Benchmarking the Limits of Language Models with Collaborative Language Agents

Khanh Trinh Pham,Thu Huong Nguyen,Jun Jo,Quoc Viet Hung Nguyen,Thanh Tam Nguyen

Main category: cs.CL

TL;DR: MultiSpider 2.0 扩展了 Spider 2.0 至八种语言，揭示了当前大模型在多语言 Text-to-SQL 任务上的严重性能下降，并提出基于协作式语言代理的基线方法以提升准确性。

Details

Motivation: 现有 Text-to-SQL 基准多为英文，缺乏多语言支持，限制了跨语言场景下的研究与应用进展。 Method: 构建 MultiSpider 2.0 数据集，覆盖八种语言并保持原有的结构复杂性；设计协作式语言代理框架，通过迭代优化生成 SQL 查询。 Result: 当前最先进的大模型在 MultiSpider 2.0 上仅达到 4% 的执行准确率（相比在 MultiSpider 1.0 上的 60% 大幅下降）；协作式代理基线将准确率提升至 15%。 Conclusion: 存在显著的多语言性能差距，需发展更鲁棒的跨语言方法以支持实际企业应用。 Abstract: Text-to-SQL enables natural access to databases, yet most benchmarks are English-only, limiting multilingual progress. We introduce MultiSpider 2.0, extending Spider 2.0 to eight languages (English, German, French, Spanish, Portuguese, Japanese, Chinese, Vietnamese). It preserves Spider 2.0's structural difficulty while adding linguistic and dialectal variability, demanding deeper reasoning for complex SQL. On this benchmark, state-of-the-art LLMs (such as DeepSeek-R1 and OpenAI o1) reach only 4\% execution accuracy when relying on intrinsic reasoning, versus 60\% on MultiSpider 1.0. Therefore, we provide a collaboration-driven language agents baseline that iteratively refines queries, improving accuracy to 15\%. These results reveal a substantial multilingual gap and motivate methods that are robust across languages and ready for real-world enterprise deployment. Our benchmark is available at https://github.com/phkhanhtrinh23/Multilingual_Text_to_SQL.

[142] CDT: A Comprehensive Capability Framework for Large Language Models Across Cognition, Domain, and Task

Haosi Mo,Xinyu Ma,Xuebo Liu,Derek F. Wong,Yu Li,Jie Liu,Min Zhang

Main category: cs.CL

TL;DR: 本文提出了Cognition-Domain-Task (CDT) 框架，用于全面评估大语言模型的能力，结合认知理论进行多维度分析，并在数据集评估和数据选择中验证了其有效性。

Details

Motivation: 现有基准测试通常关注孤立能力，缺乏对大语言模型综合能力的全面评估框架。 Method: 基于Cattell-Horn-Carroll认知理论构建CDT框架，从认知、领域和任务三个维度评估模型能力，并应用于数据集评估与数据选择。 Result: CDT的能力指标与下游性能高度相关，在通用和特定基准上分别达到44.3和45.4分，比基线提高1.6和2.2点。 Conclusion: CDT框架能有效支持模型能力的全面评估、数据集分析与构建，具有良好的实用性和有效性。 Abstract: Recent advances in Large Language Models (LLMs) have significantly enhanced their capabilities, highlighting the need for comprehensive evaluation frameworks that extend beyond task-specific benchmarks. However, existing benchmarks often focus on isolated abilities, lacking a holistic framework for assessing LLM capabilities. To address this gap, we propose the Cognition-Domain-Task (CDT) framework, which comprehensively measures a model's capabilities across three dimensions. We expand the scope of model capability definitions at the cognitive level by incorporating the Cattell-Horn-Carroll cognitive theory, refining the categorization of model capabilities. We apply CDT in two directions: dataset capability evaluation and data selection. Experiments show that our capability metrics correlate well with downstream performance and can support effective dataset analysis and construction. The experiments on data selection also show significant improvements in both general and specific benchmarks, achieving scores of 44.3 and 45.4, with an increase of 1.6 and 2.2 points over the baselines, respectively. These results validate the effectiveness and practicality of CDT. Source code and models are available at https://github.com/Alessa-mo/CDT.

[143] Alternatives To Next Token Prediction In Text Generation -- A Survey

Charlie Wyatt,Aditya Joshi,Flora Salim

Main category: cs.CL

TL;DR: 本文综述了下一代语言模型中替代“下一个词预测”（NTP）范式的新兴方法，提出了五类替代方案的分类体系，旨在克服大模型在长期规划、错误累积和计算效率方面的局限。

Details

Motivation: 尽管NTP推动了大语言模型的成功，但其固有缺陷如缺乏长程规划能力、错误累积和计算低效亟需解决，因此需要系统性探索替代范式。 Method: 通过文献综述与归类分析，将NTP的替代方法分为五类：多词预测、先计划后生成、潜在空间推理、连续生成方法以及非Transformer架构，并构建统一分类体系。 Result: 提出一个涵盖五类NTP替代方法的系统分类法，揭示各方法如何从不同角度突破NTP的限制，为后续研究提供结构化视角。 Conclusion: NTP并非唯一可行路径，多样化的生成范式正在兴起，该分类框架有助于指导未来语言模型向更高效、可控和智能的方向发展。 Abstract: The paradigm of Next Token Prediction (NTP) has driven the unprecedented success of Large Language Models (LLMs), but is also the source of their most persistent weaknesses such as poor long-term planning, error accumulation, and computational inefficiency. Acknowledging the growing interest in exploring alternatives to NTP, the survey describes the emerging ecosystem of alternatives to NTP. We categorise these approaches into five main families: (1) Multi-Token Prediction, which targets a block of future tokens instead of a single one; (2) Plan-then-Generate, where a global, high-level plan is created upfront to guide token-level decoding; (3) Latent Reasoning, which shifts the autoregressive process itself into a continuous latent space; (4) Continuous Generation Approaches, which replace sequential generation with iterative, parallel refinement through diffusion, flow matching, or energy-based methods; and (5) Non-Transformer Architectures, which sidestep NTP through their inherent model structure. By synthesizing insights across these methods, this survey offers a taxonomy to guide research into models that address the known limitations of token-level generation to develop new transformative models for natural language processing.

[144] Bias Mitigation or Cultural Commonsense? Evaluating LLMs with a Japanese Dataset

Taisei Yamamoto,Ryoma Kumon,Danushka Bollegala,Hitomi Yanaka

Main category: cs.CL

TL;DR: 本文提出了SOBACO，一个用于评估大语言模型中社会偏见和文化常识的日本基准，并发现去偏方法会显著降低模型在文化常识任务上的表现，提示需权衡公平性与实用性。

Details

Motivation: 现有去偏方法可能损害大语言模型的能力，而其对与社会偏见密切相关的文化常识的影响尚未充分研究。 Method: 构建了名为SOBACO的统一评估基准，用于衡量社会偏见和文化常识，并在多个大语言模型上测试不同去偏方法的影响。 Result: 实验结果显示，去偏方法导致模型在文化常识任务上的准确率最高下降75%。 Conclusion: 去偏方法可能严重损害文化常识能力，未来应开发兼顾公平性与文化常识保持的去偏策略。 Abstract: Large language models (LLMs) exhibit social biases, prompting the development of various debiasing methods. However, debiasing methods may degrade the capabilities of LLMs. Previous research has evaluated the impact of bias mitigation primarily through tasks measuring general language understanding, which are often unrelated to social biases. In contrast, cultural commonsense is closely related to social biases, as both are rooted in social norms and values. The impact of bias mitigation on cultural commonsense in LLMs has not been well investigated. Considering this gap, we propose SOBACO (SOcial BiAs and Cultural cOmmonsense benchmark), a Japanese benchmark designed to evaluate social biases and cultural commonsense in LLMs in a unified format. We evaluate several LLMs on SOBACO to examine how debiasing methods affect cultural commonsense in LLMs. Our results reveal that the debiasing methods degrade the performance of the LLMs on the cultural commonsense task (up to 75% accuracy deterioration). These results highlight the importance of developing debiasing methods that consider the trade-off with cultural commonsense to improve fairness and utility of LLMs.

[145] A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

Lasse Borgholt,Jakob Havtorn,Christian Igel,Lars Maaløe,Zheng-Hua Tan

Main category: cs.CL

TL;DR: 提出了一种结合动态规划与束搜索评分的新对齐算法，以实现更精确的语音识别错误分析。

Details

Motivation: 现有的语音识别评估指标（如词错误率）容易掩盖对罕见词、命名实体和领域特定词汇的错误，需要更细粒度的错误分析方法。 Method: 提出一种结合动态规划与束搜索评分的新型对齐算法，改进传统文本对齐方法，提升参考文本与模型输出之间的对齐精度。 Result: 相比传统方法，新算法能更准确地对齐单个错误，支持可靠的细粒度错误分析，并已通过PyPI发布。 Conclusion: 该算法提高了语音识别系统中错误分析的精度，有助于揭示聚合指标下隐藏的关键错误，推动模型在语义重要词汇上的改进。 Abstract: Modern neural networks have greatly improved performance across speech recognition benchmarks. However, gains are often driven by frequent words with limited semantic weight, which can obscure meaningful differences in word error rate, the primary evaluation metric. Errors in rare terms, named entities, and domain-specific vocabulary are more consequential, but remain hidden by aggregate metrics. This highlights the need for finer-grained error analysis, which depends on accurate alignment between reference and model transcripts. However, conventional alignment methods are not designed for such precision. We propose a novel alignment algorithm that couples dynamic programming with beam search scoring. Compared to traditional text alignment methods, our approach provides more accurate alignment of individual errors, enabling reliable error analysis. The algorithm is made available via PyPI.

[146] Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

Wenjie Fu,Huandong Wang,Junyao Gao,Guoan Wan,Tao Jiang

Main category: cs.CL

TL;DR: 本文提出了一种名为Self-Sanitize的新型LLM安全缓解框架，受认知心理学启发，通过自我监控和自我修复模块在生成过程中实时检测并修正有害内容，尤其针对隐私泄露问题，在低延迟下实现了高效防护。

Details

Motivation: 随着大语言模型在各类应用中的普及，其可能生成有害或侵犯隐私的内容，现有基于事后过滤的方法存在高延迟且不支持流式生成，因此需要一种更高效、实时的安全机制。 Method: Self-Sanitize框架包含两个模块：一是轻量级的Self-Monitor模块，通过表示工程在token级别持续监控LLM的高层意图；二是Self-Repair模块，能够在原地直接修正有害内容，无需额外对话或推理过程。 Result: 在四个大语言模型和三种隐私泄露场景下的实验表明，Self-Sanitize在几乎不增加延迟和资源消耗的情况下，显著优于现有方法，有效防止了隐私信息泄露等有害内容生成。 Conclusion: Self-Sanitize提供了一种实用、高效且兼容流式生成的大模型安全解决方案，特别适用于对实时性和隐私保护要求高的应用场景。 Abstract: As Large Language Models (LLMs) achieve remarkable success across a wide range of applications, such as chatbots and code copilots, concerns surrounding the generation of harmful content have come increasingly into focus. Despite significant advances in aligning LLMs with safety and ethical standards, adversarial prompts can still be crafted to elicit undesirable responses. Existing mitigation strategies are predominantly based on post-hoc filtering, which introduces substantial latency or computational overhead, and is incompatible with token-level streaming generation. In this work, we introduce Self-Sanitize, a novel LLM-driven mitigation framework inspired by cognitive psychology, which emulates human self-monitor and self-repair behaviors during conversations. Self-Sanitize comprises a lightweight Self-Monitor module that continuously inspects high-level intentions within the LLM at the token level via representation engineering, and a Self-Repair module that performs in-place correction of harmful content without initiating separate review dialogues. This design allows for real-time streaming monitoring and seamless repair, with negligible impact on latency and resource utilization. Given that privacy-invasive content has often been insufficiently focused in previous studies, we perform extensive experiments on four LLMs across three privacy leakage scenarios. The results demonstrate that Self-Sanitize achieves superior mitigation performance with minimal overhead and without degrading the utility of LLMs, offering a practical and robust solution for safer LLM deployments. Our code is available at the following link: https://github.com/wjfu99/LLM_Self_Sanitize

[147] GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training

Hongcheng Wang,Yinuo Huang,Sukai Wang,Guanghui Ren,Hao Dong

Main category: cs.CL

TL;DR: 本文提出GRPO-MA方法，通过每个思维过程生成多个答案来解决GRPO在思维链训练中的梯度耦合、稀疏奖励和优势估计不稳定等问题，理论与实验均表明该方法能有效降低梯度方差、提升训练效率与模型性能。

Details

Motivation: GRPO算法在训练大模型思维链推理时面临梯度耦合、稀疏奖励信号和优势估计不稳定三个关键挑战，需改进优化稳定性与效率。 Method: 提出GRPO-MA方法，通过从每个思维过程中生成多个答案，解耦思维与答案的梯度，提升优势估计的准确性，并理论证明随着每思维答案数增加，思维优势的方差降低。 Result: 理论分析显示GRPO-MA可降低梯度方差，实验证明其在数学、代码和多模态任务上显著优于GRPO，减少梯度尖峰，提升训练效率与模型性能，消融研究验证了多答案策略的有效性。 Conclusion: GRPO-MA通过多答案生成机制有效缓解了GRPO的关键缺陷，在多种任务中实现了更稳定高效的强化学习训练，为思维链推理提供了更优的优化方案。 Abstract: Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.

[148] Knowledge Editing with Subspace-Aware Key-Value Mappings

Haewon Park,Sangwoo Kim,Yohan Jo

Main category: cs.CL

TL;DR: 本文提出了一种名为Subspace Knowledge Edit (SUIT) 的新方法，用于在语言模型中进行知识编辑，通过仅修改与编辑相关的关键特征子空间来减少模型扰动，同时保持高编辑有效性。

Details

Motivation: 现有的知识编辑方法在没有对键和值向量施加约束的情况下会导致模型显著扰动，因此需要一种更精确且影响更小的编辑方法。 Method: SUIT采用locate-then-edit框架，识别MLP层中与编辑相关的关键特征子空间，并仅在该子空间内进行修改，从而限制编辑的影响范围。 Result: 在LLaMA-3-8B、GPT-J-6B和Qwen2.5-7B模型上的实验表明，SUIT在知识保持方面显著优于强基线方法，同时维持了高编辑成功率。 Conclusion: SUIT能有效识别并修改与知识编辑相关的关键子空间，在减少不必要扰动的同时实现高效的知识更新，验证了子空间修改策略的可行性与优势。 Abstract: Knowledge editing aims to efficiently correct factual errors in Language Models (LMs). The popular locate-then-edit approach modifies an MLP layer by finding an optimal mapping between its input vector (key) and output vector (value) that leads to the expression of the edited knowledge. However, existing methods without any constraints on the key and value vectors cause significant perturbations to the edited model. To address this, we propose Subspace Knowledge Edit (SUIT), a method that identifies and modifies only the subspace of critical features relevant to the edit. Our empirical results on LLaMA-3-8B, GPT-J-6B, and Qwen2.5-7B models show that SUIT dramatically improves knowledge preservation over strong baselines while maintaining high edit efficacy. This effectiveness confirms that SUIT successfully identifies the critical subspace for the edit. Further analyses provide additional validation for our approach. The source code and data will be released to the public upon publication of the paper.

[149] Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

Hamna,Gayatri Bhat,Sourabrata Mukherjee,Faisal Lalani,Evan Hadfield,Divya Siddarth,Kalika Bali,Sunayana Sitaram

Main category: cs.CL

TL;DR: 提出了一种社区驱动的评估方法Samiksha，通过与民间社会组织和社区成员合作，实现对多语言大模型在印度健康领域的文化敏感性和包容性评估。

Details

Motivation: 现有的大语言模型评估缺乏对终端用户真实生活情境的考量，特别是在医疗等关键领域，需要更贴近社区实际需求、文化实践和复杂背景的评估方式。 Method: 与民间社会组织和社区成员共同创建名为Samiksha的社区驱动评估流程，利用社区反馈指导评估内容、基准构建方式和输出评分，实现可扩展的自动化基准测试。 Result: 在印度健康领域的应用表明，该方法能有效评估当前多语言大模型对复杂社区健康问题的回答能力，并提供一种可扩展、情境化且具包容性的LLM评估路径。 Conclusion: Samiksha为大语言模型提供了一种更具文化敏感性和社会相关性的评估范式，推动LLM评估从人工模拟任务转向基于真实社区需求的动态、协作式评估。 Abstract: Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.

[150] AdaThink-Med: Medical Adaptive Thinking with Uncertainty-Guided Length Calibration

Shaohao Rui,Kaitao Chen,Weijie Ma,Xiaosong Wang

Main category: cs.CL

TL;DR: AdaThink-Med是首个端到端增强医学大模型自适应推理能力的框架，通过不确定性引导的长度校准，在保持性能的同时显著减少简单问题的推理链长度，实现效率与效果的平衡。

Details

Motivation: 现有医学大语言模型在推理时对所有问题都进行长链思维，导致计算成本高，缺乏根据问题难度自适应调整推理深度的端到端方法。 Method: 提出AdaThink-Med框架：生成多个候选输出，评估其正确性和不确定性，通过不确定性引导的长度校准模块估计问题难度；对简单正确的问题惩罚长推理路径，对困难错误的问题鼓励扩展思维链。 Result: 在六个公开医学问答基准上，AdaThink-Med平均减少6.4倍推理长度，性能仅有轻微下降；模型自发形成“不思考”和“思考”两种模式，动态抑制冗余推理。 Conclusion: AdaThink-Med有效实现了医学LLMs的自适应推理，在显著降低计算成本的同时保持了高性能，为实际应用中的效率优化提供了可行方案。 Abstract: Recent advances in inference time scaling with extended long chain-of thought have significantly improved the reasoning capabilities of both general and medical large language models (LLMs). However, these models tend to engage in lengthy reasoning processes regardless of the difficulty of the input question, leading to increased inference costs in real-world applications. Therefore, enabling adaptive thinking where models think less for simpler questions and think more for complex ones is critical for the effective use of medical LLMs in practice. Despite its importance, there is a lack of end-to-end approaches designed to enhance the adaptive thinking capabilities of medical LLMs while providing a comprehensive examination of the trade-off between performance and computational cost. To bridge this gap, we propose AdaThink-Med, the first end-to-end framework designed to enhance adaptive thinking ability in medical reasoning models with uncertainty-guided length calibration. AdaThink-Med first generates multiple candidate outputs for each question, evaluates the correctness and uncertainty of each candidate, and then estimates problem difficulty via an uncertainty-guided length calibration module. For outputs with low difficulty and correct answers, the framework penalizes longer reasoning paths; whereas for those with high difficulty and incorrect answers, it encourages extending the chain of thought to explore alternative solutions. On six public medical QA benchmarks, AdaThink-Med achieves up to 6.4x length reduction on average while retaining performance with only minimal degradation. Intriguingly, we observe that AdaThink-Med spontaneously develops two distinct reasoning modes, which we characterize as "non-thinking" and "thinking", demonstrating the model's ability to suppress redundant reasoning processes dynamically.

[151] Inducing Dyslexia in Vision Language Models

Melika Honarmand,Ayati Sharma,Badr AlKhamissi,Johannes Mehrer,Martin Schrimpf

Main category: cs.CL

TL;DR: 本研究利用大规模视觉-语言模型（VLMs）模拟阅读障碍，通过识别并扰动模型中类似视觉词形区的单元，成功复现了阅读障碍的关键特征，特别是语音缺陷，为研究阅读障碍提供了新的计算框架。

Details

Motivation: 传统行为和神经影像方法难以验证阅读障碍机制的因果假设，因此需要新的建模手段来探究其根本原因。 Method: 使用认知神经科学中的刺激材料，在视觉-语言模型中识别对视觉词形敏感的单元，并进行针对性消融实验，对比随机单元消融的影响。 Result: 针对视觉词形选择性单元的消融导致模型在阅读任务上出现特异性损伤，而一般视觉和语言理解能力保持完整，且表现出与人类阅读障碍相似的语音处理缺陷，但正字法处理未显著受损。 Conclusion: 该模型成功复制了阅读障碍的核心特征，证明了VLMs作为研究阅读障碍等神经发育障碍的有力计算工具的潜力。 Abstract: Dyslexia, a neurodevelopmental disorder characterized by persistent reading difficulties, is often linked to reduced activity of the visual word form area in the ventral occipito-temporal cortex. Traditional approaches to studying dyslexia, such as behavioral and neuroimaging methods, have provided valuable insights but remain limited in their ability to test causal hypotheses about the underlying mechanisms of reading impairments. In this study, we use large-scale vision-language models (VLMs) to simulate dyslexia by functionally identifying and perturbing artificial analogues of word processing. Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that targeted ablation of these units, unlike ablation of random units, leads to selective impairments in reading tasks while general visual and language comprehension abilities remain intact. In particular, the resulting model matches dyslexic humans' phonological deficits without a significant change in orthographic processing. Taken together, our modeling results replicate key characteristics of dyslexia and establish a computational framework for investigating reading disorders.

[152] HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition

Gio Paik,Yongbeom Kim,Soungmin Lee,Sangmin Ahn,Chanwoo Kim

Main category: cs.CL

TL;DR: 本文提出了HiKE，首个全球可访问的韩英混合语音识别基准，旨在系统评估多语言ASR模型在不同层级代码转换上的表现，并推动该领域的研究。

Details

Motivation: 尽管多语言自动语音识别取得了进展，但日常口语中常见的语言混用（代码转换）仍是一个严重未被充分探索的挑战。 Method: 构建高质量、自然的韩英代码转换数据集，提供细致的外来词标签和分层的代码转换标注体系（词、短语、句子级别），并通过对多种多语言ASR模型的评估与微调实验验证其有效性。 Result: 实验表明，大多数多语言ASR模型在未经微调时难以处理代码转换，但通过使用代码转换数据进行微调可以显著提升性能。 Conclusion: HiKE为韩英代码转换提供了可靠的评估框架，有助于推动多语言ASR在真实场景中的发展，且数据集将公开可用。 Abstract: Despite advances in multilingual automatic speech recognition (ASR), code-switching (CS), the mixing of languages within an utterance common in daily speech, remains a severely underexplored challenge. In this paper, we introduce HiKE: the Hierarchical Korean-English code-switching benchmark, the first globally accessible evaluation framework for Korean-English CS, aiming to provide a means for the precise evaluation of multilingual ASR models and to foster research in the field. The proposed framework not only consists of high-quality, natural CS data across various topics, but also provides meticulous loanword labels and a hierarchical CS-level labeling scheme (word, phrase, and sentence) that together enable a systematic evaluation of a model's ability to handle each distinct level of code-switching. Through evaluations of diverse multilingual ASR models and fine-tuning experiments, this paper demonstrates that while most multilingual ASR models initially struggle with CS-ASR, this capability can be enabled through fine-tuning with CS data. HiKE will be available at https://github.com/ThetaOne-AI/HiKE.

[153] Hype or not? Formalizing Automatic Promotional Language Detection in Biomedical Research

Bojan Batalo,Erica K. Shimomoto,Neil Millar

Main category: cs.CL

TL;DR: 本文首次将“夸大宣传语言”（hype）的自动检测作为自然语言处理任务提出，定义了hype为作者用于美化或夸大研究成果的夸张或主观性语言，并制定了标注规范，应用于NIH基金申请文本的标注。实验表明，该标注规范有助于人类可靠地识别hype形容词，且基于标注数据训练的机器学习模型表现良好，但任务仍具有语言复杂性，依赖领域知识和时间敏感性。

Details

Motivation: 科学文献中夸大宣传语言（hype）日益增多，可能影响证据的客观评估、阻碍科研发展并削弱公众对科学的信任，因此需要自动检测此类语言。 Method: 提出hype的定义与形式化标注指南，对NIH基金申请语料库进行人工标注，并使用传统文本分类器和语言模型进行自动检测实验，同时与人类标注基线进行比较。 Result: 形式化的标注指南能提升人类标注的一致性；基于标注数据训练的机器学习模型在识别hype形容词上取得初步成效，但任务存在语言复杂性，且依赖领域知识和事实的时间性。 Conclusion: 本文是首个将hype检测作为NLP任务的研究，展示了构建标注数据和应用模型的可行性，强调未来研究需结合领域知识与时间感知能力以提升检测效果。 Abstract: In science, promotional language ('hype') is increasing and can undermine objective evaluation of evidence, impede research development, and erode trust in science. In this paper, we introduce the task of automatic detection of hype, which we define as hyperbolic or subjective language that authors use to glamorize, promote, embellish, or exaggerate aspects of their research. We propose formalized guidelines for identifying hype language and apply them to annotate a portion of the National Institutes of Health (NIH) grant application corpus. We then evaluate traditional text classifiers and language models on this task, comparing their performance with a human baseline. Our experiments show that formalizing annotation guidelines can help humans reliably annotate candidate hype adjectives and that using our annotated dataset to train machine learning models yields promising results. Our findings highlight the linguistic complexity of the task, and the potential need for domain knowledge and temporal awareness of the facts. While some linguistic works address hype detection, to the best of our knowledge, we are the first to approach it as a natural language processing task.

[154] InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation

Weilin Zhao,Zihan Zhou,Zhou Su,Chaojun Xiao,Yuxuan Li,Yanghao Li,Yudi Zhang,Weilun Zhao,Zhen Li,Yuxiang Huang,Ao Sun,Xu Han,Zhiyuan Liu

Main category: cs.CL

TL;DR: 本文提出了InfLLM-V2，一种可训练的稠密-稀疏切换注意力框架，用于解决长序列处理中Transformer模型的计算与内存瓶颈。

Details

Motivation: 标准Transformer中的自注意力机制在处理长序列时存在计算和内存瓶颈，现有稀疏注意力方法引入过多参数并破坏预训练-微调流程，导致收敛慢且难以加速。 Method: 提出InfLLM-V2，通过无参数架构修改复用稠密注意力参数，实现短序列用稠密注意力、长序列平滑切换至稀疏注意力，并设计高效实现以降低计算开销。 Result: 实验显示InfLLM-V2比稠密注意力快4倍，分别保持98.1%和99.7%的性能；基于该框架训练并开源了MiniCPM4.1模型。 Conclusion: InfLLM-V2实现了从短到长序列的无缝适应，在保证性能的同时显著提升效率，支持可复现研究。 Abstract: Long-sequence processing is a critical capability for modern large language models. However, the self-attention mechanism in the standard Transformer architecture faces severe computational and memory bottlenecks when processing long sequences. While trainable sparse attention methods offer a promising solution, existing approaches such as NSA introduce excessive extra parameters and disrupt the conventional \textit{pretrain-on-short, finetune-on-long} workflow, resulting in slow convergence and difficulty in acceleration. To overcome these limitations, we introduce dense-sparse switchable attention framework, termed as InfLLM-V2. InfLLM-V2 is a trainable sparse attention that seamlessly adapts models from short to long sequences. Specifically, InfLLM-V2 reuses dense attention parameters through parameter-free architecture modification, maintaining consistency between short and long sequence processing. Additionally, InfLLM-V2 ensures computational efficiency across all sequence lengths, by using dense attention for short inputs and smoothly transitioning to sparse attention for long sequences. To achieve practical acceleration, we further introduce an efficient implementation of InfLLM-V2 that significantly reduces the computational overhead. Our experiments on long-context understanding and chain-of-thought reasoning demonstrate that InfLLM-V2 is 4$\times$ faster than dense attention while retaining 98.1% and 99.7% of the performance, respectively. Based on the InfLLM-V2 framework, we have trained and open-sourced MiniCPM4.1 (https://huggingface.co/openbmb/MiniCPM4.1-8B), a hybrid reasoning model, providing a reproducible implementation for the research community.

[155] Understanding the Dilemma of Unlearning for Large Language Models

Qingjie Zhang,Haoting Qian,Zhicong Huang,Cheng Hong,Minlie Huang,Ke Xu,Chao Zhang,Han Qiu

Main category: cs.CL

TL;DR: 本文提出了一个可解释的框架unPact，用于分析大语言模型中的知识遗忘机制，发现现有遗忘方法要么不足以彻底删除知识，要么因灾难性遗忘而损害模型整体性能，揭示了遗忘技术的两难困境。

Details

Motivation: 由于难以追踪大语言模型中复杂架构的知识，现有的遗忘方法缺乏可解释性分析，因此需要一种可解释的框架来理解遗忘机制。 Method: 提出unPact框架，通过提示词归因和贡献追踪量化每个提示词对输出的影响，比较遗忘前后的变化，并在六种主流遗忘方法、三种大语言模型和三个基准上进行验证。 Result: 发现遗忘方法通过破坏关键词的关注实现表面效果；但多数知识仍可通过强调关键词恢复；灾难性遗忘源于对所有token的无差别惩罚。 Conclusion: 现有遗忘方法面临两难：要么知识未真正删除，要么过度损害通用能力，可靠遗忘仍有缺口。 Abstract: Unlearning seeks to remove specific knowledge from large language models (LLMs), but its effectiveness remains contested. On one side, "forgotten" knowledge can often be recovered through interventions such as light fine-tuning; on the other side, unlearning may induce catastrophic forgetting that degrades general capabilities. Despite active exploration of unlearning methods, interpretability analyses of the mechanism are scarce due to the difficulty of tracing knowledge in LLMs' complex architectures. We address this gap by proposing unPact, an interpretable framework for unlearning via prompt attribution and contribution tracking. Typically, it quantifies each prompt token's influence on outputs, enabling pre- and post-unlearning comparisons to reveal what changes. Across six mainstream unlearning methods, three LLMs, and three benchmarks, we find that: (1) Unlearning appears to be effective by disrupting focus on keywords in prompt; (2) Much of the knowledge is not truly erased and can be recovered by simply emphasizing these keywords in prompts, without modifying the model's weights; (3) Catastrophic forgetting arises from indiscriminate penalization of all tokens. Taken together, our results suggest an unlearning dilemma: existing methods tend either to be insufficient - knowledge remains recoverable by keyword emphasis, or overly destructive - general performance collapses due to catastrophic forgetting, still leaving a gap to reliable unlearning.

[156] Reference-Free Rating of LLM Responses via Latent Information

Leander Girrbach,Chi-Ping Su,Tankred Saanum,Richard Socher,Eric Schulz,Zeynep Akata

Main category: cs.CL

TL;DR: 研究了LLM作为评判者在无参考情况下的可靠性，提出并评估了基于模型内部信号的“潜在评判者”方法，发现其在配对准确性和列表排序上优于传统提示方法。

Details

Motivation: 解决单次响应LLM评判在无参考情况下评分不稳定和校准差的问题。 Method: 提出三种基于模型内部信号的评分方法：概率加权评分、验证式‘是’概率和基于激活的线性探针，并在多个基准上进行评估。 Result: 潜在方法在配对准确性和列表排序上表现更优，概率加权评分在单评分相关性上最强，线性探针在输出logits校准不佳时仍有效。 Conclusion: 潜在信息能提供更稳定、更具区分性的无参考评估信号，可提升Best-of-N、多教师蒸馏等训练与选择方法的效果。 Abstract: How reliable are single-response LLM-as-a-judge ratings without references, and can we obtain fine-grained, deterministic scores in this setting? We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses and show two systematic issues: scores are unstable under sampling and poorly calibrated, leading to compression near the top of the scale and frequent ties. We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals: (i) probability-weighted scores over integer ratings, (ii) verifier-style probabilities of "yes", and (iii) linear probes trained on model activations at the rating position. Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting, with consistent gains on pairwise accuracy and listwise ranking relevant to Best-of-N selection. Probability-weighted scores achieve the strongest single-rating correlations, while probes recover useful signals when output logits are miscalibrated. These results indicate that latent information provides deterministic and more discriminative signals for reference-free evaluation, and can improve selection and training approaches like Best-of-$N$, multi-teacher distillation, and routing.

[157] MemGen: Weaving Generative Latent Memory for Self-Evolving Agents

Guibin Zhang,Muxin Fu,Shuicheng Yan

Main category: cs.CL

TL;DR: 本文提出了MemGen，一种动态生成式记忆框架，通过记忆触发器和记忆编织器实现LLM智能体在推理过程中动态调用和增强潜在记忆，显著优于现有外部记忆系统，并自发演化出类人记忆功能。

Details

Motivation: 现有参数化记忆和基于检索的记忆范式无法充分模拟人类认知中推理与记忆的动态交织过程，因此需要一种更贴近人类认知机制的智能体记忆架构。 Method: 提出MemGen框架，包含记忆触发器（监测推理状态以决定是否调用记忆）和记忆编织器（以当前状态为刺激生成潜变量token序列作为机器原生记忆），实现记忆与推理的深度融合。 Result: 在八个基准测试上，MemGen相比ExpeL和AWM最高提升38.22%，超过GRPO达13.44%，并展现出强跨域泛化能力；且在无显式监督下自发演化出规划记忆、程序性记忆和工作记忆等类人记忆功能。 Conclusion: MemGen实现了记忆与推理的紧密耦合，不仅性能优越，还揭示了通向更自然化机器认知的潜在路径。 Abstract: Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a \textit{memory trigger}, which monitors the agent's reasoning state to decide explicit memory invocation, and a \textit{memory weaver}, which takes the agent's current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to $38.22\%$, exceeds GRPO by up to $13.44\%$, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.

[158] Socratic-Zero : Bootstrapping Reasoning via Data-Free Agent Co-evolution

Shaobo Wang,Zhengbo Jiao,Zifan Zhang,Yilang Peng,Xu Ze,Boyu Yang,Wei Wang,Hu Wei,Linfeng Zhang

Main category: cs.CL

TL;DR: 本文提出Socratic-Zero，一种无需人工标注、基于三代理协同进化的自主数据合成框架，通过教师、求解器和生成器的闭环互动，实现高质量推理数据的自动生成，在多个数学推理任务上显著超越现有方法，甚至使小模型性能超过顶级商业大模型。

Details

Motivation: 现有数据合成方法受限于数据质量不稳定和无法动态适应模型能力进化，难以提供高效的训练信号，且依赖大量人工标注数据，扩展成本高。 Method: 构建一个由教师（Teacher）、求解器（Solver）和生成器（Generator）组成的三代理闭环系统：求解器通过偏好反馈学习成功与失败的推理路径；教师根据求解器弱点动态设计更具挑战性的问题；生成器提炼教师的出题策略以规模化生成高质量课程数据。整个过程从仅100个种子问题开始，无需预存任务或标签。 Result: Socratic-Solver-8B在七个数学推理基准上平均比先前合成方法提升20.2个百分点；Socratic-Generator-32B生成的数据训练的学生模型性能超越包括Qwen3-235B、GPT-5、Gemini、Claude等在内的多种SOTA商业大模型。 Conclusion: Socratic-Zero实现了完全自主、可扩展的高质量推理训练数据生成，展示了无需人类干预的数据驱动模型自我提升潜力，为降低对人工标注数据依赖提供了有效路径。 Abstract: Recent breakthroughs in large language models (LLMs) on reasoning tasks rely heavily on massive, high-quality datasets-typically human-annotated and thus difficult to scale. While data synthesis or distillation offers a promising alternative, existing methods struggle with inconsistent data quality and an inability to dynamically adapt to the evolving capabilities of the model, leading to suboptimal training signals. To address these limitations, we introduce Socratic-Zero, a fully autonomous framework that generates high-quality training data from minimal seed examples through the co-evolution of three agents: the Teacher, the Solver, and the Generator. The Solver continuously refines its reasoning by learning from preference feedback on both successful and failed trajectories; the Teacher adaptively crafts increasingly challenging questions based on the Solver's weaknesses; and the Generator distills the Teacher's question-design strategy to enable scalable, high-fidelity curriculum generation. This closed-loop system produces a self-improving curriculum-requiring no pre-existing tasks or labels. Remarkably, starting from only 100 seed questions, our Socratic-Solver-8B achieves an average gain of +20.2 percentage points over prior data synthesis methods across seven mathematical reasoning benchmarks (AMC23, AIME24-25, Olympiad, MATH-500, Minerva, and GSM8K), with consistent gains on both Qwen3 and GLM4 series models. Even more surprisingly, synthetic data from Socratic-Generator-32B enables student LLMs to achieve superior performance compared to other state-of-the-art (SOTA) commercial LLMs on these benchmarks, including Qwen3-235B-A22B, DeepSeek-V3.1-671B, GPT-5, Gemini-2.5-Pro, Grok-4, and Claude-4.1-Opus.

[159] ProxyAttn: Guided Sparse Attention via Representative Heads

Yixuan Wang,Huang He,Siqi Bao,Hua Wu,Haifeng Wang,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的稀疏注意力算法ProxyAttn，通过压缩注意力头维度和动态预算估计，实现高效且精确的块重要性评估，显著提升大语言模型在长文本任务中的推理速度，同时保持性能稳定。

Details

Motivation: 由于注意力机制的二次复杂度限制了大语言模型在长文本任务中的效率，现有基于块稀疏注意力的方法因粗粒度的重要性估计在高稀疏率下性能下降明显，因此需要更精细、高效的稀疏化方法。 Method: 利用注意力头之间的相似性，通过池化代表性头（代理头）的得分来近似所有头的得分，并结合块感知的动态预算分配策略，实现低计算开销下的细粒度块重要性评估。 Result: 在多种主流模型和基准测试上验证了注意力头间的相似性，ProxyAttn实现了最高10.3倍的注意力加速和2.4倍的预填充加速，且未出现显著性能损失。 Conclusion: ProxyAttn是一种高效、无需训练的稀疏注意力方法，通过细粒度重要性估计在保持模型性能的同时大幅提升长文本处理效率，具有良好的通用性和应用前景。 Abstract: The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their coarse-grained estimation inevitably leads to performance degradation at high sparsity rates. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads, we use the scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads. Leveraging a fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss. Our code is available at https://github.com/wyxstriker/ProxyAttn.

[160] LatentEvolve: Self-Evolving Test-Time Scaling in Latent Space

Guibin Zhang,Fanci Meng,Guancheng Wan,Zherui Li,Kun Wang,Zhenfei Yin,Lei Bai,Shuicheng Yan

Main category: cs.CL

TL;DR: 提出了一种受互补学习系统理论启发的自演化隐式测试时扩展框架LatentEvolve，通过模拟人类大脑的日间快速检索与夜间慢速整合机制，在无需监督的情况下显著提升了大语言模型的推理能力。

Details

Motivation: 现有测试时扩展方法彼此独立，缺乏渐进式学习如何更有效地扩展测试时计算的能力，因此希望让大语言模型学会‘如何扩展测试时计算’。 Method: 基于互补学习系统理论，设计了包含日间扩展（快速检索历史隐表示）和夜间扩展（慢速整合过往优化）的双阶段演化框架LatentEvolve，实现测试时计算的自演化。 Result: 在八个基准和五个模型主干上实验表明，LatentEvolve性能优于当前最先进的TTS方法（如LatentSeek和TTRL）最多达13.33%，并展现出优异的跨领域和跨主干泛化能力。 Conclusion: LatentEvolve通过模拟人脑认知动态，实现了大语言模型在测试时计算上的高效自演化，为提升模型推理能力提供了新范式。 Abstract: Test-time Scaling (TTS) has been demonstrated to significantly enhance the reasoning capabilities of Large Language Models (LLMs) during the inference phase without altering model parameters. However, existing TTS methods are largely independent, implying that LLMs have not yet evolved to progressively learn how to scale more effectively. With the objective of evolving LLMs to learn ``how to scale test-time computation,'' we propose LatentEvolve, a self-evolving latent TTS framework inspired by the complementary learning system (CLS) theory. Analogous to the human brain's dual system of a fast-recall hippocampus and a slow-consolidating neocortex, LatentEvolve comprises two evolutionary components: \textit{daytime scaling}, which rapidly retrieves historical latent representations to better guide current LLM reasoning; and \textit{nighttime scaling}, which integrates past latent optimizations in a manner akin to the human brain's consolidation of experiences during sleep. The alternation of daytime and nighttime processes facilitates a fast and slow evolution of LLM TTS, mirroring human cognitive dynamics in a fully unsupervised manner. Extensive experiments across eight benchmarks and five model backbones demonstrate that our LatentEvolve surpasses state-of-the-art TTS methods such as LatentSeek and TTRL by up to $13.33\%$ and exhibits exceptional cross-domain and cross-backbone generalization.

[161] SeaPO: Strategic Error Amplification for Robust Preference Optimization of Large Language Models

Jun Rao,Yunjie Liao,Xuebo Liu,Zepeng Lin,Lian Lian,Dong Jin,Shengjun Cheng,Jun Yu,Min Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为SeaPO的策略性错误放大方法，通过引入LLM中常见的三类错误来增强偏好优化中的正负样本差异，从而提升大语言模型在多个能力维度上的性能，特别是在真实性方面提升了5-10个百分点。

Details

Motivation: 现有基于偏好的对齐方法因正负样本质量趋同而难以有效优化，限制了模型性能提升。 Method: 提出SeaPO方法，利用大语言模型中常见的三类错误，有策略地在负样本中引入特定错误模式，并通过基于偏好的训练减少这些错误的发生。 Result: 在1.5B到14B规模的模型上验证，SeaPO在五个能力维度上均显著提升模型性能，尤其在真实性方面提升明显；不同错误类型的影响各异，混合错误类型能带来更广泛的性能提升。 Conclusion: SeaPO通过战略性的错误放大增强了偏好学习的有效性，提升了大语言模型的整体性能，尤其在关键错误类型针对性注入时效果更优。 Abstract: Existing alignment methods for preference optimization of large language models (LLMs) aim to enhance model performance by utilizing pairs of positive and negative samples. However, due to the limited capacity of models in scoring or generating responses, the quality of positive and negative samples may become similar during training, which complicates optimization for preference learning. To address this issue, we introduce SeaPO, a Strategic Error Amplification method that leverages three error types commonly occurring in LLMs to introduce specific error patterns into the model Preference Optimization. This strategy ensures that negative samples are more erroneous than positive samples and preference-based training is employed to mitigate the occurrence of these errors, thereby enhancing model performance. Evaluations across five capability dimensions and different model scales (1.5B to 14B) demonstrate that the generated data significantly improved overall model performance, particularly in terms of truthfulness, with improvements of 5-10 percentage points observed. Further analysis reveals that task performance varies depending on the error types introduced. Injecting the most common error types improves performance in related tasks, while a mix of error types leads to a broader performance enhancement: most tasks show stable improvements, while a few tasks exhibit significant gains.

[162] Evaluating Spatiotemporal Consistency in Automatically Generated Sewing Instructions

Luisa Geiger,Mareike Hartmann,Michael Sullivan,Alexander Koller

Main category: cs.CL

TL;DR: 提出一种基于树的自动评估指标，用于评估大模型生成的逐步组装指令，尤其在缝纫指令领域表现出比传统指标更好的相关性和鲁棒性。

Details

Motivation: 传统评估指标（如BLEU和BERT相似度）无法准确反映生成指令中的时空构造特性，导致评估结果与人工判断偏差较大。 Method: 设计一种基于树结构的评估方法，对生成的步骤式组装指令进行建模，并在缝纫指令数据上验证其与人工标注错误数和人类质量评分的相关性。 Result: 该指标比传统文本相似度指标更好地关联人工错误计数和人类质量评分，且在对抗精心构造的反事实样例时表现更稳健。 Conclusion: 所提出的树形评估指标在评估大模型生成的具有时空依赖性的装配指令方面优于传统方法，更具可靠性与实用性。 Abstract: In this paper, we propose a novel, automatic tree-based evaluation metric for LLM-generated step-by-step assembly instructions, that more accurately reflects spatiotemporal aspects of construction than traditional metrics such as BLEU and BERT similarity scores. We apply our proposed metric to the domain of sewing instructions, and show that our metric better correlates with manually-annotated error counts as well as human quality ratings, demonstrating our metric's superiority for evaluating the spatiotemporal soundness of sewing instructions. Further experiments show that our metric is more robust than traditional approaches against artificially-constructed counterfactual examples that are specifically constructed to confound metrics that rely on textual similarity.

[163] KnowGuard: Knowledge-Driven Abstention for Multi-Round Clinical Reasoning

Xilin Dang,Kexin Chen,Xiaorui Su,Ayush Noori,Iñaki Arango,Lucas Vittor,Xinyi Long,Yuyang Du,Marinka Zitnik,Pheng Ann Heng

Main category: cs.CL

TL;DR: 本文提出了一种名为KnowGuard的“先调查再 abstain”范式，通过系统性地探索医学知识图谱来改进大语言模型在临床决策中的拒绝机制。

Details

Motivation: 现有大语言模型在信息不足时难以有效 abstain，常给出过度自信的回答，缺乏结合外部医学证据识别知识边界的系统方法。 Method: KnowGuard 包含两个阶段：证据发现阶段通过图扩展和直接检索探索医学知识空间；证据评估阶段基于多种因素对证据进行排序，并根据患者上下文和对话历史调整探索策略。 Result: 在开放式多轮临床基准测试中，KnowGuard 在诊断准确率上提升了3.93%，平均减少7.27轮不必要的交互。 Conclusion: KnowGuard 通过整合系统性的知识图谱探索，显著提升了大语言模型在临床场景下的 abstention 质量和诊断效率。 Abstract: In clinical practice, physicians refrain from making decisions when patient information is insufficient. This behavior, known as abstention, is a critical safety mechanism preventing potentially harmful misdiagnoses. Recent investigations have reported the application of large language models (LLMs) in medical scenarios. However, existing LLMs struggle with the abstentions, frequently providing overconfident responses despite incomplete information. This limitation stems from conventional abstention methods relying solely on model self-assessments, which lack systematic strategies to identify knowledge boundaries with external medical evidences. To address this, we propose \textbf{KnowGuard}, a novel \textit{investigate-before-abstain} paradigm that integrates systematic knowledge graph exploration for clinical decision-making. Our approach consists of two key stages operating on a shared contextualized evidence pool: 1) an evidence discovery stage that systematically explores the medical knowledge space through graph expansion and direct retrieval, and 2) an evidence evaluation stage that ranks evidence using multiple factors to adapt exploration based on patient context and conversation history. This two-stage approach enables systematic knowledge graph exploration, allowing models to trace structured reasoning paths and recognize insufficient medical evidence. We evaluate our abstention approach using open-ended multi-round clinical benchmarks that mimic realistic diagnostic scenarios, assessing abstention quality through accuracy-efficiency trade-offs beyond existing closed-form evaluations. Experimental evidences clearly demonstrate that KnowGuard outperforms state-of-the-art abstention approaches, improving diagnostic accuracy by 3.93\% while reducing unnecessary interaction by 7.27 turns on average.

[164] DiaCDM: Cognitive Diagnosis in Teacher-Student Dialogues using the Initiation-Response-Evaluation Framework

Rui Jia,Yuang Wei,Ruijia Li,Yuang-Hao Jiang,Xinyu Xie,Yaomin Shen,Min Zhang,Bo Jiang

Main category: cs.CL

TL;DR: 本文提出了DiaCDM模型，首次将认知诊断应用于师生对话场景，通过引入IRE框架和图编码方法，有效提升了诊断准确性和可解释性。

Details

Motivation: 传统认知诊断模型难以处理非结构化、动态的师生对话，且难以从长文本对话中准确提取诊断语义信息。 Method: 基于教育学中的IRE（引发-回应-评价）框架构建对话诊断结构，并设计了一种新颖的图编码方法，将教师提问与相关知识组件结合，以更精确捕捉关键信息。 Result: 在三个真实对话数据集上的实验表明，DiaCDM显著提高了诊断准确性，并增强了结果的可解释性。 Conclusion: DiaCDM是首个用于对话场景的认知诊断模型，为教师评估学生认知状态提供了有力工具。 Abstract: While cognitive diagnosis (CD) effectively assesses students' knowledge mastery from structured test data, applying it to real-world teacher-student dialogues presents two fundamental challenges. Traditional CD models lack a suitable framework for handling dynamic, unstructured dialogues, and it's difficult to accurately extract diagnostic semantics from lengthy dialogues. To overcome these hurdles, we propose DiaCDM, an innovative model. We've adapted the initiation-response-evaluation (IRE) framework from educational theory to design a diagnostic framework tailored for dialogue. We also developed a unique graph-based encoding method that integrates teacher questions with relevant knowledge components to capture key information more precisely. To our knowledge, this is the first exploration of cognitive diagnosis in a dialogue setting. Experiments on three real-world dialogue datasets confirm that DiaCDM not only significantly improves diagnostic accuracy but also enhances the results' interpretability, providing teachers with a powerful tool for assessing students' cognitive states. The code is available at https://github.com/Mind-Lab-ECNU/DiaCDM/tree/main.

Xinye Zhao,Spyridon Mastorakis

Main category: cs.CL

TL;DR: 提出SemShareKV，一种基于语义相似性的KV缓存共享与压缩框架，通过局部敏感哈希和RoPE实现模糊匹配与位置信息保留，显著提升LLM推理效率。

Details

Motivation: 现有KV缓存压缩方法依赖精确token匹配或共享前缀，难以处理语义相似但词汇不同的提示，限制了在多文档摘要和对话系统等场景的应用。 Method: 提出SemShareKV，采用局部敏感哈希（LSH）对token嵌入进行模糊匹配，并结合旋转位置编码（RoPE）保留位置信息，从而在语义相似的提示间共享KV缓存。 Result: 实验显示，在5k token输入下，最高达6.25倍加速和42%的GPU内存降低，且生成质量损失可忽略。 Conclusion: SemShareKV通过语义感知的缓存共享，有效减少了LLM推理中的冗余计算，为高效推理提供了新思路。 Abstract: As large language models (LLMs) continue to scale, the memory footprint of key-value (KV) caches during inference has become a significant bottleneck. Existing approaches primarily focus on compressing KV caches within a single prompt or reusing shared prefixes or frequently ocurred text segments across prompts. However, such strategies are limited in scenarios where prompts are semantically similar but lexically different, which frequently occurs in tasks such as multi-document summarization and conversational agents. We propose \textit{SemShareKV}, a KV cache sharing and compression framework that accelerates LLM inference by reusing KVCache in semantically similar prompts. Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information. By selectively reusing relevant key-value pairs from a reference prompt's cache, SemShareKV reduces redundant computation while maintaining output quality. Experiments on diverse summarization datasets show up to 6.25$\times$ speedup and 42\% lower GPU memory usage with 5k tokens input, with negligible quality degradation. These results highlight the potential of semantic-aware cache sharing for efficient LLM inference.

[166] Hierarchical Error Correction for Large Language Models: A Systematic Framework for Domain-Specific AI Quality Enhancement

Zhilong Zhao,Yindi Liu

Main category: cs.CL

TL;DR: 本研究提出了一种分层错误纠正（HEC）框架，通过系统性错误分析显著提升大语言模型在专业领域的性能，尤其在中等基线任务上表现优异。

Details

Motivation: 大语言模型在专业领域表现不佳，如医疗编码任务准确率仅为45.9%，需针对性解决领域特定的错误模式。 Method: 基于对四个专业领域错误模式的分析，识别出知识层、推理层和复杂性层三类分层错误，并据此设计三阶段纠错框架HEC，按错误层级重要性进行干预。 Result: 在医疗转录、法律分类、政治偏见检测和法律推理任务中，HEC框架在五个大模型架构上平均提升11.2个百分点（p < 0.001），但在高基线任务（>75%准确率）中效果有限甚至可能干扰原有推理。 Conclusion: 系统性错误分析可有效指导专业领域AI性能提升，HEC框架适用于中等基线任务，但需注意其在高基线任务中的应用边界。 Abstract: Large Language Models face significant performance challenges in specialized domains, with state-of-the-art models achieving only 45.9% accuracy on medical coding tasks. This study proposes a Hierarchical Error Correction (HEC) framework that addresses domain-specific AI limitations through systematic error analysis and targeted intervention strategies. We analyze error patterns across four specialized domains and find that AI errors follow consistent hierarchical structures: Knowledge-layer errors (58.4%), Reasoning-layer errors (39.6%), and Complexity-layer errors (2.0%). Based on these patterns, we develop a three-stage correction framework that addresses errors according to their hierarchical importance and demonstrates that framework effectiveness correlates inversely with baseline task performance. Experimental validation across medical transcription (4,921 cases), legal document classification (1,000 cases), political bias detection (645 cases), and legal reasoning (1,000 cases) shows consistent improvements. Cross-model validation across five LLM architectures demonstrates average improvements of 11.2 percentage points (p < 0.001). However, analysis reveals framework limitations in high-baseline tasks (>75% accuracy), where hierarchical intervention may interfere with effective reasoning processes. The results suggest that systematic error analysis can guide effective AI enhancement strategies in specialized domains, particularly for moderate-baseline tasks, while highlighting the importance of understanding framework boundaries for optimal deployment.

[167] Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Adrian Arnaiz-Rodriguez,Miguel Baidal,Erik Derner,Jenn Layton Annable,Mark Ball,Mark Ince,Elvira Perez Vallejos,Nuria Oliver

Main category: cs.CL

TL;DR: 本文提出了一种临床指导下的心理健康危机分类体系，构建了评估数据集和专家评审协议，系统评测了三种先进大语言模型在识别和应对心理危机方面的能力，发现现有模型仍存在生成有害回应、难以处理模糊信号和缺乏情境感知等问题，强调需加强安全机制与上下文理解以降低风险。

Details

Motivation: 随着大语言模型在高风险场景（如心理健康支持）中的广泛应用，亟需解决其在识别和应对心理危机方面的安全性问题，但目前缺乏统一的危机分类体系、高质量标注数据集及基于临床实践的评估方法。 Method: 提出了一个包含六类临床指导的心理健康危机统一分类体系，构建了一个多样化的评估数据集，并设计了由专家制定的回应适当性评估协议，对三种最先进的大语言模型进行了系统性基准测试，评估其危机类型分类能力和生成安全、恰当回应的表现。 Result: 研究发现，尽管大语言模型在处理明确危机披露时具有一致性和可靠性，但仍有一定比例的回应被评定为不适当或有害，开源权重模型的表现劣于商业模型；同时模型在处理间接或模糊风险信号、使用公式化回复以及与用户情境错位方面存在系统性缺陷。 Conclusion: 当前大语言模型在心理健康支持应用中仍存在显著安全隐患，必须增强危机检测能力、提升回应的情境敏感性并建立更有效的防护机制；本文提出的分类体系、数据集和评估框架为未来AI心理支持系统的负责任发展奠定了基础。 Abstract: The widespread use of chatbots powered by large language models (LLMs) such as ChatGPT and Llama has fundamentally reshaped how people seek information and advice across domains. Increasingly, these chatbots are being used in high-stakes contexts, including emotional support and mental health concerns. While LLMs can offer scalable support, their ability to safely detect and respond to acute mental health crises remains poorly understood. Progress is hampered by the absence of unified crisis taxonomies, robust annotated benchmarks, and empirical evaluations grounded in clinical best practices. In this work, we address these gaps by introducing a unified taxonomy of six clinically-informed mental health crisis categories, curating a diverse evaluation dataset, and establishing an expert-designed protocol for assessing response appropriateness. We systematically benchmark three state-of-the-art LLMs for their ability to classify crisis types and generate safe, appropriate responses. The results reveal that while LLMs are highly consistent and generally reliable in addressing explicit crisis disclosures, significant risks remain. A non-negligible proportion of responses are rated as inappropriate or harmful, with responses generated by an open-weight model exhibiting higher failure rates than those generated by the commercial ones. We also identify systemic weaknesses in handling indirect or ambiguous risk signals, a reliance on formulaic and inauthentic default replies, and frequent misalignment with user context. These findings underscore the urgent need for enhanced safeguards, improved crisis detection, and context-aware interventions in LLM deployments. Our taxonomy, datasets, and evaluation framework lay the groundwork for ongoing research and responsible innovation in AI-driven mental health support, helping to minimize harm and better protect vulnerable users.

[168] Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

Matteo Fuoli,Weihang Huang,Jeannette Littlemore,Sarah Turner,Ellen Wilding

Main category: cs.CL

TL;DR: 本研究探讨了大型语言模型（LLM）在自动识别全文隐喻中的潜力，比较了检索增强生成、提示工程和微调三种方法，发现微调效果最佳，中位F1得分为0.79，且多数与人工标注的差异具有系统性，表明LLM可部分自动化隐喻识别并用于改进识别协议与理论。

Details

Motivation: 由于隐喻具有语境敏感性，传统大规模分析依赖人工标注，效率受限，因此需要探索自动化方法以提升隐喻识别的可扩展性。 Method: 比较了三种LLM应用方法：检索增强生成（RAG）、提示工程（包括零样本、少样本和思维链策略）和微调，并评估其在隐喻识别任务中的表现。 Result: 最先进的闭源大模型在隐喻识别中表现良好，微调方法取得中位F1分数0.79；与人类标注相比，LLM的误判多集中在隐喻理论中已知的模糊区域。 Conclusion: LLM可用于部分自动化隐喻识别，并可作为测试和优化隐喻识别规则及理论基础的平台。 Abstract: Metaphor is a pervasive feature of discourse and a powerful lens for examining cognition, emotion, and ideology. Large-scale analysis, however, has been constrained by the need for manual annotation due to the context-sensitive nature of metaphor. This study investigates the potential of large language models (LLMs) to automate metaphor identification in full texts. We compare three methods: (i) retrieval-augmented generation (RAG), where the model is provided with a codebook and instructed to annotate texts based on its rules and examples; (ii) prompt engineering, where we design task-specific verbal instructions; and (iii) fine-tuning, where the model is trained on hand-coded texts to optimize performance. Within prompt engineering, we test zero-shot, few-shot, and chain-of-thought strategies. Our results show that state-of-the-art closed-source LLMs can achieve high accuracy, with fine-tuning yielding a median F1 score of 0.79. A comparison of human and LLM outputs reveals that most discrepancies are systematic, reflecting well-known grey areas and conceptual challenges in metaphor theory. We propose that LLMs can be used to at least partly automate metaphor identification and can serve as a testbed for developing and refining metaphor identification protocols and the theory that underpins them.

[169] Expanding Computation Spaces of LLMs at Inference Time

Yoonna Jang,Kisu Yang,Isabelle Augenstein

Main category: cs.CL

TL;DR: 本研究探讨了在推理阶段插入填充标记作为额外计算空间的有效性，发现适当类型的标记和数量能显著提升小型语言模型的性能，尤其是在开放域问答和数学任务中。

Details

Motivation: 探索语言模型是否可以在推理阶段利用人工插入的填充标记序列来增加计算空间，从而提高问题解决能力。 Method: 通过识别有效的标记类型、数量及插入位置，研究模型在训练哪个阶段开始利用扩展的计算空间，并通过注意力图分析这些空间内的动态。 Result: 实验表明，在最终'Answer:'标记前插入填充标记最有效，小型模型受益最大，如SmolLM2-1.7B-Instruct提升了12.372个百分点；注意力图显示扩展空间延续了原有的注意力机制并聚焦于问题或答案选项。 Conclusion: 填充标记提供的扩展计算空间能够显著增强语言模型的问题解决能力，特别是对较小规模的模型而言，这表明该方法增加了实际的计算容量而非仅仅是冗余输入。 Abstract: Chain-of-thought (CoT) rationale enables language models to use additional task-related text for problem-solving, benefiting not only from detailed reasoning steps but also from the expanded computational space of longer inputs. Prior work has trained filler or special tokens to serve as additional computation spaces. In this study, we investigate whether language models can leverage artificially inserted sequences of filler tokens solely at inference. We first identify effective token types, numbers, and insertion locations, then examine at what stage of training models begin to exploit the expanded computation space, and finally analyze dynamics within these spaces via attention maps. Experiments on models ranging from 1.7B to 32B across open-domain QA and math tasks show that appropriate token types and counts vary, but placing filler tokens directly before the final 'Answer:' token is most effective. Smaller models benefit most, up to 12.372 percentage points in SmolLM2-1.7B-Instruct, indicating that these spaces act as additional computational capacity rather than redundant input. Attention maps reveal that expanded spaces often continue the original attention mechanism and sometimes focus on questions or answer options, suggesting meaningful computation for problem-solving.

[170] BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications

Andrés Fernández García,Javier de la Rosa,Julio Gonzalo,Roser Morante,Enrique Amigó,Alejandro Benito-Santos,Jorge Carrillo-de-Albornoz,Víctor Fresno,Adrian Ghajari,Guillermo Marco,Laura Plaza,Eva Sánchez Salido

Main category: cs.CL

TL;DR: 本文提出了一个名为BOE-XSUM的新数据集，包含3,648个西班牙官方公报文档的简洁摘要，用于法律领域的文本摘要任务，并通过实验证明在该数据集上微调的模型显著优于零样本通用模型。

Details

Motivation: 由于信息过载，对长文档进行简洁摘要的能力愈发重要，但西班牙语尤其是法律领域的文档缺乏此类摘要资源。 Method: 构建了一个名为BOE-XSUM的数据集，包含原文、简明摘要和文档类型标签，并对中等规模的大语言模型进行微调，与零样本通用生成模型进行比较评估。 Result: 在BOE-XSUM上微调的模型显著优于零样本模型，其中表现最好的BERTIN GPT-J 6B比DeepSeek-R1提升了24%的准确率（41.6% vs. 33.5%）。 Conclusion: 针对特定领域（如西班牙语法律文本）构建高质量摘要数据集并进行模型微调，能显著提升摘要性能，优于通用模型的零样本表现。 Abstract: The ability to summarize long documents succinctly is increasingly important in daily life due to information overload, yet there is a notable lack of such summaries for Spanish documents in general, and in the legal domain in particular. In this work, we present BOE-XSUM, a curated dataset comprising 3,648 concise, plain-language summaries of documents sourced from Spain's ``Bolet\'{\i}n Oficial del Estado'' (BOE), the State Official Gazette. Each entry in the dataset includes a short summary, the original text, and its document type label. We evaluate the performance of medium-sized large language models (LLMs) fine-tuned on BOE-XSUM, comparing them to general-purpose generative models in a zero-shot setting. Results show that fine-tuned models significantly outperform their non-specialized counterparts. Notably, the best-performing model -- BERTIN GPT-J 6B (32-bit precision) -- achieves a 24\% performance gain over the top zero-shot model, DeepSeek-R1 (accuracies of 41.6\% vs.\ 33.5\%).

[171] How Well Do LLMs Imitate Human Writing Style?

Rebira Jemama,Rajesh Kumar

Main category: cs.CL

TL;DR: 提出一种无需训练的作者风格验证与模仿分析框架，结合TF-IDF字符n-gram与Transformer嵌入，通过经验距离分布分类文本对，在学术论文和跨领域任务中分别达到97.5%和94.5%准确率，显著降低训练时间和内存消耗。实验表明提示策略比模型大小更影响风格保真度，其中少样本和补全式提示显著提升模仿效果，但LLM输出的困惑度低于人类文本，说明高风格保真不等于人类级不可预测性。

Details

Motivation: 评估大语言模型在复制特定人类作者写作风格方面的能力，并解决现有方法依赖监督训练和阈值调优的问题。 Method: 提出一种无需训练的框架，融合TF-IDF字符n-gram与Transformer嵌入，利用文本对的经验距离分布进行分类，无需监督训练或阈值调整。 Result: 该方法在学术论文上实现97.5%准确率，跨领域达94.5%，训练时间减少91.8%，内存使用降低59%；少样本提示比零样本提示风格匹配准确率最高提升23.5倍，补全提示达到99.9%风格一致；但LLM输出困惑度（平均15.2）显著低于人类文本（29.5）。 Conclusion: 风格保真度与统计可检测性可以分离，提示策略对风格模仿的影响大于模型规模，高保真模仿不等同于人类水平的不可预测性，为作者身份建模、检测和条件生成提供了可复现基础。 Abstract: Large language models (LLMs) can generate fluent text, but their ability to replicate the distinctive style of a specific human author remains unclear. We present a fast, training-free framework for authorship verification and style imitation analysis. The method integrates TF-IDF character n-grams with transformer embeddings and classifies text pairs through empirical distance distributions, eliminating the need for supervised training or threshold tuning. It achieves 97.5\% accuracy on academic essays and 94.5\% in cross-domain evaluation, while reducing training time by 91.8\% and memory usage by 59\% relative to parameter-based baselines. Using this framework, we evaluate five LLMs from three separate families (Llama, Qwen, Mixtral) across four prompting strategies - zero-shot, one-shot, few-shot, and text completion. Results show that the prompting strategy has a more substantial influence on style fidelity than model size: few-shot prompting yields up to 23.5x higher style-matching accuracy than zero-shot, and completion prompting reaches 99.9\% agreement with the original author's style. Crucially, high-fidelity imitation does not imply human-like unpredictability - human essays average a perplexity of 29.5, whereas matched LLM outputs average only 15.2. These findings demonstrate that stylistic fidelity and statistical detectability are separable, establishing a reproducible basis for future work in authorship modeling, detection, and identity-conditioned generation.

[172] MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Changsheng Zhao,Ernie Chang,Zechun Liu,Chia-Jung Chang,Wei Wen,Chen Lai,Rick Cao,Yuandong Tian,Raghuraman Krishnamoorthi,Yangyang Shi,Vikas Chandra

Main category: cs.CL

TL;DR: 本研究挑战了大语言模型推理能力必须依赖超大规模数据训练的假设，提出通过精心筛选和重采样高质量开源数据（仅需约2T tokens），即可在子十亿参数模型中实现强大的推理能力。基于此方法训练的MobileLLM-R1系列模型在多个推理基准上超越先前模型，甚至媲美更大规模模型的表现。

Details

Motivation: 质疑当前普遍认为推理能力的出现必须依赖极大规模语料库（>10T tokens）训练的假设，探索更高效、低成本的数据利用方式以实现推理能力的涌现。 Method: 设计评估指标，筛选并重采样有益的开源数据集；使用约2T高质量tokens构建4.2T token的预训练语料，结合标准后训练流程，训练子十亿参数规模的MobileLLM-R1系列模型。 Result: MobileLLM-R1-950M在AIME测评中取得15.5分，显著高于OLMo-2-1.48B（0.6）和SmolLM-2-1.7B（0.3）；尽管预训练token量仅为Qwen3的11.7%（4.2T vs 36T），但在多项推理基准上达到或超过Qwen3-0.6B的表现。 Conclusion: 高质量数据的选择与重采样可大幅降低实现强大推理能力所需的数据量，挑战了‘必须依赖超大规模数据’的传统认知，证明小模型配合高效数据策略也能实现优异推理性能。 Abstract: The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of MobileLLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, MobileLLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3's proprietary 36T-token corpus for pretraining, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we have released the complete training recipe, data sources, data mixing ratio, and model checkpoints, together with the key insights obtained throughout this study.

[173] The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability

Linlu Gong,Ante Wang,Yunghwei Lai,Weizhi Ma,Yang Liu

Main category: cs.CL

TL;DR: MAQuE是一个用于评估医疗多轮问诊的综合性基准，包含3000个模拟患者代理和多维度评估框架，揭示了当前AI医生在应对真实患者行为时的不足。

Details

Motivation: 现有AI医生虽具备诊断能力，但在共情、沟通等关键素质上仍有欠缺，缺乏全面评估其多轮问诊能力的基准。 Method: 提出MAQuE基准，包含3000个具有多样化语言模式、认知限制、情绪反应的模拟患者，并设计涵盖任务成功率、询问熟练度、对话能力、效率及患者体验的评估框架。 Result: 实验表明现有大模型在各项指标上仍有显著提升空间，对真实患者行为敏感，影响诊断准确性，且不同评估维度间存在权衡。 Conclusion: MAQuE为评估AI医生的综合问诊能力提供了有效工具，凸显了平衡性能与实际临床应用的挑战，推动AI医生向更人性化方向发展。 Abstract: An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.

[174] SemanticShield: LLM-Powered Audits Expose Shilling Attacks in Recommender Systems

Kaihong Li,Huichi Zhou,Bin Ma,Fangjun Huang

Main category: cs.CL

TL;DR: 提出了一种结合项目侧语义信息的两阶段检测框架SemanticShield，利用大语言模型通过行为和语义一致性分析来防御推荐系统中的洗牌攻击。

Details

Motivation: 现有防御方法多关注用户行为，忽视了项目侧语义特征（如标题、描述）在识别恶意意图中的潜力，导致对洗牌攻击的检测不够全面。 Method: 构建一个两阶段检测框架：第一阶段基于低成本行为指标预筛可疑用户；第二阶段利用经过强化微调的轻量级大语言模型进行语义一致性审计，提升检测精度。 Result: 在六种典型攻击策略上验证了SemanticShield的有效性，并展现出对未见过攻击方法的良好泛化能力。 Conclusion: 融合项目侧语义信息可显著增强推荐系统的抗攻击能力，所提出的SemanticShield框架在检测性能和泛化性方面均表现优异。 Abstract: Recommender systems (RS) are widely used in e-commerce for personalized suggestions, yet their openness makes them susceptible to shilling attacks, where adversaries inject fake behaviors to manipulate recommendations. Most existing defenses emphasize user-side behaviors while overlooking item-side features such as titles and descriptions that can expose malicious intent. To address this gap, we propose a two-stage detection framework that integrates item-side semantics via large language models (LLMs). The first stage pre-screens suspicious users using low-cost behavioral criteria, and the second stage employs LLM-based auditing to evaluate semantic consistency. Furthermore, we enhance the auditing model through reinforcement fine-tuning on a lightweight LLM with carefully designed reward functions, yielding a specialized detector called SemanticShield. Experiments on six representative attack strategies demonstrate the effectiveness of SemanticShield against shilling attacks, and further evaluation on previously unseen attack methods shows its strong generalization capability. Code is available at https://github.com/FrankenstLee/SemanticShield.

[175] Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns

Hanqi Xiao,Vaidehi Patil,Hyunji Lee,Elias Stengel-Eskin,Mohit Bansal

Main category: cs.CL

TL;DR: 该论文提出了一种广义正确性模型（GCM），通过系统编码历史正确性信息而非依赖模型自我内省，来提升大语言模型的置信度估计准确性，并验证了该方法在多个模型和数据集上的通用性和有效性。

Details

Motivation: 准确且校准的置信度估计对高风险或用户导向的应用至关重要，但当前方法多假设模型能自我判断答案正确性，这种假设缺乏充分依据，因此需要探索更有效的置信度建模方式。 Method: 提出广义正确性模型（GCM），利用目标模型的历史预测数据训练正确性模型；通过控制训练数据研究正确性预测能力来源，并尝试将历史信息作为上下文示例注入或进行事后校准以提升性能。 Result: 实验表明，LLM对其自身输出正确性的预测并不优于其他无关模型；GCM在跨模型、跨数据集上表现出良好的泛化能力；答案表述方式是预测正确性的重要指标；引入历史信息作为上下文示例和事后校准均可有效改善置信度估计。 Conclusion: 可靠的LLM置信度估计是一种可通过系统编码历史正确性信息而获得的可泛化、模型无关的能力，而非依赖于模型自身内省的特定技能。 Abstract: Generating accurate and calibrated confidence estimates is critical for deploying LLMs in high-stakes or user-facing applications, and remains an open challenge. Prior research has often framed confidence as a problem of eliciting a model's "self-knowledge", i.e., the ability of an LLM to judge whether its own answers are correct; this approach implicitly assumes that there is some privileged information about the answer's correctness that is accessible to the model itself. However, our experiments reveal that an LLM attempting to predict the correctness of its own outputs generally performs no better than an unrelated LLM. Moreover, we hypothesize that a key factor in building a "Correctness Model" (CM) is exposure to a target model's historical predictions. We propose multiple methods to inject this historical correctness information, creating a Generalized Correctness Model (GCM). We first show that GCMs can be trained on the correctness data from many LLMs and learn patterns for correctness prediction applicable across datasets and models. We then use CMs as a lens for studying the source of correctness prediction ability and its generalization, systematically controlling their training data and finding that answer phrasing is a strong predictor for correctness. We further explore alternative methods of injecting history without training an LLM, finding that including history as in-context examples can help improve correctness prediction, and post-hoc calibration can provide complementary reductions in calibration error. We evaluate GCMs based on Qwen3-8B across 5 model families and the MMLU and TriviaQA datasets, as well as on a downstream selective prediction task, finding that reliable LLM confidence estimation is a generalizable and model-agnostic skill learned by systematically encoding correctness history rather than a model-specific skill reliant on self-introspection.

[176] Circuit Distillation

Somin Wadhwa,Silvio Amir,Byron C. Wallace

Main category: cs.CL

TL;DR: 提出了一种称为“电路蒸馏”的新方法，通过匹配教师模型和学生模型中功能对应的内部组件，实现对计算机制的蒸馏，相比传统行为模仿蒸馏更高效地迁移算法能力。

Details

Motivation: 传统的模型蒸馏仅关注输出行为的模仿，忽视了内部计算机制的传递，限制了学生模型对教师模型算法能力的有效继承。因此，需要一种能传递内在计算机制的蒸馏方法。 Method: 提出电路蒸馏方法，识别教师与学生模型中功能对应的电路组件，并引入表示对齐损失来对齐这些组件所诱导的内部表示，仅调整学生模型中少量目标参数以实现机制迁移。 Result: 在Llama3系列模型上对实体追踪和心智理论（ToM）任务的实验表明，电路蒸馏优于标准蒸馏方法，能够成功迁移算法能力。 Conclusion: 电路蒸馏验证了传递模型内部计算机制的可行性，为通过可解释、可控的方式高效迁移特定能力提供了新路径。 Abstract: Model distillation typically focuses on behavioral mimicry, where a student model is trained to replicate a teacher's output while treating its internal computations as a black box. In this work we propose an alternative approach: Distilling the underlying computational mechanisms implemented by a teacher model. Specifically, we propose circuit distillation, which introduces an objective to align internal representations between analogous circuit components in teacher and student models. We propose a method to match ``functionally correspondent'' circuit components and introduce a loss reflecting similarities between the representations that these induce. We evaluate circuit distillation on entity tracking and theory of mind (ToM) tasks using models from the Llama3 family. Our results demonstrate that circuit distillation outperforms standard distillation, successfully transferring algorithmic capabilities by adjusting only a small, targeted subset of student model parameters. This work establishes the feasibility of transferring mechanisms, which may in turn allow for efficient distillation of targeted teacher capabilities via interpretable and controllable internal student mechanisms.

[177] Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

Haoyang Zheng,Xinyang Liu,Cindy Xiangrui Kong,Nan Jiang,Zheyuan Hu,Weijian Luo,Wei Deng,Guang Lin

Main category: cs.CL

TL;DR: DiDi-Instruct是一种基于预训练离散扩散语言模型的高效文本生成方法，通过KL散度最小化框架和多种优化技术，在显著加速（64倍）的同时保持优异性能。

Details

Motivation: 追求在AI时代实现快速语言文本生成，克服现有生成模型速度慢或性能不足的问题。 Method: 提出DiDi-Instruct方法，基于积分KL散度最小化框架，采用分组奖励归一化、中间状态匹配和奖励引导祖先采样（RGAS）等技术，从预训练的离散扩散语言模型初始化并进行高效训练。 Result: 在OpenWebText上，DiDi-Instruct在8到128次前向传播中样本困惑度从62.2降至18.4，超越加速模型、GPT-2基线和标准dLLM，仅损失约1%熵，且额外训练时间减少20倍。 Conclusion: DiDi-Instruct是一种高效且有效的蒸馏方法，实现了极快的语言生成，具备良好的稳定性、覆盖性和推理性能，并在蛋白质序列生成等任务中验证了其泛化能力。 Abstract: Fast generation of language texts is the holy grail that people pursue in the AI era. In this work, we introduced Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that leads to fast language generation models by initializing from a pre-trained (masked) discrete diffusion language model (dLLM). The resulting DiDi-Instruct model outperforms the dLLM counterparts and the GPT-2 baseline with 64x acceleration. In the theoretical part of the paper, we build the foundation of DiDi-Instruct in a framework of integral KL-divergence minimization, with practical training algorithms. We also introduce techniques like grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler (RGAS) that significantly improve the training stability, the model coverage, and the inference performances. On OpenWebText, DiDi-Instruct outperforms all accelerated language generation models as well as the GPT-2 baseline and the standard dLLMs, achieving sample perplexities ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs). These performance gains are accomplished with a negligible entropy loss of about 1% and 20x less additional training wall-clock time. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, and the generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye. We will release both code and models at github.com/haoyangzheng-ai/didi-instruct.

[178] GateMABSA: Aspect-Image Gated Fusion for Multimodal Aspect-based Sentiment Analysis

Adamu Lawan,Haruna Yunusa

Main category: cs.CL

TL;DR: 提出了一种新的门控多模态架构GateMABSA，用于解决多模态情感分析中的噪声过滤和跨模态对齐问题。

Details

Motivation: 现有MABSA模型难以有效过滤视觉噪声并实现跨模态的方面与观点内容对齐。 Method: 引入三种专用的mLSTM：Syn-mLSTM融合句法结构，Sem-mLSTM强调方面-语义相关性，Fuse-mLSTM进行选择性多模态融合。 Result: 在两个基准Twitter数据集上的实验表明，GateMABSA优于多个基线模型。 Conclusion: GateMABSA能更有效地处理多模态情感分析中的噪声和跨模态对齐问题，提升性能。 Abstract: Aspect-based Sentiment Analysis (ABSA) has recently advanced into the multimodal domain, where user-generated content often combines text and images. However, existing multimodal ABSA (MABSA) models struggle to filter noisy visual signals, and effectively align aspects with opinion-bearing content across modalities. To address these challenges, we propose GateMABSA, a novel gated multimodal architecture that integrates syntactic, semantic, and fusion-aware mLSTM. Specifically, GateMABSA introduces three specialized mLSTMs: Syn-mLSTM to incorporate syntactic structure, Sem-mLSTM to emphasize aspect--semantic relevance, and Fuse-mLSTM to perform selective multimodal fusion. Extensive experiments on two benchmark Twitter datasets demonstrate that GateMABSA outperforms several baselines.

[179] Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures

Marco Bronzini,Carlo Nicolini,Bruno Lepri,Jacopo Staiano,Andrea Passerini

Main category: cs.CL

TL;DR: 本文提出了Hyperdimensional Probe，一种结合符号表示和神经探测的新范式，用于从大语言模型的向量空间中解码信息，克服了现有方法的局限性，并在多种任务中验证了其有效性和可解释性。

Details

Motivation: 大语言模型内部表示不透明，现有可解释性方法（如DLA和SAE）受限于输出词汇或特征命名不清，难以充分揭示模型内部机制。 Method: 提出Hyperdimensional Probe，利用向量符号架构（VSA）将模型残差流投影到可解释的概念空间，结合了稀疏自编码器和传统探针的优点。 Result: 在语法识别、键值关联、抽象推理和问答等任务中，该方法能稳定提取跨不同模型、嵌入维度和输入领域的有意义概念，并有助于识别模型失败。 Conclusion: Hyperdimensional Probe提升了对大语言模型向量空间的信息解码能力，能够提取更丰富、可解释且结构化的特征，推动了模型可解释性的进展。 Abstract: Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods, such as direct logit attribution (DLA) and sparse autoencoders (SAEs), provide restricted insight due to limitations such as the model's output vocabulary or unclear feature names. This work introduces Hyperdimensional Probe, a novel paradigm for decoding information from the LLM vector space. It combines ideas from symbolic representations and neural probing to project the model's residual stream into interpretable concepts via Vector Symbolic Architectures (VSAs). This probe combines the strengths of SAEs and conventional probes while overcoming their key limitations. We validate our decoding paradigm with controlled input-completion tasks, probing the model's final state before next-token prediction on inputs spanning syntactic pattern recognition, key-value associations, and abstract inference. We further assess it in a question-answering setting, examining the state of the model both before and after text generation. Our experiments show that our probe reliably extracts meaningful concepts across varied LLMs, embedding sizes, and input domains, also helping identify LLM failures. Our work advances information decoding in LLM vector space, enabling extracting more informative, interpretable, and structured features from neural representations.

[180] Confidence-Guided Error Correction for Disordered Speech Recognition

Abner Hernandez,Tomás Arias Vergara,Andreas Maier,Paula Andrea Pérez-Toro

Main category: cs.CL

TL;DR: 提出基于置信度的提示方法，利用大语言模型对紊乱语音的ASR结果进行纠错，通过在训练中引入词级不确定性估计，显著降低词错误率。

Details

Motivation: 紊乱语音的自动语音识别（ASR）存在较高错误率，传统后处理方法易出现过度纠错问题，需提升LLM在跨说话人和数据集上的纠错鲁棒性与泛化能力。 Method: 提出置信度知情提示（confidence-informed prompting）方法，将词级置信度信息嵌入LLM训练过程，指导模型聚焦ASR低置信度区域；基于LLaMA 3.1模型进行微调，并与仅使用文本微调及后置置信度过滤方法对比。 Result: 在Speech Accessibility Project数据集上相对WER降低10%，在TORGO数据集上降低47%，显著优于基线方法。 Conclusion: 将置信度信息融入LLM微调可有效提升其对紊乱语音ASR结果的纠错性能，减少过度修正，具备良好的跨数据集泛化能力。 Abstract: We investigate the use of large language models (LLMs) as post-processing modules for automatic speech recognition (ASR), focusing on their ability to perform error correction for disordered speech. In particular, we propose confidence-informed prompting, where word-level uncertainty estimates are embedded directly into LLM training to improve robustness and generalization across speakers and datasets. This approach directs the model to uncertain ASR regions and reduces overcorrection. We fine-tune a LLaMA 3.1 model and compare our approach to both transcript-only fine-tuning and post hoc confidence-based filtering. Evaluations show that our method achieves a 10% relative WER reduction compared to naive LLM correction on the Speech Accessibility Project spontaneous speech and a 47% reduction on TORGO, demonstrating the effectiveness of confidence-aware fine-tuning for impaired speech.

[181] An empirical study on the limitation of Transformers in program trace generation

Simeng Sun

Main category: cs.CL

TL;DR: 研究了Transformer在生成程序执行轨迹任务中的表现，尽管模型在分布内数据上表现良好，但在泛化到不同因素（如程序长度、步骤数）时存在系统性失败，某些设计能显著改善泛化能力。

Details

Motivation: 探索Transformer在需要逐步推理的算法任务中的泛化能力，特别是通过外部化推理过程的程序轨迹生成任务。 Method: 使用小型Transformer模型，结合多种改进方法（如不同的位置编码、softmax替换、混合模型和短卷积）进行训练，并评估其在程序轨迹生成任务上的表现。 Result: 模型在分布内数据上达到较高的准确率，但在程序长度、轨迹步数等维度上的泛化能力有限，部分架构改进显著提升了泛化性能。 Conclusion: 程序轨迹生成是一个具有挑战性的测试床，揭示了当前Transformer在系统性泛化上的局限，同时表明特定模型设计有助于提升泛化能力。 Abstract: We study Transformers on the task \emph{program trace generation} (PTG), where models produce step-by-step execution traces for synthetic programs. Unlike existing algorithmic problems, PTG externalizes reasoning through long traces where each step is trivial. We train small Transformers with diverse modifications, including alternative position encodings, softmax replacements, hybrid model, and short convolutions. While these models achieve strong in-distribution accuracy, they exhibit systematic failures when generalizing to various factors (e.g., program length, trace steps), though some designs significantly improve generalization.

[182] Scaling Generalist Data-Analytic Agents

Shuofei Qiao,Yanqiu Zhao,Zhisong Qiu,Xiaobin Wang,Jintian Zhang,Zhao Bin,Ningyu Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Huajun Chen

Main category: cs.CL

TL;DR: 本文提出了DataMind，一种可扩展的数据合成与代理训练方法，旨在构建通用的开源数据解析智能体，解决了现有方法在数据资源、训练策略和多轮代码执行稳定性方面的不足。

Details

Motivation: 现有的数据解析智能体依赖专有模型和提示工程，而开源模型难以应对真实场景中多样化格式、大规模数据及复杂的多步推理任务，因此需要更强大的开源解决方案。 Method: DataMind采用细粒度任务分类和递归由易到难的任务组合机制生成多样化查询；结合知识增强的轨迹采样与模型+规则过滤；动态调整SFT与RL联合训练目标；并设计了内存高效且稳定的基于代码的多轮 rollout 框架。基于此构建了包含12K高质量轨迹的数据集DataMind-12K。 Result: 在多个数据分析基准上，DataMind-14B以71.16%的平均分达到SOTA，超越DeepSeek-V3.1和GPT-5；DataMind-7B以68.10%的成绩成为最佳开源模型。 Conclusion: DataMind为构建高性能开源数据解析智能体提供了有效的训练范式和高质量数据支持，推动了AI驱动科学发现的发展，相关数据集与模型将公开以促进后续研究。 Abstract: Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.

[183] jina-reranker-v3: Last but Not Late Interaction for Document Reranking

Feng Wang,Yuqing Li,Han Xiao

Main category: cs.CL

TL;DR: jina-reranker-v3是一种具有0.6B参数的多语言文档重排序模型，通过在查询和文档之间引入因果自注意力机制，在相同上下文窗口内实现丰富的交叉交互，从而在保持模型紧凑的同时达到最先进的BEIR性能。

Details

Motivation: 为了在减少模型大小的同时提升多语言文档重排序的性能，克服传统晚期交互模型（如ColBERT）缺乏充分交互的局限性。 Method: 提出一种新的“最后但不迟”交互机制，在同一上下文窗口中对查询和文档进行联合编码，并利用因果自注意力实现深度交互，最后通过提取文档最后一个标记的上下文嵌入完成重排序。 Result: 在BEIR基准上达到61.94的nDCG@10得分，性能优于现有模型，且模型规模仅为生成式列表重排序器的十分之一。 Conclusion: jina-reranker-v3通过新颖的交互机制实现了高效且强大的多语言文档重排序，兼具高性能与轻量化优势。 Abstract: jina-reranker-v3 is a 0.6B parameter multilingual document reranker that introduces a novel last but not late interaction. Unlike late interaction models such as ColBERT that perform separate encoding followed by multi-vector matching, our approach conducts causal self-attention between query and documents within the same context window, enabling rich cross-document interactions before extracting contextual embeddings from the last token of each document. This compact architecture achieves state-of-the-art BEIR performance with 61.94 nDCG@10 while being ten times smaller than generative listwise rerankers.

[184] Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs

Akio Hayakawa,Stefan Bott,Horacio Saggion

Main category: cs.CL

TL;DR: 本文提出了一种用于词汇简化（LS）的高效小规模语言模型框架，适用于本地部署，平衡了性能、效率与安全性，并通过输出概率过滤有害简化，实现了安全的现实应用。

Details

Motivation: 由于大型语言模型在隐私敏感和资源受限环境中存在挑战，且词汇简化技术主要服务于残障等脆弱用户群体，因此需要确保系统的安全性和正确性。 Method: 提出一个基于小型语言模型的本地化词汇简化框架，采用知识蒸馏与合成数据、上下文学习作为基线方法，并利用模型输出概率设计过滤策略以抑制有害简化。 Result: 在五种语言上的实验表明，知识蒸馏虽提升自动指标但增加有害简化；模型输出概率可有效检测此类问题，所提过滤策略能显著减少有害输出同时保留有益简化。 Conclusion: 该工作建立了小型语言模型在高效、安全词汇简化中的基准，揭示了性能、效率与安全间的权衡，为现实世界的安全部署提供了可行路径。 Abstract: Despite their strong performance, large language models (LLMs) face challenges in real-world application of lexical simplification (LS), particularly in privacy-sensitive and resource-constrained environments. Moreover, since vulnerable user groups (e.g., people with disabilities) are one of the key target groups of this technology, it is crucial to ensure the safety and correctness of the output of LS systems. To address these issues, we propose an efficient framework for LS systems that utilizes small LLMs deployable in local environments. Within this framework, we explore knowledge distillation with synthesized data and in-context learning as baselines. Our experiments in five languages evaluate model outputs both automatically and manually. Our manual analysis reveals that while knowledge distillation boosts automatic metric scores, it also introduces a safety trade-off by increasing harmful simplifications. Importantly, we find that the model's output probability is a useful signal for detecting harmful simplifications. Leveraging this, we propose a filtering strategy that suppresses harmful simplifications while largely preserving beneficial ones. This work establishes a benchmark for efficient and safe LS with small LLMs. It highlights the key trade-offs between performance, efficiency, and safety, and demonstrates a promising approach for safe real-world deployment.

[185] Towards Personalized Deep Research: Benchmarks and Evaluations

Yuan Liang,Jiaxian Li,Yuqing Wang,Piaohong Wang,Motong Tian,Pai Liu,Shuofei Qiao,Runnan Fang,He Zhu,Ge Zhang,Minghao Liu,Yuchen Eleanor Jiang,Ningyu Zhang,Wangchunshu Zhou

Main category: cs.CL

TL;DR: 提出了首个评估深度研究代理（DRA）个性化能力的基准——Personalized Deep Research Bench，并配套提出PQR评估框架，用于衡量个性化对齐、内容质量和事实可靠性。

Details

Motivation: 现有DRA评估多依赖封闭式基准，缺乏针对个性化场景的开放式深度研究评估基准。 Method: 构建包含10个领域50项研究任务与25个真实用户画像的基准，生成250个用户-任务查询，并提出PQR评估框架。 Result: 实验揭示了当前系统在处理个性化深度研究任务时的能力与局限。 Conclusion: 为开发和评估下一代真正个性化的AI研究助手提供了严谨基础。 Abstract: Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures (P) Personalization Alignment, (Q) Content Quality, and (R) Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.

[186] Knowledge Extraction on Semi-Structured Content: Does It Remain Relevant for Question Answering in the Era of LLMs?

Kai Sun,Yin Huang,Srishti Mehra,Mohammad Kachuee,Xilun Chen,Renjie Tao,Zhaojiang Lin,Andrea Jessee,Nirav Shah,Alex Betty,Yue Liu,Anuj Kumar,Wen-tau Yih,Xin Luna Dong

Main category: cs.CL

TL;DR: 本文研究了在大型语言模型（LLM）时代，从网页内容中提取知识三元组对问答系统的价值，发现尽管LLM在问答任务上表现良好，但结合知识提取仍能提升性能。

Details

Motivation: 随着大语言模型的发展，网页问答系统进步显著，但知识提取是否仍有价值尚不明确，因此本文探讨三元组提取在新范式下的作用。 Method: 通过扩展现有基准数据集并添加知识提取标注，评估不同规模的商用和开源大语言模型在知识提取与问答任务上的表现。 Result: 实验表明，当前大语言模型在大规模知识提取任务上仍面临挑战，但在问答中引入提取的三元组或采用多任务学习可带来性能提升。 Conclusion: 尽管大语言模型在问答方面表现出色，知识三元组提取依然具有辅助价值，有助于在不同模型规模和资源条件下提升整体效果。 Abstract: The advent of Large Language Models (LLMs) has significantly advanced web-based Question Answering (QA) systems over semi-structured content, raising questions about the continued utility of knowledge extraction for question answering. This paper investigates the value of triple extraction in this new paradigm by extending an existing benchmark with knowledge extraction annotations and evaluating commercial and open-source LLMs of varying sizes. Our results show that web-scale knowledge extraction remains a challenging task for LLMs. Despite achieving high QA accuracy, LLMs can still benefit from knowledge extraction, through augmentation with extracted triples and multi-task learning. These findings provide insights into the evolving role of knowledge triple extraction in web-based QA and highlight strategies for maximizing LLM effectiveness across different model sizes and resource settings.

[187] Investigating Language and Retrieval Bias in Multilingual Previously Fact-Checked Claim Detection

Ivan Vykopal,Antonia Karamolegkou,Jaroslav Kopčan,Qiwei Peng,Tomáš Javůrek,Michal Gregor,Marián Šimko

Main category: cs.CL

TL;DR: 本文研究了多语言大模型在事实核查中的语言偏见与检索偏见，通过20种语言的实验发现模型在高资源语言上表现更好，并揭示了检索过程中对常见声明的过度代表问题。

Details

Motivation: 解决多语言大模型在跨语言事实核查中存在的语言偏见和检索偏见问题，提升低资源语言下的公平性与准确性。 Method: 采用六种开源多语言大模型，结合AMC-16K数据集和全多语言提示策略，在20种语言上进行评估；使用多语言嵌入模型分析检索频率以探究检索偏见。 Result: 发现模型家族、规模和提示策略显著影响性能，存在明显的语言和检索偏差：高资源语言表现更优，且常见声明被过度检索，导致性能评估失真。 Conclusion: 多语言大模型在事实核查中仍存在系统性偏见，需改进提示策略与检索机制以提升跨语言公平性。 Abstract: Multilingual Large Language Models (LLMs) offer powerful capabilities for cross-lingual fact-checking. However, these models often exhibit language bias, performing disproportionately better on high-resource languages such as English than on low-resource counterparts. We also present and inspect a novel concept - retrieval bias, when information retrieval systems tend to favor certain information over others, leaving the retrieval process skewed. In this paper, we study language and retrieval bias in the context of Previously Fact-Checked Claim Detection (PFCD). We evaluate six open-source multilingual LLMs across 20 languages using a fully multilingual prompting strategy, leveraging the AMC-16K dataset. By translating task prompts into each language, we uncover disparities in monolingual and cross-lingual performance and identify key trends based on model family, size, and prompting strategy. Our findings highlight persistent bias in LLM behavior and offer recommendations for improving equity in multilingual fact-checking. To investigate retrieval bias, we employed multilingual embedding models and look into the frequency of retrieved claims. Our analysis reveals that certain claims are retrieved disproportionately across different posts, leading to inflated retrieval performance for popular claims while under-representing less common ones.

[188] Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation

Yen-Ju Lu,Thomas Thebaud,Laureano Moro-Velazquez,Najim Dehak,Jesus Villalba

Main category: cs.CL

TL;DR: 提出了一种名为“由教师配对”（PbT）的两阶段师生框架，用于在无监督情况下生成高质量的输入-输出数据对，显著降低标注成本并提升低资源自然语言生成任务的性能。

Details

Motivation: 在低资源自然语言生成场景中，通常只有非配对的输入或输出数据，缺乏成对标注数据，导致小模型难以训练或依赖昂贵的大模型合成数据。 Method: 采用教师-学生框架：教师大模型将非配对样本压缩为中间表示（IR），学生模型学习从IR重构输入，从而与原始输出形成高质量合成配对数据。 Result: 在五个基准和SwitchBoard上的实验表明，仅使用PbT数据训练的8B学生模型优于使用70B教师生成数据训练的模型，ROUGE-L分数接近人工标注对1.2分，填补了82%的oracle差距，并在人类评估中表现出更优的摘要忠实性和风格一致性。 Conclusion: PbT能有效解决低资源NLG中的数据不匹配问题，通过生成域内源文本避免直接合成的缺陷，以更低的成本实现高性能。 Abstract: We present Paired by the Teacher (PbT), a two-stage teacher-student pipeline that synthesizes accurate input-output pairs without human labels or parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like highlights, recaps, or questions, or only raw inputs, such as articles, dialogues, or paragraphs, but seldom both. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. PbT addresses this by asking a teacher LLM to compress each unpaired example into a concise intermediate representation (IR), and training a student to reconstruct inputs from IRs. This enables outputs to be paired with student-generated inputs, yielding high-quality synthetic data. We evaluate PbT on five benchmarks-document summarization (XSum, CNNDM), dialogue summarization (SAMSum, DialogSum), and question generation (SQuAD)-as well as an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, coming within 1.2 ROUGE-L of human-annotated pairs and closing 82% of the oracle gap at one-third the annotation cost of direct synthesis. Human evaluation on SwitchBoard further confirms that only PbT produces concise, faithful summaries aligned with the target style, highlighting its advantage of generating in-domain sources that avoid the mismatch, limiting direct synthesis.

[189] Pretraining Large Language Models with NVFP4

NVIDIA,Felix Abecassis,Anjulie Agrusa,Dong Ahn,Jonah Alben,Stefania Alborghetti,Michael Andersch,Sivakumar Arayandi,Alexis Bjorlin,Aaron Blakeman,Evan Briones,Ian Buck,Bryan Catanzaro,Jinhang Choi,Mike Chrzanowski,Eric Chung,Victor Cui,Steve Dai,Bita Darvish Rouhani,Carlo del Mundo,Deena Donia,Burc Eryilmaz,Henry Estela,Abhinav Goel,Oleg Goncharov,Yugi Guvvala,Robert Hesse,Russell Hewett,Herbert Hum,Ujval Kapasi,Brucek Khailany,Mikail Khona,Nick Knight,Alex Kondratenko,Ronny Krashinsky,Ben Lanir,Simon Layton,Michael Lightstone,Daniel Lo,Paulius Micikevicius,Asit Mishra,Tim Moon,Deepak Narayanan,Chao Ni,Abhijit Paithankar,Satish Pasumarthi,Ankit Patel,Mostofa Patwary,Ashwin Poojary,Gargi Prasad,Sweta Priyadarshi,Yigong Qin,Xiaowei Ren,Oleg Rybakov,Charbel Sakr,Sanjeev Satheesh,Stas Sergienko,Pasha Shamis,Kirthi Shankar,Nishant Sharma,Mohammad Shoeybi,Michael Siu,Misha Smelyanskiy,Darko Stosic,Dusan Stosic,Bor-Yiing Su,Frank Sun,Nima Tajbakhsh,Shelby Thomas,Przemek Tredak,Evgeny Tsykunov,Gandhi Vaithilingam,Aditya Vavre,Rangharajan Venkatesan,Roger Waleffe,Qiyu Wan,Hexin Wang,Mengdi Wang,Lizzie Wei,Hao Wu,Evan Wu,Keith Wyss,Ning Xu,Jinze Xue,Charlene Yang,Yujia Zhai,Ruoxi Zhang,Jingyang Zhu,Zhongbo Zhu

Main category: cs.CL

TL;DR: 本文提出了一种基于NVFP4格式的新型大语言模型训练方法，结合随机Hadamard变换、二维量化方案、随机舍入和选择性高精度层，实现了稳定且准确的4位精度训练，并在120亿参数模型上完成了迄今最长的4位精度训练验证，结果媲美FP8基线。

Details

Motivation: 提高大模型预训练效率至关重要，向更窄精度（如4位浮点）过渡可提升计算速度和资源利用率，但面临训练稳定性、收敛性和实现上的挑战。 Method: 采用NVFP4格式，引入随机Hadamard变换抑制块级异常值，使用二维量化方案保证前向和反向传播的一致性，结合随机舍入实现无偏梯度估计，并保留部分高精度层。 Result: 在120亿参数模型上完成10万亿token的4位精度训练，训练损失和下游任务准确率与FP8基线相当。 Conclusion: NVFP4结合所提方法是窄精度大模型训练算法的重要进展，可在不牺牲性能的前提下显著提升训练效率。 Abstract: Large Language Models (LLMs) today are powerful problem solvers across many domains, and they continue to get stronger as they scale in model size, training set size, and training set quality, as shown by extensive research and experimentation across the industry. Training a frontier model today requires on the order of tens to hundreds of yottaflops, which is a massive investment of time, compute, and energy. Improving pretraining efficiency is therefore essential to enable the next generation of even more capable LLMs. While 8-bit floating point (FP8) training is now widely adopted, transitioning to even narrower precision, such as 4-bit floating point (FP4), could unlock additional improvements in computational speed and resource utilization. However, quantization at this level poses challenges to training stability, convergence, and implementation, notably for large-scale models trained on long token horizons. In this study, we introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format. Our method integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers. We validate our approach by training a 12-billion-parameter model on 10 trillion tokens -- the longest publicly documented training run in 4-bit precision to date. Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline. These findings highlight that NVFP4, when combined with our training approach, represents a major step forward in narrow-precision LLM training algorithms.

[190] EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

Haolei Xu,Xinyu Mei,Yuchen Yan,Rui Zhou,Wenqi Zhang,Weiming Lu,Yueting Zhuang,Yongliang Shen

Main category: cs.CL

TL;DR: EasySteer是一个基于vLLM的高效、可扩展的大语言模型（LLM）控制框架，通过模块化设计和深度集成实现推理时行为控制，在过拟合缓解、幻觉减少等任务中表现优异。

Details

Motivation: 现有LLM控制框架存在计算效率低、扩展性差和功能受限等问题，阻碍了研究进展与实际部署，亟需一个高性能且灵活的统一框架。 Method: 构建了一个基于vLLM的模块化框架EasySteer，支持基于分析和学习的多种控制方法，提供插件式接口、细粒度参数控制及八个领域的预计算控制向量，并实现与优化推理引擎的深度集成。 Result: 相比现有框架实现了5.5-11.4倍的速度提升，在过思考缓解、幻觉减少等多个应用中验证了有效性，并提供了交互式演示系统。 Conclusion: EasySteer将LLM控制从研究技术转变为可落地的生产级能力，为可控语言模型的部署建立了关键基础设施。 Abstract: Large language model (LLM) steering has emerged as a promising paradigm for controlling model behavior at inference time through targeted manipulation of hidden states, offering a lightweight alternative to expensive retraining. However, existing steering frameworks suffer from critical limitations: computational inefficiency, limited extensibility, and restricted functionality that hinder both research progress and practical deployment. We present EasySteer, a unified framework for high-performance, extensible LLM steering built on vLLM. Our system features modular architecture with pluggable interfaces for both analysis-based and learning-based methods, fine-grained parameter control, pre-computed steering vectors for eight application domains, and an interactive demonstration system. Through deep integration with vLLM's optimized inference engine, EasySteer achieves 5.5-11.4$\times$ speedup over existing frameworks. Extensive experiments demonstrate its effectiveness in overthinking mitigation, hallucination reduction, and other key applications. EasySteer transforms steering from research technique to production-ready capability, establishing critical infrastructure for deployable, controllable language models.

[191] NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

Penghai Zhao,Jinyu Tian,Qinghua Xing,Xin Zhang,Zheng Li,Jianjun Qian,Ming-Ming Cheng,Xiang Li

Main category: cs.CL

TL;DR: NAIPv2是一种去偏且高效的论文质量评估框架，通过领域-年度内的成对学习和审稿倾向信号（RTS）提升评分一致性，在大规模数据集NAIDv2上实现最优性能，并具备良好的泛化性和推理效率。

Details

Motivation: 现有基于大语言模型的论文质量评估方法推理成本高，而直接回归方法存在尺度不一致问题，亟需一种高效且一致的评估框架。 Method: 提出NAIPv2框架，采用领域-年度分组内的成对学习以减少审稿评分偏差，并引入审稿倾向信号（RTS）融合审稿人的评分与置信度；构建包含2.4万篇ICLR投稿的NAIDv2数据集用于训练与评估。 Result: NAIPv2在AUC（78.2%）和Spearman相关系数（0.432）上达到SOTA性能，推理时间保持线性可扩展，并在外推至NeurIPS投稿时展现出从被拒到口头报告类别间预测分数的一致上升趋势，表明其良好泛化能力。 Conclusion: NAIPv2是一个去偏、高效且可泛化的论文质量评估框架，为构建未来科学智能系统提供了可行路径。 Abstract: The ability to estimate the quality of scientific papers is central to how both humans and AI systems will advance scientific knowledge in the future. However, existing LLM-based estimation methods suffer from high inference cost, whereas the faster direct score regression approach is limited by scale inconsistencies. We present NAIPv2, a debiased and efficient framework for paper quality estimation. NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings and introduces the Review Tendency Signal (RTS) as a probabilistic integration of reviewer scores and confidences. To support training and evaluation, we further construct NAIDv2, a large-scale dataset of 24,276 ICLR submissions enriched with metadata and detailed structured content. Trained on pairwise comparisons but enabling efficient pointwise prediction at deployment, NAIPv2 achieves state-of-the-art performance (78.2% AUC, 0.432 Spearman), while maintaining scalable, linear-time efficiency at inference. Notably, on unseen NeurIPS submissions, it further demonstrates strong generalization, with predicted scores increasing consistently across decision categories from Rejected to Oral. These findings establish NAIPv2 as a debiased and scalable framework for automated paper quality estimation, marking a step toward future scientific intelligence systems. Code and dataset are released at https://sway.cloud.microsoft/Pr42npP80MfPhvj8.

[192] Incentive-Aligned Multi-Source LLM Summaries

Yanchen Jiang,Zhe Feng,Aranyak Mehta

Main category: cs.CL

TL;DR: 提出了一种名为Truthful Text Summarization (TTS) 的激励对齐框架，通过分解声明、评估来源立场、评分和过滤不可靠来源来提升文本合成的事实准确性与鲁棒性。

Details

Motivation: 现有大语言模型在综合多个可能冲突的文本时缺乏对来源准确性的有效激励，易受对抗性内容影响。 Method: 将草稿分解为原子声明，收集各来源对每个声明的立场，使用改进的多任务同伴预测机制评分，并过滤不可靠来源后重新生成摘要。 Result: 实验表明TTS能提升事实准确性和鲁棒性，同时保持流畅性，有效抑制操纵行为。 Conclusion: TTS通过激励对齐机制使真实报告成为最优策略，实现了无需真实标签的事实一致性增强。 Abstract: Large language models (LLMs) are increasingly used in modern search and answer systems to synthesize multiple, sometimes conflicting, texts into a single response, yet current pipelines offer weak incentives for sources to be accurate and are vulnerable to adversarial content. We introduce Truthful Text Summarization (TTS), an incentive-aligned framework that improves factual robustness without ground-truth labels. TTS (i) decomposes a draft synthesis into atomic claims, (ii) elicits each source's stance on every claim, (iii) scores sources with an adapted multi-task peer-prediction mechanism that rewards informative agreement, and (iv) filters unreliable sources before re-summarizing. We establish formal guarantees that align a source's incentives with informative honesty, making truthful reporting the utility-maximizing strategy. Experiments show that TTS improves factual accuracy and robustness while preserving fluency, aligning exposure with informative corroboration and disincentivizing manipulation.

[193] Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding

Wenrui Bao,Zhiben Chen,Dan Xu,Yuzhang Shang

Main category: cs.CL

TL;DR: 本文提出了一种名为Learn2PD的动态并行解码框架，通过训练轻量级过滤模型来自适应预测每个token位置的最终输出，结合EoTP机制，在保持性能不变的情况下显著提升大语言模型的推理速度。

Details

Motivation: 现有的并行解码策略依赖于固定且与输入无关的启发式方法，无法适应不同输入特征，导致在多样NLP任务中存在次优的速度-质量权衡。 Method: 提出Learn2PD框架，训练一个轻量、自适应的过滤模型，以预测每个token位置当前预测是否匹配最终输出，并采用端文本预测（EoTP）机制避免冗余解码；该过滤模型通过后训练方式快速优化。 Result: 在LLaDA基准测试中，该方法实现了最高达22.58倍的加速，结合KV-Cache可达57.51倍加速，且无性能损失。 Conclusion: Learn2PD提供了一种高效、灵活的并行解码方案，能够根据输入动态调整解码过程，在大幅提高推理吞吐量的同时保持输出质量。 Abstract: Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through iterative denoising. However, current parallel decoding strategies rely on fixed, input-agnostic heuristics (e.g., confidence thresholds), which fail to adapt to input-specific characteristics, resulting in suboptimal speed-quality trade-offs across diverse NLP tasks. In this work, we explore a more flexible and dynamic approach to parallel decoding. We propose Learning to Parallel Decode (Learn2PD), a framework that trains a lightweight and adaptive filter model to predict, for each token position, whether the current prediction matches the final output. This learned filter approximates an oracle parallel decoding strategy that unmasks tokens only when correctly predicted. Importantly, the filter model is learned in a post-training manner, requiring only a small amount of computation to optimize it (minute-level GPU time). Additionally, we introduce End-of-Text Prediction (EoTP) to detect decoding completion at the end of sequence, avoiding redundant decoding of padding tokens. Experiments on the LLaDA benchmark demonstrate that our method achieves up to 22.58$\times$ speedup without any performance drop, and up to 57.51$\times$ when combined with KV-Cache.

[194] InfoAgent: Advancing Autonomous Information-Seeking Agents

Gongrui Zhang,Jialiang Zhu,Ruiqi Yang,Kai Qiu,Miaosen Zhang,Zhirong Wu,Qi Dai,Bei Liu,Chong Luo,Zhengyuan Yang,Linjie Li,Lijuan Wang,Weizhu Chen,Yuan Zhang,Xin Li,Zhaoyi Liu,Xin Geng,Baining Guo

Main category: cs.CL

TL;DR: 本文提出了InfoAgent，一种基于自建搜索基础设施和创新数据合成管道的深度研究代理，通过两阶段训练方法提升了大型语言模型在复杂查询任务中的表现。

Details

Motivation: 现有的大型语言模型代理依赖商业搜索工具，缺乏透明度且难以扩展；同时，缺少针对难检索问题的系统性构建方法。 Method: 构建实体树并采用子树采样与实体模糊化生成高难度查询；开发自托管搜索基础设施；采用冷启动监督微调和强化学习的两阶段后训练策略优化代理行为。 Result: InfoAgent在BrowseComp、BrowseComp-ZH和Xbench-DS三个基准上分别达到15.3%、29.2%和40.4%的准确率，优于WebSailor-72B和DeepDive-32B等现有开源代理。 Conclusion: 通过自建搜索环境和系统性数据合成，InfoAgent显著提升了开放域复杂信息检索任务中的性能，为可复现、可扩展的AI代理研究提供了新方向。 Abstract: Building Large Language Model agents that expand their capabilities by interacting with external tools represents a new frontier in AI research and applications. In this paper, we introduce InfoAgent, a deep research agent powered by an innovative data synthesis pipeline and orchestrated web search tools. To construct challenging, hard-to-find queries,we build entity trees and apply sub-tree sampling with entity fuzzification to systematically increase question difficulty. Unlike prior work that relies heavily on commercial search tools, we develop a dedicated self-hosted search infrastructure, enhancing transparency of agent environments and facilitating further advancement of agent capacity. We evaluate the effectiveness of our data pipeline by measuring the average number of tool calls required to correctly answer a question, and also show that our agent yields better performance when equipped with our tools. Our \mbox{InfoAgent} is post-trained from Qwen3-14B using a two-stage recipe: cold-start supervised finetuning to instill long-horizon search behaviors, followed by reinforcement learning which significantly improves reasoning-driven tool use. With our methods, InfoAgent achieves 15.3\% accuracy on BrowseComp, 29.2\% on BrowseComp-ZH, and 40.4\% on Xbench-DS, outperforming prior open-source deep research agents such as WebSailor-72B and DeepDive-32B.

[195] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Shubhashis Roy Dipta,Francis Ferraro

Main category: cs.CL

TL;DR: 提出了一种名为Q2E的查询到事件分解方法，用于零样本多语言文本到视频检索，通过利用大语言模型和视觉语言模型中的隐含知识来提升复杂现实事件相关视频的检索性能。

Details

Motivation: 现有方法在处理复杂现实事件的文本到视频检索时，难以充分理解过于简化的用户查询，因此需要一种能自动提取事件潜在参数知识的方法以提升检索效果。 Method: 提出Q2E方法，通过大语言模型（LLMs）和视觉语言模型（VLMs）对用户查询进行分解，提取事件的隐含语义，并支持视觉和语音输入；采用基于熵的融合评分策略实现零样本多模态知识融合。 Result: 在两个不同数据集上，Q2E优于多个最先进的基线方法；实验表明引入音频信息可显著提升文本到视频检索性能。 Conclusion: Q2E能有效利用LLMs和VLMs中的参数知识，改善复杂事件的视频检索，具备跨数据集、跨领域、跨模型的适应性，且多模态融合策略进一步提升了零样本检索效果。 Abstract: Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.

cs.CV [Back]

[196] Pathological Truth Bias in Vision-Language Models

Yash Thube

Main category: cs.CV

TL;DR: 本文提出了MATS，一种用于检测视觉语言模型在空间一致性方面系统性失败的小型行为审计工具，并引入了两个评估指标：空间一致性得分（SCS）和错误同意率（IAR）。实验发现，生成式VLM在这些指标上表现较差，而对比学习编码器（如CLIP）则更为稳健。通过激活补丁技术定位了模型失败的关键机制，并提出了可能的修复路径。

Details

Motivation: 现有的视觉语言模型（VLM）虽然进步迅速，但标准基准测试难以揭示其在现实应用中的系统性缺陷，尤其是在处理视觉与语言信息冲突时的真实性判断问题。为了提升模型的可信度，需要更精细的审计方法来识别这些隐藏的失败模式。 Method: 提出MATS（Multimodal Audit for Truthful Spatialization）审计框架，包含一组精心设计的任务，要求模型拒绝与图像内容相矛盾的语言陈述。定义两个量化指标：空间一致性得分（SCS）和错误同意率（IAR）。在多种模型（包括LLaVA、QwenVLchat、CLIP、SigLIP）上进行评估，并使用激活补丁（activation patching）技术进行因果分析，定位导致失败的模型组件。 Result: 指令调优的生成式VLM（如LLaVA 1.5, QwenVLchat）表现出极低的SCS和很高的IAR，说明它们容易接受与视觉内容矛盾的陈述；而对比学习模型（如CLIP, SigLIP）在这两项指标上表现显著更好，更具鲁棒性。激活补丁分析表明，生成式模型的失败主要集中在中后期的跨模态注意力机制，而对比模型的问题则出现在池化后的投影组件。 Conclusion: 生成式VLM在空间真实性判断上存在系统性缺陷，而对比式架构更具鲁棒性。MATS提供了一种有效的诊断工具，能够揭示模型行为背后的机制，并通过因果干预指明改进方向，有助于构建更可信的多模态系统。 Abstract: Vision Language Models (VLMs) are improving quickly, but standard benchmarks can hide systematic failures that reduce real world trust. We introduce MATS (Multimodal Audit for Truthful Spatialization), a compact behavioral audit that measures whether models reject visually contradicted statements, and two metrics Spatial Consistency Score (SCS) and Incorrect Agreement Rate (IAR). Instruction tuned generative VLMs (LLaVA 1.5, QwenVLchat) exhibit very low SCS and high IAR, while contrastive encoders (CLIP, SigLIP) are far more robust. Activation patching causally localizes failure loci (mid to late cross attention for generative models, pooled projection components for contrastive models) and suggests concrete repair paths.

[197] Scale and Rotation Estimation of Similarity-Transformed Images via Cross-Correlation Maximization Based on Auxiliary Function Method

Shinji Yamashita,Yuma Kinoshita,Hitoshi Kiya

Main category: cs.CV

TL;DR: 提出一种基于对数极坐标傅里叶变换和亚像素互相关的高效算法，用于精确估计图像间的尺度和旋转变化。

Details

Motivation: 传统相位相关方法无法有效处理由相机缩放或旋转引起的尺度和旋转变化，需更精确的图像配准方法。 Method: 结合对数极坐标下的傅里叶变换与交叉相关最大化策略，并利用辅助函数方法实现亚像素级精度的尺度和旋转联合估计。 Result: 实验结果表明，该方法在尺度和旋转估计上的平均误差低于传统的基于离散互相关的傅里叶方法。 Conclusion: 所提算法能更精确地估计图像间的尺度和旋转，优于传统方法，适用于需要高精度图像对齐的应用场景。 Abstract: This paper introduces a highly efficient algorithm capable of jointly estimating scale and rotation between two images with sub-pixel precision. Image alignment serves as a critical process for spatially registering images captured from different viewpoints, and finds extensive use in domains such as medical imaging and computer vision. Traditional phase-correlation techniques are effective in determining translational shifts; however, they are inadequate when addressing scale and rotation changes, which often arise due to camera zooming or rotational movements. In this paper, we propose a novel algorithm that integrates scale and rotation estimation based on the Fourier transform in log-polar coordinates with a cross-correlation maximization strategy, leveraging the auxiliary function method. By incorporating sub-pixel-level cross-correlation our method enables precise estimation of both scale and rotation. Experimental results demonstrate that the proposed method achieves lower mean estimation errors for scale and rotation than conventional Fourier transform-based techniques that rely on discrete cross-correlation.

[198] Robust Object Detection for Autonomous Driving via Curriculum-Guided Group Relative Policy Optimization

Xu Jia

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的框架，结合课程学习和难度感知过滤，提升了多模态大模型在结构化感知任务中的定位精度和鲁棒性。

Details

Motivation: 多模态大语言模型在视觉-语言推理中表现优异，但在需要精确局部化和鲁棒性的结构化感知任务上存在困难。 Method: 采用增强的组相对策略优化（GRPO）框架，引入基于课程的数据调度和难度感知过滤机制，以应对稀疏和噪声奖励下的优化挑战。 Result: 在自动驾驶基准测试中显著提升了检测精度和鲁棒性，消融实验验证了奖励设计、KL正则化和课程节奏对收敛稳定性和泛化能力的重要性。 Conclusion: 强化学习驱动的优化结合结构化数据课程是实现鲁棒且可解释的多模态检测的可扩展路径。 Abstract: Multimodal Large Language Models (MLLMs) excel in vision-language reasoning but often struggle with structured perception tasks requiring precise localization and robustness. We propose a reinforcement learning framework that augments Group Relative Policy Optimization (GRPO) with curriculum-based data scheduling and difficulty-aware filtering. This approach stabilizes optimization under sparse, noisy rewards and enables progressive adaptation to complex samples. Evaluations on autonomous driving benchmarks demonstrate substantial improvements in detection accuracy and robustness. Ablation studies confirm the importance of reward design, KL regularization, and curriculum pacing for convergence stability and generalization. Our findings highlight reinforcement-driven optimization with structured data curricula as a scalable path toward robust and interpretable multimodal detection.

[199] Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation

Ha-Hieu Pham,Minh Le,Han Huynh,Nguyen Quoc Khanh Le,Huy-Hieu Pham

Main category: cs.CV

TL;DR: 提出了一种基于图论约束的拓扑图一致性（TGC）框架，通过匹配拉普拉斯谱、连通分量数和邻接统计信息来提升半监督语义分割的性能，在低标注率下显著缩小了与全监督方法的差距。

Details

Motivation: 现有半监督语义分割方法依赖像素级一致性，容易传播噪声伪标签并产生碎片化或拓扑不合法的分割结果，尤其在计算病理学中密集标注成本高昂，亟需更鲁棒的全局一致性机制。 Method: 提出拓扑图一致性（TGC）框架，构建预测图与参考图之间的图结构对齐，通过约束拉普拉斯谱、连通分量数量和邻接统计特征来保持全局拓扑结构的一致性。 Result: 在GlaS和CRAG数据集上实验表明，TGC在5-10%标注比例下达到最先进性能，并显著缩小了与全监督方法的性能差距。 Conclusion: TGC通过引入图论驱动的全局拓扑约束，有效提升了半监督语义分割的准确性和结构合理性，适用于标注稀缺的医学图像分析任务。 Abstract: Semi-supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state-of-the-art performance under 5-10% supervision and significantly narrows the gap to full supervision. Code is available at https://github.com/hieuphamha19/TGC.

[200] A review of Recent Techniques for Person Re-Identification

Andrea Asperti,Salvatore Fiorilla,Simone Nardi,Lorenzo Orsini

Main category: cs.CV

TL;DR: 本文综述了人员重识别（ReID）领域的发展，重点分析了监督和无监督方法的最新进展。作者指出监督方法已接近性能瓶颈，而无监督方法在近年表现突出，性能差距正在缩小。

Details

Motivation: 监督方法依赖大量标注数据，存在标注成本高和可扩展性差的问题，因此研究转向利用未标注数据的无监督ReID方法。 Method: 对近年来监督和无监督人员重识别的重要文献进行系统回顾与分类，分析技术趋势与发展潜力。 Result: 监督ReID已接近性能上限，改进空间有限；无监督ReID在过去三年取得显著进展，性能逐渐逼近监督方法。 Conclusion: 无监督人员重识别具有巨大潜力，未来可能在性能上与监督方法趋同，推动ReID技术更广泛的应用。 Abstract: Person re-identification (ReId), a crucial task in surveillance, involves matching individuals across different camera views. The advent of Deep Learning, especially supervised techniques like Convolutional Neural Networks and Attention Mechanisms, has significantly enhanced person Re-ID. However, the success of supervised approaches hinges on vast amounts of annotated data, posing scalability challenges in data labeling and computational costs. To address these limitations, recent research has shifted towards unsupervised person re-identification. Leveraging abundant unlabeled data, unsupervised methods aim to overcome the need for pairwise labelled data. Although traditionally trailing behind supervised approaches, unsupervised techniques have shown promising developments in recent years, signalling a narrowing performance gap. Motivated by this evolving landscape, our survey pursues two primary objectives. First, we review and categorize significant publications in supervised person re-identification, providing an in-depth overview of the current state-of-the-art and emphasizing little room for further improvement in this domain. Second, we explore the latest advancements in unsupervised person re-identification over the past three years, offering insights into emerging trends and shedding light on the potential convergence of performance between supervised and unsupervised paradigms. This dual-focus survey aims to contribute to the evolving narrative of person re-identification, capturing both the mature landscape of supervised techniques and the promising outcomes in the realm of unsupervised learning.

[201] Sequential Token Merging: Revisiting Hidden States

Yan Wen,Peng Ye,Lin Zhang,Baopu Li,Jiakang Yuan,Yaoxin Yang,Tao Chen

Main category: cs.CV

TL;DR: 提出Sequential Token Merging (STM) 方法，通过双向最近邻合并和隐藏状态保护，在保持Vision Mambas性能的同时显著减少token数量，实现高效且稳定的视觉模型压缩。

Details

Motivation: 现有方法未充分考虑Vision Mambas中的 Limited Directional Sequential Dependence (LDSD)，导致在降低token冗余时忽略关键信息流机制。 Method: 基于Mamba的选择性扫描机制，设计双向最近邻token合并策略以保留序列依赖，并通过隐藏状态保护稳定类别token周围的表示；利用层间损失收敛特性提升稳定性。 Result: 在ViM-Ti上减少20% token仅损失1.0%准确率，ViM-S减少40% token仅损失1.4%，达到当前最优的效率与精度平衡。 Conclusion: STM有效利用Mamba的内在动态特性，在显著降低计算复杂度的同时保持模型性能，为状态空间模型提供了新的理解与优化方向。 Abstract: Vision Mambas (ViMs) achieve remarkable success with sub-quadratic complexity, but their efficiency remains constrained by quadratic token scaling with image resolution. While existing methods address token redundancy, they overlook ViMs' intrinsic Limited Directional Sequential Dependence (LDSD) - a critical information flow mechanism revealed in our analysis. We further identify Mamba's selective scan enables gradual information aggregation in hidden states. Based on these insights, we propose Sequential Token Merging (STM), featuring: 1) Bidirectional nearest neighbor merging to preserve sequential dependencies through symmetric spatial aggregation, and 2) Hidden states protection to stabilize the hidden states around the class token. STM strategically leverages Mamba's layer-wise loss convergence to convert temporal forgetfulness into stability. Experiments demonstrate STM's superiority: 1.0% accuracy drop for ViM-Ti at 20% token reduction, and only 1.4% degradation for ViM-S at 40% reduction. Our method achieves state-of-the-art efficiency with minimal complexity, while providing new insights into state-space model dynamics. Codes will be released soon.

[202] Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects

Le Zhang,Ao Li,Qibin Hou,Ce Zhu,Yonina C. Eldar

Main category: cs.CV

TL;DR: 本文对超分辨率技术进行了全面综述，涵盖单图像、视频、立体和光场超分辨率方法，分析了超过150种SISR、近70种VSR以及约30种SSR和LFSR方法，提出了基于主干结构的分类体系，并探讨了该领域未充分研究的开放问题。

Details

Motivation: 现有综述多集中于特定子领域，缺乏对超分辨率整体领域的系统性总结，且随着深度学习的发展和高质视觉应用需求增长，亟需全面回顾与梳理。 Method: 对主流超分辨率方法进行广泛调研，按不同任务类型（SISR、VSR、SSR、LFSR）分类，分析其方法、数据集、评估协议、实验结果与复杂度，并基于主干网络结构提出新的分类体系。 Result: 整理并分析了超过250种超分辨率方法，建立了统一的分类框架，总结了各方法性能与特点，指出了当前研究中的空白与挑战，并公开了GitHub资源库以支持后续研究。 Conclusion: 该综述为超分辨率领域提供了系统性参考，有助于研究人员理解技术演进、选择合适方法，并推动未来在开放问题上的探索。 Abstract: Super-resolution (SR) has garnered significant attention within the computer vision community, driven by advances in deep learning (DL) techniques and the growing demand for high-quality visual applications. With the expansion of this field, numerous surveys have emerged. Most existing surveys focus on specific domains, lacking a comprehensive overview of this field. Here, we present an in-depth review of diverse SR methods, encompassing single image super-resolution (SISR), video super-resolution (VSR), stereo super-resolution (SSR), and light field super-resolution (LFSR). We extensively cover over 150 SISR methods, nearly 70 VSR approaches, and approximately 30 techniques for SSR and LFSR. We analyze methodologies, datasets, evaluation protocols, empirical results, and complexity. In addition, we conducted a taxonomy based on each backbone structure according to the diverse purposes. We also explore valuable yet under-studied open issues in the field. We believe that this work will serve as a valuable resource and offer guidance to researchers in this domain. To facilitate access to related work, we created a dedicated repository available at https://github.com/AVC2-UESTC/Holistic-Super-Resolution-Review.

[203] Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment

Abhiroop Chatterjee,Susmita Ghosh

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP风格对比学习框架的视觉-语言模型（VLM），用于高光谱图像场景理解，仅更新0.07%参数即实现SOTA性能。

Details

Motivation: 高光谱图像具有高维3D体素结构，传统视觉与语言模型在该领域的跨模态对齐尚未充分探索，需高效利用高价值数据进行学习。 Method: 采用CLIP式对比训练框架，将视觉骨干网络的体素级嵌入映射到冻结的大嵌入模型（LEM）的潜在空间，并通过可训练探针实现视觉特征与文本标记表示的对齐；引入描述性提示作为HSI嵌入的结构化锚点，并在精选的难负例和半难负例上施加对比损失。 Result: 在Indian Pines数据集上比基线提升+0.92 OA和+1.60 κ，在Pavia University上提升+0.69 OA和+0.90 κ，且参数量仅为DCTN的1/50、SS-TMNet的1/90。 Conclusion: 所提方法以极低参数更新实现了高光谱图像跨模态对齐的有效优化，显著优于现有单模态与多模态基线方法。 Abstract: As data requirements continue to grow, efficient learning increasingly depends on the curation and distillation of high-value data rather than brute-force scaling of model sizes. In the case of a hyperspectral image (HSI), the challenge is amplified by the high-dimensional 3D voxel structure, where each spatial location is associated with hundreds of contiguous spectral channels. While vision and language models have been optimized effectively for natural image or text tasks, their cross-modal alignment in the hyperspectral domain remains an open and underexplored problem. In this article, we make an attempt to optimize a Vision-Language Model (VLM) for hyperspectral scene understanding by exploiting a CLIP-style contrastive training framework. Our framework maps voxel-level embeddings from a vision backbone onto the latent space of a frozen large embedding model (LEM), where a trainable probe aligns vision features with the model's textual token representations. The two modalities are aligned via a contrastive loss restricted to a curated set of hard (closest wrong classes) and semi-hard (random distractors) negatives, along with positive pairs. To further enhance alignment, descriptive prompts that encode class semantics are introduced and act as structured anchors for the HSI embeddings. It is seen that the proposed method updates only 0.07 percent of the total parameters, yet yields state-of-the-art performance. For example, on Indian Pines (IP) the model produces better results over unimodal and multimodal baselines by +0.92 Overall Accuracy (OA) and +1.60 Kappa ($\kappa$), while on Pavia University (PU) data it provides gains of +0.69 OA and +0.90 $\kappa$. Moreover, this is achieved with the set of parameters, nearly 50$\times$ smaller than DCTN and 90$\times$ smaller than SS-TMNet.

Zhuang Qi,Pan Yu,Lei Meng,Sijin Zhou,Han Yu,Xiaoxiao Li,Xiangxu Meng

Main category: cs.CV

TL;DR: 提出GPR-NIAM方法，通过非干扰注意力掩码机制实现一次性联邦提示学习，在跨任务泛化和通信效率上优于现有方法。

Details

Motivation: 现有联邦提示学习方法依赖多轮通信且缺乏跨任务泛化能力，尤其在一次性通信场景下表现不足。 Method: 设计非干扰注意力掩码机制，包括注意力隔离模块抑制提示到文本的注意力，并重加权反向注意力以保持跨任务泛化；引入跨群体协同优化模块，融合去中心化视觉知识并通过对齐多源跨模态知识校准全局提示。 Result: 在十个基准数据集的两类任务上实验表明，GPR-NIAM在类别级和领域级泛化性能上均优于八种最先进方法。 Conclusion: GPR-NIAM有效解决了联邦提示学习中对多轮通信的依赖和跨任务泛化不足的问题，显著提升了在一次通信下的模型性能与适应性。 Abstract: Federated Prompt Learning (FPL) enables communication-efficient adaptation by tuning lightweight prompts on top of frozen pre-trained models. Existing FPL methods typically rely on global information, which is only available after the second training round, to facilitate collaboration among client models. Therefore, they are inherently dependent on multi-round communication to fully exhibit their strengths. Moreover, existing one-shot federated learning methods typically focus on fitting seen tasks, but lack cross-task generalization. To bridge this gap, we propose the Global Prompt Refinement with Non-Interfering Attention Masking (GPR-NIAM) method for one-shot FPL. The core idea is to design a masking mechanism that restricts excessive interaction between the original text embeddings and the learnable prompt embeddings. GPR-NIAM achieves this through the collaboration of two key modules. Firstly, the attention isolation module suppresses attention from the learnable prompt tokens to the original text tokens, and reweights the reverse attention which preserves generalization across tasks. Secondly, the cross-silo collaborative refinement module integrates decentralized visual knowledge into a unified base and calibrates the global prompt through multi-source cross-modal knowledge alignment, further mitigating the inconsistency caused by data heterogeneity. Extensive experiments conducted on ten benchmark datasets under two tasks show that GPR-NIAM outperforms eight state-of-the-art methods in both class-level and domain-level generalization.

[205] GZSL-MoE: Apprentissage G{é}n{é}ralis{é} Z{é}ro-Shot bas{é} sur le M{é}lange d'Experts pour la Segmentation S{é}mantique de Nuages de Points 3DAppliqu{é} {à} un Jeu de Donn{é}es d'Environnement de Collaboration Humain-Robot

Ahed Alboody

Main category: cs.CV

TL;DR: 本文提出了一种基于Mixture-of-Experts的生成式零样本学习模型（GZSL-MoE），用于3D点云语义分割，特别是在人类-机器人协作环境中的COVERED数据集上，通过结合MoE与生成模型提升对可见和不可见类别的分割性能。

Details

Motivation: 在3D点云语义分割中，难以获取所有类别对象的完整训练数据，因此需要一种能够在测试阶段识别未见类别的方法。 Method: 将Mixture-of-Experts（MoE）引入生成式零样本学习（GZSL）的生成器和判别器中，利用预训练KPConv提取特征，并生成更接近真实特征的伪特征，以提升对未见类别的泛化能力。 Result: GZSL-MoE在COVERED数据集上显著提升了对可见和不可见类别的语义分割性能，验证了其有效性。 Conclusion: GZSL-MoE通过融合MoE机制，在3D点云零样本语义分割任务中表现出优越性能，为复杂环境中缺乏完整标注数据的问题提供了有效解决方案。 Abstract: Generative Zero-Shot Learning approach (GZSL) has demonstrated significant potential in 3D point cloud semantic segmentation tasks. GZSL leverages generative models like GANs or VAEs to synthesize realistic features (real features) of unseen classes. This allows the model to label unseen classes during testing, despite being trained only on seen classes. In this context, we introduce the Generalized Zero-Shot Learning based-upon Mixture-of-Experts (GZSL-MoE) model. This model incorporates Mixture-of-Experts layers (MoE) to generate fake features that closely resemble real features extracted using a pre-trained KPConv (Kernel Point Convolution) model on seen classes. The main contribution of this paper is the integration of Mixture-of-Experts into the Generator and Discriminator components of the Generative Zero-Shot Learning model for 3D point cloud semantic segmentation, applied to the COVERED dataset (CollabOratiVE Robot Environment Dataset) for Human-Robot Collaboration (HRC) environments. By combining the Generative Zero-Shot Learning model with Mixture-of- Experts, GZSL-MoE for 3D point cloud semantic segmentation provides a promising solution for understanding complex 3D environments, especially when comprehensive training data for all object classes is unavailable. The performance evaluation of the GZSL-MoE model highlights its ability to enhance performance on both seen and unseen classes. Keywords Generalized Zero-Shot Learning (GZSL), 3D Point Cloud, 3D Semantic Segmentation, Human-Robot Collaboration, COVERED (CollabOratiVE Robot Environment Dataset), KPConv, Mixture-of Experts

[206] IBiT: Utilizing Inductive Biases to Create a More Data Efficient Attention Mechanism

Adithya Giri

Main category: cs.CV

TL;DR: 提出了一种名为IBiT的视觉Transformer模型，通过引入学习到的掩码来加入归纳偏置，使其在小数据集上表现更优，同时保持了Transformer的可解释性。

Details

Motivation: 视觉Transformer缺乏卷积神经网络的归纳偏置，在小数据集上表现受限，因此需要引入归纳偏置以提升其在小数据集上的学习能力。 Method: 通过引入可学习的掩码将卷积神经网络的归纳偏置融入视觉Transformer中，提出Inductively Biased Image Transformers (IBiT)模型。 Result: IBiT在小数据集上显著提升了准确率，且无需知识蒸馏，同时保持了Transformer原有的可解释性。 Conclusion: 引入归纳偏置的IBiT模型有效提升了视觉Transformer在小数据集上的性能，为Transformer在数据有限场景下的应用提供了新思路。 Abstract: In recent years, Transformer-based architectures have become the dominant method for Computer Vision applications. While Transformers are explainable and scale well with dataset size, they lack the inductive biases of Convolutional Neural Networks. While these biases may be learned on large datasets, we show that introducing these inductive biases through learned masks allow Vision Transformers to learn on much smaller datasets without Knowledge Distillation. These Transformers, which we call Inductively Biased Image Transformers (IBiT), are significantly more accurate on small datasets, while retaining the explainability Transformers.

[207] LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning

Zezhong Fan,Xiaohan Li,Luyi Ma,Kai Zhao,Liang Peng,Topojoy Biswas,Evren Korpeoglu,Kaushiki Nag,Kannan Achan

Main category: cs.CV

TL;DR: 本文提出了一种名为LayoutAgent的新型框架，结合视觉-语言推理与组合扩散模型，用于生成语义合理且空间上逼真的多物体场景布局。

Details

Motivation: 现有扩散模型缺乏显式的空间推理能力，而传统空间规划方法难以捕捉视觉场景的语义丰富性，因此需要一种能同时兼顾语义关系与物理合理性的布局生成方法。 Method: 首先利用视觉语言模型对输入图像进行分割、物体尺寸估计、场景图构建和提示重写；然后采用组合扩散方法，根据场景图中的对象关系生成符合空间约束的边界框；最后通过前景条件图像生成器将对象渲染到规划好的布局中。 Result: 实验表明，LayoutAgent在布局连贯性、空间真实感和美学对齐方面优于现有的最先进布局生成模型。 Conclusion: LayoutAgent有效融合了视觉-语言理解与机器人领域的空间规划技术，实现了更真实、语义一致的多对象场景布局生成。 Abstract: Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled high-quality image generation, they lack explicit spatial reasoning, leading to unrealistic object layouts. On the other hand, traditional spatial planning methods in robotics emphasize geometric and relational consistency, but they struggle to capture semantic richness in visual scenes. To bridge this gap, in this paper, we propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation. Given multiple input images with target objects in them, our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting. Then we leverage compositional diffusion-a method traditionally used in robotics-to synthesize bounding boxes that respect object relations encoded in the scene graph for spatial layouts. In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts. Experiments demonstrate that LayoutAgent outperforms other state-of-the-art layout generation models in layout coherence, spatial realism and aesthetic alignment.

[208] CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

Jie Cai,Kangning Yang,Lan Fu,Jiaming Ding,Jinlong Li,Huiming Sun,Daitao Xing,Jinglin Shen,Zibo Meng

Main category: cs.CV

TL;DR: CompareBench是一个用于评估视觉-语言模型中视觉比较推理能力的新基准，揭示了现有模型在时间、空间、数量和几何比较任务上的显著缺陷。

Details

Motivation: 视觉比较推理是视觉-语言模型的一项基本但被忽视的能力，当前缺乏系统性的评估手段。 Method: 构建包含1000个问答对的CompareBench基准，涵盖数量、时间、几何和空间四类任务，并基于TallyBench和HistCaps两个辅助数据集进行评估。 Result: 实验显示现有最强模型在时间顺序和空间关系判断上表现差，且常在基本计数和几何比较上出错。 Conclusion: 视觉比较仍是当前视觉-语言模型的系统性盲点，CompareBench为改进多模态推理提供了诊断基础。 Abstract: We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal (100), geometric (200), and spatial (100). It is derived from two auxiliary datasets that we constructed: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). We evaluate both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series). Results show clear scaling trends but also reveal critical limitations: even the strongest models consistently fail at temporal ordering and spatial relations, and they often make mistakes in basic counting and geometric comparisons that are trivial for humans. These findings demonstrate that visual comparison remains a systematic blind spot for current VLMs. By providing controlled, diverse, and diagnostic evaluation, CompareBench establishes a foundation for advancing more reliable multimodal reasoning.

[209] MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

Yapeng Mi,Hengli Li,Yanpeng Zhao,Chenxi Li,Huimin Wu,Xiaojian Ma,Song-Chun Zhu,Ying Nian Wu,Qing Li

Main category: cs.CV

TL;DR: 本文提出MILR，一种在测试时联合对图像和文本进行推理的图像生成方法，通过在统一的潜在向量空间中搜索离散token表示实现跨模态推理，并在多个基准上取得最优性能。

Details

Motivation: 现有基于推理的图像生成方法受限于单模态推理或依赖高质量推理数据进行微调，缺乏有效的跨模态联合推理机制。 Method: 提出MILR方法，在统一的潜在向量空间中通过策略梯度方法对图像和文本token的向量表示进行搜索式推理，结合图像质量评判器指导优化，集成于MUG框架中实现语言推理到图像合成的端到端支持。 Result: 在GenEval、T2I-CompBench和WISE三个基准上均达到SOTA性能，在知识密集型WISE任务上整体得分0.63，较基线提升80%；消融分析验证了统一潜在空间中联合推理的关键作用，定性研究显示其具备时间与文化推理能力。 Conclusion: MILR通过在统一潜在空间中实现跨模态联合推理，显著提升了图像生成质量，尤其在知识密集场景下表现突出，证明了测试时联合推理的有效性和通用性。 Abstract: Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR's non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.

[210] UESA-Net: U-Shaped Embedded Multidirectional Shrinkage Attention Network for Ultrasound Nodule Segmentation

Tangqi Shi,Pietro Lio

Main category: cs.CV

TL;DR: 提出UESA-Net，一种具有多方向收缩注意力的U形网络，用于提升乳腺和甲状腺超声图像中病灶分割的准确性与鲁棒性。

Details

Motivation: 现有网络难以在噪声超声图像中平衡全局语义与局部细节，导致分割性能受限，因此需要一种能弥合全局上下文与局部细节之间语义鸿沟的分割框架。 Method: 设计了一种编码器-解码器结构的UESA-Net，在编码器中引入沿水平、垂直和深度方向的多方向注意力模块，并结合收缩策略融合先验知识与局部特征；解码器采用成对收缩机制，融合低层物理线索与编码器特征以增强上下文建模。 Result: 在TN3K（IoU 0.8487）和BUSI（IoU 0.6495）两个公开数据集上均达到最先进的分割性能。 Conclusion: UESA-Net通过有效聚合多方向空间信息与先验知识，在多种基准上显著优于现有方法，提升了超声图像分割的准确性和鲁棒性。 Abstract: Background: Breast and thyroid cancers pose an increasing public-health burden. Ultrasound imaging is a cost-effective, real-time modality for lesion detection and segmentation, yet suffers from speckle noise, overlapping structures, and weak global-local feature interactions. Existing networks struggle to reconcile high-level semantics with low-level spatial details. We aim to develop a segmentation framework that bridges the semantic gap between global context and local detail in noisy ultrasound images. Methods: We propose UESA-Net, a U-shaped network with multidirectional shrinkage attention. The encoder-decoder architecture captures long-range dependencies and fine-grained structures of lesions. Within each encoding block, attention modules operate along horizontal, vertical, and depth directions to exploit spatial details, while a shrinkage (threshold) strategy integrates prior knowledge and local features. The decoder mirrors the encoder but applies a pairwise shrinkage mechanism, combining prior low-level physical cues with corresponding encoder features to enhance context modeling. Results: On two public datasets - TN3K (3493 images) and BUSI (780 images) - UESA-Net achieved state-of-the-art performance with intersection-over-union (IoU) scores of 0.8487 and 0.6495, respectively. Conclusions: UESA-Net effectively aggregates multidirectional spatial information and prior knowledge to improve robustness and accuracy in breast and thyroid ultrasound segmentation, demonstrating superior performance to existing methods on multiple benchmarks.

[211] PartCo: Part-Level Correspondence Priors Enhance Category Discovery

Fernando Julio Cendra,Kai Han

Main category: cs.CV

TL;DR: 本文提出了PartCo，一种基于部件级对应先验的广义类别发现框架，通过引入部件级视觉特征对应关系提升现有GCD方法的性能，在多个基准数据集上实现了最先进的结果。

Details

Motivation: 现有的广义类别发现（GCD）方法主要依赖语义标签和全局图像表示，忽略了对区分相似类别至关重要的部件级细节线索，因此需要更细粒度的特征建模来提升性能。 Method: 提出PartCo框架，利用部件级视觉特征对应先验，捕捉更精细的语义结构，并通过整合部件级关系增强类别发现，且可无缝集成到现有GCD方法中。 Result: 在多个基准数据集上的实验表明，PartCo显著提升了现有GCD方法的性能，弥合了语义标签与部件级视觉组合之间的差距，达到了最先进水平。 Conclusion: PartCo通过引入部件级对应关系有效增强了广义类别发现的细粒度识别能力，具有良好的通用性和实用性，为GCD任务设定了新的标杆。 Abstract: Generalized Category Discovery (GCD) aims to identify both known and novel categories within unlabeled data by leveraging a set of labeled examples from known categories. Existing GCD methods primarily depend on semantic labels and global image representations, often overlooking the detailed part-level cues that are crucial for distinguishing closely related categories. In this paper, we introduce PartCo, short for Part-Level Correspondence Prior, a novel framework that enhances category discovery by incorporating part-level visual feature correspondences. By leveraging part-level relationships, PartCo captures finer-grained semantic structures, enabling a more nuanced understanding of category relationships. Importantly, PartCo seamlessly integrates with existing GCD methods without requiring significant modifications. Our extensive experiments on multiple benchmark datasets demonstrate that PartCo significantly improves the performance of current GCD approaches, achieving state-of-the-art results by bridging the gap between semantic labels and part-level visual compositions, thereby setting new benchmarks for GCD. Project page: https://visual-ai.github.io/partco

[212] DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models

Komal Kumar,Rao Muhammad Anwer,Fahad Shahbaz Khan,Salman Khan,Ivan Laptev,Hisham Cholakkal

Main category: cs.CV

TL;DR: 本文提出了一种名为DEFT的高效微调框架，通过将预训练文本到图像模型的权重更新分解为两个低秩矩阵组件，在保持指令遵循能力和可编辑性的同时，实现了对新概念的学习和多任务统一。

Details

Motivation: 现有的高效微调方法在个性化学习、任务统一和生成可编辑性之间难以平衡，尤其在参数量少、计算资源有限的情况下表现不佳。因此需要一种既能有效适应新任务又能保留原始模型能力的方法。 Method: DEFT将权重更新分解为两个部分：一个投影到由低秩矩阵张成子空间的正交补空间，另一个是该子空间内的低秩更新。使用两个可训练的低秩矩阵，一个定义子空间，另一个实现子空间内的灵活参数调整。 Result: 在Dreambooth、Dreambench Plus、InsDet和VisualCloze等多个数据集上进行了广泛实验，结果表明DEFT在个性化、对象与场景适应以及视觉上下文学习方面均达到最先进的性能，同时保持了良好的指令遵循和编辑能力。 Conclusion: DEFT通过分解式高效微调策略，有效平衡了模型适应性、通用性和可编辑性，展示了高效微调方法在复杂文本到图像生成任务中的潜力。 Abstract: Efficient fine-tuning of pre-trained Text-to-Image (T2I) models involves adjusting the model to suit a particular task or dataset while minimizing computational resources and limiting the number of trainable parameters. However, it often faces challenges in striking a trade-off between aligning with the target distribution: learning a novel concept from a limited image for personalization and retaining the instruction ability needed for unifying multiple tasks, all while maintaining editability (aligning with a variety of prompts or in-context generation). In this work, we introduce DEFT, Decompositional Efficient Fine-Tuning, an efficient fine-tuning framework that adapts a pre-trained weight matrix by decomposing its update into two components with two trainable matrices: (1) a projection onto the complement of a low-rank subspace spanned by a low-rank matrix, and (2) a low-rank update. The single trainable low-rank matrix defines the subspace, while the other trainable low-rank matrix enables flexible parameter adaptation within that subspace. We conducted extensive experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework through visual in-context learning with both Stable Diffusion and a unified model. Our results demonstrated state-of-the-art performance, highlighting the emergent properties of efficient fine-tuning. Our code is available on \href{https://github.com/MAXNORM8650/DEFT}{DEFTBase}.

[213] VideoScore2: Think before You Score in Generative Video Evaluation

Xuan He,Dongfu Jiang,Ping Nie,Minghao Liu,Zhengxuan Jiang,Mingyi Su,Wentao Ma,Junru Lin,Chun Ye,Yi Lu,Keming Wu,Benjamin Schneider,Quy Duc Do,Zhuofeng Li,Yiming Jia,Yuxuan Zhang,Guo Cheng,Haozhe Wang,Wangchunshu Zhou,Qunshu Lin,Yuanxing Zhang,Ge Zhang,Wenhao Huang,Wenhu Chen

Main category: cs.CV

TL;DR: 本文提出了VideoScore2，一种多维度、可解释且与人类对齐的视频生成评估框架，能够细粒度地评估视觉质量、文本-视频对齐以及物理/常识一致性，并通过大规模标注数据和强化学习方法在多个基准上取得领先性能。

Details

Motivation: 现有的文本到视频生成评估方法通常只提供单一、不透明的分数，缺乏可解释性和细粒度分析能力，难以全面衡量视频质量。因此，需要一个能同时评估多个关键维度并提供可解释推理的评估框架。 Method: 提出VideoScore2，采用两阶段训练 pipeline：先在包含27,168个带评分和推理链标注的视频的大规模数据集VideoFeedback2上进行监督微调，再使用组相对策略优化（GRPO）进行强化学习，以提升分析鲁棒性。模型显式评估视觉质量、文本对齐和物理/常识一致性三个维度。 Result: VideoScore2在领域内基准VideoScore-Bench-v2上达到44.35（+5.94）的准确率，在四个跨领域基准（如VideoGenReward-Bench、VideoPhy2等）上平均得分为50.37（+4.32），显著优于现有方法，并能生成可解释的评估理由，支持Best-of-N采样中的有效奖励建模。 Conclusion: VideoScore2是一个更全面、可解释且高性能的视频生成评估框架，通过多维评估和链式推理提升了评估的透明度与实用性，为可控视频生成提供了有力的奖励信号。 Abstract: Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling. Project Page: https://tiger-ai-lab.github.io/VideoScore2/

Sahar Dastani,Ali Bahri,Gustavo Adolfo Vargas Hakim,Moslem Yazdanpanah,Mehrdad Noori,David Osowiechi,Samuel Barbeau,Ismail Ben Ayed,Herve Lombaert,Christian Desrosiers

Main category: cs.CV

TL;DR: 提出了一种名为TRUST的测试时自适应方法，利用不确定性引导的状态空间模型遍历，提升VMamba在分布偏移下的泛化性能。

Details

Motivation: 现有状态空间模型在分布偏移下泛化性能显著下降，缺乏针对其架构特性的有效测试时自适应方法。 Method: 通过多种遍历排列生成输入图像的多重视角，以模型预测作为伪标签更新Mamba特定参数，并对适应后的权重进行平均融合。 Result: 在七个基准上实验表明，TRUST显著提升了模型鲁棒性，优于现有的TTA方法。 Conclusion: TRUST是首个充分利用SSM架构特性进行测试时自适应的方法，在视觉任务中表现出优越的泛化能力和鲁棒性。 Abstract: State Space Models (SSMs) have emerged as efficient alternatives to Vision Transformers (ViTs), with VMamba standing out as a pioneering architecture designed for vision tasks. However, their generalization performance degrades significantly under distribution shifts. To address this limitation, we propose TRUST (Test-Time Refinement using Uncertainty-Guided SSM Traverses), a novel test-time adaptation (TTA) method that leverages diverse traversal permutations to generate multiple causal perspectives of the input image. Model predictions serve as pseudo-labels to guide updates of the Mamba-specific parameters, and the adapted weights are averaged to integrate the learned information across traversal scans. Altogether, TRUST is the first approach that explicitly leverages the unique architectural properties of SSMs for adaptation. Experiments on seven benchmarks show that TRUST consistently improves robustness and outperforms existing TTA methods.

Jaeik Kim,Woojin Kim,Woohyeon Park,Jaeyoung Do

Main category: cs.CV

TL;DR: 本文提出了MMPB，首个用于评估视觉-语言模型（VLM）在个性化任务中表现的大规模基准，包含10k图像-查询对和111个可个性化概念，并通过三阶段协议评估23种主流VLM，发现现有模型在对话一致性、用户偏好处理和视觉线索适应方面存在显著不足。

Details

Motivation: 现有的大型视觉-语言模型在用户个性化方面的适应能力尚未被充分探索，而实际应用（如智能家居和医疗）需要模型能与用户特定概念对齐，因此亟需系统性评估框架来推动该方向发展。 Method: 构建了包含10k图像-查询对和111个个人化概念的MMPB基准，涵盖人类、动物、物体和角色四类，其中人类类别包含偏好相关查询；将个性化分解为三个任务类型，并采用三阶段评估协议：概念注入、多轮对话和个性化查询。 Result: 在23个主流VLM上的实验表明，大多数模型（包括部分闭源模型）在个性化任务中表现不佳，尤其在对话中保持一致性、处理用户偏好和响应视觉线索方面存在困难，暴露出拒绝行为和长上下文遗忘等问题。 Conclusion: MMPB为评估VLM的个性化能力提供了可扩展的基准，揭示了当前模型的关键缺陷，为未来实现真正个性化的多模态AI研究奠定了基础。 Abstract: Visual personalization is essential in user-facing AI systems such as smart homes and healthcare, where aligning model behavior with user-centric concepts is critical. However, recent large Vision-Language Models (VLMs), despite their broad applicability, remain underexplored in their ability to adapt to individual users. In this paper, we introduce MMPB, the first extensive benchmark for evaluating VLMs on personalization. MMPB comprises 10k image-query pairs and includes 111 personalizable concepts across four categories: humans, animals, objects, and characters, with the human category enriched with preference-grounded queries. We structure personalization into three main task types, each highlighting a different key property of VLMs. Using 23 widely used VLMs including both open- and closed-source models, we evaluate personalization performance via a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Our findings indicate that most VLMs (including some closed-source models) struggle with personalization, particularly in maintaining consistency over dialogue, handling user preferences, and adapting to visual cues. Our analysis reveals that the challenges in VLM personalization (such as refusal behaviors and long-context forgetting) highlight substantial room for improvement. By identifying these limitations and offering a scalable benchmark, MMPB offers valuable insights and a solid foundation for future research toward truly personalized multi-modal AI. Project Page: aidaslab.github.io/MMPB

[216] Seeing Isn't Believing: Context-Aware Adversarial Patch Synthesis via Conditional GAN

Roie Kazoom,Alon Goldberg,Hodaya Cohen,Ofer Hadar

Main category: cs.CV

TL;DR: 提出一种新型的可控对抗补丁生成框架，结合生成式U-Net设计与Grad-CAM引导的补丁定位，在多种深度神经网络上实现超过99%的攻击成功率，同时保证视觉真实性和黑盒适用性。

Details

Motivation: 现有对抗补丁攻击多依赖白盒假设、无目标攻击或产生明显可见的补丁，限制了在现实场景中的应用，因此需要一种兼具视觉隐蔽性、目标可控性和黑盒适用性的攻击方法。 Method: 采用生成式U-Net结构进行补丁生成，并利用Grad-CAM指导补丁在图像中的语义关键区域放置，以提升攻击有效性与视觉自然性，支持对输入图像和目标类别的完全控制。 Result: 在DenseNet-121、ResNet-50、ViT-B/16、Swin-B/16等多种模型上实验表明，攻击成功率（ASR）和目标类成功率达99%以上，优于现有白盒、无目标及非真实感方法。 Conclusion: 该框架在保持高视觉真实性的同时实现了精准的目标控制和良好的黑盒迁移性，为基于补丁的对抗攻击设立了新基准，弥合了理论攻击强度与实际隐蔽性之间的差距。 Abstract: Adversarial patch attacks pose a severe threat to deep neural networks, yet most existing approaches rely on unrealistic white-box assumptions, untargeted objectives, or produce visually conspicuous patches that limit real-world applicability. In this work, we introduce a novel framework for fully controllable adversarial patch generation, where the attacker can freely choose both the input image x and the target class y target, thereby dictating the exact misclassification outcome. Our method combines a generative U-Net design with Grad-CAM-guided patch placement, enabling semantic-aware localization that maximizes attack effectiveness while preserving visual realism. Extensive experiments across convolutional networks (DenseNet-121, ResNet-50) and vision transformers (ViT-B/16, Swin-B/16, among others) demonstrate that our approach achieves state-of-the-art performance across all settings, with attack success rates (ASR) and target-class success (TCS) consistently exceeding 99%. Importantly, we show that our method not only outperforms prior white-box attacks and untargeted baselines, but also surpasses existing non-realistic approaches that produce detectable artifacts. By simultaneously ensuring realism, targeted control, and black-box applicability-the three most challenging dimensions of patch-based attacks-our framework establishes a new benchmark for adversarial robustness research, bridging the gap between theoretical attack strength and practical stealthiness.

[217] Learning Temporal Saliency for Time Series Forecasting with Cross-Scale Attention

Ibrahim Delibasoglu,Fredrik Heintz

Main category: cs.CV

TL;DR: 本文提出了CrossScaleNet，一种结合patch-based cross-attention机制与多尺度处理的时序预测模型，兼具高性能和强时间可解释性。

Details

Motivation: 提高时序预测模型的透明度和决策可信度，解决现有可解释方法在时间显著性检测上的计算成本高、性能不足的问题。 Method: 设计了一种嵌入注意力机制的新型架构CrossScaleNet，通过cross-attention和多尺度处理实现内在的时间可解释性。 Result: 在合成数据和真实数据上均验证了模型在时间显著性识别和预测精度上的优越性，优于大多数基于Transformer的模型。 Conclusion: CrossScaleNet有效平衡了可解释性与预测性能，在不同复杂度的数据集上实现了先进的时序预测与显著性检测效果。 Abstract: Explainability in time series forecasting is essential for improving model transparency and supporting informed decision-making. In this work, we present CrossScaleNet, an innovative architecture that combines a patch-based cross-attention mechanism with multi-scale processing to achieve both high performance and enhanced temporal explainability. By embedding attention mechanisms into the training process, our model provides intrinsic explainability for temporal saliency, making its decision-making process more transparent. Traditional post-hoc methods for temporal saliency detection are computationally expensive, particularly when compared to feature importance detection. While ablation techniques may suffice for datasets with fewer features, identifying temporal saliency poses greater challenges due to its complexity. We validate CrossScaleNet on synthetic datasets with known saliency ground truth and on established public benchmarks, demonstrating the robustness of our method in identifying temporal saliency. Experiments on real-world datasets for forecasting task show that our approach consistently outperforms most transformer-based models, offering better explainability without sacrificing predictive accuracy. Our evaluations demonstrate superior performance in both temporal saliency detection and forecasting accuracy. Moreover, we highlight that existing models claiming explainability often fail to maintain strong performance on standard benchmarks. CrossScaleNet addresses this gap, offering a balanced approach that captures temporal saliency effectively while delivering state-of-the-art forecasting performance across datasets of varying complexity.

[218] Multimodal Slice Interaction Network Enhanced by Transfer Learning for Precise Segmentation of Internal Gross Tumor Volume in Lung Cancer PET/CT Imaging

Yi Luo,Yike Guo,Hamed Hooshangnejad,Rui Zhang,Xue Feng,Quan Chen,Wil Ngwa,Kai Ding

Main category: cs.CV

TL;DR: 本研究提出一种基于迁移学习和多模态交互感知网络（结合MAMBA和切片交互模块SIM）的IGTV分割方法，利用预训练GTV模型并在私有IGTV数据集上微调，有效解决了PET信号弱和标注数据少的问题，在LUCID数据集上显著提升了分割精度（Dice提升至0.609）。

Details

Motivation: 肺癌放疗中准确勾画内靶区（IGTV）至关重要，但受限于标注数据稀缺和肿瘤边界PET信号弱，现有方法性能有限。 Method: 采用迁移学习策略，先在大规模GTV数据上预训练多模态交互感知网络（含MAMBA），再在私有IGTV数据上微调；提出2.5D框架下的切片交互模块（SIM），融合通道与空间注意力及深度卷积，增强切片间关系建模能力。 Result: 在LUCID数据集的PET/CT子集上实验显示，所提方法Dice系数达0.609，显著优于传统基线方法的0.385。 Conclusion: 结合迁移学习、多模态融合与切片交互模块的策略可有效提升IGTV分割精度，具有较高的临床应用潜力，有助于改善肺癌放疗计划的可靠性。 Abstract: Lung cancer remains the leading cause of cancerrelated deaths globally. Accurate delineation of internal gross tumor volume (IGTV) in PET/CT imaging is pivotal for optimal radiation therapy in mobile tumors such as lung cancer to account for tumor motion, yet is hindered by the limited availability of annotated IGTV datasets and attenuated PET signal intensity at tumor boundaries. In this study, we present a transfer learningbased methodology utilizing a multimodal interactive perception network with MAMBA, pre-trained on extensive gross tumor volume (GTV) datasets and subsequently fine-tuned on a private IGTV cohort. This cohort constitutes the PET/CT subset of the Lung-cancer Unified Cross-modal Imaging Dataset (LUCID). To further address the challenge of weak PET intensities in IGTV peripheral slices, we introduce a slice interaction module (SIM) within a 2.5D segmentation framework to effectively model inter-slice relationships. Our proposed module integrates channel and spatial attention branches with depthwise convolutions, enabling more robust learning of slice-to-slice dependencies and thereby improving overall segmentation performance. A comprehensive experimental evaluation demonstrates that our approach achieves a Dice of 0.609 on the private IGTV dataset, substantially surpassing the conventional baseline score of 0.385. This work highlights the potential of transfer learning, coupled with advanced multimodal techniques and a SIM to enhance the reliability and clinical relevance of IGTV segmentation for lung cancer radiation therapy planning.

[219] ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models

Yixuan Hu,Yuxuan Xue,Simon Klenk,Daniel Cremers,Gerard Pons-Moll

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的生成方法ControlEvents，用于合成高质量的事件数据，并通过文本标签、2D骨架和3D姿态等多种控制信号进行引导，显著降低了标注数据的获取成本。

Details

Motivation: 由于事件相机数据标注困难且成本高昂，缺乏大规模标注数据制约了基于事件的视觉任务发展，因此需要一种高效生成高质量标注事件数据的方法。 Method: 利用Stable Diffusion等基础模型的扩散先验，设计了一种基于扩散的生成模型ControlEvents，通过少量微调和有限标注数据实现多模态控制信号引导下的事件数据生成。 Result: 实验表明，合成的标注事件数据能有效提升视觉识别、2D骨架估计和3D姿态估计等任务的模型性能，并具备对训练中未见文本标签生成事件的能力。 Conclusion: ControlEvents为事件数据生成提供了高效低成本的解决方案，继承了基础模型的文本生成能力，具有良好的泛化性和应用潜力。 Abstract: In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.

[220] Learning KAN-based Implicit Neural Representations for Deformable Image Registration

Nikita Drozdov,Marat Zinovev,Dmitry Sorokin

Main category: cs.CV

TL;DR: 本文提出了KAN-IDIR和RandKAN-IDIR，首次将Kolmogorov-Arnold网络（KAN）引入基于隐式神经表示（INR）的可变形图像配准，通过随机基函数采样策略在保持精度的同时显著降低计算成本。

Details

Motivation: 现有基于学习的配准方法依赖大量标注数据且在某些器官和模态上精度不足，而传统INR方法虽具连续建模优势，但存在实例优化效率低和训练不稳定的问题。因此，亟需一种高效、稳定且高精度的INR配准框架。 Method: 提出KAN-IDIR和RandKAN-IDIR，利用KAN作为隐式神经表示来建模空间形变场；引入随机基函数采样策略，在减少基函数数量的同时维持配准性能，提升计算效率与训练稳定性。 Result: 在肺部CT、脑部MRI和心脏MRI三个数据集上，所提方法在所有INR方法中实现了最高配准精度，计算开销小，并在多随机种子下表现出优异的学习稳定性；意外发现随机基采样略优于可学习索引基函数，同时避免了额外复杂性。 Conclusion: KAN-IDIR和RandKAN-IDIR为基于INR的医学图像配准提供了高效、稳定且高性能的新范式，RandKAN进一步简化了训练流程并提升了实用性。 Abstract: Deformable image registration (DIR) is a cornerstone of medical image analysis, enabling spatial alignment for tasks like comparative studies and multi-modal fusion. While learning-based methods (e.g., CNNs, transformers) offer fast inference, they often require large training datasets and struggle to match the precision of classical iterative approaches on some organ types and imaging modalities. Implicit neural representations (INRs) have emerged as a promising alternative, parameterizing deformations as continuous mappings from coordinates to displacement vectors. However, this comes at the cost of requiring instance-specific optimization, making computational efficiency and seed-dependent learning stability critical factors for these methods. In this work, we propose KAN-IDIR and RandKAN-IDIR, the first integration of Kolmogorov-Arnold Networks (KANs) into deformable image registration with implicit neural representations (INRs). Our proposed randomized basis sampling strategy reduces the required number of basis functions in KAN while maintaining registration quality, thereby significantly lowering computational costs. We evaluated our approach on three diverse datasets (lung CT, brain MRI, cardiac MRI) and compared it with competing instance-specific learning-based approaches, dataset-trained deep learning models, and classical registration approaches. KAN-IDIR and RandKAN-IDIR achieved the highest accuracy among INR-based methods across all evaluated modalities and anatomies, with minimal computational overhead and superior learning stability across multiple random seeds. Additionally, we discovered that our RandKAN-IDIR model with randomized basis sampling slightly outperforms the model with learnable basis function indices, while eliminating its additional training-time complexity.

[221] Convolutional Set Transformer

Federico Chinello,Giacomo Boracchi

Main category: cs.CV

TL;DR: 提出卷积集变换器（CST），可直接处理任意数量且视觉异构但具有高层语义一致性的图像集，同时进行特征提取与上下文建模，性能优于现有方法，并支持Grad-CAM等可视化技术，具备迁移学习能力。

Details

Motivation: 现有集输入网络无法直接处理3D图像张量，需依赖CNN先提取特征，导致特征提取与集合建模分离，限制了性能和可解释性。 Method: 设计CST架构，直接在3D图像张量上操作，结合卷积与注意力机制，同步实现特征提取和图像间关系建模。 Result: 在集分类和集异常检测任务中表现优于现有方法，支持Grad-CAM可视化，验证了迁移学习能力，在ImageNet上预训练的CST-15已开源。 Conclusion: CST通过统一特征提取与集合建模，提升了图像集处理的性能、可解释性和泛化能力，为集输入学习提供了新范式。 Abstract: We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (https://github.com/chinefed/convolutional-set-transformer).

[222] TY-RIST: Tactical YOLO Tricks for Real-time Infrared Small Target Detection

Abdulkarim Atrash,Omar Moured,Yufan Chen,Jiaming Zhang,Seyda Ertekin,Omur Ugur

Main category: cs.CV

TL;DR: 提出TY-RIST，一种基于YOLOv12n优化的红外小目标检测方法，通过改进骨干网络、检测头、注意力机制和分支剪枝策略，在降低计算成本的同时提升检测性能，实现最先进的mAP、精度和召回率，并支持实时推理。

Details

Motivation: 红外小目标检测面临特征少、背景复杂、显著性低和计算开销大等挑战，现有方法难以兼顾精度与效率。 Method: 设计了步长感知的骨干网络、高分辨率检测头、级联坐标注意力模块，并采用分支剪枝策略减少计算量；引入NWD损失提升回归稳定性。 Result: 在四个基准数据集上实验表明，相比现有方法，mAP@0.5提升7.9%，精度提升3%，召回率提升10.2%，计算成本降低25.5%，单GPU可达123 FPS；在第五个数据集上验证了良好的跨数据集泛化能力。 Conclusion: TY-RIST在精度、速度和泛化性之间取得良好平衡，显著提升了红外小目标检测的实用性和实时性。 Abstract: Infrared small target detection (IRSTD) is critical for defense and surveillance but remains challenging due to (1) target loss from minimal features, (2) false alarms in cluttered environments, (3) missed detections from low saliency, and (4) high computational costs. To address these issues, we propose TY-RIST, an optimized YOLOv12n architecture that integrates (1) a stride-aware backbone with fine-grained receptive fields, (2) a high-resolution detection head, (3) cascaded coordinate attention blocks, and (4) a branch pruning strategy that reduces computational cost by about 25.5% while marginally improving accuracy and enabling real-time inference. We also incorporate the Normalized Gaussian Wasserstein Distance (NWD) to enhance regression stability. Extensive experiments on four benchmarks and across 20 different models demonstrate state-of-the-art performance, improving mAP at 0.5 IoU by +7.9%, Precision by +3%, and Recall by +10.2%, while achieving up to 123 FPS on a single GPU. Cross-dataset validation on a fifth dataset further confirms strong generalization capability. Additional results and resources are available at https://www.github.com/moured/TY-RIST

[223] Learning Unified Representation of 3D Gaussian Splatting

Yuelin Xin,Yuheng Liu,Xiaohui Xie,Xinke Li

Main category: cs.CV

TL;DR: 提出基于连续子流形场的3D高斯点阵嵌入表示方法，以解决传统参数化表示在神经网络学习中的非唯一性和异质性问题。

Details

Motivation: 现有的3D高斯点阵参数化表示在用于神经网络学习时存在非唯一性和特征异质性问题，导致模型数据依赖性强，难以有效学习。 Method: 提出一种基于连续子流形场的嵌入表示方法，封装高斯基元的内在信息，在保持颜色和几何结构的同时实现唯一映射和通道同质性。 Result: 所提嵌入表示法能更有效地支持基于神经网络的3D高斯点阵学习，提升模型泛化能力与学习效率。 Conclusion: 该嵌入表示为3D高斯点阵提供了一种更原则性的神经网络集成方式，有助于推动其在学习系统中的广泛应用。 Abstract: A well-designed vectorized representation is crucial for the learning systems natively based on 3D Gaussian Splatting. While 3DGS enables efficient and explicit 3D reconstruction, its parameter-based representation remains hard to learn as features, especially for neural-network-based models. Directly feeding raw Gaussian parameters into learning frameworks fails to address the non-unique and heterogeneous nature of the Gaussian parameterization, yielding highly data-dependent models. This challenge motivates us to explore a more principled approach to represent 3D Gaussian Splatting in neural networks that preserves the underlying color and geometric structure while enforcing unique mapping and channel homogeneity. In this paper, we propose an embedding representation of 3DGS based on continuous submanifold fields that encapsulate the intrinsic information of Gaussian primitives, thereby benefiting the learning of 3DGS.

[224] Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings

Yuanzhi Zhu,Xi Wang,Stéphane Lathuilière,Vicky Kalogeiton

Main category: cs.CV

TL;DR: 本文提出了一种名为soft embeddings的新方法，通过将离散token替换为生成器输出分布下的期望嵌入，解决了单步生成器在蒸馏后无法进行梯度优化的问题。该方法集成到Di[M]O框架中（称为Soft-Di[M]O），实现了端到端训练，并支持GAN精调、基于奖励的微调和测试时嵌入优化（TTEO），在多个掩码扩散模型上取得了当前最优的单步生成性能。

Details

Motivation: 单步生成器从掩码扩散模型蒸馏而来，虽高效但存在两个问题：继承教师模型的建模偏差，以及离散token输出阻断梯度传播，限制了后续优化方法的应用。 Method: 引入soft embeddings，用生成器输出分布的期望嵌入替代离散token，构建可微的连续代理表示，使其兼容教师模型和解码器，并集成至Di[M]O框架形成Soft-Di[M]O，实现端到端训练和多种优化策略的引入。 Result: 在多个MDM教师模型（如MaskBit、MaskGen）上验证，Soft-Di[M]O取得当前最优的单步生成结果：ImageNet-256上经GAN精调后FID达1.56，文本到图像任务中GenEval和HPS分数更高，并通过TTEO进一步提升性能。 Conclusion: Soft-Di[M]O通过soft embeddings使单步生成器具备可微性，突破了传统离散输出的限制，支持多种梯度驱动的优化技术，在图像和文本生成任务中均显著提升性能，推动了高效单步生成模型的发展。 Abstract: One-step generators distilled from Masked Diffusion Models (MDMs) compress multiple sampling steps into a single forward pass, enabling efficient text and image synthesis. However, they suffer two key limitations: they inherit modeling bias from the teacher, and their discrete token outputs block gradient flow, preventing post-distillation refinements such as adversarial training, reward-based fine-tuning, and Test-Time Embedding Optimization (TTEO). In this work, we introduce soft embeddings, a simple relaxation that replaces discrete tokens with the expected embeddings under the generator's output distribution. Soft embeddings preserve representation fidelity for one-step discrete generator while providing a fully differentiable continuous surrogate that is compatible with teacher backbones and tokenizer decoders. Integrating soft embeddings into the Di[M]O distillation framework (denoted Soft-Di[M]O) makes one-step generators end-to-end trainable and enables straightforward application of GAN-based refinement, differentiable reward fine-tuning, and TTEO. Empirically, across multiple MDM teachers (e.g., MaskBit, MaskGen), Soft-Di[M]O achieves state-of-the-art one-step results: improved class-to-image performance, a one-step FID of 1.56 on ImageNet-256 with GAN-based refinement, along with higher GenEval and HPS scores on text-to-image with reward fine-tuning, and further gains from TTEO.

Chenghan Yang,Peng Zhou,Dong-Sheng Zhang,Yueyun Wang,Hong-Bin Shen,Xiaoyong Pan

Main category: cs.CV

TL;DR: FishAI 2.0 是一个结合多模态少样本深度学习与图像生成的数据增强框架，用于提升稀有海洋鱼类的识别准确率，在家族层级上实现了91.67%的Top-1准确率，显著优于基线模型。

Details

Motivation: 传统海洋生物图像识别面临数据集不完整和模型精度不足的问题，尤其是在稀有物种的少样本条件下，数据稀缺严重影响性能。 Method: 构建分层海洋鱼类基准数据集，利用大语言模型DeepSeek生成文本描述，并通过Stable Diffusion 2采用分层扩散策略进行图像增强，构建视觉-文本多模态特征空间，输入基于CLIP的模型实现鲁棒的少样本图像识别。 Result: 在家族层级上，FishAI 2.0达到Top-1准确率91.67%，Top-5准确率97.97%；在属和种层级上分别达到Top-1准确率87.58%和85.42%，显著优于基线模型，尤其在少于10个训练样本的稀有类别上表现突出。 Conclusion: FishAI 2.0有效提升了海洋鱼类识别的效率与准确性，为海洋生态监测与保护提供了可扩展的技术方案，具有较高的科学价值与实际应用潜力。 Abstract: Traditional marine biological image recognition faces challenges of incomplete datasets and unsatisfactory model accuracy, particularly for few-shot conditions of rare species where data scarcity significantly hampers the performance. To address these issues, this study proposes an intelligent marine fish recognition framework, FishAI 2.0, integrating multimodal few-shot deep learning techniques with image generation for data augmentation. First, a hierarchical marine fish benchmark dataset, which provides a comprehensive data foundation for subsequent model training, is utilized to train the FishAI 2.0 model. To address the data scarcity of rare classes, the large language model DeepSeek was employed to generate high-quality textual descriptions, which are input into Stable Diffusion 2 for image augmentation through a hierarchical diffusion strategy that extracts latent encoding to construct a multimodal feature space. The enhanced visual-textual datasets were then fed into a Contrastive Language-Image Pre-Training (CLIP) based model, enabling robust few-shot image recognition. Experimental results demonstrate that FishAI 2.0 achieves a Top-1 accuracy of 91.67 percent and Top-5 accuracy of 97.97 percent at the family level, outperforming baseline CLIP and ViT models with a substantial margin for the minority classes with fewer than 10 training samples. To better apply FishAI 2.0 to real-world scenarios, at the genus and species level, FishAI 2.0 respectively achieves a Top-1 accuracy of 87.58 percent and 85.42 percent, demonstrating practical utility. In summary, FishAI 2.0 improves the efficiency and accuracy of marine fish identification and provides a scalable technical solution for marine ecological monitoring and conservation, highlighting its scientific value and practical applicability.

[226] Brain Tumor Classification from MRI Scans via Transfer Learning and Enhanced Feature Representation

Ahta-Shamul Hoque Emran,Hafija Akter,Abdullah Al Shiam,Abu Saleh Musa Miah,Anichur Rahman,Fahmid Al Farid,Hezerul Abdul Karim

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的高效脑肿瘤检测框架，结合预训练ResNet50、全局平均池化和线性投影提取特征，并引入新颖的Dense-Dropout序列以增强非线性特征学习与模型鲁棒性；同时构建了新的脑肿瘤MRI数据集MMCBT，通过数据增强解决类别不平衡问题，提升了模型性能。

Details

Motivation: 脑肿瘤的早期检测对改善患者预后至关重要，但现有MRI数据资源有限且缺乏高效的自动检测方法，因此需要构建可靠的数据集并设计高效的深度学习模型。 Method: 采用预训练的ResNet50进行特征提取，结合全局平均池化（GAP）和线性投影获得高维紧凑表示，提出Dense-Dropout序列结构以增强非线性学习与泛化能力，并使用临床验证的MMCBT数据集，通过数据增强实现类别平衡。 Result: 成功构建了包含209名受试者的MMCBT脑肿瘤MRI数据集（共16944张图像），并通过提出的深度学习框架实现了高效的脑肿瘤检测，模型在平衡数据集上表现出良好的性能。 Conclusion: 所提出的框架在脑肿瘤自动检测方面具有高效性和实用性，Dense-Dropout结构有效提升了模型鲁棒性，MMCBT数据集为后续研究提供了宝贵的公开资源。 Abstract: Brain tumors are abnormal cell growths in the central nervous system (CNS), and their timely detection is critical for improving patient outcomes. This paper proposes an automatic and efficient deep-learning framework for brain tumor detection from magnetic resonance imaging (MRI) scans. The framework employs a pre-trained ResNet50 model for feature extraction, followed by Global Average Pooling (GAP) and linear projection to obtain compact, high-level image representations. These features are then processed by a novel Dense-Dropout sequence, a core contribution of this work, which enhances non-linear feature learning, reduces overfitting, and improves robustness through diverse feature transformations. Another major contribution is the creation of the Mymensingh Medical College Brain Tumor (MMCBT) dataset, designed to address the lack of reliable brain tumor MRI resources. The dataset comprises MRI scans from 209 subjects (ages 9 to 65), including 3671 tumor and 13273 non-tumor images, all clinically verified under expert supervision. To overcome class imbalance, the tumor class was augmented, resulting in a balanced dataset well-suited for deep learning research.

[227] Hemorica: A Comprehensive CT Scan Dataset for Automated Brain Hemorrhage Classification, Segmentation, and Detection

Kasra Davoodi,Mohammad Hoseyni,Javad Khoramdel,Reza Barati,Reihaneh Mortazavi,Amirhossein Nikoofard,Mahdi Aliyari-Shoorehdeli,Jaber Hatam Parikhan

Main category: cs.CV

TL;DR: 本文介绍了Hemorica，一个包含372例头部CT扫描的公开数据集，用于颅内出血（ICH）的AI辅助诊断。数据集涵盖五种ICH亚型，并提供多种精细标注。通过双人标注和神经外科医生仲裁确保标注质量，基准模型表现良好，验证了数据集的质量和适用性。

Details

Motivation: 颅内出血的及时诊断至关重要，但现有AI解决方案因公共数据分散而受限。因此需要一个高质量、公开、细粒度标注的数据集来推动研究。 Method: 收集372例头部CT扫描，采用双人标注流程并经过共识协商和神经外科医生仲裁，对五种ICH亚型进行患者级和切片级分类、边界框、二维像素掩码和三维体素掩码标注。使用卷积和Transformer模型进行二分类和分割任务的基准测试。 Result: 实现了低评分者间变异；MobileViT-XS在二分类中达到87.8%的F1分数，U-Net（DenseNet161编码器）在二值病灶分割中Dice得分为85.5%，验证了数据集质量和样本量充足性。 Conclusion: Hemorica是一个高质量、多层级标注的公开数据集，可作为ICH检测与量化AI系统的统一基准，支持多任务学习、课程学习及向弱标注大数据集的迁移。 Abstract: Timely diagnosis of Intracranial hemorrhage (ICH) on Computed Tomography (CT) scans remains a clinical priority, yet the development of robust Artificial Intelligence (AI) solutions is still hindered by fragmented public data. To close this gap, we introduce Hemorica, a publicly available collection of 372 head CT examinations acquired between 2012 and 2024. Each scan has been exhaustively annotated for five ICH subtypes-epidural (EPH), subdural (SDH), subarachnoid (SAH), intraparenchymal (IPH), and intraventricular (IVH)-yielding patient-wise and slice-wise classification labels, subtype-specific bounding boxes, two-dimensional pixel masks and three-dimensional voxel masks. A double-reading workflow, preceded by a pilot consensus phase and supported by neurosurgeon adjudication, maintained low inter-rater variability. Comprehensive statistical analysis confirms the clinical realism of the dataset. To establish reference baselines, standard convolutional and transformer architectures were fine-tuned for binary slice classification and hemorrhage segmentation. With only minimal fine-tuning, lightweight models such as MobileViT-XS achieved an F1 score of 87.8% in binary classification, whereas a U-Net with a DenseNet161 encoder reached a Dice score of 85.5% for binary lesion segmentation that validate both the quality of the annotations and the sufficiency of the sample size. Hemorica therefore offers a unified, fine-grained benchmark that supports multi-task and curriculum learning, facilitates transfer to larger but weakly labelled cohorts, and facilitates the process of designing an AI-based assistant for ICH detection and quantification systems.

[228] ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View

Wenbin Teng,Gonglin Chen,Haiwei Chen,Yajie Zhao

Main category: cs.CV

TL;DR: 本文提出了一种基于自回归模型的新型框架ARSS，用于从单张图像生成新视角视图，通过结合视频分词器和相机编码器，在保持自回归结构的同时提升了生成质量。

Details

Motivation: 扩散模型在世界建模任务中存在非因果生成导致多视角不一致的问题，限制了其在稀疏输入下新视角生成的应用。 Method: 采用GPT风格的纯解码器自回归模型，结合视频分词器将图像序列离散化，并设计相机编码器将相机轨迹转化为3D位置引导；引入随机置换空间顺序但保持时间顺序的自回归Transformer模块以提升生成质量。 Result: 在多个公开数据集上的定性和定量实验表明，该方法性能媲美甚至优于基于扩散模型的最先进视图合成方法。 Conclusion: ARSS框架成功地将自回归模型应用于新视角生成任务，在保持生成一致性和可增量更新方面展现出优势，为世界建模提供了新的可行方案。 Abstract: Despite their exceptional generative quality, diffusion models have limited applicability to world modeling tasks, such as novel view generation from sparse inputs. This limitation arises because diffusion models generate outputs in a non-causal manner, often leading to distortions or inconsistencies across views, and making it difficult to incrementally adapt accumulated knowledge to new queries. In contrast, autoregressive (AR) models operate in a causal fashion, generating each token based on all previously generated tokens. In this work, we introduce \textbf{ARSS}, a novel framework that leverages a GPT-style decoder-only AR model to generate novel views from a single image, conditioned on a predefined camera trajectory. We employ a video tokenizer to map continuous image sequences into discrete tokens and propose a camera encoder that converts camera trajectories into 3D positional guidance. Then to enhance generation quality while preserving the autoregressive structure, we propose a autoregressive transformer module that randomly permutes the spatial order of tokens while maintaining their temporal order. Extensive qualitative and quantitative experiments on public datasets demonstrate that our method performs comparably to, or better than, state-of-the-art view synthesis approaches based on diffusion models. Our code will be released upon paper acceptance.

[229] Disentangling Static and Dynamic Information for Reducing Static Bias in Action Recognition

Masato Kobayashi,Ning Ding,Toru Tamaki

Main category: cs.CV

TL;DR: 提出一种通过分离时间动态信息和静态场景信息来减少动作识别模型中静态偏差的方法，使用统计独立性损失和场景预测损失。

Details

Motivation: 动作识别模型过度依赖静态线索而非动态人体运动，导致在真实场景和零样本动作识别中表现不佳。 Method: 采用统计独立性损失分离有偏和无偏流，并结合场景预测损失来减少静态偏差。 Result: 实验表明该方法能有效降低静态偏差，并验证了场景预测损失的重要性。 Conclusion: 所提出的方法能够显著减轻动作识别中的静态偏差，提升模型对动态信息的依赖性和泛化能力。 Abstract: Action recognition models rely excessively on static cues rather than dynamic human motion, which is known as static bias. This bias leads to poor performance in real-world applications and zero-shot action recognition. In this paper, we propose a method to reduce static bias by separating temporal dynamic information from static scene information. Our approach uses a statistical independence loss between biased and unbiased streams, combined with a scene prediction loss. Our experiments demonstrate that this method effectively reduces static bias and confirm the importance of scene prediction loss.

[230] Desensitizing for Improving Corruption Robustness in Point Cloud Classification through Adversarial Training

Zhiqiang Tian,Weigang Li,Chunhua Deng,Junwei Hu,Yongqiang Wang,Wenping Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Desensitized Adversarial Training (DesenAT) 的方法，通过特征去敏感化和自蒸馏框架来增强点云DNN模型对输入特征的鲁棒性，有效提升了模型在 corrupted 点云下的表现，同时保持了在干净数据上的性能。

Details

Motivation: 由于场景复杂性、传感器误差和处理不精确，点云数据不可避免地会受到破坏。过度依赖输入特征是DNN脆弱性的根源，但这一问题在3D点云任务中是否普遍存在尚不清楚。本文旨在探究该问题，并验证减少模型对特定特征依赖是否能提升其鲁棒性。 Method: 使用Shapley值量化DNN对点云特征的敏感性，发现传统训练模型对某些特征高度敏感。提出DesenAT方法：首先消除高贡献点并进行空间变换生成对抗样本，结合对抗训练；然后利用自蒸馏机制将干净样本的知识迁移到对抗样本上，以弥补信息损失，实现更稳定的训练。 Result: 在ModelNet-C和PointCloud-C两个基准上实验表明，所提方法显著提升了模型对多种点云腐蚀的鲁棒性，且在干净数据集上的性能未受影响。 Conclusion: 减少模型对高敏感特征的依赖可有效增强点云DNN的鲁棒性，DesenAT通过敏感性平滑与知识蒸馏相结合，为提升3D模型抗干扰能力提供了有效解决方案。 Abstract: Due to scene complexity, sensor inaccuracies, and processing imprecision, point cloud corruption is inevitable. Over-reliance on input features is the root cause of DNN vulnerabilities. It remains unclear whether this issue exists in 3D tasks involving point clouds and whether reducing dependence on these features can enhance the model's robustness to corrupted point clouds. This study attempts to answer these questions. Specifically, we quantified the sensitivity of the DNN to point cloud features using Shapley values and found that models trained using traditional methods exhibited high sensitivity values for certain features. Furthermore, under an equal pruning ratio, prioritizing the pruning of highly sensitive features causes more severe damage to model performance than random pruning. We propose `Desensitized Adversarial Training' (DesenAT), generating adversarial samples using feature desensitization and conducting training within a self-distillation framework, which aims to alleviate DNN's over-reliance on point clouds features by smoothing sensitivity. First, data points with high contribution components are eliminated, and spatial transformation is used to simulate corruption scenes, generate adversarial samples, and conduct adversarial training on the model. Next, to compensate for information loss in adversarial samples, we use the self-distillation method to transfer knowledge from clean samples to adversarial samples, and perform adversarial training in a distillation manner.Extensive experiments on ModelNet-C and PointCloud-C demonstrate show that the propose method can effectively improve the robustness of the model without reducing the performance of clean data sets. This code is publicly available at \href{https://github.com/JerkyT/DesenAT/tree/master}{https://github.com/JerkyT/DesenAT}.

[231] Geometry-Aware Losses for Structure-Preserving Text-to-Sign Language Generation

Zetian Wu,Tianshuo Zhou,Stefan Lee,Liang Huang

Main category: cs.CV

TL;DR: 提出一种新方法，通过建模骨骼关节间的几何约束和运动动态，显著提升手语文本到视频生成中姿态的自然性与解剖合理性。

Details

Motivation: 现有手语翻译方法常忽视人体骨骼运动的解剖学约束与协调模式，导致生成动作僵硬或不符合生物力学规律。 Method: 引入关节间几何约束（如骨长、位置关系）、父节点相对重加权机制以增强手指灵活性，并采用骨姿态损失和骨长约束来保证解剖一致性。 Result: 相比先前最优方法，性能差距缩小56.51%，骨长差异减少18.76%，运动方差降低5.48%。 Conclusion: 该方法显著提升了手语生成视频中人体姿态的自然性和解剖合理性，有效缩小了与真实动作之间的差距。 Abstract: Sign language translation from text to video plays a crucial role in enabling effective communication for Deaf and hard--of--hearing individuals. A major challenge lies in generating accurate and natural body poses and movements that faithfully convey intended meanings. Prior methods often neglect the anatomical constraints and coordination patterns of human skeletal motion, resulting in rigid or biomechanically implausible outputs. To address this, we propose a novel approach that explicitly models the relationships among skeletal joints--including shoulders, arms, and hands--by incorporating geometric constraints on joint positions, bone lengths, and movement dynamics. During training, we introduce a parent-relative reweighting mechanism to enhance finger flexibility and reduce motion stiffness. Additionally, bone-pose losses and bone-length constraints enforce anatomically consistent structures. Our method narrows the performance gap between the previous best and the ground-truth oracle by 56.51%, and further reduces discrepancies in bone length and movement variance by 18.76% and 5.48%, respectively, demonstrating significant gains in anatomical realism and motion naturalness.

[232] Planning with Unified Multimodal Models

Yihao Sun,Zhilong Zhang,Yang Yu,Pierre-Luc Bacon

Main category: cs.CV

TL;DR: 本文提出了一种基于统一多模态模型（UMMs）的规划框架Uni-Plan，利用生成视觉内容进行推理，在长视野规划任务中显著优于基于视觉语言模型的方法，并展现出良好的数据可扩展性。

Details

Motivation: 现有基于大语言模型和视觉语言模型的决策方法主要依赖语言推理，限制了其推理和决策能力。统一多模态模型（UMMs）支持多模态输入输出，有望通过生成视觉内容增强推理能力，因此作者探索其在决策规划中的潜力。 Method: 提出Uni-Plan框架，使用单一UMM同时作为策略、动力学模型和价值函数；引入自判别过滤机制，利用生成模型自身作为判别器来筛选无效的动力学预测，减少幻觉问题。 Result: 在长视野规划任务上，Uni-Plan显著提高了成功率，相比VLM方法性能更优；无需专家示范，且在相同训练数据量下表现出更强的数据可扩展性和更高性能。 Conclusion: Uni-Plan验证了统一多模态模型在推理与决策任务中的巨大潜力，为未来基于UMMs的智能决策研究奠定了基础。 Abstract: With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on long-horizon planning tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.

[233] Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy

Xiafeng Man,Zhipeng Wei,Jingjing Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于差分隐私思想的版权侵权检测框架DPM，通过微调模型的正向与反向学习模拟数据点的加入与删除，结合条件敏感性度量和统计方法，在无需访问原始训练数据或文本提示的情况下有效识别扩散模型中的版权侵权内容，并构建了CIDD数据集用于标准化评测。

Details

Motivation: 现有版权侵权检测方法缺乏鲁棒性和理论基础，且大型视觉模型可能未经授权复制受版权保护的内容，亟需一种可解释、理论严谨且实用的检测方案。 Method: 引入条件敏感性度量，提出D-Plus-Minus（DPM）框架：通过正向学习（学习）和反向学习（去学习）微调模型来模拟训练数据点的包含与排除；利用正交提示分布上的统计指标计算置信度得分，以分离概念特定影响与全局参数变化。 Result: DPM能够在不访问原始训练数据或文本提示的情况下可靠地检测侵权内容；所构建的Copyright Infringement Detection Dataset（CIDD）支持跨类别检测评估，实验证明该方法具有良好的检测性能和可解释性。 Conclusion: DPM为生成式AI时代的知识产权保护提供了一个可解释、实用且理论坚实的后设检测解决方案，推动了版权侵权检测的标准化与可靠性。 Abstract: The widespread deployment of large vision models such as Stable Diffusion raises significant legal and ethical concerns, as these models can memorize and reproduce copyrighted content without authorization. Existing detection approaches often lack robustness and fail to provide rigorous theoretical underpinnings. To address these gaps, we formalize the concept of copyright infringement and its detection from the perspective of Differential Privacy (DP), and introduce the conditional sensitivity metric, a concept analogous to sensitivity in DP, that quantifies the deviation in a diffusion model's output caused by the inclusion or exclusion of a specific training data point. To operationalize this metric, we propose D-Plus-Minus (DPM), a novel post-hoc detection framework that identifies copyright infringement in text-to-image diffusion models. Specifically, DPM simulates inclusion and exclusion processes by fine-tuning models in two opposing directions: learning or unlearning. Besides, to disentangle concept-specific influence from the global parameter shifts induced by fine-tuning, DPM computes confidence scores over orthogonal prompt distributions using statistical metrics. Moreover, to facilitate standardized benchmarking, we also construct the Copyright Infringement Detection Dataset (CIDD), a comprehensive resource for evaluating detection across diverse categories. Our results demonstrate that DPM reliably detects infringement content without requiring access to the original training dataset or text prompts, offering an interpretable and practical solution for safeguarding intellectual property in the era of generative AI.

[234] Perceptual Influence: Improving the Perceptual Loss Design for Low-Dose CT Enhancement

Gabriel A. Viana,Luis F. Alves Pereira,Tsang Ing Ren,George D. C. Cavalcanti,Jan Sijbers

Main category: cs.CV

TL;DR: 本文提出了一种评估感知损失设计对低剂量CT图像增强性能影响的系统性框架，引入了“感知影响”指标，并通过实验证明优化感知损失设计可显著提升图像去噪与结构保真度。

Details

Motivation: 传统像素级损失（如MSE）在低剂量CT图像增强中易导致过度平滑，丢失临床细节；现有感知损失设计缺乏系统指导，关键设计选择未被充分探索。 Method: 引入感知影响指标量化感知损失项的贡献，构建原则性框架评估特征表示层次、编码器预训练数据集和感知权重等设计因素的影响，并通过系统实验比较不同配置。 Result: 发现文献中常用感知损失配置表现不佳，优化后的设计显著提升了噪声抑制和结构保真度，且无需更改网络架构。 Conclusion: 感知损失的设计至关重要，合理的配置能有效提升LDCT图像质量，本文提供了基于统计分析的客观指南以支持其有效应用。 Abstract: Perceptual losses have emerged as powerful tools for training networks to enhance Low-Dose Computed Tomography (LDCT) images, offering an alternative to traditional pixel-wise losses such as Mean Squared Error, which often lead to over-smoothed reconstructions and loss of clinically relevant details in LDCT images. The perceptual losses operate in a latent feature space defined by a pretrained encoder and aim to preserve semantic content by comparing high-level features rather than raw pixel values. However, the design of perceptual losses involves critical yet underexplored decisions, including the feature representation level, the dataset used to pretrain the encoder, and the relative importance assigned to the perceptual component during optimization. In this work, we introduce the concept of perceptual influence (a metric that quantifies the relative contribution of the perceptual loss term to the total loss) and propose a principled framework to assess the impact of the loss design choices on the model training performance. Through systematic experimentation, we show that the widely used configurations in the literature to set up a perceptual loss underperform compared to better-designed alternatives. Our findings show that better perceptual loss designs lead to significant improvements in noise reduction and structural fidelity of reconstructed CT images, without requiring any changes to the network architecture. We also provide objective guidelines, supported by statistical analysis, to inform the effective use of perceptual losses in LDCT denoising. Our source code is available at https://github.com/vngabriel/perceptual-influence.

Tomohiro Tanaka,Narumasa Tsutsumida

Main category: cs.CV

TL;DR: 提出一种基于轻量级多模态预训练Transformer（Presto）的传感器灵活洪水检测方法，可利用SAR、多光谱或两者数据进行洪水制图，具有高效、鲁棒和实用的优势。

Details

Motivation: 现有洪水监测方法受限于单一传感器的数据可用性问题或多传感器融合所需的高计算资源和大量标注数据，难以满足灾害应急响应中快速获取洪水范围的需求。 Method: 通过微调一个轻量级（约0.4M参数）的多模态预训练Transformer模型Presto，实现对SAR和多光谱数据的像素级洪水检测，并支持SAR-only、MS-only及SAR+MS三种输入模式。 Result: 在Sen1Floods11数据集上，该方法在多传感器融合下F1得分为0.896，mIoU为0.886，优于Prithvi-100M基线；在MS-only下F1达0.893，SAR-only下F1为0.718，表现出良好的鲁棒性。 Conclusion: 该传感器灵活、参数高效的模型能够在不同数据可用情况下稳定运行，为实际灾害应急提供了一种快速、可靠且易于部署的洪水制图解决方案。 Abstract: Floods are increasingly frequent natural disasters causing extensive human and economic damage, highlighting the critical need for rapid and accurate flood inundation mapping. While remote sensing technologies have advanced flood monitoring capabilities, operational challenges persist: single-sensor approaches face weather-dependent data availability and limited revisit periods, while multi-sensor fusion methods require substantial computational resources and large-scale labeled datasets. To address these limitations, this study introduces a novel sensor-flexible flood detection methodology by fine-tuning Presto, a lightweight ($\sim$0.4M parameters) multi-modal pre-trained transformer that processes both Synthetic Aperture Radar (SAR) and multispectral (MS) data at the pixel level. Our approach uniquely enables flood mapping using SAR-only, MS-only, or combined SAR+MS inputs through a single model architecture, addressing the critical operational need for rapid response with whatever sensor data becomes available first during disasters. We evaluated our method on the Sen1Floods11 dataset against the large-scale Prithvi-100M baseline ($\sim$100M parameters) across three realistic data availability scenarios. The proposed model achieved superior performance with an F1 score of 0.896 and mIoU of 0.886 in the optimal sensor-fusion scenario, outperforming the established baseline. Crucially, the model demonstrated robustness by maintaining effective performance in MS-only scenarios (F1: 0.893) and functional capabilities in challenging SAR-only conditions (F1: 0.718), confirming the advantage of multi-modal pre-training for operational flood mapping. Our parameter-efficient, sensor-flexible approach offers an accessible and robust solution for real-world disaster scenarios requiring immediate flood extent assessment regardless of sensor availability constraints.

[236] GeLoc3r: Enhancing Relative Camera Pose Regression with Geometric Consistency Regularization

Jingxing Li,Yongjae Lee,Deliang Fan

Main category: cs.CV

TL;DR: GeLoc3r提出了一种基于几何一致性正则化（GCR）的相对相机位姿估计新方法，在训练时利用真值深度和加权RANSAC生成一致性损失，使回归网络学习几何知识，测试时仅用回归头实现快速推理，兼顾ReLoc3R的速度和MASt3R的精度。

Details

Motivation: 尽管Prior ReLoc3R速度快、性能好，但其内部表示存在几何不一致性，限制了精度提升；而基于对应关系的方法（如MASt3R）虽精度高但耗时长。因此需要一种既能保持高速推理又能达到高精度的方法。 Method: 提出Geometric Consistency Regularization（GCR），在训练过程中利用真值深度生成密集3D-2D对应点，通过FusionTransformer学习对应点权重，并使用加权RANSAC计算几何一致的位姿，构建一致性损失来增强回归网络；测试时仅使用回归头，不进行几何求解。 Result: 在CO3Dv2、RealEstate10K和MegaDepth1500等挑战性数据集上均优于ReLoc3R，例如CO3Dv2上AUC@5°达40.45%（相对提升16%），RealEstate10K上为68.66%，MegaDepth1500上为50.45%，且保持25ms快速推理。 Conclusion: GeLoc3r通过在训练中引入几何一致性正则化，实现了回归方法的速度与对应方法的精度的统一，代表了神经网络学习相机几何的新范式。 Abstract: Prior ReLoc3R achieves breakthrough performance with fast 25ms inference and state-of-the-art regression accuracy, yet our analysis reveals subtle geometric inconsistencies in its internal representations that prevent reaching the precision ceiling of correspondence-based methods like MASt3R (which require 300ms per pair). In this work, we present GeLoc3r, a novel approach to relative camera pose estimation that enhances pose regression methods through Geometric Consistency Regularization (GCR). GeLoc3r overcomes the speed-accuracy dilemma by training regression networks to produce geometrically consistent poses without inference-time geometric computation. During training, GeLoc3r leverages ground-truth depth to generate dense 3D-2D correspondences, weights them using a FusionTransformer that learns correspondence importance, and computes geometrically-consistent poses via weighted RANSAC. This creates a consistency loss that transfers geometric knowledge into the regression network. Unlike FAR method which requires both regression and geometric solving at inference, GeLoc3r only uses the enhanced regression head at test time, maintaining ReLoc3R's fast speed and approaching MASt3R's high accuracy. On challenging benchmarks, GeLoc3r consistently outperforms ReLoc3R, achieving significant improvements including 40.45% vs. 34.85% AUC@5{\deg} on the CO3Dv2 dataset (16% relative improvement), 68.66% vs. 66.70% AUC@5{\deg} on RealEstate10K, and 50.45% vs. 49.60% on MegaDepth1500. By teaching geometric consistency during training rather than enforcing it at inference, GeLoc3r represents a paradigm shift in how neural networks learn camera geometry, achieving both the speed of regression and the geometric understanding of correspondence methods.

Ye-eun Kim,Suhyeon Lim,Andrew J. Choi

Main category: cs.CV

TL;DR: 本研究针对中风患者远程康复监测中的动作识别问题，提出了一种基于IMU传感器和RGB-D相机的多模态深度学习系统，专门用于家庭环境下的上肢日常生活活动（ADL）识别。

Details

Motivation: 现有动作识别技术主要面向健康人群，难以适用于中风患者的复杂动作特征；同时，中风患者家庭康复需求上升但治疗资源短缺，亟需自动化的远程监测方案。 Method: 设计并构建了一个结合IMU传感器与RGB-D相机的多模态动作监测系统，采集中风患者真实动作数据，提出适用于该多模态数据的深度学习模型（MMeViT），并对数据进行预处理与聚类分析。 Result: 实验发现中风患者的动作数据比健康人更难聚类，表明其动作变异性更大；所提出的深度学习模型能有效学习难以聚类数据中的标签一致性特征，在复杂ADL识别任务中表现出良好潜力。 Conclusion: 该研究表明，针对中风患者动作特征设计的深度学习模型不仅可用于动作识别，未来还可扩展至家庭康复评估与反馈系统，推动个性化远程康复的发展。 Abstract: Rehabilitation therapy for stroke patients faces a supply shortage despite the increasing demand. To address this issue, remote monitoring systems that reduce the burden on medical staff are emerging as a viable alternative. A key component of these remote monitoring systems is Human Action Recognition (HAR) technology, which classifies actions. However, existing HAR studies have primarily focused on non-disable individuals, making them unsuitable for recognizing the actions of stroke patients. HAR research for stroke has largely concentrated on classifying relatively simple actions using machine learning rather than deep learning. In this study, we designed a system to monitor the actions of stroke patients, focusing on domiciliary upper limb Activities of Daily Living (ADL). Our system utilizes IMU (Inertial Measurement Unit) sensors and an RGB-D camera, which are the most common modalities in HAR. We directly collected a dataset through this system, investigated an appropriate preprocess and proposed a deep learning model suitable for processing multimodal data. We analyzed the collected dataset and found that the action data of stroke patients is less clustering than that of non-disabled individuals. Simultaneously, we found that the proposed model learns similar tendencies for each label in data with features that are difficult to clustering. This study suggests the possibility of expanding the deep learning model, which has learned the action features of stroke patients, to not only simple action recognition but also feedback such as assessment contributing to domiciliary rehabilitation in future research. The code presented in this study is available at https://github.com/ye-Kim/MMeViT.

[238] Activation Matching for Explanation Generation

Pirzada Suhail,Aditya Anand,Amit Sethi

Main category: cs.CV

TL;DR: 提出一种基于激活匹配的轻量级自动编码器方法，生成最小且忠实的图像分类决策解释。

Details

Motivation: 为了提供对预训练分类器决策过程的可解释性，同时确保解释的简洁性和保真度。 Method: 训练一个轻量自动编码器输出二值掩码，保留输入图像的关键区域；通过多层激活匹配、掩码先验和溯因约束优化目标函数。 Result: 生成的小而清晰的二值掩码能保持模型预测和中间激活分布，有效去除无关区域，提升解释的可读性和可信度。 Conclusion: 该方法能生成简洁、忠实且人类可理解的视觉解释，适用于任意预训练图像分类模型。 Abstract: In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image $x$ and a frozen model $f$, we train a lightweight autoencoder to output a binary mask $m$ such that the explanation $e = m \odot x$ preserves both the model's prediction and the intermediate activations of $x$. Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.

[239] Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis

Ruilang Wang,Shuotong Xu,Bowen Liu,Runlin Huang,Donglong Chen,Weifeng Su

Main category: cs.CV

TL;DR: 提出了一种名为Mask What Matters的可控文本引导掩码框架，用于自监督医学图像分析，通过视觉-语言模型实现提示驱动的区域定位，显著提升语义对齐和跨任务泛化能力。

Details

Motivation: 现有掩码图像建模方法在高比例随机掩码下效率低下且语义对齐差，且区域感知变体依赖重建启发式或监督信号，限制了其在不同任务和模态间的适应性。 Method: 利用视觉-语言模型进行基于提示的区域定位，对诊断相关区域采用差异化掩码策略，强调关键区域并减少背景冗余，从而实现可控的自监督学习。 Result: 在脑MRI、胸部CT和肺部X光等多种医学影像模态上表现优异，相比现有MIM方法（如SparK）分类准确率最高提升3.1个百分点，BoxAP提升1.3，MaskAP提升1.1，且整体掩码比率显著降低（如40% vs 70%）。 Conclusion: 可控的文本驱动掩码能有效实现语义对齐的自监督学习，推动医学图像分析中鲁棒视觉模型的发展。 Abstract: The scarcity of annotated data in specialized domains such as medical imaging presents significant challenges to training robust vision models. While self-supervised masked image modeling (MIM) offers a promising solution, existing approaches largely rely on random high-ratio masking, leading to inefficiency and poor semantic alignment. Moreover, region-aware variants typically depend on reconstruction heuristics or supervised signals, limiting their adaptability across tasks and modalities. We propose Mask What Matters, a controllable text-guided masking framework for self-supervised medical image analysis. By leveraging vision-language models for prompt-based region localization, our method flexibly applies differentiated masking to emphasize diagnostically relevant regions while reducing redundancy in background areas. This controllable design enables better semantic alignment, improved representation learning, and stronger cross-task generalizability. Comprehensive evaluation across multiple medical imaging modalities, including brain MRI, chest CT, and lung X-ray, shows that Mask What Matters consistently outperforms existing MIM methods (e.g., SparK), achieving gains of up to +3.1 percentage points in classification accuracy, +1.3 in box average precision (BoxAP), and +1.1 in mask average precision (MaskAP) for detection. Notably, it achieves these improvements with substantially lower overall masking ratios (e.g., 40\% vs. 70\%). This work demonstrates that controllable, text-driven masking can enable semantically aligned self-supervised learning, advancing the development of robust vision models for medical image analysis.

[240] FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection

Ben Liang,Yuan Liu,Bingwen Qiu,Yihong Wang,Xiubao Sui,Qian Chen

Main category: cs.CV

TL;DR: 本文提出了一种用于航拍图像小目标检测的新型框架FMC-DETR，通过频率解耦融合机制，在浅层特征中增强全局低频上下文感知并保留细节信息，结合轻量化的跨阶段部分融合模块和多域特征协调模块，显著提升了检测性能，在VisDrone数据集上达到SOTA水平。

Details

Motivation: 航拍图像中的小目标检测因视觉线索有限和复杂场景下全局上下文建模困难而具有挑战性，现有方法存在上下文融合延迟和非线性建模不足的问题，难以有效利用全局信息优化浅层特征，导致性能瓶颈。 Method: 提出FMC-DETR框架，包括Wavelet Kolmogorov-Arnold Transformer（WeKat）主干网络，通过级联小波变换增强低频上下文感知，并使用Kolmogorov-Arnold网络实现多尺度依赖的自适应非线性建模；设计轻量级Cross-stage Partial Fusion（CPF）模块以减少冗余并提升多尺度特征交互；引入Multi-Domain Feature Coordination（MDFC）模块，统一空间、频率和结构先验以平衡细节保留与全局增强。 Result: 在多个航拍图像基准数据集上进行了广泛实验，FMC-DETR以更少参数实现了最先进的性能；在VisDrone数据集上，相比基线模型AP提升6.5%，AP50提升8.2%，显著提高了小目标检测效果。 Conclusion: FMC-DETR通过频率解耦的融合策略有效解决了航拍图像中小目标检测的上下文利用不足和非线性建模弱的问题，展现出卓越的检测性能和应用潜力，为遥感图像分析提供了新的技术路径。 Abstract: Aerial-view object detection is a critical technology for real-world applications such as natural resource monitoring, traffic management, and UAV-based search and rescue. Detecting tiny objects in high-resolution aerial imagery presents a long-standing challenge due to their limited visual cues and the difficulty of modeling global context in complex scenes. Existing methods are often hampered by delayed contextual fusion and inadequate non-linear modeling, failing to effectively use global information to refine shallow features and thus encountering a performance bottleneck. To address these challenges, we propose FMC-DETR, a novel framework with frequency-decoupled fusion for aerial-view object detection. First, we introduce the Wavelet Kolmogorov-Arnold Transformer (WeKat) backbone, which applies cascaded wavelet transforms to enhance global low-frequency context perception in shallow features while preserving fine-grained details, and employs Kolmogorov-Arnold networks to achieve adaptive non-linear modeling of multi-scale dependencies. Next, a lightweight Cross-stage Partial Fusion (CPF) module reduces redundancy and improves multi-scale feature interaction. Finally, we introduce the Multi-Domain Feature Coordination (MDFC) module, which unifies spatial, frequency, and structural priors to to balance detail preservation and global enhancement. Extensive experiments on benchmark aerial-view datasets demonstrate that FMC-DETR achieves state-of-the-art performance with fewer parameters. On the challenging VisDrone dataset, our model achieves improvements of 6.5% AP and 8.2% AP50 over the baseline, highlighting its effectiveness in tiny object detection. The code can be accessed at https://github.com/bloomingvision/FMC-DETR.

[241] Follow-Your-Preference: Towards Preference-Aligned Image Inpainting

Yutao Shen,Junkun Yuan,Toru Aonishi,Hideki Nakayama,Yue Ma

Main category: cs.CV

TL;DR: 本文研究了图像修复中的偏好对齐问题，通过使用直接偏好优化方法和公开的奖励模型构建偏好训练数据集，在多个基准和模型上进行了实验，发现大多数奖励模型能够提供有效的奖励分数，但存在亮度、构图和色彩方案等方面的偏差，可能导致奖励黑客行为；通过简单的模型集成可以缓解这些偏差，从而在不改变模型结构或使用新数据集的情况下显著提升对齐模型性能。

Details

Motivation: 解决图像修复中偏好对齐的基本问题，探索现有奖励模型在构建偏好数据时的有效性与局限性。 Method: 采用直接偏好优化（DPO）进行对齐训练，利用公开的奖励模型生成偏好数据，并在九个奖励模型、两个基准和两个基线模型上开展实验，分析候选扩展与样本扩展趋势及奖励模型偏差的影响。 Result: 大多数奖励模型可提供有效奖励分数；偏好数据表现出良好的扩展性；发现奖励模型在亮度、构图和色彩上的系统性偏见；简单集成方法能有效缓解偏见并提升性能。 Conclusion: 无需更改模型结构或引入新数据集，基于现有奖励模型和集成策略的对齐方法即可显著提升图像修复的偏好对齐效果，为该领域提供了简单而坚实的基线。 Abstract: This paper investigates image inpainting with preference alignment. Instead of introducing a novel method, we go back to basics and revisit fundamental problems in achieving such alignment. We leverage the prominent direct preference optimization approach for alignment training and employ public reward models to construct preference training datasets. Experiments are conducted across nine reward models, two benchmarks, and two baseline models with varying structures and generative algorithms. Our key findings are as follows: (1) Most reward models deliver valid reward scores for constructing preference data, even if some of them are not reliable evaluators. (2) Preference data demonstrates robust trends in both candidate scaling and sample scaling across models and benchmarks. (3) Observable biases in reward models, particularly in brightness, composition, and color scheme, render them susceptible to cause reward hacking. (4) A simple ensemble of these models yields robust and generalizable results by mitigating such biases. Built upon these observations, our alignment models significantly outperform prior models across standard metrics, GPT-4 assessments, and human evaluations, without any changes to model structures or the use of new datasets. We hope our work can set a simple yet solid baseline, pushing this promising frontier. Our code is open-sourced at: https://github.com/shenytzzz/Follow-Your-Preference.

[242] Streamline pathology foundation model by cross-magnification distillation

Ziyu Su,Abdul Rehman Akbar,Usama Sajjad,Anil V. Parwani,Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: 本文提出了一种名为XMAG的轻量级基础模型，通过跨放大倍数蒸馏技术，将20x教师模型的知识迁移至5x学生模型，显著降低计算需求，同时在多个病理分析任务中达到接近大型模型的性能，并实现30倍处理加速。

Details

Motivation: 现有的基础模型因参数量大、依赖高倍率图像处理，在临床部署中计算成本过高，限制了其实际应用。 Method: 采用跨放大倍数蒸馏框架，结合全局表征对齐和局部空间token映射的双级知识迁移，训练一个轻量级5x放大率下的学生模型（XMAG），并在349万张图像上进行训练。 Result: XMAG在六个临床相关病理任务中表现优异，准确率与大型模型相差不到1%，处理速度达8.8张全切片图像/分钟，减少11.3倍图像块数量，具备良好的跨机构泛化能力。 Conclusion: 跨放大倍数蒸馏是一种可行的策略，可推动基础模型在资源受限的临床环境中高效部署，有望实现病理AI的实时集成。 Abstract: Foundation models (FM) have transformed computational pathology but remain computationally prohibitive for clinical deployment due to their massive parameter counts and high-magnification processing requirements. Here, we introduce XMAG, a lightweight FM developed through corss-magnification distillation that transfers knowledge from state-of-the-art 20x magnification teacher to an efficient 5x magnification student architecture. XMAG employs a compact backbone and operates entirely at 5x, requiring 11.3 times fewer patches per whole slide image (WSI) compared to existing approaches. Our Novel distillation framework incorporates dual-level knowledge transfer, aligning both global image representations and local spatial token mapping. We trained XMAG on 3.49 million images curated from publicly available datasets and evaluated performance across six clinically relevant histopathology analysis tasks spanning multiple cancer types. XMAG achieved diagnostic accuracy within 1% of substantially larger foundation models while delivering 30-fold processing acceleration, reaching 8.8 WSIs per minute processing speed. Our cross-institutional validation confirmed robust generalization. Further, we developed an end-to-end training strategy to further boost our model's performance to approach the larger FMs' performance. These results establish cross-magnification distillation as a viable approach for deploying FM capabilities in resource-constrained clinical environments, potentially enabling real-time pathology AI integration.

[243] CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP

Na Min An,Inha Kang,Minhyun Lee,Hyunjung Shim

Main category: cs.CV

TL;DR: 提出CoPatch，一种无需训练的零样本指代表图像分割框架，通过增强文本和图像模态中的空间表示来提升空间定位性能。

Details

Motivation: 现有视觉-语言模型（如CLIP）在理解空间关系上存在不足，且文本特征提取常忽略上下文词元，导致指代表分割中的空间定位能力弱。 Method: 在语言流中构建包含空间线索的上下文词元的混合文本特征；在视觉流中从中间层发现新路径提取更好保留空间结构的patch级特征，并将两者融合为聚类的图像-文本相似性图（CoMap）以实现精确掩码选择。 Result: 在RefCOCO、RefCOCO+、RefCOCOg和PhraseCut数据集上显著提升零样本性能（+2–7 mIoU），无需任何额外训练。 Conclusion: 通过挖掘和利用视觉-语言模型中隐含的空间知识，可有效提升零样本指代表分割中的空间定位能力，为该领域提供了新方向。 Abstract: Spatial grounding is crucial for referring image segmentation (RIS), where the goal of the task is to localize an object described by language. Current foundational vision-language models (VLMs), such as CLIP, excel at aligning images and text but struggle with understanding spatial relationships. Within the language stream, most existing methods often focus on the primary noun phrase when extracting local text features, undermining contextual tokens. Within the vision stream, CLIP generates similar features for images with different spatial layouts, resulting in limited sensitivity to spatial structure. To address these limitations, we propose \textsc{CoPatch}, a zero-shot RIS framework that leverages internal model components to enhance spatial representations in both text and image modalities. For language, \textsc{CoPatch} constructs hybrid text features by incorporating context tokens carrying spatial cues. For vision, it extracts patch-level image features using our novel path discovered from intermediate layers, where spatial structure is better preserved. These enhanced features are fused into a clustered image-text similarity map, \texttt{CoMap}, enabling precise mask selection. As a result, \textsc{CoPatch} significantly improves spatial grounding in zero-shot RIS across RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut (+ 2--7 mIoU) without requiring any additional training. Our findings underscore the importance of recovering and leveraging the untapped spatial knowledge inherently embedded in VLMs, thereby paving the way for opportunities in zero-shot RIS.

[244] Deep Learning for Oral Health: Benchmarking ViT, DeiT, BEiT, ConvNeXt, and Swin Transformer

Ajo Babu George,Sadhvik Bathini,Niranjana S R

Main category: cs.CV

TL;DR: 本研究系统评估了五种先进的基于Transformer的架构（ViT、DeiT、ConvNeXt、Swin Transformer和BEiT）在多类牙科疾病分类中的性能，重点关注数据不平衡等现实挑战。使用Oral Diseases数据集进行训练和验证，结果表明ConvNeXt表现最佳，准确率达81.06%，BEiT和Swin Transformer紧随其后，而ViT和DeiT在龋齿相关类别上表现较差。

Details

Motivation: 解决现有文献中常被忽视的真实世界问题，特别是数据不平衡对牙科疾病分类模型性能的影响。 Method: 采用五种先进的视觉Transformer模型（ViT、DeiT、ConvNeXt、Swin Transformer、BEiT），在Oral Diseases数据集上进行训练与验证，并通过准确率、精确率、召回率和F1分数评估其性能，特别关注模型对类别不平衡的处理能力。 Result: ConvNeXt取得了最高的验证准确率（81.06%），BEiT（80.00%）和Swin Transformer（79.73%）表现相近且稳定，均具有较高的F1分数；ViT（79.37%）和DeiT（78.79%）表现相对较差，尤其在龋齿相关类别上识别能力不足。 Conclusion: ConvNeXt、Swin Transformer和BEiT在牙科疾病分类中展现出可靠的诊断性能，具有临床应用潜力，可为未来AI驱动的口腔疾病诊断工具提供模型选择依据，同时强调了处理数据不平衡的重要性。 Abstract: Objective: The aim of this study was to systematically evaluate and compare the performance of five state-of-the-art transformer-based architectures - Vision Transformer (ViT), Data-efficient Image Transformer (DeiT), ConvNeXt, Swin Transformer, and Bidirectional Encoder Representation from Image Transformers (BEiT) - for multi-class dental disease classification. The study specifically focused on addressing real-world challenges such as data imbalance, which is often overlooked in existing literature. Study Design: The Oral Diseases dataset was used to train and validate the selected models. Performance metrics, including validation accuracy, precision, recall, and F1-score, were measured, with special emphasis on how well each architecture managed imbalanced classes. Results: ConvNeXt achieved the highest validation accuracy at 81.06, followed by BEiT at 80.00 and Swin Transformer at 79.73, all demonstrating strong F1-scores. ViT and DeiT achieved accuracies of 79.37 and 78.79, respectively, but both struggled particularly with Caries-related classes. Conclusions: ConvNeXt, Swin Transformer, and BEiT showed reliable diagnostic performance, making them promising candidates for clinical application in dental imaging. These findings provide guidance for model selection in future AI-driven oral disease diagnostic tools and highlight the importance of addressing data imbalance in real-world scenarios

[245] HTMA-Net: Towards Multiplication-Avoiding Neural Networks via Hadamard Transform and In-Memory Computing

Emadeldeen Hamdan,Ahmet Enis Cetin

Main category: cs.CV

TL;DR: HTMA-Net是一种结合Hadamard变换与无乘法SRAM存内计算的新型框架，可减少52%的乘法运算，同时保持模型精度，显著降低计算复杂度和参数量。

Details

Motivation: 为了在资源受限的边缘设备上高效部署深度神经网络，亟需降低乘法运算的成本。传统方法仅针对卷积层中的乘法或仅关注存内加速，存在局限性。 Method: 提出HTMA-Net框架，选择性地用基于Hadamard变换的混合变换层替代中间卷积层，并在SRAM存内计算中实现无乘法操作，从而减少整体算术复杂度。 Result: 在ResNet-18等模型上验证，相比基线模型最多减少52%的乘法运算，在CIFAR-10、CIFAR-100和Tiny ImageNet上保持相当的准确率，同时显著降低计算复杂度和参数数量。 Conclusion: 将结构化Hadamard变换层与SRAM存内无乘法计算相结合，是构建高效深度学习架构的有效途径。 Abstract: Reducing the cost of multiplications is critical for efficient deep neural network deployment, especially in energy-constrained edge devices. In this work, we introduce HTMA-Net, a novel framework that integrates the Hadamard Transform (HT) with multiplication-avoiding (MA) SRAM-based in-memory computing to reduce arithmetic complexity while maintaining accuracy. Unlike prior methods that only target multiplications in convolutional layers or focus solely on in-memory acceleration, HTMA-Net selectively replaces intermediate convolutions with Hybrid Hadamard-based transform layers whose internal convolutions are implemented via multiplication-avoiding in-memory operations. We evaluate HTMA-Net on ResNet-18 using CIFAR-10, CIFAR-100, and Tiny ImageNet, and provide a detailed comparison against regular, MF-only, and HT-only variants. Results show that HTMA-Net eliminates up to 52\% of multiplications compared to baseline ResNet-18, ResNet-20, and ResNet-50 models, while achieving comparable accuracy in evaluation and significantly reducing computational complexity and the number of parameters. Our results demonstrate that combining structured Hadamard transform layers with SRAM-based in-memory computing multiplication-avoiding operators is a promising path towards efficient deep learning architectures.

[246] Towards Comprehensive Interactive Change Understanding in Remote Sensing: A Large-scale Dataset and Dual-granularity Enhanced VLM

Junxiao Xue,Quan Deng,Xuecheng Wu,Kelu Yao,Xinyi Yin,Fei Yu,Wei Zhou,Yanfei Zhong,Yang Liu,Dingkang Yang

Main category: cs.CV

TL;DR: 本文提出了一个名为ChangeIMTI的大规模多任务交互式数据集，以及一种具有双粒度感知的视觉引导视觉-语言模型（ChangeVG），用于遥感图像变化理解，在四个任务上实现了优越性能。

Details

Motivation: 现有遥感变化理解数据集在变化描述、计数和定位等任务中缺乏深度理解和交互能力，难以满足复杂环境变化分析需求。 Method: 构建了包含变化描述、二分类、计数和定位四任务的ChangeIMTI数据集，并设计了双分支视觉引导模型ChangeVG，结合细粒度空间特征与高层语义信息，辅助大视觉-语言模型进行指令微调。 Result: 在变化描述任务中，该方法在综合S*m指标上比Semantic-CC高出1.39点，且在多个任务上表现出色，消融实验验证了关键模块的有效性。 Conclusion: 所提ChangeIMTI数据集和ChangeVG模型有效推动了遥感变化理解的多任务交互与层次化跨模态学习。 Abstract: Remote sensing change understanding (RSCU) is essential for analyzing remote sensing images and understanding how human activities affect the environment. However, existing datasets lack deep understanding and interactions in the diverse change captioning, counting, and localization tasks. To tackle these gaps, we construct ChangeIMTI, a new large-scale interactive multi-task instruction dataset that encompasses four complementary tasks including change captioning, binary change classification, change counting, and change localization. Building upon this new dataset, we further design a novel vision-guided vision-language model (ChangeVG) with dual-granularity awareness for bi-temporal remote sensing images (i.e., two remote sensing images of the same area at different times). The introduced vision-guided module is a dual-branch architecture that synergistically combines fine-grained spatial feature extraction with high-level semantic summarization. These enriched representations further serve as the auxiliary prompts to guide large vision-language models (VLMs) (e.g., Qwen2.5-VL-7B) during instruction tuning, thereby facilitating the hierarchical cross-modal learning. We extensively conduct experiments across four tasks to demonstrate the superiority of our approach. Remarkably, on the change captioning task, our method outperforms the strongest method Semantic-CC by 1.39 points on the comprehensive S*m metric, which integrates the semantic similarity and descriptive accuracy to provide an overall evaluation of change caption. Moreover, we also perform a series of ablation studies to examine the critical components of our method.

[247] Stochastic Interpolants via Conditional Dependent Coupling

Chenrui Ma,Xi Xiao,Tianyang Wang,Xiao Wang,Yanning Shen

Main category: cs.CV

TL;DR: 提出了一种基于条件依赖耦合策略的统一多阶段生成框架，通过单个Diffusion Transformer实现端到端优化，在保证高保真度的同时提高了生成效率。

Details

Motivation: 现有图像生成模型在计算成本与生成质量之间存在权衡问题，基于VAE的方法存在信息损失和无法端到端训练的问题，而像素空间方法计算代价高昂，级联模型难以有效进行端到端优化。 Method: 提出条件依赖耦合策略，将生成过程分解为多阶段插值轨迹，并通过统一的Diffusion Transformer建模整个过程，实现端到端训练和知识共享。 Result: 实验表明该方法在多种分辨率下均实现了高保真度和高效率，优于现有模型。 Conclusion: 该统一多阶段生成框架有效解决了计算成本、信息损失和端到端优化之间的矛盾，为高效高质量图像生成提供了新思路。 Abstract: Existing image generation models face critical challenges regarding the trade-off between computation and fidelity. Specifically, models relying on a pretrained Variational Autoencoder (VAE) suffer from information loss, limited detail, and the inability to support end-to-end training. In contrast, models operating directly in the pixel space incur prohibitive computational cost. Although cascade models can mitigate computational cost, stage-wise separation prevents effective end-to-end optimization, hampers knowledge sharing, and often results in inaccurate distribution learning within each stage. To address these challenges, we introduce a unified multistage generative framework based on our proposed Conditional Dependent Coupling strategy. It decomposes the generative process into interpolant trajectories at multiple stages, ensuring accurate distribution learning while enabling end-to-end optimization. Importantly, the entire process is modeled as a single unified Diffusion Transformer, eliminating the need for disjoint modules and also enabling knowledge sharing. Extensive experiments demonstrate that our method achieves both high fidelity and efficiency across multiple resolutions.

[248] Benchmarking DINOv3 for Multi-Task Stroke Analysis on Non-Contrast CT

Donghao Zhang,Yimin Chen,Kauê TN Duarte,Taha Aslan,Mohamed AlShamrani,Brij Karmur,Yan Wan,Shengcai Chen,Bo Hu,Bijoy K Menon,Wu Qiu

Main category: cs.CV

TL;DR: 本研究利用先进的自监督视觉Transformer模型DINOv3，提升非增强CT（NCCT）在卒中诊断中的图像对比度和信噪比，实现了梗死与出血分割、异常分类、出血亚型分类及ASPECTS评分分类等多个任务的强基准性能。

Details

Motivation: NCCT在快速卒中诊断中至关重要，但受限于低对比度和信噪比，现有方法在特征提取上表现不足，难以支持多任务自动化分析。 Method: 采用DINOv3自监督视觉Transformer生成强大的特征表示，并在多个公开和私有数据集上评估其在卒中分析多任务（包括分割、分类等）中的表现。 Result: 在梗死/出血分割、正常vs.卒中分类、出血亚型分类及ASPECTS分类等多个任务上建立了强有力的基准，显著提升了自动化卒中分析的性能。 Conclusion: DINOv3展现出在NCCT基础上进行多任务卒中分析的巨大潜力，验证了先进自监督模型在临床影像分析中的有效性，但也需进一步优化以应对实际应用中的限制。 Abstract: Non-contrast computed tomography (NCCT) is essential for rapid stroke diagnosis but is limited by low image contrast and signal to noise ratio. We address this challenge by leveraging DINOv3, a state-of-the-art self-supervised vision transformer, to generate powerful feature representations for a comprehensive set of stroke analysis tasks. Our evaluation encompasses infarct and hemorrhage segmentation, anomaly classification (normal vs. stroke and normal vs. infarct vs. hemorrhage), hemorrhage subtype classification (EDH, SDH, SAH, IPH, IVH), and dichotomized ASPECTS classification (<=6 vs. >6) on multiple public and private datasets. This study establishes strong benchmarks for these tasks and demonstrates the potential of advanced self-supervised models to improve automated stroke diagnosis from NCCT, providing a clear analysis of both the advantages and current constraints of the approach. The code is available at https://github.com/Zzz0251/DINOv3-stroke.

[249] Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents

Peilin Feng,Zhutao Lv,Junyan Ye,Xiaolei Wang,Xinjie Huo,Jinhua Yu,Wanghan Xu,Wenlong Zhang,Lei Bai,Conghui He,Weijia Li

Main category: cs.CV

TL;DR: 本文提出了Earth-Agent，首个结合RGB与光谱地球观测数据的智能体框架，支持跨模态、多步及定量时空推理，并发布Earth-Bench基准用于系统评估，推动遥感领域基于大模型的科学应用发展。

Details

Motivation: 现有MLLM在处理需要多步推理和专业工具的复杂地球观测任务时能力有限，且当前智能体方法局限于RGB感知、浅层推理，缺乏系统性评估机制。 Method: 提出Earth-Agent框架，基于MCP工具生态系统统一RGB与光谱数据，动态调用专家工具实现跨模态多步推理；构建Earth-Bench基准，包含248个专家设计任务和13,729张图像，采用双层评估协议。 Result: 实验表明Earth-Agent在不同LLM主干、通用智能体框架及MLLM遥感基准上的表现均优于现有方法，能有效支持地物参数反演和定量时空分析等复杂任务。 Conclusion: Earth-Agent建立了地球观测分析的新范式，推动LLM向更科学化、下一代地球观测应用发展。 Abstract: Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi-step reasoning and the use of domain-specific tools. Agent-based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols. To overcome these limitations, we introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth-Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth-Bench, a benchmark of 248 expert-curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual-level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth-Agent. Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation. Our code and dataset will be publicly released.

[250] WeatherCycle: Unpaired Multi-Weather Restoration via Color Space Decoupled Cycle Learning

Wenxuan Fang,Jiangwei Weng,Jianjun Qian,Jian Yang,Jun Li

Main category: cs.CV

TL;DR: 本文提出了一种名为WeatherCycle的无监督多天气图像恢复框架，通过亮度-色度分解和降解内容双向转换机制，实现无需成对数据的图像去天气化，结合LDGM与DACR模块，在复杂天气条件下表现出优越的泛化性和语义一致性。

Details

Motivation: 现有的图像恢复方法通常依赖特定任务的物理先验，限制了在多样化真实天气场景中的可扩展性和泛化能力，因此需要一种统一且不依赖成对数据的多天气恢复框架。 Method: 提出WeatherCycle框架，采用亮度-色度分解策略解耦退化与内容；设计Lumina Degradation Guidance Module（LDGM）通过频域幅度调制注入亮度退化先验；引入Difficulty-Aware Contrastive Regularization（DACR）模块，基于CLIP识别困难样本并进行对比对齐以增强语义一致性。 Result: 在多个多天气图像恢复数据集上实验表明，该方法在无监督方法中达到最先进水平，并展现出对复杂天气退化的强泛化能力。 Conclusion: WeatherCycle提供了一种通用、无需成对数据的多天气图像恢复解决方案，通过解耦退化建模与内容重建，并引入课程式对比正则化，有效提升了无监督条件下的恢复性能与鲁棒性。 Abstract: Unsupervised image restoration under multi-weather conditions remains a fundamental yet underexplored challenge. While existing methods often rely on task-specific physical priors, their narrow focus limits scalability and generalization to diverse real-world weather scenarios. In this work, we propose \textbf{WeatherCycle}, a unified unpaired framework that reformulates weather restoration as a bidirectional degradation-content translation cycle, guided by degradation-aware curriculum regularization. At its core, WeatherCycle employs a \textit{lumina-chroma decomposition} strategy to decouple degradation from content without modeling complex weather, enabling domain conversion between degraded and clean images. To model diverse and complex degradations, we propose a \textit{Lumina Degradation Guidance Module} (LDGM), which learns luminance degradation priors from a degraded image pool and injects them into clean images via frequency-domain amplitude modulation, enabling controllable and realistic degradation modeling. Additionally, we incorporate a \textit{Difficulty-Aware Contrastive Regularization (DACR)} module that identifies hard samples via a CLIP-based classifier and enforces contrastive alignment between hard samples and restored features to enhance semantic consistency and robustness. Extensive experiments across serve multi-weather datasets, demonstrate that our method achieves state-of-the-art performance among unsupervised approaches, with strong generalization to complex weather degradations.

[251] Sparse2Dense: A Keypoint-driven Generative Framework for Human Video Compression and Vertex Prediction

Bolin Chen,Ru-Ling Liao,Yan Ye,Jie Chen,Shanzhi Yin,Xinrui Ju,Shiqi Wang,Yibo Fan

Main category: cs.CV

TL;DR: 提出Sparse2Dense框架，利用极稀疏的3D关键点实现超低比特率人体视频压缩与精确顶点预测。

Details

Motivation: 在带宽受限的多媒体应用中，如何同时实现超低比特率的人体视频压缩和准确的顶点预测是一个挑战，需要协调动态运动建模、外观合成和几何一致性。 Method: 提出一种基于关键点驱动的生成框架Sparse2Dense，采用多任务学习和关键点感知的深度生成模型，用稀疏3D关键点编码人体运动，并估计密集运动以生成具有时间连贯性和真实感纹理的视频；同时集成顶点预测器，通过与视频生成联合优化来学习人体顶点几何结构。 Result: 实验表明，Sparse2Dense在人体视频压缩性能上优于传统和生成式视频编解码器，同时能精确预测人体顶点，支持下游几何应用。 Conclusion: Sparse2Dense能够有效促进带宽高效的人类中心媒体传输，适用于实时动作分析、虚拟人动画和沉浸式娱乐等场景。 Abstract: For bandwidth-constrained multimedia applications, simultaneously achieving ultra-low bitrate human video compression and accurate vertex prediction remains a critical challenge, as it demands the harmonization of dynamic motion modeling, detailed appearance synthesis, and geometric consistency. To address this challenge, we propose Sparse2Dense, a keypoint-driven generative framework that leverages extremely sparse 3D keypoints as compact transmitted symbols to enable ultra-low bitrate human video compression and precise human vertex prediction. The key innovation is the multi-task learning-based and keypoint-aware deep generative model, which could encode complex human motion via compact 3D keypoints and leverage these sparse keypoints to estimate dense motion for video synthesis with temporal coherence and realistic textures. Additionally, a vertex predictor is integrated to learn human vertex geometry through joint optimization with video generation, ensuring alignment between visual content and geometric structure. Extensive experiments demonstrate that the proposed Sparse2Dense framework achieves competitive compression performance for human video over traditional/generative video codecs, whilst enabling precise human vertex prediction for downstream geometry applications. As such, Sparse2Dense is expected to facilitate bandwidth-efficient human-centric media transmission, such as real-time motion analysis, virtual human animation, and immersive entertainment.

[252] TRAX: TRacking Axles for Accurate Axle Count Estimation

Avinash Rai,Sandeep Jana,Vishal Vijay

Main category: cs.CV

TL;DR: 提出了一种基于视频的端到端车辆轴数计数方法，结合YOLO-OBB和YOLO检测车辆与轮胎，并通过提出的TRAX算法跟踪轴相关特征，有效解决了密集场景中长车和遮挡车辆的轴数误检问题，提升了实际交通视频中的计数准确性和鲁棒性。

Details

Motivation: 现有方法在密集交通环境中对长车和遮挡车辆的轴数计数存在局限，难以准确关联轮胎与车辆，导致误检和漏检。 Method: 采用YOLO-OBB检测并分类车辆，使用YOLO检测轮胎，并通过智能匹配将轮胎关联至对应车辆；提出TRAX（Tire and Axle Tracking）算法，跨帧跟踪轮胎与轴特征，缓解遮挡和部分检测问题。 Result: 显著减少误报，提升长车辆的轴数计数精度，在真实交通视频中表现出强鲁棒性。 Conclusion: 该方法推动了可扩展的AI驱动轴数计数系统发展，有望以机器视觉替代传统道路基础设施。 Abstract: Accurate counting of vehicle axles is essential for traffic control, toll collection, and infrastructure development. We present an end-to-end, video-based pipeline for axle counting that tackles limitations of previous works in dense environments. Our system leverages a combination of YOLO-OBB to detect and categorize vehicles, and YOLO to detect tires. Detected tires are intelligently associated to their respective parent vehicles, enabling accurate axle prediction even in complex scenarios. However, there are a few challenges in detection when it comes to scenarios with longer and occluded vehicles. We mitigate vehicular occlusions and partial detections for longer vehicles by proposing a novel TRAX (Tire and Axle Tracking) Algorithm to successfully track axle-related features between frames. Our method stands out by significantly reducing false positives and improving the accuracy of axle-counting for long vehicles, demonstrating strong robustness in real-world traffic videos. This work represents a significant step toward scalable, AI-driven axle counting systems, paving the way for machine vision to replace legacy roadside infrastructure.

[253] Confidence-Calibrating Regularization for Robust Brain MRI Segmentation Under Domain Shift

Behraj Khan,Tahir Qasim Syed

Main category: cs.CV

TL;DR: 提出CalSAM，一种轻量级适应框架，通过特征Fisher信息惩罚和置信度错配惩罚，提升SAM在医学图像分割中的域泛化性和不确定性校准性能。

Details

Motivation: SAM在自然图像上表现良好，但在医学体积数据上存在域偏移和过度自信问题，影响其在跨中心、跨设备场景下的分割性能和可靠性。 Method: 提出CalSAM框架，引入基于3D特征图的特征Fisher信息惩罚（FIP）以减少编码器对域偏移的敏感性，并设计置信度错配惩罚（CMP）来抑制体素级预测的过度自信；仅微调掩码解码器，保持SAM编码器冻结。 Result: 在BraTS扫描仪迁移任务中，DSC相对提升7.4%，HD95降低26.9%，ECE减少39.5%；在ATLAS-C运动伪影数据上，DSC提升5.3%，ECE降低32.6%；消融实验显示FIP与CMP具有互补增益（p<0.01），Fisher惩罚带来约15%训练开销。 Conclusion: CalSAM有效提升了SAM在医学图像中的域适应能力和预测校准性，同时保留了冻结编码器的计算优势，适用于脑MRI分割等临床场景。 Abstract: The Segment Anything Model (SAM) exhibits strong zero-shot performance on natural images but suffers from domain shift and overconfidence when applied to medical volumes. We propose \textbf{CalSAM}, a lightweight adaptation framework that (i) reduces encoder sensitivity to domain shift via a \emph{Feature Fisher Information Penalty} (FIP) computed on 3D feature maps and (ii) penalizes overconfident voxel-wise errors through a \emph{Confidence Misalignment Penalty} (CMP). The combined loss, $\mathcal{L}_{\mathrm{CalSAM}}$ fine-tunes only the mask decoder while keeping SAM's encoders frozen. On cross-center and scanner-shift evaluations, CalSAM substantially improves accuracy and calibration: e.g., on the BraTS scanner split (Siemens$\to$GE) CalSAM shows a $+7.4\%$ relative improvement in $\mathrm{DSC}$ (80.1\% vs.\ 74.6\%), a $-26.9\%$ reduction in $\mathrm{HD95}$ (4.6 mm vs.\ 6.3 mm), and a $-39.5\%$ reduction in $\mathrm{ECE}$ (5.2\% vs.\ 8.6\%). On ATLAS-C (motion corruptions), CalSAM achieves a $+5.3\%$ relative improvement in $\mathrm{DSC}$ (75.9\%) and a $-32.6\%$ reduction in $\mathrm{ECE}$ (5.8\%). Ablations show FIP and CMP contribute complementary gains ($p<0.01$), and the Fisher penalty incurs a modest $\sim$15\% training-time overhead. CalSAM therefore delivers improved domain generalization and better-calibrated uncertainty estimates for brain MRI segmentation, while retaining the computational benefits of freezing SAM's encoder.

[254] Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss

Yifan Zhang,Wei Zhang,Chuangxin He,Zhonghua Miao,Junhui Hou

Main category: cs.CV

TL;DR: 提出了一种新的无监督在线3D实例分割框架，通过合成点云序列生成增强训练多样性，采用灵活采样策略捕捉时间动态，并引入动态加权损失提升模型鲁棒性，在多个数据集上优于现有方法。

Details

Motivation: 现有无监督3D实例分割方法受限于训练多样性不足、固定时间采样和对噪声伪标签的依赖，难以有效保持跨LiDAR扫描的对象一致性。 Method: 通过生成合成点云序列来丰富训练分布；采用结合相邻与非相邻帧的灵活时间采样策略以捕捉长短时依赖关系；设计动态加权损失函数，强调高置信度和信息量大的样本。 Result: 在SemanticKITTI、nuScenes和PandaSet数据集上实验表明，该方法在分割精度和时间关联鲁棒性方面均优于UNIT等现有无监督方法。 Conclusion: 所提方法通过增强数据多样性、灵活建模时间动态和优化损失函数，显著提升了无监督在线3D实例分割性能，具有良好的泛化能力和应用前景。 Abstract: Unsupervised online 3D instance segmentation is a fundamental yet challenging task, as it requires maintaining consistent object identities across LiDAR scans without relying on annotated training data. Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo-labels. We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation, enabling greater diversity without relying on manual labels or simulation engines. To better capture temporal dynamics, our method incorporates a flexible sampling strategy that leverages both adjacent and non-adjacent frames, allowing the model to learn from long-range dependencies as well as short-term variations. In addition, a dynamic-weighting loss emphasizes confident and informative samples, guiding the network toward more robust representations. Through extensive experiments on SemanticKITTI, nuScenes, and PandaSet, our method consistently outperforms UNIT and other unsupervised baselines, achieving higher segmentation accuracy and more robust temporal associations. The code will be publicly available at github.com/Eaphan/SFT3D.

[255] Real-World Transferable Adversarial Attack on Face-Recognition Systems

Andrey Kaznacheev,Matvey Mikhalchuk,Andrey Kuznetsov,Aleksandr Petiushko,Anton Razzhigaev

Main category: cs.CV

TL;DR: 提出了一种名为GaP（Gaussian Patch）的新型对抗性补丁生成方法，可在严格黑盒设置下实现通用且物理可迁移的攻击，显著降低人脸识别系统的身份识别能力。

Details

Motivation: 现有的对抗性攻击大多局限于数字域或需要白盒访问，缺乏在严格黑盒条件下具有物理可迁移性的通用攻击方法。 Method: 采用查询效率高、零阶贪婪算法迭代构建前额区域的对称灰度模式，通过不断添加高斯斑点，并利用代理人脸识别模型的余弦相似度分数进行优化。 Result: 在约10,000次查询黑盒ArcFace模型的情况下，GaP在数字和真实物理场景中均实现了高攻击成功率，并展现出强迁移性，能有效欺骗未见过的FaceNet模型。 Conclusion: 证明了即使对目标系统了解有限，仍可构造出鲁棒且可迁移的对抗攻击，揭示了人脸识别系统存在的严重实际安全漏洞。 Abstract: Adversarial attacks on face recognition (FR) systems pose a significant security threat, yet most are confined to the digital domain or require white-box access. We introduce GaP (Gaussian Patch), a novel method to generate a universal, physically transferable adversarial patch under a strict black-box setting. Our approach uses a query-efficient, zero-order greedy algorithm to iteratively construct a symmetric, grayscale pattern for the forehead. The patch is optimized by successively adding Gaussian blobs, guided only by the cosine similarity scores from a surrogate FR model to maximally degrade identity recognition. We demonstrate that with approximately 10,000 queries to a black-box ArcFace model, the resulting GaP achieves a high attack success rate in both digital and real-world physical tests. Critically, the attack shows strong transferability, successfully deceiving an entirely unseen FaceNet model. Our work highlights a practical and severe vulnerability, proving that robust, transferable attacks can be crafted with limited knowledge of the target system.

[256] UltraUNet: Real-Time Ultrasound Tongue Segmentation for Diverse Linguistic and Imaging Conditions

Alisher Myrgyyassov,Zhen Song,Yu Sun,Bruce Xiao Wang,Min Ney Wong,Yongping Zheng

Main category: cs.CV

TL;DR: UltraUNet是一种轻量级编码器-解码器网络，用于超声图像中舌头轮廓的实时分割，具有高精度和高计算效率。

Details

Motivation: 由于信噪比低、成像变异性和计算需求高，现有的超声舌图像舌轮廓分割方法难以实现实时、准确的分割。 Method: 提出UltraUNet，采用轻量级Squeeze-and-Excitation模块、组归一化和基于求和的跳跃连接，并结合超声特定的数据增强策略如去噪和模糊模拟。 Result: 在8个数据集上评估，单数据集Dice为0.855，MSD为0.993px，跨数据集平均Dice分别为0.734和0.761，处理速度达250帧/秒。 Conclusion: UltraUNet为语音研究、临床诊断及言语运动障碍分析提供了一种快速且准确的解决方案。 Abstract: Ultrasound tongue imaging (UTI) is a non-invasive and cost-effective tool for studying speech articulation, motor control, and related disorders. However, real-time tongue contour segmentation remains challenging due to low signal-to-noise ratios, imaging variability, and computational demands. We propose UltraUNet, a lightweight encoder-decoder architecture optimized for real-time segmentation of tongue contours in ultrasound images. UltraUNet incorporates domain-specific innovations such as lightweight Squeeze-and-Excitation blocks, Group Normalization for small-batch stability, and summation-based skip connections to reduce memory and computational overhead. It achieves 250 frames per second and integrates ultrasound-specific augmentations like denoising and blur simulation. Evaluations on 8 datasets demonstrate high accuracy and robustness, with single-dataset Dice = 0.855 and MSD = 0.993px, and cross-dataset Dice averaging 0.734 and 0.761. UltraUNet provides a fast, accurate solution for speech research, clinical diagnostics, and analysis of speech motor disorders.

[257] Patch Rebirth: Toward Fast and Transferable Model Inversion of Vision Transformers

Seongsoo Heo,Dong-Wan Choi

Main category: cs.CV

TL;DR: 本文提出了一种新的模型反演方法Patch Rebirth Inversion (PRI)，用于提升Vision Transformers在无数据学习中的反演效率和性能。与先前认为不重要的图像块可被丢弃的观点相反，作者发现这些块仍能通过持续反演获得可迁移知识。PRI通过渐进式选择重要块并允许其余块继续演化，实现了更快的速度（比DMI快10倍，比SMI快2倍）且保持与DMI相当的精度，优于SMI。

Details

Motivation: 现有的稀疏模型反演方法(SMI)在处理Vision Transformers时因过早丢弃“不重要”图像块而抑制了类别无关特征的提取，影响知识迁移效果。作者旨在解决这一问题，探索更高效的反演策略以兼顾效率与性能。 Method: 提出Patch Rebirth Inversion (PRI)方法：在反演过程中动态识别并分离最重要的图像块构建稀疏合成图像，同时让其余图像块继续优化演化，以便后续使用。该方法采用渐进式策略，促进原本信息量较少的块逐步积累类别相关知识（即Re-Birth效应）。 Result: 实验表明，PRI比标准密集模型反演(DMI)快10倍，比SMI快2倍，并在准确率上持续优于SMI，同时达到与DMI相当的性能水平。 Conclusion: PRI通过保留并持续优化非关键图像块，有效平衡了类别无关与类别特定知识的提取，显著提升了模型反演的效率与表现，挑战了此前关于丢弃冗余块有益于知识迁移的假设。 Abstract: Model inversion is a widely adopted technique in data-free learning that reconstructs synthetic inputs from a pretrained model through iterative optimization, without access to original training data. Unfortunately, its application to state-of-the-art Vision Transformers (ViTs) poses a major computational challenge, due to their expensive self-attention mechanisms. To address this, Sparse Model Inversion (SMI) was proposed to improve efficiency by pruning and discarding seemingly unimportant patches, which were even claimed to be obstacles to knowledge transfer. However, our empirical findings suggest the opposite: even randomly selected patches can eventually acquire transferable knowledge through continued inversion. This reveals that discarding any prematurely inverted patches is inefficient, as it suppresses the extraction of class-agnostic features essential for knowledge transfer, along with class-specific features. In this paper, we propose Patch Rebirth Inversion (PRI), a novel approach that incrementally detaches the most important patches during the inversion process to construct sparse synthetic images, while allowing the remaining patches to continue evolving for future selection. This progressive strategy not only improves efficiency, but also encourages initially less informative patches to gradually accumulate more class-relevant knowledge, a phenomenon we refer to as the Re-Birth effect, thereby effectively balancing class-agnostic and class-specific knowledge. Experimental results show that PRI achieves up to 10x faster inversion than standard Dense Model Inversion (DMI) and 2x faster than SMI, while consistently outperforming SMI in accuracy and matching the performance of DMI.

[258] Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection

Mingfei Han,Haihong Hao,Jinxing Zhou,Zhihui Li,Yuhui Zheng,Xueqing Deng,Linjie Yang,Xiaojun Chang

Main category: cs.CV

TL;DR: 提出一种基于自洽性的新框架，通过长短回答的一致性生成偏好对训练VLM，减少幻觉，无需人工标注或外部监督。

Details

Motivation: 现有方法依赖大量人工标注或强模型的外部监督来缓解视觉语言模型的幻觉问题，成本高且难以扩展。 Method: 设计自反思流程，利用模型对短二元问题的可靠回答来评估和排序其长回答，通过不一致性信号自动构建高质量训练数据，仅依赖模型自身的一致性进行优化。 Result: 在AMBER、MultiObject-Hal、Object HalBench和MMHal-Bench等多个基准上显著降低幻觉，提升事实准确性和可靠性；同时在LLaVA-Bench和MMBench上保持良好的指令遵循能力。 Conclusion: 该方法仅用无标签数据和自洽性即可有效减少VLM幻觉，具备可扩展性和高效性，为减少模型幻觉提供了新思路。 Abstract: Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability. Existing methods typically address these issues via extensive human annotations or external supervision from more powerful models. In this work, we present a novel framework that leverages the model's self-consistency between long responses and short answers to generate preference pairs for training. We observe that short binary questions tend to yield highly reliable responses, which can be used to query the target model to evaluate and rank its generated responses. Specifically, we design a self-reflection pipeline where detailed model responses are compared against concise binary answers, and inconsistency signals are utilized to automatically curate high-quality training data without human annotations or external model-based supervision. By relying solely on self-consistency rather than external supervision, our method offers a scalable and efficient solution that effectively reduces hallucinations using unlabeled data. Extensive experiments on multiple benchmarks, i.e., AMBER, MultiObject-Hal (ROPE), Object HalBench, and MMHal-Bench, demonstrate significant improvements in factual grounding and reliability. Moreover, our approach maintains robust instruction-following ability, as evidenced by enhanced performance on LLaVA-Bench and MMBench.

[259] TATTOO: Training-free AesTheTic-aware Outfit recOmmendation

Yuntian Wu,Xiaonan Hu,Ziqi Zhou,Hao Lu

Main category: cs.CV

TL;DR: 提出了一种无需训练的时尚搭配推荐方法TATTOO，利用多模态大模型生成目标商品描述，并通过审美链式思维提取图像的结构化审美特征，结合动态熵门控机制实现搭配推荐，在真实数据集上达到最优性能。

Details

Motivation: 现有时尚搭配推荐方法依赖大规模标注数据和特定任务训练，且缺乏对人类审美的显式引导，限制了实用性与审美感知能力。 Method: 提出TATTOO方法，首先利用多模态大语言模型（MLLMs）生成目标商品描述，然后通过审美链式推理将图像提炼为包含颜色、风格、场合等维度的结构化审美画像，最后融合视觉摘要、文本描述和审美向量，通过动态熵门控机制在共享嵌入空间中排序候选商品。 Result: 在真实世界数据集Aesthetic-100上达到SOTA性能，并在Polyvore数据集上验证了优秀的零样本检索能力，显著优于现有基于训练的方法。 Conclusion: TATTOO实现了无需训练且具备审美感知的搭配推荐新范式，有效提升了推荐性能与美学意识，为时尚电商中的智能推荐提供了高效可行的解决方案。 Abstract: The global fashion e-commerce market relies significantly on intelligent and aesthetic-aware outfit-completion tools to promote sales. While previous studies have approached the problem of fashion outfit-completion and compatible-item retrieval, most of them require expensive, task-specific training on large-scale labeled data, and no effort is made to guide outfit recommendation with explicit human aesthetics. In the era of Multimodal Large Language Models (MLLMs), we show that the conventional training-based pipeline could be streamlined to a training-free paradigm, with better recommendation scores and enhanced aesthetic awareness. We achieve this with TATTOO, a Training-free AesTheTic-aware Outfit recommendation approach. It first generates a target-item description using MLLMs, followed by an aesthetic chain-of-thought used to distill the images into a structured aesthetic profile including color, style, occasion, season, material, and balance. By fusing the visual summary of the outfit with the textual description and aesthetics vectors using a dynamic entropy-gated mechanism, candidate items can be represented in a shared embedding space and be ranked accordingly. Experiments on a real-world evaluation set Aesthetic-100 show that TATTOO achieves state-of-the-art performance compared with existing training-based methods. Another standard Polyvore dataset is also used to measure the advanced zero-shot retrieval capability of our training-free method.

[260] Increasing the Diversity in RGB-to-Thermal Image Translation for Automotive Applications

Kaili Wang,Leonardo Ravaglia,Roberto Longo,Lore Goetschalckx,David Van Hamme,Julie Moeyersoms,Ben Stoffelen,Tom De Schepper

Main category: cs.CV

TL;DR: 本文提出了一种基于CoAdaIN的多模态RGB到热成像转换框架，用于增强自动驾驶系统中的热成像数据生成，实现了更真实、多样化的热图像翻译。

Details

Motivation: 由于实际驾驶场景中热成像数据集有限且在仿真器中表示不足，研究面临挑战，现有方法多局限于一对一图像转换，缺乏多样性。 Method: 提出一种基于组件感知的自适应实例归一化（CoAdaIN）的多模态图像翻译框架，CoAdaIN能够针对图像不同组件进行风格的个性化调整，实现一对多的RGB到热成像转换。 Result: 实验表明，所提方法相比传统AdaIN和现有方法，在生成热图像的真实性和多样性方面均有显著提升。 Conclusion: CoAdaIN有效提升了RGB-to-thermal图像翻译的质量与多样性，为ADAS中热成像数据的生成提供了可行方案，有助于推动相关研究和应用发展。 Abstract: Thermal imaging in Advanced Driver Assistance Systems (ADAS) improves road safety with superior perception in low-light and harsh weather conditions compared to traditional RGB cameras. However, research in this area faces challenges due to limited dataset availability and poor representation in driving simulators. RGB-to-thermal image translation offers a potential solution, but existing methods focus on one-to-one mappings. We propose a one-to-many mapping using a multi-modal translation framework enhanced with our Component-aware Adaptive Instance Normalization (CoAdaIN). Unlike the original AdaIN, which applies styles globally, CoAdaIN adapts styles to different image components individually. The result, as we show, is more realistic and diverse thermal image translations. This is the accepted author manuscript of the paper published in IEEE Sensors Conference 2024. The final published version is available at 10.1109/SENSORS60989.2024.10785056.

[261] LiDAR-based Human Activity Recognition through Laplacian Spectral Analysis

Sasan Sharifipour,Constantino Álvarez Casado,Le Nguyen,Tharindu Ekanayake,Manuel Lage Cañellas,Nhi Nguyen,Miguel Bordallo López

Main category: cs.CV

TL;DR: 提出一种基于图谱分析的LiDAR点云人体活动识别方法，通过构建邻近图并提取拉普拉斯谱特征，在MM-Fi数据集上实现了高精度识别，优于现有基线方法。

Details

Motivation: 为解决摄像头在隐私和光照敏感性方面的局限，探索使用LiDAR点云进行人体活动识别，并寻求比深度学习更高效、可解释的特征表示方法。 Method: 将每个LiDAR帧映射为邻近图（epsilon-图），计算其拉普拉斯谱；利用特征值及特征向量统计量构建姿态描述符，结合滑动窗口的时间统计生成固定长度向量，使用支持向量机和随机森林进行分类。 Result: 在MM-Fi数据集40名受试者27种活动上，采用严格的受试者独立协议，13类康复子集准确率达94.4%，27类完整任务达90.3%，超过该数据集已报道的骨架基线方法。 Conclusion: 该方法提供了一种直接从点云几何结构中提取紧凑且可解释特征的有效途径，是端到端深度学习之外的一种准确高效的替代方案。 Abstract: Human Activity Recognition supports applications in healthcare, manufacturing, and human-machine interaction. LiDAR point clouds offer a privacy-preserving alternative to cameras and are robust to illumination. We propose a HAR method based on graph spectral analysis. Each LiDAR frame is mapped to a proximity graph (epsilon-graph) and the Laplacian spectrum is computed. Eigenvalues and statistics of eigenvectors form pose descriptors, and temporal statistics over sliding windows yield fixed vectors for classification with support vector machines and random forests. On the MM-Fi dataset with 40 subjects and 27 activities, under a strict subject-independent protocol, the method reaches 94.4% accuracy on a 13-class rehabilitation set and 90.3% on all 27 activities. It also surpasses the skeleton-based baselines reported for MM-Fi. The contribution is a compact and interpretable feature set derived directly from point cloud geometry that provides an accurate and efficient alternative to end-to-end deep learning.

[262] OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting

Atakan Topaloglu,Kunyi Li,Michael Niemeyer,Nassir Navab,A. Murat Tekalp,Federico Tombari

Main category: cs.CV

TL;DR: 提出OracleGS框架，结合生成模型的完整性与回归模型的几何保真度，通过3D感知扩散模型生成新视角，并利用MVS模型作为验证器提供不确定性信号，指导3D高斯点阵化优化，有效减少幻觉伪影，在稀疏视图新视角合成中优于现有方法。

Details

Motivation: 稀疏视图新视角合成因几何歧义严重而本质不适定，现有方法在几何保真与场景补全之间存在权衡问题。 Method: 采用“提出-验证”框架：首先使用预训练的3D感知扩散模型生成完整场景的新视角，然后利用多视图立体（MVS）模型作为3D感知验证器，通过其注意力图提供不确定性信号，据此设计不确定性加权损失函数来优化3D高斯点阵化模型。 Result: 在Mip-NeRF 360和NeRF Synthetic等数据集上超越了当前最先进方法，有效过滤幻觉伪影，同时保留欠约束区域的合理补全。 Conclusion: OracleGS成功融合了生成模型的补全能力和回归模型的几何准确性，通过引入基于多视图证据的不确定性验证机制，提升了稀疏视图条件下的新视角合成质量。 Abstract: Sparse-view novel view synthesis is fundamentally ill-posed due to severe geometric ambiguity. Current methods are caught in a trade-off: regressive models are geometrically faithful but incomplete, whereas generative models can complete scenes but often introduce structural inconsistencies. We propose OracleGS, a novel framework that reconciles generative completeness with regressive fidelity for sparse view Gaussian Splatting. Instead of using generative models to patch incomplete reconstructions, our "propose-and-validate" framework first leverages a pre-trained 3D-aware diffusion model to synthesize novel views to propose a complete scene. We then repurpose a multi-view stereo (MVS) model as a 3D-aware oracle to validate the 3D uncertainties of generated views, using its attention maps to reveal regions where the generated views are well-supported by multi-view evidence versus where they fall into regions of high uncertainty due to occlusion, lack of texture, or direct inconsistency. This uncertainty signal directly guides the optimization of a 3D Gaussian Splatting model via an uncertainty-weighted loss. Our approach conditions the powerful generative prior on multi-view geometric evidence, filtering hallucinatory artifacts while preserving plausible completions in under-constrained regions, outperforming state-of-the-art methods on datasets including Mip-NeRF 360 and NeRF Synthetic.

[263] Learning Regional Monsoon Patterns with a Multimodal Attention U-Net

Swaib Ilias Mazumder,Manish Kumar,Aparajita Khan

Main category: cs.CV

TL;DR: 提出了一种基于多模态深度学习的高分辨率降水分类框架，利用卫星和地球观测数据，在1公里分辨率下显著提升了印度季风降雨预测精度。

Details

Motivation: 由于地面观测稀疏且区域变异性复杂，准确预测季风降雨对印度农业、水资源管理和气候风险规划至关重要，但仍然具有挑战性。 Method: 构建了一个1公里分辨率的新数据集，整合了地表温度、植被（NDVI）、土壤湿度、相对湿度、风速、高程和土地利用七种地理空间模态，并采用注意力引导的U-Net架构结合焦点损失和Dice损失函数进行建模。 Result: 该多模态框架在极端降雨类别上显著优于单模态基线和现有深度学习方法，实现了最先进的预测性能。 Conclusion: 本研究为区域季风预报、气候韧性建设和印度地理空间AI应用提供了可扩展的框架、基准数据集和先进成果。 Abstract: Accurate monsoon rainfall prediction is vital for India's agriculture, water management, and climate risk planning, yet remains challenging due to sparse ground observations and complex regional variability. We present a multimodal deep learning framework for high-resolution precipitation classification that leverages satellite and Earth observation data. Unlike previous rainfall prediction models based on coarse 5-50 km grids, we curate a new 1 km resolution dataset for five Indian states, integrating seven key geospatial modalities: land surface temperature, vegetation (NDVI), soil moisture, relative humidity, wind speed, elevation, and land use, covering the June-September 2024 monsoon season. Our approach uses an attention-guided U-Net architecture to capture spatial patterns and temporal dependencies across modalities, combined with focal and dice loss functions to handle rainfall class imbalance defined by the India Meteorological Department (IMD). Experiments demonstrate that our multimodal framework consistently outperforms unimodal baselines and existing deep learning methods, especially in extreme rainfall categories. This work contributes a scalable framework, benchmark dataset, and state-of-the-art results for regional monsoon forecasting, climate resilience, and geospatial AI applications in India.

[264] SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction

Yihao Ding,Soyeon Caren Han,Yanbei Jiang,Yan Li,Zechuan Li,Yifan Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为SynDoc的新框架，结合判别式和生成式模型，通过合成数据生成与自适应指令调优，提升领域特定视觉丰富文档的理解能力。

Details

Motivation: 现有大模型在处理医学、金融等领域的复杂敏感文档时存在幻觉、领域适应不足和依赖大量标注数据的问题。 Method: SynDoc采用结构信息提取和领域特定查询生成来创建高质量合成数据，并通过自适应指令调优和递归推理机制优化判别与生成模型的协同性能。 Result: 该框架在文档关键信息提取任务中表现出可扩展性、高效性和高精度，显著提升了领域适应能力。 Conclusion: SynDoc有效弥合了领域特定知识与通用世界知识之间的差距，为领域特定VRDU提供了高效且稳健的解决方案。 Abstract: Domain-specific Visually Rich Document Understanding (VRDU) presents significant challenges due to the complexity and sensitivity of documents in fields such as medicine, finance, and material science. Existing Large (Multimodal) Language Models (LLMs/MLLMs) achieve promising results but face limitations such as hallucinations, inadequate domain adaptation, and reliance on extensive fine-tuning datasets. This paper introduces SynDoc, a novel framework that combines discriminative and generative models to address these challenges. SynDoc employs a robust synthetic data generation workflow, using structural information extraction and domain-specific query generation to produce high-quality annotations. Through adaptive instruction tuning, SynDoc improves the discriminative model's ability to extract domain-specific knowledge. At the same time, a recursive inferencing mechanism iteratively refines the output of both models for stable and accurate predictions. This framework demonstrates scalable, efficient, and precise document understanding and bridges the gap between domain-specific adaptation and general world knowledge for document key information extraction tasks.

[265] Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing

Rohit Chowdhury,Aniruddha Bala,Rohan Jaiswal,Siddharth Roheda

Main category: cs.CV

TL;DR: 本文提出了一种名为Vid-Freeze的新型对抗攻击方法，通过向图像添加精心设计的对抗性扰动来抑制视频生成模型中的注意力机制，从而有效阻止运动合成，同时保持图像语义不变。

Details

Motivation: 由于图像到视频（I2V）生成模型的快速发展，静态图像可能被用于生成误导性或恶意视频内容，现有防御方法在阻止运动生成方面仍存在不足，因此需要更有效和原则性的防护手段。 Method: 提出Vid-Freeze方法，通过针对性地抑制I2V模型中的注意力机制，在输入图像中添加对抗性扰动，以破坏运动合成过程，同时保持图像的语义完整性。 Result: 实验表明，该方法能有效使生成的视频变为静止或接近静止状态，显著提升对I2V模型滥用的防御能力。 Conclusion: Vid-Freeze展示了通过攻击注意力机制实现图像免疫的潜力，为防范I2V模型的恶意使用提供了新的、有效的主动防御方向。 Abstract: The rapid progress of image-to-video (I2V) generation models has introduced significant risks, enabling video synthesis from static images and facilitating deceptive or malicious content creation. While prior defenses such as I2VGuard attempt to immunize images, effective and principled protection to block motion remains underexplored. In this work, we introduce Vid-Freeze - a novel attention-suppressing adversarial attack that adds carefully crafted adversarial perturbations to images. Our method explicitly targets the attention mechanism of I2V models, completely disrupting motion synthesis while preserving semantic fidelity of the input image. The resulting immunized images generate stand-still or near-static videos, effectively blocking malicious content creation. Our experiments demonstrate the impressive protection provided by the proposed approach, highlighting the importance of attention attacks as a promising direction for robust and proactive defenses against misuse of I2V generation models.

[266] Seeing Through the Blur: Unlocking Defocus Maps for Deepfake Detection

Minsun Jeon,Simon S. Woo

Main category: cs.CV

TL;DR: 提出一种基于散焦模糊的可解释深度伪造检测框架，利用光学成像中自然产生的散焦模糊作为鉴别伪造图像的可靠法医信号。

Details

Motivation: 随着生成式AI的发展，合成图像越来越逼真，传统检测方法难以应对，需要更鲁棒、可解释的物理线索来区分真实与伪造图像。 Method: 构建散焦模糊图并将其作为判别特征，通过分析其在真实图像与合成图像中的差异，利用深度学习模型进行深度伪造检测。 Result: 实验证明散焦模糊是一种可靠且可解释的法医线索，能在多种AIGC场景下有效识别合成图像，具有良好的泛化性和鲁棒性。 Conclusion: 基于物理成像原理的散焦模糊为深度伪造检测提供了通用且可解释的特征，有助于提升视觉媒体的真实性验证能力。 Abstract: The rapid advancement of generative AI has enabled the mass production of photorealistic synthetic images, blurring the boundary between authentic and fabricated visual content. This challenge is particularly evident in deepfake scenarios involving facial manipulation, but also extends to broader AI-generated content (AIGC) cases involving fully synthesized scenes. As such content becomes increasingly difficult to distinguish from reality, the integrity of visual media is under threat. To address this issue, we propose a physically interpretable deepfake detection framework and demonstrate that defocus blur can serve as an effective forensic signal. Defocus blur is a depth-dependent optical phenomenon that naturally occurs in camera-captured images due to lens focus and scene geometry. In contrast, synthetic images often lack realistic depth-of-field (DoF) characteristics. To capture these discrepancies, we construct a defocus blur map and use it as a discriminative feature for detecting manipulated content. Unlike RGB textures or frequency-domain signals, defocus blur arises universally from optical imaging principles and encodes physical scene structure. This makes it a robust and generalizable forensic cue. Our approach is supported by three in-depth feature analyses, and experimental results confirm that defocus blur provides a reliable and interpretable cue for identifying synthetic images. We aim for our defocus-based detection pipeline and interpretability tools to contribute meaningfully to ongoing research in media forensics. The implementation is publicly available at: https://github.com/irissun9602/Defocus-Deepfake-Detection

[267] Seeing the Unseen in Low-light Spike Streams

Liwen Hu,Yang Li,Mianzhi Liu,Yijia Guo,Shenghao Xie,Ziluo Ding,Tiejun Huang,Lei Ma

Main category: cs.CV

TL;DR: 本文提出了Diff-SPK，首个基于扩散模型的脉冲相机重建方法，有效利用生成先验补充低光条件下的纹理信息，并构建了首个真实低光脉冲流重建基准数据集。

Details

Motivation: 现有方法在低光高速场景下难以处理严重噪声和稀疏信息的脉冲流，导致重建效果差。 Method: 提出Diff-SPK方法，首先使用ETFI模块从脉冲间隔中聚合低光脉冲流的稀疏信息，然后将ETFI作为ControlNet的条件输入生成高速场景，并引入ETFI特征融合模块提升生成质量。 Result: 在真实低光脉冲流上验证了Diff-SPK的优越性，且所建基准数据集在规模和光照量化信息上超越现有数据集。 Conclusion: Diff-SPK显著提升了低光高速条件下脉冲相机数据的重建质量，为脉冲相机在极端环境下的应用提供了有效解决方案。 Abstract: Spike camera, a type of neuromorphic sensor with high-temporal resolution, shows great promise for high-speed visual tasks. Unlike traditional cameras, spike camera continuously accumulates photons and fires asynchronous spike streams. Due to unique data modality, spike streams require reconstruction methods to become perceptible to the human eye. However, lots of methods struggle to handle spike streams in low-light high-speed scenarios due to severe noise and sparse information. In this work, we propose Diff-SPK, the first diffusion-based reconstruction method for spike camera. Diff-SPK effectively leverages generative priors to supplement texture information in low-light conditions. Specifically, it first employs an \textbf{E}nhanced \textbf{T}exture \textbf{f}rom Inter-spike \textbf{I}nterval (ETFI) to aggregate sparse information from low-light spike streams. Then, ETFI serves as a conditioning input for ControlNet to generate the high-speed scenes. To improve the quality of results, we introduce an ETFI-based feature fusion module during the generation process. Moreover, we establish the first bona fide benchmark for the low-light spike stream reconstruction task. It significantly surpasses existing reconstruction datasets in scale and provides quantitative illumination information. The performance on real low-light spike streams demonstrates the superiority of Diff-SPK.

[268] Balanced Diffusion-Guided Fusion for Multimodal Remote Sensing Classification

Hao Liu,Yongjie Zheng,Yuhan Kang,Mingyang Zhang,Maoguo Gong,Lorenzo Bruzzone

Main category: cs.CV

TL;DR: 本文提出了一种平衡扩散引导融合（BDGF）框架，利用多模态扩散特征指导多分支网络进行土地覆盖分类，有效解决了多模态DDPM预训练中的模态不平衡问题，并在四个多模态遥感数据集上表现出优越的分类性能。

Details

Motivation: 解决多模态去噪扩散模型（DDPMs）在预训练过程中存在的模态不平衡问题，并探索如何有效利用扩散特征来引导互补性多样性特征提取。 Method: 提出自适应模态掩码策略以实现模态平衡的扩散数据分布；通过CNN、Mamba和Transformer多分支网络，结合特征融合、分组通道注意力和交叉注意力机制，分层利用扩散特征进行特征提取；引入互学习策略，通过对齐各子网络的概率熵和特征相似性增强分支间协作。 Result: 在四个多模态遥感数据集上的实验表明，该方法在土地覆盖分类任务中优于现有方法，实现了更优的分类精度和特征融合效果。 Conclusion: BDGF框架能有效缓解多模态DDPM中的模态不平衡问题，并通过分层扩散特征引导和多分支协同学习显著提升多模态遥感数据分类性能。 Abstract: Deep learning-based techniques for the analysis of multimodal remote sensing data have become popular due to their ability to effectively integrate complementary spatial, spectral, and structural information from different sensors. Recently, denoising diffusion probabilistic models (DDPMs) have attracted attention in the remote sensing community due to their powerful ability to capture robust and complex spatial-spectral distributions. However, pre-training multimodal DDPMs may result in modality imbalance, and effectively leveraging diffusion features to guide complementary diversity feature extraction remains an open question. To address these issues, this paper proposes a balanced diffusion-guided fusion (BDGF) framework that leverages multimodal diffusion features to guide a multi-branch network for land-cover classification. Specifically, we propose an adaptive modality masking strategy to encourage the DDPMs to obtain a modality-balanced rather than spectral image-dominated data distribution. Subsequently, these diffusion features hierarchically guide feature extraction among CNN, Mamba, and transformer networks by integrating feature fusion, group channel attention, and cross-attention mechanisms. Finally, a mutual learning strategy is developed to enhance inter-branch collaboration by aligning the probability entropy and feature similarity of individual subnetworks. Extensive experiments on four multimodal remote sensing datasets demonstrate that the proposed method achieves superior classification performance. The code is available at https://github.com/HaoLiu-XDU/BDGF.

[269] Seeing Symbols, Missing Cultures: Probing Vision-Language Models' Reasoning on Fire Imagery and Cultural Meaning

Haorui Yu,Qiufeng Yi,Yijia Chu,Yang Zhao

Main category: cs.CV

TL;DR: 提出了一种诊断框架来评估视觉语言模型在火主题文化图像中的分类和解释能力，揭示了模型在西方节日表现良好，但在非西方传统和紧急场景中存在系统性偏差。

Details

Motivation: 视觉语言模型虽然看似具备文化理解能力，但实际上依赖表面的模式匹配，缺乏真正的文化理解，可能导致错误分类和潜在风险。 Method: 引入一个诊断框架，通过分类和解释分析，测试多个模型在西方节日、非西方传统和紧急场景中的表现。 Result: 模型能正确识别主要的西方节日，但对代表性不足的文化活动识别效果差，常给出模糊标签或危险地将紧急情况误判为庆祝活动。 Conclusion: 暴露了模型使用符号捷径的风险，强调需要超越准确率的文化评估方法，以确保多模态系统的可解释性和公平性。 Abstract: Vision-Language Models (VLMs) often appear culturally competent but rely on superficial pattern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themed cultural imagery through both classification and explanation analysis. Testing multiple models on Western festivals, non-Western traditions, and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresented cultural events, frequently offering vague labels or dangerously misclassifying emergencies as celebrations. These failures expose the risks of symbolic shortcuts and highlight the need for cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodal systems.

Siheng Wang,Zhengdao Li,Yanshu Li,Canran Xiao,Haibo Zhan,Zhengtao Yao,Xuzhi Zhang,Jiale Kang,Linshan Li,Weiming Liu,Zhikang Dong,Jifeng Shen,Junhao Dong,Qiang Sun,Piotr Koniusz

Main category: cs.CV

TL;DR: 本文提出了一种名为C3-OWD的课程跨模态对比学习框架，旨在同时提升目标检测在未见类别上的泛化能力和在恶劣环境下的鲁棒性。该方法分两阶段进行：第一阶段利用RGBT数据增强鲁棒性，第二阶段通过视觉-语言对齐提升泛化能力，并引入指数移动平均（EMA）机制缓解灾难性遗忘。实验表明其在多个基准上取得了优异性能。

Details

Motivation: 现有目标检测方法在鲁棒性和类别多样性方面难以兼顾：可见光-红外检测提升鲁棒性但缺乏对新类别的泛化能力，而开放世界检测虽能识别新类别但在极端环境下表现不佳。因此，需要一种能同时兼顾两者的方法。 Method: 提出C3-OWD框架，包含两个阶段：第一阶段使用RGBT（可见光与红外）数据进行预训练以增强模型在恶劣条件下的鲁棒性；第二阶段通过视觉-语言对齐实现对未见类别的开放世界检测；为防止两阶段间的灾难性遗忘，引入指数移动平均（EMA）机制，理论上保证前一阶段性能的保留。 Result: 在FLIR、OV-COCO和OV-LVIS数据集上进行了实验验证：在FLIR上达到80.1 AP^50，在OV-COCO上取得48.6 AP^50_Novel，在OV-LVIS上获得35.7 mAP_r，显著优于现有方法，实现了鲁棒性与多样性的统一。 Conclusion: C3-OWD通过课程式跨模态对比学习，成功融合了多模态鲁棒性与视觉-语言驱动的类别泛化能力，解决了开放世界检测中鲁棒性与多样性难以兼得的问题，为实际应用提供了有效解决方案。 Abstract: Object detection has advanced significantly in the closed-set setting, but real-world deployment remains limited by two challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior research has explored these issues separately: visible-infrared detection improves robustness but lacks generalization, while open-world detection leverages vision-language alignment strategy for category diversity but struggles under extreme environments. This trade-off leaves robustness and diversity difficult to achieve simultaneously. To mitigate these issues, we propose \textbf{C3-OWD}, a curriculum cross-modal contrastive learning framework that unifies both strengths. Stage~1 enhances robustness by pretraining with RGBT data, while Stage~2 improves generalization via vision-language alignment. To prevent catastrophic forgetting between two stages, we introduce an Exponential Moving Average (EMA) mechanism that theoretically guarantees preservation of pre-stage performance with bounded parameter lag and function consistency. Experiments on FLIR, OV-COCO, and OV-LVIS demonstrate the effectiveness of our approach: C3-OWD achieves $80.1$ AP$^{50}$ on FLIR, $48.6$ AP$^{50}_{\text{Novel}}$ on OV-COCO, and $35.7$ mAP$_r$ on OV-LVIS, establishing competitive performance across both robustness and diversity evaluations. Code available at: https://github.com/justin-herry/C3-OWD.git.

[271] Spatial-Spectral Binarized Neural Network for Panchromatic and Multi-spectral Images Fusion

Yizhen Jiang,Mengting Ma,Anqi Zhu,Xiaowen Ma,Jiaxin Li,Wei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种高效的二值化神经网络方法S2BNet用于遥感图像全色锐化，通过设计空间-光谱二值化卷积（S2B-Conv），有效缓解了光谱失真和空间特征退化问题，在保证性能的同时显著降低计算复杂度。

Details

Motivation: 深度学习模型在全色锐化中性能优异但计算复杂度高，难以部署于资源受限设备；此外，直接二值化会导致严重的光谱失真和空间轮廓退化，因此需要一种兼顾效率与精度的二值化方法。 Method: 提出S2B-Conv模块，包含光谱重分布机制（SRM）和Gabor空间特征放大器（GSFA）：SRM通过动态学习的仿射变换调整光谱分布，缓解光谱失真；GSFA引入多尺度、多方向的Gabor滤波器增强空间特征表达。多个S2B-Conv构成S2BNet，实现高效二值化 pansharpening。 Result: 实验表明，S2BNet在多个数据集上取得了优于或媲美现有二值化和部分浮点模型的性能，显著降低了参数量和计算开销，具备良好的视觉效果与定量指标。 Conclusion: S2BNet验证了二值神经网络在遥感全色锐化任务中的可行性，通过定制化的S2B-Conv模块有效解决了光谱失真与空间特征退化问题，为边缘设备上的遥感图像处理提供了高效解决方案。 Abstract: Remote sensing pansharpening aims to reconstruct spatial-spectral properties during the fusion of panchromatic (PAN) images and low-resolution multi-spectral (LR-MS) images, finally generating the high-resolution multi-spectral (HR-MS) images. Although deep learning-based models have achieved excellent performance, they often come with high computational complexity, which hinder their applications on resource-limited devices. In this paper, we explore the feasibility of applying the binary neural network (BNN) to pan-sharpening. Nevertheless, there are two main issues with binarizing pan-sharpening models: (i) the binarization will cause serious spectral distortion due to the inconsistent spectral distribution of the PAN/LR-MS images; (ii) the common binary convolution kernel is difficult to adapt to the multi-scale and anisotropic spatial features of remote sensing objects, resulting in serious degradation of contours. To address the above issues, we design the customized spatial-spectral binarized convolution (S2B-Conv), which is composed of the Spectral-Redistribution Mechanism (SRM) and Gabor Spatial Feature Amplifier (GSFA). Specifically, SRM employs an affine transformation, generating its scaling and bias parameters through a dynamic learning process. GSFA, which randomly selects different frequencies and angles within a preset range, enables to better handle multi-scale and-directional spatial features. A series of S2B-Conv form a brand-new binary network for pan-sharpening, dubbed as S2BNet. Extensive quantitative and qualitative experiments have shown our high-efficiency binarized pan-sharpening method can attain a promising performance.

[272] Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning

Hongrui Jia,Chaoya Jiang,Shikun Zhang,Wei Ye

Main category: cs.CV

TL;DR: 提出一种无需训练的视觉推理框架，通过解耦推理与感知过程，利用大语言模型主导推理并动态查询多模态模型获取视觉信息，有效减少脱离图像内容的错误推理，提升推理准确性。

Details

Motivation: 现有大型多模态模型在长推理链中逐渐脱离视觉信息，导致推理结果偏离图像内容，产生错误结论。 Method: 将推理与感知过程分离：由大语言模型负责高层推理，并主动向多模态模型发起视觉问题以获取所需视觉信息；多模态模型仅作为视觉问答引擎提供感知支持。 Result: 该方法在无需训练或架构修改的情况下，显著减少了脱离视觉依据的推理步骤，提升了推理的保真度和准确性。 Conclusion: 所提出的轻量级、即插即用框架有效增强了多模态模型在复杂视觉推理任务中的可靠性和一致性。 Abstract: Significant advancements in the reasoning capabilities of Large Language Models (LLMs) are now driven by test-time scaling laws, particularly those leveraging extended Chain-of-Thought (CoT) reasoning. Inspired by these breakthroughs, researchers have extended these paradigms to Large Multimodal Models (LMMs). However, a critical limitation emerges: as their reasoning chains extend, LMMs increasingly rely on textual logic, progressively losing grounding in the underlying visual information. This leads to reasoning paths that diverge from the image content, culminating in erroneous conclusions. To address this, we introduce a strikingly simple yet effective training-free visual-reasoning pipeline. The core concept is to decouple the reasoning and perception processes. A powerful LLM orchestrates the high-level reasoning, strategically interrogating a LMM to extract specific visual information required for its logical chain. The LMM, in turn, functions exclusively as a visual question-answering engine, supplying the necessary perceptual details on demand. This lightweight, plug-and-play approach requires no additional training or architectural changes. Comprehensive evaluations validate that our framework effectively governs the visual reasoning process, leading to a significant reduction in visually-unfounded reasoning steps and a substantial improvement in reasoning fidelity.

[273] DDP: Dual-Decoupled Prompting for Multi-Label Class-Incremental Learning

Kaile Du,Zihan Ye,Junzhou Xie,Fan Lyu,Yixi Shen,Yuyang Li,Miaoxuan Zhu,Fuyuan Hu,Ling Shao,Guangcan Liu

Main category: cs.CV

TL;DR: 提出了一种名为Dual-Decoupled Prompting (DDP) 的框架，用于解决多标签类增量学习中的语义混淆和真负-假正混淆问题，无需回放且参数高效，在MS-COCO和PASCAL VOC上表现优异。

Details

Motivation: 现有基于提示的方法在单标签类增量学习中有效，但在多标签场景下因语义混淆和部分标注导致的假阳性问题而性能下降。 Method: DDP采用类别特定的正负提示解耦语义，并引入渐进置信度解耦（PCD）策略抑制假阳性；冻结历史提示作为知识锚点，使用层间提示提升效率。 Result: 在MS-COCO和PASCAL VOC数据集上，DDP显著优于先前方法，是首个在标准MS-COCO B40-C10基准上mAP超过80%、F1超过70%的无回放MLCIL方法。 Conclusion: DDP有效解决了MLCIL中的关键挑战，实现了高效、无需回放的多标签类增量学习，性能达到新高度。 Abstract: Prompt-based methods have shown strong effectiveness in single-label class-incremental learning, but their direct extension to multi-label class-incremental learning (MLCIL) performs poorly due to two intrinsic challenges: semantic confusion from co-occurring categories and true-negative-false-positive confusion caused by partial labeling. We propose Dual-Decoupled Prompting (DDP), a replay-free and parameter-efficient framework that explicitly addresses both issues. DDP assigns class-specific positive-negative prompts to disentangle semantics and introduces Progressive Confidence Decoupling (PCD), a curriculum-inspired decoupling strategy that suppresses false positives. Past prompts are frozen as knowledge anchors, and interlayer prompting enhances efficiency. On MS-COCO and PASCAL VOC, DDP consistently outperforms prior methods and is the first replay-free MLCIL approach to exceed 80% mAP and 70% F1 under the standard MS-COCO B40-C10 benchmark.

Bin Wu,Yahui Liu,Chi Zhang,Yao Zhao,Wei Wang

Main category: cs.CV

TL;DR: 提出了一种基于在线强化学习的盲脸修复框架LRPO，通过 likelihood 正则化策略优化，结合复合奖励函数、真值引导正则化和噪声级别优势分配，有效提升低质量输入的面部修复效果和感知质量。

Details

Motivation: 盲脸修复（BFR）面临解空间大、恢复结果常出现细节缺失和身份模糊等问题，现有方法难以兼顾感知质量和保真度。 Method: 提出Likelihood-Regularized Policy Optimization (LRPO)框架，采用在线强化学习，引入复合奖励函数、真值引导的似然正则化和噪声级别优势分配策略，优化策略网络以提高高质量输出的可能性。 Result: 实验表明，LRPO在面部修复质量上显著优于基线方法，实现了最先进的性能，尤其在低质量输入上表现出更强的恢复能力。 Conclusion: LRPO是首个将在线强化学习应用于BFR任务的框架，通过关键策略平衡了感知质量与保真度，显著提升了修复效果。 Abstract: Blind Face Restoration (BFR) encounters inherent challenges in exploring its large solution space, leading to common artifacts like missing details and identity ambiguity in the restored images. To tackle these challenges, we propose a Likelihood-Regularized Policy Optimization (LRPO) framework, the first to apply online reinforcement learning (RL) to the BFR task. LRPO leverages rewards from sampled candidates to refine the policy network, increasing the likelihood of high-quality outputs while improving restoration performance on low-quality inputs. However, directly applying RL to BFR creates incompatibility issues, producing restoration results that deviate significantly from the ground truth. To balance perceptual quality and fidelity, we propose three key strategies: 1) a composite reward function tailored for face restoration assessment, 2) ground-truth guided likelihood regularization, and 3) noise-level advantage assignment. Extensive experiments demonstrate that our proposed LRPO significantly improves the face restoration quality over baseline methods and achieves state-of-the-art performance.

[275] DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice

Zijie Meng,Jin Hao,Xiwei Dai,Yang Feng,Jiaxiang Liu,Bin Feng,Huikai Wu,Xiaotang Gai,Hengchuan Zhu,Tianxiang Hu,Yangyang Wu,Hongxia Xu,Jin Li,Jun Xiao,Xiaoqiang Liu,Joey Tianyi Zhou,Fudong Zhu,Zhihe Zhao,Lunguo Xia,Bing Fang,Jimeng Sun,Jian Wu,Zuozhu Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为DentVLM的多模态视觉-语言模型，用于专家级口腔疾病诊断，基于大规模双语数据集训练，在多种成像模式和诊断任务中显著优于现有模型，并在临床研究中表现出超越初级甚至部分高级牙医的诊断性能，同时提升诊疗效率。

Details

Motivation: 现有的AI模型通常只能处理单一任务，难以满足临床牙科实践中复杂的多模态需求，因此需要开发一种能够综合理解多种口腔影像并进行精准诊断的智能系统。 Method: 开发了DentVLM，一个基于110,447张图像和246万对视觉问答（VQA）数据的大规模双语多模态视觉-语言模型，可解析七种2D口腔影像模态，涵盖36项诊断任务。 Result: DentVLM在口腔疾病诊断准确率上比现有领先模型高出19.6%，错颌畸形诊断高27.9%；在涉及25名牙医和1,946名患者的临床研究中，其表现超过13名初级牙医中的21项任务和12名高级牙医中的12项任务；与医生协作时，使初级医生水平提升至高级水平，并减少所有医生15-22%的诊断时间。 Conclusion: DentVLM是一种强大的临床决策支持工具，有望提升基层牙科诊疗水平，缓解医患资源失衡，并推动专业医疗知识的普及。 Abstract: Diagnosing and managing oral diseases necessitate advanced visual interpretation across diverse imaging modalities and integrated information synthesis. While current AI models excel at isolated tasks, they often fall short in addressing the complex, multimodal requirements of comprehensive clinical dental practice. Here we introduce DentVLM, a multimodal vision-language model engineered for expert-level oral disease diagnosis. DentVLM was developed using a comprehensive, large-scale, bilingual dataset of 110,447 images and 2.46 million visual question-answering (VQA) pairs. The model is capable of interpreting seven 2D oral imaging modalities across 36 diagnostic tasks, significantly outperforming leading proprietary and open-source models by 19.6% higher accuracy for oral diseases and 27.9% for malocclusions. In a clinical study involving 25 dentists, evaluating 1,946 patients and encompassing 3,105 QA pairs, DentVLM surpassed the diagnostic performance of 13 junior dentists on 21 of 36 tasks and exceeded that of 12 senior dentists on 12 of 36 tasks. When integrated into a collaborative workflow, DentVLM elevated junior dentists' performance to senior levels and reduced diagnostic time for all practitioners by 15-22%. Furthermore, DentVLM exhibited promising performance across three practical utility scenarios, including home-based dental health management, hospital-based intelligent diagnosis and multi-agent collaborative interaction. These findings establish DentVLM as a robust clinical decision support tool, poised to enhance primary dental care, mitigate provider-patient imbalances, and democratize access to specialized medical expertise within the field of dentistry.

[276] Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

Xiaolong Fu,Lichen Ma,Zipeng Guo,Gaojing Zhou,Chongxiao Wang,ShiPing Dong,Shizhe Zhou,Shizhe Zhou,Ximan Liu,Jingling Fu,Tan Lit Sin,Yu Shi,Zhen Chen,Junshi Huang,Jason Li

Main category: cs.CV

TL;DR: 本文提出了Dynamic-TreeRPO方法，通过树结构搜索与动态噪声强度提升文本到图像生成中的探索效率，并结合监督微调与强化学习提出LayerTuning-RL框架，在保持计算成本不变的情况下显著提升生成质量与训练效率。

Details

Motivation: 现有基于强化学习的文本到图像生成方法因采样策略低效和探索不充分导致生成质量和训练效率受限。 Method: 提出Dynamic-TreeRPO，采用滑动窗口策略构建树结构搜索，结合GRPO优化与约束SDE采样，并在各层设置动态噪声强度；进一步提出LayerTuning-RL，将监督微调损失重构为动态加权进度奖励模型并与自适应裁剪结合。 Result: 在HPS-v2.1、PickScore和ImageReward等基准上分别超越当前最优方法4.9%、5.91%和8.66%，同时训练效率提升近50%。 Conclusion: Dynamic-TreeRPO与LayerTuning-RL有效提升了文本到图像生成的语义一致性、视觉保真度和人类偏好对齐，且显著提高训练效率。 Abstract: The integration of Reinforcement Learning (RL) into flow matching models for text-to-image (T2I) generation has driven substantial advances in generation quality. However, these gains often come at the cost of exhaustive exploration and inefficient sampling strategies due to slight variation in the sampling group. Building on this insight, we propose Dynamic-TreeRPO, which implements the sliding-window sampling strategy as a tree-structured search with dynamic noise intensities along depth. We perform GRPO-guided optimization and constrained Stochastic Differential Equation (SDE) sampling within this tree structure. By sharing prefix paths of the tree, our design effectively amortizes the computational overhead of trajectory search. With well-designed noise intensities for each tree layer, Dynamic-TreeRPO can enhance the variation of exploration without any extra computational cost. Furthermore, we seamlessly integrate Supervised Fine-Tuning (SFT) and RL paradigm within Dynamic-TreeRPO to construct our proposed LayerTuning-RL, reformulating the loss function of SFT as a dynamically weighted Progress Reward Model (PRM) rather than a separate pretraining method. By associating this weighted PRM with dynamic-adaptive clipping bounds, the disruption of exploration process in Dynamic-TreeRPO is avoided. Benefiting from the tree-structured sampling and the LayerTuning-RL paradigm, our model dynamically explores a diverse search space along effective directions. Compared to existing baselines, our approach demonstrates significant superiority in terms of semantic consistency, visual fidelity, and human preference alignment on established benchmarks, including HPS-v2.1, PickScore, and ImageReward. In particular, our model outperforms SoTA by $4.9\%$, $5.91\%$, and $8.66\%$ on those benchmarks, respectively, while improving the training efficiency by nearly $50\%$.

[277] Test-time Uncertainty Estimation for Medical Image Registration via Transformation Equivariance

Lin Tian,Xiaoling Hu,Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: 提出了一种测试时不确定性估计框架，适用于任何预训练的医学图像配准网络，通过变换等变性分析预测方差，分解为内在离散和偏差抖动，有效识别配准误差高的区域。

Details

Motivation: 现有不确定性估计方法需要修改架构或重新训练，难以应用于预训练配准网络；而准确的不确定性估计对临床安全部署至关重要。 Method: 基于配准的变换等变性，通过输入空间扰动下网络预测的方差分析，推导出扰动-based不确定性理论分解，分离出内在离散和偏差抖动两项。 Result: 在四种解剖结构和多个配准模型上验证，不确定性图与配准误差一致相关，能有效突出需警惕的区域。 Conclusion: 该框架可在测试时将任意预训练配准网络转化为风险感知工具，推动医学图像配准向安全临床应用迈进。 Abstract: Accurate image registration is essential for downstream applications, yet current deep registration networks provide limited indications of whether and when their predictions are reliable. Existing uncertainty estimation strategies, such as Bayesian methods, ensembles, or MC dropout, require architectural changes or retraining, limiting their applicability to pretrained registration networks. Instead, we propose a test-time uncertainty estimation framework that is compatible with any pretrained networks. Our framework is grounded in the transformation equivariance property of registration, which states that the true mapping between two images should remain consistent under spatial perturbations of the input. By analyzing the variance of network predictions under such perturbations, we derive a theoretical decomposition of perturbation-based uncertainty in registration. This decomposition separates into two terms: (i) an intrinsic spread, reflecting epistemic noise, and (ii) a bias jitter, capturing how systematic error drifts under perturbations. Across four anatomical structures (brain, cardiac, abdominal, and lung) and multiple registration models (uniGradICON, SynthMorph), the uncertainty maps correlate consistently with registration errors and highlight regions requiring caution. Our framework turns any pretrained registration network into a risk-aware tool at test time, placing medical image registration one step closer to safe deployment in clinical and large-scale research settings.

[278] GRAPE: Let GPRO Supervise Query Rewriting by Ranking for Retrieval

Zhaohua Zhang,Jianhuan Zhuo,Muxi Chen,Chenchen Zhao,Wenyu Jiang,Tianwen Jiang,Mingyang Chen,Yu Tang,Qiuyong Xiao,Jihong Zhang,Zhixun Su

Main category: cs.CV

TL;DR: 本文提出了GRAPE，一种即插即用的检索增强方法，通过引入排序感知的奖励机制解决CLIP模型在分布偏移下的查询重写问题，有效缓解了评分膨胀现象，在多语言、长文本和跨模态等场景下显著提升了检索性能。

Details

Motivation: CLIP模型在输入分布偏离训练数据时表现不佳，现有基于大语言模型的查询重写方法因缺乏监督信号难以生成最优查询。因此，需要一种无需重新训练且能有效对齐分布差异的方法。 Method: 提出GRAPE方法，利用分组排序感知策略优化（GRPO），结合检索排序信号指导大语言模型进行查询重写；设计基于语料库相对排序的奖励机制，避免仅依赖相似度分数导致的评分膨胀问题。 Result: 在多种分布偏移场景（如多语言、长文本、跨模态）下实验表明，GRAPE在Recall@10指标上平均提升4.9%，显著优于基线方法。 Conclusion: GRAPE通过引入排序感知的奖励机制，有效提升了大语言模型在分布偏移下的查询重写能力，是一种通用且高效的检索增强框架。 Abstract: The CLIP model has become a cornerstone of large-scale retrieval systems by aligning text and image data in a unified embedding space. Despite its simplicity and efficiency, CLIP struggles when applied to tasks whose input distributions diverge from its training corpus, such as queries with multilingual, long-form, or multimodal differences. To avoid costly retraining, existing methods mainly adopt query-rewriting strategies with large language models (LLMs), aiming to mitigate distribution gaps at the query level. However, due to the lack of supervision signals, LLMs fail to generate the optimal one that fits the training distribution. We address this challenge with GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play enhancement approach that incorporates ranking signals into retrieval-guided query rewriting with LLMs. Intuitively, GRAPE proposes to leverage GRPO to bridge distributional differences -- including length, multilingual, and modality shifts -- by transforming queries into forms better aligned with the retriever's training distribution. However, our preliminary experiment finds that naively finetuning LLM with similarity scores can lead to score inflation, where nearly all candidates are assigned unexpectedly high scores regardless of their true relevance. To address score inflation, we propose a corpus-relative ranking-based reward, which explicitly aligns optimization with ranking metrics while suppressing spurious score inflation. Extensive experiments demonstrate that GRAPE consistently improves retrieval performance under distributional shifts -- including multilingual differences (Flickr30k-CN, CVLUE, XM3600), length differences (Wikipedia), and multimodal differences (CIRR) -- achieving an average improvement of 4.9\% in Recall\@10. The code is available at https://github.com/Chinese0123456/GRAPE.git

[279] CasPoinTr: Point Cloud Completion with Cascaded Networks and Knowledge Distillation

Yifan Yang,Yuxiang Yan,Boda Liu,Jian Pu

Main category: cs.CV

TL;DR: 提出了一种名为CasPoinTr的新方法，采用级联网络和知识蒸馏技术，有效提升点云补全中整体形状预测与细节恢复的性能。

Details

Motivation: 真实环境中的点云常因传感器限制、遮挡等因素而不完整，需有效补全以支持下游应用。现有方法在从高度不完整的输入中恢复整体形状和细节方面存在困难。 Method: 设计了两阶段级联框架CasPoinTr：第一阶段（Shape Reconstruction）生成辅助信息；第二阶段（Fused Completion）结合知识蒸馏，利用教师模型（训练于稠密点云）向学生模型传递不完整-完整点云的关联知识，提升补全质量。 Result: 在ShapeNet-55数据集多种难度设置下实验表明，CasPoinTr在形状恢复和细节保持方面优于现有方法。 Conclusion: 级联结构与知识蒸馏策略能有效增强模型对全局形状上下文的捕捉能力和局部细节的重建精度，显著提升点云补全效果。 Abstract: Point clouds collected from real-world environments are often incomplete due to factors such as limited sensor resolution, single viewpoints, occlusions, and noise. These challenges make point cloud completion essential for various applications. A key difficulty in this task is predicting the overall shape and reconstructing missing regions from highly incomplete point clouds. To address this, we introduce CasPoinTr, a novel point cloud completion framework using cascaded networks and knowledge distillation. CasPoinTr decomposes the completion task into two synergistic stages: Shape Reconstruction, which generates auxiliary information, and Fused Completion, which leverages this information alongside knowledge distillation to generate the final output. Through knowledge distillation, a teacher model trained on denser point clouds transfers incomplete-complete associative knowledge to the student model, enhancing its ability to estimate the overall shape and predict missing regions. Together, the cascaded networks and knowledge distillation enhance the model's ability to capture global shape context while refining local details, effectively bridging the gap between incomplete inputs and complete targets. Experiments on ShapeNet-55 under different difficulty settings demonstrate that CasPoinTr outperforms existing methods in shape recovery and detail preservation, highlighting the effectiveness of our cascaded structure and distillation strategy.

[280] UniPose: Unified Cross-modality Pose Prior Propagation towards RGB-D data for Weakly Supervised 3D Human Pose Estimation

Jinghong Zheng,Changlong Jiang,Jiaqi Li,Haohong Kuang,Hang Xu,Tingbing Yan

Main category: cs.CV

TL;DR: UniPose提出了一种统一的跨模态姿态先验传播方法，用于弱监督3D人体姿态估计，利用未标注的单视图RGB-D序列，通过自监督学习将2D姿态标注从大规模RGB数据集迁移到3D领域，无需人工3D关键点标注。

Details

Motivation: 解决3D人体姿态估计中依赖大量人工标注数据的问题，避免多视角相机校准和合成到真实数据迁移的挑战。 Method: 利用现成的2D姿态估计作为点云网络的弱监督信号，结合时空约束（如身体对称性和关节运动），并通过2D到3D反投影损失和跨模态交互增强训练；采用锚点到关节预测方法进行3D提升。 Result: 在CMU Panoptic和ITOP数据集上表现与全监督方法相当，结合大规模无标签数据（如NTU RGB+D 60）可进一步提升性能，在挑战性场景下表现出色，3D提升方法达到SOTA。 Conclusion: UniPose有效实现了从2D到3D姿态估计的知识迁移，减少了对3D标注的依赖，在实际应用中具有潜力。 Abstract: In this paper, we present UniPose, a unified cross-modality pose prior propagation method for weakly supervised 3D human pose estimation (HPE) using unannotated single-view RGB-D sequences (RGB, depth, and point cloud data). UniPose transfers 2D HPE annotations from large-scale RGB datasets (e.g., MS COCO) to the 3D domain via self-supervised learning on easily acquired RGB-D sequences, eliminating the need for labor-intensive 3D keypoint annotations. This approach bridges the gap between 2D and 3D domains without suffering from issues related to multi-view camera calibration or synthetic-to-real data shifts. During training, UniPose leverages off-the-shelf 2D pose estimations as weak supervision for point cloud networks, incorporating spatial-temporal constraints like body symmetry and joint motion. The 2D-to-3D back-projection loss and cross-modality interaction further enhance this process. By treating the point cloud network's 3D HPE results as pseudo ground truth, our anchor-to-joint prediction method performs 3D lifting on RGB and depth networks, making it more robust against inaccuracies in 2D HPE results compared to state-of-the-art methods. Experiments on CMU Panoptic and ITOP datasets show that UniPose achieves comparable performance to fully supervised methods. Incorporating large-scale unlabeled data (e.g., NTU RGB+D 60) enhances its performance under challenging conditions, demonstrating its potential for practical applications. Our proposed 3D lifting method also achieves state-of-the-art results.

[281] Generative Modeling of Shape-Dependent Self-Contact Human Poses

Takehiko Ohkawa,Jihyun Lee,Shunsuke Saito,Jason Saragih,Fabian Prado,Yichen Xu,Shoou-I Yu,Ryosuke Furuta,Yoichi Sato,Takaaki Shiratori

Main category: cs.CV

TL;DR: 本文提出了首个包含精确身体形状注册的大规模自接触数据集Goliath-SC，并基于身体部位的潜在扩散模型构建了依赖于身体形状参数的自接触先验，显著提升了单视角下人体姿态估计中自接触情形的准确性。

Details

Motivation: 现有自接触数据集缺乏多样的自接触姿态和精确的身体形状信息，限制了对自接触姿态与身体形状之间关系的深入分析。因此，需要一个结合丰富姿态与精确形状的数据集及相应的建模方法。 Method: 构建了包含38.3万自接触姿态和130个受试者的大规模数据集Goliath-SC；提出一种基于身体部位的潜在扩散模型，引入自注意力机制，并以身体形状参数为条件进行自接触先验建模；将该先验融入单视角人体姿态估计框架中，优化估计结果使其符合自接触约束。 Result: 实验证明，引入身体形状条件能显著提升自接触姿态分布的建模效果，在单视角姿态估计中实现了更准确的自接触建模，优于无形状条件的基线方法。 Conclusion: 身体形状在自接触姿态建模中起关键作用，结合精确形状信息的生成先验有助于提升姿态估计性能，Goliath-SC为未来相关研究提供了重要资源。 Abstract: One can hardly model self-contact of human poses without considering underlying body shapes. For example, the pose of rubbing a belly for a person with a low BMI leads to penetration of the hand into the belly for a person with a high BMI. Despite its relevance, existing self-contact datasets lack the variety of self-contact poses and precise body shapes, limiting conclusive analysis between self-contact poses and shapes. To address this, we begin by introducing the first extensive self-contact dataset with precise body shape registration, Goliath-SC, consisting of 383K self-contact poses across 130 subjects. Using this dataset, we propose generative modeling of self-contact prior conditioned by body shape parameters, based on a body-part-wise latent diffusion with self-attention. We further incorporate this prior into single-view human pose estimation while refining estimated poses to be in contact. Our experiments suggest that shape conditioning is vital to the successful modeling of self-contact pose distribution, hence improving single-view pose estimation in self-contact.

[282] WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving

Ziyue Zhu,Zhanqian Wu,Zhenxin Zhu,Lijun Zhou,Haiyang Sun,Bing Wan,Kun Ma,Guang Chen,Hangjun Ye,Jin Xie,jian Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为WorldSplat的新型前馈框架，用于生成具有时空一致性的高质量多视角驾驶场景视频，结合了生成与重建方法的优势。

Details

Motivation: 现有驾驶场景生成方法在新视角合成上受限于3D一致性不足和视点覆盖稀疏，而重建方法缺乏生成能力，因此需要一种兼顾生成性与高质新视角合成的方法。 Method: 提出WorldSplat框架：首先通过引入4D感知的潜在扩散模型生成像素对齐的4D高斯表示；然后使用增强的视频扩散模型优化从这些高斯图渲染出的新视角视频。 Result: 在基准数据集上的实验表明，WorldSplat能够有效生成高保真、时间与空间上一致的多轨迹新视角驾驶视频。 Conclusion: WorldSplat成功融合了生成模型与重建方法的优点，在4D驾驶场景生成和高质量新视角合成之间取得了良好平衡。 Abstract: Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose \textbf{WorldSplat}, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: ((i)) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. ((ii)) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that \textbf{WorldSplat} effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.

[283] Enhanced Fracture Diagnosis Based on Critical Regional and Scale Aware in YOLO

Yuyang Sun,Junchuan Yu,Cuiming Zou

Main category: cs.CV

TL;DR: 本文提出了一种改进的YOLO模型Fracture-YOLO，用于骨折检测，引入了CRSelector注意力机制和Scale-Aware头结构，显著提升了检测精度，达到SOTA性能。

Details

Motivation: 传统骨折诊断依赖医生经验，速度和准确性受限，亟需一种高效、准确的自动化检测方法。 Method: 基于YOLO框架，提出Fracture-YOLO模型，集成CRSelector模块（利用全局纹理信息聚焦骨折关键区域）和ScA模块（动态调整多尺度特征权重）。 Result: 相比基线模型，mAP50提升4，mAP50-95提升3，检测性能显著提高，达到当前最优水平。 Conclusion: Fracture-YOLO通过引入注意力机制和多尺度优化策略，在骨折检测任务中表现出优越性能，具有临床辅助诊断的应用潜力。 Abstract: Fracture detection plays a critical role in medical imaging analysis, traditional fracture diagnosis relies on visual assessment by experienced physicians, however the speed and accuracy of this approach are constrained by the expertise. With the rapid advancements in artificial intelligence, deep learning models based on the YOLO framework have been widely employed for fracture detection, demonstrating significant potential in improving diagnostic efficiency and accuracy. This study proposes an improved YOLO-based model, termed Fracture-YOLO, which integrates novel Critical-Region-Selector Attention (CRSelector) and Scale-Aware (ScA) heads to further enhance detection performance. Specifically, the CRSelector module utilizes global texture information to focus on critical features of fracture regions. Meanwhile, the ScA module dynamically adjusts the weights of features at different scales, enhancing the model's capacity to identify fracture targets at multiple scales. Experimental results demonstrate that, compared to the baseline model, Fracture-YOLO achieves a significant improvement in detection precision, with mAP50 and mAP50-95 increasing by 4 and 3, surpassing the baseline model and achieving state-of-the-art (SOTA) performance.

[284] FracDetNet: Advanced Fracture Detection via Dual-Focus Attention and Multi-scale Calibration in Medical X-ray Imaging

Yuyang Sun,Cuiming Zou

Main category: cs.CV

TL;DR: 本文提出了一种名为FracDetNet的先进骨折检测框架，结合双焦点注意力（DFA）和多尺度校准（MC）模块，显著提升了医学图像中细微和形态多样骨折的检测性能，在GRAZPEDWRI-DX数据集上实现了40.0%的mAP$_{50-95}$，较基线提升7.5%。

Details

Motivation: 现有方法在处理因成像角度多变和图像质量不佳导致的细微、形态多样的骨折时检测效果有限，影响临床诊断效率，因此需要更鲁棒的骨折检测模型。 Method: FracDetNet引入Dual-Focus Attention（DFA）模块以同时捕获局部细节和全局上下文，并采用Multi-scale Calibration（MC）模块自适应优化特征表示，从而提升复杂骨折的检测能力。 Result: 在GRAZPEDWRI-DX数据集上，FracDetNet的mAP$_{50-95}$达到40.0%（+7.5%），mAP$_{50}$达63.9%（+4.2%），特定骨折检测准确率提升2.9%。 Conclusion: FracDetNet通过DFA和MC模块有效提升了骨折检测的精度与鲁棒性，尤其在处理低质量或复杂形态的骨折图像时表现出优越性能，具有较强的临床应用潜力。 Abstract: In this paper, an advanced fracture detection framework, FracDetNet, is proposed to address challenges in medical imaging, as accurate fracture detection is essential for enhancing diagnostic efficiency in clinical practice. Despite recent advancements, existing methods still struggle with detecting subtle and morphologically diverse fractures due to variable imaging angles and suboptimal image quality. To overcome these limitations, FracDetNet integrates Dual-Focus Attention (DFA) and Multi-scale Calibration (MC). Specifically, the DFA module effectively captures detailed local features and comprehensive global context through combined global and local attention mechanisms. Additionally, the MC adaptively refines feature representations to enhance detection performance. Experimental evaluations on the publicly available GRAZPEDWRI-DX dataset demonstrate state-of-the-art performance, with FracDetNet achieving a mAP$_{50-95}$ of 40.0\%, reflecting a \textbf{7.5\%} improvement over the baseline model. Furthermore, the mAP$_{50}$ reaches 63.9\%, representing an increase of \textbf{4.2\%}, with fracture-specific detection accuracy also enhanced by \textbf{2.9\%}.

[285] SPIKE-RL: Video-LLMs meet Bayesian Surprise

Sahithya Ravi,Aditya Chinchure,Raymond T. Ng,Leonid Sigal,Vered Shwartz

Main category: cs.CV

TL;DR: 本文提出了SPIKE框架，通过量化贝叶斯惊奇（Bayesian Surprise）来识别视频中与先验信念冲突的关键时刻，从而实现更有效的帧采样。进一步提出的SPIKE-RL通过强化学习优化信念假设，在多个下游任务中显著优于均匀采样方法。

Details

Motivation: 现有Video-LLM通常采用均匀采样处理视频，容易错过定义视频叙事的关键瞬间。作者希望构建能感知并定位视频中惊奇事件的模型，以提升对视频内容的理解能力。 Method: SPIKE通过计算新视觉证据引起信念更新的程度来量化惊奇值；SPIKE-RL则利用GRPO算法，基于视频字幕的奖励信号优化信念生成机制，并指导非依赖查询的加权帧采样策略。 Result: SPIKE在正向（FunQA）和负向（Oops!）惊奇检测基准上与人类判断高度相关；结合SPIKE/SPIKE-RL的采样策略在五个下游任务中均优于均匀采样。 Conclusion: 通过引入信念追踪与惊奇注册机制，该工作为构建能动态响应新信息、持续修正理解的强健Video-LLM提供了可行路径。 Abstract: Real-world videos often show routine activities punctuated by memorable, surprising events. However, most Video-LLMs process videos by sampling frames uniformly, likely missing critical moments that define a video's narrative. We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream, identifying moments where new visual evidence conflicts with prior beliefs. SPIKE effectively localizes surprise in videos, strongly correlated with humans on positive (FunQA) and negative (Oops!) surprise benchmarks. Since the beliefs of zero-shot Video-LLMs are often suboptimal, we develop SPIKE-RL, which leverages GRPO to optimize belief hypotheses based on a reward signal from the video caption. SPIKE and SPIKE-RL guide query-agnostic surprise-weighted frame sampling, which allocates more frames to interesting moments in the video. With this strategy, we achieve consistent performance gains on five downstream benchmarks over uniform sampling. By enabling Video-LLMs to track beliefs and register surprise, our work paves the way for more robust models that can revise their understanding in response to new information.

[286] FM-SIREN & FM-FINER: Nyquist-Informed Frequency Multiplier for Implicit Neural Representation with Periodic Activation

Mohammed Alsakabi,Wael Mobeirek,John M. Dolan,Ozan K. Tonguz

Main category: cs.CV

TL;DR: 提出FM-SIREN和FM-FINER，通过Nyquist指导的神经元特异性频率乘子减少特征冗余，提升隐式神经表示的表达能力和重建性能。

Details

Motivation: 现有基于周期激活的隐式神经表示网络存在隐藏特征冗余问题，限制了MLP的表达能力。 Method: 受离散正弦变换启发，为每个神经元分配Nyquist指导的特定频率乘子，引入频率多样性。 Result: 特征冗余减少近50%，在1D音频、2D图像、3D形状拟合及NeRF合成任务中均优于基线模型。 Conclusion: 该方法无需调参或增加网络深度，简单且有效提升了INR的性能与效率。 Abstract: Existing periodic activation-based implicit neural representation (INR) networks, such as SIREN and FINER, suffer from hidden feature redundancy, where neurons within a layer capture overlapping frequency components due to the use of a fixed frequency multiplier. This redundancy limits the expressive capacity of multilayer perceptrons (MLPs). Drawing inspiration from classical signal processing methods such as the Discrete Sine Transform (DST), we propose FM-SIREN and FM-FINER, which assign Nyquist-informed, neuron-specific frequency multipliers to periodic activations. Unlike existing approaches, our design introduces frequency diversity without requiring hyperparameter tuning or additional network depth. This simple yet principled modification reduces the redundancy of features by nearly 50% and consistently improves signal reconstruction across diverse INR tasks, including fitting 1D audio, 2D image and 3D shape, and synthesis of neural radiance fields (NeRF), outperforming their baseline counterparts while maintaining efficiency.

[287] FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing

Tanawan Premsri,Parisa Kordjamshidi

Main category: cs.CV

TL;DR: 本文提出了FoR-SALE，一种基于参考框架的文本到图像生成空间调整方法，通过统一语言与视觉的视角来修正图像中空间关系的错位，显著提升了现有模型在空间理解任务上的表现。

Details

Motivation: 现有的文本到图像生成模型在处理非相机视角的空间描述时存在性能瓶颈，缺乏对人类常用的空间参照系（Frame of Reference）的有效建模。 Method: 提出FoR-SALE框架，作为SLD框架的扩展，利用视觉模块提取图像空间结构，并将文本中的空间表达映射到对应相机视角，在统一视角下评估图文对齐性；若发现错位，则生成并应用潜在空间操作来调整图像的朝向和深度。 Result: 在两个专门设计用于评估参考框架下空间理解能力的基准上，FoR-SALE仅通过一轮修正即可使最先进T2I模型的性能提升最高达5.3%。 Conclusion: FoR-SALE有效整合了参考框架这一认知维度，增强了T2I模型对多视角空间描述的理解与生成能力，缩小了模型与人类空间推理之间的差距。 Abstract: Frame of Reference (FoR) is a fundamental concept in spatial reasoning that humans utilize to comprehend and describe space. With the rapid progress in Multimodal Language models, the moment has come to integrate this long-overlooked dimension into these models. In particular, in text-to-image (T2I) generation, even state-of-the-art models exhibit a significant performance gap when spatial descriptions are provided from perspectives other than the camera. To address this limitation, we propose Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing (FoR-SALE), an extension of the Self-correcting LLM-controlled Diffusion (SLD) framework for T2I. For-Sale evaluates the alignment between a given text and an initially generated image, and refines the image based on the Frame of Reference specified in the spatial expressions. It employs vision modules to extract the spatial configuration of the image, while simultaneously mapping the spatial expression to a corresponding camera perspective. This unified perspective enables direct evaluation of alignment between language and vision. When misalignment is detected, the required editing operations are generated and applied. FoR-SALE applies novel latent-space operations to adjust the facing direction and depth of the generated images. We evaluate FoR-SALE on two benchmarks specifically designed to assess spatial understanding with FoR. Our framework improves the performance of state-of-the-art T2I models by up to 5.3% using only a single round of correction.

[288] 3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras

Tharindu Ekanayake,Constantino Álvarez Casado,Miguel Bordallo López

Main category: cs.CV

TL;DR: 提出3DPCNet，一种紧凑且与估计器无关的模块，通过将3D关节坐标转换到一致的体中心规范坐标系，消除单目3D姿态估计中的视角依赖性。

Details

Motivation: 单目3D姿态估计生成的是以相机为中心的骨骼结构，导致运动学信号受视角影响，不利于健康和体育科学等领域的比较分析。因此需要一种方法将姿态统一到一致的体中心坐标系中。 Method: 设计了一个混合编码器，结合图卷积网络提取的局部骨骼特征和Transformer提供的全局上下文，通过门控交叉注意力机制融合；模型预测连续6D旋转并映射为SO(3)矩阵对姿态进行对齐；在MM-Fi数据集上使用合成旋转姿态进行自监督训练，采用复合损失函数优化旋转精度和姿态重建。 Result: 在MM-Fi基准上，平均旋转误差从20°以上降至3.4°，平均每个关节位置误差从约64mm降至47mm；在TotalCapture数据集上的定性评估显示，由视频生成的加速度信号与真实IMU传感器数据具有良好的视觉一致性。 Conclusion: 3DPCNet能有效消除视角变化的影响，生成物理上合理的运动分析结果，适用于不同3D姿态估计器，提升了跨视角动作分析的可靠性。 Abstract: Monocular 3D pose estimators produce camera-centered skeletons, creating view-dependent kinematic signals that complicate comparative analysis in applications such as health and sports science. We present 3DPCNet, a compact, estimator-agnostic module that operates directly on 3D joint coordinates to rectify any input pose into a consistent, body-centered canonical frame. Its hybrid encoder fuses local skeletal features from a graph convolutional network with global context from a transformer via a gated cross-attention mechanism. From this representation, the model predicts a continuous 6D rotation that is mapped to an $SO(3)$ matrix to align the pose. We train the model in a self-supervised manner on the MM-Fi dataset using synthetically rotated poses, guided by a composite loss ensuring both accurate rotation and pose reconstruction. On the MM-Fi benchmark, 3DPCNet reduces the mean rotation error from over 20$^{\circ}$ to 3.4$^{\circ}$ and the Mean Per Joint Position Error from ~64 mm to 47 mm compared to a geometric baseline. Qualitative evaluations on the TotalCapture dataset further demonstrate that our method produces acceleration signals from video that show strong visual correspondence to ground-truth IMU sensor data, confirming that our module removes viewpoint variability to enable physically plausible motion analysis.

[289] No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation

Mohammad Hossein Sameti,Amir M. Mansourian,Arash Marioriyad,Soheil Fadaee Oshyani,Mohammad Hossein Rohban,Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: 提出了一种细粒度测试时优化框架，通过分解提示语义概念并在全局和概念级别评估对齐性，提升文本到图像生成的组合保真度。

Details

Motivation: 现有文本到图像模型在复杂提示下常遗漏或错误表示对象和属性，需提高生成内容的忠实性。 Method: 将输入提示分解为语义概念，使用改进的CLIP计算概念级对应关系，并在迭代提示优化循环中利用反馈改进提示。 Result: 在DrawBench和CompBench上实验表明，该方法显著提升了概念覆盖率和人类评估的忠实度。 Conclusion: 所提方法通过细粒度对齐反馈有效增强了文本到图像生成的组合忠实性，优于标准测试时优化和基础模型。 Abstract: Despite recent advances in text-to-image (T2I) models, they often fail to faithfully render all elements of complex prompts, frequently omitting or misrepresenting specific objects and attributes. Test-time optimization has emerged as a promising approach to address this limitation by refining generation without the need for retraining. In this paper, we propose a fine-grained test-time optimization framework that enhances compositional faithfulness in T2I generation. Unlike most of prior approaches that rely solely on a global image/text similarity score, our method decomposes the input prompt into semantic concepts and evaluates alignment at both the global and concept levels. A fine-grained variant of CLIP is used to compute concept-level correspondence, producing detailed feedback on missing or inaccurate concepts. This feedback is fed into an iterative prompt refinement loop, enabling the large language model to propose improved prompts. Experiments on DrawBench and CompBench prompts demonstrate that our method significantly improves concept coverage and human-judged faithfulness over both standard test-time optimization and the base T2I model. Code is available at: https://github.com/AmirMansurian/NoConceptLeftBehind

Ming-Tsung Hsu,Fang-Yu Hsu,Yi-Ting Lin,Kai-Heng Chien,Jun-Ren Chen,Cheng-Hsiang Su,Yi-Chen Ou,Chiou-Ting Hsu,Pei-Kai Huang

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态人脸反欺骗框架MFAS-DANet，解决了域适应场景下的缺失模态、伪标签噪声和模型退化三个主要挑战。

Details

Motivation: 现有的多模态FAS方法在面对新目标域中的未见攻击时表现不佳，且缺乏对域适应场景的研究。 Method: 通过提取互补特征应对缺失模态；利用跨模态预测不确定性生成可靠伪标签以减少噪声影响；设计自适应机制调整损失权重防止模型退化。 Result: 大量实验表明，所提出的MFAS-DANet在多模态FAS的域适应场景下具有有效性和最先进的性能。 Conclusion: MFAS-DANet成功应对了多模态FAS中域适应的关键挑战，显著提升了在未知攻击和跨域情况下的鲁棒性。 Abstract: Recent multi-modal face anti-spoofing (FAS) methods have investigated the potential of leveraging multiple modalities to distinguish live and spoof faces. However, pre-adapted multi-modal FAS models often fail to detect unseen attacks from new target domains. Although a more realistic domain adaptation (DA) scenario has been proposed for single-modal FAS to learn specific spoof attacks during inference, DA remains unexplored in multi-modal FAS methods. In this paper, we propose a novel framework, MFAS-DANet, to address three major challenges in multi-modal FAS under the DA scenario: missing modalities, noisy pseudo labels, and model degradation. First, to tackle the issue of missing modalities, we propose extracting complementary features from other modalities to substitute missing modality features or enhance existing ones. Next, to reduce the impact of noisy pseudo labels during model adaptation, we propose deriving reliable pseudo labels by leveraging prediction uncertainty across different modalities. Finally, to prevent model degradation, we design an adaptive mechanism that decreases the loss weight during unstable adaptations and increasing it during stable ones. Extensive experiments demonstrate the effectiveness and state-of-the-art performance of our proposed MFAS-DANet.

[291] RestoRect: Degraded Image Restoration via Latent Rectified Flow & Feature Distillation

Shourya Verma,Mengbo Wang,Nadia Atallah Lanman,Ananth Grama

Main category: cs.CV

TL;DR: 提出了一种名为'RestoRect'的新型潜在修正流特征蒸馏方法，用于图像恢复，通过生成式学习在潜空间中合成教师质量的特征，结合Retinex理论、可学习各向异性扩散约束和三角颜色空间极化，在多个数据集和任务上实现了优于现有方法的性能。

Details

Motivation: 现有的知识蒸馏方法在现代Transformer架构中无法有效捕捉动态特征生成过程，且高性能图像恢复模型通常速度慢，快速模型则效果差，因此需要一种兼顾速度与质量的新方法。 Method: 提出RestoRect，将修正流应用于特征蒸馏，构建生成式潜空间学习轨迹；结合Retinex理论进行物理分解，引入可学习各向异性扩散和三角颜色空间极化，并设计特征层提取损失，实现跨架构的鲁棒知识迁移。 Result: 在15个图像恢复数据集、4项任务和8个指标上均取得更优结果，同时提升训练稳定性、加快收敛与推理速度，保持高质量恢复效果。 Conclusion: RestoRect通过生成式特征蒸馏框架有效解决了速度与性能的权衡问题，为基于Transformer的图像恢复模型的知识迁移提供了新思路。 Abstract: Current approaches for restoration of degraded images face a critical trade-off: high-performance models are too slow for practical use, while fast models produce poor results. Knowledge distillation transfers teacher knowledge to students, but existing static feature matching methods cannot capture how modern transformer architectures dynamically generate features. We propose 'RestoRect', a novel Latent Rectified Flow Feature Distillation method for restoring degraded images. We apply rectified flow to reformulate feature distillation as a generative process where students learn to synthesize teacher-quality features through learnable trajectories in latent space. Our framework combines Retinex theory for physics-based decomposition with learnable anisotropic diffusion constraints, and trigonometric color space polarization. We introduce a Feature Layer Extraction loss for robust knowledge transfer between different network architectures through cross-normalized transformer feature alignment with percentile-based outlier detection. RestoRect achieves better training stability, and faster convergence and inference while preserving restoration quality. We demonstrate superior results across 15 image restoration datasets, covering 4 tasks, on 8 metrics.

[292] Orientation-anchored Hyper-Gaussian for 4D Reconstruction from Casual Videos

Junyi Wu,Jiachen Tao,Haoxuan Wang,Gaowen Liu,Ramana Rao Kompella,Yan Yan

Main category: cs.CV

TL;DR: 提出了一种基于方向锚定的高斯点阵（OriGS）方法，用于从随意拍摄的单目视频中进行高质量4D重建，通过引入场景方向的超维表示，显著提升了对复杂动态场景的建模能力。

Details

Motivation: 现有基于3D高斯点阵的动态场景重建方法依赖低秩假设，难以有效建模无约束动态中的复杂、区域特异性形变。 Method: 引入全局方向场传播空间和时间上的主前进方向，并提出方向感知的超高斯表示，将时间、空间、几何和方向信息统一到一个概率框架中，通过条件切片推断区域特异性形变。 Result: 实验表明，OriGS在真实世界复杂动态场景中的重建保真度优于主流方法。 Conclusion: OriGS通过方向锚定的超维表示，有效解决了复杂动态场景中局部形变建模的问题，实现了更高质量的4D重建。 Abstract: We present Orientation-anchored Gaussian Splatting (OriGS), a novel framework for high-quality 4D reconstruction from casually captured monocular videos. While recent advances extend 3D Gaussian Splatting to dynamic scenes via various motion anchors, such as graph nodes or spline control points, they often rely on low-rank assumptions and fall short in modeling complex, region-specific deformations inherent to unconstrained dynamics. OriGS addresses this by introducing a hyperdimensional representation grounded in scene orientation. We first estimate a Global Orientation Field that propagates principal forward directions across space and time, serving as stable structural guidance for dynamic modeling. Built upon this, we propose Orientation-aware Hyper-Gaussian, a unified formulation that embeds time, space, geometry, and orientation into a coherent probabilistic state. This enables inferring region-specific deformation through principled conditioned slicing, adaptively capturing diverse local dynamics in alignment with global motion intent. Experiments demonstrate the superior reconstruction fidelity of OriGS over mainstream methods in challenging real-world dynamic scenes.

Divyam Madaan,Varshan Muhunthan,Kyunghyun Cho,Sumit Chopra

Main category: cs.CV

TL;DR: 本文通过大规模实证研究，利用多模态大语言模型量化了23个视觉问答基准中的模态内与模态间依赖关系，揭示了许多旨在减少文本偏差的基准反而增强了图像依赖，且大模型常依赖单模态线索掩盖了多模态推理的不足。

Details

Motivation: 当前多模态学习中对模态内和模态间依赖关系的理解不足，尤其在基准评测中缺乏系统性刻画，导致模型可能依赖单一模态而非真正进行多模态推理。 Method: 在23个视觉问答基准上使用多模态大语言模型进行大规模实证分析，量化视觉、文本及其交互的依赖程度，并跨模型规模评估这些依赖的变化。 Result: 发现不同基准之间和内部对视觉、文本及二者交互的依赖差异显著；许多试图减轻文本偏差的基准反而增强了图像依赖；大模型常利用单模态依赖实现高性能，掩盖了多模态推理能力的不足。 Conclusion: 提供了多模态数据集的定量表征方法，为多模态基准的设计与评估提供了原则性指导，强调需更精细地平衡模态依赖以促进真正的多模态推理。 Abstract: Understanding the interplay between intra-modality dependencies (the contribution of an individual modality to a target task) and inter-modality dependencies (the relationships between modalities and the target task) is fundamental to advancing multi-modal learning. However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized. In this work, we present a large-scale empirical study to quantify these dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs) covering domains such as general and expert knowledge reasoning, optical character recognition, and document understanding. Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks. We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies. This characterization persists across model sizes, as larger models often use these intra-modality dependencies to achieve high performance that mask an underlying lack of multi-modal reasoning. We provide a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation.

[294] Enhancing Polyp Segmentation via Encoder Attention and Dynamic Kernel Update

Fatemeh Salahi Chashmi,Roya Sotoudeh

Main category: cs.CV

TL;DR: 提出一种结合动态核机制和全局编码器注意力模块的新型框架，用于提高结肠息肉分割的准确性和效率，在多个基准数据集上优于现有方法。

Details

Motivation: 由于息肉在医学图像中形状、大小多样且边界对比度低，结肠息肉分割仍具挑战性。 Method: 引入动态核机制（DK）与全局编码器注意力（EA）模块结合，通过EA模块提供的全局上下文向量初始化DK，并在解码阶段迭代优化分割预测；采用统一通道适配（UCA）标准化特征维度，实现高效信息融合。 Result: 在Kvasir-SEG和CVC-ClinicDB数据集上实验表明，该方法在Dice和IoU指标上优于多种先进分割方法，同时UCA降低了计算成本。 Conclusion: 所提方法为结肠息肉分割提供了鲁棒且可适应的解决方案，具有在临床和自动化诊断系统中的应用潜力。 Abstract: Polyp segmentation is a critical step in colorectal cancer detection, yet it remains challenging due to the diverse shapes, sizes, and low contrast boundaries of polyps in medical imaging. In this work, we propose a novel framework that improves segmentation accuracy and efficiency by integrating a Dynamic Kernel (DK) mechanism with a global Encoder Attention module. The DK mechanism, initialized by a global context vector from the EA module, iteratively refines segmentation predictions across decoding stages, enabling the model to focus on and accurately delineate complex polyp boundaries. The EA module enhances the network's ability to capture critical lesion features by aggregating multi scale information from all encoder layers. In addition, we employ Unified Channel Adaptation (UCA) in the decoder to standardize feature dimensions across stages, ensuring consistent and computationally efficient information fusion. Our approach extends the lesion-aware kernel framework by introducing a more flexible, attention driven kernel initialization and a unified decoder design. Extensive experiments on the KvasirSEG and CVC ClinicDB benchmark datasets demonstrate that our model outperforms several state of the art segmentation methods, achieving superior Dice and Intersection over Union scores. Moreover, UCA simplifies the decoder structure, reducing computational cost without compromising accuracy. Overall, the proposed method provides a robust and adaptable solution for polyp segmentation, with promising applications in clinical and automated diagnostic systems.

[295] Evaluating point-light biological motion in multimodal large language models

Akila Kadambi,Marco Iacoboni,Lisa Aziz-Zadeh,Srini Narayanan

Main category: cs.CV

TL;DR: 本文介绍了ActPLD，首个用于评估多模态大语言模型（MLLMs）从人类点光源显示（PLDs）中处理动作能力的基准。实验结果表明，现有模型在动作识别和时空理解方面表现普遍较差，暴露出根本性缺陷。

Details

Motivation: 由于人类能从极简视觉线索（如点光源显示）中提取丰富语义信息，而这种能力源于具身经验，因此研究模型对PLDs的理解有助于揭示当前多模态模型在动作认知上的局限性。 Method: 构建了名为ActPLD的基准，包含单人和社交互动场景下的点光源显示数据，测试了当前最先进的专有和开源多模态大语言模型的表现。 Result: 所有测试模型在ActPLD基准上的表现均较差，显示出在动作识别和时空动态理解方面的显著不足。 Conclusion: 当前多模态大语言模型在仅依赖身体运动线索理解人类动作方面存在根本性缺陷，说明其缺乏类似人类的基于具身经验的动作理解能力。 Abstract: Humans can extract rich semantic information from minimal visual cues, as demonstrated by point-light displays (PLDs), which consist of sparse sets of dots localized to key joints of the human body. This ability emerges early in development and is largely attributed to human embodied experience. Since PLDs isolate body motion as the sole source of meaning, they represent key stimuli for testing the constraints of action understanding in these systems. Here we introduce ActPLD, the first benchmark to evaluate action processing in MLLMs from human PLDs. Tested models include state-of-the-art proprietary and open-source systems on single-actor and socially interacting PLDs. Our results reveal consistently low performance across models, introducing fundamental gaps in action and spatiotemporal understanding.

[296] Imaging-Based Mortality Prediction in Patients with Systemic Sclerosis

Alec K. Peltekian,Karolina Senkow,Gorkem Durak,Kevin M. Grudzinski,Bradford C. Bemiss,Jane E. Dematte,Carrie Richardson,Nikolay S. Markov,Mary Carns,Kathleen Aren,Alexandra Soriano,Matthew Dapas,Harris Perlman,Aaron Gundersheimer,Kavitha C. Selvan,John Varga,Monique Hinchcliff,Krishnan Warrior,Catherine A. Gao,Richard G. Wunderink,GR Scott Budinger,Alok N. Choudhary,Anthony J. Esposito,Alexander V. Misharin,Ankit Agrawal,Ulas Bagci

Main category: cs.CV

TL;DR: 本研究提出了一种基于放射组学和深度学习的大规模纵向胸部CT分析框架，用于预测系统性硬化症（SSc）相关肺部并发症的死亡率。

Details

Motivation: 明确胸部CT在系统性硬化症相关间质性肺病进展和死亡率预测中的作用尚不充分，亟需更有效的预测工具。 Method: 收集并分析了来自西北硬皮病登记库的2,125例SSc患者的CT扫描，采用ResNet-18、DenseNet-121和Swin Transformer等预训练深度学习模型，并进行微调以预测1年、3年和5年内的死亡率。 Result: 模型在预测1年、3年和5年死亡率时分别达到0.769、0.801和0.709的AUC值，且死亡病例分别为181、326和428例。 Conclusion: 放射组学与深度学习方法结合可有效提升SSc相关间质性肺病的早期检测与风险评估能力，是对现有文献的重要推进。 Abstract: Interstitial lung disease (ILD) is a leading cause of morbidity and mortality in systemic sclerosis (SSc). Chest computed tomography (CT) is the primary imaging modality for diagnosing and monitoring lung complications in SSc patients. However, its role in disease progression and mortality prediction has not yet been fully clarified. This study introduces a novel, large-scale longitudinal chest CT analysis framework that utilizes radiomics and deep learning to predict mortality associated with lung complications of SSc. We collected and analyzed 2,125 CT scans from SSc patients enrolled in the Northwestern Scleroderma Registry, conducting mortality analyses at one, three, and five years using advanced imaging analysis techniques. Death labels were assigned based on recorded deaths over the one-, three-, and five-year intervals, confirmed by expert physicians. In our dataset, 181, 326, and 428 of the 2,125 CT scans were from patients who died within one, three, and five years, respectively. Using ResNet-18, DenseNet-121, and Swin Transformer we use pre-trained models, and fine-tuned on 2,125 images of SSc patients. Models achieved an AUC of 0.769, 0.801, 0.709 for predicting mortality within one-, three-, and five-years, respectively. Our findings highlight the potential of both radiomics and deep learning computational methods to improve early detection and risk assessment of SSc-related interstitial lung disease, marking a significant advancement in the literature.

[297] Calibrated and Resource-Aware Super-Resolution for Reliable Driver Behavior Analysis

Ibne Farabi Shihab,Weiheng Chai,Jiyang Wang,Sanjeda Akter,Senem Velipasalar Gursoy,Anuj Sharma

Main category: cs.CV

TL;DR: 提出一种资源感知的自适应超分辨率框架，优化模型校准和关键事件的精确率-召回率，显著提升驾驶员监控系统在安全关键场景下的可靠性。

Details

Motivation: 低分辨率训练虽然精度高，但置信度校准差，存在安全隐患，需提升安全关键场景下的预测可靠性。 Method: 采用自适应超分辨率框架，结合轻量级伪影检测器过滤超分导致的幻觉，优化模型校准和关键事件的精准检测。 Result: 在安全性指标上达到最优：校准误差ECE为5.8%，高于基线；困倦检测AUPR达0.78，手机使用检测精确率-召回率达0.74。 Conclusion: 该自适应框架在安全关键应用中优于低分辨率训练模型，是当前最先进的可靠解决方案。 Abstract: Driver monitoring systems require not just high accuracy but reliable, well-calibrated confidence scores for safety-critical deployment. While direct low-resolution training yields high overall accuracy, it produces poorly calibrated predictions that can be dangerous in safety-critical scenarios. We propose a resource-aware adaptive super-resolution framework that optimizes for model calibration and high precision-recall on critical events. Our approach achieves state-of-the-art performance on safety-centric metrics: best calibration (ECE of 5.8\% vs 6.2\% for LR-trained baselines), highest AUPR for drowsiness detection (0.78 vs 0.74), and superior precision-recall for phone use detection (0.74 vs 0.71). A lightweight artifact detector (0.3M parameters, 5.2ms overhead) provides additional safety by filtering SR-induced hallucinations. While LR-trained video models serve as strong general-purpose baselines, our adaptive framework represents the state-of-the-art solution for safety-critical applications where reliability is paramount.

[298] OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction

Hongyang Li,Jinyuan Qu,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为OVSeg3R的训练方案，通过利用2D感知模型和3D重建，实现从2D到3D的开放词汇3D实例分割。该方法利用视图级实例划分算法和2D实例边界感知超点，有效提升了分割性能，尤其在新类别上表现突出。

Details

Motivation: 现有的3D实例分割方法多依赖于封闭词汇和手动标注，难以适应真实应用场景。同时，3D重建常导致几何细节过平滑，影响小物体或非显著物体的分割效果。因此，需要一种能利用现有2D模型知识、无需人工标注且能保持实例边界的开放词汇3D分割方法。 Method: OVSeg3R利用2D视频重建的3D场景作为输入，通过2D-3D对应关系将2D开放词汇模型的实例掩码投影到3D空间以生成标注。提出视图级实例划分算法，按视图分配监督信号以减少误检；引入2D实例边界感知超点，利用2D掩码指导超点聚类，避免跨实例边界分割。 Result: 在ScanNet200基准上，OVSeg3R比现有闭集方法提升+2.3 mAP，并在开放词汇设定下对新类别超越先前方法约+7.1 mAP，显著缩小了头部与尾部类别的性能差距。 Conclusion: OVSeg3R成功地将开放词汇2D感知能力迁移到3D实例分割，无需人工标注，且通过2D-3D联合优化策略有效提升了分割精度与泛化能力，尤其增强了对罕见类别的识别性能。 Abstract: In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view's 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view's corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.

[299] From Fields to Splats: A Cross-Domain Survey of Real-Time Neural Scene Representations

Javed Ahmad,Penggang Gao,Donatien Delehelle,Mennuti Canio,Nikhil Deshpande,Jesús Ortiz,Darwin G. Caldwell,Yonas Teodros Tefera

Main category: cs.CV

TL;DR: 本文综述了3D高斯点阵（3DGS）在多个领域中的应用，探讨其相较于NeRF的技术优势、适应性和现存局限性。

Details

Motivation: 探讨3DGS为何逐渐取代NeRF成为主流神经场景表示方法。 Method: 通过系统比较不同领域的应用流程，围绕统一的研究问题进行分析。 Result: 3DGS在光真实感、几何保真度和计算效率之间实现了良好平衡，适用于SLAM、远程呈现、机器人操作和3D内容生成等领域。 Conclusion: 3DGS因其高效性和灵活性正被广泛采用，未来可推动神经渲染在感知、交互和内容创作中的进一步发展。 Abstract: Neural scene representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have transformed how 3D environments are modeled, rendered, and interpreted. NeRF introduced view-consistent photorealism via volumetric rendering; 3DGS has rapidly emerged as an explicit, efficient alternative that supports high-quality rendering, faster optimization, and integration into hybrid pipelines for enhanced photorealism and task-driven scene understanding. This survey examines how 3DGS is being adopted across SLAM, telepresence and teleoperation, robotic manipulation, and 3D content generation. Despite their differences, these domains share common goals: photorealistic rendering, meaningful 3D structure, and accurate downstream tasks. We organize the review around unified research questions that explain why 3DGS is increasingly displacing NeRF-based approaches: What technical advantages drive its adoption? How does it adapt to different input modalities and domain-specific constraints? What limitations remain? By systematically comparing domain-specific pipelines, we show that 3DGS balances photorealism, geometric fidelity, and computational efficiency. The survey offers a roadmap for leveraging neural rendering not only for image synthesis but also for perception, interaction, and content creation across real and virtual environments.

[300] Pancreas Part Segmentation under Federated Learning Paradigm

Ziliang Hong,Halil Ertugrul Aktas,Andrea Mia Bejar,Katherine Wu,Hongyi Pan,Gorkem Durak,Zheyuan Zhang,Sait Kayali,Temel Tirkes,Federica Proietto Salanitri,Concetto Spampinato,Michael Goggins,Tamas Gonda,Candice Bolan,Raj Keswani,Frank Miller,Michael Wallace,Ulas Bagci

Main category: cs.CV

TL;DR: 本文提出了一种用于MRI胰腺头部、体部和尾部分割的首个联邦学习（FL）方法，解决了胰腺疾病区域异质性带来的临床挑战。

Details

Motivation: 胰腺疾病的区域异质性（如癌症多发于头部，慢性胰腺炎导致尾部组织损失）使得精确分割胰腺各部分对诊断和治疗至关重要，但MRI中因形态变化大、软组织对比度低和患者间解剖差异，分割极具挑战；此外，数据稀缺问题也限制了现有方法的发展。 Method: 提出一种隐私保护的联邦学习框架，联合七个医疗机构在不共享数据的情况下进行模型训练，使用711例T1加权和726例T2加权MRI扫描数据；系统评估三种先进分割架构（U-Net、Attention U-Net、Swin UNETR）与两种FL算法（FedAvg、FedProx）的组合，并设计一种基于解剖学信息的损失函数，强调MRI中区域特异性纹理对比。 Result: 实验表明，在分布式异构数据集上，该方法达到具有临床可行性的性能，其中Attention U-Net结合FedAvg表现最优，且所提解剖感知损失函数有效提升了分割精度。 Conclusion: 该研究首次实现了跨机构协作的胰腺分区分割联邦学习框架，有效应对数据隐私与数据稀缺问题，为胰腺疾病的精准诊疗提供了可推广的技术方案。 Abstract: We present the first federated learning (FL) approach for pancreas part(head, body and tail) segmentation in MRI, addressing a critical clinical challenge as a significant innovation. Pancreatic diseases exhibit marked regional heterogeneity cancers predominantly occur in the head region while chronic pancreatitis causes tissue loss in the tail, making accurate segmentation of the organ into head, body, and tail regions essential for precise diagnosis and treatment planning. This segmentation task remains exceptionally challenging in MRI due to variable morphology, poor soft-tissue contrast, and anatomical variations across patients. Our novel contribution tackles two fundamental challenges: first, the technical complexity of pancreas part delineation in MRI, and second the data scarcity problem that has hindered prior approaches. We introduce a privacy-preserving FL framework that enables collaborative model training across seven medical institutions without direct data sharing, leveraging a diverse dataset of 711 T1W and 726 T2W MRI scans. Our key innovations include: (1) a systematic evaluation of three state-of-the-art segmentation architectures (U-Net, Attention U-Net,Swin UNETR) paired with two FL algorithms (FedAvg, FedProx), revealing Attention U-Net with FedAvg as optimal for pancreatic heterogeneity, which was never been done before; (2) a novel anatomically-informed loss function prioritizing region-specific texture contrasts in MRI. Comprehensive evaluation demonstrates that our approach achieves clinically viable performance despite training on distributed, heterogeneous datasets.

[301] Towards Interpretable Visual Decoding with Attention to Brain Representations

Pinyuan Feng,Hossein Adeli,Wenxuan Guo,Fan Cheng,Ethan Hwang,Nikolaus Kriegeskorte

Main category: cs.CV

TL;DR: 提出NeuroAdapter，一种直接基于脑信号条件化扩散模型的视觉解码框架，无需中间特征空间，提升了重建质量与透明度。

Details

Motivation: 现有方法通过将脑信号映射到图像或文本特征空间进行视觉解码，掩盖了不同脑区对生成结果的贡献，缺乏透明性。 Method: 提出NeuroAdapter，直接在潜扩散模型中以脑表征为条件；并构建IBBI框架，通过分析扩散去噪过程中的交叉注意力机制，揭示皮层区域如何影响生成过程。 Result: 在公开fMRI数据集上实现了与先前方法相当甚至更优的视觉重建质量，并通过IBBI提供了脑区贡献的可解释性。 Conclusion: 端到端的脑到图像解码具有潜力，结合视觉神经科学可促进对扩散模型的理解与解释。 Abstract: Recent work has demonstrated that complex visual stimuli can be decoded from human brain activity using deep generative models, helping brain science researchers interpret how the brain represents real-world scenes. However, most current approaches leverage mapping brain signals into intermediate image or text feature spaces before guiding the generative process, masking the effect of contributions from different brain areas on the final reconstruction output. In this work, we propose NeuroAdapter, a visual decoding framework that directly conditions a latent diffusion model on brain representations, bypassing the need for intermediate feature spaces. Our method demonstrates competitive visual reconstruction quality on public fMRI datasets compared to prior work, while providing greater transparency into how brain signals shape the generation process. To this end, we contribute an Image-Brain BI-directional interpretability framework (IBBI) which investigates cross-attention mechanisms across diffusion denoising steps to reveal how different cortical areas influence the unfolding generative trajectory. Our results highlight the potential of end-to-end brain-to-image decoding and establish a path toward interpreting diffusion models through the lens of visual neuroscience.

[302] RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization

Kaicheng Yang,Xun Zhang,Haotong Qin,Yucheng Lin,Kaisen Yang,Xianglong Yan,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出了一种针对Diffusion Transformers（DiTs）的系统性量化感知训练框架RobuQ，解决了低比特激活量化中的关键瓶颈，实现了在低于4比特配置下的最先进性能，并首次在ImageNet-1K等大数据集上以平均2比特激活量化实现稳定且具竞争力的图像生成。

Details

Motivation: DiTs在图像生成中表现出色，但其高计算和内存成本限制了实际部署。现有量化方法在应用于DiTs时面临激活敏感性和分布复杂性的挑战，尤其是低比特激活量化成为主要瓶颈。 Method: 提出RobuQ框架，包括三元权重（W1.58A4）基线、RobustQuantizer（利用Hadamard变换使每token激活分布趋于正态以提升量化鲁棒性），以及首个仅对激活进行混合精度量化的AMPN流程，为不同层分配不同激活精度以消除信息瓶颈。 Result: 在无条件和条件图像生成任务中，RobuQ在低于4比特的量化配置下达到SOTA性能；平均2比特激活量化下首次实现在ImageNet-1K上的稳定且高性能图像生成。 Conclusion: RobuQ有效解决了DiTs在极低比特量化中的激活量化难题，显著降低部署开销，推动了Diffusion Transformers在资源受限场景下的应用前景。 Abstract: Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by substantial computational and memory costs. While Quantization-Aware Training (QAT) has shown promise for U-Nets, its application to DiTs faces unique challenges, primarily due to the sensitivity and distributional complexity of activations. In this work, we identify activation quantization as the primary bottleneck for pushing DiTs to extremely low-bit settings. To address this, we propose a systematic QAT framework for DiTs, named RobuQ. We start by establishing a strong ternary weight (W1.58A4) DiT baseline. Building upon this, we propose RobustQuantizer to achieve robust activation quantization. Our theoretical analyses show that the Hadamard transform can convert unknown per-token distributions into per-token normal distributions, providing a strong foundation for this method. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs. This method applies ternary weights across the entire network while allocating different activation precisions to each layer to eliminate information bottlenecks. Through extensive experiments on unconditional and conditional image generation, our RobuQ framework achieves state-of-the-art performance for DiT quantization in sub-4-bit quantization configuration. To the best of our knowledge, RobuQ is the first achieving stable and competitive image generation on large datasets like ImageNet-1K with activations quantized to average 2 bits. The code and models will be available at https://github.com/racoonykc/RobuQ .

[303] VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement

Shulian Zhang,Yong Guo,Long Peng,Ziyang Wang,Ye Chen,Wenbo Li,Xiao Zhang,Yulun Zhang,Jian Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为VividFace的高效单步扩散框架，用于视频人脸增强，通过预训练模型和时空先验实现从低质量输入到高质量输出的直接映射，并结合联合潜在-像素人脸聚焦训练策略与MLLM驱动的数据筛选流程，在感知质量、身份保持和时间稳定性方面达到SOTA水平。

Details

Motivation: 现有视频人脸增强方法在面部纹理建模、模型泛化能力和推理效率方面存在挑战，且缺乏高质量训练数据，限制了性能提升。 Method: 基于预训练的WANX视频生成模型，采用单步流匹配范式进行直接映射；提出联合潜在-像素人脸聚焦训练策略，结合面部区域优化与全局重建；引入MLLM驱动的数据整理流程以获取高质量人脸视频数据。 Result: 实验表明，VividFace在感知质量、身份保留和时间稳定性方面优于现有方法，同时显著减少推理时间，具备高效性和实用性。 Conclusion: VividFace通过单步扩散和精细化训练策略有效解决了视频人脸增强中的关键挑战，推动了该领域的高效、高质量发展。 Abstract: Video Face Enhancement (VFE) seeks to reconstruct high-quality facial regions from degraded video sequences, a capability that underpins numerous applications including video conferencing, film restoration, and surveillance. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) faithfully modeling intricate facial textures while preserving temporal consistency; (2) restricted model generalization due to the lack of high-quality face video training data; and (3) low efficiency caused by repeated denoising steps during inference. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for video face enhancement. Built upon the pretrained WANX video generation model, our method leverages powerful spatiotemporal priors through a single-step flow matching paradigm, enabling direct mapping from degraded inputs to high-quality outputs with significantly reduced inference time. To further boost efficiency, we propose a Joint Latent-Pixel Face-Focused Training strategy that employs stochastic switching between facial region optimization and global reconstruction, providing explicit supervision in both latent and pixel spaces through a progressive two-stage training process. Additionally, we introduce an MLLM-driven data curation pipeline for automated selection of high-quality video face datasets, enhancing model generalization. Extensive experiments demonstrate that VividFace achieves state-of-the-art results in perceptual quality, identity preservation, and temporal stability, while offering practical resources for the research community.

[304] Multi-Level Heterogeneous Knowledge Transfer Network on Forward Scattering Center Model for Limited Samples SAR ATR

Chenxi Zhao,Daochang Wang,Siqian Zhang,Gangyao Kuang

Main category: cs.CV

TL;DR: 提出一种基于前向散射中心模型（FSCM）的多级异构知识迁移网络（MHKT），用于提升SAR目标识别中模拟数据到实测数据的知识迁移质量，有效去除背景噪声等无关信息，通过特征、分布和类别三级迁移实现更纯净、关键的目标知识迁移。

Details

Motivation: 现有基于模拟图像的SAR目标识别方法包含大量背景、噪声等无关信息，影响知识迁移质量，且样本有限，亟需一种能提取并迁移更具物理意义和可解释性的关键目标知识的方法。 Method: 提出多级异构知识迁移网络（MHKT），在特征级引入任务关联信息选择器（TAIS）分离非信息性知识，在分布级设计最大判别散度（MDD）度量函数保留类间判别结构，在类别级利用类别关系一致性约束缓解模拟与实测数据不平衡导致的优化偏差。 Result: 在两个由FSCM数据和实测SAR图像构建的新数据集上进行了广泛实验，结果表明所提方法在SAR目标识别性能上显著优于现有方法。 Conclusion: 通过迁移具有强物理意义的FSCM模型知识，并结合多级异构知识迁移策略，能够有效提升模拟数据辅助下的SAR目标识别性能，确保迁移知识的纯净性与完整性。 Abstract: Simulated data-assisted SAR target recognition methods are the research hotspot currently, devoted to solving the problem of limited samples. Existing works revolve around simulated images, but the large amount of irrelevant information embedded in the images, such as background, noise, etc., seriously affects the quality of the migrated information. Our work explores a new simulated data to migrate purer and key target knowledge, i.e., forward scattering center model (FSCM) which models the actual local structure of the target with strong physical meaning and interpretability. To achieve this purpose, multi-level heterogeneous knowledge transfer (MHKT) network is proposed, which fully migrates FSCM knowledge from the feature, distribution and category levels, respectively. Specifically, we permit the more suitable feature representations for the heterogeneous data and separate non-informative knowledge by task-associated information selector (TAIS), to complete purer target feature migration. In the distribution alignment, the new metric function maximum discrimination divergence (MDD) in target generic knowledge transfer (TGKT) module perceives transferable knowledge efficiently while preserving discriminative structure about classes. Moreover, category relation knowledge transfer (CRKT) module leverages the category relation consistency constraint to break the dilemma of optimization bias towards simulation data due to imbalance between simulated and measured data. Such stepwise knowledge selection and migration will ensure the integrity of the migrated FSCM knowledge. Notably, extensive experiments on two new datasets formed by FSCM data and measured SAR images demonstrate the superior performance of our method.

[305] VAMamba: An Efficient Visual Adaptive Mamba for Image Restoration

Han Hu,Zhuoran Zheng,Liang Li,Chen Lyu

Main category: cs.CV

TL;DR: 本文提出VAMamba，一种视觉自适应Mamba框架，通过QCLAM和GPS-SS2D两个创新模块解决现有Mamba方法在图像恢复中固定扫描模式和特征利用效率低的问题。

Details

Motivation: 现有基于Mamba的图像恢复方法受限于固定的扫描模式和低效的特征利用，难以应对多样化的退化情况，影响恢复性能和计算效率。 Method: 提出VAMamba框架：1）QCLAM模块利用基于队列的缓存和LoRA适配特征的相似性进行动态特征融合；2）GPS-SS2D模块通过Vision Transformer生成像素重要性评分图，并采用贪心策略学习最优扫描路径，实现自适应特征提取。 Result: 在多种图像恢复任务上实验表明，VAMamba在恢复质量和计算效率方面均优于现有方法，实现了新的性能基准。 Conclusion: VAMamba通过自适应扫描和高效特征重用机制，显著提升了图像恢复的性能与效率，为基于Mamba的模型提供了更灵活、高效的解决方案。 Abstract: Recent Mamba-based image restoration methods have achieved promising results but remain limited by fixed scanning patterns and inefficient feature utilization. Conventional Mamba architectures rely on predetermined paths that cannot adapt to diverse degradations, constraining both restoration performance and computational efficiency. To overcome these limitations, we propose VAMamba, a Visual Adaptive Mamba framework with two key innovations. First, QCLAM(Queue-basedCacheLow-rankAdaptiveMemory)enhancesfeaturelearningthrougha FIFO cache that stores historical representations. Similarity between current LoRA-adapted and cached features guides intelligent fusion, enabling dynamic reuse while effectively controlling memorygrowth.Second, GPS-SS2D(GreedyPathScanSS2D)introducesadaptive scanning. A Vision Transformer generates score maps to estimate pixel importance, and a greedy strategy de termines optimal forward and backward scanning paths. These learned trajectories replace rigid patterns, enabling SS2D to perform targeted feature extraction. The integration of QCLAM and GPS-SS2D allows VAMamba to adaptively focus on degraded regions while maintaining high computational efficiency. Extensive experiments across diverse restoration tasks demonstrate that VAMamba consistently outperforms existing approaches in both restoration quality and efficiency, establishing new benchmarks for adaptive image restoration. Our code is available at https://github.com/WaterHQH/VAMamba.

[306] Deep Taxonomic Networks for Unsupervised Hierarchical Prototype Discovery

Zekun Wang,Ethan Haarer,Zhiyi Dai,Tianyi Zhu,Christopher J. MacLellan

Main category: cs.CV

TL;DR: 提出深度分类网络（deep taxonomic networks），一种基于变分推断框架下的深度潜在变量模型，通过构建二叉树结构的高斯混合先验来自动生成层次化聚类结构，无需预设类别数量，有效利用中间层原型信息，在图像分类数据集上表现出优越的层次聚类性能。

Details

Motivation: 现有深度层次聚类方法通常将结构绑定于类别数量，且未能充分利用中间层次的原型信息，缺乏灵活性和表达能力。受人类构建知识体系的启发，需设计能自动发现层次结构并有效组织原型的方法。 Method: 提出深度分类网络，采用完全二叉树结构的混合高斯先验作为潜在变量模型，在变分推断框架下优化ELBO；通过最大化证据下界鼓励原型间层次关系的形成，从而从无标签数据中自动学习层次化分类结构和原型簇。 Result: 在多个图像分类数据集上，该方法在新设计的、利用所有层次原型簇的评估机制下优于基线方法；定性结果显示所学层次结构具有丰富且可解释的语义，能捕捉粗粒度类别与细粒度视觉差异。 Conclusion: 深度分类网络能有效克服现有方法对类别数目的依赖和原型信息利用不足的问题，实现无需预设类别数的自动层次聚类，所学模型兼具高性能与可解释性。 Abstract: Inspired by the human ability to learn and organize knowledge into hierarchical taxonomies with prototypes, this paper addresses key limitations in current deep hierarchical clustering methods. Existing methods often tie the structure to the number of classes and underutilize the rich prototype information available at intermediate hierarchical levels. We introduce deep taxonomic networks, a novel deep latent variable approach designed to bridge these gaps. Our method optimizes a large latent taxonomic hierarchy, specifically a complete binary tree structured mixture-of-Gaussian prior within a variational inference framework, to automatically discover taxonomic structures and associated prototype clusters directly from unlabeled data without assuming true label sizes. We analytically show that optimizing the ELBO of our method encourages the discovery of hierarchical relationships among prototypes. Empirically, our learned models demonstrate strong hierarchical clustering performance, outperforming baselines across diverse image classification datasets using our novel evaluation mechanism that leverages prototype clusters discovered at all hierarchical levels. Qualitative results further reveal that deep taxonomic networks discover rich and interpretable hierarchical taxonomies, capturing both coarse-grained semantic categories and fine-grained visual distinctions.

[307] MAN: Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising

Tangtangfang Fang,Jingxi Hu,Xiangjian He,Jiaqi Yang

Main category: cs.CV

TL;DR: 提出了一种基于潜在扩散增强的多阶段去噪网络MAN，用于高效高质量的低剂量CT图像去噪，在保持高保真度的同时显著提升推理速度。

Details

Motivation: 扩散模型在低剂量CT去噪中表现出高质量，但因计算成本过高、推理时间过长而限制了临床应用。 Method: 通过感知优化的自编码器在压缩潜在空间中操作，采用基于注意力机制的条件U-Net执行快速确定性的条件去噪扩散过程。 Result: 在LDCT和投影数据集上，模型在感知质量上优于CNN/GAN方法，与DDPM等扩散模型相当，推理速度提升60倍以上，PSNR/SSIM指标仍具竞争力。 Conclusion: 该方法在高保真重建与临床实用性之间取得了平衡，为生成模型在医学成像中的实际应用提供了可行路径。 Abstract: While diffusion models have set a new benchmark for quality in Low-Dose Computed Tomography (LDCT) denoising, their clinical adoption is critically hindered by extreme computational costs, with inference times often exceeding thousands of seconds per scan. To overcome this barrier, we introduce MAN, a Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising task. Our method operates in a compressed latent space via a perceptually-optimized autoencoder, enabling an attention-based conditional U-Net to perform the fast, deterministic conditional denoising diffusion process with drastically reduced overhead. On the LDCT and Projection dataset, our model achieves superior perceptual quality, surpassing CNN/GAN-based methods while rivaling the reconstruction fidelity of computationally heavy diffusion models like DDPM and Dn-Dp. Most critically, in the inference stage, our model is over 60x faster than representative pixel space diffusion denoisers, while remaining competitive on PSNR/SSIM scores. By bridging the gap between high fidelity and clinical viability, our work demonstrates a practical path forward for advanced generative models in medical imaging.

[308] VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis

Zeren Xiong,Yue Yu,Zedong Zhang,Shuo Chen,Jian Yang,Jun Li

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的视觉融合框架VMDiff，通过噪声和潜在空间的混合采样与自适应参数调整，有效解决了多源图像生成中的共存与语义偏差问题。

Details

Motivation: 解决现有图像到图像生成方法在融合多源视觉信息时存在的共存生成（多个物体简单并置）和偏置生成（语义不平衡导致某一物体主导）问题。 Method: 提出Visual Mixing Diffusion (VMDiff)，结合引导去噪、反转和球面插值的混合采样过程，并设计基于相似性评分的自适应调整模块，在噪声和潜在层面融合双输入图像。 Result: 在包含780个概念对的基准上实验表明，该方法在视觉质量、语义一致性和人类评分的创造力方面优于强基线方法。 Conclusion: VMDiff能有效实现多源视觉信息的结构感知融合，生成语义均衡且具创造性的单一连贯图像。 Abstract: Creating novel images by fusing visual cues from multiple sources is a fundamental yet underexplored problem in image-to-image generation, with broad applications in artistic creation, virtual reality and visual media. Existing methods often face two key challenges: coexistent generation, where multiple objects are simply juxtaposed without true integration, and bias generation, where one object dominates the output due to semantic imbalance. To address these issues, we propose Visual Mixing Diffusion (VMDiff), a simple yet effective diffusion-based framework that synthesizes a single, coherent object by integrating two input images at both noise and latent levels. Our approach comprises: (1) a hybrid sampling process that combines guided denoising, inversion, and spherical interpolation with adjustable parameters to achieve structure-aware fusion, mitigating coexistent generation; and (2) an efficient adaptive adjustment module, which introduces a novel similarity-based score to automatically and adaptively search for optimal parameters, countering semantic bias. Experiments on a curated benchmark of 780 concept pairs demonstrate that our method outperforms strong baselines in visual quality, semantic consistency, and human-rated creativity.

[309] FlowLUT: Efficient Image Enhancement via Differentiable LUTs and Iterative Flow Matching

Liubing Hu,Chen Wu,Anrui Wang,Dianjie Lu,Guijuan Zhang,Zhuoran Zheng

Main category: cs.CV

TL;DR: 本文提出了一种名为FlowLUT的新型端到端图像增强模型，结合了3D查找表（LUT）的高效性、多先验信息以及流匹配重建图像的参数无关特性，在保持O(1)复杂度的同时实现了场景自适应的颜色校正和细节恢复。

Details

Motivation: 深度学习图像增强方法在计算效率与表示能力之间存在权衡。传统3D LUT虽高效但缺乏灵活性且依赖固定先验，因此需要一种兼具高效性与强表示能力的新方法。 Method: FlowLUT采用可微分的多3D LUT集合进行颜色空间变换，并通过轻量级内容感知网络动态预测融合权重以实现O(1)复杂度的场景自适应校正；同时设计迭代流匹配方法恢复局部结构细节；整个模型通过感知和结构保真度的复合损失函数进行联合优化。 Result: 实验结果表明，FlowLUT在三个基准数据集上均表现出优异性能，有效平衡了效率与增强质量，在颜色校正和细节恢复方面优于现有方法。 Conclusion: FlowLUT成功融合了LUT的高效性与深度学习的表达能力，通过多先验和流匹配机制提升了图像增强的质量，同时保持实时处理能力，为实际应用提供了高效可靠的解决方案。 Abstract: Deep learning-based image enhancement methods face a fundamental trade-off between computational efficiency and representational capacity. For example, although a conventional three-dimensional Look-Up Table (3D LUT) can process a degraded image in real time, it lacks representational flexibility and depends solely on a fixed prior. To address this problem, we introduce FlowLUT, a novel end-to-end model that integrates the efficiency of LUTs, multiple priors, and the parameter-independent characteristic of flow-matched reconstructed images. Specifically, firstly, the input image is transformed in color space by a collection of differentiable 3D LUTs (containing a large number of 3D LUTs with different priors). Subsequently, a lightweight content-aware dynamically predicts fusion weights, enabling scene-adaptive color correction with $\mathcal{O}(1)$ complexity. Next, a lightweight fusion prediction network runs on multiple 3D LUTs, with $\mathcal{O}(1)$ complexity for scene-adaptive color correction.Furthermore, to address the inherent representation limitations of LUTs, we design an innovative iterative flow matching method to restore local structural details and eliminate artifacts. Finally, the entire model is jointly optimized under a composite loss function enforcing perceptual and structural fidelity. Extensive experimental results demonstrate the effectiveness of our method on three benchmarks.

[310] InteractMove: Text-Controlled Human-Object Interaction Generation in 3D Scenes with Movable Objects

Xinhao Cai,Minghang Zheng,Xin Jin,Yang Liu

Main category: cs.CV

TL;DR: 提出了一种文本控制的3D场景中可移动物体人机交互生成新任务，并构建了InteractMove数据集，结合3D视觉定位与手-物联合可操作性学习，实现物理合理的交互生成。

Details

Motivation: 现有数据集交互类别不足且多局限于静态物体，收集包含可动物体的人-场景交互数据困难且成本高，因此需要构建更丰富、真实的可移动物体交互数据集并开发相应方法。 Method: 构建InteractMove数据集，提出包含3D视觉定位、手-物关节可操作性学习和局部场景建模与避障优化的三阶段方法，以准确识别目标物体、预测接触区域并生成无碰撞的合理动作。 Result: 实验表明该方法在生成符合文本指令且物理合理的交互方面优于现有方法，能有效处理不同大小和类别的物体，并避免物体与场景间的碰撞。 Conclusion: 所提出的方法和数据集为文本驱动的可移动物体人-场景交互生成提供了有效解决方案，推动了复杂动态场景下人机交互研究的发展。 Abstract: We propose a novel task of text-controlled human object interaction generation in 3D scenes with movable objects. Existing human-scene interaction datasets suffer from insufficient interaction categories and typically only consider interactions with static objects (do not change object positions), and the collection of such datasets with movable objects is difficult and costly. To address this problem, we construct the InteractMove dataset for Movable Human-Object Interaction in 3D Scenes by aligning existing human object interaction data with scene contexts, featuring three key characteristics: 1) scenes containing multiple movable objects with text-controlled interaction specifications (including same-category distractors requiring spatial and 3D scene context understanding), 2) diverse object types and sizes with varied interaction patterns (one-hand, two-hand, etc.), and 3) physically plausible object manipulation trajectories. With the introduction of various movable objects, this task becomes more challenging, as the model needs to identify objects to be interacted with accurately, learn to interact with objects of different sizes and categories, and avoid collisions between movable objects and the scene. To tackle such challenges, we propose a novel pipeline solution. We first use 3D visual grounding models to identify the interaction object. Then, we propose a hand-object joint affordance learning to predict contact regions for different hand joints and object parts, enabling accurate grasping and manipulation of diverse objects. Finally, we optimize interactions with local-scene modeling and collision avoidance constraints, ensuring physically plausible motions and avoiding collisions between objects and the scene. Comprehensive experiments demonstrate our method's superiority in generating physically plausible, text-compliant interactions compared to existing approaches.

[311] BioVessel-Net and RetinaMix: Unsupervised Retinal Vessel Segmentation from OCTA Images

Cheng Huang,Weizheng Xie,Fan Gao,Yutong Liu,Ruoling Wu,Zeyu Han,Jingxi Qiu,Xiangxiang Wang,Zhenglin Yang,Hao Wang,Yongbin Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为BioVessel-Net的无监督生成框架，用于光学相干断层扫描血管造影（OCTA）中的视网膜血管结构分割，无需标注数据即可实现高精度、可解释的血管提取，并发布了包含2D和3D OCTA图像的新基准数据集RetinaMix。

Details

Motivation: 现有的视网膜血管分割方法依赖大量人工标注，成本高且易出错，特别是在OCTA图像中难以获取标签数据，因此需要一种无需标注、高效且可靠的分割方法。 Method: 提出BioVessel-Net，结合血管生物统计特性、对抗性优化和半径引导的分割策略，直接建模血管结构而非像素级分割；同时构建新数据集RetinaMix用于训练与评估。 Result: 在RetinaMix及现有数据集上，BioVessel-Net实现了接近完美的分割精度，显著优于当前最先进的有监督和半监督方法。 Conclusion: BioVessel-Net与RetinaMix共同提供了一种无需标注、计算高效且临床可解释的视网膜血管分析方案，具有在青光眼监测、血流建模和疾病进展预测中的广泛应用前景。 Abstract: Structural changes in retinal blood vessels are critical biomarkers for the onset and progression of glaucoma and other ocular diseases. However, current vessel segmentation approaches largely rely on supervised learning and extensive manual annotations, which are costly, error-prone, and difficult to obtain in optical coherence tomography angiography. Here we present BioVessel-Net, an unsupervised generative framework that integrates vessel biostatistics with adversarial refinement and a radius-guided segmentation strategy. Unlike pixel-based methods, BioVessel-Net directly models vascular structures with biostatistical coherence, achieving accurate and explainable vessel extraction without labeled data or high-performance computing. To support training and evaluation, we introduce RetinaMix, a new benchmark dataset of 2D and 3D OCTA images with high-resolution vessel details from diverse populations. Experimental results demonstrate that BioVessel-Net achieves near-perfect segmentation accuracy across RetinaMix and existing datasets, substantially outperforming state-of-the-art supervised and semi-supervised methods. Together, BioVessel-Net and RetinaMix provide a label-free, computationally efficient, and clinically interpretable solution for retinal vessel analysis, with broad potential for glaucoma monitoring, blood flow modeling, and progression prediction. Code and dataset are available: https://github.com/VikiXie/SatMar8.

[312] DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

Wei Pan,Huiguo He,Hiuyi Cheng,Yilin Shi,Lianwen Jin

Main category: cs.CV

TL;DR: 本文提出了DiffInk，首个用于整行手写生成的潜在扩散Transformer框架，通过InkVAE和InkDiT两个模块实现内容与风格解耦，并在准确性、风格保真度和生成效率上优于现有方法。

Details

Motivation: 现有文本到手写生成方法多局限于字符或词级别，导致生成整行文本时效率低且缺乏整体结构建模。 Method: 提出DiffInk框架，包括InkVAE（带OCR损失和风格分类损失的序列变分自编码器）和InkDiT（基于潜在空间的扩散Transformer），实现内容与风格解耦并生成连贯笔迹轨迹。 Result: 实验表明，DiffInk在字形准确性和风格保真度上优于现有最先进方法，同时显著提升生成效率。 Conclusion: DiffInk为整行在线手写生成提供了高效且结构化的解决方案，推动了深度生成模型在该领域的应用。 Abstract: Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency. Code will be made publicly available.

[313] RIV: Recursive Introspection Mask Diffusion Vision Language Model

YuQian Li,Limeng Qiao,Lin Ma

Main category: cs.CV

TL;DR: 本文提出了一种具有自我纠正能力的递归内省掩码扩散视觉语言模型（RIV），通过内省训练和递归推理机制，在多模态理解任务中实现了最先进的性能。

Details

Motivation: 现有的掩码扩散视觉语言模型（MDVLMs）无法纠正生成过程中的错误，缺乏自我纠正能力，限制了其在复杂任务中的表现。 Method: 提出两种新机制：1）内省训练，引入一个内省模型来识别生成序列中的语法、拼写和逻辑错误；2）递归推理，通过反复执行解掩码-内省-重掩码的过程，逐步修正输出。 Result: 在多个基准测试上的实验结果表明，RIV优于大多数现有的MDVLMs，取得了最先进的性能。 Conclusion: RIV通过引入自我纠正机制，显著提升了掩码扩散视觉语言模型的准确性和可靠性，为多模态理解提供了新的解决方案。 Abstract: Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks. However, these models are unable to correct errors in generated tokens, meaning they lack self-correction capability. In this paper, we propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability through two novel mechanisms. The first is Introspection Training, where an Introspection Model is introduced to identify errors within generated sequences. Introspection Training enables the model to detect not only grammatical and spelling mistakes, but more importantly, logical errors. The second is Recursive Inference. Beginning with the standard unmasking step, the learned Introspection Model helps to identify errors in the output sequence and remask them. This alternating ($\text{unmask}\rightarrow\text{introspection}\rightarrow\text{remask}$) process is repeated recursively until reliable results are obtained. Experimental results on multiple benchmarks demonstrate that the proposed RIV achieves state-of-the-art performance, outperforming most existing MDVLMs.

[314] Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models

Beomseok Kang,Niluthpol Chowdhury Mithun,Mikhail Sizintsev,Han-Pang Chiu,Supun Samarasekera

Main category: cs.CV

TL;DR: 本文提出了FAMDA，一种基于视觉基础模型的多任务无监督域自适应框架，通过自训练范式生成高质量伪标签，实现了最先进的性能，并适用于资源受限的机器人应用。

Details

Motivation: 现有的多任务无监督域自适应方法主要依赖对抗学习，效果不如最新的自训练技术，且在新环境中存在域偏移问题。 Method: 将语义分割和深度估计的基础模型集成到自训练框架中，利用基础模型作为教师模型为无标签目标域生成伪标签，蒸馏其泛化能力到一个高效的学生网络。 Result: FAMDA在标准合成到真实的多任务UDA基准和新的昼夜转换任务上均达到SOTA性能，轻量级版本比基础模型小10倍以上，同时保持高精度。 Conclusion: FAMDA有效桥接了多任务UDA中对抗学习与自训练之间的差距，利用视觉基础模型实现了高效、鲁棒的域自适应，适合机器人等资源受限场景。 Abstract: Multi-task dense prediction, which aims to jointly solve tasks like semantic segmentation and depth estimation, is crucial for robotics applications but suffers from domain shift when deploying models in new environments. While unsupervised domain adaptation (UDA) addresses this challenge for single tasks, existing multi-task UDA methods primarily rely on adversarial learning approaches that are less effective than recent self-training techniques. In this paper, we introduce FAMDA, a simple yet effective UDA framework that bridges this gap by leveraging Vision Foundation Models (VFMs) as powerful teachers. Our approach integrates Segmentation and Depth foundation models into a self-training paradigm to generate high-quality pseudo-labels for the target domain, effectively distilling their robust generalization capabilities into a single, efficient student network. Extensive experiments show that FAMDA achieves state-of-the-art (SOTA) performance on standard synthetic-to-real UDA multi-task learning (MTL) benchmarks and a challenging new day-to-night adaptation task. Our framework enables the training of highly efficient models; a lightweight variant achieves SOTA accuracy while being more than 10$\times$ smaller than foundation models, highlighting FAMDA's suitability for creating domain-adaptive and efficient models for resource-constrained robotics applications.

[315] MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing

Ruibing Hou,Mingshuang Luo,Hongyu Pan,Hong Chang,Shiguang Shan

Main category: cs.CV

TL;DR: 本文提出了MotionVerse，一个利用大语言模型（LLM）理解、生成和编辑单人及多人场景下人体运动的统一框架。通过运动分词器和延迟并行建模策略，有效捕捉多流运动特征，并采用双塔架构缓解模态干扰，实验证明了其在多种运动相关任务中的优越性能。

Details

Motivation: 为了实现对人体运动的高效理解和生成，尤其是在复杂多样的单人与多人场景中，需要一种能够整合大语言模型能力的统一框架，同时解决运动数据表示、跨模态干扰和计算效率等问题。 Method: 采用带有残差量化的运动分词器将连续运动序列转换为多流离散标记；提出延迟并行建模策略，以时间错位方式编码残差标记流；设计具有模态特定参数的双塔架构，减少运动与语言之间的模态干扰。 Result: 消融实验表明MotionVerse各组件均有效；大量实验显示该方法在多种运动相关任务中表现出优异性能，且计算效率与单流建模相当。 Conclusion: MotionVerse成功融合大语言模型与人体运动处理，通过创新的多流分词、延迟并行建模和双塔结构，在保持高效计算的同时提升了运动理解与生成的效果，适用于广泛的人体运动分析与合成任务。 Abstract: This paper proposes MotionVerse, a unified framework that harnesses the capabilities of Large Language Models (LLMs) to comprehend, generate, and edit human motion in both single-person and multi-person scenarios. To efficiently represent motion data, we employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens. Furthermore, we introduce a \textit{Delay Parallel} Modeling strategy, which temporally staggers the encoding of residual token streams. This design enables LLMs to effectively capture inter-stream dependencies while maintaining computational efficiency comparable to single-stream modeling. Moreover, to alleviate modality interference between motion and language, we design a \textit{dual-tower architecture} with modality-specific parameters, ensuring stable integration of motion information for both comprehension and generation tasks. Comprehensive ablation studies demonstrate the effectiveness of each component in MotionVerse, and extensive experiments showcase its superior performance across a wide range of motion-relevant tasks.

[316] LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders

Boyu Han,Qianqian Xu,Shilong Bao,Zhiyong Yang,Kangli Zi,Qingming Huang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级方法LightFair，通过微调文本嵌入来实现公平的文本到图像扩散模型，有效减轻了训练和采样负担，并在Stable Diffusion v1.5上实现了最先进的去偏效果。

Details

Motivation: 现有去偏方法通常需要全参数训练或依赖辅助网络，导致训练或采样开销大且效果不佳。本文旨在通过更轻量、高效的方式缓解文本编码器带来的偏见问题。 Method: 提出一种协作式距离约束去偏策略，通过微调文本嵌入来平衡不同属性在CLIP空间中的嵌入距离；并设计两阶段文本引导采样策略，在提升公平性的同时保持生成质量。 Result: 在Stable Diffusion v1.5上验证了方法的有效性，仅需1/4的训练成本即可达到SOTA去偏效果，且几乎不增加采样负担。 Conclusion: LightFair通过聚焦文本编码器的微调，提供了一种高效、低负担的公平文本到图像生成方案，兼顾去偏性能与生成质量。 Abstract: This paper explores a novel lightweight approach LightFair to achieve fair text-to-image diffusion models (T2I DMs) by addressing the adverse effects of the text encoder. Most existing methods either couple different parts of the diffusion model for full-parameter training or rely on auxiliary networks for correction. They incur heavy training or sampling burden and unsatisfactory performance. Since T2I DMs consist of multiple components, with the text encoder being the most fine-tunable and front-end module, this paper focuses on mitigating bias by fine-tuning text embeddings. To validate feasibility, we observe that the text encoder's neutral embedding output shows substantial skewness across image embeddings of various attributes in the CLIP space. More importantly, the noise prediction network further amplifies this imbalance. To finetune the text embedding, we propose a collaborative distance-constrained debiasing strategy that balances embedding distances to improve fairness without auxiliary references. However, mitigating bias can compromise the original generation quality. To address this, we introduce a two-stage text-guided sampling strategy to limit when the debiased text encoder intervenes. Extensive experiments demonstrate that LightFair is effective and efficient. Notably, on Stable Diffusion v1.5, our method achieves SOTA debiasing at just $1/4$ of the training burden, with virtually no increase in sampling burden. The code is available at https://github.com/boyuh/LightFair.

[317] EfficientMIL: Efficient Linear-Complexity MIL Method for WSI Classification

Chengying She,Ben Wang,Xinran Zhang,Dongjie Fan,Jialu Zhang,Chengwei Chen,Lizhuang Liu

Main category: cs.CV

TL;DR: 本文提出了一种线性复杂度的多实例学习方法EfficientMIL，用于全切片图像分类，通过自适应补丁选择模块（APS）和高效序列模型（如GRU、LSTM、Mamba）替代传统的自注意力机制，在多个病理数据集上实现了优于现有方法的性能与计算效率。

Details

Motivation: 现有的基于注意力机制的多实例学习方法在处理大量图像补丁时计算复杂度高（二次方），限制了其在全切片图像分类中的效率和可扩展性。 Method: 提出EfficientMIL，采用线性复杂度的序列模型（如GRU、LSTM和Mamba）替代Transformer中的自注意力机制，并设计自适应补丁选择模块（APS）来优化补丁选择过程。 Result: 在TCGA-Lung数据集上，EfficientMIL-Mamba达到0.976的AUC和0.933的准确率；在CAMELYON16数据集上，EfficientMIL-GRU达到0.990的AUC和0.975的准确率，均超越现有SOTA方法。同时，APS模块在补丁选择上也优于传统策略。 Conclusion: EfficientMIL通过引入线性复杂度模型和自适应补丁选择机制，显著提升了全切片图像分类的计算效率和分类性能，为大规模病理图像分析提供了更高效的解决方案。 Abstract: Whole slide images (WSIs) classification represents a fundamental challenge in computational pathology, where multiple instance learning (MIL) has emerged as the dominant paradigm. Current state-of-the-art (SOTA) MIL methods rely on attention mechanisms, achieving good performance but requiring substantial computational resources due to quadratic complexity when processing hundreds of thousands of patches. To address this computational bottleneck, we introduce EfficientMIL, a novel linear-complexity MIL approach for WSIs classification with the patches selection module Adaptive Patch Selector (APS) that we designed, replacing the quadratic-complexity self-attention mechanisms in Transformer-based MIL methods with efficient sequence models including RNN-based GRU, LSTM, and State Space Model (SSM) Mamba. EfficientMIL achieves significant computational efficiency improvements while outperforming other MIL methods across multiple histopathology datasets. On TCGA-Lung dataset, EfficientMIL-Mamba achieved AUC of 0.976 and accuracy of 0.933, while on CAMELYON16 dataset, EfficientMIL-GRU achieved AUC of 0.990 and accuracy of 0.975, surpassing previous state-of-the-art methods. Extensive experiments demonstrate that APS is also more effective for patches selection than conventional selection strategies.

[318] From Static to Dynamic: a Survey of Topology-Aware Perception in Autonomous Driving

Yixiao Chen,Ruining Yang,Xin Chen,Jia He,Dongliang Xu,Yue Yao

Main category: cs.CV

TL;DR: 该论文综述了实现自动驾驶的关键——拓扑感知感知，重点分析了四个研究方向：矢量化地图构建、拓扑结构建模、先验知识融合和基于语言模型的感知，指出正从静态预建地图向动态传感器驱动感知转变。

Details

Motivation: 传统静态地图成本高、更新困难、泛化能力差，难以满足自动驾驶对实时性与可扩展性的需求。 Method: 系统回顾并分析了拓扑感知下的四个核心研究方向，总结其在动态感知范式转变中的作用。 Result: 动态感知方法利用车载传感器实现实时地图构建与拓扑推理，在空间建模、语义关系推理、知识融合和多模态理解方面取得进展。 Conclusion: 动态、传感器驱动的感知范式更具适应性、可扩展性和可解释性，是未来自动驾驶的发展方向。 Abstract: The key to achieving autonomous driving lies in topology-aware perception, the structured understanding of the driving environment with an emphasis on lane topology and road semantics. This survey systematically reviews four core research directions under this theme: vectorized map construction, topological structure modeling, prior knowledge fusion, and language model-based perception. Across these directions, we observe a unifying trend: a paradigm shift from static, pre-built maps to dynamic, sensor-driven perception. Specifically, traditional static maps have provided semantic context for autonomous systems. However, they are costly to construct, difficult to update in real time, and lack generalization across regions, limiting their scalability. In contrast, dynamic representations leverage on-board sensor data for real-time map construction and topology reasoning. Each of the four research directions contributes to this shift through compact spatial modeling, semantic relational reasoning, robust domain knowledge integration, and multimodal scene understanding powered by pre-trained language models. Together, they pave the way for more adaptive, scalable, and explainable autonomous driving systems.

[319] Griffin: Generative Reference and Layout Guided Image Composition

Aryan Mikaeili,Amirhossein Alimohammadi,Negar Hassanpour,Ali Mahdavi-Amiri,Andrea Tagliasacchi

Main category: cs.CV

TL;DR: 提出一种无需训练的多图像布局控制方法，通过图像而非文本指定内容，并精确控制对象和部件级组合的位置。

Details

Motivation: 为了克服纯文本控制在生成图像时对内容位置控制不足的问题，需要更明确的指导方式来实现精细控制。 Method: 使用图像作为参考输入，每个参考仅需一张图像，通过非训练的方法实现多图像布局控制，提供对象和部件级别的显式、简单控制。 Result: 在多种图像合成任务中验证了该方法的有效性，能够准确放置指定内容并实现高质量的图像生成。 Conclusion: 该方法实现了无需训练的精确图像布局控制，提升了文本到图像模型在复杂构图任务中的可控性和实用性。 Abstract: Text-to-image models have achieved a level of realism that enables the generation of highly convincing images. However, text-based control can be a limiting factor when more explicit guidance is needed. Defining both the content and its precise placement within an image is crucial for achieving finer control. In this work, we address the challenge of multi-image layout control, where the desired content is specified through images rather than text, and the model is guided on where to place each element. Our approach is training-free, requires a single image per reference, and provides explicit and simple control for object and part-level composition. We demonstrate its effectiveness across various image composition tasks.

[320] Sparse-Up: Learnable Sparse Upsampling for 3D Generation with High-Fidelity Textures

Lu Xiao,Jiale Zhang,Yang Liu,Taicheng Huang,Xin Tian

Main category: cs.CV

TL;DR: 本文提出了一种名为Sparse-Up的高效、高保真纹理建模框架，通过稀疏体素引导纹理重建，并结合表面锚定和视图域分割策略，在保证多视角一致性的同时突破分辨率限制，有效保留高频细节。

Details

Motivation: 现有方法在处理3D资产纹理时面临高频细节丢失问题，往往需要在跨视角一致性与分辨率之间权衡，导致纹理撕裂或受限于显式体素的分辨率上限。 Method: 采用稀疏体素指导纹理重建以确保多视角一致性，引入表面锚定策略将体素约束在网格表面，减少70%以上的冗余体素；并通过图像块引导的体素划分方案实现视图域分区，仅对可见局部区域进行监督和梯度反传。 Result: 显著降低了高分辨率体素训练过程中的内存消耗，同时保持了几何一致性并保留了纹理中的高频细节。 Conclusion: Sparse-Up在不牺牲几何一致性的前提下，有效解决了高分辨率纹理建模中的内存与细节保留难题，实现了高质量、高效率的3D纹理重建。 Abstract: The creation of high-fidelity 3D assets is often hindered by a 'pixel-level pain point': the loss of high-frequency details. Existing methods often trade off one aspect for another: either sacrificing cross-view consistency, resulting in torn or drifting textures, or remaining trapped by the resolution ceiling of explicit voxels, forfeiting fine texture detail. In this work, we propose Sparse-Up, a memory-efficient, high-fidelity texture modeling framework that effectively preserves high-frequency details. We use sparse voxels to guide texture reconstruction and ensure multi-view consistency, while leveraging surface anchoring and view-domain partitioning to break through resolution constraints. Surface anchoring employs a learnable upsampling strategy to constrain voxels to the mesh surface, eliminating over 70% of redundant voxels present in traditional voxel upsampling. View-domain partitioning introduces an image patch-guided voxel partitioning scheme, supervising and back-propagating gradients only on visible local patches. Through these two strategies, we can significantly reduce memory consumption during high-resolution voxel training without sacrificing geometric consistency, while preserving high-frequency details in textures.

[321] Color-Pair Guided Robust Zero-Shot 6D Pose Estimation and Tracking of Cluttered Objects on Edge Devices

Xingjian Yang,Ashis G. Banerjee

Main category: cs.CV

TL;DR: 提出了一种用于边缘设备的统一框架，通过共享的光照不变颜色对特征表示，实现对新物体在复杂光照下的鲁棒6D姿态估计与高效实时跟踪。

Details

Motivation: 在复杂光照条件下，现有方法难以兼顾准确的初始位姿估计和高效的实时跟踪，尤其在边缘设备上面临计算资源限制。 Method: 设计了一个统一框架，结合基于RGB-D数据的鲁棒初始估计模块和基于运动的快速跟踪器，两者共享一种光照不变的颜色对特征表示，用于在初始配准和时序匹配中保持一致性。 Result: 在多个基准数据集上的实验表明，该方法在保持高精度姿态估计的同时，能够实现高质量的实时跟踪，即使在剧烈姿态变化下也具有良好的鲁棒性。 Conclusion: 所提出的框架在边缘设备上实现了效率与鲁棒性的良好平衡，适用于复杂光照下新物体的6D姿态估计任务。 Abstract: Robust 6D pose estimation of novel objects under challenging illumination remains a significant challenge, often requiring a trade-off between accurate initial pose estimation and efficient real-time tracking. We present a unified framework explicitly designed for efficient execution on edge devices, which synergizes a robust initial estimation module with a fast motion-based tracker. The key to our approach is a shared, lighting-invariant color-pair feature representation that forms a consistent foundation for both stages. For initial estimation, this feature facilitates robust registration between the live RGB-D view and the object's 3D mesh. For tracking, the same feature logic validates temporal correspondences, enabling a lightweight model to reliably regress the object's motion. Extensive experiments on benchmark datasets demonstrate that our integrated approach is both effective and robust, providing competitive pose estimation accuracy while maintaining high-fidelity tracking even through abrupt pose changes.

[322] ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Congzhi Zhang,Zhibin Wang,Yinchao Ma,Jiawei Peng,Yihan Wang,Qiang Zhou,Jun Song,Bo Zheng

Main category: cs.CV

TL;DR: 本文提出了ReWatch数据集和ReWatch-R1模型，通过多阶段合成管道和新型奖励机制提升视频推理能力。

Details

Motivation: 现有数据集缺乏复杂的多跳问题和高质量的视频接地思维链数据，限制了强化学习与可验证奖励在复杂视频推理中的应用。 Method: 提出ReWatch数据集及多阶段合成管道，采用多智能体ReAct框架生成思维链，并设计观测与推理（O&R）奖励机制进行模型训练。 Result: ReWatch-R1在五个具有挑战性的视频推理基准上实现了最先进的平均性能。 Conclusion: ReWatch数据集和O&R奖励机制有效推动了基于LVLM的视频复杂推理发展。 Abstract: While Reinforcement Learning with Verifiable Reward (RLVR) significantly advances image reasoning in Large Vision-Language Models (LVLMs), its application to complex video reasoning remains underdeveloped. This gap stems primarily from a critical data bottleneck: existing datasets lack the challenging, multi-hop questions and high-quality, video-grounded Chain-of-Thought (CoT) data necessary to effectively bootstrap RLVR. To address this, we introduce ReWatch, a large-scale dataset built to foster advanced video reasoning. We propose a novel multi-stage synthesis pipeline to synthesize its three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT. A core innovation is our Multi-Agent ReAct framework for CoT synthesis, which simulates a human-like "re-watching" process to generate video-grounded reasoning traces by explicitly modeling information retrieval and verification. Building on this dataset, we develop ReWatch-R1 by post-training a strong baseline LVLM with Supervised Fine-Tuning (SFT) and our RLVR framework. This framework incorporates a novel Observation \& Reasoning (O\&R) reward mechanism that evaluates both the final answer's correctness and the reasoning's alignment with video content, directly penalizing hallucination. Our experiments show that ReWatch-R1 achieves state-of-the-art average performance on five challenging video reasoning benchmarks.

[323] LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An,Yin Xie,Kaicheng Yang,Wenkang Zhang,Xiuwei Zhao,Zheng Cheng,Yirui Wang,Songcen Xu,Changrui Chen,Chunsheng Wu,Huajie Tan,Chunyuan Li,Jing Yang,Jie Yu,Xiyao Wang,Bin Qin,Yumeng Wang,Zizhen Yan,Ziyong Feng,Ziwei Liu,Bo Li,Jiankang Deng

Main category: cs.CV

TL;DR: LLaVA-OneVision-1.5 是一个高性能、低成本的开源多模态模型系列，从零开始构建高质量视觉-语言模型，包含大规模数据集、高效训练框架，并在多项基准测试中超越现有模型。

Details

Motivation: 现有视觉语言模型训练成本高且难以复现，缺乏开放高效的框架。LLaVA-OneVision-1.5 旨在以更低的成本提供可复现、高性能的开源解决方案。 Method: 构建了85M概念平衡的预训练数据集和26M指令微调数据集，采用离线并行数据打包策略实现端到端高效训练，在1.6万美元预算内完成模型训练。 Result: LLaVA-OneVision-1.5-8B在27个基准中的18个上优于Qwen2.5-VL-7B，LLaVA-OneVision-1.5-4B在全部27个基准上超越Qwen2.5-VL-3B。 Conclusion: LLaVA-OneVision-1.5 提供了一个高效、可复现的视觉语言建模范式，显著降低训练成本的同时实现领先性能，推动开源多模态模型发展。 Abstract: We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.

[324] HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score

Jingqi Xu,Jingxi Lu,Chenghao Li,Sreetama Sarkar,Peter A. Beerel

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的分层视觉令牌剪枝方法HIVTP，通过利用视觉编码器中间层的注意力图来评估令牌重要性，显著提升视觉-语言模型的推理效率，同时保持甚至提高准确性。

Details

Motivation: 视觉-语言模型虽然能力强，但视觉令牌数量庞大导致推理效率低下，许多现有方法未能有效区分重要与非重要令牌，因此需要一种高效且无需训练的剪枝方法。 Method: 提出HIVTP，利用视觉编码器中间层提取的注意力图计算令牌重要性得分，并采用分层剪枝策略：先在全局区域保留高重要性令牌，再在局部小窗口内保留最重要令牌，同时将1D令牌序列重塑为2D空间布局以更好保留空间结构。 Result: 在LLaVA-v1.5-7B和LLaVA-Next-7B上，HIVTP最多可将首令牌延迟（TTFT）降低50.0%和55.1%，生成吞吐量提升60.9%和47.3%，且未损失准确率，在部分基准上表现更优。 Conclusion: HIVTP是一种高效的无训练视觉令牌剪枝方法，通过中间层注意力和分层保留策略，在显著提升推理速度的同时保持或提升模型性能，优于先前方法。 Abstract: Vision-Language Models (VLMs) have shown strong capabilities on diverse multimodal tasks. However, the large number of visual tokens output by the vision encoder severely hinders inference efficiency, and prior studies have shown that many of these tokens are not important and can therefore be safely pruned. In this work, we propose HIVTP, a training-free method to improve VLMs efficiency via hierarchical visual token pruning using a novel middle-layer-based importance score. Specifically, we utilize attention maps extracted from the middle layers of the vision encoder, which better reflect fine-grained and object-level attention, to estimate visual token importance. Based on this, we propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens. Specifically, we reshape the 1-D visual token sequence output by the vision encoder into a 2-D spatial layout. In the global retaining stage, we divide the image into regions and retain tokens with higher importance scores in each region; in the local retaining stage, we then divide the image into small windows and retain the most important token in each local window. Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively, and improve the token generation throughput by up to 60.9% and 47.3%, without sacrificing accuracy, and even achieving improvements on certain benchmarks. Compared with prior works, HIVTP achieves better accuracy while offering higher inference efficiency.

[325] Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding

Xixi Jiang,Chen Yang,Dong Zhang,Pingcheng Dong,Xin Yang,Kwang-Ting Cheng

Main category: cs.CV

TL;DR: 提出了一种面向手术视频理解的时空信息挖掘 token 合并方法 STIM-TM，通过解耦的时间和空间维度冗余减少策略，在保持高精度的同时显著提升模型效率。

Details

Motivation: 现有 Vision Transformer 在处理手术视频时计算开销大，且当前 token 合并方法未充分考虑视频数据的时空结构和信息分布异质性，导致性能次优。 Method: 提出 STIM-TM，采用解耦策略：时间维度上通过显著性加权合并连续帧中空间对应 token；空间维度上通过时间稳定性分析优先合并静态 token，保护动态手术区域。该方法无需训练。 Result: 实现了超过 65% 的 GFLOPs 降低，在多个手术视频任务中保持竞争力的准确性，并支持长序列手术视频的高效训练。 Conclusion: STIM-TM 是首个专为手术视频理解设计的 token 合并方法，有效平衡了效率与性能，解决了手术视频分析中的计算瓶颈问题。 Abstract: Vision Transformer models have shown impressive effectiveness in the surgical video understanding tasks through long-range dependency modeling. However, current methods suffer from prohibitive computational costs due to processing massive spatiotemporal tokens across video frames. While prior work on token merging has advanced model efficiency, they fail to adequately consider the inherent spatiotemporal structure of video data and overlook the heterogeneous nature of information distribution, leading to suboptimal performance. In this paper, we propose a spatiotemporal information mining token merging (STIM-TM) method, representing the first dedicated approach for surgical video understanding. STIM-TM introduces a decoupled strategy that reduces token redundancy along temporal and spatial dimensions independently. Specifically, the temporal component merges spatially corresponding tokens from consecutive frames using saliency weighting, preserving critical sequential information and maintaining continuity. Meanwhile, the spatial component prioritizes merging static tokens through temporal stability analysis, protecting dynamic regions containing essential surgical information. Operating in a training-free manner, STIM-TM achieves significant efficiency gains with over $65\%$ GFLOPs reduction while preserving competitive accuracy across comprehensive surgical video tasks. Our method also supports efficient training of long-sequence surgical videos, addressing computational bottlenecks in surgical applications.

[326] RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks

Amit Agarwal,Hitesh Laxmichand Patel,Srikant Panda,Hansa Meghwani,Jyotika Singh,Karan Dua,Paul Li,Tao Sheng,Sujith Ravi,Dan Roth

Main category: cs.CV

TL;DR: 本文提出了Region Comprehension Index (RCI)，用于量化多模态数据集中全局与局部视觉信息的依赖程度，发现多数现有基准倾向于局部推理并存在空间偏差，可能影响实际应用。

Details

Motivation: 现有评估方法无法明确区分模型是否具备真正的全局推理能力，导致难以判断模型在真实场景中的可靠性。 Method: 通过比较参考模型在图像局部区域和完整图像上的表现，提出RCI指标来系统衡量数据集对全局或局部视觉信息的依赖。 Result: 在13个常用多模态基准上应用RCI，发现大多数基准存在显著的空间偏差，任务可仅凭局部视觉线索解决，无需整体理解。 Conclusion: RCI为诊断和缓解多模态基准中的偏差提供了有效工具，有助于构建更鲁棒、面向实际应用的多模态系统。 Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive results on vision-language benchmarks, yet it remains unclear whether these benchmarks assess genuine global reasoning or allow success via localized visual cues. Existing evaluation methods do not explicitly measure this distinction, hindering effective dataset curation and real-world focused model development. We introduce Region Comprehension Index (RCI), the first model-based score to directly quantify a dataset's reliance on global versus local visual information. RCI systematically compares reference-model performance on image patches versus full images, revealing if tasks require holistic image understanding or can be solved with partial or localized visual cues. When applying RCI to 13 widely used multimodal benchmarks, we observed that most of them favor localized reasoning and exhibit significant spatial biases, indicating potential risks in real-world applications. RCI equips researchers & practitioners with an actionable tool for diagnosing & mitigating these biases, enabling the construction of datasets and benchmarks to foster the development of robust, enterprise-ready multimodal systems.

Dayu Tan,Ziwei Zhang,Yansan Su,Xin Peng,Yike Dai,Chunhou Zheng,Weimin Zhong

Main category: cs.CV

TL;DR: 提出了一种新的3D多模态图像分割框架MSD-KMamba，结合双向空间感知和多尺度自蒸馏，在保持高计算效率的同时显著提升了分割精度和泛化能力。

Details

Motivation: 现有CNN-Transformer混合模型因全局注意力机制导致计算复杂度高，且难以兼顾分割精度与效率。 Method: 引入双向空间感知分支捕获长距离空间依赖，并结合多尺度自蒸馏融合策略增强分层特征表示。 Result: 在多个基准数据集上，MSD-KMamba在分割精度、鲁棒性和泛化性方面优于现有最先进方法，同时具有高计算效率和良好可扩展性。 Conclusion: MSD-KMamba有效缓解了体积分割中的二次计算复杂度瓶颈，同时增强了全局感知能力，为高效精准的医学图像分割提供了新思路。 Abstract: Numerous CNN-Transformer hybrid models rely on high-complexity global attention mechanisms to capture long-range dependencies, which introduces non-linear computational complexity and leads to significant resource consumption. Although knowledge distillation and sparse attention mechanisms can improve efficiency, they often fall short of delivering the high segmentation accuracy necessary for complex tasks. Balancing model performance with computational efficiency remains a critical challenge. In this work, we propose a novel 3D multi-modal image segmentation framework, termed MSD-KMamba, which integrates bidirectional spatial perception with multi-scale self-distillation. The bidirectional spatial aware branch effectively captures long-range spatial context dependencies across brain regions, while also incorporating a powerful nonlinear feature extraction mechanism that further enhances the model's ability to learn complex and heterogeneous patterns. In addition, the proposed multi-scale self-distilled fusion strategy strengthens hierarchical feature representations and improves the transfer of semantic information at different resolution levels. By jointly leveraging the bidirectional spatial perception branch and the multi-scale self-distilled fusion strategy, our framework effectively mitigates the bottleneck of quadratic computational complexity in volumetric segmentation, while simultaneously addressing the limitation of insufficient global perception. Extensive experiments on multiple standard benchmark datasets demonstrate that MSD-KMamba consistently outperforms state-of-the-art methods in segmentation accuracy, robustness, and generalization, while maintaining high computational efficiency and favorable scalability. The source code of MSD-KMamba is publicly available at https://github.com/daimao-zhang/MSD-KMamba.

[328] QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

Weilun Feng,Chuanguang Yang,Haotong Qin,Mingqiang Wu,Yuqi Li,Xiangqi Li,Zhulin An,Libo Huang,Yulun Zhang,Michele Magno,Yongjun Xu

Main category: cs.CV

TL;DR: 提出QuantSparse框架，结合模型量化与注意力稀疏化，通过多尺度显著注意力蒸馏和二阶梯度稀疏注意力重参数化，在大幅压缩模型的同时提升视频生成性能。

Details

Motivation: 扩散Transformer在视频生成上表现优异，但计算和内存成本高，单独使用量化或稀疏化会导致严重性能下降，需有效联合二者以实现高效压缩。 Method: 提出QuantSparse，引入多尺度显著注意力蒸馏以减轻量化偏差，并利用二阶梯度残差的时序稳定性进行稀疏注意力重参数化以恢复信息。 Result: 在HunyuanVideo-13B上实现20.88 PSNR，显著优于Q-VDiT的16.85 PSNR，同时存储减少3.68倍，端到端推理加速1.88倍。 Conclusion: QuantSparse有效协同量化与稀疏化，在保持甚至提升生成质量的同时显著提升效率，为大模型部署提供可行方案。 Abstract: Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68$\times$} reduction in storage and \textbf{1.88$\times$} acceleration in end-to-end inference. Our code will be released in https://github.com/wlfeng0509/QuantSparse.

[329] HomeSafeBench: A Benchmark for Embodied Vision-Language Models in Free-Exploration Home Safety Inspection

Siyuan Gao,Jiashu Yao,Haoyu Wen,Yuhang Guo,Zeming Liu,Heyan Huang

Main category: cs.CV

TL;DR: 提出HomeSafeBench，一个包含12,900个数据点的基准，用于评估基于视觉语言模型的具身代理在家庭安全检查任务中的性能，解决了现有基准过度简化和视角静态的问题。

Details

Motivation: 现有基准使用文本描述和静态视角，无法准确评估具身代理在复杂家庭环境中的安全检测能力。 Method: 构建包含五类常见家庭安全隐患的模拟环境，提供动态第一人称视角图像，支持代理自由探索以多角度检测隐患。 Result: 主流视觉语言模型在HomeSafeBench上的F1得分仅为10.23%，表明当前模型在识别隐患和探索策略上存在显著不足。 Conclusion: HomeSafeBench能更真实地评估具身代理的安全检测能力，为未来家庭安全研究提供重要参考。 Abstract: Embodied agents can identify and report safety hazards in the home environments. Accurately evaluating their capabilities in home safety inspection tasks is curcial, but existing benchmarks suffer from two key limitations. First, they oversimplify safety inspection tasks by using textual descriptions of the environment instead of direct visual information, which hinders the accurate evaluation of embodied agents based on Vision-Language Models (VLMs). Second, they use a single, static viewpoint for environmental observation, which restricts the agents' free exploration and cause the omission of certain safety hazards, especially those that are occluded from a fixed viewpoint. To alleviate these issues, we propose HomeSafeBench, a benchmark with 12,900 data points covering five common home safety hazards: fire, electric shock, falling object, trips, and child safety. HomeSafeBench provides dynamic first-person perspective images from simulated home environments, enabling the evaluation of VLM capabilities for home safety inspection. By allowing the embodied agents to freely explore the room, HomeSafeBench provides multiple dynamic perspectives in complex environments for a more thorough inspection. Our comprehensive evaluation of mainstream VLMs on HomeSafeBench reveals that even the best-performing model achieves an F1-score of only 10.23%, demonstrating significant limitations in current VLMs. The models particularly struggle with identifying safety hazards and selecting effective exploration strategies. We hope HomeSafeBench will provide valuable reference and support for future research related to home security inspections. Our dataset and code will be publicly available soon.

[330] Confidence Aware SSD Ensemble with Weighted Boxes Fusion for Weapon Detection

Atharva Jadhav,Arush Karekar,Manas Divekar,Shachi Natu

Main category: cs.CV

TL;DR: 本文提出了一种基于多样化骨干网络的SSD模型集成方法，结合加权框融合（WBF）策略，有效提升了复杂环境下武器检测的鲁棒性和精度，mAP达到0.838，较最佳单模型提升2.948%。

Details

Motivation: 现有单模型检测器在遮挡、光照变化和复杂背景下武器检测鲁棒性不足，难以满足公共安全需求。 Method: 采用VGG16、ResNet50、EfficientNet和MobileNetV3作为SSD的骨干网络构建多样化模型集成，利用加权框融合（WBF）方法融合预测结果，并比较不同置信度评分策略。 Result: 使用'max'置信度策略的WBF方法实现了0.838的mAP，相对最佳单模型提升2.948%，且优于其他融合启发式方法。 Conclusion: 融合策略与模型多样性同样重要，置信度感知的融合机制是提升检测精度的关键，该方法可有效增强实时监控中的武器检测能力。 Abstract: The safety and security of public spaces is of vital importance, driving the need for sophisticated surveillance systems capable of accurately detecting weapons, which are often hampered by issues like partial occlusion, varying lighting, and cluttered backgrounds. While single-model detectors are advanced, they often lack robustness in these challenging conditions. This paper presents the hypothesis that ensemble of Single Shot Multibox Detector (SSD) models with diverse feature extraction backbones can significantly enhance detection robustness. To leverage diverse feature representations, individual SSD models were trained using a selection of backbone networks: VGG16, ResNet50, EfficientNet, and MobileNetV3. The study is conducted on a dataset consisting of images of three distinct weapon classes: guns, heavy weapons and knives. The predictions from these models are combined using the Weighted Boxes Fusion (WBF) method, an ensemble technique designed to optimize bounding box accuracy. Our key finding is that the fusion strategy is as critical as the ensemble's diversity, a WBF approach using a 'max' confidence scoring strategy achieved a mean Average Precision (mAP) of 0.838. This represents a 2.948% relative improvement over the best-performing single model and consistently outperforms other fusion heuristics. This research offers a robust approach to enhancing real-time weapon detection capabilities in surveillance applications by demonstrating that confidence-aware fusion is a key mechanism for improving accuracy metrics of ensembles.

[331] INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception

Yunjiang Xu,Lingzhi Li,Jin Wang,Yupeng Ouyang,Benyuan Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为INSTINCT的新型协作感知框架，通过实例级交互、质量感知过滤和跨代理特征融合，在降低通信带宽的同时显著提升了检测精度。

Details

Motivation: 现有协作感知系统在带宽消耗和LiDAR数据利用方面存在不足，尤其是在实例级交互和真实场景应用上仍有差距。 Method: 提出INSTINCT框架，包含质量感知过滤、双分支检测路由和跨代理局部实例融合模块，并改进了GT采样技术以支持多样化特征训练。 Result: 在DAIR-V2X和V2V4Real数据集上，准确率分别提升13.23%和33.08%，通信带宽降至现有方法的1/281和1/264。 Conclusion: INSTINCT有效平衡了性能与通信开销，显著优于现有方法，推动了基于LiDAR的协作感知发展。 Abstract: Collaborative perception systems overcome single-vehicle limitations in long-range detection and occlusion scenarios by integrating multi-agent sensory data, improving accuracy and safety. However, frequent cooperative interactions and real-time requirements impose stringent bandwidth constraints. Previous works proves that query-based instance-level interaction reduces bandwidth demands and manual priors, however, LiDAR-focused implementations in collaborative perception remain underdeveloped, with performance still trailing state-of-the-art approaches. To bridge this gap, we propose INSTINCT (INSTance-level INteraCtion ArchiTecture), a novel collaborative perception framework featuring three core components: 1) a quality-aware filtering mechanism for high-quality instance feature selection; 2) a dual-branch detection routing scheme to decouple collaboration-irrelevant and collaboration-relevant instances; and 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. Additionally, we enhance the ground truth (GT) sampling technique to facilitate training with diverse hybrid instance features. Extensive experiments across multiple datasets demonstrate that INSTINCT achieves superior performance. Specifically, our method achieves an improvement in accuracy 13.23%/33.08% in DAIR-V2X and V2V4Real while reducing the communication bandwidth to 1/281 and 1/264 compared to state-of-the-art methods. The code is available at https://github.com/CrazyShout/INSTINCT.

[332] CrimEdit: Controllable Editing for Counterfactual Object Removal, Insertion, and Movement

Boseong Jeon,Junghyuk Lee,Jimin Park,Kwanyoung Kim,Jingi Jung,Sangwon Lee,Hyunbo Shim

Main category: cs.CV

TL;DR: 本文提出了CrimEdit，一种基于扩散模型的统一框架，通过联合训练去除和插入任务的嵌入，并利用无分类器引导技术，有效处理物体及其效果（如阴影、反射）的编辑，支持可控的效果合成与单步去噪中的物体移动。

Details

Motivation: 现有方法在处理物体移除和插入时对物体效应（如阴影、反射）的建模仍有不足，且缺乏在统一模型中探索无分类器引导对两类任务性能影响的研究。 Method: 提出CrimEdit，联合训练移除与插入的任务嵌入，并将其融入无分类器引导机制；扩展任务提示以支持空间上不同区域的应用，实现单步内的物体移动。 Result: 实验表明，CrimEdit在物体移除、可控效应插入和高效物体移动方面优于现有方法，无需额外训练或分离阶段。 Conclusion: CrimEdit通过统一建模和引导机制，提升了复合图像编辑的效率与效果，为对象编辑提供了更通用和灵活的解决方案。 Abstract: Recent works on object removal and insertion have enhanced their performance by handling object effects such as shadows and reflections, using diffusion models trained on counterfactual datasets. However, the performance impact of applying classifier-free guidance to handle object effects across removal and insertion tasks within a unified model remains largely unexplored. To address this gap and improve efficiency in composite editing, we propose CrimEdit, which jointly trains the task embeddings for removal and insertion within a single model and leverages them in a classifier-free guidance scheme -- enhancing the removal of both objects and their effects, and enabling controllable synthesis of object effects during insertion. CrimEdit also extends these two task prompts to be applied to spatially distinct regions, enabling object movement (repositioning) within a single denoising step. By employing both guidance techniques, extensive experiments show that CrimEdit achieves superior object removal, controllable effect insertion, and efficient object movement without requiring additional training or separate removal and insertion stages.

[333] PD-Diag-Net: Clinical-Priors guided Network on Brain MRI for Auxiliary Diagnosis of Parkinson's Disease

Shuai Shao,Shu Jiang,Shiyuan Zhao,Di Yang,Yan Wang,Yutong Bai,Jianguo Zhang,Jiangtao Wang

Main category: cs.CV

TL;DR: 提出一种端到端自动化帕金森病诊断方法PD-Diag-Net，基于MRI扫描结合临床先验知识实现高精度辅助诊断。

Details

Motivation: 当前帕金森病诊断依赖专家经验，流程复杂，导致早期检测延迟，亟需自动化、精准的诊断方法。 Method: 设计PD-Diag-Net框架，包含MRI预处理模块，并引入脑区相关性和衰老先验知识，构建基于先验的特征聚合与诊断模块，从原始MRI直接进行风险评估与辅助诊断。 Result: 在外部测试中准确率达86%，早期诊断准确率超96%，性能优于现有先进方法20%以上。 Conclusion: PD-Diag-Net有效整合医学先验知识与深度学习，显著提升帕金森病诊断准确性与可解释性，具有临床应用潜力。 Abstract: Parkinson's disease (PD) is a common neurodegenerative disorder that severely diminishes patients' quality of life. Its global prevalence has increased markedly in recent decades. Current diagnostic workflows are complex and heavily reliant on neurologists' expertise, often resulting in delays in early detection and missed opportunities for timely intervention. To address these issues, we propose an end-to-end automated diagnostic method for PD, termed PD-Diag-Net, which performs risk assessment and auxiliary diagnosis directly from raw MRI scans. This framework first introduces an MRI Pre-processing Module (MRI-Processor) to mitigate inter-subject and inter-scanner variability by flexibly integrating established medical imaging preprocessing tools. It then incorporates two forms of clinical prior knowledge: (1) Brain-Region-Relevance-Prior (Relevance-Prior), which specifies brain regions strongly associated with PD; and (2) Brain-Region-Aging-Prior (Aging-Prior), which reflects the accelerated aging typically observed in PD-associated regions. Building on these priors, we design two dedicated modules: the Relevance-Prior Guided Feature Aggregation Module (Aggregator), which guides the model to focus on PD-associated regions at the inter-subject level, and the Age-Prior Guided Diagnosis Module (Diagnoser), which leverages brain age gaps as auxiliary constraints at the intra-subject level to enhance diagnostic accuracy and clinical interpretability. Furthermore, we collected external test data from our collaborating hospital. Experimental results show that PD-Diag-Net achieves 86\% accuracy on external tests and over 96% accuracy in early-stage diagnosis, outperforming existing advanced methods by more than 20%.

[334] DiffPCN: Latent Diffusion Model Based on Multi-view Depth Images for Point Cloud Completion

Zijun Li,Hongyu Yan,Shijie Li,Kunming Luo,Li Lu,Xulei Yang,Weisi Lin

Main category: cs.CV

TL;DR: 提出DiffPCN，一种基于扩散模型的粗到精点云补全框架，通过深度图像生成、点去噪和关联感知上采样实现高质量补全。

Details

Motivation: 现有扩散模型在点云补全中的应用受限于点云的非结构化和不规则性，缺乏有效利用其生成能力的方法。 Method: 首先将部分点云投影为结构化深度图，用DepthLDM生成多视角完整深度图以获得粗略点云；然后使用点去噪网络去除伪影；最后通过关联感知上采样器结合局部关联特征进行上采样。 Result: 在几何精度和形状完整性方面达到SOTA性能，显著提升补全的鲁棒性和一致性。 Conclusion: DiffPCN有效结合扩散模型与点云处理，实现了高质量的点云补全，验证了LDM在该任务中的潜力。 Abstract: Latent diffusion models (LDMs) have demonstrated remarkable generative capabilities across various low-level vision tasks. However, their potential for point cloud completion remains underexplored due to the unstructured and irregular nature of point clouds. In this work, we propose DiffPCN, a novel diffusion-based coarse-to-fine framework for point cloud completion. Our approach comprises two stages: an initial stage for generating coarse point clouds, and a refinement stage that improves their quality through point denoising and upsampling. Specifically, we first project the unordered and irregular partial point cloud into structured depth images, which serve as conditions for a well-designed DepthLDM to synthesize completed multi-view depth images that are used to form coarse point clouds. In this way, our DiffPCN can yield high-quality and high-completeness coarse point clouds by leveraging LDM' s powerful generation and comprehension capabilities. Then, since LDMs inevitably introduce outliers into the generated depth maps, we design a Point Denoising Network to remove artifacts from the coarse point cloud by predicting a per-point distance score. Finally, we devise an Association-Aware Point Upsampler, which guides the upsampling process by leveraging local association features between the input point cloud and the corresponding coarse points, further yielding a dense and high-fidelity output. Experimental results demonstrate that our DiffPCN achieves state-of-the-art performance in geometric accuracy and shape completeness, significantly improving the robustness and consistency of point cloud completion.

[335] Video Panels for Long Video Understanding

Lars Doorenbos,Federico Spurio,Juergen Gall

Main category: cs.CV

TL;DR: 提出一种无需训练、无参数的视觉提示策略，通过将多帧组合成单图来提升现有视频语言模型在长视频理解任务上的性能。

Details

Motivation: 现有视频语言模型在长视频理解上表现仍落后于图像或短视频任务，且改进方法常引入额外复杂性与训练成本。 Method: 设计一种新的视觉提示策略，将多个视频帧作为面板组合成一张图像，以空间细节换取时间分辨率，从而更好地利用现有VLM处理长视频。 Result: 在五个基准数据集上验证了方法的有效性和通用性，在TimeScope（Long）数据集上问答准确率最高提升19.4%。 Conclusion: 该方法无需训练、不增加参数、兼容不同模型，显著提升了现有VLM在长视频理解中的表现，为该领域设定了新基准。 Abstract: Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. % additional training time. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models. To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution. Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs. Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4\%. Overall, our method raises the bar for long video understanding models. We will make our code available upon acceptance.

[336] M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation

Yiheng Zhang,Zhuojiang Cai,Mingdao Wang,Meitong Guo,Tianxiao Li,Li Lin,Yuwang Wang

Main category: cs.CV

TL;DR: 本文提出了M3DLayout，一个大规模、多源的3D室内布局生成数据集，包含15,080个布局和超过258k个物体实例，结合真实扫描、CAD设计和程序生成场景，并配备详细文本描述，用于提升文本驱动的3D场景生成模型的训练与性能。

Details

Motivation: 现有3D室内布局生成数据集规模小、多样性不足且标注质量有限，限制了模型的学习能力，因此需要构建更高质量、更大规模的数据集以推动文本到3D场景生成的发展。 Method: 构建了一个名为M3DLayout的多源数据集，整合了真实世界扫描、专业CAD设计和程序生成的场景，并为每个布局配以结构化文本描述；在此基础上建立基于文本条件扩散模型的基准测试。 Result: 实验表明，M3DLayout能有效支持复杂空间与语义模式的学习，显著提升生成场景的多样性和细节丰富度，尤其是Inf3DLayout子集在小物体布局上的表现突出。 Conclusion: M3DLayout作为一个大规模、高质量、多来源的3D布局数据集，为文本驱动的3D场景生成研究提供了重要资源，有望推动该领域的进一步发展。 Abstract: In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 15,080 layouts and over 258k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using a text-conditioned diffusion model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis.

[337] LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Shubhang Bhatnagar,Andy Xu,Kar-Han Tan,Narendra Ahuja

Main category: cs.CV

TL;DR: 本文提出了首个针对多模态大语言模型（MLLMs）的超低比特（<4位）量化研究，提出了一种分层量化策略LUQ，在保持性能的同时显著降低内存占用。

Details

Motivation: 多模态大语言模型虽然在视觉-语言任务中表现出色，但其部署需要大量内存和计算资源。现有的后训练量化技术在纯语言模型上有效，但在MLLM上的应用尚不充分，尤其是超低比特量化。 Method: 通过分析多模态token和中间激活的统计特性，发现不同层对量化敏感度不同；据此提出LUQ方法，对鲁棒性强的层进行选择性超低比特量化，并结合图文混合token进行后训练量化。 Result: 在LLaVA-1.5和Qwen-2.5-VL上验证，相比4-bit模型分别减少40%和31%内存，MME基准上性能下降小于10%。 Conclusion: LUQ为多模态大模型的高效部署提供了有效的量化方案，证明了分层、选择性超低比特量化的可行性与优势。 Abstract: Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40% and 31% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10% on the MME benchmark.

[338] FastViDAR: Real-Time Omnidirectional Depth Estimation via Alternative Hierarchical Attention

Hangtian Zhao,Xiang Chen,Yizhe Li,Qianhao Wang,Haibo Lu,Fei Gao

Main category: cs.CV

TL;DR: 本文提出了FastViDAR，一个利用四个鱼眼相机输入生成360度深度图的新型框架，引入了替代分层注意力机制和ERP融合方法，并在嵌入式硬件上实现了高达20 FPS的性能。

Details

Motivation: 为了高效地从多个鱼眼相机输入中生成高质量的360度深度图，解决现有方法在计算开销和跨视角特征融合上的局限性。 Method: 提出Alternative Hierarchical Attention（AHA）机制进行跨视图特征融合，并设计新的ERP融合方法将多视角深度映射到共享的等距柱状坐标系；使用HM3D和2D3D-S数据集生成ERP图像-深度对进行评估。 Result: 在真实数据集上表现出具有竞争力的零样本性能，并在NVIDIA Orin NX嵌入式硬件上达到最高20 FPS的运行速度。 Conclusion: FastViDAR通过高效的注意力机制和新颖的深度融合策略，实现了快速且准确的360度深度估计，适用于实际部署于资源受限的平台。 Abstract: In this paper we propose FastViDAR, a novel framework that takes four fisheye camera inputs and produces a full $360^\circ$ depth map along with per-camera depth, fusion depth, and confidence estimates. Our main contributions are: (1) We introduce Alternative Hierarchical Attention (AHA) mechanism that efficiently fuses features across views through separate intra-frame and inter-frame windowed self-attention, achieving cross-view feature mixing with reduced overhead. (2) We propose a novel ERP fusion approach that projects multi-view depth estimates to a shared equirectangular coordinate system to obtain the final fusion depth. (3) We generate ERP image-depth pairs using HM3D and 2D3D-S datasets for comprehensive evaluation, demonstrating competitive zero-shot performance on real datasets while achieving up to 20 FPS on NVIDIA Orin NX embedded hardware. Project page: \href{https://3f7dfc.github.io/FastVidar/}{https://3f7dfc.github.io/FastVidar/}

[339] HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

Cong Chen,Ziyuan Huang,Cheng Zou,Muzhi Zhu,Kaixiang Ji,Jiajia Liu,Jingdong Chen,Hao Chen,Chunhua Shen

Main category: cs.CV

TL;DR: 本文提出了HieraTok，一种基于多尺度Vision Transformer的视觉tokenizer，通过多尺度下采样和尺度因果注意力机制，在图像重建与生成任务中显著优于单尺度方法，实现了更优的rFID和gFID指标及更快的收敛速度。

Details

Motivation: 现有的视觉tokenizer通常仅建模单一尺度的表示，限制了其在图像重建和生成任务中的性能，本文旨在克服这一局限性。 Method: 提出HieraTok，包含两个关键设计：一是对token映射进行多尺度下采样以生成多尺度token序列，二是引入尺度因果注意力机制，实现从低分辨率语义特征到高分辨率结构细节的信息渐进流动。 Result: 在相同设置下，相比单尺度tokenizer，HieraTok在rFID上提升27.2%（1.47→1.07）；在生成任务中收敛速度快1.38倍，gFID提升18.9%（16.4→13.3）；大规模训练后达到当前最优的rFID（0.45）和gFID（1.82）。 Conclusion: HieraTok是首个用于图像重建与生成的多尺度ViT-based tokenizer，其设计有效提升了生成性能与潜在空间质量，有望推动视觉生成任务中ViT-based tokenizer的发展。 Abstract: In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2\% improvement in rFID ($1.47 \rightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38\times$ faster convergence rate and an 18.9\% boost in gFID ($16.4 \rightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.

[340] GRS-SLAM3R: Real-Time Dense SLAM with Gated Recurrent State

Guole Shen,Tianchen Deng,Yanbo Wang,Yongtao Chen,Yilin Shen,Jiuming Liu,Jingchuan Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于DUSt3R的端到端SLAM框架GRS-SLAM3R，通过引入空间记忆和全局一致性优化，实现了无需先验知识的密集场景重建与位姿估计。

Details

Motivation: 现有方法多基于图像对估计点云，缺乏对空间记忆和全局一致性的考虑，限制了重建质量。 Method: 采用序列化输入，利用带门控更新机制的Transformer维护空间记忆，增量式估计全局坐标系下的度量尺度点云，并通过子图划分与局部对齐结合相对约束实现全局地图注册。 Result: 在多个数据集上实验表明，该方法在保持实时性的同时，显著提升了重建精度。 Conclusion: GRS-SLAM3R有效增强了跨帧的空间相关性与全局一致性，优于现有的DUSt3R-based方法。 Abstract: DUSt3R-based end-to-end scene reconstruction has recently shown promising results in dense visual SLAM. However, most existing methods only use image pairs to estimate pointmaps, overlooking spatial memory and global consistency.To this end, we introduce GRS-SLAM3R, an end-to-end SLAM framework for dense scene reconstruction and pose estimation from RGB images without any prior knowledge of the scene or camera parameters. Unlike existing DUSt3R-based frameworks, which operate on all image pairs and predict per-pair point maps in local coordinate frames, our method supports sequentialized input and incrementally estimates metric-scale point clouds in the global coordinate. In order to improve consistent spatial correlation, we use a latent state for spatial memory and design a transformer-based gated update module to reset and update the spatial memory that continuously aggregates and tracks relevant 3D information across frames. Furthermore, we partition the scene into submaps, apply local alignment within each submap, and register all submaps into a common world frame using relative constraints, producing a globally consistent map. Experiments on various datasets show that our framework achieves superior reconstruction accuracy while maintaining real-time performance.

[341] ResAD++: Towards Class Agnostic Anomaly Detection via Residual Feature Learning

Xincheng Yao,Chao Shi,Muming Zhao,Guangtao Zhai,Chongyang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的类无关异常检测框架ResAD++，通过学习残差特征分布和引入超球面约束来实现特征去相关和尺度一致性，从而在无需重训练的情况下有效检测新类别的异常。

Details

Motivation: 现有异常检测方法在新类别上的表现不佳，主要因为其特征表示仍与类别相关（即特征相关性），缺乏跨类别的泛化能力。 Method: 提出残差特征，通过匹配并减去正常参考特征来构建；引入特征超球面约束以统一不同类别的特征尺度；设计了logbarrier双向收缩OCC损失和基于向量量化的特征分布匹配模块。 Result: 在八个真实世界异常检测数据集上实验表明，ResAD++在直接用于新类别时显著优于现有最先进方法，并优于其基础版本ResAD。 Conclusion: ResAD++通过残差特征学习和特征尺度规范化，实现了强大的类无关异常检测能力，具备良好的跨域泛化性能。 Abstract: This paper explores the problem of class-agnostic anomaly detection (AD), where the objective is to train one class-agnostic AD model that can generalize to detect anomalies in diverse new classes from different domains without any retraining or fine-tuning on the target data. When applied for new classes, the performance of current single- and multi-class AD methods is still unsatisfactory. One fundamental reason is that representation learning in existing methods is still class-related, namely, feature correlation. To address this issue, we propose residual features and construct a simple but effective framework, termed ResAD. Our core insight is to learn the residual feature distribution rather than the initial feature distribution. Residual features are formed by matching and then subtracting normal reference features. In this way, we can effectively realize feature decorrelation. Even in new classes, the distribution of normal residual features would not remarkably shift from the learned distribution. In addition, we think that residual features still have one issue: scale correlation. To this end, we propose a feature hypersphere constraining approach, which learns to constrain initial normal residual features into a spatial hypersphere for enabling the feature scales of different classes as consistent as possible. Furthermore, we propose a novel logbarrier bidirectional contraction OCC loss and vector quantization based feature distribution matching module to enhance ResAD, leading to the improved version of ResAD (ResAD++). Comprehensive experiments on eight real-world AD datasets demonstrate that our ResAD++ can achieve remarkable AD results when directly used in new classes, outperforming state-of-the-art competing methods and also surpassing ResAD. The code is available at https://github.com/xcyao00/ResAD.

[342] Poivre: Self-Refining Visual Pointing with Reinforcement Learning

Wenjie Yang,Zengfeng Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为Poivre的自优化方法，通过“标记-可视化-再优化”流程结合强化学习提升视觉语言模型在视觉指向任务中的表现，显著优于现有模型。

Details

Motivation: 现有的视觉语言模型在视觉指向上通常只能单步完成任务，缺乏迭代优化能力，导致性能远低于人类水平。 Method: 提出Point, Visualize, then Refine (Poivre)框架，使模型先预测坐标，再通过可视化结果进行迭代优化，并采用基于奖励机制的强化学习训练策略。 Result: Poivre-7B在Point-Bench上达到SOTA性能，超过Gemini-2.5-Pro和Molmo-72B等大模型3%以上。 Conclusion: 引入自优化机制和强化学习可有效提升视觉语言模型在视觉指向任务上的精度，具备良好的应用前景与研究价值。 Abstract: Visual pointing, which aims to localize a target by predicting its coordinates on an image, has emerged as an important problem in the realm of vision-language models (VLMs). Despite its broad applicability, recent benchmarks show that current VLMs still fall far behind human performance on this task. A key limitation is that VLMs are typically required to complete the pointing task in a single step, akin to asking humans to point at an object without seeing their own fingers. To address this issue, we propose a simple yet effective self-refining procedure: Point, Visualize, then Refine (Poivre). This procedure enables a VLM to first mark its estimated point, then iteratively refine the coordinates if necessary. Inspired by advances of reasoning models in the natural language domain, we employ reinforcement learning (RL) to incentivize this self-refining ability. For the RL training, we design a neat process reward that is not only empirically effective but also grounded in appealing properties. Our trained model, Poivre-7B, sets a new state of the art on Point-Bench, outperforming both proprietary models such as Gemini-2.5-Pro and large open-source models such as Molmo-72B by over 3%. To support future research, we release our training and inference code, dataset, and the Poivre-7B checkpoint.

[343] PVTAdpNet: Polyp Segmentation using Pyramid vision transformer with a novel Adapter block

Arshia Yousefi Nezhad,Helia Aghaei,Hedieh Sajedi

Main category: cs.CV

TL;DR: 本文提出了一种用于结肠息肉分割的新型网络PVTAdpNet，结合Pyramid Vision Transformer与U-Net结构，引入残差块和适配器跳跃连接，在多个数据集上实现了高精度和实时性能。

Details

Motivation: 传统结肠镜检查存在漏检率高、息肉形态多变等问题，亟需一种更有效的自动息肉分割方法以提升早期检测准确性。 Method: 采用U-Net风格的编码器-解码器结构，以Pyramid Vision Transformer为骨干网络，设计了新的残差模块和基于适配器的跳跃连接，并结合squeeze-and-excitation注意力机制增强通道特征优化。 Result: 在分布外息肉数据集上取得0.8851的Dice系数和0.8167的mIoU，在PolypGen数据集上表现出优秀的实时性和分割精度。 Conclusion: PVTAdpNet在息肉分割任务中表现出优越性能，具备良好的临床应用潜力，尤其适用于复杂多变的息肉形态场景。 Abstract: Colorectal cancer ranks among the most common and deadly cancers, emphasizing the need for effective early detection and treatment. To address the limitations of traditional colonoscopy, including high miss rates due to polyp variability, we introduce the Pyramid Vision Transformer Adapter Residual Network (PVTAdpNet). This model integrates a U-Net-style encoder-decoder structure with a Pyramid Vision Transformer backbone, novel residual blocks, and adapter-based skip connections. The design enhances feature extraction, dense prediction, and gradient flow, supported by squeeze-and-excitation attention for improved channel-wise feature refinement. PVTAdpNet achieves real-time, accurate polyp segmentation, demonstrating superior performance on benchmark datasets with high mDice and mIoU scores, making it highly suitable for clinical applications. PVTAdpNet obtains a high Dice coefficient of 0.8851 and a mean Intersection over Union (mIoU) of 0.8167 on out-of-distribution polyp datasets. Evaluation of the PolypGen dataset demonstrates PVTAdpNet's capability for real-time, accurate performance within familiar distributions. The source code of our network is available at https://github.com/ayousefinejad/PVTAdpNet.git

[344] UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception

Xinyang Song,Libin Wang,Weining Wang,Shaozhen Liu,Dandan Zheng,Jingdong Chen,Qi Li,Zhenan Sun

Main category: cs.CV

TL;DR: 提出了一种名为UniAlignment的统一多模态生成框架，基于单一扩散变换器，通过双流扩散训练策略提升跨模态一致性和指令跟随能力，并发布了用于评估复杂文本指令下多模态语义一致性的新基准SemGen-Bench。

Details

Motivation: 现有方法依赖视觉-语言模型或多模块设计，导致架构碎片化和计算效率低下，难以实现高效统一的多模态语义理解与生成。 Method: 提出UniAlignment框架，采用双流扩散训练策略，在单一扩散变换器中实现模态内和跨模态的语义对齐，从而增强跨模态一致性和复杂指令跟随能力。 Result: 在多个任务和基准测试上实验表明，UniAlignment优于现有基线方法，在图像理解、编辑和感知等多模态任务中表现出更强的语义一致性和生成质量。 Conclusion: 扩散模型在统一多模态生成方面具有巨大潜力，UniAlignment为实现高效、强语义对齐的多模态生成提供了有效方案。 Abstract: The remarkable success of diffusion models in text-to-image generation has sparked growing interest in expanding their capabilities to a variety of multi-modal tasks, including image understanding, manipulation, and perception. These tasks require advanced semantic comprehension across both visual and textual modalities, especially in scenarios involving complex semantic instructions. However, existing approaches often rely heavily on vision-language models (VLMs) or modular designs for semantic guidance, leading to fragmented architectures and computational inefficiency. To address these challenges, we propose UniAlignment, a unified multimodal generation framework within a single diffusion transformer. UniAlignment introduces a dual-stream diffusion training strategy that incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness. Additionally, we present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions. Extensive experiments across multiple tasks and benchmarks demonstrate that UniAlignment outperforms existing baselines, underscoring the significant potential of diffusion models in unified multimodal generation.

[345] GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning

Xiaojie Li,Bei Wang,Jianlong Wu,Yue Yu,Liqiang Nie,Min Zhang

Main category: cs.CV

TL;DR: 本文提出了GenView++，一个通过多源自适应视图生成和质量驱动的对比学习机制来提升对比学习中正样本对质量和利用效率的统一框架。

Details

Motivation: 现有对比学习方法在正样本对构建和学习过程中存在局限性，如增强方式多样性不足、语义易损以及缺乏对配对质量的评估机制。 Method: 提出GenView++框架，包含两个创新：一是多源自适应视图生成机制，结合图像、文本及图文条件生成多样化且语义一致的视图；二是质量驱动的对比学习机制，动态评估并重加权每对样本的训练贡献。 Result: 实验表明，GenView++在视觉和视觉-语言任务中均显著提升性能：在ImageNet线性分类上比MoCov2提升+2.5%；在十个数据集上零样例分类准确率比CLIP提升+12.31%，比SLIP提升+5.31%；Flickr30k文本检索R@5提升+3.2%。 Conclusion: GenView++通过高质量视图生成与动态质量感知训练策略，有效提升了对比学习的表现，适用于多种模态任务。 Abstract: The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair's semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%. The code is available at https://github.com/xiaojieli0903/GenViewPlusPlus.

[346] A Modality-Tailored Graph Modeling Framework for Urban Region Representation via Contrastive Learning

Yaya Zhao,Kaiqi Zhao,Zixuan Tang,Zhiyuan Liu,Xiaoling Lu,Yalei Du

Main category: cs.CV

TL;DR: 提出MTGRR，一种模态定制的图建模框架，通过模态特异性图神经网络和空间感知的多模态融合机制，有效提升城市区域表征性能。

Details

Motivation: 现有图模型在处理多模态城市数据时，使用统一的图神经网络架构且忽略空间异质性，导致表征能力受限。 Method: 将模态分为聚合级和点级两类，分别采用专家混合图网络和双层GNN提取特征，并设计空间感知的多模态融合机制动态生成区域特定的融合权重，结合对比学习优化表征。 Result: 在两个真实数据集、六种模态和三项任务上的实验表明，MTGRR consistently 优于现有最先进基线方法。 Conclusion: MTGRR通过模态定制化建模和动态融合策略，有效提升了多模态城市区域表征的准确性与鲁棒性。 Abstract: Graph-based models have emerged as a powerful paradigm for modeling multimodal urban data and learning region representations for various downstream tasks. However, existing approaches face two major limitations. (1) They typically employ identical graph neural network architectures across all modalities, failing to capture modality-specific structures and characteristics. (2) During the fusion stage, they often neglect spatial heterogeneity by assuming that the aggregation weights of different modalities remain invariant across regions, resulting in suboptimal representations. To address these issues, we propose MTGRR, a modality-tailored graph modeling framework for urban region representation, built upon a multimodal dataset comprising point of interest (POI), taxi mobility, land use, road element, remote sensing, and street view images. (1) MTGRR categorizes modalities into two groups based on spatial density and data characteristics: aggregated-level and point-level modalities. For aggregated-level modalities, MTGRR employs a mixture-of-experts (MoE) graph architecture, where each modality is processed by a dedicated expert GNN to capture distinct modality-specific characteristics. For the point-level modality, a dual-level GNN is constructed to extract fine-grained visual semantic features. (2) To obtain effective region representations under spatial heterogeneity, a spatially-aware multimodal fusion mechanism is designed to dynamically infer region-specific modality fusion weights. Building on this graph modeling framework, MTGRR further employs a joint contrastive learning strategy that integrates region aggregated-level, point-level, and fusion-level objectives to optimize region representations. Experiments on two real-world datasets across six modalities and three tasks demonstrate that MTGRR consistently outperforms state-of-the-art baselines, validating its effectiveness.

[347] Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution

Qifan Li,Jiale Zou,Jinhua Zhang,Wei Long,Xinyu Zhou,Shuhang Gu

Main category: cs.CV

TL;DR: 提出了一种基于纹理向量量化和重建感知预测的生成超分辨率模型，有效降低量化误差并提升重建质量。

Details

Motivation: 现有VQ方法在视觉先验建模中存在量化误差大、训练监督粒度粗导致重建效果不佳的问题。 Method: 提出纹理向量量化（TVQ），仅对缺失纹理建模；引入重建感知预测（RAP），利用直通估计器进行图像级监督训练。 Result: 模型在较低计算成本下实现了高质量、逼真的超分辨率重建效果。 Conclusion: TVQ&RAP通过精细化先验建模和重建感知训练策略，显著提升了VQ-based超分模型的性能。 Abstract: Vector-quantized based models have recently demonstrated strong potential for visual prior modeling. However, existing VQ-based methods simply encode visual features with nearest codebook items and train index predictor with code-level supervision. Due to the richness of visual signal, VQ encoding often leads to large quantization error. Furthermore, training predictor with code-level supervision can not take the final reconstruction errors into consideration, result in sub-optimal prior modeling accuracy. In this paper we address the above two issues and propose a Texture Vector-Quantization and a Reconstruction Aware Prediction strategy. The texture vector-quantization strategy leverages the task character of super-resolution and only introduce codebook to model the prior of missing textures. While the reconstruction aware prediction strategy makes use of the straight-through estimator to directly train index predictor with image-level supervision. Our proposed generative SR model (TVQ&RAP) is able to deliver photo-realistic SR results with small computational cost.

[348] GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning

Nayeong Kim,Seong Joon Oh,Suha Kwak

Main category: cs.CV

TL;DR: 提出了一种名为GroupCoOp的参数高效微调方法，通过组特定文本提示增强视觉-语言模型在不平衡数据下的群体鲁棒性，在仅训练0.016%参数的情况下，在五个基准上取得了最优结果。

Details

Motivation: 现有视觉-语言模型在子组不平衡数据下存在虚假相关性和群体偏差问题，影响模型鲁棒性。 Method: 设计组特定文本提示作为各类别的多分类器，利用文本编码器的语义能力发现有效提示，缓解少数群体被忽视和类别嵌入分散的问题。 Result: 在五个CLIP架构的基准上表现最佳，部分超越全参数微调方法，且仅微调0.016%的参数。 Conclusion: GroupCoOp是一种高效且鲁棒的PEFT方法，显著提升了VLM在组不平衡场景下的泛化能力和公平性。 Abstract: Parameter-efficient fine-tuning (PEFT) of vision-language models (VLMs) excels in various vision tasks thanks to the rich knowledge and generalization ability of VLMs. However, recent studies revealed that such fine-tuned VLMs are vulnerable to spurious correlations stemming from the subgroup imbalance in the fine-tuning datasets. To resolve this issue, we propose Group Context Optimization (GroupCoOp), a simple and effective debiased fine-tuning algorithm that enhances the group robustness of fine-tuned VLMs. Its key idea is to employ group-specific text prompts as group representatives serving as multiple classifiers for their target class. The rich semantic knowledge of the text encoder of VLM enables the discovery of effective group prompts even for groups with a small number of training samples. Leveraging the group prompts for each class addresses the issues caused by the group-imbalanced training set, such as the neglect of minority groups and the scattered distribution of each class in the embedding space. GroupCoOp achieved the best results on five benchmarks across five CLIP architectures and occasionally outperformed prior methods that fine-tune the entire network, despite training only 0.016\% of the network's parameters.

[349] From Unstable to Playable: Stabilizing Angry Birds Levels via Object Segmentation

Mahdi Farrokhimaleki,Parsa Rahmati,Richard Zhao

Main category: cs.CV

TL;DR: 提出一种基于图像的修复方法，用于识别和修复由PCG模型生成的不稳定游戏关卡，以《愤怒的小鸟》为例验证了该方法在提升AI生成关卡稳定性和可玩性方面的有效性。

Details

Motivation: 确保PCG生成的内容具有高质量和行业标准，解决现有PCG模型生成关卡不稳定的问题。 Method: 利用对象分割和关卡图像的视觉分析来检测结构缺陷，并进行有针对性的修复；评估多种对象分割模型并选择最优模型构建修复流程。 Result: 实验结果表明，该方法能有效提升AI生成关卡的稳定性和可玩性，且该基于图像的方法具有良好的泛化能力，适用于多种2D游戏。 Conclusion: 所提出的方法能够有效修复PCG生成的不稳定关卡，为提升自动化内容生成质量提供了可行方案。 Abstract: Procedural Content Generation (PCG) techniques enable automatic creation of diverse and complex environments. While PCG facilitates more efficient content creation, ensuring consistently high-quality, industry-standard content remains a significant challenge. In this research, we propose a method to identify and repair unstable levels generated by existing PCG models. We use Angry Birds as a case study, demonstrating our method on game levels produced by established PCG approaches. Our method leverages object segmentation and visual analysis of level images to detect structural gaps and perform targeted repairs. We evaluate multiple object segmentation models and select the most effective one as the basis for our repair pipeline. Experimental results show that our method improves the stability and playability of AI-generated levels. Although our evaluation is specific to Angry Birds, our image-based approach is designed to be applicable to a wide range of 2D games with similar level structures.

[350] Controllable Generation of Large-Scale 3D Urban Layouts with Semantic and Structural Guidance

Mengyuan Niu,Xinxin Zhuo,Ruizhe Wang,Yuyue Huang,Junyan Yang,Qiao Wang

Main category: cs.CV

TL;DR: 提出一种融合几何与语义属性的可控框架，用于大规模3D矢量城市布局生成，支持用户通过调整语义属性直接控制输出。

Details

Motivation: 现有基于图像的方法缺乏几何连续性和可扩展性，而基于图的方法忽略地块语义，难以生成兼具结构合理性和语义丰富性的城市布局。 Method: 通过融合几何与语义属性、引入边权重并将建筑高度嵌入图结构，将2D布局扩展为逼真的3D结构，实现对城市布局的细粒度控制。 Result: 实验表明该方法能生成有效的大规模城市模型，具有良好的几何连续性和语义一致性。 Conclusion: 所提框架为数据驱动的城市规划与设计提供了一种有效的生成工具，兼顾可控性、可扩展性与现实性。 Abstract: Urban modeling is essential for city planning, scene synthesis, and gaming. Existing image-based methods generate diverse layouts but often lack geometric continuity and scalability, while graph-based methods capture structural relations yet overlook parcel semantics. We present a controllable framework for large-scale 3D vector urban layout generation, conditioned on both geometry and semantics. By fusing geometric and semantic attributes, introducing edge weights, and embedding building height in the graph, our method extends 2D layouts to realistic 3D structures. It also enables users to directly control the output by modifying semantic attributes. Experiments show that it produces valid, large-scale urban models, offering an effective tool for data-driven planning and design.

[351] A Multi-Camera Vision-Based Approach for Fine-Grained Assembly Quality Control

Ali Nazeri,Shashank Mishra,Achim Wagner,Martin Ruskowski,Didier Stricker,Jason Rambach

Main category: cs.CV

TL;DR: 提出了一种基于多视角成像与图像融合的新型质量控制模块，显著提升了小零件装配检测的精度与可靠性。

Details

Motivation: 现有单视角成像或人工检测方法易受遮挡、光照变化等影响，导致检测错误并增加产线停机成本。 Method: 构建了一个三摄像头成像系统，结合定制的图像融合方法与先进目标检测算法，实现对装配组件的全方位视觉覆盖。 Result: 实验表明，该方法在识别未正确紧固的小零件（如螺丝）方面显著优于单视角方法，具有高精度和高召回率。此外，发布了包含多种真实场景的标注数据集。 Conclusion: 该工作克服了单视角检测的局限性，提供了一种可扩展、成本效益高且准确的工业自动化质量控制方案，有助于提升装配线的可靠性与安全性。 Abstract: Quality control is a critical aspect of manufacturing, particularly in ensuring the proper assembly of small components in production lines. Existing solutions often rely on single-view imaging or manual inspection, which are prone to errors due to occlusions, restricted perspectives, or lighting inconsistencies. These limitations require the installation of additional inspection stations, which could disrupt the assembly line and lead to increased downtime and costs. This paper introduces a novel multi-view quality control module designed to address these challenges, integrating a multi-camera imaging system with advanced object detection algorithms. By capturing images from three camera views, the system provides comprehensive visual coverage of components of an assembly process. A tailored image fusion methodology combines results from multiple views, effectively resolving ambiguities and enhancing detection reliability. To support this system, we developed a unique dataset comprising annotated images across diverse scenarios, including varied lighting conditions, occlusions, and angles, to enhance applicability in real-world manufacturing environments. Experimental results show that our approach significantly outperforms single-view methods, achieving high precision and recall rates in the identification of improperly fastened small assembly parts such as screws. This work contributes to industrial automation by overcoming single-view limitations, and providing a scalable, cost-effective, and accurate quality control mechanism that ensures the reliability and safety of the assembly line. The dataset used in this study is publicly available to facilitate further research in this domain.

[352] Assessing Visual Privacy Risks in Multimodal AI: A Novel Taxonomy-Grounded Evaluation of Vision-Language Models

Efthymios Tsaprazlis,Tiantian Feng,Anil Ramakrishna,Rahul Gupta,Shrikanth Narayanan

Main category: cs.CV

TL;DR: 本文提出了一种多层次的视觉隐私分类法，用于评估先进视觉语言模型在理解与执行隐私原则方面的能力，揭示了模型在情境隐私理解上的显著不一致性。

Details

Motivation: 由于大语言模型在隐私概念理解上的局限性，亟需研究其是否以及如何理解和实施隐私原则。 Method: 通过借鉴法律框架，构建了一个可扩展、适应性强的多层级视觉隐私分类体系，并对多种先进的视觉语言模型进行了评估。 Result: 评估结果显示当前的视觉语言模型在理解情境隐私方面存在显著不一致，暴露了现有模型的局限性。 Conclusion: 研究贡献了一个可用于未来研究的基础性隐私分类体系，并指出了开发更强大、具备隐私意识的人工智能系统的紧迫性。 Abstract: Artificial Intelligence have profoundly transformed the technological landscape in recent years. Large Language Models (LLMs) have demonstrated impressive abilities in reasoning, text comprehension, contextual pattern recognition, and integrating language with visual understanding. While these advances offer significant benefits, they also reveal critical limitations in the models' ability to grasp the notion of privacy. There is hence substantial interest in determining if and how these models can understand and enforce privacy principles, particularly given the lack of supporting resources to test such a task. In this work, we address these challenges by examining how legal frameworks can inform the capabilities of these emerging technologies. To this end, we introduce a comprehensive, multi-level Visual Privacy Taxonomy that captures a wide range of privacy issues, designed to be scalable and adaptable to existing and future research needs. Furthermore, we evaluate the capabilities of several state-of-the-art Vision-Language Models (VLMs), revealing significant inconsistencies in their understanding of contextual privacy. Our work contributes both a foundational taxonomy for future research and a critical benchmark of current model limitations, demonstrating the urgent need for more robust, privacy-aware AI systems.

[353] Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation

Hanyu Zhou,Gim Hee Lee

Main category: cs.CV

TL;DR: 提出Uni4D-LLM，首个统一的具有时空感知能力的视觉语言模型框架，实现4D场景理解和生成的统一。

Details

Motivation: 现有3D/4D方法在语义理解和内容生成之间存在范式差异，难以在一个模型中同时处理动态4D场景中的两项任务。 Method: 通过提取语义特征和加噪外观特征，结合4D几何线索，利用自适应交叉注意力融合为时空感知的视觉表征；基于共享Transformer架构，集成自回归与扩散模型，构建统一的LLM框架，并通过指令微调提升多任务泛化能力。 Result: 在多个基准上实验表明，Uni4D-LLM在4D场景理解与生成任务中达到领先或相当的性能，首次实现了两项任务的真正统一。 Conclusion: Uni4D-LLM成功实现了4D场景中理解与生成的统一，验证了共享表示与共享架构在复杂时空建模中的有效性。 Abstract: Vision-language models (VLMs) have demonstrated strong performance in 2D scene understanding and generation, but extending this unification to the physical world remains an open challenge. Existing 3D and 4D approaches typically embed scene geometry into autoregressive model for semantic understanding and diffusion model for content generation. This paradigm gap prevents a single model from jointly handling both tasks, especially in dynamic 4D settings where spatiotemporal modeling is critical. We propose Uni4D-LLM, the first unified VLM framework with spatiotemporal awareness for 4D scene understanding and generation. Our design is guided by two key insights: 1) Unification requires a shared representation. We extract semantic features for understanding and noisy-injected appearance features for generation, incorporate 4D geometric cues, and fuse them into a spatiotemporal-aware visual representation through adaptive cross-attention. 2) Unification requires a shared architecture. Both autoregression and diffusion are built on Transformer backbones, and this enables integration into a single LLM with task-specific heads. By aligning visual and linguistic representations, our Uni4D-LLM produces predictions for both understanding and generation within one Transformer-based framework. We further apply instruction fine-tuning on diverse 4D vision-language datasets to improve generalization across tasks. Extensive experiments on multiple benchmarks demonstrate that Uni4D-LLM achieves competitive or superior results compared to state-of-the-art models and offers the first true unification of 4D scene understanding and generation.

[354] 2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC

Zhixiong Zhang,Shuangrui Ding,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Jiaqi Wang

Main category: cs.CV

TL;DR: 本文评估了Segment Concept (SeC) 框架在复杂视频对象分割v2 (MOSEv2) 数据集上的零样本性能，该框架利用大型视觉-语言模型 (LVLM) 实现对目标的深层语义理解，以提升半监督视频对象分割的鲁棒性。无需任何训练集微调，SeC 在测试集上取得了39.7 \JFn的成绩，并在第七届大规模视频对象分割挑战赛的Complex VOS赛道中排名第二。

Details

Motivation: 现有方法依赖外观模式匹配，在面对剧烈视觉变化、遮挡和场景转换时鲁棒性不足，缺乏对目标的高层概念理解。 Method: 采用Segment Concept (SeC) 框架，结合大型视觉-语言模型 (LVLM) 建立对目标的深度语义理解，实现更持久的视频对象分割。 Result: 在MOSEv2数据集上实现零样本设置下39.7 \JFn的性能，在第七届大规模视频对象分割挑战赛Complex VOS赛道中排名第二。 Conclusion: SeC框架通过引入高阶语义理解显著提升了半监督视频对象分割在复杂场景下的鲁棒性和性能，验证了语义建模在视频分割中的有效性。 Abstract: Semi-supervised Video Object Segmentation aims to segment a specified target throughout a video sequence, initialized by a first-frame mask. Previous methods rely heavily on appearance-based pattern matching and thus exhibit limited robustness against challenges such as drastic visual changes, occlusions, and scene shifts. This failure is often attributed to a lack of high-level conceptual understanding of the target. The recently proposed Segment Concept (SeC) framework mitigated this limitation by using a Large Vision-Language Model (LVLM) to establish a deep semantic understanding of the object for more persistent segmentation. In this work, we evaluate its zero-shot performance on the challenging coMplex video Object SEgmentation v2 (MOSEv2) dataset. Without any fine-tuning on the training set, SeC achieved 39.7 \JFn on the test set and ranked 2nd place in the Complex VOS track of the 7th Large-scale Video Object Segmentation Challenge.

[355] Towards Fine-Grained Text-to-3D Quality Assessment: A Benchmark and A Two-Stage Rank-Learning Metric

Bingyang Cui,Yujie Zhang,Qi Yang,Zhu Li,Yiling Xu

Main category: cs.CV

TL;DR: 本文提出了T23D-CompBench，一个用于组合式文本到3D生成的综合性基准，并基于此提出Rank2Score评估器，通过两阶段训练实现与人类判断更一致的质量评估。

Details

Motivation: 现有文本到3D质量评估的基准过时、碎片化且粗粒度，客观指标设计存在局限，难以准确反映人类感知。 Method: 构建包含五个组件和十二个子组件的T23D-CompBench基准，生成3600个网格模型并收集12.96万条人类评分；提出Rank2Score，采用监督对比回归和课程学习进行配对训练，并利用平均意见分数优化预测。 Result: Rank2Score在多个维度上优于现有指标，并可用于下游任务作为奖励函数优化生成模型。 Conclusion: Rank2Score显著提升了文本到3D质量评估与人类感知的一致性，推动了该领域的发展。 Abstract: Recent advances in Text-to-3D (T23D) generative models have enabled the synthesis of diverse, high-fidelity 3D assets from textual prompts. However, existing challenges restrict the development of reliable T23D quality assessment (T23DQA). First, existing benchmarks are outdated, fragmented, and coarse-grained, making fine-grained metric training infeasible. Moreover, current objective metrics exhibit inherent design limitations, resulting in non-representative feature extraction and diminished metric robustness. To address these limitations, we introduce T23D-CompBench, a comprehensive benchmark for compositional T23D generation. We define five components with twelve sub-components for compositional prompts, which are used to generate 3,600 textured meshes from ten state-of-the-art generative models. A large-scale subjective experiment is conducted to collect 129,600 reliable human ratings across different perspectives. Based on T23D-CompBench, we further propose Rank2Score, an effective evaluator with two-stage training for T23DQA. Rank2Score enhances pairwise training via supervised contrastive regression and curriculum learning in the first stage, and subsequently refines predictions using mean opinion scores to achieve closer alignment with human judgments in the second stage. Extensive experiments and downstream applications demonstrate that Rank2Score consistently outperforms existing metrics across multiple dimensions and can additionally serve as a reward function to optimize generative models. The project is available at https://cbysjtu.github.io/Rank2Score/.

[356] CE-FAM: Concept-Based Explanation via Fusion of Activation Maps

Michihiro Kuroki,Toshihiko Yamasaki

Main category: cs.CV

TL;DR: 提出了一种基于概念的解释方法CE-FAM，通过融合激活图来识别图像分类器学习到的概念、相关区域及其对预测的贡献，无需标注数据，并在零样本推理中表现优异。

Details

Motivation: 现有方法难以同时揭示图像分类器学到的概念、相关区域及其对预测的贡献，且依赖标注数据，限制了可解释性和泛化能力。 Method: 设计了一个与图像分类器共享激活图的分支网络，通过模仿视觉-语言模型（VLM）的嵌入来预测概念；利用激活图的梯度加权和定位概念区域，并量化其对分类得分的影响以评估贡献。 Result: 该方法能有效识别概念区域及其贡献，提出的新评估指标验证了区域定位的准确性，在定性和定量评估中均优于现有方法，且在未见概念上表现出良好的零样本推理能力。 Conclusion: CE-FAM提供了一个通用且无需标注数据的概念解释框架，结合VLM知识实现了对图像分类器决策过程的细粒度解释，在可解释AI方面具有广泛应用潜力。 Abstract: Although saliency maps can highlight important regions to explain the reasoning behind image classification in artificial intelligence (AI), the meaning of these regions is left to the user's interpretation. In contrast, conceptbased explanations decompose AI predictions into humanunderstandable concepts, clarifying their contributions. However, few methods can simultaneously reveal what concepts an image classifier learns, which regions are associated with them, and how they contribute to predictions. We propose a novel concept-based explanation method, Concept-based Explanation via Fusion of Activation Maps (CE-FAM). It employs a branched network that shares activation maps with an image classifier and learns to mimic the embeddings of a Vision and Language Model (VLM). The branch network predicts concepts in an image, and their corresponding regions are represented by a weighted sum of activation maps, with weights given by the gradients of the concept prediction scores. Their contributions are quantified based on their impact on the image classification score. Our method provides a general framework for identifying the concept regions and their contributions while leveraging VLM knowledge to handle arbitrary concepts without requiring an annotated dataset. Furthermore, we introduce a novel evaluation metric to assess the accuracy of the concept regions. Our qualitative and quantitative evaluations demonstrate our method outperforms existing approaches and excels in zero-shot inference for unseen concepts.

[357] FairViT-GAN: A Hybrid Vision Transformer with Adversarial Debiasing for Fair and Explainable Facial Beauty Prediction

Djamel Eddine Boukhari

Main category: cs.CV

TL;DR: 提出了一种名为FairViT-GAN的新型混合框架，结合CNN和ViT分别捕捉局部特征与全局上下文，并引入对抗去偏机制以减少基于种族等保护属性的算法偏见，在准确性和公平性方面均取得显著提升。

Details

Motivation: 现有面部美学预测模型存在架构局限、种群偏见和缺乏透明度的问题，尤其是CNN难以建模整体面部和谐，而ViT易忽略细节，且模型可能延续社会偏见。 Method: 设计了一个双分支结构，融合CNN用于提取局部纹理特征和ViT用于建模全局依赖关系；同时引入对抗性去偏机制，使特征表示对保护属性（如种族）不变，提升模型公平性与可解释性。 Result: 在SCUT-FBP5500数据集上达到皮尔逊相关系数0.9230，RMSE为0.2650，创下新SOTA；族裔子组间性能差距减少82.9%，对手分类准确率降至52.1%，接近随机水平。 Conclusion: FairViT-GAN在保持高预测精度的同时显著提升了公平性和透明度，为构建负责任的主观视觉评估AI系统提供了可靠范式。 Abstract: Facial Beauty Prediction (FBP) has made significant strides with the application of deep learning, yet state-of-the-art models often exhibit critical limitations, including architectural constraints, inherent demographic biases, and a lack of transparency. Existing methods, primarily based on Convolutional Neural Networks (CNNs), excel at capturing local texture but struggle with global facial harmony, while Vision Transformers (ViTs) effectively model long-range dependencies but can miss fine-grained details. Furthermore, models trained on benchmark datasets can inadvertently learn and perpetuate societal biases related to protected attributes like ethnicity. To address these interconnected challenges, we propose \textbf{FairViT-GAN}, a novel hybrid framework that synergistically integrates a CNN branch for local feature extraction and a ViT branch for global context modeling. More significantly, we introduce an adversarial debiasing mechanism where the feature extractor is explicitly trained to produce representations that are invariant to protected attributes, thereby actively mitigating algorithmic bias. Our framework's transparency is enhanced by visualizing the distinct focus of each architectural branch. Extensive experiments on the SCUT-FBP5500 benchmark demonstrate that FairViT-GAN not only sets a new state-of-the-art in predictive accuracy, achieving a Pearson Correlation of \textbf{0.9230} and reducing RMSE to \textbf{0.2650}, but also excels in fairness. Our analysis reveals a remarkable \textbf{82.9\% reduction in the performance gap} between ethnic subgroups, with the adversary's classification accuracy dropping to near-random chance (52.1\%). We believe FairViT-GAN provides a robust, transparent, and significantly fairer blueprint for developing responsible AI systems for subjective visual assessment.

[358] Sim-DETR: Unlock DETR for Temporal Sentence Grounding

Jiajin Tang,Zhengxuan Wei,Yuchen Zhu,Cheng Shi,Guanbin Li,Liang Lin,Sibei Yang

Main category: cs.CV

TL;DR: 提出Sim-DETR，通过解码器层的两个小修改解决DETR在时序句子定位中的性能瓶颈。

Details

Motivation: 发现现有增强DETR的方法在时序句子定位任务中无法提升甚至降低性能，需找出根本原因并改进。 Method: 在标准DETR解码器中引入查询间基于语义和位置重叠的自注意力约束，并增加查询到帧的对齐机制以融合全局与局部上下文。 Result: 实验表明Sim-DETR能充分释放DETR在此任务上的潜力，在多个数据集上显著优于原有方法。 Conclusion: Sim-DETR为时序句子定位提供了简单而强大的新基线，揭示了改进DETR的关键设计方向。 Abstract: Temporal sentence grounding aims to identify exact moments in a video that correspond to a given textual query, typically addressed with detection transformer (DETR) solutions. However, we find that typical strategies designed to enhance DETR do not improve, and may even degrade, its performance in this task. We systematically analyze and identify the root causes of this abnormal behavior: (1) conflicts between queries from similar target moments and (2) internal query conflicts due to the tension between global semantics and local localization. Building on these insights, we propose a simple yet powerful baseline, Sim-DETR, which extends the standard DETR with two minor modifications in the decoder layers: (1) constraining self-attention between queries based on their semantic and positional overlap and (2) adding query-to-frame alignment to bridge the global and local contexts. Experiments demonstrate that Sim-DETR unlocks the full potential of DETR for temporal sentence grounding, offering a strong baseline for future research.

[359] Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models

Ky Dan Nguyen,Hoang Lam Tran,Anh-Dung Dinh,Daochang Liu,Weidong Cai,Xiuying Wang,Chang Xu

Main category: cs.CV

TL;DR: 提出信息锚定引导（IGG）机制，通过注意力锚定语义重要区域，解决自回归图像生成中因分辨率逐步提升导致的信息不一致问题。

Details

Motivation: 自回归模型在图像生成中因逐级放大导致跨时间步的块间信息不一致，使引导信号偏离条件信息，产生模糊或失真的特征。 Method: 设计IGG机制，利用注意力机制识别并自适应增强采样过程中信息量大的图像块，将引导信号锚定在语义关键区域，保持引导与内容的一致性。 Result: 在类别条件生成和文本到图像生成任务中，IGG生成了更清晰、连贯且语义准确的图像，显著优于现有自回归方法。 Conclusion: IGG有效缓解了渐进式分辨率扩展中的信息漂移问题，为自回归图像生成提供了更强的语义一致性保障，树立了新基准。 Abstract: Autoregressive (AR) models based on next-scale prediction are rapidly emerging as a powerful tool for image generation, but they face a critical weakness: information inconsistencies between patches across timesteps introduced by progressive resolution scaling. These inconsistencies scatter guidance signals, causing them to drift away from conditioning information and leaving behind ambiguous, unfaithful features. We tackle this challenge with Information-Grounding Guidance (IGG), a novel mechanism that anchors guidance to semantically important regions through attention. By adaptively reinforcing informative patches during sampling, IGG ensures that guidance and content remain tightly aligned. Across both class-conditioned and text-to-image generation tasks, IGG delivers sharper, more coherent, and semantically grounded images, setting a new benchmark for AR-based methods.

[360] PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications

Hitesh Laxmichand Patel,Amit Agarwal,Srikant Panda,Hansa Meghwani,Karan Dua,Paul Li,Tao Sheng,Sujith Ravi,Dan Roth

Main category: cs.CV

TL;DR: 本文提出了Patch Context Robustness Index (PCRI)，首个用于量化多模态大语言模型（MLLM）对视觉上下文粒度变化鲁棒性的系统性指标。通过对19个先进MLLM在15个视觉-语言基准上的评估，发现大多数主流模型对背景噪声敏感，仅少数如InternVL2-26B和Qwen2VL-72B表现出跨任务的一致鲁棒性。PCRI为模型架构设计和训练策略提供了可解释的诊断工具，支持更可靠的现实部署。

Details

Motivation: 现有评估指标未能捕捉MLLM在真实场景中因无关或干扰性视觉上下文导致的敏感性问题，缺乏对视觉上下文粒度变化下模型鲁棒性的系统衡量方法。 Method: 提出Patch Context Robustness Index (PCRI)，通过比较模型在局部图像块输入与全图输入下的性能变化，量化其对视觉上下文变化的鲁棒性，并在19个MLLM和15个基准上进行系统评估。 Result: 大多数主流MLLM对背景噪声脆弱；少数模型（如InternVL2-26B、Qwen2VL-72B）表现良好；PCRI揭示了不同架构处理视觉上下文的差异，提供了可操作的诊断信息。 Conclusion: PCRI是首个可解释且系统的视觉上下文鲁棒性评估指标，有助于推动更可靠MLLM的设计、选择与部署。 Abstract: The reliability of Multimodal Large Language Models (MLLMs) in real-world settings is often undermined by sensitivity to irrelevant or distracting visual context, an aspect not captured by existing evaluation metrics. We introduce the \textbf{Patch Context Robustness Index (PCRI)}, the first systematic and interpretable score for quantifying MLLM robustness to variations in visual context granularity, measuring performance changes between localized image patches and full-image input. Applying PCRI to 19 state-of-the-art MLLMs across 15 vision-language benchmarks, we find that most leading models remain brittle to background noise, with only a few, such as InternVL2-26B and Qwen2VL-72B, demonstrating consistent robustness across tasks. PCRI analysis also highlights how different model architectures handle and integrate visual context, offering actionable diagnostic insight for both researchers and practitioners. PCRI enables rigorous comparison of context robustness, supporting principled model selection and guiding the development of future architectures and training strategies for robust, real-world deployment.

[361] Learning Adaptive Pseudo-Label Selection for Semi-Supervised 3D Object Detection

Taehun Kong,Tae-Kyun Kim

Main category: cs.CV

TL;DR: 本文提出了一种新的半监督3D目标检测框架，通过可学习的伪标签模块自适应地选择高质量伪标签，结合上下文信息和软监督策略，在KITTI和Waymo数据集上取得了显著性能提升。

Details

Motivation: 现有伪标签选择方法依赖人工设定阈值或动态阈值，忽视了上下文信息（如物体距离、类别和学习状态），且仅利用部分网络信息评估伪标签质量，导致选择不够准确。 Method: 提出一种新颖的SS3DOD框架，包含两个在教师模型输出端的网络：一个通过分数融合评估伪标签质量，另一个生成上下文自适应阈值，并通过伪标签与真实框对齐进行监督；同时引入软监督策略以增强对噪声伪标签的鲁棒性。 Result: 在KITTI和Waymo数据集上实验表明，该方法能选择高精度伪标签，覆盖更广的上下文，召回率更高，显著优于现有方法。 Conclusion: 所提出的可学习伪标签模块能有效提升半监督3D目标检测性能，通过上下文感知的质量评估和软监督策略实现了更优的伪标签选择与学习。 Abstract: Semi-supervised 3D object detection (SS3DOD) aims to reduce costly 3D annotations utilizing unlabeled data. Recent studies adopt pseudo-label-based teacher-student frameworks and demonstrate impressive performance. The main challenge of these frameworks is in selecting high-quality pseudo-labels from the teacher's predictions. Most previous methods, however, select pseudo-labels by comparing confidence scores over thresholds manually set. The latest works tackle the challenge either by dynamic thresholding or refining the quality of pseudo-labels. Such methods still overlook contextual information e.g. object distances, classes, and learning states, and inadequately assess the pseudo-label quality using partial information available from the networks. In this work, we propose a novel SS3DOD framework featuring a learnable pseudo-labeling module designed to automatically and adaptively select high-quality pseudo-labels. Our approach introduces two networks at the teacher output level. These networks reliably assess the quality of pseudo-labels by the score fusion and determine context-adaptive thresholds, which are supervised by the alignment of pseudo-labels over GT bounding boxes. Additionally, we introduce a soft supervision strategy that can learn robustly under pseudo-label noises. This helps the student network prioritize cleaner labels over noisy ones in semi-supervised learning. Extensive experiments on the KITTI and Waymo datasets demonstrate the effectiveness of our method. The proposed method selects high-precision pseudo-labels while maintaining a wider coverage of contexts and a higher recall rate, significantly improving relevant SS3DOD methods.

[362] Tunable-Generalization Diffusion Powered by Self-Supervised Contextual Sub-Data for Low-Dose CT Reconstruction

Guoquan Wei,Zekun Zhou,Liu Shi,Wenzhe Shan,Qiegen Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为SuperDiff的新方法，用于低剂量CT去噪，结合自监督上下文子数据与扩散模型，在无需配对数据的情况下实现优异的重建和泛化性能。

Details

Motivation: 现有深度学习模型依赖配对数据且泛化能力差，扩散模型需学习干净数据分布，在医学应用中难以满足；自监督方法在跨剂量推广时性能显著下降。 Method: 设计了基于LDCT投影域的上下文子数据相似性自适应感知策略，结合知识蒸馏与潜在扩散模型进行细节优化，并提出像素级自校正融合技术用于图像域精细重建，仅需LDCT投影域数据训练与测试。 Result: 在多个数据集和真实数据上，SuperDiff在定性和定量评估中均优于现有最先进方法，展现出更强的重建精度和跨剂量（包括未见剂量）泛化能力。 Conclusion: SuperDiff通过双域级联的自监督策略，实现了无需配对数据、高保真且可调泛化的低剂量CT重建，具有良好的临床应用潜力。 Abstract: Current models based on deep learning for low-dose CT denoising rely heavily on paired data and generalize poorly. Even the more concerned diffusion models need to learn the distribution of clean data for reconstruction, which is difficult to satisfy in medical clinical applications. At the same time, self-supervised-based methods face the challenge of significant degradation of generalizability of models pre-trained for the current dose to expand to other doses. To address these issues, this paper proposes a novel method of tunable-generalization diffusion powered by self-supervised contextual sub-data for low-dose CT reconstruction, named SuperDiff. Firstly, a contextual subdata similarity adaptive sensing strategy is designed for denoising centered on the LDCT projection domain, which provides an initial prior for the subsequent progress. Subsequently, the initial prior is used to combine knowledge distillation with a deep combination of latent diffusion models for optimizing image details. The pre-trained model is used for inference reconstruction, and the pixel-level self-correcting fusion technique is proposed for fine-grained reconstruction of the image domain to enhance the image fidelity, using the initial prior and the LDCT image as a guide. In addition, the technique is flexibly applied to the generalization of upper and lower doses or even unseen doses. Dual-domain strategy cascade for self-supervised LDCT denoising, SuperDiff requires only LDCT projection domain data for training and testing. Full qualitative and quantitative evaluations on both datasets and real data show that SuperDiff consistently outperforms existing state-of-the-art methods in terms of reconstruction and generalization performance.

[363] AssemblyHands-X: Modeling 3D Hand-Body Coordination for Understanding Bimanual Human Activities

Tatsuro Banno,Takehiko Ohkawa,Ruicong Liu,Ryosuke Furuta,Yoichi Sato

Main category: cs.CV

TL;DR: 本文提出了AssemblyHands-X，首个无标记的3D手-身基准数据集，用于研究双手活动中的手-身协调作用。通过多视角视频构建3D姿态标注流程，结合多视角三角化与SMPL-X拟合，实现对手部和上半身的可靠3D注册，并验证了基于姿态的动作识别优于视频基线，且联合建模手部与身体信息能提升识别性能。

Details

Motivation: 现有3D动作数据集通常只标注手或身体姿态，缺乏对双手与身体协调运动的系统研究；而基于标记的动作捕捉虽提供完整姿态，但引入视觉伪影，限制模型在自然无标记视频上的泛化能力。因此需要一个高质量、无标记的手-身协同动作数据集。 Method: 构建了一个从同步多视角视频中提取3D姿态的标注流程，结合多视角三角化与SMPL-X网格拟合技术，生成精确的手部和上半身3D姿态标注。在动作识别任务中评估了多种输入表示（如视频、手部姿态、身体姿态、手-身联合姿态）在基于图卷积或时空注意力模型上的性能。 Result: 实验表明，基于姿态的动作识别比基于视频的方法更高效准确；联合使用手部与身体姿态比单独使用任一部分性能更好，证明了建模手-身动态交互的重要性。 Conclusion: 手-身协调在双手机会动作理解中起关键作用，AssemblyHands-X为研究此类协同提供了可靠基准，推动无标记、自然场景下的动作识别发展。 Abstract: Bimanual human activities inherently involve coordinated movements of both hands and body. However, the impact of this coordination in activity understanding has not been systematically evaluated due to the lack of suitable datasets. Such evaluation demands kinematic-level annotations (e.g., 3D pose) for the hands and body, yet existing 3D activity datasets typically annotate either hand or body pose. Another line of work employs marker-based motion capture to provide full-body pose, but the physical markers introduce visual artifacts, thereby limiting models' generalization to natural, markerless videos. To address these limitations, we present AssemblyHands-X, the first markerless 3D hand-body benchmark for bimanual activities, designed to study the effect of hand-body coordination for action recognition. We begin by constructing a pipeline for 3D pose annotation from synchronized multi-view videos. Our approach combines multi-view triangulation with SMPL-X mesh fitting, yielding reliable 3D registration of hands and upper body. We then validate different input representations (e.g., video, hand pose, body pose, or hand-body pose) across recent action recognition models based on graph convolution or spatio-temporal attention. Our extensive experiments show that pose-based action inference is more efficient and accurate than video baselines. Moreover, joint modeling of hand and body cues improves action recognition over using hands or upper body alone, highlighting the importance of modeling interdependent hand-body dynamics for a holistic understanding of bimanual activities.

[364] LifeCLEF Plant Identification Task 2015

Herve Goeau,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: LifeCLEF 2015挑战评估了大规模植物识别方法，使用超过10万张图像和1000种西欧植物物种，数据来自大众参与的感知平台。

Details

Motivation: 评估接近真实世界生物多样性监测场景的大规模植物识别方法。 Method: 基于超过10万张图像和1000种植物物种的数据集，组织挑战赛并评估不同研究团队的方法与系统。 Result: 总结了各参与团队采用的方法和系统，并分析了挑战赛的主要成果。 Conclusion: 该挑战为大规模植物识别提供了重要基准，展示了基于公众参与数据的可行性与潜力。 Abstract: The LifeCLEF plant identification challenge aims at evaluating plant identification methods and systems at a very large scale, close to the conditions of a real-world biodiversity monitoring scenario. The 2015 evaluation was actually conducted on a set of more than 100K images illustrating 1000 plant species living in West Europe. The main originality of this dataset is that it was built through a large-scale participatory sensing plateform initiated in 2011 and which now involves tens of thousands of contributors. This overview presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

Jinghan Xu Yuyang Zhang Qixuan Cai Jiancheng Chen Keqiu Li

Main category: cs.CV

TL;DR: 提出了一种跨模态对比遗忘框架（CCU），在视觉-雷达多模态场景中有效实现视觉数据的机器遗忘，同时保留其他模态的知识并维持保留数据的类内结构稳定性，显著提升了遗忘效率与模型性能。

Details

Motivation: 现有机器遗忘方法在移除视觉数据时难以保持跨模态知识和保留数据的结构稳定性，导致整体性能下降，尤其影响其他模态的表现。 Method: 提出CCU框架，包含三个核心组件：选择性视觉遗忘（逆对比学习解耦视觉表征与语义）、跨模态知识保留（通过语义一致性维持其他模态判别能力）和双集合对比分离（隔离遗忘集与保留集的结构扰动以维持模型性能）。 Result: 在三个数据集上实验表明，CCU相较于最优基线方法，在仅7%的遗忘时间下实现了7.12%的准确率提升。 Conclusion: CCU能高效、精准地实现视觉模态的数据遗忘，同时保护跨模态知识和保留数据的结构完整性，显著优于现有方法。 Abstract: Visual modality is the most vulnerable to privacy leakage in real-world multimodal applications like autonomous driving with visual and radar data; Machine unlearning removes specific training data from pre-trained models to address privacy leakage, however, existing methods fail to preserve cross-modal knowledge and maintain intra-class structural stability of retain data, leading to reduced overall and other modalities' performance during visual unlearning; to address these challenges, we propose a Cross-modal Contrastive Unlearning (CCU) framework, which integrates three key components: (a) selective visual unlearning: employing inverse contrastive learning to dissociate visual representations from their original semantics, (b) cross-modal knowledge retention: preserving other modalities' discriminability through semantic consistency, and (c) dual-set contrastive separation: preserving the model performance via isolation of structural perturbations between the unlearn set and retain set; extensive experiments on three datasets demonstrate the superiority of CCU, and our method achieves a 7.12% accuracy improvement with only 7% of the unlearning time compared to the top-accuracy baseline.

[366] Q-FSRU: Quantum-Augmented Frequency-Spectral For Medical Visual Question Answering

Rakesh Thakur,Yusra Tariq,Rakesh Chandra Joshi

Main category: cs.CV

TL;DR: 提出Q-FSRU模型，结合频域表示与量子检索增强生成，提升医学视觉问答的准确性和可解释性。

Details

Motivation: 解决需要图像和文本理解的临床问题仍是医疗AI的重大挑战，现有模型在复杂病例上的推理能力有限。 Method: 将医学图像和文本特征通过快速傅里叶变换（FFT）转换到频域，并引入量子启发的检索增强生成（Quantum RAG）系统，从外部知识源检索相关信息进行融合推理。 Result: 在VQA-RAD数据集上表现优于先前模型，尤其在需图文联合推理的复杂病例中性能更优。 Conclusion: 频域信息与量子检索的结合能有效提升医学VQA模型的性能与可解释性，为医生提供更智能、透明的AI工具。 Abstract: Solving tough clinical questions that require both image and text understanding is still a major challenge in healthcare AI. In this work, we propose Q-FSRU, a new model that combines Frequency Spectrum Representation and Fusion (FSRU) with a method called Quantum Retrieval-Augmented Generation (Quantum RAG) for medical Visual Question Answering (VQA). The model takes in features from medical images and related text, then shifts them into the frequency domain using Fast Fourier Transform (FFT). This helps it focus on more meaningful data and filter out noise or less useful information. To improve accuracy and ensure that answers are based on real knowledge, we add a quantum inspired retrieval system. It fetches useful medical facts from external sources using quantum-based similarity techniques. These details are then merged with the frequency-based features for stronger reasoning. We evaluated our model using the VQA-RAD dataset, which includes real radiology images and questions. The results showed that Q-FSRU outperforms earlier models, especially on complex cases needing image text reasoning. The mix of frequency and quantum information improves both performance and explainability. Overall, this approach offers a promising way to build smart, clear, and helpful AI tools for doctors.

[367] LifeCLEF Plant Identification Task 2014

Herve Goeau,Alexis Joly,Pierre Bonnet,Souheil Selmi,Jean-Francois Molino,Daniel Barthelemy,Nozha Boujemaa

Main category: cs.CV

TL;DR: LifeCLEF植物识别任务通过公民科学计划构建了包含500种植物的多视图图像数据集，用于评估植物识别系统。

Details

Motivation: 推动植物识别技术在真实场景中的应用，促进生物多样性与植物学研究。 Method: 使用七种类型的植物图像（如叶、花、果实等）进行系统评估，数据来自Tela Botanica的公民科学项目。 Result: 共有来自六个国家的十个团队提交了27种不同方法的运行结果，展示了多媒体检索领域对植物识别的持续兴趣。 Conclusion: 该任务验证了植物识别系统的有效性，并指出了未来在植物识别方面的挑战和研究方向。 Abstract: The LifeCLEFs plant identification task provides a testbed for a system-oriented evaluation of plant identification about 500 species trees and herbaceous plants. Seven types of image content are considered: scan and scan-like pictures of leaf, and 6 kinds of detailed views with unconstrained conditions, directly photographed on the plant: flower, fruit, stem & bark, branch, leaf and entire view. The main originality of this data is that it was specifically built through a citizen sciences initiative conducted by Tela Botanica, a French social network of amateur and expert botanists. This makes the task closer to the conditions of a real-world application. This overview presents more precisely the resources and assessments of task, summarizes the retrieval approaches employed by the participating groups, and provides an analysis of the main evaluation results. With a total of ten groups from six countries and with a total of twenty seven submitted runs, involving distinct and original methods, this fourth year task confirms Image & Multimedia Retrieval community interest for biodiversity and botany, and highlights further challenging studies in plant identification.

[368] EWC-Guided Diffusion Replay for Exemplar-Free Continual Learning in Medical Imaging

Anoushka Harit,William Prew,Zhongtian Sun,Florian Markowetz

Main category: cs.CV

TL;DR: 提出一种结合类条件扩散回放和弹性权重固化（EWC）的持续学习框架，用于医学图像基础模型的持续适应，在不存储患者样本的情况下减少遗忘并保持隐私。

Details

Motivation: 医学图像模型需要持续更新，但因隐私限制和成本问题难以进行全量重训练，因此需要一种高效且隐私保护的持续学习方法。 Method: 采用类条件扩散回放生成虚拟样本进行训练，并结合弹性权重固化（EWC）来稳定重要参数，使用紧凑的Vision Transformer作为骨干网络，在多个医学图像数据集上进行评估。 Result: 在CheXpert数据集上达到0.851 AUROC，相较于DER++相对减少超过30%的遗忘，接近联合训练的0.869 AUROC，同时保持高效和隐私保护；分析表明回放保真度和Fisher加权参数漂移是影响遗忘的两个关键因素。 Conclusion: 该方法为临床影像模型提供了一条可扩展、隐私友好的持续适应路径，兼具性能与实用性。 Abstract: Medical imaging foundation models must adapt over time, yet full retraining is often blocked by privacy constraints and cost. We present a continual learning framework that avoids storing patient exemplars by pairing class conditional diffusion replay with Elastic Weight Consolidation. Using a compact Vision Transformer backbone, we evaluate across eight MedMNIST v2 tasks and CheXpert. On CheXpert our approach attains 0.851 AUROC, reduces forgetting by more than 30\% relative to DER\texttt{++}, and approaches joint training at 0.869 AUROC, while remaining efficient and privacy preserving. Analyses connect forgetting to two measurable factors: fidelity of replay and Fisher weighted parameter drift, highlighting the complementary roles of replay diffusion and synaptic stability. The results indicate a practical route for scalable, privacy aware continual adaptation of clinical imaging models.

[369] Adversarial Versus Federated: An Adversarial Learning based Multi-Modality Cross-Domain Federated Medical Segmentation

You Zhou,Lijiang Chen,Shuchang Lyu,Guangxia Cui,Wenpei Bai,Zheng Zhou,Meng Li,Guangliang Cheng,Huiyu Zhou,Qi Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的联邦域适应（FedDA）分割训练框架，通过客户端间的特征级对抗学习来对齐特征图，从而提升跨域医学图像分割的模型泛化能力。

Details

Motivation: 由于医疗资源不平衡、数据损坏或保存不当，不同客户端可能拥有不同模态的医学图像，导致联邦学习中的跨域分割面临挑战。 Method: 提出基于对抗训练机制的特征级对抗学习方法，在客户端之间对齐特征映射，以缓解域偏移的影响。 Result: 在三个医学图像数据集上的实验表明，FedDA在客观和主观评估中均优于现有最先进的联邦聚合算法，实现了稳健的跨域联邦聚合性能。 Conclusion: FedDA有效提升了单模态客户端的跨模态处理能力，为解决联邦学习中的跨域医学图像分割问题提供了可行方案。 Abstract: Federated learning enables collaborative training of machine learning models among different clients while ensuring data privacy, emerging as the mainstream for breaking data silos in the healthcare domain. However, the imbalance of medical resources, data corruption or improper data preservation may lead to a situation where different clients possess medical images of different modality. This heterogeneity poses a significant challenge for cross-domain medical image segmentation within the federated learning framework. To address this challenge, we propose a new Federated Domain Adaptation (FedDA) segmentation training framework. Specifically, we propose a feature-level adversarial learning among clients by aligning feature maps across clients through embedding an adversarial training mechanism. This design can enhance the model's generalization on multiple domains and alleviate the negative impact from domain-shift. Comprehensive experiments on three medical image datasets demonstrate that our proposed FedDA substantially achieves cross-domain federated aggregation, endowing single modality client with cross-modality processing capabilities, and consistently delivers robust performance compared to state-of-the-art federated aggregation algorithms in objective and subjective assessment. Our code are available at https://github.com/GGbond-study/FedDA.

[370] EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

Xin Luo,Jiahao Wang,Chenyuan Wu,Shitao Xiao,Xiyan Jiang,Defu Lian,Jiajun Zhang,Dong Liu,Zheng liu

Main category: cs.CV

TL;DR: 本文提出了一种针对指令引导图像编辑的高保真奖励模型EditScore，通过构建EditReward-Bench基准测试，系统评估并优化奖励模型性能，成功将强化学习应用于图像编辑任务，显著提升了模型表现。

Details

Motivation: 现有的指令引导图像编辑模型在处理复杂指令时存在困难，且缺乏高效的奖励信号来支持强化学习的应用。因此，需要一个高质量、领域专用的奖励模型来推动该领域的进步。 Method: 首先构建了EditReward-Bench基准用于评估奖励模型；然后基于精心筛选的数据训练了一系列名为EditScore的奖励模型（7B-72B）；采用自集成策略提升其性能；最后将其应用于在线强化学习框架中，对OmniGen2等基础模型进行策略优化。 Result: EditScore在EditReward-Bench上表现优异，最大版本甚至超过GPT-5；相比现有开源视觉语言模型，它能有效提供强化学习所需的高质量反馈信号，并显著提升基线模型的编辑性能。 Conclusion: 高保真、领域专用的奖励模型是解锁强化学习在图像编辑中潜力的关键，本文为从基准建设到奖励建模再到强化学习训练提供了首个系统性路径。 Abstract: Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.

[371] MoReact: Generating Reactive Motion from Textual Descriptions

Xiyan Xu,Sirui Xu,Yu-Xiong Wang,Liang-Yan Gui

Main category: cs.CV

TL;DR: 本文提出了一种基于文本驱动的人类反应动作生成方法MoReact，通过扩散模型分步生成全局轨迹和局部动作，并引入交互损失来提升生成动作的真实性和与文本描述的对齐程度。

Details

Motivation: 现有方法在生成多人交互动作时难以准确响应动态多样的交互场景，且缺乏对语义信息的有效利用，因此需要一种能够结合文本语义并实现自适应响应的反应生成模型。 Method: 提出MoReact，采用扩散模型先生成全局轨迹再生成局部动作，以确保动作与文本及对方行为的一致性；引入新的交互损失函数增强近距离互动的真实感。 Result: 在双人动作数据集上的实验表明，该方法能生成真实、多样且可控的反应动作，较好地匹配对方动作并符合文本描述。 Conclusion: MoReact在文本驱动的人类反应生成任务中表现出色，有效提升了生成动作的语义一致性和交互真实性，为人类交互建模提供了新思路。 Abstract: Modeling and generating human reactions poses a significant challenge with broad applications for computer vision and human-computer interaction. Existing methods either treat multiple individuals as a single entity, directly generating interactions, or rely solely on one person's motion to generate the other's reaction, failing to integrate the rich semantic information that underpins human interactions. Yet, these methods often fall short in adaptive responsiveness, i.e., the ability to accurately respond to diverse and dynamic interaction scenarios. Recognizing this gap, our work introduces an approach tailored to address the limitations of existing models by focusing on text-driven human reaction generation. Our model specifically generates realistic motion sequences for individuals that responding to the other's actions based on a descriptive text of the interaction scenario. The goal is to produce motion sequences that not only complement the opponent's movements but also semantically fit the described interactions. To achieve this, we present MoReact, a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially. This approach stems from the observation that generating global trajectories first is crucial for guiding local motion, ensuring better alignment with given action and text. Furthermore, we introduce a novel interaction loss to enhance the realism of generated close interactions. Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach for this novel task, which is capable of producing realistic, diverse, and controllable reactions that not only closely match the movements of the counterpart but also adhere to the textual guidance. Please find our webpage at https://xiyan-xu.github.io/MoReactWebPage.

[372] Revisit the Imbalance Optimization in Multi-task Learning: An Experimental Analysis

Yihang Guo,Tianyuan Yu,Liang Bai,Yanming Guo,Yirun Ruan,William Li,Weishi Zheng

Main category: cs.CV

TL;DR: 本文系统分析了多任务学习中的优化不平衡问题，发现任务特定梯度范数与优化不平衡密切相关，并提出通过按梯度范数缩放任务损失的简单策略，可达到与耗时的网格搜索相当的性能。

Details

Motivation: 多任务学习中存在“优化不平衡”问题，导致性能不如单任务模型，现有方法在不同数据集上表现不一致且依赖昂贵的超参调优，因此需要系统性分析其根本原因。 Method: 通过系统的实验分析，研究影响多任务学习优化不平衡的因素，特别关注任务梯度范数与性能之间的关系，并提出基于梯度范数调整任务损失权重的策略。 Result: 发现任务梯度范数与优化不平衡有强相关性；使用梯度范数缩放损失的方法能取得与复杂网格搜索相近甚至更好的性能，且无需额外调参。 Conclusion: 控制梯度动态是实现稳定多任务学习的关键，相比设计更复杂的模型或方法，理解并调节梯度行为是更直接有效的途径。 Abstract: Multi-task learning (MTL) aims to build general-purpose vision systems by training a single network to perform multiple tasks jointly. While promising, its potential is often hindered by "unbalanced optimization", where task interference leads to subpar performance compared to single-task models. To facilitate research in MTL, this paper presents a systematic experimental analysis to dissect the factors contributing to this persistent problem. Our investigation confirms that the performance of existing optimization methods varies inconsistently across datasets, and advanced architectures still rely on costly grid-searched loss weights. Furthermore, we show that while powerful Vision Foundation Models (VFMs) provide strong initialization, they do not inherently resolve the optimization imbalance, and merely increasing data quantity offers limited benefits. A crucial finding emerges from our analysis: a strong correlation exists between the optimization imbalance and the norm of task-specific gradients. We demonstrate that this insight is directly applicable, showing that a straightforward strategy of scaling task losses according to their gradient norms can achieve performance comparable to that of an extensive and computationally expensive grid search. Our comprehensive analysis suggests that understanding and controlling gradient dynamics is a more direct path to stable MTL than developing increasingly complex methods.

[373] Bridging the Task Gap: Multi-Task Adversarial Transferability in CLIP and Its Derivatives

Kuanrong Liu,Siyuan Liang,Cheng Qian,Ming Zhang,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文研究了CLIP模型在对抗性扰动下的跨任务迁移行为，发现细粒度任务生成的对抗样本具有更强的迁移能力，并据此提出了一种新的多任务对抗框架MT-AdvCLIP，显著提升了对多种CLIP衍生模型的攻击成功率。

Details

Motivation: 理解对抗样本在不同任务间的迁移特性对于评估CLIP模型的泛化能力和安全风险至关重要，但其在细粒度任务中的鲁棒性尚未被充分探索。 Method: 提出MT-AdvCLIP框架，引入任务感知特征聚合损失，生成具有更强跨任务泛化能力的对抗扰动，增强细粒度任务模型对共享CLIP主干网络的攻击效果。 Result: 在多个公开数据集上的实验表明，MT-AdvCLIP在不增加扰动预算的情况下，显著提高了对多种CLIP衍生模型的对抗迁移成功率，平均攻击成功率提升超过39%。 Conclusion: 该研究揭示了多任务CLIP模型中对抗样本的迁移机制，为多任务鲁棒性评估和对抗样本设计提供了新见解。 Abstract: As a general-purpose vision-language pretraining model, CLIP demonstrates strong generalization ability in image-text alignment tasks and has been widely adopted in downstream applications such as image classification and image-text retrieval. However, it struggles with fine-grained tasks such as object detection and semantic segmentation. While many variants aim to improve CLIP on these tasks, its robustness to adversarial perturbations remains underexplored. Understanding how adversarial examples transfer across tasks is key to assessing CLIP's generalization limits and security risks. In this work, we conduct a systematic empirical analysis of the cross-task transfer behavior of CLIP-based models on image-text retrieval, object detection, and semantic segmentation under adversarial perturbations. We find that adversarial examples generated from fine-grained tasks (e.g., object detection and semantic segmentation) often exhibit stronger transfer potential than those from coarse-grained tasks, enabling more effective attacks against the original CLIP model. Motivated by this observation, we propose a novel framework, Multi-Task Adversarial CLIP (MT-AdvCLIP), which introduces a task-aware feature aggregation loss and generates perturbations with enhanced cross-task generalization capability. This design strengthens the attack effectiveness of fine-grained task models on the shared CLIP backbone. Experimental results on multiple public datasets show that MT-AdvCLIP significantly improves the adversarial transfer success rate (The average attack success rate across multiple tasks is improved by over 39%.) against various CLIP-derived models, without increasing the perturbation budget. This study reveals the transfer mechanism of adversarial examples in multi-task CLIP models, offering new insights into multi-task robustness evaluation and adversarial example design.

[374] Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

Longtao Jiang,Mingfei Han,Lei Chen,Yongqiang Yu,Feng Zhao,Xiaojun Chang,Zhihui Li

Main category: cs.CV

TL;DR: 本文提出了一种基于Mask AutoRegressive（MAR）模型的无需训练的文本引导图像修复方法Token Painter，通过双流编码信息融合和自适应解码注意力增强，实现了与文本提示高度对齐且背景和谐的修复效果。

Details

Motivation: 现有的扩散模型在文本引导图像修复中难以同时保证提示细节的对齐和背景的一致性，因此需要一种更具局部可控性的方法。 Method: 提出Token Painter，包含双流编码器信息融合（DEIF）和自适应解码器注意力分数增强（ADAE），在频率域融合文本与背景信息生成指导令牌，并增强关键注意力以提升生成质量。 Result: 在多个指标上超越了先前最先进的方法，生成结果在视觉质量和提示一致性方面表现优越。 Conclusion: Token Painter作为一种无需训练的方法，在文本引导图像修复任务中有效平衡了文本保真度与背景和谐性，展现了MAR模型在此任务中的潜力。 Abstract: Text-guided image inpainting aims to inpaint masked image regions based on a textual prompt while preserving the background. Although diffusion-based methods have become dominant, their property of modeling the entire image in latent space makes it challenging for the results to align well with prompt details and maintain a consistent background. To address these issues, we explore Mask AutoRegressive (MAR) models for this task. MAR naturally supports image inpainting by generating latent tokens corresponding to mask regions, enabling better local controllability without altering the background. However, directly applying MAR to this task makes the inpainting content either ignore the prompts or be disharmonious with the background context. Through analysis of the attention maps from the inpainting images, we identify the impact of background tokens on text tokens during the MAR generation, and leverage this to design \textbf{Token Painter}, a training-free text-guided image inpainting method based on MAR. Our approach introduces two key components: (1) Dual-Stream Encoder Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, allowing MAR to generate text-faithful inpainting content while keeping harmonious with background context. (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens to further enhance the alignment of prompt details and the content visual quality. Extensive experiments demonstrate that our training-free method outperforms prior state-of-the-art methods across almost all metrics and delivers superior visual results. Codes will be released.

[375] DriveE2E: Closed-Loop Benchmark for End-to-End Autonomous Driving through Real-to-Simulation

Haibao Yu,Wenxian Yang,Ruiyang Hao,Chuanye Wang,Jiaru Zhong,Ping Luo,Zaiqing Nie

Main category: cs.CV

TL;DR: 提出了一种结合真实世界驾驶场景的闭环评估框架，通过CARLA仿真器与基础设施协作，构建高保真的数字孪生交叉口和动态交通场景，提升自动驾驶闭环评测的真实性与挑战性。

Details

Motivation: 现有基于CARLA的闭环评测依赖人工配置交通场景，偏离真实世界情况，难以反映实际驾驶性能，因此需要更贴近真实、更具多样性的评估环境。 Method: 从100小时的高位基础设施视频数据中提取800个动态交通场景，构建15个真实交叉口的静态数字孪生资产，并将其集成到CARLA仿真器中，实现真实感强的闭环测试。 Result: 实现了高度逼真的闭环仿真环境，涵盖多种驾驶行为、地点、天气和时段下的复杂城市交叉口，显著提升了评测的挑战性和现实相关性。 Conclusion: 该框架有效弥合了仿真与现实之间的差距，为端到端自动驾驶模型提供了更真实、全面的闭环评估基准。 Abstract: Closed-loop evaluation is increasingly critical for end-to-end autonomous driving. Current closed-loop benchmarks using the CARLA simulator rely on manually configured traffic scenarios, which can diverge from real-world conditions, limiting their ability to reflect actual driving performance. To address these limitations, we introduce a simple yet challenging closed-loop evaluation framework that closely integrates real-world driving scenarios into the CARLA simulator with infrastructure cooperation. Our approach involves extracting 800 dynamic traffic scenarios selected from a comprehensive 100-hour video dataset captured by high-mounted infrastructure sensors, and creating static digital twin assets for 15 real-world intersections with consistent visual appearance. These digital twins accurately replicate the traffic and environmental characteristics of their real-world counterparts, enabling more realistic simulations in CARLA. This evaluation is challenging due to the diversity of driving behaviors, locations, weather conditions, and times of day at complex urban intersections. In addition, we provide a comprehensive closed-loop benchmark for evaluating end-to-end autonomous driving models. Project URL: \href{https://github.com/AIR-THU/DriveE2E}{https://github.com/AIR-THU/DriveE2E}.

[376] Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

Alexandros Doumanoglou,Kurt Driessens,Dimitrios Zarpalas

Main category: cs.CV

TL;DR: 提出一种新方法，通过方向聚类和概率视角下的信号向量估计编码-解码方向对，揭示深度视觉网络中的概念表示，实现模型的理解、调试与修正。

Details

Motivation: 深度神经网络的概念表示难以解释，缺乏直接可访问的编码-解码方向对，限制了对模型内部机制的理解和干预能力。 Method: 通过激活的方向聚类识别解码方向，结合概率视角下的信号向量估计编码方向，并利用不确定性区域对齐技术挖掘影响预测的可解释方向。 Result: 在合成数据上恢复真实方向对，在真实数据上解码方向对应单一语义概念且优于无监督基线，编码方向估计准确，并通过激活最大化验证；展示了模型理解、预测解释和干预生成反事实的应用。 Conclusion: 该方法有效揭示了深度网络中的概念结构，提供了可解释的方向对，支持模型的理解、调试与修正，优于传统特征重构方法。 Abstract: Empirical evidence shows that deep vision networks represent concepts as directions in latent space, vectors we call concept embeddings. Each concept has a latent factor-a scalar-indicating its presence in an input patch. For a given patch, multiple latent factors are encoded into a compact representation by linearly combining concept embeddings, with the factors as coefficients. Since these embeddings enable such encoding, we call them encoding directions. A latent factor can be recovered via the inner product with a filter, a vector we call a decoding direction. These encoding-decoding direction pairs are not directly accessible, but recovering them helps open the black box of deep networks, enabling understanding, debugging, and improving models. Decoder directions attribute meaning to latent codes, while encoding directions assess concept influence on predictions, with both enabling model correction by unlearning irrelevant concepts. Unlike prior matrix decomposition, autoencoder, or dictionary learning methods that rely on feature reconstruction, we propose a new perspective: decoding directions are identified via directional clustering of activations, and encoding directions are estimated with signal vectors under a probabilistic view. We further leverage network weights through a novel technique, Uncertainty Region Alignment, which reveals interpretable directions affecting predictions. Our analysis shows that (a) on synthetic data, our method recovers ground-truth direction pairs; (b) on real data, decoding directions map to monosemantic, interpretable concepts and outperform unsupervised baselines; and (c) signal vectors faithfully estimate encoding directions, validated via activation maximization. Finally, we demonstrate applications in understanding global model behavior, explaining individual predictions, and intervening to produce counterfactuals or correct errors.

[377] SAR-KnowLIP: Towards Multimodal Foundation Models for Remote Sensing

Yi Yang,Xiaokun Zhang,Qingchen Fang,Ziqi Ye,Rui Li,Li Liu,Haipeng Wang

Main category: cs.CV

TL;DR: 本文提出了首个通用的SAR多模态基础模型SAR-KnowLIP，构建了首个大规模具有完整地理投影属性的SAR数据集SAR-GEOVL-1M，并设计了自洽迭代优化机制，在11个下游任务中表现出领先性能。

Details

Motivation: 现有跨模态AI方法主要针对RGB图像，缺乏对全天候成像的合成孔径雷达（SAR）图像的有效建模，且地理信息在遥感研究中长期被忽视。 Method: 引入地理信息属性，构建大规模SAR多模态数据集SAR-GEOVL-1M；通过分层认知思维链生成对齐文本；设计自洽迭代优化机制，结合对比、匹配与重建学习，在可迁移的多模态编码器上实现闭环自监督优化。 Result: 建立了涵盖11个代表性视觉与视觉-语言下游任务的统一评测基准，对比14种主流基础模型，SAR-KnowLIP在目标计数和土地覆盖分类等任务上表现领先。 Conclusion: SAR-KnowLIP的大规模多模态数据、可迁移模型架构和全面实验基准将显著推动SAR多模态基础模型的发展。 Abstract: Cross-modal artificial intelligence has garnered widespread attention in recent years, achieving significant progress in the study of natural images. However, existing methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery. SAR, with its all-day, all-weather imaging capabilities, plays an irreplaceable role in remote sensing scene understanding. To address this gap, this paper proposes SAR-KnowLIP, the first universal SAR multimodal foundational model, along with reusable data and evaluation baselines. Specifically: (1) This work introduces the critical yet long-overlooked attribute of geographic information into remote sensing research, constructing SAR-GEOVL-1M (the first large-scale SAR dataset with complete geographic projection properties), covering multiple satellite platforms, 120,000 images, and 135 cities. (2) Aligned structured text is generated through a hierarchical cognitive chain-of-thought (HCoT), providing more than one million multi-dimensional semantic annotations of landforms, regional functions, target attributes, and spatial relationships. (3) We design a Self-Consistent Iterative Optimization mechanism that continuously enhances cross-modal alignment through a self-supervised closed loop of contrastive, matching, and reconstruction learning on a transferable multimodal encoder. (4) A unified evaluation benchmark is established across 11 representative downstream vision and vision-language tasks, with comparisons against 14 leading foundation models, where SAR-KnowLIP demonstrates leading performance, particularly in object counting and land-cover classification. We expect that SAR-KnowLIP's large-scale multimodal data, transferable model architecture, and comprehensive experimental benchmark will significantly advance the development of SAR multimodal baseline models.

[378] AutoPrune: Each Complexity Deserves a Pruning Policy

Hanshi Wang,Yuhao Xu,Zekun Xu,Jin Gao,Yufan Liu,Weiming Hu,Ke Wang,Zhipeng Zhang

Main category: cs.CV

TL;DR: 提出了一种无需训练的自适应剪枝框架AutoPrune，根据样本和任务复杂度动态调整视觉token剪枝策略，在大幅降低计算量的同时保持高精度。

Details

Motivation: 现有视觉-语言模型中的视觉token存在冗余，但固定或启发式剪枝策略无法适应不同输入和任务的复杂性，难以与模型整体推理过程对齐。受人类视觉认知过程启发，需设计更灵活的剪枝机制。 Method: 通过量化视觉与文本token之间的互信息，并将其映射到一个预算约束下的逻辑保留曲线，每条曲线形状自适应反映不同任务的复杂度，实现动态、按需的token保留与剪枝。 Result: 在LLaVA-1.5-7B上剪除89%视觉token，推理FLOPs减少76.8%，平均保留原始准确率的96.7%，相比PDrop提升9.1%。在自动驾驶视觉-语言-动作模型上也表现良好。 Conclusion: AutoPrune是一种即插即用、无需训练的复杂度自适应剪枝方法，能有效平衡视觉-语言模型的效率与性能，适用于多种任务。 Abstract: The established redundancy in visual tokens within large vision-language models allows pruning to effectively reduce their substantial computational demands. Previous methods typically employ heuristic layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model's holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in these models. This observation suggests that neither a fixed pruning schedule nor a heuristic layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, corresponds to the specific complexity of different tasks and can guarantee adherence to predefined computational constraints. We evaluate AutoPrune on standard vision-language tasks and on Vision-Language-Action models for autonomous driving. Notably, when applied to LLaVA-1.5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76.8% while retaining 96.7% of the original accuracy averaged over all tasks. This corresponds to a 9.1% improvement over the recent work PDrop, demonstrating the effectiveness. Code is available at https://github.com/AutoLab-SAI-SJTU/AutoPrune.

[379] CrashSplat: 2D to 3D Vehicle Damage Segmentation in Gaussian Splatting

Dragoş-Andrei Chileban,Andrei-Ştefan Bulzan,Cosmin Cernǎzanu-Glǎvan

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯点阵（3D-GS）的单视角汽车损伤检测与三维分割方法，通过将2D掩码提升至3D并结合SfM和Z缓冲技术实现精确的损伤定位。

Details

Motivation: 现有自动汽车损伤检测多依赖2D图像分析，缺乏对3D几何信息的有效利用；而多视角一致性方法在仅单视角可见损伤的情况下失效，因此需要一种适用于单视角、能准确重建并分割损伤区域的3D方法。 Method: 提出一种无需学习的单视角3D-GS分割方法：利用SfM获取相机参数，将3D高斯投影到图像平面，并结合深度与不透明度的正态分布模型，通过Z缓冲算法过滤和分割损伤区域，进而实现从2D掩码到3D损伤的提升。 Result: 实验表明该方法在仅单视角可见的微小损伤（如划痕、小凹陷）检测中表现优异，优于依赖多视角一致性的传统方法，且具备良好的几何精度和鲁棒性。 Conclusion: 所提方法有效结合了3D高斯点阵与几何先验，在单视角下实现了精准的汽车损伤3D分割，为保险行业自动化定损提供了可行且高效的解决方案。 Abstract: Automatic car damage detection has been a topic of significant interest for the auto insurance industry as it promises faster, accurate, and cost-effective damage assessments. However, few works have gone beyond 2D image analysis to leverage 3D reconstruction methods, which have the potential to provide a more comprehensive and geometrically accurate representation of the damage. Moreover, recent methods employing 3D representations for novel view synthesis, particularly 3D Gaussian Splatting (3D-GS), have demonstrated the ability to generate accurate and coherent 3D reconstructions from a limited number of views. In this work we introduce an automatic car damage detection pipeline that performs 3D damage segmentation by up-lifting 2D masks. Additionally, we propose a simple yet effective learning-free approach for single-view 3D-GS segmentation. Specifically, Gaussians are projected onto the image plane using camera parameters obtained via Structure from Motion (SfM). They are then filtered through an algorithm that utilizes Z-buffering along with a normal distribution model of depth and opacities. Through experiments we found that this method is particularly effective for challenging scenarios like car damage detection, where target objects (e.g., scratches, small dents) may only be clearly visible in a single view, making multi-view consistency approaches impractical or impossible. The code is publicly available at: https://github.com/DragosChileban/CrashSplat.

[380] HunyuanImage 3.0 Technical Report

Siyu Cao,Hangting Chen,Peng Chen,Yiji Cheng,Yutao Cui,Xinchi Deng,Ying Dong,Kipper Gong,Tianpeng Gu,Xiusen Gu,Tiankai Hang,Duojun Huang,Jie Jiang,Zhengkai Jiang,Weijie Kong,Changlin Li,Donghao Li,Junzhe Li,Xin Li,Yang Li,Zhenxi Li,Zhimin Li,Jiaxin Lin,Linus,Lucaz Liu,Shu Liu,Songtao Liu,Yu Liu,Yuhong Liu,Yanxin Long,Fanbin Lu,Qinglin Lu,Yuyang Peng,Yuanbo Peng,Xiangwei Shen,Yixuan Shi,Jiale Tao,Yangyu Tao,Qi Tian,Pengfei Wan,Chunyu Wang,Kai Wang,Lei Wang,Linqing Wang,Lucas Wang,Qixun Wang,Weiyan Wang,Hao Wen,Bing Wu,Jianbing Wu,Yue Wu,Senhao Xie,Fang Yang,Miles Yang,Xiaofeng Yang,Xuan Yang,Zhantao Yang,Jingmiao Yu,Zheng Yuan,Chao Zhang,Jian-Wei Zhang,Peizhen Zhang,Shi-Xue Zhang,Tao Zhang,Weigang Zhang,Yepeng Zhang,Yingfang Zhang,Zihao Zhang,Zijian Zhang,Penghao Zhao,Zhiyuan Zhao,Xuefei Zhe,Jianchen Zhu,Zhao Zhong

Main category: cs.CV

TL;DR: HunyuanImage 3.0 是一个统一多模态理解与生成的自回归框架模型，具备800亿参数（推理时激活130亿），是目前最大且最强大的开源图像生成模型。

Details

Motivation: 旨在构建一个能同时处理多模态理解和生成任务的统一模型，并推动开源多模态生态的发展。 Method: 采用精细的数据筛选、先进的架构设计、原生思维链（Chain-of-Thoughts）方案、渐进式预训练、激进的后训练以及高效的大规模训练与推理基础设施。 Result: 成功训练出包含800亿参数的MoE模型，在文本-图像对齐和视觉质量方面表现达到或媲美当前最先进的模型水平。 Conclusion: HunyuanImage 3.0 是当前最强的开源图像生成模型之一，其代码和权重已公开，有助于促进多模态AI研究和应用的发展。 Abstract: We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0

[381] ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation

Shilan Zhang,Jirui Huang,Ruilin Yao,Cong Wang,Yaxiong Chen,Peng Xu,Shengwu Xiong

Main category: cs.CV

TL;DR: 提出ColLab，一种无需人工监督的协作式空间渐进数据引擎，用于自动生成指代表达理解（REC）和生成（REG）数据，显著提升标注效率与表达质量。

Details

Motivation: 现有REC和REG数据集依赖人工标注，耗时且难以扩展，需要一种全自动、可扩展的数据生成方法。 Method: 提出协作多模态模型交互（CMMI）策略，利用多模态大模型（MLLMs）和大语言模型（LLMs）生成描述，并设计空间渐进增强（SPA）模块以提升重复实例间的空间表达能力。 Result: ColLab显著加快了REC和REG的标注过程，同时提高了生成表达的质量和区分性；该框架部分被ICCV 2025 MARS2挑战赛采用，增强了数据集的多样性和挑战性。 Conclusion: ColLab实现了高质量、全自动的REC和REG数据生成，推动了多模态推理任务的发展，具备实际应用和大规模扩展潜力。 Abstract: Referring Expression Comprehension (REC) and Referring Expression Generation (REG) are fundamental tasks in multimodal understanding, supporting precise object localization through natural language. However, existing REC and REG datasets rely heavily on manual annotation, which is labor-intensive and difficult to scale. In this paper, we propose ColLab, a collaborative spatial progressive data engine that enables fully automated REC and REG data generation without human supervision. Specifically, our method introduces a Collaborative Multimodal Model Interaction (CMMI) strategy, which leverages the semantic understanding of multimodal large language models (MLLMs) and large language models (LLMs) to generate descriptions. Furthermore, we design a module termed Spatial Progressive Augmentation (SPA) to enhance spatial expressiveness among duplicate instances. Experiments demonstrate that ColLab significantly accelerates the annotation process of REC and REG while improving the quality and discriminability of the generated expressions. In addition to the core methodological contribution, our framework was partially adopted in the data generation pipeline of the ICCV 2025 MARS2 Challenge on Multimodal Reasoning, enriching the dataset with diverse and challenging samples that better reflect real-world reasoning demands.

[382] Reinforcement Learning with Inverse Rewards for World Model Post-training

Yang Ye,Tianyu He,Shuo Yang,Jiang Bian

Main category: cs.CV

TL;DR: 提出了一种名为RLIR的后训练框架，通过逆向动力学模型从生成视频中恢复输入动作，以提供可验证的奖励信号，从而提升视频世界模型的动作跟随能力。

Details

Motivation: 现有视频世界模型在准确建模人类指定动作方面的能力尚不足，且传统强化学习方法因标注成本高和难以构建基于规则的视频验证器而不适用于世界模型的后训练优化。 Method: 提出Reinforcement Learning with Inverse Rewards (RLIR)，利用逆向动力学模型将高维视频映射到低维动作空间，恢复输入动作以生成可验证的奖励信号，并结合Group Relative Policy Optimization进行优化。 Result: 在自回归和扩散模型上实验表明，动作跟随能力提升5-10%，视觉质量提高最高达10%，并获得更高的人类偏好评分。 Conclusion: RLIR是首个专门针对增强视频世界模型动作跟随能力的后训练方法，有效解决了奖励信号不可验证和高成本标注的问题。 Abstract: World models simulate dynamic environments, enabling agents to interact with diverse input modalities. Although recent advances have improved the visual quality and temporal consistency of video world models, their ability of accurately modeling human-specified actions remains under-explored. Reinforcement learning presents a promising approach for directly improving the suboptimal action-following capability of pre-trained models, assuming that an appropriate reward function can be defined. However, transferring reinforcement learning post-training methods to world model is impractical due to the prohibitive cost of large-scale preference annotations and the infeasibility of constructing rule-based video verifiers. To address this gap, we propose Reinforcement Learning with Inverse Rewards (RLIR), a post-training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model. By mapping high-dimensional video modality to a low-dimensional action space, RLIR provides an objective and verifiable reward for optimization via Group Relative Policy Optimization. Experiments across autoregressive and diffusion paradigms demonstrate 5-10% gains in action-following, up to 10% improvements in visual quality, and higher human preference scores, establishing RLIR as the first post-training method specifically designed to enhance action-following in video world models.

[383] A Novel Hybrid Deep Learning and Chaotic Dynamics Approach for Thyroid Cancer Classification

Nada Bouchekout,Abdelkrim Boukabou,Morad Grimes,Yassine Habchi,Yassine Himeur,Hamzah Ali Alkhazaleh,Shadi Atalla,Wathiq Mansoor

Main category: cs.CV

TL;DR: 提出一种结合自适应CNN与小波-混沌系统的甲状腺超声图像分类方法，在DDTI数据集上达到98.17%准确率和0.9912 AUC，显著优于主流模型，并具备良好可解释性与跨数据集鲁棒性。

Details

Motivation: 为应对甲状腺癌发病率上升，需提升超声诊断的准确性与及时性，现有深度学习方法在特征表达与泛化能力上仍有不足。 Method: 提出一种融合CDF9/7小波变换与n-scroll混沌系统调制细节系数的自适应CNN模型，通过混沌增强小波特征以提升判别能力，并结合Grad-CAM、SHAP等可解释性方法验证模型关注区域。 Result: 在DDTI数据集上取得98.17%准确率、98.76%敏感度、97.58%特异度、97.55% F1分数和0.9912 AUC，优于EfficientNetV2-S等先进模型；消融实验显示混沌调制使准确率提升8.79个百分点；跨数据集测试（TCIA、ISIC）表现稳健，单图推理耗时28.7ms，峰值显存1,125MB。 Conclusion: 该小波-混沌-CNN框架在甲状腺超声分类中表现出卓越性能、强泛化能力和高效推理速度，具备临床应用潜力。 Abstract: Timely and accurate diagnosis is crucial in addressing the global rise in thyroid cancer, ensuring effective treatment strategies and improved patient outcomes. We present an intelligent classification method that couples an Adaptive Convolutional Neural Network (CNN) with Cohen-Daubechies-Feauveau (CDF9/7) wavelets whose detail coefficients are modulated by an n-scroll chaotic system to enrich discriminative features. We evaluate on the public DDTI thyroid ultrasound dataset (n = 1,638 images; 819 malignant / 819 benign) using 5-fold cross-validation, where the proposed method attains 98.17% accuracy, 98.76% sensitivity, 97.58% specificity, 97.55% F1-score, and an AUC of 0.9912. A controlled ablation shows that adding chaotic modulation to CDF9/7 improves accuracy by +8.79 percentage points over a CDF9/7-only CNN (from 89.38% to 98.17%). To objectively position our approach, we trained state-of-the-art backbones on the same data and splits: EfficientNetV2-S (96.58% accuracy; AUC 0.987), Swin-T (96.41%; 0.986), ViT-B/16 (95.72%; 0.983), and ConvNeXt-T (96.94%; 0.987). Our method outperforms the best of these by +1.23 points in accuracy and +0.0042 in AUC, while remaining computationally efficient (28.7 ms per image; 1,125 MB peak VRAM). Robustness is further supported by cross-dataset testing on TCIA (accuracy 95.82%) and transfer to an ISIC skin-lesion subset (n = 28 unique images, augmented to 2,048; accuracy 97.31%). Explainability analyses (Grad-CAM, SHAP, LIME) highlight clinically relevant regions. Altogether, the wavelet-chaos-CNN pipeline delivers state-of-the-art thyroid ultrasound classification with strong generalization and practical runtime characteristics suitable for clinical integration.

[384] VFSI: Validity First Spatial Intelligence for Constraint-Guided Traffic Diffusion

Kargi Chauhan,Leilani H. Gilpin

Main category: cs.CV

TL;DR: 提出Validity-First Spatial Intelligence (VFSI) 方法，通过能量引导在扩散采样过程中显式施加物理约束，显著提升交通仿真中的物理有效性，同时改善生成轨迹的真实性。

Details

Motivation: 现有扩散模型生成的交通轨迹常违反基本物理规律（如碰撞、驶出道路等），缺乏对物理有效性的保证机制。 Method: 引入基于能量的引导机制，在扩散采样过程中融入避障和运动学等物理约束作为能量函数，无需重新训练模型即可引导去噪过程生成符合物理规律的轨迹。 Result: 在Waymo Open Motion Dataset的200个城市场景中，VFSI将碰撞率从24.6%降至8.1%，整体物理有效性从50.3%提升至94.2%，同时ADE从1.34m改善到1.21m。 Conclusion: 推理阶段显式的约束施加是实现物理上有效交通仿真的必要且充分条件，VFSI提供了一种模型无关的通用解决方案。 Abstract: Modern diffusion models generate realistic traffic simulations but systematically violate physical constraints. In a large-scale evaluation of SceneDiffuser++, a state-of-the-art traffic simulator, we find that 50% of generated trajectories violate basic physical laws - vehicles collide, drive off roads, and spawn inside buildings. This reveals a fundamental limitation: current models treat physical validity as an emergent property rather than an architectural requirement. We propose Validity-First Spatial Intelligence (VFSI), which enforces constraints through energy-based guidance during diffusion sampling, without model retraining. By incorporating collision avoidance and kinematic constraints as energy functions, we guide the denoising process toward physically valid trajectories. Across 200 urban scenarios from the Waymo Open Motion Dataset, VFSI reduces collision rates by 67% (24.6% to 8.1%) and improves overall validity by 87% (50.3% to 94.2%), while simultaneously improving realism metrics (ADE: 1.34m to 1.21m). Our model-agnostic approach demonstrates that explicit constraint enforcement during inference is both necessary and sufficient for physically valid traffic simulation.

[385] Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution

Jinpei Guo,Yifei Ji,Zheng Chen,Yufei Wang,Sizhuo Ma,Yong Guo,Yulun Zhang,Jian Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为OASIS的一次性扩散模型，用于真实世界视频超分辨率（VSR），通过注意力专业化路由减少冗余并提升性能，结合渐进式训练策略，在合成和真实数据集上实现了最先进的效果，并显著提高了推理速度。

Details

Motivation: 由于低质量视频已包含大量内容信息，直接将生成式扩散模型应用于视频超分辨率会导致冗余，增加计算开销和学习负担。因此需要一种更高效的方法来适应VSR任务。 Method: 提出OASIS模型，引入注意力专业化路由机制，根据不同模式的内在行为分配注意力头，减少冗余并保留预训练知识；同时设计了一种渐进式训练策略，从时间一致的退化开始，逐步过渡到不一致的退化场景。 Result: 在多个合成和真实世界数据集上，OASIS均达到最先进水平，并相比现有一次性扩散模型（如SeedVR2）实现了6.2倍的推理速度提升。 Conclusion: OASIS通过注意力专业化和渐进训练有效解决了扩散模型在视频超分辨率中的冗余问题，兼顾高性能与高效率，为实际应用提供了可行方案。 Abstract: Diffusion models have recently shown promising results for video super-resolution (VSR). However, directly adapting generative diffusion models to VSR can result in redundancy, since low-quality videos already preserve substantial content information. Such redundancy leads to increased computational overhead and learning burden, as the model performs superfluous operations and must learn to filter out irrelevant information. To address this problem, we propose OASIS, an efficient $\textbf{o}$ne-step diffusion model with $\textbf{a}$ttention $\textbf{s}$pecialization for real-world v$\textbf{i}$deo $\textbf{s}$uper-resolution. OASIS incorporates an attention specialization routing that assigns attention heads to different patterns according to their intrinsic behaviors. This routing mitigates redundancy while effectively preserving pretrained knowledge, allowing diffusion models to better adapt to VSR and achieve stronger performance. Moreover, we propose a simple yet effective progressive training strategy, which starts with temporally consistent degradations and then shifts to inconsistent settings. This strategy facilitates learning under complex degradations. Extensive experiments demonstrate that OASIS achieves state-of-the-art performance on both synthetic and real-world datasets. OASIS also provides superior inference speed, offering a $\textbf{6.2$\times$}$ speedup over one-step diffusion baselines such as SeedVR2. The code will be available at \href{https://github.com/jp-guo/OASIS}{https://github.com/jp-guo/OASIS}.

[386] RPG360: Robust 360 Depth Estimation with Perspective Foundation Models and Graph Optimization

Dongki Jung,Jaehoon Choi,Yonghan Lee,Dinesh Manocha

Main category: cs.CV

TL;DR: 本文提出RPG360，一种无需训练的鲁棒360度单目深度估计方法，利用透视基础模型和图优化来解决全景图像深度估计中的尺度不一致问题。

Details

Motivation: 由于缺乏大规模标注数据集，360度图像的深度估计面临挑战，现有方法在跨面深度尺度一致性方面表现不足。 Method: 将360度图像转换为六面立方体映射，使用透视基础模型估计每面的深度和法线，并通过基于图优化的深度尺度对齐技术引入每面的尺度参数以保证全局一致性。 Result: 在Matterport3D、Stanford2D3D和360Loc等多个数据集上实现了优越性能，并在特征匹配和运动结构恢复等下游任务中显著提升效果（AUC@5提升0.2~9.7%）。 Conclusion: RPG360无需训练即可实现高精度且尺度一致的360度深度估计，具备良好的泛化能力和应用潜力。 Abstract: The increasing use of 360 images across various domains has emphasized the need for robust depth estimation techniques tailored for omnidirectional images. However, obtaining large-scale labeled datasets for 360 depth estimation remains a significant challenge. In this paper, we propose RPG360, a training-free robust 360 monocular depth estimation method that leverages perspective foundation models and graph optimization. Our approach converts 360 images into six-face cubemap representations, where a perspective foundation model is employed to estimate depth and surface normals. To address depth scale inconsistencies across different faces of the cubemap, we introduce a novel depth scale alignment technique using graph-based optimization, which parameterizes the predicted depth and normal maps while incorporating an additional per-face scale parameter. This optimization ensures depth scale consistency across the six-face cubemap while preserving 3D structural integrity. Furthermore, as foundation models exhibit inherent robustness in zero-shot settings, our method achieves superior performance across diverse datasets, including Matterport3D, Stanford2D3D, and 360Loc. We also demonstrate the versatility of our depth estimation approach by validating its benefits in downstream tasks such as feature matching 3.2 ~ 5.4% and Structure from Motion 0.2 ~ 9.7% in AUC@5.

[387] Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

Muleilan Pei,Shaoshuai Shi,Shaojie Shen

Main category: cs.CV

TL;DR: 提出SMART-R1，一种基于R1风格的强化微调范式，通过SFT-RFT-SFT迭代策略提升多智能体交通行为模拟的真实性与性能，在Waymo开放数据集上达到SOTA。

Details

Motivation: 现有数据驱动模拟器依赖监督学习，难以应对训练与测试间的分布偏移，导致在新环境中泛化能力差。 Method: 提出SMART-R1，采用面向指标的策略优化算法和SFT-RFT-SFT迭代训练框架，结合监督微调与强化微调，提升行为分布对齐。 Result: 在Waymo Open Motion Dataset上验证有效，在WOSAC挑战赛中取得0.7858的现实感元得分，排名第一。 Conclusion: SMART-R1显著提升了交通模拟中代理行为的真实性和模型泛化能力，为自动驾驶仿真提供了高效、可扩展的新范式。 Abstract: Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative "SFT-RFT-SFT" training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.

[388] TREAT-Net: Tabular-Referenced Echocardiography Analysis for Acute Coronary Syndrome Treatment Prediction

Diane Kim,Minh Nguyen Nhat To,Sherif Abdalla,Teresa S. M. Tsang,Purang Abolmaesumi,and Christina Luong

Main category: cs.CV

TL;DR: 提出了一种名为TREAT-Net的多模态深度学习框架，利用超声心动图视频和结构化临床记录进行非侵入性急性冠脉综合征（ACS）治疗预测。

Details

Motivation: 冠状动脉造影虽是诊断ACS的金标准，但其资源消耗大且具侵入性，可能导致患者风险增加和诊断延迟。因此需要一种非侵入、高效的替代方法用于及时治疗决策。 Method: TREAT-Net结合了表格数据引导的交叉注意力机制以增强视频理解，并采用晚期融合机制整合多模态信息。模型在9000多个ACS病例上训练，融合超声视频与临床记录进行治疗干预预测。 Result: 模型在平衡准确率上达到67.6%，AUROC为71.1%，优于单模态和未融合基线模型；跨模态一致性分析显示干预预测准确率达88.6%。 Conclusion: TREAT-Net具备作为非侵入性工具用于ACS患者快速分诊的潜力，尤其适用于缺乏冠状动脉造影资源的医疗欠发达地区。 Abstract: Coronary angiography remains the gold standard for diagnosing Acute Coronary Syndrome (ACS). However, its resource-intensive and invasive nature can expose patients to procedural risks and diagnostic delays, leading to postponed treatment initiation. In this work, we introduce TREAT-Net, a multimodal deep learning framework for ACS treatment prediction that leverages non-invasive modalities, including echocardiography videos and structured clinical records. TREAT-Net integrates tabular-guided cross-attention to enhance video interpretation, along with a late fusion mechanism to align predictions across modalities. Trained on a dataset of over 9000 ACS cases, the model outperforms unimodal and non-fused baselines, achieving a balanced accuracy of 67.6% and an AUROC of 71.1%. Cross-modality agreement analysis demonstrates 88.6% accuracy for intervention prediction. These findings highlight the potential of TREAT-Net as a non-invasive tool for timely and accurate patient triage, particularly in underserved populations with limited access to coronary angiography.

[389] Gaze Estimation for Human-Robot Interaction: Analysis Using the NICO Platform

Matej Palider,Omar Eldardeer,Viktor Kocur

Main category: cs.CV

TL;DR: 本文评估了在共享工作空间场景中人机交互（HRI）背景下的现有凝视估计方法，使用NICO机器人平台采集了一个新的标注数据集，并对四种最先进的凝视估计模型进行了评估。结果显示，尽管角度误差接近通用基准报告的水平，但在共享工作空间中的实际距离误差中位数达到16.48厘米，揭示了当前方法的实际局限性。最后，文章讨论了这些局限性，并提出了在HRI系统中最佳集成凝视估计的建议。

Details

Motivation: 为了评估当前凝视估计方法在真实人机交互（HRI）共享工作空间场景中的实际性能和适用性，弥补通用基准与实际应用之间的差距。 Method: 引入基于NICO机器人平台采集的新标注数据集，评估四种最先进的凝视估计模型，采用角度误差和在共享工作空间中的距离误差作为评价指标。 Result: 凝视估计的角度误差接近通用基准，但在实际共享工作空间中，最佳模型的距离误差中位数为16.48厘米，显示出显著的实际定位偏差。 Conclusion: 当前凝视估计方法在实际HRI场景中存在明显局限性，尤其体现在空间定位精度上；建议在HRI系统集成时需结合上下文信息与多模态融合以提升实用性。 Abstract: This paper evaluates the current gaze estimation methods within an HRI context of a shared workspace scenario. We introduce a new, annotated dataset collected with the NICO robotic platform. We evaluate four state-of-the-art gaze estimation models. The evaluation shows that the angular errors are close to those reported on general-purpose benchmarks. However, when expressed in terms of distance in the shared workspace the best median error is 16.48 cm quantifying the practical limitations of current methods. We conclude by discussing these limitations and offering recommendations on how to best integrate gaze estimation as a modality in HRI systems.

[390] SIE3D: Single-image Expressive 3D Avatar generation via Semantic Embedding and Perceptual Expression Loss

Zhiqi Huang,Dulongkai Cui,Jinglu Hu

Main category: cs.CV

TL;DR: 本文提出SIE3D，一种从单张图像和描述性文本生成高保真、可控制表情的3D头像的新框架，通过融合图像身份特征与文本语义嵌入，并引入基于预训练表情分类器的感知表情损失函数，显著提升了表情准确性和身份保持能力。

Details

Motivation: 现有方法在通过文本对3D头像表情进行细粒度、直观控制方面存在不足，且难以保证生成表情与文本描述的一致性。 Method: SIE3D采用新颖的条件生成机制，融合图像中的身份特征和文本的语义嵌入；设计了感知表情损失函数，利用预训练的表情分类器来约束生成过程，确保表情与文本匹配。 Result: 实验表明，SIE3D在表情可控性、生成真实感、身份保持和表达保真度方面优于现有方法，且可在单个消费级GPU上高效运行。 Conclusion: SIE3D实现了从单图和文本到高保真3D表情头像的精确控制生成，有效结合了多模态输入与感知损失，在表达准确性和视觉质量之间取得了良好平衡。 Abstract: Generating high-fidelity 3D head avatars from a single image is challenging, as current methods lack fine-grained, intuitive control over expressions via text. This paper proposes SIE3D, a framework that generates expressive 3D avatars from a single image and descriptive text. SIE3D fuses identity features from the image with semantic embedding from text through a novel conditioning scheme, enabling detailed control. To ensure generated expressions accurately match the text, it introduces an innovative perceptual expression loss function. This loss uses a pre-trained expression classifier to regularize the generation process, guaranteeing expression accuracy. Extensive experiments show SIE3D significantly improves controllability and realism, outperforming competitive methods in identity preservation and expression fidelity on a single consumer-grade GPU. Project page: https://blazingcrystal1747.github.io/SIE3D/

[391] FrameMind: Frame-Interleaved Chain-of-Thought for Video Reasoning via Reinforcement Learning

Haonan Ge,Yiwei Wang,Kai-Wei Chang,Hang Wu,Yujun Cai

Main category: cs.CV

TL;DR: 本文提出了FrameMind，一种基于强化学习的端到端框架，通过动态帧采样和交错式推理（FiCOT）实现视频理解中的自适应视觉信息获取。

Details

Motivation: 现有视频理解模型采用固定帧采样策略，难以根据问题需求灵活调整时空信息获取，限制了复杂任务的性能。 Method: 提出FrameMind框架，结合Frame-Interleaved Chain-of-Thought（FiCOT）和强化学习，使模型在推理过程中主动请求关键帧或片段；引入Dynamic Resolution Frame Sampling（DRFS）和DRFS-GRPO算法进行训练，无需帧级标注。 Result: 在MLVU和VideoMME等基准上显著优于现有方法，实现了更灵活高效的视频理解。 Conclusion: FrameMind通过动态感知与推理交替机制，提升了模型对时空信息的自适应能力，推动了视频理解的发展。 Abstract: Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question. This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require either broad temporal coverage or fine-grained spatial detail. In this paper, we introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT). Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract targeted frames or video clips based on identified knowledge gaps. To train effective dynamic sampling policies, we propose Dynamic Resolution Frame Sampling (DRFS), which exposes models to diverse temporal-spatial trade-offs during learning, and DRFS-GRPO, a group-relative policy optimization algorithm that learns from outcome-based rewards without requiring frame-level annotations. Extensive experiments on challenging benchmarks like MLVU and VideoMME demonstrate that our method significantly outperforms existing models, advancing the state of the art in flexible and efficient video understanding.

[392] Generalized Category Discovery in Hyperspectral Images via Prototype Subspace Modeling

Xianlu Li,Nicolas Nadisic,Shaoguang Huang,Aleksandra Pizurica

Main category: cs.CV

TL;DR: 本文提出了首个针对高光谱图像（HSI）的广义类别发现（GCD）框架，通过原型子空间建模更好地捕捉类别结构。与现有方法使用单个原型向量不同，该方法采用一组基向量构建子空间表示，在高维特征空间中具有更强的表达力和判别能力。通过施加基正交性和重构约束来指导学习过程，在真实HSI数据上显著优于现有GCD方法。

Details

Motivation: 现有的GCD方法主要针对RGB图像设计，难以适应高光谱图像（HSI）的高维性和复杂光谱结构，因此需要专门针对HSI特性设计新的GCD框架。 Method: 提出一种基于原型子空间的建模方法，每个类别由一组基向量构成的子空间表示；引入两个关键约束：基正交性约束以增强类间可分性，重构约束以确保基能有效重建对应类样本。 Result: 在真实高光谱图像数据上的实验表明，所提方法显著优于当前最先进的GCD方法。 Conclusion: 该方法为高光谱图像中的广义类别发现建立了坚实基础，展示了子空间建模在高维数据类别发现中的有效性。 Abstract: Generalized category discovery~(GCD) seeks to jointly identify both known and novel categories in unlabeled data. While prior works have mainly focused on RGB images, their assumptions and modeling strategies do not generalize well to hyperspectral images~(HSI), which are inherently high-dimensional and exhibit complex spectral structures. In this paper, we propose the first GCD framework tailored for HSI, introducing a prototype subspace modeling model to better capture class structure. Instead of learning a single prototype vector for each category as in existing methods such as SimGCD, we model each category using a set of basis vectors, forming a subspace representation that enables greater expressiveness and discrimination in a high-dimensional feature space. To guide the learning of such bases, we enforce two key constraints: (1) a basis orthogonality constraint that promotes inter-class separability, and (2) a reconstruction constraint that ensures each prototype basis can effectively reconstruct its corresponding class samples. Experimental results on real-world HSI demonstrate that our method significantly outperforms state-of-the-art GCD methods, establishing a strong foundation for generalized category discovery in hyperspectral settings.

[393] Hazy Pedestrian Trajectory Prediction via Physical Priors and Graph-Mamba

Jian Chen,Zhuoran Zheng,Han Hu,Guijuan Zhang,Dianjie Lu,Liang Li,Chen Lyu

Main category: cs.CV

TL;DR: 提出一种结合大气散射物理先验与行人关系拓扑建模的深度学习模型，用于雾天行人轨迹预测，显著提升预测精度与推理速度。

Details

Motivation: 解决雾天条件下行人轨迹预测中存在的物理信息退化和行人交互建模无效的问题。 Method: 构建可微分的大气散射模型以解耦雾霾浓度与光照退化；设计自适应扫描状态空间模型（Mamba变体）进行特征提取；开发异构图注意力网络结合时空融合模块建模多粒度行人交互。 Result: 在密集雾霾场景下（能见度<30m），相比SOTA模型minADE和minFDE分别降低37.2%和41.5%；自适应Mamba变体推理速度提升78%。 Conclusion: 该方法通过融合物理先验与拓扑建模，显著提升了恶劣天气下行人的轨迹预测性能，为智能交通系统在复杂环境下的可靠感知提供了新范式。 Abstract: To address the issues of physical information degradation and ineffective pedestrian interaction modeling in pedestrian trajectory prediction under hazy weather conditions, we propose a deep learning model that combines physical priors of atmospheric scattering with topological modeling of pedestrian relationships. Specifically, we first construct a differentiable atmospheric scattering model that decouples haze concentration from light degradation through a network with physical parameter estimation, enabling the learning of haze-mitigated feature representations. Second, we design an adaptive scanning state space model for feature extraction. Our adaptive Mamba variant achieves a 78% inference speed increase over native Mamba while preserving long-range dependency modeling. Finally, to efficiently model pedestrian relationships, we develop a heterogeneous graph attention network, using graph matrices to model multi-granularity interactions between pedestrians and groups, combined with a spatio-temporal fusion module to capture the collaborative evolution patterns of pedestrian movements. Furthermore, we constructed a new pedestrian trajectory prediction dataset based on ETH/UCY to evaluate the effectiveness of the proposed method. Experiments show that our method reduces the minADE / minFDE metrics by 37.2% and 41.5%, respectively, compared to the SOTA models in dense haze scenarios (visibility < 30m), providing a new modeling paradigm for reliable perception in intelligent transportation systems in adverse environments.

[394] $\mathbf{R}^3$: Reconstruction, Raw, and Rain: Deraining Directly in the Bayer Domain

Nate Rothschild,Moshe Kimhi,Avi Mendelson,Chaim Baskin

Main category: cs.CV

TL;DR: 本文提出在原始Bayer域而非sRGB域进行图像去雨重建，可避免ISP带来的信息损失，实现更优的复原效果。

Details

Motivation: 传统的图像重建方法多基于经过ISP处理的sRGB图像，但该过程会不可逆地损失颜色、动态范围和细节信息。本文旨在通过直接在原始Bayer数据上学习，避免这些损失，提升重建质量。 Method: 以去雨任务为例，构建并评估基于sRGB与Bayer域的重建流程；发布首个包含真实雨天场景的12位Bayer与匹配位深sRGB图像的公开数据集Raw-Rain；提出一种对颜色不变且更符合人类感知的信息守恒评分（ICS）作为评价指标。 Result: 在测试集上，基于原始域的模型相比sRGB方案PSNR最高提升0.99 dB，ICS提升1.2%，同时运行速度更快，计算量仅为一半（GFLOPs减半）。 Conclusion: 支持“先重建后ISP”的新范式，推动低层视觉向端到端可学习相机管线发展。 Abstract: Image reconstruction from corrupted images is crucial across many domains. Most reconstruction networks are trained on post-ISP sRGB images, even though the image-signal-processing pipeline irreversibly mixes colors, clips dynamic range, and blurs fine detail. This paper uses the rain degradation problem as a use case to show that these losses are avoidable, and demonstrates that learning directly on raw Bayer mosaics yields superior reconstructions. To substantiate the claim, we (i) evaluate post-ISP and Bayer reconstruction pipelines, (ii) curate Raw-Rain, the first public benchmark of real rainy scenes captured in both 12-bit Bayer and bit-depth-matched sRGB, and (iii) introduce Information Conservation Score (ICS), a color-invariant metric that aligns more closely with human opinion than PSNR or SSIM. On the test split, our raw-domain model improves sRGB results by up to +0.99 dB PSNR and +1.2% ICS, while running faster with half of the GFLOPs. The results advocate an ISP-last paradigm for low-level vision and open the door to end-to-end learnable camera pipelines.

[395] Joint Superpixel and Self-Representation Learning for Scalable Hyperspectral Image Clustering

Xianlu Li,Nicolas Nadisic,Shaoguang Huang,Aleksandra Pizurica

Main category: cs.CV

TL;DR: 提出了一种端到端的联合优化超像素分割与子空间聚类的框架，通过ADMM展开的自表达网络反馈指导可微超像素模块，实现聚类感知的分割，显著提升高光谱图像分析的精度和效率。

Details

Motivation: 现有超像素方法独立于聚类任务进行分割，导致分割结果与聚类目标不一致，且传统子空间聚类计算和内存开销高，难以扩展。 Method: 设计了一个统一的端到端框架，结合基于展开ADMM的自表达网络（提供模型驱动反馈信号）与可微超像素模块，联合优化超像素分割和子空间聚类；超像素网络还为每个超像素学习独特的紧凑性参数以增强适应性。 Result: 在多个高光谱图像基准数据集上的实验表明，该方法在聚类精度上 consistently 优于最先进的聚类方法，同时提高了计算效率。 Conclusion: 所提出的联合优化框架能够生成兼顾光谱与空间结构的聚类感知超像素，有效提升高光谱图像无监督分析的性能和可扩展性。 Abstract: Subspace clustering is a powerful unsupervised approach for hyperspectral image (HSI) analysis, but its high computational and memory costs limit scalability. Superpixel segmentation can improve efficiency by reducing the number of data points to process. However, existing superpixel-based methods usually perform segmentation independently of the clustering task, often producing partitions that do not align with the subsequent clustering objective. To address this, we propose a unified end-to-end framework that jointly optimizes superpixel segmentation and subspace clustering. Its core is a feedback mechanism: a self-representation network based on unfolded Alternating Direction Method of Multipliers (ADMM) provides a model-driven signal to guide a differentiable superpixel module. This joint optimization yields clustering-aware partitions that preserve both spectral and spatial structure. Furthermore, our superpixel network learns a unique compactness parameter for each superpixel, enabling more flexible and adaptive segmentation. Extensive experiments on benchmark HSI datasets demonstrate that our method consistently achieves superior accuracy compared with state-of-the-art clustering approaches.

[396] A Second-Order Perspective on Pruning at Initialization and Knowledge Transfer

Leonardo Iurada,Beatrice Occhiena,Tatiana Tommasi

Main category: cs.CV

TL;DR: 本文研究了在预训练视觉模型上进行初始化时剪枝的方法，发现即使在未知下游任务的情况下，基于某一任务的剪枝仍能保持模型在未见任务上的零样本性能，并且微调后可恢复被保留任务的性能，这得益于大规模预训练带来的良好损失景观。

Details

Motivation: 由于预训练视觉模型的计算和存储成本较高，限制了其实际部署，而传统的剪枝方法需要特定任务的数据，当下游任务未知时面临挑战，因此本文旨在探索无需任务特定数据即可有效剪枝的方法。 Method: 通过在不同任务上进行初始化时剪枝实验，分析剪枝后的模型在未见任务上的零样本表现及微调后的性能恢复情况，并探讨大规模预训练对损失景观的影响。 Result: 实验表明，在一个任务上剪枝后，模型在其他未见任务上仍保持良好的零样本性能；进一步微调可以提升原任务性能并恢复被保留任务的性能。 Conclusion: 大规模预训练使得模型具有鲁棒的可迁移结构，使得无需任务特定数据的初始化剪枝成为可能，为预训练模型的高效压缩与适应提供了新思路。 Abstract: The widespread availability of pre-trained vision models has enabled numerous deep learning applications through their transferable representations. However, their computational and storage costs often limit practical deployment. Pruning-at-Initialization has emerged as a promising approach to compress models before training, enabling efficient task-specific adaptation. While conventional wisdom suggests that effective pruning requires task-specific data, this creates a challenge when downstream tasks are unknown in advance. In this paper, we investigate how data influences the pruning of pre-trained vision models. Surprisingly, pruning on one task retains the model's zero-shot performance also on unseen tasks. Furthermore, fine-tuning these pruned models not only improves performance on original seen tasks but can recover held-out tasks' performance. We attribute this phenomenon to the favorable loss landscapes induced by extensive pre-training on large-scale datasets.

[397] Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding

Zhecheng Li,Guoxian Song,Yiwei Wang,Zhen Xiong,Junsong Yuan,Yujun Cai

Main category: cs.CV

TL;DR: 提出GMS框架，结合通用视觉语言模型和特定任务模型，通过粗到细的协同机制提升GUI界面自然语言查询定位性能。

Details

Motivation: 现有的GUI grounding方法在跨应用、跨系统的多样化UI元素理解及精确坐标预测上表现不足，需要更高效的方法来提升性能。 Method: 提出GMS框架，包含五个阶段，采用层次化搜索与跨模态通信机制；通用VLM作为'Scanner'定位感兴趣区域，专用接地模型作为'Locator'输出精确坐标。 Result: 在ScreenSpot-Pro数据集上，独立的'Scanner'和'Locator'准确率分别为2.0%和3.7%，而GMS框架整体准确率达到35.7%，性能提升10倍，并显著优于多种基线模型。 Conclusion: GMS通过协同粗略扫描与精细定位，有效提升了GUI grounding的准确性与鲁棒性，展现出在通用GUI grounding任务中的巨大潜力。 Abstract: Grounding natural language queries in graphical user interfaces (GUIs) presents a challenging task that requires models to comprehend diverse UI elements across various applications and systems, while also accurately predicting the spatial coordinates for the intended operation. To tackle this problem, we propose GMS: Generalist Scanner Meets Specialist Locator, a synergistic coarse-to-fine framework that effectively improves GUI grounding performance. GMS leverages the complementary strengths of general vision-language models (VLMs) and small, task-specific GUI grounding models by assigning them distinct roles within the framework. Specifically, the general VLM acts as a 'Scanner' to identify potential regions of interest, while the fine-tuned grounding model serves as a 'Locator' that outputs precise coordinates within these regions. This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization. Our whole framework consists of five stages and incorporates hierarchical search with cross-modal communication to achieve promising prediction results. Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0\%$ and $3.7\%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7\%$, representing a $10 \times$ improvement. Additionally, GMS significantly outperforms other strong baselines under various settings, demonstrating its robustness and potential for general-purpose GUI grounding.

Hosein Hasani,Amirmohammad Izadi,Fatemeh Askari,Mobin Bagherian,Sadegh Mohammadian,Mohammad Izadi,Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: 本文提出了“Grounding IDs”的概念，解释了外部视觉线索如何通过隐式标识符增强多模态模型中图像与文本的对齐，提升跨模态定位并减少幻觉。

Details

Motivation: 大型视觉-语言模型在多模态任务上表现良好，但在结构化推理和精确对齐方面存在局限。已有研究显示添加简单视觉结构可提升性能，但其内在机制尚不清楚。 Method: 通过表示分析和因果干预方法，研究外部线索（如分区和标注）如何在嵌入空间中诱导出跨模态的对象绑定机制（即Grounding IDs），并分析其对注意力机制和模态差距的影响。 Result: 发现Grounding IDs能在嵌入空间中形成稳健的组内对齐，缩小图文模态差距；因果实验证明其在对象与符号线索间绑定中的中介作用，并增强相关组件间的注意力，从而改善跨模态对齐并减少幻觉。 Conclusion: Grounding IDs是一种关键的符号机制，揭示了外部线索如何提升多模态模型的绑定能力，为模型提供了更好的可解释性和鲁棒性改进路径。 Abstract: Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as robust within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism explaining how external cues enhance multimodal binding, offering both interpretability and practical improvements in robustness.

[399] Latent Visual Reasoning

Bangzheng Li,Ximeng Sun,Jiang Liu,Ze Wang,Jialian Wu,Xiaodong Yu,Hao Chen,Emad Barsoum,Muhao Chen,Zicheng Liu

Main category: cs.CV

TL;DR: 提出了一种新的多模态大语言模型推理范式Latent Visual Reasoning (LVR)，在视觉嵌入空间中进行自回归推理，显著提升了细粒度视觉理解和感知性能。

Details

Motivation: 现有方法将视觉信息视为静态前提，推理局限于语言空间，限制了对复杂视觉任务的理解能力。 Method: 通过视觉编码器将图像映射到与语言模型共享的语义空间中的视觉token，训练语言模型生成关键视觉token的隐状态，实现隐式视觉推理，并与文本生成交错进行，结合GRPO算法进行强化学习优化。 Result: 在MMVP数据集上达到71.67%的性能，优于Qwen2.5-VL的66.67%，在感知密集型视觉问答任务中表现显著提升。 Conclusion: LVR实现了在视觉嵌入空间中的直接推理，增强了模型对视觉信息的动态理解和利用能力，为多模态推理提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. Code base and model weights will be released later.

[400] Autoregressive Video Generation beyond Next Frames Prediction

Sucheng Ren,Chen Chen,Zhenbang Wang,Liangchen Song,Xiangxin Zhu,Alan Yuille,Yinfei Yang,Jiasen Lu

Main category: cs.CV

TL;DR: 本文提出了VideoAR，一个支持多种预测单元（如全帧、关键细节帧、多尺度细化和时空立方体）的视频生成统一框架。研究发现，以时空立方体作为预测单元能显著提升生成质量、速度和时间连贯性，打破了传统逐帧生成的限制。

Details

Motivation: 质疑视频生成中是否应以帧为基本预测单位，探索更合适的预测单元以提升生成效果。 Method: 提出VideoAR框架，支持多种预测单元，特别是使用时空立方体进行跨时空维度的自回归建模。 Result: 基于立方体的预测在VBench上优于现有最先进方法，推理更快，并可无缝扩展到分钟级长序列。 Conclusion: 帧并非视频自回归的最佳原子单位，采用时空立方体作为预测单元更优，有望推动视频及其他时空领域序列分解方式的重新思考。 Abstract: Autoregressive models for video generation typically operate frame-by-frame, extending next-token prediction from language to video's temporal dimension. We question that unlike word as token is universally agreed in language if frame is a appropriate prediction unit? To address this, we present VideoAR, a unified framework that supports a spectrum of prediction units including full frames, key-detail frames, multiscale refinements, and spatiotemporal cubes. Among these designs, we find model video generation using \textit{spatiotemporal} cubes as prediction units, which allows autoregressive models to operate across both spatial and temporal dimensions simultaneously. This approach eliminates the assumption that frames are the natural atomic units for video autoregression. We evaluate VideoAR across diverse prediction strategies, finding that cube-based prediction consistently delivers superior quality, speed, and temporal coherence. By removing the frame-by-frame constraint, our video generator surpasses state-of-the-art baselines on VBench while achieving faster inference and enabling seamless scaling to minute-long sequences. We hope this work will motivate rethinking sequence decomposition in video and other spatiotemporal domains.

[401] Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA

Jianxin Liang,Tan Yue,Yuxuan Wang,Yueqian Wang,Zhihan Yin,Huishuai Zhang,Dongyan Zhao

Main category: cs.CV

TL;DR: 提出了一种通过生成更丰富的监督信号来提升视频问答（VideoQA）模型性能的新框架，采用问题驱动的改写（QBP）和问题驱动的描述（QBC）方法，显著提升了准确率、泛化能力和训练效率。

Details

Motivation: 传统VideoQA模型依赖孤立的事实性问答对进行监督，缺乏对视频中事件的叙事和因果结构的理解，限制了模型的深层理解能力。 Method: 提出两种策略：Question-Based Paraphrasing (QBP)，将多个问答对合成为叙述性段落；Question-Based Captioning (QBC)，为每个答案生成细粒度的视觉依据。利用生成模型合成数据，并以统一的下一项预测目标训练VideoQA模型。 Result: 在STAR和NExT-QA数据集上取得新的最先进结果，如3B模型在STAR上提升4.9%至72.5%，7B模型在NExT-QA上达到80.8%。同时提高了跨数据集泛化能力，QBP使收敛速度加快2.5倍以上。 Conclusion: 将监督信号从孤立事实转向叙事连贯性和有根据的推理，可实现更准确、高效且可泛化的VideoQA训练范式。 Abstract: The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This "bag-of-facts" approach fails to capture the underlying narrative and causal structure of events, limiting models to a shallow understanding of video content. To move beyond this paradigm, we introduce a framework to synthesize richer supervisory signals. We propose two complementary strategies: Question-Based Paraphrasing (QBP), which synthesizes the diverse inquiries (what, how, why) from a video's existing set of question-answer pairs into a holistic narrative paragraph that reconstructs the video's event structure; and Question-Based Captioning (QBC), which generates fine-grained visual rationales, grounding the answer to each question in specific, relevant evidence. Leveraging powerful generative models, we use this synthetic data to train VideoQA models under a unified next-token prediction objective. Extensive experiments on STAR and NExT-QA validate our approach, demonstrating significant accuracy gains and establishing new state-of-the-art results, such as improving a 3B model to 72.5\% on STAR (+4.9\%) and a 7B model to 80.8\% on NExT-QA. Beyond accuracy, our analysis reveals that both QBP and QBC substantially enhance cross-dataset generalization, with QBP additionally accelerating model convergence by over 2.5x. These results demonstrate that shifting data synthesis from isolated facts to narrative coherence and grounded rationales yields a more accurate, efficient, and generalizable training paradigm.

Prerit Gupta,Shourya Verma,Ananth Grama,Aniket Bera

Main category: cs.CV

TL;DR: DualFlow是一个用于多模态双人动作生成的统一高效框架，能够基于文本、音乐和先验动作序列生成3D动作，采用修正流实现快速确定性采样，并通过检索增强生成模块提升语义对齐和动作协调性。

Details

Motivation: 生成逼真且上下文感知的双人动作在计算机图形学、动画和人机交互中仍具挑战，尤其是需适应多种输入模态并保持动作同步与语义一致性。 Method: 提出DualFlow框架，利用修正流实现从噪声到数据的直线采样路径，引入检索增强生成（RAG）模块，结合音乐特征和基于大语言模型的文本分解检索动作示例，并设计对比学习目标和同步损失以增强条件对齐和双人协作。 Result: 在文本到动作、音乐到动作及多模态交互任务上取得显著性能提升，生成的动作在时间连贯性、节奏同步性和响应效率方面均优于现有方法。 Conclusion: DualFlow在多模态人体动作生成任务中达到最先进水平，有效提升了生成质量、模态对齐和双人协同性能。 Abstract: Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

[403] Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

Shijie Lian,Changti Wu,Laurence Tianruo Yang,Hang Yuan,Bin Yu,Lei Zhang,Kai Chen

Main category: cs.CV

TL;DR: 本文提出通过欧几里得几何问题求解作为代理任务，提升多模态大语言模型的空间智能。作者构建了包含约3万道几何题的Euclid30K数据集，并采用GRPO方法对Qwen2.5VL和RoboBrain2.0系列模型进行微调，使其具备识别形状、计数、关联实体及多步推理能力。实验表明，模型在多个空间推理基准上零样本性能显著提升，其中RoboBrain2.0-Euclid-7B在VSI-Bench上达到49.6%的准确率，超越现有最优模型Spatial-MLLM。

Details

Motivation: 当前多模态大语言模型在空间智能方面表现有限，缺乏处理形状可视化、物体旋转、空间关系判断等能力，亟需有效方法提升其空间推理性能。 Method: 构建包含30K平面与立体几何题的Euclid30K数据集，采用Group Relative Policy Optimization（GRPO）对Qwen2.5VL和RoboBrain2.0系列模型进行微调，使其学习并应用欧几里得几何原理进行多步推理。 Result: 在Super-CLEVR、Omni3DBench、VSI-Bench和MindCube四个空间推理基准上实现显著零样本性能提升；所有模型在VSI-Bench上的平均准确率从34.5%提升至40.5%；RoboBrain2.0-Euclid-7B达到49.6%的准确率，超过此前SOTA模型Spatial-MLLM。 Conclusion: 这是首个系统性研究表明，以几何为中心的微调能够赋予视觉语言模型广泛可迁移的空间推理能力，为提升多模态模型的空间智能提供了有效路径。 Abstract: Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. To enable the model to acquire and apply Euclidean principles from these geometry problems, we employed Group Relative Policy Optimization (GRPO) to finetune the Qwen2.5VL family and RoboBrain2.0 family, inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy of all evaluated models rose from 34.5% to 40.5%, improving by 5.5 percentage points. Among them, RoboBrain2.0-Euclid-7B achieves 49.6\% accuracy, surpassing the previous state-of-the-art model, Spatial-MLLM.To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in https://zgca-ai4edu.github.io/Euclids_Gift.

[404] SVAC: Scaling Is All You Need For Referring Video Object Segmentation

Li Zhang,Haoxiang Gao,Zhihao Zhang,Luoxiao Huang,Tao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为SVAC的统一模型，通过增加输入帧和分割标记来提升指代表视频对象分割（RVOS）性能，并引入ASTC模块和CSA策略以解决计算挑战和动态行为建模问题，在多个基准上实现了最先进的性能。

Details

Motivation: 现有方法在利用多模态大模型先验知识、长视频的计算开销以及复杂时序动态处理方面存在不足，因此需要更高效且精准的RVOS方法。 Method: 提出SVAC模型，采用Anchor-Based Spatio-Temporal Compression (ASTC) 模块压缩视觉标记并保留时空结构，结合Clip-Specific Allocation (CSA) 策略优化跨片段动态对象建模，增强视频-语言交互与分割精度。 Result: SVAC在多个RVOS基准上达到最先进水平，同时具备良好的计算效率。 Conclusion: SVAC通过扩展输入规模和设计高效的压缩与分配策略，显著提升了RVOS的性能与实用性。 Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in video sequences based on natural language descriptions. While recent advances in Multi-modal Large Language Models (MLLMs) have improved RVOS performance through enhanced text-video understanding, several challenges remain, including insufficient exploitation of MLLMs' prior knowledge, prohibitive computational and memory costs for long-duration videos, and inadequate handling of complex temporal dynamics. In this work, we propose SVAC, a unified model that improves RVOS by scaling up input frames and segmentation tokens to enhance video-language interaction and segmentation precision. To address the resulting computational challenges, SVAC incorporates the Anchor-Based Spatio-Temporal Compression (ASTC) module to compress visual tokens while preserving essential spatio-temporal structure. Moreover, the Clip-Specific Allocation (CSA) strategy is introduced to better handle dynamic object behaviors across video clips. Experimental results demonstrate that SVAC achieves state-of-the-art performance on multiple RVOS benchmarks with competitive efficiency. Our code is available at https://github.com/lizhang1998/SVAC.

[405] NeMo: Needle in a Montage for Video-Language Understanding

Zi-Yuan Hu,Shuo Liang,Duo Zheng,Yanyang Li,Yeyao Tao,Shijia Huang,Wei Feng,Jia Qin,Jianguang Yu,Jing Huang,Meng Fang,Yin Li,Liwei Wang

Main category: cs.CV

TL;DR: 本文提出了一个名为Needle in a Montage (NeMo)的新任务，用于评估视频大语言模型（VideoLLMs）在复杂时序推理、长上下文回忆和时间定位方面的能力，并构建了一个自动化数据生成流程及大规模视频-语言基准NeMoBench，包含超过3万组自动生成的问答对，实验评估了20种前沿模型，展示了该基准的有效性和可扩展性。

Details

Motivation: 现有的视频大语言模型缺乏针对复杂时序推理能力的有效评估手段，需要新的评测协议和基准来衡量其关键推理能力，特别是长上下文理解和时间定位。 Method: 提出NeMo任务并设计一个可扩展的自动化数据生成流程，基于此构建NeMoBench视频语言基准，包含大量自动生成的视频问答样本，涵盖从几秒到数小时不等的视频长度。 Result: 成功生成31,378个高质量问答对，覆盖13,486个视频；实验验证了数据生成流程的可靠性，并对20个最先进模型进行了系统评估，揭示了它们在时序推理任务中的表现与局限。 Conclusion: NeMoBench为视频语言模型提供了一个可靠、可扩展且持续更新的评估基准，能够有效测试模型的长时记忆与时间定位能力，推动VideoLLMs在复杂时序理解任务上的发展。 Abstract: Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.

[406] GANji: A Framework for Introductory AI Image Generation

Chandon Hamel,Mike Busch

Main category: cs.CV

TL;DR: 本文介绍了一个轻量级框架GANji，用于基准测试基于10,314个日本汉字字符的图像生成模型，比较了VAE、GAN和DDPM三种模型的性能。

Details

Motivation: 为了降低生成模型比较研究所需的计算资源门槛，提供一个易于访问的基准测试工具。 Method: 使用包含10,314个日本汉字字符的数据集，系统地评估VAE、GAN和DDPM在图像保真度和采样速度方面的表现。 Result: DDPM在图像保真度上表现最佳（FID为26.2），但其采样速度比其他模型慢2000倍以上；VAE和GAN在效率上显著优于DDPM。 Conclusion: GANji框架有效揭示了模型架构、计算成本与视觉质量之间的基本权衡，适用于教育和研究目的。 Abstract: The comparative study of generative models often requires significant computational resources, creating a barrier for researchers and practitioners. This paper introduces GANji, a lightweight framework for benchmarking foundational AI image generation techniques using a dataset of 10,314 Japanese Kanji characters. It systematically compares the performance of a Variational Autoencoder (VAE), a Generative Adversarial Network (GAN), and a Denoising Diffusion Probabilistic Model (DDPM). The results demonstrate that while the DDPM achieves the highest image fidelity, with a Fr\'echet Inception Distance (FID) score of 26.2, its sampling time is over 2,000 times slower than the other models. The GANji framework is an effective and accessible tool for revealing the fundamental trade-offs between model architecture, computational cost, and visual quality, making it ideal for both educational and research purposes.

[407] MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment

Fankai Jia,Daisong Gan,Zhe Zhang,Zhaochi Wen,Chenchen Dan,Dong Liang,Haifeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为MMRQA的多模态MRI质量评估框架，结合多模态大语言模型与采集感知信号处理，实现了可解释性强、泛化性能优越的MRI质量评估。

Details

Motivation: 传统MRI质量评估方法在定量指标与语义理解之间存在权衡，且深度学习方法缺乏可解释性，因此需要一种兼具准确性与临床可解释性的新方法。 Method: MMRQA框架通过三个创新实现：利用MRQy增强模拟伪影进行鲁棒指标提取，使用Qwen将指标结构化为问答对，以及采用LoRA对LLaVA-OneVision进行参数高效融合。 Result: 在MR-ART、FastMRI和MyConnectome数据集上验证，MMRQA实现了最先进的性能，并表现出强大的零样本泛化能力，经消融实验验证其各组件有效性。 Conclusion: MMRQA成功融合了定量分析与语义推理，生成临床可解释的输出，提升了动态医疗环境中的MRI质量控制水平。 Abstract: Magnetic resonance imaging (MRI) quality assessment is crucial for clinical decision-making, yet remains challenging due to data scarcity and protocol variability. Traditional approaches face fundamental trade-offs: signal-based methods like MRIQC provide quantitative metrics but lack semantic understanding, while deep learning approaches achieve high accuracy but sacrifice interpretability. To address these limitations, we introduce the Multimodal MRI Quality Assessment (MMRQA) framework, pioneering the integration of multimodal large language models (MLLMs) with acquisition-aware signal processing. MMRQA combines three key innovations: robust metric extraction via MRQy augmented with simulated artifacts, structured transformation of metrics into question-answer pairs using Qwen, and parameter-efficient fusion through Low-Rank Adaptation (LoRA) of LLaVA-OneVision. Evaluated on MR-ART, FastMRI, and MyConnectome benchmarks, MMRQA achieves state-of-the-art performance with strong zero-shot generalization, as validated by comprehensive ablation studies. By bridging quantitative analysis with semantic reasoning, our framework generates clinically interpretable outputs that enhance quality control in dynamic medical settings.

[408] TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

Junyi Zhang,Jia-Chen Gu,Wenbo Hu,Yu Zhou,Robinson Piramuthu,Nanyun Peng

Main category: cs.CV

TL;DR: 本文提出了TemMed-Bench，首个用于分析患者在不同就诊时间点病情变化的医学视觉-语言模型基准，包含视觉问答、报告生成和图像对选择三项任务，并发现大多数大模型在时序医学图像推理上表现不佳，但多模态检索增强可显著提升性能。

Details

Motivation: 现有医学推理基准主要基于单次就诊的图像分析，而临床实践中医生通常参考患者历史病历来进行综合评估，因此需要一个能评估模型对时序医学图像变化理解能力的基准。 Method: 构建了TemMed-Bench基准，包含三个任务（VQA、报告生成、图像对选择）和超过17,000个实例的知识库，并对6个闭源和6个开源视觉-语言大模型进行了评估，同时探索了结合视觉与文本检索的多模态检索增强方法。 Result: 大多数LVLMs在时序医学图像推理任务上表现接近随机猜测，GPT-4o、GPT-4o-mini和Claude 3.5 Sonnet表现相对较好但仍有限；多模态检索增强在多数模型上带来显著性能提升，VQA任务平均提高2.59%。 Conclusion: TemMed-Bench填补了时序医学图像推理评估的空白，揭示了当前LVLMs在此类任务上的局限性，并表明多模态检索增强是值得深入探索的改进方向。 Abstract: Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits, which challenges large vision-language models (LVLMs) to reason over temporal medical images. TemMed-Bench consists of a test set comprising three tasks - visual question-answering (VQA), report generation, and image-pair selection - and a supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we conduct an evaluation of six proprietary and six open-source LVLMs. Our results show that most LVLMs lack the ability to analyze patients' condition changes over temporal medical images, and a large proportion perform only at a random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they have yet to reach the desired level. Furthermore, we explore augmenting the input with both retrieved visual and textual modalities in the medical domain. We also show that multi-modal retrieval augmentation yields notably higher performance gains than no retrieval and textual retrieval alone across most models on our benchmark, with the VQA task showing an average improvement of 2.59%. Overall, we compose a benchmark grounded on real-world clinical practice, and it reveals LVLMs' limitations in temporal medical image reasoning, as well as highlighting the use of multi-modal retrieval augmentation as a potentially promising direction worth exploring to address this challenge.

[409] EYE-DEX: Eye Disease Detection and EXplanation System

Youssef Sabiri,Walid Houmaidi,Amine Abouaomar

Main category: cs.CV

TL;DR: 本研究提出了一种名为EYE-DEX的自动化框架，用于分类10种视网膜疾病，基于包含21,577张眼底图像的大规模数据集，采用微调的VGG16模型实现了92.36%的准确率，并结合Grad-CAM技术提升模型可解释性。

Details

Motivation: 视网膜疾病的诊断对预防视力丧失和减轻社会经济负担至关重要，传统人工分级耗时且主观，亟需高效、客观的自动化诊断方法。 Method: 使用预训练的卷积神经网络（VGG16、VGG19、ResNet50）在大规模视网膜疾病数据集上进行微调，并以VGG16表现最优；同时集成Grad-CAM技术生成可视化解释。 Result: 微调后的VGG16模型在全局基准测试中达到92.36%的准确率，为当前最优性能，并通过Grad-CAM有效定位病变区域。 Conclusion: EYE-DEX框架在多类视网膜疾病分类中表现出高准确性与良好可解释性，有助于提升临床医生对AI辅助诊断的信任，推动其在实际医疗中的应用。 Abstract: Retinal disease diagnosis is critical in preventing vision loss and reducing socioeconomic burdens. Globally, over 2.2 billion people are affected by some form of vision impairment, resulting in annual productivity losses estimated at $411 billion. Traditional manual grading of retinal fundus images by ophthalmologists is time-consuming and subjective. In contrast, deep learning has revolutionized medical diagnostics by automating retinal image analysis and achieving expert-level performance. In this study, we present EYE-DEX, an automated framework for classifying 10 retinal conditions using the large-scale Retinal Disease Dataset comprising 21,577 eye fundus images. We benchmark three pre-trained Convolutional Neural Network (CNN) models--VGG16, VGG19, and ResNet50--with our finetuned VGG16 achieving a state-of-the-art global benchmark test accuracy of 92.36%. To enhance transparency and explainability, we integrate the Gradient-weighted Class Activation Mapping (Grad-CAM) technique to generate visual explanations highlighting disease-specific regions, thereby fostering clinician trust and reliability in AI-assisted diagnostics.

[410] GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

Fan Yuan,Yuchen Yan,Yifan Jiang,Haoran Zhao,Tao Feng,Jinyan Chen,Yanwei Lou,Wenqi Zhang,Yongliang Shen,Weiming Lu,Jun Xiao,Yueting Zhuang

Main category: cs.CV

TL;DR: 本文提出了GSM8K-V，一个纯视觉的多图像数学推理基准，用于评估视觉语言模型在处理视觉数学问题时的能力。

Details

Motivation: 现有的视觉数学推理基准多局限于几何问题，缺乏对数学文字题的覆盖，且很少评估跨多个图像的推理能力。为了填补这些空白，研究者构建了GSM8K-V。 Method: 通过系统地将广泛使用的文本基准GSM8K中的每个样本转换为视觉形式，并结合自动化的图像生成流程与细致的人工标注，构建了包含1,319个高质量样本的数据集。 Result: 在GSM8K-V上评估了多种开源和闭源模型，结果显示尽管现有模型在文本版GSM8K上表现接近饱和，但在GSM8K-V上仍有较大提升空间。例如，表现最好的Gemini-2.5-Pro在GSM8K上准确率为95.22%，而在GSM8K-V上仅为46.93%。 Conclusion: GSM8K-V为视觉数学推理提供了新的视角，并建立了一个基准，以指导更鲁棒和可泛化的视觉语言模型的发展。 Abstract: Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.

[411] Analysis of Bias in Deep Learning Facial Beauty Regressors

Chandon Hamel,Mike Busch

Main category: cs.CV

TL;DR: 该研究发现，即使使用看似平衡的数据集，AI面部美学预测模型仍存在显著的种族偏见，且会放大社会审美偏见。通过在SCUT-FBP5500和MEBeauty数据集上训练的模型进行跨数据集验证，结果显示在不同族裔群体间预测差异显著（p < 0.001），仅有4.8-9.5%的组间比较满足分布一致性标准。

Details

Motivation: 揭示AI在塑造审美标准中的潜在偏见问题，特别是在种族维度上的不公平性，警示当前AI美学技术可能加剧社会偏见。 Method: 采用Kruskal-Wallis H检验和事后Dunn分析对SCUT-FBP5500和MEBeauty数据集训练的模型进行统计验证，并在FairFace平衡数据集上评估跨族裔预测表现，进行跨数据集验证以分析偏差传播。 Result: 两个模型均表现出显著的族裔间预测差异（p < 0.001），且在预测和误差一致性方面显示出算法对社会审美偏见的放大而非缓解，仅有4.8-9.5%的组间比较满足分布一致性。 Conclusion: 当前AI面部美学预测方法存在严重不足，难以实现公平性；需通过设计更公平的数据采集与算法策略来构建更公正的美学技术。 Abstract: Bias can be introduced to AI systems even from seemingly balanced sources, and AI facial beauty prediction is subject to ethnicity-based bias. This work sounds warnings about AI's role in shaping aesthetic norms while providing potential pathways toward equitable beauty technologies through comparative analysis of models trained on SCUT-FBP5500 and MEBeauty datasets. Employing rigorous statistical validation (Kruskal-Wallis H-tests, post hoc Dunn analyses). It is demonstrated that both models exhibit significant prediction disparities across ethnic groups $(p < 0.001)$, even when evaluated on the balanced FairFace dataset. Cross-dataset validation shows algorithmic amplification of societal beauty biases rather than mitigation based on prediction and error parity. The findings underscore the inadequacy of current AI beauty prediction approaches, with only 4.8-9.5\% of inter-group comparisons satisfying distributional parity criteria. Mitigation strategies are proposed and discussed in detail.

[412] Asymmetric VAE for One-Step Video Super-Resolution Acceleration

Jianze Li,Yong Guo,Yulun Zhang,Xiaokang Yang

Main category: cs.CV

TL;DR: 本文提出了一种高效的视频超分辨率方法FastVSR，通过高压缩比VAE（f16）和优化的训练策略，显著提升了推理速度，相比多步和单步模型分别加速111.9倍和3.92倍。

Details

Motivation: 现有的扩散模型在视频超分辨率中虽已实现单步采样，但在推理效率上仍有优化空间，亟需更高效的模型设计。 Method: 提出FastVSR，采用空间压缩比为16的高压缩VAE（f16），结合像素洗牌和通道复制进行上采样，并引入下界引导的训练策略以提升训练稳定性。 Result: 实验表明，FastVSR相比多步模型加速111.9倍，相比现有单步模型加速3.92倍，同时保持良好的重建质量。 Conclusion: FastVSR通过高效的网络结构和稳定的训练策略，大幅提升了视频超分辨率的推理速度，具备实际应用潜力。 Abstract: Diffusion models have significant advantages in the field of real-world video super-resolution and have demonstrated strong performance in past research. In recent diffusion-based video super-resolution (VSR) models, the number of sampling steps has been reduced to just one, yet there remains significant room for further optimization in inference efficiency. In this paper, we propose FastVSR, which achieves substantial reductions in computational cost by implementing a high compression VAE (spatial compression ratio of 16, denoted as f16). We design the structure of the f16 VAE and introduce a stable training framework. We employ pixel shuffle and channel replication to achieve additional upsampling. Furthermore, we propose a lower-bound-guided training strategy, which introduces a simpler training objective as a lower bound for the VAE's performance. It makes the training process more stable and easier to converge. Experimental results show that FastVSR achieves speedups of 111.9 times compared to multi-step models and 3.92 times compared to existing one-step models. We will release code and models at https://github.com/JianzeLi-114/FastVSR.

[413] Accelerating Cerebral Diagnostics with BrainFusion: A Comprehensive MRI Tumor Framework

Walid Houmaidi,Youssef Sabiri,Salmane El Mansour Billah,Amine Abouaomar

Main category: cs.CV

TL;DR: 本文提出了一种名为BrainFusion的脑肿瘤分类与定位方法，结合微调的CNN模型和YOLOv8实现高精度MRI图像分析，VGG16在测试中达到99.86%的准确率，显著优于先前方法。

Details

Motivation: 早期准确分类脑肿瘤对指导治疗和改善患者预后至关重要，现有方法在准确性和可解释性方面仍有提升空间。 Method: 采用微调的VGG16、ResNet50和Xception等CNN模型进行肿瘤分类，并结合YOLOv8实现精确定位，使用Brain Tumor MRI Dataset进行训练与评估，同时引入可解释AI技术增强模型可信度。 Result: 微调后的VGG16模型在测试集上达到99.86%的分类准确率，超越以往基准，且通过边界框定位和可解释性分析提升了临床可用性。 Conclusion: BrainFusion在脑肿瘤分类与定位方面表现出色，展示了深度学习在提高诊断速度与可靠性方面的潜力，有助于改善患者诊疗效果和生存率。 Abstract: The early and accurate classification of brain tumors is crucial for guiding effective treatment strategies and improving patient outcomes. This study presents BrainFusion, a significant advancement in brain tumor analysis using magnetic resonance imaging (MRI) by combining fine-tuned convolutional neural networks (CNNs) for tumor classification--including VGG16, ResNet50, and Xception--with YOLOv8 for precise tumor localization with bounding boxes. Leveraging the Brain Tumor MRI Dataset, our experiments reveal that the fine-tuned VGG16 model achieves test accuracy of 99.86%, substantially exceeding previous benchmarks. Beyond setting a new accuracy standard, the integration of bounding-box localization and explainable AI techniques further enhances both the clinical interpretability and trustworthiness of the system's outputs. Overall, this approach underscores the transformative potential of deep learning in delivering faster, more reliable diagnoses, ultimately contributing to improved patient care and survival rates.

Moxin Zhao,Nan Meng,Jason Pui Yin Cheung,Chris Yuk Kwan Tang,Chenxi Yu,Wenting Zhong,Pengyu Lu,Chang Shi,Yipeng Zhuang,Teng Zhang

Main category: cs.CV

TL;DR: 提出LatXGen框架，通过RGBD图像生成侧位脊柱X光片，实现无辐射的脊柱矢状面形态评估。

Details

Motivation: 现有研究在无辐射冠状面评估方面取得进展，但矢状面脊柱形态的无辐射准确评估仍缺乏可靠方法。 Method: 提出双阶段生成框架LatXGen，结合注意力机制的快速傅里叶卷积模块和空间形变网络，从后表面RGBD图像合成侧位X光片，并构建包含3264对数据的大规模配对数据集。 Result: LatXGen在视觉真实性和定量指标上优于现有GAN方法，生成解剖结构准确的X光图像。 Conclusion: LatXGen为青少年特发性脊柱侧弯的矢状面评估提供了有效的无辐射解决方案，推动了全面的脊柱形态学评价。 Abstract: Adolescent Idiopathic Scoliosis (AIS) is a complex three-dimensional spinal deformity, and accurate morphological assessment requires evaluating both coronal and sagittal alignment. While previous research has made significant progress in developing radiation-free methods for coronal plane assessment, reliable and accurate evaluation of sagittal alignment without ionizing radiation remains largely underexplored. To address this gap, we propose LatXGen, a novel generative framework that synthesizes realistic lateral spinal radiographs from posterior Red-Green-Blue and Depth (RGBD) images of unclothed backs. This enables accurate, radiation-free estimation of sagittal spinal alignment. LatXGen tackles two core challenges: (1) inferring sagittal spinal morphology changes from a lateral perspective based on posteroanterior surface geometry, and (2) performing cross-modality translation from RGBD input to the radiographic domain. The framework adopts a dual-stage architecture that progressively estimates lateral spinal structure and synthesizes corresponding radiographs. To enhance anatomical consistency, we introduce an attention-based Fast Fourier Convolution (FFC) module for integrating anatomical features from RGBD images and 3D landmarks, and a Spatial Deformation Network (SDN) to model morphological variations in the lateral view. Additionally, we construct the first large-scale paired dataset for this task, comprising 3,264 RGBD and lateral radiograph pairs. Experimental results demonstrate that LatXGen produces anatomically accurate radiographs and outperforms existing GAN-based methods in both visual fidelity and quantitative metrics. This study offers a promising, radiation-free solution for sagittal spine assessment and advances comprehensive AIS evaluation.

[415] High-Order Progressive Trajectory Matching for Medical Image Dataset Distillation

Le Dong,Jinghao Bian,Jingyang Hou,Jingliang Hu,Yilei Shi,Weisheng Dong,Xiao Xiang Zhu,Lichao Mou

Main category: cs.CV

TL;DR: 提出一种基于形状势能和渐进匹配策略的轨迹匹配方法，用于医学图像数据集蒸馏，有效提升蒸馏性能并保护隐私。

Details

Motivation: 现有数据集蒸馏方法主要关注终端状态，忽略了优化过程中的中间信息，导致信息丢失，尤其在医学图像领域面临隐私和数据共享难题。 Method: 引入形状势能以捕捉参数轨迹的几何结构，并采用从简单到复杂的渐进式匹配策略，充分利用优化过程中的中间状态信息进行数据集蒸馏。 Result: 在医学图像分类任务上验证了方法的有效性，蒸馏性能得到提升，同时保持与原始数据集训练相当的模型精度，并保障数据隐私。 Conclusion: 所提出的方法通过利用参数轨迹的几何特征和渐进匹配策略，在保护隐私的前提下显著提升了医学图像数据集蒸馏的效果，具有实际应用潜力。 Abstract: Medical image analysis faces significant challenges in data sharing due to privacy regulations and complex institutional protocols. Dataset distillation offers a solution to address these challenges by synthesizing compact datasets that capture essential information from real, large medical datasets. Trajectory matching has emerged as a promising methodology for dataset distillation; however, existing methods primarily focus on terminal states, overlooking crucial information in intermediate optimization states. We address this limitation by proposing a shape-wise potential that captures the geometric structure of parameter trajectories, and an easy-to-complex matching strategy that progressively addresses parameters based on their complexity. Experiments on medical image classification tasks demonstrate that our method improves distillation performance while preserving privacy and maintaining model accuracy comparable to training on the original datasets. Our code is available at https://github.com/Bian-jh/HoP-TM.

[416] Combining Discrepancy-Confusion Uncertainty and Calibration Diversity for Active Fine-Grained Image Classification

Yinghao Jin,Xi Yang

Main category: cs.CV

TL;DR: 提出了一种结合差异-混淆不确定性和校准多样性的新方法（DECERN），用于主动细粒度图像分类，有效提升样本选择的质量。

Details

Motivation: 在细粒度图像分类中，由于类别间差异细微，传统主动学习难以准确评估样本的信息量，因此需要更有效的 informativeness 评估机制。 Method: 提出DECERN方法，结合差异-混淆不确定性与校准多样性：前者衡量局部特征融合中的类别方向性和结构稳定性，后者通过不确定性加权聚类并校准多样性以兼顾全局多样性和局部代表性。 Result: 在7个细粒度图像数据集的26种实验设置下，DECERN显著优于现有主流方法。 Conclusion: DECERN能更有效地感知细粒度图像间的区别，提升主动学习在有限标注预算下的性能。 Abstract: Active learning (AL) aims to build high-quality labeled datasets by iteratively selecting the most informative samples from an unlabeled pool under limited annotation budgets. However, in fine-grained image classification, assessing this informativeness is especially challenging due to subtle inter-class differences. In this paper, we introduce a novel method, combining discrepancy-confusion uncertainty and calibration diversity for active fine-grained image classification (DECERN), to effectively perceive the distinctiveness between fine-grained images and evaluate the sample value. DECERN introduces a multifaceted informativeness measure that combines discrepancy-confusion uncertainty and calibration diversity. The discrepancy-confusion uncertainty quantifies the category directionality and structural stability of fine-grained unlabeled data during local feature fusion. Subsequently, uncertainty-weighted clustering is performed to diversify the uncertainty samples. Then we calibrate the diversity to maximize the global diversity of the selected sample while maintaining its local representativeness. Extensive experiments conducted on 7 fine-grained image datasets across 26 distinct experimental settings demonstrate that our method achieves superior performance compared to state-of-the-art methods.

[417] Tumor Synthesis conditioned on Radiomics

Jonghun Kim,Inye Na,Eun Sook Ko,Hyunjin Park

Main category: cs.CV

TL;DR: 提出一种基于放射组学特征的肿瘤生成模型，结合GAN和扩散模型生成3D医学图像中的肿瘤掩码和纹理，支持肿瘤的移除、编辑和重定位，适用于多种器官和模态，有助于下游任务训练和治疗规划。

Details

Motivation: 由于隐私问题，获取大规模3D医学影像数据集具有挑战性，现有生成模型在输出多样性方面存在局限，难以准确表示3D医学图像。 Method: 采用GAN生成肿瘤掩码，扩散模型基于放射组学特征生成肿瘤纹理，利用放射组学特征（如大小、形状、纹理）作为生成条件，实现可控的肿瘤图像生成。 Result: 模型在肾脏、肺、乳腺和脑四种器官的CT和MRI数据上验证，生成图像真实且经专家评估认可，有效提升下游任务训练效果。 Conclusion: 该方法可生成多样化的逼真肿瘤图像，支持临床可视化、数据增强和个性化治疗规划，具有在医学图像分析中广泛应用的潜力。 Abstract: Due to privacy concerns, obtaining large datasets is challenging in medical image analysis, especially with 3D modalities like Computed Tomography (CT) and Magnetic Resonance Imaging (MRI). Existing generative models, developed to address this issue, often face limitations in output diversity and thus cannot accurately represent 3D medical images. We propose a tumor-generation model that utilizes radiomics features as generative conditions. Radiomics features are high-dimensional handcrafted semantic features that are biologically well-grounded and thus are good candidates for conditioning. Our model employs a GAN-based model to generate tumor masks and a diffusion-based approach to generate tumor texture conditioned on radiomics features. Our method allows the user to generate tumor images according to user-specified radiomics features such as size, shape, and texture at an arbitrary location. This enables the physicians to easily visualize tumor images to better understand tumors according to changing radiomics features. Our approach allows for the removal, manipulation, and repositioning of tumors, generating various tumor types in different scenarios. The model has been tested on tumors in four different organs (kidney, lung, breast, and brain) across CT and MRI. The synthesized images are shown to effectively aid in training for downstream tasks and their authenticity was also evaluated through expert evaluations. Our method has potential usage in treatment planning with diverse synthesized tumors.

[418] Simulating Post-Neoadjuvant Chemotherapy Breast Cancer MRI via Diffusion Model with Prompt Tuning

Jonghun Kim,Hyunjin Park

Main category: cs.CV

TL;DR: 本研究利用DCE-MRI的最大强度投影图像，结合扩散模型和提示调优，从治疗前图像生成预测乳腺癌新辅助化疗（NAC）后肿瘤变化的图像，能更准确反映肿瘤大小变化，优于其他生成模型。

Details

Motivation: 准确预测乳腺癌患者对新辅助化疗（NAC）的响应有助于优化治疗方案，目前缺乏能结合临床因素并高精度生成术后影像的模型。 Method: 采用DCE-MRI的最大强度投影图像，基于扩散模型从治疗前图像生成治疗后（3或12周）图像，并引入提示调优以整合影响NAC响应的关键临床因素。 Result: 该模型在图像质量指标上优于其他生成模型，且在反映病理完全缓解（pCR）相关的肿瘤大小变化方面表现更佳，消融实验验证了方法设计的有效性。 Conclusion: 所提出的结合提示调优的扩散模型可更准确地预测NAC治疗后的肿瘤变化，具有推动乳腺癌精准医疗的潜力。 Abstract: Neoadjuvant chemotherapy (NAC) is a common therapy option before the main surgery for breast cancer. Response to NAC is monitored using follow-up dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI). Accurate prediction of NAC response helps with treatment planning. Here, we adopt maximum intensity projection images from DCE-MRI to generate post-treatment images (i.e., 3 or 12 weeks after NAC) from pre-treatment images leveraging the emerging diffusion model. We introduce prompt tuning to account for the known clinical factors affecting response to NAC. Our model performed better than other generative models in image quality metrics. Our model was better at generating images that reflected changes in tumor size according to pCR compared to other models. Ablation study confirmed the design choices of our method. Our study has the potential to help with precision medicine.

[419] Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection

Sojung An,Kwanyong Park,Yong Jae Lee,Donghyun Kim

Main category: cs.CV

TL;DR: 本文提出了一种新的框架TaSe，通过将文本表示分解为核心成分（对象、属性、关系）并以层次化方式重新聚合，提升了视觉语言模型在复杂查询下的语言-物体检测能力，在OmniLabel基准上性能提升24%。

Details

Motivation: 现有视觉语言模型的文本编码器难以区分复杂查询中的目标对象与其描述属性和关系，导致在细粒度多模态感知任务中表现不佳。 Method: 提出TaSe框架，包括三个部分：分层合成字幕数据集、'Talk in Pieces'解耦模块（通过新设计的解耦损失函数将文本嵌入分解为子空间表示）、'See in Whole'聚合模块（通过层次化目标学习结构化表示）。 Result: 在OmniLabel基准上实现了24%的性能提升，验证了语言组合性对细粒度多模态表示的重要性。 Conclusion: 通过引入语言结构的归纳偏置，TaSe有效增强了视觉语言模型对复杂语言查询的理解能力，显著提升了语言驱动的物体检测性能。 Abstract: While vision-language models (VLMs) have made significant progress in multimodal perception (e.g., open-vocabulary object detection) with simple language queries, state-of-the-art VLMs still show limited ability to perceive complex queries involving descriptive attributes and relational clauses. Our in-depth analysis shows that these limitations mainly stem from text encoders in VLMs. Such text encoders behave like bags-of-words and fail to separate target objects from their descriptive attributes and relations in complex queries, resulting in frequent false positives. To address this, we propose restructuring linguistic representations according to the hierarchical relations within sentences for language-based object detection. A key insight is the necessity of disentangling textual tokens into core components-objects, attributes, and relations ("talk in pieces")-and subsequently aggregating them into hierarchically structured sentence-level representations ("see in whole"). Building on this principle, we introduce the TaSe framework with three main contributions: (1) a hierarchical synthetic captioning dataset spanning three tiers from category names to descriptive sentences; (2) Talk in Pieces, the three-component disentanglement module guided by a novel disentanglement loss function, transforms text embeddings into subspace compositions; and (3) See in Whole, which learns to aggregate disentangled components into hierarchically structured embeddings with the guide of proposed hierarchical objectives. The proposed TaSe framework strengthens the inductive bias of hierarchical linguistic structures, resulting in fine-grained multimodal representations for language-based object detection. Experimental results under the OmniLabel benchmark show a 24% performance improvement, demonstrating the importance of linguistic compositionality.

[420] An Efficient 3D Latent Diffusion Model for T1-contrast Enhanced MRI Generation

Zach Eidex,Mojtaba Safari,Jie Ding,Richard Qiu,Justin Roper,David Yu,Hui-Kuo Shu,Zhen Tian,Hui Mao,Xiaofeng Yang

Main category: cs.CV

TL;DR: 提出了一种名为3D潜在修正流（T1C-RFlow）的深度学习框架，用于从无对比剂的多参数MRI生成高质量的T1增强图像，相比现有模型在速度和质量上均有显著提升。

Details

Motivation: 钆基对比剂（GBCAs）在T1加权MRI中虽有助于病灶显示，但存在肾源性系统性纤维化风险且使用不一致，因此需要一种无需对比剂即可生成T1增强图像的方法。 Method: 采用预训练自编码器提取T1w和T2-FLAIR图像的潜在空间表示，并在该空间中训练一个3D修正流扩散模型（T1C-RFlow），利用BraTS 2024数据集中的胶质瘤、脑膜瘤和转移瘤数据进行训练与验证。 Result: T1C-RFlow在多个指标上优于基准模型（如pix2pix、DDPM、DiT-3D），在胶质瘤、脑膜瘤和转移瘤中分别取得较低的NMSE和较高的SSIM值，肿瘤重建效果最佳，去噪时间仅6.9秒/体积，远快于传统DDPM模型。 Conclusion: T1C-RFlow能高效生成逼真的合成T1增强图像，未来有望实现无需对比剂的脑肿瘤MRI检查，具有临床应用潜力。 Abstract: Objective: Gadolinium-based contrast agents (GBCAs) are commonly employed with T1w MRI to enhance lesion visualization but are restricted in patients at risk of nephrogenic systemic fibrosis and variations in GBCA administration can introduce imaging inconsistencies. This study develops an efficient 3D deep-learning framework to generate T1-contrast enhanced images (T1C) from pre-contrast multiparametric MRI. Approach: We propose the 3D latent rectified flow (T1C-RFlow) model for generating high-quality T1C images. First, T1w and T2-FLAIR images are input into a pretrained autoencoder to acquire an efficient latent space representation. A rectified flow diffusion model is then trained in this latent space representation. The T1C-RFlow model was trained on a curated dataset comprised of the BraTS 2024 glioma (GLI; 1480 patients), meningioma (MEN; 1141 patients), and metastases (MET; 1475 patients) datasets. Selected patients were split into train (N=2860), validation (N=612), and test (N=614) sets. Results: Both qualitative and quantitative results demonstrate that the T1C-RFlow model outperforms benchmark 3D models (pix2pix, DDPM, Diffusion Transformers (DiT-3D)) trained in the same latent space. T1C-RFlow achieved the following metrics - GLI: NMSE 0.044 +/- 0.047, SSIM 0.935 +/- 0.025; MEN: NMSE 0.046 +/- 0.029, SSIM 0.937 +/- 0.021; MET: NMSE 0.098 +/- 0.088, SSIM 0.905 +/- 0.082. T1C-RFlow had the best tumor reconstruction performance and significantly faster denoising times (6.9 s/volume, 200 steps) than conventional DDPM models in both latent space (37.7s, 1000 steps) and patch-based in image space (4.3 hr/volume). Significance: Our proposed method generates synthetic T1C images that closely resemble ground truth T1C in much less time than previous diffusion models. Further development may permit a practical method for contrast-agent-free MRI for brain tumors.

[421] UniVid: The Open-Source Unified Video Model

Jiabin Luo,Junhui Lin,Zeyu Zhang,Biao Wu,Meng Fang,Ling Chen,Hao Tang

Main category: cs.CV

TL;DR: 本文提出了UniVid，一种结合MLLM与扩散解码器的统一视频建模架构，通过轻量级适配器实现视频理解和生成。引入温度模态对齐和金字塔反射机制，提升了生成的语义保真度和时序推理效率，在多个基准上达到SOTA性能。

Details

Motivation: 现有的统一视频建模方法在生成过程中存在文本-视觉标记不平衡导致语义失真，且跨模态注意力机制在流式轨迹上表现受限；同时，将图像MLLM扩展到视频通常需要昂贵的重训练。因此，亟需一种高效且语义一致的统一架构。 Method: 提出UniVid架构，将MLLM与扩散解码器通过轻量适配器连接。引入温度模态对齐（TMA）以增强提示词遵循能力，设计金字塔反射（PR）机制通过动态关键帧选择实现高效的时序推理。 Result: 在VBench-Long上总分比EasyAnimateV5.1提升2.2%，在MSVD-QA和ActivityNet-QA上分别比现有最佳7B基线提高1.0%和3.3%准确率，验证了方法的有效性。 Conclusion: UniVid实现了高效的视频理解与生成统一建模，通过TMA和PR机制解决了语义忠实性和时序推理效率问题，无需大规模重训练即可扩展图像MLLM至视频任务，推动了通用视频模型的发展。 Abstract: Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

[422] BALR-SAM: Boundary-Aware Low-Rank Adaptation of SAM for Resource-Efficient Medical Image Segmentation

Zelin Liu,Sicheng Dong,Bocheng Li,Yixuan Yang,Jiacheng Ruan,Chenxu Zhou,Suncheng Xiang

Main category: cs.CV

TL;DR: 提出BALR-SAM，一种用于医学图像分割的边界感知低秩适应框架，在仅微调1.8%参数的情况下超越多种SOTA方法。

Details

Motivation: 视觉基础模型（如SAM）在医学图像分割中因缺乏领域适应而表现不佳，且全量微调资源消耗大，效率低。 Method: 设计包含三部分的BALR-SAM：(1) 使用深度可分离卷积和多尺度融合的互补细节增强网络（CDEN）捕捉边界特征；(2) 在SAM的ViT块中引入低秩适配器优化医学特征表示；(3) 在掩码解码器中采用低秩张量注意力机制以降低内存消耗并提升推理速度。 Result: 在标准医学分割数据集上，无需提示输入的BALR-SAM优于包括MedSAM在内的多种SOTA方法，内存占用减少75%，推理速度提升，仅更新1.8%（1170万）参数。 Conclusion: BALR-SAM通过高效的参数适应显著提升了SAM在医学图像分割中的性能，兼顾高精度、低资源消耗与快速推理，适合临床应用场景。 Abstract: Vision foundation models like the Segment Anything Model (SAM), pretrained on large-scale natural image datasets, often struggle in medical image segmentation due to a lack of domain-specific adaptation. In clinical practice, fine-tuning such models efficiently for medical downstream tasks with minimal resource demands, while maintaining strong performance, is challenging. To address these issues, we propose BALR-SAM, a boundary-aware low-rank adaptation framework that enhances SAM for medical imaging. It combines three tailored components: (1) a Complementary Detail Enhancement Network (CDEN) using depthwise separable convolutions and multi-scale fusion to capture boundary-sensitive features essential for accurate segmentation; (2) low-rank adapters integrated into SAM's Vision Transformer blocks to optimize feature representation and attention for medical contexts, while simultaneously significantly reducing the parameter space; and (3) a low-rank tensor attention mechanism in the mask decoder, cutting memory usage by 75% and boosting inference speed. Experiments on standard medical segmentation datasets show that BALR-SAM, without requiring prompts, outperforms several state-of-the-art (SOTA) methods, including fully fine-tuned MedSAM, while updating just 1.8% (11.7M) of its parameters.

[423] Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos

Yingdong Hu,Yisheng He,Jinnan Chen,Weihao Yuan,Kejie Qiu,Zehong Lin,Siyu Zhu,Zilong Dong,Jun Zhang

Main category: cs.CV

TL;DR: 提出Forge4D，一种前馈式4D人体重建与插值模型，能从无标定的稀疏多视角视频中高效重建时空一致的动态3D人体，并支持任意时刻和视角的合成。

Details

Motivation: 现有方法在动态3D人体重建中存在重建速度慢或无法生成新时间帧的问题，且难以处理无标定稀疏视角输入，因此需要一个高效且支持时序插值的解决方案。 Method: 将4D重建与插值建模为流式3D高斯重建与密集运动预测的联合任务：首先从稀疏视角图像重建静态3D高斯分布，引入可学习的状态令牌通过跨时间戳共享信息更新实现时间一致性；设计运动预测模块预测相邻帧间每个3D高斯的密集运动，并结合遮挡感知的高斯融合实现任意时间点插值；采用自监督重定向损失和光流损失进行训练，无需真实密集运动标签。 Result: 实验表明，Forge4D在域内和域外数据集上均优于现有方法，实现了快速、高质量的4D人体重建与新视角新时间帧合成，推理速度快，支持实时应用。 Conclusion: Forge4D通过联合建模流式3D高斯重建与密集运动预测，有效解决了无标定稀疏视角下动态人体4D重建与时序插值的难题，兼具高效性与鲁棒性。 Abstract: Instant reconstruction of dynamic 3D humans from uncalibrated sparse-view videos is critical for numerous downstream applications. Existing methods, however, are either limited by the slow reconstruction speeds or incapable of generating novel-time representations. To address these challenges, we propose Forge4D, a feed-forward 4D human reconstruction and interpolation model that efficiently reconstructs temporally aligned representations from uncalibrated sparse-view videos, enabling both novel view and novel time synthesis. Our model simplifies the 4D reconstruction and interpolation problem as a joint task of streaming 3D Gaussian reconstruction and dense motion prediction. For the task of streaming 3D Gaussian reconstruction, we first reconstruct static 3D Gaussians from uncalibrated sparse-view images and then introduce learnable state tokens to enforce temporal consistency in a memory-friendly manner by interactively updating shared information across different timestamps. For novel time synthesis, we design a novel motion prediction module to predict dense motions for each 3D Gaussian between two adjacent frames, coupled with an occlusion-aware Gaussian fusion process to interpolate 3D Gaussians at arbitrary timestamps. To overcome the lack of the ground truth for dense motion supervision, we formulate dense motion prediction as a dense point matching task and introduce a self-supervised retargeting loss to optimize this module. An additional occlusion-aware optical flow loss is introduced to ensure motion consistency with plausible human movement, providing stronger regularization. Extensive experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets. Project page and code at: https://zhenliuzju.github.io/huyingdong/Forge4D.

[424] Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis

Xuecheng Wu,Junxiao Xue,Xinyi Yin,Yunyun Shi,Liangyu Fu,Danlei Huang,Yifan Wang,Jia Zhang,Jiayu Nie,Jun Wang

Main category: cs.CV

TL;DR: 本文提出了AVF-MAE++，一种用于情感视频面部分析（AVFA）的音频-视觉掩码自编码器模型家族，通过双模掩码策略和迭代音视频相关学习模块，有效提升了跨模态相关性建模与可扩展性，在17个数据集上实现了多项任务的最先进性能。

Details

Motivation: AVFA领域受限于数据稀缺，且难以有效捕捉音视频模态间的相关性，现有自监督方法在可扩展性和跨模态建模方面存在不足。 Method: 提出AVF-MAE++，引入双模掩码策略、增强的模态编码器设计、迭代音视频相关学习模块，以及三阶段渐进式语义注入训练策略，以提升可扩展性与跨模态关联建模能力。 Result: 在17个数据集上的实验表明，AVF-MAE++在情感识别、情绪估计和说话人识别等三个主要AVFA任务上均达到最先进的性能，消融研究验证了各组件的有效性。 Conclusion: AVF-MAE++通过系统性设计显著提升了音视频自监督学习在情感视频分析中的表现，证明了可扩展架构与强跨模态建模结合的重要性，为未来AVFA研究提供了有效框架。 Abstract: Affective video facial analysis (AVFA) has emerged as a key research field for building emotion-aware intelligent systems, yet this field continues to suffer from limited data availability. In recent years, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained momentum, with growing adaptations in its audio-visual contexts. While scaling has proven essential for breakthroughs in general multi-modal learning domains, its specific impact on AVFA remains largely unexplored. Another core challenge in this field is capturing both intra- and inter-modal correlations through scalable audio-visual representations. To tackle these issues, we propose AVF-MAE++, a family of audio-visual MAE models designed to efficiently investigate the scaling properties in AVFA while enhancing cross-modal correlation modeling. Our framework introduces a novel dual masking strategy across audio and visual modalities and strengthens modality encoders with a more holistic design to better support scalable pre-training. Additionally, we present the Iterative Audio-Visual Correlation Learning Module, which improves correlation learning within the SSL paradigm, bridging the limitations of previous methods. To support smooth adaptation and reduce overfitting risks, we further introduce a progressive semantic injection strategy, organizing the model training into three structured stages. Extensive experiments conducted on 17 datasets, covering three major AVFA tasks, demonstrate that AVF-MAE++ achieves consistent state-of-the-art performance across multiple benchmarks. Comprehensive ablation studies further highlight the importance of each proposed component and provide deeper insights into the design choices driving these improvements. Our code and models have been publicly released at Github.

[425] EVLF-FM: Explainable Vision Language Foundation Model for Medicine

Yang Bai,Haoran Cheng,Yang Zhou,Jun Zhou,Arun Thirunavukarasu,Yuhe Ke,Jie Yao,Kanae Fukutsu,Chrystie Wan Ning Quek,Ashley Hong,Laura Gutierrez,Zhen Ling Teo,Darren Shu Jeng Ting,Brian T. Soetikno,Christopher S. Nielsen,Tobias Elze,Zengxiang Li,Linh Le Dinh,Hiok Hong Chan,Victor Koh,Marcus Tan,Kelvin Z. Li,Leonard Yip,Ching Yu Cheng,Yih Chung Tham,Gavin Siew Wei Tan,Leopold Schmetterer,Marcus Ang,Rahat Hussain,Jod Mehta,Tin Aung,Lionel Tim-Ee Cheng,Tran Nguyen Tuan Anh,Chee Leong Cheng,Tien Yin Wong,Nan Liu,Iain Beehuat Tan,Soon Thye Lim,Eyal Klang,Tony Kiat Hon Lim,Rick Siow Mong Goh,Yong Liu,Daniel Shu Wei Ting

Main category: cs.CV

TL;DR: EVLF-FM是一个多模态视觉-语言基础模型，具备跨多种医学影像模态的诊断能力和像素级可解释性，在内部和外部验证中均表现出优异的准确性与推理能力。

Details

Motivation: 现有医学AI模型通常局限于单一模态且缺乏透明推理过程，限制了临床应用。因此需要一个兼具广泛诊断能力和精细可解释性的统一模型。 Method: 提出EVLF-FM，采用混合训练策略（监督学习与视觉强化微调结合），支持多疾病诊断、视觉问答及像素级视觉定位与推理。模型在超过130万样本、23个全球数据集上训练，并在10个额外数据集上进行外部验证。 Result: 在内部验证中，EVLF-FM取得0.858的平均准确率和0.797的F1分数，优于主流通用和专用模型；在医学视觉定位任务中，平均mIOU为0.743，Acc@0.5达0.837；外部验证显示其具备强零样本和少样本性能。 Conclusion: EVLF-FM是早期具备可解释性和推理能力的多病种VLM模型，有助于提升基础模型在临床部署中的信任与采纳。 Abstract: Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM encompassed over 1.3 million total samples from 23 global datasets across eleven imaging modalities related to six clinical specialties: dermatology, hepatology, ophthalmology, pathology, pulmonology, and radiology. External validation employed 8,884 independent test samples from 10 additional datasets across five imaging modalities. Technically, EVLF-FM is developed to assist with multiple disease diagnosis and visual question answering with pixel-level visual grounding and reasoning capabilities. In internal validation for disease diagnostics, EVLF-FM achieved the highest average accuracy (0.858) and F1-score (0.797), outperforming leading generalist and specialist models. In medical visual grounding, EVLF-FM also achieved stellar performance across nine modalities with average mIOU of 0.743 and Acc@0.5 of 0.837. External validations further confirmed strong zero-shot and few-shot performance, with competitive F1-scores despite a smaller model size. Through a hybrid training strategy combining supervised and visual reinforcement fine-tuning, EVLF-FM not only achieves state-of-the-art accuracy but also exhibits step-by-step reasoning, aligning outputs with visual evidence. EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.

[426] FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation

Seungwook Kim,Seunghyeon Lee,Minsu Cho

Main category: cs.CV

TL;DR: 提出两种无需训练、在推理时使用的技术，通过动态调节分类器自由引导强度和噪声初始化来提升基于扩散模型的机器人视频生成中动作连贯性和视觉质量。

Details

Motivation: 为了在基于扩散模型的机器人视频生成中更好地利用显式动作参数，提高生成视频的动作一致性和可控性。 Method: 引入动作缩放的分类器自由引导和动作缩放的噪声截断，在推理阶段主动利用动作向量调节引导强度和初始噪声分布。 Result: 在真实机器人操作数据集上的实验表明，所提方法显著提升了动作连贯性和视觉质量。 Conclusion: 所提出的两种训练-free方法能有效增强扩散模型在机器人视频生成中对动作的控制能力，适用于多种机器人环境。 Abstract: Generating realistic robot videos from explicit action trajectories is a critical step toward building effective world models and robotics foundation models. We introduce two training-free, inference-time techniques that fully exploit explicit action parameters in diffusion-based robot video generation. Instead of treating action vectors as passive conditioning signals, our methods actively incorporate them to guide both the classifier-free guidance process and the initialization of Gaussian latents. First, action-scaled classifier-free guidance dynamically modulates guidance strength in proportion to action magnitude, enhancing controllability over motion intensity. Second, action-scaled noise truncation adjusts the distribution of initially sampled noise to better align with the desired motion dynamics. Experiments on real robot manipulation datasets demonstrate that these techniques significantly improve action coherence and visual quality across diverse robot environments.

[427] When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs

Jinming Liu,Zhaoyang Jia,Jiahao Li,Bin Li,Xin Jin,Wenjun Zeng,Yan Lu

Main category: cs.CV

TL;DR: 本文提出了一种针对多模态大语言模型（MLLMs）的新型图像编解码器CoTAM，通过自适应保护多级特征来优化压缩性能，显著降低带宽使用同时保持下游任务表现。

Details

Motivation: 传统图像编解码器针对人类视觉系统优化，不适合MLLMs多样化的下游任务需求；现有压缩方法在MLLM场景下表现不佳，需专门设计面向MLLM的压缩方案。 Method: 首先分析压缩伪影对主流MLLM的影响，发现不同层级特征受压缩失真的影响不均；基于此，提出CoTAM：编码器利用CLIP浅层注意力生成重要性图进行比特分配，解码器结合轻量适配器和多级损失函数，确保低级细节与高级语义的可靠重建。 Result: 实验表明，CoTAM在保持MLLM任务性能的同时，最高可实现35.99%的码率节省，优于此前最先进的神经编解码器。 Conclusion: CoTAM是一种面向MLLM的高效图像压缩框架，通过多级特征自适应保护机制，在减少传输带宽的同时有效维持多模态理解任务的性能，为边缘设备部署MLLM提供了可行的压缩解决方案。 Abstract: The increasing deployment of powerful Multimodal Large Language Models (MLLMs), typically hosted on cloud platforms, urgently requires effective compression techniques to efficiently transmit signal inputs (e.g., images, videos) from edge devices with minimal bandwidth usage. However, conventional image codecs are optimized for fidelity to serve the Human Visual System (HVS) and ill-suited for MLLMs, in which diverse downstream tasks are jointly considered. In this paper, we first systematically analyze the impact of compression artifacts on several mainstream MLLMs. We find that: Compression distortion unevenly impacts different-level image features, leading to varying effects on MLLMs' downstream tasks depending on their feature-level reliance. Motivated by this discovery, we propose an image Codec TAilored to MLLMs (CoTAM) designed to adaptively protect multi-level features and suit different demands of downstream tasks. The encoder leverages CLIP's shallow-layer attention to generate an importance map for bit allocation, preserving critical semantic regions. Concurrently, the decoder integrates a lightweight adapter with a multi-level loss function to ensure the faithful reconstruction both of low-level details and high-level semantic context for robust synthesis of cross-level features. Extensive experiments validate that our method achieves up to 35.99\% bitrate saving while maintaining the same performance on the MLLM tasks, outperforming previous SOTA neural codecs.

[428] S$^2$NN: Sub-bit Spiking Neural Networks

Wenjie Wei,Malu Zhang,Jieyuan Zhang,Ammar Belatreche,Shuai Wang,Yimeng Shan,Hanwen Liu,Honglin Cao,Guoqing Wang,Yang Yang,Haizhou Li

Main category: cs.CV

TL;DR: 提出了一种亚比特脉冲神经网络（S²NN），通过小于1比特的权重表示实现高效压缩与加速，结合OS-Quant和MPFD方法有效缓解异常值偏差并提升性能，在视觉与非视觉任务中均优于现有量化SNN。

Details

Motivation: 为了进一步提升脉冲神经网络（SNNs）在资源受限设备上的部署效率，探索其压缩与加速潜力，尤其是在二值化SNN基础上进一步降低存储和计算开销。 Method: 首先利用训练好的二值SNN中卷积核的聚类模式构建S²NN基线；然后提出异常值感知的亚比特权重量化（OS-Quant）方法，通过识别并自适应缩放异常值来优化码字选择；最后引入基于膜电位的特征蒸馏（MPFD）方法，利用教师模型提供更精确的指导以提升高度压缩S²NN的性能。 Result: 在多个视觉与非视觉任务上的实验表明，S²NN在相同或更低比特下显著优于现有的量化SNN方法，兼具更高的精度和能效。 Conclusion: S²NN通过小于1比特的权重表示，结合OS-Quant和MPFD策略，实现了优异的压缩比与性能平衡，展现出在边缘计算场景中部署高效SNN的巨大潜力。 Abstract: Spiking Neural Networks (SNNs) offer an energy-efficient paradigm for machine intelligence, but their continued scaling poses challenges for resource-limited deployment. Despite recent advances in binary SNNs, the storage and computational demands remain substantial for large-scale networks. To further explore the compression and acceleration potential of SNNs, we propose Sub-bit Spiking Neural Networks (S$^2$NNs) that represent weights with less than one bit. Specifically, we first establish an S$^2$NN baseline by leveraging the clustering patterns of kernels in well-trained binary SNNs. This baseline is highly efficient but suffers from \textit{outlier-induced codeword selection bias} during training. To mitigate this issue, we propose an \textit{outlier-aware sub-bit weight quantization} (OS-Quant) method, which optimizes codeword selection by identifying and adaptively scaling outliers. Furthermore, we propose a \textit{membrane potential-based feature distillation} (MPFD) method, improving the performance of highly compressed S$^2$NN via more precise guidance from a teacher model. Extensive results on vision and non-vision tasks reveal that S$^2$NN outperforms existing quantized SNNs in both performance and efficiency, making it promising for edge computing applications.

[429] Cycle Diffusion Model for Counterfactual Image Generation

Fangrui Huang,Alan Wang,Binxu Li,Bailey Trang,Ridvan Yesiloglu,Tianyu Hua,Wei Peng,Ehsan Adeli

Main category: cs.CV

TL;DR: 本文提出了一种基于循环训练框架的扩散模型（CDM），用于提升医学图像生成中的条件保真度和图像质量，实验表明该方法在3D脑MRI数据上显著提高了生成效果。

Details

Motivation: 现有的深度生成模型在医学图像生成中存在条件控制不准确和生成图像质量不足的问题，尤其是在直接生成和反事实生成任务中。 Method: 提出Cycle Diffusion Model（CDM），通过引入循环约束机制，在扩散模型中强制生成图像与原始图像之间的一致性，采用循环训练框架进行微调以增强条件依从性和图像真实性。 Result: 在包含ABCD、HCP、ADNI和PPMI的3D脑MRI数据集上实验显示，CDM在FID和SSIM指标上均优于基线方法，显著提升了条件准确性和图像质量。 Conclusion: CDM的循环策略能有效改进基于扩散模型的医学图像生成，适用于数据增强、反事实推理和疾病进展建模等应用。 Abstract: Deep generative models have demonstrated remarkable success in medical image synthesis. However, ensuring conditioning faithfulness and high-quality synthetic images for direct or counterfactual generation remains a challenge. In this work, we introduce a cycle training framework to fine-tune diffusion models for improved conditioning adherence and enhanced synthetic image realism. Our approach, Cycle Diffusion Model (CDM), enforces consistency between generated and original images by incorporating cycle constraints, enabling more reliable direct and counterfactual generation. Experiments on a combined 3D brain MRI dataset (from ABCD, HCP aging & young adults, ADNI, and PPMI) show that our method improves conditioning accuracy and enhances image quality as measured by FID and SSIM. The results suggest that the cycle strategy used in CDM can be an effective method for refining diffusion-based medical image generation, with applications in data augmentation, counterfactual, and disease progression modeling.

[430] Skeleton-based Robust Registration Framework for Corrupted 3D Point Clouds

Yongqiang Wang,Weigang Li,Wenping Liu,Zhiqiang Tian,Jinling Li

Main category: cs.CV

TL;DR: 提出一种基于骨架的鲁棒点云配准框架（SRRF），通过引入抗腐蚀的骨骼表示，结合点云及其骨架的配准变换，并设计分布距离损失函数来提升在密度失真、噪声和几何形变等复杂情况下的配准精度与鲁棒性。

Details

Motivation: 现有点云配准方法对传感器限制、环境噪声和预处理误差导致的点云退化敏感，难以实现高精度对齐，因此需要更鲁棒的方法。 Method: 提出SRRF框架，将骨骼结构引入配准过程，联合优化原始点云和其骨架的配准结果，并设计分布距离损失函数以保持源与目标骨架的一致性。 Result: 在多种退化场景下实验表明，SRRF在配准精度和鲁棒性方面优于现有最先进方法。 Conclusion: SRRF能有效应对真实场景中点云的各类退化问题，兼顾局部几何特征与全局结构稳定性，是3D感知任务中具有潜力的配准方法。 Abstract: Point cloud registration is fundamental in 3D vision applications, including autonomous driving, robotics, and medical imaging, where precise alignment of multiple point clouds is essential for accurate environment reconstruction. However, real-world point clouds are often affected by sensor limitations, environmental noise, and preprocessing errors, making registration challenging due to density distortions, noise contamination, and geometric deformations. Existing registration methods rely on direct point matching or surface feature extraction, which are highly susceptible to these corruptions and lead to reduced alignment accuracy. To address these challenges, a skeleton-based robust registration framework is presented, which introduces a corruption-resilient skeletal representation to improve registration robustness and accuracy. The framework integrates skeletal structures into the registration process and combines the transformations obtained from both the corrupted point cloud alignment and its skeleton alignment to achieve optimal registration. In addition, a distribution distance loss function is designed to enforce the consistency between the source and target skeletons, which significantly improves the registration performance. This framework ensures that the alignment considers both the original local geometric features and the global stability of the skeleton structure, resulting in robust and accurate registration results. Experimental evaluations on diverse corrupted datasets demonstrate that SRRF consistently outperforms state-of-the-art registration methods across various corruption scenarios, including density distortions, noise contamination, and geometric deformations. The results confirm the robustness of SRRF in handling corrupted point clouds, making it a potential approach for 3D perception tasks in real-world scenarios.

[431] Robust Partial 3D Point Cloud Registration via Confidence Estimation under Global Context

Yongqiang Wang,Weigang Li,Wenping Liu,Zhe Xu,Zhiqiang Tian

Main category: cs.CV

TL;DR: 提出了一种名为CEGC的统一、基于置信度的框架，用于鲁棒的部分3D点云配准，通过联合建模重叠置信度和对应关系可靠性，在复杂场景中实现精确对齐。

Details

Motivation: 部分点云配准在自主感知和3D场景理解中至关重要，但由于结构模糊性、部分可见性和噪声等问题仍具挑战性。 Method: 设计了混合重叠置信度估计模块，结合语义描述符和几何相似性来检测重叠区域并抑制异常值；采用上下文感知匹配策略，利用全局注意力为对应关系分配软置信度得分，并指导可微加权SVD求解器计算精确变换。 Result: 在ModelNet40、ScanObjectNN和7Scenes等数据集上的实验表明，CEGC在准确性、鲁棒性和泛化能力方面优于现有最先进方法。 Conclusion: CEGC提供了一种可解释且可扩展的解决方案，有效应对挑战条件下的部分点云配准问题。 Abstract: Partial point cloud registration is essential for autonomous perception and 3D scene understanding, yet it remains challenging owing to structural ambiguity, partial visibility, and noise. We address these issues by proposing Confidence Estimation under Global Context (CEGC), a unified, confidence-driven framework for robust partial 3D registration. CEGC enables accurate alignment in complex scenes by jointly modeling overlap confidence and correspondence reliability within a shared global context. Specifically, the hybrid overlap confidence estimation module integrates semantic descriptors and geometric similarity to detect overlapping regions and suppress outliers early. The context-aware matching strategy smitigates ambiguity by employing global attention to assign soft confidence scores to correspondences, improving robustness. These scores guide a differentiable weighted singular value decomposition solver to compute precise transformations. This tightly coupled pipeline adaptively down-weights uncertain regions and emphasizes contextually reliable matches. Experiments on ModelNet40, ScanObjectNN, and 7Scenes 3D vision datasets demonstrate that CEGC outperforms state-of-the-art methods in accuracy, robustness, and generalization. Overall, CEGC offers an interpretable and scalable solution to partial point cloud registration under challenging conditions.

[432] ASIA: Adaptive 3D Segmentation using Few Image Annotations

Sai Raj Kishore Perla,Aditya Vora,Sauradip Nag,Ali Mahdavi-Amiri,Hao Zhang

Main category: cs.CV

TL;DR: ASIA是一种基于少量野外图像标注的自适应3D分割框架，利用文本到图像扩散模型的先验知识，将图像空间的分割结果迁移至3D，适用于语义和非语义3D分割任务。

Details

Motivation: 现有的3D分割方法依赖多视角图像或复杂的3D标注，成本高且难以扩展。ASIA旨在通过更易获取的少量2D图像标注实现对3D物体中难以用文本描述的‘部件’进行精确、可控的分割。 Method: 利用Stable Diffusion等文本到图像扩散模型的丰富先验，优化每个片段的文本token，并引入新的跨视图部件对应损失进行微调；在推理阶段，对3D网格的多视角渲染进行分割，通过投票融合UV空间标签，使用噪声优化技术 refine 结果，并映射回原始网格。 Result: ASIA在定量和定性评估中均显著优于现有方法，能够有效处理几何或结构差异较大的目标对象，实现高精度的3D分割。 Conclusion: ASIA提供了一种实用且可泛化的3D分割解决方案，仅需少量2D图像标注即可实现对语义和非语义部件的精确分割，推动了低标注成本下的3D理解技术发展。 Abstract: We introduce ASIA (Adaptive 3D Segmentation using few Image Annotations), a novel framework that enables segmentation of possibly non-semantic and non-text-describable "parts" in 3D. Our segmentation is controllable through a few user-annotated in-the-wild images, which are easier to collect than multi-view images, less demanding to annotate than 3D models, and more precise than potentially ambiguous text descriptions. Our method leverages the rich priors of text-to-image diffusion models, such as Stable Diffusion (SD), to transfer segmentations from image space to 3D, even when the annotated and target objects differ significantly in geometry or structure. During training, we optimize a text token for each segment and fine-tune our model with a novel cross-view part correspondence loss. At inference, we segment multi-view renderings of the 3D mesh, fuse the labels in UV-space via voting, refine them with our novel Noise Optimization technique, and finally map the UV-labels back onto the mesh. ASIA provides a practical and generalizable solution for both semantic and non-semantic 3D segmentation tasks, outperforming existing methods by a noticeable margin in both quantitative and qualitative evaluations.

[433] SVGThinker: Instruction-Aligned and Reasoning-Driven Text-to-SVG Generation

Hanqi Chen,Zhongyin Zhao,Ye Chen,Zhujin Liang,Bingbing Ni

Main category: cs.CV

TL;DR: 提出了SVGThinker，一种基于推理的文本到SVG生成框架，通过逐步生成和多模态标注提升生成质量与指令遵循能力。

Details

Motivation: 解决现有文本到SVG生成方法在泛化能力和指令遵循方面的不足。 Method: 采用多模态模型对每一步SVG图元生成进行标注，构建逐步更新的数据集，并利用监督微调训练具备链式思维能力的大语言模型。 Result: 实验表明，SVGThinker相比现有方法能生成更稳定、可编辑性更强、质量更高的SVG图像，同时保持矢量图形的结构优势。 Conclusion: SVGThinker有效提升了文本到SVG生成的鲁棒性和准确性，支持精确分层编辑，为设计和自动化图形生成提供了新方向。 Abstract: Scalable Vector Graphics (SVG) is a code-based representation for 2D visuals. Leveraging recent advances in large language models (LLMs), we study text-to-SVG generation and address two persistent gaps: weak generalization and poor adherence to input instructions. We present SVGThinker, a reasoning-driven framework that aligns the production of SVG code with the visualization process and supports the full set of SVG primitives. Our pipeline first renders each primitive in sequence and uses a multimodal model to annotate the image and code; we then build stepwise updates that mirror the incremental addition of primitives. On this data, we train an LLM with supervised fine-tuning that exposes its chain-of-thought as intermediate reasoning, improving robustness and reducing errors and hallucinations. Experiments against state-of-the-art baselines show that SVGThinker produces more stable, editable, and higher-quality SVGs while preserving the structural advantages of vector graphics. Unlike image-based methods, our outputs enable precise and hierarchical editing, opening new directions for design, content creation, and automated graphics generation.

[434] FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting

Zefeng He,Xiaoye Qu,Yafu Li,Siyuan Huang,Daizong Liu,Yu Cheng

Main category: cs.CV

TL;DR: 本文提出了一种名为FrameThinker的新框架，使大型视觉-语言模型（LVLMs）能够通过迭代查询视频内容来实现长视频推理。该方法采用两阶段训练策略（SFT+强化学习），在显著减少处理帧数的同时，在多个长视频理解与推理基准上实现了性能突破。

Details

Motivation: 现有的LVLM在长视频理解中受限于均匀帧采样和静态文本推理，效率低下且难以应对视觉密集型任务。因此需要一种更高效、具备动态推理能力的长视频理解方法。 Method: 提出FrameThinker框架，支持LVLM迭代选择关键帧进行推理；采用两阶段训练：先用监督微调（SFT）赋予模型基本动作能力，再用强化学习（RL）优化决策策略，并深入设计针对各动作的奖励函数。 Result: 在Video-Holmes、LongVideo-Reason等多个长视频理解与推理基准上，FrameThinker平均超越基线10.4%；7B模型在LongVideo-Reason上以仅20.6帧的平均数量达到76.1%准确率，优于LongVILA-R1（72.0%），且使用帧数减少20倍以上。 Conclusion: FrameThinker通过引入动态帧选择与迭代推理机制，显著提升了LVLM在长视频任务中的效率与性能，为长视频理解提供了高效且可扩展的新范式。 Abstract: While Large Vision-Language Models (LVLMs) have achieved substantial progress in video understanding, their application to long video reasoning is hindered by uniform frame sampling and static textual reasoning, which are inefficient and struggle to handle visually intensive video tasks. To overcome these challenges, in this paper, we introduce the concept of thinking with long videos and propose a novel framework FrameThinker. Within this framework, LVLMs are able to iteratively interrogate video content. Developing such video reasoning capabilities in LVLMs presents notable challenges, particularly in adapting the model to new video actions (e.g. select frame), and designing reward functions to guide LVLMs to adopt the newly introduced action. To solve these challenges, we propose a two-phase training strategy, first employing Supervised Fine-Tuning (SFT) to instill fundamental action capabilities, followed by Reinforcement Learning (RL) to optimize a strategic decision-making policy.Notably, in this RL phase, we conduct an in-depth and comprehensive exploration of the reward design for each action and format reward. Extensive experiments on reasoning benchmarks like Video-Holmes, LongVideo-Reason, and long-video understanding benchmarks such as LongVideoBench, MLVU, VideoMME, and LVBench, demonstrate that FrameThinker achieves a significant average improvement of +10.4% over baselines while drastically reducing the number of processed frames. Most notably, our 7B model, FrameThinker establishes a new state-of-the-art on LongVideo-Reason, achieving 76.1% accuracy using an average of only 20.6 frames. This not only outperforms the competitive LongVILA-R1 (72.0%) but does so with over 20x fewer frames (vs. 512), demonstrating unparalleled efficiency and effectiveness.

[435] OMeGa: Joint Optimization of Explicit Meshes and Gaussian Splats for Robust Scene-Level Surface Reconstruction

Yuhang Cao,Haojun Yan,Danya Yao

Main category: cs.CV

TL;DR: 本文提出OMeGa，一种联合优化显式三角网格和2D高斯点的端到端框架，通过绑定策略、几何约束与法线监督显著提升无纹理室内场景的重建精度。

Details

Motivation: 现有高斯点渲染方法在无纹理区域几何重建不准确，且网格提取与优化脱节，未能利用网格几何指导点云优化。 Method: 提出OMeGa框架，采用灵活绑定策略将高斯点的空间属性表达在网格坐标系中，保留纹理属性在点上；引入网格约束和单目法线监督正则化几何学习，并设计启发式迭代网格 refinement 策略以分割高误差面片并剪枝不可靠部分。 Result: 在室内重建基准上达到SOTA性能，相比2DGS基线Chamfer-L1误差降低47.3%，同时保持具有竞争力的新视角渲染质量。 Conclusion: OMeGa有效解决了无纹理室内场景重建中的几何不准和优化脱节问题，实现了高质量网格重建与渲染的平衡。 Abstract: Neural rendering with Gaussian splatting has advanced novel view synthesis, and most methods reconstruct surfaces via post-hoc mesh extraction. However, existing methods suffer from two limitations: (i) inaccurate geometry in texture-less indoor regions, and (ii) the decoupling of mesh extraction from optimization, thereby missing the opportunity to leverage mesh geometry to guide splat optimization. In this paper, we present OMeGa, an end-to-end framework that jointly optimizes an explicit triangle mesh and 2D Gaussian splats via a flexible binding strategy, where spatial attributes of Gaussian Splats are expressed in the mesh frame and texture attributes are retained on splats. To further improve reconstruction accuracy, we integrate mesh constraints and monocular normal supervision into the optimization, thereby regularizing geometry learning. In addition, we propose a heuristic, iterative mesh-refinement strategy that splits high-error faces and prunes unreliable ones to further improve the detail and accuracy of the reconstructed mesh. OMeGa achieves state-of-the-art performance on challenging indoor reconstruction benchmarks, reducing Chamfer-$L_1$ by 47.3\% over the 2DGS baseline while maintaining competitive novel-view rendering quality. The experimental results demonstrate that OMeGa effectively addresses prior limitations in indoor texture-less reconstruction.

[436] Towards Foundation Models for Cryo-ET Subtomogram Analysis

Runmin Jiang,Wanyue Feng,Yuntian Yang,Shriya Pingulkar,Hong Wang,Xi Xiao,Xiaoyu Cao,Genpei Zhang,Xiao Wang,Xiaolong Wu,Tianyang Wang,Yang Liu,Xingjian Li,Min Xu

Main category: cs.CV

TL;DR: 本文提出了用于冷冻电子断层扫描（cryo-ET）子断层分析的首个基础模型框架，包括一个大规模合成数据生成器CryoEngine、一种增强等变性的视觉Transformer模型APT-ViT，以及一种抗噪对比学习策略NRCL，在24个数据集上展现出卓越的性能和泛化能力。

Details

Motivation: 由于标注稀缺、噪声严重和模型泛化能力差，现有的cryo-ET子断层分析方法面临挑战，因此需要一种鲁棒且可扩展的基础模型来提升结构解析效果。 Method: 提出CryoEngine生成大规模合成数据；设计APT-ViT模型，引入自适应相位分词模块以增强对几何和语义变化的鲁棒性；采用NRCL策略进行抗噪对比学习，提升在高噪声下的表征学习稳定性。 Result: 在24个合成与真实数据集上评估显示，该方法在分类、对齐和平均三大子断层任务中均达到最先进性能，并表现出对未见数据集的强泛化能力。 Conclusion: 本研究为cryo-ET子断层分析建立了首个基础模型范式，通过合成数据驱动、结构增强的模型设计和抗噪训练策略，显著提升了分析的可扩展性与鲁棒性。 Abstract: Cryo-electron tomography (cryo-ET) enables in situ visualization of macromolecular structures, where subtomogram analysis tasks such as classification, alignment, and averaging are critical for structural determination. However, effective analysis is hindered by scarce annotations, severe noise, and poor generalization. To address these challenges, we take the first step towards foundation models for cryo-ET subtomograms. First, we introduce CryoEngine, a large-scale synthetic data generator that produces over 904k subtomograms from 452 particle classes for pretraining. Second, we design an Adaptive Phase Tokenization-enhanced Vision Transformer (APT-ViT), which incorporates adaptive phase tokenization as an equivariance-enhancing module that improves robustness to both geometric and semantic variations. Third, we introduce a Noise-Resilient Contrastive Learning (NRCL) strategy to stabilize representation learning under severe noise conditions. Evaluations across 24 synthetic and real datasets demonstrate state-of-the-art (SOTA) performance on all three major subtomogram tasks and strong generalization to unseen datasets, advancing scalable and robust subtomogram analysis in cryo-ET.

[437] Similarity-Aware Selective State-Space Modeling for Semantic Correspondence

Seungwook Kim,Minsu Cho

Main category: cs.CV

TL;DR: 提出MambaMatcher，利用选择性状态空间模型高效建模高维相关性，在语义匹配任务中实现最先进的性能。

Details

Motivation: 传统特征度量方法可能忽略复杂的相互关系，而现有相关度量方法因处理4D相关图计算成本高受限。 Method: 引入基于Mamba线性复杂度算法的相似性感知选择性扫描机制，有效优化4D相关图，保持特征图分辨率和感受野。 Result: 在标准语义对应基准上的实验表明，MambaMatcher达到最先进水平。 Conclusion: MambaMatcher通过SSMs高效建模高维相关性，克服了传统方法的局限性，显著提升语义匹配性能。 Abstract: Establishing semantic correspondences between images is a fundamental yet challenging task in computer vision. Traditional feature-metric methods enhance visual features but may miss complex inter-correlation relationships, while recent correlation-metric approaches are hindered by high computational costs due to processing 4D correlation maps. We introduce MambaMatcher, a novel method that overcomes these limitations by efficiently modeling high-dimensional correlations using selective state-space models (SSMs). By implementing a similarity-aware selective scan mechanism adapted from Mamba's linear-complexity algorithm, MambaMatcher refines the 4D correlation map effectively without compromising feature map resolution or receptive field. Experiments on standard semantic correspondence benchmarks demonstrate that MambaMatcher achieves state-of-the-art performance.

[438] TP-MVCC: Tri-plane Multi-view Fusion Model for Silkie Chicken Counting

Sirui Chen,Yuhong Feng,Yifeng Wang,Jianghai Liao,Qi Zhang

Main category: cs.CV

TL;DR: 提出了一种基于三平面多视角的鸡群计数模型TP-MVCC，通过几何投影和特征融合提升在遮挡严重场景下的计数精度。

Details

Motivation: 在密集养殖场景中，由于遮挡和视角受限，传统单视角方法难以实现准确的动物计数。 Method: 采用多视角相机采集数据，利用几何投影将不同视角特征对齐并融合到统一的地平面，通过解码密度图实现全局计数。 Result: 在自建的丝羽鸡多视角数据集上实验表明，TP-MVCC达到95.1%的计数准确率，显著优于单视角和其他融合方法，且在高密度、遮挡场景下表现出强鲁棒性。 Conclusion: TP-MVCC有效解决了复杂养殖环境中的鸡群计数难题，具有在智能农业中广泛应用的潜力。 Abstract: Accurate animal counting is essential for smart farming but remains difficult in crowded scenes due to occlusions and limited camera views. To address this, we propose a tri-plane-based multi-view chicken counting model (TP-MVCC), which leverages geometric projection and tri-plane fusion to integrate features from multiple cameras onto a unified ground plane. The framework extracts single-view features, aligns them via spatial transformation, and decodes a scene-level density map for precise chicken counting. In addition, we construct the first multi-view dataset of silkie chickens under real farming conditions. Experiments show that TP-MVCC significantly outperforms single-view and conventional fusion comparisons, achieving 95.1\% accuracy and strong robustness in dense, occluded scenarios, demonstrating its practical potential for intelligent agriculture.

[439] Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

Guolin Ke,Hui Xue

Main category: cs.CV

TL;DR: SphereAR通过将自回归模型的输入和输出约束在固定半径的超球面上，解决了VAE潜在空间中异方差性导致的方差崩溃问题，在ImageNet图像生成任务上取得了新的SOTA结果，首次使纯逐标记自回归图像生成器在相参参数规模下超越扩散和掩码生成模型。

Details

Motivation: 连续标记的自回归图像生成模型在解码过程中因VAE潜在变量的异方差性及分类器自由引导（CFG）的放大作用，易出现方差崩溃问题，导致生成性能落后于扩散模型和掩码生成模型。 Method: 提出SphereAR，利用超球面VAE将所有自回归输入和输出（包括经过CFG后的）约束在固定半径的超球面上（恒定ℓ2范数），消除尺度分量的影响，从而稳定自回归解码过程。 Result: 在ImageNet生成任务中，SphereAR-H（943M）达到FID 1.34，SphereAR-L（479M）达到FID 1.54，SphereAR-B（208M）达到FID 1.92，性能匹敌或超越更大规模的基线模型（如MAR-H和VAR-d30），并在相同参数规模下首次超过扩散和掩码生成模型。 Conclusion: SphereAR通过引入超球面约束有效解决了自回归图像生成中的方差崩溃问题，显著提升了生成质量，证明了纯自回归模型在图像生成领域具备与先进非自回归模型竞争的能力。 Abstract: Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs -- including after CFG -- to lie on a fixed-radius hypersphere (constant $\ell_2$ norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.

[440] Dynamic Orchestration of Multi-Agent System for Real-World Multi-Image Agricultural VQA

Yan Ke,Xin Yu,Heming Du,Scott Chapman,Helen Huang

Main category: cs.CV

TL;DR: 提出一种自反思、自改进的多智能体框架，用于解决多图像农业视觉问答中的上下文不足和推理不充分问题。

Details

Motivation: 现有方法多局限于单图像或文本查询，难以应对需要多图像输入和最新外部农业知识的真实农业场景。 Method: 设计包含检索者、反思者、回答者和改进者的多智能体协作框架，实现上下文增强、反思推理、并行应答生成与迭代优化。 Result: 在AgMMU基准测试中表现出竞争性性能，有效整合多图像信息并提升问答准确性。 Conclusion: 该框架能有效应对复杂农业视觉问答任务，具备良好的适应性和质量控制能力。 Abstract: Agricultural visual question answering is essential for providing farmers and researchers with accurate and timely knowledge. However, many existing approaches are predominantly developed for evidence-constrained settings such as text-only queries or single-image cases. This design prevents them from coping with real-world agricultural scenarios that often require multi-image inputs with complementary views across spatial scales, and growth stages. Moreover, limited access to up-to-date external agricultural context makes these systems struggle to adapt when evidence is incomplete. In addition, rigid pipelines often lack systematic quality control. To address this gap, we propose a self-reflective and self-improving multi-agent framework that integrates four roles, the Retriever, the Reflector, the Answerer, and the Improver. They collaborate to enable context enrichment, reflective reasoning, answer drafting, and iterative improvement. A Retriever formulates queries and gathers external information, while a Reflector assesses adequacy and triggers sequential reformulation and renewed retrieval. Two Answerers draft candidate responses in parallel to reduce bias. The Improver refines them through iterative checks while ensuring that information from multiple images is effectively aligned and utilized. Experiments on the AgMMU benchmark show that our framework achieves competitive performance on multi-image agricultural QA.

[441] NeRV-Diffusion: Diffuse Implicit Neural Representations for Video Synthesis

Yixuan Ren,Hanyu Wang,Hao Chen,Bo He,Abhinav Shrivastava

Main category: cs.CV

TL;DR: 提出NeRV-Diffusion，一种通过生成神经网络权重来合成视频的隐式潜在扩散模型，在真实视频基准上表现出优越性能。

Details

Motivation: 传统视频生成模型依赖帧间注意力机制，效率低且难以保持时序一致性，需更高效、高质量的视频合成方法。 Method: 采用两阶段框架：1）基于超网络的tokenizer将视频编码为隐式神经表示（INR）权重；2）隐式扩散Transformer对INR权重去噪。重用瓶颈潜在变量并改进权重分配、上采样连接和输入坐标，结合SNR自适应损失加权与调度采样进行训练。 Result: 在UCF-101和Kinetics-600等真实视频基准上，生成质量优于先前的INR模型，接近最新的非隐式扩散模型，同时支持帧间或视频间的平滑插值。 Conclusion: NeRV-Diffusion通过将视频整体建模为统一神经网络，实现了高效、高质量的视频生成，展现了隐式扩散模型在视频合成中的潜力。 Abstract: We present NeRV-Diffusion, an implicit latent video diffusion model that synthesizes videos via generating neural network weights. The generated weights can be rearranged as the parameters of a convolutional neural network, which forms an implicit neural representation (INR), and decodes into videos with frame indices as the input. Our framework consists of two stages: 1) A hypernetworkbased tokenizer that encodes raw videos from pixel space to neural parameter space, where the bottleneck latent serves as INR weights to decode. 2) An implicit diffusion transformer that denoises on the latent INR weights. In contrast to traditional video tokenizers that encode videos into frame-wise feature maps, NeRV-Diffusion compresses and generates a video holistically as a unified neural network. This enables efficient and high-quality video synthesis via obviating temporal cross-frame attentions in the denoiser and decoding video latent with dedicated decoders. To achieve Gaussian-distributed INR weights with high expressiveness, we reuse the bottleneck latent across all NeRV layers, as well as reform its weight assignment, upsampling connection and input coordinates. We also introduce SNR-adaptive loss weighting and scheduled sampling for effective training of the implicit diffusion model. NeRV-Diffusion reaches superior video generation quality over previous INR-based models and comparable performance to most recent state-of-the-art non-implicit models on real-world video benchmarks including UCF-101 and Kinetics-600. It also brings a smooth INR weight space that facilitates seamless interpolations between frames or videos.

[442] An Enhanced Pyramid Feature Network Based on Long-Range Dependencies for Multi-Organ Medical Image Segmentation

Dayu Tan,Cheng Kong,Yansen Su,Hai Chen,Dongliang Yang,Junfeng Xia,Chunhou Zheng

Main category: cs.CV

TL;DR: 提出了一种名为LamFormer的新网络，用于多器官医学图像分割，结合Linear Attention Mamba和Reduced Transformer，在降低计算成本的同时提升局部细节和长距离依赖的建模能力。

Details

Motivation: 现有Transformer方法在多器官分割中计算成本高且缺乏对局部细节的有效提取，需设计更高效的特征提取模块。 Method: 采用U形结构，引入Linear Attention Mamba（LAM）增强金字塔编码器捕捉多尺度长程依赖，设计PHFA模块聚合多层特征，并提出Reduced Transformer（RT）优化上采样特征的全局建模。 Result: 在七个复杂多样的数据集上超越现有方法，表现出卓越性能，同时平衡了模型性能与复杂度。 Conclusion: LamFormer通过创新的模块设计，在多器官医学图像分割任务中实现了高效且精确的细粒度分割，具备良好的应用潜力。 Abstract: In the field of multi-organ medical image segmentation, recent methods frequently employ Transformers to capture long-range dependencies from image features. However, these methods overlook the high computational cost of Transformers and their deficiencies in extracting local detailed information. To address high computational costs and inadequate local detail information, we reassess the design of feature extraction modules and propose a new deep-learning network called LamFormer for fine-grained segmentation tasks across multiple organs. LamFormer is a novel U-shaped network that employs Linear Attention Mamba (LAM) in an enhanced pyramid encoder to capture multi-scale long-range dependencies. We construct the Parallel Hierarchical Feature Aggregation (PHFA) module to aggregate features from different layers of the encoder, narrowing the semantic gap among features while filtering information. Finally, we design the Reduced Transformer (RT), which utilizes a distinct computational approach to globally model up-sampled features. RRT enhances the extraction of detailed local information and improves the network's capability to capture long-range dependencies. LamFormer outperforms existing segmentation methods on seven complex and diverse datasets, demonstrating exceptional performance. Moreover, the proposed network achieves a balance between model performance and model complexity.

[443] DRIFT: Divergent Response in Filtered Transformations for Robust Adversarial Defense

Amira Guesmi,Muhammad Shafique

Main category: cs.CV

TL;DR: 本文提出DRIFT方法，通过引入梯度不一致性来增强深度神经网络对对抗样本的鲁棒性，有效抵御多种攻击并在保持低开销的同时显著提升防御性能。

Details

Motivation: 深度神经网络易受对抗样本攻击，现有防御方法在梯度可估计时往往失效，主要原因是随机变换存在梯度共识，导致攻击具有高迁移性。 Method: 提出DRIFT（Filtered Transformations中的Divergent Response），使用轻量级可学习滤波器的随机集成，通过最大化雅可比空间和logit空间的响应差异来破坏梯度共识，并设计包含预测一致性、雅可比分隔、logit分隔和对抗鲁棒性的共识发散训练策略。 Result: DRIFT在ImageNet上的CNN和Vision Transformer模型上均显著提升了对抗鲁棒性，优于现有的预处理、对抗训练和基于扩散的防御方法，在自适应白盒、迁移和无梯度攻击下表现优异，且运行时间和内存开销极小。 Conclusion: 梯度发散是一种实用且可推广的对抗防御原则，DRIFT通过主动破坏梯度共识实现了高效、强健的防御机制。 Abstract: Deep neural networks remain highly vulnerable to adversarial examples, and most defenses collapse once gradients can be reliably estimated. We identify \emph{gradient consensus} -- the tendency of randomized transformations to yield aligned gradients -- as a key driver of adversarial transferability. Attackers exploit this consensus to construct perturbations that remain effective across transformations. We introduce \textbf{DRIFT} (Divergent Response in Filtered Transformations), a stochastic ensemble of lightweight, learnable filters trained to actively disrupt gradient consensus. Unlike prior randomized defenses that rely on gradient masking, DRIFT enforces \emph{gradient dissonance} by maximizing divergence in Jacobian- and logit-space responses while preserving natural predictions. Our contributions are threefold: (i) we formalize gradient consensus and provide a theoretical analysis linking consensus to transferability; (ii) we propose a consensus-divergence training strategy combining prediction consistency, Jacobian separation, logit-space separation, and adversarial robustness; and (iii) we show that DRIFT achieves substantial robustness gains on ImageNet across CNNs and Vision Transformers, outperforming state-of-the-art preprocessing, adversarial training, and diffusion-based defenses under adaptive white-box, transfer-based, and gradient-free attacks. DRIFT delivers these improvements with negligible runtime and memory cost, establishing gradient divergence as a practical and generalizable principle for adversarial defense.

[444] UI-UG: A Unified MLLM for UI Understanding and Generation

Hao Yang,Weijie Qiu,Ru Zhang,Zhou Fang,Ruichao Mao,Xiaoyu Lin,Maji Huang,Zhaosong Huang,Teng Guo,Shuoyang Liu,Hai Rao

Main category: cs.CV

TL;DR: 本文提出了UI-UG，一个统一的多模态大语言模型，用于用户界面（UI）的理解与生成。通过结合监督微调（SFT）、组相对策略优化（GRPO）和直接偏好优化（DPO），在理解与生成任务上均取得最先进的性能，并提出了一套工业级高效的工作流程。

Details

Motivation: 现有的多模态大语言模型在特定领域如用户界面理解与生成方面仍存在准确性和质量不足的问题，亟需专门化模型和优化方法来提升性能。 Method: 采用监督微调（SFT）结合组相对策略优化（GRPO）提升UI理解能力；使用直接偏好优化（DPO）改善UI生成质量；并设计了包含领域特定语言（DSL）、训练策略、渲染流程和评估指标的工业级工作流。 Result: 在UI理解任务上达到最先进水平，优于更大规模的通用MLLM和同规模专用模型；在生成任务上性能与更大模型相当，但计算成本更低；同时验证了理解与生成任务联合训练可相互提升性能。 Conclusion: UI-UG通过统一架构和优化策略，在UI理解和生成任务上实现了高效且高质量的表现，具备较强的工业应用潜力。 Abstract: Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks.

[445] Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models

Jitai Hao,Hao Liu,Xinyan Xiao,Qiang Huang,Jun Yu

Main category: cs.CV

TL;DR: 提出Uni-X架构，解决统一多模态模型中视觉与文本在共享Transformer中的梯度冲突问题，通过两端分离、中间共享的X形结构提升训练效率和性能。

Details

Motivation: 发现模态共享的Transformer在处理多模态输入时存在严重的视觉与文本梯度冲突，尤其在浅层和深层，限制了统一多模态模型的训练效率与性能。 Method: 提出Uni-X架构，将初始和末尾层设为模态专用，中间层共享参数，形成X形结构，以缓解梯度冲突并实现高层语义融合。 Result: 实验表明，Uni-X在相同训练条件下显著提升训练效率；在3B参数规模下表现媲美或超过7B参数的自回归统一多模态模型，图像生成GenEval得分为82，并在文本与视觉理解任务中表现优异。 Conclusion: Uni-X通过分离低层与高层的模态处理，有效解决了梯度冲突问题，成为一种参数高效且可扩展的统一多模态建模范式。 Abstract: Unified Multimodal Models (UMMs) built on shared autoregressive (AR) transformers are attractive for their architectural simplicity. However, we identify a critical limitation: when trained on multimodal inputs, modality-shared transformers suffer from severe gradient conflicts between vision and text, particularly in shallow and deep layers. We trace this issue to the fundamentally different low-level statistical properties of images and text, while noting that conflicts diminish in middle layers where representations become more abstract and semantically aligned. To overcome this challenge, we propose Uni-X, a two-end-separated, middle-shared architecture. Uni-X dedicates its initial and final layers to modality-specific processing, while maintaining shared parameters in the middle layers for high-level semantic fusion. This X-shaped design not only eliminates gradient conflicts at both ends but also further alleviates residual conflicts in the shared layers. Extensive experiments validate the effectiveness of Uni-X. Under identical training conditions, Uni-X achieves superior training efficiency compared to strong baselines. When scaled to 3B parameters with larger training data, Uni-X matches or surpasses 7B AR-based UMMs, achieving a GenEval score of 82 for image generation alongside strong performance in text and vision understanding tasks. These results establish Uni-X as a parameter-efficient and scalable foundation for future unified multimodal modeling. Our code is available at https://github.com/CURRENTF/Uni-X

[446] Real-Aware Residual Model Merging for Deepfake Detection

Jinhee Park,Guisik Kim,Choongsang Cho,Junseok Kwon

Main category: cs.CV

TL;DR: 提出一种无需训练的模型融合框架R²M，用于深伪检测，通过分解真实与伪造特征并合并专家模型，在分布内、跨数据集和未见数据集上均优于联合训练等基线方法。

Details

Motivation: 深度伪造生成器快速演进，导致数据收集和重复训练不切实际，需要一种能持续整合新伪造类型的高效检测方法。 Method: 提出Real-aware Residual Model Merging（R²M），通过低秩分解估计共享的真实成分，将各专家模型分解为真实对齐部分和伪造残差，对残差进行层级别秩截断去噪，并通过任务范数匹配聚合。 Result: R²M在多个评估场景下优于联合训练和其他融合基线，且具备可组合性，新增伪造类型时只需微调单个专家并重新融合，无需从头训练。 Conclusion: R²M提供了一种高效、可扩展的深伪检测框架，能够在不重新训练的情况下持续集成新伪造类型，具有实际部署优势。 Abstract: Deepfake generators evolve quickly, making exhaustive data collection and repeated retraining impractical. We argue that model merging is a natural fit for deepfake detection: unlike generic multi-task settings with disjoint labels, deepfake specialists share the same binary decision and differ in generator-specific artifacts. Empirically, we show that simple weight averaging preserves Real representations while attenuating Fake-specific cues. Building upon these findings, we propose Real-aware Residual Model Merging (R$^2$M), a training-free parameter-space merging framework. R$^2$M estimates a shared Real component via a low-rank factorization of task vectors, decomposes each specialist into a Real-aligned part and a Fake residual, denoises residuals with layerwise rank truncation, and aggregates them with per-task norm matching to prevent any single generator from dominating. A concise rationale explains why a simple head suffices: the Real component induces a common separation direction in feature space, while truncated residuals contribute only minor off-axis variations. Across in-distribution, cross-dataset, and unseen-dataset, R$^2$M outperforms joint training and other merging baselines. Importantly, R$^2$M is also composable: when a new forgery family appears, we fine-tune one specialist and re-merge, eliminating the need for retraining.

[447] From Satellite to Street: A Hybrid Framework Integrating Stable Diffusion and PanoGAN for Consistent Cross-View Synthesis

Khawlah Bajbaa,Abbas Anwar,Muhammad Saqib,Hafeez Anwar,Nabin Sharma,Muhammad Usman

Main category: cs.CV

TL;DR: 本文提出了一种融合扩散模型与条件生成对抗网络（GAN）的混合框架，用于从卫星图像生成地理一致的街景图像，在CVUSA数据集上表现出优于纯扩散方法并媲美最先进GAN方法的性能。

Details

Motivation: 由于卫星图像与街景图像在外观和视角上存在显著差异，直接合成高质量、地理一致的街景图像面临巨大挑战，现有方法在几何一致性与细节保留方面仍有不足。 Method: 采用多阶段训练策略，结合Stable Diffusion与条件GAN，构建双分支架构，并设计融合策略以整合两者优势，提升生成图像的几何一致性和视觉质量。 Result: 在CVUSA数据集上，该方法在多个指标上优于仅使用扩散模型的方法，并与最先进的GAN方法具有竞争力，能生成包含道路标线、次级道路和云层等细粒度细节的真实且几何一致的全景街景图像。 Conclusion: 所提出的混合框架有效克服了跨视角图像合成中的关键挑战，为从卫星图像生成高质量街景图像提供了新思路，具有在城市规划与地理信息分析中的应用潜力。 Abstract: Street view imagery has become an essential source for geospatial data collection and urban analytics, enabling the extraction of valuable insights that support informed decision-making. However, synthesizing street-view images from corresponding satellite imagery presents significant challenges due to substantial differences in appearance and viewing perspective between these two domains. This paper presents a hybrid framework that integrates diffusion-based models and conditional generative adversarial networks to generate geographically consistent street-view images from satellite imagery. Our approach uses a multi-stage training strategy that incorporates Stable Diffusion as the core component within a dual-branch architecture. To enhance the framework's capabilities, we integrate a conditional Generative Adversarial Network (GAN) that enables the generation of geographically consistent panoramic street views. Furthermore, we implement a fusion strategy that leverages the strengths of both models to create robust representations, thereby improving the geometric consistency and visual quality of the generated street-view images. The proposed framework is evaluated on the challenging Cross-View USA (CVUSA) dataset, a standard benchmark for cross-view image synthesis. Experimental results demonstrate that our hybrid approach outperforms diffusion-only methods across multiple evaluation metrics and achieves competitive performance compared to state-of-the-art GAN-based methods. The framework successfully generates realistic and geometrically consistent street-view images while preserving fine-grained local details, including street markings, secondary roads, and atmospheric elements such as clouds.

[448] DINOReg: Strong Point Cloud Registration with Vision Foundation Model

Congjia Chen,Yufu Qu

Main category: cs.CV

TL;DR: 本文提出DINOReg，一种充分利用视觉和几何信息的点云配准网络。通过引入DINOv2从图像中提取丰富的纹理与语义特征，并在patch级别融合视觉与几何特征，结合混合位置编码，显著提升了配准性能。

Details

Motivation: 现有方法主要依赖几何信息，或虽引入颜色信息但未能充分挖掘图像中的纹理与语义信息，且特征融合方式有损，限制了性能。 Method: 采用DINOv2提取图像视觉特征，在patch级别融合视觉与几何特征，并设计混合位置编码以统一图像与点云空间的位置信息。 Result: 在RGBD-3DMatch和RGBD-3DLoMatch数据集上实验表明，相比现有先进方法，patch内点率提升14.2%，配准召回率提升15.7%。 Conclusion: DINOReg有效融合视觉与几何模态，在点云配准任务中显著优于单模态及现有多模态方法，验证了利用视觉基础模型提升三维配准性能的可行性。 Abstract: Point cloud registration is a fundamental task in 3D computer vision. Most existing methods rely solely on geometric information for feature extraction and matching. Recently, several studies have incorporated color information from RGB-D data into feature extraction. Although these methods achieve remarkable improvements, they have not fully exploited the abundant texture and semantic information in images, and the feature fusion is performed in an image-lossy manner, which limit their performance. In this paper, we propose DINOReg, a registration network that sufficiently utilizes both visual and geometric information to solve the point cloud registration problem. Inspired by advances in vision foundation models, we employ DINOv2 to extract informative visual features from images, and fuse visual and geometric features at the patch level. This design effectively combines the rich texture and global semantic information extracted by DINOv2 with the detailed geometric structure information captured by the geometric backbone. Additionally, a mixed positional embedding is proposed to encode positional information from both image space and point cloud space, which enhances the model's ability to perceive spatial relationships between patches. Extensive experiments on the RGBD-3DMatch and RGBD-3DLoMatch datasets demonstrate that our method achieves significant improvements over state-of-the-art geometry-only and multi-modal registration methods, with a 14.2% increase in patch inlier ratio and a 15.7% increase in registration recall. The code is publicly available at https://github.com/ccjccjccj/DINOReg.

[449] Mask Clustering-based Annotation Engine for Large-Scale Submeter Land Cover Mapping

Hao Chen,Fang Xu,Tamer Saleh,Weifeng Hao,Gui-Song Xia

Main category: cs.CV

TL;DR: 提出了一种基于掩码聚类的标注引擎MCAE，利用空间自相关性原理高效生成高质量、城市级亚米分辨率土地覆盖数据集HiCity-LC，显著提升标注效率并支持大规模应用。

Details

Motivation: 现有亚米分辨率遥感影像的土地覆盖标注数据稀缺且成本高昂，传统方法难以满足大规模高精度制图需求。 Method: 基于空间自相关原理，将语义一致的掩码组作为最小标注单元，通过掩码聚类实现多实例同步标注，构建MCAE自动化标注引擎。 Result: 标注效率提升1-2个数量级，生成包含约140亿像素的HiCity-LC数据集，在五个中国主要城市实现分类精度超过85%的城市级土地覆盖制图。 Conclusion: MCAE能高效生成高质量、具语义多样性和空间代表性的大范围亚米分辨率土地覆盖标签，HiCity-LC是首个公开的城市级亚米分辨率土地覆盖基准数据集，验证了MCAE在大规模制图中的可扩展性与实用性。 Abstract: Recent advances in remote sensing technology have made submeter resolution imagery increasingly accessible, offering remarkable detail for fine-grained land cover analysis. However, its full potential remains underutilized - particularly for large-scale land cover mapping - due to the lack of sufficient, high-quality annotated datasets. Existing labels are typically derived from pre-existing products or manual annotation, which are often unreliable or prohibitively expensive, particularly given the rich visual detail and massive data volumes of submeter imagery. Inspired by the spatial autocorrelation principle, which suggests that objects of the same class tend to co-occur with similar visual features in local neighborhoods, we propose the Mask Clustering-based Annotation Engine (MCAE), which treats semantically consistent mask groups as the minimal annotating units to enable efficient, simultaneous annotation of multiple instances. It significantly improves annotation efficiency by one to two orders of magnitude, while preserving label quality, semantic diversity, and spatial representativeness. With MCAE, we build a high-quality annotated dataset of about 14 billion labeled pixels, referred to as HiCity-LC, which supports the generation of city-scale land cover maps across five major Chinese cities with classification accuracies above 85%. It is the first publicly available submeter resolution city-level land cover benchmark, highlighting the scalability and practical utility of MCAE for large-scale, submeter resolution mapping. The dataset is available at https://github.com/chenhaocs/MCAE

[450] REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport

Soumyadeep Chandra,Kaushik Roy

Main category: cs.CV

TL;DR: 本文提出REALIGN，一种基于正则化融合部分Gromov-Wasserstein最优传输的自监督框架，用于从程序视频中学习表征，能有效处理无关帧、重复动作和非单调步骤顺序。

Details

Motivation: 现有方法依赖强单调性假设且仅利用特征相似性，难以应对真实世界教学视频中的背景片段、重复动作和乱序步骤等复杂时序结构。 Method: 提出REALIGN框架，结合正则化融合部分Gromov-Wasserstein最优传输（R-FPGWOT）与序列间对比学习，联合建模视觉对应关系和时序关系，并在部分对齐方案下实现鲁棒对齐。 Result: 在多个基准（EgoProceL、ProceL、CrossTask）上，REALIGN平均F1分数提升高达18.9%，时序IoU提高超30%，并生成更可解释、能保留关键步骤顺序且滤除噪声的传输图。 Conclusion: REALIGN通过联合建模视觉与时序结构，在复杂程序视频的自监督学习中实现了更鲁棒、准确且可解释的对齐性能，显著优于现有方法。 Abstract: Learning from procedural videos remains a core challenge in self-supervised representation learning, as real-world instructional data often contains background segments, repeated actions, and steps presented out of order. Such variability violates the strong monotonicity assumptions underlying many alignment methods. Prior state-of-the-art approaches, such as OPEL, leverage Kantorovich Optimal Transport (KOT) to build frame-to-frame correspondences, but rely solely on feature similarity and fail to capture the higher-order temporal structure of a task. In this paper, we introduce REALIGN, a self-supervised framework for procedure learning based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT). In contrast to KOT, our formulation jointly models visual correspondences and temporal relations under a partial alignment scheme, enabling robust handling of irrelevant frames, repeated actions, and non-monotonic step orders common in instructional videos. To stabilize training, we integrate FPGWOT distances with inter-sequence contrastive learning, avoiding the need for multiple regularizers and preventing collapse to degenerate solutions. Across egocentric (EgoProceL) and third-person (ProceL, CrossTask) benchmarks, REALIGN achieves up to 18.9% average F1-score improvements and over 30% temporal IoU gains, while producing more interpretable transport maps that preserve key-step orderings and filter out noise.

[451] Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy

Haijier Chen,Bo Xu,Shoujian Zhang,Haoze Liu,Jiaxuan Lin,Jingrong Wang

Main category: cs.CV

TL;DR: 本文提出Vid-LLM，一种基于视频的3D多模态大语言模型，无需外部3D数据输入，通过引入几何先验、跨任务适配器（CTA）和度量深度模型，提升3D场景理解能力，并在多项任务中表现出优越性能。

Details

Motivation: 现有的3D多模态大语言模型依赖3D输入数据，限制了其可扩展性和泛化能力，因此需要一种更实用、可扩展的方法来实现真实世界中的3D场景理解。 Method: 提出Vid-LLM，直接处理视频输入；设计跨任务适配器（CTA）融合3D几何先验与视觉-语言表征；引入度量深度模型恢复真实尺度几何；采用两阶段蒸馏优化策略进行微调。 Result: 在3D问答、3D密集描述生成和3D视觉定位等多个基准上进行了广泛实验，验证了方法的有效性，展现出卓越的多任务性能。 Conclusion: Vid-LLM无需外部3D数据即可有效实现3D场景理解，通过紧凑融合几何线索，在实际部署中具有良好的可扩展性和应用前景。 Abstract: Recent developments in Multimodal Large Language Models (MLLMs) have significantly improved Vision-Language (VL) reasoning in 2D domains. However, extending these capabilities to 3D scene understanding remains a major challenge. Existing 3D Multimodal Large Language Models (3D-MLLMs) often depend on 3D data inputs, which limits scalability and generalization. To address this limitation, we propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data, making it practical for real-world deployment. In our method, the geometric prior are directly used to improve the performance of the sceen perception. To integrate the geometric cues into the MLLM compactly, we design a Cross-Task Adapter (CTA) module to align the 3D geometric priors with the vision-language representations. To ensure geometric consistency and integrity, we introduce a Metric Depth Model that recovers real-scale geometry from the reconstruction outputs. Finally, the model is fine-tuned with a two-stage distillation optimization strategy, realizing fast convergence and stabilizes training. Extensive experiments across diverse benchmarks verified the effectiveness of our method on 3D Question Answering, 3D Dense Captioning and 3D Visual Grounding tasks, demonstrating the superior multi-task capabilities.

[452] PCICF: A Pedestrian Crossing Identification and Classification Framework

Junyi Gu,Beatriz Cabrero-Daniel,Ali Nouri,Lydia Armini,Christian Berger

Main category: cs.CV

TL;DR: 本文提出了一种名为PCICF的框架，用于系统地识别和分类脆弱道路使用者（如行人）的情景，以支持自动驾驶出租车在运行设计域内的事故分析。作者基于SMIRK数据集构建了MoreSMIRK多行人过街情景字典，并利用空间填充曲线（SFC）将多维场景特征转化为可匹配的模式，在真实数据集PIE上验证了方法的有效性，且具备车载实时应用潜力。

Details

Motivation: 自动驾驶出租车在城市环境中需可靠检测脆弱道路使用者（VRUs），尤其是在复杂多变的行人穿越场景中。现有方法缺乏系统化的场景分类与分析工具，难以支持在操作设计域（ODD）外的有效监控与评估，因此需要一种能够系统识别和分类VRU情境的框架。 Method: 提出PCICF框架，扩展合成数据集SMIRK为包含多行人穿越情景的MoreSMIRK；利用空间填充曲线（SFC）将多维场景特征映射为一维特征模式，并与MoreSMIRK中的条目进行匹配，实现对复杂行人行为（如分组、合并）的识别与分类。 Result: 在包含150多个手动标注视频的真实世界PIE数据集上验证了PCICF的有效性，能够准确识别和分类复杂的行人穿越场景，包括行人组的分裂与合并；同时证明了该方法计算效率高，具备在机器人出租车上实时部署用于OOD检测的潜力。 Conclusion: PCICF提供了一个系统化、可扩展的框架来识别和分类多行人穿越场景，结合MoreSMIRK数据集和SFC技术，不仅提升了对复杂VRU行为的理解能力，还为自动驾驶系统在ODD内外的安全评估提供了实用工具，具有实际部署潜力。 Abstract: We have recently observed the commercial roll-out of robotaxis in various countries. They are deployed within an operational design domain (ODD) on specific routes and environmental conditions, and are subject to continuous monitoring to regain control in safety-critical situations. Since ODDs typically cover urban areas, robotaxis must reliably detect vulnerable road users (VRUs) such as pedestrians, bicyclists, or e-scooter riders. To better handle such varied traffic situations, end-to-end AI, which directly compute vehicle control actions from multi-modal sensor data instead of only for perception, is on the rise. High quality data is needed for systematically training and evaluating such systems within their OOD. In this work, we propose PCICF, a framework to systematically identify and classify VRU situations to support ODD's incident analysis. We base our work on the existing synthetic dataset SMIRK, and enhance it by extending its single-pedestrian-only design into the MoreSMIRK dataset, a structured dictionary of multi-pedestrian crossing situations constructed systematically. We then use space-filling curves (SFCs) to transform multi-dimensional features of scenarios into characteristic patterns, which we match with corresponding entries in MoreSMIRK. We evaluate PCICF with the large real-world dataset PIE, which contains more than 150 manually annotated pedestrian crossing videos. We show that PCICF can successfully identify and classify complex pedestrian crossings, even when groups of pedestrians merge or split. By leveraging computationally efficient components like SFCs, PCICF has even potential to be used onboard of robotaxis for OOD detection for example. We share an open-source replication package for PCICF containing its algorithms, the complete MoreSMIRK dataset and dictionary, as well as our experiment results presented in: https://github.com/Claud1234/PCICF

[453] RapidMV: Leveraging Spatio-Angular Representations for Efficient and Consistent Text-to-Multi-View Synthesis

Seungwook Kim,Yichun Shi,Kejie Li,Minsu Cho,Peng Wang

Main category: cs.CV

TL;DR: 本文提出了RapidMV，一种高效的文本到多视角生成模型，能够在约5秒内生成32个多视角合成图像。

Details

Motivation: 为了更高效地生成具有多视角一致性的3D资产合成图像，需要一个快速且高质量的文本到多视角生成模型。 Method: 提出了一种新的时空角度潜在空间，将空间外观和角度视角变化编码到单一潜在表示中，并通过分阶段策略训练实现有效训练。 Result: RapidMV在多视角一致性和生成延迟方面优于现有方法，同时保持了竞争力的图像质量和文本-图像对齐能力。 Conclusion: RapidMV是一种高效、一致且快速的文本到多视角图像生成模型，适用于3D资产合成任务。 Abstract: Generating synthetic multi-view images from a text prompt is an essential bridge to generating synthetic 3D assets. In this work, we introduce RapidMV, a novel text-to-multi-view generative model that can produce 32 multi-view synthetic images in just around 5 seconds. In essence, we propose a novel spatio-angular latent space, encoding both the spatial appearance and angular viewpoint deviations into a single latent for improved efficiency and multi-view consistency. We achieve effective training of RapidMV by strategically decomposing our training process into multiple steps. We demonstrate that RapidMV outperforms existing methods in terms of consistency and latency, with competitive quality and text-image alignment.

[454] CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers

Kai Liu,Shaoqiu Zhang,Linghe Kong,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为CLQ的跨层引导正交量化方法，用于扩散变换器（DiTs）的高效模型压缩，在保持视觉质量的同时显著减少内存消耗并加速推理。

Details

Motivation: 随着DiTs模型规模和复杂度的增加，其在边缘设备上的部署受到限制，需要高效的模型压缩技术来解决这一问题。 Method: CLQ包含三个关键设计：跨块校准（CBC）以获得准确的激活分布校准数据；基于正交的平滑（OBS）通过块Hadamard矩阵平滑异常通道；跨层参数搜索（CLPS）优化量化参数。 Result: 在图像和视频生成模型上验证了CLQ的有效性，成功将模型压缩至W4A4，仅带来可忽略的质量下降，并实现了3.98倍的内存节省和3.95倍的推理加速。 Conclusion: CLQ为DiTs提供了一种高效的后训练量化方案，显著提升了其在边缘设备上的部署可行性。 Abstract: Visual generation quality has been greatly promoted with the rapid advances in diffusion transformers (DiTs), which is attributed to the scaling of model size and complexity. However, these attributions also hinder the practical deployment of DiTs on edge devices, limiting their development and application. Serve as an efficient model compression technique, model post-training quantization (PTQ) can reduce the memory consumption and speed up the inference, with inevitable performance degradation. To alleviate the degradation, we propose CLQ, a cross-layer guided orthogonal-based quantization method for DiTs. To be specific, CLQ consists of three key designs. First, we observe that the calibration data used by most of the PTQ methods can not honestly represent the distribution of the activations. Therefore, we propose cross-block calibration (CBC) to obtain accurate calibration data, with which the quantization can be better guided. Second, we propose orthogonal-based smoothing (OBS), which quantifies the outlier score of each channel and leverages block Hadamard matrix to smooth the outliers with negligible overhead. Third, we propose cross-layer parameter searching (CLPS) to search. We evaluate CLQ with both image generation and video generation models and successfully compress the model into W4A4 with negligible degradation in visual quality and metrics. CLQ achieves 3.98x memory saving and 3.95x speedup. Our code is available at \hyperlink{https://github.com/Kai-Liu001/CLQ}{https://github.com/Kai-Liu001/CLQ}.

[455] A Data-Centric Perspective on the Influence of Image Data Quality in Machine Learning Models

Pei-Han Chen,Szu-Chi Chung

Main category: cs.CV

TL;DR: 本研究系统评估了图像数据集质量对模型性能的影响，并提出了一种结合CleanVision和Fastdup工具的改进流程，通过自动阈值选择等增强方法显著提升了低质量图像检测和去重的F1分数。

Details

Motivation: 随着模型架构的成熟，数据质量成为影响机器学习性能的关键因素，但图像领域中关于数据质量评估的系统性研究仍然有限。 Method: 基于CIFAKE数据集识别常见质量问题，分析CleanVision和Fastdup工具机制，并引入自动阈值选择等改进方法，将低质量图像检测视为二分类任务进行评估。 Result: 卷积神经网络对某些失真具有鲁棒性，但对模糊和严重下采样等关键视觉特征退化敏感；自动阈值方法在单一扰动下F1分数从0.6794提升至0.9468，双重扰动下从0.7447提升至0.8557；去重策略F1分数从0.4576提升至0.7928。 Conclusion: 所提出的改进流程有效提升了图像数据集质量评估的准确性，为图像机器学习中的数据质量管理提供了可行方案和研究基础。 Abstract: In machine learning, research has traditionally focused on model development, with relatively less attention paid to training data. As model architectures have matured and marginal gains from further refinements diminish, data quality has emerged as a critical factor. However, systematic studies on evaluating and ensuring dataset quality in the image domain remain limited. This study investigates methods for systematically assessing image dataset quality and examines how various image quality factors influence model performance. Using the publicly available and relatively clean CIFAKE dataset, we identify common quality issues and quantify their impact on training. Building on these findings, we develop a pipeline that integrates two community-developed tools, CleanVision and Fastdup. We analyze their underlying mechanisms and introduce several enhancements, including automatic threshold selection to detect problematic images without manual tuning. Experimental results demonstrate that not all quality issues exert the same level of impact. While convolutional neural networks show resilience to certain distortions, they are particularly vulnerable to degradations that obscure critical visual features, such as blurring and severe downscaling. To assess the performance of existing tools and the effectiveness of our proposed enhancements, we formulate the detection of low-quality images as a binary classification task and use the F1 score as the evaluation metric. Our automatic thresholding method improves the F1 score from 0.6794 to 0.9468 under single perturbations and from 0.7447 to 0.8557 under dual perturbations. For near-duplicate detection, our deduplication strategy increases the F1 score from 0.4576 to 0.7928. These results underscore the effectiveness of our workflow and provide a foundation for advancing data quality assessment in image-based machine learning.

[456] Proxy-GS: Efficient 3D Gaussian Splatting via Proxy Mesh

Yuanyuan Gao,Yuning Gong,Yifei Liu,Li Jingfeng,Zhihang Zhong,Dingwen Zhang,Yanci Zhang,Dan Xu,Xiao Sun

Main category: cs.CV

TL;DR: 本文提出Proxy-GS，一种通过引入代理系统实现高斯点云遮挡感知的新方法，显著提升MLP基3D高斯泼溅的渲染效率与质量。

Details

Motivation: 现有3D高斯泼溅方法在大规模场景中存在因缺乏遮挡感知而导致的冗余计算问题，影响渲染效率和质量。 Method: 设计一个快速代理系统，生成精确的遮挡深度图，用于指导高斯点云的剔除和训练过程中的密度化，从而减少冗余并提升渲染效果。 Result: 在高度遮挡场景（如MatrixCity Streets）中，相比Octree-GS实现2.5倍以上加速，同时保持更高的渲染质量。 Conclusion: Proxy-GS通过引入遮挡感知机制，有效提升了MLP基3D高斯泼溅的渲染速度与视觉保真度，尤其适用于复杂遮挡的大规模场景。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as an efficient approach for achieving photorealistic rendering. Recent MLP-based variants further improve visual fidelity but introduce substantial decoding overhead during rendering. To alleviate computation cost, several pruning strategies and level-of-detail (LOD) techniques have been introduced, aiming to effectively reduce the number of Gaussian primitives in large-scale scenes. However, our analysis reveals that significant redundancy still remains due to the lack of occlusion awareness. In this work, we propose Proxy-GS, a novel pipeline that exploits a proxy to introduce Gaussian occlusion awareness from any view. At the core of our approach is a fast proxy system capable of producing precise occlusion depth maps at a resolution of 1000x1000 under 1ms. This proxy serves two roles: first, it guides the culling of anchors and Gaussians to accelerate rendering speed. Second, it guides the densification towards surfaces during training, avoiding inconsistencies in occluded regions, and improving the rendering quality. In heavily occluded scenarios, such as the MatrixCity Streets dataset, Proxy-GS not only equips MLP-based Gaussian splatting with stronger rendering capability but also achieves faster rendering speed. Specifically, it achieves more than 2.5x speedup over Octree-GS, and consistently delivers substantially higher rendering quality. Code will be public upon acceptance.

Runmin Zhang,Jialiang Wang,Si-Yuan Cao,Zhu Yu,Junchen Yu,Guangyi Zhang,Hui-Liang Shen

Main category: cs.CV

TL;DR: 本文提出了一种名为DCFlow的无监督跨模态光流估计框架，结合解耦优化策略和跨模态一致性约束，在无需真实光流标签的情况下实现了先进的性能。

Details

Motivation: 现有方法主要依赖外观相似性隐式学习光流，难以应对模态差异和几何错位问题，因此需要一种能够显式解耦并分别处理这些问题的新方法。 Method: 提出解耦优化策略，协同训练模态转换网络和光流估计网络；设计几何感知的数据合成流程与抗异常损失以提供可靠运动监督；引入跨模态一致性约束联合优化两个网络。 Result: 在自建的跨模态光流基准上实验表明，DCFlow可集成到多种光流网络中，并在无监督方法中达到最先进水平。 Conclusion: DCFlow通过解耦优化和跨模态一致性约束，有效解决了跨模态光流估计中的模态差异与几何错位问题，显著提升了无监督条件下的光流预测精度。 Abstract: This work presents DCFlow, a novel unsupervised cross-modal flow estimation framework that integrates a decoupled optimization strategy and a cross-modal consistency constraint. Unlike previous approaches that implicitly learn flow estimation solely from appearance similarity, we introduce a decoupled optimization strategy with task-specific supervision to address modality discrepancy and geometric misalignment distinctly. This is achieved by collaboratively training a modality transfer network and a flow estimation network. To enable reliable motion supervision without ground-truth flow, we propose a geometry-aware data synthesis pipeline combined with an outlier-robust loss. Additionally, we introduce a cross-modal consistency constraint to jointly optimize both networks, significantly improving flow prediction accuracy. For evaluation, we construct a comprehensive cross-modal flow benchmark by repurposing public datasets. Experimental results demonstrate that DCFlow can be integrated with various flow estimation networks and achieves state-of-the-art performance among unsupervised approaches.

[458] UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

Ailing Zhang,Lina Lei,Dehong Kong,Zhixin Wang,Jiaqi Xu,Fenglong Song,Chun-Le Guo,Chang Liu,Fan Li,Jie Chen

Main category: cs.CV

TL;DR: 提出UI2V-Bench，一个专注于语义理解和推理能力的Image-to-Video生成模型评估新基准，包含四个评估维度，并基于多模态大语言模型设计两种评估方法，通过约500个文本-图像对评估多种I2V模型，且人类评估结果与MLLM指标高度一致。

Details

Motivation: 现有视频生成评估基准主要关注视频质量和时间一致性，忽视了模型对输入图像中主体语义的理解以及生成视频是否符合物理规律和常识，因此需要一个更全面的评估基准。 Method: 提出UI2V-Bench，包含空间理解、属性绑定、类别理解和推理四个维度；设计基于多模态大语言模型的实例级细粒度评估流程和反馈式推理评估流程。 Result: 在约500个精心构建的文本-图像对上评估了多个开源与闭源I2V模型，MLLM-based指标与人类评估结果高度一致。 Conclusion: UI2V-Bench填补了I2V生成模型在语义理解和推理能力评估方面的空白，提供了支持未来研究的鲁棒框架和数据集。 Abstract: Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model's ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.

[459] NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding

Yanpeng Zhao,Shanyan Guan,Yunbo Wang,Yanhao Ge,Wei Li,Xiaokang Yang

Main category: cs.CV

TL;DR: NeoWorld是一个基于单张图像生成交互式3D虚拟世界的深度学习框架，采用混合2D-3D场景结构，结合对象级3D表示与背景2D合成，实现高效、动态且视觉连贯的虚拟环境探索。

Details

Motivation: 受科幻小说《Simulacron-3》中按需构建世界概念的启发，旨在克服现有方法在全局生成或2D幻觉上的局限，实现高效且可交互的3D虚拟世界生成。 Method: 采用混合场景结构，使用前沿的表征学习和对象到3D技术，对前景物体建模为完整3D，背景和非交互区域以2D合成，并支持通过自然语言控制物体外观与动态。 Result: 在WorldScore基准上显著优于现有的2D和2.5D方法，支持灵活视角操控和物理合理的动画，用户交互时逐步展开更高细节的3D内容。 Conclusion: NeoWorld通过融合3D对象建模与2D背景合成，实现了高效、沉浸式且可交互的虚拟世界生成，为单图生成虚拟环境提供了新范式。 Abstract: We introduce NeoWorld, a deep learning framework for generating interactive 3D virtual worlds from a single input image. Inspired by the on-demand worldbuilding concept in the science fiction novel Simulacron-3 (1964), our system constructs expansive environments where only the regions actively explored by the user are rendered with high visual realism through object-centric 3D representations. Unlike previous approaches that rely on global world generation or 2D hallucination, NeoWorld models key foreground objects in full 3D, while synthesizing backgrounds and non-interacted regions in 2D to ensure efficiency. This hybrid scene structure, implemented with cutting-edge representation learning and object-to-3D techniques, enables flexible viewpoint manipulation and physically plausible scene animation, allowing users to control object appearance and dynamics using natural language commands. As users interact with the environment, the virtual world progressively unfolds with increasing 3D detail, delivering a dynamic, immersive, and visually coherent exploration experience. NeoWorld significantly outperforms existing 2D and depth-layered 2.5D methods on the WorldScore benchmark.

[460] Generalist Multi-Class Anomaly Detection via Distillation to Two Heterogeneous Student Networks

Hangil Park,Yongmin Seo,Tae-Kyun Kim

Main category: cs.CV

TL;DR: 提出一种基于知识蒸馏的双模型集成方法，通过结合Encoder-Decoder和Encoder-Encoder模型，在工业检测与语义异常检测多个基准上实现最先进的性能。

Details

Motivation: 现有异常检测方法在工业检测或语义异常检测之间泛化能力差，且对数据集特定设置敏感，难以兼顾多类与单类任务。 Method: 采用共享预训练编码器（DINOv2）的双学生模型框架：一个Encoder-Decoder模型用于检测工业缺陷，一个Encoder-Encoder模型用于语义异常检测；通过知识蒸馏和Noisy-OR联合学习目标融合两个模型的局部与语义异常得分。 Result: 在八个公开基准（包括MVTec-AD、CIFAR-10等）的单类和多类设置下均达到最先进性能，MVTec-AD图像级AUROC达99.7%，CIFAR-10达97.8%。 Conclusion: 该双模型集成方法有效桥接了工业与语义异常检测之间的差距，具备强大多任务泛化能力，优于通用及专用模型。 Abstract: Anomaly detection (AD) plays an important role in various real-world applications. Recent advancements in AD, however, are often biased towards industrial inspection, struggle to generalize to broader tasks like semantic anomaly detection and vice versa. Although recent methods have attempted to address general anomaly detection, their performance remains sensitive to dataset-specific settings and single-class tasks. In this paper, we propose a novel dual-model ensemble approach based on knowledge distillation (KD) to bridge this gap. Our framework consists of a teacher and two student models: an Encoder-Decoder model, specialized in detecting patch-level minor defects for industrial AD and an Encoder-Encoder model, optimized for semantic AD. Both models leverage a shared pre-trained encoder (DINOv2) to extract high-quality feature representations. The dual models are jointly learned using the Noisy-OR objective, and the final anomaly score is obtained using the joint probability via local and semantic anomaly scores derived from the respective models. We evaluate our method on eight public benchmarks under both single-class and multi-class settings: MVTec-AD, MVTec-LOCO, VisA and Real-IAD for industrial inspection and CIFAR-10/100, FMNIST and View for semantic anomaly detection. The proposed method achieved state-of-the-art accuracies in both domains, in multi-class as well as single-class settings, demonstrating generalization across multiple domains of anomaly detection. Our model achieved an image-level AUROC of 99.7% on MVTec-AD and 97.8% on CIFAR-10, which is significantly better than the prior general AD models in multi-class settings and even higher than the best specialist models on individual benchmarks.

[461] LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation

Heechang Kim,Gwanghyun Kim,Se Young Chun

Main category: cs.CV

TL;DR: 提出一种基于拉班努力与形态量化方法的零样本推理优化技术，用于实现文本到动作生成模型中可解释且富有表现力的动作控制。

Details

Motivation: 现有的文本到动作合成模型在实现细粒度、富有表现力的动作控制方面存在挑战，主要由于数据集中动作风格多样性不足以及自然语言难以表达动作的定量特征。 Method: 将拉班运动分析中的努力与形态量化方法融入文本引导的动作生成扩散模型，通过在采样过程中优化预训练模型的文本嵌入，实现无需额外动作数据的零样本推理时优化。 Result: 实验表明该方法能够在保持动作身份的同时，根据目标拉班标签成功操纵动作属性，生成具有多样化表现力的动作质量。 Conclusion: 所提出的方法实现了对人类动作生成的可解释、精细且富有表现力的控制，提升了文本到动作生成模型的表现力和可控性。 Abstract: Diverse human motion generation is an increasingly important task, having various applications in computer vision, human-computer interaction and animation. While text-to-motion synthesis using diffusion models has shown success in generating high-quality motions, achieving fine-grained expressive motion control remains a significant challenge. This is due to the lack of motion style diversity in datasets and the difficulty of expressing quantitative characteristics in natural language. Laban movement analysis has been widely used by dance experts to express the details of motion including motion quality as consistent as possible. Inspired by that, this work aims for interpretable and expressive control of human motion generation by seamlessly integrating the quantification methods of Laban Effort and Shape components into the text-guided motion generation models. Our proposed zero-shot, inference-time optimization method guides the motion generation model to have desired Laban Effort and Shape components without any additional motion data by updating the text embedding of pretrained diffusion models during the sampling step. We demonstrate that our approach yields diverse expressive motion qualities while preserving motion identity by successfully manipulating motion attributes according to target Laban tags.

[462] Performance-Efficiency Trade-off for Fashion Image Retrieval

Julio Hurtado,Haoran Ni,Duygu Sap,Connor Mattinson,Martin Lotz

Main category: cs.CV

TL;DR: 本文提出一种选择性表示框架，通过聚类、核心集选择和基于邻域同质性一致性得分的异常值去除方法，将二手服装图像数据库缩减至原始大小的10%，同时保持检索精度，显著降低计算成本。

Details

Motivation: 为了提升二手服装市场中大规模图像检索的可扩展性，减少数据库规模与计算开销，同时维持高检索准确性。 Method: 引入选择性表示框架，结合聚类与核心集选择识别代表性样本，并提出基于邻域同质性一致性得分的高效异常值去除方法，在样本选择前过滤非典型数据。 Result: 在三个公开数据集（DeepFashion Attribute、DeepFashion Con2Shop 和 DeepFashion2）上验证，该方法在数据库缩小至10%的情况下仍保持接近最优的检索精度，且计算成本显著降低；结合异常值去除后，检索性能进一步提升。 Conclusion: 所提框架有效平衡了检索性能与效率，为大规模二手服装图像检索提供了可扩展且高效的解决方案。 Abstract: The fashion industry has been identified as a major contributor to waste and emissions, leading to an increased interest in promoting the second-hand market. Machine learning methods play an important role in facilitating the creation and expansion of second-hand marketplaces by enabling the large-scale valuation of used garments. We contribute to this line of work by addressing the scalability of second-hand image retrieval from databases. By introducing a selective representation framework, we can shrink databases to 10% of their original size without sacrificing retrieval accuracy. We first explore clustering and coreset selection methods to identify representative samples that capture the key features of each garment and its internal variability. Then, we introduce an efficient outlier removal method, based on a neighbour-homogeneity consistency score measure, that filters out uncharacteristic samples prior to selection. We evaluate our approach on three public datasets: DeepFashion Attribute, DeepFashion Con2Shop, and DeepFashion2. The results demonstrate a clear performance-efficiency trade-off by strategically pruning and selecting representative vectors of images. The retrieval system maintains near-optimal accuracy, while greatly reducing computational costs by reducing the images added to the vector database. Furthermore, applying our outlier removal method to clustering techniques yields even higher retrieval performance by removing non-discriminative samples before the selection.

[463] Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs

Yuanshuai Li,Yuping Yan,Junfeng Tang,Yunxuan Li,Zeqi Zheng,Yaochu Jin

Main category: cs.CV

TL;DR: 提出了一种新的多模态大语言模型对齐框架SCPO，通过语义课程偏好优化有效减少视觉幻觉。

Details

Motivation: 现有MLLMs存在视觉幻觉问题，且DPO方法难以捕捉细粒度语义差异并导致捷径学习。 Method: 构建了基于难易程度排序的细粒度语义对比数据集，并采用渐进式易到难课程学习、动态参考模型和对称双向目标进行训练。 Result: 在多个幻觉基准上显著优于基线模型，幻觉率最高降低62.9%，同时在通用视觉-语言基准上保持稳定性能。 Conclusion: SCPO是首个将语义、对称性和课程学习统一用于MLLM对齐的框架，能有效缓解视觉幻觉问题。 Abstract: Multimodal Large Language Models (MLLMs) have significantly improved the performance of various tasks, but continue to suffer from visual hallucinations, a critical issue where generated responses contradict visual evidence. While Direct Preference Optimization(DPO) is widely used for alignment, its application to MLLMs often fails to capture fine-grained semantic differences and encourages shortcut learning. To address these challenges, we propose Semantic Curriculum Preference Optimization (SCPO), a novel framework for MLLM alignment. SCPO employs a progressive, easy-to-hard curriculum built upon our Semantic Curriculum Preference Pairs dataset, which provides fine-grained semantic contrasts sorted by difficulty. This curriculum is trained with a dynamic reference model and a novel symmetric, bidirectional objective to facilitate simultaneous learning from both textual and visual preferences. To our knowledge, SCPO is the first framework to unify semantics, symmetry, and curriculum for MLLMs alignment, effectively mitigating visual hallucinations. Extensive experiments on LLaVA models across various scales and versions validate that SCPO demonstrates superior performance compared to baseline models on multiple hallucination benchmarks, reducing the hallucination rate by up to 62.9%. Moreover, evaluations on generalized benchmarks show that SCPO improves factuality while preserving general capabilities, with its performance remaining stable across general vision-language benchmarks.

[464] Robust Multimodal Semantic Segmentation with Balanced Modality Contributions

Jiaqi Tan,Xu Zheng,Fangyu Li,Yang Liu

Main category: cs.CV

TL;DR: 提出EQUISeg框架，通过均衡编码多模态信息和自引导机制，有效解决多模态语义分割中的模态不平衡问题，提升模型在模态退化下的鲁棒性。

Details

Motivation: 现有方法存在模态依赖不平衡问题，在主导模态退化时性能显著下降，制约了多模态语义分割在实际场景中的应用。 Method: 提出EQUISeg框架，采用四阶段跨模态Transformer块（CMTB）实现高效多模态融合与分层选择，并设计自引导模块（SGM），通过互指导机制使各模态自适应调整贡献，实现模态贡献均衡。 Result: 在多个数据集上实验表明，EQUISeg在多模态分割任务中取得显著性能提升，能有效缓解模态不平衡带来的负面影响。 Conclusion: EQUISeg通过均衡模态编码和自适应调节机制，增强了模型在模态退化情况下的鲁棒性，为实际应用中的多模态语义分割提供了有效解决方案。 Abstract: Multimodal semantic segmentation enhances model robustness by exploiting cross-modal complementarities. However, existing methods often suffer from imbalanced modal dependencies, where overall performance degrades significantly once a dominant modality deteriorates in real-world scenarios. Thus, modality balance has become acritical challenge for practical multimodal segmentation. To address this issue, we propose EQUISeg, a multimodal segmentation framework that balances modality contributions through equal encoding of modalities. Built upon a four-stage Cross-modal Transformer Block(CMTB), EQUISeg enables efficient multimodal fusion and hierarchical selection. Furthermore, we design a Self-guided Module(SGM) that mitigates modality imbalance by introducing a mutual guidance mechanism, enabling each modality to adaptively adjust its contribution and enhance robustness under degraded conditions. Extensive experiments on multiple datasets demonstrate that EQUISeg achieves significant performance gains and effectively alleviates the adverse effects of modality imbalance in segmentation tasks.

[465] Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency

Jiaqi Tan,Fangyu Li,Yang Liu

Main category: cs.CV

TL;DR: 提出QL-Adapter框架，用于解决复杂场景下基于CLIP的图像编辑在对象数量、空间布局和类别多样性方面的挑战。

Details

Motivation: 标准CLIP文本编码器在复杂多对象场景下的图像编辑效果不佳，难以保持对象数量和空间布局的一致性，并且对多样化类别支持有限。 Method: 设计了两个核心模块：图像-布局融合模块（ILFM）将布局先验与CLIP图像编码器的ViT patch token融合，增强空间结构理解；跨模态增强模块（CMAM）将图像特征注入文本分支以丰富文本嵌入并提升指令跟随能力。同时构建了涵盖广泛类别、布局和数量变化的基准数据集QL-Dataset，并定义了数量与布局一致的图像编辑任务（QL-Edit）。 Result: 在QL-Edit任务上进行了大量实验，QL-Adapter实现了最先进的性能，显著优于现有模型。 Conclusion: QL-Adapter通过引入布局感知和跨模态增强机制，有效提升了复杂场景中多对象图像编辑的质量，在对象数量、空间布局和类别多样性方面均表现出优越性能。 Abstract: Instruction driven image editing with standard CLIP text encoders often fails in complex scenes with many objects. We present QL-Adapter, a framework for multiple object editing that tackles two challenges: enforcing object counts and spatial layouts, and accommodating diverse categories. QL-Adapter consists of two core modules: the Image-Layout Fusion Module (ILFM) and the Cross-Modal Augmentation Module (CMAM). ILFM fuses layout priors with ViT patch tokens from the CLIP image encoder to strengthen spatial structure understanding. CMAM injects image features into the text branch to enrich textual embeddings and improve instruction following. We further build QL-Dataset, a benchmark that spans broad category, layout, and count variations, and define the task of quantity and layout consistent image editing (QL-Edit). Extensive experiments show that QL-Adapter achieves state of the art performance on QL-Edit and significantly outperforms existing models.

[466] CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

Zheyuan Hu,Chieh-Hsin Lai,Yuki Mitsufuji,Stefano Ermon

Main category: cs.CV

TL;DR: 提出了一种名为中段训练（mid-training）的新方法，用于在预训练和最终流图训练之间插入一个轻量级的中间阶段，显著提升训练稳定性与效率，并在多个数据集上实现了最先进的Few-step生成性能。

Details

Motivation: 现有的流图模型（如一致性模型和均值流）虽然支持少步生成，但训练不稳定、对超参数敏感且计算成本高，即使使用预训练扩散模型也难以解决从微小步长到长跳转映射的不稳定性问题。 Method: 提出一致性中段训练（CMT），在预训练模型生成的求解器轨迹上，训练模型直接从先验样本映射到干净样本，获得轨迹一致且稳定的初始化，以此作为后续流图训练的初始化权重。 Result: CMT在CIFAR-10、ImageNet 64x64和512x512上分别实现了1.97、1.32和1.84的两步FID，在ImageNet 256x256上实现1步FID 3.34，同时相比一致性模型最多减少98%的训练数据和GPU时间，相比从零开始的均值流减少约50%总训练时间。 Conclusion: CMT是一种原理清晰、高效且通用的流图模型训练框架，有效解决了训练不稳定和高成本问题，显著提升了少步生成模型的实用性和性能。 Abstract: Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state of the art two step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time, compared to CMs. On ImageNet 256x256, CMT reaches 1-step FID 3.34 while cutting total training time by about 50% compared to MF from scratch (FID 3.43). This establishes CMT as a principled, efficient, and general framework for training flow map models.

[467] CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D

Mohamad Amin Mirzaei,Pantea Amoie,Ali Ekhterachian,Matin Mirzababaei

Main category: cs.CV

TL;DR: 本文提出了一种改进的3D语义映射方法，通过SemanticSAM和上下文感知的CLIP编码策略提升零样本、开放词汇场景下的3D场景理解性能。

Details

Motivation: 现有方法因直接使用原始2D掩码导致片段化和语义分配不准确，限制了复杂环境中的3D语义分割效果。 Method: 采用SemanticSAM进行渐进式粒度细化生成更精确的物体级掩码，并结合多视角上下文加权的CLIP编码增强语义表示。 Result: 在多个基准数据集上的3D语义分割和语言查询物体检索任务中显著优于现有方法。 Conclusion: 所提方法有效缓解了过分割问题并增强了语义上下文，显著提升了开放词汇3D场景理解的性能。 Abstract: 3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation. Recent approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via vision-language models (VLMs) and projecting these into 3D. However, these methods often produce fragmented masks and inaccurate semantic assignments due to the direct use of raw masks, limiting their effectiveness in complex environments. To address this, we leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks, mitigating the over-segmentation commonly observed in mask generation models such as vanilla SAM, and improving downstream 3D semantic segmentation. To further enhance semantic context, we employ a context-aware CLIP encoding strategy that integrates multiple contextual views of each mask using empirically determined weighting, providing much richer visual context. We evaluate our approach on multiple 3D scene understanding tasks, including 3D semantic segmentation and object retrieval from language queries, across several benchmark datasets. Experimental results demonstrate significant improvements over existing methods, highlighting the effectiveness of our approach.

[468] Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis

Kaizhen Zhu,Mokai Pan,Zhechuan Yu,Jingya Wang,Jingyi Yu,Ye Shi

Main category: cs.CV

TL;DR: 本文首次从理论和实验上统一分析了Diffusion Bridge和Flow Matching两种分布转换方法，通过随机最优控制和最优传输理论揭示了前者在轨迹稳定性和小样本场景下的优势，并通过基于潜空间Transformer的公平架构比较验证了理论预测。

Details

Motivation: 尽管Diffusion Bridge和Flow Matching在分布转换任务中表现良好，但二者在建模假设和实现上的差异导致缺乏统一的理论比较，难以判断其优劣。 Method: 将两种方法统一到随机最优控制框架下进行理论分析，并构建基于潜空间Transformer的相同架构模型以实现公平实验比较。 Result: 理论上证明Diffusion Bridge的成本函数更低、轨迹更稳定；Flow Matching在数据量减少时插值系数效率下降。实验结果与理论一致，显示Diffusion Bridge在多种生成任务和小样本情况下表现更优。 Conclusion: Diffusion Bridge在理论性质和实际性能上优于Flow Matching，尤其在训练数据有限和需要稳定轨迹的任务中更具优势。 Abstract: Diffusion Bridge and Flow Matching have both demonstrated compelling empirical performance in transformation between arbitrary distributions. However, there remains confusion about which approach is generally preferable, and the substantial discrepancies in their modeling assumptions and practical implementations have hindered a unified theoretical account of their relative merits. We have, for the first time, provided a unified theoretical and experimental validation of these two models. We recast their frameworks through the lens of Stochastic Optimal Control and prove that the cost function of the Diffusion Bridge is lower, guiding the system toward more stable and natural trajectories. Simultaneously, from the perspective of Optimal Transport, interpolation coefficients $t$ and $1-t$ of Flow Matching become increasingly ineffective when the training data size is reduced. To corroborate these theoretical claims, we propose a novel, powerful architecture for Diffusion Bridge built on a latent Transformer, and implement a Flow Matching model with the same structure to enable a fair performance comparison in various experiments. Comprehensive experiments are conducted across Image Inpainting, Super-Resolution, Deblurring, Denoising, Translation, and Style Transfer tasks, systematically varying both the distributional discrepancy (different difficulty) and the training data size. Extensive empirical results align perfectly with our theoretical predictions and allow us to delineate the respective advantages and disadvantages of these two models. Our code is available at https://anonymous.4open.science/r/DBFM-3E8E/.

[469] Foggy Crowd Counting: Combining Physical Priors and KAN-Graph

Yuhao Wang,Zhuoran Zheng,Han Hu,Dianjie Lu,Guijuan Zhang,Chen Lyu

Main category: cs.CV

TL;DR: 本文提出了一种结合大气散射物理先验的群体计数方法，通过物理机制与数据驱动的协同优化，提升了雾天环境下的人群计数精度。

Details

Motivation: 针对雾天环境中目标模糊、特征退化和对比度下降导致人群计数困难的问题，现有方法在复杂气象条件下性能受限。 Method: 引入可微分的大气散射模型，采用透射率动态估计与散射参数自适应校准；设计基于Kolmogorov-Arnold定理的MSA-KAN网络以增强退化区域的非线性表征能力；提出天气感知的图卷积网络（GCN），利用深度特征动态构建空间邻接矩阵。 Result: 在四个公开数据集上实验表明，该方法在浓雾场景下相比主流算法MAE指标降低了12.2%–27.5%。 Conclusion: 所提方法通过融合物理先验与深度学习，在复杂雾天环境下显著提升了人群计数的鲁棒性与准确性。 Abstract: Aiming at the key challenges of crowd counting in foggy environments, such as long-range target blurring, local feature degradation, and image contrast attenuation, this paper proposes a crowd-counting method with a physical a priori of atmospheric scattering, which improves crowd counting accuracy under complex meteorological conditions through the synergistic optimization of the physical mechanism and data-driven.Specifically, first, the method introduces a differentiable atmospheric scattering model and employs transmittance dynamic estimation and scattering parameter adaptive calibration techniques to accurately quantify the nonlinear attenuation laws of haze on targets with different depths of field.Secondly, the MSA-KAN was designed based on the Kolmogorov-Arnold Representation Theorem to construct a learnable edge activation function. By integrating a multi-layer progressive architecture with adaptive skip connections, it significantly enhances the model's nonlinear representation capability in feature-degraded regions, effectively suppressing feature confusion under fog interference.Finally, we further propose a weather-aware GCN that dynamically constructs spatial adjacency matrices using deep features extracted by MSA-KAN. Experiments on four public datasets demonstrate that our method achieves a 12.2\%-27.5\% reduction in MAE metrics compared to mainstream algorithms in dense fog scenarios.

[470] TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models

Zhifang Zhang,Qiqi Tao,Jiaqi Lv,Na Zhao,Lei Feng,Joey Tianyi Zhou

Main category: cs.CV

TL;DR: 提出TokenSwap，一种针对大视觉语言模型的隐蔽后门攻击方法，通过交换文本答案中的关键标记语法角色并注入视觉触发器，使模型在生成正确对象的同时错误表示它们之间的关系。

Details

Motivation: 现有的固定模式后门攻击容易被检测到，因为模型会对中毒输入中的频繁模式表现出过度自信。为了提高攻击的隐蔽性，需要开发更难以察觉的攻击方式。 Method: 引入TokenSwap攻击，该方法在训练时向选定样本注入视觉触发器，并同时交换对应文本回答中关键标记的语法角色；采用自适应标记加权损失来强调被交换标记的学习，从而将视觉触发器与词袋行为关联起来。 Result: 实验表明，TokenSwap在多个基准和不同LVLM架构上实现了高攻击成功率，同时具有更好的隐蔽性和抗检测能力。 Conclusion: TokenSwap是一种有效且隐蔽的后门攻击方法，能够利用LVLM的组合理解能力进行攻击，较传统固定模式攻击更难被发现。 Abstract: Large vision-language models (LVLMs) have achieved impressive performance across a wide range of vision-language tasks, while they remain vulnerable to backdoor attacks. Existing backdoor attacks on LVLMs aim to force the victim model to generate a predefined target pattern, which is either inserted into or replaces the original content. We find that these fixed-pattern attacks are relatively easy to detect, because the attacked LVLM tends to memorize such frequent patterns in the training dataset, thereby exhibiting overconfidence on these targets given poisoned inputs. To address these limitations, we introduce TokenSwap, a more evasive and stealthy backdoor attack that focuses on the compositional understanding capabilities of LVLMs. Instead of enforcing a fixed targeted content, TokenSwap subtly disrupts the understanding of object relationships in text. Specifically, it causes the backdoored model to generate outputs that mention the correct objects in the image but misrepresent their relationships (i.e., bags-of-words behavior). During training, TokenSwap injects a visual trigger into selected samples and simultaneously swaps the grammatical roles of key tokens in the corresponding textual answers. However, the poisoned samples exhibit only subtle differences from the original ones, making it challenging for the model to learn the backdoor behavior. To address this, TokenSwap employs an adaptive token-weighted loss that explicitly emphasizes the learning of swapped tokens, such that the visual triggers and bags-of-words behavior are associated. Extensive experiments demonstrate that TokenSwap achieves high attack success rates while maintaining superior evasiveness and stealthiness across multiple benchmarks and various LVLM architectures.

[471] SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics

Peter Hönig,Stefan Thalhammer,Jean-Baptiste Weibel,Matthias Hirschmanner,Markus Vincze

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的类别级物体位姿估计方法SCOPE，利用DINOv2特征作为连续语义先验，无需离散类别标签，有效缩小了仿真到现实的差距，并在已知和未知类别上表现出优异的泛化能力。

Details

Motivation: 在开放环境中，机器人需要处理未知物体，传统方法依赖离散类别标签，难以泛化到已知类别之外，因此需要一种能够结合语义理解并具备强泛化能力的位姿估计方法。 Method: 提出SCOPE模型，采用DINOv2特征作为连续语义先验，通过交叉注意力机制融合语义信息，并结合逼真的合成训练数据与点法向噪声模型，实现类别级物体位姿估计。 Result: 在合成数据训练下，SCOPE在5°5cm指标上比现有最先进方法提升31.9%；在两个实例级数据集上的实验表明其能泛化至未知类别，在未见物体上的抓取成功率高达100%。 Conclusion: SCOPE通过引入连续语义先验和扩散模型，实现了更优的类别级位姿估计性能，并展现出对未知类别物体的强大泛化能力，推动了开放环境中机器人操作的发展。 Abstract: Object manipulation requires accurate object pose estimation. In open environments, robots encounter unknown objects, which requires semantic understanding in order to generalize both to known categories and beyond. To resolve this challenge, we present SCOPE, a diffusion-based category-level object pose estimation model that eliminates the need for discrete category labels by leveraging DINOv2 features as continuous semantic priors. By combining these DINOv2 features with photorealistic training data and a noise model for point normals, we reduce the Sim2Real gap in category-level object pose estimation. Furthermore, injecting the continuous semantic priors via cross-attention enables SCOPE to learn canonicalized object coordinate systems across object instances beyond the distribution of known categories. SCOPE outperforms the current state of the art in synthetically trained category-level object pose estimation, achieving a relative improvement of 31.9\% on the 5$^\circ$5cm metric. Additional experiments on two instance-level datasets demonstrate generalization beyond known object categories, enabling grasping of unseen objects from unknown categories with a success rate of up to 100\%. Code available: https://github.com/hoenigpeter/scope.

[472] BFSM: 3D Bidirectional Face-Skull Morphable Model

Zidu Wang,Meng Xu,Miao Xu,Hengyuan Ma,Jiankuo Zhao,Xutao Li,Xiangyu Zhu,Zhen Lei

Main category: cs.CV

TL;DR: 提出3D双向人脸-颅骨可变形模型（BFSM），通过构建包含200多个样本的数据集和新的密集射线匹配配准方法，实现面部与颅骨之间的高精度形状推断和组织厚度变化建模，支持单图像3D重建和手术规划预测。

Details

Motivation: 由于配对数据稀缺、配准精度不足以及对颅面畸形患者关注较少，现有面部-颅骨联合建模受限且缺乏包容性。 Method: 构建包含CT颅骨、CT面部和高保真纹理面部扫描的大规模数据集；提出密集射线匹配配准方法以确保拓扑一致性；建立共享系数空间的3D双向人脸-颅骨可变形模型（BFSM），建模组织厚度变化以支持一对多面部重建。 Result: 实现了高精度的面部-颅骨相互形状推断，支持从同一颅骨生成不同脂肪含量的面部（一对多重建），并在单图像3D重建和手术规划预测中展示了临床应用潜力。实验验证了方法的鲁棒性和准确性。 Conclusion: BFSM为面部-颅骨联合建模提供了新框架，在医学诊断、手术规划和面部模拟方面具有广泛应用前景，并提升了对颅面畸形患者的包容性。 Abstract: Building a joint face-skull morphable model holds great potential for applications such as remote diagnostics, surgical planning, medical education, and physically based facial simulation. However, realizing this vision is constrained by the scarcity of paired face-skull data, insufficient registration accuracy, and limited exploration of reconstruction and clinical applications. Moreover, individuals with craniofacial deformities are often overlooked, resulting in underrepresentation and limited inclusivity. To address these challenges, we first construct a dataset comprising over 200 samples, including both normal cases and rare craniofacial conditions. Each case contains a CT-based skull, a CT-based face, and a high-fidelity textured face scan. Secondly, we propose a novel dense ray matching registration method that ensures topological consistency across face, skull, and their tissue correspondences. Based on this, we introduce the 3D Bidirectional Face-Skull Morphable Model (BFSM), which enables shape inference between the face and skull through a shared coefficient space, while also modeling tissue thickness variation to support one-to-many facial reconstructions from the same skull, reflecting individual changes such as fat over time. Finally, we demonstrate the potential of BFSM in medical applications, including 3D face-skull reconstruction from a single image and surgical planning prediction. Extensive experiments confirm the robustness and accuracy of our method. BFSM is available at https://github.com/wang-zidu/BFSM

[473] Comprehensive Benchmarking of YOLOv11 Architectures for Scalable and Granular Peripheral Blood Cell Detection

Mohamad Abou Ali,Mariam Abdulfattah,Baraah Al Hussein,Fadi Dornaika,Ali Cherry,Mohamad Hajj-Hassan,Lara Hamawy

Main category: cs.CV

TL;DR: 本研究提出了一种基于YOLOv11的外周血涂片细粒度检测框架，通过构建包含298,850个标注细胞的大规模数据集，系统评估了五种YOLOv11变体的性能，发现Medium版本在精度与计算效率之间达到最佳平衡（mAP@0.5达0.934），并推荐8:1:1的数据划分策略。

Details

Motivation: 手动外周血涂片分析费时且主观，现有深度学习模型缺乏系统性评估，尤其是针对细粒度血细胞检测任务，亟需标准化数据集与模型性能比较。 Method: 构建了一个大规模标注数据集（16,891张图像，13类血细胞，共298,850个标注细胞），并对五种YOLOv11变体（Nano至XLarge）在两种数据划分策略（70:20:10和80:10:10）下进行系统评估，采用mAP、精确率、召回率、F1分数和计算效率等指标。 Result: YOLOv11 Medium模型在8:1:1划分下取得最优mAP@0.5为0.934；更大模型提升有限但计算成本显著增加；8:1:1划分整体优于7:2:1划分。 Conclusion: YOLOv11 Medium是自动化细粒度外周血涂片检测的有效方案，所发布数据集为血液学研究提供了重要资源。 Abstract: Manual peripheral blood smear (PBS) analysis is labor intensive and subjective. While deep learning offers a promising alternative, a systematic evaluation of state of the art models such as YOLOv11 for fine grained PBS detection is still lacking. In this work, we make two key contributions. First, we curate a large scale annotated dataset for blood cell detection and classification, comprising 16,891 images across 12 peripheral blood cell (PBC) classes, along with the red blood cell class, all carefully re annotated for object detection tasks. In total, the dataset contains 298,850 annotated cells. Second, we leverage this dataset to conduct a comprehensive evaluation of five YOLOv11 variants (ranging from Nano to XLarge). These models are rigorously benchmarked under two data splitting strategies (70:20:10 and 80:10:10) and systematically assessed using multiple performance criteria, including mean Average Precision (mAP), precision, recall, F1 score, and computational efficiency. Our experiments show that the YOLOv11 Medium variant achieves the best trade off, reaching a mAP@0.5 of 0.934 under the 8:1:1 split. Larger models (Large and XLarge) provide only marginal accuracy gains at substantially higher computational cost. Moreover, the 8:1:1 split consistently outperforms the 7:2:1 split across all models. These findings highlight YOLOv11, particularly the Medium variant, as a highly effective framework for automated, fine grained PBS detection. Beyond benchmarking, our publicly released dataset (github.com/Mohamad-AbouAli/OI-PBC-Dataset) offers a valuable resource to advance research on blood cell detection and classification in hematology.

[474] Biomechanical-phase based Temporal Segmentation in Sports Videos: a Demonstration on Javelin-Throw

Bikash Kumar Badatya,Vipul Baghel,Jyotirmoy Amin,Ravi Hegde

Main category: cs.CV

TL;DR: 提出了一种基于结构化最优传输（SOT）增强的注意力时空图卷积网络（ASTGCN）的无监督框架，用于精英标枪投掷动作的上下文感知时序分割，实现了高精度的运动阶段识别，无需人工标注。

Details

Motivation: 传统体育动作分析依赖人工标注或实验室设备，耗时、昂贵且难以扩展，需要一种自动、精确且可扩展的无监督方法来实现关键运动阶段的分割。 Method: 结合结构化最优传输（SOT）与注意力时空图卷积网络（ASTGCN），构建无监督框架，利用姿态估计数据进行时空建模，实现对标枪投掷中关键生物力学阶段（如助跑、发力、投掷、恢复）的自动分割。 Result: 在新发布的211段专业标枪视频数据集上，该方法达到71.02%的平均精度（mAP）和74.61%的F1分数，显著优于现有无监督方法。 Conclusion: 所提出的SOT增强ASTGCN框架能有效实现无需人工标注的高精度运动相位分割，为体育运动分析提供了一种可扩展且鲁棒的自动化解决方案。 Abstract: Precise analysis of athletic motion is central to sports analytics, particularly in disciplines where nuanced biomechanical phases directly impact performance outcomes. Traditional analytics techniques rely on manual annotation or laboratory-based instrumentation, which are time-consuming, costly, and lack scalability. Automatic extraction of relevant kinetic variables requires a robust and contextually appropriate temporal segmentation. Considering the specific case of elite javelin-throw, we present a novel unsupervised framework for such a contextually aware segmentation, which applies the structured optimal transport (SOT) concept to augment the well-known Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN). This enables the identification of motion phase transitions without requiring expensive manual labeling. Extensive experiments demonstrate that our approach outperforms state-of-the-art unsupervised methods, achieving 71.02% mean average precision (mAP) and 74.61% F1-score on test data, substantially higher than competing baselines. We also release a new dataset of 211 manually annotated professional javelin-throw videos with frame-level annotations, covering key biomechanical phases: approach steps, drive, throw, and recovery.

[475] FreeRet: MLLMs as Training-Free Retrievers

Yuhan Zhu,Xiangyu Zeng,Chenting Wang,Xinhao Li,Yicheng Xu,Ziang Yan,Yi Wang,Limin Wang

Main category: cs.CV

TL;DR: 本文提出FreeRet框架，无需额外训练即可将现成的多模态大语言模型（MLLM）转化为强大的两阶段检索器，通过语义嵌入和推理重排序提升检索性能，在多个基准上超越需大量训练的模型。

Details

Motivation: 现有的多模态大语言模型在用于检索任务时通常需要大量后训练来转换为对比编码器，本文旨在探索是否可以直接利用现成的MLLM进行高效检索，以减少对额外训练的依赖。 Method: 提出FreeRet框架，第一阶段直接从MLLM提取语义一致的嵌入用于候选检索，第二阶段利用模型的推理能力进行精确重排序；并通过去除词汇对齐层、引入显式先验条件和中性框架缓解重排序中的框架效应。 Result: 在涵盖46个数据集的MMEB和MMEB-V2基准上，FreeRet显著优于经过数百万样本训练的专用检索模型，且具备模型无关性，可跨不同MLLM家族和规模扩展，同时支持任意模态组合和端到端的检索-生成统一。 Conclusion: 研究表明，经过精心设计的预训练多模态大语言模型本身即可成为强大的检索引擎，无需额外训练，从而填补了其作为通用模型在检索任务中的关键空白。 Abstract: Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

[476] Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs

Mohamad Ballout,Okajevo Wilfred,Seyedalireza Yaghoubi,Nohayr Muhammad Abdelmoneim,Julius Mayer,Elia Bruni

Main category: cs.CV

TL;DR: SPLICE是一个基于COIN数据集的人工整理基准，用于评估多维度事件推理能力，发现现有视觉语言模型在视觉理解上显著落后于人类，尤其在上下文和空间推理及专业任务中表现较差。

Details

Motivation: 为了深入评估视觉语言模型在多维度事件推理（如时间、因果、空间、上下文和常识）方面的能力，并揭示其与人类表现之间的差距。 Method: 构建SPLICE基准，包含3,381个经过人工筛选的视频，分为12个大类和180个子类，共11,423个事件片段；要求人类和视觉语言模型将片段重新排序为连贯序列，并比较其表现。 Result: 视觉语言模型明显落后于人类；文本描述能提升模型性能但不影响人类表现，表明模型更依赖语言先验而非视觉理解；在时间与因果主导的任务上表现较好，在上下文与空间推理及专业任务中表现较差。 Conclusion: 当前视觉语言模型在复杂事件推理方面仍存在显著局限，尤其是在依赖空间、上下文理解或专业知识的任务中，亟需提升真正的视觉理解能力而非语言推理。 Abstract: In this work, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset, designed to probe event-based reasoning across multiple dimensions: temporal, causal, spatial, contextual, and general knowledge. SPLICE includes 3,381 human-filtered videos spanning 12 categories and 180 sub-categories, such as sports, engineering, and housework. These videos are segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art vision-language models (VLMs) on the task of rearranging these clips into coherent event sequences to assess visual reasoning capabilities. Results reveal a significant gap: VLMs struggle to match human performance. While human-annotated textual descriptions improve model accuracy, they do not affect human performance, suggesting that models rely more on language priors than on visual understanding. Even with annotations, VLMs fall short of human-level reasoning, underscoring persistent challenges in visual reasoning. A deeper analysis across sub-categories shows that VLMs perform relatively better on videos where temporal and causal reasoning are dominant, compared to those where contextual and spatial reasoning are dominant. They also perform better on everyday tasks than on specialized ones.

[477] RIFLE: Removal of Image Flicker-Banding via Latent Diffusion Enhancement

Zhu,Libo,Zhou,Zihan,Liu,Xiaoyang,Zhang,Weihang,Shi,Keyu,Fu,Yifan,Zhang,Yulun

Main category: cs.CV

TL;DR: 本文提出了RIFLE，一种基于扩散模型的框架，用于去除屏幕截图中的闪烁条纹（FB），并引入了新的数据集和模拟管道以支持该研究。

Details

Motivation: 闪烁条纹（FB）严重影响屏幕截图的可读性和视觉质量，但目前对此问题的研究较少，缺乏有效的方法和真实数据集来解决这一问题。 Method: 提出了一种名为RIFLE的扩散模型框架，包含闪烁条纹先验估计器（FPE）和掩码损失（ML），并在亮度域中构建了一个带有随机抖动的FB模拟管道，生成更真实的训练数据。 Result: 在真实世界的数据集上，RIFLE在定量指标和视觉效果方面均优于现有的图像重建基线方法，能有效处理从轻微到严重的FB退化。 Conclusion: 这是首个针对FB模拟与去除的研究工作，为后续在数据集构建和去FB模型设计方面奠定了基础。 Abstract: Capturing screens is now routine in our everyday lives. But the photographs of emissive displays are often influenced by the flicker-banding (FB), which is alternating bright%u2013dark stripes that arise from temporal aliasing between a camera's rolling-shutter readout and the display's brightness modulation. Unlike moire degradation, which has been extensively studied, the FB remains underexplored despite its frequent and severe impact on readability and perceived quality. We formulate FB removal as a dedicated restoration task and introduce Removal of Image Flicker-Banding via Latent Diffusion Enhancement, RIFLE, a diffusion-based framework designed to remove FB while preserving fine details. We propose the flicker-banding prior estimator (FPE) that predicts key banding attributes and injects it into the restoration network. Additionally, Masked Loss (ML) is proposed to concentrate supervision on banded regions without sacrificing global fidelity. To overcome data scarcity, we provide a simulation pipeline that synthesizes FB in the luminance domain with stochastic jitter in banding angle, banding spacing, and banding width. Feathered boundaries and sensor noise are also applied for a more realistic simulation. For evaluation, we collect a paired real-world FB dataset with pixel-aligned banding-free references captured via long exposure. Across quantitative metrics and visual comparisons on our real-world dataset, RIFLE consistently outperforms recent image reconstruction baselines from mild to severe flicker-banding. To the best of our knowledge, it is the first work to research the simulation and removal of FB. Our work establishes a great foundation for subsequent research in both the dataset construction and the removal model design. Our dataset and code will be released soon.

[478] Learning Object-Centric Representations Based on Slots in Real World Scenarios

Adil Kaan Akan

Main category: cs.CV

TL;DR: 本文提出了一种名为SlotAdapt的框架，通过引入基于槽位的轻量级条件机制，将预训练扩散模型适配于对象中心的图像和视频生成，在保持全局场景一致性的同时实现对单个对象的精细控制。

Details

Motivation: 现有扩散模型通常以整体方式处理图像并依赖文本条件，难以支持对象级别的编辑，与人类对场景的对象化感知存在不匹配。因此需要一种能够实现细粒度、可控生成的物体中心生成方法。 Method: 提出SlotAdapt框架，使用槽位机制分离背景与对象，引入注册token表示背景/风格，并通过槽位条件模块实现对象特定操作；在视频中结合不变性槽注意力（ISA）和Transformer时序聚合器，保持跨帧的对象一致性和动态性。 Result: 在对象发现、分割、组合编辑和可控图像生成方面达到最优性能；在无监督视频对象分割和重建任务上建立了新基准，并支持无需显式监督的对象移除、替换和插入等高级编辑功能。 Conclusion: 该工作建立了一个通用且可扩展的对象中心生成建模方法，弥合了人类对象化感知与机器学习之间的差距，拓展了交互式、结构化生成工具的设计空间。 Abstract: A central goal in AI is to represent scenes as compositions of discrete objects, enabling fine-grained, controllable image and video generation. Yet leading diffusion models treat images holistically and rely on text conditioning, creating a mismatch for object-level editing. This thesis introduces a framework that adapts powerful pretrained diffusion models for object-centric synthesis while retaining their generative capacity. We identify a core challenge: balancing global scene coherence with disentangled object control. Our method integrates lightweight, slot-based conditioning into pretrained models, preserving their visual priors while providing object-specific manipulation. For images, SlotAdapt augments diffusion models with a register token for background/style and slot-conditioned modules for objects, reducing text-conditioning bias and achieving state-of-the-art results in object discovery, segmentation, compositional editing, and controllable image generation. We further extend the framework to video. Using Invariant Slot Attention (ISA) to separate object identity from pose and a Transformer-based temporal aggregator, our approach maintains consistent object representations and dynamics across frames. This yields new benchmarks in unsupervised video object segmentation and reconstruction, and supports advanced editing tasks such as object removal, replacement, and insertion without explicit supervision. Overall, this work establishes a general and scalable approach to object-centric generative modeling for images and videos. By bridging human object-based perception and machine learning, it expands the design space for interactive, structured, and user-driven generative tools in creative, scientific, and practical domains.

[479] VNODE: A Piecewise Continuous Volterra Neural Network

Siddharth Roheda,Aniruddha Bala,Rohit Chowdhury,Rohan Jaiswal

Main category: cs.CV

TL;DR: 本文提出了Volterra神经常微分方程（VNODE），将非线性Volterra滤波与连续时间神经ODE结合，用于图像分类。该方法受视觉皮层启发，交替进行离散特征提取和ODE状态演化，显著减少参数量的同时在CIFAR10和Imagenet1K等数据集上优于现有模型。

Details

Motivation: 受视觉皮层中离散事件处理与连续整合交替机制的启发，旨在设计一种更高效、参数更少但性能更强的图像分类模型。 Method: 提出VNODE，结合非线性Volterra滤波与神经ODE，采用分段连续结构，在离散时刻进行Volterra特征提取，并通过ODE连续更新状态。 Result: 在CIFAR10和Imagenet1K等基准数据集上，VNODE在更低计算复杂度下持续优于当前最先进模型。 Conclusion: VNODE通过融合Volterra滤波与神经ODE，实现了高性能、低参数量的图像分类，为深度模型设计提供了新思路。 Abstract: This paper introduces Volterra Neural Ordinary Differential Equations (VNODE), a piecewise continuous Volterra Neural Network that integrates nonlinear Volterra filtering with continuous time neural ordinary differential equations for image classification. Drawing inspiration from the visual cortex, where discrete event processing is interleaved with continuous integration, VNODE alternates between discrete Volterra feature extraction and ODE driven state evolution. This hybrid formulation captures complex patterns while requiring substantially fewer parameters than conventional deep architectures. VNODE consistently outperforms state of the art models with improved computational complexity as exemplified on benchmark datasets like CIFAR10 and Imagenet1K.

[480] Classifier-Centric Adaptive Framework for Open-Vocabulary Camouflaged Object Segmentation

Hanyu Zhang,Yiming Zhou,Jinxia Zhang

Main category: cs.CV

TL;DR: 提出了一种以分类器为中心的自适应框架，通过轻量级文本适配器和分层非对称初始化增强分类组件，显著提升了开放词汇伪装物体分割性能。

Details

Motivation: 现有方法中分类组件对分割性能影响显著，需提升模型对训练未见类别的泛化能力。 Method: 设计了一个以分类器为中心的自适应框架，采用轻量级文本适配器并提出分层非对称初始化策略来优化分类性能。 Result: 在OVCamo基准上相比OVCoser基线显著提升：cIoU从0.443升至0.493，cSm从0.579升至0.658，cMAE从0.336降至0.239。 Conclusion: 通过针对性增强分类组件，可有效提升开放词汇伪装物体分割的整体性能。 Abstract: Open-vocabulary camouflaged object segmentation requires models to segment camouflaged objects of arbitrary categories unseen during training, placing extremely high demands on generalization capabilities. Through analysis of existing methods, it is observed that the classification component significantly affects overall segmentation performance. Accordingly, a classifier-centric adaptive framework is proposed to enhance segmentation performance by improving the classification component via a lightweight text adapter with a novel layered asymmetric initialization. Through the classification enhancement, the proposed method achieves substantial improvements in segmentation metrics compared to the OVCoser baseline on the OVCamo benchmark: cIoU increases from 0.443 to 0.493, cSm from 0.579 to 0.658, and cMAE reduces from 0.336 to 0.239. These results demonstrate that targeted classification enhancement provides an effective approach for advancing camouflaged object segmentation performance.

[481] Traumatic Brain Injury Segmentation using an Ensemble of Encoder-decoder Models

Ghanshyam Dhamat,Vaanathi Sundaresan

Main category: cs.CV

TL;DR: 本研究提出了一种基于nnUNet框架的自动化管道，用于在T1加权MRI上准确分割中重度创伤性脑损伤（TBI）病灶，在AIMS-TBI 2025挑战赛中表现优异，排名第6，整体Dice分数为0.5973。

Details

Motivation: 中重度TBI病灶在神经影像中表现出极大的异质性（大小、数量、侧向性多变），导致图像配准和脑区分割等分析任务困难，影响分析准确性，因此亟需高精度的自动分割方法。 Method: 采用nnUNet框架的多种网络架构进行初始分割，并结合后处理策略优化结果，构建全自动TBI病灶检测与分割流程。 Result: 在AIMS-TBI 2025挑战赛中取得0.8451的准确率，对有/无可见病灶图像的Dice分数分别为0.4711和0.8514，整体Dice分数达0.5973，位列前6名。 Conclusion: 所提出的基于nnUNet的自动化分割管道能有效识别和分割TBI病灶，具有良好的性能和应用潜力，代码已公开以便复现和进一步研究。 Abstract: The identification and segmentation of moderate-severe traumatic brain injury (TBI) lesions pose a significant challenge in neuroimaging. This difficulty arises from the extreme heterogeneity of these lesions, which vary in size, number, and laterality, thereby complicating downstream image processing tasks such as image registration and brain parcellation, reducing the analytical accuracy. Thus, developing methods for highly accurate segmentation of TBI lesions is essential for reliable neuroimaging analysis. This study aims to develop an effective automated segmentation pipeline to automatically detect and segment TBI lesions in T1-weighted MRI scans. We evaluate multiple approaches to achieve accurate segmentation of the TBI lesions. The core of our pipeline leverages various architectures within the nnUNet framework for initial segmentation, complemented by post-processing strategies to enhance evaluation metrics. Our final submission to the challenge achieved an accuracy of 0.8451, Dice score values of 0.4711 and 0.8514 for images with and without visible lesions, respectively, with an overall Dice score of 0.5973, ranking among the top-6 methods in the AIMS-TBI 2025 challenge. The Python implementation of our pipeline is publicly available.

[482] SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Junsong Chen,Yuyang Zhao,Jincheng Yu,Ruihang Chu,Junyu Chen,Shuai Yang,Xianbang Wang,Yicheng Pan,Daquan Zhou,Huan Ling,Haozhe Liu,Hongwei Yi,Hao Zhang,Muyang Li,Yukang Chen,Han Cai,Sanja Fidler,Ping Luo,Song Han,Enze Xie

Main category: cs.CV

TL;DR: SANA-Video是一种高效的扩散模型，能够以低资源消耗生成高质量、长时高清视频，采用线性注意力和常数内存KV缓存设计，显著提升生成速度与可部署性。

Details

Motivation: 为了实现低成本、高效率的长时高清视频生成，克服传统注意力机制在处理长序列时的计算和内存瓶颈。 Method: 提出Linear DiT结构，使用线性注意力替代标准注意力，并设计基于块的自回归生成方法与常数内存KV缓存，结合高效数据过滤与训练策略，优化整体性能。 Result: 在64块H100 GPU上仅用12天完成训练，生成720x1280分辨率、长达一分钟的视频；相比MovieGen节省99%成本，在RTX 5090上实现2.4倍推理加速，且性能媲美当前小型扩散模型。 Conclusion: SANA-Video实现了高效、高质量、低成本的视频生成，具备良好的实际部署能力，为长视频生成提供了可行的轻量级解决方案。 Abstract: We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

[483] Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

Yutong Hao,Chen Chen,Ajmal Saeed Mian,Chang Xu,Daochang Liu

Main category: cs.CV

TL;DR: 提出了一种无需训练的视频生成框架，通过显式物理推理和新型同步解耦引导策略（SDG）提升生成视频的物理合理性。

Details

Motivation: 现有扩散模型依赖大规模文本-视频数据隐式学习物理规律，成本高、难以扩展且易产生违反物理定律的不真实运动。 Method: 引入一个轻量级的物理感知推理流水线，构建包含物理违规行为的反事实提示，并提出同步解耦引导（SDG）策略，结合同步方向归一化和轨迹解耦去噪，在推理时抑制不合理的生成内容。 Result: 在多个物理场景下实验表明，该方法显著提升了生成视频的物理保真度，同时保持了视觉真实性，且无需额外训练。消融实验验证了物理推理模块和SDG策略的有效性及其各自组件的关键作用。 Conclusion: 该研究建立了一种新的即插即用的物理感知视频生成范式，能够在推理阶段显式纠正物理不合理性，具有良好的通用性和实用性。 Abstract: Diffusion models can generate realistic videos, but existing methods rely on implicitly learning physical reasoning from large-scale text-video datasets, which is costly, difficult to scale, and still prone to producing implausible motions that violate fundamental physical laws. We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it. Specifically, we employ a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that deliberately encode physics-violating behaviors. Then, we propose a novel Synchronized Decoupled Guidance (SDG) strategy, which leverages these prompts through synchronized directional normalization to counteract lagged suppression and trajectory-decoupled denoising to mitigate cumulative trajectory bias, ensuring that implausible content is suppressed immediately and consistently throughout denoising. Experiments across different physical domains show that our approach substantially enhances physical fidelity while maintaining photorealism, despite requiring no additional training. Ablation studies confirm the complementary effectiveness of both the physics-aware reasoning component and SDG. In particular, the aforementioned two designs of SDG are also individually validated to contribute critically to the suppression of implausible content and the overall gains in physical plausibility. This establishes a new and plug-and-play physics-aware paradigm for video generation.

[484] IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?

Yang Chen,Minghao Liu,Yufan Shen,Yunwen Li,Tianyuan Huang,Xinyu Fang,Tianyu Zheng,Wenxuan Huang,Cheng Yang,Daocheng Fu,Jianbiao Mei,Rong Wu,Licheng Wen,Xuemeng Yang,Song Mao,Qunshu Lin,Zhi Yu,Yongliang Shen,Yu Qiao,Botian Shi

Main category: cs.CV

TL;DR: 本文提出了IWR-Bench，一个用于评估大型视觉语言模型（LVLMs）从视频中重建交互式网页能力的新基准。该基准包含来自100个真实网站的113个任务和1,001个动作，涵盖多种交互复杂性、视觉风格和领域，并提供用户交互视频及静态资源。通过代理评判框架自动评估生成网页的功能正确性和视觉保真度。实验表明，当前最佳模型总体得分仅为36.35%，功能正确性（24.39%）远低于视觉保真度（64.25%），揭示了现有模型在时序动态理解和事件驱动逻辑生成方面的关键缺陷。

Details

Motivation: 现有的网页生成代码任务主要关注静态截图到代码的转换，忽略了现实世界网页应用中的动态交互。为了弥补这一不足，需要一个新的基准来评估模型在理解交互式网页行为并生成相应代码方面的能力。 Method: 提出IWR-Bench，包含113个来自真实网站的任务，每个任务提供交互视频和完整的静态资源（如图片、视频）。设计两个核心挑战：从多模态输入中推理交互逻辑，以及生成具备功能性的代码。采用“代理作为裁判”的自动化评估框架，结合功能正确性（IFS）和视觉保真度（VFS）指标进行量化评价。 Result: 在28个LVLM上进行了广泛实验，最佳模型总体得分为36.35%，其中功能正确性为24.39%，视觉保真度为64.25%，显示出模型在理解时间动态和生成事件驱动逻辑方面存在显著不足。 Conclusion: IWR-Bench为视觉语言模型在交互式网页重建任务上设定了新的挑战，暴露了当前模型在交互逻辑推理和功能性代码生成方面的局限性，推动未来研究关注时序建模与事件驱动编程能力的提升。 Abstract: The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available. Code is available at https://github.com/L-O-I/IWR-Bench.

[485] Evaluation of Polarimetric Fusion for Semantic Segmentation in Aquatic Environments

Luis F. W. Batista,Tom Bourbon,Cedric Pradalier

Main category: cs.CV

TL;DR: 本研究利用偏振成像技术缓解水面反光对漂浮物语义分割的影响，在公开数据集PoTATO上评估了多模态融合网络的性能，发现偏振信息有助于提升低对比度物体的检测精度并减少反射引起的误检，但会增加模型复杂度和计算负担。

Details

Motivation: 水面反光和变化的户外光照常影响漂浮物分割的准确性，传统RGB图像难以有效应对这些问题，因此需要探索能抑制反光干扰的新方法。 Method: 采用偏振成像技术获取包含偏振信息的图像，并在PoTATO数据集上基准测试当前先进的多模态融合网络，与基于传统模型的单图像基线方法进行比较，评估其在语义分割任务中的表现。 Result: 偏振信息能够恢复低对比度物体并抑制由反射引起的误检，相比RGB输入提升了平均交并比（mIoU）并降低了轮廓误差；然而，额外的偏振通道增加了模型体积和计算负荷，并可能引入新的误检。 Conclusion: 偏振成像有助于提升水面漂浮物分割的精度，尤其在处理反光和低对比度场景时优势明显，但需权衡其带来的计算成本和模型复杂性；本文提供的可复现基准和开源代码有助于推动相关研究并指导实际应用中传感器的选择。 Abstract: Accurate segmentation of floating debris on water is often compromised by surface glare and changing outdoor illumination. Polarimetric imaging offers a single-sensor route to mitigate water-surface glare that disrupts semantic segmentation of floating objects. We benchmark state-of-the-art fusion networks on PoTATO, a public dataset of polarimetric images of plastic bottles in inland waterways, and compare their performance with single-image baselines using traditional models. Our results indicate that polarimetric cues help recover low-contrast objects and suppress reflection-induced false positives, raising mean IoU and lowering contour error relative to RGB inputs. These sharper masks come at a cost: the additional channels enlarge the models increasing the computational load and introducing the risk of new false positives. By providing a reproducible, diagnostic benchmark and publicly available code, we hope to help researchers choose if polarized cameras are suitable for their applications and to accelerate related research.

[486] Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

Huu Tien Nguyen,Dac Thai Nguyen,The Minh Duc Nguyen,Trung Thanh Nguyen,Thao Nguyen Truong,Huy Hieu Pham,Johan Barthelemy,Minh Quan Tran,Thanh Tam Nguyen,Quoc Viet Hung Nguyen,Quynh Anh Chau,Hong Son Mai,Thanh Trung Nguyen,Phi Le Nguyen

Main category: cs.CV

TL;DR: 本文介绍了一个新的越南语多模态医学数据集，包含156万对CT-PET图像和2757份临床报告，填补了现有视觉-语言模型在PET/CT影像和低资源语言（尤其是越南语）方面的空白，并提出了增强VLM训练的框架，实验证明该数据集显著提升了现有模型在医学报告生成和视觉问答等任务上的表现。

Details

Motivation: 现有医学视觉-语言模型大多局限于高资源语言且缺乏多样化的成像模态（如PET/CT），限制了其在低资源语言环境下的通用性和临床应用价值。 Method: 构建了一个大规模的越南语CT-PET图像与临床报告配对数据集，提出包含数据增强和专家验证测试集的训练框架，并在医学报告生成和视觉问答任务上对现有VLM进行基准测试。 Result: 实验表明，引入该数据集后，现有视觉-语言模型在下游医学任务中的性能显著提升。 Conclusion: 该数据集和基准测试为推动适用于低资源语言的鲁棒医学视觉-语言模型发展，以及提升其在越南医疗系统中的临床相关性提供了关键基础。 Abstract: Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset comprising 1,567,062 paired CT-PET images and corresponding 2,757 full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.

Xue-Feng Zhu,Tianyang Xu,Yifan Pan,Jinjie Gu,Xi Li,Jiwen Lu,Xiao-Jun Wu,Josef Kittler

Main category: cs.CV

TL;DR: 本文提出了一种利用RGB、深度和热红外三种模态的新型多模态目标跟踪任务，并构建了包含500个同步视频的新数据集RGBDT500。同时，提出RDTTrack方法，通过正交投影约束融合热红外与深度信息，并以提示学习方式结合预训练RGB跟踪模型，显著提升了复杂场景下的跟踪精度与鲁棒性。

Details

Motivation: 现有双模态跟踪方法在复杂场景中因输入模态有限而表现受限，亟需引入更多互补模态以提升鲁棒性。 Method: 提出RDTTrack，采用预训练的RGB跟踪模型，结合提示学习技术，通过正交投影约束融合热红外和深度信息，并将三模态信息协同作为提示输入模型，实现鲁棒跟踪。 Result: 实验表明，所提方法在跟踪精度和鲁棒性方面显著优于现有的双模态方法，验证了三模态融合的有效性。 Conclusion: 引入RGB-D-TIR三模态融合可有效提升复杂场景下的目标跟踪性能，RGBDT500数据集和RDTTrack为多模态跟踪提供了新的研究基础。 Abstract: Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhance robustness in complex scenarios. To support this task, we construct a new multi-modal tracking dataset, coined RGBDT500, which consists of 500 videos with synchronised frames across the three modalities. Each frame provides spatially aligned RGB, depth, and thermal infrared images with precise object bounding box annotations. Furthermore, we propose a novel multi-modal tracker, dubbed RDTTrack. RDTTrack integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model and prompt learning techniques. In specific, RDTTrack fuses thermal infrared and depth modalities under a proposed orthogonal projection constraint, then integrates them with RGB signals as prompts for the pre-trained foundation tracking model, effectively harmonising tri-modal complementary cues. The experimental results demonstrate the effectiveness and advantages of the proposed method, showing significant improvements over existing dual-modal approaches in terms of tracking accuracy and robustness in complex scenarios.

[488] ExGS: Extreme 3D Gaussian Compression with Diffusion Priors

Jiaqi Chen,Xinhao Ji,Yuanyuan Gao,Hao Li,Yuning Gong,Yifei Liu,Dan Xu,Zhihang Zhong,Dingwen Zhang,Xiao Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为ExGS的新型前馈框架，用于实现极端3D高斯点阵（3DGS）压缩。该框架结合了无需重优化的通用高斯压缩（UGC）和基于扩散先验的GaussPainter恢复方法，在大幅压缩模型的同时保持高质量渲染，甚至可实现超过100倍的压缩比。

Details

Motivation: 现有的3DGS压缩方法在压缩效率、速度或渲染质量上存在局限，尤其是在资源受限环境下难以部署。需要一种既能高效压缩又能保持高渲染质量的通用解决方案。 Method: ExGS框架包含两部分：UGC通过无需重优化的剪枝技术大幅减少高斯图元数量；GaussPainter利用扩散先验和掩码引导的精细化机制，从严重剪枝后的场景中恢复高质量渲染结果，并增强可见区域像素。采用轻量级VAE和单步扩散设计以实现实时恢复。 Result: ExGS可在无需场景特定优化的情况下实现超过100倍的压缩（例如将354.77 MB模型压缩至约3.31 MB），并在挑战性条件下显著提升渲染图像质量。 Conclusion: 扩散先验在连接极端压缩与高质量神经渲染之间起到了关键作用，ExGS为实际应用中的高效神经场景表示提供了可行路径。 Abstract: Neural scene representations, such as 3D Gaussian Splatting (3DGS), have enabled high-quality neural rendering; however, their large storage and transmission costs hinder deployment in resource-constrained environments. Existing compression methods either rely on costly optimization, which is slow and scene-specific, or adopt training-free pruning and quantization, which degrade rendering quality under high compression ratios. In contrast, recent data-driven approaches provide a promising direction to overcome this trade-off, enabling efficient compression while preserving high rendering quality. We introduce \textbf{ExGS}, a novel feed-forward framework that unifies \textbf{Universal Gaussian Compression} (UGC) with \textbf{GaussPainter} for \textbf{Ex}treme 3D\textbf{GS} compression. \textbf{UGC} performs re-optimization-free pruning to aggressively reduce Gaussian primitives while retaining only essential information, whereas \textbf{GaussPainter} leverages powerful diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned Gaussian scenes. Unlike conventional inpainting, GaussPainter not only fills in missing regions but also enhances visible pixels, yielding substantial improvements in degraded renderings. To ensure practicality, it adopts a lightweight VAE and a one-step diffusion design, enabling real-time restoration. Our framework can even achieve over $100\times$ compression (reducing a typical 354.77 MB model to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. These results highlight the central role of diffusion priors in bridging the gap between extreme compression and high-quality neural rendering. Our code repository will be released at \href{https://github.com/chenttt2001/ExGS}{here}.

[489] VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding

Yizhuo Ding,Mingkang Chen,Zhibang Feng,Tong Xiao,Wanying Qu,Wenqi Shao,Yanwei Fu

Main category: cs.CV

TL;DR: 提出VTPerception-R1框架，通过解耦感知与推理，提升多模态大模型的推理准确性和鲁棒性。

Details

Motivation: 多模态大语言模型在基于感知证据进行推理时表现不佳，需系统研究不同感知策略以提升性能。 Method: 提出VTPerception-R1，包含感知增强微调的第一阶段和引入视觉、文本及一致性奖励的感知感知强化学习的第二阶段。 Result: 在四个多模态基准和两个MLLM上验证，该方法显著提升推理准确性和鲁棒性，尤其对小模型效果更优。 Conclusion: VTPerception-R1为感知驱动的多模态推理提供了可扩展且可审计的有效解决方案。 Abstract: Multimodal large language models (MLLMs) often struggle to ground reasoning in perceptual evidence. We present a systematic study of perception strategies-explicit, implicit, visual, and textual-across four multimodal benchmarks and two MLLMs. Our findings show that explicit perception, especially when paired with textual cues, consistently yields the best improvements, particularly for smaller models. Based on this insight, we propose VTPerception-R1, a unified two-stage framework that decouples perception from reasoning. Stage 1 introduces perception-augmented fine-tuning, and Stage 2 applies perception-aware reinforcement learning with novel visual, textual, and consistency rewards. Experiments demonstrate that VTPerception-R1 significantly improves reasoning accuracy and robustness across diverse tasks, offering a scalable and auditable solution for perception-grounded multimodal reasoning. Our code is available at: https://github.com/yizhuoDi/VTPerceprion-R1.

[490] SkyLink: Unifying Street-Satellite Geo-Localization via UAV-Mediated 3D Scene Alignment

Hongyang Zhang,Yinhao Liu,Zhenyu Kuang

Main category: cs.CV

TL;DR: 本文提出了一种新的跨视角地理定位方法SkyLink，通过数据增强、局部特征聚合和3D场景信息对齐来提升不同视角下的特征匹配鲁棒性。

Details

Motivation: 现有方法在处理极端视角差异时存在语义退化问题，难以有效建立跨视角位置对应关系。 Method: 引入Google检索增强模块进行街景图像数据增强，采用Patch-Aware特征聚合模块强化多局部特征提取，并利用多尺度无人机图像构建的3D场景信息作为街景与卫星视图之间的桥梁，结合自监督与跨视角对比学习实现特征对齐。 Result: 在UAVM2025挑战赛的University-1652数据集上实现了25.75%的Recall@1精度，表现出良好的鲁棒性和泛化能力。 Conclusion: SkyLink通过融合3D场景信息与多模块协同优化，显著提升了跨视角地理定位的性能。 Abstract: Cross-view geo-localization aims at establishing location correspondences between different viewpoints. Existing approaches typically learn cross-view correlations through direct feature similarity matching, often overlooking semantic degradation caused by extreme viewpoint disparities. To address this unique problem, we focus on robust feature retrieval under viewpoint variation and propose the novel SkyLink method. We firstly utilize the Google Retrieval Enhancement Module to perform data enhancement on street images, which mitigates the occlusion of the key target due to restricted street viewpoints. The Patch-Aware Feature Aggregation module is further adopted to emphasize multiple local feature aggregations to ensure the consistent feature extraction across viewpoints. Meanwhile, we integrate the 3D scene information constructed from multi-scale UAV images as a bridge between street and satellite viewpoints, and perform feature alignment through self-supervised and cross-view contrastive learning. Experimental results demonstrate robustness and generalization across diverse urban scenarios, which achieve 25.75$\%$ Recall@1 accuracy on University-1652 in the UAVM2025 Challenge. Code will be released at https://github.com/HRT00/CVGL-3D.

[491] LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning

Shenghao Fu,Qize Yang,Yuan-Ming Li,Xihan Wei,Xiaohua Xie,Wei-Shi Zheng

Main category: cs.CV

TL;DR: 提出LOVE-R1模型，通过慢-快自适应帧采样机制，在长视频理解中实现时间和空间信息的平衡，采用多步推理与解耦强化微调提升性能。

Details

Motivation: 现有大视频语言模型因均匀帧采样机制难以兼顾长时间依赖和细节空间感知，导致时空信息权衡不足。 Method: 设计可自适应缩放的多步推理框架：先用高密度低分辨率帧捕捉时序线索，再对关键片段进行高分辨率放大以获取空间细节；使用38k高质量思维链数据微调，并引入解耦强化学习优化每一步推理能力。 Result: 在4个主流长视频理解基准上，LOVE-R1相比Qwen2.5-VL基线平均提升3.1个百分点，验证了自适应采样与解耦训练的有效性。 Conclusion: LOVE-R1通过自适应帧采样和细粒度推理训练，在长视频理解中实现了更优的时空平衡，显著提升了模型性能。 Abstract: Long video understanding is still challenging for recent Large Video-Language Models (LVLMs) due to the conflict between long-form temporal understanding and detailed spatial perception. LVLMs with a uniform frame sampling mechanism, which samples frames with an equal frame size and fixed sampling rate, inevitably sacrifice either temporal clues or spatial details, resulting in suboptimal solutions. To mitigate this dilemma, we propose LOVE-R1, a model that can adaptively zoom in on a video clip. The model is first provided with densely sampled frames but in a small resolution. If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution based on its reasoning until key visual information is obtained. The whole process is implemented as a multi-step reasoning process. To train the reasoning ability, we first finetune the model on our collected 38k high-quality CoT data and enhance it with decoupled reinforcement finetuning. As outcome rewards can not provide fine-grained process supervision, we decouple multi-step reasoning into multiple single-step reasoning and optimize the internal zoom-in ability explicitly. Experiments on long video understanding benchmarks show that our model with the slow-fast adaptive frame sampling mechanism achieves a great trade-off between sampling density and frame resolutions, and LOVE-R1 outperforms our baseline Qwen2.5-VL by an average of 3.1% points across 4 common long video understanding benchmarks.

[492] Vision Function Layer in Multimodal LLMs

Cheng Shi,Yizhou Yu,Sibei Yang

Main category: cs.CV

TL;DR: 该研究提出视觉功能层（VFL）概念，揭示多模态大模型中视觉解码功能在不同解码层中的分布规律，并利用视觉token交换分析框架实现对各层功能的精准解析，进而提升模型效率与可解释性。

Details

Motivation: 理解多模态大语言模型（MLLMs）中视觉信息处理机制，揭示不同视觉功能（如计数、定位、OCR）在解码层中的分布模式，并提升模型的可解释性与实用性。 Method: 提出视觉Token交换（Visual Token Swapping）分析框架，通过修改特定KV缓存项来识别承担特定功能的解码层（即Vision Function Layers, VFL），并在此基础上设计VFL-LoRA和VFL-select方法用于模型微调与数据选择。 Result: 发现不同视觉功能集中在2-3个特定解码层（VFL），且其层次顺序在不同MLLM中具有一致性，符合人类认知流程；VFL-LoRA在匹配任务上优于全量LoRA并防止跨领域遗忘；VFL-select仅用20%数据即达到全数据98%性能，超越人工选数效果。 Conclusion: 视觉功能在MLLM解码器中呈分层分布且有序排列，基于VFL的方法可显著提升模型训练效率、避免功能遗忘，并实现高效数据筛选，推动更高效、可解释和鲁棒的多模态模型发展。 Abstract: This study identifies that visual-related functional decoding is distributed across different decoder layers in Multimodal Large Language Models (MLLMs). Typically, each function, such as counting, grounding, or OCR recognition, narrows down to two or three layers, which we define as Vision Function Layers (VFL). Additionally, the depth and its order of different VFLs exhibits a consistent pattern across different MLLMs, which is well-aligned with human behaviors (e.g., recognition occurs first, followed by counting, and then grounding). These findings are derived from Visual Token Swapping, our novel analytical framework that modifies targeted KV cache entries to precisely elucidate layer-specific functions during decoding. Furthermore, these insights offer substantial utility in tailoring MLLMs for real-world downstream applications. For instance, when LoRA training is selectively applied to VFLs whose functions align with the training data, VFL-LoRA not only outperform full-LoRA but also prevent out-of-domain function forgetting. Moreover, by analyzing the performance differential on training data when particular VFLs are ablated, VFL-select automatically classifies data by function, enabling highly efficient data selection to directly bolster corresponding capabilities. Consequently, VFL-select surpasses human experts in data selection, and achieves 98% of full-data performance with only 20% of the original dataset. This study delivers deeper comprehension of MLLM visual processing, fostering the creation of more efficient, interpretable, and robust models.

[493] Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Lei Tong,Zhihua Liu,Chaochao Lu,Dino Oglic,Tom Diethe,Philip Teare,Sotirios A. Tsaftaris,Chen Jin

Main category: cs.CV

TL;DR: Causal-Adapter是一个模块化框架，用于适配冻结的文本到图像扩散模型，以实现反事实图像生成，通过引入因果结构实现对目标属性的精确控制和高保真图像生成。

Details

Motivation: 现有方法依赖提示工程而缺乏明确的因果结构，难以精确控制属性并保持图像身份一致性，因此需要一种能够显式建模因果关系的方法来提升反事实图像生成的质量。 Method: 提出Causal-Adapter框架，结合结构因果模型，采用两种属性正则化策略：提示对齐注入（prompt-aligned injection）将因果属性与文本嵌入对齐，以及条件token对比损失来解耦属性因子并减少虚假相关性。 Result: 在合成和真实数据集上均达到最先进的性能，Pendulum数据集上属性控制的MAE减少91%，ADNI数据集上FID降低87%，显著提升了属性修改的准确性和图像保真度。 Conclusion: Causal-Adapter通过显式因果建模实现了鲁棒且可泛化的反事实图像编辑，在精确属性控制和图像身份保持方面优于现有方法。 Abstract: We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91\% MAE reduction on Pendulum for accurate attribute control and 87\% FID reduction on ADNI for high-fidelity MRI image generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

[494] TACO-Net: Topological Signatures Triumph in 3D Object Classification

Anirban Ghosh,Ayan Dutta

Main category: cs.CV

TL;DR: 本文提出了一种结合拓扑数据分析与图像滤波技术的3D物体分类新方法TACO-Net，通过将点云转换为体素化二值3D图像并提取拓扑特征，使用轻量级1D CNN进行分类，在ModelNet40、ModelNet10和真实世界OmniObject3D数据集上均达到SOTA性能，并表现出对多种输入噪声的强鲁棒性。

Details

Motivation: 由于点云的无序性、不规则性和噪声问题，现有的深度学习方法在3D物体分类中仍难以实现高精度分类，因此需要更鲁棒且高效的特征提取方法。 Method: 将点云转换为体素化二值3D图像，利用拓扑数据分析和图像滤波技术提取区分性拓扑特征，并采用轻量级一维卷积神经网络（1D CNN）进行训练和分类。 Result: TACO-Net在ModelNet40上达到99.05%准确率，在ModelNet10上达到99.52%，并在真实世界OmniObject3D数据集上表现出良好泛化能力；在十种不同噪声干扰的ModelNet40输入下仍保持强鲁棒性。 Conclusion: 所提出的TACO-Net框架在3D物体分类任务中实现了新的SOTA性能，兼具高效性与鲁棒性，验证了拓扑特征与轻量网络结合的有效性。 Abstract: 3D object classification is a crucial problem due to its significant practical relevance in many fields, including computer vision, robotics, and autonomous driving. Although deep learning methods applied to point clouds sampled on CAD models of the objects and/or captured by LiDAR or RGBD cameras have achieved remarkable success in recent years, achieving high classification accuracy remains a challenging problem due to the unordered point clouds and their irregularity and noise. To this end, we propose a novel state-of-the-art (SOTA) 3D object classification technique that combines topological data analysis with various image filtration techniques to classify objects when they are represented using point clouds. We transform every point cloud into a voxelized binary 3D image to extract distinguishing topological features. Next, we train a lightweight one-dimensional Convolutional Neural Network (1D CNN) using the extracted feature set from the training dataset. Our framework, TACO-Net, sets a new state-of-the-art by achieving $99.05\%$ and $99.52\%$ accuracy on the widely used synthetic benchmarks ModelNet40 and ModelNet10, and further demonstrates its robustness on the large-scale real-world OmniObject3D dataset. When tested with ten different kinds of corrupted ModelNet40 inputs, the proposed TACO-Net demonstrates strong resiliency overall.

[495] UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections

Zeyu Cai,Ziyang Li,Xiaoben Li,Boqian Li,Zeyu Wang,Zhenyu Zhang,Yuliang Xiu

Main category: cs.CV

TL;DR: UP2You是一种无需调优的高保真3D穿衣人像重建方法，能直接从无约束的野外2D照片中高效生成高质量3D模型。

Details

Motivation: 现有方法依赖于干净、完整或多视角对齐的输入图像，难以处理现实中姿态、视角、遮挡和裁剪变化大的真实场景照片。因此需要一种鲁棒且高效的解决方案。 Method: 提出数据校正范式，通过单次前向传播将无约束输入转换为干净的正交多视图图像；引入姿态相关特征聚合模块（PCFA）融合多参考图像信息，并采用基于perceiver的多参考形状预测器，无需预定义身体模板。 Result: 在4D-Dress和PuzzleIOI等数据集上，UP2You在几何精度（Chamfer下降15%，P2S下降18%）和纹理保真度（PSNR提升21%，LPIPS改善46%）方面优于先前方法，且处理速度为每人物1.5分钟。 Conclusion: UP2You实现了高效、高质量、无需训练的3D穿衣人像重建，支持任意姿态控制和多服装虚拟试穿，适用于真实场景应用，具有良好的实用性和扩展性。 Abstract: We present UP2You, the first tuning-free solution for reconstructing high-fidelity 3D clothed portraits from extremely unconstrained in-the-wild 2D photos. Unlike previous approaches that require "clean" inputs (e.g., full-body images with minimal occlusions, or well-calibrated cross-view captures), UP2You directly processes raw, unstructured photographs, which may vary significantly in pose, viewpoint, cropping, and occlusion. Instead of compressing data into tokens for slow online text-to-3D optimization, we introduce a data rectifier paradigm that efficiently converts unconstrained inputs into clean, orthogonal multi-view images in a single forward pass within seconds, simplifying the 3D reconstruction. Central to UP2You is a pose-correlated feature aggregation module (PCFA), that selectively fuses information from multiple reference images w.r.t. target poses, enabling better identity preservation and nearly constant memory footprint, with more observations. We also introduce a perceiver-based multi-reference shape predictor, removing the need for pre-captured body templates. Extensive experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy (Chamfer-15%, P2S-18% on PuzzleIOI) and texture fidelity (PSNR-21%, LPIPS-46% on 4D-Dress). UP2You is efficient (1.5 minutes per person), and versatile (supports arbitrary pose control, and training-free multi-garment 3D virtual try-on), making it practical for real-world scenarios where humans are casually captured. Both models and code will be released to facilitate future research on this underexplored task. Project Page: https://zcai0612.github.io/UP2You

[496] Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

Youngeun Kim,Youjia Zhang,Huiling Liu,Aecheon Jung,Sunwoo Lee,Sungeun Hong

Main category: cs.CV

TL;DR: 提出一种无需训练的视觉-语言模型token剪枝方法，通过零阶扰动估计token敏感性，在保持准确性的同时显著提升推理效率。

Details

Motivation: 现有的基于注意力或多样性的token剪枝方法存在不稳定性或关键信息丢失的风险，需要更可靠且高效的剪枝策略。 Method: 在投影层使用零阶扰动估计token敏感性，选择高敏感且互补的视觉token，避免冗余和信息丢失，实现训练-free的高效剪枝。 Result: 在多个VLM和基准上验证，最多可剪除94.4%的token，保持准确率，并实现最高2.30倍的端到端推理加速。 Conclusion: \ours是一种高效、稳定且无需训练的token剪枝框架，显著降低了大视觉语言模型的推理成本。 Abstract: Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space but risk dropping regions needed for accurate prediction. We propose \ours, a training-free framework built on a simple intuition: tokens with higher sensitivity are more likely to influence the model's output, and they should also capture complementary visual cues rather than overlapping information. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the projection layer, a shallow and computationally light component of the model. This approach measures how small random perturbations affect the projection outputs, allowing us to approximate each token's influence through lightweight forward passes without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that \ours consistently outperforms prior methods, pruning up to 94.4\% of tokens while maintaining accuracy and significantly improving efficiency, achieving up to 2.30x faster end-to-end inference over the baseline.

[497] PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement

Bo Zhao,Dan Guo,Junzhe Cao,Yong Xu,Tao Tan,Yue Sun,Bochao Zou,Jie Zhang,Zitong Yu

Main category: cs.CV

TL;DR: 提出一种基于物理原理的远程光电容积描记法（rPPG）新范式，通过Navier-Stokes方程推导出脉搏信号的二阶动态系统模型，并据此设计轻量级PHASE-Net网络，包含轴向交换模块、自适应空间滤波器和门控时序卷积网络，在保持高效性的同时实现最先进的性能。

Details

Motivation: 现有深度学习方法在头部位移和光照变化下缺乏理论基础，导致rPPG测量鲁棒性和可解释性不足。 Method: 从血流动力学的Navier-Stokes方程出发，推导出脉搏信号遵循二阶动态系统，其离散解对应因果卷积，从而合理化使用时间卷积网络（TCN）。设计PHASE-Net，包含零浮点运算轴向交换模块、自适应空间滤波器和门控扩张TCN。 Result: PHASE-Net在多个实验中表现出色，达到最先进水平，具有高效率和强鲁棒性，能在头动和光照变化下准确恢复脉搏信号。 Conclusion: 该工作为rPPG提供了理论支持的解决方案，提升了模型的可解释性和部署实用性，是迈向可靠非接触式生理监测的重要进展。 Abstract: Remote photoplethysmography (rPPG) measurement enables non-contact physiological monitoring but suffers from accuracy degradation under head motion and illumination changes. Existing deep learning methods are mostly heuristic and lack theoretical grounding, which limits robustness and interpretability. In this work, we propose a physics-informed rPPG paradigm derived from the Navier-Stokes equations of hemodynamics, showing that the pulse signal follows a second-order dynamical system whose discrete solution naturally leads to a causal convolution. This provides a theoretical justification for using a Temporal Convolutional Network (TCN). Based on this principle, we design PHASE-Net, a lightweight model with three key components: (1) Zero-FLOPs Axial Swapper module, which swaps or transposes a few spatial channels to mix distant facial regions and enhance cross-region feature interaction without breaking temporal order; (2) Adaptive Spatial Filter, which learns a soft spatial mask per frame to highlight signal-rich areas and suppress noise; and (3) Gated TCN, a causal dilated TCN with gating that models long-range temporal dynamics for accurate pulse recovery. Extensive experiments demonstrate that PHASE-Net achieves state-of-the-art performance with strong efficiency, offering a theoretically grounded and deployment-ready rPPG solution.

[498] ELPG-DTFS: Prior-Guided Adaptive Time-Frequency Graph Neural Network for EEG Depression Diagnosis

Jingru Qiu,Jiale Liang,Xuanhan Fan,Mingda Zhang,Zhenli He

Main category: cs.CV

TL;DR: 提出了一种基于先验知识引导的自适应时频图神经网络ELPG-DTFS，用于基于脑电图（EEG）的重度抑郁症（MDD）筛查，在MODMA数据集上达到了97.63%的准确率，优于现有最先进方法。

Details

Motivation: 现有的基于深度学习的EEG分析方法将频谱视为静态图像，固定通道间连接关系，且忽略神经科学先验知识，限制了抑郁症诊断的准确性和可解释性。 Method: 提出ELPG-DTFS模型，引入三项创新：（1）带跨频带互信息的通道-频带注意力机制；（2）可学习的邻接矩阵以建模动态功能连接；（3）残差知识图路径注入神经科学先验。 Result: 在128通道MODMA数据集（53名受试者）上，ELPG-DTFS达到97.63%准确率和97.33% F1分数，优于2025年最先进的ACM-GNN；消融实验显示移除任一模块会导致F1下降最多4.35%，验证各模块的互补作用。 Conclusion: ELPG-DTFS为下一代基于EEG的MDD诊断提供了一个高精度、强可解释性的框架。 Abstract: Timely and objective screening of major depressive disorder (MDD) is vital, yet diagnosis still relies on subjective scales. Electroencephalography (EEG) provides a low-cost biomarker, but existing deep models treat spectra as static images, fix inter-channel graphs, and ignore prior knowledge, limiting accuracy and interpretability. We propose ELPG-DTFS, a prior-guided adaptive time-frequency graph neural network that introduces: (1) channel-band attention with cross-band mutual information, (2) a learnable adjacency matrix for dynamic functional links, and (3) a residual knowledge-graph pathway injecting neuroscience priors. On the 128-channel MODMA dataset (53 subjects), ELPG-DTFS achieves 97.63% accuracy and 97.33% F1, surpassing the 2025 state-of-the-art ACM-GNN. Ablation shows that removing any module lowers F1 by up to 4.35, confirming their complementary value. ELPG-DTFS thus offers a robust and interpretable framework for next-generation EEG-based MDD diagnostics.

[499] Vision At Night: Exploring Biologically Inspired Preprocessing For Improved Robustness Via Color And Contrast Transformations

Lorena Stracke,Lia Nimmermann,Shashank Agnihotri,Margret Keuper,Volker Blanz

Main category: cs.CV

TL;DR: 本文提出了一种受生物视觉系统启发的输入预处理方法，通过在RGB、灰度和对立颜色通道上应用高斯差滤波（DoG）来增强局部对比度，从而提升语义分割在恶劣条件下的鲁棒性。

Details

Motivation: 受人类视觉系统中对比度增强和颜色对立机制的启发，探索能够提升模型在不利环境下语义分割鲁棒性的输入预处理方法，而无需改变模型结构或训练过程。 Method: 采用高斯差（Difference-of-Gaussians, DoG）滤波对RGB、灰度和对立颜色通道进行预处理，以增强局部对比度，并保持对标准数据分布的性能。 Result: 在Cityscapes、ACDC和Dark Zurich数据集上的实验表明，该预处理方法在维持正常条件下性能的同时，显著提升了模型在夜间、雾天和雪天等恶劣环境下的鲁棒性。 Conclusion: 该方法具有模型无关性和轻量化特点，可集成到成像流程中，为安全关键场景下的下游视觉模型提供任务就绪且鲁棒的输入。 Abstract: Inspired by the human visual system's mechanisms for contrast enhancement and color-opponency, we explore biologically motivated input preprocessing for robust semantic segmentation. By applying Difference-of-Gaussians (DoG) filtering to RGB, grayscale, and opponent-color channels, we enhance local contrast without modifying model architecture or training. Evaluations on Cityscapes, ACDC, and Dark Zurich show that such preprocessing maintains in-distribution performance while improving robustness to adverse conditions like night, fog, and snow. As this processing is model-agnostic and lightweight, it holds potential for integration into imaging pipelines, enabling imaging systems to deliver task-ready, robust inputs for downstream vision models in safety-critical environments.

[500] StreamForest: Efficient Online Video Understanding with Persistent Event Memory

Xiangyu Zeng,Kefan Qiu,Qingyu Zhang,Xinhao Li,Jing Wang,Jiaxin Li,Ziang Yan,Kun Tian,Meng Tian,Xinhai Zhao,Yi Wang,Limin Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为StreamForest的新架构，用于流式视频理解，通过持久事件记忆森林和细粒度时空窗口提升长期记忆与实时感知能力，并构建了专用指令微调数据集OnlineIT及自动驾驶场景下的评测基准ODV-Bench，实验证明该方法在多种基准上达到最先进性能且具备高鲁棒性与效率。

Details

Motivation: 现有的多模态大语言模型在流式视频理解中受限于历史视觉特征的存储压力和实时时空推理能力不足，难以满足实际应用需求。 Method: 提出StreamForest架构，包含持久事件记忆森林（组织帧为事件级树结构，基于时间距离、内容相似性和合并频率的惩罚函数进行自适应管理）和细粒度时空窗口（捕获短期视觉线索以增强当前场景感知），并构建OnlineIT指令微调数据集和ODV-Bench评测基准。 Result: StreamForest在StreamingBench、OVBench和OVO-Bench上分别达到77.3%、60.5%和55.6%的准确率，在极端视觉token压缩（仅1024 tokens）下仍保持默认设置96.8%的平均精度。 Conclusion: StreamForest在流式视频理解任务中表现出卓越的鲁棒性、效率和泛化能力，显著优于现有方法。 Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.

[501] Environment-Aware Satellite Image Generation with Diffusion Models

Nikos Kostagiolas,Pantelis Georgiades,Yannis Panagakis,Mihalis A. Nicolaou

Main category: cs.CV

TL;DR: 提出一种基于环境上下文的新型扩散模型，通过文本、元数据和视觉数据联合条件生成卫星图像，显著提升生成质量与鲁棒性。

Details

Motivation: 现有遥感生成模型受限于环境上下文利用不足、对缺失数据敏感且难以准确反映用户意图。 Method: 设计一种新型扩散模型，结合文本、元数据和视觉信号作为控制输入，并引入元数据融合策略以处理缺失或损坏的数据。 Result: 在单幅和时序图像生成任务中，定性和定量结果均优于现有方法，6项指标显示更高保真度、准确性和生成质量。 Conclusion: 环境上下文条件有助于提升遥感基础模型性能，所提方法为下游应用提供了可靠生成工具。 Abstract: Diffusion-based foundation models have recently garnered much attention in the field of generative modeling due to their ability to generate images of high quality and fidelity. Although not straightforward, their recent application to the field of remote sensing signaled the first successful trials towards harnessing the large volume of publicly available datasets containing multimodal information. Despite their success, existing methods face considerable limitations: they rely on limited environmental context, struggle with missing or corrupted data, and often fail to reliably reflect user intentions in generated outputs. In this work, we propose a novel diffusion model conditioned on environmental context, that is able to generate satellite images by conditioning from any combination of three different control signals: a) text, b) metadata, and c) visual data. In contrast to previous works, the proposed method is i) to our knowledge, the first of its kind to condition satellite image generation on dynamic environmental conditions as part of its control signals, and ii) incorporating a metadata fusion strategy that models attribute embedding interactions to account for partially corrupt and/or missing observations. Our method outperforms previous methods both qualitatively (robustness to missing metadata, higher responsiveness to control inputs) and quantitatively (higher fidelity, accuracy, and quality of generations measured using 6 different metrics) in the trials of single-image and temporal generation. The reported results support our hypothesis that conditioning on environmental context can improve the performance of foundation models for satellite imagery, and render our model a promising candidate for usage in downstream tasks. The collected 3-modal dataset is to our knowledge, the first publicly-available dataset to combine data from these three different mediums.

[502] ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation

Jiuhong Xiao,Roshan Nayak,Ning Zhang,Daniel Tortei,Giuseppe Loianno

Main category: cs.CV

TL;DR: 本文提出了ThermalGen，一种基于自适应流的生成模型，用于RGB到热成像（RGB-T）图像转换，解决了配对数据稀缺的问题，并在多种条件下实现了高质量的热图像合成。

Details

Motivation: 由于同步和校准的RGB-热成像图像对稀缺，限制了视觉-热传感器融合与跨模态任务的发展，因此需要一种有效的方法来从丰富的RGB数据中生成热成像图像。 Method: 提出ThermalGen模型，采用基于流的生成架构，结合RGB图像条件化设计和风格解耦机制；构建八个公开的RGB-T配对数据集，并新增三个大规模卫星-航空RGB-T数据集用于训练与评估。 Result: 在多个RGB-T基准上的实验表明，ThermalGen在翻译性能上达到或优于现有的GAN和扩散模型方法，能够合成在视角、传感器特性和环境条件方面具有显著变化的热图像。 Conclusion: ThermalGen是首个能够在多样化现实条件下生成高质量热图像的RGB-T转换模型，为多模态感知任务提供了有效的数据增强解决方案。 Abstract: Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks, including important applications such as multi-modal image alignment and retrieval. However, the scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle to progress in these areas. To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution, enabling the synthesis of thermal images from abundant RGB datasets for training purposes. In this study, we propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation, incorporating an RGB image conditioning architecture and a style-disentangled mechanism. To support large-scale training, we curated eight public satellite-aerial, aerial, and ground RGB-T paired datasets, and introduced three new large-scale satellite-aerial RGB-T datasets--DJI-day, Bosonplus-day, and Bosonplus-night--captured across diverse times, sensor types, and geographic regions. Extensive evaluations across multiple RGB-T benchmarks demonstrate that ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods. To our knowledge, ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions. Project page: http://xjh19971.github.io/ThermalGen

[503] Vehicle Classification under Extreme Imbalance: A Comparative Study of Ensemble Learning and CNNs

Abu Hanif Muhammad Syarubany

Main category: cs.CV

TL;DR: 本文通过整合多个数据集并采用重采样技术构建平衡数据集，比较了轻量级集成模型与深度卷积神经网络在车辆类型识别中的性能，发现深度模型表现更优，但对极少数类别（如驳船）仍存在识别困难，表明需结合更多小样本数据采集和代价敏感学习等策略。

Details

Motivation: 公共数据集中严重的类别不平衡问题抑制了稀有车辆类别的识别性能，影响智能交通与物流系统的准确性。 Method: 整合Kaggle、ImageNet和网络爬取数据构建16类约4.7万图像的数据集，采用SMOTE过采样和针对性欠采样生成六种平衡变体；比较基于MobileNet-V2特征的轻量级集成模型（如随机森林、AdaBoost、软投票）与使用强数据增强和标签平滑的可配置ResNet式CNN。 Result: 最佳集成模型（SMOTE组合）测试准确率为74.8%，CNN在完整测试集上达到79.19%，在未见推断批次上达81.25%，显示深度模型优势；但最稀缺类别“驳船”仍为失败模式。 Conclusion: 仅靠数据重平衡不足以解决极端类别不平衡问题，应优先收集更多少数类样本，采用代价敏感目标（如焦点损失），并探索结合解释性与表征能力的混合模型 pipeline。 Abstract: Accurate vehicle type recognition underpins intelligent transportation and logistics, but severe class imbalance in public datasets suppresses performance on rare categories. We curate a 16-class corpus (~47k images) by merging Kaggle, ImageNet, and web-crawled data, and create six balanced variants via SMOTE oversampling and targeted undersampling. Lightweight ensembles, such as Random Forest, AdaBoost, and a soft-voting combiner built on MobileNet-V2 features are benchmarked against a configurable ResNet-style CNN trained with strong augmentation and label smoothing. The best ensemble (SMOTE-combined) attains 74.8% test accuracy, while the CNN achieves 79.19% on the full test set and 81.25% on an unseen inference batch, confirming the advantage of deep models. Nonetheless, the most under-represented class (Barge) remains a failure mode, highlighting the limits of rebalancing alone. Results suggest prioritizing additional minority-class collection and cost-sensitive objectives (e.g., focal loss) and exploring hybrid ensemble or CNN pipelines to combine interpretability with representational power.

[504] VAGUEGAN: Stealthy Poisoning and Backdoor Attacks on Image Generative Pipelines

Mostafa Mohaimen Akand Faisal,Rabeya Amin Jhuma

Main category: cs.CV

TL;DR: 本研究提出VagueGAN，一种结合PoisonerNet与生成器-判别器对的攻击管道，通过潜空间中毒在GAN和扩散模型中实现隐蔽且可控的图像生成篡改，实验表明此类攻击不仅能保持甚至提升生成图像质量，挑战了传统像素级防御的有效性。

Details

Motivation: 生成模型如GAN和扩散模型广泛用于图像合成与编辑，但针对其生成流程的隐蔽攻击研究较少。本文旨在探索通过微小、隐蔽的输入扰动实现对生成结果的定向控制，揭示现有防御机制的盲点。 Method: 提出VagueGAN攻击框架，包含模块化扰动网络PoisonerNet与生成器-判别器结构，通过潜空间注入触发器实现目标化图像篡改，并利用感知和频域指标评估扰动的隐蔽性；进一步在基于ControlNet的扩散模型中验证方法的可迁移性。 Result: 实验证明该攻击能产生视觉上难以察觉的扰动，且生成图像质量有时高于未受攻击的正常输出；潜空间中毒可在不降低保真度的情况下实现一致的定向操控，显示出对现代生成模型的强大威胁。 Conclusion: 潜空间攻击可有效绕过传统像素级防御，在不牺牲甚至提升视觉质量的前提下实现隐蔽控制，揭示了当前生成模型在安全性方面的严重漏洞，需重新审视现有防御策略。 Abstract: Generative models such as GANs and diffusion models are widely used to synthesize photorealistic images and to support downstream creative and editing tasks. While adversarial attacks on discriminative models are well studied, attacks targeting generative pipelines where small, stealthy perturbations in inputs lead to controlled changes in outputs are less explored. This study introduces VagueGAN, an attack pipeline combining a modular perturbation network PoisonerNet with a Generator Discriminator pair to craft stealthy triggers that cause targeted changes in generated images. Attack efficacy is evaluated using a custom proxy metric, while stealth is analyzed through perceptual and frequency domain measures. The transferability of the method to a modern diffusion based pipeline is further examined through ControlNet guided editing. Interestingly, the experiments show that poisoned outputs can display higher visual quality compared to clean counterparts, challenging the assumption that poisoning necessarily reduces fidelity. Unlike conventional pixel level perturbations, latent space poisoning in GANs and diffusion pipelines can retain or even enhance output aesthetics, exposing a blind spot in pixel level defenses. Moreover, carefully optimized perturbations can produce consistent, stealthy effects on generator outputs while remaining visually inconspicuous, raising concerns for the integrity of image generation pipelines.

[505] DWGS: Enhancing Sparse-View Gaussian Splatting with Hybrid-Loss Depth Estimation and Bidirectional Warping

Yu Ma,Guoliang Wei,Yue Cheng

Main category: cs.CV

TL;DR: 提出DWGS框架，通过融合结构线索、虚拟视图约束和遮挡区域补全来增强3D高斯点阵在稀疏视角下的新视角合成性能。

Details

Motivation: 解决稀疏视角下3D重建中的过拟合、几何失真和场景恢复不完整问题，以及3D高斯点阵存在的漂浮伪影和结构不一致问题。 Method: 引入混合损失深度估计模块、双向 warp 虚拟视图合成方法和遮挡感知重建组件，结合多视角一致性、虚拟视图约束和基于学习的修复模型。 Result: 在LLFF、Blender和DTU标准基准上达到最先进水平，最高达21.13 dB PSNR和0.189 LPIPS，同时保持实时推理能力。 Conclusion: DWGS有效提升了稀疏输入条件下的新视角合成质量，在几何一致性与完整性方面表现优越。 Abstract: Novel View Synthesis (NVS) from sparse views remains a core challenge in 3D reconstruction, typically suffering from overfitting, geometric distortion, and incomplete scene recovery due to limited multi-view constraints. Although 3D Gaussian Splatting (3DGS) enables real-time, high-fidelity rendering, it suffers from floating artifacts and structural inconsistencies under sparse-input settings. To address these issues, we propose DWGS, a novel unified framework that enhances 3DGS for sparse-view synthesis by integrating robust structural cues, virtual view constraints, and occluded region completion. Our approach introduces three principal contributions: a Hybrid-Loss Depth Estimation module that leverages dense matching priors with reprojection, point propagation, and smoothness constraints to enforce multi-view consistency; a Bidirectional Warping Virtual View Synthesis method generates virtual training views to impose stronger geometric and photometric constraints; and an Occlusion-Aware Reconstruction component that utilizes depth-difference mask and a learning-based inpainting model to recover obscured regions. Extensive experiments on standard benchmarks (LLFF, Blender, and DTU) show that DWGS achieves a new state-of-the-art, achieving up to 21.13 dB PSNR and 0.189 LPIPS, while retaining real-time inference capabilities.

[506] DAM: Dual Active Learning with Multimodal Foundation Model for Source-Free Domain Adaptation

Xi Chen,Hongxun Yao,Zhaopan Xu,Kui Jiang

Main category: cs.CV

TL;DR: 提出了一种名为DAM的新框架，通过整合视觉-语言模型的多模态监督与人工标注，形成双重监督信号，显著提升了源域自由主动域适应的性能。

Details

Motivation: 现有方法未能有效融合视觉-语言模型和数据监督，限制了伪标签质量和特征对齐效果。 Method: 提出DAM框架，利用视觉-语言模型生成稳定的初始目标，并采用双向蒸馏机制，在迭代适应过程中促进目标模型与双重监督信号之间的相互知识交换。 Result: 在多个SFADA基准和主动学习策略上，DAM consistently超越现有方法，达到新的最先进水平。 Conclusion: DAM通过有效融合多模态监督与稀疏人工标注，显著提升了源域自由主动域适应的效果，验证了双重监督信号与双向蒸馏机制的有效性。 Abstract: Source-free active domain adaptation (SFADA) enhances knowledge transfer from a source model to an unlabeled target domain using limited manual labels selected via active learning. While recent domain adaptation studies have introduced Vision-and-Language (ViL) models to improve pseudo-label quality or feature alignment, they often treat ViL-based and data supervision as separate sources, lacking effective fusion. To overcome this limitation, we propose Dual Active learning with Multimodal (DAM) foundation model, a novel framework that integrates multimodal supervision from a ViL model to complement sparse human annotations, thereby forming a dual supervisory signal. DAM initializes stable ViL-guided targets and employs a bidirectional distillation mechanism to foster mutual knowledge exchange between the target model and the dual supervisions during iterative adaptation. Extensive experiments demonstrate that DAM consistently outperforms existing methods and sets a new state-of-the-art across multiple SFADA benchmarks and active learning strategies.

[507] Accurate Cobb Angle Estimation via SVD-Based Curve Detection and Vertebral Wedging Quantification

Chang Shi,Nan Meng,Yipeng Zhuang,Moxin Zhao,Jason Pui Yin Cheung,Hua Huang,Xiuyuan Chen,Cong Nie,Wenting Zhong,Guiqiang Jiang,Yuxin Wei,Jacob Hong Man Yu,Si Chen,Xiaowen Ou,Teng Zhang

Main category: cs.CV

TL;DR: 提出一种基于深度学习的脊柱侧弯评估新框架，结合HRNet与Swin-Transformer模块，准确预测椎体终板角度和坐标，引入椎体楔形指数（VWI）用于量化变形，并在630例影像数据上实现高精度诊断与良好泛化性能。

Details

Motivation: 传统Cobb角测量存在观察者变异大、依赖人工的问题，现有自动化方法因简化模型和预设曲线模式难以应对临床复杂性，因此需要一种更精确、灵活且符合解剖实际的AIS评估方法。 Method: 采用HRNet与Swin-Transformer结合的网络结构，预测每个椎体的上下终板角度及中点坐标，利用奇异值分解（SVD）从椎体形态直接分析角度，引入生物力学约束以提升特征提取能力，并提出椎体楔形指数（VWI）作为新量化指标。 Result: 在630例10-18岁患者的全脊柱正位X光片上，达到83.45%的诊断准确率和2.55°的平均绝对误差，模型在分布外数据上表现出强泛化能力；纵向分析显示VWI与曲度进展显著相关，而传统Cobb角无此相关性。 Conclusion: 该框架能更准确、稳定地评估AIS严重程度，VWI作为新的生物标志物具有良好的预后价值，有助于实现早期检测、个性化治疗和病情进展监测。 Abstract: Adolescent idiopathic scoliosis (AIS) is a common spinal deformity affecting approximately 2.2% of boys and 4.8% of girls worldwide. The Cobb angle serves as the gold standard for AIS severity assessment, yet traditional manual measurements suffer from significant observer variability, compromising diagnostic accuracy. Despite prior automation attempts, existing methods use simplified spinal models and predetermined curve patterns that fail to address clinical complexity. We present a novel deep learning framework for AIS assessment that simultaneously predicts both superior and inferior endplate angles with corresponding midpoint coordinates for each vertebra, preserving the anatomical reality of vertebral wedging in progressive AIS. Our approach combines an HRNet backbone with Swin-Transformer modules and biomechanically informed constraints for enhanced feature extraction. We employ Singular Value Decomposition (SVD) to analyze angle predictions directly from vertebral morphology, enabling flexible detection of diverse scoliosis patterns without predefined curve assumptions. Using 630 full-spine anteroposterior radiographs from patients aged 10-18 years with rigorous dual-rater annotation, our method achieved 83.45% diagnostic accuracy and 2.55{\deg} mean absolute error. The framework demonstrates exceptional generalization capability on out-of-distribution cases. Additionally, we introduce the Vertebral Wedging Index (VWI), a novel metric quantifying vertebral deformation. Longitudinal analysis revealed VWI's significant prognostic correlation with curve progression while traditional Cobb angles showed no correlation, providing robust support for early AIS detection, personalized treatment planning, and progression monitoring.

[508] Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Mohsen Ghafoorian,Denis Korzhenkov,Amirhossein Habibian

Main category: cs.CV

TL;DR: 提出了一种名为“Attention Surgery”的高效框架，用于在预训练的视频扩散模型中线性化或混合注意力机制，显著降低计算成本的同时保持生成质量。

Details

Motivation: Transformer-based视频扩散模型由于自注意力的二次复杂度导致长序列和高分辨率生成计算开销大，现有线性注意力方法在不重新训练的情况下难以匹配softmax注意力的表现。 Method: 结合混合注意力机制（融合softmax和线性token）、轻量级蒸馏与微调流程，以及基于代价感知的块率策略，在预训练模型上实现高效注意力替换。 Result: 在Wan2.1 1.3B模型上应用后，注意力计算成本最多降低40%（以FLOPs计），在VBench和VBench-2.0基准上保持生成质量，首次实现了具有竞争力的亚二次复杂度视频扩散模型。 Conclusion: Attention Surgery为大规模视频扩散模型提供了高效、实用的注意力优化方案，无需从头训练即可实现显著效率提升。 Abstract: Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce \textit{Attention Surgery}, an efficient framework for \textit{linearizing} or \textit{hybridizing} attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40\% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks.

[509] OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Zhihong Chen,Xuehai Bai,Yang Shi,Chaoyou Fu,Huanyu Zhang,Haotian Wang,Xiaoyan Sun,Zhang Zhang,Liang Wang,Yuanxing Zhang,Pengfei Wan,Yi-Fan Zhang

Main category: cs.CV

TL;DR: 本文提出了OpenGPT-4o-Image，一个大规模、系统化的多模态图像生成与编辑数据集，通过分层任务分类和自动化生成方法显著提升模型性能。

Details

Motivation: 现有数据集在任务覆盖范围和挑战性场景方面不足，限制了统一多模态模型的发展，因此需要更系统、高质量的训练数据。 Method: 提出一种结合分层任务分类与自动化生成的新方法，利用结构化资源池和GPT-4o构建包含80k指令-图像对的数据集，覆盖11个领域和51个子任务。 Result: 在多个基准测试中显著提升模型性能，编辑任务最高提升18%（UniWorld-V1在ImgEdit-Bench），生成任务提升13%（Harmon在GenEval）。 Conclusion: 系统化的数据构建是推动多模态AI能力进步的关键。 Abstract: The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.

Songze Li,Zun Wang,Gengze Zhou,Jialu Li,Xiangyu Zeng,Limin Wang,Yu Qiao,Qi Wu,Mohit Bansal,Yi Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SID的自改进演示方法，用于目标导向的语言导航任务，通过迭代生成更具探索性的轨迹来提升智能体的导航性能。

Details

Motivation: 现有方法主要依赖最短路径轨迹，缺乏有效的探索先验，导致在未知环境中导航能力不足。 Method: SID首先在最短路径数据上训练初始智能体，然后利用该智能体生成新的探索轨迹，通过迭代方式不断生成更高质量的演示用于训练更优的智能体。 Result: 实验表明，SID显著提升了智能体的探索能力和泛化性能，在多个语言导航任务上达到最先进水平，尤其在SOON的未见验证集上实现了50.9%的成功率，超越先前最优方法13.9%。 Conclusion: SID通过自改进的演示机制有效增强了语言引导导航中的探索能力，具有良好的可扩展性和跨任务迁移能力。 Abstract: Goal-oriented language-guided navigation requires robust exploration capabilities for agents to navigate to specified goals in unknown environments without step-by-step instructions. Existing methods tend to exclusively utilize shortest-path trajectories, lacking effective exploration priors for training navigation agents. To address the above challenges, we present SID, a goal-oriented language-guided navigation learning approach with Self-Improving Demonstrations. Specifically, SID learns an initial agent on the shortest-path data sampled from environments and then leverages this agent to generate novel exploration trajectories. The novel rollouts provide demonstrations with stronger exploration strategies to train a better agent, which in turn produces higher-quality agent demonstrations for the next round of training. We show that this iterative self-improving pipeline readily scales to new environments, and the resulting demonstrations can be transferred across a variety of language-guided navigation tasks, elevating the performance ceiling in diverse goal-oriented navigation tasks. Extensive experiments demonstrate that SID significantly boosts the exploration capabilities and generalization of navigation agents. The resulting agent achieves new state-of-the-art performance on goal-oriented language-guided navigation tasks, including REVERIE, SOON, notably achieving a 50.9% success rate on the unseen validation splits of SOON, surpassing the prior leading approaches by a margin of 13.9%.

[511] Segmentor-Guided Counterfactual Fine-Tuning for Image Synthesis

Tian Xia,Matthew Sinclair,Andreas Schuh,Fabio De Sousa Ribeiro,Raghav Mehta,Rajat Rasal,Esther Puyol-Antón,Samuel Gerber,Kersten Petersen,Michiel Schaap,Ben Glocker

Main category: cs.CV

TL;DR: 提出了一种名为Seg-CFT的方法，用于生成结构特定的反事实图像，能够在保持局部一致性的同时有效干预标量变量，适用于胸部X光片等医学图像生成与疾病建模。

Details

Motivation: 现有反事实图像生成方法在结构特定干预时依赖外部分类器或回归器，易导致全局不良影响，且需繁琐的手动分割标注，限制了局部干预的效果和实用性。 Method: 提出Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT)，利用分割器指导反事实微调，在不依赖像素级标签地图的前提下实现对结构特定变量（如左肺面积）的精确干预。 Result: 在生成逼真的胸部X光片方面表现出色，并在冠状动脉疾病建模中展现出良好效果，生成的反事实图像具有局部连贯性和较高的干预有效性。 Conclusion: Seg-CFT在无需复杂标注的情况下实现了高效、局部一致的结构特定反事实图像生成，优于依赖外部模型或手动分割的现有方法，具有较强的临床应用潜力。 Abstract: Counterfactual image generation is a powerful tool for augmenting training data, de-biasing datasets, and modeling disease. Current approaches rely on external classifiers or regressors to increase the effectiveness of subject-level interventions (e.g., changing the patient's age). For structure-specific interventions (e.g., changing the area of the left lung in a chest radiograph), we show that this is insufficient, and can result in undesirable global effects across the image domain. Previous work used pixel-level label maps as guidance, requiring a user to provide hypothetical segmentations which are tedious and difficult to obtain. We propose Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT), which preserves the simplicity of intervening on scalar-valued, structure-specific variables while producing locally coherent and effective counterfactuals. We demonstrate the capability of generating realistic chest radiographs, and we show promising results for modeling coronary artery disease. Code: https://github.com/biomedia-mira/seg-cft.

[512] Scalable GANs with Transformers

Sangeek Hyun,MinKyu Lee,Jae-Pil Heo

Main category: cs.CV

TL;DR: 本文提出了一种基于纯Transformer架构和潜在空间训练的生成对抗网络（GAT），通过引入轻量级中间监督和宽度感知学习率调整，解决了GAN在扩展过程中出现的早期层利用不足和优化不稳定问题，在ImageNet-256上实现了2.96的FID性能，仅用40个周期即达到当前最优水平。

Details

Motivation: 探索生成对抗网络（GANs）的可扩展性，借鉴其他生成模型中有效的设计（如潜在空间训练和纯Transformer结构），以提升训练效率和生成性能。 Method: 采用变分自编码器的紧凑潜在空间进行GAN训练，并使用纯Transformer作为生成器和判别器；针对扩展中的问题，引入轻量级中间监督和宽度感知学习率调整策略。 Result: GAT模型在多种容量下均能稳定训练；GAT-XL/2在ImageNet-256上的FID为2.96，仅需40个训练周期，比强基线少6倍训练时间，达到当前最优的单步类别条件生成性能。 Conclusion: 通过合理的设计选择和简单的改进，纯Transformer架构的GAN可以在潜在空间中高效且稳定地扩展，显著提升训练效率和生成质量。 Abstract: Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.96) on ImageNet-256 in just 40 epochs, 6x fewer epochs than strong baselines.

[513] Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents

Jiahua Li,Kun Wei,Zhe Xu,Zibo Su,Xu Yang,Cheng Deng

Main category: cs.CV

TL;DR: 本文提出CogniGPT，一种受人类渐进式视觉认知启发的长视频理解框架，通过多粒度感知代理（MGPA）和验证增强反思代理（VERA）的交互循环，高效、可靠地捕捉任务关键信息，在多个数据集上表现出色。

Details

Motivation: 现有基于大语言模型的长视频理解方法在完整性和效率之间难以平衡，难以有效捕捉任务相关的关键信息。 Method: 设计了一个包含多粒度感知代理（MGPA）和验证增强反思代理（VERA）的交互式框架：MGPA模拟人类的发散与集中注意力机制提取任务相关信息，VERA则负责验证关键线索并优化后续感知策略，从而逐步探索最少且可靠的任务相关线索。 Result: 在EgoSchema、Video-MME、NExT-QA和MovieChat等多个数据集上进行了广泛实验，结果表明CogniGPT在准确性和效率方面均优于现有方法；特别是在EgoSchema上，仅使用11.2帧就超过了现有的无训练方法，并达到与Gemini 1.5-Pro相当的性能。 Conclusion: CogniGPT通过模拟人类认知过程，实现了高效且可靠的长视频理解，能够在极低帧数下取得优异性能，为长视频理解提供了一种新的有效范式。 Abstract: Long videos, characterized by temporal complexity and sparse task-relevant information, pose significant reasoning challenges for AI systems. Although various Large Language Model (LLM)-based approaches have advanced long video understanding, they still struggle to achieve both completeness and efficiency in capturing task-critical information. Inspired by human progressive visual cognition, we propose CogniGPT, a framework that leverages an interactive loop between Multi-Granular Perception Agent (MGPA) and Verification-Enhanced Reflection Agent (VERA) for efficient and reliable long video understanding. Specifically, MGPA mimics human visual divergent and focused attention to capture task-related information, while VERA verifies perceived key clues to mitigate hallucination and optimize subsequent perception strategies. Through this interactive process, CogniGPT explores a minimal set of informative and reliable task-related clues. Extensive experiments on EgoSchema, Video-MME, NExT-QA, and MovieChat datasets demonstrate CogniGPT's superiority in both accuracy and efficiency. Notably, on EgoSchema, it surpasses existing training-free methods using only 11.2 frames and achieves performance comparable to Gemini 1.5-Pro.

[514] Evaluating Temperature Scaling Calibration Effectiveness for CNNs under Varying Noise Levels in Brain Tumour Detection

Ankur Chanda,Kushan Choudhury,Shubhrodeep Roy,Shubhajit Biswas,Somenath Kuiry

Main category: cs.CV

TL;DR: 本文研究了温度缩放（TS）在脑肿瘤分类中对卷积神经网络（CNN）的校准效果，通过引入多种图像噪声模拟真实世界的不确定性，结果表明TS能显著降低预期校准误差（ECE）和负对数似然（NLL），同时不损害分类精度。

Details

Motivation: 深度学习在医疗影像等高风险领域需要精确的置信度估计，避免因过度自信的错误分类导致严重后果。 Method: 构建自定义CNN并在合并的脑MRI数据集上训练，采用温度缩放作为后处理校准方法，并引入五种图像噪声（高斯、泊松、椒盐、斑点和均匀噪声）模拟不确定环境。 Result: 在所有噪声条件下，TS显著降低了ECE和NLL，且分类准确率未下降。 Conclusion: TS是一种高效且计算成本低的方法，可提升医学AI系统在噪声或不确定环境下的决策可靠性。 Abstract: Precise confidence estimation in deep learning is vital for high-stakes fields like medical imaging, where overconfident misclassifications can have serious consequences. This work evaluates the effectiveness of Temperature Scaling (TS), a post-hoc calibration technique, in improving the reliability of convolutional neural networks (CNNs) for brain tumor classification. We develop a custom CNN and train it on a merged brain MRI dataset. To simulate real-world uncertainty, five types of image noise are introduced: Gaussian, Poisson, Salt & Pepper, Speckle, and Uniform. Model performance is evaluated using precision, recall, F1-score, accuracy, negative log-likelihood (NLL), and expected calibration error (ECE), both before and after calibration. Results demonstrate that TS significantly reduces ECE and NLL under all noise conditions without degrading classification accuracy. This underscores TS as an effective and computationally efficient approach to enhance decision confidence of medical AI systems, hence making model outputs more reliable in noisy or uncertain settings.

Ermanno Bartoli,Dennis Rotondi,Buwei He,Patric Jensfelt,Kai O. Arras,Iolanda Leite

Main category: cs.CV

TL;DR: 本文提出了Social 3D Scene Graphs，一种增强的3D场景图表示方法，用于捕捉环境中的人类、属性、活动及关系，支持开放词汇框架，并引入了一个包含详细人类-场景关系标注的新基准，实验证明该方法在人类活动预测和人-环境关系推理方面表现优异。

Details

Motivation: 为了让机器人能够以符合社会规范且具有情境感知的方式行动，需要理解人与环境及其他人的互动。然而，现有3D场景图方法大多忽略场景中的人类，且受限于单帧图像的开集关系识别，难以建模远距离交互。 Method: 提出Social 3D Scene Graphs，扩展传统3D场景图以包含人类及其属性、活动和局部与远程关系，采用开放词汇框架；同时构建一个包含合成环境和多样化查询的新基准，用于评估3D社交场景理解能力。 Result: 实验表明，所提表示方法在人类活动预测和人-环境关系推理任务上性能优于现有方法，有效提升了对社交场景的理解能力。 Conclusion: Social 3D Scene Graphs为实现社会智能机器人提供了有效的语义表示和评估基础，推动了机器人在复杂社会环境中的情境理解与行为决策能力。 Abstract: Understanding how people interact with their surroundings and each other is essential for enabling robots to act in socially compliant and context-aware ways. While 3D Scene Graphs have emerged as a powerful semantic representation for scene understanding, existing approaches largely ignore humans in the scene, also due to the lack of annotated human-environment relationships. Moreover, existing methods typically capture only open-vocabulary relations from single image frames, which limits their ability to model long-range interactions beyond the observed content. We introduce Social 3D Scene Graphs, an augmented 3D Scene Graph representation that captures humans, their attributes, activities and relationships in the environment, both local and remote, using an open-vocabulary framework. Furthermore, we introduce a new benchmark consisting of synthetic environments with comprehensive human-scene relationship annotations and diverse types of queries for evaluating social scene understanding in 3D. The experiments demonstrate that our representation improves human activity prediction and reasoning about human-environment relations, paving the way toward socially intelligent robots.

Donghwa Kang,Junho Kim,Dongwoo Kang

Main category: cs.CV

TL;DR: 提出了一种基于跨模态融合注意力（CMFA）和自监督多事件表示学习（SSMER）的新型框架，用于基于事件的面部关键点对齐，显著优于现有方法。

Details

Motivation: 现有RGB面部关键点对齐方法在事件数据上表现不佳，纯事件数据训练受限于空间信息不足，且缺乏标注完整的事件数据集。 Method: 提出CMFA模块融合RGB数据以引导模型从事件数据中提取鲁棒特征，同时引入SSMER进行自监督学习，利用未标注事件数据提升特征学习效果。 Result: 在真实事件数据集E-SIE和合成事件版本的WFLW-V基准上，该方法在多个评估指标上均显著优于当前最先进方法。 Conclusion: 所提出的CMFA与SSMER联合框架有效解决了事件数据中面部关键点对齐的挑战，实现了更优性能。 Abstract: Event cameras offer unique advantages for facial keypoint alignment under challenging conditions, such as low light and rapid motion, due to their high temporal resolution and robustness to varying illumination. However, existing RGB facial keypoint alignment methods do not perform well on event data, and training solely on event data often leads to suboptimal performance because of its limited spatial information. Moreover, the lack of comprehensive labeled event datasets further hinders progress in this area. To address these issues, we propose a novel framework based on cross-modal fusion attention (CMFA) and self-supervised multi-event representation learning (SSMER) for event-based facial keypoint alignment. Our framework employs CMFA to integrate corresponding RGB data, guiding the model to extract robust facial features from event input images. In parallel, SSMER enables effective feature learning from unlabeled event data, overcoming spatial limitations. Extensive experiments on our real-event E-SIE dataset and a synthetic-event version of the public WFLW-V benchmark show that our approach consistently surpasses state-of-the-art methods across multiple evaluation metrics.

[517] On-the-Fly Data Augmentation for Brain Tumor Segmentation

Ishika Jain,Siri Willems,Steven Latre,Tom De Schepper

Main category: cs.CV

TL;DR: 提出一种基于预训练生成对抗网络（GliGANs）的在线数据增强策略，用于在训练过程中动态插入合成肿瘤，提升模型在不同治疗阶段胶质瘤分割任务中的泛化能力。

Details

Motivation: 训练具有跨治疗阶段泛化能力的鲁棒分割模型需要大量高质量标注数据，但实际中此类数据有限，且存储大量3D增强数据计算成本高。 Method: 采用on-the-fly增强策略，利用预训练的GliGANs在训练时动态生成并插入合成肿瘤到健康脑部图像中；基于nnU-Net框架，评估三种模型及其集成：无外部增强的基线模型、常规on-the-fly增强模型和定制化on-the-fly增强模型。 Result: 在BraTS 2025在线验证平台上，三模型集成取得了病变级Dice分数：ET 0.79，NETC 0.749，RC 0.872，SNFH 0.825，TC 0.79，WT 0.88。 Conclusion: 该方法在BraTS Lighthouse Challenge 2025 Task 1—成人胶质瘤分割任务中排名第一，验证了on-the-fly生成式增强对提升模型泛化性和分割性能的有效性。 Abstract: Robust segmentation across both pre-treatment and post-treatment glioma scans can be helpful for consistent tumor monitoring and treatment planning. BraTS 2025 Task 1 addresses this by challenging models to generalize across varying tumor appearances throughout the treatment timeline. However, training such generalized models requires access to diverse, high-quality annotated data, which is often limited. While data augmentation can alleviate this, storing large volumes of augmented 3D data is computationally expensive. To address these challenges, we propose an on-the-fly augmentation strategy that dynamically inserts synthetic tumors using pretrained generative adversarial networks (GliGANs) during training. We evaluate three nnU-Net-based models and their ensembles: (1) a baseline without external augmentation, (2) a regular on-the-fly augmented model, and (3) a model with customized on-the-fly augmentation. Built upon the nnU-Net framework, our pipeline leverages pretrained GliGAN weights and tumor insertion methods from prior challenge-winning solutions. An ensemble of the three models achieves lesion-wise Dice scores of 0.79 (ET), 0.749 (NETC), 0.872 (RC), 0.825 (SNFH), 0.79 (TC), and 0.88 (WT) on the online BraTS 2025 validation platform. This work ranked first in the BraTS Lighthouse Challenge 2025 Task 1- Adult Glioma Segmentation.

[518] Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel

Haotian Dong,Wenjing Wang,Chen Li,Di Lin

Main category: cs.CV

TL;DR: 本文提出了Wan-Alpha，一个联合学习RGB和透明度（alpha）通道的RGBA视频生成框架，通过设计有效的变分自编码器和构建高质量数据集，在视觉质量、运动真实感和透明度渲染方面优于现有方法。

Details

Motivation: 现有RGBA视频生成方法常忽视视觉质量，限制了实际应用，因此需要一种能同时提升透明度和视觉效果的新型生成框架。 Method: 提出Wan-Alpha框架，采用变分自编码器将alpha通道编码到RGB潜在空间，并构建高质量、多样化的RGBA视频数据集以支持扩散Transformer的训练。 Result: 模型在视觉质量、运动连贯性和透明度表现上优于当前最先进方法，能够生成半透明物体、发光效果和精细细节（如发丝）。 Conclusion: Wan-Alpha通过联合建模RGB和alpha通道，显著提升了透明视频生成的质量与多样性，具备广泛的应用潜力。 Abstract: RGBA video generation, which includes an alpha channel to represent transparency, is gaining increasing attention across a wide range of applications. However, existing methods often neglect visual quality, limiting their practical usability. In this paper, we propose \textit{Wan-Alpha}, a new framework that generates transparent videos by learning both RGB and alpha channels jointly. We design an effective variational autoencoder (VAE) that encodes the alpha channel into the RGB latent space. Then, to support the training of our diffusion transformer, we construct a high-quality and diverse RGBA video dataset. Compared with state-of-the-art methods, our model demonstrates superior performance in visual quality, motion realism, and transparency rendering. Notably, our model can generate a wide variety of semi-transparent objects, glowing effects, and fine-grained details such as hair strands. The released model is available on our website: \href{https://donghaotian123.github.io/Wan-Alpha/}{https://donghaotian123.github.io/Wan-Alpha/}.

[519] SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

Shuang Liang,Jing He,Chuanmeizhi Wang,Lejun Liao,Guo Zhang,Yingcong Chen,Yuan Yuan

Main category: cs.CV

TL;DR: 本文提出了SDPose，一种基于Stable Diffusion的微调框架，用于人体姿态估计，通过在潜空间中预测关键点热图并结合轻量级卷积头和RGB重建分支，在跨域任务中实现了最先进的性能。

Details

Motivation: 尽管预训练扩散模型在密集预测任务中表现出色，但其在结构化输出（如人体姿态估计）中的潜力尚未被充分探索。 Method: SDPose直接在Stable Diffusion U-Net的图像潜空间中预测关键点热图，使用轻量卷积姿态头，并引入辅助的RGB重建分支以增强跨域鲁棒性。 Result: 在仅使用Sapiens五分之一训练时间的情况下，SDPose在COCO验证集上达到与其相当的性能，并在HumanArt和COCO-OOD等跨域基准上取得新SOTA。此外，展示了其作为零样本姿态标注器在可控生成任务中的应用。 Conclusion: SDPose有效利用了预训练扩散模型的先验知识，在人体姿态估计任务中实现了优异的泛化能力和跨域鲁棒性，同时具备下游生成任务的零样本应用潜力。 Abstract: Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold~\citep{ke2024repurposing} and Lotus~\citep{he2024lotus} adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs (e.g., human pose estimation) remains underexplored. In this paper, we propose \textbf{SDPose}, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net's image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct \textbf{COCO-OOD}, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Furthermore, we showcase SDPose as a zero-shot pose annotator for downstream controllable generation tasks, including ControlNet-based image synthesis and video generation, where it delivers qualitatively superior pose guidance.

[520] PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion

Yuyang Yin,HaoXiang Guo,Fangfu Liu,Mengyu Wang,Hanwen Liang,Eric Li,Yikai Wang,Xiaojie Jin,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: PanoWorld-X 是一种用于生成高保真、可控制全景视频的新框架，支持多样化的相机轨迹，能够在360度视觉世界中实现自由探索。

Details

Motivation: 现有方法受限于视场角狭窄或相机可控性不足，难以生成连续、完整的全景视频，限制了用户或智能体的自由探索。 Method: 提出 PanoWorld-X 框架：首先在虚拟3D环境中模拟相机轨迹，构建大规模全景视频-探索路径数据集；然后设计 Sphere-Aware Diffusion Transformer 架构，将等距柱状投影特征重投影到球面以建模潜在空间中的几何邻接关系。 Result: 实验表明，PanoWorld-X 在运动范围、控制精度和视觉质量方面均优于现有方法，显著提升了视觉保真度和时空连续性。 Conclusion: PanoWorld-X 能有效生成高质量、可控制的全景视频，具备广泛的实际应用潜力。 Abstract: Generating a complete and explorable 360-degree visual world enables a wide range of downstream applications. While prior works have advanced the field, they remain constrained by either narrow field-of-view limitations, which hinder the synthesis of continuous and holistic scenes, or insufficient camera controllability that restricts free exploration by users or autonomous agents. To address this, we propose PanoWorld-X, a novel framework for high-fidelity and controllable panoramic video generation with diverse camera trajectories. Specifically, we first construct a large-scale dataset of panoramic video-exploration route pairs by simulating camera trajectories in virtual 3D environments via Unreal Engine. As the spherical geometry of panoramic data misaligns with the inductive priors from conventional video diffusion, we then introduce a Sphere-Aware Diffusion Transformer architecture that reprojects equirectangular features onto the spherical surface to model geometric adjacency in latent space, significantly enhancing visual fidelity and spatiotemporal continuity. Extensive experiments demonstrate that our PanoWorld-X achieves superior performance in various aspects, including motion range, control precision, and visual quality, underscoring its potential for real-world applications.

[521] LVT: Large-Scale Scene Reconstruction via Local View Transformers

Tooba Imtiaz,Lucy Chai,Kathryn Heal,Xuan Luo,Jungyeon Park,Jennifer Dy,John Flynn

Main category: cs.CV

TL;DR: 提出了一种名为Local View Transformer (LVT)的架构，用于大规模场景重建和新视角合成，避免了标准Transformer中的二次注意力操作，能够单次前向传递重建任意大规模、高分辨率的场景。

Details

Motivation: 标准Transformer的二次复杂度难以扩展到大型场景，因此需要一种更高效的方法来处理大规模3D视觉任务。 Method: 基于附近视图对局部场景提供更有用信息的洞察，模型在每个视图周围局部邻域内处理信息，并利用一种新的位置编码来关注邻近视图中的token，该编码依赖于查询视图与邻近视图之间的相对几何变换。 Result: 成功实现了任意大规模、高分辨率场景的单次前向传递重建，并输出包含颜色和不透明度视图依赖性的3D高斯点阵场景表示。 Conclusion: Local View Transformer有效解决了Transformer在大规模3D场景中的可扩展性问题，为高效的新视角合成和场景重建提供了新方案。 Abstract: Large transformer models are proving to be a powerful tool for 3D vision and novel view synthesis. However, the standard Transformer's well-known quadratic complexity makes it difficult to scale these methods to large scenes. To address this challenge, we propose the Local View Transformer (LVT), a large-scale scene reconstruction and novel view synthesis architecture that circumvents the need for the quadratic attention operation. Motivated by the insight that spatially nearby views provide more useful signal about the local scene composition than distant views, our model processes all information in a local neighborhood around each view. To attend to tokens in nearby views, we leverage a novel positional encoding that conditions on the relative geometric transformation between the query and nearby views. We decode the output of our model into a 3D Gaussian Splat scene representation that includes both color and opacity view-dependence. Taken together, the Local View Transformer enables reconstruction of arbitrarily large, high-resolution scenes in a single forward pass. See our project page for results and interactive demos https://toobaimt.github.io/lvt/.

[522] CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation

Max Curie,Paulo da Costa

Main category: cs.CV

TL;DR: CLASP是一种无需训练的轻量级无监督图像分割框架，通过自监督ViT编码器提取特征并结合谱聚类与密集条件随机场优化边界，在COCO Stuff和ADE20K上达到与现有方法相当的性能。

Details

Motivation: 开发一种无需标注数据和微调、易于复现的无监督图像分割方法，适用于大规模未标注数据场景，如数字广告和内容审核。 Method: 使用DINO自监督ViT提取图像块特征，构建相似性矩阵进行谱聚类；通过特征间隙与轮廓系数搜索自动确定聚类数量，并用全连接DenseCRF优化分割边界。 Result: 在COCO Stuff和ADE20K数据集上实现了具有竞争力的mIoU和像素准确率，性能媲美近期无监督方法。 Conclusion: CLASP因其无需训练、结构简单且性能良好，可作为处理大规模未标注图像数据的有效基线方法。 Abstract: We introduce CLASP (Clustering via Adaptive Spectral Processing), a lightweight framework for unsupervised image segmentation that operates without any labeled data or finetuning. CLASP first extracts per patch features using a self supervised ViT encoder (DINO); then, it builds an affinity matrix and applies spectral clustering. To avoid manual tuning, we select the segment count automatically with a eigengap silhouette search, and we sharpen the boundaries with a fully connected DenseCRF. Despite its simplicity and training free nature, CLASP attains competitive mIoU and pixel accuracy on COCO Stuff and ADE20K, matching recent unsupervised baselines. The zero training design makes CLASP a strong, easily reproducible baseline for large unannotated corpora especially common in digital advertising and marketing workflows such as brand safety screening, creative asset curation, and social media content moderation

[523] GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning

Mustansar Fiaz,Hiyam Debary,Paolo Fraccaro,Danda Paudel,Luc Van Gool,Fahad Khan,Salman Khan

Main category: cs.CV

TL;DR: 提出了一种新的后训练框架，通过引入任务感知奖励来增强基于推理的强化学习模型在地球观测任务中的适应性，显著提升了遥感图像的推理能力、优化稳定性和鲁棒性。

Details

Motivation: 地球观测任务具有独特挑战，如指代对象检测、图像描述、变化检测等，需要任务感知的推理能力，而现有强化学习模型在此领域的应用尚未充分探索。 Method: 提出一种新颖的后训练框架，结合任务感知奖励，使基于推理的强化学习模型能有效适应多种地球观测任务。 Result: 在多个地球观测基准上实验表明，该方法相比当前最先进的通用和专用视觉语言模型均取得一致的性能提升。 Conclusion: 所提框架有效增强了模型在遥感图像上的推理能力和鲁棒性，为强化学习在地球观测领域的应用提供了新方向。 Abstract: Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .

[524] STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation

Xiaoxiao Ma,Haibo Qiu,Guohui Zhang,Zhixiong Zeng,Siqi Yang,Lin Ma,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出了一种名为STAGE的稳定且可泛化的框架，用于解决在自回归图像生成中应用现有GRPO算法时遇到的训练不稳定和性能退化问题。

Details

Motivation: 现有的GRPO算法在应用于自回归图像模型时存在训练不稳定、图像质量下降和泛化能力差的问题，主要源于不必要的标记带来的梯度冲突和策略熵的不稳定性。 Method: 提出了两种针对性解决方案：1）基于相似性的优势/KL重加权，以缓解更新冲突；2）引入基于参考模型熵的奖励机制，以稳定训练过程。 Result: 实验表明，STAGE在多个基准上显著提升了图像视觉质量、训练稳定性和跨任务泛化能力，相较于基线GRPO方法表现更优。 Conclusion: STAGE通过减轻标记间的冲突和引入熵奖励，有效保护了预训练分布，减少了奖励黑客行为，从而实现了更稳定和通用的强化学习文本到图像生成。 Abstract: Reinforcement learning has recently been explored to improve text-to-image generation, yet applying existing GRPO algorithms to autoregressive (AR) image models remains challenging. The instability of the training process easily disrupts the pretrained model capability during long runs, resulting in marginal gains, degraded image quality, and poor generalization. In this work, we revisit GRPO for AR image generation and identify two key issues: contradictory gradients from unnecessary tokens and unstable policy entropy dynamics. To address these, we introduce STAGE, a stable and generalizable framework that leverages two targeted solutions: 1) Advantage/KL reweighting. Similarity-aware reweighting to alleviate conflicting updates; and 2) Entropy reward. An entropy-based reward corresponding to reference model to stabilize learning. With the help of alleviating conflicts between tokens and an entropy reward for stabilizing training, we reduce disruption of the pretrained distribution and mitigate reward hacking, which in turn improves generalization and transfer better to other benchmarks. Experiments across multiple benchmarks show that STAGE consistently improves visual quality, stability, and cross-task generalization compared to baseline GRPO.

[525] VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Wenhao Li,Qiangchang Wang,Xianjing Meng,Zhibin Wu,Yilong Yin

Main category: cs.CV

TL;DR: 提出VT-FSL框架，结合视觉与文本信息，利用大语言模型生成精确跨模态提示，并通过几何感知对齐实现少样本学习的新SOTA。

Details

Motivation: 现有少样本学习方法因缺乏实例 grounding 而产生语义幻觉，导致语义指导噪声大、纠错成本高。 Method: 提出VT-FSL框架，包含跨模态迭代提示（CIP）和跨模态几何对齐（CGA）：CIP基于类名和支持图像用LLM生成精细类描述并合成图像；CGA通过最小化核化平行六面体体积联合对齐文本、支持图像和合成图像的表示。 Result: 在十个标准、跨域和细粒度少样本学习基准上均取得当前最优性能。 Conclusion: VT-FSL通过精准的跨模态提示生成与几何感知的多模态对齐，有效提升了少样本学习的性能，具有广泛适用性。 Abstract: Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.

[526] Fast Real-Time Pipeline for Robust Arm Gesture Recognition

Milán Zsolt Bagladi,László Gulyás,Gergő Szalay

Main category: cs.CV

TL;DR: 提出了一种基于OpenPose关键点估计和循环神经网络的实时动态手臂手势识别流程，采用1x1归一化和两种特征表示方法，并通过人工旋转数据增强提升对摄像头角度变化的鲁棒性，在自建交通控制手势数据集上表现出高准确率，同时可计算手臂信号速度。

Details

Motivation: 为了实现对动态手臂手势在不同视角和速度下的实时、准确识别，特别是在交通控制等实际应用场景中具备良好的鲁棒性和实用性。 Method: 采用OpenPose进行关键点检测，使用1x1归一化方法处理关键点数据，构建基于坐标和角度的两种特征表示，并利用循环神经网络进行分类；通过人工旋转生成多角度训练数据以提升模型对视角变化的鲁棒性。 Result: 在自定义的交通控制手势数据集上实现了高识别准确率，能够在不同观看角度和运动速度下保持良好性能，并能根据需要估算手臂信号的运动速度。 Conclusion: 所提出的流程在动态手势识别任务中表现高效且鲁棒，适用于实际应用中的复杂视觉条件，具有较强的实用价值。 Abstract: This paper presents a real-time pipeline for dynamic arm gesture recognition based on OpenPose keypoint estimation, keypoint normalization, and a recurrent neural network classifier. The 1 x 1 normalization scheme and two feature representations (coordinate- and angle-based) are presented for the pipeline. In addition, an efficient method to improve robustness against camera angle variations is also introduced by using artificially rotated training data. Experiments on a custom traffic-control gesture dataset demonstrate high accuracy across varying viewing angles and speeds. Finally, an approach to calculate the speed of the arm signal (if necessary) is also presented.

[527] A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

Rohit Jena,Vedant Zope,Pratik Chaudhari,James C. Gee

Main category: cs.CV

TL;DR: 本文提出了FFDP，一种支持大规模图像配准的IO感知非GEMM融合核与分布式框架，显著提升了计算效率和内存利用率。

Details

Motivation: 图像配准在生物医学中至关重要，但现有算法未能跟上图像采集技术的发展速度，亟需可扩展的解决方案。 Method: 提出FFDP框架，结合IO感知的非GEMM融合内核与卷积感知的张量分片策略，优化模型并行中的非GEMM瓶颈。 Result: 在8块A6000 GPU上，实现了100微米人脑MRI数据的多模态配准，问题规模超过临床标准数据570倍，耗时约一分钟；相比现有方法加速6-7倍，峰值内存降低20-59%，单GPU可处理64倍更大的问题。 Conclusion: FFDP显著提升了大规模图像配准的性能与可扩展性，为生物医学图像分析提供了高效、实用的解决方案。 Abstract: In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100 micron ex-vivo human brain MRI volume at native resolution - an inverse problem more than 570x larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 - 7x while reducing peak memory consumption by 20 - 59%. Comparative analysis on a 250 micron dataset shows that FFDP can fit upto 64x larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.

[528] GEM: 3D Gaussian Splatting for Efficient and Accurate Cryo-EM Reconstruction

Huaizhi Qu,Xiao Wang,Gengwei Zhang,Jie Peng,Tianlong Chen

Main category: cs.CV

TL;DR: GEM是一种基于3D高斯点阵的新型冷冻电镜（cryo-EM）三维重构框架，能够在实空间中高效运行，兼顾高精度与低计算开销。

Details

Motivation: 传统傅里叶方法因反复变换损失精度，而基于神经辐射场的实空间方法虽精度高但计算和内存开销大，因此需要一种兼具效率与准确性的新方法。 Method: GEM采用紧凑的3D高斯表示蛋白质结构，每个高斯仅需11个参数，并设计了新的梯度计算方式，减少内存占用和训练成本。 Result: 在标准基准上，GEM比现有最先进方法快48%，内存使用降低12%，局部分辨率提升达38.8%。 Conclusion: GEM为cryo-EM重建提供了一种实用且可扩展的新范式，统一了速度、效率与高分辨率精度。 Abstract: Cryo-electron microscopy (cryo-EM) has become a central tool for high-resolution structural biology, yet the massive scale of datasets (often exceeding 100k particle images) renders 3D reconstruction both computationally expensive and memory intensive. Traditional Fourier-space methods are efficient but lose fidelity due to repeated transforms, while recent real-space approaches based on neural radiance fields (NeRFs) improve accuracy but incur cubic memory and computation overhead. Therefore, we introduce GEM, a novel cryo-EM reconstruction framework built on 3D Gaussian Splatting (3DGS) that operates directly in real-space while maintaining high efficiency. Instead of modeling the entire density volume, GEM represents proteins with compact 3D Gaussians, each parameterized by only 11 values. To further improve the training efficiency, we designed a novel gradient computation to 3D Gaussians that contribute to each voxel. This design substantially reduced both memory footprint and training cost. On standard cryo-EM benchmarks, GEM achieves up to 48% faster training and 12% lower memory usage compared to state-of-the-art methods, while improving local resolution by as much as 38.8%. These results establish GEM as a practical and scalable paradigm for cryo-EM reconstruction, unifying speed, efficiency, and high-resolution accuracy. Our code is available at https://github.com/UNITES-Lab/GEM.

[529] BRIDGE -- Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation

Dingning Liu,Haoyu Guo,Jingyi Zhou,Tong He

Main category: cs.CV

TL;DR: 提出BRIDGE框架，通过强化学习优化的深度到图像生成方法，合成大量配对的RGB图像和真实深度图，用于训练单目深度估计模型，显著提升性能和鲁棒性。

Details

Motivation: 传统单目深度估计方法受限于数据的数量和质量，缺乏足够的多样化和精确标注数据，导致模型鲁棒性不足。 Method: 提出BRIDGE框架，利用RL优化的深度到图像（D2I）生成模型，从多种来源的深度图生成超过2000万张逼真且几何准确的RGB图像，并与真实深度图配对；采用结合教师伪标签和真实深度的混合监督策略进行模型训练。 Result: 在多个基准上超越现有最先进方法，定量指标更优，且在复杂场景细节捕捉方面表现更强，实现了更大规模和更多样化的数据覆盖。 Conclusion: BRIDGE通过创新的数据生成与训练范式，显著提升了单目深度估计的泛化性和鲁棒性，推动了该领域的发展。 Abstract: Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features. Code and models are available at https://dingning-liu.github.io/bridge.github.io/.

[530] UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation

Guanjun Wu,Jiemin Fang,Chen Yang,Sikuang Li,Taoran Yi,Jia Lu,Zanwei Zhou,Jiazhong Cen,Lingxi Xie,Xiaopeng Zhang,Wei Wei,Wenyu Liu,Xinggang Wang,Qi Tian

Main category: cs.CV

TL;DR: 本文提出UniLat3D，一种将几何与外观统一编码于单一潜在空间的3D资产生成框架，实现高效、高质量的单阶段生成。

Details

Motivation: 现有3D生成模型多采用两阶段扩散方法，存在几何与纹理错位及计算成本高的问题。 Method: 设计统一的变分自编码器（Unified VAE）生成紧凑的潜在表示UniLat，并基于流匹配模型直接从噪声生成UniLat。 Result: 在公开数据集上训练的UniLat3D能从单张图像在数秒内生成高质量3D资产，显著提升几何质量与外观保真度。 Conclusion: UniLat3D通过统一的潜在空间实现了高效、高保真的单阶段3D生成，优于传统的两阶段方法。 Abstract: High-fidelity 3D asset generation is crucial for various industries. While recent 3D pretrained models show strong capability in producing realistic content, most are built upon diffusion models and follow a two-stage pipeline that first generates geometry and then synthesizes appearance. Such a decoupled design tends to produce geometry-texture misalignment and non-negligible cost. In this paper, we propose UniLat3D, a unified framework that encodes geometry and appearance in a single latent space, enabling direct single-stage generation. Our key contribution is a geometry-appearance Unified VAE, which compresses high-resolution sparse features into a compact latent representation -- UniLat. UniLat integrates structural and visual information into a dense low-resolution latent, which can be efficiently decoded into diverse 3D formats, e.g., 3D Gaussians and meshes. Based on this unified representation, we train a single flow-matching model to map Gaussian noise directly into UniLat, eliminating redundant stages. Trained solely on public datasets, UniLat3D produces high-quality 3D assets in seconds from a single image, achieving superior appearance fidelity and geometric quality. More demos \& code are available at https://unilat3d.github.io/

[531] MANI-Pure: Magnitude-Adaptive Noise Injection for Adversarial Purification

Xiaoyi Huang,Junwei Wu,Kejia Zhang,Carl Yang,Zhiming Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为MANI-Pure的幅度自适应净化框架，通过利用输入的幅度谱引导净化过程，针对高频脆弱区域抑制对抗性扰动，同时保留语义关键的低频内容，在CIFAR-10和ImageNet-1K上表现出优越的鲁棒性和准确性。

Details

Motivation: 现有的扩散模型对抗净化方法通常依赖均匀噪声注入，会破坏语义结构并削弱鲁棒性；而作者发现对抗性扰动主要集中在高频区域且幅度强度分布不均，因此需要一种更精细的净化策略。 Method: 提出MANI-Pure框架，利用输入的幅度谱来自适应地施加异质的、频率定向的噪声，优先净化高频率、低幅度的脆弱区域，保护低频语义信息。 Result: 在CIFAR-10和ImageNet-1K上的实验表明，MANI-Pure将干净样本准确率差距缩小至原始分类器的0.59以内，鲁棒准确率提升2.15，并在RobustBench排行榜上达到第一的鲁棒准确率，超越先前最先进方法。 Conclusion: MANI-Pure通过幅度自适应的频率定向噪声注入，有效平衡了语义保持与对抗扰动抑制，显著提升了扩散模型在对抗防御中的性能。 Abstract: Adversarial purification with diffusion models has emerged as a promising defense strategy, but existing methods typically rely on uniform noise injection, which indiscriminately perturbs all frequencies, corrupting semantic structures and undermining robustness. Our empirical study reveals that adversarial perturbations are not uniformly distributed: they are predominantly concentrated in high-frequency regions, with heterogeneous magnitude intensity patterns that vary across frequencies and attack types. Motivated by this observation, we introduce MANI-Pure, a magnitude-adaptive purification framework that leverages the magnitude spectrum of inputs to guide the purification process. Instead of injecting homogeneous noise, MANI-Pure adaptively applies heterogeneous, frequency-targeted noise, effectively suppressing adversarial perturbations in fragile high-frequency, low-magnitude bands while preserving semantically critical low-frequency content. Extensive experiments on CIFAR-10 and ImageNet-1K validate the effectiveness of MANI-Pure. It narrows the clean accuracy gap to within 0.59 of the original classifier, while boosting robust accuracy by 2.15, and achieves the top-1 robust accuracy on the RobustBench leaderboard, surpassing the previous state-of-the-art method.

[532] Triangle Splatting+: Differentiable Rendering with Opaque Triangles

Jan Held,Renaud Vandeghen,Sanghyun Son,Daniel Rebain,Matheus Gadelha,Yi Zhou,Ming C. Lin,Marc Van Droogenbroeck,Andrea Tagliasacchi

Main category: cs.CV

TL;DR: 本文提出了Triangle Splatting+，一种直接在可微渲染框架中优化三角形的新方法，用于3D场景重建和新视角合成。相比3D高斯点阵，该方法生成的三角形网格可直接用于标准图形引擎，无需后处理，并在保持高效训练的同时实现了更优的视觉质量和下游应用支持。

Details

Motivation: 现有的3D高斯点阵虽能实现实时渲染，但其结果不兼容基于网格的VR和图形应用，且现有转换方法复杂且损害视觉质量。因此需要一种原生支持三角形输出、无需后处理的方法。 Method: 提出Triangle Splatting+，通过共享顶点建立三角形连接性，在可微分点阵框架中直接优化三角形；设计训练策略以保证三角形不透明性，输出半连通网格。 Result: 在Mip-NeRF360和Tanks & Temples数据集上达到基于网格的新视角合成最先进性能，视觉保真度优于先前点阵方法，训练快速，结果可直接用于标准图形引擎，并支持物理仿真和交互式漫游等下游应用。 Conclusion: Triangle Splatting+实现了高质量、高效且实用的网格化新视角合成，桥接了神经渲染与传统图形管线，推动了其在VR和实时应用中的部署。 Abstract: Reconstructing 3D scenes and synthesizing novel views has seen rapid progress in recent years. Neural Radiance Fields demonstrated that continuous volumetric radiance fields can achieve high-quality image synthesis, but their long training and rendering times limit practicality. 3D Gaussian Splatting (3DGS) addressed these issues by representing scenes with millions of Gaussians, enabling real-time rendering and fast optimization. However, Gaussian primitives are not natively compatible with the mesh-based pipelines used in VR headsets, and real-time graphics applications. Existing solutions attempt to convert Gaussians into meshes through post-processing or two-stage pipelines, which increases complexity and degrades visual quality. In this work, we introduce Triangle Splatting+, which directly optimizes triangles, the fundamental primitive of computer graphics, within a differentiable splatting framework. We formulate triangle parametrization to enable connectivity through shared vertices, and we design a training strategy that enforces opaque triangles. The final output is immediately usable in standard graphics engines without post-processing. Experiments on the Mip-NeRF360 and Tanks & Temples datasets show that Triangle Splatting+achieves state-of-the-art performance in mesh-based novel view synthesis. Our method surpasses prior splatting approaches in visual fidelity while remaining efficient and fast to training. Moreover, the resulting semi-connected meshes support downstream applications such as physics-based simulation or interactive walkthroughs. The project page is https://trianglesplatting2.github.io/trianglesplatting2/.

[533] Score Distillation of Flow Matching Models

Mingyuan Zhou,Yi Gu,Huangjie Zheng,Liangchen Song,Guande He,Yizhe Zhang,Wenze Hu,Yinfei Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于贝叶斯规则和条件期望的简单推导，统一了高斯扩散模型与流匹配方法，并将得分蒸馏（SiD）直接应用于预训练的文本到图像流匹配模型（如SANA、SD3系列、FLUX.1-dev），在无需教师模型微调或架构修改的情况下实现了有效的加速生成，验证了得分蒸馏在流匹配模型中的广泛适用性。

Details

Motivation: 扩散模型虽生成质量高但采样慢，蒸馏方法可加速生成。流匹配虽理论等价于扩散，但其是否能直接应用得分蒸馏尚不明确，本文旨在验证这一迁移的可行性并实现跨模型的加速统一。 Method: 基于贝叶斯规则和条件期望，提出一种不依赖ODE/SDE的统一视角，将得分恒等蒸馏（SiD）扩展至基于DiT的流匹配模型，并在数据自由与数据辅助设置下进行实验验证。 Result: SiD在多个主流流匹配模型上无需额外微调或结构改动即可有效工作，显著提升了生成效率，且在不同设置下均表现出良好稳定性。 Conclusion: 得分蒸馏技术可直接适用于文本到图像的流匹配模型，无需特殊设计即可实现高效加速，这为扩散与流匹配模型的加速方法提供了统一框架，解决了此前对稳定性和有效性的担忧。 Abstract: Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation -- based on Bayes' rule and conditional expectations -- that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. We will make the PyTorch implementation publicly available.

[534] Fast Feature Field ($\text{F}^3$): A Predictive Representation of Events

Richeek Das,Kostas Daniilidis,Pratik Chaudhari

Main category: cs.CV

TL;DR: 本文提出了一种名为Fast Feature Field（F³）的事件相机数据表示方法，通过预测未来事件来学习保留场景结构和运动信息的表示，具有高效、稀疏、抗噪等优点，并在多种下游任务中达到SOTA性能。

Details

Motivation: 事件相机数据具有高时间分辨率和低延迟的优点，但其稀疏性和非均匀性给有效表征带来了挑战。需要一种能够高效利用事件数据并保留时空信息的通用表示方法。 Method: 提出Fast Feature Field（F³），利用多分辨率哈希编码和深度集（deep sets）构建连续时空域中的多通道图像表示，通过自监督方式从历史事件预测未来事件来学习该表示。 Result: F³在光流估计、语义分割和单目深度估计任务上取得了当前最优性能，适用于多种机器人平台、环境、光照条件和传感器，在HD分辨率下处理速度达120Hz，任务预测可达25-75Hz。 Conclusion: F³是一种高效、鲁棒且通用的事件相机数据表示方法，能够有效融合时空信息，支持高帧率实时感知任务，具有广泛的适用性和应用潜力。 Abstract: This paper develops a mathematical argument and algorithms for building representations of data from event-based cameras, that we call Fast Feature Field ($\text{F}^3$). We learn this representation by predicting future events from past events and show that it preserves scene structure and motion information. $\text{F}^3$ exploits the sparsity of event data and is robust to noise and variations in event rates. It can be computed efficiently using ideas from multi-resolution hash encoding and deep sets - achieving 120 Hz at HD and 440 Hz at VGA resolutions. $\text{F}^3$ represents events within a contiguous spatiotemporal volume as a multi-channel image, enabling a range of downstream tasks. We obtain state-of-the-art performance on optical flow estimation, semantic segmentation, and monocular metric depth estimation, on data from three robotic platforms (a car, a quadruped robot and a flying platform), across different lighting conditions (daytime, nighttime), environments (indoors, outdoors, urban, as well as off-road) and dynamic vision sensors (resolutions and event rates). Our implementations can predict these tasks at 25-75 Hz at HD resolution.

[535] VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning

Zhaozhi Wang,Tong Zhang,Mingyue Guo,Yaowei Wang,Qixiang Ye

Main category: cs.CV

TL;DR: 本文提出了一种名为VideoAnchor的即插即用模块，通过利用子空间亲和性来增强跨帧视觉线索，从而改善多模态大语言模型在视觉-空间推理上的不足。

Details

Motivation: 现有的多模态大语言模型在视觉-语言对齐方面取得了显著进展，但在视觉-空间推理方面仍存在局限，主要原因是注意力机制中视觉标记被语言标记所掩盖。 Method: 基于稀疏子空间聚类中的自表达特性与Transformer中注意力机制的新联系，提出了VideoAnchor模块，该模块无需重新训练即可强化跨帧的视觉线索。 Result: 在多个基准测试和骨干模型上进行了广泛实验，结果显示性能持续提升，例如，在VSI-Bench和Video-MME（空间相关任务）上分别提升了3.2%和4.6%。定性分析也显示了更一致的子空间划分和更强的视觉定位能力。 Conclusion: VideoAnchor有效解决了多模态大语言模型在视觉-空间推理中的注意力偏差问题，显著提高了空间理解能力，具有良好的通用性和应用前景。 Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language alignment, yet they remain limited in visual-spatial reasoning. We first identify that this limitation arises from the attention mechanism: visual tokens are overshadowed by language tokens, preventing the model from consistently recognizing the same visual cues across frames. To address this challenge, we draw a novel connection between the self-expressiveness property in sparse subspace clustering and the attention mechanism in Transformers. Building on this insight, we propose VideoAnchor, a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining, effectively anchoring attention to shared visual structures. Extensive experiments across benchmarks and backbone models show consistent performance gains -- $e.g.$, 3.2% and 4.6% improvements on VSI-Bench and Video-MME (spatial-related tasks) with InternVL2-8B and Qwen2.5VL-72B -- while qualitative analyses demonstrate more coherent subspace partitions and stronger visual grounding. Our codes will be made public available at https://github.com/feufhd/VideoAnchor.

[536] Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu,Wenbo Hu,Jiale Xu,Ying Shan,Shijian Lu

Main category: cs.CV

TL;DR: 提出了一种名为Rolling Forcing的流式视频生成新方法，通过联合去噪、注意力锚点和高效训练策略，显著减少长时视频流生成中的误差累积。

Details

Motivation: 现有流式视频生成方法在长时间生成中存在严重误差累积问题，导致视频质量下降。 Method: 1) 设计联合去噪机制，同时处理多个带递增噪声水平的帧；2) 引入注意力sink机制，保留初始帧的关键状态作为全局上下文锚点；3) 提出非重叠窗口上的少步蒸馏训练算法，缓解基于自生成历史的暴露偏差。 Result: 实验表明，该方法可在单个GPU上实现实时多分钟视频流生成，并显著降低误差累积，提升时序一致性与生成质量。 Conclusion: Rolling Forcing有效解决了流式视频生成中的长期误差累积问题，为交互式世界模型和神经游戏引擎提供了更可靠的视频生成方案。 Abstract: Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high-quality, low-latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.

[537] Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Bowei Chen,Sai Bi,Hao Tan,He Zhang,Tianyuan Zhang,Zhengqi Li,Yuanjun Xiong,Jianming Zhang,Kai Zhang

Main category: cs.CV

TL;DR: 提出一种三阶段对齐策略，将预训练视觉编码器用作图像生成中扩散模型的语义丰富 tokenizer，显著加速模型收敛并提升生成质量。

Details

Motivation: 传统从零训练VAE的方法侧重于低层次细节，缺乏高层语义信息，而预训练基础编码器具有丰富的语义结构，可被利用来构建更优的图像 tokenizer。 Method: 采用三阶段对齐策略：(1) 冻结编码器，训练适配器和解码器建立语义潜在空间；(2) 联合优化所有组件，并引入语义保持损失以兼顾感知细节与高层语义；(3) 精细调整解码器以提升重建质量。 Result: 在ImageNet 256×256上，仅用64个epoch即达到gFID为1.90，显著加快扩散模型收敛；在LAION数据集上，使用该tokenizer的2B参数文本到图像模型优于FLUX VAE。 Conclusion: 该方法简单、可扩展，为连续tokenizer设计提供了语义基础的新范式。 Abstract: In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256$\times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.

[538] YOLO26: Key Architectural Enhancements and Performance Benchmarking for Real-Time Object Detection

Ranjan Sapkota,Rahul Harsha Cheppally,Ajay Sharda,Manoj Karkee

Main category: cs.CV

TL;DR: 本文介绍了Ultralytics于2025年9月发布的YOLO26在实时边缘目标检测中的关键架构改进和性能基准测试，展示了其在边缘设备上的高效性与准确性。

Details

Motivation: 推动YOLO系列在边缘计算设备上实现更高的效率和精度，满足低功耗、实时检测的需求。 Method: 引入端到端无NMS推理、去除DFL损失、采用ProgLoss和STAL策略优化小目标检测，并使用受大模型训练启发的MuSGD优化器。 Result: 在NVIDIA Orin Jetson等边缘设备上的基准测试显示，YOLO26相比YOLOv8、YOLO11至YOLOv13版本具有更优的效率、准确性和部署灵活性。 Conclusion: YOLO26是YOLO系列发展中的一个重要里程碑，显著提升了边缘场景下的部署能力与检测性能。 Abstract: This study presents Key Architectural Enhancements and Performance Benchmarking of Ultralytics YOLO26 for real-time edge object detection, providing a comprehensive overview of the design principles of YOLO26, technological advances, and deployment readiness. YOLO26, released in September 2025 by Ultralytics, represents the newest and most cutting-edge member of the You Only Look Once (YOLO) family, engineered to push the boundaries of efficiency and accuracy on edge and low-power devices. This paper highlights architectural innovations in YOLO26, including end-to-end NMS-free inference, removal of Distribution Focal Loss (DFL) for streamlined exports, introduction of ProgLoss and Small-Target-Aware Label Assignment (STAL) for improved stability and small-object detection, and the adoption of the MuSGD optimizer inspired by large language model training. In addition, we report performance benchmarks for YOLO26 across edge devices, specifically NVIDIA Orin Jetson platforms, and compare results against YOLOv8 and YOLO11 (previous Ultralytics releases) as well as YOLOv12 and YOLOv13, which bridged the lineage between YOLO11 and YOLO26. Our comparative analysis highlights superior efficiency of YOLO26, accuracy, and deployment versatility, establishing it as a pivotal milestone in the YOLO evolution.

[539] Personalized Vision via Visual In-Context Learning

Yuxin Jiang,Yuchao Gu,Yiren Song,Ivor Tsang,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了PICO，一种基于扩散变换器的视觉上下文学习框架，通过单个标注示例实现个性化视觉任务的零样本迁移，无需微调。

Details

Motivation: 现有个性化方法依赖昂贵的微调或合成数据流水线，灵活性差，难以泛化到开放性任务。 Method: 提出PICO四格框架，将扩散变换器重构为视觉上下文学习器，并构建多样化的小规模调优数据集VisRel；引入注意力引导的种子评分器提升推理可靠性。 Result: 实验表明PICO在性能上超越微调和合成数据基线方法，能灵活适应用户自定义任务，并在识别与生成任务间良好泛化。 Conclusion: PICO实现了高效、灵活的个性化视觉学习，验证了任务多样性对泛化能力的关键作用。 Abstract: Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks but struggle with personalized vision -- tasks defined at test time by users with customized objects or novel objectives. Existing personalization approaches rely on costly fine-tuning or synthetic data pipelines, which are inflexible and restricted to fixed task formats. Visual in-context learning (ICL) offers a promising alternative, yet prior methods confine to narrow, in-domain tasks and fail to generalize to open-ended personalization. We introduce Personalized In-Context Operator (PICO), a simple four-panel framework that repurposes diffusion transformers as visual in-context learners. Given a single annotated exemplar, PICO infers the underlying transformation and applies it to new inputs without retraining. To enable this, we construct VisRel, a compact yet diverse tuning dataset, showing that task diversity, rather than scale, drives robust generalization. We further propose an attention-guided seed scorer that improves reliability via efficient inference scaling. Extensive experiments demonstrate that PICO (i) surpasses fine-tuning and synthetic-data baselines, (ii) flexibly adapts to novel user-defined tasks, and (iii) generalizes across both recognition and generation.

[540] Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding

Bingkui Tong,Jiaer Xia,Kaiyang Zhou

Main category: cs.CV

TL;DR: 提出了一种名为Layer Contrastive Decoding (LayerCD) 的简单方法，通过对比视觉编码器浅层和深层特征生成的输出分布来过滤多模态大语言模型中的幻觉问题，在两个基准上显著优于现有最先进方法。

Details

Motivation: 多模态大语言模型（MLLMs）常出现幻觉问题，即生成的语言输出与输入图像内容不一致，源于浅层视觉特征包含偏置且低级信息，不足以支持高层推理。 Method: 提出LayerCD方法，通过对比视觉编码器浅层和深层特征生成的输出分布，抑制由浅层特征引发的幻觉。 Result: 在两个幻觉基准上的实验表明，LayerCD显著优于当前最先进的方法。 Conclusion: LayerCD能有效减少MLLMs中的幻觉，提升生成内容与图像上下文的一致性。 Abstract: Multimodal Large Language Models (MLLMs) have shown impressive perception and reasoning capabilities, yet they often suffer from hallucinations -- generating outputs that are linguistically coherent but inconsistent with the context of the input image, including inaccuracies in objects, attributes, and relations. To address this challenge, we propose a simple approach called Layer Contrastive Decoding (LayerCD). Our design is motivated by the observation that shallow visual features are much more likely than deep visual features to cause an MLLM to hallucinate as they only capture biased, low-level information that is insufficient for high-level reasoning. Therefore, LayerCD aims to filter out hallucinations by contrasting the output distributions generated from visual features of different levels, specifically those from the shallow and deep layers of the vision encoder, respectively. We conduct extensive experiments on two hallucination benchmarks and show that LayerCD significantly outperforms current state-of-the-art. The code for LayerCD is available at https://github.com/maifoundations/LayerCD .

[541] GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

Aryan Yazdan Parast,Parsa Hosseini,Hesam Asadollahzadeh,Arshia Soltani Moakhar,Basim Azam,Soheil Feizi,Naveed Akhtar

Main category: cs.CV

TL;DR: 本文提出了一种名为GHOST的方法，通过在图像嵌入空间中优化来生成诱发多模态大语言模型（MLLM）产生幻觉的自然图像，从而主动发现和缓解对象幻觉问题。

Details

Motivation: 现有的对象幻觉研究依赖于静态基准测试，难以揭示模型特有或未预见的幻觉漏洞，因此需要一种自动化、无需人工干预的方法来主动探测和评估MLLM的幻觉弱点。 Method: GHOST通过在图像嵌入空间中优化误导性特征，使模型误判不存在的目标对象，并利用扩散模型生成视觉自然且贴近原图的图像，整个过程无需人工监督或先验知识。 Result: 在多种MLLM上验证，GHOST实现了超过28%的幻觉成功率，显著高于以往方法的约1%；生成图像经定量指标和人工评估确认为高质量且无目标对象；此外，该方法发现的漏洞具有跨模型可迁移性，例如对Qwen2.5-VL优化的图像可在GPT-4o上达到66.5%的幻觉率。 Conclusion: GHOST不仅是一种有效的诊断工具，可用于发现MLLM的对象幻觉漏洞，还可作为纠正工具，通过微调减轻幻觉，有助于构建更可靠的多模态系统。 Abstract: Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generating images that induce hallucination. GHOST is fully automatic and requires no human supervision or prior knowledge. It operates by optimizing in the image embedding space to mislead the model while keeping the target object absent, and then guiding a diffusion model conditioned on the embedding to generate natural-looking images. The resulting images remain visually natural and close to the original input, yet introduce subtle misleading cues that cause the model to hallucinate. We evaluate our method across a range of models, including reasoning models like GLM-4.1V-Thinking, and achieve a hallucination success rate exceeding 28%, compared to around 1% in prior data-driven discovery methods. We confirm that the generated images are both high-quality and object-free through quantitative metrics and human evaluation. Also, GHOST uncovers transferable vulnerabilities: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate. Finally, we show that fine-tuning on our images mitigates hallucination, positioning GHOST as both a diagnostic and corrective tool for building more reliable multimodal systems.

[542] DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space

Wenkun He,Yuchao Gu,Junyu Chen,Dongyun Zou,Yujun Lin,Zhekai Zhang,Haocheng Xi,Muyang Li,Ligeng Zhu,Jincheng Yu,Junsong Chen,Enze Xie,Song Han,Han Cai

Main category: cs.CV

TL;DR: 本文提出了DC-Gen，一种通过深度压缩潜在空间来加速文本到图像扩散模型的通用框架，在保持生成质量的同时显著提升了高分辨率（如4K）图像生成效率。

Details

Motivation: 现有文本到图像扩散模型在生成高质量图像方面表现出色，但在扩展到高分辨率时面临效率瓶颈，且潜在空间中的冗余未被充分挖掘。 Method: DC-Gen采用高效的后训练流程，首先通过轻量级嵌入对齐训练弥合基础模型与深度压缩潜在空间之间的表示差距，然后仅需少量LoRA微调即可恢复生成质量。 Result: 在SANA和FLUX.1-Krea上验证了DC-Gen的有效性，生成质量与原模型相当，但速度大幅提升；DC-Gen-FLUX在NVIDIA H100上将4K图像生成延迟降低53倍，结合NVFP4 SVDQuant在单张NVIDIA 5090上仅需3.5秒，总延迟降低138倍。 Conclusion: DC-Gen通过压缩潜在空间和轻量微调策略，实现了高分辨率文本到图像生成的高效加速，具有良好的实用性和可扩展性。 Abstract: Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions, like 4K image generation. While previous research accelerates diffusion models in various aspects, it seldom handles the inherent redundancy within the latent space. To bridge this gap, this paper introduces DC-Gen, a general framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space. Rather than a costly training-from-scratch approach, DC-Gen uses an efficient post-training pipeline to preserve the quality of the base model. A key challenge in this paradigm is the representation gap between the base model's latent space and a deeply compressed latent space, which can lead to instability during direct fine-tuning. To overcome this, DC-Gen first bridges the representation gap with a lightweight embedding alignment training. Once the latent embeddings are aligned, only a small amount of LoRA fine-tuning is needed to unlock the base model's inherent generation quality. We verify DC-Gen's effectiveness on SANA and FLUX.1-Krea. The resulting DC-Gen-SANA and DC-Gen-FLUX models achieve quality comparable to their base models but with a significant speedup. Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU. When combined with NVFP4 SVDQuant, DC-Gen-FLUX generates a 4K image in just 3.5 seconds on a single NVIDIA 5090 GPU, achieving a total latency reduction of 138x compared to the base FLUX.1-Krea model. Code: https://github.com/dc-ai-projects/DC-Gen.

[543] DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

Junyu Chen,Wenkun He,Yuchao Gu,Yuyang Zhao,Jincheng Yu,Junsong Chen,Dongyun Zou,Yujun Lin,Zhekai Zhang,Muyang Li,Haocheng Xi,Ligeng Zhu,Enze Xie,Song Han,Han Cai

Main category: cs.CV

TL;DR: DC-VideoGen是一种用于高效视频生成的后训练加速框架，可应用于任何预训练视频扩散模型，通过轻量级微调将其适配到深度压缩潜在空间，显著降低推理延迟并支持高分辨率视频在单GPU上生成。

Details

Motivation: 现有的视频生成模型推理成本高、效率低，难以处理长视频和高分辨率输出，亟需一种通用且高效的加速方法。 Method: 提出DC-VideoGen框架，包含两个核心：(1) 具有分块因果时序设计的深度压缩视频自编码器，实现32x/64x空间和4x时间压缩；(2) AE-Adapt-V适应策略，实现预训练模型向新潜在空间的快速稳定迁移。 Result: 在NVIDIA H100上仅用10个GPU天即可完成对Wan-2.1-14B模型的适配，推理延迟最高降低14.8倍，且不损失生成质量，并可在单GPU上生成2160x3840分辨率视频。 Conclusion: DC-VideoGen为现有视频扩散模型提供了一种高效、低成本的加速方案，兼顾压缩率、生成质量和长视频泛化能力，推动了高分辨率视频生成的实用化。 Abstract: We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.

[544] PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos

Ting-Hsuan Liao,Haowen Liu,Yiran Xu,Songwei Ge,Gengshan Yang,Jia-Bin Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为PAD3R的方法，用于从随意拍摄的单目视频中重建可变形的3D物体，能够在存在显著形变、大范围相机运动和有限视角覆盖的情况下实现高保真、可关节化的3D重建。

Details

Motivation: 现有方法在处理长视频序列中的显著物体形变、大范围相机运动和有限视角时表现不佳，因此需要一种更鲁棒且通用的可变形3D重建方法。 Method: PAD3R通过训练一个个性化的、以物体为中心的姿态估计器，并利用预训练的图像到3D模型进行监督，结合生成先验与可微分渲染优化可变形的3D高斯表示，同时使用长期2D点跟踪对整个视频进行正则化。 Result: 实验结果表明，PAD3R在多种挑战性场景下具有良好的鲁棒性和泛化能力，能够生成高质量的类别无关3D对象表示。 Conclusion: PAD3R在处理复杂真实世界动态场景方面表现出色，展现了其在动态场景理解和3D内容创作中的潜力。 Abstract: We present PAD3R, a method for reconstructing deformable 3D objects from casually captured, unposed monocular videos. Unlike existing approaches, PAD3R handles long video sequences featuring substantial object deformation, large-scale camera movement, and limited view coverage that typically challenge conventional systems. At its core, our approach trains a personalized, object-centric pose estimator, supervised by a pre-trained image-to-3D model. This guides the optimization of deformable 3D Gaussian representation. The optimization is further regularized by long-term 2D point tracking over the entire input video. By combining generative priors and differentiable rendering, PAD3R reconstructs high-fidelity, articulated 3D representations of objects in a category-agnostic way. Extensive qualitative and quantitative results show that PAD3R is robust and generalizes well across challenging scenarios, highlighting its potential for dynamic scene understanding and 3D content creation.

[545] PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

Shuoshuo Zhang,Zijian Li,Yizhen Zhang,Jingjing Fu,Lei Song,Jiang Bian,Jun Zhang,Yujiu Yang,Rui Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为PixelCraft的多智能体系统，用于提升多模态大语言模型在结构化图像（如图表和几何图形）上的视觉推理能力。该系统通过高保真图像处理和灵活的三阶段推理工作流（工具选择、智能体讨论与自我批评），结合像素级定位与传统计算机视觉算法，并引入图像记忆机制以支持非线性、可回溯的推理过程。实验表明，PixelCraft在多个挑战性基准上显著提升了性能，树立了结构化图像理解的新标准。

Details

Motivation: 结构化图像对现有MLLMs而言仍具挑战，因感知错误易导致推理偏差；现有基于视觉线索的方法受限于低质量图像处理和线性、僵化的推理模式，难以应对复杂任务。 Method: 提出PixelCraft，包含调度器、规划器、推理器、批评者及多个视觉工具智能体。通过构建高质量语料库并微调MLLM作为定位模型，将像素级定位结果与传统CV算法结合，实现高保真处理。采用动态三阶段工作流：工具选择、多智能体协作讨论、自我批评，并引入图像记忆机制，使规划器可回溯、探索替代路径并动态调整推理轨迹。 Result: 在多个具有挑战性的图表与几何推理基准上进行了广泛实验，结果显示PixelCraft显著提升了先进MLLM在结构化图像理解任务中的表现，优于现有方法。 Conclusion: PixelCraft通过高保真视觉处理与灵活的多智能体协作推理机制，有效解决了结构化图像理解中的关键瓶颈，为复杂视觉推理任务设立了新标杆。 Abstract: Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning. Our code will be available at https://github.com/microsoft/PixelCraft.

[546] FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation

Yunyang Ge,Xinhua Cheng,Chengshu Zhao,Xianyi He,Shenghai Yuan,Bin Lin,Bin Zhu,Li Yuan

Main category: cs.CV

TL;DR: 本文提出了一种名为FlashI2V的图像到视频生成方法，通过隐式条件建模和傅里叶引导机制有效缓解了条件图像泄露问题，在域外数据上表现出优异的泛化能力和生成性能。

Details

Motivation: 现有I2V方法因直接拼接条件图像导致“条件图像泄露”，引发运动缓慢、色彩不一致等问题，并在域外数据上表现下降。本文旨在解决这一泄露问题以提升模型泛化能力。 Method: 提出FlashI2V，包含两个核心组件：(1) 潜在偏移（Latent Shifting）：通过从噪声潜变量中减去条件图像信息，隐式引入条件，改变流匹配的源和目标分布；(2) 傅里叶引导（Fourier Guidance）：利用傅里叶变换提取高频幅度特征，加速收敛并调节生成细节层次。 Result: 实验表明，FlashI2V有效克服了条件图像泄露，在多种I2V范式中实现了最佳的域外数据泛化性能。仅用1.3B参数，在Vbench-I2V上达到53.01的动态度得分，优于CogVideoX1.5-5B-I2V和Wan2.1-I2V-14B-480P。 Conclusion: FlashI2V通过隐式条件注入和频域引导策略，显著缓解了条件图像泄露问题，提升了I2V模型在域内和域外数据上的生成质量与鲁棒性，为高效高质量视频生成提供了新思路。 Abstract: In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Github page: https://pku-yuangroup.github.io/FlashI2V/

[547] Visual Jigsaw Post-Training Improves MLLMs

Penghao Wu,Yushan Zhang,Haiwen Diao,Bo Li,Lewei Lu,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Visual Jigsaw的自监督后训练框架，通过将视觉输入分块并打乱顺序，要求多模态大语言模型以自然语言恢复正确排列，从而增强其视觉理解能力。该方法无需额外标注或视觉生成组件，适用于图像、视频和3D数据，在细粒度感知、时序推理和3D空间理解方面显著提升性能。

Details

Motivation: 现有的多模态大语言模型后训练方法主要以文本为中心，忽视了视觉信号的深层理解；缺乏真正以视觉为核心的自监督训练范式。 Method: 提出Visual Jigsaw框架，将视觉输入划分为若干块并随机打乱，构建一个通用的排序任务，模型需输出正确的排列顺序作为自然语言响应，并结合可验证奖励的强化学习（RLVR）进行训练。 Result: 在图像、视频和3D数据三种模态上均取得显著性能提升，增强了MLLMs的细粒度感知、时间推理和三维空间理解能力，且无需人工标注或额外生成模块。 Conclusion: Visual Jigsaw证明了以视觉为中心的自监督后训练能有效提升多模态大语言模型的视觉理解能力，为未来设计更多视觉主导的预训练任务提供了新方向。 Abstract: Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs' intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: https://penghao-wu.github.io/visual_jigsaw/

[548] VGGT-X: When VGGT Meets Dense Novel View Synthesis

Yang Liu,Chuanchen Luo,Zimo Tang,Junran Peng,Zhaoxiang Zhang

Main category: cs.CV

TL;DR: 本论文研究了将3D基础模型（3DFM）应用于密集新视角合成（NVS）的问题，提出了VGGT-X方法以解决在扩展至密集视图时面临的显存负担和输出质量下降的挑战。

Details

Motivation: 现有NVS方法依赖SfM获取精确3D属性，但该过程缓慢且脆弱；而当前3DFM多局限于稀疏视图验证，难以直接扩展到密集场景。 Method: 提出VGGT-X，包括内存高效的VGGT实现、自适应全局对齐模块以及鲁棒的3DGS训练策略，支持上千张图像的处理。 Result: 实验表明，VGGT-X显著缩小了与COLMAP初始化方法之间的渲染保真度差距，在无需COLMAP的密集NVS和位姿估计中达到SOTA性能。 Conclusion: VGGT-X有效推动了3DFM在密集NVS中的应用，同时分析了剩余差距的原因，为未来3D基础模型的发展提供了洞见。 Abstract: We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/

Table of Contents

cs.CL [Back]

[1] Are you sure? Measuring models bias in content moderation through uncertainty

[2] AccessEval: Benchmarking Disability Bias in Large Language Models

[3] RAR$^2$: Retrieval-Augmented Medical Reasoning via Thought-Driven Retrieval

[4] TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?

[5] Multi-Modal Sentiment Analysis with Dynamic Attention Fusion

[6] Enabling Approximate Joint Sampling in Diffusion LMs

[7] Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

[8] MIRAGE: Multi-hop Reasoning with Ambiguity Evaluation for Illusory Questions

[9] ML2B: Multi-Lingual ML Benchmark For AutoML

[10] ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection

[11] EditGRPO: Reinforcement Learning with Post -Rollout Edits for Clinically Accurate Chest X-Ray Report Generation

[12] Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

[13] ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

[14] Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems

[15] Towards Generalizable Implicit In-Context Learning with Attention Routing

[16] The Bias is in the Details: An Assessment of Cognitive Bias in LLMs

[17] Lexicon-Enriched Graph Modeling for Arabic Document Readability Prediction

[18] HEART: Emotionally-driven test-time scaling of Language Models

[19] Infusing Theory of Mind into Socially Intelligent LLM Agents

[20] Extract-0: A Specialized Language Model for Document Information Extraction

[21] Large language models management of medications: three performance analyses

[22] LLMs Behind the Scenes: Enabling Narrative Scene Illustration

[23] What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?

[24] Emergent morpho-phonological representations in self-supervised speech models

[25] Same Content, Different Representations: A Controlled Study for Table QA

[26] ADAM: A Diverse Archive of Mankind for Evaluating and Enhancing LLMs in Biographical Reasoning

[27] AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts

[28] Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

[29] Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate

[30] Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks

[31] From Evidence to Trajectory: Abductive Reasoning Path Synthesis for Training Retrieval-Augmented Generation Agents

[32] The Geometry of Creative Variability: How Credal Sets Expose Calibration Gaps in Language Models

[33] d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

[34] How to Make Large Language Models Generate 100% Valid Molecules?

[35] Non-Collaborative User Simulators for Tool Agents

[36] Tagging the Thought: Unlocking Personalization Reasoning via Reinforcement Learning

[37] Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models

[38] Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

[39] Pretraining LLM with Latent Thoughts in Continuous Space

[40] Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts

[41] Estimating the strength and timing of syntactic structure building in naturalistic reading

[42] From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

[43] Global Beats, Local Tongue: Studying Code Switching in K-pop Hits on Billboard Charts

[44] Steering Prepositional Phrases in Language Models: A Case of with-headed Adjectival and Adverbial Complements in Gemma-2

[45] PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness

[46] A Structured Framework for Evaluating and Enhancing Interpretive Capabilities of Multimodal LLMs in Culturally Situated Tasks

[47] Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

[48] Fin-ExBERT: User Intent based Text Extraction in Financial Context using Graph-Augmented BERT and trainable Plugin

[49] A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

[50] Scaling Policy Compliance Assessment in Language Models with Policy Reasoning Traces

[51] Learning to Reason in Structured In-context Environments with Reinforcement Learning

[52] C-Evolve: Consensus-based Evolution for Prompt Groups

[53] Dual-Space Smoothness for Robust and Balanced LLM Unlearning

[54] MedCritical: Enhancing Medical Reasoning in Small Language Models via Self-Collaborative Correction

[55] Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

[56] CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

[57] Guard Vector: Beyond English LLM Guardrails with Task-Vector Composition and Streaming-Aware Prefix SFT

[58] Train Once, Answer All: Many Pretraining Experiments for the Cost of One

[59] No Loss, No Gain: Gated Refinement and Adaptive Compression for Prompt Optimization

[60] Liaozhai through the Looking-Glass: On Paratextual Explicitation of Culture-Bound Terms in Machine Translation

[61] Comparison of Scoring Rationales Between Large Language Models and Human Raters

[62] Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models

[63] Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models

[64] Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review

[65] The Impact of Role Design in In-Context Learning for Large Language Models

[66] AraS2P: Arabic Speech-to-Phonemes System

[67] From Human Annotation to Automation: LLM-in-the-Loop Active Learning for Arabic Sentiment Analysis

[68] On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

[69] Automatic Speech Recognition for Greek Medical Dictation

[70] Towards Efficient CoT Distillation: Self-Guided Rationale Selector for Better Performance with Fewer Rationales

[71] Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks

[72] LLM Hallucination Detection: HSAD

[73] Timber: Training-free Instruct Model Refining with Base via Effective Rank

[74] Fast Thinking for Large Language Models

[75] Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models

[76] Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in LLMs

[77] Aligning LLMs for Multilingual Consistency in Enterprise Applications

[78] TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F