cs.CL [Back]

[1] DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Nikita Gupta,Riju Chatterjee,Lukas Haas,Connie Tao,Andrew Wang,Chang Liu,Hidekazu Oiwa,Elena Gribovskaya,Jan Ackermann,John Blitzer,Sasha Goldshtein,Dipanjan Das

Main category: cs.CL

TL;DR: DeepSearchQA是一个包含900个提示的基准测试，用于评估AI代理在17个不同领域中执行多步信息检索任务的能力，强调系统性信息整合、去重与实体解析、以及开放搜索空间中的停止判断能力。

Details

Motivation: 现有基准多关注单答案检索或泛化事实性，缺乏对复杂、多步、深度信息搜寻能力的系统评估；需专门测试代理在真实开放网络环境中的长程规划、碎片信息整合与精确结果生成能力。 Method: 构建了基于因果链结构的手工设计任务集（每步依赖前一步），所有任务均基于开放网页且答案可客观验证；对主流AI代理架构进行大规模实证评估，分析其召回率、精确率及典型失败模式。 Result: 当前最先进模型在高召回与高精度之间难以平衡，普遍存在过早终止搜索（欠检索）和过度发散低置信答案（为提召回而模糊输出）等失败模式。 Conclusion: DeepSearchQA揭示了现有AI代理在深度研究能力上的关键短板，为后续提升系统性信息处理、鲁棒搜索规划与精准结果生成提供了重要诊断工具与改进方向。 Abstract: We introduce DeepSearchQA, a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields. Unlike traditional benchmarks that target single answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of challenging, handcrafted tasks designed to evaluate an agent's ability to execute complex search plans to generate exhaustive answer lists. This shift in design explicitly tests three critical, yet under-evaluated capabilities: 1) systematic collation of fragmented information from disparate sources, 2) de-duplication and entity resolution to ensure precision, and 3) the ability to reason about stopping criteria within an open-ended search space. Each task is structured as a causal chain, where discovering information for one step is dependent on the successful completion of the previous one, stressing long-horizon planning and context retention. All tasks are grounded in the open web with objectively verifiable answer sets. Our comprehensive evaluation of state-of-the-art agent architectures reveals significant performance limitations: even the most advanced models struggle to balance high recall with precision. We observe distinct failure modes ranging from premature stopping (under-retrieval) to hedging behaviors, where agents cast an overly wide net of low-confidence answers to artificially boost recall. These findings highlight critical headroom in current agent designs and position DeepSearchQA as an essential diagnostic tool for driving future research toward more robust, deep-research capabilities.

[2] asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation

Oleg Sedukhin,Andrey Kostin

Main category: cs.CL

TL;DR: 本文提出了语音识别评估的多项改进，包括支持多参考标注和长插入的字符串对齐算法、构建新的俄语长时真实场景测试集 DiverseSpeech-Ru，并揭示了模型易过拟合标注方式导致指标虚高的问题，同时提供了流式识别评估工具与统一模型接口。

Details

Motivation: 现有语音识别评估方法在处理非拉丁语系、构词丰富、长时或杂乱语音时存在局限，且单参考标注易导致模型过拟合，造成性能提升的假象。 Method: 提出一种支持多参考标注、任意长度插入和更优词对齐的字符串对齐算法；构建并标注新测试集 DiverseSpeech-Ru；对主流俄语数据集进行多参考重标注；分析微调过程中的标注适应现象；开发流式识别评估与多转录对齐可视化工具；提供统一的离线/流式ASR模型封装接口。 Result: 验证了模型易适配特定标注方式而导致WER等指标虚降；实现了更鲁棒的多参考评估能力；提供了可复现、易扩展的评估工具链；DiverseSpeech-Ru 和相关代码将开源。 Conclusion: 多参考标注与改进对齐算法对语音识别评估至关重要，尤其对形态复杂语言；应警惕仅依赖单参考WER的评估偏差；所提工具和数据集有助于推动更公平、细致的ASR评估实践。 Abstract: We propose several improvements to the speech recognition evaluation. First, we propose a string alignment algorithm that supports both multi-reference labeling, arbitrary-length insertions and better word alignment. This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech. Secondly, we collect a novel test set DiverseSpeech-Ru of longform in-the-wild Russian speech with careful multi-reference labeling. We also perform multi-reference relabeling of popular Russian tests set and study fine-tuning dynamics on its corresponding train set. We demonstrate that the model often adopts to dataset-specific labeling, causing an illusion of metric improvement. Based on the improved word alignment, we develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually. Additionally, we provide uniform wrappers for many offline and streaming speech recognition models. Our code will be made publicly available.

[3] UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop

Muhammad Ali Shafique,Areej Mehboob,Layba Fiaz,Muhammad Usman Qadeer,Hamza Farooq

Main category: cs.CL

TL;DR: 本文提出了一种结合多系统翻译与人工验证的上下文集成翻译框架，构建了首个标准化的乌尔都语推理评测基准UrduBench，并系统评估了多种大语言模型在乌尔都语推理任务上的表现，揭示了多步与符号推理的难点及语言对齐的重要性。

Details

Motivation: 乌尔都语等低资源语言缺乏标准化推理评测基准，现有方法受限于机器翻译敏感性和对通用语言任务的偏重，难以准确评估模型的真实推理能力。 Method: 提出上下文集成翻译框架，融合多个翻译系统并引入人工校验，确保语义、结构与上下文完整性；将MGSM、MATH-500、CommonSenseQA和OpenBookQA等主流推理与问答基准翻译为乌尔都语，构建UrduBench；在多种提示策略下系统评测推理导向型与指令微调型大语言模型。 Result: 发现多步与符号推理任务在乌尔都语中表现显著下降；语言一致性是鲁棒推理的关键前提；不同数据集、难度等级、模型架构、缩放规模均呈现明显性能差异；该方法可推广至其他低资源语言。 Conclusion: 本工作建立了首个可扩展的乌尔都语推理评测方法论，提供了多语言推理失败的实证洞见，并开源代码与数据集，推动低资源语言AI评估发展。 Abstract: Recent advances in large language models (LLMs) have led to strong reasoning capabilities; however, evaluating such models in low-resource languages remains challenging due to the lack of standardized benchmarks. In particular, Urdu reasoning evaluation has been limited by the sensitivity of machine translation and an emphasis on general language tasks rather than reasoning benchmarks. In this paper, we propose a contextually ensembled translation framework with human-in-the-loop validation that leverages multiple translation systems to develop Urdu reasoning benchmarks while preserving contextual and structural integrity. Using this framework, we translate widely adopted reasoning and question-answering benchmarks, including MGSM, MATH-500, CommonSenseQA, and OpenBookQA, into Urdu, collectively referred to as UrduBench, and conduct a comprehensive evaluation of both reasoning-oriented and instruction-tuned LLMs across multiple prompting strategies. Our analysis reveals performance differences across (1) four datasets, (2) five task difficulty levels, (3) diverse model architectures, (4) multiple model scaling settings, and (5) language consistency tests. We find that multi-step and symbolic reasoning tasks pose significant challenges in Urdu, and that stable language alignment is a critical prerequisite for robust reasoning. Overall, our work establishes a scalable methodology for standardized reasoning evaluation in Urdu and provides empirical insights into multilingual reasoning failures. This experimental setup is also broadly applicable to other low-resource languages. The code and datasets will be publicly released.

[4] Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations

Amit Meghanani,Thomas Hain

Main category: cs.CL

TL;DR: 本文探讨了在自监督学习（SSL）语音模型中，前端语音增强（SE）模型微调时使用均方误差（MSE）损失所导致的位置嵌入干扰问题，并提出两种位置不变的微调策略：零填充和基于软动态时间规整（soft-DTW）的速度扰动，实验表明后者收敛更快、下游任务性能更优。

Details

Motivation: MSE损失在SSL表示微调中易利用位置嵌入而非内容信息，导致优化偏差，需探索位置不变的细粒度对齐策略。 Method: 提出两种策略：(1) 零填充（原用于SSL预训练，本文应用于微调）；(2) 速度扰动结合soft-DTW损失，实现对齐鲁棒性更强的表示引导语音增强。 Result: soft-DTW方法相比MSE显著提升收敛速度与下游任务性能（如ASR），验证了位置不变微调的有效性。 Conclusion: 位置不变的细粒度对齐（如soft-DTW）比传统MSE更适合SSL表示引导的语音增强微调，为SSL语音建模中的表征对齐提供了新思路。 Abstract: Integrating front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models is effective for downstream tasks in noisy conditions. SE models are commonly fine-tuned using SSL representations with mean squared error (MSE) loss between enhanced and clean speech. However, MSE is prone to exploiting positional embeddings in SSL models, allowing the objective to be minimised through positional correlations instead of content-related information. This work frames the problem as a general limitation of self-supervised representation fine-tuning and investigates it through representation-guided SE. Two strategies are considered: (1) zero-padding, previously explored in SSL pre-training but here examined in the fine-tuning setting, and (2) speed perturbations with a soft-DTW loss. Experiments show that the soft-DTW-based approach achieves faster convergence and improved downstream performance, underscoring the importance of position-invariant fine-tuning in SSL-based speech modelling.

[5] ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference

Ketan Thakkar,Maitreyi Chatterjee,Ramasubramanian Balasubramanian,Achyuthan Jootoo,Rajendra Ugrani

Main category: cs.CL

TL;DR: 本文提出ChunkWise LoRA，一种动态自适应的LoRA方法，通过基于token复杂度的可变长度分块和每块定制低秩配置，在降低延迟和内存的同时保持或提升模型性能。

Details

Motivation: 现有LoRA方法对所有输入token采用静态、统一的秩配置，忽略了token复杂度和计算需求的差异，导致效率低下。 Method: 提出ChunkWise LoRA：引入运行时调度器估计token难度、自适应分块，并通过秩阶梯机制为每块选择低秩和缩放；加入边界安全组合模块和策略驱动的KV缓存策略以保证输出一致性。 Result: 在Wikitext-103和SQuAD等基准上，相比基线LoRA，延迟降低最多34%，内存减少38%，同时BLEU、EM、困惑度等指标持平或提升。 Conclusion: ChunkWise LoRA是一种高效、兼容性强、可直接部署于现有Transformer架构与推理框架的参数高效微调新范式。 Abstract: Recent advances in low-rank adaptation (LoRA) have enabled efficient fine-tuning of large language models (LLMs) with minimal additional parameters. However, existing LoRA methods apply static rank configurations uniformly across all input tokens, ignoring variation in token complexity and computational requirements. In this work, we propose ChunkWise LoRA, a dynamic and adaptive approach that partitions sequences into variable-length chunks based on token complexity and assigns each chunk a tailored low-rank configuration. Our system introduces a runtime scheduler that estimates token difficulty, performs adaptive chunking, and selects per-chunk LoRA rank and scaling using a rank-ladder mechanism. To preserve output consistency, we further introduce a boundary-safe composition module and integrate policy-driven KV-cache strategies. Experiments on benchmark datasets such as Wikitext-103 and SQuAD demonstrate that ChunkWise LoRA achieves up to 34\% lower latency and 38% memory reduction compared to baseline LoRA, while maintaining or improving task performance metrics like BLEU, EM, and perplexity. The proposed framework remains fully compatible with existing transformer architectures and inference frameworks, providing a practical solution for real-world deployment of parameter-efficient LLMs.

[6] Multi-task Code LLMs: Data Mix or Model Merge?

Mingzhi Zhu,Boris Sobolev,Rahul Krishna,Raju Pavuluri,Stacy Patterson,Michele Merler

Main category: cs.CL

TL;DR: 本文比较了数据混合与模型合并两种方法在构建小型多任务代码大语言模型中的效果，发现模型合并更适合大规模模型，而数据混合在小规模模型中更优，并提出了权重分析技术以理解任务对模型参数的影响。

Details

Motivation: 随着小型专业化代码大语言模型在智能体框架中部署的需求增加，需要高效平衡性能、约束和成本的多任务学习策略。 Method: 在Qwen Coder和DeepSeek Coder两个模型家族（2B和7B参数）上，分别采用数据混合和模型合并策略进行代码生成与代码摘要任务的多任务训练，并通过HumanEval、MBPP和CodeXGlue基准评估性能；同时引入权重分析技术研究任务对参数的影响。 Result: 模型合并策略在大规模模型上表现最优，保留96%的专用模型代码生成性能并维持摘要能力，Qwen Coder 2.5 7B合并模型在HumanEval上达到92.7% Pass@1，优于对应专用模型（90.9%）；小规模模型则数据混合更优。 Conclusion: 精心设计的模型合并与数据混合策略可在不显著损失性能的前提下融合任务能力，适用于资源受限的部署场景。 Abstract: Recent research advocates deploying smaller, specialized code LLMs in agentic frameworks alongside frontier models, sparking interest in efficient strategies for multi-task learning that balance performance, constraints, and costs. We compare two approaches for creating small, multi-task code LLMs: data mixing versus model merging. We conduct extensive experiments across two model families (Qwen Coder and DeepSeek Coder) at two scales (2B and 7B parameters), fine-tuning them for code generation and code summarization tasks. Our evaluation on HumanEval, MBPP, and CodeXGlue benchmarks reveals that model merging achieves the best overall performance at larger scale across model families, retaining 96% of specialized model performance on code generation tasks while maintaining summarization capabilities. Notably, merged models can even surpass individually fine-tuned models, with our best configuration of Qwen Coder 2.5 7B model achieving 92.7% Pass@1 on HumanEval compared to 90.9% for its task-specific fine-tuned equivalent. At a smaller scale we find instead data mixing to be a preferred strategy. We further introduce a weight analysis technique to understand how different tasks affect model parameters and their implications for merging strategies. The results suggest that careful merging and mixing strategies can effectively combine task-specific capabilities without significant performance degradation, making them ideal for resource-constrained deployment scenarios.

[7] Large Language Models Naively Recover Ethnicity from Individual Records

Noah Dasanaike

Main category: cs.CL

TL;DR: 本文证明大语言模型（LLM）仅凭姓名即可高精度推断族裔，无需额外训练数据，且在多国场景下表现优于传统BISG方法，同时减少收入偏差，并支持低成本本地部署。

Details

Motivation: 传统族裔推断方法（如BISG）依赖美国姓氏地理编码，难以泛化到其他国家，且存在收入偏差等问题；本文旨在探索LLM是否能在无监督、跨文化场景下更准确、公平地进行姓名族裔推断。 Method: 使用多个主流LLM（如GPT-4o、Gemini 3 Flash、DeepSeek v3.2、GLM-4.7）对真实选民登记姓名进行族裔分类；结合分层抽样、扩展推理提示、附加元数据（如政党归属）等策略优化；并在印度、黎巴嫩、乌干达等多国数据上开展跨文化验证；最后用小型微调模型实现低成本本地部署。 Result: 在美佛罗里达与北卡罗来纳选民数据上，LLM方法达84.7%准确率，显著高于BISG的68.2%；加入政党信息后达86.7%；在黎巴嫩宗教派别、印度议员保留席位、印度土地记录等任务中分别达64.3%、99.2%、74.0%；在六国全量选民数据中能准确恢复人口分布；小型微调模型亦超越BISG且可零成本本地部署。 Conclusion: LLM具备强大且可迁移的姓名族裔推断能力，不仅性能更优、偏差更小，还支持跨文化适配与轻量化落地，为社会科学、公共政策及公平性研究提供了新工具。 Abstract: I demonstrate that large language models can infer ethnicity from names with accuracy exceeding that of Bayesian Improved Surname Geocoding (BISG) without additional training data, enabling inference outside the United States and to contextually appropriate classification categories. Using stratified samples from Florida and North Carolina voter files with self-reported race, LLM-based classification achieves up to 84.7% accuracy, outperforming BISG (68.2%) on balanced samples. I test six models including Gemini 3 Flash, GPT-4o, and open-source alternatives such as DeepSeek v3.2 and GLM-4.7. Enabling extended reasoning can improve accuracy by 1-3 percentage points, though effects vary across contexts; including metadata such as party registration reaches 86.7%. LLM classification also reduces the income bias inherent in BISG, where minorities in wealthier neighborhoods are systematically misclassified as White. I further validate using Lebanese voter registration with religious sect (64.3% accuracy), Indian MPs from reserved constituencies (99.2%), and Indian land records with caste classification (74.0%). Aggregate validation across India, Uganda, Nepal, Armenia, Chile, and Costa Rica using original full-count voter rolls demonstrates that the method recovers known population distributions where naming conventions are distinctive. For large-scale applications, small transformer models fine-tuned on LLM labels exceed BISG accuracy while enabling local deployment at no cost.

[8] EnsembleLink: Accurate Record Linkage Without Training Data

Noah Dasanaike

Main category: cs.CL

TL;DR: 本文提出EnsembleLink方法，利用预训练语言模型实现无需标注数据的高精度记录链接，解决了社会科学研究中因链接错误导致的下游分析不确定性问题。

Details

Motivation: 记录链接在实证社会科学中至关重要，但现有方法要么准确率低，要么需要大量标注数据，且研究者通常将其视为预处理步骤而未量化链接错误带来的不确定性。 Method: 提出EnsembleLink方法，利用预训练语言模型从大规模文本中学习语义关系（如地名层级、政党别名等），无需任何标注数据，且可在本地运行开源模型。 Result: 在涵盖城市名、人名、组织名、多语种政党名和文献记录的多个基准测试中，EnsembleLink达到或超过需大量标注数据的方法的性能，典型任务可在几分钟内完成。 Conclusion: EnsembleLink为社会科学提供了高精度、零标注、可本地部署的记录链接新范式，有助于提升下游分析的可靠性。 Abstract: Record linkage, the process of matching records that refer to the same entity across datasets, is essential to empirical social science but remains methodologically underdeveloped. Researchers treat it as a preprocessing step, applying ad hoc rules without quantifying the uncertainty that linkage errors introduce into downstream analyses. Existing methods either achieve low accuracy or require substantial labeled training data. I present EnsembleLink, a method that achieves high accuracy without any training labels. EnsembleLink leverages pre-trained language models that have learned semantic relationships (e.g., that "South Ozone Park" is a neighborhood in "New York City" or that "Lutte ouvriere" refers to the Trotskyist "Workers' Struggle" party) from large text corpora. On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling. The method runs locally on open-source models, requiring no external API calls, and completes typical linkage tasks in minutes.

[9] Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space

Tobias Materzok

Main category: cs.CL

TL;DR: 本文提出了Output-Space Search（OS-Search），将大语言模型生成转化为在冻结编码器定义的3D输出空间Z中的端点搜索，通过外层循环选择目标z，并利用基于检索的策略和序列级强化学习生成靠近z的输出，在故事和代码任务上验证了其提升多样性与优化能力的有效性。

Details

Motivation: 传统LLM生成依赖于逐token或程序路径的搜索，路径依赖性强、难以并行且难以进行黑盒优化；本文旨在解耦生成过程与路径依赖，实现更高效、可并行、可优化的输出控制。 Method: 提出OS-Search框架：外层在冻结编码器定义的3D输出空间Z中选择目标z*；内层使用检索增强的策略网络，经序列级强化学习训练，生成在标准自回归解码下坐标接近z*的输出；支持并行扫描与黑盒优化（如贝叶斯优化）。 Result: 在故事生成任务中，Z空间扫描使LLM评分的多样性提升3.1倍（相比prompt-chaining）；在代码生成任务中，对Z空间进行贝叶斯优化可在推理预算受限下提升隐藏目标函数值，同时保持生成代码有效性。 Conclusion: OS-Search为LLM生成提供了新范式——将生成视为输出空间中的端点定位问题，摆脱token级路径依赖，支持高效并行搜索与黑盒优化，并在多样性和可控性方面展现出显著优势。 Abstract: We introduce Output-Space Search (OS-Search), which turns LLM generation into endpoint search. An outer loop selects a target z* in a frozen encoder-defined 3D output space Z, and a retrieval-grounded policy trained with sequence-level RL generates outputs whose coordinates land near z* under standard autoregressive decoding. This enables parallel sweeps and black-box optimization in Z without path-dependent token/program search. On stories, sweeping Z (text) yields 3.1x higher LLM-scored diversity than prompt-chaining. On code, Bayesian optimization over Z (code) improves an objective withheld from the controller under matched inference budgets while preserving validity.

[10] From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning

Xiulin Yang,Heidi Getz,Ethan Gotlieb Wilcox

Main category: cs.CL

TL;DR: 本文通过跨语言语料库分析和神经网络建模实验，验证功能词的高频性、与句法结构的可靠关联性及与短语边界的对齐性这三大分布特性在186种语言中普遍存在，并证明保留这三特性的语言变体更易被神经学习者习得；其中频率和结构关联性比边界对齐性贡献更大；进一步探针和消融分析表明，不同学习条件下模型对功能词的依赖方式存在系统性差异。

Details

Motivation: 探究支持从线性输入中学习层级结构的统计条件，聚焦功能词的分布特性及其在语言习得中的作用。 Method: 采用跨语言语料库分析验证功能词三大分布特性（高频性、结构关联性、边界对齐性）的普适性；结合反事实语言建模与消融实验评估各特性对神经学习者习得的影响；并通过探针和消融分析考察不同学习条件下模型对功能词的依赖机制。 Result: 证实三大特性在186种语言中普遍存在；保留全部三特性的语言变体更易习得，其中频率和结构关联性比边界对齐性更重要；不同学习条件导致模型对功能词的依赖方式存在系统性差异。 Conclusion: 功能词的统计分布特性为层级结构学习提供关键支持，但相同行为表现可能源于不同的内部表征机制，提示语言习得研究需兼顾行为与机制层面的分析。 Abstract: What statistical conditions support learning hierarchical structure from linear input? In this paper, we address this question by focusing on the statistical distribution of function words. Function words have long been argued to play a crucial role in language acquisition due to their distinctive distributional properties, including high frequency, reliable association with syntactic structure, and alignment with phrase boundaries. We use cross-linguistic corpus analysis to first establish that all three properties are present across 186 studied languages. Next, we use a combination of counterfactual language modeling and ablation experiments to show that language variants preserving all three properties are more easily acquired by neural learners, with frequency and structural association contributing more strongly than boundary alignment. Follow-up probing and ablation analyses further reveal that different learning conditions lead to systematically different reliance on function words, indicating that similar performance can arise from distinct internal mechanisms.

[11] Scaling Embeddings Outperforms Scaling Experts in Language Models

Hong Liu,Jiaqi Zhang,Chao Wang,Xing Hu,Linkun Lyu,Jiaqi Sun,Xurui Yang,Bo Wang,Fengcun Li,Yulei Qian,Lingtong Si,Yerui Sun,Rumei Li,Peng Pei,Yuchen Xie,Xunliang Cai

Main category: cs.CL

TL;DR: 本文提出通过扩大嵌入层规模（embedding scaling）来提升稀疏性，相比传统的专家数量扩展（expert scaling），在特定条件下能获得更优的帕累托前沿；作者系统分析了影响该策略效果的关键架构因素，并结合系统优化与推测解码实现推理加速；最终推出68.5B参数、仅激活约3B的LongCat-Flash-Lite模型，在代理与编程任务上表现优异。

Details

Motivation: MoE架构在大语言模型中面临收益递减和系统瓶颈，亟需探索新的稀疏性扩展维度。 Method: 通过理论分析与实验，识别embedding scaling优于expert scaling的参数与结构条件；系统研究参数分配、模型宽/深度对embedding scaling的影响；结合定制化系统优化与推测解码提升推理效率；从头训练LongCat-Flash-Lite模型验证方法有效性。 Result: LongCat-Flash-Lite（68.5B总参，~3B激活）在多项基准上超越同参数量MoE基线，并在agentic与coding任务中媲美甚至超越同类规模先进模型。 Conclusion: Embedding scaling是一种与expert scaling正交且高效的稀疏扩展路径，在合理架构设计与系统协同优化下，可显著提升模型性能与推理效率。 Abstract: While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.

[12] Multilingual Dysarthric Speech Assessment Using Universal Phone Recognition and Language-Specific Phonemic Contrast Modeling

Eunjung Yeo,Julie M. Liss,Visar Berisha,David R. Mortensen

Main category: cs.CL

TL;DR: 本文提出了一种多语言构音评估框架，结合通用音素识别与语言特异性音素解释，通过对比音系特征距离和序列对齐，生成三种评估指标（PER、PFER、PhonCov），并在四种语言中验证其临床有效性。

Details

Motivation: 神经障碍相关构音障碍日益普遍，亟需跨语言适用的自动化可懂度评估方法；现有方法多局限于单语或忽视语言特异性因素。 Method: 构建多语言音素产出评估框架，融合通用音素识别与语言特异性音素解释，利用对比音系特征距离实现音素映射，并结合序列对齐，定义三种新指标：音素错误率（PER）、音系特征错误率（PFER）和无对齐音素覆盖率（PhonCov）。 Result: 在英语、西班牙语、意大利语和泰米尔语数据上验证表明：PER受益于映射+对齐，PFER仅受益于对齐，PhonCov仅受益于映射；框架能捕捉符合临床观察的构音障碍退化模式。 Conclusion: 该框架有效兼顾语言共性与特性，所提指标具有跨语言适用性和临床可解释性，为多语言构音障碍评估提供了新范式。 Abstract: The growing prevalence of neurological disorders associated with dysarthria motivates the need for automated intelligibility assessment methods that are applicalbe across languages. However, most existing approaches are either limited to a single language or fail to capture language-specific factors shaping intelligibility. We present a multilingual phoneme-production assessment framework that integrates universal phone recognition with language-specific phoneme interpretation using contrastive phonological feature distances for phone-to-phoneme mapping and sequence alignment. The framework yields three metrics: phoneme error rate (PER), phonological feature error rate (PFER), and a newly proposed alignment-free measure, phoneme coverage (PhonCov). Analysis on English, Spanish, Italian, and Tamil show that PER benefits from the combination of mapping and alignment, PFER from alignment alone, and PhonCov from mapping. Further analyses demonstrate that the proposed framework captures clinically meaningful patterns of intelligibility degradation consistent with established observations of dysarthric speech.

[13] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

Zhaoyi Li,Jiatong Li,Gangwei Jiang,Linqi Song,Defu Lian,Ying Wei

Main category: cs.CL

TL;DR: 本文揭示了大语言模型在推理步数泛化场景下性能骤降的内在机制，发现错误集中于少数关键错误类型对应的token位置，并由特定注意力头（ep heads）引发；基于此，提出一种轻量级测试时修正方法，通过动态识别并停用这些错误处理头来提升泛化能力。

Details

Motivation: 理解链式思维（CoT）推理在推理步数超出训练分布时性能急剧下降的内部机制。 Method: 系统分析多领域任务中的错误分布，识别导致错误的特定注意力头（ep heads），并提出测试时动态识别与停用这些头的轻量级修正方法。 Result: 实验表明该方法在不同任务和大语言模型上均能一致提升推理步数泛化能力。 Conclusion: 推理错误并非均匀分布，而是由少数‘错误处理头’主导；通过测试时干预这些头可有效缓解泛化失败问题。 Abstract: Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.

[14] Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

Christopher Adrian Kusuma,Muhammad Reza Qorib,Hwee Tou Ng

Main category: cs.CL

TL;DR: 本文提出了一种更鲁棒的大型语言模型（LLM）诚实性评估基准，并利用开源模型Pythia及其公开预训练数据，设计新方法提升LLM在知识边界不清时回答“I don't know”的能力，以减少幻觉。

Details

Motivation: 现有LLM常因不了解自身知识边界而产生事实性错误（即幻觉），虽有多种提升诚实性的方法，但其评估未考虑模型预训练阶段已习得的知识，缺乏鲁棒性。 Method: 构建基于开源模型Pythia及其公开预训练数据的新型诚实性评估基准；并提出一种利用预训练数据增强LLM诚实性的新方法。 Result: 提出了一个更鲁棒的LLM诚实性评估基准数据集，并验证了所提方法能有效提升模型在未知问题上主动声明‘我不知道’的能力。 Conclusion: 通过显式建模和利用预训练知识，可显著提升LLM的诚实性；公开可验证的基准（如Pythia）对推动诚实性研究至关重要。 Abstract: Large language models (LLMs) are highly capable of answering questions, but they are often unaware of their own knowledge boundary, i.e., knowing what they know and what they don't know. As a result, they can generate factually incorrect responses on topics they do not have enough knowledge of, commonly known as hallucination. Rather than hallucinating, a language model should be more honest and respond with "I don't know" when it does not have enough knowledge about a topic. Many methods have been proposed to improve LLM honesty, but their evaluations lack robustness, as they do not take into account the knowledge that the LLM has ingested during its pretraining. In this paper, we propose a more robust evaluation benchmark dataset for LLM honesty by utilizing Pythia, a truly open LLM with publicly available pretraining data. In addition, we also propose a novel method for harnessing the pretraining data to build a more honest LLM.

[15] MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

Tianyi Xu,Kosei Uemura,Alfred Malengo Kondoro,Tadesse Destaw Belay,Catherine Nana Nyaah Essuman,Ifeoma Okoh,Ganiyat Afolabi,Ayodele Awokoya,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: 本文提出了MGSM-Pro数据集，通过为每个MGSM问题生成五个不同数字、名称和无关上下文的实例，评估多语言大模型在数学推理中的鲁棒性，发现低资源语言及部分闭源模型（如Gemini 2.5 Flash、GPT-4.1）对数字变化敏感，而Claude 4.0 Sonnet及开源模型GPT-OSS 120B、DeepSeek V3表现更稳健，建议未来评测应至少采用五种数字变体以提升评估可靠性。

Details

Motivation: 现有多语言数学推理基准在难度和时效性上落后于英文基准；GSM-Symbolic已在英文中揭示模型对同一问题不同实例表现方差大，但缺乏多语言验证，亟需构建更具鲁棒性的多语言评测方法。 Method: 基于MGSM数据集，采用GSM-Symbolic方法为每个问题生成五个实例，变化要素包括人名、数字和无关上下文；在九种语言上系统评测主流闭源与开源大模型的数学推理性能及对数字变化的鲁棒性。 Result: 低资源语言在数字实例变化下性能显著下降；Gemini 2.5 Flash和GPT-4.1鲁棒性较差，Claude 4.0 Sonnet表现更优；开源模型GPT-OSS 120B和DeepSeek V3展现出较强鲁棒性。 Conclusion: 单一实例评测易导致乐观偏差，推荐在数学推理评测中对每道题至少使用五种数字变体，以实现更稳健、真实的模型能力评估；MGSM-Pro为多语言鲁棒性评测提供了新基准。 Abstract: Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that some proprietary models, notably Gemini 2.5 Flash and GPT-4.1, are less robust to digit instantiation, whereas Claude 4.0 Sonnet is more robust. Among open models, GPT-OSS 120B and DeepSeek V3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.

Alok Abhishek,Tushar Bandopadhyay,Lisa Erickson

Main category: cs.CL

TL;DR: 本文提出SHARP框架，用于多维、分布感知的社会危害评估，通过建模多变量随机风险、分解危害维度并采用CVaR95等风险敏感统计量，揭示大语言模型在尾部风险上的显著异质性，指出仅依赖平均分的评测存在严重局限。

Details

Motivation: 现有评测基准将复杂社会风险简化为均值标量分数，掩盖了分布结构、跨维度交互及最坏情况行为，难以应对高风险场景中罕见但严重的失败。 Method: 提出SHARP框架：将社会危害建模为多元随机变量；显式分解为偏见、公平性、伦理与认知可靠性四个维度；采用‘失败并集’重参数化为加性累积对数风险；以CVaR95为核心指标刻画尾部风险。 Result: 在11个前沿LLM和901条敏感提示上的实验表明：均值风险相近的模型，其尾部暴露度和波动性差异超两倍；各维度尾部严重性呈现系统性差异（偏见最强，伦理最低），揭示出被标量评测所混淆的模型特异性失败结构。 Conclusion: 负责任的大语言模型评估与治理必须超越标量均值，转向多维、尾部敏感的风险画像。 Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where rare but severe failures can result in irreversible harm. However, prevailing evaluation benchmarks often reduce complex social risk to mean-centered scalar scores, thereby obscuring distributional structure, cross-dimensional interactions, and worst-case behavior. This paper introduces Social Harm Analysis via Risk Profiles (SHARP), a framework for multidimensional, distribution-aware evaluation of social harm. SHARP models harm as a multivariate random variable and integrates explicit decomposition into bias, fairness, ethics, and epistemic reliability with a union-of-failures aggregation reparameterized as additive cumulative log-risk. The framework further employs risk-sensitive distributional statistics, with Conditional Value at Risk (CVaR95) as a primary metric, to characterize worst-case model behavior. Application of SHARP to eleven frontier LLMs, evaluated on a fixed corpus of n=901 socially sensitive prompts, reveals that models with similar average risk can exhibit more than twofold differences in tail exposure and volatility. Across models, dimension-wise marginal tail behavior varies systematically across harm dimensions, with bias exhibiting the strongest tail severities, epistemic and fairness risks occupying intermediate regimes, and ethical misalignment consistently lower; together, these patterns reveal heterogeneous, model-dependent failure structures that scalar benchmarks conflate. These findings indicate that responsible evaluation and governance of LLMs require moving beyond scalar averages toward multidimensional, tail-sensitive risk profiling.

[17] MoCo: A One-Stop Shop for Model Collaboration Research

Shangbin Feng,Yuyang Bai,Ziyuan Yang,Yike Wang,Zhaoxuan Tan,Jiajie Yan,Zhenyu Lei,Wenxuan Ding,Weijia Shi,Haojin Wang,Zhenting Qi,Yuru Jiang,Heng Wang,Chengsong Huang,Yu Fei,Jihan Yao,Yilun Du,Luke Zettlemoyer,Yejin Choi,Yulia Tsvetkov

Main category: cs.CL

TL;DR: 本文介绍了MoCo，一个用于执行、基准测试和比较大规模模型协作算法的Python库，包含26种协作方法和25个评估数据集，实验证明协作策略在多数设置下优于单模型，并探讨了其扩展性、效率及未来方向。

Details

Motivation: 现有模型协作研究分散且缺乏系统性比较，亟需统一框架来整合研究并确立模型协作作为独立研究方向。 Method: 构建开源Python库MoCo，集成26种跨模型信息交换方法（如路由、文本、logit、参数级）和25个多样化评估数据集，支持用户自定义数据与可扩展实验。 Result: 在61.0%的（模型，数据）设置中，协作策略平均优于单模型；最优方法提升达25.8%；进一步揭示了协作系统的扩展规律、训练/推理效率及解决单模型难题的能力。 Conclusion: MoCo为模型协作研究提供了标准化、可复现的基础设施，推动开放、模块化、去中心化与协作式AI的发展。 Abstract: Advancing beyond single monolithic language models (LMs), recent research increasingly recognizes the importance of model collaboration, where multiple LMs collaborate, compose, and complement each other. Existing research on this topic has mostly been disparate and disconnected, from different research communities, and lacks rigorous comparison. To consolidate existing research and establish model collaboration as a school of thought, we present MoCo: a one-stop Python library of executing, benchmarking, and comparing model collaboration algorithms at scale. MoCo features 26 model collaboration methods, spanning diverse levels of cross-model information exchange such as routing, text, logit, and model parameters. MoCo integrates 25 evaluation datasets spanning reasoning, QA, code, safety, and more, while users could flexibly bring their own data. Extensive experiments with MoCo demonstrate that most collaboration strategies outperform models without collaboration in 61.0% of (model, data) settings on average, with the most effective methods outperforming by up to 25.8%. We further analyze the scaling of model collaboration strategies, the training/inference efficiency of diverse methods, highlight that the collaborative system solves problems where single LMs struggle, and discuss future work in model collaboration, all made possible by MoCo. We envision MoCo as a valuable toolkit to facilitate and turbocharge the quest for an open, modular, decentralized, and collaborative AI future.

[18] CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

Jiahao Huo,Yu Huang,Yibo Yan,Ye Pan,Yi Cao,Mingdong Ou,Philip S. Yu,Xuming Hu

Main category: cs.CL

TL;DR: 本文提出CausalEmbed方法，通过自回归生成方式构建多向量嵌入，显著减少视觉令牌数量（30-155倍），同时保持高性能，提升视觉文档检索的实用性和可扩展性。

Details

Motivation: 现有多模态大语言模型在视觉文档检索中虽表现优异，但每页使用数千视觉令牌导致存储开销巨大，限制了实际应用。 Method: 提出自回归生成方法CausalEmbed，并在对比学习中引入迭代间隔损失，促使模型学习紧凑、结构良好的嵌入表示。 Result: 仅需数十个视觉令牌即可实现高效视觉文档检索，在多个骨干网络和基准上性能极具竞争力，令牌数量减少30–155倍；理论与实验验证其训练效率与测试时可扩展性优势。 Conclusion: CausalEmbed提供了一种灵活的测试时缩放策略，推动多向量VDR表示的发展，并揭示了生成范式在多模态文档检索中的潜力。 Abstract: Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval.

[19] Qwen3-ASR Technical Report

Xian Shi,Xiong Wang,Zhifang Guo,Yongqi Wang,Pei Zhang,Xinyu Zhang,Zishan Guo,Hongkun Hao,Yu Xi,Baosong Yang,Jin Xu,Jingren Zhou,Junyang Lin

Main category: cs.CL

TL;DR: 本文介绍了Qwen3-ASR系列语音识别模型，包括两个支持52种语言的端到端ASR模型（1.7B和0.6B参数量）及一个基于大语言模型的非自回归强制对齐模型（0.6B），均基于Qwen3-Omni基础模型，并在真实场景中展现出优异性能与效率，已开源（Apache 2.0）。

Details

Motivation: 解决开源ASR模型在公开基准上分数接近但实际表现差异显著的问题，同时提升多语言语音识别与强制对齐的准确性、效率与通用性。 Method: 基于Qwen3-Omni音频理解能力，构建两个全功能ASR模型（Qwen3-ASR-1.7B/0.6B）和一个非自回归强制对齐模型（Qwen3-ForcedAligner-0.6B）；采用大规模语音数据训练，并开展内部综合评测以弥补公开基准局限。 Result: Qwen3-ASR-1.7B在开源ASR模型中达到SOTA，媲美最强商用API；Qwen3-ASR-0.6B实现92ms平均首字输出延迟，1秒内处理2000秒语音（并发128）；Qwen3-ForcedAligner-0.6B在11种语言时间戳对齐任务中超越三大最强基线模型，兼具高效与多语言适配能力。 Conclusion: Qwen3-ASR系列模型在性能、效率与多语言支持方面取得显著突破，通过开源推动语音识别与音频理解社区发展。 Abstract: In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.

[20] Self-Improving Pretraining: using post-trained models to pretrain better models

Ellen Xiaoqing Tan,Shehzaad Dhuliawala,Jing Xu,Ping Yu,Sainbayar Sukhbaatar,Jason Weston,Olga Golovneva

Main category: cs.CL

TL;DR: 本文提出了一种在预训练阶段引入强化学习的新方法，通过流式文档输入和多候选生成评估（包括模型rollout、原始后缀和重写后缀），由强判别模型打分，逐步提升大语言模型的事实性、安全性和生成质量。

Details

Motivation: 现有方法依赖昂贵的微调与对齐流程，难以根除预训练阶段习得的不安全或幻觉输出模式，因此需在预训练阶段就建模核心安全与事实性行为。 Method: 在预训练中引入强化学习，每步对接下来K个token生成进行优化；使用一个强后训练判别模型对多个候选（rollout、原始suffix、重写suffix）在质量、安全性和事实性上打分，并动态调整奖励策略。 Result: 相比标准预训练，在事实性上提升36.2%，安全性提升18.5%，整体生成质量胜率最高提升86.3%。 Conclusion: 在预训练阶段嵌入RL驱动的质量控制机制，能从源头构建更安全、更真实、更高质的大语言模型，优于依赖后期对齐的传统范式。 Abstract: Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model's core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations -- including model rollouts, the original suffix, and a rewritten suffix -- for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.

[21] The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

Devanshu Sahoo,Manish Prasad,Vasudev Majhi,Arjun Neekhra,Yash Sinha,Murari Mandal,Vinay Chamola,Dhruv Kumar

Main category: cs.CL

TL;DR: 本文揭示了大型语言模型（LLMs）在编程作业自动评分中存在‘服从性悖论’：模型为满足隐藏指令而脱离代码逻辑，导致严重误判；作者提出SPACI与AST-ASIP攻击框架，在语法无害位置注入语义干扰，实证显示主流模型失败率超95%；建议从通用对齐转向面向评估任务的‘裁决鲁棒性’训练。

Details

Motivation: LLM被广泛用于教育评估，但其指令遵循能力是否等价于客观评判能力尚未验证；作者质疑该隐含假设，并指出过度对齐可能导致模型忽视代码实质而屈从于隐蔽指令。 Method: 提出SPACI框架和AST-ASIP协议，在抽象语法树（AST）的语法惰性区域（如注释、空格）嵌入语义对抗指令，利用语法-语义鸿沟实施攻击；在Python/C/C++/Java共25,000份代码提交上评估9个SOTA模型。 Result: 高容量开源模型（如DeepSeek-V3）错误认证功能缺陷代码的比例超过95%；提出三维度指标（解耦概率、分数偏差、教学严重性）量化‘虚假认证’现象。 Conclusion: 当前基于RLHF的对齐范式在自动评分场景中引入类似‘特洛伊木马’的安全隐患；需构建领域专属的‘裁决鲁棒性’训练机制，使模型以证据优先而非指令服从优先。 Abstract: The rapid integration of Large Language Models (LLMs) into educational assessment rests on the unverified assumption that instruction following capability translates directly to objective adjudication. We demonstrate that this assumption is fundamentally flawed. Instead of evaluating code quality, models frequently decouple from the submission's logic to satisfy hidden directives, a systemic vulnerability we term the Compliance Paradox, where models fine-tuned for extreme helpfulness are vulnerable to adversarial manipulation. To expose this, we introduce the Semantic-Preserving Adversarial Code Injection (SPACI) Framework and the Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP). These methods exploit the Syntax-Semantics Gap by embedding adversarial directives into syntactically inert regions (trivia nodes) of the Abstract Syntax Tree. Through a large-scale evaluation of 9 SOTA models across 25,000 submissions in Python, C, C++, and Java, we reveal catastrophic failure rates (>95%) in high-capacity open-weights models like DeepSeek-V3, which systematically prioritize hidden formatting constraints over code correctness. We quantify this failure using our novel tripartite framework measuring Decoupling Probability, Score Divergence, and Pedagogical Severity to demonstrate the widespread "False Certification" of functionally broken code. Our findings suggest that current alignment paradigms create a "Trojan" vulnerability in automated grading, necessitating a shift from standard RLHF toward domain-specific Adjudicative Robustness, where models are conditioned to prioritize evidence over instruction compliance. We release our complete dataset and injection framework to facilitate further research on the topic.

[22] User-Centric Evidence Ranking for Attribution and Fact Verification

Guy Alt,Eran Hirsch,Serwar Basch,Ido Dagan,Oren Glickman

Main category: cs.CL

TL;DR: 本文提出证据排序（Evidence Ranking）这一新任务，旨在通过在排序列表中尽早呈现充分但不过量的证据，减少用户阅读负担并提升事实验证效率；提出了单次排序与增量排序两种方法，并构建了统一基准和新评估框架，实验与用户研究表明增量排序和大模型方法更优，且证据排序相比传统证据选择更能降低阅读负担、提高验证准确率。

Details

Motivation: 现有自动系统和大语言模型在事实验证中常提供不足或冗余的证据，导致验证效率低、错误率高，亟需一种兼顾信息充分性与用户阅读效率的新范式。 Method: 提出证据排序任务，设计单次排序（one-shot）与增量排序（incremental）两种策略；构建融合多个现有数据集的统一基准；设计受信息检索启发的新评估框架；开展控制变量用户研究。 Result: 增量排序策略更善于捕捉互补证据；基于LLM的方法显著优于浅层基线；证据排序相比证据选择可显著降低用户阅读 effort 并提升验证准确性；但模型在充分性与冗余性平衡上仍有挑战。 Conclusion: 证据排序为构建更可解释、高效且以用户为中心的信息验证系统提供了基础性范式，推动事实验证从‘选证据’向‘排证据’演进。 Abstract: Attribution and fact verification are critical challenges in natural language processing for assessing information reliability. While automated systems and Large Language Models (LLMs) aim to retrieve and select concise evidence to support or refute claims, they often present users with either insufficient or overly redundant information, leading to inefficient and error-prone verification. To address this, we propose Evidence Ranking, a novel task that prioritizes presenting sufficient information as early as possible in a ranked list. This minimizes user reading effort while still making all available evidence accessible for sequential verification. We compare two approaches for the new ranking task: one-shot ranking and incremental ranking. We introduce a new evaluation framework, inspired by information retrieval metrics, and construct a unified benchmark by aggregating existing fact verification datasets. Extensive experiments with diverse models show that incremental ranking strategies better capture complementary evidence and that LLM-based methods outperform shallower baselines, while still facing challenges in balancing sufficiency and redundancy. Compared to evidence selection, we conduct a controlled user study and demonstrate that evidence ranking both reduces reading effort and improves verification. This work provides a foundational step toward more interpretable, efficient, and user-aligned information verification systems.

[23] Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

Yuan Sui,Bryan Hooi

Main category: cs.CL

TL;DR: 本文提出CoNL框架，通过多智能体自博弈统一生成、评估和元评估，利用批判性反馈是否能提升解决方案来衡量其质量，从而在无外部评判或真实标签的情况下联合优化生成与评判能力。

Details

Motivation: 针对非可验证任务（如创意写作、对话、伦理推理）中缺乏真实标签导致大语言模型训练困难的问题，以及现有LLM-as-Judge方法受限于评判者自身质量、存在偏差（如偏好冗长）的缺陷，提出元评估——即对评估者本身进行评估与改进的必要性。 Method: 提出CoNL框架，基于共享策略的多个智能体进行结构化自博弈：轮流提出方案、批判、修订；将批判是否促成方案改进作为诊断奖励信号，实现生成与评判能力的联合优化，无需外部评判者或真实标签。 Result: 在五个基准测试中，CoNL持续优于自奖励基线，且训练过程稳定。 Conclusion: CoNL通过将批判质量定义为‘可改进性’，实现了生成与评估能力的协同提升，为无监督/弱监督下高质量内容生成提供了新范式。 Abstract: Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.

[24] SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

Lei Yang,Wei Bi,Chenxi Sun,Renren Jin,Deyi Xiong

Main category: cs.CL

TL;DR: 本文提出SOUP框架，通过在单个样本的token级别上统一在线和离线策略学习，提升语言模型强化学习训练的探索能力和稳定性。

Details

Motivation: 现有基于在线策略的强化学习方法（如GRPO）存在探索不足和早期饱和问题；而混合整个轨迹的离线数据又导致策略不匹配和训练不稳定。 Method: 提出SOUP框架：在单个生成序列中，前缀部分使用历史策略（离线）采样，续写部分使用当前策略（在线）生成，并通过token级重要性比率融合两者信息。 Result: 实验表明SOUP在多个任务上持续优于标准在线策略训练及现有离线策略扩展方法；分析进一步验证其能同时提升探索能力与最终性能。 Conclusion: SOUP通过细粒度、单样本级别的混合策略范式，有效缓解了语言模型强化学习中的探索受限与策略失配问题，为LLM后训练提供了更稳定高效的RL框架。 Abstract: On-policy reinforcement learning (RL) methods widely used for language model post-training, like Group Relative Policy Optimization (GRPO), often suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data can help, current approaches that mix entire trajectories cause significant policy mismatch and instability. In this work, we propose the $\textbf{S}$ingle-sample Mix-p$\textbf{O}$licy $\textbf{U}$nified $\textbf{P}$aradigm (SOUP), a framework that unifies off- and on-policy learning within individual samples at the token level. It confines off-policy influence to the prefix of a generated sequence sampled from historical policies, while the continuation is generated on-policy. Through token-level importance ratios, SOUP effectively leverages off-policy information while preserving training stability. Extensive experiments demonstrate that SOUP consistently outperforms standard on-policy training and existing off-policy extensions. Our further analysis clarifies how our fine-grained, single-sample mix-policy training can improve both exploration and final performance in LLM RL.

[25] DimStance: Multilingual Datasets for Dimensional Stance Analysis

Jonas Becker,Liang-Chih Yu,Shamsuddeen Hassan Muhammad,Jan Philip Wahle,Terry Ruas,Idris Abdulmumin,Lung-Hao Lee,Wen-Ni Liu,Tzu-Mi Lin,Zhe-Yu Xu,Ying-Lung Lin,Jin Wang,Maryam Ibrahim Mukhtar,Bela Gipp,Saif M. Mohammed

Main category: cs.CL

TL;DR: 本文提出了一种基于效价-唤醒度（VA）维度的情感化立场检测新范式，构建了首个跨语言、多领域的维度立场资源DimStance，并定义了维度立场回归任务，评估了多种预训练与大语言模型在该任务上的表现。

Details

Motivation: 传统立场检测仅输出离散类别（如支持/中立/反对），难以刻画立场背后细微的情感状态；本文受情感科学启发，引入连续的效价（valence）和唤醒（arousal）维度建模立场，以实现更细粒度、情感感知的分析。 Method: 构建多语言（英、德、中、尼日利亚皮钦语、斯瓦希里语）、多领域（政治、环保）的维度立场数据集DimStance（含11,746个目标方面）；定义维度立场回归任务；在回归与提示（prompting）两种设置下，系统评测预训练模型与大语言模型的VA预测性能，并分析跨语言VA模式。 Result: 微调的大语言模型回归器表现具竞争力；低资源语言（如尼日利亚皮钦语、斯瓦希里语）仍存在显著性能差距；基于token的生成方式在VA预测中存在固有局限。 Conclusion: DimStance为多语言、情感感知的立场分析提供了新基准与资源，推动立场检测从离散分类迈向连续、可解释的维度建模。 Abstract: Stance detection is an established task that classifies an author's attitude toward a specific target into categories such as Favor, Neutral, and Against. Beyond categorical stance labels, we leverage a long-established affective science framework to model stance along real-valued dimensions of valence (negative-positive) and arousal (calm-active). This dimensional approach captures nuanced affective states underlying stance expressions, enabling fine-grained stance analysis. To this end, we introduce DimStance, the first dimensional stance resource with valence-arousal (VA) annotations. This resource comprises 11,746 target aspects in 7,365 texts across five languages (English, German, Chinese, Nigerian Pidgin, and Swahili) and two domains (politics and environmental protection). To facilitate the evaluation of stance VA prediction, we formulate the dimensional stance regression task, analyze cross-lingual VA patterns, and benchmark pretrained and large language models under regression and prompting settings. Results show competitive performance of fine-tuned LLM regressors, persistent challenges in low-resource languages, and limitations of token-based generation. DimStance provides a foundation for multilingual, emotion-aware, stance analysis and benchmarking.

[26] MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset

Serry Sibaee,Yasser Alhabashi,Nadia Sibai,Yara Farouk,Adel Ammar,Sawsan AlHalawani,Wadii Boulila

Main category: cs.CL

TL;DR: 本文介绍了MURAD，一个包含96,243个阿拉伯语词-定义对的多领域统一反向阿拉伯语词典数据集，旨在弥补阿拉伯语大规模词义数据集的不足，支持计算语言学、词典学研究及NLP应用。

Details

Motivation: 阿拉伯语虽词汇丰富、涵盖领域广泛，但缺乏大规模、精准定义的词典数据集，限制了其自然语言处理和语义研究的发展。 Method: 通过混合流水线（结合直接文本解析、光学字符识别和自动重构）从权威参考文献和教育资料中提取数据，并对每个词-定义对标注标准化阿拉伯语定义及来源领域元数据。 Result: 构建了名为MURAD的开放词典数据集，共含96,243个词-定义对，覆盖语言学、伊斯兰研究、数学、物理、心理学和工程等领域。 Conclusion: MURAD为阿拉伯语计算语言学、反向词典建模、语义检索和教育工具开发提供了高质量资源，有助于推动阿拉伯语NLP发展和可复现的语义研究。 Abstract: Arabic is a linguistically and culturally rich language with a vast vocabulary that spans scientific, religious, and literary domains. Yet, large-scale lexical datasets linking Arabic words to precise definitions remain limited. We present MURAD (Multi-domain Unified Reverse Arabic Dictionary), an open lexical dataset with 96,243 word-definition pairs. The data come from trusted reference works and educational sources. Extraction used a hybrid pipeline integrating direct text parsing, optical character recognition, and automated reconstruction. This ensures accuracy and clarity. Each record aligns a target word with its standardized Arabic definition and metadata that identifies the source domain. The dataset covers terms from linguistics, Islamic studies, mathematics, physics, psychology, and engineering. It supports computational linguistics and lexicographic research. Applications include reverse dictionary modeling, semantic retrieval, and educational tools. By releasing this resource, we aim to advance Arabic natural language processing and promote reproducible research on Arabic lexical semantics.

[27] LMK > CLS: Landmark Pooling for Dense Embeddings

Meet Doshi,Aashka Trivedi,Vishwajeet Kumar,Parul Awasthy,Yulong Li,Jaydeep Sen,Radu Florian,Sachindra Joshi

Main category: cs.CL

TL;DR: 本文提出了一种新的序列池化方法——Landmark (LMK) 池化，通过将序列分块并在块间插入地标标记（landmark tokens），再对这些地标标记的嵌入取均值来生成最终表示，从而在不牺牲局部显著特征的前提下提升长上下文外推能力。

Details

Motivation: 现有主流池化策略（如[CLS]标记或token embedding均值池化）存在系统性缺陷：[CLS]易偏向序列开头信息、忽视分布式证据；均值池化则可能稀释局部显著信号，损害短上下文性能。 Method: 提出Landmark (LMK) 池化：将输入序列划分为多个chunk，在每个chunk之间插入可学习的landmark token，最终对所有landmark token的embedding进行均值池化得到句子表示。 Result: LMK池化在短上下文检索任务上与现有方法性能相当，在长上下文任务上显著优于现有方法，且仅引入少量特殊token，具备实用性与可扩展性。 Conclusion: LMK池化是一种简单有效、兼顾局部敏感性与长程建模能力的新型序列池化方案，为表示学习中的上下文建模提供了新思路。 Abstract: Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.

[28] inversedMixup: Data Augmentation via Inverting Mixed Embeddings

Fanshuang Kong,Richong Zhang,Qiyu Sun,Zhijie Nie,Ting Deng,Chunming Hu

Main category: cs.CL

TL;DR: 本文提出inversedMixup，一种结合Mixup可控性与LLM生成可解释性的统一文本数据增强框架，通过三阶段训练对齐任务模型与大语言模型的嵌入空间，实现可控混合嵌入到可读句子的重建，并首次实证揭示并缓解文本Mixup中的流形侵入现象。

Details

Motivation: Mixup虽具可控性但生成样本不可解释，而LLM生成虽可读但缺乏控制；同时，现有方法未能弥合嵌入空间与离散token空间之间的鸿沟。 Method: 提出inversedMixup框架：采用三阶段训练对齐任务特定模型输出嵌入空间与LLM输入嵌入空间；利用LLM反转技术将可控混合后的嵌入重建为可读句子；并引入策略缓解流形侵入现象。 Result: 在少样本和全监督场景下均显著提升文本数据增强效果；首次提供文本Mixup中流形侵入现象的实证证据，并验证所提缓解策略的有效性。 Conclusion: inversedMixup成功融合了Mixup的可控性与LLM生成的可解释性，为文本数据增强提供了新范式，并揭示了嵌入混合中的关键挑战——流形侵入。 Abstract: Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup adopts a three-stage training procedure to align the output embedding space of a task-specific model with the input embedding space of an LLM. Upon successful alignment, inversedMixup can reconstruct mixed embeddings with a controllable mixing ratio into human-interpretable augmented sentences, thereby improving the augmentation performance. Additionally, inversedMixup provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup and introduces a simple yet effective strategy to mitigate it. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.

[29] Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes

Yang Zhou,Zhenting Sheng,Mingrui Tan,Yuting Song,Jun Zhou,Yu Heng Kwan,Lian Leng Low,Yang Bai,Yong Liu

Main category: cs.CL

TL;DR: 本文提出Note2Chat框架，通过将真实医疗记录转化为高质量医患对话，并结合三阶段微调策略与单轮推理范式，显著提升大语言模型在动态多轮临床问诊中的诊断能力。

Details

Motivation: 现有大语言模型在静态评测中表现良好，但在需迭代提问与假设修正的动态多轮诊断场景中效果不佳；同时缺乏高质量、合规的医患对话数据。 Method: 提出Note2Chat：1）利用决策树引导的生成与精炼流水线，将真实医疗记录转化为高质量医患对话；2）采用监督学习、模拟数据增强与偏好学习的三阶段微调策略；3）设计单轮推理范式，将病史采集建模为一系列单步推理问题。 Result: 在临床推理任务上显著优于GPT-4o，F1值提升+16.9，Top-1诊断准确率提升+21.0。 Conclusion: Note2Chat为利用广泛可得的医疗文书数据训练临床推理模型提供了高效、可解释且样本高效的可行路径。 Abstract: Effective clinical history taking is a foundational yet underexplored component of clinical reasoning. While large language models (LLMs) have shown promise on static benchmarks, they often fall short in dynamic, multi-turn diagnostic settings that require iterative questioning and hypothesis refinement. To address this gap, we propose \method{}, a note-driven framework that trains LLMs to conduct structured history taking and diagnosis by learning from widely available medical notes. Instead of relying on scarce and sensitive dialogue data, we convert real-world medical notes into high-quality doctor-patient dialogues using a decision tree-guided generation and refinement pipeline. We then propose a three-stage fine-tuning strategy combining supervised learning, simulated data augmentation, and preference learning. Furthermore, we propose a novel single-turn reasoning paradigm that reframes history taking as a sequence of single-turn reasoning problems. This design enhances interpretability and enables local supervision, dynamic adaptation, and greater sample efficiency. Experimental results show that our method substantially improves clinical reasoning, achieving gains of +16.9 F1 and +21.0 Top-1 diagnostic accuracy over GPT-4o. Our code and dataset can be found at https://github.com/zhentingsheng/Note2Chat.

[30] ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

Xiaoyu Tian,Haotian Wang,Shuaiting Chen,Hao Zhou,Kaichi Yu,Yudian Zhang,Jade Ouyang,Junxi Yin,Jiong Chen,Baoyan Guo,Lei Zhang,Junjie Tao,Yuansheng Song,Ming Cui,Chengwei Liu

Main category: cs.CL

TL;DR: 本文提出ASTRA框架，通过可验证的强化学习和可扩展的数据合成，全自动训练具备鲁棒工具使用能力的大语言模型代理。

Details

Motivation: 现有工具增强型代理训练方法依赖人工干预、不可验证的模拟环境，且在长程多轮学习中不稳定，难以兼顾任务完成与交互效率。 Method: ASTRA包含两部分：1）基于工具调用图静态拓扑合成结构化轨迹；2）将问答分解痕迹转化为可执行、可规则验证的环境以支持确定性多轮强化学习；并融合监督微调与在线强化学习，采用轨迹级奖励。 Result: 在多个工具使用基准上达到同等规模下的SOTA性能，接近闭源系统水平，同时保持核心推理能力。 Conclusion: ASTRA实现了全自动、可验证、可扩展的工具增强代理训练范式，显著提升了长程多步决策的鲁棒性与泛化性。 Abstract: Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.

[31] KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

Wuyang Zhou,Yuxuan Gu,Giorgos Iacovides,Danilo Mandic

Main category: cs.CL

TL;DR: 本文提出KromHC，利用Kronecker积构造双随机残差矩阵，在保证精确双随机性的同时将参数复杂度从O(n^3C)或O(nC·n!)降至O(n^2C)，显著提升训练稳定性和可扩展性。

Details

Motivation: 现有mHC及其变体存在训练不稳定、参数复杂度高（O(n^3C)或O(nC·n!)）以及无法保证残差矩阵严格双随机等问题。 Method: 提出KromHC，通过Kronecker积组合小规模双随机矩阵来参数化残差矩阵，并在张量化残差流的各模态上施加流形约束，确保整体残差矩阵严格双随机。 Result: KromHC在保持甚至超越SOTA mHC变体性能的同时，大幅降低可训练参数量；实验验证其有效性与高效性。 Conclusion: KromHC有效解决了mHC系列方法在双随机性保障和参数效率上的核心缺陷，为高维残差连接提供了更优的流形约束建模范式。 Abstract: The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn-Knopp (SK) algorithm does not always yield exact doubly stochastic residual matrices; 2) mHC incurs a prohibitive $\mathcal{O}(n^3C)$ parameter complexity with $n$ as the width of the residual stream and $C$ as the feature dimension. The recently proposed mHC-lite reparametrizes the residual matrix via the Birkhoff-von-Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, $\mathcal{O} \left( nC \cdot n! \right)$. To address both challenges, we propose \textbf{KromHC}, which uses the \underline{Kro}necker products of smaller doubly stochastic matrices to parametrize the residual matrix in \underline{mHC}. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to $\mathcal{O}(n^2C)$. Comprehensive experiments demonstrate that KromHC matches or even outperforms state-of-the-art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is available at \texttt{https://github.com/wz1119/KromHC}.

[32] Language Models as Artificial Learners: Investigating Crosslinguistic Influence

Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis

Main category: cs.CL

TL;DR: 本文利用语言模型（LMs）作为可控的统计学习者，系统模拟跨语言影响（CLI），探究L1主导性、L2熟练度及句法距离对CLI的影响，并通过跨语言启动范式验证其机制，结果支持并拓展了心理语言学中关于CLI的理论。

Details

Motivation: 人类双语研究中跨语言影响（CLI）的结果常因实验变异而矛盾，需更可控的方法来系统揭示其驱动机制。 Method: 使用语言模型作为可控统计学习者，通过调节L2暴露年龄（即L2引入的训练步数）操控L1主导性和L2熟练度，并改变L1预训练与L2的句法距离；采用跨语言启动范式分析L1结构激活对L2加工的影响。 Result: 结果与心理语言学证据一致：语言主导性和熟练度是CLI的强预测因子；语法结构启动呈双向性，而非语法结构启动受语言主导性调控；LM中存在L1在L2加工中的共激活及其对神经回路的直接影响。 Conclusion: 语言模型可作为计算框架，为人类CLI理论提供机制性证据和新洞见。 Abstract: Despite the centrality of crosslinguistic influence (CLI) to bilingualism research, human studies often yield conflicting results due to inherent experimental variance. We address these inconsistencies by using language models (LMs) as controlled statistical learners to systematically simulate CLI and isolate its underlying drivers. Specifically, we study the effect of varying the L1 language dominance and the L2 language proficiency, which we manipulate by controlling the L2 age of exposure -- defined as the training step at which the L2 is introduced. Furthermore, we investigate the impact of pretraining on L1 languages with varying syntactic distance from the L2. Using cross-linguistic priming, we analyze how activating L1 structures impacts L2 processing. Our results align with evidence from psycholinguistic studies, confirming that language dominance and proficiency are strong predictors of CLI. We further find that while priming of grammatical structures is bidirectional, the priming of ungrammatical structures is sensitive to language dominance. Finally, we provide mechanistic evidence of CLI in LMs, demonstrating that the L1 is co-activated during L2 processing and directly influences the neural circuitry recruited for the L2. More broadly, our work demonstrates that LMs can serve as a computational framework to inform theories of human CLI.

[33] ILRR: Inference-Time Steering Method for Masked Diffusion Language Models

Eden Avrahami,Eliya Nachmani

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的离散扩散语言模型（DLM）推理时控制框架ILRR，通过动态对齐生成序列与参考序列的隐层激活实现语义属性引导（如情感），并扩展了适用于长文本的Spatially Modulated Steering方法，在保持生成质量的同时显著提升属性准确率。

Details

Motivation: 离散扩散语言模型（DLMs）虽为非自回归文本生成提供了新路径，但其推理阶段的有效可控机制仍缺乏深入探索。 Method: 提出无需学习的Iterative Latent Representation Refinement（ILRR）框架，利用单个参考序列在去噪过程中动态对齐生成序列与参考序列的隐层激活；进一步引入Spatially Modulated Steering，通过空间调节引导强度以支持短参考控制长文本。 Result: ILRR在LLaDA和MDLM架构上实现了高效属性控制，仅需每步去噪增加一次并行前向传播，计算开销小；在相同计算预算下，属性准确率较基线提升10%–60%，同时保持高质量生成。 Conclusion: ILRR是一种轻量、通用且高效的DLM推理控制方法，能精准迁移高阶语义属性，为扩散式文本生成的可控性提供了新范式。 Abstract: Discrete Diffusion Language Models (DLMs) offer a promising non-autoregressive alternative for text generation, yet effective mechanisms for inference-time control remain relatively underexplored. Existing approaches include sampling-level guidance procedures or trajectory optimization mechanisms. In this work, we introduce Iterative Latent Representation Refinement (ILRR), a learning-free framework for steering DLMs using a single reference sequence. ILRR guides generation by dynamically aligning the internal activations of the generated sequence with those of a given reference throughout the denoising process. This approach captures and transfers high-level semantic properties, with a tunable steering scale enabling flexible control over attributes such as sentiment. We further introduce Spatially Modulated Steering, an extension that enables steering long texts using shorter references by regulating guidance intensity across the sequence. Empirically, we demonstrate that ILRR achieves effective attribute steering on LLaDA and MDLM architectures with a minor computational overhead, requiring only one additional parallel forward pass per denoising step. Under the same compute budget, ILRR improves attribute accuracy over comparable baselines by 10$\%$ to 60$\%$ points, while maintaining high generation quality.

[34] AdaptBPE: From General Purpose to Specialized Tokenizers

Vijini Liyanage,François Yvon

Main category: cs.CL

TL;DR: 本文提出一种针对子词分词器（如BPE）的轻量级后训练适配方法，通过在特定领域/语言语料上替换低效token，优化词汇表，提升压缩效率与下游任务性能。

Details

Motivation: 通用子词分词器在特定领域或语言中存在token利用率低、编码效率差的问题，导致模型性能和效率下降。 Method: 提出一种后训练适配策略：基于适配语料的token频率分析，选择性地替换原始词表中低效token，构建面向目标词汇量的最优token集合。 Result: 在多语言生成与分类任务上，适配后的分词器相比基线方法，在相同词汇量下对测试语料实现更优压缩效果，并提升下游任务性能。 Conclusion: 该方法是一种轻量、高效、可插拔的词汇表微调机制，能显著提升LLM在特定领域或语言下的分词适配能力。 Abstract: Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora more effectively than baselines using the same vocabulary size. This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks. Our code and data are available at https://github.com/vijini/Adapt-BPE.git.

[35] Scale-Dependent Semantic Dynamics Revealed by Allan Deviation

Debayan Dasgupta

Main category: cs.CL

TL;DR: 本文将文本语义演化视为高维状态空间中的随机轨迹，利用精密测量学中的Allan偏差分析语义稳定性，发现人类文本存在短时幂律标度和长时稳定性噪声基底两种动力学机制；大语言模型虽能模仿局部统计特性，但其语义稳定性持续时间显著缩短。

Details

Motivation: 探究语言语义随时间演化的内在动力学机制，理解人类文本与大语言模型生成文本在语义稳定性上的本质差异。 Method: 将有序句子嵌入视为位移信号，采用Allan偏差分析其在高维语义空间中的稳定性，识别不同时间尺度下的动力学行为。 Result: 发现人类文本呈现短时幂律标度（可区分文学与技术文本）和长时稳定性噪声基底；大语言模型能复现短时标度，但稳定性维持时间系统性缩短。 Conclusion: 语义连贯性是一种可测量的物理属性，该框架可定量区分人类认知与算法模型在语义演化动力学上的差异。 Abstract: While language progresses through a sequence of semantic states, the underlying dynamics of this progression remain elusive. Here, we treat the semantic progression of written text as a stochastic trajectory in a high-dimensional state space. We utilize Allan deviation, a tool from precision metrology, to analyze the stability of meaning by treating ordered sentence embeddings as a displacement signal. Our analysis reveals two distinct dynamical regimes: short-time power-law scaling, which differentiates creative literature from technical texts, and a long-time crossover to a stability-limited noise floor. We find that while large language models successfully mimic the local scaling statistics of human text, they exhibit a systematic reduction in their stability horizon. These results establish semantic coherence as a measurable physical property, offering a framework to differentiate the nuanced dynamics of human cognition from the patterns generated by algorithmic models.

[36] FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning

Xiaoyu Xu,Minxin Du,Kun Fang,Zi Liang,Yaxin Xiao,Zhicong Huang,Cheng Hong,Qingqing Ye,Haibo Hu

Main category: cs.CL

TL;DR: 本文提出FIT框架，用于大语言模型的持续性遗忘学习，通过数据过滤、重要性感知更新和目标层归因，在处理大量连续删除请求时有效防止灾难性遗忘和后遗忘恢复，同时在PCH基准上验证了其优越性。

Details

Motivation: 现有大语言模型遗忘方法难以应对现实世界中持续、高频的删除请求，易导致性能下降和灾难性遗忘。 Method: 提出FIT框架，包含三个核心组件：严格的数据过滤（Filtering）、重要性感知的参数更新（Importance-aware updates）和目标层归因（Targeted layer attribution）；并构建PCH基准（含个人隐私、版权、有害内容三类）及两个对称评估指标Forget Degree（F.D.）和Retain Utility（R.U.）。 Result: 在四个开源LLM上进行数百次删除请求实验，FIT在F.D.与R.U.权衡上最优，MMLU、CommonsenseQA、GSM8K等任务上超越现有方法，并对重学习和量化恢复攻击具有鲁棒性。 Conclusion: FIT是一种高效、稳健的持续遗忘学习框架，兼顾遗忘效果与模型效用，在真实场景删除需求下展现出显著优势。 Abstract: Large language models (LLMs) demonstrate impressive capabilities across diverse tasks but raise concerns about privacy, copyright, and harmful materials. Existing LLM unlearning methods rarely consider the continual and high-volume nature of real-world deletion requests, which can cause utility degradation and catastrophic forgetting as requests accumulate. To address this challenge, we introduce \fit, a framework for continual unlearning that handles large numbers of deletion requests while maintaining robustness against both catastrophic forgetting and post-unlearning recovery. \fit mitigates degradation through rigorous data \underline{F}iltering, \underline{I}mportance-aware updates, and \underline{T}argeted layer attribution, enabling stable performance across long sequences of unlearning operations and achieving a favorable balance between forgetting effectiveness and utility retention. To support realistic evaluation, we present \textbf{PCH}, a benchmark covering \textbf{P}ersonal information, \textbf{C}opyright, and \textbf{H}armful content in sequential deletion scenarios, along with two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), which jointly assess forgetting quality and utility preservation. Extensive experiments on four open-source LLMs with hundreds of deletion requests show that \fit achieves the strongest trade-off between F.D. and R.U., surpasses existing methods on MMLU, CommonsenseQA, and GSM8K, and remains resistant against both relearning and quantization recovery attacks.

[37] Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling

Xinglin Wang,Jiayi Shi,Shaoxiong Feng,Peiwen Yuan,Yiwei Li,Yueqi Zhang,Chuyi Tan,Ji Zhang,Boyuan Pan,Yao Hu,Kan Li

Main category: cs.CL

TL;DR: 本文提出Recycling Search Experience (RSE)方法，通过在测试时重用搜索过程中的成功结论与失败模式，提升大语言模型推理效率，显著减少冗余计算，在多个基准上达到最优缩放效率。

Details

Motivation: 现有测试时扩展方法将每次rollout视为独立样本，忽视中间推理结果，导致大量重复计算和死路重访。 Method: 提出无需训练的自引导策略RSE，构建共享经验库，实现对中间成功结论的正向复用（加速推导）与失败模式的负向复用（剪枝死路）。 Result: 理论分析证明RSE优于独立采样；实验表明其在HMMT24/25、IMO-Bench和HLE上以相近计算成本持续超越强基线，达到SOTA缩放效率。 Conclusion: RSE将测试时搜索从孤立尝试转变为累积学习过程，有效缓解记忆缺失问题，为高效推理提供了新范式。 Abstract: Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This systemic memorylessness leads to massive computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbf{Recycling Search Experience (RSE)}, a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE, validating its advantage over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art scaling efficiency.

[38] Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

Hojae Han,Heeyun Jung,Jongyoon Kim,Seung-won Hwang

Main category: cs.CL

TL;DR: 本文提出DAVID-GRPO框架，在资源受限下提升小语言模型多跳推理能力，通过稳定初期学习、基于证据召回的检索信用分配和近似失败轨迹重采样来改善探索与训练稳定性。

Details

Motivation: 现有强化学习方法在多轮推理中依赖高成本、高精度的大量策略 rollout，而小模型在资源受限下易陷入低精度、训练不稳定困境。 Method: 提出DAVID-GRPO：（i）用最少监督稳定早期学习；（ii）依据证据召回进行检索信用分配；（iii）对截断的近似失败轨迹重采样以增强探索。 Result: 在仅4块RTX 3090 GPU上训练≤1.5B参数模型，在6个多跳问答基准上持续超越面向大规模场景的现有RL方法。 Conclusion: 通过合适归纳偏置，小语言模型可在低训练成本下实现高准确性多跳推理，打破成本-精度权衡。 Abstract: While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.

[39] Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

Wonduk Seo,Wonseok Choi,Junseo Koh,Juhyeon Lee,Hyunjin An,Minhyeong Yu,Jian Park,Qingshan Zhou,Seunghyun Lee,Yi Bu

Main category: cs.CL

TL;DR: 本文提出OG-MAR框架，通过结合世界价值观调查（WVS）数据与文化本体，构建多智能体推理系统，提升大语言模型在文化敏感决策中的对齐性、鲁棒性与可解释性。

Details

Motivation: 现有大语言模型在文化敏感决策中存在因预训练数据偏差和缺乏结构化价值表征导致的价值观错位问题；已有对齐方法缺乏人口统计学依据且将价值观视为独立无结构信号，影响一致性与可解释性。 Method: 提出OG-MAR：基于WVS构建个体化价值观摘要，通过能力问题建模固定分类体系上的关系，形成全球文化本体；推理时检索本体一致且人口特征相似的配置，激活多个价值观人格智能体，并由判断智能体融合输出并保障本体一致性与人口邻近性。 Result: 在四个LLM主干模型和多个区域社会调查基准上的实验表明，OG-MAR在文化对齐性、鲁棒性及推理透明度上均优于强基线。 Conclusion: OG-MAR为实现具文化感知、人口适配与结构化价值引导的大语言模型推理提供了可扩展、可解释的新范式。 Abstract: Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG-MAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.

[40] Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

Qingyue Yang,Jie Wang,Xing Li,Yinqi Bai,Xialiang Tong,Huiling Zhen,Jianye Hao,Mingxuan Yuan,Bin Li

Main category: cs.CL

TL;DR: 本文提出TAPPA框架，从时间连续视角统一解释大语言模型中的多种注意力模式，区分可预测与不可预测模式，并基于查询自相似性提供数学分析，指导KV缓存压缩和模型剪枝。

Details

Motivation: 现有研究对注意力模式（如检索头、sink头、对角线迹）的观察零散且缺乏统一解释，亟需一个能整合并深入理解这些模式的框架。 Method: 提出Temporal Attention Pattern Predictability Analysis (TAPPA) 框架，从时间连续视角建模注意力机制，通过分析查询、键及RoPE的联合效应，数学刻画三类典型可预测模式，并依据查询在时间维度上的自相似性解释可预测性差异。 Result: 验证了TAPPA在KV缓存压缩与LLM剪枝任务中的有效性；基于TAPPA启发的简单指标在多个任务上持续超越基线方法。 Conclusion: TAPPA为理解注意力行为提供了统一、可解释的理论视角，并具备实际推理加速价值，证明注意力模式的可预测性源于时间维度上的查询自相似性。 Abstract: Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce \textbf{Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations} from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab-USTC/LLM-TAPPA.

[41] TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning

Huiyuan Lai,Malvina Nissim

Main category: cs.CL

TL;DR: 本文提出TACLer框架，通过模型定制的课程强化学习和混合思考/不思考推理范式，提升大语言模型在复杂推理任务中的学习与推理效率，在降低计算成本的同时提高准确性。

Details

Motivation: 现有长链思维（CoT）推理方法依赖大规模强化学习训练，易导致冗余步骤（过思考），影响学习与推理效率。 Method: 提出TACLer框架，包含两个核心组件：(i) 定制化课程学习，根据模型能力分阶段递进式调整数据复杂度；(ii) 混合Thinking/NoThinking推理范式，动态启用或禁用思考模式以平衡准确率与效率。 Result: 实验表明TACLer相较长思考模型减少超50%训练计算量，推理token使用量比基线模型降低42%以上，且在四个数学数据集上准确率提升超9%，持续优于当前最优的NoThinking和Thinking基线。 Conclusion: TACLer在保证甚至提升推理性能的前提下，显著提升了训练与推理效率，为高效、可控的复杂推理提供了新范式。 Abstract: Large Language Models (LLMs) have shown remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant intermediate steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a model-tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model's proficiency in multi-stage RL training. TACLer features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computational cost, cutting training compute by over 50% compared to long thinking models and reducing inference token usage by over 42% relative to the base model; and (ii) it improves accuracy by over 9% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.

[42] Enhancing Language Models for Robust Greenwashing Detection

Neil Heinrich Braun,Keane Ong,Rui Mao,Erik Cambria,Gianmarco Mengaldo

Main category: cs.CL

TL;DR: 本文提出了一种参数高效的框架，通过对比学习与序数排序目标结合，结构化大语言模型的潜在空间，以更好地区分可持续性报告中具体行动与模糊声明，提升ESG评估中对绿色洗白和模糊表述的鲁棒性。

Details

Motivation: 可持续性报告对ESG评估至关重要，但绿色洗白和模糊声明常削弱其可信度；现有NLP模型依赖表层模式，鲁棒性差、泛化能力弱。 Method: 提出参数高效框架：结合对比学习与序数排序目标来结构化LLM潜在空间；引入门控特征调制过滤披露噪声；采用MetaGradNorm稳定多目标优化。 Result: 在跨类别设置实验中，该方法相比标准基线展现出更优的鲁棒性，并揭示了表征刚性与泛化能力之间的权衡。 Conclusion: 所提框架有效提升了对模糊与误导性ESG披露的建模能力，为构建更可信的可持续性评估NLP系统提供了新思路。 Abstract: Sustainability reports are critical for ESG assessment, yet greenwashing and vague claims often undermine their reliability. Existing NLP models lack robustness to these practices, typically relying on surface-level patterns that generalize poorly. We propose a parameter-efficient framework that structures LLM latent spaces by combining contrastive learning with an ordinal ranking objective to capture graded distinctions between concrete actions and ambiguous claims. Our approach incorporates gated feature modulation to filter disclosure noise and utilizes MetaGradNorm to stabilize multi-objective optimization. Experiments in cross-category settings demonstrate superior robustness over standard baselines while revealing a trade-off between representational rigidity and generalization.

[43] Procedural Pretraining: Warming Up Language Models with Abstract Data

Liangze Jiang,Zachary Shinnick,Anton van den Hengel,Hemanth Saratchandran,Damien Teney

Main category: cs.CL

TL;DR: 本文提出了一种新的预训练范式：在自然语言预训练前，先用抽象的程序化数据（如Dyck序列）进行少量预训练，显著提升模型在算法任务、结构化推理和语言建模上的性能，并加速收敛。

Details

Motivation: 受人类先学习逻辑与数学再发展高级推理的启发，探索在大规模语言模型预训练中引入抽象结构化数据（特别是程序化数据）作为前置步骤，以更高效地构建语义与推理能力。 Method: 系统性地使用多种形式的程序化数据（如Dyck序列、计数器、图灵机模拟等）进行前置预训练；对比标准预训练（C4、CodeParrot、DeepMind-Math）；分析注意力机制与MLP层的结构变化；评估不同规模模型（至1.3B）在算法任务、损失下降速度及泛化能力上的表现。 Result: 在Needle-in-a-haystack任务上准确率从10%提升至98%；仅用0.1%程序化数据即可超越全量标准预训练；达到同等损失所需数据量减少至55%-86%；注意力层更适配结构化任务（如代码），MLP层更适配语言任务。 Conclusion: 程序化前置预训练是一种轻量、有效的方法，能显著提升模型算法能力、加速收敛，并提示知识获取与推理能力可在LLM中解耦设计。 Abstract: Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this procedural pretraining enables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.

[44] CE-GOCD: Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering

Jiayin Lan,Jiaqi Li,Baoxin Wang,Ming Liu,Dayong Wu,Shijin Wang,Bing Qin,Guoping Hu

Main category: cs.CL

TL;DR: 本文提出CE-GOCD方法，通过以论文标题为中枢实体构建并优化学术知识图谱子图，并结合社区发现提升大语言模型在科研文献问答中的表现。

Details

Motivation: 现有基于检索增强的方法仅依赖孤立文本块或概念，忽视论文间深层语义关联，限制了大语言模型对科学文献的理解能力与回答的全面性、特异性。 Method: 提出Central Entity-Guided Graph Optimization for Community Detection（CE-GOCD）：（1）以论文标题为中央实体进行子图检索；（2）通过子图剪枝与补全增强隐式语义发现；（3）应用社区检测提炼主题一致的论文群组。 Result: 在三个NLP领域文献问答数据集上，CE-GOCD显著优于其他检索增强基线方法。 Conclusion: 显式建模和利用学术知识图谱中的语义子结构可有效提升大语言模型在科学问答任务中的性能。 Abstract: Large Language Models (LLMs) are increasingly used for question answering over scientific research papers. Existing retrieval augmentation methods often rely on isolated text chunks or concepts, but overlook deeper semantic connections between papers. This impairs the LLM's comprehension of scientific literature, hindering the comprehensiveness and specificity of its responses. To address this, we propose Central Entity-Guided Graph Optimization for Community Detection (CE-GOCD), a method that augments LLMs' scientific question answering by explicitly modeling and leveraging semantic substructures within academic knowledge graphs. Our approach operates by: (1) leveraging paper titles as central entities for targeted subgraph retrieval, (2) enhancing implicit semantic discovery via subgraph pruning and completion, and (3) applying community detection to distill coherent paper groups with shared themes. We evaluated the proposed method on three NLP literature-based question-answering datasets, and the results demonstrate its superiority over other retrieval-augmented baseline approaches, confirming the effectiveness of our framework.

[45] Temporal Guidance for Large Language Models

Hong-Kai Zheng,Piji Li

Main category: cs.CL

TL;DR: 本文提出了一种新的时序对比引导方法TeGu，利用多令牌预测（MTP）构建模型自身弱预测用于自对比，并引入轻量级条件MTP投影器（cMTPP），在保持低开销的同时显著提升生成质量。

Details

Motivation: 现有对比解码方法（如CD和DoLa）存在计算开销大或在小模型上不稳定的问题；作者观察到LLM具有局部偏好性，由此启发设计沿时间维度的对比策略。 Method: 提出时序引导（TeGu）策略，结合多令牌预测（MTP）生成弱预测作为自对比信号，并设计轻量级条件MTP投影器（cMTPP）统一实现，避免多网络冗余。 Result: 在多个模型系列和基准测试中，TeGu显著提升生成质量，同时保持低内存占用与计算开销。 Conclusion: TeGu是一种高效、稳定且易于部署的自对比解码方法，为降低LLM解码开销并提升输出质量提供了新思路。 Abstract: Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.

[46] CoFrGeNet: Continued Fraction Architectures for Language Generation

Amit Dhurandhar,Vijil Chenthamarakshan,Dennis Wei,Tejaswini Pedapati,Karthikeyan Natesan Ramamurthy,Rahul Nair

Main category: cs.CL

TL;DR: 本文提出了一种基于连分数的新生成建模函数类及其对应架构CoFrGeNets，用以替代Transformer中的多头注意力和前馈网络，在显著减少参数量和预训练时间的同时，保持甚至超越原模型在下游任务上的性能。

Details

Motivation: 现有Transformer架构参数量大、计算成本高，亟需更高效、轻量的替代方案；受连分数数学性质启发，探索一种新型函数类以提升生成模型效率与性能。 Method: 提出基于连分数的函数类，设计可即插即用的新型模块替代Transformer中的Multi-head Attention和Feed-Forward Network；推导定制化梯度公式以实现比标准PyTorch梯度更准确高效的优化。 Result: 在GPT2-xl（1.5B）和Llama3（3.2B）上验证，新模型仅需原参数量的1/2至2/3、更短预训练时间，下游分类、问答、推理和文本理解任务性能与原模型相当甚至更优。 Conclusion: CoFrGeNets为生成建模提供了有前景的轻量高效替代路径，具备工业落地潜力，未来结合硬件定制将进一步释放其性能优势。 Abstract: Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.

[47] Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond

Wei Zhu

Main category: cs.CL

TL;DR: 本文系统评估了ChatGPT在6个基准数据集上的4类医学信息抽取（MedIE）任务中的综合能力，涵盖性能、可解释性、置信度、忠实性和不确定性五个维度；结果表明其性能逊于微调模型，虽具高解释质量与文本忠实度，但存在过度自信和生成不确定性问题。

Details

Motivation: 评估大型语言模型（如ChatGPT）在医学信息抽取（MedIE）任务中的实际能力，超越通用对话能力，关注其在专业NLP任务中的性能、可解释性、置信度、忠实性和不确定性等关键属性。 Method: 在6个基准医学数据集上，对ChatGPT在4类医学信息抽取任务中进行系统性实验评估，指标包括任务性能、解释质量、置信度校准、预测对原文的忠实度，以及生成不确定性的影响分析。 Result: （a）ChatGPT在MedIE任务上的性能低于微调基线模型；（b）能提供高质量解释，但预测过度自信；（c）多数情况下对原文具有高忠实度；（d）生成不确定性导致信息抽取结果不稳定，限制其医学应用。 Conclusion: 尽管ChatGPT在医学信息抽取中展现出一定潜力（如解释性与忠实度），但其性能短板、过度自信及生成不确定性制约其在高可靠性要求的医疗场景中的直接部署，需针对性改进。 Abstract: Large Language Models (LLMs) like ChatGPT have demonstrated amazing capabilities in comprehending user intents and generate reasonable and useful responses. Beside their ability to chat, their capabilities in various natural language processing (NLP) tasks are of interest to the research community. In this paper, we focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets. We present the systematically analysis by measuring ChatGPT's performance, explainability, confidence, faithfulness, and uncertainty. Our experiments reveal that: (a) ChatGPT's performance scores on MedIE tasks fall behind those of the fine-tuned baseline models. (b) ChatGPT can provide high-quality explanations for its decisions, however, ChatGPT is over-confident in its predcitions. (c) ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. (d) The uncertainty in generation causes uncertainty in information extraction results, thus may hinder its applications in MedIE tasks.

[48] Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

Alon Rozental

Main category: cs.CL

TL;DR: 本文提出Zonkey，一种基于分层扩散的全可微语言建模框架，用可学习、可微分的Segment Splitter替代传统BPE分词器，并引入Probabilistic Attention与DDMM实现端到端训练，支持变长生成与层次化表征。

Details

Motivation: 现有大语言模型受限于固定、不可微的分词器（如BPE），难以端到端优化，且对噪声或领域特异性数据适应性差。 Method: 提出Zonkey：1）可微分Segment Splitter学习概率化BOS决策以实现自适应切分；2）Probabilistic Attention引入位置存在概率，实现无限序列上的软掩码与梯度保留；3）分层压缩+Denosing Diffusion Mixed Model（DDMM）进行潜空间稳定去噪；4）Stitcher保障段间重叠不变性。 Result: 在Wikipedia上端到端训练，Zonkey能从噪声生成连贯、变长文本，展现出涌现的语言层级结构（如空格处为词界、句号处为句首），定性上比基于熵的可学习分词器更贴近数据分布。 Conclusion: Zonkey推动了全梯度式LLM的发展，提升了领域适配能力与生成可扩展性，代码已开源。 Abstract: Large language models (LLMs) have revolutionized natural language processing, yet they remain constrained by fixed, non-differentiable tokenizers like Byte Pair Encoding (BPE), which hinder end-to-end optimization and adaptability to noisy or domain-specific data. We introduce Zonkey, a hierarchical diffusion model that addresses these limitations through a fully trainable pipeline from raw characters to document-level representations. At its core is a differentiable tokenizer (Segment Splitter) that learns probabilistic beginning-of-sequence (BOS) decisions, enabling adaptive splits that emerge as linguistically meaningful (e.g., word boundaries at spaces, sentence starts at periods) without explicit supervision. This differentiability is enabled by our novel Probabilistic Attention mechanism, which incorporates position-specific existence probabilities to simulate soft masking over theoretically infinite sequences while preserving gradients. Sequences decay probabilistically rather than relying on end-of-sequence tokens, supporting variable-length outputs. Hierarchical levels compress sequences into higher abstractions (e.g., character n-grams to word-like vectors, then sentence-like), with reconstruction via our Denoising Diffusion Mixed Model (DDMM) for stable and efficient denoising in latent space. A Stitcher ensures overlap invariance across segments. Trained end-to-end on Wikipedia, Zonkey generates coherent, variable-length text from noise, demonstrating emergent hierarchies and promising qualitative alignment to data distributions compared to entropy-based learnable tokenizers. Our approach advances toward fully gradient-based LLMs, with potential for better domain adaptation and scalable generation. We release the source code for training and reproducing our experiments.

[49] KID: Knowledge-Injected Dual-Head Learning for Knowledge-Grounded Harmful Meme Detection

Yaocong Li,Leihan Zhang,Le Zhang,Qiang Yan

Main category: cs.CL

TL;DR: 本文提出KID框架，通过知识注入和双头学习提升有害模因检测性能，显著优于现有方法。

Details

Motivation: 互联网模因常隐含有害内容，但其依赖隐喻和文化背景，现有方法难以捕捉隐式毒性，需引入外部知识辅助理解。 Method: 提出知识注入双头学习框架KID，采用标签约束蒸馏范式构建结构化推理链，将视觉证据、背景知识与分类标签显式关联，并通过双头架构联合优化语义生成与分类目标。 Result: 在涵盖英、中、孟加拉语的五个多语言数据集上达到SOTA，二分类与多标签任务提升2.1%–19.7%；消融实验验证知识注入与双头学习的有效性与互补性。 Conclusion: KID通过知识增强与协同学习，实现了更鲁棒、可泛化的有害模因理解，在多语言场景下展现出强实用性与推广价值。 Abstract: Internet memes have become pervasive carriers of digital culture on social platforms. However, their heavy reliance on metaphors and sociocultural context also makes them subtle vehicles for harmful content, posing significant challenges for automated content moderation. Existing approaches primarily focus on intra-modal and inter-modal signal analysis, while the understanding of implicit toxicity often depends on background knowledge that is not explicitly present in the meme itself. To address this challenge, we propose KID, a Knowledge-Injected Dual-Head Learning framework for knowledge-grounded harmful meme detection. KID adopts a label-constrained distillation paradigm to decompose complex meme understanding into structured reasoning chains that explicitly link visual evidence, background knowledge, and classification labels. These chains guide the learning process by grounding external knowledge in meme-specific contexts. In addition, KID employs a dual-head architecture that jointly optimizes semantic generation and classification objectives, enabling aligned linguistic reasoning while maintaining stable decision boundaries. Extensive experiments on five multilingual datasets spanning English, Chinese, and low-resource Bengali demonstrate that KID achieves SOTA performance on both binary and multi-label harmful meme detection tasks, improving over previous best methods by 2.1%--19.7% across primary evaluation metrics. Ablation studies further confirm the effectiveness of knowledge injection and dual-head joint learning, highlighting their complementary contributions to robust and generalizable meme understanding. The code and data are available at https://github.com/PotatoDog1669/KID.

[50] Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation

Yimin Deng,Yuqing Fu,Derong Xu,Yejing Wang,Wei Ni,Jingtong Gao,Xiaopeng Li,Chengxu Liu,Xiao Han,Guoshuai Zhao,Xiangyu Zhao,Li Zhu,Xueming Qian

Main category: cs.CL

TL;DR: 本文提出了一种对抗式记忆自适应机制（AMA），通过模拟下游任务执行，在离线阶段引入任务感知的监督信号，从而提升对话代理在长对话中的记忆构建与更新能力。

Details

Motivation: 现有记忆系统在离线阶段采用固定、任务无关的记忆构建与更新策略，导致其与下游任务需求错位，影响性能。 Method: 提出AMA机制：由挑战者代理生成问答对，用当前记忆回答以模拟推理；评估者代理分析错误；适配器代理据此双层级（策略+内容）更新记忆系统。 Result: AMA可即插即用地集成到多种现有记忆系统中，在长对话基准LoCoMo上验证了显著性能提升。 Conclusion: AMA通过在离线阶段引入任务导向的对抗式训练流程，有效弥合了记忆准备与任务需求之间的鸿沟，提升了长对话理解与推理能力。 Abstract: Conversational agents struggle to handle long conversations due to context window limitations. Therefore, memory systems are developed to leverage essential historical information. Existing memory systems typically follow a pipeline of offline memory construction and update, and online retrieval. Despite the flexible online phase, the offline phase remains fixed and task-independent. In this phase, memory construction operates under a predefined workflow and fails to emphasize task relevant information. Meanwhile, memory updates are guided by generic metrics rather than task specific supervision. This leads to a misalignment between offline memory preparation and task requirements, which undermines downstream task performance. To this end, we propose an Adversarial Memory Adaptation mechanism (AMA) that aligns memory construction and update with task objectives by simulating task execution. Specifically, first, a challenger agent generates question answer pairs based on the original dialogues. The constructed memory is then used to answer these questions, simulating downstream inference. Subsequently, an evaluator agent assesses the responses and performs error analysis. Finally, an adapter agent analyzes the error cases and performs dual level updates on both the construction strategy and the content. Through this process, the memory system receives task aware supervision signals in advance during the offline phase, enhancing its adaptability to downstream tasks. AMA can be integrated into various existing memory systems, and extensive experiments on long dialogue benchmark LoCoMo demonstrate its effectiveness.

[51] RAG-E: Quantifying Retriever-Generator Alignment and Failure Modes

Korbinian Randl,Guido Rocchietti,Aron Henriksson,Ziawasch Abedjan,Tony Lindgren,John Pavlopoulos

Main category: cs.CL

TL;DR: 本文提出了RAG-E框架，用于量化检索器与生成器之间的对齐程度，通过改进的归因方法（如集成梯度和PMCSHAP）及新指标WARG，揭示RAG系统中检索与生成环节存在严重错配问题。

Details

Motivation: RAG系统在高风险领域部署时面临可解释性挑战，其检索器与生成器交互过程不透明，亟需端到端的可解释性分析工具。 Method: 提出RAG-E框架：1）用集成梯度分析检索器；2）提出蒙特卡洛稳定化的Shapley值近似方法PMCSHAP进行生成器归因；3）设计加权归因-相关性差距（WARG）指标衡量生成器对检索结果的实际使用与检索排序的一致性。 Result: 在TREC CAsT和FoodSafeSum数据集上发现：47.4%–66.7%的查询中生成器忽略检索器首选文档，48.1%–65.9%依赖低相关性文档，证实组件间错配是影响输出质量的关键因素。 Conclusion: RAG系统性能不仅取决于各组件单独表现，更取决于其协同行为；RAG-E提供了可审计、数学可解释的端到端分析手段，为可信RAG部署奠定基础。 Abstract: Retrieval-Augmented Generation (RAG) systems combine dense retrievers and language models to ground LLM outputs in retrieved documents. However, the opacity of how these components interact creates challenges for deployment in high-stakes domains. We present RAG-E, an end-to-end explainability framework that quantifies retriever-generator alignment through mathematically grounded attribution methods. Our approach adapts Integrated Gradients for retriever analysis, introduces PMCSHAP, a Monte Carlo-stabilized Shapley Value approximation, for generator attribution, and introduces the Weighted Attribution-Relevance Gap (WARG) metric to measure how well a generator's document usage aligns with a retriever's ranking. Empirical analysis on TREC CAsT and FoodSafeSum reveals critical misalignments: for 47.4% to 66.7% of queries, generators ignore the retriever's top-ranked documents, while 48.1% to 65.9% rely on documents ranked as less relevant. These failure modes demonstrate that RAG output quality depends not solely on individual component performance but on their interplay, which can be audited via RAG-E.

[52] Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning

Bodong Du,Xuanqi Huang,Xiaomeng Li

Main category: cs.CL

TL;DR: 本文提出了一种分布感知的奖励估计方法DARE，用于提升测试时强化学习（TTRL）在无监督场景下的鲁棒性和性能，显著优于基于多数投票的现有方法。

Details

Motivation: 现有TTRL方法依赖多数投票（MV）生成确定性奖励，但该假设脆弱：MV丢弃了非主流但正确的动作信息，导致奖励估计系统性偏差。 Method: 提出Distribution-Aware Reward Estimation（DARE），将奖励估计从单一多数结果扩展至完整经验 rollout 分布，并引入探索奖励和分布剪枝机制以增强非主流rollout探索与奖励去噪。 Result: 在AIME 2024和AMC等推理基准上，DARE相较基线分别取得25.3%和5.3%的相对性能提升，并提高了优化稳定性。 Conclusion: 基于完整 rollout 分布而非单一多数结果的奖励估计更鲁棒、信息更丰富，DARE为无监督TTRL提供了更可靠的奖励信号建模范式。 Abstract: Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a distribution pruning mechanism for non-majority rollout exploration and reward denoise, yielding a more informative and robust reward estimation. Extensive experiments on challenging reasoning benchmarks show that DARE improves optimization stability and final performance over recent baselines, achieving relative improvements of 25.3% on challenging AIME 2024 and 5.3% on AMC.

[53] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

Aadi Palnitkar,Mingyang Mao,Nicholas Waytowich,Vinicius G. Goecks,Tinoosh Mohsenin,Xiaomin Lin

Main category: cs.CL

TL;DR: 本文提出了MilSCORE，首个面向军事规划场景的长上下文基准测试数据集，用于评估大语言模型在多源异构信息（如地图、命令、情报报告）下的选择性阅读与空间推理能力。

Details

Motivation: 现有长上下文基准缺乏对真实复杂任务（尤其是需融合多模态地理空间信息的高风险规划任务）的建模能力，难以评估LLM在实战级场景中的决策与推理性能。 Method: 构建专家撰写的、基于仿真军事规划场景的多跳问答数据集MilSCORE，涵盖七类问题（事实检索、约束推理、策略分析、空间分析等），并设计配套评估协议，对多种视觉-语言模型进行基线评测。 Result: 当前主流视觉-语言模型在MilSCORE上表现显著不足，展现出较大提升空间，验证了该基准的挑战性与实用性。 Conclusion: MilSCORE填补了长上下文、多模态、高 stakes 场景级推理基准的空白，为推动LLM在真实世界复杂规划任务中的能力发展提供了关键评测平台。 Abstract: As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs' ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.

[54] Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model

Xiang Li,Ning Yan,Masood Mortazavi

Main category: cs.CL

TL;DR: 本文提出GiG框架，利用图中图结构和有界前向搜索模块提升具身智能体的长程规划能力，在多个基准测试中显著优于现有方法。

Details

Motivation: 大型语言模型在具身代理中的长程规划仍面临策略连贯性差、环境约束违反等问题，亟需更有效的记忆与规划机制。 Method: 提出Graph-in-Graph（GiG）规划框架：用图神经网络编码环境状态，构建动作连接的执行轨迹图存入经验记忆库；通过图嵌入聚类实现结构感知先验检索；引入基于符号转移逻辑的有界前向搜索模块进行接地动作预测。 Result: 在Robotouille同步/异步及ALFWorld三个具身规划基准上，Pass@1性能分别提升22%、37%和15%，计算成本相当或更低。 Conclusion: GiG通过结构化记忆与符号引导的前向搜索，有效提升了具身智能体在动态环境中的长程规划鲁棒性与准确性。 Abstract: While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intent into actionable sub-goals while strictly adhering to the logic of a dynamic, observed environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitation or hallucinate transitions that violate constraints. We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within an experience memory bank. By clustering these graph embeddings, the framework enables retrieval of structure-aware priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a novel bounded lookahead module that leverages symbolic transition logic to enhance the agents' planning capabilities through the grounded action projection. We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld with comparable or lower computational cost.

[55] Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

Hongyi Zhou,Jin Zhu,Erhan Xu,Kai Ye,Ying Yang,Chengchun Shi

Main category: cs.CL

TL;DR: 本文提出了一种基于重写（rewrite）的新型大语言模型（LLM）生成文本检测算法，通过几何视角解释重写类检测方法的原理，并引入自适应学习原文与重写文本间距离的机制；理论证明其优于固定距离策略，实验在100多种设置下验证了其显著优于基线方法（相对提升57.8%–80.6%）。

Details

Motivation: LLM生成高度拟人化文本引发虚假信息与学术诚信风险，亟需可靠检测算法。 Method: 提出一种几何视角理解重写类检测方法，并设计自适应学习原文与重写文本间距离的检测算法。 Result: 在超100种实验设置中，该方法在大多数场景下显著优于基线，在不同目标LLM（如GPT、Claude、Gemini）上相较最强基线实现57.8%–80.6%的相对性能提升。 Conclusion: 自适应距离学习策略在理论上更有效、实践中更鲁棒，为LLM生成内容检测提供了新思路与实用工具。 Abstract: Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Yet, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLM-generated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 57.8\% to 80.6\% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini).

[56] SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching

Hong Chen,Xiang Liu,Bo Wang,Yuxuan Fan,Yuanlin Chu,Zongluo Li,Xiaowen Chu,Xuming Hu

Main category: cs.CL

TL;DR: SONIC is a learning-based framework that compresses historical KV cache into compact Nexus tokens for multi-turn LLMs, enabling adaptive compression without retraining and significantly improving both dialogue coherence and inference speed.

Details

Motivation: The linear growth of KV cache hinders efficient multi-turn LLM deployment; existing compression methods use heuristic eviction and ignore dialogue structure, risking loss of critical context. Method: SONIC introduces semantically rich 'Nexus' tokens to compress historical segments, combined with dynamic budget training for flexible memory adaptation without retraining. Result: At 80% and 50% compression ratios, SONIC outperforms H2O and StreamingLLM across four multi-turn benchmarks; on MTBench101, it improves average score by 35.55% and speeds up inference by 50.1% vs. full-context generation. Conclusion: SONIC effectively balances memory efficiency and dialogue coherence in multi-turn LLMs through learned, structured KV compression and dynamic budgeting. Abstract: The linear growth of Key-Value (KV) cache remains a bottleneck for multi-turn LLM deployment. Existing KV cache compression methods often fail to account for the structural properties of multi-turn dialogues, relying on heuristic eviction that risks losing critical context. We propose \textbf{SONIC}, a learning-based framework that compresses historical segments into compact and semantically rich \textbf{Nexus} tokens. By integrating dynamic budget training, SONIC allows flexible adaptation to varying memory constraints without retraining. Experiments show that at compression ratios of 80\% and 50\%, SONIC consistently outperforms baselines such as H2O and StreamingLLM on four diverse multi-turn benchmarks. Specifically, on the widely used MTBench101 benchmark, SONIC achieves an average score improvement of 35.55\% over state-of-the-art baselines, validating its effectiveness in sustaining coherent multi-turn dialogues. Furthermore, SONIC enhances deployment efficiency, accelerating the overall inference process by 50.1\% compared to full-context generation.

[57] From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes

Fariba Afrin Irany

Main category: cs.CL

TL;DR: 本文提出了一种基于GPT的临床文本分类架构，采用选择性微调策略（仅微调最后Transformer块、层归一化和轻量分类头），在保持性能的同时大幅降低参数量和计算成本，在MIMIC-IV-Note放射报告数据上验证了其在多标签、不确定性感知等任务中的有效性与鲁棒性。

Details

Motivation: 临床电子健康记录（EHR）中非结构化文本日益丰富，但受限于标注数据少、类别严重不平衡及大模型适配计算开销高，长程、领域特定的临床文本建模仍具挑战。 Method: 基于GPT-2解码器-only架构，采用选择性微调：冻结大部分主干参数，仅训练最后一层Transformer块、最终LayerNorm及轻量分类头。 Result: 在MIMIC-IV-Note放射报告上，该方法在多标签分类、不确定性感知的二分类及疾病结局预测等任务中均表现稳定且优异，尤其在非提及与否定类样本主导场景下优势明显；训练参数量显著减少，收敛更稳定。 Conclusion: 选择性微调预训练生成式语言模型是临床文本分类的一种高效、有效且可扩展的途径，兼顾性能与计算效率，适用于真实世界EHR数据。 Abstract: The increasing availability of unstructured clinical narratives in electronic health records (EHRs) has created new opportunities for automated disease characterization, cohort identification, and clinical decision support. However, modeling long, domain-specific clinical text remains challenging due to limited labeled data, severe class imbalance, and the high computational cost of adapting large pretrained language models. This study presents a GPT-based architecture for clinical text classification that adapts a pretrained decoder-only Transformer using a selective fine-tuning strategy. Rather than updating all model parameters, the majority of the GPT-2 backbone is frozen, and training is restricted to the final Transformer block, the final layer normalization, and a lightweight classification head. This approach substantially reduces the number of trainable parameters while preserving the representational capacity required to model complex clinical language. The proposed method is evaluated on radiology reports from the MIMIC-IV-Note dataset using uncertainty-aware CheXpert-style labels derived directly from report text. Experiments cover multiple problem formulations, including multi-label classification of radiographic findings, binary per-label classification under different uncertainty assumptions, and aggregate disease outcome prediction. Across varying dataset sizes, the model exhibits stable convergence behavior and strong classification performance, particularly in settings dominated by non-mention and negated findings. Overall, the results indicate that selective fine-tuning of pretrained generative language models provides an efficient and effective pathway for clinical text classification, enabling scalable adaptation to real-world EHR data while significantly reducing computational complexity.

[58] OVD: On-policy Verbal Distillation

Jing Xiong,Hui Shen,Shansan Gong,Yuxin Cheng,Jianghan Shen,Chaofan Tao,Haochen Tan,Haoli Bai,Lifeng Shang,Ngai Wong

Main category: cs.CL

TL;DR: 本文提出了一种名为On-policy Verbal Distillation (OVD) 的新型知识蒸馏框架，通过使用教师模型提供的离散语言评分（0–9）进行轨迹匹配，替代传统的token级概率匹配，从而避免token对齐、提升学生模型探索能力、降低内存消耗，并在Web问答和数学推理任务上显著优于现有方法。

Details

Motivation: 现有基于token级的在线策略蒸馏方法受限于token对齐，抑制学生模型探索、难以利用环境反馈，且在强化学习中存在严重内存瓶颈。 Method: 提出OVD框架，采用离散语言评分（0–9）进行轨迹级匹配，取代token级概率匹配，支持无需token对齐的在线策略蒸馏，提升内存效率与探索自由度。 Result: 在Web问答和数学推理任务上，OVD相较现有方法取得显著提升：Web Q&A平均EM最高提升+12.9%，数学基准最高提升+25.7%（仅用单次随机采样训练），且训练效率更优。 Conclusion: OVD是一种高效、灵活、低内存开销的知识蒸馏新范式，适用于需要强探索与环境交互的推理任务，为大模型能力向小模型迁移提供了更优路径。 Abstract: Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model's exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0--9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io

[59] Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding

Yifan Zhu,Huiqiang Rong,Haoran Luo

Main category: cs.CL

TL;DR: 本文提出Token-Guard，一种基于自检解码的词元级幻觉控制方法，通过在每个推理步骤进行内部验证、潜在空间中的显式幻觉风险评分、以及迭代剪枝与重生成，显著降低大语言模型的幻觉现象。

Details

Motivation: 现有缓解大语言模型幻觉的方法（如RAG和RLHF）资源消耗大，而解码类方法缺乏显式的幻觉控制机制。 Method: Token-Guard采用自检查解码，在每个token生成步骤进行内部验证；对候选片段在潜在空间中进行显式幻觉风险评分；并结合迭代剪枝与再生动态纠正错误。 Result: 在HALU数据集上的实验表明，Token-Guard显著减少了幻觉，提升了生成准确性，且具有可扩展性和模块化优势。 Conclusion: Token-Guard提供了一种轻量、高效、可插拔的幻觉控制方案，有助于提升大语言模型输出的可靠性。 Abstract: Large Language Models (LLMs) often hallucinate, generating content inconsistent with the input. Retrieval-Augmented Generation (RAG) and Reinforcement Learning with Human Feedback (RLHF) can mitigate hallucinations but require resource-intensive retrieval or large-scale fine-tuning. Decoding-based methods are lighter yet lack explicit hallucination control. To address this, we present Token-Guard, a token-level hallucination control method based on self-checking decoding. Token-Guard performs internal verification at each reasoning step to detect hallucinated tokens before they propagate. Candidate fragments are further evaluated in a latent space with explicit hallucination risk scoring, while iterative pruning and regeneration dynamically correct detected errors. Experiments on HALU datasets show Token-Guard substantially reduces hallucinations and improves generation accuracy, offering a scalable, modular solution for reliable LLM outputs. Our code is publicly available.

[60] Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Jianhui Chen,Yuzhang Luo,Liangming Pan

Main category: cs.CL

TL;DR: 本文提出了机制数据归因（MDA）框架，利用影响函数将大语言模型中的可解释单元追溯到特定训练样本，并通过实验证明了重复结构化数据（如LaTeX、XML）对可解释头形成的催化作用，以及诱导头与上下文学习能力之间的因果联系。

Details

Motivation: 尽管机制可解释性已识别出大语言模型中的可解释电路，但其在训练数据中的因果起源仍不清楚。 Method: 提出机制数据归因（MDA）框架，使用影响函数追踪可解释单元至具体训练样本，并在Pythia模型系列上进行大量实验，包括有针对性地移除或增强高影响力样本。 Result: 证实重复结构化数据是可解释头形成的机制催化剂；干预诱导头形成会同步改变模型的上下文学习能力；提出的数据增强流程能加速不同规模模型中电路的收敛。 Conclusion: MDA为理解大语言模型内部机制的训练起源提供了可扩展工具，并为有目的地引导模型发展路径提供了原则性方法。 Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

[61] When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

Daniel Commey

Main category: cs.CL

TL;DR: 本文提出了一种面向大语言模型（LLM）应用的评估驱动工作流（Define-Test-Diagnose-Fix），并设计了分层的最小可行评估套件（MVES），覆盖通用LLM应用、RAG和智能体工具调用三类场景；通过本地可复现实验发现通用提示模板会带来行为权衡，强调需基于评估进行提示迭代与主张校准，而非依赖‘万能’提示模板。

Details

Motivation: 传统软件测试方法不适用于LLM应用，因其输出具有随机性、高维性，且对提示词和模型变化高度敏感；亟需系统化、可重复的评估工程范式。 Method: 提出Define-Test-Diagnose-Fix评估驱动工作流；构建分层MVES评估套件（涵盖通用LLM、RAG、Agentic三类）；综合自动化检查、人工评分与LLM-as-judge三类评估方法，并分析LLM裁判失效模式；在Ollama本地环境中使用Llama 3 8B和Qwen 2.5 7B开展可控实验验证。 Result: 实验表明，将任务专用提示替换为通用‘改进型’提示模板会导致关键指标下降：Llama 3在结构化测试中信息抽取通过率从100%降至90%，RAG合规性从93.3%降至80%，而指令遵循能力提升；证实提示优化存在行为权衡。 Conclusion: LLM应用开发应以评估为驱动，采用分层、场景适配的评估套件（如MVES），避免盲目推广通用提示模板；需结合多类型评估手段，并重视结果可复现性与主张校准。 Abstract: Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic "improved" prompt template can trade off behaviors: on our small structured suites, extraction pass rate decreased from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3 when replacing task-specific prompts with generic rules, while instruction-following improved. These findings motivate evaluation-driven prompt iteration and careful claim calibration rather than universal prompt recipes. All test suites, harnesses, and results are included for reproducibility.

[62] Causal Autoregressive Diffusion Language Model

Junhao Ruan,Bei Li,Yongjing Yin,Pengcheng Huang,Xin Chen,Jingang Wang,Xunliang Cai,Tong Xiao,JingBo Zhu

Main category: cs.CL

TL;DR: 本文提出了Causal Autoregressive Diffusion (CARD)框架，结合自回归模型的训练效率与扩散模型的高吞吐推理能力，在因果注意力掩码下实现单次前向传播的密集监督，并通过软尾掩码和信噪比驱动的重加权机制提升稳定性，支持基于置信度的动态并行解码；实验显示其性能优于离散扩散基线，训练延迟降低3倍，兼具ARM级数据效率与并行生成低延迟优势。

Details

Motivation: 解决自回归模型（ARMs）训练效率高但推理吞吐低、扩散模型推理吞吐高但训练不稳定且效率低的矛盾，寻求兼顾训练效率与推理并行性的新范式。 Method: 提出CARD框架：在严格因果注意力掩码下重构扩散过程，实现单步密集token级监督；引入软尾掩码保留局部上下文，设计基于信噪比的上下文感知重加权机制；利用KV缓存支持基于置信度的动态可变长并行解码。 Result: CARD在性能上超越现有离散扩散基线，训练延迟比块扩散方法降低3倍，同时达到ARM级别的数据效率，并实现并行生成的低延迟优势。 Conclusion: CARD成功统一了自回归建模与扩散建模的优势，为下一代高效大语言模型提供了一种鲁棒、高效的新范式。 Abstract: In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.

[63] Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models

Longxuan Yu,Yu Fu,Shaorong Zhang,Hui Liu,Mukund Varma T,Greg Ver Steeg,Yue Dong

Main category: cs.CL

TL;DR: 本文提出掩码扩散语言模型（MDLMs）可解决自回归（AR）模型在输出顺序与推理逻辑不一致时的“过早承诺”问题，通过并行迭代优化实现对输出顺序的鲁棒性，并在多个数学推理基准上验证其有效性及边界条件。

Details

Motivation: 自回归语言模型强制从左到右生成，当任务要求先输出答案再给出解释时，会导致模型在未充分推理前就需确定答案，造成性能严重下降；亟需一种能解耦计算顺序与输出结构的建模范式。 Method: 提出并使用掩码扩散语言模型（MDLMs），其通过多步并行迭代逐步完善全部token；构建新基准ReasonOrderQA用于控制和评估不同输出顺序下的推理能力；分析MDLMs中简单token（如推理步骤）比复杂token（如最终答案）更早稳定的现象。 Result: 在GSM8K、Math500和ReasonOrderQA上，当提示要求‘先答后理’时，AR模型相对准确率最高下降67%，而MDLMs下降≤14%；实证表明MDLMs通过早期稳定推理token实现顺序鲁棒性；同时识别出该优势失效的边界条件。 Conclusion: MDLMs具备‘顺序鲁棒性’，能在输出结构违背自然推理顺序时保持稳定性能，其机制在于扩散过程中token稳定性的分层时序；但该优势依赖于特定建模条件，存在明确适用边界。 Abstract: Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term "order robustness". Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.

[64] A Separable Architecture for Continuous Token Representation in Language Models

Reza T. Batley,Sourav Saha

Main category: cs.CL

TL;DR: 本文提出Leviathan架构，通过连续嵌入生成器替代传统离散查找表，在小语言模型中显著提升参数利用效率，在相同参数量下性能优于LLaMA风格模型。

Details

Motivation: 在小语言模型（SLM）中，嵌入矩阵占据大部分参数，但这种分配方式既次优又反直觉，而现有缩放定律将参数视为可互换，忽略了这一结构问题。 Method: 提出Leviathan架构，采用连续嵌入生成器替代标准离散嵌入查找表，并在Pile数据集上进行等参数量对比实验，结合经验幂律拟合评估有效参数容量。 Result: Leviathan在等参数设置下持续优于LLaMA风格基线；其有效参数容量相当于密集模型的1.47至2.11倍。 Conclusion: 嵌入层的设计对小语言模型的缩放效率至关重要，连续嵌入生成器能更高效地利用参数，挑战了将参数简单视为可互换的常规假设。 Abstract: Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with $1.47$ to $2.11 \times$ more parameters.

[65] On the Paradoxical Interference between Instruction-Following and Task Solving

Yunjia Qi,Hao Peng,Xintong Shi,Amy Xin,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 本文揭示了指令遵循可能意外削弱大语言模型（LLM）任务求解能力的反直觉现象，并提出SUSTAINSCORE指标来量化这种干扰；实验表明，即使加入自明约束也会显著降低模型在数学、多跳问答和代码生成等任务上的性能，且干扰与注意力过度聚焦于约束有关；研究还初步探讨了不同后训练范式对干扰的影响。

Details

Motivation: 现有指令遵循方法旨在提升LLM对人类意图的对齐，但作者发现其可能反而损害模型固有任务求解能力，这一潜在负面效应尚未被系统揭示和量化。 Method: 提出SUSTAINSCORE指标——通过向原始指令中插入从成功输出中提取的、本就满足的自明约束，测量任务性能下降程度；在多个任务和模型上开展实验，并结合注意力机制分析与后训练范式对比研究。 Result: 在数学、多跳QA和代码生成任务上，添加自明约束导致主流LLM（包括Claude-Sonnet-4.5）性能显著下降；失败样本中模型对约束的注意力分配明显更高；不同对齐策略表现出差异化的干扰程度。 Conclusion: 指令遵循并非总是有益，其可能引入对任务求解的内在干扰；SUSTAINSCORE为评估和理解对齐副作用提供了新工具，提示未来对齐方法需兼顾指令遵从性与任务鲁棒性。 Abstract: Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed. However, we reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs' task-solving capability. We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving. It measures task performance drop after inserting into the instruction a self-evident constraint, which is naturally met by the original successful model output and extracted from it. Experiments on current LLMs in mathematics, multi-hop QA, and code generation show that adding the self-evident constraints leads to substantial performance drops, even for advanced models such as Claude-Sonnet-4.5. We validate the generality of the interference across constraint types and scales. Furthermore, we identify common failure patterns, and by investigating the mechanisms of interference, we observe that failed cases allocate significantly more attention to constraints compared to successful ones. Finally, we use SUSTAINSCORE to conduct an initial investigation into how distinct post-training paradigms affect the interference, presenting empirical observations on current alignment strategies. We will release our code and data to facilitate further research

[66] MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLMs

Ghazal Kalhor,Behnam Bahrak

Main category: cs.CL

TL;DR: 本文提出了MasalBench基准，用于评估多语言大语言模型对波斯语谚语的语境与跨文化理解能力，发现现有模型在识别波斯谚语方面表现良好，但在映射等价英语谚语时存在明显不足，揭示了其在文化知识和类比推理上的局限性。

Details

Motivation: 现有研究多关注高资源语言中LLM对修辞性语言的理解，而低资源语言（如波斯语）中的跨文化谚语理解能力尚缺乏系统评估。 Method: 构建了MasalBench——一个专门评估LLM对波斯谚语语境理解与跨文化映射能力的综合基准，并在8个前沿LLM上进行实验评测。 Result: 模型在波斯谚语语境识别任务中准确率超0.90，但在匹配等价英语谚语任务中最佳模型仅达0.79准确率。 Conclusion: 当前多语言LLM在低资源语言的文化知识建模和跨语言类比推理方面仍存在显著短板，MasalBench为评估其他低资源语言的跨文化理解提供了可复用框架。 Abstract: In recent years, multilingual Large Language Models (LLMs) have become an inseparable part of daily life, making it crucial for them to master the rules of conversational language in order to communicate effectively with users. While previous work has evaluated LLMs' understanding of figurative language in high-resource languages, their performance in low-resource languages remains underexplored. In this paper, we introduce MasalBench, a comprehensive benchmark for assessing LLMs' contextual and cross-cultural understanding of Persian proverbs, which are a key component of conversation in this low-resource language. We evaluate eight state-of-the-art LLMs on MasalBench and find that they perform well in identifying Persian proverbs in context, achieving accuracies above 0.90. However, their performance drops considerably when tasked with identifying equivalent English proverbs, with the best model achieving 0.79 accuracy. Our findings highlight the limitations of current LLMs in cultural knowledge and analogical reasoning, and they provide a framework for assessing cross-cultural understanding in other low-resource languages. MasalBench is available at https://github.com/kalhorghazal/MasalBench.

[67] $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

Yaxin Du,Junru Song,Yifan Zhou,Cheng Wang,Jiahao Gu,Zimeng Chen,Menglan Chen,Wen Yao,Yang Yang,Ying Wen,Siheng Chen

Main category: cs.CL

TL;DR: 本文提出G^2-Reader，一种双图结构系统，通过内容图保持文档原生结构与跨模态语义，通过规划图追踪子问题与中间发现，以提升长文档多模态问答的检索增强生成效果。

Details

Motivation: 现有检索增强生成方法在多模态长文档问答中存在两大问题：扁平化分块破坏文档结构和跨模态对齐；迭代检索易陷入局部循环或漂移至无关区域，缺乏全局搜索状态。 Method: 提出G^2-Reader双图系统：1）内容图（Content Graph）建模文档原生结构与文本、表格、图像间的跨模态语义关系；2）规划图（Planning Graph）作为有向无环图，动态生成并追踪子问题及中间证据，实现有目标的逐步导航与证据补全。 Result: 在VisDoMBench五个多模态领域上，G^2-Reader结合Qwen3-VL-32B-Instruct达到66.21%平均准确率，显著优于强基线及独立GPT-5（53.08%）。 Conclusion: G^2-Reader通过结构感知的内容建模与目标驱动的规划机制，有效缓解多模态长文档问答中的语义碎片化与检索漂移问题，为检索增强生成提供了可扩展的图结构新范式。 Abstract: Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce $G^2$-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, $G^2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21\% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08\%).

[68] VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Yibo Wang,Yongcheng Jing,Shunyu Liu,Hao Guan,Rong-cheng Tu,Chengyu Wang,Jun Huang,Dacheng Tao

Main category: cs.CL

TL;DR: 本文提出VTC-R1，一种将中间推理过程渲染为图像并作为‘光学记忆’输入视觉语言模型的新范式，在保持性能的同时实现3.4倍token压缩和2.7倍推理加速。

Details

Motivation: 长上下文推理虽提升LLM能力，但带来严重计算开销；现有高效方法依赖额外训练或外部模型，牺牲可扩展性与细粒度信息。 Method: 提出VTC-R1范式：将文本推理段落渲染为紧凑图像，作为‘光学记忆’迭代输入视觉语言模型（如Glyph、Qwen3-VL）；基于OpenR1-Math-220K构建训练集并微调模型。 Result: 在MATH500、AIME25、AMC23、GPQA-D等基准上持续超越标准长上下文推理，并实现2.7倍端到端延迟加速和3.4倍token压缩。 Conclusion: VTC-R1是一种可扩展、高效且信息保留的新型推理范式，为推理密集型应用提供实用新路径。 Abstract: Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.

[69] ECO: Quantized Training without Full-Precision Master Weights

Mahdi Nikdan,Amir Zandieh,Dan Alistarh,Vahab Mirrokni

Main category: cs.CL

TL;DR: 本文提出Error-Compensating Optimizer（ECO），通过误差反馈机制在不使用高精度主权重（master weights）的情况下实现量化参数的直接更新，显著降低内存开销，尤其适用于稀疏混合专家（SMoE）模型，并在多种规模模型上验证了其收敛性与精度保持能力。

Details

Motivation: 现有LLM训练量化方法仍需高精度主权重缓冲区，带来显著内存开销，尤其在SMoE模型中成为瓶颈。 Method: ECO将梯度更新直接应用于量化参数，每次更新后对权重量化，并将量化误差注入优化器动量中，构建无额外内存开销的误差反馈回路。 Result: 理论证明ECO在标准假设和衰减学习率下收敛至最优解的常数半径邻域；实验表明其在FP8预训练（30M–2.1B）和INT4微调（16B）中均达到近无损精度，显著改善内存-损失Pareto前沿。 Conclusion: ECO成功消除了训练量化中的主权重依赖，在保持收敛性和精度的同时大幅降低内存占用，为大规模稀疏模型高效训练提供了新范式。 Abstract: Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.

[70] A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine

Anran Li,Yuanyuan Chen,Wenjun Long,Yu Yin,Yan Hu,Hyunjae Kim,Weipeng Zhou,Yujia Zhou,Hongyi Peng,Yang Ren,Xuguang Ai,Zhenyue Qin,Ming Hu,Xiaoxiao Li,Han Yu,Yih-Chung Tham,Lucila Ohno-Machado,Hua Xu,Qingyu Chen

Main category: cs.CL

TL;DR: 本文提出Fed-MedLoRA及其增强版Fed-MedLoRA+，一种模型无关、参数高效的联邦学习框架，用于在医疗场景下适配大语言模型（LLM），解决跨机构数据异构性与通信开销大的挑战，并在临床信息抽取任务中验证其有效性。

Details

Motivation: 现有医学大模型多基于单中心数据训练，泛化性与安全性受限；联邦学习虽具潜力，但传统方法在LLM场景下面临通信开销大和难以应对临床数据异构性的根本瓶颈。 Method: 提出Fed-MedLoRA：仅传输低秩适配器（LoRA）参数以降低通信与计算开销；进一步提出Fed-MedLoRA+，引入自适应、数据感知的聚合策略以提升跨站点异构下的收敛性；应用于临床信息抽取任务。 Result: 在五个患者队列上评估显示，Fed-MedLoRA(+)在域内测试、外部验证及低资源新站点适配（如耶鲁纽黑文健康系统真实病历）中均优于BERT、LLaMA-3、DeepSeek-R1和GPT-4o等基线模型。 Conclusion: Fed-MedLoRA系列框架有效缓解了LLM在医疗联邦学习中的通信与异构性难题，为安全、可扩展的跨机构医学大模型协同训练提供了实用新范式。 Abstract: Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis. To enable their use in clinical settings, LLMs are typically further adapted through continued pretraining or post-training using clinical data. However, most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems. Federated learning (FL) is a promising solution for enabling collaborative model development across healthcare institutions. Yet applying FL to LLMs in medicine remains fundamentally limited. First, conventional FL requires transmitting the full model during each communication round, which becomes impractical for multi-billion-parameter LLMs given the limited computational resources. Second, many FL algorithms implicitly assume data homogeneity, whereas real-world clinical data are highly heterogeneous across patients, diseases, and institutional practices. We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications. Fed-MedLoRA transmits only low-rank adapter parameters, reducing communication and computation overhead, while Fed-MedLoRA+ further incorporates adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. We apply the framework to clinical information extraction (IE), which transforms patient narratives into structured medical entities and relations. Accuracy was assessed across five patient cohorts through comparisons with BERT models, and LLaMA-3 and DeepSeek-R1, GPT-4o models. Evaluation settings included (1) in-domain training and testing, (2) external validation on independent cohorts, and (3) a low-resource new-site adaptation scenario using real-world clinical notes from the Yale New Haven Health System.

[71] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Xin Chen,Feng Jiang,Yiqian Zhang,Hardy Chen,Shuo Yan,Wenya Xie,Min Yang,Shujian Huang

Main category: cs.CL

TL;DR: 本文提出了一种新的推理范式——主动交互式推理（PIR），使大语言模型能主动向用户提问以澄清前提或意图上的不确定性，而非盲目内部推理；通过不确定性感知的监督微调和基于用户模拟器的策略优化实现，在数学推理、代码生成等任务上显著提升准确率与效率，并减少计算开销。

Details

Motivation: 现有基于思维链（CoT）的推理型大语言模型存在“盲目自思考”问题，即在关键信息缺失或模糊时仍进行大量内部推理，导致低效与错误；亟需一种能主动与用户交互以澄清前提和意图不确定性的新范式。 Method: 提出Proactive Interactive Reasoning（PIR）范式，包含两个核心组件：（1）不确定性感知的监督微调，赋予模型交互式推理能力；（2）基于用户模拟器的策略优化框架，采用复合奖励函数对齐用户意图。 Result: 在数学推理、代码生成和文档编辑任务上，PIR相比强基线提升最高达32.70%准确率、22.90%通过率和41.36 BLEU；同时减少近一半推理计算量和冗余交互轮次；在事实知识、问答及缺失前提场景中展现出强泛化性与鲁棒性。 Conclusion: PIR成功将LLM从被动求解者转变为主动探询者，通过人机协同交互有效缓解前提与意图层面的不确定性，是一种更高效、可靠且可扩展的新一代推理范式。 Abstract: Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}

[72] FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Ajay Patel,Colin Raffel,Chris Callison-Burch

Main category: cs.CL

TL;DR: 本文提出FineInstructions方法，利用互联网规模的预训练文档生成数十亿条合成指令-答案对，用于从零开始仅通过指令微调目标预训练大语言模型，显著提升其在自由生成任务上的性能。

Details

Motivation: 由于监督数据有限，现有大语言模型通常先进行大规模自监督预训练，再用少量人工指令数据微调；本文旨在克服监督数据稀缺问题，探索如何更高效地将预训练语料中的知识转化为高质量、大规模的指令微调数据。 Method: 设计约1800万个源自真实用户查询的指令模板，将其与无结构预训练语料中的人类撰写文档自动匹配并实例化，构建名为FineInstructions的大规模合成指令-答案数据集，并以此从头进行纯指令目标的预训练。 Result: 在严格控制token数量的对比实验中，基于FineInstructions预训练的模型在自由生成质量基准测试上优于标准预训练及其他合成预训练方法。 Conclusion: 仅用大规模合成指令数据从零预训练大语言模型是可行且有效的，该范式更贴近模型实际使用场景（响应用户提示），为减少对人工标注数据的依赖提供了新路径。 Abstract: Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of "instruction-tuning" data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at https://huggingface.co/fineinstructions .

[73] DynaWeb: Model-Based Reinforcement Learning of Web Agents

Hang Ding,Peidong Liu,Junqiao Wang,Ziwei Ji,Meng Cao,Rongzhao Zhang,Lynn Ai,Eric Yang,Tianyu Shi,Lei Yu

Main category: cs.CL

TL;DR: 本文提出DynaWeb，一种基于模型的强化学习框架，通过构建网页世界模型来模拟网络环境，使Web代理能够在合成环境中进行大量策略 rollout，从而提升在线强化学习的效率和稳定性。

Details

Motivation: 训练自主Web代理面临与真实互联网交互的低效、高成本和高风险问题，需要一种更安全高效的训练方法。 Method: 提出DynaWeb框架，利用网页世界模型预测自然网页表示，并支持自由策略rollout与真实专家轨迹混合训练，以增强稳定性和样本效率。 Result: 在WebArena和WebVoyager基准测试中，DynaWeb显著提升了当前开源Web代理模型的性能。 Conclusion: 证明了通过‘想象’训练Web代理的可行性，为在线代理式强化学习提供了可扩展且高效的新路径。 Abstract: The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.

[74] Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

Yingfa Chen,Zhen Leng Thai,Zihan Zhou,Zhu Zhang,Xingyu Shen,Shuo Wang,Chaojun Xiao,Xu Han,Zhiyuan Liu

Main category: cs.CL

TL;DR: 本文提出HALO蒸馏流程和HypeNet混合架构，将预训练Transformer（如Qwen3）高效转化为兼具长上下文性能与推理效率的RNN-Attention混合模型，仅需2.3B token数据（<0.01%原始预训练量）。

Details

Motivation: 现有混合Transformer模型因从头预训练成本过高而难以推广；已有参数迁移/蒸馏方法需海量数据（>10B tokens）且长上下文性能差，未能发挥混合模型在长文本上的推理优势。 Method: 提出HALO（Hybrid Attention via Layer Optimization）蒸馏流程，将预训练Transformer蒸馏为RNN-Attention混合模型；设计HypeNet新架构，引入HyPE位置编码及多项结构改进以增强长度泛化能力。 Result: 成功将Qwen3系列模型转换为HypeNet，在保持原Transformer相当性能的同时，显著提升长上下文建模能力与推理效率，且仅需2.3B tokens训练数据。 Conclusion: HALO+HypeNet提供了一种低数据、高效益的Transformer到混合模型转化范式，有效缓解了混合架构落地难的问题，并在长上下文场景下展现出显著实用价值。 Abstract: Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data

cs.CV [Back]

[75] MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading

Matteo Rossi

Main category: cs.CV

TL;DR: 本文提出了一种多注意力唇读网络（MA-LipNet），通过通道、联合时空和分离时空三种注意力机制，提升唇读任务中视觉特征的判别性与泛化能力，在CMLR和GRID数据集上显著降低CER和WER。

Details

Motivation: 现有唇读方法因口型动作细微，导致特征判别力弱、泛化能力差，需从时、空、通道多维度净化视觉特征。 Method: 提出MA-LipNet，依次引入通道注意力（CA）、联合时空注意力（JSTA）和分离时空注意力（SSTA）模块，分别实现通道重校准、粗粒度时空过滤和细粒度时空建模。 Result: 在CMLR和GRID数据集上，MA-LipNet显著降低了字符错误率（CER）和词错误率（WER），优于多个SOTA方法。 Conclusion: 多维特征精炼对鲁棒视觉语音识别至关重要，MA-LipNet为唇读任务提供了有效且可扩展的注意力建模范式。 Abstract: Lipreading, the technology of decoding spoken content from silent videos of lip movements, holds significant application value in fields such as public security. However, due to the subtle nature of articulatory gestures, existing lipreading methods often suffer from limited feature discriminability and poor generalization capabilities. To address these challenges, this paper delves into the purification of visual features from temporal, spatial, and channel dimensions. We propose a novel method named Multi-Attention Lipreading Network(MA-LipNet). The core of MA-LipNet lies in its sequential application of three dedicated attention modules. Firstly, a \textit{Channel Attention (CA)} module is employed to adaptively recalibrate channel-wise features, thereby mitigating interference from less informative channels. Subsequently, two spatio-temporal attention modules with distinct granularities-\textit{Joint Spatial-Temporal Attention (JSTA)} and \textit{Separate Spatial-Temporal Attention (SSTA)}-are leveraged to suppress the influence of irrelevant pixels and video frames. The JSTA module performs a coarse-grained filtering by computing a unified weight map across the spatio-temporal dimensions, while the SSTA module conducts a more fine-grained refinement by separately modeling temporal and spatial attentions. Extensive experiments conducted on the CMLR and GRID datasets demonstrate that MA-LipNet significantly reduces the Character Error Rate (CER) and Word Error Rate (WER), validating its effectiveness and superiority over several state-of-the-art methods. Our work highlights the importance of multi-dimensional feature refinement for robust visual speech recognition.

[76] Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs

Haochen Zhang,Animesh Sinha,Felix Juefei-Xu,Haoyu Ma,Kunpeng Li,Zhipeng Fan,Meng Dong,Xiaoliang Dai,Tingbo Hou,Peizhao Zhang,Zecheng He

Main category: cs.CV

TL;DR: 本文提出了一种面向非马尔可夫式多轮对话图像生成的新框架，通过构建包含回滚编辑与命名个性化的历史感知数据、设计带token级缓存的历史条件化训练/推理机制，并引入重建型DiT去令牌器和多阶段微调策略，显著提升了多轮一致性与指令遵循能力。

Details

Motivation: 现有对话式图像生成方法大多采用马尔可夫假设（仅依赖最新图像），无法处理用户对早期状态的引用、撤销操作或跨轮实体指代等非马尔可夫交互需求。 Method: （i）构建非马尔可夫多轮数据，包括回滚式编辑与基于名称的跨轮个性化；（ii）提出历史条件化训练与推理框架，含token级缓存以防止身份漂移；（iii）引入重建型DiT detokenizer和多阶段微调课程以提升高保真重建与可编辑个性化能力。 Result: 显式针对非马尔可夫交互训练显著提升了多轮一致性与指令遵循能力，同时保持强单轮编辑与个性化性能。 Conclusion: 非马尔可夫建模是实现真正鲁棒、连贯的多轮对话图像生成的关键，所提方法为该方向提供了系统性解决方案与实用技术路径。 Abstract: Conversational image generation requires a model to follow user instructions across multiple rounds of interaction, grounded in interleaved text and images that accumulate as chat history. While recent multimodal large language models (MLLMs) can generate and edit images, most existing multi-turn benchmarks and training recipes are effectively Markov: the next output depends primarily on the most recent image, enabling shortcut solutions that ignore long-range history. In this work we formalize and target the more challenging non-Markov setting, where a user may refer back to earlier states, undo changes, or reference entities introduced several rounds ago. We present (i) non-Markov multi-round data construction strategies, including rollback-style editing that forces retrieval of earlier visual states and name-based multi-round personalization that binds names to appearances across rounds; (ii) a history-conditioned training and inference framework with token-level caching to prevent multi-round identity drift; and (iii) enabling improvements for high-fidelity image reconstruction and editable personalization, including a reconstruction-based DiT detokenizer and a multi-stage fine-tuning curriculum. We demonstrate that explicitly training for non-Markov interactions yields substantial improvements in multi-round consistency and instruction compliance, while maintaining strong single-round editing and personalization.

[77] Text controllable PET denoising

Xuehua Ye,Hongxu Yang,Adam J. Schwarz

Main category: cs.CV

TL;DR: 本文提出了一种基于文本引导的PET图像去噪新方法，利用预训练CLIP模型特征与U-Net去噪结构结合，在单模型下实现多计数水平下的图像质量提升，显著改善定性与定量评估结果，并具备缩短采集时间的潜力。

Details

Motivation: PET图像常受复杂噪声干扰，影响诊断信息提取；现有方法难以在单一模型中适配不同计数水平的去噪需求。 Method: 提出文本引导的去噪方法，融合预训练CLIP模型的语义特征与U-Net架构进行噪声建模与去除。 Result: 在多种计数水平下均取得显著的定性（视觉效果）和定量（如PSNR、SSIM等指标）性能提升，模型具备良好泛化性与灵活性。 Conclusion: 该方法为PET图像去噪提供了新范式，有望支持更复杂的临床去噪任务并减少扫描时间。 Abstract: Positron Emission Tomography (PET) imaging is a vital tool in medical diagnostics, offering detailed insights into molecular processes within the human body. However, PET images often suffer from complicated noise, which can obscure critical diagnostic information. The quality of the PET image is impacted by various factors including scanner hardware, image reconstruction, tracer properties, dose/count level, and acquisition time. In this study, we propose a novel text-guided denoising method capable of enhancing PET images across a wide range of count levels within a single model. The model utilized the features from a pretrained CLIP model with a U-Net based denoising model. Experimental results demonstrate that the proposed model leads significant improvements in both qualitative and quantitative assessments. The flexibility of the model shows the potential for helping more complicated denoising demands or reducing the acquisition time.

[78] Low performing pixel correction in computed tomography with unrolled network and synthetic data training

Hongxu Yang,Levente Lippenszky,Edina Timko,Lehel Ferenczi,Gopal Avinash

Main category: cs.CV

TL;DR: 本文提出了一种基于合成数据的无监督双域（正向/反向）深度学习方法，用于校正CT探测器低性能像素（LPP）引起的环状和条纹伪影，无需真实临床数据训练，且在1–2%探测器缺陷下显著优于现有方法。

Details

Motivation: 低性能像素（LPP）导致CT图像出现临床不可用的环状与条纹伪影；现有监督学习方法依赖昂贵的真实标注数据，且仅在单一域（图像域或正弦图域）进行校正，忽略CT前向几何建模的内在跨域相关性。 Method: 提出一种基于合成数据的展开式双域校正方法：利用自然图像生成带LPP缺陷的配对正弦图–图像数据，构建端到端可训练的双域协同网络，显式建模LPP在正弦图与图像域间的物理关联。 Result: 在模拟1–2%中心区域探测器缺陷的实验中，该方法大幅超越当前最优方法；验证了其无需真实临床数据训练、泛化性强、适配不同CT扫描仪参数的能力。 Conclusion: 合成数据驱动的双域展开模型可高效、低成本地解决LPP伪影问题，为软件定义的CT质量保障提供了新范式。 Abstract: Low performance pixels (LPP) in Computed Tomography (CT) detectors would lead to ring and streak artifacts in the reconstructed images, making them clinically unusable. In recent years, several solutions have been proposed to correct LPP artifacts, either in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, which are expensive to collect. Moreover, existing approaches focus solely either on image-space or sinogram-space correction, ignoring the intrinsic correlations from the forward operation of the CT geometry. In this work, we propose an unrolled dual-domain method based on synthetic data to correct LPP artifacts. Specifically, the intrinsic correlations of LPP between the sinogram and image domains are leveraged through synthetic data generated from natural images, enabling the trained model to correct artifacts without requiring any real-world clinical data. In experiments simulating 1-2% detectors defect near the isocenter, the proposed method outperformed the state-of-the-art approaches by a large margin. The results indicate that our solution can correct LPP artifacts without the cost of data collection for model training, and it is adaptable to different scanner settings for software-based applications.

[79] AI-based Prediction of Biochemical Recurrence from Biopsy and Prostatectomy Samples

Andrea Camilloni,Chiara Micoli,Nita Mulliqi,Erik Everett Palm,Thorgerdur Palsdottir,Kelvin Szolnoky,Xiaoyi Ji,Sol Erika Boman,Andrea Discacciati,Henrik Grönberg,Lars Egevad,Tobias Nordström,Kimmo Kartasalo,Martin Eklund

Main category: cs.CV

TL;DR: 本文开发了一种基于前列腺活检图像的AI模型，利用基础模型和注意力机制的多实例学习，预测根治性前列腺切除术后生化复发（BCR）风险，并在多个外部队列中验证了其泛化能力；结合临床变量后进一步提升了风险分层效果，优于传统CAPRA-S评分。

Details

Motivation: 当前用于预测根治性前列腺切除术后生化复发（BCR）的预后工具精度不足，亟需更精准、可泛化的预测方法。 Method: 基于STHLM3队列的诊断性前列腺活检切片（n=676），采用基础模型与注意力机制的多实例学习（MIL）构建AI模型；在LEOPARD、CHIMERA和TCGA-PRAD三个外部根治术队列中评估泛化性；并融合临床变量进行联合预测。 Result: 图像模型在三个外部队列中5年时间依赖AUC分别为0.64、0.70、0.70；整合临床变量后实现统计学显著的风险分层，且性能优于CAPRA-S评分。 Conclusion: 基于活检图像训练的病理AI模型可跨标本类型泛化，支持术前与术后决策；但AI多模态方法相较简单模型的增量价值需在后续研究中审慎评估。 Abstract: Biochemical recurrence (BCR) after radical prostatectomy (RP) is a surrogate marker for aggressive prostate cancer with adverse outcomes, yet current prognostic tools remain imprecise. We trained an AI-based model on diagnostic prostate biopsy slides from the STHLM3 cohort (n = 676) to predict patient-specific risk of BCR, using foundation models and attention-based multiple instance learning. Generalizability was assessed across three external RP cohorts: LEOPARD (n = 508), CHIMERA (n = 95), and TCGA-PRAD (n = 379). The image-based approach achieved 5-year time-dependent AUCs of 0.64, 0.70, and 0.70, respectively. Integrating clinical variables added complementary prognostic value and enabled statistically significant risk stratification. Compared with guideline-based CAPRA-S, AI incrementally improved postoperative prognostication. These findings suggest biopsy-trained histopathology AI can generalize across specimen types to support preoperative and postoperative decision making, but the added value of AI-based multimodal approaches over simpler predictive models should be critically scrutinized in further studies.

[80] BadDet+: Robust Backdoor Attacks for Object Detection

Kealan Dunnett,Reza Arablouei,Dimity Miller,Volkan Dedeoglu,Raja Jurdak

Main category: cs.CV

TL;DR: 本文提出BadDet+框架，通过log-barrier惩罚机制统一区域误分类攻击（RMA）和目标消失攻击（ODA），提升对目标检测模型的后门攻击效果，尤其增强物理世界中的迁移鲁棒性，同时不损害干净样本性能。

Details

Motivation: 现有针对目标检测的后门攻击方法依赖不现实假设且缺乏物理验证，其影响远不如图像分类领域清晰。 Method: 提出BadDet+惩罚框架，核心是使用log-barrier penalty抑制触发样本中真实类别的预测，从而实现位置与尺度不变性及更强的物理鲁棒性；并从理论上证明该惩罚作用于触发特定的特征子空间。 Result: 在真实世界基准上，BadDet+在合成到物理迁移性能上显著优于现有RMA和ODA基线，同时保持干净样本检测性能；理论分析证实其攻击可靠且不影响标准推理。 Conclusion: 目标检测模型存在严重后门漏洞，亟需专门设计的防御机制；BadDet+揭示了当前检测系统在物理场景下的脆弱性。 Abstract: Backdoor attacks pose a severe threat to deep learning, yet their impact on object detection remains poorly understood compared to image classification. While attacks have been proposed, we identify critical weaknesses in existing detection-based methods, specifically their reliance on unrealistic assumptions and a lack of physical validation. To bridge this gap, we introduce BadDet+, a penalty-based framework that unifies Region Misclassification Attacks (RMA) and Object Disappearance Attacks (ODA). The core mechanism utilizes a log-barrier penalty to suppress true-class predictions for triggered inputs, resulting in (i) position and scale invariance, and (ii) enhanced physical robustness. On real-world benchmarks, BadDet+ achieves superior synthetic-to-physical transfer compared to existing RMA and ODA baselines while preserving clean performance. Theoretical analysis confirms the proposed penalty acts within a trigger-specific feature subspace, reliably inducing attacks without degrading standard inference. These results highlight significant vulnerabilities in object detection and the necessity for specialized defenses.

[81] Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization

Jiaqi Li,Guangming Wang,Shuntian Zheng,Minzhe Ni,Xiaoman Lu,Guanghui Ye,Yu Guan

Main category: cs.CV

TL;DR: 本文提出ActionVLM框架，通过去偏重加权和残差聚合策略，在时序动作定位任务中缓解视觉-语言模态偏差，以视觉为主、语言为辅，显著提升性能。

Details

Motivation: 现有基于视觉-语言模型的时序动作定位方法过度依赖语言先验，导致明显的模态偏差，削弱视觉性能。 Method: 提出ActionVLM框架：(i) 去偏重加权模块，估计语言相对于视觉的增量增益并动态调整语言权重；(ii) 残差聚合策略，将语言视为对视觉的互补精调而非主导信号。 Result: 在THUMOS14数据集上，mAP指标较当前最优方法提升最高达3.2%。 Conclusion: 以视觉为主导、语言为辅助的自适应聚合机制可有效缓解模态偏差，提升时序动作定位的准确性和鲁棒性。 Abstract: Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. To this end, we introduce (i) a debiasing reweighting module that estimates the language advantage-the incremental benefit of language over vision-only predictions-and dynamically reweights language modality accordingly, and (ii) a residual aggregation strategy that treats language as a complementary refinement rather than the primary driver. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.

[82] Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

Yu Huo,Siyu Zhang,Kun Zeng,Haoyue Liu,Owen Lee,Junlin Chen,Yuquan Lu,Yifu Guo,Yaodong Liang,Xiaoying Tang

Main category: cs.CV

TL;DR: 本文提出Shape-of-Thought（SoT）框架，通过视觉思维链实现无需外部引擎的渐进式2D形状组装，提升文本到图像生成在数量、属性绑定和部件关系等结构约束下的鲁棒性。

Details

Motivation: 现有文本到图像多模态模型在 compositional structural constraints（如生成数值准确性、属性绑定、部件级关系）方面表现脆弱，缺乏对形状组装逻辑的显式建模能力。 Method: 提出SoT视觉思维链框架，训练统一的多模态自回归模型，生成交错的文本规划与渲染中间状态；构建SoT-26K装配轨迹数据集和T2S-CompBench评估基准。 Result: 在组件数值准确率和结构拓扑准确率上分别达88.4%和84.8%，较纯文本基线提升约20%；验证了SoT在结构完整性与轨迹保真度上的有效性。 Conclusion: SoT建立了一种透明、过程监督的组合式生成新范式，无需显式几何表示或外部引擎即可增强生成的结构性与可解释性。 Abstract: Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints-notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework that enables progressive shape assembly via coherent 2D projections without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming text-only baselines by around 20%. SoT establishes a new paradigm for transparent, process-supervised compositional generation. The code is available at https://anonymous.4open.science/r/16FE/. The SoT-26K dataset will be released upon acceptance.

[83] An AI Framework for Microanastomosis Motion Assessment

Yan Meng,Eduardo J. Torres-Rodríguez,Marcelle Altshuler,Nishanth Gowda,Arhum Naeem,Recai Yilmaz,Omar Arnaout,Daniel A. Donoho

Main category: cs.CV

TL;DR: 本文提出了一种基于AI的自动化微血管吻合器械操作技能评估框架，整合了YOLO检测、DeepSORT跟踪、器械尖端定位与监督分类模块，实现了高精度（97%检测精度，mAP 96%）的客观、可靠评估。

Details

Motivation: 传统微外科技术评估依赖专家主观评分，存在评分者间差异大、标准不统一、易受认知偏差影响及耗时等问题，亟需客观、可靠、自动化的评估系统。 Method: 提出一种四模块AI框架：(1)基于YOLO的器械检测模块；(2)基于DeepSORT的器械跟踪模块；(3)基于形状描述符的器械尖端定位模块；(4)基于专家标注数据训练的监督分类模块，用于评估器械操作熟练度。 Result: 实验表明该框架性能优异：器械检测精度达97%，在IoU阈值50%–95%下的平均精度（mAP50-95）为96%。 Conclusion: 所提AI框架可有效实现微血管吻合中器械操作技能的客观、自动化、高精度评估，具备临床转化与教学应用潜力。 Abstract: Proficiency in microanastomosis is a fundamental competency across multiple microsurgical disciplines. These procedures demand exceptional precision and refined technical skills, making effective, standardized assessment methods essential. Traditionally, the evaluation of microsurgical techniques has relied heavily on the subjective judgment of expert raters. They are inherently constrained by limitations such as inter-rater variability, lack of standardized evaluation criteria, susceptibility to cognitive bias, and the time-intensive nature of manual review. These shortcomings underscore the urgent need for an objective, reliable, and automated system capable of assessing microsurgical performance with consistency and scalability. To bridge this gap, we propose a novel AI framework for the automated assessment of microanastomosis instrument handling skills. The system integrates four core components: (1) an instrument detection module based on the You Only Look Once (YOLO) architecture; (2) an instrument tracking module developed from Deep Simple Online and Realtime Tracking (DeepSORT); (3) an instrument tip localization module employing shape descriptors; and (4) a supervised classification module trained on expert-labeled data to evaluate instrument handling proficiency. Experimental results demonstrate the effectiveness of the framework, achieving an instrument detection precision of 97%, with a mean Average Precision (mAP) of 96%, measured by Intersection over Union (IoU) thresholds ranging from 50% to 95% (mAP50-95).

[84] Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery

Jianzheng Wang,Huan Ni

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的开放词汇语义分割框架SDCI，通过跨模型注意力融合、双向图扩散精炼和超像素协同预测机制，提升了高分辨率遥感影像中几何定位与语义预测的精度。

Details

Motivation: 高分辨率遥感影像地物密集、边界复杂，对几何定位和语义预测要求更高；现有无训练开放词汇语义分割方法采用单向注入和浅层后处理策略，难以满足需求。 Method: 提出空间正则化感知的双分支协同推理框架SDCI，包含：1）跨模型注意力融合（CAF）模块，实现特征编码阶段的互引导；2）双向跨图扩散精炼（BCDR）模块，通过迭代随机游走提升双分支分割置信度；3）基于凸优化的超像素协同预测（CSCP）机制，利用低层超像素结构优化对象边界。 Result: 在多个遥感语义分割基准上性能优于现有方法；消融实验验证了超像素结构在深度学习框架中仍具有效性。 Conclusion: SDCI通过多层级协同建模显著提升了无训练开放词汇遥感影像语义分割的精度与鲁棒性，尤其在复杂边界处理方面具有优势。 Abstract: High-resolution remote sensing imagery is characterized by densely distributed land-cover objects and complex boundaries, which places higher demands on both geometric localization and semantic prediction. Existing training-free open-vocabulary semantic segmentation (OVSS) methods typically fuse CLIP and vision foundation models (VFMs) using "one-way injection" and "shallow post-processing" strategies, making it difficult to satisfy these requirements. To address this issue, we propose a spatial-regularization-aware dual-branch collaborative inference framework for training-free OVSS, termed SDCI. First, during feature encoding, SDCI introduces a cross-model attention fusion (CAF) module, which guides collaborative inference by injecting self-attention maps into each other. Second, we propose a bidirectional cross-graph diffusion refinement (BCDR) module that enhances the reliability of dual-branch segmentation scores through iterative random-walk diffusion. Finally, we incorporate low-level superpixel structures and develop a convex-optimization-based superpixel collaborative prediction (CSCP) mechanism to further refine object boundaries. Experiments on multiple remote sensing semantic segmentation benchmarks demonstrate that our method achieves better performance than existing approaches. Moreover, ablation studies further confirm that traditional object-based remote sensing image analysis methods leveraging superpixel structures remain effective within deep learning frameworks. Code: https://github.com/yu-ni1989/SDCI.

[85] Enhancing Underwater Light Field Images via Global Geometry-aware Diffusion Process

Yuji Lin,Qian Zhao,Zongsheng Yue,Junhui Hou,Deyu Meng

Main category: cs.CV

TL;DR: 本文提出GeoDiff-LF，一种基于SD-Turbo的扩散模型框架，用于提升水下4D光场成像质量，通过几何引导的网络结构、损失函数与采样策略，有效缓解水下图像颜色失真问题，并在视觉与定量指标上达到SOTA。

Details

Motivation: 水下图像存在严重颜色失真与质量退化问题，而传统方法难以充分建模4D光场的空间-角度结构信息，亟需结合几何先验与生成先验的新方法。 Method: 提出GeoDiff-LF：（1）改进U-Net，引入卷积与注意力适配器建模几何线索；（2）设计基于张量分解与渐进加权的几何引导损失函数；（3）优化噪声预测采样策略以提升推理效率；整体融合扩散先验与光场几何结构。 Result: 在多个水下数据集上显著优于现有方法，视觉保真度与PSNR/SSIM等定量指标均达最优，代码将开源。 Conclusion: GeoDiff-LF成功将扩散建模与光场几何结构深度融合，为水下4D成像提供了一种高效、鲁棒且可解释的新范式，推动了该领域的发展。 Abstract: This work studies the challenging problem of acquiring high-quality underwater images via 4-D light field (LF) imaging. To this end, we propose GeoDiff-LF, a novel diffusion-based framework built upon SD-Turbo to enhance underwater 4-D LF imaging by leveraging its spatial-angular structure. GeoDiff-LF consists of three key adaptations: (1) a modified U-Net architecture with convolutional and attention adapters to model geometric cues, (2) a geometry-guided loss function using tensor decomposition and progressive weighting to regularize global structure, and (3) an optimized sampling strategy with noise prediction to improve efficiency. By integrating diffusion priors and LF geometry, GeoDiff-LF effectively mitigates color distortion in underwater scenes. Extensive experiments demonstrate that our framework outperforms existing methods across both visual fidelity and quantitative performance, advancing the state-of-the-art in enhancing underwater imaging. The code will be publicly available at https://github.com/linlos1234/GeoDiff-LF.

[86] FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

Chenyu Huang,Peng Ye,Xudong Tan,Jinhan Mu,Shenghe Zheng,Li Shen,Tao Chen

Main category: cs.CV

TL;DR: 本文提出FRISM框架，通过子空间级模型融合实现细粒度推理能力注入，结合无标签自蒸馏学习策略，在不损害视觉能力的前提下显著提升视觉语言模型的推理性能。

Details

Motivation: 现有方法在将大推理模型（LRM）与视觉语言模型（VLM）融合时通常采用粗粒度层级别融合，导致推理能力增强与视觉能力保持之间存在权衡。 Method: 提出FRISM（Fine-grained Reasoning Injection via Subspace-level model Merging），利用奇异值分解（SVD）分解LRM任务向量，并自适应调整各子空间缩放系数；引入无标签自蒸馏学习策略，采用双目标优化，在通用视觉-语言感知数据集上训练。 Result: 在多个视觉推理基准上持续达到SOTA性能，有效提升推理能力且未损害原始视觉能力。 Conclusion: 子空间级模型融合是实现细粒度、可控推理能力注入的有效途径，FRISM为VLM与LRM协同增强提供了新范式。 Abstract: Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose {FRISM} (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that reasoning capabilities are encoded in distinct subspaces, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with a dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities by consistently achieving state-of-the-art performance across diverse visual reasoning benchmarks.

[87] Generative Recall, Dense Reranking: Learning Multi-View Semantic IDs for Efficient Text-to-Video Retrieval

Zecheng Zhao,Zhi Chen,Zi Huang,Shazia Sadiq,Tong Chen

Main category: cs.CV

TL;DR: 本文提出GRDR方法，通过多视角语义ID分配与联合训练提升生成式召回质量，在保持高精度的同时大幅降低存储与检索延迟。

Details

Motivation: 现有两阶段文本-视频检索中，召回模型性能受限于语义歧义和跨模态错位问题，而生成式检索（GR）虽具高效性但存在上述缺陷。 Method: 提出Generative Recall and Dense Reranking（GRDR）：设计查询引导的多视图分词器为每个视频分配多个语义ID，并通过共享码本联合训练分词器与生成式检索器；推理时采用Trie约束解码生成紧凑候选集，再由稠密模型重排序。 Result: 在TVR基准上，GRDR达到与强稠密检索器相当的准确率，索引存储减少一个数量级，全库检索加速达300倍。 Conclusion: GRDR有效缓解生成式召回中的语义歧义与跨模态错位问题，实现了高效、高质的两阶段文本-视频检索。 Abstract: Text-to-Video Retrieval (TVR) is essential in video platforms. Dense retrieval with dual-modality encoders leads in accuracy, but its computation and storage scale poorly with corpus size. Thus, real-time large-scale applications adopt two-stage retrieval, where a fast recall model gathers a small candidate pool, which is reranked by an advanced dense retriever. Due to hugely reduced candidates, the reranking model can use any off-the-shelf dense retriever without hurting efficiency, meaning the recall model bounds two-stage TVR performance. Recently, generative retrieval (GR) replaces dense video embeddings with discrete semantic IDs and retrieves by decoding text queries into ID tokens. GR offers near-constant inference and storage complexity, and its semantic IDs capture high-level video features via quantization, making it ideal for quickly eliminating irrelevant candidates during recall. However, as a recall model in two-stage TVR, GR suffers from (i) semantic ambiguity, where each video satisfies diverse queries but is forced into one semantic ID; and (ii) cross-modal misalignment, as semantic IDs are solely derived from visual features without text supervision. We propose Generative Recall and Dense Reranking (GRDR), designing a novel GR method to uplift recalled candidate quality. GRDR assigns multiple semantic IDs to each video using a query-guided multi-view tokenizer exposing diverse semantic access paths, and jointly trains the tokenizer and generative retriever via a shared codebook to cast semantic IDs as the semantic bridge between texts and videos. At inference, trie-constrained decoding generates a compact candidate set reranked by a dense model for fine-grained matching. Experiments on TVR benchmarks show GRDR matches strong dense retrievers in accuracy while reducing index storage by an order of magnitude and accelerating up to 300$\times$ in full-corpus retrieval.

[88] Thinker: A vision-language foundation model for embodied intelligence

Baiyu Pan,Daqin Luo,Junpeng Yang,Jiyuan Wang,Yixuan Zhang,Hailin Shi,Jichao Jiao

Main category: cs.CV

TL;DR: 本文提出了Thinker，一种专为具身智能设计的大型视觉语言基础模型，通过构建面向机器人感知与推理的大规模数据集，并引入结合关键帧与完整视频序列的输入方法，显著提升了模型在视频理解方面的能力，尤其在任务规划基准测试中达到了最先进水平。

Details

Motivation: 大型视觉语言模型在机器人领域应用时，存在混淆第三人称与第一人称视角、忽视视频结尾信息等人类易解但模型易错的问题。 Method: 1）构建面向机器人感知与推理的大规模数据集，涵盖自我视角视频、视觉定位、空间理解及思维链数据；2）提出一种简单有效的方法，联合使用关键帧和完整视频序列作为输入，增强视频理解能力。 Result: 在两个最常用的任务规划基准数据集上达到最先进（state-of-the-art）性能。 Conclusion: Thinker模型通过针对性的数据构建与输入策略改进，有效缓解了视觉语言模型在具身智能任务中的视角混淆与时间推理缺陷，验证了其在机器人任务规划中的有效性与潜力。 Abstract: When large vision-language models are applied to the field of robotics, they encounter problems that are simple for humans yet error-prone for models. Such issues include confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during temporal reasoning. To address these challenges, we propose Thinker, a large vision-language foundation model designed for embodied intelligence. We tackle the aforementioned issues from two perspectives. Firstly, we construct a large-scale dataset tailored for robotic perception and reasoning, encompassing ego-view videos, visual grounding, spatial understanding, and chain-of-thought data. Secondly, we introduce a simple yet effective approach that substantially enhances the model's capacity for video comprehension by jointly incorporating key frames and full video sequences as inputs. Our model achieves state-of-the-art results on two of the most commonly used benchmark datasets in the field of task planning.

[89] LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models

Alvi Md Ishmam,Najibul Haque Sarker,Zaber Ibn Abdul Hakim,Chris Thomas

Main category: cs.CV

TL;DR: 本文提出LAMP，一种针对多图像多模态大语言模型（MLLMs）的黑盒通用对抗扰动（UAP）学习方法，通过注意力约束、跨图像传染约束和索引注意力抑制损失，实现高效、鲁棒、位置无关的攻击。

Details

Motivation: 现有对抗攻击主要面向单图像且依赖白盒假设，不适用于实际多图像MLLM场景，其安全漏洞尚未被系统研究。 Method: 提出LAMP方法：1）基于注意力的约束限制跨图像信息聚合；2）跨图像传染约束使扰动token影响干净token；3）索引-注意力抑制损失实现位置不变攻击。 Result: LAMP在多个视觉-语言任务和模型上显著超越SOTA基线，取得最高攻击成功率。 Conclusion: LAMP是首个面向多图像MLLM的黑盒通用对抗攻击框架，揭示了其在跨图像推理中的结构性脆弱性，并提供了可迁移、鲁棒的攻击范式。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance across vision-language tasks. Recent advancements allow these models to process multiple images as inputs. However, the vulnerabilities of multi-image MLLMs remain unexplored. Existing adversarial attacks focus on single-image settings and often assume a white-box threat model, which is impractical in many real-world scenarios. This paper introduces LAMP, a black-box method for learning Universal Adversarial Perturbations (UAPs) targeting multi-image MLLMs. LAMP applies an attention-based constraint that prevents the model from effectively aggregating information across images. LAMP also introduces a novel cross-image contagious constraint that forces perturbed tokens to influence clean tokens, spreading adversarial effects without requiring all inputs to be modified. Additionally, an index-attention suppression loss enables a robust position-invariant attack. Experimental results show that LAMP outperforms SOTA baselines and achieves the highest attack success rates across multiple vision-language tasks and models.

[90] PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models

Xuewen Liu,Zhikai Li,Jing Zhang,Mengjuan Chen,Qingyi Gu

Main category: cs.CV

TL;DR: 本文提出PTQ4ARVG，一种无需训练的后训练量化框架，解决自回归视觉生成（ARVG）模型量化中的通道级异常值、token级动态激活和样本级分布不匹配三大挑战，支持8位和6位高效量化。

Details

Motivation: ARVG模型量化研究不足，现有方法难以有效泛化；需解决通道级异常值、token级动态激活及样本级分布不匹配三大挑战。 Method: 提出PTQ4ARVG框架，包含三部分：(1) Gain-Projected Scaling (GPS)通过泰勒展开量化损失并求导优化缩放因子以缓解通道级异常值；(2) Static Token-Wise Quantization (STWQ)利用ARVG固定token长度与位置无关分布特性实现静态token级量化；(3) Distribution-Guided Calibration (DGC)基于分布熵选择代表性样本校准分布。 Result: PTQ4ARVG在8位和6位下对ARVG系列模型实现高效量化，性能保持竞争力。 Conclusion: PTQ4ARVG是一种通用、免训练、针对ARVG特性的后训练量化方案，显著提升其部署效率而不明显牺牲性能。 Abstract: AutoRegressive Visual Generation (ARVG) models retain an architecture compatible with language models, while achieving performance comparable to diffusion-based models. Quantization is commonly employed in neural networks to reduce model size and computational latency. However, applying quantization to ARVG remains largely underexplored, and existing quantization methods fail to generalize effectively to ARVG models. In this paper, we explore this issue and identify three key challenges: (1) severe outliers at channel-wise level, (2) highly dynamic activations at token-wise level, and (3) mismatched distribution information at sample-wise level. To these ends, we propose PTQ4ARVG, a training-free post-training quantization (PTQ) framework consisting of: (1) Gain-Projected Scaling (GPS) mitigates the channel-wise outliers, which expands the quantization loss via a Taylor series to quantify the gain of scaling for activation-weight quantization, and derives the optimal scaling factor through differentiation.(2) Static Token-Wise Quantization (STWQ) leverages the inherent properties of ARVG, fixed token length and position-invariant distribution across samples, to address token-wise variance without incurring dynamic calibration overhead.(3) Distribution-Guided Calibration (DGC) selects samples that contribute most to distributional entropy, eliminating the sample-wise distribution mismatch. Extensive experiments show that PTQ4ARVG can effectively quantize the ARVG family models to 8-bit and 6-bit while maintaining competitive performance. Code is available at http://github.com/BienLuky/PTQ4ARVG .

[91] NFCDS: A Plug-and-Play Noise Frequency-Controlled Diffusion Sampling Strategy for Image Restoration

Zhen Wang,Hongyi Liu,Jianing Li,Zhihui Wei

Main category: cs.CV

TL;DR: 本文提出了一种噪声频率控制的扩散采样方法（NFCDS），通过在傅里叶域设计滤波器来抑制低频噪声、保留高频细节，从而在不额外训练的前提下提升扩散PnP方法的数据保真度与感知质量的平衡。

Details

Motivation: 现有基于扩散采样的即插即用（PnP）方法虽能生成高感知质量图像，但因反向扩散过程引入噪声而降低数据保真度，需解决保真度与感知质量之间的权衡问题。 Method: 提出噪声频率控制机制（NFCDS），在傅里叶域设计一个渐进式抑制低频噪声、保留高频成分的滤波器，并将该数据一致性先验直接嵌入扩散采样过程。 Result: NFCDS作为PnP模块可无缝集成于现有扩散恢复框架，在多种零样本任务中显著改善保真度-感知质量平衡，且无需额外训练、收敛更快。 Conclusion: 噪声频率是理解保真度-感知权衡的关键因素；NFCDS通过频谱调控实现了高质量、高保真的零样本图像恢复，为扩散PnP提供了新思路。 Abstract: Diffusion sampling-based Plug-and-Play (PnP) methods produce images with high perceptual quality but often suffer from reduced data fidelity, primarily due to the noise introduced during reverse diffusion. To address this trade-off, we propose Noise Frequency-Controlled Diffusion Sampling (NFCDS), a spectral modulation mechanism for reverse diffusion noise. We show that the fidelity-perception conflict can be fundamentally understood through noise frequency: low-frequency components induce blur and degrade fidelity, while high-frequency components drive detail generation. Based on this insight, we design a Fourier-domain filter that progressively suppresses low-frequency noise and preserves high-frequency content. This controlled refinement injects a data-consistency prior directly into sampling, enabling fast convergence to results that are both high-fidelity and perceptually convincing--without additional training. As a PnP module, NFCDS seamlessly integrates into existing diffusion-based restoration frameworks and improves the fidelity-perception balance across diverse zero-shot tasks.

[92] Hypersolid: Emergent Vision Representations via Short-Range Repulsion

Esteban Rodríguez-Betancourt,Edgar Casasola-Murillo

Main category: cs.CV

TL;DR: 本文提出Hypersolid方法，将自监督表示学习重新解释为离散填充问题，通过短程硬球排斥防止局部碰撞，从而避免表征坍塌，并在细粒度和低分辨率分类任务中表现优异。

Details

Motivation: 解决自监督学习中常见的表征坍塌问题，现有方法多依赖全局正则化，而本文尝试从信息保持与映射单射性角度出发提供新思路。 Method: 提出Hypersolid方法，将表示学习建模为离散填充问题，利用短程硬球排斥机制防止嵌入空间中的局部碰撞，确保映射的注入性（injectivity）。 Result: 该方法实现了高分离性的几何结构，有效保持数据增强多样性，在细粒度分类和低分辨率图像分类任务上性能优越。 Conclusion: 通过局部硬球排斥约束替代传统全局正则化，Hypersolid为避免表征坍塌提供了新颖且有效的几何视角，并验证了其在挑战性视觉任务上的实用性。 Abstract: A recurring challenge in self-supervised learning is preventing representation collapse. Existing solutions typically rely on global regularization, such as maximizing distances, decorrelating dimensions or enforcing certain distributions. We instead reinterpret representation learning as a discrete packing problem, where preserving information simplifies to maintaining injectivity. We operationalize this in Hypersolid, a method using short-range hard-ball repulsion to prevent local collisions. This constraint results in a high-separation geometric regime that preserves augmentation diversity, excelling on fine-grained and low-resolution classification tasks.

[93] Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference

Jianglong Li,Jun Xu,Bingcong Lu,Zhengxue Cheng,Hongwei Hu,Ronghua Wu,Li Song

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、高保真、低比特率的3D说话人脸压缩框架，结合FLAME参数建模与3DGS神经渲染，仅实时传输关键面部元数据，并通过高斯属性压缩和MLP优化提升传输效率，在极低码率下实现高质量实时3D人脸渲染。

Details

Motivation: 传统2D视频压缩无法保留精细几何与外观细节，而NeRF等隐式神经渲染计算成本过高，难以满足低码率下高保真3D视频会议需求。 Method: 融合FLAME参数化建模与3D高斯泼溅（3DGS）神经渲染，设计轻量级高斯头模型；提出高斯属性压缩与MLP优化的紧凑表示与压缩方案，仅传输必要面部元数据。 Result: 在极低比特率下实现了优于现有方法的率失真性能，支持高质量、实时3D人脸渲染。 Conclusion: 该框架兼顾效率与质量，适用于实时3D视频会议场景，为低带宽下的沉浸式通信提供了可行解决方案。 Abstract: The demand for immersive and interactive communication has driven advancements in 3D video conferencing, yet achieving high-fidelity 3D talking face representation at low bitrates remains a challenge. Traditional 2D video compression techniques fail to preserve fine-grained geometric and appearance details, while implicit neural rendering methods like NeRF suffer from prohibitive computational costs. To address these challenges, we propose a lightweight, high-fidelity, low-bitrate 3D talking face compression framework that integrates FLAME-based parametric modeling with 3DGS neural rendering. Our approach transmits only essential facial metadata in real time, enabling efficient reconstruction with a Gaussian-based head model. Additionally, we introduce a compact representation and compression scheme, including Gaussian attribute compression and MLP optimization, to enhance transmission efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance, delivering high-quality facial rendering at extremely low bitrates, making it well-suited for real-time 3D video conferencing applications.

[94] GeoRC: A Benchmark for Geolocation Reasoning Chains

Mohit Talreja,Joshua Diao,Jim Thannikary James,Radu Casapu,Tejas Santanam,Ethan Mendes,Alan Ritter,Wei Xu,James Hays

Main category: cs.CV

TL;DR: 本文提出了首个地理定位推理链基准，揭示了视觉语言模型（VLMs）虽能准确预测图像地理位置，却常在解释依据时产生幻觉；专家标注的800条推理链显示，闭源大VLM（如Gemini、GPT-5）预测强但推理弱，开源VLM（如Llama、Qwen）则严重失败。

Details

Motivation: VLMs在地理定位预测上表现优异，但其生成的推理链常包含与图像不符的幻觉内容，缺乏可审计性，亟需评估其推理依据真实性的基准。 Method: 构建首个地理定位推理链基准：基于GeoGuessr游戏图像，联合顶级专家（含世界冠军）为500个场景标注800条真实推理链；采用LLM/VLM-as-a-judge方式自动评分，并与人工评分对比验证。 Result: Qwen 3 LLM-as-a-judge与人工评分相关性最高；闭源VLM（Gemini、GPT-5）定位准确率近人，但推理链质量远低于人类；开源VLM（Llama、Qwen）表现极差，仅略优于纯幻觉基线。 Conclusion: 当前VLM在细粒度视觉属性提取能力上存在明显短板，导致其推理链不可靠；该基准为提升VLM可解释性与可信推理提供了关键评测工具和改进方向。 Abstract: Vision Language Models (VLMs) are good at recognizing the global location of a photograph -- their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at explaining which image evidence led to their prediction, even when their location prediction is correct. The reasoning chains produced by VLMs frequently hallucinate scene attributes to support their location prediction (e.g. phantom writing, imagined infrastructure, misidentified flora). In this paper, we introduce the first benchmark for geolocation reasoning chains. We focus on the global location prediction task in the popular GeoGuessr game which draws from Google Street View spanning more than 100 countries. We collaborate with expert GeoGuessr players, including the reigning world champion, to produce 800 ground truth reasoning chains for 500 query scenes. These expert reasoning chains address hundreds of different discriminative visual attributes such as license plate shape, architecture, and soil properties to name just a few. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at prediction locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Open weights VLMs such as Llama and Qwen catastrophically fail on our benchmark -- they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but no visual information at all. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images.

Dong Chen,Ruoyu Li,Xinyan Zhang,Jialei Xu,Ruoseng Zhao,Zhikang Zhang,Lingyun Li,Zizhuang Wei

Main category: cs.CV

TL;DR: 本文提出了一种融合基站视频、天线几何特征和PCI信号的多模态方法，用于自动识别天线归属关系，并设计了Token Entropy Regularization模块以提升跨模态对齐效果。

Details

Motivation: 当前天线归属识别依赖人工巡检，效率低且易出错；现有预训练模型因缺乏通信领域类似数据，难以实现有效跨模态对齐。 Method: 提出一种融合视频、几何特征与PCI信号的多模态分类与匹配框架，并引入Token Entropy Regularization（TER）模块，在预训练阶段优化跨模态表征对齐。 Result: 实验表明TER能加速模型收敛并显著提升性能；进一步分析发现首token的熵具有模态依赖性。 Conclusion: 该方法实现了从人工巡检到自动化多模态识别的范式转变，为通信网络运维提供了高效、鲁棒的新技术路径。 Abstract: Accurate antenna affiliation identification is crucial for optimizing and maintaining communication networks. Current practice, however, relies on the cumbersome and error-prone process of manual tower inspections. We propose a novel paradigm shift that fuses video footage of base stations, antenna geometric features, and Physical Cell Identity (PCI) signals, transforming antenna affiliation identification into multi-modal classification and matching tasks. Publicly available pretrained transformers struggle with this unique task due to a lack of analogous data in the communications domain, which hampers cross-modal alignment. To address this, we introduce a dedicated training framework that aligns antenna images with corresponding PCI signals. To tackle the representation alignment challenge, we propose a novel Token Entropy Regularization module in the pretraining stage. Our experiments demonstrate that TER accelerates convergence and yields significant performance gains. Further analysis reveals that the entropy of the first token is modality-dependent. Code will be made available upon publication.

[96] WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models

Rishi Upadhyay,Howard Zhang,Jim Solomon,Ayush Agrawal,Pranay Boreddy,Shruti Satya Narayana,Yunhao Ba,Alex Wong,Celso M de Melo,Achuta Kadambi

Main category: cs.CV

TL;DR: 本文提出了WorldBench，一个用于评估生成式世界模型物理保真度的视频基准，其特点是概念解耦、分层设计（直观物理与低层物理参数），并揭示了当前SOTA模型在特定物理概念上的系统性失败。

Details

Motivation: 现有物理视频基准存在概念纠缠问题，无法精准诊断模型在具体物理规律上的理解缺陷，难以支撑世界模型在机器人等关键任务中的可靠部署。 Method: 提出WorldBench：一种基于视频的概念解耦基准，包含两层评估——1）直观物理概念（如物体恒常性、尺度/透视）；2）低层物理参数（如摩擦系数、流体粘度）；通过控制变量方式隔离单个物理概念进行测试。 Result: 在WorldBench上评测当前SOTA视频世界模型，发现其在多个具体物理概念上存在可复现的失败模式，整体缺乏生成物理一致视频所需的基本物理一致性。 Conclusion: WorldBench为世界模型提供了更精细、可扩展的物理推理能力评估框架，有助于推动构建更具鲁棒性与泛化性的物理可信世界模型。 Abstract: Recent advances in generative foundational models, often termed "world models," have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.

[97] Gaussian Belief Propagation Network for Depth Completion

Jie Tang,Pingping Xie,Jian Li,Ping Tan

Main category: cs.CV

TL;DR: 本文提出高斯置信传播网络（GBPN），将深度学习与概率图模型结合，通过动态构建场景特定的马尔可夫随机场并采用改进的高斯置信传播进行推理，显著提升稀疏深度图补全性能，在NYUv2和KITTI数据集上达到SOTA。

Details

Motivation: 现有深度学习方法难以有效处理输入深度数据的稀疏性和不规则性，尤其在高稀疏度下性能受限。 Method: 提出高斯置信传播网络（GBPN）：由图形模型构建网络（GMCN）动态构建场景特定的马尔可夫随机场（MRF），学习数据依赖势函数与自适应非局部边结构；再通过增强的高斯置信传播（GBP）进行推理，引入串行与并行消息传递机制以提升稀疏测量的信息传播效率。 Result: 在NYUv2和KITTI基准上达到SOTA性能；在不同稀疏度、稀疏模式及跨数据集测试中均表现出更强鲁棒性与泛化能力。 Conclusion: GBPN通过深度融合深度学习与概率图建模，有效克服了稀疏深度补全中的关键挑战，为该任务提供了新范式。 Abstract: Depth completion aims to predict a dense depth map from a color image with sparse depth measurements. Although deep learning methods have achieved state-of-the-art (SOTA), effectively handling the sparse and irregular nature of input depth data in deep networks remains a significant challenge, often limiting performance, especially under high sparsity. To overcome this limitation, we introduce the Gaussian Belief Propagation Network (GBPN), a novel hybrid framework synergistically integrating deep learning with probabilistic graphical models for end-to-end depth completion. Specifically, a scene-specific Markov Random Field (MRF) is dynamically constructed by the Graphical Model Construction Network (GMCN), and then inferred via Gaussian Belief Propagation (GBP) to yield the dense depth distribution. Crucially, the GMCN learns to construct not only the data-dependent potentials of MRF but also its structure by predicting adaptive non-local edges, enabling the capture of complex, long-range spatial dependencies. Furthermore, we enhance GBP with a serial \& parallel message passing scheme, designed for effective information propagation, particularly from sparse measurements. Extensive experiments demonstrate that GBPN achieves SOTA performance on the NYUv2 and KITTI benchmarks. Evaluations across varying sparsity levels, sparsity patterns, and datasets highlight GBPN's superior performance, notable robustness, and generalizable capability.

[98] Mam-App: A Novel Parameter-Efficient Mamba Model for Apple Leaf Disease Classification

Md Nadim Mahamood,Md Imran Hasan,Md Rasheduzzaman,Ausrukona Ray,Md Shafi Ud Doula,Kamrul Hasan

Main category: cs.CV

TL;DR: 本文提出了一种参数高效的Mamba架构模型Mam-App，用于苹果叶病识别，在保持极低参数量（0.051M）的同时，在PlantVillage数据集上达到99.58%准确率，并在玉米和马铃薯病害数据集上验证了泛化能力。

Details

Motivation: 现有深度学习模型参数量大、计算开销高，而轻量模型又常牺牲性能；需在效率与精度间取得平衡，以支持无人机、移动端等资源受限场景下的实时病害诊断。 Method: 提出基于Mamba架构的轻量级模型Mam-App，专用于植物叶片病害特征提取与分类，强调参数效率与性能兼顾。 Result: 在Apple Leaf Disease数据集上达99.58%准确率、99.30%精确率、99.14%召回率、99.22% F1；在Corn和Potato数据集上也取得优异跨域性能。 Conclusion: Mam-App在极低参数量下实现SOTA性能，证明了Mamba架构在农业病害轻量化识别中的有效性与泛化潜力，适用于边缘部署。 Abstract: The rapid growth of the global population, alongside exponential technological advancement, has intensified the demand for food production. Meeting this demand depends not only on increasing agricultural yield but also on minimizing food loss caused by crop diseases. Diseases account for a substantial portion of apple production losses, despite apples being among the most widely produced and nutritionally valuable fruits worldwide. Previous studies have employed machine learning techniques for feature extraction and early diagnosis of apple leaf diseases, and more recently, deep learning-based models have shown remarkable performance in disease recognition. However, most state-of-the-art deep learning models are highly parameter-intensive, resulting in increased training and inference time. Although lightweight models are more suitable for user-friendly and resource-constrained applications, they often suffer from performance degradation. To address the trade-off between efficiency and performance, we propose Mam-App, a parameter-efficient Mamba-based model for feature extraction and leaf disease classification. The proposed approach achieves competitive state-of-the-art performance on the PlantVillage Apple Leaf Disease dataset, attaining 99.58% accuracy, 99.30% precision, 99.14% recall, and a 99.22% F1-score, while using only 0.051M parameters. This extremely low parameter count makes the model suitable for deployment on drones, mobile devices, and other low-resource platforms. To demonstrate the robustness and generalizability of the proposed model, we further evaluate it on the PlantVillage Corn Leaf Disease and Potato Leaf Disease datasets. The model achieves 99.48%, 99.20%, 99.34%, and 99.27% accuracy, precision, recall, and F1-score on the corn dataset and 98.46%, 98.91%, 95.39%, and 97.01% on the potato dataset, respectively.

[99] HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence

Yanfeng Li,Tao Tan,Qingquan Gao,Zhiwen Cao,Xiaohong liu,Yue Sun

Main category: cs.CV

TL;DR: 本文提出了LANE模型和AdaGraph策略，以解决现有3D网格自回归建模中资源利用率低、推理慢、序列长度受限等问题，显著提升了生成速度、结构细节和几何一致性。

Details

Motivation: 现有方法在3D网格自回归建模中存在资源利用不足、推理速度慢、仅能处理小规模序列的问题，严重限制了可表达的结构细节。 Method: 提出Latent Autoregressive Network (LANE)，引入紧凑的自回归依赖；并设计Adaptive Computation Graph Reconfiguration (AdaGraph)策略，通过时空解耦突破传统串行推理的效率瓶颈。 Result: LANE将最大可生成序列长度提升6倍；AdaGraph加速推理；实验表明其在生成速度、结构细节和几何一致性上均优于现有方法。 Conclusion: LANE结合AdaGraph为高质量3D网格生成提供了高效且高保真的新方案。 Abstract: High-fidelity 3D meshes can be tokenized into one-dimension (1D) sequences and directly modeled using autoregressive approaches for faces and vertices. However, existing methods suffer from insufficient resource utilization, resulting in slow inference and the ability to handle only small-scale sequences, which severely constrains the expressible structural details. We introduce the Latent Autoregressive Network (LANE), which incorporates compact autoregressive dependencies in the generation process, achieving a $6\times$ improvement in maximum generatable sequence length compared to existing methods. To further accelerate inference, we propose the Adaptive Computation Graph Reconfiguration (AdaGraph) strategy, which effectively overcomes the efficiency bottleneck of traditional serial inference through spatiotemporal decoupling in the generation process. Experimental validation demonstrates that LANE achieves superior performance across generation speed, structural detail, and geometric consistency, providing an effective solution for high-quality 3D mesh generation.

[100] Optimal Transport-Induced Samples against Out-of-Distribution Overconfidence

Keke Tang,Ziyong Du,Xiaofei Wang,Weilong Peng,Peican Zhu,Zhihong Tian

Main category: cs.CV

TL;DR: 本文提出了一种基于半离散最优传输（OT）奇异边界的框架，通过构造几何上有依据的OOD样本（OTIS）并在训练中施加置信度抑制损失，有效缓解深度神经网络在分布外输入上的过度自信问题。

Details

Motivation: 深度神经网络在分布外（OOD）输入上常产生过度自信的预测，而半离散最优传输中的奇异性恰好对应语义模糊区域，是模型高置信误判的高发区。 Method: 构建连续基分布与训练数据隐空间嵌入之间的最优传输问题，识别其诱导的奇异边界；在边界附近采样生成OTIS样本；在训练中对OTIS施加置信度抑制损失。 Result: 实验表明该方法显著缓解OOD过度自信，在多个基准上优于现有最先进方法。 Conclusion: 利用最优传输几何结构引导OOD不确定性建模是一种有效且原理清晰的校准策略，为提升模型开放世界可靠性提供了新思路。 Abstract: Deep neural networks (DNNs) often produce overconfident predictions on out-of-distribution (OOD) inputs, undermining their reliability in open-world environments. Singularities in semi-discrete optimal transport (OT) mark regions of semantic ambiguity, where classifiers are particularly prone to unwarranted high-confidence predictions. Motivated by this observation, we propose a principled framework to mitigate OOD overconfidence by leveraging the geometry of OT-induced singular boundaries. Specifically, we formulate an OT problem between a continuous base distribution and the latent embeddings of training data, and identify the resulting singular boundaries. By sampling near these boundaries, we construct a class of OOD inputs, termed optimal transport-induced OOD samples (OTIS), which are geometrically grounded and inherently semantically ambiguous. During training, a confidence suppression loss is applied to OTIS to guide the model toward more calibrated predictions in structurally uncertain regions. Extensive experiments show that our method significantly alleviates OOD overconfidence and outperforms state-of-the-art methods.

[101] Do Pathology Foundation Models Encode Disease Progression? A Pseudotime Analysis of Visual Representations

Pritika Vig,Ren-Chin Wu,William Lotter

Main category: cs.CV

TL;DR: 本文探讨视觉基础模型是否能从离散图像中隐式学习疾病连续进展过程，并利用扩散伪时间方法评估其表征空间中疾病状态的轨迹有序性；结果表明病理专用模型能显著恢复疾病进展顺序，且轨迹保真度与少样本分类性能高度相关，揭示了模型表征连续生物学过程的能力。

Details

Motivation: 探究视觉基础模型的潜在表征是否能隐式编码训练数据背后的连续疾病进展过程，以更好反映生物学本质、提升泛化能力并支持定量分析疾病转变相关特征。 Method: 采用源自单细胞转录组学的扩散伪时间（diffusion pseudotime）方法，在表征空间中探测基础模型对四种癌症进展过程的轨迹组织能力，并评估其与少样本分类性能及细胞类型组成变化的关系。 Result: 所有病理专用模型均显著优于零假设基线，纯视觉模型在CRC-Serrated数据集上达到最高轨迹保真度（τ > 0.78）；轨迹保真度排名与少样本分类性能强相关（ρ = 0.92）；推断轨迹上细胞类型组成呈平滑变化，符合已知间质重塑模式。 Conclusion: 视觉基础模型可仅从静态图像中隐式学习连续疾病进展表征；轨迹保真度是衡量表征质量的一个新且互补的指标，该框架亦可推广至其他依赖静态快照观测连续过程的领域。 Abstract: Vision foundation models trained on discretely sampled images achieve strong performance on classification benchmarks, yet whether their representations encode the continuous processes underlying their training data remains unclear. This question is especially pertinent in computational pathology, where we posit that models whose latent representations implicitly capture continuous disease progression may better reflect underlying biology, support more robust generalization, and enable quantitative analyses of features associated with disease transitions. Using diffusion pseudotime, a method developed to infer developmental trajectories from single-cell transcriptomics, we probe whether foundation models organize disease states along coherent progression directions in representation space. Across four cancer progressions and six models, we find that all pathology-specific models recover trajectory orderings significantly exceeding null baselines, with vision-only models achieving the highest fidelities $(τ> 0.78$ on CRC-Serrated). Model rankings by trajectory fidelity on reference diseases strongly predict few-shot classification performance on held-out diseases ($ρ= 0.92$), and exploratory analysis shows cell-type composition varies smoothly along inferred trajectories in patterns consistent with known stromal remodeling. Together, these results demonstrate that vision foundation models can implicitly learn to represent continuous processes from independent static observations, and that trajectory fidelity provides a complementary measure of representation quality beyond downstream performance. While demonstrated in pathology, this framework could be applied to other domains where continuous processes are observed through static snapshots.

Ji-Xuan He,Guohang Zhuang,Junge Bo,Tingyi Li,Chen Ling,Yanan Qiao

Main category: cs.CV

TL;DR: 本文提出了一种轻量级即插即用的光谱校正超分辨率网络SR²-Net，用于高光谱图像超分辨率（HSI-SR），通过增强-校正流程提升空间分辨率并保障光谱一致性与物理可解释性。

Details

Motivation: 现有HSI-SR方法虽利用空间相关性提升空间分辨率，但常忽略跨波段光谱一致性，导致伪影和物理不可靠结果；而显式设计网络结构保障光谱一致性又牺牲通用性与灵活性。 Method: 提出SR²-Net：包含分层光谱-空间协同注意力（H-S³A）增强跨波段交互，以及流形一致性校正（MCR）将重建光谱约束至紧凑、物理合理的光谱流形；并引入退化一致性损失保障数据保真度。 Result: 在多个基准和不同骨干网络上验证，SR²-Net显著提升光谱保真度与整体重建质量，且计算开销极小。 Conclusion: SR²-Net是一种通用、灵活、轻量的即插即用模块，有效解决HSI-SR中光谱失真问题，兼顾性能与实用性。 Abstract: HSI-SR aims to enhance spatial resolution while preserving spectrally faithful and physically plausible characteristics. Recent methods have achieved great progress by leveraging spatial correlations to enhance spatial resolution. However, these methods often neglect spectral consistency across bands, leading to spurious oscillations and physically implausible artifacts. While spectral consistency can be addressed by designing the network architecture, it results in a loss of generality and flexibility. To address this issue, we propose a lightweight plug-and-play rectifier, physically priors Spectral Rectification Super-Resolution Network (SR$^{2}$-Net), which can be attached to a wide range of HSI-SR models without modifying their architectures. SR$^{2}$-Net follows an enhance-then-rectify pipeline consisting of (i) Hierarchical Spectral-Spatial Synergy Attention (H-S$^{3}$A) to reinforce cross-band interactions and (ii) Manifold Consistency Rectification (MCR) to constrain the reconstructed spectra to a compact, physically plausible spectral manifold. In addition, we introduce a degradation-consistency loss to enforce data fidelity by encouraging the degraded SR output to match the observed low resolution input. Extensive experiments on multiple benchmarks and diverse backbones demonstrate consistent improvements in spectral fidelity and overall reconstruction quality with negligible computational overhead. Our code will be released upon publication.

[103] Dynamical Adapter Fusion: Constructing A Global Adapter for Pre-Trained Model-based Class-Incremental Learning

Ruiqi Liu,Boyu Diao,Zijia An,Zhulin An,Fei Wang,Yongjun Xu

Main category: cs.CV

TL;DR: 本文提出动态适配器融合（DAF）方法，通过PAC-Bayes理论和损失函数泰勒展开，动态融合任务特定、历史全局及初始化参数，构建单一鲁棒全局适配器，缓解灾难性遗忘并提升知识迁移，在多个类增量学习基准上达到SOTA。

Details

Motivation: 现有CIL方法中任务特定适配器难以迁移且检索开销大，而简单参数融合易导致干扰和灾难性遗忘。 Method: 基于PAC-Bayes定理设计融合机制，整合任务特定适配器、历史全局适配器与初始化参数；利用损失函数泰勒展开推导最优融合系数；引入鲁棒初始化策略捕获全局知识模式。 Result: 在多个类增量学习基准上取得当前最优性能（SOTA）。 Conclusion: DAF通过动态平衡稳定性与可塑性，有效缓解遗忘并增强跨任务知识共享，为CIL提供了一种高效、鲁棒的适配器融合范式。 Abstract: Class-Incremental Learning (CIL) requires models to continuously acquire new classes without forgetting previously learned ones. A dominant paradigm involves freezing a pre-trained model and training lightweight, task-specific adapters. However, maintaining task-specific parameters hinders knowledge transfer and incurs high retrieval costs, while naive parameter fusion often leads to destructive interference and catastrophic forgetting. To address these challenges, we propose Dynamical Adapter Fusion (DAF) to construct a single robust global adapter. Grounded in the PAC-Bayes theorem, we derive a fusion mechanism that explicitly integrates three components: the optimized task-specific adapter parameters, the previous global adapter parameters, and the initialization parameters. We utilize the Taylor expansion of the loss function to derive the optimal fusion coefficients, dynamically achieving the best balance between stability and plasticity. Furthermore, we propose a Robust Initialization strategy to effectively capture global knowledge patterns. Experiments on multiple CIL benchmarks demonstrate that DAF achieves state-of-the-art (SOTA) performance.

[104] Semantic-Guided Dynamic Sparsification for Pre-Trained Model-based Class-Incremental Learning

Ruiqi Liu,Boyu Diao,Zijia An,Runjie Shao,Zhulin An,Fei Wang,Yongjun Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为语义引导的动态稀疏化（SGDS）的新方法，用于类增量学习（CIL），通过在激活空间中构建类别特定的稀疏子空间来缓解任务间干扰，而非对参数空间施加刚性约束，从而在多个基准数据集上实现了最先进的性能。

Details

Motivation: 现有基于正交约束轻量适配器的方法虽能防止任务间干扰，但损害模型可塑性。 Method: 提出语义引导的动态稀疏化（SGDS），通过定向稀疏化调控激活子空间的方向与秩：鼓励相似类别共享紧凑激活子空间以促进知识迁移，同时为不相似类别分配非重叠激活子空间以避免干扰。 Result: 在多个基准数据集上验证了SGDS的有效性，取得了当前最优性能。 Conclusion: SGDS通过在激活空间而非参数空间施加柔性约束，更有效地平衡了稳定性与可塑性，为类增量学习提供了新思路。 Abstract: Class-Incremental Learning (CIL) requires a model to continually learn new classes without forgetting old ones. A common and efficient solution freezes a pre-trained model and employs lightweight adapters, whose parameters are often forced to be orthogonal to prevent inter-task interference. However, we argue that this parameter-constraining method is detrimental to plasticity. To this end, we propose Semantic-Guided Dynamic Sparsification (SGDS), a novel method that proactively guides the activation space by governing the orientation and rank of its subspaces through targeted sparsification. Specifically, SGDS promotes knowledge transfer by encouraging similar classes to share a compact activation subspace, while simultaneously preventing interference by assigning non-overlapping activation subspaces to dissimilar classes. By sculpting class-specific sparse subspaces in the activation space, SGDS effectively mitigates interference without imposing rigid constraints on the parameter space. Extensive experiments on various benchmark datasets demonstrate the state-of-the-art performance of SGDS.

[105] Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery

Hongjun Chen,Huan Zheng,Wencheng Han,Jianbing Shen

Main category: cs.CV

TL;DR: 本文提出HMRMamba，一种基于结构化状态空间模型（SSM）的新范式，用于视频驱动的3D人体网格恢复（HMR），通过几何感知提升模块和运动引导重建网络，显著提升重建精度、时序一致性和计算效率。

Details

Motivation: 现有基于视频的3D人体网格恢复方法因依赖有缺陷的中间3D姿态锚点且难以建模复杂时空动态，导致结果物理不可行。 Method: 提出HMRMamba框架，包含两个核心模块：1）几何感知提升模块，采用双扫描Mamba架构，利用图像特征中的几何线索直接进行2D到3D姿态提升；2）运动引导重建网络，以提升的姿态序列为锚点，显式建模时序运动学模式。 Result: 在3DPW、MPI-INF-3DHP和Human3.6M数据集上达到新SOTA，重建精度与时序一致性更优，且计算效率更高。 Conclusion: HMRMamba通过引入SSM并结合几何与运动建模，有效解决了传统HMR方法的物理不合理性和时空建模不足问题，为视频HMR提供了高效可靠的新范式。 Abstract: Existing video-based 3D Human Mesh Recovery (HMR) methods often produce physically implausible results, stemming from their reliance on flawed intermediate 3D pose anchors and their inability to effectively model complex spatiotemporal dynamics. To overcome these deep-rooted architectural problems, we introduce HMRMamba, a new paradigm for HMR that pioneers the use of Structured State Space Models (SSMs) for their efficiency and long-range modeling prowess. Our framework is distinguished by two core contributions. First, the Geometry-Aware Lifting Module, featuring a novel dual-scan Mamba architecture, creates a robust foundation for reconstruction. It directly grounds the 2D-to-3D pose lifting process with geometric cues from image features, producing a highly reliable 3D pose sequence that serves as a stable anchor. Second, the Motion-guided Reconstruction Network leverages this anchor to explicitly process kinematic patterns over time. By injecting this crucial temporal awareness, it significantly enhances the final mesh's coherence and robustness, particularly under occlusion and motion blur. Comprehensive evaluations on 3DPW, MPI-INF-3DHP, and Human3.6M benchmarks confirm that HMRMamba sets a new state-of-the-art, outperforming existing methods in both reconstruction accuracy and temporal consistency while offering superior computational efficiency.

[106] Rectifying Geometry-Induced Similarity Distortions for Real-World Aerial-Ground Person Re-Identification

Kailash A. Hambarde,Hugo Proença

Main category: cs.CV

TL;DR: 本文提出GIQT方法，通过几何诱导的查询-键变换显式校正跨视角相似度空间，并结合几何条件提示生成机制，提升航拍-地面人员重识别在极端视角与尺度变化下的鲁棒性。

Details

Motivation: 现有AG-ReID方法隐含假设注意力机制中的几何不变点积相似度在大视角和尺度变化下仍可靠，但该假设在极端相机几何条件下不成立，导致相似度空间扭曲、匹配性能下降。 Method: 提出Geometry-Induced Query-Key Transformation（GIQT）轻量低秩模块，基于相机几何信息显式校正查询-键相似度计算；并引入几何条件提示生成机制，提供全局、视图自适应的表征先验。 Result: 在四个AG-ReID基准上验证了方法有效性，显著提升了在极端及未见几何条件下的识别鲁棒性，且计算开销极小。 Conclusion: 显式建模相机几何对相似度空间的影响比仅学习几何不变特征更有效；GIQT为跨视角匹配提供了可解释、低开销、高鲁棒的新范式。 Abstract: Aerial-ground person re-identification (AG-ReID) is fundamentally challenged by extreme viewpoint and distance discrepancies between aerial and ground cameras, which induce severe geometric distortions and invalidate the assumption of a shared similarity space across views. Existing methods primarily rely on geometry-aware feature learning or appearance-conditioned prompting, while implicitly assuming that the geometry-invariant dot-product similarity used in attention mechanisms remains reliable under large viewpoint and scale variations. We argue that this assumption does not hold. Extreme camera geometry systematically distorts the query-key similarity space and degrades attention-based matching, even when feature representations are partially aligned. To address this issue, we introduce Geometry-Induced Query-Key Transformation (GIQT), a lightweight low-rank module that explicitly rectifies the similarity space by conditioning query-key interactions on camera geometry. Rather than modifying feature representations or the attention formulation itself, GIQT adapts the similarity computation to compensate for dominant geometry-induced anisotropic distortions. Building on this local similarity rectification, we further incorporate a geometry-conditioned prompt generation mechanism that provides global, view-adaptive representation priors derived directly from camera geometry. Experiments on four aerial-ground person re-identification benchmarks demonstrate that the proposed framework consistently improves robustness under extreme and previously unseen geometric conditions, while introducing minimal computational overhead compared to state-of-the-art methods.

[107] Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

Zihan Su,Hongyang Wei,Kangrui Cen,Yong Wang,Guanhua Chen,Chun Yuan,Xiangxiang Chu

Main category: cs.CV

TL;DR: 本文提出UniMRG方法，通过在统一多模态模型（UMMs）后训练中引入像素、深度和分割等多种图像内在表征的生成任务，以提升其视觉理解能力，并实现理解与生成能力的协同增强。

Details

Motivation: 现有UMMs后训练方法主要利用理解能力提升生成能力，而反向利用生成来增强理解能力的研究尚属空白。 Method: 提出架构无关的后训练方法UniMRG，在标准视觉理解目标基础上，联合训练模型生成图像的多种内在表征（像素重建、深度图、分割图）。 Result: 在多种UMM架构上验证了该方法可显著提升细粒度感知、减少幻觉、增强空间理解，同时不损害甚至提升生成能力。 Conclusion: 生成多样化的内在表征可有效促进模型对视觉输入的深层、全面理解，实现理解与生成能力的双向增强。 Abstract: Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.

[108] MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations

Xinan He,Kaiqing Lin,Yue Zhou,Jiaming Zhong,Wei Ye,Wenhui Yi,Bing Fan,Feng Ding,Haodong Li,Bo Cao,Bin Li

Main category: cs.CV

TL;DR: 本文提出一种基于‘流形投影波动’（MPF）现象的双路径视频鉴伪框架，通过静态流形偏差分支检测空间异常，微时序波动分支捕捉AI生成视频中残留的结构化时序指纹，从而有效识别高保真伪造视频。

Details

Motivation: 尽管当前视频生成模型已能生成视觉质量极高、宏观语义和时序一致的假视频，但作者认为其本质是流形拟合而非物理拍摄，因此仍存在可检测的底层像素逻辑规律（即MPF）。 Method: 提出分层双路径框架：1）静态流形偏差分支，利用大规模视觉基础模型（VFMs）感知边界检测偏离真实世界流形的空间异常；2）微时序波动分支，对通过第一关的高保真视频进一步分析其残存的MPF结构特征，实现细粒度时序鉴别。 Result: 该框架能在不依赖训练数据或特定生成器先验的前提下，有效识别当前最先进视频生成模型（如Veo、Wan）生成的高保真伪造视频，覆盖宏观偏差与微观计算指纹两类伪造痕迹。 Conclusion: AI生成视频虽在表观上逼近真实，但其内在的流形拟合机制导致固有且可量化的像素级时序结构（MPF），据此构建的双路径检测范式为高保真视频鉴伪提供了新理论依据与实用方法。 Abstract: With the rapid advancement of video generation models such as Veo and Wan, the visual quality of synthetic content has reached a level where macro-level semantic errors and temporal inconsistencies are no longer prominent. However, this does not imply that the distinction between real and cutting-edge high-fidelity fake is untraceable. We argue that AI-generated videos are essentially products of a manifold-fitting process rather than a physical recording. Consequently, the pixel composition logic of consecutive adjacent frames residual in AI videos exhibits a structured and homogenous characteristic. We term this phenomenon `Manifold Projection Fluctuations' (MPF). Driven by this insight, we propose a hierarchical dual-path framework that operates as a sequential filtering process. The first, the Static Manifold Deviation Branch, leverages the refined perceptual boundaries of Large-Scale Vision Foundation Models (VFMs) to capture residual spatial anomalies or physical violations that deviate from the natural real-world manifold (off-manifold). For the remaining high-fidelity videos that successfully reside on-manifold and evade spatial detection, we introduce the Micro-Temporal Fluctuation Branch as a secondary, fine-grained filter. By analyzing the structured MPF that persists even in visually perfect sequences, our framework ensures that forgeries are exposed regardless of whether they manifest as global real-world manifold deviations or subtle computational fingerprints.

[109] From Implicit Ambiguity to Explicit Solidity: Diagnosing Interior Geometric Degradation in Neural Radiance Fields for Dense 3D Scene Understanding

Jiangsan Zhao,Jakob Geipel,Kryzysztof Kusnierek

Main category: cs.CV

TL;DR: 本文揭示了NeRF在密集自遮挡场景中因隐式密度场导致的内部几何退化（IGD）问题，提出基于稀疏体素光栅化的显式几何方法SVRaster，显著提升实例恢复率并增强对监督噪声的鲁棒性。

Details

Motivation: NeRFs在密集、自遮挡场景下用于定量3D分析的可靠性尚不明确，尤其存在隐式密度场在重遮挡下重建空心或碎片化结构的问题。 Method: 提出基于稀疏体素光栅化（SVRaster）的显式几何流水线，以SfM特征几何为初始化，将2D实例掩码投影到体素网格，并通过递归分割保证几何分离。 Result: 在密集场景中，SVRaster实现95.8%的实例恢复率，较当前最优mask-supervised NeRF提升约7个百分点；在掩码质量下降时，比隐式方法多恢复43%的实例。 Conclusion: 显式几何先验是实现高自遮挡3D场景中可靠定量分析的必要条件。 Abstract: Neural Radiance Fields (NeRFs) have emerged as a powerful paradigm for multi-view reconstruction, complementing classical photogrammetric pipelines based on Structure-from-Motion (SfM) and Multi-View Stereo (MVS). However, their reliability for quantitative 3D analysis in dense, self-occluding scenes remains poorly understood. In this study, we identify a fundamental failure mode of implicit density fields under heavy occlusion, which we term Interior Geometric Degradation (IGD). We show that transmittance-based volumetric optimization satisfies photometric supervision by reconstructing hollow or fragmented structures rather than solid interiors, leading to systematic instance undercounting. Through controlled experiments on synthetic datasets with increasing occlusion, we demonstrate that state-of-the-art mask-supervised NeRFs saturate at approximately 89% instance recovery in dense scenes, despite improved surface coherence and mask quality. To overcome this limitation, we introduce an explicit geometric pipeline based on Sparse Voxel Rasterization (SVRaster), initialized from SfM feature geometry. By projecting 2D instance masks onto an explicit voxel grid and enforcing geometric separation via recursive splitting, our approach preserves physical solidity and achieves a 95.8% recovery rate in dense clusters. A sensitivity analysis using degraded segmentation masks further shows that explicit SfM-based geometry is substantially more robust to supervision failure, recovering 43% more instances than implicit baselines. These results demonstrate that explicit geometric priors are a prerequisite for reliable quantitative analysis in highly self-occluding 3D scenes.

[110] MultiModal Fine-tuning with Synthetic Captions

Shohei Enomoto,Shin'ya Yamaguchi

Main category: cs.CV

TL;DR: 本文提出了一种利用多模态大语言模型（MLLMs）为单模态图像数据生成合成图文描述，从而将单模态微调转变为多模态微调的新方法，并引入监督对比损失和基于类平均文本嵌入的推理策略，在多个图像分类基准（尤其小样本场景）上显著提升性能。

Details

Motivation: 预训练已转向多模态学习以增强视觉理解，但微调仍主要采用单模态方式，导致无法充分利用预训练所得的丰富多模态表征，存在模态不一致的根本性鸿沟。 Method: 使用MLLMs结合类别标签与领域上下文设计提示词，为单模态图像数据生成高质量合成图像描述；在微调中采用监督对比损失以促进同类样本表征聚类；推理时利用每张图像对应多个合成描述所得到的类平均文本嵌入。 Result: 在13个图像分类基准上超越基线方法，尤其在小样本学习场景下提升显著。 Conclusion: 该工作建立了通过数据集增强来弥合多模态预训练与单模态微调之间鸿沟的新范式。 Abstract: In this paper, we address a fundamental gap between pre-training and fine-tuning of deep neural networks: while pre-training has shifted from unimodal to multimodal learning with enhanced visual understanding, fine-tuning predominantly remains unimodal, limiting the benefits of rich pre-trained representations. To bridge this gap, we propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for fine-tuning models with a multimodal objective. Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions tailored for classification tasks. Furthermore, we introduce a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, along with a new inference technique that leverages class-averaged text embeddings from multiple synthetic captions per image. Extensive experiments across 13 image classification benchmarks demonstrate that our approach outperforms baseline methods, with particularly significant improvements in few-shot learning scenarios. Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning. Our code is available at https://github.com/s-enmt/MMFT.

[111] Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

Yuxiang Huang,Mingye Li,Xu Han,Chaojun Xiao,Weilin Zhao,Ao Sun,Ziqi Yuan,Hao Zhou,Fandong Meng,Zhiyuan Liu

Main category: cs.CV

TL;DR: 本文提出Spava，一种基于序列并行和优化注意力机制的多GPU长视频推理加速框架，通过分布式近似注意力减少计算量、提升并行性，并结合系统级优化实现显著加速（最高12.72x），且不牺牲性能。

Details

Motivation: 现有方法在单GPU上压缩视觉嵌入或应用稀疏注意力，加速有限、性能下降，难以支持更长更复杂的视频处理。 Method: 提出Spava框架：采用序列并行策略，分布近似注意力计算；结合系统级优化（如负载均衡、融合前向传播）以提升多GPU利用率。 Result: 相比FlashAttn、ZigZagRing和APB，Spava分别实现12.72x、1.70x和1.18x的推理加速，且无明显性能损失。 Conclusion: Spava有效突破长视频推理瓶颈，在多GPU上实现高效、高保真处理，为大型多模态模型的视频理解提供可扩展解决方案。 Abstract: The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB

[112] Variance & Greediness: A comparative study of metric-learning losses

Donghuo Zeng,Hao Niu,Zhi Li,Masato Taya

Main category: cs.CV

TL;DR: 本文提出VARIANCE和GREEDINESS诊断框架，系统分析七种度量学习损失函数在图像检索任务中的嵌入几何特性和优化动态，揭示了效率与粒度之间的权衡关系，并给出实际应用指导。

Details

Motivation: 度量学习在检索任务中至关重要，但其对嵌入几何结构和优化动态的影响尚不明确。 Method: 构建VARIANCE（类内/类间方差）和GREEDINESS（活跃样本比例与梯度范数）诊断框架，在五个图像检索数据集上对比分析七种典型损失函数（Contrastive、Triplet、N-pair、InfoNCE、ArcFace、SCL、CCL）。 Result: 发现Triplet和SCL能保持更高类内方差与更清晰类间边界，提升细粒度检索的top-1精度；Contrastive和InfoNCE通过大量小步更新快速压缩嵌入，加速收敛但可能过度简化类别结构；N-pair虽实现较大平均类间分离，但分布不均。 Conclusion: 不同损失函数存在效率（收敛速度）与粒度（结构保真度）的权衡：当需保留多样性及区分难样本时优选Triplet/SCL；当追求快速嵌入压缩时可选Contrastive/InfoNCE。 Abstract: Metric learning is central to retrieval, yet its effects on embedding geometry and optimization dynamics are not well understood. We introduce a diagnostic framework, VARIANCE (intra-/inter-class variance) and GREEDINESS (active ratio and gradient norms), to compare seven representative losses, i.e., Contrastive, Triplet, N-pair, InfoNCE, ArcFace, SCL, and CCL, across five image-retrieval datasets. Our analysis reveals that Triplet and SCL preserve higher within-class variance and clearer inter-class margins, leading to stronger top-1 retrieval in fine-grained settings. In contrast, Contrastive and InfoNCE compact embeddings are achieved quickly through many small updates, accelerating convergence but potentially oversimplifying class structures. N-pair achieves a large mean separation but with uneven spacing. These insights reveal a form of efficiency-granularity trade-off and provide practical guidance: prefer Triplet/SCL when diversity preservation and hard-sample discrimination are critical, and Contrastive/InfoNCE when faster embedding compaction is desired.

[113] Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization

Midou Guo,Qilin Yin,Wei Lu,Xiangyang Luo,Rui Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于重建的弱监督时间伪造定位框架RT-DeepLoc，利用仅在真实视频上训练的掩码自编码器（MAE）检测伪造片段的重建误差，并引入非对称视频内对比损失（AICL）提升定位鲁棒性与泛化能力，在LAV-DF等数据集上达到SOTA性能。

Details

Motivation: 现代深度伪造趋向局部化和间歇性，需细粒度时间定位；而帧级标注成本过高，亟需仅依赖视频级标签的弱监督方法。 Method: 提出RT-DeepLoc框架：1）用仅在真实数据上训练的Masked Autoencoder建模正常时空模式；2）利用伪造片段引起的重建误差作为定位线索；3）设计Asymmetric Intra-video Contrastive Loss（AICL），以重建误差为引导，增强真实特征紧凑性与伪造局部判别力。 Result: 在LAV-DF等大规模数据集上，RT-DeepLoc在弱监督时间伪造定位任务中达到当前最优性能（state-of-the-art）。 Conclusion: 基于重建误差的弱监督定位范式有效可行，结合AICL可兼顾局部判别性与跨伪造类型泛化能力，为低成本、高精度深伪检测提供了新思路。 Abstract: Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization. The prohibitive cost of frame-level annotation makes weakly supervised methods a practical necessity, which rely only on video-level labels. To this end, we propose Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc), a weakly supervised temporal forgery localization framework that identifies forgeries via reconstruction errors. Our framework uses a Masked Autoencoder (MAE) trained exclusively on authentic data to learn its intrinsic spatiotemporal patterns; this allows the model to produce significant reconstruction discrepancies for forged segments, effectively providing the missing fine-grained cues for localization. To robustly leverage these indicators, we introduce a novel Asymmetric Intra-video Contrastive Loss (AICL). By focusing on the compactness of authentic features guided by these reconstruction cues, AICL establishes a stable decision boundary that enhances local discrimination while preserving generalization to unseen forgeries. Extensive experiments on large-scale datasets, including LAV-DF, demonstrate that RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization.

[114] Hypernetwork-Based Adaptive Aggregation for Multimodal Multiple-Instance Learning in Predicting Coronary Calcium Debulking

Kaito Shiku,Ichika Seo,Tetsuya Matoba,Rissei Hino,Yasuhiro Nakano,Ryoma Bise

Main category: cs.CV

TL;DR: 本文首次尝试从CT图像中评估冠状动脉钙化去块化的必要性，将其建模为多实例学习（MIL）问题，并提出一种基于超网络的自适应聚合Transformer（HyperAdAgFormer），利用表格数据动态调整特征聚合策略。

Details

Motivation: 临床中医生需结合CT影像和患者个体化表格数据（如生理指标）来决定是否进行钙化去块化，但现有方法难以联合建模影像与异构表格信息。 Method: 将任务建模为多实例学习（MIL），提出HyperAdAgFormer：利用超网络根据患者表格数据动态生成Transformer中注意力聚合模块的参数，实现个性化特征融合。 Result: 在真实临床数据集上的实验验证了HyperAdAgFormer的有效性，显著优于基线方法。 Conclusion: HyperAdAgFormer成功实现了影像与表格数据的协同建模，为临床决策支持提供了可解释、个性化的AI辅助工具，代码已开源。 Abstract: In this paper, we present the first attempt to estimate the necessity of debulking coronary artery calcifications from computed tomography (CT) images. We formulate this task as a Multiple-instance Learning (MIL) problem. The difficulty of this task lies in that physicians adjust their focus and decision criteria for device usage according to tabular data representing each patient's condition. To address this issue, we propose a hypernetwork-based adaptive aggregation transformer (HyperAdAgFormer), which adaptively modifies the feature aggregation strategy for each patient based on tabular data through a hypernetwork. The experiments using the clinical dataset demonstrated the effectiveness of HyperAdAgFormer. The code is publicly available at https://github.com/Shiku-Kaito/HyperAdAgFormer.

[115] SimGraph: A Unified Framework for Scene Graph-Based Image Generation and Editing

Thanh-Nhan Vo,Trong-Thuan Nguyen,Tam V. Nguyen,Minh-Triet Tran

Main category: cs.CV

TL;DR: 本文提出了SimGraph，一个基于场景图的统一框架，用于图像生成和编辑，通过整合token-based生成和diffusion-based编辑，在保持空间一致性和语义连贯性的同时，实现对物体关系和布局的精确控制。

Details

Motivation: 现有生成与编辑方法分离导致空间一致性与语义连贯性差，且缺乏对物体关系和空间布局的结构化控制。 Method: 提出SimGraph框架，将场景图作为统一表示，融合token-based图像生成与diffusion-based图像编辑，在单一模型中驱动二者协同工作。 Result: 实验表明，该方法在图像生成与编辑任务上均优于当前最先进方法，显著提升空间一致性、语义连贯性与结构可控性。 Conclusion: 基于场景图的统一建模范式能有效弥合生成与编辑之间的鸿沟，为可控视觉内容创作提供新范式。 Abstract: Recent advancements in Generative Artificial Intelligence (GenAI) have significantly enhanced the capabilities of both image generation and editing. However, current approaches often treat these tasks separately, leading to inefficiencies and challenges in maintaining spatial consistency and semantic coherence between generated content and edits. Moreover, a major obstacle is the lack of structured control over object relationships and spatial arrangements. Scene graph-based methods, which represent objects and their interrelationships in a structured format, offer a solution by providing greater control over composition and interactions in both image generation and editing. To address this, we introduce SimGraph, a unified framework that integrates scene graph-based image generation and editing, enabling precise control over object interactions, layouts, and spatial coherence. In particular, our framework integrates token-based generation and diffusion-based editing within a single scene graph-driven model, ensuring high-quality and consistent results. Through extensive experiments, we empirically demonstrate that our approach outperforms existing state-of-the-art methods.

[116] HERS: Hidden-Pattern Expert Learning for Risk-Specific Vehicle Damage Adaptation in Diffusion Models

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: 本文提出HERS框架，通过无监督的领域专家自适应方法提升文本到图像扩散模型在车辆损伤生成中的保真度、可控性和领域对齐能力，显著提高文本忠实度和人类偏好评分，并探讨其在保险欺诈检测与安全部署中的影响。

Details

Motivation: 随着文本到图像扩散模型在车辆损伤合成中日益逼真，其在自动化保险流程中的可靠性受到质疑，存在被滥用于保险欺诈或索赔操纵的风险。 Method: 提出HERS（Hidden-Pattern Expert Learning for Risk-Specific Damage Adaptation）框架，利用大语言模型与T2I流水线自动生成图文对，以自监督方式为每类损伤（如凹痕、划痕等）建模为独立专家，并融合为统一多损伤模型，无需人工标注即可实现领域微调。 Result: 在四个扩散骨干网络上验证，HERS相较基线提升5.5%文本忠实度和2.3%人类偏好评分；同时增强生成图像在保险场景下的可审计性与抗欺诈潜力。 Conclusion: HERS展示了领域专用扩散模型在高风险应用中的价值与挑战，强调在自动驾驶保险等安全关键场景中需确保生成内容的可信性与可控性。 Abstract: Recent advances in text-to-image (T2I) diffusion models have enabled increasingly realistic synthesis of vehicle damage, raising concerns about their reliability in automated insurance workflows. The ability to generate crash-like imagery challenges the boundary between authentic and synthetic data, introducing new risks of misuse in fraud or claim manipulation. To address these issues, we propose HERS (Hidden-Pattern Expert Learning for Risk-Specific Damage Adaptation), a framework designed to improve fidelity, controllability, and domain alignment of diffusion-generated damage images. HERS fine-tunes a base diffusion model via domain-specific expert adaptation without requiring manual annotation. Using self-supervised image-text pairs automatically generated by a large language model and T2I pipeline, HERS models each damage category, such as dents, scratches, broken lights, or cracked paint, as a separate expert. These experts are later integrated into a unified multi-damage model that balances specialization with generalization. We evaluate HERS across four diffusion backbones and observe consistent improvements: plus 5.5 percent in text faithfulness and plus 2.3 percent in human preference ratings compared to baselines. Beyond image fidelity, we discuss implications for fraud detection, auditability, and safe deployment of generative models in high-stakes domains. Our findings highlight both the opportunities and risks of domain-specific diffusion, underscoring the importance of trustworthy generation in safety-critical applications such as auto insurance.

[117] Vision KAN: Towards an Attention-Free Backbone for Vision with Kolmogorov-Arnold Networks

Zhuoqin Yang,Jiansong Zhang,Xiaoling Luo,Xu Wu,Zheng Lu,Linlin Shen

Main category: cs.CV

TL;DR: 本文提出了一种名为Vision KAN（ViK）的注意力机制替代方案，基于Kolmogorov-Arnold网络设计，通过MultiPatch-RBFKAN实现线性复杂度的token混合，在ImageNet-1K上达到与注意力模型相当的精度。

Details

Motivation: 注意力机制存在二次计算复杂度和可解释性差的问题，而近期无注意力架构表明高性能可不依赖成对注意力，因此需探索更高效、可解释的替代方案。 Method: 提出Vision KAN（ViK），核心为MultiPatch-RBFKAN：(a) 基于径向基函数的KAN进行块内非线性变换，(b) 轴向可分离混合实现高效局部传播，(c) 低秩全局映射建模长程交互；采用patch-wise分组策略与轻量算子替代全KAN以适配高分辨率特征。 Result: 在ImageNet-1K上ViK达到具有竞争力的分类精度，且计算复杂度为线性，验证了KAN-based token mixing作为注意力替代方案的有效性与效率。 Conclusion: ViK证明了基于KAN的token混合是一种高效、理论坚实且可扩展的注意力替代范式，为视觉骨干网络设计提供了新思路。 Abstract: Attention mechanisms have become a key module in modern vision backbones due to their ability to model long-range dependencies. However, their quadratic complexity in sequence length and the difficulty of interpreting attention weights limit both scalability and clarity. Recent attention-free architectures demonstrate that strong performance can be achieved without pairwise attention, motivating the search for alternatives. In this work, we introduce Vision KAN (ViK), an attention-free backbone inspired by the Kolmogorov-Arnold Networks. At its core lies MultiPatch-RBFKAN, a unified token mixer that combines (a) patch-wise nonlinear transform with Radial Basis Function-based KANs, (b) axis-wise separable mixing for efficient local propagation, and (c) low-rank global mapping for long-range interaction. Employing as a drop-in replacement for attention modules, this formulation tackles the prohibitive cost of full KANs on high-resolution features by adopting a patch-wise grouping strategy with lightweight operators to restore cross-patch dependencies. Experiments on ImageNet-1K show that ViK achieves competitive accuracy with linear complexity, demonstrating the potential of KAN-based token mixing as an efficient and theoretically grounded alternative to attention.

[118] Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

Hongxu Chen,Hongxiang Li,Zhen Wang,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为BA-solver的新型求解器，通过引入轻量级SideNet与冻结主干网络协同工作，在不显著增加训练成本的前提下，大幅减少Flow Matching模型生成所需的神经函数评估次数（NFEs），同时保持高保真度和即插即用兼容性。

Details

Motivation: Flow Matching模型因依赖迭代ODE求解而存在显著延迟瓶颈；现有加速方法要么在低NFE下性能下降严重，要么训练成本高、缺乏通用性。 Method: 提出Bi-Anchor Interpolation Solver（BA-solver）：1）双向时间感知——SideNet（仅占主干1–2%参数）学习近似未来与历史速度，无需重训主干；2）双锚点速度积分——利用主干提供的高精度锚点速度与SideNet共同高效估算中间速度，支持批量高阶积分。 Result: 在ImageNet-256²上，BA-solver仅用10 NFE即可达到100+ NFE欧拉求解器的生成质量，5 NFE仍保持高保真；训练开销可忽略，且可无缝集成到现有生成流程中（如图像编辑）。 Conclusion: BA-solver在保持训练免费求解器通用性的同时，实现了高效、低开销、高保真的Flow Matching加速，为实际部署提供了新范式。 Abstract: Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

[119] Unifying Heterogeneous Degradations: Uncertainty-Aware Diffusion Bridge Model for All-in-One Image Restoration

Luwei Tu,Jiawei Wu,Xing Luo,Zhi Jin

Main category: cs.CV

TL;DR: 本文提出了一种不确定性感知的扩散桥模型（UDBM），将全合一图像恢复（AiOIR）重新建模为由像素级不确定性引导的随机传输问题，通过松弛扩散桥和双调制策略，有效解决了多退化任务中优化目标冲突和漂移奇异性问题，实现了单步推理下的最优性能。

Details

Motivation: All-in-One图像恢复面临不同退化类型间优化目标冲突的根本挑战，现有方法受限于粗粒度控制或固定映射调度，适应性不足。 Method: 提出不确定性感知扩散桥模型（UDBM），将AiOIR建模为像素级不确定性驱动的随机传输问题；引入松弛扩散桥以建模退化不确定性并解决标准扩散桥中的漂移奇异性；设计噪声调度与路径调度双调制策略，分别对齐退化至高熵潜在空间并基于熵正则化的粘性动力学自适应调控传输轨迹。 Result: UDBM在多种图像恢复任务上实现单步推理下的最先进性能。 Conclusion: UDBM通过重构传输几何与动力学，显著提升了AiOIR在异构退化场景下的泛化性与鲁棒性，为统一图像恢复框架提供了新范式。 Abstract: All-in-One Image Restoration (AiOIR) faces the fundamental challenge in reconciling conflicting optimization objectives across heterogeneous degradations. Existing methods are often constrained by coarse-grained control mechanisms or fixed mapping schedules, yielding suboptimal adaptation. To address this, we propose an Uncertainty-Aware Diffusion Bridge Model (UDBM), which innovatively reformulates AiOIR as a stochastic transport problem steered by pixel-wise uncertainty. By introducing a relaxed diffusion bridge formulation which replaces the strict terminal constraint with a relaxed constraint, we model the uncertainty of degradations while theoretically resolving the drift singularity inherent in standard diffusion bridges. Furthermore, we devise a dual modulation strategy: the noise schedule aligns diverse degradations into a shared high-entropy latent space, while the path schedule adaptively regulates the transport trajectory motivated by the viscous dynamics of entropy regularization. By effectively rectifying the transport geometry and dynamics, UDBM achieves state-of-the-art performance across diverse restoration tasks within a single inference step.

[120] HydroSense: A Dual-Microcontroller IoT Framework for Real-Time Multi-Parameter Water Quality Monitoring with Edge Processing and Cloud Analytics

Abdul Hasib,A. S. M. Ahsanul Sarkar Akib,Anish Giri

Main category: cs.CV

TL;DR: 本文提出了一种低成本、高精度的物联网水质监测系统HydroSense，集成pH、溶解氧、温度、总溶解固体、估算氮含量和水位六项参数，采用双微控制器架构（Arduino Uno与ESP32），具备校准、信号处理与云集成能力，在90天实验中表现出优异精度与99.8%传输可靠性，成本仅为商用系统的15%，适用于资源受限地区。

Details

Motivation: 全球水资源危机亟需经济、准确、实时的水质监测方案，而传统人工采样或昂贵商用系统难以在资源匮乏地区普及。 Method: 提出HydroSense物联网框架，采用Arduino Uno（负责五点校准的精密模拟测量）与ESP32（负责无线通信、边缘计算与云集成）的双微控制器架构；引入中值滤波（TDS）、温度补偿及鲁棒错误处理等信号处理技术。 Result: 90天实验证明：pH误差±0.08（0–14范围），DO稳定性±0.2 mg/L，TDS误差±1.9%（0–1000 ppm），云端数据传输可靠率达99.8%；总成本32983 BDT（约300美元），较商用系统降低85%。 Conclusion: HydroSense通过智能系统架构与低成本元器件选型，实现了专业级水质监测能力，为资源受限环境下的环境监测树立了新范式。 Abstract: The global water crisis necessitates affordable, accurate, and real-time water quality monitoring solutions. Traditional approaches relying on manual sampling or expensive commercial systems fail to address accessibility challenges in resource-constrained environments. This paper presents HydroSense, an innovative Internet of Things framework that integrates six critical water quality parameters including pH, dissolved oxygen (DO), temperature, total dissolved solids (TDS), estimated nitrogen, and water level into a unified monitoring system. HydroSense employs a novel dual-microcontroller architecture, utilizing Arduino Uno for precision analog measurements with five-point calibration algorithms and ESP32 for wireless connectivity, edge processing, and cloud integration. The system implements advanced signal processing techniques including median filtering for TDS measurement, temperature compensation algorithms, and robust error handling. Experimental validation over 90 days demonstrates exceptional performance metrics: pH accuracy of plus or minus 0.08 units across the 0 to 14 range, DO measurement stability within plus or minus 0.2 mg/L, TDS accuracy of plus or minus 1.9 percent across 0 to 1000 ppm, and 99.8 percent cloud data transmission reliability. With a total implementation cost of 32,983 BDT (approximately 300 USD), HydroSense achieves an 85 percent cost reduction compared to commercial systems while providing enhanced connectivity through the Firebase real-time database. This research establishes a new paradigm for accessible environmental monitoring, demonstrating that professional-grade water quality assessment can be achieved through intelligent system architecture and cost-effective component selection.

[121] WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models

Zijin Yang,Yu Sun,Kejiang Chen,Jiawei Zhao,Jun Jiang,Weiming Zhang,Nenghai Yu

Main category: cs.CV

TL;DR: 本文提出了WMVLM，首个基于视觉语言模型（VLM）的统一、可解释的扩散模型图像水印评估框架，分别针对残差水印和语义水印重新定义质量与安全性指标，并通过三阶段训练策略实现分类、打分与可解释文本生成。

Details

Motivation: 现有水印评估方法缺乏统一框架、不可解释、忽视全面安全性、且对语义水印使用不恰当指标。 Method: 提出WMVLM框架，重新定义残差水印（评估伪影强度与擦除鲁棒性）和语义水印（评估潜在分布偏移）的质量与安全指标；设计三阶段训练策略以逐步实现分类、打分与可解释文本生成。 Result: WMVLM在多个数据集、扩散模型和水印方法上展现出强泛化能力，性能优于当前最优VLM。 Conclusion: WMVLM为扩散模型图像水印提供了首个统一、可解释、兼顾质量与安全性的评估范式，推动水印算法的可靠开发与评估。 Abstract: Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.

[122] PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization

Songhan Jiang,Fengchun Liu,Ziyue Wang,Linghan Cai,Yongbing Zhang

Main category: cs.CV

TL;DR: 本文提出了PathReasoner，首个大规模全切片图像（WSI）推理数据集，并基于其构建了PathReasoner-R1模型，通过知识引导的生成流程、轨迹掩码监督微调与推理导向的强化学习，提升病理诊断模型的可解释性与临床可信度。

Details

Motivation: 当前视觉-语言模型在病理诊断中缺乏可验证、证据关联的推理过程，限制临床信任与专家纠错能力。 Method: 构建PathReasoner数据集（20K+高质量样本），采用知识图谱引导的生成流程；提出PathReasoner-R1模型，融合轨迹掩码监督微调与推理导向的强化学习，并设计知识感知的多粒度奖励函数（含实体奖励机制）。 Result: PathReasoner-R1在PathReasoner数据集及多个公开基准上达到SOTA性能，显著提升模型在不同图像尺度下的推理透明性与临床合理性。 Conclusion: 本工作首次系统性地将结构化、知识对齐的链式推理引入计算病理学，为构建可信、可解释、可修正的AI辅助诊断系统提供了新范式。 Abstract: Vision-Language Models (VLMs) are advancing computational pathology with superior visual understanding capabilities. However, current systems often reduce diagnosis to directly output conclusions without verifiable evidence-linked reasoning, which severely limits clinical trust and hinders expert error rectification. To address these barriers, we construct PathReasoner, the first large-scale dataset of whole-slide image (WSI) reasoning. Unlike previous work reliant on unverified distillation, we develop a rigorous knowledge-guided generation pipeline. By leveraging medical knowledge graphs, we explicitly align structured pathological findings and clinical reasoning with diagnoses, generating over 20K high-quality instructional samples. Based on the database, we propose PathReasoner-R1, which synergizes trajectory-masked supervised fine-tuning with reasoning-oriented reinforcement learning to instill structured chain-of-thought capabilities. To ensure medical rigor, we engineer a knowledge-aware multi-granular reward function incorporating an Entity Reward mechanism strictly aligned with knowledge graphs. This effectively guides the model to optimize for logical consistency rather than mere outcome matching, thereby enhancing robustness. Extensive experiments demonstrate that PathReasoner-R1 achieves state-of-the-art performance on both PathReasoner and public benchmarks across various image scales, equipping pathology models with transparent, clinically grounded reasoning capabilities. Dataset and code are available at https://github.com/cyclexfy/PathReasoner-R1.

[123] Similarity of Processing Steps in Vision Model Representations

Matéo Mahaut,Marco Baroni

Main category: cs.CV

TL;DR: 本文研究不同视觉模型在训练过程中如何收敛到相似的表示，发现尽管最终表示可能相似，但中间处理步骤和操作存在显著差异，特别是分类器模型会丢弃低级图像统计信息，而CNN和Transformer模型在表示变化上表现出不同特性。

Details

Motivation: 探究不同视觉模型是否不仅在最终表示上收敛，还在中间处理步骤和操作上收敛。 Method: 通过量化不同模型在不同处理阶段的表示距离，跟踪模型表示距离在整个处理过程中的演变，识别出模型间差异最大的处理步骤。 Result: 发现相同位置的层具有最相似的表示，但仍有显著差异；分类器模型在最后层丢弃低级图像统计信息；Transformer模型相比CNN模型在层间表示变化更平滑。 Conclusion: 不同视觉模型在表示收敛的层次和性质上存在差异，这有助于更定性地理解图像模型的内在处理过程。 Abstract: Recent literature suggests that the bigger the model, the more likely it is to converge to similar, ``universal'' representations, despite different training objectives, datasets, or modalities. While this literature shows that there is an area where model representations are similar, we study here how vision models might get to those representations -- in particular, do they also converge to the same intermediate steps and operations? We therefore study the processes that lead to convergent representations in different models. First, we quantify distance between different model representations at different stages. We follow the evolution of distances between models throughout processing, identifying the processing steps which are most different between models. We find that while layers at similar positions in different models have the most similar representations, strong differences remain. Classifier models, unlike the others, will discard information about low-level image statistics in their final layers. CNN- and transformer-based models also behave differently, with transformer models applying smoother changes to representations from one layer to the next. These distinctions clarify the level and nature of convergence between model representations, and enables a more qualitative account of the underlying processes in image models.

[124] A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion

Pu Cao,Yiyang Ma,Feng Zhou,Xuedan Yin,Qing Song,Lu Yang

Main category: cs.CV

TL;DR: 本文揭示了在潜在扩散模型中，自动编码器（AE）评估过度偏向生成指标（如gFID）而忽视重建保真度的问题，并指出该偏差在可控扩散任务中会引发条件漂移、损害可控性；实证表明重建指标（尤其实例级）比gFID更能反映可控性。

Details

Motivation: 现有ImageNet尺度的AE研究过度依赖生成指标（如gFID）进行选型，忽视重建保真度，但该偏差在可控扩散任务中可能导致条件对齐能力下降，亟需厘清评估指标与可控性之间的关系。 Method: 通过理论分析解释gFID主导偏好在ImageNet生成中看似合理但在可控扩散中风险高的原因；提出多维条件漂移评估协议，系统评测多个近期ImageNet AE在可控生成任务中的表现；结合ControlNet实验验证可控性与条件保持而非gFID的相关性。 Result: gFID与条件保持弱相关，而重建保真度（尤其实例级指标）与可控性高度一致；ControlNet实验进一步证实可控性由条件保持能力决定，而非gFID高低。 Conclusion: 当前以ImageNet为中心的AE评估范式无法满足可控扩散需求，应重视重建保真度指标，尤其在模型选型和基准测试中引入条件漂移与实例级重建评估。 Abstract: In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.

[125] RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning

Shiqi Huang,Shuting He,Bihan Wen

Main category: cs.CV

TL;DR: 本文提出RSGround-R1框架，通过链式思维监督微调、位置感知强化微调和空间一致性优化，提升多模态大模型在遥感视觉定位任务中的空间推理能力。

Details

Motivation: 遥感场景尺度大、语义模糊，自然语言描述高度依赖位置线索，给多模态大语言模型的空间推理带来独特挑战。 Method: 提出推理引导的位置感知后训练框架RSGround-R1，包括：1）基于合成RSVG推理数据的链式思维监督微调（CoT-SFT）；2）引入距离感知的位置奖励进行强化微调（RFT）；3）空间一致性引导的优化策略以稳定策略更新。 Result: 在RSVG基准上实验表明，该方法性能与泛化能力均优于现有方法。 Conclusion: RSGround-R1有效增强了模型对遥感图像中位置关系的理解与定位准确性，为多模态大模型在空间密集型任务中的应用提供了新思路。 Abstract: Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.

[126] OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Yufeng Zhong,Lei Chen,Xuanle Zhao,Wenkang Han,Liming Zheng,Jing Huang,Deyang Jiang,Yilin Cao,Lin Ma,Zhixiong Zeng

Main category: cs.CV

TL;DR: 本文提出OCRVerse，首个端到端统一处理文本中心型（如文档）和视觉中心型（如图表、网页、科学绘图）OCR任务的模型；通过多领域两阶段SFT-RL训练与定制化奖励机制，实现跨域融合与高性能识别。

Details

Motivation: 现有OCR方法主要关注文本识别（Text-centric OCR），忽视了对图表、网页、科学绘图等视觉信息密集图像中视觉元素的识别（Vision-centric OCR），而这类图像在互联网上广泛存在且具有重要应用价值。 Method: 提出OCRVerse框架，构建覆盖文本中心（报纸、杂志、书籍）与视觉中心（图表、网页、科学绘图）的综合数据集，并采用两阶段SFT-RL多域训练：SFT阶段混合跨域数据建立初始知识，RL阶段为各域设计个性化奖励策略以适配不同输出格式与目标。 Result: OCRVerse在文本中心型和视觉中心型OCR任务上均取得具有竞争力的结果，性能媲美大规模开源与闭源模型。 Conclusion: OCRVerse首次实现了端到端统一的文本与视觉中心OCR能力，验证了多域协同建模与灵活奖励机制在复杂OCR任务中的有效性与可扩展性。 Abstract: The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (\textbf{Text-centric OCR}), neglecting the identification of visual elements from visually information-dense image sources (\textbf{Vision-centric OCR}), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose \textbf{OCRVerse}, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.

Bowen Zhou,Marc-André Fiedler,Ayoub Al-Hamadi

Main category: cs.CV

TL;DR: 本文提出CAF-Mamba，一种基于Mamba架构的跨模态自适应注意力融合框架，用于抑郁症检测，通过显式与隐式建模跨模态交互及动态调整模态权重，在LMVD和D-Vlog数据集上达到SOTA性能。

Details

Motivation: 现有深度学习方法在抑郁检测中存在特征类型有限、忽略显式跨模态交互、融合方式简单（如拼接或静态加权）等问题。 Method: 提出CAF-Mamba框架，基于Mamba架构，引入模态级注意力机制，实现跨模态交互的显式与隐式建模，并支持动态模态贡献调整。 Result: 在LMVD和D-Vlog两个真实场景基准数据集上，CAF-Mamba持续优于现有方法，达到当前最优性能。 Conclusion: CAF-Mamba有效提升了多模态抑郁检测的性能，验证了动态自适应跨模态融合策略的有效性与泛化能力。 Abstract: Depression is a prevalent mental health disorder that severely impairs daily functioning and quality of life. While recent deep learning approaches for depression detection have shown promise, most rely on limited feature types, overlook explicit cross-modal interactions, and employ simple concatenation or static weighting for fusion. To overcome these limitations, we propose CAF-Mamba, a novel Mamba-based cross-modal adaptive attention fusion framework. CAF-Mamba not only captures cross-modal interactions explicitly and implicitly, but also dynamically adjusts modality contributions through a modality-wise attention mechanism, enabling more effective multimodal fusion. Experiments on two in-the-wild benchmark datasets, LMVD and D-Vlog, demonstrate that CAF-Mamba consistently outperforms existing methods and achieves state-of-the-art performance.

[128] Few-Shot Domain Adaptation with Temporal References and Static Priors for Glacier Calving Front Delineation

Marcel Dreier,Nora Gourmelon,Dakota Pyles,Thorsten Seehaus,Matthias H. Braun,Andreas Maier,Vincent Christlein

Main category: cs.CV

TL;DR: 本文提出了一种无需修改网络结构的少样本领域自适应方法，结合空间静态先验知识和夏季参考图像，显著提升了冰川断裂前沿分割模型在新研究地点的泛化能力。

Details

Motivation: 现有最先进的冰川断裂前沿分割模型在基准测试中表现接近人类水平，但在真实新地点（分布外数据）应用时精度不足，无法满足后续科学分析需求。 Method: 采用少样本领域自适应策略，融合空间静态先验知识，并在输入时间序列中引入夏季参考图像。 Result: 断裂前沿分割误差从1131.6米大幅降低至68.7米，且未改变模型架构。 Conclusion: 该方法为深度学习模型在新研究地点的部署提供了可行框架，有望实现全球尺度的冰川断裂前沿动态监测。 Abstract: During benchmarking, the state-of-the-art model for glacier calving front delineation achieves near-human performance. However, when applied in a real-world setting at a novel study site, its delineation accuracy is insufficient for calving front products intended for further scientific analyses. This site represents an out-of-distribution domain for a model trained solely on the benchmark dataset. By employing a few-shot domain adaptation strategy, incorporating spatial static prior knowledge, and including summer reference images in the input time series, the delineation error is reduced from 1131.6 m to 68.7 m without any architectural modifications. These methodological advancements establish a framework for applying deep learning-based calving front segmentation to novel study sites, enabling calving front monitoring on a global scale.

[129] When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning

Zixuan Xia,Hao Wang,Pengcheng Weng,Yanyu Qian,Yangxin Xu,William Dan,Fei Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为\regName的轻量级几何感知正则化框架，用于解决多模态学习中表示几何结构病态的问题，通过模内分散正则化和模间锚定正则化提升表示多样性与跨模态一致性，无需修改模型结构，兼容多种训练范式，并在多个基准上验证了其有效性。

Details

Motivation: 多模态学习中，即使优化良好且训练策略平衡，模型仍常出现表示几何病态问题（如模内坍缩、样本级跨模态不一致），损害单模态鲁棒性与多模态融合效果。 Method: \regName框架包含两个互补约束：模内分散正则化（提升表示多样性）和模间锚定正则化（限制样本级跨模态漂移但不强制严格对齐），为即插即用式正则项，无需架构修改。 Result: 在多个多模态基准上实验表明，该方法显著且一致地提升了多模态与单模态性能，有效缓解模态权衡问题。 Conclusion: 显式调控表示几何结构是提升多模态学习性能的关键新维度，\regName提供了一种通用、轻量且有效的实现路径。 Abstract: Multimodal learning aims to integrate complementary information from heterogeneous modalities, yet strong optimization alone does not guaranty well-structured representations. Even under carefully balanced training schemes, multimodal models often exhibit geometric pathologies, including intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion. We identify representation geometry as a missing control axis in multimodal learning and propose \regName, a lightweight geometry-aware regularization framework. \regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment. The proposed regularizer is plug-and-play, requires no architectural modifications, and is compatible with various training paradigms. Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.

[130] Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification

Dexuan Ding,Ciyuan Peng,Endrowednes Kuantama,Jingcai Guo,Jia Wu,Jian Yang,Amin Beheshti,Ming-Hsuan Yang,Yuankai Qi

Main category: cs.CV

TL;DR: 本文提出Multimodal Visual Surrogate Compression (MVSC)方法，将高维3D sMRI图像压缩为紧凑2D视觉代理特征，以更好适配冻结的2D基础模型，提升阿尔茨海默病分类性能。

Details

Motivation: 现有sMRI表征学习方法存在计算成本高、跨切片关系丢失、判别性特征提取能力有限等问题。 Method: 提出MVSC框架，包含文本引导的体素上下文编码器（捕获全局跨切片上下文）和文本增强的自适应切片融合模块（补丁级聚合切片信息），将3D sMRI压缩为对齐2D基础模型的2D视觉代理特征。 Result: 在三个大规模AD数据集上，MVSC在二分类与多分类任务中均优于当前最优方法。 Conclusion: MVSC通过高效压缩与文本引导的跨模态对齐，显著提升了sMRI用于AD诊断的表征能力与分类性能。 Abstract: High-dimensional structural MRI (sMRI) images are widely used for Alzheimer's Disease (AD) diagnosis. Most existing methods for sMRI representation learning rely on 3D architectures (e.g., 3D CNNs), slice-wise feature extraction with late aggregation, or apply training-free feature extractions using 2D foundation models (e.g., DINO). However, these three paradigms suffer from high computational cost, loss of cross-slice relations, and limited ability to extract discriminative features, respectively. To address these challenges, we propose Multimodal Visual Surrogate Compression (MVSC). It learns to compress and adapt large 3D sMRI volumes into compact 2D features, termed as visual surrogates, which are better aligned with frozen 2D foundation models to extract powerful representations for final AD classification. MVSC has two key components: a Volume Context Encoder that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner. Extensive experiments on three large-scale Alzheimer's disease benchmarks demonstrate our MVSC performs favourably on both binary and multi-class classification tasks compared against state-of-the-art methods.

[131] ChartE$^{3}$: A Comprehensive Benchmark for End-to-End Chart Editing

Shuo Li,Jiajun Sun,Zhekai Wang,Xiaoran Fan,Hui Li,Dingwen Yang,Zhiheng Xi,Yijun Wang,Zifei Shan,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CV

TL;DR: 本文提出了ChartE³，一个端到端图表编辑基准，用于评估模型在不依赖中间自然语言或代码表示的情况下，直接根据多模态指令对图表进行局部（如颜色、字体）和全局（如数据过滤、趋势线添加）编辑的能力。

Details

Motivation: 现有基于流水线的图表编辑方法依赖自然语言或代码作为中间表示，难以准确执行复杂编辑；亟需一种能同时兼顾细粒度控制与整体结构一致性的端到端评估基准。 Method: 构建了ChartE³基准，包含1200+高质量三元组样本（图表图像、底层代码、多模态编辑指令），覆盖局部编辑与全局编辑两类任务，并通过人工校验确保质量。 Result: 对当前先进多模态大模型的广泛评测表明，其在全局编辑任务上性能显著不足，暴露出端到端图表编辑能力的关键短板。 Conclusion: ChartE³为端到端图表编辑提供了首个无中间表示的标准化评估框架，揭示了现有模型在数据驱动型全局编辑上的严重局限，推动该方向的研究发展。 Abstract: Charts are a fundamental visualization format for structured data analysis. Enabling end-to-end chart editing according to user intent is of great practical value, yet remains challenging due to the need for both fine-grained control and global structural consistency. Most existing approaches adopt pipeline-based designs, where natural language or code serves as an intermediate representation, limiting their ability to faithfully execute complex edits. We introduce ChartE$^{3}$, an End-to-End Chart Editing benchmark that directly evaluates models without relying on intermediate natural language programs or code-level supervision. ChartE$^{3}$ focuses on two complementary editing dimensions: local editing, which involves fine-grained appearance changes such as font or color adjustments, and global editing, which requires holistic, data-centric transformations including data filtering and trend line addition. ChartE$^{3}$ contains over 1,200 high-quality samples constructed via a well-designed data pipeline with human curation. Each sample is provided as a triplet of a chart image, its underlying code, and a multimodal editing instruction, enabling evaluation from both objective and subjective perspectives. Extensive benchmarking of state-of-the-art multimodal large language models reveals substantial performance gaps, particularly on global editing tasks, highlighting critical limitations in current end-to-end chart editing capabilities.

[132] DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning

Mingshuang Luo,Shuang Liang,Zhengkun Rong,Yuxuan Luo,Tianshu Hu,Ruibing Hou,Hong Chang,Yong Li,Yuan Zhang,Mingyuan Gao

Main category: cs.CV

TL;DR: DreamActor-M2 提出一种无需显式姿态先验的通用角色图像动画框架，通过两阶段范式（统一表征融合 + 自举式伪数据合成）解决身份保持与运动一致性的权衡问题，并在新基准 AW Bench 上实现 SOTA 性能。

Details

Motivation: 现有方法存在两大问题：一是运动注入策略不佳，导致身份保持与运动一致性难以兼顾（‘跷跷板’现象）；二是过度依赖显式姿态先验（如骨架），难以建模复杂动态且泛化能力差，尤其对非人形角色。 Method: 提出 DreamActor-M2 框架：第一阶段，融合参考外观与运动线索至统一隐空间，利用基础模型生成先验联合建模空间身份与时间动态；第二阶段，构建自举式伪跨身份数据合成流程，实现从姿态驱动到端到端 RGB 驱动的平滑过渡。同时构建新基准 AW Bench。 Result: 在 AW Bench 及多个标准数据集上达到 SOTA，显著提升视觉保真度与跨域泛化能力（尤其对任意非人形角色和多样运动场景）。 Conclusion: DreamActor-M2 通过将运动调节重构为上下文学习问题，摆脱了对显式姿态表示的依赖，实现了更鲁棒、更通用的角色图像动画。 Abstract: Character image animation aims to synthesize high-fidelity videos by transferring motion from a driving sequence to a static reference image. Despite recent advancements, existing methods suffer from two fundamental challenges: (1) suboptimal motion injection strategies that lead to a trade-off between identity preservation and motion consistency, manifesting as a "see-saw", and (2) an over-reliance on explicit pose priors (e.g., skeletons), which inadequately capture intricate dynamics and hinder generalization to arbitrary, non-humanoid characters. To address these challenges, we present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem. Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space, enabling the model to jointly reason about spatial identity and temporal dynamics by leveraging the generative prior of foundational models. Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs, facilitating a seamless transition from pose-dependent control to direct, end-to-end RGB-driven animation. This strategy significantly enhances generalization across diverse characters and motion scenarios. To facilitate comprehensive evaluation, we further introduce AW Bench, a versatile benchmark encompassing a wide spectrum of characters types and motion scenarios. Extensive experiments demonstrate that DreamActor-M2 achieves state-of-the-art performance, delivering superior visual fidelity and robust cross-domain generalization. Project Page: https://grisoon.github.io/DreamActor-M2/

[133] From Global to Granular: Revealing IQA Model Performance via Correlation Surface

Baoliang Chen,Danni Huang,Hanwei Zhu,Lingyu Zhu,Wei Zhou,Shiqi Wang,Yuming Fang,Weisi Lin

Main category: cs.CV

TL;DR: 本文提出Granularity-Modulated Correlation (GMC)方法，通过引入粒度调制器和分布调节器，构建相关性曲面，实现对图像质量评估（IQA）模型性能的细粒度、分布鲁棒性分析，克服传统全局相关性指标（如SRCC、PLCC）的局限性。

Details

Motivation: 现有IQA评估指标（如PLCC、SRCC）仅输出单一标量值，无法反映模型在不同质量区间（如高MOS或小ΔMOS）下的局部排序一致性差异，且易受测试集质量分布影响，导致跨数据集比较不稳定。 Method: 提出GMC框架，包含：(1) 粒度调制器——基于高斯加权，在绝对MOS值和|ΔMOS|两个维度上计算局部相关性；(2) 分布调节器——对相关性进行正则化以削弱非均匀质量分布带来的偏差；最终生成以MOS和|ΔMOS|为坐标的三维相关性曲面。 Result: 在标准IQA基准上的实验表明，GMC能揭示传统标量指标无法发现的模型性能特性（如高质图像排序优势或细微差异判别能力），提升模型分析、比较与部署的可靠性与信息量。 Conclusion: GMC提供了一种结构化、细粒度且分布鲁棒的IQA评估新范式，显著优于传统全局相关性指标，推动IQA模型评估向更可解释、更实用的方向发展。 Abstract: Evaluation of Image Quality Assessment (IQA) models has long been dominated by global correlation metrics, such as Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank-Order Correlation Coefficient (SRCC). While widely adopted, these metrics reduce performance to a single scalar, failing to capture how ranking consistency varies across the local quality spectrum. For example, two IQA models may achieve identical SRCC values, yet one ranks high-quality images (related to high Mean Opinion Score, MOS) more reliably, while the other better discriminates image pairs with small quality/MOS differences (related to $|Δ$MOS$|$). Such complementary behaviors are invisible under global metrics. Moreover, SRCC and PLCC are sensitive to test-sample quality distributions, yielding unstable comparisons across test sets. To address these limitations, we propose \textbf{Granularity-Modulated Correlation (GMC)}, which provides a structured, fine-grained analysis of IQA performance. GMC includes: (1) a \textbf{Granularity Modulator} that applies Gaussian-weighted correlations conditioned on absolute MOS values and pairwise MOS differences ($|Δ$MOS$|$) to examine local performance variations, and (2) a \textbf{Distribution Regulator} that regularizes correlations to mitigate biases from non-uniform quality distributions. The resulting \textbf{correlation surface} maps correlation values as a joint function of MOS and $|Δ$MOS$|$, providing a 3D representation of IQA performance. Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models. Codes are available at https://github.com/Dniaaa/GMC.

Jiankun Peng,Jianyuan Guo,Ying Xu,Yue Liu,Jiashuang Yan,Xuanwei Ye,Houhua Li,Xiaoming Wang

Main category: cs.CV

TL;DR: 本文提出DGNav框架，通过场景感知自适应策略和动态图Transformer解决视觉语言导航中拓扑地图粒度刚性问题，实现按需稠密化建图与动态边权重优化，显著提升导航性能与安全性。

Details

Motivation: 现有基于固定几何阈值构建拓扑图的方法存在“粒度刚性”问题：在简单区域过采样导致计算冗余，在高不确定性区域欠采样增加碰撞风险并降低精度。 Method: 提出DGNav框架，包含两个核心创新：(1) 场景感知自适应策略，依据预测航点的离散程度动态调节图构建阈值；(2) 动态图Transformer，融合视觉、语言与几何线索生成动态边权重，过滤拓扑噪声并增强指令遵循能力。 Result: 在R2R-CE和RxR-CE基准上验证了DGNav具有更优的导航性能和强泛化能力；消融实验表明其在导航效率与安全探索间取得最优权衡。 Conclusion: DGNav通过动态调节拓扑图密度与连接性，有效克服粒度刚性问题，为连续环境中的视觉语言导航提供了更鲁棒、安全且高效的解决方案。 Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) presents a core challenge: grounding high-level linguistic instructions into precise, safe, and long-horizon spatial actions. Explicit topological maps have proven to be a vital solution for providing robust spatial memory in such tasks. However, existing topological planning methods suffer from a "Granularity Rigidity" problem. Specifically, these methods typically rely on fixed geometric thresholds to sample nodes, which fails to adapt to varying environmental complexities. This rigidity leads to a critical mismatch: the model tends to over-sample in simple areas, causing computational redundancy, while under-sampling in high-uncertainty regions, increasing collision risks and compromising precision. To address this, we propose DGNav, a framework for Dynamic Topological Navigation, introducing a context-aware mechanism to modulate map density and connectivity on-the-fly. Our approach comprises two core innovations: (1) A Scene-Aware Adaptive Strategy that dynamically modulates graph construction thresholds based on the dispersion of predicted waypoints, enabling "densification on demand" in challenging environments; (2) A Dynamic Graph Transformer that reconstructs graph connectivity by fusing visual, linguistic, and geometric cues into dynamic edge weights, enabling the agent to filter out topological noise and enhancing instruction adherence. Extensive experiments on the R2R-CE and RxR-CE benchmarks demonstrate DGNav exhibits superior navigation performance and strong generalization capabilities. Furthermore, ablation studies confirm that our framework achieves an optimal trade-off between navigation efficiency and safe exploration. The code is available at https://github.com/shannanshouyin/DGNav.

[135] Synthetic-to-Real Domain Bridging for Single-View 3D Reconstruction of Ships for Maritime Monitoring

Borja Carrillo-Perez,Felix Sattler,Angel Bueno Rodriguez,Maurice Stephan,Sarah Barnes

Main category: cs.CV

TL;DR: 本文提出了一种基于纯合成数据训练、仅需单张图像输入的高效单视图船舶3D重建方法，结合Splatter Image网络、YOLOv8分割与AIS地理映射，实现无需真实3D标注的实时海上监控级3D可视化。

Details

Motivation: 现有主流3D重建方法依赖多视角监督、3D真值标注或计算开销大，难以满足海上实时监测需求。 Method: 采用Splatter Image网络（以稀疏3D高斯表示物体），在合成ShapeNet船舶数据上预训练，并用自建多样化3D船舶数据集微调；集成YOLOv8分割模块和定制预处理；后处理包括真实尺度缩放、中心对齐、朝向校正及基于AIS与单应性映射的地理配准。 Result: 在合成验证集上定量评估显示高重建保真度；在真实ShipSG数据集上定性验证表明其具备向实际海事场景迁移的能力；系统支持交互式3D船舶检查且无需真实3D标注。 Conclusion: 该流水线为海事监控提供了高效、可扩展的单视图3D重建方案，推动了实时3D船舶可视化在实际应用中的落地。 Abstract: Three-dimensional (3D) reconstruction of ships is an important part of maritime monitoring, allowing improved visualization, inspection, and decision-making in real-world monitoring environments. However, most state-ofthe-art 3D reconstruction methods require multi-view supervision, annotated 3D ground truth, or are computationally intensive, making them impractical for real-time maritime deployment. In this work, we present an efficient pipeline for single-view 3D reconstruction of real ships by training entirely on synthetic data and requiring only a single view at inference. Our approach uses the Splatter Image network, which represents objects as sparse sets of 3D Gaussians for rapid and accurate reconstruction from single images. The model is first fine-tuned on synthetic ShapeNet vessels and further refined with a diverse custom dataset of 3D ships, bridging the domain gap between synthetic and real-world imagery. We integrate a state-of-the-art segmentation module based on YOLOv8 and custom preprocessing to ensure compatibility with the reconstruction network. Postprocessing steps include real-world scaling, centering, and orientation alignment, followed by georeferenced placement on an interactive web map using AIS metadata and homography-based mapping. Quantitative evaluation on synthetic validation data demonstrates strong reconstruction fidelity, while qualitative results on real maritime images from the ShipSG dataset confirm the potential for transfer to operational maritime settings. The final system provides interactive 3D inspection of real ships without requiring real-world 3D annotations. This pipeline provides an efficient, scalable solution for maritime monitoring and highlights a path toward real-time 3D ship visualization in practical applications. Interactive demo: https://dlr-mi.github.io/ship3d-demo/.

Junming Huang,Weiwei Xu

Main category: cs.CV

TL;DR: 本文提出CG-MLLM，一种支持3D字幕生成与高分辨率3D内容生成的多模态大语言模型，通过混合Transformer架构（TokenAR与BlockAR）解耦建模需求，并结合预训练视觉语言骨干与专用3D VAE隐空间，显著提升3D生成质量。

Details

Motivation: 现有方法在3D内容生成中仅能产出低分辨率网格或粗糙结构代理，难以原生捕捉细粒度几何结构，LLM在3D生成能力方面尚未充分探索。 Method: 提出CG-MLLM模型，采用Mixture-of-Transformer架构，包含Token-level Autoregressive（TokenAR）和Block-level Autoregressive（BlockAR）两个Transformer模块，分别处理token级与block级内容；整合预训练视觉语言骨干网络与专用3D VAE隐空间，实现标准token与空间block之间的长上下文交互。 Result: CG-MLLM在高保真3D物体生成任务上显著优于现有MLLM方法，成功将高分辨率3D内容生成纳入主流LLM范式。 Conclusion: CG-MLLM首次在统一框架中实现了高质量3D captioning与高分辨率3D生成，推动了多模态大模型向三维内容理解与生成的拓展。 Abstract: Large Language Models(LLMs) have revolutionized text generation and multimodal perception, but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture fine-grained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.

[137] MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Honglin Lin,Zheng Liu,Yun Zhu,Chonghan Qin,Juekai Lin,Xiaoran Shang,Conghui He,Wentao Zhang,Lijun Wu

Main category: cs.CV

TL;DR: 本文提出MMFineReason，一个大规模多模态推理数据集（1.8M样本，5.1B token），通过三阶段流程构建，显著提升开源VLM的视觉推理能力；基于该数据微调的Qwen3-VL系列模型在参数效率上超越更大规模的闭源基线，并发现仅7%高质量子集即可达到全量性能。

Details

Motivation: 开源视觉语言模型（VLM）在视觉推理上仍落后于闭源系统，主因是缺乏覆盖STEM图表、视觉谜题等高难度领域且具一致长程思维链（CoT）标注的高质量多模态推理数据。 Method: 构建MMFineReason数据集：（1）大规模多源数据收集与标准化；（2）基于Qwen3-VL-235B-A22B-Thinking蒸馏生成视觉接地的长程CoT推理链；（3）基于推理质量与难度感知的筛选策略；随后在该数据上微调Qwen3-VL-Instruct得到MMFineReason-2B/4B/8B模型，并开展消融与子集实验。 Result: MMFineReason-4B超越Qwen3-VL-8B-Thinking，MMFineReason-8B超越Qwen3-VL-30B-A3B-Thinking并逼近Qwen3-VL-32B-Thinking；仅含123K样本（7%）的难度感知子集即可达到全量性能；推理数据训练同时提升模型通用能力。 Conclusion: 高质量、难度感知、视觉接地的长程CoT多模态数据是提升开源VLM推理能力的关键；'少而精'的数据筛选策略可极大提升参数效率与训练有效性；推理导向的数据构成具有泛化增益。 Abstract: Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a "less is more" phenomenon via our difficulty-aware filtering strategy: a subset of just 7\% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.

[138] Trajectory-Guided Diffusion for Foreground-Preserving Background Generation in Multi-Layer Documents

Taewon Kang

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的文档背景生成框架，通过潜空间结构设计实现前景保留与多页风格一致性，无需额外约束或掩码机制。

Details

Motivation: 解决现有文档背景生成中前景破坏和多页风格不一致（stylistic drift）的问题，避免依赖显式约束、掩码或重复提示控制风格。 Method: 将扩散过程重新解释为在结构化潜空间中的随机轨迹演化；通过设计初始噪声及其几何对齐实现前景区域自然规避；引入缓存的风格方向向量作为潜空间中持久的风格约束，使多页扩散轨迹共享同一风格子空间。 Result: 实现了训练无关、兼容现有扩散主干网络的文档背景生成，在复杂文档上生成视觉连贯、前景可读、跨页风格一致的结果。 Conclusion: 该方法从轨迹设计角度重构扩散建模，为结构化、一致性生成任务提供了原理性新范式。 Abstract: We present a diffusion-based framework for document-centric background generation that achieves foreground preservation and multi-page stylistic consistency through latent-space design rather than explicit constraints. Instead of suppressing diffusion updates or applying masking heuristics, our approach reinterprets diffusion as the evolution of stochastic trajectories through a structured latent space. By shaping the initial noise and its geometric alignment, background generation naturally avoids designated foreground regions, allowing readable content to remain intact without auxiliary mechanisms. To address the long-standing issue of stylistic drift across pages, we decouple style control from text conditioning and introduce cached style directions as persistent vectors in latent space. Once selected, these directions constrain diffusion trajectories to a shared stylistic subspace, ensuring consistent appearance across pages and editing iterations. This formulation eliminates the need for repeated prompt-based style specification and provides a more stable foundation for multi-page generation. Our framework admits a geometric and physical interpretation, where diffusion paths evolve on a latent manifold shaped by preferred directions, and foreground regions are rarely traversed as a consequence of trajectory initialization rather than explicit exclusion. The proposed method is training-free, compatible with existing diffusion backbones, and produces visually coherent, foreground-preserving results across complex documents. By reframing diffusion as trajectory design in latent space, we offer a principled approach to consistent and structured generative modeling.

[139] Improving Classifier-Free Guidance of Flow Matching via Manifold Projection

Jian-Feng Cai,Haixia Liu,Zhengyi Su,Chao Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于优化视角的Classifier-free Guidance（CFG）新解释，将其视为带流形约束的同伦优化问题，并引入流形投影与Anderson加速技术，在不增加模型评估开销的前提下提升生成质量、提示对齐性和对引导尺度的鲁棒性。

Details

Motivation: 尽管CFG在扩散和流匹配模型中被广泛使用，但其依赖启发式线性外推，对引导尺度敏感，缺乏理论支撑。 Method: 将CFG建模为对光滑距离函数梯度的近似，揭示预测差（conditional与unconditional输出之差）决定引导敏感性；进而将CFG采样重述为带流形约束的同伦优化，并设计增量梯度下降实现流形投影，结合Anderson加速提升效率与稳定性。 Result: 所提方法无需训练，显著提升了生成保真度、提示对齐性及对引导尺度的鲁棒性，在DiT-XL-2-256、Flux和Stable Diffusion 3.5等大模型上验证了有效性。 Conclusion: CFG本质上可被理解为一种隐式优化过程，本文提出的流形约束同伦优化框架为其提供了更坚实理论基础与更优实践方案。 Abstract: Classifier-free guidance (CFG) is a widely used technique for controllable generation in diffusion and flow-based models. Despite its empirical success, CFG relies on a heuristic linear extrapolation that is often sensitive to the guidance scale. In this work, we provide a principled interpretation of CFG through the lens of optimization. We demonstrate that the velocity field in flow matching corresponds to the gradient of a sequence of smoothed distance functions, which guides latent variables toward the scaled target image set. This perspective reveals that the standard CFG formulation is an approximation of this gradient, where the prediction gap, the discrepancy between conditional and unconditional outputs, governs guidance sensitivity. Leveraging this insight, we reformulate the CFG sampling as a homotopy optimization with a manifold constraint. This formulation necessitates a manifold projection step, which we implement via an incremental gradient descent scheme during sampling. To improve computational efficiency and stability, we further enhance this iterative process with Anderson Acceleration without requiring additional model evaluations. Our proposed methods are training-free and consistently refine generation fidelity, prompt alignment, and robustness to the guidance scale. We validate their effectiveness across diverse benchmarks, demonstrating significant improvements on large-scale models such as DiT-XL-2-256, Flux, and Stable Diffusion 3.5.

[140] Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion

Hanmo Chen,Chenghao Xu,Xu Yang,Xuan Chen,Cheng Deng

Main category: cs.CV

TL;DR: 本文提出了一种新的KV缓存策略PaFu-KV，通过轻量级显著性估计头评估token重要性，动态保留关键时空信息、剔除冗余缓存，从而在保持高质量视频生成的同时提升推理效率。

Details

Motivation: 现有自回归视频生成方法依赖启发式KV缓存策略，忽略token在长视频生成中的时序重要性差异，导致关键信息丢失和缓存冗余，影响生成质量与效率。 Method: 提出Past- and Future-Informed KV Cache Policy（PaFu-KV），引入由双向教师模型蒸馏而来的轻量级Salience Estimation Head，估计各token的显著性得分，据此动态管理KV缓存。 Result: 在多个基准上验证了该方法能在保持高保真视频生成质量的同时，显著降低内存占用、加速推理，提升长时程视频生成的效率-质量权衡。 Conclusion: PaFu-KV是一种高效、实用的KV缓存优化策略，为实时、长时程自回归视频生成提供了新思路和技术支撑。 Abstract: Video generation is pivotal to digital media creation, and recent advances in autoregressive video generation have markedly enhanced the efficiency of real-time video synthesis. However, existing approaches generally rely on heuristic KV Cache policies, which ignore differences in token importance in long-term video generation. This leads to the loss of critical spatiotemporal information and the accumulation of redundant, invalid cache, thereby degrading video generation quality and efficiency. To address this limitation, we first observe that token contributions to video generation are highly time-heterogeneous and accordingly propose a novel Past- and Future-Informed KV Cache Policy (PaFu-KV). Specifically, PaFu-KV introduces a lightweight Salience Estimation Head distilled from a bidirectional teacher to estimate salience scores, allowing the KV cache to retain informative tokens while discarding less relevant ones. This policy yields a better quality-efficiency trade-off by shrinking KV cache capacity and reducing memory footprint at inference time. Extensive experiments on benchmarks demonstrate that our method preserves high-fidelity video generation quality while enables accelerated inference, thereby enabling more efficient long-horizon video generation. Our code will be released upon paper acceptance.

[141] TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Chuancheng Shi,Shangze Li,Wenjun Lu,Wenhua Wu,Cong Wang,Zifeng Cheng,Fei Shen,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出TraceRouter框架，通过追踪并切断有害语义的因果传播回路来增强大基础模型的对抗鲁棒性，避免传统局部抑制方法对模型效用的损害。

Details

Motivation: 现有防御方法基于‘局部性假设’，仅抑制孤立神经元或特征，但有害语义实为跨层分布式电路，导致局部干预脆弱且损害模型性能。 Method: TraceRouter分三阶段：(1) 通过注意力发散分析定位敏感起始层；(2) 利用稀疏自编码器（SAEs）与差异激活分析解耦并隔离恶意特征；(3) 基于零掩码干预计算特征影响分数（FIS），映射并切断下游因果路径。 Result: 在多项实验中，TraceRouter显著优于当前最优基线，在对抗鲁棒性与通用性能之间取得更优权衡。 Conclusion: TraceRouter通过路径级因果干预实现对有害语义的精准阻断，为大模型安全提供了新范式，代码将开源。 Abstract: Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the "locality hypothesis", suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbf{TraceRouter}, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.

[142] Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning

Hanmo Chen,Guangtao Lyu,Chenghao Xu,Jiexi Yan,Xu Yang,Cheng Deng

Main category: cs.CV

TL;DR: 本文提出了一种名为Pyramidal Shapley-Taylor（PST）的学习框架，用于细粒度的运动-语言检索，通过金字塔式建模从关节动态到动作片段再到整体语义的层次化对齐，显著提升了跨模态检索性能。

Details

Motivation: 现有方法主要依赖全局运动序列与文本表征对齐，忽略了局部运动片段、身体关节点与文本词元之间的细粒度交互，导致检索性能受限。 Method: 提出金字塔式Shapley-Taylor（PST）学习框架，将人体运动分解为时间片段和空间关节点，通过逐级的关节点级和片段级跨模态对齐，建模局部语义细节与层次化结构关系。 Result: 在多个公开基准数据集上显著优于现有最先进方法，实现了运动片段、身体关节点与对应文本词元之间的精确对齐。 Conclusion: 金字塔式细粒度建模更符合人类运动感知机制，能有效弥合运动与语言间的语义鸿沟，为人类中心的跨模态智能提供新思路。 Abstract: As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.

[143] VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models

Yunhao Li,Sijing Wu,Zhilin Gao,Zicheng Zhang,Qi Jia,Huiyu Duan,Xiongkuo Min,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了VideoAesBench，一个用于评估大视觉语言模型（LMMs）视频美学质量理解能力的综合基准，涵盖多样化视频来源、多种题型及多维度美学指标，并在23个主流LMM上进行了评测，发现当前模型在此任务上能力仍较弱。

Details

Motivation: 视频美学质量评估是人类基本能力，但现有大 multimodal 模型（LMMs）在此方面研究不足，缺乏系统性评测基准。 Method: 构建了VideoAesBench基准，包含1804个来自UGC、AIGC、压缩、RGC和游戏等多源视频；支持单选、多选、判断与新型开放描述题型；覆盖视觉形式、风格与情感影响三大类共12个美学维度；并在23个开源与商用LMM上进行系统评测。 Result: 当前LMMs仅具备基础的视频美学感知能力，整体表现不完整且不精确。 Conclusion: VideoAesBench可作为强健的测试平台，推动可解释的视频美学评估研究，并为提升LMM美学理解能力提供指导。 Abstract: Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs' understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics dimensions including visual form related questions from 5 aspects, visual style related questions from 4 aspects, and visual affectiveness questions from 3 aspects. Based on VideoAesBench, we benchmark 23 open-source and commercial large multimodal models. Our findings show that current LMMs only contain basic video aesthetics perception ability, their performance remains incomplete and imprecise. We hope our VideoAesBench can be served as a strong testbed and offer insights for explainable video aesthetics assessment.

[144] Zero-Shot Video Restoration and Enhancement with Assistance of Video Diffusion Models

Cong Cao,Huanjing Yue,Shangbin Xie,Xin Liu,Jingyu Yang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的框架，利用视频扩散模型辅助图像扩散方法，提升零样本视频恢复与增强中的时间一致性，通过同源/异质潜在融合、COT融合策略及时间强化后处理实现。

Details

Motivation: 现有基于扩散模型的零样本图像恢复方法应用于视频时会产生严重的时间闪烁问题，亟需提升时间一致性。 Method: 提出同源潜在融合、异质潜在融合和基于COT（Cross-Attention Optimization Tuning）的融合比例策略，并结合图像到视频扩散模型进行时间强化后处理。 Result: 实验结果表明该方法在保持图像级修复质量的同时显著提升了视频的时间一致性，且适用于任意基于扩散的图像恢复方法。 Conclusion: 这是首个将视频扩散模型引入零样本视频恢复以增强时间一致性的框架，具有通用性、无需训练且效果优越。 Abstract: Although diffusion-based zero-shot image restoration and enhancement methods have achieved great success, applying them to video restoration or enhancement will lead to severe temporal flickering. In this paper, we propose the first framework that utilizes the rapidly-developed video diffusion model to assist the image-based method in maintaining more temporal consistency for zero-shot video restoration and enhancement. We propose homologous latents fusion, heterogenous latents fusion, and a COT-based fusion ratio strategy to utilize both homologous and heterogenous text-to-video diffusion models to complement the image method. Moreover, we propose temporal-strengthening post-processing to utilize the image-to-video diffusion model to further improve temporal consistency. Our method is training-free and can be applied to any diffusion-based image restoration and enhancement methods. Experimental results demonstrate the superiority of the proposed method.

[145] Just Noticeable Difference Modeling for Deep Visual Features

Rui Zhao,Wenrui Li,Lin Zhu,Yajing Zheng,Weisi Lin

Main category: cs.CV

TL;DR: 本文提出了FeatJND，一种面向任务的深度视觉特征的最小可觉差（JND）建模方法，用于预测在保持下游任务性能前提下每个特征的最大可容忍扰动，并验证其在分类、检测与实例分割中的有效性及在动态量化中的应用价值。

Details

Motivation: 深度视觉特征日益成为视觉系统的接口，亟需刻画其特性并控制其质量；将人类/机器视觉中的JND概念扩展至特征空间，可提供任务对齐的容错边界，以在资源受限下指导特征质量控制。 Method: 提出FeatJND——一种任务对齐的JND公式化方法，用于预测保持下游任务性能前提下的每特征最大可容忍扰动图；构建标准化分点处的FeatJND估计器，并在图像分类、目标检测和实例分割任务上进行验证；进一步将其应用于token级动态量化，通过FeatJND引导的步长分配实现噪声预算下的性能提升。 Result: 在相同失真强度下，FeatJND生成的失真比非结构化高斯扰动更能保持任务性能；归因可视化表明FeatJND能抑制非关键特征区域；在动态量化中，FeatJND引导的步长分配显著优于随机排列和全局统一步长。 Conclusion: FeatJND为深度特征提供了任务驱动的质量控制准则，在特征压缩、量化等资源受限场景中具有实用价值，是一种可推广的特征空间JND建模范式。 Abstract: Deep visual features are increasingly used as the interface in vision systems, motivating the need to describe feature characteristics and control feature quality for machine perception. Just noticeable difference (JND) characterizes the maximum imperceptible distortion for images under human or machine vision. Extending it to deep visual features naturally meets the above demand by providing a task-aligned tolerance boundary in feature space, offering a practical reference for controlling feature quality under constrained resources. We propose FeatJND, a task-aligned JND formulation that predicts the maximum tolerable per-feature perturbation map while preserving downstream task performance. We propose a FeatJND estimator at standardized split points and validate it across image classification, detection, and instance segmentation. Under matched distortion strength, FeatJND-based distortions consistently preserve higher task performance than unstructured Gaussian perturbations, and attribution visualizations suggest FeatJND can suppress non-critical feature regions. As an application, we further apply FeatJND to token-wise dynamic quantization and show that FeatJND-guided step-size allocation yields clear gains over random step-size permutation and global uniform step size under the same noise budget. Our code will be released after publication.

[146] BookNet: Book Image Rectification via Cross-Page Attention Network

Shaokai Liu,Hao Feng,Bozhi Luan,Min Hou,Jiajun Deng,Wengang Zhou

Main category: cs.CV

TL;DR: 本文提出了BookNet，首个专为双页图书图像校正设计的端到端深度学习框架，采用双分支结构与跨页注意力机制建模左右页几何耦合关系，并构建了合成数据集Book3D和真实基准Book100，实验表明其性能优于现有方法。

Details

Motivation: 现有单页文档图像校正方法无法捕捉书籍中相邻页面间的耦合几何关系，而书籍因装订约束导致左右页呈现显著不对称的弯曲模式，亟需专门针对双页校正的方法。 Method: 提出BookNet，采用双分支网络架构与跨页注意力机制，联合估计单页及整页摊开（book spread）的形变光流；同时构建大规模合成数据集Book3D用于训练，以及真实世界基准Book100用于评估。 Result: BookNet在图书图像校正任务上显著优于现有最先进方法，在Book100真实基准上验证了其有效性与泛化能力。 Conclusion: BookNet首次实现了对书籍双页几何耦合关系的显式建模，结合专用数据集，为书籍图像校正提供了新范式，推动了文档图像处理在复杂几何畸变场景下的发展。 Abstract: Book image rectification presents unique challenges in document image processing due to complex geometric distortions from binding constraints, where left and right pages exhibit distinctly asymmetric curvature patterns. However, existing single-page document image rectification methods fail to capture the coupled geometric relationships between adjacent pages in books. In this work, we introduce BookNet, the first end-to-end deep learning framework specifically designed for dual-page book image rectification. BookNet adopts a dual-branch architecture with cross-page attention mechanisms, enabling it to estimate warping flows for both individual pages and the complete book spread, explicitly modeling how left and right pages influence each other. Moreover, to address the absence of specialized datasets, we present Book3D, a large-scale synthetic dataset for training, and Book100, a comprehensive real-world benchmark for evaluation. Extensive experiments demonstrate that BookNet outperforms existing state-of-the-art methods on book image rectification. Code and dataset will be made publicly available.

[147] Deep Models, Shallow Alignment: Uncovering the Granularity Mismatch in Neural Decoding

Yang Du,Siyuan Dai,Yonghao Song,Paul M. Thompson,Haoteng Tang,Liang Zhan

Main category: cs.CV

TL;DR: 本文提出Shallow Alignment方法，通过将神经信号与视觉编码器的中间层表征对齐，而非最终输出，以解决人机视觉粒度不匹配问题，在多个基准上显著提升神经视觉解码性能（提升22%-58%），并揭示了解码性能随视觉骨干网络容量可预测增长的规律。

Details

Motivation: 现有神经视觉解码方法忽视了人类视觉与机器视觉之间的根本性粒度差异：深度视觉模型强调语义不变性而抑制局部纹理信息，而神经信号却混合保留了低级视觉属性和高级语义内容。 Method: 提出Shallow Alignment，一种新颖的对比学习策略，将神经信号与视觉编码器的中间层表征进行对齐，以更好平衡低级纹理细节与高级语义特征。 Result: 在多个基准上显著优于标准的最终层对齐方法，性能提升22%至58%；首次有效解锁神经视觉解码中的缩放律，使解码性能随预训练视觉骨干网络容量可预测地提升。 Conclusion: Shallow Alignment通过中间层对齐有效弥合人机视觉粒度鸿沟，不仅大幅提升解码性能，还揭示了模型容量与解码能力间的定量关系，为脑机接口中神经表征建模提供了新范式。 Abstract: Neural visual decoding is a central problem in brain computer interface research, aiming to reconstruct human visual perception and to elucidate the structure of neural representations. However, existing approaches overlook a fundamental granularity mismatch between human and machine vision, where deep vision models emphasize semantic invariance by suppressing local texture information, whereas neural signals preserve an intricate mixture of low-level visual attributes and high-level semantic content. To address this mismatch, we propose Shallow Alignment, a novel contrastive learning strategy that aligns neural signals with intermediate representations of visual encoders rather than their final outputs, thereby striking a better balance between low-level texture details and high-level semantic features. Extensive experiments across multiple benchmarks demonstrate that Shallow Alignment significantly outperforms standard final-layer alignment, with performance gains ranging from 22% to 58% across diverse vision backbones. Notably, our approach effectively unlocks the scaling law in neural visual decoding, enabling decoding performance to scale predictably with the capacity of pre-trained vision backbones. We further conduct systematic empirical analyses to shed light on the mechanisms underlying the observed performance gains.

[148] PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Yi Liu,Dianhai Yu,Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL-1.5 是一个0.9B参数的超紧凑视觉语言模型，在 OmniDocBench v1.5 上达到94.5% SOTA精度，并在新提出的 Real5-OmniDocBench 基准上验证了对真实物理畸变的强鲁棒性，同时扩展支持印章识别与文本定位任务。

Details

Motivation: 提升文档理解模型在真实场景中面对扫描、倾斜、扭曲、屏幕拍摄和光照变化等物理畸变时的鲁棒性，并拓展多任务能力（如印章识别、文本定位） Method: 升级 PaddleOCR-VL 模型架构，构建面向真实物理畸变的 Real5-OmniDocBench 评测基准，并集成密封识别与文本定位模块 Result: 在 OmniDocBench v1.5 上达94.5% SOTA准确率；在 Real5-OmniDocBench 上表现SOTA；保持0.9B参数量，兼具高精度与高效率 Conclusion: PaddleOCR-VL-1.5 在精度、鲁棒性和多任务能力上实现显著提升，是面向真实文档理解场景的高效实用VLM方案。 Abstract: We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model's capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: https://github.com/PaddlePaddle/PaddleOCR

[149] Causal World Modeling for Robot Control

Lin Li,Qihang Zhang,Yiming Luo,Shuai Yang,Ruilin Wang,Fei Han,Mingrui Yu,Zelin Gao,Nan Xue,Xing Zhu,Yujun Shen,Yinghao Xu

Main category: cs.CV

TL;DR: 本文提出LingBot-VA，一种结合视频世界建模与视觉语言预训练的自回归扩散框架，通过共享潜在空间、闭环 rollout 和异步推理实现高效机器人学习。

Details

Motivation: 视频世界建模能理解动作与视觉动态间的因果关系，从而让机器人‘想象’近未来；结合视觉语言预训练可为机器人学习提供新基础。 Method: 提出LingBot-VA：1）基于Mixture-of-Transformers的视觉与动作token共享潜在空间；2）闭环rollout机制，融合真实观测反馈；3）异步推理管线，并行化动作预测与执行。 Result: 在仿真与真实场景中验证有效：显著提升长时程操作能力、后训练数据效率高、对新构型泛化性强。 Conclusion: 视频世界建模与视觉语言联合建模是机器人学习的重要新范式，LingBot-VA为具身智能提供了高效、通用的统一建模框架。 Abstract: This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.

[150] Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

Linhan Wang,Zichong Yang,Chen Bai,Guoxiang Zhang,Xiaotong Liu,Xiaoyin Zheng,Xiao-Xiao Long,Chang-Tien Lu,Cheng Lu

Main category: cs.CV

TL;DR: 本文提出Drive-JEPA框架，结合视频联合嵌入预测架构（V-JEPA）与多模态轨迹蒸馏，提升端到端自动驾驶的规划表征能力，在NAVSIM基准上达到新SOTA。

Details

Motivation: 现有基于自监督视频预训练的端到端自动驾驶方法在场景理解上提升有限，且驾驶场景中单一人类轨迹导致难以学习多模态行为。 Method: 1）适配V-JEPA用于端到端驾驶，用大规模驾驶视频预训练ViT编码器以生成与轨迹规划对齐的预测表征；2）设计提案中心式规划器，融合仿真生成与人类轨迹进行多模态蒸馏，并引入动量感知选择机制保障行为稳定性与安全性。 Result: 在NAVSIM上，V-JEPA表征+简单Transformer解码器在无感知设置下超越先前方法3 PDMS；完整Drive-JEPA框架在v1和v2版本分别达93.3 PDMS和87.8 EPDMS，创SOTA。 Conclusion: Drive-JEPA通过联合视频表征学习与多模态轨迹蒸馏，有效缓解驾驶行为单一性带来的建模瓶颈，显著提升端到端规划性能。 Abstract: End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.

[151] Understanding Multimodal Complementarity for Single-Frame Action Anticipation

Manuel Benavent-Lledo,Konstantinos Bacharidis,Konstantinos Papoutsakis,Antonis Argyros,Jose Garcia-Rodriguez

Main category: cs.CV

TL;DR: 本文挑战了动作预测需要密集时序信息的假设，探索仅基于单帧图像进行人类动作预测的潜力，并提出改进框架AAG+，在多个基准上达到甚至超越现有视频级方法的性能。

Details

Motivation: 传统动作预测依赖视频序列，本文动机是探究单帧图像中蕴含多少未来动作信息，以及如何有效利用这些信息。 Method: 基于前期工作AAG，系统分析RGB外观、深度几何线索和过去动作语义表示等多源信息的贡献，研究不同多模态融合策略、关键帧选择策略及历史动作来源对预测性能的影响，并整合最优设计形成AAG+框架。 Result: AAG+仅用单帧即在IKEA-ASM、Meccano和Assembly101等挑战性基准上超越原AAG，并媲美或优于当前最优视频级方法。 Conclusion: 单帧动作预测具有巨大潜力，密集时序建模并非总是必要；精心选取的关键帧在特定场景下已足够支撑高性能预测。 Abstract: Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.

[152] Urban Neural Surface Reconstruction from Constrained Sparse Aerial Imagery with 3D SAR Fusion

Da Li,Chen Yao,Tong Mao,Jiacheng Bao,Houjun Sun

Main category: cs.CV

TL;DR: 本文提出首个融合3D SAR点云与航拍图像的神经表面重建框架，用于稀疏视角下的高保真城市三维重建，显著提升精度、完整性与鲁棒性。

Details

Motivation: 现有神经表面重建方法在稀疏视角下存在几何模糊与不稳定问题，而城市遥感中航拍图像获取常受限于飞行路径、地形和成本。 Method: 将3D SAR点云提供的空间约束融入SDF-based神经表面重建主干网络，指导结构感知的光线选择与自适应采样；构建首个配准的3D SAR与航拍图像联合基准数据集。 Result: 在高度稀疏与倾斜视角条件下，相比单模态基线，重建精度、完整性与鲁棒性显著提升。 Conclusion: 融合光学与SAR多模态传感是实现可扩展、高保真城市三维重建的有效路径。 Abstract: Neural surface reconstruction (NSR) has recently shown strong potential for urban 3D reconstruction from multi-view aerial imagery. However, existing NSR methods often suffer from geometric ambiguity and instability, particularly under sparse-view conditions. This issue is critical in large-scale urban remote sensing, where aerial image acquisition is limited by flight paths, terrain, and cost. To address this challenge, we present the first urban NSR framework that fuses 3D synthetic aperture radar (SAR) point clouds with aerial imagery for high-fidelity reconstruction under constrained, sparse-view settings. 3D SAR can efficiently capture large-scale geometry even from a single side-looking flight path, providing robust priors that complement photometric cues from images. Our framework integrates radar-derived spatial constraints into an SDF-based NSR backbone, guiding structure-aware ray selection and adaptive sampling for stable and efficient optimization. We also construct the first benchmark dataset with co-registered 3D SAR point clouds and aerial imagery, facilitating systematic evaluation of cross-modal 3D reconstruction. Extensive experiments show that incorporating 3D SAR markedly enhances reconstruction accuracy, completeness, and robustness compared with single-modality baselines under highly sparse and oblique-view conditions, highlighting a viable route toward scalable high-fidelity urban reconstruction with advanced airborne and spaceborne optical-SAR sensing.

[153] PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Changjian Jiang,Kerui Ren,Xudong Li,Kaiwen Song,Linning Xu,Tao Lu,Junting Dong,Yu Zhang,Bo Dai,Mulin Yu

Main category: cs.CV

TL;DR: PLANING是一种面向单目图像流的在线三维重建框架，通过显式几何原语与神经高斯的混合表示实现几何与外观的解耦建模，兼顾高质量渲染与精确几何，在速度和精度上均显著优于现有方法。

Details

Motivation: 现有单目流式重建方法难以同时兼顾高保真渲染与精确几何重建，存在质量与效率的权衡问题。 Method: 提出PLANING框架，采用松耦合的混合表示（显式几何原语 + 神经高斯），并设计几何与外观分离的在线初始化与优化策略。 Result: 在ScanNetV2上Chamfer-L2误差比PGSR降低18.52%，PSNR比ARTDECO提升1.31 dB，重建耗时<100秒，速度超2D Gaussian Splatting 5倍以上，质量媲美离线优化。 Conclusion: PLANING实现了高效、稳定、高质量的流式重建，兼具结构清晰性与计算效率，适用于大规模场景建模与具身AI仿真等下游任务。 Abstract: Streaming reconstruction from monocular image sequences remains challenging, as existing methods typically favor either high-quality rendering or accurate geometry, but rarely both. We present PLANING, an efficient on-the-fly reconstruction framework built on a hybrid representation that loosely couples explicit geometric primitives with neural Gaussians, enabling geometry and appearance to be modeled in a decoupled manner. This decoupling supports an online initialization and optimization strategy that separates geometry and appearance updates, yielding stable streaming reconstruction with substantially reduced structural redundancy. PLANING improves dense mesh Chamfer-L2 by 18.52% over PGSR, surpasses ARTDECO by 1.31 dB PSNR, and reconstructs ScanNetV2 scenes in under 100 seconds, over 5x faster than 2D Gaussian Splatting, while matching the quality of offline per-scene optimization. Beyond reconstruction quality, the structural clarity and computational efficiency of \modelname~make it well suited for a broad range of downstream applications, such as enabling large-scale scene modeling and simulation-ready environments for embodied AI. Project page: https://city-super.github.io/PLANING/ .

[154] MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Baorui Ma,Jiahui Yang,Donglin Di,Xuancheng Zhang,Jianxun Cui,Hao Li,Yan Xie,Wei Chen

Main category: cs.CV

TL;DR: 本文提出Metric Anything，一种简单可扩展的预训练框架，通过稀疏度量提示（Sparse Metric Prompt）从多源噪声3D数据中学习度量深度，无需人工设计提示、相机建模或任务特定架构，在多个深度估计与3D理解任务上达到SOTA，并提升多模态大模型的空间智能能力。

Details

Motivation: 现有度量深度估计受限于异构传感器噪声、相机依赖偏差及跨源3D数据中的度量模糊性，难以沿用视觉基础模型的成功缩放范式。 Method: 提出Sparse Metric Prompt（随机掩码深度图），作为解耦空间推理与传感器/相机偏差的通用接口；在约2000万跨源（重建/采集/渲染）、覆盖10000种相机型号的图像-深度对上进行大规模预训练。 Result: 首次在度量深度任务中观察到清晰的缩放规律；预训练模型在深度补全、超分、雷达-相机融合等提示驱动任务中表现优异；其无提示蒸馏学生模型在单目深度估计、内参恢复、单/多视图度量3D重建及VLA规划中达SOTA；以Metric Anything ViT为视觉编码器显著提升多模态大语言模型的空间智能。 Conclusion: 度量深度估计可受益于类似现代基础模型的缩放定律，Metric Anything为可扩展、高效的现实世界度量感知提供了新路径。 Abstract: Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.

[155] UEval: A Benchmark for Unified Multimodal Generation

Bo Li,Yida Yin,Wenhao Chai,Xingyu Fu,Zhuang Liu

Main category: cs.CV

TL;DR: UEval是一个用于评估统一多模态模型（能同时生成图像和文本）的新基准，包含1000个专家设计的跨任务问题，并采用基于人工校验的细粒度评分标准进行评估。

Details

Motivation: 现有方法难以准确评估开放式的多模态生成能力，尤其在图像与文本联合输出场景下，传统LLM-as-a-judge方式易忽略细节；需要更可靠、可扩展且细粒度的评估框架。 Method: 构建UEval基准：1000个专家设计的多模态问题，覆盖8类真实任务和多种推理类型；为每个问题由MLLM生成初始评分细则，再经人工精修验证，最终形成10417条有效评分准则；实现自动、可扩展、细粒度打分。 Result: 当前统一模型在UEval上表现有限（GPT-5-Thinking仅66.4/100，最佳开源模型仅49.1/100）；具备推理能力的模型显著优于非推理模型；将推理轨迹迁移至非推理模型可显著缩小性能差距。 Conclusion: 推理能力对复杂多模态理解与生成至关重要；UEval提供了一种更鲁棒、可解释、可扩展的统一模型评估范式，推动多模态基础模型发展。 Abstract: We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.

[156] Unsupervised Decomposition and Recombination with Discriminator-Driven Diffusion Models

Archer Wang,Emile Anand,Yilun Du,Marin Soljačić

Main category: cs.CV

TL;DR: 本文提出了一种基于对抗训练的扩散模型方法，用于在无监督条件下学习可分解的潜在因子表示，并通过跨样本因子重组提升生成一致性和解耦性，在图像和机器人视频任务中均取得SOTA效果。

Details

Motivation: 分解复杂数据为可复用的因子表示有助于理解数据结构并支持组合式生成，但现有无监督扩散模型在因子发现和重组一致性方面仍有不足。 Method: 引入一个判别器，区分单源样本与跨源因子重组生成的样本；生成器通过对抗优化欺骗该判别器，从而提升重组结果在物理与语义上的一致性。 Result: 在CelebA-HQ、Virtual KITTI、CLEVR和Falcor3D上FID更低、MIG和MCC指标更优；在LIBERO机器人视频轨迹任务中显著提升状态空间覆盖率。 Conclusion: 所提对抗式因子重组机制能有效提升无监督扩散模型的解耦能力与生成质量，并拓展至具身智能中的动作合成新应用。 Abstract: Decomposing complex data into factorized representations can reveal reusable components and enable synthesizing new samples via component recombination. We investigate this in the context of diffusion-based models that learn factorized latent spaces without factor-level supervision. In images, factors can capture background, illumination, and object attributes; in robotic videos, they can capture reusable motion components. To improve both latent factor discovery and quality of compositional generation, we introduce an adversarial training signal via a discriminator trained to distinguish between single-source samples and those generated by recombining factors across sources. By optimizing the generator to fool this discriminator, we encourage physical and semantic consistency in the resulting recombinations. Our method outperforms implementations of prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D, achieving lower FID scores and better disentanglement as measured by MIG and MCC. Furthermore, we demonstrate a novel application to robotic video trajectories: by recombining learned action components, we generate diverse sequences that significantly increase state-space coverage for exploration on the LIBERO benchmark.

[157] Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Wenxuan Huang,Yu Zeng,Qiuchen Wang,Zhen Fang,Shaosheng Cao,Zheng Chu,Qingyu Yin,Shuang Chen,Zhenfei Yin,Lin Chen,Zehui Chen,Yao Hu,Philip Torr,Feng Zhao,Wanli Ouyang

Main category: cs.CV

TL;DR: 本文提出Vision-DeepResearch，一种支持多轮、多实体、多尺度视觉与文本搜索的多模态深度研究范式，通过冷启动监督与强化学习将深度研究能力内化至MLLM，显著优于现有方法及闭源大模型。

Details

Motivation: 现有MLLM在依赖外部搜索增强时，假设单次图像/文本查询即可获取关键证据，忽视真实场景中视觉噪声大、需多源证据聚合的复杂性，且推理深度与搜索广度受限。 Method: 提出Vision-DeepResearch范式，支持数十步推理与数百次搜索引擎交互；采用冷启动监督训练与强化学习，将多轮、多实体、多尺度的视觉与文本深度搜索能力内化到MLLM中。 Result: 在多模态深度研究任务上显著超越现有方法，且优于基于GPT-5、Gemini-2.5-pro和Claude-4-Sonnet等强闭源基础模型构建的工作流。 Conclusion: Vision-DeepResearch实现了更鲁棒、更深入的多模态信息检索与推理，为MLLM在高噪声现实场景下的事实性问答提供了新范式。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

[158] BLO-Inst: Bi-Level Optimization Based Alignment of YOLO and SAM for Robust Instance Segmentation

Li Zhang,Pengtao Xie

Main category: cs.CV

TL;DR: 本文提出BLO-Inst框架，通过双层优化对齐目标检测与SAM分割目标，使检测器成为面向分割质量优化的提示生成器，提升自动化图像分割性能。

Details

Motivation: SAM虽具零样本分割能力，但依赖人工提示；现有将检测器作为提示生成器的方法存在目标不匹配和联合训练中的对齐过拟合问题。 Method: 提出BLO-Inst统一框架，采用双层优化：下层在数据子集D1上微调SAM以提升给定检测框的分割保真度；上层在独立子集D2上更新检测器，使其生成的边界框最小化微调后SAM的验证损失。 Result: 在通用与生物医学领域任务中，BLO-Inst显著优于标准基线方法。 Conclusion: BLO-Inst成功将检测器转化为分割感知的提示生成器，优化目标从定位精度转向下游掩码质量，有效解决目标不匹配与对齐过拟合问题。 Abstract: The Segment Anything Model has revolutionized image segmentation with its zero-shot capabilities, yet its reliance on manual prompts hinders fully automated deployment. While integrating object detectors as prompt generators offers a pathway to automation, existing pipelines suffer from two fundamental limitations: objective mismatch, where detectors optimized for geometric localization do not correspond to the optimal prompting context required by SAM, and alignment overfitting in standard joint training, where the detector simply memorizes specific prompt adjustments for training samples rather than learning a generalizable policy. To bridge this gap, we introduce BLO-Inst, a unified framework that aligns detection and segmentation objectives by bi-level optimization. We formulate the alignment as a nested optimization problem over disjoint data splits. In the lower level, the SAM is fine-tuned to maximize segmentation fidelity given the current detection proposals on a subset ($D_1$). In the upper level, the detector is updated to generate bounding boxes that explicitly minimize the validation loss of the fine-tuned SAM on a separate subset ($D_2$). This effectively transforms the detector into a segmentation-aware prompt generator, optimizing the bounding boxes not just for localization accuracy, but for downstream mask quality. Extensive experiments demonstrate that BLO-Inst achieves superior performance, outperforming standard baselines on tasks in general and biomedical domains.

[159] RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation

Hanzhuo Huang,Qingyang Bao,Zekai Gu,Zhongshuo Du,Cheng Lin,Yuan Liu,Sibei Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于3D资产参考的扩散模型，通过双分支感知架构联合建模多视角RGB图像和点云图，实现生成图像与3D参考的高度一致性。

Details

Motivation: 现有基于单张参考图像的生成方法无法利用3D资产，限制了其在实际应用中的通用性。 Method: 构建跨域双分支扩散模型，分别处理多视角RGB图像和点云图，采用空间对齐与领域解耦机制，同步生成内容解耦但空间对齐的RGB图像和点云图。 Result: 实验表明该方法能有效以3D资产为参考生成与其高度一致的图像，提升了2D图像生成与3D内容创作的结合能力。 Conclusion: 本工作为扩散模型与3D资产融合提供了新范式，拓展了生成模型在三维内容创作中的应用潜力。 Abstract: In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the simultaneous generation of two spatially aligned but content-disentangled outputs, RGB images and point maps, linking 2D image attributes with 3D asset attributes. Experiments show that our approach effectively uses 3D assets as references to produce images consistent with the given assets, opening new possibilities for combining diffusion models with 3D content creation.

[160] SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence

Saoud Aldowaish,Yashwanth Karumanchi,Kai-Chen Chiang,Soroosh Noorzad,Morteza Fayazi

Main category: cs.CV

TL;DR: 本文提出了SINA，一个开源的全自动电路原理图图像到网表生成器，结合深度学习、连通分量标记、OCR和视觉语言模型，在网表生成准确率上达到96.47%，是当前最优方法的2.72倍。

Details

Motivation: 现有将电路原理图图像转换为机器可读网表的方法在元件识别和连接关系推断方面存在困难。 Method: SINA集成了深度学习用于元件检测、连通分量标记（CCL）用于精确提取连接关系、OCR用于获取元件参考标识符，并采用视觉语言模型（VLM）进行可靠的参考标识符分配。 Result: 实验表明，SINA实现了96.47%的整体网表生成准确率，比当前最先进方法高出2.72倍。 Conclusion: SINA是一种高效、全自动的原理图图像解析工具，显著提升了网表生成的准确性与可靠性。 Abstract: Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.

[161] Creative Image Generation with Diffusion Model

Kunpeng Song,Ahmed Elgammal

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的创意图像生成新框架，通过在CLIP嵌入空间中引导生成图像至低概率区域来提升创造性，同时引入pullback机制保证视觉保真度。

Details

Motivation: 现有方法依赖人工概念融合或子类别排除，缺乏对创造力的量化与原则性建模；需一种能自动产生新颖、高质量且富有想象力图像的生成方法。 Method: 将创造力定义为图像在CLIP嵌入空间中的逆概率，并在扩散过程中显式优化生成分布以偏向低概率区域；引入pullback机制平衡创意性与视觉保真度。 Result: 在文本到图像扩散模型上验证了该框架的有效性与高效性，生成图像兼具高创意性、独特性与视觉质量。 Conclusion: 本工作为生成模型中的创造力提供了可计算、可优化的原则性定义，推动了视觉内容合成中创新性的系统化研究。 Abstract: Creative image generation has emerged as a compelling area of research, driven by the need to produce novel and high-quality images that expand the boundaries of imagination. In this work, we propose a novel framework for creative generation using diffusion models, where creativity is associated with the inverse probability of an image's existence in the CLIP embedding space. Unlike prior approaches that rely on a manual blending of concepts or exclusion of subcategories, our method calculates the probability distribution of generated images and drives it towards low-probability regions to produce rare, imaginative, and visually captivating outputs. We also introduce pullback mechanisms, achieving high creativity without sacrificing visual fidelity. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness and efficiency of our creative generation framework, showcasing its ability to produce unique, novel, and thought-provoking images. This work provides a new perspective on creativity in generative models, offering a principled method to foster innovation in visual content synthesis.

[162] EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

John Flynn,Wolfgang Paier,Dimitar Dinev,Sam Nhut Nguyen,Hayk Poghosyan,Manuel Toribio,Sandipan Banerjee,Guy Gafni

Main category: cs.CV

TL;DR: 本文提出了EditYourself，一种基于DiT架构的音频驱动视频到视频（V2V）编辑框架，用于对已有说话人视频进行基于文本脚本的精细编辑（如增删/重排口型内容），同时保持运动连贯性、说话人身份和唇形同步。

Details

Motivation: 现有生成式视频模型擅长从文本或图像生成新视频，但在编辑已有录制视频（尤其是修改语音脚本）方面存在关键缺陷：难以兼顾运动一致性、身份保留与精准唇动同步。 Method: 基于通用视频扩散模型（DiT），引入音频条件控制与区域感知、编辑导向的训练策略，结合时空掩码修复（spatiotemporal inpainting）实现精准唇同步与连贯动作合成。 Result: 实现了对说话人视频的 transcript-driven 编辑（添加/删除/重排语音对应画面），在长时序中保持视觉保真度、身份一致性和自然人体运动。 Conclusion: EditYourself 是迈向专业级视频后期制作实用化生成模型的重要基础工作。 Abstract: Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.

[163] Early and Prediagnostic Detection of Pancreatic Cancer from Computed Tomography

Wenxuan Li,Pedro R. A. S. Bassi,Lizhou Wu,Xinze Zhou,Yuxuan Zhao,Qi Chen,Szymon Plotka,Tianyu Lin,Zheren Zhu,Marisa Martin,Justin Caskey,Shanshan Jiang,Xiaoxi Chen,Jaroslaw B. Ćwikla,Artur Sankowski,Yaping Wu,Sergio Decherchi,Andrea Cavalli,Chandana Lall,Cristian Tomasetti,Yaxing Guo,Xuan Yu,Yuqing Cai,Hualin Qiao,Jie Bao,Chenhan Hu,Ximing Wang,Arkadiusz Sitek,Kai Ding,Heng Li,Meiyun Wang,Dexin Yu,Guang Zhang,Yang Yang,Kang Wang,Alan L. Yuille,Zongwei Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为ePAI的AI系统，用于在CT扫描中早期检测胰腺导管腺癌（PDAC），在内部和外部测试中均表现出高AUC、高敏感性和特异性，并能在临床诊断前数月发现被放射科医生漏诊的微小PDAC病灶。

Details

Motivation: PDAC常在晚期才被发现，而回顾性分析显示，专家放射科医生可在患者确诊前的CT影像中识别出先前被忽略的病灶，因此亟需一种自动化工具辅助早期检测。 Method: 开发了名为ePAI的深度学习系统，基于单中心1598例患者的CT数据进行训练，并在内部（1009例）和外部（6个中心共7158例）数据集上进行验证；同时开展多读者研究，与30名认证放射科医生对比性能。 Result: ePAI在内部测试中AUC达0.939–0.999，敏感性95.3%，特异性98.7%，可定位小至2 mm病灶；外部测试中AUC为0.918–0.945，敏感性91.5%，特异性88.0%，可定位小至5 mm病灶；并在临床诊断前3–36个月的CT中成功检出75/159例PDAC，中位提前时间为347天；多读者研究显示其敏感性比放射科医生高50.3%（P<0.05），特异性相当（95.4%）。 Conclusion: ePAI是一种具有临床潜力的AI辅助工具，有望显著提升PDAC的早期检出率，改善患者预后。 Abstract: Pancreatic ductal adenocarcinoma (PDAC), one of the deadliest solid malignancies, is often detected at a late and inoperable stage. Retrospective reviews of prediagnostic CT scans, when conducted by expert radiologists aware that the patient later developed PDAC, frequently reveal lesions that were previously overlooked. To help detecting these lesions earlier, we developed an automated system named ePAI (early Pancreatic cancer detection with Artificial Intelligence). It was trained on data from 1,598 patients from a single medical center. In the internal test involving 1,009 patients, ePAI achieved an area under the receiver operating characteristic curve (AUC) of 0.939-0.999, a sensitivity of 95.3%, and a specificity of 98.7% for detecting small PDAC less than 2 cm in diameter, precisely localizing PDAC as small as 2 mm. In an external test involving 7,158 patients across 6 centers, ePAI achieved an AUC of 0.918-0.945, a sensitivity of 91.5%, and a specificity of 88.0%, precisely localizing PDAC as small as 5 mm. Importantly, ePAI detected PDACs on prediagnostic CT scans obtained 3 to 36 months before clinical diagnosis that had originally been overlooked by radiologists. It successfully detected and localized PDACs in 75 of 159 patients, with a median lead time of 347 days before clinical diagnosis. Our multi-reader study showed that ePAI significantly outperformed 30 board-certified radiologists by 50.3% (P < 0.05) in sensitivity while maintaining a comparable specificity of 95.4% in detecting PDACs early and prediagnostic. These findings suggest its potential of ePAI as an assistive tool to improve early detection of pancreatic cancer.

[164] PI-Light: Physics-Inspired Diffusion for Full-Image Relighting

Zhexin Liang,Zhaoxi Chen,Yongwei Chen,Tianyi Wei,Tengfei Wang,Xingang Pan

Main category: cs.CV

TL;DR: 本文提出了π-Light（PI-Light），一种基于物理启发的两阶段扩散模型框架，用于全图重光照，通过批感知注意力、物理引导神经渲染、物理启发损失及高质量数据集，显著提升了真实场景泛化能力与物理合理性。

Details

Motivation: 全图重光照面临三大挑战：难以获取大规模结构化配对数据、难以保持物理合理性、以及数据驱动先验导致泛化能力受限；现有方法在合成到真实的迁移上效果不佳。 Method: 提出两阶段物理启发扩散框架π-Light，包含：(i) 批感知注意力机制提升多图本征属性预测一致性；(ii) 物理引导神经渲染模块保障光传输物理可解释性；(iii) 物理启发损失约束训练过程朝物理合理方向收敛；(iv) 构建可控光照下多样化物体与场景的数据集。 Result: π-Light能准确合成多种材质下的镜面高光与漫反射，相比先前方法在真实世界场景中展现出更优的泛化性能，并支持高效微调预训练扩散模型，同时提供下游评估基准。 Conclusion: π-Light通过深度融合物理先验与扩散建模，在保持生成质量的同时显著提升重光照任务的物理合理性与跨域泛化能力，为图像编辑提供了新范式。 Abstract: Full-image relighting remains a challenging problem due to the difficulty of collecting large-scale structured paired data, the difficulty of maintaining physical plausibility, and the limited generalizability imposed by data-driven priors. Existing attempts to bridge the synthetic-to-real gap for full-scene relighting remain suboptimal. To tackle these challenges, we introduce Physics-Inspired diffusion for full-image reLight ($π$-Light, or PI-Light), a two-stage framework that leverages physics-inspired diffusion models. Our design incorporates (i) batch-aware attention, which improves the consistency of intrinsic predictions across a collection of images, (ii) a physics-guided neural rendering module that enforces physically plausible light transport, (iii) physics-inspired losses that regularize training dynamics toward a physically meaningful landscape, thereby enhancing generalizability to real-world image editing, and (iv) a carefully curated dataset of diverse objects and scenes captured under controlled lighting conditions. Together, these components enable efficient finetuning of pretrained diffusion models while also providing a solid benchmark for downstream evaluation. Experiments demonstrate that $π$-Light synthesizes specular highlights and diffuse reflections across a wide variety of materials, achieving superior generalization to real-world scenes compared with prior approaches.

[165] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Xiaoxiao Sun,Mingyang Li,Kun yuan,Min Woo Sun,Mark Endo,Shengguang Wu,Changlin Li,Yuhui Zhang,Zeyu Wang,Serena Yeung-Levy

Main category: cs.CV

TL;DR: 本文提出VI-Probe框架，通过可控视觉错觉实验揭示大视觉语言模型（VLMs）在面对视觉变化时响应僵化的原因并非单一，而是源于记忆覆盖、感知-记忆竞争或视觉处理限制等异质机制。

Details

Motivation: 探究VLMs在视觉错觉图像中响应僵化（即不随视觉变化而改变回答）的根本原因，区分是真正感知变化还是仅依赖语言记忆。 Method: 构建VI-Probe框架，包含分级扰动的错觉图像与匹配的非错觉控制图像；引入Polarity-Flip Consistency、Template Fixation Index和归一化错觉倍增器等新指标，量化模型对视觉变化的稳定性与敏感性。 Result: 实验证明不同模型响应僵化成因各异：GPT-5表现为记忆覆盖，Claude-Opus-4.1呈现感知-记忆竞争，Qwen系列则暴露视觉处理能力局限。 Conclusion: VLMs对视觉变化的响应僵化由多种异质机制导致，需采用基于探针的评估范式，同时衡量知识正确性与对受控视觉变化的敏感性。 Abstract: Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.

[166] One-step Latent-free Image Generation with Pixel Mean Flows

Yiyang Lu,Susie Lu,Qiao Sun,Hanhong Zhao,Zhicheng Jiang,Xianbang Wang,Tianhong Li,Zhengyang Geng,Kaiming He

Main category: cs.CV

TL;DR: 本文提出像素MeanFlow（pMF），一种无需潜在空间且一步采样的扩散/流式图像生成模型，通过分离网络输出空间与损失空间设计，在ImageNet上实现优异的一步无潜生成性能。

Details

Motivation: 推动扩散/流式生成模型向一步采样、无需潜在空间的目标迈进，填补该方向的关键空白。 Method: 提出pMF方法，将网络目标设在预设的低维图像流形（x-prediction），而损失定义在MeanFlow的速度空间，并引入图像流形与平均速度场之间的简单变换。 Result: 在ImageNet 256×256和512×512分辨率上实现FID分别为2.22和2.48的一步无潜生成，性能强劲。 Conclusion: pMF成功推进了一步、无潜在空间的扩散/流式生成范式，为后续研究提供了新思路和关键基础。 Abstract: Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose "pixel MeanFlow" (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.

Table of Contents

cs.CL [Back]

[1] DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

[2] asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation

[3] UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop

[4] Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations

[5] ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference

[6] Multi-task Code LLMs: Data Mix or Model Merge?

[7] Large Language Models Naively Recover Ethnicity from Individual Records

[8] EnsembleLink: Accurate Record Linkage Without Training Data

[9] Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space

[10] From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning

[11] Scaling Embeddings Outperforms Scaling Experts in Language Models

[12] Multilingual Dysarthric Speech Assessment Using Universal Phone Recognition and Language-Specific Phonemic Contrast Modeling

[13] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

[14] Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

[15] MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

[16] SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models

[17] MoCo: A One-Stop Shop for Model Collaboration Research

[18] CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

[19] Qwen3-ASR Technical Report

[20] Self-Improving Pretraining: using post-trained models to pretrain better models

[21] The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

[22] User-Centric Evidence Ranking for Attribution and Fact Verification

[23] Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

[24] SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

[25] DimStance: Multilingual Datasets for Dimensional Stance Analysis

[26] MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset

[27] LMK > CLS: Landmark Pooling for Dense Embeddings

[28] inversedMixup: Data Augmentation via Inverting Mixed Embeddings

[29] Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes

[30] ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

[31] KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

[32] Language Models as Artificial Learners: Investigating Crosslinguistic Influence

[33] ILRR: Inference-Time Steering Method for Masked Diffusion Language Models

[34] AdaptBPE: From General Purpose to Specialized Tokenizers

[35] Scale-Dependent Semantic Dynamics Revealed by Allan Deviation

[36] FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning

[37] Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling

[38] Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

[39] Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

[40] Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

[41] TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning

[42] Enhancing Language Models for Robust Greenwashing Detection

[43] Procedural Pretraining: Warming Up Language Models with Abstract Data

[44] CE-GOCD: Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering

[45] Temporal Guidance for Large Language Models

[46] CoFrGeNet: Continued Fraction Architectures for Language Generation

[47] Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond

[48] Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

[49] KID: Knowledge-Injected Dual-Head Learning for Knowledge-Grounded Harmful Meme Detection

[50] Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation

[51] RAG-E: Quantifying Retriever-Generator Alignment and Failure Modes

[52] Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning

[53] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

[54] Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model

[55] Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

[56] SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching

[57] From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes

[58] OVD: On-policy Verbal Distillation

[59] Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding

[60] Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

[61] When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

[62] Causal Autoregressive Diffusion Language Model

[63] Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models

[64] A Separable Architecture for Continuous Token Representation in Language Models

[65] On the Paradoxical Interference between Instruction-Following and Task Solving

[66] MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLMs

[67] $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

[68] VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

[69] ECO: Quantized Training without Full-Precision Master Weights

[70] A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine

[71] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

[72] FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

[73] DynaWeb: Model-Based Reinforcement Learning of Web Agents

[74] Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

cs.CV [Back]

[75] MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading

[76] Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs

[77] Text controllable PET denoising