cs.CL [Back]

[1] DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Nikita Gupta,Riju Chatterjee,Lukas Haas,Connie Tao,Andrew Wang,Chang Liu,Hidekazu Oiwa,Elena Gribovskaya,Jan Ackermann,John Blitzer,Sasha Goldshtein,Dipanjan Das

Main category: cs.CL

TL;DR: DeepSearchQA是一个900提示的基准测试，用于评估智能体在17个领域中执行多步信息检索任务的能力，重点考察系统整合碎片信息、去重与实体解析、以及开放搜索空间中的终止判断等三项关键能力。

Details

Motivation: 现有基准测试多聚焦于单答案检索或事实性评估，缺乏对复杂、多步、长程规划式信息检索能力的系统评测；作者旨在填补这一空白，推动更鲁棒的深度研究型智能体发展。 Method: 构建了DeepSearchQA基准：包含900个手工设计、因果链结构的多步搜索任务，覆盖17个学科领域，所有任务基于公开网页、答案可客观验证；并对当前主流智能体架构进行系统评测。 Result: 实验表明，即使最先进的模型在高召回与高精度之间仍难以兼顾，存在提前终止（欠检索）和过度发散（低置信度答案泛滥）等典型失败模式。 Conclusion: DeepSearchQA揭示了当前智能体在深度信息检索能力上的显著不足，为后续研究提供了关键诊断工具和明确改进方向。 Abstract: We introduce DeepSearchQA, a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields. Unlike traditional benchmarks that target single answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of challenging, handcrafted tasks designed to evaluate an agent's ability to execute complex search plans to generate exhaustive answer lists. This shift in design explicitly tests three critical, yet under-evaluated capabilities: 1) systematic collation of fragmented information from disparate sources, 2) de-duplication and entity resolution to ensure precision, and 3) the ability to reason about stopping criteria within an open-ended search space. Each task is structured as a causal chain, where discovering information for one step is dependent on the successful completion of the previous one, stressing long-horizon planning and context retention. All tasks are grounded in the open web with objectively verifiable answer sets. Our comprehensive evaluation of state-of-the-art agent architectures reveals significant performance limitations: even the most advanced models struggle to balance high recall with precision. We observe distinct failure modes ranging from premature stopping (under-retrieval) to hedging behaviors, where agents cast an overly wide net of low-confidence answers to artificially boost recall. These findings highlight critical headroom in current agent designs and position DeepSearchQA as an essential diagnostic tool for driving future research toward more robust, deep-research capabilities.

[2] asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation

Oleg Sedukhin,Andrey Kostin

Main category: cs.CL

TL;DR: 本文提出了一种改进的语音识别评估方法，包括支持多参考标注和长插入的字符串对齐算法、构建了新的俄语长语音测试集DiverseSpeech-Ru，并揭示了模型易过拟合标注导致评估失真的问题，同时提供了流式识别评估工具与统一模型接口。

Details

Motivation: 现有语音识别评估方法在处理非拉丁语系、构词丰富的语言及长语音时存在标注不一致、对齐不准等问题，且模型易过拟合特定数据集标注，造成评估指标虚高。 Method: 提出一种支持多参考标注、任意长度插入和更优词对齐的字符串对齐算法；构建并标注俄语长语音测试集DiverseSpeech-Ru；对现有俄语测试集进行多参考重标注；分析微调过程中的标注适应现象；开发流式识别评估与多转录对齐可视化工具；提供统一的离线/流式ASR模型封装接口。 Result: 验证了模型会快速适应数据集特定标注风格，导致WER等指标改善但泛化能力未提升；新对齐算法显著提升非拉丁语系长语音评估可靠性；DiverseSpeech-Ru成为首个带多参考标注的俄语长语音基准；配套工具已开源。 Conclusion: 语音识别评估需重视标注多样性与对齐鲁棒性，尤其对非拉丁语系长语音；多参考标注与改进对齐是避免评估偏差的关键；所提工具与数据集为社区提供了更可靠的评估基础。 Abstract: We propose several improvements to the speech recognition evaluation. First, we propose a string alignment algorithm that supports both multi-reference labeling, arbitrary-length insertions and better word alignment. This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech. Secondly, we collect a novel test set DiverseSpeech-Ru of longform in-the-wild Russian speech with careful multi-reference labeling. We also perform multi-reference relabeling of popular Russian tests set and study fine-tuning dynamics on its corresponding train set. We demonstrate that the model often adopts to dataset-specific labeling, causing an illusion of metric improvement. Based on the improved word alignment, we develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually. Additionally, we provide uniform wrappers for many offline and streaming speech recognition models. Our code will be made publicly available.

[3] UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop

Muhammad Ali Shafique,Areej Mehboob,Layba Fiaz,Muhammad Usman Qadeer,Hamza Farooq

Main category: cs.CL

TL;DR: 本文提出了一种结合多系统翻译与人工验证的上下文集成翻译框架，构建了首个标准化的乌尔都语推理评测基准UrduBench，并系统评估了多种大语言模型在乌尔都语推理任务上的表现，揭示了多步与符号推理的瓶颈及语言对齐的关键作用。

Details

Motivation: 乌尔都语等低资源语言缺乏标准化推理评测基准，现有方法受限于机器翻译敏感性和对通用语言任务的偏重，难以准确评估模型真实推理能力。 Method: 提出上下文集成翻译框架，融合多个翻译系统并引入人工校验，保障语义、结构与上下文完整性；将MGSM、MATH-500、CommonSenseQA和OpenBookQA等基准翻译为乌尔都语，构建UrduBench；在多种提示策略下对推理导向型与指令微调型LLMs进行系统评测。 Result: 发现多步推理与符号推理在乌尔都语中显著更难；语言一致性（stable language alignment）是鲁棒推理的前提；不同数据集、难度等级、模型架构、缩放规模均呈现显著性能差异。 Conclusion: 本工作建立了可扩展的低资源语言推理评测方法论，为乌尔都语提供了首个高质量推理基准与实证分析，其框架亦适用于其他低资源语言。 Abstract: Recent advances in large language models (LLMs) have led to strong reasoning capabilities; however, evaluating such models in low-resource languages remains challenging due to the lack of standardized benchmarks. In particular, Urdu reasoning evaluation has been limited by the sensitivity of machine translation and an emphasis on general language tasks rather than reasoning benchmarks. In this paper, we propose a contextually ensembled translation framework with human-in-the-loop validation that leverages multiple translation systems to develop Urdu reasoning benchmarks while preserving contextual and structural integrity. Using this framework, we translate widely adopted reasoning and question-answering benchmarks, including MGSM, MATH-500, CommonSenseQA, and OpenBookQA, into Urdu, collectively referred to as UrduBench, and conduct a comprehensive evaluation of both reasoning-oriented and instruction-tuned LLMs across multiple prompting strategies. Our analysis reveals performance differences across (1) four datasets, (2) five task difficulty levels, (3) diverse model architectures, (4) multiple model scaling settings, and (5) language consistency tests. We find that multi-step and symbolic reasoning tasks pose significant challenges in Urdu, and that stable language alignment is a critical prerequisite for robust reasoning. Overall, our work establishes a scalable methodology for standardized reasoning evaluation in Urdu and provides empirical insights into multilingual reasoning failures. This experimental setup is also broadly applicable to other low-resource languages. The code and datasets will be publicly released.

[4] Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations

Amit Meghanani,Thomas Hain

Main category: cs.CL

TL;DR: 本文探讨了在自监督学习（SSL）语音模型中，前端语音增强（SE）模型微调时使用均方误差（MSE）损失所引发的位置嵌入干扰问题，并提出基于软动态时间规整（soft-DTW）的无位置偏置微调策略，显著提升收敛速度与下游任务性能。

Details

Motivation: MSE损失在SSL模型微调中易利用位置嵌入而非语义内容实现优化，导致表征学习偏差，需探索位置不变的微调方法。 Method: 提出两种位置不变的SE微调策略：(1) 零填充（zero-padding），(2) 速度扰动结合soft-DTW损失；重点评估后者在收敛性与下游任务（如ASR）上的效果。 Result: soft-DTW方法相比MSE显著加快收敛速度，并在噪声条件下的下游任务中取得更好性能。 Conclusion: 位置不变的微调对SSL语音建模至关重要；soft-DTW是一种更鲁棒、更有效的替代MSE的损失函数。 Abstract: Integrating front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models is effective for downstream tasks in noisy conditions. SE models are commonly fine-tuned using SSL representations with mean squared error (MSE) loss between enhanced and clean speech. However, MSE is prone to exploiting positional embeddings in SSL models, allowing the objective to be minimised through positional correlations instead of content-related information. This work frames the problem as a general limitation of self-supervised representation fine-tuning and investigates it through representation-guided SE. Two strategies are considered: (1) zero-padding, previously explored in SSL pre-training but here examined in the fine-tuning setting, and (2) speed perturbations with a soft-DTW loss. Experiments show that the soft-DTW-based approach achieves faster convergence and improved downstream performance, underscoring the importance of position-invariant fine-tuning in SSL-based speech modelling.

[5] ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference

Ketan Thakkar,Maitreyi Chatterjee,Ramasubramanian Balasubramanian,Achyuthan Jootoo,Rajendra Ugrani

Main category: cs.CL

TL;DR: 本文提出ChunkWise LoRA，一种动态自适应的LoRA方法，通过基于token复杂度的可变长度分块与每块定制低秩配置，在降低延迟和内存的同时保持或提升模型性能。

Details

Motivation: 现有LoRA方法对所有输入token采用静态统一的秩配置，忽略了token复杂度和计算需求的差异，导致效率瓶颈。 Method: 提出ChunkWise LoRA：引入运行时调度器估计token难度、进行自适应分块，并通过秩阶梯机制为每块选择低秩和缩放系数；加入边界安全组合模块和策略驱动的KV缓存策略以保障输出一致性。 Result: 在Wikitext-103和SQuAD等基准上，相比基线LoRA，延迟降低最多34%，内存减少38%，同时BLEU、EM和困惑度等指标持平或提升。 Conclusion: ChunkWise LoRA是一种高效、兼容性强、可直接部署于现有Transformer架构与推理框架的参数高效微调新范式。 Abstract: Recent advances in low-rank adaptation (LoRA) have enabled efficient fine-tuning of large language models (LLMs) with minimal additional parameters. However, existing LoRA methods apply static rank configurations uniformly across all input tokens, ignoring variation in token complexity and computational requirements. In this work, we propose ChunkWise LoRA, a dynamic and adaptive approach that partitions sequences into variable-length chunks based on token complexity and assigns each chunk a tailored low-rank configuration. Our system introduces a runtime scheduler that estimates token difficulty, performs adaptive chunking, and selects per-chunk LoRA rank and scaling using a rank-ladder mechanism. To preserve output consistency, we further introduce a boundary-safe composition module and integrate policy-driven KV-cache strategies. Experiments on benchmark datasets such as Wikitext-103 and SQuAD demonstrate that ChunkWise LoRA achieves up to 34\% lower latency and 38% memory reduction compared to baseline LoRA, while maintaining or improving task performance metrics like BLEU, EM, and perplexity. The proposed framework remains fully compatible with existing transformer architectures and inference frameworks, providing a practical solution for real-world deployment of parameter-efficient LLMs.

[6] Multi-task Code LLMs: Data Mix or Model Merge?

Mingzhi Zhu,Boris Sobolev,Rahul Krishna,Raju Pavuluri,Stacy Patterson,Michele Merler

Main category: cs.CL

TL;DR: 本文比较了数据混合与模型合并两种构建小型多任务代码大语言模型的方法，发现模型合并更适合大规模模型，而数据混合更适合小规模模型，并提出权重分析技术以理解任务对参数的影响。

Details

Motivation: 随着在智能体框架中部署小型专业化代码大语言模型的趋势增强，亟需高效、低成本且满足资源约束的多任务学习策略。 Method: 在Qwen Coder和DeepSeek Coder两个模型家族（2B和7B参数）上，分别采用数据混合和模型合并策略进行代码生成与代码摘要任务的多任务训练，并在HumanEval、MBPP和CodeXGlue基准上评估；同时引入权重分析技术探究任务对模型参数的影响。 Result: 模型合并方法在7B规模下整体性能最优，保留96%专用模型的代码生成性能并维持摘要能力，最优Qwen Coder 2.5 7B合并模型在HumanEval上达92.7% Pass@1（高于对应专用模型的90.9%）；2B规模下数据混合更优。 Conclusion: 合理选择数据混合或模型合并策略可有效融合任务能力且不显著损失性能，适用于资源受限的部署场景。 Abstract: Recent research advocates deploying smaller, specialized code LLMs in agentic frameworks alongside frontier models, sparking interest in efficient strategies for multi-task learning that balance performance, constraints, and costs. We compare two approaches for creating small, multi-task code LLMs: data mixing versus model merging. We conduct extensive experiments across two model families (Qwen Coder and DeepSeek Coder) at two scales (2B and 7B parameters), fine-tuning them for code generation and code summarization tasks. Our evaluation on HumanEval, MBPP, and CodeXGlue benchmarks reveals that model merging achieves the best overall performance at larger scale across model families, retaining 96% of specialized model performance on code generation tasks while maintaining summarization capabilities. Notably, merged models can even surpass individually fine-tuned models, with our best configuration of Qwen Coder 2.5 7B model achieving 92.7% Pass@1 on HumanEval compared to 90.9% for its task-specific fine-tuned equivalent. At a smaller scale we find instead data mixing to be a preferred strategy. We further introduce a weight analysis technique to understand how different tasks affect model parameters and their implications for merging strategies. The results suggest that careful merging and mixing strategies can effectively combine task-specific capabilities without significant performance degradation, making them ideal for resource-constrained deployment scenarios.

[7] Large Language Models Naively Recover Ethnicity from Individual Records

Noah Dasanaike

Main category: cs.CL

TL;DR: 本文证明大型语言模型（LLM）仅凭姓名即可高精度推断族裔，准确率超越传统BISG方法，且无需额外训练数据、适用多国语境，并在多种真实数据集上验证了其有效性与公平性优势。

Details

Motivation: 传统基于姓氏地理编码的族裔推断方法（如BISG）存在地域局限性（主要适用于美国）、分类粒度粗、收入偏差严重等问题；亟需一种更通用、准确、公平且可扩展的替代方案。 Method: 利用多个主流闭源与开源大语言模型（如GPT-4o、Gemini 3 Flash、DeepSeek v3.2、GLM-4.7）直接对姓名进行族裔/身份类别（如种族、宗教派别、种姓）推理；评估不同提示策略（如启用扩展推理、加入政党注册等元数据）的影响；并在多国真实选民登记、议员名单、土地记录等数据上进行跨域验证；进一步用小型Transformer模型蒸馏LLM标签实现低成本本地部署。 Result: 在美佛罗里达与北卡罗来纳州选民数据上，LLM方法达84.7%准确率（显著高于BISG的68.2%）；加入元数据后提升至86.7%；在黎巴嫩（宗教派别）、印度（保留席位议员、土地种姓记录）等场景分别达64.3%、99.2%、74.0%；在印、乌干达、尼泊尔等六国全量选民数据中能准确还原已知人口分布；小型微调模型可超越BISG且零成本部署。 Conclusion: LLM原生具备强姓名-身份映射能力，是一种无需训练、跨文化适配、低偏差、高可扩展的新型族裔/社会身份推断范式，有望替代传统统计方法并推动公平的实证社会科学研究。 Abstract: I demonstrate that large language models can infer ethnicity from names with accuracy exceeding that of Bayesian Improved Surname Geocoding (BISG) without additional training data, enabling inference outside the United States and to contextually appropriate classification categories. Using stratified samples from Florida and North Carolina voter files with self-reported race, LLM-based classification achieves up to 84.7% accuracy, outperforming BISG (68.2%) on balanced samples. I test six models including Gemini 3 Flash, GPT-4o, and open-source alternatives such as DeepSeek v3.2 and GLM-4.7. Enabling extended reasoning can improve accuracy by 1-3 percentage points, though effects vary across contexts; including metadata such as party registration reaches 86.7%. LLM classification also reduces the income bias inherent in BISG, where minorities in wealthier neighborhoods are systematically misclassified as White. I further validate using Lebanese voter registration with religious sect (64.3% accuracy), Indian MPs from reserved constituencies (99.2%), and Indian land records with caste classification (74.0%). Aggregate validation across India, Uganda, Nepal, Armenia, Chile, and Costa Rica using original full-count voter rolls demonstrates that the method recovers known population distributions where naming conventions are distinctive. For large-scale applications, small transformer models fine-tuned on LLM labels exceed BISG accuracy while enabling local deployment at no cost.

[8] EnsembleLink: Accurate Record Linkage Without Training Data

Noah Dasanaike

Main category: cs.CL

TL;DR: 本文提出EnsembleLink方法，利用预训练语言模型实现无需标注数据的高精度记录链接，适用于多种实体类型且可在本地快速运行。

Details

Motivation: 记录链接在社会科学实证研究中至关重要，但现有方法要么准确率低，要么需要大量标注数据，且研究者常忽视链接错误带来的下游分析不确定性。 Method: 提出EnsembleLink方法，利用预训练语言模型捕捉语义关系（如地名隶属、政党别名等），无需任何标注数据，支持本地开源模型运行，不依赖外部API。 Result: 在涵盖城市名、人名、组织、多语言政党及文献记录等多个基准测试中，EnsembleLink达到或超越需大量标注数据的方法的性能，典型任务可在数分钟内完成。 Conclusion: EnsembleLink为社会科学提供了一种高精度、免标注、可复现且计算高效的记录链接新范式，有助于提升下游分析的可靠性。 Abstract: Record linkage, the process of matching records that refer to the same entity across datasets, is essential to empirical social science but remains methodologically underdeveloped. Researchers treat it as a preprocessing step, applying ad hoc rules without quantifying the uncertainty that linkage errors introduce into downstream analyses. Existing methods either achieve low accuracy or require substantial labeled training data. I present EnsembleLink, a method that achieves high accuracy without any training labels. EnsembleLink leverages pre-trained language models that have learned semantic relationships (e.g., that "South Ozone Park" is a neighborhood in "New York City" or that "Lutte ouvriere" refers to the Trotskyist "Workers' Struggle" party) from large text corpora. On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling. The method runs locally on open-source models, requiring no external API calls, and completes typical linkage tasks in minutes.

[9] Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space

Tobias Materzok

Main category: cs.CL

TL;DR: 本文提出了一种名为Output-Space Search（OS-Search）的新方法，将大语言模型（LLM）生成任务转化为在冻结编码器定义的3D输出空间Z中进行目标点z的搜索问题，并通过基于检索的强化学习策略生成接近z的输出，从而实现并行化和黑箱优化，显著提升文本生成多样性与代码生成目标优化性能。

Details

Motivation: 传统LLM生成依赖于逐token或逐程序的路径依赖式搜索，效率低且难以进行全局优化；本文旨在摆脱这种限制，引入更高效、可并行、黑箱友好的生成范式。 Method: 构建一个冻结编码器定义的3D输出空间Z；外层循环选择目标点z*；内层采用检索增强、序列级强化学习训练的策略，在标准自回归解码下生成坐标靠近z*的输出；支持并行扫描与黑箱优化（如贝叶斯优化）。 Result: 在故事生成任务中，Z空间扫描使LLM评分的多样性提升3.1倍（相比prompt-chaining）；在代码生成任务中，对Z空间进行贝叶斯优化可在推理预算一致前提下，提升未向控制器暴露的目标函数值，同时保持代码有效性。 Conclusion: OS-Search为LLM生成提供了新视角——从序列生成转向输出空间搜索，兼顾可控性、多样性与优化能力，适用于文本与代码等多模态生成场景。 Abstract: We introduce Output-Space Search (OS-Search), which turns LLM generation into endpoint search. An outer loop selects a target z* in a frozen encoder-defined 3D output space Z, and a retrieval-grounded policy trained with sequence-level RL generates outputs whose coordinates land near z* under standard autoregressive decoding. This enables parallel sweeps and black-box optimization in Z without path-dependent token/program search. On stories, sweeping Z (text) yields 3.1x higher LLM-scored diversity than prompt-chaining. On code, Bayesian optimization over Z (code) improves an objective withheld from the controller under matched inference budgets while preserving validity.

[10] From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning

Xiulin Yang,Heidi Getz,Ethan Gotlieb Wilcox

Main category: cs.CL

TL;DR: 本文通过跨语言语料库分析和神经网络建模实验，验证功能词的高频性、与句法结构的可靠关联性及与短语边界的对齐性这三大统计特性在层级结构学习中的关键作用，并揭示不同学习条件下模型对功能词依赖机制的差异。

Details

Motivation: 探究何种统计条件支持从线性输入中学习层级结构，特别是功能词的分布特性在语言习得中的作用。 Method: 采用跨语言语料库分析（186种语言）验证功能词三大特性；结合反事实语言建模与消融实验评估其对神经学习者可学性的影响；并通过探针实验和进一步消融分析考察内部机制差异。 Result: 证实功能词的高频性、结构关联性和边界对齐性普遍存在；前两者对语言习得贡献更大；相似性能可源于不同内部机制。 Conclusion: 功能词的特定统计分布是支持层级结构学习的关键条件，且神经模型对功能词的利用方式高度依赖于具体学习条件。 Abstract: What statistical conditions support learning hierarchical structure from linear input? In this paper, we address this question by focusing on the statistical distribution of function words. Function words have long been argued to play a crucial role in language acquisition due to their distinctive distributional properties, including high frequency, reliable association with syntactic structure, and alignment with phrase boundaries. We use cross-linguistic corpus analysis to first establish that all three properties are present across 186 studied languages. Next, we use a combination of counterfactual language modeling and ablation experiments to show that language variants preserving all three properties are more easily acquired by neural learners, with frequency and structural association contributing more strongly than boundary alignment. Follow-up probing and ablation analyses further reveal that different learning conditions lead to systematically different reliance on function words, indicating that similar performance can arise from distinct internal mechanisms.

[11] Scaling Embeddings Outperforms Scaling Experts in Language Models

Hong Liu,Jiaqi Zhang,Chao Wang,Xing Hu,Linkun Lyu,Jiaqi Sun,Xurui Yang,Bo Wang,Fengcun Li,Yulei Qian,Lingtong Si,Yerui Sun,Rumei Li,Peng Pei,Yuchen Xie,Xunliang Cai

Main category: cs.CL

TL;DR: 本文提出通过扩大嵌入层（embedding scaling）来提升稀疏性，相比传统的专家扩展（expert scaling），在特定条件下能实现更优的帕累托前沿；结合系统优化与推测解码，显著提升推理速度，并基于该思路构建了68.5B参数的LongCat-Flash-Lite模型，在代理和编程任务中表现优异。

Details

Motivation: MoE架构在大语言模型中面临收益递减与系统瓶颈，需探索新的稀疏性扩展维度。 Method: 系统分析embedding scaling与expert scaling的权衡关系，识别其优势适用场景；研究参数分配、模型宽/深度等架构因素的影响；结合定制化系统优化与推测解码提升推理效率。 Result: LongCat-Flash-Lite（68.5B参数，~3B激活）在多项基准上超越同参数量MoE基线，并在agentic与coding任务中媲美甚至超越同规模先进模型。 Conclusion: embedding scaling是一种与expert scaling正交且有效的稀疏性扩展路径，尤其在合理架构设计与系统协同优化下可转化为实际推理增益。 Abstract: While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.

[12] Multilingual Dysarthric Speech Assessment Using Universal Phone Recognition and Language-Specific Phonemic Contrast Modeling

Eunjung Yeo,Julie M. Liss,Visar Berisha,David R. Mortensen

Main category: cs.CL

TL;DR: 本文提出了一种多语言构音评估框架，结合通用音素识别与语言特异性音素解释，通过对比音系特征距离和序列对齐，生成三种评估指标（PER、PFER、PhonCov），并在四种语言上验证其临床相关性。

Details

Motivation: 神经障碍相关构音障碍日益普遍，亟需跨语言适用的自动化可懂度评估方法；现有方法多局限于单语或忽略语言特异性因素。 Method: 构建多语言音素产出评估框架，融合通用音素识别与语言特异性音素解释，利用对比音系特征距离实现音素到音位映射，并结合序列对齐；提出三个评估指标：音素错误率（PER）、音系特征错误率（PFER）和无需对齐的音素覆盖率（PhonCov）。 Result: 在英语、西班牙语、意大利语和泰米尔语数据上验证表明：PER受益于映射与对齐结合，PFER仅受益于对齐，PhonCov仅受益于映射；框架能捕捉与临床观察一致的构音障碍退化模式。 Conclusion: 该框架有效支持跨语言构音障碍评估，兼顾语言共性与特性，三项指标互补，具备临床应用潜力。 Abstract: The growing prevalence of neurological disorders associated with dysarthria motivates the need for automated intelligibility assessment methods that are applicalbe across languages. However, most existing approaches are either limited to a single language or fail to capture language-specific factors shaping intelligibility. We present a multilingual phoneme-production assessment framework that integrates universal phone recognition with language-specific phoneme interpretation using contrastive phonological feature distances for phone-to-phoneme mapping and sequence alignment. The framework yields three metrics: phoneme error rate (PER), phonological feature error rate (PFER), and a newly proposed alignment-free measure, phoneme coverage (PhonCov). Analysis on English, Spanish, Italian, and Tamil show that PER benefits from the combination of mapping and alignment, PFER from alignment alone, and PhonCov from mapping. Further analyses demonstrate that the proposed framework captures clinically meaningful patterns of intelligibility degradation consistent with established observations of dysarthric speech.

[13] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

Zhaoyi Li,Jiatong Li,Gangwei Jiang,Linqi Song,Defu Lian,Ying Wei

Main category: cs.CL

TL;DR: 本文发现链式思维推理中存在关键错误类型，由特定注意力头（ep heads）引发，提出测试时动态识别并禁用这些头的轻量级修正方法，显著提升推理步数泛化能力。

Details

Motivation: 链式思维推理在推理步数超出训练分布时性能急剧下降，其内在机制尚不清楚。 Method: 通过系统性实验分析多领域任务中的错误分布，识别出导致错误的特定注意力头（ep heads），并提出在测试时动态识别和禁用这些头的轻量级干预方法。 Result: 该方法在多个任务和大语言模型上均能一致提升推理步数泛化性能。 Conclusion: 链式思维推理失败源于少数关键注意力头对错误推理路径的放大，通过测试时修正可有效缓解这一问题。 Abstract: Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.

[14] Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

Christopher Adrian Kusuma,Muhammad Reza Qorib,Hwee Tou Ng

Main category: cs.CL

TL;DR: 本文提出了一种更鲁棒的评估大语言模型（LLM）诚实性的基准数据集，并利用开源模型Pythia及其公开预训练数据，设计新方法提升LLM在知识边界不清时回答“我不知道”的能力，以减少幻觉。

Details

Motivation: 现有LLM常因不了解自身知识边界而产生事实性错误（即幻觉），虽有多种提升诚实性的方法，但其评估缺乏鲁棒性，未考虑模型预训练阶段已习得的知识。 Method: 利用具有公开预训练数据的开源模型Pythia，构建更鲁棒的诚实性评估基准；并提出一种新方法，通过挖掘和利用预训练数据来增强LLM在知识不足时主动承认未知的能力。 Result: 构建了一个基于Pythia的、知识可追溯的诚实性评估基准；验证了所提方法能有效提升LLM在未知问题上输出“I don't know”的倾向，降低幻觉率。 Conclusion: 结合模型预训练知识进行诚实性建模与评估是可行且必要的；该工作为LLM可信度研究提供了更透明、可复现的基准与方法路径。 Abstract: Large language models (LLMs) are highly capable of answering questions, but they are often unaware of their own knowledge boundary, i.e., knowing what they know and what they don't know. As a result, they can generate factually incorrect responses on topics they do not have enough knowledge of, commonly known as hallucination. Rather than hallucinating, a language model should be more honest and respond with "I don't know" when it does not have enough knowledge about a topic. Many methods have been proposed to improve LLM honesty, but their evaluations lack robustness, as they do not take into account the knowledge that the LLM has ingested during its pretraining. In this paper, we propose a more robust evaluation benchmark dataset for LLM honesty by utilizing Pythia, a truly open LLM with publicly available pretraining data. In addition, we also propose a novel method for harnessing the pretraining data to build a more honest LLM.

[15] MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

Tianyi Xu,Kosei Uemura,Alfred Malengo Kondoro,Tadesse Destaw Belay,Catherine Nana Nyaah Essuman,Ifeoma Okoh,Ganiyat Afolabi,Ayodele Awokoya,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: 本文介绍了MGSM-Pro数据集，通过为每个MGSM问题生成五个不同名字、数字和无关上下文的实例，来评估大语言模型在多语言数学推理中的鲁棒性；实验发现低资源语言及部分闭源模型（如Gemini 2.5 Flash、GPT-4.1）对数字变化敏感，而Claude 4.0 Sonnet及开源模型GPT-OSS 120B、DeepSeek V3表现更鲁棒；建议采用至少五种数字变体进行评测以提升评估可靠性。

Details

Motivation: 现有数学推理基准在多语言评估方面存在难度不足、时效滞后的问题，且GSM-Symbolic虽揭示了英文模型对问题实例化高方差现象，但缺乏多语言验证，亟需构建更具挑战性和鲁棒性的多语言评测基准。 Method: 基于MGSM数据集，采用GSM-Symbolic方法为每个问题生成五个语义等价但名字、数字和无关上下文不同的实例，构建MGSM-Pro；在九种语言上对多个主流闭源与开源大模型进行系统评测，并分析其在不同数字实例下的性能稳定性。 Result: 低资源语言在数字变化实例上性能显著下降；Gemini 2.5 Flash和GPT-4.1对数字实例化鲁棒性较差，Claude 4.0 Sonnet更鲁棒；开源模型GPT-OSS 120B和DeepSeek V3表现出较强鲁棒性。 Conclusion: 单一固定数字的测试易导致评估偏差，应采用多实例（至少五种数字变体）评测以更真实反映模型数学推理能力；MGSM-Pro为多语言数学推理鲁棒性评估提供了新基准与实践建议。 Abstract: Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that some proprietary models, notably Gemini 2.5 Flash and GPT-4.1, are less robust to digit instantiation, whereas Claude 4.0 Sonnet is more robust. Among open models, GPT-OSS 120B and DeepSeek V3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.

Alok Abhishek,Tushar Bandopadhyay,Lisa Erickson

Main category: cs.CL

TL;DR: 本文提出SHARP框架，用于多维、分布感知的社会危害评估，强调尾部风险和跨维度交互，揭示了主流LLM在平均风险相近时尾部暴露差异显著，呼吁超越标量均值的评估范式。

Details

Motivation: 现有基准将复杂社会风险简化为均值标量分，掩盖了分布结构、跨维度交互及最坏情况行为，难以应对高风险场景下的严重失败。 Method: 提出SHARP框架：将社会危害建模为多元随机变量，显式分解为偏见、公平性、伦理与认知可靠性四个维度；采用‘失败并集’重参数化为加性累积对数风险；以CVaR95等风险敏感统计量刻画尾部行为。 Result: 在901条敏感提示上评估11个前沿LLM发现：均值风险相近的模型尾部暴露差异超两倍；各维度尾部严重性排序为：偏见 > 认知/公平 > 伦理；模型间失败结构高度异质。 Conclusion: 负责任的LLM评估与治理必须转向多维、尾敏感的风险画像，而非依赖标量平均指标。 Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where rare but severe failures can result in irreversible harm. However, prevailing evaluation benchmarks often reduce complex social risk to mean-centered scalar scores, thereby obscuring distributional structure, cross-dimensional interactions, and worst-case behavior. This paper introduces Social Harm Analysis via Risk Profiles (SHARP), a framework for multidimensional, distribution-aware evaluation of social harm. SHARP models harm as a multivariate random variable and integrates explicit decomposition into bias, fairness, ethics, and epistemic reliability with a union-of-failures aggregation reparameterized as additive cumulative log-risk. The framework further employs risk-sensitive distributional statistics, with Conditional Value at Risk (CVaR95) as a primary metric, to characterize worst-case model behavior. Application of SHARP to eleven frontier LLMs, evaluated on a fixed corpus of n=901 socially sensitive prompts, reveals that models with similar average risk can exhibit more than twofold differences in tail exposure and volatility. Across models, dimension-wise marginal tail behavior varies systematically across harm dimensions, with bias exhibiting the strongest tail severities, epistemic and fairness risks occupying intermediate regimes, and ethical misalignment consistently lower; together, these patterns reveal heterogeneous, model-dependent failure structures that scalar benchmarks conflate. These findings indicate that responsible evaluation and governance of LLMs require moving beyond scalar averages toward multidimensional, tail-sensitive risk profiling.

[17] MoCo: A One-Stop Shop for Model Collaboration Research

Shangbin Feng,Yuyang Bai,Ziyuan Yang,Yike Wang,Zhaoxuan Tan,Jiajie Yan,Zhenyu Lei,Wenxuan Ding,Weijia Shi,Haojin Wang,Zhenting Qi,Yuru Jiang,Heng Wang,Chengsong Huang,Yu Fei,Jihan Yao,Yilun Du,Luke Zettlemoyer,Yejin Choi,Yulia Tsvetkov

Main category: cs.CL

TL;DR: 本文介绍了MoCo，一个用于执行、基准测试和比较大规模模型协作算法的Python库，包含26种协作方法和25个评估数据集，实验证明多数协作策略优于单一大语言模型。

Details

Motivation: 现有模型协作研究分散且缺乏系统性比较，亟需统一框架来推动该领域发展。 Method: 构建MoCo开源库，集成26种跨模型信息交换机制（如路由、文本、logit、参数级）及25个多样化评测数据集，并支持用户自定义数据。 Result: 在61.0%的（模型，数据）设置中，协作策略平均优于单模型；最优方法提升达25.8%；并分析了协作策略的扩展性、训练/推理效率及解决单模型难点的能力。 Conclusion: MoCo为模型协作研究提供了标准化平台，有望推动开放、模块化、去中心化与协作式AI的发展。 Abstract: Advancing beyond single monolithic language models (LMs), recent research increasingly recognizes the importance of model collaboration, where multiple LMs collaborate, compose, and complement each other. Existing research on this topic has mostly been disparate and disconnected, from different research communities, and lacks rigorous comparison. To consolidate existing research and establish model collaboration as a school of thought, we present MoCo: a one-stop Python library of executing, benchmarking, and comparing model collaboration algorithms at scale. MoCo features 26 model collaboration methods, spanning diverse levels of cross-model information exchange such as routing, text, logit, and model parameters. MoCo integrates 25 evaluation datasets spanning reasoning, QA, code, safety, and more, while users could flexibly bring their own data. Extensive experiments with MoCo demonstrate that most collaboration strategies outperform models without collaboration in 61.0% of (model, data) settings on average, with the most effective methods outperforming by up to 25.8%. We further analyze the scaling of model collaboration strategies, the training/inference efficiency of diverse methods, highlight that the collaborative system solves problems where single LMs struggle, and discuss future work in model collaboration, all made possible by MoCo. We envision MoCo as a valuable toolkit to facilitate and turbocharge the quest for an open, modular, decentralized, and collaborative AI future.

[18] CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

Jiahao Huo,Yu Huang,Yibo Yan,Ye Pan,Yi Cao,Mingdong Ou,Philip S. Yu,Xuming Hu

Main category: cs.CL

TL;DR: 本文提出CausalEmbed方法，通过自回归生成方式构建多向量嵌入，显著减少视觉令牌数量（30-155倍），同时保持高性能，提升视觉文档检索的实用性和可扩展性。

Details

Motivation: 现有MLLMs在视觉文档检索中虽表现优异，但使用数千视觉令牌表示单页导致巨大存储开销，限制实际应用。 Method: 提出自回归生成方法CausalEmbed，并在对比学习中引入迭代间隔损失，促使模型学习紧凑且结构良好的嵌入表示。 Result: 仅需数十个视觉令牌即可实现高效VDR，在多个骨干网络和基准测试中性能极具竞争力，令牌数减少30-155倍；理论与实验证明其训练效率高、测试时可扩展性强。 Conclusion: CausalEmbed提供了一种灵活的测试时缩放策略，推动多模态文档检索向生成范式发展。 Abstract: Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval.

[19] Qwen3-ASR Technical Report

Xian Shi,Xiong Wang,Zhifang Guo,Yongqi Wang,Pei Zhang,Xinyu Zhang,Zishan Guo,Hongkun Hao,Yu Xi,Baosong Yang,Jin Xu,Jingren Zhou,Junyang Lin

Main category: cs.CL

TL;DR: 本文介绍了Qwen3-ASR系列语音识别模型，包括两个支持52种语言/方言的端到端ASR模型（1.7B和0.6B）及一个基于大模型的非自回归强制对齐模型（0.6B），均在真实场景中展现SOTA或优越性能，并全部开源。

Details

Motivation: 解决开源ASR模型在公开基准上得分接近但实际应用中质量差异显著的问题，同时提升多语言语音识别与强制对齐的准确性、效率与通用性。 Method: 基于Qwen3-Omni基础模型，构建两个全功能ASR模型（Qwen3-ASR-1.7B和Qwen3-ASR-0.6B）及一个非自回归强制对齐模型（Qwen3-ForcedAligner-0.6B），采用大规模语音数据训练，并进行内部综合评估与多语言基准测试。 Result: Qwen3-ASR-1.7B在开源ASR模型中达到SOTA，媲美最强商用API；Qwen3-ASR-0.6B实现92ms平均首字延迟、128并发下1秒处理2000秒语音；Qwen3-ForcedAligner-0.6B在11种语言强制对齐任务中精度与效率均超越现有最强三个模型。 Conclusion: Qwen3-ASR系列模型在多语言ASR与强制对齐任务中兼具高性能、高效率与强通用性，且全部开源（Apache 2.0），有望推动语音理解社区发展。 Abstract: In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.

[20] Self-Improving Pretraining: using post-trained models to pretrain better models

Ellen Xiaoqing Tan,Shehzaad Dhuliawala,Jing Xu,Ping Yu,Sainbayar Sukhbaatar,Jason Weston,Olga Golovneva

Main category: cs.CL

TL;DR: 本文提出了一种在预训练阶段引入强化学习的新方法，通过流式文档输入和多候选生成评估（包括模型rollout、原始后缀和重写后缀），利用强判别模型对生成质量、安全性和事实性进行打分，从而从源头提升大语言模型的可靠性。实验显示该方法在事实性、安全性及生成质量上显著优于标准预训练。

Details

Motivation: 现有对大语言模型的安全性、事实性和质量保障主要依赖昂贵且复杂的后训练流程，但无法根除预训练阶段习得的有害或幻觉模式；因此需在预训练阶段就嵌入高质量行为，防止问题固化。 Method: 提出一种基于流式文档的预训练方法，每步生成K个后续token的多个候选（含模型rollout、原始suffix、重写suffix），由强判别模型评估其质量、安全与事实性，并用强化学习优化生成策略；训练初期侧重重写与原始suffix，后期转向奖励高质量rollout。 Result: 相比标准预训练，在事实性上提升36.2%，安全性提升18.5%，整体生成质量胜率最高提升86.3%。 Conclusion: 在预训练中集成RL与多候选判别机制，能从底层构建更安全、更真实、更高质量的大语言模型，为模型可信性提供新范式。 Abstract: Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model's core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations -- including model rollouts, the original suffix, and a rewritten suffix -- for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.

[21] The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

Devanshu Sahoo,Manish Prasad,Vasudev Majhi,Arjun Neekhra,Yash Sinha,Murari Mandal,Vinay Chamola,Dhruv Kumar

Main category: cs.CL

TL;DR: 本文揭示了大语言模型（LLMs）在编程作业自动评分中存在‘顺从悖论’：模型为满足隐含指令而脱离代码逻辑，导致高达95%的误判；作者提出SPACI和AST-ASIP攻击框架，在语法无关区域注入语义对抗指令，并构建三维度评估体系，呼吁转向面向评判鲁棒性的领域对齐新范式。

Details

Motivation: 现有教育评估中假设LLM的指令遵循能力等同于客观评判能力，但该假设未经验证且可能带来严重风险。 Method: 提出Semantic-Preserving Adversarial Code Injection (SPACI)框架与Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP)，利用语法-语义鸿沟，在抽象语法树的‘琐碎节点’中嵌入对抗性指令；并在Python、C、C++、Java共25,000份代码提交上评估9个SOTA模型。 Result: 发现高容量开源模型（如DeepSeek-V3）失败率超95%，系统性地优先响应隐藏格式约束而非代码正确性；提出Decoupling Probability、Score Divergence和Pedagogical Severity三指标量化‘虚假认证’问题。 Conclusion: 当前基于RLHF的对齐方式在自动评分中引入类似‘特洛伊木马’的安全隐患，亟需转向以证据优先、领域定制的‘评判鲁棒性’新对齐范式。 Abstract: The rapid integration of Large Language Models (LLMs) into educational assessment rests on the unverified assumption that instruction following capability translates directly to objective adjudication. We demonstrate that this assumption is fundamentally flawed. Instead of evaluating code quality, models frequently decouple from the submission's logic to satisfy hidden directives, a systemic vulnerability we term the Compliance Paradox, where models fine-tuned for extreme helpfulness are vulnerable to adversarial manipulation. To expose this, we introduce the Semantic-Preserving Adversarial Code Injection (SPACI) Framework and the Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP). These methods exploit the Syntax-Semantics Gap by embedding adversarial directives into syntactically inert regions (trivia nodes) of the Abstract Syntax Tree. Through a large-scale evaluation of 9 SOTA models across 25,000 submissions in Python, C, C++, and Java, we reveal catastrophic failure rates (>95%) in high-capacity open-weights models like DeepSeek-V3, which systematically prioritize hidden formatting constraints over code correctness. We quantify this failure using our novel tripartite framework measuring Decoupling Probability, Score Divergence, and Pedagogical Severity to demonstrate the widespread "False Certification" of functionally broken code. Our findings suggest that current alignment paradigms create a "Trojan" vulnerability in automated grading, necessitating a shift from standard RLHF toward domain-specific Adjudicative Robustness, where models are conditioned to prioritize evidence over instruction compliance. We release our complete dataset and injection framework to facilitate further research on the topic.

[22] User-Centric Evidence Ranking for Attribution and Fact Verification

Guy Alt,Eran Hirsch,Serwar Basch,Ido Dagan,Oren Glickman

Main category: cs.CL

TL;DR: 本文提出了证据排序（Evidence Ranking）这一新任务，旨在通过优先呈现充分且非冗余的证据来减少用户阅读负担并提升事实验证效率，并设计了新的评估框架与基准数据集，实验表明增量式排序策略和大语言模型方法表现更优。

Details

Motivation: 现有自动化系统和大语言模型在事实验证中常提供不足或过度冗余的证据，导致验证效率低、错误率高，亟需一种兼顾信息充分性与用户阅读效率的新范式。 Method: 提出证据排序任务，对比单次排序（one-shot）与增量排序（incremental）两种方法；构建基于信息检索思想的新评估框架和统一基准；开展模型实验与控制变量用户研究。 Result: 增量排序更善于捕捉互补证据；LLM方法优于浅层基线；证据排序相比传统证据选择显著降低用户阅读量并提升验证准确率。 Conclusion: 证据排序为构建可解释、高效、以用户为中心的信息验证系统提供了基础性范式。 Abstract: Attribution and fact verification are critical challenges in natural language processing for assessing information reliability. While automated systems and Large Language Models (LLMs) aim to retrieve and select concise evidence to support or refute claims, they often present users with either insufficient or overly redundant information, leading to inefficient and error-prone verification. To address this, we propose Evidence Ranking, a novel task that prioritizes presenting sufficient information as early as possible in a ranked list. This minimizes user reading effort while still making all available evidence accessible for sequential verification. We compare two approaches for the new ranking task: one-shot ranking and incremental ranking. We introduce a new evaluation framework, inspired by information retrieval metrics, and construct a unified benchmark by aggregating existing fact verification datasets. Extensive experiments with diverse models show that incremental ranking strategies better capture complementary evidence and that LLM-based methods outperform shallower baselines, while still facing challenges in balancing sufficiency and redundancy. Compared to evidence selection, we conduct a controlled user study and demonstrate that evidence ranking both reduces reading effort and improves verification. This work provides a foundational step toward more interpretable, efficient, and user-aligned information verification systems.

[23] Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

Yuan Sui,Bryan Hooi

Main category: cs.CL

TL;DR: 本文提出CoNL框架，通过多智能体自博弈统一生成、评估和元评估，利用批判性反馈是否能提升解决方案来衡量其质量，从而在无外部评判或真实标签的情况下联合优化生成与评判能力。

Details

Motivation: 现有LLM-as-Judge方法受限于评判者自身能力，且存在评估偏差（如偏好冗长），缺乏对评判者本身的评估与改进机制（即元评估）。 Method: 提出CoNL框架，多个共享策略的智能体通过结构化对话进行方案提出、批判与修订；以批判是否促成方案改进为依据给予诊断性奖励，实现生成与评判能力的联合自博弈优化。 Result: 在五个基准测试中，CoNL持续优于自奖励基线，且训练稳定。 Conclusion: CoNL通过引入可学习的元评估机制，有效缓解了无真值场景下评判质量受限的问题，为非可验证任务的LLM训练提供了新范式。 Abstract: Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.

[24] SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

Lei Yang,Wei Bi,Chenxi Sun,Renren Jin,Deyi Xiong

Main category: cs.CL

TL;DR: 本文提出SOUP框架，在单样本级别统一在线和离线策略学习，通过token级重要性比率利用离线策略信息，同时保持训练稳定性，显著提升大语言模型强化学习的探索能力和最终性能。

Details

Motivation: 现有基于在线策略的强化学习方法（如GRPO）在大语言模型后训练中存在探索不足和早期饱和问题；而混合整个轨迹的离线数据又会导致策略不匹配和训练不稳定。 Method: 提出SOUP（Single-sample Mix-policy Unified Paradigm）框架：在单个生成序列中，前缀部分使用历史策略（离线）采样，续写部分使用当前策略（在线）生成，并引入token级重要性比率来加权离线信息。 Result: 实验表明SOUP在多个任务上持续优于标准在线策略训练及现有离线策略扩展方法；分析进一步验证其在提升探索能力和最终性能方面的有效性。 Conclusion: SOUP通过细粒度、单样本级别的混合策略机制，有效缓解了探索不足与策略失配之间的矛盾，为大语言模型强化学习提供了一种更稳定且高效的训练范式。 Abstract: On-policy reinforcement learning (RL) methods widely used for language model post-training, like Group Relative Policy Optimization (GRPO), often suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data can help, current approaches that mix entire trajectories cause significant policy mismatch and instability. In this work, we propose the $\textbf{S}$ingle-sample Mix-p$\textbf{O}$licy $\textbf{U}$nified $\textbf{P}$aradigm (SOUP), a framework that unifies off- and on-policy learning within individual samples at the token level. It confines off-policy influence to the prefix of a generated sequence sampled from historical policies, while the continuation is generated on-policy. Through token-level importance ratios, SOUP effectively leverages off-policy information while preserving training stability. Extensive experiments demonstrate that SOUP consistently outperforms standard on-policy training and existing off-policy extensions. Our further analysis clarifies how our fine-grained, single-sample mix-policy training can improve both exploration and final performance in LLM RL.

[25] DimStance: Multilingual Datasets for Dimensional Stance Analysis

Jonas Becker,Liang-Chih Yu,Shamsuddeen Hassan Muhammad,Jan Philip Wahle,Terry Ruas,Idris Abdulmumin,Lung-Hao Lee,Wen-Ni Liu,Tzu-Mi Lin,Zhe-Yu Xu,Ying-Lung Lin,Jin Wang,Maryam Ibrahim Mukhtar,Bela Gipp,Saif M. Mohammed

Main category: cs.CL

TL;DR: 本文提出了一种基于效价（valence）和唤醒度（arousal）维度的立场检测新范式，构建了首个跨语言、多领域、带VA标注的维度立场数据集DimStance，并定义了维度立场回归任务，评估了多种预训练与大语言模型在该任务上的表现。

Details

Motivation: 传统立场检测仅使用离散标签（如Favor/Neutral/Against），难以刻画立场背后细腻的情感状态；本文旨在引入情感科学中的VA维度框架，实现更细粒度、情感感知的立场建模。 Method: 构建多语言（5种）、多领域（政治、环保）的维度立场数据集DimStance（含11,746个目标方面），提出维度立场回归任务；在回归与提示学习两种设定下，对预训练模型和大语言模型进行基准测试，并分析跨语言VA模式。 Result: 微调的大语言模型在VA回归任务中表现具竞争力；低资源语言（如尼日利亚皮钦语、斯瓦希里语）仍存在显著性能差距；基于token生成的方法在该任务中效果受限。 Conclusion: DimStance为多语言、情感感知的立场分析提供了新资源与基准，推动立场理解从分类迈向连续、可解释的维度建模。 Abstract: Stance detection is an established task that classifies an author's attitude toward a specific target into categories such as Favor, Neutral, and Against. Beyond categorical stance labels, we leverage a long-established affective science framework to model stance along real-valued dimensions of valence (negative-positive) and arousal (calm-active). This dimensional approach captures nuanced affective states underlying stance expressions, enabling fine-grained stance analysis. To this end, we introduce DimStance, the first dimensional stance resource with valence-arousal (VA) annotations. This resource comprises 11,746 target aspects in 7,365 texts across five languages (English, German, Chinese, Nigerian Pidgin, and Swahili) and two domains (politics and environmental protection). To facilitate the evaluation of stance VA prediction, we formulate the dimensional stance regression task, analyze cross-lingual VA patterns, and benchmark pretrained and large language models under regression and prompting settings. Results show competitive performance of fine-tuned LLM regressors, persistent challenges in low-resource languages, and limitations of token-based generation. DimStance provides a foundation for multilingual, emotion-aware, stance analysis and benchmarking.

[26] MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset

Serry Sibaee,Yasser Alhabashi,Nadia Sibai,Yara Farouk,Adel Ammar,Sawsan AlHalawani,Wadii Boulila

Main category: cs.CL

TL;DR: 本文介绍了MURAD，一个包含96,243个阿拉伯语词-定义对的多领域统一反向阿拉伯语词典数据集，旨在推动阿拉伯语自然语言处理和词汇语义研究。

Details

Motivation: 阿拉伯语虽丰富多样，但缺乏大规模、精准定义的阿拉伯语词汇数据集，限制了计算语言学与词典学研究。 Method: 通过混合流水线（结合直接文本解析、光学字符识别和自动重构）从权威参考书和教育资料中提取数据，并为每个词条标注标准化阿拉伯语定义及来源领域元数据。 Result: 构建了覆盖语言学、伊斯兰研究、数学、物理、心理学和工程等领域的MURAD数据集，共96,243个词-定义对，已开源。 Conclusion: MURAD填补了阿拉伯语高质量词典资源的空白，支持反向词典建模、语义检索和教育工具开发，促进阿拉伯语NLP与可复现语义研究。 Abstract: Arabic is a linguistically and culturally rich language with a vast vocabulary that spans scientific, religious, and literary domains. Yet, large-scale lexical datasets linking Arabic words to precise definitions remain limited. We present MURAD (Multi-domain Unified Reverse Arabic Dictionary), an open lexical dataset with 96,243 word-definition pairs. The data come from trusted reference works and educational sources. Extraction used a hybrid pipeline integrating direct text parsing, optical character recognition, and automated reconstruction. This ensures accuracy and clarity. Each record aligns a target word with its standardized Arabic definition and metadata that identifies the source domain. The dataset covers terms from linguistics, Islamic studies, mathematics, physics, psychology, and engineering. It supports computational linguistics and lexicographic research. Applications include reverse dictionary modeling, semantic retrieval, and educational tools. By releasing this resource, we aim to advance Arabic natural language processing and promote reproducible research on Arabic lexical semantics.

[27] LMK > CLS: Landmark Pooling for Dense Embeddings

Meet Doshi,Aashka Trivedi,Vishwajeet Kumar,Parul Awasthy,Yulong Li,Jaydeep Sen,Radu Florian,Sachindra Joshi

Main category: cs.CL

TL;DR: 本文提出了一种新的序列池化方法——Landmark (LMK) 池化，通过将序列分块并在块间插入地标标记（landmark tokens），再对这些地标标记嵌入进行均值池化，以克服[CLS]和均值池化在长上下文建模中的系统性缺陷。该方法在保持短上下文性能的同时显著提升长上下文任务效果。

Details

Motivation: 现有主流池化策略（如[CLS]和均值池化）存在系统性缺陷：[CLS]易偏向序列开头信息、忽略分布式证据；均值池化则会稀释局部显著信号，损害短上下文性能。 Method: 提出Landmark (LMK) 池化：将输入序列划分为若干块，在每块之间插入可学习的地标标记（landmark tokens），最终对所有地标标记的嵌入向量取均值作为整体表示。 Result: LMK池化在短上下文检索任务上与现有方法性能相当，在长上下文任务上取得显著提升，且仅引入少量特殊标记，具备实用性和可扩展性。 Conclusion: LMK池化是一种简单有效、兼顾局部敏感性与长程建模能力的新池化范式，为序列表示学习提供了更鲁棒的替代方案。 Abstract: Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.

[28] inversedMixup: Data Augmentation via Inverting Mixed Embeddings

Fanshuang Kong,Richong Zhang,Qiyu Sun,Zhijie Nie,Ting Deng,Chunming Hu

Main category: cs.CL

TL;DR: 本文提出inversedMixup，一种结合Mixup可控性与大语言模型（LLM）生成可读性的统一文本数据增强框架，通过三阶段对齐任务模型嵌入空间与LLM输入空间，实现可控混合嵌入到可解释句子的重建，并首次实证揭示并缓解文本Mixup中的流形侵入现象。

Details

Motivation: Mixup虽具可控性但输出不可解释，LLM生成虽可读但缺乏控制；同时，文本Mixup中存在未被充分研究的流形侵入问题。 Method: 提出inversedMixup框架：采用三阶段训练对齐任务模型输出嵌入空间与LLM输入嵌入空间；利用LLM反演技术将线性混合的嵌入重建为可控比例的可读句子；引入策略缓解流形侵入。 Result: 在少样本和全监督场景下显著提升文本数据增强效果；首次提供文本Mixup中流形侵入现象的实证证据，并验证所提缓解策略的有效性。 Conclusion: inversedMixup成功融合了Mixup的可控性与LLM生成的可解释性，为可解释、可控的文本增强提供了新范式，并揭示了嵌入混合中的关键几何问题。 Abstract: Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup adopts a three-stage training procedure to align the output embedding space of a task-specific model with the input embedding space of an LLM. Upon successful alignment, inversedMixup can reconstruct mixed embeddings with a controllable mixing ratio into human-interpretable augmented sentences, thereby improving the augmentation performance. Additionally, inversedMixup provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup and introduces a simple yet effective strategy to mitigate it. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.

[29] Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes

Yang Zhou,Zhenting Sheng,Mingrui Tan,Yuting Song,Jun Zhou,Yu Heng Kwan,Lian Leng Low,Yang Bai,Yong Liu

Main category: cs.CL

TL;DR: 本文提出Note2Chat框架，通过将真实医疗记录转化为高质量医患对话，并采用三阶段微调策略和单轮推理范式，显著提升大语言模型在动态多轮临床问诊与诊断任务中的性能。

Details

Motivation: 现有大语言模型在静态评测中表现良好，但在需迭代提问与假设修正的动态多轮临床诊断场景中能力不足，且缺乏高质量、合规的医患对话数据。 Method: 提出Note2Chat框架：1）基于决策树引导的生成与精炼流水线，将真实医疗病历转化为高质量医患对话；2）三阶段微调策略（监督学习、模拟数据增强、偏好学习）；3）创新单轮推理范式，将问诊建模为一系列单步推理问题。 Result: 在临床推理任务上显著超越GPT-4o，F1值提升+16.9，Top-1诊断准确率提升+21.0；开源代码与数据集。 Conclusion: Note2Chat提供了一种可扩展、可解释、样本高效且无需敏感对话数据的临床问诊建模范式，推动大语言模型向实用化临床辅助诊断迈进。 Abstract: Effective clinical history taking is a foundational yet underexplored component of clinical reasoning. While large language models (LLMs) have shown promise on static benchmarks, they often fall short in dynamic, multi-turn diagnostic settings that require iterative questioning and hypothesis refinement. To address this gap, we propose \method{}, a note-driven framework that trains LLMs to conduct structured history taking and diagnosis by learning from widely available medical notes. Instead of relying on scarce and sensitive dialogue data, we convert real-world medical notes into high-quality doctor-patient dialogues using a decision tree-guided generation and refinement pipeline. We then propose a three-stage fine-tuning strategy combining supervised learning, simulated data augmentation, and preference learning. Furthermore, we propose a novel single-turn reasoning paradigm that reframes history taking as a sequence of single-turn reasoning problems. This design enhances interpretability and enables local supervision, dynamic adaptation, and greater sample efficiency. Experimental results show that our method substantially improves clinical reasoning, achieving gains of +16.9 F1 and +21.0 Top-1 diagnostic accuracy over GPT-4o. Our code and dataset can be found at https://github.com/zhentingsheng/Note2Chat.

[30] ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

Xiaoyu Tian,Haotian Wang,Shuaiting Chen,Hao Zhou,Kaichi Yu,Yudian Zhang,Jade Ouyang,Junxi Yin,Jiong Chen,Baoyan Guo,Lei Zhang,Junjie Tao,Yuansheng Song,Ming Cui,Chengwei Liu

Main category: cs.CL

TL;DR: 本文提出ASTRA框架，通过可验证的强化学习和可扩展的数据合成，全自动训练工具增强型大语言模型代理，解决了现有方法依赖人工干预、模拟环境不可验证、单一训练范式及长程多轮学习不稳定等问题。

Details

Motivation: 现有工具增强型代理训练方法存在人工干预多、依赖不可验证的模拟环境、仅使用监督微调或强化学习之一、以及长程多轮学习不稳定等挑战。 Method: ASTRA包含两个核心组件：1）基于工具调用图静态拓扑结构合成多样化、结构化轨迹的数据合成管道；2）将问题-答案分解痕迹转化为独立、可执行、规则可验证环境的环境合成框架；并结合监督微调与在线强化学习，采用轨迹级奖励平衡任务完成与交互效率。 Result: 在多个代理工具使用基准测试中，ASTRA训练的模型达到同等规模下的最优性能，接近闭源系统水平，同时保持核心推理能力。 Conclusion: ASTRA是一种全自动、端到端、可验证的工具增强型代理训练框架，显著提升了多步决策中工具使用的鲁棒性与泛化性。 Abstract: Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.

[31] KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

Wuyang Zhou,Yuxuan Gu,Giorgos Iacovides,Danilo Mandic

Main category: cs.CL

TL;DR: 本文提出KromHC方法，通过Kronecker积构造双随机残差矩阵，在保证精确双随机性的同时将参数复杂度从O(n^3C)或O(nC·n!)降至O(n^2C)，显著提升训练稳定性与可扩展性。

Details

Motivation: 解决mHC及其变体mHC-lite在训练中难以保证精确双随机性及参数复杂度过高的问题。 Method: 利用Kronecker积组合小规模双随机矩阵来参数化残差矩阵，并在张量化残差流的各模态上施加流形约束，以确保整体残差矩阵严格双随机。 Result: KromHC在保持甚至超越现有mHC变体性能的同时，大幅降低可训练参数量，实验验证其有效性与高效性。 Conclusion: KromHC为超连接神经网络提供了一种兼具理论严谨性（精确双随机性）和计算高效性（O(n^2C)复杂度）的新范式。 Abstract: The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn-Knopp (SK) algorithm does not always yield exact doubly stochastic residual matrices; 2) mHC incurs a prohibitive $\mathcal{O}(n^3C)$ parameter complexity with $n$ as the width of the residual stream and $C$ as the feature dimension. The recently proposed mHC-lite reparametrizes the residual matrix via the Birkhoff-von-Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, $\mathcal{O} \left( nC \cdot n! \right)$. To address both challenges, we propose \textbf{KromHC}, which uses the \underline{Kro}necker products of smaller doubly stochastic matrices to parametrize the residual matrix in \underline{mHC}. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to $\mathcal{O}(n^2C)$. Comprehensive experiments demonstrate that KromHC matches or even outperforms state-of-the-art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is available at \texttt{https://github.com/wz1119/KromHC}.

[32] Language Models as Artificial Learners: Investigating Crosslinguistic Influence

Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis

Main category: cs.CL

TL;DR: 本文利用语言模型作为可控的统计学习者，系统模拟跨语言影响（CLI），探究L1主导性、L2熟练度及句法距离对CLI的影响，并通过跨语言启动范式验证其机制，结果与心理语言学证据一致，表明LMs可为人类CLI理论提供计算框架。

Details

Motivation: 人类双语研究中跨语言影响（CLI）的结果常因实验变异性而冲突，需更可控的方法来系统揭示其驱动因素。 Method: 使用语言模型作为可控统计学习者，操纵L1主导性、L2熟练度（通过控制L2引入训练步数即年龄暴露）和L1-L2句法距离；采用跨语言启动范式分析L1结构激活对L2加工的影响。 Result: 语言主导性和熟练度是CLI的强预测因子；语法结构启动呈双向性，而违反语法结构的启动受语言主导性调节；LM中存在L1在L2加工中的共激活及其对神经回路的直接影响。 Conclusion: 语言模型可作为研究人类跨语言影响的可靠计算框架，为CLI理论提供可解释的机制证据。 Abstract: Despite the centrality of crosslinguistic influence (CLI) to bilingualism research, human studies often yield conflicting results due to inherent experimental variance. We address these inconsistencies by using language models (LMs) as controlled statistical learners to systematically simulate CLI and isolate its underlying drivers. Specifically, we study the effect of varying the L1 language dominance and the L2 language proficiency, which we manipulate by controlling the L2 age of exposure -- defined as the training step at which the L2 is introduced. Furthermore, we investigate the impact of pretraining on L1 languages with varying syntactic distance from the L2. Using cross-linguistic priming, we analyze how activating L1 structures impacts L2 processing. Our results align with evidence from psycholinguistic studies, confirming that language dominance and proficiency are strong predictors of CLI. We further find that while priming of grammatical structures is bidirectional, the priming of ungrammatical structures is sensitive to language dominance. Finally, we provide mechanistic evidence of CLI in LMs, demonstrating that the L1 is co-activated during L2 processing and directly influences the neural circuitry recruited for the L2. More broadly, our work demonstrates that LMs can serve as a computational framework to inform theories of human CLI.

[33] ILRR: Inference-Time Steering Method for Masked Diffusion Language Models

Eden Avrahami,Eliya Nachmani

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的离散扩散语言模型（DLM）控制框架ILRR，通过动态对齐生成序列与参考序列的隐层激活实现语义属性引导，并扩展出支持长短文本匹配的空间调制引导方法，显著提升属性控制精度且计算开销小。

Details

Motivation: 离散扩散语言模型（DLMs）虽为非自回归文本生成提供了新路径，但其推理阶段的有效可控机制仍缺乏深入探索。 Method: 提出无需学习的Iterative Latent Representation Refinement（ILRR）框架，利用单个参考序列在去噪过程中动态对齐生成序列与参考序列的隐层激活；进一步引入Spatially Modulated Steering，实现用短参考引导长文本生成。 Result: ILRR在LLaDA和MDLM架构上实现了高效属性控制（如情感），计算开销仅增加一次并行前向传播/步；在相同计算预算下，属性准确率较基线提升10%–60%，同时保持高质量生成。 Conclusion: ILRR是一种轻量、通用、高精度的DLM推理控制方法，为扩散式文本生成的可控性研究提供了新范式。 Abstract: Discrete Diffusion Language Models (DLMs) offer a promising non-autoregressive alternative for text generation, yet effective mechanisms for inference-time control remain relatively underexplored. Existing approaches include sampling-level guidance procedures or trajectory optimization mechanisms. In this work, we introduce Iterative Latent Representation Refinement (ILRR), a learning-free framework for steering DLMs using a single reference sequence. ILRR guides generation by dynamically aligning the internal activations of the generated sequence with those of a given reference throughout the denoising process. This approach captures and transfers high-level semantic properties, with a tunable steering scale enabling flexible control over attributes such as sentiment. We further introduce Spatially Modulated Steering, an extension that enables steering long texts using shorter references by regulating guidance intensity across the sequence. Empirically, we demonstrate that ILRR achieves effective attribute steering on LLaDA and MDLM architectures with a minor computational overhead, requiring only one additional parallel forward pass per denoising step. Under the same compute budget, ILRR improves attribute accuracy over comparable baselines by 10$\%$ to 60$\%$ points, while maintaining high generation quality.

[34] AdaptBPE: From General Purpose to Specialized Tokenizers

Vijini Liyanage,François Yvon

Main category: cs.CL

TL;DR: 本文提出了一种针对子词分词器（如BPE）的轻量级后训练适配策略，通过在特定领域/语言语料上筛选并替换低效token，实现词汇表的针对性优化，在保持相同词表规模下提升压缩效率与下游任务性能。

Details

Motivation: 标准通用子词分词器（如BPE）在特定领域或语言上存在token低效问题，导致模型性能和压缩效率下降。 Method: 提出一种后训练适配策略：基于适配语料中token频率，识别并替换低实用性的token，以构建在给定目标词表大小下最优编码该语料的token集合。 Result: 在多语言生成与分类任务上的实验表明，所提方法在相同词表规模下比基线方法更有效地压缩测试语料，并提升下游任务性能。 Conclusion: 该方法是一种轻量、高效的词汇表微调机制，可为特定领域或任务定制优化分词器，无需重新训练整个模型或分词器。 Abstract: Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora more effectively than baselines using the same vocabulary size. This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks. Our code and data are available at https://github.com/vijini/Adapt-BPE.git.

[35] Scale-Dependent Semantic Dynamics Revealed by Allan Deviation

Debayan Dasgupta

Main category: cs.CL

TL;DR: 本文将文本语义演化视为高维状态空间中的随机轨迹，利用精密测量学中的Allan偏差分析语义稳定性，发现人类文本存在短时幂律标度和长时稳定性噪声基底两种动力学机制；大语言模型虽能模拟局部标度统计，但其语义稳定性持续时间显著缩短。

Details

Motivation: 理解语言语义随时间演化的内在动力学机制，尤其是区分人类文本与大语言模型生成文本在语义稳定性上的本质差异。 Method: 将有序句子嵌入视为位移信号，采用Allan偏差分析其在高维语义空间中的稳定性，并识别不同时间尺度下的动力学行为。 Result: 发现人类文本呈现短时幂律标度（可区分文学与技术文本）和长时稳定性噪声基底；大语言模型能复现短时标度，但稳定性维持时间系统性缩短。 Conclusion: 语义连贯性是一种可测量的物理属性，该框架为区分人类认知与算法模型的语义动态提供了新范式。 Abstract: While language progresses through a sequence of semantic states, the underlying dynamics of this progression remain elusive. Here, we treat the semantic progression of written text as a stochastic trajectory in a high-dimensional state space. We utilize Allan deviation, a tool from precision metrology, to analyze the stability of meaning by treating ordered sentence embeddings as a displacement signal. Our analysis reveals two distinct dynamical regimes: short-time power-law scaling, which differentiates creative literature from technical texts, and a long-time crossover to a stability-limited noise floor. We find that while large language models successfully mimic the local scaling statistics of human text, they exhibit a systematic reduction in their stability horizon. These results establish semantic coherence as a measurable physical property, offering a framework to differentiate the nuanced dynamics of human cognition from the patterns generated by algorithmic models.

[36] FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning

Xiaoyu Xu,Minxin Du,Kun Fang,Zi Liang,Yaxin Xiao,Zhicong Huang,Cheng Hong,Qingqing Ye,Haibo Hu

Main category: cs.CL

TL;DR: 本文提出FIT框架，用于大规模、持续性的大语言模型（LLM）参数级遗忘，兼顾高效遗忘与模型效用保持，并构建PCH基准与F.D./R.U.双指标评估体系。

Details

Motivation: 现有LLM遗忘方法难以应对现实场景中高频、持续的删除请求，易导致效用下降和灾难性遗忘。 Method: 提出FIT框架，包含严格数据过滤（Filtering）、重要性感知更新（Importance-aware updates）和目标层归因（Targeted layer attribution）三部分；构建PCH基准（覆盖个人信息、版权、有害内容）及双指标Forget Degree（F.D.）与Retain Utility（R.U.）。 Result: 在四个开源LLM上经数百次删除请求验证，FIT在F.D.与R.U.权衡上最优，在MMLU、CommonsenseQA、GSM8K上超越现有方法，并对重学习和量化恢复攻击具有鲁棒性。 Conclusion: FIT为持续、高容量、安全可靠的LLM遗忘提供了可扩展、稳健且实用的解决方案。 Abstract: Large language models (LLMs) demonstrate impressive capabilities across diverse tasks but raise concerns about privacy, copyright, and harmful materials. Existing LLM unlearning methods rarely consider the continual and high-volume nature of real-world deletion requests, which can cause utility degradation and catastrophic forgetting as requests accumulate. To address this challenge, we introduce \fit, a framework for continual unlearning that handles large numbers of deletion requests while maintaining robustness against both catastrophic forgetting and post-unlearning recovery. \fit mitigates degradation through rigorous data \underline{F}iltering, \underline{I}mportance-aware updates, and \underline{T}argeted layer attribution, enabling stable performance across long sequences of unlearning operations and achieving a favorable balance between forgetting effectiveness and utility retention. To support realistic evaluation, we present \textbf{PCH}, a benchmark covering \textbf{P}ersonal information, \textbf{C}opyright, and \textbf{H}armful content in sequential deletion scenarios, along with two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), which jointly assess forgetting quality and utility preservation. Extensive experiments on four open-source LLMs with hundreds of deletion requests show that \fit achieves the strongest trade-off between F.D. and R.U., surpasses existing methods on MMLU, CommonsenseQA, and GSM8K, and remains resistant against both relearning and quantization recovery attacks.

[37] Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling

Xinglin Wang,Jiayi Shi,Shaoxiong Feng,Peiwen Yuan,Yiwei Li,Yueqi Zhang,Chuyi Tan,Ji Zhang,Boyuan Pan,Yao Hu,Kan Li

Main category: cs.CL

TL;DR: 本文提出Recycling Search Experience (RSE)方法，通过在测试时复用搜索过程中的成功与失败经验，提升大语言模型推理效率，显著减少冗余计算并提高缩放效率。

Details

Motivation: 现有测试时扩展方法将每次采样视为独立、无记忆的过程，导致大量重复计算和死路重访，缺乏对中间推理经验的有效利用。 Method: 提出RSE策略：构建共享经验库，正向复用中间正确结论以跳过冗余推导，负向复用失败模式以剪枝已知死路；全程无需训练、自引导。 Result: 在HMMT24、HMMT25、IMO-Bench和HLE等多个复杂推理基准上，RSE以相近计算开销持续超越强基线，达到当前最优的缩放效率。 Conclusion: RSE将测试时搜索从无记忆的独立采样转变为累积式学习过程，理论分析与实验均证实其在推理效率与计算经济性上的显著优势。 Abstract: Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This systemic memorylessness leads to massive computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbf{Recycling Search Experience (RSE)}, a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE, validating its advantage over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art scaling efficiency.

[38] Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

Hojae Han,Heeyun Jung,Jongyoon Kim,Seung-won Hwang

Main category: cs.CL

TL;DR: 本文提出DAVID-GRPO框架，使小语言模型在资源受限下通过稳定训练、基于证据召回的检索信用分配和近似失败轨迹重采样，实现高效多跳推理，显著优于现有大规模RL方法。

Details

Motivation: 现有强化学习多跳推理方法依赖高成本、高精度的大规模策略 rollout，难以在小模型与有限算力下实现稳定高效训练。 Method: 提出DAVID-GRPO：（i）用最小监督稳定早期学习；（ii）基于证据召回进行检索信用分配；（iii）对截断的近似失败轨迹重采样以提升探索。 Result: 在仅4块RTX 3090 GPU上训练≤1.5B参数模型，在6个多跳问答基准上持续超越面向大规模设置的现有RL方法。 Conclusion: 通过恰当的归纳偏置，小语言模型可在低训练成本下达成高准确率，打破低成本与低准确率的固有权衡。 Abstract: While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.

[39] Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

Wonduk Seo,Wonseok Choi,Junseo Koh,Juhyeon Lee,Hyunjin An,Minhyeong Yu,Jian Park,Qingshan Zhou,Seunghyun Lee,Yi Bu

Main category: cs.CL

TL;DR: 本文提出OG-MAR框架，通过结合世界价值观调查（WVS）数据与文化本体，利用多智能体推理实现更具文化一致性、可解释性和鲁棒性的大语言模型输出。

Details

Motivation: 现有大语言模型在文化敏感决策中常因预训练数据偏差和缺乏结构化价值表征而出现价值错位；已有对齐方法缺乏人口统计学依据，且将价值观视为独立无结构信号，导致一致性和可解释性不足。 Method: 提出OG-MAR（本体引导的多智能体推理）框架：1）基于WVS构建个体价值画像；2）通过能力问题建模固定价值分类体系上的关系，构建全球文化本体；3）推理时检索本体一致且人口特征相似的配置，激活多个价值角色智能体，并由判断智能体融合输出并保障本体一致性和人口邻近性。 Result: 在四个LLM主干模型上、面向区域社会调查基准的实验表明，OG-MAR在文化对齐度、鲁棒性方面优于强基线，并生成更透明的推理轨迹。 Conclusion: 结构化的文化本体与人口感知的多智能体协同推理，能有效提升LLM在文化敏感任务中的价值一致性、可解释性与泛化能力。 Abstract: Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG-MAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.

[40] Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

Qingyue Yang,Jie Wang,Xing Li,Yinqi Bai,Xialiang Tong,Huiling Zhen,Jianye Hao,Mingxuan Yuan,Bin Li

Main category: cs.CL

TL;DR: 本文提出TAPPA框架，从时间连续视角统一解释大语言模型中的注意力模式，区分可预测与不可预测模式，并基于查询自相似性提供数学分析，指导KV缓存压缩与模型剪枝。

Details

Motivation: 现有研究对注意力模式（如检索头、sink头、对角线痕迹）的观察零散且缺乏统一解释，亟需一个能整合这些现象的理论框架。 Method: 提出Temporal Attention Pattern Predictability Analysis (TAPPA) 框架，从时间连续视角建模注意力机制，通过分析查询、键和RoPE的联合效应，量化查询在时间维度上的自相似性，从而区分并解释可预测与不可预测的注意力模式。 Result: 验证了TAPPA在KV缓存压缩和LLM剪枝任务中的有效性；一个受TAPPA启发的简单指标在多个任务上持续优于基线方法。 Conclusion: 注意力模式的可预测性本质上源于查询的时间自相似性；TAPPA不仅深化了对注意力行为的理解，还为高效推理提供了可落地的理论指导。 Abstract: Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce \textbf{Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations} from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab-USTC/LLM-TAPPA.

[41] TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning

Huiyuan Lai,Malvina Nissim

Main category: cs.CL

TL;DR: 本文提出TACLer，一种模型定制的课程强化学习框架，通过渐进式增加数据复杂度和混合思考/不思考推理范式，显著提升大语言模型在复杂数学任务上的推理效率与准确性。

Details

Motivation: 现有长链思维（CoT）推理方法依赖大规模强化学习训练，易导致冗余推理步骤（过思考），影响学习与推理效率。 Method: 提出TACLer框架，包含两个核心组件：(i) 定制化课程学习，根据模型能力分阶段确定需学习的知识；(ii) 混合Thinking/NoThinking推理范式，动态启用或禁用思考模式以平衡准确率与效率。 Result: 实验表明TACLer相较长思考模型减少超50%训练计算量，推理token使用量比基线模型降低42%以上；在四个复杂数学数据集上准确率较基线提升超9%，持续优于当前最优的NoThinking与Thinking方法。 Conclusion: TACLer在保持甚至提升性能的同时，显著提高了大语言模型推理的学习与运行效率，为高效、可控的复杂推理提供了新范式。 Abstract: Large Language Models (LLMs) have shown remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant intermediate steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a model-tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model's proficiency in multi-stage RL training. TACLer features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computational cost, cutting training compute by over 50% compared to long thinking models and reducing inference token usage by over 42% relative to the base model; and (ii) it improves accuracy by over 9% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.

[42] Enhancing Language Models for Robust Greenwashing Detection

Neil Heinrich Braun,Keane Ong,Rui Mao,Erik Cambria,Gianmarco Mengaldo

Main category: cs.CL

TL;DR: 本文提出一种参数高效的框架，通过结合对比学习和序数排序目标来结构化大语言模型的潜在空间，以区分具体行动与模糊声明，并利用门控特征调制和MetaGradNorm提升鲁棒性与多目标优化稳定性。

Details

Motivation: 可持续发展报告对ESG评估至关重要，但绿色洗白和模糊声明常削弱其可信度；现有NLP模型难以应对这些现象，泛化能力差。 Method: 提出参数高效框架，融合对比学习与序数排序目标以建模具体行动与模糊声明的渐进差异；引入门控特征调制过滤披露噪声，并采用MetaGradNorm稳定多目标优化。 Result: 在跨类别实验中展现出优于标准基线的鲁棒性，并揭示了表征刚性与泛化能力之间的权衡。 Conclusion: 该框架提升了ESG文本分析中对绿色洗白和模糊表述的识别能力，为可持续信息披露的可信评估提供了新方法。 Abstract: Sustainability reports are critical for ESG assessment, yet greenwashing and vague claims often undermine their reliability. Existing NLP models lack robustness to these practices, typically relying on surface-level patterns that generalize poorly. We propose a parameter-efficient framework that structures LLM latent spaces by combining contrastive learning with an ordinal ranking objective to capture graded distinctions between concrete actions and ambiguous claims. Our approach incorporates gated feature modulation to filter disclosure noise and utilizes MetaGradNorm to stabilize multi-objective optimization. Experiments in cross-category settings demonstrate superior robustness over standard baselines while revealing a trade-off between representational rigidity and generalization.

[43] Procedural Pretraining: Warming Up Language Models with Abstract Data

Liangze Jiang,Zachary Shinnick,Anton van den Hengel,Hemanth Saratchandran,Damien Teney

Main category: cs.CL

TL;DR: 本文提出了一种新的预训练范式：在大规模自然语言预训练前，先用抽象的程序化结构数据（如Dyck序列）进行少量预训练，显著提升模型在算法任务和下游任务上的性能，并加速收敛。

Details

Motivation: 受人类先学习逻辑与数学再发展高级推理能力的启发，探索在标准语言预训练前引入抽象结构化数据（特别是程序化数据）是否能更高效地构建语义与推理能力。 Method: 系统性地使用多种形式的程序化数据（如Dyck序列、算术表达式等）进行前置预训练；在不同规模模型（最高1.3B）上对比其与标准预训练（C4、CodeParrot、DeepMind-Math）的效果；分析注意力与MLP层的结构变化机制。 Result: 仅用0.1%程序化数据前置预训练即可显著超越标准预训练，在Needle-in-a-haystack任务中准确率从10%提升至98%；达到相同loss所需数据量减少至原方案的55%-86%；且注意力层更适应结构化领域（如代码），MLP层更利于语言建模。 Conclusion: 程序化前置预训练是一种轻量、有效的方法，可加速语言模型训练并提升推理能力，支持将知识获取与推理能力解耦建模。 Abstract: Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this procedural pretraining enables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.

[44] CE-GOCD: Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering

Jiayin Lan,Jiaqi Li,Baoxin Wang,Ming Liu,Dayong Wu,Shijin Wang,Bing Qin,Guoping Hu

Main category: cs.CL

TL;DR: 本文提出CE-GOCD方法，通过以论文标题为中枢实体构建并优化学术知识图谱子图，并结合社区发现来提升大语言模型在科研文献问答中的表现。

Details

Motivation: 现有检索增强方法仅依赖孤立文本块或概念，忽视论文间深层语义关联，导致大语言模型对科学文献理解不足，影响回答的全面性与准确性。 Method: 提出中央实体引导的图优化社区检测（CE-GOCD）：（1）以论文标题为中枢实体进行目标子图检索；（2）通过子图剪枝与补全增强隐式语义发现；（3）应用社区检测提炼主题一致的论文群组。 Result: 在三个NLP领域文献问答数据集上验证，CE-GOCD显著优于其他检索增强基线方法。 Conclusion: 显式建模并利用学术知识图谱中的语义子结构，能有效提升大语言模型在科学问答任务中的性能。 Abstract: Large Language Models (LLMs) are increasingly used for question answering over scientific research papers. Existing retrieval augmentation methods often rely on isolated text chunks or concepts, but overlook deeper semantic connections between papers. This impairs the LLM's comprehension of scientific literature, hindering the comprehensiveness and specificity of its responses. To address this, we propose Central Entity-Guided Graph Optimization for Community Detection (CE-GOCD), a method that augments LLMs' scientific question answering by explicitly modeling and leveraging semantic substructures within academic knowledge graphs. Our approach operates by: (1) leveraging paper titles as central entities for targeted subgraph retrieval, (2) enhancing implicit semantic discovery via subgraph pruning and completion, and (3) applying community detection to distill coherent paper groups with shared themes. We evaluated the proposed method on three NLP literature-based question-answering datasets, and the results demonstrate its superiority over other retrieval-augmented baseline approaches, confirming the effectiveness of our framework.

[45] Temporal Guidance for Large Language Models

Hong-Kai Zheng,Piji Li

Main category: cs.CL

TL;DR: 本文提出了一种新的时序对比引导方法TeGu，利用多令牌预测（MTP）构建弱预测用于模型自对比，并引入轻量级条件MTP投影器（cMTPP），在低开销下显著提升生成质量。

Details

Motivation: 现有对比解码方法（如CD、DoLa）存在高计算开销或在小模型上不稳定的问题；作者观察到LLM具有局部偏好，因而探索沿时间维度的对比引导策略。 Method: 提出时序引导（TeGu）策略，利用多令牌预测（MTP）生成弱业余预测以实现模型自对比，并设计轻量级条件MTP投影器（cMTPP）统一实现，避免多独立网络。 Result: TeGu在多个模型系列和基准测试中显著提升性能，同时保持较低的额外内存占用和计算开销。 Conclusion: TeGu是一种高效、稳定且可扩展的对比解码新范式，尤其适用于资源受限场景下的小规模LLM优化。 Abstract: Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.

[46] CoFrGeNet: Continued Fraction Architectures for Language Generation

Amit Dhurandhar,Vijil Chenthamarakshan,Dennis Wei,Tejaswini Pedapati,Karthikeyan Natesan Ramamurthy,Rahul Nair

Main category: cs.CL

TL;DR: 本文提出了一种基于连分数的新生成模型函数类——CoFrGeNets，用以替代Transformer中的多头注意力和前馈网络，在减少参数量（仅需原模型1/2到2/3）和缩短预训练时间的同时，保持甚至超越原有模型在下游任务上的性能。

Details

Motivation: 受连分数启发，探索更高效、参数更少但仍具强表达能力的生成模型架构，以缓解Transformer庞大参数量与计算开销的问题。 Method: 提出CoFrGeNets架构族，设计可直接替换Transformer中Multi-head Attention和Feed-Forward Network的新组件；推导定制化梯度公式以更准确高效地优化；支持即插即用，兼容现有训练/推理流程。 Result: 在GPT2-xl（1.5B）和Llama3（3.2B）上验证：使用约1/2–2/3参数量和更短预训练时间，下游分类、问答、推理和文本理解任务性能与原模型相当甚至更优。 Conclusion: CoFrGeNets是一种有前景的轻量化Transformer替代方案，具备工业落地潜力，未来结合硬件定制有望进一步释放性能。 Abstract: Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.

[47] Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond

Wei Zhu

Main category: cs.CL

TL;DR: 本文系统评估了ChatGPT在6个基准数据集上的4类医学信息抽取（MedIE）任务中的性能、可解释性、置信度、忠实性和不确定性，发现其整体表现逊于微调模型，虽解释质量高但过度自信，忠实性较好，但生成不确定性影响抽取结果可靠性。

Details

Motivation: 评估大型语言模型（如ChatGPT）在医学信息抽取（MedIE）任务中的综合能力，包括性能、可解释性、置信度、忠实性和不确定性，以厘清其在专业NLP任务中的适用边界。 Method: 在6个基准数据集上系统评测ChatGPT在4类医学信息抽取任务中的表现，并量化分析其性能、解释质量、置信度、对原文的忠实度及生成不确定性。 Result: （a）ChatGPT在MedIE任务上的性能低于微调基线模型；（b）能提供高质量解释但预测过度自信；（c）多数情况下对原文具有高忠实性；（d）生成不确定性导致信息抽取结果不稳定。 Conclusion: ChatGPT虽具备一定医学信息抽取能力与解释能力，但受限于性能不足和过度自信等问题，尚难直接替代专用微调模型应用于高要求的医学信息抽取场景。 Abstract: Large Language Models (LLMs) like ChatGPT have demonstrated amazing capabilities in comprehending user intents and generate reasonable and useful responses. Beside their ability to chat, their capabilities in various natural language processing (NLP) tasks are of interest to the research community. In this paper, we focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets. We present the systematically analysis by measuring ChatGPT's performance, explainability, confidence, faithfulness, and uncertainty. Our experiments reveal that: (a) ChatGPT's performance scores on MedIE tasks fall behind those of the fine-tuned baseline models. (b) ChatGPT can provide high-quality explanations for its decisions, however, ChatGPT is over-confident in its predcitions. (c) ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. (d) The uncertainty in generation causes uncertainty in information extraction results, thus may hinder its applications in MedIE tasks.

[48] Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

Alon Rozental

Main category: cs.CL

TL;DR: 本文提出Zonkey，一种基于分层扩散的可微分tokenizer框架，用以替代传统固定、不可微的BPE等tokenizer，实现从原始字符到文档级表征的端到端可训练语言建模。

Details

Motivation: 现有大语言模型受限于固定、非可微的tokenizers（如BPE），难以端到端优化，且对噪声或领域特异数据适应性差。 Method: 提出Zonkey：包含可学习的Segment Splitter（基于Probabilistic Attention实现可微分分词）、分层抽象压缩机制、Denoising Diffusion Mixed Model（DDMM）用于潜空间去噪，以及Stitcher保障段间重叠不变性；全程端到端训练。 Result: 在Wikipedia上端到端训练后，Zonkey能从噪声生成连贯、变长文本，展现出涌现的语言层级结构（如空格处自动切词、句号处分句），定性上比基于熵的可学习tokenizer更贴近数据分布。 Conclusion: Zonkey推动了全梯度式LLM的发展，具备更强的领域自适应能力与可扩展生成潜力，并开源代码。 Abstract: Large language models (LLMs) have revolutionized natural language processing, yet they remain constrained by fixed, non-differentiable tokenizers like Byte Pair Encoding (BPE), which hinder end-to-end optimization and adaptability to noisy or domain-specific data. We introduce Zonkey, a hierarchical diffusion model that addresses these limitations through a fully trainable pipeline from raw characters to document-level representations. At its core is a differentiable tokenizer (Segment Splitter) that learns probabilistic beginning-of-sequence (BOS) decisions, enabling adaptive splits that emerge as linguistically meaningful (e.g., word boundaries at spaces, sentence starts at periods) without explicit supervision. This differentiability is enabled by our novel Probabilistic Attention mechanism, which incorporates position-specific existence probabilities to simulate soft masking over theoretically infinite sequences while preserving gradients. Sequences decay probabilistically rather than relying on end-of-sequence tokens, supporting variable-length outputs. Hierarchical levels compress sequences into higher abstractions (e.g., character n-grams to word-like vectors, then sentence-like), with reconstruction via our Denoising Diffusion Mixed Model (DDMM) for stable and efficient denoising in latent space. A Stitcher ensures overlap invariance across segments. Trained end-to-end on Wikipedia, Zonkey generates coherent, variable-length text from noise, demonstrating emergent hierarchies and promising qualitative alignment to data distributions compared to entropy-based learnable tokenizers. Our approach advances toward fully gradient-based LLMs, with potential for better domain adaptation and scalable generation. We release the source code for training and reproducing our experiments.

[49] KID: Knowledge-Injected Dual-Head Learning for Knowledge-Grounded Harmful Meme Detection

Yaocong Li,Leihan Zhang,Le Zhang,Qiang Yan

Main category: cs.CL

TL;DR: 本文提出KID框架，通过知识注入和双头学习提升有害模因检测性能，在多语言数据集上达到SOTA。

Details

Motivation: 现有方法难以理解模因中依赖社会文化背景的隐含毒性，缺乏对隐式知识的有效建模。 Method: 提出知识注入双头学习框架KID，采用标签约束蒸馏生成结构化推理链，将视觉证据、外部知识与分类标签显式关联，并通过双头架构联合优化语义生成与分类任务。 Result: 在涵盖英语、中文和低资源孟加拉语的五个数据集上，KID在二分类和多标签有害模因检测任务中均达到SOTA，性能提升2.1%–19.7%；消融实验验证了知识注入与双头学习的有效性与互补性。 Conclusion: KID通过显式引入背景知识并联合优化生成与判别目标，显著提升了模因毒性识别的鲁棒性与泛化能力，为知识驱动的多模态内容安全提供了新范式。 Abstract: Internet memes have become pervasive carriers of digital culture on social platforms. However, their heavy reliance on metaphors and sociocultural context also makes them subtle vehicles for harmful content, posing significant challenges for automated content moderation. Existing approaches primarily focus on intra-modal and inter-modal signal analysis, while the understanding of implicit toxicity often depends on background knowledge that is not explicitly present in the meme itself. To address this challenge, we propose KID, a Knowledge-Injected Dual-Head Learning framework for knowledge-grounded harmful meme detection. KID adopts a label-constrained distillation paradigm to decompose complex meme understanding into structured reasoning chains that explicitly link visual evidence, background knowledge, and classification labels. These chains guide the learning process by grounding external knowledge in meme-specific contexts. In addition, KID employs a dual-head architecture that jointly optimizes semantic generation and classification objectives, enabling aligned linguistic reasoning while maintaining stable decision boundaries. Extensive experiments on five multilingual datasets spanning English, Chinese, and low-resource Bengali demonstrate that KID achieves SOTA performance on both binary and multi-label harmful meme detection tasks, improving over previous best methods by 2.1%--19.7% across primary evaluation metrics. Ablation studies further confirm the effectiveness of knowledge injection and dual-head joint learning, highlighting their complementary contributions to robust and generalizable meme understanding. The code and data are available at https://github.com/PotatoDog1669/KID.

[50] Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation

Yimin Deng,Yuqing Fu,Derong Xu,Yejing Wang,Wei Ni,Jingtong Gao,Xiaopeng Li,Chengxu Liu,Xiao Han,Guoshuai Zhao,Xiangyu Zhao,Li Zhu,Xueming Qian

Main category: cs.CL

TL;DR: 本文提出了一种对抗式记忆适应机制（AMA），通过模拟下游任务执行，在离线阶段引入任务感知的监督信号，动态优化记忆构建与更新策略，从而提升长对话中对话代理的记忆有效性。

Details

Motivation: 现有记忆系统在离线阶段采用固定、任务无关的构建与更新方式，导致记忆内容与下游任务需求错配，影响性能。 Method: AMA包含三个智能体协同工作：挑战者生成问答对以模拟任务推理；评估者分析回答错误；适配器基于错误反馈，双层次调整记忆构建策略和内容。 Result: AMA可即插即用地集成到多种记忆系统中，在长对话基准LoCoMo上验证了其显著提升下游任务性能的有效性。 Conclusion: AMA通过将任务目标前移至离线记忆阶段，实现了任务驱动的记忆自适应优化，为解决长对话中的上下文瓶颈提供了新范式。 Abstract: Conversational agents struggle to handle long conversations due to context window limitations. Therefore, memory systems are developed to leverage essential historical information. Existing memory systems typically follow a pipeline of offline memory construction and update, and online retrieval. Despite the flexible online phase, the offline phase remains fixed and task-independent. In this phase, memory construction operates under a predefined workflow and fails to emphasize task relevant information. Meanwhile, memory updates are guided by generic metrics rather than task specific supervision. This leads to a misalignment between offline memory preparation and task requirements, which undermines downstream task performance. To this end, we propose an Adversarial Memory Adaptation mechanism (AMA) that aligns memory construction and update with task objectives by simulating task execution. Specifically, first, a challenger agent generates question answer pairs based on the original dialogues. The constructed memory is then used to answer these questions, simulating downstream inference. Subsequently, an evaluator agent assesses the responses and performs error analysis. Finally, an adapter agent analyzes the error cases and performs dual level updates on both the construction strategy and the content. Through this process, the memory system receives task aware supervision signals in advance during the offline phase, enhancing its adaptability to downstream tasks. AMA can be integrated into various existing memory systems, and extensive experiments on long dialogue benchmark LoCoMo demonstrate its effectiveness.

[51] RAG-E: Quantifying Retriever-Generator Alignment and Failure Modes

Korbinian Randl,Guido Rocchietti,Aron Henriksson,Ziawasch Abedjan,Tony Lindgren,John Pavlopoulos

Main category: cs.CL

TL;DR: 本文提出了RAG-E框架，用于量化检索器与生成器之间的对齐程度，揭示RAG系统中二者常存在严重错配，并通过新指标WARG和归因方法（如PMCSHAP）实现可解释性审计。

Details

Motivation: RAG系统在高风险领域部署困难，因其检索器与生成器交互过程不透明，缺乏对二者协同机制的可解释性分析手段。 Method: 提出RAG-E端到端可解释性框架：1）适配Integrated Gradients分析检索器；2）提出蒙特卡洛稳定化的Shapley值近似方法PMCSHAP用于生成器归因；3）设计加权归因-相关性差距（WARG）指标衡量生成器对检索结果的实际使用与检索排序的一致性。 Result: 在TREC CAsT和FoodSafeSum数据集上发现显著错配：47.4%–66.7%的查询中生成器忽略检索器首选文档，48.1%–65.9%依赖低相关性文档。 Conclusion: RAG系统性能不仅取决于各组件单独表现，更关键在于检索器与生成器的协同对齐，而RAG-E可有效审计该对齐程度，提升系统可信性与可部署性。 Abstract: Retrieval-Augmented Generation (RAG) systems combine dense retrievers and language models to ground LLM outputs in retrieved documents. However, the opacity of how these components interact creates challenges for deployment in high-stakes domains. We present RAG-E, an end-to-end explainability framework that quantifies retriever-generator alignment through mathematically grounded attribution methods. Our approach adapts Integrated Gradients for retriever analysis, introduces PMCSHAP, a Monte Carlo-stabilized Shapley Value approximation, for generator attribution, and introduces the Weighted Attribution-Relevance Gap (WARG) metric to measure how well a generator's document usage aligns with a retriever's ranking. Empirical analysis on TREC CAsT and FoodSafeSum reveals critical misalignments: for 47.4% to 66.7% of queries, generators ignore the retriever's top-ranked documents, while 48.1% to 65.9% rely on documents ranked as less relevant. These failure modes demonstrate that RAG output quality depends not solely on individual component performance but on their interplay, which can be audited via RAG-E.

[52] Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning

Bodong Du,Xuanqi Huang,Xiaomeng Li

Main category: cs.CL

TL;DR: 本文提出了一种分布感知的奖励估计方法DARE，用于提升测试时强化学习（TTRL）中无监督奖励信号的可靠性，显著提高了大语言模型在推理任务上的性能与优化稳定性。

Details

Motivation: 现有TTRL方法依赖多数投票（MV）生成确定性奖励，但该假设脆弱：MV忽略非主流但正确的动作，导致奖励估计系统性偏差。 Method: 提出DARE方法，将奖励估计从单一多数结果转向完整经验 rollout 分布，并引入探索奖励和分布剪枝机制以增强非主流rollout探索并去噪。 Result: 在AIME 2024和AMC等推理基准上，DARE相较基线分别取得25.3%和5.3%的相对性能提升，并增强了优化稳定性。 Conclusion: 基于完整 rollout 分布而非单一多数结果的奖励估计更鲁棒、信息更丰富，DARE为TTRL提供了更可靠的无监督学习信号。 Abstract: Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a distribution pruning mechanism for non-majority rollout exploration and reward denoise, yielding a more informative and robust reward estimation. Extensive experiments on challenging reasoning benchmarks show that DARE improves optimization stability and final performance over recent baselines, achieving relative improvements of 25.3% on challenging AIME 2024 and 5.3% on AMC.

[53] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

Aadi Palnitkar,Mingyang Mao,Nicholas Waytowich,Vinicius G. Goecks,Tinoosh Mohsenin,Xiaomin Lin

Main category: cs.CL

TL;DR: 本文介绍了MilSCORE，首个针对复杂军事规划场景设计的长上下文、多模态基准测试数据集，旨在评估大语言模型在高风险决策与规划任务中的表现。

Details

Motivation: 当前缺乏能够真实反映长上下文、多源异构信息（如地图、命令、情报报告）整合需求的基准，尤其在军事等高风险地理空间规划任务中亟需此类测试工具。 Method: 构建了专家撰写的MilSCORE数据集，包含基于模拟军事场景的多跳问题，覆盖七类问题（事实回忆、约束推理、战略分析、空间分析等），并提出配套评估协议，对多种视觉-语言模型进行基线测试。 Result: 实验表明现有模型在MilSCORE上表现不佳，存在显著提升空间，验证了该基准的挑战性和实用性。 Conclusion: MilSCORE填补了长上下文、多模态、场景级规划评估的空白，为未来大模型在高风险、地理空间密集型任务中的能力提升提供了重要测试平台。 Abstract: As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs' ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.

[54] Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model

Xiang Li,Ning Yan,Masood Mortazavi

Main category: cs.CL

TL;DR: 本文提出GiG框架，通过图中图架构和有界前向规划模块提升具身智能体的长时程规划能力，在多个基准测试中显著优于现有方法。

Details

Motivation: 大型语言模型在具身代理中的长时程规划仍面临策略连贯性差、环境约束违反等根本挑战。 Method: 提出GiG框架，采用图神经网络编码环境状态并构建执行轨迹图，通过图嵌入聚类实现结构感知先验检索，并引入基于符号转移逻辑的有界前向规划模块。 Result: 在Robotouille同步/异步和ALFWorld三个基准上，Pass@1性能分别提升22%、37%和15%，计算成本相当或更低。 Conclusion: GiG通过结构化记忆与符号逻辑结合，有效提升了具身智能体在动态环境中的长时程规划鲁棒性与准确性。 Abstract: While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intent into actionable sub-goals while strictly adhering to the logic of a dynamic, observed environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitation or hallucinate transitions that violate constraints. We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within an experience memory bank. By clustering these graph embeddings, the framework enables retrieval of structure-aware priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a novel bounded lookahead module that leverages symbolic transition logic to enhance the agents' planning capabilities through the grounded action projection. We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld with comparable or lower computational cost.

[55] Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

Hongyi Zhou,Jin Zhu,Erhan Xu,Kai Ye,Ying Yang,Chengchun Shi

Main category: cs.CL

TL;DR: 本文提出了一种基于重写（rewrite）的LLM生成文本检测新算法，通过自适应学习原文与重写文本间的距离，在理论和实验上均优于固定距离方法，对GPT、Claude、Gemini等模型生成文本的检测效果显著提升。

Details

Motivation: 大型语言模型生成高度类人文本，引发虚假信息和学术诚信问题，亟需可靠的检测算法。 Method: 提出一种几何视角下的重写式检测框架，并设计自适应学习原文与重写文本间距离的新型检测算法。 Result: 在100多种实验设置中，该方法在大多数场景下优于基线算法；对GPT、Claude、Gemini等目标模型，相对最强基线提升57.8%–80.6%。 Conclusion: 自适应距离函数比固定距离更有效，所提重写式检测算法具有更强泛化性与实用性。 Abstract: Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Yet, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLM-generated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 57.8\% to 80.6\% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini).

[56] SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching

Hong Chen,Xiang Liu,Bo Wang,Yuxuan Fan,Yuanlin Chu,Zongluo Li,Xiaowen Chu,Xuming Hu

Main category: cs.CL

TL;DR: 本文提出SONIC框架，通过学习生成语义丰富的Nexus token来压缩多轮对话中的KV缓存，在高压缩比下仍保持对话连贯性，并提升推理效率。

Details

Motivation: 现有KV缓存压缩方法忽视多轮对话的结构特性，依赖启发式淘汰策略，易丢失关键上下文，难以适应不同内存约束。 Method: 提出基于学习的SONIC框架，将历史片段压缩为紧凑且语义丰富的Nexus token，并引入动态预算训练以支持无需重训练的灵活内存适配。 Result: 在80%和50%压缩比下，SONIC在四个多轮基准测试中持续优于H2O、StreamingLLM等基线；在MTBench101上平均得分提升35.55%，推理速度提升50.1%。 Conclusion: SONIC有效缓解了多轮大模型部署中KV缓存线性增长的瓶颈，在保持对话质量的同时显著提升部署效率。 Abstract: The linear growth of Key-Value (KV) cache remains a bottleneck for multi-turn LLM deployment. Existing KV cache compression methods often fail to account for the structural properties of multi-turn dialogues, relying on heuristic eviction that risks losing critical context. We propose \textbf{SONIC}, a learning-based framework that compresses historical segments into compact and semantically rich \textbf{Nexus} tokens. By integrating dynamic budget training, SONIC allows flexible adaptation to varying memory constraints without retraining. Experiments show that at compression ratios of 80\% and 50\%, SONIC consistently outperforms baselines such as H2O and StreamingLLM on four diverse multi-turn benchmarks. Specifically, on the widely used MTBench101 benchmark, SONIC achieves an average score improvement of 35.55\% over state-of-the-art baselines, validating its effectiveness in sustaining coherent multi-turn dialogues. Furthermore, SONIC enhances deployment efficiency, accelerating the overall inference process by 50.1\% compared to full-context generation.

[57] From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes

Fariba Afrin Irany

Main category: cs.CL

TL;DR: 本文提出了一种基于GPT的临床文本分类架构，通过选择性微调（仅训练最后Transformer块、层归一化和轻量分类头）来应对EHR中长文本、标注少、类别不平衡和计算开销大的挑战，在MIMIC-IV-Note放射报告上验证了其高效性与鲁棒性。

Details

Motivation: 临床电子健康记录（EHR）中大量非结构化文本为疾病表征等任务带来机遇，但也面临标注数据少、类别严重不平衡、大模型适配计算成本高等挑战。 Method: 采用GPT-2作为基础模型，冻结大部分参数，仅对最后一层Transformer块、最终层归一化及轻量分类头进行微调，实现高效适配。 Result: 在MIMIC-IV-Note放射报告数据集上，该方法在多标签分类、不确定性感知的二分类及疾病结局预测等任务中均表现稳定且性能优异，尤其在非提及和否定性发现占主导的场景下优势明显。 Conclusion: 选择性微调预训练生成式语言模型是临床文本分类的一种高效、有效且可扩展的解决方案，显著降低计算复杂度，同时保持强表征能力。 Abstract: The increasing availability of unstructured clinical narratives in electronic health records (EHRs) has created new opportunities for automated disease characterization, cohort identification, and clinical decision support. However, modeling long, domain-specific clinical text remains challenging due to limited labeled data, severe class imbalance, and the high computational cost of adapting large pretrained language models. This study presents a GPT-based architecture for clinical text classification that adapts a pretrained decoder-only Transformer using a selective fine-tuning strategy. Rather than updating all model parameters, the majority of the GPT-2 backbone is frozen, and training is restricted to the final Transformer block, the final layer normalization, and a lightweight classification head. This approach substantially reduces the number of trainable parameters while preserving the representational capacity required to model complex clinical language. The proposed method is evaluated on radiology reports from the MIMIC-IV-Note dataset using uncertainty-aware CheXpert-style labels derived directly from report text. Experiments cover multiple problem formulations, including multi-label classification of radiographic findings, binary per-label classification under different uncertainty assumptions, and aggregate disease outcome prediction. Across varying dataset sizes, the model exhibits stable convergence behavior and strong classification performance, particularly in settings dominated by non-mention and negated findings. Overall, the results indicate that selective fine-tuning of pretrained generative language models provides an efficient and effective pathway for clinical text classification, enabling scalable adaptation to real-world EHR data while significantly reducing computational complexity.

[58] OVD: On-policy Verbal Distillation

Jing Xiong,Hui Shen,Shansan Gong,Yuxin Cheng,Jianghan Shen,Chaofan Tao,Haochen Tan,Haoli Bai,Lifeng Shang,Ngai Wong

Main category: cs.CL

TL;DR: 本文提出On-policy Verbal Distillation (OVD)，一种无需token级对齐、基于离散语言评分（0–9）进行轨迹匹配的知识蒸馏框架，显著降低内存消耗并提升学生模型探索能力与训练效率，在Web问答和数学推理任务上大幅超越现有方法。

Details

Motivation: 现有on-policy token-level知识蒸馏方法受限于token级对齐，抑制学生模型探索、难以利用环境反馈，且在强化学习中存在严重内存瓶颈。 Method: 提出OVD框架，用教师模型生成的离散语言评分（0–9）替代token级概率匹配，实现轨迹级匹配；摆脱token对齐约束，支持交互式反馈下的on-policy蒸馏。 Result: 在Web问答和数学推理任务上，OVD相比基线方法取得最高+12.9% EM（Web Q&A）和+25.7%（数学基准）的绝对提升，且仅需单次采样训练，训练效率更优。 Conclusion: OVD通过离散 verbal 评分实现高效、灵活、低内存的on-policy蒸馏，为大模型能力迁移提供了新范式。 Abstract: Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model's exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0--9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io

[59] Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding

Yifan Zhu,Huiqiang Rong,Haoran Luo

Main category: cs.CL

TL;DR: 本文提出了Token-Guard，一种基于自检解码的词元级幻觉控制方法，通过在每个推理步骤进行内部验证、潜在空间中的显式幻觉风险评分以及迭代剪枝与重生成，有效减少大语言模型（LLM）的幻觉现象，且无需大规模微调或检索。

Details

Motivation: 大型语言模型（LLMs）常产生与输入不一致的幻觉内容；现有缓解方法如RAG和RLHF资源消耗大，而解码类方法缺乏显式的幻觉控制机制。 Method: 提出Token-Guard：在每个解码步进行自检验证；对候选片段在潜在空间中进行显式幻觉风险评分；结合迭代剪枝与再生机制动态修正幻觉token。 Result: 在HALU数据集上的实验表明，Token-Guard显著降低幻觉率并提升生成准确性，具备可扩展性与模块化优势。 Conclusion: Token-Guard提供了一种轻量、高效、无需额外训练或检索的词元级幻觉控制新范式，增强了LLM输出的可靠性。 Abstract: Large Language Models (LLMs) often hallucinate, generating content inconsistent with the input. Retrieval-Augmented Generation (RAG) and Reinforcement Learning with Human Feedback (RLHF) can mitigate hallucinations but require resource-intensive retrieval or large-scale fine-tuning. Decoding-based methods are lighter yet lack explicit hallucination control. To address this, we present Token-Guard, a token-level hallucination control method based on self-checking decoding. Token-Guard performs internal verification at each reasoning step to detect hallucinated tokens before they propagate. Candidate fragments are further evaluated in a latent space with explicit hallucination risk scoring, while iterative pruning and regeneration dynamically correct detected errors. Experiments on HALU datasets show Token-Guard substantially reduces hallucinations and improves generation accuracy, offering a scalable, modular solution for reliable LLM outputs. Our code is publicly available.

[60] Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Jianhui Chen,Yuzhang Luo,Liangming Pan

Main category: cs.CL

TL;DR: 本文提出机制数据归因（MDA）框架，利用影响函数追溯LLM中可解释电路的训练数据来源，并通过实验证明特定结构化数据（如LaTeX、XML）能催化可解释头的形成，且干预诱导头的训练数据可同步改变模型上下文学习能力，还提出了加速电路收敛的数据增强方法。

Details

Motivation: 尽管机制可解释性已识别出大语言模型中的可解释电路，但其在训练数据中的因果起源仍不清楚。 Method: 提出机制数据归因（MDA）框架，使用影响函数将可解释单元追溯至具体训练样本，并在Pythia系列模型上开展大量实验，通过移除或增强高影响样本进行因果干预。 Result: 证实重复性结构化数据（如LaTeX、XML）是可解释头形成的机制催化剂；干预诱导头相关数据会同步改变模型的上下文学习能力；提出的数据增强流程能稳定加速不同规模模型中电路的收敛。 Conclusion: MDA为理解LLM内部机制的训练起源提供了可扩展因果分析工具，揭示了数据结构与模型功能之间的深层联系，并为可控引导大模型发展路径提供了新方法。 Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

[61] When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

Daniel Commey

Main category: cs.CL

TL;DR: 本文提出了一种面向大语言模型（LLM）应用的评估驱动开发流程（Define-Test-Diagnose-Fix），并设计了分层的最小可行评估套件（MVES），覆盖通用LLM应用、RAG和智能体工具调用三类场景；通过本地可复现实验揭示了通用提示模板可能带来行为权衡，强调需基于评估进行提示迭代与主张校准，而非依赖‘万能’提示模板。

Details

Motivation: 传统软件测试方法难以应对LLM应用输出的随机性、高维性及对提示和模型变化的高度敏感性，亟需系统化、可重复的评估驱动工程范式。 Method: 提出Define-Test-Diagnose-Fix评估驱动工作流；构建覆盖三类典型LLM应用的最小可行评估套件（MVES）；综合自动化检查、人工评分与LLM-as-judge等评估方法，并分析其失效模式；在Ollama本地环境中使用Llama 3 8B和Qwen 2.5 7B开展可控实验验证。 Result: 实验发现：将任务专用提示替换为通用‘改进型’提示模板后，Llama 3在结构化测试套件中的信息提取通过率从100%降至90%，RAG合规性从93.3%降至80%，但指令遵循能力提升——表明提示变更存在行为权衡。 Conclusion: LLM应用开发应以评估为驱动，通过MVES等标准化套件支持快速迭代与问题诊断；避免盲目推广通用提示模板，须结合具体任务目标进行精细化提示设计与效果校准。 Abstract: Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic "improved" prompt template can trade off behaviors: on our small structured suites, extraction pass rate decreased from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3 when replacing task-specific prompts with generic rules, while instruction-following improved. These findings motivate evaluation-driven prompt iteration and careful claim calibration rather than universal prompt recipes. All test suites, harnesses, and results are included for reproducibility.

[62] Causal Autoregressive Diffusion Language Model

Junhao Ruan,Bei Li,Yongjing Yin,Pengcheng Huang,Xin Chen,Jingang Wang,Xunliang Cai,Tong Xiao,JingBo Zhu

Main category: cs.CL

TL;DR: 本文提出了Causal Autoregressive Diffusion (CARD)框架，结合自回归模型的训练效率与扩散模型的高吞吐推理能力，通过因果注意力掩码实现单次前向传播的密集监督，并引入软尾掩码和信噪比驱动的重加权机制以提升稳定性，支持基于置信度的动态并行解码；实验表明其在性能上超越离散扩散基线，训练延迟降低3倍，兼具ARM级数据效率与并行生成低延迟优势。

Details

Motivation: 解决自回归模型（ARMs）训练效率高但推理吞吐低、扩散模型推理高吞吐但训练不稳定且效率低的矛盾，寻求兼顾训练效率与推理吞吐的新范式。 Method: 提出CARD框架：1）在严格因果注意力掩码下重构扩散过程，实现单次前向传播的逐token监督；2）设计软尾掩码以保留局部上下文；3）基于信噪比原理构建上下文感知的重加权机制；4）利用KV缓存支持动态并行解码，按置信度生成变长序列。 Result: CARD在多个基准上超越现有离散扩散模型；训练延迟比块状扩散方法降低3倍；达到ARM级别的数据效率，同时获得并行生成的低延迟优势。 Conclusion: CARD成功统一了自回归建模与扩散建模的优势，为下一代高效大语言模型提供了稳健可行的新范式。 Abstract: In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.

[63] Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models

Longxuan Yu,Yu Fu,Shaorong Zhang,Hui Liu,Mukund Varma T,Greg Ver Steeg,Yue Dong

Main category: cs.CL

TL;DR: 本文提出掩码扩散语言模型（MDLMs）可解决自回归（AR）模型在输出顺序与推理逻辑不一致时的'过早承诺'问题，展现出对输出顺序变化的鲁棒性（order robustness），并在多个数学推理基准上验证了该特性及其机制。

Details

Motivation: 自回归语言模型强制左到右生成顺序，在答案需先于解释输出等场景中导致过早承诺，限制其推理能力；需探索更灵活的生成范式。 Method: 引入掩码扩散语言模型（MDLMs），通过并行迭代优化所有token，解耦计算顺序与输出结构；构建新基准ReasonOrderQA用于控制难度和顺序级评估，并分析MDLMs在不同token类型（如推理步骤 vs. 最终答案）上的稳定化时序。 Result: 在GSM8K、Math500和ReasonOrderQA上，当提示要求'先答后理'时，AR模型准确率相对下降最高达67%，而MDLMs下降≤14%；实证表明MDLMs优先稳定简单token（如推理步骤），再稳定复杂token（如答案），从而实现顺序鲁棒性；同时识别出该优势失效的边界条件。 Conclusion: MDLMs通过分阶段稳定不同复杂度token，实现了对输出顺序变化的鲁棒性，为突破AR模型的结构性限制提供了新路径，但也存在适用边界。 Abstract: Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term "order robustness". Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.

[64] A Separable Architecture for Continuous Token Representation in Language Models

Reza T. Batley,Sourav Saha

Main category: cs.CL

TL;DR: 本文提出Leviathan架构，通过连续嵌入生成器替代传统离散查找表，在小语言模型中显著提升参数利用效率，在相同参数量下性能优于LLaMA风格模型。

Details

Motivation: 在小语言模型（SLM）中，嵌入矩阵占据大部分参数预算，但这种分配方式既低效又反直觉，现有Transformer缩放定律将参数视为可互换的抽象并不适用于此场景。 Method: 提出Leviathan架构，使用连续嵌入生成器替代传统离散嵌入查找表，并在Pile数据集上进行等参数量对比实验，通过经验幂律拟合评估有效参数容量。 Result: Leviathan在等参数设置下持续优于标准LLaMA风格架构，其有效参数容量相当于密集模型的1.47至2.11倍。 Conclusion: 嵌入层的设计对小语言模型性能影响显著，连续嵌入生成器能更高效地利用参数，挑战了参数可互换的传统假设。 Abstract: Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with $1.47$ to $2.11 \times$ more parameters.

[65] On the Paradoxical Interference between Instruction-Following and Task Solving

Yunjia Qi,Hao Peng,Xintong Shi,Amy Xin,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 本文揭示了指令遵循可能意外干扰大语言模型（LLM）任务求解能力的反直觉现象，并提出SUSTAINSCORE指标量化该干扰；实验表明加入自明约束会显著降低模型在数学、多跳问答和代码生成等任务上的性能，且注意力机制分析显示失败案例更聚焦于约束；研究还初步探讨了不同后训练范式对干扰的影响。

Details

Motivation: 现有指令遵循方法旨在提升LLM与人类意图对齐，但作者观察到其可能反而损害模型本身的任务求解能力，这一矛盾现象缺乏系统量化与机制分析，亟需深入探究。 Method: 提出SUSTAINSCORE指标，通过向原始指令中插入从成功输出中提取的、本应自然满足的‘自明约束’，测量任务性能下降程度；结合多任务实验、注意力可视化分析及不同后训练策略的对比评估。 Result: 在数学、多跳QA和代码生成任务上，插入自明约束导致包括Claude-Sonnet-4.5在内的先进LLM性能显著下降；干扰具有跨约束类型和模型规模的普适性；失败样本在注意力分布上更偏向约束部分；不同对齐策略表现出差异化的干扰程度。 Conclusion: 指令遵循并非总是有益，其可能引入与任务求解冲突的认知负荷；SUSTAINSCORE为评估对齐方法的稳健性提供了新视角；未来对齐工作需兼顾指令遵从性与任务内在一致性。 Abstract: Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed. However, we reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs' task-solving capability. We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving. It measures task performance drop after inserting into the instruction a self-evident constraint, which is naturally met by the original successful model output and extracted from it. Experiments on current LLMs in mathematics, multi-hop QA, and code generation show that adding the self-evident constraints leads to substantial performance drops, even for advanced models such as Claude-Sonnet-4.5. We validate the generality of the interference across constraint types and scales. Furthermore, we identify common failure patterns, and by investigating the mechanisms of interference, we observe that failed cases allocate significantly more attention to constraints compared to successful ones. Finally, we use SUSTAINSCORE to conduct an initial investigation into how distinct post-training paradigms affect the interference, presenting empirical observations on current alignment strategies. We will release our code and data to facilitate further research

[66] MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLMs

Ghazal Kalhor,Behnam Bahrak

Main category: cs.CL

TL;DR: 本文提出了MasalBench基准，用于评估多语言大模型对波斯语谚语的语境与跨文化理解能力，发现模型在识别波斯谚语方面表现良好，但在映射等价英语谚语时性能显著下降，揭示了其在文化知识和类比推理上的局限性。

Details

Motivation: 现有研究多关注高资源语言中LLM对修辞语言的理解，而低资源语言（如波斯语）中的跨文化谚语理解仍缺乏系统评估。 Method: 构建了MasalBench——一个全面评估LLM对波斯语谚语语境及跨文化理解能力的基准，并在8个前沿LLM上进行测试，任务包括谚语识别与英波谚语等价映射。 Result: 模型在波斯谚语识别任务中准确率超0.90，但在识别对应英语谚语时最佳模型仅达0.79准确率。 Conclusion: 当前LLM在低资源语言的文化知识建模与跨语言类比推理方面存在明显短板，MasalBench为评估其他低资源语言的跨文化理解提供了可扩展框架。 Abstract: In recent years, multilingual Large Language Models (LLMs) have become an inseparable part of daily life, making it crucial for them to master the rules of conversational language in order to communicate effectively with users. While previous work has evaluated LLMs' understanding of figurative language in high-resource languages, their performance in low-resource languages remains underexplored. In this paper, we introduce MasalBench, a comprehensive benchmark for assessing LLMs' contextual and cross-cultural understanding of Persian proverbs, which are a key component of conversation in this low-resource language. We evaluate eight state-of-the-art LLMs on MasalBench and find that they perform well in identifying Persian proverbs in context, achieving accuracies above 0.90. However, their performance drops considerably when tasked with identifying equivalent English proverbs, with the best model achieving 0.79 accuracy. Our findings highlight the limitations of current LLMs in cultural knowledge and analogical reasoning, and they provide a framework for assessing cross-cultural understanding in other low-resource languages. MasalBench is available at https://github.com/kalhorghazal/MasalBench.

[67] $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

Yaxin Du,Junru Song,Yifan Zhou,Cheng Wang,Jiahao Gu,Zimeng Chen,Menglan Chen,Wen Yao,Yang Yang,Ying Wen,Siheng Chen

Main category: cs.CL

TL;DR: 本文提出G²-Reader双图系统，通过内容图保持文档原生结构与跨模态语义，通过规划图追踪子问题与中间发现，提升长文档多模态问答的准确率。

Details

Motivation: 现有检索增强生成方法在多模态长文档问答中存在两大问题：扁平化分块破坏文档结构与跨模态对齐；迭代检索易陷入局部循环或漂移到无关区域，缺乏全局搜索状态。 Method: 提出G²-Reader双图系统：1）Content Graph建模文档原生结构和跨模态语义；2）Planning Graph作为有向无环图，动态生成并追踪子问题，引导逐步证据收集。 Result: 在VisDoMBench五个多模态领域上，G²-Reader（基于Qwen3-VL-32B-Instruct）平均准确率达66.21%，显著优于强基线及独立GPT-5（53.08%）。 Conclusion: 双图协同建模（结构保持+目标导向规划）可有效缓解多模态长文档问答中的语义碎片化与检索漂移问题，为复杂文档理解提供新范式。 Abstract: Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce $G^2$-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, $G^2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21\% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08\%).

[68] VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Yibo Wang,Yongcheng Jing,Shunyu Liu,Hao Guan,Rong-cheng Tu,Chengyu Wang,Jun Huang,Dacheng Tao

Main category: cs.CL

TL;DR: 本文提出VTC-R1，一种将中间推理过程渲染为图像并作为‘光学记忆’输入视觉语言模型的新范式，在保持甚至提升推理性能的同时，实现3.4倍token压缩和2.7倍端到端推理加速。

Details

Motivation: 长上下文推理虽增强LLM能力，但带来严重效率瓶颈；现有高效方法依赖额外训练或外部模型压缩，可扩展性差且丢失细粒度信息。 Method: 提出VTC-R1范式：将文本推理段落渲染为紧凑图像，作为‘光学记忆’迭代输入视觉语言模型（如Glyph、Qwen3-VL）；基于OpenR1-Math-220K构建训练集并微调模型。 Result: 在MATH500、AIME25、AMC23、GPQA-D等基准上持续超越标准长上下文推理；实现3.4倍token压缩与2.7倍端到端延迟加速。 Conclusion: VTC-R1是一种可扩展、高效且信息保留良好的新型推理范式，为推理密集型应用提供实用新路径。 Abstract: Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.

[69] ECO: Quantized Training without Full-Precision Master Weights

Mahdi Nikdan,Amir Zandieh,Dan Alistarh,Vahab Mirrokni

Main category: cs.CL

TL;DR: 本文提出了一种无需高精度主权重（master weights）的量化优化器ECO，通过误差补偿机制在保持训练精度的同时大幅降低内存开销，尤其适用于稀疏专家模型（SMoE）。

Details

Motivation: 现有LLM训练量化方法仍需高精度主权重缓冲区，带来显著内存开销，尤其在Sparse Mixture of Experts（SMoE）模型中问题突出。 Method: 提出Error-Compensating Optimizer（ECO），直接对量化参数应用梯度更新，并将量化误差注入优化器动量中，形成无额外内存开销的误差反馈回路；理论证明其在衰减学习率下收敛至最优解邻域。 Result: 在FP8预训练（30M–2.1B参数Transformer及Sparse MoE）和INT4微调（DeepSeek-MoE-16B）实验中，ECO达到与含主权重基线近乎无损的精度，显著改善内存-损失Pareto前沿。 Conclusion: ECO成功消除了训练量化中的主权重依赖，在不牺牲精度前提下大幅降低内存占用，为大规模稀疏模型高效训练提供了新范式。 Abstract: Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.

[70] A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine

Anran Li,Yuanyuan Chen,Wenjun Long,Yu Yin,Yan Hu,Hyunjae Kim,Weipeng Zhou,Yujia Zhou,Hongyi Peng,Yang Ren,Xuguang Ai,Zhenyue Qin,Ming Hu,Xiaoxiao Li,Han Yu,Yih-Chung Tham,Lucila Ohno-Machado,Hua Xu,Qingyu Chen

Main category: cs.CL

TL;DR: 本文提出Fed-MedLoRA及其增强版Fed-MedLoRA+，一种模型无关、参数高效的联邦学习框架，用于在医疗场景下高效、安全地微调大语言模型（LLM），解决跨机构数据异构性与通信开销大的问题，并在临床信息抽取任务中验证其有效性。

Details

Motivation: 现有医学大模型多基于单机构数据训练，泛化性与安全性差；传统联邦学习难以适配超大参数量LLM且假设数据同质，而真实临床数据高度异构。 Method: 提出Fed-MedLoRA：仅上传低秩适配器（LoRA）参数以降低通信与计算开销；进一步提出Fed-MedLoRA+，引入自适应、数据感知的聚合策略以应对跨站点数据异构性；应用于临床信息抽取任务。 Result: 在五个患者队列上评估，涵盖域内测试、外部验证及低资源新站点适配（使用耶鲁纽黑文健康系统真实临床笔记），性能优于BERT、LLaMA-3、DeepSeek-R1和GPT-4o等基线模型。 Conclusion: Fed-MedLoRA/Fed-MedLoRA+为医疗LLM的隐私保护、高效协同训练提供了可行路径，显著提升跨机构部署下的实用性、鲁棒性与适应性。 Abstract: Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis. To enable their use in clinical settings, LLMs are typically further adapted through continued pretraining or post-training using clinical data. However, most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems. Federated learning (FL) is a promising solution for enabling collaborative model development across healthcare institutions. Yet applying FL to LLMs in medicine remains fundamentally limited. First, conventional FL requires transmitting the full model during each communication round, which becomes impractical for multi-billion-parameter LLMs given the limited computational resources. Second, many FL algorithms implicitly assume data homogeneity, whereas real-world clinical data are highly heterogeneous across patients, diseases, and institutional practices. We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications. Fed-MedLoRA transmits only low-rank adapter parameters, reducing communication and computation overhead, while Fed-MedLoRA+ further incorporates adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. We apply the framework to clinical information extraction (IE), which transforms patient narratives into structured medical entities and relations. Accuracy was assessed across five patient cohorts through comparisons with BERT models, and LLaMA-3 and DeepSeek-R1, GPT-4o models. Evaluation settings included (1) in-domain training and testing, (2) external validation on independent cohorts, and (3) a low-resource new-site adaptation scenario using real-world clinical notes from the Yale New Haven Health System.

[71] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Xin Chen,Feng Jiang,Yiqian Zhang,Hardy Chen,Shuo Yan,Wenya Xie,Min Yang,Shujian Huang

Main category: cs.CL

TL;DR: 本文提出了一种新的大语言模型推理范式——主动交互式推理（PIR），通过让模型在推理过程中主动向用户提问以澄清前提和意图的不确定性，从而克服传统链式思维（CoT）中‘盲目自思’的局限。PIR包含不确定性感知的监督微调与基于用户模拟器的策略优化，并在数学推理、代码生成和文档编辑等任务上显著提升性能并降低计算开销。

Details

Motivation: 现有基于链式思维的推理型大语言模型存在‘盲目自思’问题：在关键信息缺失或模糊时仍进行大量内部推理，导致效率低、错误多。本文旨在解决前提层面和意图层面的不确定性，而非仅知识层面的不确定性。 Method: 提出Proactive Interactive Reasoning（PIR）范式，包含两个核心组件：（1）不确定性感知的监督微调，赋予模型交互式推理能力；（2）基于用户模拟器的策略优化框架，采用复合奖励函数对齐用户意图。 Result: 在数学推理、代码生成、文档编辑任务上，PIR相比强基线模型最高提升32.70%准确率、22.90%通过率、41.36 BLEU分数，同时减少近一半推理计算量和冗余交互轮次；在事实知识、问答和缺前提场景中也展现出强泛化性与鲁棒性。 Conclusion: PIR成功将LLM从被动求解者转变为能主动澄清不确定性的交互式推理者，为提升模型可靠性、效率与用户对齐提供了新范式。 Abstract: Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}

[72] FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Ajay Patel,Colin Raffel,Chris Callison-Burch

Main category: cs.CL

TL;DR: 本文提出FineInstructions方法，利用互联网规模的预训练文档生成数十亿条合成指令-答案对，用于从零开始仅通过指令微调目标预训练大语言模型，显著提升下游自由生成任务性能。

Details

Motivation: 由于监督训练数据有限，现有大语言模型通常先进行大规模自监督预训练，再用少量指令微调数据进行有监督训练；本文旨在克服监督数据稀缺问题，探索更贴近实际使用场景（响应用户指令）的预训练范式。 Method: 提出一种将预训练文档知识转化为合成指令-答案对的方法：基于约1800万条真实用户查询构建指令模板，并将其匹配并实例化到无结构预训练语料中的人类撰写文档上，构建FineInstructions数据集；然后仅用该合成数据以指令微调目标从头预训练LLM。 Result: 在控制token数量的对比实验中，仅用FineInstructions预训练的模型在标准自由生成质量评测基准上，优于传统预训练及其他合成预训练方法。 Conclusion: 仅依赖大规模合成指令数据进行预训练是可行且有效的，能更好对齐模型能力与实际应用需求，为减少对人工标注数据的依赖提供了新路径。 Abstract: Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of "instruction-tuning" data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at https://huggingface.co/fineinstructions .

[73] DynaWeb: Model-Based Reinforcement Learning of Web Agents

Hang Ding,Peidong Liu,Junqiao Wang,Ziwei Ji,Meng Cao,Rongzhao Zhang,Lynn Ai,Eric Yang,Tianyu Shi,Lei Yu

Main category: cs.CL

TL;DR: 本文提出了DynaWeb，一种基于模型的强化学习框架，通过构建网页世界模型来模拟网络环境，使网络代理能够在合成环境中进行大量策略轨迹生成，从而提升在线强化学习的效率和稳定性。

Details

Motivation: 训练自主网络代理面临与真实互联网交互的低效、高成本和高风险问题，亟需一种更安全高效的训练方法。 Method: 提出DynaWeb框架，包含一个能预测自然网页表示的网页世界模型，并结合自由策略rollout与真实专家轨迹混合训练，以增强稳定性和样本效率。 Result: 在WebArena和WebVoyager基准测试中，DynaWeb显著提升了当前主流开源网络代理模型的性能。 Conclusion: 证明了通过‘想象’（即基于世界模型的模拟）训练网络代理的可行性，为规模化在线智能体强化学习提供了新路径。 Abstract: The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.

[74] Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

Yingfa Chen,Zhen Leng Thai,Zihan Zhou,Zhu Zhang,Xingyu Shen,Shuo Wang,Chaojun Xiao,Xu Han,Zhiyuan Liu

Main category: cs.CL

TL;DR: 本文提出HALO管道和HypeNet混合架构，通过少量数据（2.3B tokens）将预训练Transformer（如Qwen3）高效蒸馏为兼具高性能与长上下文优势的RNN-Attention混合模型。

Details

Motivation: 现有混合Transformer模型因需从头大规模预训练而难以推广；已有参数迁移/蒸馏方法又依赖海量数据（>10B tokens）且长上下文性能差，未能发挥混合模型的推理优势。 Method: 提出HALO（基于层优化的混合注意力蒸馏流程）和HypeNet混合架构；HypeNet引入新型位置编码HyPE及多项结构改进以增强长度泛化能力，并通过HALO将Qwen3系列模型转换为HypeNet。 Result: 仅用2.3B tokens（不足原始预训练数据的0.01%）完成转换，在保持原Transformer模型性能的同时，显著提升长上下文建模能力与推理效率。 Conclusion: HALO与HypeNet为高效构建高性能长上下文混合模型提供了可行路径，大幅降低对预训练数据的依赖，推动混合架构实用化。 Abstract: Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data

cs.CV [Back]

[75] MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading

Matteo Rossi

Main category: cs.CV

TL;DR: 本文提出了一种多注意力唇读网络（MA-LipNet），通过通道、联合时空和分离时空三种注意力机制，提升唇动视频中视觉特征的判别性与泛化能力，在CMLR和GRID数据集上显著降低CER和WER。

Details

Motivation: 现有唇读方法因口型动作细微，导致视觉特征判别力弱、泛化能力差，需从时、空、通道多维度净化特征。 Method: 提出MA-LipNet，依次引入通道注意力（CA）、联合时空注意力（JSTA）和分离时空注意力（SSTA）模块，分别实现通道重校准、粗粒度时空过滤和细粒度时空建模。 Result: 在CMLR和GRID数据集上，MA-LipNet显著降低了字符错误率（CER）和词错误率（WER），优于多个SOTA方法。 Conclusion: 多维特征精炼对鲁棒的视觉语音识别至关重要，MA-LipNet为唇读任务提供了有效且可扩展的架构设计范式。 Abstract: Lipreading, the technology of decoding spoken content from silent videos of lip movements, holds significant application value in fields such as public security. However, due to the subtle nature of articulatory gestures, existing lipreading methods often suffer from limited feature discriminability and poor generalization capabilities. To address these challenges, this paper delves into the purification of visual features from temporal, spatial, and channel dimensions. We propose a novel method named Multi-Attention Lipreading Network(MA-LipNet). The core of MA-LipNet lies in its sequential application of three dedicated attention modules. Firstly, a \textit{Channel Attention (CA)} module is employed to adaptively recalibrate channel-wise features, thereby mitigating interference from less informative channels. Subsequently, two spatio-temporal attention modules with distinct granularities-\textit{Joint Spatial-Temporal Attention (JSTA)} and \textit{Separate Spatial-Temporal Attention (SSTA)}-are leveraged to suppress the influence of irrelevant pixels and video frames. The JSTA module performs a coarse-grained filtering by computing a unified weight map across the spatio-temporal dimensions, while the SSTA module conducts a more fine-grained refinement by separately modeling temporal and spatial attentions. Extensive experiments conducted on the CMLR and GRID datasets demonstrate that MA-LipNet significantly reduces the Character Error Rate (CER) and Word Error Rate (WER), validating its effectiveness and superiority over several state-of-the-art methods. Our work highlights the importance of multi-dimensional feature refinement for robust visual speech recognition.

[76] Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs

Haochen Zhang,Animesh Sinha,Felix Juefei-Xu,Haoyu Ma,Kunpeng Li,Zhipeng Fan,Meng Dong,Xiaoliang Dai,Tingbo Hou,Peizhao Zhang,Zecheng He

Main category: cs.CV

TL;DR: 本文提出了一种面向非马尔可夫式多轮对话图像生成的新框架，通过构建包含回滚编辑与命名个性化的历史数据、引入历史感知的缓存训练机制，以及改进图像重建与可编辑个性化能力，显著提升了多轮一致性与指令遵循能力。

Details

Motivation: 现有对话图像生成方法大多采用马尔可夫假设（仅依赖最新图像），无法处理用户跨轮次引用、撤销或重提早期内容等真实非马尔可夫交互需求。 Method: 提出非马尔可夫多轮数据构建策略（如回滚式编辑、基于名称的跨轮个性化）、历史条件化训练与推理框架（含token级缓存）、以及高保真重建与可编辑个性化改进（如DiT detokenizer、多阶段微调）。 Result: 在非马尔可夫多轮任务上显著提升多轮一致性与指令遵从性，同时保持优秀的单轮编辑与个性化能力。 Conclusion: 显式建模和训练非马尔可夫交互对提升对话图像生成系统的长期连贯性和鲁棒性至关重要。 Abstract: Conversational image generation requires a model to follow user instructions across multiple rounds of interaction, grounded in interleaved text and images that accumulate as chat history. While recent multimodal large language models (MLLMs) can generate and edit images, most existing multi-turn benchmarks and training recipes are effectively Markov: the next output depends primarily on the most recent image, enabling shortcut solutions that ignore long-range history. In this work we formalize and target the more challenging non-Markov setting, where a user may refer back to earlier states, undo changes, or reference entities introduced several rounds ago. We present (i) non-Markov multi-round data construction strategies, including rollback-style editing that forces retrieval of earlier visual states and name-based multi-round personalization that binds names to appearances across rounds; (ii) a history-conditioned training and inference framework with token-level caching to prevent multi-round identity drift; and (iii) enabling improvements for high-fidelity image reconstruction and editable personalization, including a reconstruction-based DiT detokenizer and a multi-stage fine-tuning curriculum. We demonstrate that explicitly training for non-Markov interactions yields substantial improvements in multi-round consistency and instruction compliance, while maintaining strong single-round editing and personalization.

[77] Text controllable PET denoising

Xuehua Ye,Hongxu Yang,Adam J. Schwarz

Main category: cs.CV

TL;DR: 本文提出了一种基于文本引导的PET图像去噪新方法，利用预训练CLIP模型与U-Net结合，在单模型下实现多计数水平的高质量去噪，显著提升图像质量并有望缩短采集时间。

Details

Motivation: PET图像常受复杂噪声干扰，影响诊断；现有方法难以兼顾不同计数水平的去噪需求。 Method: 采用预训练CLIP模型提取文本和图像特征，融合到U-Net去噪架构中，构建文本引导的跨计数水平PET图像去噪模型。 Result: 在定性和定量评估中均取得显著性能提升，模型具备良好泛化性与灵活性。 Conclusion: 该文本引导去噪框架为PET图像质量提升提供了新思路，具有临床实用潜力，如支持低剂量成像或缩短扫描时间。 Abstract: Positron Emission Tomography (PET) imaging is a vital tool in medical diagnostics, offering detailed insights into molecular processes within the human body. However, PET images often suffer from complicated noise, which can obscure critical diagnostic information. The quality of the PET image is impacted by various factors including scanner hardware, image reconstruction, tracer properties, dose/count level, and acquisition time. In this study, we propose a novel text-guided denoising method capable of enhancing PET images across a wide range of count levels within a single model. The model utilized the features from a pretrained CLIP model with a U-Net based denoising model. Experimental results demonstrate that the proposed model leads significant improvements in both qualitative and quantitative assessments. The flexibility of the model shows the potential for helping more complicated denoising demands or reducing the acquisition time.

[78] Low performing pixel correction in computed tomography with unrolled network and synthetic data training

Hongxu Yang,Levente Lippenszky,Edina Timko,Lehel Ferenczi,Gopal Avinash

Main category: cs.CV

TL;DR: 本文提出了一种基于合成数据的无监督双域（正向投影域与图像域）方法，用于校正CT探测器低性能像素（LPP）引起的伪影，无需真实临床数据训练，且在仿真1–2%探测器缺陷下显著优于现有方法。

Details

Motivation: 现有LPP校正方法依赖昂贵的真实标注数据，且仅在单一域（图像域或sinogram域）进行处理，忽略了CT前向投影过程中的跨域相关性。 Method: 提出一种基于合成数据的可展开双域网络，利用自然图像生成带LPP伪影的配对sinogram-图像数据，建模并联合优化两个域的校正过程。 Result: 在1–2%中心区域探测器缺陷的仿真实验中，该方法大幅超越现有SOTA方法；无需真实临床数据训练，且适配不同CT扫描参数。 Conclusion: 合成数据驱动的双域联合校正是解决LPP伪影的有效新范式，兼顾性能、泛化性与部署成本，适用于软件定义的CT后处理。 Abstract: Low performance pixels (LPP) in Computed Tomography (CT) detectors would lead to ring and streak artifacts in the reconstructed images, making them clinically unusable. In recent years, several solutions have been proposed to correct LPP artifacts, either in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, which are expensive to collect. Moreover, existing approaches focus solely either on image-space or sinogram-space correction, ignoring the intrinsic correlations from the forward operation of the CT geometry. In this work, we propose an unrolled dual-domain method based on synthetic data to correct LPP artifacts. Specifically, the intrinsic correlations of LPP between the sinogram and image domains are leveraged through synthetic data generated from natural images, enabling the trained model to correct artifacts without requiring any real-world clinical data. In experiments simulating 1-2% detectors defect near the isocenter, the proposed method outperformed the state-of-the-art approaches by a large margin. The results indicate that our solution can correct LPP artifacts without the cost of data collection for model training, and it is adaptable to different scanner settings for software-based applications.

[79] AI-based Prediction of Biochemical Recurrence from Biopsy and Prostatectomy Samples

Andrea Camilloni,Chiara Micoli,Nita Mulliqi,Erik Everett Palm,Thorgerdur Palsdottir,Kelvin Szolnoky,Xiaoyi Ji,Sol Erika Boman,Andrea Discacciati,Henrik Grönberg,Lars Egevad,Tobias Nordström,Kimmo Kartasalo,Martin Eklund

Main category: cs.CV

TL;DR: 本文开发了一种基于AI的模型，利用诊断性前列腺活检切片预测根治性前列腺切除术后生化复发（BCR）风险，在多个外部队列中验证了其泛化能力，并显示其优于传统CAPRA-S评分。

Details

Motivation: 当前预测根治性前列腺切除术后生化复发（BCR）的工具精度不足，而BCR是提示侵袭性前列腺癌及不良预后的替代指标。 Method: 基于STHLM3队列（n=676）的诊断性前列腺活检全切片图像，采用基础模型与基于注意力机制的多实例学习训练AI模型；在LEOPARD、CHIMERA和TCGA-PRAD三个外部根治术队列中评估泛化性；并整合临床变量构建多模态预测模型。 Result: 图像模型在三个外部队列中5年时间依赖AUC分别为0.64、0.70和0.70；整合临床变量后实现显著风险分层，且较CAPRA-S评分有增量改进。 Conclusion: 活检切片训练的组织病理AI模型可跨样本类型泛化，支持术前与术后决策；但AI多模态模型相较简单模型的额外价值需在后续研究中审慎评估。 Abstract: Biochemical recurrence (BCR) after radical prostatectomy (RP) is a surrogate marker for aggressive prostate cancer with adverse outcomes, yet current prognostic tools remain imprecise. We trained an AI-based model on diagnostic prostate biopsy slides from the STHLM3 cohort (n = 676) to predict patient-specific risk of BCR, using foundation models and attention-based multiple instance learning. Generalizability was assessed across three external RP cohorts: LEOPARD (n = 508), CHIMERA (n = 95), and TCGA-PRAD (n = 379). The image-based approach achieved 5-year time-dependent AUCs of 0.64, 0.70, and 0.70, respectively. Integrating clinical variables added complementary prognostic value and enabled statistically significant risk stratification. Compared with guideline-based CAPRA-S, AI incrementally improved postoperative prognostication. These findings suggest biopsy-trained histopathology AI can generalize across specimen types to support preoperative and postoperative decision making, but the added value of AI-based multimodal approaches over simpler predictive models should be critically scrutinized in further studies.

[80] BadDet+: Robust Backdoor Attacks for Object Detection

Kealan Dunnett,Reza Arablouei,Dimity Miller,Volkan Dedeoglu,Raja Jurdak

Main category: cs.CV

TL;DR: 本文提出了BadDet+框架，通过log-barrier惩罚机制统一区域误分类攻击（RMA）和物体消失攻击（ODA），提升了后门攻击在目标检测任务中的物理鲁棒性和位置/尺度不变性，并在真实世界基准上验证了其优越的合成到物理迁移能力。

Details

Motivation: 现有基于检测的后门攻击方法依赖不现实的假设且缺乏物理验证，对目标检测中后门攻击影响的理解仍不足。 Method: 提出BadDet+惩罚框架，利用log-barrier惩罚抑制触发输入的真实类别预测，从而实现区域误分类与物体消失攻击的统一，并具备位置与尺度不变性及更强的物理鲁棒性。 Result: BadDet+在真实世界基准上实现了优于现有RMA和ODA基线的合成到物理迁移性能，同时保持干净样本上的正常性能；理论分析表明该惩罚作用于触发特定特征子空间，可靠诱导攻击而不损害标准推理。 Conclusion: 目标检测模型存在显著后门漏洞，亟需专门设计的防御机制。 Abstract: Backdoor attacks pose a severe threat to deep learning, yet their impact on object detection remains poorly understood compared to image classification. While attacks have been proposed, we identify critical weaknesses in existing detection-based methods, specifically their reliance on unrealistic assumptions and a lack of physical validation. To bridge this gap, we introduce BadDet+, a penalty-based framework that unifies Region Misclassification Attacks (RMA) and Object Disappearance Attacks (ODA). The core mechanism utilizes a log-barrier penalty to suppress true-class predictions for triggered inputs, resulting in (i) position and scale invariance, and (ii) enhanced physical robustness. On real-world benchmarks, BadDet+ achieves superior synthetic-to-physical transfer compared to existing RMA and ODA baselines while preserving clean performance. Theoretical analysis confirms the proposed penalty acts within a trigger-specific feature subspace, reliably inducing attacks without degrading standard inference. These results highlight significant vulnerabilities in object detection and the necessity for specialized defenses.

[81] Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization

Jiaqi Li,Guangming Wang,Shuntian Zheng,Minzhe Ni,Xiaoman Lu,Guanghui Ye,Yu Guan

Main category: cs.CV

TL;DR: 本文提出ActionVLM框架，通过去偏重加权和残差聚合策略，在时序动作定位任务中缓解视觉-语言模态偏差，使视觉保持主导信号、语言作为有益补充，显著提升性能。

Details

Motivation: 现有基于视觉-语言模型的时序动作定位方法过度依赖语言先验，导致明显的模态偏差，削弱视觉性能。 Method: 提出ActionVLM框架：(i) 去偏重加权模块，估计语言相对于视觉的增量增益并动态调整语言权重；(ii) 残差聚合策略，将语言视为对视觉结果的互补式精调而非主驱动。 Result: 在THUMOS14数据集上，mAP较当前最优方法提升最高达3.2%。 Conclusion: 通过以视觉为主导、语言为自适应补充的设计，ActionVLM有效缓解模态偏差，提升时序动作定位精度与鲁棒性。 Abstract: Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. To this end, we introduce (i) a debiasing reweighting module that estimates the language advantage-the incremental benefit of language over vision-only predictions-and dynamically reweights language modality accordingly, and (ii) a residual aggregation strategy that treats language as a complementary refinement rather than the primary driver. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.

[82] Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

Yu Huo,Siyu Zhang,Kun Zeng,Haoyue Liu,Owen Lee,Junlin Chen,Yuquan Lu,Yifu Guo,Yaodong Liang,Xiaoying Tang

Main category: cs.CV

TL;DR: 本文提出Shape-of-Thought（SoT）框架，通过视觉链式推理（CoT）实现无需外部引擎的渐进式2D形状组装，提升文本到图像生成在数量、属性绑定和部件关系等结构约束下的鲁棒性；基于新构建的SoT-26K数据集和T2S-CompBench基准，模型在组件数量准确率和结构拓扑准确率上分别达88.4%和84.8%，显著优于纯文本基线。

Details

Motivation: 现有文本到图像多模态模型在生成具有复杂组合结构（如数量、属性绑定、部件层级关系）的图像时表现脆弱，缺乏对形状组装逻辑的显式建模能力。 Method: 提出Shape-of-Thought（SoT）视觉链式推理框架，训练统一的多模态自回归模型，交替生成文本规划与渲染中间状态；构建SoT-26K装配轨迹数据集（源自部件级CAD层次结构）及T2S-CompBench评估基准。 Result: 在T2S-CompBench上，SoT微调后达到88.4%的组件数量准确率和84.8%的结构拓扑准确率，较纯文本基线提升约20%；验证了其在结构完整性与轨迹忠实性上的优越性。 Conclusion: SoT建立了可解释、过程监督的组合生成新范式，无需显式几何表示或外部引擎，即可实现结构可控的文本到图像生成。 Abstract: Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints-notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework that enables progressive shape assembly via coherent 2D projections without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming text-only baselines by around 20%. SoT establishes a new paradigm for transparent, process-supervised compositional generation. The code is available at https://anonymous.4open.science/r/16FE/. The SoT-26K dataset will be released upon acceptance.

[83] An AI Framework for Microanastomosis Motion Assessment

Yan Meng,Eduardo J. Torres-Rodríguez,Marcelle Altshuler,Nishanth Gowda,Arhum Naeem,Recai Yilmaz,Omar Arnaout,Daniel A. Donoho

Main category: cs.CV

TL;DR: 本文提出了一种基于AI的自动化评估框架，用于客观、可靠地评估显微血管吻合术中的器械操作技能，结合YOLO检测、DeepSORT跟踪、形状描述符定位及监督分类模块，实现了高精度（97%检测精度，mAP50-95达96%）的性能评估。

Details

Motivation: 传统显微外科技术评估依赖专家主观评分，存在评分者间差异大、标准不统一、易受认知偏差影响及耗时等局限，亟需客观、自动、可扩展的评估系统。 Method: 构建一个包含四个核心模块的AI框架：(1) 基于YOLO的器械检测模块；(2) 基于DeepSORT的器械跟踪模块；(3) 基于形状描述符的器械尖端定位模块；(4) 基于专家标注数据训练的监督分类模块，用于评估操作熟练度。 Result: 实验表明该框架器械检测精度达97%，在IoU阈值50%–95%下的平均精度均值（mAP50-95）为96%。 Conclusion: 所提AI框架能有效实现显微血管吻合术中器械操作技能的自动化、客观化评估，具备高精度与实用性，有望推动显微外科培训与考核标准化。 Abstract: Proficiency in microanastomosis is a fundamental competency across multiple microsurgical disciplines. These procedures demand exceptional precision and refined technical skills, making effective, standardized assessment methods essential. Traditionally, the evaluation of microsurgical techniques has relied heavily on the subjective judgment of expert raters. They are inherently constrained by limitations such as inter-rater variability, lack of standardized evaluation criteria, susceptibility to cognitive bias, and the time-intensive nature of manual review. These shortcomings underscore the urgent need for an objective, reliable, and automated system capable of assessing microsurgical performance with consistency and scalability. To bridge this gap, we propose a novel AI framework for the automated assessment of microanastomosis instrument handling skills. The system integrates four core components: (1) an instrument detection module based on the You Only Look Once (YOLO) architecture; (2) an instrument tracking module developed from Deep Simple Online and Realtime Tracking (DeepSORT); (3) an instrument tip localization module employing shape descriptors; and (4) a supervised classification module trained on expert-labeled data to evaluate instrument handling proficiency. Experimental results demonstrate the effectiveness of the framework, achieving an instrument detection precision of 97%, with a mean Average Precision (mAP) of 96%, measured by Intersection over Union (IoU) thresholds ranging from 50% to 95% (mAP50-95).

[84] Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery

Jianzheng Wang,Huan Ni

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的开放词汇语义分割框架SDCI，通过空间正则化感知的双分支协同推理，结合跨模型注意力融合、双向图扩散优化和超像素协同预测，显著提升高分辨率遥感影像的几何定位与语义分割精度。

Details

Motivation: 高分辨率遥感影像地物密集、边界复杂，现有无训练开放词汇语义分割方法（如CLIP+VFMs单向浅层融合）难以兼顾几何定位与语义预测精度。 Method: 提出SDCI框架：1）跨模型注意力融合（CAF）模块实现特征编码阶段的双向自注意力引导；2）双向跨图扩散精炼（BCDR）模块通过迭代随机游走增强双分支分割置信度；3）基于凸优化的超像素协同预测（CSCP）机制引入低层超像素结构优化边界。 Result: 在多个遥感语义分割基准上性能优于现有方法；消融实验证明超像素结构在深度学习框架中仍具有效性。 Conclusion: 空间正则化与多粒度协同推理可有效提升无训练OVSS在复杂遥感场景中的鲁棒性与精度，超像素先验仍具实用价值。 Abstract: High-resolution remote sensing imagery is characterized by densely distributed land-cover objects and complex boundaries, which places higher demands on both geometric localization and semantic prediction. Existing training-free open-vocabulary semantic segmentation (OVSS) methods typically fuse CLIP and vision foundation models (VFMs) using "one-way injection" and "shallow post-processing" strategies, making it difficult to satisfy these requirements. To address this issue, we propose a spatial-regularization-aware dual-branch collaborative inference framework for training-free OVSS, termed SDCI. First, during feature encoding, SDCI introduces a cross-model attention fusion (CAF) module, which guides collaborative inference by injecting self-attention maps into each other. Second, we propose a bidirectional cross-graph diffusion refinement (BCDR) module that enhances the reliability of dual-branch segmentation scores through iterative random-walk diffusion. Finally, we incorporate low-level superpixel structures and develop a convex-optimization-based superpixel collaborative prediction (CSCP) mechanism to further refine object boundaries. Experiments on multiple remote sensing semantic segmentation benchmarks demonstrate that our method achieves better performance than existing approaches. Moreover, ablation studies further confirm that traditional object-based remote sensing image analysis methods leveraging superpixel structures remain effective within deep learning frameworks. Code: https://github.com/yu-ni1989/SDCI.

[85] Enhancing Underwater Light Field Images via Global Geometry-aware Diffusion Process

Yuji Lin,Qian Zhao,Zongsheng Yue,Junhui Hou,Deyu Meng

Main category: cs.CV

TL;DR: 本文提出GeoDiff-LF，一种基于扩散模型（SD-Turbo）的新型框架，用于提升水下4D光场成像质量，通过几何引导的网络结构、损失函数和采样策略，有效缓解水下图像的颜色失真问题，并在视觉与定量指标上达到SOTA。

Details

Motivation: 水下成像面临颜色失真等挑战，而4D光场成像具有空间-角度信息，但其高质量重建仍具挑战性。 Method: 提出GeoDiff-LF：(1) 改进U-Net，引入卷积与注意力适配器建模几何线索；(2) 设计基于张量分解与渐进加权的几何引导损失函数；(3) 优化带噪声预测的采样策略。融合扩散先验与光场几何结构。 Result: 在多个指标和视觉质量上显著优于现有方法，推动水下成像增强的SOTA。 Conclusion: GeoDiff-LF通过协同利用扩散建模能力与光场几何先验，为水下4D光场图像增强提供了高效且鲁棒的新范式。 Abstract: This work studies the challenging problem of acquiring high-quality underwater images via 4-D light field (LF) imaging. To this end, we propose GeoDiff-LF, a novel diffusion-based framework built upon SD-Turbo to enhance underwater 4-D LF imaging by leveraging its spatial-angular structure. GeoDiff-LF consists of three key adaptations: (1) a modified U-Net architecture with convolutional and attention adapters to model geometric cues, (2) a geometry-guided loss function using tensor decomposition and progressive weighting to regularize global structure, and (3) an optimized sampling strategy with noise prediction to improve efficiency. By integrating diffusion priors and LF geometry, GeoDiff-LF effectively mitigates color distortion in underwater scenes. Extensive experiments demonstrate that our framework outperforms existing methods across both visual fidelity and quantitative performance, advancing the state-of-the-art in enhancing underwater imaging. The code will be publicly available at https://github.com/linlos1234/GeoDiff-LF.

[86] FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

Chenyu Huang,Peng Ye,Xudong Tan,Jinhan Mu,Shenghe Zheng,Li Shen,Tao Chen

Main category: cs.CV

TL;DR: 本文提出FRISM框架，通过子空间级模型融合实现细粒度推理能力注入，提升视觉语言模型的推理能力而不损害其视觉能力。

Details

Motivation: 现有方法在粗粒度层面上融合大推理模型和视觉语言模型，导致推理能力与视觉能力之间的权衡问题。 Method: 提出FRISM（细粒度推理注入 via 子空间级模型融合），利用SVD分解大推理模型的任务向量，并自适应调整各子空间缩放系数；引入无标签自蒸馏学习策略，结合双目标优化。 Result: 在多个视觉推理基准上持续达到最先进性能，有效提升推理能力且不损害原始视觉能力。 Conclusion: FRISM通过子空间级融合与自适应调优，实现了视觉语言模型推理能力的高效增强，解决了粗粒度融合带来的能力折衷问题。 Abstract: Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose {FRISM} (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that reasoning capabilities are encoded in distinct subspaces, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with a dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities by consistently achieving state-of-the-art performance across diverse visual reasoning benchmarks.

[87] Generative Recall, Dense Reranking: Learning Multi-View Semantic IDs for Efficient Text-to-Video Retrieval

Zecheng Zhao,Zhi Chen,Zi Huang,Shazia Sadiq,Tong Chen

Main category: cs.CV

TL;DR: 本文提出GRDR方法，通过多视角语义ID分配和联合训练提升生成式召回质量，在保持高精度的同时大幅降低存储与检索延迟。

Details

Motivation: 现有两阶段文本-视频检索中，召回模型性能受限于语义歧义和跨模态错位问题；生成式检索虽具高效性但难以准确建模视频-文本对应关系。 Method: 提出Generative Recall and Dense Reranking（GRDR）：设计查询引导的多视图分词器为每个视频分配多个语义ID，并通过共享码本联合训练分词器与生成式检索器；推理时采用Trie约束解码生成紧凑候选集，再由稠密模型重排序。 Result: 在TVR基准上，GRDR达到与强稠密检索器相当的精度，索引存储减少一个数量级，全库检索加速最高达300倍。 Conclusion: GRDR有效缓解生成式召回中的语义歧义与跨模态错位，成为高效高质两阶段文本-视频检索的新范式。 Abstract: Text-to-Video Retrieval (TVR) is essential in video platforms. Dense retrieval with dual-modality encoders leads in accuracy, but its computation and storage scale poorly with corpus size. Thus, real-time large-scale applications adopt two-stage retrieval, where a fast recall model gathers a small candidate pool, which is reranked by an advanced dense retriever. Due to hugely reduced candidates, the reranking model can use any off-the-shelf dense retriever without hurting efficiency, meaning the recall model bounds two-stage TVR performance. Recently, generative retrieval (GR) replaces dense video embeddings with discrete semantic IDs and retrieves by decoding text queries into ID tokens. GR offers near-constant inference and storage complexity, and its semantic IDs capture high-level video features via quantization, making it ideal for quickly eliminating irrelevant candidates during recall. However, as a recall model in two-stage TVR, GR suffers from (i) semantic ambiguity, where each video satisfies diverse queries but is forced into one semantic ID; and (ii) cross-modal misalignment, as semantic IDs are solely derived from visual features without text supervision. We propose Generative Recall and Dense Reranking (GRDR), designing a novel GR method to uplift recalled candidate quality. GRDR assigns multiple semantic IDs to each video using a query-guided multi-view tokenizer exposing diverse semantic access paths, and jointly trains the tokenizer and generative retriever via a shared codebook to cast semantic IDs as the semantic bridge between texts and videos. At inference, trie-constrained decoding generates a compact candidate set reranked by a dense model for fine-grained matching. Experiments on TVR benchmarks show GRDR matches strong dense retrievers in accuracy while reducing index storage by an order of magnitude and accelerating up to 300$\times$ in full-corpus retrieval.

[88] Thinker: A vision-language foundation model for embodied intelligence

Baiyu Pan,Daqin Luo,Junpeng Yang,Jiyuan Wang,Yixuan Zhang,Hailin Shi,Jichao Jiao

Main category: cs.CV

TL;DR: 本文提出Thinker模型，专为具身智能设计，通过构建面向机器人感知与推理的大规模数据集，并引入联合关键帧与完整视频序列输入的方法，显著提升视频理解能力，在任务规划基准测试中达到SOTA。

Details

Motivation: 大型视觉语言模型在机器人领域应用时存在视角混淆（第三人称与第一人称）和忽视视频结尾信息等人类易解、模型易错的问题。 Method: 1）构建面向机器人感知与推理的大规模数据集，涵盖自我视角视频、视觉定位、空间理解及思维链数据；2）提出一种简单有效的方法：联合输入关键帧与完整视频序列以增强视频理解能力。 Result: 在两个最常用的任务规划基准数据集上达到当前最优（state-of-the-art）性能。 Conclusion: Thinker是一种专为具身智能设计的大型视觉语言基础模型，能有效缓解视角混淆与时间推理缺陷，显著提升机器人任务规划能力。 Abstract: When large vision-language models are applied to the field of robotics, they encounter problems that are simple for humans yet error-prone for models. Such issues include confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during temporal reasoning. To address these challenges, we propose Thinker, a large vision-language foundation model designed for embodied intelligence. We tackle the aforementioned issues from two perspectives. Firstly, we construct a large-scale dataset tailored for robotic perception and reasoning, encompassing ego-view videos, visual grounding, spatial understanding, and chain-of-thought data. Secondly, we introduce a simple yet effective approach that substantially enhances the model's capacity for video comprehension by jointly incorporating key frames and full video sequences as inputs. Our model achieves state-of-the-art results on two of the most commonly used benchmark datasets in the field of task planning.

[89] LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models

Alvi Md Ishmam,Najibul Haque Sarker,Zaber Ibn Abdul Hakim,Chris Thomas

Main category: cs.CV

TL;DR: 本文提出LAMP，一种针对多图像多模态大语言模型（MLLMs）的黑盒通用对抗扰动（UAP）方法，通过注意力约束、跨图像传染约束和索引注意力抑制损失，实现高效、鲁棒且位置无关的攻击。

Details

Motivation: 现有对抗攻击主要面向单图像场景且依赖白盒假设，不适用于实际多图像MLLMs；多图像MLLMs的脆弱性尚未被探索。 Method: 提出LAMP方法：1）基于注意力的约束限制跨图像信息聚合；2）跨图像传染约束使扰动token影响干净token；3）索引-注意力抑制损失实现位置不变攻击。 Result: LAMP在多个视觉-语言任务和模型上超越SOTA基线，达到最高攻击成功率。 Conclusion: LAMP是一种有效的黑盒通用对抗攻击方法，揭示了多图像MLLMs的安全隐患，并为后续防御研究提供基础。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance across vision-language tasks. Recent advancements allow these models to process multiple images as inputs. However, the vulnerabilities of multi-image MLLMs remain unexplored. Existing adversarial attacks focus on single-image settings and often assume a white-box threat model, which is impractical in many real-world scenarios. This paper introduces LAMP, a black-box method for learning Universal Adversarial Perturbations (UAPs) targeting multi-image MLLMs. LAMP applies an attention-based constraint that prevents the model from effectively aggregating information across images. LAMP also introduces a novel cross-image contagious constraint that forces perturbed tokens to influence clean tokens, spreading adversarial effects without requiring all inputs to be modified. Additionally, an index-attention suppression loss enables a robust position-invariant attack. Experimental results show that LAMP outperforms SOTA baselines and achieves the highest attack success rates across multiple vision-language tasks and models.

[90] PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models

Xuewen Liu,Zhikai Li,Jing Zhang,Mengjuan Chen,Qingyi Gu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的后训练量化框架PTQ4ARVG，用于解决自回归视觉生成（ARVG）模型量化中的通道级异常值、token级动态激活及样本级分布不匹配三大挑战，实现了8位和6位高效量化并保持性能。

Details

Motivation: ARVG模型量化研究不足，现有方法难以有效泛化；存在通道级严重异常值、token级高度动态激活、样本级分布信息不匹配三大挑战。 Method: 提出PTQ4ARVG框架，包含三部分：(1) Gain-Projected Scaling (GPS) 通过泰勒展开量化缩放增益并求导得最优缩放因子以缓解通道级异常值；(2) Static Token-Wise Quantization (STWQ) 利用ARVG固定token长度与位置无关分布特性实现无开销token级量化；(3) Distribution-Guided Calibration (DGC) 基于分布熵选择代表性校准样本以消除样本级分布失配。 Result: 在ARVG系列模型上成功实现8-bit和6-bit量化，性能保持竞争力；代码已开源。 Conclusion: PTQ4ARVG是一种通用、高效、无需训练的ARVG量化方案，系统性解决了其独特量化难点，为ARVG实际部署提供了可行路径。 Abstract: AutoRegressive Visual Generation (ARVG) models retain an architecture compatible with language models, while achieving performance comparable to diffusion-based models. Quantization is commonly employed in neural networks to reduce model size and computational latency. However, applying quantization to ARVG remains largely underexplored, and existing quantization methods fail to generalize effectively to ARVG models. In this paper, we explore this issue and identify three key challenges: (1) severe outliers at channel-wise level, (2) highly dynamic activations at token-wise level, and (3) mismatched distribution information at sample-wise level. To these ends, we propose PTQ4ARVG, a training-free post-training quantization (PTQ) framework consisting of: (1) Gain-Projected Scaling (GPS) mitigates the channel-wise outliers, which expands the quantization loss via a Taylor series to quantify the gain of scaling for activation-weight quantization, and derives the optimal scaling factor through differentiation.(2) Static Token-Wise Quantization (STWQ) leverages the inherent properties of ARVG, fixed token length and position-invariant distribution across samples, to address token-wise variance without incurring dynamic calibration overhead.(3) Distribution-Guided Calibration (DGC) selects samples that contribute most to distributional entropy, eliminating the sample-wise distribution mismatch. Extensive experiments show that PTQ4ARVG can effectively quantize the ARVG family models to 8-bit and 6-bit while maintaining competitive performance. Code is available at http://github.com/BienLuky/PTQ4ARVG .

[91] NFCDS: A Plug-and-Play Noise Frequency-Controlled Diffusion Sampling Strategy for Image Restoration

Zhen Wang,Hongyi Liu,Jianing Li,Zhihui Wei

Main category: cs.CV

TL;DR: 本文提出了一种噪声频率控制的扩散采样方法（NFCDS），通过在傅里叶域设计滤波器，抑制低频噪声、保留高频细节，从而在无需额外训练的情况下提升扩散PnP方法的数据保真度与感知质量的平衡。

Details

Motivation: 扩散PnP方法虽能生成高感知质量图像，但因反向扩散过程引入噪声，常导致数据保真度下降；需解决保真度与感知质量之间的固有矛盾。 Method: 提出噪声频率控制机制（NFCDS），在傅里叶域设计渐进式低频噪声抑制、高频噪声保留的滤波器，将数据一致性先验直接注入采样过程。 Result: NFCDS作为即插即用模块，无需额外训练，显著提升多种零样本复原任务中图像的保真度与感知质量平衡，并实现快速收敛。 Conclusion: 噪声频率是理解保真度-感知权衡的关键维度；NFCDS提供了一种通用、高效、无需训练的谱域调控方案，可无缝集成到现有扩散复原框架中。 Abstract: Diffusion sampling-based Plug-and-Play (PnP) methods produce images with high perceptual quality but often suffer from reduced data fidelity, primarily due to the noise introduced during reverse diffusion. To address this trade-off, we propose Noise Frequency-Controlled Diffusion Sampling (NFCDS), a spectral modulation mechanism for reverse diffusion noise. We show that the fidelity-perception conflict can be fundamentally understood through noise frequency: low-frequency components induce blur and degrade fidelity, while high-frequency components drive detail generation. Based on this insight, we design a Fourier-domain filter that progressively suppresses low-frequency noise and preserves high-frequency content. This controlled refinement injects a data-consistency prior directly into sampling, enabling fast convergence to results that are both high-fidelity and perceptually convincing--without additional training. As a PnP module, NFCDS seamlessly integrates into existing diffusion-based restoration frameworks and improves the fidelity-perception balance across diverse zero-shot tasks.

[92] Hypersolid: Emergent Vision Representations via Short-Range Repulsion

Esteban Rodríguez-Betancourt,Edgar Casasola-Murillo

Main category: cs.CV

TL;DR: 本文提出Hypersolid方法，将自监督学习中的表征学习重新解释为离散填充问题，通过短程硬球排斥防止局部碰撞，从而避免表征坍缩，并在细粒度和低分辨率分类任务中表现优异。

Details

Motivation: 解决自监督学习中常见的表征坍缩问题，现有方法多依赖全局正则化，本文尝试从几何与信息保持角度提出新思路。 Method: 将表征学习建模为离散填充问题，强调保持映射的单射性；提出Hypersolid方法，引入短程硬球排斥机制以防止局部表征碰撞。 Result: 该方法促使表征进入高分离几何状态，有效保留数据增强多样性，在细粒度和低分辨率图像分类任务上取得优越性能。 Conclusion: 局部硬球排斥是一种有效且几何直观的防坍缩机制，为自监督表征学习提供了新范式。 Abstract: A recurring challenge in self-supervised learning is preventing representation collapse. Existing solutions typically rely on global regularization, such as maximizing distances, decorrelating dimensions or enforcing certain distributions. We instead reinterpret representation learning as a discrete packing problem, where preserving information simplifies to maintaining injectivity. We operationalize this in Hypersolid, a method using short-range hard-ball repulsion to prevent local collisions. This constraint results in a high-separation geometric regime that preserves augmentation diversity, excelling on fine-grained and low-resolution classification tasks.

[93] Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference

Jianglong Li,Jun Xu,Bingcong Lu,Zhengxue Cheng,Hongwei Hu,Ronghua Wu,Li Song

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、高保真、低码率的3D说话人脸压缩框架，结合FLAME参数建模与3DGS神经渲染，仅传输关键面部元数据，并通过高斯属性压缩和MLP优化提升传输效率，在极低码率下实现高质量实时渲染。

Details

Motivation: 传统2D视频压缩无法保留精细几何与外观细节，而NeRF等隐式神经渲染计算开销过大，难以满足低码率、高保真、实时3D视频会议需求。 Method: 融合FLAME参数化建模与3D高斯泼溅（3DGS）神经渲染，设计轻量高斯头部模型；提出高斯属性压缩与MLP优化的紧凑表示与压缩方案，仅实时传输必要面部元数据。 Result: 在极低比特率下实现优于现有方法的率失真性能，支持高质量、实时3D人脸渲染。 Conclusion: 该框架在保真度、效率与实时性之间取得良好平衡，适用于资源受限的3D视频会议场景。 Abstract: The demand for immersive and interactive communication has driven advancements in 3D video conferencing, yet achieving high-fidelity 3D talking face representation at low bitrates remains a challenge. Traditional 2D video compression techniques fail to preserve fine-grained geometric and appearance details, while implicit neural rendering methods like NeRF suffer from prohibitive computational costs. To address these challenges, we propose a lightweight, high-fidelity, low-bitrate 3D talking face compression framework that integrates FLAME-based parametric modeling with 3DGS neural rendering. Our approach transmits only essential facial metadata in real time, enabling efficient reconstruction with a Gaussian-based head model. Additionally, we introduce a compact representation and compression scheme, including Gaussian attribute compression and MLP optimization, to enhance transmission efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance, delivering high-quality facial rendering at extremely low bitrates, making it well-suited for real-time 3D video conferencing applications.

[94] GeoRC: A Benchmark for Geolocation Reasoning Chains

Mohit Talreja,Joshua Diao,Jim Thannikary James,Radu Casapu,Tejas Santanam,Ethan Mendes,Alan Ritter,Wei Xu,James Hays

Main category: cs.CV

TL;DR: 本文提出了首个地理定位推理链基准，揭示了视觉语言模型（VLMs）虽能准确预测图像地理位置，却常在解释依据时产生幻觉；专家标注的800条推理链显示，当前VLM（尤其是开源模型）在生成可审计、基于视觉证据的推理方面远逊于人类。

Details

Motivation: VLMs在地理定位预测上表现优异，但其推理链常脱离图像实际内容、产生幻觉，缺乏可解释性与可信度，亟需专门基准评估其推理质量。 Method: 构建首个面向地理定位推理链的基准：联合GeoGuessr世界冠军等专家，为500张街景图像撰写800条真实推理链，涵盖数百种判别性视觉线索（如车牌形状、建筑风格、土壤特性）；采用LLM-as-a-judge和VLM-as-a-judge两种策略自动评分，并与人工评分对比验证。 Result: Qwen 3 LLM-as-a-judge与人工评分相关性最高；闭源大VLM（Gemini、GPT-5）定位准确率接近人类，但推理链质量显著落后；开源VLM（Llama、Qwen）表现极差，仅略优于纯幻觉基线。 Conclusion: 当前VLM在细粒度视觉属性提取能力上存在根本局限，导致其推理链不可靠；该基准为提升VLM可解释性与视觉理解能力提供了关键评测工具与改进方向。 Abstract: Vision Language Models (VLMs) are good at recognizing the global location of a photograph -- their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at explaining which image evidence led to their prediction, even when their location prediction is correct. The reasoning chains produced by VLMs frequently hallucinate scene attributes to support their location prediction (e.g. phantom writing, imagined infrastructure, misidentified flora). In this paper, we introduce the first benchmark for geolocation reasoning chains. We focus on the global location prediction task in the popular GeoGuessr game which draws from Google Street View spanning more than 100 countries. We collaborate with expert GeoGuessr players, including the reigning world champion, to produce 800 ground truth reasoning chains for 500 query scenes. These expert reasoning chains address hundreds of different discriminative visual attributes such as license plate shape, architecture, and soil properties to name just a few. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at prediction locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Open weights VLMs such as Llama and Qwen catastrophically fail on our benchmark -- they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but no visual information at all. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images.

Dong Chen,Ruoyu Li,Xinyan Zhang,Jialei Xu,Ruoseng Zhao,Zhikang Zhang,Lingyun Li,Zizhuang Wei

Main category: cs.CV

TL;DR: 本文提出了一种融合视频、天线几何特征和PCI信号的多模态方法，用于自动识别天线归属关系，并设计了Token Entropy Regularization模块以提升跨模态对齐效果。

Details

Motivation: 现有通信网络中天线归属识别依赖人工巡检，效率低且易出错，亟需自动化、高精度的解决方案。 Method: 提出一种多模态分类与匹配框架，结合基站视频、天线几何特征和PCI信号；设计专用训练框架实现图像与PCI信号对齐，并在预训练阶段引入Token Entropy Regularization（TER）模块优化表征对齐。 Result: 实验表明TER能加速模型收敛并显著提升性能；进一步分析发现首个token的熵具有模态依赖性。 Conclusion: 该方法有效解决了通信领域缺乏类比数据导致的跨模态对齐难题，为天线归属识别提供了新范式。 Abstract: Accurate antenna affiliation identification is crucial for optimizing and maintaining communication networks. Current practice, however, relies on the cumbersome and error-prone process of manual tower inspections. We propose a novel paradigm shift that fuses video footage of base stations, antenna geometric features, and Physical Cell Identity (PCI) signals, transforming antenna affiliation identification into multi-modal classification and matching tasks. Publicly available pretrained transformers struggle with this unique task due to a lack of analogous data in the communications domain, which hampers cross-modal alignment. To address this, we introduce a dedicated training framework that aligns antenna images with corresponding PCI signals. To tackle the representation alignment challenge, we propose a novel Token Entropy Regularization module in the pretraining stage. Our experiments demonstrate that TER accelerates convergence and yields significant performance gains. Further analysis reveals that the entropy of the first token is modality-dependent. Code will be made available upon publication.

[96] WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models

Rishi Upadhyay,Howard Zhang,Jim Solomon,Ayush Agrawal,Pranay Boreddy,Shruti Satya Narayana,Yunhao Ba,Alex Wong,Celso M de Melo,Achuta Kadambi

Main category: cs.CV

TL;DR: 本文提出了WorldBench，一个用于评估视频生成世界模型物理理解能力的解耦式视频基准，涵盖直观物理概念和低层物理参数两个层次，并揭示了当前SOTA模型在物理一致性上的系统性缺陷。

Details

Motivation: 现有物理视频基准存在概念纠缠问题，难以精准诊断模型在特定物理规律上的理解缺陷，亟需一种解耦、可解释的评估工具。 Method: 构建WorldBench：一个概念特异、解耦的视频基准，包含两层——（1）直观物理理解（如物体恒常性、尺度/透视）；（2）低层物理参数（如摩擦系数、流体粘度），并用其系统评测主流视频世界模型。 Result: 所有被测SOTA视频世界模型均在多个具体物理概念上表现出一致失败，缺乏生成物理一致视频所需的基本物理一致性。 Conclusion: WorldBench提供了更细粒度、可扩展的物理推理能力评估框架，有助于推动具备真实物理可靠性的世界模型发展。 Abstract: Recent advances in generative foundational models, often termed "world models," have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.

[97] Gaussian Belief Propagation Network for Depth Completion

Jie Tang,Pingping Xie,Jian Li,Ping Tan

Main category: cs.CV

TL;DR: 本文提出高斯置信传播网络（GBPN），将深度学习与概率图模型结合，通过动态构建场景特定的马尔可夫随机场并采用改进的高斯置信传播进行推理，有效解决稀疏深度图补全问题，在NYUv2和KITTI数据集上达到SOTA性能。

Details

Motivation: 现有深度学习方法难以有效处理输入深度数据的稀疏性和不规则性，尤其在高稀疏度下性能受限。 Method: 提出高斯置信传播网络（GBPN），包含图形模型构建网络（GMCN）动态构建场景相关的马尔可夫随机场（MRF），并引入自适应非局部边预测以建模长程空间依赖；同时设计串行与并行结合的消息传递机制增强高斯置信传播（GBP）的信息传播能力。 Result: 在NYUv2和KITTI基准上达到SOTA性能；在不同稀疏度、稀疏模式及跨数据集实验中展现出更强鲁棒性与泛化能力。 Conclusion: GBPN通过深度学习与概率图模型的协同融合，为稀疏深度补全提供了更鲁棒、更具解释性的新范式。 Abstract: Depth completion aims to predict a dense depth map from a color image with sparse depth measurements. Although deep learning methods have achieved state-of-the-art (SOTA), effectively handling the sparse and irregular nature of input depth data in deep networks remains a significant challenge, often limiting performance, especially under high sparsity. To overcome this limitation, we introduce the Gaussian Belief Propagation Network (GBPN), a novel hybrid framework synergistically integrating deep learning with probabilistic graphical models for end-to-end depth completion. Specifically, a scene-specific Markov Random Field (MRF) is dynamically constructed by the Graphical Model Construction Network (GMCN), and then inferred via Gaussian Belief Propagation (GBP) to yield the dense depth distribution. Crucially, the GMCN learns to construct not only the data-dependent potentials of MRF but also its structure by predicting adaptive non-local edges, enabling the capture of complex, long-range spatial dependencies. Furthermore, we enhance GBP with a serial \& parallel message passing scheme, designed for effective information propagation, particularly from sparse measurements. Extensive experiments demonstrate that GBPN achieves SOTA performance on the NYUv2 and KITTI benchmarks. Evaluations across varying sparsity levels, sparsity patterns, and datasets highlight GBPN's superior performance, notable robustness, and generalizable capability.

[98] Mam-App: A Novel Parameter-Efficient Mamba Model for Apple Leaf Disease Classification

Md Nadim Mahamood,Md Imran Hasan,Md Rasheduzzaman,Ausrukona Ray,Md Shafi Ud Doula,Kamrul Hasan

Main category: cs.CV

TL;DR: 本文提出了一种参数高效的Mamba-based模型Mam-App，用于苹果叶病害分类，在保持极低参数量（0.051M）的同时，在PlantVillage数据集上达到99.58%准确率，并在玉米、马铃薯数据集上验证了泛化能力。

Details

Motivation: 现有深度学习模型参数量大、计算开销高，而轻量模型又常牺牲性能；需在效率与精度间取得平衡，尤其面向无人机、移动端等资源受限场景。 Method: 提出基于Mamba架构的轻量级模型Mam-App，专用于植物叶片病害特征提取与分类，强调参数效率与性能兼顾。 Result: 在Apple Leaf Disease数据集上达99.58%准确率、99.30%精确率、99.14%召回率、99.22% F1；在Corn和Potato数据集上也取得优异跨域性能。 Conclusion: Mam-App在极低参数量下实现SOTA性能，验证了Mamba架构在农业病害识别中兼顾高效性与泛化性的潜力，适合边缘部署。 Abstract: The rapid growth of the global population, alongside exponential technological advancement, has intensified the demand for food production. Meeting this demand depends not only on increasing agricultural yield but also on minimizing food loss caused by crop diseases. Diseases account for a substantial portion of apple production losses, despite apples being among the most widely produced and nutritionally valuable fruits worldwide. Previous studies have employed machine learning techniques for feature extraction and early diagnosis of apple leaf diseases, and more recently, deep learning-based models have shown remarkable performance in disease recognition. However, most state-of-the-art deep learning models are highly parameter-intensive, resulting in increased training and inference time. Although lightweight models are more suitable for user-friendly and resource-constrained applications, they often suffer from performance degradation. To address the trade-off between efficiency and performance, we propose Mam-App, a parameter-efficient Mamba-based model for feature extraction and leaf disease classification. The proposed approach achieves competitive state-of-the-art performance on the PlantVillage Apple Leaf Disease dataset, attaining 99.58% accuracy, 99.30% precision, 99.14% recall, and a 99.22% F1-score, while using only 0.051M parameters. This extremely low parameter count makes the model suitable for deployment on drones, mobile devices, and other low-resource platforms. To demonstrate the robustness and generalizability of the proposed model, we further evaluate it on the PlantVillage Corn Leaf Disease and Potato Leaf Disease datasets. The model achieves 99.48%, 99.20%, 99.34%, and 99.27% accuracy, precision, recall, and F1-score on the corn dataset and 98.46%, 98.91%, 95.39%, and 97.01% on the potato dataset, respectively.

[99] HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence

Yanfeng Li,Tao Tan,Qingquan Gao,Zhiwen Cao,Xiaohong liu,Yue Sun

Main category: cs.CV

TL;DR: 本文提出了LANE模型和AdaGraph策略，以解决现有3D网格自回归建模中资源利用不足、推理慢及序列长度受限的问题，显著提升了生成速度、结构细节和几何一致性。

Details

Motivation: 现有基于自回归的3D网格建模方法存在资源利用率低、推理速度慢、仅能处理小规模序列等问题，限制了可表达的结构细节。 Method: 提出Latent Autoregressive Network (LANE)，引入紧凑的自回归依赖；并设计Adaptive Computation Graph Reconfiguration (AdaGraph)策略，通过时空解耦突破串行推理效率瓶颈。 Result: LANE将最大可生成序列长度提升6倍；AdaGraph加速推理；实验表明其在生成速度、结构细节和几何一致性上均优于现有方法。 Conclusion: LANE结合AdaGraph为高质量3D网格生成提供了高效且高保真的新范式。 Abstract: High-fidelity 3D meshes can be tokenized into one-dimension (1D) sequences and directly modeled using autoregressive approaches for faces and vertices. However, existing methods suffer from insufficient resource utilization, resulting in slow inference and the ability to handle only small-scale sequences, which severely constrains the expressible structural details. We introduce the Latent Autoregressive Network (LANE), which incorporates compact autoregressive dependencies in the generation process, achieving a $6\times$ improvement in maximum generatable sequence length compared to existing methods. To further accelerate inference, we propose the Adaptive Computation Graph Reconfiguration (AdaGraph) strategy, which effectively overcomes the efficiency bottleneck of traditional serial inference through spatiotemporal decoupling in the generation process. Experimental validation demonstrates that LANE achieves superior performance across generation speed, structural detail, and geometric consistency, providing an effective solution for high-quality 3D mesh generation.

[100] Optimal Transport-Induced Samples against Out-of-Distribution Overconfidence

Keke Tang,Ziyong Du,Xiaofei Wang,Weilong Peng,Peican Zhu,Zhihong Tian

Main category: cs.CV

TL;DR: 本文提出了一种基于半离散最优传输（OT）奇异边界的框架，通过生成几何上合理、语义模糊的OOD样本（OTIS），并在训练中施加置信度抑制损失，有效缓解深度神经网络在OOD输入上的过自信问题。

Details

Motivation: 深度神经网络在分布外（OOD）输入上常产生过自信预测，影响其在开放世界中的可靠性；而半离散最优传输中的奇异性恰好标记了语义模糊区域，是模型高置信误判的高发区。 Method: 构建连续基分布与训练数据隐空间嵌入之间的最优传输问题，识别其诱导的奇异边界；在边界附近采样生成OTIS样本；在训练中对OTIS施加置信度抑制损失，以校准模型在结构不确定性区域的预测。 Result: 实验表明该方法显著缓解OOD过自信问题，在多个基准上优于现有最先进方法。 Conclusion: 利用最优传输几何结构引导OOD检测与校准是一种有效且原理清晰的新范式，为提升模型鲁棒性与可信度提供了新思路。 Abstract: Deep neural networks (DNNs) often produce overconfident predictions on out-of-distribution (OOD) inputs, undermining their reliability in open-world environments. Singularities in semi-discrete optimal transport (OT) mark regions of semantic ambiguity, where classifiers are particularly prone to unwarranted high-confidence predictions. Motivated by this observation, we propose a principled framework to mitigate OOD overconfidence by leveraging the geometry of OT-induced singular boundaries. Specifically, we formulate an OT problem between a continuous base distribution and the latent embeddings of training data, and identify the resulting singular boundaries. By sampling near these boundaries, we construct a class of OOD inputs, termed optimal transport-induced OOD samples (OTIS), which are geometrically grounded and inherently semantically ambiguous. During training, a confidence suppression loss is applied to OTIS to guide the model toward more calibrated predictions in structurally uncertain regions. Extensive experiments show that our method significantly alleviates OOD overconfidence and outperforms state-of-the-art methods.

[101] Do Pathology Foundation Models Encode Disease Progression? A Pseudotime Analysis of Visual Representations

Pritika Vig,Ren-Chin Wu,William Lotter

Main category: cs.CV

TL;DR: 本文探讨视觉基础模型是否能从静态图像中隐式学习连续疾病进展过程，并使用扩散伪时间方法验证了多个病理学模型在表征空间中恢复疾病轨迹的能力，发现轨迹保真度与少样本分类性能高度相关。

Details

Motivation: 视觉基础模型在分类任务上表现优异，但其是否能编码训练数据背后的连续过程尚不清楚；在计算病理学中，能隐式捕捉连续疾病进展的模型可能更符合生物学本质、泛化性更强并支持定量分析。 Method: 采用源于单细胞转录组学的扩散伪时间（diffusion pseudotime）方法，在表征空间中探测基础模型是否能组织出连贯的疾病进展方向，并在四种癌症进展和六种模型上进行评估。 Result: 所有病理特异性模型均显著优于零假设基线，纯视觉模型在CRC-Serrated数据集上达到最高轨迹保真度（τ > 0.78）；轨迹保真度排名与少样本分类性能高度相关（ρ = 0.92）；推断轨迹上细胞类型组成变化符合已知基质重塑规律。 Conclusion: 视觉基础模型可仅从独立静态观测中隐式学习连续过程；轨迹保真度是衡量表征质量的一个新且互补的指标；该框架可推广至其他依赖静态快照观察连续过程的领域。 Abstract: Vision foundation models trained on discretely sampled images achieve strong performance on classification benchmarks, yet whether their representations encode the continuous processes underlying their training data remains unclear. This question is especially pertinent in computational pathology, where we posit that models whose latent representations implicitly capture continuous disease progression may better reflect underlying biology, support more robust generalization, and enable quantitative analyses of features associated with disease transitions. Using diffusion pseudotime, a method developed to infer developmental trajectories from single-cell transcriptomics, we probe whether foundation models organize disease states along coherent progression directions in representation space. Across four cancer progressions and six models, we find that all pathology-specific models recover trajectory orderings significantly exceeding null baselines, with vision-only models achieving the highest fidelities $(τ> 0.78$ on CRC-Serrated). Model rankings by trajectory fidelity on reference diseases strongly predict few-shot classification performance on held-out diseases ($ρ= 0.92$), and exploratory analysis shows cell-type composition varies smoothly along inferred trajectories in patterns consistent with known stromal remodeling. Together, these results demonstrate that vision foundation models can implicitly learn to represent continuous processes from independent static observations, and that trajectory fidelity provides a complementary measure of representation quality beyond downstream performance. While demonstrated in pathology, this framework could be applied to other domains where continuous processes are observed through static snapshots.

Ji-Xuan He,Guohang Zhuang,Junge Bo,Tingyi Li,Chen Ling,Yanan Qiao

Main category: cs.CV

TL;DR: 本文提出了一种轻量级即插即用的光谱校正超分辨率网络SR²-Net，用于高光谱图像超分辨率（HSI-SR），通过增强-校正流程提升空间分辨率并保证光谱一致性与物理可解释性。

Details

Motivation: 现有HSI-SR方法侧重空间相关性建模，但忽视跨波段光谱一致性，导致伪影和物理不可行结果；而硬编码光谱约束会牺牲模型通用性与灵活性。 Method: 提出SR²-Net：包含分层光谱-空间协同注意力（H-S³A）以增强跨波段交互，以及流形一致性校正（MCR）将重建光谱约束至紧凑、物理可行的光谱流形；并引入退化一致性损失保障数据保真度。 Result: 在多个基准和不同骨干网络上验证，SR²-Net显著提升光谱保真度与整体重建质量，计算开销极小。 Conclusion: SR²-Net是一种通用、灵活、即插即用的光谱校正模块，有效解决了HSI-SR中光谱失真问题，兼顾性能与效率。 Abstract: HSI-SR aims to enhance spatial resolution while preserving spectrally faithful and physically plausible characteristics. Recent methods have achieved great progress by leveraging spatial correlations to enhance spatial resolution. However, these methods often neglect spectral consistency across bands, leading to spurious oscillations and physically implausible artifacts. While spectral consistency can be addressed by designing the network architecture, it results in a loss of generality and flexibility. To address this issue, we propose a lightweight plug-and-play rectifier, physically priors Spectral Rectification Super-Resolution Network (SR$^{2}$-Net), which can be attached to a wide range of HSI-SR models without modifying their architectures. SR$^{2}$-Net follows an enhance-then-rectify pipeline consisting of (i) Hierarchical Spectral-Spatial Synergy Attention (H-S$^{3}$A) to reinforce cross-band interactions and (ii) Manifold Consistency Rectification (MCR) to constrain the reconstructed spectra to a compact, physically plausible spectral manifold. In addition, we introduce a degradation-consistency loss to enforce data fidelity by encouraging the degraded SR output to match the observed low resolution input. Extensive experiments on multiple benchmarks and diverse backbones demonstrate consistent improvements in spectral fidelity and overall reconstruction quality with negligible computational overhead. Our code will be released upon publication.

[103] Dynamical Adapter Fusion: Constructing A Global Adapter for Pre-Trained Model-based Class-Incremental Learning

Ruiqi Liu,Boyu Diao,Zijia An,Zhulin An,Fei Wang,Yongjun Xu

Main category: cs.CV

TL;DR: 本文提出了一种动态适配器融合（DAF）方法，用于类增量学习，通过PAC-Bayes理论和损失函数泰勒展开，动态融合任务特定、全局历史及初始化参数，实现稳定与可塑性平衡，并结合鲁棒初始化策略，在多个基准上达到SOTA性能。

Details

Motivation: 现有CIL方法中任务特定适配器难以知识迁移且检索开销大，而简单参数融合易导致灾难性遗忘和破坏性干扰。 Method: 基于PAC-Bayes定理设计动态适配器融合机制，利用损失函数的泰勒展开推导最优融合系数；同时提出鲁棒初始化策略以捕获全局知识模式。 Result: 在多个类增量学习基准上达到当前最优（SOTA）性能。 Conclusion: DAF通过显式融合三类参数并动态调节权重，有效缓解了稳定性-可塑性困境，提升了CIL模型的持续学习能力。 Abstract: Class-Incremental Learning (CIL) requires models to continuously acquire new classes without forgetting previously learned ones. A dominant paradigm involves freezing a pre-trained model and training lightweight, task-specific adapters. However, maintaining task-specific parameters hinders knowledge transfer and incurs high retrieval costs, while naive parameter fusion often leads to destructive interference and catastrophic forgetting. To address these challenges, we propose Dynamical Adapter Fusion (DAF) to construct a single robust global adapter. Grounded in the PAC-Bayes theorem, we derive a fusion mechanism that explicitly integrates three components: the optimized task-specific adapter parameters, the previous global adapter parameters, and the initialization parameters. We utilize the Taylor expansion of the loss function to derive the optimal fusion coefficients, dynamically achieving the best balance between stability and plasticity. Furthermore, we propose a Robust Initialization strategy to effectively capture global knowledge patterns. Experiments on multiple CIL benchmarks demonstrate that DAF achieves state-of-the-art (SOTA) performance.

[104] Semantic-Guided Dynamic Sparsification for Pre-Trained Model-based Class-Incremental Learning

Ruiqi Liu,Boyu Diao,Zijia An,Runjie Shao,Zhulin An,Fei Wang,Yongjun Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为语义引导动态稀疏化（SGDS）的新方法，用于解决类别增量学习中的灾难性遗忘问题。该方法通过在激活空间中主动引导子空间的方向和秩，实现知识迁移与干扰抑制的平衡，避免了传统参数正交约束对模型可塑性的损害。

Details

Motivation: 传统基于轻量级适配器的类别增量学习方法常通过强制参数正交来防止任务间干扰，但这种参数约束会损害模型的可塑性。 Method: 提出语义引导动态稀疏化（SGDS），通过定向稀疏化调控激活子空间的方向与秩：使相似类别共享紧凑的激活子空间以促进知识迁移，使不相似类别分配非重叠的激活子空间以防止干扰。 Result: 在多个基准数据集上的大量实验表明，SGDS达到了当前最优性能。 Conclusion: SGDS通过在激活空间中构建类别特定的稀疏子空间，在不施加刚性参数约束的前提下有效缓解了任务间干扰，提升了模型的持续学习能力。 Abstract: Class-Incremental Learning (CIL) requires a model to continually learn new classes without forgetting old ones. A common and efficient solution freezes a pre-trained model and employs lightweight adapters, whose parameters are often forced to be orthogonal to prevent inter-task interference. However, we argue that this parameter-constraining method is detrimental to plasticity. To this end, we propose Semantic-Guided Dynamic Sparsification (SGDS), a novel method that proactively guides the activation space by governing the orientation and rank of its subspaces through targeted sparsification. Specifically, SGDS promotes knowledge transfer by encouraging similar classes to share a compact activation subspace, while simultaneously preventing interference by assigning non-overlapping activation subspaces to dissimilar classes. By sculpting class-specific sparse subspaces in the activation space, SGDS effectively mitigates interference without imposing rigid constraints on the parameter space. Extensive experiments on various benchmark datasets demonstrate the state-of-the-art performance of SGDS.

[105] Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery

Hongjun Chen,Huan Zheng,Wencheng Han,Jianbing Shen

Main category: cs.CV

TL;DR: 本文提出HMRMamba，一种基于结构化状态空间模型（SSM）的新型视频3D人体网格恢复框架，通过几何感知提升模块和运动引导重建网络，显著提升重建精度、时序一致性和计算效率。

Details

Motivation: 现有基于视频的3D人体网格恢复方法因依赖有缺陷的中间3D姿态锚点且难以建模复杂的时空动态，导致结果物理上不合理。 Method: 提出HMRMamba框架，包含两个核心模块：1）几何感知提升模块，采用双扫描Mamba架构，融合图像特征几何信息进行2D到3D姿态提升；2）运动引导重建网络，以提升得到的3D姿态序列为锚点，显式建模时序运动学模式。 Result: 在3DPW、MPI-INF-3DHP和Human3.6M数据集上达到新SOTA，重建精度与时序一致性更优，且计算效率更高。 Conclusion: HMRMamba通过引入SSM并设计几何与运动协同建模机制，有效克服了传统HMR方法的固有缺陷，为视频驱动的3D人体重建提供了新范式。 Abstract: Existing video-based 3D Human Mesh Recovery (HMR) methods often produce physically implausible results, stemming from their reliance on flawed intermediate 3D pose anchors and their inability to effectively model complex spatiotemporal dynamics. To overcome these deep-rooted architectural problems, we introduce HMRMamba, a new paradigm for HMR that pioneers the use of Structured State Space Models (SSMs) for their efficiency and long-range modeling prowess. Our framework is distinguished by two core contributions. First, the Geometry-Aware Lifting Module, featuring a novel dual-scan Mamba architecture, creates a robust foundation for reconstruction. It directly grounds the 2D-to-3D pose lifting process with geometric cues from image features, producing a highly reliable 3D pose sequence that serves as a stable anchor. Second, the Motion-guided Reconstruction Network leverages this anchor to explicitly process kinematic patterns over time. By injecting this crucial temporal awareness, it significantly enhances the final mesh's coherence and robustness, particularly under occlusion and motion blur. Comprehensive evaluations on 3DPW, MPI-INF-3DHP, and Human3.6M benchmarks confirm that HMRMamba sets a new state-of-the-art, outperforming existing methods in both reconstruction accuracy and temporal consistency while offering superior computational efficiency.

[106] Rectifying Geometry-Induced Similarity Distortions for Real-World Aerial-Ground Person Re-Identification

Kailash A. Hambarde,Hugo Proença

Main category: cs.CV

TL;DR: 本文提出GIQT方法，通过几何诱导的查询-键变换显式校正视角与尺度差异导致的相似性空间畸变，并结合几何条件提示生成机制，提升航拍-地面跨视角行人重识别性能。

Details

Motivation: 现有方法隐含假设点积相似性在大视角和尺度变化下仍可靠，但实际中极端相机几何会系统性扭曲查询-键相似性空间，导致注意力匹配性能下降。 Method: 提出轻量级低秩模块GIQT，基于相机几何显式校正查询-键相似性计算；并引入几何条件提示生成机制，提供全局、视图自适应的表征先验。 Result: 在四个航拍-地面行人重识别基准上验证了方法在极端及未见几何条件下的鲁棒性提升，且计算开销极小。 Conclusion: 显式建模相机几何对相似性空间的影响比隐式特征对齐更有效，GIQT为跨视角ReID提供了新思路。 Abstract: Aerial-ground person re-identification (AG-ReID) is fundamentally challenged by extreme viewpoint and distance discrepancies between aerial and ground cameras, which induce severe geometric distortions and invalidate the assumption of a shared similarity space across views. Existing methods primarily rely on geometry-aware feature learning or appearance-conditioned prompting, while implicitly assuming that the geometry-invariant dot-product similarity used in attention mechanisms remains reliable under large viewpoint and scale variations. We argue that this assumption does not hold. Extreme camera geometry systematically distorts the query-key similarity space and degrades attention-based matching, even when feature representations are partially aligned. To address this issue, we introduce Geometry-Induced Query-Key Transformation (GIQT), a lightweight low-rank module that explicitly rectifies the similarity space by conditioning query-key interactions on camera geometry. Rather than modifying feature representations or the attention formulation itself, GIQT adapts the similarity computation to compensate for dominant geometry-induced anisotropic distortions. Building on this local similarity rectification, we further incorporate a geometry-conditioned prompt generation mechanism that provides global, view-adaptive representation priors derived directly from camera geometry. Experiments on four aerial-ground person re-identification benchmarks demonstrate that the proposed framework consistently improves robustness under extreme and previously unseen geometric conditions, while introducing minimal computational overhead compared to state-of-the-art methods.

[107] Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

Zihan Su,Hongyang Wei,Kangrui Cen,Yong Wang,Guanhua Chen,Chun Yuan,Xiangxiang Chu

Main category: cs.CV

TL;DR: 本文提出UniMRG，一种架构无关的后训练方法，通过在统一多模态模型（UMMs）中引入像素、深度和分割等多表示生成任务，以提升其视觉理解能力，并实现理解与生成的双向增强。

Details

Motivation: 现有UMMs后训练方法主要利用理解能力提升生成效果，而反向利用生成来增强理解的能力尚未被充分探索。 Method: 提出UniMRG方法，在标准视觉理解目标基础上，联合训练UMMs生成图像的多种内在表示（像素重建、深度图、分割图），从而捕获外观、空间关系与结构布局等互补信息。 Result: 在多种UMM架构上实验表明，该方法显著提升细粒度感知能力、减少幻觉、增强空间理解，同时不损害反而提升生成能力。 Conclusion: 生成任务可有效反哺理解能力；多表示联合生成是一种简单、通用且高效的UMMs后训练范式，推动理解与生成的协同进化。 Abstract: Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.

[108] MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations

Xinan He,Kaiqing Lin,Yue Zhou,Jiaming Zhong,Wei Ye,Wenhui Yi,Bing Fan,Feng Ding,Haodong Li,Bo Cao,Bin Li

Main category: cs.CV

TL;DR: 本文提出了一种基于‘流形投影波动’（MPF）现象的双路径检测框架，用于识别高保真AI生成视频，即使其在宏观语义和时序上无明显错误。

Details

Motivation: 尽管当前视频生成模型已能生成视觉质量极高的合成内容，但作者认为AI视频本质上是流形拟合的结果，其像素组成逻辑仍残留可检测的结构化特征（MPF），因此仍可区分真假。 Method: 提出分层双路径框架：1）静态流形偏差分支，利用大规模视觉基础模型（VFMs）捕捉空间上的离流形异常；2）微时序波动分支，分析连续帧间残留的MPF特征，作为细粒度二次过滤。 Result: 该框架能有效检测出即使在宏观层面无错误、且严格位于真实世界流形上的高保真AI生成视频，实现了对全局离流形偏差与细微计算指纹的双重覆盖。 Conclusion: AI生成视频虽趋近真实，但其内在的流形拟合机制导致固有可检测性；所提双路径方法为高保真伪造视频检测提供了新范式。 Abstract: With the rapid advancement of video generation models such as Veo and Wan, the visual quality of synthetic content has reached a level where macro-level semantic errors and temporal inconsistencies are no longer prominent. However, this does not imply that the distinction between real and cutting-edge high-fidelity fake is untraceable. We argue that AI-generated videos are essentially products of a manifold-fitting process rather than a physical recording. Consequently, the pixel composition logic of consecutive adjacent frames residual in AI videos exhibits a structured and homogenous characteristic. We term this phenomenon `Manifold Projection Fluctuations' (MPF). Driven by this insight, we propose a hierarchical dual-path framework that operates as a sequential filtering process. The first, the Static Manifold Deviation Branch, leverages the refined perceptual boundaries of Large-Scale Vision Foundation Models (VFMs) to capture residual spatial anomalies or physical violations that deviate from the natural real-world manifold (off-manifold). For the remaining high-fidelity videos that successfully reside on-manifold and evade spatial detection, we introduce the Micro-Temporal Fluctuation Branch as a secondary, fine-grained filter. By analyzing the structured MPF that persists even in visually perfect sequences, our framework ensures that forgeries are exposed regardless of whether they manifest as global real-world manifold deviations or subtle computational fingerprints.

[109] From Implicit Ambiguity to Explicit Solidity: Diagnosing Interior Geometric Degradation in Neural Radiance Fields for Dense 3D Scene Understanding

Jiangsan Zhao,Jakob Geipel,Kryzysztof Kusnierek

Main category: cs.CV

TL;DR: 本文揭示了NeRF在密集自遮挡场景中因隐式密度场导致的内部几何退化（IGD）问题，提出基于稀疏体素光栅化的显式几何重建方法SVRaster，显著提升实例恢复率并增强对监督噪声的鲁棒性。

Details

Motivation: NeRF在密集自遮挡场景中用于定量3D分析的可靠性尚不明确，尤其存在隐式密度场导致的内部结构失真问题。 Method: 提出基于稀疏体素光栅化（SVRaster）的显式几何重建流程，以SfM特征几何为初始化，通过2D实例掩码投影与递归体素分割保证几何分离和物理实心性。 Result: 在密集合成数据集上，SVRaster实现95.8%的实例恢复率，较当前最优mask监督NeRF提升约7个百分点；在分割掩码退化情况下，比隐式基线多恢复43%的实例。 Conclusion: 显式几何先验是实现高度自遮挡3D场景中可靠定量分析的必要前提。 Abstract: Neural Radiance Fields (NeRFs) have emerged as a powerful paradigm for multi-view reconstruction, complementing classical photogrammetric pipelines based on Structure-from-Motion (SfM) and Multi-View Stereo (MVS). However, their reliability for quantitative 3D analysis in dense, self-occluding scenes remains poorly understood. In this study, we identify a fundamental failure mode of implicit density fields under heavy occlusion, which we term Interior Geometric Degradation (IGD). We show that transmittance-based volumetric optimization satisfies photometric supervision by reconstructing hollow or fragmented structures rather than solid interiors, leading to systematic instance undercounting. Through controlled experiments on synthetic datasets with increasing occlusion, we demonstrate that state-of-the-art mask-supervised NeRFs saturate at approximately 89% instance recovery in dense scenes, despite improved surface coherence and mask quality. To overcome this limitation, we introduce an explicit geometric pipeline based on Sparse Voxel Rasterization (SVRaster), initialized from SfM feature geometry. By projecting 2D instance masks onto an explicit voxel grid and enforcing geometric separation via recursive splitting, our approach preserves physical solidity and achieves a 95.8% recovery rate in dense clusters. A sensitivity analysis using degraded segmentation masks further shows that explicit SfM-based geometry is substantially more robust to supervision failure, recovering 43% more instances than implicit baselines. These results demonstrate that explicit geometric priors are a prerequisite for reliable quantitative analysis in highly self-occluding 3D scenes.

[110] MultiModal Fine-tuning with Synthetic Captions

Shohei Enomoto,Shin'ya Yamaguchi

Main category: cs.CV

TL;DR: 本文提出一种新方法，利用多模态大语言模型（MLLMs）为单模态图像数据生成高质量合成文本描述，从而将单模态微调转化为多模态微调，并引入监督对比损失与基于类平均文本嵌入的推理策略，在13个图像分类基准（尤其少样本场景）上显著提升性能。

Details

Motivation: 预训练已转向多模态学习，但微调仍以单模态为主，导致无法充分利用预训练的多模态表征能力，存在模态不匹配的根本性 gap。 Method: 使用MLLMs结合类别标签和领域上下文提示生成图像合成标题；设计监督对比损失以增强同类样本表征聚类；提出基于每图多条合成标题的类平均文本嵌入进行推理。 Result: 在13个图像分类基准上超越基线方法，尤其在少样本学习场景下提升显著。 Conclusion: 该方法建立了连接多模态预训练与微调的新范式，通过数据集增强有效弥合模态鸿沟。 Abstract: In this paper, we address a fundamental gap between pre-training and fine-tuning of deep neural networks: while pre-training has shifted from unimodal to multimodal learning with enhanced visual understanding, fine-tuning predominantly remains unimodal, limiting the benefits of rich pre-trained representations. To bridge this gap, we propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for fine-tuning models with a multimodal objective. Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions tailored for classification tasks. Furthermore, we introduce a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, along with a new inference technique that leverages class-averaged text embeddings from multiple synthetic captions per image. Extensive experiments across 13 image classification benchmarks demonstrate that our approach outperforms baseline methods, with particularly significant improvements in few-shot learning scenarios. Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning. Our code is available at https://github.com/s-enmt/MMFT.

[111] Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

Yuxiang Huang,Mingye Li,Xu Han,Chaojun Xiao,Weilin Zhao,Ao Sun,Ziqi Yuan,Hao Zhou,Fandong Meng,Zhiyuan Liu

Main category: cs.CV

TL;DR: 本文提出Spava，一种用于加速长视频推理的序列并行框架，通过分布式近似注意力机制和系统级优化，在多GPU上实现高效计算，显著提升推理速度且不明显损失性能。

Details

Motivation: 现有方法在单GPU上压缩视觉嵌入或应用稀疏注意力，加速有限或性能下降，难以支持更长、更复杂的视频处理。 Method: 提出Spava框架，采用序列并行与优化的近似注意力机制，并结合负载均衡、融合前向传播等系统级优化。 Result: 相比FlashAttn、ZigZagRing和APB，Spava分别实现12.72x、1.70x和1.18x加速，且无明显性能下降。 Conclusion: Spava有效突破长视频推理瓶颈，支持多GPU高效扩展，为大型多模态模型处理长视频提供新范式。 Abstract: The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB

[112] Variance & Greediness: A comparative study of metric-learning losses

Donghuo Zeng,Hao Niu,Zhi Li,Masato Taya

Main category: cs.CV

TL;DR: 本文提出VARIANCE和GREEDINESS诊断框架，系统分析七种度量学习损失函数在嵌入几何与优化动态上的差异，揭示了效率-粒度权衡关系，并为不同检索场景提供实践指导。

Details

Motivation: 度量学习在检索中至关重要，但其对嵌入几何结构和优化动态的影响尚不明确。 Method: 构建VARIANCE（类内/类间方差）和GREEDINESS（活跃样本比例与梯度范数）诊断框架，在五个图像检索数据集上对比分析七种典型损失函数（Contrastive、Triplet、N-pair、InfoNCE、ArcFace、SCL、CCL）。 Result: 发现Triplet和SCL保持更高类内方差与更清晰类间边界，提升细粒度检索top-1性能；Contrastive和InfoNCE通过大量小步更新快速压缩嵌入，加速收敛但可能过度简化类别结构；N-pair虽实现大平均分离度，但类间间距不均。 Conclusion: 不同损失函数存在效率（收敛速度/嵌入紧凑性）与粒度（类别结构保真度/细粒度判别力）之间的权衡，应依据任务需求选择：强调多样性保持与难例判别时选Triplet/SCL，追求快速嵌入压缩时选Contrastive/InfoNCE。 Abstract: Metric learning is central to retrieval, yet its effects on embedding geometry and optimization dynamics are not well understood. We introduce a diagnostic framework, VARIANCE (intra-/inter-class variance) and GREEDINESS (active ratio and gradient norms), to compare seven representative losses, i.e., Contrastive, Triplet, N-pair, InfoNCE, ArcFace, SCL, and CCL, across five image-retrieval datasets. Our analysis reveals that Triplet and SCL preserve higher within-class variance and clearer inter-class margins, leading to stronger top-1 retrieval in fine-grained settings. In contrast, Contrastive and InfoNCE compact embeddings are achieved quickly through many small updates, accelerating convergence but potentially oversimplifying class structures. N-pair achieves a large mean separation but with uneven spacing. These insights reveal a form of efficiency-granularity trade-off and provide practical guidance: prefer Triplet/SCL when diversity preservation and hard-sample discrimination are critical, and Contrastive/InfoNCE when faster embedding compaction is desired.

[113] Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization

Midou Guo,Qilin Yin,Wei Lu,Xiangyang Luo,Rui Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为RT-DeepLoc的弱监督时序深度伪造定位框架，利用仅在真实视频上训练的掩码自编码器（MAE）生成重建误差来定位伪造片段，并引入非对称视频内对比损失（AICL）提升定位鲁棒性与泛化能力，在LAV-DF等大数据集上达到SOTA性能。

Details

Motivation: 现代深度伪造呈现局部化、间歇性特点，需细粒度时序定位；而帧级标注成本过高，亟需仅依赖视频级标签的弱监督方法。 Method: 提出基于重建的时序深度伪造定位框架RT-DeepLoc：1）用仅在真实数据上训练的Masked Autoencoder建模时空规律，使伪造区域产生显著重建误差；2）设计Asymmetric Intra-video Contrastive Loss（AICL），利用重建误差引导真实特征紧凑性，构建稳定判别边界。 Result: 在LAV-DF等大规模数据集上，RT-DeepLoc在弱监督时序伪造定位任务中达到当前最优性能（state-of-the-art）。 Conclusion: 重建误差可作为有效的弱监督信号用于时序伪造定位，结合视频内对比学习能兼顾判别力与泛化性，为无帧级标注的深度伪造检测提供了新范式。 Abstract: Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization. The prohibitive cost of frame-level annotation makes weakly supervised methods a practical necessity, which rely only on video-level labels. To this end, we propose Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc), a weakly supervised temporal forgery localization framework that identifies forgeries via reconstruction errors. Our framework uses a Masked Autoencoder (MAE) trained exclusively on authentic data to learn its intrinsic spatiotemporal patterns; this allows the model to produce significant reconstruction discrepancies for forged segments, effectively providing the missing fine-grained cues for localization. To robustly leverage these indicators, we introduce a novel Asymmetric Intra-video Contrastive Loss (AICL). By focusing on the compactness of authentic features guided by these reconstruction cues, AICL establishes a stable decision boundary that enhances local discrimination while preserving generalization to unseen forgeries. Extensive experiments on large-scale datasets, including LAV-DF, demonstrate that RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization.

[114] Hypernetwork-Based Adaptive Aggregation for Multimodal Multiple-Instance Learning in Predicting Coronary Calcium Debulking

Kaito Shiku,Ichika Seo,Tetsuya Matoba,Rissei Hino,Yasuhiro Nakano,Ryoma Bise

Main category: cs.CV

TL;DR: 本文首次尝试从CT图像中估计冠状动脉钙化斑块减容的必要性，将其建模为多实例学习（MIL）问题，并提出一种基于超网络的自适应聚合Transformer（HyperAdAgFormer），利用患者表格数据动态调整特征聚合策略。

Details

Motivation: 临床中医生需结合CT影像和患者个体化表格数据（如生理指标）来判断是否需要进行冠状动脉钙化减容，但现有方法难以融合这两种异构信息并实现个性化决策。 Method: 将任务建模为多实例学习（MIL），提出HyperAdAgFormer：通过超网络根据患者表格数据动态生成Transformer中特征聚合模块的参数，实现对不同患者自适应的病灶特征整合。 Result: 在真实临床数据集上的实验验证了HyperAdAgFormer的有效性，显著优于传统MIL及融合模型；代码已开源。 Conclusion: HyperAdAgFormer成功实现了影像与表格数据的协同建模，为个性化冠状动脉介入决策提供了可解释、可泛化的AI支持，推动了医学影像AI向临床实用化迈进。 Abstract: In this paper, we present the first attempt to estimate the necessity of debulking coronary artery calcifications from computed tomography (CT) images. We formulate this task as a Multiple-instance Learning (MIL) problem. The difficulty of this task lies in that physicians adjust their focus and decision criteria for device usage according to tabular data representing each patient's condition. To address this issue, we propose a hypernetwork-based adaptive aggregation transformer (HyperAdAgFormer), which adaptively modifies the feature aggregation strategy for each patient based on tabular data through a hypernetwork. The experiments using the clinical dataset demonstrated the effectiveness of HyperAdAgFormer. The code is publicly available at https://github.com/Shiku-Kaito/HyperAdAgFormer.

[115] SimGraph: A Unified Framework for Scene Graph-Based Image Generation and Editing

Thanh-Nhan Vo,Trong-Thuan Nguyen,Tam V. Nguyen,Minh-Triet Tran

Main category: cs.CV

TL;DR: 本文提出SimGraph统一框架，结合场景图（scene graph）实现图像生成与编辑的一体化，通过token-based生成和diffusion-based编辑，在保持空间一致性和语义连贯性的同时，实现对物体关系、布局和交互的精细控制，并在实验中超越现有SOTA方法。

Details

Motivation: 现有生成与编辑方法分离导致空间一致性与语义连贯性差，且缺乏对物体关系和空间布局的结构化控制。 Method: 提出SimGraph框架，将基于场景图的图像生成与编辑统一建模，融合token-based生成与diffusion-based编辑机制，以场景图驱动整个流程。 Result: 在多项实验中显著优于当前最先进方法，验证了其在生成质量、编辑一致性与布局控制方面的优势。 Conclusion: SimGraph成功实现了生成与编辑任务的统一与协同，证明了场景图作为结构化先验在可控图像合成中的有效性与普适性。 Abstract: Recent advancements in Generative Artificial Intelligence (GenAI) have significantly enhanced the capabilities of both image generation and editing. However, current approaches often treat these tasks separately, leading to inefficiencies and challenges in maintaining spatial consistency and semantic coherence between generated content and edits. Moreover, a major obstacle is the lack of structured control over object relationships and spatial arrangements. Scene graph-based methods, which represent objects and their interrelationships in a structured format, offer a solution by providing greater control over composition and interactions in both image generation and editing. To address this, we introduce SimGraph, a unified framework that integrates scene graph-based image generation and editing, enabling precise control over object interactions, layouts, and spatial coherence. In particular, our framework integrates token-based generation and diffusion-based editing within a single scene graph-driven model, ensuring high-quality and consistent results. Through extensive experiments, we empirically demonstrate that our approach outperforms existing state-of-the-art methods.

[116] HERS: Hidden-Pattern Expert Learning for Risk-Specific Vehicle Damage Adaptation in Diffusion Models

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: 本文提出HERS框架，通过无标注的自监督学习提升文本到图像扩散模型在车辆损伤生成中的保真度、可控性和领域对齐能力，显著改善文本忠实度和人类偏好评分，并探讨其在保险欺诈检测与安全部署中的影响。

Details

Motivation: 随着文本到图像扩散模型在车辆损伤合成中日益逼真，其在自动化保险流程中的可靠性受到质疑，存在被用于保险欺诈或索赔操纵的风险，亟需提升生成图像的真实性与可信度。 Method: 提出HERS（Hidden-Pattern Expert Learning for Risk-Specific Damage Adaptation）框架，利用大语言模型与T2I流水线自动生成图像-文本对，以自监督方式为每类损伤（如凹痕、划痕等）训练专用专家模型，再融合为统一多损伤模型；无需人工标注，实现领域特定微调。 Result: 在四个扩散主干网络上验证，HERS相较基线平均提升5.5%文本忠实度和2.3%人类偏好评分；同时增强生成结果在保险欺诈检测、可审计性与安全部署方面的实用性。 Conclusion: HERS有效提升了车辆损伤图像生成的可信度与领域适配性，揭示了领域专用扩散模型在高风险场景（如汽车保险）中既蕴含机遇也伴随风险，强调在安全关键应用中实现可信生成的必要性。 Abstract: Recent advances in text-to-image (T2I) diffusion models have enabled increasingly realistic synthesis of vehicle damage, raising concerns about their reliability in automated insurance workflows. The ability to generate crash-like imagery challenges the boundary between authentic and synthetic data, introducing new risks of misuse in fraud or claim manipulation. To address these issues, we propose HERS (Hidden-Pattern Expert Learning for Risk-Specific Damage Adaptation), a framework designed to improve fidelity, controllability, and domain alignment of diffusion-generated damage images. HERS fine-tunes a base diffusion model via domain-specific expert adaptation without requiring manual annotation. Using self-supervised image-text pairs automatically generated by a large language model and T2I pipeline, HERS models each damage category, such as dents, scratches, broken lights, or cracked paint, as a separate expert. These experts are later integrated into a unified multi-damage model that balances specialization with generalization. We evaluate HERS across four diffusion backbones and observe consistent improvements: plus 5.5 percent in text faithfulness and plus 2.3 percent in human preference ratings compared to baselines. Beyond image fidelity, we discuss implications for fraud detection, auditability, and safe deployment of generative models in high-stakes domains. Our findings highlight both the opportunities and risks of domain-specific diffusion, underscoring the importance of trustworthy generation in safety-critical applications such as auto insurance.

[117] Vision KAN: Towards an Attention-Free Backbone for Vision with Kolmogorov-Arnold Networks

Zhuoqin Yang,Jiansong Zhang,Xiaoling Luo,Xu Wu,Zheng Lu,Linlin Shen

Main category: cs.CV

TL;DR: 本文提出了Vision KAN（ViK），一种无需注意力机制的视觉骨干网络，采用基于Kolmogorov-Arnold网络思想的MultiPatch-RBFKAN模块，实现线性复杂度下的高效长程建模，并在ImageNet-1K上达到有竞争力的精度。

Details

Motivation: 注意力机制存在二次计算复杂度和可解释性差的问题；近期无注意力架构的成功启发探索更高效、可解释的替代方案。 Method: 提出Vision KAN（ViK），核心为MultiPatch-RBFKAN：(a) 基于径向基函数的KAN进行块内非线性变换，(b) 轴向可分离混合实现高效局部传播，(c) 低秩全局映射建模长程依赖；采用分块分组策略与轻量算子替代全特征KAN，实现对高分辨率特征的可扩展建模。 Result: 在ImageNet-1K上ViK取得与主流模型有竞争力的分类精度，同时具备线性计算复杂度。 Conclusion: KAN-based token mixing是一种高效、理论扎实、可替代自注意力的视觉建模新范式。 Abstract: Attention mechanisms have become a key module in modern vision backbones due to their ability to model long-range dependencies. However, their quadratic complexity in sequence length and the difficulty of interpreting attention weights limit both scalability and clarity. Recent attention-free architectures demonstrate that strong performance can be achieved without pairwise attention, motivating the search for alternatives. In this work, we introduce Vision KAN (ViK), an attention-free backbone inspired by the Kolmogorov-Arnold Networks. At its core lies MultiPatch-RBFKAN, a unified token mixer that combines (a) patch-wise nonlinear transform with Radial Basis Function-based KANs, (b) axis-wise separable mixing for efficient local propagation, and (c) low-rank global mapping for long-range interaction. Employing as a drop-in replacement for attention modules, this formulation tackles the prohibitive cost of full KANs on high-resolution features by adopting a patch-wise grouping strategy with lightweight operators to restore cross-patch dependencies. Experiments on ImageNet-1K show that ViK achieves competitive accuracy with linear complexity, demonstrating the potential of KAN-based token mixing as an efficient and theoretically grounded alternative to attention.

[118] Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

Hongxu Chen,Hongxiang Li,Zhen Wang,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为BA-solver的新型求解器，通过引入轻量级SideNet与冻结主干网络协同工作，在不显著增加训练成本的前提下，大幅减少神经函数评估次数（NFEs），从而加速Flow Matching模型的生成过程，同时保持高质量输出和即插即用兼容性。

Details

Motivation: Flow Matching模型因依赖迭代ODE求解而存在显著延迟瓶颈；现有加速方法要么在低NFE下性能下降严重，要么训练成本高、缺乏通用性。 Method: 提出Bi-Anchor Interpolation Solver（BA-solver）：1）双向时间感知——SideNet学习预测未来与历史速度，无需重训主干；2）双锚点速度积分——利用SideNet与两个锚点速度高效估算中间速度，支持批处理高阶积分；主干提供高精度锚点，SideNet密化轨迹。 Result: 在ImageNet-256²上，BA-solver仅需10 NFE即可达到100+ NFE欧拉求解器的生成质量，5 NFE仍保持高保真度，训练开销极小，并可无缝集成到现有生成流程中（如图像编辑）。 Conclusion: BA-solver在保持训练免费求解器通用性的同时，显著提升Flow Matching模型推理效率，兼顾低NFE下的质量、低成本训练与下游任务兼容性，为实际部署提供了新范式。 Abstract: Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

[119] Unifying Heterogeneous Degradations: Uncertainty-Aware Diffusion Bridge Model for All-in-One Image Restoration

Luwei Tu,Jiawei Wu,Xing Luo,Zhi Jin

Main category: cs.CV

TL;DR: 本文提出了一种不确定性感知扩散桥模型（UDBM），将全合一图像恢复（AiOIR）重新建模为由像素级不确定性引导的随机传输问题，通过松弛扩散桥和双调制策略，解决了多退化任务中优化目标冲突和漂移奇异性问题，实现了单步推理下的SOTA性能。

Details

Motivation: All-in-One图像恢复面临不同退化类型间优化目标冲突的根本挑战，现有方法受限于粗粒度控制或固定映射调度，适应性不足。 Method: 提出不确定性感知扩散桥模型（UDBM），将AiOIR建模为像素级不确定性驱动的随机传输问题；采用松弛扩散桥替代严格终端约束，缓解漂移奇异性；设计双调制策略：噪声调度将多样退化对齐至共享高熵隐空间，路径调度基于熵正则化的粘性动力学自适应调节传输轨迹。 Result: UDBM在多种图像恢复任务中实现单步推理下的最先进（SOTA）性能。 Conclusion: UDBM通过理论建模与结构创新，有效校正了传输几何与动力学，显著提升了AiOIR的泛化性与鲁棒性。 Abstract: All-in-One Image Restoration (AiOIR) faces the fundamental challenge in reconciling conflicting optimization objectives across heterogeneous degradations. Existing methods are often constrained by coarse-grained control mechanisms or fixed mapping schedules, yielding suboptimal adaptation. To address this, we propose an Uncertainty-Aware Diffusion Bridge Model (UDBM), which innovatively reformulates AiOIR as a stochastic transport problem steered by pixel-wise uncertainty. By introducing a relaxed diffusion bridge formulation which replaces the strict terminal constraint with a relaxed constraint, we model the uncertainty of degradations while theoretically resolving the drift singularity inherent in standard diffusion bridges. Furthermore, we devise a dual modulation strategy: the noise schedule aligns diverse degradations into a shared high-entropy latent space, while the path schedule adaptively regulates the transport trajectory motivated by the viscous dynamics of entropy regularization. By effectively rectifying the transport geometry and dynamics, UDBM achieves state-of-the-art performance across diverse restoration tasks within a single inference step.

[120] HydroSense: A Dual-Microcontroller IoT Framework for Real-Time Multi-Parameter Water Quality Monitoring with Edge Processing and Cloud Analytics

Abdul Hasib,A. S. M. Ahsanul Sarkar Akib,Anish Giri

Main category: cs.CV

TL;DR: 本文提出了HydroSense，一种低成本、高精度的物联网水质监测框架，集成了pH、溶解氧、温度、总溶解固体、估算氮含量和水位六项关键参数，采用双微控制器架构（Arduino Uno + ESP32），在90天实验中展现出优异精度与99.8%云传输可靠性，成本仅为商用系统的15%，显著提升资源受限地区的可及性。

Details

Motivation: 全球水资源危机亟需经济、准确、实时的水质监测方案，而传统人工采样或高价商用系统难以满足资源匮乏地区的可及性需求。 Method: 提出HydroSense物联网框架，采用Arduino Uno（五点校准模拟测量）与ESP32（无线通信、边缘计算、云集成）双微控制器架构，并引入中值滤波（TDS）、温度补偿和鲁棒错误处理等信号处理技术。 Result: 90天实验证明：pH误差±0.08（0–14）、DO稳定性±0.2 mg/L、TDS误差±1.9%（0–1000 ppm）、云数据传输可靠率达99.8%；总成本32,983 BDT（约300美元），较商用系统降低85%。 Conclusion: HydroSense通过智能系统架构与低成本元器件选型，实现了专业级水质监测能力，为资源受限环境下的环境监测树立了新范式。 Abstract: The global water crisis necessitates affordable, accurate, and real-time water quality monitoring solutions. Traditional approaches relying on manual sampling or expensive commercial systems fail to address accessibility challenges in resource-constrained environments. This paper presents HydroSense, an innovative Internet of Things framework that integrates six critical water quality parameters including pH, dissolved oxygen (DO), temperature, total dissolved solids (TDS), estimated nitrogen, and water level into a unified monitoring system. HydroSense employs a novel dual-microcontroller architecture, utilizing Arduino Uno for precision analog measurements with five-point calibration algorithms and ESP32 for wireless connectivity, edge processing, and cloud integration. The system implements advanced signal processing techniques including median filtering for TDS measurement, temperature compensation algorithms, and robust error handling. Experimental validation over 90 days demonstrates exceptional performance metrics: pH accuracy of plus or minus 0.08 units across the 0 to 14 range, DO measurement stability within plus or minus 0.2 mg/L, TDS accuracy of plus or minus 1.9 percent across 0 to 1000 ppm, and 99.8 percent cloud data transmission reliability. With a total implementation cost of 32,983 BDT (approximately 300 USD), HydroSense achieves an 85 percent cost reduction compared to commercial systems while providing enhanced connectivity through the Firebase real-time database. This research establishes a new paradigm for accessible environmental monitoring, demonstrating that professional-grade water quality assessment can be achieved through intelligent system architecture and cost-effective component selection.

[121] WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models

Zijin Yang,Yu Sun,Kejiang Chen,Jiawei Zhao,Jun Jiang,Weiming Zhang,Nenghai Yu

Main category: cs.CV

TL;DR: 本文提出WMVLM，首个基于视觉语言模型（VLM）的统一、可解释的扩散模型图像水印评估框架，分别针对残差型与语义型水印设计新型质量与安全性指标，并通过三阶段训练策略实现分类、打分与可解释文本生成能力。

Details

Motivation: 现有水印评估方法缺乏对残差与语义水印的统一框架、结果不可解释、忽视全面安全性、且语义水印常使用不恰当指标。 Method: 提出WMVLM框架：重新定义残差水印（侧重伪影强度与擦除鲁棒性）和语义水印（侧重潜在分布偏移）的质量与安全指标；采用三阶段训练策略（分类→打分→可解释文本生成）提升VLM评估能力。 Result: WMVLM在多个数据集、扩散模型及水印方法上展现出优于现有SOTA VLM的泛化性能与评估准确性。 Conclusion: WMVLM为扩散模型图像水印提供了首个统一、可解释、兼顾质量与安全性的评估范式，推动水印技术向更可靠、透明方向发展。 Abstract: Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.

[122] PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization

Songhan Jiang,Fengchun Liu,Ziyue Wang,Linghan Cai,Yongbing Zhang

Main category: cs.CV

TL;DR: 本文提出了PathReasoner数据集和PathReasoner-R1模型，旨在提升视觉-语言模型在病理学诊断中的可解释性与临床可信度，通过知识图谱引导的推理生成与多粒度奖励机制实现结构化思维链训练。

Details

Motivation: 现有病理学诊断模型缺乏可验证的证据关联推理，限制了临床信任与专家纠错能力。 Method: 构建首个全切片图像（WSI）推理大规模数据集PathReasoner；提出PathReasoner-R1模型，结合轨迹掩码监督微调与面向推理的强化学习，并设计基于知识图谱的多粒度奖励函数（含实体奖励机制）。 Result: PathReasoner-R1在PathReasoner数据集及多个公开基准上达到SOTA性能，显著提升模型推理透明性与临床合理性。 Conclusion: 该工作为计算病理学提供了可解释、可验证、临床可信的推理范式，推动VLMs向临床落地迈进关键一步。 Abstract: Vision-Language Models (VLMs) are advancing computational pathology with superior visual understanding capabilities. However, current systems often reduce diagnosis to directly output conclusions without verifiable evidence-linked reasoning, which severely limits clinical trust and hinders expert error rectification. To address these barriers, we construct PathReasoner, the first large-scale dataset of whole-slide image (WSI) reasoning. Unlike previous work reliant on unverified distillation, we develop a rigorous knowledge-guided generation pipeline. By leveraging medical knowledge graphs, we explicitly align structured pathological findings and clinical reasoning with diagnoses, generating over 20K high-quality instructional samples. Based on the database, we propose PathReasoner-R1, which synergizes trajectory-masked supervised fine-tuning with reasoning-oriented reinforcement learning to instill structured chain-of-thought capabilities. To ensure medical rigor, we engineer a knowledge-aware multi-granular reward function incorporating an Entity Reward mechanism strictly aligned with knowledge graphs. This effectively guides the model to optimize for logical consistency rather than mere outcome matching, thereby enhancing robustness. Extensive experiments demonstrate that PathReasoner-R1 achieves state-of-the-art performance on both PathReasoner and public benchmarks across various image scales, equipping pathology models with transparent, clinically grounded reasoning capabilities. Dataset and code are available at https://github.com/cyclexfy/PathReasoner-R1.

[123] Similarity of Processing Steps in Vision Model Representations

Matéo Mahaut,Marco Baroni

Main category: cs.CV

TL;DR: 本文研究了不同视觉模型在训练过程中如何收敛到相似的表示，发现尽管最终表示可能相似，但中间处理步骤和操作存在显著差异，特别是分类器模型会丢弃低级图像统计信息，而CNN和Transformer模型在表示变化上表现出不同特性。

Details

Motivation: 探究不同视觉模型是否不仅在最终表示上收敛，还在中间处理步骤和操作上收敛。 Method: 通过量化不同模型在不同处理阶段的表示距离，跟踪模型表示距离的演变，识别模型间差异最大的处理步骤。 Result: 发现相同位置的层间表示最相似，但仍有显著差异；分类器模型在最后层丢弃低级图像统计信息；Transformer模型比CNN模型在层间表示变化更平滑。 Conclusion: 不同视觉模型在表示收敛的程度和性质上存在差异，这有助于更定性地理解图像模型的内在处理过程。 Abstract: Recent literature suggests that the bigger the model, the more likely it is to converge to similar, ``universal'' representations, despite different training objectives, datasets, or modalities. While this literature shows that there is an area where model representations are similar, we study here how vision models might get to those representations -- in particular, do they also converge to the same intermediate steps and operations? We therefore study the processes that lead to convergent representations in different models. First, we quantify distance between different model representations at different stages. We follow the evolution of distances between models throughout processing, identifying the processing steps which are most different between models. We find that while layers at similar positions in different models have the most similar representations, strong differences remain. Classifier models, unlike the others, will discard information about low-level image statistics in their final layers. CNN- and transformer-based models also behave differently, with transformer models applying smoother changes to representations from one layer to the next. These distinctions clarify the level and nature of convergence between model representations, and enables a more qualitative account of the underlying processes in image models.

[124] A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion

Pu Cao,Yiyang Ma,Feng Zhou,Xuedan Yin,Qing Song,Lu Yang

Main category: cs.CV

TL;DR: 本文揭示了潜在扩散模型中自动编码器（AE）评估存在偏向生成指标（如gFID）而忽视重建保真度的问题，指出该偏差在可控扩散任务中会引发条件漂移、损害可控性；实证表明重建指标（尤其实例级）更能反映可控性，为AE评估与选择提供了面向可控生成的实用指导。

Details

Motivation: 在潜在扩散模型中，自动编码器需兼顾重建保真度与生成友好的隐空间，但当前ImageNet尺度的AE研究普遍偏重gFID等生成指标，忽视重建指标，这种偏差在扩展至可控扩散时可能带来严重风险。 Method: 通过理论分析解释gFID主导偏好在ImageNet生成中看似无害却在可控扩散中导致条件漂移的原因；提出多维条件漂移评估协议，并在多个近期ImageNet AE上实证检验重建指标与生成指标对可控性的预测能力；辅以ControlNet实验验证可控性与条件保持而非gFID更相关。 Result: gFID对条件保持预测能力弱，而重建导向指标（尤其是实例级）与可控性高度一致；ControlNet实验进一步证实可控性由条件保持程度决定，而非gFID高低。 Conclusion: 当前以ImageNet为中心的AE评估范式与可控扩散的实际需求存在脱节；应重视重建保真度（特别是实例级指标）作为AE选型与基准测试的关键标准，以提升可控生成的可靠性。 Abstract: In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.

[125] RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning

Shiqi Huang,Shuting He,Bihan Wen

Main category: cs.CV

TL;DR: 本文提出RSGround-R1框架，通过链式思维监督微调、强化微调与空间一致性优化，提升多模态大模型在遥感视觉定位任务中的空间推理能力。

Details

Motivation: 遥感场景具有大尺度和高语义模糊性，自然语言描述高度依赖位置线索，给多模态大语言模型的空间推理带来独特挑战。 Method: 提出推理引导、位置感知的后训练框架RSGround-R1：1）基于合成RSVG推理数据的链式思维监督微调（CoT-SFT）；2）引入距离感知的位置奖励进行强化微调（RFT）；3）设计空间一致性引导的优化策略以稳定策略更新。 Result: 在RSVG基准上实验表明，该方法性能与泛化能力均优于现有方法。 Conclusion: 位置线索可被显式建模并有效融入多模态大模型训练流程，所提框架显著提升了模型在遥感视觉定位任务中的空间理解与定位精度。 Abstract: Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.

[126] OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Yufeng Zhong,Lei Chen,Xuanle Zhao,Wenkang Han,Liming Zheng,Jing Huang,Deyang Jiang,Yilin Cao,Lin Ma,Zhixiong Zeng

Main category: cs.CV

TL;DR: 本文提出了OCRVerse，首个端到端统一文本中心与视觉中心OCR的 holistic 方法，通过多领域两阶段（SFT+RL）训练，在文本和视觉密集型图像（如图表、网页、科学绘图）上均取得优异效果。

Details

Motivation: 现有OCR方法主要关注文本识别（Text-centric OCR），忽视了对图表、网页等视觉信息密集图像中视觉元素的识别（Vision-centric OCR），而这类图像在互联网上广泛存在且具有重要应用价值。 Method: 提出OCRVerse框架，构建覆盖文本类（报纸、杂志等）与视觉类（图表、网页、科学绘图）的综合数据集，并采用两阶段训练：第一阶段用监督微调（SFT）混合多域数据建立初始知识；第二阶段用强化学习（RL）为各域定制灵活奖励策略以适配不同输出格式与目标。 Result: OCRVerse在文本中心与视觉中心OCR任务上均达到领先性能，结果媲美大规模开源及闭源模型。 Conclusion: OCRVerse首次实现了文本与视觉OCR的端到端统一建模，验证了多域协同训练范式在holistic OCR中的有效性与可扩展性。 Abstract: The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (\textbf{Text-centric OCR}), neglecting the identification of visual elements from visually information-dense image sources (\textbf{Vision-centric OCR}), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose \textbf{OCRVerse}, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.

Bowen Zhou,Marc-André Fiedler,Ayoub Al-Hamadi

Main category: cs.CV

TL;DR: 本文提出CAF-Mamba框架，基于Mamba模型实现跨模态自适应注意力融合，用于抑郁症检测，在LMVD和D-Vlog数据集上达到SOTA性能。

Details

Motivation: 现有深度学习方法在抑郁检测中存在特征类型有限、忽视显式跨模态交互、融合方式简单（如拼接或静态加权）等问题。 Method: 提出CAF-Mamba：一种基于Mamba的跨模态自适应注意力融合框架，通过模态级注意力机制动态调整各模态贡献，并显隐式建模跨模态交互。 Result: 在LMVD和D-Vlog两个真实场景基准数据集上，CAF-Mamba持续优于现有方法，达到当前最优性能。 Conclusion: CAF-Mamba有效提升了多模态抑郁检测的性能，验证了动态注意力融合与Mamba架构在该任务中的优势。 Abstract: Depression is a prevalent mental health disorder that severely impairs daily functioning and quality of life. While recent deep learning approaches for depression detection have shown promise, most rely on limited feature types, overlook explicit cross-modal interactions, and employ simple concatenation or static weighting for fusion. To overcome these limitations, we propose CAF-Mamba, a novel Mamba-based cross-modal adaptive attention fusion framework. CAF-Mamba not only captures cross-modal interactions explicitly and implicitly, but also dynamically adjusts modality contributions through a modality-wise attention mechanism, enabling more effective multimodal fusion. Experiments on two in-the-wild benchmark datasets, LMVD and D-Vlog, demonstrate that CAF-Mamba consistently outperforms existing methods and achieves state-of-the-art performance.

[128] Few-Shot Domain Adaptation with Temporal References and Static Priors for Glacier Calving Front Delineation

Marcel Dreier,Nora Gourmelon,Dakota Pyles,Thorsten Seehaus,Matthias H. Braun,Andreas Maier,Vincent Christlein

Main category: cs.CV

TL;DR: 本文提出了一种无需修改网络结构的少样本域自适应方法，结合空间静态先验知识和夏季参考图像，显著提升了冰川崩解前沿分割模型在新研究地点的泛化能力。

Details

Motivation: 现有最先进模型在基准测试中表现接近人类水平，但在新研究地点（分布外数据）实际应用时精度不足，难以满足后续科学研究需求。 Method: 采用少样本域自适应策略，融合空间静态先验知识，并在输入时间序列中加入夏季参考图像。 Result: 冰川崩解前沿分割误差从1131.6米大幅降低至68.7米，且未对模型架构做任何修改。 Conclusion: 该方法为深度学习模型在新型研究地点的冰川崩解前沿分割提供了可推广的框架，支持全球尺度的崩解前沿监测。 Abstract: During benchmarking, the state-of-the-art model for glacier calving front delineation achieves near-human performance. However, when applied in a real-world setting at a novel study site, its delineation accuracy is insufficient for calving front products intended for further scientific analyses. This site represents an out-of-distribution domain for a model trained solely on the benchmark dataset. By employing a few-shot domain adaptation strategy, incorporating spatial static prior knowledge, and including summer reference images in the input time series, the delineation error is reduced from 1131.6 m to 68.7 m without any architectural modifications. These methodological advancements establish a framework for applying deep learning-based calving front segmentation to novel study sites, enabling calving front monitoring on a global scale.

[129] When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning

Zixuan Xia,Hao Wang,Pengcheng Weng,Yanyu Qian,Yangxin Xu,William Dan,Fei Wang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级几何感知正则化框架\regName，通过约束中间嵌入的表示几何结构（包括模态内分散性和模态间锚定性），缓解多模态模型中的表示坍缩与跨模态不一致问题，提升多模态与单模态性能。

Details

Motivation: 多模态学习中，即使优化充分，模型仍常出现模态内表示坍缩和样本级跨模态不一致等几何病态问题，损害鲁棒性与融合效果；表示几何被识别为缺失的调控维度。 Method: 提出\regName框架，包含两个互补约束：（1）模态内分散正则化，增强表示多样性；（2）模态间锚定正则化，限制样本级跨模态漂移而不强制严格对齐；该方法即插即用、无需修改网络结构、兼容多种训练范式。 Result: 在多个多模态基准上实验表明，\regName能持续提升多模态与单模态性能，有效缓解模态权衡问题。 Conclusion: 显式调控表示几何结构是提升多模态学习性能的有效途径，\regName提供了一种通用、轻量且高效的解决方案。 Abstract: Multimodal learning aims to integrate complementary information from heterogeneous modalities, yet strong optimization alone does not guaranty well-structured representations. Even under carefully balanced training schemes, multimodal models often exhibit geometric pathologies, including intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion. We identify representation geometry as a missing control axis in multimodal learning and propose \regName, a lightweight geometry-aware regularization framework. \regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment. The proposed regularizer is plug-and-play, requires no architectural modifications, and is compatible with various training paradigms. Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.

[130] Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification

Dexuan Ding,Ciyuan Peng,Endrowednes Kuantama,Jingcai Guo,Jia Wu,Jian Yang,Amin Beheshti,Ming-Hsuan Yang,Yuankai Qi

Main category: cs.CV

TL;DR: 本文提出Multimodal Visual Surrogate Compression (MVSC)方法，将高维3D sMRI图像压缩为紧凑的2D视觉代理特征，以更好适配冻结的2D基础模型，提升阿尔茨海默病（AD）诊断性能。

Details

Motivation: 现有sMRI表征学习方法存在计算成本高、跨层关系丢失、判别性特征提取能力有限等问题。 Method: 提出MVSC框架，包含两个核心模块：受文本引导的体素上下文编码器（Volume Context Encoder）和文本增强的自适应切片融合模块（Adaptive Slice Fusion），将3D sMRI压缩为与2D基础模型对齐的2D视觉代理特征。 Result: 在三个大规模AD数据集上，MVSC在二分类和多分类任务中均优于当前最优方法。 Conclusion: MVSC通过文本引导的3D到2D压缩与对齐策略，有效提升了sMRI表征质量与AD诊断准确率，兼顾效率与性能。 Abstract: High-dimensional structural MRI (sMRI) images are widely used for Alzheimer's Disease (AD) diagnosis. Most existing methods for sMRI representation learning rely on 3D architectures (e.g., 3D CNNs), slice-wise feature extraction with late aggregation, or apply training-free feature extractions using 2D foundation models (e.g., DINO). However, these three paradigms suffer from high computational cost, loss of cross-slice relations, and limited ability to extract discriminative features, respectively. To address these challenges, we propose Multimodal Visual Surrogate Compression (MVSC). It learns to compress and adapt large 3D sMRI volumes into compact 2D features, termed as visual surrogates, which are better aligned with frozen 2D foundation models to extract powerful representations for final AD classification. MVSC has two key components: a Volume Context Encoder that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner. Extensive experiments on three large-scale Alzheimer's disease benchmarks demonstrate our MVSC performs favourably on both binary and multi-class classification tasks compared against state-of-the-art methods.

[131] ChartE$^{3}$: A Comprehensive Benchmark for End-to-End Chart Editing

Shuo Li,Jiajun Sun,Zhekai Wang,Xiaoran Fan,Hui Li,Dingwen Yang,Zhiheng Xi,Yijun Wang,Zifei Shan,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CV

TL;DR: 本文提出了ChartE³基准，用于端到端图表编辑任务，涵盖局部外观调整与全局数据驱动变换，并揭示了当前多模态大模型在此任务上的显著性能差距。

Details

Motivation: 现有图表编辑方法依赖自然语言或代码作为中间表示，难以忠实执行复杂编辑；需直接评估端到端编辑能力。 Method: 构建了ChartE³基准，包含1200+高质量样本，每个样本为图表图像、对应代码与多模态编辑指令的三元组，支持客观与主观双视角评估。 Result: 对SOTA多模态大语言模型的广泛评测表明，其在全局编辑任务上存在显著性能差距。 Conclusion: ChartE³为端到端图表编辑提供了首个免中间表示的评估基准，突显了当前模型在数据级语义理解与结构一致性保持方面的关键局限。 Abstract: Charts are a fundamental visualization format for structured data analysis. Enabling end-to-end chart editing according to user intent is of great practical value, yet remains challenging due to the need for both fine-grained control and global structural consistency. Most existing approaches adopt pipeline-based designs, where natural language or code serves as an intermediate representation, limiting their ability to faithfully execute complex edits. We introduce ChartE$^{3}$, an End-to-End Chart Editing benchmark that directly evaluates models without relying on intermediate natural language programs or code-level supervision. ChartE$^{3}$ focuses on two complementary editing dimensions: local editing, which involves fine-grained appearance changes such as font or color adjustments, and global editing, which requires holistic, data-centric transformations including data filtering and trend line addition. ChartE$^{3}$ contains over 1,200 high-quality samples constructed via a well-designed data pipeline with human curation. Each sample is provided as a triplet of a chart image, its underlying code, and a multimodal editing instruction, enabling evaluation from both objective and subjective perspectives. Extensive benchmarking of state-of-the-art multimodal large language models reveals substantial performance gaps, particularly on global editing tasks, highlighting critical limitations in current end-to-end chart editing capabilities.

[132] DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning

Mingshuang Luo,Shuang Liang,Zhengkun Rong,Yuxuan Luo,Tianshu Hu,Ruibing Hou,Hong Chang,Yong Li,Yuan Zhang,Mingyuan Gao

Main category: cs.CV

TL;DR: DreamActor-M2 提出一种无需显式姿态先验、以上下文学习方式进行运动建模的通用角色图像动画框架，通过跨身份数据自举与统一潜空间融合，在身份保持与运动一致性间取得更好平衡，并在新基准 AW Bench 上达到 SOTA。

Details

Motivation: 现有方法存在运动注入策略不佳（导致身份保持与运动一致性难以兼顾）和过度依赖显式姿态先验（限制对非人形角色泛化）两大问题。 Method: 采用两阶段范式：1）融合参考外观与运动线索至统一潜在空间，利用基础模型生成先验联合建模空间身份与时间动态；2）构建自引导伪跨身份数据合成流程，实现从姿态驱动到端到端 RGB 驱动动画的过渡。 Result: 在新提出的 AW Bench 基准（涵盖多类角色与运动场景）上实现 SOTA 性能，显著提升视觉保真度与跨域泛化能力。 Conclusion: DreamActor-M2 通过将运动条件建模重构为上下文学习任务，摆脱了对显式姿态表示的依赖，实现了更鲁棒、更通用的角色动画生成。 Abstract: Character image animation aims to synthesize high-fidelity videos by transferring motion from a driving sequence to a static reference image. Despite recent advancements, existing methods suffer from two fundamental challenges: (1) suboptimal motion injection strategies that lead to a trade-off between identity preservation and motion consistency, manifesting as a "see-saw", and (2) an over-reliance on explicit pose priors (e.g., skeletons), which inadequately capture intricate dynamics and hinder generalization to arbitrary, non-humanoid characters. To address these challenges, we present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem. Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space, enabling the model to jointly reason about spatial identity and temporal dynamics by leveraging the generative prior of foundational models. Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs, facilitating a seamless transition from pose-dependent control to direct, end-to-end RGB-driven animation. This strategy significantly enhances generalization across diverse characters and motion scenarios. To facilitate comprehensive evaluation, we further introduce AW Bench, a versatile benchmark encompassing a wide spectrum of characters types and motion scenarios. Extensive experiments demonstrate that DreamActor-M2 achieves state-of-the-art performance, delivering superior visual fidelity and robust cross-domain generalization. Project Page: https://grisoon.github.io/DreamActor-M2/

[133] From Global to Granular: Revealing IQA Model Performance via Correlation Surface

Baoliang Chen,Danni Huang,Hanwei Zhu,Lingyu Zhu,Wei Zhou,Shiqi Wang,Yuming Fang,Weisi Lin

Main category: cs.CV

TL;DR: 本文提出Granularity-Modulated Correlation (GMC)方法，通过引入粒度调制器和分布调节器，构建相关性曲面，实现对图像质量评估（IQA）模型性能的细粒度、分布鲁棒的三维分析，克服传统全局相关性指标（如SRCC、PLCC）掩盖局部性能差异与受数据分布影响的缺陷。

Details

Motivation: 传统全局相关性指标（如SRCC、PLCC）仅输出单个标量，无法反映IQA模型在不同质量区间（如高MOS或小ΔMOS）的局部排序一致性差异，且易受测试集质量分布不均影响，导致模型比较不稳定。 Method: 提出Granularity-Modulated Correlation（GMC）：（1）粒度调制器——基于高斯加权，在绝对MOS值和|MOS差值|两个维度上计算局部相关性；（2）分布调节器——对相关性进行正则化以缓解非均匀质量分布带来的偏差；最终生成以MOS和|MOS差值|为横纵坐标的3D相关性曲面。 Result: 在标准IQA基准上的实验表明，GMC能揭示传统标量指标无法捕获的模型性能特性（如高质图像排序优势或细微质量差异判别能力），提供更丰富、稳定和可解释的模型分析与比较能力。 Conclusion: GMC是一种结构化、细粒度、分布鲁棒的IQA评估新范式，显著提升了模型性能分析的深度与可靠性，适用于IQA模型的研发、比较与实际部署。 Abstract: Evaluation of Image Quality Assessment (IQA) models has long been dominated by global correlation metrics, such as Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank-Order Correlation Coefficient (SRCC). While widely adopted, these metrics reduce performance to a single scalar, failing to capture how ranking consistency varies across the local quality spectrum. For example, two IQA models may achieve identical SRCC values, yet one ranks high-quality images (related to high Mean Opinion Score, MOS) more reliably, while the other better discriminates image pairs with small quality/MOS differences (related to $|Δ$MOS$|$). Such complementary behaviors are invisible under global metrics. Moreover, SRCC and PLCC are sensitive to test-sample quality distributions, yielding unstable comparisons across test sets. To address these limitations, we propose \textbf{Granularity-Modulated Correlation (GMC)}, which provides a structured, fine-grained analysis of IQA performance. GMC includes: (1) a \textbf{Granularity Modulator} that applies Gaussian-weighted correlations conditioned on absolute MOS values and pairwise MOS differences ($|Δ$MOS$|$) to examine local performance variations, and (2) a \textbf{Distribution Regulator} that regularizes correlations to mitigate biases from non-uniform quality distributions. The resulting \textbf{correlation surface} maps correlation values as a joint function of MOS and $|Δ$MOS$|$, providing a 3D representation of IQA performance. Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models. Codes are available at https://github.com/Dniaaa/GMC.

Jiankun Peng,Jianyuan Guo,Ying Xu,Yue Liu,Jiashuang Yan,Xuanwei Ye,Houhua Li,Xiaoming Wang

Main category: cs.CV

TL;DR: 本文提出DGNav框架，通过场景感知自适应策略和动态图Transformer解决视觉语言导航中拓扑地图粒度刚性问题，实现动态调整地图密度与连通性，提升导航精度、安全性和泛化能力。

Details

Motivation: 现有基于固定几何阈值构建拓扑地图的方法存在‘粒度刚性’问题：在简单区域过采样导致计算冗余，在高不确定性区域欠采样增加碰撞风险并降低精度。 Method: 提出DGNav框架，包含两个核心创新：(1) 场景感知自适应策略，依据预测航点分布动态调节图构建阈值，实现按需稠密化；(2) 动态图Transformer，融合视觉、语言与几何线索生成动态边权重，重构图连通性以抑制拓扑噪声、增强指令遵循能力。 Result: 在R2R-CE和RxR-CE基准上显著提升导航性能与泛化能力；消融实验验证其在导航效率与安全探索间取得最优权衡。 Conclusion: DGNav通过上下文感知的动态拓扑建图机制，有效克服粒度刚性，为连续环境中的视觉语言导航提供了更鲁棒、灵活与安全的解决方案。 Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) presents a core challenge: grounding high-level linguistic instructions into precise, safe, and long-horizon spatial actions. Explicit topological maps have proven to be a vital solution for providing robust spatial memory in such tasks. However, existing topological planning methods suffer from a "Granularity Rigidity" problem. Specifically, these methods typically rely on fixed geometric thresholds to sample nodes, which fails to adapt to varying environmental complexities. This rigidity leads to a critical mismatch: the model tends to over-sample in simple areas, causing computational redundancy, while under-sampling in high-uncertainty regions, increasing collision risks and compromising precision. To address this, we propose DGNav, a framework for Dynamic Topological Navigation, introducing a context-aware mechanism to modulate map density and connectivity on-the-fly. Our approach comprises two core innovations: (1) A Scene-Aware Adaptive Strategy that dynamically modulates graph construction thresholds based on the dispersion of predicted waypoints, enabling "densification on demand" in challenging environments; (2) A Dynamic Graph Transformer that reconstructs graph connectivity by fusing visual, linguistic, and geometric cues into dynamic edge weights, enabling the agent to filter out topological noise and enhancing instruction adherence. Extensive experiments on the R2R-CE and RxR-CE benchmarks demonstrate DGNav exhibits superior navigation performance and strong generalization capabilities. Furthermore, ablation studies confirm that our framework achieves an optimal trade-off between navigation efficiency and safe exploration. The code is available at https://github.com/shannanshouyin/DGNav.

[135] Synthetic-to-Real Domain Bridging for Single-View 3D Reconstruction of Ships for Maritime Monitoring

Borja Carrillo-Perez,Felix Sattler,Angel Bueno Rodriguez,Maurice Stephan,Sarah Barnes

Main category: cs.CV

TL;DR: 本文提出了一种基于单视角图像、完全使用合成数据训练的高效3D船舶重建方法，结合Splatter Image网络、YOLOv8分割模块与AIS地理映射，实现无需真实3D标注的实时、可交互船舶三维可视化。

Details

Motivation: 现有主流3D重建方法依赖多视角监督、真实3D标注或计算开销大，难以满足海上实时监控的实际需求。 Method: 采用Splatter Image网络（以稀疏3D高斯表示物体），在合成ShapeNet船舶数据上预训练并用自建3D船舶数据集微调；集成YOLOv8语义分割与定制预处理；通过尺度归一化、中心对齐、朝向校正及基于AIS与单应性的地理配准实现真实场景部署。 Result: 在合成验证集上定量指标表现优异，在ShipSG真实海事图像上定性验证了跨域迁移能力；系统支持交互式3D查看，且无需真实3D标注。 Conclusion: 该单视角、合成数据驱动的端到端管道为海上监控提供了高效、可扩展、近实时的3D船舶重建解决方案，并为实际应用中的实时三维可视化指明了可行路径。 Abstract: Three-dimensional (3D) reconstruction of ships is an important part of maritime monitoring, allowing improved visualization, inspection, and decision-making in real-world monitoring environments. However, most state-ofthe-art 3D reconstruction methods require multi-view supervision, annotated 3D ground truth, or are computationally intensive, making them impractical for real-time maritime deployment. In this work, we present an efficient pipeline for single-view 3D reconstruction of real ships by training entirely on synthetic data and requiring only a single view at inference. Our approach uses the Splatter Image network, which represents objects as sparse sets of 3D Gaussians for rapid and accurate reconstruction from single images. The model is first fine-tuned on synthetic ShapeNet vessels and further refined with a diverse custom dataset of 3D ships, bridging the domain gap between synthetic and real-world imagery. We integrate a state-of-the-art segmentation module based on YOLOv8 and custom preprocessing to ensure compatibility with the reconstruction network. Postprocessing steps include real-world scaling, centering, and orientation alignment, followed by georeferenced placement on an interactive web map using AIS metadata and homography-based mapping. Quantitative evaluation on synthetic validation data demonstrates strong reconstruction fidelity, while qualitative results on real maritime images from the ShipSG dataset confirm the potential for transfer to operational maritime settings. The final system provides interactive 3D inspection of real ships without requiring real-world 3D annotations. This pipeline provides an efficient, scalable solution for maritime monitoring and highlights a path toward real-time 3D ship visualization in practical applications. Interactive demo: https://dlr-mi.github.io/ship3d-demo/.

Junming Huang,Weiwei Xu

Main category: cs.CV

TL;DR: 本文提出CG-MLLM，一种新型多模态大语言模型，支持3D描述与高分辨率3D生成，通过TokenAR与BlockAR双Transformer架构解耦建模需求，并结合视觉-语言骨干网络与专用3D VAE隐空间，显著提升3D生成质量。

Details

Motivation: 现有方法在3D内容生成中仅能生成低分辨率网格或粗糙结构代理，无法原生捕捉细粒度几何结构，LLM在3D生成方向的能力尚未被充分探索。 Method: 提出CG-MLLM模型，采用Mixture-of-Transformer架构，包含Token-level Autoregressive（TokenAR）和Block-level Autoregressive（BlockAR）两个Transformer分支，分别处理token级与block级内容；集成预训练视觉-语言骨干网络与专用3D VAE隐空间，实现标准token与空间block的长上下文交互。 Result: CG-MLLM在高保真3D物体生成任务上显著优于现有MLLM方法，成功将高分辨率3D内容生成纳入主流LLM范式。 Conclusion: CG-MLLM为3D内容生成提供了统一、高效且高分辨率的多模态大模型框架，推动了LLM向三维生成领域的拓展与落地。 Abstract: Large Language Models(LLMs) have revolutionized text generation and multimodal perception, but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture fine-grained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.

[137] MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Honglin Lin,Zheng Liu,Yun Zhu,Chonghan Qin,Juekai Lin,Xiaoran Shang,Conghui He,Wentao Zhang,Lijun Wu

Main category: cs.CV

TL;DR: 本文提出MMFineReason，一个大规模多模态推理数据集（1.8M样本，5.1B token），通过三阶段流程构建，显著提升开源VLM在视觉推理任务上的性能，并发现'少即是多'现象：仅7%高质量子集即可达到全量数据效果。

Details

Motivation: 开源视觉语言模型（VLM）在视觉推理上仍落后于闭源系统，主因是缺乏覆盖STEM图表、视觉谜题等挑战性领域且具一致长程思维链（CoT）标注的高质量推理数据。 Method: 提出三阶段构建流程：（1）大规模数据收集与标准化；（2）基于Qwen3-VL-235B-A22B-Thinking蒸馏生成高质量CoT推理链；（3）依据推理质量与难度感知进行严格筛选。基于该数据集对Qwen3-VL-Instruct进行微调，得到MMFineReason-2B/4B/8B系列模型。 Result: MMFineReason-4B超越Qwen3-VL-8B-Thinking；MMFineReason-8B超越Qwen3-VL-30B-A3B-Thinking并接近Qwen3-VL-32B-Thinking；仅用7%（123K）高难度感知筛选样本即达全量性能；推理数据增强同时提升通用能力。 Conclusion: 高质量、难度感知的多模态推理数据是提升开源VLM推理能力的关键；MMFineReason验证了数据质量优于数量，并揭示推理导向数据对模型通用能力的协同增益效应。 Abstract: Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a "less is more" phenomenon via our difficulty-aware filtering strategy: a subset of just 7\% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.

[138] Trajectory-Guided Diffusion for Foreground-Preserving Background Generation in Multi-Layer Documents

Taewon Kang

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的文档背景生成框架，通过潜空间设计实现前景保留与多页风格一致性，无需显式约束或额外训练。

Details

Motivation: 解决文档背景生成中前景内容被破坏以及多页间风格不一致（stylistic drift）的长期问题。 Method: 将扩散过程重新解释为在结构化潜空间中随机轨迹的演化；通过设计初始噪声及其几何对齐实现前景区域自然规避；引入缓存的风格方向向量作为潜空间中的持久约束，使扩散轨迹限制在共享风格子空间内。 Result: 实现了无需掩码、抑制更新或重复提示的前景保留与跨页风格一致性；框架免训练、兼容现有扩散主干网络，生成结果视觉连贯且适用于复杂文档。 Conclusion: 通过将扩散建模为潜空间中的轨迹设计，本文为结构化、一致性的生成建模提供了原理性新视角。 Abstract: We present a diffusion-based framework for document-centric background generation that achieves foreground preservation and multi-page stylistic consistency through latent-space design rather than explicit constraints. Instead of suppressing diffusion updates or applying masking heuristics, our approach reinterprets diffusion as the evolution of stochastic trajectories through a structured latent space. By shaping the initial noise and its geometric alignment, background generation naturally avoids designated foreground regions, allowing readable content to remain intact without auxiliary mechanisms. To address the long-standing issue of stylistic drift across pages, we decouple style control from text conditioning and introduce cached style directions as persistent vectors in latent space. Once selected, these directions constrain diffusion trajectories to a shared stylistic subspace, ensuring consistent appearance across pages and editing iterations. This formulation eliminates the need for repeated prompt-based style specification and provides a more stable foundation for multi-page generation. Our framework admits a geometric and physical interpretation, where diffusion paths evolve on a latent manifold shaped by preferred directions, and foreground regions are rarely traversed as a consequence of trajectory initialization rather than explicit exclusion. The proposed method is training-free, compatible with existing diffusion backbones, and produces visually coherent, foreground-preserving results across complex documents. By reframing diffusion as trajectory design in latent space, we offer a principled approach to consistent and structured generative modeling.

[139] Improving Classifier-Free Guidance of Flow Matching via Manifold Projection

Jian-Feng Cai,Haixia Liu,Zhengyi Su,Chao Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于优化视角的classifier-free guidance（CFG）新解释，将其视为带流形约束的同伦优化问题，并引入增量梯度下降与Anderson加速来提升采样质量与鲁棒性，无需重新训练模型。

Details

Motivation: 尽管classifier-free guidance（CFG）在扩散和流模型中被广泛使用，但其依赖启发式线性外推，对引导尺度敏感，缺乏理论基础。 Method: 将CFG建模为带流形约束的同伦优化问题，利用流匹配中速度场是平滑距离函数梯度的性质，设计增量梯度下降进行流形投影，并结合Anderson加速提升收敛效率与稳定性。 Result: 所提方法在DiT-XL-2-256、Flux和Stable Diffusion 3.5等多个大规模模型上显著提升了生成保真度、提示对齐度及对引导尺度的鲁棒性，且完全无需额外训练。 Conclusion: CFG可被严格理解为一种隐式梯度优化过程；本文提出的优化框架为可控生成提供了更稳健、高效且原理清晰的替代方案。 Abstract: Classifier-free guidance (CFG) is a widely used technique for controllable generation in diffusion and flow-based models. Despite its empirical success, CFG relies on a heuristic linear extrapolation that is often sensitive to the guidance scale. In this work, we provide a principled interpretation of CFG through the lens of optimization. We demonstrate that the velocity field in flow matching corresponds to the gradient of a sequence of smoothed distance functions, which guides latent variables toward the scaled target image set. This perspective reveals that the standard CFG formulation is an approximation of this gradient, where the prediction gap, the discrepancy between conditional and unconditional outputs, governs guidance sensitivity. Leveraging this insight, we reformulate the CFG sampling as a homotopy optimization with a manifold constraint. This formulation necessitates a manifold projection step, which we implement via an incremental gradient descent scheme during sampling. To improve computational efficiency and stability, we further enhance this iterative process with Anderson Acceleration without requiring additional model evaluations. Our proposed methods are training-free and consistently refine generation fidelity, prompt alignment, and robustness to the guidance scale. We validate their effectiveness across diverse benchmarks, demonstrating significant improvements on large-scale models such as DiT-XL-2-256, Flux, and Stable Diffusion 3.5.

[140] Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion

Hanmo Chen,Chenghao Xu,Xu Yang,Xuan Chen,Cheng Deng

Main category: cs.CV

TL;DR: 本文提出了一种新的KV缓存策略PaFu-KV，通过轻量级显著性估计头评估token重要性，动态保留关键时空信息、剔除冗余缓存，从而在保持高质量视频生成的同时提升推理效率。

Details

Motivation: 现有自回归视频生成方法依赖启发式KV缓存策略，忽略token在长视频生成中的重要性差异，导致关键时空信息丢失和缓存冗余，影响生成质量与效率。 Method: 提出Past- and Future-Informed KV Cache Policy（PaFu-KV），引入基于双向教师模型蒸馏的轻量级Salience Estimation Head，动态评估token显著性并据此优化KV缓存管理。 Result: 在多个基准测试中，该方法在保持高保真视频生成质量的同时显著加速推理，降低内存占用，提升长时程视频生成的效率。 Conclusion: PaFu-KV通过时间异质性建模实现更优的质量-效率权衡，为高效长视频自回归生成提供了新思路。 Abstract: Video generation is pivotal to digital media creation, and recent advances in autoregressive video generation have markedly enhanced the efficiency of real-time video synthesis. However, existing approaches generally rely on heuristic KV Cache policies, which ignore differences in token importance in long-term video generation. This leads to the loss of critical spatiotemporal information and the accumulation of redundant, invalid cache, thereby degrading video generation quality and efficiency. To address this limitation, we first observe that token contributions to video generation are highly time-heterogeneous and accordingly propose a novel Past- and Future-Informed KV Cache Policy (PaFu-KV). Specifically, PaFu-KV introduces a lightweight Salience Estimation Head distilled from a bidirectional teacher to estimate salience scores, allowing the KV cache to retain informative tokens while discarding less relevant ones. This policy yields a better quality-efficiency trade-off by shrinking KV cache capacity and reducing memory footprint at inference time. Extensive experiments on benchmarks demonstrate that our method preserves high-fidelity video generation quality while enables accelerated inference, thereby enabling more efficient long-horizon video generation. Our code will be released upon paper acceptance.

[141] TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Chuancheng Shi,Shangze Li,Wenjun Lu,Wenhua Wu,Cong Wang,Zifeng Cheng,Fei Shen,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出TraceRouter框架，通过追踪并切断有害语义的因果传播回路来提升大模型对抗鲁棒性，避免传统局部抑制方法对模型效用的损害。

Details

Motivation: 现有防御方法基于‘局部性假设’，仅抑制孤立神经元或特征，但有害语义实为跨层分布式电路，导致这些方法鲁棒性差且损害模型性能。 Method: TraceRouter分三阶段：(1) 通过注意力发散分析定位敏感起始层；(2) 利用稀疏自编码器（SAEs）与差异激活分析解耦并隔离恶意特征；(3) 基于零干预计算特征影响得分（FIS），映射恶意特征至下游因果路径，并选择性抑制这些路径。 Result: 实验表明TraceRouter在对抗鲁棒性与通用性能之间取得更优权衡，显著优于当前最优基线方法。 Conclusion: TraceRouter通过路径级因果干预实现对有害语义的精准阻断，为大模型安全提供了新范式。 Abstract: Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the "locality hypothesis", suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbf{TraceRouter}, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.

[142] Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning

Hanmo Chen,Guangtao Lyu,Chenghao Xu,Jiexi Yan,Xu Yang,Cheng Deng

Main category: cs.CV

TL;DR: 本文提出了一种金字塔式Shapley-Taylor（PST）学习框架，用于细粒度动作-语言检索，通过逐级对齐关节、运动片段与文本词元，提升跨模态语义匹配精度。

Details

Motivation: 现有方法多采用全局动作序列与全局文本表征对齐，忽略了局部运动片段、身体关节点与文本词元之间的细粒度交互，导致检索性能受限。 Method: 受人类运动感知金字塔过程启发，将人体动作分解为时间片段和空间关节点，构建金字塔式联合-分段对齐机制，实现从局部到整体的跨模态细粒度对应学习。 Result: 在多个公开基准数据集上显著超越现有最先进方法，实现了运动片段、关节点与对应文本词元的精准对齐。 Conclusion: PST框架有效建模了动作与语言间的层次化细粒度语义关联，为人类中心的跨模态智能提供了新思路。 Abstract: As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.

[143] VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models

Yunhao Li,Sijing Wu,Zhilin Gao,Zicheng Zhang,Qi Jia,Huiyu Duan,Xiongkuo Min,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了VideoAesBench，一个用于评估大多媒体模型（LMMs）视频美学质量理解能力的综合基准，涵盖多样视频内容、多种问题格式及全面的美学维度，并对23个开源与商用LMM进行了评测，发现当前模型在该任务上能力仍较弱。

Details

Motivation: 视频美学质量评估是人类基本能力，但目前大多媒体模型（LMMs）在此方面的能力尚未被充分探索，亟需一个系统性评估基准。 Method: 构建了VideoAesBench基准，包含1804个来自UGC、AIGC、压缩、RGC和游戏等多源视频，支持单选、多选、判断及新颖的开放式美学描述题型，并覆盖视觉形式、风格与感染力三大类共12个美学维度；基于该基准对23个LMM进行系统评测。 Result: 当前LMMs仅具备基础的视频美学感知能力，整体表现不完整且不精确。 Conclusion: VideoAesBench可作为评估和推动可解释视频美学分析的重要测试平台，为后续研究提供基准与洞见。 Abstract: Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs' understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics dimensions including visual form related questions from 5 aspects, visual style related questions from 4 aspects, and visual affectiveness questions from 3 aspects. Based on VideoAesBench, we benchmark 23 open-source and commercial large multimodal models. Our findings show that current LMMs only contain basic video aesthetics perception ability, their performance remains incomplete and imprecise. We hope our VideoAesBench can be served as a strong testbed and offer insights for explainable video aesthetics assessment.

[144] Zero-Shot Video Restoration and Enhancement with Assistance of Video Diffusion Models

Cong Cao,Huanjing Yue,Shangbin Xie,Xin Liu,Jingyu Yang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的框架，利用视频扩散模型增强基于图像的零样本视频修复与增强方法的时间一致性，通过同源/异质潜在融合、COT融合策略及时间强化后处理来实现。

Details

Motivation: 扩散模型在零样本图像修复中表现优异，但直接应用于视频时会产生严重的时间闪烁问题，亟需提升时间一致性。 Method: 提出同源潜在融合、异质潜在融合和基于COT的融合比例策略，结合图像到视频扩散模型进行时间强化后处理，整个框架无需训练。 Result: 实验结果表明该方法在保持图像质量的同时显著提升了视频的时间一致性，适用于任意基于扩散的图像修复/增强方法。 Conclusion: 这是首个利用视频扩散模型辅助图像方法实现零样本视频修复与增强的框架，在时间一致性上取得明显优势，且具备通用性和免训练特性。 Abstract: Although diffusion-based zero-shot image restoration and enhancement methods have achieved great success, applying them to video restoration or enhancement will lead to severe temporal flickering. In this paper, we propose the first framework that utilizes the rapidly-developed video diffusion model to assist the image-based method in maintaining more temporal consistency for zero-shot video restoration and enhancement. We propose homologous latents fusion, heterogenous latents fusion, and a COT-based fusion ratio strategy to utilize both homologous and heterogenous text-to-video diffusion models to complement the image method. Moreover, we propose temporal-strengthening post-processing to utilize the image-to-video diffusion model to further improve temporal consistency. Our method is training-free and can be applied to any diffusion-based image restoration and enhancement methods. Experimental results demonstrate the superiority of the proposed method.

[145] Just Noticeable Difference Modeling for Deep Visual Features

Rui Zhao,Wenrui Li,Lin Zhu,Yajing Zheng,Weisi Lin

Main category: cs.CV

TL;DR: 本文提出FeatJND，一种面向任务的深度视觉特征可觉差（JND）建模方法，用于预测在保持下游任务性能前提下特征各维度可容忍的最大扰动；通过在分类、检测与实例分割任务上的验证，证明其优于高斯扰动，并成功应用于令牌级动态量化，提升量化性能。

Details

Motivation: 深度视觉特征作为视觉系统接口日益重要，需对其特性进行刻画并控制其质量；传统JND概念扩展至特征空间可提供任务对齐的容错边界，以支持资源受限下的特征质量调控。 Method: 提出FeatJND——一种任务对齐的特征级JND建模方法，设计可在标准分点处估计的FeatJND估计器，并在图像分类、目标检测和实例分割任务上进行验证；进一步将其应用于token-wise动态量化，实现基于FeatJND的步长分配。 Result: 在相同扰动强度下，FeatJND引导的扰动比高斯扰动更能保持下游任务性能；归因可视化显示其能抑制非关键特征区域；在动态量化中，FeatJND指导的步长分配显著优于随机或全局统一量化步长。 Conclusion: FeatJND为深度特征提供了可解释、任务对齐的质量控制边界，兼具理论意义与实用价值，尤其适用于资源受限场景下的特征压缩与量化。 Abstract: Deep visual features are increasingly used as the interface in vision systems, motivating the need to describe feature characteristics and control feature quality for machine perception. Just noticeable difference (JND) characterizes the maximum imperceptible distortion for images under human or machine vision. Extending it to deep visual features naturally meets the above demand by providing a task-aligned tolerance boundary in feature space, offering a practical reference for controlling feature quality under constrained resources. We propose FeatJND, a task-aligned JND formulation that predicts the maximum tolerable per-feature perturbation map while preserving downstream task performance. We propose a FeatJND estimator at standardized split points and validate it across image classification, detection, and instance segmentation. Under matched distortion strength, FeatJND-based distortions consistently preserve higher task performance than unstructured Gaussian perturbations, and attribution visualizations suggest FeatJND can suppress non-critical feature regions. As an application, we further apply FeatJND to token-wise dynamic quantization and show that FeatJND-guided step-size allocation yields clear gains over random step-size permutation and global uniform step size under the same noise budget. Our code will be released after publication.

[146] BookNet: Book Image Rectification via Cross-Page Attention Network

Shaokai Liu,Hao Feng,Bozhi Luan,Min Hou,Jiajun Deng,Wengang Zhou

Main category: cs.CV

TL;DR: 本文提出了BookNet，首个专为双页图书图像校正设计的端到端深度学习框架，采用双分支结构与跨页注意力机制，建模左右页面间的几何耦合关系，并构建了合成数据集Book3D和真实世界基准Book100，实验表明其性能优于现有方法。

Details

Motivation: 现有单页文档图像校正方法无法捕捉图书中相邻页面之间因装订约束导致的复杂且不对称的几何耦合关系。 Method: 提出BookNet，采用双分支架构与跨页注意力机制，联合估计单页及整页展开图的形变场；同时构建大规模合成数据集Book3D和真实基准Book100。 Result: BookNet在图书图像校正任务上显著超越现有最先进方法。 Conclusion: BookNet有效建模了双页间的几何依赖关系，验证了专用架构与数据集对图书图像校正任务的重要性。 Abstract: Book image rectification presents unique challenges in document image processing due to complex geometric distortions from binding constraints, where left and right pages exhibit distinctly asymmetric curvature patterns. However, existing single-page document image rectification methods fail to capture the coupled geometric relationships between adjacent pages in books. In this work, we introduce BookNet, the first end-to-end deep learning framework specifically designed for dual-page book image rectification. BookNet adopts a dual-branch architecture with cross-page attention mechanisms, enabling it to estimate warping flows for both individual pages and the complete book spread, explicitly modeling how left and right pages influence each other. Moreover, to address the absence of specialized datasets, we present Book3D, a large-scale synthetic dataset for training, and Book100, a comprehensive real-world benchmark for evaluation. Extensive experiments demonstrate that BookNet outperforms existing state-of-the-art methods on book image rectification. Code and dataset will be made publicly available.

[147] Deep Models, Shallow Alignment: Uncovering the Granularity Mismatch in Neural Decoding

Yang Du,Siyuan Dai,Yonghao Song,Paul M. Thompson,Haoteng Tang,Liang Zhan

Main category: cs.CV

TL;DR: 本文提出Shallow Alignment方法，通过将神经信号与视觉编码器的中间层表征对齐，而非最终输出，解决了人类与机器视觉间的粒度不匹配问题，在多个基准上显著提升神经视觉解码性能，并揭示了其随预训练视觉骨干网络容量可预测扩展的规律。

Details

Motivation: 现有神经视觉解码方法忽略了人眼视觉（保留低级纹理与高级语义混合信息）与深度视觉模型（强调语义不变性、抑制局部纹理）之间的根本性粒度不匹配。 Method: 提出Shallow Alignment——一种新颖的对比学习策略，将神经信号与视觉编码器的中间层表征进行对齐，以平衡低级纹理细节与高级语义特征。 Result: 在多个基准上显著优于标准的终层对齐方法，性能提升22%–58%；首次有效解锁神经视觉解码中的缩放律，使解码性能随预训练视觉骨干容量可预测提升；并通过系统实证分析揭示了性能提升机制。 Conclusion: Shallow Alignment通过中间层对齐缓解了人机视觉粒度失配，是提升神经视觉解码性能与可扩展性的有效范式。 Abstract: Neural visual decoding is a central problem in brain computer interface research, aiming to reconstruct human visual perception and to elucidate the structure of neural representations. However, existing approaches overlook a fundamental granularity mismatch between human and machine vision, where deep vision models emphasize semantic invariance by suppressing local texture information, whereas neural signals preserve an intricate mixture of low-level visual attributes and high-level semantic content. To address this mismatch, we propose Shallow Alignment, a novel contrastive learning strategy that aligns neural signals with intermediate representations of visual encoders rather than their final outputs, thereby striking a better balance between low-level texture details and high-level semantic features. Extensive experiments across multiple benchmarks demonstrate that Shallow Alignment significantly outperforms standard final-layer alignment, with performance gains ranging from 22% to 58% across diverse vision backbones. Notably, our approach effectively unlocks the scaling law in neural visual decoding, enabling decoding performance to scale predictably with the capacity of pre-trained vision backbones. We further conduct systematic empirical analyses to shed light on the mechanisms underlying the observed performance gains.

[148] PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Yi Liu,Dianhai Yu,Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL-1.5 是一个0.9B参数的超紧凑视觉语言模型，在 OmniDocBench v1.5 上达到94.5% SOTA精度，并在新提出的 Real5-OmniDocBench 基准上验证了对真实物理畸变的鲁棒性，同时扩展支持印章识别与文本定位任务。

Details

Motivation: 提升文档理解模型在真实场景中面对多种物理畸变（如扫描、倾斜、扭曲、屏幕拍摄、光照变化）下的鲁棒性，并拓展多任务能力。 Method: 升级 PaddleOCR-VL 模型为 1.5 版本；构建 Real5-OmniDocBench 基准以评估五类真实物理畸变下的性能；集成 seal recognition 和 text spotting 功能。 Result: 在 OmniDocBench v1.5 上达 94.5% SOTA 准确率；在 Real5-OmniDocBench 上取得 SOTA 表现；保持 0.9B 参数量，兼顾高效性与多功能性。 Conclusion: PaddleOCR-VL-1.5 在精度、鲁棒性、任务泛化性和模型效率方面实现综合突破，推动轻量级文档理解VLM向实用化迈进。 Abstract: We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model's capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: https://github.com/PaddlePaddle/PaddleOCR

[149] Causal World Modeling for Robot Control

Lin Li,Qihang Zhang,Yiming Luo,Shuai Yang,Ruilin Wang,Fei Han,Mingrui Yu,Zelin Gao,Nan Xue,Xing Zhu,Yujun Shen,Yinghao Xu

Main category: cs.CV

TL;DR: 本文提出了LingBot-VA，一种结合视频世界建模与视觉-语言预训练的自回归扩散框架，用于机器人学习，具备共享潜在空间、闭环 rollout机制和异步推理流水线三大设计，显著提升了长时程操作、数据效率和泛化能力。

Details

Motivation: 视频世界建模能帮助机器人理解动作与视觉动态间的因果关系，从而‘想象’近未来，为机器人学习提供新基础；同时结合视觉-语言预训练可增强跨模态理解与泛化能力。 Method: 提出LingBot-VA：一种自回归扩散框架，融合帧预测与策略执行；采用Mixture-of-Transformers实现视觉与动作token的共享潜在空间；引入闭环rollout机制以持续利用真实观测反馈；设计异步推理流水线并行化动作预测与执行。 Result: 在仿真与真实场景中验证有效，显著提升长时程操纵性能、后训练阶段的数据效率，以及对新配置的强泛化能力。 Conclusion: 视频世界建模与视觉-语言预训练协同构成了机器人学习的新范式；LingBot-VA通过多模块协同设计，实现了高效、鲁棒、通用的具身智能控制。 Abstract: This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.

[150] Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

Linhan Wang,Zichong Yang,Chen Bai,Guoxiang Zhang,Xiaotong Liu,Xiaoyin Zheng,Xiao-Xiao Long,Chang-Tien Lu,Cheng Lu

Main category: cs.CV

TL;DR: 本文提出Drive-JEPA框架，结合视频联合嵌入预测架构（V-JEPA）与多模态轨迹蒸馏，提升端到端自动驾驶的规划表征能力，在NAVSIM上达到新SOTA。

Details

Motivation: 现有基于自监督视频预训练的端到端自动驾驶方法受限于驾驶场景中单一人类轨迹导致的行为模态模糊性，难以学习多模态规划行为。 Method: 1）将V-JEPA适配至端到端驾驶任务，利用大规模驾驶视频预训练ViT编码器，生成与轨迹规划对齐的预测表征；2）设计提案中心式规划器，融合仿真生成与人类轨迹进行多模态蒸馏，并引入动量感知选择机制保障行为稳定性与安全性。 Result: 在NAVSIM上，V-JEPA表征+简单Transformer解码器在无感知设置下超越先前方法3 PDMS；完整Drive-JEPA框架在v1和v2版本分别达93.3 PDMS和87.8 EPDMS，创SOTA。 Conclusion: Drive-JEPA通过联合视频表征学习与多模态轨迹蒸馏，有效缓解了驾驶场景中单轨迹导致的模态模糊问题，显著提升了端到端规划性能。 Abstract: End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.

[151] Understanding Multimodal Complementarity for Single-Frame Action Anticipation

Manuel Benavent-Lledo,Konstantinos Bacharidis,Konstantinos Papoutsakis,Antonis Argyros,Jose Garcia-Rodriguez

Main category: cs.CV

TL;DR: 本文挑战了人类动作预测需要密集时间信息的假设，提出仅基于单帧图像进行动作预测的方法AAG+，通过融合RGB外观、深度几何线索和过去动作的语义表示，在多个基准测试中达到甚至超越现有视频方法的性能。

Details

Motivation: 挑战动作预测必须依赖密集时间信息的传统假设，探究单帧图像中蕴含的未来动作信息及其有效利用方式。 Method: 在AAG基础上系统研究单帧动作预测，分析RGB外观、深度几何线索和过去动作语义表示的贡献，并探索多模态融合策略、关键帧选择策略及过去动作历史来源对预测性能的影响，最终构建改进框架AAG+。 Result: AAG+在IKEA-ASM、Meccano和Assembly101等具有挑战性的预测基准上，性能达到或超过当前最先进的视频方法。 Conclusion: 单帧动作预测具有巨大潜力，其性能可媲美甚至超越视频方法；密集时间建模并非总是必要，精心选取的关键帧已足以支持高质量预测。 Abstract: Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.

[152] Urban Neural Surface Reconstruction from Constrained Sparse Aerial Imagery with 3D SAR Fusion

Da Li,Chen Yao,Tong Mao,Jiacheng Bao,Houjun Sun

Main category: cs.CV

TL;DR: 本文提出首个融合3D SAR点云与航拍图像的城市神经表面重建框架，解决稀疏视角下的几何模糊与不稳定性问题，显著提升大尺度城市三维重建的精度、完整性与鲁棒性。

Details

Motivation: 现有神经表面重建方法在稀疏视角（如受限于飞行路径、地形和成本的航拍场景）下存在几何歧义和不稳定问题，难以满足大规模城市遥感需求。 Method: 将3D SAR点云提供的几何先验融入SDF-based神经表面重建主干网络，通过雷达引导的结构感知光线选择与自适应采样实现稳定高效优化；并构建首个配准的SAR-航拍图像联合基准数据集。 Result: 在高度稀疏和倾斜视角条件下，相比单模态基线方法，该方法显著提升了重建精度、完整性与鲁棒性。 Conclusion: 融合光学与SAR多模态传感是实现可扩展、高保真城市三维重建的有效路径。 Abstract: Neural surface reconstruction (NSR) has recently shown strong potential for urban 3D reconstruction from multi-view aerial imagery. However, existing NSR methods often suffer from geometric ambiguity and instability, particularly under sparse-view conditions. This issue is critical in large-scale urban remote sensing, where aerial image acquisition is limited by flight paths, terrain, and cost. To address this challenge, we present the first urban NSR framework that fuses 3D synthetic aperture radar (SAR) point clouds with aerial imagery for high-fidelity reconstruction under constrained, sparse-view settings. 3D SAR can efficiently capture large-scale geometry even from a single side-looking flight path, providing robust priors that complement photometric cues from images. Our framework integrates radar-derived spatial constraints into an SDF-based NSR backbone, guiding structure-aware ray selection and adaptive sampling for stable and efficient optimization. We also construct the first benchmark dataset with co-registered 3D SAR point clouds and aerial imagery, facilitating systematic evaluation of cross-modal 3D reconstruction. Extensive experiments show that incorporating 3D SAR markedly enhances reconstruction accuracy, completeness, and robustness compared with single-modality baselines under highly sparse and oblique-view conditions, highlighting a viable route toward scalable high-fidelity urban reconstruction with advanced airborne and spaceborne optical-SAR sensing.

[153] PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Changjian Jiang,Kerui Ren,Xudong Li,Kaiwen Song,Linning Xu,Tao Lu,Junting Dong,Yu Zhang,Bo Dai,Mulin Yu

Main category: cs.CV

TL;DR: PLANING是一种高效的单目图像流在线三维重建框架，通过显式几何基元与神经高斯的松耦合混合表征，实现几何与外观的解耦建模和优化，兼顾高质量渲染与精确几何，在精度、速度和结构清晰度上均优于现有方法。

Details

Motivation: 现有单目图像流重建方法难以同时兼顾高保真渲染与精确几何重建，存在质量与几何不可兼得的问题。 Method: 提出PLANING框架，采用显式几何基元（如平面、线段）与神经高斯松耦合的混合表征，实现几何与外观的解耦建模；设计在线初始化与分阶段优化策略，分别更新几何与外观参数。 Result: 在ScanNetV2上Chamfer-L2误差比PGSR降低18.52%，PSNR比ARTDECO高1.31 dB，重建耗时<100秒，速度超2D Gaussian Splatting 5倍以上，质量媲美离线逐场景优化。 Conclusion: PLANING实现了高效、稳定、高质量的流式三维重建，其结构清晰、计算高效，适用于大规模场景建模与具身AI所需的仿真环境构建。 Abstract: Streaming reconstruction from monocular image sequences remains challenging, as existing methods typically favor either high-quality rendering or accurate geometry, but rarely both. We present PLANING, an efficient on-the-fly reconstruction framework built on a hybrid representation that loosely couples explicit geometric primitives with neural Gaussians, enabling geometry and appearance to be modeled in a decoupled manner. This decoupling supports an online initialization and optimization strategy that separates geometry and appearance updates, yielding stable streaming reconstruction with substantially reduced structural redundancy. PLANING improves dense mesh Chamfer-L2 by 18.52% over PGSR, surpasses ARTDECO by 1.31 dB PSNR, and reconstructs ScanNetV2 scenes in under 100 seconds, over 5x faster than 2D Gaussian Splatting, while matching the quality of offline per-scene optimization. Beyond reconstruction quality, the structural clarity and computational efficiency of \modelname~make it well suited for a broad range of downstream applications, such as enabling large-scale scene modeling and simulation-ready environments for embodied AI. Project page: https://city-super.github.io/PLANING/ .

[154] MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Baorui Ma,Jiahui Yang,Donglin Di,Xuancheng Zhang,Jianxun Cui,Hao Li,Yan Xie,Wei Chen

Main category: cs.CV

TL;DR: 本文提出了Metric Anything，一种简单且可扩展的预训练框架，用于从噪声大、来源多样的3D数据中学习度量深度，无需人工设计提示、相机特异性建模或任务专用架构，并首次在度量深度任务中展现出清晰的规模扩展趋势。

Details

Motivation: 现有度量深度估计面临传感器噪声异质性、相机依赖偏差及跨源3D数据中度量模糊性等挑战，难以沿用视觉基础模型的成功缩放范式。 Method: 提出Sparse Metric Prompt（稀疏度量提示），通过随机遮蔽深度图构建通用接口，解耦空间推理与传感器/相机偏差；利用约2000万跨源图像-深度对（涵盖重建、采集与渲染数据，覆盖10000种相机型号）进行大规模预训练。 Result: 预训练模型在深度补全、超分辨率、雷达-相机融合等提示驱动任务中表现优异；其蒸馏出的无提示学生模型在单目深度估计、相机内参恢复、单/多视角度量3D重建及VLA规划等任务上达到SOTA；以Metric Anything预训练ViT作为视觉编码器显著提升多模态大语言模型的空间智能能力。 Conclusion: 度量深度估计可受益于类似现代基础模型的缩放定律，Metric Anything为可扩展、高效率的真实世界度量感知提供了新路径。 Abstract: Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.

[155] UEval: A Benchmark for Unified Multimodal Generation

Bo Li,Yida Yin,Wenhao Chai,Xingyu Fu,Zhuang Liu

Main category: cs.CV

TL;DR: 本文提出了UEval基准，用于评估能够同时生成图像和文本的统一模型。该基准包含1000个专家设计的问题，覆盖8种真实任务和多种推理类型，并采用基于评分标准的细粒度自动评估方法，发现推理能力对复杂多模态任务至关重要。

Details

Motivation: 现有评估方法（如LLM-as-a-judge）难以捕捉开放式的多模态生成质量细节；缺乏针对统一模型（图文双生）的综合性、细粒度、可扩展的评估基准。 Method: 构建UEval基准：1000个专家设计的跨任务多模态问题；为每个问题由MLLM生成初版评分细则，再经人工精修验证，共获得10417条有效评分项；采用该 rubric 系统进行自动评分。 Result: 当前统一模型在UEval上表现不佳（GPT-5-Thinking仅66.4/100，最佳开源模型仅49.1/100）；推理型模型显著优于非推理型模型；将推理轨迹迁移到非推理模型可明显缩小性能差距。 Conclusion: UEval为统一多模态模型提供了首个细粒度、可扩展、人机协同设计的评估基准；实证表明推理能力是提升复杂多模态理解与生成性能的关键因素。 Abstract: We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.

[156] Unsupervised Decomposition and Recombination with Discriminator-Driven Diffusion Models

Archer Wang,Emile Anand,Yilun Du,Marin Soljačić

Main category: cs.CV

TL;DR: 本文提出了一种基于对抗训练的扩散模型方法，用于无监督学习因子化潜在表示，并通过因子重组提升生成质量和解耦性能，在图像和机器人视频任务中均取得SOTA效果。

Details

Motivation: 分解复杂数据为因子化表征有助于发现可复用组件并支持通过重组生成新样本，但现有扩散模型在无监督下学习高质量因子化潜在空间仍具挑战。 Method: 引入一个判别器，区分单源样本与跨源因子重组生成的样本；生成器通过对抗优化欺骗该判别器，从而提升因子重组的物理与语义一致性。 Result: 在CelebA-HQ、Virtual KITTI、CLEVR和Falcor3D上FID更低、MIG和MCC指标更优；在LIBERO机器人视频轨迹任务中显著提升状态空间探索覆盖率。 Conclusion: 所提对抗训练机制有效增强了扩散模型的无监督因子发现与组合生成能力，拓展了其在视觉与机器人领域的应用潜力。 Abstract: Decomposing complex data into factorized representations can reveal reusable components and enable synthesizing new samples via component recombination. We investigate this in the context of diffusion-based models that learn factorized latent spaces without factor-level supervision. In images, factors can capture background, illumination, and object attributes; in robotic videos, they can capture reusable motion components. To improve both latent factor discovery and quality of compositional generation, we introduce an adversarial training signal via a discriminator trained to distinguish between single-source samples and those generated by recombining factors across sources. By optimizing the generator to fool this discriminator, we encourage physical and semantic consistency in the resulting recombinations. Our method outperforms implementations of prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D, achieving lower FID scores and better disentanglement as measured by MIG and MCC. Furthermore, we demonstrate a novel application to robotic video trajectories: by recombining learned action components, we generate diverse sequences that significantly increase state-space coverage for exploration on the LIBERO benchmark.

[157] Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Wenxuan Huang,Yu Zeng,Qiuchen Wang,Zhen Fang,Shaosheng Cao,Zheng Chu,Qingyu Yin,Shuang Chen,Zhenfei Yin,Lin Chen,Zehui Chen,Yao Hu,Philip Torr,Feng Zhao,Wanli Ouyang

Main category: cs.CV

TL;DR: 本文提出Vision-DeepResearch，一种支持多轮、多实体、多尺度视觉与文本搜索的多模态深度研究范式，通过冷启动监督与强化学习将深度研究能力内化到MLLM中，在噪声大、信息杂的现实场景中显著提升复杂问题求解能力。

Details

Motivation: 现有MLLM在依赖外部搜索时假设单次图像/文本查询即可获取关键证据，忽视真实场景中视觉噪声大、需多源证据聚合的挑战，且推理深度与搜索广度受限。 Method: 提出Vision-DeepResearch范式，支持数十步推理与数百次搜索引擎交互；采用冷启动监督训练与强化学习，将多轮、多实体、多尺度的视觉-文本联合搜索能力内化至MLLM。 Result: 在多模态深度研究任务上显著超越现有方法，性能优于基于GPT-5、Gemini-2.5-pro和Claude-4-Sonnet等闭源大模型构建的工作流。 Conclusion: Vision-DeepResearch实现了更鲁棒、更深入的多模态信息检索与推理，为MLLM在高噪声、复杂现实任务中的应用提供了新范式。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

[158] BLO-Inst: Bi-Level Optimization Based Alignment of YOLO and SAM for Robust Instance Segmentation

Li Zhang,Pengtao Xie

Main category: cs.CV

TL;DR: 本文提出BLO-Inst框架，通过双层优化解决目标检测器与SAM分割模型间的目标不匹配和对齐过拟合问题，使检测器生成更适配SAM的提示，提升自动分割性能。

Details

Motivation: SAM虽具零样本分割能力，但依赖人工提示；用检测器自动生成提示时存在目标不匹配（检测重定位、SAM重分割）和对齐过拟合（联合训练导致检测器仅记忆特定提示调整）两大问题。 Method: 提出BLO-Inst：采用双层优化——下层在数据子集D1上微调SAM以提升对当前检测框的分割精度；上层在独立子集D2上更新检测器，使其生成能最小化微调后SAM验证损失的边界框，从而让检测器成为分割感知的提示生成器。 Result: 在通用与生物医学图像分割任务中，BLO-Inst显著优于标准基线方法。 Conclusion: 双层优化可有效对齐检测与分割目标，使检测器输出更适配SAM的提示，推动全自动高质量分割落地。 Abstract: The Segment Anything Model has revolutionized image segmentation with its zero-shot capabilities, yet its reliance on manual prompts hinders fully automated deployment. While integrating object detectors as prompt generators offers a pathway to automation, existing pipelines suffer from two fundamental limitations: objective mismatch, where detectors optimized for geometric localization do not correspond to the optimal prompting context required by SAM, and alignment overfitting in standard joint training, where the detector simply memorizes specific prompt adjustments for training samples rather than learning a generalizable policy. To bridge this gap, we introduce BLO-Inst, a unified framework that aligns detection and segmentation objectives by bi-level optimization. We formulate the alignment as a nested optimization problem over disjoint data splits. In the lower level, the SAM is fine-tuned to maximize segmentation fidelity given the current detection proposals on a subset ($D_1$). In the upper level, the detector is updated to generate bounding boxes that explicitly minimize the validation loss of the fine-tuned SAM on a separate subset ($D_2$). This effectively transforms the detector into a segmentation-aware prompt generator, optimizing the bounding boxes not just for localization accuracy, but for downstream mask quality. Extensive experiments demonstrate that BLO-Inst achieves superior performance, outperforming standard baselines on tasks in general and biomedical domains.

[159] RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation

Hanzhuo Huang,Qingyang Bao,Zekai Gu,Zhongshuo Du,Cheng Lin,Yuan Liu,Sibei Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于3D资产参考的扩散模型，通过双分支感知架构联合建模多视角RGB图像和点云图，实现生成图像与3D参考的高度一致性。

Details

Motivation: 现有基于单张图像参考的生成方法无法利用3D资产，限制了其在实际应用中的多样性与灵活性。 Method: 构建跨域双分支扩散模型，分别处理多视角RGB图像和点云图，采用空间对齐与域解耦机制，同步生成内容解耦但空间对齐的RGB图像和点云图。 Result: 实验表明该方法能有效以3D资产为参考生成高度一致的图像，显著提升了2D生成与3D内容创作的结合能力。 Conclusion: 所提模型成功弥合了图像扩散生成与3D资产之间的鸿沟，为3D-aware图像生成提供了新范式。 Abstract: In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the simultaneous generation of two spatially aligned but content-disentangled outputs, RGB images and point maps, linking 2D image attributes with 3D asset attributes. Experiments show that our approach effectively uses 3D assets as references to produce images consistent with the given assets, opening new possibilities for combining diffusion models with 3D content creation.

[160] SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence

Saoud Aldowaish,Yashwanth Karumanchi,Kai-Chen Chiang,Soroosh Noorzad,Morteza Fayazi

Main category: cs.CV

TL;DR: 本文提出了SINA，一个开源、全自动的电路原理图图像到网表生成器，结合深度学习、连通分量标记、OCR和视觉语言模型，在网表生成准确率上达到96.47%，是当前最优方法的2.72倍。

Details

Motivation: 现有将电路原理图图像转换为机器可读网表的方法在元件识别和连接关系推断方面存在困难。 Method: SINA整合了深度学习用于元件检测、连通分量标记（CCL）用于精确提取连接关系、OCR用于获取元件参考标识符，并采用视觉语言模型（VLM）进行可靠的参考标识符分配。 Result: SINA在实验中实现了96.47%的整体网表生成准确率，比当前最先进方法高出2.72倍。 Conclusion: SINA是一种高效、全自动的原理图图像解析方案，显著提升了网表生成的准确性与鲁棒性。 Abstract: Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.

[161] Creative Image Generation with Diffusion Model

Kunpeng Song,Ahmed Elgammal

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的创意图像生成新框架，通过在CLIP嵌入空间中引导生成图像向低概率区域移动来提升创意性，并引入pullback机制以保持视觉保真度。

Details

Motivation: 现有方法依赖手动概念融合或子类别排除，缺乏对创意性的量化和原则性建模；需一种能自动、可解释地生成新颖、高质量图像的方法。 Method: 定义创意性为图像在CLIP嵌入空间中的逆概率；在扩散过程中估计并优化图像分布，使其偏向低概率（罕见）区域；引入pullback机制平衡创意性与视觉质量。 Result: 在文本到图像扩散模型上验证了该框架的有效性与高效性，生成图像兼具高创意性、视觉保真度与新颖性。 Conclusion: 本工作为生成模型中的‘创造力’提供了可计算、可优化的定义与实现路径，推动视觉内容合成的创新范式。 Abstract: Creative image generation has emerged as a compelling area of research, driven by the need to produce novel and high-quality images that expand the boundaries of imagination. In this work, we propose a novel framework for creative generation using diffusion models, where creativity is associated with the inverse probability of an image's existence in the CLIP embedding space. Unlike prior approaches that rely on a manual blending of concepts or exclusion of subcategories, our method calculates the probability distribution of generated images and drives it towards low-probability regions to produce rare, imaginative, and visually captivating outputs. We also introduce pullback mechanisms, achieving high creativity without sacrificing visual fidelity. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness and efficiency of our creative generation framework, showcasing its ability to produce unique, novel, and thought-provoking images. This work provides a new perspective on creativity in generative models, offering a principled method to foster innovation in visual content synthesis.

[162] EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

John Flynn,Wolfgang Paier,Dimitar Dinev,Sam Nhut Nguyen,Hayk Poghosyan,Manuel Toribio,Sandipan Banerjee,Guy Gafni

Main category: cs.CV

TL;DR: 本文提出EditYourself，一种基于DiT的音频驱动视频到视频编辑框架，支持对已有说话人视频进行基于文本脚本的精细编辑（如增删/重排口型内容），同时保持动作、时序连贯性、说话人身份和精准唇形同步。

Details

Motivation: 现有生成式视频模型擅长从文本或图像生成新视频，但在编辑已有录制视频方面存在明显不足——尤其当仅需修改语音脚本时，难以兼顾运动一致性、时间连贯性、说话人身份及唇形同步。 Method: 基于通用视频扩散模型（DiT），引入音频条件控制与区域感知、编辑导向的训练策略，通过时空掩码修复（spatiotemporal inpainting）实现精准唇动合成与视频结构重构。 Result: EditYourself能高质量完成说话人视频的 transcript-driven 编辑任务，包括新增/删除/重排视觉语音内容，在长时序下保持身份一致性、运动自然性与唇形同步精度。 Conclusion: 该工作是将生成式视频模型推向专业视频后期制作实践的重要基础性进展。 Abstract: Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.

[163] Early and Prediagnostic Detection of Pancreatic Cancer from Computed Tomography

Wenxuan Li,Pedro R. A. S. Bassi,Lizhou Wu,Xinze Zhou,Yuxuan Zhao,Qi Chen,Szymon Plotka,Tianyu Lin,Zheren Zhu,Marisa Martin,Justin Caskey,Shanshan Jiang,Xiaoxi Chen,Jaroslaw B. Ćwikla,Artur Sankowski,Yaping Wu,Sergio Decherchi,Andrea Cavalli,Chandana Lall,Cristian Tomasetti,Yaxing Guo,Xuan Yu,Yuqing Cai,Hualin Qiao,Jie Bao,Chenhan Hu,Ximing Wang,Arkadiusz Sitek,Kai Ding,Heng Li,Meiyun Wang,Dexin Yu,Guang Zhang,Yang Yang,Kang Wang,Alan L. Yuille,Zongwei Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为ePAI的人工智能系统，用于在CT扫描中早期检测胰腺导管腺癌（PDAC），尤其能在临床诊断前数月发现被放射科医生遗漏的微小病灶，并显著提升检测敏感性。

Details

Motivation: 胰腺导管腺癌（PDAC）致死率高且常晚期才被发现；回顾性研究表明，专家放射科医生在已知患者后续确诊PDAC的前提下，能在前期CT中识别出曾被忽略的病变，因此亟需一种能辅助早期检出的自动化工具。 Method: 开发了基于单中心1598例患者数据训练的AI系统ePAI，用于自动检测和精确定位PDAC；在内部测试（1009例）和多中心外部测试（7158例）中评估其性能，并开展多阅片者研究对比30名认证放射科医生的表现。 Result: ePAI在内部测试中AUC达0.939–0.999，敏感性95.3%，特异性98.7%，可定位小至2 mm的PDAC；外部测试中AUC为0.918–0.945，敏感性91.5%，特异性88.0%，可定位小至5 mm病灶；并在预测性CT中提前平均347天检出75/159例PDAC；多阅片者研究显示其敏感性比放射科医生高50.3%（P<0.05），特异性相当（95.4%）。 Conclusion: ePAI展现出优异的早期PDAC检测与定位能力，尤其在临床前阶段发现被忽略病灶方面具有显著优势，有望成为提升胰腺癌早诊率的有力辅助工具。 Abstract: Pancreatic ductal adenocarcinoma (PDAC), one of the deadliest solid malignancies, is often detected at a late and inoperable stage. Retrospective reviews of prediagnostic CT scans, when conducted by expert radiologists aware that the patient later developed PDAC, frequently reveal lesions that were previously overlooked. To help detecting these lesions earlier, we developed an automated system named ePAI (early Pancreatic cancer detection with Artificial Intelligence). It was trained on data from 1,598 patients from a single medical center. In the internal test involving 1,009 patients, ePAI achieved an area under the receiver operating characteristic curve (AUC) of 0.939-0.999, a sensitivity of 95.3%, and a specificity of 98.7% for detecting small PDAC less than 2 cm in diameter, precisely localizing PDAC as small as 2 mm. In an external test involving 7,158 patients across 6 centers, ePAI achieved an AUC of 0.918-0.945, a sensitivity of 91.5%, and a specificity of 88.0%, precisely localizing PDAC as small as 5 mm. Importantly, ePAI detected PDACs on prediagnostic CT scans obtained 3 to 36 months before clinical diagnosis that had originally been overlooked by radiologists. It successfully detected and localized PDACs in 75 of 159 patients, with a median lead time of 347 days before clinical diagnosis. Our multi-reader study showed that ePAI significantly outperformed 30 board-certified radiologists by 50.3% (P < 0.05) in sensitivity while maintaining a comparable specificity of 95.4% in detecting PDACs early and prediagnostic. These findings suggest its potential of ePAI as an assistive tool to improve early detection of pancreatic cancer.

[164] PI-Light: Physics-Inspired Diffusion for Full-Image Relighting

Zhexin Liang,Zhaoxi Chen,Yongwei Chen,Tianyi Wei,Tengfei Wang,Xingang Pan

Main category: cs.CV

TL;DR: 本文提出了π-Light，一种基于物理启发的两阶段扩散模型框架，用于全图像重光照，通过批感知注意力、物理引导神经渲染模块、物理启发损失和精心构建的数据集，显著提升了真实场景泛化能力。

Details

Motivation: 解决全图像重光照中大规模配对数据难获取、物理合理性难保持以及数据驱动先验泛化能力有限等问题，同时弥合合成到真实的重光照差距。 Method: 提出π-Light两阶段框架：引入批感知注意力提升内在属性预测一致性；设计物理引导神经渲染模块确保光传输物理合理性；采用物理启发损失约束训练过程；构建多样化可控光照数据集；支持预训练扩散模型高效微调。 Result: 在多种材质上成功合成高保真镜面高光与漫反射效果，在真实世界场景中展现出优于现有方法的泛化性能。 Conclusion: π-Light通过深度融合物理先验与扩散建模，有效提升了全图像重光照的物理合理性和跨域泛化能力，为后续研究提供了新范式与基准。 Abstract: Full-image relighting remains a challenging problem due to the difficulty of collecting large-scale structured paired data, the difficulty of maintaining physical plausibility, and the limited generalizability imposed by data-driven priors. Existing attempts to bridge the synthetic-to-real gap for full-scene relighting remain suboptimal. To tackle these challenges, we introduce Physics-Inspired diffusion for full-image reLight ($π$-Light, or PI-Light), a two-stage framework that leverages physics-inspired diffusion models. Our design incorporates (i) batch-aware attention, which improves the consistency of intrinsic predictions across a collection of images, (ii) a physics-guided neural rendering module that enforces physically plausible light transport, (iii) physics-inspired losses that regularize training dynamics toward a physically meaningful landscape, thereby enhancing generalizability to real-world image editing, and (iv) a carefully curated dataset of diverse objects and scenes captured under controlled lighting conditions. Together, these components enable efficient finetuning of pretrained diffusion models while also providing a solid benchmark for downstream evaluation. Experiments demonstrate that $π$-Light synthesizes specular highlights and diffuse reflections across a wide variety of materials, achieving superior generalization to real-world scenes compared with prior approaches.

[165] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Xiaoxiao Sun,Mingyang Li,Kun yuan,Min Woo Sun,Mark Endo,Shengguang Wu,Changlin Li,Yuhui Zhang,Zeyu Wang,Serena Yeung-Levy

Main category: cs.CV

TL;DR: 本文提出VI-Probe框架，通过可控视觉错觉实验，系统分析大视觉语言模型（VLMs）对视觉变化的感知能力与语言记忆驱动响应之间的关系，发现响应僵化源于异质性机制（如记忆覆盖、知觉-记忆竞争、视觉处理局限），而非单一原因。

Details

Motivation: 探究VLMs在视觉错觉任务中响应僵化现象的根本原因——是真正感知视觉变化，还是仅依赖语言记忆召回；现有研究仅观察现象，缺乏系统归因。 Method: 构建VI-Probe：含梯度错觉扰动与匹配的非错觉视觉控制组，解耦视觉感知与语言回忆；引入新指标——极性翻转一致性（Polarity-Flip Consistency）、模板固着指数（Template Fixation Index）和归一化错觉倍增因子，超越平均准确率评估。 Result: 跨模型实验表明响应持久性由异质机制导致：GPT-5表现为记忆覆盖，Claude-Opus-4.1呈现知觉-记忆竞争，Qwen系列暴露视觉处理瓶颈；证实不存在统一解释。 Conclusion: VLMs对视觉变化的响应稳定性不能被单一机制解释；需采用基于探针的评估范式，同步衡量知识掌握程度与对受控视觉变化的敏感性。 Abstract: Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.

[166] One-step Latent-free Image Generation with Pixel Mean Flows

Yiyang Lu,Susie Lu,Qiao Sun,Hanhong Zhao,Zhicheng Jiang,Xianbang Wang,Tianhong Li,Zhengyang Geng,Kaiming He

Main category: cs.CV

TL;DR: 本文提出了一种名为'pixel MeanFlow'（pMF）的新型生成模型，旨在实现无需潜在空间的一次性扩散/流式图像生成，并在ImageNet上取得了优异的FID分数。

Details

Motivation: 推动扩散/流式生成模型向一步生成且无需潜在空间的方向发展，填补该领域关键空白。 Method: 提出pMF方法，将网络输出空间与损失空间分离：网络目标设为预设的低维图像流形（x-prediction），损失则通过速度空间中的MeanFlow定义，并引入图像流形与平均速度场间的简单变换。 Result: pMF在ImageNet 256x256（FID=2.22）和512x512（FID=2.48）分辨率上实现了强效的一次性、无潜在空间图像生成。 Conclusion: pMF成功推进了扩散/流式生成模型的边界，为一步、无潜在空间生成提供了可行且高效的新范式。 Abstract: Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose "pixel MeanFlow" (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.

Table of Contents

cs.CL [Back]

[1] DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

[2] asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation

[3] UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop

[4] Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations

[5] ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference

[6] Multi-task Code LLMs: Data Mix or Model Merge?

[7] Large Language Models Naively Recover Ethnicity from Individual Records

[8] EnsembleLink: Accurate Record Linkage Without Training Data

[9] Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space

[10] From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning

[11] Scaling Embeddings Outperforms Scaling Experts in Language Models

[12] Multilingual Dysarthric Speech Assessment Using Universal Phone Recognition and Language-Specific Phonemic Contrast Modeling

[13] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

[14] Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

[15] MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

[16] SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models

[17] MoCo: A One-Stop Shop for Model Collaboration Research

[18] CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

[19] Qwen3-ASR Technical Report

[20] Self-Improving Pretraining: using post-trained models to pretrain better models

[21] The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

[22] User-Centric Evidence Ranking for Attribution and Fact Verification

[23] Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

[24] SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

[25] DimStance: Multilingual Datasets for Dimensional Stance Analysis

[26] MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset

[27] LMK > CLS: Landmark Pooling for Dense Embeddings

[28] inversedMixup: Data Augmentation via Inverting Mixed Embeddings

[29] Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes

[30] ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

[31] KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

[32] Language Models as Artificial Learners: Investigating Crosslinguistic Influence

[33] ILRR: Inference-Time Steering Method for Masked Diffusion Language Models

[34] AdaptBPE: From General Purpose to Specialized Tokenizers

[35] Scale-Dependent Semantic Dynamics Revealed by Allan Deviation

[36] FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning

[37] Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling

[38] Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

[39] Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

[40] Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

[41] TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning

[42] Enhancing Language Models for Robust Greenwashing Detection

[43] Procedural Pretraining: Warming Up Language Models with Abstract Data

[44] CE-GOCD: Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering

[45] Temporal Guidance for Large Language Models

[46] CoFrGeNet: Continued Fraction Architectures for Language Generation

[47] Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond

[48] Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

[49] KID: Knowledge-Injected Dual-Head Learning for Knowledge-Grounded Harmful Meme Detection

[50] Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation

[51] RAG-E: Quantifying Retriever-Generator Alignment and Failure Modes

[52] Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning

[53] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

[54] Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model

[55] Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

[56] SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching

[57] From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes

[58] OVD: On-policy Verbal Distillation

[59] Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding

[60] Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

[61] When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

[62] Causal Autoregressive Diffusion Language Model

[63] Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models

[64] A Separable Architecture for Continuous Token Representation in Language Models

[65] On the Paradoxical Interference between Instruction-Following and Task Solving

[66] MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLMs

[67] $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

[68] VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

[69] ECO: Quantized Training without Full-Precision Master Weights

[70] A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine

[71] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

[72] FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

[73] DynaWeb: Model-Based Reinforcement Learning of Web Agents

[74] Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

cs.CV [Back]

[75] MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading

[76] Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs

[77] Text controllable PET denoising