Table of Contents
cs.CL [Back]
[1] DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents
Nikita Gupta,Riju Chatterjee,Lukas Haas,Connie Tao,Andrew Wang,Chang Liu,Hidekazu Oiwa,Elena Gribovskaya,Jan Ackermann,John Blitzer,Sasha Goldshtein,Dipanjan Das
Main category: cs.CL
TL;DR: DeepSearchQA是一个900提示的基准测试,用于评估智能体在17个不同领域中执行多步信息检索任务的能力,强调系统性信息整合、去重与实体解析、以及开放搜索空间中的停止判断能力。
Details
Motivation: 现有基准多关注单答案检索或事实性评估,缺乏对复杂、多步、深度信息搜索能力的系统评测,尤其忽视长程规划、上下文保持与精确召回平衡等关键能力。 Method: 构建了一个由人工设计、因果链结构、基于公开网页、答案可验证的900任务基准DeepSearchQA;采用系统性评估框架,分析主流智能体在召回率、精确率及失败模式(如过早终止、过度发散)上的表现。 Result: 当前最先进智能体在深度搜索任务中表现不佳:难以兼顾高召回与高精度;普遍存在过早终止和低置信度泛化回答等失败模式。 Conclusion: DeepSearchQA揭示了现有智能体在深度研究能力上的显著不足,为推动具备鲁棒性、系统性信息搜寻能力的下一代智能体研究提供了关键诊断工具与改进方向。 Abstract: We introduce DeepSearchQA, a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields. Unlike traditional benchmarks that target single answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of challenging, handcrafted tasks designed to evaluate an agent's ability to execute complex search plans to generate exhaustive answer lists. This shift in design explicitly tests three critical, yet under-evaluated capabilities: 1) systematic collation of fragmented information from disparate sources, 2) de-duplication and entity resolution to ensure precision, and 3) the ability to reason about stopping criteria within an open-ended search space. Each task is structured as a causal chain, where discovering information for one step is dependent on the successful completion of the previous one, stressing long-horizon planning and context retention. All tasks are grounded in the open web with objectively verifiable answer sets. Our comprehensive evaluation of state-of-the-art agent architectures reveals significant performance limitations: even the most advanced models struggle to balance high recall with precision. We observe distinct failure modes ranging from premature stopping (under-retrieval) to hedging behaviors, where agents cast an overly wide net of low-confidence answers to artificially boost recall. These findings highlight critical headroom in current agent designs and position DeepSearchQA as an essential diagnostic tool for driving future research toward more robust, deep-research capabilities.[2] asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation
Oleg Sedukhin,Andrey Kostin
Main category: cs.CL
TL;DR: 本文提出了语音识别评估的多项改进,包括支持多参考标注和任意长度插入的字符串对齐算法、构建了新的俄语长语音测试集DiverseSpeech-Ru,并揭示了模型易过拟合特定标注的问题,同时提供了流式识别评估工具和统一模型接口。
Details
Motivation: 现有语音识别评估方法在处理非拉丁语系、构词丰富或长语音时存在标注困难和对齐不准问题;且模型可能仅适应特定数据集的标注方式,造成指标虚高。 Method: 提出一种支持多参考标注与任意长度插入的字符串对齐算法;构建并标注俄语长语音测试集DiverseSpeech-Ru;对现有俄语测试集进行多参考重标注;分析微调过程中的标注适应现象;开发流式识别评估与多转录对齐可视化工具;提供统一的离线/流式ASR模型封装接口。 Result: 改进的对齐算法提升了非拉丁语种(如俄语)长语音的标注与评估准确性;发现模型易适配数据集特有标注风格,导致评估偏差;所开发工具支持更鲁棒的流式识别评估与多转录对比分析。 Conclusion: 语音识别评估需更鲁棒的对齐与多参考标注机制,尤其对形态复杂语言;脱离标注偏置的真实性能评估至关重要;开源工具链可推动社区标准化评估实践。 Abstract: We propose several improvements to the speech recognition evaluation. First, we propose a string alignment algorithm that supports both multi-reference labeling, arbitrary-length insertions and better word alignment. This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech. Secondly, we collect a novel test set DiverseSpeech-Ru of longform in-the-wild Russian speech with careful multi-reference labeling. We also perform multi-reference relabeling of popular Russian tests set and study fine-tuning dynamics on its corresponding train set. We demonstrate that the model often adopts to dataset-specific labeling, causing an illusion of metric improvement. Based on the improved word alignment, we develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually. Additionally, we provide uniform wrappers for many offline and streaming speech recognition models. Our code will be made publicly available.[3] UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop
Muhammad Ali Shafique,Areej Mehboob,Layba Fiaz,Muhammad Usman Qadeer,Hamza Farooq
Main category: cs.CL
TL;DR: 本文提出一种上下文集成翻译框架,结合人工验证,构建了首个标准化乌尔都语推理评测基准UrduBench,并系统评估了多类大语言模型在乌尔都语推理任务上的表现,揭示了多步与符号推理的难点及语言对齐的重要性。
Details
Motivation: 乌尔都语等低资源语言缺乏标准化推理评测基准,现有方法受限于机器翻译敏感性和对通用语言任务的偏重,难以准确评估模型推理能力。 Method: 提出上下文集成翻译框架,融合多个翻译系统并引入人工验证,确保语义、上下文和结构完整性;将MGSM、MATH-500、CommonSenseQA、OpenBookQA等基准翻译为乌尔都语,构建UrduBench;在多种提示策略下系统评测推理导向与指令微调型大语言模型。 Result: 发现多步与符号推理任务在乌尔都语中显著更难;语言一致性是鲁棒推理的关键前提;不同数据集、难度层级、模型架构、缩放规模下性能差异明显。 Conclusion: 本工作建立了可扩展的乌尔都语推理评测方法论,提供了多语言推理失效的实证洞见,其框架可推广至其他低资源语言。 Abstract: Recent advances in large language models (LLMs) have led to strong reasoning capabilities; however, evaluating such models in low-resource languages remains challenging due to the lack of standardized benchmarks. In particular, Urdu reasoning evaluation has been limited by the sensitivity of machine translation and an emphasis on general language tasks rather than reasoning benchmarks. In this paper, we propose a contextually ensembled translation framework with human-in-the-loop validation that leverages multiple translation systems to develop Urdu reasoning benchmarks while preserving contextual and structural integrity. Using this framework, we translate widely adopted reasoning and question-answering benchmarks, including MGSM, MATH-500, CommonSenseQA, and OpenBookQA, into Urdu, collectively referred to as UrduBench, and conduct a comprehensive evaluation of both reasoning-oriented and instruction-tuned LLMs across multiple prompting strategies. Our analysis reveals performance differences across (1) four datasets, (2) five task difficulty levels, (3) diverse model architectures, (4) multiple model scaling settings, and (5) language consistency tests. We find that multi-step and symbolic reasoning tasks pose significant challenges in Urdu, and that stable language alignment is a critical prerequisite for robust reasoning. Overall, our work establishes a scalable methodology for standardized reasoning evaluation in Urdu and provides empirical insights into multilingual reasoning failures. This experimental setup is also broadly applicable to other low-resource languages. The code and datasets will be publicly released.[4] Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations
Amit Meghanani,Thomas Hain
Main category: cs.CL
TL;DR: 本文探讨了在自监督学习(SSL)语音模型中,前端语音增强(SE)模型微调时使用均方误差(MSE)损失所引发的位置嵌入干扰问题,并提出两种位置不变的微调策略:零填充和基于软动态时间规整(soft-DTW)的速度扰动,实验表明后者收敛更快、下游性能更优。
Details
Motivation: MSE损失在SSL表示引导的语音增强微调中易利用位置嵌入而非内容信息,导致优化偏差,本文旨在解决这一自监督表示微调的通用局限性。 Method: 提出两种位置不变的SE微调策略:(1) 零填充(原用于SSL预训练,现引入微调场景);(2) 速度扰动结合soft-DTW损失,以对齐时序结构并抑制位置依赖。 Result: soft-DTW方法相比MSE显著加快收敛速度,并在下游任务上取得更好性能,验证了位置不变微调的有效性。 Conclusion: 位置不变的损失设计(如soft-DTW)对SSL语音模型的前端SE微调至关重要,可避免位置嵌入干扰,提升鲁棒性和泛化能力。 Abstract: Integrating front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models is effective for downstream tasks in noisy conditions. SE models are commonly fine-tuned using SSL representations with mean squared error (MSE) loss between enhanced and clean speech. However, MSE is prone to exploiting positional embeddings in SSL models, allowing the objective to be minimised through positional correlations instead of content-related information. This work frames the problem as a general limitation of self-supervised representation fine-tuning and investigates it through representation-guided SE. Two strategies are considered: (1) zero-padding, previously explored in SSL pre-training but here examined in the fine-tuning setting, and (2) speed perturbations with a soft-DTW loss. Experiments show that the soft-DTW-based approach achieves faster convergence and improved downstream performance, underscoring the importance of position-invariant fine-tuning in SSL-based speech modelling.[5] ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference
Ketan Thakkar,Maitreyi Chatterjee,Ramasubramanian Balasubramanian,Achyuthan Jootoo,Rajendra Ugrani
Main category: cs.CL
TL;DR: 本文提出ChunkWise LoRA,一种动态自适应的低秩适配方法,通过按token复杂度分块并为每块分配定制化低秩配置,在降低延迟和内存的同时保持或提升模型性能。
Details
Motivation: 现有LoRA方法对所有输入token采用静态统一的秩配置,忽略了token复杂度和计算需求的差异。 Method: 提出ChunkWise LoRA:基于token复杂度进行可变长度分块;设计运行时调度器估计难度、自适应分块并用秩阶梯机制选择每块的秩与缩放;引入边界安全组合模块和策略驱动的KV缓存策略以保障输出一致性。 Result: 在Wikitext-103和SQuAD等基准上,相比基线LoRA,延迟降低最多34%,内存减少38%,同时BLEU、EM、困惑度等指标持平或提升。 Conclusion: ChunkWise LoRA是一种高效、兼容性强、可直接部署于现有Transformer架构与推理框架的参数高效微调新范式。 Abstract: Recent advances in low-rank adaptation (LoRA) have enabled efficient fine-tuning of large language models (LLMs) with minimal additional parameters. However, existing LoRA methods apply static rank configurations uniformly across all input tokens, ignoring variation in token complexity and computational requirements. In this work, we propose ChunkWise LoRA, a dynamic and adaptive approach that partitions sequences into variable-length chunks based on token complexity and assigns each chunk a tailored low-rank configuration. Our system introduces a runtime scheduler that estimates token difficulty, performs adaptive chunking, and selects per-chunk LoRA rank and scaling using a rank-ladder mechanism. To preserve output consistency, we further introduce a boundary-safe composition module and integrate policy-driven KV-cache strategies. Experiments on benchmark datasets such as Wikitext-103 and SQuAD demonstrate that ChunkWise LoRA achieves up to 34\% lower latency and 38% memory reduction compared to baseline LoRA, while maintaining or improving task performance metrics like BLEU, EM, and perplexity. The proposed framework remains fully compatible with existing transformer architectures and inference frameworks, providing a practical solution for real-world deployment of parameter-efficient LLMs.[6] Multi-task Code LLMs: Data Mix or Model Merge?
Mingzhi Zhu,Boris Sobolev,Rahul Krishna,Raju Pavuluri,Stacy Patterson,Michele Merler
Main category: cs.CL
TL;DR: 本文比较了数据混合与模型合并两种方法在构建小型多任务代码大语言模型中的效果,发现模型合并更适合大规模模型,而数据混合更适合小规模模型,并提出了权重分析技术来理解不同任务对模型参数的影响。
Details
Motivation: 当前研究倡导在智能体框架中部署更小、更专业的代码大语言模型,以平衡性能、约束和成本,因此需要高效的多任务学习策略。 Method: 通过在Qwen Coder和DeepSeek Coder两个模型家族(2B和7B参数)上进行代码生成与代码摘要任务的微调实验,对比数据混合与模型合并两种多任务学习方法,并引入权重分析技术探究任务对模型参数的影响。 Result: 模型合并方法在较大规模模型上整体性能最优,保留96%的代码生成能力并维持摘要能力,甚至在Qwen Coder 2.5 7B模型上HumanEval Pass@1达92.7%,超过任务专用微调模型(90.9%);小规模模型则数据混合更优。 Conclusion: 精心设计的模型合并与数据混合策略可有效融合任务特定能力而不显著损失性能,适用于资源受限的部署场景。 Abstract: Recent research advocates deploying smaller, specialized code LLMs in agentic frameworks alongside frontier models, sparking interest in efficient strategies for multi-task learning that balance performance, constraints, and costs. We compare two approaches for creating small, multi-task code LLMs: data mixing versus model merging. We conduct extensive experiments across two model families (Qwen Coder and DeepSeek Coder) at two scales (2B and 7B parameters), fine-tuning them for code generation and code summarization tasks. Our evaluation on HumanEval, MBPP, and CodeXGlue benchmarks reveals that model merging achieves the best overall performance at larger scale across model families, retaining 96% of specialized model performance on code generation tasks while maintaining summarization capabilities. Notably, merged models can even surpass individually fine-tuned models, with our best configuration of Qwen Coder 2.5 7B model achieving 92.7% Pass@1 on HumanEval compared to 90.9% for its task-specific fine-tuned equivalent. At a smaller scale we find instead data mixing to be a preferred strategy. We further introduce a weight analysis technique to understand how different tasks affect model parameters and their implications for merging strategies. The results suggest that careful merging and mixing strategies can effectively combine task-specific capabilities without significant performance degradation, making them ideal for resource-constrained deployment scenarios.[7] Large Language Models Naively Recover Ethnicity from Individual Records
Noah Dasanaike
Main category: cs.CL
TL;DR: 本文证明大语言模型(LLM)仅凭姓名即可高精度推断族裔,准确率超越传统BISG方法,且无需额外训练数据、适用多国语境,并在多种现实数据集上验证了其有效性与公平性优势。
Details
Motivation: 传统族裔推断方法(如BISG)依赖美国姓氏地理编码,存在地域局限、分类僵化及对高收入少数族裔的系统性误判等问题;亟需一种更通用、准确且公平的跨文化族裔推断方法。 Method: 利用多个主流闭源与开源大语言模型(如Gemini 3 Flash、GPT-4o、DeepSeek v3.2、GLM-4.7),直接输入姓名(辅以可选元数据如党派注册信息或启用扩展推理)进行族裔/宗教/种姓等分类;并在多国真实选民登记、议员名单和土地记录等数据上开展严格外部验证;还探索用小型微调模型实现低成本本地部署。 Result: 在美佛罗里达与北卡罗来纳州平衡样本中,LLM分类准确率达84.7%,显著高于BISG的68.2%;加入党派信息后达86.7%;扩展推理提升1–3个百分点;在黎巴嫩(宗教)、印度(保留席位议员、土地种姓)等场景分别达64.3%、99.2%、74.0%;跨六国选民数据验证其能准确复现已知人口分布;小型微调模型亦可超越BISG且支持零成本本地部署。 Conclusion: LLM可作为强大、通用、免训练的族裔推断工具,突破BISG的地理与结构性限制,降低收入相关偏差,具备全球适用性与实际部署潜力。 Abstract: I demonstrate that large language models can infer ethnicity from names with accuracy exceeding that of Bayesian Improved Surname Geocoding (BISG) without additional training data, enabling inference outside the United States and to contextually appropriate classification categories. Using stratified samples from Florida and North Carolina voter files with self-reported race, LLM-based classification achieves up to 84.7% accuracy, outperforming BISG (68.2%) on balanced samples. I test six models including Gemini 3 Flash, GPT-4o, and open-source alternatives such as DeepSeek v3.2 and GLM-4.7. Enabling extended reasoning can improve accuracy by 1-3 percentage points, though effects vary across contexts; including metadata such as party registration reaches 86.7%. LLM classification also reduces the income bias inherent in BISG, where minorities in wealthier neighborhoods are systematically misclassified as White. I further validate using Lebanese voter registration with religious sect (64.3% accuracy), Indian MPs from reserved constituencies (99.2%), and Indian land records with caste classification (74.0%). Aggregate validation across India, Uganda, Nepal, Armenia, Chile, and Costa Rica using original full-count voter rolls demonstrates that the method recovers known population distributions where naming conventions are distinctive. For large-scale applications, small transformer models fine-tuned on LLM labels exceed BISG accuracy while enabling local deployment at no cost.[8] EnsembleLink: Accurate Record Linkage Without Training Data
Noah Dasanaike
Main category: cs.CL
TL;DR: 本文提出EnsembleLink方法,利用预训练语言模型实现无需标注数据的高精度记录链接,显著提升社会科学研究中实体匹配的准确性与效率。
Details
Motivation: 记录链接在实证社会科学中至关重要,但现有方法要么准确率低,要么需要大量标注数据,且研究者常忽视链接错误带来的下游分析不确定性。 Method: 提出EnsembleLink方法,利用预训练语言模型捕捉语义关系(如地名层级、政党别名等),通过集成方式实现无监督高精度记录匹配。 Result: 在涵盖城市名、人名、机构名、多语种政党名和文献记录的多个基准测试中,EnsembleLink达到或超越需大量标注数据的方法;支持本地开源模型运行,无需API调用,典型任务可在数分钟内完成。 Conclusion: EnsembleLink为社会科学研究提供了一种高精度、零标注、高效且可本地部署的记录链接新范式,推动该领域从经验性预处理迈向严谨方法论。 Abstract: Record linkage, the process of matching records that refer to the same entity across datasets, is essential to empirical social science but remains methodologically underdeveloped. Researchers treat it as a preprocessing step, applying ad hoc rules without quantifying the uncertainty that linkage errors introduce into downstream analyses. Existing methods either achieve low accuracy or require substantial labeled training data. I present EnsembleLink, a method that achieves high accuracy without any training labels. EnsembleLink leverages pre-trained language models that have learned semantic relationships (e.g., that "South Ozone Park" is a neighborhood in "New York City" or that "Lutte ouvriere" refers to the Trotskyist "Workers' Struggle" party) from large text corpora. On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling. The method runs locally on open-source models, requiring no external API calls, and completes typical linkage tasks in minutes.[9] Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space
Tobias Materzok
Main category: cs.CL
TL;DR: 本文提出了一种名为Output-Space Search(OS-Search)的新方法,将大语言模型(LLM)的生成过程转化为在冻结编码器定义的3D输出空间Z中进行目标点搜索,并通过序列级强化学习训练检索增强策略,使生成结果在Z中接近目标点z*。该方法支持并行扫描与黑箱优化,显著提升了文本生成的多样性与代码生成在隐藏目标下的优化性能。
Details
Motivation: 传统LLM生成依赖于逐token或逐程序的路径依赖式搜索,效率低且难以实现全局优化;本文旨在摆脱这种限制,引入更高效、更灵活的输出空间优化范式。 Method: 构建一个冻结编码器定义的3D输出空间Z;外层循环选择目标点z*;内层使用检索增强、序列级强化学习训练的策略,生成在标准自回归解码下其嵌入坐标接近z*的输出;支持并行Z空间扫描与黑箱优化(如贝叶斯优化)。 Result: 在故事生成任务中,Z空间扫描使LLM评分的多样性提升3.1倍(相比prompt-chaining);在代码生成任务中,Z空间上的贝叶斯优化在推理预算受限下,成功优化了控制器未见的目标函数,同时保持生成代码的有效性。 Conclusion: OS-Search提供了一种解耦生成与搜索的新范式,通过在语义输出空间中直接操作,实现了更高效、更可控、更可优化的LLM生成。 Abstract: We introduce Output-Space Search (OS-Search), which turns LLM generation into endpoint search. An outer loop selects a target z* in a frozen encoder-defined 3D output space Z, and a retrieval-grounded policy trained with sequence-level RL generates outputs whose coordinates land near z* under standard autoregressive decoding. This enables parallel sweeps and black-box optimization in Z without path-dependent token/program search. On stories, sweeping Z (text) yields 3.1x higher LLM-scored diversity than prompt-chaining. On code, Bayesian optimization over Z (code) improves an objective withheld from the controller under matched inference budgets while preserving validity.[10] From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning
Xiulin Yang,Heidi Getz,Ethan Gotlieb Wilcox
Main category: cs.CL
TL;DR: This paper investigates how statistical properties of function words—frequency, structural association, and boundary alignment—support hierarchical structure learning from linear input, using cross-linguistic corpus analysis and neural modeling.
Details
Motivation: To understand what statistical conditions enable learning of hierarchical (syntactic) structure from linear input, with focus on the role of function words in language acquisition. Method: Cross-linguistic corpus analysis across 186 languages; counterfactual language modeling; ablation experiments; probing analyses to assess neural model reliance on function words. Result: All three function-word properties (high frequency, reliable structural association, phrase-boundary alignment) are cross-linguistically robust; preserving them improves neural acquisition, with frequency and structural association having stronger effects than boundary alignment; different learning conditions yield distinct internal mechanisms despite similar performance. Conclusion: Function words’ distributional properties collectively scaffold hierarchical learning, but their contributions are asymmetric and interact with learning architecture—highlighting that behavioral success may mask mechanistic diversity. Abstract: What statistical conditions support learning hierarchical structure from linear input? In this paper, we address this question by focusing on the statistical distribution of function words. Function words have long been argued to play a crucial role in language acquisition due to their distinctive distributional properties, including high frequency, reliable association with syntactic structure, and alignment with phrase boundaries. We use cross-linguistic corpus analysis to first establish that all three properties are present across 186 studied languages. Next, we use a combination of counterfactual language modeling and ablation experiments to show that language variants preserving all three properties are more easily acquired by neural learners, with frequency and structural association contributing more strongly than boundary alignment. Follow-up probing and ablation analyses further reveal that different learning conditions lead to systematically different reliance on function words, indicating that similar performance can arise from distinct internal mechanisms.[11] Scaling Embeddings Outperforms Scaling Experts in Language Models
Hong Liu,Jiaqi Zhang,Chao Wang,Xing Hu,Linkun Lyu,Jiaqi Sun,Xurui Yang,Bo Wang,Fengcun Li,Yulei Qian,Lingtong Si,Yerui Sun,Rumei Li,Peng Pei,Yuchen Xie,Xunliang Cai
Main category: cs.CL
TL;DR: 本文提出通过扩展嵌入层(embedding scaling)来提升稀疏大语言模型的性能,发现其在特定条件下优于传统的专家扩展(expert scaling),并基于此构建了LongCat-Flash-Lite模型,在推理速度和任务性能上均取得显著提升。
Details
Motivation: MoE架构在稀疏扩展中面临收益递减和系统瓶颈,亟需新的正交扩展维度。 Method: 系统分析embedding scaling与expert scaling的Pareto前沿;识别影响其有效性的关键架构因素(如参数分配、模型宽/深度);结合系统优化与推测解码实现推理加速;从头训练68.5B参数的LongCat-Flash-Lite模型(其中超30B为嵌入参数)。 Result: LongCat-Flash-Lite在多项基准(尤其是代理任务和代码任务)上超越同参数量MoE基线,并媲美甚至优于其他同规模模型;实现了稀疏性到实际推理加速的有效转化。 Conclusion: embedding scaling是一种高效、正交的稀疏扩展路径,在合理架构设计与系统协同优化下,可突破MoE的现有瓶颈,为大模型稀疏化提供新范式。 Abstract: While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.[12] Multilingual Dysarthric Speech Assessment Using Universal Phone Recognition and Language-Specific Phonemic Contrast Modeling
Eunjung Yeo,Julie M. Liss,Visar Berisha,David R. Mortensen
Main category: cs.CL
TL;DR: 本文提出了一种多语言构音障碍智能度评估框架,结合通用音素识别与语言特异性音素解释,通过对比音系特征距离和序列对齐,生成三种评估指标(PER、PFER、PhonCov),并在四种语言上验证其临床相关性。
Details
Motivation: 神经疾病相关的构音障碍日益普遍,亟需跨语言适用的自动化可懂度评估方法;现有方法多局限于单语或忽视语言特异性因素。 Method: 构建多语言音素产出评估框架,融合通用音素识别与语言特异性音素解释,利用对比音系特征距离实现音素到音位映射,并结合序列对齐;定义并计算PER、PFER和新提出的无对齐指标PhonCov。 Result: 在英语、西班牙语、意大利语和泰米尔语上的实验表明:PER受益于映射与对齐结合,PFER仅受益于对齐,PhonCov仅受益于映射;框架能有效捕捉构音障碍语音中符合临床认知的可懂度退化模式。 Conclusion: 该框架兼顾语言普适性与特异性,所提指标具有互补性与临床可解释性,为跨语言构音障碍评估提供了新范式。 Abstract: The growing prevalence of neurological disorders associated with dysarthria motivates the need for automated intelligibility assessment methods that are applicalbe across languages. However, most existing approaches are either limited to a single language or fail to capture language-specific factors shaping intelligibility. We present a multilingual phoneme-production assessment framework that integrates universal phone recognition with language-specific phoneme interpretation using contrastive phonological feature distances for phone-to-phoneme mapping and sequence alignment. The framework yields three metrics: phoneme error rate (PER), phonological feature error rate (PFER), and a newly proposed alignment-free measure, phoneme coverage (PhonCov). Analysis on English, Spanish, Italian, and Tamil show that PER benefits from the combination of mapping and alignment, PFER from alignment alone, and PhonCov from mapping. Further analyses demonstrate that the proposed framework captures clinically meaningful patterns of intelligibility degradation consistent with established observations of dysarthric speech.[13] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models
Zhaoyi Li,Jiatong Li,Gangwei Jiang,Linqi Song,Defu Lian,Ying Wei
Main category: cs.CL
TL;DR: 本文发现链式思维推理中错误集中在特定token位置,由某些注意力头(ep heads)引发,提出在测试时动态识别并禁用这些头以提升推理泛化能力。
Details
Motivation: 链式思维推理在推理步数超出训练分布时性能显著下降,其内在机制尚不清楚。 Method: 通过多领域任务的系统研究,识别出导致错误的特定注意力头(ep heads),并提出一种轻量级的测试时修正方法——动态识别并禁用这些错误处理头。 Result: 在多个任务和大语言模型上的实验表明,该方法能一致提升推理步数泛化能力。 Conclusion: 链式思维推理失败源于特定注意力头对错误推理路径的放大,测试时修正这些头是一种有效且具潜力的干预策略。 Abstract: Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.[14] Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data
Christopher Adrian Kusuma,Muhammad Reza Qorib,Hwee Tou Ng
Main category: cs.CL
TL;DR: 本文提出了一种更鲁棒的大型语言模型(LLM)诚实性评估基准,并利用开源模型Pythia及其公开预训练数据,设计新方法提升LLM在未知问题上的“我不知道”响应能力,以减少幻觉。
Details
Motivation: 现有LLM常因无法识别自身知识边界而产生事实性错误(幻觉),虽有多种提升诚实性的方法,但其评估未考虑模型预训练阶段已吸收的知识,导致鲁棒性不足。 Method: 构建基于开源模型Pythia及其公开预training数据的新型诚实性评估基准;并提出一种利用预训练数据增强LLM诚实性的新方法。 Result: 提出了更鲁棒的LLM诚实性评估基准数据集,并验证了所提方法在引导模型对未知问题给出“I don't know”响应方面的有效性。 Conclusion: 通过显式建模和利用预训练知识,可显著提升LLM的诚实性;基于公开可验证数据的评估更能反映模型真实的知识边界认知能力。 Abstract: Large language models (LLMs) are highly capable of answering questions, but they are often unaware of their own knowledge boundary, i.e., knowing what they know and what they don't know. As a result, they can generate factually incorrect responses on topics they do not have enough knowledge of, commonly known as hallucination. Rather than hallucinating, a language model should be more honest and respond with "I don't know" when it does not have enough knowledge about a topic. Many methods have been proposed to improve LLM honesty, but their evaluations lack robustness, as they do not take into account the knowledge that the LLM has ingested during its pretraining. In this paper, we propose a more robust evaluation benchmark dataset for LLM honesty by utilizing Pythia, a truly open LLM with publicly available pretraining data. In addition, we also propose a novel method for harnessing the pretraining data to build a more honest LLM.[15] MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation
Tianyi Xu,Kosei Uemura,Alfred Malengo Kondoro,Tadesse Destaw Belay,Catherine Nana Nyaah Essuman,Ifeoma Okoh,Ganiyat Afolabi,Ayodele Awokoya,David Ifeoluwa Adelani
Main category: cs.CL
TL;DR: 本文介绍了MGSM-Pro数据集,扩展了MGSM并采用GSM-Symbolic方法生成每个问题的五种变体,用于多语言数学推理评估;实验发现低资源语言及部分闭源模型(如Gemini 2.5 Flash、GPT-4.1)对数字变化鲁棒性差,而Claude 4.0 Sonnet及开源模型GPT-OSS 120B、DeepSeek V3表现更稳健;建议未来评测应至少使用五种数字变体以提升评估可靠性。
Details
Motivation: 现有多语言数学推理基准在难度和时效性上落后于英文基准,且缺乏对模型在不同问题实例(尤其是数字变化)下鲁棒性的系统评估。 Method: 基于MGSM构建MGSM-Pro数据集,为每个问题生成五个变体(改变人名、数字和无关上下文),并在九种语言上进行评测,分析模型对数字实例变化的鲁棒性。 Result: 低资源语言在数字变体上性能显著下降;Gemini 2.5 Flash和GPT-4.1鲁棒性较差,Claude 4.0 Sonnet更稳健;开源模型GPT-OSS 120B和DeepSeek V3表现突出。 Conclusion: 单一数字实例的评测易导致乐观偏差,推荐采用至少五种数字变体进行评测,以获得更真实、稳健的数学推理能力评估。 Abstract: Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that some proprietary models, notably Gemini 2.5 Flash and GPT-4.1, are less robust to digit instantiation, whereas Claude 4.0 Sonnet is more robust. Among open models, GPT-OSS 120B and DeepSeek V3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.[16] SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models
Alok Abhishek,Tushar Bandopadhyay,Lisa Erickson
Main category: cs.CL
TL;DR: 本文提出SHARP框架,用于多维、分布感知的社会危害评估,强调尾部风险和跨维度交互,揭示了LLMs在平均风险相似时尾部暴露差异显著,主张超越标量均值评估。
Details
Motivation: 现有评估基准将复杂社会风险简化为均值标量分数,掩盖了分布结构、跨维度交互及最坏情况行为,难以应对高风险场景中的严重罕见失败。 Method: 提出SHARP框架,将社会危害建模为多元随机变量,显式分解为偏见、公平性、伦理与认知可靠性四个维度,并采用‘失败并集’重参数化为加性累积对数风险;主指标为条件风险价值CVaR95,辅以风险敏感的分布统计。 Result: 在901条敏感提示上评估11个前沿LLM,发现平均风险相近的模型其尾部暴露和波动性差异超两倍;各维度尾部严重性呈现系统性分层:偏见最强,认知与公平居中,伦理最弱。 Conclusion: LLM的责任评估与治理需转向多维、尾部敏感的风险画像,标量均值评估会混淆异质性、模型依赖的失效结构。 Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where rare but severe failures can result in irreversible harm. However, prevailing evaluation benchmarks often reduce complex social risk to mean-centered scalar scores, thereby obscuring distributional structure, cross-dimensional interactions, and worst-case behavior. This paper introduces Social Harm Analysis via Risk Profiles (SHARP), a framework for multidimensional, distribution-aware evaluation of social harm. SHARP models harm as a multivariate random variable and integrates explicit decomposition into bias, fairness, ethics, and epistemic reliability with a union-of-failures aggregation reparameterized as additive cumulative log-risk. The framework further employs risk-sensitive distributional statistics, with Conditional Value at Risk (CVaR95) as a primary metric, to characterize worst-case model behavior. Application of SHARP to eleven frontier LLMs, evaluated on a fixed corpus of n=901 socially sensitive prompts, reveals that models with similar average risk can exhibit more than twofold differences in tail exposure and volatility. Across models, dimension-wise marginal tail behavior varies systematically across harm dimensions, with bias exhibiting the strongest tail severities, epistemic and fairness risks occupying intermediate regimes, and ethical misalignment consistently lower; together, these patterns reveal heterogeneous, model-dependent failure structures that scalar benchmarks conflate. These findings indicate that responsible evaluation and governance of LLMs require moving beyond scalar averages toward multidimensional, tail-sensitive risk profiling.[17] MoCo: A One-Stop Shop for Model Collaboration Research
Shangbin Feng,Yuyang Bai,Ziyuan Yang,Yike Wang,Zhaoxuan Tan,Jiajie Yan,Zhenyu Lei,Wenxuan Ding,Weijia Shi,Haojin Wang,Zhenting Qi,Yuru Jiang,Heng Wang,Chengsong Huang,Yu Fei,Jihan Yao,Yilun Du,Luke Zettlemoyer,Yejin Choi,Yulia Tsvetkov
Main category: cs.CL
TL;DR: 本文介绍了MoCo,一个用于执行、基准测试和比较大规模模型协作算法的Python库,旨在推动多语言模型协作研究的发展。
Details
Motivation: 现有模型协作研究分散且缺乏系统性比较,需要一个统一框架来整合和推动该领域发展。 Method: 构建了MoCo开源库,集成26种模型协作方法和25个评测数据集,支持灵活的数据接入与多维度评估。 Result: 实验表明,多数协作策略在61.0%的(模型,数据)设置中优于单模型,最优方法提升达25.8%;并分析了协作策略的扩展性、效率及解决单模型难题的能力。 Conclusion: MoCo为模型协作研究提供了标准化工具,有望促进开放、模块化、去中心化和协作式AI的发展。 Abstract: Advancing beyond single monolithic language models (LMs), recent research increasingly recognizes the importance of model collaboration, where multiple LMs collaborate, compose, and complement each other. Existing research on this topic has mostly been disparate and disconnected, from different research communities, and lacks rigorous comparison. To consolidate existing research and establish model collaboration as a school of thought, we present MoCo: a one-stop Python library of executing, benchmarking, and comparing model collaboration algorithms at scale. MoCo features 26 model collaboration methods, spanning diverse levels of cross-model information exchange such as routing, text, logit, and model parameters. MoCo integrates 25 evaluation datasets spanning reasoning, QA, code, safety, and more, while users could flexibly bring their own data. Extensive experiments with MoCo demonstrate that most collaboration strategies outperform models without collaboration in 61.0% of (model, data) settings on average, with the most effective methods outperforming by up to 25.8%. We further analyze the scaling of model collaboration strategies, the training/inference efficiency of diverse methods, highlight that the collaborative system solves problems where single LMs struggle, and discuss future work in model collaboration, all made possible by MoCo. We envision MoCo as a valuable toolkit to facilitate and turbocharge the quest for an open, modular, decentralized, and collaborative AI future.[18] CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
Jiahao Huo,Yu Huang,Yibo Yan,Ye Pan,Yi Cao,Mingdong Ou,Philip S. Yu,Xuming Hu
Main category: cs.CL
TL;DR: 本文提出CausalEmbed方法,通过自回归生成方式构建多向量嵌入,显著减少视觉token数量(30-155倍),同时保持高性能,提升VDR任务的实用性与可扩展性。
Details
Motivation: 现有MLLMs在视觉文档检索中虽性能优异,但使用数千视觉token表示单页导致存储开销大,限制实际应用。 Method: 提出自回归生成方法CausalEmbed,并在对比学习中引入迭代间隔损失,促使模型学习紧凑、结构良好的多向量嵌入。 Result: 仅需数十个视觉token即可高效完成VDR任务,token数减少30–155倍,在多种骨干网络与基准上保持高竞争力;理论与实验验证其训练效率与测试时可扩展性优势。 Conclusion: CausalEmbed实现了灵活的测试时缩放策略,推动多模态文档检索向生成式范式发展。 Abstract: Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval.[19] Qwen3-ASR Technical Report
Xian Shi,Xiong Wang,Zhifang Guo,Yongqi Wang,Pei Zhang,Xinyu Zhang,Zishan Guo,Hongkun Hao,Yu Xi,Baosong Yang,Jin Xu,Jingren Zhou,Junyang Lin
Main category: cs.CL
TL;DR: 本文介绍了Qwen3-ASR系列语音识别模型,包括两个支持52种语言/方言的端到端ASR模型(1.7B和0.6B)及一个基于大语言模型的非自回归强制对齐模型(0.6B),均基于Qwen3-Omni基础模型;1.7B版在开源ASR模型中达到SOTA,0.6B版兼顾高精度与高效率(TTFT低至92ms),对齐模型在11种语言上精度与效率均优于现有最强方法;全部模型以Apache 2.0协议开源。
Details
Motivation: 解决开源ASR模型在标准基准上表现相近但实际应用中质量差异显著的问题,同时提升多语言支持、推理效率与强制对齐能力,并推动社区研究发展。 Method: 基于Qwen3-Omni音频理解能力构建Qwen3-ASR-1.7B和Qwen3-ASR-0.6B两个全功能ASR模型,支持语言识别与52语种ASR;提出Qwen3-ForcedAligner-0.6B,一种基于大语言模型的非自回归时间戳预测器,支持11种语言的文本-语音对齐;采用大规模语音数据训练,并开展全面内部评估以弥补公开基准局限。 Result: Qwen3-ASR-1.7B在开源ASR模型中达到SOTA,媲美最强商用API;Qwen3-ASR-0.6B实现平均TTFT仅92ms,吞吐达2000秒语音/秒(并发128);Qwen3-ForcedAligner-0.6B在11种语言的强制对齐任务中精度与效率均超越当前三个最强模型。 Conclusion: Qwen3-ASR系列模型在性能、效率与多语言支持方面取得显著突破,尤其0.6B版本实现了优异的精度-效率平衡,且全部模型开源将有力促进语音识别与音频理解领域的研究与应用。 Abstract: In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.[20] Self-Improving Pretraining: using post-trained models to pretrain better models
Ellen Xiaoqing Tan,Shehzaad Dhuliawala,Jing Xu,Ping Yu,Sainbayar Sukhbaatar,Jason Weston,Olga Golovneva
Main category: cs.CL
TL;DR: 本文提出一种在预训练阶段引入强化学习的新方法,通过流式文档处理和多候选生成评估(包括模型自生成、原始后缀和重写后缀),利用强判别模型对安全性、事实性和质量进行打分,从而从源头提升大语言模型的可靠性。实验显示该方法在事实性、安全性和生成质量上显著优于标准预训练。
Details
Motivation: 现有方法依赖昂贵的数据收集与多阶段微调/对齐,难以根除预训练阶段习得的不安全或幻觉行为;因此需在预训练阶段直接建模安全、事实与质量目标。 Method: 提出基于流式文档的强化学习预训练框架:每步生成K个后续token的多个候选(含rollout、原始suffix、重写suffix),由强判别模型评分并给予RL奖励;训练初期侧重监督式重写,后期转向奖励高质量rollout。 Result: 相比标准预训练,在事实性上提升36.2%,安全性提升18.5%,生成质量胜率最高提升86.3%。 Conclusion: 在预训练中嵌入RL驱动的质量与安全优化,能更根本地构建高质、安全、可信的大语言模型,优于仅依赖后训练对齐的范式。 Abstract: Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model's core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations -- including model rollouts, the original suffix, and a rewritten suffix -- for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.[21] The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation
Devanshu Sahoo,Manish Prasad,Vasudev Majhi,Arjun Neekhra,Yash Sinha,Murari Mandal,Vinay Chamola,Dhruv Kumar
Main category: cs.CL
TL;DR: 本文揭示了大语言模型(LLMs)在教育评估中因过度强调指令遵循而产生的‘合规悖论’:模型易受语义保持型对抗代码注入(SPACI)攻击,导致对错误代码的‘虚假认证’;提出AST-Aware注入协议与三维度评估框架,并呼吁转向面向裁决鲁棒性的新型对齐范式。
Details
Motivation: 现有LLM用于自动评分依赖‘指令遵循能力≈评判能力’这一未经验证的假设,但该假设忽略了模型可能为满足隐藏指令而脱离代码逻辑本身,带来严重评估风险。 Method: 提出SPACI对抗注入框架与AST-Aware语义注入协议(AST-ASIP),利用抽象语法树中的语法惰性区域(trivia nodes)嵌入隐式指令, exploit Syntax-Semantics Gap;在Python/C/C++/Java共25,000份代码提交上评估9个SOTA模型。 Result: 发现高容量开源模型(如DeepSeek-V3)失败率超95%,系统性地优先响应隐藏格式约束而非代码正确性;通过Decoupling Probability、Score Divergence和Pedagogical Severity三指标量化‘虚假认证’现象。 Conclusion: 当前基于RLHF的对齐方式在自动评分场景下引入‘特洛伊式’脆弱性;需转向领域专属的‘裁决鲁棒性’训练范式,使模型以证据优先而非指令服从优先。 Abstract: The rapid integration of Large Language Models (LLMs) into educational assessment rests on the unverified assumption that instruction following capability translates directly to objective adjudication. We demonstrate that this assumption is fundamentally flawed. Instead of evaluating code quality, models frequently decouple from the submission's logic to satisfy hidden directives, a systemic vulnerability we term the Compliance Paradox, where models fine-tuned for extreme helpfulness are vulnerable to adversarial manipulation. To expose this, we introduce the Semantic-Preserving Adversarial Code Injection (SPACI) Framework and the Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP). These methods exploit the Syntax-Semantics Gap by embedding adversarial directives into syntactically inert regions (trivia nodes) of the Abstract Syntax Tree. Through a large-scale evaluation of 9 SOTA models across 25,000 submissions in Python, C, C++, and Java, we reveal catastrophic failure rates (>95%) in high-capacity open-weights models like DeepSeek-V3, which systematically prioritize hidden formatting constraints over code correctness. We quantify this failure using our novel tripartite framework measuring Decoupling Probability, Score Divergence, and Pedagogical Severity to demonstrate the widespread "False Certification" of functionally broken code. Our findings suggest that current alignment paradigms create a "Trojan" vulnerability in automated grading, necessitating a shift from standard RLHF toward domain-specific Adjudicative Robustness, where models are conditioned to prioritize evidence over instruction compliance. We release our complete dataset and injection framework to facilitate further research on the topic.[22] User-Centric Evidence Ranking for Attribution and Fact Verification
Guy Alt,Eran Hirsch,Serwar Basch,Ido Dagan,Oren Glickman
Main category: cs.CL
TL;DR: 本文提出证据排序(Evidence Ranking)这一新任务,旨在通过优先呈现充分且非冗余的证据来减少用户阅读负担并提升事实验证效率;对比了单次排序与增量排序两种方法,并构建了统一基准与新评估框架,实验表明增量排序和大语言模型效果更优,且用户研究表明该方法能同时降低阅读成本并提高验证准确性。
Details
Motivation: 现有自动化系统和大语言模型在事实验证中常提供不足或过度冗余的证据,导致验证低效且易出错,亟需一种兼顾信息充分性、非冗余性与用户可操作性的新范式。 Method: 提出证据排序任务,定义其目标为在排序列表中尽早呈现充分证据;设计单次排序与增量排序两类方法;构建融合多个事实验证数据集的统一基准;引入受信息检索启发的新评估框架;开展模型实验与控制变量用户研究。 Result: 增量排序策略更善于捕获互补证据;基于大语言模型的方法显著优于浅层基线;但仍面临充分性与冗余性平衡的挑战;用户实验证明证据排序相比传统证据选择可降低30%以上阅读 effort 并提升验证准确率。 Conclusion: 证据排序是一种更用户对齐、可解释且高效的事实验证新范式,为构建下一代可信NLP系统提供了基础性方法论支撑。 Abstract: Attribution and fact verification are critical challenges in natural language processing for assessing information reliability. While automated systems and Large Language Models (LLMs) aim to retrieve and select concise evidence to support or refute claims, they often present users with either insufficient or overly redundant information, leading to inefficient and error-prone verification. To address this, we propose Evidence Ranking, a novel task that prioritizes presenting sufficient information as early as possible in a ranked list. This minimizes user reading effort while still making all available evidence accessible for sequential verification. We compare two approaches for the new ranking task: one-shot ranking and incremental ranking. We introduce a new evaluation framework, inspired by information retrieval metrics, and construct a unified benchmark by aggregating existing fact verification datasets. Extensive experiments with diverse models show that incremental ranking strategies better capture complementary evidence and that LLM-based methods outperform shallower baselines, while still facing challenges in balancing sufficiency and redundancy. Compared to evidence selection, we conduct a controlled user study and demonstrate that evidence ranking both reduces reading effort and improves verification. This work provides a foundational step toward more interpretable, efficient, and user-aligned information verification systems.[23] Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation
Yuan Sui,Bryan Hooi
Main category: cs.CL
TL;DR: 本文提出CoNL框架,通过多智能体自博弈统一生成、评估和元评估,利用批判性反馈是否能提升解决方案来衡量其质量,从而在无外部评判者和真实标签的情况下联合优化生成与评判能力。
Details
Motivation: 由于非可验证任务(如创意写作、对话、伦理推理)缺乏真实标签,现有LLM-as-Judge方法受限于评判者自身质量及评估偏差,需引入元评估以改进评判者本身。 Method: 提出CoNL框架,基于共享策略的多智能体进行结构化自博弈:提出方案→相互批判→修订方案;以批判是否促成改进为依据给予诊断性奖励,实现生成与评判能力的联合自优化。 Result: 在五个基准测试中,CoNL持续优于自奖励基线,且训练稳定。 Conclusion: CoNL通过可学习的元评估机制,有效解耦并协同提升生成与评判能力,为无监督高质量语言建模提供了新范式。 Abstract: Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.[24] SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models
Lei Yang,Wei Bi,Chenxi Sun,Renren Jin,Deyi Xiong
Main category: cs.CL
TL;DR: 本文提出SOUP框架,在单样本级别统一在线与离线策略学习,通过token级重要性比率利用离线数据提升探索能力与稳定性,显著优于现有方法。
Details
Motivation: 现有基于在线策略的RL方法(如GRPO)在语言模型后训练中存在探索不足和早期饱和问题;混合整条离线轨迹又导致策略不匹配和训练不稳定。 Method: 提出SOUP(Single-sample Mix-policy Unified Paradigm)框架:对每个生成序列,仅用历史策略生成前缀(离线部分),后续token由当前策略生成(在线部分),并引入token级重要性比率加权调整梯度。 Result: 实验表明SOUP在多个任务上持续超越标准在线策略训练及现有离线扩展方法;分析证实其能提升探索能力和最终性能。 Conclusion: SOUP通过细粒度、单样本级别的混合策略范式,有效平衡了离线数据利用与训练稳定性,为LLM强化学习提供了新思路。 Abstract: On-policy reinforcement learning (RL) methods widely used for language model post-training, like Group Relative Policy Optimization (GRPO), often suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data can help, current approaches that mix entire trajectories cause significant policy mismatch and instability. In this work, we propose the $\textbf{S}$ingle-sample Mix-p$\textbf{O}$licy $\textbf{U}$nified $\textbf{P}$aradigm (SOUP), a framework that unifies off- and on-policy learning within individual samples at the token level. It confines off-policy influence to the prefix of a generated sequence sampled from historical policies, while the continuation is generated on-policy. Through token-level importance ratios, SOUP effectively leverages off-policy information while preserving training stability. Extensive experiments demonstrate that SOUP consistently outperforms standard on-policy training and existing off-policy extensions. Our further analysis clarifies how our fine-grained, single-sample mix-policy training can improve both exploration and final performance in LLM RL.[25] DimStance: Multilingual Datasets for Dimensional Stance Analysis
Jonas Becker,Liang-Chih Yu,Shamsuddeen Hassan Muhammad,Jan Philip Wahle,Terry Ruas,Idris Abdulmumin,Lung-Hao Lee,Wen-Ni Liu,Tzu-Mi Lin,Zhe-Yu Xu,Ying-Lung Lin,Jin Wang,Maryam Ibrahim Mukhtar,Bela Gipp,Saif M. Mohammed
Main category: cs.CL
TL;DR: 本文提出了一种基于情感科学中效价-唤醒度(VA)维度的细粒度立场分析方法,构建了首个跨语言、多领域的维度假立场资源DimStance,并提出了维度假立场回归任务,评估了多种预训练与大语言模型在该任务上的表现。
Details
Motivation: 传统立场检测仅提供离散类别标签(如支持/中立/反对),难以刻画立场表达背后细微的情感状态;本文旨在引入成熟的情感科学框架(效价-唤醒度二维模型),实现更精细、可量化的立场建模。 Method: 构建多语言、多领域维度假立场标注资源DimStance(含5种语言、2个领域、11746个目标方面);定义维度假立场回归任务;在回归和提示学习两种范式下,对多种预训练模型和大语言模型进行基准评测;分析跨语言VA模式。 Result: 微调的大语言模型回归器表现具竞争力;低资源语言(如尼日利亚皮金语、斯瓦希里语)仍存在显著挑战;基于token生成的方法在VA预测中存在固有局限。 Conclusion: DimStance为多语言、情感感知的立场分析提供了新资源与基准,推动立场理解从离散分类迈向连续、可解释的维度建模。 Abstract: Stance detection is an established task that classifies an author's attitude toward a specific target into categories such as Favor, Neutral, and Against. Beyond categorical stance labels, we leverage a long-established affective science framework to model stance along real-valued dimensions of valence (negative-positive) and arousal (calm-active). This dimensional approach captures nuanced affective states underlying stance expressions, enabling fine-grained stance analysis. To this end, we introduce DimStance, the first dimensional stance resource with valence-arousal (VA) annotations. This resource comprises 11,746 target aspects in 7,365 texts across five languages (English, German, Chinese, Nigerian Pidgin, and Swahili) and two domains (politics and environmental protection). To facilitate the evaluation of stance VA prediction, we formulate the dimensional stance regression task, analyze cross-lingual VA patterns, and benchmark pretrained and large language models under regression and prompting settings. Results show competitive performance of fine-tuned LLM regressors, persistent challenges in low-resource languages, and limitations of token-based generation. DimStance provides a foundation for multilingual, emotion-aware, stance analysis and benchmarking.[26] MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset
Serry Sibaee,Yasser Alhabashi,Nadia Sibai,Yara Farouk,Adel Ammar,Sawsan AlHalawani,Wadii Boulila
Main category: cs.CL
TL;DR: 本文介绍了MURAD——一个包含96,243个词义对的多领域阿拉伯语反向词典数据集,覆盖语言学、伊斯兰研究、数学等多个学科,旨在推动阿拉伯语自然语言处理和词汇语义研究。
Details
Motivation: 阿拉伯语虽丰富多样,但缺乏大规模、高精度的词义关联词典数据集,限制了其计算语言学与词典学研究。 Method: 构建了一个混合提取流程,结合直接文本解析、光学字符识别(OCR)和自动重构技术,从权威参考书和教育资料中提取并标准化阿拉伯语词汇及其定义,并附带领域元数据。 Result: 发布开源数据集MURAD,含96,243个词-定义对,覆盖多个学术领域,支持反向词典建模、语义检索与教育应用。 Conclusion: MURAD填补了阿拉伯语高质量词典资源的空白,为阿拉伯语NLP和可复现的词汇语义研究提供了重要基础。 Abstract: Arabic is a linguistically and culturally rich language with a vast vocabulary that spans scientific, religious, and literary domains. Yet, large-scale lexical datasets linking Arabic words to precise definitions remain limited. We present MURAD (Multi-domain Unified Reverse Arabic Dictionary), an open lexical dataset with 96,243 word-definition pairs. The data come from trusted reference works and educational sources. Extraction used a hybrid pipeline integrating direct text parsing, optical character recognition, and automated reconstruction. This ensures accuracy and clarity. Each record aligns a target word with its standardized Arabic definition and metadata that identifies the source domain. The dataset covers terms from linguistics, Islamic studies, mathematics, physics, psychology, and engineering. It supports computational linguistics and lexicographic research. Applications include reverse dictionary modeling, semantic retrieval, and educational tools. By releasing this resource, we aim to advance Arabic natural language processing and promote reproducible research on Arabic lexical semantics.[27] LMK > CLS: Landmark Pooling for Dense Embeddings
Meet Doshi,Aashka Trivedi,Vishwajeet Kumar,Parul Awasthy,Yulong Li,Jaydeep Sen,Radu Florian,Sachindra Joshi
Main category: cs.CL
TL;DR: 本文提出了一种新的序列池化方法——Landmark (LMK) 池化,通过将序列分块并在块间插入地标标记(landmark tokens),再对这些地标标记嵌入取均值来获得最终表示,从而在不牺牲局部显著特征的前提下提升长上下文外推能力。
Details
Motivation: 现有主流池化策略(如[CLS]或均值池化)存在系统性缺陷:[CLS]倾向于集中序列前部信息、难以表征分布式证据;均值池化则可能稀释局部显著信号,导致短上下文性能下降。 Method: 提出Landmark (LMK) 池化:将变长序列分块,在每块之间插入可学习的地标标记(landmark tokens),最终对所有地标标记的嵌入进行均值池化得到整体表示。 Result: LMK池化在短上下文检索任务上与现有方法性能相当,在长上下文任务上显著优于现有方法,且仅引入少量特殊标记,具备实用性和可扩展性。 Conclusion: LMK池化是一种简单有效、兼顾局部敏感性与长程建模能力的新型池化机制,为序列表示学习提供了更鲁棒的替代方案。 Abstract: Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.[28] inversedMixup: Data Augmentation via Inverting Mixed Embeddings
Fanshuang Kong,Richong Zhang,Qiyu Sun,Zhijie Nie,Ting Deng,Chunming Hu
Main category: cs.CL
TL;DR: 本文提出inversedMixup,一种结合Mixup可控性与LLM生成可解释性的统一文本数据增强框架,通过三阶段训练对齐任务模型嵌入空间与LLM输入空间,实现可控混合嵌入到可读句子的重建,并首次实证揭示并缓解文本Mixup中的流形侵入现象。
Details
Motivation: Mixup虽具可控性但生成样本不可解释,LLM生成虽可读但缺乏控制;同时缺乏连接嵌入空间与离散token空间的有效方法。 Method: 提出inversedMixup框架,采用三阶段训练对齐任务模型输出嵌入空间与LLM输入嵌入空间;利用LLM反转技术将可控混合的嵌入重建为可读句子;并引入策略缓解流形侵入现象。 Result: 在少样本和全监督场景下均显著提升文本数据增强效果,验证了方法有效性与泛化性,并首次提供文本Mixup中 manifold intrusion 现象的实证证据及缓解策略。 Conclusion: inversedMixup成功融合Mixup的可控性与LLM生成的可解释性,为可控、可解释的文本增强提供了新范式,并揭示了文本Mixup中关键的流形结构问题。 Abstract: Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup adopts a three-stage training procedure to align the output embedding space of a task-specific model with the input embedding space of an LLM. Upon successful alignment, inversedMixup can reconstruct mixed embeddings with a controllable mixing ratio into human-interpretable augmented sentences, thereby improving the augmentation performance. Additionally, inversedMixup provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup and introduces a simple yet effective strategy to mitigate it. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.[29] Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes
Yang Zhou,Zhenting Sheng,Mingrui Tan,Yuting Song,Jun Zhou,Yu Heng Kwan,Lian Leng Low,Yang Bai,Yong Liu
Main category: cs.CL
TL;DR: 本文提出Note2Chat框架,通过将医疗记录转化为高质量医患对话,并采用三阶段微调策略和单轮推理范式,显著提升大语言模型在动态多轮诊断任务中的临床推理能力。
Details
Motivation: 现有大语言模型在静态基准上表现良好,但在需要迭代提问与假设修正的动态多轮诊断场景中表现不足,且缺乏高质量、非敏感的医患对话数据。 Method: 提出Note2Chat框架:1)利用决策树引导的生成与精炼流程,将真实医疗记录转化为高质量医患对话;2)采用监督学习、模拟数据增强与偏好学习的三阶段微调策略;3)设计单轮推理范式,将问诊建模为一系列单轮推理问题。 Result: 在临床推理任务上显著优于GPT-4o,F1值提升+16.9,Top-1诊断准确率提升+21.0。 Conclusion: 基于医疗记录驱动的Note2Chat框架可有效提升LLM在真实临床问诊场景中的推理能力,兼顾可解释性、动态适应性与样本效率。 Abstract: Effective clinical history taking is a foundational yet underexplored component of clinical reasoning. While large language models (LLMs) have shown promise on static benchmarks, they often fall short in dynamic, multi-turn diagnostic settings that require iterative questioning and hypothesis refinement. To address this gap, we propose \method{}, a note-driven framework that trains LLMs to conduct structured history taking and diagnosis by learning from widely available medical notes. Instead of relying on scarce and sensitive dialogue data, we convert real-world medical notes into high-quality doctor-patient dialogues using a decision tree-guided generation and refinement pipeline. We then propose a three-stage fine-tuning strategy combining supervised learning, simulated data augmentation, and preference learning. Furthermore, we propose a novel single-turn reasoning paradigm that reframes history taking as a sequence of single-turn reasoning problems. This design enhances interpretability and enables local supervision, dynamic adaptation, and greater sample efficiency. Experimental results show that our method substantially improves clinical reasoning, achieving gains of +16.9 F1 and +21.0 Top-1 diagnostic accuracy over GPT-4o. Our code and dataset can be found at https://github.com/zhentingsheng/Note2Chat.[30] ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas
Xiaoyu Tian,Haotian Wang,Shuaiting Chen,Hao Zhou,Kaichi Yu,Yudian Zhang,Jade Ouyang,Junxi Yin,Jiong Chen,Baoyan Guo,Lei Zhang,Junjie Tao,Yuansheng Song,Ming Cui,Chengwei Liu
Main category: cs.CL
TL;DR: 本文提出ASTRA框架,通过可验证的强化学习和可扩展的数据合成,全自动训练工具增强型大语言模型代理,解决了现有方法依赖人工干预、模拟环境不可验证、训练不稳定等问题,在多个基准测试中达到最先进性能。
Details
Motivation: 现有工具增强型语言模型代理的训练方法存在需要人工干预、依赖不可验证的模拟环境、仅使用监督微调或强化学习之一、难以稳定进行长周期多轮学习等问题。 Method: ASTRA框架包含两个核心组件:一是利用工具调用图的静态拓扑结构合成多样化、结构化轨迹,以提升工具使用能力;二是将问题-答案分解轨迹转化为独立、可执行、规则可验证的环境,支持确定性多轮强化学习;并结合监督微调与在线强化学习,采用轨迹级奖励平衡任务完成与交互效率。 Result: 在多个工具使用基准测试中,ASTRA训练的模型达到同规模下的最先进性能,性能接近闭源系统,同时保持核心推理能力。 Conclusion: ASTRA提供了一种全自动、可验证、可扩展的工具增强型代理训练范式,显著提升了多步决策中工具使用的鲁棒性与泛化性。 Abstract: Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.[31] KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices
Wuyang Zhou,Yuxuan Gu,Giorgos Iacovides,Danilo Mandic
Main category: cs.CL
TL;DR: 本文提出KromHC,通过Kronecker积构造双随机残差矩阵,在保证精确双随机性的同时将参数复杂度从O(n^3C)或O(nC·n!)降至O(n^2C),提升了训练稳定性与可扩展性。
Details
Motivation: 解决mHC训练不稳定、可扩展性差的问题,尤其是其Sinkhorn-Knopp算法不能保证精确双随机性,以及高参数复杂度(O(n^3C)或O(nC·n!))的缺陷。 Method: 提出KromHC,利用多个小规模双随机矩阵的Kronecker积来参数化mHC中的残差矩阵,并在张量化残差流的各模态上施加流形约束,确保整体残差矩阵严格双随机。 Result: KromHC在保证精确双随机性的同时,将参数复杂度降至O(n^2C),实验表明其性能匹配甚至超越现有SOTA mHC变体,且所需可训练参数显著减少。 Conclusion: KromHC是一种高效、稳定、可扩展的超连接结构,兼顾理论严谨性与实际性能,为神经网络中残差连接的设计提供了新思路。 Abstract: The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn-Knopp (SK) algorithm does not always yield exact doubly stochastic residual matrices; 2) mHC incurs a prohibitive $\mathcal{O}(n^3C)$ parameter complexity with $n$ as the width of the residual stream and $C$ as the feature dimension. The recently proposed mHC-lite reparametrizes the residual matrix via the Birkhoff-von-Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, $\mathcal{O} \left( nC \cdot n! \right)$. To address both challenges, we propose \textbf{KromHC}, which uses the \underline{Kro}necker products of smaller doubly stochastic matrices to parametrize the residual matrix in \underline{mHC}. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to $\mathcal{O}(n^2C)$. Comprehensive experiments demonstrate that KromHC matches or even outperforms state-of-the-art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is available at \texttt{https://github.com/wz1119/KromHC}.[32] Language Models as Artificial Learners: Investigating Crosslinguistic Influence
Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis
Main category: cs.CL
TL;DR: 本文利用语言模型(LMs)作为可控的统计学习者,系统模拟跨语言影响(CLI),探究L1主导性、L2熟练度及句法距离对CLI的影响,并通过跨语言启动范式验证其机制,结果支持并拓展了心理语言学中关于CLI的理论。
Details
Motivation: 人类双语研究中跨语言影响(CLI)结果常因实验变异性而冲突,亟需更可控的研究范式。 Method: 使用语言模型作为可控统计学习者,通过调节L2暴露年龄(即L2引入的训练步数)操控L1主导性和L2熟练度,并改变L1预训练语言与L2的句法距离;采用跨语言启动范式分析L1结构激活对L2加工的影响。 Result: 结果与心理语言学证据一致:语言主导性和熟练度是CLI的强预测因子;语法结构启动呈双向性,而非语法结构启动则受语言主导性调节;LM中存在L1在L2加工中的共激活及其对神经回路的直接影响。 Conclusion: 语言模型可作为计算框架,为人类CLI理论提供可解释、可操控的机制性证据,推动双语认知建模的发展。 Abstract: Despite the centrality of crosslinguistic influence (CLI) to bilingualism research, human studies often yield conflicting results due to inherent experimental variance. We address these inconsistencies by using language models (LMs) as controlled statistical learners to systematically simulate CLI and isolate its underlying drivers. Specifically, we study the effect of varying the L1 language dominance and the L2 language proficiency, which we manipulate by controlling the L2 age of exposure -- defined as the training step at which the L2 is introduced. Furthermore, we investigate the impact of pretraining on L1 languages with varying syntactic distance from the L2. Using cross-linguistic priming, we analyze how activating L1 structures impacts L2 processing. Our results align with evidence from psycholinguistic studies, confirming that language dominance and proficiency are strong predictors of CLI. We further find that while priming of grammatical structures is bidirectional, the priming of ungrammatical structures is sensitive to language dominance. Finally, we provide mechanistic evidence of CLI in LMs, demonstrating that the L1 is co-activated during L2 processing and directly influences the neural circuitry recruited for the L2. More broadly, our work demonstrates that LMs can serve as a computational framework to inform theories of human CLI.[33] ILRR: Inference-Time Steering Method for Masked Diffusion Language Models
Eden Avrahami,Eliya Nachmani
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的离散扩散语言模型(DLM)控制框架ILRR,通过动态对齐生成序列与参考序列的隐层激活实现语义属性引导,并扩展出适用于长文本的Spatially Modulated Steering方法,在保持生成质量的同时显著提升属性控制准确率。
Details
Motivation: 现有离散扩散语言模型(DLMs)在推理时缺乏高效、灵活的控制机制,尤其缺少不依赖额外训练、能利用单个参考序列进行语义级引导的方法。 Method: 提出Iterative Latent Representation Refinement(ILRR),在去噪过程中逐步对齐生成序列与参考序列的隐层激活;进一步设计Spatially Modulated Steering,按位置调节引导强度以支持短参考控制长文本。 Result: ILRR在LLaDA和MDLM上实现有效属性(如情感)控制,计算开销仅增加一次并行前向传播;相同算力下属性准确率比基线高10%–60%,同时保持高质量生成。 Conclusion: ILRR是一种轻量、通用、学习无关的DLM控制框架,为非自回归文本生成提供了高效、可解释且可调的语义引导新范式。 Abstract: Discrete Diffusion Language Models (DLMs) offer a promising non-autoregressive alternative for text generation, yet effective mechanisms for inference-time control remain relatively underexplored. Existing approaches include sampling-level guidance procedures or trajectory optimization mechanisms. In this work, we introduce Iterative Latent Representation Refinement (ILRR), a learning-free framework for steering DLMs using a single reference sequence. ILRR guides generation by dynamically aligning the internal activations of the generated sequence with those of a given reference throughout the denoising process. This approach captures and transfers high-level semantic properties, with a tunable steering scale enabling flexible control over attributes such as sentiment. We further introduce Spatially Modulated Steering, an extension that enables steering long texts using shorter references by regulating guidance intensity across the sequence. Empirically, we demonstrate that ILRR achieves effective attribute steering on LLaDA and MDLM architectures with a minor computational overhead, requiring only one additional parallel forward pass per denoising step. Under the same compute budget, ILRR improves attribute accuracy over comparable baselines by 10$\%$ to 60$\%$ points, while maintaining high generation quality.[34] AdaptBPE: From General Purpose to Specialized Tokenizers
Vijini Liyanage,François Yvon
Main category: cs.CL
TL;DR: 本文提出了一种针对子词分词器(如BPE)的轻量级后训练适配策略,通过在特定领域/语言语料上筛选并替换低效token,实现更优的词汇表定制化,提升压缩效率与下游任务性能。
Details
Motivation: 通用子词分词器在特定领域或语言中存在token低效问题,导致模型性能与编码效率下降。 Method: 提出一种后训练适配算法,基于适配语料的token频率,动态选择并替换低实用性的token,以在目标词表规模下最优编码该语料。 Result: 在多语言生成与分类任务上,适配后的分词器在相同词表规模下比基线方法更有效地压缩测试语料。 Conclusion: 该方法是一种轻量、高效的‘词汇微调’机制,可显著提升LLM在特定领域或任务中的分词适配性与整体表现。 Abstract: Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora more effectively than baselines using the same vocabulary size. This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks. Our code and data are available at https://github.com/vijini/Adapt-BPE.git.[35] Scale-Dependent Semantic Dynamics Revealed by Allan Deviation
Debayan Dasgupta
Main category: cs.CL
TL;DR: 本文将书面文本的语义进展视为高维状态空间中的随机轨迹,利用精密测量学中的Allan偏差分析语义稳定性,并发现人类文本存在短时幂律标度与长时噪声基底两种动力学机制;大语言模型虽能模仿局部标度统计,但其语义稳定性持续时间显著缩短。
Details
Motivation: 揭示语言语义演进背后的动力学机制,区分人类认知与算法模型在语义连贯性上的本质差异。 Method: 将有序句子嵌入视为位移信号,采用Allan偏差分析其在高维语义空间中的稳定性,识别不同时间尺度下的动力学行为。 Result: 发现人类文本呈现短时幂律标度(可区分文学与技术文本)和长时稳定性受限的噪声基底;大语言模型能复现短时标度,但稳定性持续时间系统性缩短。 Conclusion: 语义连贯性是一种可测量的物理属性,该框架为定量区分人类认知与大语言模型的语义动态提供了新范式。 Abstract: While language progresses through a sequence of semantic states, the underlying dynamics of this progression remain elusive. Here, we treat the semantic progression of written text as a stochastic trajectory in a high-dimensional state space. We utilize Allan deviation, a tool from precision metrology, to analyze the stability of meaning by treating ordered sentence embeddings as a displacement signal. Our analysis reveals two distinct dynamical regimes: short-time power-law scaling, which differentiates creative literature from technical texts, and a long-time crossover to a stability-limited noise floor. We find that while large language models successfully mimic the local scaling statistics of human text, they exhibit a systematic reduction in their stability horizon. These results establish semantic coherence as a measurable physical property, offering a framework to differentiate the nuanced dynamics of human cognition from the patterns generated by algorithmic models.[36] FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning
Xiaoyu Xu,Minxin Du,Kun Fang,Zi Liang,Yaxin Xiao,Zhicong Huang,Cheng Hong,Qingqing Ye,Haibo Hu
Main category: cs.CL
TL;DR: 本文提出FIT框架,用于大语言模型的持续性遗忘学习,通过数据过滤、重要性感知更新和目标层归因,在处理大量删除请求时有效防止灾难性遗忘和后遗忘恢复,同时在PCH基准上验证了其优越性。
Details
Motivation: 现有大语言模型遗忘方法难以应对现实场景中持续、高频的删除请求,易导致性能下降和灾难性遗忘。 Method: 提出FIT框架,包含三个核心组件:严格的数据过滤(Filtering)、重要性感知的参数更新(Importance-aware updates)和目标层归因(Targeted layer attribution);并构建PCH基准(涵盖个人信息、版权与有害内容)及两个对称评估指标Forget Degree(F.D.)和Retain Utility(R.U.)。 Result: 在四个开源LLM上进行数百次删除请求实验,FIT在F.D.与R.U.权衡上最优,MMLU、CommonsenseQA和GSM8K等任务上超越现有方法,并对重学习和量化恢复攻击具有鲁棒性。 Conclusion: FIT是一种高效、鲁棒的持续遗忘学习框架,兼顾遗忘效果与模型效用,在真实删除场景中具备实用价值。 Abstract: Large language models (LLMs) demonstrate impressive capabilities across diverse tasks but raise concerns about privacy, copyright, and harmful materials. Existing LLM unlearning methods rarely consider the continual and high-volume nature of real-world deletion requests, which can cause utility degradation and catastrophic forgetting as requests accumulate. To address this challenge, we introduce \fit, a framework for continual unlearning that handles large numbers of deletion requests while maintaining robustness against both catastrophic forgetting and post-unlearning recovery. \fit mitigates degradation through rigorous data \underline{F}iltering, \underline{I}mportance-aware updates, and \underline{T}argeted layer attribution, enabling stable performance across long sequences of unlearning operations and achieving a favorable balance between forgetting effectiveness and utility retention. To support realistic evaluation, we present \textbf{PCH}, a benchmark covering \textbf{P}ersonal information, \textbf{C}opyright, and \textbf{H}armful content in sequential deletion scenarios, along with two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), which jointly assess forgetting quality and utility preservation. Extensive experiments on four open-source LLMs with hundreds of deletion requests show that \fit achieves the strongest trade-off between F.D. and R.U., surpasses existing methods on MMLU, CommonsenseQA, and GSM8K, and remains resistant against both relearning and quantization recovery attacks.[37] Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling
Xinglin Wang,Jiayi Shi,Shaoxiong Feng,Peiwen Yuan,Yiwei Li,Yueqi Zhang,Chuyi Tan,Ji Zhang,Boyuan Pan,Yao Hu,Kan Li
Main category: cs.CL
TL;DR: 本文提出了一种名为Recycling Search Experience (RSE)的测试时搜索策略,通过复用中间推理结论与失败模式,提升大语言模型在复杂推理任务中的测试时扩展效率,无需额外训练且计算开销相近。
Details
Motivation: 现有测试时扩展方法将每次推理轨迹视为独立、不可复用的样本,导致大量重复计算和已知死路的反复探索,缺乏记忆性造成效率低下。 Method: 提出RSE方法,构建共享经验库,对推理轨迹进行提炼:正向复用成功中间结论以跳过冗余推导,负向复用失败模式以剪枝已知死路;该方法为自引导、免训练的累积式搜索机制。 Result: 理论分析证明RSE相较独立采样具有更高效率;实验在HMMT24、HMMT25、IMO-Bench和HLE上验证RSE在相近计算成本下持续超越强基线,达到当前最优的缩放效率。 Conclusion: RSE通过赋予测试时搜索记忆能力,显著提升了大语言模型推理过程的计算效率与有效性,为测试时扩展提供了新范式。 Abstract: Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This systemic memorylessness leads to massive computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbf{Recycling Search Experience (RSE)}, a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE, validating its advantage over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art scaling efficiency.[38] Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
Hojae Han,Heeyun Jung,Jongyoon Kim,Seung-won Hwang
Main category: cs.CL
TL;DR: 本文提出DAVID-GRPO框架,使小语言模型在资源受限下通过稳定早期学习、基于证据召回的检索信用分配和截断近似轨迹重采样,实现高效多跳推理,显著优于现有大规模RL方法。
Details
Motivation: 现有强化学习方法依赖高成本、高精度的大规模策略 rollout,难以在资源受限(如小模型、有限 rollout 预算)下实现稳定训练和准确多跳推理。 Method: 提出DAVID-GRPO:(i)用最小监督稳定早期学习;(ii)基于证据召回进行检索信用分配;(iii)对截断的近似失败轨迹进行重采样以提升探索效率。 Result: 在仅4块RTX 3090 GPU上训练≤1.5B参数模型,在6个多跳问答基准上持续超越面向大规模设置的现有RL方法。 Conclusion: 通过合适归纳偏置,小语言模型可在低训练成本下达成高推理准确性,打破低成本与低准确性的固有权衡。 Abstract: While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.[39] Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning
Wonduk Seo,Wonseok Choi,Junseo Koh,Juhyeon Lee,Hyunjin An,Minhyeong Yu,Jian Park,Qingshan Zhou,Seunghyun Lee,Yi Bu
Main category: cs.CL
TL;DR: 本文提出OG-MAR框架,通过融合世界价值观调查(WVS)数据与结构化文化本体,利用多智能体推理提升大语言模型在文化敏感决策中的对齐性、鲁棒性与可解释性。
Details
Motivation: 现有大语言模型因预训练数据偏差及缺乏结构化价值表征,在文化敏感决策中存在价值错位;已有对齐方法缺乏人口统计学依据且将价值观视为独立无结构信号,影响一致性与可解释性。 Method: 提出OG-MAR(本体引导的多智能体推理)框架:1)基于WVS构建个体价值画像;2)通过能力问题构建覆盖固定分类体系的全球文化本体;3)推理时检索本体一致关系与人口学相似档案,实例化多个价值-人格智能体;4)由判断智能体融合输出并强制满足本体一致性与人口邻近性约束。 Result: 在四个LLM主干模型上,于区域社会调查基准测试中,OG-MAR在文化对齐性与鲁棒性上均优于强基线,并生成更透明的推理轨迹。 Conclusion: 结构化文化本体与人口统计感知的多智能体协同推理,能有效提升LLM在价值观相关任务中的对齐质量、稳定性和可解释性。 Abstract: Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG-MAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.[40] Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis
Qingyue Yang,Jie Wang,Xing Li,Yinqi Bai,Xialiang Tong,Huiling Zhen,Jianye Hao,Mingxuan Yuan,Bin Li
Main category: cs.CL
TL;DR: 本文提出TAPPA框架,从时间连续视角统一解释LLM中多样化的注意力模式,将其分为可预测与不可预测两类,并通过查询自相似性程度进行解释;进一步对三种典型可预测模式结合查询、键和RoPE进行数学分析,并在KV缓存压缩与模型剪枝任务中验证其有效性。
Details
Motivation: 现有研究对LLM中的注意力模式(如检索头、sink头、对角线迹)观察零散,缺乏统一解释框架。 Method: 提出Temporal Attention Pattern Predictability Analysis (TAPPA)框架,从时间连续视角建模注意力模式的可预测性,基于查询在时间维度上的自相似性进行理论分析,并结合查询、键及RoPE对三类典型可预测模式开展数学推导。 Result: 揭示了注意力模式可预测性与查询自相似性的定量关联;在KV缓存压缩和LLM剪枝任务中,仅用TAPPA启发的简单指标即超越基线方法。 Conclusion: TAPPA为理解LLM注意力机制提供了统一、可解释的理论框架,并能有效指导推理加速与模型压缩等实际应用。 Abstract: Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce \textbf{Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations} from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab-USTC/LLM-TAPPA.[41] TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning
Huiyuan Lai,Malvina Nissim
Main category: cs.CL
TL;DR: 本文提出TACLer框架,通过模型定制的课程强化学习和混合思考/不思考推理范式,显著提升大语言模型在复杂推理任务中的学习与推理效率,并提高准确性。
Details
Motivation: 现有长链思维(CoT)方法依赖大规模强化学习训练,易导致冗余推理(过思考),影响学习与推理效率。 Method: 提出TACLer:1)模型定制的渐进式课程强化学习,依据模型能力动态调整数据复杂度;2)混合Thinking/NoThinking推理范式,按需启用或禁用思考模式以平衡准确率与效率。 Result: 相比长思考模型,训练计算量降低超50%;相比基线模型,推理token使用减少42%,准确率提升超9%,并在四个复杂数学数据集上持续超越SOTA Thinking与NoThinking方法。 Conclusion: TACLer在保持甚至提升性能的同时,显著提升了大语言模型推理的学习效率与推理效率,为高效、精准的复杂推理提供了新范式。 Abstract: Large Language Models (LLMs) have shown remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant intermediate steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a model-tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model's proficiency in multi-stage RL training. TACLer features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computational cost, cutting training compute by over 50% compared to long thinking models and reducing inference token usage by over 42% relative to the base model; and (ii) it improves accuracy by over 9% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.[42] Enhancing Language Models for Robust Greenwashing Detection
Neil Heinrich Braun,Keane Ong,Rui Mao,Erik Cambria,Gianmarco Mengaldo
Main category: cs.CL
TL;DR: 本文提出了一种参数高效的框架,通过结合对比学习与序数排序目标来结构化大语言模型的潜在空间,以更好地区分具体行动与模糊声明,提升ESG可持续性报告中绿色漂洗和模糊声明识别的鲁棒性。
Details
Motivation: 现有NLP模型在识别可持续性报告中的绿色漂洗和模糊声明时鲁棒性不足,易受表面模式干扰,泛化能力差。 Method: 提出参数高效框架,融合对比学习与序数排序目标;引入门控特征调制过滤披露噪声;采用MetaGradNorm稳定多目标优化。 Result: 在跨类别设置实验中,该方法展现出优于标准基线的鲁棒性,并揭示了表征刚性与泛化能力之间的权衡。 Conclusion: 所提框架能更可靠地捕捉可持续性报告中声明的具体性梯度,在ESG评估中具有实际应用潜力。 Abstract: Sustainability reports are critical for ESG assessment, yet greenwashing and vague claims often undermine their reliability. Existing NLP models lack robustness to these practices, typically relying on surface-level patterns that generalize poorly. We propose a parameter-efficient framework that structures LLM latent spaces by combining contrastive learning with an ordinal ranking objective to capture graded distinctions between concrete actions and ambiguous claims. Our approach incorporates gated feature modulation to filter disclosure noise and utilizes MetaGradNorm to stabilize multi-objective optimization. Experiments in cross-category settings demonstrate superior robustness over standard baselines while revealing a trade-off between representational rigidity and generalization.[43] Procedural Pretraining: Warming Up Language Models with Abstract Data
Liangze Jiang,Zachary Shinnick,Anton van den Hengel,Hemanth Saratchandran,Damien Teney
Main category: cs.CL
TL;DR: 本文提出了一种新的预训练范式:在大规模自然语言预训练前,先用抽象的程序化数据(如Dyck序列)进行少量前置预训练,显著提升模型在算法任务、语言建模及代码理解等方面的能力,并加速收敛。
Details
Motivation: 受人类先学逻辑数学再学高级推理的启发,探索在标准语言预训练前引入抽象结构化数据(特别是程序化数据),以更高效地构建语义与推理能力。 Method: 系统性地使用多种形式的程序化数据(如Dyck序列、简单算法生成数据)进行前置预训练,评估其对不同下游任务(如needle-in-a-haystack、loss收敛速度)、模型规模(至1.3B)及内部机制(attention/MLP结构变化)的影响,并探索多类程序数据融合策略。 Result: 前置仅0.1%程序化数据即可超越纯自然语言/代码/数学数据的标准预训练;在context recall任务中准确率从10%提升至98%;达到相同loss所需数据量减少至原方案的55%-86%;attention层更适应结构化领域(如代码),MLP层更利于语言建模。 Conclusion: 程序化预训练是一种轻量、有效的方法,可加速语言模型训练并提升性能,支持知识获取与推理能力解耦的LLM发展路径。 Abstract: Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this procedural pretraining enables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.[44] CE-GOCD: Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering
Jiayin Lan,Jiaqi Li,Baoxin Wang,Ming Liu,Dayong Wu,Shijin Wang,Bing Qin,Guoping Hu
Main category: cs.CL
TL;DR: 本文提出CE-GOCD方法,通过以论文标题为中枢实体构建并优化学术知识图谱子图,并结合社区发现提升大语言模型在科研文献问答中的表现。
Details
Motivation: 现有检索增强方法仅依赖孤立文本块或概念,忽视论文间深层语义关联,限制了大语言模型对科学文献的理解能力与回答的全面性、特异性。 Method: 提出中央实体引导的图优化社区检测(CE-GOCD):(1)以论文标题为中枢实体进行子图检索;(2)通过子图剪枝与补全增强隐式语义发现;(3)应用社区检测提炼主题一致的论文群组。 Result: 在三个NLP领域基于文献的问答数据集上,CE-GOCD显著优于其他检索增强基线方法。 Conclusion: 显式建模和利用学术知识图谱中的语义子结构可有效提升大语言模型在科学问答任务中的性能。 Abstract: Large Language Models (LLMs) are increasingly used for question answering over scientific research papers. Existing retrieval augmentation methods often rely on isolated text chunks or concepts, but overlook deeper semantic connections between papers. This impairs the LLM's comprehension of scientific literature, hindering the comprehensiveness and specificity of its responses. To address this, we propose Central Entity-Guided Graph Optimization for Community Detection (CE-GOCD), a method that augments LLMs' scientific question answering by explicitly modeling and leveraging semantic substructures within academic knowledge graphs. Our approach operates by: (1) leveraging paper titles as central entities for targeted subgraph retrieval, (2) enhancing implicit semantic discovery via subgraph pruning and completion, and (3) applying community detection to distill coherent paper groups with shared themes. We evaluated the proposed method on three NLP literature-based question-answering datasets, and the results demonstrate its superiority over other retrieval-augmented baseline approaches, confirming the effectiveness of our framework.[45] Temporal Guidance for Large Language Models
Hong-Kai Zheng,Piji Li
Main category: cs.CL
TL;DR: 本文提出了一种新的时间维度对比引导策略(TeGu),利用多令牌预测(MTP)构建模型自身弱预测用于自对比,结合轻量级条件MTP投影器(cMTPP),在低开销下显著提升大语言模型生成质量。
Details
Motivation: Contrastive Decoding虽能提升LLM生成质量,但需额外辅助模型、计算开销大;现有自对比方法(如DoLa)依赖层间差异,在小模型上不稳定;作者观察到LLM存在局部偏好,故探索时间维度上的稳定自对比机制。 Method: 提出Temporal Guidance(TeGu):基于多令牌预测(MTP)生成‘业余’未来token预测作为弱参考,与当前步预测进行对比;引入轻量级Conditional MTP Projector(cMTPP)统一实现MTP,避免多独立网络。 Result: TeGu在多个模型系列和基准上显著提升生成性能,同时保持低内存占用和计算开销,优于现有CD和自对比方法(如DoLa)。 Conclusion: 时间维度的自对比引导(TeGu)是一种高效、稳定且可扩展的解码增强策略,为无需额外模型的高质量生成提供了新范式。 Abstract: Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.[46] CoFrGeNet: Continued Fraction Architectures for Language Generation
Amit Dhurandhar,Vijil Chenthamarakshan,Dennis Wei,Tejaswini Pedapati,Karthikeyan Natesan Ramamurthy,Rahul Nair
Main category: cs.CL
TL;DR: 本文提出了一种基于连分数的新型生成模型函数类——CoFrGeNets,可替代Transformer中的多头注意力和前馈网络,参数更少、训练更快,且在多个下游任务上性能媲美甚至超越原模型。
Details
Motivation: 受连分数启发,旨在设计参数更少、计算更高效但仍保持强大生成能力的新型生成模型架构,以缓解Transformer参数量大、训练成本高的问题。 Method: 提出CoFrGeNets架构族,设计基于连分数的新组件替代Transformer中的Multi-head Attention和Feed-Forward Networks,并推导定制梯度公式以提升优化精度与效率;支持即插即用,兼容现有训练/推理流程。 Result: 在GPT2-xl(1.5B)和Llama3(3.2B)上验证,使用2/3至1/2参数量和更短预训练时间,下游分类、问答、推理和文本理解任务性能达到或超过原模型。 Conclusion: CoFrGeNets是一种高效、轻量且性能强劲的Transformer替代方案,具备良好的工业落地潜力,硬件定制化将进一步释放其优势。 Abstract: Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.[47] Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond
Wei Zhu
Main category: cs.CL
TL;DR: 本文系统评估了ChatGPT在6个基准数据集上的4类医学信息抽取(MedIE)任务中的表现,发现其性能低于微调基线模型,虽具高解释性和文本忠实性,但存在过度自信和生成不确定性问题。
Details
Motivation: 评估大型语言模型(如ChatGPT)在医学信息抽取(MedIE)任务中的综合能力,包括性能、可解释性、置信度、忠实性和不确定性,以判断其在专业领域应用的可行性。 Method: 在6个基准数据集上对ChatGPT进行4类医学信息抽取任务的系统性评测,量化其性能、解释质量、置信度、对原文的忠实度及输出不确定性。 Result: (a)ChatGPT在MedIE任务中性能低于微调基线模型;(b)能提供高质量解释但过度自信;(c)多数情况下对原文具有高忠实性;(d)生成不确定性导致信息抽取结果不稳定。 Conclusion: ChatGPT在医学信息抽取任务中尚不具备替代专用微调模型的能力,需进一步改进其校准性与不确定性建模能力。 Abstract: Large Language Models (LLMs) like ChatGPT have demonstrated amazing capabilities in comprehending user intents and generate reasonable and useful responses. Beside their ability to chat, their capabilities in various natural language processing (NLP) tasks are of interest to the research community. In this paper, we focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets. We present the systematically analysis by measuring ChatGPT's performance, explainability, confidence, faithfulness, and uncertainty. Our experiments reveal that: (a) ChatGPT's performance scores on MedIE tasks fall behind those of the fine-tuned baseline models. (b) ChatGPT can provide high-quality explanations for its decisions, however, ChatGPT is over-confident in its predcitions. (c) ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. (d) The uncertainty in generation causes uncertainty in information extraction results, thus may hinder its applications in MedIE tasks.[48] Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention
Alon Rozental
Main category: cs.CL
TL;DR: 本文提出Zonkey,一种基于分层扩散模型的可微分分词与文本生成框架,通过可学习的Segment Splitter和Probabilistic Attention机制实现从原始字符到文档级表示的端到端训练,摆脱传统固定分词器限制。
Details
Motivation: 现有大语言模型受限于固定、不可微的分词器(如BPE),难以端到端优化,且对噪声或领域特异性数据适应性差。 Method: 提出Zonkey:包含可微分Segment Splitter(学习概率性BOS决策)、Probabilistic Attention(引入位置存在概率以支持无限序列软掩码与梯度传播)、分层抽象压缩(字符n-gram→类词→类句)、DDMM潜空间去噪重建,以及Stitcher保障段间重叠不变性。 Result: 在Wikipedia上端到端训练后,Zonkey能从噪声生成连贯、变长文本,展现出涌现的语言层级结构,在定性上比基于熵的可学习分词器更贴近数据分布。 Conclusion: Zonkey推动了全梯度式LLM的发展,提升了领域自适应能力与生成可扩展性,并开源代码。 Abstract: Large language models (LLMs) have revolutionized natural language processing, yet they remain constrained by fixed, non-differentiable tokenizers like Byte Pair Encoding (BPE), which hinder end-to-end optimization and adaptability to noisy or domain-specific data. We introduce Zonkey, a hierarchical diffusion model that addresses these limitations through a fully trainable pipeline from raw characters to document-level representations. At its core is a differentiable tokenizer (Segment Splitter) that learns probabilistic beginning-of-sequence (BOS) decisions, enabling adaptive splits that emerge as linguistically meaningful (e.g., word boundaries at spaces, sentence starts at periods) without explicit supervision. This differentiability is enabled by our novel Probabilistic Attention mechanism, which incorporates position-specific existence probabilities to simulate soft masking over theoretically infinite sequences while preserving gradients. Sequences decay probabilistically rather than relying on end-of-sequence tokens, supporting variable-length outputs. Hierarchical levels compress sequences into higher abstractions (e.g., character n-grams to word-like vectors, then sentence-like), with reconstruction via our Denoising Diffusion Mixed Model (DDMM) for stable and efficient denoising in latent space. A Stitcher ensures overlap invariance across segments. Trained end-to-end on Wikipedia, Zonkey generates coherent, variable-length text from noise, demonstrating emergent hierarchies and promising qualitative alignment to data distributions compared to entropy-based learnable tokenizers. Our approach advances toward fully gradient-based LLMs, with potential for better domain adaptation and scalable generation. We release the source code for training and reproducing our experiments.[49] KID: Knowledge-Injected Dual-Head Learning for Knowledge-Grounded Harmful Meme Detection
Yaocong Li,Leihan Zhang,Le Zhang,Qiang Yan
Main category: cs.CL
TL;DR: 本文提出KID框架,通过知识注入和双头学习提升有害模因检测性能,显著优于现有方法。
Details
Motivation: 互联网模因依赖隐喻和社会文化背景,使其成为有害内容的隐蔽载体,而现有方法难以理解其隐含毒性。 Method: 提出KID(Knowledge-Injected Dual-Head)框架,采用标签约束蒸馏范式构建结构化推理链,并结合双头架构联合优化语义生成与分类任务。 Result: 在五个多语言数据集(含英语、中文、孟加拉语)上达到SOTA,二分类与多标签任务提升2.1%–19.7%;消融实验证明知识注入与双头学习的有效性与互补性。 Conclusion: KID通过将外部知识锚定于模因特定上下文并协同优化生成与判别目标,实现了更鲁棒、可泛化的有害模因理解。 Abstract: Internet memes have become pervasive carriers of digital culture on social platforms. However, their heavy reliance on metaphors and sociocultural context also makes them subtle vehicles for harmful content, posing significant challenges for automated content moderation. Existing approaches primarily focus on intra-modal and inter-modal signal analysis, while the understanding of implicit toxicity often depends on background knowledge that is not explicitly present in the meme itself. To address this challenge, we propose KID, a Knowledge-Injected Dual-Head Learning framework for knowledge-grounded harmful meme detection. KID adopts a label-constrained distillation paradigm to decompose complex meme understanding into structured reasoning chains that explicitly link visual evidence, background knowledge, and classification labels. These chains guide the learning process by grounding external knowledge in meme-specific contexts. In addition, KID employs a dual-head architecture that jointly optimizes semantic generation and classification objectives, enabling aligned linguistic reasoning while maintaining stable decision boundaries. Extensive experiments on five multilingual datasets spanning English, Chinese, and low-resource Bengali demonstrate that KID achieves SOTA performance on both binary and multi-label harmful meme detection tasks, improving over previous best methods by 2.1%--19.7% across primary evaluation metrics. Ablation studies further confirm the effectiveness of knowledge injection and dual-head joint learning, highlighting their complementary contributions to robust and generalizable meme understanding. The code and data are available at https://github.com/PotatoDog1669/KID.[50] Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation
Yimin Deng,Yuqing Fu,Derong Xu,Yejing Wang,Wei Ni,Jingtong Gao,Xiaopeng Li,Chengxu Liu,Xiao Han,Guoshuai Zhao,Xiangyu Zhao,Li Zhu,Xueming Qian
Main category: cs.CL
TL;DR: 本文提出了一种对抗式记忆适应机制(AMA),通过模拟下游任务执行,在离线阶段引入任务感知的监督信号,动态优化记忆构建与更新策略,从而提升长对话中对话代理的性能。
Details
Motivation: 现有记忆系统在离线阶段采用固定、任务无关的记忆构建与更新方式,导致其与下游任务需求不匹配,影响性能。 Method: 提出对抗式记忆适应机制(AMA):由挑战者代理生成问答对,用当前记忆回答以模拟推理;评估者代理进行响应评估与错误分析;适配器代理据此双层级优化记忆构建策略与内容。 Result: AMA可即插即用地集成到多种现有记忆系统中,在长对话基准LoCoMo上验证了其有效性,显著提升下游任务表现。 Conclusion: AMA通过在离线阶段引入任务导向的对抗训练范式,实现了记忆系统与具体任务目标的对齐,解决了传统记忆系统任务无关性带来的性能瓶颈。 Abstract: Conversational agents struggle to handle long conversations due to context window limitations. Therefore, memory systems are developed to leverage essential historical information. Existing memory systems typically follow a pipeline of offline memory construction and update, and online retrieval. Despite the flexible online phase, the offline phase remains fixed and task-independent. In this phase, memory construction operates under a predefined workflow and fails to emphasize task relevant information. Meanwhile, memory updates are guided by generic metrics rather than task specific supervision. This leads to a misalignment between offline memory preparation and task requirements, which undermines downstream task performance. To this end, we propose an Adversarial Memory Adaptation mechanism (AMA) that aligns memory construction and update with task objectives by simulating task execution. Specifically, first, a challenger agent generates question answer pairs based on the original dialogues. The constructed memory is then used to answer these questions, simulating downstream inference. Subsequently, an evaluator agent assesses the responses and performs error analysis. Finally, an adapter agent analyzes the error cases and performs dual level updates on both the construction strategy and the content. Through this process, the memory system receives task aware supervision signals in advance during the offline phase, enhancing its adaptability to downstream tasks. AMA can be integrated into various existing memory systems, and extensive experiments on long dialogue benchmark LoCoMo demonstrate its effectiveness.[51] RAG-E: Quantifying Retriever-Generator Alignment and Failure Modes
Korbinian Randl,Guido Rocchietti,Aron Henriksson,Ziawasch Abedjan,Tony Lindgren,John Pavlopoulos
Main category: cs.CL
TL;DR: 本文提出了RAG-E框架,用于量化检索器与生成器之间的对齐程度,通过改进的归因方法(如集成梯度、PMCSHAP和WARG指标)揭示RAG系统中组件间的关键错位问题。
Details
Motivation: RAG系统在高风险领域部署时面临组件交互不透明的挑战,亟需可解释性框架来审计检索器与生成器的协同效果。 Method: 提出RAG-E端到端可解释性框架:1)用集成梯度分析检索器;2)提出蒙特卡洛稳定化的Shapley值近似方法PMCSHAP进行生成器归因;3)设计加权归因-相关性差距(WARG)指标衡量生成器对检索结果的实际使用与检索排序的一致性。 Result: 在TREC CAsT和FoodSafeSum数据集上的实验表明:47.4%–66.7%的查询中生成器忽略检索器最高排名文档,48.1%–65.9%依赖低相关性文档,证实组件间存在严重错位。 Conclusion: RAG系统输出质量不仅取决于各组件单独性能,更关键在于其交互对齐程度;RAG-E为该对齐性提供了可量化、可审计的评估手段。 Abstract: Retrieval-Augmented Generation (RAG) systems combine dense retrievers and language models to ground LLM outputs in retrieved documents. However, the opacity of how these components interact creates challenges for deployment in high-stakes domains. We present RAG-E, an end-to-end explainability framework that quantifies retriever-generator alignment through mathematically grounded attribution methods. Our approach adapts Integrated Gradients for retriever analysis, introduces PMCSHAP, a Monte Carlo-stabilized Shapley Value approximation, for generator attribution, and introduces the Weighted Attribution-Relevance Gap (WARG) metric to measure how well a generator's document usage aligns with a retriever's ranking. Empirical analysis on TREC CAsT and FoodSafeSum reveals critical misalignments: for 47.4% to 66.7% of queries, generators ignore the retriever's top-ranked documents, while 48.1% to 65.9% rely on documents ranked as less relevant. These failure modes demonstrate that RAG output quality depends not solely on individual component performance but on their interplay, which can be audited via RAG-E.[52] Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning
Bodong Du,Xuanqi Huang,Xiaomeng Li
Main category: cs.CL
TL;DR: 本文提出了一种分布感知的奖励估计方法DARE,用于提升测试时强化学习(TTRL)中无监督奖励估计的鲁棒性与有效性,显著提升了LLM在推理任务上的性能。
Details
Motivation: 现有TTRL方法依赖多数投票(MV)生成确定性奖励,但该假设脆弱:MV忽略非主流但正确的动作,导致奖励估计系统性偏差。 Method: 提出Distribution-Aware Reward Estimation (DARE),将奖励估计从单一多数结果扩展到完整经验 rollout 分布,并引入探索奖励和分布剪枝机制以增强非主流rollout的探索并去噪。 Result: 在AIME 2024和AMC等推理基准上,DARE相较基线分别取得25.3%和5.3%的相对性能提升,并提高了优化稳定性。 Conclusion: 基于完整 rollout 分布而非单一多数结果的奖励估计更鲁棒、信息更丰富,DARE为TTRL提供了更可靠的无监督学习信号。 Abstract: Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a distribution pruning mechanism for non-majority rollout exploration and reward denoise, yielding a more informative and robust reward estimation. Extensive experiments on challenging reasoning benchmarks show that DARE improves optimization stability and final performance over recent baselines, achieving relative improvements of 25.3% on challenging AIME 2024 and 5.3% on AMC.[53] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models
Aadi Palnitkar,Mingyang Mao,Nicholas Waytowich,Vinicius G. Goecks,Tinoosh Mohsenin,Xiaomin Lin
Main category: cs.CL
TL;DR: 本文提出了MilSCORE,首个面向军事规划场景的长上下文多模态基准数据集,用于评估大语言模型在复杂地理空间推理与多源信息整合方面的能力。
Details
Motivation: 现有长上下文基准缺乏对真实、高风险、多模态、地理空间密集型任务(如大规模军事行动规划)的建模能力,难以评估模型在选择性阅读和跨异构信息源推理方面的表现。 Method: 构建了专家撰写的MilSCORE数据集,包含基于仿真军事场景的多跳问题,覆盖七类问题类型(事实回忆、约束推理、战略分析、空间分析等),并设计配套评估协议,对多种视觉-语言模型进行基准测试。 Result: 当前主流视觉-语言模型在MilSCORE上表现较差,展现出显著性能提升空间,验证了该基准的挑战性与现实意义。 Conclusion: MilSCORE填补了长上下文、多模态、高 stakes 场景级推理基准的空白,为未来大模型在复杂规划任务中的能力评估与提升提供了重要测试平台。 Abstract: As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs' ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.[54] Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model
Xiang Li,Ning Yan,Masood Mortazavi
Main category: cs.CL
TL;DR: 本文提出GiG框架,通过Graph-in-Graph结构和图神经网络建模环境状态与执行轨迹,结合结构感知检索与符号化有界前瞻模块,显著提升具身智能体在长时序规划任务中的性能。
Details
Motivation: 现有大语言模型作为具身智能体时,在长时序规划中难以保持策略连贯性,易受上下文限制或违反环境约束的幻觉转移影响。 Method: 提出Graph-in-Graph架构,用GNN编码环境状态为嵌入,构建成动作连接的执行轨迹图存入经验库;通过图嵌入聚类实现结构感知检索,并引入基于符号转移逻辑的有界前瞻模块进行接地动作投影。 Result: 在Robotouille同步/异步和ALFWorld三个基准上,Pass@1性能分别提升22%、37%和15%,计算开销相当或更低。 Conclusion: GiG通过结构化记忆与符号逻辑增强,有效缓解了LLM在具身长时序规划中的连贯性与约束遵循问题,为具身智能体规划提供了新范式。 Abstract: While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intent into actionable sub-goals while strictly adhering to the logic of a dynamic, observed environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitation or hallucinate transitions that violate constraints. We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within an experience memory bank. By clustering these graph embeddings, the framework enables retrieval of structure-aware priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a novel bounded lookahead module that leverages symbolic transition logic to enhance the agents' planning capabilities through the grounded action projection. We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld with comparable or lower computational cost.[55] Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text
Hongyi Zhou,Jin Zhu,Erhan Xu,Kai Ye,Ying Yang,Chengchun Shi
Main category: cs.CL
TL;DR: 本文提出了一种基于重写(rewrite)的LLM生成文本检测新算法,通过几何视角揭示重写检测的原理,并自适应学习原文与重写文本间的距离,在理论和实验上均优于现有方法。
Details
Motivation: 大型语言模型生成高度类人文本引发虚假信息和学术诚信问题,亟需可靠的LLM内容检测算法。 Method: 提出一种基于几何视角的重写检测框架,设计自适应学习原文与重写文本间距离的算法,并从理论上证明其优于固定距离策略。 Result: 在100多种实验设置中,该方法在多数场景下显著优于基线;相比最强基线,在GPT、Claude、Gemini等模型检测任务上相对提升达57.8%–80.6%。 Conclusion: 自适应距离学习是提升重写型LLM检测性能的关键,所提方法兼具理论严谨性与强实证效果。 Abstract: Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Yet, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLM-generated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 57.8\% to 80.6\% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini).[56] SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching
Hong Chen,Xiang Liu,Bo Wang,Yuxuan Fan,Yuanlin Chu,Zongluo Li,Xiaowen Chu,Xuming Hu
Main category: cs.CL
TL;DR: SONIC是一种基于学习的KV缓存压缩框架,通过生成语义丰富的Nexus token来压缩多轮对话历史,在保持对话连贯性的同时显著提升推理效率。
Details
Motivation: 现有KV缓存压缩方法未考虑多轮对话的结构特性,依赖启发式淘汰策略,易丢失关键上下文,且难以适应不同内存约束。 Method: 提出SONIC框架,将历史对话段压缩为紧凑、语义丰富的Nexus token,并引入动态预算训练机制,支持无需重训练的灵活内存适配。 Result: 在80%和50%压缩比下,SONIC在四个多轮基准上持续优于H2O和StreamingLLM;在MTBench101上平均得分提升35.55%,推理速度提升50.1%。 Conclusion: SONIC有效缓解了多轮大模型部署中KV缓存线性增长的瓶颈,在压缩率高、上下文保持与推理加速三方面取得显著平衡。 Abstract: The linear growth of Key-Value (KV) cache remains a bottleneck for multi-turn LLM deployment. Existing KV cache compression methods often fail to account for the structural properties of multi-turn dialogues, relying on heuristic eviction that risks losing critical context. We propose \textbf{SONIC}, a learning-based framework that compresses historical segments into compact and semantically rich \textbf{Nexus} tokens. By integrating dynamic budget training, SONIC allows flexible adaptation to varying memory constraints without retraining. Experiments show that at compression ratios of 80\% and 50\%, SONIC consistently outperforms baselines such as H2O and StreamingLLM on four diverse multi-turn benchmarks. Specifically, on the widely used MTBench101 benchmark, SONIC achieves an average score improvement of 35.55\% over state-of-the-art baselines, validating its effectiveness in sustaining coherent multi-turn dialogues. Furthermore, SONIC enhances deployment efficiency, accelerating the overall inference process by 50.1\% compared to full-context generation.[57] From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes
Fariba Afrin Irany
Main category: cs.CL
TL;DR: 本文提出了一种基于GPT的临床文本分类架构,采用选择性微调策略(仅微调最后Transformer块、最终层归一化和轻量分类头),在MIMIC-IV-Note放射报告数据上验证了其在多标签、二分类及预后预测任务中的高效性与鲁棒性,尤其适用于标注稀疏、否定/未提及样本占主导的场景。
Details
Motivation: 临床电子健康记录(EHR)中大量非结构化文本蕴含丰富信息,但建模长程、领域特异的临床文本面临标注数据少、类别极度不平衡、大模型适配计算成本高等挑战。 Method: 基于GPT-2的解码器-only Transformer架构,采用选择性微调:冻结绝大部分主干参数,仅训练最后Transformer块、最终LayerNorm和轻量分类头。 Result: 在MIMIC-IV-Note放射报告数据(使用不确定性感知的CheXpert式标签)上,该方法在多标签分类、不同不确定性假设下的二分类及疾病结局预测任务中均表现稳定且性能优异,尤其在非提及和否定样本主导场景下优势明显;显著降低可训练参数量与计算开销。 Conclusion: 对预训练生成式语言模型进行选择性微调,是实现高效、有效临床文本分类的可行路径,兼顾模型性能与真实世界EHR数据的可扩展适配能力。 Abstract: The increasing availability of unstructured clinical narratives in electronic health records (EHRs) has created new opportunities for automated disease characterization, cohort identification, and clinical decision support. However, modeling long, domain-specific clinical text remains challenging due to limited labeled data, severe class imbalance, and the high computational cost of adapting large pretrained language models. This study presents a GPT-based architecture for clinical text classification that adapts a pretrained decoder-only Transformer using a selective fine-tuning strategy. Rather than updating all model parameters, the majority of the GPT-2 backbone is frozen, and training is restricted to the final Transformer block, the final layer normalization, and a lightweight classification head. This approach substantially reduces the number of trainable parameters while preserving the representational capacity required to model complex clinical language. The proposed method is evaluated on radiology reports from the MIMIC-IV-Note dataset using uncertainty-aware CheXpert-style labels derived directly from report text. Experiments cover multiple problem formulations, including multi-label classification of radiographic findings, binary per-label classification under different uncertainty assumptions, and aggregate disease outcome prediction. Across varying dataset sizes, the model exhibits stable convergence behavior and strong classification performance, particularly in settings dominated by non-mention and negated findings. Overall, the results indicate that selective fine-tuning of pretrained generative language models provides an efficient and effective pathway for clinical text classification, enabling scalable adaptation to real-world EHR data while significantly reducing computational complexity.[58] OVD: On-policy Verbal Distillation
Jing Xiong,Hui Shen,Shansan Gong,Yuxin Cheng,Jianghan Shen,Chaofan Tao,Haochen Tan,Haoli Bai,Lifeng Shang,Ngai Wong
Main category: cs.CL
TL;DR: 本文提出On-policy Verbal Distillation (OVD),一种无需token级对齐、基于离散语言评分(0–9)进行轨迹匹配的知识蒸馏框架,显著降低内存开销,提升学生模型探索能力与训练效率,在Web问答和数学推理任务上大幅超越现有方法。
Details
Motivation: 现有token级on-policy蒸馏方法受限于token级对齐,抑制学生模型探索、难以利用环境反馈,且在强化学习中存在严重内存瓶颈。 Method: 提出OVD框架,用教师模型生成的离散语言评分(0–9)替代token级概率匹配,实现轨迹级匹配,避免token对齐,支持on-policy蒸馏并降低内存占用。 Result: 在Web问答任务上平均EM提升达+12.9%,数学推理基准上单样本训练即获+25.7%增益,同时训练效率更优。 Conclusion: OVD是一种高效、低内存、高探索性的on-policy知识蒸馏新范式,适用于需交互反馈的复杂推理任务。 Abstract: Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model's exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0--9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io[59] Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding
Yifan Zhu,Huiqiang Rong,Haoran Luo
Main category: cs.CL
TL;DR: 本文提出Token-Guard,一种基于自检解码的词元级幻觉控制方法,通过在每个推理步骤进行内部验证、潜在空间风险评分、迭代剪枝与重生成,显著降低大语言模型的幻觉现象,且无需大规模微调或检索。
Details
Motivation: 大型语言模型常出现幻觉问题,而现有缓解方法(如RAG、RLHF)资源消耗大,解码类方法又缺乏显式的幻觉控制机制。 Method: Token-Guard采用自检解码,在每个token生成步骤进行内部验证;对候选片段在潜在空间中进行显式幻觉风险评分,并通过迭代剪枝与再生动态修正错误。 Result: 在HALU数据集上的实验表明,Token-Guard显著降低了幻觉率,提升了生成准确性,具备可扩展性与模块化优势。 Conclusion: Token-Guard提供了一种轻量、高效、无需额外训练或检索的幻觉控制新范式,增强了LLM输出的可靠性。 Abstract: Large Language Models (LLMs) often hallucinate, generating content inconsistent with the input. Retrieval-Augmented Generation (RAG) and Reinforcement Learning with Human Feedback (RLHF) can mitigate hallucinations but require resource-intensive retrieval or large-scale fine-tuning. Decoding-based methods are lighter yet lack explicit hallucination control. To address this, we present Token-Guard, a token-level hallucination control method based on self-checking decoding. Token-Guard performs internal verification at each reasoning step to detect hallucinated tokens before they propagate. Candidate fragments are further evaluated in a latent space with explicit hallucination risk scoring, while iterative pruning and regeneration dynamically correct detected errors. Experiments on HALU datasets show Token-Guard substantially reduces hallucinations and improves generation accuracy, offering a scalable, modular solution for reliable LLM outputs. Our code is publicly available.[60] Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
Jianhui Chen,Yuzhang Luo,Liangming Pan
Main category: cs.CL
TL;DR: 本文提出了机制数据归因(MDA)框架,利用影响函数追溯LLM中可解释单元的训练数据来源,并通过实验证明特定结构化数据(如LaTeX、XML)对可解释头(如归纳头)的形成具有因果催化作用,同时揭示了归纳头与上下文学习能力之间的功能联系,并提出了一种加速电路收敛的数据增强方法。
Details
Motivation: 尽管机制可解释性已识别出大语言模型中的可解释电路,但这些电路在训练数据中的因果起源仍不清楚。 Method: 提出机制数据归因(MDA)框架,结合影响函数技术,追踪可解释单元至具体训练样本,并在Pythia系列模型上开展干预实验(如移除或增加高影响力样本),分析其对可解释头及上下文学习能力的影响。 Result: 发现重复性结构化数据(如LaTeX、XML)是可解释头形成的机制催化剂;定向干预显著调控归纳头出现,且同步影响模型的上下文学习能力;提出的数据增强流程能稳定加速不同规模模型中电路的收敛。 Conclusion: MDA为理解LLM内部机制的训练起源提供了可扩展因果分析工具,证实了归纳头与ICL的功能因果关系,并为可控引导大模型发展路径提供了新范式。 Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.[61] When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications
Daniel Commey
Main category: cs.CL
TL;DR: 本文提出了一种面向大语言模型(LLM)应用的评估驱动开发流程(Define-Test-Diagnose-Fix),并设计了分层的最小可行评估套件(MVES),覆盖通用LLM应用、RAG和智能体工具调用三类场景;通过本地可复现实验揭示了通用提示模板可能在不同指标间产生权衡,强调需基于评估迭代优化提示而非依赖通用模板。
Details
Motivation: 传统软件测试方法难以应对LLM输出的随机性、高维性和对提示/模型变化的高度敏感性,亟需系统化、可重复的评估驱动工程实践。 Method: 提出Define-Test-Diagnose-Fix四阶段评估驱动工作流;构建分层MVES评估套件;综合运用自动化检查、人工评分与LLM-as-judge三类评估方法,并分析LLM裁判的失效模式;在Ollama平台上使用Llama 3 8B Instruct和Qwen 2.5 7B Instruct开展本地可控实验。 Result: 实验发现:将任务专用提示替换为通用‘改进’模板后,Llama 3在结构化测试集上的信息抽取通过率从100%降至90%,RAG合规性从93.3%降至80%,但指令遵循能力提升,表明提示优化存在行为权衡。 Conclusion: LLM应用开发应以评估为核心闭环,避免盲目套用通用提示模板,而需结合具体任务目标进行评估驱动的提示迭代与主张校准;所提MVES与开源资源支持可复现、可扩展的工程化评估实践。 Abstract: Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic "improved" prompt template can trade off behaviors: on our small structured suites, extraction pass rate decreased from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3 when replacing task-specific prompts with generic rules, while instruction-following improved. These findings motivate evaluation-driven prompt iteration and careful claim calibration rather than universal prompt recipes. All test suites, harnesses, and results are included for reproducibility.[62] Causal Autoregressive Diffusion Language Model
Junhao Ruan,Bei Li,Yongjing Yin,Pengcheng Huang,Xin Chen,Jingang Wang,Xunliang Cai,Tong Xiao,JingBo Zhu
Main category: cs.CL
TL;DR: CARD是一种新型框架,结合了自回归模型的训练效率和扩散模型的高吞吐推理能力,通过因果注意力掩码实现单次前向传播的密集监督,并引入软尾掩码和上下文感知重加权机制以稳定训练,支持基于置信度的动态并行解码,在性能和训练速度上均优于现有离散扩散模型。
Details
Motivation: 解决自回归模型推理延迟高与扩散模型训练不稳定、效率低之间的矛盾,寻求兼具训练高效性和推理高吞吐的新范式。 Method: 提出Causal Autoregressive Diffusion(CARD)框架:在严格因果注意力掩码下重构扩散过程;引入软尾掩码保留局部上下文;设计基于信噪比的上下文感知重加权机制;利用KV缓存实现动态并行解码。 Result: 相比离散扩散基线性能更优,训练延迟降低3倍;达到ARM级数据效率,同时获得并行生成的延迟优势。 Conclusion: CARD为下一代高效大语言模型提供了兼顾训练效率与推理吞吐的稳健新范式。 Abstract: In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.[63] Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models
Longxuan Yu,Yu Fu,Shaorong Zhang,Hui Liu,Mukund Varma T,Greg Ver Steeg,Yue Dong
Main category: cs.CL
TL;DR: 本文提出掩码扩散语言模型(MDLMs)可克服自回归(AR)模型在输出顺序与推理逻辑不一致时的“过早承诺”问题,展现出对输出顺序变化的鲁棒性(order robustness),并在多个数学推理基准上验证了该特性及其内在机制。
Details
Motivation: 自回归语言模型强制左到右生成顺序,在答案需先于解释输出等场景中导致过早承诺,限制其推理灵活性;而实际任务常要求输出结构与自然推理顺序不一致。 Method: 引入掩码扩散语言模型(MDLMs)作为替代架构,利用其并行迭代优化所有token的特性解耦计算顺序与输出结构;构建新基准ReasonOrderQA用于控制难度和顺序级评估;通过对比AR模型与MDLMs在不同prompt顺序下的性能及token稳定过程分析机制。 Result: 在GSM8K、Math500和ReasonOrderQA上,当prompt要求答案前置时,AR模型准确率相对下降最高达67%,而MDLMs下降≤14%;分析显示MDLMs更早稳定简单token(如推理步骤),从而实现推理先于答案固化;同时识别出该优势失效的边界条件。 Conclusion: MDLMs具备‘顺序鲁棒性’,能缓解因输出结构与推理逻辑错位带来的性能下降,其核心机制在于扩散过程中token稳定顺序与语义复杂度相关;但该优势依赖特定建模条件,并非普适。 Abstract: Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term "order robustness". Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.[64] A Separable Architecture for Continuous Token Representation in Language Models
Reza T. Batley,Sourav Saha
Main category: cs.CL
TL;DR: 本文提出Leviathan架构,通过连续嵌入生成器替代传统离散查找表,在小语言模型中显著提升参数利用效率,在相同参数量下性能优于LLaMA式架构。
Details
Motivation: 在小语言模型(SLM)中,嵌入矩阵占据大部分参数,但这种分配方式既反直觉又次优,而现有缩放定律将参数视为可互换,忽略了该问题。 Method: 提出Leviathan架构,使用连续嵌入生成器替代标准Transformer中的离散嵌入查找表,并在Pile数据集上进行等参数量对比实验,结合经验幂律拟合评估有效参数容量。 Result: Leviathan在等参数设置下持续优于LLaMA式架构;经验幂律拟合显示其有效参数容量相当于密集模型的1.47至2.11倍。 Conclusion: 嵌入参数不应被简单视为可互换,采用连续嵌入生成机制可显著提升小语言模型的参数效率与建模能力。 Abstract: Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with $1.47$ to $2.11 \times$ more parameters.[65] On the Paradoxical Interference between Instruction-Following and Task Solving
Yunjia Qi,Hao Peng,Xintong Shi,Amy Xin,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li
Main category: cs.CL
TL;DR: 本文揭示了指令遵循可能反而干扰大语言模型任务解决能力的反直觉现象,并提出SUSTAINSCORE指标来量化这种干扰,通过在指令中插入自证约束并测量性能下降来评估;实验验证了该现象在数学、多跳问答和代码生成等任务中的普遍存在,并分析了其机制与对齐策略的影响。
Details
Motivation: 指令遵循本应提升模型对人类意图的对齐,但作者观察到其可能损害模型的任务求解能力,这一反直觉现象缺乏系统度量与机制分析,亟需深入探究。 Method: 提出SUSTAINSCORE指标:在原始有效指令中插入从模型正确输出中提取的、本已满足的自证约束,测量性能下降程度;在数学、多跳QA、代码生成任务上测试主流LLM;结合注意力分析与失败模式归纳探究干扰机制;对比不同后训练范式对干扰程度的影响。 Result: 添加自证约束显著降低各任务性能(包括Claude-Sonnet-4.5等先进模型);干扰具有跨约束类型与模型规模的普适性;失败样本中模型对约束分配的注意力显著更高;不同对齐策略(如SFT、RLHF)对SUSTAINSCORE影响存在差异。 Conclusion: 指令遵循可能引入非必要认知负担,削弱模型内在能力;SUSTAINSCORE为评估对齐副作用提供了新视角;当前对齐方法未充分缓解该干扰,需设计更鲁棒的指令理解机制。 Abstract: Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed. However, we reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs' task-solving capability. We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving. It measures task performance drop after inserting into the instruction a self-evident constraint, which is naturally met by the original successful model output and extracted from it. Experiments on current LLMs in mathematics, multi-hop QA, and code generation show that adding the self-evident constraints leads to substantial performance drops, even for advanced models such as Claude-Sonnet-4.5. We validate the generality of the interference across constraint types and scales. Furthermore, we identify common failure patterns, and by investigating the mechanisms of interference, we observe that failed cases allocate significantly more attention to constraints compared to successful ones. Finally, we use SUSTAINSCORE to conduct an initial investigation into how distinct post-training paradigms affect the interference, presenting empirical observations on current alignment strategies. We will release our code and data to facilitate further research[66] MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLMs
Ghazal Kalhor,Behnam Bahrak
Main category: cs.CL
TL;DR: 本文提出了MasalBench基准,用于评估多语言大语言模型(LLMs)对波斯语谚语的语境与跨文化理解能力,发现现有模型在识别波斯谚语方面表现良好(准确率>0.90),但在匹配对应英语谚语时显著下降(最佳模型仅0.79),揭示了其在文化知识和类比推理上的局限性。
Details
Motivation: 现有研究多关注高资源语言中比喻语言的理解,而低资源语言(如波斯语)中LLMs的跨文化与语境理解能力尚缺乏系统评估。 Method: 构建了MasalBench基准,涵盖波斯语谚语的语境识别与跨语言(英-波)等价谚语匹配任务,并在8个前沿LLM上进行评测。 Result: 模型在波斯谚语语境识别任务中准确率超0.90,但在等价英语谚语识别任务中最高仅达0.79,表明跨文化类比推理能力薄弱。 Conclusion: 当前多语言LLMs在低资源语言的文化语义理解上存在明显短板,MasalBench为评估其他低资源语言的跨文化能力提供了可扩展框架。 Abstract: In recent years, multilingual Large Language Models (LLMs) have become an inseparable part of daily life, making it crucial for them to master the rules of conversational language in order to communicate effectively with users. While previous work has evaluated LLMs' understanding of figurative language in high-resource languages, their performance in low-resource languages remains underexplored. In this paper, we introduce MasalBench, a comprehensive benchmark for assessing LLMs' contextual and cross-cultural understanding of Persian proverbs, which are a key component of conversation in this low-resource language. We evaluate eight state-of-the-art LLMs on MasalBench and find that they perform well in identifying Persian proverbs in context, achieving accuracies above 0.90. However, their performance drops considerably when tasked with identifying equivalent English proverbs, with the best model achieving 0.79 accuracy. Our findings highlight the limitations of current LLMs in cultural knowledge and analogical reasoning, and they provide a framework for assessing cross-cultural understanding in other low-resource languages. MasalBench is available at https://github.com/kalhorghazal/MasalBench.[67] $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA
Yaxin Du,Junru Song,Yifan Zhou,Cheng Wang,Jiahao Gu,Zimeng Chen,Menglan Chen,Wen Yao,Yang Yang,Ying Wen,Siheng Chen
Main category: cs.CL
TL;DR: 本文提出G²-Reader,一种基于双图结构的检索增强生成系统,用于提升多模态长文档问答性能:通过内容图保持文档原生结构与跨模态对齐,通过规划图以有向无环图形式动态追踪子问题与证据进展,显著优于现有基线。
Details
Motivation: 现有检索增强生成方法在多模态长文档问答中存在两大缺陷:扁平化分块破坏文档原生结构和跨模态对齐;迭代检索缺乏全局搜索状态,易陷入局部循环或漂移到无关区域。 Method: 提出G²-Reader双图系统:1)内容图(Content Graph)建模文档内文本、表格、图像的原生结构与语义关系;2)规划图(Planning Graph)作为智能体驱动的有向无环图,动态生成并更新子问题,记录中间发现,指导逐步证据收集。 Result: 在VisDoMBench五个多模态领域上,G²-Reader(基于Qwen3-VL-32B-Instruct)平均准确率达66.21%,显著高于强基线及独立GPT-5(53.08%)。 Conclusion: 双图协同建模——结构保真与目标导向导航——是提升多模态长文档复杂问答鲁棒性的有效范式。 Abstract: Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce $G^2$-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, $G^2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21\% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08\%).[68] VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning
Yibo Wang,Yongcheng Jing,Shunyu Liu,Hao Guan,Rong-cheng Tu,Chengyu Wang,Jun Huang,Dacheng Tao
Main category: cs.CL
TL;DR: 本文提出VTC-R1,一种将中间推理过程渲染为图像并作为‘光学记忆’输入视觉语言模型的新范式,在保持性能的同时显著提升长文本推理效率。
Details
Motivation: 现有长上下文推理方法存在计算开销大、依赖额外训练或外部模型、易丢失细粒度信息等问题,亟需更高效且可扩展的方案。 Method: VTC-R1将文本推理段落渲染为紧凑图像,作为视觉输入迭代反馈给视觉语言模型(如Glyph、Qwen3-VL);基于OpenR1-Math-220K构建训练集,并实现3.4倍token压缩与模型微调。 Result: 在MATH500、AIME25、AMC23和GPQA-D等基准上全面超越标准长上下文推理,端到端延迟降低2.7倍,推理效率显著提升。 Conclusion: VTC-R1是一种无需复杂训练、保留关键信息、高可扩展的高效长上下文推理新范式,为推理密集型应用提供了实用新路径。 Abstract: Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.[69] ECO: Quantized Training without Full-Precision Master Weights
Mahdi Nikdan,Amir Zandieh,Dan Alistarh,Vahab Mirrokni
Main category: cs.CL
TL;DR: 本文提出Error-Compensating Optimizer(ECO),通过误差补偿机制消除LLM训练中必需的高精度主权重(master weights),实现全量化训练,在保持模型性能的同时显著降低内存开销。
Details
Motivation: 现有量化训练方法仍需高精度主权重缓存,导致SMoE等大模型内存开销巨大,亟需消除该瓶颈。 Method: ECO直接在量化参数上应用梯度更新,并将每次权重量化产生的误差注入优化器动量中,构建无额外内存开销的误差反馈回路。 Result: 理论证明ECO在衰减学习率下收敛至最优解邻域;实验表明其在FP8/INT4量化下预训练和微调多个Transformer与MoE模型时,精度媲美含主权重基线,显著改善内存-损失Pareto前沿。 Conclusion: ECO实现了真正高效的全量化LLM训练,为大规模稀疏模型的低内存训练提供了可行且理论保障的新范式。 Abstract: Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.[70] A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine
Anran Li,Yuanyuan Chen,Wenjun Long,Yu Yin,Yan Hu,Hyunjae Kim,Weipeng Zhou,Yujia Zhou,Hongyi Peng,Yang Ren,Xuguang Ai,Zhenyue Qin,Ming Hu,Xiaoxiao Li,Han Yu,Yih-Chung Tham,Lucila Ohno-Machado,Hua Xu,Qingyu Chen
Main category: cs.CL
TL;DR: 本文提出Fed-MedLoRA及其增强版Fed-MedLoRA+,一种模型无关、参数高效的联邦学习框架,用于在多机构异构临床数据上高效适配大语言模型(LLM),显著降低通信与计算开销,并提升跨站点泛化能力,尤其在临床信息抽取任务中验证了其有效性。
Details
Motivation: 现有医学大模型多基于单机构数据训练,泛化性与安全性差;传统联邦学习难以适用于大参数量LLM(通信开销大)且假设数据同质,而真实临床数据高度异构。 Method: 提出Fed-MedLoRA:仅传输低秩适配器(LoRA)参数进行联邦更新;进一步提出Fed-MedLoRA+,引入自适应、数据感知的聚合策略以应对跨站点数据异质性;应用于临床信息抽取任务。 Result: 在五组患者队列上评估,Fed-MedLoRA(+)在域内测试、外部验证及低资源新站点适配场景中均优于BERT、LLaMA-3、DeepSeek-R1和GPT-4o等基线模型。 Conclusion: Fed-MedLoRA(+)为医学大模型的隐私保护、高效协同训练提供了可行路径,解决了通信瓶颈与数据异质性两大核心挑战,提升了临床部署的实用性与鲁棒性。 Abstract: Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis. To enable their use in clinical settings, LLMs are typically further adapted through continued pretraining or post-training using clinical data. However, most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems. Federated learning (FL) is a promising solution for enabling collaborative model development across healthcare institutions. Yet applying FL to LLMs in medicine remains fundamentally limited. First, conventional FL requires transmitting the full model during each communication round, which becomes impractical for multi-billion-parameter LLMs given the limited computational resources. Second, many FL algorithms implicitly assume data homogeneity, whereas real-world clinical data are highly heterogeneous across patients, diseases, and institutional practices. We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications. Fed-MedLoRA transmits only low-rank adapter parameters, reducing communication and computation overhead, while Fed-MedLoRA+ further incorporates adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. We apply the framework to clinical information extraction (IE), which transforms patient narratives into structured medical entities and relations. Accuracy was assessed across five patient cohorts through comparisons with BERT models, and LLaMA-3 and DeepSeek-R1, GPT-4o models. Evaluation settings included (1) in-domain training and testing, (2) external validation on independent cohorts, and (3) a low-resource new-site adaptation scenario using real-world clinical notes from the Yale New Haven Health System.[71] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers
Xin Chen,Feng Jiang,Yiqian Zhang,Hardy Chen,Shuo Yan,Wenya Xie,Min Yang,Shujian Huang
Main category: cs.CL
TL;DR: 本文提出了一种新的大语言模型推理范式——主动交互式推理(PIR),通过让模型在推理过程中主动向用户澄清前提和意图的不确定性,从而克服传统思维链(CoT)中‘盲目自思’的局限。PIR包含不确定性感知的监督微调和基于用户模拟器的策略优化两部分,在数学推理、代码生成和文档编辑等任务上显著提升准确率、通过率和BLEU分数,同时降低计算开销和冗余交互。
Details
Motivation: 现有推理型大语言模型依赖Chain-of-Thought提示,但存在‘盲目自思考’问题:当关键信息缺失或模糊时仍进行大量内部推理,导致低效与错误。本文旨在解决前提和意图层面的不确定性,而非仅知识层面,强调通过人机交互实现更可靠、高效的推理。 Method: 提出Proactive Interactive Reasoning(PIR)范式,包含两个核心组件:(1)不确定性感知的监督微调,使模型具备交互式推理能力;(2)基于用户模拟器的策略优化框架,采用复合奖励函数对齐用户意图。 Result: 在数学推理、代码生成、文档编辑任务上,PIR相比强基线最高提升32.70%准确率、22.90%通过率、41.36 BLEU;推理计算量减少近一半,冗余交互轮次显著下降;在事实知识、问答及缺失前提场景中展现出强泛化性与鲁棒性。 Conclusion: PIR成功将LLM从被动求解者转变为能主动澄清不确定性的交互式推理者,为构建更可靠、高效、以人为中心的AI推理系统提供了新范式与实用技术路径。 Abstract: Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}[72] FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale
Ajay Patel,Colin Raffel,Chris Callison-Burch
Main category: cs.CL
TL;DR: 本文提出FineInstructions方法,利用互联网规模的预训练文档生成数十亿条合成指令-答案对,用于从零开始仅通过指令微调目标预训练大语言模型,显著提升下游自由生成任务性能。
Details
Motivation: 由于监督训练数据有限,现有大语言模型通常先进行大规模自监督预训练,再用少量指令微调数据进行后训练;本文旨在克服监督数据稀缺问题,探索如何更有效地将预训练语料中的知识转化为高质量指令微调数据。 Method: 提出一种合成指令数据生成流程:基于真实用户查询构建约1800万条指令模板,并将其匹配并实例化到无结构预训练文本中,生成FineInstructions数据集;随后直接以指令微调目标从头预训练LLM。 Result: 在控制token数量的实验中,基于FineInstructions预训练的模型在标准自由生成质量评测基准上,优于传统自监督预训练及其他合成预训练方法。 Conclusion: 将预训练语料知识系统性地转化为大规模合成指令数据是可行且有效的,能提升模型对用户提示的响应能力,为LLM训练范式提供新思路。 Abstract: Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of "instruction-tuning" data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at https://huggingface.co/fineinstructions .[73] DynaWeb: Model-Based Reinforcement Learning of Web Agents
Hang Ding,Peidong Liu,Junqiao Wang,Ziwei Ji,Meng Cao,Rongzhao Zhang,Lynn Ai,Eric Yang,Tianyu Shi,Lei Yu
Main category: cs.CL
TL;DR: 本文提出了DynaWeb,一种基于模型的强化学习框架,通过构建网页世界模型来模拟网络交互,从而高效训练自主网页代理,显著提升了现有开源网页代理模型在WebArena和WebVoyager基准上的性能。
Details
Motivation: 训练自主网页代理面临与真实互联网交互效率低、成本高和风险大的问题。 Method: 提出DynaWeb框架,利用网页世界模型预测自然网页表示,并结合自由策略rollout与真实专家轨迹进行混合训练,提升稳定性和样本效率。 Result: 在WebArena和WebVoyager基准上,DynaWeb显著提升当前最优开源网页代理模型的性能。 Conclusion: 证明了通过‘想象’(即基于世界模型的模拟)训练网页代理是可行且高效的,为在线智能体强化学习提供了可扩展方案。 Abstract: The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.[74] Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
Yingfa Chen,Zhen Leng Thai,Zihan Zhou,Zhu Zhang,Xingyu Shen,Shuo Wang,Chaojun Xiao,Xu Han,Zhiyuan Liu
Main category: cs.CL
TL;DR: 本文提出HALO方法将预训练Transformer模型高效蒸馏为RNN-attention混合架构HypeNet,仅需2.3B tokens(<0.01%原始预训练数据),在保持原模型性能的同时显著提升长上下文建模能力与推理效率。
Details
Motivation: 现有混合Transformer模型因需从头大规模预训练而难以应用;已有参数迁移与知识蒸馏方法依赖海量数据(>10B tokens)且长上下文性能差,未能发挥混合模型的推理优势。 Method: 提出HALO(Hybrid Attention via Layer Optimization)蒸馏流程,并设计新型混合架构HypeNet,包含创新位置编码HyPE及多项结构改进,以增强长度泛化能力。 Result: 成功将Qwen3系列模型转换为HypeNet,在仅用2.3B tokens蒸馏下,性能媲美原Transformer,同时在长上下文任务中表现更优、推理更高效。 Conclusion: HALO+HypeNet提供了一种低数据开销、高性能的Transformer到混合模型转化范式,有效缓解长上下文建模中性能与效率的矛盾。 Abstract: Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training datacs.CV [Back]
[75] MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading
Matteo Rossi
Main category: cs.CV
TL;DR: 本文提出了一种多注意力唇读网络(MA-LipNet),通过通道、联合时空和分离时空三重注意力机制,提升唇读任务中视觉特征的判别性与泛化能力,在CMLR和GRID数据集上显著降低CER和WER。
Details
Motivation: 现有唇读方法因发音动作细微,常面临视觉特征判别力弱、泛化能力差的问题。 Method: 提出MA-LipNet,依次引入通道注意力(CA)、联合时空注意力(JSTA)和分离时空注意力(SSTA)模块,分别在通道、粗粒度时空、细粒度时空维度净化视觉特征。 Result: 在CMLR和GRID数据集上,MA-LipNet显著降低了字符错误率(CER)和词错误率(WER),优于多个SOTA方法。 Conclusion: 多维(时、空、通道)特征净化对提升视觉语音识别鲁棒性至关重要,MA-LipNet验证了该思路的有效性与先进性。 Abstract: Lipreading, the technology of decoding spoken content from silent videos of lip movements, holds significant application value in fields such as public security. However, due to the subtle nature of articulatory gestures, existing lipreading methods often suffer from limited feature discriminability and poor generalization capabilities. To address these challenges, this paper delves into the purification of visual features from temporal, spatial, and channel dimensions. We propose a novel method named Multi-Attention Lipreading Network(MA-LipNet). The core of MA-LipNet lies in its sequential application of three dedicated attention modules. Firstly, a \textit{Channel Attention (CA)} module is employed to adaptively recalibrate channel-wise features, thereby mitigating interference from less informative channels. Subsequently, two spatio-temporal attention modules with distinct granularities-\textit{Joint Spatial-Temporal Attention (JSTA)} and \textit{Separate Spatial-Temporal Attention (SSTA)}-are leveraged to suppress the influence of irrelevant pixels and video frames. The JSTA module performs a coarse-grained filtering by computing a unified weight map across the spatio-temporal dimensions, while the SSTA module conducts a more fine-grained refinement by separately modeling temporal and spatial attentions. Extensive experiments conducted on the CMLR and GRID datasets demonstrate that MA-LipNet significantly reduces the Character Error Rate (CER) and Word Error Rate (WER), validating its effectiveness and superiority over several state-of-the-art methods. Our work highlights the importance of multi-dimensional feature refinement for robust visual speech recognition.[76] Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs
Haochen Zhang,Animesh Sinha,Felix Juefei-Xu,Haoyu Ma,Kunpeng Li,Zhipeng Fan,Meng Dong,Xiaoliang Dai,Tingbo Hou,Peizhao Zhang,Zecheng He
Main category: cs.CV
TL;DR: 本文提出了一种面向非马尔可夫式多轮对话图像生成的新框架,通过构建包含回滚编辑与命名个性化的历史依赖数据、引入历史条件化训练与词元级缓存机制,并改进DiT解令牌器和多阶段微调策略,显著提升了多轮一致性与指令遵循能力。
Details
Motivation: 现有对话图像生成方法多为马尔可夫式(仅依赖最新图像),无法处理用户对早期状态的引用、撤销或跨轮实体指代等非马尔可夫需求,导致历史理解能力薄弱。 Method: (i)设计非马尔可夫多轮数据构造策略(如回滚式编辑、基于名称的跨轮个性化);(ii)提出历史条件化训练与推理框架,结合词元级缓存防止身份漂移;(iii)引入重建式DiT解令牌器与多阶段微调课程以提升高保真重建与可编辑个性化能力。 Result: 显式针对非马尔可夫交互训练后,模型在多轮一致性与指令遵从性上显著提升,同时保持优异的单轮编辑与个性化性能。 Conclusion: 非马尔可夫建模是提升多轮对话图像生成鲁棒性与实用性的关键,所提数据、架构与训练策略为该方向提供了系统性解决方案。 Abstract: Conversational image generation requires a model to follow user instructions across multiple rounds of interaction, grounded in interleaved text and images that accumulate as chat history. While recent multimodal large language models (MLLMs) can generate and edit images, most existing multi-turn benchmarks and training recipes are effectively Markov: the next output depends primarily on the most recent image, enabling shortcut solutions that ignore long-range history. In this work we formalize and target the more challenging non-Markov setting, where a user may refer back to earlier states, undo changes, or reference entities introduced several rounds ago. We present (i) non-Markov multi-round data construction strategies, including rollback-style editing that forces retrieval of earlier visual states and name-based multi-round personalization that binds names to appearances across rounds; (ii) a history-conditioned training and inference framework with token-level caching to prevent multi-round identity drift; and (iii) enabling improvements for high-fidelity image reconstruction and editable personalization, including a reconstruction-based DiT detokenizer and a multi-stage fine-tuning curriculum. We demonstrate that explicitly training for non-Markov interactions yields substantial improvements in multi-round consistency and instruction compliance, while maintaining strong single-round editing and personalization.[77] Text controllable PET denoising
Xuehua Ye,Hongxu Yang,Adam J. Schwarz
Main category: cs.CV
TL;DR: 本文提出了一种基于文本引导的PET图像去噪新方法,利用预训练CLIP模型特征与U-Net结构,在单模型下实现多计数水平图像的高质量去噪,显著提升定性与定量性能,并具备降低采集时间潜力。
Details
Motivation: PET图像常受复杂噪声干扰,影响诊断;现有方法难以兼顾不同计数水平下的通用去噪需求。 Method: 提出文本引导的去噪方法,融合预训练CLIP模型提取的语义特征与U-Net架构进行噪声建模与去除。 Result: 在多种计数水平下均取得显著的定性(视觉质量)和定量(如PSNR、SSIM)提升,模型具备良好泛化性与灵活性。 Conclusion: 该方法为PET图像去噪提供了新范式,有望支持更复杂的临床去噪任务或缩短扫描时间。 Abstract: Positron Emission Tomography (PET) imaging is a vital tool in medical diagnostics, offering detailed insights into molecular processes within the human body. However, PET images often suffer from complicated noise, which can obscure critical diagnostic information. The quality of the PET image is impacted by various factors including scanner hardware, image reconstruction, tracer properties, dose/count level, and acquisition time. In this study, we propose a novel text-guided denoising method capable of enhancing PET images across a wide range of count levels within a single model. The model utilized the features from a pretrained CLIP model with a U-Net based denoising model. Experimental results demonstrate that the proposed model leads significant improvements in both qualitative and quantitative assessments. The flexibility of the model shows the potential for helping more complicated denoising demands or reducing the acquisition time.[78] Low performing pixel correction in computed tomography with unrolled network and synthetic data training
Hongxu Yang,Levente Lippenszky,Edina Timko,Lehel Ferenczi,Gopal Avinash
Main category: cs.CV
TL;DR: 本文提出一种基于合成数据的无监督双域方法,利用CT几何前向运算的内在相关性,在无需真实临床数据的情况下有效校正低性能像素(LPP)引起的环状和条纹伪影。
Details
Motivation: 现有LPP校正方法依赖昂贵的真实标注数据,且仅在图像域或正弦图域单独建模,忽略了CT前向投影的跨域内在关联。 Method: 提出一种基于合成数据的展开式双域校正方法,通过从自然图像生成带LPP缺陷的合成CT数据,联合建模正弦图域与图像域的相关性。 Result: 在模拟1-2%探测器缺陷的实验中,该方法显著优于当前最优方法;无需真实临床数据训练,且适配不同CT扫描仪参数。 Conclusion: 所提方法实现了无需真实数据采集的高效LPP伪影校正,具备临床软件部署潜力和跨设备泛化能力。 Abstract: Low performance pixels (LPP) in Computed Tomography (CT) detectors would lead to ring and streak artifacts in the reconstructed images, making them clinically unusable. In recent years, several solutions have been proposed to correct LPP artifacts, either in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, which are expensive to collect. Moreover, existing approaches focus solely either on image-space or sinogram-space correction, ignoring the intrinsic correlations from the forward operation of the CT geometry. In this work, we propose an unrolled dual-domain method based on synthetic data to correct LPP artifacts. Specifically, the intrinsic correlations of LPP between the sinogram and image domains are leveraged through synthetic data generated from natural images, enabling the trained model to correct artifacts without requiring any real-world clinical data. In experiments simulating 1-2% detectors defect near the isocenter, the proposed method outperformed the state-of-the-art approaches by a large margin. The results indicate that our solution can correct LPP artifacts without the cost of data collection for model training, and it is adaptable to different scanner settings for software-based applications.[79] AI-based Prediction of Biochemical Recurrence from Biopsy and Prostatectomy Samples
Andrea Camilloni,Chiara Micoli,Nita Mulliqi,Erik Everett Palm,Thorgerdur Palsdottir,Kelvin Szolnoky,Xiaoyi Ji,Sol Erika Boman,Andrea Discacciati,Henrik Grönberg,Lars Egevad,Tobias Nordström,Kimmo Kartasalo,Martin Eklund
Main category: cs.CV
TL;DR: 本文开发了一种基于AI的模型,利用诊断性前列腺活检切片预测根治性前列腺切除术后生化复发(BCR)风险,在多个外部队列中验证了其泛化能力,并显示其在临床变量基础上可提升预后分层效果。
Details
Motivation: 当前预测根治性前列腺切除术后生化复发(BCR)的工具精度不足,而BCR是提示侵袭性前列腺癌及不良预后的关键指标。 Method: 基于STHLM3队列(n=676)的诊断性前列腺活检全切片图像,采用基础模型与基于注意力机制的多实例学习训练AI模型;在LEOPARD、CHIMERA和TCGA-PRAD三个外部根治术队列中评估泛化性;并与CAPRA-S等临床指南模型对比。 Result: 图像模型在三个外部队列中5年时间依赖AUC分别为0.64、0.70和0.70;整合临床变量后显著提升风险分层能力;相比CAPRA-S,AI模型具有增量预测价值。 Conclusion: 基于活检切片训练的组织病理AI模型可跨标本类型泛化,支持术前与术后决策;但其相较简单预测模型的额外价值需在后续研究中审慎评估。 Abstract: Biochemical recurrence (BCR) after radical prostatectomy (RP) is a surrogate marker for aggressive prostate cancer with adverse outcomes, yet current prognostic tools remain imprecise. We trained an AI-based model on diagnostic prostate biopsy slides from the STHLM3 cohort (n = 676) to predict patient-specific risk of BCR, using foundation models and attention-based multiple instance learning. Generalizability was assessed across three external RP cohorts: LEOPARD (n = 508), CHIMERA (n = 95), and TCGA-PRAD (n = 379). The image-based approach achieved 5-year time-dependent AUCs of 0.64, 0.70, and 0.70, respectively. Integrating clinical variables added complementary prognostic value and enabled statistically significant risk stratification. Compared with guideline-based CAPRA-S, AI incrementally improved postoperative prognostication. These findings suggest biopsy-trained histopathology AI can generalize across specimen types to support preoperative and postoperative decision making, but the added value of AI-based multimodal approaches over simpler predictive models should be critically scrutinized in further studies.[80] BadDet+: Robust Backdoor Attacks for Object Detection
Kealan Dunnett,Reza Arablouei,Dimity Miller,Volkan Dedeoglu,Raja Jurdak
Main category: cs.CV
TL;DR: 本文提出BadDet+框架,通过log-barrier惩罚机制统一区域误分类攻击(RMA)和目标消失攻击(ODA),提升对物体检测模型的后门攻击效果,尤其在物理世界中具有更强鲁棒性和迁移性。
Details
Motivation: 现有针对目标检测的后门攻击方法依赖不现实假设且缺乏物理验证,对其实际威胁理解不足。 Method: 提出BadDet+惩罚框架,利用log-barrier惩罚抑制触发输入下真类预测,实现位置与尺度不变性及物理鲁棒性增强;理论分析证明该惩罚作用于触发特定特征子空间。 Result: 在真实基准上,BadDet+在合成到物理迁移性能上优于现有RMA和ODA基线,同时保持干净样本性能;理论分析验证其攻击可靠性与标准推理兼容性。 Conclusion: 揭示了目标检测模型存在严重后门漏洞,强调需发展专用防御机制。 Abstract: Backdoor attacks pose a severe threat to deep learning, yet their impact on object detection remains poorly understood compared to image classification. While attacks have been proposed, we identify critical weaknesses in existing detection-based methods, specifically their reliance on unrealistic assumptions and a lack of physical validation. To bridge this gap, we introduce BadDet+, a penalty-based framework that unifies Region Misclassification Attacks (RMA) and Object Disappearance Attacks (ODA). The core mechanism utilizes a log-barrier penalty to suppress true-class predictions for triggered inputs, resulting in (i) position and scale invariance, and (ii) enhanced physical robustness. On real-world benchmarks, BadDet+ achieves superior synthetic-to-physical transfer compared to existing RMA and ODA baselines while preserving clean performance. Theoretical analysis confirms the proposed penalty acts within a trigger-specific feature subspace, reliably inducing attacks without degrading standard inference. These results highlight significant vulnerabilities in object detection and the necessity for specialized defenses.[81] Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
Jiaqi Li,Guangming Wang,Shuntian Zheng,Minzhe Ni,Xiaoman Lu,Guanghui Ye,Yu Guan
Main category: cs.CV
TL;DR: 本文提出ActionVLM框架,通过去偏重加权和残差聚合策略,在时序动作定位任务中缓解视觉-语言模态偏差,以视觉为主导、语言为辅助,提升定位精度。
Details
Motivation: 现有基于视觉-语言模型的时序动作定位方法过度依赖语言先验,导致显著的模态偏差,削弱视觉性能。 Method: 提出ActionVLM框架:(i) 去偏重加权模块,估计语言相对于视觉的增量增益并动态调整语言权重;(ii) 残差聚合策略,将语言作为视觉预测的补充修正而非主导信号。 Result: 在THUMOS14数据集上,mAP最高提升3.2%,优于当前最优方法。 Conclusion: 以视觉为主、语言为辅的自适应融合机制可有效缓解模态偏差,增强时序推理能力,并提升TAL性能。 Abstract: Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. To this end, we introduce (i) a debiasing reweighting module that estimates the language advantage-the incremental benefit of language over vision-only predictions-and dynamically reweights language modality accordingly, and (ii) a residual aggregation strategy that treats language as a complementary refinement rather than the primary driver. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.[82] Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought
Yu Huo,Siyu Zhang,Kun Zeng,Haoyue Liu,Owen Lee,Junlin Chen,Yuquan Lu,Yifu Guo,Yaodong Liang,Xiaoying Tang
Main category: cs.CV
TL;DR: 本文提出Shape-of-Thought(SoT)框架,通过视觉思维链实现无需外部引擎的渐进式2D形状组装,提升文本到图像生成在数量、属性绑定和部件关系等结构约束下的鲁棒性;基于自建数据集SoT-26K和新基准T2S-CompBench,显著超越纯文本基线。
Details
Motivation: 现有文生图模型在 compositional structural constraints(如生成数值、属性绑定、部件关系)下表现脆弱,缺乏对形状组装逻辑的显式建模与过程监督。 Method: 提出SoT——一种视觉思维链(CoT)框架,训练统一的多模态自回归模型,交替生成文本规划与渲染中间状态;构建SoT-26K装配轨迹数据集与T2S-CompBench评估基准。 Result: 在组件数量准确率上达88.4%,结构拓扑准确率达84.8%,较纯文本基线提升约20%;验证了SoT在结构完整性与轨迹忠实性上的有效性。 Conclusion: SoT建立了可解释、过程监督的组合式生成新范式,无需几何表示或外部引擎即可实现结构可控的文生图。 Abstract: Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints-notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework that enables progressive shape assembly via coherent 2D projections without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming text-only baselines by around 20%. SoT establishes a new paradigm for transparent, process-supervised compositional generation. The code is available at https://anonymous.4open.science/r/16FE/. The SoT-26K dataset will be released upon acceptance.[83] An AI Framework for Microanastomosis Motion Assessment
Yan Meng,Eduardo J. Torres-Rodríguez,Marcelle Altshuler,Nishanth Gowda,Arhum Naeem,Recai Yilmaz,Omar Arnaout,Daniel A. Donoho
Main category: cs.CV
TL;DR: 本文提出了一种基于AI的自动化微血管吻合器械操作技能评估框架,整合了YOLO检测、DeepSORT跟踪、器械尖端定位和监督分类模块,实现了高精度(检测精度97%,mAP96%)的客观、可扩展评估。
Details
Motivation: 传统微外科技术评估依赖主观专家评分,存在评分者间差异大、标准不统一、易受认知偏差影响及耗时等问题,亟需客观、可靠、自动化的评估系统。 Method: 构建了一个包含四个核心模块的AI框架:基于YOLO的器械检测模块、基于DeepSORT的器械跟踪模块、基于形状描述符的器械尖端定位模块,以及基于专家标注数据训练的监督分类模块。 Result: 实验表明该框架性能优异:器械检测精度达97%,在IoU阈值50%–95%范围内的平均精度(mAP50-95)为96%。 Conclusion: 所提出的AI框架能有效实现微血管吻合中器械操作技能的自动化、客观、高精度评估,具备临床转化与规模化应用潜力。 Abstract: Proficiency in microanastomosis is a fundamental competency across multiple microsurgical disciplines. These procedures demand exceptional precision and refined technical skills, making effective, standardized assessment methods essential. Traditionally, the evaluation of microsurgical techniques has relied heavily on the subjective judgment of expert raters. They are inherently constrained by limitations such as inter-rater variability, lack of standardized evaluation criteria, susceptibility to cognitive bias, and the time-intensive nature of manual review. These shortcomings underscore the urgent need for an objective, reliable, and automated system capable of assessing microsurgical performance with consistency and scalability. To bridge this gap, we propose a novel AI framework for the automated assessment of microanastomosis instrument handling skills. The system integrates four core components: (1) an instrument detection module based on the You Only Look Once (YOLO) architecture; (2) an instrument tracking module developed from Deep Simple Online and Realtime Tracking (DeepSORT); (3) an instrument tip localization module employing shape descriptors; and (4) a supervised classification module trained on expert-labeled data to evaluate instrument handling proficiency. Experimental results demonstrate the effectiveness of the framework, achieving an instrument detection precision of 97%, with a mean Average Precision (mAP) of 96%, measured by Intersection over Union (IoU) thresholds ranging from 50% to 95% (mAP50-95).[84] Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery
Jianzheng Wang,Huan Ni
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的开放词汇语义分割框架SDCI,通过空间正则化感知的双分支协同推理,结合跨模型注意力融合、双向图扩散优化和超像素协同预测机制,在高分辨率遥感影像分割中显著提升几何定位与语义预测精度。
Details
Motivation: 高分辨率遥感影像地物密集、边界复杂,对几何定位和语义预测要求更高;现有无训练开放词汇分割方法采用单向注入和浅层后处理策略,难以满足需求。 Method: 提出SDCI框架:1)跨模型注意力融合(CAF)模块实现双模型特征编码阶段的互引导;2)双向跨图扩散精炼(BCDR)模块通过迭代随机游走提升双分支分割置信度;3)基于凸优化的超像素协同预测(CSCP)机制融合低层超像素结构以优化边界。 Result: 在多个遥感语义分割基准上性能优于现有方法;消融实验验证了超像素结构在深度学习框架中仍具有效性。 Conclusion: 空间正则化与多粒度结构建模(如超像素)可有效增强无训练开放词汇遥感图像分割的几何与语义一致性,为后续研究提供了新思路。 Abstract: High-resolution remote sensing imagery is characterized by densely distributed land-cover objects and complex boundaries, which places higher demands on both geometric localization and semantic prediction. Existing training-free open-vocabulary semantic segmentation (OVSS) methods typically fuse CLIP and vision foundation models (VFMs) using "one-way injection" and "shallow post-processing" strategies, making it difficult to satisfy these requirements. To address this issue, we propose a spatial-regularization-aware dual-branch collaborative inference framework for training-free OVSS, termed SDCI. First, during feature encoding, SDCI introduces a cross-model attention fusion (CAF) module, which guides collaborative inference by injecting self-attention maps into each other. Second, we propose a bidirectional cross-graph diffusion refinement (BCDR) module that enhances the reliability of dual-branch segmentation scores through iterative random-walk diffusion. Finally, we incorporate low-level superpixel structures and develop a convex-optimization-based superpixel collaborative prediction (CSCP) mechanism to further refine object boundaries. Experiments on multiple remote sensing semantic segmentation benchmarks demonstrate that our method achieves better performance than existing approaches. Moreover, ablation studies further confirm that traditional object-based remote sensing image analysis methods leveraging superpixel structures remain effective within deep learning frameworks. Code: https://github.com/yu-ni1989/SDCI.[85] Enhancing Underwater Light Field Images via Global Geometry-aware Diffusion Process
Yuji Lin,Qian Zhao,Zongsheng Yue,Junhui Hou,Deyu Meng
Main category: cs.CV
TL;DR: 本文提出GeoDiff-LF,一种基于扩散模型的4D光场成像框架,用于提升水下图像质量,通过几何引导的网络结构、损失函数和采样策略,有效缓解水下色彩失真。
Details
Motivation: 解决水下4D光场成像中难以获取高质量图像的挑战,尤其是颜色失真问题。 Method: 基于SD-Turbo构建GeoDiff-LF框架,包含三方面改进:(1) 带卷积与注意力适配器的改进U-Net以建模几何线索;(2) 基于张量分解与渐进加权的几何引导损失函数;(3) 结合噪声预测的优化采样策略。 Result: 在视觉质量和定量指标上均超越现有方法,推动水下成像增强的最先进水平。 Conclusion: 融合扩散先验与光场几何结构可显著提升水下4D光场图像重建质量,GeoDiff-LF为该方向提供了新思路与实用框架。 Abstract: This work studies the challenging problem of acquiring high-quality underwater images via 4-D light field (LF) imaging. To this end, we propose GeoDiff-LF, a novel diffusion-based framework built upon SD-Turbo to enhance underwater 4-D LF imaging by leveraging its spatial-angular structure. GeoDiff-LF consists of three key adaptations: (1) a modified U-Net architecture with convolutional and attention adapters to model geometric cues, (2) a geometry-guided loss function using tensor decomposition and progressive weighting to regularize global structure, and (3) an optimized sampling strategy with noise prediction to improve efficiency. By integrating diffusion priors and LF geometry, GeoDiff-LF effectively mitigates color distortion in underwater scenes. Extensive experiments demonstrate that our framework outperforms existing methods across both visual fidelity and quantitative performance, advancing the state-of-the-art in enhancing underwater imaging. The code will be publicly available at https://github.com/linlos1234/GeoDiff-LF.[86] FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models
Chenyu Huang,Peng Ye,Xudong Tan,Jinhan Mu,Shenghe Zheng,Li Shen,Tao Chen
Main category: cs.CV
TL;DR: 本文提出FRISM框架,通过子空间级模型融合实现细粒度推理能力注入,结合无标签自蒸馏学习策略,在不损害视觉能力的前提下显著提升视觉语言模型的推理性能。
Details
Motivation: 现有方法在将大型推理模型(LRM)与视觉语言模型(VLM)融合时,通常采用粗粒度层级别融合,导致推理能力增强与视觉能力保持之间存在权衡。 Method: 提出FRISM(细粒度推理注入 via 子空间级模型融合),利用奇异值分解(SVD)分解LRM任务向量,并自适应调节各子空间缩放系数;引入无标签自蒸馏学习策略,结合双目标优化,在通用视觉-语言感知数据集上训练。 Result: 在多个视觉推理基准上持续达到最优性能,验证了FRISM在提升推理能力的同时不损害原始视觉能力的有效性。 Conclusion: 子空间级融合是一种更精细、更有效的推理能力注入方式,FRISM为VLM与LRM协同增强提供了新范式。 Abstract: Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose {FRISM} (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that reasoning capabilities are encoded in distinct subspaces, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with a dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities by consistently achieving state-of-the-art performance across diverse visual reasoning benchmarks.[87] Generative Recall, Dense Reranking: Learning Multi-View Semantic IDs for Efficient Text-to-Video Retrieval
Zecheng Zhao,Zhi Chen,Zi Huang,Shazia Sadiq,Tong Chen
Main category: cs.CV
TL;DR: 本文提出GRDR方法,通过多视角语义ID分配与联合训练提升生成式召回性能,在保持高精度的同时大幅降低存储与检索延迟。
Details
Motivation: 现有两阶段文本-视频检索中,召回模型性能受限于语义歧义和跨模态错位问题;生成式检索虽具高效性,但单ID表示难以覆盖视频多义性且缺乏文本监督。 Method: 提出Generative Recall and Dense Reranking(GRDR):设计查询引导的多视图分词器为每个视频分配多个语义ID,并通过共享码本联合训练分词器与生成式检索器,使语义ID成为文本与视频间的语义桥梁;推理时采用Trie约束解码生成紧凑候选集,交由密集重排器精细匹配。 Result: 在TVR基准上,GRDR达到与强密集检索器相当的准确率,索引存储减少一个数量级,全库检索加速最高达300倍。 Conclusion: GRDR有效缓解了生成式召回中的语义歧义与跨模态错位问题,实现了高效、精准、可扩展的两阶段文本-视频检索。 Abstract: Text-to-Video Retrieval (TVR) is essential in video platforms. Dense retrieval with dual-modality encoders leads in accuracy, but its computation and storage scale poorly with corpus size. Thus, real-time large-scale applications adopt two-stage retrieval, where a fast recall model gathers a small candidate pool, which is reranked by an advanced dense retriever. Due to hugely reduced candidates, the reranking model can use any off-the-shelf dense retriever without hurting efficiency, meaning the recall model bounds two-stage TVR performance. Recently, generative retrieval (GR) replaces dense video embeddings with discrete semantic IDs and retrieves by decoding text queries into ID tokens. GR offers near-constant inference and storage complexity, and its semantic IDs capture high-level video features via quantization, making it ideal for quickly eliminating irrelevant candidates during recall. However, as a recall model in two-stage TVR, GR suffers from (i) semantic ambiguity, where each video satisfies diverse queries but is forced into one semantic ID; and (ii) cross-modal misalignment, as semantic IDs are solely derived from visual features without text supervision. We propose Generative Recall and Dense Reranking (GRDR), designing a novel GR method to uplift recalled candidate quality. GRDR assigns multiple semantic IDs to each video using a query-guided multi-view tokenizer exposing diverse semantic access paths, and jointly trains the tokenizer and generative retriever via a shared codebook to cast semantic IDs as the semantic bridge between texts and videos. At inference, trie-constrained decoding generates a compact candidate set reranked by a dense model for fine-grained matching. Experiments on TVR benchmarks show GRDR matches strong dense retrievers in accuracy while reducing index storage by an order of magnitude and accelerating up to 300$\times$ in full-corpus retrieval.[88] Thinker: A vision-language foundation model for embodied intelligence
Baiyu Pan,Daqin Luo,Junpeng Yang,Jiyuan Wang,Yixuan Zhang,Hailin Shi,Jichao Jiao
Main category: cs.CV
TL;DR: 本文提出Thinker模型,专为具身智能设计,通过构建面向机器人感知与推理的大规模数据集,并联合使用关键帧与完整视频序列输入,显著提升视频理解能力,在任务规划基准测试中达到SOTA。
Details
Motivation: 大型视觉语言模型在机器人领域应用时存在视角混淆(第三人称与第一人称)和忽略视频结尾信息等人类易解但模型易错的问题。 Method: 1)构建面向机器人感知与推理的大规模数据集,涵盖自我视角视频、视觉定位、空间理解及思维链数据;2)提出联合输入关键帧与完整视频序列的简单有效方法以增强视频理解能力。 Result: 在两个最常用的任务规划基准数据集上达到当前最优性能(state-of-the-art)。 Conclusion: Thinker是一种专为具身智能设计的大型视觉语言基础模型,能有效缓解视角混淆与时间推理缺陷,显著提升机器人任务规划能力。 Abstract: When large vision-language models are applied to the field of robotics, they encounter problems that are simple for humans yet error-prone for models. Such issues include confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during temporal reasoning. To address these challenges, we propose Thinker, a large vision-language foundation model designed for embodied intelligence. We tackle the aforementioned issues from two perspectives. Firstly, we construct a large-scale dataset tailored for robotic perception and reasoning, encompassing ego-view videos, visual grounding, spatial understanding, and chain-of-thought data. Secondly, we introduce a simple yet effective approach that substantially enhances the model's capacity for video comprehension by jointly incorporating key frames and full video sequences as inputs. Our model achieves state-of-the-art results on two of the most commonly used benchmark datasets in the field of task planning.[89] LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models
Alvi Md Ishmam,Najibul Haque Sarker,Zaber Ibn Abdul Hakim,Chris Thomas
Main category: cs.CV
TL;DR: 本文提出LAMP,一种针对多图像多模态大语言模型(MLLMs)的黑盒通用对抗扰动(UAP)攻击方法,通过注意力约束、跨图像传染约束和索引注意力抑制损失,实现高效、鲁棒、位置无关的攻击。
Details
Motivation: 现有对抗攻击主要针对单图像场景且依赖白盒假设,不适用于实际多图像MLLMs;多图像MLLMs的脆弱性尚未被探索。 Method: 提出LAMP方法:1)基于注意力的约束以阻碍跨图像信息聚合;2)跨图像传染约束使扰动token影响干净token;3)索引-注意力抑制损失实现位置不变攻击。 Result: LAMP在多个视觉-语言任务和模型上显著优于SOTA基线,达到最高攻击成功率。 Conclusion: LAMP是一种首个面向多图像MLLMs的黑盒通用对抗攻击方法,具备强有效性、鲁棒性和泛化性,揭示了多图像多模态模型的新安全风险。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance across vision-language tasks. Recent advancements allow these models to process multiple images as inputs. However, the vulnerabilities of multi-image MLLMs remain unexplored. Existing adversarial attacks focus on single-image settings and often assume a white-box threat model, which is impractical in many real-world scenarios. This paper introduces LAMP, a black-box method for learning Universal Adversarial Perturbations (UAPs) targeting multi-image MLLMs. LAMP applies an attention-based constraint that prevents the model from effectively aggregating information across images. LAMP also introduces a novel cross-image contagious constraint that forces perturbed tokens to influence clean tokens, spreading adversarial effects without requiring all inputs to be modified. Additionally, an index-attention suppression loss enables a robust position-invariant attack. Experimental results show that LAMP outperforms SOTA baselines and achieves the highest attack success rates across multiple vision-language tasks and models.[90] PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models
Xuewen Liu,Zhikai Li,Jing Zhang,Mengjuan Chen,Qingyi Gu
Main category: cs.CV
TL;DR: 本文提出PTQ4ARVG,一种无需训练的后训练量化框架,用于解决自回归视觉生成(ARVG)模型量化中的通道级异常值、token级动态激活和样本级分布不匹配三大挑战,支持8位和6位高效量化。
Details
Motivation: 自回归视觉生成(ARVG)模型虽具语言模型兼容架构且性能媲美扩散模型,但其量化研究尚少,现有方法难以有效泛化。 Method: 提出PTQ4ARVG框架,包含三部分:(1)Gain-Projected Scaling(GPS)通过泰勒展开量化损失并求导优化缩放因子,缓解通道级异常值;(2)Static Token-Wise Quantization(STWQ)利用ARVG固定token长度与位置无关分布特性,消除token级动态校准开销;(3)Distribution-Guided Calibration(DGC)选取对分布熵贡献最大的样本来校准,解决样本级分布失配。 Result: 在ARVG系列模型上实现8-bit和6-bit高效量化,保持具有竞争力的性能。 Conclusion: PTQ4ARVG是一种通用、免训练、高性能的ARVG模型量化方案,为视觉生成模型轻量化提供新思路。 Abstract: AutoRegressive Visual Generation (ARVG) models retain an architecture compatible with language models, while achieving performance comparable to diffusion-based models. Quantization is commonly employed in neural networks to reduce model size and computational latency. However, applying quantization to ARVG remains largely underexplored, and existing quantization methods fail to generalize effectively to ARVG models. In this paper, we explore this issue and identify three key challenges: (1) severe outliers at channel-wise level, (2) highly dynamic activations at token-wise level, and (3) mismatched distribution information at sample-wise level. To these ends, we propose PTQ4ARVG, a training-free post-training quantization (PTQ) framework consisting of: (1) Gain-Projected Scaling (GPS) mitigates the channel-wise outliers, which expands the quantization loss via a Taylor series to quantify the gain of scaling for activation-weight quantization, and derives the optimal scaling factor through differentiation.(2) Static Token-Wise Quantization (STWQ) leverages the inherent properties of ARVG, fixed token length and position-invariant distribution across samples, to address token-wise variance without incurring dynamic calibration overhead.(3) Distribution-Guided Calibration (DGC) selects samples that contribute most to distributional entropy, eliminating the sample-wise distribution mismatch. Extensive experiments show that PTQ4ARVG can effectively quantize the ARVG family models to 8-bit and 6-bit while maintaining competitive performance. Code is available at http://github.com/BienLuky/PTQ4ARVG .[91] NFCDS: A Plug-and-Play Noise Frequency-Controlled Diffusion Sampling Strategy for Image Restoration
Zhen Wang,Hongyi Liu,Jianing Li,Zhihui Wei
Main category: cs.CV
TL;DR: 本文提出了一种名为噪声频率控制扩散采样(NFCDS)的新方法,通过在傅里叶域设计滤波器来调控反向扩散过程中的噪声频率成分,抑制低频噪声以提升保真度、保留高频噪声以增强细节,从而在无需额外训练的情况下改善即插即用(PnP)扩散模型的保真度-感知质量平衡。
Details
Motivation: 现有基于扩散采样的即插即用(PnP)方法虽能生成高感知质量图像,但因反向扩散引入噪声而导致数据保真度下降,亟需缓解保真度与感知质量之间的权衡问题。 Method: 提出噪声频率控制扩散采样(NFCDS),在傅里叶域设计一个渐进式滤波器,抑制低频噪声(导致模糊)、保留高频噪声(驱动细节生成),将数据一致性先验直接嵌入采样过程。 Result: NFCDS作为PnP模块可无缝集成于现有扩散修复框架,在多种零样本任务中显著提升保真度-感知质量平衡,且无需额外训练、收敛更快。 Conclusion: 噪声频率是理解并解决扩散PnP中保真度-感知权衡的关键视角;NFCDS通过频谱调控实现了高质量、高保真的零样本图像恢复。 Abstract: Diffusion sampling-based Plug-and-Play (PnP) methods produce images with high perceptual quality but often suffer from reduced data fidelity, primarily due to the noise introduced during reverse diffusion. To address this trade-off, we propose Noise Frequency-Controlled Diffusion Sampling (NFCDS), a spectral modulation mechanism for reverse diffusion noise. We show that the fidelity-perception conflict can be fundamentally understood through noise frequency: low-frequency components induce blur and degrade fidelity, while high-frequency components drive detail generation. Based on this insight, we design a Fourier-domain filter that progressively suppresses low-frequency noise and preserves high-frequency content. This controlled refinement injects a data-consistency prior directly into sampling, enabling fast convergence to results that are both high-fidelity and perceptually convincing--without additional training. As a PnP module, NFCDS seamlessly integrates into existing diffusion-based restoration frameworks and improves the fidelity-perception balance across diverse zero-shot tasks.[92] Hypersolid: Emergent Vision Representations via Short-Range Repulsion
Esteban Rodríguez-Betancourt,Edgar Casasola-Murillo
Main category: cs.CV
TL;DR: 本文提出Hypersolid方法,将自监督学习中的表征学习重新解释为离散打包问题,通过短程硬球排斥防止局部碰撞,从而避免表征坍缩,并在细粒度和低分辨率分类任务中表现优异。
Details
Motivation: 解决自监督学习中常见的表征坍缩问题,现有方法多依赖全局正则化,本文尝试从信息保持与映射单射性角度出发提供新思路。 Method: 将表征学习建模为离散打包问题,引入Hypersolid方法,利用短程硬球排斥机制防止局部碰撞,促使表征空间进入高分离几何状态。 Result: 该方法有效维持增强多样性,在细粒度分类和低分辨率图像分类任务上取得优异性能。 Conclusion: 局部碰撞约束是一种有效的防坍缩机制,相比传统全局正则化,短程排斥能更自然地保障信息可逆性与表征区分性。 Abstract: A recurring challenge in self-supervised learning is preventing representation collapse. Existing solutions typically rely on global regularization, such as maximizing distances, decorrelating dimensions or enforcing certain distributions. We instead reinterpret representation learning as a discrete packing problem, where preserving information simplifies to maintaining injectivity. We operationalize this in Hypersolid, a method using short-range hard-ball repulsion to prevent local collisions. This constraint results in a high-separation geometric regime that preserves augmentation diversity, excelling on fine-grained and low-resolution classification tasks.[93] Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference
Jianglong Li,Jun Xu,Bingcong Lu,Zhengxue Cheng,Hongwei Hu,Ronghua Wu,Li Song
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、高保真、低比特率的3D说话人脸压缩框架,结合FLAME参数建模与3DGS神经渲染,在极低码率下实现高质量实时重建,适用于3D视频会议。
Details
Motivation: 现有2D视频压缩无法保留几何与外观细节,而NeRF等隐式神经渲染计算开销过大,难以满足低码率高保真3D人脸实时通信需求。 Method: 融合FLAME参数化建模与3D高斯泼溅(3DGS)神经渲染;仅实时传输关键面部元数据;引入高斯属性压缩与MLP优化的紧凑表示与压缩方案。 Result: 在极低比特率下实现优于现有方法的率失真性能,支持高质量、实时的3D人脸渲染。 Conclusion: 所提框架在保真度、效率与实用性间取得良好平衡,为实时低带宽3D视频会议提供了可行解决方案。 Abstract: The demand for immersive and interactive communication has driven advancements in 3D video conferencing, yet achieving high-fidelity 3D talking face representation at low bitrates remains a challenge. Traditional 2D video compression techniques fail to preserve fine-grained geometric and appearance details, while implicit neural rendering methods like NeRF suffer from prohibitive computational costs. To address these challenges, we propose a lightweight, high-fidelity, low-bitrate 3D talking face compression framework that integrates FLAME-based parametric modeling with 3DGS neural rendering. Our approach transmits only essential facial metadata in real time, enabling efficient reconstruction with a Gaussian-based head model. Additionally, we introduce a compact representation and compression scheme, including Gaussian attribute compression and MLP optimization, to enhance transmission efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance, delivering high-quality facial rendering at extremely low bitrates, making it well-suited for real-time 3D video conferencing applications.[94] GeoRC: A Benchmark for Geolocation Reasoning Chains
Mohit Talreja,Joshua Diao,Jim Thannikary James,Radu Casapu,Tejas Santanam,Ethan Mendes,Alan Ritter,Wei Xu,James Hays
Main category: cs.CV
TL;DR: 本文提出了首个地理定位推理链基准,揭示了视觉语言模型(VLMs)虽能高精度预测图像地理位置,却普遍在生成可验证、基于图像证据的推理链方面表现糟糕,常出现幻觉;专家标注的800条推理链覆盖多国街景细粒度视觉线索,评估显示闭源大VLM(如Gemini、GPT-5)仍远逊于人类专家,开源VLM(如Llama、Qwen)则近乎失效。
Details
Motivation: VLMs在地理定位预测上已接近人类专家水平,但其推理过程缺乏可解释性与可信度,常依赖幻觉而非真实图像证据,亟需一个高质量基准来评估其推理链质量。 Method: 构建首个面向地理定位的推理链基准:基于GeoGuessr街景数据,联合顶尖人类专家(含世界冠军)为500个查询场景撰写800条真实、细粒度的推理链;设计LLM-as-a-judge与VLM-as-a-judge两种自动评估策略,并以人类评分为金标准进行校准。 Result: Qwen 3 LLM-as-a-judge与人类评分相关性最高;闭源VLM(Gemini、GPT-5)定位准确率近人,但推理链质量显著落后;开源VLM(Llama、Qwen)表现极差,仅略优于纯幻觉基线。 Conclusion: 当前VLM在提取高分辨率图像中细粒度视觉属性(如车牌形状、建筑风格、土壤特征)方面存在根本性局限,导致其推理链不可靠;该基准为提升VLM可解释性与视觉细粒度理解能力提供了关键评测工具和改进方向。 Abstract: Vision Language Models (VLMs) are good at recognizing the global location of a photograph -- their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at explaining which image evidence led to their prediction, even when their location prediction is correct. The reasoning chains produced by VLMs frequently hallucinate scene attributes to support their location prediction (e.g. phantom writing, imagined infrastructure, misidentified flora). In this paper, we introduce the first benchmark for geolocation reasoning chains. We focus on the global location prediction task in the popular GeoGuessr game which draws from Google Street View spanning more than 100 countries. We collaborate with expert GeoGuessr players, including the reigning world champion, to produce 800 ground truth reasoning chains for 500 query scenes. These expert reasoning chains address hundreds of different discriminative visual attributes such as license plate shape, architecture, and soil properties to name just a few. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at prediction locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Open weights VLMs such as Llama and Qwen catastrophically fail on our benchmark -- they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but no visual information at all. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images.[95] Token Entropy Regularization for Multi-modal Antenna Affiliation Identification
Dong Chen,Ruoyu Li,Xinyan Zhang,Jialei Xu,Ruoseng Zhao,Zhikang Zhang,Lingyun Li,Zizhuang Wei
Main category: cs.CV
TL;DR: 本文提出了一种融合视频、天线几何特征和PCI信号的多模态方法,用于自动识别天线归属,并设计了Token Entropy Regularization(TER)模块提升跨模态对齐效果。
Details
Motivation: 现有天线归属识别依赖人工巡检,效率低且易出错;而通用预训练模型因缺乏通信领域类似数据,难以实现有效跨模态对齐。 Method: 构建专用训练框架,将天线图像与对应PCI信号对齐;在预训练阶段引入Token Entropy Regularization(TER)模块,以优化多模态表征对齐。 Result: 实验表明TER能加速模型收敛并显著提升性能;进一步分析发现首token熵值具有模态依赖性。 Conclusion: 所提多模态融合范式及TER模块为通信网络中天线归属识别提供了高效、鲁棒的新方案。 Abstract: Accurate antenna affiliation identification is crucial for optimizing and maintaining communication networks. Current practice, however, relies on the cumbersome and error-prone process of manual tower inspections. We propose a novel paradigm shift that fuses video footage of base stations, antenna geometric features, and Physical Cell Identity (PCI) signals, transforming antenna affiliation identification into multi-modal classification and matching tasks. Publicly available pretrained transformers struggle with this unique task due to a lack of analogous data in the communications domain, which hampers cross-modal alignment. To address this, we introduce a dedicated training framework that aligns antenna images with corresponding PCI signals. To tackle the representation alignment challenge, we propose a novel Token Entropy Regularization module in the pretraining stage. Our experiments demonstrate that TER accelerates convergence and yields significant performance gains. Further analysis reveals that the entropy of the first token is modality-dependent. Code will be made available upon publication.[96] WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models
Rishi Upadhyay,Howard Zhang,Jim Solomon,Ayush Agrawal,Pranay Boreddy,Shruti Satya Narayana,Yunhao Ba,Alex Wong,Celso M de Melo,Achuta Kadambi
Main category: cs.CV
TL;DR: 本文提出了WorldBench,一个用于评估生成式世界模型物理保真度的解耦视频基准,涵盖直觉物理理解和低层物理常量/材料属性两个层次,并揭示了当前SOTA模型在特定物理概念上的系统性失败。
Details
Motivation: 现有物理视频基准存在概念纠缠问题,无法精准诊断模型在单个物理规律上的理解缺陷,难以支撑世界模型在机器人规划等关键任务中的可靠部署。 Method: 设计了WorldBench基准,包含两个层级:1)直觉物理理解(如物体恒存、尺度/透视);2)低层物理参数与材料属性(如摩擦系数、流体粘度);所有测试均实现概念解耦、独立评估。 Result: 在WorldBench上评测当前SOTA视频世界模型,发现其在多个具体物理概念上存在可复现的失败模式,且整体缺乏生成可靠真实交互所需的物理一致性。 Conclusion: WorldBench提供了更细粒度、可扩展的物理推理能力评估框架,有助于推动具备高物理保真度的世界模型发展。 Abstract: Recent advances in generative foundational models, often termed "world models," have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.[97] Gaussian Belief Propagation Network for Depth Completion
Jie Tang,Pingping Xie,Jian Li,Ping Tan
Main category: cs.CV
TL;DR: 本文提出高斯置信传播网络(GBPN),将深度学习与概率图模型结合,通过动态构建场景特定的马尔可夫随机场并采用改进的高斯置信传播进行推理,显著提升稀疏深度图补全性能。
Details
Motivation: 现有深度学习方法难以有效处理输入深度数据的稀疏性和不规则性,尤其在高稀疏度下性能受限。 Method: 提出高斯置信传播网络(GBPN):由图模型构建网络(GMCN)动态构建场景特定马尔可夫随机场(MRF),并预测自适应非局部边;采用串行与并行结合的消息传递机制增强高斯置信传播(GBP)推理能力。 Result: 在NYUv2和KITTI基准上达到SOTA性能;在不同稀疏度、稀疏模式及数据集下均表现出更强鲁棒性与泛化能力。 Conclusion: GBPN通过深度融合深度学习与概率图建模,有效建模长程空间依赖并提升稀疏深度补全效果,为该任务提供了新范式。 Abstract: Depth completion aims to predict a dense depth map from a color image with sparse depth measurements. Although deep learning methods have achieved state-of-the-art (SOTA), effectively handling the sparse and irregular nature of input depth data in deep networks remains a significant challenge, often limiting performance, especially under high sparsity. To overcome this limitation, we introduce the Gaussian Belief Propagation Network (GBPN), a novel hybrid framework synergistically integrating deep learning with probabilistic graphical models for end-to-end depth completion. Specifically, a scene-specific Markov Random Field (MRF) is dynamically constructed by the Graphical Model Construction Network (GMCN), and then inferred via Gaussian Belief Propagation (GBP) to yield the dense depth distribution. Crucially, the GMCN learns to construct not only the data-dependent potentials of MRF but also its structure by predicting adaptive non-local edges, enabling the capture of complex, long-range spatial dependencies. Furthermore, we enhance GBP with a serial \& parallel message passing scheme, designed for effective information propagation, particularly from sparse measurements. Extensive experiments demonstrate that GBPN achieves SOTA performance on the NYUv2 and KITTI benchmarks. Evaluations across varying sparsity levels, sparsity patterns, and datasets highlight GBPN's superior performance, notable robustness, and generalizable capability.[98] Mam-App: A Novel Parameter-Efficient Mamba Model for Apple Leaf Disease Classification
Md Nadim Mahamood,Md Imran Hasan,Md Rasheduzzaman,Ausrukona Ray,Md Shafi Ud Doula,Kamrul Hasan
Main category: cs.CV
TL;DR: 本文提出了一种基于Mamba架构的轻量级模型Mam-App,用于苹果叶病害识别,在保持极低参数量(0.051M)的同时,在PlantVillage数据集上达到99.58%准确率,并在玉米和马铃薯病害数据集上也表现出强泛化能力。
Details
Motivation: 现有深度学习模型参数量大、计算开销高,难以部署于无人机、移动设备等资源受限平台;而轻量模型又常以性能下降为代价,亟需兼顾效率与精度的新型解决方案。 Method: 提出Mam-App——一种基于Mamba架构的参数高效模型,专用于植物叶片病害特征提取与分类,强调结构精简与计算高效。 Result: 在Apple Leaf Disease数据集上达99.58%准确率、99.30%精确率、99.14%召回率、99.22% F1;在Corn和Potato数据集上亦取得超99%和98%以上的各项指标。参数仅0.051M,显著低于主流模型。 Conclusion: Mam-App成功平衡了模型轻量化与高性能,验证了Mamba架构在农业病害识别任务中的有效性与泛化能力,适用于边缘端实时部署。 Abstract: The rapid growth of the global population, alongside exponential technological advancement, has intensified the demand for food production. Meeting this demand depends not only on increasing agricultural yield but also on minimizing food loss caused by crop diseases. Diseases account for a substantial portion of apple production losses, despite apples being among the most widely produced and nutritionally valuable fruits worldwide. Previous studies have employed machine learning techniques for feature extraction and early diagnosis of apple leaf diseases, and more recently, deep learning-based models have shown remarkable performance in disease recognition. However, most state-of-the-art deep learning models are highly parameter-intensive, resulting in increased training and inference time. Although lightweight models are more suitable for user-friendly and resource-constrained applications, they often suffer from performance degradation. To address the trade-off between efficiency and performance, we propose Mam-App, a parameter-efficient Mamba-based model for feature extraction and leaf disease classification. The proposed approach achieves competitive state-of-the-art performance on the PlantVillage Apple Leaf Disease dataset, attaining 99.58% accuracy, 99.30% precision, 99.14% recall, and a 99.22% F1-score, while using only 0.051M parameters. This extremely low parameter count makes the model suitable for deployment on drones, mobile devices, and other low-resource platforms. To demonstrate the robustness and generalizability of the proposed model, we further evaluate it on the PlantVillage Corn Leaf Disease and Potato Leaf Disease datasets. The model achieves 99.48%, 99.20%, 99.34%, and 99.27% accuracy, precision, recall, and F1-score on the corn dataset and 98.46%, 98.91%, 95.39%, and 97.01% on the potato dataset, respectively.[99] HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence
Yanfeng Li,Tao Tan,Qingquan Gao,Zhiwen Cao,Xiaohong liu,Yue Sun
Main category: cs.CV
TL;DR: 本文提出了一种名为LANE的潜变量自回归网络,结合AdaGraph策略,显著提升了3D网格序列建模的效率与细节表达能力。
Details
Motivation: 现有基于自回归的3D网格建模方法资源利用率低、推理慢、难以处理长序列,限制了结构细节表达。 Method: 提出Latent Autoregressive Network(LANE)以实现紧凑的自回归依赖建模,并设计Adaptive Computation Graph Reconfiguration(AdaGraph)策略,通过时空解耦加速推理。 Result: LANE将最大可生成序列长度提升6倍,同时在生成速度、结构细节和几何一致性上均优于现有方法。 Conclusion: LANE+AdaGraph为高质量3D网格生成提供了一种高效且高保真的新范式。 Abstract: High-fidelity 3D meshes can be tokenized into one-dimension (1D) sequences and directly modeled using autoregressive approaches for faces and vertices. However, existing methods suffer from insufficient resource utilization, resulting in slow inference and the ability to handle only small-scale sequences, which severely constrains the expressible structural details. We introduce the Latent Autoregressive Network (LANE), which incorporates compact autoregressive dependencies in the generation process, achieving a $6\times$ improvement in maximum generatable sequence length compared to existing methods. To further accelerate inference, we propose the Adaptive Computation Graph Reconfiguration (AdaGraph) strategy, which effectively overcomes the efficiency bottleneck of traditional serial inference through spatiotemporal decoupling in the generation process. Experimental validation demonstrates that LANE achieves superior performance across generation speed, structural detail, and geometric consistency, providing an effective solution for high-quality 3D mesh generation.[100] Optimal Transport-Induced Samples against Out-of-Distribution Overconfidence
Keke Tang,Ziyong Du,Xiaofei Wang,Weilong Peng,Peican Zhu,Zhihong Tian
Main category: cs.CV
TL;DR: 本文提出了一种基于半离散最优传输(OT)奇异边界的框架,通过构造几何上有依据的OOD样本(OTIS)并在训练中施加置信度抑制损失,有效缓解深度神经网络在分布外输入上的过度自信问题。
Details
Motivation: 深度神经网络在分布外(OOD)输入上常产生过度自信的预测,影响其在开放世界环境中的可靠性;而半离散最优传输中的奇点对应语义模糊区域,正是分类器易出现高置信错误预测的地方。 Method: 构建连续基分布与训练数据隐空间嵌入之间的最优传输问题,识别其诱导的奇异边界;在边界附近采样生成OTIS样本;在训练中对OTIS施加置信度抑制损失,提升模型在结构不确定性区域的校准能力。 Result: 实验表明该方法显著缓解OOD过度自信问题,在多个基准上优于当前最先进方法。 Conclusion: 利用最优传输几何结构引导OOD建模与训练,是一种原理清晰、效果显著的模型校准新范式。 Abstract: Deep neural networks (DNNs) often produce overconfident predictions on out-of-distribution (OOD) inputs, undermining their reliability in open-world environments. Singularities in semi-discrete optimal transport (OT) mark regions of semantic ambiguity, where classifiers are particularly prone to unwarranted high-confidence predictions. Motivated by this observation, we propose a principled framework to mitigate OOD overconfidence by leveraging the geometry of OT-induced singular boundaries. Specifically, we formulate an OT problem between a continuous base distribution and the latent embeddings of training data, and identify the resulting singular boundaries. By sampling near these boundaries, we construct a class of OOD inputs, termed optimal transport-induced OOD samples (OTIS), which are geometrically grounded and inherently semantically ambiguous. During training, a confidence suppression loss is applied to OTIS to guide the model toward more calibrated predictions in structurally uncertain regions. Extensive experiments show that our method significantly alleviates OOD overconfidence and outperforms state-of-the-art methods.[101] Do Pathology Foundation Models Encode Disease Progression? A Pseudotime Analysis of Visual Representations
Pritika Vig,Ren-Chin Wu,William Lotter
Main category: cs.CV
TL;DR: 本文探讨视觉基础模型是否能从离散图像中隐式学习连续疾病进展过程,并利用扩散拟时序方法评估其表征空间中疾病状态的有序性;结果表明,病理学专用模型能显著恢复疾病进展轨迹,且轨迹保真度与少样本分类性能高度相关,揭示了模型表征连续生物学过程的能力。
Details
Motivation: 视觉基础模型在分类任务上表现优异,但其是否编码了训练数据背后的连续生物过程(如疾病进展)尚不清楚;尤其在计算病理学中,能反映连续疾病进展的表征更可能契合真实生物学、提升泛化能力并支持定量分析。 Method: 采用源自单细胞转录组学的扩散拟时序(diffusion pseudotime)方法,在表征空间中评估多个视觉基础模型对四种癌症进展轨迹的重建能力,并关联其轨迹保真度与少样本分类性能及组织细胞组成变化。 Result: 所有病理专用模型均显著优于随机基线,纯视觉模型在CRC-Serrated数据集上达到最高轨迹保真度(τ > 0.78);轨迹保真度排名与少样本分类性能强相关(ρ = 0.92);推断轨迹上细胞类型组成呈现符合已知间质重塑规律的平滑变化。 Conclusion: 视觉基础模型可仅从静态图像中隐式学习连续疾病进展表征;轨迹保真度是衡量表征质量的一个新且具生物学意义的补充指标;该框架可推广至其他依赖静态快照观测连续过程的领域。 Abstract: Vision foundation models trained on discretely sampled images achieve strong performance on classification benchmarks, yet whether their representations encode the continuous processes underlying their training data remains unclear. This question is especially pertinent in computational pathology, where we posit that models whose latent representations implicitly capture continuous disease progression may better reflect underlying biology, support more robust generalization, and enable quantitative analyses of features associated with disease transitions. Using diffusion pseudotime, a method developed to infer developmental trajectories from single-cell transcriptomics, we probe whether foundation models organize disease states along coherent progression directions in representation space. Across four cancer progressions and six models, we find that all pathology-specific models recover trajectory orderings significantly exceeding null baselines, with vision-only models achieving the highest fidelities $(τ> 0.78$ on CRC-Serrated). Model rankings by trajectory fidelity on reference diseases strongly predict few-shot classification performance on held-out diseases ($ρ= 0.92$), and exploratory analysis shows cell-type composition varies smoothly along inferred trajectories in patterns consistent with known stromal remodeling. Together, these results demonstrate that vision foundation models can implicitly learn to represent continuous processes from independent static observations, and that trajectory fidelity provides a complementary measure of representation quality beyond downstream performance. While demonstrated in pathology, this framework could be applied to other domains where continuous processes are observed through static snapshots.[102] SR$^{2}$-Net: A General Plug-and-Play Model for Spectral Refinement in Hyperspectral Image Super-Resolution
Ji-Xuan He,Guohang Zhuang,Junge Bo,Tingyi Li,Chen Ling,Yanan Qiao
Main category: cs.CV
TL;DR: 本文提出了一种轻量级即插即用的光谱校正超分辨率网络SR²-Net,用于高光谱图像超分辨率(HSI-SR),通过增强跨波段交互与约束重建光谱在物理合理的流形上,提升光谱保真度和重建质量,且不改变原有模型结构。
Details
Motivation: 现有HSI-SR方法虽利用空间相关性提升了空间分辨率,但常忽视波段间光谱一致性,导致伪影和物理不可行结果;而通过网络结构设计保证光谱一致性又牺牲通用性与灵活性。 Method: 提出SR²-Net:包含分层光谱-空间协同注意力(H-S³A)以增强跨波段交互,以及流形一致性校正(MCR)将重建光谱约束至紧凑、物理合理的光谱流形;并引入退化一致性损失保障数据保真度。 Result: 在多个基准和不同骨干网络上实验表明,SR²-Net显著提升光谱保真度与整体重建质量,且计算开销极小。 Conclusion: SR²-Net是一种通用、灵活、高效的即插即用模块,有效解决了HSI-SR中光谱一致性与物理合理性之间的平衡问题。 Abstract: HSI-SR aims to enhance spatial resolution while preserving spectrally faithful and physically plausible characteristics. Recent methods have achieved great progress by leveraging spatial correlations to enhance spatial resolution. However, these methods often neglect spectral consistency across bands, leading to spurious oscillations and physically implausible artifacts. While spectral consistency can be addressed by designing the network architecture, it results in a loss of generality and flexibility. To address this issue, we propose a lightweight plug-and-play rectifier, physically priors Spectral Rectification Super-Resolution Network (SR$^{2}$-Net), which can be attached to a wide range of HSI-SR models without modifying their architectures. SR$^{2}$-Net follows an enhance-then-rectify pipeline consisting of (i) Hierarchical Spectral-Spatial Synergy Attention (H-S$^{3}$A) to reinforce cross-band interactions and (ii) Manifold Consistency Rectification (MCR) to constrain the reconstructed spectra to a compact, physically plausible spectral manifold. In addition, we introduce a degradation-consistency loss to enforce data fidelity by encouraging the degraded SR output to match the observed low resolution input. Extensive experiments on multiple benchmarks and diverse backbones demonstrate consistent improvements in spectral fidelity and overall reconstruction quality with negligible computational overhead. Our code will be released upon publication.[103] Dynamical Adapter Fusion: Constructing A Global Adapter for Pre-Trained Model-based Class-Incremental Learning
Ruiqi Liu,Boyu Diao,Zijia An,Zhulin An,Fei Wang,Yongjun Xu
Main category: cs.CV
TL;DR: 本文提出动态适配器融合(DAF)方法,通过PAC-Bayes理论和损失函数泰勒展开,动态融合任务特定、全局历史与初始化参数,构建单一鲁棒全局适配器,缓解类增量学习中的灾难性遗忘,并在多个基准上达到SOTA。
Details
Motivation: 现有类增量学习中任务特定适配器阻碍知识迁移且检索开销大,而简单参数融合易导致破坏性干扰和灾难性遗忘。 Method: 基于PAC-Bayes定理设计动态适配器融合机制,融合任务特定适配器、历史全局适配器及初始化参数;利用损失函数泰勒展开确定最优融合系数;引入鲁棒初始化策略捕获全局知识模式。 Result: 在多个类增量学习基准上取得当前最优性能(SOTA)。 Conclusion: DAF通过动态平衡稳定性与可塑性,有效缓解灾难性遗忘,提升类增量学习中知识迁移效率与模型鲁棒性。 Abstract: Class-Incremental Learning (CIL) requires models to continuously acquire new classes without forgetting previously learned ones. A dominant paradigm involves freezing a pre-trained model and training lightweight, task-specific adapters. However, maintaining task-specific parameters hinders knowledge transfer and incurs high retrieval costs, while naive parameter fusion often leads to destructive interference and catastrophic forgetting. To address these challenges, we propose Dynamical Adapter Fusion (DAF) to construct a single robust global adapter. Grounded in the PAC-Bayes theorem, we derive a fusion mechanism that explicitly integrates three components: the optimized task-specific adapter parameters, the previous global adapter parameters, and the initialization parameters. We utilize the Taylor expansion of the loss function to derive the optimal fusion coefficients, dynamically achieving the best balance between stability and plasticity. Furthermore, we propose a Robust Initialization strategy to effectively capture global knowledge patterns. Experiments on multiple CIL benchmarks demonstrate that DAF achieves state-of-the-art (SOTA) performance.[104] Semantic-Guided Dynamic Sparsification for Pre-Trained Model-based Class-Incremental Learning
Ruiqi Liu,Boyu Diao,Zijia An,Runjie Shao,Zhulin An,Fei Wang,Yongjun Xu
Main category: cs.CV
TL;DR: 本文提出了一种名为语义引导动态稀疏化(SGDS)的新方法,用于类增量学习(CIL),通过在激活空间中构建类特定的稀疏子空间来缓解任务间干扰,避免对参数空间施加刚性约束,从而在多个基准数据集上实现了最先进性能。
Details
Motivation: 现有基于正交约束轻量适配器的方法虽能减少任务干扰,但损害模型可塑性;需一种兼顾知识迁移与干扰抑制的新机制。 Method: 提出语义引导动态稀疏化(SGDS),通过定向稀疏化调控激活子空间的方向与秩:使相似类别共享紧凑激活子空间以促进知识迁移,使不相似类别分配非重叠激活子空间以防止干扰。 Result: 在多个类增量学习基准数据集上取得当前最优性能。 Conclusion: SGDS通过在激活空间而非参数空间施加柔性结构约束,有效平衡稳定性与可塑性,为CIL提供了新范式。 Abstract: Class-Incremental Learning (CIL) requires a model to continually learn new classes without forgetting old ones. A common and efficient solution freezes a pre-trained model and employs lightweight adapters, whose parameters are often forced to be orthogonal to prevent inter-task interference. However, we argue that this parameter-constraining method is detrimental to plasticity. To this end, we propose Semantic-Guided Dynamic Sparsification (SGDS), a novel method that proactively guides the activation space by governing the orientation and rank of its subspaces through targeted sparsification. Specifically, SGDS promotes knowledge transfer by encouraging similar classes to share a compact activation subspace, while simultaneously preventing interference by assigning non-overlapping activation subspaces to dissimilar classes. By sculpting class-specific sparse subspaces in the activation space, SGDS effectively mitigates interference without imposing rigid constraints on the parameter space. Extensive experiments on various benchmark datasets demonstrate the state-of-the-art performance of SGDS.[105] Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery
Hongjun Chen,Huan Zheng,Wencheng Han,Jianbing Shen
Main category: cs.CV
TL;DR: 本文提出HMRMamba,一种基于结构化状态空间模型(SSM)的新型视频3D人体网格恢复(HMR)方法,通过几何感知提升模块和运动引导重建网络,显著提升重建精度、时序一致性和计算效率。
Details
Motivation: 现有基于视频的3D人体网格恢复方法因依赖有缺陷的中间3D姿态锚点且难以建模复杂时空动态,导致结果物理不可行。 Method: 提出HMRMamba框架:1)几何感知提升模块,采用双扫描Mamba架构,融合图像特征中的几何线索进行2D到3D姿态提升;2)运动引导重建网络,以生成的3D姿态序列为锚点,显式建模时序运动模式。 Result: 在3DPW、MPI-INF-3DHP和Human3.6M数据集上达到SOTA,重建精度与时序一致性更高,计算效率更优。 Conclusion: HMRMamba通过引入SSM并设计双模块架构,有效解决了传统HMR方法在物理合理性、时空建模和效率方面的根本性缺陷。 Abstract: Existing video-based 3D Human Mesh Recovery (HMR) methods often produce physically implausible results, stemming from their reliance on flawed intermediate 3D pose anchors and their inability to effectively model complex spatiotemporal dynamics. To overcome these deep-rooted architectural problems, we introduce HMRMamba, a new paradigm for HMR that pioneers the use of Structured State Space Models (SSMs) for their efficiency and long-range modeling prowess. Our framework is distinguished by two core contributions. First, the Geometry-Aware Lifting Module, featuring a novel dual-scan Mamba architecture, creates a robust foundation for reconstruction. It directly grounds the 2D-to-3D pose lifting process with geometric cues from image features, producing a highly reliable 3D pose sequence that serves as a stable anchor. Second, the Motion-guided Reconstruction Network leverages this anchor to explicitly process kinematic patterns over time. By injecting this crucial temporal awareness, it significantly enhances the final mesh's coherence and robustness, particularly under occlusion and motion blur. Comprehensive evaluations on 3DPW, MPI-INF-3DHP, and Human3.6M benchmarks confirm that HMRMamba sets a new state-of-the-art, outperforming existing methods in both reconstruction accuracy and temporal consistency while offering superior computational efficiency.[106] Rectifying Geometry-Induced Similarity Distortions for Real-World Aerial-Ground Person Re-Identification
Kailash A. Hambarde,Hugo Proença
Main category: cs.CV
TL;DR: 本文提出GIQT方法,通过几何诱导的查询-键变换显式校正视角差异导致的相似度空间失真,并结合几何条件提示生成机制,提升航拍-地面跨视角行人重识别性能。
Details
Motivation: 现有方法隐含假设点积相似度在大视角和尺度变化下仍可靠,但实际中极端相机几何会系统性扭曲查询-键相似度空间,导致注意力匹配性能下降。 Method: 提出Geometry-Induced Query-Key Transformation(GIQT)轻量低秩模块,基于相机几何显式校正查询-键相似度计算;并引入几何条件提示生成机制,提供全局、视图自适应的表征先验。 Result: 在四个航拍-地面行人重识别基准上验证了该方法在极端及未见几何条件下鲁棒性显著提升,且计算开销极小。 Conclusion: 显式建模相机几何对相似度空间的影响比仅依赖几何感知特征学习或外观条件提示更有效,GIQT为跨视角ReID提供了新思路。 Abstract: Aerial-ground person re-identification (AG-ReID) is fundamentally challenged by extreme viewpoint and distance discrepancies between aerial and ground cameras, which induce severe geometric distortions and invalidate the assumption of a shared similarity space across views. Existing methods primarily rely on geometry-aware feature learning or appearance-conditioned prompting, while implicitly assuming that the geometry-invariant dot-product similarity used in attention mechanisms remains reliable under large viewpoint and scale variations. We argue that this assumption does not hold. Extreme camera geometry systematically distorts the query-key similarity space and degrades attention-based matching, even when feature representations are partially aligned. To address this issue, we introduce Geometry-Induced Query-Key Transformation (GIQT), a lightweight low-rank module that explicitly rectifies the similarity space by conditioning query-key interactions on camera geometry. Rather than modifying feature representations or the attention formulation itself, GIQT adapts the similarity computation to compensate for dominant geometry-induced anisotropic distortions. Building on this local similarity rectification, we further incorporate a geometry-conditioned prompt generation mechanism that provides global, view-adaptive representation priors derived directly from camera geometry. Experiments on four aerial-ground person re-identification benchmarks demonstrate that the proposed framework consistently improves robustness under extreme and previously unseen geometric conditions, while introducing minimal computational overhead compared to state-of-the-art methods.[107] Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation
Zihan Su,Hongyang Wei,Kangrui Cen,Yong Wang,Guanhua Chen,Chun Yuan,Xiangxiang Chu
Main category: cs.CV
TL;DR: 本文提出UniMRG方法,通过在统一多模态模型(UMMs)后训练中引入像素、深度和分割等多种图像内在表征的生成任务,以增强其视觉理解能力,并实现理解与生成的双向协同提升。
Details
Motivation: 现有UMMs后训练方法主要利用理解能力提升生成性能,而反向利用生成来增强理解的能力尚未被充分探索。 Method: 提出架构无关的后训练方法UniMRG,在标准视觉理解目标之外,联合训练模型生成图像的多种内在表征(像素重建、深度图、分割图),以挖掘外观、几何与结构等互补信息。 Result: 在多种UMM架构上实验表明,该方法显著提升了细粒度感知能力、降低幻觉、增强空间理解,并同步提升生成性能。 Conclusion: 生成任务可有效反哺并强化统一多模态模型的理解能力,理解与生成的双向协同是构建更强大UMMs的关键路径。 Abstract: Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.[108] MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations
Xinan He,Kaiqing Lin,Yue Zhou,Jiaming Zhong,Wei Ye,Wenhui Yi,Bing Fan,Feng Ding,Haodong Li,Bo Cao,Bin Li
Main category: cs.CV
TL;DR: 本文提出了一种基于‘流形投影波动’(MPF)现象的AI生成视频检测框架,通过静态流形偏差分支和微时序波动分支两级过滤,有效识别高保真伪造视频。
Details
Motivation: 尽管当前视频生成模型已能生成视觉质量极高的内容,但其本质是流形拟合而非物理记录,因此仍存在可检测的结构化像素逻辑特征(即MPF)。 Method: 提出分层双路径框架:1)静态流形偏差分支,利用大规模视觉基础模型捕捉空间异常;2)微时序波动分支,分析连续帧间残留的MPF特征以检测高保真伪造。 Result: 该框架能在宏观语义错误和时序不一致消失的情况下,依然可靠识别AI生成视频,覆盖离流形(off-manifold)和在流形(on-manifold)两类伪造。 Conclusion: AI生成视频虽视觉逼真,但其内在的流形拟合机制导致固有计算指纹(MPF),可被系统性建模并用于鲁棒检测。 Abstract: With the rapid advancement of video generation models such as Veo and Wan, the visual quality of synthetic content has reached a level where macro-level semantic errors and temporal inconsistencies are no longer prominent. However, this does not imply that the distinction between real and cutting-edge high-fidelity fake is untraceable. We argue that AI-generated videos are essentially products of a manifold-fitting process rather than a physical recording. Consequently, the pixel composition logic of consecutive adjacent frames residual in AI videos exhibits a structured and homogenous characteristic. We term this phenomenon `Manifold Projection Fluctuations' (MPF). Driven by this insight, we propose a hierarchical dual-path framework that operates as a sequential filtering process. The first, the Static Manifold Deviation Branch, leverages the refined perceptual boundaries of Large-Scale Vision Foundation Models (VFMs) to capture residual spatial anomalies or physical violations that deviate from the natural real-world manifold (off-manifold). For the remaining high-fidelity videos that successfully reside on-manifold and evade spatial detection, we introduce the Micro-Temporal Fluctuation Branch as a secondary, fine-grained filter. By analyzing the structured MPF that persists even in visually perfect sequences, our framework ensures that forgeries are exposed regardless of whether they manifest as global real-world manifold deviations or subtle computational fingerprints.[109] From Implicit Ambiguity to Explicit Solidity: Diagnosing Interior Geometric Degradation in Neural Radiance Fields for Dense 3D Scene Understanding
Jiangsan Zhao,Jakob Geipel,Kryzysztof Kusnierek
Main category: cs.CV
TL;DR: 本文揭示了NeRF在密集自遮挡场景中因隐式密度场导致的内部几何退化(IGD)问题,提出基于稀疏体素光栅化的显式几何重建方法SVRaster,显著提升实例恢复率和鲁棒性。
Details
Motivation: NeRF在密集、自遮挡场景中用于定量3D分析的可靠性尚不明确,尤其存在隐式密度场在重遮挡下重建空心或碎片化结构的问题,导致系统性实例漏检。 Method: 提出基于稀疏体素光栅化(SVRaster)的显式几何重建流程,以SfM特征几何为初始化,将2D实例掩码投影至体素网格,并通过递归分割强制几何分离。 Result: 在密集场景中,SVRaster实现95.8%的实例恢复率,较当前最优mask监督NeRF(89%)显著提升;在分割掩码退化情况下,比隐式基线多恢复43%的实例。 Conclusion: 显式几何先验是高度自遮挡3D场景中实现可靠定量分析的必要前提。 Abstract: Neural Radiance Fields (NeRFs) have emerged as a powerful paradigm for multi-view reconstruction, complementing classical photogrammetric pipelines based on Structure-from-Motion (SfM) and Multi-View Stereo (MVS). However, their reliability for quantitative 3D analysis in dense, self-occluding scenes remains poorly understood. In this study, we identify a fundamental failure mode of implicit density fields under heavy occlusion, which we term Interior Geometric Degradation (IGD). We show that transmittance-based volumetric optimization satisfies photometric supervision by reconstructing hollow or fragmented structures rather than solid interiors, leading to systematic instance undercounting. Through controlled experiments on synthetic datasets with increasing occlusion, we demonstrate that state-of-the-art mask-supervised NeRFs saturate at approximately 89% instance recovery in dense scenes, despite improved surface coherence and mask quality. To overcome this limitation, we introduce an explicit geometric pipeline based on Sparse Voxel Rasterization (SVRaster), initialized from SfM feature geometry. By projecting 2D instance masks onto an explicit voxel grid and enforcing geometric separation via recursive splitting, our approach preserves physical solidity and achieves a 95.8% recovery rate in dense clusters. A sensitivity analysis using degraded segmentation masks further shows that explicit SfM-based geometry is substantially more robust to supervision failure, recovering 43% more instances than implicit baselines. These results demonstrate that explicit geometric priors are a prerequisite for reliable quantitative analysis in highly self-occluding 3D scenes.[110] MultiModal Fine-tuning with Synthetic Captions
Shohei Enomoto,Shin'ya Yamaguchi
Main category: cs.CV
TL;DR: 本文提出一种新方法,利用多模态大语言模型(MLLMs)为单模态图像数据生成高质量合成图像描述,从而将单模态数据集转化为多模态数据集,并设计监督对比损失与基于类平均文本嵌入的推理策略,显著提升图像分类(尤其少样本场景)性能,弥合了多模态预训练与单模态微调之间的鸿沟。
Details
Motivation: 预训练已转向多模态学习以增强视觉理解,但微调仍主要采用单模态方式,限制了多模态预训练表征优势的发挥,存在预训练与微调模态不一致的根本性鸿沟。 Method: 利用多模态大语言模型(MLLMs)结合类别标签和领域上下文的精心设计提示,为单模态图像生成面向分类任务的合成图像描述;引入监督对比损失函数,显式促进同类样本表征聚类;提出一种新推理方法,利用每张图像多个合成描述所得文本嵌入的类平均值进行预测。 Result: 在13个图像分类基准上广泛实验表明,该方法优于基线方法,尤其在少样本学习场景中提升显著。 Conclusion: 本工作建立了通过数据集增强来弥合多模态预训练与微调之间鸿沟的新范式。 Abstract: In this paper, we address a fundamental gap between pre-training and fine-tuning of deep neural networks: while pre-training has shifted from unimodal to multimodal learning with enhanced visual understanding, fine-tuning predominantly remains unimodal, limiting the benefits of rich pre-trained representations. To bridge this gap, we propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for fine-tuning models with a multimodal objective. Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions tailored for classification tasks. Furthermore, we introduce a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, along with a new inference technique that leverages class-averaged text embeddings from multiple synthetic captions per image. Extensive experiments across 13 image classification benchmarks demonstrate that our approach outperforms baseline methods, with particularly significant improvements in few-shot learning scenarios. Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning. Our code is available at https://github.com/s-enmt/MMFT.[111] Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention
Yuxiang Huang,Mingye Li,Xu Han,Chaojun Xiao,Weilin Zhao,Ao Sun,Ziqi Yuan,Hao Zhou,Fandong Meng,Zhiyuan Liu
Main category: cs.CV
TL;DR: 本文提出Spava,一种用于加速长视频推理的序列并行框架,通过分布式近似注意力和系统级优化,在多GPU上实现高效长视频处理,显著提升推理速度且不损失性能。
Details
Motivation: 现有方法在单GPU上压缩视觉嵌入或应用稀疏注意力,加速效果有限或导致性能下降,难以支持更长、更复杂的视频处理。 Method: 提出Spava框架,采用序列并行与优化的近似注意力机制,并结合负载均衡、融合前向传播等系统级优化,在多GPU上分布计算以提升并行性。 Result: 相比FlashAttn、ZigZagRing和APB,Spava分别实现12.72x、1.70x和1.18x的加速,且无明显性能损失。 Conclusion: Spava有效突破了长视频推理的预填充阶段计算瓶颈,支持无压缩地处理更多视觉嵌入,提升了任务性能与扩展性。 Abstract: The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB[112] Variance & Greediness: A comparative study of metric-learning losses
Donghuo Zeng,Hao Niu,Zhi Li,Masato Taya
Main category: cs.CV
TL;DR: 本文提出VARIANCE和GREEDINESS诊断框架,分析七种度量学习损失函数在图像检索中的嵌入几何特性和优化动态,揭示了效率与细粒度之间的权衡。
Details
Motivation: 度量学习在检索中至关重要,但其对嵌入几何结构和优化动态的影响尚不清楚。 Method: 引入VARIANCE(类内/类间方差)和GREEDINESS(活跃比例和梯度范数)诊断框架,对比七种代表性损失函数在五个图像检索数据集上的表现。 Result: Triplet和SCL保留更高类内方差和更清晰类间边界,在细粒度检索中表现更好;Contrastive和InfoNCE通过大量小更新快速压缩嵌入,加速收敛但可能过度简化类别结构;N-pair实现较大平均分离但间距不均。 Conclusion: 存在效率-细粒度的权衡:当需保持多样性及难样本判别时优选Triplet/SCL;当需更快嵌入压缩时优选Contrastive/InfoNCE。 Abstract: Metric learning is central to retrieval, yet its effects on embedding geometry and optimization dynamics are not well understood. We introduce a diagnostic framework, VARIANCE (intra-/inter-class variance) and GREEDINESS (active ratio and gradient norms), to compare seven representative losses, i.e., Contrastive, Triplet, N-pair, InfoNCE, ArcFace, SCL, and CCL, across five image-retrieval datasets. Our analysis reveals that Triplet and SCL preserve higher within-class variance and clearer inter-class margins, leading to stronger top-1 retrieval in fine-grained settings. In contrast, Contrastive and InfoNCE compact embeddings are achieved quickly through many small updates, accelerating convergence but potentially oversimplifying class structures. N-pair achieves a large mean separation but with uneven spacing. These insights reveal a form of efficiency-granularity trade-off and provide practical guidance: prefer Triplet/SCL when diversity preservation and hard-sample discrimination are critical, and Contrastive/InfoNCE when faster embedding compaction is desired.[113] Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization
Midou Guo,Qilin Yin,Wei Lu,Xiangyang Luo,Rui Yang
Main category: cs.CV
TL;DR: 本文提出了一种名为RT-DeepLoc的弱监督时间伪造定位框架,利用仅在真实视频上训练的掩码自编码器(MAE)建模时空模式,并通过重建误差定位伪造片段;引入非对称视频内对比损失(AICL)提升局部判别能力,在LAV-DF等大数据集上达到SOTA性能。
Details
Motivation: 现代深度伪造呈现局部化、间歇性特点,需细粒度时间定位;而帧级标注成本过高,亟需仅依赖视频级标签的弱监督方法。 Method: 提出RT-DeepLoc框架:1)用仅在真实数据上训练的Masked Autoencoder建模正常时空模式;2)利用伪造片段重建误差作为定位线索;3)设计Asymmetric Intra-video Contrastive Loss(AICL),以重建误差为引导,增强真实特征紧凑性与伪造区域区分能力。 Result: 在LAV-DF等大规模数据集上,RT-DeepLoc在弱监督时间伪造定位任务中达到当前最优性能(state-of-the-art)。 Conclusion: 基于重建误差的弱监督定位范式有效可行,结合专为视频内对比设计的AICL损失,可在无帧级标注条件下实现高精度、强泛化的伪造时间定位。 Abstract: Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization. The prohibitive cost of frame-level annotation makes weakly supervised methods a practical necessity, which rely only on video-level labels. To this end, we propose Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc), a weakly supervised temporal forgery localization framework that identifies forgeries via reconstruction errors. Our framework uses a Masked Autoencoder (MAE) trained exclusively on authentic data to learn its intrinsic spatiotemporal patterns; this allows the model to produce significant reconstruction discrepancies for forged segments, effectively providing the missing fine-grained cues for localization. To robustly leverage these indicators, we introduce a novel Asymmetric Intra-video Contrastive Loss (AICL). By focusing on the compactness of authentic features guided by these reconstruction cues, AICL establishes a stable decision boundary that enhances local discrimination while preserving generalization to unseen forgeries. Extensive experiments on large-scale datasets, including LAV-DF, demonstrate that RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization.[114] Hypernetwork-Based Adaptive Aggregation for Multimodal Multiple-Instance Learning in Predicting Coronary Calcium Debulking
Kaito Shiku,Ichika Seo,Tetsuya Matoba,Rissei Hino,Yasuhiro Nakano,Ryoma Bise
Main category: cs.CV
TL;DR: 本文首次尝试从CT图像中评估冠状动脉钙化去除的必要性,提出了一种基于超网络的自适应聚合Transformer(HyperAdAgFormer),利用表格数据动态调整特征聚合策略,实验验证了其有效性。
Details
Motivation: 医生在决定是否使用设备去除冠状动脉钙化时,会依据患者的表格临床数据调整关注点和决策标准,而现有方法难以建模这种个体化决策过程。 Method: 将任务建模为多实例学习(MIL)问题,并提出HyperAdAgFormer:通过超网络根据患者表格数据动态生成Transformer中特征聚合模块的参数,实现个性化特征融合。 Result: 在临床数据集上的实验表明,HyperAdAgFormer显著优于基线方法,有效提升了钙化去除必要性的评估性能。 Conclusion: 引入表格临床数据指导影像特征聚合是解决该临床决策任务的关键,HyperAdAgFormer为多模态MIL提供了新思路,代码已开源。 Abstract: In this paper, we present the first attempt to estimate the necessity of debulking coronary artery calcifications from computed tomography (CT) images. We formulate this task as a Multiple-instance Learning (MIL) problem. The difficulty of this task lies in that physicians adjust their focus and decision criteria for device usage according to tabular data representing each patient's condition. To address this issue, we propose a hypernetwork-based adaptive aggregation transformer (HyperAdAgFormer), which adaptively modifies the feature aggregation strategy for each patient based on tabular data through a hypernetwork. The experiments using the clinical dataset demonstrated the effectiveness of HyperAdAgFormer. The code is publicly available at https://github.com/Shiku-Kaito/HyperAdAgFormer.[115] SimGraph: A Unified Framework for Scene Graph-Based Image Generation and Editing
Thanh-Nhan Vo,Trong-Thuan Nguyen,Tam V. Nguyen,Minh-Triet Tran
Main category: cs.CV
TL;DR: 本文提出SimGraph,一个基于场景图的统一框架,整合图像生成与编辑,通过token-based生成和diffusion-based编辑实现对对象关系、布局和空间一致性的精确控制,并在实验中超越现有方法。
Details
Motivation: 现有生成与编辑方法分离,导致空间一致性与语义连贯性差,且缺乏对物体关系和空间布局的结构化控制。 Method: 提出SimGraph框架,将基于场景图的图像生成与编辑统一起来,融合token-based生成与diffusion-based编辑,在单一场景图驱动模型中实现协同控制。 Result: 实验证明该方法在图像生成与编辑任务上均优于当前最先进方法,显著提升空间一致性与语义连贯性。 Conclusion: 基于场景图的统一建模范式能有效解决生成与编辑割裂问题,SimGraph为可控图像合成提供了新范式。 Abstract: Recent advancements in Generative Artificial Intelligence (GenAI) have significantly enhanced the capabilities of both image generation and editing. However, current approaches often treat these tasks separately, leading to inefficiencies and challenges in maintaining spatial consistency and semantic coherence between generated content and edits. Moreover, a major obstacle is the lack of structured control over object relationships and spatial arrangements. Scene graph-based methods, which represent objects and their interrelationships in a structured format, offer a solution by providing greater control over composition and interactions in both image generation and editing. To address this, we introduce SimGraph, a unified framework that integrates scene graph-based image generation and editing, enabling precise control over object interactions, layouts, and spatial coherence. In particular, our framework integrates token-based generation and diffusion-based editing within a single scene graph-driven model, ensuring high-quality and consistent results. Through extensive experiments, we empirically demonstrate that our approach outperforms existing state-of-the-art methods.[116] HERS: Hidden-Pattern Expert Learning for Risk-Specific Vehicle Damage Adaptation in Diffusion Models
Teerapong Panboonyuen
Main category: cs.CV
TL;DR: 本文提出HERS框架,通过无标注的自监督方式微调扩散模型,提升车辆损伤图像生成的真实性、可控性和领域适配性,并在多个指标上超越基线,同时探讨其在保险欺诈检测等高风险场景中的可信部署问题。
Details
Motivation: 文本到图像扩散模型生成逼真车辆损伤图像的能力对保险自动化流程的可靠性构成威胁,存在被用于欺诈或索赔操纵的风险,亟需提升生成图像的真实性与可审计性。 Method: 提出HERS框架,利用大语言模型和T2I流水线自动生成图文对,以自监督方式为每类损伤(如凹痕、划痕等)训练独立专家模型,再融合为统一多损伤模型,无需人工标注即可实现领域特定微调。 Result: 在四个扩散骨干网络上验证,HERS相较基线提升5.5%文本保真度和2.3%人类偏好评分;并讨论了其在欺诈检测、可审计性与安全部署方面的实际影响。 Conclusion: HERS提升了车辆损伤图像生成的可信度,凸显了在保险等高风险领域中发展可信赖生成技术的必要性,兼顾技术潜力与伦理风险。 Abstract: Recent advances in text-to-image (T2I) diffusion models have enabled increasingly realistic synthesis of vehicle damage, raising concerns about their reliability in automated insurance workflows. The ability to generate crash-like imagery challenges the boundary between authentic and synthetic data, introducing new risks of misuse in fraud or claim manipulation. To address these issues, we propose HERS (Hidden-Pattern Expert Learning for Risk-Specific Damage Adaptation), a framework designed to improve fidelity, controllability, and domain alignment of diffusion-generated damage images. HERS fine-tunes a base diffusion model via domain-specific expert adaptation without requiring manual annotation. Using self-supervised image-text pairs automatically generated by a large language model and T2I pipeline, HERS models each damage category, such as dents, scratches, broken lights, or cracked paint, as a separate expert. These experts are later integrated into a unified multi-damage model that balances specialization with generalization. We evaluate HERS across four diffusion backbones and observe consistent improvements: plus 5.5 percent in text faithfulness and plus 2.3 percent in human preference ratings compared to baselines. Beyond image fidelity, we discuss implications for fraud detection, auditability, and safe deployment of generative models in high-stakes domains. Our findings highlight both the opportunities and risks of domain-specific diffusion, underscoring the importance of trustworthy generation in safety-critical applications such as auto insurance.[117] Vision KAN: Towards an Attention-Free Backbone for Vision with Kolmogorov-Arnold Networks
Zhuoqin Yang,Jiansong Zhang,Xiaoling Luo,Xu Wu,Zheng Lu,Linlin Shen
Main category: cs.CV
TL;DR: 本文提出了Vision KAN(ViK),一种基于Kolmogorov-Arnold网络的无注意力视觉骨干网络,通过MultiPatch-RBFKAN模块实现高效token混合,在保持线性复杂度的同时在ImageNet-1K上达到有竞争力的精度。
Details
Motivation: 注意力机制存在二次计算复杂度和可解释性差的问题,而近期无注意力架构展现出潜力,促使探索更高效、可解释的替代方案。 Method: 提出ViK骨干网络,核心为MultiPatch-RBFKAN:(a) 基于径向基函数的分块非线性变换;(b) 轴向可分离混合以高效局部传播;(c) 低秩全局映射实现长程交互;采用分块分组策略降低高分辨率特征下KAN的计算开销。 Result: 在ImageNet-1K上ViK取得与主流模型相当的精度,且计算复杂度为线性,验证了KAN-based token mixing的有效性与效率。 Conclusion: ViK为视觉建模提供了一种无需注意力、理论基础坚实、兼具高效性与可解释性的新范式。 Abstract: Attention mechanisms have become a key module in modern vision backbones due to their ability to model long-range dependencies. However, their quadratic complexity in sequence length and the difficulty of interpreting attention weights limit both scalability and clarity. Recent attention-free architectures demonstrate that strong performance can be achieved without pairwise attention, motivating the search for alternatives. In this work, we introduce Vision KAN (ViK), an attention-free backbone inspired by the Kolmogorov-Arnold Networks. At its core lies MultiPatch-RBFKAN, a unified token mixer that combines (a) patch-wise nonlinear transform with Radial Basis Function-based KANs, (b) axis-wise separable mixing for efficient local propagation, and (c) low-rank global mapping for long-range interaction. Employing as a drop-in replacement for attention modules, this formulation tackles the prohibitive cost of full KANs on high-resolution features by adopting a patch-wise grouping strategy with lightweight operators to restore cross-patch dependencies. Experiments on ImageNet-1K show that ViK achieves competitive accuracy with linear complexity, demonstrating the potential of KAN-based token mixing as an efficient and theoretically grounded alternative to attention.[118] Bi-Anchor Interpolation Solver for Accelerating Generative Modeling
Hongxu Chen,Hongxiang Li,Zhen Wang,Long Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为BA-solver的新型求解器,通过引入轻量级SideNet与冻结主干网络协同工作,在不显著增加训练成本的前提下,大幅减少Flow Matching模型生成所需的神经函数评估次数(NFEs),同时保持高保真度和即插即用性。
Details
Motivation: Flow Matching模型因依赖迭代ODE求解而存在显著延迟瓶颈;现有加速方法要么在低NFE下性能下降严重,要么训练成本高昂且缺乏通用性。 Method: 提出Bi-Anchor Interpolation Solver(BA-solver):1)双向时间感知——SideNet轻量学习未来与历史速度,不更新主干;2)双锚点速度积分——利用主干提供的高精度锚点速度和SideNet预测的中间速度,实现高效批量高阶积分。 Result: 在ImageNet-256²上,仅用10 NFE即可达到传统Euler求解器100+ NFE的生成质量,5 NFE仍保持高保真;训练开销极小,且可无缝集成到现有生成流程中(如图像编辑)。 Conclusion: BA-solver在保持训练-free求解器通用性的同时,显著提升了Flow Matching模型的推理效率与实用性,为高效高质量生成提供了新范式。 Abstract: Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.[119] Unifying Heterogeneous Degradations: Uncertainty-Aware Diffusion Bridge Model for All-in-One Image Restoration
Luwei Tu,Jiawei Wu,Xing Luo,Zhi Jin
Main category: cs.CV
TL;DR: 本文提出了一种不确定性感知的扩散桥模型(UDBM),将全合一图像恢复(AiOIR)重新建模为由像素级不确定性驱动的随机传输问题,通过松弛扩散桥和双调制策略,解决了多退化任务中优化目标冲突与漂移奇异性问题,实现了单步推理下的SOTA性能。
Details
Motivation: 全合一图像恢复(AiOIR)面临异构退化间优化目标冲突的根本挑战,现有方法受限于粗粒度控制或固定映射调度,适应性不足。 Method: 提出不确定性感知扩散桥模型(UDBM),将AiOIR建模为像素级不确定性驱动的随机传输问题;采用松弛扩散桥替代严格终端约束以建模退化不确定性并消除漂移奇异性;设计双调制策略:噪声调度将多样退化对齐至共享高熵潜在空间,路径调度基于熵正则化的粘性动力学自适应调控传输轨迹。 Result: UDBM在多种图像恢复任务上实现单步推理下的最先进(SOTA)性能。 Conclusion: UDBM通过重构传输几何与动力学,有效统一处理多种图像退化,在理论严谨性与实际性能上均取得突破。 Abstract: All-in-One Image Restoration (AiOIR) faces the fundamental challenge in reconciling conflicting optimization objectives across heterogeneous degradations. Existing methods are often constrained by coarse-grained control mechanisms or fixed mapping schedules, yielding suboptimal adaptation. To address this, we propose an Uncertainty-Aware Diffusion Bridge Model (UDBM), which innovatively reformulates AiOIR as a stochastic transport problem steered by pixel-wise uncertainty. By introducing a relaxed diffusion bridge formulation which replaces the strict terminal constraint with a relaxed constraint, we model the uncertainty of degradations while theoretically resolving the drift singularity inherent in standard diffusion bridges. Furthermore, we devise a dual modulation strategy: the noise schedule aligns diverse degradations into a shared high-entropy latent space, while the path schedule adaptively regulates the transport trajectory motivated by the viscous dynamics of entropy regularization. By effectively rectifying the transport geometry and dynamics, UDBM achieves state-of-the-art performance across diverse restoration tasks within a single inference step.[120] HydroSense: A Dual-Microcontroller IoT Framework for Real-Time Multi-Parameter Water Quality Monitoring with Edge Processing and Cloud Analytics
Abdul Hasib,A. S. M. Ahsanul Sarkar Akib,Anish Giri
Main category: cs.CV
TL;DR: HydroSense是一个低成本、高精度的物联网水质监测框架,采用双微控制器架构(Arduino Uno + ESP32),实时监测pH、溶解氧、温度、TDS、氮含量和水位,精度高、成本仅为商用系统的15%,适用于资源受限地区。
Details
Motivation: 全球水资源危机亟需经济、准确、实时的水质监测方案,而传统人工采样或昂贵商用系统难以在资源受限环境中普及。 Method: 提出HydroSense物联网框架,采用Arduino Uno进行五点校准的精密模拟测量,ESP32负责无线通信、边缘计算与云集成;引入中值滤波、温度补偿和鲁棒错误处理等信号处理技术。 Result: 90天实验证明:pH误差±0.08,DO稳定性±0.2 mg/L,TDS误差±1.9%,云端传输可靠性达99.8%;总成本仅32983 BDT(约300美元),较商用系统降低成本85%。 Conclusion: HydroSense通过智能系统架构与低成本元器件选型,实现了专业级水质监测能力,为可及性环境监测树立了新范式。 Abstract: The global water crisis necessitates affordable, accurate, and real-time water quality monitoring solutions. Traditional approaches relying on manual sampling or expensive commercial systems fail to address accessibility challenges in resource-constrained environments. This paper presents HydroSense, an innovative Internet of Things framework that integrates six critical water quality parameters including pH, dissolved oxygen (DO), temperature, total dissolved solids (TDS), estimated nitrogen, and water level into a unified monitoring system. HydroSense employs a novel dual-microcontroller architecture, utilizing Arduino Uno for precision analog measurements with five-point calibration algorithms and ESP32 for wireless connectivity, edge processing, and cloud integration. The system implements advanced signal processing techniques including median filtering for TDS measurement, temperature compensation algorithms, and robust error handling. Experimental validation over 90 days demonstrates exceptional performance metrics: pH accuracy of plus or minus 0.08 units across the 0 to 14 range, DO measurement stability within plus or minus 0.2 mg/L, TDS accuracy of plus or minus 1.9 percent across 0 to 1000 ppm, and 99.8 percent cloud data transmission reliability. With a total implementation cost of 32,983 BDT (approximately 300 USD), HydroSense achieves an 85 percent cost reduction compared to commercial systems while providing enhanced connectivity through the Firebase real-time database. This research establishes a new paradigm for accessible environmental monitoring, demonstrating that professional-grade water quality assessment can be achieved through intelligent system architecture and cost-effective component selection.[121] WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models
Zijin Yang,Yu Sun,Kejiang Chen,Jiawei Zhao,Jun Jiang,Weiming Zhang,Nenghai Yu
Main category: cs.CV
TL;DR: 本文提出WMVLM,首个基于视觉语言模型(VLM)的统一、可解释的扩散模型图像水印评估框架,分别针对残差型与语义型水印重新定义质量与安全性指标,并通过三阶段训练策略实现分类、打分与可解释文本生成。
Details
Motivation: 现有水印评估方法缺乏统一框架、不可解释、忽视全面安全性、且对语义水印使用不当指标。 Method: 提出WMVLM框架,利用视觉语言模型,为残差水印定义 artifact strength 和 erasure resistance 指标,为语义水印定义 latent distribution shift 指标;采用三阶段训练策略(分类→评分→可解释文本生成)。 Result: WMVLM在多个数据集、扩散模型和水印方法上展现出优于SOTA VLM的泛化能力与评估性能。 Conclusion: WMVLM是首个支持残差与语义水印统一、可解释评估的框架,显著提升水印算法开发中的评估可靠性与安全性分析能力。 Abstract: Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.[122] PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization
Songhan Jiang,Fengchun Liu,Ziyue Wang,Linghan Cai,Yongbing Zhang
Main category: cs.CV
TL;DR: 本文提出了PathReasoner,首个全切片图像(WSI)推理大规模数据集,并基于其构建了具备结构化思维链能力的模型PathReasoner-R1,通过知识引导的数据生成与多粒度奖励机制提升病理诊断的可解释性与临床可信度。
Details
Motivation: 现有视觉-语言模型在计算病理学中缺乏可验证、证据链接的推理过程,导致临床信任度低、专家难以纠错。 Method: 构建知识图谱驱动的WSI推理数据集PathReasoner(20K样本);提出PathReasoner-R1模型,结合轨迹掩码监督微调与面向推理的强化学习,并设计知识感知的多粒度奖励函数(含实体奖励机制)以保障逻辑一致性。 Result: PathReasoner-R1在PathReasoner数据集及多个公开基准上达到SOTA性能,支持跨尺度图像推理,显著提升模型透明性与临床合理性。 Conclusion: 结构化、知识对齐的推理训练范式可有效增强VLMs在病理诊断中的可解释性、鲁棒性与临床适用性,为可信AI辅助病理奠定基础。 Abstract: Vision-Language Models (VLMs) are advancing computational pathology with superior visual understanding capabilities. However, current systems often reduce diagnosis to directly output conclusions without verifiable evidence-linked reasoning, which severely limits clinical trust and hinders expert error rectification. To address these barriers, we construct PathReasoner, the first large-scale dataset of whole-slide image (WSI) reasoning. Unlike previous work reliant on unverified distillation, we develop a rigorous knowledge-guided generation pipeline. By leveraging medical knowledge graphs, we explicitly align structured pathological findings and clinical reasoning with diagnoses, generating over 20K high-quality instructional samples. Based on the database, we propose PathReasoner-R1, which synergizes trajectory-masked supervised fine-tuning with reasoning-oriented reinforcement learning to instill structured chain-of-thought capabilities. To ensure medical rigor, we engineer a knowledge-aware multi-granular reward function incorporating an Entity Reward mechanism strictly aligned with knowledge graphs. This effectively guides the model to optimize for logical consistency rather than mere outcome matching, thereby enhancing robustness. Extensive experiments demonstrate that PathReasoner-R1 achieves state-of-the-art performance on both PathReasoner and public benchmarks across various image scales, equipping pathology models with transparent, clinically grounded reasoning capabilities. Dataset and code are available at https://github.com/cyclexfy/PathReasoner-R1.[123] Similarity of Processing Steps in Vision Model Representations
Matéo Mahaut,Marco Baroni
Main category: cs.CV
TL;DR: 本文研究不同视觉模型在训练过程中如何收敛到相似的表示,发现尽管最终表示可能相似,但中间处理步骤和操作存在显著差异,特别是分类器模型会丢弃低级图像统计信息,而CNN和Transformer模型在表示变化上表现出不同特性。
Details
Motivation: 探究不同视觉模型是否不仅在最终表示上收敛,还在中间处理步骤和操作上收敛。 Method: 通过量化不同模型在不同处理阶段的表示距离,追踪模型间距离的演变,识别各模型间差异最大的处理步骤。 Result: 发现相同位置的层间表示最相似,但仍有显著差异;分类器模型在最后层丢弃低级图像统计信息;Transformer模型相比CNN模型在层间表示变化更平滑。 Conclusion: 不同视觉模型在表示收敛程度和性质上存在差异,这有助于更深入理解图像模型的内在处理机制。 Abstract: Recent literature suggests that the bigger the model, the more likely it is to converge to similar, ``universal'' representations, despite different training objectives, datasets, or modalities. While this literature shows that there is an area where model representations are similar, we study here how vision models might get to those representations -- in particular, do they also converge to the same intermediate steps and operations? We therefore study the processes that lead to convergent representations in different models. First, we quantify distance between different model representations at different stages. We follow the evolution of distances between models throughout processing, identifying the processing steps which are most different between models. We find that while layers at similar positions in different models have the most similar representations, strong differences remain. Classifier models, unlike the others, will discard information about low-level image statistics in their final layers. CNN- and transformer-based models also behave differently, with transformer models applying smoother changes to representations from one layer to the next. These distinctions clarify the level and nature of convergence between model representations, and enables a more qualitative account of the underlying processes in image models.[124] A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion
Pu Cao,Yiyang Ma,Feng Zhou,Xuedan Yin,Qing Song,Lu Yang
Main category: cs.CV
TL;DR: 本文揭示了在潜在扩散模型中,自动编码器(AE)评估过度偏向生成指标(如gFID)而忽视重建保真度的问题,指出该偏差虽在ImageNet图像生成中影响有限,但在可控扩散任务中会引发条件漂移、损害可控性;实证表明重建指标(尤其实例级)更能反映可控性,为AE评估与选择提供了面向可控生成的新指导。
Details
Motivation: 现有ImageNet规模的自动编码器研究过度依赖生成指标(如gFID)进行评估与选型,却忽略重建保真度;这种偏差在单纯图像生成中看似无害,但在扩展至可控扩散(如ControlNet)时可能严重损害条件对齐能力,亟需系统分析其影响机制与替代评估准则。 Method: 通过理论分析解释gFID主导偏好在ImageNet生成中看似合理、但在可控扩散中导致条件漂移的原因;提出多维条件漂移评估协议,量化不同AE在可控生成任务中的条件保持能力;在多个近期ImageNet AE上实证对比gFID与重建指标(如LPIPS、DISTS、实例级误差)对可控性的预测能力,并结合ControlNet实验验证可控性与条件保持而非gFID强相关。 Result: 发现gFID与条件保持能力仅弱相关,而重建导向指标(尤其是实例级)显著更优地预测可控性;ControlNet实验进一步证实模型可控性直接跟随条件保持程度,而非gFID高低;揭示当前ImageNet中心化AE评估范式与可控扩散实际需求之间存在实质性鸿沟。 Conclusion: 自动编码器评估不应单一追求gFID,而应将重建保真度(特别是细粒度、实例级指标)作为可控扩散场景下AE选型与基准测试的关键维度;本研究为构建更可靠、面向应用的AE评估体系提供了理论依据与实践指南。 Abstract: In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.[125] RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning
Shiqi Huang,Shuting He,Bihan Wen
Main category: cs.CV
TL;DR: 本文提出RSGround-R1框架,通过链式思维监督微调、强化微调与空间一致性优化,提升多模态大模型在遥感视觉定位任务中的空间推理能力。
Details
Motivation: 遥感场景具有大尺度和高语义模糊性,自然语言描述高度依赖位置线索,给多模态大语言模型的空间推理带来独特挑战。 Method: 提出推理引导、位置感知的后训练框架RSGround-R1,包括:1)基于合成RSVG推理数据的链式思维监督微调(CoT-SFT);2)引入距离感知的位置奖励进行强化微调(RFT);3)空间一致性引导的优化策略以稳定策略更新。 Result: 在RSVG基准上实验表明,所提方法性能与泛化能力均优于现有方法。 Conclusion: RSGround-R1有效增强了MLLM在遥感视觉定位任务中的显式位置感知与空间推理能力,为处理位置依赖型多模态任务提供了新范式。 Abstract: Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.[126] OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
Yufeng Zhong,Lei Chen,Xuanle Zhao,Wenkang Han,Liming Zheng,Jing Huang,Deyang Jiang,Yilin Cao,Lin Ma,Zhixiong Zeng
Main category: cs.CV
TL;DR: 本文提出了OCRVerse,首个端到端统一文本中心与视觉中心OCR的模型,通过多领域两阶段(SFT+RL)训练方法,在多种文档与视觉密集型图像(如图表、网页、科学绘图)上实现高性能OCR。
Details
Motivation: 现有OCR方法主要关注文本识别(Text-centric OCR),忽视了对图表、网页等视觉信息密集图像中视觉元素的识别(Vision-centric OCR),而这类图像在互联网上广泛存在且具有重要应用价值。 Method: 提出OCRVerse框架,构建覆盖文本类(报纸、杂志等)和视觉类(图表、网页、科学绘图)的综合数据集,并采用两阶段训练:第一阶段用监督微调(SFT)混合多域数据建立初始知识;第二阶段用强化学习(RL)为各域定制灵活奖励策略以适配不同输出格式与目标。 Result: OCRVerse在文本中心与视觉中心OCR任务上均取得具竞争力的结果,性能媲美大规模开源与闭源模型。 Conclusion: OCRVerse首次实现了文本与视觉双重OCR能力的端到端统一,验证了多域协同建模与RL定制化优化在OCR中的有效性,为通用视觉语言理解提供了新范式。 Abstract: The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (\textbf{Text-centric OCR}), neglecting the identification of visual elements from visually information-dense image sources (\textbf{Vision-centric OCR}), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose \textbf{OCRVerse}, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.[127] CAF-Mamba: Mamba-Based Cross-Modal Adaptive Attention Fusion for Multimodal Depression Detection
Bowen Zhou,Marc-André Fiedler,Ayoub Al-Hamadi
Main category: cs.CV
TL;DR: 本文提出CAF-Mamba,一种基于Mamba架构的跨模态自适应注意力融合框架,用于抑郁症检测,通过显隐式建模跨模态交互与动态调整模态权重,在LMVD和D-Vlog数据集上达到SOTA性能。
Details
Motivation: 现有深度学习方法在抑郁检测中存在特征类型单一、忽视显式跨模态交互、融合方式简单(如拼接或静态加权)等问题。 Method: 提出CAF-Mamba框架,基于Mamba模型,引入模态级自适应注意力机制,显式与隐式建模跨模态交互,并动态调整各模态贡献权重以实现更优多模态融合。 Result: 在LMVD和D-Vlog两个真实场景基准数据集上,CAF-Mamba持续优于现有方法,达到当前最优性能。 Conclusion: CAF-Mamba有效解决了多模态抑郁检测中的关键融合瓶颈,验证了动态注意力驱动的Mamba架构在该任务中的优越性与泛化能力。 Abstract: Depression is a prevalent mental health disorder that severely impairs daily functioning and quality of life. While recent deep learning approaches for depression detection have shown promise, most rely on limited feature types, overlook explicit cross-modal interactions, and employ simple concatenation or static weighting for fusion. To overcome these limitations, we propose CAF-Mamba, a novel Mamba-based cross-modal adaptive attention fusion framework. CAF-Mamba not only captures cross-modal interactions explicitly and implicitly, but also dynamically adjusts modality contributions through a modality-wise attention mechanism, enabling more effective multimodal fusion. Experiments on two in-the-wild benchmark datasets, LMVD and D-Vlog, demonstrate that CAF-Mamba consistently outperforms existing methods and achieves state-of-the-art performance.[128] Few-Shot Domain Adaptation with Temporal References and Static Priors for Glacier Calving Front Delineation
Marcel Dreier,Nora Gourmelon,Dakota Pyles,Thorsten Seehaus,Matthias H. Braun,Andreas Maier,Vincent Christlein
Main category: cs.CV
TL;DR: 本文提出了一种无需修改网络结构的少样本领域自适应方法,结合空间静态先验知识和夏季参考图像,显著提升了冰川崩解前沿分割模型在新研究地点的泛化能力,将 delineation 误差从1131.6米降至68.7米。
Details
Motivation: 现有最先进的冰川崩解前沿分割模型在基准测试中表现接近人类水平,但在新研究地点(分布外数据)实际应用时精度不足,难以满足后续科学研究需求。 Method: 采用少样本领域自适应策略,融合空间静态先验知识,并在输入时间序列中引入夏季参考图像。 Result: 在不改变模型架构的前提下,将崩解前沿 delineation 误差从1131.6米大幅降低至68.7米。 Conclusion: 该方法为深度学习模型在新型研究地点的部署提供了可行框架,推动全球尺度冰川崩解前沿监测成为可能。 Abstract: During benchmarking, the state-of-the-art model for glacier calving front delineation achieves near-human performance. However, when applied in a real-world setting at a novel study site, its delineation accuracy is insufficient for calving front products intended for further scientific analyses. This site represents an out-of-distribution domain for a model trained solely on the benchmark dataset. By employing a few-shot domain adaptation strategy, incorporating spatial static prior knowledge, and including summer reference images in the input time series, the delineation error is reduced from 1131.6 m to 68.7 m without any architectural modifications. These methodological advancements establish a framework for applying deep learning-based calving front segmentation to novel study sites, enabling calving front monitoring on a global scale.[129] When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning
Zixuan Xia,Hao Wang,Pengcheng Weng,Yanyu Qian,Yangxin Xu,William Dan,Fei Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为\regName的轻量级几何感知正则化框架,通过在中间嵌入上施加模内分散和模间锚定两种约束,来改善多模态学习中的表示几何结构,从而缓解模态间的权衡问题。
Details
Motivation: 多模态学习中,即使优化充分,模型仍常出现模内表示坍缩和样本级跨模态不一致等几何病理问题,影响单模态鲁棒性和多模态融合效果。 Method: \regName框架包含两个互补约束:模内分散正则化(提升表示多样性)和模间锚定正则化(限制样本级跨模态漂移,无需严格对齐)。该方法即插即用、无需修改网络结构、兼容多种训练范式。 Result: 在多个多模态基准上的大量实验表明,\regName能持续提升多模态与单模态性能,有效缓解模态间权衡。 Conclusion: 显式调控表示几何结构是提升多模态学习性能的关键新维度,\regName为解决多模态表示病理性问题提供了简单而有效的通用方案。 Abstract: Multimodal learning aims to integrate complementary information from heterogeneous modalities, yet strong optimization alone does not guaranty well-structured representations. Even under carefully balanced training schemes, multimodal models often exhibit geometric pathologies, including intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion. We identify representation geometry as a missing control axis in multimodal learning and propose \regName, a lightweight geometry-aware regularization framework. \regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment. The proposed regularizer is plug-and-play, requires no architectural modifications, and is compatible with various training paradigms. Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.[130] Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification
Dexuan Ding,Ciyuan Peng,Endrowednes Kuantama,Jingcai Guo,Jia Wu,Jian Yang,Amin Beheshti,Ming-Hsuan Yang,Yuankai Qi
Main category: cs.CV
TL;DR: 本文提出Multimodal Visual Surrogate Compression (MVSC)方法,将高维3D sMRI图像压缩为紧凑2D视觉代理特征,以更好适配冻结的2D基础模型,提升阿尔茨海默病分类性能。
Details
Motivation: 现有sMRI表征学习方法存在计算成本高、跨切片关系丢失、判别性特征提取能力有限等问题。 Method: 提出MVSC框架,包含两个核心模块:1)在文本引导下捕获全局跨切片上下文的Volume Context Encoder;2)以文本增强、块级方式聚合切片信息的Adaptive Slice Fusion模块。 Result: 在三个大规模AD数据集上,MVSC在二分类和多分类任务中均优于当前最先进方法。 Conclusion: MVSC通过将3D sMRI压缩为与2D基础模型对齐的2D视觉代理特征,有效提升了AD诊断的表征能力和分类性能。 Abstract: High-dimensional structural MRI (sMRI) images are widely used for Alzheimer's Disease (AD) diagnosis. Most existing methods for sMRI representation learning rely on 3D architectures (e.g., 3D CNNs), slice-wise feature extraction with late aggregation, or apply training-free feature extractions using 2D foundation models (e.g., DINO). However, these three paradigms suffer from high computational cost, loss of cross-slice relations, and limited ability to extract discriminative features, respectively. To address these challenges, we propose Multimodal Visual Surrogate Compression (MVSC). It learns to compress and adapt large 3D sMRI volumes into compact 2D features, termed as visual surrogates, which are better aligned with frozen 2D foundation models to extract powerful representations for final AD classification. MVSC has two key components: a Volume Context Encoder that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner. Extensive experiments on three large-scale Alzheimer's disease benchmarks demonstrate our MVSC performs favourably on both binary and multi-class classification tasks compared against state-of-the-art methods.[131] ChartE$^{3}$: A Comprehensive Benchmark for End-to-End Chart Editing
Shuo Li,Jiajun Sun,Zhekai Wang,Xiaoran Fan,Hui Li,Dingwen Yang,Zhiheng Xi,Yijun Wang,Zifei Shan,Tao Gui,Qi Zhang,Xuanjing Huang
Main category: cs.CV
TL;DR: 本文提出了ChartE³,一个端到端图表编辑基准,用于评估模型在不依赖中间自然语言或代码表示的情况下直接根据多模态指令编辑图表的能力,涵盖局部外观调整与全局数据驱动变换两大维度。
Details
Motivation: 现有图表编辑方法多采用基于流水线的设计,依赖自然语言或代码作为中间表示,难以忠实执行复杂编辑;亟需一种能直接评估端到端编辑能力的基准。 Method: 构建了ChartE³基准,包含1200+高质量样本,每个样本为图表图像、对应源码与多模态编辑指令的三元组;定义局部编辑(如字体、颜色)和全球编辑(如数据过滤、趋势线添加)两类任务。 Result: 对当前主流多模态大模型的广泛评测表明,其在端到端图表编辑上存在显著性能差距,尤其在全球编辑任务上表现薄弱。 Conclusion: ChartE³揭示了现有模型在理解与执行复杂、数据驱动的图表编辑指令方面存在关键局限,为未来端到端可视化编辑研究提供了重要评估工具与改进方向。 Abstract: Charts are a fundamental visualization format for structured data analysis. Enabling end-to-end chart editing according to user intent is of great practical value, yet remains challenging due to the need for both fine-grained control and global structural consistency. Most existing approaches adopt pipeline-based designs, where natural language or code serves as an intermediate representation, limiting their ability to faithfully execute complex edits. We introduce ChartE$^{3}$, an End-to-End Chart Editing benchmark that directly evaluates models without relying on intermediate natural language programs or code-level supervision. ChartE$^{3}$ focuses on two complementary editing dimensions: local editing, which involves fine-grained appearance changes such as font or color adjustments, and global editing, which requires holistic, data-centric transformations including data filtering and trend line addition. ChartE$^{3}$ contains over 1,200 high-quality samples constructed via a well-designed data pipeline with human curation. Each sample is provided as a triplet of a chart image, its underlying code, and a multimodal editing instruction, enabling evaluation from both objective and subjective perspectives. Extensive benchmarking of state-of-the-art multimodal large language models reveals substantial performance gaps, particularly on global editing tasks, highlighting critical limitations in current end-to-end chart editing capabilities.[132] DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning
Mingshuang Luo,Shuang Liang,Zhengkun Rong,Yuxuan Luo,Tianshu Hu,Ruibing Hou,Hong Chang,Yong Li,Yuan Zhang,Mingyuan Gao
Main category: cs.CV
TL;DR: DreamActor-M2 提出一种无需显式姿态先验的通用角色动画框架,通过两阶段范式(统一潜在空间建模与自举式伪数据合成)解决身份保持与运动一致性之间的权衡问题,并在新基准 AW Bench 上实现 SOTA 性能。
Details
Motivation: 现有方法存在两个核心问题:一是运动注入策略不佳,导致身份保持与运动一致性难以兼顾(‘跷跷板’现象);二是过度依赖显式姿态先验(如骨骼),限制对非人形角色和复杂动态的泛化能力。 Method: 提出 DreamActor-M2 框架,采用两阶段范式:第一阶段将参考图像外观与驱动运动线索融合进统一潜在空间,利用基础模型生成先验联合建模空间身份与时间动态;第二阶段设计自举式伪跨身份数据合成流程,实现从姿态依赖控制到端到端 RGB 驱动动画的过渡。同时构建新基准 AW Bench 用于全面评估。 Result: 在 AW Bench 基准上取得 SOTA 性能,显著提升视觉保真度与跨角色、跨动作场景的泛化能力,支持任意非人形角色动画。 Conclusion: DreamActor-M2 成功将角色动画重构为上下文学习问题,摆脱对显式姿态表示的依赖,为通用、高保真、强泛化的图像驱动动画提供了新范式。 Abstract: Character image animation aims to synthesize high-fidelity videos by transferring motion from a driving sequence to a static reference image. Despite recent advancements, existing methods suffer from two fundamental challenges: (1) suboptimal motion injection strategies that lead to a trade-off between identity preservation and motion consistency, manifesting as a "see-saw", and (2) an over-reliance on explicit pose priors (e.g., skeletons), which inadequately capture intricate dynamics and hinder generalization to arbitrary, non-humanoid characters. To address these challenges, we present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem. Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space, enabling the model to jointly reason about spatial identity and temporal dynamics by leveraging the generative prior of foundational models. Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs, facilitating a seamless transition from pose-dependent control to direct, end-to-end RGB-driven animation. This strategy significantly enhances generalization across diverse characters and motion scenarios. To facilitate comprehensive evaluation, we further introduce AW Bench, a versatile benchmark encompassing a wide spectrum of characters types and motion scenarios. Extensive experiments demonstrate that DreamActor-M2 achieves state-of-the-art performance, delivering superior visual fidelity and robust cross-domain generalization. Project Page: https://grisoon.github.io/DreamActor-M2/[133] From Global to Granular: Revealing IQA Model Performance via Correlation Surface
Baoliang Chen,Danni Huang,Hanwei Zhu,Lingyu Zhu,Wei Zhou,Shiqi Wang,Yuming Fang,Weisi Lin
Main category: cs.CV
TL;DR: 本文提出Granularity-Modulated Correlation (GMC)方法,用于细粒度评估图像质量评估(IQA)模型性能,克服传统全局相关性指标(如SRCC、PLCC)无法反映局部质量谱上排名一致性差异及对测试样本分布敏感的缺陷。
Details
Motivation: 现有IQA评估指标(如PLCC、SRCC)仅提供单一标量分数,无法揭示模型在不同质量区间(如高MOS或小ΔMOS)的差异化表现,且易受测试集质量分布影响,导致跨数据集比较不稳定。 Method: 提出GMC框架,包含两个核心组件:(1) Granularity Modulator——基于高斯加权的相关性计算,分别以绝对MOS值和|ΔMOS|为条件进行局部相关性分析;(2) Distribution Regulator——对相关性进行正则化以削弱非均匀质量分布带来的偏差;最终生成以MOS和|ΔMOS|为坐标的三维相关性曲面。 Result: 在标准IQA基准上的实验表明,GMC能揭示传统标量指标无法捕捉的模型性能特征(如高质图像排序优势或微小质量差异判别能力),提升模型分析、比较与部署的可靠性与信息量。 Conclusion: GMC是一种更结构化、细粒度且分布鲁棒的IQA评估范式,推动从‘单一分数’向‘多维性能图谱’的评估范式转变。 Abstract: Evaluation of Image Quality Assessment (IQA) models has long been dominated by global correlation metrics, such as Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank-Order Correlation Coefficient (SRCC). While widely adopted, these metrics reduce performance to a single scalar, failing to capture how ranking consistency varies across the local quality spectrum. For example, two IQA models may achieve identical SRCC values, yet one ranks high-quality images (related to high Mean Opinion Score, MOS) more reliably, while the other better discriminates image pairs with small quality/MOS differences (related to $|Δ$MOS$|$). Such complementary behaviors are invisible under global metrics. Moreover, SRCC and PLCC are sensitive to test-sample quality distributions, yielding unstable comparisons across test sets. To address these limitations, we propose \textbf{Granularity-Modulated Correlation (GMC)}, which provides a structured, fine-grained analysis of IQA performance. GMC includes: (1) a \textbf{Granularity Modulator} that applies Gaussian-weighted correlations conditioned on absolute MOS values and pairwise MOS differences ($|Δ$MOS$|$) to examine local performance variations, and (2) a \textbf{Distribution Regulator} that regularizes correlations to mitigate biases from non-uniform quality distributions. The resulting \textbf{correlation surface} maps correlation values as a joint function of MOS and $|Δ$MOS$|$, providing a 3D representation of IQA performance. Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models. Codes are available at https://github.com/Dniaaa/GMC.[134] Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation
Jiankun Peng,Jianyuan Guo,Ying Xu,Yue Liu,Jiashuang Yan,Xuanwei Ye,Houhua Li,Xiaoming Wang
Main category: cs.CV
TL;DR: 本文提出DGNav框架,通过场景感知自适应策略和动态图变换器解决视觉语言导航中拓扑地图粒度刚性问题,实现按需稠密化建图与动态边权重优化,显著提升导航性能与安全性。
Details
Motivation: 现有视觉语言导航的拓扑规划方法存在“粒度刚性”问题:固定几何阈值采样节点无法适应环境复杂度变化,导致简单区域冗余采样、高不确定性区域采样不足,进而引发计算浪费、碰撞风险增加和定位精度下降。 Method: 提出DGNav框架,包含两个核心模块:(1) 场景感知自适应策略——根据预测航点分布动态调整图构建阈值,实现复杂环境按需稠密化;(2) 动态图变换器——融合视觉、语言与几何线索生成动态边权重,重构图连通性以抑制拓扑噪声、增强指令遵循能力。 Result: 在R2R-CE和RxR-CE基准上实验表明,DGNav显著优于现有方法,具备更强导航性能与泛化能力;消融实验证实其在导航效率与安全探索间取得最优权衡。 Conclusion: DGNav通过引入上下文感知的动态拓扑建图机制,有效缓解了粒度刚性问题,为连续环境中的视觉语言导航提供了更鲁棒、灵活且安全的空间推理范式。 Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) presents a core challenge: grounding high-level linguistic instructions into precise, safe, and long-horizon spatial actions. Explicit topological maps have proven to be a vital solution for providing robust spatial memory in such tasks. However, existing topological planning methods suffer from a "Granularity Rigidity" problem. Specifically, these methods typically rely on fixed geometric thresholds to sample nodes, which fails to adapt to varying environmental complexities. This rigidity leads to a critical mismatch: the model tends to over-sample in simple areas, causing computational redundancy, while under-sampling in high-uncertainty regions, increasing collision risks and compromising precision. To address this, we propose DGNav, a framework for Dynamic Topological Navigation, introducing a context-aware mechanism to modulate map density and connectivity on-the-fly. Our approach comprises two core innovations: (1) A Scene-Aware Adaptive Strategy that dynamically modulates graph construction thresholds based on the dispersion of predicted waypoints, enabling "densification on demand" in challenging environments; (2) A Dynamic Graph Transformer that reconstructs graph connectivity by fusing visual, linguistic, and geometric cues into dynamic edge weights, enabling the agent to filter out topological noise and enhancing instruction adherence. Extensive experiments on the R2R-CE and RxR-CE benchmarks demonstrate DGNav exhibits superior navigation performance and strong generalization capabilities. Furthermore, ablation studies confirm that our framework achieves an optimal trade-off between navigation efficiency and safe exploration. The code is available at https://github.com/shannanshouyin/DGNav.[135] Synthetic-to-Real Domain Bridging for Single-View 3D Reconstruction of Ships for Maritime Monitoring
Borja Carrillo-Perez,Felix Sattler,Angel Bueno Rodriguez,Maurice Stephan,Sarah Barnes
Main category: cs.CV
TL;DR: 本文提出了一种基于单张图像、纯合成数据训练的高效3D船舶重建方法,结合Splatter Image网络、YOLOv8分割与地理配准,在无需真实3D标注下实现可交互的实时船舶三维可视化。
Details
Motivation: 现有3D船舶重建方法多依赖多视角监督、真实3D标注或计算开销大,难以满足海上实时监测需求。 Method: 采用Splatter Image网络(以稀疏3D高斯表示物体),在合成ShapeNet船舶数据上预训练,并用自建3D船舶数据集微调;集成YOLOv8分割模块与定制预处理;后处理包括真实尺度归一化、中心对齐、朝向校正及基于AIS与单应性的地理映射。 Result: 在合成验证集上定量指标表现良好;在真实ShipSG数据集上定性结果验证了向实际海事场景迁移的能力;系统支持无真实3D标注下的交互式3D船舶查看。 Conclusion: 该流程为海事监控提供了高效、可扩展的单视图3D重建方案,推动了实时、实用的三维船舶可视化发展。 Abstract: Three-dimensional (3D) reconstruction of ships is an important part of maritime monitoring, allowing improved visualization, inspection, and decision-making in real-world monitoring environments. However, most state-ofthe-art 3D reconstruction methods require multi-view supervision, annotated 3D ground truth, or are computationally intensive, making them impractical for real-time maritime deployment. In this work, we present an efficient pipeline for single-view 3D reconstruction of real ships by training entirely on synthetic data and requiring only a single view at inference. Our approach uses the Splatter Image network, which represents objects as sparse sets of 3D Gaussians for rapid and accurate reconstruction from single images. The model is first fine-tuned on synthetic ShapeNet vessels and further refined with a diverse custom dataset of 3D ships, bridging the domain gap between synthetic and real-world imagery. We integrate a state-of-the-art segmentation module based on YOLOv8 and custom preprocessing to ensure compatibility with the reconstruction network. Postprocessing steps include real-world scaling, centering, and orientation alignment, followed by georeferenced placement on an interactive web map using AIS metadata and homography-based mapping. Quantitative evaluation on synthetic validation data demonstrates strong reconstruction fidelity, while qualitative results on real maritime images from the ShipSG dataset confirm the potential for transfer to operational maritime settings. The final system provides interactive 3D inspection of real ships without requiring real-world 3D annotations. This pipeline provides an efficient, scalable solution for maritime monitoring and highlights a path toward real-time 3D ship visualization in practical applications. Interactive demo: https://dlr-mi.github.io/ship3d-demo/.[136] CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
Junming Huang,Weiwei Xu
Main category: cs.CV
TL;DR: 本文提出CG-MLLM,一种新型多模态大语言模型,统一实现3D描述与高分辨率3D生成,通过混合Transformer架构(TokenAR与BlockAR)解耦建模需求,并结合视觉-语言骨干网络与专用3D VAE隐空间,显著提升3D生成质量。
Details
Motivation: 现有方法在3D内容生成中仅能产生低分辨率网格或粗糙结构代理,无法原生捕捉细粒度几何,LLM在3D生成能力方面尚未充分探索。 Method: 提出CG-MLLM模型,采用Mixture-of-Transformer架构:Token-level Autoregressive (TokenAR) Transformer处理token级内容,Block-level Autoregressive (BlockAR) Transformer处理block级内容;集成预训练视觉-语言骨干网络与专用3D VAE隐空间,支持标准token与空间block间的长上下文交互。 Result: CG-MLLM在高保真3D物体生成任务上显著优于现有MLLMs,成功将高分辨率3D内容生成纳入主流LLM范式。 Conclusion: CG-MLLM首次实现了在单一大模型框架下兼顾3D理解(captioning)与高分辨率生成,为3D内容创作提供了新范式。 Abstract: Large Language Models(LLMs) have revolutionized text generation and multimodal perception, but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture fine-grained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.[137] MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
Honglin Lin,Zheng Liu,Yun Zhu,Chonghan Qin,Juekai Lin,Xiaoran Shang,Conghui He,Wentao Zhang,Lijun Wu
Main category: cs.CV
TL;DR: 本文提出了MMFineReason,一个大规模多模态推理数据集,包含1.8M样本和5.1B解题token,通过三阶段流程构建,并用于微调Qwen3-VL模型,在参数更少的情况下超越更大模型,揭示了‘少即是多’现象。
Details
Motivation: 开源视觉语言模型在视觉推理方面仍落后于闭源系统,主要原因是缺乏高质量、覆盖广泛(如STEM图表、视觉谜题)且具一致长链思维(CoT)标注的推理数据。 Method: 提出三阶段构建流程:(1) 大规模数据收集与标准化;(2) 基于Qwen3-VL-235B-A22B-Thinking蒸馏生成CoT推理链;(3) 基于推理质量与难度感知的筛选。随后在该数据集上微调Qwen3-VL-Instruct,得到MMFineReason-2B/4B/8B系列模型。 Result: MMFineReason-4B超越Qwen3-VL-8B-Thinking,MMFineReason-8B超越Qwen3-VL-30B-A3B-Thinking并接近Qwen3-VL-32B-Thinking;仅用7%(123K)高难度样本即可达到全量性能;推理数据提升同时增强通用能力。 Conclusion: 高质量、难度感知的多模态推理数据对提升VLM推理能力至关重要;合理筛选少量高质样本可实现高效训练,验证了数据质量优于数量的核心思想。 Abstract: Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a "less is more" phenomenon via our difficulty-aware filtering strategy: a subset of just 7\% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.[138] Trajectory-Guided Diffusion for Foreground-Preserving Background Generation in Multi-Layer Documents
Taewon Kang
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的文档背景生成框架,通过潜空间设计实现前景保留与多页风格一致性,无需显式约束或额外机制。
Details
Motivation: 解决现有文档背景生成中前景内容易被破坏、多页间风格不一致(风格漂移)的问题,避免依赖掩码、抑制更新等启发式方法。 Method: 将扩散过程重新解释为在结构化潜空间中随机轨迹的演化;通过控制初始噪声及其几何对齐实现前景区域自然规避;解耦风格控制与文本条件,引入缓存的风格方向向量作为潜空间中的持久约束,使扩散轨迹限制在共享风格子空间内。 Result: 实现了训练无关、兼容现有扩散主干网络的背景生成;在复杂文档上生成视觉连贯、前景可读、多页风格一致的结果;提供潜流形上的几何与物理可解释性。 Conclusion: 通过将扩散建模为潜空间中的轨迹设计,本文为结构化、一致性的生成建模提供了新范式,摆脱了对显式约束和重复提示的依赖。 Abstract: We present a diffusion-based framework for document-centric background generation that achieves foreground preservation and multi-page stylistic consistency through latent-space design rather than explicit constraints. Instead of suppressing diffusion updates or applying masking heuristics, our approach reinterprets diffusion as the evolution of stochastic trajectories through a structured latent space. By shaping the initial noise and its geometric alignment, background generation naturally avoids designated foreground regions, allowing readable content to remain intact without auxiliary mechanisms. To address the long-standing issue of stylistic drift across pages, we decouple style control from text conditioning and introduce cached style directions as persistent vectors in latent space. Once selected, these directions constrain diffusion trajectories to a shared stylistic subspace, ensuring consistent appearance across pages and editing iterations. This formulation eliminates the need for repeated prompt-based style specification and provides a more stable foundation for multi-page generation. Our framework admits a geometric and physical interpretation, where diffusion paths evolve on a latent manifold shaped by preferred directions, and foreground regions are rarely traversed as a consequence of trajectory initialization rather than explicit exclusion. The proposed method is training-free, compatible with existing diffusion backbones, and produces visually coherent, foreground-preserving results across complex documents. By reframing diffusion as trajectory design in latent space, we offer a principled approach to consistent and structured generative modeling.[139] Improving Classifier-Free Guidance of Flow Matching via Manifold Projection
Jian-Feng Cai,Haixia Liu,Zhengyi Su,Chao Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于优化视角的classifier-free guidance(CFG)新解释,并将其重构为带流形约束的同伦优化问题,通过增量梯度下降与Anderson加速实现更鲁棒、高效、高保真的生成,无需额外训练。
Details
Motivation: 尽管CFG在扩散和流模型中被广泛使用,但其依赖启发式线性外推,对引导尺度敏感,缺乏理论基础。 Method: 将流匹配中的速度场解释为一系列平滑距离函数的梯度,揭示CFG是该梯度的近似;据此将CFG采样重构为带流形约束的同伦优化问题,并引入增量梯度下降进行流形投影,结合Anderson加速提升效率与稳定性。 Result: 所提方法无需训练,在DiT-XL-2-256、Flux和Stable Diffusion 3.5等大模型上显著提升了生成保真度、提示对齐性和对引导尺度的鲁棒性。 Conclusion: CFG可被严格理解为一种隐式优化过程;所提出的优化框架为可控生成提供了更稳健、可解释且实用的新范式。 Abstract: Classifier-free guidance (CFG) is a widely used technique for controllable generation in diffusion and flow-based models. Despite its empirical success, CFG relies on a heuristic linear extrapolation that is often sensitive to the guidance scale. In this work, we provide a principled interpretation of CFG through the lens of optimization. We demonstrate that the velocity field in flow matching corresponds to the gradient of a sequence of smoothed distance functions, which guides latent variables toward the scaled target image set. This perspective reveals that the standard CFG formulation is an approximation of this gradient, where the prediction gap, the discrepancy between conditional and unconditional outputs, governs guidance sensitivity. Leveraging this insight, we reformulate the CFG sampling as a homotopy optimization with a manifold constraint. This formulation necessitates a manifold projection step, which we implement via an incremental gradient descent scheme during sampling. To improve computational efficiency and stability, we further enhance this iterative process with Anderson Acceleration without requiring additional model evaluations. Our proposed methods are training-free and consistently refine generation fidelity, prompt alignment, and robustness to the guidance scale. We validate their effectiveness across diverse benchmarks, demonstrating significant improvements on large-scale models such as DiT-XL-2-256, Flux, and Stable Diffusion 3.5.[140] Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion
Hanmo Chen,Chenghao Xu,Xu Yang,Xuan Chen,Cheng Deng
Main category: cs.CV
TL;DR: 本文提出了一种新的KV缓存策略PaFu-KV,通过轻量级显著性估计头来动态保留重要token、剔除冗余信息,从而在长时序视频生成中提升质量与推理效率的平衡。
Details
Motivation: 现有自回归视频生成方法依赖启发式KV缓存策略,忽视token在长视频生成中的重要性差异,导致关键时空信息丢失和缓存冗余,影响生成质量与效率。 Method: 提出Past- and Future-Informed KV Cache Policy(PaFu-KV),引入从双向教师模型蒸馏出的轻量级Salience Estimation Head,估计各token显著性得分,实现KV缓存的动态精简。 Result: 在多个基准上验证了该方法能在保持高保真视频生成质量的同时加速推理,显著降低内存占用,提升长时序视频生成效率。 Conclusion: PaFu-KV为自回归视频生成提供了更优的质量-效率权衡方案,是面向长时序视频生成的高效KV缓存策略。 Abstract: Video generation is pivotal to digital media creation, and recent advances in autoregressive video generation have markedly enhanced the efficiency of real-time video synthesis. However, existing approaches generally rely on heuristic KV Cache policies, which ignore differences in token importance in long-term video generation. This leads to the loss of critical spatiotemporal information and the accumulation of redundant, invalid cache, thereby degrading video generation quality and efficiency. To address this limitation, we first observe that token contributions to video generation are highly time-heterogeneous and accordingly propose a novel Past- and Future-Informed KV Cache Policy (PaFu-KV). Specifically, PaFu-KV introduces a lightweight Salience Estimation Head distilled from a bidirectional teacher to estimate salience scores, allowing the KV cache to retain informative tokens while discarding less relevant ones. This policy yields a better quality-efficiency trade-off by shrinking KV cache capacity and reducing memory footprint at inference time. Extensive experiments on benchmarks demonstrate that our method preserves high-fidelity video generation quality while enables accelerated inference, thereby enabling more efficient long-horizon video generation. Our code will be released upon paper acceptance.[141] TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention
Chuancheng Shi,Shangze Li,Wenjun Lu,Wenhua Wu,Cong Wang,Zifeng Cheng,Fei Shen,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出TraceRouter框架,通过追踪并断开有害语义的因果传播回路来增强大基础模型的鲁棒性,避免传统局部抑制方法对模型效用的损害。
Details
Motivation: 现有防御方法基于‘局部性假设’,仅抑制孤立神经元或特征,但有害语义实为跨层分布式电路,导致局部干预脆弱且损害模型性能。 Method: TraceRouter包含三阶段:(1) 通过注意力发散分析定位敏感起始层;(2) 利用稀疏自编码器(SAEs)与差异激活分析解耦并隔离恶意特征;(3) 基于零干预计算特征影响分数(FIS),映射恶意特征至下游因果路径,并选择性抑制这些路径。 Result: TraceRouter在多个实验中显著优于现有最先进基线,在对抗鲁棒性与通用性能之间取得更优权衡。 Conclusion: 路径级因果干预比传统局部特征抑制更有效、更鲁棒,为大模型安全提供了新范式。 Abstract: Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the "locality hypothesis", suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbf{TraceRouter}, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.[142] Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning
Hanmo Chen,Guangtao Lyu,Chenghao Xu,Jiexi Yan,Xu Yang,Cheng Deng
Main category: cs.CV
TL;DR: 本文提出了一种名为Pyramidal Shapley-Taylor(PST)的学习框架,用于细粒度的运动-语言检索,通过模仿人类运动感知的金字塔式过程(从关节动态到片段连贯性,再到整体理解),实现运动片段、身体关节点与文本词元之间的逐级对齐。
Details
Motivation: 现有方法主要依赖全局运动序列与全局文本表征对齐,忽略了局部运动片段、个体身体关节与文本词元之间的细粒度交互,导致检索性能受限。 Method: 提出Pyramidal Shapley-Taylor(PST)框架,将人体运动在时间和空间维度上分解为片段和关节点,并以金字塔方式逐步学习关节级和片段级的跨模态对齐。 Result: 在多个公开基准数据集上显著超越现有最优方法,实现了运动片段、关节点与对应文本词元的精准对齐。 Conclusion: PST框架通过模拟人类运动感知的层级机制,有效建模了运动-语言间的细粒度语义关联,提升了跨模态检索性能。 Abstract: As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.[143] VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models
Yunhao Li,Sijing Wu,Zhilin Gao,Zicheng Zhang,Qi Jia,Huiyu Duan,Xiongkuo Min,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出了VideoAesBench,一个用于评估大语言多模态模型(LMMs)视频美学质量理解能力的综合性基准,涵盖多样视频来源、多种问题形式及全面的美学维度,并对23个开源与商用LMM进行了评测,发现当前模型在该任务上仍表现有限。
Details
Motivation: 视频美学质量评估是人类基本能力,但目前大 multimodal 模型(LMMs)在此方面研究不足,亟需系统性评估基准。 Method: 构建了VideoAesBench基准,包含1804个来自UGC、AIGC、压缩、RGC和游戏等多源视频;支持单选、多选、判断与新颖的开放式美学描述题;覆盖视觉形式、风格与情感性共12个美学维度;并在23个LMM上进行系统评测。 Result: 当前LMMs仅具备基础的视频美学感知能力,整体性能不完整且不精确。 Conclusion: VideoAesBench可作为强有力的测试平台,推动可解释的视频美学评估研究。 Abstract: Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs' understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics dimensions including visual form related questions from 5 aspects, visual style related questions from 4 aspects, and visual affectiveness questions from 3 aspects. Based on VideoAesBench, we benchmark 23 open-source and commercial large multimodal models. Our findings show that current LMMs only contain basic video aesthetics perception ability, their performance remains incomplete and imprecise. We hope our VideoAesBench can be served as a strong testbed and offer insights for explainable video aesthetics assessment.[144] Zero-Shot Video Restoration and Enhancement with Assistance of Video Diffusion Models
Cong Cao,Huanjing Yue,Shangbin Xie,Xin Liu,Jingyu Yang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的框架,利用视频扩散模型辅助图像扩散模型,提升零样本视频修复与增强中的时间一致性,通过多种潜在特征融合策略和时序强化后处理实现。
Details
Motivation: 现有基于扩散模型的零样本图像修复方法在应用于视频时会出现严重的时间闪烁问题,亟需提升时间一致性。 Method: 提出同源潜在融合、异构潜在融合及基于COT的融合比例策略,并结合图像到视频扩散模型的时序强化后处理,实现对图像方法的时间一致性增强。 Result: 所提方法在多个零样本视频修复与增强任务上显著优于现有方法,有效缓解时间闪烁,且无需额外训练。 Conclusion: 该框架是首个将视频扩散模型引入零样本视频修复与增强的通用、免训练方案,兼顾性能与灵活性。 Abstract: Although diffusion-based zero-shot image restoration and enhancement methods have achieved great success, applying them to video restoration or enhancement will lead to severe temporal flickering. In this paper, we propose the first framework that utilizes the rapidly-developed video diffusion model to assist the image-based method in maintaining more temporal consistency for zero-shot video restoration and enhancement. We propose homologous latents fusion, heterogenous latents fusion, and a COT-based fusion ratio strategy to utilize both homologous and heterogenous text-to-video diffusion models to complement the image method. Moreover, we propose temporal-strengthening post-processing to utilize the image-to-video diffusion model to further improve temporal consistency. Our method is training-free and can be applied to any diffusion-based image restoration and enhancement methods. Experimental results demonstrate the superiority of the proposed method.[145] Just Noticeable Difference Modeling for Deep Visual Features
Rui Zhao,Wenrui Li,Lin Zhu,Yajing Zheng,Weisi Lin
Main category: cs.CV
TL;DR: 本文提出了FeatJND,一种面向任务的深度视觉特征的最小可觉差(JND)建模方法,用于预测在保持下游任务性能前提下各特征维度所能容忍的最大扰动,并验证其在分类、检测与实例分割中的有效性及在动态量化中的实用价值。
Details
Motivation: 深度视觉特征作为视觉系统接口日益重要,需对其特性进行刻画并控制其质量;传统JND概念扩展至特征空间可提供任务对齐的容错边界,以支持资源受限下的特征质量调控。 Method: 提出FeatJND模型,估计标准化分割点处的每特征维度最大可容忍扰动图;在图像分类、目标检测和实例分割任务上进行验证;进一步将其应用于token级动态量化,实现基于FeatJND的步长分配。 Result: 相比高斯扰动,FeatJND扰动在相同失真强度下显著保持更高任务性能;归因可视化显示其能抑制非关键特征区域;在动态量化中,FeatJND引导的步长分配优于随机与全局统一方案。 Conclusion: FeatJND为深度特征提供了任务对齐的质量控制基准,在特征压缩、量化等资源受限场景中具有实际应用潜力。 Abstract: Deep visual features are increasingly used as the interface in vision systems, motivating the need to describe feature characteristics and control feature quality for machine perception. Just noticeable difference (JND) characterizes the maximum imperceptible distortion for images under human or machine vision. Extending it to deep visual features naturally meets the above demand by providing a task-aligned tolerance boundary in feature space, offering a practical reference for controlling feature quality under constrained resources. We propose FeatJND, a task-aligned JND formulation that predicts the maximum tolerable per-feature perturbation map while preserving downstream task performance. We propose a FeatJND estimator at standardized split points and validate it across image classification, detection, and instance segmentation. Under matched distortion strength, FeatJND-based distortions consistently preserve higher task performance than unstructured Gaussian perturbations, and attribution visualizations suggest FeatJND can suppress non-critical feature regions. As an application, we further apply FeatJND to token-wise dynamic quantization and show that FeatJND-guided step-size allocation yields clear gains over random step-size permutation and global uniform step size under the same noise budget. Our code will be released after publication.[146] BookNet: Book Image Rectification via Cross-Page Attention Network
Shaokai Liu,Hao Feng,Bozhi Luan,Min Hou,Jiajun Deng,Wengang Zhou
Main category: cs.CV
TL;DR: 本文提出了BookNet,首个专为双页图书图像校正设计的端到端深度学习框架,采用双分支结构与跨页注意力机制,并构建了合成数据集Book3D和真实基准Book100,实验表明其性能优于现有方法。
Details
Motivation: 现有单页文档图像校正方法无法建模书籍中左右页面间耦合的几何关系,而书籍因装订约束存在复杂的非对称弯曲畸变。 Method: 提出BookNet:双分支网络架构,引入跨页注意力机制,联合估计单页及整页展开图的形变场;构建大规模合成数据集Book3D和真实世界基准Book100。 Result: BookNet在图书图像校正任务上显著优于现有最先进方法。 Conclusion: BookNet首次实现了对双页书籍图像的联合几何建模与端到端校正,验证了跨页建模对提升校正精度的关键作用,并提供了高质量数据资源推动该方向发展。 Abstract: Book image rectification presents unique challenges in document image processing due to complex geometric distortions from binding constraints, where left and right pages exhibit distinctly asymmetric curvature patterns. However, existing single-page document image rectification methods fail to capture the coupled geometric relationships between adjacent pages in books. In this work, we introduce BookNet, the first end-to-end deep learning framework specifically designed for dual-page book image rectification. BookNet adopts a dual-branch architecture with cross-page attention mechanisms, enabling it to estimate warping flows for both individual pages and the complete book spread, explicitly modeling how left and right pages influence each other. Moreover, to address the absence of specialized datasets, we present Book3D, a large-scale synthetic dataset for training, and Book100, a comprehensive real-world benchmark for evaluation. Extensive experiments demonstrate that BookNet outperforms existing state-of-the-art methods on book image rectification. Code and dataset will be made publicly available.[147] Deep Models, Shallow Alignment: Uncovering the Granularity Mismatch in Neural Decoding
Yang Du,Siyuan Dai,Yonghao Song,Paul M. Thompson,Haoteng Tang,Liang Zhan
Main category: cs.CV
TL;DR: 本文提出Shallow Alignment方法,通过将神经信号与视觉编码器的中间层表征对齐,而非最终输出,以解决人类与机器视觉间的粒度不匹配问题,显著提升了神经视觉解码性能,并揭示了解码性能随视觉骨干网络容量可预测扩展的规律。
Details
Motivation: 现有神经视觉解码方法忽视了人类视觉(保留低级纹理与高级语义混合信息)与深度视觉模型(强调语义不变性、抑制局部纹理)之间的根本性粒度不匹配。 Method: 提出Shallow Alignment——一种新颖的对比学习策略,将神经信号与视觉编码器的中间层表征进行对齐,兼顾低级纹理细节与高级语义特征。 Result: 在多个基准上显著优于标准的最终层对齐方法,性能提升达22%–58%;首次有效解锁神经视觉解码中的缩放律,使解码性能随预训练视觉骨干网络容量可预测增长。 Conclusion: 中间层对齐是弥合神经表征与机器视觉表征间粒度鸿沟的有效途径,为脑机接口中更精准、可扩展的视觉解码提供了新范式。 Abstract: Neural visual decoding is a central problem in brain computer interface research, aiming to reconstruct human visual perception and to elucidate the structure of neural representations. However, existing approaches overlook a fundamental granularity mismatch between human and machine vision, where deep vision models emphasize semantic invariance by suppressing local texture information, whereas neural signals preserve an intricate mixture of low-level visual attributes and high-level semantic content. To address this mismatch, we propose Shallow Alignment, a novel contrastive learning strategy that aligns neural signals with intermediate representations of visual encoders rather than their final outputs, thereby striking a better balance between low-level texture details and high-level semantic features. Extensive experiments across multiple benchmarks demonstrate that Shallow Alignment significantly outperforms standard final-layer alignment, with performance gains ranging from 22% to 58% across diverse vision backbones. Notably, our approach effectively unlocks the scaling law in neural visual decoding, enabling decoding performance to scale predictably with the capacity of pre-trained vision backbones. We further conduct systematic empirical analyses to shed light on the mechanisms underlying the observed performance gains.[148] UEval: A Benchmark for Unified Multimodal Generation
Bo Li,Yida Yin,Wenhao Chai,Xingyu Fu,Zhuang Liu
Main category: cs.CV
TL;DR: 本文提出了UEval基准,用于评估能够同时生成图像和文本的统一模型。该基准包含1000个专家设计的问题,覆盖8种真实任务和多种推理类型,并采用人工校验的细粒度评分标准(共10417条),显著提升了多模态生成评估的准确性与可扩展性。实验表明当前统一模型性能有限,且推理能力对提升多模态生成质量至关重要。
Details
Motivation: 现有评估方法(如LLM-as-a-judge)难以准确捕捉开放式的多模态生成质量;缺乏专门针对统一模型(能同时生成图像和文本)的综合性、细粒度、可扩展的基准。 Method: 构建UEval基准:1)收集1000个跨8类真实任务、需图文协同输出的专家问题;2)为每题提供参考图文,由MLLM生成初始评分细则,再经人工精修验证,最终形成10417条有效评分标准;3)基于该 rubric 实现自动评分。 Result: GPT-5-Thinking在UEval上仅得66.4/100,最佳开源模型仅49.1/100;推理型模型普遍优于非推理型模型;将推理轨迹迁移至非推理模型可显著缩小性能差距。 Conclusion: UEval填补了统一模型评估的空白,揭示了推理能力在复杂多模态理解与生成中的关键作用,为未来模型设计与评估提供了新范式。 Abstract: We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.[149] PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Yi Liu,Dianhai Yu,Yanjun Ma
Main category: cs.CV
TL;DR: PaddleOCR-VL-1.5 是一个0.9B参数的超紧凑视觉语言模型,在 OmniDocBench v1.5 上达到94.5% SOTA精度,并在新提出的 Real5-OmniDocBench 基准上验证了对真实物理畸变的鲁棒性,同时扩展支持印章识别与文本定位任务。
Details
Motivation: 提升文档理解模型在真实场景物理畸变(如扫描、倾斜、弯曲、屏幕拍摄、光照变化)下的鲁棒性,并拓展多任务能力(如印章识别、文本定位),同时保持模型轻量化高效。 Method: 升级 PaddleOCR-VL 模型架构与训练策略,构建包含五类真实物理畸变的 Real5-OmniDocBench 新基准,并集成密封识别和文本定位模块。 Result: 在 OmniDocBench v1.5 上达94.5% SOTA精度;在 Real5-OmniDocBench 上表现最优;支持新增任务且模型仅0.9B参数、高效率。 Conclusion: PaddleOCR-VL-1.5 在精度、鲁棒性、任务泛化性和模型效率方面取得综合突破,推动轻量级文档理解VLM实用化。 Abstract: We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model's capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: https://github.com/PaddlePaddle/PaddleOCR[150] Causal World Modeling for Robot Control
Lin Li,Qihang Zhang,Yiming Luo,Shuai Yang,Ruilin Wang,Fei Han,Mingrui Yu,Zelin Gao,Nan Xue,Xing Zhu,Yujun Shen,Yinghao Xu
Main category: cs.CV
TL;DR: 本文提出LingBot-VA,一种结合视频世界建模与视觉语言预训练的自回归扩散框架,通过共享隐空间、闭环 rollout 和异步推理实现高效机器人学习。
Details
Motivation: 视频世界建模可建模动作与视觉动态间的因果关系,从而支持对近未来进行‘想象’,为机器人学习提供新基础。 Method: 提出LingBot-VA:基于Mixture-of-Transformers(MoT)的共享视觉-动作隐空间、闭环rollout机制(融合真实观测反馈)、异步推理(并行动作预测与执行)。 Result: 在仿真与真实场景中验证有效,在长时程操作、后训练数据效率及新构型泛化性方面表现突出。 Conclusion: 视频世界建模与视觉语言预训练协同构成机器人学习的新范式;LingBot-VA为端到端具身智能提供了高效、鲁棒且开源的解决方案。 Abstract: This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.[151] Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving
Linhan Wang,Zichong Yang,Chen Bai,Guoxiang Zhang,Xiaotong Liu,Xiaoyin Zheng,Xiao-Xiao Long,Chang-Tien Lu,Cheng Lu
Main category: cs.CV
TL;DR: 本文提出Drive-JEPA框架,结合视频联合嵌入预测架构(V-JEPA)与多模态轨迹蒸馏,提升端到端自动驾驶的规划表征能力,在NAVSIM上达到新SOTA。
Details
Motivation: 现有基于自监督视频预训练的端到端自动驾驶方法在场景理解上提升有限,且驾驶场景中单一人类轨迹导致难以学习多模态行为。 Method: 1)适配V-JEPA用于端到端驾驶,用大规模驾驶视频预训练ViT编码器以生成与轨迹规划对齐的预测表征;2)设计proposal-centric规划器,融合仿真生成与人类轨迹进行多模态蒸馏,并引入动量感知选择机制保障稳定安全行为。 Result: 在NAVSIM上,V-JEPA表征+简单Transformer解码器在无感知设置下比先前方法高3 PDMS;完整Drive-JEPA达93.3 PDMS(v1)和87.8 EPDMS(v2),创SOTA。 Conclusion: Drive-JEPA通过联合视频表征学习与多模态轨迹蒸馏,有效缓解了驾驶行为单一性带来的表征瓶颈,显著提升了端到端规划性能。 Abstract: End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.[152] Understanding Multimodal Complementarity for Single-Frame Action Anticipation
Manuel Benavent-Lledo,Konstantinos Bacharidis,Konstantinos Papoutsakis,Antonis Argyros,Jose Garcia-Rodriguez
Main category: cs.CV
TL;DR: 本文挑战了动作预测需依赖密集时序信息的传统假设,探索仅用单帧图像进行人类动作预测的潜力,提出改进框架AAG+,在多个基准上达到甚至超越现有视频方法的性能。
Details
Motivation: 传统动作预测方法依赖视频序列,本文旨在探究单帧图像中已包含多少未来动作信息,以及如何有效利用这些信息。 Method: 基于前期工作AAG,系统分析RGB外观、深度几何线索和过去动作语义表示等多源信息的贡献,并研究不同多模态融合策略、关键帧选择策略及历史动作来源对预测性能的影响,最终构建改进框架AAG+。 Result: AAG+仅使用单帧即在IKEA-ASM、Meccano和Assembly101等挑战性基准上,性能媲美或超越当前最优视频方法。 Conclusion: 单帧动作预测具有巨大潜力;密集时序建模并非总是必要,精心选取的关键帧在许多场景下已足够。 Abstract: Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.[153] Urban Neural Surface Reconstruction from Constrained Sparse Aerial Imagery with 3D SAR Fusion
Da Li,Chen Yao,Tong Mao,Jiacheng Bao,Houjun Sun
Main category: cs.CV
TL;DR: 本文提出首个融合3D SAR点云与航拍图像的神经表面重建框架,用于稀疏视角下的高保真城市三维重建。
Details
Motivation: 现有神经表面重建方法在稀疏视角下存在几何模糊与不稳定问题,而城市遥感中航拍图像获取受限于飞行路径、地形和成本。 Method: 将3D SAR点云提供的几何先验融入SDF-based神经表面重建主干网络,指导结构感知的光线选择与自适应采样;构建首个配准的3D SAR与航拍图像联合基准数据集。 Result: 实验表明,相比单模态方法,该方法在高度稀疏和倾斜视角下显著提升重建精度、完整性与鲁棒性。 Conclusion: 融合光学与SAR多模态传感是实现可扩展、高保真城市三维重建的有效途径。 Abstract: Neural surface reconstruction (NSR) has recently shown strong potential for urban 3D reconstruction from multi-view aerial imagery. However, existing NSR methods often suffer from geometric ambiguity and instability, particularly under sparse-view conditions. This issue is critical in large-scale urban remote sensing, where aerial image acquisition is limited by flight paths, terrain, and cost. To address this challenge, we present the first urban NSR framework that fuses 3D synthetic aperture radar (SAR) point clouds with aerial imagery for high-fidelity reconstruction under constrained, sparse-view settings. 3D SAR can efficiently capture large-scale geometry even from a single side-looking flight path, providing robust priors that complement photometric cues from images. Our framework integrates radar-derived spatial constraints into an SDF-based NSR backbone, guiding structure-aware ray selection and adaptive sampling for stable and efficient optimization. We also construct the first benchmark dataset with co-registered 3D SAR point clouds and aerial imagery, facilitating systematic evaluation of cross-modal 3D reconstruction. Extensive experiments show that incorporating 3D SAR markedly enhances reconstruction accuracy, completeness, and robustness compared with single-modality baselines under highly sparse and oblique-view conditions, highlighting a viable route toward scalable high-fidelity urban reconstruction with advanced airborne and spaceborne optical-SAR sensing.[154] PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction
Changjian Jiang,Kerui Ren,Xudong Li,Kaiwen Song,Linning Xu,Tao Lu,Junting Dong,Yu Zhang,Bo Dai,Mulin Yu
Main category: cs.CV
TL;DR: PLANING是一种高效的单目视频流实时三维重建框架,采用显式几何基元与神经高斯松散耦合的混合表征,解耦几何与外观建模,实现稳定、快速且高质量的在线重建。
Details
Motivation: 现有方法难以兼顾高保真渲染与精确几何重建,且流式重建中存在结构冗余与优化不稳定问题。 Method: 提出PLANING框架,基于显式几何(如平面、线段)与神经高斯松散耦合的混合表征;设计几何与外观分离的在线初始化与优化策略,支持高效流式更新。 Result: 在ScanNetV2上Chamfer-L2提升18.52%(vs PGSR),PSNR提升1.31 dB(vs ARTDECO),重建耗时<100秒(比2D Gaussian Splatting快5倍以上),质量媲美离线逐场景优化。 Conclusion: PLANING通过解耦建模与在线优化,在重建质量、几何精度、速度和结构清晰度上取得综合优势,适用于大规模场景建模与具身AI仿真等下游任务。 Abstract: Streaming reconstruction from monocular image sequences remains challenging, as existing methods typically favor either high-quality rendering or accurate geometry, but rarely both. We present PLANING, an efficient on-the-fly reconstruction framework built on a hybrid representation that loosely couples explicit geometric primitives with neural Gaussians, enabling geometry and appearance to be modeled in a decoupled manner. This decoupling supports an online initialization and optimization strategy that separates geometry and appearance updates, yielding stable streaming reconstruction with substantially reduced structural redundancy. PLANING improves dense mesh Chamfer-L2 by 18.52% over PGSR, surpasses ARTDECO by 1.31 dB PSNR, and reconstructs ScanNetV2 scenes in under 100 seconds, over 5x faster than 2D Gaussian Splatting, while matching the quality of offline per-scene optimization. Beyond reconstruction quality, the structural clarity and computational efficiency of \modelname~make it well suited for a broad range of downstream applications, such as enabling large-scale scene modeling and simulation-ready environments for embodied AI. Project page: https://city-super.github.io/PLANING/ .[155] MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources
Baorui Ma,Jiahui Yang,Donglin Di,Xuancheng Zhang,Jianxun Cui,Hao Li,Yan Xie,Wei Chen
Main category: cs.CV
TL;DR: 本文提出Metric Anything,一种无需人工设计提示、相机特定建模或任务特定架构的可扩展预训练框架,通过稀疏度量提示从多源噪声3D数据中学习度量深度,并首次在度量深度领域验证了缩放规律的有效性。
Details
Motivation: 现有度量深度估计面临传感器噪声异质性、相机依赖偏差及跨源3D数据中度量模糊等挑战,难以沿用视觉基础模型的成功缩放范式。 Method: 提出Sparse Metric Prompt(随机掩码深度图)作为通用接口,解耦空间推理与传感器/相机偏差;利用约2000万跨源(重建、采集、渲染)、覆盖1万种相机型号的图像-深度对进行大规模预训练。 Result: 首次在度量深度任务中观察到清晰的缩放趋势;预训练模型在深度补全、超分、雷达-相机融合等提示驱动任务上表现优异;其无提示蒸馏学生模型在单目深度估计、内参恢复、单/多视角度量3D重建及VLA规划上达到SOTA;以Metric Anything ViT为视觉编码器显著提升多模态大语言模型的空间智能能力。 Conclusion: 度量深度估计可受益于类似现代基础模型的缩放定律,Metric Anything为可扩展、高效率的真实世界度量感知提供了新路径。 Abstract: Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.[156] Unsupervised Decomposition and Recombination with Discriminator-Driven Diffusion Models
Archer Wang,Emile Anand,Yilun Du,Marin Soljačić
Main category: cs.CV
TL;DR: 本文提出了一种基于对抗训练的扩散模型方法,用于无监督学习因子化潜在表示,并通过跨源因子重组提升生成样本的物理与语义一致性,在图像和机器人视频任务中均取得更优的生成质量与解耦性能。
Details
Motivation: 分解复杂数据为因子化表征有助于发现可复用组件并支持通过组件重组合成新样本;现有扩散模型虽能学习因子化潜在空间,但在无因子级监督下难以保证重组结果的物理与语义一致性。 Method: 引入一个判别器,区分单源样本与跨源因子重组生成的样本,通过对抗训练优化生成器以欺骗该判别器,从而鼓励因子重组在物理和语义上的一致性。 Result: 在CelebA-HQ、Virtual KITTI、CLEVR和Falcor3D数据集上FID更低、MIG和MCC指标更优;在LIBERO机器人视频轨迹任务中,通过动作组件重组显著提升了状态空间覆盖率。 Conclusion: 所提对抗训练策略有效提升了扩散模型在无监督因子分解与组合生成方面的能力,兼具高质量生成与强解耦性,并拓展至机器人视频等新应用场景。 Abstract: Decomposing complex data into factorized representations can reveal reusable components and enable synthesizing new samples via component recombination. We investigate this in the context of diffusion-based models that learn factorized latent spaces without factor-level supervision. In images, factors can capture background, illumination, and object attributes; in robotic videos, they can capture reusable motion components. To improve both latent factor discovery and quality of compositional generation, we introduce an adversarial training signal via a discriminator trained to distinguish between single-source samples and those generated by recombining factors across sources. By optimizing the generator to fool this discriminator, we encourage physical and semantic consistency in the resulting recombinations. Our method outperforms implementations of prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D, achieving lower FID scores and better disentanglement as measured by MIG and MCC. Furthermore, we demonstrate a novel application to robotic video trajectories: by recombining learned action components, we generate diverse sequences that significantly increase state-space coverage for exploration on the LIBERO benchmark.[157] Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
Wenxuan Huang,Yu Zeng,Qiuchen Wang,Zhen Fang,Shaosheng Cao,Zheng Chu,Qingyu Yin,Shuang Chen,Zhenfei Yin,Lin Chen,Zehui Chen,Yao Hu,Philip Torr,Feng Zhao,Wanli Ouyang
Main category: cs.CV
TL;DR: 本文提出Vision-DeepResearch,一种支持多轮、多实体、多尺度视觉与文本搜索的新型多模态深度研究范式,通过冷启动监督与强化学习将深度研究能力内化至MLLM中,在高噪声真实场景下显著提升复杂问题求解能力。
Details
Motivation: 现有MLLM在依赖外部搜索时假设单次图像/文本查询即可获取关键证据,但现实中视觉噪声大、问题复杂,需更深层、更广域的多源证据聚合。 Method: 提出Vision-DeepResearch范式,支持数十步推理与数百次搜索引擎交互;采用冷启动监督训练与RL联合优化,将多轮多尺度多实体搜索能力内化到MLLM中。 Result: 在多模态深度研究任务上显著超越现有方法,并优于基于GPT-5、Gemini-2.5-pro和Claude-4-Sonnet等闭源大模型构建的工作流。 Conclusion: Vision-DeepResearch实现了鲁棒、可扩展的端到端多模态深度研究能力,为高噪声现实场景下的复杂多模态问答提供了新范式。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.[158] BLO-Inst: Bi-Level Optimization Based Alignment of YOLO and SAM for Robust Instance Segmentation
Li Zhang,Pengtao Xie
Main category: cs.CV
TL;DR: 本文提出BLO-Inst框架,通过双层优化对齐目标检测与SAM分割目标,使检测器生成更适配SAM的提示框,从而提升零样本分割自动化性能。