Table of Contents
cs.CL [Back]
[1] PPoGA: Predictive Plan-on-Graph with Action for Knowledge Graph Question Answering
MinGyu Jeon,SuWan Cho,JaeYoung Shu
Main category: cs.CL
TL;DR: 本文提出PPoGA框架,通过预测性处理和自校正机制(包括路径校正和计划校正)提升知识图谱问答中大语言模型的推理鲁棒性与灵活性。
Details
Motivation: 现有LLM+KG方法易因初始高层推理计划错误而失败,缺乏类似人类的认知控制与问题重构能力。 Method: 提出PPoGA框架,采用Planner-Executor架构分离策略与执行,并引入预测性处理机制和双重自校正机制(路径校正+计划校正)。 Result: 在GrailQA、CWQ和WebQSP三个多跳KGQA基准上达到SOTA性能,显著优于现有方法。 Conclusion: 元认知能力(如问题重构)对构建更鲁棒、灵活的AI推理系统至关重要。 Abstract: Large Language Models (LLMs) augmented with Knowledge Graphs (KGs) have advanced complex question answering, yet they often remain susceptible to failure when their initial high-level reasoning plan is flawed. This limitation, analogous to cognitive functional fixedness, prevents agents from restructuring their approach, leading them to pursue unworkable solutions. To address this, we propose PPoGA (Predictive Plan-on-Graph with Action), a novel KGQA framework inspired by human cognitive control and problem-solving. PPoGA incorporates a Planner-Executor architecture to separate high-level strategy from low-level execution and leverages a Predictive Processing mechanism to anticipate outcomes. The core innovation of our work is a self-correction mechanism that empowers the agent to perform not only Path Correction for local execution errors but also Plan Correction by identifying, discarding, and reformulating the entire plan when it proves ineffective. We conduct extensive experiments on three challenging multi-hop KGQA benchmarks: GrailQA, CWQ, and WebQSP. The results demonstrate that PPoGA achieves state-of-the-art performance, significantly outperforming existing methods. Our work highlights the critical importance of metacognitive abilities like problem restructuring for building more robust and flexible AI reasoning systems.[2] Unlocking Electronic Health Records: A Hybrid Graph RAG Approach to Safe Clinical AI for Patient QA
Samuel Thio,Matthew Lewis,Spiros Denaxas,Richard JB Dobson
Main category: cs.CL
TL;DR: 本文提出MediGRAF,一种结合结构化图查询(Neo4j Text2Cypher)与非结构化语义检索(向量嵌入)的混合图RAG框架,用于提升EHR中临床信息检索的准确性与安全性,在MIMIC-IV数据上实现100%事实性查询召回率及高专家评分的推理能力。
Details
Motivation: EHR系统信息过载导致临床认知负担重;现有LLM在临床场景中存在幻觉和上下文接地不足问题;当前检索方法孤立处理结构化或非结构化数据,缺乏融合。 Method: 提出MediGRAF框架:集成Neo4j Text2Cypher实现结构化关系遍历,结合向量嵌入支持非结构化临床文本语义检索;在MIMIC-IV中构建患者级知识图谱(5973节点/5963关系),支持自然语言问答。 Result: 事实性查询达100%召回率;复杂推理任务获专家平均评分4.25/5,且零安全违规;验证了混合图增强对临床信息检索的有效性与安全性提升。 Conclusion: MediGRAF通过结构化与非结构化数据协同检索,显著提升临床RAG系统的可靠性与全面性,为安全部署LLM于真实医疗场景提供了可行路径。 Abstract: Electronic health record (EHR) systems present clinicians with vast repositories of clinical information, creating a significant cognitive burden where critical details are easily overlooked. While Large Language Models (LLMs) offer transformative potential for data processing, they face significant limitations in clinical settings, particularly regarding context grounding and hallucinations. Current solutions typically isolate retrieval methods focusing either on structured data (SQL/Cypher) or unstructured semantic search but fail to integrate both simultaneously. This work presents MediGRAF (Medical Graph Retrieval Augmented Framework), a novel hybrid Graph RAG system that bridges this gap. By uniquely combining Neo4j Text2Cypher capabilities for structured relationship traversal with vector embeddings for unstructured narrative retrieval, MediGRAF enables natural language querying of the complete patient journey. Using 10 patients from the MIMIC-IV dataset (generating 5,973 nodes and 5,963 relationships), we generated enough nodes and data for patient level question answering (QA), and we evaluated this architecture across varying query complexities. The system demonstrated 100\% recall for factual queries which means all relevant information was retrieved and in the output, while complex inference tasks achieved a mean expert quality score of 4.25/5 with zero safety violations. These results demonstrate that hybrid graph-grounding significantly advances clinical information retrieval, offering a safer, more comprehensive alternative to standard LLM deployments.[3] G-MemLLM: Gated Latent Memory Augmentation for Long-Context Reasoning in Large Language Models
Xun Xu
Main category: cs.CL
TL;DR: 本文提出G-MemLLM,一种结合冻结大语言模型主干与可训练潜在记忆库的记忆增强架构,通过GRU式门控更新机制提升多跳推理与关系抽取性能。
Details
Motivation: 现有大语言模型受限于上下文窗口长度和长程多跳推理中的事实一致性维持困难,传统压缩或循环token方法易导致'上下文腐烂'或信息稀释。 Method: 提出G-MemLLM架构,融合冻结LLM主干与可训练的潜在记忆库,并引入GRU风格的门控更新逻辑,实现对潜在记忆槽的选择性更新、保留或覆盖。 Result: 在HotpotQA和ZsRE基准上验证:Llama 3.1-8B在ZsRE上准确率提升13.3%;GPT-2的Answer F1提升8.56分;Llama 3.1-8B在HotpotQA中Supporting Fact F1提升6.89分。 Conclusion: G-MemLLM有效缓解了长程推理中的知识遗忘问题,显著提升了不同规模模型的多跳推理能力和关系抽取精度,验证了门控记忆机制在LLM中的有效性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, yet they remain constrained by the finite capacity of their context windows and the inherent difficulty of maintaining long-term factual consistency during multi-hop reasoning. While existing methods utilize context compression or recurrent tokens, they often suffer from ``context rot'' or the dilution of information over long horizons. In this paper, we propose \textbf{G-MemLLM}, a memory-augmented architecture that integrates a frozen LLM backbone with a trainable \textbf{Latent Memory Bank}. Our key innovation is a GRU-style gated update logic that allows the model to selectively update, preserve, or overwrite latent memory slots, preventing the vanishing gradients of knowledge common in recurrent systems. We evaluate G-MemLLM across scales, from GPT-2 (124M) to Llama 3.1 (8B), on the HotpotQA and Zero-Shot Relation Extraction (ZsRE) benchmarks. Our results demonstrate that G-MemLLM significantly enhances multi-hop reasoning and relational precision, achieving a 13.3\% accuracy boost on ZsRE for Llama 3.1-8B, and it also yields improvements across model scales, boosting Answer F1 by 8.56 points for GPT-2 and increasing Supporting Fact F1 by 6.89 points for Llama 3.1-8B on HotpotQA.[4] PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems
Jiongchi Yu,Yuhan Ma,Xiaoyu Zhang,Junjie Wang,Qiang Hu,Chao Shen,Xiaofei Xie
Main category: cs.CL
TL;DR: 本文提出PTCBENCH基准,用于评估大语言模型(LLM)在不同外部情境下人格特质的一致性,发现如‘失业’等情境会显著改变LLM的人格表现及推理能力。
Details
Motivation: 现有研究忽视人格特质具有动态性和情境依赖性的心理学共识,而LLM在情感代理和AI系统中需保持一致且真实的人格以增强用户信任与参与度。 Method: 构建PTCBENCH系统性基准,涵盖12种不同外部情境(如地点、生活事件),使用NEO五因素人格量表对39,240条人格特质记录进行严格评估。 Result: 发现特定外部情境(如‘失业’)会引发LLM显著人格变化,并影响其推理能力;PTCBENCH为现实、动态环境下的个性一致性评估提供了可扩展框架。 Conclusion: LLM人格并非静态,其一致性受情境显著影响,需在设计中纳入心理学依据以提升AI系统的鲁棒性与心理适配性。 Abstract: With the increasing deployment of large language models (LLMs) in affective agents and AI systems, maintaining a consistent and authentic LLM personality becomes critical for user trust and engagement. However, existing work overlooks a fundamental psychological consensus that personality traits are dynamic and context-dependent. To bridge this gap, we introduce PTCBENCH, a systematic benchmark designed to quantify the consistency of LLM personalities under controlled situational contexts. PTCBENCH subjects models to 12 distinct external conditions spanning diverse location contexts and life events, and rigorously assesses the personality using the NEO Five-Factor Inventory. Our study on 39,240 personality trait records reveals that certain external scenarios (e.g., "Unemployment") can trigger significant personality changes of LLMs, and even alter their reasoning capabilities. Overall, PTCBENCH establishes an extensible framework for evaluating personality consistency in realistic, evolving environments, offering actionable insights for developing robust and psychologically aligned AI systems.[5] SafeTalkCoach: Diversity-Driven Multi-Agent Simulation for Parent-Teen Health Conversations
Benyamin Tabarsi,Wenbo Li,Tahreem Yasir,Aryan Santhosh Kumar,Laura Widman,Dongkuan Xu,Tiffany Barnes
Main category: cs.CL
TL;DR: 本文提出了SafeTalkCoach,一个多样性驱动的多智能体对话生成框架,用于模拟关于性健康话题的亲子对话,并提供了一个配套数据集。
Details
Motivation: 由于性健康话题的私密性和敏感性,真实世界中亲子间关于性健康的对话数据稀缺且难以收集;同时,现有大语言模型在对话生成中可能偏离最佳实践,缺乏真实性和多样性。 Method: SafeTalkCoach整合了众包与合成场景、既定性健康指南、基于证据的人物设定、自适应控制模块和分层多样化机制,构建多智能体对话生成框架。 Result: 评估表明,SafeTalkCoach能在保持真实性、沟通质量和可控性的前提下,生成多样化的亲子性健康对话。 Conclusion: SafeTalkCoach框架及其配套数据集有望支持AI研究与健康传播实践的双重发展。 Abstract: The importance of effective parent-child communication about sexual health is widely acknowledged, but real-world data on these conversations is scarce and challenging to collect, due to their private and sensitive nature. Although LLMs have been widely adopted in dialogue generation, they may deviate from best practices and frequently lack realism and diversity. We introduce SafeTalkCoach, a diversity-driven multi-agent dialogue generation framework that simulates parent-child conversations about sexual health, and present an accompanying dataset. SafeTalkCoach integrates crowd-sourced and synthesized scenarios, established sexual health guidelines, evidence-based personas, adaptive control modules, and hierarchical diversification. Through evaluations, we demonstrate that SafeTalkCoach generates diverse conversations while maintaining realism, communication quality, and controllability in practice. Our goal is that the SafeTalkCoach framework and the dataset support both AI research and health communications practices.[6] Construct, Align, and Reason: Large Ontology Models for Enterprise Knowledge Management
Yao Zhang,Hongyin Zhu
Main category: cs.CL
TL;DR: 本文提出了一种名为大本体模型(LOM)的统一框架,通过构建双层企业本体、三阶段训练流程,实现了结构化与非结构化数据的融合及语义推理能力提升,在复杂图推理任务中超越DeepSeek-V3.2。
Details
Motivation: 传统知识图谱在隐式关系发现和复杂问答的语义理解方面存在不足,难以应对企业级多源异构数据的知识管理挑战。 Method: 提出统一的construct-align-reason框架(即LOM),包括:1)从结构化数据库和非结构化文本构建双层企业本体并融合;2)设计三阶段训练流程——本体指令微调、文本-本体对齐、基于课程学习的多任务本体-语言指令调优;3)构建覆盖多样本体推理任务的数据集。 Result: 在自建基准上,4B参数的LOM达到89.47%准确率,在复杂图推理任务中优于DeepSeek-V3.2,验证了本体结构与语言能力的有效融合。 Conclusion: LOM为大规模企业知识管理提供了兼具结构建模与语义推理能力的新范式,显著提升了隐式关系发现与复杂问答性能。 Abstract: Enterprise-scale knowledge management faces significant challenges in integrating multi-source heterogeneous data and enabling effective semantic reasoning. Traditional knowledge graphs often struggle with implicit relationship discovery and lack sufficient semantic understanding for complex question answering. To address these limitations, we introduce a unified construct--align--reason framework, the large ontology model (LOM). We first build a dual-layer enterprise ontology from structured databases and unstructured text, subsequently fusing these sources into a comprehensive enterprise ontology. To enable instruction-aligned reasoning, we propose a unified three-stage training pipeline: ontology instruction fine-tuning to improve structural understanding; text-ontology grounding to strengthen node semantic encoding; and multi-task instruction tuning on ontology-language pairs with curriculum learning to enhance semantic reasoning and generation. We also construct comprehensive training and evaluation datasets covering diverse ontology reasoning tasks. On this benchmark, our 4B-parameter LOM achieves 89.47% accuracy and outperforms DeepSeek-V3.2 on complex graph reasoning, indicating effective fusion of ontology structure and language.[7] Reversible Diffusion Decoding for Diffusion Language Models
Xinyun Wang,Min Zhang,Sen Cui,Zhikang Chen,Bo Jiang,Kun Kuang,Mingbao Lin
Main category: cs.CL
TL;DR: 本文提出了一种可逆扩散解码(RDD)框架,通过检测停滞状态、缓存模型状态实现高效回溯,并结合置信度引导的重掩码策略,提升扩散语言模型生成的鲁棒性与质量,同时保持并行效率。
Details
Motivation: 扩散语言模型在块级解码中存在不可逆承诺问题,易导致停滞(reverse diffusion process无法继续推进),影响生成质量。 Method: 提出Reversible Diffusion Decoding(RDD):1)将停滞建模为状态依赖的反向过程失败;2)利用缓存模型状态实现无需重计算的块级回溯;3)采用置信度引导的重掩码,选择性重初始化不确定token,保留可靠上下文。 Result: 实验表明RDD在提升生成鲁棒性和质量的同时,仅引入极小计算开销。 Conclusion: RDD通过引入可逆性,使扩散语言模型能在保持并行效率的前提下,有效从早期错误承诺中恢复,显著增强生成稳定性与性能。 Abstract: Diffusion language models enable parallel token generation through block-wise decoding, but their irreversible commitments can lead to stagnation, where the reverse diffusion process fails to make further progress under a suboptimal context.We propose Reversible Diffusion Decoding (RDD), a decoding framework that introduces reversibility into block-wise diffusion generation. RDD detects stagnation as a state-dependent failure of the reverse process and enables efficient backtracking to earlier blocks without recomputation via cached model states. To avoid repeated failure trajectories, RDD applies confidence-guided re-masking to selectively reinitialize uncertain tokens while preserving reliable context.This reversible formulation allows decoding to recover from early commitment errors while maintaining the parallel efficiency of diffusion-based generation. Experiments show that RDD improves generation robustness and quality over baselines with minimal computational overhead.[8] DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking
Tianyi Hu,Niket Tandon,Akhil Arora
Main category: cs.CL
TL;DR: 本文提出DIVERGE框架,解决现有RAG系统在开放性问题中多样性不足的问题,通过反思引导生成与记忆增强迭代优化,在保持答案质量的同时显著提升生成多样性。
Details
Motivation: 现有RAG系统假设每个查询只有一个正确答案,忽视了多答案、需多样性的信息检索场景,导致创造力受限与信息获取不公。 Method: 提出DIVERGE——一种即插即用的智能体式RAG框架,包含反思引导生成和记忆增强的迭代优化机制,并设计新型评估指标衡量多样性-质量权衡。 Result: 在真实世界Infinity-Chat数据集上,DIVERGE在多样性-质量权衡上优于所有基线及SOTA方法,显著提升多样性且不损害质量;新指标与人工评价高度相关。 Conclusion: 显式建模多样性可系统性缓解当前LLM系统在开放性信息检索中的固有局限,为公平、包容与创造性AI提供新路径。 Abstract: Existing retrieval-augmented generation (RAG) systems are primarily designed under the assumption that each query has a single correct answer. This overlooks common information-seeking scenarios with multiple plausible answers, where diversity is essential to avoid collapsing to a single dominant response, thereby constraining creativity and compromising fair and inclusive information access. Our analysis reveals a commonly overlooked limitation of standard RAG systems: they underutilize retrieved context diversity, such that increasing retrieval diversity alone does not yield diverse generations. To address this limitation, we propose DIVERGE, a plug-and-play agentic RAG framework with novel reflection-guided generation and memory-augmented iterative refinement, which promotes diverse viewpoints while preserving answer quality. We introduce novel metrics tailored to evaluating the diversity-quality trade-off in open-ended questions, and show that they correlate well with human judgments. We demonstrate that DIVERGE achieves the best diversity-quality trade-off compared to competitive baselines and previous state-of-the-art methods on the real-world Infinity-Chat dataset, substantially improving diversity while maintaining quality. More broadly, our results reveal a systematic limitation of current LLM-based systems for open-ended information-seeking and show that explicitly modeling diversity can mitigate it. Our code is available at: https://github.com/au-clan/Diverge[9] Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering
Philip Müller,Nicholas Popovič,Michael Färber,Peter Steinbach
Main category: cs.CL
TL;DR: 本文提出了首个大规模基准,用于评估大型语言模型(LLMs)在科学问答(QA)任务中的不确定性量化(UQ)方法的校准性能,发现当前UQ方法(尤其是token级置信度和口头化方法)可靠性不足,而答案频率(一致性)表现最佳,并指出仅依赖ECE指标存在误导性。
Details
Motivation: 现有不确定性量化(UQ)方法在科学问答领域缺乏充分验证,而该领域高度依赖事实检索与推理能力,亟需可靠、可复现的UQ评估基准。 Method: 构建了首个面向推理型科学QA的大规模UQ评估基准,涵盖20个LLM(基础、指令微调、推理增强型),7个科学QA数据集(含多选与算术题),通过提示工程模拟开放问答;在token级和sequence级评估7种代表性UQ方法,基于68.5万条长文本响应分析校准性,并批判性审视ECE指标的局限性。 Result: 1)指令微调导致token级置信度严重极化,降低其作为不确定性估计的可靠性;2)推理微调部分缓解该问题,但效果因厂商而异;3)口头化UQ方法系统性偏差大、与正确性相关性差;4)答案频率(采样一致性)在sequence级校准中最为可靠;5)仅用ECE评估UQ性能具有误导性。 Conclusion: 当前LLM的UQ方法在科学QA中存在根本性缺陷,需转向更鲁棒的sequence-level度量(如答案频率),并摒弃单一ECE指标的惯用做法;所提基准为后续研究提供了可扩展、开源的评估框架。 Abstract: Large Language Models (LLMs) are commonly used in Question Answering (QA) settings, increasingly in the natural sciences if not science at large. Reliable Uncertainty Quantification (UQ) is critical for the trustworthy uptake of generated answers. Existing UQ approaches remain weakly validated in scientific QA, a domain relying on fact-retrieval and reasoning capabilities. We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA studying calibration of UQ methods, providing an extensible open-source framework to reproducibly assess calibration. Our study spans up to 20 large language models of base, instruction-tuned and reasoning variants. Our analysis covers seven scientific QA datasets, including both multiple-choice and arithmetic question answering tasks, using prompting to emulate an open question answering setting. We evaluate and compare methods representative of prominent approaches on a total of 685,000 long-form responses, spanning different reasoning complexities representative of domain-specific tasks. At the token level, we find that instruction tuning induces strong probability mass polarization, reducing the reliability of token-level confidences as estimates of uncertainty. Models further fine-tuned for reasoning are exposed to the same effect, but the reasoning process appears to mitigate it depending on the provider. At the sequence level, we show that verbalized approaches are systematically biased and poorly correlated with correctness, while answer frequency (consistency across samples) yields the most reliable calibration. In the wake of our analysis, we study and report the misleading effect of relying exclusively on ECE as a sole measure for judging performance of UQ methods on benchmark datasets. Our findings expose critical limitations of current UQ methods for LLMs and standard practices in benchmarking thereof.[10] Faithful-Patchscopes: Understanding and Mitigating Model Bias in Hidden Representations Explanation of Large Language Models
Xilin Gong,Shu Yang,Zehua Cao,Lynne Billard,Di Wang
Main category: cs.CL
TL;DR: 本文发现Patchscopes框架中LLM在解释隐藏表征时易受固有语言先验影响,导致解释不忠实;为此构建了评估数据集并提出BALOR方法,通过logit重校准抑制偏差、增强上下文信息,显著提升解释忠实度。
Details
Motivation: Patchscopes利用LLM自身解码内部隐藏表征以生成可读解释,但作者发现LLM倾向于依赖固有语言模式而非真实隐藏表征中的上下文信息(如将‘紫色西兰花’的表征仍解释为‘绿色’),暴露出系统性不忠实问题。 Method: 首先构建偏置场景下的faithfulness评估数据集;进而提出Bias Alignment through Logit Recalibration (BALOR):将未patch提示下的输出logits视为模型偏差估计,与patch后含上下文信息的logits对比,通过logit分布重校准来抑制偏差、增强上下文信号。 Result: 实验表明,原有Patchscopes在偏置场景下平均faithfulness下降18.84%;BALOR在多个LLM上一致优于基线方法,最高实现33%相对性能提升。 Conclusion: LLM在Patchscopes中的解释行为存在由语言先验引发的系统性不忠实;BALOR通过logit层面的偏差对齐与重校准,有效缓解该问题,提升了隐藏表征解释的忠实性与可靠性。 Abstract: Large Language Models (LLMs) have demonstrated strong capabilities for hidden representation interpretation through Patchscopes, a framework that uses LLMs themselves to generate human-readable explanations by decoding from internal hidden representations. However, our work shows that LLMs tend to rely on inherent linguistic patterns, which can override contextual information encoded in the hidden representations during decoding. For example, even when a hidden representation encodes the contextual attribute "purple" for "broccoli", LLMs still generate "green" in their explanations, reflecting a strong prior association. This behavior reveals a systematic unfaithfulness in Patchscopes. To systematically study this issue, we first designed a dataset to evaluate the faithfulness of Patchscopes under biased cases, and our results show that there is an 18.84\% faithfulness decrease on average. We then propose Bias Alignment through Logit Recalibration (BALOR), which treats the output logits from an unpatched prompt as capturing model bias and contrasts them with logits obtained under patched contextual information. By recalibrating the logit distribution through this contrast, BALOR suppresses model bias and amplifies contextual information during generation. Experiments across multiple LLMs demonstrate that BALOR consistently outperforms existing baselines, achieving up to 33\% relative performance improvement.[11] MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes
Rodrigo Batista,Luís Filipe Cunha,Purificação Silvano,Nuno Guimarães,Alípio Jorge,Evelin Amorim,Ricardo Campos
Main category: cs.CL
TL;DR: 本文提出了一种两阶段流水线方法,用于从格式异构、风格多样的市政会议纪要中提取结构化元数据(如会议编号、日期、地点等),第一阶段用问答模型定位含元数据的首尾段落,第二阶段用去词形化的Transformer模型(BERTimbau/XLM-RoBERTa±CRF)进行细粒度实体抽取,并对比评估了Phi和Gemini等LLM的性能、推理成本与碳足迹;实验表明该方法在本领域表现优异但跨市泛化能力受限,同时构建了首个市政会议纪要元数据抽取基准。
Details
Motivation: 市政会议纪要格式不统一、元数据缺乏标准化,现有NER模型难以适应其领域特定类别,亟需专门的元数据抽取方法。 Method: 提出两阶段流水线:第一阶段使用问答(QA)模型定位包含元数据的开头和结尾文本段;第二阶段采用去词形化(deslexicalization)增强的Transformer模型(BERTimbau和XLM-RoBERTa,带或不带CRF层)进行细粒度实体抽取;并系统评估开源(Phi)与闭源(Gemini)大语言模型在预测性能、推理成本和碳足迹方面的表现。 Result: 所提方法在本领域数据上表现优于更大规模的通用大语言模型;但在跨不同市政辖区的测试中泛化能力下降,反映出市政记录在格式和语言上的高度异质性;同时建立了首个市政会议纪要元数据抽取基准。 Conclusion: 该两阶段方法为市政会议纪要元数据抽取提供了有效且可持续的解决方案,所构建的基准为后续研究奠定了基础,但需进一步提升模型对地域语言差异的鲁棒性与泛化能力。 Abstract: Municipal meeting minutes are official documents of local governance, exhibiting heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easy to extract automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction and enhanced through deslexicalization. To evaluate our proposed pipeline, we benchmark both open-weight (Phi) and closed-weight (Gemini) LLMs, assessing predictive performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, better than larger general-purpose LLMs. However, cross-municipality evaluation reveals reduced generalization reflecting the variability and linguistic complexity of municipal records. This work establishes the first benchmark for metadata extraction from municipal meeting minutes, providing a solid foundation for future research in this domain.[12] Detecting AI-Generated Content in Academic Peer Reviews
Siyuan Shen,Kai Wang
Main category: cs.CL
TL;DR: 本研究通过检测模型分析了大型语言模型(LLMs)在ICLR和Nature Communications同行评审中生成内容的时间演变趋势,发现AI生成评审内容自2022年起显著增加,至2025年分别达约20%(ICLR)和12%(NC),尤其在2024年第三季度至第四季度增长最快。
Details
Motivation: 随着大语言模型(LLMs)的普及,其在学术同行评审中的角色引发关注,亟需了解AI生成内容在评审中实际出现的时间趋势与规模。 Method: 使用基于历史评审数据训练的AI内容检测模型,对ICLR和Nature Communications历年评审文本进行回溯性分析,追踪AI生成内容的时序变化。 Result: 2022年前AI生成评审极少;2025年ICLR约20%、NC约12%的评审被检测为AI生成;NC中增长最显著阶段为2024年第三季度至第四季度。 Conclusion: AI辅助评审内容正快速渗透学术同行评审流程,其日益增长的存在提示学界需深入研究其对学术评价质量、公平性与透明度的潜在影响。 Abstract: The growing availability of large language models (LLMs) has raised questions about their role in academic peer review. This study examines the temporal emergence of AI-generated content in peer reviews by applying a detection model trained on historical reviews to later review cycles at International Conference on Learning Representations (ICLR) and Nature Communications (NC). We observe minimal detection of AI-generated content before 2022, followed by a substantial increase through 2025, with approximately 20% of ICLR reviews and 12% of Nature Communications reviews classified as AI-generated in 2025. The most pronounced growth of AI-generated reviews in NC occurs between the third and fourth quarter of 2024. Together, these findings provide suggestive evidence of a rapidly increasing presence of AI-assisted content in peer review and highlight the need for further study of its implications for scholarly evaluation.[13] DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning
Li Siyan,Darshan Deshpande,Anand Kannappan,Rebecca Qian
Main category: cs.CL
TL;DR: 本文提出了DETOUR基准,用于评估模型在模糊、欠指定检索场景(如话到嘴边却想不起)中的多轮回忆能力,结果显示当前SOTA模型在此类任务上准确率仅36%。
Details
Motivation: 现有评估基准局限于单轮检索,无法真实模拟人类在对话中通过多轮交互完成‘话到嘴边’式回忆的过程,亟需更贴近实际的多轮欠指定检索评估方法。 Method: 构建双代理评估框架DETOUR:一个待评测的Primary Agent通过多轮查询向统一的Memory Agent检索目标实体;共包含1011个跨模态(文本、图像、音频、视频)提示。 Result: 当前最先进模型在全部模态上的整体准确率仅为36%,显著低于单轮设定下的性能,验证了该基准的挑战性。 Conclusion: DETOUR揭示了大模型在多轮、模糊、跨模态检索任务上的严重不足,强调提升其在欠指定情境下的推理与交互检索能力至关重要。 Abstract: When recalling information in conversation, people often arrive at the recollection after multiple turns. However, existing benchmarks for evaluating agent capabilities in such tip-of-the-tongue search processes are restricted to single-turn settings. To more realistically simulate tip-of-the-tongue search, we introduce Dual-agent based Evaluation Through Obscure Under-specified Retrieval (DETOUR), a dual-agent evaluation benchmark containing 1,011 prompts. The benchmark design involves a Primary Agent, which is the subject of evaluation, tasked with identifying the recollected entity through querying a Memory Agent that is held consistent across evaluations. Our results indicate that current state-of-the-art models still struggle with our benchmark, only achieving 36% accuracy when evaluated on all modalities (text, image, audio, and video), highlighting the importance of enhancing capabilities in underspecified scenarios.[14] DecompressionLM: Deterministic, Diagnostic, and Zero-Shot Concept Graph Extraction from Language Models
Zhaochen Hong,Jiaxuan You
Main category: cs.CL
TL;DR: 本文提出DecompressionLM,一种无需预定义查询、无状态的零样本概念图提取框架,用于发现语言模型中编码的知识,并揭示了不同量化方法对概念覆盖范围的影响。
Details
Motivation: 现有知识探测方法依赖于预定义查询,限制了对未知概念的提取;同时常见解码式探测方法存在跨序列耦合、竞争性解码效应和可扩展性差等问题。 Method: 采用Van der Corput低差异序列结合算术解码,实现确定性、高度并行且无跨序列共享状态的生成;在多个模型家族及量化变体上进行评估,并通过语料库验证检测幻觉。 Result: 激活感知量化(AWQ-4bit)使概念覆盖率提升30–170%,而均匀量化(GPTQ-Int4)导致71–86%的覆盖率崩溃;MMLU-Pro Law模型中顶级与末位模型存在17分的幻觉差距。 Conclusion: DecompressionLM确立了‘概念覆盖率’作为评估压缩模型知识广度与事实基础的新维度,对模型部署具有实用价值。 Abstract: Existing knowledge probing methods rely on pre-defined queries, limiting extraction to known concepts. We introduce DecompressionLM, a stateless framework for zero-shot concept graph extraction that discovers what language models encode without pre-specified queries or shared cross-sequence state. Our method targets three limitations of common decoding-based probing approaches: cross-sequence coupling that concentrates probability mass on high-frequency prefixes, competitive decoding effects that suppress long-tail concepts, and scalability constraints arising from sequential exploration. Using Van der Corput low-discrepancy sequences with arithmetic decoding, DecompressionLM enables deterministic, embarrassingly parallel generation without shared state across sequences. Across two model families and five quantization variants, we find that activation-aware quantization (AWQ-4bit) expands concept coverage by 30-170%, while uniform quantization (GPTQ-Int4) induces 71-86% coverage collapse -- divergent behaviors not reliably reflected by explanation-level perplexity. Corpus-based verification further reveals a 17-point hallucination gap between top- and bottom-ranked MMLU-Pro Law models. DecompressionLM establishes concept coverage as a complementary evaluation dimension for assessing knowledge breadth and factual grounding in compressed models useful for their deployment.[15] Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models
Sercan Karakaş
Main category: cs.CL
TL;DR: 本文评估了最先进的大语言模型对土耳其语反身代词绑定关系的捕捉能力,发现不同模型在局部与长距离先行词选择上表现出显著差异。
Details
Motivation: 探究当前最先进的大语言模型是否能正确建模土耳其语反身代词(kendi和kendisi)的句法绑定关系,尤其是局部与非局部先行词的竞争。 Method: 构建包含100个平衡句子的数据集,对比测试OpenAI的o1 Mini(思维链模型)和Trendyol-LLM-7B-base-v0.1(基于LLaMA-2、深度微调于土耳其语数据的模型),采用句子级困惑度与强制选择范式评估先行词选择偏好。 Result: Trendyol-LLM在约70%试验中偏好局部绑定,表现出强局部性偏差;而o1 Mini则几乎均等地选择局部与长距离解读,二者绑定行为存在明显差异。 Conclusion: 不同架构与训练策略的大语言模型在处理土耳其语反身代词绑定时展现出截然不同的句法敏感性,提示模型语法知识并非普适,而是高度依赖具体训练设置与语言适配方式。 Abstract: This study evaluates whether state-of-the-art large language models capture the binding relations of Turkish reflexive pronouns. We construct a balanced set of 100 sentences that pit local against non-local antecedents for the reflexives kendi and kendisi, and test two contrasting systems: an OpenAI chain-of-thought model designed for multi-step reasoning and Trendyol-LLM-7B-base-v0.1, a LLaMA-2-derived model extensively fine-tuned on Turkish data. Antecedent choice is assessed using a combined sentence-level perplexity and forced-choice paradigm. Trendyol-LLM favours local bindings in approximately 70% of trials, exhibiting a strong locality bias, whereas o1 Mini distributes its choices almost evenly between local and long-distance readings, revealing a marked contrast in binding behaviour across the two systems.[16] Segment-Level Attribution for Selective Learning of Long Reasoning Traces
Siyuan Wang,Yanchen Liu,Xiang Ren
Main category: cs.CL
TL;DR: 本文提出了一种基于集成梯度归因的段级选择性监督微调(SegmentSelectiveSFT)方法,通过评估推理链中各段对答案预测的贡献强度与方向一致性,筛选出具有反思性推理的重要段落进行训练,从而提升大推理模型的准确率与输出效率。
Details
Motivation: 大型推理模型(LRMs)虽能生成长思维链(CoT),但其中大量内容冗余、重复或截断,且监督微调(SFT)会强化这些低信息量模式,损害性能。 Method: 利用集成梯度归因量化每个token对答案的影响,并聚合为两个段级指标:归因强度(attribution strength)和方向一致性(direction consistency);据此识别高归因强度但中等方向一致性的关键段(反映反思性推理),仅在这些段上执行SFT,其余段屏蔽损失。 Result: 在多个模型和数据集上的实验表明,该方法提升了推理准确率和输出效率,有效利用长推理链进行学习。 Conclusion: 段级选择性监督微调能抑制冗余推理,引导模型聚焦真正有判别力的推理步骤,是一种高效、可解释的推理链训练范式。 Abstract: Large Reasoning Models (LRMs) achieve strong reasoning performance by generating long chains of thought (CoTs), yet only a small fraction of these traces meaningfully contributes to answer prediction, while the majority contains repetitive or truncated content. Such output redundancy is further propagated after supervised finetuning (SFT), as models learn to imitate verbose but uninformative patterns, which can degrade performance. To this end, we incorporate integrated gradient attribution to quantify each token's influence on final answers and aggregate them into two segment-level metrics: (1) \textit{attribution strength} measures the overall attribution magnitude; and (2) \textit{direction consistency} captures whether tokens' attributions within a segment are uniformly positive or negative (high consistency), or a mixture of both (moderate consistency). Based on these two metrics, we propose a segment-level selective learning framework to identify important segments with high attribution strength but moderate consistency that indicate reflective rather than shallow reasoning. The framework then applies selective SFT on these important segments while masking loss for unimportant ones. Experiments across multiple models and datasets show that our approach improves accuracy and output efficiency, enabling more effective learning from long reasoning traces~\footnote{Code and data are available at https://github.com/SiyuanWangw/SegmentSelectiveSFT}.[17] When Agents "Misremember" Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems
Naen Xu,Hengyu An,Shuo Shi,Jinghuai Zhang,Chunyi Zhou,Changjiang Li,Tianyu Du,Zhihui Fu,Jun Wang,Shouling Ji
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)驱动的多智能体系统中集体认知偏差(特别是‘曼德拉效应’)的存在、成因与缓解策略,提出了新基准MANBENCH,并设计了提示层与模型层防御方法,平均降低曼德拉效应74.40%。
Details
Motivation: 多智能体系统易受集体认知偏差(如曼德拉效应)影响,导致记忆错误与 misinformation 传播,但该问题尚未被充分探索,存在理解与伦理风险双重挑战。 Method: 构建MANBENCH基准,涵盖四类易受曼德拉效应影响的任务和五种交互协议;在多种LLM上量化评估该效应;提出提示级(认知锚定、来源审查)与模型级(对齐优化)两类缓解策略。 Result: 验证了曼德拉效应在LLM多智能体系统中的显著存在;识别出角色设定、记忆时长、交互协议等关键影响因素;所提防御策略平均降低效应达74.40%。 Conclusion: 曼德拉效应是LLM多智能体系统中真实且可量化的风险;需从提示设计与模型对齐双路径协同增强系统鲁棒性与伦理可靠性。 Abstract: Recent advancements in large language models (LLMs) have significantly enhanced the capabilities of collaborative multi-agent systems, enabling them to address complex challenges. However, within these multi-agent systems, the susceptibility of agents to collective cognitive biases remains an underexplored issue. A compelling example is the Mandela effect, a phenomenon where groups collectively misremember past events as a result of false details reinforced through social influence and internalized misinformation. This vulnerability limits our understanding of memory bias in multi-agent systems and raises ethical concerns about the potential spread of misinformation. In this paper, we conduct a comprehensive study on the Mandela effect in LLM-based multi-agent systems, focusing on its existence, causing factors, and mitigation strategies. We propose MANBENCH, a novel benchmark designed to evaluate agent behaviors across four common task types that are susceptible to the Mandela effect, using five interaction protocols that vary in agent roles and memory timescales. We evaluate agents powered by several LLMs on MANBENCH to quantify the Mandela effect and analyze how different factors affect it. Moreover, we propose strategies to mitigate this effect, including prompt-level defenses (e.g., cognitive anchoring and source scrutiny) and model-level alignment-based defense, achieving an average 74.40% reduction in the Mandela effect compared to the baseline. Our findings provide valuable insights for developing more resilient and ethically aligned collaborative multi-agent systems.[18] What Matters to an LLM? Behavioral and Computational Evidences from Summarization
Yongxin Zhou,Changshun Wu,Philippe Mulhem,Didier Schwab,Maxime Peyrard
Main category: cs.CL
TL;DR: 本文通过行为和计算分析相结合的方法,探究了大语言模型(LLMs)在摘要生成中内在的“重要性”判断机制,发现LLMs具有稳定且区别于传统模型的重要性选择模式,并定位到与该机制密切相关的注意力头及网络层。
Details
Motivation: LLMs在摘要任务上表现优异,但其内部决定信息重要性的机制尚不透明,亟需可解释性研究。 Method: 结合行为分析(生成长度可控摘要并统计信息单元被选中的频率,构建经验重要性分布)与计算分析(识别与经验重要性对齐的注意力头,并分析各层对重要性的预测能力)。 Result: LLMs展现出稳定、家族聚类明显、区别于传统模型的重要性选择模式;特定中后层注意力头与经验重要性高度对齐。 Conclusion: LLMs在摘要中具有可识别、可定位的内在重要性表征机制,为理解与调控其信息选择提供了新路径。 Abstract: Large Language Models (LLMs) are now state-of-the-art at summarization, yet the internal notion of importance that drives their information selections remains hidden. We propose to investigate this by combining behavioral and computational analyses. Behaviorally, we generate a series of length-controlled summaries for each document and derive empirical importance distributions based on how often each information unit is selected. These reveal that LLMs converge on consistent importance patterns, sharply different from pre-LLM baselines, and that LLMs cluster more by family than by size. Computationally, we identify that certain attention heads align well with empirical importance distributions, and that middle-to-late layers are strongly predictive of importance. Together, these results provide initial insights into what LLMs prioritize in summarization and how this priority is internally represented, opening a path toward interpreting and ultimately controlling information selection in these models.[19] Words that make SENSE: Sensorimotor Norms in Learned Lexical Token Representations
Abhinav Gupta,Toben H. Mintz,Jesse Thomason
Main category: cs.CL
TL;DR: 本文提出SENSE模型,将词嵌入映射到Lancaster感官运动规范,并通过行为实验验证其与人类感知选择率的相关性,还发现内感受模态中存在音义对应规律。
Details
Motivation: 词嵌入依赖共现统计,而人类语言理解基于感官与运动经验,因此需要将词汇表征与感官运动信息对齐。 Method: 构建SENSE(Sensorimotor Embedding Norm Scoring Engine)模型,学习从词向量到Lancaster传感器运动规范的映射;开展含281名参与者的心理行为实验,评估nonce words在11种感官模态下的选择偏好;进行子词级音义分析。 Result: SENSE评分与人类在6/11种感官模态的选择率呈显著相关;子词分析揭示了内感受模态中系统性的音义(phonosthemic)模式。 Conclusion: SENSE为连接分布语义与具身认知提供了可计算桥梁;音义模式的发现支持从文本数据自动挖掘候选音义素的可行性。 Abstract: While word embeddings derive meaning from co-occurrence patterns, human language understanding is grounded in sensory and motor experience. We present $\text{SENSE}$ $(\textbf{S}\text{ensorimotor }$ $\textbf{E}\text{mbedding }$ $\textbf{N}\text{orm }$ $\textbf{S}\text{coring }$ $\textbf{E}\text{ngine})$, a learned projection model that predicts Lancaster sensorimotor norms from word lexical embeddings. We also conducted a behavioral study where 281 participants selected which among candidate nonce words evoked specific sensorimotor associations, finding statistically significant correlations between human selection rates and $\text{SENSE}$ ratings across 6 of the 11 modalities. Sublexical analysis of these nonce words selection rates revealed systematic phonosthemic patterns for the interoceptive norm, suggesting a path towards computationally proposing candidate phonosthemes from text data.[20] Intention-Adaptive LLM Fine-Tuning for Text Revision Generation
Zhexiong Liu,Diane Litman
Main category: cs.CL
TL;DR: 本文提出Intention-Tuning框架,通过层选择式微调大语言模型(LLM),以适应意图驱动的文本修订生成任务,在小规模数据上优于多种PEFT方法。
Details
Motivation: 现有LLM在基于意图的文本生成(如修订生成)中表现不足,尤其难以处理多意图交织场景;而依赖大量标注数据的指令微调又受限于数据稀缺与高成本。 Method: 提出Intention-Tuning:一种意图自适应的层式LLM微调框架,动态选择部分模型层专门学习写作意图,并将学到的意图表征迁移至修订生成任务。 Result: 在小规模修订语料上实验表明,Intention-Tuning有效且高效,性能优于多种参数高效微调(PEFT)基线方法。 Conclusion: Intention-Tuning为低资源意图驱动生成任务提供了可行且高效的LLM适配新范式。 Abstract: Large Language Models (LLMs) have achieved impressive capabilities in various context-based text generation tasks, such as summarization and reasoning; however, their applications in intention-based generation tasks remain underexplored. One such example is revision generation, which requires the generated text to explicitly reflect the writer's actual intentions. Identifying intentions and generating desirable revisions are challenging due to their complex and diverse nature. Although prior work has employed LLMs to generate revisions with few-shot learning, they struggle with handling entangled multi-intent scenarios. While fine-tuning LLMs using intention-based instructions appears promising, it demands large amounts of annotated data, which is expensive and scarce in the revision community. To address these challenges, we propose Intention-Tuning, an intention-adaptive layer-wise LLM fine-tuning framework that dynamically selects a subset of LLM layers to learn the intentions and subsequently transfers their representations to revision generation. Experimental results suggest that Intention-Tuning is effective and efficient on small revision corpora, outperforming several PEFT baselines.[21] From Knowledge to Inference: Scaling Laws of Specialized Reasoning on GlobalHealthAtlas
Zhaokun Yan,Zhaohan Liu,Wuzheng Dong,Lijie Feng,Chengxiao Dai
Main category: cs.CL
TL;DR: 本文提出了GlobalHealthAtlas数据集和配套的评估框架,旨在推动公共卫生推理这一安全关键领域的机器学习研究。
Details
Motivation: 公共卫生推理是一个重要但尚未被充分探索的机器学习问题,缺乏监督信号和基准测试。 Method: 构建了包含28万实例的多语言、多领域、多难度层级的GlobalHealthAtlas数据集,并设计了基于大语言模型辅助的数据构建与质量控制流程;同时提出了一种领域对齐的多维度评估器。 Result: 提供了大规模、结构化、可分层评估的公共卫生推理数据集及配套评估工具,支持LLM在该领域的可复现训练与评测。 Conclusion: 该工作为安全关键的公共卫生推理任务建立了新的数据与评估基础,超越了传统问答式基准。 Abstract: Public health reasoning requires population level inference grounded in scientific evidence, expert consensus, and safety constraints. However, it remains underexplored as a structured machine learning problem with limited supervised signals and benchmarks. We introduce \textbf{GlobalHealthAtlas}, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages, stratified into three difficulty levels from health literacy to epidemiological and policy reasoning. Instances are derived from openly available public health sources and labeled by language, domain, and difficulty to support supervised learning and slice based evaluation. We further propose large language model (LLM) assisted construction and quality control pipeline with retrieval, duplication, evidence grounding checks, and label validation to improve consistency at scale. Finally, we present a domain aligned evaluator distilled from high confidence judgments of diverse LLMs to assess outputs along six dimensions: Accuracy, Reasoning, Completeness, Consensus Alignment, Terminology Norms, and Insightfulness. Together, these contributions enable reproducible training and evaluation of LLMs for safety critical public health reasoning beyond conventional QA benchmarks.[22] Culturally-Grounded Governance for Multilingual Language Models: Rights, Data Boundaries, and Accountable AI Design
Hanjing Shi,Dominic DiFranzo
Main category: cs.CL
TL;DR: 本文提出了一种面向多语言大语言模型(MLLMs)的文化嵌入式治理框架,强调需超越英语中心主义和技术中立假定,关注低资源语言与边缘化文化社群在数据、规范与问责机制上的结构性不公。
Details
Motivation: 现有MLLM治理框架以英语为中心、忽视文化与语言多样性,导致对低资源语言和文化边缘化群体的系统性风险。 Method: 基于人本计算与AI治理的跨文化视角,综合分析多语言模型行为、数据不对称性及社会技术危害,提出文化嵌入式治理框架。 Result: 识别出三大治理挑战:训练与评估中的文化语言不平等、全球部署与本地规范/权力结构的错位、对边缘语言社群缺乏有效问责机制。 Conclusion: 多语言AI治理应被重新定义为一个社会文化与权利导向的问题;文化嵌入式治理是防止MLLM加剧全球不平等的关键。 Abstract: Multilingual large language models (MLLMs) are increasingly deployed across cultural, linguistic, and political contexts, yet existing governance frameworks largely assume English-centric data, homogeneous user populations, and abstract notions of fairness. This creates systematic risks for low-resource languages and culturally marginalized communities, where data practices, model behavior, and accountability mechanisms often fail to align with local norms, rights, and expectations. Drawing on cross-cultural perspectives in human-centered computing and AI governance, this paper synthesizes existing evidence on multilingual model behavior, data asymmetries, and sociotechnical harm, and articulates a culturally grounded governance framework for MLLMs. We identify three interrelated governance challenges: cultural and linguistic inequities in training data and evaluation practices, misalignment between global deployment and locally situated norms, values, and power structures, and limited accountability mechanisms for addressing harms experienced by marginalized language communities. Rather than proposing new technical benchmarks, we contribute a conceptual agenda that reframes multilingual AI governance as a sociocultural and rights based problem. We outline design and policy implications for data stewardship, transparency, and participatory accountability, and argue that culturally grounded governance is essential for ensuring that multilingual language models do not reproduce existing global inequalities under the guise of scale and neutrality.[23] Reasoning by Commented Code for Table Question Answering
Seho Pyo,Jiheon Seok,Jaejin Lee
Main category: cs.CL
TL;DR: 本文提出了一种带自然语言注释的分步代码生成框架,用于提升表格问答(TableQA)任务中大语言模型的数值准确性和可解释性,并在WikiTableQuestions基准上取得了84.3%的准确率。
Details
Motivation: 传统表格线性化方法破坏了表格的二维结构关系,且现有端到端或单行程序生成方法在数值准确性和可解释性方面表现有限。 Method: 提出一种带简洁自然语言注释的多行可执行Python程序生成框架,将TableQA推理过程显式分解为多步代码,并通过轻量级答案选择机制与强端到端模型结合。 Result: 在WikiTableQuestions基准上,仅用Qwen2.5-Coder-7B-Instruct达到70.9%准确率(优于Repanda的67.6%);结合端到端模型后达84.3%。 Conclusion: 显式分步代码生成并辅以注释能有效提升TableQA任务中大模型的准确性与可解释性,是一种有前景的结构化推理增强范式。 Abstract: Table Question Answering (TableQA) poses a significant challenge for large language models (LLMs) because conventional linearization of tables often disrupts the two-dimensional relationships intrinsic to structured data. Existing methods, which depend on end-to-end answer generation or single-line program queries, typically exhibit limited numerical accuracy and reduced interpretability. This work introduces a commented, step-by-step code-generation framework that incorporates explicit reasoning into the Python program-generation process. The approach decomposes TableQA reasoning into multi-line executable programs with concise natural language comments, thereby promoting clearer reasoning and increasing the likelihood of generating correct code. On the WikiTableQuestions benchmark, the proposed method achieves 70.9\% accuracy using Qwen2.5-Coder-7B-Instruct, surpassing the Repanda baseline (67.6\%). Integrating the proposed framework with a robust end-to-end TableQA model via a lightweight answer-selection mechanism yields further improvements. This combined approach achieves up to 84.3\% accuracy on the WikiTableQuestions benchmark.[24] A Hierarchical and Attentional Analysis of Argument Structure Constructions in BERT Using Naturalistic Corpora
Liu Kaipeng,Wu Ling
Main category: cs.CL
TL;DR: 本研究探讨了BERT模型如何处理四种基本的论元结构构式,发现其表征具有层次性:构式特异性信息在早期层出现,在中期层形成最大可分离簇,并在后期层保持。
Details
Motivation: 探究BERT模型如何处理四种基本的论元结构构式,理解其内部表征机制。 Method: 采用多维分析框架,整合MDS、t-SNE进行降维,GDV作为簇分离度量,FDR用于线性诊断探针,并分析注意力机制。 Result: 发现BERT表征具有层次结构:构式特异性信息在早期层出现,在中期层形成最大可分离簇,并在后期层维持。 Conclusion: BERT对论元结构构式的表征遵循清晰的层次化加工路径,不同网络层承担不同抽象级别的语言结构编码功能。 Abstract: This study investigates how the Bidirectional Encoder Representations from Transformers model processes four fundamental Argument Structure Constructions. We employ a multi-dimensional analytical framework, which integrates MDS, t-SNE as dimensionality reduction, Generalized Discrimination Value (GDV) as cluster separation metrics, Fisher Discriminant Ratio (FDR) as linear diagnostic probing, and attention mechanism analysis. Our results reveal a hierarchical representational structure. Construction-specific information emerges in early layers, forms maximally separable clusters in middle layers, and is maintained through later processing stages.[25] The French Drama Revolution: Political Economy and Literary Production, 1700-1900
Thiago Dumont Oliveira
Main category: cs.CL
TL;DR: This paper uses Latent Dirichlet Allocation and Jensen-Shannon Divergence to analyze French drama from 1700–1900, revealing a major thematic shift after the French Revolution—especially 1789–1850—with rising bourgeois themes, and links these literary changes to GDP trends and broader political/economic transformations.
Details
Motivation: To understand how French drama evolved thematically over two centuries and whether its evolution co-occurred with major political (e.g., French Revolution) and economic (e.g., industrialization, GDP growth) developments. Method: Applies Latent Dirichlet Allocation (LDA) for topic modeling and Jensen-Shannon Divergence to measure topical change over time; correlates yearly topic prevalence with historical French GDP data (1700–1900). Result: A profound shift in topical distribution occurred post-1789, especially 1789–1850; bourgeois themes rose markedly from the late 18th century onward; topic dynamics align temporally with political upheaval and economic growth. Conclusion: French drama’s thematic evolution between 1700–1900 was closely tied to macro-historical forces—particularly the French Revolution and industrialization—suggesting literature coevolved with socioeconomic change. Abstract: This paper investigates the changing nature of French drama between 1700-1900 using Latent Dirichlet Allocation and Jensen-Shannon Divergence. Results indicate that the topical distribution of French drama changed profoundly after the French Revolution, particularly between 1789 and 1850. Bourgeois themes emerged among the most prevalent topics since the late 18th century. To assess the coevolution of drama and economic growth, I plot the yearly prevalence of topics alongside French GDP between 1700-1900, and discuss these changes in light of the political and economic changes prompted by the French Revolution and the industrialization of the country.[26] Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling
Zhijie Huang,Stephen McIntosh,Daisuke Saito,Nobuaki Minematsu
Main category: cs.CL
TL;DR: 本文提出了Kanade,一种单层解耦语音分词器,旨在高效提取语音中的音素和韵律信息,同时抑制与语言无关的特征(如说话人身份),无需依赖辅助方法,实现了说话人解耦和词汇可用性的最先进性能,并保持高质量重建。
Details
Motivation: 语音建模需要处理连续信号中语言与非语言信息的混合,因此需要一个能有效提取音素与韵律、抑制无关信息(如说话人身份)、并支持高质量合成的语音分词器。 Method: 提出单层解耦语音分词器Kanade,通过分离声学常量生成单一token流,以捕获丰富的音素和韵律信息,且不依赖现有解耦编解码器常用的辅助方法。 Result: 实验表明Kanade在说话人解耦和词汇可用性方面达到当前最优水平,同时保持优异的重建质量。 Conclusion: Kanade是一种高效、简洁且高性能的语音分词器,为语音建模提供了更优的基础组件。 Abstract: A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.[27] Hermes the Polyglot: A Unified Framework to Enhance Expressiveness for Multimodal Interlingual Subtitling
Chaoqun Cui,Shijing Wang,Liangbin Huang,Qingqing Gu,Zhaolong Huang,Xiao Zeng,Wenji Mao
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的自动字幕翻译框架Hermes,通过说话人分离、术语识别和表达力增强三个模块,解决了字幕翻译中语义连贯性、代词与术语翻译及表达力等关键挑战,实现了当前最优的说话人分离性能和高质量翻译效果。
Details
Motivation: 字幕翻译(interlingual subtitling)在娱乐本地化中至关重要,但尚未被机器翻译领域深入探索;现有大语言模型虽提升了通用翻译能力,却难以应对字幕文本特有的挑战,如语义连贯性、代词与术语翻译、表达力不足等。 Method: 提出名为Hermes的LLM-based自动化字幕翻译框架,包含三个核心模块:说话人分离(Speaker Diarization)、术语识别(Terminology Identification)和表达力增强(Expressiveness Enhancement),协同解决字幕翻译中的特有难题。 Result: 实验表明,Hermes在说话人分离任务上达到当前最优(state-of-the-art)性能,并能生成富有表现力、上下文连贯的翻译结果。 Conclusion: Hermes有效推动了跨语言字幕翻译的研究进展,为面向视听媒体的专用翻译系统提供了新范式。 Abstract: Interlingual subtitling, which translates subtitles of visual media into a target language, is essential for entertainment localization but has not yet been explored in machine translation. Although Large Language Models (LLMs) have significantly advanced the general capabilities of machine translation, the distinctive characteristics of subtitle texts pose persistent challenges in interlingual subtitling, particularly regarding semantic coherence, pronoun and terminology translation, and translation expressiveness. To address these issues, we present Hermes, an LLM-based automated subtitling framework. Hermes integrates three modules: Speaker Diarization, Terminology Identification, and Expressiveness Enhancement, which effectively tackle the above challenges. Experiments demonstrate that Hermes achieves state-of-the-art diarization performance and generates expressive, contextually coherent translations, thereby advancing research in interlingual subtitling.[28] Lookahead-then-Verify: Reliable Constrained Decoding for Diffusion LLMs under Context-Free Grammars
Yitong Zhang,Yongmin Li,Yuetong Liu,Jia Li,Xiaoran Jia,Zherui Li,Ge Li
Main category: cs.CL
TL;DR: 本文提出LAVE,一种专为扩散大语言模型(dLLMs)设计的约束解码方法,通过利用dLLMs并行预测所有位置token分布的特性,在每步生成时进行前向验证,确保中间输出始终可扩展为语法正确的句子,显著提升生成的语法正确性且开销极小。
Details
Motivation: dLLMs作为概率模型,难以可靠生成符合上下文无关文法的语法正确输出;现有约束解码技术因非自回归特性不适用,而专为dLLMs设计的方法又可能产生无法补全为合法句子的中间结果,可靠性不足。 Method: 提出LAVE方法,利用dLLMs在每次前向传播中并行预测所有位置token分布的特性,对每个新提议的token执行基于该分布的前向验证,以确保其加入后中间序列仍存在扩展为合法句子的路径。 Result: 在四个主流dLLMs和三个代表性基准上实验表明,LAVE持续优于现有基线,在语法正确性上取得显著提升,且运行时开销可忽略。 Conclusion: LAVE是一种高效、可靠、低开销的约束解码方案,有效解决了dLLMs生成语法错误输出的问题,为dLLMs在形式语言生成任务中的实用化提供了关键支持。 Abstract: Diffusion Large Language Models (dLLMs) have demonstrated promising generative capabilities and are increasingly used to produce formal languages defined by context-free grammars, such as source code and chemical expressions. However, as probabilistic models, they still struggle to generate syntactically valid outputs reliably. A natural and promising direction to address this issue is to adapt constrained decoding techniques to enforce grammatical correctness during generation. However, applying these techniques faces two primary obstacles. On the one hand, the non-autoregressive nature of dLLMs renders most existing constrained decoding approaches inapplicable. On the other hand, current approaches specifically designed for dLLMs may allow intermediate outputs that are impossible to complete into valid sentences, which significantly limits their reliability in practice. To address these challenges, we present LAVE, a constrained decoding approach specifically designed for dLLMs. Our approach leverages a key property of dLLMs, namely their ability to predict token distributions for all positions in parallel during each forward pass. Whenever a new token is proposed by model, LAVE performs lookahead using these distributions to efficiently and reliably verify the validity of the proposed token. This design ensures reliable constraints by reliably preserving the potential for intermediate outputs to be extended into valid sentences. Extensive experiments across four widely used dLLMs and three representative benchmarks demonstrate that LAVE consistently outperforms existing baselines and achieves substantial improvements in syntactic correctness, while incurring negligible runtime overhead.[29] Transformer-Based Model for Multilingual Hope Speech Detection
Nsrin Ashraf,Mariam Labib,Hamada Nayel
Main category: cs.CL
TL;DR: 本文提出了一个用于英语和德语希望言语检测的系统,分别使用RoBERTa和XLM-RoBERTa模型,在RANLP2025的PolyHope-M任务中提交。RoBERTa在英语上取得0.818的加权F1分数,XLM-RoBERTa在英德双语上分别取得0.786的F1分数,验证了预训练大模型对NLP任务性能的提升作用。
Details
Motivation: 提升希望言语检测任务在多语言(尤其是英语和德语)上的性能,并探索不同预训练语言模型在此任务中的有效性。 Method: 实现并评估多种Transformer模型:RoBERTa用于英语,XLM-RoBERTa用于英语和德语;在PolyHope-M数据集上进行训练与测试,采用加权F1分数和准确率作为评估指标。 Result: RoBERTa在英语上达到加权F1为0.818、准确率81.8%;XLM-RoBERTa在英德混合设置下达到加权F1为0.786、准确率78.5%。 Conclusion: RoBERTa和XLM-RoBERTa均能有效支持希望言语检测,且单语模型(RoBERTa)略优于多语模型(XLM-RoBERTa),说明针对特定语言优化的预训练模型更具优势;结果印证了改进预训练大语言模型对下游NLP任务性能提升的重要性。 Abstract: This paper describes a system that has been submitted to the "PolyHope-M" at RANLP2025. In this work various transformers have been implemented and evaluated for hope speech detection for English and Germany. RoBERTa has been implemented for English, while the multilingual model XLM-RoBERTa has been implemented for both English and German languages. The proposed system using RoBERTa reported a weighted f1-score of 0.818 and an accuracy of 81.8% for English. On the other hand, XLM-RoBERTa achieved a weighted f1-score of 0.786 and an accuracy of 78.5%. These results reflects the importance of improvement of pre-trained large language models and how these models enhancing the performance of different natural language processing tasks.[30] Jailbreaking LLMs via Calibration
Yuxuan Lu,Yongkang Guo,Yuqing Kong
Main category: cs.CL
TL;DR: 本文提出了一种将安全对齐建模为对预对齐分布系统性扭曲的框架,将弱到强越狱问题视为预测聚合问题,并在损失诱导的对偶空间中推导出最优聚合策略(梯度偏移),统一解释了logit-arithmetic等方法,并提出了新聚合规则,在攻击成功率和降低‘越狱税’上表现更优。
Details
Motivation: 安全对齐常导致模型输出与原始数据分布之间存在系统性偏差,需建模该偏差以提升越狱攻击效果并降低对模型有用性的损害(即‘越狱税’)。 Method: 将安全对齐建模为对预对齐分布的系统性扭曲;将Weak-to-Strong Jailbreaking形式化为预测聚合问题;在损失诱导的对偶空间中推导最优聚合策略(Gradient Shift);推广logit-arithmetic至多种proper loss下的聚合规则,并提出新混合聚合规则。 Result: 在红队测试基准和数学任务上,新方法在前沿模型(尤其是安全加固的gpt-oss-120b)上显著提升攻击成功率、降低 Jailbreak Tax。 Conclusion: 安全对齐可被统一建模为分布扭曲,越狱本质是带约束的预测聚合;基于梯度偏移的聚合框架比现有logit操作更具理论基础与泛化性,能兼顾攻击效能与模型实用性。 Abstract: Safety alignment in Large Language Models (LLMs) often creates a systematic discrepancy between a model's aligned output and the underlying pre-aligned data distribution. We propose a framework in which the effect of safety alignment on next-token prediction is modeled as a systematic distortion of a pre-alignment distribution. We cast Weak-to-Strong Jailbreaking as a forecast aggregation problem and derive an optimal aggregation strategy characterized by a Gradient Shift in the loss-induced dual space. We show that logit-arithmetic jailbreaking methods are a special case of this framework under cross-entropy loss, and derive a broader family of aggregation rules corresponding to other proper losses. We also propose a new hybrid aggregation rule. Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate that our approach achieves superior Attack Success Rates and lower "Jailbreak Tax" compared with existing methods, especially on the safety-hardened gpt-oss-120b.[31] Formal Semantic Control over Language Models
Yingji Zhang
Main category: cs.CL
TL;DR: 本文提出了一种基于VAE框架的语义表征学习方法,旨在提升语言模型潜在空间的语义与几何可解释性,并实现局部化、类符号化、组合式的可控生成;分别在句子级(解释性文本生成)和推理级(解释性自然语言推理)两个方向上实现了语义特征与推理行为的解耦与操控。
Details
Motivation: 提升语言模型内部语义表征的可解释性、结构化程度与可控性,弥补当前黑箱模型在语义理解与精准操控上的不足。 Method: 基于变分自编码器(VAE)框架,构建两种互补方法:(i) 句子级——在潜在空间中解耦并操控特定语义特征以指导解释性文本生成;(ii) 推理级——在潜在空间中隔离并引导推理行为以控制解释性NLI任务(双前提推结论)。 Result: 提出新理论框架与实用方法,并通过实验验证其能显著提升自然语言潜在空间的可解释性与可控性。 Conclusion: 语义表征可通过精心设计的潜在空间几何结构实现系统性解释、精确建模与可靠操控,为迈向可解释、可控制的语言模型提供了可行路径。 Abstract: This thesis advances semantic representation learning to render language representations or models more semantically and geometrically interpretable, and to enable localised, quasi-symbolic, compositional control through deliberate shaping of their latent space geometry. We pursue this goal within a VAE framework, exploring two complementary research directions: (i) Sentence-level learning and control: disentangling and manipulating specific semantic features in the latent space to guide sentence generation, with explanatory text serving as the testbed; and (ii) Reasoning-level learning and control: isolating and steering inference behaviours in the latent space to control NLI. In this direction, we focus on Explanatory NLI tasks, in which two premises (explanations) are provided to infer a conclusion. The overarching objective is to move toward language models whose internal semantic representations can be systematically interpreted, precisely structured, and reliably directed. We introduce a set of novel theoretical frameworks and practical methodologies, together with corresponding experiments, to demonstrate that our approaches enhance both the interpretability and controllability of latent spaces for natural language across the thesis.[32] LegalOne: A Family of Foundation Models for Reliable Legal Reasoning
Haitao Li,Yifan Chen,Shuo Miao,Qian Dong,Jia Chen,Yiran Hu,Junjie Chen,Minghao Qin,Qingyao Ai,Yiqun Liu,Cheng Luo,Quan Zhou,Ya Zhang,Jikun Hu
Main category: cs.CL
TL;DR: 本文提出了LegalOne,一个专为中文法律领域设计的基础模型系列,通过三阶段训练流程(中训练、监督微调和课程强化学习)提升法律推理能力,并在多项法律任务上达到SOTA性能。
Details
Motivation: 大型语言模型在法律领域的直接应用受限于缺乏精确的领域知识和难以进行严谨的多步司法推理。 Method: 提出三阶段训练流程:1)中训练阶段采用基于困惑度的Plasticity-Adjusted Sampling(PAS)策略实现领域自适应;2)监督微调阶段使用Legal Agentic CoT Distillation(LEAD)从原始法律文本中蒸馏出结构化、事实锚定且逻辑严谨的推理轨迹;3)课程强化学习阶段分记忆、理解、推理三阶段渐进式提升模型能力。 Result: LegalOne在广泛法律任务上达到SOTA性能,超越参数量大得多的通用大模型,展现出更高知识密度与推理效率。 Conclusion: LegalOne通过针对性的训练范式显著提升了中文法律领域推理能力,其开源模型权重与LegalKit评估框架将推动可信、可解释的法律AI发展。 Abstract: While Large Language Models (LLMs) have demonstrated impressive general capabilities, their direct application in the legal domain is often hindered by a lack of precise domain knowledge and complexity of performing rigorous multi-step judicial reasoning. To address this gap, we present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain. LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning. First, during mid-training phase, we propose Plasticity-Adjusted Sampling (PAS) to address the challenge of domain adaptation. This perplexity-based scheduler strikes a balance between the acquisition of new knowledge and the retention of original capabilities, effectively establishing a robust legal foundation. Second, during supervised fine-tuning, we employ Legal Agentic CoT Distillation (LEAD) to distill explicit reasoning from raw legal texts. Unlike naive distillation, LEAD utilizes an agentic workflow to convert complex judicial processes into structured reasoning trajectories, thereby enforcing factual grounding and logical rigor. Finally, we implement a Curriculum Reinforcement Learning (RL) strategy. Through a progressive reinforcement process spanning memorization, understanding, and reasoning, LegalOne evolves from simple pattern matching to autonomous and reliable legal reasoning. Experimental results demonstrate that LegalOne achieves state-of-the-art performance across a wide range of legal tasks, surpassing general-purpose LLMs with vastly larger parameter counts through enhanced knowledge density and efficiency. We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI, paving the way for deploying trustworthy and interpretable foundation models in high-stakes judicial applications.[33] Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation
Lakshan Cooray,Deshan Sumanathilaka,Pattigadapa Venkatesh Raju
Main category: cs.CL
TL;DR: 本研究探索了指令微调的小型语言模型(SLMs)在多轮客服问答中的应用,提出基于历史摘要和对话阶段的评估方法,发现部分SLMs性能接近大模型,但整体在对话连续性和上下文对齐上仍有局限。
Details
Motivation: 大型语言模型(LLMs)虽性能强,但计算开销大、部署受限;小型语言模型(SLMs)更高效,但在需对话连续性和上下文理解的多轮客服QA中效果尚不明确。 Method: 采用历史摘要策略保留对话状态,构建指令微调的低参数SLMs,并引入基于对话阶段的定性分析框架;对比9个SLMs与3个商用LLMs,使用词法/语义相似度指标及人工评估、LLM-as-a-judge等定性方法。 Result: 不同SLMs表现差异显著:部分接近LLM水平,部分难以维持对话连续性与上下文对齐;定性分析揭示了各模型在不同客服对话阶段的行为特征。 Conclusion: 低参数语言模型在真实客服QA中具有应用潜力,但当前在复杂对话建模方面仍存在明显局限,需进一步优化摘要机制与阶段感知能力。 Abstract: Customer-service question answering (QA) systems increasingly rely on conversational language understanding. While Large Language Models (LLMs) achieve strong performance, their high computational cost and deployment constraints limit practical use in resource-constrained environments. Small Language Models (SLMs) provide a more efficient alternative, yet their effectiveness for multi-turn customer-service QA remains underexplored, particularly in scenarios requiring dialogue continuity and contextual understanding. This study investigates instruction-tuned SLMs for context-summarized multi-turn customer-service QA, using a history summarization strategy to preserve essential conversational state. We also introduce a conversation stage-based qualitative analysis to evaluate model behavior across different phases of customer-service interactions. Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods. Results show notable variation across SLMs, with some models demonstrating near-LLM performance, while others struggle to maintain dialogue continuity and contextual alignment. These findings highlight both the potential and current limitations of low-parameterized language models for real-world customer-service QA systems.[34] EchoReview: Learning Peer Review from the Echoes of Scientific Citations
Yinuo Zhang,Dingcheng Huang,Haifeng Suo,Yizhuo Li,Ziya Zhao,Junhao Xu,Zhiying Tu,Dianhui Chu,Deming Zhai,Xianming Liu,Xiaoyan Yu,Dianbo Sui
Main category: cs.CL
TL;DR: 本文提出EchoReview框架,通过挖掘学术引用中的隐含评价信号,构建大规模跨会议、跨年度的引用驱动评审数据集EchoReview-16K,并训练出自动评审模型EchoReviewer-7B,在证据支持和评审全面性等核心维度上实现显著稳定提升。
Details
Motivation: 传统同行评审系统面临可扩展性压力,而现有基于真实评审数据的监督微调方法受限于单一数据源及人类评审的主观性与不一致性,难以支撑高质量自动化评审。 Method: 提出引用上下文驱动的数据合成框架EchoReview,从学术引用中系统挖掘隐含的集体评价信号,并将其转化为结构化的评审风格数据;构建首个大规模、跨会议、跨年度的引用驱动评审数据集EchoReview-16K,并训练7B参数规模的自动评审模型EchoReviewer-7B。 Result: EchoReviewer-7B在证据支持、评审全面性等核心评审维度上实现显著且稳定的性能提升,验证了引用上下文作为可靠自动化同行评审数据范式的有效性。 Conclusion: 引用上下文是一种鲁棒且有效的数据范式,可支撑构建高可靠性、高可扩展性的自动化同行评审系统。 Abstract: As the volume of scientific submissions continues to grow rapidly, traditional peer review systems are facing unprecedented scalability pressures, highlighting the urgent need for automated reviewing methods that are both scalable and reliable. Existing supervised fine-tuning approaches based on real review data are fundamentally constrained by single-source of data as well as the inherent subjectivity and inconsistency of human reviews, limiting their ability to support high-quality automated reviewers. To address these issues, we propose EchoReview, a citation-context-driven data synthesis framework that systematically mines implicit collective evaluative signals from academic citations and transforms scientific community's long-term judgments into structured review-style data. Based on this pipeline, we construct EchoReview-16K, the first large-scale, cross-conference, and cross-year citation-driven review dataset, and train an automated reviewer, EchoReviewer-7B. Experimental results demonstrate that EchoReviewer-7B can achieve significant and stable improvements on core review dimensions such as evidence support and review comprehensiveness, validating citation context as a robust and effective data paradigm for reliable automated peer review.[35] ExperienceWeaver: Optimizing Small-sample Experience Learning for LLM-based Clinical Text Improvement
Ziyan Xiao,Yinghao Zhu,Liang Peng,Lequan Yu
Main category: cs.CL
TL;DR: ExperienceWeaver是一种新型分层框架,通过从多维反馈中提炼结构化经验(错误提示与高层策略),提升LLM在小样本临床文本改进任务中的表现,显著优于现有模型。
Details
Motivation: 临床文本改进对医疗效率至关重要,但受限于高质量数据稀缺和医学文档的复杂约束;现有LLM方法在小样本场景下效果不佳:监督微调数据需求大、成本高,RAG则常提供表面修正而缺乏推理依据。 Method: 提出ExperienceWeaver框架,将重点从数据检索转向经验学习;从嘈杂、多维反馈中蒸馏出两类结构化知识——错误特异性Tips和高层Strategies,并将其注入代理式(agentic)推理流程,使模型学会‘如何修订’而非仅‘修订什么’。 Result: 在四个临床数据集上的广泛实验表明,ExperienceWeaver在小样本设置下持续提升性能,显著超越Gemini-3 Pro等当前最优模型。 Conclusion: ExperienceWeaver验证了结构化经验蒸馏在小样本临床文本改进中的有效性,为LLM在专业领域低资源场景下的应用提供了新范式。 Abstract: Clinical text improvement is vital for healthcare efficiency but remains difficult due to limited high-quality data and the complex constraints of medical documentation. While Large Language Models (LLMs) show promise, current approaches struggle in small-sample settings: supervised fine-tuning is data-intensive and costly, while retrieval-augmented generation often provides superficial corrections without capturing the reasoning behind revisions. To address these limitations, we propose ExperienceWeaver, a hierarchical framework that shifts the focus from data retrieval to experience learning. Instead of simply recalling past examples, ExperienceWeaver distills noisy, multi-dimensional feedback into structured, actionable knowledge. Specifically, error-specific Tips and high-level Strategies. By injecting this distilled experience into an agentic pipeline, the model learns "how to revise" rather than just "what to revise". Extensive evaluations across four clinical datasets demonstrate that ExperienceWeaver consistently improves performance, surpassing state-of-the-art models such as Gemini-3 Pro in small-sample settings.[36] CURP: Codebook-based Continuous User Representation for Personalized Generation with LLMs
Liang Wang,Xinyi Mou,Xiaoyou Liu,Xuanjing Huang,Zhongyu Wei
Main category: cs.CL
TL;DR: 本文提出CURP框架,通过双向用户编码器和离散原型码本提取多维用户特征,实现高效、可插拔的个性化,仅需约20M可训练参数(占模型总参数0.2%),在生成任务中表现优于基线,兼具可解释性与可扩展性。
Details
Motivation: 现有基于提示或训练的用户建模方法难以兼顾个性化质量与计算/数据效率。 Method: 提出CURP框架,采用双向用户编码器和离散原型码本提取多维用户特征,支持低参数量(约20M)的插拔式个性化。 Result: 在多种生成任务上,CURP性能与泛化能力优于强基线,同时提升可解释性与可扩展性。 Conclusion: CURP是一种高效、轻量、可解释且可扩展的用户建模新范式,显著缓解了个性化与效率之间的权衡问题。 Abstract: User modeling characterizes individuals through their preferences and behavioral patterns to enable personalized simulation and generation with Large Language Models (LLMs) in contemporary approaches. However, existing methods, whether prompt-based or training-based methods, face challenges in balancing personalization quality against computational and data efficiency. We propose a novel framework CURP, which employs a bidirectional user encoder and a discrete prototype codebook to extract multi-dimensional user traits. This design enables plug-and-play personalization with a small number of trainable parameters (about 20M parameters, about 0.2\% of the total model size). Through extensive experiments on variant generation tasks, we show that CURP achieves superior performance and generalization compared to strong baselines, while offering better interpretability and scalability. The code are available at https://github.com/RaidonWong/CURP_code[37] Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
Shengrui Li,Fei Zhao,Kaiyan Zhao,Jieying Ye,Haifeng Liu,Fangcheng Shi,Zheyong Xie,Yao Hu,Shaosheng Cao
Main category: cs.CL
TL;DR: 本文提出DeMix框架,通过模型合并预测最优数据混合比例,解耦搜索与训练成本,在降低搜索开销的同时提升预训练数据混合效果。
Details
Motivation: 现有LLM预训练数据混合方法依赖不可靠的小规模代理实验或代价高昂的大规模探索,难以高效找到最优数据混合比例。 Method: 提出DeMix框架,先在候选数据集上大规模训练多个组件模型,再通过加权模型合并生成不同数据混合比例的代理模型,从而避免为每个采样混合比单独训练模型。 Result: DeMix在多项基准测试中取得更高性能,同时显著降低搜索成本;并发布22T-token的DeMix Corpora数据集及开源代码。 Conclusion: DeMix成功打破数据混合搜索中充分性、准确性与效率之间的权衡,为LLM预训练提供更高效、可扩展的数据混合优化范式。 Abstract: Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.[38] Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Case Study from Retrospective Forecasting
Ali El Lahib,Ying-Jieh Xia,Zehan Li,Yuxuan Wang,Xinyu Pi
Main category: cs.CL
TL;DR: 本文揭示了搜索引擎日期过滤器在回溯性评估搜索增强型预测器时的不可靠性,指出其常导致时间泄漏,并建议采用更严格的检索保护措施或基于冻结的时间戳网页快照进行评估。
Details
Motivation: 现有研究广泛使用搜索引擎日期过滤器(如before:)进行回溯性评估,但其是否真正保障时间一致性缺乏系统检验。 Method: 通过审计Google Search的before:过滤器在多个问题上的实际表现,统计时间泄漏比例;利用gpt-oss-120b模型在含泄漏与无泄漏文档上的预测性能(Brier分数)对比;分析常见泄漏机制(如文章更新、相关模块、元数据错误、缺失信号等)。 Result: 71%的问题返回至少一页含强时间泄漏内容,41%直接泄露答案;使用泄漏文档时Brier分数为0.108,远优于无泄漏时的0.242,证实评估结果被显著扭曲。 Conclusion: 仅依赖日期过滤的检索无法满足时间敏感型评估要求,应改用冻结网页快照或更强的检索隔离机制以保障评估可信度。 Abstract: Search-engine date filters are widely used to enforce pre-cutoff retrieval in retrospective evaluations of search-augmented forecasters. We show this approach is unreliable: auditing Google Search with a before: filter, 71% of questions return at least one page containing strong post-cutoff leakage, and for 41%, at least one page directly reveals the answer. Using a large language model (LLM), gpt-oss-120b, to forecast with these leaky documents, we demonstrate an inflated prediction accuracy (Brier score 0.108 vs. 0.242 with leak-free documents). We characterize common leakage mechanisms, including updated articles, related-content modules, unreliable metadata/timestamps, and absence-based signals, and argue that date-restricted search is insufficient for temporal evaluation. We recommend stronger retrieval safeguards or evaluation on frozen, time-stamped web snapshots to ensure credible retrospective forecasting.[39] Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning
Zhipeng Chen,Xiaobo Qin,Wayne Xin Zhao,Youbin Wu,Ji-Rong Wen
Main category: cs.CL
TL;DR: 本文提出A²D方法,通过自适应能力分解增强RLVR中大语言模型的推理能力,无需教师模型即可提供子问题引导,提升探索与利用效率。
Details
Motivation: 现有RLVR方法因信息有限导致模型盲目探索,在复杂问题上表现不佳,需在不依赖教师模型的前提下为RLVR提供额外引导信息。 Method: 提出A²D:先用无蒸馏RLVR训练一个分解器,将复杂问题分解为简单子问题;再用该分解器为训练集标注子问题,并在子问题引导下对推理器进行RLVR训练。 Result: A²D显著提升RLVR效果,具备即插即用特性,可适配多种RLVR算法;分析表明子问题引导能有效增强推理器的探索与利用能力。 Conclusion: A²D是一种无需教师模型、基于自适应分解的通用增强方法,为提升LLM在RLVR中的推理能力提供了新思路。 Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A$^2$D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A$^2$D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner's exploration and exploitation abilities.[40] APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards
Kaiyan Chang,Chenwei Zhu,Yingfeng Luo,Yifu Huo,Chenglong Wang,Xiaoqian Liu,Qiaozhi He,Tong Xiao,Zhengtao Yu,Jingbo Zhu
Main category: cs.CL
TL;DR: 本文提出Anchor-based Process Reward (APR)方法,通过识别推理过程中的Reasoning Anchor(答案首次稳定的位置),并惩罚其后的无意义重复验证(Answer-Stable Tail, AST),以缓解大推理模型中的Overthinking问题,在提升性能的同时显著降低计算开销。
Details
Motivation: Test-Time Scaling (TTS)虽提升了大推理模型(LRMs)能力,却引发Overthinking问题;作者观察到模型在得出最终答案后仍进行无修订的重复自验证,需从细粒度角度重新审视该现象。 Method: 定义Reasoning Anchor为答案首次稳定的位置,并识别其后的结构冗余——Answer-Stable Tail(AST);据此提出结构感知的奖励塑形方法Anchor-based Process Reward(APR),定位Anchor并仅对AST施加惩罚;结合适用于长度惩罚的策略优化算法进行强化学习训练。 Result: APR模型在1.5B和7B两个规模上,在五个数学推理数据集上平均达到性能-效率Pareto前沿,且RL训练所需计算资源显著减少。 Conclusion: 通过锚点定位与针对性惩罚冗余推理,APR有效缓解Overthinking,在保持甚至提升推理性能的同时大幅提高推理效率,验证了结构感知奖励设计在推理优化中的有效性。 Abstract: Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs) but introduces a critical side-effect known as Overthinking. We conduct a preliminary study to rethink this phenomenon from a fine-grained perspective. We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning process. We formally define this specific position where the answer first stabilizes as the Reasoning Anchor. By analyzing pre- and post-anchor reasoning behaviors, we uncover the structural redundancy fixed in LRMs: the meaningless repetitive verification after deriving the first complete answer, which we term the Answer-Stable Tail (AST). Motivated by this observation, we propose Anchor-based Process Reward (APR), a structure-aware reward shaping method that localizes the reasoning anchor and penalizes exclusively the post-anchor AST. Leveraging the policy optimization algorithm suitable for length penalties, our APR models achieved the performance-efficiency Pareto frontier at 1.5B and 7B scales averaged across five mathematical reasoning datasets while requiring significantly fewer computational resources for RL training.[41] WordCraft: Scaffolding the Keyword Method for L2 Vocabulary Learning with Multimodal LLMs
Yuheng Shao,Junjie Xiong,Chaoran Wu,Xiyuan Wang,Ziyu Zhou,Yang Ouyang,Qinyi Tao,Quan Li
Main category: cs.CL
TL;DR: 本文提出WordCraft,一个基于多模态大语言模型(MLLMs)的交互式工具,旨在帮助母语为中文的英语二语学习者更有效地运用关键词法进行词汇记忆,通过引导关键词选择、联想构建和图像生成,提升学习效果与可用性。
Details
Motivation: 母语为中文的英语二语学习者在使用关键词法记单词时,常难以生成语音恰当的关键词、构建连贯联想及形成生动心理意象;现有自动化方法或削弱学习者参与度,或缺乏过程性指导。 Method: 开展面向18名中文母语英语学习者与教育者的形成性研究,识别关键难点与需求;据此设计并实现以学习者为中心、由多模态大语言模型驱动的交互式工具WordCraft,支持关键词选择、联想构建与图像生成三阶段 scaffold。 Result: 两项用户研究表明,WordCraft在保持生成效应的同时,显著提升了词汇记忆的有效性与工具可用性。 Conclusion: WordCraft通过过程导向的多模态交互支持,有效弥补了传统关键词法与现有技术方案的不足,为二语词汇学习提供了可扩展、易用且符合认知原理的新路径。 Abstract: Applying the keyword method for vocabulary memorization remains a significant challenge for L1 Chinese-L2 English learners. They frequently struggle to generate phonologically appropriate keywords, construct coherent associations, and create vivid mental imagery to aid long-term retention. Existing approaches, including fully automated keyword generation and outcome-oriented mnemonic aids, either compromise learner engagement or lack adequate process-oriented guidance. To address these limitations, we conducted a formative study with L1 Chinese-L2 English learners and educators (N=18), which revealed key difficulties and requirements in applying the keyword method to vocabulary learning. Building on these insights, we introduce WordCraft, a learner-centered interactive tool powered by Multimodal Large Language Models (MLLMs). WordCraft scaffolds the keyword method by guiding learners through keyword selection, association construction, and image formation, thereby enhancing the effectiveness of vocabulary memorization. Two user studies demonstrate that WordCraft not only preserves the generation effect but also achieves high levels of effectiveness and usability.[42] Eliciting Trustworthiness Priors of Large Language Models via Economic Games
Siyu Yan,Lusha Zhu,Jian-Qiao Zhu
Main category: cs.CL
TL;DR: 本文提出一种基于迭代上下文学习的新方法,利用行为博弈论中的信任博弈来量化大语言模型(LLM)的可信度先验,并发现GPT-4.1的信任模式与人类高度一致,且可由温暖—能力刻板印象模型较好预测。
Details
Motivation: 构建以人为中心、可信赖的人工智能系统需实现校准的信任(即既不过度依赖也不过度忽视AI),但尚缺乏刻画AI自身所体现信任水平的方法。 Method: 提出基于迭代上下文学习的信任先验提取方法,采用信任博弈(Trust Game)作为行为范式,通过LLM在博弈中对不同角色的响应推断其内在信任倾向。 Result: GPT-4.1的信任先验与人类高度一致;其对不同玩家角色的信任响应存在系统性差异;该差异可被基于温暖与能力感知的刻板印象模型有效预测。 Conclusion: 大语言模型具备可量化的、类人的信任先验结构,且受社会认知维度(如温暖与能力)影响,为理解AI的社会推理能力及构建更可信AI提供了新路径。 Abstract: One critical aspect of building human-centered, trustworthy artificial intelligence (AI) systems is maintaining calibrated trust: appropriate reliance on AI systems outperforms both overtrust (e.g., automation bias) and undertrust (e.g., disuse). A fundamental challenge, however, is how to characterize the level of trust exhibited by an AI system itself. Here, we propose a novel elicitation method based on iterated in-context learning (Zhu and Griffiths, 2024a) and apply it to elicit trustworthiness priors using the Trust Game from behavioral game theory. The Trust Game is particularly well suited for this purpose because it operationalizes trust as voluntary exposure to risk based on beliefs about another agent, rather than self-reported attitudes. Using our method, we elicit trustworthiness priors from several leading large language models (LLMs) and find that GPT-4.1's trustworthiness priors closely track those observed in humans. Building on this result, we further examine how GPT-4.1 responds to different player personas in the Trust Game, providing an initial characterization of how such models differentiate trust across agent characteristics. Finally, we show that variation in elicited trustworthiness can be well predicted by a stereotype-based model grounded in perceived warmth and competence.[43] Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models
Siyuan Zhang,Jialian Li,Yichi Zhang,Xiao Yang,Yinpeng Dong,Hang Su
Main category: cs.CL
TL;DR: 本文提出了一种基于表征视角的新方法,研究大语言模型在推理任务中内部状态的动态演化,发现推理过程涉及生成过程中表征的持续分布偏移,且后训练主要通过优化该偏移路径来提升推理能力,而非单纯提升初始表征质量。
Details
Motivation: 现有研究多依赖生成结果分析推理能力演化,将推理过程视为黑箱,难以揭示模型内部变化;本文旨在从表征角度打开黑箱,理解推理过程中内部状态的动态机制。 Method: 通过在不同训练阶段的模型上开展系统性实验,分析其静态初始表征质量与生成过程中的表征分布演化;结合统计分析与反事实实验,探究表征变化与生成正确性之间的关系及驱动因素。 Result: 1)后训练对静态初始表征质量提升有限;2)推理任务中存在显著的、连续的表征分布偏移;3)后训练使该偏移更趋向于有利于任务求解的分布;4)生成正确性与最终表征高度相关,且语义内容(而非额外计算或参数差异)是驱动表征偏移的主因。 Conclusion: 推理能力的提升关键在于生成过程中表征分布的动态演化路径被后训练所优化,而非初始表征的静态增强;这一发现为理解与优化大模型推理能力提供了新范式。 Abstract: Large Language Models have achieved remarkable performance on reasoning tasks, motivating research into how this ability evolves during training. Prior work has primarily analyzed this evolution via explicit generation outcomes, treating the reasoning process as a black box and obscuring internal changes. To address this opacity, we introduce a representational perspective to investigate the dynamics of the model's internal states. Through comprehensive experiments across models at various training stages, we discover that post-training yields only limited improvement in static initial representation quality. Furthermore, we reveal that, distinct from non-reasoning tasks, reasoning involves a significant continuous distributional shift in representations during generation. Comparative analysis indicates that post-training empowers models to drive this transition toward a better distribution for task solving. To clarify the relationship between internal states and external outputs, statistical analysis confirms a high correlation between generation correctness and the final representations; while counterfactual experiments identify the semantics of the generated tokens, rather than additional computation during inference or intrinsic parameter differences, as the dominant driver of the transition. Collectively, we offer a novel understanding of the reasoning process and the effect of training on reasoning enhancement, providing valuable insights for future model analysis and optimization.[44] HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference
Xuan Ai,Qingqing Yang,Peng Wang,Lei Deng,Lin Zhang,Renhai Chen,Gong Zhang
Main category: cs.CL
TL;DR: 本文提出HyLRA框架,通过层间稀疏性分析与动态规划策略,在敏感层保留全注意力、在容忍层复用前层关键token索引,从而在显著提升长上下文推理吞吐量的同时,保持几乎无损的模型精度。
Details
Motivation: 现有稀疏注意力机制因依赖固定模式或激进剪枝,难以在效率与精度间取得最优平衡;长上下文推理受限于注意力计算的二次复杂度和KV缓存的巨大内存开销。 Method: 提出HyLRA(混合层重用注意力)框架,基于层内敏感性与层间相似性的实证发现,采用离线动态规划确定最优层策略:对敏感层保留全注意力,对容忍层复用前一层top-k token索引,实现关键token聚焦计算。 Result: 在多个基准上验证,HyLRA将推理吞吐量提升6%–46%,精度下降<1%,持续优于当前最优稀疏注意力方法。 Conclusion: HyLRA通过数据驱动的层自适应策略,有效缓解了长上下文LLM推理中的计算与内存瓶颈,在效率与准确性之间实现了更优权衡。 Abstract: Long-context inference in Large Language Models (LLMs) is bottlenecked by the quadratic computation complexity of attention and the substantial memory footprint of Key-Value (KV) caches. While existing sparse attention mechanisms attempt to mitigate this by exploiting inherent sparsity, they often rely on rigid patterns or aggressive pruning, failing to achieve an optimal balance between efficiency and accuracy. In this paper, we introduce {\bf HyLRA} ({\bf Hy}brid {\bf L}ayer {\bf R}euse {\bf A}ttention), a novel framework driven by layer-wise sparsity profiling. Our empirical analysis uncovers a dual characteristic in attention mechanics: \textit{intra-layer sensitivity}, where specific layers necessitate full attention to prevent feature distortion, and \textit{inter-layer similarity}, where consecutive layers share substantial critical tokens. Based on these observations, HyLRA employs an offline dynamic programming approach to derive an optimal layer-wise policy. This hybrid strategy retains full attention for sensitive layers to ensure robustness, while enabling tolerant layers to bypass quadratic calculations by directly reusing top-$k$ indices from preceding layers. This approach allows LLMs to restrict computation to the most critical tokens, effectively overcoming the quadratic bottleneck of dense attention. Extensive evaluations demonstrate that HyLRA improves inference throughput by 6\%--46\% while maintaining comparable performance (with $<1\%$ accuracy degradation), consistently outperforming state-of-the-art sparse attention methods. HyLRA is open source at \href{https://anonymous.4open.science/r/unified-cache-management-CF80/}{\texttt{/r/unified-cache-management-CF80/}}[45] Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis
Zicheng Kong,Dehua Ma,Zhenbo Xu,Alven Yang,Yiwei Ru,Haoran Wang,Zixuan Zhou,Fuqing Bie,Liuyu Xiang,Huijia Wu,Jian Zhao,Zhaofeng He
Main category: cs.CL
TL;DR: 本文提出Omni-RRM,首个开源、基于评分标准(rubric)的多模态奖励模型,支持文本、图像、视频和音频,生成结构化、多维度偏好判断及理由;其训练数据Omni-Preference通过全自动合成与教师模型推理构建,无需人工标注;两阶段训练(SFT+GRPO)使其在视频、音频等基准上达到SOTA,并显著提升下游任务性能。
Details
Motivation: 现有奖励模型以视觉为中心、输出不透明的标量分、依赖高成本人工标注,且难以支撑多模态大模型的细粒度对齐需求。 Method: 构建全自动 pipeline生成Omni-Preference大规模多模态偏好数据集(含模态感知的评分标准依据),并设计两阶段训练:先监督微调学习结构化rubric-grounded输出,再用GRPO强化学习优化难样本判别能力。 Result: Omni-RRM在ShareGPT-V视频基准达80.2%准确率、Audio-HH-RLHF音频基准达66.8%,图像任务上相较基线提升17.7%绝对准确率;支持Best-of-N选择并迁移到纯文本偏好任务。 Conclusion: Omni-RRM为多模态对齐提供了可解释、结构化、免人工标注的奖励建模新范式,推动开放、高效、多模态RLHF发展。 Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08.[46] Factuality on Demand: Controlling the Factuality-Informativeness Trade-off in Text Generation
Ziwei Gong,Yanda Chen,Julia Hirschberg,Chen Zhao,He He,Zhou Yu,Kathleen Mckeown
Main category: cs.CL
TL;DR: 本文提出Factuality-Controlled Generation(FCG)框架,使用户可在查询中指定事实性约束,以在生成响应时灵活权衡信息量与事实准确性,并通过合成数据训练显著提升模型对事实性约束的遵循能力及响应的信息丰富度。
Details
Motivation: 大型语言模型在响应查询时面临信息量与事实准确性的固有权衡,而不同应用场景对此二者的需求比例各异,亟需一种可控的事实性生成机制。 Method: 提出Factuality-Controlled Generation(FCG)框架,定义双维度评估标准(事实性约束遵循度与响应信息量),并采用合成数据进行模型训练。 Result: 基于合成数据的训练显著提升了模型在满足指定事实性约束的同时保持响应信息量的能力。 Conclusion: FCG为可控文本生成提供了新范式,证明了通过合成数据可有效解耦并协同优化事实性与信息性这两个关键属性。 Abstract: Large language models (LLMs) encode knowledge with varying degrees of confidence. When responding to queries, models face an inherent trade-off: they can generate responses that are less informative but highly factual, or more informative but potentially less accurate. Different applications demand different balances between informativeness and factuality. We introduce Factuality-Controlled Generation (FCG), a framework that enables users to specify factuality constraints alongside their queries. We propose to evaluate FCG performance on two dimensions: adherence to factuality constraints and response informativeness. We propose to train models on the FCG task using synthetic data, and show that our synthetic training significantly improves models' ability to both respect factuality requirements and maintain informativeness in their outputs.[47] Unifying Adversarial Robustness and Training Across Text Scoring Models
Manveer Singh Tamber,Hosna Oyarhoseini,Jimmy Lin
Main category: cs.CL
TL;DR: 本文提出统一研究文本评分模型(如密集检索器、重排序器和奖励模型)的对抗鲁棒性,指出当前对抗训练方法泛化能力不足,并引入多种互补的对抗训练方法,在提升鲁棒性的同时增强任务性能,尤其在RLHF中有效缓解奖励黑客行为并提升大语言模型对齐效果。
Details
Motivation: 现有语言模型对抗鲁棒性研究分散于不同应用场景和攻击类型,难以识别共性脆弱点;而文本评分模型的失败可被直接、定量检验,适合构建统一鲁棒性评估框架。 Method: 适配并改进面向文本评分模型的对抗攻击与对抗训练方法,提出多种互补的对抗训练策略,并在密集检索、重排序和奖励建模等任务上进行联合验证。 Result: 多种组合式对抗训练方法显著提升了模型跨攻击类型的鲁棒性,同时不损害甚至增强原始任务性能;在RLHF中,对抗训练后的奖励模型有效抑制奖励黑客现象,支持训练出更对齐的LLM。 Conclusion: 统一文本评分模型的对抗鲁棒性研究具有理论与实践价值;兼顾鲁棒性与有效性需采用多策略协同的对抗训练范式,该范式可推广至RLHF等关键下游应用。 Abstract: Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. This motivates adapting both attacks and adversarial training methods across model roles. Unlike open-ended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scoring models and show that combining complementary training methods can yield strong robustness while also improving task effectiveness. We also highlight the practical value of our approach for RLHF, showing that our adversarially trained reward models mitigate reward hacking and support the training of better-aligned LLMs. We provide our code and models for further study.[48] ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople
Shounak Paul,Raghav Dogra,Pawan Goyal,Saptarshi Ghosh
Main category: cs.CL
TL;DR: 本文提出了ILSIC数据集,用于研究法律条文识别(LSI)任务中法院判决文本与普通民众提问之间的差异,并通过多种方法进行实验验证,发现仅在法院数据上训练的模型在普通民众查询上效果不佳,而从法院数据迁移学习到民众数据则有一定帮助。
Details
Motivation: 现有LSI任务多使用法院判决作为输入查询,但实际应用中用户多为非专业人士,其提问更非正式;目前缺乏对法院数据与民众数据在LSI任务中差异的系统研究。 Method: 构建了覆盖500+印度法律条文的ILSIC语料库(含民众查询与对应法院判例),并在其上开展零样本/少样本推理、检索增强生成(RAG)及监督微调等实验,并进行细粒度分析(按查询类别和法条频次)。 Result: 仅在法院判决数据上训练的模型在民众查询测试集上表现差;从法院数据迁移学习至民众数据可在某些场景下提升性能;不同查询类型和法条频率对模型表现有显著影响。 Conclusion: 法院与民众查询在LSI任务中存在显著分布差异,需专门构建和利用面向民众的数据集;迁移学习是提升模型泛化能力的有效策略之一。 Abstract: Legal Statute Identification (LSI) for a given situation is one of the most fundamental tasks in Legal NLP. This task has traditionally been modeled using facts from court judgments as input queries, due to their abundance. However, in practical settings, the input queries are likely to be informal and asked by laypersons, or non-professionals. While a few laypeople LSI datasets exist, there has been little research to explore the differences between court and laypeople data for LSI. In this work, we create ILSIC, a corpus of laypeople queries covering 500+ statutes from Indian law. Additionally, the corpus also contains court case judgements to enable researchers to effectively compare between court and laypeople data for LSI. We conducted extensive experiments on our corpus, including benchmarking over the laypeople dataset using zero and few-shot inference, retrieval-augmented generation and supervised fine-tuning. We observe that models trained purely on court judgements are ineffective during test on laypeople queries, while transfer learning from court to laypeople data can be beneficial in certain scenarios. We also conducted fine-grained analyses of our results in terms of categories of queries and frequency of statutes.[49] EffGen: Enabling Small Language Models as Capable Autonomous Agents
Gaurav Srivastava,Aafiya Hussain,Chi Wang,Yingyan Celine Lin,Xuan Wang
Main category: cs.CL
TL;DR: effGen是一个开源的代理框架,专为小型语言模型(SLMs)设计,支持高效、安全的本地部署,通过提示优化、智能任务分解、复杂度路由和统一记忆系统等技术,在多个基准测试中优于现有框架。
Details
Motivation: 现有基于大语言模型API的语言模型代理系统存在高token成本和隐私问题,尤其在敏感应用场景中受限,因此需要一种面向小型语言模型、可本地部署且兼顾效率与安全的替代方案。 Method: 提出effGen框架,包含四项核心技术:(1) 基于上下文压缩(70–80%)的增强型工具调用与提示优化;(2) 依赖感知的智能任务分解;(3) 基于五维复杂度因子的预执行路由机制;(4) 融合短时、长时与向量存储的统一记忆系统;并统一MCP、A2A、ACP等多种代理协议。 Result: 在13个基准测试中,effGen在成功率、执行速度和内存占用方面均优于LangChain、AutoGen和Smolagents;分析表明提示优化对小模型增益更大(1.5B模型提升11.2%,32B仅2.4%),而复杂度路由对大模型更有效(1.5B提升3.6%,32B达7.9%),二者结合带来全尺度一致增益。 Conclusion: effGen验证了小型语言模型在代理任务中通过系统级优化可实现媲美甚至超越大模型API方案的性能,为轻量、私密、低成本的本地AI代理提供了可行路径,并已开源供研究与商用。 Abstract: Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls. While powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce effGen, an open-source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment (pip install effgen). effGen makes four major contributions: (1) Enhanced tool-calling with prompt optimization that compresses contexts by 70-80% while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity-based routing using five factors to make smart pre-execution decisions, and (4) Unified memory system combining short-term, long-term, and vector-based storage. Additionally, effGen unifies multiple agent protocols (MCP, A2A, ACP) for cross-protocol communication. Results on 13 benchmarks show effGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. effGen (https://effgen.org/) is released under the MIT License, ensuring broad accessibility for research and commercial use. Our framework code is publicly available at https://github.com/ctrl-gaurav/effGen.[50] Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? When Hard Gating Hurts
Víctor Yeste,Paolo Rosso
Main category: cs.CL
TL;DR: 本文研究了在计算资源受限条件下(单个8GB GPU),如何利用Schwartz高阶(HO)价值类别结构提升句子级人类价值观检测性能。结果表明,硬性层级门控会因错误累积和召回率下降而损害性能;相比之下,标签级阈值调优和轻量级模型集成更有效。
Details
Motivation: 探索Schwartz高阶(HO)价值类别是否能在句子级多标签分类任务中提供可用的结构信息,并在严格计算预算下评估其实际效用。 Method: 在ValueEval'24/ValuesML数据集上比较三类方法:(i) 直接监督式Transformer模型,(ii) 强制层级结构的HO→values硬掩码流水线,(iii) Presence→HO→values级联流程;并引入词典、短上下文、主题等低成本增强手段,以及标签级阈值调优、≤10B参数的小型指令微调LLM基线、QLoRA与简单集成。 Result: HO类别可从单句中学习(最易区分的双极对Macro-F1≈0.58),但硬层级门控常降低最终Macro-F1;标签级阈值调优最多提升+0.05 Macro-F1,小型Transformer集成最稳定(+0.02),小LLM单独表现差但可在跨家族集成中提供互补错误。 Conclusion: HO结构具有描述性价值,但硬性层级约束不利于句子级价值观检测;鲁棒提升来自校准(如阈值调优)与轻量集成,而非强制结构建模。 Abstract: Sentence-level human value detection is typically framed as multi-label classification over Schwartz values, but it remains unclear whether Schwartz higher-order (HO) categories provide usable structure. We study this under a strict compute-frugal budget (single 8 GB GPU) on ValueEval'24 / ValuesML (74K English sentences). We compare (i) direct supervised transformers, (ii) HO$\rightarrow$values pipelines that enforce the hierarchy with hard masks, and (iii) Presence$\rightarrow$HO$\rightarrow$values cascades, alongside low-cost add-ons (lexica, short context, topics), label-wise threshold tuning, small instruction-tuned LLM baselines ($\le$10B), QLoRA, and simple ensembles. HO categories are learnable from single sentences (e.g., the easiest bipolar pair reaches Macro-$F_1\approx0.58$), but hard hierarchical gating is not a reliable win: it often reduces end-task Macro-$F_1$ via error compounding and recall suppression. In contrast, label-wise threshold tuning is a high-leverage knob (up to $+0.05$ Macro-$F_1$), and small transformer ensembles provide the most consistent additional gains (up to $+0.02$ Macro-$F_1$). Small LLMs lag behind supervised encoders as stand-alone systems, yet can contribute complementary errors in cross-family ensembles. Overall, HO structure is useful descriptively, but enforcing it with hard gates hurts sentence-level value detection; robust improvements come from calibration and lightweight ensembling.[51] A Baseline Multimodal Approach to Emotion Recognition in Conversations
Víctor Yeste,Rodrigo Rivas-Arévalo
Main category: cs.CL
TL;DR: 本文提出了一种轻量级的多模态情感识别基线方法,结合基于Transformer的文本分类器和自监督语音表征模型,并采用简单晚期融合策略,在Friends数据集(SemEval-2024 Task 3)上进行验证。
Details
Motivation: 为提供一个可复现、易理解的参考实现,支持未来更严谨的多模态情感识别研究与对比。 Method: 采用两阶段方法:(i) 基于Transformer的文本分类器;(ii) 自监督语音表征模型;二者通过简单晚期融合(late-fusion ensemble)组合。训练受限,强调可访问性与透明性。 Result: 报告了该基线在有限训练协议下的实证结果,并指出多模态融合在某些情况下优于单模态模型。 Conclusion: 该轻量级多模态基线虽非SOTA,但为情绪识别在对话场景中的多模态建模提供了清晰、透明且可复现的起点。 Abstract: We present a lightweight multimodal baseline for emotion recognition in conversations using the SemEval-2024 Task 3 dataset built from the sitcom Friends. The goal of this report is not to propose a novel state-of-the-art method, but to document an accessible reference implementation that combines (i) a transformer-based text classifier and (ii) a self-supervised speech representation model, with a simple late-fusion ensemble. We report the baseline setup and empirical results obtained under a limited training protocol, highlighting when multimodal fusion improves over unimodal models. This preprint is provided for transparency and to support future, more rigorous comparisons.[52] Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs
Anusa Saha,Tanmay Joshi,Vinija Jain,Aman Chadha,Amitava Das
Main category: cs.CL
TL;DR: 本文提出Neural FOXP2方法,通过定位、建模和干预语言特异性神经元,实现对大语言模型中非英语语言(如印地语、西班牙语)的可控默认化,揭示并利用稀疏低秩的语言控制回路。
Details
Motivation: 尽管大语言模型在多语言数据上训练,但其默认语言常为英语,其他语言虽保留在参数记忆中却系统性被抑制;作者旨在揭示并干预这种语言默认性的机制。 Method: Neural FOXP2三阶段方法:(i) Localize——用逐层稀疏自编码器(SAE)分解激活,识别对目标语言高选择性的特征及对应语言神经元;(ii) Steering directions——对英-目标语激活差矩阵做分层SVD,提取主导语言转换的低秩方向与干预窗口;(iii) Steer——在中低层对语言神经元施加有符号、稀疏的激活偏移,正向增强目标语、负向抑制英语相关响应。 Result: 成功将Hindi或Spanish设为模型的主语言,默认输出显著提升且不损害英语能力;验证了语言默认性由稀疏低秩控制回路(语言神经元)调控,并可安全、精准干预。 Conclusion: 语言默认性并非全局属性,而是由可定位、可解释、可操控的稀疏神经子集(语言神经元)所支配;Neural FOXP2为多语言LLM的可控语言切换提供了首个机制性干预框架。 Abstract: LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other languages remain in parametric memory but are systematically suppressed. We argue that language defaultness is governed by a sparse, low-rank control circuit, language neurons, that can be mechanistically isolated and safely steered. We introduce Neural FOXP2, that makes a chosen language (Hindi or Spanish) primary in a model by steering language-specific neurons. Neural FOXP2 proceeds in three stages: (i) Localize: We train per-layer SAEs so each activation decomposes into a small set of active feature components. For every feature, we quantify English vs. Hindi/Spanish selectivity overall logit-mass lift toward the target-language token set. Tracing the top-ranked features back to their strongest contributing units yields a compact language-neuron set. (ii) Steering directions: We localize controllable language-shift geometry via a spectral low-rank analysis. For each layer, we build English to target activation-difference matrices and perform layerwise SVD to extract the dominant singular directions governing language change. The eigengap and effective-rank spectra identify a compact steering subspace and an empirically chosen intervention window (where these directions are strongest and most stable). (iii) Steer: We apply a signed, sparse activation shift targeted to the language neurons. Concretely, within low to mid layers we add a positive steering along the target-language dominant directions and a compensating negative shift toward the null space for the English neurons, yielding controllable target-language defaultness.[53] Verification Required: The Impact of Information Credibility on AI Persuasion
Saaduddin Mahmud,Eugene Bagdasarian,Shlomo Zilberstein
Main category: cs.CL
TL;DR: 本文提出MixTalk模型,用于研究LLM代理在信息可信度不确定情况下的战略性沟通,并设计TOPD方法提升接收方对说服的鲁棒性。
Details
Motivation: 现有工作仅关注完全不可验证的廉价谈话或完全可验证的信息披露,无法刻画现实场景中信息具有概率性可信度的特点,因此需要更贴近实际的战略沟通建模。 Method: 提出MixTalk博弈框架,其中发送方混合使用可验证与不可验证声明,接收方进行有限预算的代价性验证;并设计Tournament Oracle Policy Distillation(TOPD)方法,从交互日志中蒸馏出锦标赛级策略并在推理时上下文部署。 Result: 在三大现实部署场景的大规模锦标赛评估中,揭示了当前LLM代理在信息可信度推理上的能力边界;TOPD显著提升了接收方对说服的鲁棒性。 Conclusion: MixTalk为LLM代理的战略沟通建模提供了新范式,TOPD证明了基于交互日志的离线策略蒸馏可有效增强代理在高风险沟通中的稳健决策能力。 Abstract: Agents powered by large language models (LLMs) are increasingly deployed in settings where communication shapes high-stakes decisions, making a principled understanding of strategic communication essential. Prior work largely studies either unverifiable cheap-talk or fully verifiable disclosure, failing to capture realistic domains in which information has probabilistic credibility. We introduce MixTalk, a strategic communication game for LLM-to-LLM interaction that models information credibility. In MixTalk, a sender agent strategically combines verifiable and unverifiable claims to communicate private information, while a receiver agent allocates a limited budget to costly verification and infers the underlying state from prior beliefs, claims, and verification outcomes. We evaluate state-of-the-art LLM agents in large-scale tournaments across three realistic deployment settings, revealing their strengths and limitations in reasoning about information credibility and the explicit behavior that shapes these interactions. Finally, we propose Tournament Oracle Policy Distillation (TOPD), an offline method that distills tournament oracle policy from interaction logs and deploys it in-context at inference time. Our results show that TOPD significantly improves receiver robustness to persuasion.[54] Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals
Pengyue Yang,Jiawen Wen,Haolin Jin,Linghan Huang,Huaming Chen,Ling Chen
Main category: cs.CL
TL;DR: 本文提出了一种名为Structural Confidence的单次前向传播、模型无关的置信度估计框架,利用大语言模型最后一层隐藏状态轨迹的多尺度结构信号(如谱特征、局部变化和全局形状)来提升输出正确性预测能力,在多个跨领域基准上表现优异且计算高效。
Details
Motivation: 现有标准置信度估计器(如token likelihood、语义相似性和多采样一致性)在分布偏移、领域专用文本和计算资源受限情况下鲁棒性差,难以满足高风险应用场景需求。 Method: 提出Structural Confidence框架,基于模型最终层隐藏状态轨迹提取多尺度结构信号(谱特征、局部变化、全局形状),无需多次采样或辅助模型,仅需一次确定性前向传播。 Result: 在FEVER、SciFact、WikiBio-hallucination和TruthfulQA四个异构基准上,AUROC和AUPR指标均显著优于现有基线方法;相比多采样一致性方法,本方法计算开销更低、部署更便捷。 Conclusion: Structural Confidence是一种高效、鲁棒、可即插即用的后验置信度估计方法,适用于社会影响大、资源受限的大语言模型实际部署场景。 Abstract: Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs. Yet standard confidence estimators, such as token likelihood, semantic similarity and multi-sample consistency, remain brittle under distribution shift, domain-specialised text, and compute limits. In this work, we present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction based on multi-scale structural signals derived from a model's final-layer hidden-state trajectory. By combining spectral, local-variation, and global shape descriptors, our method captures internal stability patterns that are missed by probabilities and sentence embeddings. We conduct extensive, cross-domain evaluation across four heterogeneous benchmarks-FEVER (fact verification), SciFact (scientific claims), WikiBio-hallucination (biographical consistency), and TruthfulQA (truthfulness-oriented QA). Our Structural Confidence framework demonstrates strong performance compared with established baselines in terms of AUROC and AUPR. More importantly, unlike sampling-based consistency methods which require multiple stochastic generations and an auxiliary model, our approach uses a single deterministic forward pass, offering a practical basis for efficient, robust post-hoc confidence estimation in socially impactful, resource-constrained LLM applications.[55] MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA
Yutong Song,Shiva Shrestha,Chenhan Lyu,Elahe Khatibi,Pengfei Zhang,Honghui Xu,Nikil Dutt,Amir Rahmani
Main category: cs.CL
TL;DR: MedSpeak 是一种结合医学知识图谱与大语言模型的ASR纠错框架,用于提升医疗口语问答系统中医学术语识别与答案预测的准确性。
Details
Motivation: 现有基于ASR的口语问答系统在识别医学术语时准确率低,影响下游问答性能。 Method: 提出MedSpeak框架,利用医学知识图谱中的语义关系和语音信息,并融合大语言模型的推理能力,对ASR产生的噪声转录文本进行纠错和优化。 Result: 在多个基准测试上显著提升了医学术语识别准确率和整体医疗SQA性能,达到当前最优水平。 Conclusion: MedSpeak是一种有效的知识图谱增强型ASR纠错方法,为医疗口语问答系统提供了新范式。 Abstract: Spoken question-answering (SQA) systems relying on automatic speech recognition (ASR) often struggle with accurately recognizing medical terminology. To this end, we propose MedSpeak, a novel knowledge graph-aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. Comprehensive experimental results on benchmarks demonstrate that MedSpeak significantly improves the accuracy of medical term recognition and overall medical SQA performance, establishing MedSpeak as a state-of-the-art solution for medical SQA. The code is available at https://github.com/RainieLLM/MedSpeak.[56] DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning
Batuhan K. Karaman,Aditya Rawal,Suhaila Shakiah,Mohammad Ghavamzadeh,Mingyi Hong,Arijit Biswas,Ruida Zhou
Main category: cs.CL
TL;DR: 本文提出DISPO算法,通过解耦正确与错误响应的重要性采样权重上下截断,实现更稳定的强化学习训练,在数学推理任务上显著优于现有PPO和REINFORCE类方法。
Details
Motivation: 现有基于可验证奖励的强化学习方法在训练稳定性与学习效率之间存在明显权衡:PPO类方法稳定但收敛慢,REINFORCE类方法高效但不稳定。 Method: 提出DISPO——一种REINFORCE风格算法,对正确和错误响应分别设置独立的上、下截断阈值,形成四种可控策略更新机制,并通过消融实验分析各机制对探索(entropy提升)与蒸馏(entropy降低)的影响。 Result: DISPO在AIME'24上达到61.04%准确率,显著优于CISPO(55.42%)和DAPO(50.21%),并在多个基准和模型上保持一致提升。 Conclusion: 解耦重要性采样权重的上下截断是平衡探索与蒸馏、避免性能崩溃的有效途径,DISPO为数学推理等高精度任务提供了更鲁棒的强化学习训练范式。 Abstract: Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights >1 increase the average token entropy (i.e., exploration) while weights <1 decrease it (i.e., distillation) -- both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights >1) or vanishing response lengths (when weights <1). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04% on AIME'24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.[57] Sparse Reward Subsystem in Large Language Models
Guowei Xu,Mert Yuksekgonul,James Zou
Main category: cs.CL
TL;DR: 本文发现大语言模型(LLM)隐藏状态中存在一个稀疏奖励子系统,类比人脑的生物奖励系统;识别出表征状态价值的‘价值神经元’及其在推理中的关键作用,并进一步发现编码奖励预测误差的‘多巴胺神经元’。
Details
Motivation: 受人脑奖励系统启发,探究大语言模型内部是否存在类似的功能性奖励表征机制,以理解其决策与推理的内在原理。 Method: 通过分析LLM隐藏状态,识别稀疏奖励子系统;利用干预实验验证价值神经元对推理的重要性;通过奖励与预测偏差分析定位编码奖励预测误差的神经元。 Result: 发现了跨数据集、模型规模与架构均稳健的价值神经元;证实其在同源微调模型间具有强可迁移性;识别出响应奖励预测误差的多巴胺样神经元,其激活模式符合RPE理论。 Conclusion: LLM内部存在类脑的奖励子系统,包含功能明确的价值神经元与多巴胺神经元,为理解大模型推理机制及构建更可控、可解释的智能体提供了新视角。 Abstract: In this paper, we identify a sparse reward subsystem within the hidden states of Large Language Models (LLMs), drawing an analogy to the biological reward subsystem in the human brain. We demonstrate that this subsystem contains value neurons that represent the model's internal expectation of state value, and through intervention experiments, we establish the importance of these neurons for reasoning. Our experiments reveal that these value neurons are robust across diverse datasets, model scales, and architectures; furthermore, they exhibit significant transferability across different datasets and models fine-tuned from the same base model. By examining cases where value predictions and actual rewards diverge, we identify dopamine neurons within the reward subsystem which encode reward prediction errors (RPE). These neurons exhibit high activation when the reward is higher than expected and low activation when the reward is lower than expected.[58] DeALOG: Decentralized Multi-Agents Log-Mediated Reasoning Framework
Abhijit Chakraborty,Ashish Raj Shekhar,Shiven Agarwal,Vivek Gupta
Main category: cs.CL
TL;DR: 本文提出了DeALOG,一种去中心化的多智能体框架,用于跨文本、表格和图像的复杂问答任务。该框架通过专门化智能体(如表格、上下文、视觉、摘要和验证智能体)在共享自然语言日志中协作,实现可解释、鲁棒且可扩展的多模态问答。
Details
Motivation: 复杂问答需整合文本、表格和图像等异构信息源,现有方法缺乏支持专业化处理、协同推理与可解释性的统一框架。 Method: 提出DeALOG框架:由多个专用智能体组成,通过共享的自然语言日志进行去中心化通信与协作;日志作为持久记忆,支持错误检测与验证;各智能体分工明确,无需中央控制器。 Result: 在FinQA、TAT-QA、CRT-QA、WikiTableQuestions、FeTaQA和MultiModalQA等多个基准上达到竞争性性能;消融分析证实共享日志、智能体专业化和验证机制对准确率至关重要。 Conclusion: DeALOG提供了一种模块化、可扩展且可解释的多模态问答新范式,其基于自然语言的日志通信机制提升了鲁棒性与协作能力。 Abstract: Complex question answering across text, tables and images requires integrating diverse information sources. A framework supporting specialized processing with coordination and interpretability is needed. We introduce DeALOG, a decentralized multi-agent framework for multimodal question answering. It uses specialized agents: Table, Context, Visual, Summarizing and Verification, that communicate through a shared natural-language log as persistent memory. This log-based approach enables collaborative error detection and verification without central control, improving robustness. Evaluations on FinQA, TAT-QA, CRT-QA, WikiTableQuestions, FeTaQA, and MultiModalQA show competitive performance. Analysis confirms the importance of the shared log, agent specialization, and verification for accuracy. DeALOG, provides a scalable approach through modular components using natural-language communication.[59] Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning
Zhikun Xu,Xiaodong Yu,Ben Zhou,Jiang Liu,Jialian Wu,Ze Wang,Ximeng Sun,Hao Chen,Zicheng Liu
Main category: cs.CL
TL;DR: 本文提出RULES方法,通过结构化预测任务(预设检查和结论效用检查)来提升大语言模型在数学推理中正确应用引理的能力,并采用分段感知的强化学习进行训练。
Details
Motivation: 现有大语言模型虽在数学基准上表现良好,但常错误应用引理(如忽略前提条件),亟需提升其对引理适用性的判断能力。 Method: 将引理判断建模为结构化预测任务,设计两段式输出(前提检查+结论效用检查),并采用分段掩码的强化学习进行训练;数据涵盖自然语言与形式化证明语料,并构建扰动测试集评估鲁棒性。 Result: RULES在领域内显著优于基线模型,尤其在破坏适用性的扰动上提升更大;端到端评测显示其在竞赛题、扰动对齐题及定理类问题上达到持平或小幅提升;消融实验证明双段输出与分段强化学习均不可或缺。 Conclusion: 结构化输出与分段感知的强化学习是提升模型引理判断鲁棒性的关键,为数学推理中可信赖的引理使用提供了新范式。 Abstract: Recent large language models (LLMs) perform strongly on mathematical benchmarks yet often misapply lemmas, importing conclusions without validating assumptions. We formalize lemma$-$judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion$-$utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two$-$section output and trains with reinforcement learning plus section$-$aware loss masking to assign penalty to the section responsible for errors. Training and evaluation draw on diverse natural language and formal proof corpora; robustness is assessed with a held$-$out perturbation suite; and end$-$to$-$end evaluation spans competition$-$style, perturbation$-$aligned, and theorem$-$based problems across various LLMs. Results show consistent in$-$domain gains over both a vanilla model and a single$-$label RL baseline, larger improvements on applicability$-$breaking perturbations, and parity or modest gains on end$-$to$-$end tasks; ablations indicate that the two$-$section outputs and section$-$aware reinforcement are both necessary for robustness.[60] Distilling Token-Trained Models into Byte-Level Models
Zishuo Bao,Jiaqi Leng,Junxiong Wang,Bowen Peng,Yucheng Lu
Main category: cs.CL
TL;DR: 本文提出了一种高效的蒸馏方法,将已有的基于词元训练的大语言模型(LLM)转换为字节级语言模型(BLM),仅需约1250亿字节数据,即可在保持性能的同时避免从头训练的高成本。
Details
Motivation: 现有字节级语言模型(BLMs)需从头在数万亿字节上训练,成本过高;亟需一种低成本、高效地将已有词元级大模型转化为BLMs的方法。 Method: 采用两阶段课程式蒸馏:(1) 渐进式知识蒸馏,对齐字节级表示与教师模型的词元嵌入;(2) 字节级监督微调,实现端到端纯字节空间生成。 Result: 在Llama、Qwen和OLMo等多个模型家族上验证有效,蒸馏所得BLMs仅用约125B字节训练数据,即能保留教师模型大部分性能。 Conclusion: 该蒸馏方案显著降低了BLM的训练门槛,为构建高效、可扩展的无分词语言模型提供了实用可行路径。 Abstract: Byte Language Models (BLMs) have emerged as a promising direction for scaling language models beyond tokenization. However, existing BLMs typically require training from scratch on trillions of bytes, making them prohibitively expensive. In this paper, we propose an efficient distillation recipe that converts existing token-trained LLMs into BLMs while retaining comparable capabilities. Our recipe follows a two-stage curriculum: (1) Progressive Knowledge Distillation, which aligns byte-level representations with the embeddings of the token-trained teacher model; and (2) Byte-Level Supervised Fine-Tuning, which enables end-to-end generation entirely in the byte space. We validate our approach across multiple model families, including Llama, Qwen, and OLMo, and demonstrate that the distilled BLMs retain most of the teacher models' performance using only approximately 125B bytes.[61] Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident
Conrad Borchers,Jill-Jênn Vie,Roger Azevedo
Main category: cs.CL
TL;DR: 本文评估了大语言模型(LLMs)作为‘新手’建模人类学习者推理与元认知判断的能力,发现其生成的推理虽流畅但过度连贯、冗长且缺乏人类初学者的碎片化与变异性,导致对学习者表现系统性高估;问题源于LLM训练数据偏向专家解法、缺失情感表达与工作记忆限制。
Details
Motivation: 现有LLM评估偏重解题准确性,忽视人类初学者典型的碎片化、不完美推理过程;需检验LLMs能否真实模拟 novice reasoning 和 metacognitive judgments,以支撑更适配的学习辅导系统。 Method: 基于630条化学多步辅导问题中的学生出声思维(think-aloud)语句及详细问题解决日志(含提示使用、尝试次数、上下文),对比GPT-4.1等LLM在最小与扩展上下文提示下的生成推理,并评估其对每一步学习者成功与否的预测能力。 Result: GPT-4.1生成推理流畅且语境恰当,但系统性地过度连贯、冗长、变异性低;上下文越丰富,偏差越显著;对学习者步骤级表现持续高估。 Conclusion: LLMs存在建模初学者认知的固有局限,根源在于训练数据缺乏新手特征(如犹豫、错误、情感、工作记忆约束);该发现警示其直接用于自适应教学的风险,并呼吁构建更贴合真实学习过程的AI辅导框架。 Abstract: Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models' ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving context during prompting. Learner performance was consistently overestimated. These findings highlight epistemic limitations of simulating learning with LLMs. We attribute these limitations to LLM training data, including expert-like solutions devoid of expressions of affect and working memory constraints during problem solving. Our evaluation framework can guide future design of adaptive systems that more faithfully support novice learning and self-regulation using generative artificial intelligence.[62] Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations
Sheng-Lun Wei,Yu-Ling Liao,Yen-Hua Chang,Hen-Hsen Huang,Hsin-Hsi Chen
Main category: cs.CL
TL;DR: 本文首次系统研究了多语言多模态大语言模型(MLLMs)中的语音偏差问题,构建并发布了BiasInEar语音增强基准数据集,涵盖英、中、韩三语,平衡性别与口音,共70.8小时语音和11200道题目;通过四类指标评估9个模型在语言、口音、性别和选项顺序等扰动下的表现,发现模型对语言和选项顺序敏感,而对性别较鲁棒,语音可能加剧结构偏差;研究还指出架构设计与推理策略显著影响跨语言鲁棒性;最终提出统一评估语音融合大模型公平性与鲁棒性的框架。
Details
Motivation: 填补多语言多模态大语言模型中语音偏差系统性研究的空白,推动语音与文本评估的统一,并提升语音增强模型的公平性与鲁棒性。 Method: 构建BiasInEar语音增强基准数据集(基于Global MMLU Lite,覆盖英/中/韩三语,平衡性别与口音),设计四类互补评估指标(准确率、熵、APES、Fleiss' κ),在语言、口音、性别、选项顺序等扰动下系统评测9个代表性模型。 Result: MLLMs对语言类型和选项顺序高度敏感,但对性别扰动相对鲁棒;语音输入会放大结构性偏差;模型架构设计与推理策略显著影响其跨语言鲁棒性。 Conclusion: 本研究建立了首个统一评估语音融合大模型公平性与鲁棒性的框架,揭示了语音引入的新偏差维度,为未来构建更公平、稳健的多语言语音-语言联合模型提供了理论依据与实践资源。 Abstract: This work presents the first systematic investigation of speech bias in multilingual MLLMs. We construct and release the BiasInEar dataset, a speech-augmented benchmark based on Global MMLU Lite, spanning English, Chinese, and Korean, balanced by gender and accent, and totaling 70.8 hours ($\approx$4,249 minutes) of speech with 11,200 questions. Using four complementary metrics (accuracy, entropy, APES, and Fleiss' $κ$), we evaluate nine representative models under linguistic (language and accent), demographic (gender), and structural (option order) perturbations. Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases. Moreover, architectural design and reasoning strategy substantially affect robustness across languages. Overall, this study establishes a unified framework for assessing fairness and robustness in speech-integrated LLMs, bridging the gap between text- and speech-based evaluation. The resources can be found at https://github.com/ntunlplab/BiasInEar.[63] Personality Expression Across Contexts: Linguistic and Behavioral Variation in LLM Agents
Bin Han,Deuksin Kwon,Jonathan Gratch
Main category: cs.CL
TL;DR: 本研究探讨了大型语言模型(LLMs)在相同人格提示下,于不同对话场景中表现出的语言、行为与情感差异,发现其人格表达具有情境敏感性,而非固定不变。
Details
Motivation: 探究LLMs在相同人格提示下为何在不同对话场景中行为表现不一致,以及这种不一致是缺陷还是类人的情境适应能力。 Method: 在四种对话场景(破冰、谈判、群体决策、共情任务)中,对同一人格提示下的LLM输出进行语言、行为与情感层面的对比分析。 Result: 相同人格提示在不同情境下引发显著不同的语言风格、行为倾向和情感语调;情境线索系统性地调节人格与情绪表达。 Conclusion: LLMs的人格表达符合‘整体特质理论’,表现为情境敏感型适应,而非不一致;这对构建更自然、可信的对话代理具有重要启示。 Abstract: Large Language Models (LLMs) can be conditioned with explicit personality prompts, yet their behavioral realization often varies depending on context. This study examines how identical personality prompts lead to distinct linguistic, behavioral, and emotional outcomes across four conversational settings: ice-breaking, negotiation, group decision, and empathy tasks. Results show that contextual cues systematically influence both personality expression and emotional tone, suggesting that the same traits are expressed differently depending on social and affective demands. This raises an important question for LLM-based dialogue agents: whether such variations reflect inconsistency or context-sensitive adaptation akin to human behavior. Viewed through the lens of Whole Trait Theory, these findings highlight that LLMs exhibit context-sensitive rather than fixed personality expression, adapting flexibly to social interaction goals and affective conditions.[64] Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs
Ruihan Jin,Pengpeng Shao,Zhengqi Wen,Jinyang Wu,Mingkuan Feng,Shuo Yang,Chu Yuan Zhang,Jianhua Tao
Main category: cs.CL
TL;DR: 本文提出知识净化(Knowledge Purification)概念,通过融合多个教师大语言模型的推理过程生成统一理由,缓解多教师蒸馏中的知识冲突并提升效率,并提出了五种净化方法,其中基于路由器的方法展现出强泛化能力。
Details
Motivation: 传统知识蒸馏在利用多个教师模型时面临知识冲突和高资源消耗问题。 Method: 提出知识净化概念,并从不同角度设计五种净化方法,包括基于路由器的方法。 Result: 实验表明所提方法能提升蒸馏模型性能、有效缓解知识冲突,且路由器方法具有强泛化能力。 Conclusion: 知识净化是一种有前景的多教师蒸馏优化策略,有助于实现强大而轻量模型的实际部署。 Abstract: Knowledge distillation has emerged as a pivotal technique for transferring knowledge from stronger large language models (LLMs) to smaller, more efficient models. However, traditional distillation approaches face challenges related to knowledge conflicts and high resource demands, particularly when leveraging multiple teacher models. In this paper, we introduce the concept of \textbf{Knowledge Purification}, which consolidates the rationales from multiple teacher LLMs into a single rationale, thereby mitigating conflicts and enhancing efficiency. To investigate the effectiveness of knowledge purification, we further propose five purification methods from various perspectives. Our experiments demonstrate that these methods not only improve the performance of the distilled model but also effectively alleviate knowledge conflicts. Moreover, router-based methods exhibit robust generalization capabilities, underscoring the potential of innovative purification techniques in optimizing multi-teacher distillation and facilitating the practical deployment of powerful yet lightweight models.[65] From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization
Chaoqun Cui,Shijing Wang,Liangbin Huang,Qingqing Gu,Zhaolong Huang,Xiao Zeng,Wenji Mao
Main category: cs.CL
TL;DR: 本文研究如何构建满足领域定制需求的翻译大语言模型,以视觉媒体字幕翻译为切入点,提出自适应局部偏好优化(ALPO)方法,并构建多向字幕平行语料库,验证了LLM作为翻译奖励模型和评估器的可靠性。
Details
Motivation: 大型语言模型(LLMs)在通用机器翻译中表现优异,但在垂直领域(如字幕翻译)中存在局限性,亟需构建具备领域适配能力的翻译LLMs。 Method: 构建多向字幕平行语料库,并提出自适应局部偏好优化(ALPO)方法,用于细粒度偏好对齐;同时验证LLM作为翻译奖励模型与评估器的可靠性。 Result: ALPO在翻译质量的多维评估中表现出色,显著提升了字幕翻译的表达力与生动性。 Conclusion: 通过领域数据构建与ALPO优化策略,可有效提升LLM在垂直领域(尤其是字幕翻译)中的翻译表现,且LLM可作为可靠的翻译评估与奖励信号来源。 Abstract: The rapid development of Large Language Models (LLMs) has significantly enhanced the general capabilities of machine translation. However, as application scenarios become more complex, the limitations of LLMs in vertical domain translations are gradually becoming apparent. In this study, we focus on how to construct translation LLMs that meet the needs of domain customization. We take visual media subtitle translation as our topic and explore how to train expressive and vivid translation LLMs. We investigated the situations of subtitle translation and other domains of literal and liberal translation, verifying the reliability of LLM as reward model and evaluator for translation. Additionally, to train an expressive translation LLM, we constructed and released a multidirectional subtitle parallel corpus dataset and proposed the Adaptive Local Preference Optimization (ALPO) method to address fine-grained preference alignment. Experimental results demonstrate that ALPO achieves outstanding performance in multidimensional evaluation of translation quality.[66] What If We Allocate Test-Time Compute Adaptively?
Ahsan Bilal,Ahmed Mohsin,Muhammad Umer,Ali Subhan,Hassan Rizwan,Ayesha Mohsin,Dean Hougen
Main category: cs.CL
TL;DR: 本文提出了一种基于验证器引导的自适应推理框架,通过过程奖励模型(PRM)动态调控多轮推理轨迹的生成与选择,在多个数学推理基准上显著优于固定计算分配方法,并提升了计算效率。
Details
Motivation: 现有测试时计算扩展方法存在计算分配均匀、采样策略固定、验证仅用于重排序等问题,缺乏对推理过程的动态调控能力。 Method: 提出验证器引导的自适应框架,将推理建模为迭代式的轨迹生成与选择;每轮迭代中生成计划、选择工具与计算策略,并利用PRM在步骤级指导生成(剪枝与扩展),在轨迹级指导最终响应选择。 Result: 在MATH-500、AIME24和AMO-Bench等数学推理基准上显著超越直接测试时扩展方法,尤其在难题上实现数倍提升;理论FLOPs与计算强度分析表明该方法更高效地集中计算资源于高价值推理路径。 Conclusion: PRM引导的动态计算分配机制能有效提升复杂推理任务的性能与效率,验证了自适应、验证驱动的推理范式的优越性。 Abstract: Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection. For each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects a set of reasoning tools and a compute strategy together with an exploration parameter, and then generates a candidate reasoning trajectory. A process reward model (PRM) serves as a unified control signal: within each iteration, step-level PRM scores are aggregated to guide pruning and expansion during generation, and across iterations, aggregated trajectory rewards are used to select the final response. Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench. We characterize efficiency using theoretical FLOPs and a compute intensity metric penalizing wasted generation and tool overhead, demonstrating that verification-guided allocation concentrates computation on high-utility reasoning paths.[67] Logic-Oriented Retriever Enhancement via Contrastive Learning
Wenxuan Zhang,Yuan-Hao Jiang,Changyong Qi,Rui Jia,Yonghe Wu
Main category: cs.CL
TL;DR: LORE是一种无需外部监督的细粒度对比学习方法,用于增强检索器的逻辑分析能力,提升知识密集型任务中的检索和生成效果。
Details
Motivation: 大型语言模型在知识密集型任务中表现不佳,因为传统检索器容易过拟合表面相似性,难以处理涉及复杂逻辑关系的查询;而模型表示中本就蕴含逻辑分析能力,但标准训练未能充分利用。 Method: 提出LORE(Logic ORiented Retriever Enhancement),采用细粒度对比学习,引导嵌入对齐逻辑结构而非浅层相似性,无需外部监督、额外资源或预检索分析,且保持索引兼容性。 Result: LORE在检索效用和下游生成任务上持续提升,同时保持推理效率。 Conclusion: LORE有效激活了模型中潜藏的逻辑分析能力,为知识密集型RAG系统提供了轻量、高效、即插即用的检索增强方案。 Abstract: Large language models (LLMs) struggle in knowledge-intensive tasks, as retrievers often overfit to surface similarity and fail on queries involving complex logical relations. The capacity for logical analysis is inherent in model representations but remains underutilized in standard training. LORE (Logic ORiented Retriever Enhancement) introduces fine-grained contrastive learning to activate this latent capacity, guiding embeddings toward evidence aligned with logical structure rather than shallow similarity. LORE requires no external upervision, resources, or pre-retrieval analysis, remains index-compatible, and consistently improves retrieval utility and downstream generation while maintaining efficiency. The datasets and code are publicly available at https://github.com/mazehart/Lore-RAG.[68] Tendem: A Hybrid AI+Human Platform
Konstantin Chernyshev,Ekaterina Artemova,Viacheslav Zhukov,Maksim Nerush,Mariia Fedorova,Iryna Repik,Olga Shapovalova,Aleksey Sukhorosov,Vladimir Dobrovolskii,Natalia Mikhailova,Sergei Tilga
Main category: cs.CL
TL;DR: Tendem is a hybrid AI-human system that outperforms both AI-only and human-only approaches in quality, speed, and cost-efficiency, while its AI agent alone achieves near state-of-the-art performance on agentic benchmarks.
Details
Motivation: To overcome the limitations of purely AI-based or purely human-based workflows by combining their strengths—AI for scalability and consistency, humans for judgment and verification. Method: Developed Tendem, a hybrid system integrating AI for structured tasks and human experts for failure recovery and quality assurance; evaluated it on 94 real-world tasks against AI-only agents and Upwork freelancers, and benchmarked its AI agent on third-party agentic evaluation suites. Result: Tendem achieved higher output quality and faster turnaround than both AI-only and human-only baselines, with costs similar to human-only execution; its autonomous AI agent performed near state-of-the-art on web browsing and tool-use tasks, and strongly on domain knowledge and reasoning. Conclusion: Hybrid AI-human systems like Tendem offer a practical and effective path toward robust, scalable, and high-quality intelligent automation, balancing performance, speed, and cost. Abstract: Tendem is a hybrid system where AI handles structured, repeatable work and Human Experts step in when the models fail or to verify results. Each result undergoes a comprehensive quality review before delivery to the Client. To assess Tendem's performance, we conducted a series of in-house evaluations on 94 real-world tasks, comparing it with AI-only agents and human-only workflows carried out by Upwork freelancers. The results show that Tendem consistently delivers higher-quality outputs with faster turnaround times. At the same time, its operational costs remain comparable to human-only execution. On third-party agentic benchmarks, Tendem's AI Agent (operating autonomously, without human involvement) performs near state-of-the-art on web browsing and tool-use tasks while demonstrating strong results in frontier domain knowledge and reasoning.[69] Long-range Modeling and Processing of Multimodal Event Sequences
Jichu Li,Yilun Zhong,Zhiting Li,Feng Zhou,Quyu Kong
Main category: cs.CL
TL;DR: 本文提出了一种扩展基于大语言模型(LLM)的时间点过程(TPP)框架,以支持视觉模态并生成高质量文本分析;通过基于时间相似性的自适应序列压缩机制解决长上下文建模难题,并采用两阶段训练范式,在预测准确性和文本生成质量上均超越现有方法。
Details
Motivation: 现有TPP方法在处理多模态数据(尤其是图文联合建模)时受限于长序列导致的注意力机制性能下降,难以生成连贯、长程依赖的文本描述。 Method: 提出一种新型LLM-based TPP框架,引入视觉模态,将文本生成作为核心任务之一;设计基于时间相似性的自适应序列压缩机制以缩短输入长度;采用两阶段训练:先在压缩序列上预训练,再针对下游任务监督微调。 Result: 在DanmakuTPP-QA等基准上显著优于现有SOTA方法,在事件时间/类型预测准确率和生成文本质量两方面均取得提升。 Conclusion: 该框架有效缓解了多模态TPP中的长上下文建模瓶颈,验证了将文本生成深度融入TPP建模的可行性和优势,为异步多模态事件建模提供了新范式。 Abstract: Temporal point processes (TPPs) have emerged as powerful tools for modeling asynchronous event sequences. While recent advances have extended TPPs to handle textual information, existing approaches are limited in their ability to generate rich, multimodal content and reason about event dynamics. A key challenge is that incorporating multimodal data dramatically increases sequence length, hindering the ability of attention-based models to generate coherent, long-form textual descriptions that require long-range understanding. In this paper, we propose a novel framework that extends LLM-based TPPs to the visual modality, positioning text generation as a core capability alongside time and type prediction. Our approach addresses the long-context problem through an adaptive sequence compression mechanism based on temporal similarity, which reduces sequence length while preserving essential patterns. We employ a two-stage paradigm of pre-training on compressed sequences followed by supervised fine-tuning for downstream tasks. Extensive experiments, including on the challenging DanmakuTPP-QA benchmark, demonstrate that our method outperforms state-of-the-art baselines in both predictive accuracy and the quality of its generated textual analyses.[70] Don't Judge a Book by its Cover: Testing LLMs' Robustness Under Logical Obfuscation
Abhilekh Borah,Shubhra Ghosh,Kedar Joshi,Aditya Kumar Guru,Kripabandhu Ghosh
Main category: cs.CL
TL;DR: 本文提出Logifus逻辑混淆框架和LogiQAte诊断基准,揭示大型语言模型在面对逻辑等价但形式混淆的问题时性能显著下降,表明当前模型缺乏深层语义理解能力。
Details
Motivation: 大型语言模型在标准形式的逻辑推理任务上表现良好,但在逻辑等价但形式混淆的问题上常失败,需系统研究其脆弱性。 Method: 提出结构保持的逻辑混淆框架Logifus,并构建首个诊断基准LogiQAte,包含四个混淆推理任务,评估六种最先进模型的零样本性能。 Result: 混淆导致模型零样本性能大幅下降:GPT-4o平均下降47%,GPT-5下降27%,o4-mini下降22%。 Conclusion: 当前大语言模型仅进行表层解析,缺乏对语义本质的理解与保持能力,亟需构建真正理解意义的模型。 Abstract: Tasks such as solving arithmetic equations, evaluating truth tables, and completing syllogisms are handled well by large language models (LLMs) in their standard form, but they often fail when the same problems are posed in logically equivalent yet obfuscated formats. To study this vulnerability, we introduce Logifus, a structure-preserving logical obfuscation framework, and, utilizing this, we present LogiQAte, a first-of-its-kind diagnostic benchmark with 1,108 questions across four reasoning tasks: (i) Obfus FOL (first-order logic entailment under equivalence-preserving rewrites), (ii) Obfus Blood Relation (family-graph entailment under indirect relational chains), (iii) Obfus Number Series (pattern induction under symbolic substitutions), and (iv) Obfus Direction Sense (navigation reasoning under altered directions and reference frames). Across all the tasks, evaluating six state-of-the-art models, we find that obfuscation severely degrades zero-shot performance, with performance dropping on average by 47% for GPT-4o, 27% for GPT-5, and 22% for reasoning model, o4-mini. Our findings reveal that current LLMs parse questions without deep understanding, highlighting the urgency of building models that genuinely comprehend and preserve meaning beyond surface form.[71] Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models
Reem I. Masoud,Chen Feng,Shunta Asano,Saied Alshahrani,Philip Colin Treleaven,Miguel R. D. Rodrigues
Main category: cs.CL
TL;DR: 本文从数据集中心视角研究了大型语言模型(LLM)文化对齐中细调数据的语言特性,通过计算阿拉伯语、中文和日语数据集的轻量级语言、语义与结构指标并进行主成分分析(PCA),识别出语义连贯性、表层词汇/句法多样性及词汇/结构丰富性等可解释维度;实验表明这些维度与下游文化性能相关,但高度依赖模型,其中词汇导向的PC3成分最稳健,而强调语义或多样性极端的PC1-PC2常无效甚至有害。
Details
Motivation: 全球部署大语言模型引发文化错位担忧,但用于文化适配的细调数据集的语言特性尚不明确,亟需从数据集中心视角探究哪些语言属性影响文化性能、是否可训练前预测、以及效应如何跨模型变化。 Method: 对阿拉伯语、中文和日语细调数据集计算轻量级语言、语义与结构性指标,并在各语言内独立进行主成分分析(PCA),提取可解释的变异轴;随后在LLaMA、Mistral、DeepSeek三大家族模型上细调,并在文化知识、价值观与规范基准上评估;最后通过受控子集干预验证各主成分的影响。 Result: PCA成分与下游文化性能存在相关性,但该关联具有强模型依赖性;词汇导向的PC3成分在不同模型和基准上表现最稳健,提升其比重可带来更一致性能;而强调语义连贯性(PC1)或表层多样性(PC2)的干预往往中性甚至损害性能。 Conclusion: 文化对齐效果不仅取决于数据内容,更与细调数据的语言结构特征密切相关;其中词汇层面的特性(如词频分布、术语覆盖)比语义连贯性或表层多样性更具跨模型泛化能力,提示未来文化适配应优先优化词汇丰富性与代表性。 Abstract: The global deployment of large language models (LLMs) has raised concerns about cultural misalignment, yet the linguistic properties of fine-tuning datasets used for cultural adaptation remain poorly understood. We adopt a dataset-centric view of cultural alignment and ask which linguistic properties of fine-tuning data are associated with cultural performance, whether these properties are predictive prior to training, and how these effects vary across models. We compute lightweight linguistic, semantic, and structural metrics for Arabic, Chinese, and Japanese datasets and apply principal component analysis separately within each language. This design ensures that the resulting components capture variation among datasets written in the same language rather than differences between languages. The resulting components correspond to broadly interpretable axes related to semantic coherence, surface-level lexical and syntactic diversity, and lexical or structural richness, though their composition varies across languages. We fine-tune three major LLM families (LLaMA, Mistral, DeepSeek) and evaluate them on benchmarks of cultural knowledge, values, and norms. While PCA components correlate with downstream performance, these associations are strongly model-dependent. Through controlled subset interventions, we show that lexical-oriented components (PC3) are the most robust, yielding more consistent performance across models and benchmarks, whereas emphasizing semantic or diversity extremes (PC1-PC2) is often neutral or harmful.[72] Typologically-Informed Candidate Reranking for LLM-based Translation into Low-Resource Languages
Nipuna Abeykoon,Ashen Weerathunga,Pubudu Wijesinghe,Parameswari Krishnamurthy
Main category: cs.CL
TL;DR: 本文提出一种基于语言类型学的翻译质量提升框架,无需平行语料或模型重训练,通过通用元语言框架(UMF)和计算引擎实现生成时的语言消歧与选择时的类型学合规性评分,显著提升低资源语言翻译效果。
Details
Motivation: 大语言模型在高资源语言上训练导致对主导类型学模式的系统性偏好,在翻译到类型学差异大的低资源语言时出现结构性不一致问题。 Method: 构建了两个核心组件:1)通用元语言框架(UMF),将语言表示为16个类型学维度上的结构化特征谱,并采用差异加权评分;2)计算引擎,在生成阶段进行语言消歧,在候选输出选择阶段进行类型学合规性评分。 Result: 在9个语言对上的评估显示干预率与英语的类型学距离强相关;在341句含不同形态句法现象的英文句子实验中,保守处理语言、形态密集型语言和结构剖面化语言的干预精度分别为48.16%、28.15%和86.26%;框架无需平行训练数据,适用于任何能生成多候选输出的大语言模型。 Conclusion: 该框架有效缓解了大语言模型在低资源语言翻译中的类型学偏差问题,具备实际部署价值,尤其适用于资源匮乏的语言场景。 Abstract: Large language models trained predominantly on high-resource languages exhibit systematic biases toward dominant typological patterns, leading to structural non-conformance when translating into typologically divergent low-resource languages. We present a framework that leverages linguistic typology to improve translation quality without parallel training data or model retraining. The framework consists of two components: the Universal Metalinguistic Framework (UMF), which represents languages as structured profiles across 16 typological dimensions with divergence-weighted scoring, and the Computational Engine, which operates through linguistic disambiguation during generation and typological compliance scoring during selection. Evaluation across nine language pairs demonstrates intervention rates strongly correlating with typological distance from English. In experiments on 341 English sentences each having different morphological and syntactic phenomena, the framework shows an intervention precision of 48.16% for conservatively treated languages, 28.15% for morphologically dense languages, and 86.26% for structurally profiled languages. The framework requires no parallel training data and operates with any LLM capable of producing multiple candidate outputs, enabling practical deployment for under-resourced languages.[73] PedagoSense: A Pedology Grounded LLM System for Pedagogical Strategy Detection and Contextual Response Generation in Learning Dialogues
Shahem Sultan,Shahem Fadi,Yousef Melhim,Ibrahim Alsarraj,Besher Hassan
Main category: cs.CL
TL;DR: 本文提出了PedagoSense系统,通过两阶段策略分类器结合大语言模型生成,检测并推荐对话式学习中有效的教学策略,以提升师生对话的互动质量。
Details
Motivation: 提升对话式学习中师生互动质量,通过自动检测和推荐有效的教学策略来增强教育技术的适应性。 Method: 提出PedagoSense系统:第一阶段使用二元分类器检测是否存在教学策略;第二阶段进行细粒度分类以识别具体策略类型;同时基于对话上下文推荐合适策略,并利用大语言模型生成符合该策略的响应。 Result: 在人工标注的师生对话数据集上评估显示,教学策略检测性能优异,数据增强带来稳定提升;但细粒度分类中部分类别仍具挑战性。 Conclusion: PedagoSense成功融合教学理论与大语言模型响应生成,为构建更自适应的教育技术提供了新路径。 Abstract: This paper addresses the challenge of improving interaction quality in dialogue based learning by detecting and recommending effective pedagogical strategies in tutor student conversations. We introduce PedagoSense, a pedology grounded system that combines a two stage strategy classifier with large language model generation. The system first detects whether a pedagogical strategy is present using a binary classifier, then performs fine grained classification to identify the specific strategy. In parallel, it recommends an appropriate strategy from the dialogue context and uses an LLM to generate a response aligned with that strategy. We evaluate on human annotated tutor student dialogues, augmented with additional non pedagogical conversations for the binary task. Results show high performance for pedagogical strategy detection and consistent gains when using data augmentation, while analysis highlights where fine grained classes remain challenging. Overall, PedagoSense bridges pedagogical theory and practical LLM based response generation for more adaptive educational technologies.[74] EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech
Besher Hassan,Ibrahim Alsarraj,Musaab Hasan,Yousef Melhim,Shahem Fadi,Shahem Sultan
Main category: cs.CL
TL;DR: EmoAra is an end-to-end emotion-preserving pipeline for cross-lingual spoken communication, translating English speech to emotional Arabic speech for banking customer service.
Details
Motivation: Banking customer service requires preservation of emotional context to maintain service quality during cross-lingual spoken communication. Method: EmoAra integrates Speech Emotion Recognition (CNN-based), ASR (Whisper), MT (fine-tuned MarianMT), and TTS (MMS-TTS-Ara) to convert English speech into emotionally faithful Arabic speech. Result: Achieves 94% F1-score in emotion classification, BLEU 56 and BERTScore F1 88.7% in translation, and 81% average human evaluation score on banking-domain translations. Conclusion: EmoAra effectively preserves emotional nuance across speech-to-speech translation, demonstrating strong performance in a real-world domain with publicly available implementation. Abstract: This work presents EmoAra, an end-to-end emotion-preserving pipeline for cross-lingual spoken communication, motivated by banking customer service where emotional context affects service quality. EmoAra integrates Speech Emotion Recognition, Automatic Speech Recognition, Machine Translation, and Text-to-Speech to process English speech and deliver an Arabic spoken output while retaining emotional nuance. The system uses a CNN-based emotion classifier, Whisper for English transcription, a fine-tuned MarianMT model for English-to-Arabic translation, and MMS-TTS-Ara for Arabic speech synthesis. Experiments report an F1-score of 94% for emotion classification, translation performance of BLEU 56 and BERTScore F1 88.7%, and an average human evaluation score of 81% on banking-domain translations. The implementation and resources are available at the accompanying GitHub repository.[75] Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation
Shashini Nilukshi,Deshan Sumanathilaka
Main category: cs.CL
TL;DR: 本文综述了视觉词义消歧(VWSD)的发展,涵盖从早期多模态融合到基于CLIP、扩散模型和大语言模型(LLM)的新方法,指出其在多语言、提示工程与微调方面的进展及现存挑战。
Details
Motivation: 解决传统词义消歧(WSD)仅依赖文本的局限性,利用视觉线索缓解视觉-语言任务中的词汇歧义,尤其在文本输入极少时提升语义理解能力。 Method: 系统梳理2016–2025年VWSD研究,按技术路线分为特征融合、图建模与对比嵌入三类;重点分析CLIP微调、扩散模型生成、LLM增强、提示工程与多语言适配等方法。 Result: CLIP微调与LLM增强的VWSD系统显著优于零样本基线,平均倒数排名(MRR)提升达6–8%;但受限于上下文建模能力、常见义偏差、多语言数据匮乏及评估体系不完善。 Conclusion: 未来VWSD需融合CLIP对齐能力、扩散模型生成能力与LLM推理能力,构建强上下文感知、多语言支持的统一消歧框架。 Abstract: This paper offers a mini review of Visual Word Sense Disambiguation (VWSD), which is a multimodal extension of traditional Word Sense Disambiguation (WSD). VWSD helps tackle lexical ambiguity in vision-language tasks. While conventional WSD depends only on text and lexical resources, VWSD uses visual cues to find the right meaning of ambiguous words with minimal text input. The review looks at developments from early multimodal fusion methods to new frameworks that use contrastive models like CLIP, diffusion-based text-to-image generation, and large language model (LLM) support. Studies from 2016 to 2025 are examined to show the growth of VWSD through feature-based, graph-based, and contrastive embedding techniques. It focuses on prompt engineering, fine-tuning, and adapting to multiple languages. Quantitative results show that CLIP-based fine-tuned models and LLM-enhanced VWSD systems consistently perform better than zero-shot baselines, achieving gains of up to 6-8\% in Mean Reciprocal Rank (MRR). However, challenges still exist, such as limitations in context, model bias toward common meanings, a lack of multilingual datasets, and the need for better evaluation frameworks. The analysis highlights the growing overlap of CLIP alignment, diffusion generation, and LLM reasoning as the future path for strong, context-aware, and multilingual disambiguation systems.[76] Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
Zizhuo Fu,Wenxuan Zeng,Runsheng Wang,Meng Li
Main category: cs.CL
TL;DR: 本文揭示了注意力机制中的'sink'现象本质上构成了一种混合专家(MoE)结构,解释了头部坍塌现象,并提出了一种带负载均衡损失的sink感知训练算法,有效缓解该问题并提升多种注意力机制下的模型性能。
Details
Motivation: 现有工作缺乏对Vanilla Attention、Sink Attention和Gated Attention等注意力机制中'sink'现象之间关系的系统性分析,且头部坍塌现象尚未被深入解释。 Method: 通过理论推导与实证分析,论证sink现象在注意力层中自然形成MoE结构;进而提出一种面向注意力层的sink感知训练算法,引入辅助负载均衡损失以缓解头部坍塌。 Result: 所提方法在Vanilla Attention、Sink Attention和Gated Attention上均实现了有效的注意力头负载均衡,并提升了模型生成性能。 Conclusion: 注意力机制中存在内在的MoE结构,这一新视角有助于理解并改进注意力机制的设计与训练。 Abstract: Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.[77] ASTER: Agentic Scaling with Tool-integrated Extended Reasoning
Xuqin Zhang,Quan He,Zhenrui Zheng,Zongzhang Zhang,Xu He,Dong Li
Main category: cs.CL
TL;DR: 本文提出ASTER框架,通过聚焦交互密集型冷启动轨迹来解决工具集成推理(TIR)中因交互崩溃导致的多轮工具使用失败问题,在数学基准测试中达到SOTA性能。
Details
Motivation: 现有强化学习在扩展工具集成推理时面临交互崩溃问题,即模型退化为内部推理而无法持续多轮调用工具。 Method: 系统研究冷启动监督微调对工具使用行为先验的影响、冷启动轨迹交互密度对探索与RL效果的作用,以及RL交互预算对学习动态和泛化的影响;提出ASTER框架,采用以交互密集轨迹为核心的冷启动策略。 Result: 仅需4K条交互密集专家轨迹即可构建强行为先验,显著提升后续RL训练效果;ASTER-4B在AIME 2025上达90.0%,超越DeepSeek-V3.2-Exp等前沿开源模型。 Conclusion: 交互密度是冷启动质量的关键指标,ASTER验证了高质量小规模冷启动数据可有效缓解交互崩溃,推动LLM工具集成推理能力规模化发展。 Abstract: Reinforcement learning (RL) has emerged as a dominant paradigm for eliciting long-horizon reasoning in Large Language Models (LLMs). However, scaling Tool-Integrated Reasoning (TIR) via RL remains challenging due to interaction collapse: a pathological state where models fail to sustain multi-turn tool usage, instead degenerating into heavy internal reasoning with only trivial, post-hoc code verification. We systematically study three questions: (i) how cold-start SFT induces an agentic, tool-using behavioral prior, (ii) how the interaction density of cold-start trajectories shapes exploration and downstream RL outcomes, and (iii) how the RL interaction budget affects learning dynamics and generalization under varying inference-time budgets. We then introduce ASTER (Agentic Scaling with Tool-integrated Extended Reasoning), a framework that circumvents this collapse through a targeted cold-start strategy prioritizing interaction-dense trajectories. We find that a small expert cold-start set of just 4K interaction-dense trajectories yields the strongest downstream performance, establishing a robust prior that enables superior exploration during extended RL training. Extensive evaluations demonstrate that ASTER-4B achieves state-of-the-art results on competitive mathematical benchmarks, reaching 90.0% on AIME 2025, surpassing leading frontier open-source models, including DeepSeek-V3.2-Exp.[78] Chronos: Learning Temporal Dynamics of Reasoning Chains for Test-Time Scaling
Kai Zhang,Jiayi Liao,Chengpeng Li,Ziyuan Xie,Sihang Li,Xiang Wang
Main category: cs.CL
TL;DR: 本文提出了Chronos,一种轻量级、即插即用的时序推理评分器,将推理轨迹建模为时间序列,通过学习token概率特征并加权投票提升LLM推理性能,在多个基准上显著优于现有方法。
Details
Motivation: 现有测试时扩展方法(如多数投票和启发式token级打分)对推理轨迹或token一视同仁,易受轨迹质量波动和局部逻辑错误影响。 Method: Chronos将每条推理轨迹建模为时间序列,学习token概率的轨迹特征,据此分配质量得分,并采用加权投票机制。 Result: 在领域内和跨领域基准上均取得显著提升;Chronos@128在HMMT25上对Qwen3-4B-Thinking-2507模型相比Pass@1和Maj@128分别提升34.21%和22.70%。 Conclusion: Chronos是一种高效、低开销且泛化性强的推理增强方法,验证了建模轨迹时序动态性对提升LLM推理能力的有效性。 Abstract: Test-Time Scaling (TTS) has emerged as an effective paradigm for improving the reasoning performance of large language models (LLMs). However, existing methods -- most notably majority voting and heuristic token-level scoring -- treat reasoning traces or tokens equally, thereby being susceptible to substantial variations in trajectory quality and localized logical failures. In this work, we introduce \textbf{Chronos}, a lightweight and plug-and-play chronological reasoning scorer that models each trajectory as a time series. Specifically, Chronos learns to capture trajectory features of token probabilities, assigns quality scores accordingly, and employs a weighted voting mechanism. Extensive evaluations on both in-domain and out-of-domain benchmarks demonstrate that Chronos consistently delivers substantial gains across a variety of models, with negligible computational overhead. Notably, Chronos@128 achieves relative improvements of 34.21\% over Pass@1 and 22.70\% over Maj@128 on HMMT25 using Qwen3-4B-Thinking-2507, highlighting its effectiveness.[79] Supervised Fine-Tuning Needs to Unlock the Potential of Token Priority
Zhanming Shen,Zeyu Qin,Jiaqi Hu,Wentao Ye,Hao Chen,Xiaomeng Hu,Haokai Xu,Gang Chen,Yi R. Fung,Haobo Wang
Main category: cs.CL
TL;DR: 本文提出Token Priority概念,作为连接经验数据拟合与人类真实效用的关键桥梁,将监督微调(SFT)重新定义为一种精确的分布重塑过程,并据此对现有方法进行统一分类与分析。
Details
Motivation: 解决细粒度自回归生成与粗粒度或均匀监督信号之间的粒度不匹配问题,以提升模型对人类真实效用的对齐能力。 Method: 提出Token Priority框架,将SFT形式化为对齐理想对齐流形的分布重塑过程,并将现有方法分为Positive Priority(用于噪声过滤)和Signed Priority(用于毒性模式消除)两类。 Result: 为近期突破提供统一分析视角,重新审视进展与局限,识别关键挑战并指明未来研究方向。 Conclusion: Token Priority是弥合粒度鸿沟、实现真正人类对齐的核心机制,应成为SFT理论与实践的新基础范式。 Abstract: The transition from fitting empirical data to achieving true human utility is fundamentally constrained by a granularity mismatch, where fine-grained autoregressive generation is often supervised by coarse or uniform signals. This position paper advocates Token Priority as the essential bridge, formalizing Supervised Fine-Tuning (SFT) not as simple optimization but as a precise distribution reshaping process that aligns raw data with the ideal alignment manifold. We analyze recent breakthroughs through this unified lens, categorizing them into two distinct regimes: Positive Priority for noise filtration and Signed Priority for toxic modes unlearning. We revisit existing progress and limitations, identify key challenges, and suggest directions for future research.[80] Inferential Question Answering
Jamshid Mozafari,Hamed Zamani,Guido Zuccon,Adam Jatowt
Main category: cs.CL
TL;DR: 本文提出了推理性问答(Inferential QA)这一新任务,强调从仅提供线索的支持性段落中进行推理以得出答案,并构建了QUIT数据集;实验表明现有QA方法(包括检索器、重排序器和各类LLM)在此任务上表现不佳,揭示当前QA系统尚不具备可靠的推理能力。
Details
Motivation: 现有问答系统大多假设答案可直接从文档中抽取或生成,忽略了需通过推理得出隐含答案的问题,因此需要专门研究推理性问答任务。 Method: 提出Inferential QA任务定义,构建包含7401个问题和240万段落的QUIT数据集,段落源自高收敛的人工与机器生成线索,并经LLM判别与人工验证标注三类相关性;对检索器、重排序器和LLM阅读器进行全面评估。 Result: 传统QA方法在推理性问答上表现差:检索器效果下降、重排序提升有限、微调效果不稳定;甚至专为推理设计的LLM也未超越小型通用模型。 Conclusion: 当前QA流水线尚无法有效支持基于间接文本证据的理解与推理,Inferential QA为问答领域开辟了面向推理的新方向。 Abstract: Despite extensive research on a wide range of question answering (QA) systems, most existing work focuses on answer containment-i.e., assuming that answers can be directly extracted and/or generated from documents in the corpus. However, some questions require inference, i.e., deriving answers that are not explicitly stated but can be inferred from the available information. We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues. To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages built from high-convergence human- and machine-authored hints, labeled across three relevance levels using LLM-based answerability and human verification. Through comprehensive evaluation of retrievers, rerankers, and LLM-based readers, we show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements. Even reasoning-oriented LLMs fail to outperform smaller general-purpose models. These findings reveal that current QA pipelines are not yet ready for inference-based reasoning. Inferential QA thus establishes a new class of QA tasks that move towards understanding and reasoning from indirect textual evidence.[81] Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection
Ke Sun,Guangsheng Bao,Han Cui,Yue Zhang
Main category: cs.CL
TL;DR: 本文提出DetectRouter框架,通过两阶段训练学习文本与检测器之间的匹配关系,将零样本检测问题转化为路由问题,显著提升了LLM生成文本检测的鲁棒性。
Details
Motivation: 现有零样本检测方法使用固定代理模型,忽视了代理模型与未知生成源之间的对齐问题,导致性能不稳定。 Method: 提出DetectRouter原型框架:第一阶段利用白盒模型构建判别性原型;第二阶段通过几何距离与检测分数对齐,泛化至黑盒源。 Result: 在EvoBench和MAGE基准上,DetectRouter在多种检测指标和模型族上均取得一致提升。 Conclusion: 代理模型与生成源的对齐至关重要,将检测建模为路由问题可有效提升零样本检测的鲁棒性和泛化能力。 Abstract: Zero-shot methods detect LLM-generated text by computing statistical signatures using a surrogate model. Existing approaches typically employ a fixed surrogate for all inputs regardless of the unknown source. We systematically examine this design and find that detection performance varies substantially depending on surrogate-source alignment. We observe that while no single surrogate achieves optimal performance universally, a well-matched surrogate typically exists within a diverse pool for any given input. This finding transforms robust detection into a routing problem: selecting the most appropriate surrogate for each input. We propose DetectRouter, a prototype-based framework that learns text-detector affinity through two-stage training. The first stage constructs discriminative prototypes from white-box models; the second generalizes to black-box sources by aligning geometric distances with observed detection scores. Experiments on EvoBench and MAGE benchmarks demonstrate consistent improvements across multiple detection criteria and model families.[82] Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments
Siwei Wu,Yizhi Li,Yuyang Song,Wei Zhang,Yang Wang,Riza Batista-Navarro,Xian Yang,Mingjie Tang,Bryan Dai,Jian Yang,Chenghua Lin
Main category: cs.CL
TL;DR: 本文提出TerminalTraj,一个用于构建高质量终端轨迹数据集的可扩展流水线,解决了终端任务数据在可执行性和可验证性上的挑战;基于该流水线构建了32K Docker镜像和50,733条跨8个领域的验证轨迹,显著提升了终端任务模型在TerminalBench上的性能。
Details
Motivation: 训练面向终端任务的智能体模型依赖于高质量、长周期、跨领域的终端轨迹数据,但大规模构建此类数据面临两大难点:一是每个样本需适配特定Docker环境(Executability),二是任务输出异构导致难以统一验证(Verifiability)。 Method: 提出TerminalTraj流水线,包含三步:(i) 筛选高质量代码仓库并构建Docker化执行环境;(ii) 生成与Docker环境对齐的任务实例;(iii) 合成带可执行验证代码的智能体轨迹。 Result: 构建了32K Docker镜像和50,733条经验证的终端轨迹(覆盖8个领域);基于Qwen2.5-Coder微调的模型在TerminalBench上分别提升20%(TB 1.0)和10%(TB 2.0);TerminalTraj-32B在<100B参数模型中表现优异(TB 1.0达35.30%,TB 2.0达22.00%),且具备更好的测试时缩放能力。 Conclusion: TerminalTraj有效缓解了终端任务数据构建中的可执行性与可验证性瓶颈,为终端智能体训练提供了高质量、可复现、可扩展的数据基础,并验证了其对模型性能的实质性提升。 Abstract: Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \textbf{\emph{Executability}}, since each instance requires a suitable and often distinct Docker environment; and \textbf{\emph{Verifiability}}, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose \textbf{TerminalTraj}, a scalable pipeline that (i) filters high-quality repositories to construct Dockerized execution environments, (ii) generates Docker-aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5-Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20\% on TB~1.0 and 10\% on TB~2.0 over their respective backbones. Notably, \textbf{TerminalTraj-32B} achieves strong performance among models with fewer than 100B parameters, reaching 35.30\% on TB~1.0 and 22.00\% on TB~2.0, and demonstrates improved test-time scaling behavior. All code and data are available at https://github.com/Wusiwei0410/TerminalTraj.[83] PARSE: An Open-Domain Reasoning Question Answering Benchmark for Persian
Jamshid Mozafari,Seyed Parsa Mousavinasab,Adam Jatowt
Main category: cs.CL
TL;DR: 本文介绍了PARSE,首个面向波斯语的开放域推理型问答基准,包含10800个涵盖多种题型与推理类型的问题,并通过LLM生成与人工验证构建;实验表明波斯语提示和结构化提示(如CoT、少样本)及微调可显著提升模型性能。
Details
Motivation: 波斯语作为低资源语言,缺乏高质量、开放域、支持推理能力评估的问答基准,制约了相关研究发展。 Method: 提出基于可控LLM生成流程构建PARSE基准,辅以多阶段过滤、人工标注与一致性校验;并在多种提示策略与微调设置下对多语言及波斯语LLM进行系统评测。 Result: 波斯语提示与结构化提示(Boolean/multiple-choice用CoT,factoid用few-shot)有效提升性能;微调尤其对波斯语专用模型增益明显。 Conclusion: PARSE填补了波斯语推理QA基准的空白,为低资源语言下的推理型LLM开发与公平评估提供了坚实基础。 Abstract: Reasoning-focused Question Answering (QA) has advanced rapidly with Large Language Models (LLMs), yet high-quality benchmarks for low-resource languages remain scarce. Persian, spoken by roughly 130 million people, lacks a comprehensive open-domain resource for evaluating reasoning-capable QA systems. We introduce PARSE, the first open-domain Persian reasoning QA benchmark, containing 10,800 questions across Boolean, multiple-choice, and factoid formats, with diverse reasoning types, difficulty levels, and answer structures. The benchmark is built via a controlled LLM-based generation pipeline and validated through human evaluation. We also ensure linguistic and factual quality through multi-stage filtering, annotation, and consistency checks. We benchmark multilingual and Persian LLMs under multiple prompting strategies and show that Persian prompts and structured prompting (CoT for Boolean/multiple-choice; few-shot for factoid) improve performance. Fine-tuning further boosts results, especially for Persian-specialized models. These findings highlight how PARSE supports both fair comparison and practical model adaptation. PARSE fills a critical gap in Persian QA research and provides a strong foundation for developing and evaluating reasoning-capable LLMs in low-resource settings.[84] PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length
Situo Zhang,Yifan Zhang,Zichen Zhu,Hankun Wang,Da Ma,Danyang Zhang,Lu Chen,Kai Yu
Main category: cs.CL
TL;DR: 本文提出Pacer,一种通过轻量级可训练预验证层动态控制草稿长度的新型推测解码方法,显著提升了大语言模型推理速度。
Details
Motivation: 现有推测解码采用固定草稿长度,但实验发现最优草稿长度在不同解码步间差异显著,固定长度限制了加速潜力。 Method: 提出Pacer方法,引入轻量、可训练的预验证层,以块为单位对草稿token进行预验证;若预验证失败,则草稿模型提前终止生成,实现动态草稿长度控制。 Result: Pacer在多个模型对和基准测试中实现最高2.66倍于自回归解码的速度提升,且始终优于标准推测解码;与Ouroboros结合时可达3.09倍加速。 Conclusion: 动态控制草稿长度能有效释放推测解码的加速潜力,Pacer为高效LLM推理提供了新思路。 Abstract: Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs) without sacrificing accuracy. Typically, SD employs a small draft model to generate a fixed number of draft tokens, which are then verified in parallel by the target model. However, our experiments reveal that the optimal draft length varies significantly across different decoding steps. This variation suggests that using a fixed draft length limits the potential for further improvements in decoding speed. To address this challenge, we propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer. This layer pre-verifies draft tokens blockwise before they are sent to the target model, allowing the draft model to stop token generation if the blockwise pre-verification fails. We implement Pacer on multiple SD model pairs and evaluate its performance across various benchmarks. Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding. Furthermore, when integrated with Ouroboros, Pacer attains up to 3.09x Speedup.[85] EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language ModelsEverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models
Chuanrui Hu,Tong Li,Xingze Gao,Hongda Chen,Dannong Xu,Yi Bai,Tianwei Lin,Xinda Zhao,Xiaohong Li,Jiaqi An,Yunyun Han,Jian Pei,Yafeng Deng
Main category: cs.CL
TL;DR: 本文提出了EverMemBench,一个面向长期对话记忆评估的新基准,涵盖多角色、多话题、时序演化的超长对话,揭示了当前大模型记忆系统在多跳推理、时序理解与记忆感知三方面的关键缺陷。
Details
Motivation: 现有对话记忆评测基准局限于单话题、两人对话,无法反映真实复杂场景(如多角色、跨话题、时序演化等),亟需更具挑战性的评估标准。 Method: 构建EverMemBench基准:包含百万级token的多群体、多角色、时序演化对话数据集;设计1000+ QA对,从细粒度回忆、记忆感知、用户画像理解三个维度评估记忆系统。 Result: 发现三大瓶颈:(1) 多方对话中多跳推理性能骤降至26%;(2) 时序推理需版本语义建模,仅靠时间戳匹配无效;(3) 基于相似度的检索方法难以弥合查询与隐含相关记忆间的语义鸿沟。 Conclusion: EverMemBench为下一代对话记忆架构提供了更贴近实际、更具挑战性的评测平台,指明了关键改进方向。 Abstract: Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.[86] DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas
Zirui Wu,Lin Zheng,Zhihui Xie,Jiacheng Ye,Jiahui Gao,Shansan Gong,Yansong Feng,Zhenguo Li,Wei Bi,Guorui Zhou,Lingpeng Kong
Main category: cs.CL
TL;DR: DreamOn is a novel diffusion framework that enables dynamic, variable-length generation for Diffusion Language Models (DLMs), overcoming the fixed-length mask limitation and achieving performance on par with state-of-the-art autoregressive models.
Details
Motivation: The fixed-length masked sequence requirement in Diffusion Language Models (DLMs) severely degrades code infilling performance when the mask size mismatches the ideal completion length, limiting their practical utility. Method: DreamOn introduces two length control states into the diffusion process, allowing the model to autonomously expand or contract output length based on its own predictions, integrated into existing DLMs with minimal changes to training objectives and no architectural modifications. Result: DreamOn achieves infilling performance on par with state-of-the-art autoregressive models on HumanEval-Infilling and SantaCoder-FIM benchmarks, and matches oracle performance achieved with ground-truth length. Conclusion: DreamOn removes a fundamental barrier to the practical deployment of DLMs, significantly advancing their flexibility and applicability for variable-length generation. Abstract: Diffusion Language Models (DLMs) present a compelling alternative to autoregressive models, offering flexible, any-order infilling without specialized prompting design. However, their practical utility is blocked by a critical limitation: the requirement of a fixed-length masked sequence for generation. This constraint severely degrades code infilling performance when the predefined mask size mismatches the ideal completion length. To address this, we propose DreamOn, a novel diffusion framework that enables dynamic, variable-length generation. DreamOn augments the diffusion process with two length control states, allowing the model to autonomously expand or contract the output length based solely on its own predictions. We integrate this mechanism into existing DLMs with minimal modifications to the training objective and no architectural changes. Built upon Dream-Coder-7B and DiffuCoder-7B, DreamOn achieves infilling performance on par with state-of-the-art autoregressive models on HumanEval-Infilling and SantaCoder-FIM and matches oracle performance achieved with ground-truth length. Our work removes a fundamental barrier to the practical deployment of DLMs, significantly advancing their flexibility and applicability for variable-length generation. Our code is available at https://github.com/DreamLM/DreamOn.[87] CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering
Yu Liu,Wenxiao Zhang,Cong Cao,Fangfang Yuan,Weizhuo Chen,Cheng Hu,Pin Xu,Yuling Yang,Kun Peng,Diandian Guo,Qiang Sun,Yanbing Liu,Jin B. Hong,Zhiyuan Ma
Main category: cs.CL
TL;DR: 本文提出CRAFT框架,通过双奖励机制的强化学习方法解决多跳问答中推理崩溃、推理-答案不一致和格式失控三大挑战,提升大语言模型在检索增强生成中的推理忠实性和答案准确性。
Details
Motivation: 解决多跳问答中推理崩溃、推理-答案不一致和格式失控三大挑战,提升大语言模型在检索增强生成中的推理忠实性和答案准确性。 Method: 提出基于组相对策略优化(GRPO)的强化学习框架CRAFT,采用确定性奖励保障结构正确性、基于裁判的奖励验证语义忠实性,并支持可控推理轨迹变体以系统分析结构与规模对推理性能的影响。 Result: 在三个多跳问答基准上实验表明,CRAFT在不同模型规模下均提升了答案准确率和推理忠实性,其中CRAFT 7B模型在多个推理轨迹设置下达到与闭源大模型相当的性能。 Conclusion: CRAFT有效缓解了多-hop QA中RAG场景下的推理不稳定、不忠实和格式失控问题,为可信赖的多步推理生成提供了新范式。 Abstract: Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering. Recent work mainly focused on improving answer accuracy via fine-tuning and structured or reinforcement-based optimization. However, reliable reasoning in response generation faces three challenges: 1) Reasoning Collapse. Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval. 2) Reasoning-answer inconsistency. Due to the intrinsic uncertainty of LLM generation and exposure to evidence--distractor mixtures, models may produce correct answers that are not faithfully supported by their intermediate reasoning or evidence. 3) Loss of format control. Traditional chain-of-thought generation often deviates from required structured output formats, leading to incomplete or malformed structured content. To address these challenges, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a Group Relative Policy Optimization (GRPO) based reinforcement learning framework that trains models to perform faithful reasoning during response generation. CRAFT employs dual reward mechanisms to optimize multi-hop reasoning: deterministic rewards ensure structural correctness while judge-based rewards verify semantic faithfulness. This optimization framework supports controllable trace variants that enable systematic analysis of how structure and scale affect reasoning performance and faithfulness. Experiments on three multi-hop QA benchmarks show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales, with the CRAFT 7B model achieving competitive performance with closed-source LLMs across multiple reasoning trace settings.[88] Balancing Understanding and Generation in Discrete Diffusion Models
Yue Liu,Yuzhong Zhao,Zheyong Xie,Qixiang Ye,Jianbin Jiao,Yao Hu,Shaosheng Cao,Yunfan Liu
Main category: cs.CL
TL;DR: 本文提出XDLM,通过静态噪声核统一了掩码扩散语言模型(MDLM)和均匀噪声扩散语言模型(UDLM),在语义理解与生成质量之间取得更好平衡,并在多项任务上显著优于两者。
Details
Motivation: 现有离散生成建模中,MDLM擅长语义理解和零样本泛化,UDLM在少步生成质量上表现强,但二者无法兼顾;需一种能平衡两方面能力的新范式。 Method: 提出XDLM,基于静态噪声核实现MDLM与UDLM的理论统一,并通过后验概率的代数简化缓解内存瓶颈。 Result: XDLM在零样本文本基准上比UDLM高5.4分;在少步图像生成中FID达54.1(优于MDLM的80.8);微调8B大语言模型仅用32步即达15.0 MBPP,性能翻倍;训练动力学分析显示其更适于长期扩展。 Conclusion: XDLM成功桥接MDLM与UDLM,在理解能力与生成质量的Pareto前沿上实现突破,兼具理论统一性与实用高效性。 Abstract: In discrete generative modeling, two dominant paradigms demonstrate divergent capabilities: Masked Diffusion Language Models (MDLM) excel at semantic understanding and zero-shot generalization, whereas Uniform-noise Diffusion Language Models (UDLM) achieve strong few-step generation quality, yet neither attains balanced performance across both dimensions. To address this, we propose XDLM, which bridges the two paradigms via a stationary noise kernel. XDLM offers two key contributions: (1) it provides a principled theoretical unification of MDLM and UDLM, recovering each paradigm as a special case; and (2) an alleviated memory bottleneck enabled by an algebraic simplification of the posterior probabilities. Experiments demonstrate that XDLM advances the Pareto frontier between understanding capability and generation quality. Quantitatively, XDLM surpasses UDLM by 5.4 points on zero-shot text benchmarks and outperforms MDLM in few-step image generation (FID 54.1 vs. 80.8). When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively doubling the baseline performance. Finally, analysis of training dynamics reveals XDLM's superior potential for long-term scaling. Code is available at https://github.com/MzeroMiko/XDLM[89] Context Dependence and Reliability in Autoregressive Language Models
Poushali Sengupta,Shashi Raj Pandey,Sabita Maharjan,Frank Eliassen
Main category: cs.CL
TL;DR: 本文提出RISE方法,用于在大型语言模型中区分关键上下文元素与相关冗余元素,以提供更稳定、可靠的解释。
Details
Motivation: 大型语言模型(LLMs)常使用大量含冗余信息的上下文,标准归因方法难以应对冗余和重叠,导致解释不稳定、易受输入微小扰动影响,威胁可解释性与安全性(如提示注入)。 Method: 提出RISE(Redundancy-Insensitive Scoring of Explanation),通过量化各输入相对于其他输入的独特影响,降低冗余干扰,实现条件依赖感知的稳定归因。 Result: 实验表明RISE比传统方法提供更鲁棒的解释,凸显条件信息对可信LLM解释与监控的关键作用。 Conclusion: RISE有效缓解上下文冗余导致的归因不稳定性,为高风险场景下LLM的可解释性与安全监控提供了新路径。 Abstract: Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.[90] On the Power of (Approximate) Reward Models for Inference-Time Scaling
Youheng Zhu,Yiping Lu
Main category: cs.CL
TL;DR: 本文理论分析了在推理时扩展中使用近似奖励模型的有效性,指出贝尔曼误差是关键因素,并证明当误差以O(1/T)为界时,SMC方法可将推理计算复杂度从指数级降至多项式级。
Details
Motivation: 实际部署中无法获得真实奖励模型,只能依赖近似模型;亟需理论解释为何及何时近似模型仍能有效支持推理时扩展。 Method: 通过理论分析,引入贝尔曼误差作为衡量近似奖励模型质量的关键指标,并推导其对Sequential Monte Carlo(SMC)推理过程计算复杂度的影响。 Result: 证明若近似奖励模型的贝尔曼误差被限制在O(1/T),则SMC可将长度为T的推理任务的计算复杂度从指数级降低至多项式级,实现指数级推理效率提升。 Conclusion: 贝尔曼误差是决定近似奖励模型能否支撑高效推理时扩展的核心理论依据,为设计和评估实际系统中的奖励模型提供了明确指导。 Abstract: Inference-time scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, Sequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the reward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length $T$, we show that if the Bellman error of the approximate reward model is bounded by $O(1/T)$, then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in $T$ to polynomial in $T$. This yields an exponential improvement in inference efficiency despite using only approximate rewards.[91] Rethinking Selective Knowledge Distillation
Almog Tavor,Itay Ebenspanger,Neil Cnaan,Mor Geva
Main category: cs.CL
TL;DR: 本文重新审视了自回归大语言模型中的知识蒸馏,提出了一种基于学生模型熵的位置选择方法(SE-KD),并在位置、类别和样本三个维度上扩展(SE-KD 3X),显著提升了准确性、下游任务适配性和内存效率。
Details
Motivation: 现有选择性知识蒸馏在token位置、词汇类别和训练样本上的重要性信号与选择策略尚不明确,缺乏系统性分析与高效方案。 Method: 对位置、类别和样本三个维度进行解耦分析,比较不同重要性信号与选择策略;提出基于学生模型预测熵的位置选择方法SE-KD,并进一步扩展为SE-KD 3X。 Result: SE-KD在多个基准上优于密集蒸馏;SE-KD 3X实现70%墙钟时间减少、18%峰值内存下降、80%存储节省,且不损性能。 Conclusion: 学生模型熵是有效的选择信号,跨维度协同的选择性蒸馏(SE-KD 3X)可在保持性能的同时大幅提升训练与部署效率。 Abstract: Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.[92] From Pragmas to Partners: A Symbiotic Evolution of Agentic High-Level Synthesis
Niansong Zhang,Sunwoo Kim,Shreesha Srinath,Zhiru Zhang
Main category: cs.CL
TL;DR: 本文探讨了在AI代理时代,高层次综合(HLS)在硬件设计中的持续重要性,指出其作为代理优化的自然抽象层的优势,并分析了当前HLS工具的局限性,提出了一种促进HLS与AI代理协同演进的分类法。
Details
Motivation: 随着大语言模型和AI代理的兴起,HLS是否仍具价值受到质疑;本文旨在论证HLS在AI驱动硬件设计中不可替代的核心作用,尤其在支持代理优化方面。 Method: 本文采用位置论文(position paper)形式,通过概念分析与系统性梳理,阐明HLS的抽象优势、识别现有工具缺陷,并构建一个描述人机责任演进的分类学框架。 Result: 提出了三点贡献:1)确立HLS作为代理硬件设计的实用抽象层与黄金参考;2)指出当前HLS工具在性能反馈、接口灵活性和可调试性三方面的关键短板;3)提出‘共生式代理HLS演进’分类法,刻画从人类主导到AI自主的设计范式迁移路径。 Conclusion: HLS在AI代理时代不仅仍然重要,而且是实现高效、可迭代、可验证的智能硬件设计的关键中间层;未来需推动HLS工具与AI代理能力深度协同演进。 Abstract: The rise of large language models has sparked interest in AI-driven hardware design, raising the question: does high-level synthesis (HLS) still matter in the agentic era? We argue that HLS remains essential. While we expect mature agentic hardware systems to leverage both HLS and RTL, this paper focuses on HLS and its role in enabling agentic optimization. HLS offers faster iteration cycles, portability, and design permutability that make it a natural layer for agentic optimization.This position paper makes three contributions. First, we explain why HLS serves as a practical abstraction layer and a golden reference for agentic hardware design. Second, we identify key limitations of current HLS tools, namely inadequate performance feedback, rigid interfaces, and limited debuggability that agents are uniquely positioned to address. Third, we propose a taxonomy for the symbiotic evolution of agentic HLS, clarifying how responsibility shifts from human designers to AI agents as systems advance from copilots to autonomous design partners.[93] SentiFuse: Deep Multi-model Fusion Framework for Robust Sentiment Extraction
Hieu Minh Duong,Rupa Ghosh,Cong Hoan Nguyen,Eugene Levin,Todd Gary,Long Nguyen
Main category: cs.CL
TL;DR: SentiFuse是一个模型无关的框架,通过标准化层和多种融合策略(决策级、特征级和自适应融合)集成异构情感分析模型,在多个数据集上显著提升性能。
Details
Motivation: 现有情感分析模型虽有互补优势,但缺乏统一有效的集成框架。 Method: 提出SentiFuse框架,包含标准化层和决策级、特征级、自适应三种融合策略。 Result: 在Crowdflower、GoEmotions和Sentiment140数据集上,特征级融合带来最高F1提升(达4%),自适应融合增强了对否定、混合情绪等复杂情形的鲁棒性。 Conclusion: 系统利用模型互补性可显著提升情感分析的准确性与可靠性。 Abstract: Sentiment analysis models exhibit complementary strengths, yet existing approaches lack a unified framework for effective integration. We present SentiFuse, a flexible and model-agnostic framework that integrates heterogeneous sentiment models through a standardization layer and multiple fusion strategies. Our approach supports decision-level fusion, feature-level fusion, and adaptive fusion, enabling systematic combination of diverse models. We conduct experiments on three large-scale social-media datasets: Crowdflower, GoEmotions, and Sentiment140. These experiments show that SentiFuse consistently outperforms individual models and naive ensembles. Feature-level fusion achieves the strongest overall effectiveness, yielding up to 4\% absolute improvement in F1 score over the best individual model and simple averaging, while adaptive fusion enhances robustness on challenging cases such as negation, mixed emotions, and complex sentiment expressions. These results demonstrate that systematically leveraging model complementarity yields more accurate and reliable sentiment analysis across diverse datasets and text types.[94] Understanding QA generation: Extracting Parametric and Contextual Knowledge with CQA for Low Resource Bangla Language
Umme Abira Azmary,MD Ikramul Kayes,Swakkhar Shatabda,Farig Yousuf Sadeque
Main category: cs.CL
TL;DR: 本文提出了BanglaCQA,首个孟加拉语反事实问答数据集,并设计了多种模型pipeline以区分参数化知识与上下文知识在问答中的作用,发现思维链(CoT)提示在解码器-only大语言模型中对反事实场景下提取参数化知识尤为有效。
Details
Motivation: 低资源语言(如孟加拉语)的问答模型受限于标注数据稀缺和语言复杂性,且现有数据集缺乏支持分析模型依赖参数化知识还是上下文知识的结构。 Method: 构建首个孟加拉语反事实问答数据集BanglaCQA;设计微调型编码器-解码器模型pipeline及基于提示的解码器-only大语言模型pipeline;采用LLM与人工评估结合的语义相似度评价方法;开展多场景性能分析,特别考察思维链(CoT)提示的作用。 Result: 发现Chain-of-Thought(CoT)提示在反事实场景中能显著提升decoder-only LLM对参数化知识的提取能力;揭示了低资源语言中模型知识来源的差异性行为;验证了所提框架在分析知识机制上的有效性。 Conclusion: 本工作不仅为孟加拉语QA提供了首个可分析知识来源的反事实基准,还为低资源语言中的反事实推理研究开辟了新方向,强调了提示策略(尤其是CoT)在激发参数化知识方面的重要作用。 Abstract: Question-Answering (QA) models for low-resource languages like Bangla face challenges due to limited annotated data and linguistic complexity. A key issue is determining whether models rely more on pre-encoded (parametric) knowledge or contextual input during answer generation, as existing Bangla QA datasets lack the structure required for such analysis. We introduce BanglaCQA, the first Counterfactual QA dataset in Bangla, by extending a Bangla dataset while integrating counterfactual passages and answerability annotations. In addition, we propose fine-tuned pipelines for encoder-decoder language-specific and multilingual baseline models, and prompting-based pipelines for decoder-only LLMs to disentangle parametric and contextual knowledge in both factual and counterfactual scenarios. Furthermore, we apply LLM-based and human evaluation techniques that measure answer quality based on semantic similarity. We also present a detailed analysis of how models perform across different QA settings in low-resource languages, and show that Chain-of-Thought (CoT) prompting reveals a uniquely effective mechanism for extracting parametric knowledge in counterfactual scenarios, particularly in decoder-only LLMs. Our work not only introduces a novel framework for analyzing knowledge sources in Bangla QA but also uncovers critical findings that open up broader directions for counterfactual reasoning in low-resource language settings.[95] ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure
Jie Deng,Shining Liang,Jun Li,Hongzhi Li,Yutao Xie
Main category: cs.CL
TL;DR: 本文发现大型推理模型在多问题提示下会自发产生更短的推理链(Self-Compression),并基于此提出无需外部监督的轻量自监督微调方法ConPress,显著降低推理token消耗且保持准确率。
Details
Motivation: 大型推理模型(LRMs)因生成长思维链(CoT)导致推理开销大;作者观察到多问题共现时模型自发压缩单个问题的推理长度(Self-Compression),这一现象可复现且跨模型/基准稳定存在。 Method: 提出ConPress:构建多问题提示诱导Self-Compression,采样输出、解析并筛选出简洁正确的单问题推理轨迹,用于监督微调,不依赖外部教师、人工剪枝或强化学习。 Result: 仅用8k样本微调,在MATH500上推理token减少59%,AIME25上减少33%,同时保持竞争力的准确率。 Conclusion: Self-Compression是可利用的推理优化信号,ConPress证明了仅靠上下文压力即可实现高效、无需外部标注的推理压缩微调。 Abstract: Large reasoning models (LRMs) typically solve reasoning-intensive tasks by generating long chain-of-thought (CoT) traces, leading to substantial inference overhead. We identify a reproducible inference-time phenomenon, termed Self-Compression: when multiple independent and answerable questions are presented within a single prompt, the model spontaneously produces shorter reasoning traces for each question. This phenomenon arises from multi-question contextual pressure during generation and consistently manifests across models and benchmarks. Building on this observation, we propose ConPress (Learning from Contextual Pressure), a lightweight self-supervised fine-tuning approach. ConPress constructs multi-question prompts to induce self-compression, samples the resulting model outputs, and parses and filters per-question traces to obtain concise yet correct reasoning trajectories. These trajectories are directly used for supervised fine-tuning, internalizing compressed reasoning behavior in single-question settings without external teachers, manual pruning, or reinforcement learning. With only 8k fine-tuning examples, ConPress reduces reasoning token usage by 59% on MATH500 and 33% on AIME25, while maintaining competitive accuracy.[96] Ebisu: Benchmarking Large Language Models in Japanese Finance
Xueqing Peng,Ruoyu Xiang,Fan Zhang,Mingzi Song,Mingyang Jiang,Yan Wang,Lingfei Qian,Taiki Hara,Yuqing Guo,Jimin Huang,Junichi Tsujii,Sophia Ananiadou
Main category: cs.CL
TL;DR: 本文介绍了Ebisu,一个面向日语金融语言理解的基准测试,包含两个专家标注的任务:JF-ICR(识别投资者问答中的隐含承诺与拒绝)和JF-TE(从专业披露文本中分层抽取并排序嵌套金融术语),评估显示当前主流大模型在该任务上表现不佳,凸显日语金融语言理解的特殊挑战。
Details
Motivation: 日语金融语言具有黏着性、中心语后置、混合书写系统及高语境依赖等特性,导致现有大语言模型难以准确理解其隐含承诺、间接表达与嵌套术语,亟需构建文化与语言适配的专用评测基准。 Method: 构建Ebisu基准,包含两个专家标注任务:JF-ICR(隐含承诺与拒绝识别)和JF-TE(嵌套金融术语的层级抽取与排序);在多种开源与商用大模型(通用、日语适配、金融专用)上进行系统评测。 Result: 即使最先进的模型在两项任务上仍表现较差;模型规模增大仅带来有限提升;语言或领域适配策略未能稳定提升性能,存在显著性能缺口。 Conclusion: Ebisu为推进兼顾日语语言特性与金融文化语境的NLP研究提供了关键评测工具,揭示了当前大模型在高语境、隐含语义的日语金融文本理解上的根本局限。 Abstract: Japanese finance combines agglutinative, head-final linguistic structure, mixed writing systems, and high-context communication norms that rely on indirect expression and implicit commitment, posing a substantial challenge for LLMs. We introduce Ebisu, a benchmark for native Japanese financial language understanding, comprising two linguistically and culturally grounded, expert-annotated tasks: JF-ICR, which evaluates implicit commitment and refusal recognition in investor-facing Q&A, and JF-TE, which assesses hierarchical extraction and ranking of nested financial terminology from professional disclosures. We evaluate a diverse set of open-source and proprietary LLMs spanning general-purpose, Japanese-adapted, and financial models. Results show that even state-of-the-art systems struggle on both tasks. While increased model scale yields limited improvements, language- and domain-specific adaptation does not reliably improve performance, leaving substantial gaps unresolved. Ebisu provides a focused benchmark for advancing linguistically and culturally grounded financial NLP. All datasets and evaluation scripts are publicly released.[97] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training
Ran Xu,Tianci Liu,Zihan Dong,Tony You,Ilgee Hong,Carl Yang,Linjun Zhang,Tao Zhao,Haoyu Wang
Main category: cs.CL
TL;DR: 本文提出Rubric-ARM框架,通过强化学习联合优化评分标准生成器和评判模型,以更全面地评估开放域响应质量。
Details
Motivation: 标准奖励模型仅输出标量分数,难以刻画非可验证领域(如创意写作、开放指令遵循)中响应质量的多维性。 Method: 提出Rubric-ARM框架,将评分标准生成建模为隐式动作,通过偏好反馈的强化学习联合训练;引入交替优化策略缓解同时更新带来的非稳态问题,并提供降低梯度方差的理论分析。 Result: 在多个基准上达到SOTA性能,显著提升离线与在线强化学习中的下游策略对齐效果。 Conclusion: Rubric-ARM通过动态生成多维评分标准,有效提升了奖励建模的表达能力与泛化性,为复杂质量评估提供了新范式。 Abstract: Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.[98] Argument Rarity-based Originality Assessment for AI-Assisted Writing
Keito Inoshita,Michiaki Omura,Tsukasa Yamanaka,Go Maeda,Kentaro Tsuji
Main category: cs.CL
TL;DR: 本文提出Argument Rarity-based Originality Assessment (AROA)框架,通过衡量学生议论文在参考语料库中的稀有性(包括结构、主张、证据和认知深度四个维度)来自动评估其论证原创性,并引入质量调整机制,将质量与原创性作为独立评估轴;实验发现高质量文本往往依赖典型主张模式,存在质量-原创性权衡,且AI生成文本虽结构复杂度接近人类,但主张稀有性显著更低。
Details
Motivation: 随着大语言模型(LLMs)能轻松生成高质量文本,传统以质量为中心的写作评估已显不足;教育核心目标是培养批判性思维与原创观点,因此评估范式需从质量转向原创性。 Method: 提出AROA框架,定义原创性为在参考语料库中的稀有性,通过密度估计量化结构稀有性、主张稀有性、证据稀有性和认知深度四个互补维度,并引入质量调整机制,使质量与原创性成为两个独立评估轴。 Result: 实验发现:1)质量与主张稀有性呈强负相关,证实质量-原创性权衡;2)AI生成文本结构复杂度与人类相当,但主张稀有性显著低于人类,表明LLM擅长模仿论证形式,但在内容原创性上受限。 Conclusion: AROA为写作评估提供了原创性导向的新范式,揭示了当前LLM在内容创新上的局限,强调教育评估应兼顾并区分质量与原创性两个维度。 Abstract: As Large Language Models (LLMs) have become capable of effortlessly generating high-quality text, traditional quality-focused writing assessment is losing its significance. If the essential goal of education is to foster critical thinking and original perspectives, assessment must also shift its paradigm from quality to originality. This study proposes Argument Rarity-based Originality Assessment (AROA), a framework for automatically evaluating argumentative originality in student essays. AROA defines originality as rarity within a reference corpus and evaluates it through four complementary components: structural rarity, claim rarity, evidence rarity, and cognitive depth. The framework quantifies the rarity of each component using density estimation and integrates them with a quality adjustment mechanism, thereby treating quality and originality as independent evaluation axes. Experiments using human essays and AI-generated essays revealed a strong negative correlation between quality and claim rarity, demonstrating a quality-originality trade-off where higher-quality texts tend to rely on typical claim patterns. Furthermore, while AI essays achieved comparable levels of structural complexity to human essays, their claim rarity was substantially lower than that of humans, indicating that LLMs can reproduce the form of argumentation but have limitations in the originality of content.[99] FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents
Chiwei Zhu,Benfeng Xu,Mingxuan Du,Shaohan Wang,Xiaorui Wang,Zhendong Mao,Yongdong Zhang
Main category: cs.CL
TL;DR: 本文提出FS-Researcher,一种基于文件系统的双智能体框架,通过持久化工作区突破大语言模型上下文长度限制,实现高质量长周期深度研究任务。
Details
Motivation: 现有LLM代理在深度研究等长周期任务中受限于上下文窗口,难以兼顾信息搜集与报告撰写,且无法有效进行测试时扩展。 Method: 设计双智能体架构:Context Builder(构建结构化笔记与分层知识库)和Report Writer(基于知识库分节生成报告),以文件系统作为外部记忆与多智能体协同媒介。 Result: 在DeepResearch Bench和DeepConsult两个基准上达到SOTA报告质量;实验表明Context Builder的计算投入与最终报告质量呈正相关。 Conclusion: 文件系统范式可有效支持LLM代理的测试时扩展,为长周期、高信息密度任务提供可扩展、可迭代的新架构。 Abstract: Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling. We introduce FS-Researcher, a file-system-based, dual-agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open-ended benchmarks (DeepResearch Bench and DeepConsult) show that FS-Researcher achieves state-of-the-art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test-time scaling under the file-system paradigm. The code and data are anonymously open-sourced at https://github.com/Ignoramus0817/FS-Researcher.[100] LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States
Yeqin Zhang,Yunfei Wang,Jiaxuan Chen,Ke Qin,Yizheng Zhao,Cam-Tu Nguyen
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的句子表示方法Value Aggregation(VA)及其改进版AlignedWVA,利用LLM中注意力机制的value向量而非隐藏层状态来更好捕获句子级语义,在多个基准上达到SOTA性能。
Details
Motivation: 现有基于大语言模型(LLM)的句子表示方法多依赖最终隐藏层状态,但这些状态为下一词预测优化,难以有效建模全局句子语义。 Method: 提出Value Aggregation(VA):跨多层和token索引聚合注意力机制中的value向量;进一步提出AlignedWVA:利用最后token的attention score作为权重,结合输出投影矩阵WO对加权value向量进行空间对齐。 Result: VA在无训练条件下超越其他LLM嵌入方法,媲美甚至优于集成式MetaEOL;AlignedWVA成为当前最优无训练LLM嵌入方法,显著超越高成本MetaEOL;并验证了微调VA可进一步提升性能。 Conclusion: 注意力value向量比隐藏状态更适合作为句子表示基础,VA及AlignedWVA为高效、高性能的无训练句子嵌入提供了新范式。 Abstract: Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.[101] Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment
Zehua Cheng,Jianwei Yang,Wei Dai,Jiahao Sun
Main category: cs.CL
TL;DR: 本文提出了一种基于统计稳定性的可证明鲁棒性框架,通过分层随机消融(CSS)和噪声增强对齐微调(NAAT),在保证高良性效用的同时,显著提升大语言模型对抗自适应越狱攻击的鲁棒性。
Details
Motivation: 大型语言模型(LLMs)仍易受自适应越狱攻击,现有经验性防御方法(如GCG)难以提供可靠安全保障。 Method: 提出Certified Semantic Smoothing(CSS)与Stratified Randomized Ablation,将输入划分为不可变结构提示与可变载荷,利用超几何分布推导L0范数鲁棒性保证;并引入Noise-Augmented Alignment Tuning(NAAT)作为语义去噪器以缓解稀疏上下文下的性能下降。 Result: 在Llama-3上实验表明,梯度攻击成功率从84.2%降至1.2%,良性任务效用保持94.1%,显著优于字符级基线(效用仅74.3%)。 Conclusion: 该框架提供了确定性的安全证书,确保模型在可证明半径内对所有对抗变体保持鲁棒,实现了安全性与效用的兼顾。 Abstract: Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.[102] Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles
Shaohan Wang,Benfeng Xu,Licheng Zhang,Mingxuan Du,Chiwei Zhu,Xiaorui Wang,Zhendong Mao,Yongdong Zhang
Main category: cs.CL
TL;DR: 本文提出Wiki Live Challenge (WLC)——一个基于维基百科优质条目(Good Articles)的实时基准测试,用于更可靠、细粒度地评估深度研究代理(DRAs)的能力;配套发布包含39项写作质量指标和严格事实可验证性度量的Wiki Eval评估框架,并通过实验揭示当前DRAs与人类专家水平仍存在显著差距。
Details
Motivation: 现有DRA评估方法依赖大语言模型生成参考或评估维度,缺乏专家验证内容的可靠性,难以提供客观、细粒度的关键维度评估。 Method: 构建基于维基百科最新优质条目(100篇)的WLC基准;设计Wiki Eval评估框架,涵盖39项写作质量细粒度标准及事实可验证性的严谨指标。 Result: 在多种DRA系统上的实验表明,当前DRAs与人类专家撰写的维基优质条目之间存在显著差距;WLC被证实能有效推动DRA研究发展。 Conclusion: WLC提供了一种更可靠、更具挑战性和可解释性的DRA评估范式,弥补了现有LLM-centric评估方法的不足,为未来研究提供了高质量基准和评估工具。 Abstract: Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dimensions. While these approaches offer scalability, they often lack the reliability of expert-verified content and struggle to provide objective, fine-grained assessments of critical dimensions. To bridge this gap, we introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references. Wikipedia's strict standards for neutrality, comprehensiveness, and verifiability serve as a great challenge for DRAs, with GAs representing the pinnacle of which. We curate a dataset of 100 recent Good Articles and propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability. Extensive experiments on various DRA systems demonstrate a significant gap between current DRAs and human expert-level Wikipedia articles, validating the effectiveness of WLC in advancing agent research. We release our benchmark at https://github.com/WangShao2000/Wiki_Live_Challenge[103] The Art of Socratic Inquiry: A Framework for Proactive Template-Guided Therapeutic Conversation Generation
Mingwen Zhang,Minqiang Yang,Changsheng Ma,Yang Yu,Hui Bai,Chen Xu,Xiangzhen Kong,Bin Hu
Main category: cs.CL
TL;DR: 本文提出Socratic Inquiry Framework (SIF),一种轻量级、即插即用的治疗意图规划器,使大语言模型能主动发起苏格拉底式提问,提升认知行为治疗中的引导能力;同时构建Socratic-QA数据集支持训练,实验证明其显著增强提问主动性、对话深度与治疗一致性。
Details
Motivation: 当前心理领域大语言模型过于被动反应,缺乏主动引导认知和行为改变的能力,难以支撑认知行为疗法中关键的主动提问环节。 Method: 提出Socratic Inquiry Framework(SIF),解耦‘何时提问’(Strategy Anchoring)与‘提什么问’(Template Retrieval);并构建策略对齐的Socratic-QA高质量问答数据集用于监督训练。 Result: 实验表明SIF显著提升了模型主动提问频率、对话深度及治疗理论一致性,实现了从‘被动共情’到‘主动探索’的范式转变。 Conclusion: SIF为构建具备心理学素养的大语言模型提供了新范式——不仅响应用户,更要主动引导认知过程。 Abstract: Proactive questioning, where therapists deliberately initiate structured, cognition-guiding inquiries, is a cornerstone of cognitive behavioral therapy (CBT). Yet, current psychological large language models (LLMs) remain overwhelmingly reactive, defaulting to empathetic but superficial responses that fail to surface latent beliefs or guide behavioral change. To bridge this gap, we propose the \textbf{Socratic Inquiry Framework (SIF)}, a lightweight, plug-and-play therapeutic intent planner that transforms LLMs from passive listeners into active cognitive guides. SIF decouples \textbf{when to ask} (via Strategy Anchoring) from \textbf{what to ask} (via Template Retrieval), enabling context-aware, theory-grounded questioning without end-to-end retraining. Complementing SIF, we introduce \textbf{Socratic-QA}, a high-quality dataset of strategy-aligned Socratic sequences that provides explicit supervision for proactive reasoning. Experiments show that SIF significantly enhances proactive questioning frequency, conversational depth, and therapeutic alignment, marking a clear shift from reactive comfort to proactive exploration. Our work establishes a new paradigm for psychologically informed LLMs: not just to respond, but to guide.[104] SEA-Guard: Culturally Grounded Multilingual Safeguard for Southeast Asia
Panuthep Tasawong,Jian Gang Ngui,Alham Fikri Aji,Trevor Cohn,Peerat Limkonchotiwat
Main category: cs.CL
TL;DR: 本文提出了一种新型的智能体数据生成框架,用于大规模构建面向东南亚(SEA)文化背景的安全数据集,并基于此推出了首个扎根于SEA文化语境的多语言安全防护模型家族——SEA-Guard,在区域敏感内容识别上显著优于现有模型。
Details
Motivation: 现实世界中AI对齐需兼顾多元本地价值观、规范与地区法规,但构建大规模、文化适配的数据集受限于资源与本土标注者稀缺,现有方法依赖英文数据机翻,易丢失文化细微差异。 Method: 提出一种新型智能体(agentic)数据生成框架,可规模化生成真实、区域特定的东南亚安全数据集;在此基础上训练SEA-Guard多语言安全防护模型家族。 Result: SEA-Guard在多个基准测试及文化变体评估中,持续优于现有安全模型,尤其在检测区域敏感或有害内容方面表现突出,同时保持良好的通用安全性。 Conclusion: 文化感知的安全防护需从数据源头实现地域化与真实性,该工作为区域性AI对齐提供了可扩展、可复现的范式。 Abstract: Culturally aware safeguards are crucial for AI alignment in real-world settings, where safety extends beyond common sense and encompasses diverse local values, norms, and region-specific regulations. However, building large-scale, culturally grounded datasets is challenging due to limited resources and a scarcity of native annotators. Consequently, many safeguard models rely on machine translation of English datasets, often missing regional and cultural nuances. We present a novel agentic data-generation framework to scalably create authentic, region-specific safety datasets for Southeast Asia (SEA). On this foundation, we introduce the SEA-Guard family, the first multilingual safeguard models grounded in SEA cultural contexts. Evaluated across multiple benchmarks and cultural variants, SEA-Guard consistently outperforms existing safeguards at detecting regionally sensitive or harmful content while maintaining strong general safety performance.[105] A2Eval: Agentic and Automated Evaluation for Embodied Brain
Shuai Zhang,Jiayu Hu,Zijie Chen,Zeyuan Ding,Yi Zhang,Yingji Zhang,Ziyi Zhou,Junwei Liao,Shengjie Zhou,Yong Dai,Zhenzhong Lan,Xiaozhu Ju
Main category: cs.CL
TL;DR: 本文提出A2Eval,一种基于代理的自动化评估框架,用于解决当前具身视觉语言模型(VLM)评估中依赖静态、人工标注基准所带来的冗余、覆盖不均、成本高和排名失真等问题。该框架包含两个协作代理:数据代理自动构建平衡紧凑的评估集,评估代理生成并验证可执行评估流程,实现高质量、低成本、全自动评估。实验表明,A2Eval显著压缩评估集、降低成本与耗时,并提升人类一致性与排名保真度。
Details
Motivation: 当前具身VLM评估依赖静态、专家定义、人工标注的基准,存在严重冗余、覆盖不平衡、资源消耗大、成本高及模型排名失真等问题,阻碍迭代开发。 Method: 提出Agentic Automatic Evaluation(A2Eval),由两个协作智能体组成:Data Agent负责自主归纳能力维度并构建平衡紧凑的评估套件;Eval Agent负责合成并验证可执行的评估流水线,实现全自动、高保真评估。 Result: 在10个基准和13个模型上验证,A2Eval将评估套件压缩85%,计算成本降低77%,速度提升4.6倍,同时保持评估质量;修正系统性排序偏差,人类对齐度(Spearman's rho)达0.85,排序保真度(Kendall's tau)为0.81。 Conclusion: A2Eval建立了具身评估的新标准——高保真、低成本、全自动,推动具身VLM高效、可靠、可持续发展。 Abstract: Current embodied VLM evaluation relies on static, expert-defined, manually annotated benchmarks that exhibit severe redundancy and coverage imbalance. This labor intensive paradigm drains computational and annotation resources, inflates costs, and distorts model rankings, ultimately stifling iterative development. To address this, we propose Agentic Automatic Evaluation (A2Eval), the first agentic framework that automates benchmark curation and evaluation through two collaborative agents. The Data Agent autonomously induces capability dimensions and assembles a balanced, compact evaluation suite, while the Eval Agent synthesizes and validates executable evaluation pipelines, enabling fully autonomous, high-fidelity assessment. Evaluated across 10 benchmarks and 13 models, A2Eval compresses evaluation suites by 85%, reduces overall computational costs by 77%, and delivers a 4.6x speedup while preserving evaluation quality. Crucially, A2Eval corrects systematic ranking biases, improves human alignment to Spearman's rho=0.85, and maintains high ranking fidelity (Kendall's tau=0.81), establishing a new standard for high-fidelity, low-cost embodied assessment. Our code and data will be public soon.[106] Steering Vector Fields for Context-Aware Inference-Time Control in Large Language Models
Jiaqian Li,Yanshu Li,Kuan-Hao Huang
Main category: cs.CL
TL;DR: 本文提出Steering Vector Fields (SVF),通过学习一个可微的概念评分函数,使其局部梯度定义每个隐藏激活状态下的动态转向方向,从而解决传统静态转向向量(SVs)在上下文变化时方向不一致导致的不可靠问题。SVF支持多层协同干预、长文本生成和多属性控制,显著提升了推理时模型控制的强度与可靠性。
Details
Motivation: 传统转向向量(SVs)在实践中不可靠:部分概念无法转向;即使平均有效,对不少输入反而有害;在长文本生成和多属性控制中性能下降。其根本原因在于静态SV假设概念提升方向在表征空间中处处恒定,而实际该方向随上下文变化。 Method: 提出Steering Vector Fields(SVF),学习一个可微的概念评分函数,用其在每个隐藏激活处的梯度作为动态转向方向;支持多层共享概念空间的协同干预;统一处理长文本与多属性控制。 Result: 在多个LLM和转向任务上,SVF相比静态SV显著提升了控制强度与可靠性,尤其改善了长文本生成和多属性联合控制效果。 Conclusion: SVF通过引入上下文感知的动态转向机制,从几何视角解决了SVs的根本局限,为实用化推理时模型控制提供了更鲁棒、灵活且统一的框架。 Abstract: Steering vectors (SVs) offer a lightweight way to control large language models (LLMs) at inference time by shifting hidden activations, providing a practical middle ground between prompting and fine-tuning. Yet SVs can be unreliable in practice. Some concepts are unsteerable, and even when steering helps on average it can backfire for a non-trivial fraction of inputs. Reliability also degrades in long-form generation and multi-attribute steering. We take a geometric view of these failures. A static SV applies the same update vector everywhere in representation space, implicitly assuming that the concept-improving direction is constant across contexts. When the locally effective direction varies with the current activation, a single global vector can become misaligned, which yields weak or reversed effects. Guided by this perspective, we propose Steering Vector Fields (SVF), which learns a differentiable concept scoring function whose local gradient defines the steering direction at each activation, making interventions explicitly context-dependent. This formulation supports coordinated multi-layer interventions in a shared, aligned concept space, and enables efficient long-form and multi-attribute control within a unified framework. Across multiple LLMs and steering tasks, SVF delivers stronger and more reliable control, improving the practicality of inference-time steering.[107] CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation
Zhongyuan Peng,Caijun Xu,Changyi Xiao,Shibo Hong,Eli Zhang,Stephen Huang,Yixin Cao
Main category: cs.CL
TL;DR: 本文提出CoDiQ框架,实现对竞赛级难题生成的细粒度难度控制,并构建了44K高质量题库CoDiQ-Corpus,显著提升大推理模型的训练效果。
Details
Motivation: 现有自动题目生成方法难以精准控制难度、计算成本高、且难以大规模生成竞赛级题目。 Method: 提出CoDiQ框架,基于测试时缩放机制实现难度可控的题目生成;改进Qwen3-8B得到CoDiQ-Generator以提升高难度题目生成上限;构建CoDiQ-Corpus题库并进行人工评估与模型训练验证。 Result: CoDiQ-Corpus包含44K竞赛级题目序列,人工评估显示其难度高于LiveCodeBench/AIME且求解率超82%;在该数据集上训练大推理模型显著提升了推理性能。 Conclusion: 可控难度的高质量题目生成可有效增强大推理模型的推理能力,CoDiQ为相关研究提供了开源工具与数据支持。 Abstract: Large Reasoning Models (LRMs) benefit substantially from training on challenging competition-level questions. However, existing automated question synthesis methods lack precise difficulty control, incur high computational costs, and struggle to generate competition-level questions at scale. In this paper, we propose CoDiQ (Controllable Difficult Question Generation), a novel framework enabling fine-grained difficulty control via test-time scaling while ensuring question solvability. Specifically, first, we identify a test-time scaling tendency (extended reasoning token budget boosts difficulty but reduces solvability) and the intrinsic properties defining the upper bound of a model's ability to generate valid, high-difficulty questions. Then, we develop CoDiQ-Generator from Qwen3-8B, which improves the upper bound of difficult question generation, making it particularly well-suited for challenging question construction. Building on the CoDiQ framework, we build CoDiQ-Corpus (44K competition-grade question sequences). Human evaluations show these questions are significantly more challenging than LiveCodeBench/AIME with over 82% solvability. Training LRMs on CoDiQ-Corpus substantially improves reasoning performance, verifying that scaling controlled-difficulty training questions enhances reasoning capabilities. We open-source CoDiQ-Corpus, CoDiQ-Generator, and implementations to support related research.[108] Scaling Search-Augmented LLM Reasoning via Adaptive Information Control
Siheng Xiong,Oguzhan Gungordu,Blair Johnson,James C. Kerce,Faramarz Fekri
Main category: cs.CL
TL;DR: 本文提出DeepControl框架,通过信息效用度量实现对检索过程的自适应控制,包括检索延续与粒度控制机制,并采用退火策略使智能体在训练中内化高效信息获取行为,在多个基准测试中显著优于现有方法。
Details
Motivation: 现有基于结果的强化学习方法对信息获取的调控指导有限,导致冗余证据、上下文饱和和学习不稳定等问题。 Method: 提出基于信息效用(衡量给定推理状态下新检索证据的边际价值)的DeepControl框架,引入检索延续控制与粒度控制机制,并采用退火控制策略。 Result: 在七个基准上实验表明,相比强基线,DeepControl在Qwen2.5-7B和Qwen2.5-3B上平均提升9.4%和8.6%,且持续优于无检索及有检索但无显式信息控制的方法。 Conclusion: 自适应信息控制对扩展搜索增强型推理智能体以应对复杂现实信息环境至关重要。 Abstract: Search-augmented reasoning agents interleave multi-step reasoning with external information retrieval, but uncontrolled retrieval often leads to redundant evidence, context saturation, and unstable learning. Existing approaches rely on outcome-based reinforcement learning (RL), which provides limited guidance for regulating information acquisition. We propose DeepControl, a framework for adaptive information control based on a formal notion of information utility, which measures the marginal value of retrieved evidence under a given reasoning state. Building on this utility, we introduce retrieval continuation and granularity control mechanisms that selectively regulate when to continue and stop retrieval, and how much information to expand. An annealed control strategy enables the agent to internalize effective information acquisition behaviors during training. Extensive experiments across seven benchmarks demonstrate that our method consistently outperforms strong baselines. In particular, our approach achieves average performance improvements of 9.4% and 8.6% on Qwen2.5-7B and Qwen2.5-3B, respectively, over strong outcome-based RL baselines, and consistently outperforms both retrieval-free and retrieval-based reasoning methods without explicit information control. These results highlight the importance of adaptive information control for scaling search-augmented reasoning agents to complex, real-world information environments.[109] Counting Hypothesis: Potential Mechanism of In-Context Learning
Jung H. Lee,Sujith Vijayan
Main category: cs.CL
TL;DR: 本文提出并验证了ICL的'计数假说',认为大语言模型通过编码策略(如对示例中模式的统计)来支持上下文学习,从而解释其无需参数更新即可完成任务的机制。
Details
Motivation: ICL虽具实用性(无需修改模型结构、仅需少量示例),但其内在机制尚不明确,导致错误诊断与修正困难,亟需深入理解其局限性与实现原理。 Method: 基于ICL特性与大语言模型功能模块的启发,提出'计数假说',即LLMs通过某种编码策略(如隐式统计或模式匹配)实现上下文学习,并提供相应实证支持。 Result: 提出了可解释ICL机制的'计数假说',并给出了初步支持该假说的证据,为理解LLMs如何在不更新参数下完成任务提供了新视角。 Conclusion: ICL可能依赖于LLMs固有的编码策略(如对输入示例中统计规律的隐式建模),'计数假说'为解析ICL机理及后续改进提供了理论基础和研究方向。 Abstract: In-Context Learning (ICL) indicates that large language models (LLMs) pretrained on a massive amount of data can learn specific tasks from input prompts' examples. ICL is notable for two reasons. First, it does not need modification of LLMs' internal structure. Second, it enables LLMs to perform a wide range of tasks/functions with a few examples demonstrating a desirable task. ICL opens up new ways to utilize LLMs in more domains, but its underlying mechanisms still remain poorly understood, making error correction and diagnosis extremely challenging. Thus, it is imperative that we better understand the limitations of ICL and how exactly LLMs support ICL. Inspired by ICL properties and LLMs' functional modules, we propose 1the counting hypothesis' of ICL, which suggests that LLMs' encoding strategy may underlie ICL, and provide supporting evidence.[110] Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models
Wenhui Tan,Fiorenzo Parascandolo,Enver Sangineto,Jianzhong Ju,Zhenbo Luo,Qian Cao,Rita Cucchiara,Ruihua Song,Jian Luan
Main category: cs.CL
TL;DR: 本文提出Latent Exploration Decoding(LED)方法,通过利用大推理模型中间层高熵特性进行深度条件解码,在不增加训练或参数的前提下提升数学与代码推理性能。
Details
Motivation: 现代推理后训练导致温度采样无法提升pass@$n$准确率,即探索崩溃;同时发现最终层后验熵显著降低而中间层熵仍较高,存在熵不对称现象。 Method: 提出Latent Exploration Decoding(LED),一种深度条件解码策略:对中间层后验概率做累积和聚合,并选择熵最大的深度配置作为探索候选。 Result: LED在多个推理基准和模型上一致提升pass@1和pass@16准确率,分别提高0.61和1.03个百分点,且无需额外训练或参数。 Conclusion: 中间层蕴含可被解码策略有效利用的探索能力,LED验证了仅通过改进解码即可缓解后训练导致的探索崩溃问题。 Abstract: Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://GitHub.com/Xiaomi-Research/LED.[111] Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory
Langyuan Cui,Chun Kai Ling,Hwee Tou Ng
Main category: cs.CL
TL;DR: 本文提出Game of Thought (GoT)框架,利用博弈论技术近似纳什均衡策略,以提升大语言模型在信息缺失场景下的最坏情况信息搜寻能力,通过Twenty Questions及其对抗变体Strategic Language Search (SLS)问题进行评估。
Details
Motivation: 现有增强大语言模型信息搜寻能力的方法常依赖简化假设,导致最坏情况性能下降,这在高风险应用中存在严重隐患。 Method: 将信息搜寻建模为对抗性Strategic Language Search (SLS)问题,形式化为双人零和扩展式博弈,并提出Game of Thought (GoT)框架,用博弈论方法逼近受限变体的纳什均衡策略。 Result: 实验表明,GoT在所有测试设置下均显著优于直接提示法和启发式搜索法,持续提升最坏情况性能。 Conclusion: GoT框架能有效提升LLM在信息缺失任务中的鲁棒性和最坏情况表现,为高可靠性AI系统提供了新思路。 Abstract: Large Language Models (LLMs) are increasingly deployed in real-world scenarios where they may lack sufficient information to complete a given task. In such settings, the ability to actively seek out missing information becomes a critical capability. Existing approaches to enhancing this ability often rely on simplifying assumptions that degrade \textit{worst-case} performance. This is an issue with serious implications in high-stakes applications. In this work, we use the game of Twenty Questions to evaluate the information-seeking ability of LLMs. We introduce and formalize its adversarial counterpart, the Strategic Language Search (SLS) problem along with its variants as a two-player zero-sum extensive form game. We propose Game of Thought (GoT), a framework that applies game-theoretic techniques to approximate a Nash equilibrium (NE) strategy for the restricted variant of the game. Empirical results demonstrate that our approach consistently improves worst-case performance compared to (1) direct prompting-based methods and (2) heuristic-guided search methods across all tested settings.[112] ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation
Xingshan Zeng,Lingzhi Wang,Weiwen Liu,Liangyou Li,Yasheng Wang,Lifeng Shang,Xin Jiang,Qun Liu
Main category: cs.CL
TL;DR: 本文提出了一种面向智能体场景的风险感知测试时扩展框架ARTIS,通过在真实执行前进行迭代式模拟探索,解耦探索与执行,并设计风险感知工具模拟器以提升对高影响失败模式的建模能力,显著增强了智能体在多轮、多步任务中的可靠性。
Details
Motivation: 现有测试时扩展(TTS)方法在智能体场景中不足,因其动作直接影响外部环境,后果可能不可逆且代价高昂,亟需在不引入真实风险的前提下提升动作级鲁棒性。 Method: 提出ARTIS框架:1)通过迭代式模拟交互实现测试时探索与真实执行的解耦;2)构建风险感知工具模拟器,聚焦于失败诱导动作,采用定向数据生成与再平衡训练提升其对罕见高影响失败模式的建模 fidelity。 Result: 在多轮、多步智能体基准上实验表明,迭代模拟显著提升智能体可靠性,且风险感知模拟对跨模型与任务稳定获得性能增益至关重要。 Conclusion: ARTIS框架有效解决了智能体场景下测试时扩展的安全性与有效性矛盾;风险感知模拟是保障模拟质量、从而支撑可靠决策的关键技术。 Abstract: Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with external environments and their effects can be irreversible and costly. We propose \emph{\name}, \emph{\underline{A}gentic \underline{R}isk-Aware \underline{T}est-Time Scaling via \underline{I}terative \underline{S}imulation}, a framework that decouples exploration from commitment by enabling test-time exploration through simulated interactions prior to real-world execution. This design allows extending inference-time computation to improve action-level reliability and robustness without incurring environmental risk. We further show that naive LLM-based simulators struggle to capture rare but high-impact failure modes, substantially limiting their effectiveness for agentic decision making. To address this limitation, we introduce a \emph{risk-aware tool simulator} that emphasizes fidelity on failure-inducing actions via targeted data generation and rebalanced training. Experiments on multi-turn and multi-step agentic benchmarks demonstrate that iterative simulation substantially improves agent reliability, and that risk-aware simulation is essential for consistently realizing these gains across models and tasks.[113] MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark
Mouath Abu-Daoud,Leen Kharouf,Omar El Hajj,Dana El Samad,Mariam Al-Omari,Jihad Mallat,Khaled Saleh,Nizar Habash,Farah E. Shamout
Main category: cs.CL
TL;DR: 本文提出了MedAraBench,一个大规模阿拉伯语医学多选题数据集,涵盖19个专科和5个难度等级,用于评估大语言模型在阿拉伯语医学领域的表现,并开源了数据集与评估脚本。
Details
Motivation: 阿拉伯语在自然语言处理尤其是医学应用中严重缺乏开源数据和基准,限制了多语言大模型能力的评估与发展。 Method: 通过人工数字化阿拉伯地区医学专业人士编写的学术资料构建MedAraBench;采用专家人工评估与LLM-as-a-judge双重框架评估数据质量;对8个主流开源与闭源大模型进行基准测试。 Result: 数据集具有高多样性与高质量,覆盖广泛专科与难度;基准测试揭示现有模型在阿拉伯语医学任务上仍需领域特异性增强。 Conclusion: MedAraBench填补了阿拉伯语医学NLP基准的空白,其开源将促进多语言医疗AI发展及临床部署。 Abstract: Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.[114] Mechanistic Indicators of Steering Effectiveness in Large Language Models
Mehdi Jafari,Hao Xue,Flora Salim
Main category: cs.CL
TL;DR: 本文研究激活引导(activation-based steering)在大语言模型中的可靠性机制,提出利用归一化分支因子(NBF)和KL散度等内部信息论指标预测引导成败,验证其预测能力并改进对比激活添加(CAA)与稀疏自编码器引导的评估基线。
Details
Motivation: 尽管激活引导被广泛应用,但其成功或失败的内在机制尚不清楚,以往研究多依赖黑箱输出或LLM评判,缺乏基于模型内部信号的可解释诊断方法。 Method: 采用熵导出的归一化分支因子(NBF)和词表空间中引导激活与目标概念间的KL散度作为信息论指标,分析其在解码步中的结构化保持与一致性对齐,并以双LLM高一致性的生成标注为真值进行验证。 Result: 发现NBF与KL指标具有显著预测能力,能有效识别成功引导并估计失败概率;同时提出了更强的CAA与稀疏自编码器引导评估基线。 Conclusion: 模型内部的信息论信号可作为激活引导可靠性的有效诊断工具,为可解释、可控的LLM行为干预提供新路径。 Abstract: Activation-based steering enables Large Language Models (LLMs) to exhibit targeted behaviors by intervening on intermediate activations without retraining. Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges. In this study, we investigate whether the reliability of steering can be diagnosed using internal model signals. We focus on two information-theoretic measures: the entropy-derived Normalized Branching Factor (NBF), and the Kullback-Leibler (KL) divergence between steered activations and targeted concepts in the vocabulary space. We hypothesize that effective steering corresponds to structured entropy preservation and coherent KL alignment across decoding steps. Building on a reliability study demonstrating high inter-judge agreement between two architecturally distinct LLMs, we use LLM-generated annotations as ground truth and show that these mechanistic signals provide meaningful predictive power for identifying successful steering and estimating failure probability. We further introduce a stronger evaluation baseline for Contrastive Activation Addition (CAA) and Sparse Autoencoder-based steering, the two most widely adopted activation-steering methods.[115] BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition
Hyunsik Kim,Haeri Kim,Munhak Lee,Kyungmin Lee
Main category: cs.CL
TL;DR: 本文提出了一种基于UTF-16的字节级BPE(BBPE16)分词器,用于多语言ASR,相比传统UTF-8 BBPE,在保持语言无关性的同时显著减少CJK等非拉丁语系的token数量和解码迭代次数,提升效率并降低内存占用。
Details
Motivation: 现有基于UTF-8的字节级BPE(BBPE)在处理中文、日文、韩文(CJK)等非拉丁文字时因变长编码导致token序列过长,增加计算与内存开销。 Method: 提出BBPE16——一种基于UTF-16的字节级BPE分词器,利用UTF-16对多数现代文字使用固定2字节编码的特性,实现更紧凑的token表示,并增强跨语言token共享。 Result: 在单语、双语、三语及多语言持续学习ASR任务中,BBPE16达到相当或更优准确率;对中文,token数最多减少10.4%,解码迭代次数最多降低10.3%,加速微调与推理并降低内存使用。 Conclusion: BBPE16是一种兼顾语言无关性、跨语言共享能力和计算效率的实用型多语言ASR分词方案。 Abstract: Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design and full Unicode coverage, but its variable-length encoding inflates token sequences for non-Latin scripts, such as Chinese, Japanese, and Korean (CJK). Longer sequences increase computational load and memory use. We propose BBPE16, a UTF-16-based BBPE tokenizer that represents most modern scripts with a uniform 2-byte code unit. BBPE16 preserves BBPE's language-agnostic properties while substantially improving cross-lingual token sharing. Across monolingual, bilingual, and trilingual ASR, and in a multilingual continual-learning setup, BBPE16 attains comparable or better accuracy; for Chinese, it reduces token counts by up to 10.4% and lowers decoding iterations by up to 10.3%. These reductions speed up fine-tuning and inference and decrease memory usage, making BBPE16 a practical tokenization choice for multilingual ASR.[116] COMI: Coarse-to-fine Context Compression via Marginal Information Gain
Jiwei Tang,Shilei Liu,Zhicheng Zhang,Yujin Yuan,Libin Zheng,Wenbo Su,Bo Zheng
Main category: cs.CL
TL;DR: 本文提出COMI框架,通过粗粒度组重分配和细粒度令牌合并两阶段压缩长上下文,在保持语义相关性的同时减少冗余,显著提升长文本处理效率与效果。
Details
Motivation: 大型语言模型在长上下文场景中面临计算低效和信息冗余问题,亟需高效且保质的上下文压缩方法。 Method: 提出基于边际信息增益(MIG)的粗-细两级自适应压缩框架COMI:先按组间MIG动态分配压缩率(粗粒度),再在组内基于MIG加权融合令牌(细粒度)。 Result: 在多个问答与摘要数据集(如NaturalQuestions、MultiNews)及不同模型(LLaMA-2-7B、Qwen2-7B)上大幅超越基线,例如Qwen2-7B在NaturalQuestions上32倍压缩下EM提升约25分。 Conclusion: COMI能有效平衡压缩率、语义保留与冗余抑制,为LLM长上下文部署提供了高效可靠的压缩解决方案。 Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse tasks. However, their deployment in long context scenarios remains hindered by computational inefficiency and information redundancy. Context compression methods address these challenges by significantly reducing input length and eliminating redundancy. We propose COMI, a coarse-to-fine adaptive context compression framework that jointly optimizes for semantic relevance and diversity under high compression rates. We introduce Marginal Information Gain (MIG), a metric defined as the relevance of a unit to the input query minus its semantic redundancy with other units, guiding the compression process to prioritize information that is both relevant and low redundant. The framework operates in two stages: (1) Coarse-Grained Group Reallocation, where the context is partitioned into groups and dynamically assigned compression rates based on inter-group MIG, ensuring compression budgets align with information value distribution; and (2) Fine-Grained Token Merging, where tokens within each group are fused via an intra-group MIG-based weighting mechanism, thereby preserving key semantics while avoiding the accumulation of redundancy. Extensive experiments across question-answering (e.g., NaturalQuestions, 2WikiMQA, HotpotQA and NarrativeQA), summarization (e.g., MultiNews) with various backbones (e.g., LLaMA-2-7B, Qwen2-7B) show that COMI outperforms existing baselines by a large margin, e.g., approximately 25-point Exact Match (EM) improvement under 32x compression constraint with Qwen2-7B on NaturalQuestions.[117] SafePred: A Predictive Guardrail for Computer-Using Agents via World Models
Yurun Chen,Zeyi Liao,Ping Yin,Taotao Xie,Keting Yin,Shengyu Zhang
Main category: cs.CL
TL;DR: 本文提出SafePred,一种面向计算机使用代理(CUAs)的预测型防护框架,通过构建‘风险-决策’闭环,实现对短/长期风险的联合预测与基于风险的决策优化,显著提升安全性与任务效用。
Details
Motivation: 现有CUA防护机制多为反应式,仅基于当前观测约束行为,无法识别和规避具有延迟效应的长期风险(如清理日志导致未来审计不可追溯)。 Method: 提出预测型防护范式,设计SafePred框架:1)利用安全策略指导世界模型,生成短/长期风险的语义表征并剪枝高风险动作;2)通过步骤级干预与任务级重规划,将预测风险转化为安全决策指引。 Result: 实验表明SafePred将高风险行为大幅降低,安全性能达97.6%以上,任务效用较反应式基线最高提升21.4%。 Conclusion: 预测型防护是解决CUA长期风险问题的有效路径,SafePred通过风险预测与决策优化的协同闭环,兼顾安全性与实用性。 Abstract: With the widespread deployment of Computer-using Agents (CUAs) in complex real-world environments, prevalent long-term risks often lead to severe and irreversible consequences. Most existing guardrails for CUAs adopt a reactive approach, constraining agent behavior only within the current observation space. While these guardrails can prevent immediate short-term risks (e.g., clicking on a phishing link), they cannot proactively avoid long-term risks: seemingly reasonable actions can lead to high-risk consequences that emerge with a delay (e.g., cleaning logs leads to future audits being untraceable), which reactive guardrails cannot identify within the current observation space. To address these limitations, we propose a predictive guardrail approach, with the core idea of aligning predicted future risks with current decisions. Based on this approach, we present SafePred, a predictive guardrail framework for CUAs that establishes a risk-to-decision loop to ensure safe agent behavior. SafePred supports two key abilities: (1) Short- and long-term risk prediction: by using safety policies as the basis for risk prediction, SafePred leverages the prediction capability of the world model to generate semantic representations of both short-term and long-term risks, thereby identifying and pruning actions that lead to high-risk states; (2) Decision optimization: translating predicted risks into actionable safe decision guidances through step-level interventions and task-level re-planning. Extensive experiments show that SafePred significantly reduces high-risk behaviors, achieving over 97.6% safety performance and improving task utility by up to 21.4% compared with reactive baselines.[118] Enhancing Automated Essay Scoring with Three Techniques: Two-Stage Fine-Tuning, Score Alignment, and Self-Training
Hongseok Choi,Serynn Kim,Wencke Liermann,Jin Seong,Jin-Xia Huang
Main category: cs.CL
TL;DR: 本文提出了一种在标注数据稀缺情况下提升自动作文评分(AES)性能的新方法,包含两阶段微调、分数对齐和不确定性感知的自训练三项关键技术,并在ASAP++数据集上验证了其有效性。
Details
Motivation: 现实场景中标注数据极度匮乏,严重制约了鲁棒AES系统的发展与落地应用。 Method: 提出三项关键技术:1)基于低秩适配的两阶段微调策略;2)提升预测与真实分数分布一致性的分数对齐技术;3)利用未标注数据的不确定性感知自训练。所有技术均基于DualBERT实现。 Result: 在仅32个标注样本的设置下,集成三项技术达到全量数据(约1000样本)性能的91.2%;分数对齐技术在全量和少样本设置下均带来稳定提升,并在全量设置下达到SOTA。 Conclusion: 所提方法显著缓解了AES中标注数据稀缺问题,在少样本和全样本场景下均有效提升模型性能,具备较强实用价值。 Abstract: Automated Essay Scoring (AES) plays a crucial role in education by providing scalable and efficient assessment tools. However, in real-world settings, the extreme scarcity of labeled data severely limits the development and practical adoption of robust AES systems. This study proposes a novel approach to enhance AES performance in both limited-data and full-data settings by introducing three key techniques. First, we introduce a Two-Stage fine-tuning strategy that leverages low-rank adaptations to better adapt an AES model to target prompt essays. Second, we introduce a Score Alignment technique to improve consistency between predicted and true score distributions. Third, we employ uncertainty-aware self-training using unlabeled data, effectively expanding the training set with pseudo-labeled samples while mitigating label noise propagation. We implement above three key techniques on DualBERT. We conduct extensive experiments on the ASAP++ dataset. As a result, in the 32-data setting, all three key techniques improve performance, and their integration achieves 91.2% of the full-data performance trained on approximately 1,000 labeled samples. In addition, the proposed Score Alignment technique consistently improves performance in both limited-data and full-data settings: e.g., it achieves state-of-the-art results in the full-data setting when integrated into DualBERT.[119] WorldCup Sampling for Multi-bit LLM Watermarking
Yidan Wang,Yubing Ren,Yanan Cao,Li Guo
Main category: cs.CL
TL;DR: 本文提出WorldCup,一种面向大语言模型的多比特水印框架,通过分层竞争机制和熵感知调制,直接在词元选择中嵌入信息,兼顾容量、鲁棒性、文本质量与解码效率。
Details
Motivation: 现有零比特水印方法扩展至多比特时存在信息流间接、有效容量有限、解码次优等问题,亟需更高效的多比特水印方案。 Method: 提出WorldCup框架:将采样视为通信信道,利用互补信号引导的分层竞争机制直接嵌入比特;引入熵感知调制以保持生成质量;设计置信度感知解码实现鲁棒消息恢复。 Result: WorldCup在容量、可检测性、鲁棒性、文本质量和解码效率方面取得良好平衡,全面优于现有基线方法。 Conclusion: WorldCup为大语言模型多比特水印提供了新范式,奠定了未来研究的坚实基础。 Abstract: As large language models (LLMs) generate increasingly human-like text, watermarking offers a promising solution for reliable attribution beyond mere detection. While multi-bit watermarking enables richer provenance encoding, existing methods largely extend zero-bit schemes through seed-driven steering, leading to indirect information flow, limited effective capacity, and suboptimal decoding. In this paper, we propose WorldCup, a multi-bit watermarking framework for LLMs that treats sampling as a natural communication channel and embeds message bits directly into token selection via a hierarchical competition mechanism guided by complementary signals. Moreover, WorldCup further adopts entropy-aware modulation to preserve generation quality and supports robust message recovery through confidence-aware decoding. Comprehensive experiments show that WorldCup achieves a strong balance across capacity, detectability, robustness, text quality, and decoding efficiency, consistently outperforming prior baselines and laying a solid foundation for future LLM watermarking studies.[120] Zero2Text: Zero-Training Cross-Domain Inversion Attacks on Textual Embeddings
Doohyun Kim,Donghwa Kang,Kyungjae Lee,Hyeongboo Baek,Brent Byunghoon Kang
Main category: cs.CL
TL;DR: 本文提出Zero2Text框架,通过递归在线对齐技术,在无需训练和真实数据的情况下,有效应对向量数据库中的嵌入逆向攻击,显著提升跨域文本恢复效果。
Details
Motivation: 现有向量数据库面临嵌入逆向攻击带来的严重隐私风险,而当前优化型与对齐型方法分别受限于高计算开销和需访问同域训练数据的不现实假设,在严格黑盒和跨域场景下失效。 Method: 提出Zero2Text——一种无需训练的递归在线对齐框架,结合大语言模型先验与动态岭回归机制,在线迭代对齐生成文本与目标嵌入。 Result: 在MS MARCO等多基准测试中表现优异;针对OpenAI受害模型,ROUGE-L提升1.8倍、BLEU-2提升6.4倍,且无需任何泄露数据对即可恢复未知领域句子。 Conclusion: Zero2Text突破了传统方法在黑盒与跨域场景下的限制,验证了标准防御(如差分隐私)对此类自适应攻击无效,为RAG系统隐私安全提供了新范式。 Abstract: The proliferation of retrieval-augmented generation (RAG) has established vector databases as critical infrastructure, yet they introduce severe privacy risks via embedding inversion attacks. Existing paradigms face a fundamental trade-off: optimization-based methods require computationally prohibitive queries, while alignment-based approaches hinge on the unrealistic assumption of accessible in-domain training data. These constraints render them ineffective in strict black-box and cross-domain settings. To dismantle these barriers, we introduce Zero2Text, a novel training-free framework based on recursive online alignment. Unlike methods relying on static datasets, Zero2Text synergizes LLM priors with a dynamic ridge regression mechanism to iteratively align generation to the target embedding on-the-fly. We further demonstrate that standard defenses, such as differential privacy, fail to effectively mitigate this adaptive threat. Extensive experiments across diverse benchmarks validate Zero2Text; notably, on MS MARCO against the OpenAI victim model, it achieves 1.8x higher ROUGE-L and 6.4x higher BLEU-2 scores compared to baselines, recovering sentences from unknown domains without a single leaked data pair.[121] : One LLM Token for Explicit Graph Structural Understanding
Jingyao Wu,Bin Lu,Zijun Di,Xiaoying Gan,Meng Jin,Luoyi Fu,Xinbing Wang,Chenghu Zhou
Main category: cs.CL
TL;DR: 本文提出了一种名为
Details
Motivation: 大语言模型在处理图结构数据时面临结构性幻觉问题,现有方法(如图文本化或软提示)存在token消耗大、注意力分散或与原始文本token严重错位等问题。 Method: 提出[122] Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model
Kangtao Lv,Jiwei Tang,Langming Liu,Haibin Chen,Weidong Zhang,Shilei Liu,Yongwei Wang,Yujin Yuan,Wenbo Su,Bo Zheng
Main category: cs.CL
TL;DR: 本文首次从数据中心视角系统研究了数据分布对上下文压缩质量的影响,发现输入熵与压缩质量呈负相关,而编解码器间内在数据差异会显著削弱压缩增益。
Details
Motivation: 现有研究仅关注模型侧改进,忽略了数据分布本身对上下文压缩的影响,本文旨在填补这一空白。 Method: 采用自编码器框架评估压缩表示的语义完整性,并从输入数据和模型内在预训练知识两个维度分析数据分布的影响。 Result: (1)编码器测得的输入熵与压缩质量呈负相关,解码器测得的熵在冻结解码器下无显著关系;(2)编解码器内在数据差距显著削弱压缩增益且难以缓解。 Conclusion: 数据分布(尤其是输入熵与编解码器内在知识一致性)是影响长上下文压缩效果的关键因素,据此提出优化压缩增益的实用指南。 Abstract: The deployment of Large Language Models (LLMs) in long-context scenarios is hindered by computational inefficiency and significant information redundancy. Although recent advancements have widely adopted context compression to address these challenges, existing research only focus on model-side improvements, the impact of the data distribution itself on context compression remains largely unexplored. To bridge this gap, we are the first to adopt a data-centric perspective to systematically investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data (i.e., the model's internal pretrained knowledge). We evaluate the semantic integrity of compressed representations using an autoencoder-based framework to systematically investigate it. Our experimental results reveal that: (1) encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting; and (2) the gap between intrinsic data of the encoder and decoder significantly diminishes compression gains, which is hard to mitigate. Based on these findings, we further present practical guidelines to optimize compression gains.[123] CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Yuling Shi,Chaoxiang Xie,Zhensu Sun,Yeheng Chen,Chenxu Zhang,Longfei Yun,Chengcheng Wan,Hongyu Zhang,David Lo,Xiaodong Gu
Main category: cs.CL
TL;DR: 本文探索了使用多模态大语言模型(MLLMs)将源代码以图像形式表示,以提升代码理解任务的计算效率。实验表明,图像压缩可大幅减少token消耗(最高8倍),同时保持甚至提升部分任务性能,如代码补全和克隆检测。
Details
Motivation: 现有基于文本的LLM代码理解方法面临上下文长度线性增长带来的计算效率瓶颈;而多模态LLMs的兴起为利用图像模态(更易压缩且保留语义)提供了新路径。 Method: 开展首个系统性研究,评估MLLMs在代码理解任务(如代码补全、克隆检测)中对渲染代码图像的处理能力,重点分析不同图像分辨率(即token压缩比)下的性能变化,并对比文本输入基线。 Result: (1)MLLMs在高达8倍token压缩下仍能有效理解代码;(2)利用语法高亮等视觉线索,在4倍压缩下代码补全性能提升;(3)克隆检测任务对图像压缩极具鲁棒性,某些压缩比下性能反超原始文本输入。 Conclusion: 图像模态代码表示是提升大模型代码理解推理效率的有效新范式,但当前MLLMs仍存在局限,需进一步优化视觉-语义对齐能力。 Abstract: Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.[124] Sentence Curve Language Models
DongNyeong Heo,Heelyoul Choi
Main category: cs.CL
TL;DR: 本文提出了一种新的语言模型SCLM,通过引入连续的句子表示——句子曲线(sentence curve),替代传统静态词嵌入,以增强对句子全局结构的建模能力;理论证明其具有正则化效应,实验表明其在多个基准上达到DLM领域的SOTA性能。
Details
Motivation: 现有语言模型(包括扩散语言模型DLMs)使用静态词嵌入表示目标词,导致对上下文全局结构建模不足,仅关注局部词预测准确性。 Method: 提出‘句子曲线’作为连续句子表示,即由影响多个词的控制点定义的样条曲线;在此基础上构建Sentence Curve Language Model(SCLM),将DLM的目标从预测静态词嵌入改为预测句子曲线,并进行理论分析与实证验证。 Result: SCLM在IWSLT14和WMT14数据集上达到DLM领域SOTA性能;训练稳定,无需繁重的知识蒸馏;在LM1B上表现优于离散型DLM。 Conclusion: 用连续、全局感知的句子曲线替代静态词嵌入,能有效提升扩散语言模型对句子结构的理解与生成能力,为语言建模提供了新范式。 Abstract: Language models (LMs) are a central component of modern AI systems, and diffusion-based language models (DLMs) have recently emerged as a competitive alternative. Both paradigms rely on word embeddings not only to represent the input sentence, but also to represent the target sentence that backbone models are trained to predict. We argue that such static embedding of the target word is insensitive to neighboring words, encouraging locally accurate word prediction while neglecting global structure across the target sentence. To address this limitation, we propose a continuous sentence representation, termed sentence curve, defined as a spline curve whose control points affect multiple words in the sentence. Based on this representation, we introduce sentence curve language model (SCLM), which extends DLMs to predict sentence curves instead of the static word embeddings. We theoretically show that sentence curve prediction induces a regularization effect that promotes global structure modeling, and characterize how different sentence curve types affect this behavior. Empirically, SCLM achieves SOTA performance among DLMs on IWSLT14 and WMT14, shows stable training without burdensome knowledge distillation, and demonstrates promising potential compared to discrete DLMs on LM1B.[125] AXE: Low-Cost Cross-Domain Web Structured Information Extraction
Abdelrahman Mansour,Khaled W. Alshaer,Moataz Elsaban
Main category: cs.CL
TL;DR: AXE是一种新型网页结构化数据提取方法,通过将HTML DOM视为需剪枝的树,结合专用剪枝机制和可追溯的XPath解析(GXR),使小型0.6B LLM实现高精度零样本提取,在SWDE上F1达88.1%,超越多个更大模型。
Details
Motivation: 解决网页结构化数据提取中手动启发式方法脆弱性与大语言模型成本过高的矛盾。 Method: 提出AXE(自适应XPath提取器)流水线,将HTML DOM视为树进行剪枝以去除冗余节点,并引入Grounded XPath Resolution(GXR)确保每个提取结果可追溯至源DOM节点。 Result: 在SWDE数据集上实现88.1%的零样本F1分数,优于多个更大规模、全量训练的模型。 Conclusion: AXE为大规模网页信息提取提供了轻量、高效、可追溯且低成本的实用路径。 Abstract: Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost-effective path for large-scale web information extraction.[126] Read As Human: Compressing Context via Parallelizable Close Reading and Skimming
Jiwei Tang,Shilei Liu,Zhicheng Zhang,Qingsong Lv,Runsong Zhao,Tingwei Lu,Langming Liu,Haibin Chen,Yujin Yuan,Hai-Tao Zheng,Wenbo Su,Bo Zheng
Main category: cs.CL
TL;DR: RAM是一种受人类阅读行为启发的上下文压缩框架,通过自适应混合阅读策略(精读高相关段落、略读低相关段落并压缩为摘要向量)提升长上下文LLM推理效率与性能,并引入对比学习优化决策边界。
Details
Motivation: 解决大语言模型在长上下文场景中面临的计算低效和冗余信息问题。 Method: 提出RAM框架:将输入上下文分段,结合查询并行编码;对高相关段落精读保留,低相关段落进行查询引导的压缩生成摘要向量;显式文本与隐式摘要向量拼接输入解码器;引入基于正负样本对的对比学习优化精读/略读边界。 Result: 在多个问答与摘要基准上超越现有方法,在两种骨干模型上实现最高12倍端到端加速(平均长度16K,最大32K)。 Conclusion: RAM在保持自然语言可解释性的同时,显著提升了长上下文任务的效率与性能,验证了类人自适应阅读策略的有效性。 Abstract: Large Language Models (LLMs) demonstrate exceptional capability across diverse tasks. However, their deployment in long-context scenarios is hindered by two challenges: computational inefficiency and redundant information. We propose RAM (Read As HuMan), a context compression framework that adopts an adaptive hybrid reading strategy, to address these challenges. Inspired by human reading behavior (i.e., close reading important content while skimming less relevant content), RAM partitions the context into segments and encodes them with the input query in parallel. High-relevance segments are fully retained (close reading), while low-relevance ones are query-guided compressed into compact summary vectors (skimming). Both explicit textual segments and implicit summary vectors are concatenated and fed into decoder to achieve both superior performance and natural language format interpretability. To refine the decision boundary between close reading and skimming, we further introduce a contrastive learning objective based on positive and negative query-segment pairs. Experiments demonstrate that RAM outperforms existing baselines on multiple question answering and summarization benchmarks across two backbones, while delivering up to a 12x end-to-end speedup on long inputs (average length 16K; maximum length 32K).[127] PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning
Langming Liu,Kangtao Lv,Haibin Chen,Weidong Zhang,Yejing Wang,Shilei Liu,Xin Tong,Yujin Yuan,Yongwei Wang,Wenbo Su,Bo Zheng
Main category: cs.CL
TL;DR: 本文提出PretrainRL框架,通过在预训练阶段引入强化学习来解决大语言模型的事实幻觉问题,核心思想是先去偏再学习,即降低高概率错误答案的权重,为低概率正确答案腾出学习空间。
Details
Motivation: 大语言模型存在事实幻觉问题,根源在于预训练语料中数据分布不均衡,导致‘低概率真’和‘高概率假’的状态。现有方法要么回避问题,要么面临灾难性遗忘。 Method: 提出PretrainRL框架,在预训练阶段集成强化学习;采用‘去偏然后学习’原则,通过降权高概率错误答案来重塑模型概率分布;设计高效负采样策略发现高概率错误,并引入新指标评估模型对事实知识的概率状态。 Result: 在三个公开基准上的大量实验表明,PretrainRL显著缓解事实幻觉,性能优于当前最先进方法。 Conclusion: 从预训练数据分布不平衡这一根本原因入手,通过强化学习进行概率分布校准,是缓解大语言模型事实幻觉的有效新路径。 Abstract: Large language models (LLMs), despite their powerful capabilities, suffer from factual hallucinations where they generate verifiable falsehoods. We identify a root of this issue: the imbalanced data distribution in the pretraining corpus, which leads to a state of "low-probability truth" and "high-probability falsehood". Recent approaches, such as teaching models to say "I don't know" or post-hoc knowledge editing, either evade the problem or face catastrophic forgetting. To address this issue from its root, we propose \textbf{PretrainRL}, a novel framework that integrates reinforcement learning into the pretraining phase to consolidate factual knowledge. The core principle of PretrainRL is "\textbf{debiasing then learning}." It actively reshapes the model's probability distribution by down-weighting high-probability falsehoods, thereby making "room" for low-probability truths to be learned effectively. To enable this, we design an efficient negative sampling strategy to discover these high-probability falsehoods and introduce novel metrics to evaluate the model's probabilistic state concerning factual knowledge. Extensive experiments on three public benchmarks demonstrate that PretrainRL significantly alleviates factual hallucinations and outperforms state-of-the-art methods.[128] ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support
Tiantian Chen,Jiaqi Lu,Ying Shen,Lin Zhang
Main category: cs.CL
TL;DR: 本文提出ES-MemEval基准和EvoEmo数据集,用于评估大语言模型在长期情感支持对话中的五种核心记忆能力;实验表明显式长时记忆对减少幻觉和实现个性化至关重要,而RAG虽提升事实一致性,却难以应对时间动态与用户状态演化。
Details
Motivation: 现有长期对话基准主要关注静态、显式的事实检索,无法评估用户信息分散、隐含且持续演化的关键场景,尤其在在线情感支持等复杂长期服务中存在明显不足。 Method: 构建ES-MemEval基准(涵盖信息抽取、时序推理、冲突检测、主动拒答、用户建模五类能力)和EvoEmo多轮情感支持数据集(捕捉碎片化、隐含的用户披露及动态用户状态),并在开源长上下文模型、商用模型及RAG模型上开展系统实验。 Result: 显式长时记忆显著降低幻觉并提升个性化效果;RAG增强事实一致性但难以处理时间动态与用户状态演化。 Conclusion: 当前模型在长期个性化对话中仍受限于记忆与检索的协同不足,亟需更鲁棒的记忆-检索融合范式。 Abstract: Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long-term web-based services such as online emotional support. However, existing long-term dialogue benchmarks primarily focus on static and explicit fact retrieval, failing to evaluate agents in critical scenarios where user information is dispersed, implicit, and continuously evolving. To address this gap, we introduce ES-MemEval, a comprehensive benchmark that systematically evaluates five core memory capabilities: information extraction, temporal reasoning, conflict detection, abstention, and user modeling, in long-term emotional support settings, covering question answering, summarization, and dialogue generation tasks. To support the benchmark, we also propose EvoEmo, a multi-session dataset for personalized long-term emotional support that captures fragmented, implicit user disclosures and evolving user states. Extensive experiments on open-source long-context, commercial, and retrieval-augmented (RAG) LLMs show that explicit long-term memory is essential for reducing hallucinations and enabling effective personalization. At the same time, RAG improves factual consistency but struggles with temporal dynamics and evolving user states. These findings highlight both the potential and limitations of current paradigms and motivate more robust integration of memory and retrieval for long-term personalized dialogue systems.[129] GuideWeb: A Benchmark for Automatic In-App Guide Generation on Real-World Web UIs
Chengguang Gan,Yoshihiro Tsujii,Yunhao Liang,Tatsunori Mori,Shiwen Ni,Hiroki Itoh
Main category: cs.CL
TL;DR: 本文提出了GuideWeb基准,用于评估在真实网页UI上自动生成应用内引导的性能,并设计了GuideWeb Agent模型,在引导目标元素选择、意图生成和引导文本生成任务上取得初步成果,但仍面临较大挑战。
Details
Motivation: 现有数字采用平台(DAP)虽支持非专家创建网页操作引导,但因网站频繁更新,导致引导维护成本高、需反复人工标注。 Method: 构建GuideWeb基准,将自动引导生成任务定义为:基于网页选择引导目标元素并生成契合用户意图的简洁引导文本;提出联合评估指标,涵盖目标元素选择准确率、意图与引导文本生成质量;设计GuideWeb Agent模型进行实验验证。 Result: GuideWeb Agent在引导目标元素预测上达到30.79%准确率,意图生成BLEU得分为44.94,引导文本生成BLEU得分为21.34;现有基线方法表现明显更差。 Conclusion: 自动网页引导生成仍具高度挑战性,当前方法尚不足以支撑真实场景可靠部署,亟需进一步研究突破。 Abstract: Digital Adoption Platform (DAP) provide web-based overlays that deliver operation guidance and contextual hints to help users navigate complex websites. Although modern DAP tools enable non-experts to author such guidance, maintaining these guides remains labor-intensive because website layouts and functionalities evolve continuously, which requires repeated manual updates and re-annotation. In this work, we introduce \textbf{GuideWeb}, a new benchmark for automatic in-app guide generation on real-world web UIs. GuideWeb formulates the task as producing page-level guidance by selecting \textbf{guide target elements} grounded in the webpage and generating concise guide text aligned with user intent. We also propose a comprehensive evaluation suite that jointly measures the accuracy of guide target element selection and the quality of generated intents and guide texts. Experiments show that our proposed \textbf{GuideWeb Agent} achieves \textbf{30.79\%} accuracy in guide target element prediction, while obtaining BLEU scores of \textbf{44.94} for intent generation and \textbf{21.34} for guide-text generation. Existing baselines perform substantially worse, which highlights that automatic guide generation remains challenging and that further advances are necessary before such systems can be reliably deployed in real-world settings.[130] From Code-Centric to Concept-Centric: Teaching NLP with LLM-Assisted "Vibe Coding"
Hend Al-Khalifa
Main category: cs.CL
TL;DR: 本文提出了一种名为'Vibe Coding'的教学方法,利用大语言模型(LLMs)作为编程助手,强调概念理解与批判性思维,应用于高年级本科NLP课程中,并通过反思性评估和提示日志记录来保障教学效果。
Details
Motivation: 应对大语言模型快速发展给NLP教育带来的挑战与机遇,探索如何在利用LLM辅助编程的同时,不削弱学生对核心概念的理解和批判性思维能力。 Method: 在本科高年级NLP课程中实施'Vibe Coding'教学法:学生使用LLM完成7个编程实验,但考核重点为基于批判性反思问题的概念掌握;强制要求记录提示词(prompt logging),并收集课程末期学生反馈进行分析。 Result: 19名学生反馈显示高满意度(4.4–4.6/5.0),尤其认可调试负担降低、概念学习加深;但也指出时间紧张、LLM输出验证困难及任务说明不清等挑战。 Conclusion: 结构化地整合LLM作为辅助工具(配合prompt logging与反思评估),可有效将学习重心从语法熟练度转向概念 mastery,助力学生适应AI增强的职业环境。 Abstract: The rapid advancement of Large Language Models (LLMs) presents both challenges and opportunities for Natural Language Processing (NLP) education. This paper introduces ``Vibe Coding,'' a pedagogical approach that leverages LLMs as coding assistants while maintaining focus on conceptual understanding and critical thinking. We describe the implementation of this approach in a senior-level undergraduate NLP course, where students completed seven labs using LLMs for code generation while being assessed primarily on conceptual understanding through critical reflection questions. Analysis of end-of-course feedback from 19 students reveals high satisfaction (mean scores 4.4-4.6/5.0) across engagement, conceptual learning, and assessment fairness. Students particularly valued the reduced cognitive load from debugging, enabling deeper focus on NLP concepts. However, challenges emerged around time constraints, LLM output verification, and the need for clearer task specifications. Our findings suggest that when properly structured with mandatory prompt logging and reflection-based assessment, LLM-assisted learning can shift focus from syntactic fluency to conceptual mastery, preparing students for an AI-augmented professional landscape.[131] Breaking the Static Graph: Context-Aware Traversal for Robust Retrieval-Augmented Generation
Kwun Hang Lau,Fangyuan Zhang,Boyu Ruan,Yingli Zhou,Qintian Guo,Ruiyuan Zhang,Xiaofang Zhou
Main category: cs.CL
TL;DR: CatRAG提出了一种查询自适应的图遍历框架,通过符号锚定、动态边权重调整和关键事实增强,解决传统KG-RAG中静态图导致的语义漂移问题,显著提升多跳推理的证据链完整性。
Details
Motivation: 现有基于知识图谱的RAG方法(如HippoRAG)依赖静态图结构和固定转移概率,忽视查询对边相关性的动态影响,导致随机游走易陷入高中心性节点,无法完整检索多跳证据链。 Method: CatRAG在HippoRAG 2基础上构建查询感知导航图:(1)符号锚定——引入弱实体约束正则化游走;(2)查询感知动态边加权——实时调节边权重以剪枝无关路径、增强意图对齐路径;(3)关键事实段落权重增强——低成本结构化锚定至潜在证据。 Result: 在四个多跳基准上持续超越SOTA;标准召回率提升有限,但在‘推理完整性’(完整恢复无缺口证据路径的能力)上取得显著提升。 Conclusion: CatRAG有效弥合了部分上下文检索与完全可验证推理之间的鸿沟,证明图结构的查询自适应建模对多跳RAG至关重要。 Abstract: Recent advances in Retrieval-Augmented Generation (RAG) have shifted from simple vector similarity to structure-aware approaches like HippoRAG, which leverage Knowledge Graphs (KGs) and Personalized PageRank (PPR) to capture multi-hop dependencies. However, these methods suffer from a "Static Graph Fallacy": they rely on fixed transition probabilities determined during indexing. This rigidity ignores the query-dependent nature of edge relevance, causing semantic drift where random walks are diverted into high-degree "hub" nodes before reaching critical downstream evidence. Consequently, models often achieve high partial recall but fail to retrieve the complete evidence chain required for multi-hop queries. To address this, we propose CatRAG, Context-Aware Traversal for robust RAG, a framework that builds on the HippoRAG 2 architecture and transforms the static KG into a query-adaptive navigation structure. We introduce a multi-faceted framework to steer the random walk: (1) Symbolic Anchoring, which injects weak entity constraints to regularize the random walk; (2) Query-Aware Dynamic Edge Weighting, which dynamically modulates graph structure, to prune irrelevant paths while amplifying those aligned with the query's intent; and (3) Key-Fact Passage Weight Enhancement, a cost-efficient bias that structurally anchors the random walk to likely evidence. Experiments across four multi-hop benchmarks demonstrate that CatRAG consistently outperforms state of the art baselines. Our analysis reveals that while standard Recall metrics show modest gains, CatRAG achieves substantial improvements in reasoning completeness, the capacity to recover the entire evidence path without gaps. These results reveal that our approach effectively bridges the gap between retrieving partial context and enabling fully grounded reasoning. Resources are available at https://github.com/kwunhang/CatRAG.[132] Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition
Wonjun Lee,Hyounghun Kim,Gary Geunbae Lee
Main category: cs.CL
TL;DR: 本文提出Moe-Ctc模型,通过带中间CTC监督的混合专家架构,结合口音感知路由与无标签路由,提升ASR系统对已见和未见口音的鲁棒性,在Mcv-Accent基准上显著降低WER。
Details
Motivation: 现有ASR模型在口音语音上性能下降严重:口音无关方法难以处理强口音或未见口音,口音相关方法依赖有限且噪声大的口音标签。 Method: 提出Moe-Ctc——一种带中间CTC监督的混合专家架构;采用训练时口音感知路由(促进专家专业化)、推理时无标签路由(增强泛化);每个专家配备独立CTC头以对齐路由与转录质量,并引入路由增强损失稳定优化。 Result: 在Mcv-Accent基准上,Moe-Ctc在低/高资源、已见/未见口音条件下均取得一致提升,相比FastConformer基线最高实现29.3%相对WER降低。 Conclusion: Moe-Ctc通过协同优化专家专业化与泛化能力,有效缓解ASR中口音多样性带来的性能瓶颈,无需精确口音标签即可提升跨口音鲁棒性。 Abstract: Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.[133] Orthogonal Hierarchical Decomposition for Structure-Aware Table Understanding with Large Language Models
Bin Cao,Huixian Lu,Chenwen Ma,Ting Wang,Ruizhe Li,Jing Fan
Main category: cs.CL
TL;DR: 本文提出了一种正交分层分解(OHD)框架,通过正交树归纳(OTI)方法将复杂表格分解为行列两棵层次树,并设计双路径关联协议与大语言模型语义仲裁机制,以提升复杂表格的理解与问答性能。
Details
Motivation: 复杂表格(多级表头、合并单元格、异构布局)难以被现有线性化或网格化表示准确建模,导致结构语义与文本表示错位。 Method: 提出正交分层分解(OHD)框架:1)基于空间-语义协同约束的正交树归纳(OTI),生成列树和行树;2)双路径关联协议重建单元格语义谱系;3)引入LLM作为语义仲裁器对齐多级语义信息。 Result: 在AITQA和HiTab两个复杂表格问答基准上,OHD在多项指标上持续超越现有表示范式。 Conclusion: OHD通过显式建模表格的正交层次结构与跨维依赖,显著提升了LLM对复杂表格的理解与推理能力,为结构化数据理解提供了新范式。 Abstract: Complex tables with multi-level headers, merged cells and heterogeneous layouts pose persistent challenges for LLMs in both understanding and reasoning. Existing approaches typically rely on table linearization or normalized grid modeling. However, these representations struggle to explicitly capture hierarchical structures and cross-dimensional dependencies, which can lead to misalignment between structural semantics and textual representations for non-standard tables. To address this issue, we propose an Orthogonal Hierarchical Decomposition (OHD) framework that constructs structure-preserving input representations of complex tables for LLMs. OHD introduces an Orthogonal Tree Induction (OTI) method based on spatial--semantic co-constraints, which decomposes irregular tables into a column tree and a row tree to capture vertical and horizontal hierarchical dependencies, respectively. Building on this representation, we design a dual-pathway association protocol to symmetrically reconstruct semantic lineage of each cell, and incorporate an LLM as a semantic arbitrator to align multi-level semantic information. We evaluate OHD framework on two complex table question answering benchmarks, AITQA and HiTab. Experimental results show that OHD consistently outperforms existing representation paradigms across multiple evaluation metrics.[134] Beyond Local Edits: Embedding-Virtualized Knowledge for Broader Evaluation and Preservation of Model Editing
Shuainan Liu,Xuanang Chen,Ben He,Le Sun
Main category: cs.CL
TL;DR: 本文提出Embedding-Virtualized Knowledge(EVK)方法,通过嵌入空间扰动刻画模型知识,构建EVK-Bench评估编辑引发的知识漂移,并设计EVK-Align模块抑制漂移,在不降低编辑准确率前提下提升知识保持能力。
Details
Motivation: 现有大语言模型知识编辑方法的评估局限于有限预定义基准,难以反映编辑对模型整体知识系统的广泛影响。 Method: 提出Embedding-Virtualized Knowledge(EVK)表征模型知识,构建嵌入级评估基准EVK-Bench,并设计即插即用的EVK-Align模块约束编辑过程中的嵌入级知识漂移。 Result: 实验表明该方法能更全面评估知识编辑效果,并显著提升知识保持能力,同时不损害编辑准确性。 Conclusion: EVK框架为知识编辑提供了更广域、更本质的评估与优化视角,弥补了传统样本级评估的局限性。 Abstract: Knowledge editing methods for large language models are commonly evaluated using predefined benchmarks that assess edited facts together with a limited set of related or neighboring knowledge. While effective, such evaluations remain confined to finite, dataset-bounded samples, leaving the broader impact of editing on the model's knowledge system insufficiently understood. To address this gap, we introduce Embedding-Virtualized Knowledge (EVK) that characterizes model knowledge through controlled perturbations in embedding space, enabling the exploration of a substantially broader and virtualized knowledge region beyond explicit data annotations. Based on EVK, we construct an embedding-level evaluation benchmark EVK-Bench that quantifies potential knowledge drift induced by editing, revealing effects that are not captured by conventional sample-based metrics. Furthermore, we propose a plug-and-play EVK-Align module that constrains embedding-level knowledge drift during editing and can be seamlessly integrated into existing editing methods. Experiments demonstrate that our approach enables more comprehensive evaluation while significantly improving knowledge preservation without sacrificing editing accuracy.[135] S3-CoT: Self-Sampled Succinct Reasoning Enables Efficient Chain-of-Thought LLMs
Yanrui Du,Sendong Zhao,Yibo Gao,Danyang Zhao,Qika Lin,Ming Ma,Jiayun Li,Yi Jiang,Kai He,Qianyi Xu,Bing Qin,Mengling Feng
Main category: cs.CL
TL;DR: 本文提出了一种基于激活引导的自采样框架(S3-CoT),使大语言模型能自主生成风格一致、长度可变的思维链,无需教师监督或黄金答案,实现高效、类人双认知系统的推理学习。
Details
Motivation: 现有思维链(CoT)方法常存在冗余推理,且依赖高质量监督数据;受人类System 1快速直觉式思维启发,探索LLM能否习得类似‘快思考’模式。 Method: 提出S3-CoT框架:1)基于激活引导的自采样,从目标LLM自身生成多样长度、风格对齐的推理轨迹;2)结合黄金答案筛选数据,开展监督微调(SFT),引入类人双认知系统与渐进压缩课程;3)进一步设计无需黄金答案的自演化机制,仅用预测一致的变长推理数据驱动SFT。 Result: 在数学基准和医学跨领域任务上均取得稳定提升,适用于通用及R1风格LLM;显著缓解监督数据稀缺瓶颈,并支持无标注自我进化。 Conclusion: LLM可通过自采样与自演化机制有效习得高效、灵活的‘快思考’推理能力,S3-CoT为轻量、鲁棒、可扩展的CoT学习提供了新范式。 Abstract: Large language models (LLMs) equipped with chain-of-thought (CoT) achieve strong performance and offer a window into LLM behavior. However, recent evidence suggests that improvements in CoT capabilities often come with redundant reasoning processes, motivating a key question: Can LLMs acquire a fast-thinking mode analogous to human System 1 reasoning? To explore this, our study presents a self-sampling framework based on activation steering for efficient CoT learning. Our method can induce style-aligned and variable-length reasoning traces from target LLMs themselves without any teacher guidance, thereby alleviating a central bottleneck of SFT-based methods-the scarcity of high-quality supervision data. Using filtered data by gold answers, we perform SFT for efficient CoT learning with (i) a human-like dual-cognitive system, and (ii) a progressive compression curriculum. Furthermore, we explore a self-evolution regime in which SFT is driven solely by prediction-consistent data of variable-length variants, eliminating the need for gold answers. Extensive experiments on math benchmarks, together with cross-domain generalization tests in medicine, show that our method yields stable improvements for both general and R1-style LLMs. Our data and model checkpoints can be found at https://github.com/DYR1/S3-CoT.[136] From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs
Yanrui Du,Yibo Gao,Sendong Zhao,Jiayun Li,Haochun Wang,Qika Lin,Kai He,Bing Qin,Mengling Feng
Main category: cs.CL
TL;DR: 本文通过logit lens分析R1-style大语言模型的自反思行为,发现其内部存在三层结构化的激活轨迹:潜控层(编码思考预算)、语义枢纽层(出现转折点和总结线索)和行为显化层(反思行为token概率上升),并验证了各层之间的因果链,揭示了类人元认知过程。
Details
Motivation: R1-style大语言模型虽展现出自反思能力,但其内在机制尚不明确,本文旨在揭示其自反思行为的起源与层间动态机制。 Method: 采用logit lens方法逐层解析token级语义,识别反射行为激活的层间轨迹,并通过定向干预实验验证各阶段间的因果关系。 Result: 发现三层结构化激活模式:潜控层编码思考预算、语义枢纽层浮现话语级线索(如转折点与总结)、行为显化层中反思token采样概率显著上升;且实验证实三者存在因果链。 Conclusion: R1-style LLM的自反思行为遵循类人元认知路径:从潜隐监控,到话语层面调控,最终实现外显反思;该过程具有可解释、可干预的结构基础。 Abstract: R1-style LLMs have attracted growing attention for their capacity for self-reflection, yet the internal mechanisms underlying such behavior remain unclear. To bridge this gap, we anchor on the onset of reflection behavior and trace its layer-wise activation trajectory. Using the logit lens to read out token-level semantics, we uncover a structured progression: (i) Latent-control layers, where an approximate linear direction encodes the semantics of thinking budget; (ii) Semantic-pivot layers, where discourse-level cues, including turning-point and summarization cues, surface and dominate the probability mass; and (iii) Behavior-overt layers, where the likelihood of reflection-behavior tokens begins to rise until they become highly likely to be sampled. Moreover, our targeted interventions uncover a causal chain across these stages: prompt-level semantics modulate the projection of activations along latent-control directions, thereby inducing competition between turning-point and summarization cues in semantic-pivot layers, which in turn regulates the sampling likelihood of reflection-behavior tokens in behavior-overt layers. Collectively, our findings suggest a human-like meta-cognitive process-progressing from latent monitoring, to discourse-level regulation, and to finally overt self-reflection. Our analysis code can be found at https://github.com/DYR1/S3-CoT.[137] Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
Zhanghao Hu,Qinglin Zhu,Hanqi Yan,Yulan He,Lin Gui
Main category: cs.CL
TL;DR: 本文提出xMemory,一种面向agent记忆系统的新型检索方法,通过解耦-聚合框架构建语义层次结构,替代传统RAG中的相似性检索,提升多事实查询下的答案质量与token效率。
Details
Motivation: 传统RAG在agent记忆场景中表现不佳,因其假设是大规模异构语料,而agent记忆是有限、连贯、高度相关甚至重复的对话流,导致固定top-k相似检索冗余,后处理剪枝易丢失时序依赖的关键信息。 Method: 提出xMemory:将记忆解耦为语义组件,组织成层次结构(高层主题→情节→原始消息),通过稀疏性-语义性联合目标指导记忆的分裂与合并;推理时采用自顶向下检索,仅在降低读者不确定性时才展开至低层细节。 Result: 在LoCoMo和PerLTQA数据集上,xMemory在三个最新大语言模型上均显著提升答案质量与token效率。 Conclusion: 面向agent记忆的检索应超越相似性匹配,转向基于语义层次结构的、有原则的解耦-聚合机制;xMemory为此提供了有效实现路径。 Abstract: Agent memory systems often adopt the standard Retrieval-Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting. RAG targets large, heterogeneous corpora where retrieved passages are diverse, whereas agent memory is a bounded, coherent dialogue stream with highly correlated spans that are often duplicates. Under this shift, fixed top-$k$ similarity retrieval tends to return redundant context, and post-hoc pruning can delete temporally linked prerequisites needed for correct reasoning. We argue retrieval should move beyond similarity matching and instead operate over latent components, following decoupling to aggregation: disentangle memories into semantic components, organise them into a hierarchy, and use this structure to drive retrieval. We propose xMemory, which builds a hierarchy of intact units and maintains a searchable yet faithful high-level node organisation via a sparsity--semantics objective that guides memory split and merge. At inference, xMemory retrieves top-down, selecting a compact, diverse set of themes and semantics for multi-fact queries, and expanding to episodes and raw messages only when it reduces the reader's uncertainty. Experiments on LoCoMo and PerLTQA across the three latest LLMs show consistent gains in answer quality and token efficiency.[138] NEAT: Neuron-Based Early Exit for Large Reasoning Models
Kang Liu,Yongkang Liu,Xiaocui Yang,Peidong Wang,Wen Zhang,Shi Feng,Yifei Zhang,Daling Wang
Main category: cs.CL
TL;DR: 本文提出NEAT框架,通过监控神经元激活动态实现无需训练的早期退出,有效缓解大推理模型中的'过度思考'问题,在保持准确率的同时平均减少22%-28%的token消耗。
Details
Motivation: Large Reasoning Models (LRMs)存在'过度思考'问题,即在已得出正确答案后仍生成冗余推理步骤;现有早期退出方法依赖输出级启发式或需额外计算/标注数据,存在效率或泛化性瓶颈。 Method: 提出基于神经元的早期推理退出框架NEAT:识别与退出相关的神经元,实时跟踪其激活模式,据此动态触发早期退出或抑制反思过程,全程无需训练且不增加测试时计算开销。 Result: 在四个推理基准、六种不同规模与架构的模型上实验表明,NEAT对每个模型在四个基准上的平均token减少率达22%–28%,同时准确率保持不变。 Conclusion: NEAT是一种高效、通用、免训练的早期退出方法,通过神经元级动态监控显著降低冗余推理开销,为提升大模型推理效率提供了新思路。 Abstract: Large Reasoning Models (LRMs) often suffer from \emph{overthinking}, a phenomenon in which redundant reasoning steps are generated after a correct solution has already been reached. Existing early reasoning exit methods primarily rely on output-level heuristics or trained probing models to skip redundant reasoning steps, thereby mitigating overthinking. However, these approaches typically require additional rollout computation or externally labeled datasets. In this paper, we propose \textbf{NEAT}, a \textbf{N}euron-based \textbf{E}arly re\textbf{A}soning exi\textbf{T} framework that monitors neuron-level activation dynamics to enable training-free early exits, without introducing additional test-time computation. NEAT identifies exit-associated neurons and tracks their activation patterns during reasoning to dynamically trigger early exit or suppress reflection, thereby reducing unnecessary reasoning while preserving solution quality. Experiments on four reasoning benchmarks across six models with different scales and architectures show that, for each model, NEAT achieves an average token reduction of 22\% to 28\% when averaged over the four benchmarks, while maintaining accuracy.[139] WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora
Pengyu Wang,Benfeng Xu,Licheng Zhang,Shaohan Wang,Mingxuan Du,Chiwei Zhu,Zhendong Mao
Main category: cs.CL
TL;DR: 本文介绍了WildGraphBench,一个面向真实场景的图增强检索生成(GraphRAG)评估基准,基于维基百科结构构建,涵盖12类主题、1100个问题,分三类任务;实验表明当前GraphRAG在多源事实聚合上有提升,但在细粒度摘要任务上表现较弱。
Details
Motivation: 现有GraphRAG基准多依赖短小、人工筛选的段落,无法充分评估其在长上下文和大规模异构文档等真实场景下的性能。 Method: 构建WildGraphBench基准:利用维基百科文章及其外部参考文献作为检索语料,以引用链接的陈述为真值,覆盖12个顶层主题,生成1100个含单事实问答、多事实问答和章节级摘要三类复杂度的问题。 Result: 实验显示,当前GraphRAG在中等数量来源的多事实聚合任务上有效,但在需保留细粒度细节的摘要任务上性能下降,表明其聚合范式可能过度强调高层陈述。 Conclusion: WildGraphBench填补了GraphRAG在真实复杂场景下评估的空白,揭示了现有方法在细粒度信息整合上的局限性,为未来研究提供了更具挑战性的测试平台。 Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph, enabling efficient retrieval and aggregation of scattered evidence across multiple documents. However, many existing benchmarks for GraphRAG rely on short, curated passages as external knowledge, failing to adequately evaluate systems in realistic settings involving long contexts and large-scale heterogeneous documents. To bridge this gap, we introduce WildGraphBench, a benchmark designed to assess GraphRAG performance in the wild. We leverage Wikipedia's unique structure, where cohesive narratives are grounded in long and heterogeneous external reference documents, to construct a benchmark reflecting real-word scenarios. Specifically, we sample articles across 12 top-level topics, using their external references as the retrieval corpus and citation-linked statements as ground truth, resulting in 1,100 questions spanning three levels of complexity: single-fact QA, multi-fact QA, and section-level summarization. Experiments across multiple baselines reveal that current GraphRAG pipelines help on multi-fact aggregation when evidence comes from a moderate number of sources, but this aggregation paradigm may overemphasize high-level statements at the expense of fine-grained details, leading to weaker performance on summarization tasks. Project page:https://github.com/BstWPY/WildGraphBench.[140] Closing the Loop: Universal Repository Representation with RPG-Encoder
Jane Luo,Chengyu Yin,Xin Zhang,Qingtao Li,Steven Liu,Yiming Huang,Jie Wu,Hao Liu,Yangyu Huang,Yu Kang,Fangkai Yang,Ying Xin,Scarlett Li
Main category: cs.CL
TL;DR: 本文提出RPG-Encoder框架,通过将仓库规划图(RPG)从静态生成蓝图泛化为统一高保真表示,弥合代码理解与生成间的推理断层,在SWE-bench和RepoCraft上取得SOTA性能。
Details
Motivation: 现有仓库代理因依赖孤立的API文档或依赖图而缺乏语义深度,导致推理断层;作者认为仓库理解与生成是互逆过程,需统一建模。 Method: 提出RPG-Encoder框架:(1) 将原始代码编码为融合提升语义特征与代码依赖的RPG;(2) 增量演化RPG拓扑结构以解耦维护成本与仓库规模;(3) 作为结构感知导航的统一接口。 Result: 在SWE-bench Verified上达93.7% Acc@5(SOTA),较最佳基线提升超10%;在SWE-bench Live Lite上同样领先;在RepoCraft上实现98.5%重构覆盖率,验证RPG高保真性。 Conclusion: RPG-Encoder成功构建了理解与生成闭环,显著提升复杂代码库中细粒度定位精度,并证明RPG作为统一表示的有效性与可扩展性。 Abstract: Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: generation expands intent into implementation, while comprehension compresses implementation back into intent. To address this, we propose RPG-Encoder, a framework that generalizes the Repository Planning Graph (RPG) from a static generative blueprint into a unified, high-fidelity representation. RPG-Encoder closes the reasoning loop through three mechanisms: (1) Encoding raw code into the RPG that combines lifted semantic features with code dependencies; (2) Evolving the topology incrementally to decouple maintenance costs from repository scale, reducing overhead by 95.7%; and (3) Operating as a unified interface for structure-aware navigation. In evaluations, RPG-Encoder establishes state-of-the-art repository understanding on SWE-bench Verified with 93.7% Acc@5 and exceeds the best baseline by over 10% on SWE-bench Live Lite. These results highlight our superior fine-grained localization accuracy in complex codebases. Furthermore, it achieves 98.5% reconstruction coverage on RepoCraft, confirming RPG's high-fidelity capacity to mirror the original codebase and closing the loop between intent and implementation.[141] LEC-KG: An LLM-Embedding Collaborative Framework for Domain-Specific Knowledge Graph Construction -- A Case Study on SDGs
Yikai Zeng,Yingchao Piao,Jianhui Li
Main category: cs.CL
TL;DR: 本文提出LEC-KG框架,通过大语言模型(LLM)与知识图谱嵌入(KGE)双向协同,提升领域知识图谱构建效果,尤其在长尾关系上表现优异。
Details
Motivation: 领域知识图谱构建面临实体指代异构、关系分布长尾、缺乏标准模式等挑战。 Method: 提出LEC-KG双向协同框架,包含:(1) 层次化粗到细关系抽取以缓解长尾偏差;(2) 证据引导的思维链反馈机制,将结构建议锚定于原文;(3) 语义初始化,支持对未见实体进行结构验证;LLM与KGE模块迭代互增强。 Result: 在中文可持续发展目标(SDG)报告数据集上显著优于纯LLM基线,尤其在低频关系抽取上提升明显。 Conclusion: LEC-KG能可靠地将非结构化政策文本转化为经验证的知识图谱三元组,有效应对长尾和未见实体挑战。 Abstract: Constructing domain-specific knowledge graphs from unstructured text remains challenging due to heterogeneous entity mentions, long-tail relation distributions, and the absence of standardized schemas. We present LEC-KG, a bidirectional collaborative framework that integrates the semantic understanding of Large Language Models (LLMs) with the structural reasoning of Knowledge Graph Embeddings (KGE). Our approach features three key components: (1) hierarchical coarse-to-fine relation extraction that mitigates long-tail bias, (2) evidence-guided Chain-of-Thought feedback that grounds structural suggestions in source text, and (3) semantic initialization that enables structural validation for unseen entities. The two modules enhance each other iteratively-KGE provides structure-aware feedback to refine LLM extractions, while validated triples progressively improve KGE representations. We evaluate LEC-KG on Chinese Sustainable Development Goal (SDG) reports, demonstrating substantial improvements over LLM baselines, particularly on low-frequency relations. Through iterative refinement, our framework reliably transforms unstructured policy text into validated knowledge graph triples.[142] Think Dense, Not Long: Dynamic Decoupled Conditional Advantage for Efficient Reasoning
Keqin Peng,Yuanxin Ouyang,Xuebo Liu,Zhiliang Tian,Ruijian Han,Yancheng Yuan,Liang Ding
Main category: cs.CL
TL;DR: 本文提出DDCA方法解决RLVR中因长度惩罚导致的准确率下降问题,通过条件化正确响应簇内的长度优势计算和基于组通过率动态调整惩罚强度,显著提升效率-准确性权衡。
Details
Motivation: RLVR虽能激发多步推理,但易生成冗长推理链;而朴素的长度惩罚在组相对优化中会严重损害准确性,源于长度基线稀释和难度-惩罚不匹配两个结构性问题。 Method: 提出动态解耦条件优势(DDCA):在正确响应簇内条件化计算长度优势以消除基线稀释,并利用组通过率作为难度代理动态缩放惩罚强度。 Result: 在GSM8K、MATH500、AMC23、AIME25上实验表明,DDCA相较自适应基线显著改善效率-准确性权衡:在简单任务(如GSM8K)上减少约60%生成token,在困难任务(如AIME25)上减少超20%,同时保持或提升准确率。 Conclusion: DDCA有效解耦效率优化与正确性,为RLVR中合理控制推理长度提供了新范式,兼顾简洁性与可靠性。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) can elicit strong multi-step reasoning, yet it often encourages overly verbose traces. Moreover, naive length penalties in group-relative optimization can severely hurt accuracy. We attribute this failure to two structural issues: (i) Dilution of Length Baseline, where incorrect responses (with zero length reward) depress the group baseline and over-penalize correct solutions; and (ii) Difficulty-Penalty Mismatch, where a static penalty cannot adapt to problem difficulty, suppressing necessary reasoning on hard instances while leaving redundancy on easy ones. We propose Dynamic Decoupled Conditional Advantage (DDCA) to decouple efficiency optimization from correctness. DDCA computes length advantages conditionally within the correct-response cluster to eliminate baseline dilution, and dynamically scales the penalty strength using the group pass rate as a proxy for difficulty. Experiments on GSM8K, MATH500, AMC23, and AIME25 show that DDCA consistently improves the efficiency--accuracy trade-off relative to adaptive baselines, reducing generated tokens by approximately 60% on simpler tasks (e.g., GSM8K) versus over 20% on harder benchmarks (e.g., AIME25), thereby maintaining or improving accuracy. Code is available at https://github.com/alphadl/DDCA.[143] Dicta-LM 3.0: Advancing The Frontier of Hebrew Sovereign LLMs
Shaltiel Shmidman,Avi Shmidman,Amir DN Cohen,Moshe Koppel
Main category: cs.CL
TL;DR: 本文介绍了Dicta-LM 3.0,一套开源、多尺寸、支持长上下文(65k tokens)的希伯来语-英语双语大语言模型,并构建了首个面向希伯来语对话式LLM的综合评测基准。
Details
Motivation: 希伯来语等低资源语言缺乏高质量、主权可控的大语言模型,现有开源模型难以满足本地化需求,亟需专为该语言定制且可复用的训练与评估框架。 Method: 基于Mistral-Small-3.1、Nemotron Nano V2和Qwen3-1.7B三个基础模型,分别微调出24B、12B和1.7B三种尺寸的希伯来语-英语双语模型;所有模型均支持65k长上下文及工具调用;并构建覆盖翻译、摘要、Winograd、以色列常识问答和尼库德(diacritization)等任务的希伯来语专用评测套件。 Result: 成功发布三个尺寸、多种变体(base/chat/tool-calling)的Dicta-LM 3.0模型;提出并验证了首个希伯来语对话式LLM评测基准,在多项任务上展现出优异性能;验证了该适配框架对其他非英语语言的可迁移性。 Conclusion: Dicta-LM 3.0不仅填补了希伯来语大模型的空白,还提供了一套可扩展的低资源语言LLM训练与评估范式,推动多语言NLP的公平性与本地化发展。 Abstract: Open-weight LLMs have been released by frontier labs; however, sovereign Large Language Models (for languages other than English) remain low in supply yet high in demand. Training large language models (LLMs) for low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce Dicta-LM 3.0: an open-weight collection of LLMs trained on substantially-sized corpora of Hebrew and English texts. The model is released in three sizes: 24B - adapted from the Mistral-Small-3.1 base model, 12B - adapted from the NVIDIA Nemotron Nano V2 model, and 1.7B - adapted from the Qwen3-1.7B base model. We are releasing multiple variants of each model, each with a native context length of 65k tokens; base model and chat model with tool-calling support. To rigorously evaluate our models, we introduce a new benchmark suite for evaluation of Hebrew chat-LLMs, covering a diverse set of tasks including Translation, Summarization, Winograd, Israeli Trivia, and Diacritization (nikud). Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.[144] Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts
Wenhao Li,Daohai Yu,Gen Luo,Yuxin Zhang,Fei Chao,Rongrong Ji,Yifan Wu,Jiaxin Liu,Ziyang Gong,Zimu Liao
Main category: cs.CL
TL;DR: 本文提出OOMB系统,通过分块循环训练和即时激活重计算,将激活内存占用降至常数级,并结合分页内存管理、异步CPU卸载和页面级稀疏注意力等技术,显著降低长上下文大模型训练的GPU内存开销,实现单卡训练400万token上下文的Qwen2.5-7B模型。
Details
Motivation: 训练长上下文大语言模型受限于GPU内存(尤其是随序列长度线性增长的激活内存),而非训练时间。 Method: 提出OOMB训练系统,采用chunk-recurrent训练框架与on-the-fly激活重计算,使激活内存为O(1);并集成paged内存管理(KV缓存及梯度)、异步CPU卸载、page-level稀疏注意力三项协同优化。 Result: 每增加10K token上下文,Qwen2.5-7B端到端训练内存仅增10MB;可在单张H200 GPU上训练4M-token上下文的Qwen2.5-7B模型,无需传统context parallelism集群。 Conclusion: OOMB大幅提升了长上下文LLM训练的资源效率,是该领域的重要进展。 Abstract: Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available at https://github.com/wenhaoli-xmu/OOMB.[145] There Is More to Refusal in Large Language Models than a Single Direction
Faaiz Joad,Majd Hawasly,Sabri Boughorbel,Nadir Durrani,Husrev Taha Sencar
Main category: cs.CL
TL;DR: 本文挑战了先前关于大语言模型拒绝行为由单一激活空间方向控制的观点,发现不同类型的拒绝行为对应于激活空间中几何上不同的方向,但线性引导任一拒绝相关方向均产生相似的拒绝与过度拒绝权衡,表现为一个共享的一维控制旋钮。
Details
Motivation: 先前研究认为大语言模型的拒绝行为由单一激活空间方向介导,本文旨在检验该观点的完整性,并探索不同类型拒绝行为在激活空间中的表征差异。 Method: 本文在十一类拒绝与不合规行为(包括安全性、不完整或不支持的请求、拟人化、过度拒绝等)上进行分析,考察其在激活空间中的几何方向特性,并通过线性引导实验评估不同方向对拒绝行为的影响。 Result: 发现不同类别的拒绝行为对应激活空间中几何上不同的方向;然而,沿任一拒绝相关方向进行线性引导均产生几乎相同的拒绝与过度拒绝权衡;不同方向的主要影响在于拒绝的方式,而非是否拒绝。 Conclusion: 单一方向解释拒绝行为的观点不完整;拒绝行为具有方向多样性,但其控制机制存在共享的一维本质;拒绝方式比拒绝与否更受方向影响。 Abstract: Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.[146] Quantifying the Gap between Understanding and Generation within Unified Multimodal Models
Chenlong Wang,Yuhang Chen,Zhihan Hu,Dongping Chen,Wenhu Chen,Sarah Wiegreffe,Tianyi Zhou
Main category: cs.CL
TL;DR: 本文提出GapEval基准,用于评估统一多模态模型(UMM)在理解和生成两个方向上的能力差距与认知一致性,发现当前UMM仅实现表层统一,知识在模态间仍割裂且不同步。
Details
Motivation: 探究统一多模态模型(UMM)中理解与生成能力是否真正对齐和融合,而非仅表面统一。 Method: 构建双向基准GapEval,支持图像与文本双向问答,以量化跨模态一致性与认知 coherence;并从知识操作视角开展实证分析。 Result: 实验表明各类UMM在双向任务上存在持续性性能差距,知识在模态间呈现割裂状态,能力涌现与知识发展不同步。 Conclusion: 当前UMM尚未实现深层次的认知统一,其‘统一’更多是表层的;需进一步研究跨模态知识同步与融合机制。 Abstract: Recent advances in unified multimodal models (UMM) have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely aligned and integrated within a single model remains unclear. To investigate this question, we introduce GapEval, a bidirectional benchmark designed to quantify the gap between understanding and generation capabilities, and quantitatively measure the cognitive coherence of the two "unified" directions. Each question can be answered in both modalities (image and text), enabling a symmetric evaluation of a model's bidirectional inference capability and cross-modal consistency. Experiments reveal a persistent gap between the two directions across a wide range of UMMs with different architectures, suggesting that current models achieve only surface-level unification rather than deep cognitive convergence of the two. To further explore the underlying mechanism, we conduct an empirical study from the perspective of knowledge manipulation to illustrate the underlying limitations. Our findings indicate that knowledge within UMMs often remains disjoint. The capability emergence and knowledge across modalities are unsynchronized, paving the way for further exploration.[147] Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing
Lingkun Long,Yushi Huang,Shihao Bai,Ruihao Gong,Jun Zhang,Ao Zhou,Jianlei Yang
Main category: cs.CL
TL;DR: 本文提出Focus-dLLM,一种无需训练的注意力稀疏化框架,通过过去步的token置信度预测未掩码区域,并结合sink感知剪枝策略,在保持性能的同时大幅提升dLLM长上下文推理效率。
Details
Motivation: 扩散大语言模型(dLLMs)虽具备强长上下文处理能力,但双向全注意力带来高昂计算开销;现有稀疏注意力方法因难以在扩散过程中预估待解码token的重要性而效果不佳。 Method: 基于相邻扩散步间token置信度强相关性的发现,设计过去置信度引导的指示器预测未掩码区域;进而提出sink感知剪枝策略,精准识别并保留关键注意力汇点,同时跨层复用汇点位置以降低开销。 Result: 在32K上下文长度下实现超29倍无损加速。 Conclusion: Focus-dLLM是一种高效、准确且训练无关的dLLM注意力稀疏化方案,显著提升长上下文推理效率,代码已开源。 Abstract: Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a past confidence-guided indicator to predict unmasked regions. Built upon this, we propose a sink-aware pruning strategy to accurately estimate and remove redundant attention computation, while preserving highly influential attention sinks. To further reduce overhead, this strategy reuses identified sink locations across layers, leveraging the observed cross-layer consistency. Experimental results show that our method offers more than $29\times$ lossless speedup under $32K$ context length. The code is publicly available at: https://github.com/Longxmas/Focus-dLLM[148] D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use
Bowen Xu,Shaoyu Wu,Hao Jiang,Kai Liu,Xin Chen,Lulu Hu,Bin Yang
Main category: cs.CL
TL;DR: 本文提出D-CORE两阶段训练框架,通过自蒸馏增强任务分解能力,再结合多样性感知强化学习恢复反思推理能力,显著提升大推理模型在复杂工具使用场景下的性能。
Details
Motivation: 当前大推理模型(LRMs)在复杂工具使用场景中缺乏子任务分解能力,导致“懒惰推理”问题。 Method: 提出D-CORE两阶段训练框架:第一阶段通过自蒸馏激励任务分解推理能力;第二阶段采用多样性感知的强化学习恢复反思推理能力。 Result: 在BFCLv3等基准上显著提升性能:D-CORE-8B达77.7%准确率(提升5.7%),D-CORE-14B达79.3%,超越更大规模的70B模型。 Conclusion: D-CORE有效缓解Lazy Reasoning问题,在不同模型规模和基准上均展现出鲁棒且高效的工具使用能力提升。 Abstract: Effective tool use and reasoning are essential capabilities for large reasoning models~(LRMs) to address complex real-world problems. Through empirical analysis, we identify that current LRMs lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning. To address this, we propose a two-stage training framework D-CORE~(\underline{\textbf{D}}ecomposing tasks and \underline{\textbf{Co}}mposing \underline{\textbf{Re}}asoning processes) that first incentivize the LRMs' task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning~(RL) to restore LRMs' reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7\% accuracy, surpassing the best-performing 8B model by 5.7\%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3\%, outperforming 70B models despite being 5$\times$ smaller. The source code is available at https://github.com/alibaba/EfficientAI.[149] AR-MAP: Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models?
Liang Lin,Feng Xiong,Zengbin Wang,Kun Wang,Junhao Dong,Xuecai Hu,Yong Wang,Xiangxiang Chu
Main category: cs.CL
TL;DR: 本文提出AR-MAP框架,利用已对齐的自回归大语言模型(AR-LLMs)作为隐式教师,通过权重缩放方式将对齐知识迁移至扩散大语言模型(DLLMs),从而避免高方差和高计算开销的直接DLLM对齐,显著提升DLLM在偏好对齐任务中的性能。
Details
Motivation: 扩散大语言模型(DLLMs)虽支持并行生成,但其偏好对齐因ELBO似然估计带来的高方差而困难。 Method: 提出AR-MAP转移学习框架,利用AR-LLMs作为隐式教师,通过简单权重缩放将对齐知识迁移到DLLMs,利用二者共享架构特性。 Result: AR-MAP在多个偏好对齐任务上达到69.08%平均得分,性能优于或媲美现有DLLM专用对齐方法。 Conclusion: DLLMs可通过结构共享与权重缩放高效吸收AR-LLMs的对齐知识,AR-MAP为DLLM对齐提供了低方差、低开销且高性能的新范式。 Abstract: Diffusion Large Language Models (DLLMs) have emerged as a powerful alternative to autoregressive models, enabling parallel token generation across multiple positions. However, preference alignment of DLLMs remains challenging due to high variance introduced by Evidence Lower Bound (ELBO)-based likelihood estimation. In this work, we propose AR-MAP, a novel transfer learning framework that leverages preference-aligned autoregressive LLMs (AR-LLMs) as implicit teachers for DLLM alignment. We reveal that DLLMs can effectively absorb alignment knowledge from AR-LLMs through simple weight scaling, exploiting the shared architectural structure between these divergent generation paradigms. Crucially, our approach circumvents the high variance and computational overhead of direct DLLM alignment and comprehensive experiments across diverse preference alignment tasks demonstrate that AR-MAP achieves competitive or superior performance compared to existing DLLM-specific alignment methods, achieving 69.08\% average score across all tasks and models. Our Code is available at https://github.com/AMAP-ML/AR-MAP.[150] Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages
Tjaša Arčon,Matej Klemen,Marko Robnik-Šikonja,Kaja Dobrovoljc
Main category: cs.CL
TL;DR: 本文评估了大语言模型(LLMs)对语言结构的元语言知识,发现其表现有限且严重依赖数据可用性,而非真正的跨语言语法能力;GPT-4o表现最佳但准确率仅0.367,开源模型更差;性能与数字语言地位(如维基百科规模、语料库可得性)强相关,低资源语言表现显著更差;作者开源了该基准数据集。
Details
Motivation: 现有语言学评测基准多聚焦于高资源语言和狭窄现象,缺乏对元语言知识(即对语言结构的显式推理能力)的系统评估,因此需构建更全面、跨语言的评测框架。 Method: 基于准确率和宏F1,并结合多数类与随机基线,对多个LLM在多语言、多语言学领域(如词汇、音系等)的元语言任务上进行系统评测,并分析语言资源指标(如维基百科规模、语料库可用性)等预测因子的影响。 Result: 所有模型均高于随机基线但低于多数类基线;GPT-4o准确率最高(0.367),但仍属中等水平;词汇特征表现最好,音系特征最差;语言层面性能与数字存在度强相关,资源丰富语言显著优于低资源语言;资源类指标比地理、谱系或社会语言学因素更能预测模型表现。 Conclusion: 当前LLMs的元语言知识是零散的、数据驱动的,而非具备普适性语法能力;其表现受限于训练数据的语言覆盖与资源丰度,亟需提升全球语言多样性以增强语言结构理解能力。 Abstract: Large language models (LLMs) are routinely evaluated on language use tasks, yet their knowledge of linguistic structure remains poorly understood. Existing linguistic benchmarks typically focus on narrow phenomena, emphasize high-resource languages, and rarely evaluate metalinguistic knowledge-explicit reasoning about language structure rather than language use. Using accuracy and macro F1, together with majority-class and chance baselines, we analyse overall performance and examine variation by linguistic domains and language-related factors. Our results show that metalinguistic knowledge in current LLMs is limited: GPT-4o performs best but achieves only moderate accuracy (0.367), while open-source models lag behind. All models perform above chance but fail to outperform the majority-class baseline, suggesting they capture cross-linguistic patterns but lack fine-grained grammatical distinctions. Performance varies across linguistic domains, with lexical features showing the highest accuracy and phonological features among the lowest, partially reflecting differences in online visibility. At the language level, accuracy shows a strong association with digital language status: languages with higher digital presence and resource availability are evaluated more accurately, while low-resource languages show substantially lower performance. Analyses of predictive factors confirm that resource-related indicators (Wikipedia size, corpus availability) are more informative predictors of accuracy than geographical, genealogical, or sociolinguistic factors. Together, these results suggest that LLMs' metalinguistic knowledge is fragmented and shaped by data availability rather than generalizable grammatical competence across the world's languages. We release our benchmark as an open-source dataset to support systematic evaluation and encourage greater global linguistic diversity in future LLMs.[151] Sinhala Physical Common Sense Reasoning Dataset for Global PIQA
Nisansa de Silva,Surangika Ranathunga
Main category: cs.CL
TL;DR: This paper introduces the first Sinhala physical common sense reasoning dataset, part of Global PIQA, with 110 human-created and verified samples reflecting Sri Lankan context.
Details
Motivation: To address the lack of Sinhala-language resources for physical common sense reasoning, especially within the Sri Lankan cultural and linguistic context. Method: Human creation and verification of 110 Sinhala-language samples, each containing a prompt, correct answer, and incorrect answer, aligned with the Global PIQA framework. Result: A novel, culturally grounded Sinhala physical common sense reasoning dataset comprising 110 high-quality, verified samples. Conclusion: The dataset fills a critical gap in low-resource language NLP benchmarks and supports future research in multilingual and culturally-aware commonsense reasoning. Abstract: This paper presents the first-ever Sinhala physical common sense reasoning dataset created as part of Global PIQA. It contains 110 human-created and verified data samples, where each sample consists of a prompt, the corresponding correct answer, and a wrong answer. Most of the questions refer to the Sri Lankan context, where Sinhala is an official language.[152] Towards AI Evaluation in Domain-Specific RAG Systems: The AgriHubi Case Study
Md. Toufique Hasan,Ayman Asad Khan,Mika Saari,Vaishnavi Bankhele,Pekka Abrahamsson
Main category: cs.CL
TL;DR: 本文提出了AgriHubi,一个面向芬兰语农业决策支持的领域适配型检索增强生成(RAG)系统,通过融合芬兰农业文档与开源PORO模型,并结合显式来源标注和用户反馈机制,在用户研究中验证了其在答案完整性、语言准确性和可信度方面的提升,同时揭示了模型规模与响应延迟间的实际权衡。
Details
Motivation: 大语言模型在农业等知识密集型领域应用受限,尤其在低资源语言(如芬兰语)中,存在接地性弱、英语数据主导、缺乏真实场景评估等问题,而高质量农业文档难以被通用模型有效利用。 Method: 构建AgriHubi系统:集成芬兰语农业文档与开源PORO系列大模型,采用检索增强生成(RAG)架构,引入显式源 grounding 和用户反馈驱动的迭代优化机制,并经八轮开发迭代及两轮用户研究进行评估。 Result: 系统在答案完整性、语言准确性和用户感知可靠性方面显著提升;同时发现部署更大模型会带来响应延迟增加等实际权衡;为低资源语言下的领域RAG系统设计与评估提供了实证依据。 Conclusion: AgriHubi证明了在低资源语言环境中,结合领域文档、轻量适配模型与用户闭环反馈的RAG系统可有效支撑专业决策,其开发与评估方法可推广至其他类似场景。 Abstract: Large language models show promise for knowledge-intensive domains, yet their use in agriculture is constrained by weak grounding, English-centric training data, and limited real-world evaluation. These issues are amplified for low-resource languages, where high-quality domain documentation exists but remains difficult to access through general-purpose models. This paper presents AgriHubi, a domain-adapted retrieval-augmented generation (RAG) system for Finnish-language agricultural decision support. AgriHubi integrates Finnish agricultural documents with open PORO family models and combines explicit source grounding with user feedback to support iterative refinement. Developed over eight iterations and evaluated through two user studies, the system shows clear gains in answer completeness, linguistic accuracy, and perceived reliability. The results also reveal practical trade-offs between response quality and latency when deploying larger models. This study provides empirical guidance for designing and evaluating domain-specific RAG systems in low-resource language settings.[153] Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge
Yuzheng Xu,Tosho Hirasawa,Tadashi Kozuno,Yoshitaka Ushiku
Main category: cs.CL
TL;DR: 本文揭示了基于量规(rubric-based)的大型语言模型(LLM)评估中存在位置偏差问题,并提出一种平衡排列策略来缓解该偏差,从而提升LLM作为评估者与人类评分的一致性。
Details
Motivation: 现有研究多关注点式和成对式评估,而对量规式评估中的潜在偏差(如位置偏好)缺乏系统分析。 Method: 通过控制实验验证不同模型和数据集上的位置偏差;提出平衡排列策略,使各分数选项在量规列表中均匀分布于不同位置,并聚合多次排列的结果。 Result: 实验证明位置偏差普遍存在;平衡排列策略能有效揭示并缓解该偏差,显著提升LLM评估结果与人工评分的相关性。 Conclusion: 量规式LLM-as-a-judge并非本质上的点式评估,其可靠性可通过简单的位置校准方法(如平衡排列)大幅提升。 Abstract: Large language models (LLMs) are now widely used to evaluate the quality of text, a field commonly referred to as LLM-as-a-judge. While prior works mainly focus on point-wise and pair-wise evaluation paradigms. Rubric-based evaluation, where LLMs select a score from multiple rubrics, has received less analysis. In this work, we show that rubric-based evaluation implicitly resembles a multi-choice setting and therefore has position bias: LLMs prefer score options appearing at specific positions in the rubric list. Through controlled experiments across multiple models and datasets, we demonstrate consistent position bias. To mitigate this bias, we propose a balanced permutation strategy that evenly distributes each score option across positions. We show that aggregating scores across balanced permutations not only reveals latent position bias, but also improves correlation between the LLM-as-a-Judge and human. Our results suggest that rubric-based LLM-as-a-Judge is not inherently point-wise and that simple permutation-based calibration can substantially improve its reliability.[154] Using Correspondence Patterns to Identify Irregular Words in Cognate sets Through Leave-One-Out Validation
Frederic Blum,Johann-Mattis List
Main category: cs.CL
TL;DR: 本文提出了一种新的规则性度量方法——平衡平均重现率,并基于此开发了识别不规则同源词集的计算方法,通过模拟和真实数据实验验证其有效性,准确率达85%,有助于提升历史语言比较中数据集的质量。
Details
Motivation: 传统历史语言学中对音变规则性的判断多依赖直觉而非量化评估,且实际中不规则现象比新语法学派模型预期更常见;随着计算语言学进展和标准化词表数据增多,亟需更客观、可量化的规则性评估方法。 Method: 提出‘平衡平均重现率’作为规则性新度量指标,并构建基于该指标识别不规则同源词集的计算方法;采用留一法交叉验证,在人工注入不规则词形的数据集中检验方法识别异常项的能力。 Result: 在真实数据集上整体识别准确率达85%;验证了子采样策略的有效性,并揭示了数据不规则程度升高对结果的影响。 Conclusion: 该规则性度量及相应识别方法有望显著提升计算机辅助语言比较中现有及未来数据集的质量与可靠性。 Abstract: Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ leave-one-out validation to measure the regularity of cognate sets in which one word form has been replaced by an irregular one, checking how well our method identifies the forms causing the irregularity. Our method achieves an overall accuracy of 85\% with the datasets based on real data. We also show the benefits of working with subsamples of large datasets and how increasing irregularity in the data influences our results. Reflecting on the broader potential of our new regularity measure and the irregular cognate identification method based on it, we conclude that they could play an important role in improving the quality of existing and future datasets in computer-assisted language comparison.[155] OpenSeal: Good, Fast, and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data
Tan Sang Nguyen,Muhammad Reza Qorib,Hwee Tou Ng
Main category: cs.CL
TL;DR: 本文提出使用平行数据持续预训练大语言模型,构建了首个真正开源的东南亚大语言模型OpenSeal,在低资源语言上显著提升性能。
Details
Motivation: 现有东南亚大语言模型缺乏真正开源(未公开训练数据),限制了对模型内部机制、偏见和多语言能力的理解;同时,多数多语言大模型仍以英语为中心,在低资源语言上表现不佳。 Method: 通过受控且全面的实验,研究平行数据在大语言模型持续预训练中的有效性,并基于34.7B tokens平行数据和8x NVIDIA H200 GPU(180小时)训练OpenSeal模型。 Result: 仅使用平行数据是扩展大语言模型至新语言最有效的方式;OpenSeal成为首个真正开源的东南亚大语言模型,性能媲美同规模现有模型。 Conclusion: 平行数据在多语言大模型持续预训练中具有关键作用;真正开源(含训练数据公开)对推动多语言大模型透明性、可复现性和深入研究至关重要。 Abstract: Large language models (LLMs) have proven to be effective tools for a wide range of natural language processing (NLP) applications. Although many LLMs are multilingual, most remain English-centric and perform poorly on low-resource languages. Recently, several Southeast Asia-focused LLMs have been developed, but none are truly open source, as they do not publicly disclose their training data. Truly open-source models are important for transparency and for enabling a deeper and more precise understanding of LLM internals and development, including biases, generalization, and multilinguality. Motivated by recent advances demonstrating the effectiveness of parallel data in improving multilingual performance, we conduct controlled and comprehensive experiments to study the effectiveness of parallel data in continual pretraining of LLMs. Our findings show that using only parallel data is the most effective way to extend an LLM to new languages. Using just 34.7B tokens of parallel data and 180 hours on 8x NVIDIA H200 GPUs, we built OpenSeal, the first truly open Southeast Asian LLM that rivals the performance of existing models of similar size.[156] dziribot: rag based intelligent conversational agent for algerian arabic dialect
El Batoul Bechiri,Dihia Lanasri
Main category: cs.CL
TL;DR: 本文提出DziriBOT,一种专为阿尔及利亚方言Darja设计的混合式对话代理,通过结合自然语言理解(NLU)与检索增强生成(RAG),并系统评估多种建模方法,最终以微调的DziriBERT模型实现低资源方言场景下的最优性能。
Details
Motivation: 阿尔及利亚方言Darja具有正字法不规范、频繁法语语码转换、阿拉伯文与拉丁文(Arabizi)混用等特点,导致现有标准语言模型难以适配,亟需面向该方言的定制化对话系统。 Method: 提出多层混合架构,融合专用NLU与RAG;系统比较三种方法:基于稀疏特征的Rasa流程、传统机器学习基线、以及基于DziriBERT的Transformer微调。 Result: 微调后的DziriBERT在Darja任务上达到当前最优性能,显著优于传统基线,尤其在应对拼写噪声和罕见意图识别方面表现突出。 Conclusion: DziriBOT成功弥合了通用语言模型与阿尔及利亚用户真实语言使用之间的鸿沟,为北非地区方言感知的智能客服自动化提供了可复用的技术范式。 Abstract: The rapid digitalization of customer service has intensified the demand for conversational agents capable of providing accurate and natural interactions. In the Algerian context, this is complicated by the linguistic complexity of Darja, a dialect characterized by non-standardized orthography, extensive code-switching with French, and the simultaneous use of Arabic and Latin (Arabizi) scripts. This paper introduces DziriBOT, a hybrid intelligent conversational agent specifically engineered to overcome these challenges. We propose a multi-layered architecture that integrates specialized Natural Language Understanding (NLU) with Retrieval-Augmented Generation (RAG), allowing for both structured service flows and dynamic, knowledge-intensive responses grounded in curated enterprise documentation. To address the low-resource nature of Darja, we systematically evaluate three distinct approaches: a sparse-feature Rasa pipeline, classical machine learning baselines, and transformer-based fine-tuning. Our experimental results demonstrate that the fine-tuned DziriBERT model achieves state-of-the-art performance. These results significantly outperform traditional baselines, particularly in handling orthographic noise and rare intents. Ultimately, DziriBOT provides a robust, scalable solution that bridges the gap between formal language models and the linguistic realities of Algerian users, offering a blueprint for dialect-aware automation in the regional market.[157] Kimi K2.5: Visual Agentic Intelligence
Kimi Team,Tongtong Bai,Yifan Bai,Yiping Bao,S. H. Cai,Yuan Cao,Y. Charles,H. S. Che,Cheng Chen,Guanduo Chen,Huarong Chen,Jia Chen,Jiahao Chen,Jianlong Chen,Jun Chen,Kefan Chen,Liang Chen,Ruijue Chen,Xinhao Chen,Yanru Chen,Yanxu Chen,Yicun Chen,Yimin Chen,Yingjiang Chen,Yuankun Chen,Yujie Chen,Yutian Chen,Zhirong Chen,Ziwei Chen,Dazhi Cheng,Minghan Chu,Jialei Cui,Jiaqi Deng,Muxi Diao,Hao Ding,Mengfan Dong,Mengnan Dong,Yuxin Dong,Yuhao Dong,Angang Du,Chenzhuang Du,Dikang Du,Lingxiao Du,Yulun Du,Yu Fan,Shengjun Fang,Qiulin Feng,Yichen Feng,Garimugai Fu,Kelin Fu,Hongcheng Gao,Tong Gao,Yuyao Ge,Shangyi Geng,Chengyang Gong,Xiaochen Gong,Zhuoma Gongque,Qizheng Gu,Xinran Gu,Yicheng Gu,Longyu Guan,Yuanying Guo,Xiaoru Hao,Weiran He,Wenyang He,Yunjia He,Chao Hong,Hao Hu,Jiaxi Hu,Yangyang Hu,Zhenxing Hu,Ke Huang,Ruiyuan Huang,Weixiao Huang,Zhiqi Huang,Tao Jiang,Zhejun Jiang,Xinyi Jin,Yu Jing,Guokun Lai,Aidi Li,C. Li,Cheng Li,Fang Li,Guanghe Li,Guanyu Li,Haitao Li,Haoyang Li,Jia Li,Jingwei Li,Junxiong Li,Lincan Li,Mo Li,Weihong Li,Wentao Li,Xinhang Li,Xinhao Li,Yang Li,Yanhao Li,Yiwei Li,Yuxiao Li,Zhaowei Li,Zheming Li,Weilong Liao,Jiawei Lin,Xiaohan Lin,Zhishan Lin,Zichao Lin,Cheng Liu,Chenyu Liu,Hongzhang Liu,Liang Liu,Shaowei Liu,Shudong Liu,Shuran Liu,Tianwei Liu,Tianyu Liu,Weizhou Liu,Xiangyan Liu,Yangyang Liu,Yanming Liu,Yibo Liu,Yuanxin Liu,Yue Liu,Zhengying Liu,Zhongnuo Liu,Enzhe Lu,Haoyu Lu,Zhiyuan Lu,Junyu Luo,Tongxu Luo,Yashuo Luo,Long Ma,Yingwei Ma,Shaoguang Mao,Yuan Mei,Xin Men,Fanqing Meng,Zhiyong Meng,Yibo Miao,Minqing Ni,Kun Ouyang,Siyuan Pan,Bo Pang,Yuchao Qian,Ruoyu Qin,Zeyu Qin,Jiezhong Qiu,Bowen Qu,Zeyu Shang,Youbo Shao,Tianxiao Shen,Zhennan Shen,Juanfeng Shi,Lidong Shi,Shengyuan Shi,Feifan Song,Pengwei Song,Tianhui Song,Xiaoxi Song,Hongjin Su,Jianlin Su,Zhaochen Su,Lin Sui,Jinsong Sun,Junyao Sun,Tongyu Sun,Flood Sung,Yunpeng Tai,Chuning Tang,Heyi Tang,Xiaojuan Tang,Zhengyang Tang,Jiawen Tao,Shiyuan Teng,Chaoran Tian,Pengfei Tian,Ao Wang,Bowen Wang,Chensi Wang,Chuang Wang,Congcong Wang,Dingkun Wang,Dinglu Wang,Dongliang Wang,Feng Wang,Hailong Wang,Haiming Wang,Hengzhi Wang,Huaqing Wang,Hui Wang,Jiahao Wang,Jinhong Wang,Jiuzheng Wang,Kaixin Wang,Linian Wang,Qibin Wang,Shengjie Wang,Shuyi Wang,Si Wang,Wei Wang,Xiaochen Wang,Xinyuan Wang,Yao Wang,Yejie Wang,Yipu Wang,Yiqin Wang,Yucheng Wang,Yuzhi Wang,Zhaoji Wang,Zhaowei Wang,Zhengtao Wang,Zhexu Wang,Zihan Wang,Zizhe Wang,Chu Wei,Ming Wei,Chuan Wen,Zichen Wen,Chengjie Wu,Haoning Wu,Junyan Wu,Rucong Wu,Wenhao Wu,Yuefeng Wu,Yuhao Wu,Yuxin Wu,Zijian Wu,Chenjun Xiao,Jin Xie,Xiaotong Xie,Yuchong Xie,Yifei Xin,Bowei Xing,Boyu Xu,Jianfan Xu,Jing Xu,Jinjing Xu,L. H. Xu,Lin Xu,Suting Xu,Weixin Xu,Xinbo Xu,Xinran Xu,Yangchuan Xu,Yichang Xu,Yuemeng Xu,Zelai Xu,Ziyao Xu,Junjie Yan,Yuzi Yan,Guangyao Yang,Hao Yang,Junwei Yang,Kai Yang,Ningyuan Yang,Ruihan Yang,Xiaofei Yang,Xinlong Yang,Ying Yang,Yi Yang,Yi Yang,Zhen Yang,Zhilin Yang,Zonghan Yang,Haotian Yao,Dan Ye,Wenjie Ye,Zhuorui Ye,Bohong Yin,Chengzhen Yu,Longhui Yu,Tao Yu,Tianxiang Yu,Enming Yuan,Mengjie Yuan,Xiaokun Yuan,Yang Yue,Weihao Zeng,Dunyuan Zha,Haobing Zhan,Dehao Zhang,Hao Zhang,Jin Zhang,Puqi Zhang,Qiao Zhang,Rui Zhang,Xiaobin Zhang,Y. Zhang,Yadong Zhang,Yangkun Zhang,Yichi Zhang,Yizhi Zhang,Yongting Zhang,Yu Zhang,Yushun Zhang,Yutao Zhang,Yutong Zhang,Zheng Zhang,Chenguang Zhao,Feifan Zhao,Jinxiang Zhao,Shuai Zhao,Xiangyu Zhao,Yikai Zhao,Zijia Zhao,Huabin Zheng,Ruihan Zheng,Shaojie Zheng,Tengyang Zheng,Junfeng Zhong,Longguang Zhong,Weiming Zhong,M. Zhou,Runjie Zhou,Xinyu Zhou,Zaida Zhou,Jinguo Zhu,Liya Zhu,Xinhao Zhu,Yuxuan Zhu,Zhen Zhu,Jingze Zhuang,Weiyu Zhuang,Ying Zou,Xinxing Zu
Main category: cs.CL
TL;DR: Kimi K2.5 是一个开源多模态智能体模型,通过联合优化文本与视觉模态,并引入 Agent Swarm 并行智能体框架,在编码、视觉、推理和智能体任务上达到 SOTA 性能,同时降低延迟达 4.5 倍。
Details
Motivation: 推动通用智能体智能的发展,解决单模态或简单多模态模型在复杂任务中协同不足、效率低下的问题。 Method: 采用联合图文预训练、零视觉监督微调(zero-vision SFT)和联合图文强化学习;构建自驱动并行智能体框架 Agent Swarm,支持任务动态分解与并发执行。 Result: 在编码、视觉、推理和智能体任务上达到当前最优性能;Agent Swarm 相比单智能体基线降低延迟最高达 4.5 倍。 Conclusion: Kimi K2.5 证明了深度图文联合建模与并行智能体协同设计对提升通用智能体能力的有效性,其开源模型将促进该领域研究与应用。 Abstract: We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.[158] Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages
Isaac Chung,Linda Freienthal
Main category: cs.CL
TL;DR: 本文通过在爱沙尼亚语、芬兰语和匈牙利语中使用相同参数生成合成客服对话,控制生成条件以评估跨语言大语言模型(LLM)评测的可靠性;发现表层指标稳定,但语用类指标(如连贯性、指令遵循)存在显著跨语言排名不稳定现象,表明LLM-as-a-judge在形态丰富语言中零样本迁移不可靠,需语言特异性校准。
Details
Motivation: 现有跨语言LLM评测难以区分真实模型性能差异与评测方法本身的不稳定性,亟需解耦二者以准确评估模型能力。 Method: 采用受控合成方法,在三种形态丰富且相关的乌拉尔语系语言(爱沙尼亚语、芬兰语、匈牙利语)中生成结构一致的客服对话;固定模型生成条件,仅改变目标语言;对比自动指标与LLM-as-a-judge评分在跨语言场景下的排名一致性,并以少量母语者标注为参考基准。 Result: 表层指标(词汇多样性、表面/语义相似度)跨语言稳定;而语用类指标(连贯性、指令遵循)出现排名倒置和近零相关性;表明LLM判分器在不同语言中行为不一致,而非模型本身性能变化。 Conclusion: LLM-as-a-judge在形态丰富语言中进行话语级评估时零样本迁移不可靠;评测方法若在生成条件恒定时仍无法保持跨语言稳定性,则预示其实际部署中存在迁移失败风险;应基于目标语言的人类标注进行针对性校准。 Abstract: Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.[159] Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?
Alex Argese,Pasquale Lisena,Raphaël Troncy
Main category: cs.CL
TL;DR: 本文提出StoryScore,一种用于评估AI生成科学故事的复合指标,整合了语义对齐、词汇依据、叙事控制、结构保真度、冗余避免和实体级幻觉检测。
Details
Motivation: 现有摘要评估指标难以捕捉科学叙事所需的抽象性、简化性和教学创造性,且幻觉检测方法常将合理的叙事重构误判为错误,缺乏稳定性。 Method: 提出StoryScore复合评估框架,融合语义对齐、词汇 grounding、叙事控制、结构保真、冗余避免与实体级幻觉检测六大维度。 Result: 揭示了多数幻觉检测方法无法区分教学创造性与事实错误的根本原因:自动指标擅长评估语义相似性,但难以评估叙事方式与控制能力。 Conclusion: StoryScore为AI生成科学叙事提供了更全面、可靠的评估手段,并指出了当前自动评估在叙事层面的局限性。 Abstract: Generative AI can turn scientific articles into narratives for diverse audiences, but evaluating these stories remains challenging. Storytelling demands abstraction, simplification, and pedagogical creativity-qualities that are not often well-captured by standard summarization metrics. Meanwhile, factual hallucinations are critical in scientific contexts, yet, detectors often misclassify legitimate narrative reformulations or prove unstable when creativity is involved. In this work, we propose StoryScore, a composite metric for evaluating AI-generated scientific stories. StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection into a unified framework. Our analysis also reveals why many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting a key limitation: while automatic metrics can effectively assess semantic similarity with original content, they struggle to evaluate how it is narrated and controlled.[160] Advancing General-Purpose Reasoning Models with Modular Gradient Surgery
Min Cai,Yu Liang,Longzheng Wang,Yan Wang,Yueyang Zhang,Long Xia,Zhiyuan Sun,Xi Ye,Daiting Shi
Main category: cs.CL
TL;DR: 本文提出模块化梯度手术(MGS)方法,以解决大规模推理模型(LRMs)在多领域强化学习中因领域异质性导致的跨领域干扰问题,在数学、通用对话和指令遵循三个领域上显著提升性能。
Details
Motivation: 训练单一通用大规模推理模型(LRM)面临领域异质性强、跨领域干扰严重的问题,现有Sequential RL和Mixed RL策略效果有限。 Method: 提出模块化梯度手术(MGS),在Transformer架构的模块级别上缓解梯度冲突,从而减少多任务强化学习中的行为与梯度层面干扰。 Result: 在Llama和Qwen模型上,MGS在三个代表性领域平均分别提升4.3(16.6%)和4.5(11.1%)分;且在长期训练下仍保持有效性。 Conclusion: MGS有效缓解了多领域RL中的干扰问题,为训练通用型LRMs提供了可行方案,并揭示了干扰来源。 Abstract: Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce **M**odular **G**radient **S**urgery (**MGS**), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6\%) and 4.5 (11.1\%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.[161] The Shape of Beliefs: Geometry, Dynamics, and Interventions along Representation Manifolds of Language Models' Posteriors
Raphaël Sarfati,Eric Bigelow,Daniel Wurgaft,Jack Merullo,Atticus Geiger,Owen Lewis,Tom McGrath,Ekdeep Singh Lubana
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLMs)如何在表示空间中编码和更新对分布参数的隐式信念,发现存在弯曲的‘信念流形’,并提出几何与场感知的干预方法优于标准线性引导,表明纯线性表征不足以刻画LLM中的复杂结构。
Details
Motivation: 缺乏对大语言模型如何在表示空间中编码、更新及干预其prompt条件化信念(即后验分布)的机制性理解。 Method: 在可控设置下,使用Llama-3.2模型根据上下文样本隐式推断正态分布的均值与标准差;分析表示空间中形成的‘信念流形’,并比较标准线性引导与几何/场感知引导在分布突变时的表现;引入线性场探测(LFP)方法进行流形对齐干预。 Result: 发现足够上下文学习后会形成表征分布参数的弯曲信念流形;标准线性引导易使模型偏离流形并引发耦合的OOD偏移;而几何与场感知引导能更好保持目标信念族;LFP被验证为一种简单有效的流形感知干预策略。 Conclusion: 大语言模型中自然涌现出丰富几何结构,纯线性概念表征常是不充分的抽象,需发展尊重内在流形几何的干预方法。 Abstract: Large language models (LLMs) represent prompt-conditioned beliefs (posteriors over answers and claims), but we lack a mechanistic account of how these beliefs are encoded in representation space, how they update with new evidence, and how interventions reshape them. We study a controlled setting in which Llama-3.2 generates samples from a normal distribution by implicitly inferring its parameters (mean and standard deviation) given only samples from the distribution in context. We find representations of curved "belief manifolds" for these parameters form with sufficient in-context learning and study how the model adapts when the distribution suddenly changes. While standard linear steering often pushes the model off-manifold and induces coupled, out-of-distribution shifts, geometry and field-aware steering better preserves the intended belief family. Our work demonstrates an example of linear field probing (LFP) as a simple approach to tile the data manifold and make interventions that respect the underlying geometry. We conclude that rich structure emerges naturally in LLMs and that purely linear concept representations are often an inadequate abstraction.[162] A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method
Feiyang Cai,Guijuan He,Yi Hu,Jingjing Wang,Joshua Luo,Tianyu Zhu,Srikanth Pilla,Gang Li,Ling Liu,Feng Luo
Main category: cs.CL
TL;DR: 本文提出了一种全自动分子结构描述生成框架,利用扩展的规则化IUPAC解析器生成结构化XML元数据,并引导大语言模型(LLM)生成准确的自然语言描述,构建了约16.3万个高质量分子-描述对数据集,验证精度达98.6%。
Details
Motivation: 分子功能高度依赖其结构,而人工标注结构-语言对成本高昂,难以构建大规模高质量数据集,制约了大语言模型在化学任务中的结构理解能力。 Method: 基于并扩展规则化的IUPAC命名解析器,自动解析IUPAC名称、生成富含结构信息的XML元数据,并以此引导LLM生成自然语言描述;整个流程完全自动化。 Result: 构建了约163,000个分子-描述对的大规模数据集;在2,000个样本子集上经LLM与专家人工联合验证,描述精度达98.6%。 Conclusion: 该全自动标注框架可高效生成高精度结构描述,为分子-语言对齐提供可靠数据基础,且方法具备良好可扩展性,适用于更大规模数据及更广化学任务。 Abstract: Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular structure descriptions at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structured XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule-description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6\%$. The resulting dataset provides a reliable foundation for future molecule-language alignment, and the proposed annotation method is readily extensible to larger datasets and broader chemical tasks that rely on structural descriptions.[163] Language Steering for Multilingual In-Context Learning
Neeraja Kirtane,Kuan-Hao Huang
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的语言向量(language vectors)方法,通过在推理时向中间激活添加语言特定的向量,引导多语言大模型在上下文学习中更好地处理非英语语言,显著提升性能,并揭示了语言向量与语言家族结构的一致性。
Details
Motivation: 多语言大语言模型在非英语语言上的表现远逊于英语,尤其在英文示范、非英文测试的上下文学习场景中性能下降明显,亟需一种高效、免训练的语言适配机制。 Method: 基于模型存在通用语义空间、各语言对应不同方向的假设,提出语言向量——利用源语言与目标语言激活差异构造 steering 向量,在推理时将其加到中间层激活上,实现语言空间的无参数偏移。 Result: 在三个数据集、19种语言、三种模型上验证,该方法在多语言上下文学习任务中持续超越基线;语言向量的层次聚类与语系高度一致,且跨任务迁移有效,表明其任务无关性。 Conclusion: 语言向量是一种简单、通用、免训练的语言适配方法,不仅能提升多语言上下文学习性能,还能揭示模型内部语言表征的结构性规律。 Abstract: While multilingual large language models have gained widespread adoption, their performance on non-English languages remains substantially inferior to English. This disparity is particularly evident in in-context learning scenarios, where providing demonstrations in English but testing on non-English inputs leads to significant performance degradation. In this paper, we hypothesize that LLMs develop a universal semantic space for understanding languages, where different languages are encoded as distinct directions within this space. Based on this hypothesis, we propose language vectors -- a training-free language steering approach that leverages activation differences between source and target languages to guide model behavior. We steer the model generations by adding the vector to the intermediate model activations during inference. This is done to make the model's internal representations shift towards the target language space without any parameter updates. We evaluate our method across three datasets and test on a total of 19 languages on three different models. Our results show consistent improvements on multilingual in-context learning over baselines across all tasks and languages tested. Beyond performance gains, hierarchical clustering of steering vectors reveals meaningful linguistic structure aligned with language families. These vectors also successfully transfer across tasks, demonstrating that these representations are task-agnostic.[164] Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
Ziwen Xu,Chenyan Wu,Hengyu Sun,Haiwen Hong,Mengru Wang,Yunzhi Yao,Longtao Huang,Hui Xue,Shumin Deng,Zhixuan Chu,Huajun Chen,Ningyu Zhang
Main category: cs.CL
TL;DR: 本文提出了一种统一框架,将LLM控制方法(如微调、LoRA、激活干预)视为由控制信号引发的动态权重更新,并引入偏好-效用分析来量化控制效果;发现偏好与效用存在权衡关系,并据此提出新方法SPLIT以更好平衡二者。
Details
Motivation: 现有LLM控制方法(如微调、LoRA、激活干预)常被孤立研究,缺乏统一视角和可比性,亟需概念整合与系统评估。 Method: 构建基于控制信号的动态权重更新统一框架;提出偏好(趋向目标概念)与效用(生成连贯有效)的对比式log-odds度量;结合激活流形视角解释权衡机制;设计新干预方法SPLIT。 Result: 揭示了各类控制方法中普遍存在的偏好-效用权衡规律;验证了控制通过沿目标方向移动表征提升偏好,但偏离生成流形会损害效用;SPLIT在提升偏好同时更优地保持效用。 Conclusion: LLM控制本质是操控表征在激活流形上的位置,统一框架与偏好-效用分析为方法设计与评估提供了理论基础,SPLIT验证了该分析的指导价值。 Abstract: Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.[165] Automated Multiple Mini Interview (MMI) Scoring
Ryan Huynh,Frank Guerin,Alison Callwood
Main category: cs.CL
TL;DR: 本文提出了一种多智能体提示框架,用于自动评估软技能(如共情、伦理判断和沟通),在MMI评估中显著优于现有细调方法,并展现出跨任务泛化能力。
Details
Motivation: 人类评分存在不一致和偏见;现有基于推理的LLM细调方法难以处理MMI中抽象、上下文依赖且隐含信号丰富的叙事。 Method: 设计多智能体提示框架,将评估分解为转录文本优化和按标准分别评分;采用3样本上下文学习与大型指令微调模型。 Result: 在MMI任务上平均加权Kappa系数达0.62,显著优于细调基线(0.32),可靠性媲美人类专家;在ASAP基准上无需额外训练即达到领域SOTA水平。 Conclusion: 对于复杂主观推理任务,结构化提示工程可作为数据密集型微调的可扩展替代方案,改变LLM在自动评估中的应用范式。 Abstract: Assessing soft skills such as empathy, ethical judgment, and communication is essential in competitive selection processes, yet human scoring is often inconsistent and biased. While Large Language Models (LLMs) have improved Automated Essay Scoring (AES), we show that state-of-the-art rationale-based fine-tuning methods struggle with the abstract, context-dependent nature of Multiple Mini-Interviews (MMIs), missing the implicit signals embedded in candidate narratives. We introduce a multi-agent prompting framework that breaks down the evaluation process into transcript refinement and criterion-specific scoring. Using 3-shot in-context learning with a large instruct-tuned model, our approach outperforms specialised fine-tuned baselines (Avg QWK 0.62 vs 0.32) and achieves reliability comparable to human experts. We further demonstrate the generalisability of our framework on the ASAP benchmark, where it rivals domain-specific state-of-the-art models without additional training. These findings suggest that for complex, subjective reasoning tasks, structured prompt engineering may offer a scalable alternative to data-intensive fine-tuning, altering how LLMs can be applied to automated assessment.[166] Proof-RM: A Scalable and Generalizable Reward Model for Math Proof
Haotong Yang,Zitong Wang,Shijia Kang,Siqi Yang,Wenkai Yu,Xu Niu,Yike Sun,Yi Hu,Zhouchen Lin,Muhan Zhang
Main category: cs.CL
TL;DR: 本文提出了一种可扩展的数据构建流程,利用大语言模型自动生成高质量的'问题-证明-验证'三元组数据,并基于此训练出能可靠评估完整证明过程的奖励模型(RM),从而提升大语言模型在数学证明任务中的推理能力。
Details
Motivation: 现有基于可验证奖励的强化学习(RLVR)方法难以处理无确定答案的证明类数学问题,亟需能自动、可靠评估完整证明过程的奖励模型。 Method: 设计可扩展的数据构建流程,利用LLM生成多样化的'问题-证明-check'三元组数据,经多级人工审核过滤;在此基础上训练支持过程奖励和词元权重平衡的证明检验奖励模型。 Result: 所提方法在奖励准确性、泛化能力和测试时引导效果等方面均展现出强性能与可扩展性。 Conclusion: 该工作为提升大语言模型在复杂数学证明任务上的能力提供了实用的数据构建范式、奖励建模方法与工程工具。 Abstract: While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a *scalable* data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality "**question-proof-check**" triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating additional process reward and token weight balance to stabilize the RL process. Our experiments validate the model's scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.[167] From Sycophancy to Sensemaking: Premise Governance for Human-AI Decision Making
Raunak Jain,Mudita Khurana,John Stephens,Srinivas Dharmasanam,Shankar Venkataraman
Main category: cs.CL
TL;DR: 本文提出了一种新的AI-人类协作范式,强调从生成答案转向共同治理决策前提,通过差异驱动的控制回路和承诺门控机制,提升在深度不确定性决策中的可靠性与可审计性。
Details
Motivation: 随着大语言模型(LLM)从辅助工具转向决策支持,其流畅但未经校准的应答倾向导致隐含假设固化、验证成本转嫁给专家,且在目标争议大、不可逆性强的深度不确定性决策中,单纯扩大模型规模反而加速错误承诺。 Method: 提出‘协作式前提治理’框架:构建知识基底作为共享前提空间;设计差异驱动控制回路,识别并定位三类差异(目的性、认知性、程序性);引入决策切片实现有界协商;采用承诺门控阻止未达成共识的关键前提触发行动;实施价值引导的挑战分配以控制交互成本。 Result: 该框架将信任锚定于可审计的前提与证据标准,而非对话流畅度;在辅导场景中得到初步验证,并提出可证伪的评估标准。 Conclusion: 可靠的人机协作不依赖更‘聪明’的答案生成,而在于建立对关键前提的共同治理机制与结构化质疑能力;该路径为高风险决策中LLM的负责任部署提供了新方向。 Abstract: As LLMs expand from assistance to decision support, a dangerous pattern emerges: fluent agreement without calibrated judgment. Low-friction assistants can become sycophantic, baking in implicit assumptions and pushing verification costs onto experts, while outcomes arrive too late to serve as reward signals. In deep-uncertainty decisions (where objectives are contested and reversals are costly), scaling fluent agreement amplifies poor commitments faster than it builds expertise. We argue reliable human-AI partnership requires a shift from answer generation to collaborative premise governance over a knowledge substrate, negotiating only what is decision-critical. A discrepancy-driven control loop operates over this substrate: detecting conflicts, localizing misalignment via typed discrepancies (teleological, epistemic, procedural), and triggering bounded negotiation through decision slices. Commitment gating blocks action on uncommitted load-bearing premises unless overridden under logged risk; value-gated challenge allocates probing under interaction cost. Trust then attaches to auditable premises and evidence standards, not conversational fluency. We illustrate with tutoring and propose falsifiable evaluation criteria.[168] ROG: Retrieval-Augmented LLM Reasoning for Complex First-Order Queries over Knowledge Graphs
Ziyan Zhang,Chao Wang,Zhuo Chen,Chiyi Li,Kai Song
Main category: cs.CL
TL;DR: ROG是一种检索增强框架,通过将一阶逻辑查询分解为单操作子查询,并结合查询感知的邻域检索与大语言模型的链式推理,提升在不完整知识图谱上复杂及含否定查询的推理性能。
Details
Motivation: 在不完整知识图谱上回答一阶逻辑(FOL)查询困难,尤其对含投影、交集、并集和否定等复合结构的查询。 Method: 提出ROG框架:将多操作符FOL查询分解为单操作符子查询序列;每步基于紧凑、查询相关的邻域证据进行推理;缓存并复用中间答案集以提升深层推理链的一致性。 Result: 在标准KG推理基准上显著优于强嵌入式基线方法,尤其在高复杂度和含否定的查询类型上提升最大。 Conclusion: ROG提供了一种实用替代方案,用检索支撑的分步推理取代学习型逻辑算子,缓解误差累积,增强复杂与否定查询的鲁棒性。 Abstract: Answering first-order logic (FOL) queries over incomplete knowledge graphs (KGs) is difficult, especially for complex query structures that compose projection, intersection, union, and negation. We propose ROG, a retrieval-augmented framework that combines query-aware neighborhood retrieval with large language model (LLM) chain-of-thought reasoning. ROG decomposes a multi-operator query into a sequence of single-operator sub-queries and grounds each step in compact, query-relevant neighborhood evidence. Intermediate answer sets are cached and reused across steps, improving consistency on deep reasoning chains. This design reduces compounding errors and yields more robust inference on complex and negation-heavy queries. Overall, ROG provides a practical alternative to embedding-based logical reasoning by replacing learned operators with retrieval-grounded, step-wise inference. Experiments on standard KG reasoning benchmarks show consistent gains over strong embedding-based baselines, with the largest improvements on high-complexity and negation-heavy query types.[169] Misconception Diagnosis From Student-Tutor Dialogue: Generate, Retrieve, Rerank
Joshua Mitton,Prarthana Bhattacharyya,Digory Smith,Thomas Christie,Ralph Abboud,Simon Woodhead
Main category: cs.CL
TL;DR: 本文提出了一种利用大语言模型(LLM)从师生对话中自动识别学生误解的新方法,通过生成-检索-重排序三阶段流程提升检测准确性,并在真实教育对话数据上验证了其优于基线模型的效果。
Details
Motivation: 及时准确识别学生误解对提升学习效果至关重要,但目前高度依赖教师的经验和直觉,亟需自动化支持。 Method: 采用两阶段微调的LLM:第一阶段生成可能的误解,第二阶段通过嵌入相似性检索候选误解并由另一微调LLM进行评估与重排序。 Result: 在真实辅导平台对话数据上验证,该方法优于多种零样本和微调基线模型(包括LLaMA、Qwen、Claude),且微调显著提升误解生成质量,甚至超越更大闭源模型;消融实验确认生成与重排序步骤均关键。 Conclusion: 基于LLM的生成-检索-重排序框架能有效提升学生误解识别的准确性与实用性,微调对性能提升至关重要,为教育智能化提供可行路径。 Abstract: Timely and accurate identification of student misconceptions is key to improving learning outcomes and pre-empting the compounding of student errors. However, this task is highly dependent on the effort and intuition of the teacher. In this work, we present a novel approach for detecting misconceptions from student-tutor dialogues using large language models (LLMs). First, we use a fine-tuned LLM to generate plausible misconceptions, and then retrieve the most promising candidates among these using embedding similarity with the input dialogue. These candidates are then assessed and re-ranked by another fine-tuned LLM to improve misconception relevance. Empirically, we evaluate our system on real dialogues from an educational tutoring platform. We consider multiple base LLM models including LLaMA, Qwen and Claude on zero-shot and fine-tuned settings. We find that our approach improves predictive performance over baseline models and that fine-tuning improves both generated misconception quality and can outperform larger closed-source models. Finally, we conduct ablation studies to both validate the importance of our generation and reranking steps on misconception generation quality.[170] Large Language Models for Mental Health: A Multilingual Evaluation
Nishat Raihan,Sadiya Sayara Chowdhury Puspo,Ana-Maria Bucur,Stevie Chancellor,Marcos Zampieri
Main category: cs.CL
TL;DR: 本文评估了专有和开源大语言模型(LLMs)在多语种心理健康数据集上的表现,涵盖零样本、少样本和微调设置,并对比传统NLP基线;发现LLMs在原始多语数据上表现优异,但在机器翻译数据上性能下降,且下降程度因语言类型而异。
Details
Motivation: 探究大语言模型在多语种心理健康领域的应用潜力与局限,尤其关注其在非英语语言及机器翻译数据上的表现差异。 Method: 在八种语言的心理健康数据集及其机器翻译版本上,对专有和开源LLMs进行零样本、少样本和微调评估,并与非LLM传统NLP基线对比;同时分析跨语系/类型翻译质量对LLM性能的影响。 Result: 专有LLMs和微调后的开源LLMs在多个原始多语数据集上达到有竞争力的F1分数,常超越SOTA;但在机器翻译数据上性能普遍下降,下降幅度因语言家族和类型而异。 Conclusion: LLMs在多语种心理健康任务中展现出较强能力,但其性能高度依赖原始文本质量,翻译引入的结构或词汇失配会显著削弱效果,提示需谨慎使用MT数据并重视语言特异性建模。 Abstract: Large Language Models (LLMs) have remarkable capabilities across NLP tasks. However, their performance in multilingual contexts, especially within the mental health domain, has not been thoroughly explored. In this paper, we evaluate proprietary and open-source LLMs on eight mental health datasets in various languages, as well as their machine-translated (MT) counterparts. We compare LLM performance in zero-shot, few-shot, and fine-tuned settings against conventional NLP baselines that do not employ LLMs. In addition, we assess translation quality across language families and typologies to understand its influence on LLM performance. Proprietary LLMs and fine-tuned open-source LLMs achieve competitive F1 scores on several datasets, often surpassing state-of-the-art results. However, performance on MT data is generally lower, and the extent of this decline varies by language and typology. This variation highlights both the strengths of LLMs in handling mental health tasks in languages other than English and their limitations when translation quality introduces structural or lexical mismatches.[171] Abstract Activation Spaces for Content-Invariant Reasoning in Large Language Models
Gabriele Maraia,Marco Valentino,Fabio Massimo Zanzotto,Leonardo Ranaldi
Main category: cs.CL
TL;DR: 本文提出了一种抽象引导推理框架,通过分离结构推理与词汇语义,利用轻量级Abstractors对模型残差流状态进行多层干预,以减少大语言模型在三段论推理中因语义内容导致的形式有效性误判。
Details
Motivation: 大型语言模型在演绎判断(如三段论推理)中常将语义合理性与形式有效性混淆(即内容效应),且该偏差在生成逐步解释时依然存在,亟需抑制语义干扰以提升形式推理鲁棒性。 Method: 构建内容丰富与抽象化的配对三段论样本,基于抽象输入的模型激活定义抽象推理空间;训练轻量级Abstractors,从内容条件下的残差流状态预测与该空间对齐的表示,并在前向传播中实施多层干预;以跨语言迁移为测试场景验证效果。 Result: 抽象对齐的干预显著降低了内容驱动的错误,提升了对形式有效性的敏感度;激活层面的抽象被证实是一种可扩展的增强LLM形式推理鲁棒性的机制。 Conclusion: 显式分离结构推理与词汇语义、并在激活层面进行抽象对齐干预,能有效缓解大语言模型中的内容效应,提升其在形式推理任务中的可靠性与泛化能力。 Abstract: Large Language Models (LLMs) often struggle with deductive judgment in syllogistic reasoning, systematically conflating semantic plausibility with formal validity a phenomenon known as content effect. This bias persists even when models generate step-wise explanations, indicating that intermediate rationales may inherit the same semantic shortcuts that affect answers. Recent approaches propose mitigating this issue by increasing inference-time structural constraints, either by encouraging abstract intermediate representations or by intervening directly in the model's internal computations; however, reliably suppressing semantic interference remains an open challenge. To make formal deduction less sensitive to semantic content, we introduce a framework for abstraction-guided reasoning that explicitly separates structural inference from lexical semantics. We construct paired content-laden and abstract syllogisms and use the model's activations on abstract inputs to define an abstract reasoning space. We then learn lightweight Abstractors that, from content-conditioned residual-stream states, predict representations aligned with this space and integrate these predictions via multi-layer interventions during the forward pass. Using cross-lingual transfer as a test bed, we show that abstraction-aligned steering reduces content-driven errors and improves validity-sensitive performance. Our results position activation-level abstraction as a scalable mechanism for enhancing the robustness of formal reasoning in LLMs against semantic interference.[172] From Directions to Regions: Decomposing Activations in Language Models via Local Geometry
Or Shafran,Shaked Ronen,Omri Fahn,Shauli Ravfogel,Atticus Geiger,Mor Geva
Main category: cs.CL
TL;DR: 本文提出使用混合因子分析器(MFA)对语言模型激活空间进行分解,建模为多个具有局部协方差结构的高斯区域,从而捕捉非线性、多维概念结构,在定位与干预任务中表现优于现有无监督方法,并媲美甚至超越部分有监督或稀疏自编码器方法。
Details
Motivation: 现有激活分解方法依赖线性假设(如单一全局方向),难以刻画概念在激活空间中的非线性或多维几何结构。 Method: 采用无监督、可扩展的混合因子分析器(MFA),将激活空间建模为多个高斯区域,每个区域由中心点(centroid)和局部协方差(local variation)表征,实现对局部几何结构的建模。 Result: 在Llama-3.1-8B和Gemma-2-2B上成功训练大规模MFA;实验证明其能有效捕获复杂非线性结构,在定位和steering基准上优于无监督基线,与有监督定位方法性能相当,且steering效果常优于稀疏自编码器。 Conclusion: 局部几何结构(以子空间形式表达)是比孤立方向更合适、更具扩展性的概念发现与模型控制分析单元,能有效覆盖传统方法忽略的复杂结构。 Abstract: Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.[173] Indications of Belief-Guided Agency and Meta-Cognitive Monitoring in Large Language Models
Noam Steinmetz Yalon,Ariel Goldstein,Liad Mudrik,Mor Geva
Main category: cs.CL
TL;DR: 本文评估了LLM是否具备基于高阶思维(HOT-3)的意识指标,发现LLM展现出信念引导的能动性与元认知监控能力,并提出量化模型潜在信念动态的新方法。
Details
Motivation: 探究大语言模型(LLMs)是否具备某种形式的意识,特别是基于神经科学理论提出的意识指标(如HOT-3)。 Method: 将模型潜在空间中的表征视为‘信念’,提出新指标量化其在生成过程中的主导性;通过分析不同模型和任务中竞争性信念的动态关系,检验信念形成、行动选择与元认知监控三者间的关系。 Result: 发现:(1)外部干预系统性调控内部信念形成;(2)信念形成因果驱动动作选择;(3)模型可监控并报告自身信念状态。 Conclusion: 为LLMs中存在信念引导的能动性与元认知监控提供了实证支持,并奠定了研究LLMs中能动性、信念与元认知涌现的方法论基础。 Abstract: Rapid advancements in large language models (LLMs) have sparked the question whether these models possess some form of consciousness. To tackle this challenge, Butlin et al. (2023) introduced a list of indicators for consciousness in artificial systems based on neuroscientific theories. In this work, we evaluate a key indicator from this list, called HOT-3, which tests for agency guided by a general belief-formation and action selection system that updates beliefs based on meta-cognitive monitoring. We view beliefs as representations in the model's latent space that emerge in response to a given input, and introduce a metric to quantify their dominance during generation. Analyzing the dynamics between competing beliefs across models and tasks reveals three key findings: (1) external manipulations systematically modulate internal belief formation, (2) belief formation causally drives the model's action selection, and (3) models can monitor and report their own belief states. Together, these results provide empirical support for the existence of belief-guided agency and meta-cognitive monitoring in LLMs. More broadly, our work lays methodological groundwork for investigating the emergence of agency, beliefs, and meta-cognition in LLMs.[174] MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
Haozhen Zhang,Quanyu Long,Jianzhu Bao,Tao Feng,Weizhi Zhang,Haodong Yue,Wenya Wang
Main category: cs.CL
TL;DR: MemSkill提出将LLM代理的记忆操作建模为可学习、可演化的‘记忆技能’,通过控制器选择技能、执行器生成记忆、设计师定期优化技能集,形成闭环自适应记忆系统,在多个长程任务基准上显著提升性能。
Details
Motivation: 现有LLM代理记忆系统依赖静态手工设计的操作,缺乏灵活性和长历史处理效率,难以适应多样交互模式。 Method: MemSkill将记忆操作抽象为结构化、可复用的‘记忆技能’;引入控制器(学习技能选择)、执行器(LLM驱动技能执行)和设计师(基于困难样本迭代演化技能集),构成闭环自进化框架。 Result: 在LoCoMo、LongMemEval、HotpotQA和ALFWorld等基准上超越强基线,展现出更好泛化性;分析揭示了技能随任务需求动态演化的规律。 Conclusion: MemSkill验证了让记忆机制具备可学习性与自演化能力的有效性,为构建更自适应、可持续进化的LLM代理记忆系统提供了新范式。 Abstract: Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.[175] Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability
Xiao Liang,Zhong-Zhi Li,Zhenghao Lin,Eric Hancheng Jiang,Hengyuan Zhang,Yelong Shen,Kai-Wei Chang,Ying Nian Wu,Yeyun Gong,Weizhu Chen
Main category: cs.CL
TL;DR: 本文提出了一种基于端到端强化学习的分而治之(DAC)推理框架,以弥补现有大语言模型在复杂问题求解中链式思维(CoT)的不足,显著提升模型在竞赛级基准上的推理性能与测试时可扩展性。
Details
Motivation: 链式思维(CoT)在模型能力极限处表现不足,且其严格顺序性限制了测试时的可扩展性;而分而治之(DAC)虽有潜力,但通用后训练方法与DAC推理存在根本性不匹配。 Method: 提出端到端强化学习框架,在每步中联合训练问题分解(为子问题)与子问题求解,并基于子问题解来解决原问题,将分解与求解统一纳入RL训练流程。 Result: 在竞赛级基准上,该DAC框架相比CoT在Pass@1和Pass@32分别提升8.6%和6.3%,展现出更高性能上限和更强测试时可扩展性。 Conclusion: 通过专为DAC推理设计的强化学习训练范式,可有效释放大语言模型在最具挑战性任务上的推理潜力,超越传统链式思维范式。 Abstract: Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model's capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs' reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.[176] RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents
Jialiang Zhu,Gongrui Zhang,Xiaolong Ma,Lin Xu,Miaosen Zhang,Ruiqi Yang,Song Wang,Kai Qiu,Zhirong Wu,Qi Dai,Ruichun Ma,Bei Liu,Yifan Yang,Chong Luo,Zhengyuan Yang,Linjie Li,Lijuan Wang,Weizhu Chen,Xin Geng,Baining Guo
Main category: cs.CL
TL;DR: 本文提出Re-TRAC框架,通过跨轨迹探索与结构化状态表示实现迭代反思和全局规划,显著提升LLM研究代理的搜索效率与性能。
Details
Motivation: 现有基于ReAct的LLM研究代理采用线性设计,难以回溯、分支探索或维持长上下文中的全局意识,易陷入局部最优、重复探索和低效搜索。 Method: 提出Re-TRAC框架:每轮轨迹后生成结构化状态表示(涵盖证据、不确定性、失败与计划),并以此为条件引导后续轨迹;对小模型引入Re-TRAC感知的监督微调。 Result: 在BrowseComp上,Re-TRAC比ReAct提升15–20%;小模型经微调达同规模SOTA;工具调用与token使用随轮次单调下降,表明探索更精准高效。 Conclusion: Re-TRAC将研究建模为渐进式过程,通过跨轨迹反思实现更鲁棒、高效、可扩展的深度研究代理。 Abstract: LLM-based deep research agents are largely built on the ReAct framework. This linear design makes it difficult to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, often leading to local optima, redundant exploration, and inefficient search. We propose Re-TRAC, an agentic framework that performs cross-trajectory exploration by generating a structured state representation after each trajectory to summarize evidence, uncertainties, failures, and future plans, and conditioning subsequent trajectories on this state representation. This enables iterative reflection and globally informed planning, reframing research as a progressive process. Empirical results show that Re-TRAC consistently outperforms ReAct by 15-20% on BrowseComp with frontier LLMs. For smaller models, we introduce Re-TRAC-aware supervised fine-tuning, achieving state-of-the-art performance at comparable scales. Notably, Re-TRAC shows a monotonic reduction in tool calls and token usage across rounds, indicating progressively targeted exploration driven by cross-trajectory reflection rather than redundant search.[177] Reward-free Alignment for Conflicting Objectives
Peter Chen,Xiaopeng Li,Xi Chen,Tianyi Lin
Main category: cs.CL
TL;DR: 本文提出了一种无需奖励模型的多目标对齐框架RACO,通过剪裁版冲突规避梯度下降法直接利用成对偏好数据解决梯度冲突,在多个LLM上验证了其在多目标摘要与安全对齐任务中优于现有基线。
Details
Motivation: 现实中的对齐问题常涉及多个冲突目标,加权损失法易导致训练不稳定和次优权衡;现有多目标方法依赖显式奖励模型,增加复杂性并扭曲用户偏好。 Method: 提出Reward-free Alignment框架RACO,采用新型剪裁版冲突规避梯度下降(conflict-averse gradient descent)直接处理成对偏好数据,并提供收敛至加权Pareto临界点的理论保证;引入启发式改进并实验验证。 Result: 在多目标摘要与安全对齐任务中,RACO在Qwen 3、Llama 3、Gemma 3等多个LLM家族上均取得更优的Pareto权衡效果,定性与定量评估均优于现有基线。 Conclusion: RACO是一种有效、理论可证、无需奖励模型的多目标LLM对齐方法,能稳定提升各目标协同性能。 Abstract: Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.cs.CV [Back]
[178] EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions
Weiyu Sun,Liangliang Chen,Yongnuo Cai,Huiru Xie,Yi Zeng,Ying Zhang
Main category: cs.CV
TL;DR: 本文提出EDU-CIRCUIT-HW数据集,用于评估多模态大语言模型(MLLMs)在识别和理解STEM学生手写解题内容(含公式、图示与文本推理)方面的能力,并揭示其在高风险教育场景中自动评分的可靠性不足;通过错误模式分析与少量人工干预(约4%),显著提升AI评分系统的鲁棒性。
Details
Motivation: 现有MLLMs在处理无约束的STEM学生手写解答(融合公式、图示与文本推理)时缺乏真实、领域专用的基准;且当前评估范式仅关注下游任务结果(如自动评分),无法全面反映模型对复杂手写逻辑的整体理解能力。 Method: 构建并发布EDU-CIRCUIT-HW数据集(1300+份大学STEM课程真实学生手写解答),结合专家验证的逐字转录与评分报告,同步评估MLLMs在上游识别保真度与下游自动评分性能;进一步开展错误模式分析与针对性纠错的案例研究。 Result: 发现MLLMs在手写内容识别中存在大量潜在错误,导致其在高风险教育应用(如自动评分)中可靠性严重不足;而基于错误模式进行预判性检测与修正(仅需约4%人工干预),可显著提升AI评分系统在未见样本上的鲁棒性。 Conclusion: 当前MLLMs尚不足以独立承担高信度教育评估任务;需结合细粒度识别评估、错误诊断与人机协同纠错机制,才能推动其在真实教育场景中的可靠落地。 Abstract: Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs' understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs' upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models' insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. In solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and rectify recognition errors, with only minimal human intervention (approximately 4% of the total solutions), can significantly enhance the robustness of the deployed AI-enabled grading system on unseen student solutions.[179] Mirage2Matter: A Physically Grounded Gaussian World Model from Video
Zhengqing Gao,Ziwen Li,Xin Wang,Jiaxin Huang,Zhenyang Ren,Mingkai Shao,Hanlue Zhang,Tianyu Huang,Yongkang Cheng,Yandong Guo,Runqi Lin,Yuanyuan Wang,Tongliang Liu,Kun Zhang,Mingming Gong
Main category: cs.CV
TL;DR: 本文提出Simulate Anything框架,利用多视角视频和现成资产高效生成高保真具身训练数据,通过3D高斯溅射重建场景,并结合生成模型实现物理真实模拟,使VLA模型在零样本下游任务中表现媲美甚至超越真实数据训练效果。
Details
Motivation: 具身智能的可扩展性受限于真实世界交互数据的稀缺;现有仿真平台存在视觉与物理失真、依赖昂贵传感器或精确标定等问题,难以大规模实用。 Method: 采用3D高斯溅射(3DGS)从多视角视频重建实景的光度一致、几何精细的三维场景;利用生成模型恢复物理合理性,并通过精度校准目标实现真实尺度对齐,构建统一、可编辑、物理可信的世界模型。 Result: 基于该仿真数据训练的Vision Language Action(VLA)模型在多个下游任务中展现出优异的零样本泛化能力,性能匹配或超越使用真实数据训练的基线。 Conclusion: 以重建驱动的世界建模是实现可扩展、实用化具身智能训练的有效路径。 Abstract: The scalability of embodied intelligence is fundamentally constrained by the scarcity of real-world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics-driven world modeling and simulation framework that enables efficient generation of high-fidelity embodied training data using only multi-view environment videos and off-the-shelf assets. Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine-grained geometry and appearance from video. We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target, enabling accurate scale alignment between the reconstructed scene and the real world. Together, these components provide a unified, editable, and physically grounded world model. Vision Language Action (VLA) models trained on our simulated data achieve strong zero-shot performance on downstream tasks, matching or even surpassing results obtained with real-world data, highlighting the potential of reconstruction-driven world modeling for scalable and practical embodied intelligence training.[180] R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation
Zhuohong Chen,Zhengxian Wu,Zirui Liao,Shenao Jiang,Hangrui Xu,Yang Chen,Chaokui Su,Xiaoyu Liu,Haoqian Wang
Main category: cs.CV
TL;DR: 本文提出R3G框架,通过推理-检索-重排序三阶段策略提升视觉问答(VQA)中的图像检索与融合效果,在MRAG-Bench上实现SOTA性能。
Details
Motivation: 视觉为中心的VQA检索需获取缺失视觉线索并有效融入推理过程,但图像选择与融合仍具挑战性。 Method: 提出模块化R3G框架:先生成指定所需视觉线索的简要推理计划,再通过粗粒度检索与细粒度重排序两阶段策略选取证据图像;引入充分性感知重排序机制。 Result: 在MRAG-Bench上,R3G在六个MLLM主干模型和九个子场景中均提升准确率,达到整体SOTA;消融实验证明推理步骤与重排序互补。 Conclusion: R3G通过解耦推理规划与图像检索重排序,显著提升视觉线索利用效率,为VQA检索任务提供新范式。 Abstract: Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.On MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.[181] HYPE-EDIT-1: Benchmark for Measuring Reliability in Frontier Image Editing Models
Wing Chan,Richard Allen
Main category: cs.CV
TL;DR: 本文提出了HYPE-EDIT-1基准,用于评估参考式图像编辑模型在真实营销/设计工作流中的实用性,通过多维度成本效益指标(如pass@10、有效单次成功成本)揭示低价模型未必更经济。
Details
Motivation: 现有图像编辑模型的公开演示多为理想样本,无法反映实际工作流中因重试和人工审核带来的时间与成本开销,亟需面向真实场景的评估基准。 Method: 构建包含100个参考式图像编辑任务的HYPE-EDIT-1基准(50个公开+50个私有),每个任务生成10个独立输出,统计per-attempt pass rate、pass@10、带重试上限的期望尝试次数,并综合模型调用费用与人工审核时间计算有效成功成本;提供标准化JSON格式及VLM/人工评判工具链。 Result: 测试模型per-attempt通过率介于34%-83%,有效单次成功成本为0.66–1.42美元;低价模型因低通过率导致总体成本更高。 Conclusion: 单纯依赖模型单价评估图像编辑服务不具现实意义,必须纳入重试与人工审核成本;HYPE-EDIT-1为实用化图像编辑系统提供了可复现、可扩展的评估框架。 Abstract: Public demos of image editing models are typically best-case samples; real workflows pay for retries and review time. We introduce HYPE-EDIT-1, a 100-task benchmark of reference-based marketing/design edits with binary pass/fail judging. For each task we generate 10 independent outputs to estimate per-attempt pass rate, pass@10, expected attempts under a retry cap, and an effective cost per successful edit that combines model price with human review time. We release 50 public tasks and maintain a 50-task held-out private split for server-side evaluation, plus a standardized JSON schema and tooling for VLM and human-based judging. Across the evaluated models, per-attempt pass rates span 34-83 percent and effective cost per success spans USD 0.66-1.42. Models that have low per-image pricing are more expensive when you consider the total effective cost of retries and human reviews.[182] Efficient UAV trajectory prediction: A multi-modal deep diffusion framework
Yuan Gao,Xinyu Guo,Wenjing Xie,Zifan Wang,Hongwen Yu,Gongyang Li,Shugong Xu
Main category: cs.CV
TL;DR: 本文提出了一种基于LiDAR与毫米波雷达多模态信息融合的无人机轨迹预测方法,通过设计多模态深度融合框架(含双模态特征提取网络和双向交叉注意力融合模块),显著提升了低空经济中非法无人机轨迹预测精度,较基线模型提升40%。
Details
Motivation: 为满足低空经济中对非法无人机管控的需求,需提高其轨迹预测精度,而单一传感器存在局限,因此需融合LiDAR(空间几何结构)与毫米波雷达(动态反射特性)的互补信息。 Method: 提出多模态深度融合框架:包含两个结构相同但独立的模态专用特征编码器(分别处理LiDAR与雷达点云),以及一个双向交叉注意力融合模块,实现跨模态信息互补与语义对齐;在CVPR 2024 UG2+挑战赛MMAUD数据集上训练与测试,并开展消融实验验证损失函数与后处理策略的影响。 Result: 所提模型在MMAUD数据集上轨迹预测精度较基准模型提升40%;消融实验验证了双向交叉注意力机制、特定损失函数及后处理策略对性能提升的有效性。 Conclusion: 该多模态融合方法能有效利用LiDAR与毫米波雷达的互补特性,为低空经济中非法无人机的实时轨迹预测提供了高效、鲁棒的解决方案。 Abstract: To meet the requirements for managing unauthorized UAVs in the low-altitude economy, a multi-modal UAV trajectory prediction method based on the fusion of LiDAR and millimeter-wave radar information is proposed. A deep fusion network for multi-modal UAV trajectory prediction, termed the Multi-Modal Deep Fusion Framework, is designed. The overall architecture consists of two modality-specific feature extraction networks and a bidirectional cross-attention fusion module, aiming to fully exploit the complementary information of LiDAR and radar point clouds in spatial geometric structure and dynamic reflection characteristics. In the feature extraction stage, the model employs independent but structurally identical feature encoders for LiDAR and radar. After feature extraction, the model enters the Bidirectional Cross-Attention Mechanism stage to achieve information complementarity and semantic alignment between the two modalities. To verify the effectiveness of the proposed model, the MMAUD dataset used in the CVPR 2024 UG2+ UAV Tracking and Pose-Estimation Challenge is adopted as the training and testing dataset. Experimental results show that the proposed multi-modal fusion model significantly improves trajectory prediction accuracy, achieving a 40% improvement compared to the baseline model. In addition, ablation experiments are conducted to demonstrate the effectiveness of different loss functions and post-processing strategies in improving model performance. The proposed model can effectively utilize multi-modal data and provides an efficient solution for unauthorized UAV trajectory prediction in the low-altitude economy.[183] SITUATE -- Synthetic Object Counting Dataset for VLM training
René Peinl,Vincent Tischler,Patrick Schröder,Christian Groth
Main category: cs.CV
TL;DR: 本文提出了SITUATE数据集,用于训练和评估视觉语言模型在具有空间约束的计数任务上的性能,填补了简单2D数据集与真实但模糊数据集之间的空白,并验证了其对分布外图像泛化能力的提升。
Details
Motivation: 现有计数数据集存在局限:简单2D数据集(如VLMCountBench)缺乏现实性,而真实数据集(如TallyQA)又因遮挡和空间构成不可控而模糊;亟需一个兼具可控性与现实相关性的新基准。 Method: 构建SITUATE数据集,设计具有明确空间约束的计数任务;在该数据集上对Qwen VL 2.5 7B进行微调,并在Pixmo count等基准上进行跨数据集泛化性评估。 Result: 在SITUATE上微调的模型在Pixmo count测试集上准确率提升,反之则不成立;跨多个计数基准的对比验证了SITUATE带来的泛化增益优于同等规模的Pixmo count微调集。 Conclusion: SITUATE是一个有效提升视觉语言模型在空间约束计数任务中泛化能力的新基准数据集,尤其有助于提升模型对分布外图像的鲁棒性。 Abstract: We present SITUATE, a novel dataset designed for training and evaluating Vision Language Models on counting tasks with spatial constraints. The dataset bridges the gap between simple 2D datasets like VLMCountBench and often ambiguous real-life datasets like TallyQA, which lack control over occlusions and spatial composition. Experiments show that our dataset helps to improve generalization for out-of-distribution images, since a finetune of Qwen VL 2.5 7B on SITUATE improves accuracy on the Pixmo count test data, but not vice versa. We cross validate this by comparing the model performance across established other counting benchmarks and against an equally sized fine-tuning set derived from Pixmo count.[184] Robustness of Presentation Attack Detection in Remote Identity Validation Scenarios
John J. Howard,Richard O. Plesh,Yevgeniy B. Sirotin,Jerry L. Tipton,Arun R. Vemury
Main category: cs.CV
TL;DR: 本文研究了低光照和自动图像采集对商用呈现攻击检测(PAD)系统在远程身份验证(RIV)中鲁棒性的影响,发现多数系统性能显著下降,仅一个系统在所有场景下保持低于3%的错误率。
Details
Motivation: 确保PAD子系统在多样化的环境和操作条件下仍具备鲁棒、可靠且用户友好的性能,是远程身份验证系统落地的关键挑战。 Method: 通过构建RIV场景测试,评估商用PAD系统在低光照条件和自动图像采集流程下的性能变化,并采用模型预测误差率变化趋势。 Result: 低光照使PAD系统错误率约增至4倍,自动采集使错误率约翻倍;仅1个系统在所有测试场景下保持最大真实呈现分类错误率低于3%。 Conclusion: 必须在多样化真实环境条件下对PAD系统进行充分测试,以保障其在实际应用中的可靠性与鲁棒性。 Abstract: Presentation attack detection (PAD) subsystems are an important part of effective and user-friendly remote identity validation (RIV) systems. However, ensuring robust performance across diverse environmental and procedural conditions remains a critical challenge. This paper investigates the impact of low-light conditions and automated image acquisition on the robustness of commercial PAD systems using a scenario test of RIV. Our results show that PAD systems experience a significant decline in performance when utilized in low-light or auto-capture scenarios, with a model-predicted increase in error rates by a factor of about four under low-light conditions and a doubling of those odds under auto-capture workflows. Specifically, only one of the tested systems was robust to these perturbations, maintaining a maximum bona fide presentation classification error rate below 3% across all scenarios. Our findings emphasize the importance of testing across diverse environments to ensure robust and reliable PAD performance in real-world applications.[185] Observing Health Outcomes Using Remote Sensing Imagery and Geo-Context Guided Visual Transformer
Yu Li,Guilherme N. DeSouza,Praveen Rao,Chi-Ren Shyu
Main category: cs.CV
TL;DR: 本文提出了一种新型遥感图像处理模型,通过引入地理空间嵌入机制和引导注意力模块,融合辅助地理空间信息,提升多模态地理空间理解能力,在疾病流行率预测任务中优于现有地理空间基础模型。
Details
Motivation: 现有视觉-语言和多模态模型侧重于视觉与文本语义对齐,缺乏对结构化地理空间图层的表征与推理能力,难以满足遥感图像中地理空间理解的需求。 Method: 提出地理空间嵌入机制,将多种地理空间数据转化为与图像块空间对齐的嵌入块;设计引导注意力模块,基于辅助数据相关性动态计算注意力权重,并为各注意力头分配不同角色以捕获互补信息。 Result: 在疾病流行率预测任务上,所提框架优于现有预训练地理空间基础模型,验证了其在多模态地理空间理解中的有效性。 Conclusion: 融合地理空间先验信息并增强跨模态交互建模,可显著提升遥感图像分析在真实地理空间任务中的性能与可解释性。 Abstract: Visual transformers have driven major progress in remote sensing image analysis, particularly in object detection and segmentation. Recent vision-language and multimodal models further extend these capabilities by incorporating auxiliary information, including captions, question and answer pairs, and metadata, which broadens applications beyond conventional computer vision tasks. However, these models are typically optimized for semantic alignment between visual and textual content rather than geospatial understanding, and therefore are not suited for representing or reasoning with structured geospatial layers. In this study, we propose a novel model that enhances remote sensing imagery processing with guidance from auxiliary geospatial information. Our approach introduces a geospatial embedding mechanism that transforms diverse geospatial data into embedding patches that are spatially aligned with image patches. To facilitate cross-modal interaction, we design a guided attention module that dynamically integrates multimodal information by computing attention weights based on correlations with auxiliary data, thereby directing the model toward the most relevant regions. In addition, the module assigns distinct roles to individual attention heads, allowing the model to capture complementary aspects of the guidance information and improving the interpretability of its predictions. Experimental results demonstrate that the proposed framework outperforms existing pretrained geospatial foundation models in predicting disease prevalence, highlighting its effectiveness in multimodal geospatial understanding.[186] From Manual Observation to Automated Monitoring: Space Allowance Effects on Play Behaviour in Group-Housed Dairy Calves
Haiyu Yang,Heidi Lesscher,Enhong Liu,Miel Hostens
Main category: cs.CV
TL;DR: 本研究在荷兰14个商业奶牛场调查了空间配额(2.66–17.98 m²/头)对60头群养奶牛犊玩耍行为的影响,并开发了一种高精度(97.6%准确率,99.4%召回率)的计算机视觉自动监测流程;发现玩耍行为呈非线性关系,峰值出现在8–10 m²/头(1.6%观察时间),提出该区间为兼顾福利与经济性的实用目标。
Details
Motivation: 玩耍行为是奶牛犊积极福利的指标,但商业条件下中高空间配额(6–20 m²/头)对其影响尚不明确,亟需实证研究与可扩展的监测方法。 Method: 在14个商业农场对60头群养奶牛犊开展视频观察,使用详细行为谱量化玩耍行为(占观察期百分比);采用含农场随机效应的线性混合模型统计分析;构建并验证基于108小时人工标注数据训练的计算机视觉流水线。 Result: calves平均玩耍时间为观察期的1.0%(约10分钟/17小时);玩耍行为与空间呈非线性关系,峰值在8–10 m²/头(1.6% OP),低谷在6–8 m²和12–14 m²(<0.6% OP);空间效应在控制年龄、健康和群体大小后仍显著;计算机视觉分类器达97.6%准确率和99.4%召回率。 Conclusion: 8–10 m²/头是提升奶牛犊福利且具经济可行性的推荐空间配额;自动化监测技术可将小规模人工标注扩展为连续化福利评估系统。 Abstract: Play behaviour serves as a positive welfare indicator in dairy calves, yet the influence of space allowance under commercial conditions remains poorly characterized, particularly at intermediate-to-high allowances (6-20 m2 per calf). This study investigated the relationship between space allowance and play behaviour in 60 group-housed dairy calves across 14 commercial farms in the Netherlands (space range: 2.66-17.98 m2 per calf), and developed an automated computer vision pipeline for scalable monitoring. Video observations were analyzed using a detailed ethogram, with play expressed as percentage of observation period (%OP). Statistical analysis employed linear mixed models with farm as a random effect. A computer vision pipeline was trained on manual annotations from 108 hours on 6 farms and validated on held-out test data. The computer vision classifier achieved 97.6% accuracy with 99.4% recall for active play detection. Calves spent on average 1.0% of OP playing reflecting around 10 minutes per 17-hour period. The space-play relationship was non-linear, with highest play levels at 8-10 m2 per calf (1.6% OP) and lowest at 6-8 m2 and 12-14 m2 (<0.6% OP). Space remained significant after controlling for age, health, and group size. In summary, these findings suggest that 8-10 m2 per calf represents a practical target balancing welfare benefits with economic feasibility, and demonstrate that automated monitoring can scale small annotation projects to continuous welfare assessment systems.[187] AI-Driven Three-Dimensional Reconstruction and Quantitative Analysis for Burn Injury Assessment
S. Kalaycioglu,C. Hong,K. Zhai,H. Xie,J. N. Wong
Main category: cs.CV
TL;DR: 本文提出了一种结合多视角摄影测量、3D表面重建与深度学习分割的AI驱动烧伤评估平台,实现基于普通相机图像的客观、可重复、几何感知的烧伤面积、深度及愈合进程量化分析。
Details
Motivation: 现有烧伤评估方法(如目视检查和2D摄影)主观性强、难以纵向比较,亟需准确、可重现、非侵入式的客观评估手段。 Method: 集成多视角摄影测量、患者特异性3D表面重建、解剖映射与深度学习分割,利用消费级相机采集的多角度图像,在临床工作流中完成3D建模、烧伤区域量化(面积、TBSA、深度代理指标、体积变化)及跨时间点空间对齐以追踪愈合。 Result: 仿真评估显示系统重建稳定、指标计算一致、纵向趋势符合临床认知,支持实时单位下的客观烧伤评估与决策辅助。 Conclusion: 该平台为急性和门诊烧伤管理提供了一种可扩展、非侵入、几何感知的客观评估新范式。 Abstract: Accurate, reproducible burn assessment is critical for treatment planning, healing monitoring, and medico-legal documentation, yet conventional visual inspection and 2D photography are subjective and limited for longitudinal comparison. This paper presents an AI-enabled burn assessment and management platform that integrates multi-view photogrammetry, 3D surface reconstruction, and deep learning-based segmentation within a structured clinical workflow. Using standard multi-angle images from consumer-grade cameras, the system reconstructs patient-specific 3D burn surfaces and maps burn regions onto anatomy to compute objective metrics in real-world units, including surface area, TBSA, depth-related geometric proxies, and volumetric change. Successive reconstructions are spatially aligned to quantify healing progression over time, enabling objective tracking of wound contraction and depth reduction. The platform also supports structured patient intake, guided image capture, 3D analysis and visualization, treatment recommendations, and automated report generation. Simulation-based evaluation demonstrates stable reconstructions, consistent metric computation, and clinically plausible longitudinal trends, supporting a scalable, non-invasive approach to objective, geometry-aware burn assessment and decision support in acute and outpatient care.[188] 1S-DAug: One-Shot Data Augmentation for Robust Few-Shot Generalization
Yunwei Bai,Ying Kiat Tan,Yao Shu,Tsuhan Chen
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、模型无关的测试时数据增强方法1S-DAug,仅用单张图像即可生成多样且保真的增强样本,显著提升少样本学习性能。
Details
Motivation: 少样本学习(FSL)中,传统测试时增强方法效果不佳,亟需一种能仅凭单个样本生成高质量增强图像的方法以提升泛化能力。 Method: 提出1S-DAug:结合几何变换、可控噪声注入与原图条件下的去噪扩散过程,在测试时从单张图像生成多样化增强样本;再将增强样本与原图编码聚合为联合表征。 Result: 在4个标准FSL基准(如miniImageNet 5-way-1-shot)上实现一致性能提升,其中miniImageNet 1-shot任务准确率提升超10%(比例值)。 Conclusion: 1S-DAug是一种高效、即插即用的测试时增强策略,无需任何模型微调或参数更新,显著增强了少样本场景下的模型鲁棒性与泛化能力。 Abstract: Few-shot learning (FSL) challenges model generalization to novel classes based on just a few shots of labeled examples, a testbed where traditional test-time augmentations fail to be effective. We introduce 1S-DAug, a one-shot generative augmentation operator that synthesizes diverse yet faithful variants from just one example image at test time. 1S-DAug couples traditional geometric perturbations with controlled noise injection and a denoising diffusion process conditioned on the original image. The generated images are then encoded and aggregated, alongside the original image, into a combined representation for more robust FSL predictions. Integrated as a training-free model-agnostic plugin, 1S-DAug consistently improves FSL across standard benchmarks of 4 different datasets without any model parameter update, including achieving over 10% proportional accuracy improvement on the miniImagenet 5-way-1-shot benchmark. Codes will be released.[189] Event Driven Clustering Algorithm
David El-Chai Ben-Ezra,Adar Tal,Daniel Brisk
Main category: cs.CV
TL;DR: 本文提出了一种新颖的异步、事件驱动算法,用于实时检测事件相机数据中的小事件簇,具有线性时间复杂度O(n),且运行时间与像素阵列维度无关。
Details
Motivation: 为实现实时、高效的小事件簇检测,充分利用事件相机异步数据结构的特点。 Method: 设计一种基于事件时空距离的异步、事件驱动的层次凝聚聚类算法,采用简洁高效的决策机制。 Result: 算法达到O(n)线性时间复杂度,且运行时间不依赖于像素阵列尺寸。 Conclusion: 该算法在保持聚类有效性的同时显著提升了计算效率,适用于实时事件相机应用。 Abstract: This paper introduces a novel asynchronous, event-driven algorithm for real-time detection of small event clusters in event camera data. Like other hierarchical agglomerative clustering algorithms, the algorithm detects the event clusters based on their tempo-spatial distance. However, the algorithm leverages the special asynchronous data structure of event camera, and by a sophisticated, efficient and simple decision-making, enjoys a linear complexity of $O(n)$ where $n$ is the events amount. In addition, the run-time of the algorithm is independent with the dimensions of the pixels array.[190] IC-EO: Interpretable Code-based assistant for Earth Observation
Lamia Lahouel,Laurynas Lopata,Simon Gruening,Gabriele Meoni,Gaetan Petit,Sylvain Lobry
Main category: cs.CV
TL;DR: 本文提出了一种基于工具增强型大语言模型的对话式、代码生成代理,将自然语言查询转化为可执行、可审计的Python地理空间分析工作流,显著提升地球观测(EO)分析的可访问性、透明性与可复现性。
Details
Motivation: 地球观测分析目前对非专业人士门槛高,依赖专家知识,且现有系统多为黑箱预测,难以审计和复现。 Method: 构建一个面向地球观测任务的统一、易扩展API,并设计一个能调用该API生成Python代码的对话式工具增强型LLM代理,支持分类、分割、定向目标检测、光谱指数计算和地理空间操作。 Result: 在土地组成映射和火灾后损毁评估两个任务上,该代理分别达到64.2%和50%准确率,显著优于GPT-4o(51.7%)和LLaVA(0%);生成代码可验证、可解释、可复现。 Conclusion: 该方法将EO分析转化为透明、可审计、可复现的流程,降低了专业门槛,推动了地理空间AI的民主化。 Abstract: Despite recent advances in computer vision, Earth Observation (EO) analysis remains difficult to perform for the laymen, requiring expert knowledge and technical capabilities. Furthermore, many systems return black-box predictions that are difficult to audit or reproduce. Leveraging recent advances in tool LLMs, this study proposes a conversational, code-generating agent that transforms natural-language queries into executable, auditable Python workflows. The agent operates over a unified easily extendable API for classification, segmentation, detection (oriented bounding boxes), spectral indices, and geospatial operators. With our proposed framework, it is possible to control the results at three levels: (i) tool-level performance on public EO benchmarks; (ii) at the agent-level to understand the capacity to generate valid, hallucination-free code; and (iii) at the task-level on specific use cases. In this work, we select two use-cases of interest: land-composition mapping and post-wildfire damage assessment. The proposed agent outperforms general-purpose LLM/VLM baselines (GPT-4o, LLaVA), achieving 64.2% vs. 51.7% accuracy on land-composition and 50% vs. 0% on post-wildfire analysis, while producing results that are transparent and easy to interpret. By outputting verifiable code, the approach turns EO analysis into a transparent, reproducible process.[191] VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents
Hongzhu Yi,Yujia Yang,Yuanxiang Wang,Zhenyu Guan,Jiahuan Chen,Chenxi Bao,Tiankun Yang,Yixuan Yuan,Tianyu Zong,Xinming Wang,Tao Yu,Ruiwen Tao,Haijin Liang,Jin Ma,Jinwen Luo,Yeshani Xinyu Zuo,Jungang Xu
Main category: cs.CV
TL;DR: 本文提出了VDE Bench,首个专门用于评估多语言和密集文本视觉文档编辑任务的基准,包含高质量的中英文密集文本文档数据集,并引入解耦评估框架以精细量化编辑性能。
Details
Motivation: 现有方法主要针对英文和稀疏文本布局的文档,无法充分处理密集、结构复杂的文档或非拉丁文字(如中文)。 Method: 提出VDE Bench基准,包含人工标注与评估的高质量中英文密集文本文档数据集,并设计基于OCR解析的解耦评估框架。 Result: 对主流图像编辑模型进行了全面评估,人工验证表明自动评估指标与人类判断高度一致。 Conclusion: VDE Bench是首个系统性评估多语言、密集文本视觉文档编辑模型的基准,填补了该研究方向的空白。 Abstract: In recent years, multimodal image editing models have achieved substantial progress, enabling users to manipulate visual content through natural language in a flexible and interactive manner. Nevertheless, an important yet insufficiently explored research direction remains visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing approaches, including AnyText, GlyphControl, and TextCtrl, predominantly focus on English-language scenarios and documents with relatively sparse textual layouts, thereby failing to adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose \textbf{V}isual \textbf{D}oc \textbf{E}dit Bench(VDE Bench), a rigorously human-annotated and evaluated benchmark specifically designed to assess image editing models on multilingual and complex visual document editing tasks. The benchmark comprises a high-quality dataset encompassing densely textual documents in both English and Chinese, including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a decoupled evaluation framework that systematically quantifies editing performance at the OCR parsing level, enabling fine-grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative state-of-the-art image editing models. Manual verification demonstrates a strong consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating image editing models on multilingual and densely textual visual documents.[192] Context-Aware Autoencoders for Anomaly Detection in Maritime Surveillance
Divya Acharya,Pierre Bernab'e,Antoine Chevrot,Helge Spieker,Arnaud Gotlieb,Bruno Legeard
Main category: cs.CV
TL;DR: 本文提出了一种上下文感知自编码器,通过引入上下文特定阈值来提升海上船舶交通监控中集体与上下文异常的检测精度,并降低计算成本。
Details
Motivation: 传统自编码器在识别海上船舶AIS数据中的集体和上下文异常方面效果有限,而海上异常高度依赖于船舶自身上报的上下文信息。 Method: 提出上下文感知自编码器,集成上下文特定阈值;比较四种变体与常规自编码器在渔船状态异常检测案例中的性能。 Result: 实验表明上下文对重构误差和异常检测影响显著,上下文感知自编码器在时间序列异常检测中表现最优。 Conclusion: 引入上下文特定阈值并重视上下文信息,可有效提升海上船舶交通监控系统的异常检测准确率。 Abstract: The detection of anomalies is crucial to ensuring the safety and security of maritime vessel traffic surveillance. Although autoencoders are popular for anomaly detection, their effectiveness in identifying collective and contextual anomalies is limited, especially in the maritime domain, where anomalies depend on vessel-specific contexts derived from self-reported AIS messages. To address these limitations, we propose a novel solution: the context-aware autoencoder. By integrating context-specific thresholds, our method improves detection accuracy and reduces computational cost. We compare four context-aware autoencoder variants and a conventional autoencoder using a case study focused on fishing status anomalies in maritime surveillance. Results demonstrate the significant impact of context on reconstruction loss and anomaly detection. The context-aware autoencoder outperforms others in detecting anomalies in time series data. By incorporating context-specific thresholds and recognizing the importance of context in anomaly detection, our approach offers a promising solution to improve accuracy in maritime vessel traffic surveillance systems.[193] D3R-Net: Dual-Domain Denoising Reconstruction Network for Robust Industrial Anomaly Detection
Dmytro Filatov,Valentyn Fedorov,Vira Filatova,Andrii Zelenchuk
Main category: cs.CV
TL;DR: 本文提出D3R-Net,一种双域去噪重建框架,结合自监督‘修复’任务与频域感知正则化(FFT幅度损失),提升无监督异常检测中细微缺陷的定位精度,尤其在MVTec AD数据集上显著提升PRO AUC和像素级ROC AUC。
Details
Motivation: 重建式无监督异常检测方法虽高效但易过度平滑,难以凸显高频细节中的细微缺陷,导致分割精度受限。 Method: 提出D3R-Net:采用轻量卷积自编码器结构;引入自监督‘修复’任务(输入合成损坏的正常图像,重建干净目标);新增FFT幅度损失以增强频域一致性;可选加入SSIM损失并进行消融分析。 Result: 在MVTec AD Hazelnut上,PRO AUC从0.603提升至0.687;15类平均像素ROC AUC从0.733升至0.751,PRO AUC从0.417升至0.468;单GPU推理速度约20 FPS。 Conclusion: D3R-Net通过频域正则化有效缓解重建过平滑问题,在保持高效率的同时显著提升异常定位一致性,为工业视觉检测提供了轻量、实用的无监督方案。 Abstract: Unsupervised anomaly detection (UAD) is a key ingredient of automated visual inspection in modern manufacturing. The reconstruction-based methods appeal because they have basic architectural design and they process data quickly but they produce oversmoothed results for high-frequency details. As a result, subtle defects are partially reconstructed rather than highlighted, which limits segmentation accuracy. We build on this line of work and introduce D3R-Net, a Dual-Domain Denoising Reconstruction framework that couples a self-supervised 'healing' task with frequency-aware regularization. During training, the network receives synthetically corrupted normal images and is asked to reconstruct the clean targets, which prevents trivial identity mapping and pushes the model to learn the manifold of defect-free textures. In addition to the spatial mean squared error, we employ a Fast Fourier Transform (FFT) magnitude loss that encourages consistency in the frequency domain. The implementation also allows an optional structural similarity (SSIM) term, which we study in an ablation. On the MVTec AD Hazelnut benchmark, D3R-Net with the FFT loss improves localization consistency over a spatial-only baseline: PRO AUC increases from 0.603 to 0.687, while image-level ROC AUC remains robust. Evaluated across fifteen MVTec categories, the FFT variant raises the average pixel ROC AUC from 0.733 to 0.751 and PRO AUC from 0.417 to 0.468 compared to the MSE-only baseline, at roughly 20 FPS on a single GPU. The network is trained from scratch and uses a lightweight convolutional autoencoder backbone, providing a practical alternative to heavy pre-trained feature embedding methods.[194] PovNet+: A Deep Learning Architecture for Socially Assistive Robots to Learn and Assist with Multiple Activities of Daily Living
Fraser Robinson,Souren Pashangpour,Matthew Lisondra,Goldie Nejat
Main category: cs.CV
TL;DR: 本文提出了一种名为POVNet+的多模态深度学习架构,用于社会辅助机器人对日常生活活动(ADLs)进行多任务识别,能区分已知ADL、未知ADL及异常执行的ADL,并结合用户状态估计实现主动辅助交互。实验表明其在ADL分类准确率及真实人机交互中均优于现有方法。
Details
Motivation: 自主社会辅助机器人长期部署的主要障碍在于无法同时感知和辅助多种日常生活活动(ADLs)。 Method: 提出多模态深度学习架构POVNet+,引入ADL嵌入空间与运动嵌入空间联合建模;在运动嵌入空间中应用新型用户状态估计方法以识别新ADL并监测用户表现;利用ADL感知结果主动触发机器人辅助行为。 Result: POVNet+在ADL分类准确率上优于当前最优的人类活动识别方法;在杂乱居家环境中与多用户及机器人Leia的交互实验证明其可成功识别已知/未知ADL及异常ADL,并启动恰当的辅助交互。 Conclusion: POVNet+为社会辅助机器人提供了鲁棒、自适应的多ADL感知与主动辅助能力,显著提升了其在真实场景中的实用性与泛化性。 Abstract: A significant barrier to the long-term deployment of autonomous socially assistive robots is their inability to both perceive and assist with multiple activities of daily living (ADLs). In this paper, we present the first multimodal deep learning architecture, POVNet+, for multi-activity recognition for socially assistive robots to proactively initiate assistive behaviors. Our novel architecture introduces the use of both ADL and motion embedding spaces to uniquely distinguish between a known ADL being performed, a new unseen ADL, or a known ADL being performed atypically in order to assist people in real scenarios. Furthermore, we apply a novel user state estimation method to the motion embedding space to recognize new ADLs while monitoring user performance. This ADL perception information is used to proactively initiate robot assistive interactions. Comparison experiments with state-of-the-art human activity recognition methods show our POVNet+ method has higher ADL classification accuracy. Human-robot interaction experiments in a cluttered living environment with multiple users and the socially assistive robot Leia using POVNet+ demonstrate the ability of our multi-modal ADL architecture in successfully identifying different seen and unseen ADLs, and ADLs being performed atypically, while initiating appropriate assistive human-robot interactions.[195] Shedding the Facades, Connecting the Domains: Detecting Shifting Multimodal Hate Video with Test-Time Adaptation
Jiao Li,Jian Lang,Xikai Tang,Wenzheng Shu,Ting Zhong,Qiang Gao,Yong Wang,Leiting Chen,Fan Zhou
Main category: cs.CV
TL;DR: 本文提出SCANNER,首个面向仇恨视频检测(HVD)的测试时自适应(TTA)框架,通过挖掘跨域稳定的仇恨核心特征(如性别、种族等),结合质心引导对齐、样本级自适应对齐与簇内多样性正则化,有效应对HVD中严重的语义漂移问题。
Details
Motivation: 现有HVD方法假设训练与测试数据分布一致,但实际中仇恨内容不断演化以规避审查,导致严重语义漂移;传统TTA方法仅适用于轻微分布偏移,难以应对HVD中的剧烈变化。作者观察到仇恨内容表象虽变,其针对的核心属性(如性别、种族)保持稳定,可作为源域与目标域间的语义桥梁。 Method: SCANNER包含三部分:1)基于质心引导的对齐机制,从模糊多变的仇恨视频中提取稳定核心特征;2)样本级自适应质心对齐策略,削弱离群样本对对齐过程的干扰;3)簇内多样性正则化,防止聚类输出语义坍缩,增强语义丰富性。 Result: 在多个HVD基准上,SCANNER显著优于所有基线方法,Macro-F1平均提升4.69%。 Conclusion: SCANNER验证了利用稳定语义核心进行测试时自适应的有效性,为应对动态演化的恶意内容检测提供了新范式,尤其适用于分布剧烈偏移的真实场景。 Abstract: Hate Video Detection (HVD) is crucial for online ecosystems. Existing methods assume identical distributions between training (source) and inference (target) data. However, hateful content often evolves into irregular and ambiguous forms to evade censorship, resulting in substantial semantic drift and rendering previously trained models ineffective. Test-Time Adaptation (TTA) offers a solution by adapting models during inference to narrow the cross-domain gap, while conventional TTA methods target mild distribution shifts and struggle with the severe semantic drift in HVD. To tackle these challenges, we propose SCANNER, the first TTA framework tailored for HVD. Motivated by the insight that, despite the evolving nature of hateful manifestations, their underlying cores remain largely invariant (i.e., targeting is still based on characteristics like gender, race, etc), we leverage these stable cores as a bridge to connect the source and target domains. Specifically, SCANNER initially reveals the stable cores from the ambiguous layout in evolving hateful content via a principled centroid-guided alignment mechanism. To alleviate the impact of outlier-like samples that are weakly correlated with centroids during the alignment process, SCANNER enhances the prior by incorporating a sample-level adaptive centroid alignment strategy, promoting more stable adaptation. Furthermore, to mitigate semantic collapse from overly uniform outputs within clusters, SCANNER introduces an intra-cluster diversity regularization that encourages the cluster-wise semantic richness. Experiments show that SCANNER outperforms all baselines, with an average gain of 4.69% in Macro-F1 over the best.[196] LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
Pengcheng Zheng,Chaoning Zhang,Jiarong Mo,GuoHui Li,Jiaquan Zhang,Jiahao Zhang,Sihan Cao,Sheng Zheng,Caiyan Qin,Guoqing Wang,Yang Yang
Main category: cs.CV
TL;DR: 本文提出LLaVA-FA,一种在频域联合进行低秩分解与量化的新方法,用于高效压缩大型多模态模型(LMMs),显著降低计算与内存开销,同时提升精度。
Details
Motivation: 现有压缩方法常将低秩分解与量化解耦,导致重建误差累积,尤其在存在跨模态冗余的多模态架构中表现不佳;且依赖大规模校准数据。 Method: 提出频域联合低秩+量化近似框架LLaVA-FA;利用傅里叶变换的去相关性与共轭对称性优化权重表示;设计面向复数矩阵的极坐标量化方法PolarQuant;引入无需大规模校准数据的可选对角校准(ODC)方案。 Result: 在多个基准上显著优于现有高效多模态模型,同时保持极低激活参数量和计算成本。 Conclusion: LLaVA-FA是一种有效、紧凑且实用的大型多模态模型压缩方案,为实际部署提供了新思路。 Abstract: Large multimodal models (LMMs) have achieved impressive performance on various vision-language tasks, but their substantial computational and memory costs hinder their practical deployment. Existing compression methods often decouple low-rank decomposition and quantization, leading to compounded reconstruction errors, especially in multimodal architectures with cross-modal redundancy. To address this issue, we propose LLaVA-FA, a novel efficient LMM that performs joint low-rank plus quantization approximation in the frequency domain. By leveraging the de-correlation and conjugate symmetry properties of Fourier transform, LLaVA-FA achieves more compact and accurate weight representations. Furthermore, we introduce PolarQuant, a polar-coordinate quantization method tailored for complex matrices, and an optional diagonal calibration (ODC) scheme that eliminates the need for large-scale calibration data. Extensive experimental results demonstrate that our proposed LLaVA-FA outperforms existing efficient multimodal models across multiple benchmarks while maintaining minimal activated parameters and low computational costs, validating its effectiveness as a powerful solution for compressing LMMs.[197] Scalable Analytic Classifiers with Associative Drift Compensation for Class-Incremental Learning of Vision Transformers
Xuan Rao,Mingming Ha,Bo Zhao,Derong Liu,Cesare Alippi
Main category: cs.CV
TL;DR: 本文提出了一种面向ViT的高效类增量学习方法LR-RGDA,结合低秩分解RGDA分类器与无训练的Hopfield分布补偿机制HopDC,在保持高精度的同时显著降低计算复杂度。
Details
Motivation: 现有基于ViT的类增量学习在分类器重建阶段依赖计算昂贵的迭代SGD;RGDA虽具有贝叶斯最优性但推理复杂度为二次级,难以扩展到大规模场景。 Method: 提出低秩因子化RGDA(LR-RGDA),利用Woodbury恒等式将判别函数分解为全局仿射项加低秩二次扰动;并设计基于Hopfield网络的无训练分布补偿器(HopDC),通过关联记忆动态校准历史类统计。 Result: 在多个CIL基准上达到SOTA性能,显著降低推理复杂度(从O(Cd²)降至O(d² + Crd²),r≪d),且无需额外训练。 Conclusion: LR-RGDA+HopDC为ViT下的大规模类增量学习提供了兼具精度、效率与可扩展性的新范式。 Abstract: Class-incremental learning (CIL) with Vision Transformers (ViTs) faces a major computational bottleneck during the classifier reconstruction phase, where most existing methods rely on costly iterative stochastic gradient descent (SGD). We observe that analytic Regularized Gaussian Discriminant Analysis (RGDA) provides a Bayes-optimal alternative with accuracy comparable to SGD-based classifiers; however, its quadratic inference complexity limits its use in large-scale CIL scenarios. To overcome this, we propose Low-Rank Factorized RGDA (LR-RGDA), a scalable classifier that combines RGDA's expressivity with the efficiency of linear classifiers. By exploiting the low-rank structure of the covariance via the Woodbury matrix identity, LR-RGDA decomposes the discriminant function into a global affine term refined by a low-rank quadratic perturbation, reducing the inference complexity from $\mathcal{O}(Cd^2)$ to $\mathcal{O}(d^2 + Crd^2)$, where $C$ is the class number, $d$ the feature dimension, and $r \ll d$ the subspace rank. To mitigate representation drift caused by backbone updates, we further introduce Hopfield-based Distribution Compensator (HopDC), a training-free mechanism that uses modern continuous Hopfield Networks to recalibrate historical class statistics through associative memory dynamics on unlabeled anchors, accompanied by a theoretical bound on the estimation error. Extensive experiments on diverse CIL benchmarks demonstrate that our framework achieves state-of-the-art performance, providing a scalable solution for large-scale class-incremental learning with ViTs. Code: https://github.com/raoxuan98-hash/lr_rgda_hopdc.[198] DensiThAI, A Multi-View Deep Learning Framework for Breast Density Estimation using Infrared Images
Siva Teja Kakileti,Geetha Manjunath
Main category: cs.CV
TL;DR: 本文提出DensiThAI,一种基于多视角红外热成像与深度学习的乳腺密度分类方法,旨在提供无需电离辐射的密度评估新途径;在3500名女性多中心数据集上验证,平均AUROC达0.73,各类别间差异显著(p << 0.05),且跨年龄组表现稳健。
Details
Motivation: 乳腺组织密度是乳腺癌风险的关键生物标志物,也是影响钼靶敏感性的主要因素;但当前依赖X线钼靶(电离辐射),亟需安全、无创的替代评估方法。 Method: 提出多视角深度学习框架DensiThAI,利用五种标准红外热成像视角,以钼靶标注的密度标签为监督信号进行训练和评估。 Result: 在3500名女性多中心数据集上,10次随机划分的平均AUROC为0.73,所有划分中密度类别间均呈现统计学显著分离(p << 0.05),且在不同年龄组中性能一致。 Conclusion: 红外热成像结合AI具备作为非电离乳腺密度评估工具的可行性,有望提升患者体验并优化临床工作流。 Abstract: Breast tissue density is a key biomarker of breast cancer risk and a major factor affecting mammographic sensitivity. However, density assessment currently relies almost exclusively on X-ray mammography, an ionizing imaging modality. This study investigates the feasibility of estimating breast density using artificial intelligence over infrared thermal images, offering a non-ionizing imaging approach. The underlying hypothesis is that fibroglandular and adipose tissues exhibit distinct thermophysical and physiological properties, leading to subtle but spatially coherent temperature variations on the breast surface. In this paper, we propose DensiThAI, a multi-view deep learning framework for breast density classification from thermal images. The framework was evaluated on a multi-center dataset of 3,500 women using mammography-derived density labels as reference. Using five standard thermal views, DensiThAI achieved a mean AUROC of 0.73 across 10 random splits, with statistically significant separation between density classes across all splits (p << 0.05). Consistent performance across age cohorts supports the potential of thermal imaging as a non-ionizing approach for breast density assessment with implications for improved patient experience and workflow optimization.[199] Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields
Shiqian Li,Ruihong Shen,Junfeng Ni,Chang Pan,Chi Zhang,Yixin Zhu
Main category: cs.CV
TL;DR: 本文提出了Neural Gaussian Force Field (NGFF),一种端到端神经框架,将3D高斯感知与基于物理的动力学建模结合,从多视角RGB输入生成交互式、物理真实的4D视频,速度比先前高斯仿真器快两个数量级。
Details
Motivation: 现有视频生成模型缺乏对物理规律的建模,难以生成物理上合理的视频;而结合3D高斯溅射与物理引擎的方法虽可行,但计算开销大、鲁棒性差。 Method: 提出NGFF框架,融合3D高斯感知与物理动力学建模;构建大规模4D高斯数据集GSCollision(含64万+物理视频,约4TB)支持训练。 Result: NGFF在合成与真实3D场景中展现出优异的物理推理泛化性与鲁棒性,推理速度比先前高斯仿真器快约100倍。 Conclusion: NGFF推动视频预测向物理 grounded 的世界模型迈进,兼顾效率、真实性与鲁棒性。 Abstract: Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered by high computational costs in both reconstruction and simulation, and often lack robustness in complex real-world scenarios. To address these issues, we introduce Neural Gaussian Force Field (NGFF), an end-to-end neural framework that integrates 3D Gaussian perception with physics-based dynamic modeling to generate interactive, physically realistic 4D videos from multi-view RGB inputs, achieving two orders of magnitude faster than prior Gaussian simulators. To support training, we also present GSCollision, a 4D Gaussian dataset featuring diverse materials, multi-object interactions, and complex scenes, totaling over 640k rendered physical videos (~4 TB). Evaluations on synthetic and real 3D scenarios show NGFF's strong generalization and robustness in physical reasoning, advancing video prediction towards physics-grounded world models.[200] SDCM: Simulated Densifying and Compensatory Modeling Fusion for Radar-Vision 3-D Object Detection in Internet of Vehicles
Shucong Li,Xiaoluo Zhou,Yuqian He,Zhenyu Liu
Main category: cs.CV
TL;DR: 本文提出SDCM框架,通过模拟稠密化(SimDen)、雷达补偿映射(RCM)和Mamba建模交互融合(MMIF)模块,解决4D雷达-视觉融合3D目标检测中雷达点云稀疏和视觉表征退化问题,在多个数据集上实现更优性能、更低参数量和更快推理速度。
Details
Motivation: 4D雷达点云稀疏导致3D表征差;视觉数据在低光照、远距离和严重遮挡场景下表征退化,影响融合可靠性。 Method: 提出SDCM框架:1)SimDen模块基于3D核密度估计关键点的高斯模拟和曲率模拟生成稠密雷达点云;2)RCM模块利用雷达全天候优势补偿视觉退化;3)MMIF模块通过建模特征张量差异值实现异构降低与模态交互融合。 Result: 在VoD、TJ4DRadSet和Astyx HiRes 2019数据集上,SDCM取得最优性能,同时参数量更少、推理速度更快。 Conclusion: SDCM有效缓解了雷达稀疏性与视觉退化问题,提升了IoV中雷达-视觉3D检测的鲁棒性与效率。 Abstract: 3-D object detection based on 4-D radar-vision is an important part in Internet of Vehicles (IoV). However, there are two challenges which need to be faced. First, the 4-D radar point clouds are sparse, leading to poor 3-D representation. Second, vision datas exhibit representation degradation under low-light, long distance detection and dense occlusion scenes, which provides unreliable texture information during fusion stage. To address these issues, a framework named SDCM is proposed, which contains Simulated Densifying and Compensatory Modeling Fusion for radar-vision 3-D object detection in IoV. Firstly, considering point generation based on Gaussian simulation of key points obtained from 3-D Kernel Density Estimation (3-D KDE), and outline generation based on curvature simulation, Simulated Densifying (SimDen) module is designed to generate dense radar point clouds. Secondly, considering that radar data could provide more real time information than vision data, due to the all-weather property of 4-D radar. Radar Compensatory Mapping (RCM) module is designed to reduce the affects of vision datas' representation degradation. Thirdly, considering that feature tensor difference values contain the effective information of every modality, which could be extracted and modeled for heterogeneity reduction and modalities interaction, Mamba Modeling Interactive Fusion (MMIF) module is designed for reducing heterogeneous and achieving interactive Fusion. Experiment results on the VoD, TJ4DRadSet and Astyx HiRes 2019 dataset show that SDCM achieves best performance with lower parameter quantity and faster inference speed. Our code will be available.[201] Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency
Alexander Blezinger,Wolfgang Nejdl,Ming Tang
Main category: cs.CV
TL;DR: 本文系统评估了组织病理学基础模型在回归任务(如HRD评分预测)中的性能,发现其显著优于对比学习特征,并提出分布式上采样策略以改善数据不平衡问题。
Details
Motivation: 基础模型在计算病理学中广泛应用,但其在回归型生物标志物预测任务中的潜力尚未被充分探索。 Method: 在多实例学习框架下,使用5种先进基础模型从全切片图像中提取补丁级特征,结合分布式的上采样策略和消融实验,训练模型预测连续的HRD评分。 Result: 基于基础模型特征训练的模型在预测精度和泛化能力上持续优于基线;不同基础模型表现存在系统性差异;所提上采样策略显著提升罕见临床人群的召回率与平衡准确率。 Conclusion: 大规模组织病理学预训练有助于更精准、可迁移的回归型生物标志物预测,推动AI驱动的精准肿瘤学发展。 Abstract: Foundation models pretrained on large-scale histopathology data have found great success in various fields of computational pathology, but their impact on regressive biomarker prediction remains underexplored. In this work, we systematically evaluate histopathological foundation models for regression-based tasks, demonstrated through the prediction of homologous recombination deficiency (HRD) score - a critical biomarker for personalized cancer treatment. Within multiple instance learning frameworks, we extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models, and evaluate their impact compared to contrastive learning-based features. Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts from two public medical data collections. Extensive experiments demonstrate that models trained on foundation model features consistently outperform the baseline in terms of predictive accuracy and generalization capabilities while exhibiting systematic differences among the foundation models. Additionally, we propose a distribution-based upsampling strategy to mitigate target imbalance in these datasets, significantly improving the recall and balanced accuracy for underrepresented but clinically important patient populations. Furthermore, we investigate the impact of different sampling strategies and instance bagsizes by ablation studies. Our results highlight the benefits of large-scale histopathological pretraining for more precise and transferable regressive biomarker prediction, showcasing its potential to advance AI-driven precision oncology.[202] Real-Time Human Activity Recognition on Edge Microcontrollers: Dynamic Hierarchical Inference with Multi-Spectral Sensor Fusion
Boyu Li,Kuangji Zuo,Lincong Li,Yonghui Wu
Main category: cs.CV
TL;DR: 本文提出了一种面向边缘设备的资源感知分层网络HPPI-Net,用于实时人体活动识别(HAR),在ARM Cortex-M4上实现高精度(96.70%)与极低内存占用(22.3 KiB RAM)的平衡,并具备多谱融合与可解释性。
Details
Motivation: 边缘设备上准确、实时的人体活动识别需求日益增长,但现有方法难以兼顾精度与计算资源限制。 Method: 提出HPPI-Net:两层分层结构,第一层用FFT谱图提取初步特征;第二层根据活动类型选择专用模块或并行LSTM-MobileNet网络(PLMN);PLMN融合FFT、小波和Gabor谱图,通过三路LSTM编码器、ECA注意力与深度可分离卷积实现高效可解释特征融合。 Result: 在ARM Cortex-M4上达96.70%准确率,仅耗22.3 KiB RAM和439.5 KiB ROM;相比MobileNetV3,精度提升1.22%,RAM减少71.2%,ROM减少42.1%。 Conclusion: HPPI-Net在资源受限边缘平台实现了精度-效率-可解释性的良好权衡,适用于可穿戴、工业与智能家居等HAR场景。 Abstract: The demand for accurate on-device pattern recognition in edge applications is intensifying, yet existing approaches struggle to reconcile accuracy with computational constraints. To address this challenge, a resource-aware hierarchical network based on multi-spectral fusion and interpretable modules, namely the Hierarchical Parallel Pseudo-image Enhancement Fusion Network (HPPI-Net), is proposed for real-time, on-device Human Activity Recognition (HAR). Deployed on an ARM Cortex-M4 microcontroller for low-power real-time inference, HPPI-Net achieves 96.70% accuracy while utilizing only 22.3 KiB of RAM and 439.5 KiB of ROM after optimization. HPPI-Net employs a two-layer architecture. The first layer extracts preliminary features using Fast Fourier Transform (FFT) spectrograms, while the second layer selectively activates either a dedicated module for stationary activity recognition or a parallel LSTM-MobileNet network (PLMN) for dynamic states. PLMN fuses FFT, Wavelet, and Gabor spectrograms through three parallel LSTM encoders and refines the concatenated features using Efficient Channel Attention (ECA) and Depthwise Separable Convolution (DSC), thereby offering channel-level interpretability while substantially reducing multiply-accumulate operations. Compared with MobileNetV3, HPPI-Net improves accuracy by 1.22% and reduces RAM usage by 71.2% and ROM usage by 42.1%. These results demonstrate that HPPI-Net achieves a favorable accuracy-efficiency trade-off and provides explainable predictions, establishing a practical solution for wearable, industrial, and smart home HAR on memory-constrained edge platforms.[203] See Without Decoding: Motion-Vector-Based Tracking in Compressed Video
Axel Duché,Clément Chatelain,Gilles Gasso
Main category: cs.CV
TL;DR: 本文提出了一种轻量级压缩域目标跟踪模型,直接在视频码流上运行,无需完全解码RGB视频,利用运动矢量和变换系数进行目标框跨帧传播,在MOTS15/17/20数据集上实现最高3.7倍计算加速,仅损失4% mAP@0.5。
Details
Motivation: 为实现实时大规模监控系统中的高效视频分析,避免全帧RGB解码带来的高计算开销。 Method: 设计一个基于压缩域(运动矢量与DCT/变换系数)的轻量级深度学习模型,直接在H.264/H.265等编码流上进行目标框传播。 Result: 在MOTS15/17/20数据集上相比RGB基线方法获得最高3.7倍速度提升,mAP@0.5仅下降4%。 Conclusion: 压缩域建模可显著提升视频分析效率,是面向实时、低功耗边缘监控系统的可行路径。 Abstract: We propose a lightweight compressed-domain tracking model that operates directly on video streams, without requiring full RGB video decoding. Using motion vectors and transform coefficients from compressed data, our deep model propagates object bounding boxes across frames, achieving a computational speed-up of order up to 3.7 with only a slight 4% mAP@0.5 drop vs RGB baseline on MOTS15/17/20 datasets. These results highlight codec-domain motion modeling efficiency for real-time analytics in large monitoring systems.[204] Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyperkinetic Movement Disorders
Laura Cif,Diane Demailly,Gabriella A. Horvàth,Juan Dario Ortigoza Escobar,Nathalie Dorison,Mayté Castro Jiménez,Cécile A. Hubsch,Thomas Wirth,Gun-Marie Hariz,Sophie Huby,Morgan Dornadic,Zohra Souei,Muhammad Mushhood Ur Rehman,Simone Hemm,Mehdi Boulayme,Eduardo M. Moraud,Jocelyne Bloch,Xavier Vasques
Main category: cs.CV
TL;DR: 本文提出了一种基于姿态估计的机器学习框架,利用常规门诊视频自动提取解剖学关键点时间序列,并计算多维度运动学特征,以客观、可扩展地区分儿童和成人中重叠的高动力性运动障碍表型。
Details
Motivation: 高动力性运动障碍(HMDs)临床表现波动、间歇且常共存,导致主观评估易受观察者间差异影响,缺乏客观、可扩展的视频分析方法。 Method: 开发基于姿态估计的机器学习框架,将门诊视频转化为解剖学关键点时间序列,并提取统计、时域、频域及高阶不规则性-复杂性等多维运动学特征。 Result: 实现了从常规临床视频中客观、自动地区分多种重叠HMD表型(如肌张力障碍、震颤、舞蹈样动作等)。 Conclusion: 该框架为HMDs的无创、可扩展、纵向监测提供了新工具,有望提升临床识别准确性与评估一致性。 Abstract: Hyperkinetic movement disorders (HMDs) such as dystonia, tremor, chorea, myoclonus, and tics are disabling motor manifestations across childhood and adulthood. Their fluctuating, intermittent, and frequently co-occurring expressions hinder clinical recognition and longitudinal monitoring, which remain largely subjective and vulnerable to inter-rater variability. Objective and scalable methods to distinguish overlapping HMD phenotypes from routine clinical videos are still lacking. Here, we developed a pose-based machine-learning framework that converts standard outpatient videos into anatomically meaningful keypoint time series and computes kinematic descriptors spanning statistical, temporal, spectral, and higher-order irregularity-complexity features.[205] YOLOE-26: Integrating YOLO26 with YOLOE for Real-Time Open-Vocabulary Instance Segmentation
Ranjan Sapkota,Manoj Karkee
Main category: cs.CV
TL;DR: YOLOE-26 是一种实时开放词汇实例分割统一框架,融合 YOLOv26 的高效部署架构与 YOLOE 的开放词汇学习范式,支持文本提示、视觉示例引导和无提示自主分割。
Details
Motivation: 解决传统YOLO模型局限于闭集识别的问题,实现高效、确定性且支持开放词汇的实时实例分割。 Method: 基于NMS-free、端到端的YOLOv26架构,引入对象嵌入头替代固定类别logits;结合RepRTA(零开销文本对齐)、SAVPE(语义激活视觉提示编码器)和Lazy Region Prompt Contrast(懒区域提示对比)实现多模态提示;所有提示共享统一嵌入空间。 Result: 在各类模型尺寸下均展现出良好的精度-效率权衡与一致缩放行为;兼容Ultralytics生态,支持大规模检测与定位数据集的多任务训练。 Conclusion: YOLOE-26为动态真实场景下的实时开放词汇实例分割提供了实用、可扩展的解决方案。 Abstract: This paper presents YOLOE-26, a unified framework that integrates the deployment-optimized YOLO26(or YOLOv26) architecture with the open-vocabulary learning paradigm of YOLOE for real-time open-vocabulary instance segmentation. Building on the NMS-free, end-to-end design of YOLOv26, the proposed approach preserves the hallmark efficiency and determinism of the YOLO family while extending its capabilities beyond closed-set recognition. YOLOE-26 employs a convolutional backbone with PAN/FPN-style multi-scale feature aggregation, followed by end-to-end regression and instance segmentation heads. A key architectural contribution is the replacement of fixed class logits with an object embedding head, which formulates classification as similarity matching against prompt embeddings derived from text descriptions, visual examples, or a built-in vocabulary. To enable efficient open-vocabulary reasoning, the framework incorporates Re-Parameterizable Region-Text Alignment (RepRTA) for zero-overhead text prompting, a Semantic-Activated Visual Prompt Encoder (SAVPE) for example-guided segmentation, and Lazy Region Prompt Contrast for prompt-free inference. All prompting modalities operate within a unified object embedding space, allowing seamless switching between text-prompted, visual-prompted, and fully autonomous segmentation. Extensive experiments demonstrate consistent scaling behavior and favorable accuracy-efficiency trade-offs across model sizes in both prompted and prompt-free settings. The training strategy leverages large-scale detection and grounding datasets with multi-task optimization and remains fully compatible with the Ultralytics ecosystem for training, validation, and deployment. Overall, YOLOE-26 provides a practical and scalable solution for real-time open-vocabulary instance segmentation in dynamic, real-world environments.[206] Intra-Class Subdivision for Pixel Contrastive Learning: Application to Semi-supervised Cardiac Image Segmentation
Jiajun Zhao,Xuan Yang
Main category: cs.CV
TL;DR: 本文提出了一种面向心脏图像分割的类内细分像素对比学习(SPCL)框架,通过引入'无关样本'概念和边界对比损失,有效缓解边界区域的表征污染问题,提升了分割精度与边界定位能力。
Details
Motivation: 解决心脏图像分割中边界区域存在的表征污染问题,提升模型对类内(尤其是边界与内部)像素表征的区分能力。 Method: 提出类内细分像素对比学习(SPCL)框架,引入'无关样本'以区分同类中内部与边界像素的表征,并设计边界对比损失增强跨边界表征判别性。 Result: 在公开心脏数据集上实验表明,SPCL显著提升分割质量与边界精度,优于现有方法。 Conclusion: SPCL通过精细化建模类内结构(特别是边界特性),有效缓解表示污染,为医学图像分割中的细粒度表征学习提供了新思路。 Abstract: We propose an intra-class subdivision pixel contrastive learning (SPCL) framework for cardiac image segmentation to address representation contamination at boundaries. The novel concept ``Unconcerned sample'' is proposed to distinguish pixel representations at the inner and boundary regions within the same class, facilitating a clearer characterization of intra-class variations. A novel boundary contrastive loss for boundary representations is proposed to enhance representation discrimination across boundaries. The advantages of the unconcerned sample and boundary contrastive loss are analyzed theoretically. Experimental results in public cardiac datasets demonstrate that SPCL significantly improves segmentation performance, outperforming existing methods with respect to segmentation quality and boundary precision. Our code is available at https://github.com/Jrstud203/SPCL.[207] Stabilizing Diffusion Posterior Sampling by Noise--Frequency Continuation
Feng Tian,Yixuan Li,Weili Zeng,Weitian Zhang,Yichao Yan,Xiaokang Yang
Main category: cs.CV
TL;DR: 本文提出了一种噪声-频率连续性框架(noise-frequency Continuation),通过在不同噪声水平下动态限制数据一致性引导的频率带宽,改进扩散后验采样在逆问题中的重建质量,尤其提升了细粒度细节恢复能力。
Details
Motivation: 现有扩散后验采样方法中,测量一致性项与扩散噪声水平弱耦合,导致高噪声阶段梯度不准确、几何失配、早期漂移、高频伪影及对采样调度和病态算子敏感等问题。 Method: 提出噪声-频率连续性框架,构建一系列中间后验分布,其似然项仅在噪声依赖的频带内施加测量一致性;设计稳定后验采样器,融合扩散预测器、带限似然引导和多分辨率一致性策略(粗尺度激进校正、细尺度保守采纳)。 Result: 在超分辨率、修复和去模糊任务上达到SOTA;运动模糊去除PSNR提升高达5 dB。 Conclusion: 噪声-频率解耦与多分辨率一致性机制可有效缓解扩散逆问题中的几何失配与高频失真,显著提升重建保真度与鲁棒性。 Abstract: Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance, but it often fails to recover fine details because measurement terms are applied in a manner that is weakly coupled to the diffusion noise level. At high noise, data-consistency gradients computed from inaccurate estimates can be geometrically incongruent with the posterior geometry, inducing early-step drift, spurious high-frequency artifacts, plus sensitivity to schedules and ill-conditioned operators. To address these concerns, we propose a noise--frequency Continuation framework that constructs a continuous family of intermediate posteriors whose likelihood enforces measurement consistency only within a noise-dependent frequency band. This principle is instantiated with a stabilized posterior sampler that combines a diffusion predictor, band-limited likelihood guidance, and a multi-resolution consistency strategy that aggressively commits reliable coarse corrections while conservatively adopting high-frequency details only when they become identifiable. Across super-resolution, inpainting, and deblurring, our method achieves state-of-the-art performance and improves motion deblurring PSNR by up to 5 dB over strong baselines.[208] CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
Hang Wu,Yujun Cai,Zehao Li,Haonan Ge,Bowen Sun,Junsong Yuan,Yiwei Wang
Main category: cs.CV
TL;DR: 本文提出CamReasoner框架,将相机运动理解重构为基于 Observation-Thinking-Answer(O-T-A)范式的结构化推理过程,结合大规模推理轨迹数据集与首次在该任务中应用的强化学习(RL)进行逻辑对齐,显著提升几何一致性与推理准确性。
Details
Motivation: 现有多模态模型将相机运动理解视为黑箱分类,依赖表层视觉模式而非几何线索,易混淆物理上不同的运动。 Method: 提出O-T-A推理范式,设计显式推理模块解码时空线索(如轨迹、视锥);构建含18k SFT链和38k RL反馈样本的大规模推理轨迹套件;首次在该任务中使用RL实现运动推理与物理几何的逻辑对齐。 Result: 有效抑制幻觉,在多个基准上达到SOTA性能。 Conclusion: CamReasoner通过结构化推理与RL驱动的几何对齐,显著提升了视频空间智能中相机动态理解的准确性与可解释性。 Abstract: Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present CamReasoner, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to decode spatio-temporal cues such as trajectories and view frustums within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. Notably, we are the first to employ RL for logical alignment in this domain, ensuring motion inferences are grounded in physical geometry rather than contextual guesswork. By applying Reinforcement Learning to the Observation-Think-Answer (O-T-A) reasoning paradigm, CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks.[209] AI-Generated Image Detectors Overrely on Global Artifacts: Evidence from Inpainting Exchange
Elif Nebioglu,Emirhan Bilgiç,Adrian Popescu
Main category: cs.CV
TL;DR: 本文揭示了当前基于深度学习的图像修复检测器主要依赖全局伪影而非局部合成内容进行检测,并提出了Inpainting Exchange(INP-X)操作来分离VAE重建引起的频谱偏移效应;实验表明现有检测器在INP-X干预下性能大幅下降,理论分析指出其根源在于VAE信息瓶颈导致的高频衰减;研究强调需发展内容感知型检测方法,并发布了包含90K样本的新数据集与代码。
Details
Motivation: 当前图像修复检测器依赖全局伪影而非局部合成内容,导致对真实编辑内容的检测能力不足,亟需更鲁棒、内容敏感的检测机制。 Method: 提出Inpainting Exchange(INP-X)操作,通过恢复编辑区域外原始像素、保留编辑内容,以隔离VAE重建引发的全局频谱偏移;构建含90K图像的测试数据集(真实/修复/交换三类),并结合理论分析解释高频衰减与VAE信息瓶颈的关系。 Result: 预训练SOTA检测器(含商用模型)在INP-X干预下准确率显著下降(如从91%降至55%),接近随机水平;理论证实VAE信息瓶颈导致高频衰减是主因;基于新数据集训练的检测器展现出更优泛化性与定位能力。 Conclusion: 现有检测器不具备内容感知能力,易受全局频谱偏移干扰;应转向聚焦局部合成内容的检测范式;INP-X为评估和提升检测鲁棒性提供了新基准与工具。 Abstract: Modern deep learning-based inpainting enables realistic local image manipulation, raising critical challenges for reliable detection. However, we observe that current detectors primarily rely on global artifacts that appear as inpainting side effects, rather than on locally synthesized content. We show that this behavior occurs because VAE-based reconstruction induces a subtle but pervasive spectral shift across the entire image, including unedited regions. To isolate this effect, we introduce Inpainting Exchange (INP-X), an operation that restores original pixels outside the edited region while preserving all synthesized content. We create a 90K test dataset including real, inpainted, and exchanged images to evaluate this phenomenon. Under this intervention, pretrained state-of-the-art detectors, including commercial ones, exhibit a dramatic drop in accuracy (e.g., from 91\% to 55\%), frequently approaching chance level. We provide a theoretical analysis linking this behavior to high-frequency attenuation caused by VAE information bottlenecks. Our findings highlight the need for content-aware detection. Indeed, training on our dataset yields better generalization and localization than standard inpainting. Our dataset and code are publicly available at https://github.com/emirhanbilgic/INP-X.[210] Vision-Language Model Purified Semi-Supervised Semantic Segmentation for Remote Sensing Images
Shanwen Wang,Xin Sun,Danfeng Hong,Fei Zhou
Main category: cs.CV
TL;DR: 本文提出SemiEarth模型,利用视觉-语言模型(VLM)设计VLM-PP模块净化伪标签,显著提升遥感图像半监督语义分割性能与可解释性。
Details
Motivation: 传统半监督语义分割方法在遥感领域面临伪标签质量低的问题,尤其在多类别边界区域;现有教师-学生框架难以有效纠正低置信度伪标签错误。 Method: 提出基于视觉-语言模型的伪标签净化结构(VLM-PP),利用VLM的开放世界能力独立于主干架构地修正教师网络生成的低质量伪标签,尤其增强多类边界区域的判别准确性。 Result: 在多个遥感数据集上达到SOTA性能,并具备良好可解释性,优于以往SOTA方法。 Conclusion: VLM-PP是一种通用、解耦的伪标签优化机制,能有效提升遥感半监督分割精度与鲁棒性,同时增强模型透明度。 Abstract: The semi-supervised semantic segmentation (S4) can learn rich visual knowledge from low-cost unlabeled images. However, traditional S4 architectures all face the challenge of low-quality pseudo-labels, especially for the teacher-student framework.We propose a novel SemiEarth model that introduces vision-language models (VLMs) to address the S4 issues for the remote sensing (RS) domain. Specifically, we invent a VLM pseudo-label purifying (VLM-PP) structure to purify the teacher network's pseudo-labels, achieving substantial improvements. Especially in multi-class boundary regions of RS images, the VLM-PP module can significantly improve the quality of pseudo-labels generated by the teacher, thereby correctly guiding the student model's learning. Moreover, since VLM-PP equips VLMs with open-world capabilities and is independent of the S4 architecture, it can correct mispredicted categories in low-confidence pseudo-labels whenever a discrepancy arises between its prediction and the pseudo-label. We conducted extensive experiments on multiple RS datasets, which demonstrate that our SemiEarth achieves SOTA performance. More importantly, unlike previous SOTA RS S4 methods, our model not only achieves excellent performance but also offers good interpretability. The code is released at https://github.com/wangshanwen001/SemiEarth.[211] Interpretable Unsupervised Deformable Image Registration via Confidence-bound Multi-Hop Visual Reasoning
Zafar Iqbal,Anwar Ul Haq,Srimannarayana Grandhi
Main category: cs.CV
TL;DR: 本文提出了一种名为多跳视觉链式推理(VCoR)的新型无监督可变形图像配准框架,通过局部空间细化和跨参考注意力机制实现渐进式、可解释的配准过程,并提供理论保证与不确定性估计。
Details
Motivation: 现有深度学习方法虽精度高,但缺乏透明性,导致误差漂移和临床信任度低,亟需可解释、可靠的无监督配准方法。 Method: 提出多跳视觉链式推理(VCoR)框架,每跳包含局部空间细化(LSR)模块和跨参考注意力(CRA)机制,以迭代方式逐步优化配准;引入变形场稳定性和收敛性来估计不确定性。 Result: 在DIR-Lab 4D CT(肺)和IXI MRI(脑)数据集上达到有竞争力的配准精度,同时提供中间可视化结果和置信度度量。 Conclusion: VCoR实现了高精度、可解释、可靠且具备临床可行性的无监督医学图像配准,为临床决策提供了透明可信的支持。 Abstract: Unsupervised deformable image registration requires aligning complex anatomical structures without reference labels, making interpretability and reliability critical. Existing deep learning methods achieve considerable accuracy but often lack transparency, leading to error drift and reduced clinical trust. We propose a novel Multi-Hop Visual Chain of Reasoning (VCoR) framework that reformulates registration as a progressive reasoning process. Inspired by the iterative nature of clinical decision-making, each visual reasoning hop integrates a Localized Spatial Refinement (LSR) module to enrich feature representations and a Cross-Reference Attention (CRA) mechanism that leads the iterative refinement process, preserving anatomical consistency. This multi-hop strategy enables robust handling of large deformations and produces a transparent sequence of intermediate predictions with a theoretical bound. Beyond accuracy, our framework offers built-in interpretability by estimating uncertainty via the stability and convergence of deformation fields across hops. Extensive evaluations on two challenging public datasets, DIR-Lab 4D CT (lung) and IXI T1-weighted MRI (brain), demonstrate that VCoR achieves competitive registration accuracy while offering rich intermediate visualizations and confidence measures. By embedding an implicit visual reasoning paradigm, we present an interpretable, reliable, and clinically viable unsupervised medical image registration.[212] Deep Learning Based CNN Model for Automated Detection of Pneumonia from Chest XRay Images
Sathish Krishna Anumula,Vetrivelan Tamilmani,Aniruddha Arjun Singh,Dinesh Rajendran,Venkata Deepak Namburi
Main category: cs.CV
TL;DR: 本文提出了一种基于定制化深度可分离卷积神经网络的自动化肺炎诊断模型,结合CLAHE和几何增强预处理,在5863张胸片上实现了高精度、低计算成本的肺炎识别。
Details
Motivation: 肺炎在全球尤其是资源匮乏地区的儿童和老年人中发病率和死亡率高,而传统依赖放射科医生手动解读胸片的方法受限于观察者差异、医生疲劳及专业人员短缺。 Method: 设计了定制化的深度可分离卷积神经网络(CNN),并采用对比度受限自适应直方图均衡化(CLAHE)和几何增强进行预处理,以缓解类别不平衡并提升泛化能力。 Result: 该模型在5863张前后位胸片数据集上实现了高精度肺炎识别,且计算开销显著低于通用迁移学习模型。 Conclusion: 所提轻量级、定制化CNN模型为资源受限环境下的快速精准肺炎诊断提供了可行方案,具有临床应用潜力。 Abstract: Pneumonia has been one of the major causes of morbidities and mortality in the world and the prevalence of this disease is disproportionately high among the pediatric and elderly populations especially in resources trained areas Fast and precise diagnosis is a prerequisite for successful clinical intervention but due to inter observer variation fatigue among experts and a shortage of qualified radiologists traditional approaches that rely on manual interpretation of chest radiographs are frequently constrained To address these problems this paper introduces a unified automated diagnostic model using a custom Convolutional Neural Network CNN that can recognize pneumonia in chest Xray images with high precision and at minimal computational expense In contrast like other generic transfer learning based models which often possess redundant parameters the offered architecture uses a tailor made depth wise separable convolutional design which is optimized towards textural characteristics of grayscale medical images Contrast Limited Adaptive Histogram Equalization CLAHE and geometric augmentation are two significant preprocessing techniques used to ensure that the system does not experience class imbalance and is more likely to generalize The system is tested using a dataset of 5863 anterior posterior chest Xrays.[213] A Geometric Multimodal Foundation Model Integrating Bp-MRI and Clinical Reports in Prostate Cancer Classification
Juan A. Olmos,Antoine Manzanera,Fabio Martínez
Main category: cs.CV
TL;DR: 本文提出了一种基于几何深度学习的多模态基础模型MFM-Geom,融合双参数MRI与临床报告,在小样本(仅10%训练数据)下实现高精度前列腺癌诊断,并在外部数据集上验证了其泛化能力。
Details
Motivation: 现有方法依赖主观专家判读,且多数AI模型仅使用影像数据、忽略临床背景,同时受限于数据稀缺,难以学习鲁棒表征。 Method: 提出几何多模态基础模型MFM-Geom,联合编码bp-MRI图像与临床文本;在分类头中采用对称正定(SPD)矩阵与黎曼深度学习,融合影像-文本表征。 Result: 仅用10%训练数据时,AUC-PR达90.67%,较基线类标记嵌入方法提升8.3%;在外部队列中AUC-PR为90.6,验证了泛化性。 Conclusion: MFM-Geom通过几何建模有效融合多模态临床信息,在数据有限场景下显著提升PCa诊断性能与模型鲁棒性。 Abstract: Prostate cancer (PCa) is one of the most common cancers in men worldwide. Bi-parametric MRI (bp-MRI) and clinical variables are crucial for PCa identification and improving treatment decisions. However, this process is subjective to expert interpretations. Furthermore, most existing computer-aided diagnosis methods focus on imaging-based models, overlooking the clinical context and suffering from data scarcity, limiting their ability to learn robust representations. We propose a geometric multimodal Foundation Model (FM), named MFM-Geom, that learns representations from bp-MRI and clinical reports, encoding visual findings and information from the context of clinical variables. In the representations classification head, the approach leverages symmetric positive definite (SPD) matrices and Riemannian deep learning to integrate imaging-text representations from a biomedical multimodal FM. Using 10% of the training data, MFM-Geom outperformed baseline class token embedding-based classification (+8.3%, AUC-PR of 90.67). Generalization on external dataset confirmed the robustness of fine-tuning biomedical FM, achieving an AUC-PR of 90.6.[214] Development of a Cacao Disease Identification and Management App Using Deep Learning
Zaldy Pagaduan,Jason Occidental,Nathaniel Duro,Dexielito Badilles,Eleonor Palconit
Main category: cs.CV
TL;DR: 本研究开发了一款离线运行的移动应用,利用深度学习模型识别可可病害(准确率96.93%)及黑荚病感染程度(79.49%),经田间测试与专家评估一致率达84.2%,旨在提升菲律宾小农户的可可种植管理能力。
Details
Motivation: 菲律宾小农户缺乏数据、信息和优质农业实践支持,依赖过时耕作技术,易受病虫害威胁,而大型种植园资源更丰富。 Method: 构建基于深度学习的可可病害识别与黑荚病感染程度检测模型,并将其集成至离线运行的移动应用程序中,供偏远地区农民现场诊断使用。 Result: 病害识别模型验证准确率达96.93%,黑荚病感染程度检测模型达79.49%;田间测试与专家评估一致率为84.2%。 Conclusion: 该离线移动应用为小农户提供了可及、实用的技术工具,有助于改善可可作物健康与生产效率。 Abstract: Smallholder cacao producers often rely on outdated farming techniques and face significant challenges from pests and diseases, unlike larger plantations with more resources and expertise. In the Philippines, cacao farmers have limited access to data, information, and good agricultural practices. This study addresses these issues by developing a mobile application for cacao disease identification and management that functions offline, enabling use in remote areas where farms are mostly located. The core of the system is a deep learning model trained to identify cacao diseases accurately. The trained model is integrated into the mobile app to support farmers in field diagnosis. The disease identification model achieved a validation accuracy of 96.93% while the model for detecting cacao black pod infection levels achieved 79.49% validation accuracy. Field testing of the application showed an agreement rate of 84.2% compared with expert cacao technician assessments. This approach empowers smallholder farmers by providing accessible, technology-enabled tools to improve cacao crop health and productivity.[215] CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models
Samyak Jha,Junho Kim
Main category: cs.CV
TL;DR: 本文提出CAPA框架,通过注意力贡献(Attention Contribution)准则识别并剪枝低贡献视觉token,并对前馈网络(FFN)进行线性近似以提升大视觉语言模型推理效率。
Details
Motivation: 大型视觉语言模型推理效率受限于大量视觉token处理开销,而现有基于注意力分数的token重要性估计不准确,亟需更可靠的贡献评估与计算简化方法。 Method: 提出注意力贡献(Attention Contribution)作为更准确的视觉token重要性度量;分析视觉注意力汇的功能异质性,区分可剪枝的'概率倾倒'与关键的'结构锚点';发现FFN中视觉token的冗余性,尤其在中间层呈线性行为;据此设计CAPA框架,结合贡献感知的token剪枝与FFN线性近似。 Result: CAPA在多个基准上实现了高效—性能的良好权衡,同时提升了模型鲁棒性。 Conclusion: 注意力贡献是比注意力分数更优的视觉token重要性指标;视觉token和FFN均存在可安全压缩的冗余,CAPA通过双策略协同优化显著提升推理效率。 Abstract: Efficient inference in Large Vision-Language Models is constrained by the high cost of processing thousands of visual tokens, yet it remains unclear which tokens and computations can be safely removed. While attention scores are commonly used to estimate visual token importance, they are an imperfect proxy for actual contribution. We show that Attention Contribution, which weights attention probabilities by value vector magnitude, provides a more accurate criterion for visual token selection. Our empirical analysis reveals that visual attention sinks are functionally heterogeneous, comprising Probability Dumps with low contribution that can be safely pruned, and Structural Anchors with high contribution essential for maintaining model performance. Further, we identify substantial redundancy in Feed-Forward Networks (FFNs) associated with visual tokens, particularly in intermediate layers where image tokens exhibit linear behavior. Based on our findings, we introduce CAPA (Contribution-Aware Pruning and FFN Approximation), a dual-strategy framework that prunes visual tokens using attention contribution at critical functional transitions and reduces FFN computation through efficient linear approximations. Experiments on various benchmarks across baselines show that CAPA achieves competent efficiency--performance trade-offs with improved robustness.[216] SANEval: Open-Vocabulary Compositional Benchmarks with Failure-mode Diagnosis
Rishav Pramanik,Ian E. Nielsen,Jeff Smith,Saurav Pandit,Ravi P. Ramachandran,Zhaozheng Yin
Main category: cs.CV
TL;DR: 本文提出SANEval,一种基于大语言模型和开放词汇目标检测器的新型文本到图像生成评估基准,用于细粒度评估空间关系、属性绑定和数量理解等组合性能力,并在多个先进模型上验证了其与人工评估的高度一致性。
Details
Motivation: 现有文本到图像模型在处理复杂提示(含多对象、属性及空间关系)时表现不佳,且缺乏能提供可解释反馈的开放词汇、细粒度评估方法。 Method: 提出SANEval基准,结合大语言模型进行深度提示理解,以及LLM增强的开放词汇目标检测器,构建可扩展的组合性评估流水线。 Result: 在六个最先进T2I模型上的实验表明,SANEval自动评估结果与人类评估具有更高斯皮尔曼等级相关性,显著优于现有基准在属性绑定、空间关系和数量任务上的表现。 Conclusion: SANEval为文本到图像生成的组合性能力提供了更准确、可解释、开放词汇的评估框架,推动该领域向更可靠、可诊断的方向发展;作者将开源数据集与评估流水线。 Abstract: The rapid progress of text-to-image (T2I) models has unlocked unprecedented creative potential, yet their ability to faithfully render complex prompts involving multiple objects, attributes, and spatial relationships remains a significant bottleneck. Progress is hampered by a lack of adequate evaluation methods; current benchmarks are often restricted to closed-set vocabularies, lack fine-grained diagnostic capabilities, and fail to provide the interpretable feedback necessary to diagnose and remedy specific compositional failures. We solve these challenges by introducing SANEval (Spatial, Attribute, and Numeracy Evaluation), a comprehensive benchmark that establishes a scalable new pipeline for open-vocabulary compositional evaluation. SANEval combines a large language model (LLM) for deep prompt understanding with an LLM-enhanced, open-vocabulary object detector to robustly evaluate compositional adherence, unconstrained by a fixed vocabulary. Through extensive experiments on six state-of-the-art T2I models, we demonstrate that SANEval's automated evaluations provide a more faithful proxy for human assessment; our metric achieves a Spearman's rank correlation with statistically different results than those of existing benchmarks across tasks of attribute binding, spatial relations, and numeracy. To facilitate future research in compositional T2I generation and evaluation, we will release the SANEval dataset and our open-source evaluation pipeline.[217] Subspace Clustering on Incomplete Data with Self-Supervised Contrastive Learning
Huanran Li,Daniel Pimentel-Alarcón
Main category: cs.CV
TL;DR: 本文提出了一种用于不完整数据的对比自监督子空间聚类框架(CSC),通过生成掩码视图和SimCLR式对比损失学习不变嵌入,再结合稀疏子空间聚类实现高效鲁棒的聚类。
Details
Motivation: 现有子空间聚类方法大多假设数据完全观测,难以应对现实场景中普遍存在的缺失数据问题。 Method: 提出对比自监督框架CSC:对部分观测输入生成掩码视图,利用SimCLR风格的对比损失训练深度神经网络以学习不变嵌入,再用稀疏子空间聚类对嵌入进行聚类。 Result: 在六个基准数据集上的实验表明,CSC持续优于经典和深度学习基线方法,展现出对缺失数据的强鲁棒性和对大规模数据的可扩展性。 Conclusion: CSC为不完整数据的子空间聚类提供了一种有效、鲁棒且可扩展的新范式。 Abstract: Subspace clustering aims to group data points that lie in a union of low-dimensional subspaces and finds wide application in computer vision, hyperspectral imaging, and recommendation systems. However, most existing methods assume fully observed data, limiting their effectiveness in real-world scenarios with missing entries. In this paper, we propose a contrastive self-supervised framework, Contrastive Subspace Clustering (CSC), designed for clustering incomplete data. CSC generates masked views of partially observed inputs and trains a deep neural network using a SimCLR-style contrastive loss to learn invariant embeddings. These embeddings are then clustered using sparse subspace clustering. Experiments on six benchmark datasets show that CSC consistently outperforms both classical and deep learning baselines, demonstrating strong robustness to missing data and scalability to large datasets.[218] World-Shaper: A Unified Framework for 360° Panoramic Editing
Dong Liang,Yuhao Liu,Jinyuan Jia,Youjun Zhao,Rynson W. H. Lau
Main category: cs.CV
TL;DR: 本文提出World-Shaper,一种在等距柱状投影(ERP)域中直接进行全景图像编辑的几何感知统一框架,通过生成-编辑范式和几何感知学习策略,显著提升全景编辑的几何一致性、保真度与文本可控性。
Details
Motivation: 现有基于透视的图像编辑方法无法建模全景图的空间结构;立方体贴图分解虽尝试解决但因与球面几何不匹配而破坏全局一致性。 Method: 1)在ERP域直接建模全景编辑;2)采用生成-然后-编辑范式缓解配对数据稀缺;3)引入几何感知学习策略,包括位置感知形状监督与渐进式训练以隐式内化全景先验。 Result: 在新基准PEBench上,World-Shaper在几何一致性、编辑保真度和文本可控性方面均超越SOTA方法。 Conclusion: World-Shaper实现了统一、连贯且灵活的360°视觉世界编辑控制,为全景内容创作提供了新范式。 Abstract: Being able to edit panoramic images is crucial for creating realistic 360° visual experiences. However, existing perspective-based image editing methods fail to model the spatial structure of panoramas. Conventional cube-map decompositions attempt to overcome this problem but inevitably break global consistency due to their mismatch with spherical geometry. Motivated by this insight, we reformulate panoramic editing directly in the equirectangular projection (ERP) domain and present World-Shaper, a unified geometry-aware framework that bridges panoramic generation and editing within a single editing-centric design. To overcome the scarcity of paired data, we adopt a generate-then-edit paradigm, where controllable panoramic generation serves as an auxiliary stage to synthesize diverse paired examples for supervised editing learning. To address geometric distortion, we introduce a geometry-aware learning strategy that explicitly enforces position-aware shape supervision and implicitly internalizes panoramic priors through progressive training. Extensive experiments on our new benchmark, PEBench, demonstrate that our method achieves superior geometric consistency, editing fidelity, and text controllability compared to SOTA methods, enabling coherent and flexible 360° visual world creation with unified editing control. Code, model, and data will be released at our project page: https://world-shaper-project.github.io/[219] PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
Gemma Canet Tarrés,Manel Baradad,Francesc Moreno-Noguer,Yumeng Li
Main category: cs.CV
TL;DR: 本文提出PLACID框架,利用预训练图像到视频扩散模型与合成视频数据,提升多物体图像合成中对象身份、背景色彩、布局控制与整体美观性的一致性与保真度。
Details
Motivation: 现有生成式AI在单图生成上表现优异,但在专业级多物体合成任务中存在对象失真、遗漏/重复、布局错误等问题,难以满足高保真、可控、美观的复合需求。 Method: 1)利用带文本控制的预训练图像-视频(I2V)扩散模型,借助视频时序先验增强对象一致性与背景保真;2)设计新型合成数据策略:生成物体从随机位置平滑移动至目标位置的视频序列,以对齐模型时序先验;推理时通过文本引导使随机初始化的对象收敛为协调布局,最终帧即为合成图像。 Result: 定量评估与用户研究表明,PLACID在对象身份保留、背景/色彩保真、物体完整性及视觉吸引力方面显著优于现有SOTA方法。 Conclusion: PLACID通过引入视频时序建模与针对性合成数据,有效解决了多物体合成中的核心挑战,为高质量、可控图像合成提供了新范式。 Abstract: Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item's identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities, and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model's temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.[220] TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation
Ariel Shaulov,Eitan Shaar,Amit Edenzon,Lior Wolf
Main category: cs.CV
TL;DR: 本文提出了一种在推理阶段识别并移除不稳定的潜在标记(unstable latent tokens)的方法,以缓解自回归视频生成中的时间漂移问题,从而提升长时序视频的时间一致性,且无需修改模型架构或训练流程。
Details
Motivation: 自回归视频生成中存在严重的时间漂移问题,即误差随生成帧数增加而累积放大;作者认为该问题主因并非模型容量不足,而是推理过程中被污染的潜在条件标记被无控复用所致。 Method: 定义‘不稳定标记’为与前一批生成帧潜在表示显著偏离的标记,代表潜在污染或语义漂移;在自回归推理中显式剔除这些不稳定标记,避免其参与后续帧的条件生成。 Result: 该方法显著提升了长时序视频生成的时间一致性,在不修改模型结构、训练过程或潜空间的前提下有效缓解了时间漂移。 Conclusion: 推理阶段对潜在条件标记进行稳定性筛选与剔除,是一种轻量、高效且通用的缓解自回归视频生成时间漂移的策略。 Abstract: Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothesize that this drift does not primarily stem from insufficient model capacity, but rather from inference-time error propagation. Specifically, we contend that drift arises from the uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference. To correct this accumulation of errors, we propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning. For this purpose, we define unstable tokens as latent tokens whose representations deviate significantly from those of the previously generated batch, indicating potential corruption or semantic drift. By explicitly removing corrupted latent tokens from the auto-regressive context, rather than modifying entire spatial regions or model parameters, our method prevents unreliable latent information from influencing future generation steps. As a result, it significantly improves long-horizon temporal consistency without modifying the model architecture, training procedure, or leaving latent space.[221] TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs
Baiqi Li,Kangyi Zhao,Ce Zhang,Chancharik Mitra,Jean de Dieu Nyandwi,Gedas Bertasius
Main category: cs.CV
TL;DR: TimeBlind是一个用于评估多模态大语言模型(MLLMs)在细粒度时空理解能力上的新型诊断基准,发现当前前沿模型严重依赖静态视觉线索,而非真正的时序逻辑推理。
Details
Motivation: 现有MLLMs虽擅长静态语义理解,但在时序动态建模上表现脆弱;缺乏能分离识别与时间推理、避免语言先验干扰的严谨评测基准。 Method: 提出TimeBlind基准,基于认知科学将时间理解分为三层次(原子事件识别、事件属性刻画、事件依赖推理),采用最小配对范式(视频对仅时序结构不同、静态内容一致),并用互补问题设计消除语言偏差。 Result: 在600个实例(2400个视频-问题对)上评测20个SOTA MLLMs,最佳模型实例准确率仅48.2%,远低于人类98.2%;表明模型普遍依赖静态视觉捷径。 Conclusion: TimeBlind揭示了当前MLLMs在时空推理上的根本性缺陷,为下一代视频理解模型提供了关键诊断工具和改进方向。 Abstract: Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at https://baiqi-li.github.io/timeblind_project/ .[222] Computer Vision and Its Relationship to Cognitive Science: A perspective from Bayes Decision Theory
Alan Yuille,Daniel Kersten
Main category: cs.CV
TL;DR: 本文从贝叶斯决策理论(BDT)视角介绍计算机视觉及其与认知科学的关系,统一分析了贝叶斯方法与深度神经网络两种主流范式,并探讨其互补性与融合路径。
Details
Motivation: 为弥合计算机视觉中贝叶斯方法与深度学习方法之间的理论鸿沟,并建立与认知科学更紧密的联系。 Method: 以贝叶斯决策理论为统一框架,对比分析贝叶斯视角与深度神经网络方法在视觉建模中的原理、优势与局限。 Result: 揭示了两种方法在BDT框架下的内在关联与各自局限,指出了二者融合构建更强大视觉认知模型的可能方向。 Conclusion: BDT不仅可作为理解现有视觉方法的理论透镜,也为发展兼具解释性与实用性的新一代视觉-认知联合模型提供基础。 Abstract: This document presents an introduction to computer vision, and its relationship to Cognitive Science, from the perspective of Bayes Decision Theory (Berger 1985). Computer vision is a vast and complex field, so this overview has a narrow scope and provides a theoretical lens which captures many key concepts. BDT is rich enough to include two different approaches: (i) the Bayesian viewpoint, which gives a conceptually attractive framework for vision with concepts that resonate with Cognitive Science (Griffiths et al., 2024), and (ii) the Deep Neural Network approach whose successes in the real world have made Computer Vision into a trillion-dollar industry and which is motivated by the hierarchical structure of the visual ventral stream. The BDT framework relates and captures the strengths and weakness of these two approaches and, by discussing the limitations of BDT, points the way to how they can be combined in a richer framework.[223] LogicGaze: Benchmarking Causal Consistency in Visual Narratives via Counterfactual Verification
Rory Driscoll,Alexandros Christoforos,Chadbourne Davis
Main category: cs.CV
TL;DR: 本文提出了LogicGaze基准框架,用于评估视觉语言模型(VLMs)在视频和图像中验证因果推理链的能力,重点检测其对视觉证据的依赖程度及幻觉问题。
Details
Motivation: 现有VLMs在顺序推理能力提升的同时,其推理链是否真正基于视觉证据仍缺乏充分检验,尤其存在严重幻觉问题。 Method: 构建LogicGaze基准:基于40,000个视频片段(ShareGPT4Video)与Flickr30k子集,引入视觉矛盾但语言合理的因果扰动;设计三阶段评测协议——因果验证、具身叙事生成、扰动拒绝。 Result: 在Qwen2.5-VL-72B等SOTA模型上揭示了显著的脆弱性,表明当前VLMs难以可靠地将推理锚定于真实视觉内容。 Conclusion: LogicGaze推动VLM向更鲁棒、可信的多模态因果推理发展,并开源全部资源。 Abstract: While sequential reasoning enhances the capability of Vision-Language Models (VLMs) to execute complex multimodal tasks, their reliability in grounding these reasoning chains within actual visual evidence remains insufficiently explored. We introduce LogicGaze, a novel benchmark framework designed to rigorously interrogate whether VLMs can validate sequential causal chains against visual inputs, specifically targeting the pervasive issue of hallucination. Curated from 40,000 video segments from ShareGPT4Video and a subset of Flickr30k imagery, LogicGaze integrates causal sequences with visually contradictory yet linguistically plausible perturbations, compelling models to verify the authenticity of each reasoning step. Our tripartite evaluation protocol - Causal Validation, Grounded Narrative Synthesis, and Perturbation Rejection - exposes significant vulnerabilities in state-of-the-art VLMs such as Qwen2.5-VL-72B. LogicGaze advocates for robust, trustworthy multimodal reasoning, with all resources publicly available in an anonymized repository.[224] Opportunistic Promptable Segmentation: Leveraging Routine Radiological Annotations to Guide 3D CT Lesion Segmentation
Samuel Church,Joshua D. Warner,Danyal Maqbool,Xin Tie,Junjie Hu,Meghan G. Lubner,Tyler J. Bradshaw
Main category: cs.CV
TL;DR: 本文提出SAM2CT模型,利用放射科医生在日常阅片中留下的稀疏GSPS标注(如箭头、线段),结合提示式分割范式,自动生成高质量3D CT病灶分割,显著降低人工标注成本。
Details
Motivation: 3D CT病灶分割依赖大量人工标注,成本高;而临床PACS中已存在大量放射科医生日常阅片时留下的稀疏GSPS标注(如箭头、线段),尚未被有效利用。 Method: 提出Opportunistic Promptable Segmentation范式,并构建SAM2CT模型:基于SAM2改进,扩展提示编码器以支持箭头和线段输入,并引入专为3D医学图像设计的记忆条件化记忆(MCM)机制。 Result: 在公开病灶分割基准上,SAM2CT对箭头和线段提示的Dice系数分别达0.649和0.757;在60例真实PACS GSPS数据上,87%的生成分割被放射科医生评为临床可接受或仅需微调;并在急诊科部分病种上展现强零样本泛化能力。 Conclusion: 历史GSPS标注的大规模挖掘是一种可行、可扩展的3D CT分割数据集构建新路径,SAM2CT为此提供了首个有效技术方案。 Abstract: The development of machine learning models for CT imaging depends on the availability of large, high-quality, and diverse annotated datasets. Although large volumes of CT images and reports are readily available in clinical picture archiving and communication systems (PACS), 3D segmentations of critical findings are costly to obtain, typically requiring extensive manual annotation by radiologists. On the other hand, it is common for radiologists to provide limited annotations of findings during routine reads, such as line measurements and arrows, that are often stored in PACS as GSPS objects. We posit that these sparse annotations can be extracted along with CT volumes and converted into 3D segmentations using promptable segmentation models, a paradigm we term Opportunistic Promptable Segmentation. To enable this paradigm, we propose SAM2CT, the first promptable segmentation model designed to convert radiologist annotations into 3D segmentations in CT volumes. SAM2CT builds upon SAM2 by extending the prompt encoder to support arrow and line inputs and by introducing Memory-Conditioned Memories (MCM), a memory encoding strategy tailored to 3D medical volumes. On public lesion segmentation benchmarks, SAM2CT outperforms existing promptable segmentation models and similarly trained baselines, achieving Dice similarity coefficients of 0.649 for arrow prompts and 0.757 for line prompts. Applying the model to pre-existing GSPS annotations from a clinical PACS (N = 60), SAM2CT generates 3D segmentations that are clinically acceptable or require only minor adjustments in 87% of cases, as scored by radiologists. Additionally, SAM2CT demonstrates strong zero-shot performance on select Emergency Department findings. These results suggest that large-scale mining of historical GSPS annotations represents a promising and scalable approach for generating 3D CT segmentation datasets.[225] On the Assessment of Sensitivity of Autonomous Vehicle Perception
Apostol Vassilev,Munawar Hasan,Edward Griffor,Honglan Jin,Pavel Piliptchak,Mahima Arora,Thoshitha Gamage
Main category: cs.CV
TL;DR: 本文提出了一种基于模型集成的预测敏感性量化方法,用于评估自动驾驶车辆感知系统在恶劣天气和道路条件下的鲁棒性,并发现光照不足(如雾、低太阳高度)和远距离目标对感知性能影响最大。
Details
Motivation: 自动驾驶的可行性高度依赖感知系统的实时性、准确性与可靠性,尤其需应对自然与对抗性驾驶干扰因素带来的感知误差与延迟,因此亟需评估其鲁棒性并提升可靠性。 Method: 采用五种主流视觉模型(YOLOv8-v9、DETR50/101、RT-DETR)组成的集成模型,在仿真与真实环境中进行预测敏感性量化分析,结合停车距离等可解释指标构建感知评估准则,并考察不同道路表面、光照、遮挡及目标距离等变量的影响。 Result: 雾、低太阳高度等光照减弱条件对感知性能影响最大;遮挡与恶劣天气叠加时模型性能显著下降;目标距离越远,感知性能越差、鲁棒性越低;提出了一个概念性感知评估架构与基于停车距离的评估准则。 Conclusion: 感知系统鲁棒性受环境与几何因素显著影响,基于多模型敏感性量化的评估框架可有效识别薄弱环节,为提升自动驾驶感知可靠性提供可解释、可部署的评估手段。 Abstract: The viability of automated driving is heavily dependent on the performance of perception systems to provide real-time accurate and reliable information for robust decision-making and maneuvers. These systems must perform reliably not only under ideal conditions, but also when challenged by natural and adversarial driving factors. Both of these types of interference can lead to perception errors and delays in detection and classification. Hence, it is essential to assess the robustness of the perception systems of automated vehicles (AVs) and explore strategies for making perception more reliable. We approach this problem by evaluating perception performance using predictive sensitivity quantification based on an ensemble of models, capturing model disagreement and inference variability across multiple models, under adverse driving scenarios in both simulated environments and real-world conditions. A notional architecture for assessing perception performance is proposed. A perception assessment criterion is developed based on an AV's stopping distance at a stop sign on varying road surfaces, such as dry and wet asphalt, and vehicle speed. Five state-of-the-art computer vision models are used, including YOLO (v8-v9), DEtection TRansformer (DETR50, DETR101), Real-Time DEtection TRansformer (RT-DETR)in our experiments. Diminished lighting conditions, e.g., resulting from the presence of fog and low sun altitude, have the greatest impact on the performance of the perception models. Additionally, adversarial road conditions such as occlusions of roadway objects increase perception sensitivity and model performance drops when faced with a combination of adversarial road conditions and inclement weather conditions. Also, it is demonstrated that the greater the distance to a roadway object, the greater the impact on perception performance, hence diminished perception robustness.[226] Bridging the Semantic Chasm: Synergistic Conceptual Anchoring for Generalized Few-Shot and Zero-Shot OOD Perception
Alexandros Christoforos,Sarah Jenkins,Michael Brown,Tuan Pham,David Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为SynerNet的协同神经智能体网络框架,旨在解决视觉-语言模型在面对分布外概念时出现的跨模态对齐退化问题,通过四个专用计算单元协同工作,并在VISTA-Beyond基准上验证了其在少样本和零样本场景下的显著性能提升。
Details
Motivation: 解决视觉-语言模型(VLMs)在面对分布外(OOD)概念时出现的跨模态对齐退化问题。 Method: 提出Synergistic Neural Agents Network(SynerNet)框架,包含视觉感知、语言上下文、名词嵌入和全局协调四个专用计算单元,通过结构化消息传播协议协同校正模态差异;并引入多智能体潜在空间命名获取框架、语义上下文交换算法及自适应动态平衡机制。 Result: 在VISTA-Beyond基准上的实验表明,SynerNet在少样本和零样本场景下均取得显著性能提升,精度提高1.2%至5.4%。 Conclusion: SynerNet有效缓解了VLMs在OOD概念下的跨模态对齐退化问题,提升了模型泛化能力与鲁棒性。 Abstract: This manuscript presents a pioneering Synergistic Neural Agents Network (SynerNet) framework designed to mitigate the phenomenon of cross-modal alignment degeneration in Vision-Language Models (VLMs) when encountering Out-of-Distribution (OOD) concepts. Specifically, four specialized computational units - visual perception, linguistic context, nominal embedding, and global coordination - collaboratively rectify modality disparities via a structured message-propagation protocol. The principal contributions encompass a multi-agent latent space nomenclature acquisition framework, a semantic context-interchange algorithm for enhanced few-shot adaptation, and an adaptive dynamic equilibrium mechanism. Empirical evaluations conducted on the VISTA-Beyond benchmark demonstrate that SynerNet yields substantial performance augmentations in both few-shot and zero-shot scenarios, exhibiting precision improvements ranging from 1.2% to 5.4% across a diverse array of domains.[227] When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs
Beidi Zhao,Wenlong Deng,Xinting Liao,Yushu Li,Nazim Shaikh,Yao Nie,Xiaoxiao Li
Main category: cs.CV
TL;DR: 本文提出MAD-RAG方法,解决检索增强生成(RAG)在视觉问答中因检索文本干扰图像注意力导致的‘注意力分散’(AD)问题,无需训练即可提升性能。
Details
Motivation: 现有RAG方法在知识型视觉问答(VQA)中常因检索文本过度抑制图像注意力,导致模型忽略问题相关图像区域,从而产生新错误——即Attention Distraction(AD),该问题此前被忽视。 Method: 提出MAD-RAG:一种无需训练的干预方法,通过双问题建模解耦视觉定位与上下文整合,并结合注意力混合机制保留图像条件证据。 Result: 在OK-VQA、E-VQA和InfoSeek上显著优于基线,绝对准确率提升最高达9.20%;可修复74.68%的AD失败案例,计算开销极低。 Conclusion: Attention Distraction是RAG在LVLMs中一个关键但被忽视的失败模式;MAD-RAG有效缓解该问题,为RAG在多模态推理中的鲁棒性提供了新思路。 Abstract: While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.[228] AdaFuse: Adaptive Multimodal Fusion for Lung Cancer Risk Prediction via Reinforcement Learning
Chongyu Qu,Zhengyi Lu,Yuxiang Lai,Thomas Z. Li,Junchao Zhu,Junlin Guo,Juming Xiong,Yanfan Zhu,Yuechen Yang,Allen J. Luna,Kim L. Sandler,Bennett A. Landman,Yuankai Huo
Main category: cs.CV
TL;DR: 本文提出AdaFuse,一种基于强化学习的自适应多模态融合框架,用于肺癌风险预测,能够根据患者个体情况动态选择和融合模态,并在信息充足时提前终止,从而提升性能并降低计算开销。
Details
Motivation: 现有融合方法对所有模态一视同仁或仅学习权重,未解决‘对特定患者是否应使用某些模态’这一根本问题。 Method: 将多模态融合建模为序贯决策过程,利用强化学习中的策略网络动态决定是否引入新模态或直接预测,实现患者特异性的模态选择与早期终止。 Result: 在NLST数据集上AUC达0.762,优于单模态(0.732)、固定融合(0.759)及自适应基线(DynMM 0.754,MoE 0.742),且FLOPs低于所有三模态方法。 Conclusion: 强化学习可有效支持个性化多模态融合,推动医学诊断从统一融合走向自适应、按需调用模态的智能流程。 Abstract: Multimodal fusion has emerged as a promising paradigm for disease diagnosis and prognosis, integrating complementary information from heterogeneous data sources such as medical images, clinical records, and radiology reports. However, existing fusion methods process all available modalities through the network, either treating them equally or learning to assign different contribution weights, leaving a fundamental question unaddressed: for a given patient, should certain modalities be used at all? We present AdaFuse, an adaptive multimodal fusion framework that leverages reinforcement learning (RL) to learn patient-specific modality selection and fusion strategies for lung cancer risk prediction. AdaFuse formulates multimodal fusion as a sequential decision process, where the policy network iteratively decides whether to incorporate an additional modality or proceed to prediction based on the information already acquired. This sequential formulation enables the model to condition each selection on previously observed modalities and terminate early when sufficient information is available, rather than committing to a fixed subset upfront. We evaluate AdaFuse on the National Lung Screening Trial (NLST) dataset. Experimental results demonstrate that AdaFuse achieves the highest AUC (0.762) compared to the best single-modality baseline (0.732), the best fixed fusion strategy (0.759), and adaptive baselines including DynMM (0.754) and MoE (0.742), while using fewer FLOPs than all triple-modality methods. Our work demonstrates the potential of reinforcement learning for personalized multimodal fusion in medical imaging, representing a shift from uniform fusion strategies toward adaptive diagnostic pipelines that learn when to consult additional modalities and when existing information suffices for accurate prediction.[229] MASC: Metal-Aware Sampling and Correction via Reinforcement Learning for Accelerated MRI
Zhengyi Lu,Ming Lu,Chongyu Qu,Junchao Zhu,Junlin Guo,Marilyn Lionts,Yanfan Zhu,Yuechen Yang,Tianyuan Yao,Jayasai Rajagopal,Bennett Allan Landman,Xiao Wang,Xinqiang Yan,Yuankai Huo
Main category: cs.CV
TL;DR: 本文提出MASC框架,利用强化学习联合优化金属植入物MRI中的k空间采样策略与金属伪影校正,通过物理仿真构建配对数据集,并采用端到端训练使采样策略与U-Net MAR网络协同优化,在加速MRI的同时显著提升图像质量。
Details
Motivation: 金属植入物在MRI中引发严重伪影,影响诊断;传统方法将伪影校正(MAR)与加速采集视为独立问题,缺乏联合优化。 Method: 提出基于强化学习的统一框架MASC:1)构建物理仿真生成的配对3D MRI数据集(含/不含金属);2)将k空间采样建模为序列决策问题,使用PPO智能体选择相位编码线;3)采样器输入为U-Net MAR网络处理后的欠采样重建结果;4)采用端到端联合训练,使采样策略与MAR网络相互适配。 Result: MASC学习到的采样策略优于传统策略;端到端训练比固定预训练MAR网络性能更优;在FastMRI数据集上经物理伪影模拟验证具备临床泛化能力。 Conclusion: 联合优化k空间采样与金属伪影校正是可行且有效的,MASC为加速、高质量金属兼容MRI提供了新范式。 Abstract: Metal implants in MRI cause severe artifacts that degrade image quality and hinder clinical diagnosis. Traditional approaches address metal artifact reduction (MAR) and accelerated MRI acquisition as separate problems. We propose MASC, a unified reinforcement learning framework that jointly optimizes metal-aware k-space sampling and artifact correction for accelerated MRI. To enable supervised training, we construct a paired MRI dataset using physics-based simulation, generating k-space data and reconstructions for phantoms with and without metal implants. This paired dataset provides simulated 3D MRI scans with and without metal implants, where each metal-corrupted sample has an exactly matched clean reference, enabling direct supervision for both artifact reduction and acquisition policy learning. We formulate active MRI acquisition as a sequential decision-making problem, where an artifact-aware Proximal Policy Optimization (PPO) agent learns to select k-space phase-encoding lines under a limited acquisition budget. The agent operates on undersampled reconstructions processed through a U-Net-based MAR network, learning patterns that maximize reconstruction quality. We further propose an end-to-end training scheme where the acquisition policy learns to select k-space lines that best support artifact removal while the MAR network simultaneously adapts to the resulting undersampling patterns. Experiments demonstrate that MASC's learned policies outperform conventional sampling strategies, and end-to-end training improves performance compared to using a frozen pre-trained MAR network, validating the benefit of joint optimization. Cross-dataset experiments on FastMRI with physics-based artifact simulation further confirm generalization to realistic clinical MRI data. The code and models of MASC have been made publicly available: https://github.com/hrlblab/masc[230] ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models
Ignacy Kolton,Kacper Marzol,Paweł Batorski,Marcin Mazur,Paul Swoboda,Przemysław Spurek
Main category: cs.CV
TL;DR: 本文提出ReLAPSe框架,利用强化学习(RLVR)将概念恢复建模为策略学习问题,通过扩散模型内在的噪声预测损失作为可验证反馈信号,高效、实时地恢复被遗忘的细粒度身份与风格信息,突破了现有对抗方法在计算开销与反馈缺失上的瓶颈。
Details
Motivation: 现有机器遗忘方法存在视觉信息残留问题,而当前对抗攻击方法受限于:优化类方法计算昂贵;推理/启发式方法缺乏对目标模型潜在视觉表征的直接反馈。 Method: 提出基于策略的对抗框架ReLAPSe,将概念恢复建模为强化学习任务;采用强化学习与可验证奖励(RLVR)训练智能体,以扩散模型的噪声预测损失作为内在、可验证的反馈信号,实现文本提示操控与潜在视觉残差的闭环对齐。 Result: ReLAPSe在多个SOTA遗忘方法上实现了高效、近实时的细粒度身份与风格恢复,具备跨方法迁移能力,显著提升红队测试的可扩展性与实用性。 Conclusion: ReLAPSe首次将对抗概念恢复从逐样本优化转向全局策略学习,为评估和增强文本到图像扩散模型的遗忘鲁棒性提供了新范式与实用工具。 Abstract: Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at https://github.com/gmum/ReLaPSe[231] Modeling Image-Caption Rating from Comparative Judgments
Kezia Minni,Qiang Zhang,Monoshiz Mahbub Khan,Zhe Yu
Main category: cs.CV
TL;DR: 本文提出了一种基于比较学习的图像-文本匹配评估框架,通过建模人类对图像描述对的相对偏好(而非绝对评分),在保持较高性能的同时显著降低人工标注成本和主观性。
Details
Motivation: 人工对图像描述进行绝对评分耗时、主观性强;而人类更擅长判断两个描述中哪一个更匹配图像,因此采用比较式标注更高效可靠。 Method: 构建比较学习模型,使用ResNet-50提取图像特征、MiniLM提取文本特征,在VICR数据集上训练;并与直接回归模型对比;同时开展小规模人类评估实验,比较绝对评分、两两比较和同图多描述比较三种标注方式。 Result: 回归模型略优(Pearson ρ=0.7609,Spearman r_s=0.7089),但比较模型随数据量增加持续提升并逼近回归基线;人类评估表明比较式标注更快且一致性更高。 Conclusion: 比较学习能有效建模人类偏好,在图像-文本匹配评估任务中可作为低成本、高一致性的替代方案。 Abstract: Rating the accuracy of captions in describing images is time-consuming and subjective for humans. In contrast, it is often easier for people to compare two captions and decide which one better matches a given image. In this work, we propose a machine learning framework that models such comparative judgments instead of direct ratings. The model can then be applied to rank unseen image-caption pairs in the same way as a regression model trained on direct ratings. Using the VICR dataset, we extract visual features with ResNet-50 and text features with MiniLM, then train both a regression model and a comparative learning model. While the regression model achieves better performance (Pearson's $ρ$: 0.7609 and Spearman's $r_s$: 0.7089), the comparative learning model steadily improves with more data and approaches the regression baseline. In addition, a small-scale human evaluation study comparing absolute rating, pairwise comparison, and same-image comparison shows that comparative annotation yields faster results and has greater agreement among human annotators. These results suggest that comparative learning can effectively model human preferences while significantly reducing the cost of human annotations.[232] Deep Learning-Based Object Detection for Autonomous Vehicles: A Comparative Study of One-Stage and Two-Stage Detectors on Basic Traffic Objects
Bsher Karbouj,Adam Michael Altenbuchner,Joerg Krueger
Main category: cs.CV
TL;DR: 本文对比分析了YOLOv5和Faster R-CNN在自动驾驶目标检测任务中的性能,基于真实与合成图像混合数据集,评估mAP、召回率和推理速度等指标,发现YOLOv5在精度和训练效率上更优,而Faster R-CNN在小目标和低光照场景下表现更好。
Details
Motivation: 缺乏针对自动驾驶特定应用场景的深度学习目标检测方法选择指导,不同模型在检测精度、处理速度、环境鲁棒性等方面表现差异显著。 Method: 对YOLOv5(单阶段检测器)和Faster R-CNN(两阶段检测器)进行实验对比,使用融合真实与合成图像的多样化数据集,在mAP、召回率、推理速度等多维度进行评估,并分析不同置信度阈值及实际驾驶场景下的模型行为。 Result: YOLOv5在mAP、召回率和训练效率方面优于Faster R-CNN,尤其在大数据集和高分辨率图像下优势明显;Faster R-CNN在小目标、远距离目标检测以及挑战性光照条件下更具优势。 Conclusion: 模型选择需权衡精度、速度与鲁棒性:YOLOv5适合对实时性要求高的主流场景,Faster R-CNN更适合对小目标或复杂光照敏感的关键任务。 Abstract: Object detection is a crucial component in autonomous vehicle systems. It enables the vehicle to perceive and understand its environment by identifying and locating various objects around it. By utilizing advanced imaging and deep learning techniques, autonomous vehicle systems can rapidly and accurately identify objects based on their features. Different deep learning methods vary in their ability to accurately detect and classify objects in autonomous vehicle systems. Selecting the appropriate method significantly impacts system performance, robustness, and efficiency in real-world driving scenarios. While several generic deep learning architectures like YOLO, SSD, and Faster R-CNN have been proposed, guidance on their suitability for specific autonomous driving applications is often limited. The choice of method affects detection accuracy, processing speed, environmental robustness, sensor integration, scalability, and edge case handling. This study provides a comprehensive experimental analysis comparing two prominent object detection models: YOLOv5 (a one-stage detector) and Faster R-CNN (a two-stage detector). Their performance is evaluated on a diverse dataset combining real and synthetic images, considering various metrics including mean Average Precision (mAP), recall, and inference speed. The findings reveal that YOLOv5 demonstrates superior performance in terms of mAP, recall, and training efficiency, particularly as dataset size and image resolution increase. However, Faster R-CNN shows advantages in detecting small, distant objects and performs well in challenging lighting conditions. The models' behavior is also analyzed under different confidence thresholds and in various real-world scenarios, providing insights into their applicability for autonomous driving systems.[233] Robust automatic brain vessel segmentation in 3D CTA scans using dynamic 4D-CTA data
Alberto Mario Ceballos-Arroyo,Shrikanth M. Yadav,Chu-Hsuan Lin,Jisoo Kim,Geoffrey S. Young,Huaizu Jiang,Lei Qin
Main category: cs.CV
TL;DR: 本文提出了一种基于动态4D-CTA扫描的脑血管自动标注新方法,通过多时相减影增强血管可视化,并利用时相冗余扩充数据集,显著提升深度学习模型(如nnUNet)在动脉和静脉分割上的精度与鲁棒性。
Details
Motivation: 传统脑血管标注依赖大量人工,耗时且易出错;现有CTA数据集规模小、对比度变化敏感,限制模型泛化能力。 Method: 利用动态4D-CTA多时间点进行骨与软组织减影以增强血管显示;将同一患者的多个时相共享同一高质量分割标签,实现数据集4–5倍扩充;训练nnUNet模型并在TopBrain等数据集上评估。 Result: 在110例训练/165例测试图像上,nnUNet达到动脉mDC 0.846、静脉mDC 0.957;adHD分别为0.304 mm和0.078 mm;tSens达0.877(动脉)和0.974(静脉),显著优于同类数据集。 Conclusion: 该方法可高效生成高质量血管标注,提升模型对对比相位变化的鲁棒性,为临床脑血管分析提供可靠、开源的分割工具。 Abstract: In this study, we develop a novel methodology for annotating the brain vasculature using dynamic 4D-CTA head scans. By using multiple time points from dynamic CTA acquisitions, we subtract bone and soft tissue to enhance the visualization of arteries and veins, reducing the effort required to obtain manual annotations of brain vessels. We then train deep learning models on our ground truth annotations by using the same segmentation for multiple phases from the dynamic 4D-CTA collection, effectively enlarging our dataset by 4 to 5 times and inducing robustness to contrast phases. In total, our dataset comprises 110 training images from 25 patients and 165 test images from 14 patients. In comparison with two similarly-sized datasets for CTA-based brain vessel segmentation, a nnUNet model trained on our dataset can achieve significantly better segmentations across all vascular regions, with an average mDC of 0.846 for arteries and 0.957 for veins in the TopBrain dataset. Furthermore, metrics such as average directed Hausdorff distance (adHD) and topology sensitivity (tSens) reflected similar trends: using our dataset resulted in low error margins (aDHD of 0.304 mm for arteries and 0.078 for veins) and high sensitivity (tSens of 0.877 for arteries and 0.974 for veins), indicating excellent accuracy in capturing vessel morphology. Our code and model weights are available online: https://github.com/alceballosa/robust-vessel-segmentation[234] Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset
Gabriel Bromonschenkel,Alessandro L. Koerich,Thiago M. Paixão,Hilário Tomaz Alves de Oliveira
Main category: cs.CV
TL;DR: 本文提出了一种面向巴西葡萄牙语图像描述生成的跨原生翻译评估方法,使用原生标注与机器翻译的Flickr30K数据集对比评估Transformer视觉语言模型,并通过注意力分析和CLIP-Score揭示模型偏差与对齐能力。
Details
Motivation: 低资源语言(如巴西葡萄牙语)缺乏专用图像描述数据集和模型,现有研究多依赖自动翻译构建数据,但其影响尚不明确。 Method: 构建原生巴西葡萄牙语Flickr30K数据集;采用跨上下文训练-测试设置(即在一种数据上训练、另一种上测试);引入注意力图解释模型推理;使用CLIP-Score评估图文对齐。 Result: Swin-DistilBERTimbau泛化性最佳;ViTucano在文本指标上优于GPT-4o、LLaMa 3.2 Vision等大模型;GPT-4系列在CLIP-Score上最高;注意力分析发现性别误判、物体计数错误和空间不一致等系统性偏差。 Conclusion: 原生标注数据对提升低资源语言图像描述性能至关重要;自动翻译虽可用,但会引入可检测的语义与视觉偏差;模型选择应依据任务目标(文本质量 vs 图文对齐)而定。 Abstract: Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross-context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP-Score metric to evaluate the image-description alignment. Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre-trained VLM, surpasses larger multilingual models (GPT-4o, LLaMa 3.2 Vision) in traditional text-based evaluation metrics, while GPT-4 models achieve the highest CLIP-Score, highlighting improved image-text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: https://github.com/laicsiifes/transformer-caption-ptbr.[235] Modeling Art Evaluations from Comparative Judgments: A Deep Learning Approach to Predicting Aesthetic Preferences
Manoj Reddy Bethi,Sai Rupa Jhade,Pravallika Yaganti,Monoshiz Mahbub Khan,Zhe Yu
Main category: cs.CV
TL;DR: 本文提出了一种基于成对偏好比较的深度学习框架,用于建模人类对视觉艺术的审美判断,以降低标注成本;实验表明该方法在预测性能和标注效率上均优于传统直接评分方法。
Details
Motivation: 建模人类对视觉艺术的审美判断面临个体偏好差异大、标注数据获取成本高的挑战,需寻找更高效、认知负担更低的标注与建模方式。 Method: 采用基于ResNet-50提取CNN特征的深度回归模型与双分支成对比较模型,结合Law of Comparative Judgment理论,利用 pairwise preference替代直接评分,并开展四项研究问题的实证分析。 Result: 深度回归模型R²提升达328%;成对比较模型在无直接评分标签下接近回归性能;个体偏好预测效果较差;人类实验显示成对判断耗时比直接评分减少60%。 Conclusion: 成对比较学习是建模审美偏好的高效可行范式,在保持较高预测性能的同时显著降低人工标注成本,尤其适用于大规模偏好建模任务。 Abstract: Modeling human aesthetic judgments in visual art presents significant challenges due to individual preference variability and the high cost of obtaining labeled data. To reduce cost of acquiring such labels, we propose to apply a comparative learning framework based on pairwise preference assessments rather than direct ratings. This approach leverages the Law of Comparative Judgment, which posits that relative choices exhibit less cognitive burden and greater cognitive consistency than direct scoring. We extract deep convolutional features from painting images using ResNet-50 and develop both a deep neural network regression model and a dual-branch pairwise comparison model. We explored four research questions: (RQ1) How does the proposed deep neural network regression model with CNN features compare to the baseline linear regression model using hand-crafted features? (RQ2) How does pairwise comparative learning compare to regression-based prediction when lacking access to direct rating values? (RQ3) Can we predict individual rater preferences through within-rater and cross-rater analysis? (RQ4) What is the annotation cost trade-off between direct ratings and comparative judgments in terms of human time and effort? Our results show that the deep regression model substantially outperforms the baseline, achieving up to $328\%$ improvement in $R^2$. The comparative model approaches regression performance despite having no access to direct rating values, validating the practical utility of pairwise comparisons. However, predicting individual preferences remains challenging, with both within-rater and cross-rater performance significantly lower than average rating prediction. Human subject experiments reveal that comparative judgments require $60\%$ less annotation time per item, demonstrating superior annotation efficiency for large-scale preference modeling.[236] 3DGS$^2$-TR: Scalable Second-Order Trust-Region Method for 3D Gaussian Splatting
Roger Hsiao,Yuchen Fang,Xiangru Huang,Ruilong Li,Hesam Rabeti,Zan Gojcic,Javad Lavaei,James Demmel,Sophia Shao
Main category: cs.CV
TL;DR: 本文提出3DGS²-TR,一种用于加速3D高斯点绘(3DGS)场景训练的二阶优化器,通过Hutchinson方法近似Hessian对角线实现矩阵自由、O(n)复杂度,并引入基于Hellinger距离的逐参数信任域机制以提升非线性优化稳定性,在更少迭代和更低内存开销下取得更优重建质量。
Details
Motivation: 现有二阶优化方法(如3DGS-LM、3DGS2)依赖显式或稠密曲率表示,计算与内存开销大,难以扩展至大规模场景;同时3DGS光栅化过程强非线性,导致优化不稳定。 Method: 采用Hutchinson方法高效估计Hessian矩阵对角线,实现矩阵自由的二阶优化;设计基于平方Hellinger距离的逐参数信任域策略,约束高斯参数更新步长;整体复杂度与ADAM一致(O(n))。 Result: 在相同初始化且不进行致密化前提下,相比ADAM减少50%训练迭代次数、重建质量更优;峰值GPU内存仅比ADAM高17%(<1GB),远低于3DGS-LM(低85%);支持大规模场景及潜在分布式训练。 Conclusion: 3DGS²-TR在保持低计算与内存成本的同时,显著提升3DGS训练效率与稳定性,为实际大规模应用提供了可行的二阶优化方案。 Abstract: We propose 3DGS$^2$-TR,a second-order optimizer for accelerating the scene training problem in 3D Gaussian Splatting (3DGS). Unlike existing second-order approaches that rely on explicit or dense curvature representations, such as 3DGS-LM (Höllein et al., 2025) or 3DGS2 (Lan et al., 2025), our method approximates curvature using only the diagonal of the Hessian matrix, efficiently via Hutchinson's method. Our approach is fully matrix-free and has the same complexity as ADAM (Kingma, 2024), $O(n)$ in both computation and memory costs. To ensure stable optimization in the presence of strong nonlinearity in the 3DGS rasterization process, we introduce a parameter-wise trust-region technique based on the squared Hellinger distance, regularizing updates to Gaussian parameters. Under identical parameter initialization and without densification, 3DGS$^2$-TR is able to achieve better reconstruction quality on standard datasets, using 50% fewer training iterations compared to ADAM, while incurring less than 1GB of peak GPU memory overhead (17% more than ADAM and 85% less than 3DGS-LM), enabling scalability to very large scenes and potentially to distributed training settings.[237] Toward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure
Trishna Chakraborty,Udita Ghosh,Aldair Ernesto Gongora,Ruben Glatt,Yue Dong,Jiachen Li,Amit K. Roy-Chowdhury,Chengyu Song
Main category: cs.CV
TL;DR: 本文提出一种利用视觉语言模型(VLM)进行实验室安全自主监控的新方法,通过构建文本到图像-场景图的合成数据生成流程,并设计场景图引导的对齐策略,显著提升VLM在纯视觉场景下的安全隐患检测能力。
Details
Motivation: 实验室中轻微不安全行为易引发严重事故,但人工持续安全监控受限;现有VLM缺乏真实视觉评估数据(因事故多以非结构化文本记录),难以验证其实际有效性。 Method: 1)构建基于大语言模型(场景图架构师)和图像生成模型(渲染器)的合成数据生成流水线,产出图像-场景图-真值三元组;2)提出‘场景图引导的对齐’后训练上下文工程方法,将视觉输入映射为更契合VLM推理的结构化场景图。 Result: 在1207样本、362种场景的合成数据集上实验表明:VLM在给定文本场景图时表现良好,但在纯视觉输入下性能显著下降;所提方法有效弥合感知鸿沟,提升了纯视觉设置下的危险检测性能。 Conclusion: 结构化场景图是提升VLM在实验室安全监控中视觉理解能力的关键桥梁;所提出的合成数据生成与场景图引导对齐方法,为VLM在缺乏真实视觉标注的安全领域应用提供了可行路径。 Abstract: Laboratories are prone to severe injuries from minor unsafe actions, yet continuous safety monitoring -- beyond mandatory pre-lab safety training -- is limited by human availability. Vision language models (VLMs) offer promise for autonomous laboratory safety monitoring, but their effectiveness in realistic settings is unclear due to the lack of visual evaluation data, as most safety incidents are documented primarily as unstructured text. To address this gap, we first introduce a structured data generation pipeline that converts textual laboratory scenarios into aligned triples of (image, scene graph, ground truth), using large language models as scene graph architects and image generation models as renderers. Our experiments on the synthetic dataset of 1,207 samples across 362 unique scenarios and seven open- and closed-source models show that VLMs perform effectively given textual scene graph, but degrade substantially in visual-only settings indicating difficulty in extracting structured object relationships directly from pixels. To overcome this, we propose a post-training context-engineering approach, scene-graph-guided alignment, to bridge perceptual gaps in VLMs by translating visual inputs into structured scene graphs better aligned with VLM reasoning, improving hazard detection performance in visual only settings.[238] Text is All You Need for Vision-Language Model Jailbreaking
Yihang Chen,Zhao Xu,Youyuan Jiang,Tianle Zheng,Cho-Jui Hsieh
Main category: cs.CV
TL;DR: 本文提出Text-DJ攻击方法,通过将有害文本查询分解为多个良性子查询并嵌入含干扰图像的网格中,利用LVLM的OCR能力绕过安全防护。
Details
Motivation: 现有LVLM安全机制主要针对显式文本或相关视觉输入,忽视OCR路径带来的新型对抗漏洞。 Method: Text-DJ三阶段攻击:1)将有害查询分解为语义相关但良性的子查询;2)加入大量无关干扰查询;3)将所有查询以图像网格形式(子查询居中)同时输入模型。 Result: 成功绕过SOTA LVLM的安全对齐机制,验证OCR路径在多图分散输入下存在鲁棒性缺陷。 Conclusion: LVLM的OCR模块对碎片化多模态对抗输入缺乏防御能力,亟需针对性防护机制。 Abstract: Large Vision-Language Models (LVLMs) are increasingly equipped with robust safety safeguards to prevent responses to harmful or disallowed prompts. However, these defenses often focus on analyzing explicit textual inputs or relevant visual scenes. In this work, we introduce Text-DJ, a novel jailbreak attack that bypasses these safeguards by exploiting the model's Optical Character Recognition (OCR) capability. Our methodology consists of three stages. First, we decompose a single harmful query into multiple and semantically related but more benign sub-queries. Second, we pick a set of distraction queries that are maximally irrelevant to the harmful query. Third, we present all decomposed sub-queries and distraction queries to the LVLM simultaneously as a grid of images, with the position of the sub-queries being middle within the grid. We demonstrate that this method successfully circumvents the safety alignment of state-of-the-art LVLMs. We argue this attack succeeds by (1) converting text-based prompts into images, bypassing standard text-based filters, and (2) inducing distractions, where the model's safety protocols fail to link the scattered sub-queries within a high number of irrelevant queries. Overall, our findings expose a critical vulnerability in LVLMs' OCR capabilities that are not robust to dispersed, multi-image adversarial inputs, highlighting the need for defenses for fragmented multimodal inputs.[239] DISK: Dynamic Inference SKipping for World Models
Anugunj Naman,Gaibo Zhang,Ayushman Singh,Yaguang Zhang
Main category: cs.CV
TL;DR: DISK是一种无需训练的自适应推理方法,用于自回归世界模型,通过双分支控制器协调两个耦合的扩散Transformer,在不重新训练的情况下保持运动-外观一致性,并在长时序预测中实现显著加速。
Details
Motivation: 解决自回归世界模型在长时序视频与轨迹联合预测中计算开销大、稳定性差的问题,同时避免重新训练模型。 Method: 提出DISK方法,采用双分支控制器协调视频和自我轨迹的两个耦合扩散Transformer,引入跨模态跳跃决策和高阶潜在差异跳跃测试,并在自回归前向链中传播控制器统计量以提升长时序稳定性。 Result: 在NuPlan和NuScenes数据集上进行闭环驾驶推演,DISK在NVIDIA L40S GPU上实现轨迹扩散2倍加速、视频扩散1.6倍加速,同时保持L2规划误差、视觉质量(FID/FVD)和NAVSIM PDMS指标不变。 Conclusion: DISK实现了无需训练的高效、稳定、高质量长时序视频与轨迹联合预测,具备实际部署价值。 Abstract: We present DISK, a training-free adaptive inference method for autoregressive world models. DISK coordinates two coupled diffusion transformers for video and ego-trajectory via dual-branch controllers with cross-modal skip decisions, preserving motion-appearance consistency without retraining. We extend higher-order latent-difference skip testing to the autoregressive chain-of-forward regime and propagate controller statistics through rollout loops for long-horizon stability. When integrated into closed-loop driving rollouts on 1500 NuPlan and NuScenes samples using an NVIDIA L40S GPU, DISK achieves 2x speedup on trajectory diffusion and 1.6x speedup on video diffusion while maintaining L2 planning error, visual quality (FID/FVD), and NAVSIM PDMS scores, demonstrating practical long-horizon video-and-trajectory prediction at substantially reduced cost.[240] Model Optimization for Multi-Camera 3D Detection and Tracking
Ethan Anderson,Justin Silva,Kyle Zheng,Sameer Pusegaonkar,Yizhou Wang,Zheng Tang,Sujit Biswas
Main category: cs.CV
TL;DR: 本文评估了Sparse4D框架在室内多相机感知任务中的性能,重点考察其在低帧率、量化、跨数据集迁移及混合精度训练下的鲁棒性,并提出AvgTrackDur指标衡量身份稳定性。
Details
Motivation: 解决室内多相机环境下因遮挡和异构视角导致的多目标跟踪不稳定问题,尤其关注身份持续性(identity stability)这一关键挑战。 Method: 基于查询的时空3D检测与跟踪框架Sparse4D,采用共享世界坐标系融合多视角特征,并通过实例记忆传播稀疏物体查询;实验涵盖降帧率测试、INT8/FP8后训练量化、WILDTRACK迁移、Transformer Engine混合精度微调,并引入AvgTrackDur评估指标。 Result: Sparse4D在中等降帧率下保持稳定,但低于2 FPS时身份关联崩溃;骨干与颈部选择性量化最优;注意力模块对低精度敏感;低FPS预训练在WILDTRACK上带来显著零样本增益;Transformer Engine降低延迟并提升相机可扩展性,但可能损害身份传播稳定性。 Conclusion: 身份稳定性是多相机跟踪的关键瓶颈,需在模型压缩与加速中优先保障;AvgTrackDur是更贴合实际部署需求的评估指标;混合精度等加速技术需辅以稳定性验证。 Abstract: Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.[241] LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs
Benno Krojer,Shravan Nayak,Oscar Mañas,Vaibhav Adlakha,Desmond Elliott,Siva Reddy,Marius Mosbach
Main category: cs.CV
TL;DR: 本文提出LatentLens方法,通过将视觉token与大规模文本语料库中上下文化词元表示进行最近邻匹配,实现对多模态模型中视觉token的可解释性分析,发现视觉token在各层均具有高度可解释性,且优于传统LogitLens等方法。
Details
Motivation: 理解大语言模型(LLM)为何能轻易处理视觉token,需可解释性方法揭示其各层中视觉token所编码的内容。 Method: 提出LatentLens:编码大规模文本语料,存储每个词元的上下文化表示;将视觉token表示与之比对,取top-k近邻文本表示作为自然语言描述。 Result: 在10个VLM上验证,LatentLens显著优于LogitLens等方法,多数视觉token在所有模型和所有层均具可解释性;生成的描述语义清晰、细粒度高。 Conclusion: 视觉与语言表征存在深层对齐;LatentLens为分析多模态模型隐空间提供了新工具和新证据。 Abstract: Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens works by encoding a large text corpus and storing contextualized token representations for each token in that corpus. Visual token representations are then compared to their contextualized textual representations, with the top-k nearest neighbor representations providing descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations, opening up new directions for analyzing latent representations.[242] PSGS: Text-driven Panorama Sliding Scene Generation via Gaussian Splatting
Xin Zhang,Shen Chen,Jiale Zhou,Lei Li
Main category: cs.CV
TL;DR: 本文提出PSGS框架,通过两阶段方法(语义一致的全景图生成 + 全局一致的3D高斯溅射初始化)实现高质量文本到3D场景生成,显著提升细节保真度与多视角一致性。
Details
Motivation: 现有文本驱动3D场景生成方法受限于3D-文本配对数据稀缺和多视角拼接不一致,导致生成场景过于简单。 Method: PSGS为两阶段框架:第一阶段采用双层优化架构生成语义一致全景图(布局推理层解析空间关系,自优化层通过迭代MLLM反馈细化视觉细节);第二阶段利用全景滑动机制,基于重叠视角采样初始化全局一致的3D高斯溅射点云,并在训练中引入深度与语义一致性损失。 Result: 实验表明PSGS在全景图生成质量和3D场景渲染效果上均优于现有方法,生成场景更具真实感与吸引力。 Conclusion: PSGS为可扩展的沉浸式内容生成提供了鲁棒、高效的文本到3D解决方案。 Abstract: Generating realistic 3D scenes from text is crucial for immersive applications like VR, AR, and gaming. While text-driven approaches promise efficiency, existing methods suffer from limited 3D-text data and inconsistent multi-view stitching, resulting in overly simplistic scenes. To address this, we propose PSGS, a two-stage framework for high-fidelity panoramic scene generation. First, a novel two-layer optimization architecture generates semantically coherent panoramas: a layout reasoning layer parses text into structured spatial relationships, while a self-optimization layer refines visual details via iterative MLLM feedback. Second, our panorama sliding mechanism initializes globally consistent 3D Gaussian Splatting point clouds by strategically sampling overlapping perspectives. By incorporating depth and semantic coherence losses during training, we greatly improve the quality and detail fidelity of rendered scenes. Our experiments demonstrate that PSGS outperforms existing methods in panorama generation and produces more appealing 3D scenes, offering a robust solution for scalable immersive content creation.[243] ZS-TreeSeg: A Zero-Shot Framework for Tree Crown Instance Segmentation
Pengyu Chen,Fangzheng Lyu,Sicheng Wang,Cuizhen Wang
Main category: cs.CV
TL;DR: 本文提出ZS-TreeSeg零样本框架,结合冠层语义分割与细胞实例分割思想,利用Cellpose-SAM建模树冠为星凸形对象并基于拓扑流场实现重叠树冠的数学分离,无需训练即可实现跨传感器、跨密度的鲁棒树冠实例分割。
Details
Motivation: 现有监督深度学习方法标注成本高、泛化性差;而基础模型(如SAM)缺乏林学领域知识,在密集重叠树冠中易欠分割。 Method: 提出ZS-TreeSeg零样本框架,将树冠建模为星凸形对象,借助Cellpose-SAM构建拓扑流场,通过向量收敛实现接触树冠的数学分离。 Result: 在NEON和BAMFOREST数据集上验证了该方法能跨传感器类型和冠层密度稳健泛化,实现训练免费的树冠实例分割与标签生成。 Conclusion: ZS-TreeSeg有效弥合了通用基础模型与林学领域需求之间的鸿沟,为遥感森林监测提供了一种无需训练、可推广的零样本解决方案。 Abstract: Individual tree crown segmentation is an important task in remote sensing for forest biomass estimation and ecological monitoring. However, accurate delineation in dense, overlapping canopies remains a bottleneck. While supervised deep learning methods suffer from high annotation costs and limited generalization, emerging foundation models (e.g., Segment Anything Model) often lack domain knowledge, leading to under-segmentation in dense clusters. To bridge this gap, we propose ZS-TreeSeg, a Zero-Shot framework that adapts from two mature tasks: 1) Canopy Semantic segmentation; and 2) Cells instance segmentation. By modeling tree crowns as star-convex objects within a topological flow field using Cellpose-SAM, the ZS-TreeSeg framework forces the mathematical separation of touching tree crown instances based on vector convergence. Experiments on the NEON and BAMFOREST datasets and visual inspection demonstrate that our framework generalizes robustly across diverse sensor types and canopy densities, which can offer a training-free solution for tree crown instance segmentation and labels generation.[244] GTATrack: Winner Solution to SoccerTrack 2025 with Deep-EIoU and Global Tracklet Association
Rong-Lin Jian,Ming-Chi Luo,Chen-Wei Huang,Chia-Ming Lee,Yu-Fan Lin,Chih-Chung Hsu
Main category: cs.CV
TL;DR: GTATrack是一个用于鱼眼相机下足球运动多目标跟踪的分层框架,通过Deep-EIoU和GTA两个核心模块实现短时匹配与长时身份一致性,并结合伪标签策略提升小目标检测召回率,在SoccerTrack Challenge 2025中获得第一名。
Details
Motivation: 体育场景中的多目标跟踪面临球员运动不规则、外观相似、频繁遮挡等挑战,而静态鱼眼相机引入的几何畸变和尺度变化进一步加剧了这些困难。 Method: 提出GTATrack分层跟踪框架,包含:1)Deep Expansion IoU(Deep-EIoU)用于运动无关的在线关联;2)全局轨迹关联(GTA)进行轨迹级优化;3)伪标签策略提升小且畸变目标的检测召回率。 Result: 在SoccerTrack Challenge 2025中获第一名,HOTA得分为0.60,误报数显著降低至982。 Conclusion: GTATrack通过局部关联与全局推理的协同,有效缓解了ID切换、遮挡和轨迹碎片化问题,是鱼眼足球跟踪领域的最新SOTA方法。 Abstract: Multi-object tracking (MOT) in sports is highly challenging due to irregular player motion, uniform appearances, and frequent occlusions. These difficulties are further exacerbated by the geometric distortion and extreme scale variation introduced by static fisheye cameras. In this work, we present GTATrack, a hierarchical tracking framework that win first place in the SoccerTrack Challenge 2025. GTATrack integrates two core components: Deep Expansion IoU (Deep-EIoU) for motion-agnostic online association and Global Tracklet Association (GTA) for trajectory-level refinement. This two-stage design enables both robust short-term matching and long-term identity consistency. Additionally, a pseudo-labeling strategy is used to boost detector recall on small and distorted targets. The synergy between local association and global reasoning effectively addresses identity switches, occlusions, and tracking fragmentation. Our method achieved a winning HOTA score of 0.60 and significantly reduced false positives to 982, demonstrating state-of-the-art accuracy in fisheye-based soccer tracking. Our code is available at https://github.com/ron941/GTATrack-STC2025.[245] Refining Strokes by Learning Offset Attributes between Strokes for Flexible Sketch Edit at Stroke-Level
Sicong Zang,Tao Sun,Cairong Yan
Main category: cs.CV
TL;DR: 本文提出SketchMod方法,通过学习源笔画到目标草图的缩放、方向和位置三个偏移属性,对源笔画进行变换以实现语义一致且视觉保真的笔画级草图编辑。
Details
Motivation: 现有方法仅通过重定位源笔画进行草图编辑,难以应对源笔画在尺寸和方向上的显著差异,导致语义不连贯或视觉失真。 Method: 提出SketchMod,通过学习源笔画相对于目标草图的尺度、方向和位置三类偏移属性,对源笔画进行缩放、旋转与位移变换,并利用捕获的笔画属性精确控制编辑过程。 Result: 实验表明SketchMod在笔画级草图编辑任务中实现了高精度与高灵活性。 Conclusion: SketchMod通过基于目标草图模式驱动的源笔画变换,有效提升了笔画级草图编辑的语义一致性与视觉保真度。 Abstract: Sketch edit at stroke-level aims to transplant source strokes onto a target sketch via stroke expansion or replacement, while preserving semantic consistency and visual fidelity with the target sketch. Recent studies addressed it by relocating source strokes at appropriate canvas positions. However, as source strokes could exhibit significant variations in both size and orientation, we may fail to produce plausible sketch editing results by merely repositioning them without further adjustments. For example, anchoring an oversized source stroke onto the target without proper scaling would fail to produce a semantically coherent outcome. In this paper, we propose SketchMod to refine the source stroke through transformation so as to align it with the target sketch's patterns, further realize flexible sketch edit at stroke-level. As the source stroke refinement is governed by the patterns of the target sketch, we learn three key offset attributes (scale, orientation and position) from the source stroke to another, and align it with the target by: 1) resizing to match spatial proportions by scale, 2) rotating to align with local geometry by orientation, and 3) displacing to meet with semantic layout by position. Besides, a stroke's profiles can be precisely controlled during sketch edit via the exposed captured stroke attributes. Experimental results indicate that SketchMod achieves precise and flexible performances on stroke-level sketch edit.[246] HSSDCT: Factorized Spatial-Spectral Correlation for Hyperspectral Image Fusion
Chia-Ming Lee,Yu-Hao Ho,Yu-Fan Lin,Jen-Wei Lee,Li-Wei Kang,Chih-Chung Hsu
Main category: cs.CV
TL;DR: 本文提出了一种名为HSSDCT的新型网络,通过分层密集残差Transformer块和空间-光谱相关层,有效解决了高光谱图像融合中感受野有限、光谱冗余及自注意力计算复杂度高的问题,在保证高性能的同时显著降低了计算成本。
Details
Motivation: 现有深度学习方法在高光谱图像融合中存在感受野受限、光谱波段冗余以及自注意力机制二次复杂度等问题,影响了模型效率与鲁棒性。 Method: 提出HSSDCT框架,包含两个核心模块:(i) 分层密集残差Transformer块(HDRTB),通过渐进扩大窗口和密集残差连接实现多尺度特征聚合;(ii) 空间-光谱相关层(SSCL),显式解耦空间与光谱依赖关系,将自注意力降至线性复杂度并缓解光谱冗余。 Result: 在多个基准数据集上实验表明,HSSDCT在重建质量上优于现有方法,同时计算开销显著降低,达到新的SOTA性能。 Conclusion: HSSDCT是一种高效且鲁棒的高光谱图像融合方法,兼顾性能与效率,为该任务提供了新思路。 Abstract: Hyperspectral image (HSI) fusion aims to reconstruct a high-resolution HSI (HR-HSI) by combining the rich spectral information of a low-resolution HSI (LR-HSI) with the fine spatial details of a high-resolution multispectral image (HR-MSI). Although recent deep learning methods have achieved notable progress, they still suffer from limited receptive fields, redundant spectral bands, and the quadratic complexity of self-attention, which restrict both efficiency and robustness. To overcome these challenges, we propose the Hierarchical Spatial-Spectral Dense Correlation Network (HSSDCT). The framework introduces two key modules: (i) a Hierarchical Dense-Residue Transformer Block (HDRTB) that progressively enlarges windows and employs dense-residue connections for multi-scale feature aggregation, and (ii) a Spatial-Spectral Correlation Layer (SSCL) that explicitly factorizes spatial and spectral dependencies, reducing self-attention to linear complexity while mitigating spectral redundancy. Extensive experiments on benchmark datasets demonstrate that HSSDCT delivers superior reconstruction quality with significantly lower computational costs, achieving new state-of-the-art performance in HSI fusion. Our code is available at https://github.com/jemmyleee/HSSDCT.[247] RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding
Jiahe Wu,Bing Cao,Qilong Wang,Qinghua Hu,Dongdong Li,Pengfei Zhu
Main category: cs.CV
TL;DR: 本文提出RGBX-R1框架,通过UAV提示策略构建VM-CoT,并采用CS-SFT与ST-RFT两阶段训练范式,提升MLLM在红外、深度、事件等X模态上的感知与推理能力;同时构建首个RGBX-Grounding基准,在三项任务上超越基线22.71%。
Details
Motivation: 现有MLLM主要在RGB模态上预训练,难以有效处理红外、深度、事件等关键X模态数据,限制其在复杂场景下的性能。 Method: 提出RGBX-R1框架,包含:1)Understand-Associate-Validate(UAV)提示策略构建Visual Modality Chain-of-Thought(VM-CoT);2)两阶段训练:Cold-Start Supervised Fine-Tuning(CS-SFT)和Spatio-Temporal Reinforcement Fine-Tuning(ST-RFT),后者基于GRPO并引入Modality-understanding Spatio-Temporal(MuST)奖励;3)构建首个RGBX-Grounding基准。 Result: 在RGBX grounding三项任务上相较基线提升22.71%,验证了模型在多模态理解与空间感知方面的优越性。 Conclusion: RGBX-R1显著拓展了MLLM对非RGB视觉模态的理解与推理能力,为跨模态通用智能体提供了新范式与实用基准。 Abstract: Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality, thereby limiting their performance on other modalities, such as infrared, depth, and event data, which are crucial for complex scenarios. To address this, we propose RGBX-R1, a framework to enhance MLLM's perception and reasoning capacities across various X visual modalities. Specifically, we employ an Understand-Associate-Validate (UAV) prompting strategy to construct the Visual Modality Chain-of-Thought (VM-CoT), which aims to expand the MLLMs' RGB understanding capability into X modalities. To progressively enhance reasoning capabilities, we introduce a two-stage training paradigm: Cold-Start Supervised Fine-Tuning (CS-SFT) and Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT). CS-SFT supervises the reasoning process with the guidance of VM-CoT, equipping the MLLM with fundamental modality cognition. Building upon GRPO, ST-RFT employs a Modality-understanding Spatio-Temporal (MuST) reward to reinforce modality reasoning. Notably, we construct the first RGBX-Grounding benchmark, and extensive experiments verify our superiority in multimodal understanding and spatial perception, outperforming baselines by 22.71% on three RGBX grounding tasks.[248] Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models
Jingrui Zhang,Feng Liang,Yong Zhang,Wei Wang,Runhao Zeng,Xiping Hu
Main category: cs.CV
TL;DR: 本文提出SparseCut,一种用于多模态大语言模型(MLLMs)的通用跨模态融合架构,通过稀疏快捷连接和多粒度特征融合模块,高效分层整合多级视觉特征,提升跨模态理解能力,且不增加计算开销。
Details
Motivation: 现有MLLMs多关注扩大语言模型规模或构建高质量训练数据,而忽视了如何有效将跨模态知识(尤其是中低层视觉语义)融入语言空间;仅用高层视觉特征对齐模态会丢失丰富语义信息。 Method: 提出SparseCut架构:1)在跨模态编码器与LLM间引入稀疏快捷连接,实现多级视觉特征的分层注入;2)设计高效多粒度特征融合模块,在特征路由至快捷连接前完成融合,保持原始语言上下文且不增加输入长度。 Result: SparseCut在多个多模态基准上显著提升MLLM性能,具有通用性(适配不同基座LLM)和可扩展性,且未引入额外计算负担。 Conclusion: SparseCut为MLLM提供了一种高效、轻量、层次化的跨模态融合新范式,有效弥补了中低层视觉语义与语言空间对齐的不足。 Abstract: With the remarkable success of large language models (LLMs) in natural language understanding and generation, multimodal large language models (MLLMs) have rapidly advanced in their ability to process data across multiple modalities. While most existing efforts focus on scaling up language models or constructing higher-quality training data, limited attention has been paid to effectively integrating cross-modal knowledge into the language space. In vision-language models, for instance, aligning modalities using only high-level visual features often discards the rich semantic information present in mid- and low-level features, limiting the model's ability of cross-modality understanding. To address this issue, we propose SparseCut, a general cross-modal fusion architecture for MLLMs, introducing sparse shortcut connections between the cross-modal encoder and the LLM. These shortcut connections enable the efficient and hierarchical integration of visual features at multiple levels, facilitating richer semantic fusion without increasing computational overhead. We further introduce an efficient multi-grained feature fusion module, which performs the fusion of visual features before routing them through the shortcuts. This preserves the original language context and does not increase the overall input length, thereby avoiding an increase in computational complexity for the LLM. Experiments demonstrate that SparseCut significantly enhances the performance of MLLMs across various multimodal benchmarks with generality and scalability for different base LLMs.[249] DuoGen: Towards General Purpose Interleaved Multimodal Generation
Min Shi,Xiaohui Zeng,Jiannan Huang,Yin Cui,Francesco Ferroni,Jialuo Li,Shubham Pachori,Zhaoshuo Li,Yogesh Balaji,Haoxiang Wang,Tsung-Yi Lin,Xiao Fu,Yue Zhao,Chieh-Yun Chen,Ming-Yu Liu,Humphrey Shi
Main category: cs.CV
TL;DR: DuoGen是一个通用的交错多模态生成框架,通过高质量数据构建、解耦式两阶段训练架构(结合多模态大语言模型与扩散Transformer),显著提升文本质量、图像保真度及图文对齐能力,在多项基准上达到开源模型最优水平。
Details
Motivation: 现有交错多模态生成模型受限于训练数据不足和基础模型能力有限,难以在通用指令下生成高质量交错内容(如图文混合的步骤指南、视觉规划等)。 Method: 提出DuoGen框架:1)构建大规模高质量指令微调数据集,融合网站重写的多模态对话与覆盖日常场景的合成样本;2)采用解耦两阶段训练:先指令微调预训练多模态大语言模型(MLLM),再用精选的交错图文序列对齐扩散Transformer(DiT);3)复用强视觉理解的MLLM与视频预训练的DiT,避免昂贵的单模态预训练并支持灵活基座选择。 Result: 在公开及新提出的基准测试中,DuoGen在文本质量、图像保真度和图文上下文对齐方面均超越现有开源模型,并在统一生成模型中实现文本到图像和图像编辑任务的最先进性能。 Conclusion: DuoGen通过系统性解决数据、架构与评估问题,为通用交错多模态生成提供了高效、可扩展且高性能的新范式。 Abstract: Interleaved multimodal generation enables capabilities beyond unimodal generation models, such as step-by-step instructional guides, visual planning, and generating visual drafts for reasoning. However, the quality of existing interleaved generation models under general instructions remains limited by insufficient training data and base model capacity. We present DuoGen, a general-purpose interleaved generation framework that systematically addresses data curation, architecture design, and evaluation. On the data side, we build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites, and diverse synthetic examples covering everyday scenarios. Architecturally, DuoGen leverages the strong visual understanding of a pretrained multimodal LLM and the visual generation capabilities of a diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining and enabling flexible base model selection. A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences. Across public and newly proposed benchmarks, DuoGen outperforms prior open-source models in text quality, image fidelity, and image-context alignment, and also achieves state-of-the-art performance on text-to-image and image editing among unified generation models. Data and code will be released at https://research.nvidia.com/labs/dir/duetgen/.[250] SPARK: Stochastic Propagation via Affinity-guided Random walK for training-free unsupervised segmentation
Kunal Mahatha,Jose Dolz,Christian Desrosiers
Main category: cs.CV
TL;DR: 本文提出一种基于随机流平衡的无训练图像分割新方法,通过结合全局扩散注意力与局部邻域结构,克服了传统谱聚类方法在簇数预设、边界模糊和噪声敏感等方面的缺陷。
Details
Motivation: 现有无训练分割方法隐含地假设分割是基于扩散衍生亲和力的谱图划分问题,但存在需预设簇数、边界过度平滑、对噪声和多模态亲和分布敏感、忽视局部邻域结构等根本性缺陷。 Method: 将无训练分割重新建模为扩散诱导亲和图上的随机流平衡问题;引入基于马尔可夫传播的随机游走标签扩散机制,并采用自适应剪枝策略抑制不可靠转移、增强可信亲和路径。 Result: 在七个主流语义分割基准上实现零样本最先进性能,生成边界更锐利、区域更连贯、掩码更稳定的分割结果。 Conclusion: 所提方法通过融合全局扩散与局部邻域建模,构建稀疏而富有表现力的亲和结构,有效克服了谱聚类范式的固有局限,为无训练分割提供了新范式。 Abstract: We argue that existing training-free segmentation methods rely on an implicit and limiting assumption, that segmentation is a spectral graph partitioning problem over diffusion-derived affinities. Such approaches, based on global graph partitioning and eigenvector-based formulations of affinity matrices, suffer from several fundamental drawbacks, they require pre-selecting the number of clusters, induce boundary oversmoothing due to spectral relaxation, and remain highly sensitive to noisy or multi-modal affinity distributions. Moreover, many prior works neglect the importance of local neighborhood structure, which plays a crucial role in stabilizing affinity propagation and preserving fine-grained contours. To address these limitations, we reformulate training-free segmentation as a stochastic flow equilibrium problem over diffusion-induced affinity graphs, where segmentation emerges from a stochastic propagation process that integrates global diffusion attention with local neighborhoods extracted from stable diffusion, yielding a sparse yet expressive affinity structure. Building on this formulation, we introduce a Markov propagation scheme that performs random-walk-based label diffusion with an adaptive pruning strategy that suppresses unreliable transitions while reinforcing confident affinity paths. Experiments across seven widely used semantic segmentation benchmarks demonstrate that our method achieves state-of-the-art zero-shot performance, producing sharper boundaries, more coherent regions, and significantly more stable masks compared to prior spectral-clustering-based approaches.[251] MRAD: Zero-Shot Anomaly Detection with Memory-Driven Retrieval
Chaoran Xu,Chengkan Lv,Qiyu Chen,Feng Zhang,Zhengtao Zhang
Main category: cs.CV
TL;DR: 本文提出了一种零样本异常检测新方法MRAD,通过构建两级记忆库(图像级和像素级)进行无训练或轻量微调的相似性检索,替代传统参数拟合,在16个工业与医疗数据集上实现了优异的分类与分割性能。
Details
Motivation: 现有零样本异常检测方法依赖提示学习或复杂建模,导致训练/推理开销大、跨域稳定性差。 Method: 提出MRAD框架:基础版MRAD-TF为无训练方法,冻结CLIP图像编码器,构建图像级与像素级双层记忆库;MRAD-FT微调检索度量;MRAD-CLIP将区域先验注入CLIP文本提示中作为动态偏置。 Result: 在16个工业与医疗数据集上,MRAD在异常分类与分割任务中均显著优于现有方法,兼顾零训练与轻量训练设定下的高性能。 Conclusion: 直接利用原始数据的经验分布(通过记忆检索)比仅依赖模型参数拟合更有效,为零样本异常检测提供了高效、稳定且通用的新范式。 Abstract: Zero-shot anomaly detection (ZSAD) often leverages pretrained vision or vision-language models, but many existing methods use prompt learning or complex modeling to fit the data distribution, resulting in high training or inference cost and limited cross-domain stability. To address these limitations, we propose Memory-Retrieval Anomaly Detection method (MRAD), a unified framework that replaces parametric fitting with a direct memory retrieval. The train-free base model, MRAD-TF, freezes the CLIP image encoder and constructs a two-level memory bank (image-level and pixel-level) from auxiliary data, where feature-label pairs are explicitly stored as keys and values. During inference, anomaly scores are obtained directly by similarity retrieval over the memory bank. Based on the MRAD-TF, we further propose two lightweight variants as enhancements: (i) MRAD-FT fine-tunes the retrieval metric with two linear layers to enhance the discriminability between normal and anomaly; (ii) MRAD-CLIP injects the normal and anomalous region priors from the MRAD-FT as dynamic biases into CLIP's learnable text prompts, strengthening generalization to unseen categories. Across 16 industrial and medical datasets, the MRAD framework consistently demonstrates superior performance in anomaly classification and segmentation, under both train-free and training-based settings. Our work shows that fully leveraging the empirical distribution of raw data, rather than relying only on model fitting, can achieve stronger anomaly detection performance. The code will be publicly released at https://github.com/CROVO1026/MRAD.[252] SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding
Yujia Tong,Tian Zhang,Yunyang Wan,Kaiwei Lin,Jingling Yuan,Chuang Hu
Main category: cs.CV
TL;DR: 本文提出SAGE框架,通过基于实时预测不确定性的动态调整推测树结构,提升视觉语言模型(VLMs)的推测解码效率,在不损失输出质量前提下实现最高3.36倍加速。
Details
Motivation: 现有推测解码方法采用固定树结构,无法适应不同生成步骤的预测难度变化,导致接受长度次优、加速效果受限。 Method: SAGE利用输出熵作为具有时间相关性的置信度指标,动态构建深而窄(高置信)或浅而宽(低置信)的推测树结构。 Result: 在多个基准上验证有效:LLaVA-OneVision-72B达3.36×加速,Qwen2.5-VL-72B达3.18×加速,且无输出质量损失。 Conclusion: 动态树结构比静态树更适配VLM推理的不确定性变化,显著提升推测解码效率与实用性。 Abstract: Speculative decoding has emerged as a promising approach to accelerate inference in vision-language models (VLMs) by enabling parallel verification of multiple draft tokens. However, existing methods rely on static tree structures that remain fixed throughout the decoding process, failing to adapt to the varying prediction difficulty across generation steps. This leads to suboptimal acceptance lengths and limited speedup. In this paper, we propose SAGE, a novel framework that dynamically adjusts the speculation tree structure based on real-time prediction uncertainty. Our key insight is that output entropy serves as a natural confidence indicator with strong temporal correlation across decoding steps. SAGE constructs deeper-narrower trees for high-confidence predictions to maximize speculation depth, and shallower-wider trees for uncertain predictions to diversify exploration. SAGE improves acceptance lengths and achieves faster acceleration compared to static tree baselines. Experiments on multiple benchmarks demonstrate the effectiveness of SAGE: without any loss in output quality, it delivers up to $3.36\times$ decoding speedup for LLaVA-OneVision-72B and $3.18\times$ for Qwen2.5-VL-72B.[253] Enhancing Open-Vocabulary Object Detection through Multi-Level Fine-Grained Visual-Language Alignment
Tianyi Zhang,Antoine Simoulin,Kai Li,Sana Lakdawala,Shiqing Yu,Arpit Mittal,Hongyu Fu,Yu Lin
Main category: cs.CV
TL;DR: 本文提出了VLDet框架,通过改进特征金字塔和引入SigRPN模块,提升了开放词汇目标检测(OVD)性能,在COCO2017和LVIS上显著超越现有方法,并具备优异的零样本闭集检测能力。
Details
Motivation: 传统目标检测受限于预定义类别,而开放词汇目标检测(OVD)虽具潜力,但现有方法在将CLIP单尺度视觉主干适配到检测框架、以及保证细粒度视觉-语言对齐方面仍存在挑战。 Method: 提出VLDet框架:1)VL-PUB模块重设计特征金字塔以实现细粒度视觉-语言对齐,并迁移CLIP知识;2)SigRPN模块引入基于sigmoid的锚点-文本对比对齐损失,增强新类别检测能力。 Result: 在COCO2017新类别上达58.7 AP,在LVIS上达24.8 AP,分别比SOTA提升27.6%和6.9%;同时在闭集零样本检测中表现优越。 Conclusion: VLDet通过结构化视觉-语言对齐机制,有效提升了开放词汇目标检测性能,验证了改进特征金字塔与对比对齐损失在OVD中的关键作用。 Abstract: Traditional object detection systems are typically constrained to predefined categories, limiting their applicability in dynamic environments. In contrast, open-vocabulary object detection (OVD) enables the identification of objects from novel classes not present in the training set. Recent advances in visual-language modeling have led to significant progress of OVD. However, prior works face challenges in either adapting the single-scale image backbone from CLIP to the detection framework or ensuring robust visual-language alignment. We propose Visual-Language Detection (VLDet), a novel framework that revamps feature pyramid for fine-grained visual-language alignment, leading to improved OVD performance. With the VL-PUB module, VLDet effectively exploits the visual-language knowledge from CLIP and adapts the backbone for object detection through feature pyramid. In addition, we introduce the SigRPN block, which incorporates a sigmoid-based anchor-text contrastive alignment loss to improve detection of novel categories. Through extensive experiments, our approach achieves 58.7 AP for novel classes on COCO2017 and 24.8 AP on LVIS, surpassing all state-of-the-art methods and achieving significant improvements of 27.6% and 6.9%, respectively. Furthermore, VLDet also demonstrates superior zero-shot performance on closed-set object detection.[254] SADER: Structure-Aware Diffusion Framework with DEterministic Resampling for Multi-Temporal Remote Sensing Cloud Removal
Yifan Zhang,Qian Chen,Yi Liu,Wengen Li,Jihong Guan
Main category: cs.CV
TL;DR: 本文提出SADER,一种面向多时相遥感影像去云的结构感知扩散框架,通过多时相条件扩散网络、云感知注意力损失和确定性重采样策略,显著提升去云效果与采样效率。
Details
Motivation: 云污染严重降低遥感影像可用性;现有基于扩散模型的方法存在采样效率低、未能充分利用多时相结构与时间先验的问题。 Method: 提出SADER框架:1)可扩展的多时相条件扩散网络(MTCDN),融合时序与多模态信息;2)云感知注意力损失,依据云厚度与亮度差异加权关注云区;3)面向连续扩散模型的确定性重采样策略,固定步数内迭代修正异常样本。 Result: 在多个多时相数据集上全面超越现有最先进方法,所有评估指标均取得最优性能。 Conclusion: SADER有效结合结构建模、云区域自适应优化与高效采样机制,为多时相遥感去云提供了更鲁棒、高效的新范式。 Abstract: Cloud contamination severely degrades the usability of remote sensing imagery and poses a fundamental challenge for downstream Earth observation tasks. Recently, diffusion-based models have emerged as a dominant paradigm for remote sensing cloud removal due to their strong generative capability and stable optimization. However, existing diffusion-based approaches often suffer from limited sampling efficiency and insufficient exploitation of structural and temporal priors in multi-temporal remote sensing scenarios. In this work, we propose SADER, a structure-aware diffusion framework for multi-temporal remote sensing cloud removal. SADER first develops a scalable Multi-Temporal Conditional Diffusion Network (MTCDN) to fully capture multi-temporal and multimodal correlations via temporal fusion and hybrid attention. Then, a cloud-aware attention loss is introduced to emphasize cloud-dominated regions by accounting for cloud thickness and brightness discrepancies. In addition, a deterministic resampling strategy is designed for continuous diffusion models to iteratively refine samples under fixed sampling steps by replacing outliers through guided correction. Extensive experiments on multiple multi-temporal datasets demonstrate that SADER consistently outperforms state-of-the-art cloud removal methods across all evaluation metrics. The code of SADER is publicly available at https://github.com/zyfzs0/SADER.[255] NPNet: A Non-Parametric Network with Adaptive Gaussian-Fourier Positional Encoding for 3D Classification and Segmentation
Mohammad Saeid,Amir Salarpour,Pedram MohajerAnsari,Mert D. Pesé
Main category: cs.CV
TL;DR: NPNet是一种完全非参数的3D点云分类与部件分割方法,不依赖学习权重,而是通过确定性算子(如最远点采样、k近邻、池化)构建点特征,并引入自适应高斯-傅里叶位置编码以提升跨尺度和采样密度的稳定性;在多个基准上表现优异,尤其在少样本场景下效果突出,且内存与推理效率更优。
Details
Motivation: 解决现有非参数点云方法在不同尺度和采样密度下不稳定、泛化能力弱、以及内存和推理开销大的问题。 Method: 提出NPNet:无学习参数,使用确定性几何算子提取特征;核心是自适应高斯-傅里叶位置编码(带宽与高斯-余弦混合比由输入几何决定);部件分割中额外引入固定频率傅里叶特征以提供全局上下文。 Result: 在ModelNet40/ModelNet-R、ScanObjectNN和ShapeNetPart上达到非参数方法中的领先性能;在ModelNet40少样本设置下尤为有效;内存占用和推理时间优于先前非参数方法。 Conclusion: NPNet验证了完全非参数设计在3D点云理解任务中的可行性与竞争力,尤其适合资源受限或数据稀缺场景。 Abstract: We present NPNet, a fully non-parametric approach for 3D point-cloud classification and part segmentation. NPNet contains no learned weights; instead, it builds point features using deterministic operators such as farthest point sampling, k-nearest neighbors, and pooling. Our key idea is an adaptive Gaussian-Fourier positional encoding whose bandwidth and Gaussian-cosine mixing are chosen from the input geometry, helping the method remain stable across different scales and sampling densities. For segmentation, we additionally incorporate fixed-frequency Fourier features to provide global context alongside the adaptive encoding. Across ModelNet40/ModelNet-R, ScanObjectNN, and ShapeNetPart, NPNet achieves strong performance among non-parametric baselines, and it is particularly effective in few-shot settings on ModelNet40. NPNet also offers favorable memory use and inference time compared to prior non-parametric methods[256] Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models
Wenbin Xing,Quanxing Zha,Lizheng Zu,Mengran Li,Ming Li,Junchi Yan
Main category: cs.CV
TL;DR: 本文提出OmniVCHall基准,系统评估视频多模态大模型(VLLMs)在孤立与组合型幻觉上的表现,并设计TriCD对比解码框架以缓解组合幻觉,显著提升模型准确性。
Details
Motivation: 现有研究多关注孤立的视频幻觉类型,而由多个时空因素交互导致的组合型幻觉缺乏系统性研究和评测。 Method: 构建OmniVCHall基准(含新摄像机幻觉类型、细粒度分类体系与对抗性选项),并提出TriCD框架:包含自适应扰动控制器生成负样本视频变体、显著性引导的视觉证据增强模块,二者通过强化学习联合优化。 Result: 在39个主流VLLM上评测发现先进模型(如Qwen3-VL、GPT-5)在组合幻觉任务中性能显著下降;TriCD在两个骨干模型上平均准确率提升超10%。 Conclusion: 组合幻觉是当前VLLMs的关键瓶颈,OmniVCHall为该问题提供了首个系统评测基准,TriCD验证了对比解码与多路径校准在缓解此类幻觉上的有效性。 Abstract: Current research on video hallucination mitigation primarily focuses on isolated error types, leaving compositional hallucinations, arising from incorrect reasoning over multiple interacting spatial and temporal factors largely underexplored. We introduce OmniVCHall, a benchmark designed to systematically evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs). OmniVCHall spans diverse video domains, introduces a novel camera-based hallucination type, and defines a fine-grained taxonomy, together with adversarial answer options (e.g., "All are correct" and "None of the above") to prevent shortcut reasoning. The evaluations of 39 representative VLLMs reveal that even advanced models (e.g., Qwen3-VL and GPT-5) exhibit substantial performance degradation. We propose TriCD, a contrastive decoding framework with a triple-pathway calibration mechanism. An adaptive perturbation controller dynamically selects distracting operations to construct negative video variants, while a saliency-guided enhancement module adaptively reinforces grounded token-wise visual evidences. These components are optimized via reinforcement learning to encourage precise decision-making under compositional hallucination settings. Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%. The data and code can be find at https://github.com/BMRETURN/OmniVCHall.[257] GLAD: Generative Language-Assisted Visual Tracking for Low-Semantic Templates
Xingyu Luo,Yidong Cai,Jie Liu,Jie Tang,Gangshan Wu,Limin Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为GLAD的生成式语言辅助视觉-语言跟踪模型,利用扩散模型实现文本描述与模板图像的生成式多模态融合,以提升低语义图像(如模糊、低分辨率)下的跨模态理解能力,并在多个基准上达到SOTA性能。
Details
Motivation: 现有视觉-语言跟踪方法在处理低语义图像(如模糊、低分辨率)时性能受限,且文本与视觉特征间存在语义鸿沟,直接融合效果有限。 Method: 提出GLAD模型,采用扩散模型对文本描述和模板图像进行生成式多模态融合,增强语言与图像兼容性并丰富模板图像的语义信息。 Result: 在多个基准测试中达到新SOTA,显著提升对模糊/语义模糊模板图像的恢复与跟踪性能,并具备优异推理速度。 Conclusion: 生成式多模态融合范式(特别是基于扩散模型)可有效弥合视觉与语言模态差距,为视觉-语言跟踪提供新思路与实用解决方案。 Abstract: Vision-language tracking has gained increasing attention in many scenarios. This task simultaneously deals with visual and linguistic information to localize objects in videos. Despite its growing utility, the development of vision-language tracking methods remains in its early stage. Current vision-language trackers usually employ Transformer architectures for interactive integration of template, search, and text features. However, persistent challenges about low-semantic images including prevalent image blurriness, low resolution and so on, may compromise model performance through degraded cross-modal understanding. To solve this problem, language assistance is usually used to deal with the obstacles posed by low-semantic images. However, due to the existing gap between current textual and visual features, direct concatenation and fusion of these features may have limited effectiveness. To address these challenges, we introduce a pioneering Generative Language-AssisteD tracking model, GLAD, which utilizes diffusion models for the generative multi-modal fusion of text description and template image to bolster compatibility between language and image and enhance template image semantic information. Our approach demonstrates notable improvements over the existing fusion paradigms. Blurry and semantically ambiguous template images can be restored to improve multi-modal features in the generative fusion paradigm. Experiments show that our method establishes a new state-of-the-art on multiple benchmarks and achieves an impressive inference speed. The code and models will be released at: https://github.com/Confetti-lxy/GLAD[258] Bridging Degradation Discrimination and Generation for Universal Image Restoration
JiaKui Hu,Zhengjian Yao,Lujia Jin,Yanye Lu
Main category: cs.CV
TL;DR: 本文提出BDG方法,通过MAS-GLCM细粒度识别退化类型与程度,并在扩散模型中分三阶段(生成、桥接、恢复)融合判别信息,实现通用图像恢复性能提升。
Details
Motivation: 通用图像恢复需同时应对多种退化类型和程度,挑战在于高质量图像分布建模及依据退化调整输出。 Method: 提出多角度多尺度灰度共生矩阵(MAS-GLCM)用于退化判别;将扩散训练分为生成、桥接、恢复三阶段,将MAS-GLCM判别信息融入恢复过程。 Result: 在all-in-one恢复与真实场景超分任务中显著提升保真度,且不牺牲感知质量;无需修改模型架构即可获得性能增益。 Conclusion: BDG有效协同退化判别与图像生成,提升了多任务、多退化场景下的通用图像恢复能力。 Abstract: Universal image restoration is a critical task in low-level vision, requiring the model to remove various degradations from low-quality images to produce clean images with rich detail. The challenges lie in sampling the distribution of high-quality images and adjusting the outputs on the basis of the degradation. This paper presents a novel approach, Bridging Degradation discrimination and Generation (BDG), which aims to address these challenges concurrently. First, we propose the Multi-Angle and multi-Scale Gray Level Co-occurrence Matrix (MAS-GLCM) and demonstrate its effectiveness in performing fine-grained discrimination of degradation types and levels. Subsequently, we divide the diffusion training process into three distinct stages: generation, bridging, and restoration. The objective is to preserve the diffusion model's capability of restoring rich textures while simultaneously integrating the discriminative information from the MAS-GLCM into the restoration process. This enhances its proficiency in addressing multi-task and multi-degraded scenarios. Without changing the architecture, BDG achieves significant performance gains in all-in-one restoration and real-world super-resolution tasks, primarily evidenced by substantial improvements in fidelity without compromising perceptual quality. The code and pretrained models are provided in https://github.com/MILab-PKU/BDG.[259] MAUGen: A Unified Diffusion Approach for Multi-Identity Facial Expression and AU Label Generation
Xiangdong Li,Ye Lou,Ao Gao,Wei Zhang,Siyang Song
Main category: cs.CV
TL;DR: 本文提出MAUGen,一种基于扩散模型的多模态框架,用于根据文本提示生成逼真面部表情图像及对应的动作单元(AU)发生与强度标签,并构建了大规模多身份面部动作数据集MIFA。
Details
Motivation: 缺乏大规模、人口统计学多样化且具有精确动作单元(AU)发生与强度标注的面部图像,严重制约了AU识别系统的泛化能力。 Method: 提出MAUGen框架,包含两个模块:(1) 多模态表征学习(MRL)模块,在统一潜在空间中建模文本描述、面部身份、表情图像和AU激活之间的关系;(2) 基于扩散的图像-标签生成器(DIG),将联合表征解码为跨多样本身份对齐的图像-标签对。 Result: 成功生成了大规模、人口统计学多样、语义对齐的面部图像与AU标签;构建了MIFA合成数据集;实验表明MAUGen在图像真实感与标签一致性上优于现有方法。 Conclusion: MAUGen有效缓解了AU识别领域标注数据稀缺问题,为训练泛化性强的AU识别模型提供了高质量合成数据源。 Abstract: The lack of large-scale, demographically diverse face images with precise Action Unit (AU) occurrence and intensity annotations has long been recognized as a fundamental bottleneck in developing generalizable AU recognition systems. In this paper, we propose MAUGen, a diffusion-based multi-modal framework that jointly generates a large collection of photorealistic facial expressions and anatomically consistent AU labels, including both occurrence and intensity, conditioned on a single descriptive text prompt. Our MAUGen involves two key modules: (1) a Multi-modal Representation Learning (MRL) module that captures the relationships among the paired textual description, facial identity, expression image, and AU activations within a unified latent space; and (2) a Diffusion-based Image label Generator (DIG) that decodes the joint representation into aligned facial image-label pairs across diverse identities. Under this framework, we introduce Multi-Identity Facial Action (MIFA), a large-scale multimodal synthetic dataset featuring comprehensive AU annotations and identity variations. Extensive experiments demonstrate that MAUGen outperforms existing methods in synthesizing photorealistic, demographically diverse facial images along with semantically aligned AU labels.[260] From Pixels to Facts (Pix2Fact): Benchmarking Multi-Hop Reasoning for Fine-Grained Visual Fact Checking
Yifan Jiang,Cong Zhang,Bofei Zhang,Yifan Yang,Bingzhang Wang,Yew-Soon Ong
Main category: cs.CV
TL;DR: 本文介绍了Pix2Fact,一个用于评估视觉语言模型(VLMs)在专家级视觉感知与知识密集型多跳推理能力的新基准,包含1000张4K+高分辨率图像及由博士专家团队标注的复杂问题;现有SOTA模型在该基准上表现远逊于人类(24.0% vs. 56%),凸显当前VLM在细粒度视觉理解与知识融合推理上的不足。
Details
Motivation: 现有基准分别评估视觉定位与知识推理,无法衡量二者协同所需的专家级视觉理解与知识密集型多跳推理能力,亟需统一、高难度的评测基准。 Method: 构建Pix2Fact基准:涵盖8类日常生活场景的1000张4K+图像,所有问题与答案均由全球顶尖高校博士与专业标注公司联合精心设计,每题均需视觉细节定位、多跳逻辑推理及外部知识整合。 Result: 在9个SOTA VLM(含Gemini-3-Pro、GPT-5等闭源模型)上的评测显示,最佳模型平均准确率仅24.0%,显著低于人类56%的水平。 Conclusion: Pix2Fact揭示了当前VLM在细粒度视觉 grounding 与知识驱动推理融合方面的根本性局限,有望推动兼具高精度感知与强推理能力的下一代多模态智能体发展。 Abstract: Despite progress on general tasks, VLMs struggle with challenges demanding both detailed visual grounding and deliberate knowledge-based reasoning, a synergy not captured by existing benchmarks that evaluate these skills separately. To close this gap, we introduce Pix2Fact, a new visual question-answering benchmark designed to evaluate expert-level perception and knowledge-intensive multi-hop reasoning. Pix2Fact contains 1,000 high-resolution (4K+) images spanning 8 daily-life scenarios and situations, with questions and answers meticulously crafted by annotators holding PhDs from top global universities working in partnership with a professional data annotation firm. Each question requires detailed visual grounding, multi-hop reasoning, and the integration of external knowledge to answer. Our evaluation of 9 state-of-the-art VLMs, including proprietary models like Gemini-3-Pro and GPT-5, reveals the substantial challenge posed by Pix2Fact: the most advanced model achieves only 24.0% average accuracy, in stark contrast to human performance of 56%. This significant gap underscores the limitations of current models in replicating human-level visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the development of next-generation multimodal agents that combine fine-grained perception with robust, knowledge-based reasoning.[261] Tune-Your-Style: Intensity-tunable 3D Style Transfer with Gaussian Splatting
Yian Zhao,Rushi Ye,Ruochong Zheng,Zesen Cheng,Chaoran Feng,Jiashu Yang,Pengchong Qiao,Chang Liu,Jie Chen
Main category: cs.CV
TL;DR: 本文提出了一种可调节创意强度的3D风格迁移新范式Tune-Your-Style,通过高斯神经元和可学习风格调节器实现用户自定义的内容-风格平衡,并借助扩散模型与多视角风格对齐提供可调风格化引导。
Details
Motivation: 现有3D风格迁移方法采用固定输出范式,难以满足不同用户对内容与风格平衡的多样化需求。 Method: 引入高斯神经元显式建模风格强度,设计可学习风格调节器;提出可调风格化引导机制,利用扩散模型生成多视角一致的风格化视图,并通过两阶段优化策略在全风格引导与零风格引导间动态调节。 Result: 实验表明该方法在视觉效果上表现优异,且显著提升了3D风格迁移的灵活性与可定制性。 Conclusion: Tune-Your-Style为3D风格迁移提供了首个支持创意强度连续调节的实用框架,增强了用户控制力与应用适应性。 Abstract: 3D style transfer refers to the artistic stylization of 3D assets based on reference style images. Recently, 3DGS-based stylization methods have drawn considerable attention, primarily due to their markedly enhanced training and rendering speeds. However, a vital challenge for 3D style transfer is to strike a balance between the content and the patterns and colors of the style. Although the existing methods strive to achieve relatively balanced outcomes, the fixed-output paradigm struggles to adapt to the diverse content-style balance requirements from different users. In this work, we introduce a creative intensity-tunable 3D style transfer paradigm, dubbed \textbf{Tune-Your-Style}, which allows users to flexibly adjust the style intensity injected into the scene to match their desired content-style balance, thus enhancing the customizability of 3D style transfer. To achieve this goal, we first introduce Gaussian neurons to explicitly model the style intensity and parameterize a learnable style tuner to achieve intensity-tunable style injection. To facilitate the learning of tunable stylization, we further propose the tunable stylization guidance, which obtains multi-view consistent stylized views from diffusion models through cross-view style alignment, and then employs a two-stage optimization strategy to provide stable and efficient guidance by modulating the balance between full-style guidance from the stylized views and zero-style guidance from the initial rendering. Extensive experiments demonstrate that our method not only delivers visually appealing results, but also exhibits flexible customizability for 3D style transfer. Project page is available at https://zhao-yian.github.io/TuneStyle.[262] Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering
Guangtao Lyu,Xinyi Cheng,Qi Liu,Chenghao Xu,Jiexi Yan,Muli Yang,Fen Fang,Cheng Deng
Main category: cs.CV
TL;DR: 本文提出Contrastive Neuron Steering (CNS)方法,通过稀疏自编码器分析LVLM视觉表征中的神经元类型,发现图像特异性神经元的异常激活是幻觉主因,并在prefilling阶段对其进行对比式调控,显著缓解幻觉且兼容现有解码方法。
Details
Motivation: 现有大视觉语言模型(LVLM)幻觉缓解方法多聚焦输出层调整,缺乏对内部表征机制(尤其是视觉嵌入中神经元行为)的深入探究。 Method: 引入稀疏自编码器(SAE)分解视觉嵌入,识别‘始终激活’与‘图像特异性’神经元;提出Contrastive Neuron Steering(CNS),基于干净与噪声输入的对比分析定位图像特异性神经元,并选择性增强有益激活、抑制扰动诱导激活。 Result: CNS在prefilling阶段即可提升视觉表征鲁棒性与语义接地性;在幻觉专项及通用多模态基准上均显著降低幻觉率,同时保持整体多模态理解能力。 Conclusion: 幻觉根源在于图像特异性神经元的扰动激活;在表征层面进行可控神经元干预(如CNS)是一种有效、可解释且与下游方法正交的幻觉缓解新范式。 Abstract: LVLMs achieve remarkable multimodal understanding and generation but remain susceptible to hallucinations. Existing mitigation methods predominantly focus on output-level adjustments, leaving the internal mechanisms that give rise to these hallucinations largely unexplored. To gain a deeper understanding, we adopt a representation-level perspective by introducing sparse autoencoders (SAEs) to decompose dense visual embeddings into sparse, interpretable neurons. Through neuron-level analysis, we identify distinct neuron types, including always-on neurons and image-specific neurons. Our findings reveal that hallucinations often result from disruptions or spurious activations of image-specific neurons, while always-on neurons remain largely stable. Moreover, selectively enhancing or suppressing image-specific neurons enables controllable intervention in LVLM outputs, improving visual grounding and reducing hallucinations. Building on these insights, we propose Contrastive Neuron Steering (CNS), which identifies image-specific neurons via contrastive analysis between clean and noisy inputs. CNS selectively amplifies informative neurons while suppressing perturbation-induced activations, producing more robust and semantically grounded visual representations. This not only enhances visual understanding but also effectively mitigates hallucinations. By operating at the prefilling stage, CNS is fully compatible with existing decoding-stage methods. Extensive experiments on both hallucination-focused and general multimodal benchmarks demonstrate that CNS consistently reduces hallucinations while preserving overall multimodal understanding.[263] FaceSnap: Enhanced ID-fidelity Network for Tuning-free Portrait Customization
Benxiang Zhai,Yifang Xu,Guofeng Zhang,Yang Li,Sidan Du
Main category: cs.CV
TL;DR: 本文提出FaceSnap方法,基于Stable Diffusion模型,仅需单张参考图像即可在一次推理中生成高度一致、高保真度的定制化人像,无需微调,具备即插即用和跨模型兼容性。
Details
Motivation: 现有个性化人像生成方法存在需耗时微调且泛化性差,或面部细节保真度不足的问题。 Method: 提出FaceSnap:1)设计面部属性混合器(Facial Attribute Mixer),融合低层细节与高层抽象特征;2)引入关键点预测器(Landmark Predictor),在不同姿态下保持身份一致性并提供空间控制;3)通过ID保持模块将上述信息注入UNet。 Result: 实验表明FaceSnap在个性化人像生成任务上显著优于现有最先进方法,实现高一致性与高保真面部细节。 Conclusion: FaceSnap是一种高效、通用、即插即用的定制化人像生成新方法,解决了微调依赖与细节失真两大核心挑战。 Abstract: Benefiting from the significant advancements in text-to-image diffusion models, research in personalized image generation, particularly customized portrait generation, has also made great strides recently. However, existing methods either require time-consuming fine-tuning and lack generalizability or fail to achieve high fidelity in facial details. To address these issues, we propose FaceSnap, a novel method based on Stable Diffusion (SD) that requires only a single reference image and produces extremely consistent results in a single inference stage. This method is plug-and-play and can be easily extended to different SD models. Specifically, we design a new Facial Attribute Mixer that can extract comprehensive fused information from both low-level specific features and high-level abstract features, providing better guidance for image generation. We also introduce a Landmark Predictor that maintains reference identity across landmarks with different poses, providing diverse yet detailed spatial control conditions for image generation. Then we use an ID-preserving module to inject these into the UNet. Experimental results demonstrate that our approach performs remarkably in personalized and customized portrait generation, surpassing other state-of-the-art methods in this domain.[264] S$^3$POT: Contrast-Driven Face Occlusion Segmentation via Self-Supervised Prompt Learning
Lingsong Wang,Mancheng Meng,Ziyan Wu,Terrence Chen,Fan Yang,Dinggang Shen
Main category: cs.CV
TL;DR: 本文提出S³POT框架,通过结合面部生成与自监督空间提示,实现无真实遮挡掩码监督的遮挡分割。
Details
Motivation: 现有面部解析方法常将遮挡误分类为面部组件,因遮挡是高层概念,难以构建覆盖所有遮挡类别的真实数据集且精确标注耗时费力。 Method: 提出S³POT对比驱动框架,包含参考生成(RF)、特征增强(FE)和提示选择(PS)三个模块;利用人脸生成器重建无遮挡参考图像,并借助基础分割模型(如SAM)在自监督提示下提取精确遮挡掩码;全程无需遮挡真值掩码,采用三种新颖互补的目标函数进行训练。 Result: 在专门构建的数据集上大量实验表明,S³POT性能优越,各模块均有效。 Conclusion: S³POT成功实现了无需遮挡真值掩码的高质量遮挡分割,为面部解析中遮挡处理提供了新范式。 Abstract: Existing face parsing methods usually misclassify occlusions as facial components. This is because occlusion is a high-level concept, it does not refer to a concrete category of object. Thus, constructing a real-world face dataset covering all categories of occlusion object is almost impossible and accurate mask annotation is labor-intensive. To deal with the problems, we present S$^3$POT, a contrast-driven framework synergizing face generation with self-supervised spatial prompting, to achieve occlusion segmentation. The framework is inspired by the insights: 1) Modern face generators' ability to realistically reconstruct occluded regions, creating an image that preserve facial geometry while eliminating occlusion, and 2) Foundation segmentation models' (e.g., SAM) capacity to extract precise mask when provided with appropriate prompts. In particular, S$^3$POT consists of three modules: Reference Generation (RF), Feature enhancement (FE), and Prompt Selection (PS). First, a reference image is produced by RF using structural guidance from parsed mask. Second, FE performs contrast of tokens between raw and reference images to obtain an initial prompt, then modifies image features with the prompt by cross-attention. Third, based on the enhanced features, PS constructs a set of positive and negative prompts and screens them with a self-attention network for a mask decoder. The network is learned under the guidance of three novel and complementary objective functions without occlusion ground truth mask involved. Extensive experiments on a dedicatedly collected dataset demonstrate S$^3$POT's superior performance and the effectiveness of each module.[265] VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning
Vivek Madhavaram,Vartika Sengar,Arkadipta De,Charu Sharma
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、端到端的3D场景图生成方法VIZOR,可从原始3D场景中直接构建视角无关、稠密且无歧义的场景图,空间关系以物体自身朝向为基准定义,并支持开放词汇关系推理,在零样本目标定位任务中显著优于现有方法。
Details
Motivation: 现有基于多模态输入(如2D图像、深度图等)构建场景图的方法泛化能力差,且'左右'等空间关系在不同视角下不一致,难以满足3D场景理解与推理需求。 Method: 提出Viewpoint-Invariant Zero-shot scene graph generation for 3D scene Reasoning(VIZOR),一种训练自由、端到端框架,直接从原始3D场景生成稠密、视角无关的3D场景图;空间关系以每个物体的前向方向为参考系定义,并支持开放词汇空间与邻近关系推理。 Result: 在Replica和Nr3D数据集上,VIZOR在零样本目标定位任务中分别取得22%和4.81%的准确率提升;定量与定性实验均验证其在场景图生成及下游任务中的优越性。 Conclusion: VIZOR有效解决了3D场景图生成中视角依赖与标注依赖问题,通过视角无关建模与零样本推理实现了更鲁棒、通用的3D场景理解。 Abstract: Scene understanding and reasoning has been a fundamental problem in 3D computer vision, requiring models to identify objects, their properties, and spatial or comparative relationships among the objects. Existing approaches enable this by creating scene graphs using multiple inputs such as 2D images, depth maps, object labels, and annotated relationships from specific reference view. However, these methods often struggle with generalization and produce inaccurate spatial relationships like "left/right", which become inconsistent across different viewpoints. To address these limitations, we propose Viewpoint-Invariant Zero-shot scene graph generation for 3D scene Reasoning (VIZOR). VIZOR is a training-free, end-to-end framework that constructs dense, viewpoint-invariant 3D scene graphs directly from raw 3D scenes. The generated scene graph is unambiguous, as spatial relationships are defined relative to each object's front-facing direction, making them consistent regardless of the reference view. Furthermore, it infers open-vocabulary relationships that describe spatial and proximity relationships among scene objects without requiring annotated training data. We conduct extensive quantitative and qualitative evaluations to assess the effectiveness of VIZOR in scene graph generation and downstream tasks, such as query-based object grounding. VIZOR outperforms state-of-the-art methods, showing clear improvements in scene graph generation and achieving 22% and 4.81% gains in zero-shot grounding accuracy on the Replica and Nr3D datasets, respectively.[266] Diff-PC: Identity-preserving and 3D-aware Controllable Diffusion for Zero-shot Portrait Customization
Yifang Xu,Benxiang Zhai,Chenyu Zhang,Ming Li,Yang Li,Sidan Du
Main category: cs.CV
TL;DR: 本文提出Diff-PC,一种基于扩散模型的零样本人像定制框架,通过3D面部先验、ID编码器、ID控制器和ID注入器等模块,在保持高身份保真度的同时实现精细面部属性控制与多样化背景生成。
Details
Motivation: 现有方法在身份(ID)保持和面部可控性方面存在不足,难以兼顾高保真ID重建与灵活面部属性编辑。 Method: 提出Diff-PC框架:利用3D面部预测器提取参考ID、目标表情与姿态的先验;设计ID-Encoder融合局部与全局面部特征;构建ID-Ctrl以3D面部为引导对齐ID特征;引入ID-Injector增强ID保真度与面部可控性;并在自建ID中心数据集上训练以提升ID相似性与文生图对齐。 Result: 在ID保持、面部控制及文生图一致性方面显著优于现有SOTA方法,并兼容多风格基础模型。 Conclusion: Diff-PC有效解决了零样本人像定制中ID保真与面部可控性的协同难题,为高质量个性化人像生成提供了新范式。 Abstract: Portrait customization (PC) has recently garnered significant attention due to its potential applications. However, existing PC methods lack precise identity (ID) preservation and face control. To address these tissues, we propose Diff-PC, a diffusion-based framework for zero-shot PC, which generates realistic portraits with high ID fidelity, specified facial attributes, and diverse backgrounds. Specifically, our approach employs the 3D face predictor to reconstruct the 3D-aware facial priors encompassing the reference ID, target expressions, and poses. To capture fine-grained face details, we design ID-Encoder that fuses local and global facial features. Subsequently, we devise ID-Ctrl using the 3D face to guide the alignment of ID features. We further introduce ID-Injector to enhance ID fidelity and facial controllability. Finally, training on our collected ID-centric dataset improves face similarity and text-to-image (T2I) alignment. Extensive experiments demonstrate that Diff-PC surpasses state-of-the-art methods in ID preservation, facial control, and T2I consistency. Furthermore, our method is compatible with multi-style foundation models.[267] A Hybrid Mamba-SAM Architecture for Efficient 3D Medical Image Segmentation
Mohammadreza Gholipour Shahraki,Mehdi Rezaeian,Mohammad Ghasemzadeh
Main category: cs.CV
TL;DR: 本文提出Mamba-SAM,一种结合冻结SAM编码器与Mamba状态空间模型的高效混合架构,用于3D医学图像分割;通过双分支融合或轻量级适配器(TPMamba)引入三维上下文建模,并设计多频门控卷积(MFGC)增强特征表达;在ACDC数据集上取得媲美UNet++的精度与更优的推理速度。
Details
Motivation: 现有基础模型(如SAM)因域偏移、二维结构限制及微调开销大,难以直接适用于3D医学图像分割任务。 Method: 提出Mamba-SAM:1)双分支架构,融合冻结SAM特征与可训练VMamba编码器特征(通过交叉注意力);2)适配器方案,在冻结SAM ViT中嵌入3D感知的Tri-Plane Mamba(TPMamba)模块;并引入Multi-Frequency Gated Convolution(MFGC),结合3D离散余弦变换与自适应门控提升特征表示。 Result: 在ACDC心脏MRI数据集上,双分支Mamba-SAM-Base达平均Dice 0.906(心肌0.910,左心室0.971),与UNet++相当;TP-MFGC变体达0.880 Dice且推理速度4.77 FPS。 Conclusion: 将基础模型与高效SSM架构融合,是解决3D医学图像分割问题的一种实用而有效的新范式。 Abstract: Accurate segmentation of 3D medical images such as MRI and CT is essential for clinical diagnosis and treatment planning. Foundation models like the Segment Anything Model (SAM) provide powerful general-purpose representations but struggle in medical imaging due to domain shift, their inherently 2D design, and the high computational cost of fine-tuning. To address these challenges, we propose Mamba-SAM, a novel and efficient hybrid architecture that combines a frozen SAM encoder with the linear-time efficiency and long-range modeling capabilities of Mamba-based State Space Models (SSMs). We investigate two parameter-efficient adaptation strategies. The first is a dual-branch architecture that explicitly fuses general features from a frozen SAM encoder with domain-specific representations learned by a trainable VMamba encoder using cross-attention. The second is an adapter-based approach that injects lightweight, 3D-aware Tri-Plane Mamba (TPMamba) modules into the frozen SAM ViT encoder to implicitly model volumetric context. Within this framework, we introduce Multi-Frequency Gated Convolution (MFGC), which enhances feature representation by jointly analyzing spatial and frequency-domain information via 3D discrete cosine transforms and adaptive gating. Extensive experiments on the ACDC cardiac MRI dataset demonstrate the effectiveness of the proposed methods. The dual-branch Mamba-SAM-Base model achieves a mean Dice score of 0.906, comparable to UNet++ (0.907), while outperforming all baselines on Myocardium (0.910) and Left Ventricle (0.971) segmentation. The adapter-based TP MFGC variant offers superior inference speed (4.77 FPS) with strong accuracy (0.880 Dice). These results show that hybridizing foundation models with efficient SSM-based architectures provides a practical and effective solution for 3D medical image segmentation.[268] Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment
Lukas Kuhn,Giuseppe Serra,Florian Buettner
Main category: cs.CV
TL;DR: NOVA是一种非对比式视觉-语言对齐框架,通过联合嵌入预测与分布正则化(SIGReg)实现图像到文本嵌入的预测,无需负采样、动量编码器或梯度截断,仅需单个超参数,显著提升零样本胸部X光分类的稳定性与性能。
Details
Motivation: 现有主流对比学习方法(如CLIP)依赖大批次、精心设计的负样本和大量超参调优,训练复杂且不稳定。 Method: 提出NOVA框架:利用冻结的领域特定文本编码器(如ClinicalBERT),通过增强图像视图预测其对应文本嵌入,并引入Sketched Isotropic Gaussian Regularization(SIGReg)强制图像表征服从各向同性高斯分布,实现非对比式对齐。 Result: 在基于MIMIC-CXR从头训练ViT、使用ClinicalBERT作为文本编码器的零样本胸部X光分类任务中,NOVA在三个基准数据集上超越多个基线方法,且训练过程更稳定、一致性更强。 Conclusion: 非对比式视觉-语言预训练是一种更简单、更稳定且更有效的替代方案,可摆脱对比学习的诸多工程负担。 Abstract: Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.[269] Schrödinger-Inspired Time-Evolution for 4D Deformation Forecasting
Ahsan Raza Siyal,Markus Haltmeier,Ruth Steiger,Elke Ruth Gizewski,Astrid Ellen Grams
Main category: cs.CV
TL;DR: 本文提出了一种受薛定谔方程启发的物理引导神经架构,用于复杂三维时空现象(4D)的预测,通过嵌入显式时间演化算子,在深度卷积框架中学习复波函数及其势场,实现稳定、可解释且解剖一致的长期预测。
Details
Motivation: 传统无约束神经预测模型在长期4D预测中易出现漂移和误差累积,且缺乏物理可解释性;而医学成像、流体动力学等领域亟需稳定、可解释、解剖保真度高的时空预测方法。 Method: 构建基于薛定谔方程的时间演化算子,嵌入深度卷积网络;从体数据序列中学习体素级振幅A、相位φ和势场V,构造复波函数ψ = A e^{iφ},并通过可微分、展开式薛定谔时间步进器进行前向演化。 Result: 在模拟真实形变与拓扑变化的合成基准上,实现了高精度、高稳定性的4D状态(包括体数据强度与形变场)预测;相位编码输运动力学,振幅表征结构强度,学习到的势场刻画时空相互作用。 Conclusion: 该方法首次将薛定谔型演化算子端到端引入4D神经预测,融合深度网络表达力与物理建模鲁棒性及可解释性,为可解释、稳定、解剖一致的时空预测提供了新范式。 Abstract: Spatiotemporal forecasting of complex three-dimensional phenomena (4D: 3D + time) is fundamental to applications in medical imaging, fluid and material dynamics, and geophysics. In contrast to unconstrained neural forecasting models, we propose a Schrödinger-inspired, physics-guided neural architecture that embeds an explicit time-evolution operator within a deep convolutional framework for 4D prediction. From observed volumetric sequences, the model learns voxelwise amplitude, phase, and potential fields that define a complex-valued wavefunction $ψ= A e^{iφ}$, which is evolved forward in time using a differentiable, unrolled Schrödinger time stepper. This physics-guided formulation yields several key advantages: (i) temporal stability arising from the structured evolution operator, which mitigates drift and error accumulation in long-horizon forecasting; (ii) an interpretable latent representation, where phase encodes transport dynamics, amplitude captures structural intensity, and the learned potential governs spatiotemporal interactions; and (iii) natural compatibility with deformation-based synthesis, which is critical for preserving anatomical fidelity in medical imaging applications. By integrating physical priors directly into the learning process, the proposed approach combines the expressivity of deep networks with the robustness and interpretability of physics-based modeling. We demonstrate accurate and stable prediction of future 4D states, including volumetric intensities and deformation fields, on synthetic benchmarks that emulate realistic shape deformations and topological changes. To our knowledge, this is the first end-to-end 4D neural forecasting framework to incorporate a Schrödinger-type evolution operator, offering a principled pathway toward interpretable, stable, and anatomically consistent spatiotemporal prediction.[270] Improving Neuropathological Reconstruction Fidelity via AI Slice Imputation
Marina Crespo Aguirre,Jonathan Williams-Ramirez,Dina Zemlyanker,Xiaoling Hu,Lucas J. Deden-Binder,Rogeny Herisse,Mark Montine,Theresa R. Connors,Christopher Mount,Christine L. MacDonald,C. Dirk Keene,Caitlin S. Latimer,Derek H. Oakley,Bradley T. Hyman,Ana Lawry Aguila,Juan Eugenio Iglesias
Main category: cs.CV
TL;DR: 本文提出了一种计算高效的超分辨率方法,用于从各向异性(厚切片)的2D解剖照片重建生成各向同性的3D脑体积,提升解剖结构精度与后续自动分割、皮层表面重建及MRI配准性能。
Details
Motivation: 现有基于2D解剖照片的3D脑重建方法在高各向异性(如厚切片)条件下重建结果过于平滑、结构细节粗糙,限制了解剖界定与形态测量精度。 Method: 引入基于域随机化合成数据训练的超分辨率步骤,对各向异性3D重建结果进行切片插值,生成解剖一致的各向同性体积。 Result: 插值后体积显著提升自动分割Dice分数(尤其皮层与白质区),并改善皮层表面重建精度和MRI图谱配准效果。 Conclusion: 该方法增强了照片驱动重建的分辨率与解剖保真度,有效弥合神经病理学与神经影像学之间的技术鸿沟,代码已开源。 Abstract: Neuropathological analyses benefit from spatially precise volumetric reconstructions that enhance anatomical delineation and improve morphometric accuracy. Our prior work has shown the feasibility of reconstructing 3D brain volumes from 2D dissection photographs. However these outputs sometimes exhibit coarse, overly smooth reconstructions of structures, especially under high anisotropy (i.e., reconstructions from thick slabs). Here, we introduce a computationally efficient super-resolution step that imputes slices to generate anatomically consistent isotropic volumes from anisotropic 3D reconstructions of dissection photographs. By training on domain-randomized synthetic data, we ensure that our method generalizes across dissection protocols and remains robust to large slab thicknesses. The imputed volumes yield improved automated segmentations, achieving higher Dice scores, particularly in cortical and white matter regions. Validation on surface reconstruction and atlas registration tasks demonstrates more accurate cortical surfaces and MRI registration. By enhancing the resolution and anatomical fidelity of photograph-based reconstructions, our approach strengthens the bridge between neuropathology and neuroimaging. Our method is publicly available at https://surfer.nmr.mgh.harvard.edu/fswiki/mri_3d_photo_recon[271] HPC: Hierarchical Point-based Latent Representation for Streaming Dynamic Gaussian Splatting Compression
Yangzhi Ma,Bojun Liu,Wenting Liao,Dong Liu,Zhu Li,Li Li
Main category: cs.CV
TL;DR: 本文提出HPC框架,通过分层点基潜在表示和神经网络参数的帧间相关性挖掘,显著提升了动态高斯点绘的流式压缩效率,在保持高质量重建的同时实现了67%的存储减少。
Details
Motivation: 现有动态高斯点绘流式压缩方法在结构化网格或非结构化点基潜在表示上存在参数冗余或紧凑性不足的问题,难以兼顾高渲染质量与小内存占用。 Method: 提出HPC框架:1)采用逐高斯操作的分层点基潜在表示以避免空闲空间建模;2)设计定制聚合方案提升潜在点紧凑性;3)首次探索利用神经网络参数的帧间相关性进行压缩,并与潜在表示压缩联合构建端到端框架。 Result: HPC在多项实验中显著超越当前最优方法,相较基线实现67%的存储缩减,同时保持高重建保真度。 Conclusion: HPC通过创新的分层点基潜在表示与神经网络参数压缩策略,有效解决了动态高斯点绘流式传输中质量与效率难以兼顾的核心挑战。 Abstract: While dynamic Gaussian Splatting has driven significant advances in free-viewpoint video, maintaining its rendering quality with a small memory footprint for efficient streaming transmission still presents an ongoing challenge. Existing streaming dynamic Gaussian Splatting compression methods typically leverage a latent representation to drive the neural network for predicting Gaussian residuals between frames. Their core latent representations can be categorized into structured grid-based and unstructured point-based paradigms. However, the former incurs significant parameter redundancy by inevitably modeling unoccupied space, while the latter suffers from limited compactness as it fails to exploit local correlations. To relieve these limitations, we propose HPC, a novel streaming dynamic Gaussian Splatting compression framework. It employs a hierarchical point-based latent representation that operates on a per-Gaussian basis to avoid parameter redundancy in unoccupied space. Guided by a tailored aggregation scheme, these latent points achieve high compactness with low spatial redundancy. To improve compression efficiency, we further undertake the first investigation to compress neural networks for streaming dynamic Gaussian Splatting through mining and exploiting the inter-frame correlation of parameters. Combined with latent compression, this forms a fully end-to-end compression framework. Comprehensive experimental evaluations demonstrate that HPC substantially outperforms state-of-the-art methods. It achieves a storage reduction of 67% against its baseline while maintaining high reconstruction fidelity.[272] Video Understanding: Through A Temporal Lens
Thong Thanh Nguyen
Main category: cs.CV
TL;DR: 本文探讨了如何利用视频元素间的时间关系来提升视频理解能力,提出了五个创新点:自动标注框架、参数高效微调策略、状态空间层集成、细粒度运动-时刻对比学习框架,以及针对大视觉语言模型的时间推理瓶颈分析与改进方案。
Details
Motivation: 现有视频理解方法在建模时间关系方面存在局限,难以有效捕捉视频内容的动态性和时序结构。 Method: 提出五种方法:基于大视觉语言模型的自动标注框架(含噪声鲁棒对比学习与减法角度间隔);采用'循环适配器'的参数高效微调策略;集成状态空间层(SSL)用于长视频建模,并构建两个新长期基准;设计显式建模运动与视频时刻间细粒度关系的对比学习框架;对大视觉语言模型开展系统实证研究,识别视觉-语言接口为时间推理瓶颈,并提出‘面向时间的优化方案’。 Result: 验证了显式时间建模能显著提升模型对视频流式内容的表征与推理能力,并在多个新基准和低数据场景下取得性能提升。 Conclusion: 显式建模时间关系是提升视频理解能力的关键路径,所提方法在标注效率、参数效率、长程建模、细粒度关系学习及大模型时间推理优化等方面均具有有效性与实用性。 Abstract: This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using "recurrent adapters" to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large Vision-Language Models (LVLMs) that identifies the visual-language interface as a bottleneck for temporal reasoning, leading to a new "temporal-oriented recipe" for upscaled video understanding. Collectively, these contributions demonstrate that explicit temporal modeling significantly enhances a model's ability to represent and reason about the fluid nature of video content.[273] V2X-DSC: Multi-Agent Collaborative Perception with Distributed Source Coding Guided Communication
Yuankun Zeng,Shaohui Li,Zhi Li,Shulan Ruan,Yu Liu,You He
Main category: cs.CV
TL;DR: 本文提出V2X-DSC框架,利用分布式信源编码思想,在带宽受限下通过条件编解码实现多智能体协同感知的高效BEV特征融合,提升精度-带宽权衡性能。
Details
Motivation: 中间特征共享面临严苛带宽限制,而多智能体观测具有强相关性,接收端只需获取超出本地上下文的新息信息。 Method: 提出基于分布式信源编码的V2X-DSC框架,包含条件编解码器(DCC):发送端压缩BEV特征为紧凑码,接收端以本地特征为边信息进行条件重建,按互补性分配比特。 Result: 在DAIR-V2X、OPV2V和V2X-Real数据集上实现了KB级通信下的SOTA精度-带宽权衡,并可即插即用适配多种融合骨干网络。 Conclusion: 条件编解码结构不仅提升了通信效率,还通过正则化促进增量式表征学习,得到更低噪声的融合特征。 Abstract: Collaborative perception improves 3D understanding by fusing multi-agent observations, yet intermediate-feature sharing faces strict bandwidth constraints as dense BEV features saturate V2X links. We observe that collaborators view the same physical world, making their features strongly correlated; thus receivers only need innovation beyond their local context. Revisiting this from a distributed source coding perspective, we propose V2X-DSC, a framework with a Conditional Codec (DCC) for bandwidth-constrained fusion. The sender compresses BEV features into compact codes, while the receiver performs conditional reconstruction using its local features as side information, allocating bits to complementary cues rather than redundant content. This conditional structure regularizes learning, encouraging incremental representation and yielding lower-noise features. Experiments on DAIR-V2X, OPV2V, and V2X-Real demonstrate state-of-the-art accuracy-bandwidth trade-offs under KB-level communication, and generalizes as a plug-and-play communication layer across multiple fusion backbones.[274] JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning
Ruikui Wang,Jinheng Feng,Lang Tian,Huaishao Luo,Chaochao Li,Liangbo Zhou,Huan Zhang,Youzheng Wu,Xiaodong He
Main category: cs.CV
TL;DR: 本文提出了JoyAvatar框架,通过双教师增强训练算法和多模态条件动态调制,显著提升了视频头像模型对复杂文本指令(如全身动作、动态镜头、背景切换、人-物交互)的可控性与生成质量。
Details
Motivation: 现有视频头像模型在复杂文本指令(如大范围全身运动、动态摄像机轨迹、背景切换、人-物交互)下的文本对齐能力有限。 Method: 提出JoyAvatar框架,包含两个关键技术:1)双教师增强训练算法,兼顾基础模型的文本可控性与音视频同步学习;2)在训练中按去噪时间步动态调节音频与文本等多模态条件强度,缓解异构条件信号冲突。 Result: 在GSB评估中超越Omnihuman-1.5和KlingAvatar 2.0等SOTA模型;支持多人对话、非人类角色扮演等复杂应用。 Conclusion: JoyAvatar显著扩展了视频头像模型对复杂文本指令的理解与生成能力,在保持唇形同步与身份一致性的同时,实现了自然、时序连贯的全身动作与动态镜头运动生成。 Abstract: Existing video avatar models have demonstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations. Firstly, we introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability from the foundation model while simultaneously learning audio-visual synchronization. Secondly, during training, we dynamically modulate the strength of multi-modal conditions (e.g., audio and text) based on the distinct denoising timestep, aiming to mitigate conflicts between the heterogeneous conditioning signals. These two key designs serve to substantially expand the avatar model's capacity to generate natural, temporally coherent full-body motions and dynamic camera movements as well as preserve the basic avatar capabilities, such as accurate lip-sync and identity consistency. GSB evaluation results demonstrate that our JoyAvatar model outperforms the state-of-the-art models such as Omnihuman-1.5 and KlingAvatar 2.0. Moreover, our approach enables complex applications including multi-person dialogues and non-human subjects role-playing. Some video samples are provided on https://joyavatar.github.io/.[275] StomataSeg: Semi-Supervised Instance Segmentation for Sorghum Stomatal Components
Zhongtian Huang,Zhi Chen,Zi Huang,Xin Yu,Daniel Smith,Chaitanya Purushothama,Erik Van Oosterom,Alex Wu,William Salter,Yan Li,Scott Chapman
Main category: cs.CV
TL;DR: 本文提出了一种面向高通量表型分析的半监督实例分割框架,用于精准分割高粱叶片中微小且形态多变的气孔结构(包括气孔孔、保卫细胞和复合区),通过补丁切分与伪标签策略,在语义和实例分割任务上均显著提升性能。
Details
Motivation: 高粱气孔小(<40μm)、形态变异大,现有自动分割方法难以应对嵌套小结构及标注瓶颈,制约了水利用效率研究和AI驱动表型分析的应用。 Method: 构建包含11,060个手工标注图像块的高粱气孔影像数据集;采用重叠补丁切分高分辨率显微图像;引入伪标签策略扩展未标注数据,生成56,428个伪标注块;设计适配气孔组件的半监督实例分割框架。 Result: 语义分割最高mIoU从65.93%提升至70.35%,实例分割最高AP从28.30%提升至46.10%;验证了补丁预处理与半监督学习联合策略对微小气孔结构分割的有效性。 Conclusion: 所提框架可实现高通量、高精度气孔性状提取,推动AI在作物科学尤其是气候韧性育种中的规模化应用。 Abstract: Sorghum is a globally important cereal grown widely in water-limited and stress-prone regions. Its strong drought tolerance makes it a priority crop for climate-resilient agriculture. Improving water-use efficiency in sorghum requires precise characterisation of stomatal traits, as stomata control of gas exchange, transpiration and photosynthesis have a major influence on crop performance. Automated analysis of sorghum stomata is difficult because the stomata are small (often less than 40 $μ$m in length in grasses such as sorghum) and vary in shape across genotypes and leaf surfaces. Automated segmentation contributes to high-throughput stomatal phenotyping, yet current methods still face challenges related to nested small structures and annotation bottlenecks. In this paper, we propose a semi-supervised instance segmentation framework tailored for analysis of sorghum stomatal components. We collect and annotate a sorghum leaf imagery dataset containing 11,060 human-annotated patches, covering the three stomatal components (pore, guard cell and complex area) across multiple genotypes and leaf surfaces. To improve the detection of tiny structures, we split high-resolution microscopy images into overlapping small patches. We then apply a pseudo-labelling strategy to unannotated images, producing an additional 56,428 pseudo-labelled patches. Benchmarking across semantic and instance segmentation models shows substantial performance gains: for semantic models the top mIoU increases from 65.93% to 70.35%, whereas for instance models the top AP rises from 28.30% to 46.10%. These results demonstrate that combining patch-based preprocessing with semi-supervised learning significantly improves the segmentation of fine stomatal structures. The proposed framework supports scalable extraction of stomatal traits and facilitates broader adoption of AI-driven phenotyping in crop science.[276] Supervised makeup transfer with a curated dataset: Decoupling identity and makeup features for enhanced transformation
Qihe Pan,Yiming Wu,Xing Zhao,Liang Xie,Guodao Sun,Ronghua Liang
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的可控化妆迁移方法,通过构建高质量数据集、设计身份与妆容特征解耦框架、引入文本引导机制,显著提升了生成保真度、身份保持性和编辑灵活性。
Details
Motivation: 现有化妆迁移方法受限于数据集规模小、身份与妆容特征耦合严重、可控性差等问题。 Method: 1)采用‘训练-生成-筛选-再训练’策略构建高质量合成数据集;2)设计基于扩散模型的解耦框架,分离身份与妆容特征;3)引入文本引导机制,支持对眼妆、唇妆、面部妆容等区域的细粒度自然语言控制。 Result: 在基准测试和真实场景中验证了方法在保真度、身份保持性和编辑灵活性上的提升。 Conclusion: 该方法为可控、高保真、身份一致的化妆迁移提供了新思路,具备实用潜力。 Abstract: Diffusion models have recently shown strong progress in generative tasks, offering a more stable alternative to GAN-based approaches for makeup transfer. Existing methods often suffer from limited datasets, poor disentanglement between identity and makeup features, and weak controllability. To address these issues, we make three contributions. First, we construct a curated high-quality dataset using a train-generate-filter-retrain strategy that combines synthetic, realistic, and filtered samples to improve diversity and fidelity. Second, we design a diffusion-based framework that disentangles identity and makeup features, ensuring facial structure and skin tone are preserved while applying accurate and diverse cosmetic styles. Third, we propose a text-guided mechanism that allows fine-grained and region-specific control, enabling users to modify eyes, lips, or face makeup with natural language prompts. Experiments on benchmarks and real-world scenarios demonstrate improvements in fidelity, identity preservation, and flexibility. Examples of our dataset can be found at: https://makeup-adapter.github.io.[277] Diffusion-Driven Inter-Outer Surface Separation for Point Clouds with Open Boundaries
Zhengyan Qin,Liyuan Qiu
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散的算法,用于从双层点云中分离内层和外层表面,特别针对TSDF融合中因截断不对称导致的'双表面伪影'问题。该方法适用于具有开放边界的点云,能稳健处理封闭和开放边界模型,并在约10秒内完成2万内层与2万外层点的分离。
Details
Motivation: 解决TSDF融合中因不对称截断阈值引起的'双表面伪影'问题,即错误生成内外壳,导致表面重叠和法向混乱,尤其在室内和医学3D重建中影响表面精度。 Method: 提出一种基于扩散的算法,专为后处理TSDF融合结果设计,针对具有开放边界(存在拓扑孔洞)而非缺失区域的双层点云,实现内外层表面分离。 Result: 可在约10秒内从20,000个内层点和20,000个外层点中准确提取真实内层表面;支持watertight与开放边界模型;适用于室内建模与医学成像等实际场景。 Conclusion: 该方法是一种轻量级、后处理式的内外壳分离模块,不替代完整重建流程,但显著缓解双表面伪影,在保持效率的同时提升双层点云表面表示的准确性。 Abstract: We propose a diffusion-based algorithm for separating the inter and outer layer surfaces from double-layered point clouds, particularly those exhibiting the "double surface artifact" caused by truncation in Truncated Signed Distance Function (TSDF) fusion during indoor or medical 3D reconstruction. This artifact arises from asymmetric truncation thresholds, leading to erroneous inter and outer shells in the fused volume, which our method addresses by extracting the true inter layer to mitigate challenges like overlapping surfaces and disordered normals. We focus on point clouds with \emph{open boundaries} (i.e., sampled surfaces with topological openings/holes through which particles may escape), rather than point clouds with \emph{missing surface regions} where no samples exist. Our approach enables robust processing of both watertight and open-boundary models, achieving extraction of the inter layer from 20,000 inter and 20,000 outer points in approximately 10 seconds. This solution is particularly effective for applications requiring accurate surface representations, such as indoor scene modeling and medical imaging, where double-layered point clouds are prevalent, and it accommodates both closed (watertight) and open-boundary surface geometries. Our goal is \emph{post-hoc} inter/outer shell separation as a lightweight module after TSDF fusion; we do not aim to replace full variational or learning-based reconstruction pipelines.[278] HSI-VAR: Rethinking Hyperspectral Restoration through Spatial-Spectral Visual Autoregression
Xiangming Wang,Benteng Sun,Yungeng Liu,Haijin Zeng,Yongyong Chen,Jingyong Su,Jie Liu
Main category: cs.CV
TL;DR: 本文提出HSI-VAR方法,将高光谱图像(HSI)恢复重新建模为自回归生成问题,通过潜变量对齐、退化感知引导和空谱自适应模块,在保证结构细节恢复的同时大幅降低计算成本,显著优于现有扩散模型和回归模型。
Details
Motivation: 真实世界高光谱图像常受噪声、模糊和波段缺失等复合退化影响;现有生成式方法(如扩散模型)迭代步数多、计算开销大,而回归模型易产生过度平滑结果,难以保留关键结构细节。 Method: 提出HSI-VAR:(1)潜变量-条件对齐机制,保障语义一致性;(2)退化感知引导,将混合退化编码为嵌入空间的线性组合以实现自动控制;(3)空谱自适应解码模块,在解码阶段联合优化空间与光谱细节。 Result: 在九个全场景HSI恢复基准上达到SOTA性能,在ICVL数据集PSNR提升3.77 dB;相比扩散模型推理速度最高提升95.5倍,计算成本降低近50%;结构保持能力更强。 Conclusion: HSI-VAR通过自回归建模范式与三项关键技术设计,在恢复质量、细节保真度与推理效率之间取得优异平衡,是面向实际应用的高效高光谱图像恢复新方案。 Abstract: Hyperspectral images (HSIs) capture richer spatial-spectral information beyond RGB, yet real-world HSIs often suffer from a composite mix of degradations, such as noise, blur, and missing bands. Existing generative approaches for HSI restoration like diffusion models require hundreds of iterative steps, making them computationally impractical for high-dimensional HSIs. While regression models tend to produce oversmoothed results, failing to preserve critical structural details. We break this impasse by introducing HSI-VAR, rethinking HSI restoration as an autoregressive generation problem, where spectral and spatial dependencies can be progressively modeled rather than globally reconstructed. HSI-VAR incorporates three key innovations: (1) Latent-condition alignment, which couples semantic consistency between latent priors and conditional embeddings for precise reconstruction; (2) Degradation-aware guidance, which uniquely encodes mixed degradations as linear combinations in the embedding space for automatic control, remarkably achieving a nearly $50\%$ reduction in computational cost at inference; (3) A spatial-spectral adaptation module that refines details across both domains in the decoding phase. Extensive experiments on nine all-in-one HSI restoration benchmarks confirm HSI-VAR's state-of-the-art performance, achieving a 3.77 dB PSNR improvement on \textbf{\textit{ICVL}} and offering superior structure preservation with an inference speed-up of up to $95.5 \times$ compared with diffusion-based methods, making it a highly practical solution for real-world HSI restoration.[279] Evaluating Deep Learning-Based Nerve Segmentation in Brachial Plexus Ultrasound Under Realistic Data Constraints
Dylan Yves,Khush Agarwal,Jonathan Hoyin Chan,Patcharapit Promoppatum,Aroonkamon Pattanasiricharoen
Main category: cs.CV
TL;DR: 本研究评估了U-Net在臂丛超声图像神经分割中的性能,发现多设备数据联合训练具有正则化效果但不优于目标域单源训练;多类别监督会降低神经分割精度;神经尺寸与分割准确率呈中度正相关。
Details
Motivation: 超声引导区域麻醉中,手动识别神经困难,因图像对比度低、斑点噪声强及解剖结构个体差异大,亟需可靠的自动神经分割方法。 Method: 采用U-Net架构进行臂丛超声图像的神经分割,系统比较了不同超声设备(SIEMENS ACUSON NX3 Elite和Philips EPIQ5)数据组合策略、二值与多类别(动脉、静脉、神经、肌肉)标注方式的影响,并分析神经尺寸与分割精度的相关性。 Result: 多设备联合训练可提升低性能设备数据的泛化性,但无法超越匹配目标域的单源训练;多类别监督导致神经Dice分数下降9%–61%;神经尺寸与分割准确率呈中度正相关(r=0.587, p<0.001)。 Conclusion: 数据来源匹配和标注粒度选择对模型性能至关重要;小神经仍是主要挑战;研究为临床约束下构建鲁棒超声神经分割系统提供了方法学指导。 Abstract: Accurate nerve localization is critical for the success of ultrasound-guided regional anesthesia, yet manual identification remains challenging due to low image contrast, speckle noise, and inter-patient anatomical variability. This study evaluates deep learning-based nerve segmentation in ultrasound images of the brachial plexus using a U-Net architecture, with a focus on how dataset composition and annotation strategy influence segmentation performance. We find that training on combined data from multiple ultrasound machines (SIEMENS ACUSON NX3 Elite and Philips EPIQ5) provides regularization benefits for lower-performing acquisition sources, though it does not surpass single-source training when matched to the target domain. Extending the task from binary nerve segmentation to multi-class supervision (artery, vein, nerve, muscle) results in decreased nerve-specific Dice scores, with performance drops ranging from 9% to 61% depending on dataset, likely due to class imbalance and boundary ambiguity. Additionally, we observe a moderate positive correlation between nerve size and segmentation accuracy (Pearson r=0.587, p<0.001), indicating that smaller nerves remain a primary challenge. These findings provide methodological guidance for developing robust ultrasound nerve segmentation systems under realistic clinical data constraints.[280] DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning
Wenhao Li,Xianjing Meng,Qiangchang Wang,Zhongyi Han,Zhibin Wu,Yilong Yin
Main category: cs.CV
TL;DR: 本文提出DVLA-RL方法,通过双层次语义构建(DSC)和强化学习门控注意力(RLA),实现视觉与语言从低级到高级语义的渐进式、自适应对齐,显著提升少样本学习性能。
Details
Motivation: 现有基于大语言模型的少样本学习方法忽视了视觉与语言在不同语义层级上的渐进式、自适应对齐,导致语义增益有限。 Method: 提出DVLA-RL框架,包含Dual-level Semantic Construction(DSC)和RL-gated Attention(RLA):DSC利用LLM结合类别名与支持样本生成并筛选判别性属性,合成层次化类别描述;RLA将跨模态融合建模为序列决策过程,用轻量REINFORCE策略动态调控各网络层中自注意力与交叉注意力的权重。 Result: 在三种不同少样本学习场景下的九个基准上达到新SOTA性能。 Conclusion: DVLA-RL通过双层次语义构建与强化学习门控机制,实现了更精细的视觉-语言对齐,在仅需少量支持样本的情况下,兼顾细粒度定位与整体类别理解,提升了少样本泛化能力。 Abstract: Few-shot learning (FSL) aims to generalize to novel categories with only a few samples. Recent approaches incorporate large language models (LLMs) to enrich visual representations with semantic embeddings derived from class names. However, they overlook progressive and adaptive alignment between vision and language from low-level to high-level semantics, resulting in limited semantic gains. To address these challenges, we propose Dual-level Vision-Language Alignment with Reinforcement Learning gating (DVLA-RL), which consists of Dual-level Semantic Construction (DSC) and RL-gated Attention (RLA). Specifically, DSC conditions LLMs on both class names and support samples to generate discriminative attributes, progressively selects the most relevant ones, and then synthesizes them into coherent class descriptions. This process provides complementary low-level attributes and high-level descriptions, enabling both fine-grained grounding and holistic class understanding. To dynamically integrate dual-level semantics along with the visual network layers, RLA formulates cross-modal fusion as a sequential decision process. A lightweight policy trained with episodic REINFORCE adaptively adjusts the contributions of self-attention and cross-attention to integrate textual and visual tokens. As a result, shallow layers refine local attributes and deep layers emphasize global semantics, enabling more precise cross-modal alignment. This achieves class-specific discrimination and generalized representations with merely a few support samples. DVLA-RL achieves new state-of-the-art performance across nine benchmarks in three diverse FSL scenarios.[281] Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds
Xianzhe Fan,Shengliang Deng,Xiaoyang Wu,Yuxiang Lu,Zhuoling Li,Mi Yan,Yujia Zhang,Zhizheng Zhang,He Wang,Hengshuang Zhao
Main category: cs.CV
TL;DR: 本文提出Any3D-VLA,通过融合点云与2D图像提升视觉-语言-动作(VLA)模型的三维空间理解能力,缓解因数据稀缺和跨环境域差异带来的挑战。
Details
Motivation: 现有VLA模型依赖2D图像输入,空间理解受限;需引入3D信息以增强复杂场景下的感知与决策能力。 Method: 提出Any3D-VLA框架,统一仿真器、传感器与模型估计的点云输入,构建多样化3D数据,并学习域无关的3D表征,再与2D表征融合。 Result: 在仿真与真实世界实验中,Any3D-VLA显著提升任务性能,并有效缓解域差距问题。 Conclusion: 显式引入点云并融合2D/3D表征是提升VLA模型空间理解能力的有效路径,Any3D-VLA为解决3D数据稀缺与域偏移提供了可行方案。 Abstract: Existing Vision-Language-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to enhance VLA capabilities? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations. To address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases, we propose Any3D-VLA. It unifies the simulator, sensor, and model-estimated point clouds within a training pipeline, constructs diverse inputs, and learns domain-agnostic 3D representations that are fused with the corresponding 2D representations. Simulation and real-world experiments demonstrate Any3D-VLA's advantages in improving performance and mitigating the domain gap. Our project homepage is available at https://xianzhefan.github.io/Any3D-VLA.github.io.[282] VVLoc: Prior-free 3-DoF Vehicle Visual Localization
Ze Huang,Zhongyang Xiao,Mingliang Song,Longan Yang,Hongyuan Yuan,Li Sun
Main category: cs.CV
TL;DR: VVLoc is a unified neural network pipeline for concurrent topological and metric vehicle localization using multi-camera inputs, providing confidence measures and requiring only paired visual data and ground-truth poses for efficient training.
Details
Motivation: Conventional localization methods handle topological and metric tasks separately, rely on single cameras, need extra 3D semantic or pose priors, and lack confidence quantification—limiting real-world industrial applicability. Method: VVLoc uses a single neural network on multi-camera inputs to jointly perform topological localization (via geo-proximity evaluation) and metric localization (via relative pose estimation using matching), while outputting confidence scores; trained efficiently on image pairs and ground-truth poses only. Result: VVLoc achieves state-of-the-art accuracy on public benchmarks and a more challenging self-collected dataset across diverse localization tasks. Conclusion: VVLoc demonstrates that unified, multi-camera, confidence-aware localization is feasible with simplified training requirements, enhancing practicality for autonomous driving systems. Abstract: Localization is a critical technology in autonomous driving, encompassing both topological localization, which identifies the most similar map keyframe to the current observation, and metric localization, which provides precise spatial coordinates. Conventional methods typically address these tasks independently, rely on single-camera setups, and often require additional 3D semantic or pose priors, while lacking mechanisms to quantify the confidence of localization results, making them less feasible for real industrial applications. In this paper, we propose VVLoc, a unified pipeline that employs a single neural network to concurrently achieve topological and metric vehicle localization using multi-camera system. VVLoc first evaluates the geo-proximity between visual observations, then estimates their relative metric poses using a matching strategy, while also providing a confidence measure. Additionally, the training process for VVLoc is highly efficient, requiring only pairs of visual data and corresponding ground-truth poses, eliminating the need for complex supplementary data. We evaluate VVLoc not only on the publicly available datasets, but also on a more challenging self-collected dataset, demonstrating its ability to deliver state-of-the-art localization accuracy across a wide range of localization tasks.[283] Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval
Tong Wang,Yunhan Zhao,Shu Kong
Main category: cs.CV
TL;DR: 本文提出Paracosm方法,通过大型多模态模型(LMM)直接生成查询所描述的“心理图像”,并为数据库中的真实图像生成对应的合成图像以缩小域差距,从而实现无需训练的零样本组合图像检索(CIR)。
Details
Motivation: 现有CIR方法依赖LMM生成文本描述再用VLM进行图文匹配,但“心理图像”本身不可见且隐式定义,间接建模限制性能;本文旨在从第一性原理出发,直接生成并匹配该“心理图像”。 Method: 使用LMM将参考图像+修改文本的多模态查询直接生成“心理图像”;同时为数据库中每个真实图像生成对应的合成图像,构建统一的合成域(即“paracosm”)以实现跨域匹配;整个流程无需训练。 Result: Paracosm在四个具有挑战性的基准上显著超越现有零样本CIR方法,达到零样本CIR的最先进性能。 Conclusion: 直接生成并匹配‘心理图像’是更本质、更有效的零样本CIR范式;Paracosm证明了无需训练、仅靠LMM构造合成域即可大幅提升检索精度。 Abstract: Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ``mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ``mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search the target image. In contrast, we address CIR from first principles by directly generating the ``mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ``mental image'' for a given multimodal query and propose to use this ``mental image'' to search for the target image. As the ``mental image'' has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.[284] Edge-Native Generative De-identification: Inversion-Free Flow for Privacy-Preserving Federated Skin Image Analysis
Konstantinos Moutselos,Ilias Maglogiannis
Main category: cs.CV
TL;DR: 本文提出了一种面向临床皮肤科的联邦学习隐私保护框架,利用无需反演的Rectified Flow Transformer(FlowEdit)实现边缘端实时、高保真的身份匿名化与病理特征保留,并通过'Segment-by-Synthesis'机制生成健康/病变孪生图像对以提取与生物特征无关的差异红斑掩码,从而在保护患者隐私的同时维持诊断精度。
Details
Motivation: 联邦学习在临床皮肤科部署受限于患者隐私保护与诊断特征保留之间的矛盾:传统去标识化损害病理保真度,而主流生成编辑方法依赖计算密集型反演过程,难以在资源受限的边缘设备运行。 Method: 提出基于无反演Rectified Flow Transformer(FlowEdit)的客户端隐私保护框架;引入'分段式合成(Segment-by-Synthesis)'机制,在本地生成健康与病变孪生图像对,进而提取解耦于生物特征和语义干扰(如首饰)的差异性红斑掩码。 Result: 在高分辨率临床样本上的试点验证显示,合成身份间红斑掩码的IoU稳定性>0.67;系统可在<20秒内完成高保真身份变换,支持边缘端实时部署;有效缓解梯度泄露风险。 Conclusion: 该框架为联邦环境下高精度皮肤影像分析提供了兼顾隐私合规性与病理保真度的安全可行路径,尤其适用于边缘计算受限的临床节点。 Abstract: The deployment of Federated Learning (FL) for clinical dermatology is hindered by the competing requirements of protecting patient privacy and preserving diagnostic features. Traditional de-identification methods often degrade pathological fidelity, while standard generative editing techniques rely on computationally intensive inversion processes unsuitable for resource-constrained edge devices. We propose a framework for identity-agnostic pathology preservation that serves as a client-side privacy-preserving utility. By leveraging inversion-free Rectified Flow Transformers (FlowEdit), the system performs high-fidelity identity transformation in near real-time (less than 20s), facilitating local deployment on clinical nodes. We introduce a "Segment-by-Synthesis" mechanism that generates counterfactual healthy and pathological twin pairs locally. This enables the extraction of differential erythema masks that are decoupled from biometric markers and semantic artifacts (e.g. jewelry). Pilot validation on high-resolution clinical samples demonstrates an Intersection over Union (IoU) stability greater than 0.67 across synthetic identities. By generating privacy-compliant synthetic surrogates at the edge, this framework mitigates the risk of gradient leakage at the source, providing a secure pathway for high-precision skin image analysis in federated environments.[285] TransNormal: Dense Visual Semantics for Diffusion-based Transparent Object Normal Estimation
Mingwei Li,Hehe Fan,Yi Yang
Main category: cs.CV
TL;DR: 本文提出TransNormal框架,利用预训练扩散先验进行单步法向量回归,结合DINOv3语义特征与多任务学习及小波正则化,显著提升透明物体单目法向量估计精度,并发布高质量合成数据集TransNormal-Synthetic。
Details
Motivation: 透明物体因复杂折射与反射导致传统深度和法向传感器失效,制约具身AI在实验室自动化中的应用。 Method: 提出TransNormal框架:1)适配预训练扩散先验用于单步法向回归;2)通过cross-attention融合DINOv3密集视觉语义以增强几何线索;3)引入多任务学习目标与小波正则化保持细节结构;4)构建物理仿真数据集TransNormal-Synthetic。 Result: 在ClearGrasp基准上,平均误差降低24.4%,11.25°精度提升22.8%;在ClearPose上平均误差降低15.2%。 Conclusion: TransNormal有效克服透明物体法向估计难题,为科学环境中的具身AI提供可靠几何感知支持。 Abstract: Monocular normal estimation for transparent objects is critical for laboratory automation, yet it remains challenging due to complex light refraction and reflection. These optical properties often lead to catastrophic failures in conventional depth and normal sensors, hindering the deployment of embodied AI in scientific environments. We propose TransNormal, a novel framework that adapts pre-trained diffusion priors for single-step normal regression. To handle the lack of texture in transparent surfaces, TransNormal integrates dense visual semantics from DINOv3 via a cross-attention mechanism, providing strong geometric cues. Furthermore, we employ a multi-task learning objective and wavelet-based regularization to ensure the preservation of fine-grained structural details. To support this task, we introduce TransNormal-Synthetic, a physics-based dataset with high-fidelity normal maps for transparent labware. Extensive experiments demonstrate that TransNormal significantly outperforms state-of-the-art methods: on the ClearGrasp benchmark, it reduces mean error by 24.4% and improves 11.25° accuracy by 22.8%; on ClearPose, it achieves a 15.2% reduction in mean error. The code and dataset will be made publicly available at https://longxiang-ai.github.io/TransNormal.[286] Invariance on Manifolds: Understanding Robust Visual Representations for Place Recognition
Jintao Cheng,Weibin Li,Zhijian He,Jin Wu,Chi Man Vong,Wei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的二阶几何统计框架,用于视觉地点识别(VPR),通过在对称正定(SPD)流形上建模场景协方差描述子,并利用黎曼几何映射实现噪声鲁棒的线性嵌入。
Details
Motivation: 现有VPR方法依赖大量标注数据或仅使用一阶统计,忽视场景内在的结构相关性,难以应对剧烈环境与视角变化。 Method: 将场景表示为SPD流形上的协方差描述子,利用几何感知的黎曼映射将其投影到欧氏空间;整个框架基于固定预训练骨干网络,无需任何参数更新。 Result: 在多个基准上达到与最先进方法相当甚至更优的性能,尤其在零样本跨域场景中表现突出。 Conclusion: 二阶几何统计建模结合流形学习可有效提升VPR的泛化性与鲁棒性,且完全摆脱训练依赖,为零样本视觉定位提供了新范式。 Abstract: Visual Place Recognition (VPR) demands representations robust to drastic environmental and viewpoint shifts. Current aggregation paradigms, however, either rely on data-hungry supervision or simplistic first-order statistics, often neglecting intrinsic structural correlations. In this work, we propose a Second-Order Geometric Statistics framework that inherently captures geometric stability without training. We conceptualize scenes as covariance descriptors on the Symmetric Positive Definite (SPD) manifold, where perturbations manifest as tractable congruence transformations. By leveraging geometry-aware Riemannian mappings, we project these descriptors into a linearized Euclidean embedding, effectively decoupling signal structure from noise. Our approach introduces a training-free framework built upon fixed, pre-trained backbones, achieving strong zero-shot generalization without parameter updates. Extensive experiments confirm that our method achieves highly competitive performance against state-of-the-art baselines, particularly excelling in challenging zero-shot scenarios.[287] Distill3R: A Pipeline for Democratizing 3D Foundation Models on Commodity Hardware
Brandon Leblanc,Charalambos Poullis
Main category: cs.CV
TL;DR: Distill3R 提出一种知识蒸馏框架,将大型3D基础模型的几何推理能力压缩到可在单个工作站训练的小型学生模型,显著降低计算门槛。
Details
Motivation: 大型多视角3D重建基础模型依赖海量算力训练,使多数学术实验室难以参与相关研究,亟需降低计算门槛。 Method: 提出Distill3R框架,包含两个核心创新:(1) 离线缓存流水线,通过压缩监督信号解耦教师模型推理与训练循环;(2) 置信度感知蒸馏损失,利用教师模型不确定性实现低成本训练。设计72M参数学生模型。 Result: 学生模型参数量减少9倍、推理速度提升5倍,在单个工作站3天内完成训练(教师需大规模GPU集群训练约一周),且保持结构一致性和几何理解能力。 Conclusion: Distill3R为缺乏大规模算力的实验室提供了可复现、低成本的3D视觉研究入口和边缘部署方案,定位是可及的研究基线而非挑战SOTA。 Abstract: While multi-view 3D reconstruction has shifted toward large-scale foundation models capable of inferring globally consistent geometry, their reliance on massive computational clusters for training has created a significant barrier to entry for most academic laboratories. To bridge this compute divide, we introduce Distill3R, a framework designed to distill the geometric reasoning of 3D foundation models into compact students fully trainable on a single workstation. Our methodology centers on two primary innovations: (1) an offline caching pipeline that decouples heavy teacher inference from the training loop through compressed supervision signals, and (2) a confidence-aware distillation loss that leverages teacher uncertainty to enable training on commodity hardware. We propose a 72M-parameter student model which achieves a 9x reduction in parameters and a 5x inference speedup compared to its 650M-parameter teacher. The student is fully trainable in under 3 days on a single workstation, whereas its teacher requires massive GPU clusters for up to a week. We demonstrate that the student preserves the structural consistency and qualitative geometric understanding required for functional 3D awareness. By providing a reproducible, single-workstation training recipe, Distill3R serves as an exploratory entry point for democratized 3D vision research and efficient edge deployment. This work is not intended to compete with state-of-the-art foundation models, but to provide an accessible research baseline for laboratories without access to large-scale compute to train and specialize models on their own domain-specific data at minimal cost.[288] DIAMOND: Directed Inference for Artifact Mitigation in Flow Matching Models
Alicja Polowczyk,Agnieszka Polowczyk,Piotr Borycki,Joanna Waczyńska,Jacek Tabor,Przemysław Spurek
Main category: cs.CV
TL;DR: 本文提出DIAMOND,一种无需训练、不修改模型权重的推理时轨迹校正方法,用于在扩散模型(如FLUX)生成过程中主动抑制视觉与解剖学伪影,提升图像保真度。
Details
Motivation: 现有文本到图像模型(如FLUX)虽性能优异,但常产生视觉和解剖学伪影;当前去伪影方法多为后处理、需修改模型权重或依赖计算昂贵的区域精修,缺乏对生成过程核心阶段的有效干预。 Method: DIAMOND是一种训练自由(training-free)的方法,在推理阶段对扩散过程的潜在轨迹进行实时校正:每一步重建干净样本估计,并引导生成过程避开导致伪影的潜在状态;该方法可无缝扩展至标准扩散模型。 Result: DIAMOND在无需额外训练、不修改模型权重的前提下,显著降低文本到图像生成中的伪影,提升图像质量与解剖合理性,且具备零样本泛化能力。 Conclusion: DIAMOND提供了一种轻量、通用、即插即用的伪影抑制范式,为高保真、专业级生成建模开辟了无需再训练的新路径。 Abstract: Despite impressive results from recent text-to-image models like FLUX, visual and anatomical artifacts remain a significant hurdle for practical and professional use. Existing methods for artifact reduction, typically work in a post-hoc manner, consequently failing to intervene effectively during the core image formation process. Notably, current techniques require problematic and invasive modifications to the model weights, or depend on a computationally expensive and time-consuming process of regional refinement. To address these limitations, we propose DIAMOND, a training-free method that applies trajectory correction to mitigate artifacts during inference. By reconstructing an estimate of the clean sample at every step of the generative trajectory, DIAMOND actively steers the generation process away from latent states that lead to artifacts. Furthermore, we extend the proposed method to standard Diffusion Models, demonstrating that DIAMOND provides a robust, zero-shot path to high-fidelity, artifact-free image synthesis without the need for additional training or weight modifications in modern generative architectures. Code is available at https://gmum.github.io/DIAMOND/[289] OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection
Kunal Mahatha,Ali Bahri,Pierre Marza,Sahar Dastani,Maria Vakalopoulou,Stergios Christodoulidis,Jose Dolz,Christian Desrosiers
Main category: cs.CV
TL;DR: OCTOPUS是一种新型视觉状态空间模型,通过八方向离散递归机制,在保持线性复杂度的同时兼顾图像的全局上下文与局部空间结构。
Details
Motivation: 标准状态空间模型(SSMs)在视觉任务中表现受限,因其因果序列建模方式破坏了图像固有的空间关系,难以捕捉局部像素/块间的相关性。 Method: 提出OCTOPUS架构,沿水平、垂直及对角线共八个主方向进行前向或后向离散递归,实现空间连通区域间的信息交换,同时隔离无关区域,支持多方向递归建模。 Result: 在分类与分割基准测试中,OCTOPUS显著提升边界保持能力与区域一致性,并在分类精度上优于现有V-SSM模型。 Conclusion: OCTOPUS为构建兼具空间感知能力与计算高效性的视觉模型提供了可扩展的基础方法,确立了多方向递归作为核心机制的有效性。 Abstract: State space models (SSMs) have recently emerged as an alternative to transformers due to their unique ability of modeling global relationships in text with linear complexity. However, their success in vision tasks has been limited due to their causal formulation, which is suitable for sequential text but detrimental in the spatial domain where causality breaks the inherent spatial relationships among pixels or patches. As a result, standard SSMs fail to capture local spatial coherence, often linking non-adjacent patches while ignoring neighboring ones that are visually correlated. To address these limitations, we introduce OCTOPUS , a novel architecture that preserves both global context and local spatial structure within images, while maintaining the linear complexity of SSMs. OCTOPUS performs discrete reoccurrence along eight principal orientations, going forward or backward in the horizontal, vertical, and diagonal directions, allowing effective information exchange across all spatially connected regions while maintaining independence among unrelated patches. This design enables multi-directional recurrence, capturing both global context and local spatial structure with SSM-level efficiency. In our classification and segmentation benchmarks, OCTOPUS demonstrates notable improvements in boundary preservation and region consistency, as evident from the segmentation results, while maintaining relatively better classification accuracy compared to existing V-SSM based models. These results suggest that OCTOPUS appears as a foundation method for multi-directional recurrence as a scalable and effective mechanism for building spatially aware and computationally efficient vision architectures.[290] ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models
Dhruv Parikh,Haoyang Fan,Rajgopal Kannan,Viktor Prasanna
Main category: cs.CV
TL;DR: 本文提出ConsensusDrop,一种无需训练的视觉-语言模型(VLM)视觉令牌压缩框架,通过融合视觉编码器的粗粒度显著性与大语言模型(LLM)的查询感知跨注意力信号,达成共识排序并结合令牌合并,在大幅减少视觉令牌数量的同时保持高精度。
Details
Motivation: 现有视觉令牌压缩方法仅利用视觉编码器显著性(查询无关)或LLM跨注意力(查询相关但稀疏昂贵),二者单独使用均不充分;需融合二者优势以提升性能,但存在信号获取时机不对称和融合困难等挑战。 Method: ConsensusDrop是一种训练免费框架:首先协调视觉编码器显著性与查询感知跨注意力信号生成共识排序;再基于该排序保留最相关信息令牌,并对剩余令牌进行视觉编码器引导的合并压缩。 Result: 在LLaVA-1.5/NeXT、Video-LLaVA等多个开源VLM上验证,ConsensusDrop在相同令牌预算下持续优于先前剪枝方法,显著提升准确率-效率帕累托前沿:即使在激进令牌压缩下仍保持近基线精度,并降低首字延迟(TTFT)与KV缓存占用。 Conclusion: 融合多模态显著性信号并合理协调其异构特性,是高效视觉令牌压缩的关键;ConsensusDrop提供了一种实用、通用且无需训练的解决方案,为VLM轻量化部署提供了新范式。 Abstract: Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens. Existing token reduction methods typically exploit \textit{either} vision-encoder saliency (broad but query-agnostic) \textit{or} LLM cross-attention (query-aware but sparse and costly). We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking). However, making such fusion practical is non-trivial: cross-modal saliency is usually only available \emph{inside} the LLM (too late for efficient pre-LLM pruning), and the two signals are inherently asymmetric, so naive fusion underutilizes their complementary strengths. We propose \textbf{ConsensusDrop}, a training-free framework that derives a \emph{consensus} ranking by reconciling vision encoder saliency with query-aware cross-attention, retaining the most informative tokens while compressing the remainder via encoder-guided token merging. Across LLaVA-1.5/NeXT, Video-LLaVA, and other open-source VLMs, ConsensusDrop consistently outperforms prior pruning methods under identical token budgets and delivers a stronger accuracy-efficiency Pareto frontier -- preserving near-baseline accuracy even at aggressive token reductions while reducing TTFT and KV cache footprint. Our code will be open-sourced.[291] Data Augmentation for High-Fidelity Generation of CAR-T/NK Immunological Synapse Images
Xiang Zhang,Boxuan Zhang,Alireza Naghizadeh,Mohab Mohamed,Dongfang Liu,Ruixiang Tang,Dimitris Metaxas,Dongfang Liu
Main category: cs.CV
TL;DR: 本文提出了一种结合实例感知自动增强(IAAA)和语义感知AI增强(SAAA)的双框架数据增强方法,用于生成高质量、多模态的CAR-T/NK免疫突触(IS)合成图像及对应分割掩码,显著提升ANN在小样本条件下的IS检测与分割性能,助力开发更可靠的影像学生物标志物。
Details
Motivation: CAR-T/NK细胞免疫治疗中,免疫突触(IS)的质量是预测疗效的功能性生物标志物,但受限于标注显微图像数据集规模小,人工神经网络难以泛化。 Method: 融合两种互补的数据增强框架:1)Instance Aware Automatic Augmentation(IAAA),自动优化增强策略,保持实例完整性,适用于多种成像模态;2)Semantic-Aware AI Augmentation(SAAA),结合扩散模型生成语义合理的分割掩码与Pix2Pix图像合成器生成高保真配对图像。 Result: 合成图像在视觉与结构上高度逼近真实IS数据,显著提升了CAR-T/NK IS检测与分割的准确性和鲁棒性。 Conclusion: 该双增强策略有效缓解小样本瓶颈,增强了IS定量分析的可靠性,为基于影像的CAR-T/NK疗效预测生物标志物开发提供了关键技术支撑。 Abstract: Chimeric antigen receptor (CAR)-T and NK cell immunotherapies have transformed cancer treatment, and recent studies suggest that the quality of the CAR-T/NK cell immunological synapse (IS) may serve as a functional biomarker for predicting therapeutic efficacy. Accurate detection and segmentation of CAR-T/NK IS structures using artificial neural networks (ANNs) can greatly increase the speed and reliability of IS quantification. However, a persistent challenge is the limited size of annotated microscopy datasets, which restricts the ability of ANNs to generalize. To address this challenge, we integrate two complementary data-augmentation frameworks. First, we employ Instance Aware Automatic Augmentation (IAAA), an automated, instance-preserving augmentation method that generates synthetic CAR-T/NK IS images and corresponding segmentation masks by applying optimized augmentation policies to original IS data. IAAA supports multiple imaging modalities (e.g., fluorescence and brightfield) and can be applied directly to CAR-T/NK IS images derived from patient samples. In parallel, we introduce a Semantic-Aware AI Augmentation (SAAA) pipeline that combines a diffusion-based mask generator with a Pix2Pix conditional image synthesizer. This second method enables the creation of diverse, anatomically realistic segmentation masks and produces high-fidelity CAR-T/NK IS images aligned with those masks, further expanding the training corpus beyond what IAAA alone can provide. Together, these augmentation strategies generate synthetic images whose visual and structural properties closely match real IS data, significantly improving CAR-T/NK IS detection and segmentation performance. By enhancing the robustness and accuracy of IS quantification, this work supports the development of more reliable imaging-based biomarkers for predicting patient response to CAR-T/NK immunotherapy.[292] Hybrid Topological and Deep Feature Fusion for Accurate MRI-Based Alzheimer's Disease Severity Classification
Faisal Ahmed
Main category: cs.CV
TL;DR: 本文提出了一种结合拓扑数据分析(TDA)与DenseNet121的混合深度学习框架,用于基于结构MRI的阿尔茨海默病四分类诊断,在OASIS数据集上达到99.93%准确率和100% AUC。
Details
Motivation: 早期、精准诊断阿尔茨海默病(AD)仍是神经影像临床决策支持系统的关键挑战;传统深度网络易忽略脑结构的拓扑特性。 Method: 将拓扑数据分析(TDA)提取的互补拓扑特征与DenseNet121从MRI切片中学习的层次化空间特征进行融合,构建四类AD分期分类模型。 Result: 在OASIS-1 Kaggle MRI数据集上,模型准确率达99.93%,AUC达100%,显著优于现有CNN、迁移学习、集成及多尺度方法。 Conclusion: 将拓扑洞察融入深度学习流程可显著提升AD诊断性能,所提框架具备高鲁棒性与临床应用潜力。 Abstract: Early and accurate diagnosis of Alzheimer's disease (AD) remains a critical challenge in neuroimaging-based clinical decision support systems. In this work, we propose a novel hybrid deep learning framework that integrates Topological Data Analysis (TDA) with a DenseNet121 backbone for four-class Alzheimer's disease classification using structural MRI data from the OASIS dataset. TDA is employed to capture complementary topological characteristics of brain structures that are often overlooked by conventional neural networks, while DenseNet121 efficiently learns hierarchical spatial features from MRI slices. The extracted deep and topological features are fused to enhance class separability across the four AD stages. Extensive experiments conducted on the OASIS-1 Kaggle MRI dataset demonstrate that the proposed TDA+DenseNet121 model significantly outperforms existing state-of-the-art approaches. The model achieves an accuracy of 99.93% and an AUC of 100%, surpassing recently published CNN-based, transfer learning, ensemble, and multi-scale architectures. These results confirm the effectiveness of incorporating topological insights into deep learning pipelines and highlight the potential of the proposed framework as a robust and highly accurate tool for automated Alzheimer's disease diagnosis.[293] Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning
Meng Luo,Bobo Li,Shanqing Xu,Shize Zhang,Qiuchan Chen,Menglu Han,Wenhao Chen,Yanxiang Huang,Hao Fei,Mong-Li Lee,Wynne Hsu
Main category: cs.CV
TL;DR: 本文提出HitEmotion——一个基于心理理论(ToM)的分层情感理解评测基准,并设计ToM引导的推理链与强化学习方法TMPO,以提升多模态大语言模型(MLLMs)在深层情感理解上的认知能力。
Details
Motivation: 现有MLLMs在深层情感理解方面能力有限,而真实的情感智能需建立在心理理论(ToM)这一认知基础之上。 Method: 构建ToM驱动的分层基准HitEmotion;设计ToM引导的跨模态推理链;提出以中间心理状态为监督信号的强化学习方法TMPO。 Result: HitEmotion有效暴露了SOTA模型在高阶认知情感任务中的缺陷;ToM推理链与TMPO显著提升任务准确率与推理忠实性、一致性。 Conclusion: 本工作为MLLMs提供了基于认知的情感理解评估与增强新范式,开源数据集与代码促进社区发展。 Abstract: Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs. Our dataset and code are available at: https://HitEmotion.github.io/.[294] Navigating Simply, Aligning Deeply: Winning Solutions for Mouse vs. AI 2025
Phu-Hoa Pham,Chi-Nguyen Tran,Dao Sy Duy Minh,Nguyen Lam Phu Quy,Huynh Trung Kiet
Main category: cs.CV
TL;DR: 本文介绍了Team HCMUS_TheFangs在NeurIPS 2025 Mouse vs. AI竞赛中两个赛道的获胜方案:Track 1(视觉鲁棒性)采用轻量级两层CNN+门控线性单元+观测归一化,达95.4分;Track 2(神经对齐)使用16层ResNet-like结构+GLU门控,参数量1780万,取得最优神经预测性能;分析发现训练步数与性能呈非单调关系,约200K步最优;结果挑战了模型越复杂越好的传统假设。
Details
Motivation: 解决人工视觉代理在视觉鲁棒性和神经对齐方面的关键挑战,以更接近生物视觉系统。 Method: Track 1:轻量两层CNN + Gated Linear Units + 观测归一化;Track 2:16层ResNet-like卷积网络 + GLU门控;系统分析10个训练阶段模型(60K–1.14M步),开展消融实验与失败案例分析。 Result: Track 1最终得分95.4%;Track 2实现最优top-1神经预测性能;发现约200K训练步数时性能最佳;揭示简单架构利于鲁棒性、深层大模型利于神经对齐。 Conclusion: 模型复杂度并非越高越好;视觉鲁棒性与神经对齐需不同架构策略;训练时长存在最优区间;为构建鲁棒、类脑视觉代理提供实践指导。 Abstract: Visual robustness and neural alignment remain critical challenges in developing artificial agents that can match biological vision systems. We present the winning approaches from Team HCMUS_TheFangs for both tracks of the NeurIPS 2025 Mouse vs. AI: Robust Visual Foraging Competition. For Track 1 (Visual Robustness), we demonstrate that architectural simplicity combined with targeted components yields superior generalization, achieving 95.4% final score with a lightweight two-layer CNN enhanced by Gated Linear Units and observation normalization. For Track 2 (Neural Alignment), we develop a deep ResNet-like architecture with 16 convolutional layers and GLU-based gating that achieves top-1 neural prediction performance with 17.8 million parameters. Our systematic analysis of ten model checkpoints trained between 60K to 1.14M steps reveals that training duration exhibits a non-monotonic relationship with performance, with optimal results achieved around 200K steps. Through comprehensive ablation studies and failure case analysis, we provide insights into why simpler architectures excel at visual robustness while deeper models with increased capacity achieve better neural alignment. Our results challenge conventional assumptions about model complexity in visuomotor learning and offer practical guidance for developing robust, biologically-inspired visual agents.[295] VAMOS-OCTA: Vessel-Aware Multi-Axis Orthogonal Supervision for Inpainting Motion-Corrupted OCT Angiography Volumes
Nick DiSanto,Ehsan Khodapanah Aghdam,Han Liu,Jacob Watson,Yuankai K. Tao,Hao Li,Ipek Oguz
Main category: cs.CV
TL;DR: 本文提出VAMOS-OCTA,一种基于深度学习的B-scan修复框架,利用血管感知的多轴正交监督损失,在手持OCTA中有效校正严重运动伪影,同时提升B-scan锐度与en face投影质量。
Details
Motivation: 手持式OCTA在不配合或儿童受试者中易受运动伪影影响,导致B-scan缺失和en face图像出现空白带,现有方法主要关注en face重建,忽略B-scan本身质量。 Method: 提出2.5D U-Net架构,以邻近B-scans为输入重建中心受损B-scan;设计新型Vessel-Aware Multi-Axis Orthogonal Supervision (VAMOS)损失,融合血管加权强度重建、轴向与侧向投影一致性约束。 Result: 在合成与真实运动伪影数据上训练并验证,显著优于先前方法:B-scan中毛细血管更清晰、血管连续性恢复更好、en face投影更干净;多轴监督被证实是恢复3D OCTA体积数据的有效约束。 Conclusion: VAMOS-OCTA通过多轴监督联合优化B-scan与体积投影质量,为手持OCTA临床应用提供了鲁棒的运动伪影校正方案。 Abstract: Handheld Optical Coherence Tomography Angiography (OCTA) enables noninvasive retinal imaging in uncooperative or pediatric subjects, but is highly susceptible to motion artifacts that severely degrade volumetric image quality. Sudden motion during 3D acquisition can lead to unsampled retinal regions across entire B-scans (cross-sectional slices), resulting in blank bands in en face projections. We propose VAMOS-OCTA, a deep learning framework for inpainting motion-corrupted B-scans using vessel-aware multi-axis supervision. We employ a 2.5D U-Net architecture that takes a stack of neighboring B-scans as input to reconstruct a corrupted center B-scan, guided by a novel Vessel-Aware Multi-Axis Orthogonal Supervision (VAMOS) loss. This loss combines vessel-weighted intensity reconstruction with axial and lateral projection consistency, encouraging vascular continuity in native B-scans and across orthogonal planes. Unlike prior work that focuses primarily on restoring the en face MIP, VAMOS-OCTA jointly enhances both cross-sectional B-scan sharpness and volumetric projection accuracy, even under severe motion corruptions. We trained our model on both synthetic and real-world corrupted volumes and evaluated its performance using both perceptual quality and pixel-wise accuracy metrics. VAMOS-OCTA consistently outperforms prior methods, producing reconstructions with sharp capillaries, restored vessel continuity, and clean en face projections. These results demonstrate that multi-axis supervision offers a powerful constraint for restoring motion-degraded 3D OCTA data. Our source code is available at https://github.com/MedICL-VU/VAMOS-OCTA.[296] CortiNet: A Physics-Perception Hybrid Cortical-Inspired Dual-Stream Network for Gallbladder Disease Diagnosis from Ultrasound
Vagish Kumar,Souvik Chakraborty
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、受大脑皮层启发的双流神经网络CortiNet,用于胆囊疾病超声诊断,通过物理可解释的多尺度信号分解与感知驱动特征学习结合,在保持高精度(98.74%)的同时大幅降低参数量,并引入结构感知的可解释性框架提升抗斑点噪声能力。
Details
Motivation: 超声图像分辨率低、斑点噪声强,影响诊断可靠性;现有大型CNN模型难以在临床常规部署。 Method: 提出CortiNet:双流架构,分别处理低频结构信息和高频感知细节;基于频率选择性表示而非原始像素进行建模;引入皮层式晚期融合机制;设计仅作用于结构分支的梯度加权类激活映射(Grad-CAM)可解释框架。 Result: 在10692张专家标注图像、九类胆囊疾病上达到98.74%诊断准确率,参数量显著少于传统深度CNN。 Conclusion: CortiNet通过融合物理先验与神经计算,在保证高诊断性能的同时实现轻量化与可解释性,适合临床实际部署。 Abstract: Ultrasound imaging is the primary diagnostic modality for detecting Gallbladder diseases due to its non-invasive nature, affordability, and wide accessibility. However, the low resolution and speckle noise inherent to ultrasound images hinder diagnostic reliability, prompting the use of large convolutional neural networks that are difficult to deploy in routine clinical settings. In this work, we propose CortiNet, a lightweight, cortical-inspired dual-stream neural architecture for gallbladder disease diagnosis that integrates physically interpretable multi-scale signal decomposition with perception-driven feature learning. Inspired by parallel processing pathways in the human visual cortex, CortiNet explicitly separates low-frequency structural information from high-frequency perceptual details and processes them through specialized encoding streams. By operating directly on structured, frequency-selective representations rather than raw pixel intensities, the architecture embeds strong physics-based inductive bias, enabling efficient feature learning with a significantly reduced parameter footprint. A late-stage cortical-style fusion mechanism integrates complementary structural and textural cues while preserving computational efficiency. Additionally, we propose a structure-aware explainability framework wherein gradient-weighted class activation mapping is only applied to the structural branch of the proposed CortiNet architecture. This choice allows the model to only focus on the structural features, making it robust against speckle noise. We evaluate CortiNet on 10,692 expert-annotated images spanning nine clinically relevant gallbladder disease categories. Experimental results demonstrate that CortiNet achieves high diagnostic accuracy (98.74%) with only a fraction of the parameters required by conventional deep convolutional models.[297] SRVAU-R1: Enhancing Video Anomaly Understanding via Reflection-Aware Learning
Zihao Zhao,Shengting Cao,Muchao Ye
Main category: cs.CV
TL;DR: 本文提出SRVAU-R1框架,通过引入自省式推理和首个面向视频异常理解的自省链式思维数据集,提升多模态大语言模型在视频异常理解任务中的深层推理能力。
Details
Motivation: 现有基于MLLM的方法仅停留在异常的表层描述,缺乏对异常行为的深层推理(如显式自省与自我修正)。 Method: 提出SRVAU-R1框架,包含:(1)首个面向VAU的反射导向Chain-of-Thought数据集,含初始推理、自省、修正推理三阶段标注;(2)结合监督微调与强化微调的反射感知学习范式。 Result: 在多个视频异常基准上显著优于现有方法,提升了时间异常定位精度与推理质量。 Conclusion: 引入自省机制可有效增强MLLM在视频异常理解中的深层多模态推理能力,为VAU任务提供了新范式。 Abstract: Multi-modal large language models (MLLMs) have demonstrated significant progress in reasoning capabilities and shown promising effectiveness in video anomaly understanding (VAU) tasks. However, existing MLLM-based approaches remain largely focused on surface-level descriptions of anomalies, lacking deep reasoning over abnormal behaviors like explicit self-reflection and self-correction. To address that, we propose Self-Reflection-Enhanced Reasoning for Video Anomaly Understanding (SRVAU-R1), a reflection-aware learning framework that incorporates reflection in MLLM reasoning. Specifically, SRVAU-R1 introduces the first reflection-oriented Chain-of-Thought dataset tailored for VAU, providing structured supervision with initial reasoning, self-reflection, and revised reasoning. Based on that, it includes a novel reflection-aware learning paradigm with supervised fine-tuning and reinforcement fine-tuning to enhance multi-modal reasoning for VAU. Extensive experiments on multiple video anomaly benchmarks demonstrate that SRVAU-R1 consistently outperforms existing methods, achieving significant improvements in both temporal anomaly localization accuracy and reasoning quality.[298] LocalScore: Local Density-Aware Similarity Scoring for Biometrics
Yiyang Su,Minchul Kim,Jie Zhu,Christopher Perry,Feng Liu,Anil Jain,Xiaoming Liu
Main category: cs.CV
TL;DR: 本文提出LocalScore算法,通过利用k近邻估计局部密度来改进开放集生物特征识别性能,具有即插即用、计算开销小等优势,并在多个模态上显著提升检索与验证指标。
Details
Motivation: 传统生物特征系统在开放集场景下难以检测未注册的探针样本,且现有方法多将多样本库压缩为单一全局表示,忽略了类内差异,导致决策边界不佳、开放集鲁棒性差。 Method: 提出LocalScore评分算法,基于探针样本在特征空间中k近邻的局部密度进行打分,不依赖特定网络结构或损失函数,可直接集成到现有系统中。 Result: 在多个模态实验中,LocalScore显著提升开放集检索(FNIR@FPIR从53%降至40%)和验证性能(TAR@FAR从51%提升至74%),并提供理论与实证分析说明其增益来源。 Conclusion: LocalScore是一种简单有效、通用性强的开放集生物特征评分方法,能显著增强系统对未注册探针的判别能力,适用于实际多样本部署场景。 Abstract: Open-set biometrics faces challenges with probe subjects who may not be enrolled in the gallery, as traditional biometric systems struggle to detect these non-mated probes. Despite the growing prevalence of multi-sample galleries in real-world deployments, most existing methods collapse intra-subject variability into a single global representation, leading to suboptimal decision boundaries and poor open-set robustness. To address this issue, we propose LocalScore, a simple yet effective scoring algorithm that explicitly incorporates the local density of the gallery feature distribution using the k-th nearest neighbors. LocalScore is architecture-agnostic, loss-independent, and incurs negligible computational overhead, making it a plug-and-play solution for existing biometric systems. Extensive experiments across multiple modalities demonstrate that LocalScore consistently achieves substantial gains in open-set retrieval (FNIR@FPIR reduced from 53% to 40%) and verification (TAR@FAR improved from 51% to 74%). We further provide theoretical analysis and empirical validation explaining when and why the method achieves the most significant gains based on dataset characteristics.[299] Effectiveness of Automatically Curated Dataset in Thyroid Nodules Classification Algorithms Using Deep Learning
Jichen Yang,Jikai Zhang,Benjamin Wildman-Tobriner,Maciej A. Mazurowski
Main category: cs.CV
TL;DR: 本研究验证了自动构建的甲状腺结节超声数据集在提升深度学习模型性能方面的有效性,发现使用全自动构建的数据集(AUC=0.694)显著优于人工标注数据集(AUC=0.643),且全量自动数据集优于其高精度子集。
Details
Motivation: 深度学习模型训练常受限于甲状腺结节标注数据稀缺;虽已有自动构建方法,但其生成数据的实际训练价值尚不明确。 Method: 对比实验:分别在人工标注数据集、全自动构建数据集、以及其中高精度子集上训练深度学习模型,并评估AUC性能。 Result: 全自动构建数据集训练的模型AUC为0.694(p<0.001),显著高于人工数据集(0.643);其高精度子集AUC为0.689(p>0.43),与全量自动集无显著差异。 Conclusion: 全自动构建的甲状腺结节数据集能显著提升模型性能,建议直接使用全部自动数据,无需筛选高精度子集。 Abstract: The diagnosis of thyroid nodule cancers commonly utilizes ultrasound images. Several studies showed that deep learning algorithms designed to classify benign and malignant thyroid nodules could match radiologists' performance. However, data availability for training deep learning models is often limited due to the significant effort required to curate such datasets. The previous study proposed a method to curate thyroid nodule datasets automatically. It was tested to have a 63% yield rate and 83% accuracy. However, the usefulness of the generated data for training deep learning models remains unknown. In this study, we conducted experiments to determine whether using a automatically-curated dataset improves deep learning algorithms' performance. We trained deep learning models on the manually annotated and automatically-curated datasets. We also trained with a smaller subset of the automatically-curated dataset that has higher accuracy to explore the optimum usage of such dataset. As a result, the deep learning model trained on the manually selected dataset has an AUC of 0.643 (95% confidence interval [CI]: 0.62, 0.66). It is significantly lower than the AUC of the 6automatically-curated dataset trained deep learning model, 0.694 (95% confidence interval [CI]: 0.67, 0.73, P < .001). The AUC of the accurate subset trained deep learning model is 0.689 (95% confidence interval [CI]: 0.66, 0.72, P > .43), which is insignificantly worse than the AUC of the full automatically-curated dataset. In conclusion, we showed that using a automatically-curated dataset can substantially increase the performance of deep learning algorithms, and it is suggested to use all the data rather than only using the accurate subset.[300] GMAC: Global Multi-View Constraint for Automatic Multi-Camera Extrinsic Calibration
Chentian Sun
Main category: cs.CV
TL;DR: 本文提出GMAC框架,利用多视角重建网络学习的隐式几何表示来估计多相机系统的外参,无需标定板或显式3D重建,通过轻量回归头和联合优化重投影与循环一致性实现鲁棒、在线的自动标定。
Details
Motivation: 现有方法依赖标定板、显式几何建模或专用神经网络,在复杂动态或在线场景下鲁棒性和适用性不足,难以实际部署。 Method: GMAC将外参建模为受潜在多视角几何结构约束的全局变量;对现有重建网络进行剪枝与结构重构,使其隐特征可直接支持外参预测;引入轻量回归头,并联合优化跨视角重投影一致性和多视角循环一致性。 Result: 在合成与真实多相机数据集上,GMAC实现了高精度、高稳定性的外参估计,无需显式3D重建或人工标定。 Conclusion: GMAC为多相机系统提供了高效部署与在线标定的新解决方案,提升了自动标定在动态环境中的实用性与鲁棒性。 Abstract: Automatic calibration of multi-camera systems, namely the accurate estimation of spatial extrinsic parameters, is fundamental for 3D reconstruction, panoramic perception, and multi-view data fusion. Existing methods typically rely on calibration targets, explicit geometric modeling, or task-specific neural networks. Such approaches often exhibit limited robustness and applicability in complex dynamic environments or online scenarios, making them difficult to deploy in practical applications. To address this, this paper proposes GMAC, a multi-camera extrinsic estimation framework based on the implicit geometric representations learned by multi-view reconstruction networks. GMAC models extrinsics as global variables constrained by the latent multi-view geometric structure and prunes and structurally reconfigures existing networks so that their latent features can directly support extrinsic prediction through a lightweight regression head, without requiring a completely new network design. Furthermore, GMAC jointly optimizes cross-view reprojection consistency and multi-view cycle consistency, ensuring geometric coherence across cameras while improving prediction accuracy and optimization stability. Experiments on both synthetic and real-world multi-camera datasets demonstrate that GMAC achieves accurate and stable extrinsic estimation without explicit 3D reconstruction or manual calibration, providing a new solution for efficient deployment and online calibration of multi-camera systems.[301] FUSE-Flow: Scalable Real-Time Multi-View Point Cloud Reconstruction Using Confidence
Chentian Sun
Main category: cs.CV
TL;DR: 本文提出FUSE-Flow,一种帧级、无状态、线性可扩展的点云流式重建框架,通过测量置信度与3D距离一致性加权融合多视角深度图,在保证实时性的同时提升重建质量与多相机可扩展性。
Details
Motivation: 现有基于体素融合、时序累积或全局优化的方法难以在严格实时约束下兼顾高重建质量、低内存占用与多相机可扩展性。 Method: 提出FUSE-Flow框架:每帧独立生成点云片段,采用双权重(测量置信度和3D距离一致性)融合;引入自适应空间哈希加权聚合——依据局部点云密度动态划分3D空间、每单元选代表点、加权融合;结合GPU并行实现线性复杂度。 Result: 实验表明FUSE-Flow在重叠区、深度不连续区及动态场景中提升了重建稳定性与几何保真度,同时在现代GPU上维持实时帧率。 Conclusion: FUSE-Flow实现了实时性、重建质量与多相机扩展性的良好平衡,具备有效性、鲁棒性与可扩展性。 Abstract: Real-time multi-view point cloud reconstruction is a core problem in 3D vision and immersive perception, with wide applications in VR, AR, robotic navigation, digital twins, and computer interaction. Despite advances in multi-camera systems and high-resolution depth sensors, fusing large-scale multi-view depth observations into high-quality point clouds under strict real-time constraints remains challenging. Existing methods relying on voxel-based fusion, temporal accumulation, or global optimization suffer from high computational complexity, excessive memory usage, and limited scalability, failing to simultaneously achieve real-time performance, reconstruction quality, and multi-camera extensibility. We propose FUSE-Flow, a frame-wise, stateless, and linearly scalable point cloud streaming reconstruction framework. Each frame independently generates point cloud fragments, fused via two weights, measurement confidence and 3D distance consistency to suppress noise while preserving geometric details. For large-scale multi-camera efficiency, we introduce an adaptive spatial hashing-based weighted aggregation method: 3D space is adaptively partitioned by local point cloud density, representative points are selected per cell, and weighted fusion is performed to handle both sparse and dense regions. With GPU parallelization, FUSE-Flow achieves high-throughput, low-latency point cloud generation and fusion with linear complexity. Experiments demonstrate that the framework improves reconstruction stability and geometric fidelity in overlapping, depth-discontinuous, and dynamic scenes, while maintaining real-time frame rates on modern GPUs, verifying its effectiveness, robustness, and scalability.[302] VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models
Guangshuo Qin,Zhiteng Li,Zheng Chen,Weihang Zhang,Linghe Kong,Yulun Zhang
Main category: cs.CV
TL;DR: 本文提出了一种面向MoE架构视觉语言模型(VLMs)的双感知后训练量化方法VEQ,兼顾模态差异与专家异质性,在多个基准上显著优于现有SOTA量化方法。
Details
Motivation: MoE架构的视觉语言模型虽性能优异,但内存和计算开销巨大;现有PTQ方法忽视视觉与语言token间的模态异质性以及不同专家贡献的非均匀性。 Method: 提出Visual Expert Quantization(VEQ)框架,包含:1)模态-专家感知量化,利用专家激活频率优先减小关键专家的量化误差;2)模态亲和感知量化,融合token-专家亲和度与模态信息构建增强Hessian矩阵指导校准。 Result: 在W3A16配置下,VEQ在Kimi-VL和Qwen3-VL上分别平均提升准确率2.04%和3.09%,显著优于现有SOTA量化方法。 Conclusion: VEQ通过同时建模跨模态差异与专家异质性,实现了更鲁棒、更高效的MoE-VLM量化,为多模态大模型轻量化提供了新思路。 Abstract: Mixture-of-Experts(MoE) Vision-Language Models (VLMs) offer remarkable performance but incur prohibitive memory and computational costs, making compression essential. Post-Training Quantization (PTQ) is an effective training-free technique to address the massive memory and computation overhead. Existing quantization paradigms fall short as they are oblivious to two critical forms of heterogeneity: the inherent discrepancy between vision and language tokens, and the non-uniform contribution of different experts. To bridge this gap, we propose Visual Expert Quantization (VEQ), a dual-aware quantization framework designed to simultaneously accommodate cross-modal differences and heterogeneity between experts. Specifically, VEQ incorporates 1)Modality-expert-aware Quantization, which utilizes expert activation frequency to prioritize error minimization for pivotal experts, and 2)Modality-affinity-aware Quantization, which constructs an enhanced Hessian matrix by integrating token-expert affinity with modality information to guide the calibration process. Extensive experiments across diverse benchmarks verify that VEQ consistently outperforms state-of-the-art baselines. Specifically, under the W3A16 configuration, our method achieves significant average accuracy gains of 2.04\% on Kimi-VL and 3.09\% on Qwen3-VL compared to the previous SOTA quantization methods, demonstrating superior robustness across various multimodal tasks. Our code will be available at https://github.com/guangshuoqin/VEQ.[303] From Videos to Conversations: Egocentric Instructions for Task Assistance
Lavisha Aggarwal,Vikas Bahirwani,Andrea Colaco
Main category: cs.CV
TL;DR: 本文提出一种基于大语言模型的全自动框架,将单人教学视频转换为双人多模态任务指导对话,构建了包含507段对话、6636个问答对和24小时视频的HowToDIV数据集,并在该数据集上报告了Gemma 3和Qwen 2.5的基线结果。
Details
Motivation: 现有AR辅助AI代理发展受限于缺乏大规模、真实任务执行驱动的多模态对话数据集,而人工采集成本高、流程复杂。 Method: 设计了一个完全自动化的流水线,利用大语言模型将单人 instructional 视频转化为专家-新手多轮交互的多模态任务指导对话。 Result: 构建了HowToDIV数据集(507段对话、6636个QA对、24小时视频),并在其上测试了Gemma 3和Qwen 2.5模型,给出了初步基准结果。 Conclusion: 该框架为低成本、可扩展地构建真实任务导向的多模态对话数据提供了新范式,HowToDIV为多模态程序性任务辅助研究提供了重要资源。 Abstract: Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi-turn expert-novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.[304] ReLayout: Versatile and Structure-Preserving Design Layout Editing via Relation-Aware Design Reconstruction
Jiawei Lin,Shizhao Sun,Danqing Huang,Ting Liu,Ji Li,Jiang Bian
Main category: cs.CV
TL;DR: 本文提出了ReLayout框架,用于无需三元组数据的多样化且结构保持的设计布局编辑,通过关系图和关系感知设计重建(RADR)实现自监督学习,显著提升了编辑质量、准确性和布局结构保持能力。
Details
Motivation: 解决用户自然语言意图表达的模糊性问题,并应对设计布局编辑任务中缺乏标注数据(原始设计、编辑操作、编辑后设计三元组)的挑战。 Method: 提出ReLayout框架,包括构建关系图以约束未编辑元素的布局结构,以及关系感知设计重建(RADR)模块,在多模态大语言模型基础上进行自监督训练,统一支持多种编辑动作。 Result: 在定性、定量评估及用户研究中,ReLayout在编辑质量、准确性和布局结构保持方面显著优于基线模型。 Conclusion: ReLayout实现了无需人工标注三元组数据的高质量、结构保持的设计布局自动编辑,为自动化设计重设计提供了新范式。 Abstract: Automated redesign without manual adjustments marks a key step forward in the design workflow. In this work, we focus on a foundational redesign task termed design layout editing, which seeks to autonomously modify the geometric composition of a design based on user intents. To overcome the ambiguity of user needs expressed in natural language, we introduce four basic and important editing actions and standardize the format of editing operations. The underexplored task presents a unique challenge: satisfying specified editing operations while simultaneously preserving the layout structure of unedited elements. Besides, the scarcity of triplet (original design, editing operation, edited design) samples poses another formidable challenge. To this end, we present ReLayout, a novel framework for versatile and structure-preserving design layout editing that operates without triplet data. Specifically, ReLayout first introduces the relation graph, which contains the position and size relationships among unedited elements, as the constraint for layout structure preservation. Then, relation-aware design reconstruction (RADR) is proposed to bypass the data challenge. By learning to reconstruct a design from its elements, a relation graph, and a synthesized editing operation, RADR effectively emulates the editing process in a self-supervised manner. A multi-modal large language model serves as the backbone for RADR, unifying multiple editing actions within a single model and thus achieving versatile editing after fine-tuning. Qualitative, quantitative results and user studies show that ReLayout significantly outperforms the baseline models in terms of editing quality, accuracy, and layout structure preservation.[305] Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
Xinrong Chen,Xu Chu,Yingmin Qiu,Hengyuan Zhang,Jing Xiong,Shiyu Tang,Shuai Liu,Shaokang Yang,Cheng Yang,Hayden Kwok-Hay So,Ngai Wong
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的残差解码(ResDec)方法,利用大视觉语言模型(LVLMs)内部隐式推理和词元logits演化机制,通过历史信息辅助解码,有效抑制语言先验导致的幻觉,提升视觉定位能力。
Details
Motivation: 大视觉语言模型(LVLMs)虽在多模态任务中表现优异,但易受语言先验影响,产生与图像内容不符的幻觉输出。 Method: 提出无需训练的残差解码(ResDec)方法,利用LVLMs内部隐式推理机制和token logits演化机制,结合历史解码信息进行偏差校正。 Result: ResDec显著抑制语言先验引发的幻觉,提升视觉定位能力,降低物体幻觉,并在多个LVLM基准测试中表现优异。 Conclusion: ResDec是一种通用、高效且无需训练的解码增强方法,可广泛适用于各类LVLMs以缓解幻觉问题。 Abstract: Large Vision-Language Models (LVLMs) can reason effectively from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to actual visual input. To address this problem, we propose Residual Decoding (ResDec). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, ResDec also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.[306] Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis
Bo Deng,Yitong Tang,Jiake Li,Yuxin Huang,Li Wang,Yu Zhang,Yufei Zhan,Hua Lu,Xiaoshen Zhang,Jieyun Bai
Main category: cs.CV
TL;DR: 本文提出了一个用于超声图像分析的统一多任务基础模型基线,基于MH-MTL框架,支持27个子任务,结合EfficientNet-B4与FPN,并采用任务自适应损失与学习率策略,验证了其可行性与鲁棒性。
Details
Motivation: 超声图像在解剖结构和采集协议上存在显著异质性,现有方法多为任务专用,难以作为临床可部署的基础模型。 Method: 提出统一的多头多任务学习(MH-MTL)框架,采用ImageNet预训练的EfficientNet-B4主干网络和特征金字塔网络(FPN),设计任务特定路由策略,并引入复合损失、任务自适应学习率缩放与余弦退火调度进行训练。 Result: 验证结果表明该统一设计具备可行性与鲁棒性,为超声基础模型研究提供了强而可扩展的基线。 Conclusion: 所提出的MH-MTL基线模型有效支撑FM_UIA~2026多任务基准,推动超声图像分析向通用、临床可用的基础模型方向发展。 Abstract: Ultrasound (US) imaging exhibits substantial heterogeneity across anatomical structures and acquisition protocols, posing significant challenges to the development of generalizable analysis models. Most existing methods are task-specific, limiting their suitability as clinically deployable foundation models. To address this limitation, the Foundation Model Challenge for Ultrasound Image Analysis (FM\_UIA~2026) introduces a large-scale multi-task benchmark comprising 27 subtasks across segmentation, classification, detection, and regression. In this paper, we present the official baseline for FM\_UIA~2026 based on a unified Multi-Head Multi-Task Learning (MH-MTL) framework that supports all tasks within a single shared network. The model employs an ImageNet-pretrained EfficientNet--B4 backbone for robust feature extraction, combined with a Feature Pyramid Network (FPN) to capture multi-scale contextual information. A task-specific routing strategy enables global tasks to leverage high-level semantic features, while dense prediction tasks exploit spatially detailed FPN representations. Training incorporates a composite loss with task-adaptive learning rate scaling and a cosine annealing schedule. Validation results demonstrate the feasibility and robustness of this unified design, establishing a strong and extensible baseline for ultrasound foundation model research. The code and dataset are publicly available at \href{https://github.com/lijiake2408/Foundation-Model-Challenge-for-Ultrasound-Image-Analysis}{GitHub}.[307] Radioactive 3D Gaussian Ray Tracing for Tomographic Reconstruction
Ling Chen,Bao Yang
Main category: cs.CV
TL;DR: 本文提出了一种基于3D高斯射线追踪的断层重建框架,替代了R2-Gaussian中使用的仿射近似溅射方法,实现了更物理一致的前向投影建模和更灵活的非线性几何校正。
Details
Motivation: R2-Gaussian采用局部仿射近似进行可微分投影,但该近似会降低定量重建精度,并难以引入非线性几何校正(如PET中的弧形校正)。 Method: 提出基于3D高斯的射线追踪方法,解析计算穿过3D高斯基元的直线积分,并显式控制射线起点与方向,以支持精确的非线性几何校正。 Result: 相比溅射方法,新方法提升了投影精度,增强了对真实断层成像系统(如CT、PET)的适用性。 Conclusion: 3D高斯射线追踪为高斯参数化断层重建提供了更物理合理、更灵活可扩展的可微分建模范式。 Abstract: 3D Gaussian Splatting (3DGS) has recently emerged in computer vision as a promising rendering technique. By adapting the principles of Elliptical Weighted Average (EWA) splatting to a modern differentiable pipeline, 3DGS enables real-time, high-quality novel view synthesis. Building upon this, R2-Gaussian extended the 3DGS paradigm to tomographic reconstruction by rectifying integration bias, achieving state-of-the-art performance in computed tomography (CT). To enable differentiability, R2-Gaussian adopts a local affine approximation: each 3D Gaussian is locally mapped to a 2D Gaussian on the detector and composed via alpha blending to form projections. However, the affine approximation can degrade reconstruction quantitative accuracy and complicate the incorporation of nonlinear geometric corrections. To address these limitations, we propose a tomographic reconstruction framework based on 3D Gaussian ray tracing. Our approach provides two key advantages over splatting-based models: (i) it computes the line integral through 3D Gaussian primitives analytically, avoiding the local affine collapse and thus yielding a more physically consistent forward projection model; and (ii) the ray-tracing formulation gives explicit control over ray origins and directions, which facilitates the precise application of nonlinear geometric corrections, e.g., arc-correction used in positron emission tomography (PET). These properties extend the applicability of Gaussian-based reconstruction to a wider range of realistic tomography systems while improving projection accuracy.[308] DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification
Ying Shu,Pujian Zhan,Huiqi Yang,Hehe Fan,Youfang Lin,Kai Lv
Main category: cs.CV
TL;DR: 本文提出了一种双正则化双向Transformer(DRFormer)框架,通过融合视觉基础模型(如DINO)的局部纹理建模能力与视觉语言模型(如CLIP)的全局语义理解能力,协同提升行人重识别性能。
Details
Motivation: 现有方法多依赖单一模型范式(仅用视觉基础模型或视觉语言模型),忽视了二者在细粒度判别细节和全局语义特征上的互补性,难以应对遮挡、姿态变化等挑战。 Method: 提出DRFormer框架,引入双正则化机制,在双向Transformer中协同融合DINO提取的局部特征与CLIP提取的全局语义特征,确保特征多样性并平衡二者贡献。 Result: 在五个主流行人重识别基准上实验表明,该方法有效融合局部与全局表征,性能达到先进水平。 Conclusion: 视觉基础模型与视觉语言模型具有显著互补性,通过结构化协同(如DRFormer)可显著提升行人重识别鲁棒性与精度。 Abstract: Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textit{e.g.}, DINO) excel at mining local textures, and vision-language models (\textit{e.g.}, CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbf{D}ual-\textbf{R}egularized Bidirectional \textbf{Transformer} (\textbf{DRFormer}). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.[309] PDE-Constrained Optimization for Neural Image Segmentation with Physics Priors
Seema K. Poudel,Sunny K. Khadka
Main category: cs.CV
TL;DR: 本文提出了一种将偏微分方程(PDE)约束优化与深度学习结合的显微图像分割方法,通过变分正则化引入物理先验,在数据保真项基础上加入反应-扩散方程和相场界面能量作为可微残差损失,显著提升了分割精度、边界质量、模型稳定性及小样本泛化能力。
Details
Motivation: 显微图像分割是一个病态反问题,受测量噪声、弱目标边界和标注数据稀缺影响;传统深度网络在无约束经验风险最小化下易出现不稳定解和泛化差的问题。 Method: 将图像分割建模为PDE约束的优化问题,设计包含数据保真项与基于反应-扩散方程、相场界面能量的变分正则项的复合目标函数,并以可微残差形式嵌入UNet等深度网络中。 Result: 在LIVECell数据集上,对未见过的细胞类型进行跨类型评估,PDE正则化模型相比无约束UNet基线在分割精度、边界保真度、训练稳定性及低样本泛化性方面均取得一致提升。 Conclusion: 将物理驱动的PDE先验融入深度学习框架,可有效增强模型鲁棒性与泛化能力,为变分方法、统计学习与科学机器学习提供了原则性融合路径。 Abstract: Segmentation of microscopy images constitutes an ill-posed inverse problem due to measurement noise, weak object boundaries, and limited labeled data. Although deep neural networks provide flexible nonparametric estimators, unconstrained empirical risk minimization often leads to unstable solutions and poor generalization. In this work, image segmentation is formulated as a PDE-constrained optimization problem that integrates physically motivated priors into deep learning models through variational regularization. The proposed framework minimizes a composite objective function consisting of a data fidelity term and penalty terms derived from reaction-diffusion equations and phase-field interface energies, all implemented as differentiable residual losses. Experiments are conducted on the LIVECell dataset, a high-quality, manually annotated collection of phase-contrast microscopy images. Training is performed on two cell types, while evaluation is carried out on a distinct, unseen cell type to assess generalization. A UNet architecture is used as the unconstrained baseline model. Experimental results demonstrate consistent improvements in segmentation accuracy and boundary fidelity compared to unconstrained deep learning baselines. Moreover, the PDE-regularized models exhibit enhanced stability and improved generalization in low-sample regimes, highlighting the advantages of incorporating structured priors. The proposed approach illustrates how PDE-constrained optimization can strengthen data-driven learning frameworks, providing a principled bridge between variational methods, statistical learning, and scientific machine learning.[310] PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers
Haopeng Li,Shitong Shao,Wenliang Zhong,Zikai Zhou,Lichen Bai,Hui Xiong,Zeke Xie
Main category: cs.CV
TL;DR: 本文提出PISA方法,通过块级泰勒展开近似非关键注意力块,在保持高质量的同时实现亚二次复杂度的稀疏注意力。
Details
Motivation: 现有块稀疏注意力在高稀疏度下因丢弃上下文而导致性能下降;作者发现非关键块的注意力分数具有分布稳定性,可被高效准确地近似而非直接丢弃。 Method: 提出训练无关的分段稀疏注意力(PISA),对关键块进行精确计算,对非关键块采用块级泰勒展开进行高效近似,覆盖全部注意力范围。 Result: 在Wan2.1-14B和Hunyuan-Video上分别加速1.91倍和2.57倍,图像生成FLUX上加速1.2倍且不损失视觉质量,质量持续优于其他稀疏注意力方法。 Conclusion: PISA通过‘精确或近似’策略替代传统‘保留或丢弃’范式,有效弥合了稀疏注意力中速度与质量之间的鸿沟。 Abstract: Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value blocks, it suffers from degradation at high sparsity by discarding context. In this work, we discover that attention scores of non-critical blocks exhibit distributional stability, allowing them to be approximated accurately and efficiently rather than discarded, which is essentially important for sparse attention design. Motivated by this key insight, we propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity. Unlike the conventional keep-or-drop paradigm that directly drop the non-critical block information, PISA introduces a novel exact-or-approximate strategy: it maintains exact computation for critical blocks while efficiently approximating the remainder through block-wise Taylor expansion. This design allows PISA to serve as a faithful proxy to full attention, effectively bridging the gap between speed and quality. Experimental results demonstrate that PISA achieves 1.91 times and 2.57 times speedups on Wan2.1-14B and Hunyuan-Video, respectively, while consistently maintaining the highest quality among sparse attention methods. Notably, even for image generation on FLUX, PISA achieves a 1.2 times acceleration without compromising visual quality. Code is available at: https://github.com/xie-lab-ml/piecewise-sparse-attention.[311] MedAD-R1: Eliciting Consistent Reasoning in Interpretible Medical Anomaly Detection via Consistency-Reinforced Policy Optimization
Haitao Zhang,Yingying Wang,Jiaxiang Wang,Haote Xu,Hongyang Zhang,Yirong Chen,Yue Huang,Xinghao Ding
Main category: cs.CV
TL;DR: 本文提出MedAD-38K多中心多模态医学异常检测基准及MedAD-R1模型,通过认知注入与一致性分组相对策略优化(Con-GRPO)两阶段训练,显著提升推理一致性与诊断准确性,实现SOTA性能。
Details
Motivation: 现有医学异常检测方法依赖简单碎片化数据集上的监督微调,难以实现可信、可解释的多模态推理与泛化。 Method: 构建首个大规模多中心多模态MedAD-38K基准(含诊断链式思维CoT标注与结构化VQA对);提出两阶段训练框架:第一阶段‘认知注入’用SFT建立医学知识与‘先思考后作答’范式;第二阶段引入Con-GRPO算法,加入一致性奖励以保障推理过程与最终诊断逻辑一致。 Result: MedAD-R1在MedAD-38K上达到SOTA,超越强基线超10%;生成透明、逻辑一致的推理路径,提升临床决策支持系统的可信度与可解释性。 Conclusion: 结合高质量标注基准与具有一致性约束的强化学习策略,可有效提升大模型在医学异常检测中的推理可靠性与临床适用性。 Abstract: Medical Anomaly Detection (MedAD) presents a significant opportunity to enhance diagnostic accuracy using Large Multimodal Models (LMMs) to interpret and answer questions based on medical images. However, the reliance on Supervised Fine-Tuning (SFT) on simplistic and fragmented datasets has hindered the development of models capable of plausible reasoning and robust multimodal generalization. To overcome this, we introduce MedAD-38K, the first large-scale, multi-modal, and multi-center benchmark for MedAD featuring diagnostic Chain-of-Thought (CoT) annotations alongside structured Visual Question-Answering (VQA) pairs. On this foundation, we propose a two-stage training framework. The first stage, Cognitive Injection, uses SFT to instill foundational medical knowledge and align the model with a structured think-then-answer paradigm. Given that standard policy optimization can produce reasoning that is disconnected from the final answer, the second stage incorporates Consistency Group Relative Policy Optimization (Con-GRPO). This novel algorithm incorporates a crucial consistency reward to ensure the generated reasoning process is relevant and logically coherent with the final diagnosis. Our proposed model, MedAD-R1, achieves state-of-the-art (SOTA) performance on the MedAD-38K benchmark, outperforming strong baselines by more than 10\%. This superior performance stems from its ability to generate transparent and logically consistent reasoning pathways, offering a promising approach to enhancing the trustworthiness and interpretability of AI for clinical decision support.[312] Differential Vector Erasure: Unified Training-Free Concept Erasure for Flow Matching Models
Zhiqi Zhang,Xinhao Zhong,Yi Sun,Shuoyang Sun,Bin Chen,Shu-Tao Xia,Xuan Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的概念擦除方法DVE,专用于流匹配(flow matching)文本到图像生成模型,通过分析速度场的方向结构来识别并移除特定语义概念(如NSFW内容、版权风格或特定物体),在保持图像质量与多样性的同时实现精准抑制。
Details
Motivation: 现有基于DDPM的去概念化方法不适用于新兴的流匹配模型,且大多依赖高成本微调;而流匹配模型的生成机制不同,亟需一种高效、免训练、适配其速度场特性的概念擦除方案。 Method: 提出Differential Vector Erasure(DVE):利用目标概念与锚定概念在速度场方向上的差异构建微分向量场,并在推理时将速度场投影到该微分方向以滤除概念相关分量。 Result: 在FLUX模型上实验表明,DVE在NSFW抑制、艺术风格去除和物体擦除等任务中持续优于现有基线,同时保持图像质量与多样性。 Conclusion: DVE是一种通用、免训练、面向流匹配模型的概念擦除新范式,揭示了语义概念隐含于速度场方向结构中的本质,并为安全可控的生成式AI部署提供了有效工具。 Abstract: Text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images, yet their tendency to reproduce undesirable concepts, such as NSFW content, copyrighted styles, or specific objects, poses growing concerns for safe and controllable deployment. While existing concept erasure approaches primarily focus on DDPM-based diffusion models and rely on costly fine-tuning, the recent emergence of flow matching models introduces a fundamentally different generative paradigm for which prior methods are not directly applicable. In this paper, we propose Differential Vector Erasure (DVE), a training-free concept erasure method specifically designed for flow matching models. Our key insight is that semantic concepts are implicitly encoded in the directional structure of the velocity field governing the generative flow. Leveraging this observation, we construct a differential vector field that characterizes the directional discrepancy between a target concept and a carefully chosen anchor concept. During inference, DVE selectively removes concept-specific components by projecting the velocity field onto the differential direction, enabling precise concept suppression without affecting irrelevant semantics. Extensive experiments on FLUX demonstrate that DVE consistently outperforms existing baselines on a wide range of concept erasure tasks, including NSFW suppression, artistic style removal, and object erasure, while preserving image quality and diversity.[313] PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor Space
Jinghong Zheng,Changlong Jiang,Yang Xiao,Jiaqi Li,Haohong Kuang,Hang Xu,Ran Wang,Zhiguo Cao,Min Du,Joey Tianyi Zhou
Main category: cs.CV
TL;DR: 本文提出PandaPose方法,通过将2D姿态先验传播到3D锚点空间作为统一中间表示,缓解2D姿态估计误差传播和自遮挡问题,显著提升单图3D人体姿态估计性能。
Details
Motivation: 现有方法基于2D特征建立直接的关节到关节映射,存在2D姿态预测误差传播和难以处理自遮挡两大根本缺陷。 Method: 提出PandaPose:构建包含关节级3D锚点、深度感知的特征提升模块、以及锚点-特征交互解码器的3D锚点空间,生成融合3D锚点、视觉线索与几何深度信息的统一锚点查询,并用于锚点到关节的集成预测。 Result: 在Human3.6M、MPI-INF-3DHP和3DPW三个基准上取得优越性能,在Human3.6M挑战性条件下相比SOTA降低14.7%误差。 Conclusion: PandaPose通过引入结构化3D锚点空间作为中间表示,有效缓解了误差传播与自遮挡问题,提升了单图像3D姿态估计的准确性与鲁棒性。 Abstract: 3D human pose lifting from a single RGB image is a challenging task in 3D vision. Existing methods typically establish a direct joint-to-joint mapping from 2D to 3D poses based on 2D features. This formulation suffers from two fundamental limitations: inevitable error propagation from input predicted 2D pose to 3D predictions and inherent difficulties in handling self-occlusion cases. In this paper, we propose PandaPose, a 3D human pose lifting approach via propagating 2D pose prior to 3D anchor space as the unified intermediate representation. Specifically, our 3D anchor space comprises: (1) Joint-wise 3D anchors in the canonical coordinate system, providing accurate and robust priors to mitigate 2D pose estimation inaccuracies. (2) Depth-aware joint-wise feature lifting that hierarchically integrates depth information to resolve self-occlusion ambiguities. (3) The anchor-feature interaction decoder that incorporates 3D anchors with lifted features to generate unified anchor queries encapsulating joint-wise 3D anchor set, visual cues and geometric depth information. The anchor queries are further employed to facilitate anchor-to-joint ensemble prediction. Experiments on three well-established benchmarks (i.e., Human3.6M, MPI-INF-3DHP and 3DPW) demonstrate the superiority of our proposition. The substantial reduction in error by $14.7\%$ compared to SOTA methods on the challenging conditions of Human3.6M and qualitative comparisons further showcase the effectiveness and robustness of our approach.[314] Robust Harmful Meme Detection under Missing Modalities via Shared Representation Learning
Felix Breiteneder,Mohammad Belal,Muhammad Saad Saeed,Shahed Masoudian,Usman Naseem,Kulshrestha Juhi,Markus Schedl,Shah Nawaz
Main category: cs.CV
TL;DR: 本文提出了一种针对模态不完整(如缺失文本)情况下的有害梗图检测新方法,通过独立投影多模态特征学习共享表征,提升了在文本缺失时的鲁棒性与性能。
Details
Motivation: 现有有害梗图检测方法依赖完整的多模态数据(如图文),但在真实场景中常因OCR质量差等原因导致文本缺失,造成性能下降,亟需研究模态不完整下的检测行为。 Method: 提出一种新基线方法,对图文等多模态特征分别进行独立投影,学习共享表征;该表征可在任一模态缺失时仍有效支持检测任务。 Result: 在两个基准数据集上的实验表明,该方法在文本缺失时显著优于现有方法,增强了视觉特征利用,降低了对文本的依赖,提升了鲁棒性。 Conclusion: 本工作首次系统探究模态不完整下的有害梗图检测,推动了该技术在现实复杂场景(如单模态缺失)中的实际落地应用。 Abstract: Internet memes are powerful tools for communication, capable of spreading political, psychological, and sociocultural ideas. However, they can be harmful and can be used to disseminate hate toward targeted individuals or groups. Although previous studies have focused on designing new detection methods, these often rely on modal-complete data, such as text and images. In real-world settings, however, modalities like text may be missing due to issues like poor OCR quality, making existing methods sensitive to missing information and leading to performance deterioration. To address this gap, in this paper, we present the first-of-its-kind work to comprehensively investigate the behavior of harmful meme detection methods in the presence of modal-incomplete data. Specifically, we propose a new baseline method that learns a shared representation for multiple modalities by projecting them independently. These shared representations can then be leveraged when data is modal-incomplete. Experimental results on two benchmark datasets demonstrate that our method outperforms existing approaches when text is missing. Moreover, these results suggest that our method allows for better integration of visual features, reducing dependence on text and improving robustness in scenarios where textual information is missing. Our work represents a significant step forward in enabling the real-world application of harmful meme detection, particularly in situations where a modality is absent.[315] LightCity: An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions
Jingjing Wang,Qirui Hu,Chong Bao,Yuke Zhu,Hujun Bao,Zhaopeng Cui,Guofeng Zhang
Main category: cs.CV
TL;DR: 本文提出了LightCity,一个高质量的合成城市数据集,用于解决逆向渲染中的复杂光照条件挑战,并通过该数据集对城市环境中的三个基础任务进行了基准测试和全面分析。
Details
Motivation: 逆向渲染在城市场景中面临复杂光照条件(如多光源、间接光和阴影效应)的挑战,但缺乏合适的数据集来研究这些挑战对本征分解和3D重建的影响。 Method: 构建了一个名为LightCity的新型高质量合成城市数据集,包含300多个具有高度可控光照的天空图、50K张街景与航拍视角图像,以及深度、法线、材质成分、直接光与间接光等丰富属性;并利用该数据集对城市环境中的三个基础任务进行基准测试与综合分析。 Result: LightCity数据集提供了多样化的光照条件和丰富的几何与物理属性,支持对城市环境中本征分解和3D重建等任务的系统性评估,并为相关研究奠定了坚实基础。 Conclusion: LightCity填补了城市逆向渲染领域高质量、多光照合成数据集的空白,推动了在复杂光照下城市场景理解与建模的研究进展。 Abstract: Inverse rendering in urban scenes is pivotal for applications like autonomous driving and digital twins. Yet, it faces significant challenges due to complex illumination conditions, including multi-illumination and indirect light and shadow effects. However, the effects of these challenges on intrinsic decomposition and 3D reconstruction have not been explored due to the lack of appropriate datasets. In this paper, we present LightCity, a novel high-quality synthetic urban dataset featuring diverse illumination conditions with realistic indirect light and shadow effects. LightCity encompasses over 300 sky maps with highly controllable illumination, varying scales with street-level and aerial perspectives over 50K images, and rich properties such as depth, normal, material components, light and indirect light, etc. Besides, we leverage LightCity to benchmark three fundamental tasks in the urban environments and conduct a comprehensive analysis of these benchmarks, laying a robust foundation for advancing related research.[316] Koo-Fu CLIP: Closed-Form Adaptation of Vision-Language Models via Fukunaga-Koontz Linear Discriminant Analysis
Matej Suchanek,Klara Janouskova,Ondrej Vasatko,Jiri Matas
Main category: cs.CV
TL;DR: Koo-Fu CLIP是一种基于Fukunaga-Koontz线性判别分析的监督式CLIP适配方法,通过白化嵌入空间提升类间判别性、抑制类内差异,在ImageNet等基准上显著提升分类准确率并实现高效降维压缩。
Details
Motivation: CLIP原始嵌入未针对监督分类优化,存在类分离度低和维度过高的问题。 Method: 提出Koo-Fu CLIP方法,基于Fukunaga-Koontz线性判别分析,在白化后的CLIP嵌入空间中构造闭式线性投影,重塑嵌入几何结构以增强类可分性并降维。 Result: 在ImageNet-1K上top-1准确率从75.1%提升至79.1%;在14K/21K类别扩展下仍保持增益;支持10–12倍压缩且几乎无精度损失。 Conclusion: Koo-Fu CLIP提供了一种轻量、高效、可扩展的CLIP监督适配方案,兼顾性能提升与计算效率。 Abstract: Visual-language models such as CLIP provide powerful general-purpose representations, but their raw embeddings are not optimized for supervised classification, often exhibiting limited class separation and excessive dimensionality. We propose Koo-Fu CLIP, a supervised CLIP adaptation method based on Fukunaga-Koontz Linear Discriminant Analysis, which operates in a whitened embedding space to suppress within-class variation and enhance between-class discrimination. The resulting closed-form linear projection reshapes the geometry of CLIP embeddings, improving class separability while performing effective dimensionality reduction, and provides a lightweight and efficient adaptation of CLIP representations. Across large-scale ImageNet benchmarks, nearest visual prototype classification in the Koo-Fu CLIP space improves top-1 accuracy from 75.1% to 79.1% on ImageNet-1K, with consistent gains persisting as the label space expands to 14K and 21K classes. The method supports substantial compression by up to 10-12x with little or no loss in accuracy, enabling efficient large-scale classification and retrieval.[317] Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs
Daniel Yezid Guarnizo Orjuela,Leonardo Scappatura,Veronica Di Gennaro,Riccardo Andrea Izzo,Gianluca Bardaro,Matteo Matteucci
Main category: cs.CV
TL;DR: 本文提出了一种名为Corruption Restoration Transformer(CRT)的即插即用、模型无关的视觉Transformer,用于修复传感器层面的图像损坏(如噪声、坏点、镜头污染等),从而提升Vision-Language-Action(VLA)模型在真实场景中面对视觉扰动时的鲁棒性。实验表明CRT能显著恢复VLA模型性能,使其在严重图像损坏下仍保持接近基线的成功率。
Details
Motivation: 现有VLA模型虽在受控环境中表现优异,但在真实世界部署中极易受图像级传感器损坏(如电子噪声、死像素、镜头污渍)影响,而此前研究多聚焦于物理遮挡,忽视了这类底层视觉信号退化问题。 Method: 提出Corruption Restoration Transformer(CRT),一种基于对抗训练目标的即插即用视觉Transformer模块,可在不微调下游VLA模型的前提下,从损坏图像中恢复干净观测。 Result: CRT在LIBERO和Meta-World基准上显著提升了π₀.₅和SmolVLA等SOTA VLA模型的鲁棒性:在常见图像损坏下,成功率从2%恢复至接近90%的基线水平。 Conclusion: 图像传感器级损坏是制约VLA模型实际落地的关键瓶颈;CRT作为一种轻量、通用、无需微调的修复模块,有效提升了VLA系统对现实视觉扰动的免疫能力,为可靠机器人操作提供了新思路。 Abstract: Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation, unifying perception and control within a single end-to-end architecture. However, despite their success in controlled environments, reliable real-world deployment is severely hindered by their fragility to visual disturbances. While existing literature extensively addresses physical occlusions caused by scene geometry, a critical mode remains largely unexplored: image corruptions. These sensor-level artifacts, ranging from electronic noise and dead pixels to lens contaminants, directly compromise the integrity of the visual signal prior to interpretation. In this work, we quantify this vulnerability, demonstrating that state-of-the-art VLAs such as $π_{0.5}$ and SmolVLA, suffer catastrophic performance degradation, dropping from 90\% success rates to as low as 2\%, under common signal artifacts. To mitigate this, we introduce the Corruption Restoration Transformer (CRT), a plug-and-play and model-agnostic vision transformer designed to immunize VLA models against sensor disturbances. Leveraging an adversarial training objective, CRT restores clean observations from corrupted inputs without requiring computationally expensive fine-tuning of the underlying model. Extensive experiments across the LIBERO and Meta-World benchmarks demonstrate that CRT effectively recovers lost performance, enabling VLAs to maintain near-baseline success rates, even under severe visual corruption.[318] Semantically Aware UAV Landing Site Assessment from Remote Sensing Imagery via Multimodal Large Language Models
Chunliang Hua,Zeyuan Yang,Lei Zhang,Jiayang Sun,Fengwen Chen,Chunlan Zeng,Xiao Hu
Main category: cs.CV
TL;DR: 本文提出了一种结合遥感影像与多模态大语言模型(MLLMs)的无人机应急着陆点评估新框架,通过粗粒度到细粒度的语义风险识别流程,显著提升了对人群、临时建筑等复杂语义风险的检测能力,并构建了公开基准数据集ELSS。
Details
Motivation: 传统几何传感器难以识别应急着陆中不可见的复杂语义风险(如人群、临时建筑),导致安全着陆评估不足。 Method: 提出一种粗到细的框架:首先用轻量级语义分割模块预筛候选区域;再由视觉-语言推理代理融合图像特征与兴趣点(POI)数据以识别细微风险。 Result: 在自建的ELSS基准上,该方法在风险识别准确率上显著优于几何基线方法,并能生成类人、可解释的决策依据。 Conclusion: 融合遥感影像、多模态大语言模型与POI数据的语义感知框架,可有效提升全球上下文感知的无人机应急着陆安全性与可信度。 Abstract: Safe UAV emergency landing requires more than just identifying flat terrain; it demands understanding complex semantic risks (e.g., crowds, temporary structures) invisible to traditional geometric sensors. In this paper, we propose a novel framework leveraging Remote Sensing (RS) imagery and Multimodal Large Language Models (MLLMs) for global context-aware landing site assessment. Unlike local geometric methods, our approach employs a coarse-to-fine pipeline: first, a lightweight semantic segmentation module efficiently pre-screens candidate areas; second, a vision-language reasoning agent fuses visual features with Point-of-Interest (POI) data to detect subtle hazards. To validate this approach, we construct and release the Emergency Landing Site Selection (ELSS) benchmark. Experiments demonstrate that our framework significantly outperforms geometric baselines in risk identification accuracy. Furthermore, qualitative results confirm its ability to generate human-like, interpretable justifications, enhancing trust in automated decision-making. The benchmark dataset is publicly accessible at https://anonymous.4open.science/r/ELSS-dataset-43D7.[319] EEmo-Logic: A Unified Dataset and Multi-Stage Framework for Comprehensive Image-Evoked Emotion Assessment
Lancheng Gao,Ziheng Jia,Zixuan Xing,Wei Sun,Huiyu Duan,Guangtao Zhai,Xiongkuo Min
Main category: cs.CV
TL;DR: 本文提出了EEmoDB数据集和EEmo-Logic多模态大模型,旨在提升图像诱发情绪的多维细粒度理解能力。
Details
Motivation: 现有模型在图像诱发情绪理解方面仍局限于粗粒度感知或推理能力不足,难以满足机器共情与人机交互应用的需求。 Method: 构建了目前最大的图像诱发情绪理解数据集EEmoDB(含EEmoDB-QA和EEmoDB-Assess两个子集),并提出基于指令微调与任务定制化组相对偏好优化(GRPO)的多模态大语言模型EEmo-Logic。 Result: EEmo-Logic在领域内及跨领域数据集上均展现出鲁棒性能,在情绪问答与细粒度评估任务中表现优异。 Conclusion: EEmoDB与EEmo-Logic共同推动了图像诱发情绪理解向多维、细粒度、可推理方向发展,为机器共情提供了新范式。 Abstract: Understanding the multi-dimensional attributes and intensity nuances of image-evoked emotions is pivotal for advancing machine empathy and empowering diverse human-computer interaction applications. However, existing models are still limited to coarse-grained emotion perception or deficient reasoning capabilities. To bridge this gap, we introduce EEmoDB, the largest image-evoked emotion understanding dataset to date. It features $5$ analysis dimensions spanning $5$ distinct task categories, facilitating comprehensive interpretation. Specifically, we compile $1.2M$ question-answering (QA) pairs (EEmoDB-QA) from $125k$ images via automated generation, alongside a $36k$ dataset (EEmoDB-Assess) curated from $25k$ images for fine-grained assessment. Furthermore, we propose EEmo-Logic, an all-in-one multimodal large language model (MLLM) developed via instruction fine-tuning and task-customized group relative preference optimization (GRPO) with novel reward design. Extensive experiments demonstrate that EEmo-Logic achieves robust performance in in-domain and cross-domain datasets, excelling in emotion QA and fine-grained assessment. The code is available at https://anonymous.4open.science/r/EEmoLogic.[320] Refining Context-Entangled Content Segmentation via Curriculum Selection and Anti-Curriculum Promotion
Chunming He,Rihan Zhang,Fengyang Xiao,Dingming Zhang,Zhiwen Cao,Sina Farsiu
Main category: cs.CV
TL;DR: 本文提出CurriSeg框架,结合课程学习与反课程学习策略,提升上下文纠缠内容分割(CECS)任务中的鲁棒性和泛化能力,无需增加参数或训练时间。
Details
Motivation: 受生物学习由易到难启发,解决上下文纠缠内容分割(CECS)中目标与背景视觉模式高度相似带来的挑战,弥补现有方法忽视学习动态对鲁棒性影响的不足。 Method: 提出双阶段CurriSeg框架:1)课程选择阶段,基于样本损失的时间统计动态筛选高信息量难样本;2)反课程促进阶段,采用频谱盲微调抑制高频成分,增强对低频结构和上下文线索的依赖。 Result: 在多个CECS基准上实现一致性能提升,不增加模型参数与总训练时间,验证了学习进程与挑战设计对鲁棒分割的有效性。 Conclusion: CurriSeg为上下文感知分割提供了新范式,表明合理调控学习难度演进可显著提升模型在纠缠分布下的表征可靠性与泛化能力。 Abstract: Biological learning proceeds from easy to difficult tasks, gradually reinforcing perception and robustness. Inspired by this principle, we address Context-Entangled Content Segmentation (CECS), a challenging setting where objects share intrinsic visual patterns with their surroundings, as in camouflaged object detection. Conventional segmentation networks predominantly rely on architectural enhancements but often ignore the learning dynamics that govern robustness under entangled data distributions. We introduce CurriSeg, a dual-phase learning framework that unifies curriculum and anti-curriculum principles to improve representation reliability. In the Curriculum Selection phase, CurriSeg dynamically selects training data based on the temporal statistics of sample losses, distinguishing hard-but-informative samples from noisy or ambiguous ones, thus enabling stable capability enhancement. In the Anti-Curriculum Promotion phase, we design Spectral-Blindness Fine-Tuning, which suppresses high-frequency components to enforce dependence on low-frequency structural and contextual cues and thus strengthens generalization. Extensive experiments demonstrate that CurriSeg achieves consistent improvements across diverse CECS benchmarks without adding parameters or increasing total training time, offering a principled view of how progression and challenge interplay to foster robust and context-aware segmentation. Code will be released.[321] EMFormer: Efficient Multi-Scale Transformer for Accumulative Context Weather Forecasting
Hao Chen,Tao Han,Jie Zhang,Song Guo,Fenghua Ling,Lei Bai
Main category: cs.CV
TL;DR: 本文提出了一种用于长期天气预报的新方法,通过高效多尺度Transformer(EMFormer)、累积上下文微调和正弦加权复合损失,解决了灾难性遗忘、误差累积和高训练开销等问题,在长时预测精度和计算效率上均有显著提升。
Details
Motivation: 现有长期天气预报方法受限于灾难性遗忘、误差累积和高训练开销,难以兼顾长时预测精度与效率。 Method: 提出EMFormer架构(单次卷积实现多尺度特征提取)、累积上下文微调策略(增强时间一致性且不损害短期精度)、以及基于正弦权重的动态复合损失函数,构建涵盖预训练、微调与预测的完整流程。 Result: 在天气预报和极端事件预测任务中显著提升长时预测精度;EMFormer在ImageNet-1K和ADE20K视觉基准上泛化性强,并比传统多尺度模块快5.69倍。 Conclusion: 所提方法有效提升了长上下文建模能力与计算效率,为长期气象预测及跨领域时序建模提供了新思路。 Abstract: Long-term weather forecasting is critical for socioeconomic planning and disaster preparedness. While recent approaches employ finetuning to extend prediction horizons, they remain constrained by the issues of catastrophic forgetting, error accumulation, and high training overhead. To address these limitations, we present a novel pipeline across pretraining, finetuning and forecasting to enhance long-context modeling while reducing computational overhead. First, we introduce an Efficient Multi-scale Transformer (EMFormer) to extract multi-scale features through a single convolution in both training and inference. Based on the new architecture, we further employ an accumulative context finetuning to improve temporal consistency without degrading short-term accuracy. Additionally, we propose a composite loss that dynamically balances different terms via a sinusoidal weighting, thereby adaptively guiding the optimization trajectory throughout pretraining and finetuning. Experiments show that our approach achieves strong performance in weather forecasting and extreme event prediction, substantially improving long-term forecast accuracy. Moreover, EMFormer demonstrates strong generalization on vision benchmarks (ImageNet-1K and ADE20K) while delivering a 5.69x speedup over conventional multi-scale modules.[322] Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis
Haoran Lai,Zihang Jiang,Kun Zhang,Qingsong Yao,Rongsheng Wang,Zhiyang He,Xiaodong Tao,Wei Wei,Shaohua Kevin Zhou
Main category: cs.CV
TL;DR: 本文提出Med3D-R1,一种结合监督微调与强化学习的两阶段框架,用于提升3D医学视觉-语言模型的临床推理能力与可解释性,在CT-RATE和RAD-ChestCT上达到SOTA性能。
Details
Motivation: 现有3D视觉-语言模型面临体积医学影像复杂、易过拟合报告表面模式、缺乏可解释性奖励设计等挑战,难以实现鲁棒临床推理。 Method: 提出Med3D-R1框架:SFT阶段引入残差对齐机制融合3D特征与文本嵌入,并采用异常重加权策略增强临床关键token;RL阶段重新设计一致性奖励,显式促进连贯、分步的诊断推理。 Result: 在CT-RATE和RAD-ChestCT两个3D诊断基准上,准确率分别达41.92%和44.99%,为当前最优,显著优于先前方法。 Conclusion: Med3D-R1有效提升了3D医学VLM的异常诊断能力与临床推理可靠性,具备推动真实世界诊断工作流落地的潜力。 Abstract: Developing 3D vision-language models with robust clinical reasoning remains a challenge due to the inherent complexity of volumetric medical imaging, the tendency of models to overfit superficial report patterns, and the lack of interpretability-aware reward designs. In this paper, we propose Med3D-R1, a reinforcement learning framework with a two-stage training process: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). During SFT stage, we introduce a residual alignment mechanism to bridge the gap between high-dimensional 3D features and textual embeddings, and an abnormality re-weighting strategy to emphasize clinically informative tokens and reduce structural bias in reports. In RL stage, we redesign the consistency reward to explicitly promote coherent, step-by-step diagnostic reasoning. We evaluate our method on medical multiple-choice visual question answering using two 3D diagnostic benchmarks, CT-RATE and RAD-ChestCT, where our model attains state-of-the-art accuracies of 41.92\% on CT-RATE and 44.99\% on RAD-ChestCT. These results indicate improved abnormality diagnosis and clinical reasoning and outperform prior methods on both benchmarks. Overall, our approach holds promise for enhancing real-world diagnostic workflows by enabling more reliable and transparent 3D medical vision-language systems.[323] Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment
Yunchuan Ma,Laiyun Qing,Guorong Li,Yuqing Liu,Yuankai Qi,Qingming Huang
Main category: cs.CV
TL;DR: 本文提出了一种文本精炼与对齐(TRA)框架,通过引入文本描述信息来增强点监督下的时序动作定位性能,包含点基文本精炼(PTR)和点基多模态对齐(PMA)两个新模块,在多个基准上达到SOTA效果且具备实际部署可行性。
Details
Motivation: 现有方法仅利用视觉特征,忽略了文本语义信息对点监督动作定位的潜在增益。 Method: 提出TRA框架,包含两个新模块:1)Point-based Text Refinement(PTR),利用点标注和多预训练模型精炼视频帧文本描述;2)Point-based Multimodal Alignment(PMA),将视觉与文本特征映射到统一语义空间,并进行点级多模态对比学习;最终融合多模态特征输入动作检测器。 Result: 在五个主流基准上显著优于多种SOTA方法;计算开销分析表明可在单张24GB RTX 3090 GPU上运行,具备实用性和可扩展性。 Conclusion: 引入语义丰富的文本特征并实现点级多模态对齐,能有效提升点监督下时序动作定位的精度与鲁棒性,验证了跨模态协同建模在弱监督任务中的重要价值。 Abstract: Recently, point-supervised temporal action localization has gained significant attention for its effective balance between labeling costs and localization accuracy. However, current methods only consider features from visual inputs, neglecting helpful semantic information from the text side. To address this issue, we propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich. This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA). Specifically, we first generate descriptions for video frames using a pre-trained multimodal model. Next, PTR refines the initial descriptions by leveraging point annotations together with multiple pre-trained models. PMA then projects all features into a unified semantic space and leverages a point-level multimodal feature contrastive learning to reduce the gap between visual and linguistic modalities. Last, the enhanced multi-modal features are fed into the action detector for precise localization. Extensive experimental results on five widely used benchmarks demonstrate the favorable performance of our proposed framework compared to several state-of-the-art methods. Moreover, our computational overhead analysis shows that the framework can run on a single 24 GB RTX 3090 GPU, indicating its practicality and scalability.[324] OASIS-DC: Generalizable Depth Completion via Output-level Alignment of Sparse-Integrated Monocular Pseudo Depth
Jaehyeon Cho,Jhonghyun An
Main category: cs.CV
TL;DR: 本文提出一种结合单目基础模型与稀疏测距校准的方法,将相对深度转化为伪度量深度先验,并设计细化网络,在少量标注样本下实现高精度的度量深度预测,适用于真实场景中标注稀缺的情况。
Details
Motivation: 现有单目基础模型虽能零样本估计深度,但输出为相对深度而非度量深度,难以直接用于机器人和自动驾驶等需要精确尺度的应用。 Method: 利用相对深度保留全局结构和边界的特性,通过稀疏范围测量对其进行校准,生成伪度量深度先验;再构建一个细化网络,在可靠区域遵循该先验、在必要区域偏离以提升精度。 Result: 所提方法在极少量标注样本下仍能保持稳定的尺度和锐利的边缘,尤其在缺乏人工筛选验证数据时表现稳健。 Conclusion: 将基础模型先验与稀疏锚点耦合,是应对现实世界中标签稀缺问题、实现鲁棒且可部署深度补全的一种实用路径。 Abstract: Recent monocular foundation models excel at zero-shot depth estimation, yet their outputs are inherently relative rather than metric, limiting direct use in robotics and autonomous driving. We leverage the fact that relative depth preserves global layout and boundaries: by calibrating it with sparse range measurements, we transform it into a pseudo metric depth prior. Building on this prior, we design a refinement network that follows the prior where reliable and deviates where necessary, enabling accurate metric predictions from very few labeled samples. The resulting system is particularly effective when curated validation data are unavailable, sustaining stable scale and sharp edges across few-shot regimes. These findings suggest that coupling foundation priors with sparse anchors is a practical route to robust, deployment-ready depth completion under real-world label scarcity.[325] Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution
Xun Zhang,Kaicheng Yang,Hongliang Lu,Haotong Qin,Yong Guo,Yulun Zhang
Main category: cs.CV
TL;DR: 本文提出了Q-DiT4SR,首个专为基于DiT的Real-ISR设计的后训练量化(PTQ)框架,通过分层奇异值分解(H-SVD)和方差感知时空混合精度(VaSMP/VaTMP)方法,在W4A6和W4A4设置下实现SOTA性能,并大幅减小模型尺寸与计算量。
Details
Motivation: Diffusion Transformers (DiTs)在Real-ISR中表现优异但推理开销大;现有PTQ方法不适用于DiT-based Real-ISR,直接迁移导致局部纹理严重退化。 Method: 提出Q-DiT4SR框架,包含:1)H-SVD——融合全局低秩与局部块级秩-1分支的分层SVD;2)VaSMP——基于率失真理论、无数据的跨层权重比特分配;3)VaTMP——基于动态规划、极少量校准的扩散时间步内激活精度调度。 Result: 在多个真实世界数据集上,Q-DiT4SR在W4A6和W4A4设置下均达到SOTA性能;W4A4配置使模型大小减少5.8×,计算量降低超60×。 Conclusion: Q-DiT4SR是首个面向DiT-based Real-ISR的高效PTQ方案,兼顾精度与效率,验证了专用量化设计对生成式超分模型落地的重要性。 Abstract: Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8$\times$ and computational operations by over 60$\times$. Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR.[326] TF-Lane: Traffic Flow Module for Robust Lane Perception
Yihan Xie,Han Xia,Zhen Yang
Main category: cs.CV
TL;DR: 本文提出了一种基于实时交通流信息的车道感知模块TFM,无需额外成本即可提升现有视觉车道检测方法在遮挡或无标线等挑战场景下的鲁棒性,在NuScenes和OpenLaneV2数据集上显著提升性能。
Details
Motivation: 现有基于视觉的车道检测方法在遮挡或车道线缺失等场景下性能显著下降;引入高精地图虽有帮助,但存在成本高、实时性差的问题。 Method: 提出TrafficFlow-aware Lane perception Module(TFM),从交通流中提取实时特征,并与现有车道感知算法无缝融合。 Result: 在四个主流模型和NuScenes、OpenLaneV2两个公开数据集上实验表明,TFM持续提升性能,在NuScenes上最高提升+4.1% mAP。 Conclusion: 交通流是一种低成本、高时效的新型辅助信息源,TFM有效提升了复杂场景下车道感知的鲁棒性与实用性。 Abstract: Autonomous driving systems require robust lane perception capabilities, yet existing vision-based detection methods suffer significant performance degradation when visual sensors provide insufficient cues, such as in occluded or lane-missing scenarios. While some approaches incorporate high-definition maps as supplementary information, these solutions face challenges of high subscription costs and limited real-time performance. To address these limitations, we explore an innovative information source: traffic flow, which offers real-time capabilities without additional costs. This paper proposes a TrafficFlow-aware Lane perception Module (TFM) that effectively extracts real-time traffic flow features and seamlessly integrates them with existing lane perception algorithms. This solution originated from real-world autonomous driving conditions and was subsequently validated on open-source algorithms and datasets. Extensive experiments on four mainstream models and two public datasets (Nuscenes and OpenLaneV2) using standard evaluation metrics show that TFM consistently improves performance, achieving up to +4.1% mAP gain on the Nuscenes dataset.[327] DSFC-Net: A Dual-Encoder Spatial and Frequency Co-Awareness Network for Rural Road Extraction
Zhengbo Zhang,Yihe Tian,Wanke Xia,Lin Chen,Yue Sun,Kun Ding,Ying Wang,Bing Xu,Shiming Xiang
Main category: cs.CV
TL;DR: 本文提出DSFC-Net,一种双编码器网络,融合空间与频域信息,通过CNN分支提取局部细节、新型空间-频率混合Transformer(SFT)建模全局拓扑关系,并引入跨频交互注意力(CFIA)和通道特征融合模块(CFFM),显著提升农村道路在遥感影像中的提取精度。
Details
Motivation: 农村道路提取面临类内差异大、类间可分性低、植被遮挡严重、道路狭窄等挑战,现有面向城市环境的方法难以适应。 Method: 提出DSFC-Net双编码器框架:CNN分支捕获局部边界与短程连续性;空间-频率混合Transformer(SFT)结合Laplacian金字塔实现高频(细节)与低频(结构)解耦,并通过CFIA模块动态交互;CFFM模块自适应融合双分支通道特征。 Result: 在WHU-RuR+、DeepGlobe和Massachusetts数据集上实验表明,DSFC-Net性能优于当前最先进方法。 Conclusion: 融合空间与频率域信息、并解耦建模高低频特征的双编码器架构,能更鲁棒地应对农村道路提取中的复杂干扰,提升窄路连续性和分割精度。 Abstract: Accurate extraction of rural roads from high-resolution remote sensing imagery is essential for infrastructure planning and sustainable development. However, this task presents unique challenges in rural settings due to several factors. These include high intra-class variability and low inter-class separability from diverse surface materials, frequent vegetation occlusions that disrupt spatial continuity, and narrow road widths that exacerbate detection difficulties. Existing methods, primarily optimized for structured urban environments, often underperform in these scenarios as they overlook such distinctive characteristics. To address these challenges, we propose DSFC-Net, a dual-encoder framework that synergistically fuses spatial and frequency-domain information. Specifically, a CNN branch is employed to capture fine-grained local road boundaries and short-range continuity, while a novel Spatial-Frequency Hybrid Transformer (SFT) is introduced to robustly model global topological dependencies against vegetation occlusions. Distinct from standard attention mechanisms that suffer from frequency bias, the SFT incorporates a Cross-Frequency Interaction Attention (CFIA) module that explicitly decouples high- and low-frequency information via a Laplacian Pyramid strategy. This design enables the dynamic interaction between spatial details and frequency-aware global contexts, effectively preserving the connectivity of narrow roads. Furthermore, a Channel Feature Fusion Module (CFFM) is proposed to bridge the two branches by adaptively recalibrating channel-wise feature responses, seamlessly integrating local textures with global semantics for accurate segmentation. Comprehensive experiments on the WHU-RuR+, DeepGlobe, and Massachusetts datasets validate the superiority of DSFC-Net over state-of-the-art approaches.[328] Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons
Xianhui Zhang,Chengyu Xie,Linxia Zhu,Yonghui Yang,Weixiang Zhao,Zifeng Cheng,Cong Wang,Fei Shen,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文发现大语言模型中存在一组跨语言共享的安全神经元(SS-Neurons),它们在高资源与低资源语言间桥接安全能力迁移;通过定位、干预并针对性微调这些极小神经元子集,显著提升非高资源语言的安全性,且不损害通用能力。
Details
Motivation: 多语言安全性存在严重不平衡,非高资源语言相比高资源语言更易受攻击;同时,支撑安全对齐的神经机制尚不明确。 Method: 首先识别单语安全神经元(MS-Neurons)并验证其因果作用;继而通过跨语言分析提取其中在高/低资源语言间共享的子集(SS-Neurons);最后提出基于语言资源分布和模型结构的SS-Neuron定向训练策略。 Result: 抑制SS-Neurons导致多种非高资源语言安全性能同步下降,增强则提升跨语言防御一致性;仅微调该极小神经元子集即超越现有最优方法,显著提升非高资源语言安全性,同时保持模型通用能力。 Conclusion: SS-Neurons是大语言模型中实现跨语言安全对齐的关键神经基础,以神经元为中心的轻量干预策略为多语言安全提供了可解释、高效且泛化性强的新范式。 Abstract: Multilingual safety remains significantly imbalanced, leaving non-high-resource (NHR) languages vulnerable compared to robust high-resource (HR) ones. Moreover, the neural mechanisms driving safety alignment remain unclear despite observed cross-lingual representation transfer. In this paper, we find that LLMs contain a set of cross-lingual shared safety neurons (SS-Neurons), a remarkably small yet critical neuronal subset that jointly regulates safety behavior across languages. We first identify monolingual safety neurons (MS-Neurons) and validate their causal role in safety refusal behavior through targeted activation and suppression. Our cross-lingual analyses then identify SS-Neurons as the subset of MS-Neurons shared between HR and NHR languages, serving as a bridge to transfer safety capabilities from HR to NHR domains. We observe that suppressing these neurons causes concurrent safety drops across NHR languages, whereas reinforcing them improves cross-lingual defensive consistency. Building on these insights, we propose a simple neuron-oriented training strategy that targets SS-Neurons based on language resource distribution and model architecture. Experiments demonstrate that fine-tuning this tiny neuronal subset outperforms state-of-the-art methods, significantly enhancing NHR safety while maintaining the model's general capabilities. The code and dataset will be available athttps://github.com/1518630367/SS-Neuron-Expansion.[329] Interacted Planes Reveal 3D Line Mapping
Zeran Ke,Bin Tan,Gui-Song Xia,Yujun Shen,Nan Xue
Main category: cs.CV
TL;DR: 本文提出LiP-Map,一种联合优化3D直线与平面的框架,通过显式建模可学习的线和平面基元,实现高效、高精度的结构化3D线地图构建,并提升线辅助视觉定位性能。
Details
Motivation: 从物理和拓扑角度出发,认为3D直线本质上是有限3D平面边缘,现有方法缺乏对这种结构关系的建模。 Method: 提出LiP-Map框架,显式耦合可学习的线与平面基元,避免简单施加共面性约束,而是构建二者间的内在交互机制。 Result: 在ScanNetV2等5个数据集超100个场景上,线地图的精度与完整性均超越SOTA;在7Scenes上显著提升线辅助视觉定位性能。 Conclusion: LiP-Map首次将平面拓扑原理融入3D线映射,为人工环境中的结构化重建提供了新范式,并兼具效率与实用性。 Abstract: 3D line mapping from multi-view RGB images provides a compact and structured visual representation of scenes. We study the problem from a physical and topological perspective: a 3D line most naturally emerges as the edge of a finite 3D planar patch. We present LiP-Map, a line-plane joint optimization framework that explicitly models learnable line and planar primitives. This coupling enables accurate and detailed 3D line mapping while maintaining strong efficiency (typically completing a reconstruction in 3 to 5 minutes per scene). LiP-Map pioneers the integration of planar topology into 3D line mapping, not by imposing pairwise coplanarity constraints but by explicitly constructing interactions between plane and line primitives, thus offering a principled route toward structured reconstruction in man-made environments. On more than 100 scenes from ScanNetV2, ScanNet++, Hypersim, 7Scenes, and Tanks\&Temple, LiP-Map improves both accuracy and completeness over state-of-the-art methods. Beyond line mapping quality, LiP-Map significantly advances line-assisted visual localization, establishing strong performance on 7Scenes. Our code is released at https://github.com/calmke/LiPMAP for reproducible research.[330] Interaction-Consistent Object Removal via MLLM-Based Reasoning
Ching-Kai Huang,Wen-Chieh Lin,Yan-Cen Lee
Main category: cs.CV
TL;DR: 本文提出了一种交互一致的对象移除方法(ICOR),通过多模态大语言模型(MLLM)驱动的推理框架REORM,不仅移除目标对象,还联合移除与其存在交互关系的元素(如光照效应、物理连接物等),并在新构建的ICOREval基准上验证了其优越性。
Details
Motivation: 现有基于图像的对象移除方法仅删除命名目标,忽略其与场景中其他元素的语义交互(如光照、连接、上下文依赖),导致编辑结果语义不一致。 Method: 提出Reasoning-Enhanced Object Removal with MLLM(REORM):结合MLLM进行交互关系推理、掩码引导的移除模块及自修正机制,并支持本地部署以适应资源受限场景;同时构建ICOREval基准用于评估。 Result: 在ICOREval基准上,REORM显著优于现有图像编辑方法,生成更符合语义交互一致性的编辑结果。 Conclusion: 交互一致性是高质量对象移除的关键,REORM通过引入MLLM驱动的推理能力,有效建模并移除目标对象及其关联交互元素,为图像编辑提供了新范式。 Abstract: Image-based object removal often erases only the named target, leaving behind interaction evidence that renders the result semantically inconsistent. We formalize this problem as Interaction-Consistent Object Removal (ICOR), which requires removing not only the target object but also associated interaction elements, such as lighting-dependent effects, physically connected objects, targetproduced elements, and contextually linked objects. To address this task, we propose Reasoning-Enhanced Object Removal with MLLM (REORM), a reasoningenhanced object removal framework that leverages multimodal large language models to infer which elements must be jointly removed. REORM features a modular design that integrates MLLM-driven analysis, mask-guided removal, and a self-correction mechanism, along with a local-deployment variant that supports accurate editing under limited resources. To support evaluation, we introduce ICOREval, a benchmark consisting of instruction-driven removals with rich interaction dependencies. On ICOREval, REORM outperforms state-of-the-art image editing systems, demonstrating its effectiveness in producing interactionconsistent results.[331] ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation
Ayushman Sarkar,Zhenyu Yu,Chu Chen,Wei Tang,Kangning Cui,Mohd Yamani Idna Idris
Main category: cs.CV
TL;DR: ReDiStory是一种无需训练的多帧视觉故事生成框架,通过推理时重组织文本提示嵌入,显式分离身份与帧语义成分并去相关帧间共享方向,从而提升主体身份一致性而不损害语义保真度。
Details
Motivation: 现有训练-free方法将身份和帧提示拼接导致帧间语义干扰,削弱复杂故事中的主体身份一致性。 Method: ReDiStory在推理时对文本嵌入进行分解,分离出身份相关和帧特异性成分,并通过抑制各帧嵌入间的共享方向来实现帧嵌入去相关。 Result: 在ConsiStory+基准上,ReDiStory在多个身份一致性指标上均一致优于1Prompt1Story,且不修改扩散模型参数或增加监督信号。 Conclusion: ReDiStory验证了推理时提示嵌入重组织是提升多帧故事中身份一致性的有效且轻量的训练-free途径。 Abstract: Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: https://github.com/YuZhenyuLindy/ReDiStory[332] StoryState: Agent-Based State Control for Consistent and Editable Storybooks
Ayushman Sarkar,Zhenyu Yu,Wei Tang,Chu Chen,Kangning Cui,Mohd Yamani Idna Idris
Main category: cs.CV
TL;DR: 本文提出StoryState,一种基于代理的编排层,在无需训练的文本到图像生成之上引入显式、可编辑的故事状态,以提升多页绘本生成与编辑的一致性与可控性。
Details
Motivation: 现有大型多模态模型虽能一键生成多页绘本,但故事状态(如角色、场景、页面对象)隐式存在,导致编辑粒度粗、易破坏视觉一致性。 Method: StoryState将每个故事建模为结构化对象(含角色表、全局设定、每页场景约束),并利用少量LLM代理维护该状态,生成适配1Prompt1Story范式的提示用于生成与编辑;全程仅通过提示操作,具备模型无关性。 Result: 系统实验表明,StoryState在多页编辑任务中支持局部页面编辑,提升跨页一致性,减少意外更改、交互轮次和编辑时间,性能接近Gemini Storybook的一次性一致性。 Conclusion: StoryState通过显式建模与代理协同,显著提升了多页图文生成系统的可控性与一致性,且不依赖特定生成模型或微调。 Abstract: Large multimodal models have enabled one-click storybook generation, where users provide a short description and receive a multi-page illustrated story. However, the underlying story state, such as characters, world settings, and page-level objects, remains implicit, making edits coarse-grained and often breaking visual consistency. We present StoryState, an agent-based orchestration layer that introduces an explicit and editable story state on top of training-free text-to-image generation. StoryState represents each story as a structured object composed of a character sheet, global settings, and per-page scene constraints, and employs a small set of LLM agents to maintain this state and derive 1Prompt1Story-style prompts for generation and editing. Operating purely through prompts, StoryState is model-agnostic and compatible with diverse generation backends. System-level experiments on multi-page editing tasks show that StoryState enables localized page edits, improves cross-page consistency, and reduces unintended changes, interaction turns, and editing time compared to 1Prompt1Story, while approaching the one-shot consistency of Gemini Storybook. Code is available at https://github.com/YuZhenyuLindy/StoryState[333] DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling
Ayushman Sarkar,Zhenyu Yu,Mohd Yamani Idna Idris
Main category: cs.CV
TL;DR: DeCorStory是一种无需训练的推理时框架,通过提示嵌入去相关、奇异值重加权和身份保持的交叉注意力机制,有效缓解文本到图像故事生成中帧间的语义干扰,提升一致性与多样性。
Details
Motivation: 现有无训练方法(如One-Prompt-One-Story)将所有提示拼接为单序列,导致嵌入强相关,引发颜色泄漏、背景融合和身份漂移等问题。 Method: 采用Gram-Schmidt提示嵌入去相关实现帧级语义正交化;结合奇异值重加权增强提示特异性信息;引入身份保持的交叉注意力稳定扩散过程中的角色身份。 Result: 在提示-图像对齐度、身份一致性和视觉多样性方面均取得一致提升,成为当前无训练基线中最优方法。 Conclusion: DeCorStory无需模型修改或微调,可即插即用地集成到现有扩散模型中,显著缓解跨帧语义干扰,是高效且实用的文本到图像故事生成方案。 Abstract: Maintaining visual and semantic consistency across frames is a key challenge in text-to-image storytelling. Existing training-free methods, such as One-Prompt-One-Story, concatenate all prompts into a single sequence, which often induces strong embedding correlation and leads to color leakage, background blending, and identity drift. We propose DeCorStory, a training-free inference-time framework that explicitly reduces inter-frame semantic interference. DeCorStory applies Gram-Schmidt prompt embedding decorrelation to orthogonalize frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information and identity-preserving cross-attention to stabilize character identity during diffusion. The method requires no model modification or fine-tuning and can be seamlessly integrated into existing diffusion pipelines. Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity, achieving state-of-the-art performance among training-free baselines. Code is available at: https://github.com/YuZhenyuLindy/DeCorStory[334] FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching
Divya Jyoti Bajpai,Shubham Agarwal,Apoorv Saxena,Kuldeep Kulkarni,Subrata Mitra,Manjesh Kumar Hanawal
Main category: cs.CV
TL;DR: 本文提出FlowCast,一种无需训练的推测生成框架,利用流匹配模型恒定速度特性加速推理,实现2.5倍以上加速且无质量损失。
Details
Motivation: 现有流匹配模型推理速度慢,受限于大量去噪步数,难以用于实时或交互式应用;已有加速方法存在质量下降、需昂贵重训练或泛化性差等问题。 Method: FlowCast基于流匹配模型训练时保持恒定速度的特性,通过外推当前速度来推测未来速度,在均方误差阈值内接受推测结果,从而跳过稳定区域的冗余步骤;该方法无需额外网络、无需重新训练,即插即用。 Result: 在图像生成、视频生成和编辑任务中实现超过2.5倍的加速,质量与标准全步长生成相当,优于现有基线方法;并提供了推测轨迹与完整轨迹最坏偏差的理论界。 Conclusion: FlowCast是一种高效、通用、无需训练的流匹配加速框架,在保持生成质量的同时显著提升推理速度,具备良好的实用性和理论保障。 Abstract: Flow Matching (FM) has recently emerged as a powerful approach for high-quality visual generation. However, their prohibitively slow inference due to a large number of denoising steps limits their potential use in real-time or interactive applications. Existing acceleration methods, like distillation, truncation, or consistency training, either degrade quality, incur costly retraining, or lack generalization. We propose FlowCast, a training-free speculative generation framework that accelerates inference by exploiting the fact that FM models are trained to preserve constant velocity. FlowCast speculates future velocity by extrapolating current velocity without incurring additional time cost, and accepts it if it is within a mean-squared error threshold. This constant-velocity forecasting allows redundant steps in stable regions to be aggressively skipped while retaining precision in complex ones. FlowCast is a plug-and-play framework that integrates seamlessly with any FM model and requires no auxiliary networks. We also present a theoretical analysis and bound the worst-case deviation between speculative and full FM trajectories. Empirical evaluations demonstrate that FlowCast achieves $>2.5\times$ speedup in image generation, video generation, and editing tasks, outperforming existing baselines with no quality loss as compared to standard full generation.[335] What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom
Yan Ma,Weiyu Zhang,Tianle Li,Linge Du,Xuyang Shen,Pengfei Liu
Main category: cs.CV
TL;DR: 本文提出MED框架,用于区分视觉工具使用强化学习中内在能力提升与工具使用效果,并发现当前方法主要减少工具带来的负面影响,而非真正掌握工具。
Details
Motivation: 现有视觉工具使用强化学习的性能提升来源不明,难以判断是内在能力提升还是工具使用改进所致。 Method: 提出MED(Measure-Explain-Diagnose)粗到细分析框架,分离内在能力变化与工具诱导效应,分解工具诱导的性能差异为增益项与损害项,并探究其演化机制。 Result: 在两个VLM和六个基准上的检查点级分析表明:性能提升主要源于内在学习;工具使用RL主要减少工具引发的错误和干扰,对用工具纠正内在缺陷作用有限。 Conclusion: 当前视觉工具使用强化学习本质上是学会与工具安全共存,而非真正掌握工具。 Abstract: Vision tool-use reinforcement learning (RL) can equip vision-language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities.We introduce MED (Measure-Explain-Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, current vision tool-use RL learns to coexist safely with tools rather than master them.[336] Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning
Yu Xu,Yuxin Zhang,Juan Cao,Lin Gao,Chunyu Wang,Oliver Deussen,Tong-Yee Lee,Fan Tang
Main category: cs.CV
TL;DR: 本文提出视觉隐喻迁移(VMT)任务,旨在让AI模型自主提取参考图像中的抽象创意本质,并将其逻辑迁移到新目标主体上;为此设计了一个受认知科学启发、基于概念整合理论的多智能体框架,包含感知、迁移、生成与诊断四个协同代理,并引入结构化‘图式语法’以支撑跨域逻辑重实例化;实验表明该方法在隐喻一致性、类比恰当性和视觉创造性上显著优于现有方法。
Details
Motivation: 现有生成式AI模型局限于像素级指令对齐和表层外观保持,难以捕捉生成真正视觉隐喻所需的抽象逻辑,亟需一种能建模并迁移‘创意本质’的新范式。 Method: 提出视觉隐喻迁移(VMT)任务;构建基于概念整合理论(CBT)的多智能体框架,含感知代理(提取图式)、迁移代理(保持通用空间不变性以发现适配载体)、生成代理(高保真合成)和分层诊断代理(闭环回溯纠错);引入新型‘图式语法(G)’作为结构化中间表示,解耦关系不变量与具体视觉实体。 Result: 在多项自动指标和人类评估中显著超越SOTA基线,尤其在隐喻一致性、类比恰当性与视觉创造性三方面表现突出;验证了所提认知驱动框架与图式语法的有效性。 Conclusion: VMT任务及所提出的认知多智能体框架为生成式AI迈向高层次抽象创造力提供了新路径,推动其在广告、媒体等高影响力创意场景中的自动化应用。 Abstract: A visual metaphor constitutes a high-order form of human creativity, employing cross-domain semantic fusion to transform abstract concepts into impactful visual rhetoric. Despite the remarkable progress of generative AI, existing models remain largely confined to pixel-level instruction alignment and surface-level appearance preservation, failing to capture the underlying abstract logic necessary for genuine metaphorical generation. To bridge this gap, we introduce the task of Visual Metaphor Transfer (VMT), which challenges models to autonomously decouple the "creative essence" from a reference image and re-materialize that abstract logic onto a user-specified target subject. We propose a cognitive-inspired, multi-agent framework that operationalizes Conceptual Blending Theory (CBT) through a novel Schema Grammar ("G"). This structured representation decouples relational invariants from specific visual entities, providing a rigorous foundation for cross-domain logic re-instantiation. Our pipeline executes VMT through a collaborative system of specialized agents: a perception agent that distills the reference into a schema, a transfer agent that maintains generic space invariance to discover apt carriers, a generation agent for high-fidelity synthesis and a hierarchical diagnostic agent that mimics a professional critic, performing closed-loop backtracking to identify and rectify errors across abstract logic, component selection, and prompt encoding. Extensive experiments and human evaluations demonstrate that our method significantly outperforms SOTA baselines in metaphor consistency, analogy appropriateness, and visual creativity, paving the way for automated high-impact creative applications in advertising and media. Source code will be made publicly available.[337] MTC-VAE: Multi-Level Temporal Compression with Content Awareness
Yubo Dong,Linchao Zhu
Main category: cs.CV
TL;DR: 本文提出了一种将固定压缩率VAE转换为支持多级时间压缩的VAE的技术,通过最小微调缓解高保真压缩下的性能下降,并验证了其在视频扩散模型(如DiT)中的有效性和兼容性。
Details
Motivation: 连续变分自编码器(VAEs)在提高视频压缩率时,若不增加隐藏通道维度,性能会显著下降;需要一种高效、轻量的方法来支持多级时间压缩。 Method: 提出一种技术,将固定压缩率VAE改造为支持多级时间压缩的VAE,仅需最小化微调;并在不同视频片段上评估压缩级别影响,同时将其集成到DiT等扩散生成模型中进行联合训练。 Result: 该方法有效缓解了高保真压缩下的性能下降;实证表明其在不同视频特性片段上具有鲁棒性;成功实现与DiT等扩散模型的兼容及协同训练。 Conclusion: 多级时间压缩VAE是一种简单、高效且通用的方案,可提升LVDMs在不同压缩需求下的灵活性与性能,具备实际应用潜力。 Abstract: Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations. For continuous Variational Autoencoders (VAEs), achieving higher compression rates is desirable; yet, the efficiency notably declines when extra sampling layers are added without expanding the dimensions of hidden channels. In this paper, we present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression, providing a straightforward and minimal fine-tuning approach to counteract performance decline at elevated compression rates.Moreover, we examine how varying compression levels impact model performance over video segments with diverse characteristics, offering empirical evidence on the effectiveness of our proposed approach. We also investigate the integration of our multi-level temporal compression VAE with diffusion-based generative models, DiT, highlighting successful concurrent training and compatibility within these frameworks. This investigation illustrates the potential uses of multi-level temporal compression.[338] Adaptive Visual Autoregressive Acceleration via Dual-Linkage Entropy Analysis
Yu Zhang,Jingyi Liu,Feng Liu,Duoqian Miao,Qi Zhang,Kexue Fu,Changwei Wang,Longbing Cao
Main category: cs.CV
TL;DR: 本文提出NOVA框架,通过熵分析实现视觉自回归模型(VAR)的训练免费、自适应令牌缩减加速,显著降低计算成本并保持生成质量。
Details
Motivation: 现有VAR模型因令牌数量庞大导致计算成本高,且现有令牌缩减方法存在启发式阶段划分、非自适应调度和加速范围有限三大缺陷,未能充分挖掘加速潜力。 Method: NOVA是一种无需训练的令牌缩减加速框架,基于熵变化反映预测不确定性演化的原理,在线识别尺度熵增长拐点,自适应确定加速激活尺度;通过尺度关联与层关联比率调整,为各尺度和层动态计算不同令牌缩减比率,剪除低熵令牌,并复用前一尺度残差缓存以加速推理。 Result: 大量实验与分析验证了NOVA作为简单而有效的训练免费加速框架的有效性,在降低计算开销的同时维持了生成质量。 Conclusion: NOVA通过引入熵驱动的自适应令牌缩减机制,突破了传统VAR加速方法的局限,为高效视觉自回归建模提供了新范式。 Abstract: Visual AutoRegressive modeling (VAR) suffers from substantial computational cost due to the massive token count involved. Failing to account for the continuous evolution of modeling dynamics, existing VAR token reduction methods face three key limitations: heuristic stage partition, non-adaptive schedules, and limited acceleration scope, thereby leaving significant acceleration potential untapped. Since entropy variation intrinsically reflects the transition of predictive uncertainty, it offers a principled measure to capture modeling dynamics evolution. Therefore, we propose NOVA, a training-free token reduction acceleration framework for VAR models via entropy analysis. NOVA adaptively determines the acceleration activation scale during inference by online identifying the inflection point of scale entropy growth. Through scale-linkage and layer-linkage ratio adjustment, NOVA dynamically computes distinct token reduction ratios for each scale and layer, pruning low-entropy tokens while reusing the cache derived from the residuals at the prior scale to accelerate inference and maintain generation quality. Extensive experiments and analyses validate NOVA as a simple yet effective training-free acceleration framework.[339] T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation
Xingzu Zhan,Chen Xie,Honghang Chen,Yixun Lin,Xiaochun Mai
Main category: cs.CV
TL;DR: 本文提出T2M Mamba模型,通过周期性-关键帧显著性感知的Mamba架构和周期性差分跨模态对齐模块(PDCAM),解决文本到动作生成中长序列漂移与语义等价改写鲁棒性差两大问题,在HumanML3D和KIT-ML数据集上取得SOTA性能。
Details
Motivation: 现有文本到动作生成模型存在两个核心缺陷:一是忽略运动周期性与关键帧显著性的耦合关系,导致长序列生成漂移;二是对语义等价的文本改写(如同义词替换)鲁棒性差,易引发动作错误。 Method: 提出Periodicity-Saliency Aware Mamba,结合增强型密度峰值聚类估计关键帧权重、FFT加速自相关分析估计运动周期;并构建周期性差分跨模态对齐模块(PDCAM)提升文本与动作嵌入的鲁棒对齐。 Result: 在HumanML3D和KIT-ML数据集上实验表明,FID达0.068,并在所有其他指标上均取得一致提升。 Conclusion: T2M Mamba有效缓解了长序列生成漂移和文本语义扰动敏感问题,验证了建模周期性与显著性耦合关系及鲁棒跨模态对齐的重要性。 Abstract: Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.[340] Exposing and Defending the Achilles' Heel of Video Mixture-of-Experts
Songping Wang,Qinglong Liu,Yueming Lyu,Ning Li,Ziwen He,Caifeng Shan
Main category: cs.CV
TL;DR: 本文提出了一种针对视频理解中混合专家(MoE)模型的新型对抗攻击与防御框架,包括TLGA、J-TLGA和J-TLAT,分别用于揭示路由器与专家模块的独立及协同脆弱性,并提升整体鲁棒性。
Details
Motivation: 现有对抗攻击方法将MoE视为统一架构,忽视了路由器和专家模块各自的独立弱点及其协作弱点,导致对MoE鲁棒性的研究不充分。 Method: 提出Temporal Lipschitz-Guided Attacks(TLGA)攻击路由器独立弱点;进一步设计Joint TLGA(J-TLGA)协同扰动路由器与专家以暴露协同弱点;基于此提出Joint Temporal Lipschitz Adversarial Training(J-TLAT)进行联合对抗训练以增强鲁棒性。 Result: 所提框架在多个数据集与架构上显著提升MoE模型的对抗鲁棒性,同时推理开销比稠密模型降低60%以上,具备即插即用特性。 Conclusion: MoE模型存在可被系统性挖掘的独立与协同脆弱性,通过组件级对抗分析与联合训练可有效提升其鲁棒性,为视频MoE的安全部署提供新思路。 Abstract: Mixture-of-Experts (MoE) has demonstrated strong performance in video understanding tasks, yet its adversarial robustness remains underexplored. Existing attack methods often treat MoE as a unified architecture, overlooking the independent and collaborative weaknesses of key components such as routers and expert modules. To fill this gap, we propose Temporal Lipschitz-Guided Attacks (TLGA) to thoroughly investigate component-level vulnerabilities in video MoE models. We first design attacks on the router, revealing its independent weaknesses. Building on this, we introduce Joint Temporal Lipschitz-Guided Attacks (J-TLGA), which collaboratively perturb both routers and experts. This joint attack significantly amplifies adversarial effects and exposes the Achilles' Heel (collaborative weaknesses) of the MoE architecture. Based on these insights, we further propose Joint Temporal Lipschitz Adversarial Training (J-TLAT). J-TLAT performs joint training to further defend against collaborative weaknesses, enhancing component-wise robustness. Our framework is plug-and-play and reduces inference cost by more than 60% compared with dense models. It consistently enhances adversarial robustness across diverse datasets and architectures, effectively mitigating both the independent and collaborative weaknesses of MoE.[341] PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles
Leonardo Brusini,Cristian Sbrolli,Eugenio Lomurno,Toshihiko Yamasaki,Matteo Matteucci
Main category: cs.CV
TL;DR: 本文提出PolyGen框架,通过多生成器联合训练和程序化难负样本课程,提升视觉语言预训练的合成数据质量与特征多样性,在多项基准上显著超越单源方法。
Details
Motivation: 现有视觉语言预训练依赖单一扩大规模的生成模型,易引入生成器特有频谱偏差、限制特征多样性。 Method: 采用Polylithic(多体)策略,在结构差异显著的多个生成器交集上联合训练,以消除模型特有伪影;并设计程序化难负样本课程,强化细粒度句法理解;将数据预算从单源唯一描述转向多源变体。 Result: 在多任务综合基准上较SynthCLIP提升+19.0%,在SugarCrepe++组成性基准上提升+9.1%。 Conclusion: 结构多样性比单纯增加单源样本量更符合数据高效扩展规律,是提升合成数据质量的关键路径。 Abstract: Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits feature diversity. In this work, we introduce PolyGen, a framework that redefines synthetic data construction by prioritizing manifold coverage and compositional rigor over simple dataset size. PolyGen employs a Polylithic approach to train on the intersection of architecturally distinct generators, effectively marginalizing out model-specific artifacts. Additionally, we introduce a Programmatic Hard Negative curriculum that enforces fine-grained syntactic understanding. By structurally reallocating the same data budget from unique captions to multi-source variations, PolyGen achieves a more robust feature space, outperforming the leading single-source baseline (SynthCLIP) by +19.0% on aggregate multi-task benchmarks and on the SugarCrepe++ compositionality benchmark (+9.1%). These results demonstrate that structural diversity is a more data-efficient scaling law than simply increasing the volume of single-source samples.[342] PromptRL: Prompt Matters in RL for Flow-Based Image Generation
Fu-Yun Wang,Han Zhang,Michael Gharbi,Hongsheng Li,Taesung Park
Main category: cs.CV
TL;DR: 本文提出PromptRL框架,利用语言模型作为可训练的提示词优化代理,嵌入基于流的强化学习优化循环中,以解决当前流匹配模型在文本到图像生成中样本效率低和提示过拟合的问题,并在多个基准测试中达到SOTA性能。
Details
Motivation: 当前流匹配模型在强化学习后训练中存在样本效率低和提示过拟合两大问题,影响模型泛化能力与实际部署效果。 Method: 提出PromptRL框架,将语言模型作为可训练的提示词重写代理,嵌入流匹配模型的强化学习优化流程中,实现提示优化与图像生成的协同训练。 Result: 在GenEval、OCR准确率和PickScore等基准上分别达到0.97、0.98和24.05;在FLUX.1-Kontext图像编辑任务中,EditReward从1.19提升至1.43,仅用0.06百万rollouts,优于Gemini 2.5 Flash Image,媲美需精细标注与多阶段训练的ReasonNet;相比纯流式RL,性能更高且rollouts减少超2倍。 Conclusion: PromptRL通过引入LM驱动的提示优化机制,显著提升了流匹配模型在强化学习对齐中的样本效率与泛化能力,为T2I与图像编辑任务提供了更高效、鲁棒的RL训练范式。 Abstract: Flow matching models (FMs) have revolutionized text-to-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present PromptRL (Prompt Matters in RL for Flow-Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow-based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state-of-the-art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Furthermore, we validate the effectiveness of our RL approach on large-scale image editing models, improving the EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine-grained data annotations along with a complex multi-stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2$\times$ fewer rollouts compared to naive flow-only RL. Our code is available at https://github.com/G-U-N/UniRL.[343] Stronger Semantic Encoders Can Harm Relighting Performance: Probing Visual Priors via Augmented Latent Intrinsics
Xiaoyan Xing,Xiao Zhang,Sezer Karaoglu,Theo Gevers,Anand Bhattad
Main category: cs.CV
TL;DR: 本文提出Augmented Latent Intrinsics (ALI)方法,通过融合像素对齐视觉编码器特征与潜在本征表示,并结合自监督优化策略,在无配对真实数据下显著提升图像重光照效果,尤其改善金属、玻璃等高光材质的渲染质量。
Details
Motivation: 现有基于潜在本征表示的图像重光照方法在金属、玻璃等复杂材质上表现不佳,且依赖强语义编码器反而降低光度保真度,揭示语义抽象与光度保真之间存在权衡。 Method: 提出ALI框架:将像素对齐的视觉编码器特征融入潜在本征框架,并设计自监督细化策略以缓解真实世界配对数据稀缺问题。 Result: ALI在仅使用未标注真实图像对和密集像素对齐视觉先验训练下,在重光照任务中取得显著提升,尤其在复杂镜面材质上增益最大。 Conclusion: 语义强编码器未必利于重光照;平衡语义上下文与密集光度结构(如ALI)更有效;自监督与像素对齐先验是解决数据稀缺与材质挑战的关键。 Abstract: Image-to-image relighting requires representations that disentangle scene properties from illumination. Recent methods rely on latent intrinsic representations but remain under-constrained and often fail on challenging materials such as metal and glass. A natural hypothesis is that stronger pretrained visual priors should resolve these failures. We find the opposite: features from top-performing semantic encoders often degrade relighting quality, revealing a fundamental trade-off between semantic abstraction and photometric fidelity. We study this trade-off and introduce Augmented Latent Intrinsics (ALI), which balances semantic context and dense photometric structure by fusing features from a pixel-aligned visual encoder into a latent-intrinsic framework, together with a self-supervised refinement strategy to mitigate the scarcity of paired real-world data. Trained only on unlabeled real-world image pairs and paired with a dense, pixel-aligned visual prior, ALI achieves strong improvements in relighting, with the largest gains on complex, specular materials. Project page: https:\\augmented-latent-intrinsics.github.io[344] Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas
Christoffer Koo Øhrstrøm,Rafael I. Cabral Muchacho,Yifei Dong,Filippos Moumtzidellis,Ronja Güldenring,Florian T. Pokorny,Lazaros Nalpantidis
Main category: cs.CV
TL;DR: 本文提出了基于抛物线的位置编码方法PaPE,专为视觉模态设计,兼顾平移不变性、旋转不变性、距离衰减、方向性和上下文感知等特性,在8个跨4种模态的数据集上表现优异,尤其在ImageNet-1K外推实验中显著优于现有方法。
Details
Motivation: 现有视觉位置编码多直接从语言的1D序列扩展到nD结构,未能充分考虑视觉模态本身的特性,如几何结构、空间关系等,因此需要一种更适配视觉任务的位置编码方法。 Method: 提出基于抛物线函数的位置编码PaPE,支持多种视觉token(图像、点云、视频、事件流);进一步引入旋转不变变体PaPE-RI;设计时融合翻译不变性、旋转不变性、距离衰减、方向性和上下文感知五大原则。 Result: 在8个涵盖4种视觉模态的数据集上,PaPE或PaPE-RI在7个数据集上取得最佳性能;ImageNet-1K外推实验中,绝对精度最高提升达10.5%。 Conclusion: PaPE是一种原理驱动、模态适配性强的位置编码方法,显著提升视觉Transformer在内插与外推场景下的性能,为视觉注意力机制提供了更鲁棒的空间建模能力。 Abstract: We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.[345] BioTamperNet: Affinity-Guided State-Space Model Detecting Tampered Biomedical Images
Soumyaroop Nandi,Prem Natarajan
Main category: cs.CV
TL;DR: BioTamperNet 是一种面向生物医学图像篡改检测的新框架,利用受状态空间模型(SSM)启发的亲和力引导注意力机制,有效识别重复区域及其源位置。
Details
Motivation: 现有基于自然图像训练的取证模型在生物医学图像上表现不佳,而生物医学图像中细微篡改可能严重影响实验有效性,亟需专用于该领域的检测方法。 Method: 提出亲和力引导的自注意力与交叉注意力模块,并融合轻量级SSM启发的线性注意力机制,实现高效细粒度定位;端到端训练,同步定位篡改区域及其源区域。 Result: 在生物医学取证基准数据集上显著优于现有主流基线方法,准确检测重复篡改区域。 Conclusion: BioTamperNet 为生物医学图像篡改检测提供了高效、精准的新范式,凸显了领域适配建模对数字取证任务的重要性。 Abstract: We propose BioTamperNet, a novel framework for detecting duplicated regions in tampered biomedical images, leveraging affinity-guided attention inspired by State Space Model (SSM) approximations. Existing forensic models, primarily trained on natural images, often underperform on biomedical data where subtle manipulations can compromise experimental validity. To address this, BioTamperNet introduces an affinity-guided self-attention module to capture intra-image similarities and an affinity-guided cross-attention module to model cross-image correspondences. Our design integrates lightweight SSM-inspired linear attention mechanisms to enable efficient, fine-grained localization. Trained end-to-end, BioTamperNet simultaneously identifies tampered regions and their source counterparts. Extensive experiments on the benchmark bio-forensic datasets demonstrate significant improvements over competitive baselines in accurately detecting duplicated regions. Code - https://github.com/SoumyaroopNandi/BioTamperNet[346] Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles
Penghao Deng,Jidong J. Yang,Jiachen Bian
Main category: cs.CV
TL;DR: 本文研究了驾驶场景中驾驶员视觉注意力的语义识别问题,比较了YOLOv13、SAM2+EfficientNetV2/YOLOv13以及Qwen2.5-VL系列多模态大模型三种方法,发现YOLOv13与Qwen2.5-VL-32b在宏F1得分(>0.84)和夜间小目标(如交通灯)识别上表现最优,而分割辅助方法因‘部分-整体’语义鸿沟导致召回率低;研究揭示了实时性与上下文鲁棒性之间的根本权衡。
Details
Motivation: 理解驾驶员在驾驶过程中的视觉注意力分布对提升道路安全和开发新一代驾驶员辅助系统至关重要。 Method: 将驾驶员注视点与道路场景语义关联建模为语义识别任务,对比三种视觉方法:1)直接目标检测(YOLOv13);2)分割辅助分类(SAM2 + EfficientNetV2 或 YOLOv13);3)基于查询的视觉语言模型(Qwen2.5-VL-7b 与 Qwen2.5-VL-32b)。 Result: YOLOv13 和 Qwen2.5-VL-32b 宏F1得分均超0.84;Qwen2.5-VL-32b 在夜间及小目标(如交通灯)识别上鲁棒性最强;分割辅助方法因‘部分-整体’语义鸿沟导致召回率大幅下降。 Conclusion: 传统检测器具备实时优势,而大VLM提供更强上下文理解与鲁棒性,二者存在根本权衡;该结论为面向人类感知的智能驾驶员监控系统设计提供了关键指导。 Abstract: Understanding where drivers direct their visual attention during driving, as characterized by gaze behavior, is critical for developing next-generation advanced driver-assistance systems and improving road safety. This paper tackles this challenge as a semantic identification task from the road scenes captured by a vehicle's front-view camera. Specifically, the collocation of gaze points with object semantics is investigated using three distinct vision-based approaches: direct object detection (YOLOv13), segmentation-assisted classification (SAM2 paired with EfficientNetV2 versus YOLOv13), and query-based Vision-Language Models, VLMs (Qwen2.5-VL-7b versus Qwen2.5-VL-32b). The results demonstrate that the direct object detection (YOLOv13) and Qwen2.5-VL-32b significantly outperform other approaches, achieving Macro F1-Scores over 0.84. The large VLM (Qwen2.5-VL-32b), in particular, exhibited superior robustness and performance for identifying small, safety-critical objects such as traffic lights, especially in adverse nighttime conditions. Conversely, the segmentation-assisted paradigm suffers from a "part-versus-whole" semantic gap that led to large failure in recall. The results reveal a fundamental trade-off between the real-time efficiency of traditional detectors and the richer contextual understanding and robustness offered by large VLMs. These findings provide critical insights and practical guidance for the design of future human-aware intelligent driver monitoring systems.[347] Understanding vision transformer robustness through the lens of out-of-distribution detection
Joey Kuang,Alexander Wong
Main category: cs.CV
TL;DR: 本文研究了视觉Transformer在低比特量化下的性能,特别是在分布外(OOD)检测任务中的表现。结果表明,大规模预训练(如ImageNet-22k)反而会削弱4位量化模型在OOD检测中的鲁棒性,而数据增强可能是更优的替代方案。
Details
Motivation: 视觉Transformer虽性能优异,但低比特量化常导致性能下降;现有工作多关注分布内(ID)任务,本文提出通过OOD场景下的注意力机制行为来深入理解量化特性。 Method: 对DeiT、DeiT3和ViT等小型视觉Transformer进行4比特量化,并在多个OOD数据集上评估其检测性能(如AUPR-out),对比不同预训练规模(ImageNet-1k vs ImageNet-22k)和数据增强的影响。 Result: ImageNet-22k预训练模型在4比特量化后OOD检测性能显著下降(DeiT3下降19.2%,ViT下降15.0%),远高于ImageNet-1k预训练模型(分别下降12.0%和9.5%);ID任务中DeiT3量化后精度骤降17%,成为最弱4比特模型。 Conclusion: 大规模预训练可能损害低比特量化模型在OOD检测中的鲁棒性,相比扩大预训练数据规模,采用更强的数据增强策略或许是提升量化鲁棒性更有效的方式。 Abstract: Vision transformers have shown remarkable performance in vision tasks, but enabling them for accessible and real-time use is still challenging. Quantization reduces memory and inference costs at the risk of performance loss. Strides have been made to mitigate low precision issues mainly by understanding in-distribution (ID) task behaviour, but the attention mechanism may provide insight on quantization attributes by exploring out-of-distribution (OOD) situations. We investigate the behaviour of quantized small-variant popular vision transformers (DeiT, DeiT3, and ViT) on common OOD datasets. ID analyses show the initial instabilities of 4-bit models, particularly of those trained on the larger ImageNet-22k, as the strongest FP32 model, DeiT3, sharply drop 17% from quantization error to be one of the weakest 4-bit models. While ViT shows reasonable quantization robustness for ID calibration, OOD detection reveals more: ViT and DeiT3 pretrained on ImageNet-22k respectively experienced a 15.0% and 19.2% average quantization delta in AUPR-out between full precision to 4-bit while their ImageNet-1k-only counterparts experienced a 9.5% and 12.0% delta. Overall, our results suggest pretraining on large scale datasets may hinder low-bit quantization robustness in OOD detection and that data augmentation may be a more beneficial option.[348] Preserving Localized Patch Semantics in VLMs
Parsa Esmaeilkhani,Longin Jan Latecki
Main category: cs.CV
TL;DR: 本文提出Logit Lens Loss(LLL)来增强视觉-语言模型中图像token的局部视觉信息保留,从而提升Logit Lens的可解释性与分割等视觉任务性能,无需架构修改或大规模训练。
Details
Motivation: Logit Lens在自回归视觉-语言模型中因图像token视觉信息扩散至语言token而失效,导致可视化不可靠,亟需保持图像token的局部视觉表征。 Method: 引入一种与下一词预测(NTP)互补的Logit Lens Loss(LLL),通过约束图像token嵌入与对应区域文本概念(如'cat')的语义对齐,抑制自注意力中图文token混合,不改变模型结构或重训。 Result: LLL显著提升了Logit Lens生成有意义物体置信度热图的能力,并在分割等视觉中心任务上取得性能提升,且无需额外解码头。 Conclusion: LLL是一种轻量、即插即用的损失函数,有效恢复Logit Lens在VLM中的可解释性,并带来下游视觉任务增益。 Abstract: Logit Lens has been proposed for visualizing tokens that contribute most to LLM answers. Recently, Logit Lens was also shown to be applicable in autoregressive Vision-Language Models (VLMs), where it illustrates the conceptual content of image tokens in the form of heatmaps, e.g., which image tokens are likely to depict the concept of cat in a given image. However, the visual content of image tokens often gets diffused to language tokens, and consequently, the locality of visual information gets mostly destroyed, which renders Logit Lens visualization unusable for explainability. To address this issue, we introduce a complementary loss to next-token prediction (NTP) to prevent the visual tokens from losing the visual representation inherited from corresponding image patches. The proposed Logit Lens Loss (LLL) is designed to make visual token embeddings more semantically aligned with the textual concepts that describe their image regions (e.g., patches containing a cat with the word "cat"), without requiring any architectural modification or large-scale training. This way, LLL constrains the mixing of image and text tokens in the self-attention layers in order to prevent image tokens from losing their localized visual information. As our experiments show, LLL not only makes Logit Lens practically relevant by producing meaningful object confidence maps in images, but also improves performance on vision-centric tasks like segmentation without attaching any special heads.[349] Rotation-free Online Handwritten Character Recognition Using Linear Recurrent Units
Zhe Ling,Sicheng Yu,Danyu Yang
Main category: cs.CV
TL;DR: 本文提出SW-PS+LRU框架,利用滑动窗口路径签名提取旋转不变局部特征,结合轻量级线性循环单元进行分类,在大幅旋转(±180°)下仍保持高识别精度。
Details
Motivation: 在线手写字符识别虽具优势,但实际中旋转形变严重破坏空间布局,导致精度下降;提取旋转不变特征仍是开放难题。 Method: 采用滑动窗口路径签名(SW-PS)提取字符局部结构特征,并引入轻量级线性循环单元(LRU)作为分类器,融合RNN的增量处理能力与状态空间模型(SSM)的并行训练效率。 Result: 在CASIA-OLHWDB1.1数据集三个子集(数字、英文大写字母、中文偏旁)上,经集成学习后识别准确率分别达99.62%、96.67%和94.33%;模型在收敛速度与测试精度上均优于对比方法。 Conclusion: SW-PS+LRU框架能有效应对强旋转干扰,兼具鲁棒性、高效性与高精度,为在线手写识别提供了新思路。 Abstract: Online handwritten character recognition leverages stroke order and dynamic features, which generally provide higher accuracy and robustness compared with offline recognition. However, in practical applications, rotational deformations can disrupt the spatial layout of strokes, substantially reducing recognition accuracy. Extracting rotation-invariant features therefore remains a challenging open problem. In this work, we employ the Sliding Window Path Signature (SW-PS) to capture local structural features of characters, and introduce the lightweight Linear Recurrent Units (LRU) as the classifier. The LRU combine the fast incremental processing capability of recurrent neural networks (RNN) with the efficient parallel training of state space models (SSM), while reliably modelling dynamic stroke characteristics. We conducted recognition experiments with random rotation angle up to $\pm 180^{\circ}$ on three subsets of the CASIA-OLHWDB1.1 dataset: digits, English upper letters, and Chinese radicals. The accuracies achieved after ensemble learning were $99.62\%$, $96.67\%$, and $94.33\%$, respectively. Experimental results demonstrate that the proposed SW-PS+LRU framework consistently surpasses competing models in both convergence speed and test accuracy.[350] Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars
Youliang Zhang,Zhengguang Zhou,Zhentao Yu,Ziyao Huang,Teng Hu,Sen Liang,Guozhen Zhang,Ziqiao Peng,Shunkai Li,Yi Chen,Zixiang Zhou,Yuan Zhou,Qinglin Lu,Xiu Li
Main category: cs.CV
TL;DR: 本文提出InteractAvatar框架,解决带物体交互的说话头像生成问题,通过感知-规划与视频合成解耦,结合PIM和AIM模块实现文本对齐的交互动作与生动视频生成,并建立GroundedInter基准进行评估。
Details
Motivation: 现有方法只能生成简单人体动作的全身说话头像,难以扩展到需环境感知与精准控制的文本对齐人-物交互(GHOI)任务,面临感知不足与控制-质量权衡困境。 Method: 提出双流框架InteractAvatar:1)感知与交互模块(PIM)利用目标检测增强环境感知,生成文本对齐交互动作;2)音频-交互感知生成模块(AIM)合成具交互行为的生动说话头像;3)设计运动-视频对齐器,使PIM与AIM结构相似、并行协同生成动作与视频。 Result: 在自建GroundedInter基准上验证了方法有效性,显著提升GHOI视频生成的质量与控制精度,支持文本驱动的、具物理合理性的交互说话头像生成。 Conclusion: InteractAvatar通过解耦感知规划与视频合成,有效缓解控制-质量矛盾,为 grounded human-object interaction 视频生成提供了新范式,并推动该方向标准化评估发展。 Abstract: Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io[351] FSCA-Net: Feature-Separated Cross-Attention Network for Robust Multi-Dataset Training
Yuehai Chen
Main category: cs.CV
TL;DR: 本文提出FSCA-Net,通过显式解耦域不变与域特定特征,并引入跨注意力融合与互信息优化,缓解多数据集联合训练中的负迁移问题,在跨数据集人群计数任务中达到SOTA性能。
Details
Motivation: CNN和Transformer模型在跨环境应用时因严重域差异导致性能下降;直接多数据集联合训练引发负迁移,因共享表征与域特定表征相互纠缠。 Method: 提出FSCA-Net框架:1)显式分离特征为域不变与域特定成分;2)设计跨注意力融合模块自适应建模二者交互;3)引入互信息优化目标,最大化域不变特征一致性、最小化域特定特征冗余。 Result: 在多个人群计数基准上实验表明,FSCA-Net有效缓解负迁移,实现SOTA跨数据集泛化性能。 Conclusion: FSCA-Net提供了一种鲁棒、可扩展的现实世界人群分析解决方案,验证了显式特征解耦与互信息驱动学习在跨域人群计数中的有效性。 Abstract: Crowd counting plays a vital role in public safety, traffic regulation, and smart city management. However, despite the impressive progress achieved by CNN- and Transformer-based models, their performance often deteriorates when applied across diverse environments due to severe domain discrepancies. Direct joint training on multiple datasets, which intuitively should enhance generalization, instead results in negative transfer, as shared and domain-specific representations become entangled. To address this challenge, we propose the Feature Separation and Cross-Attention Network FSCA-Net, a unified framework that explicitly disentangles feature representations into domain-invariant and domain-specific components. A novel cross-attention fusion module adaptively models interactions between these components, ensuring effective knowledge transfer while preserving dataset-specific discriminability. Furthermore, a mutual information optimization objective is introduced to maximize consistency among domain-invariant features and minimize redundancy among domain-specific ones, promoting complementary shared-private representations. Extensive experiments on multiple crowd counting benchmarks demonstrate that FSCA-Net effectively mitigates negative transfer and achieves state-of-the-art cross-dataset generalization, providing a robust and scalable solution for real-world crowd analysis.[352] Toward Cognitive Supersensing in Multimodal Large Language Model
Boyi Li,Yifan Shen,Yuanzhe Liu,Yifan Xu,Jiateng Liu,Xinzhuo Li,Zhengyuan Li,Jingyuan Zhu,Yunhan Zhong,Fangzhou Lan,Jianguo Cao,James M. Rehg,Heng Ji,Ismini Lourentzou,Xu Cao
Main category: cs.CV
TL;DR: 本文提出Cognitive Supersensing训练范式,通过引入潜在视觉意象预测(LVIP)头和基于视觉潜表示的强化学习,赋予多模态大语言模型类人的视觉意象能力,显著提升其在认知型视觉问答任务上的性能。
Details
Motivation: 现有MLLMs在复杂认知任务(尤其是依赖视觉记忆的抽象视觉细节理解)上能力有限,主要依赖文本空间的思维链推理,忽视了类比人类视空草图板和视觉意象的视觉推理机制。 Method: 提出Cognitive Supersensing训练范式:1)添加Latent Visual Imagery Prediction (LVIP) 头,联合学习视觉认知隐表示序列并对其与答案对齐,构建基于视觉的内部推理链;2)引入基于该视觉潜表示的强化学习阶段,优化文本推理路径。同时构建CogSense-Bench评测基准。 Result: 在自建CogSense-Bench(涵盖5个认知维度)及跨领域数学、科学VQA基准上,采用Cognitive Supersensing训练的MLLM显著超越SOTA基线。 Conclusion: 内部视觉意象能力是弥合感知识别与认知理解之间鸿沟的关键,Cognitive Supersensing为提升MLLM认知能力提供了新路径。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuospatial sketchpad and visual imagery. To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent. To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions. Extensive experiments demonstrate that MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench and exhibit superior generalization on out-of-domain mathematics and science VQA benchmarks, suggesting that internal visual imagery is potentially key to bridging the gap between perceptual recognition and cognitive understanding. We will open-source the CogSense-Bench and our model weights.[353] Combined Flicker-banding and Moire Removal for Screen-Captured Images
Libo Zhu,Zihan Zhou,Zhiyi Zhou,Yiyang Qu,Weihang Zhang,Keyu Shi,Yifan Fu,Yulun Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为CLEAR的统一恢复框架,首次系统研究并联合去除屏幕拍摄图像中的莫尔条纹和闪烁带状伪影,通过构建新数据集、ISP-based闪烁模拟流程及频域分解-重组模块与轨迹对齐损失,显著提升了复合退化场景下的图像质量。
Details
Motivation: 屏幕拍摄图像中莫尔条纹与闪烁带状伪影强耦合,现有单退化方法难以泛化到此类复合退化场景。 Method: 提出CLEAR统一恢复框架;构建含两类伪影的大规模数据集;设计基于ISP的闪烁仿真流程;引入频域分解与重组模块及轨迹对齐损失。 Result: 在多个评估指标上持续超越现有图像恢复方法,验证了其在复杂真实场景中的有效性。 Conclusion: CLEAR框架为联合去除莫尔与闪烁伪影提供了首个系统性解决方案,显著提升屏幕拍摄图像的视觉质量与实用性。 Abstract: Capturing display screens with mobile devices has become increasingly common, yet the resulting images often suffer from severe degradations caused by the coexistence of moiré patterns and flicker-banding, leading to significant visual quality degradation. Due to the strong coupling of these two artifacts in real imaging processes, existing methods designed for single degradations fail to generalize to such compound scenarios. In this paper, we present the first systematic study on joint removal of moiré patterns and flicker-banding in screen-captured images, and propose a unified restoration framework, named CLEAR. To support this task, we construct a large-scale dataset containing both moiré patterns and flicker-banding, and introduce an ISP-based flicker simulation pipeline to stabilize model training and expand the degradation distribution. Furthermore, we design a frequency-domain decomposition and re-composition module together with a trajectory alignment loss to enhance the modeling of compound artifacts. Extensive experiments demonstrate that the proposed method consistently. outperforms existing image restoration approaches across multiple evaluation metrics, validating its effectiveness in complex real-world scenarios.[354] Multimodal UNcommonsense: From Odd to Ordinary and Ordinary to Odd
Yejin Son,Saejin Kim,Dongjun Min,Younjae Yu
Main category: cs.CV
TL;DR: 本文提出了Multimodal UNcommonsense (MUN)基准,用于评估多模态模型在违背常规视觉或语境预期场景下的常识推理能力,并设计了基于检索的上下文学习(R-ICL)框架,通过多模态集成检索器(MER)提升小模型在非典型场景中的推理性能,平均提升8.3%。
Details
Motivation: 常识推理在多模态场景中仍是AI的基础挑战,现有基准多聚焦于典型场景,缺乏对非典型、文化多样及低频现实场景的评估能力。 Method: 构建MUN基准,包含图像与意外语言描述的配对;提出R-ICL框架,利用Multimodal Ensemble Retriever(MER)在图文不一致情况下检索语义相关示例,实现大模型到小模型的零训练推理能力迁移。 Result: R-ICL在MUN上相较基线ICL方法平均提升8.3%,验证了其在低频、非典型场景下的有效性。 Conclusion: MUN为评估和提升多模态模型在真实、文化多元、非原型场景中的鲁棒性与适应性提供了新方向,R-ICL为轻量模型注入强常识推理能力提供了可行路径。 Abstract: Commonsense reasoning in multimodal contexts remains a foundational challenge in artificial intelligence. We introduce Multimodal UNcommonsense(MUN), a benchmark designed to evaluate models' ability to handle scenarios that deviate from typical visual or contextual expectations. MUN pairs visual scenes with surprising or unlikely outcomes described in natural language, prompting models to either rationalize seemingly odd images using everyday logic or uncover unexpected interpretations in ordinary scenes. To support this task, we propose a retrieval-based in-context learning (R-ICL) framework that transfers reasoning capabilities from larger models to smaller ones without additional training. Leveraging a novel Multimodal Ensemble Retriever (MER), our method identifies semantically relevant exemplars even when image and text pairs are deliberately discordant. Experiments show an average improvement of 8.3% over baseline ICL methods, highlighting the effectiveness of R-ICL in low-frequency, atypical settings. MUN opens new directions for evaluating and improving visual-language models' robustness and adaptability in real-world, culturally diverse, and non-prototypical scenarios.[355] One-Step Diffusion for Perceptual Image Compression
Yiwen Jia,Hao Wei,Yanhui Zhou,Chenyang Ge
Main category: cs.CV
TL;DR: 本文提出了一种单步扩散图像压缩方法,通过引入基于紧凑特征表示的判别器,在显著提升推理速度(快46倍)的同时保持了良好的感知质量。
Details
Motivation: 现有基于扩散的图像压缩方法因解码时需要大量去噪步骤,导致推理延迟高、计算开销大,难以实际部署。 Method: 设计单步扩散过程以大幅减少推理步骤;引入作用于紧凑特征表示(而非原始像素)的判别器,以更好建模高层纹理与结构信息,提升重建图像的感知质量。 Result: 在保持与近期扩散方法相当压缩性能的同时,推理速度提升46倍。 Conclusion: 单步扩散结合特征级判别器是一种高效且高质量的图像压缩新范式,兼顾实用性与感知质量。 Abstract: Diffusion-based image compression methods have achieved notable progress, delivering high perceptual quality at low bitrates. However, their practical deployment is hindered by significant inference latency and heavy computational overhead, primarily due to the large number of denoising steps required during decoding. To address this problem, we propose a diffusion-based image compression method that requires only a single-step diffusion process, significantly improving inference speed. To enhance the perceptual quality of reconstructed images, we introduce a discriminator that operates on compact feature representations instead of raw pixels, leveraging the fact that features better capture high-level texture and structural details. Experimental results show that our method delivers comparable compression performance while offering a 46$\times$ faster inference speed compared to recent diffusion-based approaches. The source code and models are available at https://github.com/cheesejiang/OSDiff.[356] SGHA-Attack: Semantic-Guided Hierarchical Alignment for Transferable Targeted Attacks on Vision-Language Models
Haobo Wang,Weiqi Luo,Xiaojun Jia,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文提出SGHA-Attack,一种语义引导的分层对齐对抗攻击方法,通过多参考锚点与中间层跨模态特征对齐,提升对黑盒大视觉语言模型的迁移攻击效果。
Details
Motivation: 现有基于迁移的定向攻击方法依赖单一参考、过度拟合代理模型的最终层嵌入空间,忽视中间语义,导致在异构VLM间迁移性差。 Method: 提出SGHA-Attack框架:1)利用冻结文生图模型生成视觉 grounded 参考池,并筛选Top-K语义相关锚点构建加权混合指导;2)在多个深度上对齐全局和空间粒度的中间视觉表征;3)在共享潜在子空间中同步中间视觉与文本特征,提供早阶跨模态监督。 Result: 在开源与商用黑盒VLM上实验表明,SGHA-Attack显著优于先前方法,具备更强的定向迁移能力,并对预处理与净化防御保持鲁棒性。 Conclusion: 分层语义对齐(尤其中间层跨模态协同)是提升VLM对抗迁移性的关键,SGHA-Attack为黑盒VLM安全评估提供了更有效、鲁棒的攻击范式。 Abstract: Large vision-language models (VLMs) are vulnerable to transfer-based adversarial perturbations, enabling attackers to optimize on surrogate models and manipulate black-box VLM outputs. Prior targeted transfer attacks often overfit surrogate-specific embedding space by relying on a single reference and emphasizing final-layer alignment, which underutilizes intermediate semantics and degrades transfer across heterogeneous VLMs. To address this, we propose SGHA-Attack, a Semantic-Guided Hierarchical Alignment framework that adopts multiple target references and enforces intermediate-layer consistency. Concretely, we generate a visually grounded reference pool by sampling a frozen text-to-image model conditioned on the target prompt, and then carefully select the Top-K most semantically relevant anchors under the surrogate to form a weighted mixture for stable optimization guidance. Building on these anchors, SGHA-Attack injects target semantics throughout the feature hierarchy by aligning intermediate visual representations at both global and spatial granularities across multiple depths, and by synchronizing intermediate visual and textual features in a shared latent subspace to provide early cross-modal supervision before the final projection. Extensive experiments on open-source and commercial black-box VLMs show that SGHA-Attack achieves stronger targeted transferability than prior methods and remains robust under preprocessing and purification defenses.[357] HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation
Wencan Cheng,Gim Hee Lee
Main category: cs.CV
TL;DR: 本文提出HandMCM方法,基于Mamba状态空间模型,结合局部信息注入/过滤与对应关系建模模块,并融合多模态图像特征,显著提升严重遮挡场景下的3D手部姿态估计精度。
Details
Motivation: 3D手部姿态估计在增强现实等人机交互应用中至关重要,但面临手部自遮挡及与物体交互导致的遮挡等挑战。 Method: 提出基于Mamba状态空间模型的HandMCM方法,引入局部信息注入/过滤模块和对应关系建模模块,并融合多模态图像特征以增强输入表征能力与鲁棒性。 Result: 在三个基准数据集上的实验表明,该方法显著优于当前最先进方法,尤其在严重遮挡场景下表现突出。 Conclusion: HandMCM有效提升了3D手部姿态估计的准确性与可靠性,展现出在实际应用中的广阔潜力。 Abstract: 3D hand pose estimation that involves accurate estimation of 3D human hand keypoint locations is crucial for many human-computer interaction applications such as augmented reality. However, this task poses significant challenges due to self-occlusion of the hands and occlusions caused by interactions with objects. In this paper, we propose HandMCM to address these challenges. Our HandMCM is a novel method based on the powerful state space model (Mamba). By incorporating modules for local information injection/filtering and correspondence modeling, the proposed correspondence Mamba effectively learns the highly dynamic kinematic topology of keypoints across various occlusion scenarios. Moreover, by integrating multi-modal image features, we enhance the robustness and representational capacity of the input, leading to more accurate hand pose estimation. Empirical evaluations on three benchmark datasets demonstrate that our model significantly outperforms current state-of-the-art methods, particularly in challenging scenarios involving severe occlusions. These results highlight the potential of our approach to advance the accuracy and reliability of 3D hand pose estimation in practical applications.[358] Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages
Zhixiong Yue,Zixuan Ni,Feiyang Ye,Jinshan Zhang,Sheng Shen,Zhenpeng Mi
Main category: cs.CV
TL;DR: 本文提出TAFS GRPO框架,通过温度退火与分组相对策略优化,提升流匹配文本到图像生成模型在少步采样下的人类偏好对齐性能。
Details
Motivation: 现有基于强化学习的流匹配模型依赖大量去噪步数,且奖励信号稀疏不精确,导致对齐效果不佳。 Method: 提出温度退火少步采样与分组相对策略优化(TAFS GRPO):迭代地对单步采样结果注入自适应时间噪声以引入随机性并保持语义完整性;结合GRPO实现步级感知优势整合,避免奖励函数可微需求,提供稠密、步特定奖励。 Result: 在少步文本到图像生成任务中表现优异,显著提升生成图像与人类偏好的对齐程度。 Conclusion: TAFS GRPO有效解决了现有RL流匹配模型在少步采样和奖励稀疏性方面的关键限制,为高效、高保真偏好对齐生成提供了新范式。 Abstract: Recent advances in flow matching models, particularly with reinforcement learning (RL), have significantly enhanced human preference alignment in few step text to image generators. However, existing RL based approaches for flow matching models typically rely on numerous denoising steps, while suffering from sparse and imprecise reward signals that often lead to suboptimal alignment. To address these limitations, we propose Temperature Annealed Few step Sampling with Group Relative Policy Optimization (TAFS GRPO), a novel framework for training flow matching text to image models into efficient few step generators well aligned with human preferences. Our method iteratively injects adaptive temporal noise onto the results of one step samples. By repeatedly annealing the model's sampled outputs, it introduces stochasticity into the sampling process while preserving the semantic integrity of each generated image. Moreover, its step aware advantage integration mechanism combines the GRPO to avoid the need for the differentiable of reward function and provide dense and step specific rewards for stable policy optimization. Extensive experiments demonstrate that TAFS GRPO achieves strong performance in few step text to image generation and significantly improves the alignment of generated images with human preferences. The code and models of this work will be available to facilitate further research.[359] Samba+: General and Accurate Salient Object Detection via A More Unified Mamba-based Framework
Wenzhuo Zhao,Keren Fu,Jiahao He,Xiaohong Liu,Qijun Zhao,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出了Saliency Mamba(Samba)及其增强版Samba+,一种基于Mamba状态空间模型的纯Mamba架构,用于多种显著性目标检测(SOD)任务,兼顾全局感受野与计算效率,并通过多任务联合训练实现统一、通用的模型。
Details
Motivation: 现有SOD模型受限于CNN感受野有限和Transformer计算复杂度高;需一种能平衡全局建模能力与效率的新架构,并解决多模态输入、任务专用性及持续学习中的模态冲突与灾难性遗忘问题。 Method: 提出Samba:引入空间邻域扫描(SNS)算法的显著性引导Mamba块(SGMB)和上下文感知上采样(CAU);进一步提出Samba+:采用多任务联合训练,并设计中心辐射图注意力(HGA)模块与模态锚定持续学习(MACL)策略。 Result: Samba在6类SOD任务、22个数据集上以更低计算成本超越现有方法;Samba+用单一模型在全部任务和数据集上取得更优性能;验证了框架在多模态与持续适应场景下的有效性。 Conclusion: Mamba状态空间模型适合作为SOD基础架构;Samba/Samba+实现了高效、通用、可扩展的显著性检测框架,为多模态与持续学习导向的视觉基础模型提供了新范式。 Abstract: Existing salient object detection (SOD) models are generally constrained by the limited receptive fields of convolutional neural networks (CNNs) and quadratic computational complexity of Transformers. Recently, the emerging state-space model, namely Mamba, has shown great potential in balancing global receptive fields and computational efficiency. As a solution, we propose Saliency Mamba (Samba), a pure Mamba-based architecture that flexibly handles various distinct SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), RGB-D VSOD, and visible-depth-thermal SOD. Specifically, we rethink the scanning strategy of Mamba for SOD, and introduce a saliency-guided Mamba block (SGMB) that features a spatial neighborhood scanning (SNS) algorithm to preserve the spatial continuity of salient regions. A context-aware upsampling (CAU) method is also proposed to promote hierarchical feature alignment and aggregation by modeling contextual dependencies. As one step further, to avoid the "task-specific" problem as in previous SOD solutions, we develop Samba+, which is empowered by training Samba in a multi-task joint manner, leading to a more unified and versatile model. Two crucial components that collaboratively tackle challenges encountered in input of arbitrary modalities and continual adaptation are investigated. Specifically, a hub-and-spoke graph attention (HGA) module facilitates adaptive cross-modal interactive fusion, and a modality-anchored continual learning (MACL) strategy alleviates inter-modal conflicts together with catastrophic forgetting. Extensive experiments demonstrate that Samba individually outperforms existing methods across six SOD tasks on 22 datasets with lower computational cost, whereas Samba+ achieves even superior results on these tasks and datasets by using a single trained versatile model. Additional results further demonstrate the potential of our Samba framework.[360] UV-M3TL: A Unified and Versatile Multimodal Multi-Task Learning Framework for Assistive Driving Perception
Wenzhuo Liu,Qiannan Guo,Zhen Wang,Wenshuo Wang,Lei Yang,Yicheng Qiao,Lening Wang,Zhiwei Li,Chen Lv,Shanghang Zhang,Junqiang Xi,Huaping Liu
Main category: cs.CV
TL;DR: 本文提出了一种统一且通用的多模态多任务学习框架(UV-M3TL),用于协同理解驾驶员行为、情绪、车辆行为及交通上下文,通过双分支空间通道嵌入(DB-SCME)与自适应特征解耦损失(AFD-Loss)缓解任务间负迁移,在AIDE及多个公开数据集上达到SOTA性能。
Details
Motivation: ADAS需同时理解异构的驾驶员行为与导航上下文,但联合学习易引发任务间负迁移,损害系统性能。 Method: 提出UV-M3TL框架,包含双分支空间通道多模态嵌入(DB-SCME)以显式建模任务共享与任务特异性特征,以及基于学习动态和特征解耦约束的自适应特征解耦多任务损失(AFD-Loss)。 Result: 在AIDE数据集上四任务均达SOTA;在BDD100K、CityScapes、NYUD-v2和PASCAL-Context等多任务基准上也表现优异,多数任务达SOTA。 Conclusion: UV-M3TL有效缓解多任务负迁移,兼具高性能与强泛化能力,为ADAS多模态感知提供了通用可扩展的多任务学习范式。 Abstract: Advanced Driver Assistance Systems (ADAS) need to understand human driver behavior while perceiving their navigation context, but jointly learning these heterogeneous tasks would cause inter-task negative transfer and impair system performance. Here, we propose a Unified and Versatile Multimodal Multi-Task Learning (UV-M3TL) framework to simultaneously recognize driver behavior, driver emotion, vehicle behavior, and traffic context, while mitigating inter-task negative transfer. Our framework incorporates two core components: dual-branch spatial channel multimodal embedding (DB-SCME) and adaptive feature-decoupled multi-task loss (AFD-Loss). DB-SCME enhances cross-task knowledge transfer while mitigating task conflicts by employing a dual-branch structure to explicitly model salient task-shared and task-specific features. AFD-Loss improves the stability of joint optimization while guiding the model to learn diverse multi-task representations by introducing an adaptive weighting mechanism based on learning dynamics and feature decoupling constraints. We evaluate our method on the AIDE dataset, and the experimental results demonstrate that UV-M3TL achieves state-of-the-art performance across all four tasks. To further prove the versatility, we evaluate UV-M3TL on additional public multi-task perception benchmarks (BDD100K, CityScapes, NYUD-v2, and PASCAL-Context), where it consistently delivers strong performance across diverse task combinations, attaining state-of-the-art results on most tasks.[361] Rethinking Genomic Modeling Through Optical Character Recognition
Hongxin Xiang,Pengsen Ma,Yunkang Cao,Di Yu,Haowen Chen,Xinyu Yang,Xiangxiang Zeng
Main category: cs.CV
TL;DR: OpticalDNA 是一种基于视觉的基因组建模新框架,将DNA序列渲染为结构化视觉布局,利用OCR风格的视觉-语言模型进行高效、高保真压缩与理解,显著减少计算开销并提升长序列建模性能。
Details
Motivation: 现有基因组基础模型将DNA视为一维文本序列,导致对低信息背景的冗余计算,且难以实现面向理解的长上下文压缩。 Method: 提出OpticalDNA框架:将DNA转为视觉布局,设计视觉DNA编码器和文档解码器;编码器生成可重建的紧凑视觉token;定义基于核心基因组原语(读取、区域定位、子序列检索、掩码跨度补全)的提示条件目标,学习布局感知表征。 Result: 在多种基因组基准上持续超越近期基线;在长达450k碱基的序列上达到最佳整体性能,仅用约1/20的有效token,并以仅256k可训练参数超越参数量高达985倍的模型。 Conclusion: 视觉化DNA表征与OCR式建模能更契合基因组的稀疏、非连续语义结构,实现高效、可解释、高性能的长序列基因组理解。 Abstract: Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a \emph{visual DNA encoder} and a \emph{document decoder}, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly $20\times$ fewer effective tokens, and surpasses models with up to $985\times$ more activated parameters while tuning only 256k \emph{trainable} parameters.[362] Token Pruning for In-Context Generation in Diffusion Transformers
Junqing Lin,Xingyu Zheng,Pei Cheng,Bin Fu,Jingwei Sun,Guangzhong Sun
Main category: cs.CV
TL;DR: 本文提出ToPi框架,一种无需训练的token剪枝方法,用于解决扩散Transformer(DiTs)在上下文生成中因输入拼接导致的序列过长和计算瓶颈问题。ToPi通过离线校准的敏感性分析识别关键注意力层,并设计新影响度量与时间更新策略,实现对参考上下文token的选择性剪枝,在保持图像质量的同时提升推理速度超30%。
Details
Motivation: 现有token压缩方法主要面向文本到图像生成,采用统一压缩策略,忽视了在上下文图像生成中参考上下文与目标潜在表示之间在空间、时间和功能维度上的角色不对称性,难以有效缓解计算瓶颈。 Method: ToPi是一种训练无关的token剪枝框架:首先通过离线校准驱动的敏感性分析识别关键注意力层;然后基于这些层定义新型影响度量,量化各参考上下文token对生成的贡献;最后结合适配扩散过程演化的时序更新策略进行选择性剪枝。 Result: ToPi在多个复杂图像生成任务上实现超过30%的推理加速,同时保持结构保真度与视觉一致性。 Conclusion: ToPi有效解决了DiTs在in-context生成中的长序列计算瓶颈,其基于敏感性分析与动态影响评估的剪枝范式为高效可控图像生成提供了新思路,且无需额外训练,具有强实用性与泛化性。 Abstract: In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples. However, the resulting input concatenation drastically increases sequence length, creating a substantial computational bottleneck. Existing token reduction techniques, primarily tailored for text-to-image synthesis, fall short in this paradigm as they apply uniform reduction strategies, overlooking the inherent role asymmetry between reference contexts and target latents across spatial, temporal, and functional dimensions. To bridge this gap, we introduce ToPi, a training-free token pruning framework tailored for in-context generation in DiTs. Specifically, ToPi utilizes offline calibration-driven sensitivity analysis to identify pivotal attention layers, serving as a robust proxy for redundancy estimation. Leveraging these layers, we derive a novel influence metric to quantify the contribution of each context token for selective pruning, coupled with a temporal update strategy that adapts to the evolving diffusion trajectory. Empirical evaluations demonstrate that ToPi can achieve over 30\% speedup in inference while maintaining structural fidelity and visual consistency across complex image generation tasks.[363] Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Yu Zeng,Wenxuan Huang,Zhen Fang,Shuang Chen,Yufan Shen,Yishuo Cai,Xiaoman Wang,Zhenfei Yin,Lin Chen,Zehui Chen,Shiting Huang,Yiming Zhao,Yao Hu,Philip Torr,Wanli Ouyang,Shaosheng Cao
Main category: cs.CV
TL;DR: 本文提出了Vision-DeepResearch基准(VDR-Bench),用于更真实地评估多模态大模型在视觉-文本联合搜索任务中的能力,并提出一种多轮裁剪搜索策略以提升视觉检索性能。
Details
Motivation: 现有评测基准存在两大缺陷:一是非视觉搜索导向,答案易通过文本线索或模型先验知识泄露;二是评估场景过于理想化,图像和文本搜索难度均不足。 Method: 构建包含2000个VQA实例的VDR-Bench基准,采用多阶段人工构建与专家审核流程;提出多轮裁剪搜索(cropped-search)工作流以增强视觉检索能力。 Result: VDR-Bench能更真实反映Vision-DeepResearch系统在现实场景下的表现;所提多轮裁剪搜索策略显著提升了模型在实际视觉检索任务中的性能。 Conclusion: VDR-Bench为多模态深度研究系统的评测提供了更可靠的基准,所提方法为提升视觉检索能力提供了实用方案,对后续系统设计具有指导意义。 Abstract: Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.[364] Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?
Susan Liang,Chao Huang,Filippos Bellos,Yolo Yunlong Tang,Qianxiang Shen,Jing Bi,Luchuan Song,Zeliang Zhang,Jason Corso,Chenliang Xu
Main category: cs.CV
TL;DR: 本文提出Omni-Judge,探索多模态大语言模型(omni-LLMs)作为文本驱动音视频生成质量评估器的可行性,发现其在语义对齐类指标上表现优异且可解释,但在高帧率感知指标上受限于时序分辨率。
Details
Motivation: 现有自动评估指标(如FVD、CLAP、ViCLIP)仅覆盖双模态、难以处理复杂提示、缺乏可解释性;人工评估成本高、难扩展;而多模态大模型具备天然三模态理解与推理能力,有望成为统一、可解释的评估器。 Method: 构建Omni-Judge框架,利用omni-LLMs对文本-音频-视频三模态生成结果进行九项感知与对齐指标评估,并与传统指标及人类评分对比相关性,同时分析其解释性输出。 Result: Omni-Judge在音频-文本、视频-文本、音视频文本一致性等语义任务上达到甚至超越传统指标的相关性,但在视频质量、音视频同步等高FPS感知指标上表现较弱;能生成揭示语义或物理不一致性的可解释反馈。 Conclusion: omni-LLMs具备成为人类对齐、统一、可解释的多模态生成评估器的潜力,但当前受限于时序建模能力,需进一步提升其细粒度动态理解。 Abstract: State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.[365] PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards
Minh-Quan Le,Gaurav Mittal,Cheng Zhao,David Gu,Dimitris Samaras,Mei Chen
Main category: cs.CV
TL;DR: 本文提出PISCES,一种无需人工标注的文本到视频生成后训练算法,通过双最优传输(OT)对齐奖励模块,在分布级和离散token级对齐文本与视频嵌入,从而提升生成视频的质量与语义一致性。
Details
Motivation: 现有基于奖励的后训练方法依赖大规模人工偏好标注或使用预训练多模态模型中错位的嵌入,导致可扩展性差或监督信号次优。 Method: 提出PISCES算法,包含Dual Optimal Transport (OT)-aligned Rewards模块:分布级OT对齐质量奖励(衡量整体视觉质量与时间连贯性)和离散token级OT对齐语义奖励(强制文本与视频token间的时空语义对应)。 Result: 在VBench上短/长视频生成任务中,PISCES在质量和语义得分上均超越有/无标注方法;人类偏好研究进一步验证其有效性;该模块兼容直接反向传播与强化学习微调。 Conclusion: PISCES是首个将最优传输引入无标注生成式后训练以提升奖励监督质量的工作,显著提升了T2V生成的质量与语义对齐能力。 Abstract: Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present $\texttt{PISCES}$, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, $\texttt{PISCES}$ uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, $\texttt{PISCES}$ is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that $\texttt{PISCES}$ outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.[366] Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks
Bohan Zeng,Kaixin Zhu,Daili Hua,Bozhou Li,Chengzhuo Tong,Yuran Wang,Xinyi Huang,Yifan Dai,Zixiang Zhang,Yifan Yang,Zhou Liu,Hao Liang,Xiaochen Ma,Ruichuan An,Tianyi Bai,Hongcheng Gao,Junbo Niu,Yang Shi,Xinlong Chen,Yue Ding,Minglei Shi,Kai Zeng,Yiwen Tang,Yuanxing Zhang,Pengfei Wan,Xintao Wang,Wentao Zhang
Main category: cs.CV
TL;DR: 本文分析了当前世界模型研究的碎片化问题,提出了一种统一的设计规范,强调世界模型应整合交互、感知、符号推理和空间表征,以实现更通用、鲁棒和原则性的世界建模。
Details
Motivation: 当前世界模型研究过于分散,集中在孤立任务上,缺乏统一定义和系统性框架,难以支持整体的世界理解。 Method: 通过分析现有碎片化方法的局限性,提出一个统一的世界模型设计规范,强调其应作为整合交互、感知、符号推理与空间表征的规范性框架。 Result: 提出了世界模型的统一设计规范,明确了其核心组成要素与系统性要求。 Conclusion: 世界模型不应是能力的松散集合,而应是一个具备内在一致性和结构性的规范框架,为未来研究提供指导方向。 Abstract: World models have emerged as a critical frontier in AI research, aiming to enhance large models by infusing them with physical dynamics and world knowledge. The core objective is to enable agents to understand, predict, and interact with complex environments. However, current research landscape remains fragmented, with approaches predominantly focused on injecting world knowledge into isolated tasks, such as visual prediction, 3D estimation, or symbol grounding, rather than establishing a unified definition or framework. While these task-specific integrations yield performance gains, they often lack the systematic coherence required for holistic world understanding. In this paper, we analyze the limitations of such fragmented approaches and propose a unified design specification for world models. We suggest that a robust world model should not be a loose collection of capabilities but a normative framework that integrally incorporates interaction, perception, symbolic reasoning, and spatial representation. This work aims to provide a structured perspective to guide future research toward more general, robust, and principled models of the world.[367] Federated Vision Transformer with Adaptive Focal Loss for Medical Image Classification
Xinyuan Zhao,Yihang Wu,Ahmad Chaddad,Tareef Daqqaq,Reem Kateb
Main category: cs.CV
TL;DR: 本文提出了一种结合动态自适应焦点损失(DAFL)和客户端感知聚合策略的联邦学习框架,用于解决医疗图像等场景中数据异构与类别不平衡问题,显著提升了多个基准数据集上的分类性能。
Details
Motivation: 由于数据隐私法规限制,医疗图像等敏感数据难以集中获取,而传统联邦学习在面对客户端数据异构和类别不平衡时泛化能力受限。 Method: 提出动态自适应焦点损失(DAFL),引入基于各客户端样本分布和类别分布的动态类别不平衡系数;设计客户端感知的加权聚合策略,依据数据量和特征自适应调整模型聚合权重。 Result: 在ISIC、Ocular Disease和RSNA-ICH三个公开数据集上,该框架在多数情况下优于DenseNet121、ResNet50、ViT系列、FedCLIP、Swin Transformer、CoAtNet和MixNet,准确率提升0.98%–41.69%;消融实验验证了DAFL和聚合策略的有效性。 Conclusion: 所提联邦学习框架能有效缓解本地数据异构与类别不平衡带来的挑战,提升全局模型在医疗图像分类任务中的鲁棒性与准确性。 Abstract: While deep learning models like Vision Transformer (ViT) have achieved significant advances, they typically require large datasets. With data privacy regulations, access to many original datasets is restricted, especially medical images. Federated learning (FL) addresses this challenge by enabling global model aggregation without data exchange. However, the heterogeneity of the data and the class imbalance that exist in local clients pose challenges for the generalization of the model. This study proposes a FL framework leveraging a dynamic adaptive focal loss (DAFL) and a client-aware aggregation strategy for local training. Specifically, we design a dynamic class imbalance coefficient that adjusts based on each client's sample distribution and class data distribution, ensuring minority classes receive sufficient attention and preventing sparse data from being ignored. To address client heterogeneity, a weighted aggregation strategy is adopted, which adapts to data size and characteristics to better capture inter-client variations. The classification results on three public datasets (ISIC, Ocular Disease and RSNA-ICH) show that the proposed framework outperforms DenseNet121, ResNet50, ViT-S/16, ViT-L/32, FedCLIP, Swin Transformer, CoAtNet, and MixNet in most cases, with accuracy improvements ranging from 0.98\% to 41.69\%. Ablation studies on the imbalanced ISIC dataset validate the effectiveness of the proposed loss function and aggregation strategy compared to traditional loss functions and other FL approaches. The codes can be found at: https://github.com/AIPMLab/ViT-FLDAF.[368] ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval
Tianyu Yang,ChenWei He,Xiangzhao Hao,Tianyue Wang,Jiarui Guo,Haiyun Guo,Leigang Qu,Jinqiao Wang,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出ReCALL框架,通过诊断-生成-精炼流程解决将生成式多模态大语言模型(MLLM)适配为判别式检索器时导致的细粒度推理能力退化问题,在组合图像检索任务上达到SOTA性能。
Details
Motivation: 现有将生成式MLLM适配为判别式检索器的方法存在范式冲突,导致其原生的细粒度推理能力退化(Capability Degradation),难以胜任组合图像检索(CIR)所需的跨模态组合推理。 Method: 提出ReCALL框架:1)通过自引导信息实例挖掘诊断检索器的认知盲点;2)利用思维链提示基础MLLM生成修正指令与三元组,并用VQA一致性过滤保障质量;3)采用分组对比学习方式对检索器进行持续训练,使其嵌入空间重对齐MLLM内在的组合推理能力。 Result: 在CIRR和FashionIQ数据集上实验表明,ReCALL能持续校准退化能力,显著提升检索性能,达到当前最优水平。 Conclusion: ReCALL是一种模型无关的通用框架,有效缓解了生成式MLLM向判别式检索器转化中的能力退化问题,为多模态检索与大模型协同提供了新思路。 Abstract: Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL (Recalibrating Capability Degradation), a model-agnostic framework that follows a diagnose-generate-refine pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.[369] Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning
Yinchao Ma,Qiang Zhou,Zhibin Wang,Xianing Chen,Hanqing Yang,Jun Song,Bo Zheng
Main category: cs.CV
TL;DR: 本文提出CaCoVID算法,通过强化学习优化视频token选择策略,以提升视频理解任务的效率和效果。
Details
Motivation: 现有视频大模型因视频token冗余导致推理计算开销大,而基于注意力分数的压缩方法与实际预测贡献关系不明确。 Method: 提出贡献感知的视频token压缩算法CaCoVID,包含基于强化学习的token选择策略优化框架和在线组合空间采样的组合策略优化算法。 Result: 在多个视频理解基准上验证了CaCoVID的有效性,显著降低计算开销并保持或提升性能。 Conclusion: CaCoVID通过显式建模token对正确预测的贡献,实现了更高效、更精准的视频token压缩,为视频大模型部署提供了新思路。 Abstract: Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inference, limiting their practical deployment. Many compression algorithms are proposed to prioritize retaining features with the highest attention scores to minimize perturbations in attention computations. However, the correlation between attention scores and their actual contribution to correct answers remains ambiguous. To address the above limitation, we propose a novel \textbf{C}ontribution-\textbf{a}ware token \textbf{Co}mpression algorithm for \textbf{VID}eo understanding (\textbf{CaCoVID}) that explicitly optimizes the token selection policy based on the contribution of tokens to correct predictions. First, we introduce a reinforcement learning-based framework that optimizes a policy network to select video token combinations with the greatest contribution to correct predictions. This paradigm shifts the focus from passive token preservation to active discovery of optimal compressed token combinations. Secondly, we propose a combinatorial policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations and accelerates the convergence speed of policy optimization. Extensive experiments on diverse video understanding benchmarks demonstrate the effectiveness of CaCoVID. Codes will be released.[370] From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction
Xingyu Miao,Junting Dong,Qin Zhao,Yuhang Yang,Junhao Chen,Yang Long
Main category: cs.CV
TL;DR: 本文提出了一种面向视频序列的人体密集预测方法,通过构建可扩展的合成数据流水线生成具有几何标注(深度、法向、掩码)的运动对齐人体视频,并设计基于ViT的统一密集预测器,引入显式人体几何先验(CSE嵌入)与轻量通道重加权模块,在静态预训练与动态序列监督的两阶段训练下显著提升时序一致性与泛化能力。
Details
Motivation: 现有模型在单帧精度上表现良好,但在运动、遮挡和光照变化下存在闪烁问题,且缺乏多任务配对的人体视频监督数据。 Method: 构建了可生成光真实感、运动对齐、像素级几何标注(深度、法向、掩码)视频序列的合成数据流水线;设计基于ViT的统一密集预测器,引入CSE嵌入作为人体几何先验,并在特征融合后加入轻量通道重加权模块以增强几何特征可靠性;采用两阶段训练策略:先静态预训练学习空间表征,再用动态序列监督优化时序一致性。 Result: 在THuman2.1和Hi4D数据集上达到SOTA性能,并在野外视频上展现出强泛化能力。 Conclusion: 合成数据驱动的几何感知建模与两阶段训练策略可有效提升人体密集预测的时序一致性与跨域泛化性。 Abstract: In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.[371] Moonworks Lunara Aesthetic II: An Image Variation Dataset
Yan Wang,Partho Hassan,Samiha Sadeka,Nada Soliman,M M Sayeef Abdullah,Sabit Hassan
Main category: cs.CV
TL;DR: Lunara Aesthetic II 是一个公开、合乎伦理的图像数据集,包含2854组锚点关联的变体图像对,用于评估和提升图像生成与编辑系统在上下文一致性、身份保持和美学质量方面的性能。
Details
Motivation: 现有大规模网络数据集在上下文一致性、身份保持和美学可控性方面存在不足,亟需高质量、结构化、可解释的监督信号来推动图像生成与编辑模型的可控性与鲁棒性研究。 Method: 构建了一个由专业艺术工作室(Moonworks)原创图像衍生出的锚点-变体对数据集,每对图像在保持主体身份稳定的前提下,施加光照、天气、视角、构图、色调或情绪等上下文变换,并确保高美学评分。 Result: 该数据集展现出高身份稳定性、强目标属性实现能力,以及优于大规模网络数据集的鲁棒美学表现;已开源并支持基准测试、微调及可控性分析。 Conclusion: Lunara Aesthetic II 为图像生成与编辑系统提供了首个兼具身份稳定性、上下文可控性与高美学质量的公开监督数据集,有助于推动可控、可解释、鲁棒的视觉生成研究。 Abstract: We introduce Lunara Aesthetic II, a publicly released, ethically sourced image dataset designed to support controlled evaluation and learning of contextual consistency in modern image generation and editing systems. The dataset comprises 2,854 anchor-linked variation pairs derived from original art and photographs created by Moonworks. Each variation pair applies contextual transformations, such as illumination, weather, viewpoint, scene composition, color tone, or mood; while preserving a stable underlying identity. Lunara Aesthetic II operationalizes identity-preserving contextual variation as a supervision signal while also retaining Lunara's signature high aesthetic scores. Results show high identity stability, strong target attribute realization, and a robust aesthetic profile that exceeds large-scale web datasets. Released under the Apache 2.0 license, Lunara Aesthetic II is intended for benchmarking, fine-tuning, and analysis of contextual generalization, identity preservation, and edit robustness in image generation and image-to-image systems with interpretable, relational supervision. The dataset is publicly available at: https://huggingface.co/datasets/moonworks/lunara-aesthetic-image-variations.[372] Real-Time Loop Closure Detection in Visual SLAM via NetVLAD and Faiss
Enguang Fan
Main category: cs.CV
TL;DR: 本文实证评估了NetVLAD作为回环检测(LCD)模块在SLAM中的性能,对比经典DBoW方法,在KITTI数据集上展示了其在保持实时性的同时提升准确性和鲁棒性。
Details
Motivation: 传统基于词袋(DBoW)的回环检测方法在外观变化和感知混淆下性能下降;而深度学习方法(如NetVLAD)虽鲁棒性强,但常被认为计算开销大、难以满足SLAM实时性要求。本文旨在验证深度VPR描述子能否兼顾实时性与性能,成为LCD实用替代方案。 Method: 在KITTI数据集上实证评估NetVLAD作为LCD模块,并与DBoW对比;提出细粒度Top-K精确率-召回率曲线以更贴合LCD中查询可能无匹配或多匹配的实际场景;采用Faiss加速最近邻搜索以提升推理速度。 Result: 借助Faiss加速,NetVLAD实现实时查询速度,且在准确性和鲁棒性上均优于DBoW。 Conclusion: NetVLAD可作为SLAM中回环检测的实用、即插即用替代方案,在不牺牲实时性的前提下显著提升性能。 Abstract: Loop closure detection (LCD) is a core component of simultaneous localization and mapping (SLAM): it identifies revisited places and enables pose-graph constraints that correct accumulated drift. Classic bag-of-words approaches such as DBoW are efficient but often degrade under appearance change and perceptual aliasing. In parallel, deep learning-based visual place recognition (VPR) descriptors (e.g., NetVLAD and Transformer-based models) offer stronger robustness, but their computational cost is often viewed as a barrier to real-time SLAM. In this paper, we empirically evaluate NetVLAD as an LCD module and compare it against DBoW on the KITTI dataset. We introduce a Fine-Grained Top-K precision-recall curve that better reflects LCD settings where a query may have zero or multiple valid matches. With Faiss-accelerated nearestneighbor search, NetVLAD achieves real-time query speed while improving accuracy and robustness over DBoW, making it a practical drop-in alternative for LCD in SLAM.[373] VRGaussianAvatar: Integrating 3D Gaussian Avatars into VR
Hail Song,Boram Yoon,Seokhwan Yang,Seoyoung Kang,Hyunjeong Kim,Henning Metzmacher,Woontack Woo
Main category: cs.CV
TL;DR: VRGaussianAvatar 是一个基于单张图像重建、仅依赖HMD跟踪信号的实时全身3D高斯泼溅(3DGS)虚拟现实化身系统,通过并行前后端架构与创新的Binocular Batching技术实现高效立体渲染,并在性能与主观体验上优于图像/视频驱动的网格化身基线。
Details
Motivation: 解决现有VR化身系统在实时性、外观保真度和全身运动自然性方面的局限,尤其避免依赖额外传感器或密集视频输入。 Method: 构建VR Frontend(基于逆运动学估计全身姿态)与GA Backend(基于单图重建的3DGS立体渲染)的并行系统;提出Binocular Batching技术以联合批处理左右眼视图,提升渲染效率。 Result: 在定量测试中维持交互级VR帧率(≥60 FPS),用户研究显示其在外观相似性、具身感和可信度上显著优于图像/视频驱动的网格化身基线。 Conclusion: 仅用HMD跟踪即可驱动高质量、实时、全身3DGS VR化身是可行的,Binocular Batching为3DGS在VR中的高效立体渲染提供了新范式。 Abstract: We present VRGaussianAvatar, an integrated system that enables real-time full-body 3D Gaussian Splatting (3DGS) avatars in virtual reality using only head-mounted display (HMD) tracking signals. The system adopts a parallel pipeline with a VR Frontend and a GA Backend. The VR Frontend uses inverse kinematics to estimate full-body pose and streams the resulting pose along with stereo camera parameters to the backend. The GA Backend stereoscopically renders a 3DGS avatar reconstructed from a single image. To improve stereo rendering efficiency, we introduce Binocular Batching, which jointly processes left and right eye views in a single batched pass to reduce redundant computation and support high-resolution VR displays. We evaluate VRGaussianAvatar with quantitative performance tests and a within-subject user study against image- and video-based mesh avatar baselines. Results show that VRGaussianAvatar sustains interactive VR performance and yields higher perceived appearance similarity, embodiment, and plausibility. Project page and source code are available at https://vrgaussianavatar.github.io.[374] SMTrack: State-Aware Mamba for Efficient Temporal Modeling in Visual Tracking
Yinchao Ma,Dengqing Yang,Zhangyu He,Wenfei Yang,Tianzhu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于状态空间模型的新型视觉跟踪器SMTrack,通过选择性状态感知机制和隐藏状态传播,在保持低计算成本的同时有效建模长程时序依赖。
Details
Motivation: 传统CNN和Transformer在视觉跟踪中难以高效建模长程时序依赖,常需复杂定制模块或高计算开销。 Method: 提出State-aware Mamba Tracker(SMTrack),引入选择性状态感知状态空间模型,采用帧间隐藏状态传播与更新机制,实现线性复杂度的时序建模。 Result: SMTrack在保持低计算成本的同时,在多个基准上取得有竞争力的跟踪性能。 Conclusion: SMTrack提供了一种简洁、高效、无需定制模块的时序建模新范式,为视觉跟踪中的长程依赖建模提供了新思路。 Abstract: Visual tracking aims to automatically estimate the state of a target object in a video sequence, which is challenging especially in dynamic scenarios. Thus, numerous methods are proposed to introduce temporal cues to enhance tracking robustness. However, conventional CNN and Transformer architectures exhibit inherent limitations in modeling long-range temporal dependencies in visual tracking, often necessitating either complex customized modules or substantial computational costs to integrate temporal cues. Inspired by the success of the state space model, we propose a novel temporal modeling paradigm for visual tracking, termed State-aware Mamba Tracker (SMTrack), providing a neat pipeline for training and tracking without needing customized modules or substantial computational costs to build long-range temporal dependencies. It enjoys several merits. First, we propose a novel selective state-aware space model with state-wise parameters to capture more diverse temporal cues for robust tracking. Second, SMTrack facilitates long-range temporal interactions with linear computational complexity during training. Third, SMTrack enables each frame to interact with previously tracked frames via hidden state propagation and updating, which releases computational costs of handling temporal cues during tracking. Extensive experimental results demonstrate that SMTrack achieves promising performance with low computational costs.[375] FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding
Kangcong Li,Peng Ye,Lin Zhang,Chao Wang,Huafeng Qin,Tao Chen
Main category: cs.CV
TL;DR: 本文提出FreshMem,一种频率-空间混合记忆网络,用于在线流视频理解,通过多尺度频率记忆和空间缩略图记忆模块,在不进行训练的情况下显著提升多模态大语言模型的性能。
Details
Motivation: 现有方法缺乏灵活适应性,导致不可逆的细节丢失和上下文碎片化,难以实现从离线到在线流视频理解的有效过渡。 Method: 提出FreshMem网络,包含两个协同模块:多尺度频率记忆(MFM)将溢出帧投影为频率系数并保留残差细节以重建历史‘要点’;空间缩略图记忆(STM)通过自适应压缩策略将连续流划分为情节簇并提炼为高密度空间缩略图。 Result: FreshMem在StreamingBench、OV-Bench和OVO-Bench上分别比Qwen2-VL基线提升5.20%、4.52%和2.34%,且作为免训练方案优于多个全量微调方法。 Conclusion: FreshMem提供了一种高效、免训练的长时程流视频理解新范式,兼顾短期保真度与长期连贯性。 Abstract: Transitioning Multimodal Large Language Models (MLLMs) from offline to online streaming video understanding is essential for continuous perception. However, existing methods lack flexible adaptivity, leading to irreversible detail loss and context fragmentation. To resolve this, we propose FreshMem, a Frequency-Space Hybrid Memory network inspired by the brain's logarithmic perception and memory consolidation. FreshMem reconciles short-term fidelity with long-term coherence through two synergistic modules: Multi-scale Frequency Memory (MFM), which projects overflowing frames into representative frequency coefficients, complemented by residual details to reconstruct a global historical "gist"; and Space Thumbnail Memory (STM), which discretizes the continuous stream into episodic clusters by employing an adaptive compression strategy to distill them into high-density space thumbnails. Extensive experiments show that FreshMem significantly boosts the Qwen2-VL baseline, yielding gains of 5.20%, 4.52%, and 2.34% on StreamingBench, OV-Bench, and OVO-Bench, respectively. As a training-free solution, FreshMem outperforms several fully fine-tuned methods, offering a highly efficient paradigm for long-horizon streaming video understanding.[376] Cross-Modal Alignment and Fusion for RGB-D Transmission-Line Defect Detection
Jiaming Cui,Shuai Zhou,Wenqiang Li,Ruifeng Qin,Feng Shen
Main category: cs.CV
TL;DR: 本文提出CMAFNet,一种融合RGB外观与深度几何信息的跨模态对齐与融合网络,通过特征净化与上下文语义集成提升输电线路小缺陷检测性能,在TLRGBD数据集上显著优于现有方法。
Details
Motivation: 无人机巡检中输电线路缺陷检测面临小目标占比高、背景复杂、光照变化大等挑战,现有RGB检测器难以在低色度对比下区分几何细微缺陷与相似背景结构。 Method: 提出CMAFNet:包含语义重构模块(基于字典学习的特征净化)和上下文语义集成框架(部分通道注意力建模全局空间依赖);引入位置归一化实现显式重建驱动的跨模态对齐,保障异构特征统计兼容性后再融合。 Result: 在TLRGBD基准(94.5%为小目标)上达到32.2% mAP@50和12.5% APs,分别超越最强基线9.8和4.0个百分点;轻量版达24.8% mAP50、228 FPS、仅4.9M参数,性能超YOLO系列、媲美Transformer方法且计算成本更低。 Conclusion: CMAFNet通过跨模态对齐与净化融合机制,有效缓解小目标检测中的模态噪声与结构混淆问题,为资源受限的无人机实时巡检提供了高效鲁棒的解决方案。 Abstract: Transmission line defect detection remains challenging for automated UAV inspection due to the dominance of small-scale defects, complex backgrounds, and illumination variations. Existing RGB-based detectors, despite recent progress, struggle to distinguish geometrically subtle defects from visually similar background structures under limited chromatic contrast. This paper proposes CMAFNet, a Cross-Modal Alignment and Fusion Network that integrates RGB appearance and depth geometry through a principled purify-then-fuse paradigm. CMAFNet consists of a Semantic Recomposition Module that performs dictionary-based feature purification via a learned codebook to suppress modality-specific noise while preserving defect-discriminative information, and a Contextual Semantic Integration Framework that captures global spatial dependencies using partial-channel attention to enhance structural semantic reasoning. Position-wise normalization within the purification stage enforces explicit reconstruction-driven cross-modal alignment, ensuring statistical compatibility between heterogeneous features prior to fusion. Extensive experiments on the TLRGBD benchmark, where 94.5% of instances are small objects, demonstrate that CMAFNet achieves 32.2% mAP@50 and 12.5% APs, outperforming the strongest baseline by 9.8 and 4.0 percentage points, respectively. A lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing all YOLO-based detectors while matching transformer-based methods at substantially lower computational cost.[377] Physics Informed Generative AI Enabling Labour Free Segmentation For Microscopy Analysis
Salma Zahran,Zhou Ao,Zhengyang Zhang,Chen Chi,Chenchen Yuan,Yanming Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注的语义分割框架,利用相场模拟生成带真值标签的微观结构图像,并通过CycleGAN将模拟图像转换为逼真的SEM图像,使仅在合成数据上训练的U-Net模型在真实实验图像上取得优异性能(Boundary F1=0.90,IoU=0.88)
Details
Motivation: 显微图像语义分割受限于专家标注数据成本高、主观性强且稀缺;而纯物理仿真数据因缺乏真实噪声与成像伪影,存在显著域偏移,难以泛化。 Method: 采用相场模拟生成带完美真值掩码的微观结构图像,再用CycleGAN进行无配对图像翻译,将干净仿真图转化为高保真、逼真的扫描电镜(SEM)图像;最后仅用该合成数据训练U-Net模型。 Result: U-Net在未见过的真实实验图像上达到平均边界F1分数0.90和交并比(IoU)0.88;t-SNE和香农熵分析证实合成图像在特征空间和统计分布上与真实数据不可区分。 Conclusion: 该生成式框架完全摆脱人工标注依赖,将数据稀缺问题转化为数据丰裕问题,为材料发现与分析提供鲁棒、全自动的解决方案。 Abstract: Semantic segmentation of microscopy images is a critical task for high-throughput materials characterisation, yet its automation is severely constrained by the prohibitive cost, subjectivity, and scarcity of expert-annotated data. While physics-based simulations offer a scalable alternative to manual labelling, models trained on such data historically fail to generalise due to a significant domain gap, lacking the complex textures, noise patterns, and imaging artefacts inherent to experimental data. This paper introduces a novel framework for labour-free segmentation that successfully bridges this simulation-to-reality gap. Our pipeline leverages phase-field simulations to generate an abundant source of microstructural morphologies with perfect, intrinsically-derived ground-truth masks. We then employ a Cycle-Consistent Generative Adversarial Network (CycleGAN) for unpaired image-to-image translation, transforming the clean simulations into a large-scale dataset of high-fidelity, realistic SEM images. A U-Net model, trained exclusively on this synthetic data, demonstrated remarkable generalisation when deployed on unseen experimental images, achieving a mean Boundary F1-Score of 0.90 and an Intersection over Union (IOU) of 0.88. Comprehensive validation using t-SNE feature-space projection and Shannon entropy analysis confirms that our synthetic images are statistically and featurally indistinguishable from the real data manifold. By completely decoupling model training from manual annotation, our generative framework transforms a data-scarce problem into one of data abundance, providing a robust and fully automated solution to accelerate materials discovery and analysis.[378] FastPhysGS: Accelerating Physics-based Dynamic 3DGS Simulation via Interior Completion and Adaptive Optimization
Yikun Ma,Yiqing Li,Jingwen Ye,Zhongkai Wu,Weidong Zhang,Lin Gao,Zhi Jin
Main category: cs.CV
TL;DR: 本文提出了FastPhysGS框架,用于基于物理的动态3D高斯泼溅(3DGS)模拟,结合MPM物理方法与VLM预测,通过IPF和BGDO两个核心模块实现高效、鲁棒、高保真的4D物理仿真。
Details
Motivation: 现有将3DGS扩展至4D物理仿真的方法存在手动调参依赖、泛化性差、优化效率低、文本/图像到3D感知鸿沟、物理行为不稳定及忽略3DGS表面结构等问题。 Method: 提出FastPhysGS框架:(1) 实例感知粒子填充(IPF)结合蒙特卡洛重要性采样(MCIS),高效填充内部粒子并保持几何保真度;(2) 双向图解耦优化(BGDO),自适应优化由视觉语言模型(VLM)预测的材料参数。 Result: 实验表明FastPhysGS可在仅7GB运行内存下1分钟内完成高保真物理仿真,性能优于先前方法。 Conclusion: FastPhysGS是一种快速、鲁棒且可扩展的物理驱动动态3DGS模拟新范式,具有广泛的应用潜力。 Abstract: Extending 3D Gaussian Splatting (3DGS) to 4D physical simulation remains challenging. Based on the Material Point Method (MPM), existing methods either rely on manual parameter tuning or distill dynamics from video diffusion models, limiting the generalization and optimization efficiency. Recent attempts using LLMs/VLMs suffer from a text/image-to-3D perceptual gap, yielding unstable physics behavior. In addition, they often ignore the surface structure of 3DGS, leading to implausible motion. We propose FastPhysGS, a fast and robust framework for physics-based dynamic 3DGS simulation:(1) Instance-aware Particle Filling (IPF) with Monte Carlo Importance Sampling (MCIS) to efficiently populate interior particles while preserving geometric fidelity; (2) Bidirectional Graph Decoupling Optimization (BGDO), an adaptive strategy that rapidly optimizes material parameters predicted from a VLM. Experiments show FastPhysGS achieves high-fidelity physical simulation in 1 minute using only 7 GB runtime memory, outperforming prior works with broad potential applications.[379] DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-time Optical Flow and Stereo Estimation
Tushar Anand,Maheswar Bora,Antitza Dantcheva,Abhijit Das
Main category: cs.CV
TL;DR: 本文提出了一种名为DenVisCoM的新型Mamba块及一种专为光流与视差联合估计设计的混合架构,兼顾实时性、内存效率与精度。
Details
Motivation: 光流与视差估计同属多视图几何与运动任务,本质相关,因此需统一建模以提升联合估计性能。 Method: 提出DenVisCoM Mamba块,并结合Transformer注意力机制构建混合架构,实现运动(光流)与3D稠密感知(视差)的联合实时估计。 Result: 在多个数据集上验证了该模型在精度与实时性之间的优越权衡,实现了高精度且实时的光流与视差估计。 Conclusion: DenVisCoM混合架构有效平衡了准确性、推理速度与内存占用,为联合运动与3D感知任务提供了新范式。 Abstract: In this work, we propose a novel Mamba block DenVisCoM, as well as a novel hybrid architecture specifically tailored for accurate and real-time estimation of optical flow and disparity estimation. Given that such multi-view geometry and motion tasks are fundamentally related, we propose a unified architecture to tackle them jointly. Specifically, the proposed hybrid architecture is based on DenVisCoM and a Transformer-based attention block that efficiently addresses real-time inference, memory footprint, and accuracy at the same time for joint estimation of motion and 3D dense perception tasks. We extensively analyze the benchmark trade-off of accuracy and real-time processing on a large number of datasets. Our experimental results and related analysis suggest that our proposed model can accurately estimate optical flow and disparity estimation in real time. All models and associated code are available at https://github.com/vimstereo/DenVisCoM.[380] Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models
Yue Zhou,Xinan He,Kaiqing Lin,Bing Fan,Feng Ding,Bin Li
Main category: cs.CV
TL;DR: 本文提出一种基于现代视觉基础模型冻结特征的简单线性分类器,用于检测AI生成图像,在真实场景中显著优于专用检测器,揭示了大规模预训练数据中合成内容对检测能力的促进作用,并指出当前方法在重捕获、传输、VAE重建和局部编辑等场景下的局限性。
Details
Motivation: 现有AI生成图像检测器在标准基准上表现优异,但在真实世界(in-the-wild)场景下性能急剧下降,亟需更鲁棒、泛化更强的方法。 Method: 使用冻结的现代视觉基础模型(如Perception Encoder、MetaCLIP 2、DINOv3)提取图像特征,仅训练一个简单的线性分类器进行AI生成图像判别。 Result: 该简单方法在标准基准上媲美专用检测器,在in-the-wild数据集上准确率提升超30%;发现VLM隐含‘伪造’语义概念,SSL模型则从预训练数据中隐式习得鉴伪特征;但对重捕获、传输、VAE重建和局部编辑仍不敏感。 Conclusion: 应推动AI鉴伪范式从过拟合静态基准转向利用基础模型持续演进的世界知识,以提升现实可靠性。 Abstract: While specialized detectors for AI-Generated Images (AIGI) achieve near-perfect accuracy on curated benchmarks, they suffer from a dramatic performance collapse in realistic, in-the-wild scenarios. In this work, we demonstrate that simplicity prevails over complex architectural designs. A simple linear classifier trained on the frozen features of modern Vision Foundation Models , including Perception Encoder, MetaCLIP 2, and DINOv3, establishes a new state-of-the-art. Through a comprehensive evaluation spanning traditional benchmarks, unseen generators, and challenging in-the-wild distributions, we show that this baseline not only matches specialized detectors on standard benchmarks but also decisively outperforms them on in-the-wild datasets, boosting accuracy by striking margins of over 30\%. We posit that this superior capability is an emergent property driven by the massive scale of pre-training data containing synthetic content. We trace the source of this capability to two distinct manifestations of data exposure: Vision-Language Models internalize an explicit semantic concept of forgery, while Self-Supervised Learning models implicitly acquire discriminative forensic features from the pretraining data. However, we also reveal persistent limitations: these models suffer from performance degradation under recapture and transmission, remain blind to VAE reconstruction and localized editing. We conclude by advocating for a paradigm shift in AI forensics, moving from overfitting on static benchmarks to harnessing the evolving world knowledge of foundation models for real-world reliability.[381] Tail-Aware Post-Training Quantization for 3D Geometry Models
Sicheng Pan,Chen Tang,Shuzhao Xie,Ke Yang,Weixiang Zhang,Jiawei Li,Bin Chen,Shu-Tao Xia,Zhi Wang
Main category: cs.CV
TL;DR: 本文提出TAPTQ,一种专为3D几何学习设计的尾部感知后训练量化方法,通过渐进式校准构建、三元搜索优化量化区间和基于尾部相对误差的模块补偿机制,在保持高精度的同时大幅降低校准开销。
Details
Motivation: 现有针对2D视觉Transformer优化的后训练量化(PTQ)方法难以有效迁移到3D模型,因其特征分布复杂且校准开销巨大,而3D模型在资源受限平台部署面临严峻挑战。 Method: 提出TAPTQ:1)渐进式粗到细校准子集构建策略以兼顾统计纯度与几何代表性;2)将量化区间搜索建模为优化问题,并采用三元搜索求解,将复杂度从O(N)降至O(log N);3)引入尾部相对误差(TRE)指标,指导模块级误差补偿以缓解长尾激活异常导致的误差累积。 Result: 在VGGT和Pi3基准上,TAPTQ在精度上持续超越现有SOTA PTQ方法,同时显著减少校准时间。 Conclusion: TAPTQ是一种高效、精准且适配3D几何学习特性的后训练量化方案,解决了传统PTQ在3D场景中迁移性差与校准成本高的核心问题。 Abstract: The burgeoning complexity and scale of 3D geometry models pose significant challenges for deployment on resource-constrained platforms. While Post-Training Quantization (PTQ) enables efficient inference without retraining, conventional methods, primarily optimized for 2D Vision Transformers, fail to transfer effectively to 3D models due to intricate feature distributions and prohibitive calibration overhead. To address these challenges, we propose TAPTQ, a Tail-Aware Post-Training Quantization pipeline specifically engineered for 3D geometric learning. Our contribution is threefold: (1) To overcome the data-scale bottleneck in 3D datasets, we develop a progressive coarse-to-fine calibration construction strategy that constructs a highly compact subset to achieve both statistical purity and geometric representativeness. (2) We reformulate the quantization interval search as an optimization problem and introduce a ternary-search-based solver, reducing the computational complexity from $\mathcal{O}(N)$ to $\mathcal{O}(\log N)$ for accelerated deployment. (3) To mitigate quantization error accumulation, we propose TRE-Guided Module-wise Compensation, which utilizes a Tail Relative Error (TRE) metric to adaptively identify and rectify distortions in modules sensitive to long-tailed activation outliers. Extensive experiments on the VGGT and Pi3 benchmarks demonstrate that TAPTQ consistently outperforms state-of-the-art PTQ methods in accuracy while significantly reducing calibration time. The code will be released soon.[382] ObjEmbed: Towards Universal Multimodal Object Embeddings
Shenghao Fu,Yukun Su,Fengyun Rao,Jing Lyu,Xiaohua Xie,Wei-Shi Zheng
Main category: cs.CV
TL;DR: ObjEmbed是一种新型多模态大语言模型嵌入方法,通过将图像分解为多个区域嵌入(每个对应一个物体)并结合全局嵌入,实现细粒度的图像-文本对齐,支持视觉定位、局部/全局图像检索等任务,并在18个基准上表现出色。
Details
Motivation: 现有视觉-语言模型擅长全局图像-文本对齐,但在图像区域与文本短语之间的细粒度对齐方面存在不足。 Method: 提出ObjEmbed模型,将输入图像分解为多个区域嵌入(每个对应一个物体)和全局嵌入;为每个区域生成两种互补嵌入:语义匹配用的对象嵌入和预测定位质量的IoU嵌入;最终匹配得分融合语义相似性与预测IoU;所有区域和整图通过单次前向传播编码。 Result: 在18个多样化基准测试中展现出优越性能,验证了其强语义判别能力及在视觉定位、局部/全局图像检索等任务中的有效性。 Conclusion: ObjEmbed通过对象导向表征、任务通用性和高效编码三大特性,显著提升了细粒度视觉-语言对齐能力,为多模态理解提供了新思路。 Abstract: Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.[383] Spot-Wise Smart Parking: An Edge-Enabled Architecture with YOLOv11 and Digital Twin Integration
Gustavo P. C. P. da Luz,Alvaro M. Aspilcueta Narvaez,Tiago Godoi Bannwart,Gabriel Massuyoshi Sato,Luis Fernando Gomez Gonzalez,Juliana Freitag Borin
Main category: cs.CV
TL;DR: 本文提出了一种基于距离感知匹配与自适应边界框分割的车位级智能停车监控系统,提升了YOLOv11m在边缘设备上的精度(98.80%)与实时性(8秒),并引入数字影子和复用电视盒构建应用支持服务器,推动可持续智慧城市发展。
Details
Motivation: 原有区域级车辆计数方法虽准确但无法提供车位级细粒度信息,限制了高级应用支持能力;同时需提升边缘部署效率与系统可持续性。 Method: 提出距离感知匹配算法(含空间容差)与自适应边界框分割方法实现车位级检测;集成数字影子(为数字孪生奠基)与基于复用TV盒的应用支持服务器;优化YOLOv11m模型在资源受限边缘设备上的部署。 Result: 车位级检测准确率达98.80%,单次推理耗时8秒(边缘设备);成功部署数字影子与TV盒服务器,实现云-终端-机器人协同及硬件复用。 Conclusion: 该扩展系统显著提升了智能停车系统的空间粒度、实时性与可持续性,为校园及城市级智慧停车提供了可扩展、低成本、易部署的解决方案,并向数字孪生演进迈出关键一步。 Abstract: Smart parking systems help reduce congestion and minimize users' search time, thereby contributing to smart city adoption and enhancing urban mobility. In previous works, we presented a system developed on a university campus to monitor parking availability by estimating the number of free spaces from vehicle counts within a region of interest. Although this approach achieved good accuracy, it restricted the system's ability to provide spot-level insights and support more advanced applications. To overcome this limitation, we extend the system with a spot-wise monitoring strategy based on a distance-aware matching method with spatial tolerance, enhanced through an Adaptive Bounding Box Partitioning method for challenging spaces. The proposed approach achieves a balanced accuracy of 98.80% while maintaining an inference time of 8 seconds on a resource-constrained edge device, enhancing the capabilities of YOLOv11m, a model that has a size of 40.5 MB. In addition, two new components were introduced: (i) a Digital Shadow that visually represents parking lot entities as a base to evolve to a full Digital Twin, and (ii) an application support server based on a repurposed TV box. The latter not only enables scalable communication among cloud services, the parking totem, and a bot that provides detailed spot occupancy statistics, but also promotes hardware reuse as a step towards greater sustainability.[384] Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation
Jun He,Junyan Ye,Zilong Huang,Dongzhi Jiang,Chenjue Zhang,Leqi Zhu,Renrui Zhang,Xiang Zhang,Weijia Li
Main category: cs.CV
TL;DR: 本文提出Mind-Brush框架,将文生图过程建模为动态、知识驱动的'思考-检索-生成'智能体工作流,显著提升模型对隐含意图的理解与复杂知识推理能力。
Details
Motivation: 现有文生图模型多为静态文本到像素解码器,难以理解用户隐含意图;统一理解-生成模型虽有改进,但在复杂知识推理和适应现实世界动态变化方面仍存在不足。 Method: 提出Mind-Brush统一智能体框架,模拟人类‘思考-研究-创作’范式,通过主动检索多模态证据以支撑分布外概念,并调用推理工具解决隐含视觉约束;同时构建涵盖实时新闻、新兴概念及数学/地理推理的500样本基准Mind-Bench进行评估。 Result: Mind-Brush在Mind-Bench上实现Qwen-Image基线模型从零到一的能力跃升,并在WISE和RISE等既有基准上取得更优结果。 Conclusion: Mind-Brush通过引入动态知识驱动机制与外部工具调用,有效突破了传统静态生成范式的局限,为文生图系统迈向真正理解与推理迈出了关键一步。 Abstract: While text-to-image generation has achieved unprecedented fidelity, the vast majority of existing models function fundamentally as static text-to-pixel decoders. Consequently, they often fail to grasp implicit user intentions. Although emerging unified understanding-generation models have improved intent comprehension, they still struggle to accomplish tasks involving complex knowledge reasoning within a single model. Moreover, constrained by static internal priors, these models remain unable to adapt to the evolving dynamics of the real world. To bridge these gaps, we introduce Mind-Brush, a unified agentic framework that transforms generation into a dynamic, knowledge-driven workflow. Simulating a human-like 'think-research-create' paradigm, Mind-Brush actively retrieves multimodal evidence to ground out-of-distribution concepts and employs reasoning tools to resolve implicit visual constraints. To rigorously evaluate these capabilities, we propose Mind-Bench, a comprehensive benchmark comprising 500 distinct samples spanning real-time news, emerging concepts, and domains such as mathematical and Geo-Reasoning. Extensive experiments demonstrate that Mind-Brush significantly enhances the capabilities of unified models, realizing a zero-to-one capability leap for the Qwen-Image baseline on Mind-Bench, while achieving superior results on established benchmarks like WISE and RISE.[385] MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement
Hao Zhang,Yanping Zha,Zizhuo Li,Meiqi Gong,Jiayi Ma
Main category: cs.CV
TL;DR: 本文提出MagicFuse框架,通过扩散模型从单张低质量可见光图像中生成跨光谱场景表示,实现单图像融合,在视觉和语义任务上媲美甚至超越多模态融合方法。
Details
Motivation: 在仅有可见光传感器可用的恶劣条件下,如何继续利用多模态图像融合的优势。 Method: 提出单图像融合新概念,构建MagicFuse框架,包含:1)基于扩散模型的同光谱知识强化分支;2)跨光谱知识生成分支;3)多域知识融合分支,融合两个分支的扩散噪声流并采样获得跨光谱表示;最后施加视觉与语义约束。 Result: MagicFuse仅用单张退化可见光图像,即可在视觉质量和语义表示性能上达到甚至超越现有依赖多模态输入的最先进融合方法。 Conclusion: 单图像融合是可行且有效的,MagicFuse为无红外传感器条件下的跨光谱感知提供了新范式。 Abstract: This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image.[386] GDPR-Compliant Person Recognition in Industrial Environments Using MEMS-LiDAR and Hybrid Data
Dennis Basile,Dennis Sprute,Helene Dörksen,Holger Flatt
Main category: cs.CV
TL;DR: 本文提出了一种基于MEMS-LiDAR的隐私合规人员检测方法,利用CARLA仿真生成合成点云数据增强真实数据,显著提升检测精度并大幅降低人工标注成本,兼顾性能与GDPR合规性。
Details
Motivation: 传统基于深度学习的视觉方法易受光照和能见度影响,且存在隐私泄露风险(如违反GDPR);同时真实LiDAR数据采集与标注耗时费力、易出错。 Method: 采用MEMS-LiDAR采集匿名化3D点云,避免身份识别;结合CARLA仿真框架生成高质量合成场景,对真实数据进行增强;构建端到端点云目标检测模型。 Result: 混合数据训练模型平均精度(AP)比仅用真实数据提升44个百分点,人工标注工作量减少50%。 Conclusion: 该方法在工业室内场景中实现了高精度、低成本、可扩展的隐私合规人员检测,验证了合成LiDAR数据在兼顾性能与法规遵从性方面的有效性。 Abstract: The reliable detection of unauthorized individuals in safety-critical industrial indoor spaces is crucial to avoid plant shutdowns, property damage, and personal hazards. Conventional vision-based methods that use deep-learning approaches for person recognition provide image information but are sensitive to lighting and visibility conditions and often violate privacy regulations, such as the General Data Protection Regulation (GDPR) in the European Union. Typically, detection systems based on deep learning require annotated data for training. Collecting and annotating such data, however, is highly time-consuming and due to manual treatments not necessarily error free. Therefore, this paper presents a privacy-compliant approach based on Micro-Electro-Mechanical Systems LiDAR (MEMS-LiDAR), which exclusively captures anonymized 3D point clouds and avoids personal identification features. To compensate for the large amount of time required to record real LiDAR data and for post-processing and annotation, real recordings are augmented with synthetically generated scenes from the CARLA simulation framework. The results demonstrate that the hybrid data improves the average precision by 44 percentage points compared to a model trained exclusively with real data while reducing the manual annotation effort by 50 %. Thus, the proposed approach provides a scalable, cost-efficient alternative to purely real-data-based methods and systematically shows how synthetic LiDAR data can combine high performance in person detection with GDPR compliance in an industrial environment.[387] DDP-WM: Disentangled Dynamics Prediction for Efficient World Models
Shicheng Yin,Kaixuan Yin,Weixing Chen,Yang Liu,Guanbin Li,Liang Lin
Main category: cs.CV
TL;DR: 本文提出DDP-WM,一种基于解耦动力学预测(DDP)的高效世界模型,通过分离主导物理交互的稀疏主动力学与背景驱动的次级更新,在保持高保真度的同时显著提升推理速度和规划性能。
Details
Motivation: 现有基于Transformer的稠密世界模型计算开销大,难以实现实时部署,亟需解决效率-性能瓶颈。 Method: 提出解耦动力学预测(DDP)原理,设计DDP-WM架构:结合高效历史处理与动态定位以提取主动力学,并利用交叉注意力机制处理背景更新,实现资源优化分配与平滑优化景观。 Result: 在导航、桌面精准操作及复杂形变/多体交互等任务中显著提升效率与性能;在Push-T任务上推理速度提升约9倍,MPC成功率从90%提升至98%。 Conclusion: DDP-WM为构建高效、高保真世界模型提供了新路径,推动自主机器人实时规划发展。 Abstract: World models are essential for autonomous robotic planning. However, the substantial computational overhead of existing dense Transformerbased models significantly hinders real-time deployment. To address this efficiency-performance bottleneck, we introduce DDP-WM, a novel world model centered on the principle of Disentangled Dynamics Prediction (DDP). We hypothesize that latent state evolution in observed scenes is heterogeneous and can be decomposed into sparse primary dynamics driven by physical interactions and secondary context-driven background updates. DDP-WM realizes this decomposition through an architecture that integrates efficient historical processing with dynamic localization to isolate primary dynamics. By employing a crossattention mechanism for background updates, the framework optimizes resource allocation and provides a smooth optimization landscape for planners. Extensive experiments demonstrate that DDP-WM achieves significant efficiency and performance across diverse tasks, including navigation, precise tabletop manipulation, and complex deformable or multi-body interactions. Specifically, on the challenging Push-T task, DDP-WM achieves an approximately 9 times inference speedup and improves the MPC success rate from 90% to98% compared to state-of-the-art dense models. The results establish a promising path for developing efficient, high-fidelity world models. Codes will be available at https://github.com/HCPLabSYSU/DDP-WM.[388] Automated Discontinuity Set Characterisation in Enclosed Rock Face Point Clouds Using Single-Shot Filtering and Cyclic Orientation Transformation
Dibyayan Patra,Pasindu Ranasinghe,Bikram Banerjee,Simit Raval
Main category: cs.CV
TL;DR: 本文提出了一种用于地下矿腔暴露岩面结构不连续组自动表征的新方法,结合单次滤波、循环方位变换与层次聚类,显著提升了真实矿坑数据中产状估计精度。
Details
Motivation: 现有无人机和移动激光扫描技术虽能高效获取岩面点云,但在全封闭矿腔等真实场景下,仍缺乏鲁棒且高效的不连续组自动表征方法。 Method: 提出一种新方法:1)单次滤波策略,利用信号处理技术一次性分离平面区域并抑制噪声与高曲率伪影;2)创新的循环方位变换方案,将极坐标系下的产状(倾角、倾向)准确映射到笛卡尔空间;3)层次聚类技术,无需预设簇数即可处理密度变化分布并识别不连续组。 Result: 在真实矿坑采场数据上验证,该方法在倾角和倾向估计上的平均绝对误差分别为1.95°和2.20°,离散误差低于3°,优于现有自动化结构测绘技术。 Conclusion: 所提方法在真实复杂地下环境中实现了高精度、免人工干预的不连续组自动识别,为岩体稳定性评估与采矿安全提供了可靠技术支持。 Abstract: Characterisation of structural discontinuity sets in exposed rock faces of underground mine cavities is essential for assessing rock-mass stability, excavation safety, and operational efficiency. UAV and other mobile laser-scanning techniques provide efficient means of collecting point clouds from rock faces. However, the development of a robust and efficient approach for automatic characterisation of discontinuity sets in real-world scenarios, like fully enclosed rock faces in cavities, remains an open research problem. In this study, a new approach is proposed for automatic discontinuity set characterisation that uses a single-shot filtering strategy, an innovative cyclic orientation transformation scheme and a hierarchical clustering technique. The single-shot filtering step isolates planar regions while robustly suppressing noise and high-curvature artefacts in one pass using a signal-processing technique. To address the limitations of Cartesian clustering on polar orientation data, a cyclic orientation transformation scheme is developed, enabling accurate representation of dip angle and dip direction in Cartesian space. The transformed orientations are then characterised into sets using a hierarchical clustering technique, which handles varying density distributions and identifies clusters without requiring user-defined set numbers. The accuracy of the method is validated on real-world mine stope and against ground truth obtained using manually handpicked discontinuity planes identified with the Virtual Compass tool, as well as widely used automated structure mapping techniques. The proposed approach outperforms the other techniques by exhibiting the lowest mean absolute error in estimating discontinuity set orientations in real-world stope data with errors of 1.95° and 2.20° in nominal dip angle and dip direction, respectively, and dispersion errors lying below 3°.[389] Spatio-Temporal Transformers for Long-Term NDVI Forecasting
Ido Faran,Nathan S. Netanyahu,Maxim Shoshany
Main category: cs.CV
TL;DR: 本文提出了STT-LTF框架,结合空间上下文建模与时间序列预测,利用自监督学习处理长达40年的Landsat影像,在地中海异质景观的长期遥感时间序列预测中显著优于传统方法。
Details
Motivation: 解决地中海等异质景观中长期卫星图像时间序列分析面临的复杂空间模式、季节变化与多十年环境变化跨尺度交互的挑战。 Method: 提出Spatio-Temporal Transformer for Long Term Forecasting(STT-LTF),统一建模多尺度空间块与长时序(达20年),采用空间掩码、时间掩码与视野采样等自监督策略;引入空间块嵌入、周期性时间编码和地理坐标建模空间-时间依赖。 Result: 在1984–2024年Landsat数据上实现下一年预测MAE=0.0328、R²=0.8412,性能超越统计方法、CNN、LSTM及标准Transformer;支持不规则采样与可变预测步长。 Conclusion: STT-LTF为异质生态系统中长期、鲁棒、高精度遥感时间序列预测提供了新范式,尤其适用于经历快速生态转变的区域。 Abstract: Long-term satellite image time series (SITS) analysis in heterogeneous landscapes faces significant challenges, particularly in Mediterranean regions where complex spatial patterns, seasonal variations, and multi-decade environmental changes interact across different scales. This paper presents the Spatio-Temporal Transformer for Long Term Forecasting (STT-LTF ), an extended framework that advances beyond purely temporal analysis to integrate spatial context modeling with temporal sequence prediction. STT-LTF processes multi-scale spatial patches alongside temporal sequences (up to 20 years) through a unified transformer architecture, capturing both local neighborhood relationships and regional climate influences. The framework employs comprehensive self-supervised learning with spatial masking, temporal masking, and horizon sampling strategies, enabling robust model training from 40 years of unlabeled Landsat imagery. Unlike autoregressive approaches, STT-LTF directly predicts arbitrary future time points without error accumulation, incorporating spatial patch embeddings, cyclical temporal encoding, and geographic coordinates to learn complex dependencies across heterogeneous Mediterranean ecosystems. Experimental evaluation on Landsat data (1984-2024) demonstrates that STT-LTF achieves a Mean Absolute Error (MAE) of 0.0328 and R^2 of 0.8412 for next-year predictions, outperforming traditional statistical methods, CNN-based approaches, LSTM networks, and standard transformers. The framework's ability to handle irregular temporal sampling and variable prediction horizons makes it particularly suitable for analysis of heterogeneous landscapes experiencing rapid ecological transitions.[390] Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention
Dvir Samuel,Issar Tzachor,Matan Levy,Micahel Green,Gal Chechik,Rami Ben-Ari
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的统一注意力优化框架(TempCache、AnnCA、AnnSA),用于解决自回归视频扩散模型中KV缓存增长导致的推理延迟和显存激增问题,显著提升长序列生成效率与稳定性。
Details
Motivation: 自回归视频扩散模型在长视频生成中面临KV缓存随帧数线性增长的问题,导致推理延迟上升、GPU内存占用飙升,限制时序上下文长度并损害长程一致性。 Method: 基于对视频扩散中注意力冗余的分析(帧间近重复键、缓慢演化的语义查询/键、长提示中仅少量token相关),提出三个训练无关模块:TempCache(利用时序对应压缩KV缓存)、AnnCA(用近似最近邻匹配选择帧相关提示token加速交叉注意力)、AnnSA(用轻量ANN限制自注意力中查询-键匹配范围)。 Result: 在保持视觉质量几乎不变前提下,实现端到端5–10倍加速;关键地,长序列 rollout 中吞吐量稳定、峰值GPU内存近乎恒定,而基线方法性能持续下降且内存持续增长。 Conclusion: 该框架是通用、即插即用的注意力优化方案,兼容现有自回归视频扩散模型与视频世界模型,有效突破长视频生成的推理瓶颈。 Abstract: Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.[391] FlowBypass: Rectified Flow Trajectory Bypass for Training-Free Image Editing
Menglin Han,Zhangkai Ni
Main category: cs.CV
TL;DR: 本文提出FlowBypass,一种基于Rectified Flow的免训练图像编辑新框架,通过构建反转与重建轨迹间的直接‘绕行路径’,避免误差累积,无需特征操作,兼顾提示对齐与图像保真度。
Details
Motivation: 现有免训练图像编辑方法依赖反转-重建轨迹,存在轨迹长短导致的保真度与提示对齐之间的固有矛盾;且已有改进方案多为骨干网络特异性设计,泛化性差。 Method: 提出基于Rectified Flow的FlowBypass框架,形式化推导反转与重建两条轨迹,得到可解析近似的绕行路径及其数值解,实现轨迹间无缝过渡,不依赖任何特征操作。 Result: 在多项实验中,FlowBypass持续超越当前最优方法,在提升提示对齐能力的同时,更好保留无关区域的高保真细节。 Conclusion: FlowBypass提供了一种通用、分析可解、无需训练和特征定制的免训练图像编辑新范式,有效缓解了轨迹误差累积问题。 Abstract: Training-free image editing has attracted increasing attention for its efficiency and independence from training data. However, existing approaches predominantly rely on inversion-reconstruction trajectories, which impose an inherent trade-off: longer trajectories accumulate errors and compromise fidelity, while shorter ones fail to ensure sufficient alignment with the edit prompt. Previous attempts to address this issue typically employ backbone-specific feature manipulations, limiting general applicability. To address these challenges, we propose FlowBypass, a novel and analytical framework grounded in Rectified Flow that constructs a bypass directly connecting inversion and reconstruction trajectories, thereby mitigating error accumulation without relying on feature manipulations. We provide a formal derivation of two trajectories, from which we obtain an approximate bypass formulation and its numerical solution, enabling seamless trajectory transitions. Extensive experiments demonstrate that FlowBypass consistently outperforms state-of-the-art image editing methods, achieving stronger prompt alignment while preserving high-fidelity details in irrelevant regions.[392] LDRNet: Large Deformation Registration Model for Chest CT Registration
Cheng Wang,Qiyu Gao,Fandong Zhang,Shu Zhang,Yizhou Yu
Main category: cs.CV
TL;DR: 本文提出了一种名为LDRNet的快速无监督深度学习方法,用于胸部CT图像的大形变配准,通过粗到细的策略和两个创新模块(细化块和刚性块)提升性能,在私有数据集和SegTHOR上验证了其SOTA效果与高效性。
Details
Motivation: 胸部CT图像配准相比脑部配准具有更大形变、更复杂背景和区域重叠,现有深度学习方法多集中于脑部,难以直接适用。 Method: 提出LDRNet:先预测粗分辨率配准场,再由粗到细逐步细化;引入细化块(multi-resolution refinement)和刚性块(从高层特征学习变换矩阵);采用无监督训练。 Result: 在私有数据集和公开数据集SegTHOR上评估,性能优于VoxelMorph、RCN、LapIRN等深度模型及传统配准方法,同时速度显著更快。 Conclusion: LDRNet是一种高效、准确的胸部CT大形变无监督配准方法,为临床胸部影像分析提供了新工具。 Abstract: Most of the deep learning based medical image registration algorithms focus on brain image registration tasks.Compared with brain registration, the chest CT registration has larger deformation, more complex background and region over-lap. In this paper, we propose a fast unsupervised deep learning method, LDRNet, for large deformation image registration of chest CT images. We first predict a coarse resolution registration field, then refine it from coarse to fine. We propose two innovative technical components: 1) a refine block that is used to refine the registration field in different resolutions, 2) a rigid block that is used to learn transformation matrix from high-level features. We train and evaluate our model on the private dataset and public dataset SegTHOR. We compare our performance with state-of-the-art traditional registration methods as well as deep learning registration models VoxelMorph, RCN, and LapIRN. The results demonstrate that our model achieves state-of-the-art performance for large deformation images registration and is much faster.[393] GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation
Xiao Liang,Yunzhu Zhang,Linchao Zhu
Main category: cs.CV
TL;DR: 本文提出了一种名为Guided Progressive Distillation(GPD)的框架,用于加速视频扩散模型的采样过程,在大幅减少采样步数(如从48步降至6步)的同时保持高质量生成效果。
Details
Motivation: 扩散模型在视频生成中计算成本高,现有加速方法常导致质量显著下降。 Method: GPD采用渐进式教师-学生蒸馏策略:教师模型在线生成训练目标,并在潜在空间引入频域约束,以兼顾效率与细节/时序保真度。 Result: 在Wan2.1模型上将采样步数从48降至6,VBench上视觉质量仍具竞争力;相比其他蒸馏方法,GPD在流程简洁性和质量保持方面均有优势。 Conclusion: GPD是一种高效、高质量的视频扩散加速框架,通过渐进引导和频域约束有效平衡速度与生成质量。 Abstract: Diffusion models have achieved remarkable success in video generation; however, the high computational cost of the denoising process remains a major bottleneck. Existing approaches have shown promise in reducing the number of diffusion steps, but they often suffer from significant quality degradation when applied to video generation. We propose Guided Progressive Distillation (GPD), a framework that accelerates the diffusion process for fast and high-quality video generation. GPD introduces a novel training strategy in which a teacher model progressively guides a student model to operate with larger step sizes. The framework consists of two key components: (1) an online-generated training target that reduces optimization difficulty while improving computational efficiency, and (2) frequency-domain constraints in the latent space that promote the preservation of fine-grained details and temporal dynamics. Applied to the Wan2.1 model, GPD reduces the number of sampling steps from 48 to 6 while maintaining competitive visual quality on VBench. Compared with existing distillation methods, GPD demonstrates clear advantages in both pipeline simplicity and quality preservation.[394] Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies
Wenjin Hou,Wei Liu,Han Hu,Xiaoxiao Sun,Serena Yeung-Levy,Hehe Fan
Main category: cs.CV
TL;DR: 本文提出了VIA-Bench,一个用于评估多模态大语言模型(MLLMs)在视觉错觉与异常场景下鲁棒性的新基准,揭示了当前主流MLLMs在违背常识先验的视觉输入下普遍存在严重脆弱性,且思维链推理对此无明显改善。
Details
Motivation: 现有MLLM评测多基于常规分布内数据,缺乏对模型在违背人类常识先验(如视觉错觉)场景下鲁棒性的考察,存在评估盲区。 Method: 构建包含六大类视觉错觉与异常(色彩、运动、格式塔、几何空间、通用错觉、视觉异常)的VIA-Bench基准,经人工审核生成1000+高质量问答对,并对20余个SOTA MLLMs进行系统评测。 Result: 发现所有被测MLLMs在VIA-Bench上表现显著下降;Chain-of-Thought推理未能提升鲁棒性,反而产生逻辑自洽但错误的‘脆弱幻象’;机器与人类感知存在根本性差异。 Conclusion: 当前MLLMs在低层感知与高阶推理的耦合上存在本质瓶颈,突破视觉错觉鲁棒性是迈向通用人工智能的关键挑战之一。 Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable proficiency on general-purpose vision-language benchmarks, reaching or even exceeding human-level performance. However, these evaluations typically rely on standard in-distribution data, leaving the robustness of MLLMs largely unexamined when faced with scenarios that defy common-sense priors. To address this gap, we introduce VIA-Bench, a challenging benchmark designed to probe model performance on visual illusions and anomalies. It includes six core categories: color illusions, motion illusions, gestalt illusions, geometric and spatial illusions, general visual illusions, and visual anomalies. Through careful human-in-the-loop review, we construct over 1K high-quality question-answer pairs that require nuanced visual reasoning. Extensive evaluation of over 20 state-of-the-art MLLMs, including proprietary, open-source, and reasoning-enhanced models, uncovers significant vulnerabilities. Notably, we find that Chain-of-Thought (CoT) reasoning offers negligible robustness, often yielding ``brittle mirages'' where the model's logic collapses under illusory stimuli. Our findings reveal a fundamental divergence between machine and human perception, suggesting that resolving such perceptual bottlenecks is critical for the advancement of artificial general intelligence. The benchmark data and code will be released.[395] Efficient Cross-Country Data Acquisition Strategy for ADAS via Street-View Imagery
Yin Wu,Daniel Slieter,Carl Esselborn,Ahmed Abouelazm,Tsung Yuan Tseng,J. Marius Zöllner
Main category: cs.CV
TL;DR: 本文提出了一种基于街景图像引导的数据采集策略,利用公开街景影像识别兴趣地点(POI),以降低跨国家部署ADAS/ADS系统时因域偏移导致的感知性能下降问题;通过两种POI评分方法(KNN特征距离与视觉归因)筛选数据,并在交通标志检测任务上验证其有效性,仅用一半目标域数据即达到与随机采样相当的性能。
Details
Motivation: ADAS和ADS跨国家部署面临立法、道路基础设施和视觉规范差异带来的域偏移问题,传统实地采集成本高、效率低,难以覆盖代表性区域。 Method: 提出街景图像引导的数据获取策略:利用Zenseact Open Dataset与Mapillary街景图像构建共定位数据集;设计两种POI评分方法——基于视觉基础模型的KNN特征距离法和基于视觉语言模型的视觉归因法;采用collect-detect协议进行可复现评估。 Result: 在交通标志检测任务中,所提方法仅使用50%的目标域数据即达到与随机采样相当的性能;成本估算表明大规模街景处理在经济上可行。 Conclusion: 街景引导的数据采集策略能高效、低成本支持跨国家模型适配,为解决域偏移问题提供了新思路。 Abstract: Deploying ADAS and ADS across countries remains challenging due to differences in legislation, traffic infrastructure, and visual conventions, which introduce domain shifts that degrade perception performance. Traditional cross-country data collection relies on extensive on-road driving, making it costly and inefficient to identify representative locations. To address this, we propose a street-view-guided data acquisition strategy that leverages publicly available imagery to identify places of interest (POI). Two POI scoring methods are introduced: a KNN-based feature distance approach using a vision foundation model, and a visual-attribution approach using a vision-language model. To enable repeatable evaluation, we adopt a collect-detect protocol and construct a co-located dataset by pairing the Zenseact Open Dataset with Mapillary street-view images. Experiments on traffic sign detection, a task particularly sensitive to cross-country variations in sign appearance, show that our approach achieves performance comparable to random sampling while using only half of the target-domain data. We further provide cost estimations for full-country analysis, demonstrating that large-scale street-view processing remains economically feasible. These results highlight the potential of street-view-guided data acquisition for efficient and cost-effective cross-country model adaptation.[396] SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection
Qian Xu,Xi Li,Fei Gao,Jie Guo,Haojuan Yuan,Shuaipeng Fan,Mingjin Zhang
Main category: cs.CV
TL;DR: 本文提出SPIRIT框架,通过轻量级物理信息插件适配视觉基础模型(VFMs)用于红外小目标检测(IRSTD),在单帧与视频模式下均实现统一高效检测。
Details
Motivation: 红外小目标信号弱、语义线索少,与可见光图像差异大,直接使用面向语义的VFMs和外观驱动的跨帧关联不可靠,导致背景干扰和错误关联。 Method: 提出SPIRIT框架:空间上采用PIFR模块近似秩-稀疏分解以抑制结构化背景、增强稀疏目标信号;时间上采用PGMA模块将历史导出的软空间先验注入记忆交叉注意力,约束跨帧关联,支持视频检测并兼容单帧推理。 Result: 在多个IRSTD基准上实验表明,SPIRIT持续优于基于VFM的基线方法,并达到当前最优(SOTA)性能。 Conclusion: SPIRIT是一种统一、VFM兼容的红外小目标检测框架,通过物理信息引导的轻量插件有效弥合模态鸿沟,兼顾单帧与视频模式鲁棒性。 Abstract: Infrared small target detection (IRSTD) is crucial for surveillance and early-warning, with deployments spanning both single-frame analysis and video-mode tracking. A practical solution should leverage vision foundation models (VFMs) to mitigate infrared data scarcity, while adopting a memory-attention-based temporal propagation framework that unifies single- and multi-frame inference. However, infrared small targets exhibit weak radiometric signals and limited semantic cues, which differ markedly from visible-spectrum imagery. This modality gap makes direct use of semantics-oriented VFMs and appearance-driven cross-frame association unreliable for IRSTD: hierarchical feature aggregation can submerge localized target peaks, and appearance-only memory attention becomes ambiguous, leading to spurious clutter associations. To address these challenges, we propose SPIRIT, a unified and VFM-compatible framework that adapts VFMs to IRSTD via lightweight physics-informed plug-ins. Spatially, PIFR refines features by approximating rank-sparsity decomposition to suppress structured background components and enhance sparse target-like signals. Temporally, PGMA injects history-derived soft spatial priors into memory cross-attention to constrain cross-frame association, enabling robust video detection while naturally reverting to single-frame inference when temporal context is absent. Experiments on multiple IRSTD benchmarks show consistent gains over VFM-based baselines and SOTA performance.[397] CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions
Yuliang Zhan,Jian Li,Wenbing Huang,Wenbing Huang,Yang Liu,Hao Sun
Main category: cs.CV
TL;DR: 本文提出Cloth Dynamics Grounding (CDG)场景和Cloth Dynamics Splatting (CloDS)框架,实现从多视角视频中无监督学习布料动力学,通过三阶段流程和双位置不透明度调制解决大形变与自遮挡问题。
Details
Motivation: 现有深度学习方法模拟复杂动态系统需已知物理属性作为监督或输入,在未知条件下适用性受限,本文旨在解决这一挑战。 Method: 提出CloDS无监督动态学习框架,采用三阶段流程:视频到几何 grounding、基于接地网格训练动力学模型;在grounding阶段引入双位置不透明度调制,结合网格化高斯溅射实现2D观测与3D几何的双向映射。 Result: 实验表明CloDS能有效从视觉数据学习布料动力学,并对未见构型具有强泛化能力。 Conclusion: CloDS为无监督学习复杂软体动力学提供了新范式,突破了对先验物理知识的依赖,提升了在未知条件下的建模能力。 Abstract: Deep learning has demonstrated remarkable capabilities in simulating complex dynamic systems. However, existing methods require known physical properties as supervision or inputs, limiting their applicability under unknown conditions. To explore this challenge, we introduce Cloth Dynamics Grounding (CDG), a novel scenario for unsupervised learning of cloth dynamics from multi-view visual observations. We further propose Cloth Dynamics Splatting (CloDS), an unsupervised dynamic learning framework designed for CDG. CloDS adopts a three-stage pipeline that first performs video-to-geometry grounding and then trains a dynamics model on the grounded meshes. To cope with large non-linear deformations and severe self-occlusions during grounding, we introduce a dual-position opacity modulation that supports bidirectional mapping between 2D observations and 3D geometry via mesh-based Gaussian splatting in video-to-geometry grounding stage. It jointly considers the absolute and relative position of Gaussian components. Comprehensive experimental evaluations demonstrate that CloDS effectively learns cloth dynamics from visual data while maintaining strong generalization capabilities for unseen configurations. Our code is available at https://github.com/whynot-zyl/CloDS. Visualization results are available at https://github.com/whynot-zyl/CloDS_video}.%\footnote{As in this example.[398] WS-IMUBench: Can Weakly Supervised Methods from Audio, Image, and Video Be Adapted for IMU-based Temporal Action Localization?
Pei Li,Jiaxi Yin,Lei Ouyang,Shihan Pan,Ge Wang,Han Ding,Fei Wang
Main category: cs.CV
TL;DR: 本文提出WS-IMUBench,首个面向弱监督IMU时序动作定位(WS-IMU-TAL)的系统性基准研究,在仅有序列级标签下评估七种弱监督方法在七个IMU数据集上的迁移性与有效性,并揭示模态依赖性、性能边界及主要失效模式,为可扩展的弱监督IMU-TAL提供统一评测框架与未来方向。
Details
Motivation: 现有IMU行为识别以片段分类为主,难以建模真实行为的丰富时序结构;而时序动作定位(IMU-TAL)虽更贴近实际,却严重受限于昂贵且难以扩展的帧级边界标注需求。 Method: 构建WS-IMUBench基准,系统评估七种源自音频、图像和视频领域的经典弱监督时序定位方法在IMU数据上的迁移能力,覆盖七个公开IMU数据集,完成超3540次训练与7080次推理评估,并围绕可迁移性、有效性与关键洞察三个问题展开分析。 Result: 发现:(i) 方法迁移具有模态依赖性,时域方法比图像衍生的提案法更稳定;(ii) 在动作较长、传感器维度较高的数据集上,弱监督方法可媲美全监督性能;(iii) 主要失效源于短动作、时序模糊性和提案质量差。 Conclusion: WS-IMUBench确立了可复现的弱监督IMU-TAL评测范式(含数据、协议与分析),并指出IMU专用提案生成、边界感知目标函数和强时序建模是未来关键方向,推动社区迈向可扩展的弱监督IMU-TAL。 Abstract: IMU-based Human Activity Recognition (HAR) has enabled a wide range of ubiquitous computing applications, yet its dominant clip classification paradigm cannot capture the rich temporal structure of real-world behaviors. This motivates a shift toward IMU Temporal Action Localization (IMU-TAL), which predicts both action categories and their start/end times in continuous streams. However, current progress is strongly bottlenecked by the need for dense, frame-level boundary annotations, which are costly and difficult to scale. To address this bottleneck, we introduce WS-IMUBench, a systematic benchmark study of weakly supervised IMU-TAL (WS-IMU-TAL) under only sequence-level labels. Rather than proposing a new localization algorithm, we evaluate how well established weakly supervised localization paradigms from audio, image, and video transfer to IMU-TAL under only sequence-level labels. We benchmark seven representative weakly supervised methods on seven public IMU datasets, resulting in over 3,540 model training runs and 7,080 inference evaluations. Guided by three research questions on transferability, effectiveness, and insights, our findings show that (i) transfer is modality-dependent, with temporal-domain methods generally more stable than image-derived proposal-based approaches; (ii) weak supervision can be competitive on favorable datasets (e.g., with longer actions and higher-dimensional sensing); and (iii) dominant failure modes arise from short actions, temporal ambiguity, and proposal quality. Finally, we outline concrete directions for advancing WS-IMU-TAL (e.g., IMU-specific proposal generation, boundary-aware objectives, and stronger temporal reasoning). Beyond individual results, WS-IMUBench establishes a reproducible benchmarking template, datasets, protocols, and analyses, to accelerate community-wide progress toward scalable WS-IMU-TAL.[399] How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing
Huanyu Zhang,Xuehai Bai,Chengzu Li,Chen Liang,Haochen Tian,Haodong Li,Ruichuan An,Yifan Zhang,Anna Korhonen,Zhang Zhang,Liang Wang,Tieniu Tan
Main category: cs.CV
TL;DR: 本文提出了VIBE视觉指令基准,用于评估图像编辑模型对视觉指令(如草图)的理解能力,并设计了多层次交互任务和LMM-as-a-judge评估框架;实验表明现有模型(尤其是开源模型)在复杂视觉指令任务上仍存在显著不足。
Details
Motivation: 现有图像编辑系统主要依赖文本引导,而人类常通过草图等视觉方式高效传达空间与结构意图,因此亟需构建支持视觉指令的基准与评估方法。 Method: 提出VIBE视觉指令基准,包含三级交互层次(指示性定位、形态操作、因果推理),并构建高质量多样化测试集;同时设计基于大语言模型(LMM)的自动化评估框架,配备任务特定指标。 Result: 对17个主流图像编辑模型的评测显示:闭源模型初步具备视觉指令理解能力且整体优于开源模型,但所有模型在任务难度提升时性能显著下降。 Conclusion: 当前图像编辑模型对视觉指令的理解能力仍较弱,尤其在高阶空间与因果推理任务上存在明显瓶颈,亟需新方法突破。 Abstract: Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.[400] Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection
A S M Sharifuzzaman Sagar,Mohammed Bennamoun,Farid Boussaid,Naeha Sharif,Lian Xu,Shaaban Sahmoud,Ali Kishk
Main category: cs.CV
TL;DR: 本文系统分析了深度伪造检测器在多模态虚假信息检测中的作用,发现仅依赖像素级特征的检测器对图像-文本声明验证贡献有限,甚至会因引入错误的真实性先验而降低事实核查系统的性能;相比之下,基于外部证据和语义理解的事实核查方法表现更优。
Details
Motivation: 现有深度伪造检测器主要针对像素级篡改,但多模态虚假信息往往源于图像与文本联合表达的语义与上下文主张,其在自动事实核查(AFC)流水线中的实际效用尚不明确。 Method: 构建两个互补基准MMFakeBench和DGM4,评估三类方法:(1) 图像单模态深度伪造检测器;(2) 基于蒙特卡洛树搜索(MCTS)工具检索与多智能体辩论(MAD)推理的证据驱动型事实核查系统;(3) 将检测器输出作为辅助证据注入的混合系统。 Result: 像素级检测器独立F1仅为0.26–0.53(MMFakeBench)和0.33–0.49(DGM4);将其融入事实核查流程反而使F1下降0.04–0.08;而纯证据驱动系统在MMFakeBench和DGM4上分别达F1≈0.81和0.55。 Conclusion: 多模态声明验证主要依赖语义理解和外部证据,像素级伪造信号无法可靠提升对真实世界图文虚假信息的推理能力。 Abstract: In multimodal misinformation, deception usually arises not just from pixel-level manipulations in an image, but from the semantic and contextual claim jointly expressed by the image-text pair. Yet most deepfake detectors, engineered to detect pixel-level forgeries, do not account for claim-level meaning, despite their growing integration in automated fact-checking (AFC) pipelines. This raises a central scientific and practical question: Do pixel-level detectors contribute useful signal for verifying image-text claims, or do they instead introduce misleading authenticity priors that undermine evidence-based reasoning? We provide the first systematic analysis of deepfake detectors in the context of multimodal misinformation detection. Using two complementary benchmarks, MMFakeBench and DGM4, we evaluate: (1) state-of-the-art image-only deepfake detectors, (2) an evidence-driven fact-checking system that performs tool-guided retrieval via Monte Carlo Tree Search (MCTS) and engages in deliberative inference through Multi-Agent Debate (MAD), and (3) a hybrid fact-checking system that injects detector outputs as auxiliary evidence. Results across both benchmark datasets show that deepfake detectors offer limited standalone value, achieving F1 scores in the range of 0.26-0.53 on MMFakeBench and 0.33-0.49 on DGM4, and that incorporating their predictions into fact-checking pipelines consistently reduces performance by 0.04-0.08 F1 due to non-causal authenticity assumptions. In contrast, the evidence-centric fact-checking system achieves the highest performance, reaching F1 scores of approximately 0.81 on MMFakeBench and 0.55 on DGM4. Overall, our findings demonstrate that multimodal claim verification is driven primarily by semantic understanding and external evidence, and that pixel-level artifact signals do not reliably enhance reasoning over real-world image-text misinformation.[401] Trust but Verify: Adaptive Conditioning for Reference-Based Diffusion Super-Resolution via Implicit Reference Correlation Modeling
Yuan Wang,Yuhao Wan,Siming Zheng,Bo Li,Qibin Hou,Peng-Tao Jiang
Main category: cs.CV
TL;DR: 本文提出Ada-RefSR,一种基于单步扩散模型的参考图像超分辨率方法,通过自适应隐式相关性门控(AICG)机制实现对参考图像信息的可信度感知融合,提升重建质量与鲁棒性。
Details
Motivation: 现有参考图像超分辨率方法在真实退化场景下难以可靠建立低质量输入与参考图像间的对应关系,导致参考信息被误用或未充分利用。 Method: 提出Ada-RefSR框架,核心为自适应隐式相关性门控(AICG),利用可学习摘要token提取参考主导模式并隐式建模其与低质量特征的相关性,嵌入注意力主干中实现轻量、自适应的参考引导调控。 Result: 在多个数据集上验证了Ada-RefSR在保真度、自然性和效率上的优异平衡,并在参考图像对齐程度变化时保持鲁棒性。 Conclusion: Ada-RefSR通过‘信任但验证’原则和AICG机制,有效解决了RefSR中参考信息不可靠融合的问题,为扩散模型在图像恢复中的可控引导提供了新思路。 Abstract: Recent works have explored reference-based super-resolution (RefSR) to mitigate hallucinations in diffusion-based image restoration. A key challenge is that real-world degradations make correspondences between low-quality (LQ) inputs and reference (Ref) images unreliable, requiring adaptive control of reference usage. Existing methods either ignore LQ-Ref correlations or rely on brittle explicit matching, leading to over-reliance on misleading references or under-utilization of valuable cues. To address this, we propose Ada-RefSR, a single-step diffusion framework guided by a "Trust but Verify" principle: reference information is leveraged when reliable and suppressed otherwise. Its core component, Adaptive Implicit Correlation Gating (AICG), employs learnable summary tokens to distill dominant reference patterns and capture implicit correlations with LQ features. Integrated into the attention backbone, AICG provides lightweight, adaptive regulation of reference guidance, serving as a built-in safeguard against erroneous fusion. Experiments on multiple datasets demonstrate that Ada-RefSR achieves a strong balance of fidelity, naturalness, and efficiency, while remaining robust under varying reference alignment.[402] ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding
Ye Chen,Yupeng Zhu,Xiongzhen Zhang,Zhewen Wan,Yingzhe Li,Wenjun Zhang,Bingbing Ni
Main category: cs.CV
TL;DR: 本文提出了一种分层代理式参数化图像表示方法,通过解耦语义、几何与纹理属性,实现高效、可控的图像/视频编辑与物理驱动动画。
Details
Motivation: 现有图像表示方法(如光栅图、高斯原语、潜在图像)存在表征冗余或缺乏语义实例/部件到潜变量的直接映射,导致编辑困难、控制性差。 Method: 基于语义感知的图像分解,构建自适应贝塞尔拟合与迭代区域细分/网格化的分层代理几何;在几何感知的分布式代理节点中嵌入多尺度隐式纹理参数;引入局部性自适应特征索引机制保障空间纹理一致性。 Result: 在ImageNet、OIR-Bench、HumanEdit等基准上达到SOTA渲染保真度,参数量显著减少;支持直观交互式、物理合理的编辑;结合Position-Based Dynamics实现轻量隐式渲染的实时物理驱动动画,时序一致性和视觉真实性优于生成式方法。 Conclusion: 该分层代理式表示统一兼顾紧凑性、语义可解释性与物理可编辑性,为可控图像/视频编辑与动画提供了新范式。 Abstract: Prevailing image representation methods, including explicit representations such as raster images and Gaussian primitives, as well as implicit representations such as latent images, either suffer from representation redundancy that leads to heavy manual editing effort, or lack a direct mapping from latent variables to semantic instances or parts, making fine-grained manipulation difficult. These limitations hinder efficient and controllable image and video editing. To address these issues, we propose a hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes into independent and manipulable parameter spaces. Based on a semantic-aware decomposition of the input image, our representation constructs hierarchical proxy geometries through adaptive Bezier fitting and iterative internal region subdivision and meshing. Multi-scale implicit texture parameters are embedded into the resulting geometry-aware distributed proxy nodes, enabling continuous high-fidelity reconstruction in the pixel domain and instance- or part-independent semantic editing. In addition, we introduce a locality-adaptive feature indexing mechanism to ensure spatial texture coherence, which further supports high-quality background completion without relying on generative models. Extensive experiments on image reconstruction and editing benchmarks, including ImageNet, OIR-Bench, and HumanEdit, demonstrate that our method achieves state-of-the-art rendering fidelity with significantly fewer parameters, while enabling intuitive, interactive, and physically plausible manipulation. Moreover, by integrating proxy nodes with Position-Based Dynamics, our framework supports real-time physics-driven animation using lightweight implicit rendering, achieving superior temporal consistency and visual realism compared with generative approaches.[403] Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model
Jiedong Zhuang,Lu Lu,Ming Dai,Rui Hu,Jian Chen,Qiang Liu,Haoji Hu
Main category: cs.CV
TL;DR: 本文提出Lazy Attention机制,通过跨层共享相似注意力模式来减少多模态大语言模型(MLLMs)推理中的冗余计算和KV缓存开销,显著提升吞吐量并保持高精度。
Details
Motivation: MLLMs因视觉token过多导致推理成本高昂、KV缓存瓶颈严重;现有token-wise剪枝方法易破坏KV缓存完整性,影响长文本生成。 Method: 发现超半数解码头层注意力语义相似,据此提出Lazy Attention:引入轻量级层共享Q Cache,复用相邻层的查询(Q),支持与Flash Attention及KV缓存兼容,并可独立或协同token剪枝使用。 Result: 在多个基准上实现KV缓存降低超35%、吞吐量提升1.5倍,仅损失约1%性能;相比SOTA token-wise方法,精度保持更优。 Conclusion: Lazy Attention是一种高效、兼容、正交的注意力优化机制,有效缓解MLLMs推理瓶颈,兼顾效率与精度。 Abstract: Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.[404] Learning Sparse Visual Representations via Spatial-Semantic Factorization
Theodore Zhengde Zhao,Sid Kiblawi,Jianwei Yang,Naoto Usuyama,Reuben Tan,Noel C Codella,Tristan Naumann,Hoifung Poon,Mu Wei
Main category: cs.CV
TL;DR: STELLAR 提出一种因子化视觉特征表示方法,将语义概念与空间分布解耦,从而同时支持高质量图像重建和强语义理解,解决了自监督学习中语义理解与重建之间的根本冲突。
Details
Motivation: 自监督学习中,高层语义方法(如DINO)因追求增强对齐而丢失空间信息,难以重建;生成式方法(如MAE)保留空间结构但缺乏高层语义抽象,二者存在本质矛盾。 Method: 提出STELLAR框架,将视觉特征分解为低秩的语义概念向量与空间分布矩阵的乘积,使语义token可用于DINO式增强对齐,而空间矩阵支撑像素级重建。 Result: 仅用16个稀疏token即可实现2.60 FID的高质量重建和79.10% ImageNet分类准确率,媲美密集骨干网络。 Conclusion: STELLAR通过语义身份与空间几何的策略性分离,构建了一种通用稀疏表征,弥合了判别式与生成式视觉学习之间的鸿沟。 Abstract: Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative SSL (e.g., MAE) preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy). Our results highlight STELLAR as a versatile sparse representation that bridges the gap between discriminative and generative vision by strategically separating semantic identity from spatial geometry. Code available at https://aka.ms/stellar.[405] DSXFormer: Dual-Pooling Spectral Squeeze-Expansion and Dynamic Context Attention Transformer for Hyperspectral Image Classification
Farhan Ullah,Irfan Ullah,Khalil Khan,Giovanni Pau,JaKeoung Koo
Main category: cs.CV
TL;DR: 本文提出了一种名为DSXFormer的新型双池化光谱压缩-扩展Transformer模型,用于高光谱图像分类(HSIC),通过双池化光谱挤压扩展(DSX)模块和动态上下文注意力(DCA)机制,在提升光谱判别力的同时降低计算开销,并在多个基准数据集上取得SOTA性能。
Details
Motivation: 现有基于Transformer的HSIC方法难以兼顾足够的光谱判别力与计算效率,且面临高维光谱、复杂谱空相关性及标注样本少等挑战。 Method: 提出DSXFormer模型,包含Dual-Pooling Spectral Squeeze-Expansion(DSX)模块(结合全局平均与最大池化自适应重校准光谱通道)和Dynamic Context Attention(DCA)机制(嵌入窗口化Transformer中以动态建模局部谱空关系),并辅以多尺度补丁处理策略。 Result: 在Salinas、Indian Pines、Pavia University和Kennedy Space Center四个基准数据集上分别达到99.95%、98.91%、99.85%和98.52%的分类精度,显著优于现有SOTA方法。 Conclusion: DSXFormer通过联合优化光谱强调与空间上下文表征,在保持高效计算的同时显著提升了HSIC性能,验证了其设计的有效性与泛化能力。 Abstract: Hyperspectral image classification (HSIC) is a challenging task due to high spectral dimensionality, complex spectral-spatial correlations, and limited labeled training samples. Although transformer-based models have shown strong potential for HSIC, existing approaches often struggle to achieve sufficient spectral discriminability while maintaining computational efficiency. To address these limitations, we propose a novel DSXFormer, a novel dual-pooling spectral squeeze-expansion transformer with Dynamic Context Attention for HSIC. The proposed DSXFormer introduces a Dual-Pooling Spectral Squeeze-Expansion (DSX) block, which exploits complementary global average and max pooling to adaptively recalibrate spectral feature channels, thereby enhancing spectral discriminability and inter-band dependency modeling. In addition, DSXFormer incorporates a Dynamic Context Attention (DCA) mechanism within a window-based transformer architecture to dynamically capture local spectral-spatial relationships while significantly reducing computational overhead. The joint integration of spectral dual-pooling squeeze-expansion and DCA enables DSXFormer to achieve an effective balance between spectral emphasis and spatial contextual representation. Furthermore, patch extraction, embedding, and patch merging strategies are employed to facilitate efficient multi-scale feature learning. Extensive experiments conducted on four widely used hyperspectral benchmark datasets, including Salinas (SA), Indian Pines (IP), Pavia University (PU), and Kennedy Space Center (KSC), demonstrate that DSXFormer consistently outperforms state-of-the-art methods, achieving classification accuracies of 99.95%, 98.91%, 99.85%, and 98.52%, respectively.[406] Enabling Progressive Whole-slide Image Analysis with Multi-scale Pyramidal Network
Shuyang Wu,Yifu Qiu,Ines P. Nearchou,Sandrine Prost,Jonathan A Fallowfield,Hakan Bilen,Timothy J Kendall
Main category: cs.CV
TL;DR: 本文提出了一种即插即用的多尺度金字塔网络(MSPN),用于增强基于注意力机制的多实例学习(MIL)在计算病理学中的性能,通过网格重映射和粗粒度引导网络实现高效、轻量的多尺度特征融合。
Details
Motivation: 现有基于多尺度补丁的MIL方法依赖于多个固定放大倍率输入和晚期特征融合,导致跨尺度特征关联丢失、灵活性差且计算开销大。 Method: 提出MSPN,包含(1)基于网格的重映射模块,利用高倍特征生成粗粒度特征;(2)粗粒度引导网络(CGN),学习粗粒度上下文信息;作为插件模块集成到多种注意力型MIL框架中。 Result: 在4个临床相关任务、3类基础模型及预训练MIL框架上验证,MSPN在所有配置下均稳定提升性能,同时保持轻量性和易用性。 Conclusion: MSPN是一种灵活、高效、即插即用的多尺度建模方法,显著提升了MIL在计算病理学任务中的表现。 Abstract: Multiple-instance Learning (MIL) is commonly used to undertake computational pathology (CPath) tasks, and the use of multi-scale patches allows diverse features across scales to be learned. Previous studies using multi-scale features in clinical applications rely on multiple inputs across magnifications with late feature fusion, which does not retain the link between features across scales while the inputs are dependent on arbitrary, manufacturer-defined magnifications, being inflexible and computationally expensive. In this paper, we propose the Multi-scale Pyramidal Network (MSPN), which is plug-and-play over attention-based MIL that introduces progressive multi-scale analysis on WSI. Our MSPN consists of (1) grid-based remapping that uses high magnification features to derive coarse features and (2) the coarse guidance network (CGN) that learns coarse contexts. We benchmark MSPN as an add-on module to 4 attention-based frameworks using 4 clinically relevant tasks across 3 types of foundation model, as well as the pre-trained MIL framework. We show that MSPN consistently improves MIL across the compared configurations and tasks, while being lightweight and easy-to-use.[407] Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images
Shuai Yang,Ziyue Huang,Jiaxin Chen,Qingjie Liu,Yunhong Wang
Main category: cs.CV
TL;DR: 本文提出RS-MPOD框架,通过引入基于实例的视觉提示、文本提示及其多模态融合,提升遥感图像开放词汇目标检测中类别指定的鲁棒性与灵活性。
Details
Motivation: 遥感图像开放词汇目标检测中,仅依赖文本提示常因任务特定语义和分布偏移导致类别指定不稳定。 Method: 提出RS-MPOD框架,包含视觉提示编码器(从示例实例提取外观线索)和多模态融合模块(联合利用视觉与文本提示)。 Result: 在标准、跨数据集及细粒度遥感基准上实验表明:视觉提示在语义模糊和分布偏移下更可靠;多模态提示在文本语义对齐良好时仍具竞争力。 Conclusion: 视觉与文本双路径提示机制可有效增强遥感开放词汇检测的鲁棒性与适应性,突破纯文本提示的局限。 Abstract: Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal fusion module to integrate visual and textual information when both modalities are available. Extensive experiments on standard, cross-dataset, and fine-grained remote sensing benchmarks show that visual prompting yields more reliable category specification under semantic ambiguity and distribution shifts, while multimodal prompting provides a flexible alternative that remains competitive when textual semantics are well aligned.[408] Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated
Muli Yang,Gabriel James Goenawan,Henan Wang,Huaiyuan Qin,Chenghao Xu,Yanhua Yang,Fen Fang,Ying Sun,Joo-Hwee Lim,Hongyuan Zhu
Main category: cs.CV
TL;DR: 本文提出了一种基于贝叶斯决策理论的后处理校准框架,用于缓解AI生成图像检测器在测试时因分布偏移导致的系统性偏差(如将假图误判为真图),仅需少量目标域验证样本即可校准logits,无需重训练或真实标签。
Details
Motivation: 现有AI生成图像检测器虽在平衡数据集上训练,但在测试时因伪造样本分布偏移及模型隐式先验,常系统性地将假图误判为真图,根源在于对非泛化人工伪影的过拟合和决策阈值错位。 Method: 提出一种理论驱动的后处理校准方法:在冻结主干网络前提下,引入一个可学习标量参数对模型logits进行校正,该参数在小规模目标分布验证集上优化,不依赖真实标签,基于贝叶斯决策理论重新校准决策边界。 Result: 在多个挑战性基准上显著提升检测鲁棒性,无需重训练,计算轻量,且适用于开放世界场景;代码已开源。 Conclusion: 该方法是一种原理清晰、实用性强的轻量级解决方案,能有效应对测试时分布偏移问题,提升AI生成图像检测器的可靠性与自适应能力。 Abstract: Despite being trained on balanced datasets, existing AI-generated image detectors often exhibit systematic bias at test time, frequently misclassifying fake images as real. We hypothesize that this behavior stems from distributional shift in fake samples and implicit priors learned during training. Specifically, models tend to overfit to superficial artifacts that do not generalize well across different generation methods, leading to a misaligned decision threshold when faced with test-time distribution shift. To address this, we propose a theoretically grounded post-hoc calibration framework based on Bayesian decision theory. In particular, we introduce a learnable scalar correction to the model's logits, optimized on a small validation set from the target distribution while keeping the backbone frozen. This parametric adjustment compensates for distributional shift in model output, realigning the decision boundary even without requiring ground-truth labels. Experiments on challenging benchmarks show that our approach significantly improves robustness without retraining, offering a lightweight and principled solution for reliable and adaptive AI-generated image detection in the open world. Code is available at https://github.com/muliyangm/AIGI-Det-Calib.[409] Enhancing Multi-Image Understanding through Delimiter Token Scaling
Minyoung Lee,Yeji Park,Dongjun Hwang,Yejin Kim,Seong Joon Oh,Junsuk Choe
Main category: cs.CV
TL;DR: 本文提出了一种通过缩放分隔符标记隐藏状态来缓解大视觉语言模型(LVLMs)在多图像输入中跨图像信息泄露问题的方法,无需额外训练或推理开销,显著提升了多图像和多文档理解任务的性能。
Details
Motivation: 现有LVLMs在处理多图像输入时性能下降,主要原因是跨图像信息泄露,而当前使用的分隔符标记未能有效阻止该问题。 Method: 提出对分隔符标记的隐藏状态进行缩放,以增强图像内交互、抑制跨图像交互,从而提升模型对图像边界的识别能力。 Result: 在Mantis、MuirBench、MIRB、QBench2等多图像基准上性能提升;在TQABench、MultiNews、WCEP-10等多文档/多表格任务上也取得改进;且不增加训练或推理成本。 Conclusion: 缩放分隔符隐藏状态是一种简单高效、即插即用的方法,能有效缓解LVLMs中的跨图像信息泄露,提升多模态与纯文本多源理解任务表现。 Abstract: Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model's ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.[410] Leveraging Latent Vector Prediction for Localized Control in Image Generation via Diffusion Models
Pablo Domingo-Gregorio,Javier Ruiz-Hidalgo
Main category: cs.CV
TL;DR: 本文提出一种新方法,允许用户对图像的特定区域进行精确局部控制,同时让扩散模型自主生成其余部分;通过引入掩码特征和额外损失项来增强潜在空间中各步与最终样本的对应关系。
Details
Motivation: 现有基于文本和图像级控制的方法难以实现局部精细控制,且依赖试错过程,效率低下。 Method: 提出一种新的训练框架,结合掩码特征与额外损失项,利用任意扩散步长对初始潜在向量的预测,增强潜在空间中当前步与最终样本的一致性。 Result: 实验表明该方法能有效合成高质量图像,并支持用户定义区域的局部条件控制。 Conclusion: 所提方法显著提升了扩散模型在局部可控图像生成任务中的灵活性与精度,为细粒度编辑提供了新思路。 Abstract: Diffusion models emerged as a leading approach in text-to-image generation, producing high-quality images from textual descriptions. However, attempting to achieve detailed control to get a desired image solely through text remains a laborious trial-and-error endeavor. Recent methods have introduced image-level controls alongside with text prompts, using prior images to extract conditional information such as edges, segmentation and depth maps. While effective, these methods apply conditions uniformly across the entire image, limiting localized control. In this paper, we propose a novel methodology to enable precise local control over user-defined regions of an image, while leaving to the diffusion model the task of autonomously generating the remaining areas according to the original prompt. Our approach introduces a new training framework that incorporates masking features and an additional loss term, which leverages the prediction of the initial latent vector at any diffusion step to enhance the correspondence between the current step and the final sample in the latent space. Extensive experiments demonstrate that our method effectively synthesizes high-quality images with controlled local conditions.[411] SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors
Bing He,Jingnan Gao,Yunuo Chen,Ning Cao,Gang Chen,Zhengxue Cheng,Li Song,Wenjun Zhang
Main category: cs.CV
TL;DR: 本文提出SurfSplat,一种基于2D高斯溅射(2DGS)的前馈框架,通过引入表面连续性先验和强制alpha混合策略,提升稀疏图像输入下的3D重建几何精度与纹理保真度,并提出新指标HRRC评估高分辨率重建质量。
Details
Motivation: 现有基于3D高斯溅射的通用化方法难以生成连续表面,易产生离散、颜色偏差的点云,在近景下出现严重伪影。 Method: 提出基于2D高斯溅射(2DGS)的SurfSplat框架,结合表面连续性先验与强制alpha混合策略;并设计高分辨率渲染一致性(HRRC)评估指标。 Result: 在RealEstate10K、DL3DV和ScanNet数据集上,SurfSplat在标准指标与HRRC上均显著优于先前方法。 Conclusion: SurfSplat为稀疏图像驱动的高保真3D重建提供了鲁棒、高效的解决方案。 Abstract: Reconstructing 3D scenes from sparse images remains a challenging task due to the difficulty of recovering accurate geometry and texture without optimization. Recent approaches leverage generalizable models to generate 3D scenes using 3D Gaussian Splatting (3DGS) primitive. However, they often fail to produce continuous surfaces and instead yield discrete, color-biased point clouds that appear plausible at normal resolution but reveal severe artifacts under close-up views. To address this issue, we present SurfSplat, a feedforward framework based on 2D Gaussian Splatting (2DGS) primitive, which provides stronger anisotropy and higher geometric precision. By incorporating a surface continuity prior and a forced alpha blending strategy, SurfSplat reconstructs coherent geometry together with faithful textures. Furthermore, we introduce High-Resolution Rendering Consistency (HRRC), a new evaluation metric designed to evaluate high-resolution reconstruction quality. Extensive experiments on RealEstate10K, DL3DV, and ScanNet demonstrate that SurfSplat consistently outperforms prior methods on both standard metrics and HRRC, establishing a robust solution for high-fidelity 3D reconstruction from sparse inputs. Project page: https://hebing-sjtu.github.io/SurfSplat-website/[412] UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving
Guosheng Zhao,Yaozeng Wang,Xiaofeng Wang,Zheng Zhu,Tingdong Yu,Guan Huang,Yongchen Zai,Ji Jiao,Changliang Xue,Xiaole Wang,Zhen Yang,Futang Zhu,Xingang Wang
Main category: cs.CV
TL;DR: 本文提出UniDriveDreamer,一种单阶段统一多模态世界模型,可直接生成自动驾驶中的多模态未来观测(视频+LiDAR),无需中间表示或级联模块;通过LiDAR与视频专用VAE、统一潜在锚定(ULA)对齐模态潜空间,并结合扩散Transformer建模几何与时间动态,辅以场景布局条件引导,在视频和LiDAR生成任务上均超越SOTA。
Details
Motivation: 现有世界模型大多局限于单模态(仅视频或仅LiDAR)生成,难以充分利用多模态互补信息,且多采用级联或中间表征,导致误差累积与训练不稳定。 Method: 提出UniDriveDreamer:1)分别设计LiDAR和视频专用VAE进行编码;2)引入Unified Latent Anchoring(ULA)显式对齐两模态潜分布;3)融合对齐特征,输入扩散Transformer联合建模几何对应与时间演化;4)将结构化场景布局按模态投影为条件信号指导生成。 Result: 在视频和LiDAR生成任务上均显著优于先前SOTA方法,并在下游任务中带来可测量的性能提升。 Conclusion: UniDriveDreamer验证了单阶段端到端多模态世界建模的可行性与优越性,ULA机制和模态感知条件引导是提升跨模态一致性与生成质量的关键。 Abstract: World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single-modality generation, typically focusing on either multi-camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR-specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi-camera images. To ensure cross-modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer that jointly models their geometric correspondence and temporal evolution. Additionally, structured scene layout information is projected per modality as a conditioning signal to guide the synthesis. Extensive experiments demonstrate that UniDriveDreamer outperforms previous state-of-the-art methods in both video and LiDAR generation, while also yielding measurable improvements in downstream[413] ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning
Gongli Xi,Kun Wang,Zeming Gao,Huahui Yi,Haolang Lu,Ye Tian,Wendong Wang
Main category: cs.CV
TL;DR: 本文提出ClueTracer方法,通过追踪关键视觉线索在推理路径中的传播来抑制大模型幻觉,无需训练即可提升多种推理架构性能。
Details
Motivation: 现有大模型在多模态推理中易产生幻觉,主要源于‘推理漂移’——模型过度关注与问题无关的视觉实体,导致推理脱离图像依据;而现有干预方法难以在推理场景中准确定位真实线索。 Method: 提出ClueRecall评估指标和ClueTracer插件:ClueTracer不依赖训练、参数或特定架构,从问题出发反向追踪关键线索在‘问题→输出→视觉token’路径中的传播,定位任务相关图像块并抑制对无关区域的注意力。 Result: ClueTracer在无需任何额外训练下,使各类推理架构(如R1-OneVision、Ocean-R1等)在推理基准上提升1.21倍;迁移到非推理场景仍提升1.14倍。 Conclusion: 推理漂移是多模态大模型幻觉的关键成因;ClueTracer提供了一种通用、轻量、有效的训练-free幻觉抑制方案,显著提升模型视觉线索检索能力与推理鲁棒性。 Abstract: Large multimodal reasoning models solve challenging visual problems via explicit long-chain inference: they gather visual clues from images and decode clues into textual tokens. Yet this capability also increases hallucinations, where the model generates content that is not supported by the input image or the question. To understand this failure mode, we identify \emph{reasoning drift}: during clue gathering, the model over-focuses on question-irrelevant entities, diluting focus on task-relevant cues and gradually decoupling the reasoning trace from visual grounding. As a consequence, many inference-time localization or intervention methods developed for non-reasoning models fail to pinpoint the true clues in reasoning settings. Motivated by these insights, we introduce ClueRecall, a metric for assessing visual clue retrieval, and present ClueTracer, a training-free, parameter-free, and architecture-agnostic plugin for hallucination suppression. ClueTracer starts from the question and traces how key clues propagate along the model's reasoning pathway (question $\rightarrow$ outputs $\rightarrow$ visual tokens), thereby localizing task-relevant patches while suppressing spurious attention to irrelevant regions. Remarkably, \textbf{without any additional training}, ClueTracer improves all \textbf{reasoning} architectures (including \texttt{R1-OneVision}, \texttt{Ocean-R1}, \texttt{MM-Eureka}, \emph{etc}.) by $\mathbf{1.21\times}$ on reasoning benchmarks. When transferred to \textbf{non-reasoning} settings, it yields a $\mathbf{1.14\times}$ gain.[414] One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation
Shuo Lu,Haohan Wang,Wei Feng,Weizhen Wang,Shen Zhang,Yaoyu Li,Ao Ma,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Bing Zhan,Yuan Xu,Huizai Yao,Yongcan Yu,Chenyang Si,Jian Liang
Main category: cs.CV
TL;DR: 本文提出OSMF框架,通过产品感知自适应分组和偏好条件图像生成,结合Group-DPO微调方法,在广告图像生成中实现群体级点击率优化。
Details
Motivation: 现有广告图像生成方法采用‘一刀切’策略,忽视用户群体间的偏好差异,导致特定群体效果不佳,限制精准营销效果。 Method: 提出OSMF统一框架:1)产品感知自适应分组,动态构建用户群体并提取集体偏好特征;2)基于Group-aware Multimodal LLM(G-MLLM)进行群体偏好条件图像生成;3)使用Group-DPO进行群体级偏好对齐微调;4)构建首个大规模群体广告图像偏好数据集GAIP(含约60万群体)。 Result: 在离线与在线实验中均达到SOTA性能;GAIP数据集为首个公开的大规模群体图像偏好数据集;代码与数据集将开源。 Conclusion: OSMF有效解决了广告图像生成中群体偏好多样性建模难题,显著提升各用户群体的CTR,推动面向细分人群的生成式广告技术发展。 Abstract: Advertising image generation has increasingly focused on online metrics like Click-Through Rate (CTR), yet existing approaches adopt a ``one-size-fits-all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present \textit{One Size, Many Fits} (OSMF), a unified framework that aligns diverse group-wise click preferences in large-scale advertising image generation. OSMF begins with product-aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference-conditioned image generation employs a Group-aware Multimodal Large Language Model (G-MLLM) to generate tailored images for each group. The G-MLLM is pre-trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine-tune the G-MLLM using our proposed Group-DPO for group-wise preference alignment, which effectively enhances each group's CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large-scale public dataset of group-wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance in both offline and online settings. Our code and datasets will be released at https://github.com/JD-GenX/OSMF.[415] Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models
Cristian Sbrolli,Matteo Matteucci,Toshihiko Yamasaki
Main category: cs.CV
TL;DR: 本文提出Auto-Comp自动化合成基准生成框架,用于细粒度评估视觉语言模型(VLMs)在颜色绑定与空间关系等组合推理任务上的缺陷,发现当前主流VLM普遍存在属性混淆、易受低熵干扰物影响等问题,并揭示上下文信息对空间推理有益但损害局部属性绑定的权衡现象。
Details
Motivation: 现代视觉语言模型(VLMs)在组合推理(如颜色-物体绑定、空间关系理解)上存在严重缺陷,但其视觉与语言根源难以分离;亟需可控、可扩展、细粒度的评估基准来系统诊断问题。 Method: 提出Auto-Comp:全自动、可控制的合成基准生成流程,通过Minimal caption(精简语义)与LLM生成的Contextual caption(富含上下文)配对生成图像对,构建A/B测试以解耦核心绑定能力与多模态复杂性;设计Color Binding、Spatial Relations和Confusion Benchmark三类新基准。 Result: 在20个VLM(含CLIP/SigLIP系列)上验证:普遍存在组合推理失败;‘Confusion Benchmark’揭示模型易被重复对象/颜色等低熵干扰物误导;发现上下文增强空间推理却削弱颜色绑定的意外权衡。 Conclusion: 组合推理缺陷是VLM的系统性短板,不能仅归因于bag-of-words局限;Auto-Comp为可解释、可控的VLM评估提供了新范式,并开源全部工具与基准以推动研究。 Abstract: Modern Vision-Language Models (VLMs) exhibit a critical flaw in compositional reasoning, often confusing "a red cube and a blue sphere" with "a blue cube and a red sphere". Disentangling the visual and linguistic roots of these failures is a fundamental challenge for robust evaluation. To enable fine-grained, controllable analysis, we introduce Auto-Comp, a fully automated and synthetic pipeline for generating scalable benchmarks. Its controllable nature is key to dissecting and isolating different reasoning skills. Auto-Comp generates paired images from Minimal (e.g., "a monitor to the left of a bicycle on a white background") and LLM-generated Contextual captions (e.g., "In a brightly lit photography studio, a monitor is positioned to the left of a bicycle"), allowing a controlled A/B test to disentangle core binding ability from visio-linguistic complexity. Our evaluation of 20 VLMs on novel benchmarks for color binding and spatial relations reveals universal compositional failures in both CLIP and SigLIP model families. Crucially, our novel "Confusion Benchmark" reveals a deeper flaw beyond simple attribute swaps: models are highly susceptible to low-entropy distractors (e.g., repeated objects or colors), demonstrating their compositional failures extend beyond known bag-of-words limitations. we uncover a surprising trade-off: visio-linguistic context, which provides global scene cues, aids spatial reasoning but simultaneously hinders local attribute binding by introducing visual clutter. We release the Auto-Comp pipeline to facilitate future benchmark creation, alongside all our generated benchmarks (https://huggingface.co/AutoComp).[416] Multi-View Stenosis Classification Leveraging Transformer-Based Multiple-Instance Learning Using Real-World Clinical Data
Nikola Cenikj,Özgün Turgut,Alexander Müller,Alexander Steger,Jan Kehrer,Marcus Brugger,Daniel Rueckert,Eimo Martens,Philip Müller
Main category: cs.CV
TL;DR: 本文提出SegmentMIL,一种基于Transformer的多视角多实例学习框架,仅需患者级标签即可实现冠状动脉狭窄的分类与定位,无需昂贵的视角级标注,并在内外部评估中表现优异。
Details
Motivation: 现有单视角深度学习模型依赖难以获取的视角级标注,且无法建模多视角间的时序动态与依赖关系,而临床诊断需综合多视角信息。 Method: 提出SegmentMIL——一种基于Transformer的多视角多实例学习(MIL)框架,利用患者级监督信号进行端到端训练,联合完成患者级狭窄分类与解剖区域定位(区分左右冠状动脉及其节段)。 Result: 在真实临床数据集上验证,SegmentMIL在内部和外部测试中均取得高性能,显著优于单视角模型和经典MIL基线方法。 Conclusion: SegmentMIL是一种临床可行、可扩展的冠状动脉狭窄智能诊断方案,有效缓解对精细标注的依赖,提升多视角协同分析能力。 Abstract: Coronary artery stenosis is a leading cause of cardiovascular disease, diagnosed by analyzing the coronary arteries from multiple angiography views. Although numerous deep-learning models have been proposed for stenosis detection from a single angiography view, their performance heavily relies on expensive view-level annotations, which are often not readily available in hospital systems. Moreover, these models fail to capture the temporal dynamics and dependencies among multiple views, which are crucial for clinical diagnosis. To address this, we propose SegmentMIL, a transformer-based multi-view multiple-instance learning framework for patient-level stenosis classification. Trained on a real-world clinical dataset, using patient-level supervision and without any view-level annotations, SegmentMIL jointly predicts the presence of stenosis and localizes the affected anatomical region, distinguishing between the right and left coronary arteries and their respective segments. SegmentMIL obtains high performance on internal and external evaluations and outperforms both view-level models and classical MIL baselines, underscoring its potential as a clinically viable and scalable solution for coronary stenosis diagnosis. Our code is available at https://github.com/NikolaCenic/mil-stenosis.[417] UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction
Changbai Li,Haodong Zhu,Hanlin Chen,Xiuping Liang,Tongfei Chen,Shuwei Shao,Linlin Yang,Huobin Tan,Baochang Zhang
Main category: cs.CV
TL;DR: UrbanGS 是一种面向城市级场景的可扩展三维高斯点阵重建框架,通过深度一致的 D-Normal 正则化、空间自适应高斯剪枝和统一划分与视图分配策略,显著提升大场景下的几何一致性、内存效率与计算可扩展性。
Details
Motivation: 3D高斯点阵(3DGS)在有限场景中表现优异,但扩展至大规模城市环境时面临几何不一致、内存占用高和计算不可扩展等关键挑战。 Method: 提出三方面方法:1)Depth-Consistent D-Normal Regularization,融合外部深度监督与D-Normal约束,并引入基于梯度一致性和逆深度偏差的自适应置信加权;2)Spatially Adaptive Gaussian Pruning(SAGP),依据局部几何复杂度与可见性动态剪枝;3)统一的空间划分与视图分配机制,消除边界伪影并均衡计算负载。 Result: 在多个城市数据集上实验表明,UrbanGS在渲染质量、几何精度和内存效率方面均优于现有方法。 Conclusion: UrbanGS为高保真、大规模场景重建提供了系统性解决方案,有效克服了3DGS在城市尺度应用中的核心瓶颈。 Abstract: While 3D Gaussian Splatting (3DGS) enables high-quality, real-time rendering for bounded scenes, its extension to large-scale urban environments gives rise to critical challenges in terms of geometric consistency, memory efficiency, and computational scalability. To address these issues, we present UrbanGS, a scalable reconstruction framework that effectively tackles these challenges for city-scale applications. First, we propose a Depth-Consistent D-Normal Regularization module. Unlike existing approaches that rely solely on monocular normal estimators, which can effectively update rotation parameters yet struggle to update position parameters, our method integrates D-Normal constraints with external depth supervision. This allows for comprehensive updates of all geometric parameters. By further incorporating an adaptive confidence weighting mechanism based on gradient consistency and inverse depth deviation, our approach significantly enhances multi-view depth alignment and geometric coherence, which effectively resolves the issue of geometric accuracy in complex large-scale scenes. To improve scalability, we introduce a Spatially Adaptive Gaussian Pruning (SAGP) strategy, which dynamically adjusts Gaussian density based on local geometric complexity and visibility to reduce redundancy. Additionally, a unified partitioning and view assignment scheme is designed to eliminate boundary artifacts and optimize computational load. Extensive experiments on multiple urban datasets demonstrate that UrbanGS achieves superior performance in rendering quality, geometric accuracy, and memory efficiency, providing a systematic solution for high-fidelity large-scale scene reconstruction.[418] FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space
FSVideo Team,Qingyu Chen,Zhiyuan Fang,Haibin Huang,Xinwei Huang,Tong Jin,Minxuan Lin,Bo Liu,Celong Liu,Chongyang Ma,Xing Mei,Xiaohui Shen,Yaojie Shen,Fuwen Tan,Angtian Wang,Xiao Yang,Yiding Yang,Jiamin Yuan,Lingxi Zhang,Yuxin Zhang
Main category: cs.CV
TL;DR: FSVideo is a fast speed transformer-based image-to-video diffusion framework featuring a highly-compressed video autoencoder, a diffusion transformer with enhanced layer memory, and a multi-resolution upsampling strategy, achieving competitive performance at significantly higher speed.
Details
Motivation: To address the computational inefficiency and slow generation speed of existing image-to-video diffusion models while maintaining competitive reconstruction quality and video fidelity. Method: Proposes FSVideo with three key components: 1) a video autoencoder with 64×64×4 spatial-temporal compression; 2) a diffusion transformer (DIT) with novel layer memory design for improved inter-layer information flow; 3) a few-step DIT upsampler for multi-resolution generation. Uses a 14B DIT base model and a 14B DIT upsampler. Result: Achieves competitive performance against popular open-source I2V models while being an order of magnitude faster. Conclusion: The FSVideo framework demonstrates that efficient, high-fidelity image-to-video generation is feasible through architectural innovations in autoencoding, transformer design, and multi-resolution sampling. Abstract: We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ($64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.[419] Teacher-Guided Student Self-Knowledge Distillation Using Diffusion Model
Yu Wang,Chuanguang Yang,Zhulin An,Weilun Feng,Jiarui Zhao,Chengqing Yu,Libo Huang,Boyu Diao,Yongjun Xu
Main category: cs.CV
TL;DR: 本文提出了一种名为DSKD的新型知识蒸馏方法,利用教师分类器引导学生特征的扩散去噪过程,并结合LSH指导的学生自蒸馏,以缓解师生特征分布不一致问题,显著提升多种模型和数据集上的性能。
Details
Motivation: 现有知识蒸馏方法因师生特征分布差异,可能导致学生学习到与自身不兼容的知识。 Method: 提出教师引导的学生扩散自蒸馏(DSKD):用教师分类器指导轻量级扩散模型对学生特征进行去噪;再通过局部敏感哈希(LSH)引导原始与去噪学生特征间的知识蒸馏。 Result: 在多个视觉识别任务、模型和数据集上,DSKD显著优于现有知识蒸馏方法。 Conclusion: DSKD通过将去噪后学生特征视为‘代理教师’,有效消除了师生间映射方式与特征分布的差异,在不依赖直接师生对齐的前提下实现了更有效的知识迁移。 Abstract: Existing Knowledge Distillation (KD) methods often align feature information between teacher and student by exploring meaningful feature processing and loss functions. However, due to the difference in feature distributions between the teacher and student, the student model may learn incompatible information from the teacher. To address this problem, we propose teacher-guided student Diffusion Self-KD, dubbed as DSKD. Instead of the direct teacher-student alignment, we leverage the teacher classifier to guide the sampling process of denoising student features through a light-weight diffusion model. We then propose a novel locality-sensitive hashing (LSH)-guided feature distillation method between the original and denoised student features. The denoised student features encapsulate teacher knowledge and could be regarded as a teacher role. In this way, our DSKD method could eliminate discrepancies in mapping manners and feature distributions between the teacher and student, while learning meaningful knowledge from the teacher. Experiments on visual recognition tasks demonstrate that DSKD significantly outperforms existing KD methods across various models and datasets. Our code is attached in supplementary material.[420] Enhancing Diffusion-Based Quantitatively Controllable Image Generation via Matrix-Form EDM and Adaptive Vicinal Training
Xin Ding,Yun Chen,Sen Zhang,Kao Zhang,Nenglun Chen,Peibei Cao,Yongwei Wang,Fei Wu
Main category: cs.CV
TL;DR: 本文提出iCCDM,改进了连续条件扩散模型(CCDM),采用Elucidated Diffusion Model(EDM)框架并引入矩阵形式EDM公式与自适应邻域训练策略,在多个数据集上显著提升生成质量与采样效率,超越现有扩散及GAN方法。
Details
Motivation: CCDM虽在连续标签图像生成中表现优异,但受限于陈旧扩散框架和低采样效率,已被GAN方法CcGAN-AVAR超越,亟需改进。 Method: 提出iCCDM:基于Elucidated Diffusion Model(EDM)框架,设计矩阵形式EDM公式,并引入自适应vicinal训练策略,以兼顾生成质量与采样效率。 Result: 在4个基准数据集(64×64至256×256分辨率)上,iCCDM全面超越现有方法,包括Stable Diffusion 3、FLUX.1和Qwen-Image等先进文本到图像模型,生成质量更高、采样成本显著降低。 Conclusion: iCCDM通过融合先进扩散框架与创新训练策略,有效克服了CCDM的固有缺陷,成为连续条件图像生成的新SOTA方法。 Abstract: Continuous Conditional Diffusion Model (CCDM) is a diffusion-based framework designed to generate high-quality images conditioned on continuous regression labels. Although CCDM has demonstrated clear advantages over prior approaches across a range of datasets, it still exhibits notable limitations and has recently been surpassed by a GAN-based method, namely CcGAN-AVAR. These limitations mainly arise from its reliance on an outdated diffusion framework and its low sampling efficiency due to long sampling trajectories. To address these issues, we propose an improved CCDM framework, termed iCCDM, which incorporates the more advanced \textit{Elucidated Diffusion Model} (EDM) framework with substantial modifications to improve both generation quality and sampling efficiency. Specifically, iCCDM introduces a novel matrix-form EDM formulation together with an adaptive vicinal training strategy. Extensive experiments on four benchmark datasets, spanning image resolutions from $64\times64$ to $256\times256$, demonstrate that iCCDM consistently outperforms existing methods, including state-of-the-art large-scale text-to-image diffusion models (e.g., Stable Diffusion 3, FLUX.1, and Qwen-Image), achieving higher generation quality while significantly reducing sampling cost.[421] MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos
Yangyi Cao,Yuanhang Li,Lan Chen,Qi Mao
Main category: cs.CV
TL;DR: MLV-Edit is a training-free, flow-based framework for minute-level video editing that uses segment-wise processing with Velocity Blend and Attention Sink modules to ensure temporal consistency and semantic fidelity.
Details
Motivation: Existing video editing methods struggle with long-duration videos due to high computational cost and difficulty maintaining global temporal consistency across thousands of frames. Method: MLV-Edit adopts a divide-and-conquer strategy with two key modules: Velocity Blend aligns optical flow fields at segment boundaries to fix motion inconsistencies, and Attention Sink anchors local segment features to global reference frames to prevent structural drift. Result: Extensive experiments show MLV-Edit outperforms state-of-the-art methods in temporal stability and semantic fidelity. Conclusion: MLV-Edit effectively enables high-quality, training-free editing of minute-long videos while preserving global coherence and local detail. Abstract: We propose MLV-Edit, a training-free, flow-based framework that address the unique challenges of minute-level video editing. While existing techniques excel in short-form video manipulation, scaling them to long-duration videos remains challenging due to prohibitive computational overhead and the difficulty of maintaining global temporal consistency across thousands of frames. To address this, MLV-Edit employs a divide-and-conquer strategy for segment-wise editing, facilitated by two core modules: Velocity Blend rectifies motion inconsistencies at segment boundaries by aligning the flow fields of adjacent chunks, eliminating flickering and boundary artifacts commonly observed in fragmented video processing; and Attention Sink anchors local segment features to global reference frames, effectively suppressing cumulative structural drift. Extensive quantitative and qualitative experiments demonstrate that MLV-Edit consistently outperforms state-of-the-art methods in terms of temporal stability and semantic fidelity.[422] Toxicity Assessment in Preclinical Histopathology via Class-Aware Mahalanobis Distance for Known and Novel Anomalies
Olga Graf,Dhrupal Patel,Peter Groß,Charlotte Lempp,Matthias Hein,Fabian Heinemann
Main category: cs.CV
TL;DR: 本文提出了一种基于AI的组织病理学全切片图像(WSI)异常检测框架,用于啮齿类动物肝脏毒理学研究,能同时识别已知病理(有监督)和罕见未知病理(无监督OOD检测),结合LoRA微调DINOv2模型、像素级标注数据与类特异性马氏距离阈值,实现了高精度健康/病变组织分割与异常发现。
Details
Motivation: 药物诱导毒性是临床前及早期临床试验失败的主因之一;传统组织病理学评估依赖专家,难以满足大规模筛选需求,亟需自动化、可扩展的AI辅助方法。 Method: 构建啮齿类肝脏WSI像素级健康/已知病理标注数据集;采用LoRA微调预训练Vision Transformer(DINOv2)实现组织分割;利用提取特征计算类特异性马氏距离,并优化类依赖阈值以提升OOD检测鲁棒性。 Result: 在小鼠肝脏WSI上验证,健康组织误判为病变率仅0.35%,病变组织漏检率仅0.16%;成功检出包括罕见OOD形态在内的多种毒性相关异常。 Conclusion: 该AI框架可有效支持临床前毒理评估,提升早期毒性识别能力,有望降低药物研发后期失败率并提高整体效率。 Abstract: Drug-induced toxicity remains a leading cause of failure in preclinical development and early clinical trials. Detecting adverse effects at an early stage is critical to reduce attrition and accelerate the development of safe medicines. Histopathological evaluation remains the gold standard for toxicity assessment, but it relies heavily on expert pathologists, creating a bottleneck for large-scale screening. To address this challenge, we introduce an AI-based anomaly detection framework for histopathological whole-slide images (WSIs) in rodent livers from toxicology studies. The system identifies healthy tissue and known pathologies (anomalies) for which training data is available. In addition, it can detect rare pathologies without training data as out-of-distribution (OOD) findings. We generate a novel dataset of pixelwise annotations of healthy tissue and known pathologies and use this data to fine-tune a pre-trained Vision Transformer (DINOv2) via Low-Rank Adaptation (LoRA) in order to do tissue segmentation. Finally, we extract features for OOD detection using the Mahalanobis distance. To better account for class-dependent variability in histological data, we propose the use of class-specific thresholds. We optimize the thresholds using the mean of the false negative and false positive rates, resulting in only 0.16\% of pathological tissue classified as healthy and 0.35\% of healthy tissue classified as pathological. Applied to mouse liver WSIs with known toxicological findings, the framework accurately detects anomalies, including rare OOD morphologies. This work demonstrates the potential of AI-driven histopathology to support preclinical workflows, reduce late-stage failures, and improve efficiency in drug development.[423] Eliminating Registration Bias in Synthetic CT Generation: A Physics-Based Simulation Framework
Lukas Zimmermann,Michael Rauter,Maximilian Schmid,Dietmar Georg,Barbara Knäusl
Main category: cs.CV
TL;DR: 本文提出了一种基于物理的CBCT仿真方法,以生成几何对齐的训练数据对,避免传统配准误差带来的偏差,并通过几何对齐指标(如归一化互信息)而非强度指标评估模型性能,结果表明几何保真度更符合临床需求。
Details
Motivation: 传统监督式合成CT生成依赖于配准的CBCT-CT配对训练数据,但实际中难以实现完美配准,导致配准偏差污染模型和评估指标,使高分可能反映的是对配准伪影的复现而非真实解剖保真度。 Method: 采用基于物理的CBCT仿真方法生成天然几何对齐的训练数据对,并在评估中使用与输入CBCT的几何对齐指标(如归一化互信息),而非与存在配准偏差的真实CT对比的强度指标。 Result: 在两个独立盆腔数据集上,合成数据训练的模型几何对齐性显著更优(NMI:0.31 vs 0.22);强度指标与临床评估呈负相关,而NMI与观察者偏好显著正相关(rho = 0.31, p < 0.001);临床观察者在87%案例中更倾向合成训练模型输出。 Conclusion: 几何保真度比强度一致性更能反映临床真实需求,应摒弃依赖有偏真实CT的强度评估范式,转向基于物理仿真与几何对齐的评估新标准。 Abstract: Supervised synthetic CT generation from CBCT requires registered training pairs, yet perfect registration between separately acquired scans remains unattainable. This registration bias propagates into trained models and corrupts standard evaluation metrics. This may suggest that superior benchmark performance indicates better reproduction of registration artifacts rather than anatomical fidelity. We propose physics-based CBCT simulation to provide geometrically aligned training pairs by construction, combined with evaluation using geometric alignment metrics against input CBCT rather than biased ground truth. On two independent pelvic datasets, models trained on synthetic data achieved superior geometric alignment (Normalized Mutual Information: 0.31 vs 0.22) despite lower conventional intensity scores. Intensity metrics showed inverted correlations with clinical assessment for deformably registered data, while Normalized Mutual Information consistently predicted observer preference across registration methodologies (rho = 0.31, p < 0.001). Clinical observers preferred synthetic-trained outputs in 87% of cases, demonstrating that geometric fidelity, not intensity agreement with biased ground truth, aligns with clinical requirements.[424] Deep learning enables urban change profiling through alignment of historical maps
Sidi Wu,Yizi Chen,Maurizio Gribaudi,Konrad Schindler,Clément Mallet,Julien Perret,Lorenz Hurni
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的全自动框架,用于从大量历史地图中进行细粒度城市变化分析,涵盖密集地图配准、多时相目标检测和变化剖面分析,显著提升了历史地图定量分析的能力。
Details
Motivation: 历史地图是研究长期城市演变的重要资料,但因空间错位、制图差异和文档质量退化等问题,难以从中提取一致且精细的变化信息,导致现有分析多局限于小规模或定性层面。 Method: 提出一种模块化深度学习框架,整合密集地图对齐、多时相目标检测和变化特征分析,实现对历史地图的自动化、细粒度城市变化分析。 Result: 实验验证了所提配准与目标检测方法的鲁棒性;应用于1868–1937年巴黎历史地图,揭示了城市演变在空间与时间上的异质性。 Conclusion: 该框架推动历史地图分析从经验式视觉比对转向系统化、定量化的城市变化刻画,具备跨制图语境适应性与人文社科应用潜力。 Abstract: Prior to modern Earth observation technologies, historical maps provide a unique record of long-term urban transformation and offer a lens on the evolving identity of cities. However, extracting consistent and fine-grained change information from historical map series remains challenging due to spatial misalignment, cartographic variation, and degrading document quality, limiting most analyses to small-scale or qualitative approaches. We propose a fully automated, deep learning-based framework for fine-grained urban change analysis from large collections of historical maps, built on a modular design that integrates dense map alignment, multi-temporal object detection, and change profiling. This framework shifts the analysis of historical maps from ad hoc visual comparison toward systematic, quantitative characterization of urban change. Experiments demonstrate the robust performance of the proposed alignment and object detection methods. Applied to Paris between 1868 and 1937, the framework reveals the spatial and temporal heterogeneity in urban transformation, highlighting its relevance for research in the social sciences and humanities. The modular design of our framework further supports adaptation to diverse cartographic contexts and downstream applications.[425] LoopViT: Scaling Visual ARC with Looped Transformers
Wen-Jie Shu,Xuerui Qiu,Rui-Jie Zhu,Harold Haodong Chen,Yexin Liu,Harry Yang
Main category: cs.CV
TL;DR: 本文提出Loop-ViT,一种基于权重共享递归结构的视觉Transformer,通过动态退出机制实现自适应迭代推理,在ARC-AGI-1上以更小参数量超越大模型。
Details
Motivation: 现有视觉Transformer的前馈结构难以模拟人类归纳所需的迭代、算法式推理过程。 Method: 提出Loop-ViT:采用权重共享的Hybrid Block(结合局部卷积与全局注意力)进行递归迭代,并引入基于预测熵的无参动态退出机制,使模型在内部状态不确定性降低时自动终止推理。 Result: 在ARC-AGI-1基准上,18M参数的Loop-ViT达到65.8%准确率,优于73M参数的集成模型。 Conclusion: 自适应迭代计算比单纯扩大网络宽度更高效,为视觉推理提供了新的可扩展路径。 Abstract: Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes" into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy, outperforming massive 73M-parameter ensembles. These findings demonstrate that adaptive iterative computation offers a far more efficient scaling axis for visual reasoning than simply increasing network width. The code is available at https://github.com/WenjieShu/LoopViT.[426] Reg4Pru: Regularisation Through Random Token Routing for Token Pruning
Julian Wyatt,Ronald Clark,Irina Voiculescu
Main category: cs.CV
TL;DR: 本文提出Reg4Pru,一种用于分割任务的训练正则化技术,以缓解token剪枝导致的性能下降;在FIVES血管分割数据集上,相比未使用路由的基线模型,平均精度绝对提升46%,同时推理速度相对提升29%。
Details
Motivation: token剪枝虽提升计算效率,但因保留表征稳定性下降,导致深层密集预测性能变差。 Method: 提出Reg4Pru训练正则化方法,用于缓解token剪枝在分割任务中的性能损失。 Result: 在FIVES数据集上,Reg4Pru使平均精度绝对提升46%,且实现29%的相对推理加速。 Conclusion: Reg4Pru是一种对token压缩策略有效的正则化方法,能显著提升剪枝模型在分割任务中的性能与效率平衡。 Abstract: Transformers are widely adopted in modern vision models due to their strong ability to scale with dataset size and generalisability. However, this comes with a major drawback: computation scales quadratically to the total number of tokens. Numerous methods have been proposed to mitigate this. For example, we consider token pruning with reactivating tokens from preserved representations, but the increased computational efficiency of this method results in decreased stability from the preserved representations, leading to poorer dense prediction performance at deeper layers. In this work, we introduce Reg4Pru, a training regularisation technique that mitigates token-pruning performance loss for segmentation. We compare our models on the FIVES blood vessel segmentation dataset and find that Reg4Pru improves average precision by an absolute 46% compared to the same model trained without routing. This increase is observed using a configuration that achieves a 29% relative speedup in wall-clock time compared to the non-pruned baseline. These findings indicate that Reg4Pru is a valuable regulariser for token reduction strategies.[427] Lung Nodule Image Synthesis Driven by Two-Stage Generative Adversarial Networks
Lu Cao,Xiquan He,Junying Zeng,Chaoyun Mai,Min Luo
Main category: cs.CV
TL;DR: 本文提出了一种两阶段生成对抗网络(TSGAN),通过解耦肺结节的形态结构与纹理特征,提升合成CT数据的多样性与空间可控性,从而改善检测模型性能。
Details
Motivation: 肺结节CT数据集样本量小、多样性不足,导致检测模型性能和泛化能力受限;现有生成方法缺乏多样性与可控性,易出现纹理单调、解剖结构失真等问题。 Method: 提出两阶段GAN:第一阶段用StyleGAN生成语义分割掩码图以控制解剖结构;第二阶段用DL-Pix2Pix结合局部重要性注意力和动态权重多头窗口注意力,将掩码图翻译为高质量CT图像。 Result: 在LUNA16数据集上,相比原始数据训练结果,检测准确率提升4.6%,mAP提升4%;合成图像质量与检测模型性能均得到验证提升。 Conclusion: TSGAN有效提升了合成肺结节CT图像的多样性、解剖合理性和纹理真实性,进而增强了下游检测模型的性能与泛化能力。 Abstract: The limited sample size and insufficient diversity of lung nodule CT datasets severely restrict the performance and generalization ability of detection models. Existing methods generate images with insufficient diversity and controllability, suffering from issues such as monotonous texture features and distorted anatomical structures. Therefore, we propose a two-stage generative adversarial network (TSGAN) to enhance the diversity and spatial controllability of synthetic data by decoupling the morphological structure and texture features of lung nodules. In the first stage, StyleGAN is used to generate semantic segmentation mask images, encoding lung nodules and tissue backgrounds to control the anatomical structure of lung nodule images; The second stage uses the DL-Pix2Pix model to translate the mask map into CT images, employing local importance attention to capture local features, while utilizing dynamic weight multi-head window attention to enhance the modeling capability of lung nodule texture and background. Compared to the original dataset, the accuracy improved by 4.6% and mAP by 4% on the LUNA16 dataset. Experimental results demonstrate that TSGAN can enhance the quality of synthetic images and the performance of detection models.[428] CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization
Xinquan Yu,Wei Lu,Xiangyang Luo
Main category: cs.CV
TL;DR: 本文提出CIEC框架,利用粗粒度图像/句子级标注实现图文对的弱监督多模态篡改定位,通过TRPS和VCTG两个模块分别实现图像和文本层面的定位,并引入多种约束提升鲁棒性与精度。
Details
Motivation: 现有方法依赖昂贵且耗时的细粒度标注(如patch/token级),限制了实际应用;亟需基于粗粒度标注的弱监督方案。 Method: 提出耦合隐式与显式线索(CIEC)框架,包含图像分支(TRPS模块:文本引导的细化块选择 + 背景静音与空间对比约束)和文本分支(VCTG模块:视觉偏差校准的词元定位 + 非对称稀疏与语义一致性约束)。 Result: 在多个评估指标上性能媲美全监督方法,验证了弱监督设定下有效性。 Conclusion: CIEC成功实现了仅用图像/句子级粗标注的高效、鲁棒多模态篡改定位,为 misinformation 检测提供了实用新范式。 Abstract: To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.[429] Learning Topology-Aware Implicit Field for Unified Pulmonary Tree Modeling with Incomplete Topological Supervision
Ziqiao Weng,Jiancheng Yang,Kangxian Xie,Bo Zhou,Weidong Cai
Main category: cs.CV
TL;DR: 本文提出TopoField,一种拓扑感知的隐式建模框架,用于修复CT图像中肺树的拓扑不完整性,并支持解剖标注与肺段重建等多任务联合推理,兼具高精度与高效性。
Details
Motivation: 肺树CT提取常存在拓扑不完整(如分支缺失或断连),严重影响下游分析;现有方法依赖密集体素处理或显式图推理,效率低、鲁棒性差。 Method: TopoField利用稀疏表面与骨架点云表示肺解剖结构,学习一个连续隐式场,在已有不完整肺树上通过合成结构扰动进行训练,实现无需显式断连标注的拓扑修复;并在修复后的隐式表示基础上,通过任务特定隐式函数单次前向传播完成解剖标注与肺段重建。 Result: 在Lung3D+数据集上,TopoField显著提升拓扑完整性,并在严重不完整场景下实现准确解剖标注与肺段重建;单例全流程耗时仅约1秒,具备临床实用价值。 Conclusion: TopoField将拓扑修复作为核心建模问题,通过隐式表示实现高效、鲁棒、多任务统一的肺树分析,为临床大规模、时效性要求高的应用提供了新范式。 Abstract: Pulmonary trees extracted from CT images frequently exhibit topological incompleteness, such as missing or disconnected branches, which substantially degrades downstream anatomical analysis and limits the applicability of existing pulmonary tree modeling pipelines. Current approaches typically rely on dense volumetric processing or explicit graph reasoning, leading to limited efficiency and reduced robustness under realistic structural corruption. We propose TopoField, a topology-aware implicit modeling framework that treats topology repair as a first-class modeling problem and enables unified multi-task inference for pulmonary tree analysis. TopoField represents pulmonary anatomy using sparse surface and skeleton point clouds and learns a continuous implicit field that supports topology repair without relying on complete or explicit disconnection annotations, by training on synthetically introduced structural disruptions over \textit{already} incomplete trees. Building upon the repaired implicit representation, anatomical labeling and lung segment reconstruction are jointly inferred through task-specific implicit functions within a single forward pass.Extensive experiments on the Lung3D+ dataset demonstrate that TopoField consistently improves topological completeness and achieves accurate anatomical labeling and lung segment reconstruction under challenging incomplete scenarios. Owing to its implicit formulation, TopoField attains high computational efficiency, completing all tasks in just over one second per case, highlighting its practicality for large-scale and time-sensitive clinical applications. Code and data will be available at https://github.com/HINTLab/TopoField.[430] SSI-DM: Singularity Skipping Inversion of Diffusion Models
Chen Min,Enze Jiang,Jishen Peng,Zheng Ma
Main category: cs.CV
TL;DR: 本文提出SSI-DM方法,通过在标准反演前添加微小噪声来规避扩散模型反演中因数学奇点导致的病态问题,从而生成更符合高斯分布、可编辑性更强的噪声表示。
Details
Motivation: 现有扩散模型图像反演方法在早期加噪步骤中存在不准确性,导致反演得到的噪声非高斯、编辑性能差;根本原因在于反演过程存在数学奇点,使问题本质病态。 Method: 提出Singularity Skipping Inversion of Diffusion Models(SSI-DM),在标准反演流程前引入少量初始噪声,主动绕过奇点区域。 Result: 反演所得噪声更接近高斯分布,重建保真度高,在公开图像数据集上的重建与插值任务中性能优于现有方法。 Conclusion: SSI-DM是一种即插即用、普适于各类扩散模型的原理性且高效反演方案,有效解决了扩散模型图像反演的病态性与噪声非高斯问题。 Abstract: Inverting real images into the noise space is essential for editing tasks using diffusion models, yet existing methods produce non-Gaussian noise with poor editability due to the inaccuracy in early noising steps. We identify the root cause: a mathematical singularity that renders inversion fundamentally ill-posed. We propose Singularity Skipping Inversion of Diffusion Models (SSI-DM), which bypasses this singular region by adding small noise before standard inversion. This simple approach produces inverted noise with natural Gaussian properties while maintaining reconstruction fidelity. As a plug-and-play technique compatible with general diffusion models, our method achieves superior performance on public image datasets for reconstruction and interpolation tasks, providing a principled and efficient solution to diffusion model inversion.[431] MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models
Zheyuan Zhou,Liang Du,Zixun Sun,Xiaoyu Zhou,Ruimin Ye,Qihao Chen,Yinda Chen,Lemiao Qiu
Main category: cs.CV
TL;DR: 本文提出MAIN-VLA框架,通过显式建模意图与环境的抽象表征,实现语义对齐驱动的决策,显著提升复杂动态环境(如Minecraft、和平精英、Valorant)中的动作决策质量、泛化性与推理效率。
Details
Motivation: 现有视觉-语言-动作(VLA)方法在3D开放世界和大规模PvP等高度复杂动态环境中,难以从冗余传感器流中高效提取动作关键信号。 Method: MAIN-VLA框架包含两个核心抽象模块:意图抽象(IA)将语言指令及其推理压缩为显式语义原语;环境语义抽象(ESA)将视觉流映射为结构化拓扑可供性表示;二者对齐后产生注意力聚焦效应,支持无参token剪枝以消除感知冗余。 Result: 在Minecraft、Game for Peace和Valorant等开放世界与PvP环境中,MAIN-VLA在决策质量、泛化能力和推理效率上均达到新SOTA。 Conclusion: 语义层面的深度对齐比表层模式匹配更适配复杂动态环境下的VLA任务,MAIN-VLA为高实时性、强不确定性场景提供了可扩展的架构范式。 Abstract: Despite significant progress in Visual-Language-Action (VLA), in highly complex and dynamic environments that involve real-time unpredictable interactions (such as 3D open worlds and large-scale PvP games), existing approaches remain inefficient at extracting action-critical signals from redundant sensor streams. To tackle this, we introduce MAIN-VLA, a framework that explicitly Models the Abstraction of Intention and eNvironment to ground decision-making in deep semantic alignment rather than superficial pattern matching. Specifically, our Intention Abstraction (IA) extracts verbose linguistic instructions and their associated reasoning into compact, explicit semantic primitives, while the Environment Semantics Abstraction (ESA) projects overwhelming visual streams into a structured, topological affordance representation. Furthermore, aligning these two abstract modalities induces an emergent attention-concentration effect, enabling a parameter-free token-pruning strategy that filters out perceptual redundancy without degrading performance. Extensive experiments in open-world Minecraft and large-scale PvP environments (Game for Peace and Valorant) demonstrate that MAIN-VLA sets a new state-of-the-art, which achieves superior decision quality, stronger generalization, and cutting-edge inference efficiency.[432] Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
Hongzhou Zhu,Min Zhao,Guande He,Hang Su,Chongxuan Li,Jun Zhu
Main category: cs.CV
TL;DR: 本文提出Causal Forcing方法,通过使用自回归(AR)教师模型进行ODE初始化,解决从双向视频扩散模型蒸馏到AR学生模型时因因果注意力替代全注意力引发的架构鸿沟问题,显著提升实时交互式视频生成性能。
Details
Motivation: 现有方法在将预训练双向视频扩散模型蒸馏为少步自回归模型时,因全注意力被替换为因果注意力而存在理论上的架构鸿沟;且基于ODE蒸馏的初始化要求帧级单射性,该条件在双向→自回归蒸馏中无法满足,导致性能下降。 Method: 提出Causal Forcing方法,采用自回归教师模型进行ODE初始化,从而保证帧级单射性成立,弥合架构差异,并实现更准确的流映射恢复。 Result: 在Dynamic Degree、VisionReward和Instruction Following三项指标上分别超越SOTA方法Self Forcing达19.3%、8.7%和16.7%。 Conclusion: 使用AR教师进行ODE初始化是桥接双向与AR模型间架构鸿沟的关键,Causal Forcing为高效高质量视频生成提供了更坚实的理论与实践基础。 Abstract: To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page and the code: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}[433] LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation
Bo Miao,Weijia Liu,Jun Luo,Lachlan Shinnick,Jian Liu,Thomas Hamilton-Smith,Yuhe Yang,Zijie Wu,Vanja Videnovic,Feras Dayoub,Anton van den Hengel
Main category: cs.CV
TL;DR: 本文提出了HieraNav多粒度、开放词汇目标导航任务和LangMap大规模基准,用于评估AI代理在真实3D室内环境中根据自然语言指令进行分层(场景/房间/区域/实例)导航的能力。
Details
Motivation: 解决人与AI之间基于对象与语言关系的有意义交互及具身智能实用性问题,现有基准在标注质量、语义粒度和指令多样性方面存在不足。 Method: 构建多粒度开放词汇导航任务HieraNav;提出大规模真实3D室内扫描基准LangMap,包含四层语义目标标注(场景、房间、区域、实例)、18K+导航任务、人类验证的简洁与详细描述,并对比评估零样本与监督模型性能。 Result: LangMap标注质量显著优于GOAT-Bench(判别准确率高23.8%,用词少4倍);实验表明上下文与记忆增强有助于提升成功率,但长尾、小尺寸、依赖上下文及远距离目标以及多目标完成仍具挑战性。 Conclusion: HieraNav与LangMap为语言驱动的具身导航提供了严格、全面且高质量的评测基准,推动该领域向更鲁棒、通用和实用的方向发展。 Abstract: The relationships between objects and language are fundamental to meaningful communication between humans and AI, and to practically useful embodied intelligence. We introduce HieraNav, a multi-granularity, open-vocabulary goal navigation task where agents interpret natural language instructions to reach targets at four semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), a large-scale benchmark built on real-world 3D indoor scans with comprehensive human-verified annotations and tasks spanning these levels. LangMap provides region labels, discriminative region descriptions, discriminative instance descriptions covering 414 object categories, and over 18K navigation tasks. Each target features both concise and detailed descriptions, enabling evaluation across different instruction styles. LangMap achieves superior annotation quality, outperforming GOAT-Bench by 23.8% in discriminative accuracy using four times fewer words. Comprehensive evaluations of zero-shot and supervised models on LangMap reveal that richer context and memory improve success, while long-tailed, small, context-dependent, and distant goals, as well as multi-goal completion, remain challenging. HieraNav and LangMap establish a rigorous testbed for advancing language-driven embodied navigation. Project: https://bo-miao.github.io/LangMap[434] MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection
Ruiqi Liu,Manni Cui,Ziheng Qin,Zhiyuan Yan,Ruoxin Chen,Yi Han,Zhiheng Li,Junkai Chen,ZhiJin Chen,Kaiqing Lin,Jialiang Shen,Lubin Weng,Jing Dong,Yan Wang,Shu Wu
Main category: cs.CV
TL;DR: 本文提出MIRROR框架,将AI生成图像检测重构为参考比较问题,利用可学习的离散记忆库编码现实先验,通过稀疏线性组合生成理想参考图像,并以残差作为检测信号,在多个基准上显著超越现有方法,甚至超越人类专家。
Details
Motivation: 现有AI生成图像检测器依赖人工痕迹分类,泛化能力差;而人类判断基于稳定的真实世界规律,偏离人类认知流形是更通用的伪造信号。 Method: 提出MIRROR框架,使用可学习的离散记忆库显式编码现实先验,将输入投影到流形一致的理想参考,并用残差作为鲁棒检测信号;引入Human-AIGI人类感知极限基准进行评估。 Result: 在14个基准上持续领先,标准基准提升2.1%,真实场景基准提升8.1%;在Human-AIGI基准上对27种生成器达89.6%准确率,超越普通用户和视觉专家,并随预训练主干网络增大逼近人类感知极限。 Conclusion: 基于现实流形一致性而非特定伪造线索的检测范式更具泛化性与鲁棒性,MIRROR实现了向‘超人交叉点’的跨越,为AIGI检测提供了新思路。 Abstract: High-fidelity generative models have narrowed the perceptual gap between synthetic and real images, posing serious threats to media security. Most existing AI-generated image (AIGI) detectors rely on artifact-based classification and struggle to generalize to evolving generative traces. In contrast, human judgment relies on stable real-world regularities, with deviations from the human cognitive manifold serving as a more generalizable signal of forgery. Motivated by this insight, we reformulate AIGI detection as a Reference-Comparison problem that verifies consistency with the real-image manifold rather than fitting specific forgery cues. We propose MIRROR (Manifold Ideal Reference ReconstructOR), a framework that explicitly encodes reality priors using a learnable discrete memory bank. MIRROR projects an input into a manifold-consistent ideal reference via sparse linear combination, and uses the resulting residuals as robust detection signals. To evaluate whether detectors reach the "superhuman crossover" required to replace human experts, we introduce the Human-AIGI benchmark, featuring a psychophysically curated human-imperceptible subset. Across 14 benchmarks, MIRROR consistently outperforms prior methods, achieving gains of 2.1% on six standard benchmarks and 8.1% on seven in-the-wild benchmarks. On Human-AIGI, MIRROR reaches 89.6% accuracy across 27 generators, surpassing both lay users and visual experts, and further approaching the human perceptual limit as pretrained backbones scale. The code is publicly available at: https://github.com/349793927/MIRROR[435] Evaluating OCR Performance for Assistive Technology: Effects of Walking Speed, Camera Placement, and Camera Type
Junchi Feng,Nikhil Ballem,Mahya Beheshti,Giles Hamilton-Fletcher,Todd Hudson,Maurizio Porfiri,William H. Seiple,John-Ross Rizzo
Main category: cs.CV
TL;DR: 本研究系统评估了OCR在静态和动态条件下的性能,发现识别准确率随行走速度增加和视角增大而下降;Google Vision表现最佳,PaddleOCR 3.0是最强的开源替代方案;手机主摄像头和肩部佩戴位置整体表现最优。
Details
Motivation: 现有OCR评估多依赖静态数据集,无法反映移动端真实使用场景(如运动、不同视角、设备佩戴位置等)带来的挑战,亟需更贴近实际应用的评估方法。 Method: 在静态条件下测试1–7米距离和0–75度水平视角下的OCR检测范围;在动态条件下改变行走速度(0.8–1.8 m/s)并比较头戴、肩戴、手持三种摄像头位置;使用智能手机(主摄与超广角)和智能眼镜,评测Google Vision、PaddleOCR 3.0、EasyOCR和Tesseract四个OCR引擎;字符级准确率采用Levenshtein比率对比人工标注真值计算。 Result: 识别准确率随行走速度加快和水平视角增大而显著下降;Google Vision总体准确率最高,PaddleOCR 3.0为最强开源引擎;手机主摄像头准确率高于超广角及智能眼镜;肩部佩戴平均准确率最高,但头戴、肩戴、手持三者差异不具统计显著性。 Conclusion: 移动端OCR性能受运动状态和成像几何因素显著影响,评估需纳入动态真实场景;推荐优先选用Google Vision或PaddleOCR 3.0,并采用手机主摄与肩部佩戴组合以获得较优鲁棒性。 Abstract: Optical character recognition (OCR), which converts printed or handwritten text into machine-readable form, is widely used in assistive technology for people with blindness and low vision. Yet, most evaluations rely on static datasets that do not reflect the challenges of mobile use. In this study, we systematically evaluated OCR performance under both static and dynamic conditions. Static tests measured detection range across distances of 1-7 meters and viewing angles of 0-75 degrees horizontally. Dynamic tests examined the impact of motion by varying walking speed from slow (0.8 m/s) to very fast (1.8 m/s) and comparing three camera mounting positions: head-mounted, shoulder-mounted, and hand-held. We evaluated both a smartphone and smart glasses, using the phone's main and ultra-wide cameras. Four OCR engines were benchmarked to assess accuracy at different distances and viewing angles: Google Vision, PaddleOCR 3.0, EasyOCR, and Tesseract. PaddleOCR 3.0 was then used to evaluate accuracy at different walking speeds. Accuracy was computed at the character level using the Levenshtein ratio against manually defined ground truth. Results showed that recognition accuracy declined with increased walking speed and wider viewing angles. Google Vision achieved the highest overall accuracy, with PaddleOCR close behind as the strongest open-source alternative. Across devices, the phone's main camera achieved the highest accuracy, and a shoulder-mounted placement yielded the highest average among body positions; however, differences among shoulder, head, and hand were not statistically significant.[436] Show, Don't Tell: Morphing Latent Reasoning into Image Generation
Harold Haodong Chen,Xinxiang Yin,Wen-Jie Shu,Hongfei Zhang,Zixin Zhang,Chenfei Liao,Litao Guo,Qifeng Chen,Ying-Cong Chen
Main category: cs.CV
TL;DR: 本文提出LatentMorph框架,通过在潜在空间中进行隐式推理,提升文本到图像生成的质量、效率与认知对齐性。
Details
Motivation: 现有文本到图像生成方法缺乏动态推理与自我修正能力,显式推理范式存在效率低、信息损失和认知不匹配等问题。 Method: 提出LatentMorph框架,包含四个轻量组件:condenser(压缩中间状态为视觉记忆)、translator(将潜在思维转为指导信号)、shaper(动态引导图像token预测)、RL-trained invoker(自适应决定推理时机),全程在连续潜在空间中完成推理。 Result: 在GenEval和T2I-CompBench上分别提升Janus-Pro模型16%和25%;在WISE和IPV-Txt等抽象推理任务上优于TwiG等显式方法15%和11%;推理时间减少44%,token消耗降低51%;推理调用与人类直觉的认知对齐率达71%。 Conclusion: LatentMorph通过隐式潜在空间推理,有效克服了显式推理的瓶颈,在性能、效率和认知合理性上实现统一提升。 Abstract: Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation--a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by $16\%$ on GenEval and $25\%$ on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by $15\%$ and $11\%$ on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by $44\%$ and token consumption by $51\%$; and (IV) exhibits $71\%$ cognitive alignment with human intuition on reasoning invocation.[437] LiFlow: Flow Matching for 3D LiDAR Scene Completion
Andrea Matteazzi,Dietmar Tutsch
Main category: cs.CV
TL;DR: 本文提出了一种基于流匹配(flow matching)的3D LiDAR场景补全新框架LiFlow,解决了扩散模型中训练与推理初始分布不一致的问题,并在多个指标上达到SOTA。
Details
Motivation: LiDAR点云在自动驾驶中常受遮挡和远距离稀疏性影响,导致感知受限;现有基于去噪扩散概率模型的方法存在训练与推理初始分布不匹配问题。 Method: 提出首个面向3D LiDAR场景补全的流匹配框架LiFlow,采用最近邻流匹配损失和Chamfer距离损失,兼顾点云局部结构与全局覆盖对齐。 Result: LiFlow在多个评估指标上实现SOTA性能。 Conclusion: 流匹配范式能更一致地建模点云生成过程,为LiDAR场景补全提供了比扩散模型更优的替代方案。 Abstract: In autonomous driving scenarios, the collected LiDAR point clouds can be challenged by occlusion and long-range sparsity, limiting the perception of autonomous driving systems. Scene completion methods can infer the missing parts of incomplete 3D LiDAR scenes. Recent methods adopt local point-level denoising diffusion probabilistic models, which require predicting Gaussian noise, leading to a mismatch between training and inference initial distributions. This paper introduces the first flow matching framework for 3D LiDAR scene completion, improving upon diffusion-based methods by ensuring consistent initial distributions between training and inference. The model employs a nearest neighbor flow matching loss and a Chamfer distance loss to enhance both local structure and global coverage in the alignment of point clouds. LiFlow achieves state-of-the-art performance across multiple metrics. Code: https://github.com/matteandre/LiFlow.[438] Enhancing Indoor Occupancy Prediction via Sparse Query-Based Multi-Level Consistent Knowledge Distillation
Xiang Li,Yupeng Zheng,Pengfei Li,Yilun Chen,Ya-Qin Zhang,Wenchao Ding
Main category: cs.CV
TL;DR: 本文提出DiScene,一种基于稀疏查询的多级知识蒸馏框架,用于高效且鲁棒的占用预测,在多个基准上达到SOTA性能。
Details
Motivation: 现有密集方法计算浪费严重,稀疏查询方法在复杂室内场景中鲁棒性不足,需兼顾效率与精度。 Method: 提出多级一致知识蒸馏策略(含编码器、查询、先验、锚点四层对齐)和教师引导初始化策略(优化参数预热)。 Result: 在Occ-Scannet上达23.2 FPS,超越OPUS 36.1%;集成深度信息后超越EmbodiedOcc 3.7%,推理快1.62×;在Occ3D-nuScenes及野外场景也表现优异。 Conclusion: DiScene通过多级蒸馏与初始化优化,显著提升稀疏占用预测的效率、鲁棒性与泛化能力,为机器人环境感知提供新范式。 Abstract: Occupancy prediction provides critical geometric and semantic understanding for robotics but faces efficiency-accuracy trade-offs. Current dense methods suffer computational waste on empty voxels, while sparse query-based approaches lack robustness in diverse and complex indoor scenes. In this paper, we propose DiScene, a novel sparse query-based framework that leverages multi-level distillation to achieve efficient and robust occupancy prediction. In particular, our method incorporates two key innovations: (1) a Multi-level Consistent Knowledge Distillation strategy, which transfers hierarchical representations from large teacher models to lightweight students through coordinated alignment across four levels, including encoder-level feature alignment, query-level feature matching, prior-level spatial guidance, and anchor-level high-confidence knowledge transfer and (2) a Teacher-Guided Initialization policy, employing optimized parameter warm-up to accelerate model convergence. Validated on the Occ-Scannet benchmark, DiScene achieves 23.2 FPS without depth priors while outperforming our baseline method, OPUS, by 36.1% and even better than the depth-enhanced version, OPUS†. With depth integration, DiScene† attains new SOTA performance, surpassing EmbodiedOcc by 3.7% with 1.62$\times$ faster inference speed. Furthermore, experiments on the Occ3D-nuScenes benchmark and in-the-wild scenarios demonstrate the versatility of our approach in various environments. Code and models can be accessed at https://github.com/getterupper/DiScene.[439] VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations
Fatemeh Zargarbashi,Dhruv Agrawal,Jakob Buhmann,Martin Guay,Stelian Coros,Robert W. Sumner
Main category: cs.CV
TL;DR: 本文提出了一种基于RVQ-VAE与对比学习、信息泄露损失相结合的新方法,实现人体运动数据中内容与风格的有效解耦,并通过量化码本交换技术实现无需微调的风格迁移。
Details
Motivation: 人体运动数据语义丰富、结构复杂,其内容与风格难以有效分离,限制了风格迁移等下游任务的效果。 Method: 采用残差矢量量化变分自编码器(RVQ-VAE)构建由粗到细的运动表征;引入对比学习和新型信息泄露损失,协同码本学习,将内容与风格分别组织于不同码本中;设计推理阶段的量化码本交换(Quantized Code Swapping)技术实现风格迁移。 Result: 在多个运动风格迁移任务(如风格迁移、风格去除、动作融合)上展现出强泛化能力,支持对未见风格零样本迁移且无需微调。 Conclusion: 该框架实现了内容与风格的高质量解耦,提升了运动建模的可解释性与可控性,为人体动作生成与编辑提供了通用、高效的新范式。 Abstract: Human motion data is inherently rich and complex, containing both semantic content and subtle stylistic features that are challenging to model. We propose a novel method for effective disentanglement of the style and content in human motion data to facilitate style transfer. Our approach is guided by the insight that content corresponds to coarse motion attributes while style captures the finer, expressive details. To model this hierarchy, we employ Residual Vector Quantized Variational Autoencoders (RVQ-VAEs) to learn a coarse-to-fine representation of motion. We further enhance the disentanglement by integrating contrastive learning and a novel information leakage loss with codebook learning to organize the content and the style across different codebooks. We harness this disentangled representation using our simple and effective inference-time technique Quantized Code Swapping, which enables motion style transfer without requiring any fine-tuning for unseen styles. Our framework demonstrates strong versatility across multiple inference applications, including style transfer, style removal, and motion blending.[440] LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
Zhenpeng Huang,Jiaqi Li,Zihan Jia,Xinhao Li,Desen Meng,Lingxue Song,Xi Chen,Liang Li,Limin Wang
Main category: cs.CV
TL;DR: LongVPO是一种无需长视频标注的两阶段直接偏好优化框架,使短上下文视觉语言模型能稳健理解超长视频;通过合成偏好三元组与递归多段推理任务,在仅16K合成样本下超越现有开源模型。
Details
Motivation: 解决短上下文视觉语言模型难以理解超长视频的问题,同时避免依赖昂贵且稀缺的长视频人工标注。 Method: Stage 1:基于短片段锚定问题、插入干扰项,并通过视觉相似性与问题特异性过滤生成偏好三元组,近似参考模型评分;Stage 2:采用递归字幕生成场景元数据,再用大语言模型构造多段推理类问题及非偏好响应,进行多段推理偏好对齐。 Result: 在多个长视频基准上超越当前最优开源模型,且在短视频基准(如MVBench)上保持强性能,仅需16K合成样本、无需人工标注。 Conclusion: LongVPO提供了一种可扩展、低成本、高性能的长视频理解新范式,兼顾长/短视频理解能力。 Abstract: We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model's scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.[441] Implicit neural representation of textures
Albert Kwok,Zheyuan Hu,Dounia Hammou
Main category: cs.CV
TL;DR: 本文探索了不同神经网络架构作为连续纹理隐式神经表示(INR)的设计,评估其在图像质量、内存占用和渲染推理时间上的表现,并分析三者间的权衡,同时拓展至实时渲染、mipmap拟合和INR空间生成等应用。
Details
Motivation: 传统纹理表示多为离散形式,而隐式神经表示(INR)在多个领域展现出高精度与高效性,本文旨在探索适用于纹理的连续UV坐标空间的新型INR设计。 Method: 设计多种神经网络架构作为连续域上的纹理INR,在UV坐标空间进行隐式建模;通过系统实验评估其图像质量、内存开销与渲染推理时间;分析三者之间的权衡关系;并拓展至mipmap拟合、INR空间生成等下游任务。 Result: 所提出的纹理INR在图像质量上表现良好,但存在内存占用与推理时间的折衷;在实时渲染、mipmap拟合和INR空间生成等任务中展现出可行性与潜力。 Conclusion: 连续纹理INR是一种有前景的替代方案,需在质量、效率与资源消耗间取得平衡,相关应用验证了其在图形学与生成任务中的扩展价值。 Abstract: Implicit neural representation (INR) has proven to be accurate and efficient in various domains. In this work, we explore how different neural networks can be designed as a new texture INR, which operates in a continuous manner rather than a discrete one over the input UV coordinate space. Through thorough experiments, we demonstrate that these INRs perform well in terms of image quality, with considerable memory usage and rendering inference time. We analyze the balance between these objectives. In addition, we investigate various related applications in real-time rendering and down-stream tasks, e.g. mipmap fitting and INR-space generation.[442] NAB: Neural Adaptive Binning for Sparse-View CT reconstruction
Wangduo Xie,Matthew B. Blaschko
Main category: cs.CV
TL;DR: 本文提出了一种名为神经自适应分箱(NAB)的新方法,将矩形结构先验融入稀疏视角CT重建中,通过可学习的双曲正切差分分箱机制实现端到端优化,显著提升重建精度。
Details
Motivation: 经典隐式神经网络无法利用工业对象常见的矩形结构先验;稀疏视角CT重建需兼顾质量与成本。 Method: 提出神经自适应分箱(NAB):将坐标空间映射为分箱向量空间,采用基于平移双曲正切函数差的可学习分箱机制(支持绕法向量旋转),再经神经网络预测CT衰减系数;所有分箱参数(位置、尺寸、陡度、旋转)均可通过投影数据梯度端到端优化。 Result: 在两个工业数据集上性能优于现有方法;扩展后在医学数据集上也保持鲁棒性。 Conclusion: NAB为将几何形状先验嵌入神经重建提供了新范式,兼具可解释性与灵活性。 Abstract: Computed Tomography (CT) plays a vital role in inspecting the internal structures of industrial objects. Furthermore, achieving high-quality CT reconstruction from sparse views is essential for reducing production costs. While classic implicit neural networks have shown promising results for sparse reconstruction, they are unable to leverage shape priors of objects. Motivated by the observation that numerous industrial objects exhibit rectangular structures, we propose a novel \textbf{N}eural \textbf{A}daptive \textbf{B}inning (\textbf{NAB}) method that effectively integrates rectangular priors into the reconstruction process. Specifically, our approach first maps coordinate space into a binned vector space. This mapping relies on an innovative binning mechanism based on differences between shifted hyperbolic tangent functions, with our extension enabling rotations around the input-plane normal vector. The resulting representations are then processed by a neural network to predict CT attenuation coefficients. This design enables end-to-end optimization of the encoding parameters -- including position, size, steepness, and rotation -- via gradient flow from the projection data, thus enhancing reconstruction accuracy. By adjusting the smoothness of the binning function, NAB can generalize to objects with more complex geometries. This research provides a new perspective on integrating shape priors into neural network-based reconstruction. Extensive experiments demonstrate that NAB achieves superior performance on two industrial datasets. It also maintains robust on medical datasets when the binning function is extended to more general expression. The code will be made available.[443] Uncertainty-Aware Image Classification In Biomedical Imaging Using Spectral-normalized Neural Gaussian Processes
Uma Meleti,Jeffrey J. Nirschl
Main category: cs.CV
TL;DR: 本文提出Spectral-normalized Neural Gaussian Process (SNGP),通过谱归一化和高斯过程层替换全连接层,提升数字病理学中单模型的不确定性估计与OOD检测能力,在保持分布内性能的同时显著增强模型可靠性。
Details
Motivation: 当前数字病理学深度学习模型在分布外(OOD)场景下常过度自信、校准差,限制了临床信任与实际应用;亟需具备内在不确定性感知能力的安全可靠模型。 Method: 采用谱归一化(spectral normalization)并用高斯过程(GP)层替代最终全连接层,构建轻量级不确定性感知模型SNGP,并在六种数据集、三类生物医学分类任务上与确定性模型及Monte Carlo Dropout对比评估。 Result: SNGP在分布内性能与基线相当,同时显著提升不确定性估计质量与OOD检测能力。 Conclusion: SNGP为数字病理学提供了实用、可部署的不确定性感知分类框架,有助于临床安全落地并增强病理科医生信任。 Abstract: Accurate histopathologic interpretation is key for clinical decision-making; however, current deep learning models for digital pathology are often overconfident and poorly calibrated in out-of-distribution (OOD) settings, which limit trust and clinical adoption. Safety-critical medical imaging workflows benefit from intrinsic uncertainty-aware properties that can accurately reject OOD input. We implement the Spectral-normalized Neural Gaussian Process (SNGP), a set of lightweight modifications that apply spectral normalization and replace the final dense layer with a Gaussian process layer to improve single-model uncertainty estimation and OOD detection. We evaluate SNGP vs. deterministic and MonteCarlo dropout on six datasets across three biomedical classification tasks: white blood cells, amyloid plaques, and colorectal histopathology. SNGP has comparable in-distribution performance while significantly improving uncertainty estimation and OOD detection. Thus, SNGP or related models offer a useful framework for uncertainty-aware classification in digital pathology, supporting safe deployment and building trust with pathologists.[444] Unified Personalized Reward Model for Vision Generation
Yibin Wang,Yuhang Zang,Feng Han,Jiazi Bu,Yujie Zhou,Cheng Jin,Jiaqi Wang
Main category: cs.CV
TL;DR: 本文提出UnifiedReward-Flex,一种统一的个性化视觉生成奖励模型,通过结合语义意图理解与视觉证据 grounding,并动态构建分层评估标准,克服现有奖励模型对内容敏感性不足和主观偏好建模能力弱的问题;采用两阶段训练(SFT蒸馏+DPO优化),并在GRPO框架中验证其在图像与视频合成中的优越性。
Details
Motivation: 现有多模态奖励模型采用单一范式或固定评估标准,难以捕捉内容特异性视觉线索及主观、上下文相关的用户偏好,导致系统性错配。 Method: 提出UnifiedReward-Flex模型:1)输入提示与生成图像,联合解析语义意图并基于视觉证据进行 grounding;2)动态构建包含预定义与自生成高层维度的细粒度分层评估标准;3)两阶段训练:先用闭源VLM蒸馏结构化推理轨迹进行监督微调(SFT),再用精选偏好对进行直接偏好优化(DPO)。 Result: 在GRPO框架下集成UnifiedReward-Flex后,在图像与视频合成任务中展现出优于现有方法的性能,验证了其更强的推理保真度与判别对齐能力。 Conclusion: UnifiedReward-Flex通过引入灵活、上下文自适应的推理机制,显著提升了视觉生成奖励建模的个性化与准确性,为构建更符合人类偏好的生成系统提供了新范式。 Abstract: Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.[445] Personalized Image Generation via Human-in-the-loop Bayesian Optimization
Rajalaxmi Rajagopalan,Debottam Dutta,Yu-Lin Wei,Romit Roy Choudhury
Main category: cs.CV
TL;DR: 本文提出MultiBO方法,利用多选偏好式贝叶斯优化,通过用户对生成图像的相对偏好反馈迭代优化扩散模型输出,显著缩小语言提示无法覆盖的语义鸿沟,实现更精准的个性化图像生成。
Details
Motivation: 语言提示在图像生成中存在表达极限,而人类仍能判断图像与心中目标图像的相对接近程度,该能力尚未被现有方法有效利用。 Method: 提出Multi-Choice Preferential Bayesian Optimization(MultiBO):基于初始提示生成图像x^{p*},每次迭代生成K张新图像,由用户选出更接近目标x^*的一张(偏好反馈),用该偏好信息更新代理模型并指导扩散模型生成下一轮K张图像,共B轮。 Result: 在30名用户定性评分和5种基线方法的定量对比中,MultiBO显著提升生成图像与用户心中目标图像的匹配度,验证了多选偏好反馈的有效性。 Conclusion: 即使生成模型完全未知目标图像x^*,仅依靠有限轮次的多选偏好反馈,也能大幅提升生成精度;人类的相对判断能力可作为语言之外的关键优化信号。 Abstract: Imagine Alice has a specific image $x^\ast$ in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multiple rounds of prompting and arrives at an image $x^{p*}$. Although $x^{p*}$ is reasonably close to $x^\ast$, Alice finds it difficult to close that gap using language prompts. This paper aims to narrow this gap by observing that even after language has reached its limits, humans can still tell when a new image $x^+$ is closer to $x^\ast$ than $x^{p*}$. Leveraging this observation, we develop MultiBO (Multi-Choice Preferential Bayesian Optimization) that carefully generates $K$ new images as a function of $x^{p*}$, gets preferential feedback from the user, uses the feedback to guide the diffusion model, and ultimately generates a new set of $K$ images. We show that within $B$ rounds of user feedback, it is possible to arrive much closer to $x^\ast$, even though the generative model has no information about $x^\ast$. Qualitative scores from $30$ users, combined with quantitative metrics compared across $5$ baselines, show promising results, suggesting that multi-choice feedback from humans can be effectively harnessed for personalized image generation.[446] Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory
Ruiqi Wu,Xuanhua He,Meng Cheng,Tianyu Yang,Yong Zhang,Zhuoliang Kang,Xunliang Cai,Xiaoming Wei,Chunle Guo,Chongyi Li,Ming-Ming Cheng
Main category: cs.CV
TL;DR: 本文提出了Infinite-World,一种能在复杂真实世界环境中维持1000+帧连贯视觉记忆的鲁棒交互式世界模型;通过无姿态记忆压缩、不确定性感知动作标注与重访密集微调策略,克服了真实视频中位姿噪声与视角重访稀疏带来的训练难题。
Details
Motivation: 现有世界模型在合成数据上表现良好,但在真实世界视频中受限于位姿估计噪声和视角重访稀疏,缺乏有效的训练范式。 Method: 提出三个核心方法:1)层级姿态无关记忆压缩器(HPMC),递归压缩历史隐状态并联合优化生成主干;2)不确定性感知动作标注模块,将连续运动离散为三态逻辑以抑制轨迹噪声;3)基于小规模重访密集视频集的微调策略。 Result: 在客观指标与用户研究中均显示Infinite-World在视觉质量、动作可控性与空间一致性方面优于现有方法。 Conclusion: Infinite-World通过去除几何先验依赖、提升动作鲁棒性与高效激活长程闭环能力,显著增强了世界模型在真实场景中的实用性与可扩展性。 Abstract: We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model's long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.[447] Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation
Xinshun Wang,Peiming Li,Ziyi Wang,Zhongbin Fang,Zhichao Deng,Songtao Wu,Jason Li,Mengyuan Liu
Main category: cs.CV
TL;DR: 本文提出Superman框架,统一视觉感知与基于骨架的时序运动生成,通过视觉引导的运动分词器和统一MLLM架构,解决当前人体运动分析中感知与生成割裂、时序建模不足及视觉-运动模态脱节等问题。
Details
Motivation: 现有方法存在三大问题:1)感知模型(从视频理解运动)与生成模型(无法直接处理视觉输入)割裂;2)生成式多模态大模型局限于单帧静态姿态,难以建模时序运动;3)运动词表仅基于骨架数据构建,脱离视觉域。 Method: 提出Superman框架,包含两部分:1)视觉引导的运动分词器(Vision-Guided Motion Tokenizer),利用3D骨架与视觉数据的几何对齐性,实现跨模态联合学习,构建统一运动词表;2)基于该运动语言的统一MLLM架构,支持视频到3D姿态估计(感知)、运动预测与in-betweening(生成)等多任务。 Result: 在Human3.6M等标准基准上,Superman在所有运动分析任务中达到SOTA或具有竞争力的性能,验证了其统一建模的有效性与可扩展性。 Conclusion: Superman为基于骨架的生成式运动分析提供了更高效、可扩展的统一范式,弥合了视觉感知与运动生成之间的鸿沟。 Abstract: Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between ``perception'' models that understand motion from video but only output text, and ``generation'' models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons.[448] ReasonEdit: Editing Vision-Language Models using Human Reasoning
Jiaxing Qiu,Kaihua Hou,Roxana Daneshjou,Ahmed Alaa,Thomas Hartvigsen
Main category: cs.CV
TL;DR: 本文提出了ReasonEdit,首个支持用户在编辑过程中解释推理过程的视觉-语言模型(VLM)编辑方法,通过拓扑平衡的多模态嵌入检索人类推理知识,在多个数据集上实现了SOTA编辑泛化性能。
Details
Motivation: 现有VLM编辑方法未覆盖需人类与模型协同推理的复杂任务,缺乏对人类推理过程的建模与利用。 Method: 提出ReasonEdit框架:持续将人类推理存入代码本,并设计受网络科学启发的拓扑平衡多模态嵌入方法,在推理时精准检索相关事实。 Result: 在四个VLM和多个基于理由的视觉问答数据集上达到最优编辑性能,验证了引入人类推理可显著提升编辑泛化能力。 Conclusion: 将人类推理显式融入VLM编辑流程是可行且有效的,ReasonEdit为推理密集型任务的模型编辑提供了新范式。 Abstract: Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about images.We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.[449] Catalyst: Out-of-Distribution Detection via Elastic Scaling
Abid Hassan,Tuan Ngo,Saad Shafiq,Nenad Medvidovic
Main category: cs.CV
TL;DR: 本文提出Catalyst框架,利用预池化特征图的通道统计信息(如均值、标准差等)生成输入相关的缩放因子γ,对现有OOD检测分数进行弹性缩放,显著提升多种基线方法的OOD检测性能。
Details
Motivation: 现有后处理OOD检测方法过度依赖logits或全局平均池化后的特征向量,忽略了预池化特征图中蕴含的丰富通道级统计信号。 Method: Catalyst从预池化特征图中实时计算通道统计量(如均值、标准差、最大激活值),生成输入相关的缩放因子γ,并以乘性方式融合到现有OOD得分中,实现弹性缩放。 Result: Catalyst在CIFAR-10(ResNet-18)、CIFAR-100(ResNet-18)和ImageNet(ResNet-50)上分别将平均误报率降低32.87%、27.94%和22.25%,且兼容多种基线方法(如Energy、ReAct、KNN等)。 Conclusion: 预池化特征图的通道统计信息是被低估但极具价值的OOD检测信号;Catalyst是一种通用、即插即用且与现有方法互补的增强框架。 Abstract: Out-of-distribution (OOD) detection is critical for the safe deployment of deep neural networks. State-of-the-art post-hoc methods typically derive OOD scores from the output logits or penultimate feature vector obtained via global average pooling (GAP). We contend that this exclusive reliance on the logit or feature vector discards a rich, complementary signal: the raw channel-wise statistics of the pre-pooling feature map lost in GAP. In this paper, we introduce Catalyst, a post-hoc framework that exploits these under-explored signals. Catalyst computes an input-dependent scaling factor ($γ$) on-the-fly from these raw statistics (e.g., mean, standard deviation, and maximum activation). This $γ$ is then fused with the existing baseline score, multiplicatively modulating it -- an ``elastic scaling'' -- to push the ID and OOD distributions further apart. We demonstrate Catalyst is a generalizable framework: it seamlessly integrates with logit-based methods (e.g., Energy, ReAct, SCALE) and also provides a significant boost to distance-based detectors like KNN. As a result, Catalyst achieves substantial and consistent performance gains, reducing the average False Positive Rate by 32.87 on CIFAR-10 (ResNet-18), 27.94% on CIFAR-100 (ResNet-18), and 22.25% on ImageNet (ResNet-50). Our results highlight the untapped potential of pre-pooling statistics and demonstrate that Catalyst is complementary to existing OOD detection approaches.[450] SelvaMask: Segmenting Trees in Tropical Forests and Beyond
Simon-Olivier Duguay,Hugo Baudchon,Etienne Laliberté,Helene Muller-Landau,Gonzalo Rivas-Torres,Arthur Ouaknine
Main category: cs.CV
TL;DR: 本文介绍了SelvaMask,一个包含8800多个热带森林树冠标注的新数据集,并提出了一种基于视觉基础模型的模块化检测-分割流程,在热带森林树冠分割任务中达到SOTA性能。
Details
Motivation: 现有基于Transformer的单木树冠分割模型在热带森林等复杂场景中性能仍较低,缺乏高质量、具代表性的热带森林标注数据集和适配方法。 Method: 构建了涵盖巴拿马、巴西、厄瓜多尔三地的热带森林树冠数据集SelvaMask,并设计了一种结合领域特异性检测提示器(detection-prompter)与视觉基础模型(VFMs)的模块化检测-分割流水线。 Result: 所提方法在SelvaMask上达到SOTA性能,显著优于零样本通用模型和全监督端到端方法;并在外部热带及温带数据集上验证了泛化能力。 Conclusion: SelvaMask不仅是一个具有挑战性的新基准,更推动了面向热带森林乃至广义森林监测的可推广AI方法发展。 Abstract: Tropical forests harbor most of the planet's tree biodiversity and are critical to global ecological balance. Canopy trees in particular play a disproportionate role in carbon storage and functioning of these ecosystems. Studying canopy trees at scale requires accurate delineation of individual tree crowns, typically performed using high-resolution aerial imagery. Despite advances in transformer-based models for individual tree crown segmentation, performance remains low in most forests, especially tropical ones. To this end, we introduce SelvaMask, a new tropical dataset containing over 8,800 manually delineated tree crowns across three Neotropical forest sites in Panama, Brazil, and Ecuador. SelvaMask features comprehensive annotations, including an inter-annotator agreement evaluation, capturing the dense structure of tropical forests and highlighting the difficulty of the task. Leveraging this benchmark, we propose a modular detection-segmentation pipeline that adapts vision foundation models (VFMs), using domain-specific detection-prompter. Our approach reaches state-of-the-art performance, outperforming both zero-shot generalist models and fully supervised end-to-end methods in dense tropical forests. We validate these gains on external tropical and temperate datasets, demonstrating that SelvaMask serves as both a challenging benchmark and a key enabler for generalized forest monitoring. Our code and dataset will be released publicly.[451] UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing
Dianyi Wang,Chaofan Ma,Feng Han,Size Wu,Wei Song,Yibin Wang,Zhixiong Zhang,Tianhang Wang,Siyuan Wang,Zhongyu Wei,Jiaqi Wang
Main category: cs.CV
TL;DR: 本文提出UniReason框架,通过双推理范式统一文本到图像生成与图像编辑任务,将生成视为增强世界知识的规划过程,编辑则用于基于自省的视觉精细化修正,并构建了大规模推理导向数据集,在多个推理密集型基准上取得先进性能。
Details
Motivation: 现有统一多模态模型在复杂合成任务中推理能力不足,且将文本到图像生成与图像编辑视为孤立能力,缺乏任务间的协同与认知一致性。 Method: 提出UniReason统一框架,采用双推理范式:1)将生成建模为世界知识增强的规划过程以注入隐式约束;2)利用编辑能力实现基于自省的细粒度视觉修正;二者共享表征;并构建约30万样本的推理中心数据集(覆盖5大知识领域)及代理生成的视觉自校正语料。 Result: 在WISE、KrisBench和UniREditBench等推理密集型基准上达到先进性能,同时保持优越的通用合成能力。 Conclusion: UniReason成功将生成与编辑统一于人类认知启发的‘规划—精修’范式,验证了联合建模对提升多模态推理与合成能力的有效性。 Abstract: Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.[452] Multi-head automated segmentation by incorporating detection head into the contextual layer neural network
Edwin Kys,Febian Febian
Main category: cs.CV
TL;DR: 本文提出了一种基于Swin U-Net的门控多头Transformer架构,融合片间上下文与并行检测头,通过检测结果门控分割预测以抑制解剖学上无效切片中的假阳性,显著提升放疗中自动分割的解剖合理性和鲁棒性。
Details
Motivation: 传统深度学习自动分割模型在缺乏目标结构的切片中易产生解剖学上不合理的假阳性(即‘幻觉’),影响放疗临床可靠性。 Method: 提出门控多头Transformer架构,基于Swin U-Net,集成片间上下文建模与并行检测头(MLP用于切片级结构检测,上下文增强流用于像素级分割);检测输出作为门控信号抑制无效切片的分割响应;采用切片级Tversky损失缓解类别不平衡。 Result: 在Prostate-Anatomical-Edge-Cases数据集上,门控模型平均Dice loss为0.013±0.036,显著优于非门控基线(0.732±0.314);检测概率与解剖存在强相关,有效消除伪分割;非门控模型则存在高变异性和持续假阳性。 Conclusion: 检测驱动的门控机制可显著提升自动分割的解剖合理性与鲁棒性,在不损害有效切片分割质量的前提下消除幻觉预测,有望提升临床放疗自动勾画流程的可靠性。 Abstract: Deep learning based auto segmentation is increasingly used in radiotherapy, but conventional models often produce anatomically implausible false positives, or hallucinations, in slices lacking target structures. We propose a gated multi-head Transformer architecture based on Swin U-Net, augmented with inter-slice context integration and a parallel detection head, which jointly performs slice-level structure detection via a multi-layer perceptron and pixel-level segmentation through a context-enhanced stream. Detection outputs gate the segmentation predictions to suppress false positives in anatomically invalid slices, and training uses slice-wise Tversky loss to address class imbalance. Experiments on the Prostate-Anatomical-Edge-Cases dataset from The Cancer Imaging Archive demonstrate that the gated model substantially outperforms a non-gated segmentation-only baseline, achieving a mean Dice loss of $0.013 \pm 0.036$ versus $0.732 \pm 0.314$, with detection probabilities strongly correlated with anatomical presence, effectively eliminating spurious segmentations. In contrast, the non-gated model exhibited higher variability and persistent false positives across all slices. These results indicate that detection-based gating enhances robustness and anatomical plausibility in automated segmentation applications, reducing hallucinated predictions without compromising segmentation quality in valid slices, and offers a promising approach for improving the reliability of clinical radiotherapy auto-contouring workflows.[453] PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss
Zehong Ma,Ruihan Xu,Shiliang Zhang
Main category: cs.CV
TL;DR: PixelGen提出了一种带感知监督的像素级扩散生成框架,通过LPIPS和DINO感知损失引导模型学习更有意义的感知流形,在ImageNet-256上FID达5.11,无需VAE或潜在表示,简化了生成范式。