Table of Contents
cs.CL [Back]
[1] Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents
Kaifeng Wu,Junyan Wu,Qiang Liu,Jiarui Zhang,Wen Xu
Main category: cs.CL
TL;DR: 本文提出了一种基于Qwen3-0.6B的判别式长文档主题分割模型,通过跨窗口上下文融合、边界分类头和重叠滑动窗口策略,支持单次输入长达13k token,并结合带标量校正的向量融合方法提升检索效率;在WIKI-727K数据集上F1优于基线模型且推理速度快两个数量级。
Details
Motivation: 现有方法在超长文本主题分割中存在明显不足:判别式模型受限于固定窗口、无法建模文档级语义;生成式大模型虽能输出段落边界但推理开销大、难以支持长输入。 Method: 基于Qwen3-0.6B构建判别式分割模型,引入跨窗口context融合层与边界分类头,采用重叠滑动窗口策略;并提出带标量校正的向量融合方法,将超长段落表征压缩为无损单向量。 Result: 在WIKI-727K数据集上,相比三个基于Qwen2-0.5B的生成式模型(Jina发布),本方法获得更高macro-F1,且推理速度提升两个数量级。 Conclusion: 所提方法兼顾高精度与高效率,显著提升了长文档主题分割的实用性与可扩展性,尤其适用于下游检索等实际应用场景。 Abstract: Long-document topic segmentation plays an important role in information retrieval and document understanding, yet existing methods still show clear shortcomings in ultra-long text settings. Traditional discriminative models are constrained by fixed windows and cannot model document-level semantics; generative large language models can output paragraph boundaries, but inference is expensive and long inputs are difficult to support. To address these issues, we propose a discriminative segmentation model based on Qwen3-0.6B. On top of the backbone network, we add a cross-window context fusion layer and a boundary classification head, and combine them with an overlapping sliding-window strategy. Our model supports single-pass inputs of up to 13k tokens and can be extended to ultra-long documents for paragraph boundary detection. To further enhance downstream retrieval efficiency, we derive a vector fusion method with scalar correction, which compresses the representation of ultra-long segments into a single vector without semantic loss. Experiments on the Wikipedia long-document topic segmentation dataset WIKI-727K show that, compared with three generative models based on Qwen2-0.5B released by Jina, our method achieves a better macro-averaged F1 and delivers two orders of magnitude faster inference, substantially improving the practicality and scalability of long-document processing.[2] Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages
Swati Sharma,Divya V. Sharma,Anubha Gupta
Main category: cs.CL
TL;DR: 本文提出Task-Lens,对印度50个语音数据集(涵盖26种语言)在9个下游语音任务中的适用性进行跨任务评估,发现大量数据集蕴含未被利用的元数据,可支持多任务;并指出当前资源严重缺失的任务与语言,为数据建设提供指导。
Details
Motivation: 低资源语言(尤其是印度等语言多样性高的国家)中,现有语音数据集缺乏跨任务的系统性评估,导致资源利用率低、数据稀缺问题加剧。 Method: 构建Task-Lens跨任务调查框架:1)分析50个印度语音数据集的元数据与属性对9类下游任务的适配性;2)提出面向任务的数据增强策略;3)识别任务与语言层面的资源缺口。 Result: 发现多数印度语音数据集含有可支撑多个下游任务的未被挖掘元数据;明确了若干严重缺乏资源的下游任务和印度语言。 Conclusion: Task-Lens为提升现有数据集复用率、指导针对性数据建设提供了系统性方法和实证依据,推动包容性多语言语音技术发展。 Abstract: The rising demand for inclusive speech technologies amplifies the need for multilingual datasets for Natural Language Processing (NLP) research. However, limited awareness of existing task-specific resources in low-resource languages hinders research. This challenge is especially acute in linguistically diverse countries, such as India. Cross-task profiling of existing Indian speech datasets can alleviate the data scarcity challenge. This involves investigating the utility of datasets across multiple downstream tasks rather than focusing on a single task. Prior surveys typically catalogue datasets for a single task, leaving comprehensive cross-task profiling as an open opportunity. Therefore, we propose Task-Lens, a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks. First, we analyze which datasets contain metadata and properties suitable for specific tasks. Next, we propose task-aligned enhancements to unlock datasets to their full downstream potential. Finally, we identify tasks and Indian languages that are critically underserved by current resources. Our findings reveal that many Indian speech datasets contain untapped metadata that can support multiple downstream tasks. By uncovering cross-task linkages and gaps, Task-Lens enables researchers to explore the broader applicability of existing datasets and to prioritize dataset creation for underserved tasks and languages.[3] Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning
Chris Samarinas,Haw-Shiuan Chang,Hamed Zamani
Main category: cs.CL
TL;DR: 本文提出SLATE框架,通过截断式步级采样和密集的LLM-as-judge奖励机制,解决大语言模型在搜索引擎推理中信用分配困难的问题,显著降低策略梯度方差,提升多跳问答性能。
Details
Motivation: 现有基于强化学习的搜索引擎推理方法面临信用分配难题:稀疏奖励无法定位关键决策,而过程奖励依赖启发式指标且梯度方差高。 Method: SLATE框架包含两个核心设计:(1) 截断式步级采样——生成共享前缀、仅下一步不同的k条轨迹;(2) 密集LLM-as-judge奖励——由强LLM对每步推理、查询与答案质量进行细粒度评估。 Result: 理论证明截断采样可将优势估计方差最多降低T倍(T为步数);实验在7个QA基准上验证SLATE持续优于稀疏奖励和过程奖励基线,尤其在多跳任务和小模型上增益最大。 Conclusion: SLATE通过更精准的信用分配和更低方差的梯度更新,有效提升了语言模型在搜索增强推理中的训练效率与泛化能力。 Abstract: Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.[4] CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era
Zhengqing Yuan,Kaiwen Shi,Zheyuan Zhang,Lichao Sun,Nitesh V. Chawla,Yanfang Ye
Main category: cs.CL
TL;DR: 本文提出首个针对科学写作中幻觉引用(hallucinated citations)的综合基准与检测框架,通过多智能体验证流程实现对引用真实性的自动化、可解释评估,并构建了大规模人工验证数据集和统一评估指标。
Details
Motivation: 大型语言模型(LLMs)易生成看似合理但实际不存在的虚假引用,已影响顶会论文质量;而人工核查日益不可行,现有自动工具在格式鲁棒性和评估标准化方面存在严重不足。 Method: 设计多智能体验证流水线,将引用核查分解为声明提取、证据检索、段落匹配、推理判断和校准决策五个阶段;构建跨学科、人工标注的大规模数据集;定义引用忠实度(citation faithfulness)与证据对齐度(evidence alignment)等统一评估指标。 Result: 在多个SOTA LLM上实验表明其存在大量引用错误;所提框架在准确率与可解释性上显著优于先前方法。 Conclusion: 本工作提供了首个可扩展的LLM时代引文审计基础设施与实用工具,有助于提升科学研究中引用的可信度与完整性。 Abstract: Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-scale human-validated dataset across domains and define unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that our framework significantly outperforms prior methods in both accuracy and interpretability. This work provides the first scalable infrastructure for auditing citations in the LLM era and practical tools to improve the trustworthiness of scientific references.[5] FHIRPath-QA: Executable Question Answering over FHIR Electronic Health Records
Michael Frew,Nishit Bheda,Bryan Tripp
Main category: cs.CL
TL;DR: 本文提出FHIRPath-QA数据集与文本到FHIRPath查询合成范式,以提升患者对电子健康记录(EHR)的精准、可信问答能力,减少大语言模型(LLM)幻觉与计算开销,并验证监督微调可显著提升LLM在FHIRPath生成任务上的性能。
Details
Motivation: 现有EHR界面难以支持患者对自身健康数据的精准、可信问答;基于检索的LLM临床问答方法计算效率低、易产生幻觉、难部署于真实EHR系统。 Method: 构建首个面向患者特定问题的开放标准FHIRPath问答数据集(FHIRPath-QA),涵盖14,000+自然语言问题及其对应经验证的FHIRPath查询与答案;提出文本到FHIRPath的问答范式,将自由文本生成转为结构化查询合成;基于MIMIC-IV on FHIR Demo构建,并评估主流LLM在该任务上的表现及微调效果。 Result: 当前SOTA LLM在处理患者语言歧义和FHIRPath查询合成任务上表现较差,但经监督微调后性能显著提升;文本到FHIRPath合成可成为安全、高效、互操作的消费者健康应用的基础。 Conclusion: FHIRPath-QA数据集与文本到FHIRPath范式为临床问答提供了更可靠、可解释、易部署的新路径,推动LLM在真实医疗场景中的安全落地。 Abstract: Though patients are increasingly granted digital access to their electronic health records (EHRs), existing interfaces may not support precise, trustworthy answers to patient-specific questions. Large language models (LLM) show promise in clinical question answering (QA), but retrieval-based approaches are computationally inefficient, prone to hallucination, and difficult to deploy over real-life EHRs. In this work, we introduce FHIRPath-QA, the first open dataset and benchmark for patient-specific QA that includes open-standard FHIRPath queries over real-world clinical data. We propose a text-to-FHIRPath QA paradigm that shifts reasoning from free-text generation to FHIRPath query synthesis, significantly reducing LLM usage. Built on MIMIC-IV on FHIR Demo, the dataset pairs over 14k natural language questions in patient and clinician phrasing with validated FHIRPath queries and answers. Further, we demonstrate that state-of-the-art LLMs struggle to deal with ambiguity in patient language and perform poorly in FHIRPath query synthesis. However, they benefit strongly from supervised fine-tuning. Our results highlight that text-to-FHIRPath synthesis has the potential to serve as a practical foundation for safe, efficient, and interoperable consumer health applications, and our dataset and benchmark serve as a starting point for future research on the topic. The full dataset and generation code is available at: https://github.com/mooshifrew/fhirpath-qa.[6] IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation
Md Mofijul Islam,Md Sirajus Salekin,Joe King,Priyashree Roy,Vamsi Thilak Gudi,Spencer Romo,Akhil Nooney,Boyi Xie,Bob Strahan,Diego A. Socolinsky
Main category: cs.CL
TL;DR: 本文提出IDP Accelerator框架,通过四个模块(DocSplit、可配置抽取模块、智能分析模块、规则验证模块)实现端到端智能文档处理,在医疗等行业落地验证了高准确率、低延迟与低成本优势。
Details
Motivation: 传统文档处理流程难以应对多文档包、复杂推理和严格合规要求,而大语言模型虽支持零样本抽取,仍需系统性框架支撑工业级文档智能。 Method: 构建IDP Accelerator框架,包含:(1) DocSplit——基于BIO标注的多模态文档包分割数据集与分类器;(2) 可配置抽取模块——利用多模态大模型进行结构化信息抽取;(3) Agentic Analytics模块——遵循MCP协议,通过安全沙箱执行代码提供数据访问;(4) 规则验证模块——用LLM驱动逻辑替代确定性引擎完成复杂合规检查,并提供交互式Web演示界面。 Result: 在领先医疗服务商生产部署中达到98%分类准确率、处理延迟降低80%、运营成本下降77%;框架已开源并提供在线演示。 Conclusion: IDP Accelerator为工业级智能文档处理提供了可扩展、合规、高效且用户友好的代理式AI解决方案,显著优于传统流水线方法。 Abstract: Understanding and extracting structured insights from unstructured documents remains a foundational challenge in industrial NLP. While Large Language Models (LLMs) enable zero-shot extraction, traditional pipelines often fail to handle multi-document packets, complex reasoning, and strict compliance requirements. We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging to segment complex document packets; (2) configurable Extraction Module leveraging multimodal LLMs to transform unstructured content into structured data; (3) Agentic Analytics Module, compliant with the Model Context Protocol (MCP) providing data access through secure, sandboxed code execution; and (4) Rule Validation Module replacing deterministic engines with LLM-driven logic for complex compliance checks. The interactive demonstration enables users to upload document packets, visualize classification results, and explore extracted data through an intuitive web interface. We demonstrate effectiveness across industries, highlighting a production deployment at a leading healthcare provider achieving 98% classification accuracy, 80% reduced processing latency, and 77% lower operational costs over legacy baselines. IDP Accelerator is open-sourced with a live demonstration available to the community.[7] Humans and LLMs Diverge on Probabilistic Inferences
Gaurav Kamath,Sreenath Madathil,Sebastian Schuster,Marie-Catherine de Marneffe,Siva Reddy
Main category: cs.CL
TL;DR: 本文提出了ProbCOPA数据集,用于评估大语言模型在非确定性、概率性推理任务上的表现,并发现当前主流推理型大语言模型无法复现人类的概率判断分布。
Details
Motivation: 现有推理型大语言模型在逻辑和数学等确定性任务上表现优异,但在人类常见的、基于有限信息进行概率性推断的开放性推理任务上的行为尚不明确,亟需专门评测工具和深入分析。 Method: 构建了包含210个手工设计英文概率推理样本的ProbCOPA数据集,每个样本由25–30名人类参与者标注推理可能性;对比8个前沿推理型大语言模型与人类响应分布;并分析模型的推理链以识别其常用推理模式。 Result: 人类判断呈现显著的梯度化与多样性;所有被测LLM均未能拟合人类的概率响应分布;模型推理链中存在一种共通的(但非人类式的)推理模式。 Conclusion: 人类与LLM在概率性推理上存在系统性差异,提示推理能力评测必须超越确定性范式,纳入概率性、开放性场景。 Abstract: Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.[8] France or Spain or Germany or France: A Neural Account of Non-Redundant Redundant Disjunctions
Sasha Boguraev,Qing Yao,Kyle Mahowald
Main category: cs.CL
TL;DR: 本文探讨了看似冗余的句子(如'她将去法国或西班牙,或或许去德国或法国')在特定语境下为何变得可接受,并通过人类行为实验和大语言模型分析,揭示了Transformer模型中冗余避免的神经机制:绑定上下文相关信息与重复词汇,并通过归纳头选择性关注这些上下文许可的表征。
Details
Motivation: 解释看似冗余但实际可接受的句子现象,补充现有符号化分析,提供基于人工神经机制的新视角。 Method: 结合人类行为实验与大语言模型分析,探究Transformer中冗余避免的神经机制,重点关注上下文信息绑定与归纳头的注意力选择。 Result: 发现语言模型通过绑定上下文相关信息到重复词汇,并由Transformer归纳头选择性关注这些表征,实现冗余避免;该机制在人类和大模型中均稳健存在。 Conclusion: 冗余避免源于两种交互的神经机制,为语境敏感的语义解释提供了新洞见,并可与符号化分析互补。 Abstract: Sentences like "She will go to France or Spain, or perhaps to Germany or France." appear formally redundant, yet become acceptable in contexts such as "Mary will go to a philosophy program in France or Spain, or a mathematics program in Germany or France." While this phenomenon has typically been analyzed using symbolic formal representations, we aim to provide a complementary account grounded in artificial neural mechanisms. We first present new behavioral evidence from humans and large language models demonstrating the robustness of this apparent non-redundancy across contexts. We then show that, in language models, redundancy avoidance arises from two interacting mechanisms: models learn to bind contextually relevant information to repeated lexical items, and Transformer induction heads selectively attend to these context-licensed representations. We argue that this neural explanation sheds light on the mechanisms underlying context-sensitive semantic interpretation, and that it complements existing symbolic analyses.[9] Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations
Jun Li,Xiangmeng Wang,Haoyang Li,Yifei Yan,Shijie Zhang,Hong Va Leong,Ling Feng,Nancy Xiaonan Yu,Qing Li
Main category: cs.CL
TL;DR: 本文提出了一种多智能体因果推理(MACR)框架,通过结合认知评估理论生成反事实用户反应,并利用前门调整策略缓解隐藏偏差,以提升社交媒体中自杀风险识别的准确性与鲁棒性。
Details
Motivation: 现有基于社交媒体的自杀风险检测方法依赖预定义规则、忽略用户从众和模仿等隐藏影响,导致覆盖范围窄、偏差大。 Method: 提出多智能体因果推理(MACR)框架,包含推理智能体(基于认知评估理论生成多维度反事实用户反应)和偏差感知决策智能体(采用前门调整策略校正隐藏偏差)。 Result: 在真实对话数据集上的实验表明,MACR在自杀风险识别任务中显著优于基线方法,具备良好有效性和鲁棒性。 Conclusion: MACR通过引入反事实推理与因果偏差校正,有效缓解了隐藏社会影响带来的偏差,增强了用户交互的上下文建模能力,为公共卫生领域的在线风险监测提供了新范式。 Abstract: Suicide remains a pressing global public health concern. While social media platforms offer opportunities for early risk detection through online conversation trees, existing approaches face two major limitations: (1) They rely on predefined rules (e.g., quotes or relies) to log conversations that capture only a narrow spectrum of user interactions, and (2) They overlook hidden influences such as user conformity and suicide copycat behavior, which can significantly affect suicidal expression and propagation in online communities. To address these limitations, we propose a Multi-Agent Causal Reasoning (MACR) framework that collaboratively employs a Reasoning Agent to scale user interactions and a Bias-aware Decision-Making Agent to mitigate harmful biases arising from hidden influences. The Reasoning Agent integrates cognitive appraisal theory to generate counterfactual user reactions to posts, thereby scaling user interactions. It analyses these reactions through structured dimensions, i.e., cognitive, emotional, and behavioral patterns, with a dedicated sub-agent responsible for each dimension. The Bias-aware Decision-Making Agent mitigates hidden biases through a front-door adjustment strategy, leveraging the counterfactual user reactions produced by the Reasoning Agent. Through the collaboration of reasoning and bias-aware decision making, the proposed MACR framework not only alleviates hidden biases, but also enriches contextual information of user interactions with counterfactual knowledge. Extensive experiments on real-world conversational datasets demonstrate the effectiveness and robustness of MACR in identifying suicide risk.[10] BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation
Yun Wang,Xuansheng Wu,Jingyuan Huang,Lei Liu,Xiaoming Zhai,Ninghao Liu
Main category: cs.CL
TL;DR: 本文提出BRIDGE框架,通过合成高分英语学习者(ELL)样本以缓解自动化评分系统中的表征偏差问题,从而减少对ELL学生的预测偏差,提升评估公平性。
Details
Motivation: 自动化评分系统在教育评估中广泛应用,但存在偏差放大风险,尤其对英语学习者(ELL)等少数群体,因训练数据中高分ELL样本稀缺,导致模型偏向主流语言模式,低估ELL学生真实能力。 Method: 提出BRIDGE框架:将高分非ELL样本中符合评分标准的知识内容‘粘贴’到真实ELL语言模式中,合成高质量高分ELL样本;引入判别器保障合成样本质量。 Result: 在加州科学测试(CAST)数据集上的实验表明,BRIDGE显著降低对高分ELL学生的预测偏差,同时保持整体评分性能,其公平性提升效果媲美增加真实人工标注数据。 Conclusion: BRIDGE为低资源教育评估场景提供了一种高效、低成本的偏差缓解方案,有助于实现更公平的大规模自动化评分。 Abstract: In the field of educational assessment, automated scoring systems increasingly rely on deep learning and large language models (LLMs). However, these systems face significant risks of bias amplification, where model prediction gaps between student groups become larger than those observed in training data. This issue is especially severe for underrepresented groups such as English Language Learners (ELLs), as models may inherit and further magnify existing disparities in the data. We identify that this issue is closely tied to representation bias: the scarcity of minority (high-scoring ELL) samples makes models trained with empirical risk minimization favor majority (non-ELL) linguistic patterns. Consequently, models tend to under-predict ELL students who even demonstrate comparable domain knowledge but use different linguistic patterns, thereby undermining the fairness of automated scoring outcomes. To mitigate this, we propose BRIDGE, a Bias-Reducing Inter-group Data GEneration framework designed for low-resource assessment settings. Instead of relying on the limited minority samples, BRIDGE synthesizes high-scoring ELL samples by "pasting" construct-relevant (i.e., rubric-aligned knowledge and evidence) content from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns. We further introduce a discriminator model to ensure the quality of synthetic samples. Experiments on California Science Test (CAST) datasets demonstrate that BRIDGE effectively reduces prediction bias for high-scoring ELL students while maintaining overall scoring performance. Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.[11] LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering
Rafid Ishrak Jahan,Fahmid Shahriar Iqbal,Sagnik Ray Choudhury
Main category: cs.CL
TL;DR: 本文提出了LFQA-HP-1M数据集(含130万条人类成对偏好标注)和九项评估标准,证明基于这些标准的简单线性模型可媲美当前大语言模型评估器,并揭示了LLM评估器在一致性、位置与冗余偏差及对抗扰动方面的脆弱性。
Details
Motivation: 现有LFQA评估指标难以准确反映人类判断,亟需更可靠、透明且大规模的人类标注数据与评估框架。 Method: 构建LFQA-HP-1M大规模人类偏好数据集;设计九项答案质量评估细粒度标准;训练并对比线性模型与LLM评估器性能;分析LLM评估器的transitivity一致性、位置偏差、冗余偏差及对抗鲁棒性。 Result: 基于九项标准的简单线性模型在LFQA评估任务上表现媲美SOTA LLM评估器;发现主流LLM评估器存在显著的位置偏差、verbosity偏差及对抗脆弱性。 Conclusion: 人类偏好标注与结构化评分标准可为LFQA提供更透明、稳健且可解释的评估范式,挑战了单纯依赖黑箱LLM评估器的现状。 Abstract: Long-form question answering (LFQA) demands nuanced evaluation of multi-sentence explanatory responses, yet existing metrics often fail to reflect human judgment. We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA. We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators. We further examine transitivity consistency, positional bias, and verbosity biases in LLM evaluators and demonstrate their vulnerability to adversarial perturbations. Overall, this work provides one of the largest public LFQA preference datasets and a rubric-driven framework for transparent and reliable evaluation.[12] LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning
Yu Zhu,Kai Yang
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的三层次优化框架,用于合成多轮、任务导向、贴近真实场景的对话数据,以构建更可靠、更具挑战性的逻辑推理评测基准,有效提升LLM在实际任务中的推理能力。
Details
Motivation: 现有推理评测基准过于简化、脱离真实任务流程与领域约束,且存在预训练数据污染和人工构建成本高、难扩展等问题,难以准确评估LLM在现实场景中的逻辑推理能力。 Method: 提出LLM驱动的三层次优化框架,自动生成基于真实任务场景、富含现实信息、上下文连贯的多轮任务型对话,并围绕其设计并迭代优化推理任务。 Result: 构建了一个高质量合成推理数据集,实验表明其推理任务具有非平凡挑战性,能有效支持LLM推理能力的提升。 Conclusion: 该框架克服了传统评测基准的局限性,为评估和增强LLM在真实任务环境下的逻辑推理能力提供了可靠、可扩展的新范式。 Abstract: The reasoning capability of large language models (LLMs), defined as their ability to analyze, infer, and make decisions based on input information, is essential for building intelligent task-oriented dialogue systems. However, existing benchmarks do not sufficiently reflect the complexity of real-world scenarios, which limits their effectiveness in evaluating and enhancing LLM reasoning in practical contexts. Many current reasoning datasets are overly simplistic and abstract, often disconnected from realistic task flows, domain constraints, and operational rules, making it difficult to effectively evaluate LLMs' logical reasoning ability. In addition, data contamination from pretraining corpora undermines the reliability of evaluation results, and traditional crowdsourcing methods for dataset construction are labor-intensive and difficult to scale. To address these challenges, we propose a LLM-driven framework for synthesizing multi-turn, task-oriented dialogues grounded in realistic reasoning scenarios, leveraging trilevel optimization to enhance dialogue quality. Our method generates dialogues grounded in authentic task scenarios, enriched with real-world information, and exhibiting strong contextual coherence. Corresponding reasoning tasks are carefully designed around these dialogues and iteratively refined to continuously improve the tasks' quality and challenge. The resulting dataset serves as a valuable benchmark for assessing and advancing the realistic logical reasoning capabilities of LLMs. Experimental results show that our synthetic data-based reasoning tasks introduce non-trivial reasoning challenges and provide meaningful support for improving the reasoning capabilities of LLMs.[13] TRIZ-RAGNER: A Retrieval-Augmented Large Language Model for TRIZ-Aware Named Entity Recognition in Patent-Based Contradiction Mining
Zitong Xu,Yuqing Wu,Yue Zhao
Main category: cs.CL
TL;DR: 本文提出TRIZ-RAGNER框架,结合检索增强与结构化大语言模型提示,提升专利中TRIZ技术矛盾参数的抽取精度与鲁棒性。
Details
Motivation: 现有TRIZ矛盾挖掘方法受限于语义歧义、领域依赖和泛化能力弱;而直接应用大语言模型易产生幻觉且缺乏TRIZ结构化知识支撑。 Method: 将矛盾挖掘建模为语义级命名实体识别任务,融合TRIZ知识库的稠密检索、上下文优化的交叉编码器重排序,以及结构化LLM提示机制。 Result: 在PaTRIZ数据集上达到85.6%精确率、82.9%召回率、84.2% F1值,较最强LLM基线F1提升7.3个百分点。 Conclusion: 检索增强的TRIZ知识注入显著提升LLM在专利矛盾挖掘中的准确性与一致性,验证了该范式在系统性创新支持中的有效性。 Abstract: TRIZ-based contradiction mining is a fundamental task in patent analysis and systematic innovation, as it enables the identification of improving and worsening technical parameters that drive inventive problem solving. However, existing approaches largely rely on rule-based systems or traditional machine learning models, which struggle with semantic ambiguity, domain dependency, and limited generalization when processing complex patent language. Recently, large language models (LLMs) have shown strong semantic understanding capabilities, yet their direct application to TRIZ parameter extraction remains challenging due to hallucination and insufficient grounding in structured TRIZ knowledge. To address these limitations, this paper proposes TRIZ-RAGNER, a retrieval-augmented large language model framework for TRIZ-aware named entity recognition in patent-based contradiction mining. TRIZ-RAGNER reformulates contradiction mining as a semantic-level NER task and integrates dense retrieval over a TRIZ knowledge base, cross-encoder reranking for context refinement, and structured LLM prompting to extract improving and worsening parameters from patent sentences. By injecting domain-specific TRIZ knowledge into the LLM reasoning process, the proposed framework effectively reduces semantic noise and improves extraction consistency. Experiments on the PaTRIZ dataset demonstrate that TRIZ-RAGNER consistently outperforms traditional sequence labeling models and LLM-based baselines. The proposed framework achieves a precision of 85.6%, a recall of 82.9%, and an F1-score of 84.2% in TRIZ contradiction pair identification. Compared with the strongest baseline using prompt-enhanced GPT, TRIZ-RAGNER yields an absolute F1-score improvement of 7.3 percentage points, confirming the effectiveness of retrieval-augmented TRIZ knowledge grounding for robust and accurate patent-based contradiction mining.[14] From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning
Seungdong Yoa,Sanghyu Yoon,Suhee Yoon,Dongmin Kim,Ye Seul Sim,Junhyun Lee,Woohyung Lim
Main category: cs.CL
TL;DR: 本文提出一种以智能体为中心的动态评估范式,通过教师、协调员和学生三类智能体协同生成、验证与求解问题,实现LLM能力的渐进式、自动化、可持续评估。
Details
Motivation: 传统静态数据集评估方法扩展性差,难以反映大语言模型不断演进的推理能力。 Method: 构建包含教师(生成问题)、协调员(验证问题并防御对抗攻击)、学生(求解问题)三类自主智能体的动态协议;问题经迭代修订与难度升级,支持跨句子逻辑推理的文本异常检测作为主要评估形式,并引入多维评估指标(如跨模型两两性能、初始与终版问题表现差异)。 Result: 该协议能系统性暴露传统基准无法发现的边缘案例推理错误,并支持随模型能力提升自动扩展难度。 Conclusion: 将评估重心从固定数据集转向动态协议,为持续演进的大语言模型提供了可持续的评估路径,并提出了以智能体协同演进为核心的新型基准研究方向。 Abstract: The evaluation of large language models (LLMs) has predominantly relied on static datasets, which offer limited scalability and fail to capture the evolving reasoning capabilities of recent models. To overcome these limitations, we propose an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in which autonomous agents iteratively generate, validate, and solve problems. Within this protocol, a teacher agent generates candidate problems, an orchestrator agent rigorously verifies their validity and guards against adversarial attacks, and a student agent attempts to solve the validated problems. An invalid problem is revised by the teacher agent until it passes validation. If the student correctly solves the problem, the orchestrator prompts the teacher to generate more challenging variants. Consequently, the benchmark scales in difficulty automatically as more capable agents are substituted into any role, enabling progressive evaluation of large language models without manually curated datasets. Adopting text anomaly detection as our primary evaluation format, which demands cross-sentence logical inference and resists pattern-matching shortcuts, we demonstrate that this protocol systematically exposes corner-case reasoning errors that conventional benchmarks fail to reveal. We further advocate evaluating systems along several complementary axes including cross-model pairwise performance and progress between the initial and orchestrator-finalized problems. By shifting the focus from fixed datasets to dynamic protocols, our approach offers a sustainable direction for evaluating ever-evolving language models and introduces a research agenda centered on the co-evolution of agent-centric benchmarks.[15] Structured Prompt Optimization for Few-Shot Text Classification via Semantic Alignment in Latent Space
Jiasen Zheng,Zijun Zhou,Huajun Zhang,Junjiang Lin,Jingyun Jia,Qi Wang
Main category: cs.CL
TL;DR: 本文提出了一种基于结构化提示的优化框架,用于解决少样本文本分类中的语义纠缠、标签结构不清和特征表示不足问题,通过多维语义因子提示、结构化标签嵌入与跨空间对齐等机制,提升语义理解与任务适应能力,并在多项指标上显著提升性能。
Details
Motivation: 解决少样本文本分类中语义纠缠、标签结构不清晰及特征表达不足的问题。 Method: 提出基于结构化提示的优化框架:利用预训练语言模型获取基础语义表示;引入含多维语义因子的结构化提示,通过可学习组合机制融合文本特征;构建结构化标签嵌入矩阵并采用跨空间对齐机制;施加提示正交性约束与联合优化目标以保障语义因子独立性。 Result: 显著缓解语义冲突与标签歧义,在准确率、精确率、召回率和AUC上明显提升,并展现出强跨任务适用性;通过学习率、提示长度与数据规模三类敏感性实验验证了框架的稳定性与鲁棒性。 Conclusion: 结构化提示优化框架能有效增强少样本条件下的语义理解与任务适配能力,提供透明可控的分类决策指导。 Abstract: This study addresses the issues of semantic entanglement, unclear label structure, and insufficient feature representation in few-shot text classification, and proposes an optimization framework based on structured prompts to enhance semantic understanding and task adaptation under low-resource conditions. The framework first uses a pretrained language model to encode the input text and obtain basic semantic representations. It then introduces structured prompts composed of multi-dimensional semantic factors and integrates them with text features through a learnable combination mechanism, which forms task-related representations with clear boundaries in the latent space. To further strengthen the consistency between text representations and label semantics, the method constructs a structured label embedding matrix and employs a cross-space alignment mechanism to ensure stable matching between textual features and label attributes. In addition, the model applies prompt orthogonality constraints and a joint optimization objective to maintain independence across different semantic factors in the prompts, allowing the structured prompts to provide transparent and controllable guidance for classification decisions. Three types of sensitivity experiments, including learning rate sensitivity, prompt length sensitivity, and data scale sensitivity, are designed to evaluate the stability and robustness of the framework under different conditions. Experimental results show that the proposed structured prompt optimization framework effectively alleviates semantic conflicts and label ambiguity in few-shot text classification. It significantly improves performance on accuracy, precision, recall, and AUC, and demonstrates strong cross-task applicability.[16] Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding
Xiangzhong Luo,Yilin An,Zhicheng Yu,Weichen Liu,Xu Yang
Main category: cs.CL
TL;DR: 本文提出DiCo方法,通过三阶段分治策略(划分、征服、定稿)释放扩散型大语言模型(dLLMs)的并行生成潜力,在保持生成质量的同时显著提升推理速度。
Details
Motivation: 当前扩散型大语言模型(dLLMs)虽理论上支持多token并行解码,但实践中仍多采用单token步进方式,因直接多token解码易导致质量下降和不稳定,存在理论并行性与实际性能间的显著差距。 Method: 提出自适应并行解码方法DiCo,包含三个阶段:(1)Divide阶段:识别掩码序列中的种子token,并构建局部簇;(2)Conquer阶段:在各局部簇上并行解码,交替执行Divide与Conquer直至收敛;(3)Finalize阶段:对剩余少量掩码token采用细粒度复合解码完成生成。 Result: 实验表明DiCo在维持具竞争力的生成质量前提下,实现了显著的推理加速。 Conclusion: DiCo有效弥合了dLLMs理论并行能力与实际性能之间的鸿沟,为高效、高质量的扩散式文本生成提供了新范式。 Abstract: Diffusion-based large language models (dLLMs) have shown promising performance across various reasoning tasks, establishing themselves as an alternative to autoregressive large language models (LLMs). Unlike autoregressive LLMs that generate one token per step based on all previous tokens, dLLMs theoretically enable parallel generation of multiple tokens at each decoding step. However, recent dLLMs still favor one-token-per-step generation in practice, as directly decoding multiple masked tokens often leads to degraded generation quality and stability. This reveals a substantial gap between the theoretical parallelism and practical performance of dLLMs. To bridge this gap, we introduce an adaptive parallel decoding approach, namely DiCo, which features a three-phase divide-and-conquer paradigm to unleash the inherent parallelism of dLLMs. During the Divide phase, DiCo first explores the input masked sequence and identifies masked tokens as seed tokens, which are then expanded to construct a set of local clusters. During the Conquer phase, DiCo performs parallel decoding across different local clusters constructed in the Divide phase. The divide-and-conquer process repeatedly alternates between the Divide and Conquer phases until convergence. During the Finalize phase, DiCo decodes the remaining few masked tokens using an effective fine-grained compound decoding scheme to finalize the generation. Extensive experiments demonstrate that DiCo can achieve significant inference speedups while maintaining competitive generation quality.[17] GLUScope: A Tool for Analyzing GLU Neurons in Transformer Language Models
Sebastian Gerstner,Hinrich Schütze
Main category: cs.CL
TL;DR: GLUScope是一个开源工具,用于分析基于Transformer的语言模型中的神经元,特别关注使用SwiGLU等门控激活函数的较新模型,通过展示四种符号组合的文本示例及出现频率来帮助理解神经元行为。
Details
Motivation: 现有工具未充分支持对采用SwiGLU等门控激活函数的较新Transformer模型的神经元进行细粒度解释;由于门控和输入激活均可正可负,需考虑四种符号组合,而不仅是正激活。 Method: 开发GLUScope工具,针对每个神经元可视化四种(gate, in)符号组合对应的文本示例及其统计频率,并结合案例说明其在发现新解释性规律中的应用。 Result: GLUScope成功支持对SwiGLU类模型中神经元的细粒度行为分析,提供了可交互的实证分析界面,并已公开演示。 Conclusion: GLUScope填补了当前可解释性工具在门控激活模型分析上的空白,为研究者提供了更精细、更具针对性的神经元分析能力。 Abstract: We present GLUScope, an open-source tool for analyzing neurons in Transformer-based language models, intended for interpretability researchers. We focus on more recent models than previous tools do; specifically we consider gated activation functions such as SwiGLU. This introduces a new challenge: understanding positive activations is not enough. Instead, both the gate and the in activation of a neuron can be positive or negative, leading to four different possible sign combinations that in some cases have quite different functionalities. Accordingly, for any neuron, our tool shows text examples for each of the four sign combinations, and indicates how often each combination occurs. We describe examples of how our tool can lead to novel insights. A demo is available at https: //sjgerstner.github.io/gluscope.[18] CLFEC: A New Task for Unified Linguistic and Factual Error Correction in paragraph-level Chinese Professional Writing
Jian Kai,Zidong Zhang,Jiwen Chen,Zhengxiang Wu,Songtao Sun,Fuyang Li,Yang Cao,Qiang Liu
Main category: cs.CL
TL;DR: 本文提出了CLFEC(中文语言与事实错误联合纠正)新任务,构建了涵盖时政、金融、法律和医学的多领域专业中文写作数据集,并系统评估了从提示工程到RAG及智能体工作流等LLM纠错范式,揭示了泛化性差、需证据支撑、混合错误难处理及过度纠正等挑战,验证了联合纠错优于解耦处理,且适配强基座模型的智能体流程更有效。
Details
Motivation: 中文专业写作中语言错误(拼写、语法、标点)与事实错误常共现并相互影响,传统分开处理方式难以满足实际需求,亟需统一建模的联合纠错任务。 Method: 提出CLFEC新任务;构建跨领域的混合中文专业写作数据集;系统对比分析提示工程、检索增强生成(RAG)和智能体工作流等LLM纠错范式。 Result: 发现专用纠错模型泛化能力有限、事实修复需证据支撑、混合错误段落纠错难度高、清洁文本易被过度纠正;联合上下文处理两类错误优于解耦流程;适配强基座模型的智能体工作流效果更优。 Conclusion: CLFEC任务及配套数据集为中文专业文本全自动校对系统提供了新基准和实践指导,推动工业级可靠校对系统的建设。 Abstract: Chinese text correction has traditionally focused on spelling and grammar, while factual error correction is usually treated separately. However, in paragraph-level Chinese professional writing, linguistic (word/grammar/punctuation) and factual errors frequently co-occur and interact, making unified correction both necessary and challenging. This paper introduces CLFEC (Chinese Linguistic & Factual Error Correction), a new task for joint linguistic and factual correction. We construct a mixed, multi-domain Chinese professional writing dataset spanning current affairs, finance, law, and medicine. We then conduct a systematic study of LLM-based correction paradigms, from prompting to retrieval-augmented generation (RAG) and agentic workflows. The analysis reveals practical challenges, including limited generalization of specialized correction models, the need for evidence grounding for factual repair, the difficulty of mixed-error paragraphs, and over-correction on clean inputs. Results further show that handling linguistic and factual Error within the same context outperform decoupled processes, and that agentic workflows can be effective with suitable backbone models. Overall, our dataset and empirical findings provide guidance for building reliable, fully automatic proofreading systems in industrial settings.[19] The Astonishing Ability of Large Language Models to Parse Jabberwockified Language
Gary Lupyan,Senyi Yang
Main category: cs.CL
TL;DR: 本文展示了大语言模型(LLMs)在严重退化英文文本中惊人地恢复语义的能力,揭示了句法结构等线索对词汇意义的强大约束力,并暗示语言处理需语法、词义与世界知识的高度整合。
Details
Motivation: 探究大语言模型在极端语义缺失条件下能否利用结构线索恢复文本原意,从而理解语言结构的本质及高效语言处理的机制。 Method: 通过将英文文本中的实词随机替换为无意义字符串(即'Jabberwockified'文本),测试LLMs将其翻译还原为通顺英文的能力,并分析还原结果与原文的语义接近程度。 Result: LLMs能在大量案例中准确还原严重退化文本的原始语义,表明形态句法和虚词等结构线索对词汇意义具有远超预期的强约束力。 Conclusion: 语言理解高度依赖语法结构、词汇语义与世界知识的紧密耦合;LLMs的超人表现不仅体现其强大能力,也为人类与人工语言处理机制提供了重要启示。 Abstract: We show that large language models (LLMs) have an astonishing ability to recover meaning from severely degraded English texts. Texts in which content words have been randomly substituted by nonsense strings, e.g., "At the ghybe of the swuint, we are haiveed to Wourge Phrear-gwurr, who sproles into an ghitch flount with his crurp", can be translated to conventional English that is, in many cases, close to the original text, e.g., "At the start of the story, we meet a man, Chow, who moves into an apartment building with his wife." These results show that structural cues (e.g., morphosyntax, closed-class words) constrain lexical meaning to a much larger degree than imagined. Although the abilities of LLMs to make sense of "Jabberwockified" English are clearly superhuman, they are highly relevant to understanding linguistic structure and suggest that efficient language processing either in biological or artificial systems likely benefits from very tight integration between syntax, lexical semantics, and general world knowledge.[20] Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language
Nischal Karki,Bipesh Subedi,Prakash Poudyal,Rupak Raj Ghimire,Bal Krishna Bal
Main category: cs.CL
TL;DR: 本研究评估了多种BERT变体在尼泊尔语主题分类任务上的性能,发现Indic系列模型(尤其是MuRIL-large)表现最佳,F1得分为90.60%,为尼泊尔语NLP提供了坚实基线。
Details
Motivation: 尼泊尔语是一种低资源语言,使用天城文书写,在NLP领域研究不足,亟需建立有效的基准模型。 Method: 对10种预训练BERT模型(包括mBERT、XLM-R、MuRIL、DevBERT、HindiBERT、IndicBERT和NepBERTa)在平衡的尼泊尔语数据集(25,006句,涵盖5个主题领域)上进行微调,并采用准确率、加权精确率、召回率、F1分数和AUROC等指标评估性能。 Result: MuRIL-large取得最高F1分数90.60%,NepBERTa达88.26%,均优于多语言和单语模型;Indic系列模型整体表现最优。 Conclusion: Indic系列BERT模型(特别是MuRIL-large)最适合尼泊尔语主题分类任务,该研究为尼泊尔语文档级分类及更广泛NLP应用建立了可靠基准。 Abstract: Transformer-based models such as BERT have significantly advanced Natural Language Processing (NLP) across many languages. However, Nepali, a low-resource language written in Devanagari script, remains relatively underexplored. This study benchmarks multilingual, Indic, Hindi, and Nepali BERT variants to evaluate their effectiveness in Nepali topic classification. Ten pre-trained models, including mBERT, XLM-R, MuRIL, DevBERT, HindiBERT, IndicBERT, and NepBERTa, were fine-tuned and tested on the balanced Nepali dataset containing 25,006 sentences across five conceptual domains and the performance was evaluated using accuracy, weighted precision, recall, F1-score, and AUROC metrics. The results reveal that Indic models, particularly MuRIL-large, achieved the highest F1-score of 90.60%, outperforming multilingual and monolingual models. NepBERTa also performed competitively with an F1-score of 88.26%. Overall, these findings establish a robust baseline for future document-level classification and broader Nepali NLP applications.[21] EDDA-Coordinata: An Annotated Dataset of Historical Geographic Coordinates
Ludovic Moncla,Pierre Nugues,Thierry Joliveau,Katherine McDonough
Main category: cs.CL
TL;DR: 本文构建了一个从十八世纪《百科全书》中提取并标准化地理坐标的金标准数据集,提出了一种基于Transformer的两步式坐标识别与归一化方法,在多个跨语言、跨领域的历史文本上验证了其有效性。
Details
Motivation: 自动从历史文本中恢复地理坐标具有挑战性,因其表达方式多样且精度不一;为提升类似早期现代数字化文本中的坐标检索效果,需构建高质量标注数据集并开发鲁棒模型。 Method: 构建地理坐标金标准数据集,训练基于Transformer的两步模型:第一步分类器识别含坐标条目,第二步模型提取并归一化坐标;采用编码器-解码器与纯解码器架构进行对比实验。 Result: 在《百科全书》内部交叉验证达86% EM分数;在法语《特雷武词典》(域外)上达61%,在英文《不列颠百科全书》第七版上达77%;验证了方法的跨语言、跨领域泛化能力。 Conclusion: 所构建的金标准数据集具有重要训练价值,提出的两步式方法在历史文本地理信息抽取任务中具备良好的鲁棒性与可迁移性。 Abstract: This paper introduces a dataset of enriched geographic coordinates retrieved from Diderot and d'Alembert's eighteenth-century Encyclopedie. Automatically recovering geographic coordinates from historical texts is a complex task, as they are expressed in a variety of ways and with varying levels of precision. To improve retrieval of coordinates from similar digitized early modern texts, we have created a gold standard dataset, trained models, published the resulting inferred and normalized coordinate data, and experimented applying these models to new texts. From 74,000 total articles in each of the digitized versions of the Encyclopedie from ARTFL and ENCCRE, we examined 15,278 geographical entries, manually identifying 4,798 containing coordinates, and 10,480 with descriptive but non-numerical references. Leveraging our gold standard annotations, we trained transformer-based models to retrieve and normalize coordinates. The pipeline presented here combines a classifier to identify coordinate-bearing entries and a second model for retrieval, tested across encoder-decoder and decoder architectures. Cross-validation yielded an 86% EM score. On an out-of-domain eighteenth-century Trevoux dictionary (also in French), our fine-tuned model had a 61% EM score, while for the nineteenth-century, 7th edition of the Encyclopaedia Britannica in English, the EM was 77%. These findings highlight the gold standard dataset's usefulness as training data, and our two-step method's cross-lingual, cross-domain generalizability.[22] MemEmo: Evaluating Emotion in Memory Systems of Agents
Peng Liu,Zhen Tao,Jihao Zhao,Ding Chen,Yansong Zhang,Cuiping Li,Zhiyu Li,Hong Chen
Main category: cs.CL
TL;DR: 本文提出了一种情感增强的记忆评估基准HLME,用于评估主流记忆系统在处理情绪信息方面的能力,发现现有系统在情绪信息提取、更新和问答三方面均表现不佳。
Details
Motivation: 现有记忆系统在处理情绪相关的信息方面效果尚不明确,与人类认知相比存在差距,因此需要一个专门评估情感记忆能力的基准。 Method: 构建了Human-Like Memory Emotion(HLME)数据集,从情绪信息提取、情绪记忆更新、情绪记忆问答三个维度评估记忆系统性能。 Result: 实验表明,所有被评估的记忆系统均未能在三个任务上实现稳健表现。 Conclusion: 当前记忆系统在处理情绪记忆方面存在明显缺陷,研究结果为未来改进提供了新方向。 Abstract: Memory systems address the challenge of context loss in Large Language Model during prolonged interactions. However, compared to human cognition, the efficacy of these systems in processing emotion-related information remains inconclusive. To address this gap, we propose an emotion-enhanced memory evaluation benchmark to assess the performance of mainstream and state-of-the-art memory systems in handling affective information. We developed the \textbf{H}uman-\textbf{L}ike \textbf{M}emory \textbf{E}motion (\textbf{HLME}) dataset, which evaluates memory systems across three dimensions: emotional information extraction, emotional memory updating, and emotional memory question answering. Experimental results indicate that none of the evaluated systems achieve robust performance across all three tasks. Our findings provide an objective perspective on the current deficiencies of memory systems in processing emotional memories and suggest a new trajectory for future research and system optimization.[23] The GRADIEND Python Package: An End-to-End System for Gradient-Based Feature Learning
Jonathan Drechsel,Steffen Herbold
Main category: cs.CL
TL;DR: 本文介绍了gradiend,一个开源Python包,用于实现GRADIEND方法,通过事实-反事实MLM和CLM梯度学习语言模型中的特征方向。
Details
Motivation: 为了提供一种统一的工作流程来操作化GRADIEND方法,以从语言模型中学习特征方向,并支持特征相关数据创建、训练、评估、可视化、模型重写及多特征比较。 Method: 提出并实现了gradiend开源Python包,该包支持GRADIEND方法,利用事实-反事实MLM和CLM梯度学习特征方向,并提供统一工作流支持各项功能。 Result: 成功演示了GRADIEND在英语代词范式上的应用,并在大规模特征比较中复现了先前的用例。 Conclusion: gradiend为研究语言模型中特征方向提供了高效、可扩展且易于使用的工具,有助于推动可解释性与可控性研究。 Abstract: We present gradiend, an open-source Python package that operationalizes the GRADIEND method for learning feature directions from factual-counterfactual MLM and CLM gradients in language models. The package provides a unified workflow for feature-related data creation, training, evaluation, visualization, persistent model rewriting via controlled weight updates, and multi-feature comparison. We demonstrate GRADIEND on an English pronoun paradigm and on a large-scale feature comparison that reproduces prior use cases.[24] Dialect and Gender Bias in YouTube's Spanish Captioning System
Iris Dania Jimenez,Christoph Kern
Main category: cs.CL
TL;DR: 本研究探讨了YouTube自动字幕系统在不同西班牙语方言中的表现,发现该系统存在针对某些方言的偏差,尤其在性别和地区维度上表现出系统性差异。
Details
Motivation: YouTube仅提供一种西班牙语自动字幕选项,而西班牙语在全球存在大量地域变体,这可能导致对某些方言的识别偏差,影响内容可及性。 Method: 通过对比不同地区(含男女说话人)的西班牙语方言样本在YouTube自动字幕系统中的识别质量,分析其错误率与偏差模式。 Result: 发现了与特定方言显著相关的系统性字幕质量差异,表明当前系统未充分适配西班牙语的地域多样性。 Conclusion: 数字平台部署的算法技术需针对用户群体的语言多样性进行校准,以保障公平性与可及性。 Abstract: Spanish is the official language of twenty-one countries and is spoken by over 441 million people. Naturally, there are many variations in how Spanish is spoken across these countries. Media platforms such as YouTube rely on automatic speech recognition systems to make their content accessible to different groups of users. However, YouTube offers only one option for automatically generating captions in Spanish. This raises the question: could this captioning system be biased against certain Spanish dialects? This study examines the potential biases in YouTube's automatic captioning system by analyzing its performance across various Spanish dialects. By comparing the quality of captions for female and male speakers from different regions, we identify systematic disparities which can be attributed to specific dialects. Our study provides further evidence that algorithmic technologies deployed on digital platforms need to be calibrated to the diverse needs and experiences of their user populations.[25] Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis
Donghao Huang,Zhaoxia Wang
Main category: cs.CL
TL;DR: 本文通过在不同粒度的情感分析数据集上评估504种大语言模型推理配置,发现推理效果高度依赖任务复杂度:简单任务(如二分类)中推理反而导致性能下降,而复杂任务(如27类情绪识别)则显著提升;此外,推理带来显著计算开销,仅在复杂任务中具备性价比;定性分析表明,推理在简单任务中引发系统性‘过度思考’错误。
Details
Motivation: 检验‘推理能力普遍提升语言任务性能’这一流行假设是否成立,探究推理效果是否具有任务依赖性。 Method: 在7个模型家族(含自适应、条件式及强化学习推理架构)的504种配置上,于二分类、五分类和27类情绪识别等不同粒度情感分析数据集上进行系统评估,并结合Pareto前沿分析、少样本提示实验与定性错误分析。 Result: (1)推理效果呈任务复杂度依赖:二分类F1最多下降19.9pp,27类情绪识别最多提升16.0pp;(2)蒸馏推理变体在简单任务中比基线低3–18pp,少样本提示可部分恢复;(3)少样本学习普遍优于零样本;(4)基线模型在效率-性能权衡中占优,推理仅在复杂情绪识别中合理,但需承担2.1x–54x计算开销;(5)定性分析揭示推理在简单任务中因系统性‘过度推演’而致错。 Conclusion: 推理并非普适增益机制,其价值严格受限于任务复杂度;盲目引入推理可能损害简单任务性能并浪费算力;未来应倡导任务驱动的推理设计与评估范式。 Abstract: Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across language tasks. We test this claim through a comprehensive evaluation of 504 configurations across seven model families--including adaptive, conditional, and reinforcement learning-based reasoning architectures--on sentiment analysis datasets of varying granularity (binary, five-class, and 27-class emotion). Our findings reveal that reasoning effectiveness is strongly task-dependent, challenging prevailing assumptions: (1) Reasoning shows task-complexity dependence--binary classification degrades up to -19.9 F1 percentage points (pp), while 27-class emotion recognition gains up to +16.0pp; (2) Distilled reasoning variants underperform base models by 3-18 pp on simpler tasks, though few-shot prompting enables partial recovery; (3) Few-shot learning improves over zero-shot in most cases regardless of model type, with gains varying by architecture and task complexity; (4) Pareto frontier analysis shows base models dominate efficiency-performance trade-offs, with reasoning justified only for complex emotion recognition despite 2.1x-54x computational overhead. We complement these quantitative findings with qualitative error analysis revealing that reasoning degrades simpler tasks through systematic over-deliberation, offering mechanistic insight beyond the high-level overthinking hypothesis.[26] Preference Packing: Efficient Preference Optimization for Large Language Models
Jaekyung Cho
Main category: cs.CL
TL;DR: 本文提出了一种名为“偏好打包(preference packing)”的新方法,用于提升奖励模型和DPO等偏好学习训练过程的资源效率,通过减少重复输入提示的注意力计算和KV缓存内存占用,在文本与图文数据集上实现至少37%训练时间下降,并可与其他优化技术(如batch sorting)协同达到3.22倍加速。
Details
Motivation: 随着大语言模型规模增大,资源高效训练愈发关键;现有batch packing主要用于预训练和SFT,但尚未有效适配偏好学习类任务(如DPO、reward modeling),其输入相同prompt、多个响应的数据结构未被充分优化。 Method: 偏好打包(preference packing):将同一输入prompt对应多个响应的数据样本打包进单个序列,共享prompt部分的token表示,仅对不同响应部分进行独立注意力计算,并复用prompt段的KV缓存,从而减少冗余计算与内存开销。 Result: 在文本及图文数据集上训练时间至少降低37%;与batch sorting结合后实现3.22倍整体加速;显著降低KV缓存内存占用和注意力操作量。 Conclusion: 偏好打包是一种轻量、通用且正交的优化技术,可无缝集成到现有偏好学习训练流程中,显著提升硬件资源利用率,为大规模偏好建模提供实用支撑。 Abstract: Resource-efficient training optimization techniques are becoming increasingly important as the size of large language models (LLMs) continues to grow. In particular, batch packing is commonly used in pre-training and supervised fine-tuning to achieve resource-efficient training. We propose preference packing, a method to enhance resource efficiency in training techniques that use data with different responses for the same input prompt, such as reward models or Direct Preference Optimization (DPO). Preference packing improves resource efficiency by reducing the attention operations for duplicate input prompts and decreasing KV cache memory usage. We conducted experiments on text-only datasets and image-included datasets and achieved at least 37% reduction in training time. Notably, this method can be applied alongside existing optimization techniques such as batch sorting, resulting in a 3.22x speedup.[27] ARGUS: Seeing the Influence of Narrative Features on Persuasion in Argumentative Texts
Sara Nabhani,Federico Pianzola,Khalid Al-Khatib,Malvina Nissim
Main category: cs.CL
TL;DR: 本文提出ARGUS框架,研究叙事特征如何影响在线辩论中的说服力,并构建了标注故事存在及六种关键叙事特征的新语料库。
Details
Motivation: 尽管故事常被视为有力的说服工具,但其在在线非结构化论证中的具体作用尚未得到充分探索。 Method: 提出ARGUS框架,构建新的ChangeMyView语料库并标注故事存在与六种关键叙事特征,结合编码器分类器和零样本大语言模型识别故事及叙事特征,并大规模分析其对说服效果的影响。 Result: 成功开发出ARGUS框架及配套标注语料库,验证了多种叙事维度对在线论证说服力具有可测量影响。 Conclusion: 叙事确实能提升论证说服力,其中特定叙事特征(如情节连贯性、人物塑造等)起关键作用,ARGUS为后续研究提供了可扩展的方法论基础。 Abstract: Can narratives make arguments more persuasive? And to this end, which narrative features matter most? Although stories are often seen as powerful tools for persuasion, their specific role in online, unstructured argumentation remains underexplored. To address this gap, we present ARGUS, a framework for studying the impact of narration on persuasion in argumentative discourse. ARGUS introduces a new ChangeMyView corpus annotated for story presence and six key narrative features, integrating insights from two established theoretical frameworks that capture both textual narrative features and their effects on recipients. Leveraging both encoder-based classifiers and zero-shot large language models (LLMs), ARGUS identifies stories and narrative features and applies them at scale to examine how different narrative dimensions influence persuasion success in online argumentation.[28] Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek
James L. Zainaldin,Cameron Pattison,Manuela Marai,Jacob Wu,Mark J. Schiefsky
Main category: cs.CL
TL;DR: 本研究首次系统地对大型语言模型(LLM)在古希腊语技术性散文机器翻译中的表现进行了无参考人工评估,涵盖Claude、Gemini和ChatGPT三款模型,评估文本来自盖伦的两部医学著作;结果显示LLM在已有英译本的文本上接近专家水平,但在未译文本上质量下降且受术语稀有度显著影响;自动化指标仅在质量跨度大时与人工判断中度相关,无法区分高质量译文。
Details
Motivation: 缺乏针对低资源古代语言(如古希腊语)的LLM机器翻译系统性、无参考人工评估,尤其在专业领域(如古典医学文献)中翻译质量与术语挑战尚不明确。 Method: 采用改进的多维质量指标(MQM)框架,由领域专家对60个LLM生成的古希腊语译文进行人工评分;同时使用BLEU、chrF++、METEOR、ROUGE-L、BERTScore、COMET和BLEURT等自动化指标对比分析;术语稀有度通过Diorisis古希腊语语料库频率量化。 Result: 在已有英译本的文本上,LLM平均MQM得分为95.2/100(近专家水平);在未译文本上为79.9/100,剔除两个术语极度密集段落后升至约91–95/100;术语稀有度与翻译质量呈极强负相关(r = −0.97);自动化指标仅在质量差异大的文本上中度相关,无法区分高分译文。 Conclusion: LLM在古希腊语技术翻译中展现出潜力,但其性能高度依赖术语常见度;当前自动化评估指标不足以替代人工评估,尤其在高质量、低资源语言场景下;需为古典学研究设计更适配的评估范式与术语增强策略。 Abstract: This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose. We evaluate translations by three commercial LLMs (Claude, Gemini, ChatGPT) of twenty paragraph-length passages from two works by the Greek physician Galen of Pergamum (ca. 129-216 CE): On Mixtures, which has two published English translations, and On the Composition of Drugs according to Kinds, which has never been fully translated into English. We assess translation quality using both standard automated evaluation metrics (BLEU, chrF++, METEOR, ROUGE-L, BERTScore, COMET, BLEURT) and expert human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied to all 60 translations by a team of domain specialists. On the previously translated expository text, LLMs achieved high translation quality (mean MQM score 95.2/100), with performance approaching expert level. On the untranslated pharmacological text, aggregate quality was lower (79.9/100) but with high variance driven by two passages presenting extreme terminological density; excluding these, scores converged to within 4 points of the translated text. Terminology rarity, operationalized via corpus frequency in the literary Diorisis Ancient Greek Corpus, emerged as a strong predictor of translation failure (r = -.97 for passage-level quality on the untranslated text). Automated metrics showed moderate correlation with human judgment overall on the text with a wide quality spread (Composition), but no metric discriminated among high-quality translations. We discuss implications for the use of LLMs in Classical scholarship and for the design of automated evaluation pipelines for low-resource ancient languages.[29] CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning
Yuxuan Liu,Weikai Xu,Kun Huang,Changyu Chen,Jiankun Zhao,Pengzhi Gao,Wei Liu,Jian Luan,Shuo Shang,Bo Du,Ji-Rong Wen,Rui Yan
Main category: cs.CL
TL;DR: 本文提出Channel-of-Mobile-Experts (CoME),一种面向移动智能体的新型专家通道架构,通过四个阶段专用专家、渐进式训练策略与信息增益驱动的DPO方法,提升混合能力推理性能与鲁棒性。
Details
Motivation: 现有移动智能体难以同时实现各推理能力(如屏幕理解、子任务规划、动作决策等)的解耦增强与均衡协同,且易受错误传播影响。 Method: 提出CoME架构:四个阶段对齐的专用专家,采用输出导向的专家激活机制;设计三阶段渐进式训练(Expert-FT、Router-FT、CoT-FT);引入InfoGain-Driven DPO(Info-DPO)缓解错误传播。 Result: 在AITZ和AMEX两个基准上,CoME显著优于密集型移动智能体及传统MoE方法。 Conclusion: CoME通过结构化专家分工、阶段适配训练与信息感知优化,有效提升了移动智能体的混合能力推理能力与泛化性。 Abstract: Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, action decision and action function. However, existing agents struggle to achieve both decoupled enhancement and balanced integration of these capabilities. To address these challenges, we propose Channel-of-Mobile-Experts (CoME), a novel agent architecture consisting of four distinct experts, each aligned with a specific reasoning stage, CoME activates the corresponding expert to generate output tokens in each reasoning stage via output-oriented activation. To empower CoME with hybrid-capabilities reasoning, we introduce a progressive training strategy: Expert-FT enables decoupling and enhancement of different experts' capability; Router-FT aligns expert activation with the different reasoning stage; CoT-FT facilitates seamless collaboration and balanced optimization across multiple capabilities. To mitigate error propagation in hybrid-capabilities reasoning, we propose InfoGain-Driven DPO (Info-DPO), which uses information gain to evaluate the contribution of each intermediate step, thereby guiding CoME toward more informative reasoning. Comprehensive experiments show that CoME outperforms dense mobile agents and MoE methods on both AITZ and AMEX datasets.[30] ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Adam Dejl,Deniz Gorur,Francesca Toni
Main category: cs.CL
TL;DR: 本文提出了一种基于论证式大语言模型(ArgLLM)的Web系统ArgLLM-App,用于支持可解释、可质疑的二元决策任务,并提供可视化解释与用户交互功能。
Details
Motivation: 提升LLM决策的可解释性与可质疑性,使人类用户能理解、验证并纠正系统推理过程。 Method: 构建模块化Web应用ArgLLM-App,集成ArgLLM代理,支持可视化论证过程、人机交互及接入可信外部信息源。 Result: 实现了公开可用的ArgLLM-App系统(https://argllm.app),含视频演示(https://youtu.be/vzwlGOr0sPM),支持用户识别并质疑系统推理错误。 Conclusion: ArgLLM-App为增强LLM决策透明度与人类可控性提供了实用、开放的技术框架。 Abstract: Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by humans. Here we propose a web-based system implementing ArgLLM-empowered agents for binary tasks. ArgLLM-App supports visualisation of the produced explanations and interaction with human users, allowing them to identify and contest any mistakes in the system's reasoning. It is highly modular and enables drawing information from trusted external sources. ArgLLM-App is publicly available at https://argllm.app, with a video demonstration at https://youtu.be/vzwlGOr0sPM.[31] Task-Centric Acceleration of Small-Language Models
Dor Tsur,Sharon Adar,Ran Levy
Main category: cs.CL
TL;DR: 本文提出了TASC框架,包括TASC-ft(用于SLM微调的词汇表扩展方法)和TASC-spec(无需训练的轻量级推测解码方法),显著提升了小语言模型在低变异性生成任务中的推理效率,同时保持任务性能。
Details
Motivation: 小语言模型(SLMs)常用于高吞吐、低延迟场景,但其推理效率仍有提升空间;现有方法存在训练开销大或词汇对齐约束等问题。 Method: 提出TASC框架:1)TASC-ft——在微调阶段迭代扩充分词器词汇表并微调模型;2)TASC-spec——在推理时基于任务输出语料构建n-gram草稿模型,融合任务与上下文n-gram信息,无需训练且无需词汇对齐。 Result: 在多个低输出变异性生成任务上验证了两种方法的有效性,均实现了推理效率的持续提升,同时保持原有任务性能。 Conclusion: TASC框架为小语言模型提供了高效、轻量、无需额外训练的加速方案,兼顾实用性与通用性。 Abstract: Small language models (SLMs) have emerged as efficient alternatives to large language models for task-specific applications. However, they are often employed in high-volume, low-latency settings, where efficiency is crucial. We propose TASC, Task-Adaptive Sequence Compression, a framework for SLM acceleration comprising two use-cases: When performing SLM fine-tuning, we propose TASC-ft, which iteratively enriches the tokenizer vocabulary with high-frequency output n-grams and then fine-tunes the model to utilize the expanded vocabulary. Next, we propose an inference-time method, termed TASC-spec. TASC-spec is a lightweight, training-free speculative decoding method that constructs an n-gram draft model from the task's output corpus, mixing task and context n-gram information.TASC-spec avoids any additional training, while bypassing draft-target vocabulary alignment constraints. We demonstrate the effectiveness of both methods across multiple low output-variability generation tasks. Our methods show consistent improvements in inference efficiency while maintaining task performance.[32] MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games
Jacob Eisenstein,Fantine Huot,Adam Fisch,Jonathan Berant,Mirella Lapata
Main category: cs.CL
TL;DR: 本文提出了一种可扩展的多轮交互式语言模型评估方法(MT-PingEval),基于需私有信息协作的游戏任务,发现当前大模型在多轮协作中难以超越单轮摘要基线,暴露出规划与执行协同对话的显著缺陷;分析表明人类对话更连贯、更高效,而模型尚无单一语言特征可解释其协作弱项。
Details
Motivation: 现实世界沟通常需主动管理私有信息,但现有语言模型评估多集中于单轮任务,缺乏对多轮协作能力的有效衡量,因此需要构建能反映真实协作挑战的可扩展评估框架。 Method: 提出MT-PingEval评估框架,设计一系列需共享私有信息的协作性多轮游戏;采用交互式缩放分析(固定总token预算,变量轮数);对比多轮交互与单轮摘要基线性能;并从奉承倾向(sycophancy)、信息密度、话语连贯性等维度分析对话语言特征。 Result: 多数语言模型在多轮交互中无法超越单轮摘要基线,即使存在明显提升空间;人类在同等任务成功率下使用更少token,且产出对话连贯性显著优于模型;未发现单一主导语言因素解释模型协作缺陷。 Conclusion: 当前最先进语言模型在规划与执行多轮协同对话方面仍存在根本性弱点,亟需提升对私有信息的主动管理与对话连贯性建模能力;MT-PingEval为该方向提供了可扩展的评估基准与研究推动力。 Abstract: We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed token budget is divided over a variable number of turns. We find that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts -- despite substantial headroom. This suggests that state-of-the-art models still suffer from significant weaknesses in planning and executing multi-turn collaborative conversations. We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence. While there is no single linguistic explanation for the collaborative weaknesses of contemporary language models, we note that humans achieve comparable task success at superior token efficiency by producing dialogues that are more coherent than those produced by most language models. The proactive management of private information is a defining feature of real-world communication, and we hope that MT-PingEval will drive further work towards improving this capability.[33] Controllable Reasoning Models Are Private Thinkers
Haritz Puerto,Haonan Li,Xudong Han,Timothy Baldwin,Iryna Gurevych
Main category: cs.CL
TL;DR: 本文提出通过提升AI推理模型在推理过程中的指令遵循能力来增强隐私保护,设计了新的指令遵循数据集和解耦式生成策略(LoRA适配器),在多个模型和基准上显著提升了隐私保护效果(最高+51.9%),但可能牺牲部分任务性能。
Details
Motivation: AI代理在推理过程中难以控制其推理轨迹,易导致用户敏感信息意外泄露,亟需提升其在推理链中对隐私约束指令的遵循能力。 Method: 构建包含显式推理轨迹约束的指令遵循数据集,并采用双LoRA适配器解耦推理与答案生成过程进行微调。 Result: 在6个1.7B–14B参数模型上,指令遵循性能最高提升20.9分,隐私基准最高提升51.9个百分点;但存在推理性能与指令遵循能力之间的权衡。 Conclusion: 提升推理模型在推理轨迹中的指令遵循能力可显著增强隐私保护,为构建隐私感知型AI代理提供了可行路径。 Abstract: AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy-preservation skills. To demonstrate this, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction-following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models[34] Do LLMs Benefit From Their Own Words?
Jenny Y. Huang,Leshem Choshen,Ramon Astudillo,Tamara Broderick,Jacob Andreas
Main category: cs.CL
TL;DR: 本文探讨了在多轮对话中是否需要保留大语言模型(LLM)自身历史回复的问题,发现仅保留用户输入(user-turn-only)在多数情况下不损害回答质量,甚至可减少上下文长度达10倍,并缓解因过度依赖历史回复导致的错误传播(context pollution);据此提出一种选择性过滤助手端上下文的方法,在提升响应质量的同时降低内存开销。
Details
Motivation: 传统多轮对话中始终保留模型自身历史回复,但该设计未经充分验证;作者质疑这种默认做法是否真正必要或有益,尤其在推理型大模型中可能引入冗余甚至有害信息。 Method: 在真实多轮对话数据上,对比标准全上下文提示(full-context)与仅用户轮次提示(user-turn-only)的效果,涵盖3个开源推理模型和1个SOTA模型;分析自包含提示比例、上下文污染现象,并设计上下文过滤策略。 Result: 1)约36.4%的多轮提示为自包含;2)user-turn-only在多数轮次中效果不逊于full-context;3)部分情况下其表现更优,归因于避免了历史回复引发的错误传播;4)上下文长度最多可降低10倍;5)所提上下文过滤方法可进一步提升质量并节省内存。 Conclusion: 无条件保留助手历史回复并非最优策略;选择性省略助手端上下文不仅能显著压缩输入长度、降低计算开销,还能提升响应准确性与鲁棒性,为高效、可靠多轮交互提供新思路。 Abstract: Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses. Using in-the-wild, multi-turn conversations, we compare standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, across three open reasoning models and one state-of-the-art model. To our surprise, we find that removing prior assistant responses does not affect response quality on a large fraction of turns. Omitting assistant-side history can reduce cumulative context lengths by up to 10x. To explain this result, we find that multi-turn conversations consist of a substantial proportion (36.4%) of self-contained prompts, and that many follow-up prompts provide sufficient instruction to be answered using only the current user turn and prior user turns. When analyzing cases where user-turn-only prompting substantially outperforms full context, we identify instances of context pollution, in which models over-condition on their previous responses, introducing errors, hallucinations, or stylistic artifacts that propagate across turns. Motivated by these findings, we design a context-filtering approach that selectively omits assistant-side context. Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.cs.CV [Back]
[35] DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation
Varun Gopal,Rishabh Jain,Aradhya Mathur,Nikitha SR,Sohan Patnaik,Sudhir Yarram,Mayur Hemani,Balaji Krishnamurthy,Mausoom Sarkar
Main category: cs.CV
TL;DR: 本文提出DesignSense-10k数据集和DesignSense模型,专用于图形布局质量评估,显著提升布局生成效果。
Details
Motivation: 现有布局生成模型难以契合人类审美判断;通用偏好数据集和文本到图像奖励模型无法泛化至布局评估任务,因布局质量取决于相同元素的空间排布。 Method: 构建含10235对人工标注偏好的DesignSense-10k数据集,采用五阶段流水线(语义分组、布局预测、过滤、聚类、VLM精修)生成高质量对比样本;设计基于VLM的四分类偏好判别模型DesignSense,并在RL训练与推理时重排序中验证其效用。 Result: DesignSense在宏F1上比最强商用基线提升54.6%;前沿VLM在四分类任务上表现不可靠甚至灾难性失败;在布局生成中,RL训练使胜率提升约3%,推理时多候选重排序提升3.6%。 Conclusion: 专用、偏好感知的布局评估模型对提升真实场景中图形布局生成质量具有实质性影响。 Abstract: Graphic layouts serve as an important and engaging medium for visual communication across different channels. While recent layout generation models have demonstrated impressive capabilities, they frequently fail to align with nuanced human aesthetic judgment. Existing preference datasets and reward models trained on text-to-image generation do not generalize to layout evaluation, where the spatial arrangement of identical elements determines quality. To address this critical gap, we introduce DesignSense-10k, a large-scale dataset of 10,235 human-annotated preference pairs for graphic layout evaluation. We propose a five-stage curation pipeline that generates visually coherent layout transformations across diverse aspect ratios, using semantic grouping, layout prediction, filtering, clustering, and VLM-based refinement to produce high-quality comparison pairs. Human preferences are annotated using a 4-class scheme (left, right, both good, both bad) to capture subjective ambiguity. Leveraging this dataset, we train DesignSense, a vision-language model-based classifier that substantially outperforms existing open-source and proprietary models across comprehensive evaluation metrics (54.6% improvement in Macro F1 over the strongest proprietary baseline). Our analysis shows that frontier VLMs remain unreliable overall and fail catastrophically on the full four-class task, underscoring the need for specialized, preference-aware models. Beyond the dataset, our reward model DesignSense yields tangible downstream gains in layout generation. Using our judge during RL based training improves generator win rate by about 3%, while inference-time scaling, which involves generating multiple candidates and selecting the best one, provides a 3.6% improvement. These results highlight the practical impact of specialized, layout-aware preference modeling on real-world layout generation quality.[36] Modelling and Simulation of Neuromorphic Datasets for Anomaly Detection in Computer Vision
Mike Middleton,Teymoor Ali,Hakan Kayan,Basabdatta Sen Bhattacharya,Charith Perera,Oliver Rhodes,Elena Gheorghiu,Mark Vousden,Martin A. Trefzer
Main category: cs.CV
TL;DR: 本文提出ANTShapes,一个基于Unity引擎的新型神经形态视觉数据集仿真框架,用于生成具有异常行为物体的3D场景数据,以解决动态视觉传感器(DVS)数据稀缺问题。
Details
Motivation: 动态视觉传感器(DVS)数据获取受限,现有神经形态视觉数据集样本量少、场景单一,缺乏综合性仿真工具。 Method: 基于Unity构建ANTShapes仿真框架,生成可配置的抽象3D场景,其中物体具有随机运动与旋转等行为;采用中心极限定理指导的统计方法对异常物体进行采样与标注;支持通过调节少量参数生成任意规模的数据集及对应标签和帧数据。 Result: 实现了可灵活定制、大规模生成带精确标注的神经形态事件数据的能力,支持目标识别、定位及异常检测等任务。 Conclusion: ANTShapes有效缓解了事件驱动视觉研究中数据匮乏的问题,为算法开发与评估提供了可控、可复现且多样化的仿真数据源。 Abstract: Limitations on the availability of Dynamic Vision Sensors (DVS) present a fundamental challenge to researchers of neuromorphic computer vision applications. In response, datasets have been created by the research community, but often contain a limited number of samples or scenarios. To address the lack of a comprehensive simulator of neuromorphic vision datasets, we introduce the Anomalous Neuromorphic Tool for Shapes (ANTShapes), a novel dataset simulation framework. Built in the Unity engine, ANTShapes simulates abstract, configurable 3D scenes populated by objects displaying randomly-generated behaviours describing attributes such as motion and rotation. The sampling of object behaviours, and the labelling of anomalously-acting objects, is a statistical process following central limit theorem principles. Datasets containing an arbitrary number of samples can be created and exported from ANTShapes, along with accompanying label and frame data, through the adjustment of a limited number of parameters within the software. ANTShapes addresses the limitations of data availability to researchers of event-based computer vision by allowing for the simulation of bespoke datasets to suit purposes including object recognition and localisation alongside anomaly detection.[37] All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark
Junjiang Wu,Liejun Wang,Zhiqing Guo
Main category: cs.CV
TL;DR: 本文提出了一种名为LIDMark的152维地标-身份水印框架,结合因子化解码器(FHD),统一实现深度伪造内容的检测、定位与溯源。
Details
Motivation: 现有主动取证方法将深度伪造检测、篡改定位和来源追踪视为独立任务,缺乏统一框架;而深度伪造技术快速发展,对隐私与社会安全构成严重威胁。 Method: 提出LIDMark水印(融合面部关键点与唯一源标识)及因子化头解码器(FHD),将骨干特征分解为回归头(用于检测与定位)和分类头(用于溯源)。 Result: 实验表明该框架在检测、定位和溯源任务上具有统一性、鲁棒性和不可感知性。 Conclusion: LIDMark提供了一种端到端、一体化的主动深度伪造取证解决方案,显著提升了多任务协同性能。 Abstract: With the rapid advancement of deepfake technology, malicious face manipulations pose a significant threat to personal privacy and social security. However, existing proactive forensics methods typically treat deepfake detection, tampering localization, and source tracing as independent tasks, lacking a unified framework to address them jointly. To bridge this gap, we propose a unified proactive forensics framework that jointly addresses these three core tasks. Our core framework adopts an innovative 152-dimensional landmark-identity watermark termed LIDMark, which structurally interweaves facial landmarks with a unique source identifier. To robustly extract the LIDMark, we design a novel Factorized-Head Decoder (FHD). Its architecture factorizes the shared backbone features into two specialized heads (i.e., regression and classification), robustly reconstructing the embedded landmarks and identifier, respectively, even when subjected to severe distortion or tampering. This design realizes an "all-in-one" trifunctional forensic solution: the regression head underlies an "intrinsic-extrinsic" consistency check for detection and localization, while the classification head robustly decodes the source identifier for tracing. Extensive experiments show that the proposed LIDMark framework provides a unified, robust, and imperceptible solution for the detection, localization, and tracing of deepfake content. The code is available at https://github.com/vpsg-research/LIDMark.[38] Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos
Ziqi Gao,Jieyu Zhang,Wisdom Oluchi Ikezogwo,Jae Sung Park,Tario G. You,Daniel Ogbu,Chenhao Zheng,Weikai Huang,Yinuo Yang,Winson Han,Quan Kong,Rajat Saini,Ranjay Krishna
Main category: cs.CV
TL;DR: 本文提出了大规模视频场景图数据集SVG2和视频场景图生成模型TRaSER,显著提升了视频中物体、属性和关系的检测性能,并验证了其在视频问答任务中的有效性。
Details
Motivation: 现有时空场景图数据集规模小、多样性不足,难以支撑复杂视频理解任务;需构建更大规模、更丰富的视频场景图基准并开发高效生成模型。 Method: 设计全自动流水线构建SVG2数据集,涵盖多尺度全景分割、轨迹跟踪与新物体发现、轨迹语义解析及GPT-5驱动的时空关系推理;提出TRaSER模型,引入轨迹对齐的token排列机制、物体-轨迹重采样器和时间窗重采样器,实现单次前向传播生成紧凑时空场景图。 Result: TRaSER在PVSG、VIPSeg、VidOR和SVG2测试集上,关系检测提升15–20%,物体预测提升30–40%(相比最强开源基线)和13%(相比GPT-5),属性预测提升15%;用于视频问答时,相较仅用视频或Qwen2.5-VL生成的场景图,绝对准确率提升1.5–4.6%。 Conclusion: SVG2是当前最大规模、最多样化的视频场景图数据集,TRaSER是首个支持端到端视频到场景图生成的高效模型,证明了显式时空场景图作为中间表征对视频理解具有重要价值。 Abstract: We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL's generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.[39] LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification
Shawn Liang,Sahil Shah,Chengwei Zhou,SP Sharan,Harsh Goel,Arnab Sanyal,Sandeep Chinchali,Gourav Datta
Main category: cs.CV
TL;DR: 本文提出LE-NeuS,一种低延迟神经符号框架,用于长视频问答(LVQA),通过两阶段自适应采样和批量命题检测,在保持高精度的同时显著降低推理延迟。
Details
Motivation: 现有神经符号方法在长视频问答中虽准确率高,但推理延迟高达基线VLM的90倍,难以部署于边缘设备。 Method: 提出CLIP引导的两阶段自适应采样以跳过语义相似帧,并采用批量命题检测并行化VLM推理;理论推导了延迟上界。 Result: 在LongVideoBench和Video-MME上,LE-NeuS将延迟差距从90倍降至约10倍,同时在时序复杂问题上保持超10%的精度提升。 Conclusion: LE-NeuS在精度与延迟之间实现了更优平衡,为边缘端LVQA部署提供了可行方案。 Abstract: Neuro-symbolic approaches to long-form video question answering (LVQA) have demonstrated significant accuracy improvements by grounding temporal reasoning in formal verification. However, existing methods incur prohibitive latency overheads, up to 90x slower than base VLM prompting, rendering them impractical for latency-sensitive edge deployments. We present LE-NeuS, a latency-efficient neuro-symbolic framework that preserves the accuracy benefits of temporal logic-guided video understanding while drastically reducing inference latency. Our key insight is that the dominant computational bottleneck arises from sequential and dense proposition detection across video frames during automaton construction. We address this through two principled optimizations: (1) CLIP guided two-stage adaptive sampling that exploits visual redundancy to skip semantically similar frames while preserving temporal boundaries, and (2) batched proposition detection that parallelizes VLM inference across temporal windows. Theoretically, we derive latency bounds as a function of video length, proposition complexity, and sampling density, establishing conditions under which latency efficiency is achievable. Empirically, on LongVideoBench and Video-MME benchmarks deployed on NVIDIA H100 GPUs, LE-NeuS reduces the latency gap from 90x to approximately 10x while maintaining >10% accuracy gains on temporally complex queries.[40] No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
Cho-Ying Wu,Zixun Huang,Xinyu Huang,Liu Ren
Main category: cs.CV
TL;DR: 本文提出了一种跨传感器视图合成方法,解决RGB-X数据对齐难题,无需X传感器的3D先验,仅依赖低成本COLMAP处理RGB图像,通过匹配-稠密化-整合流程实现高效校准与视图合成。
Details
Motivation: 现有RGB-X研究大多假设已存在对齐的RGB-X数据对,但实际中获取对齐数据需要大量工程校准工作,这一问题被广泛忽视。 Method: 提出match-densify-consolidate方法:先进行RGB-X图像匹配,再进行置信度感知的引导点稠密化和自匹配滤波,最后将结果整合进3D高斯泼溅(3DGS)框架。 Result: 实现了无需X传感器3D先验、仅需低成本COLMAP处理RGB图像的跨模态视图合成,在保持性能的同时显著降低校准复杂度。 Conclusion: 该方法有效缓解了RGB-X数据采集中的校准瓶颈,为大规模真实场景下的跨传感器学习提供了可扩展解决方案,有望推动跨模态学习的实际落地与普及。 Abstract: We present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottleneck in large-scale real-world RGB-X data collection.[41] Evidential Neural Radiance Fields
Ruxiao Duan,Alex Wong
Main category: cs.CV
TL;DR: 本文提出Evidential Neural Radiance Fields,一种能单次前向传播直接量化aleatoric和epistemic不确定性的概率化NeRF方法,在保持高质量渲染的同时提升了不确定性估计能力。
Details
Motivation: 现有NeRF的不确定性估计方法无法同时捕捉aleatoric和epistemic不确定性,或牺牲渲染质量,或带来高计算开销,限制了其在安全关键场景的应用。 Method: 提出Evidential Neural Radiance Fields,将证据深度学习引入NeRF框架,使其在标准NeRF渲染流程中无缝集成,并支持单次前向传播同时估计两类不确定性。 Result: 在三个标准基准上,该方法在场景重建保真度和不确定性估计质量两方面均达到当前最优水平。 Conclusion: Evidential NeRF为可信三维场景建模提供了高效、准确且全面的不确定性量化方案,有助于推动NeRF在安全敏感领域的实际部署。 Abstract: Understanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare multiple uncertainty quantification methods on three standardized benchmarks, where our approach demonstrates state-of-the-art scene reconstruction fidelity and uncertainty estimation quality.[42] CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird's-Eye-View Semantic Segmentation
Jeongbin Hong,Dooseop Choi,Taeg-Hyun An,Kyounghwan An,Kyoung-Wook Min
Main category: cs.CV
TL;DR: 本文提出CycleBEV框架,通过引入逆视角变换(IVT)网络和循环一致性损失来增强现有视角变换(VT)模型在鸟瞰图(BEV)语义分割任务中的性能,显著提升各类目标的mIoU,且不增加推理开销。
Details
Motivation: 解决自动驾驶中从俯视图(PV)到鸟瞰图(BEV)特征变换存在的深度模糊与遮挡问题,现有VT方法仍存在挑战。 Method: 提出CycleBEV正则化框架,设计逆视角变换(IVT)网络实现BEV→PV映射,并结合循环一致性损失;进一步将循环一致性拓展至几何空间和表征空间以挖掘IVT网络潜力。 Result: 在nuScenes数据集上对四种代表性VT模型验证,drivable area、vehicle、pedestrian类mIoU分别提升0.74、4.86、3.74,推理复杂度不变。 Conclusion: CycleBEV是一种通用、高效、即插即用的训练时正则化方法,能显著提升VT模型的BEV语义分割性能,且无需修改原有推理流程。 Abstract: Transforming image features from perspective view (PV) space to bird's-eye-view (BEV) space remains challenging in autonomous driving due to depth ambiguity and occlusion. Although several view transformation (VT) paradigms have been proposed, the challenge still remains. In this paper, we propose a new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation. Inspired by cycle consistency, widely used in image distribution modeling, we devise an inverse view transformation (IVT) network that maps BEV segmentation maps back to PV segmentation maps and use it to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images. To further exploit the capacity of the IVT network, we introduce two novel ideas that extend cycle consistency into geometric and representation spaces. We evaluate CycleBEV on four representative VT models covering three major paradigms using the large-scale nuScenes dataset. Experimental results show consistent improvements -- with gains of up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes, respectively -- without increasing inference complexity, since the IVT network is used only during training. The implementation code is available at https://github.com/JeongbinHong/CycleBEV.[43] Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning
Abhishek Dalvi,Vasant Honavar
Main category: cs.CV
TL;DR: 本文提出HDFLIM框架,利用超维计算在不更新预训练视觉与语言模型参数的前提下,实现跨模态对齐;通过将单模态嵌入映射至共享超维空间,并运用绑定、聚合与相似性检索等轻量符号操作,一次性构建关联式跨模态表征,显著降低计算开销并保持预训练语义结构。
Details
Motivation: 现有跨模态对齐方法依赖计算密集型的多模态微调,易扰动预训练表示;而独立训练的基础模型可能已具备潜在语义兼容性,因此亟需探索无需修改模型参数的对齐新范式。 Method: 提出HDFLIM框架:将冻结的视觉与语言模型输出的单模态嵌入投影至共享超维空间,利用超维计算中的绑定(binding)、聚合(bundling)和基于相似性的检索(similarity-based retrieval)等轻量符号操作,在单次数据遍历中构建跨模态关联表征;图像描述生成由高维记忆检索驱动,而非迭代梯度优化。 Result: HDFLIM在图像描述生成任务上达到与端到端视觉-语言训练方法相当的性能,且生成描述比零样本基线更具语义根基;验证了仅通过符号操作即可实现有效跨模态对齐。 Conclusion: 跨模态对齐可脱离参数调优,转而依赖超维编码上的结构化表征映射;该工作为大模型对齐提供了‘冻结模型+符号操作’的新范式,具有低资源、保语义、易扩展的优势。 Abstract: Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine-tuning. Such approaches depend on large-scale parameter updates, are resource intensive, and can perturb pretrained representations. Emerging evidence suggests, however, that independently trained foundation models may already exhibit latent semantic compatibility, reflecting shared structures in the data they model. This raises a fundamental question: can cross-modal alignment be achieved without modifying the models themselves? Here we introduce HDFLIM (HyperDimensional computing with Frozen Language and Image Models), a framework that establishes cross-modal mappings while keeping pretrained vision and language models fully frozen. HDFLIM projects unimodal embeddings into a shared hyperdimensional space and leverages lightweight symbolic operations -- binding, bundling, and similarity-based retrieval to construct associative cross-modal representations in a single pass over the data. Caption generation emerges from high-dimensional memory retrieval rather than iterative gradient-based optimization. We show that HDFLIM achieves performance comparable to end-to-end vision-language training methods and produces captions that are more semantically grounded than zero-shot baselines. By decoupling alignment from parameter tuning, our results suggest that semantic mapping across foundation models can be realized through symbolic operations on hyperdimensional encodings of the respective embeddings. More broadly, this work points toward an alternative paradigm for foundation model alignment in which frozen models are integrated through structured representational mappings rather than through large-scale retraining. The codebase for our implementation can be found at https://github.com/Abhishek-Dalvi410/HDFLIM.[44] Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models
Hiroshi Sasaki
Main category: cs.CV
TL;DR: 本文提出一种新的训练范式,通过引入由图渲染器生成的伪对比样本,提升视觉-语言模型对图表(如流程图)中细微结构差异的敏感性,从而增强图表理解能力。
Details
Motivation: 现有CLIP等多模态模型在处理小视觉差异蕴含大语义差异的领域(如图表理解)时表现不佳,因其对细粒度结构变化不敏感。 Method: 设计一种新训练范式:利用图渲染器生成含随机文本元素的合成图表作为伪对比样本,无需修改原始数据,将其融入训练目标以增强模型对结构细节的建模能力。 Result: 在流程图基准数据集上的实验表明,该方法在图文匹配和视觉问答任务上显著优于标准CLIP及难负例CLIP。 Conclusion: 针对特定领域(如图表理解)设计训练策略可有效提升视觉-语言模型性能,为多模态图表理解提供了新思路。 Abstract: Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models' limited sensitivity to fine-grained structural variations. We propose a new training paradigm designed to enhance diagram comprehension in vision-language models. Our approach introduces pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements. These samples highlight structural differences in diagrammatic imagery without requiring any modification or editing of the original data. By incorporating these pseudo contrastive samples into the training objective, the model learns to capture more precise and semantically consistent diagram structures. Empirical evaluations on a benchmark dataset of flowcharts demonstrate substantial improvements over standard CLIP and hard-negative CLIP training in both image-text matching and visual question answering tasks. The results underscore the value of domain-specific training strategies and contribute to advancing diagrammatic understanding within the broader context of vision-language learning.[45] Incremental dimension reduction for efficient and accurate visual anomaly detection
Teng-Yok Lee
Main category: cs.CV
TL;DR: 本文提出了一种增量式降维算法,用于加速视觉异常检测中高维特征的处理,通过分批更新截断奇异值分解(SVD)并重映射至全局子空间,在保持精度的同时显著降低内存与计算开销。
Details
Motivation: 现有基于深度神经网络的视觉异常检测方法提取的特征维度高,难以高效处理大规模图像数据(如数千张图像)。 Method: 提出一种增量式截断奇异值分解(SVD)算法:将高维特征向量分批处理,每批动态更新并维护全局近似SVD;各批次独立降维以节省内存;最后将各批次降维结果统一重映射至全局SVD子空间。 Result: 该算法显著加速了当前最优异常检测算法的训练过程,同时保持接近原始精度。 Conclusion: 增量式分批SVD降维是一种高效、低内存开销的方案,适用于大规模视觉异常检测任务,兼顾速度与精度。 Abstract: While nowadays visual anomaly detection algorithms use deep neural networks to extract salient features from images, the high dimensionality of extracted features makes it difficult to apply those algorithms to large data with 1000s of images. To address this issue, we present an incremental dimension reduction algorithm to reduce the extracted features. While our algorithm essentially computes truncated singular value decomposition of these features, other than processing all vectors at once, our algorithm groups the vectors into batches. At each batch, our algorithm updates the truncated singular values and vectors that represent all visited vectors, and reduces each batch by its own singular values and vectors so they can be stored in the memory with low overhead. After processing all batches, we re-transform these batch-wise singular vectors to the space spanned by the singular vectors of all features. We show that our algorithm can accelerate the training of state-of-the-art anomaly detection algorithm with close accuracy.[46] Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning
Jiacheng Yang,Anqi Chen,Yunkai Dang,Qi Fan,Cong Wang,Wenbin Li,Feng Miao,Yang Gao
Main category: cs.CV
TL;DR: 本文提出了一种无需额外标注的高分辨率视觉推理技术HART,通过闭环框架和AP-GRPO后训练方法,使大语言多模态模型能自主定位并验证关键图像区域,在高分辨率视觉任务上显著超越基线甚至更大模型。
Details
Motivation: 现有大语言多模态模型(LMMs)处理高分辨率图像时面临图像token数量随分辨率平方增长的问题,导致冗余与无关信息增多;依赖人工标注的外部视觉监督成本高,且如何在无额外标注下提升模型视觉定位与推理能力仍是开放问题。 Method: 提出无标注高分辨率推理技术HART:采用闭环框架,结合后训练范式与新设计的Advantage Preference Group Relative Policy Optimization(AP-GRPO)算法,引导模型精准定位关键图像区域,并支持可解释的推理路径与高效定位优化。 Result: HART在多种高分辨率视觉任务上显著提升性能,持续超越强基线;应用于Qwen2.5-VL-7B后训练后,在高分辨率视觉基准上甚至超越Qwen2.5-VL-72B和LLaVA-OneVision-72B等更大模型。 Conclusion: HART是一种有效、高效且无需额外标注的高分辨率视觉推理增强方法,提升了LMMs的自主定位、自我验证与可解释推理能力,为高分辨率多模态理解提供了新范式。 Abstract: Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines. When applied to post-train Qwen2.5-VL-7B, HART even surpasses larger-scale models such as Qwen2.5-VL-72B and LLaVA-OneVision-72B on high-resolution, vision-centric benchmarks.[47] Egocentric Visibility-Aware Human Pose Estimation
Peng Dai,Yu Zhang,Yiqiang Feng,Zhen Fan,Yang Zhang
Main category: cs.CV
TL;DR: 本文提出Eva-3M数据集(含可见性标注的大规模自我中心人体姿态估计数据集)和EvaPose方法(显式建模关键点可见性的新姿态估计算法),显著提升自我中心场景下可见关键点的估计精度。
Details
Motivation: 现有自我中心人体姿态估计(HPE)方法忽视关键点可见性问题,且缺乏带可见性标注的数据集,导致对可见关键点的预测精度受限。 Method: 构建大规模可见性感知数据集Eva-3M(300万帧,43.5万帧含可见性标注)并扩充EMHI数据集;提出EvaPose方法,显式引入关键点可见性信息以提升姿态估计精度。 Result: 实验证明真实可见性标签对自我中心HPE具有重要价值,EvaPose在Eva-3M和EMHI数据集上均达到SOTA性能。 Conclusion: 关键点可见性建模是提升自我中心HPE精度的关键,Eva-3M和EvaPose为该方向提供了重要数据基础与方法范式。 Abstract: Egocentric human pose estimation (HPE) using a head-mounted device is crucial for various VR and AR applications, but it faces significant challenges due to keypoint invisibility. Nevertheless, none of the existing egocentric HPE datasets provide keypoint visibility annotations, and the existing methods often overlook the invisibility problem, treating visible and invisible keypoints indiscriminately during estimation. As a result, their capacity to accurately predict visible keypoints is compromised. In this paper, we first present Eva-3M, a large-scale egocentric visibility-aware HPE dataset comprising over 3.0M frames, with 435K of them annotated with keypoint visibility labels. Additionally, we augment the existing EMHI dataset with keypoint visibility annotations to further facilitate the research in this direction. Furthermore, we propose EvaPose, a novel egocentric visibility-aware HPE method that explicitly incorporates visibility information to enhance pose estimation accuracy. Extensive experiments validate the significant value of ground-truth visibility labels in egocentric HPE settings, and demonstrate that our EvaPose achieves state-of-the-art performance in both Eva-3M and EMHI datasets.[48] DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model
Shibo Hong,Boxian Ai,Jun Kuang,Wei Wang,FengJiao Chen,Zhongyuan Peng,Chenhao Huang,Yixin Cao
Main category: cs.CV
TL;DR: 本文提出了首个专门评估指令式图像编辑模型(IIEMs)在小尺度物体编辑能力的基准DeepLookEditBench(DLEBench),包含1889个样本,目标物体仅占图像面积1%-10%,并设计了减少主观性的双模式评估协议,实证发现现有10种IIEMs在此任务上存在显著性能差距。
Details
Motivation: 现有指令式图像编辑模型在主流基准上表现良好,但其对小物体(仅占图像1%-10%)的精细编辑能力尚未被系统评估,而该能力对局部编辑和细节优化至关重要。 Method: 构建首个面向小尺度物体编辑的基准DLEBench,含1889个样本、7类指令,涵盖部分遮挡与多物体编辑等复杂场景;提出兼顾客观性与可靠性的双模式评估协议(Tool-driven和Oracle-guided),改进Instruction Following与Visual Consistency评分标准。 Result: 在DLEBench上对10种IIEMs的实验表明,所有模型在小物体编辑任务上性能明显下降,验证了当前方法的局限性及新基准的必要性。 Conclusion: 小尺度物体编辑是IIEMs亟待突破的关键短板,DLEBench为该方向提供了标准化评估工具,推动模型向更精细、可控的图像编辑能力发展。 Abstract: Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.[49] BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
Tongyan Hua,Haoran Gong,Yuan Liu,Di Wang,Ying-Cong Chen,Wufan Zhao
Main category: cs.CV
TL;DR: 本文提出BuildAnyPoint框架,利用Loosely Cascaded Diffusion Transformer(Loca-DiT)从多样分布的点云(如机载LiDAR、SfM)中生成结构化3D建筑模型,通过两阶段流程:先用潜在扩散模型恢复点云分布,再用解码器-only Transformer自回归生成紧凑网格。
Details
Motivation: 现有方法难以在高度欠约束条件下(如稀疏、噪声点云)重建符合人工设计抽象风格的3D建筑结构,亟需引入显式3D生成先验以提升重建质量与泛化性。 Method: 提出Loosely Cascaded Diffusion Transformer(Loca-DiT):第一阶段训练条件潜在扩散模型以从输入点云中恢复底层点分布;第二阶段设计decoder-only transformer,基于恢复后的点云进行条件自回归网格生成。 Result: 在建筑抽象任务上显著优于先前方法;其恢复的点云在建筑点云补全基准上表现优异,表面精度和分布均匀性均提升。 Conclusion: 显式3D生成先验与级联扩散-自回归建模的有效结合,为多样化点云下的高质量、结构化建筑重建提供了新范式。 Abstract: We introduce BuildAnyPoint, a novel generative framework for structured 3D building reconstruction from point clouds with diverse distributions, such as those captured by airborne LiDAR and Structure-from-Motion. To recover artist-created building abstraction in this highly underconstrained setting, we capitalize on the role of explicit 3D generative priors in autoregressive mesh generation. Specifically, we design a Loosely Cascaded Diffusion Transformer (Loca-DiT) that initially recovers the underlying distribution from noisy or sparse points, followed by autoregressively encapsulating them into compact meshes. We first formulate distribution recovery as a conditional generation task by training latent diffusion models conditioned on input point clouds, and then tailor a decoder-only transformer for conditional autoregressive mesh generation based on the recovered point clouds. Our method delivers substantial qualitative and quantitative improvements over prior building abstraction methods. Furthermore, the effectiveness of our approach is evidenced by the strong performance of its recovered point clouds on building point cloud completion benchmarks, which exhibit improved surface accuracy and distribution uniformity.[50] 3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection
Haowen Zhu,Ning Yin,Xiaogen Zhou
Main category: cs.CV
TL;DR: 本文提出MedMAP,一种面向3D MRI的医学模态感知预训练框架,通过模态感知的视觉-语言对齐与跨模态特征融合,提升多器官异常检测性能。
Details
Motivation: 将视觉-语言模型(VLMs)应用于多器官医学影像存在两大挑战:模态特异性的视觉-语言对齐和跨模态特征融合。 Method: 提出MedMAP框架,包含模态感知的视觉-语言对齐预训练阶段和面向多器官异常检测的微调阶段;设计模态感知编码器隐式建模联合模态分布,并在微调中冻结文本编码器、仅微调视觉编码器;构建新数据集MedMoM-MRI3D(7392组3D MRI-报告对,覆盖12种MRI模态与9类异常)。 Result: 在MedMoM-MRI3D上大量实验表明,MedMAP在3D MRI多器官异常检测任务上显著优于现有VLMs。 Conclusion: MedMAP有效提升了VLM在3D医学影像中的表示学习能力,为多器官、多模态医学诊断提供了可扩展的预训练范式。 Abstract: Vision-language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi-organ medical imaging introduces two principal challenges: (1) modality-specific vision-language alignment and (2) cross-modal feature fusion. In this work, we propose MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI. MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection. During the pre-training stage, the modality-aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine-tune the pre-trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM-MRI3D, comprising 7,392 3D MRI volume-report pairs spanning twelve MRI modalities and nine abnormalities tailored for various 3D medical analysis tasks. Extensive experiments on MedMoM-MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection. Our code is available at https://github.com/RomantiDr/MedMAP.[51] ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models
Wei Luo,Yangfan Ou,Jin Deng,Zeshuai Deng,Xiquan Yan,Zhiquan Wen,Mingkui Tan
Main category: cs.CV
TL;DR: 本文提出ProtoDCS框架,用于开放集测试时自适应(OSTTA),通过高斯混合模型验证和证据驱动的原型级更新,提升大模型在分布偏移下的鲁棒性与效率。
Details
Motivation: 现有基于VLM的测试时自适应方法假设封闭集,无法应对开放集场景中同时存在协变量偏移的ID和OOD样本的问题,导致误判和计算开销大。 Method: 提出Prototype-based Double-Check Separation(ProtoDCS):1)用概率性GMM验证替代硬阈值实现双检分离;2)采用不确定性感知损失和原型级参数更新进行证据驱动的自适应。 Result: 在CIFAR-10/100-C和Tiny-ImageNet-C上显著提升已知类准确率与OOD检测指标,达到SOTA性能。 Conclusion: ProtoDCS有效解决了开放集TTA中OOD干扰与ID自适应的双重挑战,在准确性、鲁棒性和效率上均优于现有方法。 Abstract: Large-scale Vision-Language Models (VLMs) exhibit strong zero-shot recognition, yet their real-world deployment is challenged by distribution shifts. While Test-Time Adaptation (TTA) can mitigate this, existing VLM-based TTA methods operate under a closed-set assumption, failing in open-set scenarios where test streams contain both covariate-shifted in-distribution (csID) and out-of-distribution (csOOD) data. This leads to a critical difficulty: the model must discriminate unknown csOOD samples to avoid interference while simultaneously adapting to known csID classes for accuracy. Current open-set TTA (OSTTA) methods rely on hard thresholds for separation and entropy minimization for adaptation. These strategies are brittle, often misclassifying ambiguous csOOD samples and inducing overconfident predictions, and their parameter-update mechanism is computationally prohibitive for VLMs. To address these limitations, we propose Prototype-based Double-Check Separation (ProtoDCS), a robust framework for OSTTA that effectively separates csID and csOOD samples, enabling safe and efficient adaptation of VLMs to csID data. Our main contributions are: (1) a novel double-check separation mechanism employing probabilistic Gaussian Mixture Model (GMM) verification to replace brittle thresholding; and (2) an evidence-driven adaptation strategy utilizing uncertainty-aware loss and efficient prototype-level updates, mitigating overconfidence and reducing computational overhead. Extensive experiments on CIFAR-10/100-C and Tiny-ImageNet-C demonstrate that ProtoDCS achieves state-of-the-art performance, significantly boosting both known-class accuracy and OOD detection metrics. Code will be available at https://github.com/O-YangF/ProtoDCS.[52] Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering
Ao Li,Rui Liu,Mingjie Li,Sheng Liu,Lei Wang,Xiaodan Liang,Lina Yao,Xiaojun Chang,Lei Xing
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、在推理时控制视觉-语言模型生成放射科报告中历史比较幻觉的新方法SDLS,通过LLM驱动的语义分解与QR正交化构建语义无关干预向量,显著降低历史幻觉并提升临床准确性。
Details
Motivation: 现有视觉-语言模型在自动生成放射科报告时易产生‘先验比较幻觉’(即虚构当前影像不支持的历史发现),亟需一种能精准抑制该类错误而不损害临床准确性的方法。 Method: 提出语义解耦潜在引导(SDLS)框架:利用大语言模型进行语义分解,再通过QR分解实现正交化,构造仅针对‘历史比较’维度的语义无关干预向量,避免PCA方向中临床语义纠缠问题。 Result: 在MIMIC-CXR上,FilBERT幻觉评分从0.2373降至0.1889,CheXpert宏观F1从0.2242升至0.3208;在CheXpert Plus和IU-Xray上实现零样本迁移验证,且临床叙事结构完整性得以保持。 Conclusion: SDLS是一种有效、通用、训练无关的推理时控制机制,成功解耦并精准干预历史比较语义轴,在抑制幻觉与维持临床准确性之间取得更好平衡。 Abstract: Automated radiology report generation using vision-language models (VLMs) is limited by the risk of prior-comparison hallucination, where the model generates historical findings unsupported by the current study. We address this challenge with a training-free, inference-time control framework termed Semantically Decoupled Latent Steering (SDLS). Unlike generic activation steering, which often suffers from semantic entanglement, our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition followed by $QR$-based orthogonalization. This orthogonalization step is critical. It leverages geometric constraints to filter out the clinical semantics often entangled in standard principal component analysis (PCA) directions, ensuring that the steering vector targets only the ``historical comparison" axis. We validate our method on the BiomedGPT foundation model, demonstrating that it overcomes the trade-off between hallucination suppression and clinical accuracy. Extensive experiments on MIMIC-CXR, and zero-shot transfer evaluation on CheXpert Plus and IU-Xray, demonstrate the robustness of our approach. Quantitative evaluations on MIMIC-CXR show that our approach significantly reduces the probability of historical hallucinations (FilBERT score decreases from 0.2373 to 0.1889) and improves clinical label fidelity (CheXpert macro-F1 increases from 0.2242 to 0.3208). Supplementary evaluations confirm that the structural integrity of the clinical narrative is maintained.[53] Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation
Nazia Hossain,Xintong Jiang,Yu Tian,Philippe Seguin,O. Grant Clark,Shangpeng Sun
Main category: cs.CV
TL;DR: 本文提出了一种名为VL-WS的视觉-语言杂草分割框架,通过融合冻结的CLIP文本嵌入与空间特征,并利用FiLM层进行语言引导的特征调制,实现跨域泛化和细粒度分割;在多源数据上训练,在四个基准数据集上显著优于CNN基线,尤其在最难杂草类别上提升达15.42%,且在少样本场景下表现稳健。
Details
Motivation: 现有深度学习模型依赖数据集特定的视觉特征,在异构农业环境中泛化能力差,难以支持精准农业中的靶向除草应用。 Method: 提出Vision-Language Weed Segmentation(VL-WS)框架:采用双编码器结构,冻结CLIP提取文本语义嵌入,与任务专用空间特征融合,并通过自然语言描述条件化的FiLM层进行通道级特征调制;训练数据涵盖地面机器人与无人机多源影像,覆盖多种作物、杂草种类、生长阶段及成像条件。 Result: 在四个基准数据集上平均Dice得分为91.64%,较CNN基线提升4.98%;最难杂草类别的Dice分数达80.45%,提升15.42%;在目标域标注有限时仍保持稳定性能,体现更强泛化性与数据效率。 Conclusion: 视觉-语言对齐可有效构建可扩展、标签高效、适用于多样化真实农业场景的分割模型。 Abstract: Fine-grained crop-weed segmentation is essential for enabling targeted herbicide application in precision agriculture. However, existing deep learning models struggle to generalize across heterogeneous agricultural environments due to reliance on dataset-specific visual features. We propose Vision-Language Weed Segmentation (VL-WS), a novel framework that addresses this limitation by grounding pixel-level segmentation in semantically aligned, domain-invariant representations. Our architecture employs a dual-encoder design, where frozen Contrastive Language-Image Pretraining (CLIP) embeddings and task-specific spatial features are fused and modulated via Feature-wise Linear Modulation (FiLM) layers conditioned on natural language captions. This design enables image level textual descriptions to guide channel-wise feature refinement while preserving fine-grained spatial localization. Unlike prior works restricted to training and evaluation on single-source datasets, VL-WS is trained on a unified corpus that includes close-range ground imagery (robotic platforms) and high-altitude UAV imagery, covering diverse crop types, weed species, growth stages, and sensing conditions. Experimental results across four benchmark datasets demonstrate the effectiveness of our framework, with VL-WS achieving a mean Dice score of 91.64% and outperforming the CNN baseline by 4.98%. The largest gains occur on the most challenging weed class, where VL-WS attains 80.45% Dice score compared to 65.03% for the best baseline, representing a 15.42% improvement. VL-WS further maintains stable weed segmentation performance under limited target-domain supervision, indicating improved generalization and data efficiency. These findings highlight the potential of vision-language alignment to enable scalable, label-efficient segmentation models deployable across diverse real-world agricultural domains.[54] Any Model, Any Place, Any Time: Get Remote Sensing Foundation Model Embeddings On Demand
Dingqi Ye,Daniel Kiv,Wei Hu,Jimeng Shi,Shaowen Wang
Main category: cs.CV
TL;DR: 本文提出rs-embed Python库,旨在统一遥感基础模型的嵌入获取接口,支持单行代码调用多模型、任意位置与时间范围的嵌入提取,并提供高效批量处理能力,以降低使用与评测成本。
Details
Motivation: 遥感领域基础模型快速增长,但模型发布格式、平台接口及输入数据规范差异大,导致嵌入获取、使用和基准测试成本高、难以公平比较。 Method: 设计并实现一个名为rs-embed的Python开源库,采用以兴趣区域(ROI)为中心的统一接口,支持多种模型集成与灵活时空查询,并内置高效批处理机制。 Result: 实现了跨模型、跨时空的嵌入一键调用与大规模批量生成能力,显著降低遥感基础模型的实际应用与评估门槛。 Conclusion: rs-embed为遥感基础模型生态提供了标准化接口基础设施,推动模型可复现性、可比性与实用化发展。 Abstract: The remote sensing community is witnessing a rapid growth of foundation models, which provide powerful embeddings for a wide range of downstream tasks. However, practical adoption and fair comparison remain challenging due to substantial heterogeneity in model release formats, platforms and interfaces, and input data specifications. These inconsistencies significantly increase the cost of obtaining, using, and benchmarking embeddings across models. To address this issue, we propose rs-embed, a Python library that offers a unified, region of interst (ROI) centric interface: with a single line of code, users can retrieve embeddings from any supported model for any location and any time range. The library also provides efficient batch processing to enable large-scale embedding generation and evaluation. The code is available at: https://github.com/cybergis/rs-embed[55] Towards Source-Aware Object Swapping with Initial Noise Perturbation
Jiahui Zhan,Xianbing Sun,Xiangnan Zhu,Yikun Ji,Ruitong Liu,Liqing Zhang,Jianfu Zhang
Main category: cs.CV
TL;DR: 本文提出SourceSwap,一种自监督、源感知的框架,用于对象交换任务,通过频率分离扰动合成伪配对数据,无需额外数据或微调,实现高质量的对象-场景和谐交换。
Details
Motivation: 现有方法要么需要针对每个对象进行微调且推理慢,要么依赖额外的配对数据,导致模型过度依赖背景线索而非学习跨对象对齐。 Method: 提出SourceSwap框架,利用频率分离扰动在初始噪声空间中合成高质量伪配对;采用双U-Net结构,结合全源条件控制和无噪声参考编码器,支持零样本推理与轻量迭代优化。 Result: 在新构建的高分辨率、多类别、丰富交互的SourceBench基准上,SourceSwap展现出更优的对象保真度、更强的场景保持能力与更自然的对象-场景和谐性,并能泛化至主体驱动精修与人脸交换等编辑任务。 Conclusion: SourceSwap是一种高效、通用且无需微调的对象交换新范式,显著提升了跨对象对齐能力与实际编辑灵活性。 Abstract: Object swapping aims to replace a source object in a scene with a reference object while preserving object fidelity, scene fidelity, and object-scene harmony. Existing methods either require per-object finetuning and slow inference or rely on extra paired data that mostly depict the same object across contexts, forcing models to rely on background cues rather than learning cross-object alignment. We propose SourceSwap, a self-supervised and source-aware framework that learns cross-object alignment. Our key insight is to synthesize high-quality pseudo pairs from any image via a frequency-separated perturbation in the initial-noise space, which alters appearance while preserving pose, coarse shape, and scene layout, requiring no videos, multi-view data, or additional images. We then train a dual U-Net with full-source conditioning and a noise-free reference encoder, enabling direct inter-object alignment, zero-shot inference without per-object finetuning, and lightweight iterative refinement. We further introduce SourceBench, a high-quality benchmark with higher resolution, more categories, and richer interactions. Experiments demonstrate that SourceSwap achieves superior fidelity, stronger scene preservation, and more natural harmony, and it transfers well to edits such as subject-driven refinement and face swapping.[56] HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit
Hao Wu,Yingqi Fan,Jinyang Dai,Junlong Tong,Yunpu Ma,Xiaoyu Shen
Main category: cs.CV
TL;DR: HiDrop是一种针对多模态大语言模型(MLLMs)视觉token处理高计算成本问题的高效压缩框架,通过分层对齐的视觉token剪枝策略,在保持性能的同时大幅加速训练与推理。
Details
Motivation: MLLMs中视觉token处理具有二次方计算复杂度,限制了其广泛应用;现有渐进式剪枝方法未能准确理解浅层功能且调度僵化,导致效率提升受限。 Method: 提出HiDrop框架:1)Late Injection机制,跳过被动浅层,仅在主动融合起始处注入视觉token;2)结合Early Exit的凹金字塔剪枝策略,利用层间相似性度量和可微top-k算子动态调整中深层剪枝率;并引入持久位置编码、FlashAttention兼容的token选择及视觉计算并行解耦以消除动态剪枝的隐藏开销。 Result: 实验表明HiDrop可压缩约90%视觉token,在保持原始性能的同时训练速度提升1.72倍。 Conclusion: HiDrop不仅在高效MLLM训练与推理上达到新SOTA,还揭示了多模态融合的层次性本质,为后续研究提供了重要启示。 Abstract: The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-k operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding, FlashAttention-compatible token selection, and parallel decoupling of vision computation to eliminate hidden overhead associated with dynamic token reduction. Extensive experiments show that HiDrop compresses about 90% visual tokens while matching the original performance and accelerating training by 1.72 times. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion. The code is released at https://github.com/EIT-NLP/HiDrop.[57] EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
Shitong Sun,Ke Han,Yukai Huang,Weitong Cai,Jifei Song
Main category: cs.CV
TL;DR: 本文提出EgoGraph,一种无需训练、动态构建知识图谱的框架,用于处理超长第一人称视频,通过新颖的以自我为中心的模式和时序关系建模策略,实现跨实体、跨多日的长期语义理解和推理。
Details
Motivation: 现有方法依赖局部片段处理和有限的时序建模,难以对跨越多天的超长第一人称视频进行有效推理。 Method: 提出EgoGraph框架:1)设计以自我为中心的知识图谱模式,统一抽取与抽象人物、物体、地点、事件等核心实体及其属性与交互;2)引入时序关系建模策略,捕获实体间时序依赖并累积稳定长期记忆。 Result: 在EgoLifeQA和EgoR1-bench基准上,EgoGraph在长期视频问答任务中达到SOTA性能。 Conclusion: EgoGraph为超长第一人称视频理解提供了一种新范式,通过训练无关、动态知识图谱构建实现强长期时序推理能力。 Abstract: Ultra-long egocentric videos spanning multiple days present significant challenges for video understanding. Existing approaches still rely on fragmented local processing and limited temporal modeling, restricting their ability to reason over such extended sequences. To address these limitations, we introduce EgoGraph, a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams. EgoGraph employs a novel egocentric schema that unifies the extraction and abstraction of core entities, such as people, objects, locations, and events, and structurally reasons about their attributes and interactions, yielding a significantly richer and more coherent semantic representation than traditional clip-based video models. Crucially, we develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning. Extensive experiments on the EgoLifeQA and EgoR1-bench benchmarks demonstrate that EgoGraph achieves state-of-the-art performance on long-term video question answering, validating its effectiveness as a new paradigm for ultra-long egocentric video understanding.[58] Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?
Hongbo Jiang,Jie Li,Yunhang Shen,Pingyang Dai,Xing Sun,Haoyu Cao,Liujuan Cao
Main category: cs.CV
TL;DR: 本文提出VGUBench框架,用于评估统一多模态大语言模型(U-MLLMs)在文本与图像模态间是否具备语义等价性;实验发现模型虽在文本推理和图像渲染上表现良好,但在需以图像作答的推理任务中性能显著下降,表明问题根源在于跨模态语义对齐失效,而非生成质量不足。
Details
Motivation: 现有U-MLLM评测通常割裂地评估理解与生成能力,忽视了语义等价性——即同一推理结果应能在不同模态下一致表达;本文旨在检验当前U-MLLM是否满足这一基本前提。 Method: 提出VGUBench评测框架,包含三个解耦任务:文本生成式理解(基线推理)、视觉生成式理解(图像作答能力)、视觉渲染控制任务(纯描述到图像的映射),以分离推理逻辑与生成保真度。 Result: U-MLLMs在文本理解与视觉渲染任务中表现强劲,但在视觉生成式理解任务中性能大幅下降;且视觉作答性能与基础渲染质量几乎无关。 Conclusion: U-MLLMs的视觉推理失败源于跨模态语义对齐缺陷,而非图像生成能力不足;该发现为未来统一生成与理解模型的设计提供了关键诊断依据。 Abstract: Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture. However, existing evaluations typically assess these capabilities separately, overlooking semantic equivalence, i.e., the ability to manifest consistent reasoning results regardless of the output modality. In this work, we investigate whether current U-MLLMs satisfy this premise. We observe that while models demonstrate robust textual reasoning, they fail to maintain semantic equivalence when required to render the same results in the image modality. To rigorously diagnose this discrepancy, we introduce VGUBench, a framework to decouple reasoning logic from generation fidelity. VGUBench comprises three diagnostic tasks: (1)Textual Generative Understanding, establishing a baseline for reasoning accuracy in textual response; (2)Visual Generative Understanding, evaluating the ability to generate visual responses that represent the correct answer; and (3)a Visual Rendering control task, which assesses the ability to directly render explicit visual descriptions into images without complex reasoning. Our evaluation reveals a significant disparity: despite strong performance in textual understanding and visual rendering, U-MLLMs exhibit a marked performance collapse when required to generate visual answers to questions. Furthermore, we find a negligible correlation between visual answering performance and basic rendering quality. These results suggest that the failure stems not from insufficient generation fidelity, but from a breakdown in cross-modal semantic alignment. We provide diagnostic insights to address this challenge in future Unified Generation and Understanding Models.[59] A Difference-in-Difference Approach to Detecting AI-Generated Images
Xinyi Qi,Kai Ye,Chengchun Shi,Ying Yang,Hongyi Zhou,Jin Zhu
Main category: cs.CV
TL;DR: 本文提出了一种基于二阶差分(差分之差)的新型检测方法,通过计算重建误差的变化量而非直接使用重建误差,以提升对高质量AI生成图像的检测准确率和泛化能力。
Details
Motivation: 现有基于重建误差的一阶检测方法在面对日益逼真的AI生成图像时性能下降,亟需更鲁棒的检测机制。 Method: 提出差分之差(difference-in-difference)方法,即计算不同条件下重建误差的二阶差分,用于降低方差并增强判别性。 Result: 实验表明该方法在多种生成模型和数据集上具有强泛化能力,显著提升了AI生成图像的检测准确率。 Conclusion: 二阶差分特征比传统一阶重建误差更具鲁棒性和判别力,为生成式AI内容鉴伪提供了新思路。 Abstract: Diffusion models are able to produce AI-generated images that are almost indistinguishable from real ones. This raises concerns about their potential misuse and poses substantial challenges for detecting them. Many existing detectors rely on reconstruction error -- the difference between the input image and its reconstructed version -- as the basis for distinguishing real from fake images. However, these detectors become less effective as modern AI-generated images become increasingly similar to real ones. To address this challenge, we propose a novel difference-in-difference method. Instead of directly using the reconstruction error (a first-order difference), we compute the difference in reconstruction error -- a second-order difference -- for variance reduction and improving detection accuracy. Extensive experiments demonstrate that our method achieves strong generalization performance, enabling reliable detection of AI-generated images in the era of generative AI.[60] UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking
Hao Wu,Xudong Wang,Jialiang Zhang,Junlong Tong,Xinghao Chen,Junyan Lin,Yunpu Ma,Xiaoyu Shen
Main category: cs.CV
TL;DR: 本文提出UTPTrack,一种统一的令牌剪枝框架,首次联合压缩搜索区域、动态模板和静态模板三个组件,通过注意力引导和令牌类型感知策略实现高效视觉目标跟踪,在保持高精度的同时显著减少计算开销。
Details
Motivation: 现有基于令牌剪枝的Transformer跟踪器通常孤立地剪枝各组件,忽略其间的依赖关系,导致剪枝效果不佳和精度下降。 Method: 提出UTPTrack框架,采用注意力引导、令牌类型感知的统一剪枝策略,联合压缩搜索区域、动态模板和静态模板,并支持多模态与语言引导任务的统一跟踪。 Result: 在10个基准上验证,UTPTrack在RGB跟踪中剪枝65.4%视觉令牌、在统一跟踪中剪枝67.5%,同时分别保持99.7%和100.5%基线性能,达到剪枝型跟踪器中精度-效率权衡的新SOTA。 Conclusion: UTPTrack为高效视觉跟踪提供了鲁棒基础,具备向多模态及语言引导跟踪扩展的能力,具有重要研究与应用价值。 Abstract: One-stream Transformer-based trackers achieve advanced performance in visual object tracking but suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, existing methods are fragmented. They typically prune the search region, dynamic template, and static template in isolation, overlooking critical inter-component dependencies, which yields suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multimodal and language-guided tasks within a single model. Extensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking. Code will be released at https://github.com/EIT-NLP/UTPTrack.[61] U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
Xiang Deng,Feng Gao,Yong Zhang,Youxin Pang,Xu Xiaoming,Zhuoliang Kang,Xiaoming Wei,Yebin Liu
Main category: cs.CV
TL;DR: U-Mind 是首个支持语言、语音、动作和视频实时联合生成的统一高智能多模态对话系统,通过分段对齐与回溯驱动学习提升跨模态同步性与推理能力。
Details
Motivation: 现有系统受限于单模态生成或存在推理能力下降、跨模态对齐差等问题,难以实现自然、感知 grounded 的实时多模态交互。 Method: 提出统一建模框架 U-Mind,包含:1)分段式跨模态对齐策略;2)回溯驱动学习(Rehearsal-Driven Learning)以保持推理能力;3)文本优先解码流程,含内部思维链规划与多模态时序同步生成;4)基于姿态与语音的实时视频渲染框架。 Result: 在多模态问答、指令跟随、动作生成等任务上达到 SOTA 性能。 Conclusion: U-Mind 为构建智能、沉浸式对话代理提供了可行路径,推动了全栈实时多模态交互的发展。 Abstract: Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.[62] Learning Accurate Segmentation Purely from Self-Supervision
Zuyao You,Zuxuan Wu,Yu-Gang Jiang
Main category: cs.CV
TL;DR: Selfment是一种完全自监督的图像分割框架,无需人工标注、预训练模型或后处理,通过构建patch级亲和图、NCut粗分割、迭代patch优化(IPO)及轻量分割头训练,实现高质量无监督前景分割,并在多个基准上达到SOTA性能,且具备强零样本泛化能力。
Details
Motivation: 准确分割图像中的物体而无需任何人工标注是计算机视觉的核心挑战之一。 Method: Selfment首先从自监督特征构建patch级亲和图并用NCut获得初始粗略前景-背景分离;然后引入迭代Patch优化(IPO),在特征空间中通过迭代patch聚类增强空间一致性和语义一致性;最后用优化后的掩码作为监督信号,以对比学习和区域一致性目标训练轻量分割头。 Result: 在ECSSD、HKUIS、PASCAL-S等基准上F_max显著超越先前无监督显著性检测方法(分别+4.0%、+4.6%、+5.7%);零样本迁移到伪装物体检测任务(如CHAMELEON和CAMO)表现优异,甚至媲美全监督SOTA方法。 Conclusion: Selfment证明了完全自监督方式可实现高性能、高泛化性的对象分割,为无监督视觉分割提供了新范式。 Abstract: Accurately segmenting objects without any manual annotations remains one of the core challenges in computer vision. In this work, we introduce Selfment, a fully self-supervised framework that segments foreground objects directly from raw images without human labels, pretrained segmentation models, or any post-processing. Selfment first constructs patch-level affinity graphs from self-supervised features and applies NCut to obtain an initial coarse foreground--background separation. We then introduce Iterative Patch Optimization (IPO), a feature-space refinement procedure that progressively enforces spatial coherence and semantic consistency through iterative patch clustering. The refined masks are subsequently used as supervisory signals to train a lightweight segmentation head with contrastive and region-consistency objectives, allowing the model to learn stable and transferable object representations. Despite its simplicity and complete absence of manual supervision, Selfment sets new state-of-the-art (SoTA) results across multiple benchmarks. It achieves substantial improvements on $F_{\max}$ over previous unsupervised saliency detection methods on ECSSD ($+4.0\%$), HKUIS ($+4.6\%$), and PASCAL-S ($+5.7\%$). Moreover, without any additional fine-tuning, Selfment demonstrates remarkable zero-shot generalization to camouflaged object detection tasks (e.g., $0.910$ $S_m$ on CHAMELEON and $0.792$ $F_β^ω$ on CAMO), outperforming all existing unsupervised approaches and even rivaling the SoTA fully supervised methods.[63] Diffusion Probe: Generated Image Result Prediction Using CNN Probes
Benlei Cui,Bukun Huang,Zhizeng Ye,Xuemei Dong,Tuo Chen,Hui Xue,Dingkang Yang,Longtao Huang,Jingqun Tang,Haiwen Hong
Main category: cs.CV
TL;DR: 本文提出Diffusion Probe框架,利用扩散模型早期交叉注意力分布预测图像质量,实现多生成场景下的高效质量评估与优化。
Details
Motivation: 文本到图像扩散模型缺乏高效的早期质量评估机制,导致在提示词迭代、基于代理的生成和流程图等多生成场景中计算成本高昂。 Method: 基于早期扩散交叉注意力分布与最终图像质量之间的强相关性,设计了一个轻量级预测器,将初始去噪步骤中提取的交叉注意力统计特性映射为最终图像整体质量预测。 Result: 在多个T2I模型、不同去噪阶段、分辨率和质量指标下均取得强相关性(PCC > 0.7)和高分类性能(AUC-ROC > 0.9),并显著降低计算开销、提升输出质量。 Conclusion: Diffusion Probe是一种模型无关、高效且通用的早期质量预测方法,可有效提升文本到图像生成的整体效率与实用性。 Abstract: Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-generation scenarios such as prompt iteration, agent-based generation, and flow-grpo. We reveal a strong correlation between early diffusion cross-attention distributions and final image quality. Based on this finding, we introduce Diffusion Probe, a framework that leverages internal cross-attention maps as predictive signals. We design a lightweight predictor that maps statistical properties of early-stage cross-attention extracted from initial denoising steps to the final image's overall quality. This enables accurate forecasting of image quality across diverse evaluation metrics long before full synthesis is complete. We validate Diffusion Probe across a wide range of settings. On multiple T2I models, across early denoising windows, resolutions, and quality metrics, it achieves strong correlation (PCC > 0.7) and high classification performance (AUC-ROC > 0.9). Its reliability translates into practical gains. By enabling early quality-aware decisions in workflows such as prompt optimization, seed selection, and accelerated RL training, the probe supports more targeted sampling and avoids computation on low-potential generations. This reduces computational overhead while improving final output quality. Diffusion Probe is model-agnostic, efficient, and broadly applicable, offering a practical solution for improving T2I generation efficiency through early quality prediction.[64] Fourier Angle Alignment for Oriented Object Detection in Remote Sensing
Changyu Gu,Linwei Chen,Lin Gu,Ying Fu
Main category: cs.CV
TL;DR: 本文提出基于傅里叶旋转等变性的傅里叶角度对齐(FAA)方法,通过FAAFusion模块在检测器neck层实现多尺度特征方向对齐与融合,以及FAA Head模块在检测头中对RoI特征进行预对齐,从而缓解遥感图像旋转目标检测中的方向不一致和任务冲突问题,在DOTA和HRSC2016数据集上达到SOTA性能。
Details
Motivation: 主流遥感旋转目标检测方法存在检测器neck层的方向不一致性与检测头的任务冲突两大瓶颈。 Method: 利用傅里叶旋转等变性,提出傅里叶角度对齐(FAA),包括两个即插即用模块:FAAFusion(在neck层对齐并融合高低层特征方向)和FAA Head(在检测头中对RoI特征预对齐至标准角度后与原始特征相加)。 Result: 在DOTA-v1.0、DOTA-v1.5和HRSC2016数据集上显著提升性能;单尺度训练/测试下,在DOTA-v1.0和DOTA-v1.5上分别达到78.72%和72.28% mAP,刷新SOTA。 Conclusion: 傅里叶角度对齐机制能有效缓解遥感旋转目标检测中的方向不一致与任务冲突问题,所提模块具有即插即用特性且显著提升检测精度。 Abstract: In remote sensing rotated object detection, mainstream methods suffer from two bottlenecks, directional incoherence at detector neck and task conflict at detecting head. Ulitising fourier rotation equivariance, we introduce Fourier Angle Alignment, which analyses angle information through frequency spectrum and aligns the main direction to a certain orientation. Then we propose two plug and play modules : FAAFusion and FAA Head. FAAFusion works at the detector neck, aligning the main direction of higher-level features to the lower-level features and then fusing them. FAA Head serves as a new detection head, which pre-aligns RoI features to a canonical angle and adds them to the original features before classification and regression. Experiments on DOTA-v1.0, DOTA-v1.5 and HRSC2016 show that our method can greatly improve previous work. Particularly, our method achieves new state-of-the-art results of 78.72% mAP on DOTA-v1.0 and 72.28% mAP on DOTA-v1.5 datasets with single scale training and testing, validating the efficacy of our approach in remote sensing object detection. The code is made publicly available at https://github.com/gcy0423/Fourier-Angle-Alignment .[65] See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent
Tianci Tang,Tielong Cai,Hongwei Wang,Gaoang Wang
Main category: cs.CV
TL;DR: Sea$^2$ 提出一种无需微调、无需下游标注的主动感知新范式,通过冻结预训练视觉模型并训练一个基于VLM的智能姿态控制代理,利用感知反馈自主导航至信息丰富视角,显著提升室内场景下的多种视觉任务性能。
Details
Motivation: 预训练感知模型在新型环境(如室内场景)中性能显著下降;传统微调方法会导致灾难性遗忘且依赖昂贵的场景特定标注。 Method: 提出Sea$^2$框架:1)保持所有感知模块冻结;2)将视觉语言模型(VLM)经两阶段训练转化为低层姿态控制器——先在基于规则的探索轨迹上微调,再通过无监督强化学习优化策略,奖励由感知模块输出及其置信度构建;3)仅需标量感知反馈,无需下游标签。 Result: 在ReplicaCAD数据集上,视觉定位、分割和3D框估计任务分别提升13.54%、15.92%和27.68%。 Conclusion: Sea$^2$解耦了感知模型与探索策略,可直接复用现成模型完成多任务,避免重训练,在少样本甚至零标注条件下实现高效自适应。 Abstract: Pre-trained perception models excel in generic image domains but degrade significantly in novel environments like indoor scenes. The conventional remedy is fine-tuning on downstream data which incurs catastrophic forgetting of prior knowledge and demands costly, scene-specific annotations. We propose a paradigm shift through Sea$^2$ (See, Act, Adapt): rather than adapting the perception modules themselves, we adapt how they are deployed through an intelligent pose-control agent. Sea$^2$ keeps all perception modules frozen, requiring no downstream labels during training, and uses only scalar perceptual feedback to navigate the agent toward informative viewpoints. Specially, we transform a vision-language model (VLM) into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes, and then refining the policy via unsupervised reinforcement learning that constructs rewards from the perception module's outputs and confidence. Unlike prior active perception methods that couple exploration with specific models or collect data for retraining them, Sea$^2$ directly leverages off-the-shelf perception models for various tasks without the need for retraining. We conducted experiments on three visual perception tasks, including visual grounding, segmentation and 3D box estimation, with performance improvements of 13.54%, 15.92% and 27.68% respectively on dataset ReplicaCAD.[66] Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation
Chongyang Xu,Haipeng Li,Shen Cheng,Jingyu Hu,Haoqiang Fan,Ziliang Feng,Shuaicheng Liu
Main category: cs.CV
TL;DR: 本文提出了一种基于预训练3D几何基础模型的双臂操作框架,通过融合几何感知隐状态、2D语义特征和本体感受,利用扩散模型联合预测动作序列与未来3D隐状态,仅用RGB图像即可实现高精度空间理解与预测,在仿真与真实机器人实验中均达到SOTA性能。
Details
Motivation: 现有双臂操作方法受限于2D特征空间感知弱或依赖难以可靠获取的显式点云;而新兴3D几何基础模型可从RGB图像鲁棒重建丰富3D结构,为摆脱传感器依赖提供了新路径。 Method: 构建端到端策略框架:以预训练3D几何基础模型为骨干,融合几何感知隐变量、2D语义特征和本体感受形成统一状态表征;采用扩散模型联合预测未来动作块及对应3D隐状态(可解码为稠密点图),显式建模3D场景演化与动作协同。 Result: 在RoboTwin仿真基准和真实机器人平台上验证有效;相比2D和点云基线,在操作成功率、双臂协调性及3D空间预测精度上均显著提升,达到当前最优水平。 Conclusion: 仅依赖RGB输入、结合3D几何先验与扩散建模的双臂操作策略,能有效增强空间推理与动态预测能力,是迈向通用具身智能的重要一步。 Abstract: Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner. We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap. By explicitly predicting how the 3D scene will evolve together with the action sequence, the policy gains strong spatial understanding and predictive capability using only RGB observations. We evaluate our method both in simulation on the RoboTwin benchmark and in real-world robot executions. Our approach consistently outperforms 2D-based and point-cloud-based baselines, achieving state-of-the-art performance in manipulation success, inter-arm coordination, and 3D spatial prediction accuracy. Code is available at https://github.com/Chongyang-99/GAP.git.[67] Footprint-Guided Exemplar-Free Continual Histopathology Report Generation
Pratibha Kumari,Daniel Reisenbüchler,Afshin Bozorgpour,yousef Sadegheih,Priyankar Choudhary,Dorit Merhof
Main category: cs.CV
TL;DR: 本文提出了一种无需存储样本的持续学习框架,用于从全切片图像(WSI)生成病理报告,通过构建紧凑的形态学‘领域足迹’实现生成式回放与风格自适应,避免灾难性遗忘。
Details
Motivation: 临床环境中新器官、新机构和报告规范不断出现,而传统静态训练或顺序微调易导致灾难性遗忘,亟需能持续适应的WSI报告生成方法。 Method: 构建冻结patch嵌入空间中的紧凑领域足迹:包括形态学token码本、切片级共现统计和轻量patch计数先验;结合教师快照生成伪报告,并蒸馏领域特异性语言风格为紧凑描述符以引导生成;推理时直接从WSI信号识别最匹配的风格描述符。 Result: 在多个公开持续学习基准上,该方法显著优于无样本及有限缓冲重演基线。 Conclusion: 基于足迹的生成式回放是一种适用于动态临床部署场景的实用持续学习方案。 Abstract: Rapid progress in vision-language modeling has enabled pathology report generation from gigapixel whole-slide images, but most approaches assume static training with simultaneous access to all data. In clinical deployment, however, new organs, institutions, and reporting conventions emerge over time, and sequential fine-tuning can cause catastrophic forgetting. We introduce an exemplar-free continual learning framework for WSI-to-report generation that avoids storing raw slides or patch exemplars. The core idea is a compact domain footprint built in a frozen patch-embedding space: a small codebook of representative morphology tokens together with slide-level co-occurrence summaries and lightweight patch-count priors. These footprints support generative replay by synthesizing pseudo-WSI representations that reflect domain-specific morphological mixtures, while a teacher snapshot provides pseudo-reports to supervise the updated model without retaining past data. To address shifting reporting conventions, we distill domain-specific linguistic characteristics into a compact style descriptor and use it to steer generation. At inference, the model identifies the most compatible descriptor directly from the slide signal, enabling domain-agnostic setup without requiring explicit domain identifiers. Evaluated across multiple public continual learning benchmarks, our approach outperforms exemplar-free and limited-buffer rehearsal baselines, highlighting footprint-based generative replay as a practical solution for deployment in evolving clinical settings.[68] NAU-QMUL: Utilizing BERT and CLIP for Multi-modal AI-Generated Image Detection
Xiaoyu Guo,Arkaitz Zubiaga
Main category: cs.CV
TL;DR: 本文提出了一种多模态多任务模型,用于检测AI生成图像并识别其生成模型,结合BERT和CLIP提取文本与图像特征,采用跨模态融合与定制多任务损失,并引入伪标签数据增强;在CT2竞赛中两项任务均获第五名,F1分数分别为83.16%和48.88%。
Details
Motivation: 检测AI生成图像并识别具体生成模型,以应对日益增长的AI伪造内容风险。 Method: 构建多模态多任务模型,使用预训练BERT和CLIP Vision编码器分别提取文本和图像特征,进行跨模态特征融合,并设计多任务损失函数;引入伪标签策略进行数据增强。 Result: 在CT2竞赛Task A和Task B中均获得第五名,F1分数分别为83.16%和48.88%。 Conclusion: 所提模型架构有效,具备在真实场景中推进AI生成内容检测的潜力。 Abstract: With the aim of detecting AI-generated images and identifying the specific models responsible for their generation, we propose a multi-modal multi-task model. The model leverages pre-trained BERT and CLIP Vision encoders for text and image feature extraction, respectively, and employs cross-modal feature fusion with a tailored multi-task loss function. Additionally, a pseudo-labeling-based data augmentation strategy was utilized to expand the training dataset with high-confidence samples. The model achieved fifth place in both Tasks A and B of the `CT2: AI-Generated Image Detection' competition, with F1 scores of 83.16\% and 48.88\%, respectively. These findings highlight the effectiveness of the proposed architecture and its potential for advancing AI-generated content detection in real-world scenarios. The source code for our method is published on https://github.com/xxxxxxxxy/AIGeneratedImageDetection.[69] Denoising-Enhanced YOLO for Robust SAR Ship Detection
Xiaojing Zhao,Shiyang Li,Zena Chu,Ying Zhang,Peinan Hao,Tianzi Yan,Jiajia Chen,Huicong Ning
Main category: cs.CV
TL;DR: 本文提出CPN-YOLO,一种基于YOLOv8的高精度SAR图像舰船检测框架,通过可学习大核去噪模块、PPA注意力增强多尺度特征提取、以及基于归一化Wasserstein距离的高斯相似性损失,显著提升复杂场景下的检测精度与小目标敏感性。
Details
Motivation: SAR图像舰船检测在复杂场景中面临杂波和斑点噪声导致虚警、小目标易漏检等挑战,需提升鲁棒性与精度。 Method: 基于YOLOv8构建CPN-YOLO:1)引入可学习大核去噪模块用于输入预处理;2)设计基于PPA注意力机制的特征增强策略以强化多尺度建模;3)采用基于归一化Wasserstein距离的高斯相似性损失函数优化边界框匹配。 Result: 在SSDD数据集上,CPN-YOLO达到97.0%精度、95.1%召回率、98.9% mAP,全面超越YOLOv8及其他主流深度学习检测器;在HRSID上也验证了有效性。 Conclusion: CPN-YOLO通过三方面协同改进,显著提升了SAR图像中舰船检测的精度、鲁棒性与小目标识别能力,为复杂海况下的遥感目标检测提供了有效解决方案。 Abstract: With the rapid advancement of deep learning, synthetic aperture radar (SAR) imagery has become a key modality for ship detection. However, robust performance remains challenging in complex scenes, where clutter and speckle noise can induce false alarms and small targets are easily missed. To address these issues, we propose CPN-YOLO, a high-precision ship detection framework built upon YOLOv8 with three targeted improvements. First, we introduce a learnable large-kernel denoising module for input pre-processing, producing cleaner representations and more discriminative features across diverse ship types. Second, we design a feature extraction enhancement strategy based on the PPA attention mechanism to strengthen multi-scale modeling and improve sensitivity to small ships. Third, we incorporate a Gaussian similarity loss derived from the normalized Wasserstein distance (NWD) to better measure similarity under complex bounding-box distributions and improve generalization. Extensive experiments on HRSID and SSDD demonstrate the effectiveness of our method. On SSDD, CPN-YOLO surpasses the YOLOv8 baseline, achieving 97.0% precision, 95.1% recall, and 98.9% mAP, and consistently outperforms other representative deep-learning detectors in overall performance.[70] Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
Qihua Dong,Kuo Yang,Lin Ju,Handong Zhao,Yitian Zhang,Yizhou Wang,Huimin Zeng,Jianglin Lu,Yun Fu
Main category: cs.CV
TL;DR: 本文提出Ref-Adv——一个更难、更注重视觉推理与语言 grounding 的新型指代表达理解(REC)基准,旨在抑制现有数据集中的捷径解法,并揭示多模态大模型在真实视觉推理能力上的不足。
Details
Motivation: 现有REC基准(如RefCOCO系列)存在表达过短、干扰项少、描述冗余等问题,导致模型依赖捷径而非真正理解语言与视觉的对应关系,无法有效评估视觉推理与grounding能力。 Method: 构建Ref-Adv数据集:基于真实图像,设计语言上复杂(含否定等推理要素)、视觉上高干扰(硬干扰项)、信息精炼(无冗余描述)的指代表达;通过词序扰动和描述符删除等消融实验验证其推理必要性;并在多种先进多模态大模型上进行评测。 Result: 当前主流多模态大模型在RefCOCO系列上表现优异,但在Ref-Adv上性能显著下降,证实其严重依赖数据集捷径,视觉推理与精细grounding能力存在明显缺陷。 Conclusion: Ref-Adv能更严格地评测模型的视觉语言推理与grounding能力,为未来多模态大模型在深层语义理解与视觉推理方向的研究提供新基准与改进方向。 Abstract: Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.[71] APPO: Attention-guided Perception Policy Optimization for Video Reasoning
Henghui Du,Chang Zhou,Xi Chen,Di Hu
Main category: cs.CV
TL;DR: 本文提出APPO算法,通过注意力引导的感知策略优化,在不依赖昂贵细粒度标注的情况下提升视频模型的细粒度感知能力,实验证明其在多个视频基准上优于现有方法。
Details
Motivation: 发现复杂视频推理性能提升更依赖于细粒度感知能力而非高级推理能力,而现有方法缺乏低成本增强感知的有效途径。 Method: 提出Attention-guided Perception Policy Optimization(APPO)算法,利用token级稠密奖励优化聚焦于关键视频帧的‘组内感知token’。 Result: 在多个视频基准和不同规模模型(3B/7B)上,APPO持续优于GRPO和DAPO,提升幅度达0.5%~4%。 Conclusion: 增强感知能力比增强推理能力对视频理解性能提升更为关键;APPO提供了一种低成本、无需细粒度标注的感知增强新范式。 Abstract: Complex video reasoning, actually, relies excessively on fine-grained perception rather than on expert (e.g., Ph.D, Science)-level reasoning. Through extensive empirical observation, we have recognized the critical impact of perception. In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance. Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile. To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model's fine-grained perception. The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens). Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5%~4%). We hope our work provides a promising approach to effectively enhance model's perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.[72] Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification
Vikash Singh,Debargha Ganguly,Haotian Yu,Chengwei Zhou,Prerna Singh,Brandon Lee,Vipin Chaudhary,Gourav Datta
Main category: cs.CV
TL;DR: 本文提出了一种神经符号验证框架,利用SMT求解器(Z3)和临床知识库,自动将VLM生成的自由文本放射学发现形式化为命题逻辑证据,并严格验证诊断结论是否被感知发现所逻辑蕴含,从而检测并消除幻觉、遗漏等推理错误。
Details
Motivation: 现有视觉-语言模型(VLMs)在生成放射科报告时存在逻辑不一致问题(如诊断印象无依据或遗漏必然结论),而传统基于词汇匹配的评估指标无法捕捉此类演绎错误,尤其在无参考文本(reference-free)场景下。 Method: 构建神经符号验证框架:首先将VLM输出的自由文本发现自动形式化为结构化命题逻辑表达;然后结合临床知识库,使用Z3 SMT求解器对每个诊断主张进行可满足性检验,判定其为逻辑蕴含、幻觉或遗漏。 Result: 在五个胸部X光基准数据集上评估7个VLM,该验证器揭示了传统指标无法发现的特有推理失败模式(如保守观察、随机幻觉);在有标签数据上,施加求解器支持的蕴含约束作为后处理保障,显著提升了诊断合理性与精确率。 Conclusion: 该神经符号验证方法为临床推理提供了可证明的内部一致性保障,是提升生成式临床辅助系统可信度与安全性的关键路径。 Abstract: Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic impressions unsupported by their own perceptual findings or missing logically entailed conclusions. Standard lexical metrics heavily penalize clinical paraphrasing and fail to capture these deductive failures in reference-free settings. Toward guarantees for clinical reasoning, we introduce a neurosymbolic verification framework that deterministically audits the internal consistency of VLM-generated reports. Our pipeline autoformalizes free-text radiographic findings into structured propositional evidence, utilizing an SMT solver (Z3) and a clinical knowledge base to verify whether each diagnostic claim is mathematically entailed, hallucinated, or omitted. Evaluating seven VLMs across five chest X-ray benchmarks, our verifier exposes distinct reasoning failure modes, such as conservative observation and stochastic hallucination, that remain invisible to traditional metrics. On labeled datasets, enforcing solver-backed entailment acts as a rigorous post-hoc guarantee, systematically eliminating unsupported hallucinations to significantly increase diagnostic soundness and precision in generative clinical assistants.[73] AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation
Zhengren Wang,Dongsheng Ma,Huaping Zhong,Jiayu Li,Wentao Zhang,Bin Wang,Conghui He
Main category: cs.CV
TL;DR: 本文提出AgenticOCR,一种查询驱动的动态OCR解析范式,用于解决多模态RAG中处理复杂视觉文档(如财报)时页面级分块导致的冗余上下文和视觉令牌压缩引发幻觉的问题。它通过图像感知布局分析实现按需区域识别,提升视觉文档RAG的效率与准确性。
Details
Motivation: 现有RAG在处理视觉文档时依赖页面级分块,导致生成器注意力过载、关键证据稀释,且视觉令牌压缩易引发幻觉。 Method: 提出AgenticOCR,将OCR从静态全量识别转变为查询驱动、按需区域识别;通过‘以图思考’方式自主分析文档布局,动态提取感兴趣区域,实现视觉令牌的精准按需解压。 Result: 实验表明AgenticOCR显著提升视觉RAG系统的效率与准确性,在长文档理解任务中达到专家级性能。 Conclusion: AgenticOCR可作为视觉文档RAG栈的第三大基础模块,与嵌入和重排序模块协同工作,突破页面级粒度限制,增强多模态RAG能力。 Abstract: The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator's attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information-rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system. By autonomously analyzing document layout in a "thinking with images" manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on-demand decompression of visual tokens precisely where needed, effectively decoupling retrieval granularity from rigid page-level chunking. AgenticOCR has the potential to serve as the "third building block" of the visual document RAG stack, operating alongside and enhancing standard Embedding and Reranking modules. Experimental results demonstrate that AgenticOCR improves both the efficiency and accuracy of visual RAG systems, achieving expert-level performance in long document understanding. Code and models are available at https://github.com/OpenDataLab/AgenticOCR.[74] Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition
Mohammadreza Heidarianbaei,Mareike Dorozynski,Hubert Kanyamahanga,Max Mehltretter,Franz Rottensteiner
Main category: cs.CV
TL;DR: 本文提出ReSeg-CLIP,一种无需训练的开放词汇语义分割方法,专用于遥感数据,通过引入SAM生成的多尺度掩码约束自注意力交互,并采用加权平均多个遥感专用CLIP变体参数的模型组合策略,在三个遥感基准上达到SOTA性能。
Details
Motivation: 解决CLIP等视觉语言模型在遥感语义分割任务中因自注意力层内不恰当交互导致的性能问题。 Method: 提出分层方案,利用SAM生成的掩码在多尺度上约束自注意力交互;设计模型组合方法,通过新权重方案评估不同文本提示下的表征质量,对多个遥感专用CLIP变体参数进行加权平均。 Result: 在三个遥感基准数据集上实现无需额外训练的最先进(SOTA)性能。 Conclusion: ReSeg-CLIP是一种高效、无需训练的开放词汇语义分割方法,在遥感领域展现出显著优势,验证了结合掩码引导与模型集成策略的有效性。 Abstract: In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.[75] Bandwidth-adaptive Cloud-Assisted 360-Degree 3D Perception for Autonomous Vehicles
Faisal Hawladera,Rui Meireles,Gamal Elghazaly,Ana Aguiar,Raphaël Frank
Main category: cs.CV
TL;DR: 本文提出了一种基于V2X通信的车-云协同计算框架,利用Transformer模型融合多相机数据生成BEV表示,通过动态划分计算负载、量化与特征压缩降低延迟,并设计自适应优化算法在实时性约束下提升3D目标检测精度。
Details
Motivation: 自动驾驶需在严格延迟约束下实现实时环境感知,但车载算力有限,尤其在复杂城市环境中易产生延迟问题。 Method: 采用Transformer模型进行多相机数据融合生成BEV;动态划分车载与云端计算(按层数与特征量化等级);引入特征向量裁剪与压缩以降低传输负载;设计动态优化算法根据网络状况实时调整分割点和量化等级。 Result: 相比纯车载方案,端到端延迟降低72%;在带宽波动下,自适应策略较静态配置在相同延迟下检测精度提升最高达20%。 Conclusion: 车-云协同与动态适配机制可有效平衡延迟与精度,为资源受限场景下的实时3D感知提供了可行路径。 Abstract: A key challenge for autonomous driving lies in maintaining real-time situational awareness regarding surrounding obstacles under strict latency constraints. The high processing requirements coupled with limited onboard computational resources can cause delay issues, particularly in complex urban settings. To address this, we propose leveraging Vehicle-to-Everything (V2X) communication to partially offload processing to the cloud, where compute resources are abundant, thus reducing overall latency. Our approach utilizes transformer-based models to fuse multi-camera sensor data into a comprehensive Bird's-Eye View (BEV) representation, enabling accurate 360-degree 3D object detection. The computation is dynamically split between the vehicle and the cloud based on the number of layers processed locally and the quantization level of the features. To further reduce network load, we apply feature vector clipping and compression prior to transmission. In a real-world experimental evaluation, our hybrid strategy achieved a 72 \% reduction in end-to-end latency compared to a traditional onboard solution. To adapt to fluctuating network conditions, we introduce a dynamic optimization algorithm that selects the split point and quantization level to maximize detection accuracy while satisfying real-time latency constraints. Trace-based evaluation under realistic bandwidth variability shows that this adaptive approach improves accuracy by up to 20 \% over static parameterization with the same latency performance.[76] Altitude-Aware Visual Place Recognition in Top-Down View
Xingyu Shao,Mengfan He,Chunyu Li,Liangzheng Sun,Ziyang Meng
Main category: cs.CV
TL;DR: 本文提出了一种无需额外硬件的、基于视觉的海拔自适应空中视觉地点识别(VPR)方法,通过地面特征密度分析估计相对海拔,并结合裁剪与分类策略提升VPR精度,在显著海拔变化下大幅提高R@1和R@5指标。
Details
Motivation: 解决空中视觉地点识别(VPR)在显著海拔变化下的性能下降问题,避免依赖易受干扰或增加成本的硬件(如气压计、ToF传感器)。 Method: 提出一种海拔自适应VPR方法:首先通过图像中地面特征密度估计相对海拔;然后据此进行相对海拔驱动的图像裁剪以生成标准查询图像;最后采用分类式VPR策略完成定位。 Result: 在多种地形与海拔条件下实验表明,该方法显著提升海拔估计与VPR精度:相比纯VPR检索,R@1和R@5分别提升29.85%和60.20%;相比单目深度估计(MMDE)方法,平均误差降低202.1米,R@1和R@5分别提升31.4%和44%。 Conclusion: 本方法构建了一个鲁棒、纯视觉、三维VPR框架,适用于传感器受限的小/中型空中平台,在城乡等多样化环境中具备实用性和可扩展性。 Abstract: To address the challenge of aerial visual place recognition (VPR) problem under significant altitude variations, this study proposes an altitude-adaptive VPR approach that integrates ground feature density analysis with image classification techniques. The proposed method estimates airborne platforms' relative altitude by analyzing the density of ground features in images, then applies relative altitude-based cropping to generate canonical query images, which are subsequently used in a classification-based VPR strategy for localization. Extensive experiments across diverse terrains and altitude conditions demonstrate that the proposed approach achieves high accuracy and robustness in both altitude estimation and VPR under significant altitude changes. Compared to conventional methods relying on barometric altimeters or Time-of-Flight (ToF) sensors, this solution requires no additional hardware and offers a plug-and-play solution for downstream applications, {making it suitable for small- and medium-sized airborne platforms operating in diverse environments, including rural and urban areas.} Under significant altitude variations, incorporating our relative altitude estimation module into the VPR retrieval pipeline boosts average R@1 and R@5 by 29.85\% and 60.20\%, respectively, compared with applying VPR retrieval alone. Furthermore, compared to traditional {Monocular Metric Depth Estimation (MMDE) methods}, the proposed method reduces the mean error by 202.1 m, yielding average additional improvements of 31.4\% in R@1 and 44\% in R@5. These results demonstrate that our method establishes a robust, vision-only framework for three-dimensional visual place recognition, offering a practical and scalable solution for accurate airborne platforms localization under large altitude variations and limited sensor availability.[77] DACESR: Degradation-Aware Conditional Embedding for Real-World Image Super-Resolution
Xiaoyan Lei,Wenlong Zhang,Biao Luo,Hui Liang,Weifeng Cao,Qiuting Lin
Main category: cs.CV
TL;DR: 本文提出了一种结合Real Embedding Extractor(REE)和Conditional Feature Modulator(CFM)的新方法,利用Mamba架构提升退化图像的超分辨率重建效果,在保真度与感知质量间取得更好平衡。
Details
Motivation: 现有基于多模态大模型的图像超分辨率方法在处理退化图像时性能受限,尤其是Recognize Anything Model(RAM)在退化空间中直接对比学习效果不佳。 Method: 提出退化选择策略构建Real Embedding Extractor(REE),通过对比学习提升退化图像识别能力;设计Conditional Feature Modulator(CFM)将REE的高层语义信息融入基于Mamba的超分网络中,增强纹理恢复能力。 Result: REE显著提升了退化图像内容识别性能;所提方法在真实场景超分辨率任务中有效平衡了重建保真度与感知质量,验证了Mamba在该任务中的潜力。 Conclusion: 引入语义感知的嵌入提取与条件调制机制可显著增强退化图像超分辨率性能,Mamba架构具备应用于真实世界图像复原任务的良好前景。 Abstract: Multimodal large models have shown excellent ability in addressing image super-resolution in real-world scenarios by leveraging language class as condition information, yet their abilities in degraded images remain limited. In this paper, we first revisit the capabilities of the Recognize Anything Model (RAM) for degraded images by calculating text similarity. We find that directly using contrastive learning to fine-tune RAM in the degraded space is difficult to achieve acceptable results. To address this issue, we employ a degradation selection strategy to propose a Real Embedding Extractor (REE), which achieves significant recognition performance gain on degraded image content through contrastive learning. Furthermore, we use a Conditional Feature Modulator (CFM) to incorporate the high-level information of REE for a powerful Mamba-based network, which can leverage effective pixel information to restore image textures and produce visually pleasing results. Extensive experiments demonstrate that the REE can effectively help image super-resolution networks balance fidelity and perceptual quality, highlighting the great potential of Mamba in real-world applications. The source code of this work will be made publicly available at: https://github.com/nathan66666/DACESR.git[78] AoE: Always-on Egocentric Human Video Collection for Embodied AI
Bowen Yang,Zishuo Li,Yang Sun,Changtao Miao,Yifan Yang,Man Luo,Xiaotong Yan,Feng Jiang,Jinchuan Shi,Yankai Fu,Ning Chen,Junkai Zhao,Pengwei Wang,Guocai Yao,Shanghang Zhang,Hao Chen,Zhe Li,Kai Zhu
Main category: cs.CV
TL;DR: 本文提出Always-on Egocentric (AoE)系统,利用全球分布的人类及其智能手机采集低成本、可持续的自上而下(egocentric)真实世界交互数据,以缓解具身基础模型的数据稀缺问题。
Details
Motivation: 现有具身模型数据收集方法存在高成本、硬件依赖复杂、交互范围有限等问题,难以规模化;而人类作为天然具身智能体,结合智能手机可提供低成本、可持续、广覆盖的数据来源。 Method: 设计了基于颈挂式手机支架与云边协同架构的AoE系统:1)采用人体佩戴智能手机实现低门槛、大规模第一视角数据采集;2)开发跨平台APP,端侧实时处理+云端自动标注与过滤;3)支持任何人、任何时间、任何地点的分布式采集。 Result: 实验验证AoE在数据预处理质量及下游任务上的有效性,表明高质量第一视角数据显著提升模型在真实世界的泛化能力。 Conclusion: AoE系统为具身基础模型提供了可扩展、低成本、可持续的真实世界交互数据采集新范式,有效缓解数据瓶颈。 Abstract: Embodied foundation models require large-scale, high-quality real-world interaction data for pre-training and scaling. However, existing data collection methods suffer from high infrastructure costs, complex hardware dependencies, and limited interaction scope, making scalable expansion challenging. In fact, humans themselves are ideal physically embodied agents. Therefore, obtaining egocentric real-world interaction data from globally distributed "human agents" offers advantages of low cost and sustainability. To this end, we propose the Always-on Egocentric (AoE) data collection system, which aims to simplify hardware dependencies by leveraging humans themselves and their smartphones, enabling low-cost, highly efficient, and scene-agnostic real-world interaction data collection to address the challenge of data scarcity. Specifically, we first employ an ergonomic neck-mounted smartphone holder to enable low-barrier, large-scale egocentric data collection through a cloud-edge collaborative architecture. Second, we develop a cross-platform mobile APP that leverages on-device compute for real-time processing, while the cloud hosts automated labeling and filtering pipelines that transform raw videos into high-quality training data. Finally, the AoE system supports distributed Ego video data collection by anyone, anytime, and anywhere. We evaluate AoE on data preprocessing quality and downstream tasks, demonstrating that high-quality egocentric data significantly boosts real-world generalization.[79] SelfOccFlow: Towards end-to-end self-supervised 3D Occupancy Flow prediction
Xavier Timoneda,Markus Herb,Fabian Duerr,Daniel Goehring
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注或外部光流监督的自监督3D占据与运动估计方法,通过分离静态/动态隐式场并利用特征余弦相似度构建自监督光流线索,在多个数据集上验证了有效性。
Details
Motivation: 现有3D占据与运动估计方法依赖昂贵的3D标注、边界框速度标签或预训练光流模型,限制了其可扩展性与实用性。 Method: 将场景解耦为静态和动态符号距离场,通过时序聚合隐式学习运动,并引入基于特征余弦相似度的强自监督光流线索。 Result: 在SemanticKITTI、KITTI-MOT和nuScenes数据集上验证了所提方法的有效性,实现了高质量的3D占据流估计。 Conclusion: 该自监督框架显著降低了对标注数据的依赖,提升了3D动态场景理解的实用性与泛化能力。 Abstract: Estimating 3D occupancy and motion at the vehicle's surroundings is essential for autonomous driving, enabling situational awareness in dynamic environments. Existing approaches jointly learn geometry and motion but rely on expensive 3D occupancy and flow annotations, velocity labels from bounding boxes, or pretrained optical flow models. We propose a self-supervised method for 3D occupancy flow estimation that eliminates the need for human-produced annotations or external flow supervision. Our method disentangles the scene into separate static and dynamic signed distance fields and learns motion implicitly through temporal aggregation. Additionally, we introduce a strong self-supervised flow cue derived from features' cosine similarities. We demonstrate the efficacy of our 3D occupancy flow method on SemanticKITTI, KITTI-MOT, and nuScenes.[80] Experience-Guided Self-Adaptive Cascaded Agents for Breast Cancer Screening and Diagnosis with Reduced Biopsy Referrals
Pramit Saha,Mohammad Alsharid,Joshua Strong,J. Alison Noble
Main category: cs.CV
TL;DR: 本文提出了一种名为BUSD-Agent的经验引导级联多智能体框架,用于乳腺超声筛查与诊断,通过两阶段决策减少不必要的活检 referrals,并利用历史决策轨迹进行检索增强的上下文自适应,显著降低诊断升级率和活检率,同时提升特异性。
Details
Motivation: 旨在解决乳腺超声诊断中诊断升级过度和不必要的活检 referrals 问题,提升筛查效率与临床合理性。 Method: 构建两级级联多智能体框架:轻量级‘筛查诊所’智能体初步过滤低风险病例;高风险病例交由具备丰富感知与放射学描述能力的‘诊断诊所’智能体进行二次决策;引入结构化记忆库存储病理确认的决策轨迹,并基于图像、模型响应与置信度相似性实现检索增强的上下文自适应,动态调整模型信任度与升级阈值。 Result: 在10个乳腺超声数据集上,相比无轨迹条件的相同架构,诊断升级率从84.95%降至58.72%,活检率从59.50%降至37.08%,筛查特异性提升68.48%,诊断特异性提升6.33%。 Conclusion: 经验引导的级联多智能体框架能有效优化乳腺超声临床决策流程,在不更新模型参数的前提下,通过检索增强的上下文自适应显著提升诊断效率与准确性。 Abstract: We propose an experience-guided cascaded multi-agent framework for Breast Ultrasound Screening and Diagnosis, called BUSD-Agent, that aims to reduce diagnostic escalation and unnecessary biopsy referrals. Our framework models screening and diagnosis as a two-stage, selective decision-making process. A lightweight `screening clinic' agent, restricted to classification models as tools, selectively filters out benign and normal cases from further diagnostic escalation when malignancy risk and uncertainty are estimated as low. Cases that have higher risks are escalated to the `diagnostic clinic' agent, which integrates richer perception and radiological description tools to make a secondary decision on biopsy referral. To improve agent performance, past records of pathology-confirmed outcomes along with image embeddings, model predictions, and historical agent actions are stored in a memory bank as structured decision trajectories. For each new case, BUSD-Agent retrieves similar past cases based on image, model response and confidence similarity to condition the agent's current decision policy. This enables retrieval-conditioned in-context adaptation that dynamically adjusts model trust and escalation thresholds from prior experiences without parameter updates. Evaluation across 10 breast ultrasound datasets shows that the proposed experience-guided workflow reduces diagnostic escalation in BUSD-Agent from 84.95% to 58.72% and overall biopsy referrals from 59.50% to 37.08%, compared to the same architecture without trajectory conditioning, while improving average screening specificity by 68.48% and diagnostic specificity by 6.33%.[81] SegMate: Asymmetric Attention-Based Lightweight Architecture for Efficient Multi-Organ Segmentation
Andrei-Alexandru Bunea,Dan-Matei Popovici,Radu Tudor Ionescu
Main category: cs.CV
TL;DR: SegMate是一种高效的2.5D医学图像分割框架,在保持SOTA精度的同时显著降低计算与显存开销。
Details
Motivation: 现有医学图像分割模型精度高但计算资源需求大,难以部署于临床资源受限环境。 Method: 提出SegMate框架,融合非对称架构、注意力机制、多尺度特征融合、基于切片的位置编码及多任务优化;在三个主流骨干网络(EfficientNetV2-M、MambaOut-Tiny、FastViT-T12)上验证;实验涵盖TotalSegmentator、SegTHOR和AMOS22三个数据集。 Result: 相比基线模型,计算量(GFLOPs)最多降低2.5倍,显存占用(VRAM)最多降低2.1倍,Dice分数平均提升约1%;在TotalSegmentator上达93.51% Dice且仅需295MB峰值GPU显存;跨数据集零样本泛化在SegTHOR和AMOS22上分别达86.85%和89.35% Dice。 Conclusion: SegMate在精度与效率间取得更优平衡,具备强泛化能力与临床落地潜力,并已开源代码。 Abstract: State-of-the-art models for medical image segmentation achieve excellent accuracy but require substantial computational resources, limiting deployment in resource-constrained clinical settings. We present SegMate, an efficient 2.5D framework that achieves state-of-the-art accuracy, while considerably reducing computational requirements. Our efficient design is the result of meticulously integrating asymmetric architectures, attention mechanisms, multi-scale feature fusion, slice-based positional conditioning, and multi-task optimization. We demonstrate the efficiency-accuracy trade-off of our framework across three modern backbones (EfficientNetV2-M, MambaOut-Tiny, FastViT-T12). We perform experiments on three datasets: TotalSegmentator, SegTHOR and AMOS22. Compared with the vanilla models, SegMate reduces computation (GFLOPs) by up to 2.5x and memory footprint (VRAM) by up to 2.1x, while generally registering performance gains of around 1%. On TotalSegmentator, we achieve a Dice score of 93.51% with only 295MB peak GPU memory. Zero-shot cross-dataset evaluations on SegTHOR and AMOS22 demonstrate strong generalization, with Dice scores of up to 86.85% and 89.35%, respectively. We release our open-source code at https://github.com/andreibunea99/SegMate.[82] Half-Truths Break Similarity-Based Retrieval
Bora Kargi,Arnas Uselis,Seong Joon Oh
Main category: cs.CV
TL;DR: 本文发现CLIP等双编码器模型在文本描述中添加错误但看似合理的细节时,图像-文本相似度反而上升(即‘半真半假’问题),并提出CS-CLIP方法通过组件级监督提升对实体与关系的细粒度对齐,显著改善半真半假识别与组合泛化能力。
Details
Motivation: CLIP类模型缺乏对文本各组成部分(如实体、关系)的显式监督,导致其在描述中加入错误但合理的细节时仍给出高相似度,违背基本直觉。 Method: 提出CS-CLIP:将caption分解为实体和关系单元,为每个单元构造最小编辑的干扰项(foil),并在保持双编码器结构的前提下进行细粒度对比学习。 Result: 在COCO上,半真半假识别准确率从40.6%提升至69.3%,组合性基准平均性能提升5.7分。 Conclusion: 组件级监督能有效缓解双编码器模型的半真半假问题,并促进更鲁棒的组合语义理解。 Abstract: When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP[83] The Geometry of Transfer: Unlocking Medical Vision Manifolds for Training-Free Model Ranking
Jiaqi Tang,Shaoyang Zhang,Xiaoqi Wang,Jiaying Zhou,Yang Liu,Qingchao Chen
Main category: cs.CV
TL;DR: 本文提出了一种基于拓扑结构的迁移性评估框架,用于高效选择适用于医学图像分割任务的基础模型,无需微调即可实现比现有方法高31%的相对性能提升。
Details
Motivation: 现有迁移性评估指标主要面向分类任务,依赖全局统计假设,难以刻画密集预测(如分割)所需的拓扑复杂性;而医学基础模型数量激增,亟需高效、免训练的模型选择方法。 Method: 提出拓扑驱动的迁移性评估框架,包含三部分:(1) 全局表征拓扑差异(GRTD),用最小生成树量化特征-标签结构同构性;(2) 局部边界感知拓扑一致性(LBTC),聚焦关键解剖边界的流形可分性;(3) 任务自适应融合,依据目标任务语义基数动态加权整合全局与局部指标。 Result: 在OpenMind大规模基准上验证,相较SOTA基线,加权Kendall相关系数相对提升约31%,显著提升了免训练模型选择的准确性与鲁棒性。 Conclusion: 拓扑视角比传统统计视角更能刻画医学分割任务中模型迁移性的本质,所提框架为高效、可解释的基础模型选型提供了新范式。 Abstract: The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal medical foundation models for specific segmentation tasks remains a computational bottleneck. Existing Transferability Estimation (TE) metrics, primarily designed for classification, rely on global statistical assumptions and fail to capture the topological complexity essential for dense prediction. We propose a novel Topology-Driven Transferability Estimation framework that evaluates manifold tractability rather than statistical overlap. Our approach introduces three components: (1) Global Representation Topology Divergence (GRTD), utilizing Minimum Spanning Trees to quantify feature-label structural isomorphism; (2) Local Boundary-Aware Topological Consistency (LBTC), which assesses manifold separability specifically at critical anatomical boundaries; and (3) Task-Adaptive Fusion, which dynamically integrates global and local metrics based on the semantic cardinality of the target task. Validated on the large-scale OpenMind benchmark across diverse anatomical targets and SSL foundation models, our approach significantly outperforms state-of-the-art baselines by around \textbf{31\%} relative improvement in the weighted Kendall, providing a robust, training-free proxy for efficient model selection without the cost of fine-tuning. The code will be made publicly available upon acceptance.[84] Leveraging Geometric Prior Uncertainty and Complementary Constraints for High-Fidelity Neural Indoor Surface Reconstruction
Qiyu Feng,Jiwei Shan,Shing Shin Cheng,Hesheng Wang
Main category: cs.CV
TL;DR: 本文提出GPU-SDF框架,通过显式估计几何先验不确定性并设计不确定性引导损失与互补约束(边缘距离场和多视角一致性正则化),提升室内场景神经隐式表面重建中细粒度结构的恢复能力。
Details
Motivation: 现有基于符号距离函数的神经隐式表面重建方法在恢复薄结构和复杂几何细节方面仍存在挑战,主要由于几何先验不可靠或含噪;而依赖优化过程中隐式产生的不确定性进行过滤的方式间接、低效,且在高不确定性区域屏蔽监督会加剧优化欠约束问题。 Method: 提出GPU-SDF框架:1)引入自监督模块显式估计几何先验不确定性;2)设计不确定性引导损失,动态调制而非丢弃先验影响;3)加入边缘距离场增强边界监督,以及多视角一致性正则化保障几何连贯性。 Result: 大量实验表明GPU-SDF显著提升了细粒度结构重建质量,并可作为即插即用模块增强现有框架性能。 Conclusion: 显式建模与利用先验不确定性,结合互补几何约束,是提升神经隐式重建精度与鲁棒性的有效途径;GPU-SDF为室内场景高质量表面重建提供了新思路。 Abstract: Neural implicit surface reconstruction with signed distance function has made significant progress, but recovering fine details such as thin structures and complex geometries remains challenging due to unreliable or noisy geometric priors. Existing approaches rely on implicit uncertainty that arises during optimization to filter these priors, which is indirect and inefficient, and masking supervision in high-uncertainty regions further leads to under-constrained optimization. To address these issues, we propose GPU-SDF, a neural implicit framework for indoor surface reconstruction that leverages geometric prior uncertainty and complementary constraints. We introduce a self-supervised module that explicitly estimates prior uncertainty without auxiliary networks. Based on this estimation, we design an uncertainty-guided loss that modulates prior influence rather than discarding it, thereby retaining weak but informative cues. To address regions with high prior uncertainty, GPU-SDF further incorporates two complementary constraints: an edge distance field that strengthens boundary supervision and a multi-view consistency regularization that enforces geometric coherence. Extensive experiments confirm that GPU-SDF improves the reconstruction of fine details and serves as a plug-and-play enhancement for existing frameworks. Source code will be available at https://github.com/IRMVLab/GPU-SDF[85] PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning
Dongxu Zhang,Yiding Sun,Pengcheng Li,Yumou Liu,Hongqiang Lin,Haoran Xu,Xiaoxuan Mu,Liang Lin,Wenbiao Yan,Ning Yang,Chaowei Fang,Juanjuan Zhao,Jihua Zhu,Conghui He,Cheng Tan
Main category: cs.CV
TL;DR: 本文提出PointCoT框架,通过显式链式思维(CoT)推理增强多模态大语言模型对3D点云的理解能力,强调‘观察、思考、回答’范式,并构建了大规模带分层CoT标注的数据集Point-Reason-Instruct,显著提升复杂几何推理任务性能。
Details
Motivation: 现有MLLMs在3D点云理解中将几何推理视为隐式映射,易产生几何幻觉,缺乏对结构细节的精确 grounding。 Method: 提出PointCoT框架,采用双流多模态架构融合语义外观与几何真实,并构建含约8.6万样本的Point-Reason-Instruct数据集,监督模型生成几何 grounded 的CoT推理过程。 Result: 在复杂3D几何推理任务上达到SOTA性能。 Conclusion: 显式CoT推理能有效提升MLLMs对3D点云的感知与推理能力,'Look, Think, then Answer'范式为3D多模态理解提供了新路径。 Abstract: While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical steps and consequently suffer from geometric hallucinations. They confidently generate plausible responses that fail to ground in precise structural details. To bridge this gap, we present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data. We advocate for a \textit{Look, Think, then Answer} paradigm. In this approach, the model is supervised to generate geometry-grounded rationales before predicting final answers. To facilitate this, we construct Point-Reason-Instruct, a large-scale benchmark comprising $\sim$86k instruction-tuning samples with hierarchical CoT annotations. By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth. Extensive experiments demonstrate that PointCoT achieves state-of-the-art performance on complex reasoning tasks.[86] Micro-expression Recognition Based on Dual-branch Feature Extraction and Fusion
Mingjie Zhang,Bo Li,Wanting Liu,Hongyan Cui,Yue Li,Qingwen Li,Hong Li,Ge Gao
Main category: cs.CV
TL;DR: 本文提出了一种结合并行注意力机制的双分支微表情特征提取网络,通过残差结构、Inception模块和自适应特征融合模块提升识别性能,在CASME II数据集上达到74.67%准确率。
Details
Motivation: 微表情具有短暂性和细微性,现有基于光流的方法难以有效识别。 Method: 构建双分支网络,包含残差网络(缓解梯度消失与网络退化)、Inception网络(增强表征能力并抑制无关区域干扰)以及自适应特征融合模块(整合双分支特征),并引入平行注意力机制。 Result: 在CASME II数据集上准确率达到74.67%,优于LBP-TOP(+11.26%)和MSMMT(+3.36%)等方法。 Conclusion: 所提双分支并行注意力网络能更有效地提取微表情特征,显著提升识别精度。 Abstract: Micro-expressions, characterized by transience and subtlety, pose challenges to existing optical flow-based recognition methods. To address this, this paper proposes a dual-branch micro-expression feature extraction network integrated with parallel attention. Key contributions include: 1) a residual network designed to alleviate gradient anishing and network degradation; 2) an Inception network constructed to enhance model representation and suppress interference from irrelevant regions; 3) an adaptive feature fusion module developed to integrate dual-branch features. Experiments on the CASME II dataset demonstrate that the proposed method achieves 74.67% accuracy, outperforming LBP-TOP (by 11.26%), MSMMT (by 3.36%), and other comparative methods.[87] AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors
Xiaozhen Qiao,Wenjia Wang,Zhiyuan Zhao,Jiacheng Sun,Ping Luo,Hongyuan Zhang,Xuelong Li
Main category: cs.CV
TL;DR: 本文提出AHAP框架,无需相机标定即可从任意视角图像中重建三维人体,通过多视图几何融合提升跨视角人体关联、重建与定位精度,并在多个数据集上实现高效且具竞争力的性能。
Details
Motivation: 传统多视角三维人体重建依赖预标定(如棋盘格或MVS算法),限制了其在真实场景中的可扩展性与适用性。 Method: 提出AHAP:包含跨视角身份关联模块(基于可学习人物查询与对比学习)、人体头部特征融合模块(结合多视角特征与场景上下文预测SMPL参数,并由重投影损失约束姿态一致性),并利用多视图三角测量消除单目深度歧义。 Result: 在EgoHumans和EgoExo4D数据集上,AHAP在世界坐标系人体重建与相机位姿估计任务中达到有竞争力的性能,且比基于优化的方法快180倍。 Conclusion: AHAP是一种高效、免标定的前馈式多视角三维人体重建框架,有效融合多视图几何信息,在精度与速度间取得良好平衡。 Abstract: Reconstructing 3D humans from images captured at multiple perspectives typically requires pre-calibration, like using checkerboards or MVS algorithms, which limits scalability and applicability in diverse real-world scenarios. In this work, we present \textbf{AHAP} (Reconstructing \textbf{A}rbitrary \textbf{H}umans from \textbf{A}rbitrary \textbf{P}erspectives), a feed-forward framework for reconstructing arbitrary humans from arbitrary camera perspectives without requiring camera calibration. Our core lies in the effective fusion of multi-view geometry to assist human association, reconstruction and localization. Specifically, we use a Cross-View Identity Association module through learnable person queries and soft assignment, supervised by contrastive learning to resolve cross-view human identity association. A Human Head fuses cross-view features and scene context for SMPL prediction, guided by cross-view reprojection losses to enforce body pose consistency. Additionally, multi-view geometry eliminates the depth ambiguity inherent in monocular methods, providing more precise 3D human localization through multi-view triangulation. Experiments on EgoHumans and EgoExo4D demonstrate that AHAP achieves competitive performance on both world-space human reconstruction and camera pose estimation, while being 180$\times$ faster than optimization-based approaches.[88] CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering
Yuyang Hong,Jiaqi Gu,Yujin Lou,Lubin Fan,Qi Yang,Ying Wang,Kun Ding,Yue Wu,Shiming Xiang,Jieping Ye
Main category: cs.CV
TL;DR: 本文提出CC-VQA方法,通过视觉中心的上下文冲突推理与相关性引导的编解码机制,在无需训练的前提下有效缓解知识库视觉问答中的参数化知识与检索知识之间的冲突,显著提升多个基准上的性能。
Details
Motivation: 现有KB-VQA方法难以协调VLM中静态参数化知识与动态检索知识之间的冲突,且忽视视觉信息在冲突中的作用,同时存在冗余检索上下文问题。 Method: 提出无训练的CC-VQA方法,包含两个核心模块:(1) 视觉中心的上下文冲突推理,进行内外知识的视觉语义冲突分析;(2) 相关性引导的编码与解码,采用位置编码压缩低相关陈述,并基于相关性加权的冲突评分实现自适应解码。 Result: 在E-VQA、InfoSeek和OK-VQA基准上达到SOTA,绝对准确率提升3.3%–6.4%。 Conclusion: CC-VQA通过显式建模视觉语义冲突与知识相关性,有效缓解KB-VQA中的知识冲突问题,为无需微调的知识融合提供了新范式。 Abstract: Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose \textbf{CC-VQA}: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3\% to 6.4\% compared to existing methods. Code is available at https://github.com/cqu-student/CC-VQA.[89] GDA-YOLO11: Amodal Instance Segmentation for Occlusion-Robust Robotic Fruit Harvesting
Caner Beldek,Emre Sariyildiz,Son Lam Phung,Gursel Alici
Main category: cs.CV
TL;DR: 本文提出了一种基于新型无模态分割模型GDA-YOLO11的机器人柑橘采摘框架,通过改进网络结构和引入非对称掩码损失,提升了遮挡条件下的果实检测与完整掩码预测能力,并实现了从感知到动作的端到端集成,在真实遮挡场景中显著提高了采摘成功率。
Details
Motivation: 遮挡是农业机器人采摘中的关键挑战,易导致果实漏检或定位不准,造成严重作物损失。 Method: 提出GDA-YOLO11无模态实例分割模型,结合架构改进与新型不对称掩码损失;在改进的公开柑橘数据集上训练,并在不同遮挡程度子集上评估;利用欧氏距离变换估计采摘点,并投影至3D空间驱动机械臂执行采摘。 Result: GDA-YOLO11在精度、mAP@50和mAP@50:95上分别达0.844、0.914和0.636,优于YOLO11n;框架在零至高遮挡下采摘成功率分别为92.59%、85.18%、48.14%和22.22%,中高遮挡下提升3.5%。 Conclusion: GDA-YOLO11有效增强了遮挡鲁棒性分割能力,并优化了感知-动作链路,为农业自主系统提供了更可靠的解决方案;本研究首次在机器人水果采摘中实现了无模态实例分割的实用化验证。 Abstract: Occlusion remains a critical challenge in robotic fruit harvesting, as undetected or inaccurately localised fruits often results in substantial crop losses. To mitigate this issue, we propose a harvesting framework using a new amodal segmentation model, GDA-YOLO11, which incorporates architectural improvements and an updated asymmetric mask loss. The proposed model is trained on a modified version of a public citrus dataset and evaluated on both the base dataset and occlusion-sensitive subsets with varying occlusion levels. Within the framework, full fruit masks, including invisible regions, are inferred by GDA-YOLO11, and picking points are subsequently estimated using the Euclidean distance transform. These points are then projected into 3D coordinates for robotic harvesting execution. Experiments were conducted using real citrus fruits in a controlled environment simulating occlusion scenarios. Notably, to the best of our knowledge, this study provides the first practical demonstration of amodal instance segmentation in robotic fruit harvesting. GDA-YOLO11 achieves a precision of 0.844, recall of 0.846, mAP@50 of 0.914, and mAP@50:95 of 0.636, outperforming YOLO11n by 5.1%, 1.3%, and 1.0% in precision, mAP@50, and mAP@50:95, respectively. The framework attains harvesting success rates of 92.59%, 85.18%, 48.14%, and 22.22% at zero to high occlusion levels, improving success by 3.5% under medium and high occlusion. These findings demonstrate that GDA-YOLO11 enhances occlusion robust segmentation and streamlines perception-to-action integration, paving the way for more reliable autonomous systems in agriculture.[90] SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
Qianxun Xu,Chenxi Song,Yujun Cai,Chi Zhang
Main category: cs.CV
TL;DR: 本文提出SwitchCraft框架,通过Event-Aligned Query Steering(EAQS)和Auto-Balance Strength Solver(ABSS)实现无需训练的多事件视频生成,显著提升提示对齐性、事件清晰度与场景一致性。
Details
Motivation: 现有文本到视频扩散模型在处理多事件提示时缺乏显式时间定位,易导致场景混淆或坍缩,难以保持叙事连贯性。 Method: 提出无需训练的SwitchCraft框架,包含两个核心模块:1)事件对齐查询引导(EAQS),使帧级注意力匹配对应事件提示;2)自适应平衡强度求解器(ABSS),动态调节引导强度以兼顾时间一致性和视觉保真度。 Result: 实验表明,SwitchCraft在提示对齐性、事件清晰度和场景一致性方面显著优于现有基线方法。 Conclusion: SwitchCraft为多事件视频生成提供了一种简单而有效的训练-free解决方案,解决了当前模型在复杂时序叙事生成中的关键瓶颈。 Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.[91] Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought
Kesen Zhao,Beier Zhu,Junbao Zhou,Xingyu Zhu,Zhongqi Yue,Hanwang Zhang
Main category: cs.CV
TL;DR: 本文提出NV-CoT框架,使多模态大语言模型(MLLMs)能用连续数值坐标进行图像区域推理,提升定位精度与答案准确率,并加速训练收敛。
Details
Motivation: 现有视觉链式推理方法存在模态不匹配、语义碎片化或固定粒度补丁限制精确区域选择等问题,且常需复杂架构修改。 Method: 提出Numerical Visual Chain-of-Thought(NV-CoT),将MLLM动作空间从离散词表扩展至连续欧氏空间,直接生成边界框坐标;采用高斯/拉普拉斯策略替代分类token策略,并通过重参数化采样引入随机性,兼容GRPO式策略优化;支持监督微调与强化学习。 Result: 在三个基准上对比八个主流视觉推理基线,NV-CoT显著提升定位精度和最终答案准确率,并加快训练收敛速度。 Conclusion: 连续动作空间的视觉推理范式在MLLMs中有效可行,仅需最小架构改动即可实现性能提升。 Abstract: Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in https://github.com/kesenzhao/NV-CoT.[92] SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking
Qiuyang Zhang,Jiujun Cheng,Qichao Mao,Cong Liu,Yu Fang,Yuhong Li,Mengying Ge,Shangce Gao
Main category: cs.CV
TL;DR: 本文提出了SpikeTrack,一种面向RGB目标跟踪的脉冲驱动框架,通过非对称时间步扩展和单向信息流设计,结合受神经推理机制启发的记忆检索模块,在保持高精度的同时大幅降低能耗。
Details
Motivation: 现有脉冲神经网络(SNN)跟踪方法难以兼顾脉冲驱动计算的纯粹性与神经元时空动力学的充分利用,导致效率与精度难以兼得。 Method: 提出SpikeTrack框架,采用非对称时间步扩展与单向信息流结构;设计受神经推理机制启发的记忆检索模块,通过模板初始化紧凑记忆并递归查询以增强目标感知。 Result: 在多个基准上达到SNN跟踪器最优性能,且在LaSOT数据集上超越TransT,能耗仅为后者的1/26;首次实现RGB跟踪中SNN在精度与能效上的双重领先。 Conclusion: SpikeTrack是首个真正实现RGB视觉跟踪中高精度与高能效协同的纯脉冲驱动框架,为SNN在实时视觉任务中的实用化提供了新范式。 Abstract: Spiking Neural Networks (SNNs) promise energy-efficient vision, but applying them to RGB visual tracking remains difficult: Existing SNN tracking frameworks either do not fully align with spike-driven computation or do not fully leverage neurons' spatiotemporal dynamics, leading to a trade-off between efficiency and accuracy. To address this, we introduce SpikeTrack, a spike-driven framework for energy-efficient RGB object tracking. SpikeTrack employs a novel asymmetric design that uses asymmetric timestep expansion and unidirectional information flow, harnessing spatiotemporal dynamics while cutting computation. To ensure effective unidirectional information transfer between branches, we design a memory-retrieval module inspired by neural inference mechanisms. This module recurrently queries a compact memory initialized by the template to retrieve target cues and sharpen target perception over time. Extensive experiments demonstrate that SpikeTrack achieves the state-of-the-art among SNN-based trackers and remains competitive with advanced ANN trackers. Notably, it surpasses TransT on LaSOT dataset while consuming only 1/26 of its energy. To our knowledge, SpikeTrack is the first spike-driven framework to make RGB tracking both accurate and energy efficient. The code and models are available at https://github.com/faicaiwawa/SpikeTrack.[93] Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping
Tianxiang Du,Hulingxiao He,Yuxin Peng
Main category: cs.CV
TL;DR: 本文提出AesGuide数据集和Venus框架,旨在解决现有MLLMs在摄影美学指导(AG)能力上的不足,首次实现从拍摄过程中的美学问题识别与指导到后期美学裁剪的全流程支持。
Details
Motivation: 普通用户与专业摄影师之间存在美学判断与拍摄指导能力的差距,而现有多模态大语言模型(MLLMs)缺乏有效的美学指导(AG)能力,无法识别构图问题或提供可操作建议,也难以胜任美学裁剪任务。 Method: 构建首个大规模美学指导(AG)数据集AesGuide(含10,748张照片及美学评分、分析与指导标注);提出两阶段框架Venus:第一阶段通过渐进式美学问答增强MLLMs的AG能力,第二阶段结合思维链(CoT)推理激活其美学裁剪能力。 Result: Venus在美学指导能力上显著提升,并在美学裁剪任务中达到当前最优(SOTA)性能,支持可解释、可交互的摄影全流程美学优化。 Conclusion: AesGuide数据集和Venus框架填补了计算美学中美学指导这一关键空白,推动MLLMs从被动描述迈向主动美学创作辅助。 Abstract: The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) -- an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. Extensive experiments show that Venus substantially improves AG capability and achieves state-of-the-art (SOTA) performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across both stages of photo creation. Code is available at https://github.com/PKU-ICST-MIPL/Venus_CVPR2026.[94] Accelerating Masked Image Generation by Learning Latent Controlled Dynamics
Kaiwen Zhu,Quansheng Zeng,Yuandong Pu,Shuo Cao,Xiaohui Li,Yi Xin,Qi Qin,Jiayang Li,Yu Qiao,Jinjin Gu,Yihao Liu
Main category: cs.CV
TL;DR: 本文提出MIGM-Shortcut方法,通过学习一个轻量级模型来预测特征演化的平均速度场,从而加速掩码图像生成模型(MIGMs),在保持生成质量的同时实现超4倍加速。
Details
Motivation: 现有掩码图像生成模型(MIGMs)因多步双向注意力而效率低下,且在离散token采样中丢失连续特征语义;已有缓存特征的方法在高速加速下近似误差大,主因是表达能力有限且未利用采样信息。 Method: 提出轻量级模型,融合历史特征与已采样token,回归特征演化的平均速度场,以跳过冗余计算步骤;将该方法应用于Lumina-DiMOO等MIGM架构。 Result: 在Lumina-DiMOO上实现文本到图像生成超4倍加速,同时保持图像质量,显著提升掩码图像生成的效率-质量Pareto前沿。 Conclusion: MIGM-Shortcut是一种高效、轻量且通用的加速方案,有效缓解MIGMs的计算冗余问题,为实际部署提供了新思路。 Abstract: Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.[95] Ordinal Diffusion Models for Color Fundus Images
Gustav Schmidt,Philipp Berens,Sarah Müller
Main category: cs.CV
TL;DR: 本文提出了一种序数潜在扩散模型,用于生成具有糖尿病视网膜病变(DR)严重程度连续结构的彩色眼底图像,通过标量疾病表征实现相邻阶段间的平滑过渡,并在EyePACS数据集上验证了其在视觉真实性和临床一致性上的优越性。
Details
Motivation: 现有条件扩散模型将疾病阶段视为独立类别,忽略了疾病进展的连续性,而医学影像中(如DR)常仅有离散但有序的粗粒度标签,导致建模失配。 Method: 提出序数潜在扩散模型,用标量疾病表征替代传统分类条件,显式建模DR严重程度的有序结构,并在EyePACS数据集上进行视觉真实性(FID)与临床一致性(二次加权κ)评估及插值实验。 Result: 相比标准条件扩散模型,该模型在5个DR阶段中的4个阶段FID降低,二次加权κ从0.79提升至0.87;插值实验证明其能从有序粗标签中学习到连续的疾病进展谱。 Conclusion: 引入序数结构的扩散模型更贴合医学影像中疾病连续进展的本质,可提升生成图像的临床相关性与实用性。 Abstract: It has been suggested that generative image models such as diffusion models can improve performance on clinically relevant tasks by offering deep learning models supplementary training data. However, most conditional diffusion models treat disease stages as independent classes, ignoring the continuous nature of disease progression. This mismatch is problematic in medical imaging because continuous pathological processes are typically only observed through coarse, discrete but ordered labels as in ophthalmology for diabetic retinopathy (DR). We propose an ordinal latent diffusion model for generating color fundus images that explicitly incorporates the ordered structure of DR severity into the generation process. Instead of categorical conditioning, we used a scalar disease representation, enabling a smooth transition between adjacent stages. We evaluated our approach using visual realism metrics and classification-based clinical consistency analysis on the EyePACS dataset. Compared to a standard conditional diffusion model, our model reduced the Fréchet inception distance for four of the five DR stages and increased the quadratic weighted $κ$ from 0.79 to 0.87. Furthermore, interpolation experiments showed that the model captured a continuous spectrum of disease progression learned from ordered, coarse class labels.[96] Interpretable Debiasing of Vision-Language Models for Social Fairness
Na Min An,Yoonna Jang,Yusuke Hirota,Ryo Hachiuma,Isabelle Augenstein,Hyunjung Shim
Main category: cs.CV
TL;DR: 本文提出DeBiasLens框架,利用稀疏自编码器(SAE)在视觉-语言模型(VLMs)中定位与社会属性相关的神经元,并通过选择性失活这些神经元来缓解社会偏见,同时保留语义能力。
Details
Motivation: 现有去偏方法多关注表层偏差信号,忽视模型内部机制;VLMs黑箱推理可能引发社会偏见,亟需可解释、模型无关的内部偏差审计与缓解方法。 Method: 提出DeBiasLens框架:在VLM多模态编码器上应用稀疏自编码器(SAE),在无社会属性标签的面部图像或文本数据上训练,利用SAE的解耦能力识别对特定人口统计特征(包括少数群体)高度响应的神经元;再针对每类群体,选择性失活最强关联的‘社会神经元’。 Result: 成功定位并干预VLM中与社会属性相关的神经元,在不损害语义性能的前提下有效缓解多种社会偏见行为;为后续AI公平性审计工具提供基础。 Conclusion: 内部神经元级可解释性是实现VLM社会公平的关键路径;DeBiasLens证明了无需标签、模型无关的偏差定位与干预的可行性与有效性。 Abstract: The rapid advancement of Vision-Language models (VLMs) has raised growing concerns that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are underrepresented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.[97] SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
Xiang Feng,Xiangbo Wang,Tieshi Zhong,Chengkai Wang,Yiting Zhao,Tianxiang Xu,Zhenzhong Kuang,Feiwei Qin,Xuefei Yin,Yanming Zhu
Main category: cs.CV
TL;DR: 本文提出SR3R,一种前馈式3D超分辨率框架,直接从稀疏低分辨率多视角图像生成高分辨率3D高斯点阵表示,无需每场景优化,显著提升重建保真度、跨场景泛化性和实时性。
Details
Motivation: 现有3D超分辨率方法依赖密集输入和逐场景优化,受限于2D超分模型提供的高频先验,导致重建精度、泛化性和实时性不足。 Method: 将3DSR重构为稀疏LR视图到HR 3DGS的直接前馈映射;提出SR3R框架,包含高斯偏移学习与特征精炼模块;可即插即用地与任意前馈3DGS重建骨干网络协同工作。 Result: 在三个3D基准上超越现有SOTA方法,实现强零样本跨场景泛化,甚至在未见场景上优于SOTA逐场景优化方法。 Conclusion: SR3R通过端到端学习3D特异性高频几何与外观,从根本上改变了3DSR获取高频知识的方式,实现了高效、通用且高保真的3D重建。 Abstract: 3D super-resolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce SR3R, a feed-forward framework that directly predicts HR 3DGS representations from sparse LR views via the learned mapping network. To further enhance reconstruction fidelity, we introduce Gaussian offset learning and feature refinement, which stabilize reconstruction and sharpen high-frequency details. SR3R is plug-and-play and can be paired with any feed-forward 3DGS reconstruction backbone: the backbone provides an LR 3DGS scaffold, and SR3R upscales it to an HR 3DGS. Extensive experiments across three 3D benchmarks demonstrate that SR3R surpasses state-of-the-art (SOTA) 3DSR methods and achieves strong zero-shot generalization, even outperforming SOTA per-scene optimization methods on unseen scenes.[98] Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection
Zhaolin Cai,Fan Li,Huiyu Duan,Lijun He,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出SteerVAD框架,通过主动干预多模态大语言模型(MLLM)的内部表征,提升无调优视频异常检测(VAD)性能;其核心是基于梯度无关的表征可分性分析识别关键注意力头,并利用分层元控制器生成动态校正信号,对异常相关维度进行各向异性缩放。
Details
Motivation: 现有基于冻结MLLM的VAD方法受限于预训练偏差和缺乏对特定视频上下文的表征自适应能力,难以检测细微或模糊异常。 Method: 提出SteerVAD框架:1)用梯度无关的表征可分性分析(RSA)识别最具判别力的注意力头作为潜在异常专家(LAEs);2)设计分层元控制器(HMC),联合全局上下文与LAE输出生成动态校正信号;3)对LAE表征流形执行目标导向、各向异性的缩放操作,增强异常维度、抑制固有偏置。 Result: 在主流基准上达到无调优VAD方法的最优性能,仅需1%训练数据,显著优于现有冻结MLLM方法。 Conclusion: SteerVAD开创了从被动读取到主动引导和校正MLLM内部表征的新范式,为高效、低资源视频异常检测提供了新方向。 Abstract: Video anomaly detection (VAD) aims to identify abnormal events in videos. Traditional VAD methods generally suffer from the high costs of labeled data and full training, thus some recent works have explored leveraging frozen multi-modal large language models (MLLMs) in a tuning-free manner to perform VAD. However, their performance is limited as they directly inherit pre-training biases and cannot adapt internal representations to specific video contexts, leading to difficulties in handling subtle or ambiguous anomalies. To address these limitations, we propose a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations. Our approach first leverages the gradient-free representational separability analysis (RSA) to identify top attention heads as latent anomaly experts (LAEs) which are most discriminative for VAD. Then a hierarchical meta-controller (HMC) generates dynamic rectification signals by jointly conditioning on global context and these LAE outputs. The signals execute targeted, anisotropic scaling directly upon the LAE representation manifolds, amplifying anomaly-relevant dimensions while suppressing inherent biases. Extensive experiments on mainstream benchmarks demonstrate our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data, establishing it as a powerful new direction for video anomaly detection. The code will be released upon the publication.[99] GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models
Xingyu Zhu,Beier Zhu,Junfeng Fang,Shuo Wang,Yin Zhang,Xiang Wang,Xiangnan He
Main category: cs.CV
TL;DR: 本文提出GuardAlign,一种无需训练的防御框架,通过OT增强的安全检测和跨模态注意力校准,提升大视觉语言模型(LVLMs)的安全性,显著降低不安全响应率并保持模型效用。
Details
Motivation: 现有基于输入侧的防御方法在复杂场景下检测不准,且解码过程中安全信号不稳定,难以保障LVLMs的安全性。 Method: GuardAlign包含两个策略:1)OT增强的安全检测,利用最优传输度量图像块与不安全语义间的分布距离,精准定位恶意区域;2)跨模态注意力校准,自适应重分配各层注意力以增强安全前缀的影响,确保安全信号在生成全程持续激活。 Result: 在六个主流多模态大语言模型(MLLMs)上的实验表明,GuardAlign在SPA-VL数据集上将不安全响应率最高降低39%,同时在VQAv2上将性能从78.51%提升至79.21%,兼顾安全性与实用性。 Conclusion: GuardAlign是一种高效、免训练的LVLM安全防御框架,在不牺牲模型性能的前提下显著提升了安全性,为视觉语言模型的可信部署提供了新思路。 Abstract: Large vision-language models (LVLMs) have achieved remarkable progress in vision-language reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose GuardAlign, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain consistently activated throughout generation. Extensive evaluations on six representative MLLMs demonstrate that GuardAlign reduces unsafe response rates by up to 39% on SPA-VL, while preserving utility, achieving an improvement on VQAv2 from 78.51% to 79.21%.[100] Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation
Xingyu Zhu,Kesen Zhao,Liang Yi,Shuo Wang,Zhicai Wang,Beier Zhu,Hanwang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的自适应视觉强化(AIR)框架,通过原型驱动的视觉标记压缩和最优传输(OT)引导的图像块强化,提升多模态大语言模型(MLLMs)对关键视觉信息的依赖,从而有效缓解幻觉问题。
Details
Motivation: 现有MLLMs易产生幻觉,且已有缓解方法要么依赖高成本监督训练,要么增加推理延迟;视觉增强方法常 indiscriminately 注入全部视觉标记,引入背景干扰、削弱关键线索。 Method: 提出AIR框架:1)基于原型的标记缩减,压缩冗余视觉标记;2)OT引导的图像块强化,量化隐藏状态与图像块嵌入对齐度,并选择性地将最一致的图像块注入前馈层。 Result: 在多个代表性MLLMs上实验表明,AIR显著降低幻觉率,同时保持模型通用能力。 Conclusion: AIR是一种高效、即插即用、无需训练的视觉强化方法,为构建可靠MLLMs提供了新思路。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning, yet they remain vulnerable to hallucination, where generated content deviates from visual evidence. Existing mitigation strategies either require costly supervision during training or introduce additional latency at inference time. Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding, but they typically inject all tokens indiscriminately, which causes interference from background regions and distracts the model from critical cues. To overcome this challenge, we propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs. AIR consists of two components. Prototype-based token reduction condenses the large pool of visual tokens into a compact subset to suppress redundancy. OT-guided patch reinforcement quantifies the alignment between hidden states and patch embeddings to selectively integrate the most consistent patches into feed-forward layers. As a result, AIR enhances the model's reliance on salient visual information and effectively mitigates hallucination. Extensive experiments across representative MLLMs demonstrate that AIR substantially reduces hallucination while preserving general capabilities, establishing it as an effective solution for building reliable MLLMs.[101] Spatio-Temporal Garment Reconstruction Using Diffusion Mapping via Pattern Coordinates
Yingxuan You,Ren Li,Corentin Dumery,Cong Cao,Hao Li,Pascal Fua
Main category: cs.CV
TL;DR: 本文提出了一种统一框架,结合隐式裁剪图案(ISP)与生成扩散模型,在2D UV空间学习服装形状先验,并通过映射模型实现单张图像和视频中高保真3D服装重建,尤其在宽松服装上表现优异。
Details
Motivation: 从单目图像/视频中高精度重建3D服装(尤其是宽松服装)仍是开放挑战,现有方法难以兼顾几何细节与动态一致性。 Method: 融合隐式缝纫图案(ISP)与生成扩散模型学习2D UV空间服装先验;构建像素-UV-3D映射模型实现单图重建;引入时空扩散机制与测试时引导保证视频时序一致性;设计解析投影约束以保持可见区域对齐与遮挡区域连贯补全。 Result: 在合成数据上训练但泛化至真实图像,对紧身与宽松服装均优于现有方法;重建结果保留精细几何细节并呈现真实动态运动,支持纹理编辑、服装重定向与动画等下游任务。 Conclusion: 该框架有效解决了单目条件下高保真、动态一致的3D服装重建难题,为虚拟试穿、数字人与混合现实提供了可靠基础。 Abstract: Reconstructing 3D clothed humans from monocular images and videos is a fundamental problem with applications in virtual try-on, avatar creation, and mixed reality. Despite significant progress in human body recovery, accurately reconstructing garment geometry, particularly for loose-fitting clothing, remains an open challenge. We propose a unified framework for high-fidelity 3D garment reconstruction from both single images and video sequences. Our approach combines Implicit Sewing Patterns (ISP) with a generative diffusion model to learn expressive garment shape priors in 2D UV space. Leveraging these priors, we introduce a mapping model that establishes correspondences between image pixels, UV pattern coordinates, and 3D geometry, enabling accurate and detailed garment reconstruction from single images. We further extend this formulation to dynamic reconstruction by introducing a spatio-temporal diffusion scheme with test-time guidance to enforce long-range temporal consistency. We also develop analytic projection-based constraints that preserve image-aligned geometry in visible regions while enforcing coherent completion in occluded areas over time. Although trained exclusively on synthetically simulated cloth data, our method generalizes well to real-world imagery and consistently outperforms existing approaches on both tight- and loose-fitting garments. The reconstructed garments preserve fine geometric detail while exhibiting realistic dynamic motion, supporting downstream applications such as texture editing, garment retargeting, and animation.[102] Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
Chenwei Jia,Baoting Li,Xuchong Zhang,Mingzhuo Wei,Bochen Lin,Hongbin Sun
Main category: cs.CV
TL;DR: 本文提出Quant Experts (QE),一种针对视觉语言模型(VLMs)的后训练量化(PTQ)方法,通过将重要通道分为token-independent和token-dependent两类,并分别设计共享专家与路由专家进行自适应误差补偿,显著提升量化后模型精度。
Details
Motivation: 现有PTQ方法忽略重要通道在不同输入、模态及token间的分布差异,导致量化效果不佳。 Method: 提出Quant Experts(QE),将重要通道分为token-independent(用共享低秩适配器补偿全局误差)和token-dependent(用多个路由低秩适配器补偿局部token相关误差)两类,并结合混合专家机制实现token-aware自适应误差补偿。 Result: QE在2B至70B参数规模的VLMs上,多种量化设置下均显著提升任务准确率,性能接近全精度模型。 Conclusion: QE通过建模通道重要性的token级动态性,有效提升了VLMs的后训练量化效果,为大模型高效部署提供了新思路。 Abstract: Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights and activations without retraining the full model. Existing PTQ methods primarily rely on static identification and global compensation of sensitive or outlier channels, yet they often overlook the distributional differences of these important channels across inputs, leading to unsatisfactory quantization. In this work, we observe that the distributions and occurrence frequencies of important channels vary significantly both across modalities and among tokens, even within the same modality. Accordingly, we propose \textbf{Quant Experts (QE)}, a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization. QE divides the important channels into token-independent and token-dependent groups. For the former, a shared expert is designed for most tokens to compensate for global quantization error using a low-rank adapter. For the latter, routed experts including multiple routed low-rank adapters are elaborated to compensate for local quantization error related to specific tokens. Extensive experiments demonstrate that QE consistently enhances task accuracy across various quantization settings and model scales, ranging from 2B to 70B parameters, while maintaining performance comparable to full-precision models.[103] EvalMVX: A Unified Benchmarking for Neural 3D Reconstruction under Diverse Multiview Setups
Zaiyan Yang,Jieji Ren,Xiangyi Wang,zonglin li,Xu Cao,Heng Guo,Zhanyu Ma,Boxin Shi
Main category: cs.CV
TL;DR: 本文提出了EvalMVX真实世界数据集,用于定量评估多视角立体(MVS)、多视角光度立体(MVPS)和多视角偏振形状恢复(MVSfP)三种技术,并在该数据集上评测了13种最新方法,揭示了不同几何细节与反射特性下的性能差异与开放问题。
Details
Motivation: 现有真实世界数据集主要针对基于RGB输入的多视角立体(MVS)进行评测,而同样重要的多视角光度立体(MVPS)和多视角偏振形状恢复(MVSfP)缺乏统一、定量的基准评估;需明确各类MVX技术的有效适用范围。 Method: 构建EvalMVX数据集:包含25个物体,每个物体用偏振相机在20个视角、17种光照条件(含OLAT与自然光)下采集共8500张图像,并提供配准好的真值3D网格;在此基础上对13种近年MVX方法进行系统定量评测。 Result: 首次实现了MVS、MVPS和MVSfP三类方法在统一真实数据上的同步定量比较;识别出各方法在不同几何复杂度和材质反射特性下的性能瓶颈与优势场景;明确了若干尚未解决的关键问题。 Conclusion: EvalMVX填补了多视角三维重建领域在高保真表面重建与稀疏输入场景下的基准评测空白;所揭示的性能规律与开放问题将推动MVX方法的进一步发展与融合。 Abstract: Recent advancements in neural surface reconstruction have significantly enhanced 3D reconstruction. However, current real world datasets mainly focus on benchmarking multiview stereo (MVS) based on RGB inputs. Multiview photometric stereo (MVPS) and multiview shape from polarization (MVSfP), though indispensable on high-fidelity surface reconstruction and sparse inputs, have not been quantitatively assessed together with MVS. To determine the working range of different MVX (MVS, MVSfP, and MVPS) techniques, we propose EvalMVX, a real-world dataset containing $25$ objects, each captured with a polarized camera under $20$ varying views and $17$ light conditions including OLAT and natural illumination, leading to $8,500$ images. Each object includes aligned ground-truth 3D mesh, facilitating quantitative benchmarking of MVX methods simultaneously. Based on our EvalMVX, we evaluate $13$ MVX methods published in recent years, record the best-performing methods, and identify open problems under diverse geometric details and reflectance types. We hope EvalMVX and the benchmarking results can inspire future research on multiview 3D reconstruction.[104] FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
Matteo Ballegeer,Dries F. Benoit
Main category: cs.CV
TL;DR: 本文提出FoV-Net,首个在旋转不变性下建模B-rep局部几何与全局结构的框架,通过LRF UV-grid和FoV grid实现旋转鲁棒性,在分类与分割任务中达到SOTA并减少数据需求。
Details
Motivation: 现有B-rep学习方法依赖绝对坐标和法向量,对旋转高度敏感,在任意SO(3)旋转下性能急剧下降(从>95%降至~10%)。 Method: 提出FoV-Net:每个面用Local Reference Frame (LRF) UV-grid编码局部几何,用Field-of-View (FoV) grids通过射线投射记录邻面交点以捕获全局上下文;使用轻量CNN提取面特征,并通过图注意力网络在B-rep图上传播特征。 Result: 在B-rep分类与分割基准上达到SOTA性能,对任意旋转鲁棒,且所需训练数据更少。 Conclusion: FoV-Net首次实现了B-rep学习中局部几何与全局结构的旋转不变建模,显著提升了模型鲁棒性与数据效率。 Abstract: Learning directly from boundary representations (B-reps) has significantly advanced 3D CAD analysis. However, state-of-the-art B-rep learning methods rely on absolute coordinates and normals to encode global context, making them highly sensitive to rotations. Our experiments reveal that models achieving over 95% accuracy on aligned benchmarks can collapse to as low as 10% under arbitrary $\mathbf{SO}(3)$ rotations. To address this, we introduce FoV-Net, the first B-rep learning framework that captures both local surface geometry and global structural context in a rotation-invariant manner. Each face is represented by a Local Reference Frame (LRF) UV-grid that encodes its local surface geometry, and by Field-of-View (FoV) grids that capture the surrounding 3D context by casting rays and recording intersections with neighboring faces. Lightweight CNNs extract per-face features, which are propagated over the B-rep graph using a graph attention network. FoV-Net achieves state-of-the-art performance on B-rep classification and segmentation benchmarks, demonstrating robustness to arbitrary rotations while also requiring less training data to achieve strong results.[105] DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
Yuxuan Zhang,Katarína Tóthová,Zian Wang,Kangxue Yin,Haithem Turki,Riccardo de Lutio,Yen-Yu Chang,Or Litany,Sanja Fidler,Zan Gojcic
Main category: cs.CV
TL;DR: 本文提出DiffusionHarmonizer,一种基于扩散模型的在线生成式增强框架,用于提升神经重建场景(如NeRF、3D高斯泼溅)在自动驾驶仿真中的视觉真实感与时间一致性,尤其改善动态物体插入、新视角渲染伪影及光照不一致等问题。
Details
Motivation: 现有神经重建方法(如NeRF、3D Gaussian Splatting)虽能从真实数据自动构建仿真场景,但在新视角渲染时易出现伪影,且难以真实融合跨场景采集的动态物体,限制了其在高保真仿真中的应用。 Method: 提出DiffusionHarmonizer:将预训练多步图像扩散模型蒸馏为单步、时序条件增强器,并设计专用合成-真实配对数据构建流程,聚焦外观协调、伪影修正与光照真实性。 Result: 该框架可在单GPU上实时运行于在线仿真器中,显著提升渲染结果的时间一致性与视觉真实感,在研究与生产级仿真环境中展现出强可扩展性与实用性。 Conclusion: DiffusionHarmonizer为神经重建驱动的仿真系统提供了高效、轻量、高保真的后处理增强方案,弥合了神经渲染质量与实际仿真需求之间的关键差距。 Abstract: Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations, we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into temporally consistent outputs while improving their realism. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively is a custom data curation pipeline that constructs synthetic-real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. The result is a scalable system that significantly elevates simulation fidelity in both research and production environments.[106] FocusTrack: One-Stage Focus-and-Suppress Framework for 3D Point Cloud Object Tracking
Sifan Zhou,Jiahao Nie,Ziyu Zhao,Yichao Cao,Xiaobo Lu
Main category: cs.CV
TL;DR: 本文提出FocusTrack,一种用于3D点云目标跟踪的一阶段框架,通过统一建模运动与语义信息,避免传统两阶段方法中的误差累积和计算瓶颈,在KITTI、nuScenes和Waymo等基准上达到SOTA性能并实现105 FPS实时速度。
Details
Motivation: 现有基于运动的两阶段3D点云跟踪方法存在误差累积(因先分割后估计运动)和计算效率低(顺序处理)两大问题。 Method: 提出FocusTrack一阶段框架,包含两个核心模块:(1) 时序差分Siamese编码器实现的帧间运动建模(IMM);(2) 基于IMM输出的运动显著特征门控与背景抑制机制的Focus-and-Suppress注意力机制,无需显式前景分割即可联合优化运动与语义。 Result: 在KITTI、nuScenes和Waymo数据集上达到新的SOTA性能,并以105 FPS高速运行。 Conclusion: FocusTrack通过运动-语义协同建模的一阶段端到端设计,有效克服了传统两阶段方法的固有缺陷,兼顾精度与效率,为3D点云跟踪提供了新范式。 Abstract: In 3D point cloud object tracking, the motion-centric methods have emerged as a promising avenue due to its superior performance in modeling inter-frame motion. However, existing two-stage motion-based approaches suffer from fundamental limitations: (1) error accumulation due to decoupled optimization caused by explicit foreground segmentation prior to motion estimation, and (2) computational bottlenecks from sequential processing. To address these challenges, we propose FocusTrack, a novel one-stage paradigms tracking framework that unifies motion-semantics co-modeling through two core innovations: Inter-frame Motion Modeling (IMM) and Focus-and-Suppress Attention. The IMM module employs a temp-oral-difference siamese encoder to capture global motion patterns between adjacent frames. The Focus-and-Suppress attention that enhance the foreground semantics via motion-salient feature gating and suppress the background noise based on the temporal-aware motion context from IMM without explicit segmentation. Based on above two designs, FocusTrack enables end-to-end training with compact one-stage pipeline. Extensive experiments on prominent 3D tracking benchmarks, such as KITTI, nuScenes, and Waymo, demonstrate that the FocusTrack achieves new SOTA performance while running at a high speed with 105 FPS.[107] Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
Haoran Wang,Guoxi Huang,Fan Zhang,David Bull,Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: 本文提出了一种面向重建质量的自适应剪枝策略和一种新型3D Difference-of-Gaussians基元,显著压缩3D高斯溅射模型(最高减少90%高斯数量),同时保持甚至提升渲染质量。
Details
Motivation: 3D高斯溅射(3DGS)虽实现高质量实时渲染,但需大量基元,导致冗余、资源消耗高、难以扩展至复杂大场景,亟需高效剪枝与更具表达力的紧凑基元。 Method: 1)提出重建感知的集成剪枝策略,根据重建质量自适应决定剪枝时机与优化间隔;2)引入3D Difference-of-Gaussians基元,单个基元联合建模正负密度,增强紧凑配置下的表达能力。 Result: 模型高斯数量最多减少90%,渲染视觉质量媲美或优于当前最优方法。 Conclusion: 所提剪枝策略与新基元协同提升了3DGS的紧凑性与渲染保真度,为大规模场景的实际部署提供了有效解决方案。 Abstract: Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with photorealistic quality. 3DGS often requires a large number of primitives to achieve high fidelity, leading to redundant representations and high resource consumption, thereby limiting its scalability for complex or large-scale scenes. Consequently, effective pruning strategies and more expressive primitives that can reduce redundancy while preserving visual quality are crucial for practical deployment. We propose an efficient, integrated reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality, thus reducing model size while enhancing rendering quality. Moreover, we introduce a 3D Difference-of-Gaussians primitive that jointly models both positive and negative densities in a single primitive, improving the expressiveness of Gaussians under compact configurations. Our method significantly improves model compactness, achieving up to 90\% reduction in Gaussian-count while delivering visual quality that is similar to, or in some cases better than, that produced by state-of-the-art methods. Code will be made publicly available.[108] Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics
Omar Mohamed,Edoardo Fazzari,Ayah Al-Naji,Hamdan Alhadhrami,Khalfan Hableel,Saif Alkindi,Cesare Stefanini
Main category: cs.CV
TL;DR: 本文提出了一种无需手术特异性预训练的无监督方法TASOT,通过结合视频生成的文本信息与视觉特征,利用多模态最优传输实现手术阶段和步骤识别,在多个基准数据集上显著优于现有零样本方法。
Details
Motivation: 质疑当前依赖大规模标注手术视频预训练的高计算与数据成本策略,探索是否必须进行如此繁重的预训练。 Method: 提出Text-Augmented Action Segmentation Optimal Transport (TASOT),将时序动作分割建模为多模态最优传输问题,联合建模基于帧外观的视觉代价与基于视频生成文本的语义代价,并采用时间一致的非平衡Gromov-Wasserstein正则化。 Result: 在StrasBypass70、BernBypass70、Cholec80和AutoLaparo等基准数据集上,相比现有零样本方法分别提升23.7、4.5、16.5和19.6个百分点。 Conclusion: 仅利用标准视觉与文本表征中已有的信息,即可实现细粒度手术理解,无需日益复杂的预训练流程。 Abstract: Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions. Recent approaches increasingly rely on large-scale pre-training on thousands of labeled surgical videos, followed by zero-shot transfer to specific procedures. While effective, this strategy incurs substantial computational and data collection costs. In this work, we question whether such heavy pre-training is truly necessary. We propose Text-Augmented Action Segmentation Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition that extends Action Segmentation Optimal Transport (ASOT) by incorporating textual information generated directly from the videos. TASOT formulates temporal action segmentation as a multimodal optimal transport problem, where the matching cost is defined as a weighted combination of visual and text-based costs. The visual term captures frame-level appearance similarity, while the text term provides complementary semantic cues, and both are jointly regularized through a temporally consistent unbalanced Gromov-Wasserstein formulation. This design enables effective alignment between video frames and surgical actions without surgical-specific pretraining or external web-scale supervision. We evaluate TASOT on multiple benchmark surgical datasets and observe consistent and substantial improvements over existing zero-shot methods, including StrasBypass70 (+23.7), BernBypass70 (+4.5), Cholec80 (+16.5), and AutoLaparo (+19.6). These results demonstrate that fine-grained surgical understanding can be achieved by exploiting information already present in standard visual and textual representations, without resorting to increasingly complex pre-training pipelines. The code will be available at https://github.com/omar8ahmed9/TASOT.[109] Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation
Muquan Li,Hang Gou,Yingyi Ma,Rongzheng Wang,Ke Qin,Tao He
Main category: cs.CV
TL;DR: 本文提出RETA框架,通过动态检索连接(DRC)和持续拓扑对齐(PTA)改进解耦式数据集蒸馏,显著提升合成图像的多样性与泛化能力,在ImageNet-1K上以每类50张图像达到64.3% top-1准确率,超越先前最优方法3.1%。
Details
Motivation: 现有解耦式数据集蒸馏方法依赖静态真实图像块进行残差匹配,导致拟合复杂度失配和‘拉向锚点’效应,损害类内多样性和泛化性能。 Method: 提出RETA框架:1)动态检索连接(DRC),在教师特征空间中基于拟合-复杂度评分从预建池中动态选取真实图像块,并通过残差连接注入以平衡拟合精度与复杂度;2)持续拓扑对齐(PTA),利用持久同调构建互k近邻特征图、计算持久性图像(刻画连通分量与环结构),并惩罚真实与合成数据集间的拓扑差异。 Result: 在CIFAR-100、Tiny-ImageNet、ImageNet-1K及多个ImageNet子集上,RETA在相近时间与内存开销下持续优于各类基线;在ImageNet-1K上使用ResNet-18、每类50张合成图像时达64.3% top-1准确率,较此前最优方法提升3.1%。 Conclusion: RETA通过动态检索与拓扑感知正则化有效缓解了拟合复杂度失配与拉锚效应,提升了合成数据的表征多样性与下游泛化能力,为高效数据集蒸馏提供了新范式。 Abstract: Decoupled dataset distillation (DD) compresses large corpora into a few synthetic images by matching a frozen teacher's statistics. However, current residual-matching pipelines rely on static real patches, creating a fit-complexity gap and a pull-to-anchor effect that reduce intra-class diversity and hurt generalization. To address these issues, we introduce RETA -- a Retrieval and Topology Alignment framework for decoupled DD. First, Dynamic Retrieval Connection (DRC) selects a real patch from a prebuilt pool by minimizing a fit-complexity score in teacher feature space; the chosen patch is injected via a residual connection to tighten feature fit while controlling injected complexity. Second, Persistent Topology Alignment (PTA) regularizes synthesis with persistent homology: we build a mutual k-NN feature graph, compute persistence images of components and loops, and penalize topology discrepancies between real and synthetic sets, mitigating pull-to-anchor effect. Across CIFAR-100, Tiny-ImageNet, ImageNet-1K, and multiple ImageNet subsets, RETA consistently outperforms various baselines under comparable time and memory, especially reaching 64.3% top-1 accuracy on ImageNet-1K with ResNet-18 at 50 images per class, +3.1% over the best prior.[110] HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation
Keito Suzuki,Kunyao Chen,Lei Wang,Bang Du,Runfa Blark Li,Peng Liu,Ning Bi,Truong Nguyen
Main category: cs.CV
TL;DR: 本文提出HumanOrbit,一种基于视频扩散模型的多视角人像生成方法,能从单张图像生成360°环绕视频,并重建高保真度的纹理化三维网格。
Details
Motivation: 现有基于图像扩散模型的多视角合成方法在视角间一致性及身份保持方面表现不佳;而视频扩散模型在生成与提示对齐的逼真结果上展现出优势,因此作者尝试将其应用于多视角人像生成任务。 Method: 提出HumanOrbit——一种专用于多视角人像生成的视频扩散模型,支持连续相机绕人旋转建模;并设计后续重建流程,利用生成的多视角帧恢复带纹理的三维网格。 Result: HumanOrbit在多视角图像生成上效果显著,生成结果几何一致、身份保持良好;重建的3D模型在完整性与保真度上优于当前最优基线方法。 Conclusion: 视频扩散模型适配于多视角人体建模任务,HumanOrbit为单图生成360°环绕视频及高质量3D重建提供了新范式。 Abstract: We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated their ability in generating photorealistic results that align well with the given prompts. Inspired by these results, we propose HumanOrbit, a video diffusion model for multi-view human image generation. Our approach enables the model to synthesize continuous camera rotations around the subject, producing geometrically consistent novel views while preserving the appearance and identity of the person. Using the generated multi-view frames, we further propose a reconstruction pipeline that recovers a textured mesh of the subject. Experimental results validate the effectiveness of HumanOrbit for multi-view image generation and that the reconstructed 3D models exhibit superior completeness and fidelity compared to those from state-of-the-art baselines.[111] RAViT: Resolution-Adaptive Vision Transformer
Martial Guidez,Stefan Duffner,Christophe Garcia
Main category: cs.CV
TL;DR: 本文提出RAViT框架,通过多分支网络和早期退出机制,在保持视觉Transformer精度的同时显著降低计算成本。
Details
Motivation: 视觉Transformer在计算机视觉中表现出色,但其计算成本远高于卷积神经网络,因此需要一种既能降低计算开销又能保持精度的方法。 Method: 提出基于多分支网络的RAViT框架,对同一图像的不同分辨率副本进行处理,并引入早期退出机制,使模型能在运行时动态权衡精度与计算成本。 Result: 在CIFAR-10、Tiny ImageNet和ImageNet数据集上评估表明,RAViT达到与经典视觉Transformer相当的精度,仅需约70%的FLOPs。 Conclusion: RAViT是一种高效且自适应的图像分类框架,有效缓解了视觉Transformer高计算成本的问题,适用于资源受限场景。 Abstract: Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first transformer and the resulting prediction is reused together with the original-size image to perform a final prediction on a second transformer with less computation than a classical Vision transformer architecture. The early-exit process allows the model to make a final prediction at intermediate branches, saving even more computation. We evaluated our approach on CIFAR-10, Tiny ImageNet, and ImageNet. We obtained an equivalent accuracy to the classical Vision transformer model with only around 70% of FLOPs.[112] Manifold-Preserving Superpixel Hierarchies and Embeddings for the Exploration of High-Dimensional Images
Alexander Vieth,Boudewijn Lelieveldt,Elmar Eisemann,Anna Vilanova,Thomas Höllt
Main category: cs.CV
TL;DR: 本文提出了一种面向高维图像的超像素层次结构,该结构在构建过程中同时考虑高维属性流形和像素空间布局,从而实现图像空间与属性空间的一致性探索。
Details
Motivation: 现有分层降维方法仅基于属性信息构建层次结构,忽略了像素的空间布局,导致图像空间中感兴趣区域与属性层次中对应抽象之间缺乏一致性,阻碍了高维图像的有效探索。 Method: 提出一种图像引导的超像素层次结构构建方法,在层次生成过程中融入高维属性流形信息,使空间邻近性与属性相似性协同作用。 Result: 所提方法在两个用例中相较于传统基于分层嵌入的图像探索方法展现出更优的探索一致性与有效性。 Conclusion: 融合空间布局与属性结构的图像引导层次结构,能显著提升高维图像在图像空间与属性空间中的一致性探索能力。 Abstract: High-dimensional images, or images with a high-dimensional attribute vector per pixel, are commonly explored with coordinated views of a low-dimensional embedding of the attribute space and a conventional image representation. Nowadays, such images can easily contain several million pixels. For such large datasets, hierarchical embedding techniques are better suited to represent the high-dimensional attribute space than flat dimensionality reduction methods. However, available hierarchical dimensionality reduction methods construct the hierarchy purely based on the attribute information and ignore the spatial layout of pixels in the images. This impedes the exploration of regions of interest in the image space, since there is no congruence between a region of interest in image space and the associated attribute abstractions in the hierarchy. In this paper, we present a superpixel hierarchy for high-dimensional images that takes the high-dimensional attribute manifold into account during construction. Through this, our method enables consistent exploration of high-dimensional images in both image and attribute space. We show the effectiveness of this new image-guided hierarchy in the context of embedding exploration by comparing it with classical hierarchical embedding-based image exploration in two use cases.[113] GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
Chao Xu,Xiaochen Zhao,Xiang Deng,Jingxiang Sun,Zhuo Su,Donglin Di,Yebin Liu
Main category: cs.CV
TL;DR: 本文提出了一种基于几何感知扩散模型的框架,用于从单张肖像图像重建高保真、可动画的4D头像,通过联合生成图像与法线图,并结合姿态无关的表情编码器和3D高斯表示,实现几何准确、实时渲染的头像重建。
Details
Motivation: 从单张肖像图像重建真实感强、可驱动的4D头像仍具挑战性;现有扩散模型方法依赖2D先验,难以保证一致的3D几何结构。 Method: 提出几何感知扩散框架:联合合成肖像图像与对应表面法线;引入姿态无关的表情编码器提取隐式表情表征;将合成图像与表情潜变量融入基于3D高斯的头像表示中。 Result: 在视觉质量、表情保真度和跨身份泛化能力上显著优于SOTA方法,并支持实时渲染。 Conclusion: 几何感知扩散建模能有效学习强几何先验,为单图驱动的高保真4D头像重建提供了新范式。 Abstract: Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to learn strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are incorporated into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.[114] A Mixed Diet Makes DINO An Omnivorous Vision Encoder
Rishabh Kabra,Maks Ovsjanikov,Drew A. Hudson,Ye Xia,Skanda Koppula,Andre Araujo,Joao Carreira,Niloy J. Mitra
Main category: cs.CV
TL;DR: 本文提出Omnivorous Vision Encoder,通过联合优化跨模态对齐与知识蒸馏(以DINOv2为冻结教师模型),学习模态无关的统一特征空间,使同一场景的不同模态输入(如RGB、深度图等)生成高度一致且语义丰富的嵌入。
Details
Motivation: 预训练视觉编码器(如DINOv2)在单模态任务中表现优异,但其跨模态特征对齐差——例如同一场景的RGB图像与深度图的特征余弦相似度接近随机图像对,限制了多模态理解能力。 Method: 提出Omnivorous Vision Encoder框架,采用双重训练目标:(1)最大化同一场景不同模态输入(如RGB/深度/分割图)的特征对齐;(2)以冻结的DINOv2等强教师模型输出为锚点进行知识蒸馏。 Result: 所学学生编码器能为同一场景的不同模态输入生成高度一致、语义判别力强的嵌入,在跨模态理解任务中展现出鲁棒性,同时保留原始基础模型的判别能力。 Conclusion: 模态无关的统一特征空间可通过显式跨模态对齐与教师引导的蒸馏协同实现,为构建真正‘多食性’(omnivorous)通用视觉编码器提供了可行路径。 Abstract: Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.[115] A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification
Yixuan Liu,Kanwal K. Bhatia,Ahmed E. Fetit
Main category: cs.CV
TL;DR: 本文提出首个面向医疗应用的多模态自动审计框架,用于发现和解释机器学习医学图像分类器的系统性故障,实验表明其在故障发现与解释生成方面表现优异,并验证了多模态信息对提升审计效果的重要性。
Details
Motivation: 现有基于单模态特征或元数据子组分析的审计方法在可解释性和捕捉隐藏系统性故障方面存在局限,难以保障医学图像分类器的安全性与可靠性。 Method: 提出一种扩展切片发现方法至多模态表征的自动化审计框架,并在MIMIC-CXR-JPG数据集上针对常见故障场景开展综合实验。 Result: 该框架在故障发现与解释生成方面展现出强大能力;多模态信息能实现更全面有效的分类器审计,而资源受限场景下非图像单模态变体也表现出较强潜力。 Conclusion: 多模态审计框架显著提升了医学图像分类器审计的可解释性与有效性,为临床部署提供了更可靠的安全保障手段。 Abstract: Despite advances in machine learning-based medical image classifiers, the safety and reliability of these systems remain major concerns in practical settings. Existing auditing approaches mainly rely on unimodal features or metadata-based subgroup analyses, which are limited in interpretability and often fail to capture hidden systematic failures. To address these limitations, we introduce the first automated auditing framework that extends slice discovery methods to multimodal representations specifically for medical applications. Comprehensive experiments were conducted under common failure scenarios using the MIMIC-CXR-JPG dataset, demonstrating the framework's strong capability in both failure discovery and explanation generation. Our results also show that multimodal information generally allows more comprehensive and effective auditing of classifiers, while unimodal variants beyond image-only inputs exhibit strong potential in scenarios where resources are constrained.[116] SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
Yasaman Haghighi,Alexandre Alahi
Main category: cs.CV
TL;DR: 本文提出了一种基于敏感性分析的动态缓存策略SenCache,用于加速扩散模型视频生成的推理过程,在保持计算预算不变的情况下提升了视觉质量。
Details
Motivation: 扩散模型视频生成推理开销大,现有无训练加速方法(如缓存)依赖经验启发式选择缓存时间步,需大量调参且缺乏理论依据。 Method: 通过分析去噪输入(噪声隐变量和时间步)扰动对模型输出的影响,形式化定义缓存误差,并提出敏感性感知的动态缓存策略SenCache,按样本自适应选择缓存时间步。 Result: 在Wan 2.1、CogVideoX和LTX-Video上实验表明,SenCache在相似计算预算下比现有缓存方法获得更优视觉质量。 Conclusion: SenCache为自适应缓存提供了理论基础,解释了已有启发式方法部分有效的原因,并将其推广为动态、样本特定的缓存策略。 Abstract: Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoising steps. This has motivated a growing line of research on accelerating diffusion inference. Among training-free acceleration methods, caching reduces computation by reusing previously computed model outputs across timesteps. Existing caching methods rely on heuristic criteria to choose cache/reuse timesteps and require extensive tuning. We address this limitation with a principled sensitivity-aware caching framework. Specifically, we formalize the caching error through an analysis of the model output sensitivity to perturbations in the denoising inputs, i.e., the noisy latent and the timestep, and show that this sensitivity is a key predictor of caching error. Based on this analysis, we propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis. Our framework provides a theoretical basis for adaptive caching, explains why prior empirical heuristics can be partially effective, and extends them to a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX, and LTX-Video show that SenCache achieves better visual quality than existing caching methods under similar computational budgets.[117] MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
Albert Dominguez Mantes,Gioele La Manno,Martin Weigert
Main category: cs.CV
TL;DR: 本文提出MuViT,一种专为显微镜图像多分辨率分析设计的Transformer架构,通过将不同分辨率的图像块映射到共享的世界坐标系并扩展旋转位置编码,实现宽视场上下文与高分辨率细节的有效融合,在多个显微镜图像任务中显著优于ViT和CNN基线。
Details
Motivation: 现代显微镜图像具有多尺度结构,但现有视觉模型大多单分辨率操作或仅从单一视图提取多尺度特征,难以充分利用其固有的多分辨率特性。 Method: 提出MuViT架构:将不同分辨率的图像块嵌入共享世界坐标系,并扩展旋转位置编码以支持跨尺度注意力;结合多分辨率掩码自编码器(MAE)进行预训练。 Result: 在合成数据集、肾脏组织病理学及小鼠脑高分辨显微图像上,MuViT持续优于强ViT和CNN基线;多分辨率MAE预训练进一步生成尺度一致表征,提升下游任务性能。 Conclusion: 显式建模世界坐标是一种简单而强大的机制,可有效利用大规模显微图像中的多分辨率信息。 Abstract: Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modelling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.[118] Enhancing Spatial Understanding in Image Generation via Reward Modeling
Zhenyu Tang,Chaoran Feng,Yufan Deng,Jie Wu,Xiaojie Li,Rui Wang,Yunpeng Chen,Daquan Zhou
Main category: cs.CV
TL;DR: 本文提出一种增强文本到图像生成模型空间理解能力的新方法,构建了包含8万对偏好数据的SpatialReward-Dataset,并训练出高性能空间评估奖励模型SpatialScore,进而通过在线强化学习显著提升模型对复杂空间关系的生成能力。
Details
Motivation: 现有文本到图像生成模型在处理复杂空间关系时提示词要求高、成功率低,需多次采样,亟需提升其空间理解能力。 Method: 构建大规模空间关系偏好数据集SpatialReward-Dataset;训练专用空间评估奖励模型SpatialScore;将该奖励模型用于在线强化学习以优化生成模型。 Result: SpatialScore在空间关系评估上超越主流闭源模型;在线强化学习后,多个基准测试中空间理解性能获得显著且一致提升。 Conclusion: 专用空间奖励建模与在线强化学习可有效增强文本到图像模型对复杂空间关系的理解与生成能力,为细粒度可控生成提供新路径。 Abstract: Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.[119] Joint Geometric and Trajectory Consistency Learning for One-Step Real-World Super-Resolution
Chengyan Deng,Zhangquan Chen,Li Yu,Kai Zhang,Xue Zhou,Wang Zhang
Main category: cs.CV
TL;DR: 本文提出GTASR方法,通过轨迹对齐与双参考结构校正机制,解决一致性模型在真实图像超分中的一致性漂移与几何解耦问题,实现高效高质量重建。
Details
Motivation: 扩散模型在真实图像超分中计算开销大;蒸馏方法参数量大且受限于教师模型能力;一致性模型虽轻量但存在一致性漂移和几何解耦问题。 Method: 提出GTASR框架,包含轨迹对齐(TA)策略(通过全路径投影校正切向量场)和双参考结构校正(DRSR)机制(施加严格结构约束)。 Result: GTASR在保持低延迟的同时,在多个基准上显著优于代表性方法。 Conclusion: GTASR是一种简单有效的一致性训练范式,能兼顾真实图像超分的生成质量、结构保真度与推理效率。 Abstract: Diffusion-based Real-World Image Super-Resolution (Real-ISR) achieves impressive perceptual quality but suffers from high computational costs due to iterative sampling. While recent distillation approaches leveraging large-scale Text-to-Image (T2I) priors have enabled one-step generation, they are typically hindered by prohibitive parameter counts and the inherent capability bounds imposed by teacher models. As a lightweight alternative, Consistency Models offer efficient inference but struggle with two critical limitations: the accumulation of consistency drift inherent to transitive training, and a phenomenon we term "Geometric Decoupling" - where the generative trajectory achieves pixel-wise alignment yet fails to preserve structural coherence. To address these challenges, we propose GTASR (Geometric Trajectory Alignment Super-Resolution), a simple yet effective consistency training paradigm for Real-ISR. Specifically, we introduce a Trajectory Alignment (TA) strategy to rectify the tangent vector field via full-path projection, and a Dual-Reference Structural Rectification (DRSR) mechanism to enforce strict structural constraints. Extensive experiments verify that GTASR delivers superior performance over representative baselines while maintaining minimal latency. The code and model will be released at https://github.com/Blazedengcy/GTASR.[120] Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models
Arnas Uselis,Andrea Dittadi,Seong Joon Oh
Main category: cs.CV
TL;DR: 本文提出并形式化了组合泛化所需的三个理想性质(可分性、可迁移性、稳定性),证明其对表征几何结构施加了必要约束:概念表征需线性分解且跨概念正交,从而为‘线性表征假说’提供了理论依据,并通过多视觉模型实证验证了该结构与组合泛化能力的相关性。
Details
Motivation: 现代大模型虽训练于海量数据,但仍无法覆盖组合输入的指数级空间,因此需探究支持组合泛化的表征应具备何种结构。 Method: 形式化定义组合泛化的三个核心性质(divisibility, transferability, stability),推导其对表征几何的必要约束(线性可分解性与跨概念正交性),并结合理论分析与实证检验(在CLIP、SigLIP、DINO等模型上测量线性因子结构及其与泛化性能的相关性)。 Result: 理论证明组合泛化要求表征必须线性分解且各概念成分近似正交;实验发现主流视觉模型表征呈现部分线性因子化、低秩、近正交特性,且该结构程度与组合泛化能力显著正相关。 Conclusion: 线性表征结构并非偶然现象,而是组合泛化的必然结果;该工作为理解神经网络表征几何与泛化能力的关系提供了理论基础和实证支持。 Abstract: Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of what structure representations must have to support generalization to unseen combinations. We formalize three desiderata for compositional generalization under standard training (divisibility, transferability, stability) and show they impose necessary geometric constraints: representations must decompose linearly into per-concept components, and these components must be orthogonal across concepts. This provides theoretical grounding for the Linear Representation Hypothesis: the linear structure widely observed in neural representations is a necessary consequence of compositional generalization. We further derive dimension bounds linking the number of composable concepts to the embedding geometry. Empirically, we evaluate these predictions across modern vision models (CLIP, SigLIP, DINO) and find that representations exhibit partial linear factorization with low-rank, near-orthogonal per-concept factors, and that the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the representational geometry they may converge to. Code is available at https://github.com/oshapio/necessary-compositionality.[121] Hierarchical Action Learning for Weakly-Supervised Action Segmentation
Junxian Huang,Ruichu Cai,Hao Zhu,Juntao Fang,Boyan Xu,Weilin Chen,Zijian Li,Shenghua Gao
Main category: cs.CV
TL;DR: 本文提出了一种名为HAL的分层动作学习模型,用于弱监督动作分割,通过建模高低层潜在变量的不同演化速率,并利用分层因果生成过程与稀疏转移约束,实现了对高层动作变量的严格可识别性,在多个基准上显著优于现有方法。
Details
Motivation: 人类能通过关键过渡点在多抽象层次上理解动作,而机器依赖视觉特征易过度分割;高低层潜在变量演化速率不同(低层快、高层慢),为分层推理提供了新思路。 Method: 提出HAL模型:构建分层因果数据生成过程,高层动作潜变量控制底层视觉特征动态;引入确定性过程对齐时序;采用分层金字塔Transformer提取特征与潜变量;施加稀疏转移约束以强制高层变量慢演化;并在一定假设下证明高层动作变量的严格可识别性。 Result: 在多个基准数据集上,HAL模型在弱监督动作分割任务中显著优于现有方法。 Conclusion: HAL模型通过建模多尺度时间动态和引入可识别性理论保障,有效提升了弱监督视频动作分割性能,验证了分层因果建模在视频理解中的有效性。 Abstract: Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.[122] Mode Seeking meets Mean Seeking for Fast Long Video Generation
Shengqu Cai,Weili Nie,Chao Liu,Julius Berner,Lvmin Zhang,Nanye Ma,Hansheng Chen,Maneesh Agrawala,Leonidas Guibas,Gordon Wetzstein,Arash Vahdat
Main category: cs.CV
TL;DR: 本文提出了一种结合模式寻求(mode seeking)与均值寻求(mean seeking)的解耦扩散Transformer训练范式,以解决长视频生成中长程连贯性与局部高保真难以兼顾的问题。
Details
Motivation: 短时视频数据丰富且高保真,但长时连贯视频数据稀缺且领域受限,导致视频生成难以从秒级扩展到分钟级。 Method: 设计解耦扩散Transformer:全局Flow Matching头通过监督学习在少量长视频上建模叙事结构;局部Distribution Matching头通过反向KL散度对齐滑动窗口与冻结的短视频教师模型,实现模式寻求。 Result: 实现了分钟级视频的快速生成(few-step),在局部清晰度、运动自然性和长程一致性三方面同步提升,显著缩小了保真度与生成时长之间的鸿沟。 Conclusion: 该解耦范式有效分离并协同优化了长视频生成中的局部真实感与全局连贯性,为长时视频生成提供了可扩展、高效且高质量的新路径。 Abstract: Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.[123] UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images
Junhwa Hur,Charles Herrmann,Songyou Peng,Philipp Henzler,Zeyu Ma,Todd Zickler,Deqing Sun
Main category: cs.CV
TL;DR: UFO-4D 是一种统一的前馈框架,能从一对无位姿图像中直接重建密集、显式的动态 4D 场景(即带运动的 3D 高斯点阵),通过共享几何基元实现外观、深度与运动的联合自监督优化,显著提升几何、运动和相机位姿估计精度,并支持高质量时空新视角合成。