Skip to content

Table of Contents

cs.CL [Back]

[1] QueryPlot: Generating Geological Evidence Layers using Natural Language Queries for Mineral Exploration

Meng Ye,Xiao Lin,Georgina Lukoczki,Graham W. Lederer,Yi Yao

Main category: cs.CL

TL;DR: 本文提出QueryPlot框架,利用自然语言处理技术将地质文本知识与空间地质图数据融合,实现基于语义检索的矿产远景区预测,并支持交互式查询与GIS输出。

Details Motivation: 传统矿产远景区预测依赖人工、知识密集,难以高效整合异构地质知识(如文本型成矿模型和空间地质数据)。 Method: 构建120余种矿床类型的描述性模型;将州级地质图(SGMC)多边形转化为结构化文本;采用预训练嵌入模型对用户自然语言查询与区域描述进行编码,并计算语义相似度以生成连续证据层;支持组合查询与多准则分析;并将相似度分数作为特征融入监督学习流程。 Result: 在钨矽卡岩矿床案例中,该方法实现了已知矿点的高召回率,预测结果与专家划定的许可区高度一致;引入相似度特征后,分类性能显著提升。 Conclusion: QueryPlot为地质知识驱动的智能找矿提供了可扩展、可交互、可复用的新范式,源码与数据集已开源。 Abstract: Mineral prospectivity mapping requires synthesizing heterogeneous geological knowledge, including textual deposit models and geospatial datasets, to identify regions likely to host specific mineral deposit types. This process is traditionally manual and knowledge-intensive. We present QueryPlot, a semantic retrieval and mapping framework that integrates large-scale geological text corpora with geologic map data using modern Natural Language Processing techniques. We curate descriptive deposit models for over 120 deposit types and transform the State Geologic Map Compilation (SGMC) polygons into structured textual representations. Given a user-defined natural language query, the system encodes both queries and region descriptions using a pretrained embedding model and computes semantic similarity scores to rank and spatially visualize regions as continuous evidence layers. QueryPlot supports compositional querying over deposit characteristics, enabling aggregation of multiple similarity-derived layers for multi-criteria prospectivity analysis. In a case study on tungsten skarn deposits, we demonstrate that embedding-based retrieval achieves high recall of known occurrences and produces prospective regions that closely align with expert-defined permissive tracts. Furthermore, similarity scores can be incorporated as additional features in supervised learning pipelines, yielding measurable improvements in classification performance. QueryPlot is implemented as a web-based system supporting interactive querying, visualization, and export of GIS-compatible prospectivity layers.To support future research, we have made the source code and datasets used in this study publicly available.

[2] Neural Synchrony Between Socially Interacting Language Models

Zhining Zhang,Wentao Zhu,Chi Han,Yizhou Wang,Heng Ji

Main category: cs.CL

TL;DR: 本文提出了一种新方法——通过分析大语言模型(LLMs)在社会模拟中的神经同步性,来评估其是否具备类人的‘社会心智’;实验表明,LLM间的神经同步性与社会表现显著相关,揭示了其内部动态与人类社会互动存在意外相似性。

Details Motivation: 传统上认为社会心智是生物体独有的,而尽管LLM在行为上可逼近人类,其是否具有真正意义上的社会性仍存争议;本文旨在为该争论提供可量化的实证依据。 Method: 引入‘社会模拟中的神经同步性’作为衡量LLM社会性的新代理指标,通过精心设计的多LLM交互实验,量化其表征层面的同步程度,并分析其与社会参与度和时间对齐性的关系。 Result: LLM之间的神经同步性能够可靠反映其社会互动中的参与度与时间协调性,且与社会任务表现呈强相关;该现象在不同模型、任务和交互设置下具有一致性。 Conclusion: 神经同步性可作为评估LLM社会心智的有效指标,揭示了LLM与人类在社会互动底层动态上的深层相似性,为理解AI社会性提供了新范式。 Abstract: Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction. Traditionally, social minds have been regarded as an exclusive property of living beings. Although large language models (LLMs) are widely accepted as powerful approximations of human behavior, with multi-LLM system being extensively explored to enhance their capabilities, it remains controversial whether they can be meaningfully compared to human social minds. In this work, we explore neural synchrony between socially interacting LLMs as an empirical evidence for this debate. Specifically, we introduce neural synchrony during social simulations as a novel proxy for analyzing the sociality of LLMs at the representational level. Through carefully designed experiments, we demonstrate that it reliably reflects both social engagement and temporal alignment in their interactions. Our findings indicate that neural synchrony between LLMs is strongly correlated with their social performance, highlighting an important link between neural synchrony and the social behaviors of LLMs. Our work offers a new perspective to examine the "social minds" of LLMs, highlighting surprising parallels in the internal dynamics that underlie human and LLM social interaction.

[3] On the scaling relationship between cloze probabilities and language model next-token prediction

Cassandra L. Jacobs,Morgan Grobol

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型在眼动和阅读时间数据预测中的表现,发现更大的模型虽然在词识别的低层次信息上敏感度降低,但在语义层面更贴近人类的完形填空响应,从而提高了预测质量。

Details Motivation: 探究为什么更大的语言模型在眼动和阅读时间数据预测中表现更好,以及其与人类语言处理机制的关系。 Method: 通过分析大型语言模型对眼动、阅读时间和完形填空数据的预测能力,比较其在词汇共现统计敏感性和语义对齐程度上的差异。 Result: 更大的语言模型在完形填空任务中分配更高品质的下一个词概率估计,语义上更贴近人类响应,但对词汇共现等低层次信息敏感度更低。 Conclusion: 更大模型更强的记忆容量有助于猜测更符合语义的词语,但削弱了对词识别所需低层次信息的敏感性。 Abstract: Recent work has shown that larger language models have better predictive power for eye movement and reading time data. While even the best models under-allocate probability mass to human responses, larger models assign higher-quality estimates of next tokens and their likelihood of production in cloze data because they are less sensitive to lexical co-occurrence statistics while being better aligned semantically to human cloze responses. The results provide support for the claim that the greater memorization capacity of larger models helps them guess more semantically appropriate words, but makes them less sensitive to low-level information that is relevant for word recognition.

[4] Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

Joschka Braun

Main category: cs.CL

TL;DR: 本文研究了steering vectors在控制语言模型行为时的可靠性问题,发现其可靠性与训练数据中激活差异的余弦相似度、正负激活在steering方向上的分离程度以及提示变体的选择密切相关;结果表明steering不可靠的根本原因在于线性steering方向无法有效逼近潜在的非线性目标行为表征。

Details Motivation: steering vectors虽轻量有效,但在不同样本和目标行为上效果不稳定,需探究其可靠性差异的原因及训练数据的影响。 Method: 通过分析训练激活差异的余弦相似度、正负激活在steering方向上的分离程度,以及不同prompt变体下steering向量的方向性与性能关系,开展实证研究。 Result: 1)训练激活差异的余弦相似度越高,steering越可靠;2)正负激活在steering方向上分离越好,steering越可靠;3)不同prompt变体训练的steering向量方向不同但性能相近且效能相关。 Conclusion: steering vectors不可靠源于线性方向难以逼近非线性潜在行为表征;该发现提供了实用的不可靠性诊断方法,并推动发展能显式建模非线性表征的鲁棒steering方法。 Abstract: Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. Taken together, these insights offer a practical diagnostic for steering unreliability and motivate the development of more robust steering methods that explicitly account for non-linear latent behavior representations.

[5] Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions

Raymond Li,Amirhossein Abaskohi,Chuyuan Li,Gabriel Murray,Giuseppe Carenini

Main category: cs.CL

TL;DR: 本文提出了一种利用语言模型生成语义增强软标签来改进神经主题模型的新方法,通过重构这些软标签提升主题质量与文档检索效果。

Details Motivation: 传统神经主题模型仅基于词袋(BoW)重建,忽略上下文信息且易受数据稀疏性影响。 Method: 使用语言模型(LM)在特定提示下预测下一词概率,并将其投影到预定义词表上,生成语义丰富的软标签;再用LM隐状态训练主题模型去重建这些软标签。 Result: 在三个数据集上显著提升了主题一致性(coherence)和纯度(purity),并在基于检索的新指标上优于现有方法。 Conclusion: 该方法能有效提升主题建模质量与语义检索能力,为神经主题模型提供了更鲁棒、更语义化的监督信号。 Abstract: Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the topic models to reconstruct the soft labels using the LM hidden states, our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Experiments on three datasets show that our method achieves substantial improvements in topic coherence, purity over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.

[6] Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

Jash Rajesh Parekh,Wonbin Kweon,Joey Chan,Rezarta Islamaj,Robert Leaman,Pengcheng Jiang,Chih-Hsuan Wei,Zhizheng Wang,Zhiyong Lu,Jiawei Han

Main category: cs.CL

TL;DR: 本文提出CondMedQA基准和Condition-Gated Reasoning(CGR)框架,以解决现有生物医学问答系统忽略患者特异性条件(如合并症、禁忌症)的问题,首次聚焦条件性临床推理,并通过构建条件感知知识图谱提升答案选择的可靠性。

Details Motivation: 现有生物医学问答系统假设医学知识普适,忽视临床决策高度依赖患者个体条件(如合并症、禁忌症),且缺乏评估条件推理能力的基准与支持条件适配的知识利用机制。 Method: 提出CondMedQA——首个面向条件性生物医学问答的多跳问题基准;并设计Condition-Gated Reasoning(CGR)框架,构建条件感知知识图谱,基于查询条件动态激活或剪枝推理路径。 Result: CGR在CondMedQA上显著提升条件适配答案的选择可靠性,同时在主流生物医学QA基准上达到或超越当前最优性能。 Conclusion: 显式建模医学知识的条件性对实现鲁棒、可信的临床推理至关重要;CondMedQA与CGR为条件性生物医学AI提供了新基准与方法范式。 Abstract: Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.

[7] Analyzing LLM Instruction Optimization for Tabular Fact Verification

Xiaotang Du,Giwon Hong,Wai-Chung Kwan,Rohit Saxena,Ivan Titov,Pasquale Minervini,Emily Allaway

Main category: cs.CL

TL;DR: 本文首次系统比较了基于DSPy框架的指令优化在表格事实验证任务中的效果,评估了四种提示技术(直接预测、思维链CoT、ReAct+SQL工具、CodeAct+Python执行)及三种DSPy优化器(COPRO、MiPROv2、SIMBA),发现指令优化能稳定提升准确率,其中MiPROv2对CoT最稳定,SIMBA对ReAct在大模型上增益最大,并揭示了其促进直接推理路径与减少冗余工具调用的机制。

Details Motivation: 缺乏对指令优化在表格事实验证任务中系统性比较的研究,尤其在不同提示技术与优化器组合下的表现尚不明确。 Method: 基于DSPy框架,对四种提示技术(直接预测、CoT、ReAct with SQL、CodeAct with Python)和三种优化器(COPRO、MiPROv2、SIMBA)在四个基准数据集和三类大语言模型上进行实验评估,并开展行为分析以理解优化机制。 Result: 指令优化显著且一致地提升了表格事实验证准确率;MiPROv2在CoT上表现最稳定,SIMBA在ReAct上尤其在大模型下带来最大提升;SIMBA通过启发式引导更直接的推理路径,增强数值比较能力并减少不必要的工具调用;CoT在小模型上仍具优势,ReAct需精细优化方能在大模型上达到竞争力。 Conclusion: 指令优化是提升表格事实验证性能的有效轻量级方法;不同提示技术适配不同优化器,应根据模型规模与任务特性选择合适组合;SIMBA展现出对复杂推理与工具调用场景的独特优势。 Abstract: Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution. We study three optimizers from the DSPy framework -- COPRO, MiPROv2, and SIMBA -- across four benchmarks and three model families. We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales. Behavioral analyses reveal that SIMBA encourages more direct reasoning paths by applying heuristics, thereby improving numerical comparison abilities in CoT reasoning and helping avoid unnecessary tool calls in ReAct agents. Across different prompting techniques, CoT remains effective for tabular fact checking, especially with smaller models. Although ReAct agents built with larger models can achieve competitive performance, they require careful instruction optimization.

[8] CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Victoria Blake,Mathew Miller,Jamie Novak,Sze-yuan Ooi,Blanca Gallego

Main category: cs.CL

TL;DR: 本文提出CUICurate框架,利用图检索增强生成(GraphRAG)方法,结合UMLS知识图谱与大语言模型(GPT-5和GPT-5-mini),自动构建临床相关的UMLS概念集,在保持人工精度的同时显著提升覆盖度与可扩展性。

Details Motivation: 现有临床命名实体识别工具仅映射到单个UMLS CUI,但下游任务常需包含同义词、子类、父类的语义相关概念集;而人工构建概念集费时、不一致且缺乏工具支持。 Method: 构建UMLS知识图谱并进行语义嵌入;对每个目标概念,从图中检索候选CUI;再通过GPT-5和GPT-5-mini进行两阶段LLM过滤与分类;在五个异构临床概念上评估。 Result: CUICurate生成的概念集比人工基准更完整(更高召回),精度相当;GPT-5-mini过滤召回更高,GPT-5分类更符合临床判断;结果稳定且计算成本低。 Conclusion: CUICurate是一种可扩展、可复现的自动化UMLS概念集构建方法,通过融合图检索与LLM推理,有效降低人工负担,并适配多种临床NLP任务需求。 Abstract: Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and supertypes. Constructing such concept sets is labour-intensive, inconsistently performed, and poorly supported by existing tools, particularly for NLP pipelines that operate directly on UMLS CUIs. Methods We present CUICurate, a Graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. For each target concept, candidate CUIs were retrieved from the KG, followed by large language model (LLM) filtering and classification steps comparing two LLMs (GPT-5 and GPT-5-mini). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets. Results Across all concepts, CUICurate produced substantially larger and more complete concept sets than the manual benchmarks whilst matching human precision. Comparisons between the two LLMs found that GPT-5-mini achieved higher recall during filtering, while GPT-5 produced classifications that more closely aligned with clinician judgements. Outputs were stable across repeated runs and computationally inexpensive. Conclusions CUICurate offers a scalable and reproducible approach to support UMLS concept set curation that substantially reduces manual effort. By integrating graph-based retrieval with LLM reasoning, the framework produces focused candidate concept sets that can be adapted to clinical NLP pipelines for different phenotyping and analytic requirements.

[9] Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Amine Kobeissi,Philippe Langlais

Main category: cs.CL

TL;DR: 本文研究了金融问答中检索增强生成(RAG)在高风险场景下的可靠性问题,聚焦于‘文档内检索失败’这一常见但被忽视的失效模式——即正确文档被检出,但含答案的具体页或段落未被定位。作者提出多粒度(文档/页/块)检索评估与oracle分析,并设计了一个面向金融文件语义连贯性的页面级双编码器打分器,在FinanceBench子集上显著提升了页召回率与块检索效果。

Details Motivation: 现有金融问答系统常因未能在已检出文档中准确定位含答案的页面或文本块而导致生成错误,该‘文档内检索失败’问题缺乏系统性研究,影响高风险场景下的可靠性。 Method: 提出多粒度(文档、页面、文本块)检索评估框架与oracle分析;对比密集、稀疏、混合及分层等检索策略;引入基于金融领域微调的页面级双编码器打分器,将页面作为文档与文本块之间的中间检索单元。 Result: 在FinanceBench 150题子集上,所提页面级打分器显著提升页面召回率和文本块检索准确率;oracle分析表明页面与块级检索仍有明显提升空间。 Conclusion: 页面作为语义连贯的中间检索单元具有重要价值;针对金融监管文件特性微调页面级检索模型,可有效缓解文档内检索失败问题,提升RAG系统在高风险金融问答中的可靠性。 Abstract: Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode in which the correct document is retrieved but the page or chunk that contains the answer is missed, leading the generator to extrapolate from incomplete context. Despite its practical significance, this within-document retrieval failure mode has received limited systematic attention in the Financial Question Answering (QA) literature. We evaluate retrieval at multiple levels of granularity, document, page, and chunk level, and introduce an oracle based analysis to provide empirical upper bounds on retrieval and generative performance. On a 150 question subset of FinanceBench, we reproduce and compare diverse retrieval strategies including dense, sparse, hybrid, and hierarchical methods with reranking and query reformulation. Across methods, gains in document discovery tend to translate into stronger page recall, yet oracle performance still suggests headroom for page and chunk level retrieval. To target this gap, we introduce a domain fine-tuned page scorer that treats pages as an intermediate retrieval unit between documents and chunks. Unlike prior passage-based hierarchical retrieval, we fine-tune a bi-encoder specifically for page-level relevance on financial filings, exploiting the semantic coherence of pages. Overall, our results demonstrate a significant improvement in page recall and chunk retrieval.

[10] Towards More Standardized AI Evaluation: From Models to Agents

Ali El Filali,Inès Bedar

Main category: cs.CL

TL;DR: 本文指出,随着AI系统从静态模型演变为复合型、工具驱动的智能体,评估已不再是机器学习生命周期的终点,而应成为核心控制功能;传统基于静态基准和聚合分数的评估方法已不再适用,需将评估重新定位为一种支撑信任、迭代与治理的测量学科。

Details Motivation: AI系统正从静态模型向动态、工具使用的智能体转变,但现有评估范式仍沿用模型中心时代的静态假设,导致评估结果失真、误导团队决策,无法应对系统在变化与规模下的可信性挑战。 Method: 通过分析评估流水线引入的隐性失效模式、高基准分数为何常误导开发团队,以及智能体系统如何根本性改变性能度量的内涵,批判性地重构评估的角色与定位。 Result: 揭示了当前主流评估实践的局限性,指出其日益‘模糊’而非‘阐明’系统行为,并论证评估应从‘性能秀’转向支撑信任、迭代与治理的测量学科。 Conclusion: 评估在AI时代(尤其是智能体场景)的核心作用不是给出一个分数,而是作为一项严谨的测量学科,服务于系统可信性、持续迭代与有效治理。 Abstract: Evaluation is no longer a final checkpoint in the machine learning lifecycle. As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function. The question is no longer "How good is the model?" but "Can we trust the system to behave as intended, under change, at scale?". Yet most evaluation practices remain anchored in assumptions inherited from the model-centric era: static benchmarks, aggregate scores, and one-off success criteria. This paper argues that such approaches are increasingly obscure rather than illuminating system behavior. We examine how evaluation pipelines themselves introduce silent failure modes, why high benchmark scores routinely mislead teams, and how agentic systems fundamentally alter the meaning of performance measurement. Rather than proposing new metrics or harder benchmarks, we aim to clarify the role of evaluation in the AI era, and especially for agents: not as performance theater, but as a measurement discipline that conditions trust, iteration, and governance in non-deterministic systems.

[11] Perceived Political Bias in LLMs Reduces Persuasive Abilities

Matthew DiGiuseppe,Joshua Robison

Main category: cs.CL

TL;DR: 本研究通过一项美国调查实验发现,当用户被告知聊天机器人(如ChatGPT)对其所属政党存在偏见时,其说服效果显著下降28%,表明对话式AI的政治说服力高度依赖于用户对其政治中立性的感知。

Details Motivation: 随着大语言模型(LLM)日益卷入党派争议,精英阶层频繁将其描绘为具有意识形态倾向,这可能削弱其作为纠正公共误解工具的可信度与有效性;本文旨在检验此类‘可信度攻击’是否实质性削弱LLM的说服力。 Method: 采用预注册的美国在线调查实验(N=2144),参与者与ChatGPT就自身持有的经济政策误解展开三轮对话;实验组接收一条提示信息,称该LLM对其所属政党存在偏见,对照组则无此提示;结合说服效果测量与对话文本分析评估干预影响。 Result: 相比中性对照组,被告知LLM存在政党偏见的参与者,其观念修正程度降低28%;文本分析显示,该提示导致参与者更倾向于反驳、互动参与度和接受度明显下降。 Conclusion: 对话式AI的说服效力并非技术中立,而是高度政治化和情境依赖的;若公众视其为党派工具,其在公共教育与 misinformation 纠正中的潜力将严重受限。 Abstract: Conversational AI has been proposed as a scalable way to correct public misconceptions and spread misinformation. Yet its effectiveness may depend on perceptions of its political neutrality. As LLMs enter partisan conflict, elites increasingly portray them as ideologically aligned. We test whether these credibility attacks reduce LLM-based persuasion. In a preregistered U.S. survey experiment (N=2144), participants completed a three-round conversation with ChatGPT about a personally held economic policy misconception. Compared to a neutral control, a short message indicating that the LLM was biased against the respondent's party attenuated persuasion by 28%. Transcript analysis indicates that the warnings alter the interaction: respondents push back more and engage less receptively. These findings suggest that the persuasive impact of conversational AI is politically contingent, constrained by perceptions of partisan alignment.

[12] Agentic Adversarial QA for Improving Domain-Specific LLMs

Vincent Grari,Ciprian Tomoiaga,Sylvain Lamprier,Tatsunori Hashimoto,Marcin Detyniecki

Main category: cs.CL

TL;DR: 本文提出了一种对抗式问题生成框架,用于高效生成少量但语义挑战性强的合成问题,以提升大语言模型在专业领域(如法律)的适应能力,克服现有合成数据方法在推理能力和样本效率上的不足。

Details Motivation: 现有大语言模型在专业领域适应性差,而高质量标注数据稀缺;当前合成数据方法(如改写、知识抽取)虽能增强事实记忆,但缺乏对解释性推理的支持,且生成的数据冗余度高、样本效率低。 Method: 提出一种对抗式问题生成框架:通过对比待适配模型与基于参考文档构建的专家模型的输出,在迭代反馈过程中识别理解差距,并生成语义上具有挑战性的紧凑问题集。 Result: 在LegalBench专业子集上的实验表明,该方法仅用更少的合成样本即实现了更高准确率。 Conclusion: 所提框架显著提升了专业领域微调的样本效率和推理能力,为低资源专业场景下的LLM适配提供了新范式。 Abstract: Large Language Models (LLMs), despite extensive pretraining on broad internet corpora, often struggle to adapt effectively to specialized domains. There is growing interest in fine-tuning these models for such domains; however, progress is constrained by the scarcity and limited coverage of high-quality, task-relevant data. To address this, synthetic data generation methods such as paraphrasing or knowledge extraction are commonly applied. Although these approaches excel at factual recall and conceptual knowledge, they suffer from two critical shortcomings: (i) they provide minimal support for interpretive reasoning capabilities in these specialized domains, and (ii) they often produce synthetic corpora that are excessively large and redundant, resulting in poor sample efficiency. To overcome these gaps, we propose an adversarial question-generation framework that produces a compact set of semantically challenging questions. These questions are constructed by comparing the outputs of the model to be adapted and a robust expert model grounded in reference documents, using an iterative, feedback-driven process designed to reveal and address comprehension gaps. Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.

[13] Detecting Contextual Hallucinations in LLMs with Frequency-Aware Attention

Siya Qi,Yudong Chen,Runcong Zhao,Qinglin Zhu,Zhanghao Hu,Wei Liu,Yulan He,Zheng Yuan,Lin Gui

Main category: cs.CL

TL;DR: 本文提出了一种基于注意力机制高频成分的轻量级幻觉检测方法,通过将注意力分布建模为离散信号并提取其高频分量,捕捉生成过程中的细粒度不稳定性,从而有效识别幻觉 token。

Details Motivation: 现有基于注意力的幻觉检测方法依赖粗粒度汇总,难以捕捉注意力在生成过程中的细粒度不稳定性;需要更精细的信号分析视角。 Method: 将注意力分布建模为离散信号,采用信号处理思想提取其高频成分以表征注意力的快速局部变化,并据此构建轻量级幻觉检测器。 Result: 在RAGTruth和HalluRAG基准上,该方法在多个模型和任务中均优于基于验证、内部表征及传统注意力的方法。 Conclusion: 高频注意力能量是幻觉生成的关键信号,所提频率感知方法为LLM幻觉检测提供了新且有效的内在线索。 Abstract: Hallucination detection is critical for ensuring the reliability of large language models (LLMs) in context-based generation. Prior work has explored intrinsic signals available during generation, among which attention offers a direct view of grounding behavior. However, existing approaches typically rely on coarse summaries that fail to capture fine-grained instabilities in attention. Inspired by signal processing, we introduce a frequency-aware perspective on attention by analyzing its variation during generation. We model attention distributions as discrete signals and extract high-frequency components that reflect rapid local changes in attention. Our analysis reveals that hallucinated tokens are associated with high-frequency attention energy, reflecting fragmented and unstable grounding behavior. Based on this insight, we develop a lightweight hallucination detector using high-frequency attention features. Experiments on the RAGTruth and HalluRAG benchmarks show that our approach achieves performance gains over verification-based, internal-representation-based, and attention-based methods across models and tasks.

[14] The Statistical Signature of LLMs

Ortal Hadad,Edoardo Loru,Jacopo Nudo,Niccolò Di Marco,Matteo Cinelli,Walter Quattrociocchi

Main category: cs.CL

TL;DR: 本文提出利用无损压缩作为模型无关的度量方法,揭示大语言模型生成文本所具有的结构性规律性(更高可压缩性),并在多种信息生态中验证该特征的普适性与尺度依赖性。

Details Motivation: 大语言模型通过高维概率采样生成文本,但该过程如何重塑语言的统计结构尚不明确;亟需一种简单、模型无关、仅依赖表层文本的分析方法。 Method: 采用无损压缩率作为衡量语言结构性规律性的指标,在三类信息生态系统(可控人-LLM续写、知识基础设施中介生成、全合成社交交互)中系统比较LLM生成文本与人类文本的压缩行为。 Result: LLM生成文本在多数场景下比人类文本更具结构性规律性(更高可压缩性),但在细粒度、碎片化交互环境中该差异减弱;该现象跨模型、任务与领域稳定存在,且无需访问模型内部或语义理解。 Conclusion: 无损压缩提供了一种简洁鲁棒的框架,用于量化生成系统对文本生产的结构性影响,揭示了生成语言复杂性演化的底层统计机制。 Abstract: Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.

[15] FENCE: A Financial and Multimodal Jailbreak Detection Dataset

Mirae Kim,Seonghun Jeong,Youngjun Kwak

Main category: cs.CL

TL;DR: 本文提出了FENCE数据集,一个面向金融领域的双语(韩英)多模态越狱检测数据集,用于训练和评估大语言模型(LLM)与视觉语言模型(VLM)的越狱攻击检测能力;实验表明现有VLM在金融场景中普遍存在越狱风险,而基于FENCE训练的基线检测器表现出高准确率和强泛化性。

Details Motivation: 现有越狱检测资源稀缺,尤其在金融领域;而VLM因处理图文双模态输入,攻击面更广、风险更高,亟需领域适配、真实可信的检测数据集。 Method: 构建FENCE——一个包含金融相关查询与图像支撑威胁样本的双语多模态 jailbreak 数据集,并在多个商业及开源VLM上开展越狱攻击实验,训练并评估基于该数据集的基线检测器性能。 Result: GPT-4o等模型在金融越狱测试中表现出可观攻击成功率,开源VLM更易被攻破;基于FENCE训练的检测器在分布内测试达99%准确率,并在外部基准上保持强鲁棒性。 Conclusion: FENCE填补了金融多模态越狱检测的数据空白,为构建安全、可靠的AI系统提供了关键支持,尤其适用于高敏感领域。 Abstract: Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.

[16] Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models

Wojciech Michaluk,Tymoteusz Urban,Mateusz Kubita,Soveatin Kuntur,Anna Wroblewska

Main category: cs.CL

TL;DR: 本文提出了一种结合Transformer文本嵌入与语言学驱动的信息量特征的混合点击诱饵检测方法,XGBoost模型在增强嵌入上达到91% F1分数,并提升可解释性。

Details Motivation: 点击诱饵标题损害在线信息质量并削弱用户信任,需更准确、可解释的检测方法。 Method: 融合Transformer文本嵌入与15个语言学启发的信息量特征(如第二人称代词、最高级、数字、注意力标点),结合XGBoost等树模型进行分类;对比TF-IDF、Word2Vec、GloVe及LLM提示分类等基线。 Result: 最佳模型(XGBoost+增强嵌入)F1达91%,显著优于各类基线;特征集增强了模型可解释性与预测校准性。 Conclusion: 语言学特征与深度嵌入的协同能兼顾高性能与可解释性,所提方法为可复现的点击诱饵检测提供了新范式。 Abstract: Clickbait headlines degrade the quality of online information and undermine user trust. We present a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features. Using natural language processing techniques, we evaluate classical vectorizers, word embedding baselines, and large language model embeddings paired with tree-based classifiers. Our best-performing model, XGBoost over embeddings augmented with 15 explicit features, achieves an F1-score of 91\%, outperforming TF-IDF, Word2Vec, GloVe, LLM prompt based classification, and feature-only baselines. The proposed feature set enhances interpretability by highlighting salient linguistic cues such as second-person pronouns, superlatives, numerals, and attention-oriented punctuation, enabling transparent and well-calibrated clickbait predictions. We release code and trained models to support reproducible research.

[17] Improving Sampling for Masked Diffusion Models via Information Gain

Kaisen Yang,Jayden Teoh,Kaicheng Yang,Yitong Zhang,Alex Lamb

Main category: cs.CL

TL;DR: 本文提出了一种名为Info-Gain Sampler的新解码框架,用于掩码扩散模型(MDMs),通过权衡即时不确定性与对未来掩码位置的信息增益,显著提升生成质量。

Details Motivation: 现有MDM采样器采用贪心启发式策略,忽视当前解码选择对后续步骤的下游影响,未能最小化累积不确定性,且未充分利用MDM的非因果特性。 Method: 提出Info-Gain Sampler,一种兼顾即时不确定性与对未来掩码token信息增益的解码框架,利用MDM的非因果性评估解码决策对全局剩余掩码位置不确定性的影响。 Result: 在推理、编程、创意写作和图像生成等多任务上显著优于现有采样器:推理任务平均准确率提升3.6%,创意写作胜率达63.1%,累积不确定性从78.4降至48.6。 Conclusion: Info-Gain Sampler为MDMs提供了一种更优、更原则性的解码范式,有效缓解了贪心策略的局限性,并在多种生成任务中验证了其有效性与泛化能力。 Abstract: Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to decode at each step. Through failure case analysis, we identify a fundamental limitation of this approach: it neglects the downstream impact of current decoding choices on subsequent steps and fails to minimize cumulative uncertainty. In particular, these methods do not fully exploit the non-causal nature of MDMs, which enables evaluating how a decoding decision reshapes token probabilities/uncertainty across all remaining masked positions. To bridge this gap, we propose the Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs. For instance, it achieves a 3.6% improvement in average accuracy on reasoning tasks and a 63.1% win-rate in creative writing. Notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin. The code will be available at https://github.com/yks23/Information-Gain-Sampler.

[18] Information-Theoretic Storage Cost in Sentence Comprehension

Kohei Kajikawa,Shinnosuke Isono,Ethan Gotlieb Wilcox

Main category: cs.CL

TL;DR: 本文提出了一种基于信息论的连续型句子加工存储成本度量方法,利用预训练神经语言模型估计前序词对未来语境的信息量,验证其能有效解释多种阅读行为现象。

Details Motivation: 传统心理语言学中对工作记忆负荷的度量多依赖离散、基于符号语法的方法,缺乏连续性与理论中立性,难以刻画真实语言理解中的不确定性。 Method: 提出一种基于信息论的处理存储成本度量:定义为在不确定性下,先前词汇对未来语境所携带的信息量;该度量可由预训练神经语言模型估计,无需显式语法标注。 Result: 该度量在英语中成功复现了中心嵌入和关系从句的加工不对称性;与语法驱动的存储成本指标在句法标注语料中显著相关;并在两个大规模自然阅读数据集上超越基线模型预测阅读时间变异。 Conclusion: 信息论框架下的连续存储成本度量是一种更灵活、理论中立且经验有效的替代方案,有助于弥合计算模型与人类语言理解实证研究之间的鸿沟。 Abstract: Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have been formalized, largely, using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors.

[19] Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning

Lexiang Tang,Weihao Gao,Bingchen Zhao,Lu Ma,Qiao jin,Bang Yang,Yuexian Zou

Main category: cs.CL

TL;DR: 本文提出了一种名为'减法式思考(Thinking by Subtraction)'的置信度驱动对比解码方法(CCD),通过在解码过程中定位并干预低置信度token,提升大语言模型推理的准确性和效率。

Details Motivation: 现有测试时扩展方法通常假设均匀增加推理计算能提升正确率,但实际推理不确定性高度局部化,少量低置信度token主导错误和冗余输出。 Method: 提出置信度驱动对比解码(CCD):在解码中检测低置信度token;用最小占位符替换高置信度token构建对比参考分布;在低置信位置减去该参考分布以修正预测。 Result: CCD在数学推理基准上显著提升准确率,同时大幅缩短输出长度,并保持极低KV缓存开销;是一种无需训练、无计算冗余的轻量干预方法。 Conclusion: CCD验证了针对性token级干预比均匀计算扩展更高效可靠,为LLM推理可靠性提升提供了新范式。 Abstract: Recent work on test-time scaling for large language model (LLM) reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence-driven contrastive decoding approach that improves reasoning reliability through targeted token-level intervention. Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes selectively at these positions. It constructs a contrastive reference by replacing high-confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low-confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead. As a training-free method, CCD enhances reasoning reliability through targeted low-confidence intervention without computational redundancy. Our code will be made available at: https://github.com/bolo-web/CCD.

[20] Simplifying Outcomes of Language Model Component Analyses with ELIA

Aaron Louis Eidt,Nils Feldhus

Main category: cs.CL

TL;DR: 本文提出ELIA,一个交互式Web应用,通过整合归因分析、功能向量分析和电路追踪,并利用多模态大模型自动生成自然语言解释,降低LLM可解释性技术的使用门槛;用户研究表明其交互设计与AI解释显著提升了非专家的理解能力,且消除了经验差异的影响。

Details Motivation: 机械可解释性工具虽强大,但因复杂性导致可及性低,仅限专家使用,亟需面向更广泛用户的简化与可视化方案。 Method: 设计并实现ELIA系统,集成归因分析、功能向量分析和电路追踪三种技术,并创新性引入视觉-语言模型来自动生成对应可视化结果的自然语言解释(NLE);开展混合方法用户研究评估效果。 Result: 用户研究显示:1)用户明显偏好交互式、可探索界面而非静态图;2)AI生成的解释有效弥合非专家知识鸿沟;3)统计分析表明用户LLM经验与其理解得分无显著相关性,说明系统削弱了经验依赖。 Conclusion: AI能简化复杂模型分析,但其真正价值在于与以用户为中心的设计结合——强调交互性、具体性和叙事引导。 Abstract: While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques -- Attribution Analysis, Function Vector Analysis, and Circuit Tracing -- and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations helped bridge the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user's prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.

[21] PsihoRo: Depression and Anxiety Romanian Text Corpus

Alexandra Ciobotaru,Ana-Maria Bucur,Liviu P. Dinu

Main category: cs.CL

TL;DR: 本文介绍了PsihoRo,首个面向罗马尼亚语的抑郁与焦虑心理语料库,通过开放式问题结合PHQ-9和GAD-7量表收集205名受访者文本,并采用统计分析、罗马尼亚语LIWC、情绪识别与主题建模进行特征分析。

Details Motivation: 罗马尼亚语缺乏开源心理健康语料库,而现有社交媒体数据易受采集者主观假设影响;需一种更务实、基于自评量表验证的数据构建方法。 Method: 设计包含6个开放式问题及标准化PHQ-9与GAD-7问卷的表单,收集205名罗马尼亚语受访者的文本;随后开展统计分析、罗马尼亚语LIWC词典分析、情绪检测与LDA主题建模。 Result: 构建了首个罗马尼亚语抑郁与焦虑语料库PsihoRo;揭示了该语料在情绪表达、心理过程词汇使用及主题分布等方面的关键语言特征。 Conclusion: PsihoRo填补了罗马尼亚语心理健康NLP资源的空白,为后续心理语言学研究、临床辅助工具开发及多语言心理健康模型训练提供了基础资源与分析范式。 Abstract: Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health. These texts allow researchers to study psychological constructs, detect mental health issues and analyze emotional language. However, mental health data can be difficult to collect correctly from social media, due to suppositions made by the collectors. A more pragmatic strategy involves gathering data through open-ended questions and then assessing this information with self-report screening surveys. This method was employed successfully for English, a language with a lot of psychological NLP resources. However, this cannot be stated for Romanian, which currently has no open-source mental health corpus. To address this gap, we have created the first corpus for depression and anxiety in Romanian, by utilizing a form with 6 open-ended questions along with the standardized PHQ-9 and GAD-7 screening questionnaires. Consisting of the texts of 205 respondents and although it may seem small, PsihoRo is a first step towards understanding and analyzing texts regarding the mental health of the Romanian population. We employ statistical analysis, text analysis using Romanian LIWC, emotion detection and topic modeling to show what are the most important features of this newly introduced resource to the NLP community.

[22] Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning

Tao Wu,Adam Kapelner

Main category: cs.CL

TL;DR: 本文提出了一种基于现代深度学习的系统,用于为高中生母语词汇教学自动选取高信息量的上下文例句,并通过新提出的Retention Competency Curve指标评估三种建模方法,其中结合人工特征与监督微调Qwen3嵌入的模型表现最优。

Details Motivation: 为高中母语词汇教学高效、低成本地生成大量高质量上下文例句,解决传统人工选例耗时且难以规模化的问题。 Method: 比较三种方法:(i) 基于MPNet均匀上下文化嵌入的无监督相似度策略;(ii) 基于指令感知微调Qwen3嵌入+非线性回归头的监督框架;(iii) 在(ii)基础上加入手工设计的上下文特征;并提出新评估指标Retention Competency Curve。 Result: 模型(iii)达到最佳性能:好/坏上下文比率达440,仅舍弃70%的优质上下文;验证了人工监督引导下的现代嵌入模型可稳定产出近似完美的教学上下文。 Conclusion: 人工监督与先进嵌入模型(如Qwen3)及神经网络架构相结合,能以低代价大规模生成适用于多样化目标词的高质量词汇教学上下文。 Abstract: We describe a modern deep learning system that automatically identifies informative contextual examples (\qu{contexts}) for first language vocabulary instruction for high school student. Our paper compares three modeling approaches: (i) an unsupervised similarity-based strategy using MPNet's uniformly contextualized embeddings, (ii) a supervised framework built on instruction-aware, fine-tuned Qwen3 embeddings with a nonlinear regression head and (iii) model (ii) plus handcrafted context features. We introduce a novel metric called the Retention Competency Curve to visualize trade-offs between the discarded proportion of good contexts and the \qu{good-to-bad} contexts ratio providing a compact, unified lens on model performance. Model (iii) delivers the most dramatic gains with performance of a good-to-bad ratio of 440 all while only throwing out 70\% of the good contexts. In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.

[23] Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

Pavithra PM Nair,Preethu Rose Anish

Main category: cs.CL

TL;DR: 本文提出Vichara框架,专为印度司法系统设计,用于预测和解释上诉判决。该框架将英文上诉案卷分解为结构化的‘决策点’,并采用适配印度法律推理的IRAC式解释格式,显著提升预测准确性和法律专业人士可理解性。实验表明其在多个数据集和大模型上均超越现有基准。

Details Motivation: 印度法院案件积压严重,尤其上诉案件数量庞大,亟需AI辅助提升司法效率与可解释性;现有方法难以兼顾预测精度与符合本地法律逻辑的解释能力。 Method: 提出Vichara框架:将上诉案卷解析为含法律问题、权威主体、结果、推理及时间背景的‘决策点’;采用改进版IRAC结构生成解释;在PredEx和ILDC_expert两个数据集上,使用GPT-4o mini、Llama-3.1-8B、Mistral-7B和Qwen2.5-7B进行评估。 Result: Vichara在两大数据集上均超越现有判决预测基准;GPT-4o mini表现最优(PredEx F1=81.5,ILDC_expert F1=80.3);人工评估显示其解释在清晰度、关联性与实用性上得分最高。 Conclusion: Vichara有效结合高精度预测与符合印度法律实践的可解释性,为司法AI在高负荷普通法体系中的落地提供了可行路径。 Abstract: In jurisdictions like India, where courts face an extensive backlog of cases, artificial intelligence offers transformative potential for legal judgment prediction. A critical subset of this backlog comprises appellate cases, which are formal decisions issued by higher courts reviewing the rulings of lower courts. To this end, we present Vichara, a novel framework tailored to the Indian judicial system that predicts and explains appellate judgments. Vichara processes English-language appellate case proceeding documents and decomposes them into decision points. Decision points are discrete legal determinations that encapsulate the legal issue, deciding authority, outcome, reasoning, and temporal context. The structured representation isolates the core determinations and their context, enabling accurate predictions and interpretable explanations. Vichara's explanations follow a structured format inspired by the IRAC (Issue-Rule-Application-Conclusion) framework and adapted for Indian legal reasoning. This enhances interpretability, allowing legal professionals to assess the soundness of predictions efficiently. We evaluate Vichara on two datasets, PredEx and the expert-annotated subset of the Indian Legal Documents Corpus (ILDC_expert), using four large language models: GPT-4o mini, Llama-3.1-8B, Mistral-7B, and Qwen2.5-7B. Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B. Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini's superior interpretability.

[24] Validating Political Position Predictions of Arguments

Jordan Robinson,Angus R. Williams,Katie Atkinson,Anthony G. Cohn

Main category: cs.CL

TL;DR: 本文提出了一种双尺度验证框架(结合点式与成对标注)来评估政治立场预测任务中语言模型对主观、连续属性的建模能力,并构建了大规模结构化论辩知识库。

Details Motivation: 现实世界知识表示常需捕捉主观、连续属性(如政治立场),但这类属性难以满足传统成对验证这一人类评估金标准,亟需新方法。 Method: 提出点式(pointwise)与成对(pairwise)双尺度人工标注框架,基于22个语言模型,在23,228条来自英国电视节目《Question Time》的论点上预测政治立场,并计算Krippendorff's α衡量一致性。 Result: 点式评估显示中等一致性(α=0.578),反映主观性;成对评估显示强排序一致性(最优模型α=0.86);构建了经验证的结构化论辩知识库。 Conclusion: ordinal结构可从语言模型对主观话语的点式预测中有效提取;双尺度验证框架兼顾可扩展性与可靠性,推动在符号/分类方法不适用领域中的知识表示进步。 Abstract: Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation. We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation. Using 22 language models, we construct a large-scale knowledge base of political position predictions for 23,228 arguments drawn from 30 debates that appeared on the UK politicial television programme \textit{Question Time}. Pointwise evaluation shows moderate human-model agreement (Krippendorff's $α=0.578$), reflecting intrinsic subjectivity, while pairwise validation reveals substantially stronger alignment between human- and model-derived rankings ($α=0.86$ for the best model). This work contributes: (i) a practical validation methodology for subjective continuous knowledge that balances scalability with reliability; (ii) a validated structured argumentation knowledge base enabling graph-based reasoning and retrieval-augmented generation in political domains; and (iii) evidence that ordinal structure can be extracted from pointwise language models predictions from inherently subjective real-world discourse, advancing knowledge representation capabilities for domains where traditional symbolic or categorical approaches are insufficient.

[25] SPQ: An Ensemble Technique for Large Language Model Compression

Jiamin Yao,Eren Gultepe

Main category: cs.CL

TL;DR: 本文提出了一种名为SPQ的LLM压缩集成技术,结合SVD、基于激活的剪枝和8位线性量化,在保持甚至提升模型性能的同时实现高达75%的内存减少和1.9倍推理加速。

Details Motivation: 解决大语言模型(LLM)在内存受限环境下的部署难题,通过互补压缩技术协同优化不同模块的冗余与效率瓶颈。 Method: 提出SPQ集成压缩方法:1)基于激活的MLP层剪枝去除冗余神经元;2)方差保留的SVD对注意力投影进行低秩分解;3)8位后训练线性量化统一压缩所有线性层。 Result: 在LLaMA-2-7B上实现75%内存缩减,WikiText-2困惑度从5.47降至4.91,C4、TruthfulQA、GSM8K等下游任务精度保持或提升;内存占用6.86GB(优于GPTQ的7.16GB),推理吞吐量最高提升1.9倍。 Conclusion: SPQ通过层感知与互补压缩策略,在同等压缩比下显著优于单一压缩方法,为资源受限场景下的LLM实用化部署提供了高效可行方案。 Abstract: This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real-world deployment. The effectiveness of SPQ's robust compression through layer-aware and complementary compression techniques may provide practical deployment of LLMs in memory-constrained environments. Code is available at: https://github.com/JiaminYao/SPQ_LLM_Compression/

[26] RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering

Deniz Qian,Hung-Ting Chen,Eunsol Choi

Main category: cs.CL

TL;DR: 本文提出了一种名为retrieve-verify-retrieve(RVR)的多轮检索框架,通过迭代式检索与验证来提升对多答案查询的覆盖召回率。

Details Motivation: 现有检索方法难以全面覆盖具有多种合理答案的复杂查询,需提升答案的完整召回率。 Method: RVR框架包含三阶段:首轮用原始查询检索初筛文档; verifier筛选高质量子集;后续轮次将已验证文档融入查询,引导检索未覆盖答案;支持即插即用及微调检索器适配。 Result: 在QAMPARI数据集上相对完整召回率提升至少10%(绝对+3%),并在QUEST和WebQuestionsSP两个跨域数据集上对多种基线检索器均表现出稳定增益。 Conclusion: RVR提供了一种高效、通用且可扩展的迭代检索范式,通过引入验证机制与查询增强策略,显著提升了多答案场景下的检索全面性。 Abstract: Comprehensively retrieving diverse documents is crucial to address queries that admit a wide range of valid answers. We introduce retrieve-verify-retrieve (RVR), a multi-round retrieval framework designed to maximize answer coverage. Initially, a retriever takes the original query and returns a candidate document set, followed by a verifier that identifies a high-quality subset. For subsequent rounds, the query is augmented with previously verified documents to uncover answers that are not yet covered in previous rounds. RVR is effective even with off-the-shelf retrievers, and fine-tuning retrievers for our inference procedure brings further gains. Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI). We also see consistent gains on two out-of-domain datasets (QUEST and WebQuestionsSP) across different base retrievers. Our work presents a promising iterative approach for comprehensive answer recall leveraging a verifier and adapting retrievers to a new inference scenario.

[27] VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning

Harshul Raj Surana,Arijit Maji,Aryan Vats,Akash Ghosh,Sriparna Saha,Amit Sheth

Main category: cs.CL

TL;DR: 本文提出VIRAASAT——一个面向印度文化的半自动多跳问答数据集构建方法,并设计Symbolic Chain-of-Manipulation(SCoM)框架以提升大模型在文化知识推理中的表现,显著超越传统CoT方法。

Details Motivation: 现有文化类评测基准存在人工构建、单跳问答、难以扩展等问题,尤其缺乏对印度等多元文化背景下多跳推理能力的评估。 Method: 提出VIRAASAT方法:基于700+专家标注的文化知识图谱(涵盖13类属性、28州8联邦属地),半自动生成3200+多跳问题;并提出SCoM框架,使模型内部模拟知识图谱的原子级操作与拓扑遍历。 Result: 在VIRAASAT上评测主流LLM发现其多跳文化推理能力薄弱;SCoM在监督微调中相较标准CoT提升最高达20%。 Conclusion: VIRAASAT填补了文化感知推理评测空白,SCoM为提升模型对低概率文化事实的链式推理能力提供了新范式,推动构建真正文化自觉的大语言模型。 Abstract: Large Language Models (LLMs) have made significant progress in reasoning tasks across various domains such as mathematics and coding. However, their performance deteriorates in tasks requiring rich socio-cultural knowledge and diverse local contexts, particularly those involving Indian Culture. Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured. To address this, we introduce VIRAASAT, a novel, semi-automated multi-hop approach for generating cultural specific multi-hop Question-Answering dataset for Indian culture. VIRAASAT leverages a Knowledge Graph comprising more than 700 expert-curated cultural artifacts, covering 13 key attributes of Indian culture (history, festivals, etc). VIRAASAT spans all 28 states and 8 Union Territories, yielding more than 3,200 multi-hop questions that necessitate chained cultural reasoning. We evaluate current State-of-the-Art (SOTA) LLMs on VIRAASAT and identify key limitations in reasoning wherein fine-tuning on Chain-of-Thought(CoT) traces fails to ground and synthesize low-probability facts. To bridge this gap, we propose a novel framework named Symbolic Chain-of-Manipulation (SCoM). Adapting the Chain-of-Manipulation paradigm, we train the model to simulate atomic Knowledge Graph manipulations internally. SCoM teaches the model to reliably traverse the topological structure of the graph. Experiments on Supervised Fine-Tuning (SFT) demonstrate that SCoM outperforms standard CoT baselines by up to 20%. We release the VIRAASAT dataset along with our findings, laying a strong foundation towards building Culturally Aware Reasoning Models.

cs.CV [Back]

[28] KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding

Boda Lin,Yongjie Zhu,Xiaocheng Gong,Wenyu Qin,Meng Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于运动学的自动标注流程,构建了KPM-Bench数据集,并设计了MoPE算法与新型幻觉评估指标,显著提升了视频描述模型对细粒度人体动作的理解与生成可靠性。

Details Motivation: 现有视频描述模型在细粒度运动细节刻画和幻觉问题上存在严重不足,尤其在以动作为核心的视频中难以准确描述肢体动态。 Method: 构建融合运动学计算与语言解析的自动标注流程;发布KPM-Bench基准(含细粒度动作对、运动理解问答、幻觉评估子集);提出语言引导的Motion Parsing and Extraction(MoPE)算法及不依赖大模型的幻觉评估指标;将MoPE集成至GRPO后训练框架中缓解幻觉。 Result: 发布了首个面向细粒度人体运动理解的开源基准KPM-Bench;提出了可解释、轻量级的MoPE算法与专用幻觉评估方法;在多个视频描述模型上验证了其有效降低运动相关幻觉并提升描述准确性。 Conclusion: 通过运动学驱动的数据构建与语言引导的解析机制,本文系统性地推进了动作-centric视频描述的可靠性与可解释性,为细粒度运动理解提供了新范式与实用工具。 Abstract: Despite recent advancements, video captioning models still face significant limitations in accurately describing fine-grained motion details and suffer from severe hallucination issues. These challenges become particularly prominent when generating captions for motion-centric videos, where precise depiction of intricate movements and limb dynamics is crucial yet often neglected. To alleviate this gap, we introduce an automated annotation pipeline that integrates kinematic-based motion computation with linguistic parsing, enabling detailed decomposition and description of complex human motions. Based on this pipeline, we construct and release the Kinematic Parsing Motion Benchmark (KPM-Bench), a novel open-source dataset designed to facilitate fine-grained motion understanding. KPM-Bench consists of (i) fine-grained video-caption pairs that comprehensively illustrate limb-level dynamics in complex actions, (ii) diverse and challenging question-answer pairs focusing specifically on motion understanding, and (iii) a meticulously curated evaluation set specifically designed to assess hallucination phenomena associated with motion descriptions. Furthermore, to address hallucination issues systematically, we propose the linguistically grounded Motion Parsing and Extraction (MoPE) algorithm, capable of accurately extracting motion-specific attributes directly from textual captions. Leveraging MoPE, we introduce a precise hallucination evaluation metric that functions independently of large-scale vision-language or language-only models. By integrating MoPE into the GRPO post-training framework, we effectively mitigate hallucination problems, significantly improving the reliability of motion-centric video captioning models.

[29] CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

Balamurugan Thambiraja,Omid Taheri,Radek Danecek,Giorgio Becherini,Gerard Pons-Moll,Justus Thies

Main category: cs.CV

TL;DR: 本文提出了首个面向真实场景(in-the-wild)的3D手部动作数据集3D-HIW及配套生成系统CLUTCH,通过新VQ-VAE架构SHIFT与几何精调机制,显著提升文本-动作对齐与动画保真度。

Details Motivation: 现有手部动作建模方法依赖昂贵、受限的实验室数据,难以泛化到真实复杂场景,且文本-动作对齐与动画质量不足。 Method: 构建大规模3D-HIW数据集(32K序列),结合VLM与3D手部追踪器进行自动标注;提出CLUTCH系统,含两部分:(a) SHIFT——分模态VQ-VAE用于动作离散化;(b) 几何精调阶段,联合重建损失优化LLM输出的手部参数。 Result: 在文本生成手部动作与动作生成文本任务上达到SOTA;建立了首个可扩展的真实场景手部动作建模基准。 Conclusion: 3D-HIW与CLUTCH为真实场景下手部动作理解与生成提供了高质量数据基础与高效架构范式,推动该领域向实用化迈进。 Abstract: Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to "in-the-wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.

[30] Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision

Xinwei Ju,Rema Daher,Danail Stoyanov,Sophia Bano,Francisco Vasconcelos

Main category: cs.CV

TL;DR: 本文提出PRISM框架,通过结合边缘检测和亮度解耦,在无监督学习中利用解剖学和光照先验来提升结肠镜检查中的单目深度与位姿估计性能。

Details Motivation: 单目深度和位姿估计在结肠镜辅助导航中至关重要,但面临无纹理表面、复杂光照、形变及缺乏高质量体内真值数据等挑战。 Method: 提出PRISM(Pose-Refinement with Intrinsic Shading and edge Maps)自监督学习框架,融合学习型边缘检测器(如DexiNed/HED)生成边缘图,并通过固有分解模块实现明暗分离以提取阴影线索用于深度估计。 Result: 在多个真实与合成数据集上达到SOTA性能;消融实验揭示:真实数据的自监督训练优于仿真数据的监督训练,且视频帧率对模型性能影响极大,需按数据集定制采样策略。 Conclusion: 域真实性比真值标签更关键;视频采样策略是提升结肠镜单目几何估计性能的重要实践因素;PRISM有效整合结构与光照先验,显著提升鲁棒性与精度。 Abstract: Monocular depth and pose estimation play an important role in the development of colonoscopy-assisted navigation, as they enable improved screening by reducing blind spots, minimizing the risk of missed or recurrent lesions, and lowering the likelihood of incomplete examinations. However, this task remains challenging due to the presence of texture-less surfaces, complex illumination patterns, deformation, and a lack of in-vivo datasets with reliable ground truth. In this paper, we propose **PRISM** (Pose-Refinement with Intrinsic Shading and edge Maps), a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning. Our approach uniquely incorporates edge detection and luminance decoupling for structural guidance. Specifically, edge maps are derived using a learning-based edge detector (e.g., DexiNed or HED) trained to capture thin and high-frequency boundaries, while luminance decoupling is obtained through an intrinsic decomposition module that separates shading and reflectance, enabling the model to exploit shading cues for depth estimation. Experimental results on multiple real and synthetic datasets demonstrate state-of-the-art performance. We further conduct a thorough ablation study on training data selection to establish best practices for pose and depth estimation in colonoscopy. This analysis yields two practical insights: (1) self-supervised training on real-world data outperforms supervised training on realistic phantom data, underscoring the superiority of domain realism over ground truth availability; and (2) video frame rate is an extremely important factor for model performance, where dataset-specific video frame sampling is necessary for generating high quality training data.

[31] LGD-Net: Latent-Guided Dual-Stream Network for HER2 Scoring with Task-Specific Domain Knowledge

Peide Zhu,Linbin Lu,Zhiqin Chen,Xiong Chen

Main category: cs.CV

TL;DR: 本文提出LGD-Net,通过跨模态特征幻觉(而非像素级虚拟染色)直接从H&E图像预测HER2表达水平,引入临床先验知识正则化,在BCI数据集上达到SOTA性能。

Details Motivation: 传统IHC染色耗时昂贵且资源依赖高,难以普及;而现有基于虚拟IHC图像的HER2评分方法存在计算开销大和重建伪影问题。 Method: 提出Latent-Guided Dual-Stream Network(LGD-Net),利用教师IHC编码器引导,将H&E形态学特征映射至分子潜在空间,并通过核分布与膜染色强度等轻量辅助任务进行临床知识正则化。 Result: 在公开BCI数据集上,LGD-Net显著优于基线方法,实现高效单模态H&E输入推理。 Conclusion: LGD-Net为无需IHC染色的HER2精准评估提供了可靠、高效且具临床可解释性的新范式。 Abstract: It is a critical task to evalaute HER2 expression level accurately for breast cancer evaluation and targeted treatment therapy selection. However, the standard multi-step Immunohistochemistry (IHC) staining is resource-intensive, expensive, and time-consuming, which is also often unavailable in many areas. Consequently, predicting HER2 levels directly from H&E slides has emerged as a potential alternative solution. It has been shown to be effective to use virtual IHC images from H&E images for automatic HER2 scoring. However, the pixel-level virtual staining methods are computationally expensive and prone to reconstruction artifacts that can propagate diagnostic errors. To address these limitations, we propose the Latent-Guided Dual-Stream Network (LGD-Net), a novel framework that employes cross-modal feature hallucination instead of explicit pixel-level image generation. LGD-Net learns to map morphological H&E features directly to the molecular latent space, guided by a teacher IHC encoder during training. To ensure the hallucinated features capture clinically relevant phenotypes, we explicitly regularize the model training with task-specific domain knowledge, specifically nuclei distribution and membrane staining intensity, via lightweight auxiliary regularization tasks. Extensive experiments on the public BCI dataset demonstrate that LGD-Net achieves state-of-the-art performance, significantly outperforming baseline methods while enabling efficient inference using single-modality H&E inputs.

[32] Enabling Training-Free Text-Based Remote Sensing Segmentation

Jose Sosa,Danila Rukhovich,Anis Kacem,Djamila Aouada

Main category: cs.CV

TL;DR: 本文提出了一种无需额外训练即可实现遥感图像文本引导分割的方法,通过结合CLIP、GPT-5/Qwen-VL与SAM模型,在零样本或轻量微调(LoRA)下实现了开放词汇、指代和推理型分割的SOTA性能。

Details Motivation: 现有遥感图像文本引导分割方法大多依赖可训练模块,限制了泛化性和实用性;本文探索仅利用现有基础模型实现完全无训练或轻量微调的分割方案。 Method: 将对比式VLM(CLIP)用作SAM网格提案的掩码选择器;同时利用生成式VLM(GPT-5零样本、Qwen-VL LoRA微调)为SAM生成点击提示,实现指代与推理分割。 Result: 在19个遥感基准(涵盖开放词汇、指代、推理任务)上达到SOTA性能,尤其LoRA微调的Qwen-VL方案效果最佳。 Conclusion: 仅依靠现有基础模型、无需重训即可高效完成遥感图像文本引导分割,验证了训练自由范式的可行性与强泛化能力。 Abstract: Recent advances in Vision Language Models (VLMs) and Vision Foundation Models (VFMs) have opened new opportunities for zero-shot text-guided segmentation of remote sensing imagery. However, most existing approaches still rely on additional trainable components, limiting their generalisation and practical applicability. In this work, we investigate to what extent text-based remote sensing segmentation can be achieved without additional training, by relying solely on existing foundation models. We propose a simple yet effective approach that integrates contrastive and generative VLMs with the Segment Anything Model (SAM), enabling a fully training-free or lightweight LoRA-tuned pipeline. Our contrastive approach employs CLIP as mask selector for SAM's grid-based proposals, achieving state-of-the-art open-vocabulary semantic segmentation (OVSS) in a completely zero-shot setting. In parallel, our generative approach enables reasoning and referring segmentation by generating click prompts for SAM using GPT-5 in a zero-shot setting and a LoRA-tuned Qwen-VL model, with the latter yielding the best results. Extensive experiments across 19 remote sensing benchmarks, including open-vocabulary, referring, and reasoning-based tasks, demonstrate the strong capabilities of our approach. Code will be released at https://github.com/josesosajs/trainfree-rs-segmentation.

[33] VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

Narges Norouzi,Idil Esen Zulfikar,Niccol`o Cavagnero,Tommie Kerssies,Bastian Leibe,Gijs Dubbelman,Daan de Geus

Main category: cs.CV

TL;DR: 本文提出了一种名为VidEoMT的纯编码器视频分割模型,通过轻量级查询传播与融合机制,在不使用专用跟踪模块的前提下实现了高效准确的视频分割,速度提升5–10倍,最高达160 FPS。

Details Motivation: 受纯ViT在图像分割中无需专用模块即可取得良好性能的启发,作者希望设计一种更简洁、高效、无需复杂跟踪模块的视频分割架构。 Method: 提出Video Encoder-only Mask Transformer(VidEoMT),采用纯ViT编码器结构;引入轻量级查询传播机制复用前一帧查询,并结合时序无关的可学习查询进行融合,实现帧间信息建模与内容自适应。 Result: VidEoMT在保持竞争性精度的同时,比现有方法快5–10倍,使用ViT-L主干可达160 FPS。 Conclusion: 纯编码器结构配合查询传播与融合策略足以建模视频时序信息,无需复杂跟踪模块,显著提升效率与简洁性。 Abstract: Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x--10x faster, running at up to 160 FPS with a ViT-L backbone. Code: https://www.tue-mps.org/videomt/

[34] VQPP: Video Query Performance Prediction Benchmark

Adrian Catalin Lutu,Eduard Poesina,Radu Tudor Ionescu

Main category: cs.CV

TL;DR: 本文提出了首个面向内容视频检索(CBVR)的查询性能预测(VQPP)基准,包含两个文本-视频检索数据集和两个CBVR系统,共56K文本查询和51K视频,并划分了训练/验证/测试集;探索了多种预/后检索预测器,验证了预检索预测器的有效性,并将其作为奖励模型用于LLM查询改写训练。

Details Motivation: 查询性能预测(QPP)在文本和图像检索中已有较多研究,但在内容视频检索(CBVR)中仍属空白,亟需构建专用基准以推动该方向发展。 Method: 构建首个VQPP基准,涵盖两个文本-视频数据集与两个CBVR系统,提供标准划分;设计并评估多种预检索与后检索预测器;进一步将最优预检索预测器用作奖励模型,通过直接偏好优化(DPO)训练大语言模型进行查询改写。 Result: 预检索预测器达到有竞争力的性能,支持无需执行检索即可预测;所提VQPP基准已开源,且验证了其在查询改写等下游任务中的实用性。 Conclusion: VQPP填补了视频领域QPP研究的空白,提供了标准化、可复现的基准与方法框架,为未来视频检索性能预测及应用(如LLM驱动的查询优化)奠定基础。 Abstract: Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text-to-video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre-retrieval and post-retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre-retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre-retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at https://github.com/AdrianLutu/VQPP.

[35] On the Evaluation Protocol of Gesture Recognition for UAV-based Rescue Operation based on Deep Learning: A Subject-Independence Perspective

Domonkos Varga

Main category: cs.CV

TL;DR: 本文指出Liu和Szirányi提出的基于手势识别方法的评估协议存在严重数据泄露问题,因其采用帧级随机划分训练/测试集,导致同一受试者的样本混入两组,使报告的高准确率失真;作者强调在面向新用户(如无人机人机交互)的应用中,必须采用受试者无关的数据划分方式。

Details Motivation: 揭示Liu和Szirányi手势识别方法评估协议中潜在的数据泄漏问题,强调主体无关评估对实际应用(如UAV-human交互)的重要性。 Method: 通过分析原始论文的混淆矩阵、学习曲线和数据集构建方式,检验其帧级随机训练-测试划分是否导致同一受试者样本跨集泄露。 Result: 证实其近似完美的准确率源于严重数据泄漏,评估未反映模型对未见个体的手势泛化能力。 Conclusion: 在基于视觉的手势识别研究中,尤其是需面向新用户的场景,必须采用受试者独立的数据划分策略,以确保评估有效性与结果可靠性。 Abstract: This paper presents a methodological analysis of the gesture-recognition approach proposed by Liu and Szirányi, with a particular focus on the validity of their evaluation protocol. We show that the reported near-perfect accuracy metrics result from a frame-level random train-test split that inevitably mixes samples from the same subjects across both sets, causing severe data leakage. By examining the published confusion matrix, learning curves, and dataset construction, we demonstrate that the evaluation does not measure generalization to unseen individuals. Our findings underscore the importance of subject-independent data partitioning in vision-based gesture-recognition research, especially for applications - such as UAV-human interaction - that require reliable recognition of gestures performed by previously unseen people.

[36] Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

Yuxiao Chen,Jue Wang,Zhikang Zhang,Jingru Yi,Xu Zhang,Yang Zou,Zhaowei Cai,Jianbo Yuan,Xinyu Li,Hao Yang,Davide Modolo

Main category: cs.CV

TL;DR: 本文提出了一种端到端的长视频理解框架,包含基于信息密度的自适应视频采样器(AVS)和基于自编码器的时空视频压缩器(SVC),与多模态大语言模型(MLLM)集成,在保持判别性信息的同时实现高效压缩与理解。

Details Motivation: 现有模型难以应对长视频中帧数多、冗余度高带来的内存限制和关键信息提取困难两大挑战。 Method: 提出包含自适应视频采样器(AVS)和自编码器式时空视频压缩器(SVC)的端到端框架,并与多模态大语言模型(MLLM)联合训练。 Result: 在多个长视频及标准视频理解基准上均取得优异性能,验证了其高信息保留率与高压缩率兼顾的能力。 Conclusion: 该框架能自适应处理不同长度视频,兼顾效率与判别性,在长视频理解任务中具有强通用性与实用性。 Abstract: With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.

[37] Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models

Dhruba Ghosh,Yuhui Zhang,Ludwig Schmidt

Main category: cs.CV

TL;DR: 本文探讨了视觉语言模型(VLMs)在细粒度图像分类任务上表现不佳的原因,发现视觉编码器质量与预训练策略(尤其是语言模型权重是否解冻)对提升细粒度性能尤为关键。

Details Motivation: 尽管VLMs在多种视觉问答任务中表现优异,但在传统细粒度图像分类基准上仍落后,本文旨在探究其背后原因。 Method: 在多个细粒度分类基准上评测大量近期VLM,并通过消融实验分析视觉编码器、语言模型、预训练方式等因素的影响。 Result: 更好的视觉编码器显著提升细粒度分类性能,而更强的语言模型则均匀提升所有基准;预训练阶段解冻语言模型权重对细粒度性能至关重要。 Conclusion: 提升VLM的细粒度视觉理解能力需重点关注高质量视觉编码器和特定预训练策略,而非仅依赖大语言模型。 Abstract: Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we find that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining. These insights pave the way for enhancing fine-grained visual understanding and vision-centric capabilities in VLMs.

[38] A Single Image and Multimodality Is All You Need for Novel View Synthesis

Amirhosein Javadi,Chi-Shiang Gau,Konstantinos D. Polyzos,Tara Javidi

Main category: cs.CV

TL;DR: 本文提出了一种利用稀疏多模态测距数据(如雷达或LiDAR)重建稠密深度图的方法,以提升扩散模型在单图像新视角合成中的几何一致性和视觉质量,无需修改生成模型本身。

Details Motivation: 单目深度估计在低纹理、恶劣天气和严重遮挡等现实场景下不可靠,限制了基于扩散模型的新视角合成质量。 Method: 提出一种多模态深度重建框架,采用角域局部高斯过程建模稀疏测距数据,高效生成带不确定性估计的稠密深度图,并作为即插即用模块替代单目深度估计器。 Result: 在真实驾驶多模态场景实验中,该方法显著提升了新视角视频生成的几何一致性与视觉质量。 Conclusion: 可靠的几何先验对扩散模型新视角合成至关重要,即使极稀疏的多模态传感也能带来显著实际收益。 Abstract: Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference while explicitly quantifying uncertainty in regions with limited observations. The reconstructed depth and uncertainty are used as a drop-in replacement for monocular depth estimators in existing diffusion-based rendering pipelines, without modifying the generative model itself. Experiments on real-world multimodal driving scenes demonstrate that replacing vision-only depth with our sparse range-based reconstruction substantially improves both geometric consistency and visual quality in single-image novel-view video generation. These results highlight the importance of reliable geometric priors for diffusion-based view synthesis and demonstrate the practical benefits of multimodal sensing even at extreme levels of sparsity.

[39] ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

Athanasios Angelakis

Main category: cs.CV

TL;DR: ZACH-ViT是一种去除位置编码和[CLS]标记、采用全局平均池化的轻量级视觉Transformer,具备排列不变性,适用于空间结构弱或不一致的医学图像,在少样本设置下表现稳健且适合边缘部署。

Details Motivation: 传统Vision Transformer依赖位置编码和类标记,其固定空间先验在医学影像等空间布局弱或不一致的场景中可能损害泛化能力。 Method: 提出ZACH-ViT:移除位置嵌入与[CLS] token,用全局平均池化聚合patch特征;引入自适应残差投影以保障小参数量下的训练稳定性。 Result: 在7个MedMNIST数据集(50样本/类)上验证:在BloodMNIST上优势最强,在PathMNIST上媲美TransMIL,在具强解剖先验的OCTMNIST/OrganAMNIST上相对优势减弱;模型仅0.25M参数、无需预训练、推理快于1秒。 Conclusion: 架构的归纳偏置应与数据结构对齐,而非一味追求通用基准性能;ZACH-ViT为资源受限临床场景提供了高效、紧凑、鲁棒的ViT替代方案。 Abstract: Vision Transformers rely on positional embeddings and class tokens that encode fixed spatial priors. While effective for natural images, these priors may hinder generalization when spatial layout is weakly informative or inconsistent, a frequent condition in medical imaging and edge-deployed clinical systems. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes both positional embeddings and the [CLS] token, achieving permutation invariance through global average pooling over patch representations. The term "Zero-token" specifically refers to removing the dedicated [CLS] aggregation token and positional embeddings; patch tokens remain unchanged and are processed normally. Adaptive residual projections preserve training stability in compact configurations while maintaining a strict parameter budget. Evaluation is performed across seven MedMNIST datasets spanning binary and multi-class tasks under a strict few-shot protocol (50 samples per class, fixed hyperparameters, five random seeds). The empirical analysis demonstrates regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves its strongest advantage on BloodMNIST and remains competitive with TransMIL on PathMNIST, while its relative advantage decreases on datasets with strong anatomical priors (OCTMNIST, OrganAMNIST), consistent with the architectural hypothesis. These findings support the view that aligning architectural inductive bias with data structure can be more important than pursuing universal benchmark dominance. Despite its minimal size and lack of pretraining, ZACH-ViT achieves competitive performance while maintaining sub-second inference times, supporting deployment in resource-constrained clinical environments. Code and models are available at https://github.com/Bluesman79/ZACH-ViT.

[40] ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Guoheng Sun,Tingting Du,Kaixi Feng,Chenxiang Luo,Xingguo Ding,Zheyu Shen,Ziyao Wang,Yexiao He,Ang Li

Main category: cs.CV

TL;DR: 本文提出ROCKET框架,通过残差导向的多层表征对齐方法,利用共享投影器将2D VLA模型与3D视觉基础模型在多个层次上对齐,缓解梯度干扰,在极低计算开销下达到接近SOTA的机器人操作成功率。

Details Motivation: 现有VLA模型缺乏3D空间理解,而单层表征对齐无法充分利用多层特征信息,多层对齐又易引发梯度干扰。 Method: 提出ROCKET:采用共享投影器实现VLA骨干网络与3D视觉基础模型的多层残差对齐;引入Matryoshka式稀疏激活机制平衡多层损失;结合训练免费的层选择策略。 Result: 在LIBERO上仅用约4%算力即达98.5% SOTA成功率,并在LIBERO-Plus、RoboTwin及多种VLA模型上验证泛化性。 Conclusion: 多层残差对齐配合共享投影器是高效提升VLA模型3D理解能力的有效范式,显著降低计算成本并保持高性能。 Abstract: Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.

[41] Image Quality Assessment: Exploring Quality Awareness via Memory-driven Distortion Patterns Matching

Xuting Lan,Mingliang Zhou,Xuekai Wei,Jielu Yan,Yueting Huang,Huayan Pu,Jun Luo,Weijia Jia

Main category: cs.CV

TL;DR: 本文提出了一种受人类视觉记忆机制启发的记忆驱动质量感知框架(MQAF),通过构建存储失真模式的记忆库,并在有参考和无参考两种模式间动态切换,减少对高质量参考图像的依赖,同时在全参考和无参考图像质量评估任务中均取得优越性能。

Details Motivation: 现有全参考图像质量评估方法受限于参考图像质量,在真实场景中难以应用;而人类视觉系统可通过长期视觉记忆进行质量评估,该文由此提出模仿此机制的新框架。 Method: 提出记忆驱动质量感知框架(MQAF),构建存储失真模式的记忆库,并设计双模式质量评估策略:有参考时融合参考信息与记忆库匹配;无参考时仅依赖记忆库进行推断。 Result: 在多个数据集上超越当前最优方法,同时支持全参考(FR-IQA)和无参考(NR-IQA)评估任务。 Conclusion: MQAF有效缓解了对理想参考图像的依赖,提升了图像质量评估在现实场景中的适用性与鲁棒性。 Abstract: Existing full-reference image quality assessment (FR-IQA) methods achieve high-precision evaluation by analysing feature differences between reference and distorted images. However, their performance is constrained by the quality of the reference image, which limits real-world applications where ideal reference sources are unavailable. Notably, the human visual system has the ability to accumulate visual memory, allowing image quality assessment on the basis of long-term memory storage. Inspired by this biological memory mechanism, we propose a memory-driven quality-aware framework (MQAF), which establishes a memory bank for storing distortion patterns and dynamically switches between dual-mode quality assessment strategies to reduce reliance on high-quality reference images. When reference images are available, MQAF obtains reference-guided quality scores by adaptively weighting reference information and comparing the distorted image with stored distortion patterns in the memory bank. When the reference image is absent, the framework relies on distortion patterns in the memory bank to infer image quality, enabling no-reference quality assessment (NR-IQA). The experimental results show that our method outperforms state-of-the-art approaches across multiple datasets while adapting to both no-reference and full-reference tasks.

[42] MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method

Ahsan Baidar Bakht,Mohamad Alansari,Muhayy Ud Din,Muzammal Naseer,Sajid Javed,Irfan Hussain,Jiri Matas,Arif Mahmood

Main category: cs.CV

TL;DR: 本文提出了首个伪多模态水下目标跟踪基准MUOT_3M(含300万帧、4种模态)及基于SAM的多模态到单模态跟踪器MUTrack,在多个基准上显著超越SOTA,兼顾高性能与实时性。

Details Motivation: 现有水下目标跟踪数据集规模小、模态单一(仅RGB),难以应对水下严重的色彩失真、浑浊度和低能见度等挑战。 Method: 构建大规模伪多模态UOT基准MUOT_3M(含RGB、增强RGB、估计深度、语言模态);提出MUTrack跟踪器,融合视觉几何对齐、视觉-语言融合和四级知识蒸馏,将多模态知识迁移到单模态学生模型。 Result: MUTrack在五个UOT基准上AUC最高提升8.40%,精度提升7.80%,运行速度达24 FPS。 Conclusion: MUOT_3M和MUTrack为可扩展、多模态训练且实际可部署的水下跟踪奠定了新基础。 Abstract: Underwater Object Tracking (UOT) is crucial for efficient marine robotics, large scale ecological monitoring, and ocean exploration; however, progress has been hindered by the scarcity of large, multimodal, and diverse datasets. Existing benchmarks remain small and RGB only, limiting robustness under severe color distortion, turbidity, and low visibility conditions. We introduce MUOT_3M, the first pseudo multimodal UOT benchmark comprising 3 million frames from 3,030 videos (27.8h) annotated with 32 tracking attributes, 677 fine grained classes, and synchronized RGB, estimated enhanced RGB, estimated depth, and language modalities validated by a marine biologist. Building upon MUOT_3M, we propose MUTrack, a SAM-based multimodal to unimodal tracker featuring visual geometric alignment, vision language fusion, and four level knowledge distillation that transfers multimodal knowledge into a unimodal student model. Extensive evaluations across five UOT benchmarks demonstrate that MUTrack achieves up to 8.40% higher AUC and 7.80% higher precision than the strongest SOTA baselines while running at 24 FPS. MUOT_3M and MUTrack establish a new foundation for scalable, multimodally trained yet practically deployable underwater tracking.

[43] Towards LLM-centric Affective Visual Customization via Efficient and Precise Emotion Manipulating

Jiamin Luo,Xuqian Gu,Jingjing Wang,Jiahong Lu

Main category: cs.CV

TL;DR: 本文提出了一种以大语言模型(LLM)为中心的情感化视觉定制(L-AVC)任务,旨在通过多模态LLM修改图像的主观情感,并设计了高效的情绪转换(EIC)与精确的情绪无关内容保留(PER)模块(统称EPEM),在自建数据集上验证了其优于现有方法。

Details Motivation: 现有视觉定制研究忽视主观情感内容,且缺乏通用的情感化基础模型;需解决情绪语义转换与情绪无关内容保持两大挑战。 Method: 提出L-AVC任务;构建EPEM方法,包含高效情绪语义转换模块(EIC)和精确情绪无关内容保留模块(PER);基于多模态LLM实现情感编辑。 Result: 在自建L-AVC数据集上,EPEM显著优于多个SOTA基线,验证了情感信息的重要性及EPEM在情感操控上的高效性与精确性。 Conclusion: 情感信息对视觉定制至关重要;所提EPEM方法能高效、精准地操控图像主观情感,为情感化生成提供了新范式。 Abstract: Previous studies on visual customization primarily rely on the objective alignment between various control signals (e.g., language, layout and canny) and the edited images, which largely ignore the subjective emotional contents, and more importantly lack general-purpose foundation models for affective visual customization. With this in mind, this paper proposes an LLM-centric Affective Visual Customization (L-AVC) task, which focuses on generating images within modifying their subjective emotions via Multimodal LLM. Further, this paper contends that how to make the model efficiently align emotion conversion in semantics (named inter-emotion semantic conversion) and how to precisely retain emotion-agnostic contents (named exter-emotion semantic retaining) are rather important and challenging in this L-AVC task. To this end, this paper proposes an Efficient and Precise Emotion Manipulating approach for editing subjective emotions in images. Specifically, an Efficient Inter-emotion Converting (EIC) module is tailored to make the LLM efficiently align emotion conversion in semantics before and after editing, followed by a Precise Exter-emotion Retaining (PER) module to precisely retain the emotion-agnostic contents. Comprehensive experimental evaluations on our constructed L-AVC dataset demonstrate the great advantage of the proposed EPEM approach to the L-AVC task over several state-of-the-art baselines. This justifies the importance of emotion information for L-AVC and the effectiveness of EPEM in efficiently and precisely manipulating such information.

[44] DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE

Yujie Jin,Wenxin Zhang,Jingjing Wang,Guodong Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的安全导向视频理解任务DeepSVU,旨在不仅检测和定位威胁,还要归因并评估威胁成因,并为此设计了统一物理世界正则化MoE(UPRM)方法,在自建数据集上验证了其有效性。

Details Motivation: 现有安全导向视频理解(SVU)研究主要聚焦于威胁检测与定位,缺乏对威胁成因的生成与评估能力,存在明显空白。 Method: 提出统一物理世界正则化MoE(UPRM)方法,包含两个核心组件:统一物理世界增强MoE(UPE)模块用于建模粗粒度到细粒度的物理世界信息(如行为、物体交互、背景),以及物理世界权衡正则器(PTR)用于自适应平衡多因素影响。 Result: 在自建DeepSVU指令数据集(UCF-C instructions和CUVA instructions)上的实验表明,UPRM显著优于多种先进视频大模型及非VLM方法,验证了物理世界信息建模与权衡机制的有效性。 Conclusion: 粗粒度到细粒度的物理世界信息对深度安全视频理解至关重要,所提出的UPRM方法能有效建模并自适应权衡该类信息,为SVU任务提供了新范式。 Abstract: In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such information.These justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.

[45] UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

Jiabing Yang,Yixiang Chen,Yuan Xu,Peiyan Li,Xiangnan Wu,Zichen Wen,Bowen Fang,Tao Yu,Zhengbo Zhang,Yingda Li,Kai Wang,Jing Liu,Nianfeng Liu,Yan Huang,Liang Wang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、即插即用的模块UAOR,通过在语言模型层不确定性高时,将观测信息重新注入FFN中,提升VLA模型的动作生成准确性和鲁棒性,无需额外数据或模块。

Details Motivation: 现有VLA模型为提升性能常依赖额外观测线索或辅助模块,但带来高昂的数据收集和训练成本;受语言模型中FFN可作为'键值记忆'启发,探索无需训练的增强方法。 Method: 提出不确定性感知观测重注入(UAOR)模块:利用动作熵衡量当前层不确定性,当不确定性高时,通过注意力检索将关键观测信息重注入下一层的FFN中。 Result: UAOR在多种VLA模型上显著提升仿真与真实世界任务性能,开销极小,且无需额外观测输入或辅助模块。 Conclusion: UAOR是一种高效、通用、实用的即插即用式增强模块,为VLA模型提供了轻量、免训练的推理优化新范式。 Abstract: Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism helps VLAs better attend to observations during inference, enabling more confident and faithful action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.

[46] Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

Guandong Li,Mengxia Ye

Main category: cs.CV

TL;DR: 本文提出Dual-Channel Attention Guidance (DCAG),一种无需训练的扩散模型图像编辑方法,通过同时调控DiT中注意力机制的Key和Value通道,实现对编辑强度的精细、解耦控制,在PIE-Bench上显著提升编辑保真度。

Details Motivation: 现有基于DiT的扩散编辑模型缺乏对编辑强度的训练-free精细控制;已有注意力操纵方法仅调节Key空间,忽略Value空间在特征聚合中的关键作用。 Method: 发现DiT多模态注意力层中Key和Value投影均存在显著的bias-delta结构;据此提出DCAG框架,联合操控Key通道(通过softmax实现粗粒度注意力路由控制)与Value通道(通过线性加权实现细粒度特征聚合控制),形成二维调控参数(δ_k, δ_v)。 Result: 在PIE-Bench(700图/10类)上,DCAG全面超越仅Key引导方法:物体删除任务LPIPS降低4.9%,物体添加降低3.2%。 Conclusion: Key与Value双通道协同调控可提供更优的编辑-保真度权衡,验证了利用Value空间进行训练-free扩散编辑的有效性与必要性。 Abstract: Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(δ_k, δ_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).

[47] Spatio-temporal Decoupled Knowledge Compensator for Few-Shot Action Recognition

Hongyu Qu,Xiangbo Shu,Rui Yan,Hailiang Gao,Wenguan Wang,Jinhui Tang

Main category: cs.CV

TL;DR: 本文提出DiST框架,通过解耦动作名称为时空属性描述,并利用空间/时间知识补偿器学习多粒度原型,提升少样本动作识别性能。

Details Motivation: 现有方法仅使用粗粒度的动作名称作为辅助上下文,缺乏足够的背景知识来捕捉动作中的新颖空间和时间概念。 Method: 提出DiST框架:分解阶段将动作名称解耦为空间和时间属性描述;融合阶段设计空间/时间知识补偿器(SKC/TKC)分别学习物体级和帧级原型。 Result: 在五个标准少样本动作识别数据集上达到SOTA性能。 Conclusion: 利用大语言模型提供的解耦时空知识可有效增强少样本动作识别中细粒度空间细节与多样时间模式的建模能力。 Abstract: Few-Shot Action Recognition (FSAR) is a challenging task that requires recognizing novel action categories with a few labeled videos. Recent works typically apply semantically coarse category names as auxiliary contexts to guide the learning of discriminative visual features. However, such context provided by the action names is too limited to provide sufficient background knowledge for capturing novel spatial and temporal concepts in actions. In this paper, we propose DiST, an innovative Decomposition-incorporation framework for FSAR that makes use of decoupled Spatial and Temporal knowledge provided by large language models to learn expressive multi-granularity prototypes. In the decomposition stage, we decouple vanilla action names into diverse spatio-temporal attribute descriptions (action-related knowledge). Such commonsense knowledge complements semantic contexts from spatial and temporal perspectives. In the incorporation stage, we propose Spatial/Temporal Knowledge Compensators (SKC/TKC) to discover discriminative object-level and frame-level prototypes, respectively. In SKC, object-level prototypes adaptively aggregate important patch tokens under the guidance of spatial knowledge. Moreover, in TKC, frame-level prototypes utilize temporal attributes to assist in inter-frame temporal relation modeling. These learned prototypes thus provide transparency in capturing fine-grained spatial details and diverse temporal patterns. Experimental results show DiST achieves state-of-the-art results on five standard FSAR datasets.

[48] CityGuard: Graph-Aware Private Descriptors for Bias-Resilient Identity Search Across Urban Cameras

Rong Fu,Wenxin Zhang,Yibo Meng,Jia Yee Tan,Jiaxuan Lu,Rui Lu,Jiekai Wu,Zhaolu Kang,Simon Fong

Main category: cs.CV

TL;DR: 本文提出CityGuard,一种面向城市级跨摄像头行人重识别的拓扑感知Transformer框架,兼顾隐私保护(差分隐私)与鲁棒性(应对视角、遮挡、域偏移),通过分散自适应度量学习、空间条件注意力和差分隐私嵌入映射实现高效安全检索。

Details Motivation: 城市级跨摄像头行人重识别需应对严重外观变化(视角、遮挡、域偏移),同时遵守禁止共享原始图像的数据保护法规。 Method: 提出CityGuard框架,包含三部分:1)分散自适应度量学习,根据特征分布动态调整类内/类间距离;2)空间条件注意力机制,将粗粒度几何信息(如GPS或楼层平面图)注入图自注意力,实现无需高精度标定的跨视角对齐;3)差分隐私嵌入映射结合紧凑近似索引,保障隐私与部署效率。 Result: 在Market-1501等公开基准及大规模数据库检索实验中,CityGuard在检索精度与查询吞吐量上均显著优于强基线,验证了其在隐私敏感城市身份匹配中的实用性。 Conclusion: CityGuard在严格差分隐私约束下实现了隐私-效用可调的鲁棒行人重识别,为分布式城市监控系统提供了可行的隐私保护解决方案。 Abstract: City-scale person re-identification across distributed cameras must handle severe appearance changes from viewpoint, occlusion, and domain shift while complying with data protection rules that prevent sharing raw imagery. We introduce CityGuard, a topology-aware transformer for privacy-preserving identity retrieval in decentralized surveillance. The framework integrates three components. A dispersion-adaptive metric learner adjusts instance-level margins according to feature spread, increasing intra-class compactness. Spatially conditioned attention injects coarse geometry, such as GPS or deployment floor plans, into graph-based self-attention to enable projectively consistent cross-view alignment using only coarse geometric priors without requiring survey-grade calibration. Differentially private embedding maps are coupled with compact approximate indexes to support secure and cost-efficient deployment. Together these designs produce descriptors robust to viewpoint variation, occlusion, and domain shifts, and they enable a tunable balance between privacy and utility under rigorous differential-privacy accounting. Experiments on Market-1501 and additional public benchmarks, complemented by database-scale retrieval studies, show consistent gains in retrieval precision and query throughput over strong baselines, confirming the practicality of the framework for privacy-critical urban identity matching.

[49] Temporal Consistency-Aware Text-to-Motion Generation

Hongsong Wang,Wenjing Yan,Qiuxia Lai,Xin Geng

Main category: cs.CV

TL;DR: 本文提出TCA-T2M框架,通过引入时间一致性感知的空间VQ-VAE(TCaS-VQ-VAE)和带运动掩码的Transformer,结合运动学约束模块,提升文本到动作生成中跨序列的时间一致性与物理合理性,在HumanML3D和KIT-ML上达到SOTA。

Details Motivation: 现有两阶段文本到动作(T2M)方法使用离散动作表征,但忽视了同一动作在不同样本间共有的时间结构(即跨序列时间一致性),导致语义错位与物理不合理动作。 Method: 提出TCA-T2M框架:1)时间一致性感知的空间VQ-VAE(TCaS-VQ-VAE)实现跨序列时间对齐;2)掩码运动Transformer进行文本条件动作生成;3)运动学约束模块缓解离散化伪影、保障物理可行性。 Result: 在HumanML3D和KIT-ML基准上取得当前最优性能(SOTA),验证了时间一致性对鲁棒、连贯T2M生成的关键作用。 Conclusion: 跨序列时间一致性是提升T2M生成质量的重要因素;TCA-T2M通过联合建模时间结构与运动学约束,有效解决了语义对齐与物理合理性问题。 Abstract: Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.

[50] 3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis

Ziyue Wang,Linghan Cai,Chang Han Low,Haofeng Liu,Junde Wu,Jingyu Wang,Rui Wang,Lei Song,Jiang Bian,Jingjing Fu,Yueming Jin

Main category: cs.CV

TL;DR: 本文提出3DMedAgent,一种无需3D微调即可赋能2D多模态大语言模型(MLLM)进行通用3D CT分析的统一智能体;其通过协调异构工具、分阶段分解任务,并依托结构化长时记忆实现证据驱动的多步推理,在40+任务上超越现有各类MLLM。

Details Motivation: 现有3D CT分析方法受限于孤立任务建模或任务无关端到端范式,难以系统积累感知证据;而主流MLLM多为2D设计,无法有效处理三维医学体数据。 Method: 提出3DMedAgent框架:基于灵活的MLLM代理协调视觉与文本工具,将复杂3D分析逐步分解为从全局到局部、3D体到关键2D切片、视觉证据到结构化文本的子任务;引入长时结构化记忆以聚合中间结果,支持查询自适应、证据驱动的多步推理;并构建DeepChestVQA基准用于评估3D胸腔影像的感知-理解能力。 Result: 在超过40个3D医学分析任务上,3DMedAgent持续优于通用、医学专用及3D专用MLLM;验证了其作为通用3D临床助手的可扩展路径。 Conclusion: 3DMedAgent成功弥合了2D MLLM与3D医学影像分析之间的鸿沟,证明无需3D微调即可实现强泛化能力的统一3D临床分析,为构建通用3D临床智能助手提供了新范式。 Abstract: 3D CT analysis spans a continuum from low-level perception to high-level clinical understanding. Existing 3D-oriented analysis methods adopt either isolated task-specific modeling or task-agnostic end-to-end paradigms to produce one-hop outputs, impeding the systematic accumulation of perceptual evidence for downstream reasoning. In parallel, recent multimodal large language models (MLLMs) exhibit improved visual perception and can integrate visual and textual information effectively, yet their predominantly 2D-oriented designs fundamentally limit their ability to perceive and analyze volumetric medical data. To bridge this gap, we propose 3DMedAgent, a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning. 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent, progressively decomposing complex 3D analysis into tractable subtasks that transition from global to regional views, from 3D volumes to informative 2D slices, and from visual evidence to structured textual representations. Central to this design, 3DMedAgent maintains a long-term structured memory that aggregates intermediate tool outputs and supports query-adaptive, evidence-driven multi-step reasoning. We further introduce the DeepChestVQA benchmark for evaluating unified perception-to-understanding capabilities in 3D thoracic imaging. Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs, highlighting a scalable path toward general-purpose 3D clinical assistants.Code and data are available at \href{https://github.com/jinlab-imvr/3DMedAgent}{https://github.com/jinlab-imvr/3DMedAgent}.

[51] Faster Training, Fewer Labels: Self-Supervised Pretraining for Fine-Grained BEV Segmentation

Daniel Busch,Christian Bohn,Thomas Kurbiel,Klaus Friedrichs,Richard Meyes,Tobias Meisen

Main category: cs.CV

TL;DR: 本文提出一种两阶段训练策略,通过自监督预训练和半监督微调,在减少标注数据和训练时间的同时,提升了BEV语义地图中道路标线分割的性能。

Details Motivation: 现有基于多相机的BEV语义地图方法依赖昂贵且标注不一致的BEV真值标签,限制了其可扩展性和实用性。 Method: 第一阶段为自监督预训练:利用BEVFormer预测结果进行可微重投影至图像平面,并与Mask2Former生成的多视角语义伪标签对齐,辅以时序一致性损失;第二阶段为监督微调:仅使用50%标注数据进行高效微调。 Result: 在nuScenes数据集上,mIoU提升达+2.5个百分点,标注数据用量减半,总训练时间减少三分之二。 Conclusion: 可微重投影结合图像视角伪标签能学习到可迁移的BEV特征,为低标注需求的自动驾驶感知提供了可扩展路径。 Abstract: Dense Bird's Eye View (BEV) semantic maps are central to autonomous driving, yet current multi-camera methods depend on costly, inconsistently annotated BEV ground truth. We address this limitation with a two-phase training strategy for fine-grained road marking segmentation that removes full supervision during pretraining and halves the amount of training data during fine-tuning while still outperforming the comparable supervised baseline model. During the self-supervised pretraining, BEVFormer predictions are differentiably reprojected into the image plane and trained against multi-view semantic pseudo-labels generated by the widely used semantic segmentation model Mask2Former. A temporal loss encourages consistency across frames. The subsequent supervised fine-tuning phase requires only 50% of the dataset and significantly less training time. With our method, the fine-tuning benefits from rich priors learned during pretraining boosting the performance and BEV segmentation quality (up to +2.5pp mIoU over the fully supervised baseline) on nuScenes. It simultaneously halves the usage of annotation data and reduces total training time by up to two thirds. The results demonstrate that differentiable reprojection plus camera perspective pseudo labels yields transferable BEV features and a scalable path toward reduced-label autonomous perception.

[52] Comparative Assessment of Multimodal Earth Observation Data for Soil Moisture Estimation

Ioannis Kontogiorgakis,Athanasios Askitopoulos,Iason Tsardanidis,Dimitrios Bormpoudakis,Ilias Tsoumas,Fotios Balampanis,Charalampos Kontoes

Main category: cs.CV

TL;DR: 本文提出了一种结合Sentinel-1 SAR、Sentinel-2光学影像和ERA-5再分析数据的机器学习框架,实现了欧洲植被覆盖区10米分辨率的土壤湿度估计;通过多源数据融合与时空建模,在稀疏地面站点验证下达到R²≈0.518,表明传统光谱指数加树模型仍优于大模型嵌入特征。

Details Motivation: 现有卫星土壤湿度产品空间分辨率过低(>1km),难以满足农田尺度应用需求,亟需高分辨率(如10m)且泛化能力强的估计算法。 Method: 融合Sentinel-1 SAR、Sentinel-2光学影像与ERA-5再分析数据,采用机器学习方法(特别是树集成模型),结合时空匹配策略(如Sentinel-2当日数据+Sentinel-1降轨+ERA5 10天回溯窗口)及空间交叉验证;对比手工程光谱特征与IBM-NASA Prithvi基础模型嵌入特征。 Result: 最佳组合(Sentinel-2当日+Sentinel-1降轨+ERA5 10天回溯)达R²=0.518;Prithvi嵌入特征仅微弱提升(R²=0.515 vs. 0.514),传统特征仍具竞争力。 Conclusion: 面向稀疏地面观测的农田尺度土壤湿度反演,领域定制光谱指数配合树集成模型是高效、实用且可业务化部署的方案;大模型嵌入在该任务中未展现显著优势。 Abstract: Accurate soil moisture (SM) estimation is critical for precision agriculture, water resources management and climate monitoring. Yet, existing satellite SM products are too coarse (>1km) for farm-level applications. We present a high-resolution (10m) SM estimation framework for vegetated areas across Europe, combining Sentinel-1 SAR, Sentinel-2 optical imagery and ERA-5 reanalysis data through machine learning. Using 113 International Soil Moisture Network (ISMN) stations spanning diverse vegetated areas, we compare modality combinations with temporal parameterizations, using spatial cross-validation, to ensure geographic generalization. We also evaluate whether foundation model embeddings from IBM-NASA's Prithvi model improve upon traditional hand-crafted spectral features. Results demonstrate that hybrid temporal matching - Sentinel-2 current-day acquisitions with Sentinel-1 descending orbit - achieves R^2=0.514, with 10-day ERA5 lookback window improving performance to R^2=0.518. Foundation model (Prithvi) embeddings provide negligible improvement over hand-crafted features (R^2=0.515 vs. 0.514), indicating traditional feature engineering remains highly competitive for sparse-data regression tasks. Our findings suggest that domain-specific spectral indices combined with tree-based ensemble methods offer a practical and computationally efficient solution for operational pan-European field-scale soil moisture monitoring.

[53] DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text

Kunwar Arpit Singh,Ankush Prakash,Haroon R Lone

Main category: cs.CV

TL;DR: 本文介绍了DohaScript,一个大规模、多书写者的手写印地语Devanagari文本数据集,包含531位贡献者的书写样本,所有样本均基于六首固定传统印地语对句(dohas),旨在支持手写识别、书写者识别、风格分析和生成建模等任务。

Details Motivation: 现有Devanagari手写数据集规模小、内容零散、缺乏书写者多样性与可控词汇设计,难以反映其连续连写、shirorekha连接及丰富合字等真实书写特性。 Method: 构建名为DohaScript的平行风格语料库:531名书写者统一抄写六首固定印地语dohas;辅以匿名人口统计元数据、基于清晰度与分辨率的质量筛选、页面级版式难度标注。 Result: 基线实验显示该数据集能清晰区分质量等级,并在未见书写者上表现出强泛化能力。 Conclusion: DohaScript为低资源文字场景下的连续手写Devanagari文本研究提供了标准化、可复现的基准数据集。 Abstract: Despite having hundreds of millions of speakers, handwritten Devanagari text remains severely underrepresented in publicly available benchmark datasets. Existing resources are limited in scale, focus primarily on isolated characters or short words, and lack controlled lexical content and writer level diversity, which restricts their utility for modern data driven handwriting analysis. As a result, they fail to capture the continuous, fused, and structurally complex nature of Devanagari handwriting, where characters are connected through a shared shirorekha (horizontal headline) and exhibit rich ligature formations. We introduce DohaScript, a large scale, multi writer dataset of handwritten Hindi text collected from 531 unique contributors. The dataset is designed as a parallel stylistic corpus, in which all writers transcribe the same fixed set of six traditional Hindi dohas (couplets). This controlled design enables systematic analysis of writer specific variation independent of linguistic content, and supports tasks such as handwriting recognition, writer identification, style analysis, and generative modeling. The dataset is accompanied by non identifiable demographic metadata, rigorous quality curation based on objective sharpness and resolution criteria, and page level layout difficulty annotations that facilitate stratified benchmarking. Baseline experiments demonstrate clear quality separation and strong generalization to unseen writers, highlighting the dataset's reliability and practical value. DohaScript is intended to serve as a standardized and reproducible benchmark for advancing research on continuous handwritten Devanagari text in low resource script settings.

[54] Predict to Skip: Linear Multistep Feature Forecasting for Efficient Diffusion Transformers

Hanshuai Cui,Zhiqing Tang,Qianli Ma,Zhi Yao,Weijia Jia

Main category: cs.CV

TL;DR: 本文提出PrediT,一种无需训练的扩散Transformer(DiT)加速框架,通过线性多步预测和动态校正机制,在大幅降低延迟的同时保持生成质量。

Details Motivation: Diffusion Transformers(DiT)虽性能优异,但迭代去噪过程计算开销大;现有无训练加速方法依赖特征缓存与重用,易导致潜在漂移和视觉退化。 Method: 提出PrediT框架:将特征预测建模为线性多步问题,采用经典线性多步法进行输出预测,结合在高动态区域激活的校正器防止误差累积,并引入动态步长调制机制根据特征变化率自适应调整预测步数。 Result: 在多种DiT图像与视频生成模型上实现最高5.54×延迟降低,且图像/视频质量几乎无损。 Conclusion: PrediT是一种高效、通用、无需训练的DiT加速方法,兼顾显著加速与高保真生成。 Abstract: Diffusion Transformers (DiT) have emerged as a widely adopted backbone for high-fidelity image and video generation, yet their iterative denoising process incurs high computational costs. Existing training-free acceleration methods rely on feature caching and reuse under the assumption of temporal stability. However, reusing features for multiple steps may lead to latent drift and visual degradation. We observe that model outputs evolve smoothly along much of the diffusion trajectory, enabling principled predictions rather than naive reuse. Based on this insight, we propose \textbf{PrediT}, a training-free acceleration framework that formulates feature prediction as a linear multistep problem. We employ classical linear multistep methods to forecast future model outputs from historical information, combined with a corrector that activates in high-dynamics regions to prevent error accumulation. A dynamic step modulation mechanism adaptively adjusts the prediction horizon by monitoring the feature change rate. Together, these components enable substantial acceleration while preserving generation fidelity. Extensive experiments validate that our method achieves up to $5.54\times$ latency reduction across various DiT-based image and video generation models, while incurring negligible quality degradation.

[55] OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

Ling Lin,Yang Bai,Heng Su,Congcong Zhu,Yaoxing Wang,Yang Zhou,Huazhu Fu,Jingrun Chen

Main category: cs.CV

TL;DR: 本文提出OODBench,一个用于评估视觉语言模型(VLMs)在分布外(OOD)数据上表现的自动化基准,揭示当前VLMs在OOD场景下性能显著下降,并提出一种基于问题难度递进的自动化评估指标。

Details Motivation: 现有VLMs多在IID假设下训练,但现实场景中OOD数据普遍存在,处理不当可能引发安全风险(如自动驾驶、医疗辅助),而目前缺乏有效评估VLMs对OOD数据鲁棒性的综合基准。 Method: 提出OODBench:一种以自动化为主、人工校验为辅的OOD基准构建方法,包含40K实例级OOD实例-类别对;设计基于‘基础到进阶’提示问题序列的自动化评估指标,以衡量OOD数据对不同难度问题的影响。 Result: 实验表明,当前主流VLMs在OODBench上性能明显下降,即使图像类别本身常见;所提评估指标能更全面反映OOD影响;并总结了OOD数据获取与评估的关键发现与启示。 Conclusion: OODBench填补了VLMs在OOD评估领域的空白,推动了对模型分布外泛化能力的系统性研究,并为提升VLMs在真实安全敏感场景中的鲁棒性提供了新工具与方向。 Abstract: Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.

[56] Evaluating Graphical Perception Capabilities of Vision Transformers

Poonam Poonam,Pere-Pau Vázquez,Timo Ropinski

Main category: cs.CV

TL;DR: 本文研究了视觉Transformer(ViT)在图形感知任务中的表现,发现其虽在通用视觉任务中表现优异,但在与人类图形感知对齐方面存在明显差距。

Details Motivation: CNN在图形感知任务中已有评估,但ViT在此领域的感知能力尚未被系统探索。 Method: 基于Cleveland和McGill的经典图形感知研究,设计了一系列受控的图形感知任务,对ViT、CNN及人类参与者进行基准测试。 Result: ViT在通用视觉任务中表现强劲,但在可视化领域的类人图形感知能力有限,暴露出关键的感知差距。 Conclusion: ViT在可视化系统和图形感知建模中的应用需谨慎考虑其感知局限性。 Abstract: Vision Transformers, ViTs, have emerged as a powerful alternative to convolutional neural networks, CNNs, in a variety of image-based tasks. While CNNs have previously been evaluated for their ability to perform graphical perception tasks, which are essential for interpreting visualizations, the perceptual capabilities of ViTs remain largely unexplored. In this work, we investigate the performance of ViTs in elementary visual judgment tasks inspired by the foundational studies of Cleveland and McGill, which quantified the accuracy of human perception across different visual encodings. Inspired by their study, we benchmark ViTs against CNNs and human participants in a series of controlled graphical perception tasks. Our results reveal that, although ViTs demonstrate strong performance in general vision tasks, their alignment with human-like graphical perception in the visualization domain is limited. This study highlights key perceptual gaps and points to important considerations for the application of ViTs in visualization systems and graphical perceptual modeling.

[57] BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

Yiran Yang,Zhaowei Liu,Yuan Yuan,Yukun Song,Xiong Ma,Yinghao Song,Xiangji Zeng,Lu Sun,Yulu Wang,Hai Zhou,Shuai Cui,Zhaohan Gong,Jiefei Zhang

Main category: cs.CV

TL;DR: 本文提出BLM-Guard框架,结合思维链推理、规则驱动的数据合成与批评引导的强化学习,用于短视頻广告中多模态欺骗性内容的细粒度审核。

Details Motivation: 短视頻平台上的多模态广告存在视觉、语音和字幕等层面的欺骗性内容,现有社区安全过滤器难以满足政策导向的精细化审核需求。 Method: 提出BLM-Guard框架:1)规则驱动的ICoT数据合成流程生成场景描述、推理链和标签;2)基于因果一致性与政策合规性平衡的复合奖励进行强化学习优化;3)多任务架构建模单模态篡改与跨模态不一致。 Result: 在真实短视頻广告数据上实验表明,BLM-Guard在准确性、一致性与泛化性上均优于强基线模型。 Conclusion: BLM-Guard为商业广告内容审核提供了一种可解释、可策略对齐且鲁棒的多模态解决方案。 Abstract: Short-video platforms now host vast multimodal ads whose deceptive visuals, speech and subtitles demand finer-grained, policy-driven moderation than community safety filters. We present BLM-Guard, a content-audit framework for commercial ads that fuses Chain-of-Thought reasoning with rule-based policy principles and a critic-guided reward. A rule-driven ICoT data-synthesis pipeline jump-starts training by generating structured scene descriptions, reasoning chains and labels, cutting annotation costs. Reinforcement learning then refines the model using a composite reward balancing causal coherence with policy adherence. A multitask architecture models intra-modal manipulations (e.g., exaggerated imagery) and cross-modal mismatches (e.g., subtitle-speech drift), boosting robustness. Experiments on real short-video ads show BLM-Guard surpasses strong baselines in accuracy, consistency and generalization.

[58] A Self-Supervised Approach on Motion Calibration for Enhancing Physical Plausibility in Text-to-Motion

Gahyeon Shim,Soogeun Park,Hyemin Ahn

Main category: cs.CV

TL;DR: 本文提出了一种名为Distortion-aware Motion Calibrator(DMC)的后处理模块,用于在保持语义一致性的同时,修正文本生成人体运动中常见的物理不合理问题(如脚部漂浮、穿透等),其采用自监督、数据驱动的方式,无需复杂物理建模,显著提升了多种文本到动作模型的物理真实性和语义对齐度。

Details Motivation: 现有文本到人体运动生成方法虽进展迅速,但难以同时保证语义对齐与物理真实性(如脚漂浮、关节穿透等),亟需一种轻量、通用、不依赖物理建模的后处理校正机制。 Method: 提出DMC——一种失真感知的运动校准器,作为后处理模块;以故意失真的运动和原始文本描述为输入,通过自监督方式学习映射到物理合理且语义一致的运动;不引入显式物理模型,完全基于数据驱动。 Result: 在T2M和T2M-GPT上FID分别降低42.74%和13.20%,R-Precision达最高;在MoMask上使穿透减少33.0%,显著缓解脚部漂浮;适用于多种文本到动作模型。 Conclusion: DMC是一种高效、通用、语义-物理联合优化的后处理框架,可即插即用地提升任意文本到动作模型的物理合理性与语义保真度,为该领域提供了新范式。 Abstract: Generating semantically aligned human motion from textual descriptions has made rapid progress, but ensuring both semantic and physical realism in motion remains a challenge. In this paper, we introduce the Distortion-aware Motion Calibrator (DMC), a post-hoc module that refines physically implausible motions (e.g., foot floating) while preserving semantic consistency with the original textual description. Rather than relying on complex physical modeling, we propose a self-supervised and data-driven approach, whereby DMC learns to obtain physically plausible motions when an intentionally distorted motion and the original textual descriptions are given as inputs. We evaluate DMC as a post-hoc module to improve motions obtained from various text-to-motion generation models and demonstrate its effectiveness in improving physical plausibility while enhancing semantic consistency. The experimental results show that DMC reduces FID score by 42.74% on T2M and 13.20% on T2M-GPT, while also achieving the highest R-Precision. When applied to high-quality models like MoMask, DMC improves the physical plausibility of motions by reducing penetration by 33.0% as well as adjusting floating artifacts closer to the ground-truth reference. These results highlight that DMC can serve as a promising post-hoc motion refinement framework for any kind of text-to-motion models by incorporating textual semantics and physical plausibility.

[59] On the Adversarial Robustness of Discrete Image Tokenizers

Rishika Bhagwatkar,Irina Rish,Nicolas Flammarion,Francesco Croce

Main category: cs.CV

TL;DR: 本文首次研究离散图像分词器(discrete image tokenizers)在多模态系统中的对抗攻击脆弱性,并提出一种无需标签的无监督对抗训练方法来提升其鲁棒性,显著增强其在分类、跨模态检索和图像描述等任务中的抗攻击能力。

Details Motivation: 离散图像分词器在多模态模型中日益重要,但其对抗鲁棒性尚未被研究;而CLIP等编码器已有相关工作,因此需填补该空白。 Method: 1) 构建面向分词器特征扰动的高效、应用无关的对抗攻击;2) 提出基于无监督对抗训练的防御方法,仅微调分词器,冻结其余模块,并利用无标签图像。 Result: 所提攻击在多个任务上均有效;所提防御方法显著提升分词器对无监督与端到端有监督攻击的鲁棒性,并泛化至未见任务与数据。 Conclusion: 图像分词器的鲁棒性对下游多模态任务至关重要;本文为构建安全的多模态基础模型提供了关键一步。 Abstract: Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.

[60] DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

Shiyan Du,Conghan Yue,Xinyu Cheng,Dongyu Zhang

Main category: cs.CV

TL;DR: 本文提出DEIG框架,通过实例细节提取器(IDE)和细节融合模块(DFM)提升多实例生成中对复杂文本描述的细粒度语义理解与属性绑定能力,并构建新数据集DEIG-Bench以支持区域级监督与评估。

Details Motivation: 现有多实例生成方法在处理复杂文本描述时,难以实现细粒度语义理解与准确的属性绑定,易出现属性泄露与空间不一致问题。 Method: 提出DEIG框架,包含Instance Detail Extractor(IDE)用于生成实例感知的紧凑文本表征,以及Detail Fusion Module(DFM)采用基于实例的掩码注意力机制防止跨实例属性泄露;同时构建含VLM生成细粒度组合式标注的数据集及DEIG-Bench基准。 Result: DEIG在多个基准上显著优于现有方法,在空间一致性、语义准确性与组合泛化能力方面均有提升;且可作为即插即用模块集成于扩散模型管线。 Conclusion: DEIG有效解决了多实例生成中的细粒度语义建模与可控性难题,为文本到图像生成提供了更精准、鲁棒的实例级控制范式。 Abstract: Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.

[61] Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation

Ziyue Liu,Davide Talon,Federico Girella,Zanxi Ruan,Mattia Mondo,Loris Bazzani,Yiming Wang,Marco Cristani

Main category: cs.CV

TL;DR: 本文提出LOTS框架,通过多级条件引导和扩散模型中的配对指导,结合全局草图与局部文本-草图对,提升时尚图像生成效果,并构建了首个高质量时尚草图-文本数据集Sketchy。

Details Motivation: 现有方法难以在时尚图像生成中有效融合草图的结构信息与文本的局部语义细节,需兼顾全局结构一致性与局部属性指导。 Method: 提出LOTS框架,包含多级条件编码阶段(独立编码局部特征并保持全局结构协调)和扩散配对指导阶段(通过注意力机制在扩散去噪过程中融合局部与全局条件);构建Sketchy数据集,含专业与非专业草图两个子集。 Result: 在时尚图像生成任务上优于当前最优方法,增强了全局结构遵循性与局部语义引导能力;Sketchy数据集、平台与代码已开源。 Conclusion: LOTS框架有效解决了草图与文本模态协同生成中的结构-语义对齐问题,验证了多级局部化条件建模在时尚设计生成中的有效性。 Abstract: Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an "in the wild" split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.

[62] Diff2DGS: Reliable Reconstruction of Occluded Surgical Scenes via 2D Gaussian Splatting

Tianyi Song,Danail Stoyanov,Evangelos Mazomenos,Francisco Vasconcelos

Main category: cs.CV

TL;DR: 本文提出Diff2DGS框架,通过扩散模型视频修复与可学习形变的2D高斯溅射,提升达芬奇手术视频中遮挡区域的实时、高精度三维重建质量,并在SCARED数据集上首次开展定量深度精度评估。

Details Motivation: 现有方法在手术视频遮挡区域重建质量差,且缺乏带3D真值的基准(如EndoNeRF、StereoMIS),深度精度未被充分评估,难以支撑机器人手术导航与自动化。 Method: 提出两阶段Diff2DGS:第一阶段用带时序先验的扩散视频模块修复器械遮挡的组织;第二阶段将2D高斯溅射(2DGS)与可学习形变模型(LDM)结合,建模动态组织形变与解剖几何;并在SCARED数据集上开展定量深度精度分析。 Result: 在EndoNeRF和StereoMIS上PSNR分别达38.02 dB和34.40 dB,超越SOTA;实验证明仅优化图像质量不保证3D几何精度,故进一步优化深度质量以提升几何保真度。 Conclusion: Diff2DGS显著提升了遮挡手术场景下实时三维重建的外观与几何精度,强调需联合优化图像与深度质量,为临床机器人手术提供更可靠的三维感知基础。 Abstract: Real-time reconstruction of deformable surgical scenes is vital for advancing robotic surgery, improving surgeon guidance, and enabling automation. Recent methods achieve dense reconstructions from da Vinci robotic surgery videos, with Gaussian Splatting (GS) offering real-time performance via graphics acceleration. However, reconstruction quality in occluded regions remains limited, and depth accuracy has not been fully assessed, as benchmarks like EndoNeRF and StereoMIS lack 3D ground truth. We propose Diff2DGS, a novel two-stage framework for reliable 3D reconstruction of occluded surgical scenes. In the first stage, a diffusion-based video module with temporal priors inpaints tissue occluded by instruments with high spatial-temporal consistency. In the second stage, we adapt 2D Gaussian Splatting (2DGS) with a Learnable Deformation Model (LDM) to capture dynamic tissue deformation and anatomical geometry. We also extend evaluation beyond prior image-quality metrics by performing quantitative depth accuracy analysis on the SCARED dataset. Diff2DGS outperforms state-of-the-art approaches in both appearance and geometry, reaching 38.02 dB PSNR on EndoNeRF and 34.40 dB on StereoMIS. Furthermore, our experiments demonstrate that optimizing for image quality alone does not necessarily translate into optimal 3D reconstruction accuracy. To address this, we further optimize the depth quality of the reconstructed 3D results, ensuring more faithful geometry in addition to high-fidelity appearance.

[63] Unifying Color and Lightness Correction with View-Adaptive Curve Adjustment for Robust 3D Novel View Synthesis

Ziteng Cui,Shuhong Liu,Xiaoyu Dong,Xuangeng Chu,Lin Gu,Ming-Hsuan Yang,Tatsuya Harada

Main category: cs.CV

TL;DR: 本文提出Luminance-GS++,一种基于3D高斯泼溅(3DGS)的新型框架,通过全局视图自适应亮度调整与局部像素级残差精修,结合无监督多目标优化,在复杂光照下实现鲁棒的新型视角合成,同时保持原始3DGS的显式表示与实时渲染效率。

Details Motivation: 现实环境中多视角图像采集受复杂光照变化和相机ISP差异影响,导致光度与色度不一致,破坏了NeRF和3DGS等新型视角合成方法所依赖的光度一致性假设,从而降低重建与渲染质量。 Method: 提出Luminance-GS++框架:1)全局视图自适应亮度调整;2)局部像素级残差颜色校正;3)设计联合约束亮度校正、多视角几何一致性和光度一致性的无监督损失函数。 Result: 在低照度、过曝及复杂明度/色度变化等挑战性场景中达到当前最优性能;保持3DGS显式表示,提升重建保真度且不牺牲实时渲染效率。 Conclusion: Luminance-GS++有效缓解多视角光照不一致问题,无需修改3DGS基础表征,为真实场景下鲁棒新型视角合成提供了实用且高效的解决方案。 Abstract: High-quality image acquisition in real-world environments remains challenging due to complex illumination variations and inherent limitations of camera imaging pipelines. These issues are exacerbated in multi-view capture, where differences in lighting, sensor responses, and image signal processor (ISP) configurations introduce photometric and chromatic inconsistencies that violate the assumptions of photometric consistency underlying modern 3D novel view synthesis (NVS) methods, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), leading to degraded reconstruction and rendering quality. We propose Luminance-GS++, a 3DGS-based framework for robust NVS under diverse illumination conditions. Our method combines a globally view-adaptive lightness adjustment with a local pixel-wise residual refinement for precise color correction. We further design unsupervised objectives that jointly enforce lightness correction and multi-view geometric and photometric consistency. Extensive experiments demonstrate state-of-the-art performance across challenging scenarios, including low-light, overexposure, and complex luminance and chromatic variations. Unlike prior approaches that modify the underlying representation, our method preserves the explicit 3DGS formulation, improving reconstruction fidelity while maintaining real-time rendering efficiency.

[64] G-LoG Bi-filtration for Medical Image Classification

Qingsong Wang,Jiaxing He,Bingzhe Hou,Tieru Wu,Yang Cao,Cailing Yao

Main category: cs.CV

TL;DR: 本文提出了一种新的G-LoG双滤波方法,用于医学图像的多参数持续同调分析,并证明了其稳定性;实验表明,基于该方法提取的拓扑特征训练的简单MLP可媲美复杂深度学习模型。

Details Motivation: 构建实用的滤波以检测数据的拓扑与几何特征是TDA中的关键任务;现有单参数滤波在医学图像中表征能力有限,需更适配多参数持续同调的滤波方法。 Method: 利用LoG算子增强医学图像边界,定义G-LoG双滤波;将体素图像建模为有界函数,并从理论上证明所生成持续同调模块关于最大范数的稳定性。 Result: 在MedMNIST数据集上,G-LoG双滤波显著优于单参数滤波;仅用其提取的拓扑特征训练的MLP,性能可媲美Google AutoML Vision、ResNet等深度模型。 Conclusion: G-LoG双滤波是一种稳定且有效的多参数拓扑特征提取工具,能以轻量模型实现与复杂深度学习相当的性能,提升了TDA在医学图像分析中的实用性。 Abstract: Building practical filtrations on objects to detect topological and geometric features is an important task in the field of Topological Data Analysis (TDA). In this paper, leveraging the ability of the Laplacian of Gaussian operator to enhance the boundaries of medical images, we define the G-LoG (Gaussian-Laplacian of Gaussian) bi-filtration to generate the features more suitable for multi-parameter persistence module. By modeling volumetric images as bounded functions, then we prove the interleaving distance on the persistence modules obtained from our bi-filtrations on the bounded functions is stable with respect to the maximum norm of the bounded functions. Finally, we conduct experiments on the MedMNIST dataset, comparing our bi-filtration against single-parameter filtration and the established deep learning baselines, including Google AutoML Vision, ResNet, AutoKeras and auto-sklearn. Experiments results demonstrate that our bi-filtration significantly outperforms single-parameter filtration. Notably, a simple Multi-Layer Perceptron (MLP) trained on the topological features generated by our bi-filtration achieves performance comparable to complex deep learning models trained on the original dataset.

[65] Self-Aware Object Detection via Degradation Manifolds

Stefan Becker,Simon Weiss,Wolfgang Hübner,Michael Arens

Main category: cs.CV

TL;DR: 本文提出了一种面向退化感知的自感知目标检测框架,通过构建基于退化的特征流形,在无需退化标签的情况下实现对图像退化类型与程度的几何化表征,并以干净样本原型为基准,利用几何偏差提供与检测置信度无关的自感知信号。

Details Motivation: 目标检测器在非理想成像条件下(如模糊、噪声、压缩、恶劣天气、分辨率变化)易发生静默失效,而在安全关键场景中,仅输出预测结果不够,需判断输入是否仍在模型的正常工作范围内,即实现‘自感知’能力。 Method: 提出基于退化流形的自感知框架:在检测主干网络上增加轻量嵌入头,通过多层对比学习训练,使相同退化类型的图像在嵌入空间中拉近,不同退化配置则推远;同时从干净训练样本嵌入中估计‘原始原型’作为名义操作点,将偏离该点的几何距离作为自感知信号。 Result: 在合成退化基准、跨数据集零样本迁移及自然天气分布偏移实验中,展现出优异的原始-退化样本可分性、对多种检测器架构的一致性行为,以及在语义偏移下的强泛化能力。 Conclusion: 退化感知的表示几何结构为自感知目标检测提供了实用且检测器无关的基础,无需退化标注或显式密度建模,即可实现内在、图像级的退化敏感性评估。 Abstract: Object detectors achieve strong performance under nominal imaging conditions but can fail silently when exposed to blur, noise, compression, adverse weather, or resolution changes. In safety-critical settings, it is therefore insufficient to produce predictions without assessing whether the input remains within the detector's nominal operating regime. We refer to this capability as self-aware object detection. We introduce a degradation-aware self-awareness framework based on degradation manifolds, which explicitly structure a detector's feature space according to image degradation rather than semantic content. Our method augments a standard detection backbone with a lightweight embedding head trained via multi-layer contrastive learning. Images sharing the same degradation composition are pulled together, while differing degradation configurations are pushed apart, yielding a geometrically organized representation that captures degradation type and severity without requiring degradation labels or explicit density modeling. To anchor the learned geometry, we estimate a pristine prototype from clean training embeddings, defining a nominal operating point in representation space. Self-awareness emerges as geometric deviation from this reference, providing an intrinsic, image-level signal of degradation-induced shift that is independent of detection confidence. Extensive experiments on synthetic corruption benchmarks, cross-dataset zero-shot transfer, and natural weather-induced distribution shifts demonstrate strong pristine-degraded separability, consistent behavior across multiple detector architectures, and robust generalization under semantic shift. These results suggest that degradation-aware representation geometry provides a practical and detector-agnostic foundation.

[66] Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges

Minh Dinh,Stéphane Deny

Main category: cs.CV

TL;DR: 本文探讨了如何利用学习到的等变算子来提升神经网络在未见过的对称变换(如旋转、平移)下的泛化能力,特别是在分布外分类任务中。

Details Motivation: 传统深度学习模型在面对训练中罕见的群对称变换(如特殊姿态、尺度、位置)时表现不佳;而现有等变网络需预先知道变换类型,缺乏灵活性。 Method: 提出一种从对称变换样例中自动学习潜在空间中等变算子的架构,并在旋转与平移的带噪MNIST数据集上进行验证。 Result: 该架构在分布外分类任务中成功超越传统网络和预设等变网络,展现出良好泛化能力。 Conclusion: 学习等变性是一种有前景的泛化增强方法,但将其扩展到更复杂数据集仍面临挑战。 Abstract: Despite the successes of deep learning in computer vision, difficulties persist in recognizing objects that have undergone group-symmetric transformations rarely seen during training-for example objects seen in unusual poses, scales, positions, or combinations thereof. Equivariant neural networks are a solution to the problem of generalizing across symmetric transformations, but require knowledge of transformations a priori. An alternative family of architectures proposes to earn equivariant operators in a latent space from examples of symmetric transformations. Here, using simple datasets of rotated and translated noisy MNIST, we illustrate how such architectures can successfully be harnessed for out-of-distribution classification, thus overcoming the limitations of both traditional and equivariant networks. While conceptually enticing, we discuss challenges ahead on the path of scaling these architectures to more complex datasets.

[67] Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Linxi Xie,Lisong C. Sun,Ashley Neall,Tong Wu,Shengqu Cai,Gordon Wetzstein

Main category: cs.CV

TL;DR: 本文提出了一种以人类为中心的视频世界模型,通过融合3D头部姿态与关节级手部姿态进行条件控制,支持XR中精细的具身交互,并通过双向教师模型蒸馏构建因果式实时生成系统,在用户实验中验证了其在任务表现与感知控制力上的提升。

Details Motivation: 现有视频世界模型仅支持文本或键盘等粗粒度控制信号,难以满足扩展现实(XR)中基于用户真实运动(如头部和手部姿态)进行细粒度、具身交互的需求。 Method: 评估现有扩散Transformer条件控制策略,提出针对3D头部与手部姿态的有效联合控制机制;训练双向视频扩散模型作为教师模型,并将其蒸馏为因果、可交互的实时生成系统,用于生成第一人称视角虚拟环境。 Result: 在人类被试实验中,该生成现实系统相比基线方法显著提升了任务完成性能及用户对动作控制程度的主观感知。 Conclusion: 融合高精度人体运动信号(头+手)的条件化视频世界模型是实现自然、可控XR交互的关键路径,所提出的控制机制与蒸馏框架为构建实时具身AI系统提供了可行方案。 Abstract: Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

[68] CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

Xia Su,Ruiqi Chen,Benlin Liu,Jingwei Ma,Zonglin Di,Ranjay Krishna,Jon Froehlich

Main category: cs.CV

TL;DR: 本文提出了Capability-Conditioned Navigation (CapNav)基准,用于评估视觉语言模型(VLMs)在考虑代理物理与操作能力约束下的室内导航能力;实验表明现有VLMs在严格移动约束下性能显著下降,尤其难以处理需空间维度推理的障碍物。

Details Motivation: 现实世界导航受代理自身机动性限制(如扫地机器人不能上楼梯),而现有VLMs导航研究未充分建模此类能力约束。 Method: 构建CapNav基准,包含5类具不同物理尺寸、移动能力和环境交互能力的代理,覆盖45个真实室内场景、473个导航任务和2365个QA对,并对13种现代VLMs进行系统评测。 Result: 当前VLMs在移动约束增强时导航性能急剧下降;最先进模型仍难以处理需空间维度推理的障碍物类型。 Conclusion: CapNav揭示了现有VLMs在能力感知导航和具身空间推理方面的不足,为未来研究提供了新方向和基准支持。 Abstract: Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent's mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent's specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at https://github.com/makeabilitylab/CapNav

[69] SARAH: Spatially Aware Real-time Agentic Humans

Evonne Ng,Siwei Zhang,Zhang Chen,Michael Zollhoefer,Alexander Richard

Main category: cs.CV

TL;DR: 本文提出了一种首个实时、完全因果的空间感知对话运动生成方法,可在流式VR头显上部署,结合因果Transformer-VAE与基于流匹配的模型,支持用户位置和语音输入,实现自然的全身运动、朝向与注视控制,并在Embody 3D数据集上达到SOTA性能(>300 FPS)且通过真实VR系统验证。

Details Motivation: 现有对话代理运动生成方法缺乏空间感知能力,无法根据用户位置实时调整朝向、响应运动和维持自然注视,难以满足VR、远程临场与数字人等应用需求。 Method: 提出一种实时、完全因果的架构:融合因果Transformer变分自编码器(VAE)用于流式隐空间推理,以及以用户轨迹和语音为条件的流匹配模型;引入基于分类器自由引导的注视评分机制,解耦学习与控制,支持推理时调节注视强度。 Result: 在Embody 3D数据集上达到SOTA运动质量,推理速度超300 FPS(是非因果基线的3倍),成功部署于实时VR系统,验证了空间感知对话行为的可行性与自然性。 Conclusion: 该工作填补了实时空间感知对话运动生成的技术空白,为具身智能体在VR等交互场景中的自然化部署提供了可扩展、可控且高效的解决方案。 Abstract: As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS -- 3x faster than non-causal baselines -- while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see https://evonneng.github.io/sarah/ for details.

[70] Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Vatsal Agarwal,Saksham Suri,Matthew Gwilliam,Pulkit Kumar,Abhinav Shrivastava

Main category: cs.CV

TL;DR: 本文提出MemStream方法,通过扩展token预算、自适应选择策略和训练无关的检索混合专家机制,提升流式视频理解与问答性能。

Details Motivation: 现有流式视频理解方法依赖key-value缓存,但每帧token数量有限,导致细粒度视觉细节丢失;且特征编码使查询-帧相似度随时间增长,造成对后期帧的检索偏差。 Method: 1)扩展token预算以支持更细粒度时空理解;2)引入自适应选择策略减少token冗余并保留局部时空信息;3)设计训练无关的检索混合专家(MoE),利用外部模型增强相关帧识别能力。 Result: 在CG-Bench、LVBench和VideoMME (Long)上分别比ReKV(基于Qwen2.5-VL-7B)提升+8.0%、+8.5%和+2.4%。 Conclusion: 扩大token预算并结合自适应选择与外部模型辅助的检索机制,可显著提升流式视频VQA的鲁棒性与准确性。 Abstract: Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.