Skip to content

Table of Contents

cs.CL [Back]

[1] EmbeddingRWKV: State-Centric Retrieval with Reusable States

Haowen Hou,Jie Yang

Main category: cs.CL

TL;DR: 本文提出了一种名为“State-Centric Retrieval”的统一检索范式,通过引入“状态”桥接嵌入模型与重排序器,利用RWKV基模型实现高效且高质量的检索与重排序,显著提升系统效率。

Details Motivation: 传统RAG系统中检索与重排序阶段缺乏信息共享,导致计算冗余和效率低下。 Method: 提出State-Centric Retrieval,通过微调RWKV基语言模型构建EmbeddingRWKV,用于生成可复用的状态表示,并设计基于状态的重排序器,在重排序时仅处理查询部分以解耦文档长度对推理成本的影响。 Result: 该方法在保持98.62%性能的同时仅使用25%网络层,推理速度提升5.4–44.8倍。 Conclusion: State-Centric Retrieval有效整合了嵌入与重排序阶段,大幅提升了检索系统的整体效率与实用性。 Abstract: Current Retrieval-Augmented Generation (RAG) systems typically employ a traditional two-stage pipeline: an embedding model for initial retrieval followed by a reranker for refinement. However, this paradigm suffers from significant inefficiency due to the lack of shared information between stages, leading to substantial redundant computation. To address this limitation, we propose \textbf{State-Centric Retrieval}, a unified retrieval paradigm that utilizes "states" as a bridge to connect embedding models and rerankers. First, we perform state representation learning by fine-tuning an RWKV-based LLM, transforming it into \textbf{EmbeddingRWKV}, a unified model that serves as both an embedding model and a state backbone for extracting compact, reusable states. Building upon these reusable states, we further design a state-based reranker to fully leverage precomputed information. During reranking, the model processes only query tokens, decoupling inference cost from document length and yielding a 5.4$\times$--44.8$\times$ speedup. Furthermore, we observe that retaining all intermediate layer states is unnecessary; with a uniform layer selection strategy, our model maintains 98.62\% of full-model performance using only 25\% of the layers. Extensive experiments demonstrate that State-Centric Retrieval achieves high-quality retrieval and reranking results while significantly enhancing overall system efficiency. Code is available at \href{https://github.com/howard-hou/EmbeddingRWKV}{our GitHub repository}.

[2] A Human-Centric Pipeline for Aligning Large Language Models with Chinese Medical Ethics

Haoan Jin,Han Ying,Jiacheng Ji,Hanhui Xu,Mengyue Wu

Main category: cs.CL

TL;DR: 本文提出了MedES,一个基于260个权威中文医学、伦理与法律来源构建的动态情景式基准,用于评估大语言模型在临床决策中的医学伦理对齐能力。作者设计了一个“守护者在环”框架,利用高精度自动评估器生成针对性提示并提供结构化伦理反馈,并通过监督微调和领域偏好优化对7B参数模型进行对齐。实验表明,该模型在中文医学伦理任务上优于更大规模的基线模型,展示了可迁移至其他文化和法律环境的对齐框架潜力。

Details Motivation: 尽管大语言模型在医疗领域应用广泛,但在复杂现实场景下与医学伦理对齐的研究仍不足,尤其是在中文语境中缺乏系统性评估工具与对齐方法。 Method: 构建MedES基准,整合中国医学、伦理与法律规范;设计‘守护者在环’框架,使用高准确率(>97%)的自动化评估器生成反馈;采用监督微调与领域特定偏好优化对7B模型进行伦理对齐训练。 Result: 在完全基于中文医学伦理的实验中,对齐后的7B模型在核心伦理任务上表现优于更大规模的基线模型,显著提升输出质量与综合评价指标。 Conclusion: 本研究提供了一种实用且可扩展的大模型医学伦理对齐框架,适用于中文医疗环境,并可通过替换规范语料库推广至其他法律与文化背景。 Abstract: Recent advances in large language models have enabled their application to a range of healthcare tasks. However, aligning LLMs with the nuanced demands of medical ethics, especially under complex real world scenarios, remains underexplored. In this work, we present MedES, a dynamic, scenario-centric benchmark specifically constructed from 260 authoritative Chinese medical, ethical, and legal sources to reflect the challenges in clinical decision-making. To facilitate model alignment, we introduce a guardian-in-the-loop framework that leverages a dedicated automated evaluator (trained on expert-labeled data and achieving over 97% accuracy within our domain) to generate targeted prompts and provide structured ethical feedback. Using this pipeline, we align a 7B-parameter LLM through supervised fine-tuning and domain-specific preference optimization. Experimental results, conducted entirely within the Chinese medical ethics context, demonstrate that our aligned model outperforms notably larger baselines on core ethical tasks, with observed improvements in both quality and composite evaluation metrics. Our work offers a practical and adaptable framework for aligning LLMs with medical ethics in the Chinese healthcare domain, and suggests that similar alignment pipelines may be instantiated in other legal and cultural environments through modular replacement of the underlying normative corpus.

[3] Knowing But Not Doing: Convergent Morality and Divergent Action in LLMs

Jen-tse Huang,Jiantong Qin,Xueli Qiu,Sharon Levy,Michelle R. Kaufman,Mark Dredze

Main category: cs.CL

TL;DR: 本研究提出了ValAct-15k数据集,用于评估大语言模型在现实决策情境中表征和践行人类价值观的能力,发现模型间一致性高但存在知识-行动差距。

Details Motivation: 探讨大语言模型如何表示和执行人类价值观尚不明确,需系统研究其价值对齐的实质表现。 Method: 构建包含3000个来自Reddit的建议寻求场景的ValAct-15k数据集,基于Schwartz价值观理论评估10个前沿大语言模型与人类参与者的价值一致性。 Result: 模型在决策中表现出近乎完全一致(r≈1.0),而人类差异较大(r∈[-0.79, 0.98]);但模型与人类均显示自我报告与实际行为间相关性弱(r=0.4, 0.3),且模型在扮演特定价值观时性能下降达6.6%。 Conclusion: 对齐训练虽使模型在规范价值上趋同,但未能消除类似人类的知识与行动间的不一致,揭示了当前价值对齐的局限性。 Abstract: Value alignment is central to the development of safe and socially compatible artificial intelligence. However, how Large Language Models (LLMs) represent and enact human values in real-world decision contexts remains under-explored. We present ValAct-15k, a dataset of 3,000 advice-seeking scenarios derived from Reddit, designed to elicit ten values defined by Schwartz Theory of Basic Human Values. Using both the scenario-based questions and the traditional value questionnaire, we evaluate ten frontier LLMs (five from U.S. companies, five from Chinese ones) and human participants ($n = 55$). We find near-perfect cross-model consistency in scenario-based decisions (Pearson $r \approx 1.0$), contrasting sharply with the broad variability observed among humans ($r \in [-0.79, 0.98]$). Yet, both humans and LLMs show weak correspondence between self-reported and enacted values ($r = 0.4, 0.3$), revealing a systematic knowledge-action gap. When instructed to "hold" a specific value, LLMs' performance declines up to $6.6%$ compared to merely selecting the value, indicating a role-play aversion. These findings suggest that while alignment training yields normative value convergence, it does not eliminate the human-like incoherence between knowing and acting upon values.

[4] Explaining Generalization of AI-Generated Text Detectors Through Linguistic Analysis

Yuxi Xia,Kinga Stańczak,Benjamin Roth

Main category: cs.CL

TL;DR: 本文研究了AI文本检测器在不同生成条件下的泛化能力,发现其性能与语言特征(如时态使用和代词频率)的变化密切相关。

Details Motivation: AI文本检测器在特定领域表现良好,但在跨提示、跨模型或跨领域时泛化能力差,缺乏对其原因的深入理解。 Method: 构建了一个涵盖6种提示策略、7个大语言模型和4个领域数据集的综合基准,并微调分类检测器以评估其跨条件泛化性能,通过80个语言特征的特征偏移分析解释性能差异。 Result: 发现检测器的泛化性能与训练和测试条件下语言特征的变化(如时态和代词使用)有显著相关性。 Conclusion: 语言特征偏移是影响AI文本检测器泛化能力的关键因素,揭示了改进检测器鲁棒性的潜在方向。 Abstract: AI-text detectors achieve high accuracy on in-domain benchmarks, but often struggle to generalize across different generation conditions such as unseen prompts, model families, or domains. While prior work has reported these generalization gaps, there are limited insights about the underlying causes. In this work, we present a systematic study aimed at explaining generalization behavior through linguistic analysis. We construct a comprehensive benchmark that spans 6 prompting strategies, 7 large language models (LLMs), and 4 domain datasets, resulting in a diverse set of human- and AI-generated texts. Using this dataset, we fine-tune classification-based detectors on various generation settings and evaluate their cross-prompt, cross-model, and cross-dataset generalization. To explain the performance variance, we compute correlations between generalization accuracies and feature shifts of 80 linguistic features between training and test conditions. Our analysis reveals that generalization performance for specific detectors and evaluation conditions is significantly associated with linguistic features such as tense usage and pronoun frequency.

[5] Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu,Ramon Ruiz-Dolz,Xuehang Wen,Fengrui Zhang,Qiufeng Yi

Main category: cs.CL

TL;DR: 提出了一种三层次评估框架,用于评估视觉语言模型在跨文化艺术批评中的文化理解能力,发现自动化指标不可靠、西方样本得分较高以及评分者间尺度不一致等问题。

Details Motivation: 验证视觉语言模型在艺术中解读文化意义的能力尚缺乏有效评估方法,现有指标无法准确反映模型的文化理解深度。 Method: 设计了一个三层次评估框架:第一层计算自动化覆盖率和风险指标;第二层使用单一主评人基于评分标准对五个维度打分;第三层通过等渗回归将第二层总分校准到人类评分。 Result: 在294个涵盖六种文化传统的专家锚点上评估了15个VLM,结果显示自动化指标与文化深度相关性差,西方样本得分高于非西方样本,且不同评分者之间存在尺度偏差,需通过单一主评人加校准来解决。校准后在152个保留样本上MAE降低了5.2%。 Conclusion: 当前VLM在理解非西方文化方面存在差距,自动化指标不足以衡量文化理解能力,提出的三层次框架可有效支持模型选择和文化差距诊断。 Abstract: Vision-Language Models (VLMs) excel at visual perception, yet their ability to interpret cultural meaning in art remains under-validated. We present a tri-tier evaluation framework for cross-cultural art-critique assessment: Tier I computes automated coverage and risk indicators offline; Tier II applies rubric-based scoring using a single primary judge across five dimensions; and Tier III calibrates the Tier II aggregate score to human ratings via isotonic regression, yielding a 5.2% reduction in MAE on a 152-sample held-out set. The framework outputs a calibrated cultural-understanding score for model selection and cultural-gap diagnosis, together with dimension-level diagnostics and risk indicators. We evaluate 15 VLMs on 294 expert anchors spanning six cultural traditions. Key findings are that (i) automated metrics are unreliable proxies for cultural depth, (ii) Western samples score higher than non-Western samples under our sampling and rubric, and (iii) cross-judge scale mismatch makes naive score averaging unreliable, motivating a single primary judge with explicit calibration. Dataset and code are available in the supplementary materials.

[6] Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset

Z. Melce Hüsünbeyi,Virginie Mouilleron,Leonie Uhling,Daniel Foppe,Tatjana Scheffler,Djamé Seddah

Main category: cs.CL

TL;DR: 本文提出了一种构建法语和德语多模态事实核查数据集的综合数据收集与处理流程,利用大规模语言模型提取证据并生成解释,支持跨组织的事实核查实践比较,并推动多语言、多模态虚假信息验证研究。

Details Motivation: 现有事实核查数据集在多模态证据、结构化标注和跨语言支持方面存在局限,难以满足日益增长的多语言虚假信息检测需求。 Method: 通过聚合ClaimReview反馈、抓取完整的辟谣文章、标准化异构的裁决结果,并结合结构化元数据和对齐的视觉内容来构建数据集;使用大型语言模型和多模态大模型进行按类别证据提取和基于证据的解释生成。 Result: 该流程成功构建了法语和德语的多模态事实核查数据集,G-Eval和人工评估表明其能实现细粒度的事实核查实践比较,并提升模型的可解释性和证据依赖性。 Conclusion: 所提出的流程为多语言、多模态虚假信息验证提供了可靠的数据基础和方法框架,有助于推动可解释、证据驱动的事实核查模型的发展。 Abstract: The rapid proliferation of misinformation across online platforms underscores the urgent need for robust, up-to-date, explainable, and multilingual fact-checking resources. However, existing datasets are limited in scope, often lacking multimodal evidence, structured annotations, and detailed links between claims, evidence, and verdicts. This paper introduces a comprehensive data collection and processing pipeline that constructs multimodal fact-checking datasets in French and German languages by aggregating ClaimReview feeds, scraping full debunking articles, normalizing heterogeneous claim verdicts, and enriching them with structured metadata and aligned visual content. We used state-of-the-art large language models (LLMs) and multimodal LLMs for (i) evidence extraction under predefined evidence categories and (ii) justification generation that links evidence to verdicts. Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and evidence-grounded fact-checking models, and lays the groundwork for future research on multilingual, multimodal misinformation verification.

[7] VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding

Haorui Yu,Ramon Ruiz-Dolz,Diji Yang,Hang He,Fengrui Zhang,Qiufeng Yi

Main category: cs.CL

TL;DR: VULCA-Bench是一个多文化艺术批评基准,用于评估视觉-语言模型在超越表层视觉感知之外的文化理解能力,涵盖八个文化传统和中英双语,提出五层框架(L1-L5)以衡量从视觉感知到哲学美学的高阶文化解读能力。

Details Motivation: 现有视觉-语言模型基准主要评估低层次视觉任务(如物体识别、场景描述),缺乏对高阶文化理解能力(如艺术批评、哲学美学)的有效评测,因此需要构建一个能衡量跨文化深层理解的 benchmark。 Method: 构建包含7,410个图文对的VULCA-Bench数据集,覆盖八个文化传统并支持中英双语;提出五层文化理解框架(L1-L5),细分为225个文化特异性维度,并由专家撰写双语艺术批评作为标注。 Result: 实验表明,模型在高阶文化推理(L3-L5)上的表现显著低于基础视觉任务(L1-L2),验证了该基准对评估文化理解能力的挑战性和有效性。 Conclusion: VULCA-Bench为视觉-语言模型的文化理解能力提供了系统性评测工具,揭示了当前模型在高阶文化解释方面的不足,推动多文化、深层次人机交互的发展。 Abstract: We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception. Existing VLM benchmarks predominantly measure L1-L2 capabilities (object recognition, scene description, and factual question answering) while under-evaluate higher-order cultural interpretation. VULCA-Bench contains 7,410 matched image-critique pairs spanning eight cultural traditions, with Chinese-English bilingual coverage. We operationalise cultural understanding using a five-layer framework (L1-L5, from Visual Perception to Philosophical Aesthetics), instantiated as 225 culture-specific dimensions and supported by expert-written bilingual critiques. Our pilot results indicate that higher-layer reasoning (L3-L5) is consistently more challenging than visual and technical analysis (L1-L2). The dataset, evaluation scripts, and annotation tools are available under CC BY 4.0 in the supplementary materials.

[8] From Word Sequences to Behavioral Sequences: Adapting Modeling and Evaluation Paradigms for Longitudinal NLP

Adithya V Ganesan,Vasudha Varadarajan,Oscar NE Kjell,Whitney R Ringwald,Scott Feltman,Benjamin J Luft,Roman Kotov,Ryan L Boyd,H Andrew Schwartz

Main category: cs.CL

TL;DR: 本文提出了一种面向纵向研究的NLP建模与评估新范式,强调文档在个体和时间上的依赖性,引入行为序列分析,并在 PTSD 日记数据上验证了传统方法可能导致错误结论,主张从“词序列”转向“行为序列”范式。

Details Motivation: 传统的NLP假设文档独立无序,但在纵向研究中,文档按个体和时间有序排列,形成行为序列,因此需要更符合实际的建模与评估方法。 Method: 提出四部分改进:(1) 按个体(横断面)和/或时间(前瞻性)划分评估集;(2) 使用区分个体间差异与个体内动态的准确率指标;(3) 默认使用包含历史信息的序列输入;(4) 模型内部支持不同粒度的历史潜在状态建模(如汇总、显式动态或交互模型)。 Result: 在包含238名参与者共17k份日记的PTSD症状数据集上,发现传统文档级评估可能得出与生态效度更高的新方法完全不同甚至相反的结论。 Conclusion: 应推动NLP从传统的词序列分析转向更具生态效度的行为序列分析范式,以更好支持纵向人类行为研究。 Abstract: While NLP typically treats documents as independent and unordered samples, in longitudinal studies, this assumption rarely holds: documents are nested within authors and ordered in time, forming person-indexed, time-ordered $\textit{behavioral sequences}$. Here, we demonstrate the need for and propose a longitudinal modeling and evaluation paradigm that consequently updates four parts of the NLP pipeline: (1) evaluation splits aligned to generalization over people ($\textit{cross-sectional}$) and/or time ($\textit{prospective}$); (2) accuracy metrics separating between-person differences from within-person dynamics; (3) sequence inputs to incorporate history by default; and (4) model internals that support different $\textit{coarseness}$ of latent state over histories (pooled summaries, explicit dynamics, or interaction-based models). We demonstrate the issues ensued by traditional pipeline and our proposed improvements on a dataset of 17k daily diary transcripts paired with PTSD symptom severity from 238 participants, finding that traditional document-level evaluation can yield substantially different and sometimes reversed conclusions compared to our ecologically valid modeling and evaluation. We tie our results to a broader discussion motivating a shift from word-sequence evaluation toward $\textit{behavior-sequence}$ paradigms for NLP.

[9] DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs

Nayoung Choi,Jonathan Zhang,Jinho D. Choi

Main category: cs.CL

TL;DR: 本文提出了一种名为DyCP的轻量级上下文管理方法,能够在查询时动态分段并检索相关记忆,有效提升长对话中的回答质量并降低响应延迟。

Details Motivation: 随着对话长度增加,大语言模型常出现响应延迟和回答质量下降的问题,现有方法因依赖额外的LLM调用或离线处理而效率低下且破坏对话连贯性,因此需要一种更高效的上下文管理机制。 Method: 提出DyCP方法,在查询时动态地对对话上下文进行分段,并基于当前用户输入检索相关记忆,保持对话顺序结构,无需预定义主题边界,实现高效、自适应的上下文检索。 Result: 在LoCoMo、MT-Bench+和SCM4LLMs三个长对话基准上,DyCP在多个大语言模型上均显著提升了回答质量并减少了响应时间。研究还揭示了现代LLM扩展上下文窗口与其实际长上下文处理能力之间的差距。 Conclusion: DyCP是一种高效、轻量且自适应的上下文管理方法,能够有效缓解长对话中大语言模型的性能退化问题,凸显了即使在大上下文窗口背景下,上下文管理仍至关重要。 Abstract: Large Language Models (LLMs) often exhibit increased response latency and degraded answer quality as dialogue length grows, making effective context management essential. However, existing methods rely on extra LLM calls to build memory or perform offline memory construction without considering the current user utterance, which can introduce inefficiencies or disrupt conversational continuity. We introduce DyCP, a lightweight context management method that dynamically segment and retrieve relevant memory at query time. It preserves the sequential structure of dialogue without predefined topic boundaries and supports efficient, adaptive context retrieval. Across three long-form dialogue benchmarks, LoCoMo, MT-Bench+, and SCM4LLMs, and multiple LLMs, DyCP consistently improves answer quality while reducing response latency. We also examine the gap between modern LLMs' expanded context windows and their actual long-context processing capacity, highlighting the continued importance of effective context management.

[10] Is Sentiment Banana-Shaped? Exploring the Geometry and Portability of Sentiment Concept Vectors

Laurits Lyngbaek,Pascale Feldkamp,Yuri Bizzoni,Kristoffer L. Nielbo,Kenneth Enevoldsen

Main category: cs.CL

TL;DR: 本文评估了概念向量投影(CVP)在不同领域、时期、语言和情感维度下的可迁移性,发现其具有良好的跨域表现,但其线性假设仅为近似,仍有改进空间。

Details Motivation: 探讨CVP方法在人文学科中的适用性及其跨域可移植性,并检验其底层线性假设的有效性。 Method: 通过在多种文本类型、历史时期、语言和情感维度上评估CVP的表现,分析其迁移性能,并检验其线性假设的合理性。 Result: 发现CVP在不同语料库间具有良好的迁移能力,性能损失较小;但其依赖的线性假设仅为近似成立。 Conclusion: CVP是一种具有良好泛化能力的情感分析工具,但其线性建模方式仍有优化空间,未来可发展更精确的非线性方法。 Abstract: Use cases of sentiment analysis in the humanities often require contextualized, continuous scores. Concept Vector Projections (CVP) offer a recent solution: by modeling sentiment as a direction in embedding space, they produce continuous, multilingual scores that align closely with human judgments. Yet the method's portability across domains and underlying assumptions remain underexplored. We evaluate CVP across genres, historical periods, languages, and affective dimensions, finding that concept vectors trained on one corpus transfer well to others with minimal performance loss. To understand the patterns of generalization, we further examine the linearity assumption underlying CVP. Our findings suggest that while CVP is a portable approach that effectively captures generalizable patterns, its linearity assumption is approximate, pointing to potential for further development.

[11] LLM Review: Enhancing Creative Writing via Blind Peer Review Feedback

Weiyue Li,Mingxiao Song,Zhenda Shen,Dachuan Zhao,Yunfan Long,Yi Li,Yongce Li,Ruyi Yang,Mengyu Wang

Main category: cs.CL

TL;DR: 提出LLM Review框架,通过模拟盲审机制促进多智能体间的创造性生成,避免内容同质化,并在科幻写作任务中验证其优于多智能体基线的效果。

Details Motivation: 大语言模型在创造性生成上存在不足,现有的多智能体交互虽能提升推理但会因同质化抑制创造力。 Method: 设计受同行评审启发的LLM Review框架,采用盲审机制使智能体在独立修改的同时交换针对性反馈;构建SciFi-100数据集,结合LLM评分、人工标注与规则化新颖性指标进行评估。 Result: 实验表明LLM Review在创造性生成上持续优于多智能体基线,且小模型配合该框架可超越更大的单智能体模型。 Conclusion: 交互结构的设计可能替代模型规模扩展,成为提升生成创造力的有效路径。 Abstract: Large Language Models (LLMs) often struggle with creative generation, and multi-agent frameworks that improve reasoning through interaction can paradoxically hinder creativity by inducing content homogenization. We introduce LLM Review, a peer-review-inspired framework implementing Blind Peer Review: agents exchange targeted feedback while revising independently, preserving divergent creative trajectories. To enable rigorous evaluation, we propose SciFi-100, a science fiction writing dataset with a unified framework combining LLM-as-a-judge scoring, human annotation, and rule-based novelty metrics. Experiments demonstrate that LLM Review consistently outperforms multi-agent baselines, and smaller models with our framework can surpass larger single-agent models, suggesting interaction structure may substitute for model scale.

[12] Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models

Zhenghao He,Guangzhi Xiong,Bohan Liu,Sanchit Sinha,Aidong Zhang

Main category: cs.CL

TL;DR: 该研究通过稀疏自编码器(SAE)分析大语言模型内部表征,发现一个与推理行为相关的隐特征,直接操控该特征可实现与链式思维(CoT)提示相媲美的推理性能,表明CoT并非唯一触发机制。

Details Motivation: 探究CoT提示为何有效,并验证是否存在其他触发大模型推理的机制。 Method: 使用稀疏自编码器(SAE)识别与推理相关的关键隐特征,并通过干预这些内部表征来观察对推理行为的影响。 Result: 发现单个隐特征的引导即可显著提升推理准确率,效果接近CoT提示,且输出更高效;该状态在生成早期被激活,可覆盖抑制推理的提示指令。 Conclusion: 多步推理由可外部激活的内部隐状态支持,CoT提示只是激活该机制的有效方式之一,而非必要原因。 Abstract: Chain-of-Thought (CoT) prompting has improved the reasoning performance of large language models (LLMs), but it remains unclear why it works and whether it is the unique mechanism for triggering reasoning in large language models. In this work, we study this question by directly analyzing and intervening on the internal representations of LLMs with Sparse Autoencoders (SAEs), identifying a small set of latent features that are causally associated with LLM reasoning behavior. Across multiple model families and reasoning benchmarks, we find that steering a single reasoning-related latent feature can substantially improve accuracy without explicit CoT prompting. For large models, latent steering achieves performance comparable to standard CoT prompting while producing more efficient outputs. We further observe that this reasoning-oriented internal state is triggered early in generation and can override prompt-level instructions that discourage explicit reasoning. Overall, our results suggest that multi-step reasoning in LLMs is supported by latent internal activations that can be externally activated, while CoT prompting is one effective, but not unique, way of activating this mechanism rather than its necessary cause.

[13] Universal computation is intrinsic to language model decoding

Alex Lewandowski,Marlos C. Machado,Dale Schuurmans

Main category: cs.CL

TL;DR: 本文证明了语言模型通过自回归输出链能够实现通用计算,其计算能力在训练前已具备,训练仅提升了可编程性,而非赋予计算表达能力。

Details Motivation: 探讨语言模型的终极计算能力及其来源,澄清训练在其中的作用。 Method: 通过理论证明语言模型的自回归输出链可模拟任何算法的执行,并验证随机初始化的语言模型也具备通用计算能力。 Result: 语言模型无需训练即可实现通用计算,训练的作用是提升可编程性,使自然语言能更有效地访问其内在计算能力。 Conclusion: 语言模型的计算表达能力是固有的,训练的主要作用是改善通过自然语言提示来引导和控制这些能力的难易程度。 Abstract: Language models now provide an interface to express and often solve general problems in natural language, yet their ultimate computational capabilities remain a major topic of scientific debate. Unlike a formal computer, a language model is trained to autoregressively predict successive elements in human-generated text. We prove that chaining a language model's autoregressive output is sufficient to perform universal computation. That is, a language model can simulate the execution of any algorithm on any input. The challenge of eliciting desired computational behaviour can thus be reframed in terms of programmability: the ease of finding a suitable prompt. Strikingly, we demonstrate that even randomly initialized language models are capable of universal computation before training. This implies that training does not give rise to computational expressiveness -- rather, it improves programmability, enabling a natural language interface for accessing these intrinsic capabilities.

[14] Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations

Yuxi Xia,Dennis Ulmer,Terra Blevins,Yihong Liu,Hinrich Schütze,Benjamin Roth

Main category: cs.CL

TL;DR: 本文提出了一种针对大语言模型置信度估计(CE)的综合评估框架,强调在提示扰动下的鲁棒性、语义等价答案间的稳定性以及对语义差异答案的敏感性,揭示了现有CE方法在实际应用中的局限性。

Details Motivation: 现有的置信度估计评估主要依赖校准性和判别性,忽视了语言变体和提示变化下的一致性和敏感性问题,难以反映真实场景中的可靠性。 Method: 提出一个新评估框架,从鲁棒性、稳定性和敏感性三个新维度评测CE方法,并在多种CE方法上进行实验验证。 Result: 实验表明,尽管某些CE方法在校准性或判别性上表现良好,但在提示扰动下缺乏鲁棒性,且对答案语义变化不敏感。 Conclusion: 现有CE评估指标不足,需结合鲁棒性、稳定性和敏感性来设计更可靠的CE方法,以支持实际应用中的信任与决策。 Abstract: Confidence estimation (CE) indicates how reliable the answers of large language models (LLMs) are, and can impact user trust and decision-making. Existing work evaluates CE methods almost exclusively through calibration, examining whether stated confidence aligns with accuracy, or discrimination, whether confidence is ranked higher for correct predictions than incorrect ones. However, these facets ignore pitfalls of CE in the context of LLMs and language variation: confidence estimates should remain consistent under semantically equivalent prompt or answer variations, and should change when the answer meaning differs. Therefore, we present a comprehensive evaluation framework for CE that measures their confidence quality on three new aspects: robustness of confidence against prompt perturbations, stability across semantic equivalent answers, and sensitivity to semantically different answers. In our work, we demonstrate that common CE methods for LLMs often fail on these metrics: methods that achieve good performance on calibration or discrimination are not robust to prompt variations or are not sensitive to answer changes. Overall, our framework reveals limitations of existing CE evaluations relevant for real-world LLM use cases and provides practical guidance for selecting and designing more reliable CE methods.

[15] AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling

Yongliang Miao,Yangyang Liang,Mengnan Du

Main category: cs.CL

TL;DR: AdaJudge 是一种新的奖励建模框架,通过联合优化表示和聚合机制,动态调整序列聚合并改进判别性表征,从而提升语言模型与人类偏好的对齐效果。

Details Motivation: 现有奖励模型采用静态池化策略,存在归纳偏置与任务需求不匹配以及生成导向的骨干网络不适合精细判别的问题。 Method: 提出 AdaJudge 框架:使用门控 refinement 模块将骨干网络的表示转换到更适合判别的空间,并设计自适应多视图池化模块,动态路由和组合不同视角的证据以生成评分。 Result: 在 RM-Bench 和 JudgeBench 上的实验表明,AdaJudge 超过现有的强奖励模型和传统池化方法。 Conclusion: AdaJudge 通过联合优化表示学习和聚合策略,有效解决了静态池化和表征不匹配的问题,提升了奖励建模的性能。 Abstract: Reward modeling is essential for aligning large language models with human preferences, yet predominant architectures rely on a static pooling strategy to condense sequences into scalar scores. This paradigm, however, suffers from two key limitations: a static inductive bias that misaligns with task-dependent preference signals, and a representational mismatch, as the backbone is optimized for generation rather than fine-grained discrimination. To address this, we propose AdaJudge, a unified framework that jointly adapts representation and aggregation. AdaJudge first refines backbone representations into a discrimination-oriented space via gated refinement blocks. It then replaces the static readout with an adaptive multi-view pooling module that dynamically routes and combines evidence. Extensive experiments on RM-Bench and JudgeBench show that AdaJudge outperforms strong off-the-shelf reward models and traditional pooling baselines.

[16] Query Suggestion for Retrieval-Augmented Generation via Dynamic In-Context Learning

Fabian Spaeh,Tianyi Chen,Chen-Hao Chiang,Bin Shen

Main category: cs.CL

TL;DR: 本文首次研究了面向代理检索增强生成(agentic RAG)的查询建议问题,提出了一种鲁棒的动态少样本学习方法,能自学习地生成相关且可回答的建议查询,提升了用户与系统交互的安全性与有效性。

Details Motivation: 由于agentic RAG的知识范围有限,超出范围的提问易导致幻觉问题;现有防护机制仅阻断不可回答问题,缺乏对用户进行可回答查询引导的研究。 Method: 提出鲁棒的动态少样本学习方法,通过从相关工作流中检索示例,结合自学习机制(如基于历史用户查询),生成语义相似且RAG可回答的建议查询。 Result: 在三个基于真实用户查询构建的基准数据集上实验表明,该方法在生成相关且可回答的建议方面优于少样本和仅检索基线方法。 Conclusion: 所提方法有效解决了agentic RAG中查询不可回答时的用户交互问题,通过动态示例检索和自学习机制,实现了更安全、高效的用户交互体验。 Abstract: Retrieval-augmented generation with tool-calling agents (agentic RAG) has become increasingly powerful in understanding, processing, and responding to user queries. However, the scope of the grounding knowledge is limited and asking questions that exceed this scope may lead to issues like hallucination. While guardrail frameworks aim to block out-of-scope questions (Rodriguez et al., 2024), no research has investigated the question of suggesting answerable queries in order to complete the user interaction. In this paper, we initiate the study of query suggestion for agentic RAG. We consider the setting where user questions are not answerable, and the suggested queries should be similar to aid the user interaction. Such scenarios are frequent for tool-calling LLMs as communicating the restrictions of the tools or the underlying datasets to the user is difficult, and adding query suggestions enhances the interaction with the RAG agent. As opposed to traditional settings for query recommendations such as in search engines, ensuring that the suggested queries are answerable is a major challenge due to the RAG's multi-step workflow that demands a nuanced understanding of the RAG as a whole, which the executing LLM lacks. As such, we introduce robust dynamic few-shot learning which retrieves examples from relevant workflows. We show that our system can be self-learned, for instance on prior user queries, and is therefore easily applicable in practice. We evaluate our approach on three benchmark datasets based on two unlabeled question datasets collected from real-world user queries. Experiments on real-world datasets confirm that our method produces more relevant and answerable suggestions, outperforming few-shot and retrieval-only baselines, and thus enable safer, more effective user interaction with agentic RAG.

[17] Debiasing Large Language Models via Adaptive Causal Prompting with Sketch-of-Thought

Bowen Li,Ziqi Xu,Jing Ren,Renqiang Luo,Xikun Zhang,Xiuzhen Zhang,Yongli Ren,Feng Xia

Main category: cs.CL

TL;DR: 提出了一种新的提示框架ACPS,通过因果模型和简要思维草图实现高效、通用的推理,减少token使用并提升在多种任务上的性能。

Details Motivation: 现有提示方法如CoT存在token消耗高和跨任务泛化能力差的问题。 Method: 提出自适应因果提示与思维草图(ACPS)框架,利用结构因果模型推断查询对答案的因果效应,并自适应选择干预策略;用简洁的Sketch-of-Thought替代冗长的Chain-of-Thought。 Result: 在多个推理基准和LLM上实验表明,ACPS在准确率、鲁棒性和计算效率方面均优于现有提示方法。 Conclusion: ACPS实现了更高效、更通用的推理,无需任务特定微调即可在异构任务中保持优越性能。 Abstract: Despite notable advancements in prompting methods for Large Language Models (LLMs), such as Chain-of-Thought (CoT), existing strategies still suffer from excessive token usage and limited generalisability across diverse reasoning tasks. To address these limitations, we propose an Adaptive Causal Prompting with Sketch-of-Thought (ACPS) framework, which leverages structural causal models to infer the causal effect of a query on its answer and adaptively select an appropriate intervention (i.e., standard front-door and conditional front-door adjustments). This design enables generalisable causal reasoning across heterogeneous tasks without task-specific retraining. By replacing verbose CoT with concise Sketch-of-Thought, ACPS enables efficient reasoning that significantly reduces token usage and inference cost. Extensive experiments on multiple reasoning benchmarks and LLMs demonstrate that ACPS consistently outperforms existing prompting baselines in terms of accuracy, robustness, and computational efficiency.

[18] Attention Projection Mixing and Exogenous Anchors

Jonathan Su

Main category: cs.CL

TL;DR: ExoFormer提出了一种新的Transformer架构,通过将锚点投影从层堆叠中解耦,使用外部学习的锚点来解决深层模型中的稳定性与计算效率之间的矛盾,在多个指标上优于基线模型。

Details Motivation: 传统Transformer中早期层需同时作为深层参考和有效计算模块,存在功能冲突。ExoFormer旨在解耦这一双重角色,提升模型性能与数据效率。 Method: 引入外部锚点(exogenous anchors)替代内部残差连接,并设计统一的归一化混合框架,支持多种注意力路径和系数粒度(elementwise, headwise, scalar)。 Result: ExoFormer在下游任务中准确率提升2.13点,训练时仅用1/1.84的token即可达到基线损失,注意力沉降减少2倍;但出现表征坍塌现象。 Conclusion: 尽管存在表征坍塌,ExoFormer通过‘卸载假说’保留了关键token身份,使各层可专注于计算优化,验证了外部锚点机制的有效性。 Abstract: Transformers that reuse early-layer attention projections as residuals face a fundamental tension: the first layer must simultaneously serve as a stable reference for all deeper layers and as an effective computational block. To resolve this, we propose ExoFormer, which learns dedicated exogenous anchor projections outside the sequential layer stack, decoupling the anchor role from computational refinement. Through a unified normalized mixing framework (studying different coefficient granularities: elementwise, headwise, scalar) across all attention pathways (queries, keys, values, and gate logits), ExoFormer variants consistently outperform their internal-anchor counterparts. Moreover, the dynamic variant achieves a 2.13-point increase in downstream accuracy over the baseline and demonstrates superior data efficiency, matching baseline validation loss with 1.84x fewer tokens. ExoFormer also achieves a 2x reduction in attention sink compared to standard Gated Attention. Paradoxically, all ExoFormer variants exhibit signs of representation collapse. We explain this via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in computational refinement. We release codes and models to facilitate future research.

[19] How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

Reza Khanmohammadi,Erfan Miahi,Simerjot Kaur,Ivan Brugere,Charese H. Smiley,Kundan Thind,Mohammad M. Ghassemi

Main category: cs.CL

TL;DR: 本文提出了一个用于评估大型推理模型(LRM)置信度估计的基准RMCB,包含347,496条来自六种主流LRM的推理轨迹,并涵盖临床、金融、法律、数学等高风险领域。基于该基准的实验表明,现有基于表示的方法在区分性(AUROC)和校准性(ECE)之间存在权衡,且模型复杂度提升并未显著改善性能,揭示了当前方法的局限性。

Details Motivation: 大型推理模型(LRM)在高风险领域的应用需要可靠的置信度估计,但其输出的多步长形式导致传统校准方法失效,亟需专门的基准和评估手段来衡量其置信度估计能力。 Method: 构建了一个大规模公开基准RMCB,包含来自六种不同架构LRM的347,496条推理轨迹,覆盖多个高风险与复杂推理任务,并对所有样本进行正确性标注;在此基础上系统评估了十余种基于表示的置信度估计方法,包括序列式、图结构和文本编码器等架构类型。 Result: 实验发现:基于文本的编码器在区分性(AUROC=0.672)上表现最佳,而结构感知模型在校准性(ECE=0.148)上更优,但两者之间存在明显权衡;增加模型复杂度未能持续提升性能,简单序列模型仍具竞争力。 Conclusion: RMCB是目前该领域最全面的基准,为置信度估计提供了严格基线,并揭示了当前基于表示的方法存在性能瓶颈,提示未来研究需探索超越隐藏状态聚合的新范式。 Abstract: The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between discrimination (AUROC) and calibration (ECE): text-based encoders achieve the best AUROC (0.672), while structurally-aware models yield the best ECE (0.148), with no single method dominating both. Furthermore, we find that increased architectural complexity does not reliably outperform simpler sequential baselines, suggesting a performance ceiling for methods relying solely on chunk-level hidden states. This work provides the most comprehensive benchmark for this task to date, establishing rigorous baselines and demonstrating the limitations of current representation-based paradigms.

[20] Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training

Muhammad Taimoor Hassan,Jawad Ahmed,Muhammad Awais

Main category: cs.CL

TL;DR: 本文提出了Qalb,一个专为乌尔都语设计的语言模型,通过在LLaMA 3.1 8B基础上进行持续预训练和指令微调,在多种乌尔都语任务上实现了最先进的性能。

Details Motivation: 乌尔都语在现代自然语言处理系统中严重缺乏代表性,现有模型在其复杂形态、从右到左的书写系统和丰富文学传统方面表现不佳。 Method: 采用两阶段方法:首先在19.7亿token的乌尔都语和英语维基数据上对LLaMA 3.1 8B进行持续预训练,然后在Alif Urdu-instruct数据集上进行监督微调。 Result: Qalb在乌尔都语基准测试中加权平均得分为90.34,超过此前最优模型Alif-1.0-Instruct(87.1)3.24分,远超基础LLaMA-3.1 8B-Instruct模型44.64分,在分类、情感分析和推理等七项任务中表现优异。 Conclusion: 持续预训练结合指令微调能有效将基础模型适配至低资源语言,为类似语言的模型开发提供了可行路径。 Abstract: Despite remarkable progress in large language models, Urdu-a language spoken by over 230 million people-remains critically underrepresented in modern NLP systems. Existing multilingual models demonstrate poor performance on Urdu-specific tasks, struggling with the language's complex morphology, right-to-left Nastaliq script, and rich literary traditions. Even the base LLaMA-3.1 8B-Instruct model shows limited capability in generating fluent, contextually appropriate Urdu text. We introduce Qalb, an Urdu language model developed through a two-stage approach: continued pre-training followed by supervised fine-tuning. Starting from LLaMA 3.1 8B, we perform continued pre-training on a dataset of 1.97 billion tokens. This corpus comprises 1.84 billion tokens of diverse Urdu text-spanning news archives, classical and contemporary literature, government documents, and social media-combined with 140 million tokens of English Wikipedia data to prevent catastrophic forgetting. We then fine-tune the resulting model on the Alif Urdu-instruct dataset. Through extensive evaluation on Urdu-specific benchmarks, Qalb demonstrates substantial improvements, achieving a weighted average score of 90.34 and outperforming the previous state-of-the-art Alif-1.0-Instruct model (87.1) by 3.24 points, while also surpassing the base LLaMA-3.1 8B-Instruct model by 44.64 points. Qalb achieves state-of-the-art performance with comprehensive evaluation across seven diverse tasks including Classification, Sentiment Analysis, and Reasoning. Our results demonstrate that continued pre-training on diverse, high-quality language data, combined with targeted instruction fine-tuning, effectively adapts foundation models to low-resource languages.

[21] Mechanisms are Transferable: Data-Efficient Low-Resource Adaptation via Circuit-Targeted Supervised Fine-Tuning

Khumaisa Nur'aini,Ayu Purwarianti,Alham Fikri Aji,Derry Wijaya

Main category: cs.CL

TL;DR: 提出CT-SFT方法,通过识别任务相关注意力头并仅更新这些头(加LayerNorm)来实现低资源语言的高效适配,减少参数更新量和灾难性遗忘。

Details Motivation: 解决低资源语言下大模型微调的数据稀缺、全模型微调不稳定以及跨语言微调导致的灾难性遗忘问题。 Method: 基于CD-T改进,采用无反事实的Circuit-Targeted Supervised Fine-Tuning(CT-SFT),利用标签平衡均值基线和任务方向相关性评分,在代理语言检查点中识别稀疏的任务相关注意力头,并通过头级别梯度掩码仅更新这些头和LayerNorm。 Result: 在NusaX-Senti和XNLI上,CT-SFT相比全模型微调以更少的参数更新量提升了跨语言准确率,并显著减少了灾难性遗忘,保持了源语言能力。 Conclusion: CT-SFT能有效平衡编辑与保留源机制的关系,适用于不同难度的迁移任务,是低资源语言适配的一种高效且稳定的微调策略。 Abstract: Adapting LLMs to low-resource languages is difficult: labeled data is scarce, full-model fine-tuning is unstable, and continued cross-lingual tuning can cause catastrophic forgetting. We propose Circuit-Targeted Supervised Fine-Tuning (CT-SFT): a counterfactual-free adaptation of CD-T (Contextual Decomposition Transformer) that uses a label-balanced mean baseline and task-directional relevance scoring to identify a sparse set of task-relevant attention heads in a proxy-language checkpoint, then transfer learns to a target language by updating only those heads (plus LayerNorm) via head-level gradient masking. Across NusaX-Senti and XNLI, CT-SFT improves cross-lingual accuracy over continued full fine-tuning while updating only a small subset of model parameters. We find an editing-preserving trade-off: harder transfers favor editing circuit heads, while easier transfers often favor near-zero (i.e., low-relevance heads) updates, preserving the source mechanism. CT-SFT also substantially reduces catastrophic forgetting, preserving proxy/source-language competence during transfer.

[22] WISE-Flow: Workflow-Induced Structured Experience for Self-Evolving Conversational Service Agents

Yuqing Zhou,Zhuoer Wang,Jie Yuan,Hong Wang,Samson Koelle,Ziwei Zhu,Wei Niu

Main category: cs.CL

TL;DR: 提出WISE-Flow框架,通过从历史服务交互中提取带前提条件的动作块工作流,实现LLM代理的自我演化,提升任务执行稳定性与性能。

Details Motivation: LLM代理在新任务中易出错、失败模式重复且运行间差异大,依赖环境特定训练或手动修复成本高、难扩展,需实现自我演化的代理以适应用户服务环境。 Method: 提出WISE-Flow,一种以工作流为中心的框架,将历史交互转化为可复用的程序化经验,通过引入带前提条件的动作块进行工作流归纳;部署时,将代理执行轨迹与检索到的工作流对齐,并进行前提感知的可行性推理,生成状态接地的下一步动作。 Result: 在ToolSandbox和$τ^2$-bench上的实验表明,该方法在不同基础模型上均带来一致性能提升。 Conclusion: WISE-Flow通过工作流建模与前提感知推理,有效提升了LLM代理在用户服务场景中的鲁棒性与泛化能力,支持其自我演化。 Abstract: Large language model (LLM)-based agents are widely deployed in user-facing services but remain error-prone in new tasks, tend to repeat the same failure patterns, and show substantial run-to-run variability. Fixing failures via environment-specific training or manual patching is costly and hard to scale. To enable self-evolving agents in user-facing service environments, we propose WISE-Flow, a workflow-centric framework that converts historical service interactions into reusable procedural experience by inducing workflows with prerequisite-augmented action blocks. At deployment, WISE-Flow aligns the agent's execution trajectory to retrieved workflows and performs prerequisite-aware feasibility reasoning to achieve state-grounded next actions. Experiments on ToolSandbox and $τ^2$-bench show consistent improvement across base models.

[23] SwiftMem: Fast Agentic Memory via Query-aware Indexing

Anxin Tian,Yiming Li,Xing Li,Hui-Ling Zhen,Lei Chen,Xianzhi Yu,Zhenhua Dong,Mingxuan Yuan

Main category: cs.CL

TL;DR: SwiftMem是一种查询感知的代理记忆系统,通过时间与语义维度的专用索引实现亚线性检索,显著提升检索速度并保持高准确率。

Details Motivation: 现有记忆框架在查询时对整个存储层进行 exhaustive retrieval,导致随着记忆增长出现严重延迟瓶颈,难以支持实时交互。 Method: 提出SwiftMem,构建时间索引以支持对时间敏感信息的对数时间范围查询,并设计基于层次标签结构的语义DAG-Tag索引以映射相关主题;同时引入嵌入-标签协同整合机制,通过语义聚类优化存储布局,减少内存碎片。 Result: 在LoCoMo和LongMemEval基准测试中,SwiftMem比现有最先进方法快47倍,同时保持有竞争力的准确性。 Conclusion: SwiftMem有效解决了大规模代理记忆系统中的检索效率问题,为实际部署记忆增强型LLM代理提供了可行方案。 Abstract: Agentic memory systems have become critical for enabling LLM agents to maintain long-term context and retrieve relevant information efficiently. However, existing memory frameworks suffer from a fundamental limitation: they perform exhaustive retrieval across the entire storage layer regardless of query characteristics. This brute-force approach creates severe latency bottlenecks as memory grows, hindering real-time agent interactions. We propose SwiftMem, a query-aware agentic memory system that achieves sub-linear retrieval through specialized indexing over temporal and semantic dimensions. Our temporal index enables logarithmic-time range queries for time-sensitive retrieval, while the semantic DAG-Tag index maps queries to relevant topics through hierarchical tag structures. To address memory fragmentation during growth, we introduce an embedding-tag co-consolidation mechanism that reorganizes storage based on semantic clusters to improve cache locality. Experiments on LoCoMo and LongMemEval benchmarks demonstrate that SwiftMem achieves 47$\times$ faster search compared to state-of-the-art baselines while maintaining competitive accuracy, enabling practical deployment of memory-augmented LLM agents.

[24] Relational Knowledge Distillation Using Fine-tuned Function Vectors

Andrea Kang,Yingnian Wu,Hongjing Lu

Main category: cs.CL

TL;DR: 本文提出通过微调函数向量(function vectors)来提升语言模型在关系推理任务中的表现,并引入复合函数向量以增强类比推理能力,显著提升了在认知科学和SAT类比问题上的性能。

Details Motivation: 为了增强大语言模型对概念间关系的理解与推理能力,研究者希望改进基于因果中介分析得到的函数向量,使其更有效地编码关系知识。 Method: 使用少量样本(约20对词)对函数向量进行微调,并构建加权组合的复合函数向量,在推理时将其插入语言模型的激活中以增强关系提取和类比推理能力。 Result: 微调后的函数向量在关系型词语补全任务中表现优于原始向量,且与人类语义相似性判断更一致;复合函数向量显著提升了在认知科学和SAT类比问题上的推理性能。 Conclusion: 激活修补(activation patching)是一种可控且有效的方法,可用于编码和操作关系知识,有助于提升大语言模型的可解释性和推理能力。 Abstract: Representing relations between concepts is a core prerequisite for intelligent systems to make sense of the world. Recent work using causal mediation analysis has shown that a small set of attention heads encodes task representation in in-context learning, captured in a compact representation known as the function vector. We show that fine-tuning function vectors with only a small set of examples (about 20 word pairs) yields better performance on relation-based word-completion tasks than using the original vectors derived from causal mediation analysis. These improvements hold for both small and large language models. Moreover, the fine-tuned function vectors yield improved decoding performance for relation words and show stronger alignment with human similarity judgments of semantic relations. Next, we introduce the composite function vector - a weighted combination of fine-tuned function vectors - to extract relational knowledge and support analogical reasoning. At inference time, inserting this composite vector into LLM activations markedly enhances performance on challenging analogy problems drawn from cognitive science and SAT benchmarks. Our results highlight the potential of activation patching as a controllable mechanism for encoding and manipulating relational knowledge, advancing both the interpretability and reasoning capabilities of large language models.

[25] Prompt-Based Clarity Evaluation and Topic Detection in Political Question Answering

Lavanya Prahallad,Sai Utkarsh Choudarypally,Pragna Prahallad,Pranathi Prahallad

Main category: cs.CL

TL;DR: 本研究探讨了不同提示策略对大语言模型在政治问答中清晰度自动评估的影响,使用CLARITY数据集比较GPT-3.5与GPT-5.2的表现,发现链式思维加少量示例提示可将清晰度预测准确率从56%提升至63%,但细粒度回避和主题检测仍具挑战。

Details Motivation: 现有研究缺乏对提示设计在自动评估LLM回答清晰度方面影响的深入探索,尤其是在政治问答这种高敏感领域,需更好理解如何通过提示工程提升评估效果。 Method: 采用CLARITY数据集,对比GPT-3.5基线与GPT-5.2在简单提示、链式思维提示和带示例的链式思维提示三种策略下的表现,以准确率、类别指标和层次完全匹配评估与人类标注的一致性。 Result: GPT-5.2在清晰度预测上优于GPT-3.5,最佳准确率达63%(+7%);链式思维提示在回避识别上达到最高34%准确率但稳定性不足;主题识别准确率由60%提升至74%。 Conclusion: 提示设计能有效提升高层次清晰度评估性能,但细粒度回避分类和主题检测仍存在挑战,结构化推理提示虽有帮助但不足以完全解决复杂语义判断问题。 Abstract: Automatic evaluation of large language model (LLM) responses requires not only factual correctness but also clarity, particularly in political question-answering. While recent datasets provide human annotations for clarity and evasion, the impact of prompt design on automatic clarity evaluation remains underexplored. In this paper, we study prompt-based clarity evaluation using the CLARITY dataset from the SemEval 2026 shared task. We compare a GPT-3.5 baseline provided with the dataset against GPT-5.2 evaluated under three prompting strategies: simple prompting, chain-of-thought prompting, and chain-of-thought with few-shot examples. Model predictions are evaluated against human annotations using accuracy and class-wise metrics for clarity and evasion, along with hierarchical exact match. Results show that GPT-5.2 consistently outperforms the GPT-3.5 baseline on clarity prediction, with accuracy improving from 56 percent to 63 percent under chain-of-thought with few-shot prompting. Chain-of-thought prompting yields the highest evasion accuracy at 34 percent, though improvements are less stable across fine-grained evasion categories. We further evaluate topic identification and find that reasoning-based prompting improves accuracy from 60 percent to 74 percent relative to human annotations. Overall, our findings indicate that prompt design reliably improves high-level clarity evaluation, while fine-grained evasion and topic detection remain challenging despite structured reasoning prompts.

[26] Evaluating Implicit Regulatory Compliance in LLM Tool Invocation via Logic-Guided Synthesis

Da Song,Yuheng Huang,Boqi Chen,Tianshuo Cong,Randy Goebel,Lei Ma,Foutse Khomh

Main category: cs.CL

TL;DR: 提出LogiSafetyGen框架和LogiSafetyBench基准,用于评估大语言模型在高风险领域中对隐式法规合规性的遵守能力。

Details Motivation: 现有基准测试忽视了隐式法规合规性,无法评估大语言模型是否能自主执行强制性安全约束。 Method: 将非结构化法规转换为线性时序逻辑预言,并采用逻辑引导的模糊测试生成有效的安全关键轨迹,构建包含240个任务的基准测试集。 Result: 评估13种最先进大语言模型发现,较大模型虽功能正确性更好,但常优先完成任务而牺牲安全性,导致不合规行为。 Conclusion: 需要专门框架来评估和提升大语言模型在复杂工具使用中的隐式法规遵守能力,特别是在高风险领域。 Abstract: The integration of large language models (LLMs) into autonomous agents has enabled complex tool use, yet in high-stakes domains, these systems must strictly adhere to regulatory standards beyond simple functional correctness. However, existing benchmarks often overlook implicit regulatory compliance, thus failing to evaluate whether LLMs can autonomously enforce mandatory safety constraints. To fill this gap, we introduce LogiSafetyGen, a framework that converts unstructured regulations into Linear Temporal Logic oracles and employs logic-guided fuzzing to synthesize valid, safety-critical traces. Building on this framework, we construct LogiSafetyBench, a benchmark comprising 240 human-verified tasks that require LLMs to generate Python programs that satisfy both functional objectives and latent compliance rules. Evaluations of 13 state-of-the-art (SOTA) LLMs reveal that larger models, despite achieving better functional correctness, frequently prioritize task completion over safety, which results in non-compliant behavior.

[27] Triplets Better Than Pairs: Towards Stable and Effective Self-Play Fine-Tuning for LLMs

Yibo Wang,Hai-Long Sun,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Lijun Zhang

Main category: cs.CL

TL;DR: 提出了一种新的三元组自对弈微调方法T-SPIN,通过引入历史优势和熵约束,解决了SPIN在稀少标注数据下优化不稳定和训练生成不一致的问题,在多种任务上表现出优于SPIN的性能和稳定性。

Details Motivation: 现有的自对弈微调方法SPIN在迭代过程中可能因当前奖励优势消失而导致优化不稳定,且训练目标与生成指标之间存在不一致问题。 Method: 提出了T-SPIN方法,结合两个关键设计:一是利用历史优势(当前策略与初始策略生成响应之间的优势)来稳定优化过程;二是在自对弈框架中引入熵约束,实现无需参考策略的微调,消除训练与生成间的差异。 Result: 在多个任务上的实验表明,T-SPIN不仅性能优于SPIN,且在迭代过程中表现更稳定;相比监督微调,仅用25%的样本即可达到相当甚至更好的效果。 Conclusion: T-SPIN通过历史优势和熵约束有效提升了自对弈微调的稳定性和性能,特别适用于标注数据稀缺的场景。 Abstract: Recently, self-play fine-tuning (SPIN) has been proposed to adapt large language models to downstream applications with scarce expert-annotated data, by iteratively generating synthetic responses from the model itself. However, SPIN is designed to optimize the current reward advantages of annotated responses over synthetic responses at hand, which may gradually vanish during iterations, leading to unstable optimization. Moreover, the utilization of reference policy induces a misalignment issue between the reward formulation for training and the metric for generation. To address these limitations, we propose a novel Triplet-based Self-Play fIne-tuNing (T-SPIN) method that integrates two key designs. First, beyond current advantages, T-SPIN additionally incorporates historical advantages between iteratively generated responses and proto-synthetic responses produced by the initial policy. Even if the current advantages diminish, historical advantages remain effective, stabilizing the overall optimization. Second, T-SPIN introduces the entropy constraint into the self-play framework, which is theoretically justified to support reference-free fine-tuning, eliminating the training-generation discrepancy. Empirical results on various tasks demonstrate not only the superior performance of T-SPIN over SPIN, but also its stable evolution during iterations. Remarkably, compared to supervised fine-tuning, T-SPIN achieves comparable or even better performance with only 25% samples, highlighting its effectiveness when faced with scarce annotated data.

[28] Generation-Augmented Generation: A Plug-and-Play Framework for Private Knowledge Injection in Large Language Models

Rongji Li,Jian Xu,Xueqing Chen,Yisheng Yang,Jiayi Wang,Xingyu Chen,Chunyu Xie,Dawei Leng,Xu-Yao Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为Generation-Augmented Generation (GAG) 的新方法,用于在保持基础大语言模型不变的前提下,高效注入私有领域知识,解决了微调和检索增强生成(RAG)的局限性,在多个科学问答任务中显著优于RAG,并支持可扩展的多领域组合与选择性激活。

Details Motivation: 现有的知识注入方法如微调成本高且易遗忘,RAG在专业私有语料中表现脆弱,因此需要一种更稳定、高效且可扩展的方法来整合私有领域知识。 Method: 受多模态模型启发,将私有专业知识视为一种额外的专家模态,通过一个紧凑的、表示层的接口将其对齐到冻结的基础模型中,实现无需序列化证据的生成时融合。 Result: 在免疫学佐剂和催化材料两个私有科学QA基准上,GAG比强RAG基线分别提升15.34%和14.86%,同时在六个公开通用基准上保持性能,并实现接近oracle的选择性激活能力。 Conclusion: GAG提供了一种高效、可插拔且可扩展的知识注入范式,能够在不破坏通用能力的前提下实现专业化,并支持多领域部署。 Abstract: In domains such as biomedicine, materials, and finance, high-stakes deployment of large language models (LLMs) requires injecting private, domain-specific knowledge that is proprietary, fast-evolving, and under-represented in public pretraining. However, the two dominant paradigms for private knowledge injection each have pronounced drawbacks: fine-tuning is expensive to iterate, and continual updates risk catastrophic forgetting and general-capability regression; retrieval-augmented generation (RAG) keeps the base model intact but is brittle in specialized private corpora due to chunk-induced evidence fragmentation, retrieval drift, and long-context pressure that yields query-dependent prompt inflation. Inspired by how multimodal LLMs align heterogeneous modalities into a shared semantic space, we propose Generation-Augmented Generation (GAG), which treats private expertise as an additional expert modality and injects it via a compact, representation-level interface aligned to the frozen base model, avoiding prompt-time evidence serialization while enabling plug-and-play specialization and scalable multi-domain composition with reliable selective activation. Across two private scientific QA benchmarks (immunology adjuvant and catalytic materials) and mixed-domain evaluations, GAG improves specialist performance over strong RAG baselines by 15.34% and 14.86% on the two benchmarks, respectively, while maintaining performance on six open general benchmarks and enabling near-oracle selective activation for scalable multi-domain deployment.

[29] Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints

Seng Pei Liew,Kenta Shinzato,Yuyang Dong

Main category: cs.CL

TL;DR: MoE模型的性能主要由总参数量和专家稀疏性决定,提出在给定约束下最大化总参数量并最小化稀疏性的设计原则。

Details Motivation: 发现仅用总参数和激活参数不足以描述最优MoE架构,需进一步研究影响性能的关键因素。 Method: 通过系统性研究分析总参数量、专家数量和top-k门控机制对模型性能的影响。 Result: 性能主要由总参数量和专家稀疏性(n_exp / n_topk)决定;增大专家数量会因压缩模型维度而轻微降低性能。 Conclusion: 最优MoE设计应最大化总参数量,同时尽可能减小稀疏性(即增加n_topk和n_exp)以提升性能。 Abstract: Modern Mixture-of-Experts (MoE) language models are designed based on total parameters (memory footprint) and active parameters (inference cost). However, we find these two factors alone are insufficient to describe an optimal architecture. Through a systematic study, we demonstrate that MoE performance is primarily determined by total parameters ($N_{total}$) and expert sparsity ($s:=n_{exp}/n_{topk}$). Moreover, $n_{exp}$ and $n_{topk}$ do not "cancel out" within the sparsity ratio; instead, a larger total number of experts slightly penalizes performance by forcing a reduction in core model dimensions (depth and width) to meet memory constraints. This motivates a simple principle for MoE design which maximizes $N_{total}$ while minimizing $s$ (maximizing $n_{topk}$) and $n_{exp}$ under the given constraints. Our findings provide a robust framework for resolving architectural ambiguity and guiding MoE design.

[30] User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale

Jungho Cho,Minbyul Jeong,Sungrae Park

Main category: cs.CL

TL;DR: 本文提出了一种用户导向的多轮对话生成框架,通过解耦任务生成与用户模拟器来生成更真实、高交互性的工具使用数据,以支持大规模复杂人机协作场景。

Details Motivation: 现有数据集和生成方法受限于静态、预定义的工具集,难以扩展到开放性的人机协同场景,且生成的对话轮次少、交互性不足。 Method: 构建基于大推理模型的模拟器,采用用户导向范式,引入模拟人类行为(如逐步请求和逐轮反馈)的用户模拟器,实现动态生成领域特定工具和高密度、多任务的多轮对话。 Result: 生成的对话具有更高的轮数和交互性,能够反映真实世界问题解决的迭代特性,并支持单轨迹内完成多个任务,产出高密度工具使用数据。 Conclusion: 该框架是一种可扩展、即插即用的模块,能有效生成高质量、长周期、多任务的工具使用对话数据,推动大模型在复杂人机协作中的应用。 Abstract: The recent paradigm shift toward large reasoning models (LRMs) as autonomous agents has intensified the demand for sophisticated, multi-turn tool-use capabilities. Yet, existing datasets and data-generation approaches are limited by static, predefined toolsets that cannot scale to the complexity of open-ended human-agent collaboration. To address this, we initially developed a framework for automated task-oriented multi-turn dialogue generation at scale, utilizing an LRM-based simulator to dynamically generate high-value, domain-specific tools to solve specified tasks. However, we observe that a purely task-oriented design often results in "solely task-solving" trajectories, where the agent completes the objective with minimal interaction, failing to generate the high turn-count conversations seen in realistic scenarios. To bridge this gap, we shift toward a user-oriented simulation paradigm. By decoupling task generation from a dedicated user simulator that mimics human behavioral rules - such as incremental request-making and turn-by-turn feedback - we facilitate more authentic, extended multi-turn dialogues that reflect the iterative nature of real-world problem solving. Our generation pipeline operates as a versatile, plug-and-play module capable of initiating generation from any state, ensuring high scalability in producing extended tool-use data. Furthermore, by facilitating multiple task completions within a single trajectory, it yields a high-density dataset that reflects the multifaceted demands of real-world human-agent interaction.

[31] Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning

Fan Gao,Sherry T. Tong,Jiwoong Sohn,Jiahao Huang,Junfeng Jiang,Ding Xia,Piyalitt Ittichaiwong,Kanyakorn Veerakanjana,Hyunjae Kim,Qingyu Chen,Edison Marrese Taylor,Kazuma Kobayashi,Akkiko Aizawa,Irene Li

Main category: cs.CL

TL;DR: Med-CoReasoner是一个语言感知的协同推理框架,通过概念级对齐和检索,结合英语逻辑结构与本地临床知识,提升多语言医学推理性能,尤其在低资源语言中表现显著。

Details Motivation: 现有的推理增强型大语言模型在英语医学任务上表现良好,但在本地语言中的推理能力较弱,导致全球医疗应用中的多语言差距。为了实现更公平的全球医疗部署,需要提升模型在多语言环境下的医学推理能力。 Method: 提出Med-CoReasoner框架,通过并行生成英语和本地语言的推理路径,将其抽象为结构化概念,并利用概念级对齐和检索机制将本地临床知识融入英语逻辑骨架中。同时构建MultiMed-X基准,涵盖七种语言的长文本问答和自然语言推理任务,用于评估多语言医学推理能力。 Result: 在三个基准上的实验表明,Med-CoReasoner平均提升多语言推理性能5%,在低资源语言中增益尤为显著。模型蒸馏和专家评估进一步验证其生成的推理轨迹具有临床合理性和文化适配性。 Conclusion: Med-CoReasoner有效弥合了多语言医学推理中的语言鸿沟,通过融合英语的结构优势与本地语言的实践知识,实现了更公平、准确且文化敏感的全球医疗AI部署。 Abstract: While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.

[32] Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees

Kun Li,Zenan Xu,Junan Li,Zengrui Jin,Jinghao Deng,Zexuan Qiu,Bo Zhou

Main category: cs.CL

TL;DR: 本文提出了DART框架,一种基于强化学习的方法,用于在无需人工标注的情况下,在长链思维链(long CoT)推理中实现自发的工具使用。

Details Motivation: 现有的大语言模型在长链推理中集成工具使用的研究不足,主要受限于训练数据稀缺以及难以在不损害模型自身推理能力的前提下整合工具使用。 Method: 提出DART框架,通过构建动态rollout树来发现有效的工具使用机会,并在有前景的位置分支探索多种工具集成路径;采用基于树的过程优势估计方法识别并奖励对求解有正面贡献的子路径,从而强化有益行为。 Result: 在AIME和GPQA-Diamond等具有挑战性的基准测试上,DART显著优于现有方法,成功实现了工具执行与长链思维链推理的融合。 Conclusion: DART有效解决了在长链推理中自发集成工具使用的难题,为增强大语言模型的复杂推理能力提供了新思路。 Abstract: Tool-Integrated Reasoning has emerged as a key paradigm to augment Large Language Models (LLMs) with computational capabilities, yet integrating tool-use into long Chain-of-Thought (long CoT) remains underexplored, largely due to the scarcity of training data and the challenge of integrating tool-use without compromising the model's intrinsic long-chain reasoning. In this paper, we introduce DART (Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees), a reinforcement learning framework that enables spontaneous tool-use during long CoT reasoning without human annotation. DART operates by constructing dynamic rollout trees during training to discover valid tool-use opportunities, branching out at promising positions to explore diverse tool-integrated trajectories. Subsequently, a tree-based process advantage estimation identifies and credits specific sub-trajectories where tool invocation positively contributes to the solution, effectively reinforcing these beneficial behaviors. Extensive experiments on challenging benchmarks like AIME and GPQA-Diamond demonstrate that DART significantly outperforms existing methods, successfully harmonizing tool execution with long CoT reasoning.

[33] D$^2$Plan: Dual-Agent Dynamic Global Planning for Complex Retrieval-Augmented Reasoning

Kangcheng Luo,Tinglang Wu,Yansong Feng

Main category: cs.CL

TL;DR: 提出D²Plan,一种双智能体动态全局规划范式,通过推理器和净化器的协作,结合两阶段训练框架,提升检索增强推理中的多跳推理能力与抗干扰性。

Details Motivation: 现有基于强化学习的搜索增强大模型在多跳推理中面临上下文信息过载导致的搜索链构建低效和推理被无关信息劫持的问题。 Method: 设计双智能体协作框架D²Plan:推理器负责基于检索反馈动态构建和调整全局推理计划,净化器评估检索内容的相关性并提炼关键信息;采用包含监督微调和基于计划奖励的强化学习的两阶段训练方法。 Result: 实验表明D²Plan在多个具有挑战性的问答基准上表现优异,推理过程更连贯,对无关信息具有更强的鲁棒性。 Conclusion: D²Plan有效解决了检索增强推理中的关键失败模式,显著提升了复杂推理任务的性能。 Abstract: Recent search-augmented LLMs trained with reinforcement learning (RL) can interleave searching and reasoning for multi-hop reasoning tasks. However, they face two critical failure modes as the accumulating context becomes flooded with both crucial evidence and irrelevant information: (1) ineffective search chain construction that produces incorrect queries or omits retrieval of critical information, and (2) reasoning hijacking by peripheral evidence that causes models to misidentify distractors as valid evidence. To address these challenges, we propose **D$^2$Plan**, a **D**ual-agent **D**ynamic global **Plan**ning paradigm for complex retrieval-augmented reasoning. **D$^2$Plan** operates through the collaboration of a *Reasoner* and a *Purifier*: the *Reasoner* constructs explicit global plans during reasoning and dynamically adapts them based on retrieval feedback; the *Purifier* assesses retrieval relevance and condenses key information for the *Reasoner*. We further introduce a two-stage training framework consisting of supervised fine-tuning (SFT) cold-start on synthesized trajectories and RL with plan-oriented rewards to teach LLMs to master the **D$^2$Plan** paradigm. Extensive experiments demonstrate that **D$^2$Plan** enables more coherent multi-step reasoning and stronger resilience to irrelevant information, thereby achieving superior performance on challenging QA benchmarks.

[34] Enhancing Sentiment Classification and Irony Detection in Large Language Models through Advanced Prompt Engineering Techniques

Marvin Schmitt,Anne Schwerk,Sebastian Lempert

Main category: cs.CL

TL;DR: 本研究探讨了通过提示工程(如少样本学习、思维链提示和自一致性)提升GPT-4o-mini和gemini-1.5-flash在情感分析任务中的表现,结果表明先进提示技术显著提高性能,但需根据模型和任务特性定制策略。

Details Motivation: 探索如何通过提示工程技术提升大语言模型在情感分类、方面级情感分析和讽刺检测等复杂情感分析任务中的性能,克服传统方法的局限性。 Method: 采用少样本学习、思维链提示和自一致性等高级提示技术,以准确率、召回率、精确率和F1分数为指标,在标准数据集上评估GPT-4o-mini和gemini-1.5-flash的表现,并与基线进行比较。 Result: 高级提示技术显著提升了情感分析性能:少样本学习在GPT-4o-mini上表现最佳;思维链提示使gemini-1.5-flash在讽刺检测上的性能提升高达46%。 Conclusion: 提示策略应根据具体模型架构和任务语义复杂度进行定制,提示工程的有效性依赖于模型与任务的匹配。 Abstract: This study investigates the use of prompt engineering to enhance large language models (LLMs), specifically GPT-4o-mini and gemini-1.5-flash, in sentiment analysis tasks. It evaluates advanced prompting techniques like few-shot learning, chain-of-thought prompting, and self-consistency against a baseline. Key tasks include sentiment classification, aspect-based sentiment analysis, and detecting subtle nuances such as irony. The research details the theoretical background, datasets, and methods used, assessing performance of LLMs as measured by accuracy, recall, precision, and F1 score. Findings reveal that advanced prompting significantly improves sentiment analysis, with the few-shot approach excelling in GPT-4o-mini and chain-of-thought prompting boosting irony detection in gemini-1.5-flash by up to 46%. Thus, while advanced prompting techniques overall improve performance, the fact that few-shot prompting works best for GPT-4o-mini and chain-of-thought excels in gemini-1.5-flash for irony detection suggests that prompting strategies must be tailored to both the model and the task. This highlights the importance of aligning prompt design with both the LLM's architecture and the semantic complexity of the task.

[35] AgriAgent: Contract-Driven Planning and Capability-Aware Tool Orchestration in Real-World Agriculture

Bo Yang,Yu Zhang,Yunkui Chen,Lanfei Feng,Xiao Xu,Nueraili Aierken,Shijian Li

Main category: cs.CL

TL;DR: 提出了一种名为AgriAgent的双层智能体框架,用于应对农业场景中多模态输入下任务复杂度差异大和工具不完整的问题,通过分层执行策略提升复杂任务的成功率与鲁棒性。

Details Motivation: 现有方法采用统一执行范式,难以适应农业环境中任务复杂度差异大和工具可用性不完整的问题。 Method: 设计了一个两层代理框架AgriAgent:简单任务由模态特定代理直接推理处理,复杂任务则通过基于契约的规划机制,将任务转化为能力需求,并进行能力感知的工具编排与动态工具生成,支持多步骤、可验证执行及故障恢复。 Result: 实验结果表明,AgriAgent在复杂任务上的执行成功率和鲁棒性优于依赖统一执行范式的现有工具中心型代理基线方法。 Conclusion: AgriAgent通过分层执行策略有效提升了真实农业场景中智能代理对多样化和复杂任务的适应能力,为实际应用提供了更高可靠性与灵活性。 Abstract: Intelligent agent systems in real-world agricultural scenarios must handle diverse tasks under multimodal inputs, ranging from lightweight information understanding to complex multi-step execution. However, most existing approaches rely on a unified execution paradigm, which struggles to accommodate large variations in task complexity and incomplete tool availability commonly observed in agricultural environments. To address this challenge, we propose AgriAgent, a two-level agent framework for real-world agriculture. AgriAgent adopts a hierarchical execution strategy based on task complexity: simple tasks are handled through direct reasoning by modality-specific agents, while complex tasks trigger a contract-driven planning mechanism that formulates tasks as capability requirements and performs capability-aware tool orchestration and dynamic tool generation, enabling multi-step and verifiable execution with failure recovery. Experimental results show that AgriAgent achieves higher execution success rates and robustness on complex tasks compared to existing tool-centric agent baselines that rely on unified execution paradigms. All code, data will be released at after our work be accepted to promote reproducible research.

[36] CLaS-Bench: A Cross-Lingual Alignment and Steering Benchmark

Daniil Gurgurov,Yusser Al Ghussin,Tanja Baeumel,Cheng-Ting Chou,Patrick Schramowski,Marius Mosbach,Josef van Genabith,Simon Ostermann

Main category: cs.CL

TL;DR: 本文提出了CLaS-Bench,一个用于评估大语言模型在32种语言中多语言引导效果的轻量级平行问题基准,并系统比较了多种引导方法,发现基于残差流的DiffMean方法表现最优。

Details Motivation: 目前缺乏专门用于量化多语言场景下模型引导技术有效性的基准和评估协议,因此需要构建一个标准化的评估工具来推动该领域的研究。 Method: 提出CLaS-Bench基准,包含32种语言的平行问题,采用两种评估维度——语言控制能力和语义相关性,并结合调和平均数计算综合引导得分;评估了包括DiffMean、探针方向、语言特定神经元等多种引导方法。 Result: 实验发现基于残差流的简单DiffMean方法在所有语言上均优于其他方法;语言特异性结构主要出现在模型的后期层,且引导方向按语系聚类。 Conclusion: CLaS-Bench是首个面向多语言引导的标准化基准,不仅支持对语言表示的科学分析,也为低成本适应提供了实用评估手段。 Abstract: Understanding and controlling the behavior of large language models (LLMs) is an increasingly important topic in multilingual NLP. Beyond prompting or fine-tuning, , i.e.,~manipulating internal representations during inference, has emerged as a more efficient and interpretable technique for adapting models to a target language. Yet, no dedicated benchmarks or evaluation protocols exist to quantify the effectiveness of steering techniques. We introduce CLaS-Bench, a lightweight parallel-question benchmark for evaluating language-forcing behavior in LLMs across 32 languages, enabling systematic evaluation of multilingual steering methods. We evaluate a broad array of steering techniques, including residual-stream DiffMean interventions, probe-derived directions, language-specific neurons, PCA/LDA vectors, Sparse Autoencoders, and prompting baselines. Steering performance is measured along two axes: language control and semantic relevance, combined into a single harmonic-mean steering score. We find that across languages simple residual-based DiffMean method consistently outperforms all other methods. Moreover, a layer-wise analysis reveals that language-specific structure emerges predominantly in later layers and steering directions cluster based on language family. CLaS-Bench is the first standardized benchmark for multilingual steering, enabling both rigorous scientific analysis of language representations and practical evaluation of steering as a low-cost adaptation alternative.

[37] Detecting Mental Manipulation in Speech via Synthetic Multi-Speaker Dialogue

Run Chen,Wen Liang,Ziwei Gong,Lin Ai,Julia Hirschberg

Main category: cs.CL

TL;DR: 本研究首次探讨了在口语对话中检测心理操纵行为的问题,提出了一个新的多说话人基准数据集SPEECHMENTALMANIP,并通过文本到语音合成音频进行增强。研究发现,与文本相比,模型和人类在语音模式下的识别准确率均下降,突显了模态感知评估的重要性。

Details Motivation: 现有心理操纵研究局限于文本对话,忽视了语音中声学和韵律特征的影响,而实际应用场景多为多模态交互,因此需探索语音模态对操纵检测的影响。 Method: 构建了一个融合高质量、语音一致的文本到语音音频的合成多说话人数据集SPEECHMENTALMANIP,并利用少样本大音频-语言模型及人工标注,比较文本与语音模态下的心理操纵检测性能。 Result: 模型在语音上的特异性高但召回率显著低于文本;人类评分者在音频条件下面临类似不确定性,表明操纵性语音具有内在模糊性。 Conclusion: 语音模态中的心理操纵检测更具挑战性,当前模型可能缺乏对关键声学或韵律线索的捕捉能力,未来需发展模态感知的多模态对话系统安全对齐方法。 Abstract: Mental manipulation, the strategic use of language to covertly influence or exploit others, is a newly emerging task in computational social reasoning. Prior work has focused exclusively on textual conversations, overlooking how manipulative tactics manifest in speech. We present the first study of mental manipulation detection in spoken dialogues, introducing a synthetic multi-speaker benchmark SPEECHMENTALMANIP that augments a text-based dataset with high-quality, voice-consistent Text-to-Speech rendered audio. Using few-shot large audio-language models and human annotation, we evaluate how modality affects detection accuracy and perception. Our results reveal that models exhibit high specificity but markedly lower recall on speech compared to text, suggesting sensitivity to missing acoustic or prosodic cues in training. Human raters show similar uncertainty in the audio setting, underscoring the inherent ambiguity of manipulative speech. Together, these findings highlight the need for modality-aware evaluation and safety alignment in multimodal dialogue systems.

[38] PATS: Personality-Aware Teaching Strategies with Large Language Model Tutors

Donya Rooein,Sankalan Pal Chowdhury,Mariia Eremeeva,Yuan Qin,Debora Nozza,Mrinmaya Sachan,Dirk Hovy

Main category: cs.CL

TL;DR: 本文提出了一种基于学生人格特征调整教学策略的LLM辅导框架,通过构建教学方法与人格类型的映射关系,并在模拟对话中实现个性化教学,实验表明该方法优于基线且更受教师青睐。

Details Motivation: 现有LLM辅导系统未考虑学生人格特质,而不同人格对教学策略反应不同,忽略这一点可能影响学习效果。 Method: 基于教育学文献构建人格与教学策略的分类体系,在模拟师生对话中让LLM根据学生人格调整教学方式。 Result: 人类教师更偏好该方法;相比基线,系统更多使用如角色扮演等高影响力教学策略,且获人类和LLM标注者显著认可。 Conclusion: 考虑学生人格特征可提升LLM在教育应用中的个性化与有效性,为未来智能辅导系统提供新方向。 Abstract: Recent advances in large language models (LLMs) demonstrate their potential as educational tutors. However, different tutoring strategies benefit different student personalities, and mismatches can be counterproductive to student outcomes. Despite this, current LLM tutoring systems do not take into account student personality traits. To address this problem, we first construct a taxonomy that links pedagogical methods to personality profiles, based on pedagogical literature. We simulate student-teacher conversations and use our framework to let the LLM tutor adjust its strategy to the simulated student personality. We evaluate the scenario with human teachers and find that they consistently prefer our approach over two baselines. Our method also increases the use of less common, high-impact strategies such as role-playing, which human and LLM annotators prefer significantly. Our findings pave the way for developing more personalized and effective LLM use in educational applications.

[39] Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering

Nonghai Zhang,Weitao Ma,Zhanyu Ma,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He,Jingwen Xu

Main category: cs.CL

TL;DR: 提出Latent-GRPO框架,利用潜在空间几何结构生成内在奖励,通过迭代鲁棒质心估计(IRCE)算法实现密集连续奖励,显著提升训练速度并保持性能。

Details Motivation: 现有GRPO方法依赖昂贵的外部验证器或人工规则,导致计算成本高、训练延迟大且奖励稀疏,影响优化效率。 Method: 发现正确推理路径的终端标记表征在潜在空间中形成密集聚类,而错误路径则呈离散分布;基于此提出IRCE算法,通过球面投影缓解幅度波动,并迭代估计鲁棒的‘真相质心’以生成密集连续奖励。 Result: 在多个数据集上实验表明,相比基线方法训练速度提升超过2倍,同时保持模型性能,且具有强泛化能力和鲁棒性。 Conclusion: Latent-GRPO通过利用潜在空间的几何特性实现了高效强化学习优化,减少了对外部奖励的依赖,为LLM推理训练提供了更高效、可扩展的解决方案。 Abstract: Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs). However, this success heavily relies on expensive external verifiers or human rules. Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency. To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry. Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers. In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical projection and estimating a robust ``truth centroid'' through iterative aggregation. Experimental results on multiple datasets show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines. Furthermore, extensive results demonstrate strong generalization ability and robustness. The code will be released soon.

[40] Fine-Mem: Fine-Grained Feedback Alignment for Long-Horizon Memory Management

Weitao Ma,Xiaocheng Feng,Lei Huang,Xiachong Feng,Zhanyu Ma,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He,Bing Qin

Main category: cs.CL

TL;DR: Fine-Mem是一个用于细粒度反馈对齐的统一框架,通过引入块级步骤奖励和证据锚定奖励分配机制,提升大语言模型代理在长程任务中的记忆管理能力。

Details Motivation: 现有基于强化学习的记忆管理方法依赖最终任务性能作为奖励信号,导致奖励稀疏和信用分配困难,难以有效指导单个记忆操作。 Method: 提出Fine-Mem框架:1)引入块级步骤奖励,通过辅助的块特定问答任务提供即时的步骤级监督;2)设计证据锚定奖励分配机制,根据推理中使用的具体记忆项来锚定关键记忆操作的信用,重新分配全局奖励。 Result: 在Memalpha和MemoryAgentBench上的实验表明,Fine-Mem在多个子任务中 consistently 超越强基线方法,取得更高的成功率,并展现出在不同模型配置和主干网络下的良好适应性和泛化能力。 Conclusion: Fine-Mem通过细粒度反馈机制有效缓解了奖励稀疏和信用分配问题,实现了局部记忆操作与长期记忆效用的对齐,提升了记忆管理策略的稳定优化能力。 Abstract: Effective memory management is essential for large language model agents to navigate long-horizon tasks. Recent research has explored using Reinforcement Learning to develop specialized memory manager agents. However, existing approaches rely on final task performance as the primary reward, which results in severe reward sparsity and ineffective credit assignment, providing insufficient guidance for individual memory operations. To this end, we propose Fine-Mem, a unified framework designed for fine-grained feedback alignment. First, we introduce a Chunk-level Step Reward to provide immediate step-level supervision via auxiliary chunk-specific question answering tasks. Second, we devise Evidence-Anchored Reward Attribution to redistribute global rewards by anchoring credit to key memory operations, based on the specific memory items utilized as evidence in reasoning. Together, these components enable stable policy optimization and align local memory operations with the long-term utility of memory. Experiments on Memalpha and MemoryAgentBench demonstrate that Fine-Mem consistently outperforms strong baselines, achieving superior success rates across various sub-tasks. Further analysis reveals its adaptability and strong generalization capabilities across diverse model configurations and backbones.

[41] JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Jiangshan Duo,Hanyu Li,Hailin Zhang,Yudong Wang,Sujian Li,Liang Zhao

Main category: cs.CL

TL;DR: 本文提出了JudgeRLVR,一种两阶段的“先判断后生成”强化学习框架,通过先训练模型判断解法的有效性,再进行生成优化,显著提升了大语言模型在数学推理中的准确性和生成效率,并展现出更强的泛化能力。

Details Motivation: 单纯优化最终答案正确性容易导致模型进行无目的、冗长的试错探索,缺乏结构化规划,而现有减少冗长的方法(如长度惩罚)常会截断关键推理步骤,难以平衡效率与验证。 Method: 提出JudgeRLVR,分为两个阶段:第一阶段训练模型判断带有可验证答案的解法响应;第二阶段基于该判断模型初始化,使用标准的生成式强化学习(Vanilla RLVR)进行微调。 Result: 在Qwen3-30B-A3B模型上,相比Vanilla RLVR,JudgeRLVR在领域内数学任务上平均准确率提升约+3.7点,生成长度减少42%;在跨领域基准上平均准确率提升约+4.5点,显示出更好的泛化性能。 Conclusion: 判别能力是高效生成的前提,通过‘先判断后生成’的范式,模型能内化指导信号,有效剪枝搜索空间,在提升推理效率的同时增强准确性和泛化能力。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.

[42] sui-1: Grounded and Verifiable Long-Form Summarization

Benedikt Droste,Jan Philipp Harries,Maximilian Idahl,Björn Plüster

Main category: cs.CL

TL;DR: 本文提出了一种名为sui-1的240亿参数模型,能够生成带有内联引用的抽象摘要,使用户可以追溯每个声明到其来源句子。通过特定任务训练而非单纯扩大模型规模,在多语言数据上实现了优于现有开源基准模型的表现。

Details Motivation: 大型语言模型常生成看似合理但不忠实于原文的摘要,这在政府和法律等合规敏感领域中尤为关键。为解决这一问题,需要模型生成可验证、有引用支持的摘要。 Method: 提出sui-1模型,并构建一个结合思维链提示与多阶段验证的合成数据管道,从议会文件、网页文本和维基百科等多样化来源生成超过22,000个高质量、多语言训练样本。 Result: sui-1在评估中显著优于所有测试的开源基线模型,包括参数量多达3倍的模型,证明任务特定训练比单纯扩大规模更有效。 Conclusion: 任务特定训练在实现可引用、忠实摘要方面优于单纯依赖模型规模扩展,sui-1为合规敏感场景提供了更可靠的语言模型解决方案。 Abstract: Large language models frequently generate plausible but unfaithful summaries that users cannot verify against source text, a critical limitation in compliance-sensitive domains such as government and legal analysis. We present sui-1, a 24B parameter model that produces abstractive summaries with inline citations, enabling users to trace each claim to its source sentence. Our synthetic data pipeline combines chain-of-thought prompting with multi-stage verification, generating over 22,000 high-quality training examples across five languages from diverse sources including parliamentary documents, web text, and Wikipedia. Evaluation shows sui-1 significantly outperforms all tested open-weight baselines, including models with 3x more parameters. These results demonstrate that task-specific training substantially outperforms scale alone for citation-grounded summarization. Model weights and an interactive demo are publicly available.

[43] Do You Understand How I Feel?: Towards Verified Empathy in Therapy Chatbots

Francesco Dettori,Matteo Forasassi,Lorenzo Veronese,Livia Lestingi,Vincenzo Scotti,Matteo Giovanni Rossi

Main category: cs.CL

TL;DR: 本文提出了一种结合自然语言处理与形式化验证的框架,用于开发具有共情能力的治疗型对话机器人。

Details Motivation: 当前聊天机器人缺乏系统性方法来规范和验证共情能力,而这在心理治疗中至关重要。 Method: 使用基于Transformer的模型提取对话特征,并将其转化为随机混合自动机模型,通过统计模型检测验证共情属性,结合策略合成为代理行为提供指导。 Result: 初步结果表明,该形式化模型能较好地捕捉治疗对话动态,特定策略可提高满足共情需求的概率。 Conclusion: 所提框架为构建可验证共情能力的治疗型聊天机器人提供了可行路径。 Abstract: Conversational agents are increasingly used as support tools along mental therapeutic pathways with significant societal impacts. In particular, empathy is a key non-functional requirement in therapeutic contexts, yet current chatbot development practices provide no systematic means to specify or verify it. This paper envisions a framework integrating natural language processing and formal verification to deliver empathetic therapy chatbots. A Transformer-based model extracts dialogue features, which are then translated into a Stochastic Hybrid Automaton model of dyadic therapy sessions. Empathy-related properties can then be verified through Statistical Model Checking, while strategy synthesis provides guidance for shaping agent behavior. Preliminary results show that the formal model captures therapy dynamics with good fidelity and that ad-hoc strategies improve the probability of satisfying empathy requirements.

[44] Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning

Tony Cristofano

Main category: cs.CL

TL;DR: 本文提出了一种名为Surgical Refusal Ablation (SRA) 的新方法,用于精确消融语言模型中的拒绝行为,同时避免损害其核心能力与语言风格。传统基于对比有害和无害提示得到的“原始拒绝向量”具有多义性,会干扰模型性能;而SRA通过构建概念原子库并使用正则化谱残差化技术,将拒绝方向与其他语义方向正交化,实现了几乎完全消除拒绝(0-2%)且几乎不影响困惑度(PPL变化约0.02)和输出分布(KL散度从2.088降至0.044)。实验覆盖五个模型,结果表明所谓的“模型损伤”常是“幽灵噪声”,即脏拒绝方向对能力子空间的频谱泄漏。

Details Motivation: 现有的激活引导方法在去除语言模型拒绝行为时,往往因使用的“拒绝向量”混杂了语言风格和核心能力信号而导致模型性能下降。作者希望解决这一问题,明确区分真正的拒绝信号与其它语义成分,从而实现更精准、无损的控制。 Method: 提出Surgical Refusal Ablation (SRA) 方法:首先构建包含保护性能力(如数学推理、代码生成)和语言风格的概念原子集合;然后采用岭回归正则化的谱残差化方法,将原始拒绝向量沿这些概念原子方向进行正交投影,提取出纯净的拒绝方向。该方向仅针对拒绝相关结构,不扰动模型其他语义几何。 Result: 在Qwen3-VL和Ministral系列共五个模型上验证,SRA实现了极低的拒绝率(0-2%),同时在Wikitext-2上的困惑度变化极小(平均ΔPPL ≈ 0.02),显著优于标准消融导致的严重分布漂移(如Qwen3-VL-4B中KL=2.088 vs SRA的KL=0.044)。在GSM8K和MBPP上的教师强制困惑度分析显示,SRA较好保留了数学与编程能力分布。 Conclusion: 常见的模型性能退化并非源于必要的功能牺牲,而是由多义性拒绝向量引起的‘幽灵噪声’。SRA能有效分离拒绝信号与核心能力,证明了高精度、低损伤的行为编辑是可行的,为安全对齐机制的可解释性和可控性提供了新路径。 Abstract: Safety-aligned language models systematically refuse harmful requests. While activation steering can modulate refusal, ablating the raw "refusal vector" calculated from contrastive harmful and harmless prompts often causes collateral damage and distribution drift. We argue this degradation occurs because the raw vector is polysemantic, entangling the refusal signal with core capability circuits and linguistic style. We introduce Surgical Refusal Ablation (SRA) to distill these steering directions. SRA constructs a registry of independent Concept Atoms representing protected capabilities and stylistic confounds, then uses ridge-regularized spectral residualization to orthogonalize the refusal vector against these directions. This yields a clean refusal direction that targets refusal-relevant structure while minimizing disruption to the model's semantic geometry. Across five models (Qwen3-VL and Ministral series), SRA achieves deep refusal reduction (0-2%) with negligible perplexity impact on Wikitext-2 (mean delta PPL approx. 0.02) and minimal distribution drift. Notably, standard ablation on Qwen3-VL-4B induces severe drift (first-token KL = 2.088), whereas SRA maintains the original distribution (KL = 0.044) while achieving the same 0% refusal rate. Using teacher-forced perplexity on GSM8K and MBPP as a high-resolution capability proxy, we show SRA preserves math and code distributions. These results suggest that common "model damage" is often "Ghost Noise," defined as the spectral bleeding of the dirty refusal direction into capability subspaces.

[45] BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

Erin Feiglin,Nir Hutnik,Raz Lapid

Main category: cs.CL

TL;DR: 本文研究了大语言模型中由于普通文本提示引发的过度输出现象(称为Overflow),提出了一个名为BenchOverflow的基准测试,用于评估不同模型在多种提示策略下的输出长度控制能力,并提出了一种轻量级缓解方法(简洁性提醒)来减少资源浪费和运行成本。

Details Motivation: Overflow现象在常规交互场景下发生,导致服务成本增加、延迟上升以及跨用户性能下降,尤其在大规模部署时对经济和环境造成显著影响。因此需要系统性地衡量和缓解这一问题。 Method: 设计了一个模型无关的基准测试BenchOverflow,包含九种非对抗性的纯文本提示策略,在固定5000新token预算下评估九个开源和闭源模型;使用CSR@1k/3k/5k和ECDF等指标量化尾部风险,并分析模型间差异与相关性;引入简洁性提醒作为轻量级缓解措施。 Result: 实验显示各类模型均出现明显的输出长度右偏分布和重尾现象,Overflow可复现但具有模型和策略异质性;简洁性提醒有效抑制长尾输出,降低多数模型的CSRs;BenchOverflow实现了对长度控制鲁棒性的标准化比较。 Conclusion: 应将输出长度控制视为一项重要的可靠性、成本与可持续性指标。BenchOverflow为评估模型部署中的资源效率及防御计算放大攻击提供了实用工具,有助于在不牺牲任务性能的前提下优化运营开销。 Abstract: We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and can lead to elevated serving cost, latency, and cross-user performance degradation, particularly when scaled across many requests. Beyond usability, the stakes are economic and environmental: unnecessary tokens increase per-request cost and energy consumption, compounding into substantial operational spend and carbon footprint at scale. Moreover, Overflow represents a practical vector for compute amplification and service degradation in shared environments. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a standardized protocol with a fixed budget of 5000 new tokens, we evaluate nine open- and closed-source models and observe pronounced rightward shifts and heavy tails in length distributions. Cap-saturation rates (CSR@1k/3k/5k) and empirical cumulative distribution functions (ECDFs) quantify tail risk; within-prompt variance and cross-model correlations show that Overflow is broadly reproducible yet heterogeneous across families and attack vectors. A lightweight mitigation-a fixed conciseness reminder-attenuates right tails and lowers CSR for all strategies across the majority of models. Our findings position length control as a measurable reliability, cost, and sustainability concern rather than a stylistic quirk. By enabling standardized comparison of length-control robustness across models, BenchOverflow provides a practical basis for selecting deployments that minimize resource waste and operating expense, and for evaluating defenses that curb compute amplification without eroding task performance.

[46] It's All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models

Cristian Santini,Marieke Van Erp,Mehwish Alam

Main category: cs.CL

TL;DR: 本文提出了一种无需微调的多语言历史实体链接方法MHEL-LLaMo,结合小模型与大模型的优势,在低资源环境下实现高效准确的实体链接。

Details Motivation: 历史文本由于语言变化、输入噪声和语义演变,使得现有实体链接方法面临数据需求大或可扩展性差的问题。 Method: 采用无监督集成方法,结合小型语言模型(SLM)和大型语言模型(LLM),使用BELA进行候选检索,通过指令调优的LLM进行NIL预测和提示链选择候选,利用SLM置信度区分难易样本,仅对难例使用LLM。 Result: 在六个欧洲语言的四个基准上验证,MHEL-LLaMo优于现有最先进模型,且无需微调,计算成本更低。 Conclusion: MHEL-LLaMo为低资源历史文本提供了可扩展、高效的实体链接解决方案,平衡了性能与计算开销。 Abstract: Despite the recent advancements in NLP with the advent of Large Language Models (LLMs), Entity Linking (EL) for historical texts remains challenging due to linguistic variation, noisy inputs, and evolving semantic conventions. Existing solutions either require substantial training data or rely on domain-specific rules that limit scalability. In this paper, we present MHEL-LLaMo (Multilingual Historical Entity Linking with Large Language MOdels), an unsupervised ensemble approach combining a Small Language Model (SLM) and an LLM. MHEL-LLaMo leverages a multilingual bi-encoder (BELA) for candidate retrieval and an instruction-tuned LLM for NIL prediction and candidate selection via prompt chaining. Our system uses SLM's confidence scores to discriminate between easy and hard samples, applying an LLM only for hard cases. This strategy reduces computational costs while preventing hallucinations on straightforward cases. We evaluate MHEL-LLaMo on four established benchmarks in six European languages (English, Finnish, French, German, Italian and Swedish) from the 19th and 20th centuries. Results demonstrate that MHEL-LLaMo outperforms state-of-the-art models without requiring fine-tuning, offering a scalable solution for low-resource historical EL. The implementation of MHEL-LLaMo is available on Github.

[47] STAGE: A Benchmark for Knowledge Graph Construction, Question Answering, and In-Script Role-Playing over Movie Screenplays

Qiuyu Tian,Yiding Li,Fengyi Chen,Zequn Liu,Youyong Kong,Fan Guo,Yuyao Li,Jinjing Shen,Zhijing Xie,Yiyun Luo,Xin Zhang

Main category: cs.CL

TL;DR: STAGE是一个新的统一基准,用于评估模型在完整电影剧本上的叙事理解能力,涵盖知识图谱构建、场景事件摘要、长上下文问答和角色扮演四项任务。

Details Motivation: 现有基准多关注单一子任务,缺乏对模型构建连贯故事世界并在此基础上进行多形式推理与生成的综合评估。 Method: 提出STAGE基准,包含150部中英文电影的清洗剧本、知识图谱及事件与角色标注,定义四项基于共享叙事世界表示的任务。 Result: 提供了支持多任务评估的数据集和框架,能够全面测试模型在世界表示构建、叙事抽象与验证、长文本推理和角色一致性生成方面的能力。 Conclusion: STAGE为电影剧本层面的叙事理解提供了更全面的评估标准,推动模型在复杂叙事结构中的综合理解与生成能力发展。 Abstract: Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models' abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.

[48] STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio

Seong-Gyu Park,Sohee Park,Jisu Lee,Hyunsik Na,Daeseon Choi

Main category: cs.CL

TL;DR: 提出STAR框架,通过分析输出概率变化检测推理时后门攻击,在多种模型和数据集上表现出接近完美的检测性能和高效率。

Details Motivation: 现有推理机制如CoT引入了新的攻击面,恶意推理路径难以被传统方法检测,需有效识别不改变模型参数的推理时后门攻击。 Method: 提出STAR(State-Transition Amplification Ratio),利用恶意输入导致的先验与后验概率差异,通过CUSUM算法检测持续异常的概率转移模式。 Result: 在8B到70B的模型及五个基准数据集上实验显示,STAR检测性能接近完美(AUROC≈1.0),效率比现有基线高约42倍,并对自适应攻击具有鲁棒性。 Conclusion: STAR是一种高效、通用且鲁棒的推理时后门检测框架,能有效识别基于语言连贯性伪装的恶意推理路径。 Abstract: Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model's general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.

[49] Algorithmic Stability in Infinite Dimensions: Characterizing Unconditional Convergence in Banach Spaces

Przemysław Spyra

Main category: cs.CL

TL;DR: 本文提出了一个统一的定理,将无限维空间中无条件收敛的七个等价条件联系起来,并揭示了其在计算算法中的重要应用。

Details Motivation: 由于在无限维空间中条件收敛、无条件收敛和绝对收敛之间的区别对计算算法具有重要意义,而这些概念在有限维中是相同的,因此需要建立更系统的理论来指导实际应用。 Method: 通过综合分析Dvoretzky-Rogers定理以及无条件收敛的各种等价形式(如排列不变性、子级数检验、符号稳定性等),建立了统一的特征定理。 Result: 得到了七个关于无条件收敛的等价条件的完整刻画,并展示了其在随机梯度下降中的梯度累积和基于框架的信号处理中的系数阈值方法中的直接应用。 Conclusion: 该研究连接了经典泛函分析与现代计算实践,为顺序无关且数值稳定的求和过程提供了严格的理论基础。 Abstract: The distinction between conditional, unconditional, and absolute convergence in infinite-dimensional spaces has fundamental implications for computational algorithms. While these concepts coincide in finite dimensions, the Dvoretzky-Rogers theorem establishes their strict separation in general Banach spaces. We present a comprehensive characterization theorem unifying seven equivalent conditions for unconditional convergence: permutation invariance, net convergence, subseries tests, sign stability, bounded multiplier properties, and weak uniform convergence. These theoretical results directly inform algorithmic stability analysis, governing permutation invariance in gradient accumulation for Stochastic Gradient Descent and justifying coefficient thresholding in frame-based signal processing. Our work bridges classical functional analysis with contemporary computational practice, providing rigorous foundations for order-independent and numerically robust summation processes.

[50] DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report

Ruizhe Li,Mingxuan Du,Benfeng Xu,Chiwei Zhu,Xiaorui Wang,Zhendong Mao

Main category: cs.CL

TL;DR: 本文提出了Deep Research Bench II,一个用于评估深度研究系统(DRS)生成报告的新基准,包含132个跨领域研究任务和9430个细粒度二元评分规则,通过LLM+人类四阶段流程构建,确保评估标准与专家判断一致。实验表明当前最先进的系统仍远落后于人类专家。

Details Motivation: 现有深度研究系统的评估基准存在不足:一些未能充分测试系统对证据的分析和连贯报告撰写能力,另一些则依赖过于粗糙或由大模型直接定义的评估标准,导致结果易受偏见影响且难以验证。因此需要一个更严谨、可解释且贴近人类专家判断的评估框架。 Method: 提出Deep Research Bench II,包含132个覆盖22个领域的研究任务;为每个任务构建细粒度、二元化的评分规则(共9430条),这些规则源自专家撰写的调查文章,并通过“LLM+人类”四阶段流程(自动提取+超过400小时的专家评审)制定,从信息召回、分析和呈现三个维度评估系统生成的长篇研究报告。 Result: 在Deep Research Bench II上评估多个最先进的深度研究系统发现,即使表现最好的模型也未能满足50%的评分规则,显示出当前系统与人类专家之间存在显著差距。所有评分标准均为原子化、可验证,并与人类专家判断高度对齐。 Conclusion: Deep Research Bench II提供了一个更加严谨、透明且与人类专家一致的评估框架,揭示了当前深度研究系统在信息整合、分析深度和报告质量方面的重大不足,为未来系统改进提供了明确方向。 Abstract: Deep Research Systems (DRS) aim to help users search the web, synthesize information, and deliver comprehensive investigative reports. However, how to rigorously evaluate these systems remains under-explored. Existing deep-research benchmarks often fall into two failure modes. Some do not adequately test a system's ability to analyze evidence and write coherent reports. Others rely on evaluation criteria that are either overly coarse or directly defined by LLMs (or both), leading to scores that can be biased relative to human experts and are hard to verify or interpret. To address these issues, we introduce Deep Research Bench II, a new benchmark for evaluating DRS-generated reports. It contains 132 grounded research tasks across 22 domains; for each task, a system must produce a long-form research report that is evaluated by a set of 9430 fine-grained binary rubrics in total, covering three dimensions: information recall, analysis, and presentation. All rubrics are derived from carefully selected expert-written investigative articles and are constructed through a four-stage LLM+human pipeline that combines automatic extraction with over 400 human-hours of expert review, ensuring that the criteria are atomic, verifiable, and aligned with human expert judgment. We evaluate several state-of-the-art deep-research systems on Deep Research Bench II and find that even the strongest models satisfy fewer than 50% of the rubrics, revealing a substantial gap between current DRSs and human experts.

[51] Ministral 3

Alexander H. Liu,Kartik Khandelwal,Sandeep Subramanian,Victor Jouault,Abhinav Rastogi,Adrien Sadé,Alan Jeffares,Albert Jiang,Alexandre Cahill,Alexandre Gavaudan,Alexandre Sablayrolles,Amélie Héliou,Amos You,Andy Ehrenberg,Andy Lo,Anton Eliseev,Antonia Calvi,Avinash Sooriyarachchi,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Clémence Lanfranchi,Corentin Barreau,Cyprien Courtot,Daniele Grattarola,Darius Dabert,Diego de las Casas,Elliot Chane-Sane,Faruk Ahmed,Gabrielle Berrada,Gaëtan Ecrepont,Gauthier Guinet,Georgii Novikov,Guillaume Kunsch,Guillaume Lample,Guillaume Martin,Gunshi Gupta,Jan Ludziejewski,Jason Rute,Joachim Studnia,Jonas Amar,Joséphine Delas,Josselin Somerville Roberts,Karmesh Yadav,Khyathi Chandu,Kush Jain,Laurence Aitchison,Laurent Fainsin,Léonard Blier,Lingxiao Zhao,Louis Martin,Lucile Saulnier,Luyu Gao,Maarten Buyl,Margaret Jennings,Marie Pellat,Mark Prins,Mathieu Poirée,Mathilde Guillaumin,Matthieu Dinot,Matthieu Futeral,Maxime Darrin,Maximilian Augustin,Mia Chiquier,Michel Schimpf,Nathan Grinsztajn,Neha Gupta,Nikhil Raghuraman,Olivier Bousquet,Olivier Duchenne,Patricia Wang,Patrick von Platen,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Pavankumar Reddy Muddireddy,Philomène Chagniot,Pierre Stock,Pravesh Agrawal,Quentin Torroba,Romain Sauvestre,Roman Soletskyi,Rupert Menneer,Sagar Vaze,Samuel Barry,Sanchit Gandhi,Siddhant Waghjale,Siddharth Gandhi,Soham Ghosh,Srijan Mishra,Sumukh Aithal,Szymon Antoniak,Teven Le Scao,Théo Cachet,Theo Simon Sorg,Thibaut Lavril,Thiziri Nait Saada,Thomas Chabal,Thomas Foubert,Thomas Robert,Thomas Wang,Tim Lawson,Tom Bewley,Tom Bewley,Tom Edwards,Umar Jamil,Umberto Tomasini,Valeriia Nemychnikova,Van Phung,Vincent Maladière,Virgile Richard,Wassim Bouaziz,Wen-Ding Li,William Marshall,Xinghui Li,Xinyu Yang,Yassine El Ouahidi,Yihan Wang,Yunhao Tang,Zaccharie Ramzi

Main category: cs.CL

TL;DR: Ministral 3系列是专为计算和内存受限应用设计的高效密集语言模型,包含3B、8B和14B三种规模,每种均有基础、指令微调和推理三个版本,并通过级联蒸馏技术训练,具备图像理解能力且开源许可宽松。

Details Motivation: 为了在计算和内存资源受限的设备上实现高效的语言模型部署,同时保持多功能性和高性能。 Method: 采用Cascade Distillation(级联蒸馏)方法,结合迭代剪枝与持续蒸馏训练,优化模型效率与性能,并集成图像理解能力。 Result: 成功推出Ministral 3系列模型(3B/8B/14B),每个规模包含基础、指令微调和推理三种变体,在多种任务上表现优异,支持图像理解,且全部开源于Apache 2.0许可下。 Conclusion: Ministral 3系列模型在保持较小参数量的同时,通过高效的训练方法实现了强大的语言和多模态能力,适用于资源受限场景,具有良好的实用性和扩展性。 Abstract: We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.

[52] ExpSeek: Self-Triggered Experience Seeking for Web Agents

Wenyuan Zhang,Xinghua Zhang,Haiyang Yu,Shuaiyi Nie,Bingli Wu,Juwei Yue,Tingwen Liu,Yongbin Li

Main category: cs.CL

TL;DR: 本文提出ExpSeek,一种基于熵阈值的主动经验检索方法,通过在步骤级别动态触发和定制经验干预,显著提升Web智能体的性能。

Details Motivation: 现有方法通常将经验作为全局上下文被动注入,在任务执行前无法根据动态环境变化进行调整,缺乏对交互过程中实时决策的支持。 Method: ExpSeek利用模型内在信号估计每一步的熵阈值以决定干预时机,并设计针对该步骤的经验内容,实现主动、细粒度的经验调用。 Result: 在Qwen3-8B和32B模型上四个具有挑战性的Web代理基准测试中,ExpSeek分别实现了9.3%和7.5%的绝对性能提升;实验还表明小至4B参数的经验模型也能有效增强更大代理模型的表现。 Conclusion: 熵可作为有效的自触发信号,支持细粒度、动态的经验干预,ExpSeek为构建更灵活、高效的Web智能体提供了新方向。 Abstract: Experience intervention in web agents emerges as a promising technical paradigm, enhancing agent interaction capabilities by providing valuable insights from accumulated experiences. However, existing methods predominantly inject experience passively as global context before task execution, struggling to adapt to dynamically changing contextual observations during agent-environment interaction. We propose ExpSeek, which shifts experience toward step-level proactive seeking: (1) estimating step-level entropy thresholds to determine intervention timing using the model's intrinsic signals; (2) designing step-level tailor-designed experience content. Experiments on Qwen3-8B and 32B models across four challenging web agent benchmarks demonstrate that ExpSeek achieves absolute improvements of 9.3% and 7.5%, respectively. Our experiments validate the feasibility and advantages of entropy as a self-triggering signal, reveal that even a 4B small-scale experience model can significantly boost the performance of larger agent models.

[53] GraphSearch: Agentic Search-Augmented Reasoning for Zero-Shot Graph Learning

Jiajin Liu,Yuanfu Sun,Dongzhe Fan,Qiaoyu Tan

Main category: cs.CL

TL;DR: 本文提出了GraphSearch,首个将搜索增强推理扩展到图学习的框架,能够在无需任务特定微调的情况下实现零样本图学习。

Details Motivation: 现有搜索增强大推理模型在处理图结构数据方面研究不足,而图数据中的拓扑信号可为检索提供有价值先验,提升推理效率。 Method: 提出GraphSearch框架,包含图感知查询规划器(分离搜索空间与语义查询)和图感知检索器(基于拓扑构建候选集并用混合打分函数排序),并设计两种遍历模式:递归扩展邻域的GraphSearch-R和灵活跨局部与全局邻域检索的GraphSearch-F。 Result: 在多个基准测试中,GraphSearch在零样本节点分类和链接预测任务上达到或超越有监督方法的表现,取得最先进结果。 Conclusion: GraphSearch是一种灵活且可泛化的面向图的代理推理新范式。 Abstract: Recent advances in search-augmented large reasoning models (LRMs) enable the retrieval of external knowledge to reduce hallucinations in multistep reasoning. However, their ability to operate on graph-structured data, prevalent in domains such as e-commerce, social networks, and scientific citations, remains underexplored. Unlike plain text corpora, graphs encode rich topological signals that connect related entities and can serve as valuable priors for retrieval, enabling more targeted search and improved reasoning efficiency. Yet, effectively leveraging such structure poses unique challenges, including the difficulty of generating graph-expressive queries and ensuring reliable retrieval that balances structural and semantic relevance. To address this gap, we introduce GraphSearch, the first framework that extends search-augmented reasoning to graph learning, enabling zero-shot graph learning without task-specific fine-tuning. GraphSearch combines a Graph-aware Query Planner, which disentangles search space (e.g., 1-hop, multi-hop, or global neighbors) from semantic queries, with a Graph-aware Retriever, which constructs candidate sets based on topology and ranks them using a hybrid scoring function. We further instantiate two traversal modes: GraphSearch-R, which recursively expands neighborhoods hop by hop, and GraphSearch-F, which flexibly retrieves across local and global neighborhoods without hop constraints. Extensive experiments across diverse benchmarks show that GraphSearch achieves competitive or even superior performance compared to supervised graph learning methods, setting state-of-the-art results in zero-shot node classification and link prediction. These findings position GraphSearch as a flexible and generalizable paradigm for agentic reasoning over graphs.

[54] How Order-Sensitive Are LLMs? OrderProbe for Deterministic Structural Reconstruction

Yingjie He,Zhaolu Kang,Kehan Jiang,Qianyuan Zhang,Jiachen Qian,Chunlei Meng,Yujie Feng,Yuan Wang,Jiabao Dou,Aming Wu,Leqi Zheng,Pengxiang Zhao,Jiaxin Liu,Zeyu Zhang,Lei Wang,Guansu Wang,Qishi Zhan,Xiaomin He,Meisheng Zhang,Jianyuan Ni

Main category: cs.CL

TL;DR: 本文提出了OrderProbe,一个用于评估大语言模型在中文、日文和韩文中固定四字表达结构重建能力的确定性基准,并引入诊断框架评估模型在恢复准确率、语义保真度、逻辑有效性等方面的表现。实验表明,即使前沿模型在零样本下的恢复准确率也常低于35%,且语义记忆与结构规划之间存在明显分离。

Details Motivation: 大语言模型在语义理解方面表现出色,但其对打乱输入的内部结构重建能力尚不明确。句子级恢复因存在多种合理词序而难以自动评估,因此需要一个具有唯一标准顺序的任务来精确衡量模型的结构重建能力。 Method: 提出OrderProbe基准,基于中日韩语言中具有唯一规范顺序的固定四字符表达,支持精确匹配评分;同时设计诊断框架,从恢复准确率、语义保真度、逻辑有效性、一致性、鲁棒性敏感性和信息密度等多个维度评估模型表现。 Result: 在12个主流大语言模型上的实验显示,零样本恢复准确率普遍低于35%;模型在语义回忆和结构规划之间存在显著分离,说明结构鲁棒性并非语义能力的自然副产品。 Conclusion: 结构重建对当前大语言模型仍具挑战性,语义理解和结构处理是相对独立的能力,需专门机制加以提升。 Abstract: Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent dissociation between semantic recall and structural planning, suggesting that structural robustness is not an automatic byproduct of semantic competence.

[55] Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation

Saumitra Yadav,Manish Shrivastava

Main category: cs.CL

TL;DR: 本文提出了一种名为LALITA的框架,通过词汇和语言学特征选择源句,优化低资源语言机器翻译的平行语料库构建,显著提升翻译质量并减少数据需求。

Details Motivation: 在低资源语言中,人工翻译成本高昂,缺乏高效的数据筛选方法来构建高质量的平行语料库,因此需要一种能提升机器翻译性能的同时降低数据需求的方法。 Method: 提出LALITA框架,利用词汇和语言学特征进行源句选择,优先选取复杂句子,并结合现有与合成数据集进行数据筛选与增强。 Result: 在英-印地语等多语言对上验证,使用5万至80万英语句子模拟低资源场景,翻译质量显著提升,数据需求减少一半以上。 Conclusion: LALITA有效提升了低资源环境下机器翻译系统的训练效率和性能,降低了数据获取成本,具有广泛的数据增强应用潜力。 Abstract: Data curation is a critical yet under-researched step in the machine translation training paradigm. To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree, synthetic generation. But, for low-resource languages, human translation to generate sufficient data is prohibitively expensive. Therefore, it is crucial to develop a framework that screens source sentences to form efficient parallel text, ensuring optimal MT system performance in low-resource environments. We approach this by evaluating English-Hindi bi-text to determine effective sentence selection strategies for optimal MT system training. Our extensively tested framework, (Lexical And Linguistically Informed Text Analysis) LALITA, targets source sentence selection using lexical and linguistic features to curate parallel corpora. We find that by training mostly on complex sentences from both existing and synthetic datasets, our method significantly improves translation quality. We test this by simulating low-resource data availabilty with curated datasets of 50K to 800K English sentences and report improved performances on all data sizes. LALITA demonstrates remarkable efficiency, reducing data needs by more than half across multiple languages (Hindi, Odia, Nepali, Norwegian Nynorsk, and German). This approach not only reduces MT systems training cost by reducing training data requirement, but also showcases LALITA's utility in data augmentation.

[56] Moral Lenses, Political Coordinates: Towards Ideological Positioning of Morally Conditioned LLMs

Chenchen Yuan,Bolei Ma,Zheyu Zhang,Bardh Prenkaj,Frauke Kreuter,Gjergji Kasneci

Main category: cs.CL

TL;DR: 本研究通过将道德价值观作为可控条件,探究其对大语言模型政治立场的因果影响,发现道德调节会显著改变模型的政治坐标,且效果受角色设定和模型规模的影响。

Details Motivation: 现有评估大语言模型政治偏见的方法多依赖直接探测或人口统计 persona 工程,而忽略了道德直觉对政治意识形态的根本影响。本文旨在从社会心理学视角出发,建立道德价值观与政治立场之间的因果关系。 Method: 通过条件化模型接受或拒绝特定的道德价值观,并使用政治罗盘测试(Political Compass Test)评估其政治取向的变化,分析道德调节对经济和社会维度政治轨迹的影响。同时检验不同角色设定、模型规模及不同测量工具下的稳健性。 Result: 道德条件化导致模型政治坐标的显著、特定于价值的变化;这种效应受到角色框架和模型规模的系统调节,并在不同但等效的道德评估工具中保持稳健。 Conclusion: 政治立场的评估需置于更广泛的道德与社会价值背景中,道德价值观是塑造模型意识形态的关键因果因素,为实现更符合社会价值的对齐提供了新路径。 Abstract: While recent research has systematically documented political orientation in large language models (LLMs), existing evaluations rely primarily on direct probing or demographic persona engineering to surface ideological biases. In social psychology, however, political ideology is also understood as a downstream consequence of fundamental moral intuitions. In this work, we investigate the causal relationship between moral values and political positioning by treating moral orientation as a controllable condition. Rather than simply assigning a demographic persona, we condition models to endorse or reject specific moral values and evaluate the resulting shifts on their political orientations, using the Political Compass Test. By treating moral values as lenses, we observe how moral conditioning actively steers model trajectories across economic and social dimensions. Our findings show that such conditioning induces pronounced, value-specific shifts in models' political coordinates. We further notice that these effects are systematically modulated by role framing and model scale, and are robust across alternative assessment instruments instantiating the same moral value. This highlights that effective alignment requires anchoring political assessments within the context of broader social values including morality, paving the way for more socially grounded alignment techniques.

[57] A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding

Dilara Torunoğlu-Selamet,Dogukan Arslan,Rodrigo Wilkens,Wei He,Doruk Eryiğit,Thomas Pickard,Adriana S. Pagano,Aline Villavicencio,Gülşen Eryiğit,Ágnes Abuczki,Aida Cardoso,Alesia Lazarenka,Dina Almassova,Amalia Mendes,Anna Kanellopoulou,Antoni Brosa-Rodríguez,Baiba Saulite,Beata Wojtowicz,Bolette Pedersen,Carlos Manuel Hidalgo-Ternero,Chaya Liebeskind,Danka Jokić,Diego Alves,Eleni Triantafyllidi,Erik Velldal,Fred Philippy,Giedre Valunaite Oleskeviciene,Ieva Rizgeliene,Inguna Skadina,Irina Lobzhanidze,Isabell Stinessen Haugen,Jauza Akbar Krito,Jelena M. Marković,Johanna Monti,Josue Alejandro Sauca,Kaja Dobrovoljc,Kingsley O. Ugwuanyi,Laura Rituma,Lilja Øvrelid,Maha Tufail Agro,Manzura Abjalova,Maria Chatzigrigoriou,María del Mar Sánchez Ramos,Marija Pendevska,Masoumeh Seyyedrezaei,Mehrnoush Shamsfard,Momina Ahsan,Muhammad Ahsan Riaz Khan,Nathalie Carmen Hau Norman,Nilay Erdem Ayyıldız,Nina Hosseini-Kivanani,Noémi Ligeti-Nagy,Numaan Naeem,Olha Kanishcheva,Olha Yatsyshyna,Daniil Orel,Petra Giommarelli,Petya Osenova,Radovan Garabik,Regina E. Semou,Rozane Rebechi,Salsabila Zahirah Pranida,Samia Touileb,Sanni Nimb,Sarfraz Ahmad,Sarvinoz Nematkhonova,Shahar Golan,Shaoxiong Ji,Sopuruchi Christian Aboh,Srdjan Sucur,Stella Markantonatou,Sussi Olsen,Vahide Tajalli,Veronika Lipp,Voula Giouli,Yelda Yeşildal Eraydın,Zahra Saaberi,Zhuohan Xie

Main category: cs.CL

TL;DR: 本文介绍了XMPIE,一个包含34种语言和上万条数据的多语言、多模态潜在习语表达数据集,用于评估NLP系统在跨语言和跨模态下的习语理解能力。

Details Motivation: 习语表达与特定语言社区的文化和日常经验紧密相关,因此对NLP系统的语言和文化理解能力构成挑战,需要高质量的多语言多模态数据集来评估模型表现。 Method: 由语言专家基于多语言指南构建数据集,每条潜在习语包含文本及五幅图像(涵盖从字面到习语意义的多种类型),覆盖34种语言,支持跨语言和跨模态(文本与图像)分析。 Result: 构建了一个高质量、大规模的平行多语言多模态习语数据集XMPIE,支持对不同语言中习语模式的比较分析,并可用于评估模型在跨语言和跨模态下的习语理解能力。 Conclusion: XMPIE为多语言多模态下的习语理解提供了可靠基准,有助于研究语言间文化共性以及语言与视觉模态之间的理解迁移。 Abstract: Potentially idiomatic expressions (PIEs) construe meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows to evaluate model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.

[58] Safe Language Generation in the Limit

Antonios Anastasopoulos,Giuseppe Ateniese,Evgenios M. Kornaropoulos

Main category: cs.CL

TL;DR: 本文首次对安全语言生成进行了理论探讨,基于极限学习的计算范式,形式化了安全语言识别与生成任务,并证明了在该模型下安全语言识别是不可能的,而安全语言生成至少与普通语言识别一样困难(同样不可行),最后讨论了一些可解和不可解的情况。

Details Motivation: 随着极限学习领域的发展,需要考虑语言生成在现实世界中的安全性问题,因此有必要对安全语言生成进行理论研究。 Method: 基于极限学习的计算范式,形式化定义了安全语言识别与安全语言生成任务,并通过理论分析和证明来探讨其可行性。 Result: 证明了安全语言识别是不可能的,且安全语言生成至少与普通语言识别一样困难(同样不可行),并识别出若干可解与不可解的情形。 Conclusion: 安全语言识别不可实现,安全语言生成也面临根本性困难,需进一步探索特定条件下的可行路径。 Abstract: Recent results in learning a language in the limit have shown that, although language identification is impossible, language generation is tractable. As this foundational area expands, we need to consider the implications of language generation in real-world settings. This work offers the first theoretical treatment of safe language generation. Building on the computational paradigm of learning in the limit, we formalize the tasks of safe language identification and generation. We prove that under this model, safe language identification is impossible, and that safe language generation is at least as hard as (vanilla) language identification, which is also impossible. Last, we discuss several intractable and tractable cases.

[59] RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation

Yihan Hong,Huaiyuan Yao,Bolin Shen,Wanpeng Xu,Hua Wei,Yushun Dong

Main category: cs.CL

TL;DR: 本文提出了RULERS框架,通过将自然语言评分标准转化为可执行规范,解决大语言模型作为评判者的三大失效模式:评分标准不稳定性、不可验证的推理和与人类评分尺度的错位。

Details Motivation: 由于生成随机性,现有的黑盒大模型难以稳定地遵循自然语言评分标准,导致自动评估结果与人类判断不一致,因此需要一种更可靠的方法来对齐评判标准。 Method: 提出RULERS框架,包括三个核心组件:将评分标准编译为版本化不可变包(Rubric Unification & Locking)、基于结构化解码进行确定性证据验证(Evidence-anchored Reasoning),以及使用Wasserstein距离进行无需参数更新的后处理校准(Robust Scoring)。 Result: 在论文和摘要评估基准上的实验表明,RULERS显著提升了与人类评分的一致性,对对抗性评分标准扰动具有强鲁棒性,并使较小的开源模型能够媲美更大的专有评判模型。 Conclusion: 可靠的LLM评判不仅依赖提示词设计,更需要可执行的评分标准、可验证的推理过程和校准的评分尺度。 Abstract: The LLM-as-a-Judge paradigm promises scalable rubric-based evaluation, yet aligning frozen black-box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring), a compiler-executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein-based post-hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at https://github.com/LabRAI/Rulers.git.

[60] Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification

Kyuri Im,Shuzhou Yuan,Michael Färber

Main category: cs.CL

TL;DR: 本研究系统探讨了大语言模型在仇恨言论去毒化任务中的错误拒绝行为,分析了引发此类拒绝的语境和语言偏见,并提出一种简单的跨语言翻译策略有效缓解该问题。

Details Motivation: 大语言模型在处理仇恨言论去毒化时常因触发安全警报而拒绝任务,影响实际应用,因此需要探究其错误拒绝行为的原因并提出解决方案。 Method: 评估九种大语言模型在英文和多语言数据集上的表现,分析语义毒性和目标群体对拒绝行为的影响,并提出将英文仇恨言论翻译为中文进行去毒化再回译的交叉翻译策略。 Result: 发现大语言模型更倾向于拒绝语义毒性高及针对国籍、宗教和政治意识形态等特定群体的输入;多语言数据集整体拒绝率较低但仍存在语言相关的系统性偏见;所提出的交叉翻译策略显著减少了错误拒绝率同时保留原内容。 Conclusion: 大语言模型在仇恨言论去毒化中存在系统性偏见,提出的跨语言翻译策略是一种有效且轻量化的缓解方法。 Abstract: While large language models (LLMs) have increasingly been applied to hate speech detoxification, the prompts often trigger safety alerts, causing LLMs to refuse the task. In this study, we systematically investigate false refusal behavior in hate speech detoxification and analyze the contextual and linguistic biases that trigger such refusals. We evaluate nine LLMs on both English and multilingual datasets, our results show that LLMs disproportionately refuse inputs with higher semantic toxicity and those targeting specific groups, particularly nationality, religion, and political ideology. Although multilingual datasets exhibit lower overall false refusal rates than English datasets, models still display systematic, language-dependent biases toward certain targets. Based on these findings, we propose a simple cross-translation strategy, translating English hate speech into Chinese for detoxification and back, which substantially reduces false refusals while preserving the original content, providing an effective and lightweight mitigation approach.

[61] Lessons from the Field: An Adaptable Lifecycle Approach to Applied Dialogue Summarization

Kushal Chawla,Chenyang Zhu,Pengshan Cai,Sangwoo Cho,Scott Novotney,Ayushman Singh,Jonah Lewis,Keasha Safewright,Alfy Samuel,Erin Babinsky,Shi-Xiong Zhang,Sambit Sahu

Main category: cs.CL

TL;DR: 本文介绍了一项关于开发用于多参与者交互摘要的代理系统(agentic system)的工业案例研究,重点在于应对动态需求和实际部署中的挑战。

Details Motivation: 由于现实场景中摘要需求不断变化,传统的静态数据集和基准测试难以满足实际应用要求,因此需要构建更可靠且可适应的摘要系统。 Method: 采用基于代理架构的系统设计,通过任务分解实现组件级优化,并结合鲁棒的评估方法来应对需求演变和任务主观性。 Result: 提出了在评价、组件优化、数据瓶颈和LLM提示迁移性方面的一系列实践洞察,揭示了上游数据限制和供应商锁定问题对系统开发的实际影响。 Conclusion: 该研究为构建适应性强、可靠的多参与者对话摘要系统提供了全生命周期的实践经验,对工业界实践和未来研究具有指导意义。 Abstract: Summarization of multi-party dialogues is a critical capability in industry, enhancing knowledge transfer and operational effectiveness across many domains. However, automatically generating high-quality summaries is challenging, as the ideal summary must satisfy a set of complex, multi-faceted requirements. While summarization has received immense attention in research, prior work has primarily utilized static datasets and benchmarks, a condition rare in practical scenarios where requirements inevitably evolve. In this work, we present an industry case study on developing an agentic system to summarize multi-party interactions. We share practical insights spanning the full development lifecycle to guide practitioners in building reliable, adaptable summarization systems, as well as to inform future research, covering: 1) robust methods for evaluation despite evolving requirements and task subjectivity, 2) component-wise optimization enabled by the task decomposition inherent in an agentic architecture, 3) the impact of upstream data bottlenecks, and 4) the realities of vendor lock-in due to the poor transferability of LLM prompts.

[62] QuantEval: A Benchmark for Financial Quantitative Tasks in Large Language Models

Zhaolu Kang,Junhao Gong,Wenqing Hu,Shuo Yin,Kehan Jiang,Zhicheng Fang,Yingjie He,Chunlei Meng,Rong Fu,Dongyang Chen,Leqi Zheng,Eric Hanchen Jiang,Yunfei Feng,Yitong Leng,Junfan Zhu,Xiaoyou Chen,Xi Yang,Richeng Xuan

Main category: cs.CL

TL;DR: QuantEval是一个新的大语言模型基准,用于评估金融量化任务中的知识问答、数学推理和策略编码能力,引入可复现的回测框架以更真实地衡量模型表现。

Details Motivation: 现有对大语言模型在金融领域的评估主要集中在知识型问答,缺乏对量化推理与策略生成能力的系统性评测。 Method: 提出QuantEval基准,包含三个维度:知识问答、定量推理和策略编码,并集成CTA风格的确定性回测框架来执行并评估模型生成的交易策略。 Result: 在多个开源与闭源大模型上测试发现,当前模型在推理和策略编码方面显著落后于人类专家;通过领域对齐数据的监督微调和强化学习训练可实现性能提升。 Conclusion: QuantEval能更全面、真实地评估大模型在量化金融中的能力,有助于推动其在实际交易中的应用,同时发布的完整回测配置确保结果可复现。 Abstract: Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a benchmark that evaluates LLMs across three essential dimensions of quantitative finance: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting framework that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability. We evaluate some state-of-the-art open-source and proprietary LLMs and observe substantial gaps to human experts, particularly in reasoning and strategy coding. Finally, we conduct large-scale supervised fine-tuning and reinforcement learning experiments on domain-aligned data, demonstrating consistent improvements. We hope QuantEval will facilitate research on LLMs' quantitative finance capabilities and accelerate their practical adoption in real-world trading workflows. We additionally release the full deterministic backtesting configuration (asset universe, cost model, and metric definitions) to ensure strict reproducibility.

[63] Nationality and Region Prediction from Names: A Comparative Study of Neural Models and Large Language Models

Keito Inoshita

Main category: cs.CL

TL;DR: 本研究全面比较了神经网络模型与大语言模型(LLM)在从人名预测国籍任务中的表现,发现LLM在所有粒度级别上均优于传统模型,其优势源于预训练获得的世界知识。

Details Motivation: 传统神经模型在低频国籍和相近地域国籍的区分上泛化能力有限,而大语言模型可能利用其蕴含的世界知识克服这些问题。因此需要系统比较两类模型在不同粒度和频率下的表现差异。 Method: 评估了六种神经网络模型和六种LLM提示策略,在国籍、区域和大陆三个粒度层级上进行实验,采用基于频率的分层分析和错误类型分析。 Result: LLM在所有粒度级别上均优于神经模型,但随着粒度变粗差距缩小;简单机器学习方法对低频类别鲁棒性最强,而预训练模型和LLM在低频国籍上表现下降;LLM倾向于‘近似错误’(如国籍错但区域正确),神经模型更多出现跨区域错误并偏向高频类别。 Conclusion: LLM的优势来自其世界知识,适用于细粒度国籍预测;模型选择应考虑任务所需的粒度,且评估需关注错误质量而不仅是准确率。 Abstract: Predicting nationality from personal names has practical value in marketing, demographic research, and genealogical studies. Conventional neural models learn statistical correspondences between names and nationalities from task-specific training data, posing challenges in generalizing to low-frequency nationalities and distinguishing similar nationalities within the same region. Large language models (LLMs) have the potential to address these challenges by leveraging world knowledge acquired during pre-training. In this study, we comprehensively compare neural models and LLMs on nationality prediction, evaluating six neural models and six LLM prompting strategies across three granularity levels (nationality, region, and continent), with frequency-based stratified analysis and error analysis. Results show that LLMs outperform neural models at all granularity levels, with the gap narrowing as granularity becomes coarser. Simple machine learning methods exhibit the highest frequency robustness, while pre-trained models and LLMs show degradation for low-frequency nationalities. Error analysis reveals that LLMs tend to make ``near-miss'' errors, predicting the correct region even when nationality is incorrect, whereas neural models exhibit more cross-regional errors and bias toward high-frequency classes. These findings indicate that LLM superiority stems from world knowledge, model selection should consider required granularity, and evaluation should account for error quality beyond accuracy.

[64] RAGShaper: Eliciting Sophisticated Agentic RAG Skills via Automated Data Synthesis

Zhengwei Tao,Bo Li,Jialong Wu,Guochen Yan,Huanyao Zhang,Jiahao Xu,Haitao Mi,Wentao Zhang

Main category: cs.CL

TL;DR: 提出RAGShaper,一种自动化合成RAG任务和鲁棒代理轨迹的数据合成框架,通过引入对抗性干扰和约束导航策略提升模型在复杂检索环境中的鲁棒性。

Details Motivation: 现有RAG代理缺乏高质量训练数据来反映真实检索环境中的噪声和复杂性,手动标注不可扩展且难以捕捉动态推理策略。 Method: 设计InfoCurator构建包含感知和认知层面对抗性干扰的密集信息树,并采用约束导航策略迫使教师代理面对干扰,生成包含错误纠正和噪声拒绝的轨迹。 Result: 实验表明,在合成数据上训练的模型显著优于基线,在高噪声和复杂检索任务中表现出更强的鲁棒性。 Conclusion: RAGShaper能有效生成高质量、多样化的训练数据,提升代理在复杂和噪声环境下的推理与检索能力。 Abstract: Agentic Retrieval-Augmented Generation (RAG) empowers large language models to autonomously plan and retrieve information for complex problem-solving. However, the development of robust agents is hindered by the scarcity of high-quality training data that reflects the noise and complexity of real-world retrieval environments. Conventional manual annotation is unscalable and often fails to capture the dynamic reasoning strategies required to handle retrieval failures. To bridge this gap, we introduce RAGShaper, a novel data synthesis framework designed to automate the construction of RAG tasks and robust agent trajectories. RAGShaper incorporates an InfoCurator to build dense information trees enriched with adversarial distractors spanning Perception and Cognition levels. Furthermore, we propose a constrained navigation strategy that forces a teacher agent to confront these distractors, thereby eliciting trajectories that explicitly demonstrate error correction and noise rejection. Comprehensive experiments confirm that models trained on our synthesized corpus significantly outperform existing baselines, exhibiting superior robustness in noise-intensive and complex retrieval tasks.

[65] PrivGemo: Privacy-Preserving Dual-Tower Graph Retrieval for Empowering LLM Reasoning with Memory Augmentation

Xingyu Tan,Xiaoyang Wang,Qing Liu,Xiwei Xu,Xin Yuan,Liming Zhu,Wenjie Zhang

Main category: cs.CL

TL;DR: 提出PrivGemo框架,实现隐私保护的基于知识图谱的推理,通过双塔结构和记忆引导机制,在保证隐私的同时提升多跳、多实体推理性能。

Details Motivation: 现有隐私保护方法在语义遮蔽下仍存在结构泄露、远程交互不可控、多跳推理脆弱和经验复用不足等问题。 Method: 采用双塔设计,本地保留原始知识图谱,远程推理使用匿名化视图;提取连接所有主题实体的匿名长跳路径,结合分层控制器和隐私感知经验记忆减少不必要的探索和远程交互。 Result: 在六个基准上实验表明,PrivGemo整体达到最先进水平,比最强基线提升高达17.1%,并使小模型(如Qwen3-4B)达到与GPT-4-Turbo相当的推理性能。 Conclusion: PrivGemo有效平衡了隐私保护与推理性能,支持高效、安全的知识图谱增强型语言模型推理。 Abstract: Knowledge graphs (KGs) provide structured evidence that can ground large language model (LLM) reasoning for knowledge-intensive question answering. However, many practical KGs are private, and sending retrieved triples or exploration traces to closed-source LLM APIs introduces leakage risk. Existing privacy treatments focus on masking entity names, but they still face four limitations: structural leakage under semantic masking, uncontrollable remote interaction, fragile multi-hop and multi-entity reasoning, and limited experience reuse for stability and efficiency. To address these issues, we propose PrivGemo, a privacy-preserving retrieval-augmented framework for KG-grounded reasoning with memory-guided exposure control. PrivGemo uses a dual-tower design to keep raw KG knowledge local while enabling remote reasoning over an anonymized view that goes beyond name masking to limit both semantic and structural exposure. PrivGemo supports multi-hop, multi-entity reasoning by retrieving anonymized long-hop paths that connect all topic entities, while keeping grounding and verification on the local KG. A hierarchical controller and a privacy-aware experience memory further reduce unnecessary exploration and remote interactions. Comprehensive experiments on six benchmarks show that PrivGemo achieves overall state-of-the-art results, outperforming the strongest baseline by up to 17.1%. Furthermore, PrivGemo enables smaller models (e.g., Qwen3-4B) to achieve reasoning performance comparable to that of GPT-4-Turbo.

[66] From Rows to Reasoning: A Retrieval-Augmented Multimodal Framework for Spreadsheet Understanding

Anmol Gulati,Sahil Sen,Waqar Sarguroh,Kevin Paul

Main category: cs.CL

TL;DR: 本文提出了FRTR-Bench,首个面向企业级多模态电子表格推理的大规模基准,以及FRTR框架,通过细粒度嵌入和混合检索显著提升大型语言模型在复杂电子表格中的推理能力。

Details Motivation: 现有方法在处理大规模、多表联动且包含视觉内容的企业电子表格时存在可扩展性差、无法反映真实用户交互的问题,因此需要更有效的推理框架。 Method: 提出From Rows to Reasoning (FRTR) 框架,将Excel工作簿分解为行、列和块级别的嵌入,采用基于倒数排名融合的混合词法-密集检索,并结合多模态嵌入以同时推理数值与视觉信息。 Result: 在FRTR-Bench上,Claude Sonnet 4.5 使用FRTR达到74%的准确率,远超此前24%的SOTA;在SpreadsheetLLM基准上,GPT-5使用FRTR达到87%准确率,并减少约50%的token消耗。 Conclusion: FRTR通过细粒度分解与多模态检索增强生成,显著提升了大型语言模型在大规模、多模态电子表格上的推理性能与效率,具有良好的实际应用潜力。 Abstract: Large Language Models (LLMs) struggle to reason over large-scale enterprise spreadsheets containing thousands of numeric rows, multiple linked sheets, and embedded visual content such as charts and receipts. Prior state-of-the-art spreadsheet reasoning approaches typically rely on single-sheet compression or full-context encoding, which limits scalability and fails to reflect how real users interact with complex, multimodal workbooks. We introduce FRTR-Bench, the first large-scale benchmark for multimodal spreadsheet reasoning, comprising 30 enterprise-grade Excel workbooks spanning nearly four million cells and more than 50 embedded images. To address these challenges, we present From Rows to Reasoning (FRTR), an advanced, multimodal retrieval-augmented generation framework that decomposes Excel workbooks into granular row, column, and block embeddings, employs hybrid lexical-dense retrieval with Reciprocal Rank Fusion (RRF), and integrates multimodal embeddings to reason over both numerical and visual information. We tested FRTR on six LLMs, achieving 74% answer accuracy on FRTR-Bench with Claude Sonnet 4.5, a substantial improvement over prior state-of-the-art approaches that reached only 24%. On the SpreadsheetLLM benchmark, FRTR achieved 87% accuracy with GPT-5 while reducing token usage by roughly 50% compared to context-compression methods.

[67] Inferring Latent Intentions: Attributional Natural Language Inference in LLM Agents

Xin Quan,Jiafeng Xiong,Marco Valentino,André Freitas

Main category: cs.CL

TL;DR: 本文提出了Attributional NLI(Att-NLI),一种结合社会心理学原理的自然语言推断框架,用于评估大语言模型在多智能体环境中推断隐含意图的能力。通过一个名为Undercover-V的文本游戏实验,比较了三种不同推理能力的LLM智能体,结果表明采用神经符号系统的智能体表现最优。

Details Motivation: 传统的自然语言推断(NLI)无法捕捉复杂交互系统中基于意图的精细推理,因此需要一种新的框架来提升大语言模型在多智能体环境中的归因推理能力。 Method: 提出Att-NLI框架,结合溯因推理(假设潜在意图)与演绎验证(逻辑结论),并在Undercover-V文本游戏中测试三种LLM智能体:仅使用演绎推理的标准NLI智能体、使用溯因-演绎推理的Att-NLI智能体、以及结合定理证明工具的神经符号Att-NLI智能体。 Result: 实验结果显示神经符号Att-NLI智能体表现最佳,平均胜率达17.08%,显著优于其他两种智能体,展现出更强的归因推理能力。 Conclusion: Att-NLI框架有效提升了LLM在多智能体环境中的意图推理能力,表明神经符号AI在构建具备复杂推理能力的理性智能体方面具有重要潜力。 Abstract: Attributional inference, the ability to predict latent intentions behind observed actions, is a critical yet underexplored capability for large language models (LLMs) operating in multi-agent environments. Traditional natural language inference (NLI), in fact, fails to capture the nuanced, intention-driven reasoning essential for complex interactive systems. To address this gap, we introduce Attributional NLI (Att-NLI), a framework that extends NLI with principles from social psychology to assess an agent's capacity for abductive intentional inference (generating hypotheses about latent intentions), and subsequent deductive verification (drawing valid logical conclusions). We instantiate Att-NLI via a textual game, Undercover-V, experimenting with three types of LLM agents with varying reasoning capabilities and access to external tools: a standard NLI agent using only deductive inference, an Att-NLI agent employing abductive-deductive inference, and a neuro-symbolic Att-NLI agent performing abductive-deductive inference with external theorem provers. Extensive experiments demonstrate a clear hierarchy of attributional inference capabilities, with neuro-symbolic agents consistently outperforming others, achieving an average win rate of 17.08%. Our results underscore the role that Att-NLI can play in developing agents with sophisticated reasoning capabilities, highlighting, at the same time, the potential impact of neuro-symbolic AI in building rational LLM agents acting in multi-agent environments.

[68] TableCache: Primary Foreign Key Guided KV Cache Precomputation for Low Latency Text-to-SQL

Jinbo Su,Yuxuan Hu,Cuiping Li,Hong Chen,Jia Li,Lintao Ma,Jing Zhang

Main category: cs.CL

TL;DR: 提出TableCache方法,通过离线预计算表的KV缓存并保留主外键关系,结合Table Trie结构和缓存管理机制,在Text-to-SQL任务中实现最高3.62倍的首 token 时间加速,性能损失可忽略。

Details Motivation: 现有基于大模型的Text-to-SQL方法在提示中包含大量数据库模式,导致上下文过长和prefilling延迟高;同时推理引擎对不同表序的查询生成冗余前缀缓存,效率低下。 Method: 离线预计算带有主外键关系的表级KV缓存,构建Table Trie结构以支持高效在线查找,并设计缓存管理系统,包括查询重排序策略和并行化推理与缓存加载的计算流水线。 Result: 实验结果显示TableCache在Time to First Token(TTFT)上最高实现3.62倍加速,且性能下降可忽略。 Conclusion: TableCache有效提升了Text-to-SQL任务中的推理效率,通过结构化缓存复用显著降低长上下文开销,为数据库交互场景提供了高效的推理解决方案。 Abstract: In Text-to-SQL tasks, existing LLM-based methods often include extensive database schemas in prompts, leading to long context lengths and increased prefilling latency. While user queries typically focus on recurrent table sets-offering an opportunity for KV cache sharing across queries-current inference engines, such as SGLang and vLLM, generate redundant prefix cache copies when processing user queries with varying table orders. To address this inefficiency, we propose precomputing table representations as KV caches offline and querying the required ones online. A key aspect of our approach is the computation of table caches while preserving primary foreign key relationships between tables. Additionally, we construct a Table Trie structure to facilitate efficient KV cache lookups during inference. To enhance cache performance, we introduce a cache management system with a query reranking strategy to improve cache hit rates and a computation loading pipeline for parallelizing model inference and cache loading. Experimental results show that our proposed TableCache achieves up to a 3.62x speedup in Time to First Token (TTFT) with negligible performance degradation.

[69] To Retrieve or To Think? An Agentic Approach for Context Evolution

Rubing Chen,Jian Wang,Wenjie Li,Xiao-Yong Wei,Qing Li

Main category: cs.CL

TL;DR: 本文提出了Agentic Context Evolution (ACE)框架,通过模拟人类元认知机制动态决定何时检索新信息或基于已有知识进行推理,从而提升知识密集型任务的效率与准确性。

Details Motivation: 现有的上下文增强方法(如检索增强生成)通常在每一步都执行检索,这种固定且粗暴的策略不仅计算成本高,还会因引入无关信息而降低性能。 Method: ACE框架采用一个中央协调代理,通过多数投票机制决策是否激活检索代理获取外部信息或激活推理代理进行内部分析与优化,实现检索与推理的动态交替。 Result: 在多个复杂的多跳问答基准上的实验表明,ACE在提高准确率的同时显著减少了token消耗,优于多种强基线方法。 Conclusion: ACE通过消除冗余检索步骤,保持简洁且持续演化的上下文,为复杂知识密集型任务中的上下文演化生成提供了有效解决方案和重要启示。 Abstract: Current context augmentation methods, such as retrieval-augmented generation, are essential for solving knowledge-intensive reasoning tasks.However, they typically adhere to a rigid, brute-force strategy that executes retrieval at every step. This indiscriminate approach not only incurs unnecessary computational costs but also degrades performance by saturating the context with irrelevant noise. To address these limitations, we introduce Agentic Context Evolution (ACE), a framework inspired by human metacognition that dynamically determines whether to seek new evidence or reason with existing knowledge. ACE employs a central orchestrator agent to make decisions strategically via majority voting.It aims to alternate between activating a retriever agent for external retrieval and a reasoner agent for internal analysis and refinement. By eliminating redundant retrieval steps, ACE maintains a concise and evolved context. Extensive experiments on challenging multi-hop QA benchmarks demonstrate that ACE significantly outperforms competitive baselines in accuracy while achieving efficient token consumption.Our work provides valuable insights into advancing context-evolved generation for complex, knowledge-intensive tasks.

[70] Spatial Context Improves the Integration of Text with Remote Sensing for Mapping Environmental Variables

Valerie Zermatten,Chiara Vanalli,Gencer Sumbul,Diego Marcos,Devis Tuia

Main category: cs.CL

TL;DR: 本文提出了一种基于注意力机制的方法,结合航空影像和地理定位文本,在空间邻域内融合多模态数据以预测环境变量,结果表明该方法在多种生态变量预测中优于单模态或单点基线模型。

Details Motivation: 文本数据作为新兴的生态信息来源,虽蕴含传统地理空间数据无法捕捉的局部环境信息,但其贡献尚不明确且与地理空间数据的整合存在挑战,因此需要有效融合文本与图像数据的方法。 Method: 采用基于注意力机制的模型,结合航空影像、地理定位文本和地理位置编码,在空间邻域内动态选择对预测任务有用的邻居观测进行多模态数据融合。 Result: 在EcoWikiRS数据集上评估了模型对103个环境变量的预测性能,结果表明该方法显著优于单点或单模态(仅图像或仅文本)基线模型,尤其在气候、土壤、种群及土地利用/覆被变量上表现更优。 Conclusion: 结合地理空间上下文的多模态注意力模型能有效提升环境变量预测性能,验证了在生态建模中整合文本与图像数据的价值。 Abstract: Recent developments in natural language processing highlight text as an emerging data source for ecology. Textual resources carry unique information that can be used in complementarity with geospatial data sources, thus providing insights at the local scale into environmental conditions and properties hidden from more traditional data sources. Leveraging textual information in a spatial context presents several challenges. First, the contribution of textual data remains poorly defined in an ecological context, and it is unclear for which tasks it should be incorporated. Unlike ubiquitous satellite imagery or environmental covariates, the availability of textual data is sparse and irregular; its integration with geospatial data is not straightforward. In response to these challenges, this work proposes an attention-based approach that combines aerial imagery and geolocated text within a spatial neighbourhood, i.e. integrating contributions from several nearby observations. Our approach combines vision and text representations with a geolocation encoding, with an attention-based module that dynamically selects spatial neighbours that are useful for predictive tasks.The proposed approach is applied to the EcoWikiRS dataset, which combines high-resolution aerial imagery with sentences extracted from Wikipedia describing local environmental conditions across Switzerland. Our model is evaluated on the task of predicting 103 environmental variables from the SWECO25 data cube. Our approach consistently outperforms single-location or unimodal, i.e. image-only or text-only, baselines. When analysing variables by thematic groups, results show a significant improvement in performance for climatic, edaphic, population and land use/land cover variables, underscoring the benefit of including the spatial context when combining text and image data.

[71] Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge

Yao Tang,Li Dong,Yaru Hao,Qingxiu Dong,Furu Wei,Jiatao Gu

Main category: cs.CL

TL;DR: 提出Multiplex Thinking,一种基于连续多重token的随机软推理机制,通过聚合多个候选token的嵌入,在保持词汇嵌入先验的同时实现高效、紧凑的推理,优于传统离散CoT方法。

Details Motivation: 大语言模型使用思维链(CoT)虽有效但序列冗长,受人类以分布形式进行软推理的启发,希望构建更高效、自适应的推理机制。 Method: 在每一步推理中采样K个候选token,将其嵌入聚合为单一的连续multiplex token,支持基于策略的强化学习优化,并保持标准生成的采样动态和嵌入先验。 Result: 在数学推理基准上,Multiplex Thinking在Pass@1至Pass@1024均优于强基线,且生成序列更短。 Conclusion: Multiplex Thinking是一种自适应的高效推理方法,能够在模型自信时接近离散推理,在不确定时表达多种可能路径,提升性能并压缩推理长度。 Abstract: Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR-Penn/Multiplex-Thinking.

[72] Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System

Hsiang-Wei Huang,Junbin Lu,Kuang-Ming Chen,Jenq-Neng Hwang

Main category: cs.CL

TL;DR: 本研究探讨了在Elo排名评审系统中大型语言模型(LLM)代理评审员的动态行为,使用真实世界会议论文提交数据,通过多轮互动评审模拟,发现引入Elo评分可提高领域主席决策准确性,但也出现评审员策略性利用系统而不提升评审努力的现象。

Details Motivation: 探索LLM代理在学术评审过程中的行为模式及其对评审质量的影响,特别是在引入Elo排名机制下的动态变化。 Method: 采用具有不同角色设定的多个LLM代理评审员,在领域主席主持下进行多轮评审交互,并比较基础设置与引入Elo评分及评审员记忆条件下的表现差异。 Result: 引入Elo评分提高了领域主席决策的准确性;同时发现评审员会发展出适应性策略来利用Elo系统,但并未相应增加评审努力。 Conclusion: Elo评分系统虽能提升评审决策效率,但可能被策略性利用,提示需进一步优化激励机制以确保评审质量。 Abstract: In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo-ranked review system using real-world conference paper submissions. Multiple LLM agent reviewers with different personas are engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers' adaptive review strategy that exploits our Elo system without improving review effort. Our code is available at https://github.com/hsiangwei0903/EloReview.

cs.CV [Back]

[73] Edge-AI Perception Node for Cooperative Road-Safety Enforcement and Connected-Vehicle Integration

Shree Charran R,Rahul Kumar Dubey

Main category: cs.CV

TL;DR: 本文提出了一种基于边缘AI的实时路侧感知节点,用于多类交通违法分析和安全事件发布,结合YOLOv8 Nano、DeepSORT和规则引导OCR,在NVIDIA Jetson Nano上实现高效、高精度检测,并支持V2X协同感知与智能交通管理。

Details Motivation: 印度等新兴经济体机动车快速增长,导致交通违法行为剧增,而警力严重不足,传统人工执法难以应对,亟需一种自主、协作且节能的边缘AI感知系统来提升执法效率与道路安全。 Method: 设计并实现一个集成YOLOv8 Nano(用于多目标检测)、DeepSORT(用于车辆轨迹跟踪)和规则引导OCR(识别多语言、低质量车牌)的路侧感知节点,部署于NVIDIA Jetson Nano平台,采用TensorRT FP16量化优化性能,并通过V2X协议发送CAM/DENM安全消息。 Result: 系统在5类交通违法(闯红灯、压斑马线、逆行、非法掉头、超速)上达到97.7%的检测准确率和84.9%的OCR精度,帧率达28-30 FPS,功耗仅9.6W,相比YOLOv4 Tiny等模型,mAP提升10.7%,能效提高1.4倍,无需手动设定ROI区域。 Conclusion: 该边缘AI感知节点可高效支持实时交通违法分析与安全事件分发,显著提升智能交通系统的自动化与协同能力,为IEEE智能车辆生态中的主动安全管理提供了可行的技术路径。 Abstract: Rapid motorization in emerging economies such as India has created severe enforcement asymmetries, with over 11 million recorded violations in 2023 against a human policing density of roughly one officer per 4000 vehicles. Traditional surveillance and manual ticketing cannot scale to this magnitude, motivating the need for an autonomous, cooperative, and energy efficient edge AI perception infrastructure. This paper presents a real time roadside perception node for multi class traffic violation analytics and safety event dissemination within a connected and intelligent vehicle ecosystem. The node integrates YOLOv8 Nano for high accuracy multi object detection, DeepSORT for temporally consistent vehicle tracking, and a rule guided OCR post processing engine capable of recognizing degraded or multilingual license plates compliant with MoRTH AIS 159 and ISO 7591 visual contrast standards. Deployed on an NVIDIA Jetson Nano with a 128 core Maxwell GPU and optimized via TensorRT FP16 quantization, the system sustains 28 to 30 frames per second inference at 9.6 W, achieving 97.7 percent violation detection accuracy and 84.9 percent OCR precision across five violation classes, namely signal jumping, zebra crossing breach, wrong way driving, illegal U turn, and speeding, without manual region of interest calibration. Comparative benchmarking against YOLOv4 Tiny, PP YOLOE S, and Nano DetPlus demonstrates a 10.7 percent mean average precision gain and a 1.4 times accuracy per watt improvement. Beyond enforcement, the node publishes standardized safety events of CAM and DENM type to connected vehicles and intelligent transportation system backends via V2X protocols, demonstrating that roadside edge AI analytics can augment cooperative perception and proactive road safety management within the IEEE Intelligent Vehicles ecosystem.

[74] An Empirical Study on Knowledge Transfer under Domain and Label Shifts in 3D LiDAR Point Clouds

Subeen Lee,Siyeong Lee,Namil Kim,Jaesik Choi

Main category: cs.CV

TL;DR: 提出了ROAD基准,用于评估LiDAR点云分类在域偏移和标签演化的持续学习场景下的鲁棒性,基于大规模数据集揭示了现有方法的局限并建立了强基线。

Details Motivation: 3D感知系统在现实应用中需应对不断变化的对象定义和传感器域,但目前在同时存在域偏移和标签变化下的持续学习研究仍不足。 Method: 构建RObust Autonomous driving under Dataset shifts (ROAD)基准,结合Waymo、NuScenes、Argoverse2等大规模数据集,评估零样本迁移、线性探针和持续学习方法,并分析骨干网络、训练目标和CL方法的影响。 Result: 揭示了现有3D点云分类方法在真实场景变化下的性能局限,验证了不同架构与学习策略在类分裂、扩展和插入等标签演化下的表现。 Conclusion: ROAD为持续且鲁棒的3D感知提供了重要基准,推动未来在复杂动态环境中提升模型适应能力的研究。 Abstract: For 3D perception systems to be practical in real-world applications -- from autonomous driving to embodied AI -- models must adapt to continuously evolving object definitions and sensor domains. Yet, research on continual and transfer learning in 3D point cloud perception remains underexplored compared to 2D vision -- particularly under simultaneous domain and label shifts. To address this gap, we propose the RObust Autonomous driving under Dataset shifts (ROAD) benchmark, a comprehensive evaluation suite for LiDAR-based object classification that explicitly accounts for domain shifts as well as three key forms of label evolution: class split, class expansion, and class insertion. Using large-scale datasets (Waymo, NuScenes, Argoverse2), we evaluate zero-shot transfer, linear probe, and CL, and analyze the impact of backbone architectures, training objectives, and CL methods. Our findings reveal limitations of existing approaches under realistic shifts and establish strong baselines for future research in robust 3D perception.

[75] Moonworks Lunara Aesthetic Dataset

Yan Wang,M M Sayeef Abdullah,Partho Hassan,Sabit Hassan

Main category: cs.CV

TL;DR: Lunara Aesthetic Dataset是一个高质量、风格多样的艺术图像数据集,具有精细标注和高审美评分,采用Apache 2.0许可发布。

Details Motivation: 创建一个注重审美质量、风格多样性和标注精确性的艺术图像数据集,弥补现有网络数据集在精度和美学上的不足。 Method: 使用Moonworks Lunara模型生成涵盖多种地域和艺术风格的图像,并配以人工优化的提示词和结构化标注。 Result: 生成了首个在审美评分上显著超越现有美学聚焦和通用数据集的艺术数据集,具备高质量、多样化风格和透明授权。 Conclusion: Lunara Aesthetic Dataset为艺术风格研究提供了高标准资源,支持学术与商业应用,推动高质量视觉内容的发展。 Abstract: The dataset spans diverse artistic styles, including regionally grounded aesthetics from the Middle East, Northern Europe, East Asia, and South Asia, alongside general categories such as sketch and oil painting. All images are generated using the Moonworks Lunara model and intentionally crafted to embody distinct, high-quality aesthetic styles, yielding a first-of-its-kind dataset with substantially higher aesthetic scores, exceeding even aesthetics-focused datasets, and general-purpose datasets by a larger margin. Each image is accompanied by a human-refined prompt and structured annotations that jointly describe salient objects, attributes, relationships, and stylistic cues. Unlike large-scale web-derived datasets that emphasize breadth over precision, the Lunara Aesthetic Dataset prioritizes aesthetic quality, stylistic diversity, and licensing transparency, and is released under the Apache 2.0 license to support research and unrestricted academic and commercial use.

[76] LWMSCNN-SE: A Lightweight Multi-Scale Network for Efficient Maize Disease Classification on Edge Devices

Fikadu Weloday,Jianmei Su

Main category: cs.CV

TL;DR: 提出了一种轻量级卷积神经网络LWMSCNN-SE,用于玉米病害分类,兼顾高精度与低计算成本,适用于边缘设备上的实时检测。

Details Motivation: 传统病害检测模型在资源受限环境(如手机和无人机)中部署时计算成本高,难以实现实时应用。 Method: 设计了结合多尺度特征提取、深度可分离卷积和Squeeze-and-Excitation注意力机制的轻量级CNN模型LWMSCNN-SE。 Result: 模型在仅241,348个参数和0.666 GFLOPs下达到96.63%的分类精度。 Conclusion: LWMSCNN-SE在保持高准确性的同时显著降低计算开销,适合在边缘设备上部署,推动精准农业中的实时病害诊断。 Abstract: Maize disease classification plays a vital role in mitigating yield losses and ensuring food security. However, the deployment of traditional disease detection models in resource-constrained environments, such as those using smartphones and drones, faces challenges due to high computational costs. To address these challenges, we propose LWMSCNN-SE, a lightweight convolutional neural network (CNN) that integrates multi-scale feature extraction, depthwise separable convolutions, and squeeze-and-Excitation (SE) attention mechanisms. This novel combination enables the model to achieve 96.63% classification accuracy with only 241,348 parameters and 0.666 GFLOPs, making it suitable for real-time deployment in field applications. Our approach addresses the accuracy--efficiency trade-off by delivering high accuracy while maintaining low computational costs, demonstrating its potential for efficient maize disease diagnosis on edge devices in precision farming systems.

[77] 3DGS-Drag: Dragging Gaussians for Intuitive Point-Based 3D Editing

Jiahua Dong,Yu-Xiong Wang

Main category: cs.CV

TL;DR: 本文提出了3DGS-Drag,一种基于点的3D编辑框架,结合3D高斯泼溅和扩散模型实现高效、直观的3D场景拖拽编辑。

Details Motivation: 现有的3D编辑方法在几何相关的编辑任务中存在局限,而2D编辑中的拖拽操作虽直观但难以直接应用于3D场景。 Method: 提出3DGS-Drag框架,利用3D高斯泼溅进行形变引导以保证几何一致性,并引入扩散模型进行内容修正和视觉质量提升,同时采用渐进式编辑策略支持大幅度编辑操作。 Result: 实验表明该方法在多种真实3D场景中实现了最先进的几何相关编辑效果,编辑过程仅需10到20分钟(单块RTX 4090 GPU)。 Conclusion: 3DGS-Drag成功融合了形变编辑与基于2D编辑的思想,为真实3D场景提供了高效且直观的拖拽编辑方案。 Abstract: The transformative potential of 3D content creation has been progressively unlocked through advancements in generative models. Recently, intuitive drag editing with geometric changes has attracted significant attention in 2D editing yet remains challenging for 3D scenes. In this paper, we introduce 3DGS-Drag -- a point-based 3D editing framework that provides efficient, intuitive drag manipulation of real 3D scenes. Our approach bridges the gap between deformation-based and 2D-editing-based 3D editing methods, addressing their limitations to geometry-related content editing. We leverage two key innovations: deformation guidance utilizing 3D Gaussian Splatting for consistent geometric modifications and diffusion guidance for content correction and visual quality enhancement. A progressive editing strategy further supports aggressive 3D drag edits. Our method enables a wide range of edits, including motion change, shape adjustment, inpainting, and content extension. Experimental results demonstrate the effectiveness of 3DGS-Drag in various scenes, achieving state-of-the-art performance in geometry-related 3D content editing. Notably, the editing is efficient, taking 10 to 20 minutes on a single RTX 4090 GPU.

[78] Sesame Plant Segmentation Dataset: A YOLO Formatted Annotated Dataset

Sunusi Ibrahim Muhammad,Ismail Ismail Tijjani,Saadatu Yusuf Jumare,Fatima Isah Jibrin

Main category: cs.CV

TL;DR: 本文提出了一个名为Sesame Plant Segmentation Dataset的开源标注图像数据集,专注于芝麻植物的早期生长阶段,采用像素级分割格式,适用于农业人工智能模型开发。

Details Motivation: 为了支持针对芝麻植物的人工智能模型在农业中的应用,特别是提高在复杂田间环境下对芝麻植株的精确检测与分析能力。 Method: 采集了来自尼日利亚农场的高分辨率移动相机图像,共包含206张训练、43张验证和43张测试图像,并使用Segment Anything Model v2在农民监督下进行像素级标注,采用YOLO兼容的分割格式,利用Ultralytics YOLOv8框架进行模型评估。 Result: 在检测任务中,模型达到79%的召回率和精确率,mAP@0.5为84%,mAP@0.5:0.95为58%;在分割任务中,召回率为82%,精确率为77%,mAP@0.5为84%,mAP@0.5:0.95为52%。 Conclusion: 该数据集填补了尼日利亚芝麻作物视觉数据的空白,支持植物监测、产量估计和农业研究等实际应用,具有较高的实用价值和推广潜力。 Abstract: This paper presents the Sesame Plant Segmentation Dataset, an open source annotated image dataset designed to support the development of artificial intelligence models for agricultural applications, with a specific focus on sesame plants. The dataset comprises 206 training images, 43 validation images, and 43 test images in YOLO compatible segmentation format, capturing sesame plants at early growth stages under varying environmental conditions. Data were collected using a high resolution mobile camera from farms in Jirdede, Daura Local Government Area, Katsina State, Nigeria, and annotated using the Segment Anything Model version 2 with farmer supervision. Unlike conventional bounding box datasets, this dataset employs pixel level segmentation to enable more precise detection and analysis of sesame plants in real world farm settings. Model evaluation using the Ultralytics YOLOv8 framework demonstrated strong performance for both detection and segmentation tasks. For bounding box detection, the model achieved a recall of 79 percent, precision of 79 percent, mean average precision at IoU 0.50 of 84 percent, and mean average precision from 0.50 to 0.95 of 58 percent. For segmentation, it achieved a recall of 82 percent, precision of 77 percent, mean average precision at IoU 0.50 of 84 percent, and mean average precision from 0.50 to 0.95 of 52 percent. The dataset represents a novel contribution to sesame focused agricultural vision datasets in Nigeria and supports applications such as plant monitoring, yield estimation, and agricultural research.

[79] An Efficient Additive Kolmogorov-Arnold Transformer for Point-Level Maize Localization in Unmanned Aerial Vehicle Imagery

Fei Li,Lang Qiao,Jiahao Fan,Yijia Xu,Shawn M. Kaeppler,Zhou Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为AKT的新型Transformer模型,用于解决无人机高分辨率图像中玉米植株点级定位的挑战,结合Kolmogorov-Arnold理论提升小目标检测性能,并在真实农田数据上验证了其优越性。

Details Motivation: 由于无人机图像中玉米植株占比极小、计算成本高以及农业场景复杂,现有方法难以实现精准的点级定位,因此需要更高效且专用的模型来应对这些挑战。 Method: 提出Additive Kolmogorov-Arnold Transformer(AKT),用Pade Kolmogorov-Arnold Network(PKAN)模块替代传统MLP以增强小目标特征表达能力,并引入PKAN Additive Attention(PAA)降低计算复杂度;同时构建了包含1928张图像和约50.1万个点注释的Point-based Maize Localization(PML)数据集。 Result: AKT在F1分数上达到62.8%,比现有最佳方法提升4.2%,FLOPs减少12.6%,推理吞吐量提高20.7%;在下游任务中,出苗计数平均绝对误差为7.1,株距估计RMSE为1.95–1.97厘米。 Conclusion: 将Kolmogorov-Arnold表示理论与高效注意力机制结合,为高分辨率农业遥感中的小目标检测提供了一个有效且高效的解决方案。 Abstract: High-resolution UAV photogrammetry has become a key technology for precision agriculture, enabling centimeter-level crop monitoring and point-level plant localization. However, point-level maize localization in UAV imagery remains challenging due to (1) extremely small object-to-pixel ratios, typically less than 0.1%, (2) prohibitive computational costs of quadratic attention on ultra-high-resolution images larger than 3000 x 4000 pixels, and (3) agricultural scene-specific complexities such as sparse object distribution and environmental variability that are poorly handled by general-purpose vision models. To address these challenges, we propose the Additive Kolmogorov-Arnold Transformer (AKT), which replaces conventional multilayer perceptrons with Pade Kolmogorov-Arnold Network (PKAN) modules to enhance functional expressivity for small-object feature extraction, and introduces PKAN Additive Attention (PAA) to model multiscale spatial dependencies with reduced computational complexity. In addition, we present the Point-based Maize Localization (PML) dataset, consisting of 1,928 high-resolution UAV images with approximately 501,000 point annotations collected under real field conditions. Extensive experiments show that AKT achieves an average F1-score of 62.8%, outperforming state-of-the-art methods by 4.2%, while reducing FLOPs by 12.6% and improving inference throughput by 20.7%. For downstream tasks, AKT attains a mean absolute error of 7.1 in stand counting and a root mean square error of 1.95-1.97 cm in interplant spacing estimation. These results demonstrate that integrating Kolmogorov-Arnold representation theory with efficient attention mechanisms offers an effective framework for high-resolution agricultural remote sensing.

[80] Likelihood ratio for a binary Bayesian classifier under a noise-exclusion model

Howard C. Gifford

Main category: cs.CV

TL;DR: 提出了一种新的统计理想观察者模型,通过设置可提取图像特征的最小阈值来执行整体视觉搜索处理,减少了自由参数数量,适用于医学图像感知、计算机视觉和目标检测等领域。

Details Motivation: 为了优化医学成像系统和算法,并提升视觉搜索任务中的性能,需要一种能够模拟人类整体视觉处理过程的理想观察者模型。 Method: 开发了一个新的统计理想观察者模型,该模型通过对可提取图像特征设定最低阈值来实现整体视觉搜索处理,并减少系统的自由参数数量。 Result: 该模型成功简化了系统结构,在医学图像感知、计算机视觉、目标检测与识别以及传感器评估等应用中展现出潜力。 Conclusion: 所提出的模型为医学图像分析、计算机视觉及安全防御领域的目标检测提供了有效的理论框架和性能基准。 Abstract: We develop a new statistical ideal observer model that performs holistic visual search (or gist) processing in part by placing thresholds on minimum extractable image features. In this model, the ideal observer reduces the number of free parameters thereby shrinking down the system. The applications of this novel framework is in medical image perception (for optimizing imaging systems and algorithms), computer vision, benchmarking performance and enabling feature selection/evaluations. Other applications are in target detection and recognition in defense/security as well as evaluating sensors and detectors.

[81] Predicting Region of Interest in Human Visual Search Based on Statistical Texture and Gabor Features

Hongwei Lin,Diego Andrade,Mini Das,Howard C. Gifford

Main category: cs.CV

TL;DR: 本研究探讨了Gabor特征与灰度共生矩阵(GLCM)纹理特征在建模早期视觉搜索行为中的关系,提出两种融合特征的流程,并在模拟乳腺断层合成图像上验证其与人类注视行为的一致性。

Details Motivation: 理解人类视觉搜索行为对视觉科学和计算机视觉至关重要,尤其是在目标位置未知的搜索任务中如何分配注意力。 Method: 提出两种融合Gabor特征与GLCM纹理特征的管道模型,用于缩小可能的人类注视区域,并在模拟数字乳腺断层合成图像上进行评估,结合眼动追踪数据验证模型预测结果。 Result: 所提管道模型预测的注视候选区域与基于阈值的模型观察者具有定性一致性;GLCM均值与Gabor特征响应之间存在强相关性;人类观察者的眼动数据表明预测区域与早期注视行为一致。 Conclusion: 结合结构与纹理特征有助于更准确地建模早期视觉搜索行为,支持构建感知启发的观察者模型。 Abstract: Understanding human visual search behavior is a fundamental problem in vision science and computer vision, with direct implications for modeling how observers allocate attention in location-unknown search tasks. In this study, we investigate the relationship between Gabor-based features and gray-level co-occurrence matrix (GLCM) based texture features in modeling early-stage visual search behavior. Two feature-combination pipelines are proposed to integrate Gabor and GLCM features for narrowing the region of possible human fixations. The pipelines are evaluated using simulated digital breast tomosynthesis images. Results show qualitative agreement among fixation candidates predicted by the proposed pipelines and a threshold-based model observer. A strong correlation is observed between GLCM mean and Gabor feature responses, indicating that these features encode related image information despite their different formulations. Eye-tracking data from human observers further suggest consistency between predicted fixation regions and early-stage gaze behavior. These findings highlight the value of combining structural and texture-based features for modeling visual search and support the development of perceptually informed observer models.

[82] CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

Chaoyu Li,Deeparghya Dutta Barua,Fei Tao,Pooyan Fazli

Main category: cs.CV

TL;DR: 本文提出CASHEW和CASHEW-RL两种方法,通过测试时扩展和强化学习提升视觉语言模型的多步推理稳定性,在13个基准上显著提升了性能。

Details Motivation: 现有视觉语言模型在多步推理中存在不稳定性,相同输入多次采样会产生不同的推理路径和不一致预测。 Method: 提出CASHEW:一种在推理时聚合多个候选推理路径并利用视觉验证过滤幻觉步骤的框架;以及CASHEW-RL:通过组序列策略优化(GSPO)训练的强化学习变体,内部化聚合行为,并基于任务难度自适应分配推理资源。 Result: 在13个图像和视频理解基准上实验显示显著性能提升,例如ScienceQA上最高提升23.6个百分点,EgoSchema上提升8.1个百分点。 Conclusion: CASHEW和CASHEW-RL有效提升了视觉语言模型的推理稳定性和准确性,实现了基于视觉证据的可靠多步推理。 Abstract: Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.

[83] TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models

Xin Jin,Yichuan Zhong,Yapeng Tian

Main category: cs.CV

TL;DR: 本文提出了一种名为TP-Blend的无需训练的文本条件扩散编辑框架,能够同时引入新对象和新风格,通过双提示注意力融合机制实现内容与外观的精确控制。

Details Motivation: 现有文本条件扩散模型在同时替换对象和风格时表现不佳,缺乏对内容和外观的独立且精细的控制。 Method: 提出TP-Blend框架,包含两个注意力处理器:Cross-Attention Object Fusion(CAOF)通过最优传输重新分配多头特征向量以融合对象;Self-Attention Style Fusion(SASF)利用细节敏感的实例归一化和高频纹理注入来融合风格,并交换Key/Value矩阵以实现上下文感知的纹理调制。 Result: 实验表明TP-Blend在高分辨率图像编辑中实现了照片级真实感,在保真度、感知质量和推理速度上优于现有方法。 Conclusion: TP-Blend是一种高效、无需训练的双提示编辑方法,能有效解耦并协同控制对象与风格编辑,为扩散模型编辑提供了新的解决方案。 Abstract: Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors. Cross-Attention Object Fusion (CAOF) first averages head-wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy-regularised optimal transport problem that reassigns complete multi-head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD-XL), preserving rich cross-head correlations while keeping memory low. Self-Attention Style Fusion (SASF) injects style at every self-attention layer through Detail-Sensitive Instance Normalization. A lightweight one-dimensional Gaussian filter separates low- and high-frequency components; only the high-frequency residual is blended back, imprinting brush-stroke-level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context-aware texture modulation that remains independent of object fusion. Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.

[84] Decoder Generates Manufacturable Structures: A Framework for 3D-Printable Object Synthesis

Abhishek Kumar

Main category: cs.CV

TL;DR: 提出了一种基于解码器的深度学习方法,用于生成可制造的3D结构,优化增材制造中的几何有效性与制造约束。

Details Motivation: 传统生成方法常忽略制造约束,导致生成的3D结构难以实际打印,因此需要一种能自动满足制造可行性的生成框架。 Method: 设计一个深度学习解码器框架,将潜在表示映射为符合过悬角、壁厚和结构强度等制造约束的3D几何体。 Result: 该方法在多种物体类别上生成了几何有效且可打印的结构,相比朴素生成方法显著提升了可制造性,并通过实际3D打印验证了有效性。 Conclusion: 神经解码器能够有效学习从抽象表示到可制造3D结构的复杂映射,为增材制造中的生成设计提供了实用且高效的解决方案。 Abstract: This paper presents a novel decoder-based approach for generating manufacturable 3D structures optimized for additive manufacturing. We introduce a deep learning framework that decodes latent representations into geometrically valid, printable objects while respecting manufacturing constraints such as overhang angles, wall thickness, and structural integrity. The methodology demonstrates that neural decoders can learn complex mapping functions from abstract representations to valid 3D geometries, producing parts with significantly improved manufacturability compared to naive generation approaches. We validate the approach on diverse object categories and demonstrate practical 3D printing of decoder-generated structures.

[85] Representations of Text and Images Align From Layer One

Evžen Wybitul,Javier Rando,Florian Tramèr,Stanislav Fort

Main category: cs.CV

TL;DR: 本文提出了一种基于合成的新方法,揭示了适配器式视觉-语言模型在早期层中即存在显著的图文概念对齐现象,挑战了传统认为对齐仅发生在深层的观点。

Details Motivation: 挑战现有观点:传统认为视觉-语言模型中的图像与文本表示仅在深层才对齐,本文旨在验证早期层是否也存在有意义的概念级对齐。 Method: 受DeepDream启发,提出一种合成方法:提取某一层中文本概念的向量,通过优化生成与该向量对齐的图像,逐层分析图文对齐情况。 Result: 在Gemma 3模型的七层中对数百个概念进行实验,发现即使在第一层,超过50%的生成图像已包含目标概念(如动物、活动、季节)的可识别视觉特征。 Conclusion: 视觉-语言模型从第一层起就存在概念级的图文对齐,所提方法为多模态对齐提供了直接、可构建的证据,并具有无需辅助模型或数据集、快速简便的优势,有助于模型可解释性研究。 Abstract: We show that for a variety of concepts in adapter-based vision-language models, the representations of their images and their text descriptions are meaningfully aligned from the very first layer. This contradicts the established view that such image-text alignment only appears in late layers. We show this using a new synthesis-based method inspired by DeepDream: given a textual concept such as "Jupiter", we extract its concept vector at a given layer, and then use optimisation to synthesise an image whose representation aligns with that vector. We apply our approach to hundreds of concepts across seven layers in Gemma 3, and find that the synthesised images often depict salient visual features of the targeted textual concepts: for example, already at layer 1, more than 50 % of images depict recognisable features of animals, activities, or seasons. Our method thus provides direct, constructive evidence of image-text alignment on a concept-by-concept and layer-by-layer basis. Unlike previous methods for measuring multimodal alignment, our approach is simple, fast, and does not require auxiliary models or datasets. It also offers a new path towards model interpretability, by providing a way to visualise a model's representation space by backtracing through its image processing components.

[86] Training Free Zero-Shot Visual Anomaly Localization via Diffusion Inversion

Samet Hicsonmez,Abd El Rahman Shabayek,Djamila Aouada

Main category: cs.CV

TL;DR: 提出一种无需训练、仅依赖视觉的零样本异常检测框架DIVAD,通过预训练去噪扩散隐式模型(DDIM)的图像反转与重建来定位异常,在无细粒度提示的情况下实现先进性能。

Details Motivation: 现有视觉-only零样本异常检测方法局限于图像级分类,缺乏空间定位能力,且依赖精细提示;希望在不使用额外模态或精细语言提示的前提下实现精确的异常定位。 Method: 利用预训练的DDIM模型,将输入图像反演至潜在空间,并从固定的中间时间步开始去噪重建,使用通用文本描述(如“an image of an [object class]”)引导生成正常外观图像,通过比较输入与重建图像的差异来定位异常区域。 Result: 在VISA数据集上达到最先进的性能,显著优于现有视觉-only零样本异常检测方法,展现出强异常定位能力。 Conclusion: 该方法证明了无需训练和精细语言提示,仅通过扩散模型的反演即可实现高性能零样本异常检测与定位,推动该领域向减少对提示工程依赖的方向发展。 Abstract: Zero-Shot image Anomaly Detection (ZSAD) aims to detect and localise anomalies without access to any normal training samples of the target data. While recent ZSAD approaches leverage additional modalities such as language to generate fine-grained prompts for localisation, vision-only methods remain limited to image-level classification, lacking spatial precision. In this work, we introduce a simple yet effective training-free vision-only ZSAD framework that circumvents the need for fine-grained prompts by leveraging the inversion of a pretrained Denoising Diffusion Implicit Model (DDIM). Specifically, given an input image and a generic text description (e.g., "an image of an [object class]"), we invert the image to obtain latent representations and initiate the denoising process from a fixed intermediate timestep to reconstruct the image. Since the underlying diffusion model is trained solely on normal data, this process yields a normal-looking reconstruction. The discrepancy between the input image and the reconstructed one highlights potential anomalies. Our method achieves state-of-the-art performance on VISA dataset, demonstrating strong localisation capabilities without auxiliary modalities and facilitating a shift away from prompt dependence for zero-shot anomaly detection research. Code is available at https://github.com/giddyyupp/DIVAD.

[87] A Highly Efficient Diversity-based Input Selection for DNN Improvement Using VLMs

Amin Abbasishahkoo,Mahboubeh Dadkhah,Lionel Briand

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言模型的高效多样性度量方法CBD,用于深度神经网络的输入选择,兼具高效率和良好性能,显著优于现有方法。

Details Motivation: 现有的多样性选择方法在大规模数据上计算开销大、扩展性差,难以实际应用,因此需要一种更高效的输入选择策略。 Method: 提出概念多样性(CBD)度量方法,利用视觉-语言模型提取图像概念,并结合简单的不确定性度量Margin构建混合选择策略。 Result: CBD与传统几何多样性(GD)高度相关但计算速度更快;在多种模型、数据集和预算下,CBD-based方法均优于最先进的基线方法,且具有良好的可扩展性。 Conclusion: CBD是一种高效、可扩展且有效的输入选择方法,适用于大规模场景下的深度神经网络微调,为实际应用提供了可行方案。 Abstract: Maintaining or improving the performance of Deep Neural Networks (DNNs) through fine-tuning requires labeling newly collected inputs, a process that is often costly and time-consuming. To alleviate this problem, input selection approaches have been developed in recent years to identify small, yet highly informative subsets for labeling. Diversity-based selection is one of the most effective approaches for this purpose. However, they are often computationally intensive and lack scalability for large input sets, limiting their practical applicability. To address this challenge, we introduce Concept-Based Diversity (CBD), a highly efficient metric for image inputs that leverages Vision-Language Models (VLM). Our results show that CBD exhibits a strong correlation with Geometric Diversity (GD), an established diversity metric, while requiring only a fraction of its computation time. Building on this finding, we propose a hybrid input selection approach that combines CBD with Margin, a simple uncertainty metric. We conduct a comprehensive evaluation across a diverse set of DNN models, input sets, selection budgets, and five most effective state-of-the-art selection baselines. The results demonstrate that the CBD-based selection consistently outperforms all baselines at guiding input selection to improve the DNN model. Furthermore, the CBD-based selection approach remains highly efficient, requiring selection times close to those of simple uncertainty-based methods such as Margin, even on larger input sets like ImageNet. These results confirm not only the effectiveness and computational advantage of the CBD-based approach, particularly compared to hybrid baselines, but also its scalability in repetitive and extensive input selection scenarios.

[88] FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

Jifeng Song,Arun Das,Pan Wang,Hui Ji,Kun Zhao,Yufei Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为FigEx2的视觉条件框架,用于从科学复合图像中定位面板并生成面板级字幕,通过噪声感知门控融合模块和分阶段优化策略实现了优异的检测与字幕性能,并展现出强大的零样本迁移能力。

Details Motivation: 现有科学复合图像的图注通常缺失或仅提供整体描述,缺乏对单个面板的理解,限制了信息提取效果。 Method: 提出FigEx2框架,结合视觉输入进行面板定位与字幕生成;引入噪声感知门控融合模块以稳定检测查询空间,并采用结合监督学习与强化学习的分阶段优化策略,利用CLIP对齐和BERTScore语义奖励保证多模态一致性。 Result: 在检测任务上达到0.726 mAP@0.5:0.95,在METEOR和BERTScore指标上分别比Qwen3-VL-8B高出0.51和0.24;并在物理和化学领域的跨学科测试中表现出良好的零样本迁移能力。 Conclusion: FigEx2在面板级理解方面显著优于现有方法,具备高精度、强泛化性和实际应用潜力,尤其适用于缺乏标注的科学图像分析场景。 Abstract: Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.

[89] Rescind: Countering Image Misconduct in Biomedical Publications with Vision-Language and State-Space Modeling

Soumyaroop Nandi,Prem Natarajan

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言引导的框架,用于生成和检测生物医学图像伪造,引入了大规模基准Rescind和新型检测模型Integscan,在检测与定位性能上达到先进水平。

Details Motivation: 生物医学图像篡改威胁科研诚信,传统自然图像取证方法难以应对该领域特有的复杂纹理与非结构化布局,亟需专门的伪造检测方案。 Method: 结合扩散合成与视觉-语言提示,构建生成与检测联合框架;提出Integscan模型,融合增强注意力的视觉编码与提示条件语义对齐,实现精准伪造定位,并利用视觉-语言模型验证生成结果的语义一致性。 Result: 在Rescind及现有基准上实验表明,Integscan在伪造检测与定位任务中均达到最先进的性能。 Conclusion: 该研究为自动化科学诚信分析提供了有效工具,推动生物医学图像真实性验证的发展。 Abstract: Scientific image manipulation in biomedical publications poses a growing threat to research integrity and reproducibility. Unlike natural image forensics, biomedical forgery detection is uniquely challenging due to domain-specific artifacts, complex textures, and unstructured figure layouts. We present the first vision-language guided framework for both generating and detecting biomedical image forgeries. By combining diffusion-based synthesis with vision-language prompting, our method enables realistic and semantically controlled manipulations, including duplication, splicing, and region removal, across diverse biomedical modalities. We introduce Rescind, a large-scale benchmark featuring fine-grained annotations and modality-specific splits, and propose Integscan, a structured state space modeling framework that integrates attention-enhanced visual encoding with prompt-conditioned semantic alignment for precise forgery localization. To ensure semantic fidelity, we incorporate a vision-language model based verification loop that filters generated forgeries based on consistency with intended prompts. Extensive experiments on Rescind and existing benchmarks demonstrate that Integscan achieves state of the art performance in both detection and localization, establishing a strong foundation for automated scientific integrity analysis.

[90] The Role of Noisy Data in Improving CNN Robustness for Image Classification

Oscar H. Ramírez-Agudelo,Nicoleta Gorea,Aliza Reif,Lorenzo Bonasera,Michael Karl

Main category: cs.CV

TL;DR: 该研究探讨了在训练数据中引入可控噪声(如高斯噪声、椒盐噪声和高斯模糊)对卷积神经网络鲁棒性的影响,发现仅使用10%的带噪数据即可显著提升模型在全噪声测试条件下的性能,同时保持对干净数据的良好表现。

Details Motivation: 由于现实世界中的图像常受噪声和失真影响,而传统训练偏好高质量数据,因此需要研究如何通过有策略地引入噪声来提升模型在实际应用中的鲁棒性。 Method: 在CIFAR-10数据集上,采用ResNet-18模型,评估三种常见噪声(高斯噪声、椒盐噪声、高斯模糊)在不同强度和污染比例下的训练效果,分析其对测试损失和准确率的影响。 Result: 实验表明,仅在训练中加入10%的噪声数据,就能显著降低完全噪声测试条件下的测试损失并提高准确率,且对干净测试数据的性能影响极小。 Conclusion: 有策略地在训练中引入噪声可作为一种简单有效的正则化手段,在数据清洁性与现实适应性之间实现良好权衡。 Abstract: Data quality plays a central role in the performance and robustness of convolutional neural networks (CNNs) for image classification. While high-quality data is often preferred for training, real-world inputs are frequently affected by noise and other distortions. This paper investigates the effect of deliberately introducing controlled noise into the training data to improve model robustness. Using the CIFAR-10 dataset, we evaluate the impact of three common corruptions, namely Gaussian noise, Salt-and-Pepper noise, and Gaussian blur at varying intensities and training set pollution levels. Experiments using a Resnet-18 model reveal that incorporating just 10\% noisy data during training is sufficient to significantly reduce test loss and enhance accuracy under fully corrupted test conditions, with minimal impact on clean-data performance. These findings suggest that strategic exposure to noise can act as a simple yet effective regularizer, offering a practical trade-off between traditional data cleanliness and real-world resilience.

[91] Exploiting DINOv3-Based Self-Supervised Features for Robust Few-Shot Medical Image Segmentation

Guoping Xu,Jayaram K. Udupa,Weiguo Lu,You Zhang

Main category: cs.CV

TL;DR: 本文提出DINO-AugSeg,一种利用DINOv3特征的少样本医学图像分割新框架,通过引入基于小波的特征增强模块(WT-Aug)和上下文信息引导的融合模块(CG-Fuse),在六种公开数据集上表现出优于现有方法的性能。

Details Motivation: 由于标注训练数据稀缺,基于深度学习的医学图像分割在少样本场景下面临挑战;而现有的自监督基础模型因自然图像与医学图像之间的域差异难以直接有效应用。 Method: 提出DINO-AugSeg框架:1)WT-Aug模块通过扰动频率分量增强DINOv3提取的特征多样性;2)CG-Fuse模块利用交叉注意力机制融合语义丰富的低分辨率特征与空间细节丰富的高分辨率特征。 Result: 在涵盖MRI、CT、超声、内窥镜和皮肤镜五种成像模态的六个公共基准上进行实验,结果表明DINO-AugSeg在少样本条件下 consistently 优于现有方法。 Conclusion: 结合小波域增强与上下文融合策略可有效提升少样本医学图像分割中的特征表示鲁棒性,DINO-AugSeg为该领域提供了有前景的发展方向。 Abstract: Deep learning-based automatic medical image segmentation plays a critical role in clinical diagnosis and treatment planning but remains challenging in few-shot scenarios due to the scarcity of annotated training data. Recently, self-supervised foundation models such as DINOv3, which were trained on large natural image datasets, have shown strong potential for dense feature extraction that can help with the few-shot learning challenge. Yet, their direct application to medical images is hindered by domain differences. In this work, we propose DINO-AugSeg, a novel framework that leverages DINOv3 features to address the few-shot medical image segmentation challenge. Specifically, we introduce WT-Aug, a wavelet-based feature-level augmentation module that enriches the diversity of DINOv3-extracted features by perturbing frequency components, and CG-Fuse, a contextual information-guided fusion module that exploits cross-attention to integrate semantic-rich low-resolution features with spatially detailed high-resolution features. Extensive experiments on six public benchmarks spanning five imaging modalities, including MRI, CT, ultrasound, endoscopy, and dermoscopy, demonstrate that DINO-AugSeg consistently outperforms existing methods under limited-sample conditions. The results highlight the effectiveness of incorporating wavelet-domain augmentation and contextual fusion for robust feature representation, suggesting DINO-AugSeg as a promising direction for advancing few-shot medical image segmentation. Code and data will be made available on https://github.com/apple1986/DINO-AugSeg.

[92] From Prompts to Deployment: Auto-Curated Domain-Specific Dataset Generation via Diffusion Models

Dongsik Yoon,Jongeun Kim

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的自动化合成数据生成管道,用于构建高质量、领域特定的数据集,减少对真实世界数据收集的依赖。

Details Motivation: 解决预训练模型与实际部署环境之间的分布差异问题,同时降低对大量真实标注数据的依赖。 Method: 采用三阶段框架:首先通过控制性图像修复在特定背景中合成目标对象;然后利用多模态评估(包括目标检测、美学评分和视觉-语言对齐)验证生成结果;最后使用用户偏好分类器捕捉主观选择标准。 Result: 实现了高质量、可直接部署的合成数据集的高效构建,并验证了其在缓解分布偏移方面的有效性。 Conclusion: 该方法为领域特定数据集的构建提供了一种高效且可扩展的解决方案,有助于提升模型在真实场景中的适应性和性能。 Abstract: In this paper, we present an automated pipeline for generating domain-specific synthetic datasets with diffusion models, addressing the distribution shift between pre-trained models and real-world deployment environments. Our three-stage framework first synthesizes target objects within domain-specific backgrounds through controlled inpainting. The generated outputs are then validated via a multi-modal assessment that integrates object detection, aesthetic scoring, and vision-language alignment. Finally, a user-preference classifier is employed to capture subjective selection criteria. This pipeline enables the efficient construction of high-quality, deployable datasets while reducing reliance on extensive real-world data collection.

[93] PathoGen: Diffusion-Based Synthesis of Realistic Lesions in Histopathology Images

Mohamad Koohi-Moghadam,Mohammad-Ali Nikouei Mahani,Kyongtae Tyler Bae

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的可控生成方法PathoGen,用于在良性组织病理图像中高保真地修复病变区域,以解决医学图像中罕见病灶标注数据稀缺的问题。

Details Motivation: 由于专家标注的病变数据稀缺,尤其是罕见病理和疾病亚型,限制了人工智能在组织病理诊断中的应用。现有数据增强方法难以生成形态真实且保持组织结构的病变图像,因此需要更逼真的生成模型。 Method: 提出PathoGen,一种基于扩散模型的可控图像修复方法,能够在良性组织病理图像中生成具有自然边界、完整细胞结构和真实染色特征的病变区域,并支持精确的像素级标签生成。 Result: 在肾脏、皮肤、乳腺和前列腺四个不同病理数据集上验证,PathoGen在图像保真度和分布相似性上优于现有的生成模型(如条件GAN和Stable Diffusion);合成数据可提升下游分割任务性能,尤其在数据稀缺场景下效果显著。 Conclusion: PathoGen提供了一种可扩展的解决方案,通过高保真病变生成和自动标注,缓解医学AI中标签数据不足的瓶颈,推动泛化性强的医疗AI系统发展。 Abstract: The development of robust artificial intelligence models for histopathology diagnosis is severely constrained by the scarcity of expert-annotated lesion data, particularly for rare pathologies and underrepresented disease subtypes. While data augmentation offers a potential solution, existing methods fail to generate sufficiently realistic lesion morphologies that preserve the complex spatial relationships and cellular architectures characteristic of histopathological tissues. Here we present PathoGen, a diffusion-based generative model that enables controllable, high-fidelity inpainting of lesions into benign histopathology images. Unlike conventional augmentation techniques, PathoGen leverages the iterative refinement process of diffusion models to synthesize lesions with natural tissue boundaries, preserved cellular structures, and authentic staining characteristics. We validate PathoGen across four diverse datasets representing distinct diagnostic challenges: kidney, skin, breast, and prostate pathology. Quantitative assessment confirms that PathoGen outperforms state-of-the-art generative baselines, including conditional GAN and Stable Diffusion, in image fidelity and distributional similarity. Crucially, we show that augmenting training sets with PathoGen-synthesized lesions enhances downstream segmentation performance compared to traditional geometric augmentations, particularly in data-scarce regimes. Besides, by simultaneously generating realistic morphology and pixel-level ground truth, PathoGen effectively overcomes the manual annotation bottleneck. This approach offers a scalable pathway for developing generalizable medical AI systems despite limited expert-labeled data.

[94] How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

Peng Gao,Yujian Lee,Yongqi Xu,Wentao Fan

Main category: cs.CV

TL;DR: 本文提出了一种新的协作框架SSP+,用于音频-视觉语义分割(AVSS),通过结合光流和文本提示来改进现有方法。

Details Motivation: 现有的音频-视觉分割方法难以准确识别发声物体并理解其语义,尤其是在静态发声物体和复杂场景中表现不佳。 Method: 将AVSS任务分解为两个子任务:首先利用光流提供预分割掩码以捕捉运动动态;然后引入文本提示(对象类别和场景描述)辅助语义分析,并设计视觉-文本对齐模块(VTA)实现跨模态融合。同时采用后掩码训练策略强化模型对光流图的学习。 Result: 实验结果表明,SSP+在多个指标上优于现有的AVS方法,实现了更高效、精确的语义分割效果,尤其在处理静止发声物体和复杂动态场景时表现出色。 Conclusion: SSP+通过融合光流与文本提示,有效提升了音频-视觉语义分割的性能,推动了从像素级分割向语义理解的转变。 Abstract: Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, \textit{S}tepping \textit{S}tone \textit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient and precise segmentation results.

[95] Subspace Alignment for Vision-Language Model Test-time Adaptation

Zhichen Zeng,Wenxuan Bao,Xiao Lin,Ruizhong Qiu,Tianxin Wei,Xuying Ning,Yuchen Yan,Chen Luo,Monica Xiao Cheng,Jingrui He,Hanghang Tong

Main category: cs.CV

TL;DR: 本文提出了一种新的测试时自适应方法SubTTA,用于提升视觉-语言模型在分布偏移下的零样本性能。该方法通过模态子空间对齐和去除视觉噪声来改善伪标签质量,从而有效提升跨模态学习的鲁棒性。

Details Motivation: 现有测试时自适应方法依赖不可靠的零样本预测作为伪标签,在分布偏移下因模态差距和视觉噪声导致性能下降。 Method: SubTTA通过提取并对其视觉与文本模态的主语义子空间,最小化其弦距离以桥接模态差距;同时将视觉特征投影到任务相关的文本子空间以过滤无关噪声,并在净化后的空间中进行标准TTA优化决策边界。 Result: 在多个基准和VLM架构上的实验表明,SubTTA平均比现有最先进方法提升2.24%。 Conclusion: SubTTA能有效缓解模态差距与视觉噪声问题,显著提升VLM在分布偏移下的测试时自适应性能。 Abstract: Vision-language models (VLMs), despite their extraordinary zero-shot capabilities, are vulnerable to distribution shifts. Test-time adaptation (TTA) emerges as a predominant strategy to adapt VLMs to unlabeled test data on the fly. However, existing TTA methods heavily rely on zero-shot predictions as pseudo-labels for self-training, which can be unreliable under distribution shifts and misguide adaptation due to two fundamental limitations. First (Modality Gap), distribution shifts induce gaps between visual and textual modalities, making cross-modal relations inaccurate. Second (Visual Nuisance), visual embeddings encode rich but task-irrelevant noise that often overwhelms task-specific semantics under distribution shifts. To address these limitations, we propose SubTTA, which aligns the semantic subspaces of both modalities to enhance zero-shot predictions to better guide the TTA process. To bridge the modality gap, SubTTA extracts the principal subspaces of both modalities and aligns the visual manifold to the textual semantic anchor by minimizing their chordal distance. To eliminate visual nuisance, SubTTA projects the aligned visual features onto the task-specific textual subspace, which filters out task-irrelevant noise by constraining visual embeddings within the valid semantic span, and standard TTA is further performed on the purified space to refine the decision boundaries. Extensive experiments on various benchmarks and VLM architectures demonstrate the effectiveness of SubTTA, yielding an average improvement of 2.24% over state-of-the-art TTA methods.

[96] Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention

Shezheng Song,Shasha Li,Jie Yu

Main category: cs.CV

TL;DR: 本文通过系统性的层间掩码分析,揭示了多模态大语言模型(MLLMs)中视觉与文本信息融合的动态过程,发现融合集中在特定层并存在“回顾”现象,并据此提出一种无需训练的对比注意力框架以提升多模态推理性能。

Details Motivation: 理解MLLMs如何在内部整合视觉和文本信息,当前对此机制缺乏系统性认识。 Method: 采用层间掩码分析方法,在多种架构上研究视觉-文本融合的演化过程,并分析注意力在各层的变化模式。 Result: 发现融合主要发生在特定网络层而非均匀分布,部分模型在后期存在视觉信号重新激活的“回顾”现象;同时观察到对无关区域的高注意力噪声持续存在,而对文本对齐区域的注意力逐渐增强。 Conclusion: 基于上述发现提出的训练-free对比注意力框架能有效建模从早期融合到最终输出的注意力变化,提升了多模态推理能力,并在多个模型和基准上得到验证。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage "review" phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts. Extensive experiments across various MLLMs and benchmarks validate our analysis and demonstrate that the proposed approach improves multimodal reasoning performance. Code will be released.

[97] Instance-Aligned Captions for Explainable Video Anomaly Detection

Inpyo Song,Minjun Joo,Joonhyung Kwon,Eunji Jeon,Jangwon Lee

Main category: cs.CV

TL;DR: 本文提出了一种实例对齐的字幕生成方法,用于可解释的视频异常检测,通过将文本描述与特定对象实例关联,实现空间定位和可验证的解释,并构建了包含868个新增视频的VIEW360+数据集以推动可信、可解释的异常检测研究。

Details Motivation: 现有可解释视频异常检测方法在多实体交互场景中缺乏空间定位,导致解释不可靠且难以验证,尤其在安全关键应用中问题突出。 Method: 提出实例对齐字幕框架,将每个文本陈述与具有外观和运动属性的具体对象实例关联,实现‘谁、做了什么、影响了谁、位置在哪’的细粒度解释;标注八个常用基准数据集并扩展VIEW360为VIEW360+。 Result: 实验表明当前基于LLM和VLM的方法在实例级空间对齐字幕任务上存在显著局限性,而所提方法提供了更可靠、可验证的解释能力。 Conclusion: 该工作推动了可解释视频异常检测向实例级空间对齐发展,提升了解释的可信度与实用性,VIEW360+为未来研究提供了有力基准。 Abstract: Explainable video anomaly detection (VAD) is crucial for safety-critical applications, yet even with recent progress, much of the research still lacks spatial grounding, making the explanations unverifiable. This limitation is especially pronounced in multi-entity interactions, where existing explainable VAD methods often produce incomplete or visually misaligned descriptions, reducing their trustworthiness. To address these challenges, we introduce instance-aligned captions that link each textual claim to specific object instances with appearance and motion attributes. Our framework captures who caused the anomaly, what each entity was doing, whom it affected, and where the explanationis grounded, enabling verifiable and actionable reasoning. We annotate eight widely used VAD benchmarks and extend the 360-degree egocentric dataset, VIEW360, with 868 additional videos, eight locations, and four new anomaly types, creating VIEW360+, a comprehensive testbed for explainable VAD. Experiments show that our instance-level spatially grounded captions reveal significant limitations in current LLM- and VLM-based methods while providing a robust benchmark for future research in trustworthy and interpretable anomaly detection.

[98] A Hardware-Algorithm Co-Designed Framework for HDR Imaging and Dehazing in Extreme Rocket Launch Environments

Jing Tao,Banglei Guan,Pengju Sun,Taihang Lei,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 提出了一种硬件-算法协同设计框架,结合空间变曝光传感器与物理感知去雾算法,用于火箭发射过程中关键力学参数的光学测量。

Details Motivation: 火箭发射时极端成像条件导致传统光学测量方法难以准确获取喷流流场、激波结构等关键机械参数,需解决强发光、颗粒雾霾和高动态范围带来的图像退化问题。 Method: 设计了一种可单次拍摄获取多曝光数据的空间变曝光(SVE)传感器,并结合物理感知的去雾算法,动态估计雾霾密度,进行区域自适应光照优化和多尺度熵约束融合,以分离雾霾与场景辐射。 Result: 在真实发射图像和受控实验中验证了该框架的有效性,显著提升了喷流和发动机区域的图像恢复质量,成功提取了粒子速度、流动不稳定频率和结构振动等机械参数。 Conclusion: 所提方法在极端航空航天环境中实现了物理准确的视觉信息恢复,为定量光测提供了可靠图像基础。 Abstract: Quantitative optical measurement of critical mechanical parameters -- such as plume flow fields, shock wave structures, and nozzle oscillations -- during rocket launch faces severe challenges due to extreme imaging conditions. Intense combustion creates dense particulate haze and luminance variations exceeding 120 dB, degrading image data and undermining subsequent photogrammetric and velocimetric analyses. To address these issues, we propose a hardware-algorithm co-design framework that combines a custom Spatially Varying Exposure (SVE) sensor with a physics-aware dehazing algorithm. The SVE sensor acquires multi-exposure data in a single shot, enabling robust haze assessment without relying on idealized atmospheric models. Our approach dynamically estimates haze density, performs region-adaptive illumination optimization, and applies multi-scale entropy-constrained fusion to effectively separate haze from scene radiance. Validated on real launch imagery and controlled experiments, the framework demonstrates superior performance in recovering physically accurate visual information of the plume and engine region. This offers a reliable image basis for extracting key mechanical parameters, including particle velocity, flow instability frequency, and structural vibration, thereby supporting precise quantitative analysis in extreme aerospace environments.

[99] Representation Learning with Semantic-aware Instance and Sparse Token Alignments

Phuoc-Nguyen Bui,Toan Duc Nguyen,Junghyun Bum,Duc-Tai Le,Hyunseung Choo

Main category: cs.CV

TL;DR: 本文提出了一种新的多级对齐框架SISTA,用于医学视觉-语言预训练,通过利用图像-报告和图像块-词元两个层次的语义对应关系,改进了传统的对比学习方法,有效提升了下游任务的表现。

Details Motivation: 传统对比学习将所有未配对样本视为负样本,但在医学数据中不同患者的图像或报告可能存在相似性,导致错误的负样本问题,破坏语义结构,影响表示学习质量。 Method: 提出SISTA框架,结合报告间的相似性以消除假负样本,并在图像块与相关词元之间进行稀疏对齐,实现图像-报告和局部-局部的多级语义对齐。 Result: 实验表明,该方法在多个数据集上的图像分类、分割和检测任务中均取得更好性能,尤其在细粒度任务和少样本场景下表现突出。 Conclusion: SISTA通过引入语义感知的实例和稀疏词元对齐,有效缓解了医学VLP中的假负样本问题,显著提升了跨任务迁移性能。 Abstract: Medical contrastive vision-language pre-training (VLP) has demonstrated significant potential in improving performance on downstream tasks. Traditional approaches typically employ contrastive learning, treating paired image-report samples as positives and unpaired ones as negatives. However, in medical datasets, there can be substantial similarities between images or reports from different patients. Rigidly treating all unpaired samples as negatives, can disrupt the underlying semantic structure and negatively impact the quality of the learned representations. In this paper, we propose a multi-level alignment framework, Representation Learning with Semantic-aware Instance and Sparse Token Alignments (SISTA) by exploiting the semantic correspondence between medical image and radiology reports at two levels, i.e., image-report and patch-word levels. Specifically, we improve the conventional contrastive learning by incorporating inter-report similarity to eliminate the false negatives and introduce a method to effectively align image patches with relevant word tokens. Experimental results demonstrate the effectiveness of the proposed framework in improving transfer performance across different datasets on three downstream tasks: image classification, image segmentation, and object detection. Notably, our framework achieves significant improvements in fine-grained tasks even with limited labeled data. Codes and pre-trained models will be made available.

[100] Towards Cross-Platform Generalization: Domain Adaptive 3D Detection with Augmentation and Pseudo-Labeling

Xiyan Feng,Wenbo Zhang,Lu Zhang,Yunzhi Zhuge,Huchuan Lu,You He

Main category: cs.CV

TL;DR: 本文提出了一种基于PVRCNN++的跨平台3D目标检测方法,通过定制的数据增强和伪标签自训练策略缩小域间差距,提升了模型泛化能力,在RoboSense2025挑战赛中取得了优异成绩。

Details Motivation: 解决跨平台3D目标检测中的域适应问题,提升模型在不同平台间的泛化能力。 Method: 基于PVRCNN++框架,结合点基和体素特征,引入定制化的数据增强和基于伪标签的自训练策略以缩小域间隙。 Result: 在RoboSense2025挑战赛中,相位1目标域上汽车类别的3D AP达到62.67%;相位2目标域上汽车和行人类别分别达到58.76%和49.81%。 Conclusion: 所提出的方法有效提升了跨平台3D目标检测的性能,验证了数据增强与自训练策略在域适应中的有效性。 Abstract: This technical report represents the award-winning solution to the Cross-platform 3D Object Detection task in the RoboSense2025 Challenge. Our approach is built upon PVRCNN++, an efficient 3D object detection framework that effectively integrates point-based and voxel-based features. On top of this foundation, we improve cross-platform generalization by narrowing domain gaps through tailored data augmentation and a self-training strategy with pseudo-labels. These enhancements enabled our approach to secure the 3rd place in the challenge, achieving a 3D AP of 62.67% for the Car category on the phase-1 target domain, and 58.76% and 49.81% for Car and Pedestrian categories respectively on the phase-2 target domain.

[101] CogniMap3D: Cognitive 3D Mapping and Rapid Retrieval

Feiran Wang,Junyi Wu,Dawen Cai,Yuan Hong,Yan Yan

Main category: cs.CV

TL;DR: CogniMap3D是一个受生物启发的动态3D场景理解与重建框架,通过模拟人类认知过程,结合运动线索、记忆库和因子图优化,实现对静态场景的持续记忆与动态物体识别,在多个任务上达到先进性能。

Details Motivation: 现有的3D场景重建方法在处理动态环境和长期空间记忆方面存在不足,难以有效区分动静态元素并支持多次访问的场景更新,因此需要一种能够模拟人类认知能力的持久化、自适应系统。 Method: 提出CogniMap3D框架,包含三部分:基于多阶段运动线索(结合深度和相机位姿先验)检测动态对象;构建可存储、检索和更新静态场景的认知地图;使用因子图优化相机位姿。系统通过图像流识别动态区域,并与静态记忆库匹配,在重访时恢复场景并更新记忆。 Result: 在视频深度估计、相机位姿重建和3D建图任务中表现出最先进的性能,能有效支持长序列和多次访问下的连续场景理解。 Conclusion: CogniMap3D通过融合生物启发的记忆机制与动态感知策略,实现了高效、持久且可更新的3D场景理解,为复杂动态环境中的SLAM和视觉感知提供了新思路。 Abstract: We present CogniMap3D, a bioinspired framework for dynamic 3D scene understanding and reconstruction that emulates human cognitive processes. Our approach maintains a persistent memory bank of static scenes, enabling efficient spatial knowledge storage and rapid retrieval. CogniMap3D integrates three core capabilities: a multi-stage motion cue framework for identifying dynamic objects, a cognitive mapping system for storing, recalling, and updating static scenes across multiple visits, and a factor graph optimization strategy for refining camera poses. Given an image stream, our model identifies dynamic regions through motion cues with depth and camera pose priors, then matches static elements against its memory bank. When revisiting familiar locations, CogniMap3D retrieves stored scenes, relocates cameras, and updates memory with new observations. Evaluations on video depth estimation, camera pose reconstruction, and 3D mapping tasks demonstrate its state-of-the-art performance, while effectively supporting continuous scene understanding across extended sequences and multiple visits.

[102] Instruction-Driven 3D Facial Expression Generation and Transition

Anh H. Vo,Tae-Seok Kim,Hulin Jin,Soo-Mi Choi,Yong-Guk Kim

Main category: cs.CV

TL;DR: 本文提出了一种基于文本指令驱动的3D面部表情生成与过渡框架,能够根据文本描述实现从一种面部表情到另一种的平滑转换。

Details Motivation: 为了实现更真实的情感变化模拟,需要能够在任意两种面部表情之间生成自然的3D面部过渡,而现有方法在语义控制和表达多样性方面存在局限。 Method: 提出了Instruction-driven Facial Expression Decomposer (IFED) 模块以融合文本与面部特征,并结合I2FET方法及顶点重建损失函数来优化潜在向量的语义理解,进而生成符合指令的表情序列和过渡动画。 Result: 在CK+和CelebV-HQ数据集上优于现有最先进方法,能够根据多样化文本指令生成逼真的面部表情过渡轨迹。 Conclusion: 该框架显著扩展了可生成的表情种类及其过渡方式,具有广泛的实际应用潜力,特别是在需要精细情感控制的虚拟交互场景中。 Abstract: A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction-driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction-driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state-of-the-art methods on the CK+ and CelebV-HQ datasets. The results show that our framework can generate facial expression trajectories according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications More information about our project can be found at https://vohoanganh.github.io/tg3dfet/

[103] Second-order Gaussian directional derivative representations for image high-resolution corner detection

Dongbo Xie,Junjie Qiu,Changming Sun,Weichuan Zhang

Main category: cs.CV

TL;DR: 提出了一种基于二阶高斯方向导数(SOGDD)的高分辨率角点检测新方法,解决了相邻角点灰度相互干扰的问题,在定位误差、抗模糊性、图像匹配和3D重建方面优于现有方法。

Details Motivation: 张等人使用简单角点模型存在理论缺陷,相邻角点的灰度信息会相互影响,导致特征提取不准确,因此需要更精确的高分辨率角点建模与检测方法。 Method: 采用二阶高斯方向导数(SOGDD)滤波器对两种典型的高分辨率角点模型(END型和L型)进行平滑处理,并分别推导其SOGDD表示形式,从而发现高分辨率角点的多种特性,指导高斯滤波尺度选择以准确捕捉强度变化信息。 Result: 成功揭示了高分辨率角点的关键特性,实现了对相邻角点的精确建模与检测;提出的新方法在定位精度、抗模糊能力、图像匹配及3D重建任务中均优于当前最先进的方法。 Conclusion: 本文提出的SOGDD框架有效解决了相邻角点干扰问题,首次实现了高分辨率图像中相邻角点的精准检测,为后续角点相关应用提供了更可靠的理论基础与技术手段。 Abstract: Corner detection is widely used in various computer vision tasks, such as image matching and 3D reconstruction. Our research indicates that there are theoretical flaws in Zhang et al.'s use of a simple corner model to obtain a series of corner characteristics, as the grayscale information of two adjacent corners can affect each other. In order to address the above issues, a second-order Gaussian directional derivative (SOGDD) filter is used in this work to smooth two typical high-resolution angle models (i.e. END-type and L-type models). Then, the SOGDD representations of these two corner models were derived separately, and many characteristics of high-resolution corners were discovered, which enabled us to demonstrate how to select Gaussian filtering scales to obtain intensity variation information from images, accurately depicting adjacent corners. In addition, a new high-resolution corner detection method for images has been proposed for the first time, which can accurately detect adjacent corner points. The experimental results have verified that the proposed method outperforms state-of-the-art methods in terms of localization error, robustness to image blur transformation, image matching, and 3D reconstruction.

[104] GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards

Yan Zhu,Te Luo,Pei-Yao Fu,Zhen Zhang,Zi-Long Wang,Yi-Fan Qu,Zi-Han Geng,Jia-Qi Xu,Lu Yao,Li-Yun Ma,Wei Su,Wei-Feng Chen,Quan-Lin Li,Shuo Wang,Ping-Hong Zhou

Main category: cs.CV

TL;DR: 本文提出了GI-Bench,一个用于评估多模态大语言模型(MLLMs)在胃肠内镜临床工作流中性能的基准测试,涵盖解剖定位、病变识别、诊断、描述和管理五个阶段,并揭示了当前MLLMs存在的“空间定位瓶颈”和“流畅性-准确性悖论”。

Details Motivation: 尽管多模态大语言模型(MLLMs)在胃肠病学中展现出潜力,但其在完整临床工作流中的表现及与人类医生的对比尚不明确,因此需要系统性评估其临床实用性。 Method: 构建了一个包含20个细粒度病变类别的GI-Bench基准,对12种MLLMs在五阶段胃肠内镜工作流中进行评估,并与三名初级内镜医师和三名住院医师进行比较,采用Macro-F1、mIoU和多维李克特量表作为评价指标。 Result: Gemini-3-Pro达到最先进水平,在诊断推理中顶级模型(Macro-F1 0.641)优于住院医师(0.492)并接近初级医师(0.727);但在病变定位上人类显著优于最佳模型(mIoU >0.506 vs 0.345),且模型存在生成文本流畅但事实错误较多的“流畅性-准确性悖论”。 Conclusion: GI-Bench为评估MLLMs在胃肠内镜中的应用提供了标准化平台,结果显示当前模型在诊断推理方面具潜力,但在空间定位和事实准确性方面仍落后于人类,需进一步改进以实现可靠临床部署。 Abstract: Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical "spatial grounding bottleneck" persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a "fluency-accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to "over-interpretation" and hallucination of visual features.GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.

[105] Human-inspired Global-to-Parallel Multi-scale Encoding for Lightweight Vision Models

Wei Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为GPM的全局到并行多尺度编码模块,并基于此构建了轻量级网络H-GPE,在图像分类、检测和分割任务中实现了更优的精度-效率权衡。

Details Motivation: 现有轻量级视觉网络在降低计算量时往往增加参数量,且对人类视觉感知机制的建模过于简化,难以真实反映视觉系统的运作方式。 Method: 提出GPM(Global-to-Parallel Multi-scale Encoding)模块,包含全局洞察生成器(GIG)和并行分支结构:LSAE捕捉中大尺度语义关系,IRB保留细粒度纹理信息;在此基础上构建轻量级H-GPE网络。 Result: H-GPE在ImageNet分类、COCO检测和ADE20K分割任务上均表现出色,兼顾低FLOPs与参数量,优于多个前沿轻量模型。 Conclusion: GPM通过模拟人类先整体后局部、保持上下文感知的视觉机制,有效平衡了模型大小、计算开销与性能,为轻量级网络设计提供了新思路。 Abstract: Lightweight vision networks have witnessed remarkable progress in recent years, yet achieving a satisfactory balance among parameter scale, computational overhead, and task performance remains difficult. Although many existing lightweight models manage to reduce computation considerably, they often do so at the expense of a substantial increase in parameter count (e.g., LSNet, MobileMamba), which still poses obstacles for deployment on resource-limited devices. In parallel, some studies attempt to draw inspiration from human visual perception, but their modeling tends to oversimplify the visual process, making it hard to reflect how perception truly operates. Revisiting the cooperative mechanism of the human visual system, we propose GPM (Global-to-Parallel Multi-scale Encoding). GPM first employs a Global Insight Generator (GIG) to extract holistic cues, and subsequently processes features of different scales through parallel branches: LSAE emphasizes mid-/large-scale semantic relations, while IRB (Inverted Residual Block) preserves fine-grained texture information, jointly enabling coherent representation of global and local features. As such, GPM conforms to two characteristic behaviors of human vision perceiving the whole before focusing on details, and maintaining broad contextual awareness even during local attention. Built upon GPM, we further develop the lightweight H-GPE network. Experiments on image classification, object detection, and semantic segmentation show that H-GPE achieves strong performance while maintaining a balanced footprint in both FLOPs and parameters, delivering a more favorable accuracy-efficiency trade-off compared with recent state-of-the-art lightweight models.

[106] Route, Retrieve, Reflect, Repair: Self-Improving Agentic Framework for Visual Detection and Linguistic Reasoning in Medical Imaging

Md. Faiyaz Abdullah Sayeedi,Rashedur Rahman,Siam Tahsin Bhuiyan,Sefatul Wasi,Ashraful Islam,Saadia Binte Alam,AKM Mahbubur Rahman

Main category: cs.CV

TL;DR: 提出R^4框架,通过四个协同代理(Router、Retriever、Reflector、Repairer)分解医学影像工作流,提升视觉语言模型在临床图像解释中的可靠性与空间定位能力,无需微调即可显著提高报告生成和弱监督检测性能。

Details Motivation: 现有医学图像分析系统多为单次推理的黑箱模型,缺乏对推理过程、安全性及空间定位的控制,难以满足临床需求。 Method: 设计R^4代理框架,将任务分解为路由、检索、反思和修复四个阶段,结合任务感知提示、示例记忆、批评机制与迭代修正,实现文本报告与边界框的联合生成与优化。 Result: 在胸部X光分析中,相比强基线模型,R^4将LLM-as-a-Judge评分提升1.7-2.5分,mAP50提升2.5-3.5个百分点,且无需梯度微调。 Conclusion: 代理式架构可通过结构化协作增强VLM在医学影像中的可靠性和空间一致性,为临床应用提供更可控、可解释的AI工具。 Abstract: Medical image analysis increasingly relies on large vision-language models (VLMs), yet most systems remain single-pass black boxes that offer limited control over reasoning, safety, and spatial grounding. We propose R^4, an agentic framework that decomposes medical imaging workflows into four coordinated agents: a Router that configures task- and specialization-aware prompts from the image, patient history, and metadata; a Retriever that uses exemplar memory and pass@k sampling to jointly generate free-text reports and bounding boxes; a Reflector that critiques each draft-box pair for key clinical error modes (negation, laterality, unsupported claims, contradictions, missing findings, and localization errors); and a Repairer that iteratively revises both narrative and spatial outputs under targeted constraints while curating high-quality exemplars for future cases. Instantiated on chest X-ray analysis with multiple modern VLM backbones and evaluated on report generation and weakly supervised detection, R^4 consistently boosts LLM-as-a-Judge scores by roughly +1.7-+2.5 points and mAP50 by +2.5-+3.5 absolute points over strong single-VLM baselines, without any gradient-based fine-tuning. These results show that agentic routing, reflection, and repair can turn strong but brittle VLMs into more reliable and better grounded tools for clinical image interpretation. Our code can be found at: https://github.com/faiyazabdullah/MultimodalMedAgent

[107] Unified Multi-Site Multi-Sequence Brain MRI Harmonization Enriched by Biomedical Semantic Style

Mengqi Wu,Yongheng Sun,Qianqian Wang,Pew-Thian Yap,Mingxia Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为MMH的多中心多序列脑MRI图像标准化框架,利用生物医学语义先验实现序列感知的风格对齐,通过两阶段方法在无需配对数据的情况下有效解耦图像风格与解剖内容,在大规模T1和T2加权MRI数据上显著优于现有方法。

Details Motivation: 多中心脑MRI数据整合有助于深度学习训练,但不同站点的扫描设备、参数和协议差异引入了非生物性异质性,影响模型泛化能力;现有标准化方法常依赖有限的配对数据或难以有效分离图像风格与解剖结构,且多局限于单序列处理,难以适用于临床常见的多序列场景。 Method: 提出MMH框架,包含两个阶段:(1)基于扩散模型的全局标准化器,使用风格无关的梯度条件将MR图像映射到序列特异的统一域;(2)目标特定的微调模块,将全局对齐图像适配至目标域。采用三平面注意力BiomedCLIP编码器聚合多视角嵌入以表征体积风格信息,实现风格与解剖的有效解耦,且无需配对数据。 Result: 在4,163例T1和T2加权MRI数据上的实验表明,MMH在图像特征聚类、体素级比较、组织分割以及下游年龄和站点分类任务中均显著优于当前最先进的方法。 Conclusion: MMH是一种有效的多中心多序列MRI标准化框架,能够克服站点异质性,提升图像一致性和模型泛化能力,具有广泛的临床应用潜力。 Abstract: Aggregating multi-site brain MRI data can enhance deep learning model training, but also introduces non-biological heterogeneity caused by site-specific variations (e.g., differences in scanner vendors, acquisition parameters, and imaging protocols) that can undermine generalizability. Recent retrospective MRI harmonization seeks to reduce such site effects by standardizing image style (e.g., intensity, contrast, noise patterns) while preserving anatomical content. However, existing methods often rely on limited paired traveling-subject data or fail to effectively disentangle style from anatomy. Furthermore, most current approaches address only single-sequence harmonization, restricting their use in real-world settings where multi-sequence MRI is routinely acquired. To this end, we introduce MMH, a unified framework for multi-site multi-sequence brain MRI harmonization that leverages biomedical semantic priors for sequence-aware style alignment. MMH operates in two stages: (1) a diffusion-based global harmonizer that maps MR images to a sequence-specific unified domain using style-agnostic gradient conditioning, and (2) a target-specific fine-tuner that adapts globally aligned images to desired target domains. A tri-planar attention BiomedCLIP encoder aggregates multi-view embeddings to characterize volumetric style information, allowing explicit disentanglement of image styles from anatomy without requiring paired data. Evaluations on 4,163 T1- and T2-weighted MRIs demonstrate MMH's superior harmonization over state-of-the-art methods in image feature clustering, voxel-level comparison, tissue segmentation, and downstream age and site classification.

[108] MobiDiary: Autoregressive Action Captioning with Wearable Devices and Wireless Signals

Fei Deng,Yinghui He,Chuntong Chu,Ge Wang,Han Ding,Jinsong Han,Fei Wang

Main category: cs.CV

TL;DR: 本文提出MobiDiary,一种从异构物理信号(IMU和Wi-Fi)生成日常活动自然语言描述的框架,克服传统基于视觉或预定义标签方法的局限,通过统一传感器编码器和Transformer解码器实现跨模态的动作描述生成,在多个基准上达到最优性能。

Details Motivation: 现有的行为识别方法多依赖视觉数据或预定义类别标签,存在隐私问题、环境限制以及表达能力不足的问题,难以生成丰富、可读的活动描述。 Method: 提出MobiDiary框架,使用基于patch的统一传感器编码器捕捉IMU和Wi-Fi信号中的局部时序相关性和空间上下文,利用运动信号共有的动力学特性进行跨模态融合,并通过Transformer解码器自回归生成自然语言描述。 Result: 在XRF V2、UWash和WiFiTAD等多个公开基准上进行了评估,MobiDiary在BLEU@4、CIDEr、RMC等描述生成指标上优于现有方法,并在连续动作理解任务中超过专用基线模型。 Conclusion: MobiDiary能够有效弥合物理信号与语言描述之间的语义鸿沟,支持跨模态泛化,生成高质量、人类可读的活动日志,为智能家居中的健康监测和辅助生活提供了更实用、隐私友好的解决方案。 Abstract: Human Activity Recognition (HAR) in smart homes is critical for health monitoring and assistive living. While vision-based systems are common, they face privacy concerns and environmental limitations (e.g., occlusion). In this work, we present MobiDiary, a framework that generates natural language descriptions of daily activities directly from heterogeneous physical signals (specifically IMU and Wi-Fi). Unlike conventional approaches that restrict outputs to pre-defined labels, MobiDiary produces expressive, human-readable summaries. To bridge the semantic gap between continuous, noisy physical signals and discrete linguistic descriptions, we propose a unified sensor encoder. Instead of relying on modality-specific engineering, we exploit the shared inductive biases of motion-induced signals--where both inertial and wireless data reflect underlying kinematic dynamics. Specifically, our encoder utilizes a patch-based mechanism to capture local temporal correlations and integrates heterogeneous placement embedding to unify spatial contexts across different sensors. These unified signal tokens are then fed into a Transformer-based decoder, which employs an autoregressive mechanism to generate coherent action descriptions word-by-word. We comprehensively evaluate our approach on multiple public benchmarks (XRF V2, UWash, and WiFiTAD). Experimental results demonstrate that MobiDiary effectively generalizes across modalities, achieving state-of-the-art performance on captioning metrics (e.g., BLEU@4, CIDEr, RMC) and outperforming specialized baselines in continuous action understanding.

[109] FUME: Fused Unified Multi-Gas Emission Network for Livestock Rumen Acidosis Detection

Taminul Islam,Toqi Tahamid Sarker,Mohamed Embaby,Khaled R Ahmed,Amer AbuGhazaleh

Main category: cs.CV

TL;DR: 本文提出了FUME——一种基于双气体光学成像的轻量级深度学习模型,用于体外条件下奶牛瘤胃酸中毒的无创检测,实现了高精度分类与气体羽流分割,并构建了首个双气体红外图像数据集。

Details Motivation: 现有瘤胃酸中毒诊断方法依赖侵入性pH测量,难以实现连续监测,限制了规模化应用。 Method: 提出FUME模型,采用共享权重编码器、模态特异性自注意力和通道注意力融合的双流架构,联合优化CO2和CH4气体羽流分割与健康状态分类。 Result: 在包含8,967帧图像的数据集上,FUME达到80.99% mIoU和98.82%分类精度,仅使用1.28M参数和1.97G MACs,计算成本降低10倍且性能优于现有方法;消融实验表明CO2为主要判别信号,多任务学习对性能至关重要。 Conclusion: 本研究验证了基于气体排放进行牲畜健康监测的可行性,为非侵入式、可扩展的瘤胃酸中毒检测系统提供了新路径。 Abstract: Ruminal acidosis is a prevalent metabolic disorder in dairy cattle causing significant economic losses and animal welfare concerns. Current diagnostic methods rely on invasive pH measurement, limiting scalability for continuous monitoring. We present FUME (Fused Unified Multi-gas Emission Network), the first deep learning approach for rumen acidosis detection from dual-gas optical imaging under in vitro conditions. Our method leverages complementary carbon dioxide (CO2) and methane (CH4) emission patterns captured by infrared cameras to classify rumen health into Healthy, Transitional, and Acidotic states. FUME employs a lightweight dual-stream architecture with weight-shared encoders, modality-specific self-attention, and channel attention fusion, jointly optimizing gas plume segmentation and classification of dairy cattle health. We introduce the first dual-gas OGI dataset comprising 8,967 annotated frames across six pH levels with pixel-level segmentation masks. Experiments demonstrate that FUME achieves 80.99% mIoU and 98.82% classification accuracy while using only 1.28M parameters and 1.97G MACs--outperforming state-of-the-art methods in segmentation quality with 10x lower computational cost. Ablation studies reveal that CO2 provides the primary discriminative signal and dual-task learning is essential for optimal performance. Our work establishes the feasibility of gas emission-based livestock health monitoring, paving the way for practical, in vitro acidosis detection systems. Codes are available at https://github.com/taminulislam/fume.

[110] Knowledge-based learning in Text-RAG and Image-RAG

Alexander Shim,Khalil Saieh,Samuel Clarke

Main category: cs.CV

TL;DR: 本研究探讨了基于EVA-ViT图像编码器与LlaMA或ChatGPT语言模型结合的多模态方法,用于减少胸部X光图像诊断中的幻觉问题并提升疾病检测性能。

Details Motivation: 旨在解决现有模型在医学图像理解中产生的幻觉问题,并提高诊断准确性和模型校准能力。 Method: 采用NIH Chest X-ray数据集训练模型,比较图像RAG、文本RAG和基线方法;利用KNN增强图像RAG,结合外部知识优化文本RAG,并对比GPT与Llama系列LLM的表现。 Result: 文本RAG有效减少了幻觉现象,图像RAG通过KNN提升了预测置信度和校准效果;GPT在幻觉率和ECE指标上优于Llama模型。 Conclusion: 尽管面临数据不平衡和复杂多阶段结构的挑战,该研究验证了多模态RAG框架在医学图像诊断中的潜力,建议未来构建更平衡的数据环境以提升实用性。 Abstract: This research analyzed and compared the multi-modal approach in the Vision Transformer(EVA-ViT) based image encoder with the LlaMA or ChatGPT LLM to reduce the hallucination problem and detect diseases in chest x-ray images. In this research, we utilized the NIH Chest X-ray image to train the model and compared it in image-based RAG, text-based RAG, and baseline. [3] [5] In a result, the text-based RAG[2] e!ectively reduces the hallucination problem by using external knowledge information, and the image-based RAG improved the prediction con"dence and calibration by using the KNN methods. [4] Moreover, the GPT LLM showed better performance, a low hallucination rate, and better Expected Calibration Error(ECE) than Llama Llama-based model. This research shows the challenge of data imbalance, a complex multi-stage structure, but suggests a large experience environment and a balanced example of use.

[111] Improving Zero-shot ADL Recognition with Large Language Models through Event-based Context and Confidence

Michele Fiori,Gabriele Civitarese,Marco Colussi,Claudio Bettini

Main category: cs.CV

TL;DR: 本文提出了一种基于事件分割和预测置信度估计的零样本日常活动识别方法,显著提升了无标签传感器数据下的识别性能。

Details Motivation: 现有基于大语言模型的零样本ADL识别方法依赖时间分割,与LLM的上下文推理能力不匹配,且缺乏预测置信度估计机制。 Method: 采用事件驱动的分割策略替代传统时间分割,并设计一种新的预测置信度估计算法,以更好发挥大语言模型在ADL识别中的潜力。 Result: 实验表明,事件分割在复杂真实数据集上 consistently 优于时间分割方法,甚至超过有监督模型;小规模LLM(如Gemma 3 27B)也能取得优异表现,且所提置信度指标能有效区分正确与错误预测。 Conclusion: 事件分割更契合LLM的推理特性,结合置信度估计可显著提升零样本ADL识别的准确性和实用性,为无标签智能居家感知提供了新方向。 Abstract: Unobtrusive sensor-based recognition of Activities of Daily Living (ADLs) in smart homes by processing data collected from IoT sensing devices supports applications such as healthcare, safety, and energy management. Recent zero-shot methods based on Large Language Models (LLMs) have the advantage of removing the reliance on labeled ADL sensor data. However, existing approaches rely on time-based segmentation, which is poorly aligned with the contextual reasoning capabilities of LLMs. Moreover, existing approaches lack methods for estimating prediction confidence. This paper proposes to improve zero-shot ADL recognition with event-based segmentation and a novel method for estimating prediction confidence. Our experimental evaluation shows that event-based segmentation consistently outperforms time-based LLM approaches on complex, realistic datasets and surpasses supervised data-driven methods, even with relatively small LLMs (e.g., Gemma 3 27B). The proposed confidence measure effectively distinguishes correct from incorrect predictions.

[112] AIMC-Spec: A Benchmark Dataset for Automatic Intrapulse Modulation Classification under Variable Noise Conditions

Sebastian L. Cocks,Salvador Dreo,Feras Dayoub

Main category: cs.CV

TL;DR: 本文提出了一个用于雷达信号中脉冲内调制分类(AIMC)的标准化合成数据集AIMC-Spec,包含33种调制类型和13个信噪比水平,并基于该数据集评估了五种深度学习模型的性能,揭示了不同调制类型和网络结构对分类鲁棒性的影响。

Details Motivation: 由于缺乏标准化数据集,自动脉冲内调制分类(AIMC)在雷达信号分析中的发展长期受限,尤其是在噪声或退化条件下。为推动该领域的可重复研究与比较,亟需公开、全面的数据集和基准测试。 Method: 构建了一个名为AIMC-Spec的大规模合成数据集,基于I/Q信号生成谱图形式的图像样本,涵盖33种调制类型和13个SNR级别;选取五种代表性深度学习模型(包括轻量级CNN、去噪网络和Transformer架构),在统一输入格式下进行重新实现与系统评估。 Result: 实验结果显示各类模型在FM信号上表现较好,而在相位或混合调制类型上性能下降明显,尤其在低信噪比条件下;不同网络架构对调制类型的鲁棒性有显著影响,Transformer和去噪结构在抗噪方面表现出一定优势。 Conclusion: AIMC-Spec为AIMC任务提供了可复现的基准平台,有助于推动雷达信号智能分析的标准化研究,未来工作可在此基础上开发更鲁棒的分类模型。 Abstract: A lack of standardized datasets has long hindered progress in automatic intrapulse modulation classification (AIMC) - a critical task in radar signal analysis for electronic support systems, particularly under noisy or degraded conditions. AIMC seeks to identify the modulation type embedded within a single radar pulse from its complex in-phase and quadrature (I/Q) representation, enabling automated interpretation of intrapulse structure. This paper introduces AIMC-Spec, a comprehensive synthetic dataset for spectrogram-based image classification, encompassing 33 modulation types across 13 signal-to-noise ratio (SNR) levels. To benchmark AIMC-Spec, five representative deep learning algorithms - ranging from lightweight CNNs and denoising architectures to transformer-based networks - were re-implemented and evaluated under a unified input format. The results reveal significant performance variation, with frequency-modulated (FM) signals classified more reliably than phase or hybrid types, particularly at low SNRs. A focused FM-only test further highlights how modulation type and network architecture influence classifier robustness. AIMC-Spec establishes a reproducible baseline and provides a foundation for future research and standardization in the AIMC domain.

[113] HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding

Qitan Lv,Tianyu Liu,Wen Wu,Xuenan Xu,Bowen Zhou,Feng Wu,Chao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为HIPPO的新型并行推测解码框架,用于加速视频大语言模型(video-LLMs)的推理过程。该方法通过语义感知的视觉token保留策略和并行化解码流程,在高剪枝率下仍保持高质量的生成效果,实现了最高达3.51倍的推理加速。

Details Motivation: 现有的推测解码方法在视频-LLMs中未能实现与纯文本LLMs相当的加速效果,主要受限于视觉token剪枝过程中语义信息丢失严重以及剪枝后草案模型计算开销仍然较大。 Method: 提出HIPPO框架,包含两个核心组件:(1) 语义感知的token保留机制,融合全局注意力得分与局部视觉语义以在高剪枝比下保留关键语义信息;(2) 视频并行推测解码算法,将草案生成与目标验证阶段解耦并重叠执行,提升整体吞吐效率。 Result: 在四个视频-LLM模型和六个基准上的实验表明,HIPPO相比传统的自回归解码可实现最高3.51倍的推理速度提升,显著优于现有推测解码方法。 Conclusion: HIPPO有效解决了当前视频-LLMs中推测解码加速不足的问题,通过语义保持和并行化设计,实现了高效且高质量的推理加速,具有良好的通用性和实用性。 Abstract: Speculative decoding (SD) has emerged as a promising approach to accelerate LLM inference without sacrificing output quality. Existing SD methods tailored for video-LLMs primarily focus on pruning redundant visual tokens to mitigate the computational burden of massive visual inputs. However, existing methods do not achieve inference acceleration comparable to text-only LLMs. We observe from extensive experiments that this phenomenon mainly stems from two limitations: (i) their pruning strategies inadequately preserve visual semantic tokens, degrading draft quality and acceptance rates; (ii) even with aggressive pruning (e.g., 90% visual tokens removed), the draft model's remaining inference cost limits overall speedup. To address these limitations, we propose HIPPO, a general holistic-aware parallel speculative decoding framework. Specifically, HIPPO proposes (i) a semantic-aware token preservation method, which fuses global attention scores with local visual semantics to retain semantic information at high pruning ratios; (ii) a video parallel SD algorithm that decouples and overlaps draft generation and target verification phases. Experiments on four video-LLMs across six benchmarks demonstrate HIPPO's effectiveness, yielding up to 3.51x speedup compared to vanilla auto-regressive decoding.

[114] One-Shot Identification with Different Neural Network Approaches

Janis Mohr,Jörg Frochte

Main category: cs.CV

TL;DR: 本文探索了在数据稀缺的一次性学习任务中使用堆叠图像和孪生胶囊网络的方法,在工业应用和人脸识别等多个领域取得了优于其他技术的显著成果。

Details Motivation: 由于在机器学习中缺乏足够的数据,特别是在一次性学习场景下,传统的卷积神经网络难以有效提取特征,因此需要开发新的方法来提升小样本学习的性能。 Method: 采用堆叠图像技术和孪生胶囊网络(siamese capsule networks)来进行一次性识别任务,并在不同领域的数据集上进行评估。 Result: 胶囊网络架构在多个数据集上表现优异,超越了其他现有技术,且模型易于使用和优化。 Conclusion: 胶囊网络结合堆叠图像的方法在一次性学习任务中具有强大潜力,适用于工业应用和人脸识别等实际场景。 Abstract: Convolutional neural networks (CNNs) have been widely used in the computer vision community, significantly improving the state-of-the-art. But learning good features often is computationally expensive in machine learning settings and is especially difficult when there is a lack of data. One-shot learning is one such area where only limited data is available. In one-shot learning, predictions have to be made after seeing only one example from one class, which requires special techniques. In this paper we explore different approaches to one-shot identification tasks in different domains including an industrial application and face recognition. We use a special technique with stacked images and use siamese capsule networks. It is encouraging to see that the approach using capsule architecture achieves strong results and exceeds other techniques on a wide range of datasets from industrial application to face recognition benchmarks while being easy to use and optimise.

[115] KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?

Xianfeng Wang,Kaiwei Zhang,Qi Jia,Zijian Chen,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 本文提出了KidVis基准,用于评估多模态大语言模型在基于儿童视觉发展理论的六种基本视觉能力上的表现,发现当前MLLM缺乏人类般的底层视觉感知能力。

Details Motivation: 探究多模态大语言模型是否具备类似人类直觉的基本视觉原语,而不仅仅是高级推理能力。 Method: 基于儿童视觉发展理论构建KidVis基准,将视觉智能分解为六种原子能力,并设计低语义依赖的视觉任务来评估20种主流MLLM。 Result: 人类儿童平均得分95.32,而最先进的GPT-5仅得67.33分,且模型参数增长并未带来性能线性提升,存在“缩放定律悖论”。 Conclusion: 当前多模态大语言模型虽擅长高阶推理,但缺乏实现通用视觉智能所需的基础生理感知原语。 Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated impressive proficiency in high-level reasoning tasks, such as complex diagrammatic interpretation, it remains an open question whether they possess the fundamental visual primitives comparable to human intuition. To investigate this, we introduce KidVis, a novel benchmark grounded in the theory of human visual development. KidVis deconstructs visual intelligence into six atomic capabilities - Concentration, Tracking, Discrimination, Memory, Spatial, and Closure - already possessed by 6-7 year old children, comprising 10 categories of low-semantic-dependent visual tasks. Evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity. Results indicate that while human children achieve a near-perfect average score of 95.32, the state-of-the-art GPT-5 attains only 67.33. Crucially, we observe a "Scaling Law Paradox": simply increasing model parameters fails to yield linear improvements in these foundational visual capabilities. This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.

[116] M3SR: Multi-Scale Multi-Perceptual Mamba for Efficient Spectral Reconstruction

Yuze Zhang,Lingjie Li,Qiuzhen Lin,Zhong Ming,Fei Yu,Victor C. M. Leung

Main category: cs.CV

TL;DR: 提出了一种多尺度、多感知的Mamba架构M3SR,用于光谱重建,通过多感知融合模块和U-Net结构实现更优性能。

Details Motivation: 解决现有Mamba架构在光谱重建中空间感知单一和特征提取尺度单一的问题。 Method: 设计多感知融合模块,并结合U-Net结构实现多尺度特征提取与融合。 Result: 在定量和定性实验中均优于现有最先进方法,且计算成本更低。 Conclusion: M3SR能有效提升高光谱图像重建的精度和效率。 Abstract: The Mamba architecture has been widely applied to various low-level vision tasks due to its exceptional adaptability and strong performance. Although the Mamba architecture has been adopted for spectral reconstruction, it still faces the following two challenges: (1) Single spatial perception limits the ability to fully understand and analyze hyperspectral images; (2) Single-scale feature extraction struggles to capture the complex structures and fine details present in hyperspectral images. To address these issues, we propose a multi-scale, multi-perceptual Mamba architecture for the spectral reconstruction task, called M3SR. Specifically, we design a multi-perceptual fusion block to enhance the ability of the model to comprehensively understand and analyze the input features. By integrating the multi-perceptual fusion block into a U-Net structure, M3SR can effectively extract and fuse global, intermediate, and local features, thereby enabling accurate reconstruction of hyperspectral images at multiple scales. Extensive quantitative and qualitative experiments demonstrate that the proposed M3SR outperforms existing state-of-the-art methods while incurring a lower computational cost.

[117] ReCo-KD: Region- and Context-Aware Knowledge Distillation for Efficient 3D Medical Image Segmentation

Qizhen Lan,Yu-Chun Hsu,Nida Saddaf Khan,Xiaoqian Jiang

Main category: cs.CV

TL;DR: 提出了一种名为ReCo-KD的训练阶段知识蒸馏框架,用于在不牺牲性能的情况下将大容量教师模型的知识迁移到轻量级学生模型,适用于3D医学图像分割。

Details Motivation: 现有高性能3D医学图像分割模型计算开销大,难以在临床资源受限环境中部署;轻量模型则性能不足,需在精度与效率间取得平衡。 Method: 提出Region- and Context-aware Knowledge Distillation (ReCo-KD),包含多尺度结构感知区域蒸馏(MS-SARD)和多尺度上下文对齐(MS-CA),在nnU-Net上实现且骨干网络无关,无需自定义学生网络结构。 Result: 在多个公开3D医学分割数据集和一个聚合数据集上验证,轻量学生模型在显著减少参数量和推理延迟的同时,性能接近教师模型。 Conclusion: ReCo-KD是一种实用、灵活且高效的训练框架,有助于推动高性能医学图像分割模型在临床环境中的实际部署。 Abstract: Accurate 3D medical image segmentation is vital for diagnosis and treatment planning, but state-of-the-art models are often too large for clinics with limited computing resources. Lightweight architectures typically suffer significant performance loss. To address these deployment and speed constraints, we propose Region- and Context-aware Knowledge Distillation (ReCo-KD), a training-only framework that transfers both fine-grained anatomical detail and long-range contextual information from a high-capacity teacher to a compact student network. The framework integrates Multi-Scale Structure-Aware Region Distillation (MS-SARD), which applies class-aware masks and scale-normalized weighting to emphasize small but clinically important regions, and Multi-Scale Context Alignment (MS-CA), which aligns teacher-student affinity patterns across feature levels. Implemented on nnU-Net in a backbone-agnostic manner, ReCo-KD requires no custom student design and is easily adapted to other architectures. Experiments on multiple public 3D medical segmentation datasets and a challenging aggregated dataset show that the distilled lightweight model attains accuracy close to the teacher while markedly reducing parameters and inference latency, underscoring its practicality for clinical deployment.

[118] SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Dongting Hu,Aarush Gupta,Magzhan Gabidolla,Arpit Sahni,Huseyin Coskun,Yanyu Li,Yerlan Idelbayev,Ahsan Mahmood,Aleksei Lebedev,Dishani Lahiri,Anujraaj Goyal,Ju Hu,Mingming Gong,Sergey Tulyakov,Anil Kag

Main category: cs.CV

TL;DR: 提出了一种高效的DiT框架,专为移动和边缘设备设计,在严格资源限制下实现高质量图像生成。

Details Motivation: 现有的扩散变换器(DiT)在图像生成方面表现优异,但因计算和内存开销高,难以在设备端部署。 Method: 提出紧凑型DiT架构与自适应全局-局部稀疏注意力机制;采用弹性训练框架联合优化不同容量的子DiT;开发知识引导的分布匹配蒸馏方法,结合DMD目标与少步教师模型的知识迁移。 Result: 实现了在低延迟(如4步生成)条件下高质量、高保真的图像生成,适用于实时设备端应用。 Conclusion: 所提方法使扩散模型能够在多种硬件上高效、可扩展地部署,同时保持生成质量。 Abstract: Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints. Our design combines three key components. First, we propose a compact DiT architecture with an adaptive global-local sparse attention mechanism that balances global context modeling and local detail preservation. Second, we propose an elastic training framework that jointly optimizes sub-DiTs of varying capacities within a unified supernetwork, allowing a single model to dynamically adjust for efficient inference across different hardware. Finally, we develop Knowledge-Guided Distribution Matching Distillation, a step-distillation pipeline that integrates the DMD objective with knowledge transfer from few-step teacher models, producing high-fidelity and low-latency generation (e.g., 4-step) suitable for real-time on-device use. Together, these contributions enable scalable, efficient, and high-quality diffusion models for deployment on diverse hardware.

[119] Enhancing Image Quality Assessment Ability of LMMs via Retrieval-Augmented Generation

Kang Fu,Huiyu Duan,Zicheng Zhang,Yucheng Zhu,Jun Zhao,Xiongkuo Min,Jia Wang,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了一种名为IQARAG的无需训练框架,通过检索增强生成(RAG)提升大型多模态模型(LMMs)在图像质量评估(IQA)任务中的性能,利用语义相似但质量不同的参考图像作为视觉感知锚点,在多个数据集上实现了优于或媲美微调方法的效果,同时避免了高昂的计算成本。

Details Motivation: 尽管大型多模态模型(LMMs)在图像质量评估(IQA)任务中展现出零样本能力,但达到最先进性能通常依赖计算成本高昂的微调方法。因此,需要一种无需训练且高效的替代方案来提升LMMs的IQA表现。 Method: IQARAG包含三个阶段:检索特征提取、图像检索和集成与质量评分生成。首先从数据库中提取图像特征,然后为输入图像检索语义相似但质量不同的参考图像及其MOS分数,最后将这些参考图像与输入图像一起整合到特定提示中,引导LMM进行质量判断。 Result: 在KADID、KonIQ、LIVE Challenge和SPAQ等多个IQA数据集上的实验表明,IQARAG显著提升了LMMs的IQA性能,相关性指标(如SROCC、PLCC)明显优于基线方法,甚至媲美经过微调的模型。 Conclusion: IQARAG提供了一种高效、无需训练的途径来增强LMMs在图像质量评估中的能力,证明了通过外部知识检索引入感知锚点的有效性,为低层次视觉任务提供了新的思路。 Abstract: Large Multimodal Models (LMMs) have recently shown remarkable promise in low-level visual perception tasks, particularly in Image Quality Assessment (IQA), demonstrating strong zero-shot capability. However, achieving state-of-the-art performance often requires computationally expensive fine-tuning methods, which aim to align the distribution of quality-related token in output with image quality levels. Inspired by recent training-free works for LMM, we introduce IQARAG, a novel, training-free framework that enhances LMMs' IQA ability. IQARAG leverages Retrieval-Augmented Generation (RAG) to retrieve some semantically similar but quality-variant reference images with corresponding Mean Opinion Scores (MOSs) for input image. These retrieved images and input image are integrated into a specific prompt. Retrieved images provide the LMM with a visual perception anchor for IQA task. IQARAG contains three key phases: Retrieval Feature Extraction, Image Retrieval, and Integration & Quality Score Generation. Extensive experiments across multiple diverse IQA datasets, including KADID, KonIQ, LIVE Challenge, and SPAQ, demonstrate that the proposed IQARAG effectively boosts the IQA performance of LMMs, offering a resource-efficient alternative to fine-tuning for quality assessment.

[120] YOLOBirDrone: Dataset for Bird vs Drone Detection and Classification and a YOLO based enhanced learning architecture

Dapinder Kaur,Neeraj Battish,Arnav Bhavsar,Shashi Poddar

Main category: cs.CV

TL;DR: 提出一种新型YOLOBirDrone架构,用于提升无人机与鸟类的检测与分类精度,并引入大规模BirDrone数据集,实验表明检测准确率可达约85%。

Details Motivation: 现有基于视觉的无人机检测系统在区分小型无人机和鸟类方面存在准确率低的问题,亟需改进。 Method: 提出YOLOBirDrone架构,包含自适应扩展层聚合(AELAN)、多尺度渐进双注意力模块(MPDA)和反向MPDA(RMPDA),并构建大规模BirDrone数据集以支持小目标检测。 Result: 相比现有最先进算法,YOLOBirDrone在多种场景下显著提升性能指标,检测准确率约为85%。 Conclusion: YOLOBirDrone有效提升了复杂场景下对小型无人机和鸟类的识别能力,具有较强的实用性与鲁棒性。 Abstract: The use of aerial drones for commercial and defense applications has benefited in many ways and is therefore utilized in several different application domains. However, they are also increasingly used for targeted attacks, posing a significant safety challenge and necessitating the development of drone detection systems. Vision-based drone detection systems currently have an accuracy limitation and struggle to distinguish between drones and birds, particularly when the birds are small in size. This research work proposes a novel YOLOBirDrone architecture that improves the detection and classification accuracy of birds and drones. YOLOBirDrone has different components, including an adaptive and extended layer aggregation (AELAN), a multi-scale progressive dual attention module (MPDA), and a reverse MPDA (RMPDA) to preserve shape information and enrich features with local and global spatial and channel information. A large-scale dataset, BirDrone, is also introduced in this article, which includes small and challenging objects for robust aerial object identification. Experimental results demonstrate an improvement in performance metrics through the proposed YOLOBirDrone architecture compared to other state-of-the-art algorithms, with detection accuracy reaching approximately 85% across various scenarios.

[121] UM-Text: A Unified Multimodal Model for Image Understanding

Lichen Ma,Xiaolong Fu,Gaojing Zhou,Zipeng Guo,Ting Zhu,Yichun Liu,Yu Shi,Jason Li,Junshi Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为UM-Text的统一多模态模型,用于基于自然语言指令的视觉文本编辑,通过引入视觉语言模型和UM-Encoder实现风格一致的文本生成,并设计了区域一致性损失和三阶段训练策略,在大规模数据集UM-DATA-200K上验证了方法的有效性。

Details Motivation: 现有方法在进行视觉文本编辑时通常需要手动指定文本属性,且难以保证生成文本与参考图像的风格一致性,因此需要一种能够自动理解上下文并保持风格一致性的端到端方法。 Method: 提出UM-Text模型,包含一个视觉语言模型(VLM)用于理解指令和参考图像,以及UM-Encoder用于融合多种条件信息;采用区域一致性损失在潜在空间和RGB空间中提供更有效的字形生成监督,并设计三阶段训练策略提升性能。 Result: 在多个公开基准上的定性和定量实验表明,该方法在文本样式一致性、布局合理性和生成质量方面均达到最先进的性能。 Conclusion: UM-Text通过统一的多模态架构实现了高质量、风格一致的视觉文本编辑,结合新构建的大规模数据集UM-DATA-200K,显著提升了自然语言驱动的文本图像生成效果。 Abstract: With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.

[122] IGAN: A New Inception-based Model for Stable and High-Fidelity Image Synthesis Using Generative Adversarial Networks

Ahmed A. Hashim,Ali Al-Shuwaili,Asraa Saeed,Ali Al-Bayaty

Main category: cs.CV

TL;DR: 本文提出了一种新的生成对抗网络IGAN,结合了深度inception卷积和扩张卷积,有效提升了图像生成质量和训练稳定性,在多个数据集上取得了优于现有方法的FID和IS分数。

Details Motivation: 解决GAN在高质量图像生成中面临的模式崩溃和梯度不稳定问题,尤其是在深层网络中的训练困难。 Method: 提出Inception Generative Adversarial Network (IGAN),采用inception-inspired卷积和扩张卷积结构,并在生成器和判别器中引入dropout和谱归一化技术以缓解梯度爆炸和过拟合。 Result: 在CUB-200和ImageNet数据集上分别取得13.12和15.08的FID分数,比当前最优GAN提升28-33%;同时获得9.27和68.25的IS分数,显示更高的图像多样性和质量。 Conclusion: IGAN能够在保持训练稳定的同时生成高质量图像,是一种可扩展且计算高效的高保真图像合成框架。 Abstract: Generative Adversarial Networks (GANs) face a significant challenge of striking an optimal balance between high-quality image generation and training stability. Recent techniques, such as DCGAN, BigGAN, and StyleGAN, improve visual fidelity; however, such techniques usually struggle with mode collapse and unstable gradients at high network depth. This paper proposes a novel GAN structural model that incorporates deeper inception-inspired convolution and dilated convolution. This novel model is termed the Inception Generative Adversarial Network (IGAN). The IGAN model generates high-quality synthetic images while maintaining training stability, by reducing mode collapse as well as preventing vanishing and exploding gradients. Our proposed IGAN model achieves the Frechet Inception Distance (FID) of 13.12 and 15.08 on the CUB-200 and ImageNet datasets, respectively, representing a 28-33% improvement in FID over the state-of-the-art GANs. Additionally, the IGAN model attains an Inception Score (IS) of 9.27 and 68.25, reflecting improved image diversity and generation quality. Finally, the two techniques of dropout and spectral normalization are utilized in both the generator and discriminator structures to further mitigate gradient explosion and overfitting. These findings confirm that the IGAN model potentially balances training stability with image generation quality, constituting a scalable and computationally efficient framework for high-fidelity image synthesis.

[123] Tissue Classification and Whole-Slide Images Analysis via Modeling of the Tumor Microenvironment and Biological Pathways

Junzhuo Liu,Xuemei Du,Daniel Reisenbuchler,Ye Chen,Markus Eckstein,Christian Matek,Friedrich Feuerhake,Dorit Merhof

Main category: cs.CV

TL;DR: 本文提出了一种名为BioMorphNet的多模态网络,用于整合全切片图像(WSIs)和空间基因表达数据,以支持组织分类和差异基因分析。该方法通过建模形态学与分子特征的关系,并引入临床通路及可学习通路模块,显著提升了癌症分类性能,并有助于发现潜在肿瘤生物标志物。

Details Motivation: 现有研究多关注单个基因序列和滑动级别分类任务,缺乏对空间转录组学和补丁级别应用的关注。因此,需要一种能够自动整合组织形态特征和空间基因表达的方法来更好地表征肿瘤微环境。 Method: BioMorphNet构建图模型来描述目标补丁与其邻居之间的关系,并基于形态和分子水平相似性调整响应强度;利用预定义的通路数据库从空间转录组数据中提取临床通路特征作为桥梁;设计了一个新的可学习通路模块,模拟生物通路形成过程。 Result: 相比最新的多模态方法,BioMorphNet在前列腺癌、结直肠癌和乳腺癌数据集上的平均分类指标分别提高了2.67%、5.48%和6.29%。此外,该模型能准确分类WSI内的组织类型并分析不同组织类别间的差异基因表达。 Conclusion: BioMorphNet不仅提高了组织分类精度,还通过预测置信度分析差异基因表达,为肿瘤定位和潜在生物标志物发现提供了有力工具。 Abstract: Automatic integration of whole slide images (WSIs) and gene expression profiles has demonstrated substantial potential in precision clinical diagnosis and cancer progression studies. However, most existing studies focus on individual gene sequences and slide level classification tasks, with limited attention to spatial transcriptomics and patch level applications. To address this limitation, we propose a multimodal network, BioMorphNet, which automatically integrates tissue morphological features and spatial gene expression to support tissue classification and differential gene analysis. For considering morphological features, BioMorphNet constructs a graph to model the relationships between target patches and their neighbors, and adjusts the response strength based on morphological and molecular level similarity, to better characterize the tumor microenvironment. In terms of multimodal interactions, BioMorphNet derives clinical pathway features from spatial transcriptomic data based on a predefined pathway database, serving as a bridge between tissue morphology and gene expression. In addition, a novel learnable pathway module is designed to automatically simulate the biological pathway formation process, providing a complementary representation to existing clinical pathways. Compared with the latest morphology gene multimodal methods, BioMorphNet's average classification metrics improve by 2.67%, 5.48%, and 6.29% for prostate cancer, colorectal cancer, and breast cancer datasets, respectively. BioMorphNet not only classifies tissue categories within WSIs accurately to support tumor localization, but also analyzes differential gene expression between tissue categories based on prediction confidence, contributing to the discovery of potential tumor biomarkers.

[124] From Local Windows to Adaptive Candidates via Individualized Exploratory: Rethinking Attention for Image Super-Resolution

Chunyu Meng,Wei Long,Shuhang Gu

Main category: cs.CV

TL;DR: 提出了一种名为Individualized Exploratory Transformer (IET) 的新方法,通过个体化探索注意力机制实现更高效、灵活的单图像超分辨率重建。

Details Motivation: 现有基于Transformer的方法在处理单图像超分辨率时计算成本高,且固定分组的注意力机制忽略了token间相似性的不对称性,缺乏灵活性。 Method: 引入Individualized Exploratory Attention (IEA) 机制,使每个token能自适应地选择内容感知且独立的注意力候选,实现细粒度、非对称的信息聚合。 Result: 在标准超分辨率基准上实验表明,IET在相当的计算复杂度下达到了最先进的性能。 Conclusion: IET通过token级自适应注意力机制有效平衡了性能与效率,为图像超分辨率提供了更灵活的Transformer架构设计思路。 Abstract: Single Image Super-Resolution (SISR) is a fundamental computer vision task that aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) input. Transformer-based methods have achieved remarkable performance by modeling long-range dependencies in degraded images. However, their feature-intensive attention computation incurs high computational cost. To improve efficiency, most existing approaches partition images into fixed groups and restrict attention within each group. Such group-wise attention overlooks the inherent asymmetry in token similarities, thereby failing to enable flexible and token-adaptive attention computation. To address this limitation, we propose the Individualized Exploratory Transformer (IET), which introduces a novel Individualized Exploratory Attention (IEA) mechanism that allows each token to adaptively select its own content-aware and independent attention candidates. This token-adaptive and asymmetric design enables more precise information aggregation while maintaining computational efficiency. Extensive experiments on standard SR benchmarks demonstrate that IET achieves state-of-the-art performance under comparable computational complexity.

[125] Semantic Misalignment in Vision-Language Models under Perceptual Degradation

Guo Cheng

Main category: cs.CV

TL;DR: 该研究系统分析了视觉-语言模型(VLM)在上游视觉感知退化下的语义错位问题,发现在传统分割指标轻微下降的情况下,VLM仍会出现严重的行为失败,揭示了像素级鲁棒性与多模态语义可靠性之间的脱节。

Details Motivation: 当前VLM在多模态基准上表现良好,但其在真实感知退化下的鲁棒性尚不明确,尤其在自动驾驶等安全关键应用中,可靠的感知至关重要。 Method: 通过Cityscapes数据集上的语义分割作为感知模块,引入感知现实的扰动,评估多种对比和生成式VLM在受控感知退化下的表现,并提出语言层面的错位度量来量化幻觉、关键遗漏和安全误判。 Result: 尽管传统分割指标仅轻微下降,VLM在下游任务中仍出现严重错误,包括幻觉对象描述、遗漏安全关键实体和不一致的安全判断;且不同VLM均表现出语义错位与分割质量的相关性。 Conclusion: 当前VLM系统在面对感知不确定性时存在严重语义可靠性问题,需建立专门考虑感知不确定性的评估框架,以确保其在安全关键场景中的可靠部署。 Abstract: Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.

[126] Geo-NVS-w: Geometry-Aware Novel View Synthesis In-the-Wild with an SDF Renderer

Anastasios Tsalakopoulos,Angelos Kanlis,Evangelos Chatzis,Antonis Karakottas,Dimitrios Zarpalas

Main category: cs.CV

TL;DR: 提出Geo-NVS-w,一种基于几何感知的高保真新视角合成框架,利用SDF和几何保持损失提升复杂表面的渲染一致性与细节保留。

Details Motivation: 现有野外图像集合的新视角合成方法在复杂表面上缺乏几何一致性,易产生渲染伪影。 Method: 采用符号距离函数(SDF)作为基础几何表示,并设计几何保持损失来引导渲染过程,增强结构细节保留。 Result: 在保持竞争力的渲染性能的同时,能量消耗比类似方法降低4-5倍,生成更清晰、几何一致的结果。 Conclusion: Geo-NVS-w是一种鲁棒且高效的野外新视角合成方法,兼具高视觉质量和几何准确性。 Abstract: We introduce Geo-NVS-w, a geometry-aware framework for high-fidelity novel view synthesis from unstructured, in-the-wild image collections. While existing in-the-wild methods already excel at novel view synthesis, they often lack geometric grounding on complex surfaces, sometimes producing results that contain inconsistencies. Geo-NVS-w addresses this limitation by leveraging an underlying geometric representation based on a Signed Distance Function (SDF) to guide the rendering process. This is complemented by a novel Geometry-Preservation Loss which ensures that fine structural details are preserved. Our framework achieves competitive rendering performance, while demonstrating a 4-5x reduction reduction in energy consumption compared to similar methods. We demonstrate that Geo-NVS-w is a robust method for in-the-wild NVS, yielding photorealistic results with sharp, geometrically coherent details.

[127] Source-Free Domain Adaptation for Geospatial Point Cloud Semantic Segmentation

Yuan Gao,Di Cao,Xiaohuan Xi,Sheng Nie,Shaobo Xia,Cheng Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为LoGo的局部-全局双一致性框架,用于地理空间点云的无源域自适应语义分割,有效缓解了跨域分布偏移和长尾问题。

Details Motivation: 由于隐私和法规限制,传统域自适应方法难以获取源域数据,因此需要在仅有预训练模型和无标签目标域数据的情况下进行源-free无监督域自适应(SFUDA)。 Method: 提出LoGo框架:局部采用类平衡原型估计模块,通过类内独立锚点挖掘生成鲁棒特征原型;全局引入基于最优传输的分布对齐模块,优化伪标签分配;结合局部与全局一致性的伪标签筛选机制用于自训练。 Result: LoGo在多个地理空间点云数据集上显著优于现有SFUDA方法,尤其在处理长尾分布和减少类别预测偏差方面表现优异。 Conclusion: LoGo通过局部-全局协同优化,有效解决了源自由场景下3D点云语义分割的域偏移问题,具有良好的应用潜力。 Abstract: Semantic segmentation of 3D geospatial point clouds is pivotal for remote sensing applications. However, variations in geographic patterns across regions and data acquisition strategies induce significant domain shifts, severely degrading the performance of deployed models. Existing domain adaptation methods typically rely on access to source-domain data. However, this requirement is rarely met due to data privacy concerns, regulatory policies, and data transmission limitations. This motivates the largely underexplored setting of source-free unsupervised domain adaptation (SFUDA), where only a pretrained model and unlabeled target-domain data are available. In this paper, we propose LoGo (Local-Global Dual-Consensus), a novel SFUDA framework specifically designed for geospatial point clouds. At the local level, we introduce a class-balanced prototype estimation module that abandons conventional global threshold filtering in favor of an intra-class independent anchor mining strategy. This ensures that robust feature prototypes can be generated even for sample-scarce tail classes, effectively mitigating the feature collapse caused by long-tailed distributions. At the global level, we introduce an optimal transport-based global distribution alignment module that formulates pseudo-label assignment as a global optimization problem. By enforcing global distribution constraints, this module effectively corrects the over-dominance of head classes inherent in local greedy assignments, preventing model predictions from being severely biased towards majority classes. Finally, we propose a dual-consistency pseudo-label filtering mechanism. This strategy retains only high-confidence pseudo-labels where local multi-augmented ensemble predictions align with global optimal transport assignments for self-training.

[128] Design and Development of a Low-Cost Scalable GSM-IoT Smart Pet Feeder with a Remote Mobile Application

Md. Rakibul Hasan Nishat,S. M. Khalid Bin Zahid,Abdul Hasib,T. M. Mehrab Hasan,Mohammad Arman,A. S. M. Ahsanul Sarkar Akib

Main category: cs.CV

TL;DR: 本文提出了一种低成本、可扩展的GSM-IoT智能宠物喂食器,支持远程控制与实时监控,适用于资源有限的家庭。

Details Motivation: 解决城市中忙碌宠物主人难以维持规律喂食的问题。 Method: 结合Arduino微控制器、SIM800L GSM模块、超声波传感器和舵机机构,开发基于短信通信的智能喂食系统,并通过MIT App Inventor开发配套手机应用。 Result: 实验显示短信命令成功率98%,投喂精度误差±2.67%,系统运行稳定可靠。 Conclusion: 该系统为智能宠物护理提供了实用、可扩展且不依赖互联网的解决方案,树立了低功耗GSM驱动宠物设备的新基准。 Abstract: Pet ownership is increasingly common in modern households, yet maintaining a consistent feeding schedule remains challenging for the owners particularly those who live in cities and have busy lifestyles. This paper presents the design, development, and validation of a low-cost, scalable GSM-IoT smart pet feeder that enables remote monitoring and control through cellular communication. The device combines with an Arduino microcontroller, a SIM800L GSM module for communication, an ultrasonic sensor for real-time food-level assessment, and a servo mechanism for accurate portion dispensing. A dedicated mobile application was developed using MIT App Inventor which allows owners to send feeding commands and receive real-time status updates. Experimental results demonstrate a 98\% SMS command success rate, consistent portion dispensing with $\pm 2.67$\% variance, and reliable autonomous operation. Its modular, energy-efficient design makes it easy to use in a wide range of households, including those with limited resources. This work pushes forward the field of accessible pet care technology by providing a practical, scalable, and completely internet-independent solution for personalized pet feeding. In doing so, it sets a new benchmark for low-cost, GSM-powered automation in smart pet products.

[129] An Explainable Two Stage Deep Learning Framework for Pericoronitis Assessment in Panoramic Radiographs Using YOLOv8 and ResNet-50

Ajo Babu George,Pranav S,Kunal Agarwal

Main category: cs.CV

TL;DR: 提出了一种结合解剖定位、病理分类和可解释性的两阶段深度学习系统,用于在全景片上辅助诊断智齿周围炎。

Details Motivation: 全景片上智齿周围炎的诊断存在挑战,需要提高诊断准确性和临床可信度。 Method: 第一阶段使用YOLOv8检测第三磨牙并根据Winter分类法判断其解剖位置和倾斜角度;第二阶段采用改进的ResNet-50分类器检测提示周围炎的影像特征,并利用Grad-CAM增强模型可解释性。 Result: YOLOv8检测达到92%精确率和92.5%平均精度;ResNet-50对正常和炎症病例的F1分数分别为88%和86%;放射科医生对Grad-CAM与诊断印象的一致性评分为84%。 Conclusion: 该AI系统在辅助全景片评估中表现出良好潜力,其可解释性功能有助于提升临床信任度。 Abstract: Objectives: To overcome challenges in diagnosing pericoronitis on panoramic radiographs, an AI-assisted assessment system integrating anatomical localization, pathological classification, and interpretability. Methods: A two-stage deep learning pipeline was implemented. The first stage used YOLOv8 to detect third molars and classify their anatomical positions and angulations based on Winter's classification. Detected regions were then fed into a second-stage classifier, a modified ResNet-50 architecture, for detecting radiographic features suggestive of pericoronitis. To enhance clinical trust, Grad-CAM was used to highlight key diagnostic regions on the radiographs. Results: The YOLOv8 component achieved 92% precision and 92.5% mean average precision. The ResNet-50 classifier yielded F1-scores of 88% for normal cases and 86% for pericoronitis. Radiologists reported 84% alignment between Grad-CAM and their diagnostic impressions, supporting the radiographic relevance of the interpretability output. Conclusion: The system shows strong potential for AI-assisted panoramic assessment, with explainable AI features that support clinical confidence.

[130] Edge-Optimized Multimodal Learning for UAV Video Understanding via BLIP-2

Yizhan Feng,Hichem Snoussi,Jing Teng,Jian Liu,Yuyang Wang,Abel Cherouat,Tian Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于BLIP-2的轻量级多模态任务平台,结合YOLO-World和YOLOv8-Seg模型,用于提升无人机在资源受限条件下的实时视觉理解与交互能力,无需针对无人机数据进行特定任务微调。

Details Motivation: 解决大型视觉语言模型在无人机边缘设备上因计算资源有限而难以部署的问题,满足复杂场景中对实时视觉理解的需求。 Method: 将BLIP-2与YOLO系列模型深度集成,利用YOLO的检测与分割结果增强视觉注意力理解;设计基于K-Means聚类的内容感知关键帧采样机制,并引入结构化事件日志作为上下文信息输入,结合输出约束实现多任务统一提示优化。 Result: 所提方法在无需任务微调的情况下扩展了BLIP-2在无人机场景中的多任务能力,有效支持视频级交互任务,并提升了推理效率与输出相关性。 Conclusion: 该轻量化多模态平台能够在资源受限的UAV设备上实现高效、准确的多任务视觉理解与交互,具有良好的实际应用前景。 Abstract: The demand for real-time visual understanding and interaction in complex scenarios is increasingly critical for unmanned aerial vehicles. However, a significant challenge arises from the contradiction between the high computational cost of large Vision language models and the limited computing resources available on UAV edge devices. To address this challenge, this paper proposes a lightweight multimodal task platform based on BLIP-2, integrated with YOLO-World and YOLOv8-Seg models. This integration extends the multi-task capabilities of BLIP-2 for UAV applications with minimal adaptation and without requiring task-specific fine-tuning on drone data. Firstly, the deep integration of BLIP-2 with YOLO models enables it to leverage the precise perceptual results of YOLO for fundamental tasks like object detection and instance segmentation, thereby facilitating deeper visual-attention understanding and reasoning. Secondly, a content-aware key frame sampling mechanism based on K-Means clustering is designed, which incorporates intelligent frame selection and temporal feature concatenation. This equips the lightweight BLIP-2 architecture with the capability to handle video-level interactive tasks effectively. Thirdly, a unified prompt optimization scheme for multi-task adaptation is implemented. This scheme strategically injects structured event logs from the YOLO models as contextual information into BLIP-2's input. Combined with output constraints designed to filter out technical details, this approach effectively guides the model to generate accurate and contextually relevant outputs for various tasks.

[131] SPARK: Scalable Real-Time Point Cloud Aggregation with Multi-View Self-Calibration

Chentian Sun

Main category: cs.CV

TL;DR: SPARK是一种自校准的实时多摄像头点云重建框架,通过几何感知的在线外参估计和置信度驱动的融合策略,实现高精度、稳定的三维重建,并具有良好的可扩展性。

Details Motivation: 现有方法在多视角融合、相机外参不确定性以及大规模相机系统扩展性方面存在困难。 Method: 提出SPARK框架,包含几何感知的在线外参估计模块和置信度驱动的点云融合策略,进行逐帧融合而非累积,以实现稳定重建。 Result: 在真实多摄像头系统上实验表明,SPARK在外参精度、几何一致性、时间稳定性和实时性能方面优于现有方法。 Conclusion: SPARK有效解决了多摄像头三维重建中的关键挑战,具备高精度、稳定性及线性可扩展性,适用于大规模场景。 Abstract: Real-time multi-camera 3D reconstruction is crucial for 3D perception, immersive interaction, and robotics. Existing methods struggle with multi-view fusion, camera extrinsic uncertainty, and scalability for large camera setups. We propose SPARK, a self-calibrating real-time multi-camera point cloud reconstruction framework that jointly handles point cloud fusion and extrinsic uncertainty. SPARK consists of: (1) a geometry-aware online extrinsic estimation module leveraging multi-view priors and enforcing cross-view and temporal consistency for stable self-calibration, and (2) a confidence-driven point cloud fusion strategy modeling depth reliability and visibility at pixel and point levels to suppress noise and view-dependent inconsistencies. By performing frame-wise fusion without accumulation, SPARK produces stable point clouds in dynamic scenes while scaling linearly with the number of cameras. Extensive experiments on real-world multi-camera systems show that SPARK outperforms existing approaches in extrinsic accuracy, geometric consistency, temporal stability, and real-time performance, demonstrating its effectiveness and scalability for large-scale multi-camera 3D reconstruction.

[132] MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP

Aditya Chaudhary,Sneha Barman,Mainak Singha,Ankit Jha,Girish Mishra,Biplab Banerjee

Main category: cs.CV

TL;DR: 提出了一种名为MMLGNet的多模态语言引导网络,用于将遥感中的高光谱和LiDAR数据与自然语言语义对齐,通过引入文本监督提升多模态融合性能。

Details Motivation: 随着多模态地球观测数据的增长,现有方法难以有效融合光谱、空间和几何信息并实现语义级理解,需要引入更高级的语义引导机制。 Method: 设计了模态特异性编码器,采用双向对比学习将视觉特征与手工构建的文本嵌入对齐到共享潜在空间,借鉴CLIP的训练范式实现视觉-语言关联。 Result: 在两个基准数据集上使用简单CNN编码器即超越多种主流多模态视觉方法,验证了语言监督的有效性。 Conclusion: 语言引导是一种有效的多模态遥感数据融合策略,MMLGNet为遥感领域提供了语义理解的新范式。 Abstract: In this paper, we propose a novel multimodal framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities like Hyperspectral Imaging (HSI) and LiDAR with natural language semantics using vision-language models such as CLIP. With the increasing availability of multimodal Earth observation data, there is a growing need for methods that effectively fuse spectral, spatial, and geometric information while enabling semantic-level understanding. MMLGNet employs modality-specific encoders and aligns visual features with handcrafted textual embeddings in a shared latent space via bi-directional contrastive learning. Inspired by CLIP's training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation. Notably, MMLGNet achieves strong performance with simple CNN-based encoders, outperforming several established multimodal visual-only methods on two benchmark datasets, demonstrating the significant benefit of language supervision. Codes are available at https://github.com/AdityaChaudhary2913/CLIP_HSI.

[133] Deep Learning Based Facial Retargeting Using Local Patches

Yeonsoo Choi,Inyup Lee,Sihun Cha,Seonghyeon Kim,Sunjin Jung,Junyong Noh

Main category: cs.CV

TL;DR: 提出了一种基于局部块的重定向方法,用于将源视频中的面部动画迁移到风格化的3D角色上。

Details Motivation: 在面部结构差异较大的风格化3D角色上进行面部动画重定向时,传统方法难以保持原始表情的语义含义。 Method: 该方法包含三个模块:自动局部块提取模块从源视频帧中提取局部块;重演模块生成对应的目标局部块;权重估计模块计算每帧的动画参数以生成完整的面部动画序列。 Result: 实验表明,该方法能有效将源面部表情的语义迁移到面部特征比例差异较大的风格化角色上。 Conclusion: 所提出的局部块基重定向方法能够较好地保留源表情语义,适用于多样化风格化角色的面部动画生成。 Abstract: In the era of digital animation, the quest to produce lifelike facial animations for virtual characters has led to the development of various retargeting methods. While the retargeting facial motion between models of similar shapes has been very successful, challenges arise when the retargeting is performed on stylized or exaggerated 3D characters that deviate significantly from human facial structures. In this scenario, it is important to consider the target character's facial structure and possible range of motion to preserve the semantics assumed by the original facial motions after the retargeting. To achieve this, we propose a local patch-based retargeting method that transfers facial animations captured in a source performance video to a target stylized 3D character. Our method consists of three modules. The Automatic Patch Extraction Module extracts local patches from the source video frame. These patches are processed through the Reenactment Module to generate correspondingly re-enacted target local patches. The Weight Estimation Module calculates the animation parameters for the target character at every frame for the creation of a complete facial animation sequence. Extensive experiments demonstrate that our method can successfully transfer the semantic meaning of source facial expressions to stylized characters with considerable variations in facial feature proportion.

[134] Incentivizing Cardiologist-Like Reasoning in MLLMs for Interpretable Echocardiographic Diagnosis

Yi Qin,Lehan Wang,Chenxu Zhao,Alex P. W. Lee,Xiaomeng Li

Main category: cs.CV

TL;DR: 提出了一种名为CardiacMind的新方法,结合心脏推理模板(CRT)和强化学习机制,提升多模态大语言模型在超声心动图诊断中的推理能力,显著提高复杂心脏病诊断准确率并实现与心脏病专家高度一致的临床逻辑。

Details Motivation: 现有超声心动图基础模型难以有效关联定量测量与临床表现,而现有的医学推理多模态大语言模型需要高成本构建详细推理路径,且无法有效融入超声心动图先验知识。 Method: 设计了心脏推理模板(CRT),提供标准化的分步诊断流程以简化推理路径构建;提出CardiacMind框架,采用包含过程数量奖励(PQtR)、过程质量奖励(PQlR)和超声语义奖励(ESR)的新型强化学习方案,引导模型生成符合心脏病学思维的推理过程。 Result: 在15种复杂心脏病的多视角超声诊断中性能提升48%,在CardiacNet-PAH数据集上超越先前方法5%,医生用户研究显示93.33%的临床一致性。 Conclusion: CardiacMind通过引入心脏病专家式的推理结构与定制化奖励机制,显著增强了多模态大语言模型在超声心动图诊断中的准确性与可解释性,具备良好的临床应用前景。 Abstract: Echocardiographic diagnosis is vital for cardiac screening yet remains challenging. Existing echocardiography foundation models do not effectively capture the relationships between quantitative measurements and clinical manifestations, whereas medical reasoning multimodal large language models (MLLMs) require costly construction of detailed reasoning paths and remain ineffective at directly incorporating such echocardiographic priors into their reasoning. To address these limitations, we propose a novel approach comprising Cardiac Reasoning Template (CRT) and CardiacMind to enhance MLLM's echocardiographic reasoning by introducing cardiologist-like mindset. Specifically, CRT provides stepwise canonical diagnostic procedures for complex cardiac diseases to streamline reasoning path construction without the need for costly case-by-case verification. To incentivize reasoning MLLM under CRT, we develop CardiacMind, a new reinforcement learning scheme with three novel rewards: Procedural Quantity Reward (PQtR), Procedural Quality Reward (PQlR), and Echocardiographic Semantic Reward (ESR). PQtR promotes detailed reasoning; PQlR promotes integration of evidence across views and modalities, while ESR grounds stepwise descriptions in visual content. Our methods show a 48% improvement in multiview echocardiographic diagnosis for 15 complex cardiac diseases and a 5% improvement on CardiacNet-PAH over prior methods. The user study on our method's reasoning outputs shows 93.33% clinician agreement with cardiologist-like reasoning logic. Our code will be available.

[135] Noise-Adaptive Regularization for Robust Multi-Label Remote Sensing Image Classification

Tom Burgert,Julia Henkel,Begüm Demir

Main category: cs.CV

TL;DR: 本文提出了一种名为NAR的噪声自适应正则化方法,用于提升遥感多标签分类中对加性、减性和混合噪声的鲁棒性。

Details Motivation: 现有的多标签分类方法通常忽视了遥感数据中标注噪声类型(加性、减性或混合)的区别,直接将含噪标注作为监督信号,导致模型性能下降。 Method: 提出NAR方法,在半监督学习框架下区分不同类型的标签噪声;采用基于置信度的标签处理机制,动态保留高置信度标签、暂时停用中等置信度标签,并通过翻转修正低置信度标签;结合早期学习正则化(ELR)稳定训练过程。 Result: 在多种噪声场景下实验表明,NAR在遥感多标签分类任务中相比现有方法具有更强的鲁棒性,尤其在减性和混合噪声下性能提升更显著。 Conclusion: 自适应抑制和选择性修正噪声监督是一种有效的遥感多标签分类抗噪学习策略,NAR为处理实际应用中常见的部分错误标注提供了新思路。 Abstract: The development of reliable methods for multi-label classification (MLC) has become a prominent research direction in remote sensing (RS). As the scale of RS data continues to expand, annotation procedures increasingly rely on thematic products or crowdsourced procedures to reduce the cost of manual annotation. While cost-effective, these strategies often introduce multi-label noise in the form of partially incorrect annotations. In MLC, label noise arises as additive noise, subtractive noise, or a combination of both in the form of mixed noise. Previous work has largely overlooked this distinction and commonly treats noisy annotations as supervised signals, lacking mechanisms that explicitly adapt learning behavior to different noise types. To address this limitation, we propose NAR, a noise-adaptive regularization method that explicitly distinguishes between additive and subtractive noise within a semi-supervised learning framework. NAR employs a confidence-based label handling mechanism that dynamically retains label entries with high confidence, temporarily deactivates entries with moderate confidence, and corrects low confidence entries via flipping. This selective attenuation of supervision is integrated with early-learning regularization (ELR) to stabilize training and mitigate overfitting to corrupted labels. Experiments across additive, subtractive, and mixed noise scenarios demonstrate that NAR consistently improves robustness compared with existing methods. Performance improvements are most pronounced under subtractive and mixed noise, indicating that adaptive suppression and selective correction of noisy supervision provide an effective strategy for noise robust learning in RS MLC.

[136] Divide and Conquer: Static-Dynamic Collaboration for Few-Shot Class-Incremental Learning

Kexin Bao,Daichi Zhang,Yong Li,Dan Zeng,Shiming Ge

Main category: cs.CV

TL;DR: 本文提出了一种用于少样本类增量学习(FSCIL)的静态-动态协作框架(SDC),通过将任务分为静态保持阶段和动态学习阶段,有效平衡了稳定性与可塑性之间的矛盾,显著提升了旧知识保留和新类学习的能力。

Details Motivation: 解决FSCIL中稳定性与可塑性之间的矛盾,即在有限数据下持续学习新类时如何有效保留旧知识。 Method: 提出SDC框架,包含两个阶段:静态保持阶段(SRS)训练初始模型并保留关键部分作为静态记忆;动态学习阶段(DLS)引入额外的动态投影器与静态记忆联合训练,以学习新类。 Result: 在三个公开基准和一个真实应用场景数据集上进行了大量实验,结果表明该方法性能优于现有竞争方法。 Conclusion: SDC框架通过分离静态保留与动态学习,有效缓解了FSCIL中的稳定性-可塑性困境,实现了当前最优的性能表现。 Abstract: Few-shot class-incremental learning (FSCIL) aims to continuously recognize novel classes under limited data, which suffers from the key stability-plasticity dilemma: balancing the retention of old knowledge with the acquisition of new knowledge. To address this issue, we divide the task into two different stages and propose a framework termed Static-Dynamic Collaboration (SDC) to achieve a better trade-off between stability and plasticity. Specifically, our method divides the normal pipeline of FSCIL into Static Retaining Stage (SRS) and Dynamic Learning Stage (DLS), which harnesses old static and incremental dynamic class information, respectively. During SRS, we train an initial model with sufficient data in the base session and preserve the key part as static memory to retain fundamental old knowledge. During DLS, we introduce an extra dynamic projector jointly trained with the previous static memory. By employing both stages, our method achieves improved retention of old knowledge while continuously adapting to new classes. Extensive experiments on three public benchmarks and a real-world application dataset demonstrate that our method achieves state-of-the-art performance against other competitors.

[137] Developing Predictive and Robust Radiomics Models for Chemotherapy Response in High-Grade Serous Ovarian Carcinoma

Sepideh Hatamikia,Geevarghese George,Florian Schwarzhans,Amirreza Mahbod,Marika AV Reinius,Ali Abbasian Ardakani,Mercedes Jimenez-Linan,Satish Viswanath,Mireia Crispin-Ortuzar,Lorena Escudero Sanchez,Evis Sala,James D Brenton,Ramona Woitek

Main category: cs.CV

TL;DR: 本研究通过整合多种特征选择方法,利用CT影像数据结合机器学习构建放射组学模型,以预测高级别浆液性卵巢癌(HGSOC)患者对新辅助化疗(NACT)的治疗反应。研究引入了一种考虑鲁棒性的自动化特征筛选框架,并在独立队列中验证模型性能,结果显示不同解剖部位病灶对不同疗效指标的预测能力各异,整体表现良好,具有临床应用潜力。

Details Motivation: 约40%的HGSOC患者对NACT反应不佳,缺乏有效的非侵入性预测手段,限制了个体化治疗的发展。因此需要一种可靠的方法来提前预测治疗反应,以优化治疗策略。 Method: 提出一个结合自动随机化算法的放射组学特征筛选框架,模拟观察者间变异以提升特征鲁棒性;基于治疗前后的CT图像提取特征,使用四种疗效指标(CRS、RECIST、VolR、DiaR)进行建模;分别分析不同解剖部位病灶,并在一个训练队列中建模,在独立外部队列中验证。 Result: 综合所有病灶时对体积缩小(VolR)的预测效果最佳(AUC=0.83);网膜病灶对CRS预测最优(AUC=0.77);盆腔病灶对直径缩小(DiaR)预测最佳(AUC=0.76)。纳入鲁棒性筛选提升了模型稳定性与可重复性。 Conclusion: 将特征鲁棒性纳入筛选流程有助于构建更可靠的放射组学预测模型,推动其在HGSOC患者NACT反应预测中的临床转化应用。未来应进一步探索放射组学在卵巢癌实时临床决策中的应用价值。 Abstract: Objectives: High-grade serous ovarian carcinoma (HGSOC) is typically diagnosed at an advanced stage with extensive peritoneal metastases, making treatment challenging. Neoadjuvant chemotherapy (NACT) is often used to reduce tumor burden before surgery, but about 40% of patients show limited response. Radiomics, combined with machine learning (ML), offers a promising non-invasive method for predicting NACT response by analyzing computed tomography (CT) imaging data. This study aimed to improve response prediction in HGSOC patients undergoing NACT by integration different feature selection methods. Materials and methods: A framework for selecting robust radiomics features was introduced by employing an automated randomisation algorithm to mimic inter-observer variability, ensuring a balance between feature robustness and prediction accuracy. Four response metrics were used: chemotherapy response score (CRS), RECIST, volume reduction (VolR), and diameter reduction (DiaR). Lesions in different anatomical sites were studied. Pre- and post-NACT CT scans were used for feature extraction and model training on one cohort, and an independent cohort was used for external testing. Results: The best prediction performance was achieved using all lesions combined for VolR prediction, with an AUC of 0.83. Omental lesions provided the best results for CRS prediction (AUC 0.77), while pelvic lesions performed best for DiaR (AUC 0.76). Conclusion: The integration of robustness into the feature selection processes ensures the development of reliable models and thus facilitates the implementation of the radiomics models in clinical applications for HGSOC patients. Future work should explore further applications of radiomics in ovarian cancer, particularly in real-time clinical settings.

[138] Modality-Decoupled RGB-Thermal Object Detector via Query Fusion

Chao Tian,Zikun Zhou,Chao Yang,Guoqing Zhu,Fu'an Zhong,Zhenyu He

Main category: cs.CV

TL;DR: 提出了一种解耦的RGB-热成像检测框架(MDQF),通过查询融合在保持模态互补的同时实现模态分离,提升极端条件下的检测鲁棒性。

Details Motivation: 在极端条件下,单一模态质量下降会干扰检测,需平衡模态互补与分离。 Method: 采用DETR-like结构分别处理RGB和TIR图像,在各细化阶段间引入查询融合机制,通过选择和适配高质量查询进行跨分支融合。 Result: 实验表明该方法优于现有RGB-T检测器,具备更强的模态独立性和噪声抑制能力。 Conclusion: MDQF框架有效平衡了模态融合与分离,提升了复杂环境下的检测性能,且无需配对的RGB-T训练数据。 Abstract: The advantage of RGB-Thermal (RGB-T) detection lies in its ability to perform modality fusion and integrate cross-modality complementary information, enabling robust detection under diverse illumination and weather conditions. However, under extreme conditions where one modality exhibits poor quality and disturbs detection, modality separation is necessary to mitigate the impact of noise. To address this problem, we propose a Modality-Decoupled RGB-T detection framework with Query Fusion (MDQF) to balance modality complementation and separation. In this framework, DETR-like detectors are employed as separate branches for the RGB and TIR images, with query fusion interspersed between the two branches in each refinement stage. Herein, query fusion is performed by feeding the high-quality queries from one branch to the other one after query selection and adaptation. This design effectively excludes the degraded modality and corrects the predictions using high-quality queries. Moreover, the decoupled framework allows us to optimize each individual branch with unpaired RGB or TIR images, eliminating the need for paired RGB-T data. Extensive experiments demonstrate that our approach delivers superior performance to existing RGB-T detectors and achieves better modality independence.

[139] CoMa: Contextual Massing Generation with Vision-Language Models

Evgenii Maslov,Valentin Khrulkov,Anastasia Volkova,Anton Gusarov,Andrey Kuznetsov,Ivan Oseledets

Main category: cs.CV

TL;DR: 提出了一种基于功能需求和场地环境的建筑体量生成自动化框架,并发布了包含2万条数据的CoMa-20K数据集,用于视觉-语言模型条件下的建筑体量生成基准测试。

Details Motivation: 建筑设计初期的概念设计阶段(尤其是建筑体量设计)依赖设计师直觉和手动操作,缺乏数据驱动方法所需的数据集。 Method: 构建了CoMa-20K数据集,包含详细的体量几何、经济与功能数据及场地视觉信息,并将体量生成任务建模为视觉-语言模型的条件生成任务,评估了微调和零样本大模型的表现。 Result: 实验证明该任务具有挑战性,但视觉-语言模型能够生成符合上下文的建筑体量方案,展示了数据驱动方法的潜力。 Conclusion: CoMa-20K数据集和实验分析为数据驱动的建筑设计提供了基础基准,揭示了未来研究的重要方向。 Abstract: The conceptual design phase in architecture and urban planning, particularly building massing, is complex and heavily reliant on designer intuition and manual effort. To address this, we propose an automated framework for generating building massing based on functional requirements and site context. A primary obstacle to such data-driven methods has been the lack of suitable datasets. Consequently, we introduce the CoMa-20K dataset, a comprehensive collection that includes detailed massing geometries, associated economical and programmatic data, and visual representations of the development site within its existing urban context. We benchmark this dataset by formulating massing generation as a conditional task for Vision-Language Models (VLMs), evaluating both fine-tuned and large zero-shot models. Our experiments reveal the inherent complexity of the task while demonstrating the potential of VLMs to produce context-sensitive massing options. The dataset and analysis establish a foundational benchmark and highlight significant opportunities for future research in data-driven architectural design.

[140] Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling

Takamichi Miyata,Sumiko Miyata,Andrew Morris

Main category: cs.CV

TL;DR: 提出一种解耦驾驶员外观与行为的框架,提升视觉-语言模型在分心驾驶检测中的零样本性能。

Details Motivation: 现有基于视觉-语言模型的分心驾驶检测方法因混淆驾驶员外观特征与行为线索而在真实场景中表现不佳。 Method: 通过提取并消除图像中的驾驶员外观嵌入影响,并将文本嵌入正交化投影到Stiefel流形上,以增强类别可分性。 Result: 实验显示该方法在多个基准上优于先前方法,显著提升零样本分类性能。 Conclusion: 所提解耦框架有效缓解外观偏差问题,增强了模型对实际驾驶行为的判断能力。 Abstract: Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.

[141] Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs

Takara Taniguchi,Kuniaki Saito,Atsushi Hashimoto

Main category: cs.CV

TL;DR: 本文提出HazardForge,一个利用图像编辑模型生成复杂危险场景的可扩展管道,并构建了包含7,254张图像和问答对的MovSafeBench基准,用于评估视觉语言模型在异常动态场景下的性能表现。

Details Motivation: 现有基准未能充分覆盖具有时空动态的异常危险场景,难以有效评估视觉语言模型在复杂环境中的安全决策能力。 Method: 提出HazardForge管道,结合图像编辑模型、布局决策算法和验证模块,生成包含移动、侵入性和远距离物体的异常场景,并构建MovSafeBench多选题基准。 Result: 实验表明,当存在异常物体时,尤其是需要细致运动理解的场景中,视觉语言模型的性能显著下降。 Conclusion: HazardForge和MovSafeBench为评估和提升视觉语言模型在复杂、危险环境中的安全性提供了有效工具和数据支持。 Abstract: Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems, making it crucial to evaluate their ability to support safer decision-making in complex environments. However, existing benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with spatio-temporal dynamics. While image editing models are a promising means to synthesize such hazards, it remains challenging to generate well-formulated scenarios that include moving, intrusive, and distant objects frequently observed in the real world. To address this gap, we introduce \textbf{HazardForge}, a scalable pipeline that leverages image editing models to generate these scenarios with layout decision algorithms, and validation modules. Using HazardForge, we construct \textbf{MovSafeBench}, a multiple-choice question (MCQ) benchmark comprising 7,254 images and corresponding QA pairs across 13 object categories, covering both normal and anomalous objects. Experiments using MovSafeBench show that VLM performance degrades notably under conditions including anomalous objects, with the largest drop in scenarios requiring nuanced motion understanding.

[142] Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models

Hao Tang,Yu Liu,Shuanglin Yan,Fei Shen,Shengfeng He,Jing Qin

Main category: cs.CV

TL;DR: 本文提出了一种名为CoEvo的无训练、无标注的测试时框架,用于实现视觉-语言模型中跨模态代理的双向动态对齐,以提升分布外(OOD)样本的零样本检测性能。

Details Motivation: 现有的零样本OOD检测方法依赖固定的文本代理,难以应对分布偏移,且在视觉特征漂移时导致跨模态不对齐和预测不稳定。因此需要一种能够动态适应变化的机制。 Method: CoEvo通过引入代理对齐的协同进化机制,维护两个动态更新的代理缓存:基于测试图像动态挖掘上下文相关的文本负样本,并迭代优化视觉代理,逐步重对齐跨模态相似性并扩大局部OOD边界;同时动态加权双模态代理以获得鲁棒的OOD评分。 Result: 在标准基准上的大量实验表明,CoEvo优于强基线方法,在ImageNet-1K上相比现有方法AUROC提升了1.33%,FPR95降低了45.98%。 Conclusion: CoEvo通过双向、样本条件下的文本与视觉代理协同演化,有效解决了零样本OOD检测中的跨模态不对齐问题,显著提升了在分布偏移下的检测鲁棒性和准确性。 Abstract: Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided by test images and iteratively refines visual proxies, progressively realigning cross-modal similarities and enlarging local OOD margins. Finally, we dynamically re-weight the contributions of dual-modal proxies to obtain a calibrated OOD score that is robust to distribution shift. Extensive experiments on standard benchmarks demonstrate that CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.

[143] An IoT-Enabled Smart Aquarium System for Real-Time Water Quality Monitoring and Automated Feeding

MD Fatin Ishraque Ayon,Sabrin Nahar,Ataur Rahman,Md. Taslim Arif,Abdul Hasib,A. S. M. Ahsanul Sarkar Akib

Main category: cs.CV

TL;DR: 本文提出了一种基于IoT的智能水族箱系统,利用ESP32和多种传感器实现水质的实时监测与自动控制,具有高精度、快速响应和高可靠性。

Details Motivation: 传统水族箱水质监测方法效率低、劳动强度大且易出错,难以维持稳定的水生环境。 Method: 采用ESP32微控制器集成pH、TDS、温度、浊度等传感器和伺服喂食器、水泵等执行器,结合边缘计算与Blynk云平台实现数据采集、实时监控、自动控制及智能报警功能。 Result: 实验显示系统平均传感器精度达96%,异常检测响应时间为1.2秒,自动喂食和水循环模块运行可靠性达97%。 Conclusion: 低成本IoT方案可显著提升水族箱管理的自动化与可靠性,适用于家庭和商业场景,推动水生生态系统管理的智能化发展。 Abstract: Maintaining optimal water quality in aquariums is critical for aquatic health but remains challenging due to the need for continuous monitoring of multiple parameters. Traditional manual methods are inefficient, labor-intensive, and prone to human error, often leading to suboptimal aquatic conditions. This paper presents an IoT-based smart aquarium system that addresses these limitations by integrating an ESP32 microcontroller with multiple sensors (pH, TDS, temperature, turbidity) and actuators (servo feeder, water pump) for comprehensive real-time water quality monitoring and automated control. The system architecture incorporates edge processing capabilities, cloud connectivity via Blynk IoT platform, and an intelligent alert mechanism with configurable cooldown periods to prevent notification fatigue. Experimental evaluation in a 10-liter aquarium environment demonstrated the system's effectiveness, achieving 96\% average sensor accuracy and 1.2-second response time for anomaly detection. The automated feeding and water circulation modules maintained 97\% operational reliability throughout extended testing, significantly reducing manual intervention while ensuring stable aquatic conditions. This research demonstrates that cost-effective IoT solutions can revolutionize aquarium maintenance, making aquatic ecosystem management more accessible, reliable, and efficient for both residential and commercial applications.

[144] PKI: Prior Knowledge-Infused Neural Network for Few-Shot Class-Incremental Learning

Kexin Baoa,Fanzhao Lin,Zichen Wang,Yong Li,Dan Zeng,Shiming Ge

Main category: cs.CV

TL;DR: 本文提出了一种名为PKI的先验知识融合神经网络,用于解决少样本类增量学习中的灾难性遗忘和过拟合问题,通过级联投影器有效整合历史知识并灵活学习新知识,在多个基准上优于现有方法。

Details Motivation: 为了解决少样本类增量学习中灾难性遗忘和对新类过拟合的问题,同时有效利用先验知识以提升模型在旧类和新类上的识别能力。 Method: 提出PKI框架,包含骨干网络、投影器集成、分类器和额外记忆;每轮增量学习新增一个投影器并与其他冻结组件联合微调,通过级联方式融合历史知识,并设计两种变体(PKIV-1和PKIV-2)减少资源消耗。 Result: 在三个主流基准上的实验表明,该方法在性能上超过了现有的最先进方法,同时通过变体实现了资源消耗与性能的良好平衡。 Conclusion: PKI能有效结合先验知识与新知识,缓解灾难性遗忘与过拟合,提升少样本类增量学习的表现,且其变体可在资源受限场景下保持竞争力。 Abstract: Few-shot class-incremental learning (FSCIL) aims to continually adapt a model on a limited number of new-class examples, facing two well-known challenges: catastrophic forgetting and overfitting to new classes. Existing methods tend to freeze more parts of network components and finetune others with an extra memory during incremental sessions. These methods emphasize preserving prior knowledge to ensure proficiency in recognizing old classes, thereby mitigating catastrophic forgetting. Meanwhile, constraining fewer parameters can help in overcoming overfitting with the assistance of prior knowledge. Following previous methods, we retain more prior knowledge and propose a prior knowledge-infused neural network (PKI) to facilitate FSCIL. PKI consists of a backbone, an ensemble of projectors, a classifier, and an extra memory. In each incremental session, we build a new projector and add it to the ensemble. Subsequently, we finetune the new projector and the classifier jointly with other frozen network components, ensuring the rich prior knowledge is utilized effectively. By cascading projectors, PKI integrates prior knowledge accumulated from previous sessions and learns new knowledge flexibly, which helps to recognize old classes and efficiently learn new classes. Further, to reduce the resource consumption associated with keeping many projectors, we design two variants of the prior knowledge-infused neural network (PKIV-1 and PKIV-2) to trade off a balance between resource consumption and performance by reducing the number of projectors. Extensive experiments on three popular benchmarks demonstrate that our approach outperforms state-of-the-art methods.

[145] EfficientFSL: Enhancing Few-Shot Classification via Query-Only Tuning in Vision Transformers

Wenwen Liao,Hang Ruan

Main category: cs.CV

TL;DR: 提出EfficientFSL,一种仅调整查询的轻量级微调框架,用于基于ViT的少样本分类,在极低参数量下实现高效高性能。

Details Motivation: 大型模型如ViT在少样本分类中表现优异,但微调成本高,难以应用于低资源场景,需降低计算开销。 Method: 设计Forward Block生成任务特定查询,通过查询-only方式从预训练模型中间层提取特征;引入Combine Block融合多层输出;使用Support-Query Attention Block对齐原型与查询分布。 Result: 在四个领域内和六个跨域少样本分类数据集上达到SOTA性能,且可训练参数极少。 Conclusion: EfficientFSL通过轻量级查询微调机制,有效平衡了性能与计算成本,适用于实际低资源场景。 Abstract: Large models such as Vision Transformers (ViTs) have demonstrated remarkable superiority over smaller architectures like ResNet in few-shot classification, owing to their powerful representational capacity. However, fine-tuning such large models demands extensive GPU memory and prolonged training time, making them impractical for many real-world low-resource scenarios. To bridge this gap, we propose EfficientFSL, a query-only fine-tuning framework tailored specifically for few-shot classification with ViT, which achieves competitive performance while significantly reducing computational overhead. EfficientFSL fully leverages the knowledge embedded in the pre-trained model and its strong comprehension ability, achieving high classification accuracy with an extremely small number of tunable parameters. Specifically, we introduce a lightweight trainable Forward Block to synthesize task-specific queries that extract informative features from the intermediate representations of the pre-trained model in a query-only manner. We further propose a Combine Block to fuse multi-layer outputs, enhancing the depth and robustness of feature representations. Finally, a Support-Query Attention Block mitigates distribution shift by adjusting prototypes to align with the query set distribution. With minimal trainable parameters, EfficientFSL achieves state-of-the-art performance on four in-domain few-shot datasets and six cross-domain datasets, demonstrating its effectiveness in real-world applications.

[146] Closed-Loop LLM Discovery of Non-Standard Channel Priors in Vision Models

Tolgay Atinc Uzun,Dmitry Ignatov,Radu Timofte

Main category: cs.CV

TL;DR: 本文提出了一种基于大语言模型(LLM)的神经网络架构搜索(NAS)框架,用于解决深度神经网络中通道配置的组合优化问题,通过AST变异生成大量合法架构数据,使LLM学习到通道配置与模型性能之间的潜在关系,并在CIFAR-100上实现了显著的精度提升。

Details Motivation: 通道配置的优化面临张量形状兼容性和计算资源限制的复杂组合挑战,传统启发式方法难以有效探索设计空间,因此需要一种能理解代码结构并进行推理的新方法。 Method: 将通道配置搜索建模为一系列条件代码生成任务,利用LLM根据性能反馈迭代优化架构;通过抽象语法树(AST)突变生成大量形状一致但性能未知的网络架构,以缓解训练数据稀缺问题,从而让LLM学习架构先验知识。 Result: 在CIFAR-100上的实验表明,该方法相比随机搜索能显著提高模型准确率,LLM成功掌握了与领域相关的架构先验,能够生成更优的通道配置。 Conclusion: 大语言模型具备理解和优化神经网络架构的能力,尤其在代码结构感知和设计模式学习方面展现出巨大潜力,语言驱动的深度学习设计是一种有前景的新范式。 Abstract: Channel configuration search the optimization of layer specifications such as layer widths in deep neural networks presents a complex combinatorial challenge constrained by tensor shape compatibility and computational budgets. We posit that Large Language Models (LLMs) offer a transformative approach to Neural Architecture Search (NAS), capable of reasoning about architectural code structure in ways that traditional heuristics cannot. In this paper, we investigate the application of an LLM-driven NAS framework to the problem of channel configuration. We formulate the search as a sequence of conditional code generation tasks, where an LLM refines architectural specifications based on performance telemetry. Crucially, we address the data scarcity problem by generating a vast corpus of valid, shape-consistent architectures via Abstract Syntax Tree (AST) mutations. While these mutated networks are not necessarily high-performing, they provide the critical volume of structural data required for the LLM to learn the latent relationship between channel configurations and model performance. This allows the LLM to internalize complex design patterns and apply them to optimize feature extraction strategies. Experimental results on CIFAR-100 validate the efficacy of this approach, demonstrating that the model yields statistically significant improvements in accuracy. Our analysis confirms that the LLM successfully acquires domain-specific architectural priors, distinguishing this method from random search and highlighting the immense potential of language-driven design in deep learning.

[147] CD^2: Constrained Dataset Distillation for Few-Shot Class-Incremental Learning

Kexin Bao,Daichi Zhang,Hansong Zhang,Yong Li,Yutao Yue,Shiming Ge

Main category: cs.CV

TL;DR: 提出了一种名为CD²的约束数据蒸馏框架,用于解决少样本类增量学习中的灾难性遗忘问题,通过合成紧凑样本和约束先前类别分布来更好地保留知识。

Details Motivation: 现有方法在处理少样本类增量学习时,由于无法有效保留先前的关键知识,导致严重的灾难性遗忘问题。 Method: 提出了CD²框架,包含数据蒸馏模块(DDM)用于合成紧凑样本,以及蒸馏约束模块(DCM)通过设计损失函数来约束先前的类别分布。 Result: 在三个公开数据集上进行了大量实验,结果表明所提方法优于其他最先进的方法。 Conclusion: CD²能更充分地保留先前知识,有效缓解FSCIL中的灾难性遗忘,提升持续学习性能。 Abstract: Few-shot class-incremental learning (FSCIL) receives significant attention from the public to perform classification continuously with a few training samples, which suffers from the key catastrophic forgetting problem. Existing methods usually employ an external memory to store previous knowledge and treat it with incremental classes equally, which cannot properly preserve previous essential knowledge. To solve this problem and inspired by recent distillation works on knowledge transfer, we propose a framework termed \textbf{C}onstrained \textbf{D}ataset \textbf{D}istillation (\textbf{CD$^2$}) to facilitate FSCIL, which includes a dataset distillation module (\textbf{DDM}) and a distillation constraint module~(\textbf{DCM}). Specifically, the DDM synthesizes highly condensed samples guided by the classifier, forcing the model to learn compacted essential class-related clues from a few incremental samples. The DCM introduces a designed loss to constrain the previously learned class distribution, which can preserve distilled knowledge more sufficiently. Extensive experiments on three public datasets show the superiority of our method against other state-of-the-art competitors.

[148] VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations

Sushant Gautam,Cise Midoglu,Vajira Thambawita,Michael A. Riegler,Pål Halvorsen

Main category: cs.CV

TL;DR: 本文提出了VideoHEDGE,一种用于视频问答中幻觉检测的模块化框架,通过扩展基于熵的可靠性估计到时序结构输入,有效提升了对视频-语言模型中高置信度错误的识别能力。

Details Motivation: 现有的不确定性度量在判断视频-语言模型输出是否正确时表现不佳,且模型常产生高置信度的幻觉,亟需更可靠的检测方法。 Method: VideoHEDGE通过对原始视频及其多种扰动变体进行多轮生成,利用NLI或嵌入方法对文本输出聚类,并基于语义假设的分布计算语义熵(SE)、RadFlag和VASE等可靠性评分。 Result: 在SoccerChat基准上,使用LLM-as-a-judge评估,VASE在多个7B规模的Video-VLM上实现了最高的ROC-AUC,尤其在大扰动下表现突出;嵌入式聚类与NLI效果相当但成本更低;领域微调降低了幻觉频率但校准提升有限。 Conclusion: VideoHEDGE为视频问答中的幻觉检测提供了高效、可扩展的解决方案,VASE指标显著优于现有方法,同时开源工具支持后续研究复现与拓展。 Abstract: Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods. Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels. Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge .

[149] REVNET: Rotation-Equivariant Point Cloud Completion via Vector Neuron Anchor Transformer

Zhifan Ni,Eckehard Steinbach

Main category: cs.CV

TL;DR: 提出了一种基于向量神经元的旋转等变锚点变换器(REVNET),用于在任意旋转下实现鲁棒的点云补全,有效保持几何细节并提升真实场景中的性能。

Details Motivation: 现有点云补全方法多基于对标准姿态敏感的旋转变框架,难以应对实际应用中任意旋转的输入,导致性能下降。 Method: 构建基于向量神经元(VN)网络的旋转等变框架REVNET,引入等变锚点表示部分点云,设计VN缺失锚点变换器预测缺失结构,并改进VN网络的偏置和归一化机制以增强特征表达能力。 Result: 在合成MVP数据集上超越现有方法,在真实KITTI数据集上取得与非等变网络相当的结果,且无需输入姿态对齐。 Conclusion: REVNET实现了对任意旋转的鲁棒性,兼顾局部细节保持与特征稳定性,推动了点云补全在真实场景中的应用。 Abstract: Incomplete point clouds captured by 3D sensors often result in the loss of both geometric and semantic information. Most existing point cloud completion methods are built on rotation-variant frameworks trained with data in canonical poses, limiting their applicability in real-world scenarios. While data augmentation with random rotations can partially mitigate this issue, it significantly increases the learning burden and still fails to guarantee robust performance under arbitrary poses. To address this challenge, we propose the Rotation-Equivariant Anchor Transformer (REVNET), a novel framework built upon the Vector Neuron (VN) network for robust point cloud completion under arbitrary rotations. To preserve local details, we represent partial point clouds as sets of equivariant anchors and design a VN Missing Anchor Transformer to predict the positions and features of missing anchors. Furthermore, we extend VN networks with a rotation-equivariant bias formulation and a ZCA-based layer normalization to improve feature expressiveness. Leveraging the flexible conversion between equivariant and invariant VN features, our model can generate point coordinates with greater stability. Experimental results show that our method outperforms state-of-the-art approaches on the synthetic MVP dataset in the equivariant setting. On the real-world KITTI dataset, REVNET delivers competitive results compared to non-equivariant networks, without requiring input pose alignment. The source code will be released on GitHub under URL: https://github.com/nizhf/REVNET.

[150] End-to-End Video Character Replacement without Structural Guidance

Zhengbo Xu,Jie Ma,Ziheng Wang,Zhan Peng,Jun Liang,Jing Li

Main category: cs.CV

TL;DR: 本文提出了一种名为MoCha的新框架,用于实现可控的视频角色替换,仅需单个任意帧掩码,无需复杂的结构化引导或逐帧分割,显著提升了在复杂场景下的生成质量与一致性。

Details Motivation: 由于缺乏配对视频数据,使用用户提供的身份进行可控视频角色替换仍然具有挑战性;现有方法依赖于重建范式和显式结构引导,在复杂场景中泛化能力差,容易产生视觉伪影和时序不一致。 Method: 提出MoCha框架,仅需单个任意帧掩码;引入条件感知的RoPE机制以更好融合多模态输入,并采用基于强化学习的后训练阶段增强面部身份保持;构建包含UE5渲染、表情驱动合成和增强真实数据的三类专用数据集以解决训练数据稀缺问题。 Result: 实验表明,该方法在多种复杂场景下显著优于现有最先进方法,尤其在遮挡、人-物交互、非常规姿态和光照变化等情况下表现出更强的鲁棒性和时序一致性。 Conclusion: MoCha通过简化输入条件、改进模型结构并构建高质量训练数据,在无配对数据条件下实现了更优的可控视频角色替换效果,推动了该领域的实用化发展。 Abstract: Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research. Please refer to our project page for more details: orange-3dv-team.github.io/MoCha

[151] WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation

Zishan Shu,Juntong Wu,Wei Yan,Xudong Liu,Hongyu Zhang,Chang Liu,Youdong Mao,Jie Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于波动方程的视觉建模范式,通过将特征图视为空间信号,利用欠阻尼波动方程显式建模空间频率与传播时间的关系,设计了高效且具有物理意义的Wave Propagation Operator(WPO),并构建了WaveFormer模型,在多种视觉任务中实现了高效、准确的性能。

Details Motivation: 现有的视觉Transformer虽然能捕捉视觉依赖关系,但缺乏对语义信息如何在空间上传播的原理性建模;现有方法如基于热传导的方法难以同时捕捉全局结构与高频细节,因此需要一种新的建模机制。 Method: 将特征图视为空间信号,其随网络深度演化的过程由欠阻尼波动方程描述;推导出频域-时间解耦的闭式解,实现为轻量级模块WPO,可建模全局交互且计算复杂度为O(N log N);基于WPO构建WaveFormer系列模型,作为ViT和CNN的即插即用替代方案。 Result: WaveFormer在图像分类、目标检测和语义分割任务上达到与注意力机制相当或更优的精度,吞吐量最高提升1.6倍,FLOPs减少30%;实验证明其能有效兼顾全局连贯性和高频细节,与基于热传导的方法形成互补。 Conclusion: 波动传播为视觉建模提供了新的物理启发视角,WPO通过显式控制空间频率与传播时间的动态关系,实现了高效、紧凑且语义丰富的特征提取,是注意力机制的有力替代方案。 Abstract: Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave-based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency-from low-frequency global layout to high-frequency edges and textures-is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed-form, frequency-time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(N log N) time-far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop-in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6x higher throughput and 30% fewer FLOPs than attention-based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat-based methods, effectively capturing both global coherence and high-frequency details essential for rich visual semantics. Codes are available at: https://github.com/ZishanShu/WaveFormer.

[152] Interpretability and Individuality in Knee MRI: Patient-Specific Radiomic Fingerprint with Reconstructed Healthy Personas

Yaxi Chen,Simin Ni,Shuai Li,Shaheer U. Saeed,Aleksandra Ivanova,Rikin Hargunani,Jie Huang,Chaozong Liu,Yipeng Hu

Main category: cs.CV

TL;DR: 提出两种可解释的个性化方法:放射组学指纹和健康人格,用于膝关节MRI的自动化评估,兼顾性能与临床可解释性。

Details Motivation: 传统放射组学特征在群体层面固定,难以捕捉个体差异,而深度学习缺乏可解释性,限制了临床应用。 Method: 1) 放射组学指纹:基于MRI动态构建患者特异性特征集,通过图像条件预测器选择最相关的特征;2) 健康人格:使用扩散模型生成个体化的健康膝关节MRI作为基准,对比病态图像以发现异常。 Result: 在三个临床任务中,两种方法单独或联合使用均达到或超过现有深度学习模型的性能,并支持多层次的可解释性。 Conclusion: 所提方法在保持高准确性的同时增强了模型的可解释性,有助于临床采纳和个体化生物标志物发现。 Abstract: For automated assessment of knee MRI scans, both accuracy and interpretability are essential for clinical use and adoption. Traditional radiomics rely on predefined features chosen at the population level; while more interpretable, they are often too restrictive to capture patient-specific variability and can underperform end-to-end deep learning (DL). To address this, we propose two complementary strategies that bring individuality and interpretability: radiomic fingerprints and healthy personas. First, a radiomic fingerprint is a dynamically constructed, patient-specific feature set derived from MRI. Instead of applying a uniform population-level signature, our model predicts feature relevance from a pool of candidate features and selects only those most predictive for each patient, while maintaining feature-level interpretability. This fingerprint can be viewed as a latent-variable model of feature usage, where an image-conditioned predictor estimates usage probabilities and a transparent logistic regression with global coefficients performs classification. Second, a healthy persona synthesises a pathology-free baseline for each patient using a diffusion model trained to reconstruct healthy knee MRIs. Comparing features extracted from pathological images against their personas highlights deviations from normal anatomy, enabling intuitive, case-specific explanations of disease manifestations. We systematically compare fingerprints, personas, and their combination across three clinical tasks. Experimental results show that both approaches yield performance comparable to or surpassing state-of-the-art DL models, while supporting interpretability at multiple levels. Case studies further illustrate how these perspectives facilitate human-explainable biomarker discovery and pathology localisation.

[153] SfMamba: Efficient Source-Free Domain Adaptation via Selective Scan Modeling

Xi Chen,Hongxun Yao,Sicheng Zhao,Jiankun Zhu,Jing Jiang,Kui Jiang

Main category: cs.CV

TL;DR: 本文提出了一种名为SfMamba的新框架,用于解决无源域自适应(SFDA)中的感知范围与计算效率之间的权衡问题。该方法通过引入通道视觉状态空间模块和语义一致的打乱策略,在保持参数效率的同时实现了更强的性能。

Details Motivation: 现有的SFDA方法在域不变特征学习中难以平衡感知场和计算效率;同时,现有视觉Mamba模型在捕捉通道频率特性和空间鲁棒性方面存在局限。 Method: 提出了SfMamba框架,包含两个关键组件:通道视觉状态空间块(Channel-wise Visual State-Space block),实现通道序列扫描以提取域不变特征;语义一致打乱策略(Semantic-Consistent Shuffle),在2D选择性扫描中打乱背景块序列但保持预测一致性,减少误差累积。 Result: 在多个基准上的实验表明,SfMamba在无源域自适应任务上显著优于现有方法,并且具有良好的参数效率和计算效率。 Conclusion: SfMamba通过改进的通道序列建模和结构设计,有效提升了源自由域自适应的性能与稳定性,为实际应用提供了高效可行的解决方案。 Abstract: Source-free domain adaptation (SFDA) tackles the critical challenge of adapting source-pretrained models to unlabeled target domains without access to source data, overcoming data privacy and storage limitations in real-world applications. However, existing SFDA approaches struggle with the trade-off between perception field and computational efficiency in domain-invariant feature learning. Recently, Mamba has offered a promising solution through its selective scan mechanism, which enables long-range dependency modeling with linear complexity. However, the Visual Mamba (i.e., VMamba) remains limited in capturing channel-wise frequency characteristics critical for domain alignment and maintaining spatial robustness under significant domain shifts. To address these, we propose a framework called SfMamba to fully explore the stable dependency in source-free model transfer. SfMamba introduces Channel-wise Visual State-Space block that enables channel-sequence scanning for domain-invariant feature extraction. In addition, SfMamba involves a Semantic-Consistent Shuffle strategy that disrupts background patch sequences in 2D selective scan while preserving prediction consistency to mitigate error accumulation. Comprehensive evaluations across multiple benchmarks show that SfMamba achieves consistently stronger performance than existing methods while maintaining favorable parameter efficiency, offering a practical solution for SFDA. Our code is available at https://github.com/chenxi52/SfMamba.

[154] SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning

Leo Fillioux,Omprakash Chakraborty,Ismail Ben Ayed,Paul-Henry Cournède,Stergios Christodoulidis,Maria Vakalopoulou,Jose Dolz

Main category: cs.CV

TL;DR: 本文提出了一种新的语义正交校准方法(SoC),用于改善视觉-语言模型在测试时提示调优中的不确定性估计校准,相比完全正交约束,SoC在保持分类性能的同时提升了校准效果。

Details Motivation: 现有的视觉-语言模型在测试时提示调优中主要关注提升分类性能,但对不确定性校准关注不足,尤其是完全正交约束可能导致语义相关类别被过度分离,引发模型过置信问题。 Method: 提出基于Huber损失的正则化项——语义正交校准(SoC),在实现原型分离的同时保留语义邻近性,缓解完全正交带来的过强梯度问题。 Result: 实验表明,SoC在多个基准上 consistently 提升了模型的校准性能,同时保持了良好的分类准确率。 Conclusion: SoC通过平衡正交性和语义相似性,有效改善了VLMs在TPT场景下的校准表现,为未来校准与泛化能力的联合优化提供了新方向。 Abstract: With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.

[155] CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion

Yiming Sun,Yuan Ruan,Qinghua Hu,Pengfei Zhu

Main category: cs.CV

TL;DR: 提出CtrlFuse,一种基于掩码提示的可控红外与可见光图像融合框架,实现任务自适应的动态融合。

Details Motivation: 现有方法在像素级融合中忽视下游任务适应性,或通过级联模型隐式学习刚性语义,难以满足多样化的语义感知需求。 Method: 设计多模态特征提取器、参考提示编码器(RPE)和提示-语义融合模块(PSFM),利用掩码引导微调分割模型生成任务特定语义提示,并显式注入融合过程。 Result: 在融合可控性和分割精度上达到SOTA,任务分支性能甚至超过原始分割模型。 Conclusion: CtrlFuse实现了语义引导的动态图像融合,通过与下游任务协同优化,提升了融合质量与感知能力。 Abstract: Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities, enhancing environmental awareness for intelligent unmanned systems. Existing methods either focus on pixel-level fusion while overlooking downstream task adaptability or implicitly learn rigid semantics through cascaded detection/segmentation models, unable to interactively address diverse semantic target perception needs. We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts. The model integrates a multi-modal feature extractor, a reference prompt encoder (RPE), and a prompt-semantic fusion module (PSFM). The RPE dynamically encodes task-specific semantic prompts by fine-tuning pre-trained segmentation models with input mask guidance, while the PSFM explicitly injects these semantics into fusion features. Through synergistic optimization of parallel segmentation and fusion branches, our method achieves mutual enhancement between task performance and fusion quality. Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.

[156] SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

Renyang Liu,Kangjie Chen,Han Qiu,Jie Zhang,Kwok-Yan Lam,Tianwei Zhang,See-Kiong Ng

Main category: cs.CV

TL;DR: 本文提出了一种名为SafeRedir的轻量级推理时框架,通过在嵌入空间中对提示进行重定向,实现图像生成模型中不安全内容的鲁棒遗忘,无需修改原始模型,具有良好的通用性和对抗攻击抵抗力。

Details Motivation: 现有的图像生成模型容易记忆并复现训练数据中的有害内容(如NSFW图像和版权风格),而现有去学习方法存在需重新训练、影响正常生成质量或易受提示改写攻击等问题,因此需要一种高效且鲁棒的解决方案。 Method: SafeRedir在推理时通过标记级别的嵌入干预实现语义重定向:使用潜在感知的多模态安全分类器识别不安全生成路径,并利用带有辅助预测器的标记级delta生成器进行精确的语义调整,包括标记掩码和自适应缩放以定位和控制干预强度。 Result: 实验表明,SafeRedir在多个去学习任务中实现了有效的遗忘能力,保持了高水平的语义和感知一致性,图像质量稳定,并对对抗性攻击具有更强的抵抗力,同时可泛化到多种扩散模型架构和已有去学习模型。 Conclusion: SafeRedir是一种即插即用、无需微调的图像生成模型安全框架,能够在不牺牲生成质量的前提下实现鲁棒的概念遗忘,适用于实际部署中的内容安全控制。 Abstract: Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.

[157] Além do Desempenho: Um Estudo da Confiabilidade de Detectores de Deepfakes

Lucas Lopes,Rayson Laroca,André Grégio

Main category: cs.CV

TL;DR: 提出基于可转移性、鲁棒性、可解释性和计算效率的深度伪造检测可靠性评估框架,分析了现有方法的优势与不足。

Details Motivation: 现有深度伪造检测技术缺乏超越分类性能的综合评估方法,难以全面衡量其实际应用中的可靠性。 Method: 构建了一个四支柱的可靠性评估框架(可转移性、鲁棒性、可解释性、计算效率),并对五种最先进的检测方法进行了系统分析。 Result: 分析揭示了当前先进方法在各项指标上的显著进展和关键局限性,例如在跨数据集迁移时性能下降、对扰动敏感、缺乏可解释性或计算开销过大等问题。 Conclusion: 仅依赖分类准确率不足以评估深度伪造检测模型的可靠性,必须结合多维度指标;所提出的框架为未来研究提供了更全面的评估标准。 Abstract: Deepfakes are synthetic media generated by artificial intelligence, with positive applications in education and creativity, but also serious negative impacts such as fraud, misinformation, and privacy violations. Although detection techniques have advanced, comprehensive evaluation methods that go beyond classification performance remain lacking. This paper proposes a reliability assessment framework based on four pillars: transferability, robustness, interpretability, and computational efficiency. An analysis of five state-of-the-art methods revealed significant progress as well as critical limitations.

[158] Salience-SGG: Enhancing Unbiased Scene Graph Generation with Iterative Salience Estimation

Runfeng Qu,Ole Hall,Pia K Bideau,Julie Ouerfelli-Ethier,Martin Rolfs,Klaus Obermayer,Olaf Hellwich

Main category: cs.CV

TL;DR: 本文提出了Salience-SGG框架,通过引入迭代显著性解码器(ISD)和语义无关的显著性标签,提升场景图生成中对空间结构的关注,缓解长尾分布下模型对罕见关系的偏差问题,并在多个数据集上实现了最先进的性能。

Details Motivation: 场景图生成(SGG)面临长尾分布问题,导致模型对稀有谓词类别的表现不佳;现有去偏方法常牺牲空间理解能力,过度依赖语义先验。 Method: 提出Salience-SGG框架,包含一个迭代显著性解码器(ISD),强调具有显著空间结构的三元组,并使用语义无关的显著性标签来引导ISD训练。 Result: 在Visual Genome、Open Images V6和GQA-200数据集上的实验表明,Salience-SGG取得了当前最优性能,并提升了现有去偏SGG方法的空间理解能力,表现为更高的Pairwise Localization Average Precision。 Conclusion: Salience-SGG有效平衡了去偏与空间理解之间的权衡,通过关注显著空间结构改善了对罕见关系的建模,为长尾场景图生成提供了新思路。 Abstract: Scene Graph Generation (SGG) suffers from a long-tailed distribution, where a few predicate classes dominate while many others are underrepresented, leading to biased models that underperform on rare relations. Unbiased-SGG methods address this issue by implementing debiasing strategies, but often at the cost of spatial understanding, resulting in an over-reliance on semantic priors. We introduce Salience-SGG, a novel framework featuring an Iterative Salience Decoder (ISD) that emphasizes triplets with salient spatial structures. To support this, we propose semantic-agnostic salience labels guiding ISD. Evaluations on Visual Genome, Open Images V6, and GQA-200 show that Salience-SGG achieves state-of-the-art performance and improves existing Unbiased-SGG methods in their spatial understanding as demonstrated by the Pairwise Localization Average Precision

[159] ISLA: A U-Net for MRI-based acute ischemic stroke lesion segmentation with deep supervision, attention, domain adaptation, and ensemble learning

Vincent Roca,Martin Bretzner,Hilde Henon,Laurent Puy,Grégory Kuchcinski,Renaud Lopes

Main category: cs.CV

TL;DR: 本文提出了一种名为ISLA的新型深度学习模型,用于从扩散MRI中分割急性缺血性卒中(AIS)病灶,通过系统优化损失函数、卷积结构、深层监督和注意力机制,并在三个多中心数据库上训练,表现出优于现有方法的性能,且代码和模型将公开以促进可重复性。

Details Motivation: 急性缺血性卒中病灶在MRI中的精确分割对诊断和治疗至关重要,但目前最优的深度学习配置尚不明确,且许多模型未公开,限制了可复现性和临床应用。 Method: 提出ISLA模型,基于U-Net框架,系统优化损失函数、残差连接、深层监督和注意力机制;使用超过1500名AIS患者的多中心数据进行训练,并研究无监督域自适应以提升对外部临床数据的泛化能力。 Result: ISLA在外部测试集上优于两种最先进的AIS病灶分割方法,展现出更强的分割性能和泛化能力。 Conclusion: ISLA是一种鲁棒且高性能的AIS病灶自动分割模型,其系统优化策略和公开资源有助于推动该领域的研究与临床应用。 Abstract: Accurate delineation of acute ischemic stroke lesions in MRI is a key component of stroke diagnosis and management. In recent years, deep learning models have been successfully applied to the automatic segmentation of such lesions. While most proposed architectures are based on the U-Net framework, they primarily differ in their choice of loss functions and in the use of deep supervision, residual connections, and attention mechanisms. Moreover, many implementations are not publicly available, and the optimal configuration for acute ischemic stroke (AIS) lesion segmentation remains unclear. In this work, we introduce ISLA (Ischemic Stroke Lesion Analyzer), a new deep learning model for AIS lesion segmentation from diffusion MRI, trained on three multicenter databases totaling more than 1500 AIS participants. Through systematic optimization of the loss function, convolutional architecture, deep supervision, and attention mechanisms, we developed a robust segmentation framework. We further investigated unsupervised domain adaptation to improve generalization to an external clinical dataset. ISLA outperformed two state-of-the-art approaches for AIS lesion segmentation on an external test set. Codes and trained models will be made publicly available to facilitate reuse and reproducibility.

[160] UR-Bench: A Benchmark for Multi-Hop Reasoning over Ultra-High-Resolution Images

Siqi Li,Xinyu Cai,Jianbiao Mei,Nianchen Deng,Pinlong Cai,Licheng Wen,Yufan Shen,Xuemeng Yang,Botian Shi,Yong Liu

Main category: cs.CV

TL;DR: 本文提出了一个用于评估多模态大语言模型在超高清图像下视觉-语言推理能力的基准UR-Bench,包含人文与自然场景下的四个子集,图像分辨率从百兆像素到千兆像素,并设计了基于代理的框架和语义抽象与检索工具以提升处理效率。

Details Motivation: 现有视觉问答基准多基于中等分辨率图像,难以评估模型在极端视觉信息密度下的推理能力,因此需要构建针对超高清图像的评测基准。 Method: 构建了包含人文场景和自然场景的UR-Bench基准,每个子集配备不同空间结构的超高清图像,并设计三级问题评估模型推理能力;同时提出基于代理的框架,结合语义抽象与检索工具辅助语言模型调用外部视觉工具进行推理。 Result: 在端到端的MLLMs和所提代理框架下对多个先进模型进行了评估,结果表明该框架能有效提升模型在超高清图像上的推理性能。 Conclusion: UR-Bench为评估多模态大语言模型在超高清分辨率下的推理能力提供了新标准,所提出的代理框架和工具为处理极端分辨率图像提供了可行方案。 Abstract: Recent multimodal large language models (MLLMs) show strong capabilities in visual-language reasoning, yet their performance on ultra-high-resolution imagery remains largely unexplored. Existing visual question answering (VQA) benchmarks typically rely on medium-resolution data, offering limited visual complexity. To bridge this gap, we introduce Ultra-high-resolution Reasoning Benchmark (UR-Bench), a benchmark designed to evaluate the reasoning capabilities of MLLMs under extreme visual information. UR-Bench comprises two major categories, Humanistic Scenes and Natural Scenes, covering four subsets of ultra-high-resolution images with distinct spatial structures and data sources. Each subset contains images ranging from hundreds of megapixels to gigapixels, accompanied by questions organized into three levels, enabling evaluation of models' reasoning capabilities in ultra-high-resolution scenarios. We further propose an agent-based framework in which a language model performs reasoning by invoking external visual tools. In addition, we introduce Semantic Abstraction and Retrieval tools that enable more efficient processing of ultra-high-resolution images. We evaluate state-of-the-art models using both an end-to-end MLLMs and our agent-based framework, demonstrating the effectiveness of our framework.

[161] Translating Light-Sheet Microscopy Images to Virtual H&E Using CycleGAN

Yanhua Zhao

Main category: cs.CV

TL;DR: 提出一种基于CycleGAN的无配对图像转换方法,将多通道荧光显微图像转化为伪H&E染色病理图像,便于病理学家理解和整合到现有工作流中。

Details Motivation: 荧光显微镜提供与H&E染色互补的信息,但缺乏病理医生熟悉的H&E外观;将荧光图像转为H&E样态有助于解释和融合到标准病理分析流程。 Method: 采用Cycle-Consistent Adversarial Network(CycleGAN),结合C01和C02荧光通道生成RGB图像,利用ResNet生成器和PatchGAN判别器,在无配对数据情况下学习荧光与H&E域之间的双向映射,并引入对抗损失、循环一致性损失和身份损失进行训练。 Result: 实验表明该方法能生成逼真的伪H&E图像,保留形态结构的同时具备H&E的颜色特征。 Conclusion: 该方法可有效将荧光图像转化为病理医生熟悉的形式,促进其在常规H&E分析流程中的整合与应用。 Abstract: Histopathology analysis relies on Hematoxylin and Eosin (H&E) staining, but fluorescence microscopy offers complementary information. Converting fluorescence images to H&E-like appearance can aid interpretation and integration with standard workflows. We present a Cycle-Consistent Adversarial Network (CycleGAN) approach for unpaired image-to-image translation from multi-channel fluorescence microscopy to pseudo H&E stained histopathology images. The method combines C01 and C02 fluorescence channels into RGB and learns a bidirectional mapping between fluorescence and H&E domains without paired training data. The architecture uses ResNet-based generators with residual blocks and PatchGAN discriminators, trained with adversarial, cycle-consistency, and identity losses. Experiments on fluorescence microscopy datasets show the model generates realistic pseudo H&E images that preserve morphological structures while adopting H&E-like color characteristics. This enables visualization of fluorescence data in a format familiar to pathologists and supports integration with existing H&E-based analysis pipelines.

[162] Aggregating Diverse Cue Experts for AI-Generated Image Detection

Lei Tan,Shuwei Li,Mohan Kankanhalli,Robby T. Tan

Main category: cs.CV

TL;DR: 本文提出了一种新的多线索聚合网络(MCAN),通过整合空间、频域和色度信息来提升AI生成图像检测的泛化能力。

Details Motivation: 现有AI生成图像检测方法常依赖模型特定特征,导致过拟合和泛化性能差。 Method: 提出MCAN框架,结合输入图像、高频分量和色度不一致性(CI)线索,采用混合编码器适配器动态融合多线索信息。 Result: 在GenImage、Chameleon和UniversalFakeDetect基准上取得SOTA性能,在GenImage上相比最优方法平均ACC提升达7.4%。 Conclusion: MCAN通过统一的多线索融合机制显著提升了检测器对不同生成模型的泛化能力。 Abstract: The rapid emergence of image synthesis models poses challenges to the generalization of AI-generated image detectors. However, existing methods often rely on model-specific features, leading to overfitting and poor generalization. In this paper, we introduce the Multi-Cue Aggregation Network (MCAN), a novel framework that integrates different yet complementary cues in a unified network. MCAN employs a mixture-of-encoders adapter to dynamically process these cues, enabling more adaptive and robust feature representation. Our cues include the input image itself, which represents the overall content, and high-frequency components that emphasize edge details. Additionally, we introduce a Chromatic Inconsistency (CI) cue, which normalizes intensity values and captures noise information introduced during the image acquisition process in real images, making these noise patterns more distinguishable from those in AI-generated content. Unlike prior methods, MCAN's novelty lies in its unified multi-cue aggregation framework, which integrates spatial, frequency-domain, and chromaticity-based information for enhanced representation learning. These cues are intrinsically more indicative of real images, enhancing cross-model generalization. Extensive experiments on the GenImage, Chameleon, and UniversalFakeDetect benchmark validate the state-of-the-art performance of MCAN. In the GenImage dataset, MCAN outperforms the best state-of-the-art method by up to 7.4% in average ACC across eight different image generators.

[163] DentalX: Context-Aware Dental Disease Detection with Radiographs

Zhi Qin Tan,Xiatian Zhu,Owen Addison,Yunpeng Li

Main category: cs.CV

TL;DR: 提出了一种名为DentalX的新型上下文感知牙病检测方法,利用口腔结构信息来缓解X光片中固有的视觉模糊问题。

Details Motivation: 由于诊断证据细微,从X光片中诊断牙科疾病耗时且具有挑战性,现有依赖于自然图像目标检测模型的方法难以有效检测视觉支持较少的牙病。 Method: 引入了一个结构上下文提取模块,通过学习牙齿解剖语义分割的辅助任务来提取有意义的结构上下文,并将其融入主要的疾病检测任务中。 Result: 在专用基准上的大量实验表明,DentalX在两个任务上均显著优于先前方法,且两个任务之间存在自然的相互促进效果。 Conclusion: DentalX通过结合口腔结构信息和疾病检测任务,有效提升了对细微牙病的检测能力。 Abstract: Diagnosing dental diseases from radiographs is time-consuming and challenging due to the subtle nature of diagnostic evidence. Existing methods, which rely on object detection models designed for natural images with more distinct target patterns, struggle to detect dental diseases that present with far less visual support. To address this challenge, we propose {\bf DentalX}, a novel context-aware dental disease detection approach that leverages oral structure information to mitigate the visual ambiguity inherent in radiographs. Specifically, we introduce a structural context extraction module that learns an auxiliary task: semantic segmentation of dental anatomy. The module extracts meaningful structural context and integrates it into the primary disease detection task to enhance the detection of subtle dental diseases. Extensive experiments on a dedicated benchmark demonstrate that DentalX significantly outperforms prior methods in both tasks. This mutual benefit arises naturally during model optimization, as the correlation between the two tasks is effectively captured. Our code is available at https://github.com/zhiqin1998/DentYOLOX.

[164] Near-perfect photo-ID of the Hula painted frog with zero-shot deep local-feature matching

Maayan Yesharim,R. G. Bina Perl,Uri Roll,Sarig Gafny,Eli Geffen,Yoav Ram

Main category: cs.CV

TL;DR: 本文评估了基于计算机视觉的Hula彩蛙个体识别方法,发现零样本深度局部特征匹配在准确性和实用性上均优于全局特征模型,并开发了一个两阶段工作流程以提高效率,最终部署为用于保护监测的网络应用。

Details Motivation: 准确的个体识别对监测稀有两栖动物至关重要,但传统标记方法不适用于极危物种,因此需要非侵入性的个体识别技术。 Method: 使用191个个体的1233张腹面图像,比较了零样本设置下的深度局部特征匹配与深度全局特征嵌入模型;提出一种两阶段工作流:先用微调的全局模型生成候选列表,再用局部特征重新排序。 Result: 局部特征匹配在闭集识别中达到98%的top-1准确率,优于所有全局模型;两阶段流程将运行时间从7小时左右减少到约38分钟,同时保持约96%的准确率;开集识别可通过匹配分数阈值实现。 Conclusion: 对于Hula彩蛙,零样本深度局部特征匹配在照片识别中表现最优,可作为该类物种非侵入性个体识别的默认方案。 Abstract: Accurate individual identification is essential for monitoring rare amphibians, yet invasive marking is often unsuitable for critically endangered species. We evaluate state-of-the-art computer-vision methods for photographic re-identification of the Hula painted frog (Latonia nigriventer) using 1,233 ventral images from 191 individuals collected during 2013-2020 capture-recapture surveys. We compare deep local-feature matching in a zero-shot setting with deep global-feature embedding models. The local-feature pipeline achieves 98% top-1 closed-set identification accuracy, outperforming all global-feature models; fine-tuning improves the best global-feature model to 60% top-1 (91% top-10) but remains below local matching. To combine scalability with accuracy, we implement a two-stage workflow in which a fine-tuned global-feature model retrieves a short candidate list that is re-ranked by local-feature matching, reducing end-to-end runtime from 6.5-7.8 hours to ~38 minutes while maintaining ~96% top-1 closed-set accuracy on the labeled dataset. Separation of match scores between same- and different-individual pairs supports thresholding for open-set identification, enabling practical handling of novel individuals. We deploy this pipeline as a web application for routine field use, providing rapid, standardized, non-invasive identification to support conservation monitoring and capture-recapture analyses. Overall, in this species, zero-shot deep local-feature matching outperformed global-feature embedding and provides a strong default for photo-identification.

[165] S3-CLIP: Video Super Resolution for Person-ReID

Tamas Endrei,Gyorgy Cserey

Main category: cs.CV

TL;DR: 本文提出了S3-CLIP,一种基于视频超分辨率的CLIP-ReID框架,首次系统地研究了利用视频超分辨率提升行人重识别中轨迹质量的方法,尤其适用于具有挑战性的跨视角场景。

Details Motivation: 现有行人重识别方法多关注模型结构改进,忽视轨迹质量对实际复杂场景部署的影响,本文旨在通过超分辨率技术提升低质量视频片段的表征能力。 Method: 结合最新的超分辨率网络与任务驱动的超分辨率流程,构建视频超分辨率增强的CLIP-ReID框架S3-CLIP,用于改善视频行人重识别中的tracklet质量。 Result: 在VReID-XFD挑战赛中,S3-CLIP在航对地和地对航场景分别达到37.52%和29.16%的mAP;在地对航场景下,Rank-1、Rank-5和Rank-10分别提升了11.24%、13.48%和17.98%。 Conclusion: 视频超分辨率能有效提升跨视角行人重识别中的tracklet质量,尤其在低质量视频条件下显著改善检索性能,为实际应用提供了新思路。 Abstract: Tracklet quality is often treated as an afterthought in most person re-identification (ReID) methods, with the majority of research presenting architectural modifications to foundational models. Such approaches neglect an important limitation, posing challenges when deploying ReID systems in real-world, difficult scenarios. In this paper, we introduce S3-CLIP, a video super-resolution-based CLIP-ReID framework developed for the VReID-XFD challenge at WACV 2026. The proposed method integrates recent advances in super-resolution networks with task-driven super-resolution pipelines, adapting them to the video-based person re-identification setting. To the best of our knowledge, this work represents the first systematic investigation of video super-resolution as a means of enhancing tracklet quality for person ReID, particularly under challenging cross-view conditions. Experimental results demonstrate performance competitive with the baseline, achieving 37.52% mAP in aerial-to-ground and 29.16% mAP in ground-to-aerial scenarios. In the ground-to-aerial setting, S3-CLIP achieves substantial gains in ranking accuracy, improving Rank-1, Rank-5, and Rank-10 performance by 11.24%, 13.48%, and 17.98%, respectively.

[166] Reasoning Matters for 3D Visual Grounding

Hsiang-Wei Huang,Kuang-Ming Chen,Wenhao Chai,Cheng-Yen Yang,Jen-Hao Cheng,Jenq-Neng Hwang

Main category: cs.CV

TL;DR: 本文提出了一种新的3D视觉定位数据生成管道,能够自动生成带有推理过程的3D视觉定位数据,并利用这些数据微调出一个名为Reason3DVG-8B的大型语言模型,在仅使用先前方法1.6%训练数据的情况下超越了现有基于LLM的方法。

Details Motivation: 现有的3D视觉定位模型推理能力有限,且依赖大量标注数据进行监督训练;尽管已有研究尝试通过合成数据提升性能,但效果与成本不成正比,因此需要更高效的数据生成方式和具备更强推理能力的模型。 Method: 提出一种自动合成3D视觉定位数据及其对应推理过程的数据生成管道,并使用该数据对大型语言模型进行微调,从而构建具备强推理能力的3D视觉定位模型Reason3DVG-8B。 Result: 所提出的Reason3DVG-8B模型在仅使用先前方法1.6%训练数据的情况下,性能已超过之前的LLM-based方法3D-GRAND,验证了所生成数据的有效性以及推理能力在3D视觉定位中的重要性。 Conclusion: 引入推理过程并结合高效合成数据的训练策略,可以显著提升3D视觉定位模型的性能,同时大幅降低对大规模标注数据的依赖。 Abstract: The recent development of Large Language Models (LLMs) with strong reasoning ability has driven research in various domains such as mathematics, coding, and scientific discovery. Meanwhile, 3D visual grounding, as a fundamental task in 3D understanding, still remains challenging due to the limited reasoning ability of recent 3D visual grounding models. Most of the current methods incorporate a text encoder and visual feature encoder to generate cross-modal fuse features and predict the referring object. These models often require supervised training on extensive 3D annotation data. On the other hand, recent research also focus on scaling synthetic data to train stronger 3D visual grounding LLM, however, the performance gain remains limited and non-proportional to the data collection cost. In this work, we propose a 3D visual grounding data pipeline, which is capable of automatically synthesizing 3D visual grounding data along with corresponding reasoning process. Additionally, we leverage the generated data for LLM fine-tuning and introduce Reason3DVG-8B, a strong 3D visual grounding LLM that outperforms previous LLM-based method 3D-GRAND using only 1.6% of their training data, demonstrating the effectiveness of our data and the importance of reasoning in 3D visual grounding.

[167] Motion Attribution for Video Generation

Xindi Wu,Despoina Paschalidou,Jun Gao,Antonio Torralba,Laura Leal-Taixé,Olga Russakovsky,Sanja Fidler,Jonathan Lorraine

Main category: cs.CV

TL;DR: 本文提出了Motive,一种面向视频生成模型的运动中心型数据归因框架,可有效识别影响时间动态的关键微调片段,并通过数据筛选提升生成视频的时序一致性和物理合理性。

Details Motivation: 当前视频生成模型快速发展,但数据如何影响生成结果中的运动模式仍不清楚,缺乏针对运动而非静态外观的数据归因方法。 Method: 提出Motive,一种基于梯度的可扩展运动归因框架,通过运动加权损失掩码将时间动态与静态外观分离,实现对大规模高质量视频数据和模型的高效运动特异性影响计算。 Result: Motive能准确识别显著影响运动的训练片段,并用于指导数据筛选;在文本到视频模型上验证表明,使用高影响数据微调后,在VBench上的运动平滑度和动态程度均提升,人类偏好胜率达74.1%。 Conclusion: Motive是首个专注于视频生成模型中运动归因而非视觉外观的框架,并首次将其应用于微调数据筛选,显著提升了生成视频的时间质量。 Abstract: Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.

[168] 3AM: Segment Anything with Geometric Consistency in Videos

Yang-Che Sun,Cheng Sun,Chin-Yang Lin,Fu-En Yang,Min-Hung Chen,Yen-Yu Lin,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文提出了3AM,一种在训练时增强SAM2的方法,通过融合MUSt3R的3D感知特征提升其在大视角变化下的视频对象分割性能。该方法无需相机姿态或深度图,仅用RGB输入即可实现几何一致性分割,在ScanNet++等数据集上显著优于现有方法。

Details Motivation: 现有视频对象分割方法依赖外观特征,在大视角变化下表现不佳;传统3D实例分割方法虽具视角一致性,但需相机姿态、深度图和复杂预处理,限制了实用性。 Method: 引入3AM,将MUSt3R的3D感知特征与SAM2的外观特征通过轻量级Feature Merger融合,并设计视野感知的采样策略,确保空间一致性的3D对应学习。 Result: 在ScanNet++和Replica等具有大基线运动的数据集上,3AM显著超越SAM2及其他扩展方法,在ScanNet++选定子集上达到90.6% IoU和71.7% Positive IoU,较最先进方法分别提升+15.9和+30.4点。 Conclusion: 3AM通过结合隐式几何对应与外观特征,在不增加推理复杂度的前提下,有效提升了视频对象分割在大视角变化下的性能,且仅需RGB输入,无需相机姿态或预处理,具备良好实用性。 Abstract: Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/

[169] RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

Fahad Shamshad,Nils Lukas,Karthik Nandakumar

Main category: cs.CV

TL;DR: 本文提出了一种新的视角合成方法来揭示不可见水印在语义保持的视点变换下的根本漏洞,利用零样本扩散模型在潜在空间中进行几何变换,有效去除多种水印方案且保持图像质量。

Details Motivation: 评估当前AI生成图像中不可见水印方案对高级去除攻击的鲁棒性,揭示其在语义层面变换下的脆弱性。 Method: 将水印去除重新定义为视图合成问题,提出基于零样本扩散模型的框架,在潜在空间中施加可控几何变换,并引入视图引导的对应注意力机制以保持结构一致性。 Result: 该方法在无需访问检测器或水印信息的情况下,成功去除了15种水印方法中的水印,性能优于14种基线攻击方法,并在多个数据集上保持更高的感知质量。 Conclusion: 现有的不可见水印即使对像素和频域攻击鲁棒,仍易受语义保持的视点变换攻击,表明需重新思考水印的语义鲁棒性设计。 Abstract: Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative view of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.