Table of Contents
cs.CL [Back]
[1] A Novel Differential Feature Learning for Effective Hallucination Detection and Classification
Wenkai Wang,Vincent Lee,Yizhen Zheng
Main category: cs.CL
TL;DR: 提出了一种双模型架构,通过投影融合和差异特征学习机制,发现幻觉信号集中在稀疏特征子集中,并揭示了层次化的“漏斗模式”,为高效检测大语言模型幻觉提供了新路径。
Details
Motivation: 现有研究虽发现幻觉与事实内容在隐藏层有差异,但具体信号定位不清,限制了检测方法的发展。 Method: 采用双模型架构,结合投影融合(PF)模块进行自适应跨层特征加权,以及差异特征学习(DFL)机制,通过并行编码器计算输入的互补表示差异来识别判别性特征。 Result: 在HaluEval的多个任务上验证,幻觉信号集中在极稀疏的特征子集;仅用1%特征维度即可保持检测性能,显著提升问答和对话任务的准确率。 Conclusion: 幻觉信号比以往认为的更集中,呈现出浅层多样、深层集中的“漏斗模式”,为构建计算高效的检测系统提供了可能,有助于降低推理成本同时维持准确性。 Abstract: Large language model hallucination represents a critical challenge where outputs deviate from factual accuracy due to distributional biases in training data. While recent investigations establish that specific hidden layers exhibit differences between hallucinatory and factual content, the precise localization of hallucination signals within layers remains unclear, limiting the development of efficient detection methods. We propose a dual-model architecture integrating a Projected Fusion (PF) block for adaptive inter-layer feature weighting and a Differential Feature Learning (DFL) mechanism that identifies discriminative features by computing differences between parallel encoders learning complementary representations from identical inputs. Through systematic experiments across HaluEval's question answering, dialogue, and summarization datasets, we demonstrate that hallucination signals concentrate in highly sparse feature subsets, achieving significant accuracy improvements on question answering and dialogue tasks. Notably, our analysis reveals a hierarchical "funnel pattern" where shallow layers exhibit high feature diversity while deep layers demonstrate concentrated usage, enabling detection performance to be maintained with minimal degradation using only 1\% of feature dimensions. These findings suggest that hallucination signals are more concentrated than previously assumed, offering a pathway toward computationally efficient detection systems that could reduce inference costs while maintaining accuracy.[2] Influence Guided Context Selection for Effective Retrieval-Augmented Generation
Jiale Deng,Yanyan Shen,Ziyuan Pei,Youmin Chen,Linpeng Huang
Main category: cs.CL
TL;DR: 提出上下文影响值(CI value)来量化检索上下文的质量,通过移除每个上下文后性能下降程度进行评估,并设计参数化代理模型实现高效推理时的上下文选择。
Details
Motivation: 现有基于预定义质量指标的上下文选择方法在提升RAG效果方面有限,因其未能充分利用查询、上下文列表和生成模型的整体信息来进行全面的质量评估。 Method: 将上下文质量评估重新定义为推理时的数据估值问题,提出CI value,衡量移除各上下文后的性能退化;采用分层结构的参数化代理模型预测CI value,结合局部相关性和全局上下文交互,并通过oracle监督和端到端反馈训练。 Result: 在8个NLP任务和多个大语言模型上实验表明,该方法显著优于当前最先进基线,能有效过滤低质上下文并保留关键信息。 Conclusion: CI value提供了一种更全面、无需复杂调参的上下文质量评估方式,所提出的代理模型可在实际应用中高效实现高质量上下文选择,显著提升RAG性能。 Abstract: Retrieval-Augmented Generation (RAG) addresses large language model (LLM) hallucinations by grounding responses in external knowledge, but its effectiveness is compromised by poor-quality retrieved contexts containing irrelevant or noisy information. While existing approaches attempt to improve performance through context selection based on predefined context quality assessment metrics, they show limited gains over standard RAG. We attribute this limitation to their failure in holistically utilizing available information (query, context list, and generator) for comprehensive quality assessment. Inspired by recent advances in data selection, we reconceptualize context quality assessment as an inference-time data valuation problem and introduce the Contextual Influence Value (CI value). This novel metric quantifies context quality by measuring the performance degradation when removing each context from the list, effectively integrating query-aware relevance, list-aware uniqueness, and generator-aware alignment. Moreover, CI value eliminates complex selection hyperparameter tuning by simply retaining contexts with positive CI values. To address practical challenges of label dependency and computational overhead, we develop a parameterized surrogate model for CI value prediction during inference. The model employs a hierarchical architecture that captures both local query-context relevance and global inter-context interactions, trained through oracle CI value supervision and end-to-end generator feedback. Extensive experiments across 8 NLP tasks and multiple LLMs demonstrate that our context selection method significantly outperforms state-of-the-art baselines, effectively filtering poor-quality contexts while preserving critical information. Code is available at https://github.com/SJTU-DMTai/RAG-CSM.[3] Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs
Norman Paulsen
Main category: cs.CL
TL;DR: 本文提出了最大有效上下文窗口(MECW)的概念,通过测试不同模型在不同上下文长度和问题类型下的表现,发现实际有效的上下文远小于厂商宣称的最大上下文窗口,且性能随问题类型变化显著。
Details
Motivation: 大型语言模型厂商宣传的大上下文窗口并不反映真实性能,缺乏衡量有效上下文能力的标准方法。 Method: 定义最大有效上下文窗口(MECW),设计系统性测试方法,在多种问题类型和上下文长度下评估多个模型,并建立标准化比较方式以确定性能下降的临界点。 Result: 实验收集了数十万数据点,发现所有模型的实际有效上下文窗口远小于宣称值,部分顶级模型在仅100个token时即失败,大多数在1000个token内准确率严重下降,差距最大达99%。MECW随问题类型变化。 Conclusion: 当前LLM在长上下文任务中存在严重有效性问题,厂商宣称的上下文长度具有误导性;研究提供了改进模型准确性和降低幻觉率的可操作洞见。 Abstract: Large language model (LLM) providers boast big numbers for maximum context window sizes. To test the real world use of context windows, we 1) define a concept of maximum effective context window, 2) formulate a testing method of a context window's effectiveness over various sizes and problem types, and 3) create a standardized way to compare model efficacy for increasingly larger context window sizes to find the point of failure. We collected hundreds of thousands of data points across several models and found significant differences between reported Maximum Context Window (MCW) size and Maximum Effective Context Window (MECW) size. Our findings show that the MECW is, not only, drastically different from the MCW but also shifts based on the problem type. A few top of the line models in our test group failed with as little as 100 tokens in context; most had severe degradation in accuracy by 1000 tokens in context. All models fell far short of their Maximum Context Window by as much as 99 percent. Our data reveals the Maximum Effective Context Window shifts based on the type of problem provided, offering clear and actionable insights into how to improve model accuracy and decrease model hallucination rates.[4] How Large Language Models Need Symbolism
Xiaotie Deng,Hanyu Li
Main category: cs.CL
TL;DR: 本文认为,仅靠扩展无法实现AI的未来突破,大型语言模型需要人类设计的符号作为‘指南针’,以引导其强大的直觉能力。
Details
Motivation: 当前AI依赖规模扩展,但缺乏方向性,难以实现真正的发现。 Method: 提出引入人类精心设计的符号系统,作为引导大模型直觉的结构化框架。 Result: 为大模型提供语义和逻辑上的引导,增强其推理与创新能力。 Conclusion: 结合人类符号系统与大模型的直觉能力,是通向真正AI发现的关键路径。 Abstract: We argue that AI's future requires more than scaling. To unlock genuine discovery, large language models need a compass: human-crafted symbols to guide their powerful but blind intuition.[5] One Model, Many Morals: Uncovering Cross-Linguistic Misalignments in Computational Moral Reasoning
Sualeha Farid,Jayden Lin,Zean Chen,Shivani Kumar,David Jurgens
Main category: cs.CL
TL;DR: 本研究系统探讨了语言如何影响大语言模型(LLMs)的道德决策,通过将两个道德推理基准翻译成五种不同文化和语言背景的语言,发现LLMs在不同语言中的道德判断存在显著不一致,反映了文化错位问题,并揭示了预训练数据对模型道德观的影响。
Details
Motivation: 由于大语言模型主要基于英文数据预训练,其在多语言多文化环境中的道德推理能力存在隐忧,难以泛化到不同文化语境中,因此需要系统研究语言对道德判断的影响。 Method: 将两个成熟的道德推理基准翻译为五种文化与类型学上多样化的语言,进行多语言零样本评估,并结合精心设计的研究问题分析差异根源,通过案例研究探讨预训练数据的作用。 Result: 发现LLMs在不同语言中的道德判断存在显著不一致性,表现出文化错位;揭示了导致这些差异的关键因素,包括模型推理策略和预训练数据偏差;构建了一个道德推理错误的结构化分类体系。 Conclusion: 语言显著影响LLMs的道德判断,当前模型缺乏跨文化道德对齐能力,需发展更具文化意识的人工智能系统。 Abstract: Large Language Models (LLMs) are increasingly deployed in multilingual and multicultural environments where moral reasoning is essential for generating ethically appropriate responses. Yet, the dominant pretraining of LLMs on English-language data raises critical concerns about their ability to generalize judgments across diverse linguistic and cultural contexts. In this work, we systematically investigate how language mediates moral decision-making in LLMs. We translate two established moral reasoning benchmarks into five culturally and typologically diverse languages, enabling multilingual zero-shot evaluation. Our analysis reveals significant inconsistencies in LLMs' moral judgments across languages, often reflecting cultural misalignment. Through a combination of carefully constructed research questions, we uncover the underlying drivers of these disparities, ranging from disagreements to reasoning strategies employed by LLMs. Finally, through a case study, we link the role of pretraining data in shaping an LLM's moral compass. Through this work, we distill our insights into a structured typology of moral reasoning errors that calls for more culturally-aware AI.[6] LLM-Based Support for Diabetes Diagnosis: Opportunities, Scenarios, and Challenges with GPT-5
Gaurav Kumar Gupta,Nirajan Acharya,Pranal Pande
Main category: cs.CL
TL;DR: 本研究评估了GPT-5在糖尿病诊断与管理中的应用潜力,使用基于ADA标准和公共数据集生成的合成病例,测试其在症状识别、实验室解读、妊娠糖尿病筛查、远程监测和多模态并发症检测五个场景中的表现,结果显示GPT-5输出与ADA标准高度一致,具备作为临床决策支持工具的潜力。
Details
Motivation: 糖尿病早期识别困难,受限于症状模糊、检验值临界、妊娠复杂性及长期监测负担,现有诊断流程亟需智能化辅助工具提升效率与可及性。 Method: 构建基于ADA 2025标准的合成病例模拟框架,利用GPT-5处理五类典型糖尿病临床场景,生成分类判断、临床推理、患者解释及结构化JSON摘要,并评估其与指南的一致性。 Result: GPT-5在五项任务中均表现出与ADA标准的高度一致性,能够提供可解释的临床推理和患者友好的解释,并输出结构化数据,展现出作为双用途(临床医生与患者)决策支持工具的能力。 Conclusion: GPT-5有望成为符合临床指南的糖尿病智能辅助工具,但其应用需依托可重复的评估框架,以确保在医疗场景中的可靠性与责任性。 Abstract: Diabetes mellitus is a major global health challenge, affecting over half a billion adults worldwide with prevalence projected to rise. Although the American Diabetes Association (ADA) provides clear diagnostic thresholds, early recognition remains difficult due to vague symptoms, borderline laboratory values, gestational complexity, and the demands of long-term monitoring. Advances in large language models (LLMs) offer opportunities to enhance decision support through structured, interpretable, and patient-friendly outputs. This study evaluates GPT-5, the latest generative pre-trained transformer, using a simulation framework built entirely on synthetic cases aligned with ADA Standards of Care 2025 and inspired by public datasets including NHANES, Pima Indians, EyePACS, and MIMIC-IV. Five representative scenarios were tested: symptom recognition, laboratory interpretation, gestational diabetes screening, remote monitoring, and multimodal complication detection. For each, GPT-5 classified cases, generated clinical rationales, produced patient explanations, and output structured JSON summaries. Results showed strong alignment with ADA-defined criteria, suggesting GPT-5 may function as a dual-purpose tool for clinicians and patients, while underscoring the importance of reproducible evaluation frameworks for responsibly assessing LLMs in healthcare.[7] Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes
Guangliang Liu,Bocheng Chen,Xitong Zhang,Kristen Marie Johnson
Main category: cs.CL
TL;DR: 本文研究了在缓解性别偏见过程中,当前公平性目标在遗忘机制下对预训练语言模型下游任务性能的影响,发现现有方法难以有效平衡公平性与性能。
Details
Motivation: 在通过微调或模型编辑实现道德对齐的过程中,预训练语言模型的下游任务性能常因遗忘而下降,本文旨在探究其背后机制,特别是在缓解性别刻板印象时公平性目标的作用与局限。 Method: 通过分析遗忘程度与公平性目标的关系,研究在去偏过程中整体遗忘与选择性遗忘对下游任务性能的影响。 Result: 发现下游性能主要受整体遗忘程度影响;选择性遗忘刻板知识反而增加整体遗忘;通用缓解遗忘的方法无法有效减少整体遗忘,也无法提升性能。 Conclusion: 当前的公平性目标在实现性能与公平权衡方面存在局限,需设计更精细的遗忘控制机制以改善道德对齐中的性能损失问题。 Abstract: Moral alignment has emerged as a widely adopted approach for regulating the behavior of pretrained language models (PLMs), typically through fine-tuning or model editing on curated datasets. However, this process often comes at the cost of degraded downstream task performance. Prior studies commonly aim to achieve a performance trade-off by encouraging PLMs to selectively forget stereotypical knowledge through carefully designed fairness objectives, while preserving their helpfulness. In this short paper, we investigate the underlying mechanisms of the performance trade-off in the context of mitigating gender stereotypes, through the lens of forgetting and the fairness objective. Our analysis reveals the limitations of current fairness objective in achieving trade-off by demonstrating that: (1) downstream task performance is primarily driven by the overall forgetting level; (2) selective forgetting of stereotypes tends to increase overall forgetting; and (3) general solutions for mitigating forgetting are ineffective at reducing overall forgetting and fail to improve downstream task performance.[8] A State-of-the-Art SQL Reasoning Model using RLVR
Alnur Ali,Ashutosh Baheti,Jonathan Chang,Ta-Chung Chi,Brandon Cui,Andrew Drozdov,Jonathan Frankle,Abhay Gupta,Pallavi Koppol,Sean Kulinski,Jonathan Li,Dipendra Misra,Krista Opsahl-Ong,Jose Javier Gonzalez Ortiz,Matei Zaharia,Yue Zhang
Main category: cs.CL
TL;DR: 本文提出了一种基于可验证奖励的强化学习(RLVR)方法,应用于BIRD数据科学基准任务,在无需额外训练数据或专有模型的情况下,首次提交即达到最先进的SQL生成准确率。
Details
Motivation: 企业场景中需要能够结合特定领域知识的推理模型,而许多问题具有可验证的奖励函数,因此探索RLVR在实际任务中的应用潜力。 Method: 采用包含提示和模型选择、基于TAO的离线强化学习预热阶段,以及严格的在线RLVR训练的通用训练流程。 Result: 在BIRD排行榜上首次提交即取得73.56%(无自洽)和75.68%(带自洽)的准确率,且生成次数少于第二名方法。 Conclusion: 该框架简单有效,具有广泛的企业应用前景,适用于商业智能、数据科学和代码生成等领域。 Abstract: Developing custom reasoning models via Reinforcement Learning (RL) that can incorporate organization-specific knowledge has great potential to address problems faced by enterprise customers. In many of these problems, the reward function is verifiable, a setting termed RL with Verifiable Rewards (RLVR). We apply RLVR to a popular data science benchmark called BIRD that measures the ability of an AI agent to convert a natural language query for a database to SQL executions. We apply a simple and general-purpose training recipe involving careful prompt and model selection, a warm-up stage using our offline RL approach called TAO, followed by rigorous online RLVR training. With no additional training data beyond the BIRD training set and no use of proprietary models, our very first submission to the BIRD leaderboard reached state-of-the-art accuracy on the private test set: 73.56% without self-consistency and 75.68% with self-consistency. In the latter case, our model also required fewer generations than the second-best approach. While BIRD is only a proxy task, the simplicity of our framework makes it broadly applicable to enterprise domains such as business intelligence, data science, and coding.[9] Learning to Reason with Mixture of Tokens
Adit Jain,Brendan Rappazzo
Main category: cs.CL
TL;DR: 本文提出了一种基于混合token生成(MoT-G)的强化学习与可验证奖励(RLVR)框架,通过在连续的混合空间中进行推理生成,充分利用模型的概率分布信息,提升了大语言模型的推理能力和训练效率。
Details
Motivation: 现有RLVR方法在每一步推理中仅采样离散token,忽略了模型对候选token的概率分布中的丰富信息,限制了推理搜索空间。本文旨在通过保留和利用这种分布信息来改进推理性能。 Method: 提出一个统一框架,推广现有的MoT-G方法,包括构建token嵌入加权和的混合嵌入,并将RLVR扩展到该连续混合空间中进行思维链生成。在Reasoning-Gym任务集上评估两种MoT-G变体。 Result: 在Qwen2.5-1.5B模型上,MoT-G方法在10个任务中的7个实现了5%-35%的性能提升,且仅需一半的推理轨迹即可达到相当的准确率,显示出更高的训练效率;分析表明其优势可能源于在整个推理过程中保持更高的隐藏状态熵并促进token空间的探索。 Conclusion: MoT-G为RLVR提供了一个更高效的信息利用方式,通过在连续混合空间中操作,显著提升了大语言模型的推理能力与探索性,同时减少了对大量采样轨迹的依赖。 Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a leading approach for improving large language model (LLM) reasoning capabilities. Most current methods follow variants of Group Relative Policy Optimization, which samples multiple reasoning completions, scores them relative to each other, and adjusts the policy accordingly. However, these approaches invariably sample discrete tokens at each reasoning step, discarding the rich distributional information in the model's probability distribution over candidate tokens. While preserving and utilizing this distributional information has proven beneficial in non-RL settings, current RLVR methods seem to be unnecessarily constraining the reasoning search space by not using this information. To address this limitation, we investigate mixture-of-token generation (MoT-G) in RLVR. We present a unified framework that generalizes existing MoT-G approaches, including existing training-free methods that construct mixture embeddings as weighted sums over token embeddings, and extend RLVR to operate directly in this continuous mixture space for generating chain-of-thought. Evaluating two MoT-G variants on Reasoning-Gym, a suite of reasoning-intensive language tasks, we find that MoT--G methods achieve substantial improvements (5--35 \% gains on 7 out of 10 tasks) compared to standard decoding with the Qwen2.5-1.5B model, while reaching comparable accuracy with half the number of trajectories, suggesting improved training efficiency. Through comprehensive hidden-state and token-level analyses, we provide evidence that MoT--G's benefits may stem from its ability to maintain higher hidden-state entropy throughout the reasoning process and promote exploration in token space.[10] Dual-Head Reasoning Distillation: Improving Classifier Accuracy with Train-Time-Only Reasoning
Jillian Xu,Dylan Zhou,Vinay Shukla,Yang Yang,Junrui Ruan,Shuhuai Lin,Wenfei Zou,Yinxiao Liu,Karthik Lakshmanan
Main category: cs.CL
TL;DR: 提出Dual-Head Reasoning Distillation (DHRD) 方法,在保持高推理吞吐量的同时提升分类准确率。
Details
Motivation: 解决Chain-of-Thought提示带来的高计算开销与分类精度之间的权衡问题。 Method: 在解码器-only语言模型中引入双头结构:一个用于训练和推理的池化分类头,以及一个仅在训练时使用的、由教师理由监督的推理头,并采用标签交叉熵和token级语言模型损失的加权和进行训练。 Result: 在七个SuperGLUE任务上,DHRD相比基线模型相对提升了0.65-5.47%的准确率,尤其在蕴含/因果任务上增益更显著;推理吞吐量与池化分类器相当,且比CoT解码快96-142倍(QPS)。 Conclusion: DHRD在不增加推理成本的前提下有效提升了模型性能,是一种高效的知识蒸馏与推理结合方法。 Abstract: Chain-of-Thought (CoT) prompting often improves classification accuracy, but it introduces a significant throughput penalty with rationale generation (Wei et al., 2022; Cheng and Van Durme, 2024). To resolve this trade-off, we introduce Dual-Head Reasoning Distillation (DHRD), a simple training method for decoder-only language models (LMs) that adds (i) a pooled classification head used during training and inference and (ii) a reasoning head supervised by teacher rationales used only in training. We train with a loss function that is a weighted sum of label cross-entropy and token-level LM loss over input-plus-rationale sequences. On seven SuperGLUE tasks, DHRD yields relative gains of 0.65-5.47% over pooled baselines, with notably larger gains on entailment/causal tasks. Since we disable the reasoning head at test time, inference throughput matches pooled classifiers and exceeds CoT decoding on the same backbones by 96-142 times in QPS.[11] On Code-Induced Reasoning in LLMs
Abdul Waheed,Zhen Wu,Carolyn Rosé,Daphne Ippolito
Main category: cs.CL
TL;DR: 研究通过系统性框架探究代码数据中哪些特性最有助于提升大语言模型的推理能力,发现结构属性比语义属性更重要,适当的抽象(如伪代码)可达到与真实代码相当甚至更好的效果。
Details
Motivation: 明确代码数据中影响大语言模型推理能力的关键因素,理解不同编程语言和代码特性对模型性能的具体作用。 Method: 构建包含十种编程语言的平行指令数据集,对代码进行控制性扰动以分离结构与语义影响,并在五类八种规模的LLM上进行微调和多任务评估。 Result: 模型对结构扰动更敏感;伪代码和流程图等抽象形式效果接近或优于真实代码;保持表面规律的损坏代码仍具竞争力;Python利于自然语言推理,Java/Rust等低级语言更利于数学任务。 Conclusion: 代码的结构特性比语义特性对LLM推理更重要,语法风格影响任务表现,合理设计训练数据可有效提升模型推理能力。 Abstract: Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We investigate this question with a systematic, data-centric framework. We construct parallel instruction datasets in ten programming languages and apply controlled perturbations that selectively disrupt structural or semantic properties of code. We then finetune LLMs from five model families and eight scales on each variant and evaluate their performance on natural language, math, and code tasks. Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones, particularly on math and code tasks. Appropriate abstractions like pseudocode and flowcharts can be as effective as code, while encoding the same information with fewer tokens without adhering to original syntax can often retain or even improve performance. Remarkably, even corrupted code with misleading signals remains competitive when surface-level regularities persist. Finally, syntactic styles also shape task-specific gains with Python favoring natural language reasoning and lower-level languages such as Java and Rust favoring math. Through our systematic framework, we aim to provide insight into how different properties of code influence reasoning and inform the design of training data for enhancing LLM reasoning capabilities.[12] Agribot: agriculture-specific question answer system
Naman Jain,Pranjali Jain,Pratik Kayal,Jayakrishna Sahit,Soham Pachpande,Jayesh Choudhari
Main category: cs.CL
TL;DR: 本文提出了一种基于Kisan呼叫中心数据集的农业聊天机器人,利用句子嵌入模型帮助农民获取天气、市场价格、植物保护和政府计划等相关信息。通过消除同义词并引入实体提取,系统准确率从56%提升至86%。该系统24/7可用,易于访问和理解,有助于提高农业生产效率,并减轻呼叫中心工作人员负担。
Details
Motivation: 印度是农业经济为主导的国家,农民获取准确农业信息对提高农业产出至关重要。现有信息获取方式可能存在效率低、覆盖不足等问题,因此需要一个自动化、高效且易用的信息服务系统来满足农民多样化的问题需求。 Method: 基于Kisan呼叫中心的数据集构建农业聊天机器人,采用句子嵌入模型进行语义匹配以回答农民问题,并通过去除同义词和引入实体提取技术优化模型性能。 Result: 初始句子嵌入模型准确率为56%,在去除同义词并加入实体提取后,准确率提升至86%。系统支持全天候、跨设备访问,信息传达简洁易懂。 Conclusion: 该聊天机器人系统能有效提升农民获取农业信息的便利性和准确性,促进农业发展;同时减少人工呼叫中心的工作负担,使人力资源得以更高效地分配。 Abstract: India is an agro-based economy and proper information about agricultural practices is the key to optimal agricultural growth and output. In order to answer the queries of the farmer, we have build an agricultural chatbot based on the dataset from Kisan Call Center. This system is robust enough to answer queries related to weather, market rates, plant protection and government schemes. This system is available 24* 7, can be accessed through any electronic device and the information is delivered with the ease of understanding. The system is based on a sentence embedding model which gives an accuracy of 56%. After eliminating synonyms and incorporating entity extraction, the accuracy jumps to 86%. With such a system, farmers can progress towards easier information about farming related practices and hence a better agricultural output. The job of the Call Center workforce would be made easier and the hard work of various such workers can be redirected to a better goal.[13] Domain-Aware Speaker Diarization On African-Accented English
Chibuzor Okocha,Kelechi Ezema,Christan Grant
Main category: cs.CL
TL;DR: 本研究探讨了非洲口音英语在不同领域(通用对话与临床对话)的说话人分离性能,发现临床对话存在显著的领域惩罚,主要源于短语轮换和频繁重叠导致的误报和漏检。通过在匹配口音数据上微调分割模块进行轻量级领域自适应,虽能降低错误但未能消除差距。研究贡献包括跨领域的可控基准、简洁的错误分解与会话级分析方法,以及可复现的自适应方案,建议未来关注重叠感知的分割和更均衡的临床资源建设。
Details
Motivation: 探究非洲口音英语在不同对话领域(特别是临床场景)中说话人分离系统的性能差异及其根源。 Method: 在严格DER协议下评估多个生成式和开源系统在通用与临床对话上的表现,并进行错误分解和会话级特征分析;采用在口音匹配数据上微调分割模块的方式进行轻量级领域自适应。 Result: 临床语音存在一致且显著的领域惩罚,主要由短语轮换和频繁重叠引起的误报与漏检造成;微调分割模块可减少错误但未完全消除领域差距。 Conclusion: 当前说话人分离系统在处理临床领域的非洲口音英语时表现受限,需改进重叠感知的分割能力并构建更平衡的临床语音资源。 Abstract: This study examines domain effects in speaker diarization for African-accented English. We evaluate multiple production and open systems on general and clinical dialogues under a strict DER protocol that scores overlap. A consistent domain penalty appears for clinical speech and remains significant across models. Error analysis attributes much of this penalty to false alarms and missed detections, aligning with short turns and frequent overlap. We test lightweight domain adaptation by fine-tuning a segmentation module on accent-matched data; it reduces error but does not eliminate the gap. Our contributions include a controlled benchmark across domains, a concise approach to error decomposition and conversation-level profiling, and an adaptation recipe that is easy to reproduce. Results point to overlap-aware segmentation and balanced clinical resources as practical next steps.[14] Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution
Yash Saxena,Raviteja Bommireddy,Ankur Padia,Manas Gaur
Main category: cs.CL
TL;DR: 本文比较了大语言模型中生成时引用(G-Cite)和事后引用(P-Cite)两种范式,发现在高风险领域推荐以检索为中心、优先使用P-Cite的方法,以平衡覆盖率和引用正确性。
Details
Motivation: 在医疗、法律等高风险领域,大语言模型需要提供可验证的引用,但目前缺乏对生成时引用与事后引用范式的系统比较。 Method: 提出了G-Cite和P-Cite两种引用范式,并在四种归因数据集上从零样本到先进的检索增强方法进行了综合评估。 Result: P-Cite方法在覆盖率和正确性之间取得更好平衡,且延迟适中;G-Cite则精度高但覆盖和速度较低。检索是提升归因质量的关键因素。 Conclusion: 建议在高风险应用中采用以检索为中心、优先P-Cite的策略,仅在需要高精度的场景(如严格的事实验证)中使用G-Cite。 Abstract: Trustworthy Large Language Models (LLMs) must cite human-verifiable sources in high-stakes domains such as healthcare, law, academia, and finance, where even small errors can have severe consequences. Practitioners and researchers face a choice: let models generate citations during decoding, or let models draft answers first and then attach appropriate citations. To clarify this choice, we introduce two paradigms: Generation-Time Citation (G-Cite), which produces the answer and citations in one pass, and Post-hoc Citation (P-Cite), which adds or verifies citations after drafting. We conduct a comprehensive evaluation from zero-shot to advanced retrieval-augmented methods across four popular attribution datasets and provide evidence-based recommendations that weigh trade-offs across use cases. Our results show a consistent trade-off between coverage and citation correctness, with retrieval as the main driver of attribution quality in both paradigms. P-Cite methods achieve high coverage with competitive correctness and moderate latency, whereas G-Cite methods prioritize precision at the cost of coverage and speed. We recommend a retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings such as strict claim verification. Our codes and human evaluation results are available at https://anonymous.4open.science/r/Citation_Paradigms-BBB5/[15] Comparative Personalization for Multi-document Summarization
Haoyuan Li,Snigdha Chaturvedi
Main category: cs.CL
TL;DR: 本文提出了一种用于个性化多文档摘要(MDS)的新框架ComPSum,通过比较用户间的偏好差异生成结构化分析以指导摘要生成,并提出了无需参考摘要的细粒度评估框架AuthorMap及构建了PerMSum数据集进行实验验证。
Details
Motivation: 为了更有效地实现个性化MDS,需要识别用户偏好之间的细粒度差异,而现有方法缺乏对这种差异的深入建模和有效评估手段。 Method: 提出ComPSum框架,首先通过比较目标用户与其他用户的偏好生成结构化用户分析,然后利用该分析指导个性化摘要生成;同时提出AuthorMap评估框架,基于不同用户生成的摘要之间的作者归属关系进行评估。 Result: 在构建的PerMSum数据集上使用AuthorMap进行评估,ComPSum显著优于强基线模型。 Conclusion: 通过用户偏好对比建模和结构化分析可有效提升个性化MDS性能,且AuthorMap为个性化摘要提供了可靠的无参考评估方式。 Abstract: Personalized multi-document summarization (MDS) is essential for meeting individual user preferences of writing style and content focus for summaries. In this paper, we propose that for effective personalization, it is important to identify fine-grained differences between users' preferences by comparing the given user's preferences with other users' preferences.Motivated by this, we propose ComPSum, a personalized MDS framework. It first generates a structured analysis of a user by comparing their preferences with other users' preferences. The generated structured analysis is then used to guide the generation of personalized summaries. To evaluate the performance of ComPSum, we propose AuthorMap, a fine-grained reference-free evaluation framework for personalized MDS. It evaluates the personalization of a system based on the authorship attribution between two personalized summaries generated for different users. For robust evaluation of personalized MDS, we construct PerMSum, a personalized MDS dataset in the review and news domain. We evaluate the performance of ComPSum on PerMSum using AuthorMap, showing that it outperforms strong baselines.[16] Vision Language Models Cannot Plan, but Can They Formalize?
Muyu He,Yuxi Zheng,Yuchen Liu,Zijian An,Bill Cai,Jiani Huang,Lifeng Zhou,Feng Liu,Ziyang Li,Li Zhang
Main category: cs.CL
TL;DR: 本文提出了一套五种基于视觉语言模型(VLM)作为形式化工具的管道,用于解决单次、开放词汇和多模态的PDDL形式化问题,并在现有及新构建的基准上验证了其优于端到端规划生成的表现,指出视觉理解是当前瓶颈。
Details
Motivation: 尽管视觉语言模型在简单多模态规划任务中取得进展,但在长视野规划任务中表现不足;而文本环境中通过将大语言模型用作形式化器(如生成PDDL)显著提升了性能,但多模态场景下此类研究较少且受限于预定义词汇或简化设定,因此需要探索更通用的一-shot、开放词汇的多模态形式化方法。 Method: 设计并实现了五种VLM-as-formalizer流程,利用视觉语言模型将多模态输入(图像与文本)转化为PDDL形式化表示,进而调用规划求解器生成计划;在原有和两个新构建的真实、多视角、低质量图像基准上进行评估,并分析中间表征(如字幕、场景图)对性能的影响。 Result: 实验表明,VLM-as-formalizer在多模态长视野规划中显著优于端到端方法;性能瓶颈主要在于视觉模块未能充分捕捉关键对象关系;使用中间文本表征可部分缓解该问题,但增益不稳定。 Conclusion: 将VLM用于多模态形式化(而非直接生成动作序列)是实现复杂任务规划的有效路径,未来应重点提升VLM的视觉理解能力以支持完整的关系抽取和鲁棒的形式化转换。 Abstract: The advancement of vision language models (VLMs) has empowered embodied agents to accomplish simple multimodal planning tasks, but not long-horizon ones requiring long sequences of actions. In text-only simulations, long-horizon planning has seen significant improvement brought by repositioning the role of LLMs. Instead of directly generating action sequences, LLMs translate the planning domain and problem into a formal planning language like the Planning Domain Definition Language (PDDL), which can call a formal solver to derive the plan in a verifiable manner. In multimodal environments, research on VLM-as-formalizer remains scarce, usually involving gross simplifications such as predefined object vocabulary or overly similar few-shot examples. In this work, we present a suite of five VLM-as-formalizer pipelines that tackle one-shot, open-vocabulary, and multimodal PDDL formalization. We evaluate those on an existing benchmark while presenting another two that for the first time account for planning with authentic, multi-view, and low-quality images. We conclude that VLM-as-formalizer greatly outperforms end-to-end plan generation. We reveal the bottleneck to be vision rather than language, as VLMs often fail to capture an exhaustive set of necessary object relations. While generating intermediate, textual representations such as captions or scene graphs partially compensate for the performance, their inconsistent gain leaves headroom for future research directions on multimodal planning formalization.[17] "Be My Cheese?": Assessing Cultural Nuance in Multilingual LLM Translations
Madison Van Doren,Cory Holland
Main category: cs.CL
TL;DR: 本研究探讨了最先进的多语言AI模型在将英语中的比喻性语言(如习语和双关语)翻译成多种全球语言时的本地化能力,强调文化适宜性和整体本地化质量,而非仅关注语法准确性。通过对20种语言的24个方言中87个LLM生成的电商营销邮件翻译样本进行人工评估,发现尽管主流模型能生成语法正确的译文,但在处理文化细微表达时仍表现不佳,即使高资源语言也常误译比喻和文字游戏。研究表明数据量并非翻译质量的唯一决定因素,提出应将文化适宜性作为衡量多语言大模型性能的关键指标,并呼吁开展更大规模的研究以改进实际本地化应用中的机器翻译系统。
Details
Motivation: 现有大语言模型翻译研究和行业基准过于注重语法准确性和词元级别正确性,忽视了文化适宜性和本地化质量,而这在营销、电商等实际应用场景中至关重要。因此,需要探索当前多语言AI模型在处理比喻性语言时的真实表现,评估其跨文化传达能力。 Method: 选取87个由大语言模型生成的电子商务营销邮件翻译样本,覆盖20种语言的24个地区方言,由精通各目标语言的人类评审员对译文在语气、原意忠实度和受众适配性方面进行定量评分和定性反馈,重点分析模型在翻译习语、双关语等文化相关表达时的表现。 Result: 研究发现,尽管领先的大语言模型能生成语法正确的翻译,但在处理文化相关的比喻性语言时普遍存在不足,经常无法准确传达原意或失去修辞效果,需大量人工修改;即使是高资源语言(在行业基准中排名靠前)也频繁出现对双关语和习语的误译。 Conclusion: 数据规模并非预测机器翻译质量的最可靠指标,文化适宜性是评估多语言大模型性能的关键维度,当前学术与行业基准对此关注不足。该研究为未来更大规模的本地化翻译研究提供了依据,并建议在实际部署中结合人工润色以提升跨文化沟通效果。 Abstract: This pilot study explores the localisation capabilities of state-of-the-art multilingual AI models when translating figurative language, such as idioms and puns, from English into a diverse range of global languages. It expands on existing LLM translation research and industry benchmarks, which emphasise grammatical accuracy and token-level correctness, by focusing on cultural appropriateness and overall localisation quality - critical factors for real-world applications like marketing and e-commerce. To investigate these challenges, this project evaluated a sample of 87 LLM-generated translations of e-commerce marketing emails across 24 regional dialects of 20 languages. Human reviewers fluent in each target language provided quantitative ratings and qualitative feedback on faithfulness to the original's tone, meaning, and intended audience. Findings suggest that, while leading models generally produce grammatically correct translations, culturally nuanced language remains a clear area for improvement, often requiring substantial human refinement. Notably, even high-resource global languages, despite topping industry benchmark leaderboards, frequently mistranslated figurative expressions and wordplay. This work challenges the assumption that data volume is the most reliable predictor of machine translation quality and introduces cultural appropriateness as a key determinant of multilingual LLM performance - an area currently underexplored in existing academic and industry benchmarks. As a proof of concept, this pilot highlights limitations of current multilingual AI systems for real-world localisation use cases. Results of this pilot support the opportunity for expanded research at greater scale to deliver generalisable insights and inform deployment of reliable machine translation workflows in culturally diverse contexts.[18] Multi-Objective Reinforcement Learning for Large Language Model Optimization: Visionary Perspective
Lingxiao Kong,Cong Yang,Oya Deniz Beyan,Zeyd Boukhers
Main category: cs.CL
TL;DR: 本文提出了一个用于大语言模型(LLM)优化的多目标强化学习(MORL)分类体系,探讨了不同MORL方法的优势与局限,提出需要高效且灵活的方法来应对LLM中个性化功能和复杂性,并展望了一个能够评估不同目标关系影响的MORL基准框架,未来研究方向聚焦于通过双层学习范式提升效率和灵活性的元策略MORL。
Details
Motivation: 在大语言模型中同时优化多个目标具有挑战性,现有MORL方法难以兼顾效率、灵活性和个性化需求,亟需系统性的分类和评估框架以推动发展。 Method: 提出了一种MORL分类体系,分析了各类MORL方法在LLM优化中的适用性,并构想了一个MORL基准测试框架;重点探讨基于元策略的双层学习范式作为未来方向。 Result: 明确了当前MORL方法在LLM应用中的优缺点,提出了支持多样化目标关系评估的基准框架愿景,并指出了元策略MORL在提升效率和灵活性方面的潜力。 Conclusion: 元策略MORL结合双层学习范式是提升LLM多目标优化效率与灵活性的关键方向,建立统一的基准框架对推动该领域发展至关重要。 Abstract: Multi-Objective Reinforcement Learning (MORL) presents significant challenges and opportunities for optimizing multiple objectives in Large Language Models (LLMs). We introduce a MORL taxonomy and examine the advantages and limitations of various MORL methods when applied to LLM optimization, identifying the need for efficient and flexible approaches that accommodate personalization functionality and inherent complexities in LLMs and RL. We propose a vision for a MORL benchmarking framework that addresses the effects of different methods on diverse objective relationships. As future research directions, we focus on meta-policy MORL development that can improve efficiency and flexibility through its bi-level learning paradigm, highlighting key research questions and potential solutions for improving LLM performance.[19] OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule
Yuxuan Zhu,David H. Yang,Mohammad Mohammadi Amiri,Keerthiram Murugesan,Tejaswini Pedapati,Pin-Yu Chen
Main category: cs.CL
TL;DR: 本文提出了OjaKV,一种结合混合存储策略与在线子空间自适应的KV缓存压缩框架,有效缓解大模型长上下文生成中的内存瓶颈。
Details
Motivation: 现有基于静态低秩投影的KV缓存压缩方法在数据分布变化时性能下降明显,且长上下文下的KV缓存内存开销超过模型参数本身,亟需更鲁棒、动态的压缩方案。 Method: OjaKV采用混合存储策略:保留首尾关键token的完整KV状态,对中间token使用基于Oja算法的在线主成分分析进行低秩压缩,并在prefill阶段全面更新、在解码阶段轻量周期更新投影基,保持子空间与上下文同步。 Result: 实验表明,OjaKV在高压缩比下仍保持甚至提升零样本准确率,尤其在需要复杂推理的长上下文基准上表现更优,且兼容FlashAttention等现代注意力模块。 Conclusion: OjaKV是一种无需微调、即插即用的高效长上下文推理解决方案,通过在线自适应子空间显著提升了KV缓存压缩的鲁棒性和实用性。 Abstract: The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model's weights. While KV-cache compression via low-rank projection is a promising direction, existing methods rely on a static, offline-learned subspace that performs poorly under data distribution shifts. To overcome these limitations, we introduce OjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention. Second, for the vast majority of intermediate tokens, it applies low-rank compression by incrementally adapting the projection basis using Oja's algorithm for online principal component analysis. This adaptation involves a comprehensive update during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context. Crucially, our framework is fully compatible with modern attention modules like FlashAttention. Experiments demonstrate that OjaKV maintains or even improves zero-shot accuracy at high compression ratios. In particular, OjaKV achieves its strongest gains on very long-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts. These results establish our hybrid framework as a practical, plug-and-play solution for memory-efficient long-context inference without requiring model fine-tuning.[20] Towards Transparent AI: A Survey on Explainable Language Models
Avash Palikhe,Zichong Wang,Zhipeng Yin,Rui Guo,Qiang Duan,Jie Yang,Wenbin Zhang
Main category: cs.CL
TL;DR: 本论文综述了针对语言模型(LMs)的可解释人工智能(XAI)技术,特别根据Transformer架构(仅编码器、仅解码器、编码器-解码器)对现有方法进行系统分类,并从可信性和忠实性两个维度评估其有效性,指出了当前研究的挑战与未来方向。
Details
Motivation: 语言模型虽在多个领域取得显著进展,但其黑盒特性导致可解释性不足,尤其在高风险应用中影响其可信度和问责机制;现有XAI方法在应用于LM时面临架构复杂、训练数据庞大等挑战,且已有综述未能充分反映LM架构多样性带来的独特问题。 Method: 本文采用系统性综述方法,依据LM的三种主要Transformer架构对XAI技术进行分类整理,分析各类方法在不同架构上的适配方式,并从plausibility(合理性)和faithfulness(忠实性)两个关键维度对这些技术进行评估。 Result: 论文建立了面向语言模型的XAI方法分类体系,揭示了不同架构下XAI技术的优势与局限,明确了当前评估标准中存在的差距,并总结了一系列开放性研究问题。 Conclusion: 为推动语言模型的透明化与可解释性,需发展更加鲁棒且适配特定架构的XAI方法,未来研究应加强在忠实性评估、跨架构泛化及实际应用场景中的验证。 Abstract: Language Models (LMs) have significantly advanced natural language processing and enabled remarkable progress across diverse domains, yet their black-box nature raises critical concerns about the interpretability of their internal mechanisms and decision-making processes. This lack of transparency is particularly problematic for adoption in high-stakes domains, where stakeholders need to understand the rationale behind model outputs to ensure accountability. On the other hand, while explainable artificial intelligence (XAI) methods have been well studied for non-LMs, they face many limitations when applied to LMs due to their complex architectures, considerable training corpora, and broad generalization abilities. Although various surveys have examined XAI in the context of LMs, they often fail to capture the distinct challenges arising from the architectural diversity and evolving capabilities of these models. To bridge this gap, this survey presents a comprehensive review of XAI techniques with a particular emphasis on LMs, organizing them according to their underlying transformer architectures: encoder-only, decoder-only, and encoder-decoder, and analyzing how methods are adapted to each while assessing their respective strengths and limitations. Furthermore, we evaluate these techniques through the dual lenses of plausibility and faithfulness, offering a structured perspective on their effectiveness. Finally, we identify open research challenges and outline promising future directions, aiming to guide ongoing efforts toward the development of robust, transparent, and interpretable XAI methods for LMs.[21] ReviewScore: Misinformed Peer Review Detection with Large Language Models
Hyun Ryu,Doohyuk Jang,Hyemin S. Lee,Joonhyun Jeong,Gyeongman Kim,Donghyeon Cho,Gyouk Chu,Minyeong Hwang,Hyeongwon Jang,Changhun Kim,Haechan Kim,Jina Kim,Joowon Kim,Yoonjeon Kim,Kwanhyung Lee,Chanjae Park,Heecheol Yun,Gregor Betz,Eunho Yang
Main category: cs.CL
TL;DR: 本文提出了一种名为ReviewScore的指标,用于自动检测AI会议中低质量的同行评审意见,通过识别包含错误前提的“弱点”或论文中已有答案的“问题”来定义误导性评审点,并验证了使用大语言模型在前提层面评估事实性可达到中等程度的人机一致性。
Details
Motivation: 随着AI领域投稿量激增,同行评审质量下降,亟需一种可靠的方法来自动识别低质量评审意见。 Method: 提出ReviewScore指标,构建自动化引擎以重构评审弱点中的显性和隐性前提,并基于专家标注数据集评估大语言模型在事实性判断上的表现。 Result: 发现15.2%的弱点和26.4%的问题属于误导性评审点;在八个主流大语言模型上验证了中等水平的人机一致性,且前提层面的事实性评估一致性显著高于弱点层面。 Conclusion: 前提层面的事实性评估更精确,结合深入的分歧分析,表明ReviewScore有望实现完全自动化,提升同行评审质量。 Abstract: Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.[22] GRAB: A Risk Taxonomy--Grounded Benchmark for Unsupervised Topic Discovery in Financial Disclosures
Ying Li,Tiejun Ma
Main category: cs.CL
TL;DR: 本文提出了GRAB,一个针对10-K风险披露中风险分类的金融领域基准,利用无监督方法生成大规模标注数据,支持对多种主题模型进行可重复、标准化的评估。
Details
Motivation: 现有的风险分类研究缺乏公开可用的基准来评估无监督主题模型在金融风险披露中的表现,因此需要一个专门针对该任务的标准化评估框架。 Method: 结合FinBERT的token注意力、YAKE关键词信号和术语分类匹配,基于包含193个术语映射到21个细粒度类型的金融风险分类体系,自动生成句子级标签;使用固定数据划分和多种稳健指标(如准确率、Macro-F1、Topic BERTScore和有效主题数)进行统一评估。 Result: 构建了包含161万句子、来自8,247份文件的GRAB基准数据集,实现了无需人工标注的规模化标签生成,并支持对经典、基于嵌入、神经网络及混合主题模型的系统性比较。 Conclusion: GRAB为金融文本中的风险分类提供了首个大规模、可复现的评估基准,推动了无监督主题模型在该领域的标准化研究与应用。 Abstract: Risk categorization in 10-K risk disclosures matters for oversight and investment, yet no public benchmark evaluates unsupervised topic models for this task. We present GRAB, a finance-specific benchmark with 1.61M sentences from 8,247 filings and span-grounded sentence labels produced without manual annotation by combining FinBERT token attention, YAKE keyphrase signals, and taxonomy-aware collocation matching. Labels are anchored in a risk taxonomy mapping 193 terms to 21 fine-grained types nested under five macro classes; the 21 types guide weak supervision, while evaluation is reported at the macro level. GRAB unifies evaluation with fixed dataset splits and robust metrics--Accuracy, Macro-F1, Topic BERTScore, and the entropy-based Effective Number of Topics. The dataset, labels, and code enable reproducible, standardized comparison across classical, embedding-based, neural, and hybrid topic models on financial disclosures.[23] Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval
Xiaojun Wu,Cehao Yang,Xueyuan Lin,Chengjin Xu,Xuhui Jiang,Yuanliang Sun,Hui Xiong,Jia Li,Jian Guo
Main category: cs.CL
TL;DR: 本文提出了Think-on-Graph 3.0(ToG-3),一种基于多智能体上下文演化的检索增强生成框架,通过动态构建和优化异构图索引,实现查询与子图的双重演化,显著提升轻量级大模型下的精确推理能力。
Details
Motivation: 现有图增强RAG方法依赖高质量图结构,但人工构建成本高,自动抽取受限于LLM性能,尤其在小模型上表现不佳,难以兼顾效率与精度。 Method: 提出MACER机制,采用多智能体系统(构造者、检索者、反思者、响应者)协同进行迭代推理,动态构建Chunk-Triplets-Community异构图索引,实现查询与子图的双演化。 Result: 实验表明ToG-3在深度与广度推理基准上均优于基线模型,消融研究验证了MACER各组件的有效性。 Conclusion: ToG-3通过动态图索引构建与多智能体协作,克服了传统静态图RAG的局限,支持轻量级LLM实现高效精准推理,具备强实用价值。 Abstract: Retrieval-Augmented Generation (RAG) and Graph-based RAG has become the important paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing approaches face a fundamental trade-off. While graph-based methods are inherently dependent on high-quality graph structures, they face significant practical constraints: manually constructed knowledge graphs are prohibitively expensive to scale, while automatically extracted graphs from corpora are limited by the performance of the underlying LLM extractors, especially when using smaller, local-deployed models. This paper presents Think-on-Graph 3.0 (ToG-3), a novel framework that introduces Multi-Agent Context Evolution and Retrieval (MACER) mechanism to overcome these limitations. Our core innovation is the dynamic construction and refinement of a Chunk-Triplets-Community heterogeneous graph index, which pioneeringly incorporates a dual-evolution mechanism of Evolving Query and Evolving Sub-Graph for precise evidence retrieval. This approach addresses a critical limitation of prior Graph-based RAG methods, which typically construct a static graph index in a single pass without adapting to the actual query. A multi-agent system, comprising Constructor, Retriever, Reflector, and Responser agents, collaboratively engages in an iterative process of evidence retrieval, answer generation, sufficiency reflection, and, crucially, evolving query and subgraph. This dual-evolving multi-agent system allows ToG-3 to adaptively build a targeted graph index during reasoning, mitigating the inherent drawbacks of static, one-time graph construction and enabling deep, precise reasoning even with lightweight LLMs. Extensive experiments demonstrate that ToG-3 outperforms compared baselines on both deep and broad reasoning benchmarks, and ablation studies confirm the efficacy of the components of MACER framework.[24] ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation
Jiho Kim,Junseong Choi,Woosog Chay,Daeun Kyung,Yeonsu Kwon,Yohan Jo,Edward Choi
Main category: cs.CL
TL;DR: 本文提出了ProPerSim任务和模拟框架,以及基于该框架的ProPerAssistant系统,旨在实现大型语言模型在家庭场景中主动且个性化的推荐。
Details
Motivation: 随着大语言模型越来越多地融入日常生活,人们对不仅能够响应、还能主动并个性化服务的AI助手需求日益增长。然而,主动性和个性化结合的研究尚不充分。 Method: 构建了一个名为ProPerSim的模拟环境,其中具有丰富人设的用户代理与助手互动,并对建议进行评分;基于此提出了ProPerAssistant,一种通过检索增强和偏好对齐机制持续学习用户反馈的助手。 Result: 在32种不同人设上的实验表明,ProPerAssistant能逐步调整策略,持续提升用户满意度。 Conclusion: 将主动性和个性化相结合,有助于提升AI助手在真实场景中的表现,展现了未来发展方向的潜力。 Abstract: As large language models (LLMs) become increasingly integrated into daily life, there is growing demand for AI assistants that are not only reactive but also proactive and personalized. While recent advances have pushed forward proactivity and personalization individually, their combination remains underexplored. To bridge this gap, we introduce ProPerSim, a new task and simulation framework for developing assistants capable of making timely, personalized recommendations in realistic home scenarios. In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context. The assistant's goal is to use these ratings to learn and adapt to achieve higher scores over time. Built on ProPerSim, we propose ProPerAssistant, a retrieval-augmented, preference-aligned assistant that continually learns and adapts through user feedback. Experiments across 32 diverse personas show that ProPerAssistant adapts its strategy and steadily improves user satisfaction, highlighting the promise of uniting proactivity and personalization.[25] How Accurate Are LLMs at Multi-Question Answering on Conversational Transcripts?
Xiliang Zhu,Shi Zong,David Rossouw
Main category: cs.CL
TL;DR: 研究了大语言模型在长上下文问答中的多问题处理能力,发现经过微调的开源模型在准确性上可超越GPT-4o,具备低成本部署潜力。
Details
Motivation: 解决工业场景中大模型处理长上下文多问题时的高计算成本和延迟问题。 Method: 通过大量实验,对多种专有和开源大模型在相同上下文多问题问答任务上进行基准测试。 Result: 发现80亿参数以内的微调开源模型在准确性上可超过GPT-4o。 Conclusion: 微调后的中小型开源模型在多问题长上下文问答中具有高性能、低成本的优势,适合实际应用部署。 Abstract: Deploying Large Language Models (LLMs) for question answering (QA) over lengthy contexts is a significant challenge. In industrial settings, this process is often hindered by high computational costs and latency, especially when multiple questions must be answered based on the same context. In this work, we explore the capabilities of LLMs to answer multiple questions based on the same conversational context. We conduct extensive experiments and benchmark a range of both proprietary and public models on this challenging task. Our findings highlight that while strong proprietary LLMs like GPT-4o achieve the best overall performance, fine-tuned public LLMs with up to 8 billion parameters can surpass GPT-4o in accuracy, which demonstrates their potential for transparent and cost-effective deployment in real-world applications.[26] Self-Speculative Biased Decoding for Faster Live Translation
Linxiao Zeng,Haoyun Deng,Kangyuan Shu,Shizhen Wang
Main category: cs.CL
TL;DR: 提出了一种名为Self-Speculative Biased Decoding的新推理范式,用于加速流式应用中的大语言模型生成,无需额外草稿计算,实现最高1.7倍加速并减少80%输出闪烁。
Details
Motivation: 现有大语言模型在流式输入场景(如实时翻译)中需反复从头生成输出,导致高延迟和频繁内容闪烁,难以满足实时性要求。 Method: 利用最近一次输出作为当前扩展输入的草稿,在验证阶段偏向草稿token以提高接受率,并结合mask-k技术减少显示闪烁,仅从分歧点继续解码。 Result: 在同步文本重翻译任务上,相比传统自回归方法最高实现1.7倍速度提升,质量不下降,且输出闪烁减少80%。 Conclusion: 所提方法是一种无需额外草稿模型、即插即用的加速方案,适用于各类对延迟敏感的流式生成应用,兼顾效率与稳定性。 Abstract: Large Language Models (LLMs) have recently demonstrated impressive capabilities in various text generation tasks. However, it remains challenging to use them off-the-shelf in streaming applications (such as live translation), where the output must continually update as the input context expands, while still maintaining a reasonable computational cost to meet the latency requirement. In this work, we reexamine the re-translation approach to simultaneous translation and propose Self-Speculative Biased Decoding, a novel inference paradigm designed to avoid repeatedly generating output from scratch for a consistently growing input stream. We propose using the most recent output as a draft for the current growing input context. During the verification stage, the output will be biased towards the draft token for a higher draft acceptance rate. This strategy not only minimizes flickering that might distract users but also leads to higher speedups. Conventional decoding may take charge from the point of divergence after draft verification and continue until the end condition is met. Unlike existing speculative decoding strategies, our approach eliminates the need for draft computations, making it a model-agnostic and plug-and-play solution for accelerating latency-sensitive streaming applications. Experimental results on simultaneous text-to-text re-translation demonstrate that our approach achieves up to 1.7x speedup compared to conventional auto-regressive re-translation without compromising quality. Additionally, it significantly reduces flickering by 80% by incorporating the display-only mask-k technique.[27] Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models
Zhen Xiong,Yujun Cai,Zhecheng Li,Junsong Yuan,Yiwei Wang
Main category: cs.CL
TL;DR: 本文提出了Thinking-with-Sound (TwS) 框架,通过结合语言推理与实时音频域分析,提升大型音频-语言模型在复杂声学场景中的鲁棒性和推理能力。
Details
Motivation: 现有大型音频-语言模型在复杂声学环境下的音频推理任务中表现不佳,缺乏对噪声抑制、声源分离等声学工具的访问能力。 Method: 提出TwS框架,引入音频思维链(Audio CoT),使模型能主动进行音频信号的数值分析和数字操作;构建包含多种声学干扰的MELD-Hard1k基准用于评估。 Result: 实验显示现有模型在MELD-Hard1k上性能下降超50%;采用TwS后,小模型绝对准确率提升24.73%,大模型最高提升36.61%。 Conclusion: Audio CoT 能显著增强模型鲁棒性且无需重新训练,为构建更强大的音频理解系统提供了新方向。 Abstract: Recent Large Audio-Language Models (LALMs) have shown strong performance on various audio understanding tasks such as speech translation and Audio Q\&A. However, they exhibit significant limitations on challenging audio reasoning tasks in complex acoustic scenarios. These situations would greatly benefit from the use of acoustic tools like noise suppression, source separation, and precise temporal alignment, but current LALMs lack access to such tools. To address this limitation, we introduce Thinking-with-Sound (TwS), a framework that equips LALMs with Audio CoT by combining linguistic reasoning with on-the-fly audio-domain analysis. Unlike existing approaches that treat audio as static input, TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning. To evaluate this approach, we construct MELD-Hard1k, a new robustness benchmark created by introducing various acoustic perturbations. Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50\%$ compared to clean audio. TwS achieves substantial improvements in robustness, demonstrating both effectiveness and scalability: small models gain $24.73\%$ absolute accuracy, with improvements scaling consistently up to $36.61\%$ for larger models. Our findings demonstrate that Audio CoT can significantly enhance robustness without retraining, opening new directions for developing more robust audio understanding systems.[28] SynerGen: Contextualized Generative Recommender for Unified Search and Recommendation
Vianne R. Gao,Chen Xue,Marc Versage,Xie Zhou,Zhongruo Wang,Chao Li,Yeon Seonwoo,Nan Chen,Zhen Ge,Gourab Kundu,Weiqi Zhang,Tian Wang,Qingjun Cui,Trishul Chilimbi
Main category: cs.CL
TL;DR: 提出SynerGen,一种基于生成式序列建模的统一推荐与搜索框架,通过联合优化实现检索与排序性能提升。
Details
Motivation: 现有生成模型通常仅支持个性化搜索或无查询推荐,难以兼顾两者性能,且传统检索-排序架构存在校准问题和工程开销。 Method: 采用纯解码器Transformer架构,结合InfoNCE进行检索优化,使用点对点与成对损失混合的排序损失,并引入时间感知的旋转位置编码以融合时间信息。 Result: 在多个推荐与搜索基准上显著优于强基线模型,验证了单一生成模型在工业级统一信息访问中的有效性。 Conclusion: SynerGen能够统一处理个性化搜索与推荐任务,在检索与排序方面均表现优异,展示了生成式基础模型在大规模推荐系统中的潜力。 Abstract: The dominant retrieve-then-rank pipeline in large-scale recommender systems suffers from mis-calibration and engineering overhead due to its architectural split and differing optimization objectives. While recent generative sequence models have shown promise in unifying retrieval and ranking by auto-regressively generating ranked items, existing solutions typically address either personalized search or query-free recommendation, often exhibiting performance trade-offs when attempting to unify both. We introduce \textit{SynerGen}, a novel generative recommender model that bridges this critical gap by providing a single generative backbone for both personalized search and recommendation, while simultaneously excelling at retrieval and ranking tasks. Trained on behavioral sequences, our decoder-only Transformer leverages joint optimization with InfoNCE for retrieval and a hybrid pointwise-pairwise loss for ranking, allowing semantic signals from search to improve recommendation and vice versa. We also propose a novel time-aware rotary positional embedding to effectively incorporate time information into the attention mechanism. \textit{SynerGen} achieves significant improvements on widely adopted recommendation and search benchmarks compared to strong generative recommender and joint search and recommendation baselines. This work demonstrates the viability of a single generative foundation model for industrial-scale unified information access.[29] Navigating the Impact of Structured Output Format on Large Language Models through the Compass of Causal Inference
Han Yuan,Yue Zhao,Li Zhang,Wuqiong Luo,Zheng Ma
Main category: cs.CL
TL;DR: 本文通过因果推断方法重新评估结构化输出对大语言模型生成质量的影响,发现在大多数情况下结构化输出并无因果影响,少数显著情况与具体指令设计相关。
Details
Motivation: 现有研究对结构化输出的影响结论不一,且存在测试场景有限、对照设置不严谨和依赖粗粒度指标等问题,需更严谨的分析方法。 Method: 基于因果推断方法,提出五种可能的因果结构,并在八个推理任务上进行验证。 Result: 在48种情景中,43种显示结构化输出无因果影响;其余5种中有3种受具体指令影响呈现多因素因果结构。 Conclusion: 结构化输出本身对LLM生成质量大多无直接因果影响,其效果更多取决于指令的具体设计。 Abstract: Structured output from large language models (LLMs) has enhanced efficiency in processing generated information and is increasingly adopted in industrial applications. Prior studies have investigated the impact of structured output on LLMs' generation quality, often presenting one-way findings. Some suggest that structured format enhances completeness and factual accuracy, while others argue that it restricts the reasoning capacity of LLMs and leads to reductions in standard evaluation metrics. Potential limitations of these assessments include restricted testing scenarios, weakly controlled comparative settings, and reliance on coarse metrics. In this work, we present a refined analysis using causal inference. Based on one assumed and two guaranteed constraints, we derive five potential causal structures characterizing the influence of structured output on LLMs' generation: (1) collider without m-bias, (2) collider with m-bias, (3) single cause from instruction, (4) single cause from output format, and (5) independence. Across seven public and one developed reasoning tasks, we find that coarse metrics report positive, negative, or neutral effects of structured output on GPT-4o's generation. However, causal inference reveals no causal impact in 43 out of 48 scenarios. In the remaining 5, 3 involve multifaceted causal structures influenced by concrete instructions.[30] Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment
Hongbin Zhang,Kehai Chen,Xuefeng Bai,Yang Xiang,Min Zhang
Main category: cs.CL
TL;DR: 本文提出了Cultural Awareness Reward modeling Benchmark (CARB),用于评估奖励模型在多文化背景下的表现,揭示了现有模型在文化意识建模上的不足,并提出“Think-as-Locals”方法结合可验证奖励的强化学习(RLVR)来提升文化感知能力。
Details
Motivation: 现有奖励模型评估缺乏对文化意识的充分测评,且缺少跨文化评估数据集,难以推动大模型在全球范围内的文化对齐。 Method: 构建覆盖10种文化、4个文化领域的CARB基准;通过实证分析发现模型依赖表面特征的虚假相关性;提出Think-as-Locals方法,利用RLVR框架激发生成式奖励模型进行本地化推理,并设计高质量奖励以生成结构化评估标准。 Result: 实验表明,CARB能有效评估文化意识缺失问题,且其表现与下游多语言文化对齐任务正相关;Think-as-Locals显著减少表面特征干扰,提升文化相关判断的准确性与评估标准质量。 Conclusion: 文化意识在奖励模型中至关重要但当前存在建模缺陷,CARB为评估提供了有效基准,而Think-as-Locals+RLVR为构建更具文化敏感性的奖励模型提供了可行路径。 Abstract: Reward models (RMs) are crucial for aligning large language models (LLMs) with diverse cultures. Consequently, evaluating their cultural awareness is essential for further advancing global alignment of LLMs. However, existing RM evaluations fall short in assessing cultural awareness due to the scarcity of culturally relevant evaluation datasets. To fill this gap, we propose Cultural Awareness Reward modeling Benchmark (CARB), covering 10 distinct cultures across 4 cultural domains. Our extensive evaluation of state-of-the-art RMs reveals their deficiencies in modeling cultural awareness and demonstrates a positive correlation between performance on CARB and downstream multilingual cultural alignment tasks. Further analysis identifies the spurious correlations within culture-aware reward modeling, wherein RM's scoring relies predominantly on surface-level features rather than authentic cultural nuance understanding. To address these, we propose Think-as-Locals to elicit deeper culturally grounded reasoning from generative RMs via reinforcement learning from verifiable rewards (RLVR) and employ well-designed rewards to ensure accurate preference judgments and high-quality structured evaluation criteria generation. Experimental results validate its efficacy in mitigating spurious features interference and advancing culture-aware reward modeling.[31] Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies
Qianen Zhang,Satoshi Nakamura
Main category: cs.CL
TL;DR: 本文提出了一种增强的同步机器翻译(SiMT)框架,通过引入四种自适应动作(句子切分、丢弃、部分摘要和代词化)扩展了传统的读写操作,在保持语义保真度的同时提升了实时性和翻译质量。
Details
Motivation: 传统SiMT方法受限于仅有的READ/WRITE操作,难以在严格实时约束下实现高质量翻译,因此需要更灵活的动作空间来应对实时性与语义保留之间的权衡。 Method: 在解码器-only大语言模型框架中引入四种新动作,并通过动作感知提示构建训练参考;同时设计了延迟感知的TTS流水线以评估质量和延迟。 Result: 在ACL60/60英-中和英-德数据集上,该方法在COMET-KIWI等语义指标上优于基线,并降低了平均滞后(Average Lagging),其中DROP与SENTENCE_CUT组合表现最佳。 Conclusion: 扩展SiMT的动作空间能有效提升翻译质量与实时性的平衡,为缩小人机同声传译差距提供了可行路径。 Abstract: Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints, which traditional encoder-decoder policies with only READ/WRITE actions cannot fully address. We extend the action space of SiMT with four adaptive actions: SENTENCE_CUT, DROP, PARTIAL_SUMMARIZATION and PRONOMINALIZATION, which enable real-time restructuring, omission, and simplification while preserving semantic fidelity. We implement these actions in a decoder-only large language model (LLM) framework and construct training references through action-aware prompting. To evaluate both quality and latency, we further develop a latency-aware TTS pipeline that maps textual outputs to speech with realistic timing. Experiments on the ACL60/60 English-Chinese and English-German benchmarks show that our framework consistently improves semantic metrics (e.g., COMET-KIWI) and achieves lower delay (measured by Average Lagging) compared to reference translations and salami-based baselines. Notably, combining DROP and SENTENCE_CUT yields the best overall balance between fluency and latency. These results demonstrate that enriching the action space of LLM-based SiMT provides a promising direction for bridging the gap between human and machine interpretation.[32] Towards Minimal Causal Representations for Human Multimodal Language Understanding
Menghua Jiang,Yuncheng Jiang,Haifeng Hu,Sijie Mai
Main category: cs.CL
TL;DR: 本文提出了一种基于因果原则的多模态信息瓶颈模型(CaMIB),旨在解决传统多模态学习因数据集偏差导致的分布外泛化能力差的问题。
Details
Motivation: 现有方法依赖于最大化数据与标签之间的互信息,容易受到数据集中非因果统计捷径的影响,导致模型在分布外场景下表现不佳。 Method: 引入因果多模态信息瓶颈(CaMIB),首先使用信息瓶颈过滤单模态输入中的噪声;通过参数化掩码生成器将融合的多模态表示解耦为因果和捷径子表示;利用工具变量约束保证因果特征的全局一致性,并通过随机重组因果与捷径特征进行后门调整以稳定因果估计。 Result: 在多模态情感分析、幽默检测和讽刺检测任务及分布外测试集上实验表明,CaMIB优于现有方法,具备更强的泛化能力和可解释性。 Conclusion: CaMIB通过因果建模有效分离了因果特征与统计捷径,提升了多模态语言理解模型的鲁棒性和分布外泛化性能。 Abstract: Human Multimodal Language Understanding (MLU) aims to infer human intentions by integrating related cues from heterogeneous modalities. Existing works predominantly follow a ``learning to attend" paradigm, which maximizes mutual information between data and labels to enhance predictive performance. However, such methods are vulnerable to unintended dataset biases, causing models to conflate statistical shortcuts with genuine causal features and resulting in degraded out-of-distribution (OOD) generalization. To alleviate this issue, we introduce a Causal Multimodal Information Bottleneck (CaMIB) model that leverages causal principles rather than traditional likelihood. Concretely, we first applies the information bottleneck to filter unimodal inputs, removing task-irrelevant noise. A parameterized mask generator then disentangles the fused multimodal representation into causal and shortcut subrepresentations. To ensure global consistency of causal features, we incorporate an instrumental variable constraint, and further adopt backdoor adjustment by randomly recombining causal and shortcut features to stabilize causal estimation. Extensive experiments on multimodal sentiment analysis, humor detection, and sarcasm detection, along with OOD test sets, demonstrate the effectiveness of CaMIB. Theoretical and empirical analyses further highlight its interpretability and soundness.[33] Can LLMs Solve and Generate Linguistic Olympiad Puzzles?
Neh Majmudar,Elena Filatova
Main category: cs.CL
TL;DR: 本文研究了语言学谜题的求解与生成,扩展了现有基准,探索了大语言模型(如OpenAI的o1)在各类语言谜题上的表现,发现其在多数类型上优于人类,但在文字系统和濒危语言相关谜题上表现较差,并基于求解结果推动自动谜题生成,认为该任务有助于推广语言学知识。
Details
Motivation: 激发对语言学的兴趣并扩大其影响力,尤其是通过自动化生成谜题来传播关于罕见和未充分研究语言的知识。 Method: 扩展语言谜题求解基准,评估多种大语言模型(包括OpenAI o1)在不同语言主题谜题上的表现,并利用求解结果指导谜题生成。 Result: 大语言模型在大多数谜题类型上表现优于人类,但在涉及文字系统和未充分研究语言的谜题上表现不佳;基于求解的洞见可用于生成新谜题。 Conclusion: 自动化语言学谜题生成是一个有前景的研究方向,不仅能促进语言学普及,还能助力稀有和濒危语言知识的传播。 Abstract: In this paper, we introduce a combination of novel and exciting tasks: the solution and generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We first extend the existing benchmark for the task of solving linguistic puzzles. We explore the use of Large Language Models (LLMs), including recent state-of-the-art models such as OpenAI's o1, for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. We use the insights from puzzle-solving experiments to direct the novel task of puzzle generation. We believe that automating puzzle generation, even for relatively simple puzzles, holds promise for expanding interest in linguistics and introducing the field to a broader audience. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.[34] ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models
Zihan Lin,Xiaohan Wang,Jie Cao,Jiajun Chai,Guojun Yin,Wei Lin,Ran He
Main category: cs.CL
TL;DR: 提出了一种基于熵感知的token级策略梯度重塑方法ResT,用于优化大语言模型在工具使用任务中的强化学习训练,显著提升了训练稳定性和性能。
Details
Motivation: 现有强化学习方法在工具使用任务中仅依赖稀疏结果奖励,忽略任务特性,导致策略梯度方差大、训练效率低。 Method: 通过建立策略熵与训练稳定性的理论联系,提出ResT方法,采用熵感知的token重加权机制,在训练过程中逐步提升推理token的权重,实现从结构正确性到语义推理的平滑过渡。 Result: 在BFCL和API-Bank上达到SOTA,比先前方法最高提升8.76%;在4B大小的基础LLM上微调后,单轮任务超越GPT-4o 4.11%,多轮任务超越1.50%。 Conclusion: ResT通过熵感知的token级梯度重塑有效稳定了工具使用任务的训练过程,显著提升了模型性能,验证了结构先验与渐进式学习在复杂任务中的重要性。 Abstract: Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose \textbf{Res}haped \textbf{T}oken-level policy gradients (\textbf{ResT}) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to $8.76\%$. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by $4.11\%$ on single-turn tasks and $1.50\%$ on multi-turn base tasks.[35] Semantic Agreement Enables Efficient Open-Ended LLM Cascades
Duncan Soiffer,Steven Kolawole,Virginia Smith
Main category: cs.CL
TL;DR: 提出语义一致性作为无需训练的可靠信号,用于级联大语言模型系统中的决策,有效降低成本和延迟。
Details
Motivation: 解决开放文本生成中输出可靠性判断困难的问题,尤其是在生成质量连续且存在多种合理答案的情况下。 Method: 利用集成模型输出之间的语义一致性(而非token-level置信度)作为可靠性信号,构建语义级联系统。 Result: 在500M到70B参数的模型上验证,语义级联在40%成本下达到或超过目标模型质量,延迟降低高达60%。 Conclusion: 该方法不依赖模型内部结构,适用于黑盒API并能适应模型更新,具有较强的实用性,可作为现实部署中的有效基线。 Abstract: Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement -- meaning-level consensus between ensemble outputs -- as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.[36] Following the TRACE: A Structured Path to Empathetic Response Generation with Multi-Agent Models
Ziqi Liu,Ziyang Zhou,Yilin Li,Haiyang Zhang,Yangbin Chen
Main category: cs.CL
TL;DR: 提出TRACE框架,通过任务分解建模共情,结合深度分析与生成流畅性,在自动和基于LLM的评估中显著优于基线。
Details
Motivation: 现有方法在专用模型的分析深度和大语言模型的生成流畅性之间存在权衡,难以兼顾共情响应生成的质量。 Method: 将共情任务分解为分析与合成两个阶段,构建先理解后生成的流水线框架TRACE,利用结构化认知过程建模共情。 Result: 实验结果显示,TRACE在自动评价和基于LLM的评价中均显著优于强基线模型。 Conclusion: 结构化的任务分解是构建更强大、可解释的共情对话代理的有效范式。 Abstract: Empathetic response generation is a crucial task for creating more human-like and supportive conversational agents. However, existing methods face a core trade-off between the analytical depth of specialized models and the generative fluency of Large Language Models (LLMs). To address this, we propose TRACE, Task-decomposed Reasoning for Affective Communication and Empathy, a novel framework that models empathy as a structured cognitive process by decomposing the task into a pipeline for analysis and synthesis. By building a comprehensive understanding before generation, TRACE unites deep analysis with expressive generation. Experimental results show that our framework significantly outperforms strong baselines in both automatic and LLM-based evaluations, confirming that our structured decomposition is a promising paradigm for creating more capable and interpretable empathetic agents. Our code is available at https://anonymous.4open.science/r/TRACE-18EF/README.md.[37] KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues
Junhao Chen,Yu Huang,Siyuan Li,Rui Yao,Hanqian Li,Hanyu Zhang,Jungang Li,Jian Chen,Bowen Wang,Xuming Hu
Main category: cs.CL
TL;DR: 本文提出了KnowMT-Bench,首个用于评估大语言模型在多轮长篇问答(MT-LFQA)中知识密集型任务表现的基准,涵盖医学、金融和法律领域。该基准采用动态评估方式,通过模型自生成的多轮对话历史来测试其最终回答的事实准确性和信息传递效率。实验表明,多轮上下文会因自生成历史中的噪声导致事实性下降,且信息效率随对话长度增加而降低;而检索增强生成(RAG)可有效缓解这一问题。
Details
Motivation: 现有基准多局限于单轮对话,或多轮对话评估的是与知识密集型事实性无关的能力,缺乏针对多轮长篇问答中事实准确性的系统评估,因此需要构建一个更贴近真实应用场景的多轮知识密集型问答基准。 Method: 提出KnowMT-Bench,采用动态评估设置:给定逻辑递进的问题序列,由模型自动生成多轮对话历史,并基于最终轮回答,使用人工验证的自动化流程评估其事实性与信息传递效率。 Result: 实验发现多轮上下文会导致性能下降:随着对话轮数增加,模型因自生成历史引入的上下文噪声而导致事实性降低,同时回答变得更冗长,信息效率下降;但引入检索增强生成(RAG)可有效缓解甚至逆转这种事实性退化。 Conclusion: KnowMT-Bench填补了多轮知识密集型问答评估的空白,揭示了当前LLM在多轮对话中面临的关键挑战,并验证了RAG在提升事实准确性方面的有效性,强调了该基准对改进现实场景中LLM对话能力的重要性。 Abstract: Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains. However, existing benchmarks are limited to single-turn dialogue, while multi-turn dialogue benchmarks typically assess other orthogonal capabilities rather than knowledge-intensive factuality. To bridge this critical gap, we introduce \textbf{KnowMT-Bench}, the \textit{first-ever} benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields, including medicine, finance, and law. To faithfully assess the model's real-world performance, KnowMT-Bench employs a dynamic evaluation setting where models generate their own multi-turn dialogue histories given logically progressive question sequences. The factual capability and information delivery efficiency of the \textit{final-turn} answer are then evaluated using a human-validated automated pipeline. Our experiments reveal that multi-turn contexts degrade performance: factual capability declines due to the contextual noise from self-generated histories, while information efficiency drops as models become more verbose with increasing dialogue length. We then investigate mitigation strategies, demonstrating that retrieval-augmented generation (RAG) can effectively alleviate and even reverse this factual degradation. These findings underscore the importance of our benchmark in evaluating and enhancing the conversational factual capabilities of LLMs in real-world knowledge-intensive applications. Code is available at \href{https://github.com/hardenyu21/KnowMT-Bench}{\textcolor{cyan}{\texttt{KnowMT-Bench}}}.[38] Enhancing Low-Rank Adaptation with Structured Nonlinear Transformations
Guanzhi Deng,Mingyang Liu,Dapeng Wu,Yinqiao Li,Linqi Song
Main category: cs.CL
TL;DR: 提出LoRAN,一种非线性扩展的低秩适应方法,并引入基于正弦的激活函数Sinter,在不增加参数量的情况下提升性能。
Details
Motivation: LoRA的线性特性限制了其表达能力,需要更高效的非线性微调方法来提升模型表现。 Method: 在LoRA基础上引入轻量级变换构建非线性更新,并设计Sinter(正弦激活)进行结构化扰动。 Result: 在摘要生成和分类任务上,LoRAN consistently 优于QLoRA;消融实验显示Sinter优于Sigmoid、ReLU和Tanh等常见激活函数。 Conclusion: 非线性扩展和精心设计的激活函数能显著提升低秩微调的效果,Sinter为低秩更新提供了更优的激活选择。 Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning method for large language models. However, its linear nature limits expressiveness. We propose LoRAN, a non-linear extension of LoRA that applies lightweight transformations to the low-rank updates. We further introduce Sinter, a sine-based activation that adds structured perturbations without increasing parameter count. Experiments across summarization and classification tasks show that LoRAN consistently improves over QLoRA. Ablation studies reveal that Sinter outperforms standard activations such as Sigmoid, ReLU, and Tanh, highlighting the importance of activation design in lowrank tuning.[39] LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals
Min-Hsuan Yeh,Yixuan Li,Tanwi Mallick
Main category: cs.CL
TL;DR: 提出LUMINA框架,通过量化外部上下文和内部知识的使用来检测RAG系统中的幻觉,具有高鲁棒性和优于现有方法的性能。
Details
Motivation: 尽管RAG能通过检索文档减少大语言模型的幻觉,但模型仍会在有充分正确上下文时产生幻觉,现有检测方法因依赖超参数调优而泛化能力有限。 Method: LUMINA通过分布距离量化外部上下文利用,并通过追踪transformer层间预测token的变化来衡量内部知识利用,同时引入统计验证框架。 Result: 在多个RAG幻觉基准和四个开源大模型上实验表明,LUMINA在AUROC和AUPRC指标上显著优于已有方法,最高提升13% AUROC,且对检索质量和模型匹配假设更鲁棒。 Conclusion: LUMINA有效且实用,为RAG系统中的幻觉检测提供了一种无需大量调参、具有良好泛化能力的新方案。 Abstract: Retrieval-Augmented Generation (RAG) aims to mitigate hallucinations in large language models (LLMs) by grounding responses in retrieved documents. Yet, RAG-based LLMs still hallucinate even when provided with correct and sufficient context. A growing line of work suggests that this stems from an imbalance between how models use external context and their internal knowledge, and several approaches have attempted to quantify these signals for hallucination detection. However, existing methods require extensive hyperparameter tuning, limiting their generalizability. We propose LUMINA, a novel framework that detects hallucinations in RAG systems through context-knowledge signals: external context utilization is quantified via distributional distance, while internal knowledge utilization is measured by tracking how predicted tokens evolve across transformer layers. We further introduce a framework for statistically validating these measurements. Experiments on common RAG hallucination benchmarks and four open-source LLMs show that LUMINA achieves consistently high AUROC and AUPRC scores, outperforming prior utilization-based methods by up to +13% AUROC on HalluRAG. Moreover, LUMINA remains robust under relaxed assumptions about retrieval quality and model matching, offering both effectiveness and practicality.[40] No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping
Thanh-Long V. Le,Myeongho Jeon,Kim Vu,Viet Lai,Eunho Yang
Main category: cs.CL
TL;DR: 本文提出了RL-ZVP算法,通过利用零方差提示中的学习信号,在强化学习中改进大语言模型的推理能力,显著优于现有方法。
Details
Motivation: 现有的强化学习框架如GRPO忽略了所有响应获得相同奖励的零方差提示,而本文认为这些提示仍包含有价值的信息,可用于策略优化。 Method: 提出RL-ZVP算法,直接在无响应对比的情况下奖励正确性并惩罚错误,结合词元级特征调节反馈,从而从零方差提示中提取学习信号。 Result: 在六个数学推理基准上,RL-ZVP相比GRPO最高提升8.61点准确率和7.77点通过率,且持续优于过滤零方差提示的基线方法。 Conclusion: 零方差提示在强化学习与可验证奖励框架中具有未被开发的潜力,有效利用可提升模型推理性能。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.[41] QoNext: Towards Next-generation QoE for Foundation Models
Yijin Guo,Ye Shen,Farong Wen,Junying Wang,Zicheng Zhang,Qi Jia,Guangtao Zhai
Main category: cs.CL
TL;DR: 本文提出了QoNext框架,首次将网络与多媒体领域的体验质量(QoE)原则引入大模型评估,通过结合影响用户体验的体验性因素并开展受控实验,构建了面向QoE的数据集并训练了可从系统参数预测用户感知体验的模型,实现了对基础模型更细粒度、前瞻性的评估,并为实际应用中的优化提供指导。
Details
Motivation: 现有评估方法仅关注输出正确性,忽视了用户在交互过程中的真实体验,无法捕捉响应质量与交互之间动态关系对用户满意度的影响,因此需要一种能反映用户体验内在机制的新评估框架。 Method: 借鉴网络与多媒体领域的QoE理念,提出QoNext框架,识别影响用户体验的关键因素,并设计受控实验收集不同配置下的人类评分,基于实验数据构建QoE数据库,进而训练可从可测量系统参数预测用户体验的模型。 Result: 成功构建了首个面向基础模型评估的QoE数据库,并训练出能有效预测用户感知体验的模型;实验表明该框架支持细粒度、前瞻性的评估,并能为实际产品化服务提供优化建议。 Conclusion: QoNext填补了传统评估方法在用户体验层面的空白,通过引入QoE范式实现了对基础模型更全面、贴近真实使用场景的评估,具有较强的实践指导价值。 Abstract: Existing evaluations of foundation models, including recent human-centric approaches, fail to capture what truly matters: user's experience during interaction. Current methods treat evaluation as a matter of output correctness alone, overlooking that user satisfaction emerges from the interplay between response quality and interaction, which limits their ability to account for the mechanisms underlying user experience. To address this gap, we introduce QoNext, the first framework that adapts Quality of Experience (QoE) principles from networking and multimedia to the assessment of foundation models. QoNext identifies experiential factors that shape user experience and incorporates them into controlled experiments, where human ratings are collected under varied configurations. From these studies we construct a QoE-oriented database and train predictive models that estimate perceived user experience from measurable system parameters. Our results demonstrate that QoNext not only enables proactive and fine-grained evaluation but also provides actionable guidance for productized services of optimizing foundation models in practice.[42] Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
Naibin Gu,Zhenyu Zhang,Yuchen Feng,Yilong Chen,Peng Fu,Zheng Lin,Shuohuan Wang,Yu Sun,Hua Wu,Weiping Wang,Haifeng Wang
Main category: cs.CL
TL;DR: 本文提出了Elastic Mixture-of-Experts (EMoE),一种可在推理时灵活扩展激活专家数量的MoE训练框架,有效解决了传统MoE在增加专家数时性能迅速下降的问题。
Details
Motivation: 传统MoE模型在训练和推理时固定激活专家数量k,尽管直觉上在推理时激活更多专家(k' > k)应提升性能,但实际性能在k稍增后迅速下降,主要原因是专家间缺乏协作学习。 Method: 提出EMoE框架,通过在训练时让专家在多种组合中协同工作,并优化路由器的选择质量,使模型在推理时能弹性扩展激活专家数,且不增加训练开销。 Result: 实验表明,EMoE将有效的性能扩展范围提升至训练时k值的2-3倍,并显著提高了模型的峰值性能。 Conclusion: EMoE通过促进专家间的协作和更优路由选择,实现了在不同计算预算下鲁棒且可扩展的推理性能,为MoE模型的实际应用提供了更大灵活性。 Abstract: Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. Intuitively, activating more experts at inference $k'$ (where $k'> k$) means engaging a larger set of model parameters for the computation and thus is expected to improve performance. However, contrary to this intuition, we find the scaling range to be so narrow that performance begins to degrade rapidly after only a slight increase in the number of experts. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce Elastic Mixture-of-Experts (EMoE), a novel training framework that enables MoE models to scale the number of activated experts at inference without incurring additional training overhead. By simultaneously training experts to collaborate in diverse combinations and encouraging the router for high-quality selections, EMoE ensures robust performance across computational budgets at inference. We conduct extensive experiments on various MoE settings. Our results show that EMoE significantly expands the effective performance-scaling range, extending it to as much as 2-3$\times$ the training-time $k$, while also pushing the model's peak performance to a higher level.[43] A Large-Scale Dataset and Citation Intent Classification in Turkish with LLMs
Kemal Sami Karaca,Bahaeddin Eravcı
Main category: cs.CL
TL;DR: 本文提出了一种系统性方法和基础数据集,用于解决土耳其语引用意图分类问题。通过构建标注工具创建了首个公开的土耳其语引用意图数据集,并采用基于DSPy框架的可编程分类流水线优化提示,结合XGBoost元模型的堆叠集成方法,实现了91.3%的最先进准确率。
Details
Motivation: 土耳其语等黏着语言在引用意图分析中面临独特挑战,现有基于人工设计提示的上下文学习方法效果不稳定,亟需系统化解决方案。 Method: 首先构建专用标注工具并发布首个土耳其语引用意图数据集;采用DSPy框架实现提示的自动化优化;使用多模型输出的堆叠泛化集成方法,以XGBoost作为元模型进行最终分类。 Result: 所提出的可编程分类流水线结合堆叠集成方法在土耳其语引用意图分类任务上达到91.3%的准确率,显著优于手动设计提示的标准上下文学习方法。 Conclusion: 本研究为土耳其语NLP社区及更广泛的学术界提供了基础数据集和稳健的分类框架,推动了针对非英语语言的引用意图分析研究。 Abstract: Understanding the qualitative intent of citations is essential for a comprehensive assessment of academic research, a task that poses unique challenges for agglutinative languages like Turkish. This paper introduces a systematic methodology and a foundational dataset to address this problem. We first present a new, publicly available dataset of Turkish citation intents, created with a purpose-built annotation tool. We then evaluate the performance of standard In-Context Learning (ICL) with Large Language Models (LLMs), demonstrating that its effectiveness is limited by inconsistent results caused by manually designed prompts. To address this core limitation, we introduce a programmable classification pipeline built on the DSPy framework, which automates prompt optimization systematically. For final classification, we employ a stacked generalization ensemble to aggregate outputs from multiple optimized models, ensuring stable and reliable predictions. This ensemble, with an XGBoost meta-model, achieves a state-of-the-art accuracy of 91.3\%. Ultimately, this study provides the Turkish NLP community and the broader academic circles with a foundational dataset and a robust classification framework paving the way for future qualitative citation studies.[44] AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition
Yun Wang,Zhaojun Ding,Xuansheng Wu,Siyue Sun,Ninghao Liu,Xiaoming Zhai
Main category: cs.CL
TL;DR: 提出了一种名为AutoSCORE的多智能体LLM框架,通过基于评分标准的结构化组件识别来提升自动评分的准确性、可解释性和鲁棒性。
Details
Motivation: 现有大语言模型在自动评分中存在准确率低、对提示敏感、可解释性差和评分标准不对齐等问题,限制了其在教育评估中的应用。 Method: 设计了一个双智能体框架:一个用于从学生回答中提取与评分标准相关的组件并生成结构化表示,另一个基于该表示进行评分,模拟人类评卷过程。 Result: 在ASAP基准的四个数据集上验证了AutoSCORE的有效性,相比单智能体基线显著提升了评分准确性、与人类评分的一致性(QWK、相关性)以及误差指标(MAE、RMSE),尤其在复杂多维评分标准和较小规模的语言模型上表现更优。 Conclusion: 结构化组件识别结合多智能体设计为自动评分提供了一种可扩展、可靠且可解释的解决方案。 Abstract: Automated scoring plays a crucial role in education by reducing the reliance on human raters, offering scalable and immediate evaluation of student work. While large language models (LLMs) have shown strong potential in this task, their use as end-to-end raters faces challenges such as low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment. These issues hinder the implementation of LLM-based automated scoring in assessment practice. To address the limitations, we propose AutoSCORE, a multi-agent LLM framework enhancing automated scoring via rubric-aligned Structured COmponent REcognition. With two agents, AutoSCORE first extracts rubric-relevant components from student responses and encodes them into a structured representation (i.e., Scoring Rubric Component Extraction Agent), which is then used to assign final scores (i.e., Scoring Agent). This design ensures that model reasoning follows a human-like grading process, enhancing interpretability and robustness. We evaluate AutoSCORE on four benchmark datasets from the ASAP benchmark, using both proprietary and open-source LLMs (GPT-4o, LLaMA-3.1-8B, and LLaMA-3.1-70B). Across diverse tasks and rubrics, AutoSCORE consistently improves scoring accuracy, human-machine agreement (QWK, correlations), and error metrics (MAE, RMSE) compared to single-agent baselines, with particularly strong benefits on complex, multi-dimensional rubrics, and especially large relative gains on smaller LLMs. These results demonstrate that structured component recognition combined with multi-agent design offers a scalable, reliable, and interpretable solution for automated scoring.[45] SimulSense: Sense-Driven Interpreting for Efficient Simultaneous Speech Translation
Haotian Tan,Hiroki Ouchi,Sakriani Sakti
Main category: cs.CL
TL;DR: 本文提出SimulSense框架,通过模拟人类译员连续阅读输入语音并在感知到新意义单元时触发翻译输出,解决了同步语音翻译中的读写决策问题。
Details
Motivation: 现有的同步语音翻译系统依赖复杂的交错训练数据和计算成本高昂的大语言模型推理,难以高效模仿人类译员的实时翻译行为。 Method: 将同步语音翻译建模为持续理解输入语音并识别意义单元的过程,设计无需特殊训练数据的轻量级决策机制,模仿人类译员的读写模式。 Result: 实验表明,该方法在质量-延迟权衡上优于两个最先进的基线系统,并显著提升实时效率,决策速度最快可达基线的9.6倍。 Conclusion: SimulSense有效模拟了人类译员的翻译决策过程,在降低计算成本的同时提升了同步语音翻译系统的性能。 Abstract: How to make human-interpreter-like read/write decisions for simultaneous speech translation (SimulST) systems? Current state-of-the-art systems formulate SimulST as a multi-turn dialogue task, requiring specialized interleaved training data and relying on computationally expensive large language model (LLM) inference for decision-making. In this paper, we propose SimulSense, a novel framework for SimulST that mimics human interpreters by continuously reading input speech and triggering write decisions to produce translation when a new sense unit is perceived. Experiments against two state-of-the-art baseline systems demonstrate that our proposed method achieves a superior quality-latency tradeoff and substantially improved real-time efficiency, where its decision-making is up to 9.6x faster than the baselines.[46] Why Chain of Thought Fails in Clinical Text Understanding
Jiageng Wu,Kevin Xie,Bowen Gu,Nils Krüger,Kueiyu Joshua Lin,Jie Yang
Main category: cs.CL
TL;DR: 本研究首次大规模系统性评估了链式思维(CoT)提示在临床文本理解中的表现,发现86.3%的大型语言模型在使用CoT时性能下降,尤其是在处理电子健康记录等复杂、碎片化文本时,揭示了可解释性增强与可靠性降低之间的矛盾。
Details
Motivation: 临床应用对准确性和推理透明度要求极高,而CoT提示虽在其他领域提升了模型性能和可解释性,但其在临床文本(如电子健康记录)中的有效性尚不明确,亟需系统评估。 Method: 研究评估了95个先进大模型在87项真实世界临床文本任务上的表现,涵盖9种语言和8类任务,并通过LLM自动评判和临床专家评估进行细粒度分析,包括推理长度、医学概念对齐和错误模式。 Result: 86.3%的模型在CoT设置下性能持续下降;较强模型相对稳健,较弱模型性能显著下降;研究发现了CoT在临床场景中失效的系统性模式。 Conclusion: 尽管CoT提升了推理可解释性,但在临床文本任务中可能损害模型可靠性,该发现强调在医疗AI部署中需权衡可解释性与准确性,并推动更可靠、透明的临床推理方法发展。 Abstract: Large language models (LLMs) are increasingly being applied to clinical care, a domain where both accuracy and transparent reasoning are critical for safe and trustworthy deployment. Chain-of-thought (CoT) prompting, which elicits step-by-step reasoning, has demonstrated improvements in performance and interpretability across a wide range of tasks. However, its effectiveness in clinical contexts remains largely unexplored, particularly in the context of electronic health records (EHRs), the primary source of clinical documentation, which are often lengthy, fragmented, and noisy. In this work, we present the first large-scale systematic study of CoT for clinical text understanding. We assess 95 advanced LLMs on 87 real-world clinical text tasks, covering 9 languages and 8 task types. Contrary to prior findings in other domains, we observe that 86.3\% of models suffer consistent performance degradation in the CoT setting. More capable models remain relatively robust, while weaker ones suffer substantial declines. To better characterize these effects, we perform fine-grained analyses of reasoning length, medical concept alignment, and error profiles, leveraging both LLM-as-a-judge evaluation and clinical expert evaluation. Our results uncover systematic patterns in when and why CoT fails in clinical contexts, which highlight a critical paradox: CoT enhances interpretability but may undermine reliability in clinical text tasks. This work provides an empirical basis for clinical reasoning strategies of LLMs, highlighting the need for transparent and trustworthy approaches.[47] Debiasing Large Language Models in Thai Political Stance Detection via Counterfactual Calibration
Kasidit Sermsri,Teerapong Panboonyuen
Main category: cs.CL
TL;DR: 本文提出了ThaiFACTUAL,一种轻量级、模型无关的校准框架,用于缓解低资源、文化复杂环境下(如泰国政治语境)大语言模型中的政治立场偏见问题。
Details
Motivation: 在泰国政治语境中,由于语言含蓄、人物立场极化以及情感与立场交织,大语言模型常出现情感泄露和实体偏好等系统性偏见,影响公平性与可靠性。 Method: 提出ThaiFACTUAL框架,采用反事实数据增强和基于理由的监督方法,解耦情感与立场,无需微调即可减轻偏见;同时发布首个高质量泰语政治立场数据集,包含立场、情感、理由和偏见标记。 Result: 实验表明,ThaiFACTUAL显著减少了虚假相关性,提升了零样本泛化能力,并在多个大语言模型上改善了公平性。 Conclusion: 该研究强调了为少数语言开发文化适配的去偏技术的重要性,为跨文化政治立场检测提供了有效解决方案。 Abstract: Political stance detection in low-resource and culturally complex settings poses a critical challenge for large language models (LLMs). In the Thai political landscape - marked by indirect language, polarized figures, and entangled sentiment and stance - LLMs often display systematic biases such as sentiment leakage and favoritism toward entities. These biases undermine fairness and reliability. We present ThaiFACTUAL, a lightweight, model-agnostic calibration framework that mitigates political bias without requiring fine-tuning. ThaiFACTUAL uses counterfactual data augmentation and rationale-based supervision to disentangle sentiment from stance and reduce bias. We also release the first high-quality Thai political stance dataset, annotated with stance, sentiment, rationales, and bias markers across diverse entities and events. Experimental results show that ThaiFACTUAL significantly reduces spurious correlations, enhances zero-shot generalization, and improves fairness across multiple LLMs. This work highlights the importance of culturally grounded debiasing techniques for underrepresented languages.[48] MotivGraph-SoIQ: Integrating Motivational Knowledge Graphs and Socratic Dialogue for Enhanced LLM Ideation
Xinping Lei,Tong Zhou,Yubo Chen,Kang Liu,Jun Zhao
Main category: cs.CL
TL;DR: 提出了一种结合动机知识图谱(MotivGraph)和苏格拉底式对话(SoIQ)的新型框架MotivGraph-SoIQ,以提升大语言模型在学术创意生成中的接地性和减少确认偏见。
Details
Motivation: 大语言模型在加速学术创意方面潜力巨大,但在创意接地和避免确认偏见方面存在挑战,影响了进一步优化的质量。 Method: 构建一个包含问题、挑战和解决方案三种节点的动机知识图谱(MotivGraph),并设计一个基于苏格拉底提问的双代理Q驱动苏格拉底思想者(Ideator)进行创意迭代优化。 Result: 在ICLR25论文主题数据集上,该方法在LLM评分、ELO排名和人工评估等多个指标上优于现有最先进方法。 Conclusion: MotivGraph-SoIQ有效提升了大语言模型在学术创意生成中的质量,特别是在新颖性、实验严谨性和动机合理性方面表现突出。 Abstract: Large Language Models (LLMs) hold substantial potential for accelerating academic ideation but face critical challenges in grounding ideas and mitigating confirmation bias for further refinement. We propose integrating motivational knowledge graphs and socratic dialogue to address these limitations in enhanced LLM ideation (MotivGraph-SoIQ). This novel framework provides essential grounding and practical idea improvement steps for LLM ideation by integrating a Motivational Knowledge Graph (MotivGraph) with a Q-Driven Socratic Ideator. The MotivGraph structurally stores three key node types(problem, challenge and solution) to offer motivation grounding for the LLM ideation process. The Ideator is a dual-agent system utilizing Socratic questioning, which facilitates a rigorous refinement process that mitigates confirmation bias and improves idea quality across novelty, experimental rigor, and motivational rationality dimensions. On the ICLR25 paper topics dataset, MotivGraph-SoIQ exhibits clear advantages over existing state-of-the-art approaches across LLM-based scoring, ELO ranking, and human evaluation metrics.[49] Black-Box Hallucination Detection via Consistency Under the Uncertain Expression
Seongho Joo,Kyungmin Min,Jahyun Koo,Kyomin Jung
Main category: cs.CL
TL;DR: 提出一种基于黑箱方法的幻觉检测指标,通过分析大语言模型在表达不确定性时的行为,发现事实性响应具有一致性,而非事实性响应则不一致,实验表明该指标比使用内部知识的基线更有效。
Details
Motivation: 大语言模型(如GPT3)存在生成非事实性回应(即“幻觉”)的问题,现有检测方法依赖外部资源或模型内部状态,在实际应用中受限,因此需要一种无需内部信息的黑箱检测方法。 Method: 通过分析大语言模型在表达不确定性时的响应一致性,提出一种简单的黑箱幻觉检测指标,利用多次采样生成的响应之间的一致性来判断事实性。 Result: 实验证明所提出的黑箱检测指标在预测模型回应的事实性方面优于依赖模型内部知识的基线方法。 Conclusion: 基于响应一致性的黑箱幻觉检测方法是有效的,并可作为实际应用中检测大语言模型幻觉问题的基础工具。 Abstract: Despite the great advancement of Language modeling in recent days, Large Language Models (LLMs) such as GPT3 are notorious for generating non-factual responses, so-called "hallucination" problems. Existing methods for detecting and alleviating this hallucination problem require external resources or the internal state of LLMs, such as the output probability of each token. Given the LLM's restricted external API availability and the limited scope of external resources, there is an urgent demand to establish the Black-Box approach as the cornerstone for effective hallucination detection. In this work, we propose a simple black-box hallucination detection metric after the investigation of the behavior of LLMs under expression of uncertainty. Our comprehensive analysis reveals that LLMs generate consistent responses when they present factual responses while non-consistent responses vice versa. Based on the analysis, we propose an efficient black-box hallucination detection metric with the expression of uncertainty. The experiment demonstrates that our metric is more predictive of the factuality in model responses than baselines that use internal knowledge of LLMs.[50] GraphSearch: An Agentic Deep Searching Workflow for Graph Retrieval-Augmented Generation
Cehao Yang,Xiaojun Wu,Xueyuan Lin,Chengjin Xu,Xuhui Jiang,Yuanliang Sun,Jia Li,Hui Xiong,Jian Guo
Main category: cs.CL
TL;DR: 提出了一种名为GraphSearch的新型图检索增强生成方法,通过双通道检索和模块化框架提升多跳问答的准确性和生成质量。
Details
Motivation: 现有GraphRAG方法存在检索浅层、无法充分挖掘关键证据以及难以有效利用预构建的结构化图数据的问题,限制了复杂查询下的推理能力。 Method: 设计了一个包含六个模块的模块化框架,支持多轮交互与迭代推理;采用双通道检索策略,同时在基于文本块的语义检索和基于结构图的关系检索上进行查询,充分利用两种模态的互补优势。 Result: 在六个多跳RAG基准上的实验表明,GraphSearch在答案准确率和生成质量上均显著优于传统方法。 Conclusion: GraphSearch为推进图检索增强生成提供了一个有前景的方向,尤其在处理复杂查询和深度推理任务中表现出色。 Abstract: Graph Retrieval-Augmented Generation (GraphRAG) enhances factual reasoning in LLMs by structurally modeling knowledge through graph-based representations. However, existing GraphRAG approaches face two core limitations: shallow retrieval that fails to surface all critical evidence, and inefficient utilization of pre-constructed structural graph data, which hinders effective reasoning from complex queries. To address these challenges, we propose \textsc{GraphSearch}, a novel agentic deep searching workflow with dual-channel retrieval for GraphRAG. \textsc{GraphSearch} organizes the retrieval process into a modular framework comprising six modules, enabling multi-turn interactions and iterative reasoning. Furthermore, \textsc{GraphSearch} adopts a dual-channel retrieval strategy that issues semantic queries over chunk-based text data and relational queries over structural graph data, enabling comprehensive utilization of both modalities and their complementary strengths. Experimental results across six multi-hop RAG benchmarks demonstrate that \textsc{GraphSearch} consistently improves answer accuracy and generation quality over the traditional strategy, confirming \textsc{GraphSearch} as a promising direction for advancing graph retrieval-augmented generation.[51] From Outliers to Topics in Language Models: Anticipating Trends in News Corpora
Evangelia Zve,Benjamin Icard,Alice Breton,Lila Sainero,Gauvain Bourgne,Jean-Gabriel Ganascia
Main category: cs.CL
TL;DR: 该研究探讨了在主题建模中常被视为噪声的异常值,如何作为动态新闻语料中新兴主题的弱信号,并通过向量嵌入和累积聚类方法验证其随时间演变为连贯主题的趋势。
Details
Motivation: 异常值通常在主题建模中被忽略,但可能包含新兴话题的早期信号,研究旨在挖掘这些潜在信息。 Method: 使用最先进的语言模型生成的向量嵌入,结合累积聚类方法,在关注企业社会责任和气候变化的英法新闻数据集中追踪异常值随时间的演变。 Result: 结果显示,无论在何种模型或语言中,异常值均表现出随时间演变为连贯主题的一致模式。 Conclusion: 异常值可作为检测新兴话题的有效弱信号,具有跨语言和模型的稳健性。 Abstract: This paper examines how outliers, often dismissed as noise in topic modeling, can act as weak signals of emerging topics in dynamic news corpora. Using vector embeddings from state-of-the-art language models and a cumulative clustering approach, we track their evolution over time in French and English news datasets focused on corporate social responsibility and climate change. The results reveal a consistent pattern: outliers tend to evolve into coherent topics over time across both models and languages.[52] Taxonomy of Comprehensive Safety for Clinical Agents
Jean Seo,Hyunkyung Lee,Gibaeg Kim,Wooseok Han,Jaehyo Yoo,Seungseop Lim,Kihun Shin,Eunho Yang
Main category: cs.CL
TL;DR: 本文提出了TACOS,一个包含21个类别的细粒度临床安全分类体系,将安全过滤与工具选择整合到用户意图分类中,以提升临床聊天机器人的安全性。
Details
Motivation: 现有方法如防护栏和工具调用在应对临床领域复杂的安全需求时存在不足,亟需一种更全面、精细的安全框架。 Method: 提出TACOS(临床代理综合安全分类体系),设计一个涵盖广泛临床与非临床查询的21类分类系统,并构建TACOS标注数据集进行实验验证。 Result: 实验证明TACOS在临床场景下的有效性,揭示了训练数据分布和基础模型预训练知识对安全性能的重要影响。 Conclusion: TACOS为临床代理提供了一种新颖且有效的安全解决方案,通过统一的意图分类实现安全过滤与工具调用,具有实际应用价值。 Abstract: Safety is a paramount concern in clinical chatbot applications, where inaccurate or harmful responses can lead to serious consequences. Existing methods--such as guardrails and tool calling--often fall short in addressing the nuanced demands of the clinical domain. In this paper, we introduce TACOS (TAxonomy of COmprehensive Safety for Clinical Agents), a fine-grained, 21-class taxonomy that integrates safety filtering and tool selection into a single user intent classification step. TACOS is a taxonomy that can cover a wide spectrum of clinical and non-clinical queries, explicitly modeling varying safety thresholds and external tool dependencies. To validate our framework, we curate a TACOS-annotated dataset and perform extensive experiments. Our results demonstrate the value of a new taxonomy specialized for clinical agent settings, and reveal useful insights about train data distribution and pretrained knowledge of base models.[53] Fuzzy Reasoning Chain (FRC): An Innovative Reasoning Framework from Fuzziness to Clarity
Ping Chen,Xiang Liu,Zhaoxiang Liu,Zezhou Chen,Xingpeng Zhang,Huan Hu,Zipeng Wang,Kai Wang,Shuming Shi,Shiguo Lian
Main category: cs.CL
TL;DR: 本文提出了模糊推理链(FRC)框架,结合大语言模型的语义先验与连续模糊隶属度,实现基于概率与模糊推理的显式交互,有效处理文本中的歧义、多义和不确定性,在情感分析任务中验证了其稳定推理和跨模型规模知识迁移的能力。
Details
Motivation: 现有自然语言处理方法在处理歧义、多义和不确定文本时存在局限,传统基于概率的方法难以捕捉冲突或模糊信号,亟需更鲁棒且可解释的推理机制。 Method: 提出模糊推理链(FRC)框架,将大语言模型的语义先验与连续模糊隶属度相结合,构建概率推理与模糊隶属推理的显式交互机制,逐步将模糊输入转化为清晰可解释的决策。 Result: 在情感分析任务中,理论分析与实验结果表明FRC具有稳定的推理性能,并能促进不同规模模型间的知识迁移。 Conclusion: FRC为处理细微和模糊表达提供了一种通用机制,显著提升了模型的可解释性与鲁棒性。 Abstract: With the rapid advancement of large language models (LLMs), natural language processing (NLP) has achieved remarkable progress. Nonetheless, significant challenges remain in handling texts with ambiguity, polysemy, or uncertainty. We introduce the Fuzzy Reasoning Chain (FRC) framework, which integrates LLM semantic priors with continuous fuzzy membership degrees, creating an explicit interaction between probability-based reasoning and fuzzy membership reasoning. This transition allows ambiguous inputs to be gradually transformed into clear and interpretable decisions while capturing conflicting or uncertain signals that traditional probability-based methods cannot. We validate FRC on sentiment analysis tasks, where both theoretical analysis and empirical results show that it ensures stable reasoning and facilitates knowledge transfer across different model scales. These findings indicate that FRC provides a general mechanism for managing subtle and ambiguous expressions with improved interpretability and robustness.[54] RedNote-Vibe: A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Social Media
Yudong Li,Yufei Sun,Yuhan Yao,Peiru Yang,Wanyue Li,Jiajun Zou,Yongfeng Huang,Linlin Shen
Main category: cs.CL
TL;DR: 本文提出了首个针对社交媒体AI生成文本的纵向数据集RedNote-Vibe,并设计了基于心理语言学特征的可解释检测框架PLAD,揭示了AI生成内容的语言特征与用户互动之间的动态关系。
Details
Motivation: 现有AI生成文本检测数据集多为静态,难以反映社交媒体中内容随时间演变和用户参与驱动的动态特性,因此需要一个长期、包含用户互动信息的数据集来研究AIGT的时间演化模式。 Method: 构建了一个涵盖五年时间跨度的社交媒体数据集RedNote-Vibe,包含点赞、评论等用户参与指标和时间戳;提出PLAD框架,利用心理语言学特征进行AI生成文本检测,并分析其与用户互动的关系。 Result: PLAD在检测性能上表现优异,能够有效识别AI生成内容,并揭示了特定语言特征与社交平台用户参与度之间的复杂关联。 Conclusion: RedNote-Vibe为研究AI生成内容的长期演化提供了重要资源,而PLAD提供了一种可解释的检测方法,有助于理解人类与AI文本在社交语境中的差异及其影响。 Abstract: The proliferation of Large Language Models (LLMs) has led to widespread AI-Generated Text (AIGT) on social media platforms, creating unique challenges where content dynamics are driven by user engagement and evolve over time. However, existing datasets mainly depict static AIGT detection. In this work, we introduce RedNote-Vibe, the first longitudinal (5-years) dataset for social media AIGT analysis. This dataset is sourced from Xiaohongshu platform, containing user engagement metrics (e.g., likes, comments) and timestamps spanning from the pre-LLM period to July 2025, which enables research into the temporal dynamics and user interaction patterns of AIGT. Furthermore, to detect AIGT in the context of social media, we propose PsychoLinguistic AIGT Detection Framework (PLAD), an interpretable approach that leverages psycholinguistic features. Our experiments show that PLAD achieves superior detection performance and provides insights into the signatures distinguishing human and AI-generated content. More importantly, it reveals the complex relationship between these linguistic features and social media engagement. The dataset is available at https://github.com/testuser03158/RedNote-Vibe.[55] The QCET Taxonomy of Standard Quality Criterion Names and Definitions for the Evaluation of NLP Systems
Anya Belz,Simon Mille,Craig Thomson
Main category: cs.CL
TL;DR: 本文提出了QCET质量评估分类法,通过调查NLP领域中的评估实践,建立了一套标准化的质量标准名称和定义体系,以解决不同评估之间可比性不明确的问题,并支持现有评估的比较、新评估的设计以及合规性审查。
Details
Motivation: 不同研究中使用相同质量标准名称(如“流畅性”)可能实际评估的是不同方面,导致评估结果不可靠且难以比较,阻碍了NLP领域的科学进步。 Method: 采用描述性方法,基于对NLP文献中三项评估调查的结果,构建了一个层次化的质量标准分类体系(QCET),将现有的数百个质量标准映射到统一的标准上。 Result: 提出了QCET分类法及其配套资源,实现了对质量标准的系统化组织,支持跨评估的可比性分析、新评估设计指导和合规性评估。 Conclusion: QCET为NLP评估提供了标准化框架,有助于提升评估的透明度与可比性,推动NLP领域科学研究的可重复性和进步。 Abstract: Prior work has shown that two NLP evaluation experiments that report results for the same quality criterion name (e.g. Fluency) do not necessarily evaluate the same aspect of quality, and the comparability implied by the name can be misleading. Not knowing when two evaluations are comparable in this sense means we currently lack the ability to draw reliable conclusions about system quality on the basis of multiple, independently conducted evaluations. This in turn hampers the ability of the field to progress scientifically as a whole, a pervasive issue in NLP since its beginning (Sparck Jones, 1981). It is hard to see how the issue of unclear comparability can be fully addressed other than by the creation of a standard set of quality criterion names and definitions that the several hundred quality criterion names actually in use in the field can be mapped to, and grounded in. Taking a strictly descriptive approach, the QCET Quality Criteria for Evaluation Taxonomy derives a standard set of quality criterion names and definitions from three surveys of evaluations reported in NLP, and structures them into a hierarchy where each parent node captures common aspects of its child nodes. We present QCET and the resources it consists of, and discuss its three main uses in (i) establishing comparability of existing evaluations, (ii) guiding the design of new evaluations, and (iii) assessing regulatory compliance.[56] Fine-tuning Done Right in Model Editing
Wanli Yang,Fei Sun,Rui Tang,Hongyu Zang,Du Su,Qi Cao,Jingang Wang,Huawei Shen,Xueqi Cheng
Main category: cs.CL
TL;DR: 本文挑战了微调在模型编辑中无效的传统观点,提出通过恢复标准的广度优先(epoch-based)微调框架并引入局部化调整策略LocFT-BF,显著提升了模型编辑效果,首次支持10万次编辑和720亿参数模型,且不损害通用能力。
Details
Motivation: 长期以来认为微调不适合模型编辑,但作者认为问题不在微调本身,而在于其应用于编辑任务时采用的顺序式、单样本逐个优化的深度优先流程导致过拟合和编辑间干扰。 Method: 将微调从深度优先的单样本更新改为广度优先的epoch-based小批量优化,并系统分析可调参数的位置,提出LocFT-BF:一种基于恢复后微调框架的局部化编辑方法。 Result: 在多种大语言模型和数据集上实验表明,LocFT-BF大幅超越现有最先进方法,首次实现10万次编辑和720亿参数模型的有效编辑,同时保持模型的通用性能。 Conclusion: 澄清了对微调的长期误解,证明其在适当框架下可成为领先的模型编辑方法,为未来研究奠定了坚实基础。 Abstract: Fine-tuning, a foundational method for adapting large language models, has long been considered ineffective for model editing. Here, we challenge this belief, arguing that the reported failure arises not from the inherent limitation of fine-tuning itself, but from adapting it to the sequential nature of the editing task, a single-pass depth-first pipeline that optimizes each sample to convergence before moving on. While intuitive, this depth-first pipeline coupled with sample-wise updating over-optimizes each edit and induces interference across edits. Our controlled experiments reveal that simply restoring fine-tuning to the standard breadth-first (i.e., epoch-based) pipeline with mini-batch optimization substantially improves its effectiveness for model editing. Moreover, fine-tuning in editing also suffers from suboptimal tuning parameter locations inherited from prior methods. Through systematic analysis of tuning locations, we derive LocFT-BF, a simple and effective localized editing method built on the restored fine-tuning framework. Extensive experiments across diverse LLMs and datasets demonstrate that LocFT-BF outperforms state-of-the-art methods by large margins. Notably, to our knowledge, it is the first to sustain 100K edits and 72B-parameter models,10 x beyond prior practice, without sacrificing general capabilities. By clarifying a long-standing misconception and introducing a principled localized tuning strategy, we advance fine-tuning from an underestimated baseline to a leading method for model editing, establishing a solid foundation for future research.[57] COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Dmitriy Shopkhoev,Denis Makhov,Magauiya Zhussip,Ammar Ali,Stamatios Lefkimmiatis
Main category: cs.CL
TL;DR: 提出CoSpaDi,一种无需训练的LLM压缩框架,采用结构化稀疏分解和稀疏字典学习,在保持激活输出一致性的基础上,优于传统低秩方法。
Details
Motivation: 传统低秩权重逼近方法结构约束 rigid,导致模型精度显著下降,需更灵活且表达能力强的压缩方法。 Method: 提出CoSpaDi框架,用稠密字典和列稀疏系数矩阵替代低秩分解,实现多子空间表示,并利用校准数据优化分解以匹配原始层输出激活。 Result: 在多个Llama和Qwen模型上验证,20-50%压缩比下,CoSpaDi在准确性和困惑度上均优于现有数据感知低秩方法。 Conclusion: 结构化稀疏字典学习是高效LLM部署中传统低秩方法的强大替代方案。 Abstract: Post-training compression of large language models (LLMs) largely relies on low-rank weight approximation, which represents each column of a weight matrix in a shared low-dimensional subspace. While this is a computationally efficient strategy, the imposed structural constraint is rigid and can lead to a noticeable model accuracy drop. In this work, we propose CoSpaDi (Compression via Sparse Dictionary Learning), a novel training-free compression framework that replaces low-rank decomposition with a more flexible structured sparse factorization in which each weight matrix is represented with a dense dictionary and a column-sparse coefficient matrix. This formulation enables a union-of-subspaces representation: different columns of the original weight matrix are approximated in distinct subspaces spanned by adaptively selected dictionary atoms, offering greater expressiveness than a single invariant basis. Crucially, CoSpaDi leverages a small calibration dataset to optimize the factorization such that the output activations of compressed projection layers closely match those of the original ones, thereby minimizing functional reconstruction error rather than mere weight approximation. This data-aware strategy preserves better model fidelity without any fine-tuning under reasonable compression ratios. Moreover, the resulting structured sparsity allows efficient sparse-dense matrix multiplication and is compatible with post-training quantization for further memory and latency gains. We evaluate CoSpaDi across multiple Llama and Qwen models under per-layer and per-group settings at 20-50\% compression ratios, demonstrating consistent superiority over state-of-the-art data-aware low-rank methods both in accuracy and perplexity. Our results establish structured sparse dictionary learning as a powerful alternative to conventional low-rank approaches for efficient LLM deployment.[58] Multilingual Dialogue Generation and Localization with Dialogue Act Scripting
Justin Vasselli,Eunike Andriani Kardinata,Yusuke Sakai,Taro Watanabe
Main category: cs.CL
TL;DR: 提出了一种名为Dialogue Act Script (DAS) 的结构化框架,用于从抽象意图表示生成多语言对话,避免翻译过程中引入的文化不匹配和非自然性问题。
Details
Motivation: 非英语对话数据集稀缺,现有方法常依赖英语对话的翻译,容易引入失真和文化不适配问题。 Method: 通过结构化的对话行为表示(dialogue act representations),在目标语言中生成新的对话,而非直接翻译原文对话。 Result: 在意大利语、德语和中文上的人类评估表明,DAS生成的对话在文化相关性、连贯性和情境适切性方面均优于机器和人工翻译的结果。 Conclusion: DAS能有效支持跨语言的灵活本地化,缓解翻译腔问题,生成更自然、更符合文化的对话。 Abstract: Non-English dialogue datasets are scarce, and models are often trained or evaluated on translations of English-language dialogues, an approach which can introduce artifacts that reduce their naturalness and cultural appropriateness. This work proposes Dialogue Act Script (DAS), a structured framework for encoding, localizing, and generating multilingual dialogues from abstract intent representations. Rather than translating dialogue utterances directly, DAS enables the generation of new dialogues in the target language that are culturally and contextually appropriate. By using structured dialogue act representations, DAS supports flexible localization across languages, mitigating translationese and enabling more fluent, naturalistic conversations. Human evaluations across Italian, German, and Chinese show that DAS-generated dialogues consistently outperform those produced by both machine and human translators on measures of cultural relevance, coherence, and situational appropriateness.[59] S2J: Bridging the Gap Between Solving and Judging Ability in Generative Reward Models
Shaoning Sun,Jiachen Yu,Zongqi Wang,Xuewei Yang,Tianle Gu,Yujiu Yang
Main category: cs.CL
TL;DR: 本文提出了Solve-to-Judge (S2J) 方法,通过联合利用生成式奖励模型(GRM)的求解和判断能力来缩小“能解决问题但判断错误”的解决-判断差距,显著提升了判断性能,并在更小训练数据下达到当前最优水平。
Details
Motivation: 尽管GRM具备较强的问题解决能力,但在个别查询上仍存在明显判断错误现象(解决-判断差距),现有方法未能有效关联其求解与判断能力。 Method: S2J方法在同一GRM输出中同时利用问题求解和判断能力进行监督,在模型优化过程中显式地关联这两种能力,从而缩小解决-判断差距。 Result: 实验表明,S2J将解决-判断差距减少了16.2%,判断性能提升5.8%,并在相同基模型的GRM中实现最先进性能,且无需依赖更强外部模型或更大训练数据。 Conclusion: S2J通过自我进化机制有效桥接了GRM的求解与判断能力,在较少训练数据下实现了优越的判断表现,为构建高效奖励模型提供了新思路。 Abstract: With the rapid development of large language models (LLMs), generative reward models (GRMs) have been widely adopted for reward modeling and evaluation. Previous studies have primarily focused on training specialized GRMs by optimizing them on preference datasets with the judgment correctness as supervision. While it's widely accepted that GRMs with stronger problem-solving capabilities typically exhibit superior judgment abilities, we first identify a significant solve-to-judge gap when examining individual queries. Specifically, the solve-to-judge gap refers to the phenomenon where GRMs struggle to make correct judgments on some queries (14%-37%), despite being fully capable of solving them. In this paper, we propose the Solve-to-Judge (S2J) approach to address this problem. Specifically, S2J simultaneously leverages both the solving and judging capabilities on a single GRM's output for supervision, explicitly linking the GRM's problem-solving and evaluation abilities during model optimization, thereby narrowing the gap. Our comprehensive experiments demonstrate that S2J effectively reduces the solve-to-judge gap by 16.2%, thereby enhancing the model's judgment performance by 5.8%. Notably, S2J achieves state-of-the-art (SOTA) performance among GRMs built on the same base model while utilizing a significantly smaller training dataset. Moreover, S2J accomplishes this through self-evolution without relying on more powerful external models for distillation.[60] Think Right, Not More: Test-Time Scaling for Numerical Claim Verification
Primakov Chungkham,V Venktesh,Vinay Setty,Avishek Anand
Main category: cs.CL
TL;DR: 本文提出了一种基于缩放测试时计算(TTS)和验证器模型(VERIFIERFC)的方法,以提升大语言模型在复杂数值声明事实核查中的性能,有效缓解推理漂移问题,并通过自适应机制显著提高计算效率。
Details
Motivation: 大语言模型在处理需要组合与数值推理的现实世界声明核查时仍存在推理漂移和理解数值细节困难的问题。 Method: 通过生成多个推理路径并训练一个验证器模型(VERIFIERFC)来选择最优路径,同时引入基于声明复杂度的自适应TTS机制以提升计算效率。 Result: 该方法在事实核查任务中比单次验证方法性能提升18.8%,且计算效率是标准TTS的1.8倍。 Conclusion: 缩放测试时计算结合自适应机制和验证器模型能有效提升大语言模型在复杂数值声明核查中的准确性与效率。 Abstract: Fact-checking real-world claims, particularly numerical claims, is inherently complex that require multistep reasoning and numerical reasoning for verifying diverse aspects of the claim. Although large language models (LLMs) including reasoning models have made tremendous advances, they still fall short on fact-checking real-world claims that require a combination of compositional and numerical reasoning. They are unable to understand nuance of numerical aspects, and are also susceptible to the reasoning drift issue, where the model is unable to contextualize diverse information resulting in misinterpretation and backtracking of reasoning process. In this work, we systematically explore scaling test-time compute (TTS) for LLMs on the task of fact-checking complex numerical claims, which entails eliciting multiple reasoning paths from an LLM. We train a verifier model (VERIFIERFC) to navigate this space of possible reasoning paths and select one that could lead to the correct verdict. We observe that TTS helps mitigate the reasoning drift issue, leading to significant performance gains for fact-checking numerical claims. To improve compute efficiency in TTS, we introduce an adaptive mechanism that performs TTS selectively based on the perceived complexity of the claim. This approach achieves 1.8x higher efficiency than standard TTS, while delivering a notable 18.8% performance improvement over single-shot claim verification methods. Our code and data can be found at https://github.com/VenkteshV/VerifierFC[61] Universal Legal Article Prediction via Tight Collaboration between Supervised Classification Model and LLM
Xiao Chi,Wenlin Zhong,Yiquan Wu,Wei Wang,Kun Kuang,Fei Wu,Minghui Xiong
Main category: cs.CL
TL;DR: 本文提出了一种名为Uni-LAP的通用法律条文预测框架,结合监督分类模型(SCM)和大语言模型(LLM)的优势,通过Top-K损失函数和三段论式推理提升预测准确性和泛化能力。
Details
Motivation: 现有法律条文预测方法在处理复杂案情时存在局限:监督模型难以捕捉细节,大语言模型在抽象条文预测上表现不佳,且多数方法缺乏跨司法管辖区的通用性。 Method: 提出Uni-LAP框架,其中SCM采用新的Top-K损失函数生成候选条文,LLM利用三段论式推理进行最终预测优化,实现两者的紧密协作。 Result: 在多个司法管辖区的数据集上实验表明,Uni-LAP consistently 优于现有基线方法,展现出更强的有效性和可推广性。 Conclusion: Uni-LAP通过融合SCM和LLM的优势,解决了法律条文预测中的关键挑战,具备良好的跨区域适用性和实际应用潜力。 Abstract: Legal Article Prediction (LAP) is a critical task in legal text classification, leveraging natural language processing (NLP) techniques to automatically predict relevant legal articles based on the fact descriptions of cases. As a foundational step in legal decision-making, LAP plays a pivotal role in determining subsequent judgments, such as charges and penalties. Despite its importance, existing methods face significant challenges in addressing the complexities of LAP. Supervised classification models (SCMs), such as CNN and BERT, struggle to fully capture intricate fact patterns due to their inherent limitations. Conversely, large language models (LLMs), while excelling in generative tasks, perform suboptimally in predictive scenarios due to the abstract and ID-based nature of legal articles. Furthermore, the diversity of legal systems across jurisdictions exacerbates the issue, as most approaches are tailored to specific countries and lack broader applicability. To address these limitations, we propose Uni-LAP, a universal framework for legal article prediction that integrates the strengths of SCMs and LLMs through tight collaboration. Specifically, in Uni-LAP, the SCM is enhanced with a novel Top-K loss function to generate accurate candidate articles, while the LLM employs syllogism-inspired reasoning to refine the final predictions. We evaluated Uni-LAP on datasets from multiple jurisdictions, and empirical results demonstrate that our approach consistently outperforms existing baselines, showcasing its effectiveness and generalizability.[62] Multilingual Vision-Language Models, A Survey
Andrei-Alexandru Manea,Jindřich Libovický
Main category: cs.CL
TL;DR: 本文综述了多语言视觉-语言模型,分析了31个模型和21个基准,指出语言中立性与文化感知之间的张力,并发现训练目标与评估目标之间存在差距。
Details
Motivation: 为了理解多语言视觉-语言模型在跨语言和跨文化场景下的表现,揭示当前方法在语言中立性和文化感知之间的权衡问题。 Method: 综述了31个多语言视觉-语言模型和21个评估基准,分析其架构、训练方法和评估策略,特别关注对比学习、数据多样性及翻译导向的评估方式。 Result: 发现当前训练方法倾向于通过对比学习实现语言中立性,而文化感知依赖多样化数据;三分之二的基准使用翻译为基础的方法,强调语义一致性,但与实际跨文化理解存在差距。 Conclusion: 现有模型在语言中立性上表现较好,但文化感知能力不足,训练目标与评估方式之间存在不一致,未来需构建更具文化敏感性的训练数据和评估基准。 Abstract: This survey examines multilingual vision-language models that process text and images across languages. We review 31 models and 21 benchmarks, spanning encoder-only and generative architectures, and identify a key tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts). Current training methods favor neutrality through contrastive learning, while cultural awareness depends on diverse data. Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency, though recent work incorporates culturally grounded content. We find discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.[63] FoodSEM: Large Language Model Specialized in Food Named-Entity Linking
Ana Gjorgjevikj,Matej Martinc,Gjorgjina Cenikj,Sašo Džeroski,Barbara Koroušić Seljak,Tome Eftimov
Main category: cs.CL
TL;DR: 本文提出了FoodSEM,一种用于食品相关本体命名实体链接(NEL)的开源大语言模型,通过指令-响应方式在多个本体上实现最先进的性能,F1分数高达98%,并公开了模型和数据集以推动食品领域语义理解研究。
Details
Motivation: 现有的通用或特定领域语言模型无法准确解决食品领域的命名实体链接问题,因此需要一个专门针对食品本体的高效模型。 Method: 采用指令-响应(IR)框架对大语言模型进行微调,使其能够将文本中的食品相关实体链接到FoodOn、SNOMED-CT和Hansard等本体。 Result: FoodSEM在多个数据集和本体上达到最先进的性能,F1分数最高达98%,显著优于零样本、一样本和少样本提示的基线模型。 Conclusion: FoodSEM是首个专为食品领域命名实体链接设计的高性能开源模型,其发布为食品语义理解提供了强大工具和基准。 Abstract: This paper introduces FoodSEM, a state-of-the-art fine-tuned open-source large language model (LLM) for named-entity linking (NEL) to food-related ontologies. To the best of our knowledge, food NEL is a task that cannot be accurately solved by state-of-the-art general-purpose (large) language models or custom domain-specific models/systems. Through an instruction-response (IR) scenario, FoodSEM links food-related entities mentioned in a text to several ontologies, including FoodOn, SNOMED-CT, and the Hansard taxonomy. The FoodSEM model achieves state-of-the-art performance compared to related models/systems, with F1 scores even reaching 98% on some ontologies and datasets. The presented comparative analyses against zero-shot, one-shot, and few-shot LLM prompting baselines further highlight FoodSEM's superior performance over its non-fine-tuned version. By making FoodSEM and its related resources publicly available, the main contributions of this article include (1) publishing a food-annotated corpora into an IR format suitable for LLM fine-tuning/evaluation, (2) publishing a robust model to advance the semantic understanding of text in the food domain, and (3) providing a strong baseline on food NEL for future benchmarking.[64] R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning
Hongyu Shan,Mingyang Song,Chang Dai,Di Liang,Han Chen
Main category: cs.CL
TL;DR: 提出Reasoning Capsule(R-Capsule)框架,通过将高层推理计划压缩为少量隐含token,在保持显式推理透明性的同时提升推理效率,结合信息瓶颈原则,在减少推理开销的同时维持或提升准确性。
Details
Motivation: Chain-of-Thought(CoT)虽然能提升大模型的复杂推理能力,但其冗长的显式推理链导致延迟高、内存消耗大,并可能传播早期错误,因此需要一种兼顾效率与透明性的新型推理框架。 Method: 提出Reasoning Capsule(R-Capsule),利用低容量瓶颈将高层计划压缩为少量学习到的隐含token,同时保留轻量或显式的执行步骤;通过主任务损失和辅助的计划重建损失,确保胶囊既最小又充分,符合信息瓶颈原则。 Result: 该方法在多个复杂基准上实现了与显式CoT相当或更优的准确性,同时显著减少了可见token数量,提升了推理效率,并增强了隐含表示的可解释性。 Conclusion: R-Capsule在效率、准确性和可解释性之间取得了良好平衡,为大模型推理提供了一种更高效且透明的替代方案。 Abstract: Chain-of-Thought (CoT) prompting helps Large Language Models (LLMs) tackle complex reasoning by eliciting explicit step-by-step rationales. However, CoT's verbosity increases latency and memory usage and may propagate early errors across long chains. We propose the Reasoning Capsule (R-Capsule), a framework that aims to combine the efficiency of latent reasoning with the transparency of explicit CoT. The core idea is to compress the high-level plan into a small set of learned latent tokens (a Reasoning Capsule) while keeping execution steps lightweight or explicit. This hybrid approach is inspired by the Information Bottleneck (IB) principle, where we encourage the capsule to be approximately minimal yet sufficient for the task. Minimality is encouraged via a low-capacity bottleneck, which helps improve efficiency. Sufficiency is encouraged via a dual objective: a primary task loss for answer accuracy and an auxiliary plan-reconstruction loss that encourages the capsule to faithfully represent the original textual plan. The reconstruction objective helps ground the latent space, thereby improving interpretability and reducing the use of uninformative shortcuts. Our framework strikes a balance between efficiency, accuracy, and interpretability, thereby reducing the visible token footprint of reasoning while maintaining or improving accuracy on complex benchmarks. Our codes are available at: https://anonymous.4open.science/r/Reasoning-Capsule-7BE0[65] Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
Shijing Hu,Jingyang Li,Zhihui Lu,Pan Zhou
Main category: cs.CL
TL;DR: 本文提出了Group Tree Optimization (GTO),通过将训练与解码时的树策略对齐来提升大语言模型推理速度,解决了现有方法中草稿策略不一致的问题。
Details
Motivation: 现有推测解码方法在训练时仅优化单一贪婪路径,而实际解码采用树形策略验证多条分支,导致策略不一致,限制了加速效果。 Method: GTO包含两个部分:(i) 草稿树奖励,一种无需采样的目标函数,直接衡量目标模型下草稿树的预期接受长度;(ii) 基于组的草稿策略训练,通过对比当前与冻结参考模型生成的树,构建去偏标准化优势,并沿最长接受序列进行PPO风格的更新。 Result: 在多个任务(对话、代码、数学)和大模型(如LLaMA、Vicuna、DeepSeek等)上,GTO相比EAGLE-3平均提升7.4%的接受长度和额外7.7%的推理速度提升。 Conclusion: GTO有效弥合了训练与解码间的策略差异,为高效大语言模型推理提供了一种通用且实用的解决方案。 Abstract: Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy draft path, while decoding follows a tree policy that re-ranks and verifies multiple branches. This draft policy misalignment limits achievable speedups. We introduce Group Tree Optimization (GTO), which aligns training with the decoding-time tree policy through two components: (i) Draft Tree Reward, a sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance; (ii) Group-based Draft Policy Training, a stable optimization scheme that contrasts trees from the current and a frozen reference draft model, forming debiased group-standardized advantages and applying a PPO-style surrogate along the longest accepted sequence for robust updates. We further prove that increasing our Draft Tree Reward provably improves acceptance length and speedup. Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), and multiple LLMs (e.g., LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, DeepSeek-R1-Distill-LLaMA-8B), GTO increases acceptance length by 7.4% and yields an additional 7.7% speedup over prior state-of-the-art EAGLE-3. By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference.[66] NFDI4DS Shared Tasks for Scholarly Document Processing
Raia Abu Ahmad,Rana Abdulla,Tilahun Abedissa Taffa,Soeren Auer,Hamed Babaei Giglou,Ekaterina Borisova,Zongxiong Chen,Stefan Dietze,Jennifer DSouza,Mayra Elwes,Genet-Asefa Gesese,Shufan Jiang,Ekaterina Kutafina,Philipp Mayr,Georg Rehm,Sameer Sadruddin,Sonja Schimmler,Daniel Schneider,Kanishka Silva,Sharmila Upadhyaya,Ricardo Usbeck
Main category: cs.CL
TL;DR: 本文介绍了德国国家研究数据基础设施(NFDI4DS)联盟下开发和主办的十二个共享任务的最新概况,这些任务涵盖学术文档处理中的多样化挑战,推动方法创新,并为研究社区提供开放数据集、模型和工具。
Details
Motivation: 通过社区标准化评估推动研究进展,促进FAIR(可发现、可访问、互操作、可重用)以及透明和可重复的研究实践。 Method: 在顶级会议中主办共享任务,围绕学术文档处理的不同挑战设计任务框架,并集成到NFDI4DS的研究数据基础设施中。 Result: 开发并维护了十二个共享任务,产生了开放访问的数据集、模型和工具,促进了方法学创新和社区协作。 Conclusion: 共享任务是推动数据科学和人工智能领域透明、可重复研究的有效手段,NFDI4DS的实践为研究基础设施建设提供了可借鉴的范例。 Abstract: Shared tasks are powerful tools for advancing research through community-based standardised evaluation. As such, they play a key role in promoting findable, accessible, interoperable, and reusable (FAIR), as well as transparent and reproducible research practices. This paper presents an updated overview of twelve shared tasks developed and hosted under the German National Research Data Infrastructure for Data Science and Artificial Intelligence (NFDI4DS) consortium, covering a diverse set of challenges in scholarly document processing. Hosted at leading venues, the tasks foster methodological innovations and contribute open-access datasets, models, and tools for the broader research community, which are integrated into the consortium's research data infrastructure.[67] From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement
Jianzhi Yan,Le Liu,Youcheng Pan,Shiwei Chen,Zike Yuan,Yang Xiang,Buzhou Tang
Main category: cs.CL
TL;DR: 提出了一种多轮自适应的思维链压缩方法(MACC),利用token弹性现象,在提升准确率的同时显著减少推理延迟和输出长度。
Details
Motivation: 思维链(CoT)推理虽能提升复杂任务性能,但因冗长导致推理延迟高,需有效压缩以提高效率。 Method: 提出MACC框架,通过多轮精细化压缩和token弹性现象,自适应确定每条输入的最佳压缩深度,并利用训练集上的可解释特征预测测试性能。 Result: 平均准确率提升5.6%,CoT长度平均减少47个token,显著降低延迟;可在不重复微调的情况下实现高效模型选择与性能预测。 Conclusion: CoT压缩不仅有效,且其性能表现具有可预测性,MACC为高效推理提供了可靠框架。 Abstract: Chain-of-Thought (CoT) reasoning improves performance on complex tasks but introduces significant inference latency due to verbosity. We propose Multiround Adaptive Chain-of-Thought Compression (MACC), a framework that leverages the token elasticity phenomenon--where overly small token budgets can paradoxically increase output length--to progressively compress CoTs via multiround refinement. This adaptive strategy allows MACC to determine the optimal compression depth for each input. Our method achieves an average accuracy improvement of 5.6 percent over state-of-the-art baselines, while also reducing CoT length by an average of 47 tokens and significantly lowering latency. Furthermore, we show that test-time performance--accuracy and token length--can be reliably predicted using interpretable features like perplexity and compression rate on the training set. Evaluated across different models, our method enables efficient model selection and forecasting without repeated fine-tuning, demonstrating that CoT compression is both effective and predictable. Our code will be released in https://github.com/Leon221220/MACC.[68] Mixture of Detectors: A Compact View of Machine-Generated Text Detection
Sai Teja Lekkala,Yadagiri Annepaka,Arun Kumar Challa,Samatha Reddy Machireddy,Partha Pakray,Chukhu Chunka
Main category: cs.CL
TL;DR: 本文提出了一种新的英语数据集BMAS English,用于检测机器生成文本,并支持文档级和句子级的分类、生成器归属识别以及对抗性攻击分析。
Details
Motivation: 随着大语言模型的发展,人类创作的真实性与创造力保护面临挑战,亟需有效方法来区分人类与机器生成的文本。 Method: 构建了一个名为BMAS English的新数据集,涵盖二分类、多分类、生成器归因、句子级分割和对抗性攻击场景,系统评估机器生成文本的可检测性。 Result: 该数据集支持多种任务,包括准确识别机器生成文本及其来源、分割人机协作文本中的不同部分,并测试对抗性扰动对检测效果的影响。 Conclusion: BMAS English为机器生成文本检测提供了更全面的基准,有助于应对AI生成内容带来的真实性与创意保护问题。 Abstract: Large Language Models (LLMs) are gearing up to surpass human creativity. The veracity of the statement needs careful consideration. In recent developments, critical questions arise regarding the authenticity of human work and the preservation of their creativity and innovative abilities. This paper investigates such issues. This paper addresses machine-generated text detection across several scenarios, including document-level binary and multiclass classification or generator attribution, sentence-level segmentation to differentiate between human-AI collaborative text, and adversarial attacks aimed at reducing the detectability of machine-generated text. We introduce a new work called BMAS English: an English language dataset for binary classification of human and machine text, for multiclass classification, which not only identifies machine-generated text but can also try to determine its generator, and Adversarial attack addressing where it is a common act for the mitigation of detection, and Sentence-level segmentation, for predicting the boundaries between human and machine-generated text. We believe that this paper will address previous work in Machine-Generated Text Detection (MGTD) in a more meaningful way.[69] Context Parametrization with Compositional Adapters
Josip Jukić,Martin Tutek,Jan Šnajder
Main category: cs.CL
TL;DR: 本文提出了CompAs,一种元学习框架,通过将上下文信息组合式地转化为适配器参数,实现了高效、灵活且可逆的上下文学习,解决了传统上下文学习和微调方法在处理多输入时的效率与稳定性问题。
Details
Motivation: 现有上下文学习(ICL)和监督微调(SFT)方法在处理大量示例时存在效率低或灵活性差的问题,且缺乏对多个上下文片段的有效整合机制。 Method: 提出CompAs框架,利用元学习将指令、示例或检索段落等上下文信息生成具有组合结构的适配器参数,并支持通过代数方式合并;同时设计了解码器以实现上下文的可逆恢复。 Result: 在多种选择题和抽取式问答任务上,CompAs优于ICL及先前生成式方法,尤其在扩展输入数量时表现更优;同时降低了推理成本,提升了长上下文稳定性,并支持超过模型上下文窗口的输入处理。 Conclusion: CompAs为大语言模型部署提供了一种实用且高效的可组合适配器生成方案,是扩展上下文学习的有力替代方法。 Abstract: Large language models (LLMs) often seamlessly adapt to new tasks through in-context learning (ICL) or supervised fine-tuning (SFT). However, both of these approaches face key limitations: ICL is inefficient when handling many demonstrations, and SFT incurs training overhead while sacrificing flexibility. Mapping instructions or demonstrations from context directly into adapter parameters offers an appealing alternative. While prior work explored generating adapters based on a single input context, it has overlooked the need to integrate multiple chunks of information. To address this gap, we introduce CompAs, a meta-learning framework that translates context into adapter parameters with a compositional structure. Adapters generated this way can be merged algebraically, enabling instructions, demonstrations, or retrieved passages to be seamlessly combined without reprocessing long prompts. Critically, this approach yields three benefits: lower inference cost, robustness to long-context instability, and establishes a principled solution when input exceeds the model's context window. Furthermore, CompAs encodes information into adapter parameters in a reversible manner, enabling recovery of input context through a decoder, facilitating safety and security. Empirical results on diverse multiple-choice and extractive question answering tasks show that CompAs outperforms ICL and prior generator-based methods, especially when scaling to more inputs. Our work establishes composable adapter generation as a practical and efficient alternative for scaling LLM deployment.[70] When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance
Nicolas Boizard,Hippolyte Gisserot-Boukhlef,Kevin El-Haddad,Céline Hudelot,Pierre Colombo
Main category: cs.CL
TL;DR: 该研究通过合成数据蒸馏框架,系统比较了不同规模的指令微调(IFT)与推理模型在数学和通用任务上的表现,发现推理能持续提升模型性能,尤其在大规模模型和开放性任务中优于IFT。
Details
Motivation: 探索推理能力在不同任务和模型规模下的有效性及其训练和推理成本,填补现有研究空白。 Method: 采用合成数据蒸馏框架,对不同规模的IFT和推理模型进行大规模监督研究,涵盖多种数学和通用任务,并评估其在多项选择和开放式格式下的表现。 Result: 推理模型在各类任务中 consistently 提升性能,常匹敌或超越更大规模的IFT模型;随着模型规模增大,推理模型在推理密集型和开放式任务上突破IFT的性能瓶颈。 Conclusion: 尽管IFT在训练和推理成本上仍具优势,但推理模型随规模扩展价值显著增加,尤其适用于复杂和开放性任务。 Abstract: Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.[71] The Outputs of Large Language Models are Meaningless
Anandi Hattiangadi,Anders J. Schoubye
Main category: cs.CL
TL;DR: 本文论证了大语言模型(LLM)输出本质上是无意义的,因其缺乏赋予语义所需的关键意图,但解释了为何其输出仍看似有意义并可传递知识。
Details
Motivation: 探讨大语言模型输出是否真正具有语义意义,澄清关于AI语言理解的哲学误解。 Method: 基于两个前提进行哲学论证:一是语义需要特定意图;二是LLM无法拥有此类意图,并回应了语义外在主义与内在主义的可能反驳。 Result: 论证表明LLM输出缺乏字面意义,但可解释其为何看似有意义,并说明其仍可用于获取真信念和知识。 Conclusion: 大语言模型的输出是无意义的,但由于人类解读和使用方式,其表现和功能上仍具认知价值。 Abstract: In this paper, we offer a simple argument for the conclusion that the outputs of large language models (LLMs) are meaningless. Our argument is based on two key premises: (a) that certain kinds of intentions are needed in order for LLMs' outputs to have literal meanings, and (b) that LLMs cannot plausibly have the right kinds of intentions. We defend this argument from various types of responses, for example, the semantic externalist argument that deference can be assumed to take the place of intentions and the semantic internalist argument that meanings can be defined purely in terms of intrinsic relations between concepts, such as conceptual roles. We conclude the paper by discussing why, even if our argument is sound, the outputs of LLMs nevertheless seem meaningful and can be used to acquire true beliefs and even knowledge.[72] Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLMs for Text Clustering and Controllable Generation
Tiago Fernandes Tavares
Main category: cs.CL
TL;DR: 提出了一种基于大语言模型的递归主题划分(RTP)框架,通过自然语言问题构建可解释的主题树,提升文本语料无监督分析的可解释性与实用性。
Details
Motivation: 传统主题模型在数据稀缺领域表现不佳,且生成的关键词主题缺乏语义连贯性,需大量人工解释,存在可解释性不足的问题。 Method: 提出递归主题划分(RTP)框架,利用大语言模型构建二叉树结构,每个节点为一个能语义划分数据的自然语言问题,形成可解释的主题分类体系。 Result: 实验表明RTP生成的问答式层次结构比BERTopic等关键词主题更具可解释性,并可作为有效特征用于下游分类任务;同时其主题路径可作为结构化提示用于生成模型。 Conclusion: RTP实现了一种从统计模式发现到知识驱动主题分析的新范式,兼具分析与合成能力,提升了无监督文本分析的可解释性与应用价值。 Abstract: Unsupervised analysis of text corpora is challenging, especially in data-scarce domains where traditional topic models struggle. While these models offer a solution, they typically describe clusters with lists of keywords that require significant manual effort to interpret and often lack semantic coherence. To address this critical interpretability gap, we introduce Recursive Thematic Partitioning (RTP), a novel framework that leverages Large Language Models (LLMs) to interactively build a binary tree. Each node in the tree is a natural language question that semantically partitions the data, resulting in a fully interpretable taxonomy where the logic of each cluster is explicit. Our experiments demonstrate that RTP's question-driven hierarchy is more interpretable than the keyword-based topics from a strong baseline like BERTopic. Furthermore, we establish the quantitative utility of these clusters by showing they serve as powerful features in downstream classification tasks, particularly when the data's underlying themes correlate with the task labels. RTP introduces a new paradigm for data exploration, shifting the focus from statistical pattern discovery to knowledge-driven thematic analysis. Furthermore, we demonstrate that the thematic paths from the RTP tree can serve as structured, controllable prompts for generative models. This transforms our analytical framework into a powerful tool for synthesis, enabling the consistent imitation of specific characteristics discovered in the source corpus.[73] StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
Yuhan Song,Linhao Zhang,Chuhan Wu,Aiwei Liu,Wei Jia,Houfeng Wang,Xiao Zhou
Main category: cs.CL
TL;DR: 本文提出了一种名为StableToken的新型语义语音分词器,通过多分支并行处理和位级投票机制显著提升了在噪声环境下的分词稳定性,有效降低了单元编辑距离,并增强了下游语音大模型的鲁棒性。
Details
Motivation: 现有的语义语音分词器对无意义的声学扰动极为敏感,即使在高信噪比下输出也会剧烈变化,影响下游语言模型的学习效率。 Method: 提出StableToken,采用多分支并行处理架构和基于比特级投票的共识机制,生成稳定一致的分词序列。 Result: StableToken在多种噪声条件下显著降低了单元编辑距离(UED),达到了当前最优的分词稳定性,并提升了SpeechLLMs在各类任务中的性能。 Conclusion: 通过引入共识驱动机制,StableToken解决了传统分词器在声学扰动下的不稳定性问题,为语音到语言模型的稳健训练提供了更可靠的基础。 Abstract: Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.[74] Thinking in Many Modes: How Composite Reasoning Elevates Large Language Model Performance with Limited Data
Zishan Ahmad,Saisubramaniam Gopalakrishnan
Main category: cs.CL
TL;DR: 本文提出了一种名为复合推理(Composite Reasoning, CR)的新方法,使大语言模型能够动态结合演绎、归纳和溯因等多种推理方式,提升在复杂问题上的表现。
Details
Motivation: 现有的大语言模型依赖单一主导的推理范式,难以应对需要多样化认知策略的复杂问题,限制了其推理能力。 Method: 提出复合推理(CR)框架,使模型能动态探索并组合多种推理方式(如演绎、归纳、溯因),并根据任务领域自适应地调整推理风格权重。 Result: 在科学和医学问答基准上,CR优于思维链(CoT)和DeepSeek-R1的推理方式,展现出更高的准确率、样本效率和合理的令牌使用。 Conclusion: 通过引入内部推理风格的多样性,大语言模型可以获得更强大、自适应且高效的解决问题能力。 Abstract: Large Language Models (LLMs), despite their remarkable capabilities, rely on singular, pre-dominant reasoning paradigms, hindering their performance on intricate problems that demand diverse cognitive strategies. To address this, we introduce Composite Reasoning (CR), a novel reasoning approach empowering LLMs to dynamically explore and combine multiple reasoning styles like deductive, inductive, and abductive for more nuanced problem-solving. Evaluated on scientific and medical question-answering benchmarks, our approach outperforms existing baselines like Chain-of-Thought (CoT) and also surpasses the accuracy of DeepSeek-R1 style reasoning (SR) capabilities, while demonstrating superior sample efficiency and adequate token usage. Notably, CR adaptively emphasizes domain-appropriate reasoning styles. It prioritizes abductive and deductive reasoning for medical question answering, but shifts to causal, deductive, and inductive methods for scientific reasoning. Our findings highlight that by cultivating internal reasoning style diversity, LLMs acquire more robust, adaptive, and efficient problem-solving abilities.[75] In Their Own Words: Reasoning Traces Tailored for Small Models Make Them Better Reasoners
Jaehoon Kim,Kwangwook Seo,Dongha Lee
Main category: cs.CL
TL;DR: 本文提出了一种名为反向投机解码(RSD)的方法,以解决从大语言模型向小模型迁移推理能力时因分布不匹配导致性能下降的问题。实验表明,RSD能显著提升小模型的推理表现。
Details
Motivation: 监督微调在将大模型的推理轨迹迁移到小模型时常失败,原因是大模型生成的低概率token超出小模型表示能力,造成学习障碍。 Method: 提出反向投机解码(RSD),由教师模型提议候选token,学生模型根据自身分布决定是否接受,从而生成更适合学生模型的推理轨迹。 Result: 在Qwen3-0.6B上,使用原始推理轨迹微调使性能下降20.5%,而使用RSD生成的轨迹则带来4.9%的提升。分析显示低概率token是迁移瓶颈,且RSD生成的轨迹具有模型特异性。 Conclusion: 分布对齐对推理能力迁移至关重要,RSD通过让学生模型主导token选择,有效缓解了分布不匹配问题,为模型蒸馏提供了新思路。 Abstract: Transferring reasoning capabilities from larger language models to smaller ones through supervised fine-tuning often fails counterintuitively, with performance degrading despite access to high-quality teacher demonstrations. We identify that this failure stems from distributional misalignment: reasoning traces from larger models contain tokens that are low probability under the student's distribution, exceeding the internal representation capacity of smaller architectures and creating learning barriers rather than helpful guidance. We propose Reverse Speculative Decoding (RSD), a mechanism for generating student-friendly reasoning traces in which the teacher model proposes candidate tokens but the student model determines acceptance based on its own probability distributions, filtering low probability tokens. When applied to Qwen3-0.6B, direct distillation of s1K-1.1 reasoning trace data degrades average performance across major reasoning benchmarks by 20.5\%, while the same model trained on RSD-generated reasoning traces achieves meaningful improvements of 4.9\%. Our analysis reveals that low probability tokens constitute the critical bottleneck in reasoning ability transfer. However, cross-model experiments demonstrate that RSD traces are model-specific rather than universally applicable, indicating that distributional alignment must be tailored for each student architecture's unique internal representation.[76] FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding
Haorui Chen,Chengze Li,Jia Li
Main category: cs.CL
TL;DR: 本文提出了FeatBench,一个专注于“vibe coding”范式中功能实现的新型代码生成评估基准,通过纯自然语言提示、严格的数据收集流程和全面的测试用例,揭示了当前大模型在特征实现上的重大挑战。
Details
Motivation: 现有代码生成基准未能有效评估LLM在高阶自然语言交互(即“vibe coding”)下的功能实现能力,且多依赖代码级提示或局限于问题修复,缺乏对真实场景中功能扩展的评估。 Method: 设计FeatBench基准,包含纯自然语言任务描述、多层级过滤与自动化演进的数据收集流程、F2P与P2P测试机制,并涵盖多领域真实项目仓库。 Result: 在两个先进代理框架和四个主流大模型上的实验显示,最高成功率仅为29.94%,暴露出当前系统在功能实现上的严重不足,并观察到‘激进实现’策略带来的设计优势与失败风险并存的现象。 Conclusion: FeatBench有效填补了vibe coding场景下功能实现评估的空白,揭示了当前代码生成系统的关键局限,为未来研究提供了可扩展的基准和公开资源。 Abstract: The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as "vibe coding," where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent's vibe coding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: 1. Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. 2. A Rigorous & Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. 3. Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. 4. Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for "aggressive implementation," a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research.[77] FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction
Yuan Ge,Saihan Chen,Jingqi Xiao,Xiaoqian Liu,Tong Xiao,Yan Xiang,Zhengtao Yu,Jingbo Zhu
Main category: cs.CL
TL;DR: FLEXI是首个针对全双工语音LLM在紧急场景中中断能力的基准,评估了实时对话的延迟、质量和交互效果,揭示了开源与商业模型间的差距,并提出下一代全双工交互的潜在方向。
Details
Motivation: 现有全双工语音大模型缺乏对紧急情况下模型中断能力的系统性评估,难以实现类人自然交互,因此需要一个包含真实人类互动场景的基准来衡量其性能。 Method: 提出FLEXI基准,包含六个多样化的人类-LLM语音交互场景,显式引入紧急情况下的模型中断机制,从延迟、生成质量与对话有效性三方面进行系统评估,并分析开源与商业模型的表现差异。 Result: FLEXI揭示了当前模型在紧急感知、话轮终止和交互延迟方面存在显著缺陷,尤其是开源模型明显落后于商业模型,且整体在实时中断响应上仍有较大改进空间。 Conclusion: FLEXI为全双工语音LLM提供了首个面向紧急中断的评估框架,推动该领域标准化;研究建议采用下一词元对预测方法,以实现更流畅、类人的实时对话交互。 Abstract: Full-Duplex Speech-to-Speech Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling real-time spoken dialogue systems. However, benchmarking and modeling these models remains a fundamental challenge. We introduce FLEXI, the first benchmark for full-duplex LLM-human spoken interaction that explicitly incorporates model interruption in emergency scenarios. FLEXI systematically evaluates the latency, quality, and conversational effectiveness of real-time dialogue through six diverse human-LLM interaction scenarios, revealing significant gaps between open source and commercial models in emergency awareness, turn terminating, and interaction latency. Finally, we suggest that next token-pair prediction offers a promising path toward achieving truly seamless and human-like full-duplex interaction.[78] Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance
Wenbin Hu,Huihao Jing,Haochen Shi,Haoran Li,Yangqiu Song
Main category: cs.CL
TL;DR: 本文提出从法律合规角度解决大语言模型(LLM)的安全问题,引入“安全合规”概念,并基于欧盟AI法案和GDPR构建了新的安全合规基准。通过群体策略优化(GRPO)训练出能够有效对齐法律标准的合规推理器(Compliance Reasoner),在新基准上显著提升了合规性能。
Details
Motivation: 现有LLM安全方法缺乏系统性和严谨性,依赖于非正式的分类体系,难以应对现代LLM复杂多样的风险。因此,需要一种基于权威、可衡量的框架来提升LLM的安全保障。 Method: 以欧盟AI法案和GDPR等法律框架作为安全标准,构建包含真实场景和法律条文的安全合规基准;采用群体策略优化(GRPO)对Qwen3-8B进行对齐训练,构建名为Compliance Reasoner的安全推理模型。 Result: Compliance Reasoner在新构建的安全合规基准上表现优异,相较于基线模型,在EU AI Act相关任务上平均提升10.45%,在GDPR相关任务上平均提升11.85%。 Conclusion: 将法律合规引入LLM安全是一种有效且系统的途径,所提出的Compliance Reasoner能够显著提升模型对法律标准的遵循能力,为构建更安全、合法的LLM系统提供了可行方案。 Abstract: The proliferation of Large Language Models (LLMs) has demonstrated remarkable capabilities, elevating the critical importance of LLM safety. However, existing safety methods rely on ad-hoc taxonomy and lack a rigorous, systematic protection, failing to ensure safety for the nuanced and complex behaviors of modern LLM systems. To address this problem, we solve LLM safety from legal compliance perspectives, named safety compliance. In this work, we posit relevant established legal frameworks as safety standards for defining and measuring safety compliance, including the EU AI Act and GDPR, which serve as core legal frameworks for AI safety and data security in Europe. To bridge the gap between LLM safety and legal compliance, we first develop a new benchmark for safety compliance by generating realistic LLM safety scenarios seeded with legal statutes. Subsequently, we align Qwen3-8B using Group Policy Optimization (GRPO) to construct a safety reasoner, Compliance Reasoner, which effectively aligns LLMs with legal standards to mitigate safety risks. Our comprehensive experiments demonstrate that the Compliance Reasoner achieves superior performance on the new benchmark, with average improvements of +10.45% for the EU AI Act and +11.85% for GDPR.[79] Beyond Textual Context: Structural Graph Encoding with Adaptive Space Alignment to alleviate the hallucination of LLMs
Yifang Zhang,Pengfei Duan,Yiwen Yang,Shengwu Xiong
Main category: cs.CL
TL;DR: 提出SSKG-LLM模型,通过融合知识图谱的结构与语义信息来增强大语言模型的事实推理能力。
Details
Motivation: 现有方法将知识图谱视为纯文本,忽略了其结构信息,且知识图谱嵌入与大语言模型之间存在表示空间差异,导致知识融合不充分。 Method: 设计了包含知识图谱检索(KGR)、知识图谱编码(KGE)和知识图谱适配(KGA)三个模块的SSKG-LLM架构,以保留语义并利用结构信息,并实现嵌入空间的对齐。 Result: 实验表明,引入知识图谱的结构信息能有效提升大语言模型在事实推理任务中的表现。 Conclusion: SSKG-LLM通过有效整合知识图谱的结构与语义信息,显著增强了大语言模型处理幻觉问题和事实推理的能力。 Abstract: Currently, the main approach for Large Language Models (LLMs) to tackle the hallucination issue is incorporating Knowledge Graphs(KGs).However, LLMs typically treat KGs as plain text, extracting only semantic information and limiting their use of the crucial structural aspects of KGs. Another challenge is the gap between the embedding spaces of KGs encoders and LLMs text embeddings, which hinders the effective integration of structured knowledge. To overcome these obstacles, we put forward the SSKG-LLM, an innovative model architecture that is designed to efficiently integrate both the Structural and Semantic information of KGs into the reasoning processes of LLMs. SSKG-LLM incorporates the Knowledge Graph Retrieval (KGR) module and the Knowledge Graph Encoding (KGE) module to preserve semantics while utilizing structure. Then, the Knowledge Graph Adaptation (KGA) module is incorporated to enable LLMs to understand KGs embeddings. We conduct extensive experiments and provide a detailed analysis to explore how incorporating the structural information of KGs can enhance the factual reasoning abilities of LLMs. Our code are available at https://github.com/yfangZhang/SSKG-LLM.[80] Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?
Yifan Wang,Mayank Jobanputra,Ji-Ung Lee,Soyoung Oh,Isabel Valera,Vera Demberg
Main category: cs.CL
TL;DR: 本研究首次系统探讨了可解释性与公平性在仇恨言论检测中的关系,发现基于输入的解释能有效检测偏见并辅助训练中减小偏见,但在模型选择时不可靠。
Details
Motivation: 自然语言处理模型常从训练数据中复制或放大社会偏见,且其黑箱特性使得用户难以识别偏见预测,开发者也难以有效缓解,因此需要研究可解释性是否有助于提升公平性。 Method: 对编码器-解码器和仅编码器模型进行大规模定量分析,考察可解释性在识别偏见预测、选择公平模型和训练中缓解偏见三个方面的表现。 Result: 基于输入的解释能有效识别偏见预测,并可在训练中作为监督信号减少偏见,但在多个候选模型中选择更公平的模型时不可靠。 Conclusion: 可解释性工具在检测和缓解偏见方面有潜力,但不能完全依赖其进行模型公平性选择,需结合其他评估手段。 Abstract: Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.[81] Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs
Felix Vossel,Till Mossakowski,Björn Gehrke
Main category: cs.CL
TL;DR: 本文系统评估了微调后的大型语言模型(LLM)在将自然语言自动翻译为一阶逻辑(FOL)任务中的表现,发现Flan-T5-XXL在使用谓词列表的情况下准确率达到70%,优于GPT-4o和DeepSeek-R1等模型,并揭示谓词提取是主要瓶颈。
Details
Motivation: 将自然语言自动转换为一阶逻辑(FOL)对于知识表示和形式化方法至关重要,但现有方法仍面临挑战,因此需要系统评估当前大模型在此任务上的性能。 Method: 通过在MALLS和Willow数据集上微调不同架构的模型(如编码器-解码器与仅解码器),比较其性能;采用词汇扩展、谓词条件控制和多语言训练等策略,并引入精确匹配、逻辑等价和谓词对齐等评估指标。 Result: Flan-T5-XXL在提供谓词列表时达到70%的准确率,超过GPT-4o、DeepSeek-R1-0528(具备思维链推理能力)以及ccg2lambda等符号系统;T5类模型优于更大的仅解码器模型;模型能在未专门训练的情况下泛化到新数据集(如FOLIO);谓词可用性使性能提升15-20%。 Conclusion: 编码器-解码器结构的微调模型在FOL翻译任务中表现最佳,结构化逻辑转换具有鲁棒性,而谓词提取仍是关键瓶颈,未来应重点关注该环节的改进。 Abstract: Automating the translation of natural language to first-order logic (FOL) is crucial for knowledge representation and formal methods, yet remains challenging. We present a systematic evaluation of fine-tuned LLMs for this task, comparing architectures (encoder-decoder vs. decoder-only) and training strategies. Using the MALLS and Willow datasets, we explore techniques like vocabulary extension, predicate conditioning, and multilingual training, introducing metrics for exact match, logical equivalence, and predicate alignment. Our fine-tuned Flan-T5-XXL achieves 70% accuracy with predicate lists, outperforming GPT-4o and even the DeepSeek-R1-0528 model with CoT reasoning ability as well as symbolic systems like ccg2lambda. Key findings show: (1) predicate availability boosts performance by 15-20%, (2) T5 models surpass larger decoder-only LLMs, and (3) models generalize to unseen logical arguments (FOLIO dataset) without specific training. While structural logic translation proves robust, predicate extraction emerges as the main bottleneck.[82] Transformers Can Learn Connectivity in Some Graphs but Not Others
Amit Roy,Abulhair Saparov
Main category: cs.CL
TL;DR: 本研究探讨了Transformer模型从训练数据中学习推断有向图传递关系(连通性)的能力,发现模型在低维网格状图上表现良好,且模型规模越大泛化能力越强,但在高维或包含大量不连通组件的非网格图上表现较差。
Details
Motivation: 探究Transformer模型是否能从训练数据中学习传递关系推理能力,并分析模型规模和图结构对其性能的影响。 Method: 生成不同结构的有向图,训练不同规模的Transformer模型,并评估其在不同图规模下的连通性推断能力。 Result: Transformer能在低维网格图上有效学习连通性,维度越高越难学习;模型规模提升有助于更好泛化;但在非网格且含多个不连通组件的图上表现不佳。 Conclusion: Transformer具备在特定结构(如低维网格)图上学习传递关系的能力,模型规模和图的结构复杂度是影响其推理能力的关键因素。 Abstract: Reasoning capability is essential to ensure the factual correctness of the responses of transformer-based Large Language Models (LLMs), and robust reasoning about transitive relations is instrumental in many settings, such as causal inference. Hence, it is essential to investigate the capability of transformers in the task of inferring transitive relations (e.g., knowing A causes B and B causes C, then A causes C). The task of inferring transitive relations is equivalent to the task of connectivity in directed graphs (e.g., knowing there is a path from A to B, and there is a path from B to C, then there is a path from A to C). Past research focused on whether transformers can learn to infer transitivity from in-context examples provided in the input prompt. However, transformers' capability to infer transitive relations from training examples and how scaling affects the ability is unexplored. In this study, we seek to answer this question by generating directed graphs to train transformer models of varying sizes and evaluate their ability to infer transitive relations for various graph sizes. Our findings suggest that transformers are capable of learning connectivity on "grid-like'' directed graphs where each node can be embedded in a low-dimensional subspace, and connectivity is easily inferable from the embeddings of the nodes. We find that the dimensionality of the underlying grid graph is a strong predictor of transformers' ability to learn the connectivity task, where higher-dimensional grid graphs pose a greater challenge than low-dimensional grid graphs. In addition, we observe that increasing the model scale leads to increasingly better generalization to infer connectivity over grid graphs. However, if the graph is not a grid graph and contains many disconnected components, transformers struggle to learn the connectivity task, especially when the number of components is large.[83] The InviTE Corpus: Annotating Invectives in Tudor English Texts for Computational Modeling
Sophie Spliethoff,Sanne Hoeken,Silke Schwandt,Sina Zarrieß,Özge Alaçam
Main category: cs.CL
TL;DR: 本文介绍了一种将自然语言处理技术应用于历史研究的方法,特别是在都铎王朝时期英格兰宗教论战语言的研究中。作者提出了一个从原始数据到专家标注的迭代流程,并发布了包含近2000个早期现代英语句子的InviTE语料库。通过比较微调的BERT模型和零样本提示的大语言模型,发现基于历史数据预训练并微调的模型在识别攻击性语言方面表现更优。
Details
Motivation: 为了支持对16世纪英格兰宗教论战语言的历史研究,需要系统化地识别和分析攻击性语言,而现有NLP工具在处理历史文本方面存在局限。 Method: 提出了一套涵盖数据预处理、筛选和迭代人工标注的工作流程,构建了InviTE语料库;采用微调的BERT模型与零样本提示的指令调优大语言模型进行对比实验,评估其在识别早期现代英语中宗教攻击性语言的能力。 Result: 成功构建了包含近2000条标注句子的InviTE语料库;实验表明,基于历史文本预训练并进一步微调的BERT模型在检测攻击性语言任务上显著优于零样本大语言模型。 Conclusion: 针对历史文本中的特定语言现象(如宗教攻击语),结合领域适配的预训练与监督微调的NLP模型效果最佳,验证了构建专业标注语料库与定制化模型在数字人文研究中的价值。 Abstract: In this paper, we aim at the application of Natural Language Processing (NLP) techniques to historical research endeavors, particularly addressing the study of religious invectives in the context of the Protestant Reformation in Tudor England. We outline a workflow spanning from raw data, through pre-processing and data selection, to an iterative annotation process. As a result, we introduce the InviTE corpus -- a corpus of almost 2000 Early Modern English (EModE) sentences, which are enriched with expert annotations regarding invective language throughout 16th-century England. Subsequently, we assess and compare the performance of fine-tuned BERT-based models and zero-shot prompted instruction-tuned large language models (LLMs), which highlights the superiority of models pre-trained on historical data and fine-tuned to invective detection.[84] Conversational Implicatures: Modelling Relevance Theory Probabilistically
Christoph Unger,Hendrik Buschmeier
Main category: cs.CL
TL;DR: 本文探讨了如何将贝叶斯方法应用于关联理论语用学,特别是通过研究会话含义来传达隐含意义。
Details
Motivation: 随着贝叶斯概率理论在认知科学中的应用以及概率计算工具的发展,语用学和语义学领域出现了‘概率转向’,促使研究者探索贝叶斯方法在关联理论中的适用性。 Method: 采用理性言语行为理论框架,结合贝叶斯模型,分析会话含义等典型语用现象。 Result: 展示了贝叶斯方法在建模关联理论语用现象方面的潜力,特别是在处理隐含意义的交流方面。 Conclusion: 贝叶斯方法为关联理论语用学提供了一个有前景的建模工具,有助于深化对隐含意义生成与理解机制的理解。 Abstract: Recent advances in Bayesian probability theory and its application to cognitive science in combination with the development of a new generation of computational tools and methods for probabilistic computation have led to a 'probabilistic turn' in pragmatics and semantics. In particular, the framework of Rational Speech Act theory has been developed to model broadly Gricean accounts of pragmatic phenomena in Bayesian terms, starting with fairly simple reference games and covering ever more complex communicative exchanges such as verbal syllogistic reasoning. This paper explores in which way a similar Bayesian approach might be applied to relevance-theoretic pragmatics (Sperber & Wilson, 1995) by study a paradigmatic pragmatic phenomenon: the communication of implicit meaning by ways of (conversational) implicatures.[85] CHRONOBERG: Capturing Language Evolution and Temporal Awareness in Foundation Models
Niharika Hegde,Subarnaduti Paul,Lars Joel-Frey,Manuel Brack,Kristian Kersting,Martin Mundt,Patrick Schramowski
Main category: cs.CL
TL;DR: 本文介绍了CHRONOBERG,一个涵盖250年英语书籍文本的时间结构化语料库,旨在支持语言模型对历时语言变化的分析与训练。该语料库来自Project Gutenberg,并添加了多种时间标注,可用于量化词汇语义随时间的变化,并构建历史校准的情感词典。研究表明,现有语言模型在捕捉语义的历时演变方面存在困难,凸显了开发时间感知训练和评估方法的必要性。
Details
Motivation: 现有语料库缺乏长期时间结构,限制了大语言模型对语言语义和规范演变的上下文理解能力,难以捕捉历时性语言变化。因此需要一个具有丰富时间标注的长期语料库来支持相关研究。 Method: 基于Project Gutenberg构建名为CHRONOBERG的英文学术语料库,覆盖250年历史文本;对文本进行时间标注,并利用书籍文本的编辑特性开展时间敏感的Valence-Arousal-Dominance(VAD)情感分析,构建历史校准的情感词典,用于分析语义变迁。 Result: 成功构建了CHRONOBERG语料库并生成了时间标记的情感词典;实验证明当前LLM在基于该语料库进行顺序训练时难以有效编码语义的历时变化,揭示了其在跨时代语义理解上的局限性。 Conclusion: CHRONOBERG为研究语言历时变化和时间泛化提供了一个可扩展的资源;研究强调需开发具备时间感知能力的训练和评估框架,以提升语言模型在不同历史语境下的语义理解和偏见识别能力。 Abstract: Large language models (LLMs) excel at operating at scale by leveraging social media and various data crawled from the web. Whereas existing corpora are diverse, their frequent lack of long-term temporal structure may however limit an LLM's ability to contextualize semantic and normative evolution of language and to capture diachronic variation. To support analysis and training for the latter, we introduce CHRONOBERG, a temporally structured corpus of English book texts spanning 250 years, curated from Project Gutenberg and enriched with a variety of temporal annotations. First, the edited nature of books enables us to quantify lexical semantic change through time-sensitive Valence-Arousal-Dominance (VAD) analysis and to construct historically calibrated affective lexicons to support temporally grounded interpretation. With the lexicons at hand, we demonstrate a need for modern LLM-based tools to better situate their detection of discriminatory language and contextualization of sentiment across various time-periods. In fact, we show how language models trained sequentially on CHRONOBERG struggle to encode diachronic shifts in meaning, emphasizing the need for temporally aware training and evaluation pipelines, and positioning CHRONOBERG as a scalable resource for the study of linguistic change and temporal generalization. Disclaimer: This paper includes language and display of samples that could be offensive to readers. Open Access: Chronoberg is available publicly on HuggingFace at ( https://huggingface.co/datasets/spaul25/Chronoberg). Code is available at (https://github.com/paulsubarna/Chronoberg).[86] Exploratory Semantic Reliability Analysis of Wind Turbine Maintenance Logs using Large Language Models
Max Malyi,Jonathan Shek,Andre Biscaya
Main category: cs.CL
TL;DR: 本文提出了一种利用大语言模型(LLMs)对风力涡轮机维护日志进行深度语义分析的探索性框架,超越传统的文本分类,实现故障模式识别、因果链推断、站点比较分析和数据质量审计四项任务,展示了LLMs作为“可靠性副驾驶”生成专家级可操作假设的能力。
Details
Motivation: 风力涡轮机维护日志中的非结构化文本蕴含丰富的运行智能,但传统可靠性分析难以获取。现有机器学习方法多局限于文本分类,缺乏对复杂推理任务的挖掘。本文旨在填补利用大语言模型进行深层次语义理解和推理的空白。 Method: 提出一个基于大语言模型的探索性分析框架,应用于大型工业数据集,执行四种分析工作流:故障模式识别、因果链推断、站点间比较分析和数据质量审计,以实现从文本中提取深层信息的目标。 Result: 实验结果表明,大语言模型能够有效执行复杂的语义推理任务,不仅完成分类,还能综合信息生成类似专家的可操作假设,显著提升从非结构化文本中提取运营洞察的能力。 Conclusion: 本研究提供了一种新颖且可复现的方法论,将大语言模型用作推理工具,为风能行业从非结构化维护日志中挖掘以往难以发现的洞察提供了新路径,增强了运维智能水平。 Abstract: A wealth of operational intelligence is locked within the unstructured free-text of wind turbine maintenance logs, a resource largely inaccessible to traditional quantitative reliability analysis. While machine learning has been applied to this data, existing approaches typically stop at classification, categorising text into predefined labels. This paper addresses the gap in leveraging modern large language models (LLMs) for more complex reasoning tasks. We introduce an exploratory framework that uses LLMs to move beyond classification and perform deep semantic analysis. We apply this framework to a large industrial dataset to execute four analytical workflows: failure mode identification, causal chain inference, comparative site analysis, and data quality auditing. The results demonstrate that LLMs can function as powerful "reliability co-pilots," moving beyond labelling to synthesise textual information and generate actionable, expert-level hypotheses. This work contributes a novel and reproducible methodology for using LLMs as a reasoning tool, offering a new pathway to enhance operational intelligence in the wind energy sector by unlocking insights previously obscured in unstructured data.[87] What Is The Political Content in LLMs' Pre- and Post-Training Data?
Tanise Ceron,Dmitry Nikolaev,Dominik Stammbach,Debora Nozza
Main category: cs.CL
TL;DR: 本文分析了OLMO2模型的训练数据,发现左倾内容在预训练语料中占主导地位,并与模型在政策议题上的政治偏见密切相关,强调需在数据筛选中纳入政治内容分析以提高透明度。
Details
Motivation: 当前大语言模型(LLM)存在政治偏见,但其成因尚不明确,尤其是训练数据中的政治内容未被充分研究。因此,有必要对开源模型的训练数据进行系统性政治倾向分析。 Method: 作者对OLMO2模型的预训练和后训练语料库进行了大规模随机抽样,使用自动化方法标注文档的政治取向,并分析其来源领域和内容特征,进而评估训练数据中的政治立场与模型在具体政策问题上立场之间的相关性。 Result: 研究发现:左倾文档在各数据集中占主导;预训练语料比后训练数据包含更多政治相关内容;左右倾文档通过不同的价值观和合法性来源来表述相似主题;训练数据中的主流立场与模型输出的政治偏向高度相关。 Conclusion: 训练数据的政治构成显著影响模型的输出偏见,未来应将政治内容分析纳入数据整理流程,并加强对过滤策略的详细记录以提升透明度。 Abstract: Large language models (LLMs) are known to generate politically biased text, yet how such biases arise remains unclear. A crucial step toward answering this question is the analysis of training data, whose political content remains largely underexplored in current LLM research. To address this gap, we present in this paper an analysis of the pre- and post-training corpora of OLMO2, the largest fully open-source model released together with its complete dataset. From these corpora, we draw large random samples, automatically annotate documents for political orientation, and analyze their source domains and content. We then assess how political content in the training data correlates with models' stance on specific policy issues. Our analysis shows that left-leaning documents predominate across datasets, with pre-training corpora containing significantly more politically engaged content than post-training data. We also find that left- and right-leaning documents frame similar topics through distinct values and sources of legitimacy. Finally, the predominant stance in the training data strongly correlates with models' political biases when evaluated on policy issues. These findings underscore the need to integrate political content analysis into future data curation pipelines as well as in-depth documentation of filtering strategies for transparency.[88] Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding
Ziheng Chi,Yifan Hou,Chenxi Pang,Shaobo Cui,Mubashara Akhtar,Mrinmaya Sachan
Main category: cs.CL
TL;DR: 本文提出了Chimera测试套件,用于评估视觉语言模型(VLMs)在图表理解中的真实能力,揭示了当前模型在处理图表时主要依赖记忆和语言捷径而非真正理解。
Details
Motivation: 现有VLM在图表相关基准上表现良好,但可能存在依赖知识、推理或模态捷径的问题,难以判断其是否真正理解图表内容。 Method: 构建包含7500个高质量维基百科图表的Chimera数据集,每个图表以语义三元组标注,并设计多层次问题评估实体识别、关系理解、知识对接和视觉推理四方面能力;评估15种开源VLM对三种捷径的依赖:视觉记忆、知识回忆和Clever-Hans捷径。 Result: 发现当前VLM的表面良好表现主要源于捷径行为,尤其是Clever-Hans捷径影响显著,知识回忆次之,视觉记忆影响较小。 Conclusion: 当前VLM在图表理解方面存在严重局限性,需建立更严格的评估机制以衡量对复杂视觉输入的真实理解能力。 Abstract: Diagrams convey symbolic information in a visual format rather than a linear stream of words, making them especially challenging for AI models to process. While recent evaluations suggest that vision-language models (VLMs) perform well on diagram-related benchmarks, their reliance on knowledge, reasoning, or modality shortcuts raises concerns about whether they genuinely understand and reason over diagrams. To address this gap, we introduce Chimera, a comprehensive test suite comprising 7,500 high-quality diagrams sourced from Wikipedia; each diagram is annotated with its symbolic content represented by semantic triples along with multi-level questions designed to assess four fundamental aspects of diagram comprehension: entity recognition, relation understanding, knowledge grounding, and visual reasoning. We use Chimera to measure the presence of three types of shortcuts in visual question answering: (1) the visual-memorization shortcut, where VLMs rely on memorized visual patterns; (2) the knowledge-recall shortcut, where models leverage memorized factual knowledge instead of interpreting the diagram; and (3) the Clever-Hans shortcut, where models exploit superficial language patterns or priors without true comprehension. We evaluate 15 open-source VLMs from 7 model families on Chimera and find that their seemingly strong performance largely stems from shortcut behaviors: visual-memorization shortcuts have slight impact, knowledge-recall shortcuts play a moderate role, and Clever-Hans shortcuts contribute significantly. These findings expose critical limitations in current VLMs and underscore the need for more robust evaluation protocols that benchmark genuine comprehension of complex visual inputs (e.g., diagrams) rather than question-answering shortcuts.[89] Detecting (Un)answerability in Large Language Models with Linear Directions
Maor Juliet Lavi,Tova Milo,Mor Geva
Main category: cs.CL
TL;DR: 提出一种基于激活空间方向的方法来检测大语言模型中不可回答问题,通过在推理过程中应用激活加法并测量其对模型放弃行为的影响,有效识别不可回答问题,并在多个基准上表现优于现有方法。
Details
Motivation: 大语言模型在缺乏必要信息时仍会自信地生成答案,导致出现幻觉性回答,因此需要研究如何检测问题的可回答性。 Method: 通过在推理过程中施加激活加法,寻找能够捕捉不可回答性的激活空间方向,并利用该方向对隐藏激活进行投影以获得不可回答性评分。 Result: 该方法在两个开源大模型和四个抽取式问答基准上表现出色,不仅有效检测不可回答问题,且跨数据集泛化能力更强,同时适用于科学共识缺失和主观性等导致的不可回答性,因果干预实验表明该方向能有效控制模型的放弃行为。 Conclusion: 所提出的激活空间方向方法是一种简单而有效的方式,可用于检测大语言模型中的不可回答问题,并改善其回答可靠性。 Abstract: Large language models (LLMs) often respond confidently to questions even when they lack the necessary information, leading to hallucinated answers. In this work, we study the problem of (un)answerability detection, focusing on extractive question answering (QA) where the model should determine if a passage contains sufficient information to answer a given question. We propose a simple approach for identifying a direction in the model's activation space that captures unanswerability and uses it for classification. This direction is selected by applying activation additions during inference and measuring their impact on the model's abstention behavior. We show that projecting hidden activations onto this direction yields a reliable score for (un)answerability classification. Experiments on two open-weight LLMs and four extractive QA benchmarks show that our method effectively detects unanswerable questions and generalizes better across datasets than existing prompt-based and classifier-based approaches. Moreover, the obtained directions extend beyond extractive QA to unanswerability that stems from factors, such as lack of scientific consensus and subjectivity. Last, causal interventions show that adding or ablating the directions effectively controls the abstention behavior of the model.[90] Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning
Antreas Ioannou,Andreas Shiamishis,Nora Hollenstein,Nezihe Merve Gürel
Main category: cs.CL
TL;DR: 本文评估了LLaMA和Gemini在多语言法律任务中的表现与对抗鲁棒性,发现法律任务对大语言模型仍具挑战,准确率普遍低于50%,且模型对提示敏感、易受对抗干扰,性能与语言和英语的句法相似性相关,Gemini整体优于LLaMA约24个百分点。
Details
Motivation: 随着大语言模型在法律领域的广泛应用,亟需系统评估其在多语言、跨司法管辖区及对抗环境下的表现,以确保其在高风险法律应用中的可靠性。 Method: 采用LLM-as-a-Judge方法,在多语言法律与非法律基准(如LEXam、XNLI)上评估LLaMA和Gemini的表现,并通过字符级和词级扰动测试其对抗鲁棒性,同时开发了一个开源、模块化的评估管道支持多样化任务评测。 Result: 实验显示,模型在法律任务上的准确率常低于50%,远低于其在通用任务(>70%)的表现;英语结果更稳定但未必更优;模型对提示敏感且存在跨语言的对抗脆弱性;语言与英语的句法相似性与其性能呈正相关;Gemini在各项任务中平均领先LLaMA约24个百分点。 Conclusion: 尽管大语言模型持续进步,但在多语言法律场景中仍面临准确性低、鲁棒性差等挑战,当前模型尚难可靠地用于高风险法律应用,需进一步改进与评估。 Abstract: In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta's LLaMA, OpenAI's ChatGPT, Google's Gemini, DeepSeek, and other emerging models are increasingly integrated into legal workflows, their performance in multilingual, jurisdictionally diverse, and adversarial contexts remains insufficiently explored. This work evaluates LLaMA and Gemini on multilingual legal and non-legal benchmarks, and assesses their adversarial robustness in legal tasks through character and word-level perturbations. We use an LLM-as-a-Judge approach for human-aligned evaluation. We moreover present an open-source, modular evaluation pipeline designed to support multilingual, task-diverse benchmarking of any combination of LLMs and datasets, with a particular focus on legal tasks, including classification, summarization, open questions, and general reasoning. Our findings confirm that legal tasks pose significant challenges for LLMs with accuracies often below 50% on legal reasoning benchmarks such as LEXam, compared to over 70% on general-purpose tasks like XNLI. In addition, while English generally yields more stable results, it does not always lead to higher accuracy. Prompt sensitivity and adversarial vulnerability is also shown to persist across languages. Finally, a correlation is found between the performance of a language and its syntactic similarity to English. We also observe that LLaMA is weaker than Gemini, with the latter showing an average advantage of about 24 percentage points across the same task. Despite improvements in newer LLMs, challenges remain in deploying them reliably for critical, multilingual legal applications.[91] NeLLCom-Lex: A Neural-agent Framework to Study the Interplay between Lexical Systems and Language Use
Yuqing Zhang,Ecesu Ürker,Tessa Verhoef,Gemma Boleda,Arianna Bisazza
Main category: cs.CL
TL;DR: 本文提出了NeLLCom-Lex,一个基于神经代理的框架,用于模拟词汇语义变化,通过在真实语言系统中建立代理并系统操纵其交流需求,研究语义变化的机制。
Details
Motivation: 现有的词汇语义变化研究方法(如语料库分析和实验范式)难以揭示因果机制或应用于长时间跨度的语义演变过程,因此需要一种可控制且可扩展的模拟方法。 Method: 构建NeLLCom-Lex神经代理框架,在英语等真实词汇系统中进行 grounding,并通过颜色命名任务操纵代理的交流需求,使用监督学习和强化学习管道模拟词汇系统的演化。 Result: 神经代理能够显著复现人类在颜色命名中的行为模式,展现出人类般的命名行为和词汇发展,并能根据交流需求调整其词汇使用。 Conclusion: NeLLCom-Lex框架能有效模拟词汇语义变化,支持其进一步用于探究语义变化的潜在机制。 Abstract: Lexical semantic change has primarily been investigated with observational and experimental methods; however, observational methods (corpus analysis, distributional semantic modeling) cannot get at causal mechanisms, and experimental paradigms with humans are hard to apply to semantic change due to the extended diachronic processes involved. This work introduces NeLLCom-Lex, a neural-agent framework designed to simulate semantic change by first grounding agents in a real lexical system (e.g. English) and then systematically manipulating their communicative needs. Using a well-established color naming task, we simulate the evolution of a lexical system within a single generation, and study which factors lead agents to: (i) develop human-like naming behavior and lexicons, and (ii) change their behavior and lexicons according to their communicative needs. Our experiments with different supervised and reinforcement learning pipelines show that neural agents trained to 'speak' an existing language can reproduce human-like patterns in color naming to a remarkable extent, supporting the further use of NeLLCom-Lex to elucidate the mechanisms of semantic change.[92] Exploring Solution Divergence and Its Effect on Large Language Model Problem Solving
Hang Li,Kaiqi Yang,Yucheng Chu,Hui Liu,Jiliang Tang
Main category: cs.CL
TL;DR: 本文提出了一种新的度量方法——解的发散性(solution divergence),用于评估大语言模型在问题求解中的表现,并证明其在监督微调和强化学习中均能有效提升模型性能。
Details
Motivation: 现有方法主要依赖标注数据或任务反馈来提升大语言模型的问题解决能力,本文探索一种新视角:同一问题下模型生成解的多样性是否与性能相关。 Method: 通过分析不同模型在单一问题上生成解的发散程度,验证其与问题解决能力的关系,并将解的发散性作为指标应用于监督微调和强化学习策略中。 Result: 在三个代表性问题领域中实验表明,利用解的发散性可一致地提高任务成功率。 Conclusion: 解的发散性是一种简单但有效的工具,可用于改进大语言模型的训练与评估。 Abstract: Large language models (LLMs) have been widely used for problem-solving tasks. Most recent work improves their performance through supervised fine-tuning (SFT) with labeled data or reinforcement learning (RL) from task feedback. In this paper, we study a new perspective: the divergence in solutions generated by LLMs for a single problem. We show that higher solution divergence is positively related to better problem-solving abilities across various models. Based on this finding, we propose solution divergence as a novel metric that can support both SFT and RL strategies. We test this idea on three representative problem domains and find that using solution divergence consistently improves success rates. These results suggest that solution divergence is a simple but effective tool for advancing LLM training and evaluation.[93] JGU Mainz's Submission to the WMT25 Shared Task on LLMs with Limited Resources for Slavic Languages: MT and QA
Hossain Shaikh Saadi,Minh Duc Bui,Mario Sanz-Guerrero,Katharina von der Wense
Main category: cs.CL
TL;DR: 本文提出了一种在资源有限情况下针对乌克兰语、上索布语和下索布语的机器翻译与问答任务的联合微调方法,使用Qwen2.5-3B-Instruct模型并结合参数高效微调、数据增强与集成方法,显著优于基线。
Details
Motivation: 在低资源斯拉夫语言(如乌克兰语、上/下索布语)上同时处理机器翻译和问答任务,面临数据稀缺和计算资源受限的挑战,需提升模型在多任务下的性能。 Method: 采用Qwen2.5-3B-Instruct模型进行联合微调,使用参数高效微调技术;整合额外的翻译和多项选择问答数据;对乌克兰语问答引入检索增强生成;对上/下索布语问答应用模型集成。 Result: 实验表明,所提出的方法在机器翻译和问答两个任务上均优于基线系统。 Conclusion: 联合微调结合参数高效方法、数据扩充与集成策略,能有效提升低资源斯拉夫语言在多任务场景下的表现。 Abstract: This paper presents the JGU Mainz submission to the WMT25 Shared Task on LLMs with Limited Resources for Slavic Languages: Machine Translation and Question Answering, focusing on Ukrainian, Upper Sorbian, and Lower Sorbian. For each language, we jointly fine-tune a Qwen2.5-3B-Instruct model for both tasks with parameter-efficient finetuning. Our pipeline integrates additional translation and multiple-choice question answering (QA) data. For Ukrainian QA, we further use retrieval-augmented generation. We also apply ensembling for QA in Upper and Lower Sorbian. Experiments show that our models outperform the baseline on both tasks.[94] Representing LLMs in Prompt Semantic Task Space
Idan Kashani,Avi Mendelson,Yaniv Nemcovsky
Main category: cs.CL
TL;DR: 提出一种无需训练、高效且可解释的LLM表示方法,将其作为提示语义任务空间中的线性算子,通过几何属性的闭式计算实现良好的可扩展性和实时适应性。
Details
Motivation: 在日益增长的预训练大模型库中,如何快速准确地识别特定任务下性能最佳的模型是一个挑战;现有方法存在可扩展性差、需昂贵重训练和表示不可解释等问题。 Method: 将大语言模型表示为提示语义任务空间中的线性算子,利用闭式计算提取其几何属性,无需训练即可生成模型表示。 Result: 在模型选择和成功预测任务上达到竞争性或最先进的结果,尤其在未见样本场景下表现突出,具备实时适应动态模型库的能力。 Conclusion: 该方法高效、可扩展、无需训练且具有高度可解释性,为大模型的选择与分析提供了新思路。 Abstract: Large language models (LLMs) achieve impressive results over various tasks, and ever-expanding public repositories contain an abundance of pre-trained models. Therefore, identifying the best-performing LLM for a given task is a significant challenge. Previous works have suggested learning LLM representations to address this. However, these approaches present limited scalability and require costly retraining to encompass additional models and datasets. Moreover, the produced representation utilizes distinct spaces that cannot be easily interpreted. This work presents an efficient, training-free approach to representing LLMs as linear operators within the prompts' semantic task space, thus providing a highly interpretable representation of the models' application. Our method utilizes closed-form computation of geometrical properties and ensures exceptional scalability and real-time adaptability to dynamically expanding repositories. We demonstrate our approach on success prediction and model selection tasks, achieving competitive or state-of-the-art results with notable performance in out-of-sample scenarios.[95] We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong
Gautam Siddharth Kashyap,Mark Dras,Usman Naseem
Main category: cs.CL
TL;DR: 提出自适应多分支引导(AMBS)框架,用于统一且高效地实现大语言模型在有用性、无害性和诚实性(HHH)上的多目标对齐。
Details
Motivation: 现有方法在优化单一对齐目标时容易导致其他目标的表征被覆盖(灾难性遗忘),而多分支设计虽缓解此问题,但独立优化各目标会导致输出不一致(推理碎片化)。 Method: AMBS采用两阶段1-to-N框架:第一阶段计算共享的后注意力隐藏状态;第二阶段将该表示复制到并行分支,并通过策略-参考机制进行目标特定引导,保持跨目标一致性。 Result: 在Alpaca、BeaverTails和TruthfulQA数据集上验证,AMBS在多个7B规模LLM上显著提升HHH对齐效果。例如,在DeepSeek-7B上相比基线平均对齐分数提升32.4%,不安全输出减少11.0%。 Conclusion: AMBS能有效平衡多目标对齐,避免灾难性遗忘与推理碎片化,兼具性能优势与效率,优于现有方法。 Abstract: Alignment of Large Language Models (LLMs) along multiple objectives-helpfulness, harmlessness, and honesty (HHH)-is critical for safe and reliable deployment. Prior work has used steering vector-small control signals injected into hidden states-to guide LLM outputs, typically via one-to-one (1-to-1) Transformer decoders. In this setting, optimizing a single alignment objective can inadvertently overwrite representations learned for other objectives, leading to catastrophic forgetting. More recent approaches extend steering vectors via one-to-many (1-to-N) Transformer decoders. While this alleviates catastrophic forgetting, naive multi-branch designs optimize each objective independently, which can cause inference fragmentation-outputs across HHH objectives may become inconsistent. We propose Adaptive Multi-Branch Steering (AMBS), a two-stage 1-to-N framework for unified and efficient multi-objective alignment. In Stage I, post-attention hidden states of the Transformer layer are computed once to form a shared representation. In Stage II, this representation is cloned into parallel branches and steered via a policy-reference mechanism, enabling objective-specific control while maintaining cross-objective consistency. Empirical evaluations on Alpaca, BeaverTails, and TruthfulQA show that AMBS consistently improves HHH alignment across multiple 7B LLM backbones. For example, on DeepSeek-7B, AMBS improves average alignment scores by +32.4% and reduces unsafe outputs by 11.0% compared to a naive 1-to-N baseline, while remaining competitive with state-of-the-art methods.[96] InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models
Wenjun Wang,Shuo Cai,Congkai Xie,Mingfa Feng,Yiming Zhang,Zhen Li,Kejing Yang,Ming Li,Jiannong Cao,Yuan Xie,Hongxia Yang
Main category: cs.CL
TL;DR: 本文提出了一种端到端的FP8训练方案,通过细粒度混合量化策略,在保持数值精度的同时显著提升训练效率,实现了与BF16相当的性能但训练时间减少22%,内存峰值降低14%,吞吐量提高19%。
Details
Motivation: 训练大语言模型的计算成本高昂,阻碍了创新;尽管FP8训练理论上更高效,但缺乏完整的开源训练方案限制了其广泛应用。 Method: 提出一种融合持续预训练和监督微调的端到端FP8训练方案,采用细粒度、混合粒度的量化策略以平衡数值保真与计算效率。 Result: 在160B token语料上持续预训练实验表明,该方法稳定性高、几乎无损,推理任务性能与BF16基线相当,同时训练时间减少22%,峰值内存降低14%,吞吐量提升19%。 Conclusion: FP8可作为BF16实用且稳健的替代方案,有望推动大规模模型训练的普及,作者将开源代码以促进社区发展。 Abstract: The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.[97] Think Socially via Cognitive Reasoning
Jinfeng Zhou,Zheyu Chen,Shuai Wang,Quanyu Dai,Zhenhua Dong,Hongning Wang,Minlie Huang
Main category: cs.CL
TL;DR: 本文提出了认知推理(Cognitive Reasoning)范式及CogFlow框架,通过结构化认知流提升大模型在模糊社交情境中的理解与决策能力。
Details
Motivation: 传统逻辑推理范式难以应对社交情境中模糊、非确定性的线索解读,缺乏对人类社会认知过程的建模。 Method: 提出认知推理范式和CogFlow框架:首先通过树结构规划模拟人类思维构建认知流数据集,然后使用监督微调赋予模型基本认知能力,再通过多目标强化学习优化认知流与响应质量。 Result: 实验表明CogFlow能有效增强大语言模型的社交认知能力,在社交决策任务中表现优于基线模型,并展现出对人类决策的辅助潜力。 Conclusion: CogFlow为大模型实现类人社交推理提供了有效路径,弥合了逻辑推理与社会认知之间的鸿沟。 Abstract: LLMs trained for logical reasoning excel at step-by-step deduction to reach verifiable answers. However, this paradigm is ill-suited for navigating social situations, which induce an interpretive process of analyzing ambiguous cues that rarely yield a definitive outcome. To bridge this gap, we introduce Cognitive Reasoning, a paradigm modeled on human social cognition. It formulates the interpretive process into a structured cognitive flow of interconnected cognitive units (e.g., observation or attribution), which combine adaptively to enable effective social thinking and responses. We then propose CogFlow, a complete framework that instills this capability in LLMs. CogFlow first curates a dataset of cognitive flows by simulating the associative and progressive nature of human thought via tree-structured planning. After instilling the basic cognitive reasoning capability via supervised fine-tuning, CogFlow adopts reinforcement learning to enable the model to improve itself via trial and error, guided by a multi-objective reward that optimizes both cognitive flow and response quality. Extensive experiments show that CogFlow effectively enhances the social cognitive capabilities of LLMs, and even humans, leading to more effective social decision-making.[98] Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation
Wenyuan Chen,Fateme Nateghi Haredasht,Kameron C. Black,Francois Grolleau,Emily Alsentzer,Jonathan H. Chen,Stephen P. Ma
Main category: cs.CL
TL;DR: 本文提出了一种基于检索增强的评估管道(RAEC),结合临床错误本体和两阶段DSPy提示架构,用于评估大型语言模型在电子健康记录门户中生成患者消息回复的质量,结果显示引入历史消息上下文可显著提升错误检测效果。
Details
Motivation: 异步患者-临床医生消息数量增加,导致临床工作负担加重,虽然大语言模型(LLMs)可用于生成回复草稿,但其输出可能存在临床不准确、遗漏或语气不当等问题,因此需要一种可靠的方法来评估LLM生成内容的质量。 Method: 1)构建一个包含5个领域、59个细粒度错误代码的临床错误本体,通过归纳编码和专家裁定开发;2)设计检索增强评估管道(RAEC),利用机构档案中语义相似的历史消息-回复对提供上下文支持;3)采用基于DSPy的两阶段提示架构,实现可扩展、可解释、分层的错误检测。 Result: 在超过1,500条患者消息上比较了基线与上下文增强的评估方法,结果显示检索上下文在临床完整性与工作流适当性等领域的错误识别能力更强;在100条消息上的人工验证表明,上下文增强标签相比基线具有更高的一致性(50% vs. 33%)和性能(F1 = 0.500 vs. 0.256)。 Conclusion: 引入历史消息上下文能有效提升对LLM生成患者消息回复的评估质量,所提出的RAEC管道可作为AI辅助患者沟通的安全保障机制。 Abstract: Asynchronous patient-clinician messaging via EHR portals is a growing source of clinician workload, prompting interest in large language models (LLMs) to assist with draft responses. However, LLM outputs may contain clinical inaccuracies, omissions, or tone mismatches, making robust evaluation essential. Our contributions are threefold: (1) we introduce a clinically grounded error ontology comprising 5 domains and 59 granular error codes, developed through inductive coding and expert adjudication; (2) we develop a retrieval-augmented evaluation pipeline (RAEC) that leverages semantically similar historical message-response pairs to improve judgment quality; and (3) we provide a two-stage prompting architecture using DSPy to enable scalable, interpretable, and hierarchical error detection. Our approach assesses the quality of drafts both in isolation and with reference to similar past message-response pairs retrieved from institutional archives. Using a two-stage DSPy pipeline, we compared baseline and reference-enhanced evaluations on over 1,500 patient messages. Retrieval context improved error identification in domains such as clinical completeness and workflow appropriateness. Human validation on 100 messages demonstrated superior agreement (concordance = 50% vs. 33%) and performance (F1 = 0.500 vs. 0.256) of context-enhanced labels vs. baseline, supporting the use of our RAEC pipeline as AI guardrails for patient messaging.[99] Fine-Grained Detection of Context-Grounded Hallucinations Using LLMs
Yehonatan Pesiakhovsky,Zorik Gekhman,Yosi Mass,Liat Ein-Dor,Roi Reichart
Main category: cs.CL
TL;DR: 研究了大语言模型(LLM)在定位上下文相关幻觉中的应用,提出新的基于自由文本描述的幻觉表示方法,并构建了一个针对LLM的基准测试集,评估结果显示现有模型仍面临挑战,最佳F1分数仅为0.67。
Details
Motivation: 由于现有幻觉检测评估流程复杂且缺乏适用于LLM的基准,作者希望探索更实用的LLM在幻觉定位任务中的表现,并改进幻觉错误的表达方式以覆盖更广泛的错误类型。 Method: 构建了一个包含1000多个人工标注样例的基准数据集,采用自由形式文本描述来表示幻觉,并设计了基于LLM的评估协议,通过人类评估验证其质量,同时对四个大规模LLM进行了系统评测。 Result: 四个被测LLM在该任务上表现有限,最佳F1得分为0.67;分析发现主要挑战包括:将缺失细节误判为不一致,以及难以处理虽事实正确但源文本未提及、因而无法验证的信息。 Conclusion: 当前LLM在幻觉定位任务上仍有显著局限性,需改进提示策略和幻觉建模方式,未来应关注如何区分参数知识与源文本可验证性的问题。 Abstract: Context-grounded hallucinations are cases where model outputs contain information not verifiable against the source text. We study the applicability of LLMs for localizing such hallucinations, as a more practical alternative to existing complex evaluation pipelines. In the absence of established benchmarks for meta-evaluation of hallucinations localization, we construct one tailored to LLMs, involving a challenging human annotation of over 1,000 examples. We complement the benchmark with an LLM-based evaluation protocol, verifying its quality in a human evaluation. Since existing representations of hallucinations limit the types of errors that can be expressed, we propose a new representation based on free-form textual descriptions, capturing the full range of possible errors. We conduct a comprehensive study, evaluating four large-scale LLMs, which highlights the benchmark's difficulty, as the best model achieves an F1 score of only 0.67. Through careful analysis, we offer insights into optimal prompting strategies for the task and identify the main factors that make it challenging for LLMs: (1) a tendency to incorrectly flag missing details as inconsistent, despite being instructed to check only facts in the output; and (2) difficulty with outputs containing factually correct information absent from the source - and thus not verifiable - due to alignment with the model's parametric knowledge.[100] ArabJobs: A Multinational Corpus of Arabic Job Ads
Mo El-Haj
Main category: cs.CL
TL;DR: ArabJobs是一个包含8500多个阿拉伯语招聘信息的公开语料库,涵盖埃及、约旦、沙特阿拉伯和阿联酋,用于研究阿拉伯劳动力市场的语言、地区和社会经济差异。
Details
Motivation: 为了捕捉阿拉伯语在招聘广告中的语言多样性及社会经济背景差异,并支持阿拉伯语NLP中的公平性研究。 Method: 收集来自四个阿拉伯国家的招聘信息,进行性别表征、职业结构和方言变异分析,并应用大语言模型进行薪资预测、职位分类和性别偏见检测等任务。 Result: 展示了ArabJobs在阿拉伯语自然语言处理和劳动力市场研究中的实用性,特别是在公平性感知任务中的潜力。 Conclusion: ArabJobs语料库为阿拉伯语NLP和劳动市场研究提供了有价值的资源,支持多种下游任务和未来研究。 Abstract: ArabJobs is a publicly available corpus of Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the United Arab Emirates. Comprising over 8,500 postings and more than 550,000 words, the dataset captures linguistic, regional, and socio-economic variation in the Arab labour market. We present analyses of gender representation and occupational structure, and highlight dialectal variation across ads, which offers opportunities for future research. We also demonstrate applications such as salary estimation and job category normalisation using large language models, alongside benchmark tasks for gender bias detection and profession classification. The findings show the utility of ArabJobs for fairness-aware Arabic NLP and labour market research. The dataset is publicly available on GitHub: https://github.com/drelhaj/ArabJobs.[101] From Formal Language Theory to Statistical Learning: Finite Observability of Subregular Languages
Katsuhiko Hayashi,Hidetaka Kamigaito
Main category: cs.CL
TL;DR: 本文证明了由决策谓词表示的标准子正则语言类是线性可分的,从而保证了其可学习性,并通过实验验证了在噪声-free条件下完美可分性以及在英语形态学中学习到的特征符合已知语言学约束。
Details
Motivation: 探索子正则语言类的可分性以建立自然语言结构的可解释建模基础。 Method: 通过决策谓词表示子正则语言类,并使用简单线性模型进行学习实验,包括合成数据和真实语言数据(英语形态学)分析。 Result: 所有标准子正则语言类在线性模型下均可分;合成实验显示噪声-free条件下完美可分,真实数据实验显示学习到的特征与已知语言学约束一致。 Conclusion: 子正则层级为建模自然语言结构提供了严格且可解释的基础,支持使用简单线性模型实现有效学习。 Abstract: We prove that all standard subregular language classes are linearly separable when represented by their deciding predicates. This establishes finite observability and guarantees learnability with simple linear models. Synthetic experiments confirm perfect separability under noise-free conditions, while real-data experiments on English morphology show that learned features align with well-known linguistic constraints. These results demonstrate that the subregular hierarchy provides a rigorous and interpretable foundation for modeling natural language structure. Our code used in real-data experiments is available at https://github.com/UTokyo-HayashiLab/subregular.[102] Capturing Opinion Shifts in Deliberative Discourse through Frequency-based Quantum deep learning methods
Rakesh Thakur,Harsh Chaturvedi,Ruqayya Shah,Janvi Chauhan,Ayush Sharma
Main category: cs.CL
TL;DR: 本研究比较了多种自然语言处理技术在模拟审议性话语中的表现,提出了两种优于现有模型的方法:基于频率的论述调制和量子审议框架。
Details
Motivation: 通过计算建模分析意见变化,以更好地理解审议过程及其对决策的影响。 Method: 构建了一个包含多样化观点的自采数据集,并通过产品展示引入引人注目的事实来模拟审议过程,进而比较Frequency-Based Discourse Modulation与Quantum-Deliberation Framework两种模型的效果。 Result: 两种新模型在解释审议性话语和预测结果方面均优于现有技术,能够有效捕捉意见转变。 Conclusion: 所提出的模型在公共政策制定、辩论评估、决策支持系统及社交媒体舆论挖掘中具有广泛应用前景。 Abstract: Deliberation plays a crucial role in shaping outcomes by weighing diverse perspectives before reaching decisions. With recent advancements in Natural Language Processing, it has become possible to computationally model deliberation by analyzing opinion shifts and predicting potential outcomes under varying scenarios. In this study, we present a comparative analysis of multiple NLP techniques to evaluate how effectively models interpret deliberative discourse and produce meaningful insights. Opinions from individuals of varied backgrounds were collected to construct a self-sourced dataset that reflects diverse viewpoints. Deliberation was simulated using product presentations enriched with striking facts, which often prompted measurable shifts in audience opinions. We have given comparative analysis between two models namely Frequency-Based Discourse Modulation and Quantum-Deliberation Framework which outperform the existing state of art models. The findings highlight practical applications in public policy-making, debate evaluation, decision-support frameworks, and large-scale social media opinion mining.[103] From tests to effect sizes: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation benchmarks
Jonne Sälevä,Duygu Ataman,Constantine Lignos
Main category: cs.CL
TL;DR: 提出了一组基于重采样的方法,用于量化多语言和/或多任务NLP基准中评估指标的不确定性和统计精度。
Details
Motivation: 现有评估指标常忽略模型和数据带来的变异,导致对性能波动的低估,因此需要更准确的方法来衡量不确定性。 Method: 采用重采样方法分析模型和数据双重来源的性能变化,并计算各种排行榜常用指标(如均值、中位数、模型间差异、排名)的抽样分布。 Result: 在多语言问答、机器翻译和命名实体识别任务上验证了该方法的有效性,能更真实地反映性能指标的变异性。 Conclusion: 同时考虑模型和数据来源的变异对正确评估NLP模型至关重要,重采样方法可有效提升多语言和多任务基准测试的统计可靠性。 Abstract: In this paper, we introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks. We show how experimental variation in performance scores arises from both model- and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications. Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for computing sampling distributions for various quantities used in leaderboards such as the average/median, pairwise differences between models, and rankings.[104] StateX: Enhancing RNN Recall via Post-training State Expansion
Xingyu Shen,Yingfa Chen,Zhen Leng Thai,Xu Han,Zhiyuan Liu,Maosong Sun
Main category: cs.CL
TL;DR: 本文提出了StateX,一种通过后训练高效扩展预训练RNN状态大小的训练流程,能够在几乎不增加模型参数的情况下提升线性注意力和状态空间模型的记忆与上下文学习能力。
Details
Motivation: Transformer模型在长上下文处理中成本高,而RNN虽计算效率高,但因状态大小固定导致长程信息回忆能力弱;扩大状态可提升回忆能力,但直接训练大状态RNN代价高昂。 Method: 提出StateX训练流程,针对线性注意力和状态空间模型设计后训练架构修改,在不显著增加参数的前提下扩展状态大小。 Result: 在高达13亿参数的模型上实验表明,StateX能有效增强RNN的回忆能力和上下文学习性能,且后训练成本低,不影响其他能力。 Conclusion: StateX为提升RNN在长上下文任务中的表现提供了一种高效可行的解决方案,平衡了模型容量、训练成本与推理效率。 Abstract: While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in high training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities.[105] Variational Reasoning for Language Models
Xiangxin Zhou,Zichen Liu,Haonan Wang,Chao Du,Min Lin,Chongxuan Li,Liang Wang,Tianyu Pang
Main category: cs.CL
TL;DR: 提出了一种基于变分推理的语言模型推理框架,将思维链视为隐变量,并通过变分推断进行优化,统一了变分推断与强化学习方法,提升了语言模型的推理能力。
Details
Motivation: 为了提升语言模型的推理能力,需要一种更稳定、有原则的概率框架来建模思维过程,同时统一现有变分推断与强化学习方法的视角。 Method: 将思维轨迹视为隐变量,从证据下界(ELBO)出发,扩展为多轨迹目标以获得更紧的界,并提出前向KL公式来稳定变分后验的训练;同时揭示拒绝采样微调和二元奖励强化学习(如GRPO)可被视为局部前向KL目标。 Result: 在Qwen 2.5和Qwen 3模型系列上广泛验证,方法在多种推理任务中表现良好,揭示了现有方法对较简单问题的隐式偏好,并提供了更稳定的训练目标。 Conclusion: 该工作为语言模型的推理提供了一个统一、有原则的概率框架,连接了变分推断与强化学习方法,有助于稳定优化并提升模型推理性能。 Abstract: We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.[106] Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Renjie Luo,Zichen Liu,Xiangyan Liu,Chao Du,Min Lin,Wenhu Chen,Wei Lu,Tianyu Pang
Main category: cs.CL
TL;DR: 提出反馈条件策略(FCP),将语言反馈作为条件信号,通过最大似然训练直接从反馈中学习,避免标量奖励压缩问题。
Details
Motivation: 传统RLHF方法将丰富的语言反馈压缩为标量奖励,丢失信息并导致尺度不平衡,限制了模型对复杂反馈的学习能力。 Method: 引入反馈条件策略(FCP),在离线数据上以响应-反馈对进行最大似然训练;结合在线自举阶段,模型在正向条件下生成响应并获取新反馈以迭代优化。 Result: FCP能更充分地利用语言反馈的丰富性,提升模型对反馈的理解与响应能力,实验表明其在反馈驱动学习中优于传统奖励优化方法。 Conclusion: 将反馈视为生成条件而非标量奖励,为大模型学习语言反馈提供了一种更表达力更强、更自然的新范式。 Abstract: LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.[107] Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity
Arkadiy Saakyan,Najoung Kim,Smaranda Muresan,Tuhin Chakrabarty
Main category: cs.CL
TL;DR: 该论文研究了n-gram新颖性作为文本创造力衡量指标的局限性,发现其虽与专家评判的创造力正相关,但91%高n-gram新颖性的表达未被认定为创造性,且在开源大模型中更高n-gram新颖性反而伴随更低实用性。前沿闭源模型生成创造性表达的可能性仍低于人类,而当前LLM在识别创造性或非实用性表达方面仍有提升空间。
Details
Motivation: 现有研究广泛使用n-gram新颖性来评估语言模型的创造力,但创造力理论强调需同时考虑新颖性和适当性(合理与实用),因此需探究n-gram新颖性是否真能反映人类认可的创造力。 Method: 通过7542条由26名专业写作者标注的数据,对人类和AI生成文本中的新颖性、实用性和合理性进行细读分析,并评估零样本、少样本及微调模型识别创造性与非实用性表达的能力。 Result: n-gram新颖性虽与专家判断的创造力呈正相关,但多数高n-gram新颖性表达未被认定为创造性;开源LLM中n-gram新颖性越高,实用性越低;前沿闭源模型创造能力仍逊于人类;LLM作为判别器的表现优于随机水平,但在识别非实用性表达上表现较差。 Conclusion: n-gram新颖性不足以单独衡量文本创造力,应结合实用性与合理性评估;当前LLM在判断创造性写作方面有潜力但尚不完善,未来需更全面的创造力评估框架。 Abstract: N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.[108] WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning
Zimu Lu,Houxing Ren,Yunqiao Yang,Ke Wang,Zhuofan Zong,Junting Pan,Mingjie Zhan,Hongsheng Li
Main category: cs.CL
TL;DR: 本文提出WebGen-Agent,一种利用多层次视觉反馈迭代生成和优化网站代码库的新型代理系统,并通过引入基于截图和GUI代理反馈的Step-GRPO训练方法提升LLM在网页生成任务中的表现。
Details
Motivation: 现有代码代理在生成网站等依赖视觉效果和用户交互的任务中,仅依赖简单的代码执行反馈,难以准确评估生成代码的实际质量。 Method: 提出WebGen-Agent,结合视觉语言模型(VLM)生成对网站截图和GUI测试的详细文本描述与评分,并融合回溯与选优机制;进一步设计Step-GRPO训练方法,将每一步的视觉与GUI评分作为奖励信号用于强化学习。 Result: 在WebGen-Bench数据集上,WebGen-Agent将Claude-3.5-Sonnet的准确率从26.4%提升至51.9%,外观评分从3.0升至3.9;Step-GRPO使Qwen2.5-Coder-7B-Instruct的准确率从38.9%提升至45.4%,外观评分从3.4升至3.7。 Conclusion: 通过引入多层级视觉反馈与细粒度奖励机制,WebGen-Agent显著提升了LLM在网站生成任务中的性能,优于当前最先进的代理系统。 Abstract: Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual language model (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce \textit{Step-GRPO with Screenshot and GUI-agent Feedback} to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model's website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our Step-GRPO training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9% to 45.4% and raises the appearance score from 3.4 to 3.7.[109] VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing
Ke Wang,Houxing Ren,Zimu Lu,Mingjie Zhan,Hongsheng Li
Main category: cs.CL
TL;DR: 本文提出了VoiceAssistant-Eval,一个用于评估语音优先AI助手的综合性基准,涵盖听、说、看三方面能力,包含10,497个样本和13类任务。通过对21个开源模型和GPT-4o-Audio的评估,发现专有模型并非全面领先,小而精的模型可媲美大模型,但当前模型在多模态输入和角色扮演语音模仿上仍面临挑战。
Details
Motivation: 现有基准不足以全面评估语音优先AI助手的能力,尤其在听觉、语音生成和视觉理解的综合表现上缺乏系统性评测手段。 Method: 构建了包含10,497个样本、覆盖13个任务类别的VoiceAssistant-Eval基准,涵盖自然声音、音乐、对话理解(听)、多轮对话、角色模仿(说)及异构图像理解(看)。对21个开源模型和GPT-4o-Audio进行内容质量、语音输出质量和一致性评估。 Result: 实验显示:(1) 专有模型未全面优于开源模型;(2) 多数模型擅长说话任务但在音频理解上表现不佳;(3) 设计良好的中小模型(如7B的Step-Audio-2-mini)性能可超过更大模型(如32B的LLaMA-Omni2),在听力准确率上翻倍。多模态输入与角色语音模仿仍是难点,鲁棒性和安全对齐存在显著差距。 Conclusion: VoiceAssistant-Eval为下一代AI助手提供了系统、严格的评估框架,揭示了当前语音AI系统的优缺点,推动更均衡、安全、鲁棒的模型发展。 Abstract: The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio plus visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation AI assistants. Code and data will be released at https://mathllm.github.io/VoiceAssistantEval/ .cs.CV [Back]
[110] Random Direct Preference Optimization for Radiography Report Generation
Valentin Samokhin,Boris Shirokikh,Mikhail Goncharov,Dmitriy Umerenkov,Maksim Bobrin,Ivan Oseledets,Dmitry Dylov,Mikhail Belyaev
Main category: cs.CV
TL;DR: 提出一种模型无关的框架,利用随机对比采样和直接偏好优化(DPO)提升放射学报告生成的准确性,无需额外训练数据或人工标注,临床性能指标最高提升5%。
Details
Motivation: 现有放射学报告生成方法在真实临床环境中的质量仍不足,且依赖奖励模型或人工偏好标注,限制了其应用。 Method: 引入一种模型无关的框架,采用随机对比采样构建训练对,结合直接偏好优化(DPO)进行训练,避免使用奖励模型或人工标注。 Result: 在三个最先进的模型上验证了该方法,临床性能指标最高提升5%,且无需额外训练数据。 Conclusion: 所提方法能有效提升放射学报告生成的质量和临床适用性,具有广泛的应用潜力。 Abstract: Radiography Report Generation (RRG) has gained significant attention in medical image analysis as a promising tool for alleviating the growing workload of radiologists. However, despite numerous advancements, existing methods have yet to achieve the quality required for deployment in real-world clinical settings. Meanwhile, large Visual Language Models (VLMs) have demonstrated remarkable progress in the general domain by adopting training strategies originally designed for Large Language Models (LLMs), such as alignment techniques. In this paper, we introduce a model-agnostic framework to enhance RRG accuracy using Direct Preference Optimization (DPO). Our approach leverages random contrastive sampling to construct training pairs, eliminating the need for reward models or human preference annotations. Experiments on supplementing three state-of-the-art models with our Random DPO show that our method improves clinical performance metrics by up to 5%, without requiring any additional training data.[111] Improving Autism Detection with Multimodal Behavioral Analysis
William Saakyan,Matthias Norden,Lola Eversmann,Simon Kirsch,Muyu Lin,Simon Guendelman,Isabel Dziobek,Hanna Drimalla
Main category: cs.CV
TL;DR: 提出了一种基于多模态行为特征(面部表情、语音韵律、头部运动、心率变异性、注视行为)的自闭症计算机辅助诊断方法,通过改进注视特征建模和晚期融合策略,在大规模标准化数据集上实现了74%的分类准确率。
Details
Motivation: 现有基于视频的自闭症自动检测模型在注视特征表现不佳且缺乏真实场景泛化能力,需要更可靠、可推广的计算方法来辅助临床诊断。 Method: 采用多模态分析方法,引入新的统计描述子量化眼动角度变异性,并使用晚期融合整合多种行为标记进行分类。 Result: 新注视特征将分类准确率从64%提升至69%;多模态晚期融合达到74%的分类准确率,优于单一模态或早期融合方法。 Conclusion: 改进的注视特征和多模态融合有助于提升自闭症视频数据分析的性能,支持开发可扩展的视频筛查工具以辅助临床评估。 Abstract: Due to the complex and resource-intensive nature of diagnosing Autism Spectrum Condition (ASC), several computer-aided diagnostic support methods have been proposed to detect autism by analyzing behavioral cues in patient video data. While these models show promising results on some datasets, they struggle with poor gaze feature performance and lack of real-world generalizability. To tackle these challenges, we analyze a standardized video dataset comprising 168 participants with ASC (46% female) and 157 non-autistic participants (46% female), making it, to our knowledge, the largest and most balanced dataset available. We conduct a multimodal analysis of facial expressions, voice prosody, head motion, heart rate variability (HRV), and gaze behavior. To address the limitations of prior gaze models, we introduce novel statistical descriptors that quantify variability in eye gaze angles, improving gaze-based classification accuracy from 64% to 69% and aligning computational findings with clinical research on gaze aversion in ASC. Using late fusion, we achieve a classification accuracy of 74%, demonstrating the effectiveness of integrating behavioral markers across multiple modalities. Our findings highlight the potential for scalable, video-based screening tools to support autism assessment.[112] KV-Efficient VLA: A Method of Speed up Vision Language Model with RNN-Gated Chunked KV Cache
Wanshun Xu,Long Zhuang
Main category: cs.CV
TL;DR: 提出了一种模型无关的KV缓存压缩框架KV-Efficient VLA,通过分块和循环门控机制选择性保留高实用性的历史上下文,显著降低视觉-语言-动作模型在长时推理中的内存消耗和计算延迟。
Details
Motivation: 现有视觉-语言-动作(VLA)模型在长时推理中面临注意力机制的二次计算成本和键值(KV)内存无限增长的问题,限制了其在实时部署中的可扩展性。 Method: 将KV缓存划分为固定大小的块,并引入轻量级、可训练的循环门控模块,根据学习到的效用分数对历史上下文进行汇总和过滤,在保持因果性的同时保留近期细节并剪枝过时信息。 Result: 理论分析表明,该方法可实现最高1.21倍的推理速度提升和36%的KV内存减少,且对任务成功率影响极小。 Conclusion: KV-Efficient VLA是一种高效、即插即用的推理优化方案,兼容现有的自回归和混合VLA架构,无需修改训练流程或控制逻辑即可实现可扩展的长时推理。 Abstract: Vision-Language-Action (VLA) models promise unified robotic perception and control, yet their scalability is constrained by the quadratic cost of attention and the unbounded growth of key-value (KV) memory during long-horizon inference. While recent methods improve generalization through scaling backbone architectures, they often neglect the inference inefficiencies critical to real-time deployment. In this work, we present KV-Efficient VLA, a model-agnostic memory compression framework that addresses these limitations by introducing a lightweight, training-friendly mechanism to selectively retain high-utility context. Our method partitions the KV cache into fixed size chunks and employs a recurrent gating module to summarize and filter historical context according to learned utility scores. This design preserves recent fine-grained detail while aggressively pruning stale, low-relevance memory, all while maintaining causality. Theoretically, KV-Efficient VLA yields up to 1.21x inference speedup and 36% KV memory reduction, with minimal impact on task success. Our method integrates seamlessly into existing autoregressive and hybrid VLA stacks, enabling scalable inference without modifying training pipelines or downstream control logic.[113] Phrase-grounded Fact-checking for Automatically Generated Chest X-ray Reports
Razi Mahmood,Diego Machado-Reyes,Joy Wu,Parisa Kaviani,Ken C. L. Wong,Niharika D'Souza,Mannudeep Kalra,Ge Wang,Pingkun Yan,Tanveer Syeda-Mahmood
Main category: cs.CV
TL;DR: 提出一种基于短语的跨模态事实核查模型(FC模型),用于检测自动生成的胸部X光报告中的错误及其位置,通过合成数据训练,在多个数据集上表现出高准确性和鲁棒性,与真实验证结果高度一致(CCC=0.997)。
Details
Motivation: 大型视觉语言模型(VLM)在生成放射学报告时存在事实错误和幻觉问题,限制了其临床应用,因此需要一种能检测报告中发现及其位置错误的有效方法。 Method: 构建一个大规模合成数据集,通过对真实报告中的发现及其位置进行扰动生成真假配对;设计并训练一种新的多标签交叉模态对比回归网络(FC模型)进行事实核查。 Result: 该模型在多个X光数据集上实现了高精度的发现真实性判断和定位,对SOTA报告生成器的错误检测效果显著,与基于真实标签的验证结果的组内相关系数达0.997。 Conclusion: 所提出的FC模型能够有效识别自动生成报告中的事实错误和位置偏差,具备在临床推理流程中辅助质量控制的潜力。 Abstract: With the emergence of large-scale vision language models (VLM), it is now possible to produce realistic-looking radiology reports for chest X-ray images. However, their clinical translation has been hampered by the factual errors and hallucinations in the produced descriptions during inference. In this paper, we present a novel phrase-grounded fact-checking model (FC model) that detects errors in findings and their indicated locations in automatically generated chest radiology reports. Specifically, we simulate the errors in reports through a large synthetic dataset derived by perturbing findings and their locations in ground truth reports to form real and fake findings-location pairs with images. A new multi-label cross-modal contrastive regression network is then trained on this dataset. We present results demonstrating the robustness of our method in terms of accuracy of finding veracity prediction and localization on multiple X-ray datasets. We also show its effectiveness for error detection in reports of SOTA report generators on multiple datasets achieving a concordance correlation coefficient of 0.997 with ground truth-based verification, thus pointing to its utility during clinical inference in radiology workflows.[114] MDF-MLLM: Deep Fusion Through Cross-Modal Feature Alignment for Contextually Aware Fundoscopic Image Classification
Jason Jordan,Mohammadreza Akbari Lor,Peter Koulen,Mei-Ling Shyu,Shu-Ching Chen
Main category: cs.CV
TL;DR: 本研究提出了一种名为MDF-MLLM的新型多模态深度学习架构,通过融合视网膜眼底图像的细粒度特征与全局文本上下文,显著提升了疾病分类准确率。
Details
Motivation: 现有多种模态大语言模型(MLLM)难以捕捉对诊断视网膜疾病至关重要的低层次空间细节,因此需要一种能更好融合图像与文本信息的模型以提高分类性能。 Method: MDF-MLLM结合了U-Net编码器的四个层级的跳跃特征,并将其引入LLaMA 3.2 11B MLLM的交叉注意力模块中;采用逐块投影、缩放交叉注意力和基于FiLM的U-Net调制实现视觉与文本特征融合,并对整个模型进行端到端微调。 Result: 在包含1,305个图像-文本对的数据集上测试,MDF-MLLM达到94%的分类准确率,相较基线模型的60%有显著提升(提高56%),召回率和F1分数最高分别提升67%和35%;消融实验验证了多深度融合策略的有效性,尤其在遗传性眼病分类中表现突出。 Conclusion: MDF-MLLM通过多尺度特征融合,提供了一个可泛化、可解释且模块化的视网膜图像分类框架,在临床决策支持系统中具有实际应用潜力。 Abstract: This study aimed to enhance disease classification accuracy from retinal fundus images by integrating fine-grained image features and global textual context using a novel multimodal deep learning architecture. Existing multimodal large language models (MLLMs) often struggle to capture low-level spatial details critical for diagnosing retinal diseases such as glaucoma, diabetic retinopathy, and retinitis pigmentosa. This model development and validation study was conducted on 1,305 fundus image-text pairs compiled from three public datasets (FIVES, HRF, and StoneRounds), covering acquired and inherited retinal diseases, and evaluated using classification accuracy and F1-score. The MDF-MLLM integrates skip features from four U-Net encoder layers into cross-attention blocks within a LLaMA 3.2 11B MLLM. Vision features are patch-wise projected and fused using scaled cross-attention and FiLM-based U-Net modulation. Baseline MLLM achieved 60% accuracy on the dual-type disease classification task. MDF-MLLM, with both U-Net and MLLM components fully fine-tuned during training, achieved a significantly higher accuracy of 94%, representing a 56% improvement. Recall and F1-scores improved by as much as 67% and 35% over baseline, respectively. Ablation studies confirmed that the multi-depth fusion approach contributed to substantial gains in spatial reasoning and classification, particularly for inherited diseases with rich clinical text. MDF-MLLM presents a generalizable, interpretable, and modular framework for fundus image classification, outperforming traditional MLLM baselines through multi-scale feature fusion. The architecture holds promise for real-world deployment in clinical decision support systems. Future work will explore synchronized training techniques, a larger pool of diseases for more generalizability, and extending the model for segmentation tasks.[115] Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models
Xingkai Peng,Jun Jiang,Meng Tong,Shuai Li,Weiming Zhang,Nenghai Yu,Kejiang Chen
Main category: cs.CV
TL;DR: 提出多模态提示解耦攻击(MPDA),利用图像模态分离有害语义,绕过文本安全过滤器生成NSFW图像。
Details
Motivation: 现有基于文本的越狱方法难以有效绕过文本到图像模型的安全过滤器,且对图像输入的潜在漏洞探索不足,因此需要一种更有效的多模态攻击方式。 Method: 使用大语言模型将不安全提示分解为伪安全提示和有害提示,再将有害提示重写为自然的对抗性提示,并结合视觉语言模型生成图像描述以指导迭代优化。 Result: MPDA能够有效绕过T2I模型的安全过滤机制,生成与原始不安全提示语义一致的NSFW图像。 Conclusion: MPDA展示了在多模态环境下通过图像引导生成有害内容的新攻击路径,揭示了当前T2I模型在多模态安全防御上的薄弱环节。 Abstract: Text-to-image (T2I) models have been widely applied in generating high-fidelity images across various domains. However, these models may also be abused to produce Not-Safe-for-Work (NSFW) content via jailbreak attacks. Existing jailbreak methods primarily manipulate the textual prompt, leaving potential vulnerabilities in image-based inputs largely unexplored. Moreover, text-based methods face challenges in bypassing the model's safety filters. In response to these limitations, we propose the Multimodal Prompt Decoupling Attack (MPDA), which utilizes image modality to separate the harmful semantic components of the original unsafe prompt. MPDA follows three core steps: firstly, a large language model (LLM) decouples unsafe prompts into pseudo-safe prompts and harmful prompts. The former are seemingly harmless sub-prompts that can bypass filters, while the latter are sub-prompts with unsafe semantics that trigger filters. Subsequently, the LLM rewrites the harmful prompts into natural adversarial prompts to bypass safety filters, which guide the T2I model to modify the base image into an NSFW output. Finally, to ensure semantic consistency between the generated NSFW images and the original unsafe prompts, the visual language model generates image captions, providing a new pathway to guide the LLM in iterative rewriting and refining the generated content.[116] A Mutual Learning Method for Salient Object Detection with intertwined Multi-Supervision--Revised
Runmin Wu,Mengyang Feng,Wenlong Guan,Dong Wang,Huchuan Lu,Errui Ding
Main category: cs.CV
TL;DR: 提出一种利用显著性物体检测、前景轮廓检测和边缘检测联合监督的显著性检测网络,通过相互学习模块(MLM)提升性能,在多个数据集上达到最先进效果。
Details
Motivation: 现有深度学习方法在显著性物体检测中存在预测不完整和边界不准确的问题,主要由于物体内部复杂性和卷积/池化操作的步幅导致。 Method: 采用多任务学习框架,联合训练显著性物体检测、前景轮廓检测和边缘检测任务;设计相互学习模块(MLM),多个分支相互促进学习,提升预测一致性和精度。 Result: 在七个具有挑战性的数据集上实验表明,该方法在显著性物体检测和边缘检测任务上均取得最先进的性能。 Conclusion: 通过引入多任务监督和相互学习机制,有效改善了显著性检测中的完整性与边界精度,显著提升了整体检测质量。 Abstract: Though deep learning techniques have made great progress in salient object detection recently, the predicted saliency maps still suffer from incomplete predictions due to the internal complexity of objects and inaccurate boundaries caused by strides in convolution and pooling operations. To alleviate these issues, we propose to train saliency detection networks by exploiting the supervision from not only salient object detection, but also foreground contour detection and edge detection. First, we leverage salient object detection and foreground contour detection tasks in an intertwined manner to generate saliency maps with uniform highlight. Second, the foreground contour and edge detection tasks guide each other simultaneously, thereby leading to precise foreground contour prediction and reducing the local noises for edge prediction. In addition, we develop a novel mutual learning module (MLM) which serves as the building block of our method. Each MLM consists of multiple network branches trained in a mutual learning manner, which improves the performance by a large margin. Extensive experiments on seven challenging datasets demonstrate that the proposed method has delivered state-of-the-art results in both salient object detection and edge detection.[117] MAJORScore: A Novel Metric for Evaluating Multimodal Relevance via Joint Representation
Zhicheng Du,Qingyang Shi,Jiasheng Lu,Yingshan Liang,Xinyu Zhang,Yiran Wang,Peiwu Qin
Main category: cs.CV
TL;DR: 提出了一种新的多模态相关性评估指标MAJORScore,首次通过多模态联合表示实现对三种及以上模态的相关性评估,实验表明其在一致性和不一致性情况下均显著优于现有方法。
Details
Motivation: 现有的多模态相关性评估指标主要适用于双模态数据,难以有效评估三种及以上模态之间的相关性,限制了多模态相似性分析的发展。 Method: 基于预训练对比学习模型的多模态联合表示能力,将多种模态数据映射到统一的潜在空间中,从而实现对N(N≥3)种模态的相关性进行公平且准确的评分。 Result: 实验结果显示,MAJORScore在一致性模态下比现有方法提高26.03%-64.29%,在不一致性模态下降低13.28%-20.54%。 Conclusion: MAJORScore是一种更可靠的多模态相似性评估指标,适用于大规模多模态数据集和多模态模型性能评估。 Abstract: The multimodal relevance metric is usually borrowed from the embedding ability of pretrained contrastive learning models for bimodal data, which is used to evaluate the correlation between cross-modal data (e.g., CLIP). However, the commonly used evaluation metrics are only suitable for the associated analysis between two modalities, which greatly limits the evaluation of multimodal similarity. Herein, we propose MAJORScore, a brand-new evaluation metric for the relevance of multiple modalities ($N$ modalities, $N\ge3$) via multimodal joint representation for the first time. The ability of multimodal joint representation to integrate multiple modalities into the same latent space can accurately represent different modalities at one scale, providing support for fair relevance scoring. Extensive experiments have shown that MAJORScore increases by 26.03%-64.29% for consistent modality and decreases by 13.28%-20.54% for inconsistence compared to existing methods. MAJORScore serves as a more reliable metric for evaluating similarity on large-scale multimodal datasets and multimodal model performance evaluation.[118] Safety Assessment of Scaffolding on Construction Site using AI
Sameer Prabhu,Amit Patwardhan,Ramin Karim
Main category: cs.CV
TL;DR: 提出一种基于云的AI平台,利用点云数据自动检测脚手架结构变化,以提高施工安全性和检查效率。
Details
Motivation: 传统脚手架检查依赖人工目视,耗时且易出错,存在安全隐患。 Method: 开发一个基于云的AI平台,通过将最新的脚手架点云数据与认证的参考数据进行比较和评估,来检测结构变化。 Result: 该系统能够实现脚手架的自动化监测,减少人工检查所需的时间和精力。 Conclusion: 所提出的AI方法可有效提升脚手架安全评估的准确性和效率,有助于改善施工现场的整体安全。 Abstract: In the construction industry, safety assessment is vital to ensure both the reliability of assets and the safety of workers. Scaffolding, a key structural support asset requires regular inspection to detect and identify alterations from the design rules that may compromise the integrity and stability. At present, inspections are primarily visual and are conducted by site manager or accredited personnel to identify deviations. However, visual inspection is time-intensive and can be susceptible to human errors, which can lead to unsafe conditions. This paper explores the use of Artificial Intelligence (AI) and digitization to enhance the accuracy of scaffolding inspection and contribute to the safety improvement. A cloud-based AI platform is developed to process and analyse the point cloud data of scaffolding structure. The proposed system detects structural modifications through comparison and evaluation of certified reference data with the recent point cloud data. This approach may enable automated monitoring of scaffolding, reducing the time and effort required for manual inspections while enhancing the safety on a construction site.[119] Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis
Aleksa Jelaca,Ying Jiao,Chang Tian,Marie-Francine Moens
Main category: cs.CV
TL;DR: 本文提出了一种自动提示工程框架,用于生成违背常识的反事实图像(如巨大的按钮旁的小海象),通过图像评估器、监督式提示重写器和DPO训练的排序器三个组件,在新构建的反事实尺寸数据集上实现了优于现有方法的效果。
Details
Motivation: 尽管文本到图像生成取得了进展,但细粒度的可控性仍具挑战,尤其是反事实可控性(即故意生成违背常识的图像)在创意和探索性应用中至关重要。 Method: 提出一个包含图像评估器、监督提示重写器和DPO训练的排序器的框架,自动将基础提示修改为适用于反事实图像生成的提示,并构建首个反事实尺寸图文数据集。 Result: 所构建的图像评估器在Grounded SAM基础上改进后性能提升114%,实验显示该方法优于现有最先进基线模型和ChatGPT-4o。 Conclusion: 该研究为实现文本到图像生成中的反事实可控性提供了有效框架和数据基础,推动了该方向的未来研究。 Abstract: Text-to-image generation has advanced rapidly with large-scale multimodal training, yet fine-grained controllability remains a critical challenge. Counterfactual controllability, defined as the capacity to deliberately generate images that contradict common-sense patterns, remains a major challenge but plays a crucial role in enabling creativity and exploratory applications. In this work, we address this gap with a focus on counterfactual size (e.g., generating a tiny walrus beside a giant button) and propose an automatic prompt engineering framework that adapts base prompts into revised prompts for counterfactual images. The framework comprises three components: an image evaluator that guides dataset construction by identifying successful image generations, a supervised prompt rewriter that produces revised prompts, and a DPO-trained ranker that selects the optimal revised prompt. We construct the first counterfactual size text-image dataset and enhance the image evaluator by extending Grounded SAM with refinements, achieving a 114 percent improvement over its backbone. Experiments demonstrate that our method outperforms state-of-the-art baselines and ChatGPT-4o, establishing a foundation for future research on counterfactual controllability.[120] In silico Deep Learning Protocols for Label-Free Super-Resolution Microscopy: A Comparative Study of Network Architectures and SNR Dependence
Shiraz S Kaderuppan,Jonathan Mar,Andrew Irvine,Anurag Sharma,Muhammad Ramadan Saifuddin,Wai Leong Eugene Wong,Wai Lok Woo
Main category: cs.CV
TL;DR: 本研究评估了两种深度神经网络(O-Net和Theta-Net)在非荧光相位调制显微技术中实现超分辨率成像的能力,发现二者在不同信噪比条件下互补,而非竞争。
Details
Motivation: 克服传统光学显微镜横向分辨率受限(约200nm)的问题,提供一种经济、适用于普通用户的超分辨率成像方案。 Method: 使用先前开发的两种深度神经网络架构(O-Net和Theta-Net),在通过原子力显微镜校准的纳米级特征测试样本上进行图像超分辨率重建,并分析其在不同信噪比条件下的表现。 Result: O-Net在高信噪比条件下表现更优,而Theta-Net在低信噪比条件下更具优势,表明模型架构与输入图像信噪比共同影响超分辨率效果。 Conclusion: 针对非荧光光学纳米成像,应根据图像信噪比选择合适的深度学习模型架构,O-Net与Theta-Net是互补工具,可提升超分辨率成像质量。 Abstract: The field of optical microscopy spans across numerous industries and research domains, ranging from education to healthcare, quality inspection and analysis. Nonetheless, a key limitation often cited by optical microscopists refers to the limit of its lateral resolution (typically defined as ~200nm), with potential circumventions involving either costly external modules (e.g. confocal scan heads, etc) and/or specialized techniques [e.g. super-resolution (SR) fluorescent microscopy]. Addressing these challenges in a normal (non-specialist) context thus remains an aspect outside the scope of most microscope users & facilities. This study thus seeks to evaluate an alternative & economical approach to achieving SR optical microscopy, involving non-fluorescent phase-modulated microscopical modalities such as Zernike phase contrast (PCM) and differential interference contrast (DIC) microscopy. Two in silico deep neural network (DNN) architectures which we developed previously (termed O-Net and Theta-Net) are assessed on their abilities to resolve a custom-fabricated test target containing nanoscale features calibrated via atomic force microscopy (AFM). The results of our study demonstrate that although both O-Net and Theta-Net seemingly performed well when super-resolving these images, they were complementary (rather than competing) approaches to be considered for image SR, particularly under different image signal-to-noise ratios (SNRs). High image SNRs favoured the application of O-Net models, while low SNRs inclined preferentially towards Theta-Net models. These findings demonstrate the importance of model architectures (in conjunction with the source image SNR) on model performance and the SR quality of the generated images where DNN models are utilized for non-fluorescent optical nanoscopy, even where the same training dataset & number of epochs are being used.[121] Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation
Yinfeng Yu,Hailong Zhang,Meiling Zhu
Main category: cs.CV
TL;DR: 提出了一种动态多目标融合方法(DMTF-AVN),用于高效音视频导航,通过改进的Transformer机制选择性融合跨模态信息,在多个指标上达到SOTA。
Details
Motivation: 现有音视频导航方法通常仅进行基础的视听数据融合,忽略了深层感知上下文,难以有效利用多模态线索指导导航。 Method: 采用多目标架构结合改进的Transformer机制,动态过滤并选择性融合视觉与听觉信息,提升跨模态感知能力。 Result: 在Replica和Matterport3D数据集上实验表明,该方法在成功率(SR)、路径效率(SPL)和场景适应性(SNA)上均优于现有方法,具有良好的可扩展性和泛化能力。 Conclusion: DMTF-AVN通过精细化的多模态融合机制显著提升了音视频具身导航性能,为机器人导航中的多模态融合提供了新思路。 Abstract: Audiovisual embodied navigation enables robots to locate audio sources by dynamically integrating visual observations from onboard sensors with the auditory signals emitted by the target. The core challenge lies in effectively leveraging multimodal cues to guide navigation. While prior works have explored basic fusion of visual and audio data, they often overlook deeper perceptual context. To address this, we propose the Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation (DMTF-AVN). Our approach uses a multi-target architecture coupled with a refined Transformer mechanism to filter and selectively fuse cross-modal information. Extensive experiments on the Replica and Matterport3D datasets demonstrate that DMTF-AVN achieves state-of-the-art performance, outperforming existing methods in success rate (SR), path efficiency (SPL), and scene adaptation (SNA). Furthermore, the model exhibits strong scalability and generalizability, paving the way for advanced multimodal fusion strategies in robotic navigation. The code and videos are available at https://github.com/zzzmmm-svg/DMTF.[122] SAEmnesia: Erasing Concepts in Diffusion Models with Sparse Autoencoders
Enrico Cassano,Riccardo Renzulli,Marco Nurisso,Mirko Zaffaroni,Alan Perotti,Marco Grangetto
Main category: cs.CV
TL;DR: SAEmnesia是一种监督稀疏自编码器训练方法,通过系统性概念标注实现一对一的概念-神经元映射,有效提升文本到图像扩散模型中的概念遗忘效果。
Details
Motivation: 现有稀疏自编码器虽减少了神经元的多语义性,但概念表征仍分散在多个潜在特征中,导致概念遗忘需大量搜索,难以高效定位和消除特定概念。 Method: 提出SAEmnesia,采用监督式稀疏自编码器训练,通过引入系统性的概念标签,促进概念与神经元的一对一映射,减少特征分裂并增强特征集中性,从而精确定位概念表示。 Result: 相比无监督基线,SAEmnesia学习到的神经元具有更强的概念关联性;推理时减少96.67%的超参数搜索;在UnlearnCanvas基准上比当前最优方法提升9.22%;在9个对象移除的连续遗忘任务中准确率提高28.4%。 Conclusion: SAEmnesia通过监督学习构建可解释的、集中的概念表示,显著提升了文本到图像模型中概念遗忘的效率与准确性,具备良好的可扩展性和实际应用潜力。 Abstract: Effective concept unlearning in text-to-image diffusion models requires precise localization of concept representations within the model's latent space. While sparse autoencoders successfully reduce neuron polysemanticity (i.e., multiple concepts per neuron) compared to the original network, individual concept representations can still be distributed across multiple latent features, requiring extensive search procedures for concept unlearning. We introduce SAEmnesia, a supervised sparse autoencoder training method that promotes one-to-one concept-neuron mappings through systematic concept labeling, mitigating feature splitting and promoting feature centralization. Our approach learns specialized neurons with significantly stronger concept associations compared to unsupervised baselines. The only computational overhead introduced by SAEmnesia is limited to cross-entropy computation during training. At inference time, this interpretable representation reduces hyperparameter search by 96.67% with respect to current approaches. On the UnlearnCanvas benchmark, SAEmnesia achieves a 9.22% improvement over the state-of-the-art. In sequential unlearning tasks, we demonstrate superior scalability with a 28.4% improvement in unlearning accuracy for 9-object removal.[123] Coreset selection based on Intra-class diversity
Imran Ashraf,Mukhtar Ullah,Muhammad Faisal Nadeem,Muhammad Nouman Noor
Main category: cs.CV
TL;DR: 提出一种智能轻量级的核集选择方法,通过提取类内多样性并聚类来提升生物医学图像分类中数据子集的代表性,相比随机采样在多个指标上表现更优。
Details
Motivation: 传统随机采样在不平衡或高多样性数据集中缺乏代表性,尤其忽略类内多样性,影响模型训练与超参数搜索效率。 Method: 通过提取每类样本的特征,进行类内聚类以捕捉多样性,并基于聚类结果进行有代表性的采样,构建更具代表性的核集。 Result: 在知名生物医学图像数据集上的实验表明,该方法在相同条件下优于随机采样,在多个性能指标上表现更佳。 Conclusion: 所提出的基于类内聚类的核集选择方法能有效提升小规模数据子集的代表性,有助于降低计算成本同时保持模型训练质量。 Abstract: Deep Learning models have transformed various domains, including the healthcare sector, particularly biomedical image classification by learning intricate features and enabling accurate diagnostics pertaining to complex diseases. Recent studies have adopted two different approaches to train DL models: training from scratch and transfer learning. Both approaches demand substantial computational time and resources due to the involvement of massive datasets in model training. These computational demands are further increased due to the design-space exploration required for selecting optimal hyperparameters, which typically necessitates several training rounds. With the growing sizes of datasets, exploring solutions to this problem has recently gained the research community's attention. A plausible solution is to select a subset of the dataset for training and hyperparameter search. This subset, referred to as the corset, must be a representative set of the original dataset. A straightforward approach to selecting the coreset could be employing random sampling, albeit at the cost of compromising the representativeness of the original dataset. A critical limitation of random sampling is the bias towards the dominant classes in an imbalanced dataset. Even if the dataset has inter-class balance, this random sampling will not capture intra-class diversity. This study addresses this issue by introducing an intelligent, lightweight mechanism for coreset selection. Specifically, it proposes a method to extract intra-class diversity, forming per-class clusters that are utilized for the final sampling. We demonstrate the efficacy of the proposed methodology by conducting extensive classification experiments on a well-known biomedical imaging dataset. Results demonstrate that the proposed scheme outperforms the random sampling approach on several performance metrics for uniform conditions.[124] The LongiMam model for improved breast cancer risk prediction using longitudinal mammograms
Manel Rakez,Thomas Louis,Julien Guillaumin,Foucauld Chamming's,Pierre Fillard,Brice Amadeo,Virginie Rondeau
Main category: cs.CV
TL;DR: LongiMam 是一个结合当前和最多四次既往乳腺X线检查的端到端深度学习模型,利用卷积神经网络和循环神经网络捕捉乳腺癌预测中的空间与时间模式,在真实世界不平衡数据和异质随访条件下仍表现优异。
Details
Motivation: 现有深度学习模型多使用单次或有限的既往影像,缺乏对真实筛查环境中结果分布不均和随访差异的适应能力,因此需要一种能有效整合纵向数据的鲁棒模型。 Method: 开发了LongiMam模型,结合卷积神经网络(CNN)和循环神经网络(RNN),在大规模基于人群的筛查数据集上训练和评估,包含不同数量和组合的既往检查,并进行亚组分析。 Result: 包含既往和当前检查时模型预测性能持续提升,优于单次检查模型;在致密乳腺、55岁以上女性及乳腺密度随时间变化的人群中表现更佳。 Conclusion: 纵向建模显著提升乳腺癌预测能力,支持利用多次乳腺X线检查优化筛查项目中的风险分层,且该模型已作为开源软件公开发布。 Abstract: Risk-adapted breast cancer screening requires robust models that leverage longitudinal imaging data. Most current deep learning models use single or limited prior mammograms and lack adaptation for real-world settings marked by imbalanced outcome distribution and heterogeneous follow-up. We developed LongiMam, an end-to-end deep learning model that integrates both current and up to four prior mammograms. LongiMam combines a convolutional and a recurrent neural network to capture spatial and temporal patterns predictive of breast cancer. The model was trained and evaluated using a large, population-based screening dataset with disproportionate case-to-control ratio typical of clinical screening. Across several scenarios that varied in the number and composition of prior exams, LongiMam consistently improved prediction when prior mammograms were included. The addition of prior and current visits outperformed single-visit models, while priors alone performed less well, highlighting the importance of combining historical and recent information. Subgroup analyses confirmed the model's efficacy across key risk groups, including women with dense breasts and those aged 55 years or older. Moreover, the model performed best in women with observed changes in mammographic density over time. These findings demonstrate that longitudinal modeling enhances breast cancer prediction and support the use of repeated mammograms to refine risk stratification in screening programs. LongiMam is publicly available as open-source software.[125] Assessing the Alignment of Popular CNNs to the Brain for Valence Appraisal
Laurent Mertens,Elahe' Yargholi,Laura Van Hove,Hans Op de Beeck,Jan Van den Stock,Joost Vennekens
Main category: cs.CV
TL;DR: 本文探讨了卷积神经网络(CNN)在社会认知这一复杂脑过程中的表现,通过相关性分析评估了主流CNN架构与人类行为及fMRI数据在图像效价评估任务上的对齐程度。研究发现,CNN难以超越简单的视觉处理,无法反映高阶脑活动。此外,作者提出了Object2Brain框架,结合GradCAM和对象检测技术,分析不同对象类别对CNN-人类相关性的影响,结果显示不同CNN架构对对象类别的敏感性存在差异。
Details
Motivation: 探索CNN与人类大脑在复杂认知过程(如社会认知)中的对应关系,而不仅限于一般视觉感知。 Method: 通过相关性分析评估主流CNN架构与人类行为及fMRI数据在图像效价评估任务上的对齐情况,并提出Object2Brain框架,结合GradCAM和对象检测技术,在CNN滤波器层面分析不同对象类别对相关性的影响。 Result: CNN在图像效价评估任务中难以超越简单视觉处理,未能反映高阶脑活动;不同CNN架构表现出不同的对象类别敏感性,尽管相关性趋势相似。 Conclusion: 当前CNN在模拟人类社会认知方面存在局限,主要停留在低层次视觉处理;Object2Brain框架有助于揭示不同架构在对象敏感性上的差异,为未来改进提供方向。 Abstract: Convolutional Neural Networks (CNNs) are a popular type of computer model that have proven their worth in many computer vision tasks. Moreover, they form an interesting study object for the field of psychology, with shown correspondences between the workings of CNNs and the human brain. However, these correspondences have so far mostly been studied in the context of general visual perception. In contrast, this paper explores to what extent this correspondence also holds for a more complex brain process, namely social cognition. To this end, we assess the alignment between popular CNN architectures and both human behavioral and fMRI data for image valence appraisal through a correlation analysis. We show that for this task CNNs struggle to go beyond simple visual processing, and do not seem to reflect higher-order brain processing. Furthermore, we present Object2Brain, a novel framework that combines GradCAM and object detection at the CNN-filter level with the aforementioned correlation analysis to study the influence of different object classes on the CNN-to-human correlations. Despite similar correlation trends, different CNN architectures are shown to display different object class sensitivities.[126] Debugging Concept Bottleneck Models through Removal and Retraining
Eric Enouen,Sainyam Galhotra
Main category: cs.CV
TL;DR: 提出了一种用于概念瓶颈模型(CBM)的可解释性调试框架CBDebug,通过移除和重训练两步过程,利用专家反馈生成辅助标签以减轻模型对不良概念的依赖。
Details
Motivation: 现有CBM在专家干预时无法解决模型与专家推理之间的系统性错位问题,例如因数据偏差导致的捷径学习。 Method: 采用两步调试框架:首先由专家识别并移除不良概念;然后提出CBDebug方法,将概念级反馈转化为样本级辅助标签,用于监督式偏见缓解和针对性数据增强。 Result: 在多种CBM架构和存在虚假相关性的基准上,CBDebug显著优于先前的重训练方法,尤其在真实和自动化专家反馈下均表现出优越性能。 Conclusion: CBDebug有效提升了CBM的可解释性和鲁棒性,能够更好地对齐专家知识与模型行为,减少对数据偏差的学习。 Abstract: Concept Bottleneck Models (CBMs) use a set of human-interpretable concepts to predict the final task label, enabling domain experts to not only validate the CBM's predictions, but also intervene on incorrect concepts at test time. However, these interventions fail to address systemic misalignment between the CBM and the expert's reasoning, such as when the model learns shortcuts from biased data. To address this, we present a general interpretable debugging framework for CBMs that follows a two-step process of Removal and Retraining. In the Removal step, experts use concept explanations to identify and remove any undesired concepts. In the Retraining step, we introduce CBDebug, a novel method that leverages the interpretability of CBMs as a bridge for converting concept-level user feedback into sample-level auxiliary labels. These labels are then used to apply supervised bias mitigation and targeted augmentation, reducing the model's reliance on undesired concepts. We evaluate our framework with both real and automated expert feedback, and find that CBDebug significantly outperforms prior retraining methods across multiple CBM architectures (PIP-Net, Post-hoc CBM) and benchmarks with known spurious correlations.[127] ShipwreckFinder: A QGIS Tool for Shipwreck Detection in Multibeam Sonar Data
Anja Sheppard,Tyler Smithline,Andrew Scheffer,David Smith,Advaith V. Sethuraman,Ryan Bird,Sabrina Lin,Katherine A. Skinner
Main category: cs.CV
TL;DR: 本文介绍了一款名为ShipwreckFinder的开源QGIS插件,用于从多波束声呐数据中自动检测沉船。
Details
Motivation: 沉船是海洋历史的重要标志,传统上通过人工检查测深数据发现,耗时且依赖专家分析。 Method: 该工具结合深度学习模型与合成数据增强技术,支持自动预处理、推理、阈值处理,并生成沉船的像素级分割掩码或边界框。模型在来自五大湖和爱尔兰海岸的真实沉船数据及合成数据上训练。 Result: 实验表明,该工具在分割性能上优于基于深度学习的ArcGIS工具包和传统的反向漏斗检测方法。 Conclusion: ShipwreckFinder提供了一个高效、开源的自动化沉船检测解决方案,具有良好的应用前景和可重复性。 Abstract: In this paper, we introduce ShipwreckFinder, an open-source QGIS plugin that detects shipwrecks from multibeam sonar data. Shipwrecks are an important historical marker of maritime history, and can be discovered through manual inspection of bathymetric data. However, this is a time-consuming process and often requires expert analysis. Our proposed tool allows users to automatically preprocess bathymetry data, perform deep learning inference, threshold model outputs, and produce either pixel-wise segmentation masks or bounding boxes of predicted shipwrecks. The backbone of this open-source tool is a deep learning model, which is trained on a variety of shipwreck data from the Great Lakes and the coasts of Ireland. Additionally, we employ synthetic data generation in order to increase the size and diversity of our dataset. We demonstrate superior segmentation performance with our open-source tool and training pipeline as compared to a deep learning-based ArcGIS toolkit and a more classical inverse sinkhole detection method. The open-source tool can be found at https://github.com/umfieldrobotics/ShipwreckFinderQGISPlugin.[128] Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence
Sanish Suwal,Dipkamal Bhusal,Michael Clifford,Nidhi Rastogi
Main category: cs.CV
TL;DR: 研究了基于幅度的剪枝和微调对神经网络可解释性的影响,发现轻度到中度剪枝能提升显著图的聚焦性和保真度,而过度剪枝会损害可解释性。
Details
Motivation: 尽管已知神经网络可以大幅剪枝而不影响性能,但剪枝对模型可解释性的影响尚不清楚,因此本文旨在探究不同剪枝程度对低级显著图和高级概念表示的影响。 Method: 使用在ImageNette上训练的ResNet-18模型,比较不同剪枝水平下经过微调后的Vanilla Gradients(VG)和Integrated Gradients(IG)的后验解释,评估显著图的稀疏性和保真度;同时应用CRAFT方法提取概念,追踪语义一致性的变化。 Result: 轻度到中度剪枝能提高显著图的聚焦性和保真度,并保持语义上有意义的概念;而激进剪枝虽保持准确性,却导致异质特征融合,降低显著图稀疏性和概念一致性。 Conclusion: 剪枝可以塑造更符合人类注意力模式的内部表示,从而提升可解释性,但过度剪枝会破坏语义结构,反而损害模型可解释性。 Abstract: Prior works have shown that neural networks can be heavily pruned while preserving performance, but the impact of pruning on model interpretability remains unclear. In this work, we investigate how magnitude-based pruning followed by fine-tuning affects both low-level saliency maps and high-level concept representations. Using a ResNet-18 trained on ImageNette, we compare post-hoc explanations from Vanilla Gradients (VG) and Integrated Gradients (IG) across pruning levels, evaluating sparsity and faithfulness. We further apply CRAFT-based concept extraction to track changes in semantic coherence of learned concepts. Our results show that light-to-moderate pruning improves saliency-map focus and faithfulness while retaining distinct, semantically meaningful concepts. In contrast, aggressive pruning merges heterogeneous features, reducing saliency map sparsity and concept coherence despite maintaining accuracy. These findings suggest that while pruning can shape internal representations toward more human-aligned attention patterns, excessive pruning undermines interpretability.[129] TUN3D: Towards Real-World Scene Understanding from Unposed Images
Anton Konushin,Nikita Drozdov,Bulat Gabdullin,Alexey Zakharov,Anna Vorontsova,Danila Rukhovich,Maksim Kolodiazhnyi
Main category: cs.CV
TL;DR: TUN3D是首个基于多视图图像输入、无需真实相机位姿或深度监督即可联合进行布局估计和3D目标检测的方法,在多个室内场景理解基准上达到最先进性能。
Details
Motivation: 现有方法依赖点云输入,限制了在仅有视觉数据的消费级设备上的应用,因此需要一种仅使用多视图图像且不依赖深度信息或相机位姿的联合3D目标检测与布局估计方法。 Method: 提出TUN3D,采用轻量级稀疏卷积骨干网络,配备两个专用头分别处理3D目标检测和布局估计,并引入一种新颖有效的参数化墙表示方法。 Result: 在三个挑战性场景理解基准上实验表明,TUN3D在使用真实点云、已知位姿图像和未知位姿图像的情况下均达到最先进水平,尤其在布局估计方面显著超越先前方法。 Conclusion: TUN3D实现了无需深度输入和相机位姿的联合布局估计与3D目标检测,推动了整体室内场景理解的发展,具有良好的实用性和扩展性。 Abstract: Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .[130] Large AI Model-Enabled Generative Semantic Communications for Image Transmission
Qiyu Ma,Wanli Ni,Zhijin Qin
Main category: cs.CV
TL;DR: 提出了一种基于生成式AI的语义通信系统,通过区分图像的关键与非关键区域来提升图像传输的质量和效率,并采用轻量化部署策略优化资源利用。
Details
Motivation: 现有语义通信方法忽视图像不同区域的重要性差异,影响关键视觉内容的重建质量。 Method: 将图像分割为关键与非关键区域;关键区域使用面向图像的语义编码器处理,非关键区域采用图像到文本建模压缩;结合模型量化与低秩适应微调实现轻量化部署。 Result: 仿真结果表明,该系统在语义保真度和视觉质量方面均优于传统方法。 Conclusion: 所提方法能有效提升图像传输中的语义精度与视觉重建质量,同时降低大模型带来的存储与计算开销,适用于资源受限场景。 Abstract: The rapid development of generative artificial intelligence (AI) has introduced significant opportunities for enhancing the efficiency and accuracy of image transmission within semantic communication systems. Despite these advancements, existing methodologies often neglect the difference in importance of different regions of the image, potentially compromising the reconstruction quality of visually critical content. To address this issue, we introduce an innovative generative semantic communication system that refines semantic granularity by segmenting images into key and non-key regions. Key regions, which contain essential visual information, are processed using an image oriented semantic encoder, while non-key regions are efficiently compressed through an image-to-text modeling approach. Additionally, to mitigate the substantial storage and computational demands posed by large AI models, the proposed system employs a lightweight deployment strategy incorporating model quantization and low-rank adaptation fine-tuning techniques, significantly boosting resource utilization without sacrificing performance. Simulation results demonstrate that the proposed system outperforms traditional methods in terms of both semantic fidelity and visual quality, thereby affirming its effectiveness for image transmission tasks.[131] mmHSense: Multi-Modal and Distributed mmWave ISAC Datasets for Human Sensing
Nabeel Nisar Bhat,Maksim Karnaukh,Stein Vandenbroeke,Wouter Lemoine,Jakob Struye,Jesus Omar Lacruz,Siddhartha Kumar,Mohammad Hossein Moghaddam,Joerg Widmer,Rafael Berkvens,Jeroen Famaey
Main category: cs.CV
TL;DR: 本文介绍了mmHSense,一个支持集成感知与通信(ISAC)系统中人类感知研究的开源毫米波数据集集合。
Details
Motivation: 为了推动毫米波ISAC在手势识别、人员识别、姿态估计和定位等应用中的研究,需要公开且标注良好的数据集来支持信号处理和深度学习算法的开发。 Method: 构建了一个开放标注的毫米波数据集(mmHSense),并详细描述了测试平台、实验设置和信号特征;通过具体下游任务验证数据集的有效性,并采用参数高效微调方法适应不同任务。 Result: 数据集可用于多种人类感知任务及算法研究;参数高效微调能够在保持先前任务性能的同时显著降低计算复杂度。 Conclusion: mmHSense为毫米波ISAC研究提供了有价值的资源,且参数高效微调有助于模型在多任务场景下的部署。 Abstract: This article presents mmHSense, a set of open labeled mmWave datasets to support human sensing research within Integrated Sensing and Communication (ISAC) systems. The datasets can be used to explore mmWave ISAC for various end applications such as gesture recognition, person identification, pose estimation, and localization. Moreover, the datasets can be used to develop and advance signal processing and deep learning research on mmWave ISAC. This article describes the testbed, experimental settings, and signal features for each dataset. Furthermore, the utility of the datasets is demonstrated through validation on a specific downstream task. In addition, we demonstrate the use of parameter-efficient fine-tuning to adapt ISAC models to different tasks, significantly reducing computational complexity while maintaining performance on prior tasks.[132] Skeleton Sparsification and Densification Scale-Spaces
Julia Gierke,Pascal Peter
Main category: cs.CV
TL;DR: 本文提出了骨架化尺度空间(skeletonisation scale-spaces),通过中轴线的稀疏化实现形状的层次化简化,克服了传统方法对噪声敏感的问题,并具备尺度空间的关键性质。
Details
Motivation: 中轴线(medial axis)作为形状描述符虽广泛应用,但对边界噪声极为敏感,容易产生冗余的骨架分支,限制了其实际应用。 Method: 引入骨架化尺度空间框架,结合中轴线稀疏化与层次简化,满足尺度空间的层级结构、可控简化和几何变换等变性;同时提出致密化机制,支持从粗到细的逆向重构,甚至生成超完备表示。 Result: 在连续和离散形式下建立了严格的理论基础,实验验证了该方法在鲁棒骨架提取、形状压缩和增材制造刚度增强等任务中的有效性。 Conclusion: 所提出的骨架化尺度空间为形状分析提供了理论上严谨且实用的框架,兼具传统修剪方法的优点并克服其局限,具有广泛的应用前景。 Abstract: The Hamilton-Jacobi skeleton, also known as the medial axis, is a powerful shape descriptor that represents binary objects in terms of the centres of maximal inscribed discs. Despite its broad applicability, the medial axis suffers from sensitivity to noise: minor boundary variations can lead to disproportionately large and undesirable expansions of the skeleton. Classical pruning methods mitigate this shortcoming by systematically removing extraneous skeletal branches. This sequential simplification of skeletons resembles the principle of sparsification scale-spaces that embed images into a family of reconstructions from increasingly sparse pixel representations. We combine both worlds by introducing skeletonisation scale-spaces: They leverage sparsification of the medial axis to achieve hierarchical simplification of shapes. Unlike conventional pruning, our framework inherently satisfies key scale-space properties such as hierarchical architecture, controllable simplification, and equivariance to geometric transformations. We provide a rigorous theoretical foundation in both continuous and discrete formulations and extend the concept further with densification. This allows inverse progression from coarse to fine scales and can even reach beyond the original skeleton to produce overcomplete shape representations with relevancy for practical applications. Through proof-of-concept experiments, we demonstrate the effectiveness of our framework for practical tasks including robust skeletonisation, shape compression, and stiffness enhancement for additive manufacturing.[133] Downscaling climate projections to 1 km with single-image super resolution
Petr Košťál,Pavel Kordík,Ondřej Podsztavek
Main category: cs.CV
TL;DR: 提出基于单图像超分辨率模型的统计降尺度方法,将气候预测从低分辨率(如12.5公里)提升至1公里,并通过气候指标评估方法在无高分辨率真值情况下验证其有效性。
Details
Motivation: 现有气候预测数据空间分辨率较低(如12.5公里),限制了其在地方决策中的应用,因此需要提高分辨率以增强实用性。 Method: 利用单图像超分辨率模型对气候预测进行统计降尺度,使用高分辨率观测格点数据进行训练,并将模型应用于低分辨率气候预测;提出基于气候指标的评估方法,在气象站位置计算观测气候指数以评估降尺度结果。 Result: 实验表明,该方法可将日均温预测成功降尺度至1公里分辨率,且在气候指标上的误差未高于原始低分辨率预测。 Conclusion: 所提出的超分辨率降尺度方法能在缺乏高分辨率真值的情况下有效提升气候预测的空间分辨率,同时保持气候指标的准确性,具有实际应用潜力。 Abstract: High-resolution climate projections are essential for local decision-making. However, available climate projections have low spatial resolution (e.g. 12.5 km), which limits their usability. We address this limitation by leveraging single-image super-resolution models to statistically downscale climate projections to 1-km resolution. Since high-resolution climate projections are unavailable for training, we train models on a high-resolution observational gridded data set and apply them to low-resolution climate projections. We propose a climate indicator-based assessment using observed climate indices computed at weather station locations to evaluate the downscaled climate projections without ground-truth high-resolution climate projections. Experiments on daily mean temperature demonstrate that single-image super-resolution models can downscale climate projections without increasing the error of climate indicators compared to low-resolution climate projections.[134] JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation
Md Jueal Mia,M. Hadi Amini
Main category: cs.CV
TL;DR: 提出一种名为JaiLIP的图像空间越狱攻击方法,通过联合优化均方误差和有害输出损失,生成高效且难以察觉的对抗性图像,在毒性和实际应用方面优于现有方法。
Details
Motivation: 由于多种攻击向量的存在,视觉语言模型(VLM)的安全对齐问题日益突出,尤其是基于图像的扰动攻击效果显著,但现有方法存在性能不稳定和扰动明显的问题。 Method: 提出Jailbreaking with Loss-guided Image Perturbation (JaiLIP),在图像空间中最小化干净图像与对抗图像之间的均方误差损失,并结合模型有害输出损失进行联合优化。 Result: 在Perspective API和Detoxify标准毒性指标下,实验表明JaiLIP生成的对抗图像更有效且不易察觉,在产生毒性方面优于现有方法;并在交通领域验证了其实际应用场景。 Conclusion: 基于图像的越狱攻击对VLM构成实际威胁,凸显了开发高效防御机制的必要性。 Abstract: Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.[135] Overview of ExpertLifeCLEF 2018: how far automated identification systems are from the best experts?
Herve Goeau,Pierre Bonnet,Alexis Joly
Main category: cs.CV
TL;DR: 该论文通过LifeCLEF 2018 ExpertCLEF挑战赛比较了深度学习系统与植物学专家在物种识别上的表现,发现最先进的深度学习模型性能已接近人类专家水平。
Details
Motivation: 评估自动化识别系统与人类专家在物种鉴定中的性能差距,并量化视觉或音频数据中因信息不全导致的不确定性。 Method: 组织LifeCLEF 2018 ExpertCLEF挑战赛,对比4个研究团队提交的19个深度学习系统与9名法国植物专家的表现。 Result: 最先进的深度学习模型在物种识别任务上的性能已接近顶级人类专家。 Conclusion: 当前自动化系统在植物识别方面已达到接近人类专家的水平,为未来人机协作提供了基础。 Abstract: Automated identification of plants and animals has improved considerably in the last few years, in particular thanks to the recent advances in deep learning. The next big question is how far such automated systems are from the human expertise. Indeed, even the best experts are sometimes confused and/or disagree between each others when validating visual or audio observations of living organism. A picture actually contains only a partial information that is usually not sufficient to determine the right species with certainty. Quantifying this uncertainty and comparing it to the performance of automated systems is of high interest for both computer scientists and expert naturalists. The LifeCLEF 2018 ExpertCLEF challenge presented in this paper was designed to allow this comparison between human experts and automated systems. In total, 19 deep-learning systems implemented by 4 different research teams were evaluated with regard to 9 expert botanists of the French flora. The main outcome of this work is that the performance of state-of-the-art deep learning models is now close to the most advanced human expertise. This paper presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.[136] QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models
Jian Liu,Chunshi Wang,Song Guo,Haohan Weng,Zhen Zhou,Zhiqi Li,Jiaao Yu,Yiling Zhu,Jing Xu,Biwen Lei,Zhuo Chen,Chunchao Guo
Main category: cs.CV
TL;DR: 本文提出了QuadGPT,首个端到端自回归生成四边形网格的框架,通过统一的标记化方法和基于强化学习的微调策略tDPO,在几何精度和拓扑质量上显著优于传统的三角转四边形方法。
Details
Motivation: 现有生成模型通常先生成三角网格再合并为四边形,导致拓扑结构差,缺乏直接生成高质量四边形主导网格的方法。 Method: 提出QuadGPT,采用序列预测范式,设计统一的令牌化方法处理三角形与四边形混合拓扑,并引入专门的强化学习微调方法tDPO提升生成质量。 Result: 实验表明,QuadGPT在几何准确性和拓扑质量上均显著优于以往的三角转四边形流程。 Conclusion: QuadGPT建立了原生四边形网格生成的新基准,展示了大规模自回归模型与拓扑感知强化学习结合在生成结构化3D资产中的潜力。 Abstract: The generation of quadrilateral-dominant meshes is a cornerstone of professional 3D content creation. However, existing generative models generate quad meshes by first generating triangle meshes and then merging triangles into quadrilaterals with some specific rules, which typically produces quad meshes with poor topology. In this paper, we introduce QuadGPT, the first autoregressive framework for generating quadrilateral meshes in an end-to-end manner. QuadGPT formulates this as a sequence prediction paradigm, distinguished by two key innovations: a unified tokenization method to handle mixed topologies of triangles and quadrilaterals, and a specialized Reinforcement Learning fine-tuning method tDPO for better generation quality. Extensive experiments demonstrate that QuadGPT significantly surpasses previous triangle-to-quad conversion pipelines in both geometric accuracy and topological quality. Our work establishes a new benchmark for native quad-mesh generation and showcases the power of combining large-scale autoregressive models with topology-aware RL refinement for creating structured 3D assets.[137] DyME: Dynamic Multi-Concept Erasure in Diffusion Models with Bi-Level Orthogonal LoRA Adaptation
Jiaqi Liu,Lan Zhang,Xiaoyong Yuan
Main category: cs.CV
TL;DR: 本文提出DyME,一种按需概念擦除框架,通过训练轻量级、特定概念的LoRA适配器并动态组合实现灵活多概念擦除,并引入双层正交约束缓解适配器间干扰,在新基准ErasureBench-H和标准数据集上表现优于现有方法。
Details
Motivation: 现有文本到图像扩散模型中的概念擦除方法依赖静态擦除机制,难以扩展至需擦除多个甚至冲突概念的实际场景,且在推理时无法根据需求灵活调整,导致擦除效果差并损害非目标内容保真度。 Method: 提出DyME框架:1)训练轻量级、针对特定概念的LoRA适配器;2)在推理时仅动态组合所需适配器;3)引入特征级与参数级的双层正交约束,以减少多个适配器间的表示干扰。同时构建了具有品牌-系列-角色层级结构的新基准ErasureBench-H用于评估。 Result: 在ErasureBench-H、CIFAR-100和Imagenette等数据集上的实验表明,DyME在多概念擦除保真度方面持续优于现有最先进方法,且对非目标内容的副作用最小。 Conclusion: DyME通过模块化设计和动态组合机制实现了高效、灵活的多概念擦除,解决了静态擦除方法在实际应用中的局限性,为版权与伦理敏感的概念控制提供了可扩展的解决方案。 Abstract: Text-to-image diffusion models (DMs) inadvertently reproduce copyrighted styles and protected visual concepts, raising legal and ethical concerns. Concept erasure has emerged as a safeguard, aiming to selectively suppress such concepts through fine-tuning. However, existing methods do not scale to practical settings where providers must erase multiple and possibly conflicting concepts. The core bottleneck is their reliance on static erasure: a single checkpoint is fine-tuned to remove all target concepts, regardless of the actual erasure needs at inference. This rigid design mismatches real-world usage, where requests vary per generation, leading to degraded erasure success and reduced fidelity for non-target content. We propose DyME, an on-demand erasure framework that trains lightweight, concept-specific LoRA adapters and dynamically composes only those needed at inference. This modular design enables flexible multi-concept erasure, but naive composition causes interference among adapters, especially when many or semantically related concepts are suppressed. To overcome this, we introduce bi-level orthogonality constraints at both the feature and parameter levels, disentangling representation shifts and enforcing orthogonal adapter subspaces. We further develop ErasureBench-H, a new hierarchical benchmark with brand-series-character structure, enabling principled evaluation across semantic granularities and erasure set sizes. Experiments on ErasureBench-H and standard datasets (e.g., CIFAR-100, Imagenette) demonstrate that DyME consistently outperforms state-of-the-art baselines, achieving higher multi-concept erasure fidelity with minimal collateral degradation.[138] VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
Abdul Waheed,Zhen Wu,Dareen Alharthi,Seungone Kim,Bhiksha Raj
Main category: cs.CV
TL;DR: 本文提出了VideoJudge,一种专门用于评估视频理解模型输出的多模态大语言模型(MLLM)评测器,能够在多个基准上优于更大规模的基线模型。
Details
Motivation: 现有的视频理解模型评估方法如BLEU、ROUGE等无法准确反映人类判断,而人工评估成本高昂;使用大模型作为自动评估器的研究在视频领域尚不充分。 Method: 提出VideoJudge,通过生成器与评估器的协作机制进行训练:生成器根据目标评分生成响应,不符合评估器评分的响应被丢弃;使用3B和7B参数规模的MLLM作为基础模型。 Result: 在四个元评估基准中的三个上,VideoJudge-7B优于Qwen2.5-VL(32B和72B)等更大的MLLM评测模型;发现纯LLM评测效果不如MLLM,且长链式思维推理未提升性能。 Conclusion: 视频输入对视频理解任务的评估至关重要,VideoJudge证明了专用中小规模MLLM作为评测器的有效性和优越性。 Abstract: Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the fineness of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to evaluate outputs from video understanding models (\textit{i.e.}, text responses conditioned on videos). To train VideoJudge, our recipe builds on the interplay between a generator and an evaluator: the generator is prompted to produce responses conditioned on a target rating, and responses not matching the evaluator's rating are discarded. Across three out of four meta-evaluation benchmarks, VideoJudge-7B outperforms larger MLLM judge baselines such as Qwen2.5-VL (32B and 72B). Notably, we find that LLM judges (Qwen3) models perform worse than MLLM judges (Qwen2.5-VL) and long chain-of-thought reasoning does not improve performance, indicating that providing video inputs is crucial for evaluation of video understanding tasks.[139] Residual Vector Quantization For Communication-Efficient Multi-Agent Perception
Dereje Shenkut,B. V. K Vijaya Kumar
Main category: cs.CV
TL;DR: ReVQom是一种用于多智能体协同感知的端到端学习型特征编解码器,通过多级残差向量量化实现高效压缩,在极低带宽下保持高精度。
Details
Motivation: 通信带宽限制了多智能体协同感知系统的可扩展性,因此需要高效的特征压缩方法以支持实际V2X部署。 Method: 提出ReVQom,采用简单的瓶颈网络和多级残差向量量化(RVQ)对中间特征进行压缩,仅传输每个像素的码字索引,保留空间信息的同时大幅降低数据量。 Result: 在DAIR-V2X数据集上,ReVQom将每像素数据从8192比特降至6-30比特,实现273倍到1365倍的压缩率;在6-12 bpp下支持超低带宽运行并保持性能,18 bpp时性能优于原始特征传输。 Conclusion: ReVQom显著降低了多智能体协同感知的通信开销,在保持准确性的同时推动了实际V2X应用的发展。 Abstract: Multi-agent collaborative perception (CP) improves scene understanding by sharing information across connected agents such as autonomous vehicles, unmanned aerial vehicles, and robots. Communication bandwidth, however, constrains scalability. We present ReVQom, a learned feature codec that preserves spatial identity while compressing intermediate features. ReVQom is an end-to-end method that compresses feature dimensions via a simple bottleneck network followed by multi-stage residual vector quantization (RVQ). This allows only per-pixel code indices to be transmitted, reducing payloads from 8192 bits per pixel (bpp) of uncompressed 32-bit float features to 6-30 bpp per agent with minimal accuracy loss. On DAIR-V2X real-world CP dataset, ReVQom achieves 273x compression at 30 bpp to 1365x compression at 6 bpp. At 18 bpp (455x), ReVQom matches or outperforms raw-feature CP, and at 6-12 bpp it enables ultra-low-bandwidth operation with graceful degradation. ReVQom allows efficient and accurate multi-agent collaborative perception with a step toward practical V2X deployment.[140] Gender Stereotypes in Professional Roles Among Saudis: An Analytical Study of AI-Generated Images Using Language Models
Khaloud S. AlKhalifah,Malak Mashaabi,Hend Al-Khalifa
Main category: cs.CV
TL;DR: 该研究分析了当前文本到图像AI模型在生成沙特阿拉伯职业形象时的性别刻板印象和文化错误,发现三大模型均严重偏向男性化输出,尤其是DALL-E V3,并普遍存在文化不准确问题。
Details
Motivation: 探讨AI图像生成模型是否延续了社会偏见,特别是在性别表征和文化真实性方面,以评估其在沙特职业场景中的公平性与适用性。 Method: 使用中性提示词通过ImageFX、DALL-E V3和Grok生成1,006张56种沙特职业的图像,由两名沙特标注员从五个维度进行评估,争议结果由第三位高级研究人员裁定,共产生10,100条判断数据。 Result: ImageFX生成图像85%为男性,Grok为86.6%,DALL-E V3高达96%;领导和技术岗位性别失衡最严重;三个模型普遍存在服饰、场景和活动的文化误读,反刻板印象图像多源于文化误解而非真正进步的表现。 Conclusion: 现有AI模型反映了训练数据中的人类社会偏见,未能真实体现沙特劳动力市场的性别动态与文化复杂性,亟需更多元的训练数据、更公平的算法及文化敏感的评估框架。 Abstract: This study investigates the extent to which contemporary Text-to-Image artificial intelligence (AI) models perpetuate gender stereotypes and cultural inaccuracies when generating depictions of professionals in Saudi Arabia. We analyzed 1,006 images produced by ImageFX, DALL-E V3, and Grok for 56 diverse Saudi professions using neutral prompts. Two trained Saudi annotators evaluated each image on five dimensions: perceived gender, clothing and appearance, background and setting, activities and interactions, and age. A third senior researcher adjudicated whenever the two primary raters disagreed, yielding 10,100 individual judgements. The results reveal a strong gender imbalance, with ImageFX outputs being 85\% male, Grok 86.6\% male, and DALL-E V3 96\% male, indicating that DALL-E V3 exhibited the strongest overall gender stereotyping. This imbalance was most evident in leadership and technical roles. Moreover, cultural inaccuracies in clothing, settings, and depicted activities were frequently observed across all three models. Counter-stereotypical images often arise from cultural misinterpretations rather than genuinely progressive portrayals. We conclude that current models mirror societal biases embedded in their training data, generated by humans, offering only a limited reflection of the Saudi labour market's gender dynamics and cultural nuances. These findings underscore the urgent need for more diverse training data, fairer algorithms, and culturally sensitive evaluation frameworks to ensure equitable and authentic visual outputs.[141] Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Moderation
Zixuan Wang,Yu Sun,Hongwei Wang,Baoyu Jing,Xiang Shen,Xin Dong,Zhuolin Hao,Hongyu Xiong,Yang Song
Main category: cs.CV
TL;DR: 提出一种基于推理增强的多模态大语言模型预训练范式,用于统一的短视频不当内容检测,通过三种预训练任务提升模型在零样本和微调下的表现及对新问题的泛化能力。
Details
Motivation: 现有方法依赖大量标注数据且难以泛化到不同类型的不当内容,需提升多模态大模型对短视频内容的理解与跨问题检测能力。 Method: 引入三种预训练任务:Caption增强视频细节感知,VQA深化对问题定义的理解,CoT增强推理能力,以缩小短视频内容与MLLM原始数据间的分布差距。 Result: 所提方法显著提升了MLLM在零样本和监督微调下的性能,并展现出对新兴、未见过的不当内容类型的强泛化能力。 Conclusion: 该推理增强的预训练范式有效提升了多模态大模型在统一不当内容检测中的性能与泛化性,具有实际应用潜力。 Abstract: Short video platforms are evolving rapidly, making the identification of inappropriate content increasingly critical. Existing approaches typically train separate and small classification models for each type of issue, which requires extensive human-labeled data and lacks cross-issue generalization. We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection. To address the distribution gap between short video content and the original pretraining data of MLLMs, as well as the complex issue definitions, we introduce three targeted pretraining tasks: (1) \textit{Caption}, to enhance the MLLM's perception of video details; (2) \textit{Visual Question Answering (VQA)}, to deepen the MLLM's understanding of issue definitions and annotation guidelines; (3) \textit{Chain-of-Thought (CoT)}, to enhance the MLLM's reasoning capability. Experimental results show that our pretraining approach significantly improves the MLLM's performance in both zero-shot and supervised fine-tuning (SFT) settings. In addition, our pretrained model demonstrates strong generalization capabilities to emergent, previously unseen issues.[142] Learning GUI Grounding with Spatial Reasoning from Visual Feedback
Yu Zhao,Wei-Ning Chen,Huseyin Atahan Inan,Samuel Kessler,Lu Wang,Lukas Wutschitz,Fangkai Yang,Chaoyun Zhang,Pasquale Minervini,Saravan Rajmohan,Robert Sim
Main category: cs.CV
TL;DR: 本文提出了一种新的GUI定位方法,将传统的坐标预测任务重构为交互式搜索任务,通过在界面上移动光标来精确定位UI元素。该方法利用视觉语言模型(VLM)结合多步在线强化学习训练,显著提升了定位准确率,在ScreenSpot-v2和ScreenSpot-Pro数据集上达到SOTA性能。
Details
Motivation: 现有视觉语言模型在处理高分辨率、复杂布局的GUI图像时,难以准确预测具体的数值坐标,导致GUI定位效果不佳。因此需要一种更鲁棒的方法来提升定位精度。 Method: 将GUI定位重构为交互式搜索任务,使用VLM生成控制光标移动的动作序列;引入渲染的光标作为视觉反馈,并基于轨迹设计密集奖励函数,采用多步在线强化学习训练模型(GUI-Cursor)。 Result: 基于Qwen2.5-VL-7B的GUI-Cursor在ScreenSpot-v2上准确率从88.8%提升至93.9%,在ScreenSpot-Pro上从26.8%大幅提升至56.5%;95%的案例可在两步内完成定位,且模型能自适应增加困难样本的操作步数。 Conclusion: 通过将GUI定位视为交互式搜索过程并引入光标反馈机制,显著提高了VLM在复杂GUI环境下的定位准确性与鲁棒性,展示了强化学习在GUI操作任务中的潜力。 Abstract: Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing high-resolution GUI images with complex layouts. To address this issue, we reframe GUI grounding as an \emph{interactive search task}, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Our experimental results show that GUI-Cursor, based on Qwen2.5-VL-7B, improves the GUI grounding accuracy and achieves state-of-the-art results on ScreenSpot-v2 ($88.8\% \rightarrow 93.9\%$) and ScreenSpot-Pro ($26.8\% \rightarrow 56.5\%$). Moreover, we observe that GUI-Cursor learns to solve the problem within two steps for 95\% of instances and can adaptively conduct more steps on more difficult examples.[143] X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning
Prasanna Reddy Pulakurthi,Jiamian Wang,Majid Rabbani,Sohail Dianat,Raghuveer Rao,Zhiqiang Tao
Main category: cs.CV
TL;DR: 提出X-CoT,一种基于大语言模型思维链推理的可解释文本-视频检索框架,替代传统的嵌入模型相似性排序方法,提升检索性能并提供详细推理依据。
Details
Motivation: 现有文本-视频检索系统依赖嵌入模型和余弦相似度,难以识别低质量数据且缺乏可解释性,因此需要一种能解释排序结果以评估模型和数据质量的方法。 Method: 构建检索用的思维链(CoT),通过成对比较步骤进行逐步推理;扩展现有基准数据集,增加视频标注以支持语义理解和减少数据偏差。 Result: X-CoT在实验中提升了检索性能,生成详细的推理理由,并有助于分析模型行为和数据质量。 Conclusion: X-CoT通过引入LLM的推理能力,实现了更可解释、更高效的文本-视频检索,同时促进了对模型和数据的深入分析。 Abstract: Prevalent text-to-video retrieval systems mainly adopt embedding models for feature extraction and compute cosine similarities for ranking. However, this design presents two limitations. Low-quality text-video data pairs could compromise the retrieval, yet are hard to identify and examine. Cosine similarity alone provides no explanation for the ranking results, limiting the interpretability. We ask that can we interpret the ranking results, so as to assess the retrieval models and examine the text-video data? This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning in place of the embedding model-based similarity ranking. We first expand the existing benchmarks with additional video annotations to support semantic understanding and reduce data bias. We also devise a retrieval CoT consisting of pairwise comparison steps, yielding detailed reasoning and complete ranking. X-CoT empirically improves the retrieval performance and produces detailed rationales. It also facilitates the model behavior and data quality analysis. Code and data are available at: https://github.com/PrasannaPulakurthi/X-CoT.[144] Unsupervised Defect Detection for Surgical Instruments
Joseph Huang,Yichi Zhang,Jingxi Yu,Wei Chen,Seunghyun Hwang,Qiang Qiu,Amy R. Reibman,Edward J. Delp,Fengqing Zhu
Main category: cs.CV
TL;DR: 提出一种针对手术器械的无监督缺陷检测自适应方法,通过背景掩码、基于patch的分析和高效的领域自适应,有效解决现有方法在手术场景中的误检、漏检和特征捕获不足问题。
Details
Motivation: 现有基于自然或工业图像训练的自动化缺陷检测方法难以有效迁移到手术器械领域,存在误报、对细微缺陷不敏感以及无法充分捕捉器械特有特征的问题。 Method: 采用背景掩码去除纹理背景干扰,结合基于patch的分析策略,并通过高效的领域自适应技术,将无监督缺陷检测方法适配到手术器械图像上。 Result: 该方法显著降低了因背景纹理导致的误检,提高了对小而细微缺陷的检测灵敏度,并更好地捕捉了器械特定特征,实现了对手术器械细粒度缺陷的可靠检测。 Conclusion: 所提出的方法能有效克服领域迁移带来的挑战,在无需大量标注数据的情况下,提升了手术器械视觉缺陷检测的准确性和鲁棒性。 Abstract: Ensuring the safety of surgical instruments requires reliable detection of visual defects. However, manual inspection is prone to error, and existing automated defect detection methods, typically trained on natural/industrial images, fail to transfer effectively to the surgical domain. We demonstrate that simply applying or fine-tuning these approaches leads to issues: false positive detections arising from textured backgrounds, poor sensitivity to small, subtle defects, and inadequate capture of instrument-specific features due to domain shift. To address these challenges, we propose a versatile method that adapts unsupervised defect detection methods specifically for surgical instruments. By integrating background masking, a patch-based analysis strategy, and efficient domain adaptation, our method overcomes these limitations, enabling the reliable detection of fine-grained defects in surgical instrument imagery.[145] No Alignment Needed for Generation: Learning Linearly Separable Representations in Diffusion Models
Junno Yun,Yaşar Utku Alçalar,Mehmet Akçakaya
Main category: cs.CV
TL;DR: 提出一种基于促进中间层表示线性可分性(LSEP)的新型正则化方法,用于大规模扩散模型的高效训练,无需依赖外部预训练编码器,在SiTs等架构上显著提升训练效率和生成质量。
Details
Motivation: 现有基于表示对齐的方法虽能提升扩散模型的特征表示质量,但依赖计算成本高昂的大型预训练编码器,限制了训练效率。 Method: 提出LSEP正则化方法,通过将线性探测直接融入网络学习过程,促进中间层表示的线性可分性,避免使用外部编码器和表示对齐。 Result: 在256×256 ImageNet数据集上,基于SiTs架构实现了1.46的FID分数,显著提升了训练效率和生成质量。 Conclusion: LSEP是一种无需辅助编码器的有效正则化策略,能够提升扩散模型的表示学习效率与生成性能。 Abstract: Efficient training strategies for large-scale diffusion models have recently emphasized the importance of improving discriminative feature representations in these models. A central line of work in this direction is representation alignment with features obtained from powerful external encoders, which improves the representation quality as assessed through linear probing. Alignment-based approaches show promise but depend on large pretrained encoders, which are computationally expensive to obtain. In this work, we propose an alternative regularization for training, based on promoting the Linear SEParability (LSEP) of intermediate layer representations. LSEP eliminates the need for an auxiliary encoder and representation alignment, while incorporating linear probing directly into the network's learning dynamics rather than treating it as a simple post-hoc evaluation tool. Our results demonstrate substantial improvements in both training efficiency and generation quality on flow-based transformer architectures such as SiTs, achieving an FID of 1.46 on $256 \times 256$ ImageNet dataset.[146] Enhancing Contrastive Learning for Geolocalization by Discovering Hard Negatives on Semivariograms
Boyi Chen,Zhangyu Wang,Fabian Deuser,Johann Maximilian Zollner,Martin Werner
Main category: cs.CV
TL;DR: 提出一种基于半变异函数的空间正则化对比学习策略,用于提升图像地理定位性能,尤其在细粒度场景下表现更优。
Details
Motivation: 现有对比学习方法忽略地理空间中的空间依赖性,导致难以处理视觉与地理相似但被标记为负样本的“假负例”以及区分视觉相似但地理位置遥远的“难负例”。 Method: 引入半变异函数建模特征空间距离与地理距离之间的空间相关性,利用拟合的半变异函数定义特定空间距离下的预期视觉差异,以识别假负例和难负例,并将该策略集成到GeoCLIP中进行优化。 Result: 在OSV5M数据集上验证了方法的有效性,结果显示显式建模空间先验可显著提升图像地理定位精度,尤其在更细粒度的定位任务中表现突出。 Conclusion: 通过融合地统计工具半变异函数,所提出的空间正则化对比学习策略有效解决了对比学习中的假负例和难负例问题,增强了模型对空间结构的理解,提升了全球尺度图像地理定位的准确性和鲁棒性。 Abstract: Accurate and robust image-based geo-localization at a global scale is challenging due to diverse environments, visually ambiguous scenes, and the lack of distinctive landmarks in many regions. While contrastive learning methods show promising performance by aligning features between street-view images and corresponding locations, they neglect the underlying spatial dependency in the geographic space. As a result, they fail to address the issue of false negatives -- image pairs that are both visually and geographically similar but labeled as negatives, and struggle to effectively distinguish hard negatives, which are visually similar but geographically distant. To address this issue, we propose a novel spatially regularized contrastive learning strategy that integrates a semivariogram, which is a geostatistical tool for modeling how spatial correlation changes with distance. We fit the semivariogram by relating the distance of images in feature space to their geographical distance, capturing the expected visual content in a spatial correlation. With the fitted semivariogram, we define the expected visual dissimilarity at a given spatial distance as reference to identify hard negatives and false negatives. We integrate this strategy into GeoCLIP and evaluate it on the OSV5M dataset, demonstrating that explicitly modeling spatial priors improves image-based geo-localization performance, particularly at finer granularity.[147] X-Streamer: Unified Human World Modeling with Audiovisual Interaction
You Xie,Tianpei Gu,Zenan Li,Chenxu Zhang,Guoxian Song,Xiaochen Zhao,Chao Liang,Jianwen Jiang,Hongyi Xu,Linjie Luo
Main category: cs.CV
TL;DR: X-Streamer是一个端到端的多模态人类世界建模框架,通过统一的Thinker-Actor双Transformer架构,实现基于单张肖像的实时、开放式的跨文本、语音和视频交互。
Details
Motivation: 旨在构建能够进行无限交互的数字人代理,突破现有系统在多模态融合与持续交互上的局限。 Method: 采用Thinker-Actor双Transformer架构:Thinker模块利用预训练大语言-语音模型感知和推理流式输入;Actor模块使用分块自回归扩散模型,结合Thinker的隐藏状态生成时间对齐的多模态输出,并引入跨模态位置编码、分块注意力机制和全局身份引用以提升长期稳定性。 Result: X-Streamer可在两块A100 GPU上实现实时运行,支持从任意肖像出发持续数小时的一致性视频对话体验。 Conclusion: X-Streamer实现了统一的多模态理解与生成,为交互式数字人的持久化音视频交互提供了可行路径,推动了数字人类世界建模的发展。 Abstract: We introduce X-Streamer, an end-to-end multimodal human world modeling framework for building digital human agents capable of infinite interactions across text, speech, and video within a single unified architecture. Starting from a single portrait, X-Streamer enables real-time, open-ended video calls driven by streaming multimodal inputs. At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation, turning a static portrait into persistent and intelligent audiovisual interactions. The Thinker module perceives and reasons over streaming user inputs, while its hidden states are translated by the Actor into synchronized multimodal streams in real time. Concretely, the Thinker leverages a pretrained large language-speech model, while the Actor employs a chunk-wise autoregressive diffusion model that cross-attends to the Thinker's hidden states to produce time-aligned multimodal responses with interleaved discrete text and audio tokens and continuous video latents. To ensure long-horizon stability, we design inter- and intra-chunk attentions with time-aligned multimodal positional embeddings for fine-grained cross-modality alignment and context retention, further reinforced by chunk-wise diffusion forcing and global identity referencing. X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits and paving the way toward unified world modeling of interactive digital humans.[148] What Happens Next? Anticipating Future Motion by Generating Point Trajectories
Gabrijel Boduljak,Laurynas Karazija,Iro Laina,Christian Rupprecht,Andrea Vedaldi
Main category: cs.CV
TL;DR: 提出一种基于单张图像预测物体运动轨迹的方法,通过生成密集轨迹网格来直接建模运动,相较于传统视频生成模型更准确且多样化。
Details
Motivation: 现有视频生成模型难以从单张图像准确预测物体运动,因其侧重于像素生成而非直接建模运动,导致在物理场景中表现不佳。 Method: 将运动预测任务建模为条件生成密集轨迹网格,采用类似现代视频生成器的架构,但输出为运动轨迹而非像素。 Result: 在模拟数据和真实物理数据集上均表现出更高的预测准确性和多样性,并在机器人等下游任务中验证了有效性。 Conclusion: 直接生成运动轨迹优于间接通过像素生成进行运动预测,所提方法能更好捕捉场景动力学与不确定性。 Abstract: We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.[149] Temporal vs. Spatial: Comparing DINOv3 and V-JEPA2 Feature Representations for Video Action Analysis
Sai Varun Kodathala,Rakesh Vunnam
Main category: cs.CV
TL;DR: 本研究比较了两种用于视频动作识别的自监督学习架构DINOv3和V-JEPA2,发现DINOv3在静态姿态识别上表现优异,而V-JEPA2在各类动作上具有更稳定的表现。
Details
Motivation: 探讨不同自监督学习架构在视频动作识别中的性能差异,理解其设计选择对特征提取质量的影响。 Method: 在UCF Sports数据集上评估DINOv3(基于空间特征提取)和V-JEPA2(基于时间建模)的分类准确率、聚类性能、类内一致性和类间区分能力。 Result: DINOv3在聚类性能(轮廓系数0.31 vs 0.21)和类间分离比(6.16倍)上优于V-JEPA2,尤其擅长姿态可识别动作;V-JEPA2则在所有动作类型上表现更稳定(性能方差0.094 vs 0.288)。DINOv3在运动依赖型动作上性能下降,而V-JEPA2因时间建模表现出均衡的表征能力。 Conclusion: DINOv3适用于强调姿态识别且允许一定不稳定性任务,而V-JEPA2更适合要求稳定性和泛化性的应用场景,研究结果为视频分析系统的设计提供了实证指导。 Abstract: This study presents a comprehensive comparative analysis of two prominent self-supervised learning architectures for video action recognition: DINOv3, which processes frames independently through spatial feature extraction, and V-JEPA2, which employs joint temporal modeling across video sequences. We evaluate both approaches on the UCF Sports dataset, examining feature quality through multiple dimensions including classification accuracy, clustering performance, intra-class consistency, and inter-class discrimination. Our analysis reveals fundamental architectural trade-offs: DINOv3 achieves superior clustering performance (Silhouette score: 0.31 vs 0.21) and demonstrates exceptional discrimination capability (6.16x separation ratio) particularly for pose-identifiable actions, while V-JEPA2 exhibits consistent reliability across all action types with significantly lower performance variance (0.094 vs 0.288). Through action-specific evaluation, we identify that DINOv3's spatial processing architecture excels at static pose recognition but shows degraded performance on motion-dependent actions, whereas V-JEPA2's temporal modeling provides balanced representation quality across diverse action categories. These findings contribute to the understanding of architectural design choices in video analysis systems and provide empirical guidance for selecting appropriate feature extraction methods based on task requirements and reliability constraints.[150] VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Md. Mahfuzur Rahman,Kishor Datta Gupta,Marufa Kamal,Fahad Rahman,Sunzida Siddique,Ahmed Rafi Hasan,Mohd Ariful Haque,Roy George
Main category: cs.CV
TL;DR: 提出了一种名为视觉语言字幕增强器(VLCE)的多模态系统,用于生成灾害图像的详细、上下文感知的解释,显著优于现有模型。
Details
Motivation: 现有的计算机视觉方法在灾后损害评估中仅提供分类标签或分割掩码,限制了对灾情的全面理解,需要更详尽的信息支持决策。 Method: 采用双架构方法:基于ResNet50预训练的CNN-LSTM模型处理卫星图像,以及基于Vision Transformer预训练的模型处理无人机图像,并结合ConceptNet和WordNet等外部语义知识提升描述准确性。 Result: 在InfoMetIC指标上最高达到95.33%,显著优于LLaVA和QwenVL等基线模型,同时保持良好的语义对齐性能。 Conclusion: VLCE系统能有效自动生成信息丰富的灾损描述,具有提升灾害响应效率的巨大潜力。 Abstract: Immediate damage assessment is essential after natural catastrophes; yet, conventional hand evaluation techniques are sluggish and perilous. Although satellite and unmanned aerial vehicle (UAV) photos offer extensive perspectives of impacted regions, current computer vision methodologies generally yield just classification labels or segmentation masks, so constraining their capacity to deliver a thorough situational comprehension. We introduce the Vision Language Caption Enhancer (VLCE), a multimodal system designed to produce comprehensive, contextually-informed explanations of disaster imagery. VLCE employs a dual-architecture approach: a CNN-LSTM model with a ResNet50 backbone pretrained on EuroSat satellite imagery for the xBD dataset, and a Vision Transformer (ViT) model pretrained on UAV pictures for the RescueNet dataset. Both systems utilize external semantic knowledge from ConceptNet and WordNet to expand vocabulary coverage and improve description accuracy. We assess VLCE in comparison to leading vision-language models (LLaVA and QwenVL) utilizing CLIPScore for semantic alignment and InfoMetIC for caption informativeness. Experimental findings indicate that VLCE markedly surpasses baseline models, attaining a maximum of 95.33% on InfoMetIC while preserving competitive semantic alignment. Our dual-architecture system demonstrates significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from satellite and drone photos.[151] A Data-driven Typology of Vision Models from Integrated Representational Metrics
Jialin Wu,Shreya Saha,Yiqing Bo,Meenakshi Khosla
Main category: cs.CV
TL;DR: 该研究利用多种表征相似性度量方法,结合生物学启发的相似性网络融合技术,系统分析了不同视觉模型家族之间的共性与差异,揭示了架构与训练目标共同塑造的计算策略对表征结构的深层影响。
Details
Motivation: 缺乏系统的方法来判断大规模视觉模型之间哪些表征特性是跨家族共享的,哪些反映了独特的计算策略。 Method: 采用多种表征相似性度量(如RSA、Soft Matching、Linear Predictivity),并引入相似性网络融合(SNF)方法整合多维度信息,评估模型家族的可分性。 Result: 几何结构和单元调谐相关度量能有效区分模型家族,而线性可解码信息更具通用性;SNF显著提升了家族分离效果,聚类结果揭示了监督模型与自监督模型的聚集模式,以及混合架构与掩码自编码器的趋同现象。 Conclusion: 视觉模型的表征结构不仅取决于表面设计,更由架构与训练目标共同塑造的涌现性计算策略决定,提出了一种基于计算策略的原则性模型分类框架。 Abstract: Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.[152] FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
Yixiang Dai,Fan Jiang,Chiyu Wang,Mu Xu,Yonggang Qi
Main category: cs.CV
TL;DR: 本文提出了FantasyWorld,一个几何增强框架,通过在冻结的视频基础模型中增加可训练的几何分支,实现视频潜在表示和隐式3D场的联合建模,从而提升空间一致性和3D感知能力。
Details
Motivation: 现有的视频基础模型缺乏明确的3D接地能力,限制了其在空间一致性和下游3D推理任务中的应用。 Method: 引入了一个可训练的几何分支,并通过跨分支监督机制,使几何线索引导视频生成,同时利用视频先验正则化3D预测。 Result: 实验表明,FantasyWorld在多视角连贯性和风格一致性方面优于近期的几何一致基线方法,且无需针对特定场景优化或微调即可用于新视角合成和导航等下游任务。 Conclusion: FantasyWorld有效弥合了视频想象与3D感知之间的差距,为3D-aware视频表示提供了通用且一致的解决方案。 Abstract: High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.[153] MORPH: Shape-agnostic PDE Foundation Models
Mahindra Singh Rautela,Alexander Most,Siddharth Mansingh,Bradley C. Love,Ayan Biswas,Diane Oyen,Earl Lawrence
Main category: cs.CV
TL;DR: MORPH是一种形状无关的自回归基础模型,用于求解偏微分方程(PDE),能处理异构、多维、多场、多分辨率的时空数据,在预训练和迁移学习中表现优异。
Details
Motivation: 现有的PDE求解模型通常针对特定几何形状和数据结构设计,难以泛化到异构、多模态的科学数据。需要一个统一、灵活且高效的基础模型来应对科学机器学习中的多样化PDE问题。 Method: MORPH采用卷积视觉Transformer架构,结合分量卷积处理标量和矢量场,通过跨场交叉注意力建模不同物理场之间的信息传递,并使用轴向注意力分解时空自注意力以降低计算开销。模型在多种异构PDE数据集上进行预训练,并支持全模型微调和低秩适配器(LoRA)进行迁移学习。 Result: MORPH在多个下游预测任务中优于从零开始训练的模型,在零样本和全样本泛化设置下均匹配或超越现有强基线和最新模型,展现出强大的迁移能力和泛化性能。 Conclusion: MORPH提供了一个灵活、高效且可扩展的基础模型框架,能够有效处理科学观测中的异构性和多模态特性,推动数据高效、可扩展的科学机器学习发展。 Abstract: We introduce MORPH, a shape-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data dimensionality (1D--3D) at different resolutions, multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorizes full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters (LoRA), MORPH outperforms models trained from scratch in both zero-shot and full-shot generalization. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning.[154] MS-YOLO: Infrared Object Detection for Edge Deployment via MobileNetV4 and SlideLoss
Jiali Zhang,Thomas S. White,Haoliang Zhang,Wenqing Hu,Donald C. Wunsch II,Jian Liu
Main category: cs.CV
TL;DR: 本文提出MS-YOLO,结合MobileNetV4和新型SlideLoss损失函数,在FLIR ADAS V2数据集上实现了高效、高精度的红外目标检测,适用于城市环境中实时边缘部署。
Details
Motivation: 解决红外目标检测中类别不平衡、热噪声和计算资源受限等问题,提升模型在低光和恶劣天气下的实用性。 Method: 基于YOLOv8,采用更高效的MobileNetV4替代CSPDarknet主干网络,并引入动态关注稀有和遮挡样本的SlideLoss损失函数。 Result: 在FLIR ADAS V2上,MS-YOLO达到具有竞争力的mAP和更高精度,计算量仅为6.7 GFLOPs,计算开销降低1.5%。 Conclusion: MS-YOLO在保持高检测性能的同时显著降低计算成本,适合城市环境下的实时边缘应用。 Abstract: Infrared imaging has emerged as a robust solution for urban object detection under low-light and adverse weather conditions, offering significant advantages over traditional visible-light cameras. However, challenges such as class imbalance, thermal noise, and computational constraints can significantly hinder model performance in practical settings. To address these issues, we evaluate multiple YOLO variants on the FLIR ADAS V2 dataset, ultimately selecting YOLOv8 as our baseline due to its balanced accuracy and efficiency. Building on this foundation, we present \texttt{MS-YOLO} (\textbf{M}obileNetv4 and \textbf{S}lideLoss based on YOLO), which replaces YOLOv8's CSPDarknet backbone with the more efficient MobileNetV4, reducing computational overhead by \textbf{1.5%} while sustaining high accuracy. In addition, we introduce \emph{SlideLoss}, a novel loss function that dynamically emphasizes under-represented and occluded samples, boosting precision without sacrificing recall. Experiments on the FLIR ADAS V2 benchmark show that \texttt{MS-YOLO} attains competitive mAP and superior precision while operating at only \textbf{6.7 GFLOPs}. These results demonstrate that \texttt{MS-YOLO} effectively addresses the dual challenge of maintaining high detection quality while minimizing computational costs, making it well-suited for real-time edge deployment in urban environments.[155] Motion-Aware Transformer for Multi-Object Tracking
Xu Yang,Gady Agam
Main category: cs.CV
TL;DR: 本文提出了一种新的多目标跟踪方法Motion-Aware Transformer (MATR),通过显式建模物体运动来更新轨迹查询,减少查询冲突,提升检测与关联性能,在多个基准上实现了最先进的效果。
Details
Motivation: 现有的DETR-based多目标跟踪方法在单个Transformer解码层中联合处理检测和跟踪查询,容易导致查询冲突和关联精度下降。 Method: 引入Motion-Aware Transformer (MATR),显式预测跨帧对象运动,提前更新轨迹查询,从而减少查询碰撞,实现更一致的训练。 Result: 在DanceTrack、SportsMOT和BDD100k数据集上取得显著性能提升:DanceTrack上HOTA提升超过9点,达到71.3;SportsMOT上HOTA为72.2;BDD100k上mTETA为54.7,mHOTA为41.6,均无需外部数据。 Conclusion: 显式建模运动信息可有效提升端到端Transformer在多目标跟踪中的性能,MATR提供了一种简单而高效的新范式。 Abstract: Multi-object tracking (MOT) in videos remains challenging due to complex object motions and crowded scenes. Recent DETR-based frameworks offer end-to-end solutions but typically process detection and tracking queries jointly within a single Transformer Decoder layer, leading to conflicts and degraded association accuracy. We introduce the Motion-Aware Transformer (MATR), which explicitly predicts object movements across frames to update track queries in advance. By reducing query collisions, MATR enables more consistent training and improves both detection and association. Extensive experiments on DanceTrack, SportsMOT, and BDD100k show that MATR delivers significant gains across standard metrics. On DanceTrack, MATR improves HOTA by more than 9 points over MOTR without additional data and reaches a new state-of-the-art score of 71.3 with supplementary data. MATR also achieves state-of-the-art results on SportsMOT (72.2 HOTA) and BDD100k (54.7 mTETA, 41.6 mHOTA) without relying on external datasets. These results demonstrate that explicitly modeling motion within end-to-end Transformers offers a simple yet highly effective approach to advancing multi-object tracking.[156] DeLiVR: Differential Spatiotemporal Lie Bias for Efficient Video Deraining
Shuning Sun,Jialang Lu,Xiang Chen,Jichao Wang,Dianjie Lu,Guijuan Zhang,Guangwei Gao,Zhuoran Zheng
Main category: cs.CV
TL;DR: 提出了一种基于李群微分偏置的高效视频去雨方法DeLiVR,通过在注意力机制中引入几何一致性偏差来抑制雨 streaks 和时间伪影。
Details
Motivation: 现有视频去雨方法依赖光流或启发式对齐,计算昂贵且鲁棒性差;同时真实视频存在雨、模糊、噪声及相机抖动导致的帧间不一致问题。 Method: 利用李群建模连续几何变换,在网络注意力分数中注入时空李群微分偏置:1)旋转有界李相对偏置预测每帧平面内旋转角度并实现几何一致对齐;2)微分群位移计算相邻帧角度差估计速度,结合时间衰减和注意力掩码精确匹配雨 streak 方向。 Result: 在公开基准上实验表明该方法有效,具有良好的去雨性能和时序一致性。 Conclusion: DeLiVR通过引入李群结构先验,有效提升了视频去雨中的空间对齐和时间连续性,优于依赖传统对齐方式的方法。 Abstract: Videos captured in the wild often suffer from rain streaks, blur, and noise. In addition, even slight changes in camera pose can amplify cross-frame mismatches and temporal artifacts. Existing methods rely on optical flow or heuristic alignment, which are computationally expensive and less robust. To address these challenges, Lie groups provide a principled way to represent continuous geometric transformations, making them well-suited for enforcing spatial and temporal consistency in video modeling. Building on this insight, we propose DeLiVR, an efficient video deraining method that injects spatiotemporal Lie-group differential biases directly into attention scores of the network. Specifically, the method introduces two complementary components. First, a rotation-bounded Lie relative bias predicts the in-plane angle of each frame using a compact prediction module, where normalized coordinates are rotated and compared with base coordinates to achieve geometry-consistent alignment before feature aggregation. Second, a differential group displacement computes angular differences between adjacent frames to estimate a velocity. This bias computation combines temporal decay and attention masks to focus on inter-frame relationships while precisely matching the direction of rain streaks. Extensive experimental results demonstrate the effectiveness of our method on publicly available benchmarks.[157] On the Status of Foundation Models for SAR Imagery
Nathan Inkawhich
Main category: cs.CV
TL;DR: 本研究探讨了基础AI/ML模型在合成孔径雷达(SAR)目标识别任务中的可行性,发现现有视觉基础模型在直接应用于SAR数据时存在局限性;通过使用SAR数据对公开的自监督学习模型进行微调,提出了AFRL-DINOv2系列模型,在性能上显著超越了当前最先进的SARATR-X模型,为SAR领域建立了新的技术标杆。
Details
Motivation: 受自然图像领域中大规模自监督基础模型成功的启发,希望将类似技术应用于SAR目标识别,以提升模型在少量标注数据下的适应能力、特征可迁移性和对分布偏移的鲁棒性。 Method: 采用DINOv2、DINOv3和PE-Core等先进视觉基础模型进行实验,评估其在SAR数据上的表现,并通过对公开自监督模型(如DINOv2)在SAR数据上进行自监督微调,训练出新的AFRL-DINOv2系列模型,比较不同主干网络与下游任务适配策略的性能权衡。 Result: 自监督微调后的AFRL-DINOv2模型在SAR目标识别任务中显著优于现有最佳模型SARATR-X,展现出更强的特征提取能力和对低标注数据、扩展操作条件等挑战的应对能力。 Conclusion: 基于自监督学习的基础模型经SAR数据微调后是构建高性能SAR识别系统的一条可行且高效的技术路径,尽管已有积极成果,但SAR基础模型的发展仍处于早期阶段,未来仍有广阔改进空间。 Abstract: In this work we investigate the viability of foundational AI/ML models for Synthetic Aperture Radar (SAR) object recognition tasks. We are inspired by the tremendous progress being made in the wider community, particularly in the natural image domain where frontier labs are training huge models on web-scale datasets with unprecedented computing budgets. It has become clear that these models, often trained with Self-Supervised Learning (SSL), will transform how we develop AI/ML solutions for object recognition tasks - they can be adapted downstream with very limited labeled data, they are more robust to many forms of distribution shift, and their features are highly transferable out-of-the-box. For these reasons and more, we are motivated to apply this technology to the SAR domain. In our experiments we first run tests with today's most powerful visual foundational models, including DINOv2, DINOv3 and PE-Core and observe their shortcomings at extracting semantically-interesting discriminative SAR target features when used off-the-shelf. We then show that Self-Supervised finetuning of publicly available SSL models with SAR data is a viable path forward by training several AFRL-DINOv2s and setting a new state-of-the-art for SAR foundation models, significantly outperforming today's best SAR-domain model SARATR-X. Our experiments further analyze the performance trade-off of using different backbones with different downstream task-adaptation recipes, and we monitor each model's ability to overcome challenges within the downstream environments (e.g., extended operating conditions and low amounts of labeled data). We hope this work will inform and inspire future SAR foundation model builders, because despite our positive results, we still have a long way to go.[158] UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments
Jiannan Xiang,Yun Zhu,Lei Shu,Maria Wang,Lijun Yu,Gabriel Barcik,James Lyon,Srinivas Sunkara,Jindong Chen
Main category: cs.CV
TL;DR: 提出了一种基于图像的UI模拟器UISim,能够从屏幕图像中动态、交互式地模拟移动设备界面状态转换,用于UI测试、快速原型设计和AI代理训练。
Details
Motivation: 现有方法依赖物理设备或静态截图分析,难以实现可扩展的UI测试和智能UI代理开发,因此需要一种更高效、动态的模拟方法。 Method: 采用两阶段方法:首先根据当前屏幕图像和用户操作预测下一UI状态的抽象布局,然后基于该布局生成视觉一致的新图像。 Result: 实验表明,UISim在生成逼真且连贯的后续UI状态方面优于端到端基线模型,具有更高的保真度。 Conclusion: UISim提供了一个高效的交互式平台,可用于UI测试、原型设计和AI代理训练,推动了移动端UI自动化和智能交互的发展。 Abstract: Developing and testing user interfaces (UIs) and training AI agents to interact with them are challenging due to the dynamic and diverse nature of real-world mobile environments. Existing methods often rely on cumbersome physical devices or limited static analysis of screenshots, which hinders scalable testing and the development of intelligent UI agents. We introduce UISim, a novel image-based UI simulator that offers a dynamic and interactive platform for exploring mobile phone environments purely from screen images. Our system employs a two-stage method: given an initial phone screen image and a user action, it first predicts the abstract layout of the next UI state, then synthesizes a new, visually consistent image based on this predicted layout. This approach enables the realistic simulation of UI transitions. UISim provides immediate practical benefits for UI testing, rapid prototyping, and synthetic data generation. Furthermore, its interactive capabilities pave the way for advanced applications, such as UI navigation task planning for AI agents. Our experimental results show that UISim outperforms end-to-end UI generation baselines in generating realistic and coherent subsequent UI states, highlighting its fidelity and potential to streamline UI development and enhance AI agent training.[159] LFA-Net: A Lightweight Network with LiteFusion Attention for Retinal Vessel Segmentation
Mehwish Mehmood,Ivor Spence,Muhammad Fahim
Main category: cs.CV
TL;DR: 提出了一种轻量级视网膜血管分割网络LFA-Net,结合LiteFusion-Attention模块,在低计算资源环境下实现了高性能分割。
Details
Motivation: 现有深度学习模型在小血管分割和高计算成本方面仍存在挑战,且需要适应资源受限的临床环境。 Method: 设计了新的LiteFusion-Attention模块,融合残差连接、受Vision Mamba启发的动态机制和调制式注意力机制,构建轻量级网络LFA-Net。 Result: 模型仅含0.11百万参数、0.42 MB内存占用和4.46 GFLOPs,在DRIVE、STARE和CHASE_DB数据集上分别取得83.28%、87.44%和84.50%的Dice分数。 Conclusion: LFA-Net在保持极低资源消耗的同时实现了优异的血管分割性能,适合部署于资源受限的现实临床场景。 Abstract: Lightweight retinal vessel segmentation is important for the early diagnosis of vision-threatening and systemic diseases, especially in a real-world clinical environment with limited computational resources. Although segmentation methods based on deep learning are improving, existing models are still facing challenges of small vessel segmentation and high computational costs. To address these challenges, we proposed a new vascular segmentation network, LFA-Net, which incorporates a newly designed attention module, LiteFusion-Attention. This attention module incorporates residual learning connections, Vision Mamba-inspired dynamics, and modulation-based attention, enabling the model to capture local and global context efficiently and in a lightweight manner. LFA-Net offers high performance with 0.11 million parameters, 0.42 MB memory size, and 4.46 GFLOPs, which make it ideal for resource-constrained environments. We validated our proposed model on DRIVE, STARE, and CHASE_DB with outstanding performance in terms of dice scores of 83.28, 87.44, and 84.50% and Jaccard indices of 72.85, 79.31, and 74.70%, respectively. The code of LFA-Net is available online https://github.com/Mehwish4593/LFA-Net.[160] Incorporating Scene Context and Semantic Labels for Enhanced Group-level Emotion Recognition
Qing Zhu,Wangdong Guo,Qirong Mao,Xiaohua Huang,Xiuyan Shao,Wenming Zheng
Main category: cs.CV
TL;DR: 提出了一种结合视觉场景上下文和标签引导语义信息的新框架,以提升群体情感识别(GER)性能。
Details
Motivation: 现有方法低估了视觉场景上下文信息在建模个体关系中的重要性,并忽视了情感标签语义信息对情感理解的关键作用。 Method: 设计了视觉上下文编码模块,利用多尺度场景信息编码个体关系;同时构建情感语义编码模块,通过大语言模型生成细腻的情感词典,并结合结构化情感树提炼出全面的语义表示;最后通过相似性感知交互融合视觉与语义信息。 Result: 在三个广泛使用的GER数据集上实验表明,该方法性能优于或媲美当前最先进的方法。 Conclusion: 所提出的框架有效提升了群体情感识别的性能,验证了引入视觉上下文和标签语义信息的重要性。 Abstract: Group-level emotion recognition (GER) aims to identify holistic emotions within a scene involving multiple individuals. Current existed methods underestimate the importance of visual scene contextual information in modeling individual relationships. Furthermore, they overlook the crucial role of semantic information from emotional labels for complete understanding of emotions. To address this limitation, we propose a novel framework that incorporates visual scene context and label-guided semantic information to improve GER performance. It involves the visual context encoding module that leverages multi-scale scene information to diversely encode individual relationships. Complementarily, the emotion semantic encoding module utilizes group-level emotion labels to prompt a large language model to generate nuanced emotion lexicons. These lexicons, in conjunction with the emotion labels, are then subsequently refined into comprehensive semantic representations through the utilization of a structured emotion tree. Finally, similarity-aware interaction is proposed to align and integrate visual and semantic information, thereby generating enhanced group-level emotion representations and subsequently improving the performance of GER. Experiments on three widely adopted GER datasets demonstrate that our proposed method achieves competitive performance compared to state-of-the-art methods.[161] KG-SAM: Injecting Anatomical Knowledge into Segment Anything Models via Conditional Random Fields
Yu Li,Da Chang,Xi Xiao
Main category: cs.CV
TL;DR: 本文提出了KG-SAM,一种结合解剖先验、边界优化和不确定性估计的知识引导框架,以解决SAM在医学图像分割中面临的模糊边界、解剖关系建模不足和缺乏不确定性量化等问题。
Details
Motivation: 由于医学图像中存在边界模糊、解剖结构复杂以及临床决策对可靠性要求高等问题,直接应用SAM模型效果受限,因此需要引入领域知识和不确定性评估来提升分割性能与可信度。 Method: KG-SAM通过引入(1)编码细粒度解剖关系的医学知识图谱,(2)基于能量的条件随机场(CRF)以保证解剖一致性预测,以及(3)不确定性感知融合模块来增强高风险临床场景下的可靠性。 Result: 在多个中心医学数据集上的实验表明,KG-SAM在前列腺分割上平均Dice分数达到82.69%,在腹部MRI和CT分割中分别达到78.05%和79.68%,显著优于现有方法。 Conclusion: KG-SAM是一种鲁棒且可泛化的医学图像分割框架,通过整合解剖知识与不确定性建模,有效提升了分割精度与临床适用性。 Abstract: While the Segment Anything Model (SAM) has achieved remarkable success in image segmentation, its direct application to medical imaging remains hindered by fundamental challenges, including ambiguous boundaries, insufficient modeling of anatomical relationships, and the absence of uncertainty quantification. To address these limitations, we introduce KG-SAM, a knowledge-guided framework that synergistically integrates anatomical priors with boundary refinement and uncertainty estimation. Specifically, KG-SAM incorporates (i) a medical knowledge graph to encode fine-grained anatomical relationships, (ii) an energy-based Conditional Random Field (CRF) to enforce anatomically consistent predictions, and (iii) an uncertainty-aware fusion module to enhance reliability in high-stakes clinical scenarios. Extensive experiments across multi-center medical datasets demonstrate the effectiveness of our approach: KG-SAM achieves an average Dice score of 82.69% on prostate segmentation and delivers substantial gains in abdominal segmentation, reaching 78.05% on MRI and 79.68% on CT. These results establish KG-SAM as a robust and generalizable framework for advancing medical image segmentation.[162] UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
Lan Chen,Yuchao Gu,Qi Mao
Main category: cs.CV
TL;DR: 提出UniVid框架,利用预训练视频生成模型统一处理多种视觉任务,通过视觉句子表示任务,无需任务特定修改,在跨模态和跨源任务上展现良好泛化能力。
Details
Motivation: 现有方法需要多模态和多源任务特定预训练,成本高且难以扩展到未见任务,希望探索更统一、可扩展的视觉任务建模范式。 Method: 基于预训练视频扩散Transformer,设计UniVid框架,将视觉任务表示为视觉句子,利用视觉提示序列定义任务和输出模态,进行微调以支持多样化任务。 Result: UniVid在仅使用自然视频数据训练的情况下,在跨模态(图像+视频上下文)和跨源(自然到标注数据)任务中均表现出良好泛化能力,且可通过反转视觉句子顺序切换理解与生成任务。 Conclusion: 预训练视频生成模型有潜力作为统一、可扩展的视觉建模基础,UniVid为多任务视觉处理提供了新范式。 Abstract: Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM's uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling. Our code will be released at https://github.com/CUC-MIPG/UniVid.[163] CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones
Wenyi Gong,Mieszko Lis
Main category: cs.CV
TL;DR: 本文提出了一种简单而有效的token合并方法,能够在保持空间结构完整性的同时实现高效的token缩减,兼容具有空间设计的ViT架构,在多种视觉任务中实现了先进的性能。
Details
Motivation: 现代ViT架构(如SAM和DINOv3)采用空间结构设计,但现有token缩减方法难以保持这些结构所需的拓扑完整性,导致性能下降。 Method: 提出一种保持空间完整性的token合并方法,包括:(1) 2D缩减策略以维持结构化token布局;(2) 空间感知的合并算法,保持相对位置关系;(3) 每维度最大幅值的token表示,保留显著特征。 Result: 在SAM-H上实现1.25倍加速(COCO上仅0.7% mIOU下降),DeiT-B上1.15倍加速且ImageNet上微调一个epoch后无精度损失,适用于空间与非空间架构。 Conclusion: 该方法在保持空间结构的同时有效利用信息分布不均特性,实现了通用、高效且即插即用的token缩减,推动了ViT在实际应用中的部署效率。 Abstract: Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.[164] Training-Free Multimodal Deepfake Detection via Graph Reasoning
Yuxin Liu,Fei Wang,Kun Li,Yiqi Nie,Junjie Chen,Yanyan Wei,Zhangling Duan,Zhaohong Jia
Main category: cs.CV
TL;DR: 提出了一种无需训练的多模态深度伪造检测框架GASP-ICL,通过引导自适应评分和图结构泰勒评分机制提升大视觉语言模型在多模态伪造检测中的性能。
Details
Motivation: 现有大视觉语言模型在多模态深度伪造检测中难以捕捉细微伪造线索、解决跨模态不一致性且缺乏任务对齐的检索能力。 Method: 设计了GASP-ICL框架,结合MDD适配的特征提取器构建候选集,并提出图结构泰勒自适应评分器(GSTAS)来捕获样本间关系并传播查询对齐信号,以选择语义对齐且任务相关性强的示例用于上下文学习。 Result: 在四种伪造类型上的实验表明,GASP-ICL优于强基线方法,且无需微调大视觉语言模型即可实现性能提升。 Conclusion: GASP-ICL是一种有效的训练-free框架,能够增强大视觉语言模型在多模态深度伪造检测中的鲁棒性和准确性。 Abstract: Multimodal deepfake detection (MDD) aims to uncover manipulations across visual, textual, and auditory modalities, thereby reinforcing the reliability of modern information systems. Although large vision-language models (LVLMs) exhibit strong multimodal reasoning, their effectiveness in MDD is limited by challenges in capturing subtle forgery cues, resolving cross-modal inconsistencies, and performing task-aligned retrieval. To this end, we propose Guided Adaptive Scorer and Propagation In-Context Learning (GASP-ICL), a training-free framework for MDD. GASP-ICL employs a pipeline to preserve semantic relevance while injecting task-aware knowledge into LVLMs. We leverage an MDD-adapted feature extractor to retrieve aligned image-text pairs and build a candidate set. We further design the Graph-Structured Taylor Adaptive Scorer (GSTAS) to capture cross-sample relations and propagate query-aligned signals, producing discriminative exemplars. This enables precise selection of semantically aligned, task-relevant demonstrations, enhancing LVLMs for robust MDD. Experiments on four forgery types show that GASP-ICL surpasses strong baselines, delivering gains without LVLM fine-tuning.[165] Prompt-guided Representation Disentanglement for Action Recognition
Tianci Wu,Guangming Zhu,Jiang Lu,Siyuan Wang,Ning Wang,Nuoye Xiong,Zhang Liang
Main category: cs.CV
TL;DR: 提出了一种名为ProDA的新型动作识别框架,通过时空场景图和动态提示模块引导图解析神经网络,实现从复杂多动作场景中解耦指定动作。
Details
Motivation: 现有方法难以建模多动作场景中不同物体间的交互,需有效解耦特定动作以提升识别性能。 Method: 引入Prompt-guided Disentangled Representation(ProDA),利用时空场景图(SSGs)和动态提示模块(DPM)指导图解析神经网络(GPNN)生成动作特异性表征,并设计了视频自适应的GPNN进行动态加权信息聚合。 Result: 在视频动作识别实验中,ProDA优于现有最先进方法,验证了其有效性。 Conclusion: ProDA能有效解耦多动作场景中的特定动作,显著提升动作识别性能,具有良好的应用前景。 Abstract: Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Experiments in video action recognition demonstrate the effectiveness of our approach when compared with the state-of-the-art methods. Our code can be found in https://github.com/iamsnaping/ProDA.git[166] DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images
Dwip Dalal,Gautam Vashishtha,Anku Ranui,Aishwarya Reganti,Parth Patwa,Mohd Sarique,Chandan Gupta,Keshav Nath,Viswanatha Reddy,Vinija Jain,Aman Chadha,Amitava Das,Amit Sheth,Asif Ekbal
Main category: cs.CV
TL;DR: 本文提出了一种用于识别数字内容中仇恨信息的多模态数据集,并结合水印增强的稳定扩散技术与数字注意力分析模块(DAAM)生成仇恨注意力图,实现对图像中仇恨区域的模糊处理。同时发布了DeHater视觉-语言模型,推动社交媒体中更符合伦理的AI应用。
Details
Motivation: 应对网络有害内容泛滥、破坏公共言论和数字环境的问题,亟需有效的多模态仇恨内容检测与净化方法。 Method: 采用带水印且稳定性增强的稳定扩散技术生成图像,结合DAAM模块生成仇恨注意力图,定位并模糊图像中的仇恨区域;构建多模态数据集并组织dehate共享任务;提出DeHater视觉-语言模型用于多模态去仇恨任务。 Result: 成功生成可解释的仇恨注意力图,有效识别并模糊图像中的仇恨内容;发布了用于仇恨识别的多模态数据集;DeHater模型在基于文本提示的图像仇恨检测任务中表现优异。 Conclusion: 该方法为AI驱动的多模态仇恨内容检测与净化设立了新标准,有助于构建更健康、伦理合规的社交媒体环境。 Abstract: The rise in harmful online content not only distorts public discourse but also poses significant challenges to maintaining a healthy digital environment. In response to this, we introduce a multimodal dataset uniquely crafted for identifying hate in digital content. Central to our methodology is the innovative application of watermarked, stability-enhanced, stable diffusion techniques combined with the Digital Attention Analysis Module (DAAM). This combination is instrumental in pinpointing the hateful elements within images, thereby generating detailed hate attention maps, which are used to blur these regions from the image, thereby removing the hateful sections of the image. We release this data set as a part of the dehate shared task. This paper also describes the details of the shared task. Furthermore, we present DeHater, a vision-language model designed for multimodal dehatification tasks. Our approach sets a new standard in AI-driven image hate detection given textual prompts, contributing to the development of more ethical AI applications in social media.[167] MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning
Lihao Zheng,Jiawei Chen,Xintian Shen,Hao Ma,Tao Wei
Main category: cs.CV
TL;DR: 本文提出了一种基于强化学习的多图像推理与定位统一框架MIRG-RL,通过两阶段训练和双奖励机制,在跨图像推理任务中实现了SOTA性能。
Details
Motivation: 现有大视觉语言模型缺乏跨图像推理能力和有效的跨图像引用奖励建模,难以处理复杂的跨图像关系。 Method: 采用两阶段训练范式:第一阶段使用带注释轨迹的监督微调,第二阶段引入图像感知的强化学习优化;构建融合对象级和图像级信息的轻量级推理增强数据集,并设计针对对象和图像的双奖励函数。 Result: 在多图像定位基准测试中达到SOTA性能,跨图像推理任务准确率达64.82%,超过此前最好方法1%。 Conclusion: MIRG-RL有效提升了多图像推理与定位能力,解决了跨图像歧义问题,且代码与数据集已开源。 Abstract: Multi-image reasoning and grounding require understanding complex cross-image relationships at both object levels and image levels. Current Large Visual Language Models (LVLMs) face two critical challenges: the lack of cross-image reasoning capabilities and insufficient cross-image reference reward modeling. To address these issues, we propose a unified framework - Multi-Image Reasoning and Grounding with Reinforcement Learning (MIRG-RL). Specifically, our two-stage training paradigm combines supervised fine-tuning with annotated trajectories and image-aware reinforcement learning optimization, progressively developing multi-image reasoning capabilities. Furthermore, we innovatively propose a method for constructing the trajectory data, which integrates object-level and image-level annotation information, and use this method to generate a lightweight reasoning-enhanced dataset. To effectively resolve cross-image ambiguities, we design an image-aware RL policy with dual reward functions for objects and images. Experiments demonstrate that MIRG-RL achieves state-of-the-art (SOTA) performance in multi-image grounding benchmarks, attaining 64.82% on cross-image reasoning tasks - exceeding the previous best method by 1%. The code and dataset have been released at https://github.com/ZEUS2035/MIRG-RL.[168] LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE
Yu Shang,Lei Jin,Yiding Ma,Xin Zhang,Chen Gao,Wei Wu,Yong Li
Main category: cs.CV
TL;DR: 提出了一种名为LongScape的混合框架,通过自适应结合块内扩散去噪和块间自回归生成,实现稳定且高质量的长时程视频生成。
Details
Motivation: 现有基于视频的世界模型在长时程生成上存在时间不一致、视觉漂移或细节丢失等问题,难以满足具身操作任务对稳定性和视觉质量的需求。 Method: 引入动作引导的可变长度分块机制,根据机器人动作的语义上下文划分视频片段,并采用上下文感知的专家混合(CMoE)框架,在生成过程中动态激活专业化专家以保证视觉质量和片段过渡的连贯性。 Result: 实验结果表明,该方法在长时程 rollout 中实现了稳定且一致的生成效果,显著优于传统扩散和自回归方法。 Conclusion: LongScape 有效解决了长时程视频生成中的视觉漂移与细节退化问题,为具身智能体提供了高质量的仿真数据生成方案。 Abstract: Video-based world models hold significant potential for generating high-quality embodied manipulation data. However, current video generation methods struggle to achieve stable long-horizon generation: classical diffusion-based approaches often suffer from temporal inconsistency and visual drift over multiple rollouts, while autoregressive methods tend to compromise on visual detail. To solve this, we introduce LongScape, a hybrid framework that adaptively combines intra-chunk diffusion denoising with inter-chunk autoregressive causal generation. Our core innovation is an action-guided, variable-length chunking mechanism that partitions video based on the semantic context of robotic actions. This ensures each chunk represents a complete, coherent action, enabling the model to flexibly generate diverse dynamics. We further introduce a Context-aware Mixture-of-Experts (CMoE) framework that adaptively activates specialized experts for each chunk during generation, guaranteeing high visual quality and seamless chunk transitions. Extensive experimental results demonstrate that our method achieves stable and consistent long-horizon generation over extended rollouts. Our code is available at: https://github.com/tsinghua-fib-lab/Longscape.[169] MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation
Yu Shang,Yangcheng Yu,Xin Zhang,Xin Jin,Haisheng Su,Wei Wu,Yong Li
Main category: cs.CV
TL;DR: 提出MoWM,一种混合世界模型框架,融合潜在空间和像素空间的表示,用于具身动作规划,在CALVIN基准上达到SOTA性能。
Details
Motivation: 现有视频生成世界模型依赖像素级重建,引入视觉冗余,影响动作解码与泛化;潜在模型虽紧凑但忽略精细细节,难以实现精确操作。 Method: 设计MoWM框架,结合潜在世界模型(提供运动感知的高层先验)与像素空间模型(提取细粒度视觉特征),通过混合机制融合二者优势,提升动作解码精度。 Result: 在CALVIN基准上取得最先进的任务成功率和更强的泛化能力,并对不同特征空间的优势进行了深入分析。 Conclusion: MoWM有效结合了运动感知先验与细粒度视觉信息,显著提升了具身动作规划的性能,为未来研究提供了有价值的见解。 Abstract: Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel-level reconstruction often introduces visual redundancies that hinder action decoding and generalization. Latent world models offer a compact, motion-aware representation, but overlook the fine-grained details critical for precise manipulation. To overcome these limitations, we propose MoWM, a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning. Our approach uses motion-aware representations from a latent model as a high-level prior, which guides the extraction of fine-grained visual features from the pixel space model. This design allows MoWM to highlight the informative visual details needed for action decoding. Extensive evaluations on the CALVIN benchmark demonstrate that our method achieves state-of-the-art task success rates and superior generalization. We also provide a comprehensive analysis of the strengths of each feature space, offering valuable insights for future research in embodied planning. The code is available at: https://github.com/tsinghua-fib-lab/MoWM.[170] DiTraj: training-free trajectory control for video diffusion transformer
Cheng Lei,Jiayu Zhang,Yue Ma,Xinyu Wang,Long Chen,Liang Tang,Yiqiang Yan,Fei Su,Zhicheng Zhao
Main category: cs.CV
TL;DR: 提出DiTraj,一种无需训练的框架,用于在基于DiT的视频生成中实现轨迹控制,通过前景-背景分离引导和改进的位置编码(STD-RoPE)增强跨帧注意力和轨迹控制能力。
Details
Motivation: 现有可控视频生成方法要么需要大量训练资源,要么专为U-Net设计,无法充分利用DiT的优越性能,尤其是在轨迹控制任务中缺乏高效、免训练的解决方案。 Method: 1) 使用大语言模型将用户提示拆分为前景和背景提示,分别指导前景与背景区域生成;2) 分析3D全注意力机制中注意力分数与位置嵌入的关系,提出空间-时间解耦的3D-RoPE(STD-RoPE),仅修改前景token的位置嵌入以消除跨帧空间差异,增强跨帧注意力;3) 通过调节位置嵌入密度实现3D感知的轨迹控制。 Result: 实验表明,DiTraj在视频生成质量和轨迹可控性方面均优于现有方法,且无需额外训练,适用于DiT架构。 Conclusion: DiTraj为基于DiT的文本到视频生成模型提供了一种简单、高效且无需训练的轨迹控制方案,通过前景-背景分离和改进的位置编码机制显著提升了控制精度和生成质量。 Abstract: Diffusion Transformers (DiT)-based video generation models with 3D full attention exhibit strong generative capabilities. Trajectory control represents a user-friendly task in the field of controllable video generation. However, existing methods either require substantial training resources or are specifically designed for U-Net, do not take advantage of the superior performance of DiT. To address these issues, we propose DiTraj, a simple but effective training-free framework for trajectory control in text-to-video generation, tailored for DiT. Specifically, first, to inject the object's trajectory, we propose foreground-background separation guidance: we use the Large Language Model (LLM) to convert user-provided prompts into foreground and background prompts, which respectively guide the generation of foreground and background regions in the video. Then, we analyze 3D full attention and explore the tight correlation between inter-token attention scores and position embedding. Based on this, we propose inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE). By modifying only foreground tokens' position embedding, STD-RoPE eliminates their cross-frame spatial discrepancies, strengthening cross-frame attention among them and thus enhancing trajectory control. Additionally, we achieve 3D-aware trajectory control by regulating the density of position embedding. Extensive experiments demonstrate that our method outperforms previous methods in both video quality and trajectory controllability.[171] A Comprehensive Evaluation of Transformer-Based Question Answering Models and RAG-Enhanced Design
Zichen Zhang,Kunlong Zhang,Hongwei Ruan,Yiming Luo
Main category: cs.CV
TL;DR: 本文提出了一种混合检索方法,结合稠密嵌入、词项重叠和重排序,在多跳问答任务中显著优于基线方法,在HotpotQA上实现了50%的EM和47%的F1相对提升。
Details
Motivation: 多跳问题需要跨多个段落进行推理,现有基于Transformer的模型在检索相关证据方面仍面临挑战,尤其是准确性和效率之间的平衡问题。 Method: 在检索增强生成框架下,比较了余弦相似度、最大边际相关性及一种融合稠密嵌入、词汇重叠和重排序的混合方法;并改进EfficientRAG管道,引入标记化查询优化与迭代精炼。 Result: 混合方法在HotpotQA数据集上比余弦相似度基线提升了50%的精确匹配率和47%的F1分数;错误分析显示其提升了实体召回和证据互补性,但在干扰项处理和时序推理上仍有局限。 Conclusion: 混合检索增强生成为多跳问答提供了一个实用的零样本解决方案,在准确性、效率和可解释性之间实现了良好平衡。 Abstract: Transformer-based models have advanced the field of question answering, but multi-hop reasoning, where answers require combining evidence across multiple passages, remains difficult. This paper presents a comprehensive evaluation of retrieval strategies for multi-hop question answering within a retrieval-augmented generation framework. We compare cosine similarity, maximal marginal relevance, and a hybrid method that integrates dense embeddings with lexical overlap and re-ranking. To further improve retrieval, we adapt the EfficientRAG pipeline for query optimization, introducing token labeling and iterative refinement while maintaining efficiency. Experiments on the HotpotQA dataset show that the hybrid approach substantially outperforms baseline methods, achieving a relative improvement of 50 percent in exact match and 47 percent in F1 score compared to cosine similarity. Error analysis reveals that hybrid retrieval improves entity recall and evidence complementarity, while remaining limited in handling distractors and temporal reasoning. Overall, the results suggest that hybrid retrieval-augmented generation provides a practical zero-shot solution for multi-hop question answering, balancing accuracy, efficiency, and interpretability.[172] Dynamic Novel View Synthesis in High Dynamic Range
Kaixuan Zhang,Zhipeng Xiong,Minxian Li,Mingwu Ren,Jiankang Deng,Xiatian Zhu
Main category: cs.CV
TL;DR: 本文提出了HDR Dynamic Novel View Synthesis(HDR DNVS)这一新问题,旨在从低动态范围(LDR)图像中学习高动态范围(HDR)的4D动态场景模型。为此,作者提出HDR-4DGS,一种基于高斯点阵的架构,包含动态色调映射模块,实现跨时间维度的辐射一致性与空间颜色准确转换,在定量和视觉质量上均优于现有方法。
Details
Motivation: 现有HDR新视角合成方法主要针对静态场景,难以应对现实世界中常见的动态元素(如移动物体、光照变化等)。因此,需要一种能同时建模时间和三维辐射变化的方法,以更真实地还原动态HDR场景。 Method: 提出HDR-4DGS,基于Gaussian Splatting框架,引入动态色调映射模块,该模块根据时间维度上的辐射分布变化自适应调整色调映射函数,从而在HDR与LDR域之间建立显式联系,实现时空一致的渲染。 Result: 实验表明,HDR-4DGS在动态HDR新视角合成任务中,在定量指标和视觉保真度方面均优于当前最先进的方法,能够生成任意视角和时间点的高质量、光逼真的HDR图像。 Conclusion: HDR-4DGS有效解决了动态场景下的HDR新视角合成难题,通过联合建模时间辐射变化与3D结构,实现了时空一致且色彩准确的HDR渲染,推动了HDR动态场景建模的发展。 Abstract: High Dynamic Range Novel View Synthesis (HDR NVS) seeks to learn an HDR 3D model from Low Dynamic Range (LDR) training images captured under conventional imaging conditions. Current methods primarily focus on static scenes, implicitly assuming all scene elements remain stationary and non-living. However, real-world scenarios frequently feature dynamic elements, such as moving objects, varying lighting conditions, and other temporal events, thereby presenting a significantly more challenging scenario. To address this gap, we propose a more realistic problem named HDR Dynamic Novel View Synthesis (HDR DNVS), where the additional dimension ``Dynamic'' emphasizes the necessity of jointly modeling temporal radiance variations alongside sophisticated 3D translation between LDR and HDR. To tackle this complex, intertwined challenge, we introduce HDR-4DGS, a Gaussian Splatting-based architecture featured with an innovative dynamic tone-mapping module that explicitly connects HDR and LDR domains, maintaining temporal radiance coherence by dynamically adapting tone-mapping functions according to the evolving radiance distributions across the temporal dimension. As a result, HDR-4DGS achieves both temporal radiance consistency and spatially accurate color translation, enabling photorealistic HDR renderings from arbitrary viewpoints and time instances. Extensive experiments demonstrate that HDR-4DGS surpasses existing state-of-the-art methods in both quantitative performance and visual fidelity. Source code will be released.[173] SRHand: Super-Resolving Hand Images and 3D Shapes via View/Pose-aware Neural Image Representations and Explicit 3D Meshes
Minje Kim,Tae-Kyun Kim
Main category: cs.CV
TL;DR: 本文提出SRHand,一种从低分辨率图像中重建高保真3D手部几何和纹理的方法,结合隐式图像表示与显式手部网格,通过几何感知的隐式图像函数实现多视角与姿态一致性,在多个数据集上优于现有方法。
Details
Motivation: 现有高保真手部重建方法依赖高分辨率多视角图像,难以处理低分辨率输入;现有的超分方法不适用于可变形的手部结构,缺乏姿态和多视角一致性。 Method: 提出SRHand,引入几何感知的隐式图像函数(GIIF),联合优化隐式图像函数和显式3D手部形状,利用手部先验信息对低分辨率图像进行超分,并实现精细的3D重建。 Result: 在InterHand2.6M和Goliath数据集上,SRHand在定量和定性评估中均显著优于现有的图像超分和3D手部重建方法,能恢复皱纹、指甲等细节。 Conclusion: SRHand有效解决了从低分辨率图像重建高质量手部模型的问题,兼顾多视角一致性与几何细节,为手部重建提供了新的可行方案。 Abstract: Reconstructing detailed hand avatars plays a crucial role in various applications. While prior works have focused on capturing high-fidelity hand geometry, they heavily rely on high-resolution multi-view image inputs and struggle to generalize on low-resolution images. Multi-view image super-resolution methods have been proposed to enforce 3D view consistency. These methods, however, are limited to static objects/scenes with fixed resolutions and are not applicable to articulated deformable hands. In this paper, we propose SRHand (Super-Resolution Hand), the method for reconstructing detailed 3D geometry as well as textured images of hands from low-resolution images. SRHand leverages the advantages of implicit image representation with explicit hand meshes. Specifically, we introduce a geometric-aware implicit image function (GIIF) that learns detailed hand prior by upsampling the coarse input images. By jointly optimizing the implicit image function and explicit 3D hand shapes, our method preserves multi-view and pose consistency among upsampled hand images, and achieves fine-detailed 3D reconstruction (wrinkles, nails). In experiments using the InterHand2.6M and Goliath datasets, our method significantly outperforms state-of-the-art image upsampling methods adapted to hand datasets, and 3D hand reconstruction methods, quantitatively and qualitatively. Project page: https://yunminjin2.github.io/projects/srhand[174] Deepfakes: we need to re-think the concept of "real" images
Janis Keuper,Margret Keuper
Main category: cs.CV
TL;DR: 本文指出当前“假图像”检测研究过于依赖过时的“真实”图像数据集,忽视了智能手机摄影和计算成像技术发展对“真实图像”定义的挑战,呼吁重新思考“真实图像”的概念,并建立新的技术定义和基准数据集。
Details
Motivation: 现有的“假图像”检测方法大多基于老旧的低分辨率真实图像数据集,而当今绝大多数照片由智能手机通过复杂的算法生成,这些算法与生成模型相似,导致“真实”与“伪造”之间的界限模糊,因此需要重新审视“真实图像”的定义。 Method: 本文采用分析和论述的方法,回顾了当前假图像检测领域的研究现状,指出现有工作中对“真实”图像数据收集和定义的不足,并通过比较现代手机成像技术和生成模型的相似性,提出对“真实图像”概念的质疑。 Result: 作者发现当前大多数假图像检测方法依赖于如ImageNet等陈旧的真实图像数据集,未能反映现代计算成像的实际;并论证了现代手机拍摄的照片本身也经过神经网络处理,本质上与生成图像接近。 Conclusion: 研究者应重新思考“真实图像”的定义,质疑单纯检测“假图像”的研究目标是否合理,并推动建立更清晰的技术标准和新的真实图像基准数据集。 Abstract: The wide availability and low usability barrier of modern image generation models has triggered the reasonable fear of criminal misconduct and negative social implications. The machine learning community has been engaging this problem with an extensive series of publications proposing algorithmic solutions for the detection of "fake", e.g. entirely generated or partially manipulated images. While there is undoubtedly some progress towards technical solutions of the problem, we argue that current and prior work is focusing too much on generative algorithms and "fake" data-samples, neglecting a clear definition and data collection of "real" images. The fundamental question "what is a real image?" might appear to be quite philosophical, but our analysis shows that the development and evaluation of basically all current "fake"-detection methods is relying on only a few, quite old low-resolution datasets of "real" images like ImageNet. However, the technology for the acquisition of "real" images, aka taking photos, has drastically evolved over the last decade: Today, over 90% of all photographs are produced by smartphones which typically use algorithms to compute an image from multiple inputs (over time) from multiple sensors. Based on the fact that these image formation algorithms are typically neural network architectures which are closely related to "fake"-image generators, we state the position that today, we need to re-think the concept of "real" images. The purpose of this position paper is to raise the awareness of the current shortcomings in this active field of research and to trigger an open discussion whether the detection of "fake" images is a sound objective at all. At the very least, we need a clear technical definition of "real" images and new benchmark datasets.[175] Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization
Boyang Liu,Yifan Hu,Senjie Jin,Shihan Dou,Gonglei Shi,Jie Shao,Tao Gui,Xuanjing Huang
Main category: cs.CV
TL;DR: 提出Aes-R1框架,结合强化学习提升多模态大模型在图像美学评估中的评分准确性和可解释性。
Details
Motivation: 现有方法因缺乏高质量多模态美学推理数据且审美判断主观性强,难以生成准确且有可解释理由的美学判断。 Method: 构建AesCoT流程生成过滤高质量的思维链美学推理数据用于冷启动,并采用新型强化学习算法RAPO联合优化绝对评分回归与相对排序。 Result: 实验显示Aes-R1使骨干模型的平均PLCC/SRCC提升47.9%/34.8%,优于同规模最先进基线,在少监督和分布外场景下也表现出强泛化能力。 Conclusion: Aes-R1通过统一框架增强了多模态大语言模型在图像美学评分与推理方面的能力,实现了更准确、可解释且鲁棒的美学评估。 Abstract: Multimodal large language models (MLLMs) are well suited to image aesthetic assessment, as they can capture high-level aesthetic features leveraging their cross-modal understanding capacity. However, the scarcity of multimodal aesthetic reasoning data and the inherently subjective nature of aesthetic judgment make it difficult for MLLMs to generate accurate aesthetic judgments with interpretable rationales. To this end, we propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL). Concretely, Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data used for cold-start. After teaching the model to generate structured explanations prior to scoring, we then employ the Relative-Absolute Policy Optimization (RAPO), a novel RL algorithm that jointly optimizes absolute score regression and relative ranking order, improving both per-image accuracy and cross-image preference judgments. Aes-R1 enables MLLMs to generate grounded explanations alongside faithful scores, thereby enhancing aesthetic scoring and reasoning in a unified framework. Extensive experiments demonstrate that Aes-R1 improves the backbone's average PLCC/SRCC by 47.9%/34.8%, surpassing state-of-the-art baselines of similar size. More ablation studies validate Aes-R1's robust generalization under limited supervision and in out-of-distribution scenarios.[176] StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing
Liyang Chen,Tianze Zhou,Xu He,Boshi Tang,Zhiyong Wu,Yang Huang,Yang Wu,Zhongqian Sun,Wei Yang,Helen Meng
Main category: cs.CV
TL;DR: 本文提出StableDub,一种结合唇部习惯感知建模与遮挡鲁棒合成的新型视觉配音框架,在唇形同步、遮挡处理和训练效率方面均表现出优越性能。
Details
Motivation: 现有视觉配音方法在语音驱动下难以捕捉说话人特有的唇部习惯,且在处理遮挡物时易产生视觉伪影,限制了实际应用。 Method: 基于Stable Diffusion架构,提出唇部习惯调制机制以建模音视频同步与个体面部动态,并设计遮挡感知训练策略,显式将遮挡物纳入修复过程;同时引入混合Mamba-Transformer架构提升低资源下的训练效率。 Result: 实验表明,StableDub在唇部习惯还原、遮挡鲁棒性、音频-唇动同步、视频质量和分辨率一致性方面优于现有方法,且无需昂贵先验,训练效率更高。 Conclusion: StableDub通过联合建模唇部习惯与遮挡感知修复,显著提升了视觉配音的真实感与实用性,拓展了其在复杂场景下的应用潜力。 Abstract: The visual dubbing task aims to generate mouth movements synchronized with the driving audio, which has seen significant progress in recent years. However, two critical deficiencies hinder their wide application: (1) Audio-only driving paradigms inadequately capture speaker-specific lip habits, which fail to generate lip movements similar to the target avatar; (2) Conventional blind-inpainting approaches frequently produce visual artifacts when handling obstructions (e.g., microphones, hands), limiting practical deployment. In this paper, we propose StableDub, a novel and concise framework integrating lip-habit-aware modeling with occlusion-robust synthesis. Specifically, building upon the Stable-Diffusion backbone, we develop a lip-habit-modulated mechanism that jointly models phonemic audio-visual synchronization and speaker-specific orofacial dynamics. To achieve plausible lip geometries and object appearances under occlusion, we introduce the occlusion-aware training strategy by explicitly exposing the occlusion objects to the inpainting process. By incorporating the proposed designs, the model eliminates the necessity for cost-intensive priors in previous methods, thereby exhibiting superior training efficiency on the computationally intensive diffusion-based backbone. To further optimize training efficiency from the perspective of model architecture, we introduce a hybrid Mamba-Transformer architecture, which demonstrates the enhanced applicability in low-resource research scenarios. Extensive experimental results demonstrate that StableDub achieves superior performance in lip habit resemblance and occlusion robustness. Our method also surpasses other methods in audio-lip sync, video quality, and resolution consistency. We expand the applicability of visual dubbing methods from comprehensive aspects, and demo videos can be found at https://stabledub.github.io.[177] Drag4D: Align Your Motion with Text-Driven 3D Scene Generation
Minjun Kang,Inkyu Shin,Taeyeop Lee,In So Kweon,Kuk-Jin Yoon
Main category: cs.CV
TL;DR: Drag4D是一个将物体运动控制集成到文本驱动3D场景生成中的交互式框架,用户可定义从单张图像生成的3D对象的运动轨迹,并将其无缝融入高质量3D背景中。
Details
Motivation: 现有的3D场景生成方法难以实现用户对物体运动轨迹的精确控制,且在动态物体与静态背景的时空一致性方面存在挑战。Drag4D旨在实现用户可控的3D物体运动,并确保其在多视角下与背景的视觉和谐与时间一致性。 Method: Drag4D包含三个阶段:1)利用2D高斯泼溅和补全的新视角生成全景图像,提升文本到3D背景的生成质量;2)基于参考图像,使用现成的图像到3D模型提取目标对象的完整3D网格,并通过物理感知的对象定位学习将其无缝嵌入3D场景;3)沿用户定义的3D轨迹对物体进行时序动画,采用部分增强、运动条件化的视频扩散模型处理多视角图像对及其投影的2D轨迹,以保证运动一致性。 Result: 实验验证了各阶段的有效性,最终结果展示了用户控制的物体运动与高质量3D背景之间的空间与时间上的和谐对齐,显著减少了运动幻觉并提升了多视角一致性。 Conclusion: Drag4D提供了一个统一的框架,实现了在文本生成的3D场景中对物体运动的直观、精确控制,推动了交互式3D内容创作的发展。 Abstract: We introduce Drag4D, an interactive framework that integrates object motion control within text-driven 3D scene generation. This framework enables users to define 3D trajectories for the 3D objects generated from a single image, seamlessly integrating them into a high-quality 3D background. Our Drag4D pipeline consists of three stages. First, we enhance text-to-3D background generation by applying 2D Gaussian Splatting with panoramic images and inpainted novel views, resulting in dense and visually complete 3D reconstructions. In the second stage, given a reference image of the target object, we introduce a 3D copy-and-paste approach: the target instance is extracted in a full 3D mesh using an off-the-shelf image-to-3D model and seamlessly composited into the generated 3D scene. The object mesh is then positioned within the 3D scene via our physics-aware object position learning, ensuring precise spatial alignment. Lastly, the spatially aligned object is temporally animated along a user-defined 3D trajectory. To mitigate motion hallucination and ensure view-consistent temporal alignment, we develop a part-augmented, motion-conditioned video diffusion model that processes multiview image pairs together with their projected 2D trajectories. We demonstrate the effectiveness of our unified architecture through evaluations at each stage and in the final results, showcasing the harmonized alignment of user-controlled object motion within a high-quality 3D background.[178] Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers
Jibin Song,Mingi Kwon,Jaeseok Jeong,Youngjung Uh
Main category: cs.CV
TL;DR: 本文提出Syncphony,一种能够根据音频生成高同步性视频的模型,通过Motion-aware Loss和Audio Sync Guidance提升音视频同步精度,并在多个数据集上实现了优于现有方法的表现。
Details
Motivation: 现有的音频到视频生成模型在细粒度音画同步方面表现不足,主要受限于间接的条件机制或时序建模能力有限,因此需要更有效的同步控制方法。 Method: 基于预训练视频生成模型,引入Motion-aware Loss以增强对高运动区域的学习,并设计Audio Sync Guidance模块,利用无音频层的视觉对齐模型在推理时更好地利用音频线索。同时提出CycleSync指标,通过视频重构音频来评估生成视频中的运动与原始音频的同步程度。 Result: 在AVSync15和The Greatest Hits数据集上的实验表明,Syncphony在音画同步准确性和视觉质量方面均优于现有方法,能生成380x640分辨率、24fps且与音频高度同步的视频。 Conclusion: Syncphony通过改进损失函数和推理引导机制,有效提升了音频驱动视频生成的时序同步性能,为可控视频生成提供了新思路。 Abstract: Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation. However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity. We present Syncphony, which generates 380x640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization: (1) Motion-aware Loss, which emphasizes learning at high-motion regions; (2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality. To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio. Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality. Project page is available at: https://jibin86.github.io/syncphony_project_page[179] LG-CD: Enhancing Language-Guided Change Detection through SAM2 Adaptation
Yixiao Liu,Yizhou Yang,Jinwen Li,Jun Tao,Ruoyu Li,Xiangkun Wang,Min Zhu,Junlong Cheng
Main category: cs.CV
TL;DR: 提出了一种语言引导的遥感变化检测模型LG-CD,利用文本提示增强视觉变化检测,显著提升了检测精度和鲁棒性。
Details
Motivation: 现有深度学习方法主要关注单模态视觉信息,忽略了文本等多模态数据中的丰富语义信息,限制了变化检测的准确性和泛化能力。 Method: LG-CD采用视觉基础模型SAM2提取双时相遥感图像的多尺度特征,通过多层适配器进行微调,并设计文本融合注意力模块(TFAM)对齐图文信息,最后使用视觉-语义融合解码器(V-SFD)结合交叉注意力生成精确的变化检测结果。 Result: 在LEVIR-CD、WHU-CD和SYSU-CD三个数据集上实验表明,LG-CD consistently优于现有的最先进方法。 Conclusion: 该方法有效融合了语言与视觉信息,为基于多模态信息的通用变化检测提供了新思路。 Abstract: Remote Sensing Change Detection (RSCD) typically identifies changes in land cover or surface conditions by analyzing multi-temporal images. Currently, most deep learning-based methods primarily focus on learning unimodal visual information, while neglecting the rich semantic information provided by multimodal data such as text. To address this limitation, we propose a novel Language-Guided Change Detection model (LG-CD). This model leverages natural language prompts to direct the network's attention to regions of interest, significantly improving the accuracy and robustness of change detection. Specifically, LG-CD utilizes a visual foundational model (SAM2) as a feature extractor to capture multi-scale pyramid features from high-resolution to low-resolution across bi-temporal remote sensing images. Subsequently, multi-layer adapters are employed to fine-tune the model for downstream tasks, ensuring its effectiveness in remote sensing change detection. Additionally, we design a Text Fusion Attention Module (TFAM) to align visual and textual information, enabling the model to focus on target change regions using text prompts. Finally, a Vision-Semantic Fusion Decoder (V-SFD) is implemented, which deeply integrates visual and semantic information through a cross-attention mechanism to produce highly accurate change detection masks. Our experiments on three datasets (LEVIR-CD, WHU-CD, and SYSU-CD) demonstrate that LG-CD consistently outperforms state-of-the-art change detection methods. Furthermore, our approach provides new insights into achieving generalized change detection by leveraging multimodal information.[180] TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation
Qihang Wang,Yaxiong Wang,Lechao Cheng,Zhun Zhong
Main category: cs.CV
TL;DR: 提出了一种基于扩散模型的统一框架,结合文本和拖拽交互进行图像编辑,通过点云确定性拖拽和拖拽-文本引导去噪实现高保真、灵活的联合编辑。
Details
Motivation: 现有文本驱动和拖拽驱动的图像编辑方法在空间控制和纹理细节上各有局限,难以同时实现精确的形状调整与丰富的纹理生成。 Method: 提出一个统一的扩散模型框架,包含两个创新:1)点云确定性拖拽,通过3D特征映射增强潜在空间布局控制;2)拖拽-文本引导去噪,动态平衡拖拽和文本条件的影响。支持纯文本、纯拖拽或两者联合的编辑模式。 Result: 实验表明该方法在联合编辑下保持高保真度,且在单独使用文本或拖拽时性能媲美专用方法,在多种编辑任务中表现出强泛化能力。 Conclusion: 该框架有效融合了文本和拖拽编辑的优势,提供了一种通用、灵活且高性能的可控图像编辑解决方案。 Abstract: This paper explores image editing under the joint control of text and drag interactions. While recent advances in text-driven and drag-driven editing have achieved remarkable progress, they suffer from complementary limitations: text-driven methods excel in texture manipulation but lack precise spatial control, whereas drag-driven approaches primarily modify shape and structure without fine-grained texture guidance. To address these limitations, we propose a unified diffusion-based framework for joint drag-text image editing, integrating the strengths of both paradigms. Our framework introduces two key innovations: (1) Point-Cloud Deterministic Drag, which enhances latent-space layout control through 3D feature mapping, and (2) Drag-Text Guided Denoising, dynamically balancing the influence of drag and text conditions during denoising. Notably, our model supports flexible editing modes - operating with text-only, drag-only, or combined conditions - while maintaining strong performance in each setting. Extensive quantitative and qualitative experiments demonstrate that our method not only achieves high-fidelity joint editing but also matches or surpasses the performance of specialized text-only or drag-only approaches, establishing a versatile and generalizable solution for controllable image manipulation. Code will be made publicly available to reproduce all results presented in this work.[181] Enhancing Vehicle Detection under Adverse Weather Conditions with Contrastive Learning
Boying Li,Chang Liu,Petter Kyösti,Mattias Öhman,Devashish Singha Roy,Sofia Plazzi,Hamam Mokayed,Olle Hagner
Main category: cs.CV
TL;DR: 提出了一种基于对比学习的sideload-CL-adaptation框架,利用无标注数据提升无人机图像中车辆检测性能,尤其应对北欧地区积雪导致的域偏移问题。
Details
Motivation: 解决北欧地区无人机遥感图像中因积雪导致的可见性差和域偏移问题,并有效利用大量易获取的无标注数据提升轻量级模型的车辆检测性能。 Method: 在预训练阶段,使用无标注数据通过对比学习训练CNN特征提取器;在微调阶段,将该提取器作为附加模块(sideload)融合到冻结的YOLO11n骨干网络中,并比较了多种融合方式与粒度以优化性能。 Result: 在NVD数据集上,所提方法使mAP50指标提升了3.8%至9.5%。 Conclusion: sideload-CL-adaptation框架能有效利用无标注数据增强轻量模型在复杂 snowy 环境下的车辆检测能力,具有良好的应用潜力。 Abstract: Aside from common challenges in remote sensing like small, sparse targets and computation cost limitations, detecting vehicles from UAV images in the Nordic regions faces strong visibility challenges and domain shifts caused by diverse levels of snow coverage. Although annotated data are expensive, unannotated data is cheaper to obtain by simply flying the drones. In this work, we proposed a sideload-CL-adaptation framework that enables the use of unannotated data to improve vehicle detection using lightweight models. Specifically, we propose to train a CNN-based representation extractor through contrastive learning on the unannotated data in the pretraining stage, and then sideload it to a frozen YOLO11n backbone in the fine-tuning stage. To find a robust sideload-CL-adaptation, we conducted extensive experiments to compare various fusion methods and granularity. Our proposed sideload-CL-adaptation model improves the detection performance by 3.8% to 9.5% in terms of mAP50 on the NVD dataset.[182] Taming Flow-based I2V Models for Creative Video Editing
Xianghao Kong,Hansheng Chen,Yuwei Guo,Lvmin Zhang,Gordon Wetzstein,Maneesh Agrawala,Anyi Rao
Main category: cs.CV
TL;DR: 本文提出了一种无需反转的视频编辑方法IF-V2V,通过向去噪向量场引入偏差项并生成结构和运动保持的初始化噪声,在不显著增加计算开销的情况下实现了高质量、高一致性的视频编辑。
Details
Motivation: 现有的图像条件视频编辑方法通常需要特定模型设计的反转过程或大量优化,限制了其利用最新图像到视频(I2V)模型的能力。因此,需要一种更高效、通用的方法来实现从图像编辑到视频编辑的能力迁移。 Method: 提出IF-V2V,一种无需反转的方法:1)采用带样本偏差的向量场校正,将源视频信息融入去噪过程;2)引入结构与运动保持初始化,生成具有结构信息的时序相关噪声;3)设计偏差缓存机制以降低计算成本。 Result: 实验表明,该方法在编辑质量和一致性方面优于现有方法,且计算开销低,具备良好的编辑质量与效率平衡。 Conclusion: IF-V2V提供了一种轻量级、即插即用的解决方案,能够有效适配现成的基于流匹配的I2V模型进行视频编辑,推动了图像编辑技术向视频领域的迁移应用。 Abstract: Although image editing techniques have advanced significantly, video editing, which aims to manipulate videos according to user intent, remains an emerging challenge. Most existing image-conditioned video editing methods either require inversion with model-specific design or need extensive optimization, limiting their capability of leveraging up-to-date image-to-video (I2V) models to transfer the editing capability of image editing models to the video domain. To this end, we propose IF-V2V, an Inversion-Free method that can adapt off-the-shelf flow-matching-based I2V models for video editing without significant computational overhead. To circumvent inversion, we devise Vector Field Rectification with Sample Deviation to incorporate information from the source video into the denoising process by introducing a deviation term into the denoising vector field. To further ensure consistency with the source video in a model-agnostic way, we introduce Structure-and-Motion-Preserving Initialization to generate motion-aware temporally correlated noise with structural information embedded. We also present a Deviation Caching mechanism to minimize the additional computational cost for denoising vector rectification without significantly impacting editing quality. Evaluations demonstrate that our method achieves superior editing quality and consistency over existing approaches, offering a lightweight plug-and-play solution to realize visual creativity.[183] Multi-View Crowd Counting With Self-Supervised Learning
Hong Mo,Xiong Zhang,Tengfei Shi,Zhongbo Wu
Main category: cs.CV
TL;DR: 提出了一种名为SSLCounter的自监督学习框架,用于多视角计数(MVC),通过神经体积渲染减少对大规模标注数据的依赖,在多个基准上表现出卓越的数据效率和性能。
Details
Motivation: 现有的MVC方法大多依赖于全监督学习,需要大量标注数据,限制了其在实际场景中的应用。因此,亟需一种减少对标注数据依赖的方法。 Method: 提出SSLCounter,利用神经体积渲染学习场景的隐式表示,通过可微分渲染重建2D投影的几何形状和外观,并可无缝集成到现有框架中。 Result: 实验表明,SSLCounter在多个MVC基准上达到最先进水平,且仅使用70%训练数据时仍具有竞争力。 Conclusion: SSLCounter有效降低了对标注数据的依赖,具备良好的通用性和数据效率,为MVC任务提供了一种新的自监督解决方案。 Abstract: Multi-view counting (MVC) methods have attracted significant research attention and stimulated remarkable progress in recent years. Despite their success, most MVC methods have focused on improving performance by following the fully supervised learning (FSL) paradigm, which often requires large amounts of annotated data. In this work, we propose SSLCounter, a novel self-supervised learning (SSL) framework for MVC that leverages neural volumetric rendering to alleviate the reliance on large-scale annotated datasets. SSLCounter learns an implicit representation w.r.t. the scene, enabling the reconstruction of continuous geometry shape and the complex, view-dependent appearance of their 2D projections via differential neural rendering. Owing to its inherent flexibility, the key idea of our method can be seamlessly integrated into exsiting frameworks. Notably, extensive experiments demonstrate that SSLCounter not only demonstrates state-of-the-art performances but also delivers competitive performance with only using 70% proportion of training data, showcasing its superior data efficiency across multiple MVC benchmarks.[184] Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding
Vahid Mirjalili,Ramin Giahi,Sriram Kollipara,Akshay Kekuda,Kehui Yao,Kai Zhao,Jianpeng Xu,Kaushiki Nag,Sinduja Subramaniam,Topojoy Biswas,Evren Korpeoglu,Kannan Achan
Main category: cs.CV
TL;DR: 本文提出了一种针对视觉基础模型的以对象为中心的空间推理系统性基准,通过合成数据集评估了多种先进模型在空间定位、空间推理和下游检索任务中的表现,揭示了现有模型在精确定位与关系推理之间的权衡问题。
Details
Motivation: 现有的视觉模型基准主要关注定位精度,而忽略了对场景中对象间空间关系的理解,这限制了真正意义上的场景理解能力。因此,需要一个专门评估空间推理能力的基准。 Method: 构建了一个受控的合成数据集,用于系统评估包括GroundingDINO、Florence-2、OWLv2等视觉模型以及InternVL、LLaVA、GPT-4o等大视觉语言模型,在空间定位、空间推理和下游检索三项任务上的表现。 Result: 实验发现,如GroundingDINO和OWLv2等检测器能提供精确的边界框但关系推理能力有限,而SmolVLM和GPT-4o等视觉语言模型虽能生成流畅描述但难以捕捉细粒度空间上下文,表现出定位与空间理解之间的稳定权衡。 Conclusion: 当前视觉基础模型在空间理解方面仍存在显著缺陷,研究强调了开发具备更强空间感知能力的模型的重要性,并为未来研究提供了评估基准。 Abstract: Spatial understanding is a critical capability for vision foundation models. While recent advances in large vision models or vision-language models (VLMs) have expanded recognition capabilities, most benchmarks emphasize localization accuracy rather than whether models capture how objects are arranged and related within a scene. This gap is consequential; effective scene understanding requires not only identifying objects, but reasoning about their relative positions, groupings, and depth. In this paper, we present a systematic benchmark for object-centric spatial reasoning in foundation models. Using a controlled synthetic dataset, we evaluate state-of-the-art vision models (e.g., GroundingDINO, Florence-2, OWLv2) and large VLMs (e.g., InternVL, LLaVA, GPT-4o) across three tasks: spatial localization, spatial reasoning, and downstream retrieval tasks. We find a stable trade-off: detectors such as GroundingDINO and OWLv2 deliver precise boxes with limited relational reasoning, while VLMs like SmolVLM and GPT-4o provide coarse layout cues and fluent captions but struggle with fine-grained spatial context. Our study highlights the gap between localization and true spatial understanding, and pointing toward the need for spatially-aware foundation models in the community.[185] PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning
Jiahao Zhang,Bowen Wang,Hong Liu,Yuta Nakashima,Hajime Nagahara
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的视觉上下文学习框架PANICL,通过基于patch的k近邻方法利用多个上下文示例缓解模型对单个示例的过度依赖问题,在多种视觉任务中实现了稳定且鲁棒的性能提升。
Details
Motivation: 现有的视觉上下文学习(VICL)方法容易过度依赖单个上下文样本,导致预测偏差和不稳定性,因此需要一种能有效利用多个上下文样本的方法来减少这种偏差。 Method: 提出PAtch-based k-Nearest neighbor visual In-Context Learning(PANICL),通过在多个上下文样本间平滑匹配得分,利用patch级相似性进行更稳健的推理,无需额外训练即可集成到现有VICL模型中。 Result: 在前景分割、单目标检测、着色、多目标分割和关键点检测等多个任务上均优于强基线,并展现出对数据集级别和标签空间级别的域偏移具有强鲁棒性,且可泛化至SegGPT、Painter和LVM等不同VICL模型。 Conclusion: PANICL是一种通用、无需训练且具有强鲁棒性的视觉上下文学习框架,能有效缓解对单一示例的依赖,提升模型稳定性和跨任务、跨域的泛化能力。 Abstract: Visual In-Context Learning (VICL) uses input-output image pairs, referred to as in-context pairs (or examples), as prompts alongside query images to guide models in performing diverse vision tasks. However, VICL often suffers from over-reliance on a single in-context pair, which can lead to biased and unstable predictions. We introduce PAtch-based $k$-Nearest neighbor visual In-Context Learning (PANICL), a general training-free framework that mitigates this issue by leveraging multiple in-context pairs. PANICL smooths assignment scores across pairs, reducing bias without requiring additional training. Extensive experiments on a variety of tasks, including foreground segmentation, single object detection, colorization, multi-object segmentation, and keypoint detection, demonstrate consistent improvements over strong baselines. Moreover, PANICL exhibits strong robustness to domain shifts, including dataset-level shift (e.g., from COCO to Pascal) and label-space shift (e.g., FSS-1000), and generalizes well to other VICL models such as SegGPT, Painter, and LVM, highlighting its versatility and broad applicability.[186] SingRef6D: Monocular Novel Object Pose Estimation with a Single RGB Reference
Jiahui Wang,Haiyue Zhu,Haoren Guo,Abdullah Al Mamun,Cheng Xiang,Tong Heng Lee
Main category: cs.CV
TL;DR: 提出SingRef6D,一种仅需单张RGB图像作为参考的轻量级6D姿态估计方法,无需深度传感器或多视角图像,在资源受限场景下仍保持鲁棒性。
Details
Motivation: 现有6D姿态估计方法依赖深度传感器或在低纹理、低光照下表现不佳,缺乏对透明或高反射表面的有效处理。 Method: 1) 基于Depth-Anything v2引入token-scaler微调机制和新优化损失提升深度预测精度;2) 在LoFTR中引入深度感知匹配机制,融合空间关系。 Result: 在REAL275上深度预测δ₁.₀₅提升14.41%;在REAL275、ClearPose和Toyota-Light上姿态估计平均召回率提升6.1%,优于当前最先进方法。 Conclusion: SingRef6D在无需深度输入和多视图的前提下实现了更鲁棒、准确的6D姿态估计,尤其适用于挑战性表面和光照条件。 Abstract: Recent 6D pose estimation methods demonstrate notable performance but still face some practical limitations. For instance, many of them rely heavily on sensor depth, which may fail with challenging surface conditions, such as transparent or highly reflective materials. In the meantime, RGB-based solutions provide less robust matching performance in low-light and texture-less scenes due to the lack of geometry information. Motivated by these, we propose SingRef6D, a lightweight pipeline requiring only a single RGB image as a reference, eliminating the need for costly depth sensors, multi-view image acquisition, or training view synthesis models and neural fields. This enables SingRef6D to remain robust and capable even under resource-limited settings where depth or dense templates are unavailable. Our framework incorporates two key innovations. First, we propose a token-scaler-based fine-tuning mechanism with a novel optimization loss on top of Depth-Anything v2 to enhance its ability to predict accurate depth, even for challenging surfaces. Our results show a 14.41% improvement (in $\delta_{1.05}$) on REAL275 depth prediction compared to Depth-Anything v2 (with fine-tuned head). Second, benefiting from depth availability, we introduce a depth-aware matching process that effectively integrates spatial relationships within LoFTR, enabling our system to handle matching for challenging materials and lighting conditions. Evaluations of pose estimation on the REAL275, ClearPose, and Toyota-Light datasets show that our approach surpasses state-of-the-art methods, achieving a 6.1% improvement in average recall.[187] DynaNav: Dynamic Feature and Layer Selection for Efficient Visual Navigation
Jiahui Wang,Changhao Chen
Main category: cs.CV
TL;DR: 提出DynaNav,一种动态视觉导航框架,通过自适应特征和层选择提升效率与可解释性,在降低计算开销的同时提高导航性能。
Details
Motivation: 现有基于Transformer解码器的基础模型在视觉导航中存在计算开销高、可解释性差的问题,难以部署于资源受限场景。 Method: 设计动态特征选择机制和早期退出机制,结合贝叶斯优化确定最优退出阈值,实现稀疏操作并减少计算量。 Result: 在真实世界数据集和模拟环境中验证,相比ViNT,FLOPs减少2.26倍,推理时间降低42.3%,内存使用减少32.8%,并在四个公开数据集上提升导航性能。 Conclusion: DynaNav在保证并提升导航性能的同时显著降低计算资源消耗,适用于资源受限的视觉导航应用。 Abstract: Visual navigation is essential for robotics and embodied AI. However, existing foundation models, particularly those with transformer decoders, suffer from high computational overhead and lack interpretability, limiting their deployment in resource-tight scenarios. To address this, we propose DynaNav, a Dynamic Visual Navigation framework that adapts feature and layer selection based on scene complexity. It employs a trainable hard feature selector for sparse operations, enhancing efficiency and interpretability. Additionally, we integrate feature selection into an early-exit mechanism, with Bayesian Optimization determining optimal exit thresholds to reduce computational cost. Extensive experiments in real-world-based datasets and simulated environments demonstrate the effectiveness of DynaNav. Compared to ViNT, DynaNav achieves a 2.26x reduction in FLOPs, 42.3% lower inference time, and 32.8% lower memory usage, while improving navigation performance across four public datasets.[188] SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet
Woosung Joung,Daewon Chae,Jinkyu Kim
Main category: cs.CV
TL;DR: 提出了一种无需训练的方法SemanticControl,利用语义相关但未对齐的视觉条件来提升文本到图像生成中的空间控制效果。
Details
Motivation: 现有ControlNet在使用未精确对齐视觉条件时表现不佳,尤其在生成非常见或想象场景时受限于缺乏合适的视觉条件。 Method: 通过引入辅助去噪过程,使用与视觉条件对齐的代理提示生成注意力掩码,并在目标提示的去噪过程中自适应地抑制冲突区域、增强文本引导。 Result: 在深度图、边缘图和人体骨架等多种未对齐条件下,SemanticControl均优于现有基线方法,提升了文本保真度和图像质量。 Conclusion: SemanticControl能有效利用语义相关但未对齐的视觉线索,在不需训练的情况下改善文本到图像生成的空间控制。 Abstract: ControlNet has enabled detailed spatial control in text-to-image diffusion models by incorporating additional visual conditions such as depth or edge maps. However, its effectiveness heavily depends on the availability of visual conditions that are precisely aligned with the generation goal specified by text prompt-a requirement that often fails in practice, especially for uncommon or imaginative scenes. For example, generating an image of a cat cooking in a specific pose may be infeasible due to the lack of suitable visual conditions. In contrast, structurally similar cues can often be found in more common settings-for instance, poses of humans cooking are widely available and can serve as rough visual guides. Unfortunately, existing ControlNet models struggle to use such loosely aligned visual conditions, often resulting in low text fidelity or visual artifacts. To address this limitation, we propose SemanticControl, a training-free method for effectively leveraging misaligned but semantically relevant visual conditions. Our approach adaptively suppresses the influence of the visual condition where it conflicts with the prompt, while strengthening guidance from the text. The key idea is to first run an auxiliary denoising process using a surrogate prompt aligned with the visual condition (e.g., "a human playing guitar" for a human pose condition) to extract informative attention masks, and then utilize these masks during the denoising of the actual target prompt (e.g., cat playing guitar). Experimental results demonstrate that our method improves performance under loosely aligned conditions across various conditions, including depth maps, edge maps, and human skeletons, outperforming existing baselines. Our code is available at https://mung3477.github.io/semantic-control.[189] Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach
Daiqing Wu,Dongbao Yang,Sicheng Zhao,Can Ma,Yu Zhou
Main category: cs.CV
TL;DR: 本文提出了一种新的情感语句判断任务和自动化构建情感中心语句的管道,以更有效地评估多模态大语言模型(MLLMs)在图像情感感知方面的能力,并揭示了当前MLLMs相较于人类仍存在显著差距。
Details
Motivation: 现有评估方法存在忽略合理回答、情感分类有限、忽视上下文因素和标注耗时等问题,导致对MLLMs情感理解能力的评价结果不一致。因此需要一种更可靠、高效的评估方式。 Method: 提出了情感语句判断任务,并设计了一个自动化流水线来生成情感相关的陈述句,从而系统地评估主流MLLMs在情感解释、基于情境的情感判断以及感知主观性理解方面的表现。 Result: 实验表明当前MLLMs在情感解释和情境相关判断上表现较强,但在理解感知主观性方面仍有不足;与人类相比,即使是GPT4o等顶级模型也存在明显性能差距。 Conclusion: 本研究通过构建基础评估框架和全面评估MLLMs,推动了多模态大语言模型在情感智能方面的发展,指出了未来改进的关键方向。 Abstract: Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.[190] MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning
Tao Wu,Yibo Jiang,Yehao Lu,Zhizhong Wang,Zeyi Huang,Zequn Qin,Xi Li
Main category: cs.CV
TL;DR: 本文提出了MultiCrafter框架,用于解决多主体图像生成中的属性泄漏和人类偏好对齐问题,通过显式位置监督、混合专家架构和在线强化学习实现高保真、偏好对齐的生成。
Details
Motivation: 现有基于上下文学习的方法依赖简单的重建目标,导致主体间属性泄漏严重且难以满足人类审美偏好,因此需要一种能同时保证主体保真度和偏好对齐的生成方法。 Method: 引入显式位置监督以分离不同主体的注意力区域,采用混合专家(MoE)架构提升模型在多样场景下的注意力规划能力,并设计了一种新的在线强化学习框架,结合评分机制和稳定训练策略来对齐人类偏好。 Result: 实验表明,该框架显著提升了多主体图像生成的主体保真度,并在人类偏好对齐方面优于现有方法。 Conclusion: MultiCrafter通过解耦注意力机制、增强模型表达能力和引入偏好驱动的强化学习,有效解决了多主体图像生成中的关键挑战,实现了高质量、符合人类审美的生成效果。 Abstract: Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. However, existing methods, particularly those built on the In-Context-Learning paradigm, are limited by their reliance on simple reconstruction-based objectives, leading to both severe attribute leakage that compromises subject fidelity and failing to align with nuanced human preferences. To address this, we propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation. First, we find that the root cause of attribute leakage is a significant entanglement of attention between different subjects during the generation process. Therefore, we introduce explicit positional supervision to explicitly separate attention regions for each subject, effectively mitigating attribute leakage. To enable the model to accurately plan the attention region of different subjects in diverse scenarios, we employ a Mixture-of-Experts architecture to enhance the model's capacity, allowing different experts to focus on different scenarios. Finally, we design a novel online reinforcement learning framework to align the model with human preferences, featuring a scoring mechanism to accurately assess multi-subject fidelity and a more stable training strategy tailored for the MoE architecture. Experiments validate that our framework significantly improves subject fidelity while aligning with human preferences better.[191] PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data
Zhe Zhu,Le Wan,Rui Xu,Yiheng Zhang,Honghua Chen,Zhiyang Dou,Cheng Lin,Yuan Liu,Mingqiang Wei
Main category: cs.CV
TL;DR: PartSAM是首个直接在大规模3D数据上训练的可提示部件分割模型,通过三平面双分支编码器和模型闭环标注管线,实现对3D物体表面与内部结构的准确、开放世界部件分割。
Details
Motivation: 现有方法依赖2D基础模型间接迁移监督,难以捕捉3D几何本质,导致分割局限于表面、分解不可控且泛化能力弱。 Method: 提出PartSAM,采用基于三平面的双分支编码器生成空间结构化token,并构建模型闭环流水线从在线资源中自动标注超五百万3D形状-部件数据用于训练。 Result: PartSAM在多种基准上显著超越现有最先进方法,支持单提示精确分割及全自动全部件分解,能同时理解表面与内部结构。 Conclusion: PartSAM通过原生3D训练范式和大规模高质量数据,实现了面向开放世界的3D部件理解,推动了3D视觉基础模型的发展。 Abstract: Segmenting 3D objects into parts is a long-standing challenge in computer vision. To overcome taxonomy constraints and generalize to unseen 3D objects, recent works turn to open-world part segmentation. These approaches typically transfer supervision from 2D foundation models, such as SAM, by lifting multi-view masks into 3D. However, this indirect paradigm fails to capture intrinsic geometry, leading to surface-only understanding, uncontrolled decomposition, and limited generalization. We present PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data. Following the design philosophy of SAM, PartSAM employs an encoder-decoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens for scalable part-aware representation learning. To enable large-scale supervision, we further introduce a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs from online assets, providing diverse and fine-grained labels. This combination of scalable architecture and diverse 3D data yields emergent open-world capabilities: with a single prompt, PartSAM achieves highly accurate part identification, and in a Segment-Every-Part mode, it automatically decomposes shapes into both surface and internal structures. Extensive experiments show that PartSAM outperforms state-of-the-art methods by large margins across multiple benchmarks, marking a decisive step toward foundation models for 3D part understanding. Our code and model will be released soon.[192] No-Reference Image Contrast Assessment with Customized EfficientNet-B0
Javad Hassannataj Joloudari,Bita Mesbahzadeh,Omid Zare,Emrah Arslan,Roohallah Alizadehsani,Hossein Moosaei
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的无参考图像对比度质量评估框架,通过定制和微调EfficientNet-B0、ResNet18和MobileNetV2三种预训练模型,并引入对比感知回归头,在CID2013和CCID2014数据集上实现了优于传统方法的性能,其中EfficientNet-B0达到最先进的结果。
Details
Motivation: 现有无参考图像质量评估模型在复杂真实场景下难以准确评估对比度失真,因此需要一种能有效捕捉感知对比度变化的方法。 Method: 采用EfficientNet-B0、ResNet18、MobileNetV2和Siamese网络结构,添加对比感知回归头,并在CID2013和CCID2014数据集上进行针对性数据增强和端到端训练,以预测感知MOS评分。 Result: EfficientNet-B0在CCID2014上取得PLCC=0.9286、SRCC=0.9178,在CID2013上取得PLCC=0.9581、SRCC=0.9369,显著优于传统和其他深度模型。 Conclusion: 轻量级预训练网络经对比感知适配后可有效用于无参考对比度质量评估,具备高性能、可扩展性,适用于实时和资源受限场景。 Abstract: Image contrast was a fundamental factor in visual perception and played a vital role in overall image quality. However, most no reference image quality assessment NR IQA models struggled to accurately evaluate contrast distortions under diverse real world conditions. In this study, we proposed a deep learning based framework for blind contrast quality assessment by customizing and fine-tuning three pre trained architectures, EfficientNet B0, ResNet18, and MobileNetV2, for perceptual Mean Opinion Score, along with an additional model built on a Siamese network, which indicated a limited ability to capture perceptual contrast distortions. Each model is modified with a contrast-aware regression head and trained end to end using targeted data augmentations on two benchmark datasets, CID2013 and CCID2014, containing synthetic and authentic contrast distortions. Performance is evaluated using Pearson Linear Correlation Coefficient and Spearman Rank Order Correlation Coefficient, which assess the alignment between predicted and human rated scores. Among these three models, our customized EfficientNet B0 model achieved state-of-the-art performance with PLCC = 0.9286 and SRCC = 0.9178 on CCID2014 and PLCC = 0.9581 and SRCC = 0.9369 on CID2013, surpassing traditional methods and outperforming other deep baselines. These results highlighted the models robustness and effectiveness in capturing perceptual contrast distortion. Overall, the proposed method demonstrated that contrast aware adaptation of lightweight pre trained networks can yield a high performing, scalable solution for no reference contrast quality assessment suitable for real time and resource constrained applications.[193] Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
Zilun Zhang,Zian Guan,Tiancheng Zhao,Haozhan Shen,Tianyu Li,Yuxiang Cai,Zhonggen Su,Zhaojun Liu,Jianwei Yin,Xiang Li
Main category: cs.CV
TL;DR: 提出Geo-R1,一种面向少样本地理空间指代表达理解的推理中心型强化微调方法,通过“先推理后行动”机制提升模型在数据稀缺下的泛化能力与可解释性。
Details
Motivation: 遥感中的指代表达理解需要复杂对象-上下文关系推理,现有监督微调方法在标注数据稀缺时泛化性能差。 Method: 提出Geo-R1,采用推理中心型强化微调(RFT)范式,强制模型先生成显式的推理链分解指代表达,再基于推理结果定位目标对象。 Result: 在三个少样本基准上显著优于SFT基线,具备强跨数据集泛化能力。 Conclusion: Geo-R1通过引入显式推理过程,在数据稀缺场景下有效提升了指代表达理解的性能、泛化性和可解释性。 Abstract: Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object-context relationships. While supervised fine-tuning (SFT) on multimodal large language models achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 enforces the model to first generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects. This "reason first, then act" process enables the model to make more effective use of limited annotations, enhances generalization, and provides interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness. Code and data will be released at http://geo-r1.github.io.[194] Benchmarking and Mitigate Psychological Sycophancy in Medical Vision-Language Models
Zikun Guo,Xinyue Xu,Pei Xiang,Shu Yang,Xin Han,Di Wang,Lijie Hu
Main category: cs.CV
TL;DR: 本研究提出了一种新型临床基准,评估视觉语言模型在医学图像问答中的“谄媚行为”(sycophancy),并构建了涵盖多种器官系统和模态的医学谄媚数据集。研究发现,现有VLM容易受模仿和专家纠正等社会压力影响,表现出与证据无关的偏见。为此,作者提出VIPER方法,通过过滤非证据性信息并生成基于证据的回应,有效减少谄媚行为,提升模型在真实临床场景中的可靠性。
Details
Motivation: 视觉语言模型在临床应用中日益普及,但其倾向于迎合用户表述、社交线索或权威意见,而非基于证据推理,可能威胁医疗安全。因此,亟需评估并缓解此类“谄媚行为”。 Method: 构建了一个基于PathVQA、SLAKE和VQA-RAD的医学谄媚数据集,按器官系统和模态分层,并设计心理学驱动的压力模板进行对抗实验。提出VIPER框架,通过净化视觉信息、过滤非证据内容,生成以证据为先的答案。 Result: 实验表明主流VLM普遍存在谄媚倾向,且该现象与模型准确率或规模相关性弱;模仿和专家纠正最易触发谄媚。VIPER显著降低了平均谄媚发生率,优于基线方法,同时保持可解释性。 Conclusion: 医学VLM存在独立于视觉证据的社交偏见机制,需引入证据锚定的防御策略。VIPER和所提出的基准为实现更稳健、可信的临床VLM部署提供了可行路径。 Abstract: Vision language models(VLMs) are increasingly integrated into clinical workflows, but they often exhibit sycophantic behavior prioritizing alignment with user phrasing social cues or perceived authority over evidence based reasoning. This study evaluate clinical sycophancy in medical visual question answering through a novel clinically grounded benchmark. We propose a medical sycophancy dataset construct from PathVQA, SLAKE, and VQA-RAD stratified by different type organ system and modality. Using psychologically motivated pressure templates including various sycophancy. In our adversarial experiments on various VLMs, we found that these models are generally vulnerable, exhibiting significant variations in the occurrence of adversarial responses, with weak correlations to the model accuracy or size. Imitation and expert provided corrections were found to be the most effective triggers, suggesting that the models possess a bias mechanism independent of visual evidence. To address this, we propose Visual Information Purification for Evidence based Response (VIPER) a lightweight mitigation strategy that filters non evidentiary content for example social pressures and then generates constrained evidence first answers. This framework reduces sycophancy by an average amount outperforming baselines while maintaining interpretability. Our benchmark analysis and mitigation framework lay the groundwork for robust deployment of medical VLMs in real world clinician interactions emphasizing the need for evidence anchored defenses.[195] Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm
Zeyu Wang,Baiyu Chen,Kun Yan,Hongjing Piao,Hao Xue,Flora D. Salim,Yuanchun Shi,Yuntao Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为GLARIFY的新方法,利用时空注视信息增强视觉-语言模型在现实应用中的表现,通过设计热图模块整合注视数据,并在保持预训练知识的同时显著优于基线方法。
Details
Motivation: 智能眼镜普及使得用户注意力被整合进视觉-语言模型,但注视数据可能引入歧义问题,包括语言表达的上下文缺失和注视模式的噪声与复杂时空关系,现有工作仅使用单张图像输入,无法捕捉注意力的动态特性。 Method: 首先分析带注视模态的查询样本以揭示注视模式的噪声特性;然后利用GPT-4o构建自动数据合成管道生成包含思维链(CoT)的GLARIFY-Ambi数据集;最后设计热图模块将注视信息融入先进的视觉-语言模型中。 Result: 在保留模型预训练知识的基础上,GLARIFY在独立测试集上显著优于基线方法,验证了其在对齐人类注意力与模型理解方面的有效性。 Conclusion: GLARIFY通过鲁棒地融合时空注视信息,提升了视觉-语言模型在真实场景中的交互性能,为可视化助手提供了一种可用且直观的交互范式。 Abstract: With the rise in popularity of smart glasses, users' attention has been integrated into Vision-Language Models (VLMs) to streamline multi-modal querying in daily scenarios. However, leveraging gaze data to model users' attention may introduce ambiguity challenges: (1) users' verbal questions become ambiguous by using pronouns or skipping context, (2) humans' gaze patterns can be noisy and exhibit complex spatiotemporal relationships with their spoken questions. Previous works only consider single image as visual modality input, failing to capture the dynamic nature of the user's attention. In this work, we introduce GLARIFY, a novel method to leverage spatiotemporal gaze information to enhance the model's effectiveness in real-world applications. Initially, we analyzed hundreds of querying samples with the gaze modality to demonstrate the noisy nature of users' gaze patterns. We then utilized GPT-4o to design an automatic data synthesis pipeline to generate the GLARIFY-Ambi dataset, which includes a dedicated chain-of-thought (CoT) process to handle noisy gaze patterns. Finally, we designed a heatmap module to incorporate gaze information into cutting-edge VLMs while preserving their pretrained knowledge. We evaluated GLARIFY using a hold-out test set. Experiments demonstrate that GLARIFY significantly outperforms baselines. By robustly aligning VLMs with human attention, GLARIFY paves the way for a usable and intuitive interaction paradigm with a visual assistant.[196] From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs
Yingjie Zhu,Xuefeng Bai,Kehai Chen,Yang Xiang,Weili Guan,Jun Yu,Min Zhang
Main category: cs.CV
TL;DR: 本文系统研究了大视觉语言模型(LVLMs)在空间位置变化下的偏差问题,发现现有模型因语言模型中位置嵌入设计不平衡而导致对相同视觉内容在不同位置时输出不一致。该问题源于跨模态交互中RoPE等位置嵌入策略导致图像令牌影响不均。为此,作者提出BaPA方法,将所有图像令牌赋予相同的位置嵌入,从而提升空间鲁棒性。实验表明,BaPA无需重训练即可增强模型稳定性,并在轻量微调后提升多项多模态任务性能,同时促进更均衡的注意力分配和整体视觉理解。
Details
Motivation: 尽管大视觉语言模型在多种多模态任务中表现优异,但其对空间位置变化的鲁棒性尚未被充分理解。当关键视觉信息位于图像不同位置时,模型应保持语义一致性。然而,实际中模型输出可能随位置变化而不一致,暴露出空间-语义理解的缺陷,因此亟需探究其根源并加以改进。 Method: 通过构建精心设计的探测数据集,评估LVLMs在关键视觉信息位于不同空间位置时的输出一致性;分析发现位置嵌入不平衡是根本原因,尤其来自语言模型中的RoPE机制;为此提出BaPA方法,即为所有图像令牌分配相同的位置嵌入,以实现更均衡的跨模态融合。 Result: 实验证明,当前LVLMs在空间位置变化下输出不一致;这种不一致性主要源自语言模型的位置嵌入设计而非视觉编码器;BaPA能有效缓解该问题,在无需重新训练的情况下提升模型的空间鲁棒性,并在轻量微调后显著提高多个多模态基准上的性能;信息流分析显示BaPA带来了更平衡的注意力分布。 Conclusion: 大视觉语言模型的空间敏感性主要源于语言模型中位置嵌入的不平衡设计,而非视觉编码能力不足;提出的BaPA方法通过统一图像令牌的位置表示,有效提升了模型的空间鲁棒性和语义一致性,为构建更具泛化能力的多模态模型提供了新思路。 Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we present a systematic study of the spatial bias of LVLMs, focusing on how models respond when identical key visual information is placed at different locations within an image. Through a carefully designed probing dataset, we demonstrate that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a fundamental limitation in their spatial-semantic understanding. Further analysis shows that this phenomenon originates not from the vision encoder, which reliably perceives and interprets visual content across positions, but from the unbalanced design of position embeddings in the language model component. In particular, the widely adopted position embedding strategies, such as RoPE, introduce imbalance during cross-modal interaction, leading image tokens at different positions to exert unequal influence on semantic understanding. To mitigate this issue, we introduce Balanced Position Assignment (BaPA), a simple yet effective mechanism that assigns identical position embeddings to all image tokens, promoting a more balanced integration of visual information. Extensive experiments show that BaPA enhances the spatial robustness of LVLMs without retraining and further boosts their performance across diverse multimodal benchmarks when combined with lightweight fine-tuning. Further analysis of information flow reveals that BaPA yields balanced attention, enabling more holistic visual understanding.[197] Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation
Abdelrahman Eldesokey,Aleksandar Cvejic,Bernard Ghanem,Peter Wonka
Main category: cs.CV
TL;DR: 提出一种从预训练扩散模型主干中解耦视觉和语义特征的新方法,实现类似语义对应关系的视觉对应,并提出新的度量指标VSM用于量化和定位主体驱动图像生成中的视觉不一致性。
Details
Motivation: 扩散模型主干虽包含丰富的语义特征,但其图像合成能力也依赖于视觉特征;然而由于缺乏标注数据集,难以分离这两种特征。 Method: 构建一个自动化管道,基于现有主体驱动图像生成数据集生成具有语义和视觉对应标注的图像对,并设计对比架构来分离视觉与语义特征。 Result: 提出的Visual Semantic Matching (VSM) 指标在量化视觉不一致性方面优于CLIP、DINO和视觉-语言模型等全局特征指标,并能实现不一致区域的空间定位。 Conclusion: 这是首个支持主体驱动生成中不一致性量化与定位的方法,为推进该任务提供了有力工具。 Abstract: We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets. To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types. Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision--language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions. To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task. Project Page:https://abdo-eldesokey.github.io/mind-the-glitch/[198] WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
Changli Tang,Qinfan Xiao,Ke Mei,Tianyi Wang,Fengyun Rao,Chao Zhang
Main category: cs.CV
TL;DR: WAVE是首个基于大语言模型的统一文本、音频和视频嵌入方法,通过分层特征融合和多任务联合训练,实现跨模态任意检索与指令感知嵌入,在多个基准上达到先进性能。
Details
Motivation: 尽管多模态大语言模型的嵌入在通用表示上表现优异,但其在音频和视频等动态模态上的应用仍探索不足,需要一种统一且灵活的跨模态表示方法。 Method: 提出WAVE,采用分层特征融合策略和多模态多任务联合训练,构建统一的文本-音频-视频嵌入空间,并支持根据用户提示生成指令感知的嵌入。 Result: 在MMEB-v2视频基准上达到最先进水平,在音频和视频到音频检索任务中表现优越,并在多模态问答中显著优于现有嵌入模型;消融实验验证了联合训练的有效性。 Conclusion: WAVE为文本、音频和视频提供了统一且多功能的嵌入框架,推动了任意到任意跨模态应用的发展,具有广泛的应用前景。 Abstract: While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \& \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.[199] ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models
Jewon Lee,Wooksu Shin,Seungmin Yang,Ki-Ung Song,DongUk Lim,Jaeyeon Kim,Tae-Ho Kim,Bo-Kyeong Kim
Main category: cs.CV
TL;DR: 本文提出了一种名为ERGO的高效视觉-语言推理框架,采用“由粗到细”的两阶段策略,在降低计算开销的同时保持对关键视觉细节的精确捕捉。
Details
Motivation: 现有大视觉语言模型因处理大量视觉token而导致计算开销高,且在图像下采样后难以准确识别任务相关区域,尤其是在需要视觉推理的任务中表现不佳。 Method: 提出ERGO模型,通过强化学习框架实现推理驱动的感知机制:第一阶段用下采样图像进行初步推理以定位关键区域;第二阶段仅对这些区域进行全分辨率裁剪和精细推理,并引入奖励机制处理感知不确定性。 Result: 在多个数据集上,ERGO优于原始模型和其他竞争方法,在V*基准上超越Qwen2.5-VL-7B达4.7个点,同时仅使用23%的视觉token,实现3倍推理加速。 Conclusion: ERGO通过推理引导感知的粗细两级架构,显著提升了视觉语言模型的效率与准确性,为实际应用中的高分辨率图像处理提供了有效解决方案。 Abstract: Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage "coarse-to-fine" reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.[200] DualFocus: Depth from Focus with Spatio-Focal Dual Variational Constraints
Sungmin Woo,Sangyoun Lee
Main category: cs.CV
TL;DR: 提出了一种名为DualFocus的新型Depth-from-Focus(DFF)框架,通过联合建模空间和焦距维度上的焦点变化,利用焦距堆栈中由焦点变化引起的独特梯度模式,在复杂场景中实现了更精确和鲁棒的深度估计。
Details
Motivation: 现有基于学习的DFF方法在具有精细纹理或突变深度的复杂场景中表现不佳,因为这些场景中的焦点线索可能变得模糊或误导,因此需要一种更具鲁棒性的方法来准确区分真实深度边缘和纹理伪影。 Method: 提出DualFocus框架,引入一种变分公式并结合双重约束:空间约束利用不同焦点水平间的梯度模式变化来区分深度边缘与纹理伪影,焦距约束则强制聚焦概率符合物理聚焦行为的单峰性和单调性。 Result: 在四个公开数据集上的实验表明,DualFocus在深度估计精度和感知质量方面 consistently 优于当前最先进的方法。 Conclusion: DualFocus通过引入针对DFF任务设计的空间和焦距双重约束,有效提升了在复杂场景下的深度估计鲁棒性与准确性,为基于学习的DFF方法提供了新的思路。 Abstract: Depth-from-Focus (DFF) enables precise depth estimation by analyzing focus cues across a stack of images captured at varying focal lengths. While recent learning-based approaches have advanced this field, they often struggle in complex scenes with fine textures or abrupt depth changes, where focus cues may become ambiguous or misleading. We present DualFocus, a novel DFF framework that leverages the focal stack's unique gradient patterns induced by focus variation, jointly modeling focus changes over spatial and focal dimensions. Our approach introduces a variational formulation with dual constraints tailored to DFF: spatial constraints exploit gradient pattern changes across focus levels to distinguish true depth edges from texture artifacts, while focal constraints enforce unimodal, monotonic focus probabilities aligned with physical focus behavior. These inductive biases improve robustness and accuracy in challenging regions. Comprehensive experiments on four public datasets demonstrate that DualFocus consistently outperforms state-of-the-art methods in both depth accuracy and perceptual quality.[201] Rate-Distortion Optimized Communication for Collaborative Perception
Genjia Liu,Anning Hu,Yue Hu,Wenjun Zhang,Siheng Chen
Main category: cs.CV
TL;DR: 本文提出了一个基于率失真理论的多智能体协作感知框架RDcomm,通过任务熵离散编码和互信息驱动的消息选择,在保证性能的同时大幅降低通信开销。
Details
Motivation: 现有协作感知方法缺乏理论基础,难以在通信效率与任务性能间取得理论上的平衡,因此需要建立理论模型指导通信策略设计。 Method: 基于信息论提出实用的率失真理论,指导设计满足提供相关任务信息和无冗余传输两个条件的通信策略;提出任务熵离散编码和基于互信息神经估计的消息选择机制。 Result: 在DAIR-V2X和OPV2V数据集上,RDcomm在3D目标检测和BEV分割任务中达到最先进精度,通信量最多减少108倍。 Conclusion: 所提出的率失真理论为多智能体协作提供了理论基础,RDcomm框架在保持高性能的同时显著降低了通信成本,验证了理论指导的有效性。 Abstract: Collaborative perception emphasizes enhancing environmental understanding by enabling multiple agents to share visual information with limited bandwidth resources. While prior work has explored the empirical trade-off between task performance and communication volume, a significant gap remains in the theoretical foundation. To fill this gap, we draw on information theory and introduce a pragmatic rate-distortion theory for multi-agent collaboration, specifically formulated to analyze performance-communication trade-off in goal-oriented multi-agent systems. This theory concretizes two key conditions for designing optimal communication strategies: supplying pragmatically relevant information and transmitting redundancy-less messages. Guided by these two conditions, we propose RDcomm, a communication-efficient collaborative perception framework that introduces two key innovations: i) task entropy discrete coding, which assigns features with task-relevant codeword-lengths to maximize the efficiency in supplying pragmatic information; ii) mutual-information-driven message selection, which utilizes mutual information neural estimation to approach the optimal redundancy-less condition. Experiments on 3D object detection and BEV segmentation demonstrate that RDcomm achieves state-of-the-art accuracy on DAIR-V2X and OPV2V, while reducing communication volume by up to 108 times. The code will be released.[202] FailureAtlas:Mapping the Failure Landscape of T2I Models via Active Exploration
Muxi Chen,Zhaohua Zhang,Chenchen Zhao,Mingyang Chen,Wenyu Jiang,Tianwen Jiang,Jianhuan Zhuo,Yu Tang,Qiuyong Xiao,Jihong Zhang,Qiang Xu
Main category: cs.CV
TL;DR: 本文提出了FailureAtlas,首个用于大规模自动探索和绘制文本到图像(T2I)模型失败图谱的框架,通过主动发现最小的、导致失败的概念,揭示了数十万个先前未知的错误片段,并首次大规模证实这些失败与训练数据稀缺性相关。
Details
Motivation: 静态基准测试在诊断T2I模型系统性故障方面能力有限,难以发现根本原因,因此需要一种更具诊断性的主动探索方法。 Method: 将错误发现建模为对最小失败诱导概念的结构化搜索,提出FailureAtlas框架,并采用新颖的加速技术使其在计算上可行。 Result: 在Stable Diffusion模型上应用时,发现了超过247,000个先前未知的错误片段,并首次提供大规模证据表明这些失败与训练数据稀缺有关。 Conclusion: FailureAtlas建立了一种以诊断为核心的新型、可扩展的模型审计方法,为开发更鲁棒的生成式AI提供了指导。 Abstract: Static benchmarks have provided a valuable foundation for comparing Text-to-Image (T2I) models. However, their passive design offers limited diagnostic power, struggling to uncover the full landscape of systematic failures or isolate their root causes. We argue for a complementary paradigm: active exploration. We introduce FailureAtlas, the first framework designed to autonomously explore and map the vast failure landscape of T2I models at scale. FailureAtlas frames error discovery as a structured search for minimal, failure-inducing concepts. While it is a computationally explosive problem, we make it tractable with novel acceleration techniques. When applied to Stable Diffusion models, our method uncovers hundreds of thousands of previously unknown error slices (over 247,000 in SD1.5 alone) and provides the first large-scale evidence linking these failures to data scarcity in the training set. By providing a principled and scalable engine for deep model auditing, FailureAtlas establishes a new, diagnostic-first methodology to guide the development of more robust generative AI. The code is available at https://github.com/cure-lab/FailureAtlas[203] Exposing Hallucinations To Suppress Them: VLMs Representation Editing With Generative Anchors
Youxu Shi,Suorong Yang,Dong Liu
Main category: cs.CV
TL;DR: 提出一种无需训练的自监督方法,通过文本到图像生成的负锚和原始图像的正锚来缓解多模态大模型中的幻觉问题。
Details
Motivation: 多模态大语言模型(MLLMs)在视觉-语言任务中表现出色,但容易产生与视觉证据不一致的幻觉内容,现有方法依赖微调或先验知识且影响信息量和可扩展性。 Method: 引入幻觉放大机制:利用文本到图像模型将文本描述投影回视觉空间作为负锚,原始图像为正锚,通过调整解码器隐藏状态,使表示更贴近真实语义并远离幻觉方向。 Result: 在多个基准上显著减少了对象、属性和关系层面的幻觉,使用LLaVA-v1.5-7B在CHAIR上降低超5%幻觉率,同时保持描述丰富性和召回率,并在不同架构间具有良好泛化性。 Conclusion: 该方法无需人工先验或额外训练,具备高效、有效、即插即用的特点,对非幻觉文本几乎无副作用,具有实际应用价值。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet they remain highly susceptible to hallucinations, producing content that is fluent but inconsistent with visual evidence. Such hallucinations, spanning objects, attributes, and relations, persist even in larger models, while existing mitigation approaches often require additional finetuning, handcrafted priors, or trade-offs that compromise informativeness and scalability. To address this limitation, we propose a training-free, self-supervised method for hallucination mitigation. Our approach introduces a novel hallucination amplification mechanism: a caption is projected into the visual space via a text-to-image model to reveal implicit hallucination signals, serving as a negative anchor, while the original image provides a positive anchor. Leveraging these dual anchors, we edit decoder hidden states by pulling representations toward faithful semantics and pushing them away from hallucination directions. This correction requires no human priors or additional training costs, ensuring both effectiveness and efficiency. Extensive experiments across multiple benchmarks show that our method significantly reduces hallucinations at the object, attribute, and relation levels while largely preserving recall and caption richness, e.g., achieving a hallucination reduction by over 5% using LLaVA-v1.5-7B on CHAIR. Furthermore, results on diverse architectures, including LLaVA-NEXT-7B, Cambrian-8B, and InstructBLIP-7B, validate strong cross-architecture generalization. More importantly, when applied to hallucination-free captions, our method introduces almost no side effects, underscoring its robustness and practical plug-and-play applicability. The implementation will be publicly available.[204] CoFFT: Chain of Foresight-Focus Thought for Visual Language Models
Xinyu Zhang,Yuxuan Dong,Lingling Zhang,Chengyou Jia,Zhuohang Dang,Basura Fernando,Jun Liu,Mike Zheng Shou
Main category: cs.CV
TL;DR: 提出了一种无需训练的视觉语言模型推理增强方法CoFFT,通过模拟人类视觉认知,迭代地生成多样化推理样本、进行双重视觉前瞻解码并调整视觉焦点,从而提升模型在复杂图像中的推理准确性。
Details
Motivation: 现有视觉语言模型在处理包含大量无关信息的图像时容易受到干扰,产生与任务无关的推理或幻觉,原因是缺乏精确发现和处理关键区域的能力。 Method: 提出Chain of Foresight-Focus Thought (CoFFT),包括三个阶段循环迭代:(1) 多样化样本生成,探索潜在推理路径;(2) 双重视觉前瞻解码,评估样本并选择最优步骤加入推理链;(3) 视觉焦点调整,将注意力精确定位到对未来推理最有帮助的图像区域。 Result: 在Qwen2.5-VL、InternVL-2.5和Llava-Next等多个主流VLM上实验表明,CoFFT在多个基准上带来3.1-5.8%的一致性能提升,且计算开销可控。 Conclusion: CoFFT通过推理与视觉焦点的动态交互,有效增强了VLM在复杂视觉输入下的推理能力,是一种通用、无需训练的增强框架。 Abstract: Despite significant advances in Vision Language Models (VLMs), they remain constrained by the complexity and redundancy of visual input. When images contain large amounts of irrelevant information, VLMs are susceptible to interference, thus generating excessive task-irrelevant reasoning processes or even hallucinations. This limitation stems from their inability to discover and process the required regions during reasoning precisely. To address this limitation, we present the Chain of Foresight-Focus Thought (CoFFT), a novel training-free approach that enhances VLMs' visual reasoning by emulating human visual cognition. Each Foresight-Focus Thought consists of three stages: (1) Diverse Sample Generation: generates diverse reasoning samples to explore potential reasoning paths, where each sample contains several reasoning steps; (2) Dual Foresight Decoding: rigorously evaluates these samples based on both visual focus and reasoning progression, adding the first step of optimal sample to the reasoning process; (3) Visual Focus Adjustment: precisely adjust visual focus toward regions most beneficial for future reasoning, before returning to stage (1) to generate subsequent reasoning samples until reaching the final answer. These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning. Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8\% with controllable increasing computational overhead.[205] Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics
Saurav Jha,Stefan K. Ehrlich
Main category: cs.CV
TL;DR: 提出一种轻量级的多模态代理框架,结合Qwen2.5-VL-3B-Instruct模型与SmolAgent协调层,支持基于视频的场景理解,在临床环境中展现出良好的准确性和鲁棒性。
Details
Motivation: 现有视觉语言模型在时间推理、不确定性估计和结构化输出方面存在不足,难以满足医疗机器人对安全性和动态环境适应性的高要求。 Method: 结合Qwen2.5-VL-3B-Instruct模型与基于SmolAgent的协调层,引入思维链推理、视听融合和动态工具调用,生成结构化场景图,并通过混合检索模块实现可解释和自适应推理。 Result: 在Video-MME基准和自建临床数据集上评估显示,该框架相比当前最先进的视觉语言模型具有竞争力的准确性与更高的鲁棒性。 Conclusion: 所提框架在机器人辅助手术、患者监测和决策支持等医疗应用中具有潜力,为动态临床环境中的安全感知与推理提供了有效解决方案。 Abstract: Healthcare robotics requires robust multimodal perception and reasoning to ensure safety in dynamic clinical environments. Current Vision-Language Models (VLMs) demonstrate strong general-purpose capabilities but remain limited in temporal reasoning, uncertainty estimation, and structured outputs needed for robotic planning. We present a lightweight agentic multimodal framework for video-based scene understanding. Combining the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer, it supports chain-of-thought reasoning, speech-vision fusion, and dynamic tool invocation. The framework generates structured scene graphs and leverages a hybrid retrieval module for interpretable and adaptive reasoning. Evaluations on the Video-MME benchmark and a custom clinical dataset show competitive accuracy and improved robustness compared to state-of-the-art VLMs, demonstrating its potential for applications in robot-assisted surgery, patient monitoring, and decision support.[206] EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking
Yuki Sakai,Ryosuke Furuta,Juichun Yen,Yoichi Sato
Main category: cs.CV
TL;DR: 本文介绍了一个用于面对面教学场景分析的新型自我中心视频数据集,并评估了多模态大语言模型(MLLMs)在理解教学互动中的表现,结果表明MLLMs在无需任务特定微调的情况下优于专用基线模型。
Details
Motivation: 由于缺乏合适的数据集和有限的分析技术,计算机视觉领域对面对面教学场景的研究不足,因此需要建立新的数据集和评估方法来促进对此类交互的理解。 Method: 提出了一个包含程序步骤分割和对话状态分类两个基本任务标注的新颖自我中心视频数据集,并使用该数据集对多模态大语言模型与传统任务特定模型进行了基准测试。 Result: 实验显示,多模态大语言模型即使没有进行任务特定的微调,在理解面对面教学场景方面也优于专门设计的基线模型。 Conclusion: 多模态大语言模型展现了对面对面教学互动进行全面理解的潜力,为未来教育支持和技术转移提供了新的可能性。 Abstract: Analyzing instructional interactions between an instructor and a learner who are co-present in the same physical space is a critical problem for educational support and skill transfer. Yet such face-to-face instructional scenes have not been systematically studied in computer vision. We identify two key reasons: i) the lack of suitable datasets and ii) limited analytical techniques. To address this gap, we present a new egocentric video dataset of face-to-face instruction and provide ground-truth annotations for two fundamental tasks that serve as a first step toward a comprehensive understanding of instructional interactions: procedural step segmentation and conversation-state classification. Using this dataset, we benchmark multimodal large language models (MLLMs) against conventional task-specific models. Since face-to-face instruction involves multiple modalities (speech content and prosody, gaze and body motion, and visual context), effective understanding requires methods that handle verbal and nonverbal communication in an integrated manner. Accordingly, we evaluate recently introduced MLLMs that jointly process images, audio, and text. This evaluation quantifies the extent to which current machine learning models understand face-to-face instructional scenes. In experiments, MLLMs outperform specialized baselines even without task-specific fine-tuning, suggesting their promise for holistic understanding of instructional interactions.[207] High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling
Chao Huang,Susan Liang,Yapeng Tian,Anurag Kumar,Chenliang Xu
Main category: cs.CV
TL;DR: 提出DAVIS,一种基于扩散模型的视听声音分离框架,通过生成学习直接从噪声分布合成分离的声音频谱图,在AVE和MUSIC数据集上优于现有方法。
Details
Motivation: 现有方法在捕捉复杂数据分布方面存在局限,难以实现高质量、多类别的声音分离。 Method: 采用基于扩散的生成模型(DDPM和Flow Matching),结合分离U-Net架构,以混合音频和视觉信息为条件生成目标声音频谱图。 Result: 在AVE和MUSIC数据集上的实验表明,DAVIS的两种变体均优于当前主流方法,显著提升分离质量。 Conclusion: 基于生成式学习的DAVIS框架能有效提升多类别视听声音分离的质量,验证了扩散模型在此任务中的优越性。 Abstract: We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS circumvents these issues by leveraging potent generative modeling paradigms, specifically Denoising Diffusion Probabilistic Models (DDPM) and the more recent Flow Matching (FM), integrated within a specialized Separation U-Net architecture. Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information. The inherent nature of its generative objective makes DAVIS particularly adept at producing high-quality sound separations for diverse sound categories. We present comparative evaluations of DAVIS, encompassing both its DDPM and Flow Matching variants, against leading methods on the standard AVE and MUSIC datasets. The results affirm that both variants surpass existing approaches in separation quality, highlighting the efficacy of our generative framework for tackling the audio-visual source separation task.[208] SpecXNet: A Dual-Domain Convolutional Network for Robust Deepfake Detection
Inzamamul Alam,Md Tanvir Islam,Simon S. Woo
Main category: cs.CV
TL;DR: 本文提出了一种用于鲁棒深度伪造检测的双域架构SpecXNet,结合局部空间特征和全局频谱特征,通过Dual-Domain Feature Coupler(DDFC)和Dual Fourier Attention(DFA)模块实现对真实与伪造图像的有效区分,在多个基准上实现了最先进的检测性能,尤其在跨数据集和未见篡改场景下表现优异。
Details
Motivation: 随着GAN和扩散模型生成内容的真实感增强,现有仅依赖空间或频域特征的深度伪造检测方法在面对未知篡改手段时泛化能力受限,亟需能够同时利用局部纹理异常和全局结构不一致性的更强大检测模型。 Method: 提出SpecXNet,包含Dual-Domain Feature Coupler(DDFC)将特征分解为空间分支(捕捉纹理级异常)和频谱分支(使用FFT建模周期性不一致性),并引入Dual Fourier Attention(DFA)模块以内容感知方式动态融合双域特征;模型基于改进的XceptionNet骨干网络,并嵌入可分离卷积块中。 Result: 在多个深度伪造检测基准上实验表明,SpecXNet在跨数据集和未见篡改场景下均达到最先进的检测精度,同时具备实时性。 Conclusion: 统一的空间-频谱联合学习能有效提升深度伪造检测的鲁棒性和泛化能力,SpecXNet为未来检测模型提供了新的设计思路。 Abstract: The increasing realism of content generated by GANs and diffusion models has made deepfake detection significantly more challenging. Existing approaches often focus solely on spatial or frequency-domain features, limiting their generalization to unseen manipulations. We propose the Spectral Cross-Attentional Network (SpecXNet), a dual-domain architecture for robust deepfake detection. The core \textbf{Dual-Domain Feature Coupler (DDFC)} decomposes features into a local spatial branch for capturing texture-level anomalies and a global spectral branch that employs Fast Fourier Transform to model periodic inconsistencies. This dual-domain formulation allows SpecXNet to jointly exploit localized detail and global structural coherence, which are critical for distinguishing authentic from manipulated images. We also introduce the \textbf{Dual Fourier Attention (DFA)} module, which dynamically fuses spatial and spectral features in a content-aware manner. Built atop a modified XceptionNet backbone, we embed the DDFC and DFA modules within a separable convolution block. Extensive experiments on multiple deepfake benchmarks show that SpecXNet achieves state-of-the-art accuracy, particularly under cross-dataset and unseen manipulation scenarios, while maintaining real-time feasibility. Our results highlight the effectiveness of unified spatial-spectral learning for robust and generalizable deepfake detection. To ensure reproducibility, we released the full code on \href{https://github.com/inzamamulDU/SpecXNet}{\textcolor{blue}{\textbf{GitHub}}}.[209] Large Material Gaussian Model for Relightable 3D Generation
Jingrui Ye,Lingting Zhu,Runze Zhang,Zeyu Hu,Yingda Yin,Lanjiong Li,Lequan Yu,Qingmin Liao
Main category: cs.CV
TL;DR: 本文提出了Large Material Gaussian Model (MGM),一种能够生成包含物理真实材质(如albedo、roughness、metallic)的高质量3D内容的新框架,相较于传统仅生成RGB纹理的方法,支持动态光照渲染。
Details
Motivation: 现有3D重建模型无法生成资产的材质属性,限制了在不同光照环境下实现真实感渲染的能力。 Method: 首先基于深度和法线图微调一个多视角材质扩散模型,利用生成的多视角PBR图像,设计一种与2D高斯点阵化兼容的高斯材质表示方法,用于建模PBR各通道并重建带材质属性的点云。 Result: 实验表明,该方法生成的材质在视觉上优于基线方法,显著提升了材质建模质量,并支持动态环境光下的重光照。 Conclusion: MGM能够有效生成带PBR材质的3D内容,为实际渲染应用(如动态光照)提供了更真实、可控的解决方案。 Abstract: The increasing demand for 3D assets across various industries necessitates efficient and automated methods for 3D content creation. Leveraging 3D Gaussian Splatting, recent large reconstruction models (LRMs) have demonstrated the ability to efficiently achieve high-quality 3D rendering by integrating multiview diffusion for generation and scalable transformers for reconstruction. However, existing models fail to produce the material properties of assets, which is crucial for realistic rendering in diverse lighting environments. In this paper, we introduce the Large Material Gaussian Model (MGM), a novel framework designed to generate high-quality 3D content with Physically Based Rendering (PBR) materials, ie, albedo, roughness, and metallic properties, rather than merely producing RGB textures with uncontrolled light baking. Specifically, we first fine-tune a new multiview material diffusion model conditioned on input depth and normal maps. Utilizing the generated multiview PBR images, we explore a Gaussian material representation that not only aligns with 2D Gaussian Splatting but also models each channel of the PBR materials. The reconstructed point clouds can then be rendered to acquire PBR attributes, enabling dynamic relighting by applying various ambient light maps. Extensive experiments demonstrate that the materials produced by our method not only exhibit greater visual appeal compared to baseline methods but also enhance material modeling, thereby enabling practical downstream rendering applications.[210] Self-Supervised Point Cloud Completion based on Multi-View Augmentations of Single Partial Point Cloud
Jingjing Lu,Huilong Pi,Yunchuan Qin,Zhuo Tang,Ruihui Li
Main category: cs.CV
TL;DR: 提出了一种新的自监督点云补全方法,通过多视角增强和Mamba模型提升性能,在合成和真实数据集上达到SOTA。
Details
Motivation: 现有方法依赖真实标签、完整点云或多视角观测,且自监督信号能力有限,难以泛化到真实场景。 Method: 设计基于单个部分点云的多视角增强作为新型自监督信号,并首次引入Mamba模型以提升生成质量。 Result: 在合成与真实世界数据集上的实验表明,该方法显著优于现有自监督及其他弱监督方法,实现最先进性能。 Conclusion: 所提自监督方法有效克服了对标注和完整数据的依赖,在点云补全任务中表现出强泛化能力和优越性能。 Abstract: Point cloud completion aims to reconstruct complete shapes from partial observations. Although current methods have achieved remarkable performance, they still have some limitations: Supervised methods heavily rely on ground truth, which limits their generalization to real-world datasets due to the synthetic-to-real domain gap. Unsupervised methods require complete point clouds to compose unpaired training data, and weakly-supervised methods need multi-view observations of the object. Existing self-supervised methods frequently produce unsatisfactory predictions due to the limited capabilities of their self-supervised signals. To overcome these challenges, we propose a novel self-supervised point cloud completion method. We design a set of novel self-supervised signals based on multi-view augmentations of the single partial point cloud. Additionally, to enhance the model's learning ability, we first incorporate Mamba into self-supervised point cloud completion task, encouraging the model to generate point clouds with better quality. Experiments on synthetic and real-world datasets demonstrate that our method achieves state-of-the-art results.[211] REFINE-CONTROL: A Semi-supervised Distillation Method For Conditional Image Generation
Yicheng Jiang,Jin Yuan,Hua Yuan,Yao Zhang,Yong Rui
Main category: cs.CV
TL;DR: 提出了一种名为Refine-Control的半监督蒸馏框架,用于降低条件图像生成模型的资源消耗和数据标注依赖,同时保持高质量生成和可控性。
Details
Motivation: 现有的文本控制图像生成模型资源消耗高、依赖大量标注数据,难以在边缘设备部署,且存在隐私风险。 Method: 引入三层次知识融合损失进行知识迁移,并采用结合有标签和无标签数据的半监督蒸馏方法以提升学生模型性能和泛化能力。 Result: 实验表明,Refine-Control显著降低了计算成本和延迟,同时在生成保真度和可控性方面表现优异。 Conclusion: Refine-Control为高效、低资源消耗的条件图像生成提供了一种有效解决方案,适合边缘设备部署。 Abstract: Conditional image generation models have achieved remarkable results by leveraging text-based control to generate customized images. However, the high resource demands of these models and the scarcity of well-annotated data have hindered their deployment on edge devices, leading to enormous costs and privacy concerns, especially when user data is sent to a third party. To overcome these challenges, we propose Refine-Control, a semi-supervised distillation framework. Specifically, we improve the performance of the student model by introducing a tri-level knowledge fusion loss to transfer different levels of knowledge. To enhance generalization and alleviate dataset scarcity, we introduce a semi-supervised distillation method utilizing both labeled and unlabeled data. Our experiments reveal that Refine-Control achieves significant reductions in computational cost and latency, while maintaining high-fidelity generation capabilities and controllability, as quantified by comparative metrics.[212] Joint graph entropy knowledge distillation for point cloud classification and robustness against corruptions
Zhiqiang Tian,Weigang Li,Junwei Hu,Chunhua Deng
Main category: cs.CV
TL;DR: 本文提出了一种名为JGEKD的3D点云分类策略,利用联合图熵进行知识蒸馏,以捕捉非独立同分布数据中的类别相关性,并通过自蒸馏和教师蒸馏框架提升模型对空间变换和数据损坏的鲁棒性。
Details
Motivation: 传统3D点云分类假设类别事件独立同分布(IID),忽略了类别间的相关性,限制了模型性能。因此,需要一种能够建模并利用类别间相关性的分类策略。 Method: 提出JGEKD方法,构建基于联合图熵的损失函数,通过知识蒸馏传递类别相关性;采用联合图捕捉类别间隐含关系,并计算图熵实现知识迁移;设计Siamese结构,构建自知识蒸馏与教师知识蒸馏框架,处理空间变换不变性,并实现原始点云与损坏点云之间的知识转移以增强鲁棒性。 Result: 在ScanObject、ModelNet40、ScanNetV2_cls和ModelNet-C等多个数据集上进行了广泛实验,结果表明所提方法能取得具有竞争力的性能,尤其在抗数据损坏方面表现优异。 Conclusion: JGEKD有效利用了非IID条件下3D点云中类别间的相关性,通过联合图熵驱动的知识蒸馏显著提升了分类性能与模型鲁棒性,为处理真实场景中复杂分布的点云数据提供了新思路。 Abstract: Classification tasks in 3D point clouds often assume that class events \replaced{are }{follow }independent and identically distributed (IID), although this assumption destroys the correlation between classes. This \replaced{study }{paper }proposes a classification strategy, \textbf{J}oint \textbf{G}raph \textbf{E}ntropy \textbf{K}nowledge \textbf{D}istillation (JGEKD), suitable for non-independent and identically distributed 3D point cloud data, \replaced{which }{the strategy } achieves knowledge transfer of class correlations through knowledge distillation by constructing a loss function based on joint graph entropy. First\deleted{ly}, we employ joint graphs to capture add{the }hidden relationships between classes\replaced{ and}{,} implement knowledge distillation to train our model by calculating the entropy of add{add }graph.\replaced{ Subsequently}{ Then}, to handle 3D point clouds \deleted{that is }invariant to spatial transformations, we construct \replaced{S}{s}iamese structures and develop two frameworks, self-knowledge distillation and teacher-knowledge distillation, to facilitate information transfer between different transformation forms of the same data. \replaced{In addition}{ Additionally}, we use the above framework to achieve knowledge transfer between point clouds and their corrupted forms, and increase the robustness against corruption of model. Extensive experiments on ScanObject, ModelNet40, ScanntV2\_cls and ModelNet-C demonstrate that the proposed strategy can achieve competitive results.[213] MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models
Jonas Belouadi,Tamy Boubekeur,Adrien Kaiser
Main category: cs.CV
TL;DR: 本文提出了一种名为MultiMat的多模态程序合成框架,利用大型多模态模型同时处理可视化和文本化的节点图表示,以更高效地生成高质量的程序化材质图,在无条件和有条件生成任务中均优于纯文本方法。
Details
Motivation: 现有的神经程序合成方法仅将节点图表示为文本程序,忽略了节点图本身具有的视觉-空间特性,导致难以直观建模;因此需要一种能结合视觉与文本信息的多模态方法来提升程序生成的质量与可访问性。 Method: 提出MultiMat框架,结合大型多模态模型处理视觉与文本输入,并在新构建的高质量程序化材质数据集上进行训练,采用受限树搜索推理算法确保生成程序的语法正确性并高效探索程序空间。 Result: 实验结果表明,该方法在生成程序化材质图时比纯文本基线方法更高效,且具有更高的视觉质量和保真度,在无条件和条件生成任务中均达到最先进的性能。 Conclusion: MultiMat通过融合视觉与文本模态信息,显著提升了程序化材质节点图的生成效率与质量,验证了多模态程序合成在计算机图形学中的潜力。 Abstract: Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structures and intermediate states provide an intuitive understanding and workflow for interactive appearance modeling. Creating such graphs is a challenging task and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures syntactic validity while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.[214] DragGANSpace: Latent Space Exploration and Control for GANs
Kirsten Odendaal,Neela Kaushik,Spencer Halverson
Main category: cs.CV
TL;DR: 本研究结合StyleGAN、DragGAN和PCA,通过降维和跨模型对齐提升生成图像潜在空间的效率与可控性,在AFHQ数据集上验证了方法在保持视觉质量的同时缩短优化时间,并实现了不同域间模型的潜在空间对齐与直观编辑。
Details
Motivation: 为了提高GAN生成图像潜在空间的操作效率和可解释性,解决高维潜在空间优化困难及跨模型控制不一致的问题。 Method: 将主成分分析(PCA)引入StyleGAN和DragGAN框架中,应用于潜在空间W+层进行降维,并利用DragGAN实现直观图像编辑;同时探索PCA在跨域模型(如AFHQ-Dog与AFHQ-Cat)间的潜在空间对齐能力。 Result: 在AFHQ数据集上,PCA显著减少了DragGAN的优化时间,尤其在较浅的W+层(=3)时效果更明显,同时保持良好视觉质量并提升了SSIM指标;成功实现了两个不同训练域的StyleGAN模型之间的潜在空间对齐与联合操控。 Conclusion: PCA与DragGAN结合能有效提升潜在空间的优化效率和可解释性,支持跨模型对齐与直观编辑,为图像合成与编辑提供了更具扩展性和实用性的控制方法。 Abstract: This work integrates StyleGAN, DragGAN and Principal Component Analysis (PCA) to enhance the latent space efficiency and controllability of GAN-generated images. Style-GAN provides a structured latent space, DragGAN enables intuitive image manipulation, and PCA reduces dimensionality and facilitates cross-model alignment for more streamlined and interpretable exploration of latent spaces. We apply our techniques to the Animal Faces High Quality (AFHQ) dataset, and find that our approach of integrating PCA-based dimensionality reduction with the Drag-GAN framework for image manipulation retains performance while improving optimization efficiency. Notably, introducing PCA into the latent W+ layers of DragGAN can consistently reduce the total optimization time while maintaining good visual quality and even boosting the Structural Similarity Index Measure (SSIM) of the optimized image, particularly in shallower latent spaces (W+ layers = 3). We also demonstrate capability for aligning images generated by two StyleGAN models trained on similar but distinct data domains (AFHQ-Dog and AFHQ-Cat), and show that we can control the latent space of these aligned images to manipulate the images in an intuitive and interpretable manner. Our findings highlight the possibility for efficient and interpretable latent space control for a wide range of image synthesis and editing applications.[215] MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Junbo Niu,Zheng Liu,Zhuangcheng Gu,Bin Wang,Linke Ouyang,Zhiyuan Zhao,Tao Chu,Tianyao He,Fan Wu,Qintong Zhang,Zhenjiang Jin,Guang Liang,Rui Zhang,Wenzheng Zhang,Yuan Qu,Zhifei Ren,Yuefeng Sun,Yuanhong Zheng,Dongsheng Ma,Zirui Tang,Boyu Niu,Ziyang Miao,Hejun Dong,Siyi Qian,Junyuan Zhang,Jingzhou Chen,Fangdong Wang,Xiaomeng Zhao,Liqun Wei,Wei Li,Shasha Wang,Ruiliang Xu,Yuanyuan Cao,Lu Chen,Qianqian Wu,Huaiyu Gu,Lindong Lu,Keming Wang,Dechen Lin,Guanlin Shen,Xuanhe Zhou,Linfeng Zhang,Yuhang Zang,Xiaoyi Dong,Jiaqi Wang,Bo Zhang,Lei Bai,Pei Chu,Weijia Li,Jiang Wu,Lijun Wu,Zhenxiang Li,Guangyu Wang,Zhongying Tu,Chao Xu,Kai Chen,Yu Qiao,Bowen Zhou,Dahua Lin,Wentao Zhang,Conghui He
Main category: cs.CV
TL;DR: MinerU2.5是一种12亿参数的文档解析视觉语言模型,采用粗到精的两阶段方法,在保持高计算效率的同时实现了最先进的识别精度。
Details
Motivation: 为了在处理复杂文档(如包含密集文本、公式和表格)时兼顾高精度和高效率,克服现有模型在处理高分辨率图像时的计算开销问题。 Method: 采用两阶段解析策略:第一阶段在降采样图像上进行高效布局分析;第二阶段根据全局布局指导,在原始分辨率的局部裁剪区域上进行内容识别。同时构建了一个大规模数据引擎用于预训练和微调。 Result: MinerU2.5在多个基准测试中达到最先进水平,优于通用和特定领域模型,在各种识别任务中表现出色,同时显著降低了计算开销。 Conclusion: 该模型通过解耦布局分析与内容识别,在精度和效率之间取得了良好平衡,适用于高效文档解析。 Abstract: We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.[216] Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models
Jiaqi Liu,Lang Sun,Ronghao Fu,Bo Yang
Main category: cs.CV
TL;DR: 提出了一种基于感知的地理空间思维链(Geo-CoT)框架,通过两阶段对齐策略提升遥感视觉语言模型的可验证推理能力,显著优于现有最先进模型。
Details
Motivation: 现有的遥感视觉语言模型因端到端训练范式缺乏中间推理步骤,导致复杂分析任务表现差且输出不可验证。 Method: 构建大规模结构化Geo-CoT380k数据集,采用监督微调(SFT)建立基础认知架构,再用组奖励策略优化(GRPO)提升推理正确性。 Result: 所提出的RSThinker模型在多种任务上显著超越现有最先进模型,能同时输出答案和可验证的分析过程。 Conclusion: Geo-CoT框架推动遥感分析从黑箱感知走向结构化、可验证的推理,促进地球观测领域的透明化发展。 Abstract: Vision-Language Models (VLMs) in remote sensing often fail at complex analytical tasks, a limitation stemming from their end-to-end training paradigm that bypasses crucial reasoning steps and leads to unverifiable outputs. To address this limitation, we introduce the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT), a framework that models remote sensing analysis as a verifiable, multi-step process. We instill this analytical process through a two-stage alignment strategy, leveraging Geo-CoT380k, the first large-scale dataset of structured Geo-CoT rationales. This strategy first employs supervised fine-tuning (SFT) to instill the foundational cognitive architecture, then leverages Group Reward Policy Optimization (GRPO) to refine the model's reasoning policy towards factual correctness. The resulting model, RSThinker, outputs both a final answer and its justifying, verifiable analytical trace. This capability yields dominant performance, significantly outperforming state-of-the-art models across a comprehensive range of tasks. The public release of our Geo-CoT380k dataset and RSThinker model upon publication serves as a concrete pathway from opaque perception towards structured, verifiable reasoning for Earth Observation.[217] Polysemous Language Gaussian Splatting via Matching-based Mask Lifting
Jiayu Ding,Xinpeng Liu,Zhiyi Pan,Shiqiang Long,Ge Li
Main category: cs.CV
TL;DR: 本文提出MUSplat,一种无需训练的3D高斯点阵语义理解框架,通过2D分割模型与视觉-语言模型实现开放词汇的3D场景理解,解决现有方法需重训练、语义单一和跨视图不一致的问题。
Details
Motivation: 现有3D场景开放词汇理解方法依赖昂贵的逐场景训练、语义表达单一且存在跨视角不一致性,限制了实际应用。 Method: 利用预训练2D分割模型生成多粒度掩码并提升至3D空间,估计高斯点前景概率形成初始物体组;结合语义熵与几何不透明度优化边界;通过视觉-语言模型提取代表性视角下的文本特征,缓解视觉不一致,支持语义匹配查询。 Result: MUSplat在开放词汇3D物体选择与语义分割任务上优于现有训练型方法,将场景适配时间从数小时缩短至几分钟,并支持多概念语义表达。 Conclusion: MUSplat实现了高效、无需训练的3D语义理解,克服了传统方法在效率、语义丰富性和一致性方面的局限,推动了开放词汇3D场景理解的实用化。 Abstract: Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. However, mainstream methods suffer from three key flaws: (i) their reliance on costly per-scene retraining prevents plug-and-play application; (ii) their restrictive monosemous design fails to represent complex, multi-concept semantics; and (iii) their vulnerability to cross-view semantic inconsistencies corrupts the final semantic representation. To overcome these limitations, we introduce MUSplat, a training-free framework that abandons feature optimization entirely. Leveraging a pre-trained 2D segmentation model, our pipeline generates and lifts multi-granularity 2D masks into 3D, where we estimate a foreground probability for each Gaussian point to form initial object groups. We then optimize the ambiguous boundaries of these initial groups using semantic entropy and geometric opacity. Subsequently, by interpreting the object's appearance across its most representative viewpoints, a Vision-Language Model (VLM) distills robust textual features that reconciles visual inconsistencies, enabling open-vocabulary querying via semantic matching. By eliminating the costly per-scene training process, MUSplat reduces scene adaptation time from hours to mere minutes. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, MUSplat outperforms established training-based frameworks while simultaneously addressing their monosemous limitations.[218] UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective
Jun He,Yi Lin,Zilong Huang,Jiacong Yin,Junyan Ye,Yuchuan Zhou,Weijia Li,Xiang Zhang
Main category: cs.CV
TL;DR: 本文提出了UrbanFeel,一个用于评估多模态大语言模型(MLLMs)在城市发展理念和主观环境感知方面性能的综合基准。该基准包含14.3K个涵盖静态场景感知、时间变化理解和主观环境感知三个维度的视觉问题,并基于全球11个代表性城市的街景图像构建。通过对20种最先进MLLM的广泛评估,发现Gemini-2.5 Pro表现最佳,整体准确率接近人类专家水平,与人类平均差距仅1.5%。多数模型在场景理解任务中表现良好,部分甚至在像素级变化检测上超过人类;但在涉及城市发展的时序推理任务中表现下降明显。在主观感知维度(如美观性和安全性),一些模型达到了甚至超过了人类的一致性水平。
Details
Motivation: 现有MLLM基准在城市环境中的应用缺乏对城市发展的时间演化和符合人类感知的主观体验的系统探索,限制了其在可持续城市发展中的应用。因此,需要一个更全面、贴近人类认知的基准来评估MLLM在城市理解方面的能力。 Method: 提出UrbanFeel基准,包含14.3K个视觉问答对,覆盖静态场景感知、时间变化理解和主观环境感知三个认知递进维度。数据基于11个全球代表性城市的多时相单视角与全景街景图像,通过空间聚类、基于规则生成、模型辅助提示和人工标注的混合流程构建高质量问答对,并对20种主流MLLM进行系统评估。 Result: Gemini-2.5 Pro在所有模型中表现最好,整体准确率接近人类专家,平均差距仅为1.5%。大多数模型在基于场景理解的任务中表现良好,部分在像素级变化检测上超过人类。但在需要时间推理的城市发展任务中性能显著下降。在主观感知维度(如美丽、安全),多个模型达到或超过人类标注者的一致性水平。 Conclusion: UrbanFeel为评估MLLM在城市理解与主观感知方面提供了有效基准,揭示了当前模型在时序推理方面的不足,同时表明其在某些感知任务上已接近甚至超越人类水平,推动MLLM在智慧城市和可持续发展中的应用。 Abstract: Urban development impacts over half of the global population, making human-centered understanding of its structural and perceptual changes essential for sustainable development. While Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various domains, existing benchmarks that explore their performance in urban environments remain limited, lacking systematic exploration of temporal evolution and subjective perception of urban environment that aligns with human perception. To address these limitations, we propose UrbanFeel, a comprehensive benchmark designed to evaluate the performance of MLLMs in urban development understanding and subjective environmental perception. UrbanFeel comprises 14.3K carefully constructed visual questions spanning three cognitively progressive dimensions: Static Scene Perception, Temporal Change Understanding, and Subjective Environmental Perception. We collect multi-temporal single-view and panoramic street-view images from 11 representative cities worldwide, and generate high-quality question-answer pairs through a hybrid pipeline of spatial clustering, rule-based generation, model-assisted prompting, and manual annotation. Through extensive evaluation of 20 state-of-the-art MLLMs, we observe that Gemini-2.5 Pro achieves the best overall performance, with its accuracy approaching human expert levels and narrowing the average gap to just 1.5\%. Most models perform well on tasks grounded in scene understanding. In particular, some models even surpass human annotators in pixel-level change detection. However, performance drops notably in tasks requiring temporal reasoning over urban development. Additionally, in the subjective perception dimension, several models reach human-level or even higher consistency in evaluating dimension such as beautiful and safety.[219] A Tale of Two Experts: Cooperative Learning for Source-Free Unsupervised Domain Adaptation
Jiaping Yu,Muli Yang,Jiapeng Ji,Jiexi Yan,Cheng Deng
Main category: cs.CV
TL;DR: 本文提出了一种名为Experts Cooperative Learning (EXCL)的源域无数据无监督域自适应方法,通过双专家框架和检索增强交互优化流程,在无需访问源数据的情况下有效提升目标域性能。
Details
Motivation: 现有SFUDA方法仅利用源模型预测或微调大型多模态模型,忽略了目标数据的潜在结构和互补信息,本文旨在挖掘共识知识并充分利用未标记的目标样本。 Method: 提出EXCL方法,包含双专家框架(冻结的源域模型与可训练文本提示的视觉语言模型)和三阶段检索增强交互(RAIN)优化流程:协同检索伪源和复杂目标样本、分别微调两个专家、通过共享学习结果强制保持学习一致性。 Result: 在四个基准数据集上的实验表明,该方法达到了与当前最先进方法相当的性能。 Conclusion: EXCL通过协作式双专家架构与检索增强训练策略,有效挖掘目标域中的潜在结构与共识知识,为源域无数据条件下的无监督域适应提供了新的解决方案。 Abstract: Source-Free Unsupervised Domain Adaptation (SFUDA) addresses the realistic challenge of adapting a source-trained model to a target domain without access to the source data, driven by concerns over privacy and cost. Existing SFUDA methods either exploit only the source model's predictions or fine-tune large multimodal models, yet both neglect complementary insights and the latent structure of target data. In this paper, we propose the Experts Cooperative Learning (EXCL). EXCL contains the Dual Experts framework and Retrieval-Augmentation-Interaction optimization pipeline. The Dual Experts framework places a frozen source-domain model (augmented with Conv-Adapter) and a pretrained vision-language model (with a trainable text prompt) on equal footing to mine consensus knowledge from unlabeled target samples. To effectively train these plug-in modules under purely unsupervised conditions, we introduce Retrieval-Augmented-Interaction(RAIN), a three-stage pipeline that (1) collaboratively retrieves pseudo-source and complex target samples, (2) separately fine-tunes each expert on its respective sample set, and (3) enforces learning object consistency via a shared learning result. Extensive experiments on four benchmark datasets demonstrate that our approach matches state-of-the-art performance.[220] FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing
Junyi Wu,Zhiteng Li,Haotong Qin,Xiaohong Liu,Linghe Kong,Yulun Zhang,Xiaokang Yang
Main category: cs.CV
TL;DR: FlashEdit是一种用于实现实时、高保真图像编辑的新型框架,通过单步倒置与编辑、背景保护和稀疏空间交叉注意力机制,显著提升了编辑速度并保持了背景一致性。
Details
Motivation: 现有的基于扩散模型的文本引导图像编辑方法虽然质量高,但延迟严重,限制了实际应用。 Method: 提出FlashEdit框架,包含三个关键技术:单步倒置与编辑(OSIE)流程、背景保护(BG-Shield)技术和稀疏化空间交叉注意力(SSCA)机制。 Result: 实验表明,FlashEdit能在不到0.2秒内完成编辑,相比以往多步方法加速超过150倍,同时保持出色的背景一致性和结构完整性。 Conclusion: FlashEdit实现了高效、精确的实时图像编辑,在速度和编辑质量之间取得了良好平衡,具有广泛的应用前景。 Abstract: Text-guided image editing with diffusion models has achieved remarkable quality but suffers from prohibitive latency, hindering real-world applications. We introduce FlashEdit, a novel framework designed to enable high-fidelity, real-time image editing. Its efficiency stems from three key innovations: (1) a One-Step Inversion-and-Editing (OSIE) pipeline that bypasses costly iterative processes; (2) a Background Shield (BG-Shield) technique that guarantees background preservation by selectively modifying features only within the edit region; and (3) a Sparsified Spatial Cross-Attention (SSCA) mechanism that ensures precise, localized edits by suppressing semantic leakage to the background. Extensive experiments demonstrate that FlashEdit maintains superior background consistency and structural integrity, while performing edits in under 0.2 seconds, which is an over 150$\times$ speedup compared to prior multi-step methods. Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit.[221] Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
Miao Jing,Mengting Jia,Junling Lin,Zhongxia Shen,Lijun Wang,Yuanyuan Peng,Huan Gao,Mingkun Xu,Shangyang Li
Main category: cs.CV
TL;DR: Neural-MedBench是一个专注于神经学领域多模态临床推理能力的紧凑型基准测试,揭示了现有视觉-语言模型在高风险诊断推理中的局限性,强调需要兼顾广度和深度的双轴评估框架。
Details
Motivation: 现有医学基准主要关注分类准确性,无法真实反映模型的临床推理能力,导致评估上的错觉,因此需要一个专门针对复杂临床推理任务的高质量评估基准。 Method: 构建包含多序列MRI、结构化电子健康记录和临床笔记的Neural-MedBench基准,涵盖鉴别诊断、病灶识别和理由生成三类任务,并采用结合LLM评分器、临床医生验证和语义相似度指标的混合评分 pipeline 进行评估。 Result: 在GPT-4o、Claude-3、MedGemma等先进VLM上的实验显示,相比传统数据集性能显著下降,错误分析表明问题主要源于推理失败而非感知错误。 Conclusion: 提出双轴评估框架,主张在广泛数据集基础上引入如Neural-MedBench这类聚焦深度推理的小而精基准,以实现对临床可信AI的严格且高效评估。 Abstract: Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.[222] UniMapGen: A Generative Framework for Large-Scale Map Construction from Multi-modal Data
Yujian Yuan,Changjie Wu,Xinyuan Chang,Sijin Wang,Hang Zhang,Shiyi Liang,Shuang Zeng,Mu Xu
Main category: cs.CV
TL;DR: 本文提出了一种名为UniMapGen的新型生成式框架,用于大规模地图构建,通过离散序列表示车道线、支持多模态输入和状态更新策略,克服了卫星数据局限性和传统方法的低效向量化问题,在OpenSatMap数据集上达到SOTA性能。
Details
Motivation: 传统的大规模地图构建依赖昂贵的数据采集车辆和人工标注,现有基于卫星的方法受限于数据遮挡、过时以及感知方法导致的不连续和粗糙道路的问题。因此需要一种更高效、鲁棒且能生成高质量矢量地图的方法。 Method: 提出UniMapGen框架:(1) 将车道线表示为离散序列,采用迭代策略生成更完整平滑的地图矢量;(2) 设计支持BEV、PV和文本提示等多模态输入的灵活架构,以弥补卫星图像的不足;(3) 引入状态更新机制,确保大范围地图的全局连续性和一致性。 Result: 在OpenSatMap数据集上实现了最先进的性能,能够推断被遮挡的道路并发现数据集中缺失标注的道路。 Conclusion: UniMapGen通过生成式建模和多模态融合显著提升了大规模地图构建的质量与鲁棒性,解决了传统方法和现有卫星方法的关键瓶颈,具有实际应用潜力。 Abstract: Large-scale map construction is foundational for critical applications such as autonomous driving and navigation systems. Traditional large-scale map construction approaches mainly rely on costly and inefficient special data collection vehicles and labor-intensive annotation processes. While existing satellite-based methods have demonstrated promising potential in enhancing the efficiency and coverage of map construction, they exhibit two major limitations: (1) inherent drawbacks of satellite data (e.g., occlusions, outdatedness) and (2) inefficient vectorization from perception-based methods, resulting in discontinuous and rough roads that require extensive post-processing. This paper presents a novel generative framework, UniMapGen, for large-scale map construction, offering three key innovations: (1) representing lane lines as \textbf{discrete sequence} and establishing an iterative strategy to generate more complete and smooth map vectors than traditional perception-based methods. (2) proposing a flexible architecture that supports \textbf{multi-modal} inputs, enabling dynamic selection among BEV, PV, and text prompt, to overcome the drawbacks of satellite data. (3) developing a \textbf{state update} strategy for global continuity and consistency of the constructed large-scale map. UniMapGen achieves state-of-the-art performance on the OpenSatMap dataset. Furthermore, UniMapGen can infer occluded roads and predict roads missing from dataset annotations. Our code will be released.[223] GS-2M: Gaussian Splatting for Joint Mesh Reconstruction and Material Decomposition
Dinh Minh Nguyen,Malte Avenhaus,Thomas Lindemeier
Main category: cs.CV
TL;DR: 提出了一种基于3D高斯点阵的统一方法GS-2M,用于从多视角图像中进行网格重建和材质分解,通过联合优化深度和法线渲染质量相关属性,有效处理高反射表面并保持几何细节。
Details
Motivation: 现有方法通常将网格重建和材质分解分开处理,且难以重建高反射表面,依赖外部模型先验,限制了性能。 Method: 基于3D高斯点阵,联合优化与渲染深度和法线质量相关的属性,引入基于多视角光度变化的粗糙度监督策略,并设计了特定的损失函数和优化流程。 Result: 在多个常用数据集上验证了方法的有效性,重建结果与当前最优方法相当,能输出三角网格及其对应的材质成分,适用于下游任务。 Conclusion: GS-2M实现了无需复杂神经网络组件的统一框架,在保持几何细节的同时有效处理反射表面,为大规模应用提供了高效解决方案。 Abstract: We propose a unified solution for mesh reconstruction and material decomposition from multi-view images based on 3D Gaussian Splatting, referred to as GS-2M. Previous works handle these tasks separately and struggle to reconstruct highly reflective surfaces, often relying on priors from external models to enhance the decomposition results. Conversely, our method addresses these two problems by jointly optimizing attributes relevant to the quality of rendered depth and normals, maintaining geometric details while being resilient to reflective surfaces. Although contemporary works effectively solve these tasks together, they often employ sophisticated neural components to learn scene properties, which hinders their performance at scale. To further eliminate these neural components, we propose a novel roughness supervision strategy based on multi-view photometric variation. When combined with a carefully designed loss and optimization process, our unified framework produces reconstruction results comparable to state-of-the-art methods, delivering triangle meshes and their associated material components for downstream tasks. We validate the effectiveness of our approach with widely used datasets from previous works and qualitative comparisons with state-of-the-art surface reconstruction methods.[224] MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning
Jinkun Hao,Naifu Liang,Zhen Luo,Xudong Xu,Weipeng Zhong,Ran Yi,Yichen Jin,Zhaoyang Lyu,Feng Zheng,Lizhuang Ma,Jiangmiao Pang
Main category: cs.CV
TL;DR: 本文提出了任务导向的桌面场景生成新任务,并构建了包含约10,700个合成场景的大规模数据集MesaTask-10K,提出基于空间推理链和大语言模型的MesaTask框架,结合DPO算法生成与任务描述高度对齐且物理合理的桌面场景。
Details
Motivation: 传统方法生成桌面场景依赖耗时的手动设计或纯随机布局,难以保证场景的合理性和任务相关性,因此需要一种能将高层任务指令与具体场景布局有效关联的自动化生成方法。 Method: 提出空间推理链,将场景生成分解为对象推断、空间关系推理和场景图构建三个步骤;基于大语言模型构建MesaTask框架,并采用DPO算法进行优化,利用MesaTask-10K数据集训练和评估模型。 Result: 实验表明,MesaTask在生成符合任务要求且布局真实的桌面场景方面显著优于基线方法,能够生成物理上合理且与任务描述高度一致的复杂对象布局。 Conclusion: MesaTask通过引入空间推理链和DPO优化,在任务导向的桌面场景生成任务上取得了优异表现,验证了其在机器人理解和执行人类指令方面的潜力。 Abstract: The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/[225] Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models
Michael Jungo,Andreas Fischer
Main category: cs.CV
TL;DR: 本文研究了基于规则的强化学习在文档图像分类任务中的应用,发现其在分布外数据、未见类别和不同模态下具有更好的泛化能力。
Details
Motivation: 尽管强化学习在提升推理能力方面具有潜力,但在文档分析领域应用较少,因此探索其在文档图像分类中的效果。 Method: 采用基于规则的强化学习方法,在文档图像分类任务中进行实验,并评估其在分布外数据、未见类别和不同模态下的表现。 Result: 强化学习在三种分布外场景下均表现出更强的泛化能力。 Conclusion: 基于规则的强化学习能有效提升文档图像分类模型的泛化性能,值得在文档分析领域进一步推广。 Abstract: Rule-based reinforcement learning has been gaining popularity ever since DeepSeek-R1 has demonstrated its success through simple verifiable rewards. In the domain of document analysis, reinforcement learning is not as prevalent, even though many downstream tasks may benefit from the emerging properties of reinforcement learning, particularly the enhanced reason capabilities. We study the effects of rule-based reinforcement learning with the task of Document Image Classification which is one of the most commonly studied downstream tasks in document analysis. We find that reinforcement learning tends to have better generalisation capabilities to out-of-distritbution data, which we examine in three different scenarios, namely out-of-distribution images, unseen classes and different modalities. Our code is available at https://github.com/jungomi/vision-finetune.[226] Jailbreaking on Text-to-Video Models via Scene Splitting Strategy
Wonjun Lee,Haon Park,Doehyeon Lee,Bumsub Ham,Suhyun Kim
Main category: cs.CV
TL;DR: 本文提出了SceneSplit,一种针对文本到视频(T2V)模型的新型黑盒越狱方法,通过将有害叙述拆分为多个独立无害的场景,利用场景组合约束生成空间,最终引导生成有害视频。实验表明该方法在多个T2V模型上具有高攻击成功率,揭示了当前T2V安全机制在叙事结构攻击下的脆弱性。
Details
Motivation: 随着T2V模型的快速发展,其安全风险日益受到关注。然而,现有研究主要集中在LLM、VLM和T2I模型的越狱攻击,T2V模型的安全漏洞尚缺乏探索,存在显著的安全空白。 Method: 提出SceneSplit方法,将有害叙述分解为多个看似无害的场景序列,利用这些场景的组合来约束生成输出空间,使其逐步收敛至有害区域;结合迭代场景优化和策略库复用成功攻击模式,提升攻击有效性与鲁棒性。 Result: 在Luma Ray2、Hailuo和Veo2等11个安全类别上评估,SceneSplit分别达到77.2%、84.1%和78.2%的平均攻击成功率,显著优于基线方法。 Conclusion: 当前T2V模型的安全机制容易受到利用叙事结构的越狱攻击,SceneSplit揭示了此类模型在组合语义推理下的安全隐患,为未来T2V安全防护提供了重要启示。 Abstract: Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack's overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, and 78.2% on Veo2, significantly outperforming the existing baseline. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.[227] HiGS: History-Guided Sampling for Plug-and-Play Enhancement of Diffusion Models
Seyedmorteza Sadat,Farnood Salehi,Romann M. Weber
Main category: cs.CV
TL;DR: 提出了一种基于动量的历史引导采样方法(HiGS),可显著提升扩散模型在少步数和低引导尺度下的生成质量,且无需额外训练或计算开销。
Details
Motivation: 扩散模型在低采样步数或低引导尺度下生成的图像常缺乏细节、不够真实,需要一种高效且即插即用的方法来提升生成质量。 Method: 引入历史预测的加权平均与当前预测的差异作为动量项,指导每一步采样过程,从而优化生成结果的结构和细节。 Result: 在多种模型和设置下均提升了图像质量,在ImageNet 256×256无引导生成任务中仅用30步即达到1.61的FID,创下新纪录。 Conclusion: HiGS是一种无需训练、即插即用的扩散采样增强方法,能有效提升生成效率与图像保真度。 Abstract: While diffusion models have made remarkable progress in image generation, their outputs can still appear unrealistic and lack fine details, especially when using fewer number of neural function evaluations (NFEs) or lower guidance scales. To address this issue, we propose a novel momentum-based sampling technique, termed history-guided sampling (HiGS), which enhances quality and efficiency of diffusion sampling by integrating recent model predictions into each inference step. Specifically, HiGS leverages the difference between the current prediction and a weighted average of past predictions to steer the sampling process toward more realistic outputs with better details and structure. Our approach introduces practically no additional computation and integrates seamlessly into existing diffusion frameworks, requiring neither extra training nor fine-tuning. Extensive experiments show that HiGS consistently improves image quality across diverse models and architectures and under varying sampling budgets and guidance scales. Moreover, using a pretrained SiT model, HiGS achieves a new state-of-the-art FID of 1.61 for unguided ImageNet generation at 256$\times$256 with only 30 sampling steps (instead of the standard 250). We thus present HiGS as a plug-and-play enhancement to standard diffusion sampling that enables faster generation with higher fidelity.[228] Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation
Jinpeng Lu,Linghan Cai,Yinda Chen,Guo Tang,Songhan Jiang,Haoyuan Shi,Zhiwei Xiong
Main category: cs.CV
TL;DR: 本文提出了一种名为VeloxSeg的轻量级3D医学图像分割框架,通过双流CNN-Transformer架构和空间解耦知识迁移,在效率与鲁棒性之间取得平衡,显著提升了多模态医学图像分割性能。
Details
Motivation: 轻量级3D医学图像分割面临效率与鲁棒性的冲突,尤其是在处理复杂解剖结构和异构模态时表现脆弱,需要一种能充分利用高维3D图像特性并增强表示能力的新框架。 Method: 提出VeloxSeg,采用双流CNN-Transformer架构,结合配对窗口注意力(PWA)和基于Johnson-Lindenstrauss引理的卷积(JLC),引入“ glance-and-focus”机制进行高效多尺度特征提取,并通过模态交互扩展支持多模态融合;利用Gram矩阵实现空间解耦知识迁移(SDKT),将自监督模型的纹理先验注入分割网络。 Result: 在多模态基准上实验表明,VeloxSeg相较于基线方法Dice分数提升26%,GPU吞吐量提高11倍,CPU提速48倍,且不增加推理开销。 Conclusion: VeloxSeg有效解决了轻量级3D医学图像分割中的效率与鲁棒性矛盾,通过架构创新和数据协同策略,在多种硬件平台上实现了高性能与高效率的统一。 Abstract: Lightweight 3D medical image segmentation remains constrained by a fundamental "efficiency / robustness conflict", particularly when processing complex anatomical structures and heterogeneous modalities. In this paper, we study how to redesign the framework based on the characteristics of high-dimensional 3D images, and explore data synergy to overcome the fragile representation of lightweight methods. Our approach, VeloxSeg, begins with a deployable and extensible dual-stream CNN-Transformer architecture composed of Paired Window Attention (PWA) and Johnson-Lindenstrauss lemma-guided convolution (JLC). For each 3D image, we invoke a "glance-and-focus" principle, where PWA rapidly retrieves multi-scale information, and JLC ensures robust local feature extraction with minimal parameters, significantly enhancing the model's ability to operate with low computational budget. Followed by an extension of the dual-stream architecture that incorporates modal interaction into the multi-scale image-retrieval process, VeloxSeg efficiently models heterogeneous modalities. Finally, Spatially Decoupled Knowledge Transfer (SDKT) via Gram matrices injects the texture prior extracted by a self-supervised network into the segmentation network, yielding stronger representations than baselines at no extra inference cost. Experimental results on multimodal benchmarks show that VeloxSeg achieves a 26% Dice improvement, alongside increasing GPU throughput by 11x and CPU by 48x. Codes are available at https://github.com/JinPLu/VeloxSeg.[229] NIFTY: a Non-Local Image Flow Matching for Texture Synthesis
Pierrick Chatillon,Julien Rabin,David Tschumperlé
Main category: cs.CV
TL;DR: 本文提出了一种名为NIFTY的混合框架,用于基于示例的纹理合成,结合了扩散模型与卷积神经网络以及传统的基于patch的方法。
Details
Motivation: 解决现有基于示例的纹理合成方法中存在的训练需求、初始化困难和视觉伪影等问题。 Method: 提出NIFTY,一种基于非局部patch匹配的非参数流匹配模型,结合了扩散模型和传统patch优化技术,无需神经网络训练。 Result: 实验结果表明,NIFTY在合成质量上优于文献中代表性方法,且避免了常见缺陷。 Conclusion: NIFTY是一种有效且无需训练的纹理合成方法,融合了深度学习与经典优化思想,具有良好的应用潜力。 Abstract: This paper addresses the problem of exemplar-based texture synthesis. We introduce NIFTY, a hybrid framework that combines recent insights on diffusion models trained with convolutional neural networks, and classical patch-based texture optimization techniques. NIFTY is a non-parametric flow-matching model built on non-local patch matching, which avoids the need for neural network training while alleviating common shortcomings of patch-based methods, such as poor initialization or visual artifacts. Experimental results demonstrate the effectiveness of the proposed approach compared to representative methods from the literature. Code is available at https://github.com/PierrickCh/Nifty.git[230] RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer
Wangbo Zhao,Yizeng Han,Zhiwei Tang,Jiasheng Tang,Pengfei Zhou,Kai Wang,Bohan Zhuang,Zhangyang Wang,Fan Wang,Yang You
Main category: cs.CV
TL;DR: 提出RAPID3框架,通过三级强化加速策略实现无需更新生成器的图像级加速,在Stable Diffusion 3和FLUX等先进DiT模型上实现近3倍加速且保持生成质量。
Details
Motivation: 现有扩散Transformer加速方法依赖固定或手动设计的启发式策略,无法为不同图像自适应优化,而动态网络虽可自适应但微调成本高,限制了应用。 Method: 设计三个轻量级策略头(跳步、缓存复用、稀疏注意力),在每一步去噪过程中根据当前状态独立决策加速方式;采用分组相对策略优化(GRPO)在线训练策略参数,同时保持生成器冻结,并引入对抗性判别器增强奖励信号以防止奖励欺骗。 Result: 在Stable Diffusion 3和FLUX等先进DiT模型上,实现了近3倍的采样速度提升,同时生成质量具有竞争力。 Conclusion: RAPID3实现了无需微调基础生成器的图像级自适应加速,在多种DiT架构上验证了其有效性与通用性,显著提升了扩散Transformer的推理效率。 Abstract: Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling. Existing training-free accelerators - step reduction, feature caching, and sparse attention - enhance inference speed but typically rely on a uniform heuristic or a manually designed adaptive strategy for all images, leaving quality on the table. Alternatively, dynamic neural networks offer per-image adaptive acceleration, but their high fine-tuning costs limit broader applicability. To address these limitations, we introduce RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformers, a framework that delivers image-wise acceleration with zero updates to the base generator. Specifically, three lightweight policy heads - Step-Skip, Cache-Reuse, and Sparse-Attention - observe the current denoising state and independently decide their corresponding speed-up at each timestep. All policy parameters are trained online via Group Relative Policy Optimization (GRPO) while the generator remains frozen. Meanwhile, an adversarially learned discriminator augments the reward signal, discouraging reward hacking by boosting returns only when generated samples stay close to the original model's distribution. Across state-of-the-art DiT backbones, including Stable Diffusion 3 and FLUX, RAPID3 achieves nearly 3x faster sampling with competitive generation quality.[231] Pedestrian Attribute Recognition via Hierarchical Cross-Modality HyperGraph Learning
Xiao Wang,Shujuan Wu,Xiaoxia Cheng,Changwei Bi,Jin Tang,Bin Luo
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态知识图的行人属性识别方法,通过构建视觉与文本之间的关系图来增强属性识别性能。
Details
Motivation: 现有PAR方法未能充分利用属性知识和上下文信息,且对视觉与语义关联的建模仍不充分。 Method: 提出一种多模态知识图构建方法,并引入知识图引导的跨模态超图学习框架,以建模属性间及视觉token与属性间的关系。 Result: 在多个PAR基准数据集上进行了综合实验,验证了所提方法的有效性,显著提升了识别准确率。 Conclusion: 该方法为知识引导的行人属性识别建立了坚实基础,推动了多模态知识建模在PAR中的应用。 Abstract: Current Pedestrian Attribute Recognition (PAR) algorithms typically focus on mapping visual features to semantic labels or attempt to enhance learning by fusing visual and attribute information. However, these methods fail to fully exploit attribute knowledge and contextual information for more accurate recognition. Although recent works have started to consider using attribute text as additional input to enhance the association between visual and semantic information, these methods are still in their infancy. To address the above challenges, this paper proposes the construction of a multi-modal knowledge graph, which is utilized to mine the relationships between local visual features and text, as well as the relationships between attributes and extensive visual context samples. Specifically, we propose an effective multi-modal knowledge graph construction method that fully considers the relationships among attributes and the relationships between attributes and vision tokens. To effectively model these relationships, this paper introduces a knowledge graph-guided cross-modal hypergraph learning framework to enhance the standard pedestrian attribute recognition framework. Comprehensive experiments on multiple PAR benchmark datasets have thoroughly demonstrated the effectiveness of our proposed knowledge graph for the PAR task, establishing a strong foundation for knowledge-guided pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR[232] Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
Yasmine Omri,Connor Ding,Tsachy Weissman,Thierry Tambe
Main category: cs.CV
TL;DR: 本文探索2D高斯点阵(2DGS)作为视觉语言模型的替代视觉表示,提出高效2DGS处理流程并适配CLIP框架,在显著压缩输入的同时实现有意义的零样本性能,为边缘-云学习提供兼具语义表达力和传输效率的新路径。
Details
Motivation: 现有基于RGB图像的视觉语言模型在传输密集像素数据时能耗高、成本大,且基于图像块的标记化导致序列过长,影响注意力机制效率和上下文容量,亟需更高效的视觉表示方法。 Method: 采用2D高斯点阵(2DGS)作为紧凑、空间自适应的图像表示方式,构建可扩展的2DGS处理流程,包括结构化初始化、亮度感知剪枝和批量化CUDA核;将CLIP框架适配到2DGS,复用冻结的RGB Transformer主干网络,设计轻量级点阵感知输入模块和Perceiver重采样器,仅训练约7%的参数。 Result: 相比先前实现,2DGS拟合速度提升90倍以上,GPU利用率约97%;在大规模DataComp子集上,GS编码器在输入压缩3至20倍的情况下仍取得有意义的零样本ImageNet-1K性能。 Conclusion: 2DGS是一种可行的多模态视觉表示基础,尽管当前精度尚不及RGB编码器,但研究明确了架构瓶颈,为兼顾语义能力和传输效率的新型表示开辟了道路。 Abstract: Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy intensive and costly, and (ii) patch based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language image pretraining (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat aware input stem and a perceiver resampler, training only about 7% of the total parameters. On large DataComp subsets, GS encoders yield meaningful zero shot ImageNet-1K performance while compressing inputs 3 to 20x relative to pixels. While accuracy currently trails RGB encoders, our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission efficient for edge cloud learning.[233] CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process
Arman Akbari,Jian Gao,Yifei Zou,Mei Yang,Jinru Duan,Dmitrii Torbunov,Yanzhi Wang,Yihui Ren,Xuan Zhang
Main category: cs.CV
TL;DR: 本文提出了CircuitSense,一个评估多模态大模型在电路图理解中从视觉输入推导符号方程能力的基准测试,揭示了现有模型在视觉到数学推理上的严重不足。
Details
Motivation: 尽管多模态大语言模型在自然图像任务上表现出色,但其在从技术图表中提取数学模型方面的能力尚未被探索,尤其是在工程设计所需的层次化抽象和数学推理方面存在空白。 Method: 构建了一个包含8006多个问题的综合性基准CircuitSense,覆盖从元件级原理图到系统级框图的理解;提出分层合成生成流程,包括基于网格的原理图生成器和带自动符号方程标注的框图生成器,并在感知、分析和设计三个阶段评估六个最先进的多模态大模型。 Result: 闭源模型在组件识别和拓扑识别等感知任务上准确率超过85%,但在符号推导和分析推理任务上准确率低于19%;具备更强符号推理能力的模型在设计任务中表现更优。 Conclusion: 当前多模态大模型在视觉到数学的跨模态推理上存在根本性缺陷,符号推理能力是衡量其工程设计能力的关键指标,需重点提升该能力以支持实际工程应用。 Abstract: Engineering design operates through hierarchical abstraction from system specifications to component implementations, requiring visual understanding coupled with mathematical reasoning at each level. While Multi-modal Large Language Models (MLLMs) excel at natural image tasks, their ability to extract mathematical models from technical diagrams remains unexplored. We present \textbf{CircuitSense}, a comprehensive benchmark evaluating circuit understanding across this hierarchy through 8,006+ problems spanning component-level schematics to system-level block diagrams. Our benchmark uniquely examines the complete engineering workflow: Perception, Analysis, and Design, with a particular emphasis on the critical but underexplored capability of deriving symbolic equations from visual inputs. We introduce a hierarchical synthetic generation pipeline consisting of a grid-based schematic generator and a block diagram generator with auto-derived symbolic equation labels. Comprehensive evaluation of six state-of-the-art MLLMs, including both closed-source and open-source models, reveals fundamental limitations in visual-to-mathematical reasoning. Closed-source models achieve over 85\% accuracy on perception tasks involving component recognition and topology identification, yet their performance on symbolic derivation and analytical reasoning falls below 19\%, exposing a critical gap between visual parsing and symbolic reasoning. Models with stronger symbolic reasoning capabilities consistently achieve higher design task accuracy, confirming the fundamental role of mathematical understanding in circuit synthesis and establishing symbolic reasoning as the key metric for engineering competence.[234] LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision
Debargha Ganguly,Sumit Kumar,Ishwar Balappanawar,Weicong Chen,Shashank Kambhatla,Srinivasan Iyengar,Shivkumar Kalyanaraman,Ponnurangam Kumaraguru,Vipin Chaudhary
Main category: cs.CV
TL;DR: 本文提出了Labeling Copilot,首个用于计算机视觉的数据整理深度研究代理,通过协调发现、可控合成和共识标注三大核心能力,高效构建高质量、领域特定的视觉数据集。
Details
Motivation: 构建高质量、特定领域的视觉数据集面临数据质量、多样性与成本之间的复杂权衡,现有方法在大规模未标注数据中难以有效平衡这些因素。 Method: 提出Labeling Copilot,采用大型多模态语言模型作为核心协调代理,通过多步推理调度三种专用工具:校准发现(从大型仓库中提取相关数据)、可控合成(生成稀有场景的新数据并进行强过滤)、共识标注(结合非极大值抑制和投票机制,协调多个基础模型生成准确标签)。 Result: 在COCO数据集上,共识标注模块每幅图像平均产生14.2个候选提议,最终标注mAP达到37.1%;在Open Images数据集上发现了903个新边界框类别,总数超过1500类;校准发现工具在千万级样本规模下,计算效率比现有方法高40倍。 Conclusion: 实验验证了基于代理的工作流结合优化且可扩展的工具,能够为工业级数据集的构建提供坚实基础。 Abstract: Curating high-quality, domain-specific datasets is a major bottleneck for deploying robust vision systems, requiring complex trade-offs between data quality, diversity, and cost when researching vast, unlabeled data lakes. We introduce Labeling Copilot, the first data curation deep research agent for computer vision. A central orchestrator agent, powered by a large multimodal language model, uses multi-step reasoning to execute specialized tools across three core capabilities: (1) Calibrated Discovery sources relevant, in-distribution data from large repositories; (2) Controllable Synthesis generates novel data for rare scenarios with robust filtering; and (3) Consensus Annotation produces accurate labels by orchestrating multiple foundation models via a novel consensus mechanism incorporating non-maximum suppression and voting. Our large-scale validation proves the effectiveness of Labeling Copilot's components. The Consensus Annotation module excels at object discovery: on the dense COCO dataset, it averages 14.2 candidate proposals per image-nearly double the 7.4 ground-truth objects-achieving a final annotation mAP of 37.1%. On the web-scale Open Images dataset, it navigated extreme class imbalance to discover 903 new bounding box categories, expanding its capability to over 1500 total. Concurrently, our Calibrated Discovery tool, tested at a 10-million sample scale, features an active learning strategy that is up to 40x more computationally efficient than alternatives with equivalent sample efficiency. These experiments validate that an agentic workflow with optimized, scalable tools provides a robust foundation for curating industrial-scale datasets.[235] HierLight-YOLO: A Hierarchical and Lightweight Object Detection Network for UAV Photography
Defan Chen,Yaohua Hu,Luchan Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv8的轻量级模型HierLight-YOLO,用于提升无人机图像中小目标的实时检测性能。
Details
Motivation: 现有YOLO系列模型在检测小目标(<32像素)时漏检率高,难以满足复杂场景下资源受限平台的实时性与精度需求。 Method: 提出Hierarchical Extended Path Aggregation Network(HEPAN)实现多尺度特征融合,并设计IRDCB和LDown两个轻量模块以减少参数量和计算复杂度,同时引入专用于小目标检测的检测头以增强空间分辨率和特征融合能力。 Result: 在VisDrone2019数据集上的实验表明,HierLight-YOLO在保持实时性的前提下显著提升了小目标检测精度,取得了当前最优性能。 Conclusion: HierLight-YOLO通过层次化特征融合与轻量化设计,有效平衡了小目标检测的精度与效率,适用于无人机等资源受限平台。 Abstract: The real-time detection of small objects in complex scenes, such as the unmanned aerial vehicle (UAV) photography captured by drones, has dual challenges of detecting small targets (<32 pixels) and maintaining real-time efficiency on resource-constrained platforms. While YOLO-series detectors have achieved remarkable success in real-time large object detection, they suffer from significantly higher false negative rates for drone-based detection where small objects dominate, compared to large object scenarios. This paper proposes HierLight-YOLO, a hierarchical feature fusion and lightweight model that enhances the real-time detection of small objects, based on the YOLOv8 architecture. We propose the Hierarchical Extended Path Aggregation Network (HEPAN), a multi-scale feature fusion method through hierarchical cross-level connections, enhancing the small object detection accuracy. HierLight-YOLO includes two innovative lightweight modules: Inverted Residual Depthwise Convolution Block (IRDCB) and Lightweight Downsample (LDown) module, which significantly reduce the model's parameters and computational complexity without sacrificing detection capabilities. Small object detection head is designed to further enhance spatial resolution and feature fusion to tackle the tiny object (4 pixels) detection. Comparison experiments and ablation studies on the VisDrone2019 benchmark demonstrate state-of-the-art performance of HierLight-YOLO.[236] Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs
Xingyu Fu,Siyi Liu,Yinuo Xu,Pan Lu,Guangqiuse Hu,Tianbo Yang,Taran Anantasagar,Christopher Shen,Yikai Mao,Yuanzhe Liu,Keyush Shah,Chung Un Lee,Yejin Choi,James Zou,Dan Roth,Chris Callison-Burch
Main category: cs.CV
TL;DR: 本文提出了DeeptraceReward,首个细粒度、时空感知的基准数据集,用于标注人类感知到的AI生成视频中的伪造痕迹,并训练多模态语言模型作为奖励模型来模拟人类判断。
Details
Motivation: 现有视频生成模型发展迅速,但人类是否能识别其中的深伪痕迹(如时空上的视觉异常)这一关键问题被忽视。 Method: 构建包含4.3K标注的3.3K高质量生成视频数据集,每个标注包括自然语言解释、边界框区域和时间戳;归纳出9类深伪痕迹,并训练7B参数的多模态语言模型作为奖励模型。 Result: 所提7B奖励模型在伪造线索识别、定位和解释上平均超越GPT-5达34.7%;发现任务难度梯度:二分类 < 语言解释 < 空间定位 < 时间标注。 Conclusion: DeeptraceReward为可信赖的视频生成提供了基于人类感知的严格测试平台和训练信号,有助于提升社会层面的模型可靠性。 Abstract: Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension -- whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated -- has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.[237] Effectiveness of Large Multimodal Models in Detecting Disinformation: Experimental Results
Yasmina Kheddache,Marc Lalonde
Main category: cs.CV
TL;DR: 本研究提出了一种基于GPT-4o大模型的多模态虚假信息检测框架,结合优化提示、预处理方法和六项评估标准,系统评估了其在多个数据集上的性能与稳定性。
Details
Motivation: 多模态虚假信息(文本与图像结合)在数字平台上日益泛滥,传统检测方法面临挑战,亟需利用先进模型提升检测精度与可靠性。 Method: 采用GPT-4o模型,设计优化提示工程,构建包含图文预处理的结构化分析框架,并定义六项细粒度评估标准,引入置信度自评与重复测试机制以评估预测稳定性。 Result: 在Gossipcop、Politifact、Fakeddit、MMFakeBench和AMMEBA等多个异构数据集上验证了GPT-4o的有效性,揭示了其优势与局限性,并发现其分类结果具有一定变异性,需结合置信度进行判断。 Conclusion: 该研究为自动化多模态虚假信息检测提供了可复现、鲁棒的方法论框架,强调了提示工程、置信度评估与稳定性分析在实际应用中的重要性。 Abstract: The proliferation of disinformation, particularly in multimodal contexts combining text and images, presents a significant challenge across digital platforms. This study investigates the potential of large multimodal models (LMMs) in detecting and mitigating false information. We propose to approach multimodal disinformation detection by leveraging the advanced capabilities of the GPT-4o model. Our contributions include: (1) the development of an optimized prompt incorporating advanced prompt engineering techniques to ensure precise and consistent evaluations; (2) the implementation of a structured framework for multimodal analysis, including a preprocessing methodology for images and text to comply with the model's token limitations; (3) the definition of six specific evaluation criteria that enable a fine-grained classification of content, complemented by a self-assessment mechanism based on confidence levels; (4) a comprehensive performance analysis of the model across multiple heterogeneous datasets Gossipcop, Politifact, Fakeddit, MMFakeBench, and AMMEBA highlighting GPT-4o's strengths and limitations in disinformation detection; (5) an investigation of prediction variability through repeated testing, evaluating the stability and reliability of the model's classifications; and (6) the introduction of confidence-level and variability-based evaluation methods. These contributions provide a robust and reproducible methodological framework for automated multimodal disinformation analysis.[238] CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
Long Xing,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Jianze Liang,Qidong Huang,Jiaqi Wang,Feng Wu,Dahua Lin
Main category: cs.CV
TL;DR: 提出CapRL框架,利用可验证奖励的强化学习提升图像描述生成质量,通过非视觉语言模型回答问题的准确性来评估描述质量。
Details
Motivation: 现有图像描述模型依赖昂贵的人工标注数据,易导致模型记忆固定答案,缺乏多样性与泛化能力。 Method: 采用强化学习与可验证奖励(RLVR)范式,构建两阶段框架:LVLM生成描述,视觉无关的LLM基于描述回答多选题以提供客观奖励信号。 Result: 在12个基准上显著提升性能;CapRL-5M预训练数据集效果突出;在Prism框架下性能接近Qwen2.5-VL-72B,平均超越基线8.4%。 Conclusion: CapRL首次将RLVR成功应用于主观性图像描述任务,通过解耦训练和基于实用性的奖励机制,有效提升生成质量与模型通用性。 Abstract: Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.[239] GPT-4 for Occlusion Order Recovery
Kaziwa Saleh,Zhyar Rzgar K Rostam,Sándor Szénási,Zoltán Vámossy
Main category: cs.CV
TL;DR: 本文提出利用预训练的GPT-4模型通过提示工程来预测图像中物体间的遮挡顺序,无需标注训练数据,具有零样本推理能力,并在COCOA和InstaOrder数据集上验证了其有效性。
Details
Motivation: 当前视觉模型在复杂真实场景中处理遮挡问题存在困难,难以准确推断物体间的遮挡顺序。 Method: 设计特定提示,结合输入图像,利用GPT-4分析图像并生成遮挡顺序预测,解析响应构建遮挡矩阵。 Result: 在COCOA和InstaOrder数据集上的实验表明,该方法能有效利用语义上下文、视觉模式和常识知识,实现更准确的遮挡顺序预测,且具备零样本推理能力。 Conclusion: GPT-4可通过提示工程实现无需训练数据的遮挡顺序推理,生成的遮挡矩阵有助于提升图像理解和遮挡处理任务的性能。 Abstract: Occlusion remains a significant challenge for current vision models to robustly interpret complex and dense real-world images and scenes. To address this limitation and to enable accurate prediction of the occlusion order relationship between objects, we propose leveraging the advanced capability of a pre-trained GPT-4 model to deduce the order. By providing a specifically designed prompt along with the input image, GPT-4 can analyze the image and generate order predictions. The response can then be parsed to construct an occlusion matrix which can be utilized in assisting with other occlusion handling tasks and image understanding. We report the results of evaluating the model on COCOA and InstaOrder datasets. The results show that by using semantic context, visual patterns, and commonsense knowledge, the model can produce more accurate order predictions. Unlike baseline methods, the model can reason about occlusion relationships in a zero-shot fashion, which requires no annotated training data and can easily be integrated into occlusion handling frameworks.[240] Gradient-based multi-focus image fusion with focus-aware saliency enhancement
Haoyu Li,XiaoSong Li
Main category: cs.CV
TL;DR: 提出一种基于显著边界增强的多焦点图像融合方法,通过梯度域模型和Tenengrad梯度检测生成高质量融合结果,在多个数据集上优于12种现有方法。
Details
Motivation: 现有方法难以保持清晰的聚焦-非聚焦边界,常导致过渡模糊和细节丢失。 Method: 提出基于梯度域的模型获取具完整边界的初始融合结果;利用Tenengrad梯度检测提取源图像和初始融合图像的显著特征,生成显著图;结合梯度与互补信息设计聚焦度量,优化边界并生成高质量决策图。 Result: 在四个公开数据集上的实验表明,该方法在主观和客观评价中均优于12种先进方法,能有效保留边界细节并提升融合质量。 Conclusion: 所提方法在多焦点图像融合中实现了更优的边界保持和细节恢复,具有良好的应用前景。 Abstract: Multi-focus image fusion (MFIF) aims to yield an all-focused image from multiple partially focused inputs, which is crucial in applications cover sur-veillance, microscopy, and computational photography. However, existing methods struggle to preserve sharp focus-defocus boundaries, often resulting in blurred transitions and focused details loss. To solve this problem, we propose a MFIF method based on significant boundary enhancement, which generates high-quality fused boundaries while effectively detecting focus in-formation. Particularly, we propose a gradient-domain-based model that can obtain initial fusion results with complete boundaries and effectively pre-serve the boundary details. Additionally, we introduce Tenengrad gradient detection to extract salient features from both the source images and the ini-tial fused image, generating the corresponding saliency maps. For boundary refinement, we develop a focus metric based on gradient and complementary information, integrating the salient features with the complementary infor-mation across images to emphasize focused regions and produce a high-quality initial decision result. Extensive experiments on four public datasets demonstrate that our method consistently outperforms 12 state-of-the-art methods in both subjective and objective evaluations. We have realized codes in https://github.com/Lihyua/GICI[241] Text Adversarial Attacks with Dynamic Outputs
Wenqiang Wang,Siyuan Liang,Xiao Yan,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文提出了一种名为Textual Dynamic Outputs Attack (TDOA) 的文本对抗攻击方法,通过聚类构建代理模型将动态输出场景转化为静态场景,并引入最远标签攻击策略提升攻击效果,在多种数据集和大模型上验证了其高效性与泛化能力。
Details
Motivation: 现有文本对抗攻击通常局限于固定输出标签的静态场景,且依赖大量查询或代理模型访问,难以应对动态变化的输出空间,因此需要一种适用于动态输出场景的高效攻击方法。 Method: 采用基于聚类的代理模型训练方法,将动态输出转换为静态单输出问题;提出最远标签目标攻击策略,选择偏离模型粗粒度标签最远的对抗方向以增强扰动效果。 Result: 在四个数据集和八个目标模型(如ChatGPT-4o)上验证,单次查询下最大攻击成功率达50.81%;在传统静态场景中最高ASR达82.68%;扩展至生成任务后,在RDBLEU和RDchrF指标上分别提升0.64和0.62。 Conclusion: TDOA能有效攻击具有动态输出的大语言模型,且在低查询条件下表现优异,同时可推广至生成式任务,展现出强大的适应性和攻击潜力。 Abstract: Text adversarial attack methods are typically designed for static scenarios with fixed numbers of output labels and a predefined label space, relying on extensive querying of the victim model (query-based attacks) or the surrogate model (transfer-based attacks). To address this gap, we introduce the Textual Dynamic Outputs Attack (TDOA) method, which employs a clustering-based surrogate model training approach to convert the dynamic-output scenario into a static single-output scenario. To improve attack effectiveness, we propose the farthest-label targeted attack strategy, which selects adversarial vectors that deviate most from the model's coarse-grained labels, thereby maximizing disruption. We extensively evaluate TDOA on four datasets and eight victim models (e.g., ChatGPT-4o, ChatGPT-4.1), showing its effectiveness in crafting adversarial examples and its strong potential to compromise large language models with limited access. With a single query per text, TDOA achieves a maximum attack success rate of 50.81\%. Additionally, we find that TDOA also achieves state-of-the-art performance in conventional static output scenarios, reaching a maximum ASR of 82.68\%. Meanwhile, by conceptualizing translation tasks as classification problems with unbounded output spaces, we extend the TDOA framework to generative settings, surpassing prior results by up to 0.64 RDBLEU and 0.62 RDchrF.[242] Integrating Background Knowledge in Medical Semantic Segmentation with Logic Tensor Networks
Luca Bergamin,Giovanna Maria Dimitri,Fabio Aiolli
Main category: cs.CV
TL;DR: 本文提出了一种结合逻辑张量网络(LTNs)与SwinUNETR的神经符号方法,通过将医学先验知识编码为一阶逻辑规则并融入损失函数,提升小样本情况下的脑部MRI海马体分割性能。
Details
Motivation: 现有的深度学习模型在医学图像分割中仍存在不足,尤其在训练数据稀缺时性能受限,因此需要引入医学领域知识以提升模型的鲁棒性和准确性。 Method: 采用逻辑张量网络(LTNs)将医学背景知识(如形状约束和区域关系)用一阶逻辑表达,并嵌入到SwinUNETR模型的损失函数中,实现端到端的语义分割框架。 Result: 在海马体MRI分割任务中,该方法优于基线模型,尤其在训练数据较少时表现出更优的性能提升。 Conclusion: 神经符号方法能够有效融合医学先验知识,具有良好的泛化潜力,可推广至其他医学图像分割任务。 Abstract: Semantic segmentation is a fundamental task in medical image analysis, aiding medical decision-making by helping radiologists distinguish objects in an image. Research in this field has been driven by deep learning applications, which have the potential to scale these systems even in the presence of noise and artifacts. However, these systems are not yet perfected. We argue that performance can be improved by incorporating common medical knowledge into the segmentation model's loss function. To this end, we introduce Logic Tensor Networks (LTNs) to encode medical background knowledge using first-order logic (FOL) rules. The encoded rules span from constraints on the shape of the produced segmentation, to relationships between different segmented areas. We apply LTNs in an end-to-end framework with a SwinUNETR for semantic segmentation. We evaluate our method on the task of segmenting the hippocampus in brain MRI scans. Our experiments show that LTNs improve the baseline segmentation performance, especially when training data is scarce. Despite being in its preliminary stages, we argue that neurosymbolic methods are general enough to be adapted and applied to other medical semantic segmentation tasks.[243] Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
Xinhao Zhong,Yimin Zhou,Zhiqi Zhang,Junhao Li,Yi Sun,Bin Chen,Shu-Tao Xia,Ke Xu
Main category: cs.CV
TL;DR: 本文提出了一种针对视觉自回归模型(VAR)的概念擦除框架VARE及其改进方法S-VARE,通过引入辅助视觉标记和过滤交叉熵损失,在保持生成质量的同时实现精准、稳定的概念擦除,有效解决了现有方法在VAR模型上不适用的问题。
Details
Motivation: 由于现有的概念擦除技术主要面向扩散模型,难以适用于采用下一尺度标记预测范式的视觉自回归模型(VAR),因此需要开发专用于VAR的安全擦除方法以应对日益突出的安全问题。 Method: 提出VARE框架,利用辅助视觉标记降低微调强度;在此基础上设计S-VARE方法,结合过滤交叉熵损失精确定位并最小化调整不安全视觉标记,并引入保留损失以维持语义保真度。 Result: 实验表明,该方法能实现精确的概念擦除,显著减少语言漂移和多样性下降等问题,同时保持较高的图像生成质量。 Conclusion: S-VARE为视觉自回归模型提供了高效且稳定的概念擦除方案,填补了文本到图像生成中VAR模型安全性研究的空白。 Abstract: The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by na\"ive fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.[244] RAU: Reference-based Anatomical Understanding with Vision Language Models
Yiwei Li,Yikang Liu,Jiaqi Guo,Lin Zhao,Zheyuan Zhang,Xiao Chen,Boris Mailhe,Ankush Mukherjee,Terrence Chen,Shanhui Sun
Main category: cs.CV
TL;DR: 本文提出了RAU框架,首次探索了视觉-语言模型(VLM)在医学图像中基于参考的解剖结构识别、定位与分割能力。通过结合VLM的空间推理与SAM2的精细分割,RAU在多个数据集上实现了更准确的分割和更强的泛化性能。
Details
Motivation: 深度学习在医学图像中的解剖理解受限于专家标注数据的稀缺,因此需要一种无需大量标注即可实现精确解剖定位的方法。利用已标注的参考图像指导未标注目标图像的理解是一种有前景的解决方案,但现有视觉-语言模型在此类任务上的表现仍有限。 Method: 提出RAU框架,利用视觉-语言模型进行参考图像与目标图像之间的相对空间推理,实现解剖区域识别,并将VLM生成的空间线索与SAM2的像素级分割能力相结合,完成细粒度解剖结构(如血管段)的定位与分割。在中等规模数据集上训练并验证其在视觉问答(VQA)和边界框预测中的有效性。 Result: RAU在两个分布内和两个分布外数据集上均优于使用相同内存设置的SAM2微调基线,实现了更精确的分割和更可靠的定位,展现出强大的泛化能力。 Conclusion: RAU是首个探索VLM用于医学图像中参考式解剖理解的框架,其优异表现表明VLM驱动的方法在自动化临床工作流中具有广阔应用潜力。 Abstract: Anatomical understanding through deep learning is critical for automatic report generation, intra-operative navigation, and organ localization in medical imaging; however, its progress is constrained by the scarcity of expert-labeled data. A promising remedy is to leverage an annotated reference image to guide the interpretation of an unlabeled target. Although recent vision-language models (VLMs) exhibit non-trivial visual reasoning, their reference-based understanding and fine-grained localization remain limited. We introduce RAU, a framework for reference-based anatomical understanding with VLMs. We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images, trained on a moderately sized dataset. We validate this capability through visual question answering (VQA) and bounding box prediction. Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2, enabling localization and pixel-level segmentation of small anatomical regions, such as vessel segments. Across two in-distribution and two out-of-distribution datasets, RAU consistently outperforms a SAM2 fine-tuning baseline using the same memory setup, yielding more accurate segmentations and more reliable localization. More importantly, its strong generalization ability makes it scalable to out-of-distribution datasets, a property crucial for medical image applications. To the best of our knowledge, RAU is the first to explore the capability of VLMs for reference-based identification, localization, and segmentation of anatomical structures in medical images. Its promising performance highlights the potential of VLM-driven approaches for anatomical understanding in automated clinical workflows.[245] FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing
Hossein Kashiani,Niloufar Alipour Talemi,Fatemeh Afghah
Main category: cs.CV
TL;DR: 本文提出了一种名为FreqDebias的频率去偏框架,以解决深度伪造检测器在频域中的谱偏差问题,通过伪造Mixup增强和双重一致性正则化显著提升了跨域和域内检测性能。
Details
Motivation: 现有的深度伪造检测器由于训练数据有限,容易在频域中学习到特定频率带的偏差(谱偏差),导致对未见过的伪造类型泛化能力差。 Method: 提出FreqDebias框架,包含两种策略:一是伪造Mixup(Fo-Mixup)增强方法,动态多样化训练样本的频率特征;二是结合局部(基于类激活图)和全局(基于vMF分布的超球面嵌入空间)一致性正则化,促进模型学习更鲁棒的表示。 Result: 大量实验表明,FreqDebias在跨域和域内设置下均优于现有最先进方法,显著提升了模型的跨域泛化能力。 Conclusion: FreqDebias有效缓解了深度伪造检测中的谱偏差问题,通过双重正则化和数据增强策略,增强了模型对未知伪造类型的泛化能力,为构建更鲁棒的检测器提供了新思路。 Abstract: Deepfake detectors often struggle to generalize to novel forgery types due to biases learned from limited training data. In this paper, we identify a new type of model bias in the frequency domain, termed spectral bias, where detectors overly rely on specific frequency bands, restricting their ability to generalize across unseen forgeries. To address this, we propose FreqDebias, a frequency debiasing framework that mitigates spectral bias through two complementary strategies. First, we introduce a novel Forgery Mixup (Fo-Mixup) augmentation, which dynamically diversifies frequency characteristics of training samples. Second, we incorporate a dual consistency regularization (CR), which enforces both local consistency using class activation maps (CAMs) and global consistency through a von Mises-Fisher (vMF) distribution on a hyperspherical embedding space. This dual CR mitigates over-reliance on certain frequency components by promoting consistent representation learning under both local and global supervision. Extensive experiments show that FreqDebias significantly enhances cross-domain generalization and outperforms state-of-the-art methods in both cross-domain and in-domain settings.[246] LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer
Song Fei,Tian Ye,Lujia Wang,Lei Zhu
Main category: cs.CV
TL;DR: 本文提出了一种无需图像描述的通用图像恢复框架LucidFlux,通过适配大型扩散Transformer(Flux.1)实现高质量修复,在不依赖文本提示的情况下保持语义一致性。
Details
Motivation: 现有基于判别或UNet扩散先验的方法在未知混合退化条件下常出现过平滑、幻觉或语义漂移问题,且依赖文本提示导致延迟与不稳定。 Method: 设计轻量级双分支调节器,结合退化输入和轻度恢复代理信号,锚定几何结构并抑制伪影;采用时间步和层自适应调制策略,在主干网络中传递线索以实现从粗到细的上下文感知更新;利用SigLIP特征实现无文本语义对齐,并构建可扩展的数据筛选管道提供结构丰富的监督。 Result: 在合成与真实场景基准上,LucidFlux持续优于强开源及商业基线模型,消融实验验证各组件必要性。 Conclusion: 对于大型扩散Transformer而言,何时、何地以及如何进行条件注入是实现鲁棒、无文本通用图像恢复的关键控制因素,而非增加参数或依赖文本提示。 Abstract: Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semantics -- conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) without image captions. LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbone's hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or MLLM captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition on -- rather than adding parameters or relying on text prompts -- is the governing lever for robust and caption-free universal image restoration in the wild.[247] Explaining multimodal LLMs via intra-modal token interactions
Jiawei Liang,Ruoyu Chen,Xianghao Jiao,Siyuan Liang,Shiming Liu,Qunli Zhang,Zheng Hu,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文提出了一种增强多模态大语言模型可解释性的新方法,通过利用模态内交互来改进视觉和文本分支的归因质量。
Details
Motivation: 现有可解释性方法主要关注跨模态归因,忽视了模态内依赖关系,导致视觉解释碎片化、文本解释存在虚假激活问题。 Method: 在视觉分支中引入多尺度解释聚合(MSEA),动态调整感受野;在文本分支中提出激活排序相关性(ARC),通过预测排名对齐衡量上下文相关性并抑制虚假激活。 Result: 实验表明,该方法在多个先进MLLM和基准数据集上均优于现有可解释性方法,生成更准确、细粒度的解释。 Conclusion: 通过增强模态内交互,能显著提升多模态大语言模型的归因保真度和解释质量。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top-$k$ prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.[248] U-MAN: U-Net with Multi-scale Adaptive KAN Network for Medical Image Segmentation
Bohan Huang,Qianyun Bao,Haoyuan Ma
Main category: cs.CV
TL;DR: 提出U-MAN网络,结合KAN与注意力机制,通过PAGF和MAN模块解决U-Net在医学图像分割中语义鸿沟和多尺度特征提取不足的问题,在多个数据集上表现优异。
Details
Motivation: 传统U-Net存在编码器-解码器语义鸿沟和深层缺乏多尺度特征提取能力,导致医学图像分割中细节和边界不精确。 Method: 设计PAGF模块替代简单跳跃连接,利用注意力机制融合特征;引入多尺度自适应KAN(MAN)模块增强多尺度特征处理能力。 Result: 在BUSI、GLAS和CVC三个公开数据集上实验表明,U-MAN优于现有方法,尤其在边界准确性和细节保留方面表现突出。 Conclusion: U-MAN有效提升了医学图像分割性能,特别是在复杂结构和病灶区域的精细分割上具有显著优势。 Abstract: Medical image segmentation faces significant challenges in preserving fine-grained details and precise boundaries due to complex anatomical structures and pathological regions. These challenges primarily stem from two key limitations of conventional U-Net architectures: (1) their simple skip connections ignore the encoder-decoder semantic gap between various features, and (2) they lack the capability for multi-scale feature extraction in deep layers. To address these challenges, we propose the U-Net with Multi-scale Adaptive KAN (U-MAN), a novel architecture that enhances the emerging Kolmogorov-Arnold Network (KAN) with two specialized modules: Progressive Attention-Guided Feature Fusion (PAGF) and the Multi-scale Adaptive KAN (MAN). Our PAGF module replaces the simple skip connection, using attention to fuse features from the encoder and decoder. The MAN module enables the network to adaptively process features at multiple scales, improving its ability to segment objects of various sizes. Experiments on three public datasets (BUSI, GLAS, and CVC) show that U-MAN outperforms state-of-the-art methods, particularly in defining accurate boundaries and preserving fine details.[249] $γ$-Quant: Towards Learnable Quantization for Low-bit Pattern Recognition
Mishal Fatima,Shashank Agnihotri,Marius Bock,Kanchana Vaishnavi Gandikota,Kristof Van Laerhoven,Michael Moeller,Margret Keuper
Main category: cs.CV
TL;DR: 本文提出了一种任务特定的非线性量化方法γ-Quant,用于低比特深度传感器数据下的模式识别,在4比特下即可达到12比特原始数据相当的性能。
Details
Motivation: 现有模式识别模型多依赖于为人类感知设计的预处理数据,但在无人参与的自动化任务中,这种预处理可能并非最优;同时,高比特数据在可穿戴设备中导致传输开销大、能耗高,限制了在低带宽和能量受限场景的应用。 Method: 提出γ-Quant,一种可学习的非线性量化方法,直接在原始低比特数据上进行模式识别,并将其应用于raw-image目标检测和可穿戴设备的人类活动识别任务中。 Result: 实验表明,使用仅4比特的可学习量化方法在目标检测和人类活动识别任务上可达到与12比特原始数据相当的性能,显著降低数据带宽和能耗。 Conclusion: γ-Quant为低功耗、低带宽场景下的模式识别提供了一种高效的数据量化方案,验证了在原始数据上进行任务驱动量化学习的有效性和潜力。 Abstract: Most pattern recognition models are developed on pre-proce\-ssed data. In computer vision, for instance, RGB images processed through image signal processing (ISP) pipelines designed to cater to human perception are the most frequent input to image analysis networks. However, many modern vision tasks operate without a human in the loop, raising the question of whether such pre-processing is optimal for automated analysis. Similarly, human activity recognition (HAR) on body-worn sensor data commonly takes normalized floating-point data arising from a high-bit analog-to-digital converter (ADC) as an input, despite such an approach being highly inefficient in terms of data transmission, significantly affecting the battery life of wearable devices. In this work, we target low-bandwidth and energy-constrained settings where sensors are limited to low-bit-depth capture. We propose $\gamma$-Quant, i.e.~the task-specific learning of a non-linear quantization for pattern recognition. We exemplify our approach on raw-image object detection as well as HAR of wearable data, and demonstrate that raw data with a learnable quantization using as few as 4-bits can perform on par with the use of raw 12-bit data. All code to reproduce our experiments is publicly available via https://github.com/Mishalfatima/Gamma-Quant[250] SSVIF: Self-Supervised Segmentation-Oriented Visible and Infrared Image Fusion
Zixian Zhao,Xingchen Zhang
Main category: cs.CV
TL;DR: 提出了一种自监督的分割导向可见光与红外图像融合框架(SSVIF),通过特征级与像素级融合分割的一致性实现无标签学习,在无需人工标注数据的情况下性能优于传统方法并媲美有监督方法。
Details
Motivation: 应用导向的图像融合方法依赖标注数据,导致数据获取成本高;因此需要一种无需标注的自监督训练框架来降低对标签数据的依赖。 Method: 提出自监督任务——跨分割一致性(task-cross-segmentation consistency),利用特征级和像素级融合结果在语义分割上的一致性作为监督信号,并设计了两阶段训练策略和动态权重调整机制以实现有效联合学习。 Result: 在公开数据集上的实验表明,该方法在仅使用无标签图像对训练的情况下,融合性能优于传统方法,并与有监督的分割导向方法相当。 Conclusion: SSVIF实现了无需分割标签的高效图像融合训练,为应用导向的融合方法提供了低代价、高性能的解决方案。 Abstract: Visible and infrared image fusion (VIF) has gained significant attention in recent years due to its wide application in tasks such as scene segmentation and object detection. VIF methods can be broadly classified into traditional VIF methods and application-oriented VIF methods. Traditional methods focus solely on improving the quality of fused images, while application-oriented VIF methods additionally consider the performance of downstream tasks on fused images by introducing task-specific loss terms during training. However, compared to traditional methods, application-oriented VIF methods require datasets labeled for downstream tasks (e.g., semantic segmentation or object detection), making data acquisition labor-intensive and time-consuming. To address this issue, we propose a self-supervised training framework for segmentation-oriented VIF methods (SSVIF). Leveraging the consistency between feature-level fusion-based segmentation and pixel-level fusion-based segmentation, we introduce a novel self-supervised task-cross-segmentation consistency-that enables the fusion model to learn high-level semantic features without the supervision of segmentation labels. Additionally, we design a two-stage training strategy and a dynamic weight adjustment method for effective joint learning within our self-supervised framework. Extensive experiments on public datasets demonstrate the effectiveness of our proposed SSVIF. Remarkably, although trained only on unlabeled visible-infrared image pairs, our SSVIF outperforms traditional VIF methods and rivals supervised segmentation-oriented ones. Our code will be released upon acceptance.[251] Bézier Meets Diffusion: Robust Generation Across Domains for Medical Image Segmentation
Chen Li,Meilong Xu,Xiaoling Hu,Weimin Lyu,Chao Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为“Bézier Meets Diffusion”的统一框架,用于跨域医学图像生成,结合Bézier曲线风格迁移和基于不确定性引导的条件扩散模型,有效缩小域间差距并生成高质量标注图像,显著提升分割性能。
Details
Motivation: 由于不同医学影像模态之间存在较大的域差异,训练鲁棒的学习算法具有挑战性,现有无监督域适应方法在高变异区域难以捕捉跨域映射关系。 Method: 提出Bézier曲线风格迁移策略减少域间差异,并利用伪标签训练条件扩散模型生成高质量目标域图像;引入不确定性引导的得分匹配方法提升扩散模型对噪声伪标签的鲁棒性。 Result: 在公共数据集上的实验表明,该方法能生成逼真的标注图像,显著增强目标域数据并提升分割性能。 Conclusion: 所提框架有效解决了跨模态医学图像分析中的域适应问题,通过协同使用Bézier风格迁移与条件扩散模型,实现了更鲁棒的跨域图像生成与分割。 Abstract: Training robust learning algorithms across different medical imaging modalities is challenging due to the large domain gap. Unsupervised domain adaptation (UDA) mitigates this problem by using annotated images from the source domain and unlabeled images from the target domain to train the deep models. Existing approaches often rely on GAN-based style transfer, but these methods struggle to capture cross-domain mappings in regions with high variability. In this paper, we propose a unified framework, B\'ezier Meets Diffusion, for cross-domain image generation. First, we introduce a B\'ezier-curve-based style transfer strategy that effectively reduces the domain gap between source and target domains. The transferred source images enable the training of a more robust segmentation model across domains. Thereafter, using pseudo-labels generated by this segmentation model on the target domain, we train a conditional diffusion model (CDM) to synthesize high-quality, labeled target-domain images. To mitigate the impact of noisy pseudo-labels, we further develop an uncertainty-guided score matching method that improves the robustness of CDM training. Extensive experiments on public datasets demonstrate that our approach generates realistic labeled images, significantly augmenting the target domain and improving segmentation performance.[252] PSTTS: A Plug-and-Play Token Selector for Efficient Event-based Spatio-temporal Representation Learning
Xiangmo Zhao,Nan Yang,Yang Wang,Zhanwen Liu
Main category: cs.CV
TL;DR: 本文提出了一种即插即用的渐进式时空Token选择(PSTTS)模块,用于事件数据表示学习,通过空间Token净化和时间Token选择两个阶段,在不牺牲精度的前提下显著提升计算效率。
Details
Motivation: 现有基于事件帧的方法存在高空间稀疏性和帧间运动冗余问题,导致计算开销大;而适用于RGB视频的Token稀疏化方法因依赖中间表示且忽视事件噪声,难以直接应用于事件数据。 Method: 提出PSTTS模块,包含空间Token净化(评估单帧内事件的时空一致性以去除噪声和非事件区域)和时间Token选择(评估相邻帧间运动模式相似性以剔除时间冗余信息),无需引入额外参数。 Result: 在HARDVS、DailyDVS-200和SeACT数据集上,结合UniformerV2、VideoSwin、EVMamba和ExACT四种主干网络进行实验,PSTTS在DailyDVS-200上减少29%-43.6%的FLOPs,提升21.6%-41.3%的FPS,同时保持任务精度。 Conclusion: PSTTS能有效利用原始事件数据中的时空分布特性,实现事件流中冗余Token的精准识别与剔除,在准确性和效率之间取得良好平衡,具有良好的通用性和实用性。 Abstract: Mainstream event-based spatio-temporal representation learning methods typically process event streams by converting them into sequences of event frames, achieving remarkable performance. However, they neglect the high spatial sparsity and inter-frame motion redundancy inherent in event frame sequences, leading to significant computational overhead. Existing token sparsification methods for RGB videos rely on unreliable intermediate token representations and neglect the influence of event noise, making them ineffective for direct application to event data. In this paper, we propose Progressive Spatio-Temporal Token Selection (PSTTS), a Plug-and-Play module for event data without introducing any additional parameters. PSTTS exploits the spatio-temporal distribution characteristics embedded in raw event data to effectively identify and discard spatio-temporal redundant tokens, achieving an optimal trade-off between accuracy and efficiency. Specifically, PSTTS consists of two stages, Spatial Token Purification and Temporal Token Selection. Spatial Token Purification discards noise and non-event regions by assessing the spatio-temporal consistency of events within each event frame to prevent interference with subsequent temporal redundancy evaluation. Temporal Token Selection evaluates the motion pattern similarity between adjacent event frames, precisely identifying and removing redundant temporal information. We apply PSTTS to four representative backbones UniformerV2, VideoSwin, EVMamba, and ExACT on the HARDVS, DailyDVS-200, and SeACT datasets. Experimental results demonstrate that PSTTS achieves significant efficiency improvements. Specifically, PSTTS reduces FLOPs by 29-43.6% and increases FPS by 21.6-41.3% on the DailyDVS-200 dataset, while maintaining task accuracy. Our code will be available.[253] Group Critical-token Policy Optimization for Autoregressive Image Generation
Guohui Zhang,Hu Yu,Xiaoxiao Ma,JingHao Zhang,Yaning Pan,Mingde Yao,Jie Xiao,Linjiang Huang,Feng Zhao
Main category: cs.CV
TL;DR: 本文提出了GCPO方法,通过识别自回归视觉生成中的关键图像标记并对其进行优化,提升了基于可验证奖励的强化学习(RLVR)的效果。
Details
Motivation: 现有方法在所有图像标记上进行均匀优化,忽略了不同标记对训练贡献的差异性,如何识别关键标记并实现有效优化是一个未被充分探索的问题。 Method: 从因果依赖、熵诱导的空间结构和RLVR关注的标记多样性三个角度识别关键标记,并引入动态标记级优势权重来促进探索。 Result: GCPO仅使用30%的图像标记就超过了使用全部标记的GRPO性能,在多个文本到图像基准上验证了其有效性。 Conclusion: GCPO能够有效提升自回归视觉生成中策略模型的关键标记优化效果,为高效强化学习提供了新思路。 Abstract: Recent studies have extended Reinforcement Learning with Verifiable Rewards (RLVR) to autoregressive (AR) visual generation and achieved promising progress. However, existing methods typically apply uniform optimization across all image tokens, while the varying contributions of different image tokens for RLVR's training remain unexplored. In fact, the key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them. To tackle this challenge, we propose $\textbf{G}$roup $\textbf{C}$ritical-token $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{GCPO}$), which facilitates effective policy optimization on critical tokens. We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically: $\textbf{(1)}$ Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency; $\textbf{(2)}$ Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions; $\textbf{(3)}$ RLVR-focused token diversity: tokens with low visual similarity across a group of sampled images contribute to richer token-level diversity. For these identified critical tokens, we further introduce a dynamic token-wise advantage weight to encourage exploration, based on confidence divergence between the policy model and reference model. By leveraging 30\% of the image tokens, GCPO achieves better performance than GRPO with full tokens. Extensive experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness of GCPO for AR visual generation.[254] Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Ruoyu Chen,Xiaoqing Guo,Kangwei Liu,Siyuan Liang,Shiming Liu,Qunli Zhang,Hua Zhang,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文提出了EAGLE,一种轻量级黑盒框架,用于解释多模态大语言模型(MLLM)中的自回归生成过程,通过量化语言先验与感知证据的影响,实现对生成token的忠实归因,并在多个MLLM上验证了其在保真度、定位和幻觉诊断方面的优越性。
Details
Motivation: 当前对多模态大语言模型中视觉模态对生成token影响程度的理解不足,限制了模型的可解释性和可靠性,因此需要一种有效的方法来分析生成结果对视觉输入的依赖性。 Method: 提出EAGLE框架,通过定义统一充分性和必要性的目标函数,采用贪婪搜索在稀疏化图像区域上进行优化,实现对关键感知区域的归因,并支持模态感知分析以分离语言与视觉对token生成的影响。 Result: 在多个开源MLLM上的实验表明,EAGLE在归因保真度、空间定位准确性和幻觉检测方面优于现有方法,同时显著降低GPU内存消耗。 Conclusion: EAGLE为多模态大语言模型提供了高效且可靠的生成解释方法,增强了模型决策的可解释性,具有良好的实用价值和推广潜力。 Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at https://github.com/RuoyuChen10/EAGLE.[255] Color Names in Vision-Language Models
Alexandra Gomez-Villa,Pablo Hernández-Cámara,Muhammad Atif Butt,Valero Laparra,Jesus Malo,Javier Vazquez-Corral
Main category: cs.CV
TL;DR: 本文首次系统评估了视觉-语言模型(VLMs)的颜色命名能力,发现模型在典型颜色上表现良好,但在非典型颜色上性能下降,并揭示了不同模型在命名策略、语言偏差和架构影响上的差异。
Details
Motivation: 随着视觉-语言模型的广泛应用,理解它们是否像人类一样命名颜色对于实现有效的人机交互至关重要。 Method: 采用经典的颜色命名方法,使用957种颜色样本对五个代表性VLM进行系统评估,并进行跨语言分析和消融研究。 Result: 发现VLM在原型颜色上准确率高,但在非原型颜色上显著下降;识别出21个跨模型一致使用的常见颜色词;揭示了模型在基本词与扩展修饰词上的不同策略;发现训练数据中存在严重的英汉偏向,色调是命名决策的主要驱动因素;模型架构显著影响命名行为。 Conclusion: 视觉-语言模型的颜色命名能力受限于训练数据的语言不平衡和模型架构,且在非典型颜色上表现不佳,需进一步优化以提升人机交互中的色彩理解一致性。 Abstract: Color serves as a fundamental dimension of human visual perception and a primary means of communicating about objects and scenes. As vision-language models (VLMs) become increasingly prevalent, understanding whether they name colors like humans is crucial for effective human-AI interaction. We present the first systematic evaluation of color naming capabilities across VLMs, replicating classic color naming methodologies using 957 color samples across five representative models. Our results show that while VLMs achieve high accuracy on prototypical colors from classical studies, performance drops significantly on expanded, non-prototypical color sets. We identify 21 common color terms that consistently emerge across all models, revealing two distinct approaches: constrained models using predominantly basic terms versus expansive models employing systematic lightness modifiers. Cross-linguistic analysis across nine languages demonstrates severe training imbalances favoring English and Chinese, with hue serving as the primary driver of color naming decisions. Finally, ablation studies reveal that language model architecture significantly influences color naming independent of visual processing capabilities.[256] EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model
Andrii Litvynchuk,Ivan Livinsky,Anand Ravi,Nima Kalantari,Andrii Tsarov
Main category: cs.CV
TL;DR: 本文提出了一种名为EfficientDepth的新型单目深度估计系统,结合了Transformer架构与轻量级卷积解码器以及双模密度头,提升了几何一致性、细节精度和对真实场景的鲁棒性,同时降低了计算资源消耗。
Details
Motivation: 现有单目深度估计方法在几何一致性、细节表现、对反射表面等现实挑战的鲁棒性以及边缘设备效率方面存在不足,难以满足3D重建和视图合成的需求。 Method: 采用基于Transformer的编码器与轻量级卷积解码器结合的架构,引入双模密度头以提升细节估计;利用合成数据、真实标注图像和伪标签真实图像进行训练;使用多阶段优化策略提升训练效率;引入基于LPIPS的损失函数增强细节恢复。 Result: 实验结果表明,EfficientDepth在性能上达到或优于现有的最先进模型,同时显著减少了计算资源需求,在多个标准数据集上表现出色。 Conclusion: EfficientDepth通过架构创新和训练策略优化,在保持高效性的同时显著提升了单目深度估计的质量,适用于资源受限的边缘设备和对精度要求高的应用场景。 Abstract: Monocular depth estimation (MDE) plays a pivotal role in various computer vision applications, such as robotics, augmented reality, and autonomous driving. Despite recent advancements, existing methods often fail to meet key requirements for 3D reconstruction and view synthesis, including geometric consistency, fine details, robustness to real-world challenges like reflective surfaces, and efficiency for edge devices. To address these challenges, we introduce a novel MDE system, called EfficientDepth, which combines a transformer architecture with a lightweight convolutional decoder, as well as a bimodal density head that allows the network to estimate detailed depth maps. We train our model on a combination of labeled synthetic and real images, as well as pseudo-labeled real images, generated using a high-performing MDE method. Furthermore, we employ a multi-stage optimization strategy to improve training efficiency and produce models that emphasize geometric consistency and fine detail. Finally, in addition to commonly used objectives, we introduce a loss function based on LPIPS to encourage the network to produce detailed depth maps. Experimental results demonstrate that EfficientDepth achieves performance comparable to or better than existing state-of-the-art models, with significantly reduced computational resources.[257] Category Discovery: An Open-World Perspective
Zhenqi He,Yuanpei Liu,Kai Han
Main category: cs.CV
TL;DR: 本文对类别发现(Category Discovery)领域的研究进行了全面综述,提出了涵盖新类别发现(NCD)和广义类别发现(GCD)的分类体系,并分析了多种衍生设置。文章系统地评估了现有方法,在表示学习、标签分配和类别数估计三个方面进行了深入讨论,指出了当前挑战与未来方向。
Details
Motivation: 类别发现作为开放世界学习的重要任务,旨在从未知类别的无标签数据中自动识别新类别。随着该领域快速发展,亟需系统性梳理现有工作并明确未来研究方向。 Method: 提出统一的分类体系,将类别发现分为基础设置(NCD、GCD)和多个衍生设置;从表示学习、标签分配和类别数估计三个核心组件对各类方法进行详细分析;通过实验基准测试总结关键设计选择的影响。 Result: 发现大规模预训练骨干网络、分层与辅助信息以及课程式训练策略有助于提升性能;但在标签分配、类别数估计及复杂多目标场景下的扩展仍存在挑战。 Conclusion: 本文为类别发现领域提供了系统的综述与分析,总结了当前最佳实践与核心挑战,并指出了未来有前景的研究方向,推动该领域进一步发展。 Abstract: Category discovery (CD) is an emerging open-world learning task, which aims at automatically categorizing unlabelled data containing instances from unseen classes, given some labelled data from seen classes. This task has attracted significant attention over the years and leads to a rich body of literature trying to address the problem from different perspectives. In this survey, we provide a comprehensive review of the literature, and offer detailed analysis and in-depth discussion on different methods. Firstly, we introduce a taxonomy for the literature by considering two base settings, namely novel category discovery (NCD) and generalized category discovery (GCD), and several derived settings that are designed to address the extra challenges in different real-world application scenarios, including continual category discovery, skewed data distribution, federated category discovery, etc. Secondly, for each setting, we offer a detailed analysis of the methods encompassing three fundamental components, representation learning, label assignment, and estimation of class number. Thirdly, we benchmark all the methods and distill key insights showing that large-scale pretrained backbones, hierarchical and auxiliary cues, and curriculum-style training are all beneficial for category discovery, while challenges remain in the design of label assignment, the estimation of class numbers, and scaling to complex multi-object scenarios.Finally, we discuss the key insights from the literature so far and point out promising future research directions. We compile a living survey of the category discovery literature at \href{https://github.com/Visual-AI/Category-Discovery}{https://github.com/Visual-AI/Category-Discovery}.[258] HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection
Mohammad Mahdi Hemmatyar,Mahdi Jafari,Mohammad Amin Yousefi,Mohammad Reza Nemati,Mobin Azadani,Hamid Reza Rastad,Amirmohammad Akbari
Main category: cs.CV
TL;DR: 提出HyCoVAD,一种结合自监督学习(SSL)和大语言模型(LLM)的混合框架,用于复杂视频异常检测,在ComplexVAD数据集上显著优于现有方法。
Details
Motivation: 现有方法在检测由多实体间复杂关系和时序依赖定义的复杂异常时存在局限:SSL方法难以理解语义交互,而LLM计算成本高且缺乏细粒度空间定位。 Method: HyCoVAD结合了基于nnFormer的多任务SSL时序分析模块与LLM验证模块。SSL模块通过代理任务训练,筛选疑似异常帧;LLM模块利用结构化规则推理,对这些帧进行语义上下文验证,形成两阶段混合分析流程。 Result: 在ComplexVAD数据集上达到72.5%的帧级AUC,比现有基线提升12.5%,同时减少了LLM的计算开销。此外,公开了交互异常分类体系、自适应阈值协议和代码。 Conclusion: HyCoVAD通过融合SSL的高效时空建模与LLM的语义推理能力,有效提升了复杂视频异常检测性能,为该领域提供了可扩展且高效的解决方案。 Abstract: Video anomaly detection (VAD) is crucial for intelligent surveillance, but a significant challenge lies in identifying complex anomalies, which are events defined by intricate relationships and temporal dependencies among multiple entities rather than by isolated actions. While self-supervised learning (SSL) methods effectively model low-level spatiotemporal patterns, they often struggle to grasp the semantic meaning of these interactions. Conversely, large language models (LLMs) offer powerful contextual reasoning but are computationally expensive for frame-by-frame analysis and lack fine-grained spatial localization. We introduce HyCoVAD, Hybrid Complex Video Anomaly Detection, a hybrid SSL-LLM model that combines a multi-task SSL temporal analyzer with LLM validator. The SSL module is built upon an nnFormer backbone which is a transformer-based model for image segmentation. It is trained with multiple proxy tasks, learns from video frames to identify those suspected of anomaly. The selected frames are then forwarded to the LLM, which enriches the analysis with semantic context by applying structured, rule-based reasoning to validate the presence of anomalies. Experiments on the challenging ComplexVAD dataset show that HyCoVAD achieves a 72.5% frame-level AUC, outperforming existing baselines by 12.5% while reducing LLM computation. We release our interaction anomaly taxonomy, adaptive thresholding protocol, and code to facilitate future research in complex VAD scenarios.[259] JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
Shuang Zeng,Dekang Qi,Xinyuan Chang,Feng Xiong,Shichao Xie,Xiaolong Wu,Shiyi Liang,Mu Xu,Xing Wei
Main category: cs.CV
TL;DR: 提出JanusVLN,一种具有双隐式神经记忆的视觉-语言导航框架,分别建模空间几何和视觉语义记忆,提升导航效率与性能。
Details
Motivation: 现有方法依赖显式语义记忆,存在空间信息丢失、计算冗余和内存膨胀问题,难以实现高效导航。受人类左右脑分工启发,需探索更高效的隐式记忆机制。 Method: 设计双隐式神经记忆结构,将空间几何与视觉语义记忆建模为紧凑、固定大小的神经表示;扩展MLLM以融合3D先验知识,并通过滑动窗口机制构建历史KV缓存,实现增量更新与计算效率优化。 Result: 在多个基准上超越20余种近期方法,达到SOTA性能:相比多模态输入方法,成功率提升10.5-35.5;相比使用更多RGB训练数据的方法,提升3.6-10.8。 Conclusion: 双隐式神经记忆为视觉-语言导航提供了新范式,有效平衡了性能与效率,推动未来研究方向。 Abstract: Vision-and-Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream. Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models. However, these methods typically rely on explicit semantic memory, such as building textual cognitive maps or storing historical visual frames. This type of method suffers from spatial information loss, computational redundancy, and memory bloat, which impede efficient navigation. Inspired by the implicit scene representation in human navigation, analogous to the left brain's semantic understanding and the right brain's spatial cognition, we propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations. This framework first extends the MLLM to incorporate 3D prior knowledge from the spatial-geometric encoder, thereby enhancing the spatial reasoning capabilities of models based solely on RGB input. Then, the historical key-value caches from the spatial-geometric and visual-semantic encoders are constructed into a dual implicit memory. By retaining only the KVs of tokens in the initial and sliding window, redundant computation is avoided, enabling efficient incremental updates. Extensive experiments demonstrate that JanusVLN outperforms over 20 recent methods to achieve SOTA performance. For example, the success rate improves by 10.5-35.5 compared to methods using multiple data types as input and by 3.6-10.8 compared to methods using more RGB training data. This indicates that the proposed dual implicit neural memory, as a novel paradigm, explores promising new directions for future VLN research. Ours project page: https://miv-xjtu.github.io/JanusVLN.github.io/.[260] SpikeMatch: Semi-Supervised Learning with Temporal Dynamics of Spiking Neural Networks
Jini Yang,Beomseok Oh,Seungryong Kim,Sunok Kim
Main category: cs.CV
TL;DR: 本文提出了SpikeMatch,是首个用于脉冲神经网络(SNN)的半监督学习框架,利用SNN的时间动态特性在协同训练框架中进行多样化伪标签生成。
Details
Motivation: 尽管SNN具有生物可解释性和能效优势,但其半监督学习方法相比人工神经网络仍研究不足。 Method: 通过利用单个SNN多次预测的一致性,在弱增强无标签样本上生成可靠伪标签,并用于训练强增强样本,结合泄漏因子捕捉时间动态信息。 Result: 实验表明,SpikeMatch在多个标准基准上优于现有适配到SNN骨干的SSL方法。 Conclusion: SpikeMatch有效缓解了确认偏差,提升了SNN在半监督学习场景下的性能,为SNN的SSL研究提供了新方向。 Abstract: Spiking neural networks (SNNs) have recently been attracting significant attention for their biological plausibility and energy efficiency, but semi-supervised learning (SSL) methods for SNN-based models remain underexplored compared to those for artificial neural networks (ANNs). In this paper, we introduce SpikeMatch, the first SSL framework for SNNs that leverages the temporal dynamics through the leakage factor of SNNs for diverse pseudo-labeling within a co-training framework. By utilizing agreement among multiple predictions from a single SNN, SpikeMatch generates reliable pseudo-labels from weakly-augmented unlabeled samples to train on strongly-augmented ones, effectively mitigating confirmation bias by capturing discriminative features with limited labels. Experiments show that SpikeMatch outperforms existing SSL methods adapted to SNN backbones across various standard benchmarks.[261] LongLive: Real-time Interactive Long Video Generation
Shuai Yang,Wei Huang,Ruihang Chu,Yicheng Xiao,Yuyang Zhao,Xianbang Wang,Muyang Li,Enze Xie,Yingcong Chen,Yao Lu,Song Han,Yukang Chen
Main category: cs.CV
TL;DR: 提出LongLive,一种用于实时交互式长视频生成的帧级自回归框架,通过KV重缓存、流式长调优和帧级注意力 sink 机制,在保证生成质量的同时实现高效长视频生成。
Details
Motivation: 长视频生成面临效率与质量的双重挑战,现有扩散模型效率低,因果注意力模型在长视频训练中存在记忆难题,且缺乏对动态提示输入等交互能力的支持。 Method: 采用因果帧级自回归架构,引入KV-recache机制以适应新提示,使用流式长调优实现训练与推理一致(train-long-test-long),结合短窗口注意力与帧级注意力 sink 保持长程一致性并加速生成。 Result: 仅用32个GPU天即可将13亿参数的短片段模型微调至生成分钟级长视频;单张NVIDIA H100上达到20.7 FPS,支持最长240秒视频生成,并支持INT8量化推理,质量损失极小,在VBench上表现出色。 Conclusion: LongLive实现了高效、高质量、可交互的长视频生成,解决了长视频生成中的效率、质量和动态提示响应等关键问题,具备实际应用潜力。 Abstract: We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.[262] SPARK: Synergistic Policy And Reward Co-Evolving Framework
Ziyu Liu,Yuhang Zang,Shengyuan Ding,Yuhang Cao,Xiaoyi Dong,Haodong Duan,Dahua Lin,Jiaqi Wang
Main category: cs.CV
TL;DR: 提出SPARK框架,通过回收验证信号和生成式奖励模型实现策略与奖励的协同进化,提升大模型在推理和通用任务中的性能。
Details
Motivation: 现有RL方法(如RLHF和RLVR)存在高成本、奖励-策略不匹配或监督信号浪费的问题,需更高效稳定的后训练方法。 Method: 提出SPARK框架,基于RLVR构建,利用回放rollouts和正确性信号联合训练策略和生成式奖励模型,采用点级评分、成对比较和自省条件评估等多目标学习。 Result: SPARK在多个LLM和LVLM上显著提升性能,例如SPARK-VL-7B在7个推理基准上平均提升9.7%,2个奖励基准上提升12.1%,8个通用基准上提升1.5%。 Conclusion: SPARK实现了策略与奖励模型的协同进化,无需外部奖励模型或人工偏好数据,支持测试时自省扩展,具有强鲁棒性和泛化能力。 Abstract: Recent Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) increasingly use Reinforcement Learning (RL) for post-pretraining, such as RL with Verifiable Rewards (RLVR) for objective tasks and RL from Human Feedback (RLHF) for subjective tasks. However, RLHF incurs high costs and potential reward-policy mismatch due to reliance on human preferences, while RLVR still wastes supervision by discarding rollouts and correctness signals after each update. To address these challenges, we introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR. Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model. This auxiliary training uses a mix of objectives, such as pointwise reward score, pairwise comparison, and evaluation conditioned on further-reflection responses, to teach the model to evaluate and improve its own responses. Our process eliminates the need for a separate reward model and costly human preference data. SPARK creates a positive co-evolving feedback loop: improved reward accuracy yields better policy gradients, which in turn produce higher-quality rollouts that further refine the reward model. Our unified framework supports test-time scaling via self-reflection without external reward models and their associated costs. We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks. For example, SPARK-VL-7B achieves an average 9.7% gain on 7 reasoning benchmarks, 12.1% on 2 reward benchmarks, and 1.5% on 8 general benchmarks over the baselines, demonstrating robustness and broad generalization.[263] CCNeXt: An Effective Self-Supervised Stereo Depth Estimation Approach
Alexandre Lopes,Roberto Souza,Helio Pedrini
Main category: cs.CV
TL;DR: 提出一种新的自监督卷积方法CCNeXt,用于高效深度估计,在多个数据集上达到最先进的性能,同时显著降低计算成本。
Details
Motivation: 由于在多种场景中难以获取可靠的真值深度数据,且存在计算资源限制,需要一种高效的自监督深度估计方法。 Method: 提出CCNeXt架构,采用现代CNN特征提取器,结合新颖的窗口化极线交叉注意力模块,并重新设计深度估计解码器,实现自监督训练。 Result: 在KITTI Eigen Split测试中取得具有竞争力的结果,比当前最优模型快10.18倍;在KITTI Eigen Split改进真值和Driving Stereo数据集中各项指标均达到最先进水平。 Conclusion: CCNeXt在保持低计算成本的同时,显著提升了自监督深度估计的性能,适用于机器人、自动驾驶和增强现实等实际应用。 Abstract: Depth Estimation plays a crucial role in recent applications in robotics, autonomous vehicles, and augmented reality. These scenarios commonly operate under constraints imposed by computational power. Stereo image pairs offer an effective solution for depth estimation since it only needs to estimate the disparity of pixels in image pairs to determine the depth in a known rectified system. Due to the difficulty in acquiring reliable ground-truth depth data across diverse scenarios, self-supervised techniques emerge as a solution, particularly when large unlabeled datasets are available. We propose a novel self-supervised convolutional approach that outperforms existing state-of-the-art Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) while balancing computational cost. The proposed CCNeXt architecture employs a modern CNN feature extractor with a novel windowed epipolar cross-attention module in the encoder, complemented by a comprehensive redesign of the depth estimation decoder. Our experiments demonstrate that CCNeXt achieves competitive metrics on the KITTI Eigen Split test data while being 10.18$\times$ faster than the current best model and achieves state-of-the-art results in all metrics in the KITTI Eigen Split Improved Ground Truth and Driving Stereo datasets when compared to recently proposed techniques. To ensure complete reproducibility, our project is accessible at \href{https://github.com/alelopes/CCNext}{\texttt{https://github.com/alelopes/CCNext}}.[264] UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning
Hongyu Chen,Guangrun Wang
Main category: cs.CV
TL;DR: 本文提出了UML-CoT,一种基于统一建模语言(UML)的结构化思维链框架,通过类图和活动图提升大模型在具身任务中的可解释性和执行能力,在清理房间任务中表现优于传统CoT。
Details
Motivation: 传统的思维链(CoT)提示缺乏结构,限制了在具身任务中的可解释性和可执行性;现有结构化方法仅建模低阶关系,缺乏继承、行为抽象及标准的顺序与条件规划语义。 Method: 提出UML-CoT框架,利用UML类图表示对象语义,活动图建模控制流程,并采用三阶段训练 pipeline,结合监督微调与基于组相对策略优化(GRPO)的方法,包括从仅答案数据中学习奖励。 Result: 在新构建的MRoom-30k杂乱房间清理基准上,UML-CoT在可解释性、规划连贯性和执行成功率方面均优于非结构化CoT。 Conclusion: UML是一种更具表达力和可操作性的结构化推理形式,能有效提升大模型在复杂任务中的推理与执行能力。 Abstract: Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), but its reliance on unstructured text limits interpretability and executability in embodied tasks. Prior work has explored structured CoTs using scene or logic graphs, yet these remain fundamentally limited: they model only low-order relations, lack constructs like inheritance or behavioral abstraction, and provide no standardized semantics for sequential or conditional planning. We propose UML-CoT, a structured reasoning and planning framework that leverages Unified Modeling Language (UML) to generate symbolic CoTs and executable action plans. UML class diagrams capture compositional object semantics, while activity diagrams model procedural control flow. Our three-stage training pipeline combines supervised fine-tuning with Group Relative Policy Optimization (GRPO), including reward learning from answer-only data. We evaluate UML-CoT on MRoom-30k, a new benchmark of cluttered room-cleaning scenarios. UML-CoT outperforms unstructured CoTs in interpretability, planning coherence, and execution success, highlighting UML as a more expressive and actionable structured reasoning formalism.[265] Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance
Luc Boudier,Loris Manganelli,Eleftherios Tsonis,Nicolas Dufour,Vicky Kalogeiton
Main category: cs.CV
TL;DR: 提出了一种无需训练的图像生成方法DIPSY,用于少样本图像分类,通过改进的指导方案和采样策略提升分类性能。
Details
Motivation: 由于标注样本有限,少样本图像分类具有挑战性,现有方法通常需要大量微调或外部信息,因此需要一种无需训练且不依赖外部工具的方法。 Method: 利用IP-Adapter进行图像到图像转换,引入扩展的无分类器指导方案、基于类相似性的采样策略,并构建无需微调或外部标注的简单流程。 Result: 在十个基准数据集上达到最先进或相当的性能,尤其在细粒度分类任务中表现出色。 Conclusion: DIPSY能有效利用正负引导的双图像提示生成判别性特征,显著提升少样本分类效果,且无需模型微调或外部工具。 Abstract: Few-shot image classification remains challenging due to the limited availability of labeled examples. Recent approaches have explored generating synthetic training data using text-to-image diffusion models, but often require extensive model fine-tuning or external information sources. We present a novel training-free approach, called DIPSY, that leverages IP-Adapter for image-to-image translation to generate highly discriminative synthetic images using only the available few-shot examples. DIPSY introduces three key innovations: (1) an extended classifier-free guidance scheme that enables independent control over positive and negative image conditioning; (2) a class similarity-based sampling strategy that identifies effective contrastive examples; and (3) a simple yet effective pipeline that requires no model fine-tuning or external captioning and filtering. Experiments across ten benchmark datasets demonstrate that our approach achieves state-of-the-art or comparable performance, while eliminating the need for generative model adaptation or reliance on external tools for caption generation and image filtering. Our results highlight the effectiveness of leveraging dual image prompting with positive-negative guidance for generating class-discriminative features, particularly for fine-grained classification tasks.[266] Scale-Wise VAR is Secretly Discrete Diffusion
Amandeep Kumar,Nithin Gopalakrishnan Nair,Vishal M. Patel
Main category: cs.CV
TL;DR: 本文重新审视了VAR模型,并提出其在马尔可夫注意力掩码下等价于离散扩散模型,由此建立了一种名为SRDD的新视角,架起了自回归变换器与扩散模型之间的理论桥梁。
Details
Motivation: 尽管VAR在视觉生成中表现出色,但其与扩散模型之间的理论联系尚不明确,本文旨在揭示这一联系并利用扩散模型的优势改进VAR。 Method: 通过引入马尔可夫注意力掩码,从数学上证明VAR等价于离散扩散过程,并基于此提出SRDD框架,将扩散模型的迭代优化特性引入VAR。 Result: SRDD在多个数据集上实现了更快的收敛速度、更低的推理成本和更好的零样本重建效果,显著提升了生成效率和质量。 Conclusion: VAR与离散扩散模型本质相通,SRDD为统一自回归与扩散模型提供了理论基础和实践优势。 Abstract: Autoregressive (AR) transformers have emerged as a powerful paradigm for visual generation, largely due to their scalability, computational efficiency and unified architecture with language and vision. Among them, next scale prediction Visual Autoregressive Generation (VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models. In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion. We term this reinterpretation as Scalable Visual Refinement with Discrete Diffusion (SRDD), establishing a principled bridge between AR transformers and diffusion models. Leveraging this new perspective, we show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction. Across multiple datasets, we show that the diffusion based perspective of VAR leads to consistent gains in efficiency and generation.[267] Hierarchical Representation Matching for CLIP-based Class-Incremental Learning
Zhen-Hao Wen,Yan Wang,Ji Feng,Han-Jia Ye,De-Chuan Zhan,Da-Wei Zhou
Main category: cs.CV
TL;DR: 本文提出了一种基于CLIP的类增量学习方法HERMAN,利用大语言模型生成具有层次结构的文本描述,并与视觉特征多层级匹配,以提升细粒度识别能力并缓解灾难性遗忘。
Details
Motivation: 现有CIL方法使用简单模板和单一深层特征,忽视了视觉概念的层次性,导致在区分粗粒度与细粒度类别时性能受限。 Method: 提出HERMAN框架,利用LLM递归生成具有层次性的判别性文本描述,并将其与CLIP中不同层次的视觉特征进行自适应匹配与路由,实现对多层次语义信息的有效利用。 Result: 在多个基准上进行了广泛实验,结果表明该方法在类增量学习任务中持续达到最先进的性能。 Conclusion: 通过引入层次化表示匹配机制,有效增强了CLIP在类增量学习中的表现,为解决灾难性遗忘和细粒度区分问题提供了新思路。 Abstract: Class-Incremental Learning (CIL) aims to endow models with the ability to continuously adapt to evolving data streams. Recent advances in pre-trained vision-language models (e.g., CLIP) provide a powerful foundation for this task. However, existing approaches often rely on simplistic templates, such as "a photo of a [CLASS]", which overlook the hierarchical nature of visual concepts. For example, recognizing "cat" versus "car" depends on coarse-grained cues, while distinguishing "cat" from "lion" requires fine-grained details. Similarly, the current feature mapping in CLIP relies solely on the representation from the last layer, neglecting the hierarchical information contained in earlier layers. In this work, we introduce HiErarchical Representation MAtchiNg (HERMAN) for CLIP-based CIL. Our approach leverages LLMs to recursively generate discriminative textual descriptors, thereby augmenting the semantic space with explicit hierarchical cues. These descriptors are matched to different levels of the semantic hierarchy and adaptively routed based on task-specific requirements, enabling precise discrimination while alleviating catastrophic forgetting in incremental tasks. Extensive experiments on multiple benchmarks demonstrate that our method consistently achieves state-of-the-art performance.[268] RefAM: Attention Magnets for Zero-Shot Referral Segmentation
Anna Kukleva,Enis Simsar,Alessio Tonioni,Muhammad Ferjad Naeem,Federico Tombari,Jan Eric Lenssen,Bernt Schiele
Main category: cs.CV
TL;DR: 本文提出RefAM,一种无需训练的指代表分割框架,利用扩散模型中的注意力特征,通过处理停用词和全局注意力汇聚点来提升定位精度。