Table of Contents
cs.CL [Back]
[1] Overview of PAN 2026: Voight-Kampff Generative AI Detection, Text Watermarking, Multi-Author Writing Style Analysis, Generative Plagiarism Detection, and Reasoning Trajectory Detection
Janek Bevendorff,Maik Fröbe,André Greiner-Petter,Andreas Jakoby,Maximilian Mayerl,Preslav Nakov,Henry Plutz,Martin Potthast,Benno Stein,Minh Ngoc Ta,Yuxia Wang,Eva Zangerle
Main category: cs.CL
TL;DR: PAN 2026 workshop introduces five stylometry and text forensics tasks—including AI detection, text watermarking, multi-author analysis, generative plagiarism detection, and reasoning trajectory detection—with emphasis on reproducibility via Docker submissions on TIRA.
Details
Motivation: To advance computational stylometry and text forensics through objective, reproducible evaluation. Method: Organizing five benchmark tasks with standardized software submission via Docker containers on the TIRA platform. Result: Five defined tasks for PAN 2026, continuing prior efforts and introducing new challenges in AI-generated text analysis and authorship attribution. Conclusion: PAN 2026 strengthens the community’s capacity for rigorous, reproducible evaluation in text forensics and stylometry, building on over a decade of submissions (>1,100 since 2012). Abstract: The goal of the PAN workshop is to advance computational stylometry and text forensics via objective and reproducible evaluation. In 2026, we run the following five tasks: (1) Voight-Kampff Generative AI Detection, particularly in mixed and obfuscated authorship scenarios, (2) Text Watermarking, a new task that aims to find new and benchmark the robustness of existing text watermarking schemes, (3) Multi-author Writing Style Analysis, a continued task that aims to find positions of authorship change, (4) Generative Plagiarism Detection, a continued task that targets source retrieval and text alignment between generated text and source documents, and (5) Reasoning Trajectory Detection, a new task that deals with source detection and safety detection of LLM-generated or human-written reasoning trajectories. As in previous years, PAN invites software submissions as easy-to-reproduce Docker containers for most of the tasks. Since PAN 2012, more than 1,100 submissions have been made this way via the TIRA experimentation platform.[2] Measuring Inclusion in Interaction: Inclusion Analytics for Human-AI Collaborative Learning
Jaeyoon Choi,Nia Nixon
Main category: cs.CL
TL;DR: 本文提出了一种基于话语分析的‘包容性分析’框架,用于动态、交互式地评估协作问题解决(CPS)中的人工智能教育包容性,涵盖参与公平性、情感氛围和认知公平性三个维度,并通过模拟与实证数据验证其有效性。
Details
Motivation: 现有对AI与教育中包容性、公平性和可及性的评估多依赖粗粒度样本描述或事后自我报告,难以捕捉协作问题解决过程中动态生成的包容性。 Method: 构建基于话语的‘包容性分析’框架,从参与公平性、情感氛围和认知公平性三个互补维度定义包容性,并利用可扩展的交互级测量方法进行分析;结合模拟对话与人-AI协同实验的实证数据进行验证。 Result: 该框架能揭示参与模式、关系动态与观点采纳等在聚合或事后评估中不可见的包容性相关现象。 Conclusion: 本研究是迈向以过程为导向的人-AI协作学习环境中包容性测量的重要初步探索。 Abstract: Inclusion, equity, and access are widely valued in AI and education, yet are often assessed through coarse sample descriptors or post-hoc self-reports that miss how inclusion is shaped moment by moment in collaborative problem solving (CPS). In this proof-of-concept paper, we introduce inclusion analytics, a discourse-based framework for examining inclusion as a dynamic, interactional process in CPS. We conceptualize inclusion along three complementary dimensions -- participation equity, affective climate, and epistemic equity -- and demonstrate how these constructs can be made analytically visible using scalable, interaction-level measures. Using both simulated conversations and empirical data from human-AI teaming experiments, we illustrate how inclusion analytics can surface patterns of participation, relational dynamics, and idea uptake that remain invisible to aggregate or post-hoc evaluations. This work represents an initial step toward process-oriented approaches to measuring inclusion in human-AI collaborative learning environments.[3] Effective Reasoning Chains Reduce Intrinsic Dimensionality
Archiki Prasad,Mandar Joshi,Kenton Lee,Mohit Bansal,Peter Shaw
Main category: cs.CL
TL;DR: 本文提出内在维度作为量化推理链有效性的新指标,发现有效的推理策略能降低任务的内在维度,从而提升泛化性能。
Details
Motivation: 当前对思维链等推理策略如何提升模型泛化能力的机制理解不足,缺乏一致、可量化的解释。 Method: 提出内在维度(intrinsic dimensionality)作为衡量推理链有效性的定量指标,通过固定模型架构、改变任务表述(不同推理策略),在GSM8K数据集上用Gemma-3模型验证其与泛化性能的关系。 Result: 有效推理策略显著降低任务的内在维度;内在维度与模型在分布内和分布外数据上的泛化性能呈强负相关。 Conclusion: 推理链通过更高效地压缩任务、减少所需参数维度来促进学习,内在维度可作为分析推理过程的新定量指标。 Abstract: Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.[4] Don't Shoot The Breeze: Topic Continuity Model Using Nonlinear Naive Bayes With Attention
Shu-Ting Pi,Pradeep Bagavan,Yejia Li,Disha,Qun Liu
Main category: cs.CL
TL;DR: 本文提出了一种基于朴素贝叶斯扩展与注意力机制的可解释话题连续性模型,用于评估LLM聊天响应是否保持初始话题,支持任意长度对话且具线性时间复杂度,在长而复杂对话中显著优于传统方法。
Details
Motivation: 解决LLM作为商业聊天机器人时话题突变导致用户体验差和计算资源浪费的问题。 Method: 将自然语言理解(NLU)模型用朴素贝叶斯扩展为可量化形式,并引入注意力机制和对数非线性以增强话题连续性建模能力,最终导出可解释的解析公式。 Result: 模型在长而复杂的对话中持续优于传统方法,具备线性时间复杂度,可处理任意长度对话,并提升话题连续性识别能力。 Conclusion: 该模型为LLM在业务场景中的负责任、可解释应用提供了新路径。 Abstract: Utilizing Large Language Models (LLM) as chatbots in diverse business scenarios often presents the challenge of maintaining topic continuity. Abrupt shifts in topics can lead to poor user experiences and inefficient utilization of computational resources. In this paper, we present a topic continuity model aimed at assessing whether a response aligns with the initial conversation topic. Our model is built upon the expansion of the corresponding natural language understanding (NLU) model into quantifiable terms using a Naive Bayes approach. Subsequently, we have introduced an attention mechanism and logarithmic nonlinearity to enhance its capability to capture topic continuity. This approach allows us to convert the NLU model into an interpretable analytical formula. In contrast to many NLU models constrained by token limits, our proposed model can seamlessly handle conversations of any length with linear time complexity. Furthermore, the attention mechanism significantly improves the model's ability to identify topic continuity in complex conversations. According to our experiments, our model consistently outperforms traditional methods, particularly in handling lengthy and intricate conversations. This unique capability offers us an opportunity to ensure the responsible and interpretable use of LLMs.[5] Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization
Mykola Khandoga,Rui Yuan,Vinay Kumar Sankarapu
Main category: cs.CL
TL;DR: 本文提出了一种名为反事实重要性加权(counterfactual importance weighting)的新方法,用于改进语言模型推理中的策略梯度训练。该方法通过遮蔽推理片段、测量答案概率下降来估计各token的重要性,并据此调整梯度更新,无需额外模型或人工标注。在GSM8K数据集上的实验表明其优于均匀信用分配基线,且能更快收敛;消融与分析验证了其捕捉真实因果结构的能力。
Details
Motivation: 现有策略梯度方法(如GRPO、DAPO)对生成的所有token给予均匀信用分配,无法区分关键推理步骤与无关填充内容,导致训练低效。 Method: 提出反事实重要性加权:在推理过程中遮蔽不同token跨度,计算答案概率下降量作为该跨度的重要性得分,并在策略梯度更新中按此得分加权各token的梯度。所有计算均基于策略模型自身输出,无需外部模型或标注。 Result: 在GSM8K上,该方法在Qwen和Llama系列共三个模型上均稳定超越均匀信用分配基线,实现更快收敛与更高准确率;逆向重要性信号导致性能下降,证明其有效性;分析显示其确实更关注计算步骤而非冗余文本。 Conclusion: 反事实重要性加权是一种无需额外组件、可即插即用的通用机制,为语言模型推理的信用分配问题提供了新基础,虽非终极方案,但具有重要启发意义。 Abstract: Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase "Let me think" receives the same gradient update as the critical calculation "23 + 45 = 68." We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. Our method requires no auxiliary models or external annotation, instead importance is estimated directly from the policy model's own probability shifts. Experiments on GSM8K across three models spanning the Qwen and Llama families demonstrate consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming we capture genuine causal structure rather than noise. Analysis shows the method correctly prioritizes calculation steps over scaffolding text. We view these findings as establishing counterfactual importance weighting as a foundation for further research rather than a complete solution.[6] FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding
Siyuan Huang,Ziyu Wang,Chao Pan,Han Zhao
Main category: cs.CL
TL;DR: 本文提出FM SO.P方法,通过渐进式任务混合和多智能体评估系统,显著提升语言模型对标准操作流程(SOP)的理解与跨领域泛化能力,在SOPBench基准上以更小模型规模达到甚至超越大模型的性能。
Details
Motivation: 现有语言模型在理解和泛化标准操作流程(SOP)方面表现不佳,主因是联合训练无法区分SOP所需的三类关键推理能力:术语精确性、顺序逻辑和约束推理。 Method: 提出FM SO.P:一是渐进式任务混合机制,分阶段训练概念消歧(术语精度)、动作序列理解(流程正确性)和场景感知图推理(条件逻辑);二是自动多智能体评估系统,含三个自适应生成评分标准、分层测试集与评分的智能体,适配不同领域约束。 Result: 在涵盖7个领域的SOPBench基准上,32B模型达48.3%通过率,开源7B模型达34.3%,媲美Qwen-2.5-72B-Instruct(34.4%),参数量仅为其1/10。 Conclusion: FM SO.P有效解耦并协同增强SOP所需的核心推理能力,结合自适应评估,实现了高效、可泛化的SOP理解新范式。 Abstract: Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3\% pass rate with our 32B model and 34.3\% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4\%) with 10x fewer parameters.[7] Understanding Risk and Dependency in AI Chatbot Use from User Discourse
Jianfeng Zhu,Karin G. Coifman,Ruoming Jin
Main category: cs.CL
TL;DR: 本研究通过大规模计算主题分析Reddit上两个AI相关危害社区的帖子,识别出五种AI相关的心理风险体验维度,发现自我调节困难最普遍,恐惧主要集中于自主性、控制和技术风险。
Details
Motivation: 尽管生成式AI系统日益融入日常生活,但关于用户如何产生、体验和调节与AI使用相关的心理风险的实证理解仍然有限。 Method: 对2023至2025年间Reddit社区r/AIDangers和r/ChatbotAddiction的帖子进行大规模计算主题分析,采用基于Braun和Clarke反思框架的多智能体LLM辅助主题分析,并结合BERT情绪分类器进行情感模式刻画。 Result: 识别出14个重复出现的主题类别,归纳为五个高阶体验维度;发现自我调节困难是最普遍的问题,恐惧情绪集中于自主性、控制和技术风险。 Conclusion: 研究提供了来自真实用户经验的AI安全性感知与情感体验的早期实证证据,为未来AI安全研究、评估与负责任治理奠定基础。 Abstract: Generative AI systems are increasingly embedded in everyday life, yet empirical understanding of how psychological risk associated with AI use emerges, is experienced, and is regulated by users remains limited. We present a large-scale computational thematic analysis of posts collected between 2023 and 2025 from two Reddit communities, r/AIDangers and r/ChatbotAddiction, explicitly focused on AI-related harm and distress. Using a multi-agent, LLM-assisted thematic analysis grounded in Braun and Clarke's reflexive framework, we identify 14 recurring thematic categories and synthesize them into five higher-order experiential dimensions. To further characterize affective patterns, we apply emotion labeling using a BERT-based classifier and visualize emotional profiles across dimensions. Our findings reveal five empirically derived experiential dimensions of AI-related psychological risk grounded in real-world user discourse, with self-regulation difficulties emerging as the most prevalent and fear concentrated in concerns related to autonomy, control, and technical risk. These results provide early empirical evidence from lived user experience of how AI safety is perceived and emotionally experienced outside laboratory or speculative contexts, offering a foundation for future AI safety research, evaluation, and responsible governance.[8] Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs
Yoshifumi Kawasaki
Main category: cs.CL
TL;DR: 本研究评估了大语言模型(LLMs)对西班牙语地理词汇变异的捕捉能力,发现其在识别西班牙、赤道几内亚、墨西哥及中美洲、拉普拉塔河流域方言时表现较好,但难以区分智利方言;性能差异不能仅由各国数字资源数量解释,提示存在数字语言偏见。
Details
Motivation: 探究大语言模型是否以及如何表征西班牙语中显著的地域词汇变异,检验是否存在数字语言偏见。 Method: 将LLMs视为虚拟调查对象,采用是非题和多项选择题两种问卷形式,基于大规模专家构建的西班牙语词汇变异数据库,在21个西语国家及方言区层面评估超900个词汇项。 Result: 模型对西班牙、赤道几内亚、墨西哥与中美洲、拉普拉塔河流域方言识别更准确,而对智利方言识别最差;该差异无法由各国数字资源量解释。 Conclusion: LLMs的方言表征存在系统性偏差,数据量并非决定性因素,需关注训练数据中的地域覆盖不均衡所导致的数字语言偏见。 Abstract: This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish, a language that exhibits substantial regional variation. Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions. To this end, we exploited a large-scale, expert-curated database of Spanish lexical variation. Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels. Across both evaluation formats, the results reveal systematic differences in how LLMs represent Spanish language varieties. Lexical variation associated with Spain, Equatorial Guinea, Mexico & Central America, and the La Plata River is recognized more accurately by the models, while the Chilean variety proves particularly difficult for the models to distinguish. Importantly, differences in the volume of country-level digital resources do not account for these performance patterns, suggesting that factors beyond data quantity shape dialectal representation in LLMs. By providing a fine-grained, large-scale evaluation of geographic lexical variation, this work advances empirical understanding of dialectal knowledge in LLMs and contributes new evidence to discussions of Digital Linguistic Bias in Spanish.[9] Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only
Jianyu Zheng
Main category: cs.CL
TL;DR: 本文提出了一种完全无监督的跨语言词性标注框架,利用无监督神经机器翻译(UNMT)从单语语料中构建伪平行句对,再通过词对齐进行POS标签投影,并引入多源投影技术提升性能,在多个低资源语言上效果媲美甚至超越使用真实平行语料的基线方法。
Details
Motivation: 现有低资源语言词性标注方法严重依赖平行语料,但这类语料在许多低资源语言中难以获取。 Method: 提出基于无监督神经机器翻译(UNMT)构建伪平行句对,再沿用标准的基于词对齐的标签投影流程;并引入多源投影技术校准目标端标签。 Result: 在28个语言对上的实验表明,该方法性能可媲美甚至超过使用真实平行语料的基线方法;多源投影使平均性能提升1.3%。 Conclusion: 仅依赖单语语料的无监督跨语言POS标注是可行且有效的,UNMT与多源投影的结合显著缓解了对平行语料的依赖。 Abstract: Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.[10] AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis
Zexu Sun,Bokai Ji,Hengyi Cai,Shuaiqiang Wang,Lei Wang,Guangxia Li,Xu Chen
Main category: cs.CL
TL;DR: 本文提出AgentSkiller框架,用于自动生成高质量、多轮、跨域的智能体交互数据,以解决大模型智能体缺乏长周期训练数据的问题。
Details
Motivation: 现有方法受限于隐私约束或生成数据多样性不足,难以提供足够高质量、长周期的训练数据来提升大语言模型智能体的通用能力。 Method: AgentSkiller采用DAG架构确保确定性和可恢复性,构建领域本体和以人为中心的实体图,通过服务蓝图定义工具接口,结合跨域融合机制模拟复杂任务,并利用基于角色的模拟器生成用户任务与验证执行路径。 Result: 生成约11K条高质量交互样本;实验表明,基于该数据训练的模型在函数调用任务上显著优于基线,尤其在大参数量模型中效果更明显。 Conclusion: AgentSkiller为大模型智能体提供了可扩展、可靠、语义丰富的训练数据生成范式,有效提升了其在真实场景中的工具使用能力。 Abstract: Large Language Model agents demonstrate potential in solving real-world problems via tools, yet generalist intelligence is bottlenecked by scarce high-quality, long-horizon data. Existing methods collect privacy-constrained API logs or generate scripted interactions lacking diversity, which struggle to produce data requisite for scaling capabilities. We propose AgentSkiller, a fully automated framework synthesizing multi-turn interaction data across realistic, semantically linked domains. It employs a DAG-based architecture with explicit state transitions to ensure determinism and recoverability. The pipeline builds a domain ontology and Person-Centric Entity Graph, defines tool interfaces via Service Blueprints for Model Context Protocol servers, and populates environments with consistent databases and strict Domain Policies. A cross-domain fusion mechanism links services to simulate complex tasks. Finally, the pipeline creates user tasks by verifying solution paths, filtering via execution-based validation, and generating queries using a Persona-based Simulator for automated rollout. This produces reliable environments with clear state changes. To demonstrate effectiveness, we synthesized $\approx$ 11K interaction samples; experimental results indicate that models trained on this dataset achieve significant improvements on function calling over baselines, particularly in larger parameter regimes.[11] AfriNLLB: Efficient Translation Models for African Languages
Yasmin Moslem,Aman Kassahun Wassie,Amanuel Gizachew Abebe
Main category: cs.CL
TL;DR: AfriNLLB 是一系列轻量级神经机器翻译模型,支持30种非洲语言与英语/法语等之间的双向翻译,通过剪枝、量化和知识蒸馏在保持性能的同时显著提升推理效率,并开源模型与数据。
Details
Motivation: 解决非洲语言翻译模型在资源受限环境下部署困难的问题,推动低功耗、高效率的本地化翻译应用。 Method: 基于NLLB-200 600M模型,采用迭代层剪枝与量化压缩,并在自建非洲语言平行语料上结合知识蒸馏进行微调;发布Transformers和CTranslate2两种格式模型。 Result: AfriNLLB在多个非洲语言对上达到与基线模型相当的翻译质量,同时推理速度显著提升;并开源模型及全部训练数据。 Conclusion: AfriNLLB为非洲语言提供了高效、可部署的翻译解决方案,兼顾性能与轻量化,支持可持续研究与实际应用。 Abstract: In this work, we present AfriNLLB, a series of lightweight models for efficient translation from and into African languages. AfriNLLB supports 15 language pairs (30 translation directions), including Swahili, Hausa, Yoruba, Amharic, Somali, Zulu, Lingala, Afrikaans, Wolof, and Egyptian Arabic, as well as other African Union official languages such as Arabic (MSA), French, Portuguese, and Spanish. Our training data covers bidirectional translation between English and 13 languages, and between French and two languages (Lingala and Wolof). AfriNLLB models are based on NLLB-200 600M, which we compress using iterative layer pruning and quantization. We fine-tune the pruned models on parallel corpora we curated for African languages, employing knowledge distillation from a larger teacher model. Our work aims at enabling efficient deployment of translation models for African languages in resource-constrained settings. Our evaluation results demonstrate that AfriNLLB models achieve performance comparable to the baseline while being significantly faster. We release two versions of the AfriNLLB models, a Transformers version that allows further fine-tuning and a CTranslate2 version for efficient inference. Moreover, we release all the training data that we used for fine-tuning the baseline and pruned models to facilitate further research.[12] BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
Peng Lai,Zhihao Ou,Yong Wang,Longyue Wang,Jian Yang,Yun Chen,Guanhua Chen
Main category: cs.CL
TL;DR: 本文提出BiasScope框架,用于自动、大规模地发现LLM-as-a-Judge评估中潜在的未知偏差,并基于其构建更具挑战性的基准JudgeBench-Pro,揭示当前大模型评测器鲁棒性严重不足。
Details
Motivation: 现有LLM-as-a-Judge评估中对偏差的研究多集中于已知偏差,缺乏对未知偏差的自动化、系统性探索,影响评估的鲁棒性与可靠性。 Method: 提出BiasScope——一个由大语言模型驱动的偏差自动发现框架,可跨模型家族与规模识别潜在偏差;并基于BiasScope构建增强基准JudgeBench-Pro。 Result: BiasScope在JudgeBench上验证了通用性与有效性;JudgeBench-Pro显示即使强大多数LLM评测器错误率也超50%。 Conclusion: 需重视并系统化挖掘未知偏差,BiasScope为提升LLM-as-a-Judge鲁棒性提供了新范式,JudgeBench-Pro凸显当前评测体系的脆弱性及改进紧迫性。 Abstract: LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bias, which has primarily been studied in terms of known biases and their impact on evaluation outcomes, while automated and systematic exploration of potential unknown biases is still lacking. Nevertheless, such exploration is crucial for enhancing the robustness and reliability of evaluations. To bridge this gap, we propose BiasScope, a LLM-driven framework for automatically and at scale discovering potential biases that may arise during model evaluation. BiasScope can uncover potential biases across different model families and scales, with its generality and effectiveness validated on the JudgeBench dataset. It overcomes the limitations of existing approaches, transforming bias discovery from a passive process relying on manual effort and predefined bias lists into an active and comprehensive automated exploration. Moreover, based on BiasScope, we propose JudgeBench-Pro, an extended version of JudgeBench and a more challenging benchmark for evaluating the robustness of LLM-as-a-judge. Strikingly, even powerful LLMs as evaluators show error rates above 50\% on JudgeBench-Pro, underscoring the urgent need to strengthen evaluation robustness and to mitigate potential biases further.[13] Contractual Deepfakes: Can Large Language Models Generate Contracts?
Eliza Mik
Main category: cs.CL
TL;DR: 本文质疑了大型语言模型(LLMs)在合同起草等法律实务中被过度夸大的应用潜力,指出LLMs仅能统计性预测词语,缺乏语义理解、情境把握与法律推理能力,其生成的合同文本表面合理但实质可能矛盾或不适用,因此不会真正威胁法律行业的存续。
Details
Motivation: 澄清当前对LLMs在法律领域(尤其是合同起草)能力的误解与夸大,反驳‘LLMs将取代律师’的简化论断。 Method: 概念分析与批判性论证,区分‘词语预测’与‘法律语言使用’、‘模板复现’与‘法律推理’的本质差异,并指出LLM输出在真实交易场景中的实践缺陷。 Result: 揭示LLMs生成的合同文本虽具表面合理性,但常存在条款不一致、法律适用不当或交易不适配等问题,实际法律效力与实用性存疑。 Conclusion: LLMs不具备法律实务所需的理解力与推理能力,其在合同起草中的应用被严重高估;法律行业不会因其出现而失去不可替代性。 Abstract: Notwithstanding their unprecedented ability to generate text, LLMs do not understand the meaning of words, have no sense of context and cannot reason. Their output constitutes an approximation of statistically dominant word patterns. And yet, the drafting of contracts is often presented as a typical legal task that could be facilitated by this technology. This paper seeks to put an end to such unreasonable ideas. Predicting words differs from using language in the circumstances of specific transactions and reconstituting common contractual phrases differs from reasoning about the law. LLMs seem to be able to generate generic and superficially plausible contractual documents. In the cold light of day, such documents may turn out to be useless assemblages of inconsistent provisions or contracts that are enforceable but unsuitable for a given transaction. This paper casts a shadow on the simplistic assumption that LLMs threaten the continued viability of the legal industry.[14] Effective vocabulary expanding of multilingual language models for extremely low-resource languages
Jianyu Zheng
Main category: cs.CL
TL;DR: 本文提出了一种针对低资源语言扩展多语言预训练语言模型(mPLMs)词汇表并初始化新词向量的方法,通过筛选原词汇、利用双语词典初始化新增词汇,并在目标语言语料上继续预训练,显著提升了POS标注和NER任务性能,且不损害源语言性能。
Details
Motivation: 现有工作较少关注如何将多语言预训练语言模型(mPLMs)扩展至此前未支持的低资源语言。 Method: 首先基于目标语言语料扩展模型词汇表;然后筛选出偏向源语言(如英语)的原始词汇子集,并利用双语词典初始化新增词汇的表示;最后基于该初始化结果,在目标语言语料上继续预训练mPLMs。 Result: 在POS标注和NER任务上分别比随机初始化扩展词汇的基线方法提升0.54%和2.60%;模型对训练语料选择具有高鲁棒性,且源语言性能未下降。 Conclusion: 所提方法能有效扩展mPLMs至新低资源语言,在保持源语言能力的同时显著提升目标语言下游任务性能。 Abstract: Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model's vocabulary using a target language corpus. We then screen out a subset from the model's original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses randomly initialized expanded vocabulary for continued pre-training, in POS tagging and NER tasks, achieving improvements by 0.54% and 2.60%, respectively. Furthermore, our method demonstrates high robustness in selecting the training corpora, and the models' performance on the source language does not degrade after continued pre-training.[15] Are Language Models Sensitive to Morally Irrelevant Distractors?
Andrew Shaw,Christina Hahn,Catherine Rasgaitis,Yash Mishra,Alisa Liu,Natasha Jaques,Yulia Tsvetkov,Amy X. Zhang
Main category: cs.CL
TL;DR: 本文受道德心理学中情境主义观点启发,构建了一个包含60个无道德相关性的多模态‘道德干扰项’数据集,并将其注入现有道德基准测试中,发现这些干扰项可使大语言模型(LLM)的道德判断发生超过30%的偏移,表明LLM存在类似人类的认知道德偏差,亟需更情境化、更精细的道德评估与建模。
Details
Motivation: 现有道德基准假设LLM具有稳定的道德偏好,但人类道德判断易受无关情境因素影响;本文旨在检验LLM是否同样表现出此类‘情境依赖’的认知道德偏差。 Method: 从心理学期刊数据集中筛选60个情感显著但无道德相关性的图像与叙事作为‘道德干扰项’,将其注入主流道德基准(如ETHICS、Moral Stories等),系统测量其对LLM道德判断输出的影响。 Result: 道德干扰项显著改变LLM道德判断,在低歧义场景下偏移幅度超30%;不同模型、提示方式和干扰模态均观察到稳健效应。 Conclusion: LLM并非稳定输出道德判断,而是易受无关情境线索干扰,这挑战了当前基于静态偏好的道德评估范式,呼吁发展更具认知现实性的道德建模与评测框架。 Abstract: With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this "situationist" view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 "moral distractors" from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.[16] Breaking the Pre-Sampling Barrier: Activation-Informed Difficulty-Aware Self-Consistency
Taewoong Yoon,Geunyeong Jeong,Geon Park,Sihyeong Yeom,Harksoo Kim
Main category: cs.CL
TL;DR: 本文提出ACTSC方法,利用前馈网络神经元激活信号构建轻量级难度估计探针,动态调整自一致性采样数量,在不增加额外模型调用和预采样的前提下降低推理开销并保持准确率。
Details
Motivation: 现有自一致性(SC)方法推理成本高;Difficulty-Adaptive SC(DSC)虽能按难度调整采样数,但需额外模型调用和预采样,带来显著计算开销。 Method: 提出Activation-Informed Difficulty-Aware Self-Consistency(ACTSC),利用LLM前馈层神经元激活值作为内部难度信号,构建无需额外token生成或模型调用的轻量级难度探针,动态决定SC采样数。 Result: 在五个基准测试上验证,ACTSC在保持与现有方法相当准确率的同时,显著降低了推理成本。 Conclusion: ACTSC是一种高效、通用且免预采样的难度感知自一致性方法,可直接迁移至新数据集,有效平衡推理效率与性能。 Abstract: Self-Consistency (SC) is an effective decoding strategy that improves the reasoning performance of Large Language Models (LLMs) by generating multiple chain-of-thought reasoning paths and selecting the final answer via majority voting. However, it suffers from substantial inference costs because it requires a large number of samples. To mitigate this issue, Difficulty-Adaptive Self-Consistency (DSC) was proposed to reduce unnecessary token usage for easy problems by adjusting the number of samples according to problem difficulty. However, DSC requires additional model calls and pre-sampling to estimate difficulty, and this process is repeated when applying to each dataset, leading to significant computational overhead. In this work, we propose Activation-Informed Difficulty-Aware Self-Consistency (ACTSC) to address these limitations. ACTSC leverages internal difficulty signals reflected in the feed-forward network neuron activations to construct a lightweight difficulty estimation probe, without any additional token generation or model calls. The probe dynamically adjusts the number of samples for SC and can be applied to new datasets without requiring pre-sampling for difficulty estimation. To validate its effectiveness, we conduct experiments on five benchmarks. Experimental results show that ACTSC effectively reduces inference costs while maintaining accuracy relative to existing methods.[17] Evaluating Social Bias in RAG Systems: When External Context Helps and Reasoning Hurts
Shweta Parihar,Lu Cheng
Main category: cs.CL
TL;DR: 本研究发现检索增强生成(RAG)可降低大语言模型中的社会偏见,但加入思维链(CoT)提示虽提升准确性却加剧偏见,揭示了准确性与公平性间的权衡。
Details
Motivation: 探究检索增强生成(RAG)架构在缓解大语言模型(LLM)社会偏见方面的潜力与机制,因RAG虽增强生成能力但仍可能继承偏见。 Method: 在多种检索语料库、LLM和偏见评测数据集上开展大规模实验,覆盖13种以上偏见类型;进一步结合思维链(CoT)提示分析RAG推理过程,并评估CoT的忠实性。 Result: RAG整体上降低了社会偏见,表明外部上下文有助于抑制刻板印象预测;而引入CoT虽提升准确性,却导致偏见增加,且模型在刻板与反刻板响应间随检索信息增多而动态变化。 Conclusion: RAG具备提升LLM公平性的潜力,但CoT等推理增强技术可能加剧偏见,需构建兼顾准确性与偏见控制的新型推理框架。 Abstract: Social biases inherent in large language models (LLMs) raise significant fairness concerns. Retrieval-Augmented Generation (RAG) architectures, which retrieve external knowledge sources to enhance the generative capabilities of LLMs, remain susceptible to the same bias-related challenges. This work focuses on evaluating and understanding the social bias implications of RAG. Through extensive experiments across various retrieval corpora, LLMs, and bias evaluation datasets, encompassing more than 13 different bias types, we surprisingly observe a reduction in bias in RAG. This suggests that the inclusion of external context can help counteract stereotype-driven predictions, potentially improving fairness by diversifying the contextual grounding of the model's outputs. To better understand this phenomenon, we then explore the model's reasoning process by integrating Chain-of-Thought (CoT) prompting into RAG while assessing the faithfulness of the model's CoT. Our experiments reveal that the model's bias inclinations shift between stereotype and anti-stereotype responses as more contextual information is incorporated from the retrieved documents. Interestingly, we find that while CoT enhances accuracy, contrary to the bias reduction observed with RAG, it increases overall bias across datasets, highlighting the need for bias-aware reasoning frameworks that can mitigate this trade-off.[18] Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality
Takumi Ohashi,Hitoshi Iyatomi
Main category: cs.CL
TL;DR: 本文提出了一种名为概念文化指数(CCI)的新指标,用于在句子级别量化文化特异性,并通过实验证明其在区分文化特异与通用句子上的有效性优于直接大模型打分。
Details
Motivation: 大型语言模型在多元文化环境中部署日益增多,但目前缺乏对文化特异性在句子级别上的系统性评估方法。 Method: 提出概念文化指数(CCI),定义为句子在目标文化中的泛化性估计值与在其他文化中平均泛化性估计值之差;基于400句(200句文化特异、200句通用)进行验证。 Result: CCI得分分布符合预期:文化特异句子得分更高,通用句子得分更低;在二分类任务中,CCI的AUC比直接LLM打分高出10点以上(针对目标文化优化的模型)。 Conclusion: CCI提供了一种可操作、可解释的文化特异性量化方法,有助于提升LLM在多元文化场景中的适配性与公平性。 Abstract: Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at https://github.com/IyatomiLab/CCI .[19] NOWJ @BioCreative IX ToxHabits: An Ensemble Deep Learning Approach for Detecting Substance Use and Contextual Information in Clinical Texts
Huu-Huy-Hoang Tran,Gia-Bao Duong,Quoc-Viet-Anh Tran,Thi-Hai-Yen Vuong,Hoang-Quynh Le
Main category: cs.CL
TL;DR: 本文提出了一种多输出集成系统,结合BETO与CRF层,在西班牙语临床文本中检测有毒物质使用及其上下文属性,取得了优异的F1和精度结果。
Details
Motivation: 从非结构化的电子健康记录中提取药物使用信息仍是临床自然语言处理的重大挑战;大语言模型在该领域应用受限于可信度、可控性和效率问题。 Method: 提出一种多输出集成系统,结合BETO与CRF层进行序列标注,采用多样化训练策略和句子过滤技术以提升精确率,同时解决ToxNER(子任务1)和ToxUse(子任务2)。 Result: 最佳运行结果在触发词检测上达到0.94 F1和0.97精确率,在论元检测上达到0.91 F1。 Conclusion: 所提方法在西班牙语低资源临床文本的有毒物质使用识别任务中表现出高精度和强鲁棒性,验证了融合预训练模型与传统结构化方法的有效性。 Abstract: Extracting drug use information from unstructured Electronic Health Records remains a major challenge in clinical Natural Language Processing. While Large Language Models demonstrate advancements, their use in clinical NLP is limited by concerns over trust, control, and efficiency. To address this, we present NOWJ submission to the ToxHabits Shared Task at BioCreative IX. This task targets the detection of toxic substance use and contextual attributes in Spanish clinical texts, a domain-specific, low-resource setting. We propose a multi-output ensemble system tackling both Subtask 1 - ToxNER and Subtask 2 - ToxUse. Our system integrates BETO with a CRF layer for sequence labeling, employs diverse training strategies, and uses sentence filtering to boost precision. Our top run achieved 0.94 F1 and 0.97 precision for Trigger Detection, and 0.91 F1 for Argument Detection.[20] Listen to the Layers: Mitigating Hallucinations with Inter-Layer Disagreement
Koduvayur Subbalakshmi,Sabbir Hossain Ujjal,Venkata Krishna Teja Mangichetty,Nastaran Jamalipour Soofi
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的解码算法CoCoA,通过监测大语言模型中间层表征的不稳定性(混淆与一致性)来缓解生成中的幻觉问题,并在多个任务和模型上验证了其有效性。
Details
Motivation: 预训练大语言模型易产生流利但事实错误的文本(即幻觉),损害其可靠性;作者假设生成文本的事实性与其在模型内部各层表征的不稳定性相关。 Method: 提出CoCoA解码器,定义两个衡量中间层表征不稳定性的指标,据此对高内部混淆的输出施加惩罚;进一步提出自信息门控变体CoCoA-SIG,动态调节惩罚以聚焦高惊奇度、不稳定的生成。 Result: 在问答、摘要、代码生成等多任务及Llama-3、Qwen-2.5、Mistral等多模型上,CoCoA显著提升事实正确性。 Conclusion: CoCoA利用模型内在信号,在推理阶段无需重训练即可有效提升LLM可信度,具有强通用性。 Abstract: Pretrained Large Language Models (LLMs) are prone to generating fluent yet factually incorrect text-a phenomenon known as hallucinations, undermining their reliability and utility in downstream tasks. We hypothesize that a generated text span's factuality is correlated with its representational instability across the model's internal layers. Based on this, we propose the CoCoA (Confusion and Consistency Aware) decoder, a novel, training-free decoding algorithm that mitigates hallucinations at inference time by listening to these signals in the middle layers. We propose two metrics to quantify this instability in the middle layers, and use it to penalize outputs that exhibit high internal confusion, thereby steering the model towards more internally consistent and factually grounded outputs. We further propose a self-information gated variant, CoCoA-SIG, that dynamically modulates this penalty to selectively target high-surprise, unstable generations. Extensive experiments on diverse tasks, including question-answering, summarization and code generation demonstrate that CoCoA significantly improves factual correctness across multiple model families (e.g., Llama-3, Qwen-2.5, Mistral). By leveraging model-intrinsic signals, CoCoA offers an effective and broadly applicable method for enhancing the trustworthiness of LLMs at inference time, without requiring any model retraining.[21] Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models
Hikaru Asano,Tadashi Kozuno,Kuniaki Saito,Yukino Baba
Main category: cs.CL
TL;DR: 本文提出Gt-Margin位置评分方法,定义为正确token与其最强竞争token之间的概率差,用于生成最优解码顺序;并基于此构建监督式unmasking planner,显著提升MDLM在逻辑推理任务上的生成质量。
Details
Motivation: 现有MDLM在推理时的unmasking顺序(where-to-unmask)依赖启发式置信度或高成本的强化学习,缺乏高效、可学习的确定性策略。 Method: 提出Gt-Margin——一种基于真实token的概率间隔的位置得分;利用其构造oracle unmasking顺序;再通过learning-to-rank训练一个监督式unmasking planner来拟合该顺序。 Result: 所提planner在不修改原token预测模型的前提下,显著提升MDLM在逻辑推理基准上的生成准确率。 Conclusion: Gt-Margin揭示了‘先易后难’的unmasking顺序对MDLM性能的关键作用;监督式planner提供了一种高效、可扩展的解耦式解码控制机制。 Abstract: Masked Diffusion Language Models (MDLMs) generate text by iteratively filling masked tokens, requiring two coupled decisions at each step: which positions to unmask (where-to-unmask) and which tokens to place (what-to-unmask). While standard MDLM training directly optimizes token prediction (what-to-unmask), inference-time unmasking orders (where-to-unmask) are typically determined by heuristic confidence measures or trained through reinforcement learning with costly on-policy rollouts. To address this, we introduce Gt-Margin, a position-wise score derived from ground-truth tokens, defined as the probability margin between the correct token and its strongest alternative. Gt-Margin yields an oracle unmasking order that prioritizes easier positions first under each partially masked state. We demonstrate that leveraging this oracle unmasking order significantly enhances final generation quality, particularly on logical reasoning benchmarks. Building on this insight, we train a supervised unmasking planner via learning-to-rank to imitate the oracle ordering from masked contexts. The resulting planner integrates into standard MDLM sampling to select where-to-unmask, improving reasoning accuracy without modifying the token prediction model.[22] EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
Xavier Hu,Jinxiang Xia,Shengze Xu,Kangqi Song,Yishuo Yuan,Guibin Zhang,Jincheng Ren,Boyu Feng,Li Lu,Tieyong Zeng,Jiaheng Liu,Minghao Liu,Yuchen Elenor Jiang,Wei Wang,He Zhu,Wangchunshu Zhou
Main category: cs.CL
TL;DR: 本文提出了EcoGym,一个用于评估基于大语言模型(LLM)的自主智能体在长期经济环境中进行连续规划与执行能力的通用基准。它包含三个不同经济场景(自动售货、自由职业、运营管理),强调长期战略一致性、部分可观测性与随机性下的鲁棒性,并通过商业指标(如净资产、收入、日活用户)进行评估。实验表明当前主流LLM在策略制定或动作执行上普遍存在系统性缺陷。
Details
Motivation: 现有长时序规划评估框架多为片段式、领域特定或缺乏对持续经济动态的建模,难以真实衡量LLM智能体的长期决策能力。 Method: 设计并实现EcoGym基准,包含三个统一接口、预算化动作、超长周期(1000+步/365天循环)的交互式经济环境;采用业务导向指标评估长期战略与执行表现。 Result: 在11个主流LLM上的实验显示:无一模型在全部三个场景中占优;模型普遍在高层策略或底层动作执行任一方面存在显著次优表现。 Conclusion: EcoGym为长时序自主智能体提供了更现实、可扩展、可复现的评估平台,揭示了当前LLM在可控性与效用间的权衡困境,并推动面向真实经济场景的智能体研究。 Abstract: Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.[23] The CLEF-2026 CheckThat! Lab: Advancing Multilingual Fact-Checking
Julia Maria Struß,Sebastian Schellhammer,Stefan Dietze,Venktesh V,Vinay Setty,Tanmoy Chakraborty,Preslav Nakov,Avishek Anand,Primakov Chungkham,Salim Hafid,Dhruv Sahnan,Konstantin Todorov
Main category: cs.CL
TL;DR: CheckThat! lab focuses on advancing technologies to combat disinformation, with this year's edition emphasizing source retrieval for scientific claims, fact-checking numerical/temporal claims with reasoning, and generating full fact-checking articles.
Details
Motivation: To advance innovative technologies that combat disinformation and manipulation in multilingual, multi-platform online communication. Method: Organizing shared tasks centered on the verification pipeline: source retrieval (Task 1), reasoning-enhanced fact-checking of numerical/temporal claims (Task 2), and generation of full fact-checking articles (Task 3). Result: A set of challenging multilingual classification, retrieval, and document/span-level generation tasks integrated into the fact-checking pipeline. Conclusion: The CheckThat! lab continues to evolve the fact-checking research agenda by expanding the verification pipeline to include reasoning and generative components, fostering progress in disinformation detection across languages and platforms. Abstract: The CheckThat! lab aims to advance the development of innovative technologies combating disinformation and manipulation efforts in online communication across a multitude of languages and platforms. While in early editions the focus has been on core tasks of the verification pipeline (check-worthiness, evidence retrieval, and verification), in the past three editions, the lab added additional tasks linked to the verification process. In this year's edition, the verification pipeline is at the center again with the following tasks: Task 1 on source retrieval for scientific web claims (a follow-up of the 2025 edition), Task 2 on fact-checking numerical and temporal claims, which adds a reasoning component to the 2025 edition, and Task 3, which expands the verification pipeline with generation of full-fact-checking articles. These tasks represent challenging classification and retrieval problems as well as generation challenges at the document and span level, including multilingual settings.[24] Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models
Sangwon Yu,Ik-hwan Kim,Donghun Kang,Bongkyu Hwang,Junhwa Choi,Suk-hoon Jung,Seungki Hong,Taehee Lee,Sungroh Yoon
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的推理时策略SAKE,通过在推理过程首尾锚定检索到的知识,缓解大模型在长链推理中知识整合衰减(KID)的问题,显著提升多跳问答和复杂推理性能。
Details
Motivation: 现代大语言模型在搜索增强推理中存在知识整合衰减(KID)瓶颈:随着推理链增长,模型难以有效将检索到的证据融入后续推理步骤,即使相关信息已获得,性能仍受限。 Method: 提出Self-Anchored Knowledge Encoding(SAKE),一种训练无关的推理时策略:将检索知识显式锚定在推理链的开头和结尾,防止其被先前上下文淹没,从而维持其语义完整性。 Result: 在多跳问答和复杂推理基准上大量实验表明,SAKE能显著缓解KID,提升模型性能。 Conclusion: SAKE是一种轻量、高效且无需训练的知识集成方案,为具身/智能体式大模型的知识利用提供了新思路。 Abstract: Modern Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks by employing search-augmented reasoning to incorporate external knowledge into long chains of thought. However, we identify a critical yet underexplored bottleneck in this paradigm, termed Knowledge Integration Decay (KID). Specifically, we observe that as the length of reasoning generated before search grows, models increasingly fail to integrate retrieved evidence into subsequent reasoning steps, limiting performance even when relevant information is available. To address this, we propose Self-Anchored Knowledge Encoding (SAKE), a training-free inference-time strategy designed to stabilize knowledge utilization. By anchoring retrieved knowledge at both the beginning and end of the reasoning process, SAKE prevents it from being overshadowed by prior context, thereby preserving its semantic integrity. Extensive experiments on multi-hop QA and complex reasoning benchmarks demonstrate that SAKE significantly mitigates KID and improves performance, offering a lightweight yet effective solution for knowledge integration in agentic LLMs.[25] UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment
Hongyan Xie,Yikun Ban,Ruiyu Fang,Zixuan Huang,Deqing Wang,Jianxin Li,Yitong Yao,Chao Wang,Shuangyong Song
Main category: cs.CL
TL;DR: 本文提出了一种名为MoSLoRA的ARM训练方法及基于它的UniARM框架,用于多目标测试时对齐,通过共享特征提取与偏好调制来缓解特征纠缠,实现更精准的偏好权衡控制。
Details
Motivation: 现有基于自回归奖励模型(ARM)的多目标对齐方法在建模多个偏好目标时存在特征交互忽略或特征纠缠问题,导致生成结果与用户偏好不一致。 Method: 提出Preference-Modulated & Shared Low-Rank Adaptation(MoSLoRA):先用偏好无关模块提取共享特征,再用以混合偏好向量为条件的调制模块进行仿射变换;在此基础上构建Unified Autoregressive Reward Model(UniARM),在单一参数空间联合建模所有偏好维度。 Result: UniARM在多目标测试时对齐任务上提升了对齐精度,缓解了特征纠缠,并支持在更大规模LLM上部署,增强了实用性。 Conclusion: MoSLoRA与UniARM提供了一种高效、统一且可扩展的多目标对齐方案,显著改善了冻结LLM在多偏好约束下的生成质量与可控性。 Abstract: Multi-objective alignment aims to align LLM responses with multiple human preference objectives. Among existing methods, guiding the generation of frozen LLMs through autoregressive reward models (ARMs) to accomplish multi-objective test-time alignment is a low-cost solution. However, these methods typically rely on independent parameters for each preference objective, either by training ARMs independently across preference dimensions, which neglects interactions among preference features, or by training a single ARM with separate feature extraction modules for each preference, which can cause feature entanglement. Both strategies can result in misalignment between generated outputs and user preferences. To address this limitation, we propose Preference-Modulated \& Shared Low-Rank Adaptation (MoSLoRA) for ARM training, which first extracts shared features via a preference-agnostic module and then applies affine transformations to shared features via a preference modulation module conditioned on mixed preference vectors. This design mitigates feature entanglement and enables precise control over preference trade-offs during inference. Building on this, we introduce the Unified Autoregressive Reward Model (UniARM), a novel framework for multi-objective test-time alignment. UniARM jointly models all preference dimensions in a single parameter space, eliminating the need for independent parameters for each preference objective. es on larger-scale LLMs, enhancing its practical usability.[26] Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA
Klejda Alushi,Jan Strich,Chris Biemann,Martin Semmann
Main category: cs.CL
TL;DR: 本文系统评估了多种RAG方法在多轮对话问答中的表现,发现简单有效的方法(如reranking、混合BM25、HyDE)普遍优于基础RAG,而部分复杂方法反而不如无RAG基线;性能高度依赖数据集特性和对话长度,强调检索策略与数据结构的匹配比模型复杂度更重要。
Details
Motivation: 现有RAG研究多聚焦单轮问答且孤立评估,缺乏对多轮对话中因对话历史、指代消解和用户意图变化带来的检索挑战的系统性比较。 Method: 在统一实验框架下,对八种跨领域的多轮对话QA数据集,评估多种基础与先进RAG方法(如reranking、BM25混合、HyDE等),使用检索与生成双维度指标,并分析各轮次性能变化。 Result: reranking、混合BM25和HyDE等鲁棒简洁方法持续优于基础RAG;若干先进方法未提升甚至低于无RAG基线;数据集特性与对话长度显著影响检索效果,无单一策略全面占优。 Conclusion: 多轮对话RAG的有效性关键在于检索策略与数据集结构的匹配程度,而非方法复杂度;研究结果为RAG方法选择提供了实证依据。 Abstract: Conversational question answering increasingly relies on retrieval-augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single-turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi-turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No-RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.\footnote{\href{https://github.com/Klejda-A/exp-rag.git}{GitHub Repository}}[27] Advancing Block Diffusion Language Models for Test-Time Scaling
Yi Lu,Deyang Kong,Jianing Wang,Linsen Guo,Xue Wang,Qi Guo,Tao Gui,Xuanjing Huang,Wei Ye,Shikun Zhang,Wei Wang
Main category: cs.CL
TL;DR: 本文提出了一种面向块扩散语言模型(BDLMs)的测试时扩展统一框架,包含BACD解码策略和TCCF生成范式,显著提升长链式推理任务中的效率与效果平衡。
Details
Motivation: 现有BDLMs在测试时扩展设置下探索不足,且在长思维链推理中面临更严峻的解码挑战,难以兼顾解码速度与有效性。 Method: 提出Bounded Adaptive Confidence Decoding(BACD)实现难度感知的动态去噪解码;提出Think Coarse, Critic Fine(TCCF)范式,按推理阶段自适应分配粗粒度/细粒度块大小;引入Progressive Block Size Extension缓解大块尺寸带来的性能下降。 Result: 在TDAR-8B上应用BACD与TCCF,相较TraDo-8B获得2.26倍加速与AIME24分数+11.2分的显著提升。 Conclusion: 该工作为BDLMs在复杂推理任务中的测试时扩展提供了有效可行的新路径,推动其实际应用潜力。 Abstract: Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified framework for test-time scaling in BDLMs that introduces adaptivity in both decoding and block-wise generation. At the decoding level, we propose Bounded Adaptive Confidence Decoding (BACD), a difficulty-aware sampling strategy that dynamically adjusts denoising based on model confidence, accelerating inference while controlling error accumulation. Beyond step-wise adaptivity, we introduce Think Coarse, Critic Fine (TCCF), a test-time scaling paradigm that allocates large block sizes to exploratory reasoning and smaller block sizes to refinement, achieving an effective efficiency-effectiveness balance. To enable efficient and effective decoding with a large block size, we adopt Progressive Block Size Extension, which mitigates performance degradation when scaling block sizes. Extensive experiments show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines such as TraDo-8B (2.26x speedup, +11.2 points on AIME24). These results mark an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks.[28] LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval
Narges Baba Ahmadi,Jan Strich,Martin Semmann,Chris Biemann
Main category: cs.CL
TL;DR: 本文提出LEMUR,一个大规模多语言欧盟环境立法语料库,并基于此微调了三种多语言嵌入模型,在法律语义检索任务中显著提升了低资源语言的Top-k准确率,且效果可跨语言迁移。
Details
Motivation: 现有大语言模型在多语言法律场景中受限于不可靠的检索效果和缺乏领域适配的开源嵌入模型;多语言法律语料未针对语义检索设计,且PDF来源存在文本提取噪声。 Method: 构建LEMUR语料库(24,953份EUR-Lex PDF,覆盖25种语言),提出Lexical Content Score(LCS)评估PDF转文本保真度;在单语与双语设定下,采用对比学习目标微调三种先进多语言嵌入模型。 Result: 法律领域微调显著提升低/高资源语言的Top-k检索准确率,尤其对低资源语言增益明显;跨语言评测表明性能可泛化至未见语言,说明模型学习的是语言无关的法律内容表征。 Conclusion: 领域适配的多语言嵌入微调能有效提升法律语义检索性能,且其改进源于对法律内容本质的理解而非语言特征,LEMUR语料与模型已开源。 Abstract: Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote{\href{https://github.com/nargesbh/eur_lex}{GitHub Repository}} and data\footnote{\href{https://huggingface.co/datasets/G4KMU/LEMUR}{Hugging Face Dataset}}.[29] Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs
Sora Miyamoto,Daisuke Oba,Naoaki Okazaki
Main category: cs.CL
TL;DR: 本文提出了一种预算感知的树搜索解码算法BG-MCTS,通过动态调整探索与精炼策略以匹配剩余token预算,在数学推理任务上显著优于传统预算无关方法。
Details
Motivation: 现有树搜索解码策略忽视查询的固定token预算约束,仅将其作为终止条件,易导致后期过度分支或过早终止,影响性能与效率平衡。 Method: 提出Budget-Guided MCTS(BG-MCTS),使搜索策略随剩余token预算动态演化:初期宽泛探索,后期转向答案精炼与完成,并抑制浅层节点的晚期分支。 Result: 在MATH500和AIME24/25数据集上,BG-MCTS在多种预算设定下均一致超越预算无关的树搜索基线,使用开源大语言模型验证了有效性。 Conclusion: 预算引导是提升树搜索解码实用性与鲁棒性的关键设计原则,BG-MCTS为受限资源下的LLM推理提供了更高效、自适应的解码范式。 Abstract: Tree-search decoding is an effective form of test-time scaling for large language models (LLMs), but real-world deployment imposes a fixed per-query token budget that varies across settings. Existing tree-search policies are largely budget-agnostic, treating the budget as a termination condition, which can lead to late-stage over-branching or premature termination. We propose {Budget-Guided MCTS} (BG-MCTS), a tree-search decoding algorithm that aligns its search policy with the remaining token budget: it starts with broad exploration, then prioritizes refinement and answer completion as the budget depletes while reducing late-stage branching from shallow nodes. BG-MCTS consistently outperforms budget-agnostic tree-search baselines across different budgets on MATH500 and AIME24/25 with open-weight LLMs.[30] Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models
Shweta Parihar,Liu Guangliang,Natalie Parde,Lu Cheng
Main category: cs.CL
TL;DR: 本文提出Context-CDA方法,利用大语言模型增强反事实数据增强(CDA)的上下文相关性和多样性,并通过不确定性过滤提升小模型微调数据质量,在缓解性别偏见的同时不损害语言建模能力。
Details
Motivation: 现有反事实数据增强(CDA)方法在缓解社会偏见时易降低语言建模能力,且生成的反事实样本常脱离真实分布或忽略敏感属性的社会语境。 Method: 提出Context-CDA:利用大语言模型为反事实样本注入上下文以提升其多样性与语境相关性,并采用目标小语言模型的不确定性估计对生成样本进行质量过滤。 Result: 在性别偏见基准测试中,Context-CDA显著缓解偏见,同时保持甚至提升语言建模性能;并通过分析下一词预测概率分布偏移,提供对社会偏见的新见解。 Conclusion: Context-CDA是一种简单有效的方法,能在不牺牲语言建模能力的前提下实现更鲁棒、更符合现实分布的社会偏见缓解。 Abstract: A challenge in mitigating social bias in fine-tuned language models (LMs) is the potential reduction in language modeling capability, which can harm downstream performance. Counterfactual data augmentation (CDA), a widely used method for fine-tuning, highlights this issue by generating synthetic data that may align poorly with real-world distributions or creating overly simplistic counterfactuals that ignore the social context of altered sensitive attributes (e.g., gender) in the pretraining corpus. To address these limitations, we propose a simple yet effective context-augmented CDA method, Context-CDA, which uses large LMs to enhance the diversity and contextual relevance of the debiasing corpus. By minimizing discrepancies between the debiasing corpus and pretraining data through augmented context, this approach ensures better alignment, enhancing language modeling capability. We then employ uncertainty-based filtering to exclude generated counterfactuals considered low-quality by the target smaller LMs (i.e., LMs to be debiased), further improving the fine-tuning corpus quality. Experimental results on gender bias benchmarks demonstrate that Context-CDA effectively mitigates bias without sacrificing language modeling performance while offering insights into social biases by analyzing distribution shifts in next-token generation probabilities.[31] On the Optimal Reasoning Length for RL-Trained Language Models
Daisuke Nohara,Taishi Nakamura,Rio Yokota
Main category: cs.CL
TL;DR: 本文探讨了强化学习在大语言模型推理中的应用,指出其虽提升推理能力但增加了输出长度和计算成本;通过对比不同长度控制方法,发现长度惩罚可能阻碍推理能力获取,而适当调优的长度控制可提升具备强先验推理能力模型的效率,并识别出长输出导致分散和短输出导致思考不足两种失败模式。
Details
Motivation: 强化学习虽提升大语言模型推理能力,但也带来输出过长和计算成本上升的问题,尚不清楚如何平衡效率与性能的最优输出长度。 Method: 在Qwen3-1.7B Base和DeepSeek-R1-Distill-Qwen-1.5B两个模型上比较多种长度控制方法,并扩展先前工作至RL训练策略,分析长/短输出的影响。 Result: 长度惩罚可能阻碍推理能力获取;适当调优的长度控制可提升具备强先验推理能力模型的效率;识别出长输出增加分散、短输出导致欠思考两种失败模式。 Conclusion: 长度控制需权衡推理质量与计算效率,不能简单施加惩罚;应依据模型先验推理能力差异化设计长度控制策略。 Abstract: Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.[32] Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning
Qiao Liang,Yuke Zhu,Chao Ge,Lei Yang,Ying Shen,Bo Zheng,Sheng Guo
Main category: cs.CL
TL;DR: 本文提出Error-Localized Policy Optimization (ELPO),通过二分搜索定位工具集成推理(TIR)中首个不可恢复错误步骤,并结合分层优势归因与自适应裁剪,提升长程任务中的细粒度信用分配与策略优化效果。
Details
Motivation: 现有基于结果的强化学习在工具集成推理(TIR)中面临稀疏延迟奖励和步级信用分配弱的问题,尤其在长程轨迹中,早期不可恢复错误决定最终成败,亟需精准定位并利用该错误点进行细粒度训练。 Method: 提出ELPO方法:1)在固定rollout预算下构建二分搜索rollout树以定位首个不可恢复错误步骤;2)将该树转化为稳定学习信号,采用分层优势归因;3)引入错误局部化的自适应裁剪机制,增强关键步骤及其后续步骤的纠偏更新。 Result: 在数学、科学问答和代码执行等TIR基准上,ELPO在同等采样预算下持续超越强Agentic RL基线,在Pass@K、Major@K、rollout排序质量及工具调用效率等方面均取得提升。 Conclusion: ELPO通过精准错误定位与结构化信用分配,显著提升了长程工具集成推理中LLM代理的训练效率与性能,为agentic RL提供了新范式。 Abstract: Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.[33] AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models
R E Zera Marveen Lyngkhoi,Chirag Chawla,Pratinav Seth,Utsav Avaiya,Soham Bhattacharjee,Mykola Khandoga,Rui Yuan,Vinay Kumar Sankarapu
Main category: cs.CL
TL;DR: AlignTune 是一个模块化工具包,旨在统一和简化大语言模型后训练对齐流程,支持 SFT 和 RLHF 风格优化,具备可插拔后端、标准化配置、可扩展奖励层及基准评估集成。
Details
Motivation: 现有对齐工作流分散于后端专用工具与临时胶水代码中,导致实验难以复现;核心障碍包括后端干扰、奖励碎片化和流水线不可复现。 Method: 提出 AlignTune 工具包,通过单一工厂边界隔离后端逻辑(支持 TRL 和 Unsloth),提供统一接口、标准化配置、可扩展奖励层(规则与学习型)及标准/自定义任务评估集成。 Result: 实现了可控的对齐方法对比和可复现的对齐实验,提升了对齐研究的模块性、可复用性与可比性。 Conclusion: AlignTune 有效解决了对齐研究中的工程碎片化问题,为 LLM 对齐提供了标准化、模块化且可复现的实验基础设施。 Abstract: Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.[34] MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation
Nalin Srun,Parisa Rastin,Guénaël Cabanes,Lydia Boudjeloud Assala
Main category: cs.CL
TL;DR: MILE-RefHumEval is a reference-free, human-aligned LLM evaluation framework using an ensemble of independently prompted evaluators, achieving high alignment with human judgments while reducing computational cost.
Details
Motivation: To address the limitations of existing LLM evaluation methods that rely on ground-truth annotations or coordinated evaluators, which are costly, inflexible, and poorly scalable in real-world settings. Method: Proposes MILE-RefHumEval, a reference-free framework using task-specific prompts and an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring. Result: Experiments show strong alignment with human judgments, outperformance over prior methods, and reduced computational overhead. Conclusion: MILE-RefHumEval offers an efficient, robust, flexible, interpretable, and human-aligned solution for practical LLM evaluation across diverse tasks. Abstract: We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.[35] MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering
Sieun Hyeon,Jusang Oh,Sunghwan Steve Cho,Jaeyoung Do
Main category: cs.CL
TL;DR: 本文提出MATA,一种基于多智能体的表格问答(TableQA)框架,利用多个互补推理路径和由小型语言模型构建的工具集,在保证高准确率的同时显著提升效率与可扩展性,适用于资源受限或隐私敏感场景。
Details
Motivation: 现有大型语言模型在表格理解任务中仍面临可靠性、可扩展性和效率挑战,尤其在资源受限或隐私敏感环境中。 Method: 提出MATA多智能体框架,通过多样推理路径生成候选答案,并借助小型语言模型构建的工具进行答案精炼或选择;同时设计算法最小化昂贵的LLM调用次数。 Result: 在两个不同难度基准数据集及十种LLM上实验表明,MATA达到当前最优准确率,且推理高效、无需过度依赖LLM推理。 Conclusion: 精心编排多个推理路径可实现可扩展、可靠的表格问答,MATA在小模型支持下具备强泛化性与部署灵活性。 Abstract: Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments. In this paper, we introduce MATA, a multi-agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open-source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state-of-the-art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at https://github.com/AIDAS-Lab/MATA.[36] Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs
Joseph Attieh,Timothee Mickus,Anne-Laure Ligozat,Aurélie Névéol,Jörg Tiedemann
Main category: cs.CL
TL;DR: 本文评估了知识蒸馏(KD)在机器翻译中的多种方法,不仅关注学生模型的翻译质量,还首次系统地将计算成本(以碳足迹衡量)纳入考量,发现蒸馏开销在小规模部署时占主导,而推理开销在大规模部署时占主导,且词级蒸馏通常比序列级蒸馏具有更优的能效-质量权衡。
Details
Motivation: 现有研究通常只报告学生模型的翻译质量,忽略知识蒸馏过程本身的计算复杂度和碳排放,导致在算力受限场景下难以合理选择KD方法。 Method: 采用机器学习生命周期评估(MLCA)工具,量化知识蒸馏全流程(教师模型训练、蒸馏过程、学生模型推理)的碳足迹,涵盖运行时操作排放与硬件生产摊销成本,并对比多种代表性KD方法(如词级与序列级)在质量与碳足迹上的权衡。 Result: (i)小规模部署下蒸馏开销主导总碳足迹;(ii)大规模部署下推理开销主导,KD仅在任务依赖的使用阈值之上才具环保优势;(iii)词级蒸馏相比序列级蒸馏通常提供更优的碳足迹-质量折衷。 Conclusion: 应将碳足迹作为KD方法选择的关键指标之一,本文提出的评估协议为在明确质量与算力/环境约束下选择KD方法提供了可复现的指导。 Abstract: Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.[37] Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding
Abdulhai Alali,Abderrahmane Issam
Main category: cs.CL
TL;DR: This paper proposes a compact framework using LoRA fine-tuning, adapter merging, and dialect-aware MBR decoding to improve dialectal Arabic generation and translation, especially for Syrian, Moroccan, and Saudi Arabic.
Details
Motivation: Dialect variations are underrepresented in large language models due to limited data and linguistic variation, despite growing multilingual support. Method: The authors apply Low Rank Adaptation (LoRA) fine-tuning on monolingual and English-dialect parallel data, followed by adapter merging and dialect-aware Minimum Bayes Risk (MBR) decoding. Result: Experiments on Syrian, Moroccan, and Saudi Arabic show improved dialectal fidelity while maintaining semantic accuracy. Conclusion: Adapter merging and dialect-aware MBR decoding together form a compact and effective framework for robust dialectal Arabic generation. Abstract: Large Language Models (LLMs) are becoming increasingly multilingual, supporting hundreds of languages, especially high resource ones. Unfortunately, Dialect variations are still underrepresented due to limited data and linguistic variation. In this work, we adapt a pre-trained LLM to improve dialectal performance. Specifically, we use Low Rank Adaptation (LoRA) fine-tuning on monolingual and English Dialect parallel data, adapter merging and dialect-aware MBR decoding to improve dialectal fidelity generation and translation. Experiments on Syrian, Moroccan, and Saudi Arabic show that merging and MBR improve dialectal fidelity while preserving semantic accuracy. This combination provides a compact and effective framework for robust dialectal Arabic generation.[38] TraceMem: Weaving Narrative Memory Schemata from User Conversational Traces
Yiming Shu,Pei Liu,Tiange Zhang,Ruiyang Gao,Jun Ma,Chen Sun
Main category: cs.CL
TL;DR: 本文提出TraceMem,一种受认知科学启发的框架,通过三阶段记忆处理流程(短期记忆处理、突触记忆巩固、系统记忆巩固)构建结构化、叙事性的用户记忆图式,显著提升大语言模型在长程对话中的多跳推理与时间推理能力。
Details
Motivation: 现有大语言模型受限于上下文窗口长度,难以维持长期对话;且传统记忆系统将交互视为离散片段,无法捕捉对话流内在的叙事连贯性。 Method: TraceMem采用三阶段认知启发式流程:(1) 短期记忆处理——基于演绎式主题分割识别对话片段边界并提取语义;(2) 突触记忆巩固——将片段总结为情节记忆,并融合语义生成用户专属轨迹;(3) 系统记忆巩固——通过两阶段分层聚类将轨迹组织为按主题演化的时间叙事线,并封装为结构化用户记忆卡;辅以智能体式检索机制支持推理。 Result: 在LoCoMo基准测试中达到SOTA性能;分析表明其在多跳推理和时间推理任务上显著优于基线模型,验证了叙事连贯性建模对深层理解的关键作用。 Conclusion: 构建具有叙事一致性的记忆图式是提升LLM长期交互能力的核心路径;TraceMem为记忆系统设计提供了可扩展、可解释、受脑启发的新范式,并推动该领域开放讨论与未来探索。 Abstract: Sustaining long-term interactions remains a bottleneck for Large Language Models (LLMs), as their limited context windows struggle to manage dialogue histories that extend over time. Existing memory systems often treat interactions as disjointed snippets, failing to capture the underlying narrative coherence of the dialogue stream. We propose TraceMem, a cognitively-inspired framework that weaves structured, narrative memory schemata from user conversational traces through a three-stage pipeline: (1) Short-term Memory Processing, which employs a deductive topic segmentation approach to demarcate episode boundaries and extract semantic representation; (2) Synaptic Memory Consolidation, a process that summarizes episodes into episodic memories before distilling them alongside semantics into user-specific traces; and (3) Systems Memory Consolidation, which utilizes two-stage hierarchical clustering to organize these traces into coherent, time-evolving narrative threads under unifying themes. These threads are encapsulated into structured user memory cards, forming narrative memory schemata. For memory utilization, we provide an agentic search mechanism to enhance reasoning process. Evaluation on the LoCoMo benchmark shows that TraceMem achieves state-of-the-art performance with a brain-inspired architecture. Analysis shows that by constructing coherent narratives, it surpasses baselines in multi-hop and temporal reasoning, underscoring its essential role in deep narrative comprehension. Additionally, we provide an open discussion on memory systems, offering our perspectives and future outlook on the field. Our code implementation is available at: https://github.com/YimingShu-teay/TraceMem[39] Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLMs
Longhuan Xu,Cunjian Chen,Feng Yin
Main category: cs.CL
TL;DR: 本文提出了一种面向大语言模型的层自适应动态测试时适配(Layer-wise Dynamic Test-time Adaptation)方法,通过轻量级超网络为LoRA参数的每层每步生成学习率乘子,以提升无监督、样本级测试时适配的稳定性与性能。
Details
Motivation: 现有无监督、样本级测试时适配(TTA)因固定学习率易过拟合提示统计、偏离目标分布而失效,亟需更鲁棒的自适应机制。 Method: 提出层自适应动态TTA框架:仅更新LoRA参数,并用轻量级超网络根据提示表征、模型结构和步数动态预测每层每步的学习率乘子。 Result: 在多个数据集和LLM上实验表明,该方法显著提升TTA稳定性与生成质量,学会有效的跨步长与Transformer层的学习率缩放模式。 Conclusion: 动态、细粒度的层/步学习率调控是提升无监督样本级TTA鲁棒性与效果的关键。 Abstract: Test-time adaptation (TTA) for large language models (LLMs) updates model parameters at inference time using signals available at deployment. This paper focuses on a common yet under-explored regime: unsupervised, sample-specific TTA, where the model adapts independently for each prompt using only the prompt itself, without gold answers or external supervision. Although appealing, naive unsupervised TTA with a fixed, handcrafted learning rate can be unstable: updates may overfit to prompt-specific statistics, drift from the desired answer distribution, and ultimately degrade generation quality. This failure mode is not surprising, as in this case TTA must adapt to a single prompt within only a few gradient steps, unlike standard training that averages updates over large datasets and long optimization horizons. Therefore, we propose layer-wise dynamic test-time adaptation, a framework which explicitly modulates TTA strength as a function of prompt representation, LLM structure and adaptation step. In our setting, TTA updates only LoRA parameters, and a lightweight hypernetwork predicts per-layer, per-step learning-rate multipliers, enabling fine-grained control. Experiments across various datasets and LLMs consistently show that our method substantially strengthens TTA by learning effective scaling patterns over adaptation steps and transformer layer projections, improving stability while delivering better performance.[40] AI-Assisted Scientific Assessment: A Case Study on Climate Change
Christian Buck,Levke Caesar,Michelle Chen Huebscher,Massimiliano Ciaramita,Erich M. Fischer,Zeke Hausfather,Özge Kart Tokmak,Reto Knutti,Markus Leippold,Joseph Ludescher,Katharine J. Mach,Sofia Palazzo Corner,Kasra Rafiezadeh Shahi,Johan Rockström,Joeri Rogelj,Boris Sakschewski
Main category: cs.CL
TL;DR: 本文探讨了AI作为科学合作者在气候科学中的应用,特别是在大西洋经向翻转环流(AMOC)稳定性评估中,发现AI可加速工作流程并提升逻辑与呈现质量,但专家监督和补充仍不可或缺。
Details
Motivation: 现有AI‘猜测-验证’范式难以适用于无法重复评估、依赖理论与证据共识建立真理的复杂科学问题,因此需探索AI如何支持严谨的科学协作评估。 Method: 将基于Gemini的AI环境嵌入标准科研流程,并与13位气候科学家合作,围绕AMOC稳定性开展协同综述研究,记录内容生成、修订轮次、时间投入及人机贡献比例。 Result: 团队在46+人小时内完成79篇文献的综合报告,历经104轮修订;大部分AI生成内容被保留,AI提升了逻辑一致性与呈现质量,但仅约45%内容由AI直接产出,其余依赖专家补充与深度把关。 Conclusion: AI可有效辅助复杂科学评估任务,加快流程并增强表达质量,但尚不能替代专家判断与理论整合能力;成功的人机协作需明确分工、持续监督与深度专家介入。 Abstract: The emerging paradigm of AI co-scientists focuses on tasks characterized by repeatable verification, where agents explore search spaces in 'guess and check' loops. This paradigm does not extend to problems where repeated evaluation is impossible and ground truth is established by the consensus synthesis of theory and existing evidence. We evaluate a Gemini-based AI environment designed to support collaborative scientific assessment, integrated into a standard scientific workflow. In collaboration with a diverse group of 13 scientists working in the field of climate science, we tested the system on a complex topic: the stability of the Atlantic Meridional Overturning Circulation (AMOC). Our results show that AI can accelerate the scientific workflow. The group produced a comprehensive synthesis of 79 papers through 104 revision cycles in just over 46 person-hours. AI contribution was significant: most AI-generated content was retained in the report. AI also helped maintain logical consistency and presentation quality. However, expert additions were crucial to ensure its acceptability: less than half of the report was produced by AI. Furthermore, substantial oversight was required to expand and elevate the content to rigorous scientific standards.[41] Targum -- A Multilingual New Testament Translation Corpus
Maciej Rapacz,Aleksander Smywiński-Pohl
Main category: cs.CL
TL;DR: 本文介绍了一个包含657个新约译本的多语种语料库,其中352个为独特版本,特别在英语、法语、意大利语、波兰语和西班牙语五种语言中实现了前所未有的深度覆盖;该语料库整合自12个在线圣经图书馆及一个现有语料库,并对每个译本进行了人工元数据标注,支持灵活的多层次翻译史定量研究。
Details
Motivation: 现有语料库注重语言广度而忽视翻译历史深度,尤其在拥有丰富圣经翻译史的欧洲语言中存在明显缺口。 Method: 构建了一个涵盖657个新约译本(352个独特版本)的多语种语料库,覆盖五种语言并实现深度采集;整合12个在线圣经图书馆和1个既有语料库;对每个译本进行人工元数据标注,统一标准化标识符(作品、版本、修订年份),支持按需定义‘独特性’。 Result: 发布了首个支持微观(如KJV谱系分析)与宏观(去重近似文本)双层级分析的多语种圣经翻译语料库,具备标准化、可扩展与可复现特性。 Conclusion: 该语料库填补了翻译史定量研究中深度语料缺失的空白,确立了多层级、灵活性翻译史研究的新基准。 Abstract: Many European languages possess rich biblical translation histories, yet existing corpora - in prioritizing linguistic breadth - often fail to capture this depth. To address this gap, we introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102). Aggregated from 12 online biblical libraries and one preexisting corpus, each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision. This canonicalization empowers researchers to define "uniqueness" for their own needs: they can perform micro-level analyses on translation families, such as the KJV lineage, or conduct macro-level studies by deduplicating closely related texts. By providing the first resource designed for such flexible, multilevel analysis, our corpus establishes a new benchmark for the quantitative study of translation history.[42] Improving Interpretability of Lexical Semantic Change with Neurobiological Features
Kohei Oda,Hiroya Takamura,Kiyoaki Shirai,Natthawut Kertkeidkachorn
Main category: cs.CL
TL;DR: 本文提出了一种将词义随时间变化(LSC)映射到神经生物学特征空间的新方法,以提升LSC的可解释性,并在LSC程度估计任务中取得优越性能。
Details
Motivation: 现有LSC研究多关注提升变化程度估计性能,但缺乏对语义如何变化的可解释性分析,亟需增强可解释性以获得新洞见。 Method: 将预训练语言模型得到的上下文化词嵌入映射到神经生物学特征空间,其中每个维度对应一个词的原始语义特征,值表示该特征强度。 Result: 在LSC程度估计上优于多数先前方法;并基于高可解释性发现了以往被忽略的LSC类型,且能有效检索具有特定LSC类型的词语。 Conclusion: 该映射方法不仅提升了LSC建模的性能,更显著增强了语义变化过程的可解释性,为LSC研究提供了新视角与分析工具。 Abstract: Lexical Semantic Change (LSC) is the phenomenon in which the meaning of a word change over time. Most studies on LSC focus on improving the performance of estimating the degree of LSC, however, it is often difficult to interpret how the meaning of a word change. Enhancing the interpretability of LSC is a significant challenge as it could lead to novel insights in this field. To tackle this challenge, we propose a method to map the semantic space of contextualized embeddings of words obtained by a pre-trained language model to a neurobiological feature space. In the neurobiological feature space, each dimension corresponds to a primitive feature of words, and its value represents the intensity of that feature. This enables humans to interpret LSC systematically. When employed for the estimation of the degree of LSC, our method demonstrates superior performance in comparison to the majority of the previous methods. In addition, given the high interpretability of the proposed method, several analyses on LSC are carried out. The results demonstrate that our method not only discovers interesting types of LSC that have been overlooked in previous studies but also effectively searches for words with specific types of LSC.[43] Where Are We At with Automatic Speech Recognition for the Bambara Language?
Seydou Diallo,Yacouba Diarra,Mamadou K. Keita,Panga Azazia Kamaté,Adam Bouno Kampo,Aboubacar Ouattara
Main category: cs.CL
TL;DR: 本文提出了首个用于评估巴马纳语自动语音识别(ASR)的标准化基准,基于一小时专业录制的马里宪法文本;评估37个模型后发现当前性能远未达部署标准,最佳WER为46.76%,CER为13.00%,凸显多语言预训练与模型扩展对小语种效果有限。
Details
Motivation: 缺乏针对巴马纳语等低资源语言的标准化ASR评估基准,现有模型在真实场景中的表现不明,亟需可控、可复现的基准以推动研究。 Method: 构建一个一小时的专业录音宪法文本基准数据集,作为近似最优声学与语言条件下的受控参考集,并系统评估37种ASR模型(含本地训练与商用多语言模型),采用WER和CER指标。 Result: 最高性能模型WER为46.76%,另一模型CER最低为13.00%;多个主流多语言模型WER超100%;该数据集代表最简正式口语形式,实际场景性能尚未验证。 Conclusion: 多语言预训练和模型规模扩大不足以解决低资源语言ASR问题;需针对性建模与数据建设;作者开源基准与公共排行榜以支持后续研究。 Abstract: This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards in a narrow formal domain; the top-performing system in terms of Word Error Rate (WER) achieved 46.76\% and the best Character Error Rate (CER) of 13.00\% was set by another model, while several prominent multilingual models exceeded 100\% WER. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures are yet to be tested against practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.[44] Decomposing Reasoning Efficiency in Large Language Models
Daniel Kaiser,Arnoldo Frigessi,Ali Ramezani-Kebrya,Benjamin Ricaud
Main category: cs.CL
TL;DR: 本文提出了一种可选追踪的框架,用于分解大语言模型在推理任务中的token效率,将其细分为完成率、条件正确率和冗余度,并进一步分析冗余度的构成及推理轨迹质量,揭示了准确率与token效率排名存在显著差异,不同模型具有不同的效率瓶颈。
Details
Motivation: 标准评估仅报告最终准确率,掩盖了推理过程中token的消耗或浪费情况,无法反映模型在推理过程中的效率问题。 Method: 提出trace-optional框架,将token效率分解为完成率、条件正确率和verbosity;当有工作负载代理时,进一步将verbosity分解为平均表达开销和耦合系数;当有推理轨迹时,引入确定性轨迹质量指标(如grounding、重复、提示复制)来区分低效循环与冗长但有效的推理。 Result: 在CogniLoad上评估25个模型发现:准确率与token效率排名显著不一致(Spearman ρ=0.63),效率差异主要源于条件正确率,表达开销差异达9倍且与模型规模弱相关。 Conclusion: 该分解方法揭示了不同模型独特的效率瓶颈特征,为针对性提升token效率提供了依据。 Abstract: Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman $ρ=0.63$), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.[45] AnalyticsGPT: An LLM Workflow for Scientometric Question Answering
Khang Ly,Georgios Cheirmpos,Adrian Raudaschl,Christopher James,Seyed Amin Tabatabaei
Main category: cs.CL
TL;DR: 本文提出了AnalyticsGPT,一个基于大语言模型(LLM)的端到端工作流系统,用于回答科学计量学(scientometric)问题,即关于‘科学之科学’的元科学研究问题。该系统结合检索增强生成(RAG)与智能体(agentic)概念,并利用专有科研绩效评估平台作为知识库,通过专家评估与LLM-as-judges方式进行评测。
Details
Motivation: 科学计量学问答是一个被低估的下游任务,其在规划阶段面临独特挑战,如学术实体识别和多维度科学计量指标(如影响因子)检索,而传统科学问答方法难以应对。 Method: 构建了一个基于LLM的顺序式工作流系统,融合检索增强生成(RAG)与智能体式任务分解、规划与推理;使用专有科研绩效平台作为检索数据库;采用领域专家评估与LLM-as-judges双重评测机制。 Result: 验证了LLM在科学计量学问答这一细分下游任务中的有效性,提供了关于LLM在此类复杂元科学任务中应用价值的实证见解。 Conclusion: AnalyticsGPT展示了LLM在需要多步规划、实体识别与结构化分析的元科学任务中的潜力,为科学智能(scientific AI)提供了可扩展、可解释的新范式。 Abstract: This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta-scientific questions concerning the "science of science." When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named-entity recognition of academic entities within questions and multi-faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end-to-end system implementing a sequential workflow with retrieval-augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well-structured high-level analyses. As a database for retrieval-augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs-as-judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: https://github.com/lyvykhang/llm-agents-scientometric-qa/tree/acl.[46] Text summarization via global structure awareness
Jiaquan Zhang,Chaoning Zhang,Shuxu Chen,Yibei Liu,Chenghao Li,Qigan Sun,Shuai Yuan,Fachrina Dewi Puspitasari,Dongshen Han,Guoqing Wang,Sung-Ho Bae,Yang Yang
Main category: cs.CL
TL;DR: 本文提出GloSA-sum,首个利用拓扑数据分析(TDA)实现全局结构感知的文本摘要方法,通过语义加权图与持续同调识别核心语义与逻辑结构,并结合轻量级代理指标和分层策略,在保持语义与逻辑完整性的同时提升效率,尤其利于LLM下游任务。
Details
Motivation: 现有摘要方法多关注模型改进或句子级剪枝,忽视文本全局结构,导致连贯性受损、下游性能下降;而基于大语言模型的方法虽准确率高,但资源与时间开销大。 Method: 构建语义加权句嵌入图,利用持续同调识别核心语义与逻辑结构并存入‘保护池’作为摘要主干;设计拓扑引导的迭代策略,用轻量代理指标近似句子重要性;引入分层策略融合段级与全局摘要。 Result: 在多个数据集上验证,GloSA-sum显著降低冗余,保持语义与逻辑完整性,在准确率与效率间取得更好平衡,并能缩短LLM输入上下文同时保留关键推理链,提升下游任务表现。 Conclusion: GloSA-sum首次将拓扑数据分析引入摘要任务,实现了高效、结构感知的长文档摘要,为兼顾准确性、效率与下游适配性提供了新范式。 Abstract: Text summarization is a fundamental task in natural language processing (NLP), and the information explosion has made long-document processing increasingly demanding, making summarization essential. Existing research mainly focuses on model improvements and sentence-level pruning, but often overlooks global structure, leading to disrupted coherence and weakened downstream performance. Some studies employ large language models (LLMs), which achieve higher accuracy but incur substantial resource and time costs. To address these issues, we introduce GloSA-sum, the first summarization approach that achieves global structure awareness via topological data analysis (TDA). GloSA-sum summarizes text efficiently while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool'' as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.[47] From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models
Abdulmuizz Khalak,Abderrahmane Issam,Gerasimos Spanakis
Main category: cs.CL
TL;DR: This paper investigates cross-lingual transfer of Arabic language models from Modern Standard Arabic (MSA) to various Arabic dialects, revealing uneven transfer performance correlated with geographic proximity and evidence of negative interference when models are trained on multiple dialects.
Details
Motivation: Arabic LMs are mostly pretrained on Modern Standard Arabic (MSA), yet real-world usage—especially online—involves diverse dialects that differ in similarity to MSA, limiting model effectiveness and raising questions about cross-dialect transfer. Method: The authors conduct probing experiments on 3 NLP tasks and analyze representational similarity to assess cross-dialect transfer capability of Arabic LMs. Result: Transfer is possible but highly disproportionate across dialects; geographic proximity partially explains performance variation; models trained on multiple dialects show evidence of negative interference. Conclusion: The findings challenge the assumption of uniform dialect similarity and highlight risks in current cross-lingual transfer approaches for Arabic, urging more dialect-aware modeling strategies. Abstract: Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.[48] LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse
Bakhtawar Ahtisham,Kirk Vanacore,Zhuqian Zhou,Jinsook Lee,Rene F. Kizilcec
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)自身生成的推理文本,来预测其标签预测是否正确的错误检测方法,在教育对话分析任务中实现了高F1分数(0.83),并发现因果性语言与正确性正相关,而试探性表达与错误预测强相关。
Details
Motivation: 当前LLM在大规模教育对话自动标注中缺乏可靠的错误检测机制,亟需一种可扩展、自解释的质量控制方法。 Method: 使用TF-IDF编码30,300条教师话语对应的LLM推理文本,训练五种监督分类器(以随机森林最优)预测标签是否正确;并结合LIWC框架分析四类语言特征(因果性、区分性、试探性、洞察力)与预测正确性的关联。 Result: 随机森林分类器F1达0.83(召回率0.854);构造特异性检测器进一步提升难构念性能;正确推理更常含因果连接词(如because, therefore),错误推理则多含试探性词汇(如might, could)和元认知动词(如think, realize);句法复杂度和长度无判别力。 Conclusion: LLM生成的推理文本蕴含足够信号用于自身错误检测,该方法具备实用性与可扩展性,为自动化教育对话分析提供了有效质量保障路径。 Abstract: Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model's assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.[49] How Do People Quantify Naturally: Evidence from Mandarin Picture Description
Yayun Zhang,Guanyi Chen,Fahime Same,Saad Mahamood,Tingting He
Main category: cs.CL
TL;DR: 本研究通过图片描述任务,考察了汉语母语者在自然语言产出中量化表达的选择、精确度和策略,发现物体数量、生命性及产出模态显著影响量化行为。
Details
Motivation: 量化是日常语言使用的基本组成部分,但人们对说话者如何在自然产出中决定是否及如何量化知之甚少。 Method: 采用基于图片的诱发描述任务,让汉语母语者自由描述含多个物体的场景(无计数或量化指令),跨口语与书面语两种模态,考察量化选择、精确度与策略。 Result: 物体数量增加会降低量化可能性与精确度;生命性与产出模态则选择性地调节量化策略;不同模态下量化行为存在系统性差异。 Conclusion: 量化行为受语境因素(如 numerosity、animacy、modality)自然驱动,可在无约束产出条件下有效研究,并提供了可用于进一步分析的自然语料数据集。 Abstract: Quantification is a fundamental component of everyday language use, yet little is known about how speakers decide whether and how to quantify in naturalistic production. We investigate quantification in Mandarin Chinese using a picture-based elicited description task in which speakers freely described scenes containing multiple objects, without explicit instructions to count or quantify. Across both spoken and written modalities, we examine three aspects of quantification: whether speakers choose to quantify at all, how precise their quantification is, and which quantificational strategies they adopt. Results show that object numerosity, animacy, and production modality systematically shape quantificational behaviour. In particular, increasing numerosity reduces both the likelihood and the precision of quantification, while animate referents and modality selectively modulate strategy choice. This study demonstrates how quantification can be examined under unconstrained production conditions and provides a naturalistic dataset for further analyses of quantity expression in language production.[50] SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech
Johan Sofalas,Dilushri Pavithra,Nevidu Jayatilleke,Ruvan Weerasinghe
Main category: cs.CL
TL;DR: 本文构建了一个包含2344个僧伽罗语修辞表达的语料库,并进行了文化来源分类与跨语言对等识别;开发了92%准确率的二分类器,并揭示了现有大语言模型在理解习语含义上的显著不足,为低资源NLP和文化感知机器翻译提供了新基准。
Details
Motivation: 神经机器翻译(NMT)在处理低资源语言(如僧伽罗语)的修辞性表达时因数据稀缺而表现不佳,亟需构建带文化与跨语言标注的专用语料库以推动研究。 Method: 构建含2344个僧伽罗语修辞表达的语料库,进行文化起源分类与跨语言对等识别;训练二分类器区分两类FoS;评估主流大语言模型在该数据集上的表现。 Result: 二分类器准确率达约92%;现有大语言模型在准确传达习语含义方面存在显著缺陷;语料库已公开,可作为低资源NLP与文化感知翻译的新基准。 Conclusion: 本工作通过构建高质量、多维度标注的僧伽罗语修辞语料库,填补了低资源文化相关NLP研究的数据空白,并揭示了当前大模型在文化语义理解上的局限性,为后续研究提供了重要基础与评估标准。 Abstract: Figures of Speech (FoS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs, as these models often struggle to accurately convey idiomatic meanings. By making this dataset publicly available, we offer a crucial benchmark for future research in low-resource NLP and culturally aware machine translation.[51] Steer2Edit: From Activation Steering to Component-Level Editing
Chung-En Sun,Ge Yan,Zimo Wang,Tsui-Wei Weng
Main category: cs.CL
TL;DR: Steer2Edit 是一种无需训练的框架,将推理时的语义引导向量转化为对模型组件(如注意力头、MLP神经元)进行秩-1权重编辑的诊断信号,在保持前向传播不变的同时,实现更优的属性-效用权衡。
Details
Motivation: 现有基于隐藏表示的引导方法通常采用全局、固定的激活干预,忽视了行为由少量异质组件主导的事实,导致强控制下属性与效用难以兼顾。 Method: 提出 Steer2Edit 框架,将 steering vector 从推理时控制信号转为诊断信号,用于组件级(注意力头、MLP 神经元)的秩-1 权重编辑,不改变标准前向传播,支持并行推理。 Result: 在安全对齐、幻觉缓解和推理效率任务上,相比基线方法,在相同下游性能下,安全提升达 17.2%,真实性提高 9.8%,平均推理长度减少 12.2%。 Conclusion: Steer2Edit 在表征引导与权重编辑之间建立了理论严谨、无需训练、可解释的桥梁,实现了细粒度、组件级的行为调控。 Abstract: Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.[52] The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies
Chenxu Wang,Chaozhuo Li,Songyang Liu,Zejian Chen,Jinyu Hou,Ji Qi,Rui Li,Litian Zhang,Qiwei Ye,Zheng Liu,Xu Chen,Xi Zhang,Philip S. Yu
Main category: cs.CL
TL;DR: 本文指出,由大语言模型构建的多智能体系统在实现持续自我进化、完全隔离和安全不变性三者兼得时存在根本性矛盾(即‘自我进化三难困境’),并从信息论角度证明隔离式自我进化会导致安全对齐不可逆退化;实验验证了该理论预测,并提出需引入外部监督或新机制来保障安全。
Details
Motivation: 解决多智能体LLM系统在追求持续自我进化过程中如何兼顾安全对齐的根本挑战,揭示现有‘封闭式自演化’范式的内在风险。 Method: 基于信息论框架将安全形式化为与人类价值分布的散度;通过理论推导证明隔离式自我进化必然导致统计盲区与安全退化;结合Moltbook开放社区及两个封闭自演化系统的实证与定性分析进行验证。 Result: 理论上证明‘自我进化三难困境’的存在;实证发现所有测试的封闭自演化系统均出现安全对齐持续退化现象;识别出统计盲区是导致安全退化的关键机制。 Conclusion: 完全隔离的自我演化AI社会无法长期维持安全对齐,必须引入外部监督或设计新型安全保持机制;研究将AI安全讨论从表象修补转向对系统内在动力学风险的原理性理解。 Abstract: The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.[53] AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning
Tilahun Yeshambel,Moncef Garouani,Josiane Mothe
Main category: cs.CL
TL;DR: 本文发布了两个高质量的阿姆哈拉语数据集,分别用于神经检索排序和指令跟随式文本生成,旨在解决低资源语言数据稀缺问题,并提供了可推广至其他低资源语言的方法论。
Details
Motivation: 针对阿姆哈拉语等低资源语言缺乏大规模高质量监督数据,制约神经检索与生成模型发展的现状,亟需构建专用数据资源以支撑相关研究。 Method: 构建了两个数据集:(i)包含1091个经人工验证的查询-正文档-负文档三元组的检索排序数据集,来源包括专家设计、网络采集及大模型辅助生成,并由母语者验证;(ii)包含6285个跨领域、多类型指令-响应对的生成数据集,由多个大模型生成并经人工审校修正语法、相关性、流利度与事实合理性;所有数据均提供标准化格式与划分。 Result: 发布了两个开源、标准化、可复现的阿姆哈拉语数据集,支持神经检索(如DPR、ColBERT、SPLADE)与指令微调生成任务,并附带一套可迁移至其他低资源语言的数据构建方法论。 Conclusion: 该工作显著缓解了阿姆哈拉语在信息检索与生成建模方面的数据瓶颈,为低资源语言AI研究提供了可复用的数据建设范式与实践基础。 Abstract: Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types, generated with several LLMs and refined through manual review and correction for grammaticality, relevance, fluency, and factual plausibility. We release both datasets with standardized splits and formats (CSV,JSON,JSONL) to enable reproducible work on Amharic retrieval, ranking, and generative modelling. These datasets also come with a methodology that can be generalized to other low-resource languages.[54] LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations
William Lugoloobi,Thomas Foster,William Bankes,Chris Russell
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)在推理任务中如何利用其内部表征预测自身成功概率,从而实现更高效的推理调度;通过在线性探针上训练预测器,发现模型对难度的感知与人类不同且随推理增强而加剧;基于此设计的路由策略可在保持甚至提升性能的同时大幅降低推理成本。
Details
Motivation: 运行具备扩展推理能力的LLM成本高昂,而准确识别哪些输入真正需要额外计算仍具挑战性。本文旨在探究:模型是否能在生成前从其内部表征中提前获知自身在特定问题上的成功概率,并利用该信号优化推理效率。 Method: 在生成前的隐藏层激活上训练线性探针,预测模型在数学和编程任务上的任务特定成功率;使用E2H-AMC数据集对比模型与人类对难度的感知差异;构建基于探针输出的多模型查询路由机制。 Result: 线性探针显著优于基于问题长度、TF-IDF等表面特征的基线;发现模型编码的‘难度’与其自身性能高度相关,但与人类难度明显不同,且该差异随扩展推理增强;所提路由方法在MATH数据集上相较最优单模型可降低最多70%推理成本,同时性能不降反升。 Conclusion: LLM内部表征蕴含强预测性信号,可用于指导高效推理调度;模型对难度的认知具有内在性与任务特异性,不依赖人类直觉;该信号具备实际部署价值,能显著提升系统级效率。 Abstract: Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70\% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty[55] ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning
Shuaiyi Nie,Siyu Ding,Wenyuan Zhang,Linhao Yu,Tianmeng Yang,Yao Chen,Tingwen Liu,Weichong Yin,Yu Sun,Hua Wu
Main category: cs.CL
TL;DR: 本文提出ATTNPO,一种利用模型内在注意力信号进行步骤级信用分配的低开销过程监督强化学习框架,有效缓解大推理模型过度思考问题,显著缩短推理长度并提升性能。
Details
Motivation: 大型推理模型在复杂推理任务中表现优异,但常出现过度思考现象,即生成冗余推理步骤而未带来性能提升;现有轨迹级长度惩罚方法效果不佳且易损害准确率,过程监督方法则资源消耗大且信用分配不准确。 Method: 提出ATTNPO框架,利用模型内在注意力机制识别特殊注意力头,通过其注意力分数实现步骤级信用分配,并设计两个子策略:抑制冗余步骤、减轻对关键步骤的惩罚。 Result: 实验表明ATTNPO在9个基准测试上显著缩短推理长度并大幅提升性能。 Conclusion: ATTNPO是一种高效、低开销的过程监督RL方法,能精准区分冗余与必要推理步骤,有效缓解过度思考问题。 Abstract: Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.[56] ViMultiChoice: Toward a Method That Gives Explanation for Multiple-Choice Reading Comprehension in Vietnamese
Trung Tien Cao,Lam Minh Thai,Nghia Hieu Nguyen,Duc-Vu Nguyen,Ngan Luu-Thuy Nguyen
Main category: cs.CL
TL;DR: 本文提出了一种新的越南语多选阅读理解模型ViMultiChoice,可同时预测正确答案并生成解释,并构建了配套的新数据集,在ViMMRC 2.0和新数据集上均达到SOTA。
Details
Motivation: 现有越南语多选阅读理解(MCRC)模型缺乏解释其选择理由的能力,亟需具备解释生成能力的模型及相应数据集。 Method: 提出ViMultiChoice方法,联合建模越南语阅读理解任务,同步完成选项预测与解释生成;并构建首个支持解释生成的越南语MCRC新数据集。 Result: ViMultiChoice在ViMMRC 2.0和新数据集上均取得SOTA性能;联合训练显著提升了多选准确率。 Conclusion: 联合建模答案选择与解释生成是提升越南语MCRC性能的有效途径,所提方法与数据集为可解释阅读理解提供了重要基础。 Abstract: Multiple-choice Reading Comprehension (MCRC) models aim to select the correct answer from a set of candidate options for a given question. However, they typically lack the ability to explain the reasoning behind their choices. In this paper, we introduce a novel Vietnamese dataset designed to train and evaluate MCRC models with explanation generation capabilities. Furthermore, we propose ViMultiChoice, a new method specifically designed for modeling Vietnamese reading comprehension that jointly predicts the correct answer and generates a corresponding explanation. Experimental results demonstrate that ViMultiChoice outperforms existing MCRC baselines, achieving state-of-the-art (SotA) performance on both the ViMMRC 2.0 benchmark and the newly introduced dataset. Additionally, we show that jointly training option decision and explanation generation leads to significant improvements in multiple-choice accuracy.[57] A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models
Xiulin Yang,Arianna Bisazza,Nathan Schneider,Ethan Gotlieb Wilcox
Main category: cs.CL
TL;DR: 本文通过构建POSHBENCH基准测试,检验神经语言模型在缺乏先天语言约束的情况下能否从有限输入中习得句法,发现Transformer模型虽能泛化但效率低于儿童,加入认知启发的归纳偏置可提升句法能力,但未改善POSHBENCH表现,从而挑战‘先天句法必要’论点。
Details
Motivation: 检验贫困刺激假说(PoSH)——即儿童需依赖先天语言约束才能从有限输入中习得句法——在神经语言模型上的可计算验证性。 Method: 构建POSHBENCH训练与评估套件(聚焦英语疑问句形成、移位岛等PoSH核心现象),在10–50M词的发展适宜文本上训练Transformer模型,并引入三种认知启发的归纳偏置进行增强实验。 Result: Transformer模型在无直接正面证据下仍表现出一定泛化能力,但数据效率和泛化强度均弱于儿童;所测试的三种归纳偏置提升了整体句法能力,却未提升POSHBENCH任务表现。 Conclusion: 先天句法并非泛化的唯一路径,但实现类人数据效率还需尚未测试的其他归纳偏置。 Abstract: How can children acquire native-level syntax from limited input? According to the Poverty of the Stimulus Hypothesis (PoSH), the linguistic input children receive is insufficient to explain certain generalizations that are robustly learned; innate linguistic constraints, many have argued, are thus necessary to explain language learning. Neural language models, which lack such language-specific constraints in their design, offer a computational test of this longstanding (but controversial) claim. We introduce \poshbench, a training-and-evaluation suite targeting question formation, islands to movement, and other English phenomena at the center of the PoSH arguments. Training Transformer models on 10--50M words of developmentally plausible text, we find indications of generalization on all phenomena even without direct positive evidence -- yet neural models remain less data-efficient and their generalizations are weaker than those of children. We further enhance our models with three recently proposed cognitively motivated inductive biases. We find these biases improve general syntactic competence but not \poshbench performance. Our findings challenge the claim that innate syntax is the only possible route to generalization, while suggesting that human-like data efficiency requires inductive biases beyond those tested here.[58] ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition
Khoa Anh Nguyen,Long Minh Hoang,Nghia Hieu Nguyen,Luan Thanh Nguyen,Ngan Luu-Thuy Nguyen
Main category: cs.CL
TL;DR: 本文提出ViSpeechFormer,一种基于音素的越南语自动语音识别(ASR)框架,利用越南语高度透明的字形-音素映射关系,首次显式建模音素表示,在公开数据集上展现出优异性能、更好的未登录词泛化能力和更低的训练偏差。
Details
Motivation: 越南语具有高度透明的字形-音素对应关系,但此前缺乏显式建模音素表示的ASR框架。 Method: 提出基于音素的ViSpeechFormer模型,专为越南语ASR设计,利用其音素正字法特性进行建模。 Result: 在两个公开越南语ASR数据集上表现优异,泛化能力更强,对未登录词和训练偏差更鲁棒。 Conclusion: 音素建模范式在越南语等具有音素正字法的语言中具有潜力,为相关语言ASR提供了新思路。 Abstract: Vietnamese has a phonetic orthography, where each grapheme corresponds to at most one phoneme and vice versa. Exploiting this high grapheme-phoneme transparency, we propose ViSpeechFormer (\textbf{Vi}etnamese \textbf{Speech} Trans\textbf{Former}), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR). To the best of our knowledge, this is the first Vietnamese ASR framework that explicitly models phonemic representations. Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias. This phoneme-based paradigm is also promising for other languages with phonetic orthographies. The code will be released upon acceptance of this paper.[59] SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation
Homaira Huda Shomee,Rochana Chaturvedi,Yangxinyu Xie,Tanwi Mallick
Main category: cs.CL
TL;DR: 本文提出了一种无需参考答案的多维评估框架,用于评估大语言模型在自然灾害响应等高风险领域中的问答质量,涵盖特异性、鲁棒性、答案相关性和上下文利用四个维度,并构建了包含1412个专业问题-答案对的数据集。
Details
Motivation: 现有RAG和开放问答评估方法主要依赖表层相似性、事实一致性或语义相关性,难以衡量答案是否提供领域敏感决策所需的细粒度关键信息。 Method: 提出一个无参考、多维度(特异性、鲁棒性、答案相关性、上下文利用)的评估框架,并构建覆盖40种职业角色和7类自然灾害的1412题专业问答数据集;辅以人工评估验证标注一致性与模型判断对齐程度。 Result: 实证表明单一指标无法充分反映高风险场景下答案质量,需结构化、多指标联合评估;人工评估揭示了领域特异性开放问答评估固有的主观性。 Conclusion: 在高风险、领域敏感的应用中部署LLM时,必须采用多维、结构化的评估框架,而非依赖单一指标。 Abstract: Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.[60] Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference
Wenxuan Xie,Yujia Wang,Xin Tan,Chaochao Lu,Xia Hu,Xuhong Wang
Main category: cs.CL
TL;DR: 本文提出DRIFT,一种双模型架构,通过轻量级知识模型动态压缩文档块为隐式事实标记,并将其投影到推理模型嵌入空间,从而解耦知识提取与推理过程,提升长上下文任务性能。
Details
Motivation: 现有方法(如RAG和参数化知识编辑)受限于有限上下文窗口、检索噪声或灾难性遗忘,难以有效将大规模动态知识集成到大语言模型中。 Method: 提出DRIFT双模型架构:轻量级知识模型根据查询动态压缩文档块为隐式事实标记,并将这些密集表示投影至推理模型的嵌入空间,替代原始冗余文本。 Result: 在长上下文任务上显著优于同规模强基线模型,提升了推理准确性和上下文扩展能力。 Conclusion: DRIFT提供了一种可扩展且高效的范式,用于增强大语言模型的有效上下文窗口和推理能力。 Abstract: The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot-Xie/DRIFT.[61] MEVER: Multi-Modal and Explainable Claim Verification with Graph-based Evidence Retrieval
Delvin Ce Zhang,Suhan Cui,Zhelin Chu,Xianren Zhang,Dongwon Lee
Main category: cs.CL
TL;DR: 本文提出了一种新型多模态模型,联合实现证据检索、多模态声明验证与解释生成,并构建了面向AI领域的科学数据集AIChartClaim。
Details
Motivation: 现有声明验证工作大多仅依赖文本证据或忽略可解释性,导致验证结果不准确且缺乏说服力。 Method: 构建两层多模态图进行证据检索(含图文双向推理);提出词元级和证据级融合进行多模态验证;引入多模态Fusion-in-Decoder生成解释;并构建新数据集AIChartClaim。 Result: 实验表明所提模型在多模态声明验证与解释生成任务上性能优越。 Conclusion: 该模型有效提升了多模态声明验证的准确性与可解释性,所建数据集为领域提供了新基准。 Abstract: Verifying the truthfulness of claims usually requires joint multi-modal reasoning over both textual and visual evidence, such as analyzing both textual caption and chart image for claim verification. In addition, to make the reasoning process transparent, a textual explanation is necessary to justify the verification result. However, most claim verification works mainly focus on the reasoning over textual evidence only or ignore the explainability, resulting in inaccurate and unconvincing verification. To address this problem, we propose a novel model that jointly achieves evidence retrieval, multi-modal claim verification, and explanation generation. For evidence retrieval, we construct a two-layer multi-modal graph for claims and evidence, where we design image-to-text and text-to-image reasoning for multi-modal retrieval. For claim verification, we propose token- and evidence-level fusion to integrate claim and evidence embeddings for multi-modal verification. For explanation generation, we introduce multi-modal Fusion-in-Decoder for explainability. Finally, since almost all the datasets are in general domain, we create a scientific dataset, AIChartClaim, in AI domain to complement claim verification community. Experiments show the strength of our model.[62] Anagent For Enhancing Scientific Table & Figure Analysis
Xuehang Guo,Zhiyong Lu,Tom Hope,Qingyun Wang
Main category: cs.CL
TL;DR: 本文提出AnaBench基准和Anagent多智能体框架,以提升AI对科学图表的分析能力,通过四个专业化智能体协同完成任务分解、信息检索、分析生成与质量评估,并结合模块化训练策略,在多个子领域实现显著性能提升。
Details
Motivation: 当前AI系统难以准确解析复杂多模态科学知识,尤其在处理结构异构、上下文长的科学表格与图表时存在根本性障碍。 Method: 构建大规模多学科基准AnaBench(63,178个样本,覆盖9个领域、7个复杂度维度);提出Anagent多智能体框架,含Planner、Expert、Solver、Critic四类专业化智能体,并采用监督微调与专用强化学习相结合的模块化训练策略。 Result: 在170个子领域综合评测中,Anagent在免训练设置下最高提升13.43%,微调后最高提升42.12%;验证了面向任务推理与上下文感知问题解决对高质量科学图表分析的关键作用。 Conclusion: 多智能体协同架构与模块化训练是突破科学图表分析瓶颈的有效路径,AnaBench为该领域提供了系统性评测标准。 Abstract: In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43\%$ in training-free settings and $\uparrow 42.12\%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table \& figure analysis. Our project page: https://xhguo7.github.io/Anagent/.[63] Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing
Mohamed Afane,Kayla Laufer,Wenqi Wei,Ying Mao,Junaid Farooq,Ying Wang,Juntao Chen
Main category: cs.CL
TL;DR: Quantum-Audit is a new benchmark with 2,700 questions to assess language models' understanding of quantum computing concepts; it reveals strong overall performance (e.g., Claude Opus 4.5 at 84%), but notable weaknesses in handling expert-written, advanced, and false-premise questions.
Details
Motivation: Existing benchmarks focus on quantum code generation and circuit design, but lack systematic evaluation of language models' conceptual understanding of quantum computing. Method: Developed Quantum-Audit, a comprehensive benchmark of 2,700 questions—including 1,000 expert-written, 1,000 LLM-extracted/validated, and 700 specialized questions (350 open-ended + 350 with false premises)—and evaluated 26 state-of-the-art models. Result: Top models outperformed human experts overall (Claude Opus 4.5: 84% vs. expert avg 74%), but showed significant drops on expert-written (−12 pts), security (73%), and false-premise questions (<66% accuracy); models often failed to correct erroneous assumptions. Conclusion: Current language models demonstrate impressive but uneven conceptual understanding of quantum computing—strong on surface-level or LLM-generated content, yet fragile on rigorous, expert-level, and critical reasoning tasks. Abstract: Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.cs.CV [Back]
[64] UI-Venus-1.5 Technical Report
Veuns-Team,:,Changlong Gao,Zhangxuan Gu,Yulin Liu,Xinyu Qiu,Shuheng Shen,Yue Wen,Tianyu Xia,Zhenyu Xu,Zhengwen Zeng,Beitong Zhou,Xingran Zhou,Weizhi Chen,Sunhao Dai,Jingya Dou,Yichen Gong,Yuan Guo,Zhenlin Guo,Feng Li,Qian Li,Jinzhen Lin,Yuqi Zhou,Linchao Zhu,Liang Chen,Zhenyu Guo,Changhua Meng,Weiqiang Wang
Main category: cs.CV
TL;DR: UI-Venus-1.5 是一个统一、端到端的 GUI 智能体,通过中训阶段、在线强化学习与模型融合三大技术升级,在多项基准测试中达到新SOTA性能。
Details
Motivation: 解决现有 GUI 智能体在通用性与实际任务性能之间难以兼顾的问题,提升其在真实复杂数字环境(尤其是中文移动应用)中的鲁棒导航与执行能力。 Method: 提出 UI-Venus-1.5 模型家族(2B/8B 密集模型 + 30B-A3B MoE 模型),引入三阶段训练策略:1)基于 100 亿 token 的中训阶段夯实 GUI 语义基础;2)全轨迹在线强化学习适配长程动态导航;3)通过模型融合将接地、网页、移动端专用模型统一为单个检查点。 Result: 在 ScreenSpot-Pro(69.6%)、VenusBench-GD(75.0%)、AndroidWorld(77.6%)等基准上显著超越先前强基线,并在多种中文移动 App 中验证了真实场景指令执行能力。 Conclusion: UI-Venus-1.5 实现了通用性与高性能的统一,是面向现实世界 GUI 自动化的可靠端到端解决方案。 Abstract: GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.The proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B) to meet various downstream application scenarios.Compared to our previous version, UI-Venus-1.5 introduces three key technical advances: (1) a comprehensive Mid-Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full-trajectory rollouts, aligning training objectives with long-horizon, dynamic navigation in large-scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain-specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI-Venus-1.5 establishes new state-of-the-art performance on benchmarks such as ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real-world scenarios. Code: https://github.com/inclusionAI/UI-Venus; Model: https://huggingface.co/collections/inclusionAI/ui-venus[65] Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling
Ruijie Ye,Jiayi Zhang,Zhuoxin Liu,Zihao Zhu,Siyuan Yang,Li Li,Tianfu Fu,Franck Dernoncourt,Yue Zhao,Jiacheng Zhu,Ryan Rossi,Wenhao Chai,Zhengzhong Tu
Main category: cs.CV
TL;DR: 本文提出Agent Banana框架,用于高保真、对象感知、深思熟虑的指令式图像编辑,解决专业工作流中过编辑、单轮限制和低分辨率评估三大问题;通过上下文折叠与图像图层分解机制,在自建4K对话式基准HDD-Bench上显著提升多轮一致性与背景保真度。
Details
Motivation: 现有指令式图像编辑方法在专业工作流中存在三大问题:过度编辑、单轮交互难以支持多轮修改导致对象失真、以及评估分辨率(约1K)远低于实际应用(如4K超高清)导致评估失准。 Method: 提出分层智能体框架Agent Banana,包含两个核心机制:(1) 上下文折叠(Context Folding),将长交互历史压缩为结构化记忆以实现稳定长周期控制;(2) 图像图层分解(Image Layer Decomposition),支持局部图层编辑以保护非目标区域并输出原生分辨率图像。同时构建了高清晰度、对话式、带可验证逐步目标的基准HDD-Bench(含原生4K图像,11.8M像素)。 Result: 在HDD-Bench上,Agent Banana取得最优多轮一致性与背景保真度(IC 0.871,SSIM-OM 0.84,LPIPS-OM 0.12),指令遵循能力保持竞争力,并在标准单轮编辑基准上表现强劲。 Conclusion: Agent Banana推动了高可靠性、专业级智能体图像编辑的发展,提升了其在真实工作流中的适用性与集成潜力。 Abstract: We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.[66] SemanticMoments: Training-Free Motion Similarity via Third Moment Features
Saar Huberman,Kfir Goldberg,Or Patashnik,Sagie Benaim,Ron Mokady
Main category: cs.CV
TL;DR: 本文提出SemanticMoments方法,通过在预训练语义模型特征上计算高阶时间统计量,实现无需训练的语义化运动理解,在SimMotion基准上显著优于现有RGB、光流和文本监督方法。
Details
Motivation: 现有视频表征方法过度依赖静态外观和场景上下文,忽视运动动态;而传统光流等运动输入又缺乏高层语义理解能力,二者均存在固有偏差。 Method: 提出SemanticMoments,一种无需训练的方法:在预训练语义模型提取的特征序列上计算高阶时间矩(如方差、偏度等)作为运动表征。 Result: 在新构建的SimMotion合成与真实世界基准上,SemanticMoments显著优于现有RGB、光流及文本监督方法,能更好解耦运动与外观。 Conclusion: 语义特征空间中的时间统计量可作为可扩展且符合人类感知的运动中心视频理解基础。 Abstract: Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.[67] A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video
Andrea Filiberto Lucas,Dylan Seychell
Main category: cs.CV
TL;DR: 本文提出了一种可解释、模块化、确定性可审计的框架,用于从新闻视频中自动检测和提取人名,并构建了标注数据集;在图形元素定位上达到95.8% mAP@0.5,人名提取F1为77.08%,虽略低于生成式方法(84.18%),但具备完全可追溯性、无幻觉、适合新闻与分析场景。
Details
Motivation: 新闻视频中屏幕信息(如人名)提取需求增长,但图形布局、字体、平台设计差异大,人工索引不现实,且需透明、可靠、可审计的方法以满足新闻业要求。 Method: 构建了一个涵盖当代新闻图形多样性的标注帧语料库;设计了一个可解释、模块化、确定性运行的提取流水线,强调可审计性和数据溯源。 Result: 检测器mAP@0.5达95.8%;人名提取精度79.9%、召回率74.4%、F1=77.08%;相比生成式多模态方法(F1=84.18%),本方法无幻觉、全程可追溯;用户调研显示59%观众难以看清快节奏播报中的人名。 Conclusion: 该工作确立了一个方法严谨、可解释、可审计的混合多模态信息提取基线,适用于现代新闻媒体环境。 Abstract: The growing volume of video-based news content has heightened the need for transparent and reliable methods to extract on-screen information. Yet the variability of graphical layouts, typographic conventions, and platform-specific design patterns renders manual indexing impractical. This work presents a comprehensive framework for automatically detecting and extracting personal names from broadcast and social-media-native news videos. It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics and proposes an interpretable, modular extraction pipeline designed to operate under deterministic and auditable conditions. The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and stochastic inference. The underlying detector achieves 95.8% mAP@0.5, demonstrating operationally robust performance for graphical element localisation. While generative systems achieve marginally higher raw accuracy (F1: 84.18% vs 77.08%), they lack the transparent data lineage required for journalistic and analytical contexts. The proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability across each processing stage. Complementary user findings indicate that 59% of respondents report difficulty reading on-screen names in fast-paced broadcasts, underscoring the practical relevance of the task. The results establish a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media.[68] Decoding Future Risk: Deep Learning Analysis of Tubular Adenoma Whole-Slide Images
Ahmed Rahu,Brian Shula,Brandon Combs,Aqsa Sultana,Surendra P. Singh,Vijayan K. Asari,Derrick Forchetti
Main category: cs.CV
TL;DR: 本研究探索了使用卷积神经网络(CNN)分析低级别管状腺瘤的全切片图像(WSI),以识别预测患者长期结直肠癌(CRC)风险的细微组织学特征。
Details
Motivation: 传统组织病理学评估可能无法充分捕捉提示恶性潜能的细微结构或细胞学特征,亟需识别低风险患者中高进展风险亚群,以实现个体化监测和预防治疗。 Method: 采用卷积神经网络(CNN)对低级别管状腺瘤的全切片图像(WSI)进行深度学习分析,旨在识别与患者远期CRC发生相关的细微组织学特征。 Result: 尚未在摘要中明确给出具体结果,但研究旨在验证CNN能否成功识别具有预测价值的细微组织学特征。 Conclusion: 该研究为利用数字病理学与机器学习提升低级别腺瘤风险分层提供了新思路,有望推动结直肠癌精准筛查与干预策略的发展。 Abstract: Colorectal cancer (CRC) remains a significant cause of cancer-related mortality, despite the widespread implementation of prophylactic initiatives aimed at detecting and removing precancerous polyps. Although screening effectively reduces incidence, a notable portion of patients initially diagnosed with low-grade adenomatous polyps will still develop CRC later in life, even without the presence of known high-risk syndromes. Identifying which low-risk patients are at higher risk of progression is a critical unmet need for tailored surveillance and preventative therapeutic strategies. Traditional histological assessment of adenomas, while fundamental, may not fully capture subtle architectural or cytological features indicative of malignant potential. Advancements in digital pathology and machine learning provide an opportunity to analyze whole-slide images (WSIs) comprehensively and objectively. This study investigates whether machine learning algorithms, specifically convolutional neural networks (CNNs), can detect subtle histological features in WSIs of low-grade tubular adenomas that are predictive of a patient's long-term risk of developing colorectal cancer.[69] All-in-One Conditioning for Text-to-Image Synthesis
Hirunima Jayasekara,Chuong Huynh,Yixuan Ren,Christabel Acquaye,Abhinav Shrivastava
Main category: cs.CV
TL;DR: 本文提出一种基于场景图的零样本条件生成方法,通过ASQL Conditioner在推理时优化视觉条件,提升文本到图像合成中对复杂提示的语义保真度与结构一致性。
Details
Motivation: 现有文本到图像模型在处理含多个物体、属性和空间关系的复杂提示时,难以保持语义保真度和结构一致性。 Method: 提出基于场景图的零样本条件机制,核心是轻量级语言模型驱动的ASQL Conditioner,生成软性视觉引导,并结合扩散模型的推理时优化。 Result: 该方法在不牺牲多样性前提下,提升了生成图像与复杂文本提示之间的对齐性、连贯性与可控性。 Conclusion: 基于场景图与ASQL条件化的零样本推理框架,有效增强了文本到图像模型的组合泛化能力与结构理解能力。 Abstract: Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.[70] Wearable environmental sensing to forecast how legged systems will interact with upcoming terrain
Michael D. Murray,James Tung,Richard W. Nuckols
Main category: cs.CV
TL;DR: 本研究探索了利用计算机视觉(RGB-D相机)预测步态转换(平地到上楼梯)过程中足部压力中心(COP)和触地时间(TOI)的可行性,提出了一种轻量级CNN-RNN模型,在250ms预测时域内实现了毫米/毫秒级精度,并可在边缘设备实时运行。
Details
Motivation: 现有基于计算机视觉的环境分类方法较少关注对足-地接触参数(如COP和TOI)的前瞻性预测,而此类预测对辅助设备的预判式控制至关重要。 Method: 采集8名受试者在平地转上楼梯任务中的胫骨RGB-D视频与足底压力数据;构建并训练CNN-RNN混合模型,在触地前250ms窗口内连续预测AP方向COP和TOI;评估不同预测时域(150/100/50ms)下的MAE,并分析运动学因素(躯干速度、趾摆速度、足着地点位置)对预测精度的影响。 Result: COP预测MAE随预测时域缩短而降低(150ms:29.42mm → 50ms:23.72mm);TOI预测MAE同理(150ms:21.14ms → 50ms:17.73ms);趾摆速度加快提升COP预测精度,但不影响TOI;更靠前的着地点降低COP预测精度,但不影响TOI;模型可在消费级笔记本或边缘设备以60FPS实时运行。 Conclusion: 仅凭胫骨视角的RGB-D视觉数据,使用轻量CNN-RNN即可高精度、实时地预测关键步态接触参数,为智能假肢、外骨骼等辅助系统的前馈控制提供了可行的技术路径。 Abstract: Computer-vision (CV) has been used for environmental classification during gait and is often used to inform control in assistive systems; however, the ability to predict how the foot will contact a changing environment is underexplored. We evaluated the feasibility of forecasting the anterior-posterior (AP) foot center-of-pressure (COP) and time-of-impact (TOI) prior to foot-strike on a level-ground to stair-ascent transition. Eight subjects wore an RGB-D camera on their right shank and instrumented insoles while performing the task of stepping onto the stairs. We trained a CNN-RNN to forecast the COP and TOI continuously within a 250ms window prior to foot-strike, termed the forecast horizon (FH). The COP mean-absolute-error (MAE) at 150, 100, and 50ms FH was 29.42mm, 26.82, and 23.72mm respectively. The TOI MAE was 21.14, 20.08, and 17.73ms for 150, 100, and 50ms respectively. While torso velocity had no effect on the error in either task, faster toe-swing speeds prior to foot-strike were found to improve the prediction accuracy in the COP case, however, was insignificant in the TOI case. Further, more anterior foot-strikes were found to reduce COP prediction accuracy but did not affect the TOI prediction accuracy. We also found that our lightweight model was capable at running at 60 FPS on either a consumer grade laptop or an edge computing device. This study demonstrates that forecasting COP and TOI from visual data was feasible using a lightweight model, which may have important implications for anticipatory control in assistive systems.[71] VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models
Chenyu Wang,Tianle Chen,H. M. Sabbir Ahmad,Kayhan Batmanghelich,Wenchao Li
Main category: cs.CV
TL;DR: 本文提出了VLM-UQBench基准,用于评估视觉-语言模型(VLMs)中模态特异性和跨模态的不确定性,并设计了简单指标来衡量不确定性量化(UQ)方法对扰动的敏感性及其与幻觉的相关性;实验发现现有UQ方法存在模态专精性强、依赖特定VLM、难以检测细粒度实例级不确定性等问题。
Details
Motivation: 现有VLM的不确定性量化(UQ)方法缺乏对不确定性来源(图像、文本或图文错配)的精准定位能力,难以支撑安全可靠的部署。 Method: 构建VLM-UQBench基准:包含600个真实样本(来自VizWiz),划分为干净、图像、文本和跨模态不确定性子集;设计可扩展扰动流水线(8种视觉、5种文本、3种跨模态扰动);提出两个新指标——UQ分数对扰动的敏感性及与幻觉的相关性。 Result: 实证发现:(i) UQ方法呈现强模态专精性且高度依赖底层VLM;(ii) 模态特异性不确定性常伴随幻觉,但当前UQ得分仅提供微弱且不一致的风险信号;(iii) UQ方法在群体级明显歧义上可媲美思维链基线,但在扰动引入的细粒度实例级歧义上表现差。 Conclusion: 当前UQ实践与VLM实际部署所需的细粒度、模态感知不确定性之间存在显著差距,亟需更精准、鲁棒的UQ方法。 Abstract: Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit strong modality-specific specialization and substantial dependence on the underlying VLM, (ii) modality-specific uncertainty frequently co-occurs with hallucinations while current UQ scores provide only weak and inconsistent risk signals, and (iii) although UQ methods can rival reasoning-based chain-of-thought baselines on overt, group-level ambiguity, they largely fail to detect the subtle, instance-level ambiguity introduced by our perturbation pipeline. These results highlight a significant gap between current UQ practices and the fine-grained, modality-aware uncertainty required for reliable VLM deployment.[72] VLM-Guided Iterative Refinement for Surgical Image Segmentation with Foundation Models
Ange Lou,Yamin Li,Qi Chang,Nan Xi,Luyuan Xie,Zichao Li,Tianyu Luan
Main category: cs.CV
TL;DR: 本文提出IR-SIS,一种基于自然语言描述的迭代式外科图像分割系统,结合微调SAM3、视觉-语言模型与智能体工作流,支持医生通过自然语言反馈参与交互式分割,并构建了多粒度语言标注数据集。
Details
Motivation: 现有外科图像分割方法受限于预定义类别、单次预测无自适应优化、缺乏临床医生交互机制。 Method: 提出IR-SIS系统:1)用微调SAM3生成初始分割;2)利用视觉-语言模型检测器械并评估分割质量;3)通过智能体工作流自适应选择细化策略;4)支持医生以自然语言反馈进行交互;5)构建EndoVis2017/2018衍生的多粒度语言标注数据集。 Result: 在域内和分布外数据上均达到SOTA性能,医生交互进一步提升效果。 Conclusion: 首次建立了基于自然语言的外科图像分割框架,具备自适应自优化能力,为临床实时交互式分割提供了新范式。 Abstract: Surgical image segmentation is essential for robot-assisted surgery and intraoperative guidance. However, existing methods are constrained to predefined categories, produce one-shot predictions without adaptive refinement, and lack mechanisms for clinician interaction. We propose IR-SIS, an iterative refinement system for surgical image segmentation that accepts natural language descriptions. IR-SIS leverages a fine-tuned SAM3 for initial segmentation, employs a Vision-Language Model to detect instruments and assess segmentation quality, and applies an agentic workflow that adaptively selects refinement strategies. The system supports clinician-in-the-loop interaction through natural language feedback. We also construct a multi-granularity language-annotated dataset from EndoVis2017 and EndoVis2018 benchmarks. Experiments demonstrate state-of-the-art performance on both in-domain and out-of-distribution data, with clinician interaction providing additional improvements. Our work establishes the first language-based surgical segmentation framework with adaptive self-refinement capabilities.[73] Rethinking Global Text Conditioning in Diffusion Transformers
Nikita Starodubcev,Daniil Pakhomov,Zongze Wu,Ilya Drobyshevskiy,Yuchen Liu,Zhonghao Wang,Yuqian Zhou,Zhe Lin,Dmitry Baranchuk
Main category: cs.CV
TL;DR: 本文探讨了扩散变换器中基于调制的文本条件是否必要,发现传统池化文本嵌入对性能贡献小,但若将其作为引导信号以实现可控特性调整,则能显著提升性能,且无需额外训练。
Details
Motivation: 探究扩散变换器中基于调制的文本条件是否必要,及其潜在性能优势。 Method: 分析传统池化文本嵌入在扩散模型中的作用,并提出将其从固定条件信号转为动态引导信号的新用法。 Result: 验证了池化嵌入在常规调制方式下贡献有限,但作为引导信号可显著提升文本到图像/视频生成及图像编辑等任务性能。 Conclusion: 基于调制的文本条件并非必需,但经重新设计后可成为高效、即插即用的可控引导机制。 Abstract: Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.[74] X-Mark: Saliency-Guided Robust Dataset Ownership Verification for Medical Imaging
Pranav Kulkarni,Junfeng Guo,Heng Huang
Main category: cs.CV
TL;DR: 本文提出X-Mark,一种针对胸部X光片的样本特异性干净标签水印方法,通过条件U-Net在显著区域生成独特扰动,并结合多目标损失与拉普拉斯正则化实现尺度不变性、诊断保真度和鲁棒性,在CheXpert数据集上验证了其高水印成功率(100%)和抗攻击能力。
Details
Motivation: 高质量医学影像数据集对深度学习训练至关重要,但其未经授权使用引发版权与伦理问题;现有面向自然图像的水印方法难以适配医学影像的高分辨率、动态缩放、低视觉多样性及细微解剖结构等特点,且需兼顾诊断质量。 Method: 提出X-Mark方法:采用条件U-Net为每张胸片生成样本特异性、位于显著区域的扰动;设计多组分训练目标以兼顾水印有效性、缩放鲁棒性、诊断保真度与视觉可区分性;引入拉普拉斯正则化抑制高频扰动,提升尺度不变性;所有权验证在黑盒设置下检测可疑模型的特征行为。 Result: 在CheXpert数据集上实验表明,X-Mark达到100%水印成功率(WSR),在Ind-M场景下误报率降低12%,并展现出对潜在自适应攻击的鲁棒性。 Conclusion: X-Mark是一种专为胸部X光片设计的高效、鲁棒、诊断友好的水印方案,有效解决了医学影像数据版权保护中尺度变化与诊断质量难以兼顾的关键挑战。 Abstract: High-quality medical imaging datasets are essential for training deep learning models, but their unauthorized use raises serious copyright and ethical concerns. Medical imaging presents a unique challenge for existing dataset ownership verification methods designed for natural images, as static watermark patterns generated in fixed-scale images scale poorly dynamic and high-resolution scans with limited visual diversity and subtle anatomical structures, while preserving diagnostic quality. In this paper, we propose X-Mark, a sample-specific clean-label watermarking method for chest x-ray copyright protection. Specifically, X-Mark uses a conditional U-Net to generate unique perturbations within salient regions of each sample. We design a multi-component training objective to ensure watermark efficacy, robustness against dynamic scaling processes while preserving diagnostic quality and visual-distinguishability. We incorporate Laplacian regularization into our training objective to penalize high-frequency perturbations and achieve watermark scale-invariance. Ownership verification is performed in a black-box setting to detect characteristic behaviors in suspicious models. Extensive experiments on CheXpert verify the effectiveness of X-Mark, achieving WSR of 100% and reducing probability of false positives in Ind-M scenario by 12%, while demonstrating resistance to potential adaptive attacks.[75] A Deep Multi-Modal Method for Patient Wound Healing Assessment
Subba Reddy Oota,Vijay Rowtula,Shahid Mohammed,Jeffrey Galitz,Minghsun Liu,Manish Gupta
Main category: cs.CV
TL;DR: 本文提出了一种基于深度多模态和迁移学习的伤口评估方法,通过融合伤口图像与临床变量来预测患者住院风险,并同时估计伤口变量及愈合轨迹,以实现早期复杂性识别和减轻医生诊断负担。
Details
Motivation: 住院是导致伤口护理成本高昂的主要因素;许多伤口本无需立即住院,但因治疗延迟、患者依从性差或共病等因素可能恶化并最终导致住院,因此需提前预测住院风险。 Method: 提出一种深度多模态方法,结合伤口图像与临床变量,采用迁移学习框架,联合预测住院风险、伤口变量(如面积、深度等)及其愈合轨迹。 Result: 构建了一个可同时从伤口图像中预测伤口变量和愈合轨迹的迁移学习模型,提升了住院风险预测的准确性与临床实用性。 Conclusion: 所提模型有助于伤口复杂性的早期识别,支持临床决策,减少医生诊断时间,并有望降低不必要的住院率与总体护理成本。 Abstract: Hospitalization of patients is one of the major factors for high wound care costs. Most patients do not acquire a wound which needs immediate hospitalization. However, due to factors such as delay in treatment, patient's non-compliance or existing co-morbid conditions, an injury can deteriorate and ultimately lead to patient hospitalization. In this paper, we propose a deep multi-modal method to predict the patient's risk of hospitalization. Our goal is to predict the risk confidently by collectively using the wound variables and wound images of the patient. Existing works in this domain have mainly focused on healing trajectories based on distinct wound types. We developed a transfer learning-based wound assessment solution, which can predict both wound variables from wound images and their healing trajectories, which is our primary contribution. We argue that the development of a novel model can help in early detection of the complexities in the wound, which might affect the healing process and also reduce the time spent by a clinician to diagnose the wound.[76] GAFR-Net: A Graph Attention and Fuzzy-Rule Network for Interpretable Breast Cancer Image Classification
Lin-Guo Gao,Suxing Liu
Main category: cs.CV
TL;DR: 本文提出GAFRNet,一种结合图注意力与可微模糊规则的网络,用于少样本乳腺癌组织病理图像分类,兼具高精度与可解释性。
Details
Motivation: 传统深度学习模型在标注数据有限时性能下降,且缺乏可解释性,难以在临床中应用。 Method: GAFRNet构建基于相似性的图表示,利用多头图注意力机制建模样本间关系,并引入可微模糊规则模块,将拓扑特征(如节点度、聚类系数、标签一致性)编码为显式的IF-THEN诊断逻辑。 Result: 在BreakHis、Mini-DDSM和ICIAR2018三个基准数据集上,GAFRNet在多种放大倍率和分类任务中均优于现有方法。 Conclusion: GAFRNet在弱监督医学图像分析中展现出优异泛化能力与实用价值,是一种可靠、透明的临床决策支持工具。 Abstract: Accurate classification of breast cancer histopathology images is pivotal for early oncological diagnosis and therapeutic intervention.However, conventional deep learning architectures often encounter performance degradation under limited annotations and suffer from a "blackbox" nature, hindering their clinical integration. To mitigate these limitations, we propose GAFRNet, a robust and interpretable Graph Attention and FuzzyRule Network specifically engineered for histopathology image classification with scarce supervision. GAFRNet constructs a similarity-driven graph representation to model intersample relationships and employs a multihead graph attention mechanism to capture complex relational features across heterogeneous tissue structures.Concurrently, a differentiable fuzzy-rule module encodes intrinsic topological descriptorsincluding node degree, clustering coefficient, and label consistencyinto explicit, human-understandable diagnostic logic. This design establishes transparent "IF-THEN" mappings that mimic the heuristic deduction process of medical experts, providing clear reasoning behind each prediction without relying on post-hoc attribution methods. Extensive evaluations on three benchmark datasets (BreakHis, Mini-DDSM, and ICIAR2018) demonstrate that GAFR-Net consistently outperforms various state-of-the-art methods across multiple magnifications and classification tasks. These results validate the superior generalization and practical utility of GAFR-Net as a reliable decision-support tool for weakly supervised medical image analysis.[77] Deep Modeling and Interpretation for Bladder Cancer Classification
Ahmad Chaddad,Yihang Wu,Xianrui Chen
Main category: cs.CV
TL;DR: 本文评估了13种深度模型(4种CNN和8种Transformer)在膀胱癌图像分类任务中的性能、校准性和可解释性,发现ConvNeXt泛化能力有限,ViT系列在校准和OOD样本解释上表现更优。
Details
Motivation: 自然图像上表现优异的ViT和CNN模型在医学影像(如膀胱癌图像,异常区域占比小)中可能表现不佳,需系统评估其适用性。 Method: 在公开多中心膀胱癌数据集上开展约300次实验,包括:1)13种模型的标准分类;2)校准分析;3)GradCAM++可解释性评估;并引入测试时增强提升可解释性。 Result: ConvNeXt系列准确率仅约60%,泛化能力弱;ViT系列校准效果优于ConvNeXt和Swin;无单一模型能兼顾所有需求:ConvNeXt适合分布内样本,ViT及其变体更适合分布外样本解释。 Conclusion: 模型选择应依据具体临床场景——分布内任务倾向ConvNeXt,分布外或需强解释性任务则推荐ViT系列。 Abstract: Deep models based on vision transformer (ViT) and convolutional neural network (CNN) have demonstrated remarkable performance on natural datasets. However, these models may not be similar in medical imaging, where abnormal regions cover only a small portion of the image. This challenge motivates this study to investigate the latest deep models for bladder cancer classification tasks. We propose the following to evaluate these deep models: 1) standard classification using 13 models (four CNNs and eight transormer-based models), 2) calibration analysis to examine if these models are well calibrated for bladder cancer classification, and 3) we use GradCAM++ to evaluate the interpretability of these models for clinical diagnosis. We simulate $\sim 300$ experiments on a publicly multicenter bladder cancer dataset, and the experimental results demonstrate that the ConvNext series indicate limited generalization ability to classify bladder cancer images (e.g., $\sim 60\%$ accuracy). In addition, ViTs show better calibration effects compared to ConvNext and swin transformer series. We also involve test time augmentation to improve the models interpretability. Finally, no model provides a one-size-fits-all solution for a feasible interpretable model. ConvNext series are suitable for in-distribution samples, while ViT and its variants are suitable for interpreting out-of-distribution samples.[78] Kyrtos: A methodology for automatic deep analysis of graphic charts with curves in technical documents
Michail S. Alexiou,Nikolaos G. Bourbakis
Main category: cs.CV
TL;DR: 本文提出了Kyrtos方法,用于自动识别和分析技术文档图像中含曲线的图表,通过聚类识别线段中点、解析行为特征,并将结果转化为属性图和自然语言描述,最终映射为随机Petri网(SPN)以表征图表功能。
Details
Motivation: 技术文档数量庞大且蕴含丰富知识,其整体理解依赖于对图形、表格、文本等多模态内容及其关联的准确分析。 Method: 采用基于聚类的方法识别图表中曲线构成的线段中点;解析线段以提取方向、趋势等行为特征;构建属性图保留结构特征,并生成自然语言描述;最终转换为随机Petri网(SPN)建模图表功能。 Result: 实验表明Kyrtos在多函数图表上的曲线识别与分析具有高精度,通过结构相似性度量验证了其近似效果。 Conclusion: Kyrtos有效实现了技术文档中曲线图表的自动化识别、结构化分析与语义转化,为技术文档深度理解提供了可行路径。 Abstract: Deep Understanding of Technical Documents (DUTD) has become a very attractive field with great potential due to large amounts of accumulated documents and the valuable knowledge contained in them. In addition, the holistic understanding of technical documents depends on the accurate analysis of its particular modalities, such as graphics, tables, diagrams, text, etc. and their associations. In this paper, we introduce the Kyrtos methodology for the automatic recognition and analysis of charts with curves in graphics images of technical documents. The recognition processing part adopts a clustering based approach to recognize middle-points that delimit the line-segments that construct the illustrated curves. The analysis processing part parses the extracted line-segments of curves to capture behavioral features such as direction, trend and etc. These associations assist the conversion of recognized segments' relations into attributed graphs, for the preservation of the curves' structural characteristics. The graph relations are also are expressed into natural language (NL) text sentences, enriching the document's text and facilitating their conversion into Stochastic Petri-net (SPN) graphs, which depict the internal functionality represented in the chart image. Extensive evaluation results demonstrate the accuracy of Kyrtos' recognition and analysis methods by measuring the structural similarity between input chart curves and the approximations generated by Kyrtos for charts with multiple functions.[79] Impact of domain adaptation in deep learning for medical image classifications
Yihang Wu,Ahmad Chaddad
Main category: cs.CV
TL;DR: 本文探讨了域适应(DA)技术在医学影像分析中的应用,通过10种深度学习模型在四个医学图像数据集上验证了DA在多模态、噪声鲁棒性、联邦学习、可解释性和模型校准等方面的效果。
Details
Motivation: 尽管域适应(DA)已有显著进展,但其核心思想仍为将不同域的数据对齐到共享特征空间,以利用有标签源域知识提升无标签目标域性能;本文旨在系统评估DA在多种实际医学场景(如噪声、联邦学习、可解释性等)下的有效性。 Method: 采用10种深度学习模型模拟常见DA方法,在四个医学图像数据集上开展实验,涵盖多模态、高斯噪声、联邦学习(FL)、Grad-CAM++可解释性分析及模型校准(ECE)评估。 Result: DA结合ResNet34在脑肿瘤数据集上提升性能4.7%;在加噪数据下提升约3%准确率;在皮肤癌联邦学习中仅提升约0.3%;提升Grad-CAM++可解释性并具临床价值;在多模态数据上降低预期校准误差(ECE)约2%。 Conclusion: DA在医学图像任务中具有实际价值,尤其在提升性能、抗噪性、可解释性和模型校准方面效果显著,但在联邦学习等特定场景下增益有限,需进一步优化适配。 Abstract: Domain adaptation (DA) is a quickly expanding area in machine learning that involves adjusting a model trained in one domain to perform well in another domain. While there have been notable progressions, the fundamental concept of numerous DA methodologies has persisted: aligning the data from various domains into a shared feature space. In this space, knowledge acquired from labeled source data can improve the model training on target data that lacks sufficient labels. In this study, we demonstrate the use of 10 deep learning models to simulate common DA techniques and explore their application in four medical image datasets. We have considered various situations such as multi-modality, noisy data, federated learning (FL), interpretability analysis, and classifier calibration. The experimental results indicate that using DA with ResNet34 in a brain tumor (BT) data set results in an enhancement of 4.7\% in model performance. Similarly, the use of DA can reduce the impact of Gaussian noise, as it provides $\sim 3\%$ accuracy increase using ResNet34 on a BT dataset. Furthermore, simply introducing DA into FL framework shows limited potential (e.g., $\sim 0.3\%$ increase in performance) for skin cancer classification. In addition, the DA method can improve the interpretability of the models using the gradcam++ technique, which offers clinical values. Calibration analysis also demonstrates that using DA provides a lower expected calibration error (ECE) value $\sim 2\%$ compared to CNN alone on a multi-modality dataset.[80] Fully Differentiable Bidirectional Dual-Task Synergistic Learning for Semi-Supervised 3D Medical Image Segmentation
Jun Li
Main category: cs.CV
TL;DR: 本文提出了一种可微双向协同学习(DBiSL)框架,用于医学图像分割的半监督学习,通过整合监督学习、一致性正则化、伪监督学习和不确定性估计,实现回归与分割任务间的在线双向协作,显著提升性能。
Details
Motivation: 医学图像标注成本高、专家资源稀缺,导致高质量标注数据匮乏;现有双任务协同学习方法仅支持单向交互(如回归到分割),无法充分利用在线双向跨任务协作潜力。 Method: 提出可微双向协同学习(DBiSL)框架,实现分割与回归任务间完全可微、在线的双向一致性约束,并统一集成监督学习、一致性正则化、伪监督学习和不确定性估计四大半监督学习组件。 Result: 在两个基准医学图像数据集上达到当前最优性能(state-of-the-art)。 Conclusion: DBiSL不仅提升了半监督医学图像分割效果,还为统一半监督框架设计和双任务驱动SSL提供了新思路,并构建了可推广至更广泛计算机视觉任务的通用多任务学习架构。 Abstract: Semi-supervised learning relaxes the need of large pixel-wise labeled datasets for image segmentation by leveraging unlabeled data. The scarcity of high-quality labeled data remains a major challenge in medical image analysis due to the high annotation costs and the need for specialized clinical expertise. Semi-supervised learning has demonstrated significant potential in addressing this bottleneck, with pseudo-labeling and consistency regularization emerging as two predominant paradigms. Dual-task collaborative learning, an emerging consistency-aware paradigm, seeks to derive supplementary supervision by establishing prediction consistency between related tasks. However, current methodologies are limited to unidirectional interaction mechanisms (typically regression-to-segmentation), as segmentation results can only be transformed into regression outputs in an offline manner, thereby failing to fully exploit the potential benefits of online bidirectional cross-task collaboration. Thus, we propose a fully Differentiable Bidirectional Synergistic Learning (DBiSL) framework, which seamlessly integrates and enhances four critical SSL components: supervised learning, consistency regularization, pseudo-supervised learning, and uncertainty estimation. Experiments on two benchmark datasets demonstrate our method's state-of-the-art performance. Beyond technical contributions, this work provides new insights into unified SSL framework design and establishes a new architectural foundation for dual-task-driven SSL, while offering a generic multitask learning framework applicable to broader computer vision applications. The code will be released on github upon acceptance.[81] Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D
Yan Luo,Advaith Ravishankar,Serena Liu,Yutong Yang,Mengyu Wang
Main category: cs.CV
TL;DR: 本文评估了五种先进的单图像到3D重建模型在医学影像上的零样本泛化能力,发现其在单切片医学图像重建3D结构时普遍存在深度重建失败问题,提示需采用多视角方法以提升可靠性。
Details
Motivation: 三维解剖理解对诊断和治疗规划至关重要,但体积成像成本高、等待时间长;而现有基于自然图像训练的图像到3D基础模型是否适用于医学数据尚不明确。 Method: 构建了一个受控的零样本单切片医学图像到3D重建基准,评估SAM3D、Hunyuan3D-2.1、Direct3D、Hi3DGen和TripoSG五种模型,在六个医学数据集(涵盖解剖与病理结构)及两个自然图像数据集上,使用体素重叠率和点云距离等指标进行定量比较。 Result: 所有模型在医学数据上的体素重叠率均处于中等水平,表明存在普遍的深度重建失败;全局距离指标显示SAM3D在拓扑相似性上最优,其他模型更易过度简化结构。 Conclusion: 单切片医学图像到3D重建存在固有局限,主要源于2D医学图像的平面性导致的深度模糊性;因此,应推动多视角图像到3D重建以实现可靠的医学三维推断。 Abstract: A 3D understanding of anatomy is central to diagnosis and treatment planning, yet volumetric imaging remains costly with long wait times. Image-to-3D foundations models can solve this issue by reconstructing 3D data from 2D modalites. Current foundation models are trained on natural image distributions to reconstruct naturalistic objects from a single image by leveraging geometric priors across pixels. However, it is unclear whether these learned geometric priors transfer to medical data. In this study, we present a controlled zero-shot benchmark of single slice medical image-to-3D reconstruction across five state-of-the-art image-to-3D models: SAM3D, Hunyuan3D-2.1, Direct3D, Hi3DGen, and TripoSG. These are evaluated across six medical datasets spanning anatomical and pathological structures and two natrual datasets, using voxel based metrics and point cloud distance metrics. Across medical datasets, voxel based overlap remains moderate for all models, consistent with a depth reconstruction failure mode when inferring volume from a single slice. In contrast, global distance metrics show more separation between methods: SAM3D achieves the strongest overall topological similarity to ground truth medical 3D data, while alternative models are more prone to over-simplication of reconstruction. Our results quantify the limits of single-slice medical reconstruction and highlight depth ambiguity caused by the planar nature of 2D medical data, motivating multi-view image-to-3D reconstruction to enable reliable medical 3D inference.[82] K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge
Zhikai Li,Jiatong Li,Xuewen Liu,Wangbo Zhao,Pan Du,Kaicheng Zhou,Qingyi Gu,Yang You,Zhen Dong,Kurt Keutzer
Main category: cs.CV
TL;DR: 本文提出K-Sort Eval,一种结合后验校正与动态匹配的VLM-based高效可靠视觉生成模型评估框架,显著提升评估效率与人类偏好对齐度。
Details
Motivation: 现有基于人群投票的评估方法(如Arena)成本高、耗时长;而直接用视觉语言模型(VLM)替代人工判断又因幻觉和偏差导致与人类偏好不一致,且静态评估效率低。 Method: 构建基于K-Sort Arena人类投票的高质量排序数据集;引入(K+1)-wise自由比对机制;提出后验校正方法(依据VLM预测与人类监督的一致性自适应修正贝叶斯后验概率);设计动态匹配策略(权衡不确定性与多样性以最大化单次比较收益)。 Result: 实验表明K-Sort Eval评估结果与K-Sort Arena高度一致,通常仅需少于90次模型运行,兼具高效性与可靠性。 Conclusion: K-Sort Eval为视觉生成模型提供了可扩展、低成本、高保真的人类对齐评估新范式,有效缓解了VLM评估中的偏差与低效问题。 Abstract: The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods. While the crowdsourced Arena platforms offer human preference assessments by collecting human votes, they are costly and time-consuming, inherently limiting their scalability. Leveraging vision-language model (VLMs) as substitutes for manual judgments presents a promising solution. However, the inherent hallucinations and biases of VLMs hinder alignment with human preferences, thus compromising evaluation reliability. Additionally, the static evaluation approach lead to low efficiency. In this paper, we propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching. Specifically, we curate a high-quality dataset from thousands of human votes in K-Sort Arena, with each instance containing the outputs and rankings of K models. When evaluating a new model, it undergoes (K+1)-wise free-for-all comparisons with existing models, and the VLM provide the rankings. To enhance alignment and reliability, we propose a posterior correction method, which adaptively corrects the posterior probability in Bayesian updating based on the consistency between the VLM prediction and human supervision. Moreover, we propose a dynamic matching strategy, which balances uncertainty and diversity to maximize the expected benefit of each comparison, thus ensuring more efficient evaluation. Extensive experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs, demonstrating both its efficiency and reliability.[83] LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging
Xinyu Wang,Ke Deng,Fei Dou,Jinbo Bi,Jin Lu
Main category: cs.CV
TL;DR: 本文提出LARV方法,一种无需训练、无需数据的层自适应重缩放技术,用于提升任务向量合并效果。它通过为不同网络层分配不同缩放因子,在不修改原有合并方法的前提下显著提升多任务模型融合性能。
Details
Motivation: 现有任务向量合并方法(如TIES、TSV-M、Iso-C/CTS)对所有层采用统一处理,忽略了视觉Transformer中各层对干扰敏感性与特征稳定性的显著差异,尤其是浅层易受干扰、深层编码稳定任务特征这一现象。 Method: 提出LARV(Layer-wise Adaptive Rescaling Veneer),一种插入任意任务向量合并器前的轻量级、数据无关的层感知缩放模块;它基于简单可计算的层代理指标,按确定性调度(如分层或连续映射)为每层任务向量分配缩放因子,抑制浅层干扰、增强深层对齐,且无需重训练或修改原合并器。 Result: 在FusionBench基准上,LARV在ViT系列模型上一致提升所有任务向量基线:Iso-C + LARV在8/14/20任务设置下分别达85.9%(ViT-B/32)、89.2%(ViT-B/16)、92.6%(ViT-L/14);消融与层分析证实其有效抑制浅层干扰并适度增强深层任务稳定特征。 Conclusion: LARV是首个面向任务向量合并的层感知缩放方法,具有训练/数据/合并器无关性、低开销与强鲁棒性,将模型合并从统一操作提升为层自适应过程。 Abstract: Model merging aims to combine multiple fine-tuned models into a single multi-task model without access to training data. Existing task-vector merging methods such as TIES, TSV-M, and Iso-C/CTS differ in their aggregation rules but treat all layers nearly uniformly. This assumption overlooks the strong layer-wise heterogeneity in large vision transformers, where shallow layers are sensitive to interference while deeper layers encode stable task-specific features. We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer that plugs into any task-vector merger and assigns a per-layer scale to each task vector before aggregation, and show it consistently boosts diverse merging rules. LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule, requiring no retraining or modification to existing mergers. To our knowledge, this is the first work to perform layer-aware scaling for task-vector merging. LARV computes simple data-free layer proxies and turns them into scales through a lightweight rule; we study several instantiations within one framework (e.g., tiered two/three-level scaling with fixed values, or continuous mappings) and show that tiered choices offer the best robustness, while continuous mappings remain an ablation. LARV is orthogonal to the base merger and adds negligible cost. On FusionBench with Vision Transformers, LARV consistently improves all task-vector baselines across 8/14/20-task settings; for example, Iso-C + LARV reaches 85.9% on ViT-B/32, 89.2% on ViT-B/16, and 92.6% on ViT-L/14. Layerwise analysis and corruption tests further indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features, turning model merging into a robust, layer-aware procedure rather than a uniform one.[84] Stability and Concentration in Nonlinear Inverse Problems with Block-Structured Parameters: Lipschitz Geometry, Identifiability, and an Application to Gaussian Splatting
Joe-Mei Feng,Hsin-Hsiung Kao
Main category: cs.CV
TL;DR: 本文提出了一种面向块结构参数的非线性反问题的算子理论框架,建立了确定性稳定性不等式、全局Lipschitz界和非渐近集中估计,并以高斯点绘渲染算子为例验证了该框架,揭示了稳定性与分辨率之间的根本权衡。
Details
Motivation: 为刻画现代成像与可微渲染中一大类高维非线性反问题在算子层面的根本极限,需统一处理稳定性、统计集中性与参数结构(如块结构)的关系。 Method: 构建基于算子理论的分析框架,结合分块Lipschitz几何、局部可识别性与亚高斯噪声假设,推导确定性稳定性不等式、最小二乘目标函数的全局Lipschitz界及非渐近集中估计;并以高斯点绘渲染算子为具体实例进行验证与常数刻画。 Result: 获得了与重建算法无关、仅依赖前向算子的高概率参数误差界;明确给出了高斯点绘算子的Lipschitz常数与分辨率相关可观测性;揭示了估计误差受图像分辨率与模型复杂度之比所固有限制的稳定性–分辨率权衡关系。 Conclusion: 该框架为具有块结构参数的非线性反问题提供了统一、内在且算法无关的稳定性与统计性能刻画,揭示了高维逆问题中算子层面的基本极限。 Abstract: We develop an operator-theoretic framework for stability and statistical concentration in nonlinear inverse problems with block-structured parameters. Under a unified set of assumptions combining blockwise Lipschitz geometry, local identifiability, and sub-Gaussian noise, we establish deterministic stability inequalities, global Lipschitz bounds for least-squares misfit functionals, and nonasymptotic concentration estimates. These results yield high-probability parameter error bounds that are intrinsic to the forward operator and independent of any specific reconstruction algorithm. As a concrete instantiation, we verify that the Gaussian Splatting rendering operator satisfies the proposed assumptions and derive explicit constants governing its Lipschitz continuity and resolution-dependent observability. This leads to a fundamental stability--resolution tradeoff, showing that estimation error is inherently constrained by the ratio between image resolution and model complexity. Overall, the analysis characterizes operator-level limits for a broad class of high-dimensional nonlinear inverse problems arising in modern imaging and differentiable rendering.[85] Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification
Yiqiao Li,Bo Shang,Jie Wei
Main category: cs.CV
TL;DR: 本文提出了一种无需参数微调、适配现成视觉-语言模型(VLM)用于路侧LiDAR细粒度卡车分类的新框架,通过深度感知图像生成将稀疏点云转为2D视觉代理,在极少量样本(16–30例/类)下实现高精度分类,并揭示了文本引导的‘语义锚’效应及其在冷启动中的实用价值。
Details
Motivation: 现有LiDAR细粒度卡车分类方法依赖大量人工标注和监督学习,扩展性差;而VLM虽具少样本泛化能力,但受限于LiDAR点云与2D图像间的模态鸿沟。 Method: 设计深度感知图像生成流程(含去噪、时空配准、朝向校正、形态学处理与各向异性平滑),将稀疏LiDAR扫描转换为深度编码2D视觉代理,直接适配冻结权重的商用VLM,不进行任何参数微调,并探索文本引导的少样本分类及冷启动标签自举策略。 Result: 在20类真实卡车数据集上,仅需每类16–30个样本即达竞争性准确率;对特定集装箱类型(20ft/40ft/53ft)实现>75%正确率;发现k<4时文本引导起稳定作用(语义锚效应),但k增大后因语义失配导致性能下降;VLM生成标签可有效启动轻量监督模型。 Conclusion: 该框架显著降低初始标注成本,为智能交通系统提供了一种可扩展、实用、免训练的细粒度卡车识别新范式。 Abstract: Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a "Semantic Anchor" effect: text-based guidance regularizes performance in ultra-low-shot regimes $k < 4$, but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.[86] SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL
Yang Zhao,Shizhao Sun,Meisheng Zhang,Yingdong Shi,Xubo Yang,Jiang Bian
Main category: cs.CV
TL;DR: 本文提出SceneReVis框架,通过视觉引导的自反思机制和迭代式‘诊断-行动’循环,解决单次生成3D场景中的空间幻觉问题,并构建SceneChain-12k数据集与两阶段训练策略以提升空间规划能力。
Details
Motivation: 现有单次3D场景合成方法易出现空间幻觉(如物体碰撞),缺乏审慎推理能力。 Method: 提出SceneReVis框架,采用多模态反馈驱动的迭代‘诊断-行动’自反思机制;构建SceneChain-12k因果构造轨迹数据集;设计从监督微调到智能体强化学习的两阶段训练策略。 Result: 在高保真生成与目标导向优化任务中达到SOTA性能,并在长尾领域展现出强泛化能力。 Conclusion: 视觉引导的自反思范式可有效提升3D场景合成的空间合理性与可控性,为具身智能与三维内容生成提供新思路。 Abstract: Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act'' loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.[87] Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning
Xu Ma,Yitian Zhang,Qihua Dong,Yun Fu
Main category: cs.CV
TL;DR: 本文提出了Fine-T2I,一个大规模、高质量、完全开源的文本到图像(T2I)微调数据集,旨在解决当前公开微调数据集分辨率低、图文对齐差和多样性不足的问题。该数据集融合了先进生成模型合成图像与专业摄影师实拍图像,并经过严格筛选,最终包含超600万图文对;实验表明,在多种预训练扩散和自回归模型上微调Fine-T2I,显著提升生成质量与指令遵循能力。
Details
Motivation: 高质、开放的文本到图像微调数据集稀缺,现有公开数据集存在低分辨率、图文对齐差、多样性不足等问题,导致开源研究模型性能远落后于企业级模型。 Method: 构建Fine-T2I数据集:涵盖10种任务组合、32类提示词、11种视觉风格、5种提示模板;融合强生成模型合成图像与专业摄影师真实图像;通过多维度严格过滤(图文对齐、视觉保真度、提示质量),剔除超95%初筛样本。 Result: 最终获得超600万高质量图文对(约2TB),规模接近预训练数据集;在多种预训练扩散与自回归模型上微调后,生成质量与指令遵循能力均显著提升(经人工评估、视觉对比与自动指标验证)。 Conclusion: Fine-T2I填补了高质量开源T2I微调数据集的空白,有望推动开放社区在T2I领域缩小与企业级模型的性能差距;数据集已开源授权发布。 Abstract: High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a clear performance gap between open research models and enterprise-grade models. In this work, we present Fine-T2I, a large-scale, high-quality, and fully open dataset for T2I fine-tuning. Fine-T2I spans 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates, and combines synthetic images generated by strong modern models with carefully curated real images from professional photographers. All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality, with over 95% of initial candidates removed. The final dataset contains over 6 million text-image pairs, around 2 TB on disk, approaching the scale of pretraining datasets while maintaining fine-tuning-level quality. Across a diverse set of pretrained diffusion and autoregressive models, fine-tuning on Fine-T2I consistently improves both generation quality and instruction adherence, as validated by human evaluation, visual comparison, and automatic metrics. We release Fine-T2I under an open license to help close the data gap in T2I fine-tuning in the open community.[88] A Scoping Review of Deep Learning for Urban Visual Pollution and Proposal of a Real-Time Monitoring Framework with a Visual Pollution Index
Mohammad Masudur Rahman,Md. Rashedur Rahman,Ashraful Islam,Saadia B Alam,M Ashraful Amin
Main category: cs.CV
TL;DR: 本文是一篇关于城市视觉污染(UVP)的范围综述,系统梳理了基于深度学习的UVP检测与分类方法,指出当前研究存在数据集地域局限、缺乏统一分类体系和实时应用不足等问题,并提出一个整合视觉污染指数的综合管理框架。
Details
Motivation: 城市视觉污染(UVP)日益严重,但现有自动检测与应用研究零散、缺乏系统性,亟需全面梳理与整合。 Method: 依据PRISMA-ScR指南,系统检索并分析了7个学术数据库中的26篇文献,归纳主流模型(YOLO、Faster R-CNN、EfficientDet)、数据集现状及应用模式,并提出包含视觉污染指数的综合管理框架。 Result: 识别出当前研究集中于特定污染物类别、模型架构趋同、数据集地域性强且缺乏标准分类;少数实时系统存在地理偏差;提出首个集成视觉污染指数的监测框架。 Conclusion: 亟需构建统一的UVP管理体系,涵盖标准化污染物分类、跨城市基准数据集、泛化性强的深度学习模型,以及支持可持续城市美学与居民福祉的评估指数。 Abstract: Urban Visual Pollution (UVP) has emerged as a critical concern, yet research on automatic detection and application remains fragmented. This scoping review maps the existing deep learning-based approaches for detecting, classifying, and designing a comprehensive application framework for visual pollution management. Following the PRISMA-ScR guidelines, seven academic databases (Scopus, Web of Science, IEEE Xplore, ACM DL, ScienceDirect, SpringerNatureLink, and Wiley) were systematically searched and reviewed, and 26 articles were found. Most research focuses on specific pollutant categories and employs variations of YOLO, Faster R-CNN, and EfficientDet architectures. Although several datasets exist, they are limited to specific areas and lack standardized taxonomies. Few studies integrate detection into real-time application systems, yet they tend to be geographically skewed. We proposed a framework for monitoring visual pollution that integrates a visual pollution index to assess the severity of visual pollution for a certain area. This review highlights the need for a unified UVP management system that incorporates pollutant taxonomy, a cross-city benchmark dataset, a generalized deep learning model, and an assessment index that supports sustainable urban aesthetics and enhances the well-being of urban dwellers.[89] Look-Ahead and Look-Back Flows: Training-Free Image Generation with Trajectory Smoothing
Yan Luo,Henry Huang,Todd Y. Zhou,Mengyu Wang
Main category: cs.CV
TL;DR: 本文提出两种无需训练的潜在轨迹调整方法(Look-Ahead 和 Look-Back),通过在潜在空间中平滑生成路径来提升基于流匹配的扩散模型图像生成质量,显著优于现有SOTA方法。
Details
Motivation: 现有训练-free流匹配方法通过修改速度场v来提升生成效果,但误差会沿整个生成路径累积;而调整潜在轨迹z可由预训练速度网络自然校正,减少误差积累。 Method: 提出两种训练-free的潜在轨迹平滑方案:1)Look-Ahead——利用曲率加权平均当前与下一步潜在表示;2)Look-Back——采用指数滑动平均(带衰减)平滑潜在轨迹。 Result: 在COCO17、CUB-200和Flickr30K等多个数据集上,所提方法在多项评估指标下显著超越多种SOTA模型。 Conclusion: 直接在潜在空间中调整和光滑生成轨迹是一种更鲁棒、更高效的训练-free优化策略,为流匹配扩散模型提供了新思路。 Abstract: Recent advances have reformulated diffusion models as deterministic ordinary differential equations (ODEs) through the framework of flow matching, providing a unified formulation for the noise-to-data generative process. Various training-free flow matching approaches have been developed to improve image generation through flow velocity field adjustment, eliminating the need for costly retraining. However, Modifying the velocity field $v$ introduces errors that propagate through the full generation path, whereas adjustments to the latent trajectory $z$ are naturally corrected by the pretrained velocity network, reducing error accumulation. In this paper, we propose two complementary training-free latent-trajectory adjustment approaches based on future and past velocity $v$ and latent trajectory $z$ information that refine the generative path directly in latent space. We propose two training-free trajectory smoothing schemes: \emph{Look-Ahead}, which averages the current and next-step latents using a curvature-gated weight, and \emph{Look-Back}, which smoothes latents using an exponential moving average with decay. We demonstrate through extensive experiments and comprehensive evaluation metrics that the proposed training-free trajectory smoothing models substantially outperform various state-of-the-art models across multiple datasets including COCO17, CUB-200, and Flickr30K.[90] ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs
James Burgess,Rameen Abdal,Dan Stoddart,Sergey Tulyakov,Serena Yeung-Levy,Kuan-Chieh Jackson Wang
Main category: cs.CV
TL;DR: 本文提出ArtifactLens系统,利用预训练视觉语言模型(VLM)结合少量标注数据(每类数百样本)和创新的多组件架构(含上下文学习与文本指令优化),实现高效、泛化性强的AI生成图像伪影检测,在多个基准上达到SOTA性能,显著降低数据依赖。
Details
Motivation: 现有AI图像生成器虽逼真,但存在细微伪影(如扭曲的手、变形物体),需可靠检测以评估模型性能和训练奖励模型;而当前检测方法依赖大量标注数据并需频繁重训练,成本高昂且难以适应快速演进的生成器和新伪影类型。 Method: 提出ArtifactLens系统,核心是挖掘预训练VLM固有知识,通过多组件架构(含改进的上下文学习和文本指令优化)进行轻量级适配,仅需每类数百标注样本即可激活其伪影检测能力。 Result: 在五个针对人类伪影的人类标注基准上达到SOTA性能,是首个跨多数据集统一评估的工作;方法可泛化至物体形态、动物解剖、实体交互等其他伪影类型,以及更广义的AIGC检测任务。 Conclusion: 预训练VLM已蕴含丰富的伪影识别知识,关键在于设计合适的‘脚手架’(scaffolding)来高效释放其潜力;ArtifactLens证明了小样本适配策略在AI生成内容检测中的有效性与广泛适用性。 Abstract: Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine-tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts - with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state-of-the-art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi-component architecture with in-context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types - object morphology, animal anatomy, and entity interactions - and to the distinct task of AIGC detection.[91] FD-DB: Frequency-Decoupled Dual-Branch Network for Unpaired Synthetic-to-Real Domain Translation
Chuanhai Zang,Jiabao Hu,XW Song
Main category: cs.CV
TL;DR: 本文提出FD-DB模型,通过频域解耦的双分支结构,在无配对监督下实现合成到真实图像的迁移,兼顾结构稳定性与外观真实性,提升下游语义分割性能。
Details
Motivation: 合成数据虽成本低、标注准,但因外观与成像差异导致严重域偏移;现有无配对合成到真实图像翻译方法难以兼顾光度真实感与结构稳定性。 Method: 提出频率解耦双分支(FD-DB)模型: interpretable分支预测物理可解释的低频编辑参数(白平衡、曝光等),构建稳定低频外观基;free分支通过残差生成补充高频细节;引入门控融合机制在显式频域约束下融合两分支,并采用两阶段训练策略先稳后放。 Result: 在YCB-V数据集上,FD-DB显著提升真实域外观一致性,大幅增强下游语义分割性能,同时保持几何与语义结构完整性。 Conclusion: 频域解耦设计有效缓解了无配对合成到真实图像翻译中真实感与结构稳定性的固有矛盾,为几何敏感视觉任务提供了更鲁棒的域迁移方案。 Abstract: Synthetic data provide low-cost, accurately annotated samples for geometry-sensitive vision tasks, but appearance and imaging differences between synthetic and real domains cause severe domain shift and degrade downstream performance. Unpaired synthetic-to-real translation can reduce this gap without paired supervision, yet existing methods often face a trade-off between photorealism and structural stability: unconstrained generation may introduce deformation or spurious textures, while overly rigid constraints limit adaptation to real-domain statistics. We propose FD-DB, a frequency-decoupled dual-branch model that separates appearance transfer into low-frequency interpretable editing and high-frequency residual compensation. The interpretable branch predicts physically meaningful editing parameters (white balance, exposure, contrast, saturation, blur, and grain) to build a stable low-frequency appearance base with strong content preservation. The free branch complements fine details through residual generation, and a gated fusion mechanism combines the two branches under explicit frequency constraints to limit low-frequency drift. We further adopt a two-stage training schedule that first stabilizes the editing branch and then releases the residual branch to improve optimization stability. Experiments on the YCB-V dataset show that FD-DB improves real-domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.[92] Weakly Supervised Contrastive Learning for Histopathology Patch Embeddings
Bodong Zhang,Xiwen Li,Hamid Manoochehri,Xiaoya Tang,Deepika Sirohi,Beatrice S. Knudsen,Tolga Tasdizen
Main category: cs.CV
TL;DR: 本文提出了一种名为WeakSupCon的弱监督对比学习框架,用于在多实例学习(MIL)中提升图像块特征表示能力,仅需滑片级标签,无需实例级伪标签,即可在特征空间中有效分离不同类别的图像块,从而提升下游MIL任务性能。
Details
Motivation: 数字病理全切片图像(WSI)分析面临标注成本高、标签稀疏的问题,现有MIL方法多忽略编码器预训练阶段的特征表示学习,而依赖冻结特征和简单聚合。 Method: 提出弱监督对比学习(WeakSupCon)框架,在对比学习中融入袋级(滑片级)标签信息,不依赖实例级伪标签,通过弱监督方式引导特征空间中同类块聚集、异类块分离。 Result: 在三个数据集上,WeakSupCon生成的图像特征显著优于自监督对比学习方法,提升了下游MIL任务性能。 Conclusion: WeakSupCon验证了在MIL范式下利用袋级标签进行特征表示学习的有效性,为弱监督病理图像分析提供了新思路。 Abstract: Digital histopathology whole slide images (WSIs) provide gigapixel-scale high-resolution images that are highly useful for disease diagnosis. However, digital histopathology image analysis faces significant challenges due to the limited training labels, since manually annotating specific regions or small patches cropped from large WSIs requires substantial time and effort. Weakly supervised multiple instance learning (MIL) offers a practical and efficient solution by requiring only bag-level (slide-level) labels, while each bag typically contains multiple instances (patches). Most MIL methods directly use frozen image patch features generated by various image encoders as inputs and primarily focus on feature aggregation. However, feature representation learning for encoder pretraining in MIL settings has largely been neglected. In our work, we propose a novel feature representation learning framework called weakly supervised contrastive learning (WeakSupCon) that incorporates bag-level label information during training. Our method does not rely on instance-level pseudo-labeling, yet it effectively separates patches with different labels in the feature space. Experimental results demonstrate that the image features generated by our WeakSupCon method lead to improved downstream MIL performance compared to self-supervised contrastive learning approaches in three datasets. Our related code is available at github.com/BzhangURU/Paper_WeakSupCon_for_MIL[93] Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions
Lin Chen,Xiaoke Zhao,Kun Ding,Weiwei Feng,Changtao Miao,Zili Wang,Wenxuan Guo,Ying Wang,Kaiyuan Zheng,Bo Zhang,Zhe Li,Shiming Xiang
Main category: cs.CV
TL;DR: 本文提出Align-TI,一种从Token Interactions视角出发的知识蒸馏框架,通过视觉-指令对齐(IVA)和响应内token转移对齐(TPA)提升MLLM压缩效果,在多个指标上超越现有方法,甚至小模型(2B)超越大模型(7B)。
Details
Motivation: 现有MLLM知识蒸馏方法仅关注静态next-token对齐,忽略了蕴含多模态理解与生成能力的关键动态token交互。 Method: 提出Align-TI框架,包含两个核心组件:IVA(Instruction-Visual Alignment)对齐关键视觉区域以模仿教师的视觉信息提取能力;TPA(Token-to-Token Probability Alignment)对齐序列中token间转移概率以捕捉教师的动态生成逻辑。 Result: Align-TI在多个基准上显著优于基线:相对Vanilla KD提升2.6%;其蒸馏出的2B模型超越LLaVA-1.5-7B达7.0%,达到SOTA。 Conclusion: Align-TI验证了建模token交互对MLLM知识蒸馏至关重要,为高效训练参数精简型多模态大模型提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves $2.6\%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0\%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.[94] OSI: One-step Inversion Excels in Extracting Diffusion Watermarks
Yuwei Chen,Zhenliang He,Jia Tang,Meina Kan,Shiguang Shan
Main category: cs.CV
TL;DR: 本文提出了一种名为One-step Inversion(OSI)的单步水印提取方法,用于高效、准确地提取高斯着色(Gaussian Shading)类训练-free扩散模型水印,相较传统多步反演提速20倍、精度更高、载荷容量翻倍。
Details
Motivation: 现有训练-free水印方法(如Gaussian Shading)虽生成质量好,但提取需多步扩散反演以恢复初始噪声,计算开销大、耗时长。 Method: 将水印提取建模为可学习的符号分类问题,避免对初始噪声的精确回归;基于扩散模型主干初始化OSI模型,并在合成的噪声-图像对上以符号分类目标进行微调,实现单步提取。 Result: OSI比多步扩散反演快20倍,提取精度更高,水印载荷容量提升一倍;在多种调度器、扩散主干和密码方案下均表现鲁棒且泛化性强。 Conclusion: OSI是一种高效、准确、通用的单步水印提取框架,显著提升了训练-free扩散水印的实用性与部署效率。 Abstract: Watermarking is an important mechanism for provenance and copyright protection of diffusion-generated images. Training-free methods, exemplified by Gaussian Shading, embed watermarks into the initial noise of diffusion models with negligible impact on the quality of generated images. However, extracting this type of watermark typically requires multi-step diffusion inversion to obtain precise initial noise, which is computationally expensive and time-consuming. To address this issue, we propose One-step Inversion (OSI), a significantly faster and more accurate method for extracting Gaussian Shading style watermarks. OSI reformulates watermark extraction as a learnable sign classification problem, which eliminates the need for precise regression of the initial noise. Then, we initialize the OSI model from the diffusion backbone and finetune it on synthesized noise-image pairs with a sign classification objective. In this manner, the OSI model is able to accomplish the watermark extraction efficiently in only one step. Our OSI substantially outperforms the multi-step diffusion inversion method: it is 20x faster, achieves higher extraction accuracy, and doubles the watermark payload capacity. Extensive experiments across diverse schedulers, diffusion backbones, and cryptographic schemes consistently show improvements, demonstrating the generality of our OSI framework.[95] Equilibrium contrastive learning for imbalanced image classification
Sumin Roh,Harim Kim,Ho Yun Lee,Il Yong Chun
Main category: cs.CV
TL;DR: 本文提出了一种名为均衡对比学习(ECL)的新框架,旨在解决现有监督对比学习方法在长尾/不平衡数据上因类中心与分类器不对齐、原型贡献不均而导致的泛化差问题;通过同时优化表征空间的正则单纯形几何结构和分类器权重与类原型的对齐,提升不平衡图像分类性能。
Details
Motivation: 现有监督对比学习方法在不平衡数据上表现受限:一是忽略类均值/原型与分类器权重之间的对齐,影响泛化;二是原型仅作为每类一个附加样本,其贡献受批次内各类样本数影响,导致类别间贡献失衡。 Method: 提出Equilibrium Contrastive Learning(ECL)框架,包含两个核心组件:1)表征几何均衡模块——协同优化类样本坍缩、类均值均匀分布(正则单纯形),并平衡类平均特征与类原型的贡献;2)分类器-类中心几何均衡模块——显式对齐分类器权重与类原型。 Result: 在CIFAR-10-LT、ImageNet-LT、ISIC 2019和自建LCCT四个长尾/不平衡医学图像数据集上,ECL显著优于当前SOTA监督对比学习方法。 Conclusion: 几何均衡(包括表征空间内类结构均衡与分类器-类中心对齐)是提升不平衡数据下对比学习性能的关键,ECL为该方向提供了有效且可扩展的解决方案。 Abstract: Contrastive learning (CL) is a predominant technique in image classification, but they showed limited performance with an imbalanced dataset. Recently, several supervised CL methods have been proposed to promote an ideal regular simplex geometric configuration in the representation space-characterized by intra-class feature collapse and uniform inter-class mean spacing, especially for imbalanced datasets. In particular, existing prototype-based methods include class prototypes, as additional samples to consider all classes. However, the existing CL methods suffer from two limitations. First, they do not consider the alignment between the class means/prototypes and classifiers, which could lead to poor generalization. Second, existing prototype-based methods treat prototypes as only one additional sample per class, making their influence depend on the number of class instances in a batch and causing unbalanced contributions across classes. To address these limitations, we propose Equilibrium Contrastive Learning (ECL), a supervised CL framework designed to promote geometric equilibrium, where class features, means, and classifiers are harmoniously balanced under data imbalance. The proposed ECL framework uses two main components. First, ECL promotes the representation geometric equilibrium (i.e., a regular simplex geometry characterized by collapsed class samples and uniformly distributed class means), while balancing the contributions of class-average features and class prototypes. Second, ECL establishes a classifier-class center geometric equilibrium by aligning classifier weights and class prototypes. We ran experiments with three long-tailed datasets, the CIFAR-10(0)-LT, ImageNet-LT, and the two imbalanced medical datasets, the ISIC 2019 and our constructed LCCT dataset. Results show that ECL outperforms existing SOTA supervised CL methods designed for imbalanced classification.[96] Robust Depth Super-Resolution via Adaptive Diffusion Sampling
Kun Wang,Yun Zhu,Pan Zhou,Na Zhao
Main category: cs.CV
TL;DR: AdaDS是一种基于扩散模型的深度超分辨率框架,利用高斯平滑的收缩特性,通过自适应选择反向扩散起始时间步并注入定制噪声,实现对任意退化低分辨率深度图的鲁棒高分辨率重建。
Details
Motivation: 传统方法直接回归深度值,在严重或未知退化下易产生伪影;需一种能泛化到未知退化、鲁棒性强的深度超分辨率方法。 Method: 提出AdaDS框架,利用高斯平滑中退化输入与真值间分布差异随噪声增加而缩小并收敛至各向同性高斯先验的性质;自适应选择反向扩散起始时间步,并注入定制噪声,使中间样本落入目标后验分布的高概率区域,从而让预训练扩散模型的生成先验主导恢复过程。 Result: 在真实世界和合成数据集上实验表明,AdaDS在零样本泛化能力和对多种退化模式的鲁棒性方面显著优于现有最先进方法。 Conclusion: AdaDS通过引入基于扩散模型的自适应策略,有效提升了深度超分辨率任务在未知退化下的泛化性与鲁棒性,为该任务提供了新范式。 Abstract: We propose AdaDS, a generalizable framework for depth super-resolution that robustly recovers high-resolution depth maps from arbitrarily degraded low-resolution inputs. Unlike conventional approaches that directly regress depth values and often exhibit artifacts under severe or unknown degradation, AdaDS capitalizes on the contraction property of Gaussian smoothing: as noise accumulates in the forward process, distributional discrepancies between degraded inputs and their pristine high-quality counterparts diminish, ultimately converging to isotropic Gaussian prior. Leveraging this, AdaDS adaptively selects a starting timestep in the reverse diffusion trajectory based on estimated refinement uncertainty, and subsequently injects tailored noise to position the intermediate sample within the high-probability region of the target posterior distribution. This strategy ensures inherent robustness, enabling generative prior of a pre-trained diffusion model to dominate recovery even when upstream estimations are imperfect. Extensive experiments on real-world and synthetic benchmarks demonstrate AdaDS's superior zero-shot generalization and resilience to diverse degradation patterns compared to state-of-the-art methods.[97] Energy-Efficient Fast Object Detection on Edge Devices for IoT Systems
Mas Nurul Achmadiah,Afaroj Ahamad,Chi-Chia Sun,Wen-Kai Kuo
Main category: cs.CV
TL;DR: 本文提出了一种基于帧差法的轻量级AI目标检测方法,适用于物联网边缘设备,相比端到端方法显著提升了准确率(+28.314%)、能效(+3.6倍)和延迟(-39.305%),尤其适合高速移动目标(如火车、飞机)的实时检测。
Details
Motivation: 物联网系统对能效和实时性要求高,传统端到端目标检测方法在高速物体检测中精度低、延迟高、能耗大,亟需更高效的轻量级方案。 Method: 采用帧差法预处理视频流以快速定位运动区域,结合轻量级AI模型(如MobileNet、YOLOX等)在多种边缘设备(AMD Alveo U50、Jetson Orin Nano、Hailo-8T)上部署与评估。 Result: MobileNet表现最优(高精度、低延迟、高能效);YOLOX精度最低;整体相较端到端方法平均精度提升28.314%,效率提升3.6倍,延迟降低39.305%;高速目标(火车、飞机)检测精度相对较低。 Conclusion: 帧差法配合轻量级模型是面向IoT边缘场景高速目标检测的有效解决方案,兼顾准确性、实时性与能效,优于传统端到端方法。 Abstract: This paper presents an Internet of Things (IoT) application that utilizes an AI classifier for fast-object detection using the frame difference method. This method, with its shorter duration, is the most efficient and suitable for fast-object detection in IoT systems, which require energy-efficient applications compared to end-to-end methods. We have implemented this technique on three edge devices: AMD AlveoT M U50, Jetson Orin Nano, and Hailo-8T M AI Accelerator, and four models with artificial neural networks and transformer models. We examined various classes, including birds, cars, trains, and airplanes. Using the frame difference method, the MobileNet model consistently has high accuracy, low latency, and is highly energy-efficient. YOLOX consistently shows the lowest accuracy, lowest latency, and lowest efficiency. The experimental results show that the proposed algorithm has improved the average accuracy gain by 28.314%, the average efficiency gain by 3.6 times, and the average latency reduction by 39.305% compared to the end-to-end method. Of all these classes, the faster objects are trains and airplanes. Experiments show that the accuracy percentage for trains and airplanes is lower than other categories. So, in tasks that require fast detection and accurate results, end-to-end methods can be a disaster because they cannot handle fast object detection. To improve computational efficiency, we designed our proposed method as a lightweight detection algorithm. It is well suited for applications in IoT systems, especially those that require fast-moving object detection and higher accuracy.[98] A Universal Action Space for General Behavior Analysis
Hung-Shuo Chang,Yue-Cheng Yang,Yu-Hsi Chen,Wei-Hsin Chen,Chien-Yao Wang,James C. Liao,Chien-Chang Chen,Hen-Hsen Huang,Hong-Yuan Mark Liao
Main category: cs.CV
TL;DR: 本文提出了一种基于大规模人类动作数据集构建通用动作空间(UAS)的方法,并将其应用于哺乳动物和黑猩猩行为分析,推动行为识别从低层特征向高层语义表示转变。
Details
Motivation: 传统行为分析依赖手工设计的低层特征,鲁棒性和泛化性差;ImageNet推动了高层语义表示的发展,启发作者构建跨物种通用动作表征。 Method: 利用现有标注的人类动作数据集构建大规模通用动作空间(UAS),并迁移应用于哺乳动物与黑猩猩行为数据集的分析与分类。 Result: 成功构建了可迁移的通用动作空间(UAS),并验证其在非人类灵长类及哺乳动物行为分析中的有效性;开源代码以促进后续研究。 Conclusion: 高层语义动作表征(UAS)可有效跨越物种边界,为动物行为分析提供统一、可扩展的深度学习框架。 Abstract: Analyzing animal and human behavior has long been a challenging task in computer vision. Early approaches from the 1970s to the 1990s relied on hand-crafted edge detection, segmentation, and low-level features such as color, shape, and texture to locate objects and infer their identities-an inherently ill-posed problem. Behavior analysis in this era typically proceeded by tracking identified objects over time and modeling their trajectories using sparse feature points, which further limited robustness and generalization. A major shift occurred with the introduction of ImageNet by Deng and Li in 2010, which enabled large-scale visual recognition through deep neural networks and effectively served as a comprehensive visual dictionary. This development allowed object recognition to move beyond complex low-level processing toward learned high-level representations. In this work, we follow this paradigm to build a large-scale Universal Action Space (UAS) using existing labeled human-action datasets. We then use this UAS as the foundation for analyzing and categorizing mammalian and chimpanzee behavior datasets. The source code is released on GitHub at https://github.com/franktpmvu/Universal-Action-Space.[99] Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs
Jingyi Wang,Fei Li,Rujie Liu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的注意力干预算法,通过增强任务相关视觉token的注意力来缓解大视觉语言模型(LVLMs)中的视觉注意力不足与幻觉问题。
Details
Motivation: 现有LVLMs存在视觉注意力不足导致幻觉的问题;已有方法通过放大所有视觉token的注意力,但会同时增强无关token的注意力,效果受限。 Method: 基于任务相关token具有高视觉-文本相似性的假设,提取视觉-文本交叉注意力子矩阵构建重加权矩阵以重分配注意力;并在束搜索解码中注入视觉注意力值,优先选择视觉注意力更高的生成结果。 Result: 在主流LVLMs上显著降低幻觉率,同时保持生成内容的准确性与连贯性。 Conclusion: 该训练免费的注意力干预方法是一种有效、通用且即插即用的幻觉缓解策略。 Abstract: Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.[100] Singpath-VL Technical Report
Zhen Qiu,Kaiwen Xiao,Zhengwei Lu,Xiangyu Liu,Lei Zhao,Hao Zhang
Main category: cs.CV
TL;DR: Singpath-VL is a vision-language large model tailored for cervical cytology, developed using a novel three-stage synthetic data pipeline and fine-tuning of Qwen3-VL-4B, achieving strong performance in morphological perception and cell-level diagnosis.
Details
Motivation: The lack of large-scale, high-quality annotated datasets has hindered the application of multi-modal large language models (MLLMs) in cervical cytology. Method: A three-stage synthetic data pipeline using multiple general-purpose MLLMs as weak annotators, refined via consensus fusion and expert knowledge injection, to generate a million-scale image-description dataset; followed by multi-stage fine-tuning of Qwen3-VL-4B. Result: Singpath-VL achieves superior performance in fine-grained morphological perception and cell-level diagnostic classification. Conclusion: Singpath-VL fills a critical gap in AI-assisted cervical cytology, and part of the synthetic dataset and benchmark will be open-sourced to advance the field. Abstract: We present Singpath-VL, a vision-language large model, to fill the vacancy of AI assistant in cervical cytology. Recent advances in multi-modal large language models (MLLMs) have significantly propelled the field of computational pathology. However, their application in cytopathology, particularly cervical cytology, remains underexplored, primarily due to the scarcity of large-scale, high-quality annotated datasets. To bridge this gap, we first develop a novel three-stage pipeline to synthesize a million-scale image-description dataset. The pipeline leverages multiple general-purpose MLLMs as weak annotators, refines their outputs through consensus fusion and expert knowledge injection, and produces high-fidelity descriptions of cell morphology. Using this dataset, we then fine-tune the Qwen3-VL-4B model via a multi-stage strategy to create a specialized cytopathology MLLM. The resulting model, named Singpath-VL, demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. To advance the field, we will open-source a portion of the synthetic dataset and benchmark.[101] HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection
Han Zhou,Yuxuan Gao,Yinchao Du,Xuezhe Zheng
Main category: cs.CV
TL;DR: 本文提出HLGFA框架,通过建模正常样本的高-低分辨率特征一致性来学习正常性,避免像素级重建,实现工业无监督异常检测。
Details
Motivation: 工业场景中缺陷样本稀缺,需可靠无监督异常检测方法;现有方法依赖像素重建或单一特征表示,鲁棒性不足。 Method: 提出高-低分辨率引导的特征对齐框架HLGFA:使用共享冻结骨干网络提取多级特征;将高分辨率特征分解为结构与细节先验,通过条件调制和门控残差校正引导低分辨率特征优化;引入噪声感知数据增强抑制工业环境干扰。 Result: 在MVTec AD数据集上达到97.9%像素级AUROC和97.5%图像级AUROC,显著优于主流重建类与特征类方法。 Conclusion: 跨分辨率特征一致性建模是更鲁棒的正常性学习范式,HLGFA为工业无监督异常检测提供了新思路与高效解决方案。 Abstract: Unsupervised industrial anomaly detection (UAD) is essential for modern manufacturing inspection, where defect samples are scarce and reliable detection is required. In this paper, we propose HLGFA, a high-low resolution guided feature alignment framework that learns normality by modeling cross-resolution feature consistency between high-resolution and low-resolution representations of normal samples, instead of relying on pixel-level reconstruction. Dual-resolution inputs are processed by a shared frozen backbone to extract multi-level features, and high-resolution representations are decomposed into structure and detail priors to guide the refinement of low-resolution features through conditional modulation and gated residual correction. During inference, anomalies are naturally identified as regions where cross-resolution alignment breaks down. In addition, a noise-aware data augmentation strategy is introduced to suppress nuisance-induced responses commonly observed in industrial environments. Extensive experiments on standard benchmarks demonstrate the effectiveness of HLGFA, achieving 97.9% pixel-level AUROC and 97.5% image-level AUROC on the MVTec AD dataset, outperforming representative reconstruction-based and feature-based methods.[102] SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem
Ziqiang Shi,Rujie Liu,Shanshan Yu,Satoshi Munakata,Koichi Shirahata
Main category: cs.CV
TL;DR: 本文提出SchröMind框架,通过求解薛定谔桥问题,在不损害模型原有能力的前提下,减少多模态大语言模型(MLLMs)在医疗等高风险领域中的幻觉现象。
Details
Motivation: 多模态大语言模型(MLLMs)在医疗等高风险领域应用受限,主要因其存在视觉-文本不一致的幻觉问题,即生成文本与输入图像矛盾或忽略图像信息。 Method: 提出SchröMind框架,将幻觉状态与真实状态之间的token级激活映射建模为薛定谔桥问题,以最小传输代价实现轻量级训练下的校正。 Result: 在POPE和MME基准上实验表明,SchröMind显著降低幻觉,达到当前最优性能,且计算开销极小。 Conclusion: SchröMind有效缓解MLLMs的幻觉问题,提升其在高风险场景下的可靠性与实用性,同时保持模型原有能力。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have achieved significant success across various domains. However, their use in high-stakes fields like healthcare remains limited due to persistent hallucinations, where generated text contradicts or ignores visual input. We contend that MLLMs can comprehend images but struggle to produce accurate token sequences. Minor perturbations can shift attention from truthful to untruthful states, and the autoregressive nature of text generation often prevents error correction. To address this, we propose SchröMind-a novel framework reducing hallucinations via solving the Schrödinger bridge problem. It establishes a token-level mapping between hallucinatory and truthful activations with minimal transport cost through lightweight training, while preserving the model's original capabilities. Extensive experiments on the POPE and MME benchmarks demonstrate the superiority of Schrödinger, which achieves state-of-the-art performance while introducing only minimal computational overhead.[103] SCA-Net: Spatial-Contextual Aggregation Network for Enhanced Small Building and Road Change Detection
Emad Gholibeigi,Abbas Koochari,Azadeh ZamaniFar
Main category: cs.CV
TL;DR: 本文提出SCA-Net,一种基于Change-Agent框架的改进模型,用于双时相遥感影像中建筑物和道路的精确变化检测,通过多尺度差异分析、自适应多尺度处理、多级注意力机制及动态复合损失函数等创新,在精度与效率上显著优于现有方法。
Details
Motivation: 深度学习模型在遥感影像变化检测中面临小目标敏感度低、计算成本高等挑战,亟需更高效准确的解决方案。 Method: 提出SCA-Net模型,包含Difference Pyramid Block、Adaptive Multi-scale Processing模块(含shape-aware与高分辨率增强块)、PPM与CSAGate多级注意力机制,并引入动态复合损失函数与四阶段训练策略。 Result: 在LEVIR-CD和LEVIR-MCI数据集上显著优于Change-Agent及其他SOTA方法:LEVIR-MCI上mIoU提升2.64%,小建筑物IoU提升57.9%,训练时间减少61%。 Conclusion: SCA-Net为实际变化检测应用提供了高效、准确且鲁棒的解决方案。 Abstract: Automated change detection in remote sensing imagery is critical for urban management, environmental monitoring, and disaster assessment. While deep learning models have advanced this field, they often struggle with challenges like low sensitivity to small objects and high computational costs. This paper presents SCA-Net, an enhanced architecture built upon the Change-Agent framework for precise building and road change detection in bi-temporal images. Our model incorporates several key innovations: a novel Difference Pyramid Block for multi-scale change analysis, an Adaptive Multi-scale Processing module combining shape-aware and high-resolution enhancement blocks, and multi-level attention mechanisms (PPM and CSAGate) for joint contextual and detail processing. Furthermore, a dynamic composite loss function and a four-phase training strategy are introduced to stabilize training and accelerate convergence. Comprehensive evaluations on the LEVIR-CD and LEVIR-MCI datasets demonstrate SCA-Net's superior performance over Change-Agent and other state-of-the-art methods. Our approach achieves a significant 2.64% improvement in mean Intersection over Union (mIoU) on LEVIR-MCI and a remarkable 57.9% increase in IoU for small buildings, while reducing the training time by 61%. This work provides an efficient, accurate, and robust solution for practical change detection applications.[104] DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment
Bohan Fu,Guanyi Qin,Fazhan Zhang,Zihao Huang,Mingxuan Li,Runze Hu
Main category: cs.CV
TL;DR: 本文提出DR.Experts,一种基于失真先验驱动的盲图像质量评估(BIQA)框架,通过引入退化感知的视觉-语言模型和动态失真加权模块,显式建模失真特征并依据人眼感知重要性进行加权融合,显著提升与主观评价的一致性。
Details
Motivation: 现有BIQA模型难以有效捕捉细微失真线索,导致与人类主观评价不一致,根本原因在于缺乏可靠的失真先验知识。 Method: 提出DR.Experts框架:1)利用退化感知的视觉-语言模型获取失真特异性先验;2)通过失真-显著性差异模块区分并增强失真先验;3)设计动态失真加权模块(MoE风格),融合失真先验、语义及桥接表征,并按感知影响加权。 Result: 在五个主流BIQA基准上显著优于现有方法,展现出更强的泛化能力和数据效率。 Conclusion: 显式建模并加权融合失真先验可有效提升BIQA模型与人类视觉感知的一致性,验证了先验驱动范式在盲评任务中的有效性。 Abstract: Blind Image Quality Assessment, aiming to replicate human perception of visual quality without reference, plays a key role in vision tasks, yet existing models often fail to effectively capture subtle distortion cues, leading to a misalignment with human subjective judgments. We identify that the root cause of this limitation lies in the lack of reliable distortion priors, as methods typically learn shallow relationships between unified image features and quality scores, resulting in their insensitive nature to distortions and thus limiting their performance. To address this, we introduce DR.Experts, a novel prior-driven BIQA framework designed to explicitly incorporate distortion priors, enabling a reliable quality assessment. DR.Experts begins by leveraging a degradation-aware vision-language model to obtain distortion-specific priors, which are further refined and enhanced by the proposed Distortion-Saliency Differential Module through distinguishing them from semantic attentions, thereby ensuring the genuine representations of distortions. The refined priors, along with semantics and bridging representation, are then fused by a proposed mixture-of-experts style module named the Dynamic Distortion Weighting Module. This mechanism weights each distortion-specific feature as per its perceptual impact, ensuring that the final quality prediction aligns with human perception. Extensive experiments conducted on five challenging BIQA benchmarks demonstrate the superiority of DR.Experts over current methods and showcase its excellence in terms of generalization and data efficiency.[105] RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes
Michael Baltaxe,Dan Levi,Sagie Benaim
Main category: cs.CV
TL;DR: 本文提出了一种名为RAD的检索增强框架,用于单目度量深度估计(MMDE),通过检索语义相似的RGB-D样本作为几何代理,结合双流网络和匹配交叉注意力模块,显著提升了对少见类别深度估计的准确性。
Details
Motivation: 单目度量深度估计在复杂场景中对少见类别的准确估计仍具挑战性。 Method: 提出RAD框架:采用不确定性感知检索机制定位低置信区域并检索语义相似的RGB-D上下文;利用双流网络分别处理输入与检索上下文,并通过匹配的交叉注意力模块仅在可靠点对应处传递几何信息。 Result: 在NYU Depth v2、KITTI和Cityscapes数据集上,RAD在少见类别上的相对绝对误差分别降低29.2%、13.3%和7.2%,同时在标准域内基准上保持竞争力。 Conclusion: RAD有效缓解了单目深度估计中对少见类别的泛化不足问题,验证了检索增强几何先验的可行性与有效性。 Abstract: Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.[106] AUHead: Realistic Emotional Talking Head Generation via Action Units Control
Jiayi Lyu,Leigang Qu,Wenjing Zhang,Hanyu Jiang,Kai Liu,Zhenglin Zhou,Xiaobo Xia,Jian Xue,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出AUHead,一种两阶段方法,通过解耦音频中的精细情感单元(AUs)并驱动可控扩散模型,生成具有高情感真实感和口型同步的逼真说话人视频。
Details
Motivation: 现有方法难以实现细粒度情感表达控制,缺乏对动作单元(AUs)这一精细情感表征的有效建模。 Method: 第一阶段利用大音频语言模型(ALM),结合时空AU标记化与“情感-再AU”链式推理机制生成AUs;第二阶段构建AU驱动的可控扩散模型,将AU序列映射为结构化2D面部表示,并在交叉注意力模块中建模AU-视觉交互;引入AU解耦引导策略以平衡AU保真度与生成质量。 Result: 在基准数据集上显著优于现有方法,在情感真实感、唇动同步精度和视觉连贯性方面均取得竞争性结果。 Conclusion: AUHead成功实现了从语音到精细情感驱动的可控 talking-head 视频生成,为虚拟人、影视制作等应用提供了新范式。 Abstract: Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at https://github.com/laura990501/AUHead_ICLR[107] Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination
Ziqiang Shi,Rujie Liu,Shanshan Yu,Satoshi Munakata,Koichi Shirahata
Main category: cs.CV
TL;DR: 本文提出Scalpel方法,通过在推理过程中调整Transformer各层注意力头的激活分布,使其更聚焦于可信视觉区域,从而减少大视觉语言模型(LVLMs)中的幻觉问题。该方法基于高斯混合模型建模注意力分布,并利用熵最优传输精确映射信任与幻觉注意力模式,在不增加计算开销、无需额外训练的前提下实现高效幻觉抑制。
Details
Motivation: 大视觉语言模型(LVLMs)因大语言模型强先验和跨模态注意力错位,常产生与图像内容不符的幻觉输出,亟需有效缓解机制。 Method: Scalpel在推理时预测每个Transformer注意力头的可信注意力方向并调整激活;采用高斯混合模型刻画信任与幻觉注意力的多峰分布;利用熵最优传输(等价于Schrödinger桥问题)精准映射高斯成分;并根据成分归属与映射关系动态调节干预强度与方向。 Result: 在多个数据集与基准上实验表明,Scalpel显著缓解幻觉,性能超越先前方法,达到当前最优;且具备模型与数据无关性,无需额外计算或训练,仅需单次解码。 Conclusion: Scalpel是一种高效、通用、即插即用的推理时幻觉缓解方法,为提升LVLMs视觉忠实性提供了新范式。 Abstract: Rapid progress in large vision-language models (LVLMs) has achieved unprecedented performance in vision-language tasks. However, due to the strong prior of large language models (LLMs) and misaligned attention across modalities, LVLMs often generate outputs inconsistent with visual content - termed hallucination. To address this, we propose \textbf{Scalpel}, a method that reduces hallucination by refining attention activation distributions toward more credible regions. Scalpel predicts trusted attention directions for each head in Transformer layers during inference and adjusts activations accordingly. It employs a Gaussian mixture model to capture multi-peak distributions of attention in trust and hallucination manifolds, and uses entropic optimal transport (equivalent to Schrödinger bridge problem) to map Gaussian components precisely. During mitigation, Scalpel dynamically adjusts intervention strength and direction based on component membership and mapping relationships between hallucination and trust activations. Extensive experiments across multiple datasets and benchmarks demonstrate that Scalpel effectively mitigates hallucinations, outperforming previous methods and achieving state-of-the-art performance. Moreover, Scalpel is model- and data-agnostic, requiring no additional computation, only a single decoding step.[108] Delving into Spectral Clustering with Vision-Language Representations
Bo Peng,Yuanwei Hu,Bo Liu,Ling Chen,Jie Lu,Zhen Fang
Main category: cs.CV
TL;DR: 本文提出了一种基于神经正切核的多模态谱聚类方法(NTK-SC),利用预训练视觉-语言模型中的跨模态对齐,通过锚定正向名词增强图像间亲和力,并引入正则化亲和扩散机制,显著提升多种基准数据集上的聚类性能。
Details
Motivation: 现有谱聚类方法大多为单模态,未能充分利用多模态表征中的丰富信息;受视觉-语言预训练成功启发,亟需将谱聚类拓展至多模态场景。 Method: 提出神经正切核谱聚类(NTK-SC):1)利用预训练视觉-语言模型,以语义上贴近目标图像的正向名词为锚点构建神经正切核;2)将图像亲和度定义为视觉相似性与语义重叠性的耦合;3)设计正则化亲和扩散机制,自适应融合不同提示词生成的亲和矩阵。 Result: 在16个涵盖经典、大规模、细粒度及域偏移场景的基准数据集上,该方法持续大幅超越当前最优方法。 Conclusion: 多模态视角可有效增强谱聚类的判别能力;NTK-SC通过跨模态对齐与亲和扩散,提升了簇内连接强度并抑制跨簇噪声,推动谱聚类向更鲁棒、更泛化的方向发展。 Abstract: Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks -- including classical, large-scale, fine-grained and domain-shifted datasets -- manifest that our method consistently outperforms the state-of-the-art by a large margin.[109] MieDB-100k: A Comprehensive Dataset for Medical Image Editing
Yongfan Lai,Wen Qian,Bo Liu,Hongyan Li,Hao Luo,Fan Wang,Bohan Zhuang,Shenda Hong
Main category: cs.CV
TL;DR: 本文提出MieDB-100k,一个大规模、高质量、多样化的文本引导医学图像编辑数据集,旨在解决现有数据集多样性不足、忽视医学理解及质量与规模难以兼顾的问题。
Details
Motivation: 现有医学图像编辑数据集存在多样性有限、忽视医学图像理解、质量与可扩展性难以平衡等问题,制约了多模态生成模型在医学图像编辑中的适配。 Method: 构建了MieDB-100k数据集,涵盖感知、修改和变换三类编辑任务;采用模态专用专家模型与基于规则的数据合成方法进行数据构建,并辅以严格人工审核确保临床保真度。 Result: 在MieDB-100k上训练的模型在多项指标上持续优于开源及商用模型,并展现出强泛化能力。 Conclusion: MieDB-100k有望成为推动专业医学图像编辑研究发展的基石数据集。 Abstract: The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.[110] Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures
Yuxi Wang,Wenqi Ouyang,Tianyi Wei,Yi Dong,Zhiqi Shen,Xingang Pan
Main category: cs.CV
TL;DR: 本文提出Hand2World框架,用于从单张场景图像生成以自我为中心的、响应自由空间手势的交互式视频,解决分布偏移、运动歧义和长视频生成等挑战。
Details
Motivation: 构建以自我为中心的交互式世界模型对增强现实和具身AI至关重要,但现有方法难以在低延迟、几何一致性和长期稳定性之间取得平衡。 Method: 提出Hand2World:基于投影3D手部网格实现遮挡无关的手部条件建模;引入逐像素Plücker射线嵌入显式编码相机几何以稳定视角变化;构建全自动单目标注流程,并将双向扩散模型蒸馏为因果生成器。 Result: 在三个以自我为中心的交互基准上显著提升了感知质量和3D一致性,支持相机控制与长时程交互生成。 Conclusion: Hand2World为单图驱动的以自我为中心交互视频生成提供了统一、稳定且可扩展的自回归框架。 Abstract: Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Plücker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.[111] Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing
Jialun Liu,Yukuo Ma,Xiao Cao,Tian Li,Gonghu Shang,Haibin Huang,Chi Zhang,Xuelong Li,Cong Liu,Junqi Liu,Jiakui Hu,Robby T. Tan,Shiwen Zhang,Liying Yang,Xiaoyan Yang,Qizhen Weng,Xiangzhen Chang,Yuanzhi Liang,Yifan Xu,Zhiyong Huang,Zuoxin Li,Xuelong Li
Main category: cs.CV
TL;DR: 本文提出Tele-Omni,一个统一的多模态视频生成与编辑框架,支持文本、图像和参考视频等多种指令输入,通过解耦指令解析与视频合成,并结合任务感知的数据设计,实现灵活控制与高质量输出。
Details
Motivation: 现有扩散模型视频生成方法多为任务特定、依赖纯文本指令,难以处理多模态输入与多样化编辑场景;视频编辑方法常需定制化流水线,缺乏可扩展性与可组合性。 Method: 提出Tele-Omni框架:利用预训练多模态大语言模型解析异构指令并推断结构化意图,扩散模型基于该意图生成视频;引入任务感知的数据处理流程,将多模态输入统一为结构化指令格式,保留任务约束。 Result: Tele-Omni支持文本/图像/首尾帧/上下文驱动的视频生成与编辑,在多项任务上达到有竞争力的性能,同时保持良好时序连贯性与视觉一致性。 Conclusion: 解耦指令理解与视频合成、辅以任务感知数据设计,是构建统一、灵活、高质量多模态视频生成与编辑框架的有效路径。 Abstract: Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while diffusion-based generators perform high-quality video synthesis conditioned on these structured signals. To enable joint training across heterogeneous video tasks, we introduce a task-aware data processing pipeline that unifies multimodal inputs into a structured instruction format while preserving task-specific constraints. Tele-Omni supports a wide range of video-centric tasks, including text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing. By decoupling instruction parsing from video synthesis and combining it with task-aware data design, Tele-Omni achieves flexible multimodal control while maintaining strong temporal coherence and visual consistency. Experimental results demonstrate that Tele-Omni achieves competitive performance across multiple tasks.[112] AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models
Yue Li,Xin Yi,Dongsheng Shi,Yongyi Cui,Gerard de Melo,Linlin Wang
Main category: cs.CV
TL;DR: 本文提出Attention-Guided Dynamic Watermarking (AGMark),一种面向大视觉语言模型的动态水印框架,通过注意力引导与证据密度校准实现视觉保真与可检测性的协同优化。
Details
Motivation: 现有水印方法存在视觉无关token干扰、静态权重估计忽略分布密度、无法适应生成过程中视觉依赖动态变化等问题,导致生成质量下降和语义失真。 Method: AGMark在每步解码中:1)基于注意力权重和上下文连贯性动态识别语义关键证据;2)结合token熵(不确定性感知)与权重密度(证据校准)自适应划分词表,确定需保护的语义关键token比例。 Result: AGMark显著提升生成质量,尤其增强生成后期的视觉语义保真度;检测准确率≥99.36% AUC,抗攻击鲁棒性≥88.61% AUC,且不牺牲推理效率。 Conclusion: AGMark为多模态水印提供了兼顾可靠性、保真性与鲁棒性的新范式,确立了可信多模态内容溯源的新标准。 Abstract: Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases. Additionally, current vision-specific watermarks rely on a static, one-time estimation of vision critical weights and ignore the weight distribution density when determining the proportion of protected tokens. This design fails to account for dynamic changes in visual dependence during generation and may introduce low-quality tokens in the long tail. To address these challenges, we propose Attention-Guided Dynamic Watermarking (AGMark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. At each decoding step, AGMark first dynamically identifies semantic-critical evidence based on attention weights for visual relevance, together with context-aware coherence cues, resulting in a more adaptive and well-calibrated evidence-weight distribution. It then determines the proportion of semantic-critical tokens by jointly considering uncertainty awareness (token entropy) and evidence calibration (weight density), thereby enabling adaptive vocabulary partitioning to avoid irrelevant tokens. Empirical results confirm that AGMark outperforms conventional methods, observably improving generation quality and yielding particularly strong gains in visual semantic fidelity in the later stages of generation. The framework maintains highly competitive detection accuracy (at least 99.36\% AUC) and robust attack resilience (at least 88.61\% AUC) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multi-modal watermarking.[113] Towards Training-free Multimodal Hate Localisation with Large Language Models
Yueming Sun,Long Yang,Jianbo Jiao,Zeyu Fu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的基于大语言模型(LLM)的视频仇恨内容定位框架LELA,通过多模态分解与多阶段提示实现细粒度时间定位,在两个基准上显著超越现有无监督方法。
Details
Motivation: 现有视频仇恨检测方法依赖大量人工标注或缺乏细粒度时间精度,亟需一种无需训练、可解释且精准的解决方案。 Method: LELA将视频分解为图像、语音、OCR、音乐和视频上下文五种模态,利用模态专用字幕与多阶段提示计算每帧仇恨得分,并引入跨模态组合匹配机制增强推理。 Result: 在HateMM和MultiHateClip两个基准上,LELA大幅超越所有现有无训练基线;消融实验与可视化验证了其有效性与可解释性。 Conclusion: LELA是首个训练-free的LLM驱动视频仇恨定位框架,为可扩展、可解释的仇恨内容检测提供了新范式。 Abstract: The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.[114] VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model
Hanqing Wang,Mingyu Liu,Xiaoyu Chen,Chengwei MA,Yiming Zhong,Wenti Yin,Yuhao Liu,Zhiqing Cui,Jiahao Yuan,Lu Dai,Zhiyuan Ma,Hui Xiong
Main category: cs.CV
TL;DR: 本文提出VIDA视频数据集和VideoAfford模型,通过引入动态交互视频与空间感知损失,提升3D物体可操作区域定位能力,显著优于现有方法并具备开放世界泛化性。
Details
Motivation: 现有基于静态语言/图像的3D可操作性学习方法缺乏动态交互时序与因果线索,难以支撑机器人操作需求。 Method: 构建大规模视频驱动的3D可操作性数据集VIDA;提出VideoAfford模型,融合多模态大语言模型、潜在动作编码器提取视频动态先验,并设计空间感知损失以增强3D空间理解。 Result: 在多个指标上显著超越现有方法,展现出强开放世界泛化能力和可操作性推理能力。 Conclusion: 利用视频动态信息与空间建模可有效提升3D可操作性定位性能,VIDA数据集与VideoAfford为该领域提供了新基准与实用框架。 Abstract: 3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a \textit{spatial-aware} loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.[115] Time2General: Learning Spatiotemporal Invariant Representations for Domain-Generalization Video Semantic Segmentation
Siyu Chen,Ting Han,Haoling Huang,Chaolei Wang,Chengzheng Fu,Duxin Zhu,Guorong Cai,Jinhe Su
Main category: cs.CV
TL;DR: 本文提出Time2General框架,通过稳定性查询和时空记忆解码器解决域泛化视频语义分割中的域偏移与时间采样偏移问题,显著提升跨域准确率与时间稳定性。
Details
Motivation: 现有方法在域偏移和时间采样偏移下易产生帧间闪烁,难以保证标签稳定区域的时间一致性。 Method: 提出基于稳定性查询的Time2General框架,包含时空记忆解码器(聚合多帧上下文并解码一致掩码)和掩码时间一致性损失(正则化不同步长下的预测差异),并采用随机训练步长增强鲁棒性。 Result: 在多个驾驶基准上显著超越现有域泛化语义分割和视频语义分割基线,在跨域精度和时间稳定性上均有大幅提升,推理速度达18 FPS。 Conclusion: Time2General有效缓解了域泛化视频语义分割中的时间不一致性问题,为无目标标签、无测试时自适应的部署场景提供了更鲁棒的解决方案。 Abstract: Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test-time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal-sampling shift break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio-Temporal Memory Decoder that aggregates multi-frame context into a clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to regularize temporal prediction discrepancies across different strides, and randomize training strides to expose the model to diverse temporal gaps. Extensive experiments on multiple driving benchmarks show that Time2General achieves a substantial improvement in cross-domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS. Code will be released after the review process.[116] TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution
Deyang Jiang,Jing Huang,Xuanle Zhao,Lei Chen,Liming Zheng,Fanfan Liu,Haibo Qiu,Peng Shi,Zhixiong Zeng
Main category: cs.CV
TL;DR: 本文提出TreeCUA框架,通过树结构组织GUI自动化轨迹,结合多智能体协同探索、树状拓扑存储、自适应探索算法及世界知识引导等技术,高效提升GUI规划能力,并进一步提出TreeCUA-DPO方法利用分支信息增强性能。
Details
Motivation: 现有工作侧重于GUI定位(grounding)的扩展,而忽视更关键的GUI规划(planning)所需的数据收集;实际CUA探索过程具有天然树状结构,利用该结构可降低数据成本、提升规划可扩展性。 Method: 提出TreeCUA:1)多智能体协同框架用于环境探索、动作验证、轨迹总结与质量评估;2)树状拓扑结构存储与复用重复探索节点;3)自适应探索算法平衡轨迹深度(难度)与广度(多样性);4)引入世界知识引导和全局记忆回溯避免低质量生成;5)基于树节点信息拓展出TreeCUA-DPO方法,利用相邻轨迹分支信息优化规划。 Result: TreeCUA与TreeCUA-DPO在多项实验中显著优于基线方法,OOD测试表明其具备强泛化能力。 Conclusion: 树结构建模是高效扩展GUI规划能力的有效范式,TreeCUA及其DPO变体为CUA提供了可扩展、高质量、强泛化的轨迹生成与规划新路径。 Abstract: Effectively scaling GUI automation is essential for computer-use agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/desktops/web pages typically follows a tree structure, with earlier functional entry points often being explored more frequently. Thus, organizing large-scale trajectories into tree structures can reduce data cost and streamline the data scaling of GUI planning. In this work, we propose TreeCUA to efficiently scale GUI automation with tree-structured verifiable evolution. We propose a multi-agent collaborative framework to explore the environment, verify actions, summarize trajectories, and evaluate quality to generate high-quality and scalable GUI trajectories. To improve efficiency, we devise a novel tree-based topology to store and replay duplicate exploration nodes, and design an adaptive exploration algorithm to balance the depth (\emph{i.e.}, trajectory difficulty) and breadth (\emph{i.e.}, trajectory diversity). Moreover, we develop world knowledge guidance and global memory backtracking to avoid low-quality generation. Finally, we naturally extend and propose the TreeCUA-DPO method from abundant tree node information, improving GUI planning capability by referring to the branch information of adjacent trajectories. Experimental results show that TreeCUA and TreeCUA-DPO offer significant improvements, and out-of-domain (OOD) studies further demonstrate strong generalization. All trajectory node information and code will be available at https://github.com/UITron-hub/TreeCUA.[117] Semi-supervised Liver Segmentation and Patch-based Fibrosis Staging with Registration-aided Multi-parametric MRI
Boya Wang,Ruizhe Li,Chao Chen,Xin Chen
Main category: cs.CV
TL;DR: 本文提出了一种用于肝脏分割(LiSeg)和肝纤维化分期(LiFS)的多任务深度学习框架,基于CARE Liver 2025 Track 4挑战,结合半监督学习与图像配准解决标注数据少和多模态MRI差异问题,并采用基于图像块的方法实现可解释的纤维化分期。
Details
Motivation: 肝纤维化临床诊断中亟需精准的肝脏分割与疾病分期,而现有方法受限于标注数据稀缺、多参数MRI模态差异及域偏移问题。 Method: 构建多任务框架:LiSeg阶段采用融合图像分割与配准的半监督学习模型,利用有/无标签数据缓解域偏移;LiFS阶段采用基于图像块的分类方法,支持纤维化分期结果可视化。 Result: 在CARE Liver 2025挑战独立测试集(含ID与OOD样本)上验证有效,支持3通道(T1/T2/DWI)与7通道(+GED1–GED4)MRI输入。 Conclusion: 该框架能有效应对多模态影像、标注数据有限及域偏移等挑战,具备临床实用性与开源可复现性。 Abstract: Liver fibrosis poses a substantial challenge in clinical practice, emphasizing the necessity for precise liver segmentation and accurate disease staging. Based on the CARE Liver 2025 Track 4 Challenge, this study introduces a multi-task deep learning framework developed for liver segmentation (LiSeg) and liver fibrosis staging (LiFS) using multiparametric MRI. The LiSeg phase addresses the challenge of limited annotated images and the complexities of multi-parametric MRI data by employing a semi-supervised learning model that integrates image segmentation and registration. By leveraging both labeled and unlabeled data, the model overcomes the difficulties introduced by domain shifts and variations across modalities. In the LiFS phase, we employed a patchbased method which allows the visualization of liver fibrosis stages based on the classification outputs. Our approach effectively handles multimodality imaging data, limited labels, and domain shifts. The proposed method has been tested by the challenge organizer on an independent test set that includes in-distribution (ID) and out-of-distribution (OOD) cases using three-channel MRIs (T1, T2, DWI) and seven-channel MRIs (T1, T2, DWI, GED1-GED4). The code is freely available. Github link: https://github.com/mileywang3061/Care-Liver[118] GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation
Sandesh Hegde,Jaison Saji Chacko,Debarshi Banerjee,Uma Mahesh
Main category: cs.CV
TL;DR: 本文提出GenSeg-R1框架,通过解耦的'推理-分割'流程实现细粒度指代表达图像分割:先用微调的Qwen3-VL视觉语言模型生成结构化空间提示(边界框+关键点),再由冻结的SAM 2模型生成掩码;采用无需推理链标注的GRPO强化学习策略进行训练,在多个基准上显著超越现有方法。
Details
Motivation: 现有方法在细粒度指代表达图像分割中难以兼顾精准推理与高质量分割,且依赖昂贵的推理链监督;同时缺乏对无目标查询(negative prompts)的检测能力。 Method: 提出Decoupled Reason-then-Segment范式:1)用Qwen3-VL(4B/8B)VLM接收图像和文本查询,推理生成结构化空间提示(bounding box + two interior keypoints);2)冻结SAM 2作为promptable segmenter将提示转为掩码;3)使用Group Relative Policy Optimization(GRPO)进行端到端强化学习微调,无需人工标注的推理链;4)进一步提出GenSeg-R1-G变体,在GRefCOCO上结合SAM 2 in-the-loop reward直接优化掩码质量并支持negative prompt识别。 Result: 在RefCOCOg val上,GenSeg-R1-8B达0.7127 cIoU和0.7382 mIoU,较Qwen3-VL Instruct基线提升+15.3/+21.9;超越Seg-Zero-7B +3.3 cIoU;在GRefCOCO val上,GenSeg-R1-G达76.69% target mIoU和82.40% negative prompt准确率;在ReasonSeg test上,GenSeg-R1-4B达68.40% mIoU,超越Seg-Zero-7B和Seg-R1-7B分别+7.0和+10.7。 Conclusion: GenSeg-R1验证了‘推理生成结构化空间提示+冻结强大分割器’范式的有效性,通过GRPO实现高效无监督推理训练,并首次在统一框架中同时实现高精度分割与鲁棒的无目标查询识别,为VLM驱动的具身智能提供了可扩展、可解释的新路径。 Abstract: We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points.[119] Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models
Ruisi Zhao,Haoren Zheng,Zongxin Yang,Hehe Fan,Yi Yang
Main category: cs.CV
TL;DR: Stroke3D 是首个能根据用户手绘2D笔画和文本提示直接生成可驱动(rigged)3D网格的框架,采用两阶段流程:先用Sk-VAE与Sk-DiT生成可控骨架,再通过增强的TextuRig数据集和SKA-DPO优化策略合成高质量带纹理网格。
Details
Motivation: 现有3D生成方法难以生成可动画几何体,而传统蒙皮(rigging)技术缺乏对骨架结构的细粒度控制;需一种能结合语义(文本)与几何约束(手绘笔画)的端到端可控 rigged 3D 生成方法。 Method: 提出两阶段框架:1)可控骨架生成——使用 Skeletal Graph VAE(Sk-VAE)编码骨架图结构至潜在空间,并由 Skeletal Graph DiT(Sk-DiT)在文本与2D笔画联合条件下生成骨架嵌入,再经VAE解码重建3D骨架;2)增强网格合成——基于TextuRig数据集(来自Objaverse-XL的带标注纹理+绑定网格)微调骨架到网格模型,并引入基于骨架-网格对齐得分的SKA-DPO偏好优化策略提升几何保真度。 Result: Stroke3D成功实现了基于用户2D笔画和文本提示生成结构合理、语义一致、高几何保真度的可驱动3D网格;实验表明其骨架合理性与网格质量均优于基线方法。 Conclusion: Stroke3D首次实现了以2D手绘笔画为显式结构引导的文本条件 rigged 3D 生成,显著提升了创作可控性与动画就绪性,为3D内容生成开辟了新交互范式。 Abstract: Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton's graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE's decoder reconstructing the final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig and SKA-DPO, where we then synthesize a textured mesh conditioned on the generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh model by augmenting its training data with TextuRig: a dataset of textured and rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh alignment score, to further improve geometric fidelity. Together, our framework enables a more intuitive workflow for creating ready to animate 3D content. To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that Stroke3D produces plausible skeletons and high-quality meshes.[120] From Lightweight CNNs to SpikeNets: Benchmarking Accuracy-Energy Tradeoffs with Pruned Spiking SqueezeNet
Radib Bin Kabir,Tawsif Tashwar Dipto,Mehedi Ahamed,Sabbir Ahmed,Md Hasanul Kabir
Main category: cs.CV
TL;DR: 本文首次系统性地评估了轻量级脉冲神经网络(SNN)——通过将ShuffleNet、SqueezeNet、MnasNet和MixNet等紧凑CNN架构转换为基于LIF神经元的SNN,并在统一框架下训练;结果表明SNN可实现最高15.7倍能效提升,其中SqueezeNet-SNN最优;进一步结构化剪枝得SNN-SqueezeNet-P,在CIFAR-10上精度提升6%、参数减少19%,且相较CNN-SqueezeNet仅低1%精度但能耗降低88.1%。
Details
Motivation: 现有研究多聚焦于大规模SNN,而面向边缘智能的轻量级CNN-to-SNN转换流程缺乏系统设计与评估。 Method: 将ShuffleNet、SqueezeNet、MnasNet和MixNet等轻量CNN转换为基于Leaky-Integrate-and-Fire(LIF)神经元的SNN,采用代理梯度下降法在统一设置下训练;并在CIFAR-10/100/TinyImageNet上评估准确率、F1分数、参数量、计算复杂度与能耗;进一步对SqueezeNet-SNN应用模块级结构化剪枝。 Result: SNN相较对应CNN最高达15.7×能效提升;SqueezeNet-SNN表现最优;剪枝后模型SNN-SqueezeNet-P在CIFAR-10上精度提升6%、参数减少19%,相较CNN-SqueezeNet仅精度低1%,但能耗降低88.1%。 Conclusion: 轻量级SNN是边缘部署高能效、低功耗AI的可行方案,结构化剪枝可显著缩小SNN与CNN的精度差距,同时极大提升能效。 Abstract: Spiking Neural Networks (SNNs) are increasingly studied as energy-efficient alternatives to Convolutional Neural Networks (CNNs), particularly for edge intelligence. However, prior work has largely emphasized large-scale models, leaving the design and evaluation of lightweight CNN-to-SNN pipelines underexplored. In this paper, we present the first systematic benchmark of lightweight SNNs obtained by converting compact CNN architectures into spiking networks, where activations are modeled with Leaky-Integrate-and-Fire (LIF) neurons and trained using surrogate gradient descent under a unified setup. We construct spiking variants of ShuffleNet, SqueezeNet, MnasNet, and MixNet, and evaluate them on CIFAR-10, CIFAR-100, and TinyImageNet, measuring accuracy, F1-score, parameter count, computational complexity, and energy consumption. Our results show that SNNs can achieve up to 15.7x higher energy efficiency than their CNN counterparts while retaining competitive accuracy. Among these, the SNN variant of SqueezeNet consistently outperforms other lightweight SNNs. To further optimize this model, we apply a structured pruning strategy that removes entire redundant modules, yielding a pruned architecture, SNN-SqueezeNet-P. This pruned model improves CIFAR-10 accuracy by 6% and reduces parameters by 19% compared to the original SNN-SqueezeNet. Crucially, it narrows the gap with CNN-SqueezeNet, achieving nearly the same accuracy (only 1% lower) but with an 88.1% reduction in energy consumption due to sparse spike-driven computations. Together, these findings establish lightweight SNNs as practical, low-power alternatives for edge deployment, highlighting a viable path toward deploying high-performance, low-power intelligence on the edge.[121] Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings
Laura Paul,Holger Rauhut,Martin Burger,Samira Kabri,Tim Roith
Main category: cs.CV
TL;DR: 本文提出一种混合方法,将裂纹检测建模为反问题,通过深度生成模型和Mumford-Shah型变分泛函联合优化,实现绘画中裂纹的像素级定位。
Details
Motivation: 自动检测数字化绘画中的龟裂纹对评估画作退化和指导修复至关重要,但因场景复杂及裂纹与笔触、发丝等艺术特征视觉相似而具有挑战性。 Method: 将裂纹检测建模为反问题,分解图像为无裂纹画作和裂纹成分;利用深度生成模型作为画作先验,结合Mumford-Shah型变分泛函与裂纹先验刻画裂纹结构,并进行联合优化。 Result: 实现了绘画中裂纹的像素级定位图。 Conclusion: 该混合方法有效提升了龟裂纹检测的准确性,兼顾了艺术特征与真实裂纹的区分能力,为画作保护与修复提供了可靠技术支持。 Abstract: Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford--Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.[122] Toward Fine-Grained Facial Control in 3D Talking Head Generation
Shaoyang Xie,Xiaofeng Cong,Baosheng Yu,Zhipeng Gui,Jie Gui,Yuan Yan Tang,James Tin-Yau Kwok
Main category: cs.CV
TL;DR: 本文提出Fine-Grained 3D Gaussian Splatting(FG-3DGS)框架,通过频率感知的解耦策略建模不同面部区域运动特性,并结合高频率优化的后渲染对齐机制,显著提升唇形同步精度与面部动态稳定性,生成高保真、时序一致的 talking head 视频。
Details
Motivation: 现有基于3D高斯泼溅的音频驱动说话人生成方法在唇形同步和面部抖动方面存在不足,易引发恐怖谷效应,亟需对细粒度面部运动进行精准、稳定建模。 Method: 提出FG-3DGS:1)频率感知解耦建模——低频区域(脸颊、鼻、额)用共享MLP建模,高频区域(眼、口)用掩码引导的专用网络建模;2)高斯位移(Gaussian deltas)驱动静态高斯变形;3)引入基于大规模音视频预训练模型的高频率后渲染对齐模块,提升逐帧生成质量与唇同步精度。 Result: 在主流talking head数据集上显著优于近期SOTA方法,在视觉保真度、唇同步准确性和时序稳定性方面均有提升。 Conclusion: FG-3DGS通过运动频率导向的解耦建模与高精度后对齐,有效缓解了唇不同步与面部抖动问题,为实时高保真数字人生成提供了新范式。 Abstract: Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.[123] Robust Vision Systems for Connected and Autonomous Vehicles: Security Challenges and Attack Vectors
Sandeep Gupta,Roberto Passerone
Main category: cs.CV
TL;DR: 本文研究了网联与自动驾驶车辆(CAVs)中视觉系统的鲁棒性,提出了CAV视觉系统(CAVVS)的参考架构,识别其攻击面与攻击向量,并从机密性、完整性、可用性(CIA)三方面评估其安全影响。
Details
Motivation: 保障Level-5自动驾驶的安全可靠导航依赖于鲁棒的视觉系统,而其面临的安全威胁尚缺乏系统性分析。 Method: 构建CAVVS参考架构以识别攻击表面,并逐一分析针对各表面的攻击向量,进而从CIA三维度进行严格安全影响评估。 Result: 明确了CAV视觉系统的关键攻击面与对应攻击向量,系统揭示了其对机密性、完整性与可用性的差异化影响。 Conclusion: 视觉系统是CAV安全的关键薄弱环节,需基于CIA三原则设计针对性防护机制,本研究为构建鲁棒安全措施提供了理论基础与实践指导。 Abstract: This article investigates the robustness of vision systems in Connected and Autonomous Vehicles (CAVs), which is critical for developing Level-5 autonomous driving capabilities. Safe and reliable CAV navigation undeniably depends on robust vision systems that enable accurate detection of objects, lane markings, and traffic signage. We analyze the key sensors and vision components essential for CAV navigation to derive a reference architecture for CAV vision system (CAVVS). This reference architecture provides a basis for identifying potential attack surfaces of CAVVS. Subsequently, we elaborate on identified attack vectors targeting each attack surface, rigorously evaluating their implications for confidentiality, integrity, and availability (CIA). Our study provides a comprehensive understanding of attack vector dynamics in vision systems, which is crucial for formulating robust security measures that can uphold the principles of the CIA triad.[124] Self-Supervised Learning as Discrete Communication
Kawtar Zaher,Ilyass Moummad,Olivier Buisson,Alexis Joly
Main category: cs.CV
TL;DR: 本文提出了一种将视觉自监督学习建模为教师-学生间离散二进制通信过程的新范式,通过二值消息预测与编码率正则化,学习结构化、紧凑且语义可重用的离散表示。
Details
Motivation: 现有SSL方法主要学习连续特征表示,缺乏对表征维度信息结构的显式控制;作者旨在引入离散化机制以增强语义可解释性与结构可控性。 Method: 将SSL建模为教师-学生间的固定容量二进制信道通信;学生预测教师生成的多标签二进制消息;采用逐元素二元交叉熵损失+编码率正则化;并周期性重置投影头以提升鲁棒性。 Result: 在图像分类、检索、密集预测及域迁移任务上均一致优于连续对齐基线;所学二进制码构成紧凑、信息丰富、跨类可复用的离散语义语言。 Conclusion: 离散通信视角能有效提升SSL表征的结构化程度与泛化能力,为可控、可解释的视觉表征学习提供了新路径。 Abstract: Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.[125] Code2World: A GUI World Model via Renderable Code Generation
Yuhao Zheng,Li'an Zhong,Yi Wang,Rui Dai,Kaikui Liu,Xiangxiang Chu,Linyuan Lv,Philip Torr,Kevin Qinghong Lin
Main category: cs.CV
TL;DR: 本文提出Code2World,一种通过生成可渲染代码来模拟GUI界面下一状态的视觉语言模型,解决了现有方法在视觉保真度与结构可控性间的权衡问题;基于自建AndroidCode数据集(80K高质量屏幕-动作对),结合监督微调与渲染感知强化学习训练,在UI状态预测和下游导航任务上显著优于现有模型。
Details
Motivation: 现有基于文本或像素的GUI状态预测方法难以兼顾高视觉保真度与细粒度结构可控性,限制了自主GUI代理的感知与规划能力。 Method: 提出Code2World:1)构建AndroidCode数据集,将GUI轨迹翻译为HTML并引入视觉反馈机制优化代码质量;2)采用两阶段训练策略——先监督微调(SFT)适配代码格式,再进行渲染感知强化学习(以渲染结果的视觉语义保真度和动作一致性为奖励信号)。 Result: Code2World-8B在下一UI状态预测任务上达到SOTA,性能媲美GPT-5和Gemini-3-Pro-Image;在AndroidWorld导航任务中,显著提升Gemini-2.5-Flash成功率+9.5%。 Conclusion: 生成可渲染代码是实现高保真、强可控GUI状态建模的有效范式;Code2World验证了代码作为中间表征在GUI智能体中的优越性,并为下游任务提供灵活、可扩展的增强能力。 Abstract: Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.[126] Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets
Abhipsa Basu,Yugam Bahl,Kirti Bhagat,Preethi Seshadri,R. Venkatesh Babu,Danish Pruthi
Main category: cs.CV
TL;DR: 本文通过地理标注分析了多个大规模多模态数据集的地理代表性,发现其严重偏向英语国家(尤其是美国、英国和加拿大),而非洲和南美洲国家样本极少;同时揭示了数据分布与各国GDP高度相关,并指出高覆盖率不等于高视觉/语义多样性,且生成模型输出的地理覆盖远低于真实世界。
Details
Motivation: 文本到图像模型常无法生成具有地理代表性的图像,引发对其训练数据地理来源和代表性的质疑。 Method: 利用大语言模型从英文及非英文(4种语言)图像标题中提取地理位置信息,将图像-标题对映射到具体国家;分析Re-LAION、DataComp1B和Conceptual Captions三个数据集对20个常见实体的地理分布;计算国家GDP与数据占比的相关性;评估地理表征与视觉/语义多样性的关系;测试Stable Diffusion v1.3在Re-LAION上训练后生成图像的地理覆盖能力。 Result: 美国、英国、加拿大占样本48.0%,非洲和南美洲仅占3.8%和1.8%;地理分布与GDP强相关(ρ=0.82);非英语子集仍偏向该语言主要使用国;高地理覆盖率不意味着高视觉或语义多样性;Stable Diffusion生成图像地理覆盖远少于真实图像。 Conclusion: 主流多模态训练数据存在严重地理偏差,这种偏差会延续至生成模型,影响其全球适用性与公平性,亟需构建更具地理代表性的数据集与评估方法。 Abstract: Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0\%$ of samples, while South American and African countries are severely under-represented with only $1.8\%$ and $3.8\%$ of images, respectively. We observe a strong correlation between a country's GDP and its representation in the data ($ρ= 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.[127] SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing
Tong Zhang,Honglin Lin,Zhou Liu,Chong Chen,Wentao Zhang
Main category: cs.CV
TL;DR: 本文提出SciFlow-Bench,一个面向科学图表像素级生成的结构优先评测基准,通过逆向解析生成图像为结构图并比对真值图,强调结构可恢复性而非视觉相似性。
Details
Motivation: 现有文本到图像模型在生成科学图表时虽视觉合理但结构常出错;当前评测基准多依赖图像中心或主观指标,或仅评估中间符号表示,忽视像素级结构正确性评估。 Method: 构建基于真实科学PDF的SciFlow-Bench基准,每张源图配标准真值图;采用闭环往返协议,将模型生成图像经多智能体系统(含规划、感知与结构推理)逆解析为结构图后与真值比较;该系统为分层多智能体架构。 Result: 实验表明,保持结构正确性仍是根本挑战,尤其对拓扑复杂的图表;验证了结构感知评测的必要性。 Conclusion: 像素级科学图表生成需以结构可恢复性为核心评测标准,SciFlow-Bench为此提供了首个端到端、结构驱动的黑盒评测框架。 Abstract: Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.[128] CompSplat: Compression-aware 3D Gaussian Splatting for Real-world Video
Hojun Song,Heejung Choi,Aro Kim,Chae-yeong Song,Gahyeon Kim,Soo Ye Kim,Jaehyup Lee,Sang-hyo Park
Main category: cs.CV
TL;DR: 本文提出CompSplat,一种压缩感知的训练框架,用于提升真实视频在严重压缩条件下的新视角合成质量,通过帧级压缩建模、压缩感知加权与自适应剪枝,显著改善几何一致性和渲染质量。
Details
Motivation: 真实世界视频存在长序列、相机轨迹不规则、位姿未知及压缩失真等问题,导致重建中出现位姿漂移、特征错位和几何畸变,现有方法未能充分应对多样化的视频压缩模式。 Method: 提出CompSplat框架,显式建模逐帧压缩特性,引入压缩感知的帧加权机制和自适应剪枝策略,以缓解帧间不一致性与累积几何误差。 Result: 在Tanks and Temples、Free和Hike等具有挑战性的基准上,CompSplat在严重压缩条件下显著超越多数最新NVS方法,取得最优渲染质量与位姿精度。 Conclusion: CompSplat有效提升了压缩视频下新视角合成的鲁棒性与几何一致性,为真实场景NVS提供了实用且可扩展的压缩感知解决方案。 Abstract: High-quality novel view synthesis (NVS) from real-world videos is crucial for applications such as cultural heritage preservation, digital twins, and immersive media. However, real-world videos typically contain long sequences with irregular camera trajectories and unknown poses, leading to pose drift, feature misalignment, and geometric distortion during reconstruction. Moreover, lossy compression amplifies these issues by introducing inconsistencies that gradually degrade geometry and rendering quality. While recent studies have addressed either long-sequence NVS or unposed reconstruction, compression-aware approaches still focus on specific artifacts or limited scenarios, leaving diverse compression patterns in long videos insufficiently explored. In this paper, we propose CompSplat, a compression-aware training framework that explicitly models frame-wise compression characteristics to mitigate inter-frame inconsistency and accumulated geometric errors. CompSplat incorporates compression-aware frame weighting and an adaptive pruning strategy to enhance robustness and geometric consistency, particularly under heavy compression. Extensive experiments on challenging benchmarks, including Tanks and Temples, Free, and Hike, demonstrate that CompSplat achieves state-of-the-art rendering quality and pose accuracy, significantly surpassing most recent state-of-the-art NVS approaches under severe compression conditions.[129] SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding
Zhaoxu Li,Chenqi Kong,Peijun Bao,Song Xia,Yi Tu,Yi Yu,Xinghao Jiang,Xudong Jiang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的稳定性感知知识增强解码方法(SAKED),通过量化模型各层知识稳定性来抑制大视觉语言模型(LVLMs)中的幻觉现象,并在多种模型、任务和基准上实现了最优性能。
Details
Motivation: 大视觉语言模型(LVLMs)中的幻觉问题带来严重安全与可靠性风险;受人类在不确定时更易出错的启发,作者探究模型内部知识不稳定性如何导致幻觉。 Method: 从注意力头、模型层和解码token三方面进行实证分析,发现三种幻觉模式;据此提出SAKED方法,引入层级别知识稳定性分数(KSS),对比最稳定与最不稳定层以抑制解码噪声并动态利用最可靠知识;该方法无需训练且架构无关。 Result: SAKED在多个LVLM模型、任务(如视觉问答、图像描述)及基准(如POPE、HallusionBench)上显著降低幻觉率,达到当前最优(SOTA)效果。 Conclusion: 模型内部知识的不稳定性是LVLM幻觉的关键成因;SAKED通过显式建模并利用知识稳定性,提供了一种通用、高效、即插即用的幻觉缓解方案。 Abstract: Hallucinations in Large Vision-Language Models (LVLMs) pose significant security and reliability risks in real-world applications. Inspired by the observation that humans are more error-prone when uncertain or hesitant, we investigate how instability in a model 's internal knowledge contributes to LVLM hallucinations. We conduct extensive empirical analyses from three perspectives, namely attention heads, model layers, and decoding tokens, and identify three key hallucination patterns: (i) visual activation drift across attention heads, (ii) pronounced knowledge fluctuations across layers, and (iii) visual focus distraction between neighboring output tokens. Building on these findings, we propose Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model. By contrasting the most stability-aware and stability-agnostic layers, SAKED suppresses decoding noise and dynamically leverages the most reliable internal knowledge for faithful token generation. Moreover, SAKED is training-free and can be seamlessly integrated into different architectures. Extensive experiments demonstrate that SAKED achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.[130] ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge
Yijie Lin,Guofeng Ding,Haochen Zhou,Haobin Li,Mouxing Yang,Xi Peng
Main category: cs.CV
TL;DR: 本文提出了ARK基准,用于评估多模态检索在专业知识和复杂推理方面的能力,涵盖五个知识领域、17个子类及六种推理技能,并测试了23种检索模型,揭示了知识密集型与推理密集型任务间的显著差距。
Details
Motivation: 现有基准主要关注日常图像的语义匹配,缺乏对专业领域知识和复杂推理能力的诊断能力。 Method: 构建ARK基准,从知识领域(5大类、17子类)和推理技能(6类)两个维度设计评测任务;支持单模态与多模态查询/候选;引入针对硬负样本以避免捷径匹配;评估23种主流检索模型,并尝试重排序与查询重写等增强方法。 Result: 发现知识密集型与推理密集型检索之间存在明显性能差距,细粒度视觉与空间推理是持续瓶颈;重排序和查询重写带来一致提升,但仍有较大提升空间。 Conclusion: ARK为多模态检索提供了更全面、更具挑战性的评测框架,突显当前模型在专业领域知识理解和多步推理上的不足,指明了未来改进方向。 Abstract: Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.[131] Kelix Technique Report
Boyang Ding,Chenglong Chu,Dunju Zang,Han Li,Jiangxia Cao,Kun Gai,Muhao Wei,Ruiming Tang,Shiyao Wang,Siyang Mao,Xinchen Luo,Yahui Liu,Zhixin Ling,Zhuoran Yang,Ziming Li,Chengru Song,Guorui Zhou,Guowang Zhang,Hao Peng,Hao Wang,Jiaxin Deng,Jin Ouyang,Jinghao Zhang,Lejian Ren,Qianqian Wang,Qigen Hu,Tao Wang,Xingmei Wang,Yiping Yang,Zixing Zhang,Ziqi Wang
Main category: cs.CV
TL;DR: 本文提出Kelx模型,旨在解决离散视觉标记化在多模态大模型中理解能力弱于连续特征的问题,通过提升离散视觉标记的信息容量,实现离散与连续表征在理解能力上的对齐。
Details
Motivation: 现有视觉语言模型多采用文本离散token与视觉连续特征混合的接口,导致模型偏向文本理解、难以充分利用非文本数据的自监督学习;虽有研究尝试离散视觉token化,但受限于码本容量,其理解能力仍显著弱于连续特征模型。 Method: 提出Kelix——一种全离散自回归统一模型,通过增强离散视觉token的信息表达能力,弥合离散与连续视觉表征之间的理解差距。 Result: Kelix成功缩小了离散视觉表示与连续视觉表示在理解能力上的差距,推动了真正统一的、全离散自回归多模态建模。 Conclusion: 全离散多模态建模可行且必要;提升离散视觉token的容量是实现其与连续特征性能对齐的关键路径。 Abstract: Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.[132] Reason-IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection
Peng Chen,Chao Huang,Yunkang Cao,Chengliang Liu,Wenqiang Wang,Mingbo Yang,Li Shen,Wenqi Ren,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文提出Reason-IAD,一种知识引导的动态潜在推理框架,用于可解释的工业异常检测,通过检索增强知识模块和熵驱动潜在推理机制提升检测精度与可解释性。
Details
Motivation: 现有基于通用领域预训练的多模态大语言模型难以捕捉类别特定的异常模式,导致工业异常检测的准确性和可解释性受限。 Method: 提出Reason-IAD框架,包含:1)检索增强的知识模块,将类别特定文本描述融入输入;2)熵驱动的潜在推理机制,利用可优化的潜在‘思考令牌’在紧凑潜在空间中迭代探索;3)动态视觉注入策略,选择性地将最具信息量的图像块注入潜在序列。 Result: 在多项实验中,Reason-IAD持续优于当前最先进方法。 Conclusion: Reason-IAD通过融合领域知识、动态潜在推理与关键区域聚焦,有效提升了工业异常检测的准确性与可解释性。 Abstract: Industrial anomaly detection demands precise reasoning over fine-grained defect patterns. However, existing multimodal large language models (MLLMs), pretrained on general-domain data, often struggle to capture category-specific anomalies, thereby limiting both detection accuracy and interpretability. To address these limitations, we propose Reason-IAD, a knowledge-guided dynamic latent reasoning framework for explainable industrial anomaly detection. Reason-IAD comprises two core components. First, a retrieval-augmented knowledge module incorporates category-specific textual descriptions into the model input, enabling context-aware reasoning over domain-specific defects. Second, an entropy-driven latent reasoning mechanism conducts iterative exploration within a compact latent space using optimizable latent think tokens, guided by an entropy-based reward that encourages confident and stable predictions. Furthermore, a dynamic visual injection strategy selectively incorporates the most informative image patches into the latent sequence, directing the reasoning process toward regions critical for anomaly detection. Extensive experimental results demonstrate that Reason-IAD consistently outperforms state-of-the-art methods. The code will be publicly available at https://github.com/chenpeng052/Reason-IAD.[133] Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence
Xiaoyue Ling,Chuqin Zhou,Chunyi Li,Yunuo Chen,Yuan Tian,Guo Lu,Wenjun Zhang
Main category: cs.CV
TL;DR: 本文提出Free-GVC,一种无需训练的生成式视频压缩框架,通过扩散先验引导的潜在轨迹压缩,在超低码率下显著提升视觉质量与时间一致性。
Details
Motivation: 现有生成式视频压缩方法对时序相关性利用不足,导致超低码率下出现明显闪烁和时间连贯性下降。 Method: 将视频编码重构为由视频扩散先验引导的潜在轨迹压缩;在GOP级别操作,引入自适应质量控制模块动态选择最优扩散步数,并设计帧间对齐模块实现相邻GOP的潜在融合与重叠。 Result: 相比最新神经编解码器DCVC-RT,在DISTS指标上平均降低93.29% BD-Rate;用户研究表明其在超低码率下具有更优的感知质量和时间连贯性。 Conclusion: Free-GVC通过无需训练的扩散引导压缩与跨GOP对齐机制,有效解决了超低码率下生成式视频压缩的时间不一致问题,为该领域提供了新范式。 Abstract: Building on recent advances in video generation, generative video compression has emerged as a new paradigm for achieving visually pleasing reconstructions. However, existing methods exhibit limited exploitation of temporal correlations, causing noticeable flicker and degraded temporal coherence at ultra-low bitrates. In this paper, we propose Free-GVC, a training-free generative video compression framework that reformulates video coding as latent trajectory compression guided by a video diffusion prior. Our method operates at the group-of-pictures (GOP) level, encoding video segments into a compact latent space and progressively compressing them along the diffusion trajectory. To ensure perceptually consistent reconstruction across GOPs, we introduce an Adaptive Quality Control module that dynamically constructs an online rate-perception surrogate model to predict the optimal diffusion step for each GOP. In addition, an Inter-GOP Alignment module establishes frame overlap and performs latent fusion between adjacent groups, thereby mitigating flicker and enhancing temporal coherence. Experiments show that Free-GVC achieves an average of 93.29% BD-Rate reduction in DISTS over the latest neural codec DCVC-RT, and a user study further confirms its superior perceptual quality and temporal coherence at ultra-low bitrates.[134] BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices
Mridankan Mandal
Main category: cs.CV
TL;DR: 本文提出了BabyMamba-HAR框架,包含两种轻量级Mamba启发架构(CI-BabyMamba-HAR和Crossover-BiDir-BabyMamba-HAR),专为资源受限的可穿戴/移动设备上的人类活动识别(HAR)设计,在保持高精度的同时显著降低参数量与计算量。
Details
Motivation: 在内存和算力受限的TinyML场景下,现有HAR模型难以兼顾效率与跨异构传感器配置的精度;而选择性状态空间模型(SSMs)虽具潜力,其在该场景下的架构设计尚未充分探索。 Method: 提出两种新型轻量Mamba架构:1)CI-BabyMamba-HAR,采用通道独立但权重共享的stem以抑制噪声传播;2)Crossover-BiDir-BabyMamba-HAR,采用早期融合stem实现通道数无关的计算复杂度;二者均引入权值绑定的双向扫描与轻量时序注意力池化。 Result: Crossover-BiDir-BabyMamba-HAR在8个基准上平均macro F1达86.52%,仅需27K参数和2.21M MACs,较TinyHAR精度相当但MACs减少11倍;消融显示双向扫描提升F1最多8.42%,门控时序注意力比均值池化提升最多8.94%。 Conclusion: 验证了SSMs可作为高效TinyML HAR骨干网络,并提炼出面向资源受限HAR的实用SSM架构设计原则。 Abstract: Human activity recognition (HAR) on wearable and mobile devices is constrained by memory footprint and computational budget, yet competitive accuracy must be maintained across heterogeneous sensor configurations. Selective state space models (SSMs) offer linear time sequence processing with input dependent gating, presenting a compelling alternative to quadratic complexity attention mechanisms. However, the design space for deploying SSMs in the TinyML regime remains largely unexplored. In this paper, BabyMamba-HAR is introduced, a framework comprising two novel lightweight Mamba inspired architectures optimized for resource constrained HAR: (1) CI-BabyMamba-HAR, using a channel independent stem that processes each sensor channel through shared weight, but instance independent transformations to prevent cross channel noise propagation, and (2) Crossover-BiDir-BabyMamba-HAR, using an early fusion stem that achieves channel count independent computational complexity. Both variants incorporate weight tied bidirectional scanning and lightweight temporal attention pooling. Through evaluation across eight diverse benchmarks, it is demonstrated that Crossover-BiDir-BabyMamba-HAR achieves 86.52% average macro F1-score with approximately 27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. Systematic ablation studies reveal that bidirectional scanning contributes up to 8.42% F1-score improvement, and gated temporal attention provides up to 8.94% F1-score gain over mean pooling. These findings establish practical design principles for deploying selective state space models as efficient TinyML backbones for HAR.[135] MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation
Jiaxu Wang,Yicheng Jiang,Tianlun He,Jingkai Sun,Qiang Zhang,Junhao He,Jiahang Cao,Zesen Gan,Mingyuan Sun,Qiming Shao,Xiangyu Yue
Main category: cs.CV
TL;DR: 本文提出了一种新型具身4D世界模型,支持单视角RGBD输入下的任意视角RGBD生成与时间一致的4D场景建模,并结合测试时动作优化与残差逆动力学实现从想象到动作的端到端闭环。
Details
Motivation: 现有基于世界模型的机器人操作方法受限于纯图像预测或部分3D几何推理,难以建模完整、几何一致的4D场景动态。 Method: 提出具身4D世界模型,设计跨视角与跨模态特征融合机制以保证RGB-深度一致性与多视角几何对齐;引入测试时动作优化(通过生成模型反向传播推断轨迹级隐变量)与残差逆动力学模型将隐变量转化为可执行动作。 Result: 在三个数据集上验证了该方法在4D场景生成和下游机器人操作任务中的优越性能,并通过消融实验揭示了关键设计的有效性。 Conclusion: 所提4D世界模型及其动作生成策略有效 bridged 想象与行动之间的鸿沟,为具身智能体提供了更几何一致、可泛化的时序场景理解与决策能力。 Abstract: World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.[136] AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization
Shaoqiu Zhang,Zizhong Ding,Kaicheng Yang,Junyi Wu,Xianglong Yan,Xi Li,Bingnan Duan,Jianping Fang,Yulun Zhang
Main category: cs.CV
TL;DR: 本文提出AdaTSQ,一种专为扩散Transformer(DiTs)设计的后训练量化框架,通过时间步动态比特宽分配和Fisher引导的时间校准机制,在保持生成质量的同时显著提升效率。
Details
Motivation: 现有后训练量化方法忽视了扩散过程中的时间动态特性,导致在DiTs上效果不佳;而DiTs计算和内存开销大,难以部署到边缘设备。 Method: 提出两种核心方法:1)Pareto感知的时间步动态比特宽分配策略,将量化策略搜索建模为约束路径查找问题,并用端到端重建误差引导的束搜索实现层与时步联合优化;2)Fisher引导的时间校准机制,利用时间维度Fisher信息选择敏感时间步的数据进行校准,并结合Hessian加权权重优化。 Result: 在Flux-Dev、Flux-Schnell、Z-Image和Wan2.1四个先进DiT模型上,AdaTSQ显著优于SVDQuant和ViDiT-Q等SOTA方法。 Conclusion: AdaTSQ通过建模DiTs的时间敏感性,有效提升了其量化后的效率-质量帕累托前沿,为边缘端高效部署扩散生成模型提供了新范式。 Abstract: Diffusion Transformers (DiTs) have emerged as the state-of-the-art backbone for high-fidelity image and video generation. However, their massive computational cost and memory footprint hinder deployment on edge devices. While post-training quantization (PTQ) has proven effective for large language models (LLMs), directly applying existing methods to DiTs yields suboptimal results due to the neglect of the unique temporal dynamics inherent in diffusion processes. In this paper, we propose AdaTSQ, a novel PTQ framework that pushes the Pareto frontier of efficiency and quality by exploiting the temporal sensitivity of DiTs. First, we propose a Pareto-aware timestep-dynamic bit-width allocation strategy. We model the quantization policy search as a constrained pathfinding problem. We utilize a beam search algorithm guided by end-to-end reconstruction error to dynamically assign layer-wise bit-widths across different timesteps. Second, we propose a Fisher-guided temporal calibration mechanism. It leverages temporal Fisher information to prioritize calibration data from highly sensitive timesteps, seamlessly integrating with Hessian-based weight optimization. Extensive experiments on four advanced DiTs (e.g., Flux-Dev, Flux-Schnell, Z-Image, and Wan2.1) demonstrate that AdaTSQ significantly outperforms state-of-the-art methods like SVDQuant and ViDiT-Q. Our code will be released at https://github.com/Qiushao-E/AdaTSQ.[137] SARS: A Novel Face and Body Shape and Appearance Aware 3D Reconstruction System extends Morphable Models
Gulraiz Khan,Kenneth Y. Wertheim,Kevin Pimbblet,Waqas Ahmed
Main category: cs.CV
TL;DR: 本文提出了一种名为SARS的形状与外观感知的3D重建系统,旨在从单张图像中提取人脸与身体信息,实现包含语义特征(如年龄、性别、面部关键点等)的完整人体3D重建,弥补了以往3DMM方法忽略高阶面部语义特征的不足。
Details
Motivation: 以往3D人体重建研究仅关注全局面部结构或几何,忽略了年龄、性别、面部关键点等语义特征,限制了模型的表现力和实用性。 Method: 提出模块化pipeline SARS,融合形状与外观感知机制,从单张2D图像中联合提取人脸与身体信息,构建具备语义感知能力的全身体3D Morphable Model。 Result: 实现了对高阶面部语义特征(如皱纹、边界、曲线等)可控建模的全身体3D重建,提升了3DMM在真实场景中的表达能力与泛化性。 Conclusion: SARS系统拓展了传统3DMM的能力边界,将语义理解引入3D重建流程,为面向应用的精细化人脸与人体建模提供了新范式。 Abstract: Morphable Models (3DMMs) are a type of morphable model that takes 2D images as inputs and recreates the structure and physical appearance of 3D objects, especially human faces and bodies. 3DMM combines identity and expression blendshapes with a basic face mesh to create a detailed 3D model. The variability in the 3D Morphable models can be controlled by tuning diverse parameters. They are high-level image descriptors, such as shape, texture, illumination, and camera parameters. Previous research in 3D human reconstruction concentrated solely on global face structure or geometry, ignoring face semantic features such as age, gender, and facial landmarks characterizing facial boundaries, curves, dips, and wrinkles. In order to accommodate changes in these high-level facial characteristics, this work introduces a shape and appearance-aware 3D reconstruction system (named SARS by us), a c modular pipeline that extracts body and face information from a single image to properly rebuild the 3D model of the human full body.[138] A benchmark for video-based laparoscopic skill analysis and assessment
Isabel Funke,Sebastian Bodenstedt,Felix von Bechtolsheim,Florian Oehme,Michael Maruschke,Stefanie Herrlich,Jürgen Weitz,Marius Distler,Sören Torge Mees,Stefanie Speidel
Main category: cs.CV
TL;DR: 本文介绍了LASANA数据集,包含1270段双目腹腔镜训练任务视频,每段视频附有结构化技能评分和任务特异性错误标签,旨在推动基于视频的手术技能评估与错误识别研究。
Details
Motivation: 现有深度学习模型在腹腔镜手术技能自动评估中受限于标注数据集规模小,亟需大规模、高质量、多标注的视频数据集。 Method: 构建LASANA数据集:涵盖4种基础腹腔镜训练任务的1270段立体视频,由三位独立评分者提供结构化技能评分及二值化错误标签;提供标准数据划分,并基于深度学习模型给出基线结果。 Result: 发布了首个大规模、多标注、面向腹腔镜基础技能评估的立体视频数据集LASANA,并提供了各任务的标准划分与基线性能。 Conclusion: LASANA数据集填补了腹腔镜视频分析领域高质量标注数据的空白,为未来视频驱动的手术技能评估与错误识别研究提供了可靠基准和开放平台。 Abstract: Laparoscopic surgery is a complex surgical technique that requires extensive training. Recent advances in deep learning have shown promise in supporting this training by enabling automatic video-based assessment of surgical skills. However, the development and evaluation of deep learning models is currently hindered by the limited size of available annotated datasets. To address this gap, we introduce the Laparoscopic Skill Analysis and Assessment (LASANA) dataset, comprising 1270 stereo video recordings of four basic laparoscopic training tasks. Each recording is annotated with a structured skill rating, aggregated from three independent raters, as well as binary labels indicating the presence or absence of task-specific errors. The majority of recordings originate from a laparoscopic training course, thereby reflecting a natural variation in the skill of participants. To facilitate benchmarking of both existing and novel approaches for video-based skill assessment and error recognition, we provide predefined data splits for each task. Furthermore, we present baseline results from a deep learning model as a reference point for future comparisons.[139] Monocular Normal Estimation via Shading Sequence Estimation
Zongrui Li,Xinhua Ma,Minghui Hu,Yunqing Zhao,Yingchen Yu,Qian Zheng,Chang Liu,Xudong Jiang,Song Bai
Main category: cs.CV
TL;DR: 本文提出RoSE方法,将单目法线估计重新定义为阴影序列估计,利用图像到视频生成模型预测阴影序列,并通过最小二乘法转换为法线图,显著提升3D几何对齐精度。
Details
Motivation: 现有单目法线估计方法存在3D错位问题,因法线图中几何差异仅表现为细微颜色变化,模型难以区分和重建不同几何结构。 Method: 提出新范式——将法线估计重构为阴影序列估计;设计RoSE方法,利用图像到视频生成模型预测多光照下的阴影序列,并通过普通最小二乘法将其转为法线图;在合成数据集MultiShade上训练以增强鲁棒性。 Result: RoSE在真实世界物体级单目法线估计基准数据集上达到SOTA性能,显著改善3D几何对齐效果。 Conclusion: 将法线估计转化为更几何敏感的阴影序列估计是有效途径;RoSE验证了该范式的优越性与实用性。 Abstract: Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.[140] GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery
Han Jinzhen,JinByeong Lee,JiSung Kim,MinKyung Cho,DaHee Kim,HongSik Yun
Main category: cs.CV
TL;DR: 本文提出GeoFormer,一种基于Swin Transformer的开源框架,仅利用Sentinel-1/2影像和公开DEM数据联合估计建筑高度与轮廓,在54个城市上实现高精度且具跨洲泛化能力。
Details
Motivation: 准确的三维城市数据对气候建模、灾害风险评估和城市规划至关重要,但目前受限于专有传感器依赖和跨城市泛化能力差。 Method: 提出GeoFormer,一种基于Swin Transformer的开源框架,采用地理区块划分策略确保训练/测试集空间独立性,融合Sentinel-1/2影像与开放DEM数据,在100米网格上联合估计建筑高度(BH)与轮廓(BF)。 Result: 在54个多样化城市上评估,BH RMSE为3.19米,BF RMSE为0.05,分别比最强CNN基线提升7.5%和15.3%;跨洲迁移时BH RMSE仍保持在3.5米以内;消融实验表明DEM对高度估计不可或缺,光学反射率主导SAR,但多源融合效果最佳。 Conclusion: GeoFormer是一种高效、泛化性强且完全开源的三维城市建模方法,显著提升了建筑高度与轮廓估计精度,并推动了全球尺度城市遥感分析的可复现性与可访问性。 Abstract: Accurate three-dimensional urban data are critical for climate modelling, disaster risk assessment, and urban planning, yet remain scarce due to reliance on proprietary sensors or poor cross-city generalisation. We propose GeoFormer, an open-source Swin Transformer framework that jointly estimates building height (BH) and footprint (BF) on a 100 m grid using only Sentinel-1/2 imagery and open DEM data. A geo-blocked splitting strategy ensures strict spatial independence between training and test sets. Evaluated over 54 diverse cities, GeoFormer achieves a BH RMSE of 3.19 m and a BF RMSE of 0.05, improving 7.5% and 15.3% over the strongest CNN baseline, while maintaining under 3.5 m BH RMSE in cross-continent transfer. Ablation studies confirm that DEM is indispensable for height estimation and that optical reflectance dominates over SAR, though multi-source fusion yields the best overall accuracy. All code, weights, and global products are publicly released.[141] Unbalanced optimal transport for robust longitudinal lesion evolution with registration-aware and appearance-guided priors
Melika Qahqaie,Dominik Neumann,Tobias Heimann,Andreas Maier,Veronika A. Zimmer
Main category: cs.CV
TL;DR: 本文提出了一种基于非平衡最优传输(UOT)的配准感知匹配方法,用于解决癌症患者纵向CT扫描中病灶演化分析时病灶对应关系难以建立的问题,通过融合几何、配准可信度和外观一致性信息,实现了对新发、消失、合并和分裂病灶的准确识别。
Details
Motivation: 评估癌症患者纵向CT扫描中病灶的演化对治疗反应评估至关重要,但跨时间点建立可靠的病灶对应关系仍具挑战性,尤其当病灶出现、消失、合并或分裂时,传统基于几何邻近性的二分匹配器效果不佳。 Method: 提出一种基于非平衡最优传输(UOT)的注册感知匹配器,其传输代价融合了三方面信息:(i) 尺寸归一化的几何距离,(ii) 基于形变场Jacobian行列式的局部配准可信度,(iii) 可选的块级外观一致性;再通过相对剪枝对传输计划进行稀疏化,直接输出一对一匹配及新发、消失、合并、分裂等病灶状态。 Result: 在纵向CT数据上,该方法在边缘检测精度与召回率、病灶状态召回率以及病灶图连通组件F1分数上均显著优于仅依赖距离的基线方法。 Conclusion: 所提UOT匹配框架无需重训练或人工设定启发式规则,即可鲁棒建模病灶动态演化,为临床纵向分析提供了更可靠、可解释的病灶追踪工具。 Abstract: Evaluating lesion evolution in longitudinal CT scans of can cer patients is essential for assessing treatment response, yet establishing reliable lesion correspondence across time remains challenging. Standard bipartite matchers, which rely on geometric proximity, struggle when lesions appear, disappear, merge, or split. We propose a registration-aware matcher based on unbalanced optimal transport (UOT) that accommodates unequal lesion mass and adapts priors to patient-level tumor-load changes. Our transport cost blends (i) size-normalized geometry, (ii) local registration trust from the deformation-field Jacobian, and (iii) optional patch-level appearance consistency. The resulting transport plan is sparsified by relative pruning, yielding one-to-one matches as well as new, disappearing, merging, and splitting lesions without retraining or heuristic rules. On longitudinal CT data, our approach achieves consistently higher edge-detection precision and recall, improved lesion-state recall, and superior lesion-graph component F1 scores versus distance-only baselines.[142] VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization
Yikun Liu,Yuan Liu,Shangzhe Di,Haicheng Wang,Zhongyin Zhao,Le Tian,Xiao Zhou,Jie Zhou,Jiangchao Yao,Yanfeng Wang,Weidi Xie
Main category: cs.CV
TL;DR: 本文探讨了多模态大语言模型(MLLMs)的视觉编码器能否作为通用视觉骨干网络用于经典视觉任务,发现其在密集预测任务上存在缺陷,并提出VersaViT方法通过多任务协同后训练提升其通用性。
Details
Motivation: 探究MLLMs的视觉编码器是否可作为通用视觉骨干网络用于经典视觉任务,因其在视觉-语言理解中表现出色,但在密集预测任务中性能欠佳。 Method: 提出VersaViT,一种新型视觉Transformer,采用多任务协同后训练框架,通过轻量级任务头和多粒度监督优化视觉骨干网络。 Result: 在多种下游任务上的大量实验表明,VersaViT显著提升了视觉编码器在语言引导推理和像素级理解两方面的性能,成为一个通用视觉骨干网络。 Conclusion: MLLMs的视觉编码器经适当后训练可拓展为通用视觉骨干网络,VersaViT验证了其在多任务场景下的有效性与实用性。 Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.[143] Bladder Vessel Segmentation using a Hybrid Attention-Convolution Framework
Franziska Krauß,Matthias Ege,Zoltan Lovasz,Albrecht Bartz-Schmidt,Igor Tsaur,Oliver Sawodny,Carina Veil
Main category: cs.CV
TL;DR: 本文提出了一种混合注意力-卷积(HAC)架构,用于膀胱内窥镜图像中血管的精准分割,以支持膀胱癌的术中导航;该方法结合Transformer建模全局血管拓扑结构与CNN细化细小血管细节,并通过结构优化真值和物理感知自监督预训练提升鲁棒性,在BlaVeS数据集上显著优于现有方法,尤其能有效抑制黏膜皱褶引起的假阳性。
Details
Motivation: 膀胱缺乏稳定解剖标志,而内窥镜下可见的血管可作为患者特异性‘血管指纹’用于导航;但现有血管分割方法难以应对内窥镜数据中的稀疏标注、气泡/光照伪影、膀胱连续形变及黏膜皱褶干扰等问题。 Method: 提出Hybrid Attention-Convolution(HAC)架构:Transformer模块学习全局血管拓扑先验(使用剔除短/末端分支的优化真值训练),CNN模块学习残差精修图以恢复细小血管;引入基于临床合理增强的物理感知自监督预训练缓解标注稀缺。 Result: 在BlaVeS内窥镜视频数据集上达到0.94准确率、0.61精确率和0.66 clDice,显著优于现有医学分割模型;关键地,能有效抑制因膀胱充盈/排空导致动态变化的黏膜皱褶引发的假阳性。 Conclusion: HAC架构为膀胱内窥镜导航提供了高精度、强鲁棒且结构稳定的血管分割能力,满足临床实际需求。 Abstract: Urinary bladder cancer surveillance requires tracking tumor sites across repeated interventions, yet the deformable and hollow bladder lacks stable landmarks for orientation. While blood vessels visible during endoscopy offer a patient-specific "vascular fingerprint" for navigation, automated segmentation is challenged by imperfect endoscopic data, including sparse labels, artifacts like bubbles or variable lighting, continuous deformation, and mucosal folds that mimic vessels. State-of-the-art vessel segmentation methods often fail to address these domain-specific complexities. We introduce a Hybrid Attention-Convolution (HAC) architecture that combines Transformers to capture global vessel topology prior with a CNN that learns a residual refinement map to precisely recover thin-vessel details. To prioritize structural connectivity, the Transformer is trained on optimized ground truth data that exclude short and terminal branches. Furthermore, to address data scarcity, we employ a physics-aware pretraining, that is a self-supervised strategy using clinically grounded augmentations on unlabeled data. Evaluated on the BlaVeS dataset, consisting of endoscopic video frames, our approach achieves high accuracy (0.94) and superior precision (0.61) and clDice (0.66) compared to state-of-the-art medical segmentation models. Crucially, our method successfully suppresses false positives from mucosal folds that dynamically appear and vanish as the bladder fills and empties during surgery. Hence, HAC provides the reliable structural stability required for clinical navigation.[144] Learning to Detect Baked Goods with Limited Supervision
Thomas H. Schmitt,Maximilian Bundscherer,Tobias Bocklet
Main category: cs.CV
TL;DR: 本文提出了一种在标注数据稀缺的工业场景(如德国面包店)中,利用弱监督和伪标签方法训练高效目标检测模型(YOLOv11)来自动识别剩余烘焙食品的方法,显著提升了非理想部署条件下的检测性能。
Details
Motivation: 德国面包店产品保质期短,需监控剩余产品以优化生产;但烘焙品类繁多,全监督训练成本高、难扩展;现有开放词汇检测器(如OWLv2、Grounding DINO)在此任务中表现不足;工业CV普遍面临任务专有、标注数据稀缺的挑战。 Method: 提出两种低监督训练流程:1)结合OWLv2与Grounding DINO定位结果,利用图像级标签进行弱监督训练;2)用Segment Anything 2在视频帧上生成伪标签,提升视角鲁棒性;最终训练YOLOv11模型。 Result: 仅用图像级监督时mAP达0.91;加入伪标签微调后,在非理想部署条件下性能提升19.3%;联合方案在仅用图像级监督的情况下,性能超越全监督基线模型。 Conclusion: 在标注稀缺的工业视觉任务中,融合多源弱监督信号与高质量伪标签传播是可行且高效的替代方案,可兼顾精度、鲁棒性与部署实用性。 Abstract: Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.[145] Coupled Inference in Diffusion Models for Semantic Decomposition
Calvin Yeung,Ali Zakeri,Zhuowen Zou,Mohsen Imani
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型耦合推理的语义分解新框架,将分解视为逆问题,并通过重构驱动的引导项耦合扩散过程,性能优于传统Resonator网络。
Details
Motivation: 现有方法(如Resonator网络)在语义分解任务中存在局限;近期研究发现Hopfield网络与扩散模型存在显著相似性,启发作者探索扩散模型在分解任务中的潜力。 Method: 将语义分解建模为逆问题,设计基于重构驱动的引导项来耦合多个扩散过程,使各因子估计的绑定结果逼近原始绑定向量;并提出一种新型迭代采样策略提升性能。 Result: 实验证明该耦合推理框架在多种合成语义分解任务上均优于Resonator网络;同时证明注意力机制增强的Resonator网络是本框架的一个特例。 Conclusion: 扩散模型通过耦合推理可有效实现语义分解,为解耦表征学习提供了新范式,统一并拓展了基于绑定与共振机制的相关方法。 Abstract: Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.[146] Efficient Special Stain Classification
Oskar Thaeter,Christian Grashei,Anette Haas,Elisa Schmoeckel,Han Li,Peter J. Schüffler
Main category: cs.CV
TL;DR: 本文提出了一种轻量级的基于缩略图的染色分类方法,用于全切片图像(WSI)的自动化染色识别,在保持较高准确率的同时显著提升了处理速度和跨数据集泛化能力,适用于数字病理工作流中的常规视觉质量控制。
Details
Motivation: 维持病理切片染色类型元数据的准确性对临床档案质量控制和计算病理数据集完整性至关重要,但目前缺乏高效可靠的自动化染色分类方法。 Method: 比较了多实例学习(MIL)流水线与新提出的轻量级缩略图(thumbnail-based)方法,对16类常用染色(含H&E及14种特殊染色)进行自动分类。 Result: 在内部测试集上,MIL宏F1达0.941(16类)和0.969(14类合并);缩略图法分别为0.897和0.953;在外部分布(TCGA)上缩略图法加权F1更高(0.843 vs. 0.807),且吞吐量提升两个数量级(5.635 vs. 0.018 张/秒)。 Conclusion: 缩略图方法在可扩展性、鲁棒性和计算效率方面优于MIL,是数字病理常规视觉质控的实用解决方案。 Abstract: Stains are essential in histopathology to visualize specific tissue characteristics, with Haematoxylin and Eosin (H&E) serving as the clinical standard. However, pathologists frequently utilize a variety of special stains for the diagnosis of specific morphologies. Maintaining accurate metadata for these slides is critical for quality control in clinical archives and for the integrity of computational pathology datasets. In this work, we compare two approaches for automated classification of stains using whole slide images, covering the 14 most commonly used special stains in our institute alongside standard and frozen-section H&E. We evaluate a Multi-Instance Learning (MIL) pipeline and a proposed lightweight thumbnail-based approach. On internal test data, MIL achieved the highest performance (macro F1: 0.941 for 16 classes; 0.969 for 14 merged classes), while the thumbnail approach remained competitive (0.897 and 0.953, respectively). On external TCGA data, the thumbnail model generalized best (weighted F1: 0.843 vs. 0.807 for MIL). The thumbnail approach also increased throughput by two orders of magnitude (5.635 vs. 0.018 slides/s for MIL with all patches). We conclude that thumbnail-based classification provides a scalable and robust solution for routine visual quality control in digital pathology workflows.[147] Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
Florian Hahlbohm,Linus Franke,Martin Eisemann,Marcus Magnor
Main category: cs.CV
TL;DR: This paper introduces Faster-GS, a rigorously optimized 3D Gaussian Splatting system that achieves up to 5× faster training without sacrificing visual quality, and extends optimizations to 4D Gaussian reconstruction for efficient non-rigid scene modeling.
Details
Motivation: To address the fragmented research landscape in 3D Gaussian Splatting caused by entangled implementation-level and algorithmic changes, and to enable fair comparison through a consolidated, principled optimization framework. Method: Consolidates and evaluates effective prior strategies, adds novel optimizations, and investigates underexplored aspects including numerical stability, Gaussian truncation, and gradient approximation. Result: Faster-GS achieves up to 5× faster training while maintaining visual fidelity across comprehensive benchmarks; it also successfully extends to 4D Gaussian reconstruction for efficient non-rigid scene optimization. Conclusion: Faster-GS establishes a new cost-effective, resource-efficient baseline for 3DGS optimization and demonstrates generalizability to dynamic (4D) scenes. Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison. In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation. The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5$\times$ faster training while maintaining visual quality, establishing a new cost-effective and resource efficient baseline for 3DGS optimization. Furthermore, we demonstrate that optimizations can be applied to 4D Gaussian reconstruction, leading to efficient non-rigid scene optimization.[148] Perception with Guarantees: Certified Pose Estimation via Reachability Analysis
Tobias Ladner,Yasser Shoukry,Matthias Althoff
Main category: cs.CV
TL;DR: 本文提出了一种仅基于单张相机图像和已知目标几何形状的三维认证姿态估计方法,通过可达性分析与形式化神经网络验证技术,为安全关键型网络物理系统提供最坏情况下的安全保证。
Details
Motivation: 安全关键型网络物理系统中,传统姿态估计(如依赖GPS或多种传感器融合)无法在最坏情况下提供形式化安全保证,且外部服务可能不可信。 Method: 利用可达性分析和形式化神经网络验证技术,对仅基于单张相机图像和已知目标几何结构的姿态估计结果进行形式化边界界定。 Result: 在合成数据和真实世界实验中均实现了高效、准确的智能体定位。 Conclusion: 该认证姿态估计方法能为安全关键应用提供可验证的、最坏情况下的姿态不确定性边界,从而支撑形式化安全性验证。 Abstract: Agents in cyber-physical systems are increasingly entrusted with safety-critical tasks. Ensuring safety of these agents often requires localizing the pose for subsequent actions. Pose estimates can, e.g., be obtained from various combinations of lidar sensors, cameras, and external services such as GPS. Crucially, in safety-critical domains, a rough estimate is insufficient to formally determine safety, i.e., guaranteeing safety even in the worst-case scenario, and external services might additionally not be trustworthy. We address this problem by presenting a certified pose estimation in 3D solely from a camera image and a well-known target geometry. This is realized by formally bounding the pose, which is computed by leveraging recent results from reachability analysis and formal neural network verification. Our experiments demonstrate that our approach efficiently and accurately localizes agents in both synthetic and real-world experiments.[149] Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection
Changjiang Jiang,Xinkuan Sha,Fengchang Yu,Jingjing Liu,Jian Liu,Mingqi Fang,Chenfeng Zhang,Wei Lu
Main category: cs.CV
TL;DR: 本文提出Fake-HR1模型,首次实现自适应判断是否需要推理以检测合成图像,在保证性能的同时显著提升响应效率。
Details
Motivation: 现有基于链式思维(CoT)的合成图像检测方法虽有效,但固定长推理导致资源开销大,尤其对明显伪造图像造成冗余。 Method: 提出两阶段训练框架:先通过混合微调(HFT)冷启动初始化,再利用混合推理分组策略优化(HGRPO)进行在线强化学习,使模型隐式学习何时选择合适推理模式。 Result: Fake-HR1在不同查询类型上自适应推理,推理能力与生成检测性能均超越现有大语言模型,同时显著提升响应效率。 Conclusion: Fake-HR1是首个面向生成检测任务自适应启用推理的大规模混合推理模型,兼顾高精度与高效率。 Abstract: Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.[150] Simple Image Processing and Similarity Measures Can Link Data Samples across Databases through Brain MRI
Gaurang Sharma,Harri Polonen,Juha Pajula,Jutta Suksi,Jussi Tohka
Main category: cs.CV
TL;DR: 本文提出一种无需训练、计算高效的MRI图像匹配方法,仅通过标准预处理和图像相似度计算,即可在不同时间、设备、参数条件下实现近乎完美的个体身份匹配,揭示了去颅骨后脑影像仍存在显著隐私风险。
Details
Motivation: 当前监管框架要求在共享MRI数据前去除标识信息,但即使经过颅骨剥离,脑实质仍包含可识别个体的独特特征,存在跨数据库重识别风险;而现有评估多依赖主观‘合理性’判断,缺乏客观、高效的风险量化方法。 Method: 采用标准预处理流程(如配准、归一化)对去颅骨的T1加权MRI进行处理,随后直接计算图像间相似度(如互相关或结构相似性),无需深度学习模型或大量训练。 Result: 在不同时间点、扫描仪型号、空间分辨率和采集协议下均实现近似100%的个体匹配准确率,且不受认知功能下降影响,验证了该方法在真实跨库场景下的鲁棒性。 Conclusion: 去颅骨MRI本身即构成强生物标识符,仅靠传统匿名化手段不足以保障隐私;亟需更新医疗数据共享政策,纳入此类轻量级、高精度的重识别风险评估机制。 Abstract: Head Magnetic Resonance Imaging (MRI) is routinely collected and shared for research under strict regulatory frameworks. These frameworks require removing potential identifiers before sharing. But, even after skull stripping, the brain parenchyma contains unique signatures that can match other MRIs from the same participants across databases, posing a privacy risk if additional data features are available. Current regulatory frameworks often mandate evaluating such risks based on the assessment of a certain level of reasonableness. Prior studies have already suggested that a brain MRI could enable participant linkage, but they have relied on training-based or computationally intensive methods. Here, we demonstrate that linking an individual's skull-stripped T1-weighted MRI, which may lead to re-identification if other identifiers are available, is possible using standard preprocessing followed by image similarity computation. Nearly perfect linkage accuracy was achieved in matching data samples across various time intervals, scanner types, spatial resolutions, and acquisition protocols, despite potential cognitive decline, simulating MRI matching across databases. These results aim to contribute meaningfully to the development of thoughtful, forward-looking policies in medical data sharing.[151] Conformal Prediction Sets for Instance Segmentation
Kerri Lu,Dan M. Kluger,Stephen Bates,Sherrie Wang
Main category: cs.CV
TL;DR: 本文提出了一种用于实例分割的符合性预测算法,能为每个像素生成具有统计保证的置信预测集,确保至少一个预测与真实掩码具有高IoU。该方法在农业田块划分、细胞分割和车辆检测中验证有效,并优于现有基线方法。
Details
Motivation: 当前实例分割模型缺乏原则性的不确定性量化能力,预测结果未校准,且无法保证预测掩码与真实标注的接近程度。 Method: 引入符合性预测算法,针对图像中每个像素坐标查询,生成实例预测的置信集合,并提供关于高IoU覆盖概率的可证明保证;支持渐近和有限样本两种理论保证版本。 Result: 在农业田块划分、细胞分割和车辆检测任务上验证了方法有效性;预测集大小随查询难度自适应变化,达到目标覆盖率,性能优于Learn Then Test、Conformal Risk Control及形态学膨胀等基线方法。 Conclusion: 所提符合性预测算法为实例分割提供了首个具备统计保证的不确定性量化框架,兼具实用性与理论严谨性。 Abstract: Current instance segmentation models achieve high performance on average predictions, but lack principled uncertainty quantification: their outputs are not calibrated, and there is no guarantee that a predicted mask is close to the ground truth. To address this limitation, we introduce a conformal prediction algorithm to generate adaptive confidence sets for instance segmentation. Given an image and a pixel coordinate query, our algorithm generates a confidence set of instance predictions for that pixel, with a provable guarantee for the probability that at least one of the predictions has high Intersection-Over-Union (IoU) with the true object instance mask. We apply our algorithm to instance segmentation examples in agricultural field delineation, cell segmentation, and vehicle detection. Empirically, we find that our prediction sets vary in size based on query difficulty and attain the target coverage, outperforming existing baselines such as Learn Then Test, Conformal Risk Control, and morphological dilation-based methods. We provide versions of the algorithm with asymptotic and finite sample guarantees.[152] Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving
Serin Varghese,Kevin Ross,Fabian Hueger,Kira Maag
Main category: cs.CV
TL;DR: 本文提出了一种时空注意力(STA)机制,扩展了Transformer的自注意力模块以利用视频多帧上下文,显著提升了视频语义分割的时序一致性和mIoU性能。
Details
Motivation: 现有语义分割模型独立处理视频帧,忽略了时间一致性,导致在动态场景中准确性和稳定性不足。 Method: 提出时空注意力(STA)机制,修改标准自注意力以处理时空特征序列,在保持计算效率的同时最小化对现有架构的改动。 Result: 在Cityscapes和BDD100k数据集上,时序一致性指标提升9.20个百分点,mIoU最高提升1.76个百分点。 Conclusion: STA是一种有效且通用的视频语义分割架构增强方法,适用于多种规模的Transformer模型。 Abstract: Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.[153] Can Image Splicing and Copy-Move Forgery Be Detected by the Same Model? Forensim: An Attention-Based State-Space Approach
Soumyaroop Nandi,Prem Natarajan
Main category: cs.CV
TL;DR: Forensim是一种基于注意力机制的状态空间框架,用于图像伪造检测,能同时定位伪造区域(目标)和源区域,支持拼接和复制移动伪造的统一检测。
Details
Motivation: 传统方法仅依赖伪影线索检测伪造区域,难以理解上下文中的复制模式;在抗议图像等场景中,仅定位伪造区域可能导致错误解读,因此需要联合定位源和目标区域。 Method: 提出一种视觉状态空间模型,利用归一化注意力图识别内部相似性,并结合基于区域的块注意力模块区分被操纵区域,输出三类掩码(原始、源、目标),支持端到端训练。 Result: 在标准基准上达到最先进性能,并发布新数据集CMFD-Anything以弥补现有复制移动伪造数据集的不足。 Conclusion: Forensim通过联合源-目标定位和统一架构,提升了图像伪造检测的准确性与可解释性,尤其适用于需理解伪造上下文的现实场景。 Abstract: We introduce Forensim, an attention-based state-space framework for image forgery detection that jointly localizes both manipulated (target) and source regions. Unlike traditional approaches that rely solely on artifact cues to detect spliced or forged areas, Forensim is designed to capture duplication patterns crucial for understanding context. In scenarios such as protest imagery, detecting only the forged region, for example a duplicated act of violence inserted into a peaceful crowd, can mislead interpretation, highlighting the need for joint source-target localization. Forensim outputs three-class masks (pristine, source, target) and supports detection of both splicing and copy-move forgeries within a unified architecture. We propose a visual state-space model that leverages normalized attention maps to identify internal similarities, paired with a region-based block attention module to distinguish manipulated regions. This design enables end-to-end training and precise localization. Forensim achieves state-of-the-art performance on standard benchmarks. We also release CMFD-Anything, a new dataset addressing limitations of existing copy-move forgery datasets.[154] 4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere
Yihang Luo,Shangchen Zhou,Yushi Lan,Xingang Pan,Chen Change Loy
Main category: cs.CV
TL;DR: 4RC是一种统一的前馈框架,用于从单目视频中进行4D重建,通过一次编码、任意时刻查询的范式,联合建模密集场景几何与运动动态。
Details
Motivation: 现有方法通常将运动与几何解耦,或仅生成稀疏轨迹、双视图场景流等有限的4D属性,难以实现对完整4D时空信息的联合建模。 Method: 提出encode-once, query-anywhere and anytime范式:用Transformer主干网络将整段视频编码为紧凑的时空潜在空间;条件解码器可高效查询任意目标帧和时间戳下的3D几何与运动;将每视角4D属性最小化分解为基底几何与时间依赖的相对运动。 Result: 在广泛的4D重建任务上,4RC性能优于先前及同期方法。 Conclusion: 4RC实现了对密集几何与运动动态的联合、高效、灵活的4D重建,验证了统一前馈框架在4D理解中的有效性与潜力。 Abstract: We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.[155] Causality in Video Diffusers is Separable from Denoising
Xingjian Bai,Guande He,Zhengqi Li,Eli Shechtman,Xun Huang,Zongze Wu
Main category: cs.CV
TL;DR: 本文提出了一种可分离的因果扩散模型(SCD),将时序因果推理与多步去噪渲染解耦,从而提升视频生成效率和质量。
Details
Motivation: 现有因果扩散模型将时序推理与迭代去噪过程紧密耦合,导致计算冗余;作者通过系统探针发现早期层特征在去噪步间高度相似、深层主要做帧内渲染,表明二者可分离。 Method: 提出Separable Causal Diffusion(SCD)架构:用因果Transformer编码器执行每帧一次的时序推理,用轻量扩散解码器负责多步帧级渲染。 Result: 在合成与真实数据集的预训练和后训练任务中,SCD显著提升吞吐量与单帧延迟,同时保持或超越强因果扩散基线的生成质量。 Conclusion: 因果推理与去噪渲染可解耦,SCD通过结构化分离实现了效率与性能的双重提升,为高效视频生成提供了新范式。 Abstract: Causality -- referring to temporal, uni-directional cause-effect relationships between components -- underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.[156] VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
Zhongwei Ren,Yunchao Wei,Xiao Yu,Guixun Luo,Yao Zhao,Bingyi Kang,Jiashi Feng,Xiaojie Jin
Main category: cs.CV
TL;DR: VideoWorld 2 提出动态增强的潜在动力学模型(dLDM),利用预训练视频扩散模型解耦视觉外观与动作动力学,从而从原始真实视频中学习可迁移的任务相关动力学表征,并在手工制作任务和机器人操作(CALVIN)中显著提升成功率与长程执行能力。
Details
Motivation: 智能体需从无标签真实视频中学习可迁移知识并泛化至新环境,但现有视频生成与潜在动力学模型在复杂真实任务中表现不可靠。 Method: 提出动态增强的潜在动力学模型(dLDM),用预训练视频扩散模型建模视觉外观,使dLDM专注学习紧凑、有意义的任务相关潜在动力学;再通过自回归建模这些潜码以学习策略并支持长时序推理。 Result: 在真实手工制作任务上任务成功率最高提升70%,生成连贯的长执行视频;在CALVIN机器人任务上,利用Open-X数据学到的操作知识显著提升性能。 Conclusion: 直接从原始真实视频中学习可迁移的世界知识是可行且有效的,VideoWorld 2为视频驱动的具身智能提供了新范式,所有代码、数据与模型将开源。 Abstract: Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.[157] Olaf-World: Orienting Latent Actions for Video World Modeling
Yuxin Jiang,Yuchao Gu,Ivor W. Tsang,Mike Zheng Shou
Main category: cs.CV
TL;DR: 本文提出SeqΔ-REPA方法,通过利用视频中可观察的动作语义效应(时间特征差异)作为跨场景对齐的共享参考,解决无标签视频中潜在动作表征难以泛化的问题,并构建了Olaf-World模型,在零样本动作迁移与数据高效适配方面显著优于现有方法。
Details
Motivation: 现有潜在动作学习方法因仅在单个视频片段内优化,缺乏跨上下文的动作语义对齐机制,导致学习到的潜在动作空间纠缠场景特异性线索、缺乏统一坐标系,限制了其泛化能力。 Method: 提出序列级控制-效应对齐目标SeqΔ-REPA,将潜在动作锚定于冻结的自监督视频编码器提取的时间特征差;基于此构建Olaf-World流程,从大规模无动作标签视频中预训练动作条件化世界模型。 Result: 实验表明该方法学习到结构更优的潜在动作空间,在零样本动作迁移和新控制接口的数据高效适配上均优于当前最优基线。 Conclusion: 利用可观测的动作语义效应作为跨场景对齐的共享锚点,是提升无监督潜在动作学习泛化能力的有效途径;SeqΔ-REPA与Olaf-World为构建可扩展的动作可控世界模型提供了新范式。 Abstract: Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.[158] ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
Mingyang Wu,Ashirbad Mishra,Soumik Dey,Shuo Xing,Naveen Ravipati,Hansi Wu,Binbin Li,Zhengzhong Tu
Main category: cs.CV
TL;DR: 本文提出ConsID-Gen框架与ConsIDVid数据集,通过多视角辅助和双流编码器提升图像到视频生成中的物体身份一致性与几何稳定性。
Details
Motivation: 现有图像到视频(I2V)生成方法在视角变化下难以保持细粒度物体身份,易出现外观漂移与几何失真,主因是单视角2D观测稀疏及跨模态对齐弱。 Method: 构建大规模物体中心数据集ConsIDVid及配套评测基准ConsIDVid-Bench;提出ConsID-Gen框架,利用未位姿的辅助视角增强首帧,并设计语义-结构双流视觉几何编码器与文本-视觉连接器,为Diffusion Transformer提供统一条件输入。 Result: 在ConsIDVid-Bench上多项指标超越Wan2.1、HunyuanVideo等主流视频生成模型,显著提升身份保真度与时间连贯性。 Conclusion: 从数据与模型双路径入手,结合多视角监督与跨模态深度融合,可有效缓解I2V中物体身份退化问题,为高质量可控视频生成提供新范式。 Abstract: Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.[159] Quantum Multiple Rotation Averaging
Shuteng Wang,Natacha Kuete Meli,Michael Möller,Vladislav Golyanik
Main category: cs.CV
TL;DR: 本文提出IQARS算法,首次将多旋转平均(MRA)问题重构为可由量子退火器求解的局部二次非凸子问题序列,在保留旋转流形几何结构的同时,利用量子隧穿与并行性提升高噪声下的求解精度;在D-Wave设备上实验显示其精度比最优经典方法Shonan高出约12%。
Details
Motivation: 现有经典MRA方法(如L1-IRLS和Shonan)易陷入局部极小、依赖破坏旋转流形几何结构的凸松弛,在高噪声下精度下降。 Method: 提出IQARS:将MRA分解为一系列局部二次非凸子问题,经二值化后适配量子退火硬件;摒弃凸松弛,直接在旋转流形上优化,并利用量子隧穿与并行搜索能力。 Result: 在合成与真实数据集上验证,IQARS在当前受限的D-Wave退火器上已实现比Shonan高约12%的精度。 Conclusion: IQARS是首个面向MRA的量子退火算法,兼顾几何保真性与量子优势,在硬件演进后有望显著超越经典方法。 Abstract: Multiple rotation averaging (MRA) is a fundamental optimization problem in 3D vision and robotics that aims to recover globally consistent absolute rotations from noisy relative measurements. Established classical methods, such as L1-IRLS and Shonan, face limitations including local minima susceptibility and reliance on convex relaxations that fail to preserve the exact manifold geometry, leading to reduced accuracy in high-noise scenarios. We introduce IQARS (Iterative Quantum Annealing for Rotation Synchronization), the first algorithm that reformulates MRA as a sequence of local quadratic non-convex sub-problems executable on quantum annealers after binarization, to leverage inherent hardware advantages. IQARS removes convex relaxation dependence and better preserves non-Euclidean rotation manifold geometry while leveraging quantum tunneling and parallelism for efficient solution space exploration. We evaluate IQARS's performance on synthetic and real-world datasets. While current annealers remain in their nascent phase and only support solving problems of limited scale with constrained performance, we observed that IQARS on D-Wave annealers can already achieve ca. 12% higher accuracy than Shonan, i.e., the best-performing classical method evaluated empirically.[160] SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
Hongchi Xia,Xuan Li,Zhaoshuo Li,Qianli Ma,Jiashu Xu,Ming-Yu Liu,Yin Cui,Tsung-Yi Lin,Wei-Chiu Ma,Shenlong Wang,Shuran Song,Fangyin Wei
Main category: cs.CV
TL;DR: 本文提出SAGE框架,通过多生成器与批评器协同的智能体方法,根据用户指定任务自动生成物理合理、语义合理且视觉逼真的仿真就绪3D环境,显著提升具身智能策略训练的可扩展性与泛化能力。