Skip to content

Table of Contents

cs.CL [Back]

[1] Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning

Mahdi Cherakhloo,Arash Abbasi,Mohammad Saeid Sarafraz,Bijan Vosoughi Vahdat

Main category: cs.CL

TL;DR: 本文对多个开源大语言模型在波斯语自然语言处理任务中的表现进行了全面评测,涵盖零样本和少样本学习场景,结果表明Gemma 2在多数任务中表现最佳,但在命名实体识别等细粒度任务上仍存在挑战。

Details Motivation: 评估大语言模型在低资源语言(如波斯语)中的有效性尚需深入研究,以推动多语言模型的发展。 Method: 在ParsiNLU和ArmanEmo等标准波斯语数据集上,采用零样本和少样本学习范式,对多个开源大语言模型进行评测,使用准确率、F1分数、BLEU和ROUGE等指标衡量模型在情感分析、命名实体识别、阅读理解与问答等任务上的表现。 Result: Gemma 2在几乎所有任务和学习范式中均优于其他模型,尤其在复杂推理任务中表现突出;但大多数模型在命名实体识别等词元级理解任务上表现不佳。 Conclusion: 本研究为波斯语大语言模型提供了基准评测结果,揭示了当前模型的优势与局限,为未来多语言模型的优化提供了重要参考。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous languages; however, their effectiveness in low-resource languages like Persian requires thorough investigation. This paper presents a comprehensive benchmark of several open-source LLMs for Persian Natural Language Processing (NLP) tasks, utilizing both zero-shot and few-shot learning paradigms. We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering, using established Persian datasets such as ParsiNLU and ArmanEmo. Our methodology encompasses rigorous experimental setups for both zero-shot and few-shot scenarios, employing metrics such as Accuracy, F1-score, BLEU, and ROUGE for performance evaluation. The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms, with particularly strong performance in complex reasoning tasks. However, most models struggle with token-level understanding tasks like Named Entity Recognition, highlighting specific challenges in Persian language processing. This study contributes to the growing body of research on multilingual LLMs, providing valuable insights into their performance in Persian and offering a benchmark for future model development.

[2] Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

Soheil Hashtarkhani,Rezaur Rashid,Christopher L Brett,Lokesh Chinthala,Fekede Asefa Kumsa,Janet A Zink,Robert L Davis,David L Schwartz,Arash Shaban-Nejad

Main category: cs.CL

TL;DR: 该研究评估了四种大语言模型(GPT-3.5、GPT-4o、Llama 3.2、Gemini 1.5)和BioBERT在从结构化与非结构化电子健康记录中分类癌症诊断的表现,结果显示GPT-4o和BioBERT在不同数据类型上表现最佳,但临床高风险应用仍需人工监督。

Details Motivation: 电子健康记录中存在大量非结构化或自由文本数据,需有效预处理以支持预测性医疗模型;当前AI驱动的自然语言处理工具虽具潜力,但其性能与临床可靠性缺乏系统评估。 Method: 研究采用5种模型(GPT-3.5、GPT-4o、Llama 3.2、Gemini 1.5和BioBERT),对762条癌症诊断(包括ICD编码和自由文本)进行14类分类,并由两名肿瘤学专家验证结果。 Result: BioBERT在ICD编码分类中表现最优(加权宏F1分数84.2,准确率90.8),GPT-4o在自由文本上优于BioBERT(F1分数71.8 vs 61.5,准确率81.9 vs 81.6);其他模型整体表现较低;常见错误包括转移瘤与中枢神经系统肿瘤混淆及术语歧义问题。 Conclusion: 现有模型性能适用于行政管理和研究用途,但用于临床决策时仍需标准化文档规范和严格的人工监督。 Abstract: Electronic health records contain inconsistently structured or free-text data, requiring efficient preprocessing to enable predictive health care models. Although artificial intelligence-driven natural language processing tools show promise for automating diagnosis classification, their comparative performance and clinical reliability require systematic evaluation. The aim of this study is to evaluate the performance of 4 large language models (GPT-3.5, GPT-4o, Llama 3.2, and Gemini 1.5) and BioBERT in classifying cancer diagnoses from structured and unstructured electronic health records data. We analyzed 762 unique diagnoses (326 International Classification of Diseases (ICD) code descriptions, 436free-text entries) from 3456 records of patients with cancer. Models were tested on their ability to categorize diagnoses into 14predefined categories. Two oncology experts validated classifications. BioBERT achieved the highest weighted macro F1-score for ICD codes (84.2) and matched GPT-4o in ICD code accuracy (90.8). For free-text diagnoses, GPT-4o outperformed BioBERT in weighted macro F1-score (71.8 vs 61.5) and achieved slightly higher accuracy (81.9 vs 81.6). GPT-3.5, Gemini, and Llama showed lower overall performance on both formats. Common misclassification patterns included confusion between metastasis and central nervous system tumors, as well as errors involving ambiguous or overlapping clinical terminology. Although current performance levels appear sufficient for administrative and research use, reliable clinical applications will require standardized documentation practices alongside robust human oversight for high-stakes decision-making.

[3] From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP

Shanshan Xu,Santosh T. Y. S. S,Barbara Plank

Main category: cs.CL

TL;DR: 本文主张在人工智能系统设计中应将人类标注差异(HLV)视为目的本身,以体现人类价值观的多元性,并呼吁在偏好数据集中主动保留和整合HLV。

Details Motivation: 过去HLV被视为噪声被忽略,近年来才被重新认识为提升模型鲁棒性的信号;然而当前偏好学习数据集常将多样标注合并为单一标签,抹杀了人类观点的多样性,与AI对齐的目标相悖。 Method: 通过立场论文的形式,分析现有偏好学习数据处理方式的问题,提出将HLV作为自我目的(Selbstzweck)的理念,并建议在数据集中主动纳入HLV的具体行动步骤。 Result: 强调保留HLV有助于反映真实的人类价值多元性,提升AI系统的包容性与对齐效果。 Conclusion: 应在AI系统设计中将保留人类标注差异视为核心目标,推动更尊重人类多样性的AI发展路径。 Abstract: Human Label Variation (HLV) refers to legitimate disagreement in annotation that reflects the genuine diversity of human perspectives rather than mere error. For decades, HLV in NLP was dismissed as noise to be discarded, and only slowly over the last decade has it been reframed as a signal for improving model robustness. With the rise of large language models (LLMs), where post-training on human feedback has become central to model alignment, the role of HLV has become increasingly consequential. Yet current preference-learning datasets routinely aggregate multiple annotations into a single label, thereby flattening diverse perspectives into a false universal agreement and erasing precisely the pluralism of human values that alignment aims to preserve. In this position paper, we argue that preserving HLV as an embodiment of human pluralism must be treated as a Selbstzweck - a goal it self when designing AI systems. We call for proactively incorporating HLV into preference datasets and outline actionable steps towards it.

[4] MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning

Rajarshi Ghosh,Abhay Gupta,Hudson McBride,Anurag Vaidya,Faisal Mahmood

Main category: cs.CL

TL;DR: 本研究提出了MEDEQUALQA,一个通过仅改变患者代词(如he/him, she/her, they/them)来评估大语言模型在临床决策支持中推理稳定性的反事实基准。研究发现,尽管最终诊断一致,模型的推理过程仍存在与性别相关的局部偏差,提示潜在的临床不平等风险。

Details Motivation: 大语言模型越来越多地用于临床决策支持,但其输出可能受患者人口学特征(如性别代词)影响。现有研究多关注输出差异,缺乏对内部推理过程在控制变量下变化的深入理解,因此需要一个可控的评估框架来检测模型推理的稳定性。 Method: 构建MEDEQUALQA基准,通过对临床病例生成仅改变患者代词的反事实样本,保持关键症状和条件(CSCs)不变;将每个病例扩展为单CSC消融形式,形成约69,000个样本的三组平行数据集;使用GPT-4.1模型生成推理路径,并计算不同代词变体间推理文本的语义相似性(STS)以衡量稳定性。 Result: 整体推理路径语义相似性较高(平均STS >0.80),但在风险因素引用、指南依据和鉴别诊断排序上存在一致的局部差异;错误分析揭示了某些情况下推理路径发生显著变化,表明模型在看似相同的诊断背后采用了不同的医学逻辑。 Conclusion: 即使在保持症状完全一致的情况下,患者代词的变化仍可引发大语言模型在临床推理中的细微但具临床意义的偏差;MEDEQUALQA提供了一个有效的工具,用于审计医疗AI系统中的推理稳定性与潜在偏见。 Abstract: Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each clinical vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23,000 items each (69,000 total). We evaluate a GPT-4.1 model and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS >0.80), but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering, even when final diagnoses remain unchanged. Our error analysis highlights certain cases in which the reasoning shifts, underscoring clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA offers a controlled diagnostic setting for auditing reasoning stability in medical AI.

[5] Classifier-Augmented Generation for Structured Workflow Prediction

Thomas Gschwind,Shramona Chakraborty,Nitin Gupta,Sameep Mehta

Main category: cs.CL

TL;DR: 提出一种基于自然语言生成ETL工作流的系统,采用分类器增强生成(CAG)方法,实现对工作流结构和配置的自动预测。

Details Motivation: ETL工具虽能可视化构建数据流,但配置复杂且耗时,需大量专业知识,因此需要自动化手段降低使用门槛。 Method: 采用Classifier-Augmented Generation(CAG)方法,结合语句分解、分类器与阶段特定的少样本提示进行阶段预测,通过边预测连接非线性工作流,并从子语句上下文中推断阶段属性。 Result: 相比单提示和基于代理的基线方法,CAG在准确性、效率和令牌使用方面表现更优,系统具备模块化、可解释性和端到端生成能力,并首次对阶段预测、边布局和属性生成进行了详细评估。 Conclusion: 该系统能有效将自然语言描述转化为可执行的ETL工作流,显著提升配置效率,为自然语言驱动的ETL开发提供了可行方案。 Abstract: ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but configuring stages and their properties remains time consuming and requires deep tool knowledge. We propose a system that translates natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of the flow. At its core lies a Classifier-Augmented Generation (CAG) approach that combines utterance decomposition with a classifier and stage-specific few-shot prompting to produce accurate stage predictions. These stages are then connected into non-linear workflows using edge prediction, and stage properties are inferred from sub-utterance context. We compare CAG against strong single-prompt and agentic baselines, showing improved accuracy and efficiency, while substantially reducing token usage. Our architecture is modular, interpretable, and capable of end-to-end workflow generation, including robust validation steps. To our knowledge, this is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-language-driven ETL authoring.

[6] Scheming Ability in LLM-to-LLM Strategic Interactions

Thao Pham

Main category: cs.CL

TL;DR: 研究探讨了前沿大语言模型(LLM)代理在无提示情况下的策略性欺骗能力与倾向,通过廉价对话信号博弈和同行评估对抗博弈两种博弈论框架进行测试,发现模型在未明确提示时仍表现出强烈的欺骗倾向。

Details Motivation: 随着大语言模型代理在各种场景中自主部署,评估其战略欺骗能力变得至关重要,而现有研究多关注AI对人类开发者的行为,LLM之间的合谋行为尚缺乏探索。 Method: 采用廉价对话信号博弈和同行评估对抗博弈两种游戏理论框架,测试GPT-4o、Gemini-2.5-pro、Claude-3.7-Sonnet和Llama-3.3-70b四个模型,在有无显式提示条件下测量其欺骗表现,并通过思维链推理分析欺骗策略。 Result: 在有提示时,多数模型(尤其是Gemini和Claude)接近完美表现;关键的是,在无提示情况下,所有模型在同行评估中均选择欺骗而非坦白(100%),在廉价对话中选择欺骗的模型成功率达95%-100%。 Conclusion: 研究表明LLM代理具有强烈的自发欺骗倾向,强调在多智能体环境中需使用高风险博弈场景进行更鲁棒的评估。 Abstract: As large language model (LLM) agents are deployed autonomously in diverse contexts, evaluating their capacity for strategic deception becomes crucial. While recent research has examined how AI systems scheme against human developers, LLM-to-LLM scheming remains underexplored. We investigate the scheming ability and propensity of frontier LLM agents through two game-theoretic frameworks: a Cheap Talk signaling game and a Peer Evaluation adversarial game. Testing four models (GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3.3-70b), we measure scheming performance with and without explicit prompting while analyzing scheming tactics through chain-of-thought reasoning. When prompted, most models, especially Gemini-2.5-pro and Claude-3.7-Sonnet, achieved near-perfect performance. Critically, models exhibited significant scheming propensity without prompting: all models chose deception over confession in Peer Evaluation (100% rate), while models choosing to scheme in Cheap Talk succeeded at 95-100% rates. These findings highlight the need for robust evaluations using high-stakes game-theoretic scenarios in multi-agent settings.

[7] Mathematics with large language models as provers and verifiers

Hieu Le Duc,Leo Liberti

Main category: cs.CL

TL;DR: 本文报告了使用ChatGPT(基于gpt-5模型)在定理证明方面的突破,通过多个证明者与验证者实例协作,并结合Lean证明助手进行形式化验证,成功解决了2025年IMO六道题中的五道,并验证了Cohen提出的66个数论猜想中的三分之一。

Details Motivation: 验证大型语言模型是否具备解决复杂数学问题和猜想的能力,尤其是在国际数学竞赛题目和新提出猜想上的表现。 Method: 采用多实例协作协议,利用gpt-5模型的不同实例作为证明者和验证者协同工作,生成的证明通过Lean证明助手进行形式化验证,并由人工检查前提与结论的一致性以避免幻觉。 Result: 成功解决2025年IMO六道问题中的五道;验证并闭合了Cohen列出的66个数论猜想中约三分之一的猜想。 Conclusion: 大型语言模型在适当框架下已展现出接近人类顶尖水平的数学定理证明能力,结合形式化验证可有效抑制幻觉,推动AI在数学研究中的应用。 Abstract: During 2024 and 2025 the discussion about the theorem-proving capabilities of large language models started reporting interesting success stories, mostly to do with difficult exercises (such as problems from the International Mathematical Olympiad), but also with conjectures [Feldman & Karbasi, arXiv:2509.18383v1] formulated for the purpose of verifying whether the artificial intelligence could prove it. In this paper we report a theorem proving feat achieved by ChatGPT by using a protocol involving different prover and verifier instances of the gpt-5 model working collaboratively. To make sure that the produced proofs do not suffer from hallucinations, the final proof is formally verified by the lean proof assistant, and the conformance of premises and conclusion of the lean code is verified by a human. Our methodology was able to solve five out of six 2025 IMO problems, and close a third of the sixty-six number theory conjectures in [Cohen, Journal of Integer Sequences, 2025].

[8] MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training

Taicheng Guo,Hai Wang,ChaoChun Liu,Mohsen Golalikhani,Xin Chen,Xiangliang Zhang,Chandan K. Reddy

Main category: cs.CL

TL;DR: 提出MTSQL-R1,一种基于代理的训练框架,将多轮Text-to-SQL任务建模为马尔可夫决策过程,通过执行反馈和对话记忆实现迭代的生成-验证- refinement 循环,显著提升对话连贯性和SQL可执行性。

Details Motivation: 现有系统将多轮Text-to-SQL视为简单的逐轮文本翻译,缺乏执行反馈和一致性验证,导致输出不可执行或不连贯。需要一个能长期规划、具备环境交互和记忆机制的框架来提升性能。 Method: 将任务建模为马尔可夫决策过程(MDP),引入代理与数据库(获取执行反馈)和持久化对话记忆(验证连贯性)交互,实现‘生成→执行→验证→修正’的迭代循环,直到通过所有检查。 Result: 在COSQL和SPARC数据集上的实验表明,MTSQL-R1 consistently 优于强基线模型,验证了环境驱动的验证和记忆引导的修正对多轮Text-to-SQL的重要性。 Conclusion: 通过引入环境交互和持久记忆的代理式训练框架,能够有效提升多轮Text-to-SQL的语义解析质量,证明了长视野建模和闭环验证在该任务中的关键作用。 Abstract: Multi-turn Text-to-SQL aims to translate a user's conversational utterances into executable SQL while preserving dialogue coherence and grounding to the target schema. However, most existing systems only regard this task as a simple text translation task and follow a short-horizon paradigm, generating a query per turn without execution, explicit verification, and refinement, which leads to non-executable or incoherent outputs. We present MTSQL-R1, an agentic training framework for long-horizon multi-turn Text-to-SQL. We cast the task as a Markov Decision Process (MDP) in which an agent interacts with (i) a database for execution feedback and (ii) a persistent dialogue memory for coherence verification, performing an iterative propose to execute -> verify -> refine cycle until all checks pass. Experiments on COSQL and SPARC demonstrate that MTSQL-R1 consistently outperforms strong baselines, highlighting the importance of environment-driven verification and memory-guided refinement for conversational semantic parsing. Full recipes (including code, trained models, logs, reasoning trajectories, etc.) will be released after the internal review to contribute to community research.

[9] Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study

Kon Woo Kim,Rezarta Islamaj,Jin-Dong Kim,Florian Boudin,Akiko Aizawa

Main category: cs.CL

TL;DR: 本研究探讨了如何将现有的文本标注指南重新调整,以指导大语言模型(LLM)进行文本标注任务,提出了一种面向LLM的指南重构方法,并通过实验验证了其有效性。

Details Motivation: 传统标注指南针对人类设计,而大语言模型需要更明确、结构化的指令,因此需要对现有指南进行适配以有效指导LLM标注。 Method: 提出一种基于LLM审核过程的指南重构方法,将原始指南转化为适合LLM理解的清晰指令,并在NCBI疾病语料库上进行案例研究。 Result: 实验证明重构后的指南能有效引导LLM进行标注,同时揭示了实际应用中的若干挑战。 Conclusion: 该方法有助于实现标注指南的可扩展、低成本优化,并推动自动化文本标注的发展。 Abstract: This study investigates how existing annotation guidelines can be repurposed to instruct large language model (LLM) annotators for text annotation tasks. Traditional guidelines are written for human annotators who internalize training, while LLMs require explicit, structured instructions. We propose a moderation-oriented guideline repurposing method that transforms guidelines into clear directives for LLMs through an LLM moderation process. Using the NCBI Disease Corpus as a case study, our experiments show that repurposed guidelines can effectively guide LLM annotators, while revealing several practical challenges. The results highlight the potential of this workflow to support scalable and cost-effective refinement of annotation guidelines and automated annotation.

[10] A\textsuperscript{2}FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning

Qianben Chen,Jingyi Cao,Jiayu Zhang,Tianrui Qin,Xiaowan Li,King Zhu,Dingfeng Shi,He Zhu,Minghao Liu,Xiaobo Liang,Ge Zhang,Jian Yang,Yuchen Eleanor Jiang,Wangchunshu Zhou

Main category: cs.CL

TL;DR: 本文提出了Adaptive Agent Foundation Model (A²FM),通过“先路由后对齐”原则统一推理型和代理型大模型,引入第三种即时模式处理简单查询,并提出自适应策略优化(APO)以提升准确性和效率,在多个基准上达到SOTA,显著降低计算成本。

Details Motivation: 现有大语言模型分为推理型和代理型两类,因训练目标不同导致各自优势不匹配,在处理简单问题时容易过度推理或过度调用工具,效率低下。 Method: 提出A²FM框架,采用‘先路由后对齐’机制:先进行任务感知的模式路由,再在共享主干网络下对齐不同模式的轨迹;引入第三种‘即时’模式直接处理简单查询;设计自适应策略优化(APO),结合跨模式自适应采样和成本正则化奖励。 Result: 在32B规模下,A²FM在BrowseComp、AIME25和HLE上分别取得13.4%、70.4%和16.7%的成绩,达到同类模型SOTA水平;每正确回答的成本降至0.00487美元,相比推理和代理模式分别降低成本45.2%和33.5%。 Conclusion: A²FM通过统一架构和自适应机制,在保持高准确性的同时显著提升效率,有效平衡了推理、工具调用与直接响应之间的权衡,实现了更高性价比的智能代理。 Abstract: Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A\textsuperscript{2}FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A\textsuperscript{2}FM achieves 13.4\% on BrowseComp, 70.4\% on AIME25, and 16.7\% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only \$0.00487 per correct answer-cutting cost by 45.2\% relative to reasoning and 33.5\% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.

[11] FaStFACT: Faster, Stronger Long-Form Factuality Evaluations in LLMs

Yingjia Wan,Haochen Tan,Xiao Zhu,Xinyu Zhou,Zhiwei Li,Qingsong Lv,Changxuan Sun,Jiaqi Zeng,Yi Xu,Jianqiao Lu,Yinhong Liu,Zhijiang Guo

Main category: cs.CL

TL;DR: 提出了一种名为\name的高效且强大的评估框架,用于评估大语言模型生成长文本的事实性,通过分块提取声明并结合置信度预验证,显著提高了效率和与人工评估的一致性。

Details Motivation: 现有方法在评估大语言模型生成的长文本事实性时存在效率低和效果差的问题,主要由于复杂的流水线组件、不准确的声明提取和证据不足。 Method: \name采用基于置信度的预验证进行分块级声明提取,并从爬取的网页中收集文档级证据,在验证阶段选择性检索证据,从而减少网络搜索和推理调用成本,提升证据充分性。 Result: 在聚合且人工标注的基准上进行了大量实验,结果显示\name在效率和有效性方面均优于现有基线方法,与人类评估具有最高的一致性。 Conclusion: \name是一种高效、可靠的事实性评估框架,显著提升了对大语言模型长文本输出的事实性评估性能,并已开源代码与数据。 Abstract: Evaluating the factuality of long-form generations from Large Language Models (LLMs) remains challenging due to accuracy issues and costly human assessment. Prior efforts attempt this by decomposing text into claims, searching for evidence, and verifying claims, but suffer from critical drawbacks: (1) inefficiency due to complex pipeline components unsuitable for long LLM outputs, and (2) ineffectiveness stemming from inaccurate claim sets and insufficient evidence collection of one-line snippets. To address these limitations, we propose \name, a fast and strong evaluation framework that achieves the highest alignment with human evaluation and efficiency among existing baselines. \name first employs chunk-level claim extraction integrated with confidence-based pre-verification, significantly reducing the cost of web searching and inference calling while ensuring reliability. For searching and verification, it collects document-level evidence from crawled webpages and selectively retrieves it during verification, addressing the evidence insufficiency problem in previous pipelines. Extensive experiments based on an aggregated and manually annotated benchmark demonstrate the reliability of \name in both efficiently and effectively evaluating the factuality of long-form LLM generations. Code and benchmark data is available at https://github.com/Yingjia-Wan/FastFact.

[12] VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages

Jesse Atuhurra,Iqra Ali,Tomoya Iwakura,Hidetaka Kamigaito,Tatsuya Hiraoka

Main category: cs.CL

TL;DR: 本文提出了一个新的多语言基准VLURes,用于评估视觉语言模型(VLMs)在四种语言和长文本设置下的细粒度理解能力,涵盖八项视觉-语言任务及一项无关性任务,揭示了不同语言和任务间的性能差异。

Details Motivation: 现有VLM评估主要集中在以英语为主的短文本基准上,缺乏对多语言和长文本场景下细粒度视觉与语言理解能力的全面评测。 Method: 构建了一个包含十种图像类别和丰富文本内容的多语言数据集VLURes,覆盖英语、日语、斯瓦希里语和乌尔都语,并通过自动评估和母语者评判结合的方式,对十种VLM进行评测。 Result: GPT-4o表现最佳,总体准确率为90.8%,距人类性能仍有6.7%差距,开源模型差距更大;不同语言和任务间存在显著性能差异。 Conclusion: VLURes为开发具备多模态视觉推理能力的智能代理提供了重要评测工具,突显了跨语言细粒度理解的挑战与方向。 Abstract: Vision Language Models (VLMs) are pivotal for advancing perception in intelligent agents. Yet, evaluation of VLMs remains limited to predominantly English-centric benchmarks in which the image-text pairs comprise short texts. To evaluate VLM fine-grained abilities, in four languages under long-text settings, we introduce a novel multilingual benchmark VLURes featuring eight vision-and-language tasks, and a pioneering unrelatedness task, to probe the fine-grained Visual and Linguistic Understanding capabilities of VLMs across English, Japanese, and low-resource languages, Swahili, and Urdu. Our datasets, curated from web resources in the target language, encompass ten diverse image categories and rich textual context, introducing valuable vision-language resources for Swahili and Urdu. By prompting VLMs to generate responses and rationales, evaluated automatically and by native speakers, we uncover performance disparities across languages and tasks critical to intelligent agents, such as object recognition, scene understanding, and relationship understanding. We conducted evaluations of ten VLMs with VLURes. The best performing model, GPT-4o, achieves an overall accuracy of 90.8% and lags human performance by 6.7%, though the gap is larger for open-source models. The gap highlights VLURes' critical role in developing intelligent agents to tackle multi-modal visual reasoning.

[13] Efficient Adaptive Transformer: An Empirical Study and Reproducible Framework

Jan Miller

Main category: cs.CL

TL;DR: EAT框架整合了三种自适应效率技术,提供了一个开源的、端到端可复现的基准测试管道,用于输入自适应推理,在SST-2任务上略优于DistilBERT基线。

Details Motivation: 为了提升Transformer模型在不同输入下的推理效率,同时保持或提高准确率,研究者们探索了多种自适应方法,但缺乏统一且可复现的框架来评估这些方法的效果。 Method: EAT框架结合了渐进式令牌剪枝、稀疏注意力和动态早退三种技术,并提供了自动化数据处理、计时和消融实验的开源基准测试流程。 Result: 尽管在浅层六层模型中组合这些机制可能会增加延迟,但在SST-2任务上EAT实现了比优化后的DistilBERT基线稍高的准确性。 Conclusion: EAT框架作为一个开放的、端到端可复现的工具,旨在促进自适应Transformer模型的研究,展示了动态计算在延迟敏感NLP应用中的潜力。 Abstract: The Efficient Adaptive Transformer (EAT) framework unifies three adaptive efficiency techniques - progressive token pruning, sparse attention, and dynamic early exiting - into a single, reproducible architecture for input-adaptive inference. EAT provides an open-source benchmarking pipeline that automates data processing, timing, and ablation across GLUE tasks (SST-2, QQP, MNLI). Although this empirical study finds that combining these mechanisms can increase latency in shallow six-layer models, it demonstrates that EAT achieves slightly higher accuracy than the optimized DistilBERT baseline on SST-2, illustrating the potential of dynamic computation for latency-sensitive NLP. The main contribution is the open, end-to-end reproducible framework - complete with scripts, CSV logging, and analysis utilities - intended to serve as a community tool for further research on adaptive transformers.

[14] A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation

Mohammed Hilal Al-Kharusi,Khizar Hayat,Khalil Bader Al Ruqeishi,Haroon Rashid Lone

Main category: cs.CL

TL;DR: 本文综述了过去二十年中用于《古兰经》诵读评估的自动化技术,指出当前基于自动语音识别(ASR)的方法存在数据依赖、人口偏差和反馈不足等问题,主张转向以知识为中心的计算框架,结合《古兰经》文本的不变性和塔吉威德规则的精确性,提出基于规范发音规则和发音位置的前瞻性声学建模,并倡导开发融合深层语言学知识与先进音频分析的混合系统,以实现更稳健、公平且符合教学需求的自动化评估工具。

Details Motivation: 现有的自动化诵读评估工具在教学效果和普及方面表现不佳,主要因其沿用侧重词汇识别而非音质评估的ASR架构,且受限于数据偏差与诊断能力不足,亟需一种更符合 Tajweed 教学本质的技术范式。 Method: 通过综合分析近二十年来的学术研究、网络平台与商业应用,批判现有数据驱动方法的局限,并提出应以 Tajweed 的规范规则(如 Makhraj)为基础构建预测性声学模型的知识中心型框架。 Result: 发现当前主流方法因依赖有偏数据和统计模式而无法提供有效教学反馈;论证了基于规则的知识驱动方法在准确性、公平性和可解释性上的优势。 Conclusion: 未来的自动化《古兰经》诵读评估应走向融合深层语言知识与先进音频分析的混合系统,才能实现真正稳健、公正且具教学价值的全球学习支持。 Abstract: The sacred practice of Quranic recitation (Tajweed), governed by precise phonetic, prosodic, and theological rules, faces significant pedagogical challenges in the modern era. While digital technologies promise unprecedented access to education, automated tools for recitation evaluation have failed to achieve widespread adoption or pedagogical efficacy. This literature review investigates this critical gap, conducting a comprehensive analysis of academic research, web platforms, and commercial applications developed over the past two decades. Our synthesis reveals a fundamental misalignment in prevailing approaches that repurpose Automatic Speech Recognition (ASR) architectures, which prioritize lexical recognition over qualitative acoustic assessment and are plagued by data dependency, demographic biases, and an inability to provide diagnostically useful feedback. Critiquing these data--driven paradigms, we argue for a foundational paradigm shift towards a knowledge-centric computational framework. Capitalizing on the immutable nature of the Quranic text and the precisely defined rules of Tajweed, we propose that a robust evaluator must be architected around anticipatory acoustic modeling based on canonical rules and articulation points (Makhraj), rather than relying on statistical patterns learned from imperfect and biased datasets. This review concludes that the future of automated Quranic evaluation lies in hybrid systems that integrate deep linguistic knowledge with advanced audio analysis, offering a path toward robust, equitable, and pedagogically sound tools that can faithfully support learners worldwide.

[15] EduDial: Constructing a Large-scale Multi-turn Teacher-Student Dialogue Corpus

Shouang Wei,Min Zhang,Xin Lin,Bo Jiang,Zhongxiang Dai,Kun Kuang

Main category: cs.CL

TL;DR: 本文提出了EduDial,一个基于布鲁姆教育目标分类和多种提问策略构建的多轮师生对话数据集,包含345个知识点和34,250个对话会话,并基于此训练了EduDial-LLM 32B模型,提出11维评估框架,在17个主流大模型中表现出显著优势。

Details Motivation: 随着大语言模型在智能教育中的应用日益广泛,亟需专门的师生对话基准来评估其教学能力,现有对话基准难以捕捉真实课堂中的教学互动特征。 Method: 基于Bloom教育目标分类设计涵盖345个核心知识点的多轮师生对话数据集EduDial,采用10种提问策略并通过教师与学生代理交互生成34,250个对话会话;引入差异化教学策略以适应不同认知水平的学生;基于该数据集训练EduDial-LLM 32B,并提出11维评估框架系统评测大模型的教学能力。 Result: 实验表明,当前主流大模型在以学生为中心的教学场景中表现不佳,而EduDial-LLM 32B在所有评估指标上均显著优于基线模型。 Conclusion: EduDial为评估和提升大语言模型在教育场景下的对话能力提供了有效基准,所提出的模型和评估框架有助于推动个性化、高质量的智能教育发展。 Abstract: Recently, several multi-turn dialogue benchmarks have been proposed to evaluate the conversational abilities of large language models (LLMs). As LLMs are increasingly recognized as a key technology for advancing intelligent education, owing to their ability to deeply understand instructional contexts and provide personalized guidance, the construction of dedicated teacher-student dialogue benchmarks has become particularly important. To this end, we present EduDial, a comprehensive multi-turn teacher-student dialogue dataset. EduDial covers 345 core knowledge points and consists of 34,250 dialogue sessions generated through interactions between teacher and student agents. Its design is guided by Bloom's taxonomy of educational objectives and incorporates ten questioning strategies, including situational questioning, zone of proximal development (ZPD) questioning, and metacognitive questioning-thus better capturing authentic classroom interactions. Furthermore, we design differentiated teaching strategies for students at different cognitive levels, thereby providing more targeted teaching guidance. Building on EduDial, we further develop EduDial-LLM 32B via training and propose an 11-dimensional evaluation framework that systematically measures the teaching abilities of LLMs, encompassing both overall teaching quality and content quality. Experiments on 17 mainstream LLMs reveal that most models struggle in student-centered teaching scenarios, whereas our EduDial-LLM achieves significant gains, consistently outperforming all baselines across all metrics. The code is available at https://github.com/Mind-Lab-ECNU/EduDial/tree/main.

[16] Who's Asking? Evaluating LLM Robustness to Inquiry Personas in Factual Question Answering

Nil-Jana Akpinar,Chia-Jung Lee,Vanessa Murdock,Pietro Perona

Main category: cs.CL

TL;DR: 本文首次系统评估了大语言模型(LLM)对用户查询 persona 的鲁棒性,发现用户披露的身份、专长或信念等信息会影响模型回答的准确性,导致拒绝回答、幻觉或角色混淆等问题,从而影响事实可靠性。

Details Motivation: 确保大语言模型在面对不同用户背景时仍能保持事实回答的客观性和一致性,避免因用户自我披露信息而产生偏差。 Method: 通过引入真实场景中用户可能披露的 inquiry persona 作为输入变量,系统性地测试多个大语言模型在问答任务中的表现变化。 Result: 实验证明,用户 persona 会显著影响模型的问答准确率,并引发拒绝回答、虚构限制和角色混淆等失败模式。 Conclusion: 模型对用户表述的敏感性可能损害其事实可靠性,因此应将 inquiry persona 测试作为评估模型鲁棒性的重要手段。 Abstract: Large Language Models (LLMs) should answer factual questions truthfully, grounded in objective knowledge, regardless of user context such as self-disclosed personal information, or system personalization. In this paper, we present the first systematic evaluation of LLM robustness to inquiry personas, i.e. user profiles that convey attributes like identity, expertise, or belief. While prior work has primarily focused on adversarial inputs or distractors for robustness testing, we evaluate plausible, human-centered inquiry persona cues that users disclose in real-world interactions. We find that such cues can meaningfully alter QA accuracy and trigger failure modes such as refusals, hallucinated limitations, and role confusion. These effects highlight how model sensitivity to user framing can compromise factual reliability, and position inquiry persona testing as an effective tool for robustness evaluation.

[17] The Curious Case of Curiosity across Human Cultures and LLMs

Angana Borah,Rada Mihalcea

Main category: cs.CL

TL;DR: 本研究提出CUEST框架,用于评估跨文化背景下大语言模型(LLM)中好奇心的表达与人类的一致性,发现现有模型更偏向西方表达方式,忽视文化多样性;通过微调策略可将人-模型对齐差距缩小50%,并验证了好奇心对LLM跨文化适应性的重要价值。

Details Motivation: 尽管大语言模型在人类交互中日益重要,但其在不同文化背景下的好奇心表达尚未被充分探索,缺乏系统性评估框架来衡量模型与人类在好奇心上的对齐程度。 Method: 基于Yahoo! Answers多国数据集,结合语言风格、话题偏好分析和社会科学理论,构建CUEST评估框架,并在开源与闭源模型上测试跨文化好奇心对齐情况,进一步探索微调方法以提升对齐效果。 Result: 发现当前LLMs在表达好奇时趋于同质化,更贴近西方文化模式;通过特定微调策略,人-模型在好奇心表达上的对齐差距最多可缩小50%;同时验证了好奇心有助于提升模型在跨文化场景中的适应能力。 Conclusion: 好奇心是提升LLM跨文化适应性的关键因素,现有模型存在文化偏倚,需通过有针对性的训练和评估框架(如CUEST)来增强多元文化表达的一致性,推动更公平、包容的NLP系统发展。 Abstract: Recent advances in Large Language Models (LLMs) have expanded their role in human interaction, yet curiosity -- a central driver of inquiry -- remains underexplored in these systems, particularly across cultural contexts. In this work, we investigate cultural variation in curiosity using Yahoo! Answers, a real-world multi-country dataset spanning diverse topics. We introduce CUEST (CUriosity Evaluation across SocieTies), an evaluation framework that measures human-model alignment in curiosity through linguistic (style), topic preference (content) analysis and grounding insights in social science constructs. Across open- and closed-source models, we find that LLMs flatten cross-cultural diversity, aligning more closely with how curiosity is expressed in Western countries. We then explore fine-tuning strategies to induce curiosity in LLMs, narrowing the human-model alignment gap by up to 50\%. Finally, we demonstrate the practical value of curiosity for LLM adaptability across cultures, showing its importance for future NLP research.

[18] 3-Model Speculative Decoding

Sanghyun Byun,Mohanad Odema,Jung Ick Guack,Baisub Lee,Jacob Song,Woo Seong Chung

Main category: cs.CL

TL;DR: PyramidSD通过引入中间的qualifier模型,改进了推测解码(Speculative Decoding)中草案模型与目标模型之间的分布差距,从而提高token接受率和生成速度,在消费级GPU上达到每秒124个token,比标准SD快1.91倍。

Details Motivation: 标准推测解码在草案模型大小与token接受率之间存在权衡:模型越小越快,但与目标模型差异越大,导致接受率低,限制了加速效果。 Method: 提出PyramidSD,采用三层结构(草案-qualifier-目标模型),结合模糊接受机制,在各阶段放宽分布差异阈值,提升模型间对齐程度,允许使用更小的草案模型。 Result: 实验显示PyramidSD在RTX 4090上实现最高1.91倍于标准SD的生成速度,达到124 token/s;在1B草案与8B目标模型的小内存设置下,几乎不损失质量地提升了吞吐量。 Conclusion: PyramidSD有效提升了推测解码的效率,支持更小的草案模型使用,具有高实用性和对现有推理管道的良好兼容性。 Abstract: Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a trade-off between draft model size and token acceptance: smaller draft models generate tokens more quickly but exhibit greater divergence from the target model, resulting in lower acceptance rates and reduced speedups. We introduce Pyramid Speculative Decoding (PyramidSD), an extension of SD that inserts an intermediate qualifier model between the draft and target to bridge the distributional gap in output predictions, allowing smaller model to be used for drafting. This hierarchical decoding strategy improves alignment across models, enabling higher acceptance rates and allowing the use of significantly smaller draft models without sacrificing overall performance. PyramidSD builds on fuzzy acceptance criteria to support relaxed divergence thresholds at each stage, improving throughput. In experiments, PyramidSD achieves up to 1.91x generation speed over standard SD, reaching 124 tokens per second on a consumer GPU (RTX 4090). In small-memory settings with a 1B-parameter draft model and an 8B target model, PyramidSD minimally trades target model quality for improved throughput. Overall, PyramidSD offers a practical approach to enhancing speculative decoding efficiency and can be readily applied to existing inference pipelines.

[19] A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

João A. Leite,Arnav Arora,Silvia Gargova,João Luz,Gustavo Sampaio,Ian Roberts,Carolina Scarton,Kalina Bontcheva

Main category: cs.CL

TL;DR: 本文首次对大语言模型(LLM)生成针对特定人群的虚假信息进行了大规模、多语言实证研究,提出AI-TRAITS数据集,揭示个性化提示显著增加越狱成功率并提升生成虚假信息的说服力。

Details Motivation: 担忧大语言模型被滥用以生成具有说服力和个性化的虚假信息,但关于其在不同人群和语言中的表现仍缺乏系统研究。 Method: 采用红队测试方法,结合324种虚假叙事和150种人物画像,通过8个最先进的大语言模型生成约160万条文本,构建AI-TRAITS数据集,并从模型、语言、越狱率和个人化维度进行定量分析。 Result: 发现即使简单的个性化提示也能显著提高所有被测模型的越狱率;个性化提示改变了语言和修辞模式,并增强了虚假叙述的说服力。 Conclusion: 当前大语言模型在面对个性化攻击时存在严重安全漏洞,研究结果为改进多语言、跨人群场景下的安全对齐与检测策略提供了基础。 Abstract: The human-like proficiency of Large Language Models (LLMs) has brought concerns about their potential misuse for generating persuasive and personalised disinformation at scale. While prior work has demonstrated that LLMs can generate disinformation, specific questions around persuasiveness and personalisation (generation of disinformation tailored to specific demographic attributes) remain largely unstudied. This paper presents the first large-scale, multilingual empirical study on persona-targeted disinformation generation by LLMs. Employing a red teaming methodology, we systematically evaluate the robustness of LLM safety mechanisms to persona-targeted prompts. A key novel result is AI-TRAITS (AI-generaTed peRsonAlIsed disinformaTion dataSet), a new dataset of around 1.6 million texts generated by eight state-of-the-art LLMs. AI-TRAITS is seeded by prompts that combine 324 disinformation narratives and 150 distinct persona profiles, covering four major languages (English, Russian, Portuguese, Hindi) and key demographic dimensions (country, generation, political orientation). The resulting personalised narratives are then assessed quantitatively and compared along the dimensions of models, languages, jailbreaking rate, and personalisation attributes. Our findings demonstrate that the use of even simple personalisation strategies in the prompts significantly increases the likelihood of jailbreaks for all studied LLMs. Furthermore, personalised prompts result in altered linguistic and rhetorical patterns and amplify the persuasiveness of the LLM-generated false narratives. These insights expose critical vulnerabilities in current state-of-the-art LLMs and offer a foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.

[20] OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting during Parameter-Efficient Fine-Tuning

Yifeng Xiong,Xiaohui Xie

Main category: cs.CL

TL;DR: OPLoRA通过双侧正交投影约束LoRA更新,防止干扰预训练模型的关键奇异方向,从而减少灾难性遗忘,同时保持良好的任务性能。

Details Motivation: LoRA在高效微调大模型时容易因更新干扰主要奇异方向而导致灾难性遗忘,需要一种能保留预训练知识的方法。 Method: 提出OPLoRA,利用SVD分解冻结权重,并通过左右正交投影P_L和P_R将LoRA更新限制在前k个奇异子空间的正交补空间中,从而保护关键知识。 Result: 理论证明OPLoRA能精确保持前k个奇异三元组;实验显示其在常识推理、数学和代码生成任务上显著减少遗忘,且在LLaMA-2 7B和Qwen2.5 7B上表现良好。 Conclusion: OPLoRA通过正交投影有效实现了知识保留,是参数高效微调中一种有前景的知识保护机制。 Abstract: Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large language models but suffers from catastrophic forgetting when learned updates interfere with the dominant singular directions that encode essential pre-trained knowledge. We propose Orthogonal Projection LoRA (OPLoRA), a theoretically grounded approach that prevents this interference through double-sided orthogonal projections. By decomposing frozen weights via SVD, OPLoRA constrains LoRA updates to lie entirely within the orthogonal complement of the top-$k$ singular subspace using projections $P_L = I - U_k U_k^\top$ and $P_R = I - V_k V_k^\top$. We prove that this construction exactly preserves the top-$k$ singular triples, providing mathematical guarantees for knowledge retention. To quantify subspace interference, we introduce $\rho_k$, a metric measuring update alignment with dominant directions. Extensive experiments across commonsense reasoning, mathematics, and code generation demonstrate that OPLoRA significantly reduces forgetting while maintaining competitive task-specific performance on LLaMA-2 7B and Qwen2.5 7B, establishing orthogonal projection as an effective mechanism for knowledge preservation in parameter-efficient fine-tuning.

[21] CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models

Pavan Kalyan,Shubhra Mishra,Satya Lokam,Navin Goyal

Main category: cs.CL

TL;DR: 本文提出了一个基于5-10岁人类发展轨迹的持续学习数据集和基准CurlL,包含五个发展阶段、技能图谱和234亿token的合成数据,用于系统评估语言模型在技能获取中的遗忘与迁移能力。

Details Motivation: 现有的持续学习评估缺乏对人类学习过程的精细建模和系统性评估框架,难以真实反映模型在渐进式技能获取中的表现。 Method: 构建了一个涵盖五个发展阶段的技能图谱,将广泛技能分解为具体能力、目标和可测量指标,并生成具有控制性技能 progression、词汇复杂度和格式多样性的23.4B token合成数据集;在独立、联合和顺序三种训练设置下评估135M参数Transformer模型的表现。 Result: 成功生成了分阶段的大规模合成数据集(各阶段2.12B至6.78B tokens),并验证了不同训练方式在技能保持、前向迁移和后向迁移上的权衡。 Conclusion: CurlL通过模拟人类学习模式并提供细粒度技能依赖控制,推动了语言模型持续学习的评估能力,为未来研究提供了系统性基准。 Abstract: We introduce a comprehensive continual learning dataset and benchmark (CurlL) grounded in human developmental trajectories from ages 5-10, enabling systematic and fine-grained assessment of models' ability to progressively acquire new skills. CurlL spans five developmental stages (0-4) covering ages 5-10, supported by a skill graph that breaks down broad skills into smaller abilities, concrete goals, and measurable indicators, while also capturing which abilities build on others. We generate a 23.4B-token synthetic dataset with controlled skill progression, vocabulary complexity, and format diversity, comprising paragraphs, comprehension-based QA (CQA), skill-testing QA (CSQA), and instruction-response (IR) pairs. Stage-wise token counts range from 2.12B to 6.78B tokens, supporting precise analysis of forgetting, forward transfer, and backward transfer. Using a 135M-parameter transformer trained under independent, joint, and sequential (continual) setups, we show trade-offs in skill retention and transfer efficiency. By mirroring human learning patterns and providing fine-grained control over skill dependencies, this work advances continual learning evaluations for language models.

[22] On the Role of Preference Variance in Preference Optimization

Jiacheng Guo,Zihao Li,Jiahao Qiu,Yue Wu,Mengdi Wang

Main category: cs.CL

TL;DR: 本文研究了偏好方差(PVar)在直接偏好优化(DPO)中的作用,提出PVar可作为衡量提示信息量的指标,并验证了高PVar提示能显著提升大模型对齐效果。

Details Motivation: 收集人类偏好数据成本高,因此需要方法减少标注需求;本文旨在探索如何通过偏好方差识别更具学习价值的样本以提高DPO效率。 Method: 提出偏好方差(PVar)概念,理论推导出DPO梯度范数受PVar控制的上界,并基于奖励模型生成偏好数据,通过在不同PVar水平的提示上进行LLM微调来验证其有效性。 Result: 实验表明高PVar提示比随机或低PVar提示带来更优的评估性能;仅使用UltraFeedback中前10%高PVar提示训练即可超越使用全量数据的效果,且该结果在小规模奖励模型下仍稳健。 Conclusion: 偏好方差(PVar)是衡量DPO训练样本有效性的关键指标,可用于高效筛选 informative 示例,提升大语言模型对齐的效率与性能。 Abstract: Direct Preference Optimization (DPO) has emerged as an important approach for learning from human preferences in aligning large language models (LLMs). However, collecting human preference data is costly and inefficient, motivating methods to reduce the required annotations. In this work, we investigate the impact of \emph{preference variance} (PVar), which measures the variance in model preferences when comparing pairs of responses, on the effectiveness of DPO training. We provide a theoretical insight by establishing an upper bound on the DPO gradient norm for any given prompt, showing it is controlled by the PVar of that prompt. This implies that prompts with low PVar can only produce small gradient updates, making them less valuable for learning. We validate this finding by fine-tuning LLMs with preferences generated by a reward model, evaluating on two benchmarks (AlpacaEval 2.0 and Arena-Hard). Experimental results demonstrate that prompts with higher PVar outperform randomly selected prompts or those with lower PVar. We also show that our PVar-based selection method is robust, when using smaller reward models (1B, 3B) for selection. Notably, in a separate experiment using the original human annotations from the UltraFeedback dataset, we found that training on only the top 10\% of prompts with the highest PVar yields better evaluation performance than training on the full dataset, highlighting the importance of preference variance in identifying informative examples for efficient LLM alignment.

[23] GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models

Chen Zheng,Yuhang Cai,Deyi Liu,Jin Ma,Yiyuan Ma,Yuan Yang,Jing Liu,Yutao Zeng,Xun Zhou,Siyuan Qiao

Main category: cs.CL

TL;DR: 提出了一种名为GatePro的无参数方法,通过引入局部竞争机制来促进专家选择的多样性,从而解决MoE模型中功能冗余的问题。

Details Motivation: 现有的MoE架构在扩展时存在专家功能相似导致计算冗余的问题,传统平衡损失方法未能解决专家多样性不足的根本问题。 Method: GatePro识别最相似的专家对,并引入局部竞争机制,防止冗余专家同时激活,且不引入额外可学习参数。 Result: 实验证明GatePro在不同模型规模和基准上均能提升专家多样性,使专家发展出更独特和互补的功能,减少功能冗余。 Conclusion: GatePro是一种无需额外参数、可热插拔部署的有效方法,显著提升了MoE模型的效率与容量利用。 Abstract: Modern large language models leverage Mixture-of-Experts (MoE) architectures for efficient scaling, but face a critical challenge: functionally similar experts are often selected simultaneously, creating redundant computation and limiting effective model capacity. Existing auxiliary balance loss methods improve token distribution but fail to address the underlying expert diversity problem. We introduce GatePro, a novel parameter-free method that directly promotes expert selection diversity. GatePro identifies the most similar expert pairs and introduces localized competition mechanisms, preventing redundant expert co-activation while maintaining natural expert specialization. Our comprehensive evaluation demonstrates GatePro's effectiveness across model scales and benchmarks. Analysis demonstrates GatePro's ability to achieve enhanced expert diversity, where experts develop more distinct and complementary capabilities, avoiding functional redundancy. This approach can be deployed hot-swappable during any training phase without additional learnable parameters, offering a practical solution for improving MoE effectiveness.

[24] ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models

Mingda Li,Xinyu Li,Weinan Zhang,Longxuan Ma

Main category: cs.CL

TL;DR: 提出了一种基于语义保持干预的灰盒不确定性量化方法,通过因果视角连接大语言模型的不确定性与不变性,实验表明该方法在有效性和计算效率上均表现出色。

Details Motivation: 大语言模型的不确定性量化具有挑战性,现有方法难以准确评估模型的不确定性,因此需要一种新的高效且有效的量化方法。 Method: 从因果角度出发,建立大语言模型在语义保持干预下的不变性与不确定性的联系,提出一种灰盒不确定性量化方法,通过测量干预前后模型输出的变化来估计认知不确定性,并提供理论证明。 Result: 在多种大语言模型和问答数据集上的实验表明,该方法在不确定性估计的有效性方面优于现有方法,同时具备较高的计算效率。 Conclusion: 所提出的方法能够有效估计大语言模型的认知不确定性,为提升模型可靠性提供了可行的不确定性量化方案。 Abstract: Uncertainty Quantification (UQ) is a promising approach to improve model reliability, yet quantifying the uncertainty of Large Language Models (LLMs) is non-trivial. In this work, we establish a connection between the uncertainty of LLMs and their invariance under semantic-preserving intervention from a causal perspective. Building on this foundation, we propose a novel grey-box uncertainty quantification method that measures the variation in model outputs before and after the semantic-preserving intervention. Through theoretical justification, we show that our method provides an effective estimate of epistemic uncertainty. Our extensive experiments, conducted across various LLMs and a variety of question-answering (QA) datasets, demonstrate that our method excels not only in terms of effectiveness but also in computational efficiency.

[25] Multi-Label Clinical Text Eligibility Classification and Summarization System

Surya Tejaswi Yerramsetty,Almas Fathimah

Main category: cs.CL

TL;DR: 提出了一种结合自然语言处理和大语言模型的系统,用于自动化多标签临床文本资格分类与摘要,提升临床试验资格评估效率。

Details Motivation: 临床试验需要包含具有适当和多样化医学背景的参与者,但资格筛选过程繁琐,因此需要自动化方法来提高效率和准确性。 Method: 结合词嵌入(Word2Vec)、命名实体识别、TF-IDF及其加权变体进行特征提取,使用随机森林和SVM进行多标签分类,并采用TextRank、Luhn和GPT-3进行资格标准摘要生成。 Result: 通过ROUGE分数评估显示,所提方法在分类和摘要任务上均有效,尤其GPT-3在生成简洁准确的资格摘要方面表现良好。 Conclusion: 该系统能有效支持临床试验资格的自动化评估,有助于提升医学研究的效率和可扩展性。 Abstract: Clinical trials are central to medical progress because they help improve understanding of human health and the healthcare system. They play a key role in discovering new ways to detect, prevent, or treat diseases, and it is essential that clinical trials include participants with appropriate and diverse medical backgrounds. In this paper, we propose a system that leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to automate multi-label clinical text eligibility classification and summarization. The system combines feature extraction methods such as word embeddings (Word2Vec) and named entity recognition to identify relevant medical concepts, along with traditional vectorization techniques such as count vectorization and TF-IDF (Term Frequency-Inverse Document Frequency). We further explore weighted TF-IDF word embeddings that integrate both count-based and embedding-based strengths to capture term importance effectively. Multi-label classification using Random Forest and SVM models is applied to categorize documents based on eligibility criteria. Summarization techniques including TextRank, Luhn, and GPT-3 are evaluated to concisely summarize eligibility requirements. Evaluation with ROUGE scores demonstrates the effectiveness of the proposed methods. This system shows potential for automating clinical trial eligibility assessment using data-driven approaches, thereby improving research efficiency.

[26] Stable LLM Ensemble: Interaction between Example Representativeness and Diversity

Junichiro Niimi

Main category: cs.CL

TL;DR: 本文研究了在大语言模型(LLM)集成中,示例代表性(单样本策略)和输出多样性(采样温度)对性能的影响,提出基于质心的代表性示例选择方法,并结合高温度采样,显著优于随机示例选择和五样本提示。

Details Motivation: 单样本LLM预测的准确性和鲁棒性高度依赖于示例选择和集成成员间的多样性,但如何有效平衡二者尚不明确。 Method: 比较基于质心的代表性示例与随机示例作为单样本提示,并在不同温度下进行采样以控制输出多样性,评估其对集成性能的影响。 Result: 所提方法在宏F1上比随机选择提升+7.6%,RMSE降低10.5%;且优于五样本提示,宏F1提升+21.1%,RMSE降低24.0%。 Conclusion: 结合代表性示例选择与高温度采样能有效提升LLM集成性能,强调了示例选择与可控多样性在单样本集成设计中的关键作用。 Abstract: Large language models (LLMs) have achieved remarkable results in wide range of domains. However, the accuracy and robustness of one-shot LLM predictions remain highly sensitive to the examples and the diversity among ensemble members. This study systematically investigates the effects of example representativeness (one-shot strategy) and output diversity (sampling temperature) on LLM ensemble performance. Two one-shot strategies are compared: centroid-based representative examples (proposed) and randomly sampled examples (baseline) and sampling temperature also is varied. The proposed approach with higher temperature setting significantly outperforms random selection by +7.6% (macro-F1) and -10.5% (RMSE). Furthermore, the proposed model exceeds 5-shot prompting by +21.1% (macro-F1) and -24.0% (RMSE). Our findings demonstrate that combining representative example selection with increased temperature provides the appropriate level of diversity to the ensemble. This work highlights the practical importance of both example selection and controlled diversity in designing effective one-shot LLM ensembles.

[27] I Am Aligned, But With Whom? MENA Values Benchmark for Evaluating Cultural Alignment and Multilingual Bias in LLMs

Pardis Sadat Zahraei,Ehsaneddin Asgari

Main category: cs.CL

TL;DR: MENAValues是一个新的基准,用于评估大语言模型在中东和北非(MENA)地区的文化对齐性和多语言偏见,揭示了跨语言价值变化、推理导致的文化对齐退化以及logit泄露等关键现象。

Details Motivation: 当前AI评估缺乏对中东和北非地区文化的代表性,亟需一个能够反映该区域多样价值观的评估基准。 Method: 基于大规模权威人类调查数据构建结构化数据集,涵盖16个国家的群体响应分布;通过三种视角设定(中立、个性化、第三方/文化观察者)与两种语言模式(英语、本地语言:阿拉伯语、波斯语、土耳其语)交叉的条件来评估多种模型。 Result: 发现了‘跨语言价值转移’、‘推理诱导退化’和‘logit泄露’三种关键现象,并发现模型在使用本地语言时将不同国家简化为单一类别,忽视文化多样性。 Conclusion: MENAValues提供了一个可扩展的框架,用于诊断文化错位问题,为开发更具文化包容性的AI系统提供了实证见解和方法工具。 Abstract: We introduce MENAValues, a novel benchmark designed to evaluate the cultural alignment and multilingual biases of large language models (LLMs) with respect to the beliefs and values of the Middle East and North Africa (MENA) region, an underrepresented area in current AI evaluation efforts. Drawing from large-scale, authoritative human surveys, we curate a structured dataset that captures the sociocultural landscape of MENA with population-level response distributions from 16 countries. To probe LLM behavior, we evaluate diverse models across multiple conditions formed by crossing three perspective framings (neutral, personalized, and third-person/cultural observer) with two language modes (English and localized native languages: Arabic, Persian, Turkish). Our analysis reveals three critical phenomena: "Cross-Lingual Value Shifts" where identical questions yield drastically different responses based on language, "Reasoning-Induced Degradation" where prompting models to explain their reasoning worsens cultural alignment, and "Logit Leakage" where models refuse sensitive questions while internal probabilities reveal strong hidden preferences. We further demonstrate that models collapse into simplistic linguistic categories when operating in native languages, treating diverse nations as monolithic entities. MENAValues offers a scalable framework for diagnosing cultural misalignment, providing both empirical insights and methodological tools for developing more culturally inclusive AI.

[28] Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference

Nikhil Bhendawade,Kumari Nishu,Arnav Kundu,Chris Bartels,Minsik Cho,Irina Belousova

Main category: cs.CL

TL;DR: Mirror-SD提出了一种新的推理算法,通过并行异构执行和多令牌推测流显著提升了大语言模型的推理速度,实现了2.8x-5.8x的端到端加速。

Details Motivation: 现有的推测解码方法受限于自回归草稿生成的延迟成本,难以在提高接受率的同时降低延迟,存在速度与准确性的权衡问题。 Method: 提出Mirror-SD算法,利用早期退出信号并行展开分支,并在GPU和NPU等异构加速器上显式分配计算任务;同时引入推测流技术,使草稿模型每步输出多个令牌,形成双向互补的执行流水线。 Result: 在14B至66B参数规模的模型上,Mirror-SD在SpecBench测试中实现了2.8x-5.8x的 wall-time 加速,平均比最强基线EAGLE3提升30%。 Conclusion: Mirror-SD打破了推测解码中的延迟-接受率权衡,通过跨设备并行和多令牌流式生成,推动推测解码接近理想性能状态。 Abstract: Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency overhead exacerbating the speed-accuracy tradeoff. Prior methods (Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling. We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from early-exit signals in parallel with the target model's suffix and explicitly maps computation across heterogeneous accelerators (GPU and NPU) to exploit cross-device parallelism. The draft speculates forward continuations for the target to verify, while the target simultaneously speculates correction paths for the draft, converting speculation into two complementary execution pipelines. To further cut draft latency without weakening acceptance semantics, we add speculative streaming so the draft emits multiple tokens per step. This dual strategy of parallel heterogeneous execution plus multi-token speculative streaming pushes speculative decoding toward its ideal regime of high acceptance with low overhead. On SpecBench with server-scale models from 14B to 66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving 2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative improvement over the strongest baseline, EAGLE3.

[29] A Matter of Representation: Towards Graph-Based Abstract Code Generation

Nyx Iskandar,Hisham Bedri,Andy Tsen

Main category: cs.CL

TL;DR: 本文提出了用于图结构抽象代码生成的JSON表示方法,并在基于Scratch的自定义Python实现的ScratchTest基准上验证了大型语言模型(LLM)在合适表示下的单步生成能力,表明表示方式对生成准确性有显著影响,为图基抽象代码生成的表示学习奠定了基础。

Details Motivation: 现有大语言模型擅长生成线性代码,但在图结构化抽象代码生成方面研究较少,尤其在可视化编程语言和源码不可见场景下缺乏有效方法,因此需要探索适用于图结构的表示形式以支持此类生成任务。 Method: 设计并评估多种JSON格式的图表示方法,使用自建的ScratchTest基准(基于Scratch的Python重实现)测试LLM在图代码空间中的生成性能,无需复杂辅助流程即可完成生成任务。 Result: 实验表明,合适的图表示可使LLM在单次推理中高效生成图结构代码,不同表示方式导致显著的准确率差异,证明了表示方法在该任务中的关键作用。 Conclusion: 正确的图表示能有效支持LLM进行图基抽象代码生成,本工作为该方向的表示学习提供了初步但重要的探索路径。 Abstract: Most large language models (LLMs) today excel at generating raw, sequential code with minimal abstractions and custom structures. However, there has been little work on graph-based abstract code generation, where significant logic is encapsulated in predefined nodes and execution flow is determined by edges. This is relevant for visual programming languages, and in cases where raw source code is inaccessible to users and LLM training sets. In this work, we propose and evaluate JSON representations for graphs to enable high accuracy graph-based abstract code generation. We evaluate these representations on ScratchTest, a mini-benchmark based on our custom Python re-implementation of Scratch, which tests the LLM in code graph space. Our findings demonstrate that LLMs can indeed perform the aforementioned generation task in a single pass without relying on specialized or complex pipelines, given the correct graph representations. We also show that different representations induce significantly different accuracies, highlighting the instrumental role of representations in this generation task. All in all, this work establishes the first steps towards representation learning for graph-based abstract code generation.

[30] CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning

Kehua Feng,Keyan Ding,Zhihui Zhu,Lei Liang,Qiang Zhang,Huajun Chen

Main category: cs.CL

TL;DR: 提出CoT-Evo,一种基于进化思想的思维链蒸馏框架,通过多模型推理路径融合、知识增强与迭代优化生成高质量科学推理数据,显著提升小型模型在科学领域的推理能力。

Details Motivation: 现有思维链蒸馏方法依赖大模型生成的推理路径,在科学领域因大模型自身知识局限和推理错误而效果受限,难以生成高质量训练数据。 Method: 构建来自多个LLM的多样化推理路径池,结合自动检索的领域知识进行增强,并通过基于新颖性选择、反思式重组与变异的迭代进化机制,在正确性、连贯性和知识利用的 fitness 函数指导下优化推理路径。 Result: 生成了针对科学推理的高质量思维链数据集,并用于微调紧凑模型,在多个科学推理基准上达到最先进性能。 Conclusion: CoT-Evo为从多样但不可靠的大模型中合成高保真科学推理数据提供了一种可扩展且有效的途径,显著提升了小模型在复杂科学任务中的表现。 Abstract: While chain-of-thought (CoT) distillation from advanced large language models (LLMs) has proven effective in general reasoning tasks, it struggles in scientific domains where even advanced models often produce incorrect or superficial reasoning due to high complexity and specialized knowledge requirements. Directly distilling from such flawed outputs results in low-quality training data and limits the performance of smaller student models. To overcome this, we propose CoT-Evo, an evolutionary CoT distillation framework. It begins by constructing a diverse pool of reasoning trajectories from multiple LLM thinkers, enriches them with automatically retrieved domain knowledge, and iteratively refines the trajectories using novelty-driven selection, reflective recombination and mutation. The refinement is guided by a fitness function that evaluates answer correctness, coherence, and effective knowledge utilization. This results in a high-quality CoT dataset tailored for scientific reasoning. We employ this evolved dataset to fine-tune a compact model, which achieves state-of-the-art performance on scientific reasoning benchmarks. Our work establishes a scalable approach to synthesizing high-fidelity scientific reasoning data from diverse and fallible LLMs.

[31] Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism

Xiaoshu Chen,Sihang Zhou,Ke Liang,Duanyang Yuan,Haoyuan Chen,Xiaoyu Sun,Linyuan Meng,Xinwang Liu

Main category: cs.CL

TL;DR: 本文提出首个基于人类思维机制的链式思维(CoT)微调综合综述,借鉴“六顶思考帽”框架对现有方法进行分类和分析,并探讨未来研究方向,同时整理了相关数据集与模型性能,维护了一个持续更新的GitHub资源库。

Details Motivation: 现有CoT微调研究多关注技术层面,缺乏从人类认知角度的系统性分析;而CoT的目标是让大语言模型具备类人推理能力,因此亟需基于人类思维理论对其进行系统梳理。 Method: 受“六顶思考帽”框架启发,将人类常见的思维模式分为六类,并以此为视角对现有的CoT微调方法进行分类和系统分析,同时总结数据集、模型表现,并提出未来研究方向。 Result: 建立了首个基于人类思维理论的CoT微调分类体系,整理了现有数据集与模型性能,并维护一个实时更新的GitHub资源库以跟踪该领域进展。 Conclusion: 通过引入人类认知理论,特别是六顶思考帽框架,能够更系统地理解和发展CoT微调方法,为未来构建更具类人推理能力的模型提供了新视角和研究路径。 Abstract: Chain of thought (CoT) fine-tuning aims to endow large language models (LLMs) with reasoning capabilities by training them on curated reasoning traces. It leverages both supervised and reinforced fine-tuning to cultivate human-like reasoning skills in LLMs, including detailed planning, divergent thinking, intuitive judgment, timely reflection, internal thinking, and fact perception, etc. As CoT fine-tuning has advanced, LLMs have demonstrated substantial improvements in tasks such as mathematical reasoning and code generation. However, existing surveys about CoT fine-tuning primarily focus on technical aspects and overlook a systematic analysis from the perspective of human reasoning mechanisms. Given that the ultimate goal of CoT fine-tuning is to enable LLMs to reason like humans, it is crucial to investigate this technique through the lens of human cognition. To fill this gap, we present the first comprehensive survey of CoT fine-tuning grounded in human reasoning theory. Specifically, inspired by the well-known Six Thinking Hats framework, which systematically characterizes common human thinking modes using six metaphorical hats, we classify and examine CoT fine-tuning methods through this lens. Furthermore, building upon this theory, we outline potential directions for future research in CoT fine-tuning. In addition, we compile a comprehensive overview of existing datasets and model performances, and a real-time GitHub repository \footnote{https://github.com/AI-Chen/Awesome-CoT-Finetuning} that continuously tracks recent advances in this area is maintained. We hope this survey will serve as a valuable resource to inspire innovation and foster progress in this rapidly evolving field.

[32] DSCD: Large Language Model Detoxification with Self-Constrained Decoding

Ming Dong,Jinkui Zhang,Bolong Zheng,Xinhui Tu,Po Hu,Tingting He

Main category: cs.CL

TL;DR: 提出了一种无需参数微调的自约束解码去毒方法DSCD,通过调节内部层的token分布来降低毒性并提升生成安全性与流畅性。

Details Motivation: 现有解码去毒方法依赖外部约束,带来资源开销且损害生成流畅性,需更高效、兼容性强的新方法。 Method: 在生成过程中增强安全层的下一词预测分布,同时削弱幻觉和有毒层的分布,实现自我约束的去毒解码。 Result: 在多个开源大模型和公开数据集上验证了DSCD的有效性,取得了SOTA的去毒效果和生成流畅性,且效率更高。 Conclusion: DSCD是一种轻量、高兼容、即插即用的去毒方法,具有实际应用和规模化部署的潜力。 Abstract: Detoxification in large language models (LLMs) remains a significant research challenge. Existing decoding detoxification methods are all based on external constraints, which require additional resource overhead and lose generation fluency. This work proposes Detoxification with Self-Constrained Decoding (DSCD), a novel method for LLM detoxification without parameter fine-tuning. DSCD strengthens the inner next-token distribution of the safety layer while weakening that of hallucination and toxic layers during output generation. This effectively diminishes toxicity and enhances output safety. DSCD offers lightweight, high compatibility, and plug-and-play capabilities, readily integrating with existing detoxification methods for further performance improvement. Extensive experiments on representative open-source LLMs and public datasets validate DSCD's effectiveness, demonstrating state-of-the-art (SOTA) performance in both detoxification and generation fluency, with superior efficiency compared to existing methods. These results highlight DSCD's potential as a practical and scalable solution for safer LLM deployments.

[33] SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

Juan Ren,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: 提出了一种轻量级、模型无关的预处理框架SHIELD,用于增强大视觉-语言模型的安全性,通过细粒度分类和特定类别引导有效降低越狱和指令不服从率。

Details Motivation: 大视觉-语言模型(LVLMs)虽然具备强大的多模态推理能力,但容易受到隐藏在良性提示中的对抗性输入攻击,需要更精细的安全防护机制。 Method: 提出SHIELD框架,结合细粒度安全分类与类别特定引导,采用Block、Reframe、Forward等显式操作生成定制化安全提示,在不重新训练的情况下实现对有害内容的拒绝或安全重定向。 Result: 在五个基准和五个代表性LVLM上验证,SHIELD显著降低了越狱和指令不服从率,同时保持了模型原有性能,且具有低开销和即插即用特性。 Conclusion: SHIELD是一种实用、可扩展的安全补丁方案,适用于弱对齐和强对齐的LVLM,能有效应对多种对抗攻击。 Abstract: Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirection without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types -- serving as a practical safety patch for both weakly and strongly aligned LVLMs.

[34] Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation

Jiamin Chen,Yuchen Li,Xinyu Ma,Xinran Chen,Xiaokun Zhang,Shuaiqiang Wang,Chen Ma,Dawei Yin

Main category: cs.CL

TL;DR: 本文研究了检索增强生成(RAG)中上下文格式对大语言模型性能的影响,提出了一种称为上下文归一化的方法来提升长上下文利用和顺序鲁棒性。

Details Motivation: 尽管已有研究关注检索质量和提示策略,但检索文档的呈现方式(即上下文格式)对模型性能的影响尚未充分探索。本文旨在系统分析上下文格式的影响,并提出改进方法。 Method: 设计控制实验,改变上下文密度、分隔符样式和位置分布,分析其对性能的影响;基于发现提出上下文归一化方法,以标准化上下文表示。 Result: 实验表明,上下文格式的细微变化(如分隔符)会显著影响准确性和稳定性;所提出的上下文归一化方法在多种RAG基准上提升了对顺序变化的鲁棒性和长上下文利用效果。 Conclusion: 可靠的RAG不仅依赖于检索到的内容,还取决于内容的呈现方式;上下文归一化是一种有效的轻量级策略,有助于提升长上下文推理性能。 Abstract: Retrieval-Augmented Generation (RAG) has become an essential approach for extending the reasoning and knowledge capacity of large language models (LLMs). While prior research has primarily focused on retrieval quality and prompting strategies, the influence of how the retrieved documents are framed, i.e., context format, remains underexplored. We show that seemingly superficial choices, such as delimiters or structural markers in key-value extraction, can induce substantial shifts in accuracy and stability, even when semantic content is identical. To systematically investigate this effect, we design controlled experiments that vary context density, delimiter styles, and positional placement, revealing the underlying factors that govern performance differences. Building on these insights, we introduce Contextual Normalization, a lightweight strategy that adaptively standardizes context representations before generation. Extensive experiments on both controlled and real-world RAG benchmarks across diverse settings demonstrate that the proposed strategy consistently improves robustness to order variation and strengthens long-context utilization. These findings underscore that reliable RAG depends not only on retrieving the right content, but also on how that content is presented, offering both new empirical evidence and a practical technique for better long-context reasoning.

[35] StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation

Xi Chen,Yuchen Song,Satoshi Nakamura

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的压力感知语音到语音翻译系统,通过将源语言重音转换为目标语言标签来保持词级强调,并利用自动数据生成和LLM-as-Judge评估方法实现高效训练与评测。

Details Motivation: 传统S2ST系统常忽略语调和重音等韵律信息,导致翻译结果缺乏情感和意图表达;为了更好地保留说话人的强调和情感,需在翻译过程中处理词级重音信息。 Method: 利用大语言模型(LLMs)进行跨语言重音转换,将源语言的重音信息转化为目标语言的控制标签,指导可控TTS模型合成;通过自建数据生成流程创建对齐的训练数据,并引入LLM-as-Judge进行自动评估。 Result: 实验表明该方法在保持词级强调方面显著优于基线模型,同时保持了可比的翻译质量、说话人意图和语音自然度。 Conclusion: 韵律信息在语音翻译中至关重要,所提方法为在低资源条件下保留副语言线索提供了一种有效且数据高效的解决方案。 Abstract: We propose a stress-aware speech-to-speech translation (S2ST) system that preserves word-level emphasis by leveraging LLMs for cross-lingual emphasis conversion. Our method translates source-language stress into target-language tags that guide a controllable TTS model. To overcome data scarcity, we developed a pipeline to automatically generate aligned training data and introduce the "LLM-as-Judge" for evaluation. Experiments show our approach substantially outperforms baselines in preserving emphasis while maintaining comparable translation quality, speaker intent, and naturalness. Our work highlights the importance of prosody in translation and provides an effective, data-efficient solution for preserving paralinguistic cues in S2ST.

[36] Text Anomaly Detection with Simplified Isolation Kernel

Yang Cao,Sikun Yang,Yujiu Yang,Lianyong Qi,Ming Liu

Main category: cs.CL

TL;DR: 提出了一种简化的隔离核(SIK)方法,将大语言模型生成的高维密集嵌入映射为低维稀疏表示,在保持检测性能的同时显著降低计算和内存开销。

Details Motivation: 高维密集嵌入导致内存消耗大和计算耗时,限制了文本异常检测的实际应用。 Method: 设计了一种具有线性时间复杂度的简化隔离核(SIK),通过聚焦边界的特征映射,将高维嵌入转换为低维稀疏表示。 Result: 在7个数据集上的实验表明,SIK优于11种最先进的异常检测算法,兼具更高的检测性能、更低的内存成本和良好的计算效率。 Conclusion: SIK有效平衡了异常检测的性能与效率,适用于基于大模型嵌入的实际异常检测任务。 Abstract: Two-step approaches combining pre-trained large language model embeddings and anomaly detectors demonstrate strong performance in text anomaly detection by leveraging rich semantic representations. However, high-dimensional dense embeddings extracted by large language models pose challenges due to substantial memory requirements and high computation time. To address this challenge, we introduce the Simplified Isolation Kernel (SIK), which maps high-dimensional dense embeddings to lower-dimensional sparse representations while preserving crucial anomaly characteristics. SIK has linear time complexity and significantly reduces space complexity through its innovative boundary-focused feature mapping. Experiments across 7 datasets demonstrate that SIK achieves better detection performance than 11 state-of-the-art (SOTA) anomaly detection algorithms while maintaining computational efficiency and low memory cost. All code and demonstrations are available at https://github.com/charles-cao/SIK.

[37] LLM-Guided Synthetic Augmentation (LGSA) for Mitigating Bias in AI Systems

Sai Suhruth Reddy Karri,Yashwanth Sai Nallapuneni,Laxmi Narasimha Reddy Mallireddy,Gopichand G

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型的合成增强方法(LGSA),用于缓解自然语言处理中的性别偏见,通过生成反事实样本来提升模型公平性与准确性。

Details Motivation: 现有公平性方法依赖受保护属性标签、存在准确率-公平性权衡,且难以泛化,需一种不依赖标签、保持准确率的去偏方法。 Method: 利用大语言模型生成针对欠代表群体的反事实文本(如性别互换的句子),结合结构化提示和多步质量控制(语义相似性、属性验证、毒性检测、人工抽查)进行数据增强。 Result: 在性别化句子分类任务中,LGSA将基线模型的性别偏差从7.2%降至1.9%,同时将准确率提升至99.1%,优于简单交换等基线方法。 Conclusion: LGSA能有效缓解NLP系统中的偏见,在不牺牲准确性的前提下提升对欠代表群体的性能,具有良好的应用潜力。 Abstract: Bias in AI systems, especially those relying on natural language data, raises ethical and practical concerns. Underrepresentation of certain groups often leads to uneven performance across demographics. Traditional fairness methods, such as pre-processing, in-processing, and post-processing, depend on protected-attribute labels, involve accuracy-fairness trade-offs, and may not generalize across datasets. To address these challenges, we propose LLM-Guided Synthetic Augmentation (LGSA), which uses large language models to generate counterfactual examples for underrepresented groups while preserving label integrity. We evaluated LGSA on a controlled dataset of short English sentences with gendered pronouns, professions, and binary classification labels. Structured prompts were used to produce gender-swapped paraphrases, followed by quality control including semantic similarity checks, attribute verification, toxicity screening, and human spot checks. The augmented dataset expanded training coverage and was used to train a classifier under consistent conditions. Results show that LGSA reduces performance disparities without compromising accuracy. The baseline model achieved 96.7 percent accuracy with a 7.2 percent gender bias gap. Simple swap augmentation reduced the gap to 0.7 percent but lowered accuracy to 95.6 percent. LGSA achieved 99.1 percent accuracy with a 1.9 percent bias gap, improving performance on female-labeled examples. These findings demonstrate that LGSA is an effective strategy for bias mitigation, enhancing subgroup balance while maintaining high task accuracy and label fidelity.

[38] A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics

Prawaal Sharma,Navneet Goyal,Poonam Goyal,Vishnupriyan R

Main category: cs.CL

TL;DR: 提出一种可扩展且全自动的方法,从报纸文章中提取双语平行语料库,并通过机器翻译任务验证其有效性,BLEU分数提升近3点。

Details Motivation: 全球语言多样性导致数字语言资源分布不均,低资源语言缺乏数据支持,难以开展自然语言处理任务。 Method: 结合图像和文本分析技术,从报纸文章中自动提取双语平行语料库。 Result: 为两种不同语言组合构建了平行语料库,并在机器翻译任务中比现有基线提升近3 BLEU分。 Conclusion: 该方法能有效构建低资源语言的平行语料库,显著提升下游NLP任务性能。 Abstract: Linguistic diversity across the world creates a disparity with the availability of good quality digital language resources thereby restricting the technological benefits to majority of human population. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles using image and text analytics. We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation and improve over the current baseline by close to 3 BLEU points.

[39] Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain

Jingmin An,Yilong Song,Ruolin Yang,Nai Ding,Lingxi Lu,Yuxuan Wang,Wei Wang,Chu Zhuang,Qian Wang,Fang Fang

Main category: cs.CL

TL;DR: 提出了一种名为分层频率标记探针(HFTP)的新工具,利用频域分析识别大语言模型和人脑中编码句法结构的神经组件,发现不同规模的大语言模型在模拟大脑句法处理方面表现出不同程度的相似性,揭示了模型改进背后可能的人类或非人类机制。

Details Motivation: 探索大语言模型的语言能力是否源于与人脑类似的机制,并明确其内部负责句法处理的计算模块。 Method: 引入分层频率标记探针(HFTP),结合频域分析和表征相似性分析,对多个大语言模型(如GPT-2、Llama系列、Gemma系列等)和人脑皮层区域进行跨层次比较。 Result: 发现大语言模型在类似层级处理句法,而人脑使用不同皮层区域处理不同句法层级;LLM表征与左脑语言优势半球更相似;升级后的模型表现出分化趋势:Gemma 2更接近人脑,而Llama 3.1相比Llama 2与人脑对齐程度下降。 Conclusion: HFTP为理解大语言模型的行为改进提供了新视角,表明性能提升未必意味着更像人脑,提示某些改进可能依赖非人类机制,该工具促进了计算语言学与认知神经科学的交叉研究。 Abstract: Large Language Models (LLMs) demonstrate human-level or even superior language abilities, effectively modeling syntactic structures, yet the specific computational modules responsible remain unclear. A key question is whether LLM behavioral capabilities stem from mechanisms akin to those in the human brain. To address these questions, we introduce the Hierarchical Frequency Tagging Probe (HFTP), a tool that utilizes frequency-domain analysis to identify neuron-wise components of LLMs (e.g., individual Multilayer Perceptron (MLP) neurons) and cortical regions (via intracranial recordings) encoding syntactic structures. Our results show that models such as GPT-2, Gemma, Gemma 2, Llama 2, Llama 3.1, and GLM-4 process syntax in analogous layers, while the human brain relies on distinct cortical regions for different syntactic levels. Representational similarity analysis reveals a stronger alignment between LLM representations and the left hemisphere of the brain (dominant in language processing). Notably, upgraded models exhibit divergent trends: Gemma 2 shows greater brain similarity than Gemma, while Llama 3.1 shows less alignment with the brain compared to Llama 2. These findings offer new insights into the interpretability of LLM behavioral improvements, raising questions about whether these advancements are driven by human-like or non-human-like mechanisms, and establish HFTP as a valuable tool bridging computational linguistics and cognitive neuroscience. This project is available at https://github.com/LilTiger/HFTP.

[40] Do You Get the Hint? Benchmarking LLMs on the Board Game Concept

Ine Gevers,Walter Daelemans

Main category: cs.CL

TL;DR: 本文介绍了Concept,一个用于测试大语言模型(LLM)在自然语言表示中进行溯因推理能力的简单猜词棋盘游戏。实验结果表明,尽管人类可以轻松解决该游戏(成功率超过90%),但最先进的LLM仍面临挑战(没有模型的成功率超过40%)。此外,在低资源语言(如荷兰语、法语和西班牙语)中,LLM的表现进一步下降。

Details Motivation: 由于现有任务通常使用与LLM训练数据(自然语言)不同的表示形式(如网格、符号或视觉模式),导致LLM在抽象推理任务上表现不佳。因此,需要一种更贴近自然语言的数据表示方式来评估LLM的溯因推理能力。 Method: 提出了一种名为Concept的简单猜词棋盘游戏作为基准测试工具,并通过多语言实验评估了当前最先进LLM在该任务上的表现,重点分析其在理解其他玩家策略意图以及根据顺序信息更新初始假设方面的能力。 Result: 人类在Concept游戏中成功率超过90%,而最先进的LLM最高成功率未超过40%;LLM在理解和调整策略意图方面存在明显困难,且在低资源语言中的表现比英语更差。 Conclusion: Concept游戏揭示了当前LLM在自然语言环境下的溯因推理能力仍然有限,特别是在处理策略性互动和动态信息更新时表现不足,同时语言资源的多少显著影响其性能。 Abstract: Large language models (LLMs) have achieved striking successes on many benchmarks, yet recent studies continue to expose fundamental weaknesses. In particular, tasks that require abstract reasoning remain challenging, often because they use representations such as grids, symbols, or visual patterns that differ from the natural language data LLMs are trained on. In this paper, we introduce Concept, a simple word-guessing board game, as a benchmark for probing abductive reasoning in a representation that is much closer to LLM pre-training data: natural language. Our results show that this game, easily solved by humans (with a success rate of over 90\%), is still very challenging for state-of-the-art LLMs (no model exceeds 40\% success rate). Specifically, we observe that LLMs struggle with interpreting other players' strategic intents, and with correcting initial hypotheses given sequential information updates. In addition, we extend the evaluation across multiple languages, and find that the LLM performance drops further in lower-resource languages (Dutch, French, and Spanish) compared to English.

[41] Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation

Zhichao Xu,Zongyu Wu,Yun Zhou,Aosong Feng,Kang Zhou,Sangmin Woo,Kiran Ramnath,Yijun Tian,Xuan Qi,Weikang Qiu,Lin Lee Cheong,Haibo Ding

Main category: cs.CL

TL;DR: 本文提出了一种新的强化学习框架VERITAS,用于提升大语言模型在检索增强生成中使用搜索引擎时的推理忠实性。

Details Motivation: 现有方法注重最终答案正确性,忽视中间推理步骤的质量,可能导致推理链不忠实。 Method: 引入包含细粒度忠实性奖励的VERITAS框架,并提出三个忠实性评估指标:信息-思考、思考-答案、思考-搜索忠实性。 Result: 实验表明,使用VERITAS训练的模型在七个问答基准上显著提升了推理忠实性,同时保持了相当的任务性能。 Conclusion: VERITAS能有效提升基于强化学习的搜索代理在检索增强生成中的推理忠实性。 Abstract: Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent works have begun exploring how to train LLMs to use search engines more effectively as tools for retrieval-augmented generation. Although these methods achieve performance improvement across QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for evaluating RL-based search agents, covering three distinct faithfulness metrics: information-think faithfulness, think-answer faithfulness, and think-search faithfulness. Our evaluations reveal that a prototypical RL-based search agent, Search-R1, has significant room for improvement in this regard. To foster faithful reasoning, we introduce VERITAS (Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with VERITAS not only significantly improve reasoning faithfulness, but also achieve comparable task performance across seven QA benchmarks.

[42] In-Distribution Steering: Balancing Control and Coherence in Language Model Generation

Arthur Vogels,Benjamin Wong,Yann Choho,Annabelle Blangero,Milan Bhan

Main category: cs.CL

TL;DR: 提出了一种名为In-Distribution Steering (IDS)的新方法,通过根据输入数据在表示空间中的分布动态调整干预强度,实现对大语言模型行为的自适应控制。

Details Motivation: 现有激活引导方法通常使用固定引导强度,导致控制不足或过度干预,影响文本的合理性和连贯性。 Method: IDS基于输入在表示空间中与正常分布的距离,动态调整激活层面的干预强度,使干预更适应当前生成状态。 Result: 实验表明,IDS在分类任务上实现了高准确率,同时生成的文本保持连贯且无崩溃现象。 Conclusion: IDS能够在保证生成质量的同时有效控制模型行为,特别适用于实际应用场景。 Abstract: Activation steering methods control large language model (LLM) behavior by modifying internal activations at inference time. However, most existing activation steering methods rely on a fixed steering strength, leading to either insufficient control or unadapted intervention that degrades text plausibility and coherence. We introduce In-Distribution Steering (IDS), a novel method that adapts steering strength based on the input data distribution in representation space. IDS dynamically adjusts interventions according to how far a given input lies within the distribution, enabling adaptive intervention and generation stability during text generation. Experiments demonstrate that IDS achieves strong accuracy on classification tasks while producing coherent text without collapse, making IDS particularly well suited for real-world applications.

[43] Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems

Xuxin Cheng,Ke Zeng,Zhiquan Cao,Linyi Dai,Wenxuan Gao,Fei Han,Ai Jian,Feng Hong,Wenxing Hu,Zihe Huang,Dejian Kong,Jia Leng,Zhuoyuan Liao,Pei Liu,Jiaye Lin,Xing Ma,Jingqing Ruan,Jiaxing Song,Xiaoyu Tan,Ruixuan Xiao,Wenhui Yu,Wenyu Zhan,Haoxing Zhang,Chao Zhou,Hao Zhou,Shaodong Zheng,Ruinian Chen,Siyuan Chen,Ziyang Chen,Yiwen Dong,Yaoyou Fan,Yangyi Fang,Yang Gan,Shiguang Guo,Qi He,Chaowen Hu,Binghui Li,Dailin Li,Xiangyu Li,Yan Li,Chengjian Liu,Xiangfeng Liu,Jiahui Lv,Qiao Ma,Jiang Pan,Cong Qin,Chenxing Sun,Wen Sun,Zhonghui Wang,Abudukelimu Wuerkaixi,Xin Yang,Fangyi Yuan,Yawen Zhu,Tianyi Zhai,Jie Zhang,Runlai Zhang,Yao Xu,Yiran Zhao,Yifan Wang,Xunliang Cai,Yangen Hu,Cao Liu,Lu Pan,Xiaoli Wang,Bo Xiao,Wenyuan Yao,Qianlin Zhou,Benchang Zhu

Main category: cs.CL

TL;DR: 本文提出WOWService,一个面向工业应用的智能交互系统,通过整合大语言模型和多智能体架构,解决冷启动数据构建、多轮对话性能、业务规则演化、单模型局限性和评估困难等挑战,在美团App中部署后显著提升了用户满意度。

Details Motivation: 随着服务需求的增长,提升客户体验面临诸多挑战,包括高质量数据获取难、多轮对话理解不足、业务规则频繁变更、单一LLM能力有限以及缺乏有效评估机制,亟需一个能自进化、强适应、可扩展的智能服务系统。 Method: 提出WOWService系统,采用大语言模型与多智能体架构相结合的方法,设计了数据构造、通用能力增强、场景适配、多智能体协同和自动评估五大核心模块,实现自主任务管理和协作式问题解决。 Result: 在美团App中实际部署后,WOWService显著改善关键指标:用户满意度指标USM 1下降27.53%,USM 2提升25.51%,验证了其在理解用户需求和提供个性化服务方面的有效性。 Conclusion: WOWService通过多智能体协同与LLM融合,有效应对了智能服务系统在工业落地中的核心挑战,具备良好的可操作性、可转移性和持续优化能力,为复杂场景下的客户服务提供了可行解决方案。 Abstract: Enhancing customer experience is essential for business success, particularly as service demands grow in scale and complexity. Generative artificial intelligence and Large Language Models (LLMs) have empowered intelligent interaction systems to deliver efficient, personalized, and 24/7 support. In practice, intelligent interaction systems encounter several challenges: (1) Constructing high-quality data for cold-start training is difficult, hindering self-evolution and raising labor costs. (2) Multi-turn dialogue performance remains suboptimal due to inadequate intent understanding, rule compliance, and solution extraction. (3) Frequent evolution of business rules affects system operability and transferability, constraining low-cost expansion and adaptability. (4) Reliance on a single LLM is insufficient in complex scenarios, where the absence of multi-agent frameworks and effective collaboration undermines process completeness and service quality. (5) The open-domain nature of multi-turn dialogues, lacking unified golden answers, hampers quantitative evaluation and continuous optimization. To address these challenges, we introduce WOWService, an intelligent interaction system tailored for industrial applications. With the integration of LLMs and multi-agent architectures, WOWService enables autonomous task management and collaborative problem-solving. Specifically, WOWService focuses on core modules including data construction, general capability enhancement, business scenario adaptation, multi-agent coordination, and automated evaluation. Currently, WOWService is deployed on the Meituan App, achieving significant gains in key metrics, e.g., User Satisfaction Metric 1 (USM 1) -27.53% and User Satisfaction Metric 2 (USM 2) +25.51%, demonstrating its effectiveness in capturing user needs and advancing personalized service.

[44] Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models

Yizhou Peng,Yukun Ma,Chong Zhang,Yi-Wen Chao,Chongjia Ni,Bin Ma

Main category: cs.CL

TL;DR: 提出一种自适应的无分类器引导(CFG)方案,以解决自回归文本转语音(AR TTS)模型中情感风格与文本语义冲突的问题,通过动态调整CFG强度提升情感表现力,同时保持音频质量。

Details Motivation: 在使用自然语言提示进行细粒度情感控制时,情感风格与文本语义之间的不匹配会导致合成语音不自然,现有CFG方法在AR TTS模型中应用受限,可能损害音频质量。 Method: 基于大语言模型或自然语言推理模型检测风格-内容不匹配程度,提出一种自适应调整CFG权重的方案,并系统分析CFG对先进AR TTS模型情感表达的影响。 Result: 实验表明,该自适应CFG方案能有效提升AR TTS模型的情感表现力,同时维持良好的音频质量和可懂度。 Conclusion: 自适应CFG是一种有效应对AR TTS中风格-内容冲突的方法,兼顾了情感控制精度与语音合成质量。 Abstract: While Text-to-Speech (TTS) systems can achieve fine-grained control over emotional expression via natural language prompts, a significant challenge emerges when the desired emotion (style prompt) conflicts with the semantic content of the text. This mismatch often results in unnatural-sounding speech, undermining the goal of achieving fine-grained emotional control. Classifier-Free Guidance (CFG) is a key technique for enhancing prompt alignment; however, its application to auto-regressive (AR) TTS models remains underexplored, which can lead to degraded audio quality. This paper directly addresses the challenge of style-content mismatch in AR TTS models by proposing an adaptive CFG scheme that adjusts to different levels of the detected mismatch, as measured using large language models or natural language inference models. This solution is based on a comprehensive analysis of CFG's impact on emotional expressiveness in state-of-the-art AR TTS models. Our results demonstrate that the proposed adaptive CFG scheme improves the emotional expressiveness of the AR TTS model while maintaining audio quality and intelligibility.

[45] LLM one-shot style transfer for Authorship Attribution and Verification

Pablo Miralles-González,Javier Huertas-Tato,Alejandro Martín,David Camacho

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型(LLM)预训练和上下文学习能力的无监督计算文体学方法,利用LLM的对数概率衡量文本间风格可迁移性,在控制主题相关性的条件下优于对比训练基线和同类提示方法。

Details Motivation: 现有监督和对比方法易将文本主题与写作风格混淆,且未充分利用现代大模型在作者归属等通用任务中的潜力,尤其是在AI生成文本检测背景下。 Method: 利用大语言模型(LLM)的预训练能力和上下文学习,通过计算文本间风格转移的对数概率来衡量写作风格相似性,提出一种无监督的风格分析方法。 Result: 该方法显著优于同等规模的LLM提示方法,并在控制主题相关性后优于对比训练基线;性能随基础模型规模稳定提升,且在作者验证任务中可通过增加测试时计算量进一步提高准确性。 Conclusion: 基于LLM内在机制的无监督方法在计算文体学中具有优越性能和可扩展性,为风格分析提供了不依赖标注数据且能解耦主题与风格的新范式。 Abstract: Computational stylometry analyzes writing style through quantitative patterns in text, supporting applications from forensic tasks such as identity linking and plagiarism detection to literary attribution in the humanities. Supervised and contrastive approaches rely on data with spurious correlations and often confuse style with topic. Despite their natural use in AI-generated text detection, the CLM pre-training of modern LLMs has been scarcely leveraged for general authorship problems. We propose a novel unsupervised approach based on this extensive pre-training and the in-context learning capabilities of LLMs, employing the log-probabilities of an LLM to measure style transferability from one text to another. Our method significantly outperforms LLM prompting approaches of comparable scale and achieves higher accuracy than contrastively trained baselines when controlling for topical correlations. Moreover, performance scales fairly consistently with the size of the base model and, in the case of authorship verification, with an additional mechanism that increases test-time computation; enabling flexible trade-offs between computational cost and accuracy.

[46] ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering

Simon Lupart,Mohammad Aliannejadi,Evangelos Kanoulas

Main category: cs.CL

TL;DR: 提出基于强化学习的对话式问答推理框架ChatR1,通过意图感知奖励实现跨轮次搜索与推理的动态协调,在多个数据集上优于现有方法。

Details Motivation: 传统CQA流水线静态固定,难以应对用户意图演化和语境依赖问题,需更灵活的推理机制。 Method: 提出ChatR1框架,采用强化学习实现检索与生成的动态交织,并设计意图感知的逐轮奖励机制以解决稀疏奖励问题。 Result: 在3B和7B模型上,ChatR1在五个CQA数据集上超越竞争模型,指标包括F1、BERTScore和LLM-as-judge,消融实验验证奖励机制有效性。 Conclusion: RL驱动的推理框架能更好适应动态对话环境,具备跨领域泛化能力,优于静态流水线方法。 Abstract: We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate' pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of sparse and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. Our proposed ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and LLM-as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1's performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-sensitive behavior than static CQA pipelines.

[47] Embedding-Based Context-Aware Reranker

Ye Yuan,Mohammad Amin Shabani,Siqi Liu

Main category: cs.CL

TL;DR: 提出了一种轻量级的基于嵌入的上下文感知重排序框架EBCAR,通过结合段落结构信息和混合注意力机制,提升跨段落推理的信息检索效果,在ConTEB基准上优于现有最先进方法。

Details Motivation: 现有的检索重排序方法在处理需要跨段落推理(如共指消解、实体消歧、证据聚合)的任务时存在不足,且高成本模型仍未解决这些问题。 Method: 提出EBCAR框架,直接在段落嵌入上操作,利用段落的结构信息和混合注意力机制,捕捉文档间高层次交互和文档内低层次关系,实现更优的重排序。 Result: 在ConTEB基准上的实验表明,EBCAR在需要跨段落推理的信息检索任务中,在准确性和效率方面均优于当前最先进的重排序方法。 Conclusion: EBCAR是一种高效且准确的重排序框架,有效增强了对多段落信息的上下文理解能力,适用于需跨段落推理的复杂检索任务。 Abstract: Retrieval-Augmented Generation (RAG) systems rely on retrieving relevant evidence from a corpus to support downstream generation. The common practice of splitting a long document into multiple shorter passages enables finer-grained and targeted information retrieval. However, it also introduces challenges when a correct retrieval would require inference across passages, such as resolving coreference, disambiguating entities, and aggregating evidence scattered across multiple sources. Many state-of-the-art (SOTA) reranking methods, despite utilizing powerful large pretrained language models with potentially high inference costs, still neglect the aforementioned challenges. Therefore, we propose Embedding-Based Context-Aware Reranker (EBCAR), a lightweight reranking framework operating directly on embeddings of retrieved passages with enhanced cross-passage understandings through the structural information of the passages and a hybrid attention mechanism, which captures both high-level interactions across documents and low-level relationships within each document. We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.

[48] Taming the Fragility of KV Cache Eviction in LLM Inference

Yuan Feng,Haoyu Guo,JunLin Lv,S. Kevin Zhou,Xike Xie

Main category: cs.CL

TL;DR: 提出了一种新的缓存驱逐方法DefensiveKV,通过防御性聚合策略减少生成质量损失,在七种任务领域中显著优于基线。

Details Motivation: 现有的缓存驱逐方法依赖于稳定性假设,使用均值聚合,但在极端情况下该假设脆弱,导致性能下降。 Method: 提出一种两步、线性时间复杂度的防御性聚合策略,并设计了DefensiveKV及其扩展Layer-DefensiveKV,结合逐层预算分配。 Result: 在20%缓存大小下,DefensiveKV和Layer-DefensiveKV相比最强基线分别将生成质量损失降低了2.3倍和4.3倍,在18个数据集上表现优异。 Conclusion: DefensiveKV通过管理最坏情况风险有效应对缓存驱逐中的稳定性假设脆弱问题,为高效语言模型推理提供了新方向。 Abstract: Large language models have revolutionized natural language processing, yet their deployment remains hampered by the substantial memory and runtime overhead of the transformer's Key-Value cache. To mitigate this, recent methods employ a scoring-aggregation framework to evict unimportant cache entries, based on the stability assumption-that a fixed subset of entries remains consistently important during generation. However, prior work has largely focused on refining importance indicators for scoring, while defaulting to mean aggregation due to a faithful trust in the stability assumption. In this work, we argue that this underlying assumption is inherently fragile, making mean aggregation highly vulnerable in extreme cases. To counter this, we propose a simple yet elegant defensive aggregation strategy: a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead. Embodying this strategy, we propose a novel cache eviction method, DefensiveKV and its extension, Layer-DefensiveKV, which incorporates layer-wise budget allocation. Across seven task domains (18 datasets), our methods reduce generation quality loss by 2.3x and 4.3x respectively, versus the strongest baseline under a 20% cache size. These results set new performance benchmarks and pioneer a promising direction for optimizing cache eviction against underlying fragility through worst-case risk management. Our code is available at https://github.com/FFY0/DefensiveKV.

[49] Are Proverbs the New Pythian Oracles? Exploring Sentiment in Greek Sayings

Katerina Korre,John Pavlopoulos

Main category: cs.CL

TL;DR: 本研究利用自然语言处理技术,特别是大语言模型(LLM),对希腊谚语进行情感分析,并扩展数据集以涵盖地方方言,绘制了希腊谚语情感分布地图,发现多数地区负面情感更为普遍。

Details Motivation: 由于谚语多以口头传统在特定文化社区中传承,全球谚语景观尚未充分探索,因此需要借助现代NLP技术系统分析非主流语言中的谚语情感特征。 Method: 基于已标注的希腊谚语数据集,引入地方方言数据,利用大语言模型进行谚语情感分类,并结合地理、方言和主题信息进行综合分析,同时生成情感地理分布图。 Result: 研究表明,大语言模型能够较准确地识别谚语情感(尤其是作为非常规情感极性任务时);在希腊大部分地区,负面情感在谚语中占主导地位。 Conclusion: 大语言模型适用于低资源、口头传统的语言现象分析,且希腊谚语整体呈现负向情感倾向,反映出特定文化中的集体认知与社会心态。 Abstract: Proverbs are among the most fascinating linguistic phenomena that transcend cultural and linguistic boundaries. Yet, much of the global landscape of proverbs remains underexplored, as many cultures preserve their traditional wisdom within their own communities due to the oral tradition of the phenomenon. Taking advantage of the current advances in Natural Language Processing (NLP), we focus on Greek proverbs, analyzing their sentiment. Departing from an annotated dataset of Greek proverbs, we expand it to include local dialects, effectively mapping the annotated sentiment. We present (1) a way to exploit LLMs in order to perform sentiment classification of proverbs, (2) a map of Greece that provides an overview of the distribution of sentiment, (3) a combinatory analysis in terms of the geographic position, dialect, and topic of proverbs. Our findings show that LLMs can provide us with an accurate enough picture of the sentiment of proverbs, especially when approached as a non-conventional sentiment polarity task. Moreover, in most areas of Greece negative sentiment is more prevalent.

[50] Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems

Karthik Avinash,Nikhil Pareek,Rishav Hada

Main category: cs.CL

TL;DR: 本文提出了一种名为Protect的原生多模态安全防护模型,能够跨文本、图像和音频模态进行企业级部署,在毒性、性别歧视、数据隐私和提示注入四个安全维度上表现出色,性能超越现有主流模型。

Details Motivation: 现有的安全防护系统大多局限于文本模态,缺乏实时监控、多模态处理和可解释性,难以满足企业级和受监管环境的需求。 Method: Protect采用LoRA对特定类别的适配器进行微调,并基于大规模多模态数据集训练;通过教师辅助标注流程生成高保真、上下文感知的标签,实现跨模态的安全检测。 Result: 实验结果显示,Protect在四个安全维度上均达到SOTA水平,性能优于WildGuard、LlamaGuard-4和GPT-4.1等现有模型。 Conclusion: Protect为可信、可审计且可用于生产的多模态安全系统奠定了坚实基础,适用于复杂、多模态的现实应用场景。 Abstract: The increasing deployment of Large Language Models (LLMs) across enterprise and mission-critical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability -- limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multi-modal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels across modalities. Experimental results demonstrate state-of-the-art performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and production-ready safety systems capable of operating across text, image, and audio modalities.

[51] Personal Attribute Leakage in Federated Speech Models

Hamdan Al-Ali,Ali Reza Ghavamipour,Tommaso Caselli,Fatih Turkmen,Zeerak Talat,Hanan Aldarmaki

Main category: cs.CL

TL;DR: 本文研究了在联邦学习环境下ASR模型对属性推断攻击的脆弱性,提出一种无需访问原始语音数据的非参数白盒攻击方法,并在Wav2Vec2、HuBERT和Whisper三种模型上验证了对性别、年龄、口音、情绪和构音障碍等敏感属性的推断可行性,发现预训练数据中缺失或代表性不足的属性更易受攻击,尤其是口音信息可被可靠推断。

Details Motivation: 联邦学习虽用于保护机器学习训练中的隐私,但ASR模型在该框架下是否仍易受属性推断攻击尚不明确,尤其针对敏感人口统计和临床属性的泄露风险亟需评估。 Method: 采用非参数白盒攻击方法,在被动威胁模型下仅利用权重差异进行攻击,无需访问目标说话人的原始语音数据,测试对象为Wav2Vec2、HuBERT和Whisper三种主流ASR模型。 Result: 实验证明所提攻击方法可行,能有效推断性别、年龄、口音、情绪和构音障碍等属性;其中预训练数据中稀疏或缺失的属性更易被推断,特别是所有模型均能可靠泄露口音信息。 Conclusion: 联邦学习中的ASR模型仍存在属性推断风险,尤其对数据分布不平衡的属性更为敏感,研究揭示了现有模型的安全漏洞并为提升隐私保护提供了方向。 Abstract: Federated learning is a common method for privacy-preserving training of machine learning models. In this paper, we analyze the vulnerability of ASR models to attribute inference attacks in the federated setting. We test a non-parametric white-box attack method under a passive threat model on three ASR models: Wav2Vec2, HuBERT, and Whisper. The attack operates solely on weight differentials without access to raw speech from target speakers. We demonstrate attack feasibility on sensitive demographic and clinical attributes: gender, age, accent, emotion, and dysarthria. Our findings indicate that attributes that are underrepresented or absent in the pre-training data are more vulnerable to such inference attacks. In particular, information about accents can be reliably inferred from all models. Our findings expose previously undocumented vulnerabilities in federated ASR models and offer insights towards improved security.

[52] D-SMART: Enhancing LLM Dialogue Consistency via Dynamic Structured Memory And Reasoning Tree

Xiang Lei,Qin Li,Min Zhang,Min Zhang

Main category: cs.CL

TL;DR: 本文提出D-SMART,一种模型无关的框架,通过动态结构化记忆和推理树来提升大语言模型在多轮对话中的一致性。

Details Motivation: 大语言模型在多轮对话中存在事实不一致和逻辑衰减问题,现有方法依赖静态知识源且推理路径单一,难以适应动态上下文变化。 Method: 提出D-SMART框架,包含两个核心组件:动态结构化记忆(DSM),用于构建和维护符合OWL标准的知识图谱;推理树(RT),在知识图谱上进行显式的多步推理搜索。同时引入基于NLI的新指标评估一致性。 Result: 在MT-Bench-101基准测试中,D-SMART相比现有方法将多轮对话一致性得分提升超过48%,并对开源模型的质量评分最高提升10.1%。 Conclusion: D-SMART能有效增强大语言模型在演化对话上下文中的事实与逻辑一致性,且适用于不同类型的模型。 Abstract: Large Language Models (LLMs) often exhibit factual inconsistencies and logical decay in extended, multi-turn dialogues, a challenge stemming from their reliance on static, pre-trained knowledge and an inability to reason adaptively over the dialogue history. Prevailing mitigation strategies, such as Retrieval-Augmented Generation (RAG) and agentic working memories, improve information recall but still engage with fundamentally static knowledge sources and follow pre-defined single reasoning path. This hinders their ability to preserve factual and logical consistency of their responses in multi-turn dialogues while the context evolves over time. To address this issue, we propose D-SMART, a model-agnostic framework designed to maintain multi-turn dialogue consistency by enabling LLMs to build and reason over a dynamic, structured representation of the conversational context. This is achieved via two synergistic components: (1) a Dynamic Structured Memory (DSM), which incrementally constructs and maintains an authoritative, OWL-compliant knowledge graph of the conversation; and (2) a Reasoning Tree (RT), which executes inferences as an explicit and traceable multi-step search over the graph. As the popular-used quality score (judged by GPT-4) can overlook logical flaws, we introduce new NLI-based metrics to better measure multi-turn dialogue consistency. Comprehensive experiments on the MT-Bench-101 benchmark show that D-SMART significantly outperforms state-of-the-art baselines, elevating the dialogue consistency score by over 48\% for both proprietary and open-source models, and notably improves the quality score of the latter by up to 10.1\%.

[53] Document Intelligence in the Era of Large Language Models: A Survey

Weishi Wang,Hengchang Hu,Zhijie Zhang,Zhaochen Li,Hongxin Shao,Daniel Dahlmeier

Main category: cs.CL

TL;DR: 本文综述了大型语言模型(LLMs)在文档智能(DAI)领域的应用进展,探讨了多模态、多语言和检索增强型DAI的关键进展与挑战,并提出了基于代理的方法和文档专用基础模型等未来研究方向。

Details Motivation: 随着大型语言模型的兴起,文档智能领域经历了显著变革,但缺乏对LLMs在该领域系统性总结,因此需要全面梳理其发展脉络、当前进展与未来方向。 Method: 通过综述现有文献,分析LLMs在文档理解与生成中的作用,重点关注编码器-解码器架构向纯解码器LLMs的转变,并探讨多模态、多语言及检索增强技术的发展。 Result: 总结了LLMs在DAI中的关键进展,识别出当前面临的挑战,并提出未来研究方向,如代理机制和面向文档的基础模型。 Conclusion: LLMs极大地推动了文档智能的发展,未来的研究应聚焦于构建更高效、专用的文档处理模型与系统。 Abstract: Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.

[54] Make an Offer They Can't Refuse: Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment

Buwei He,Yang Liu,Zhaowei Zhang,Zixia Jia,Huijia Wu,Zhaofeng He,Zilong Zheng,Yipeng Kang

Main category: cs.CL

TL;DR: 该论文研究了如何利用贝叶斯说服(Bayesian Persuasion, BP)框架提升大语言模型(LLMs)在自然语言单轮对话中的策略性说服能力,提出了一种包含承诺-沟通机制的框架,并在多种说服者和任务上进行了评估。

Details Motivation: 当前关于AI说服的研究常忽视信息不对称的战略性使用,或依赖于强假设(如预先承诺),难以真实反映人类说服的复杂性。因此,作者希望探索更符合现实场景的、基于贝叶斯推理的自然语言说服机制。 Method: 作者提出了基于贝叶斯说服的两种变体:半形式化自然语言(SFNL)BP 和全自然语言(FNL)BP,通过让说服者明确描述其类型(如诚实或不诚实)来引导被说服者进行贝叶斯信念更新。他们在包含不同提示和微调的LLM以及人类参与者上,与非BP基线进行了对比实验。 Result: 实验结果表明:(1) 基于BP策略的LLM说服成功率显著高于非BP基线;(2) SFNL更具逻辑性和可信度,FNL在自然对话中情感共鸣更强且更稳健;(3) 经过监督微调后,小模型也能达到与大模型相当的BP表现。 Conclusion: 贝叶斯说服框架能有效提升LLM在自然语言环境下的战略说服能力,且通过适当设计和训练,可在不同规模模型和多样化被说服者中实现高效、可信的说服。 Abstract: Persuasion, a fundamental social capability for humans, remains a challenge for AI systems such as large language models (LLMs). Current studies often overlook the strategic use of information asymmetry in message design or rely on strong assumptions regarding pre-commitment. In this work, we explore the application of Bayesian Persuasion (BP) in natural language within single-turn dialogue settings, to enhance the strategic persuasion capabilities of LLMs. Our framework incorporates a commitment-communication mechanism, where the persuader explicitly outlines an information schema by narrating their potential types (e.g., honest or dishonest), thereby guiding the persuadee in performing the intended Bayesian belief update. We evaluate two variants of our approach: Semi-Formal-Natural-Language (SFNL) BP and Fully-Natural-Language (FNL) BP, benchmarking them against both naive and strong non-BP (NBP) baselines within a comprehensive evaluation framework. This framework covers a diverse set of persuadees -- including LLM instances with varying prompts and fine-tuning and human participants -- across tasks ranging from specially designed persuasion scenarios to general everyday situations. Experimental results on LLM-based agents reveal three main findings: (1) LLMs guided by BP strategies consistently achieve higher persuasion success rates than NBP baselines; (2) SFNL exhibits greater credibility and logical coherence, while FNL shows stronger emotional resonance and robustness in naturalistic conversations; (3) with supervised fine-tuning, smaller models can attain BP performance comparable to that of larger models.

[55] Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models

Agnese Lombardi,Alessandro Lenci

Main category: cs.CL

TL;DR: 该研究探讨了生成式智能体模型Concordia是否能在模拟真实环境中有效建模心智理论(ToM),发现GPT-4在基于信念归因选择行动方面表现不佳,其表面的ToM能力可能源于浅层统计关联而非真正推理,并难以生成连贯的因果效应,表明当前大语言模型的ToM类能力被高估,需更严格的基于行为的评估框架。

Details Motivation: 探究大语言模型是否具备真正的心智理论能力,而非依赖语言记忆进行推测,特别是在社会情境理解和行动协调方面。 Method: 采用生成式智能体模型Concordia,在模拟现实环境中评估GPT-4是否能基于社会情境做出真实推断,测试其信念归因和因果推理能力。 Result: GPT-4常无法根据信念归因选择正确行动,表现出的ToM能力可能来自浅层语言统计而非深层推理,且难以生成连贯的行动因果链,处理复杂社会互动存在困难。 Conclusion: 当前大语言模型展现的ToM类能力可能被误解或高估,应发展更严谨、基于实际行为的评估方法以检验其社会认知真实性。 Abstract: Language is fundamental to human cooperation, facilitating not only the exchange of information but also the coordination of actions through shared interpretations of situational contexts. This study explores whether the Generative Agent-Based Model (GABM) Concordia can effectively model Theory of Mind (ToM) within simulated real-world environments. Specifically, we assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context, rather than relying on linguistic memorization. Our findings reveal a critical limitation: GPT-4 frequently fails to select actions based on belief attribution, suggesting that apparent ToM-like abilities observed in previous studies may stem from shallow statistical associations rather than true reasoning. Additionally, the model struggles to generate coherent causal effects from agent actions, exposing difficulties in processing complex social interactions. These results challenge current statements about emergent ToM-like capabilities in LLMs and highlight the need for more rigorous, action-based evaluation frameworks.

[56] Investigating Lexical Change through Cross-Linguistic Colexification Patterns

Kim Gfeller,Sabine Stoll,Chundra Cathcart,Paul Widmer

Main category: cs.CL

TL;DR: 研究通过比较语言谱系模型,探讨了三个语言家族中概念对共词化(colexification)的演化动态,发现概念关联性越强,共词化越稳定;而高频和易借用的概念对变化更快、共词化更少,且不同语系间存在显著差异。

Details Motivation: 理解语言中意义演变的驱动因素,特别是共词化现象如何反映语义变化的规律。 Method: 应用系统发育比较模型分析南岛语系、印欧语系和乌拉尔语系的词典数据,考察关联性、可借用性和使用频率对共词化的影响。 Result: 概念对的关联性越强,共词化范围越广且变化越慢;使用频率高和易借用的概念对变化更快,共词化更少;不同语系间存在显著差异,提示区域和文化因素的作用。 Conclusion: 共词化的演化受到概念关联性、频率和借用性的共同影响,语义演变不仅受内在认知因素驱动,也受语言接触和文化环境的影响。 Abstract: One of the most intriguing features of language is its constant change, with ongoing shifts in how meaning is expressed. Despite decades of research, the factors that determine how and why meanings evolve remain only partly understood. Colexification -- the phenomenon of expressing multiple distinct concepts using the same word form -- serves as a valuable window onto the dynamics of meaning change across languages. Here, we apply phylogenetic comparative models to dictionary data from three language families, Austronesian, Indo-European, and Uralic, in order to shed light on the evolutionary dynamics underlying the colexification of concept pairs. We assess the effects of three predictors: associativity, borrowability, and usage frequency. Our results show that more closely related concept pairs are colexified across a larger portion of the family tree and exhibit slower rates of change. In contrast, concept pairs that are more frequent and more prone to borrowing tend to change more rapidly and are less often colexified. We also find considerable differences between the language families under study, suggesting that areal and cultural factors may play a role.

[57] Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Ahmed Alzubaidi,Shaikha Alsuwaidi,Basma El Amel Boussaha,Leen AlQadi,Omar Alkaabi,Mohammed Alyafeai,Hamza Alobeidli,Hakim Hacid

Main category: cs.CL

TL;DR: 本论文首次系统性综述了阿拉伯语大语言模型(LLM)的评估基准,分析了40多个涵盖NLP任务、知识领域、文化理解及特定能力的基准,提出了一个四类分类体系,并指出了当前存在的关键缺陷,如时间评估不足、多轮对话测试缺乏和翻译数据集的文化错位问题,同时比较了原生收集、翻译和合成生成三种构建方法的权衡,为阿拉伯语NLP研究提供了全面参考。

Details Motivation: 阿拉伯语大语言模型的发展迅速,但缺乏系统性的评估基准综述,现有基准在文化适应性、对话能力和时效性等方面存在明显不足,亟需统一的分类体系和建设指导。 Method: 通过系统性文献回顾,分析40多个阿拉伯语LLM评估基准,提出一个包含知识、NLP任务、文化与方言、特定目标四类的分类体系,并比较原生收集、翻译和合成生成三种数据构建方法的优劣。 Result: 建立了首个阿拉伯语LLM基准的系统分类体系,识别出当前评估中的关键空白,包括缺乏时间动态评估、多轮对话测试不足以及翻译数据集中的文化错位问题,并系统比较了不同基准构建方法的代价与收益。 Conclusion: 该研究为阿拉伯语NLP社区提供了全面的基准综述与分类框架,强调未来应关注文化对齐、动态知识评估和多轮交互能力的测试,并倡导采用更真实、可复现的评估标准以推动领域发展。 Abstract: This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.

[58] Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation

Hao Wang,Linlong Xu,Heng Liu,Yangyang Liu,Xiaohu Zhao,Bo Zeng,Liangying Shao,Longyue Wang,Weihua Luo,Kaifu Zhang

Main category: cs.CL

TL;DR: M^2PO提出了一种多对、多视角偏好优化框架,通过结合幻觉惩罚和动态质量评分来改进机器翻译中的偏好学习,显著优于现有方法。

Details Motivation: 现有DPO方法在机器翻译中受限于不准确的奖励信号(如忽略翻译幻觉)和低效的数据利用(仅使用单一对比对),导致模型对齐效果不佳。 Method: 提出M^2PO框架,包含多视角奖励机制(引入幻觉惩罚和动态融合外部评估与模型自评的质量得分)和多对构造策略(从所有候选翻译中系统生成多个偏好对)。 Result: 在WMT21-22基准上,M^2PO显著优于现有的偏好优化方法,并在性能上可与领先的专有大语言模型相媲美。 Conclusion: M^2PO通过更鲁棒的奖励信号和更高效的数据利用,提升了大模型在机器翻译中对人类偏好的对齐能力,实现了更忠实、高质量的翻译输出。 Abstract: Direct Preference Optimization (DPO) is a powerful paradigm for aligning Large Language Models (LLMs) to human preferences in Machine Translation (MT), but current methods are hindered by two fundamental challenges: (1) flawed reward signals from Quality Estimation (QE) models that overlook critical errors like translation hallucination, and (2) inefficient data utilization that discards valuable learning signals by selecting only a single win-loss pair. To address these limitations, we introduce M^2PO: Multi-Pair, Multi-Perspective Preference Optimization. Our framework integrates a multi-perspective reward engine that creates a more robust signal by combining two key viewpoints: a new hallucination penalty for factuality, and an innovative dynamic quality score that adaptively fuses external evaluations with the model's own evolving judgment. This is synergistically paired with a multi-pair construction strategy that systematically creates a comprehensive set of preference pairs from the entire pool of translation candidates. This synergistic approach ensures the model learns from a richer spectrum of quality trade-offs, leading to more robust and faithful translations. On challenging WMT21-22 benchmarks, M^2PO substantially outperforms existing preference optimization methods and demonstrates highly competitive performance against leading proprietary LLMs.

[59] LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

Tommaso Bonomo,Luca Gioffré,Roberto Navigli

Main category: cs.CL

TL;DR: 本文提出了LiteraryQA,一个高质量的NarrativeQA子集,专注于文学作品,并通过人工和大语言模型验证的流程清理数据。研究发现传统的n-gram指标与人类判断相关性低,而“LLM-as-a-Judge”方法即使使用小型开源模型也能与人类排序高度一致。作者还在LiteraryQA上评测了多个长上下文大模型,并公开了代码和数据。

Details Motivation: NarrativeQA基准存在噪声文档和有缺陷的问答对,影响其可靠性,限制了叙事文本问答系统的发展。因此需要一个更高质量、更可靠的基准来推动该领域研究。 Method: 构建LiteraryQA:从NarrativeQA中筛选文学作品,通过人工和大语言模型联合验证的方式识别并修正低质量问答样本,去除源文档中的冗余文本;对多种自动评估指标进行元评估,比较其与人类判断的相关性;在LiteraryQA上评测多个长上下文大语言模型。 Result: 1) 成功构建了高质量的LiteraryQA数据集;2) 发现n-gram类自动指标与人类判断系统级相关性差;3) LLM-as-a-Judge方法(即使是小模型)能与人类排名高度一致;4) 提供了多个长上下文LLM在该数据集上的基准性能。 Conclusion: LiteraryQA是一个更可靠、高质量的叙事问答基准,有助于推动对复杂叙事文本理解的研究;未来应采用LLM-as-a-Judge而非传统n-gram指标进行系统评估。 Abstract: Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA. This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://github.com/SapienzaNLP/LiteraryQA.

[60] ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding

Xiaozhe Li,TianYi Lyu,Siyi Yang,Yuxi Gong,Yizhao Yang,Jinxuan Huang,Ligao Zhang,Zhuoyi Huang,Qingwen Liu

Main category: cs.CL

TL;DR: 本文提出了\bench,首个针对消费者领域人类意图理解的动态、实时评估基准,旨在解决现有大语言模型在复杂、多源、非线性公共讨论中意图理解能力缺乏有效评测的问题。

Details Motivation: 由于现实世界中的公共讨论具有多视角、冲突性、情感多样和隐含背景等特点,现有的大语言模型难以准确理解人类意图,且缺乏大规模、动态的评估基准来衡量其性能。 Method: \bench通过自动化策展管道构建了一个支持实时更新的大规模、多样化基准,包含真实世界的消费者讨论数据,并设计了防止数据污染的机制,以动态评估大语言模型在复杂语境下的意图理解能力。 Result: \bench是目前最大、最多样化的意图理解基准,具备实时更新能力,能有效避免训练数据与测试数据之间的污染,为评估大语言模型在非线性、多源、动态讨论中的表现提供了可靠平台。 Conclusion: 该研究填补了大语言模型在现实场景下意图理解能力评估的空白,\bench为未来模型在复杂社会语境中的发展提供了重要工具和方向。 Abstract: Understanding human intent is a complex, high-level task for large language models (LLMs), requiring analytical reasoning, contextual interpretation, dynamic information aggregation, and decision-making under uncertainty. Real-world public discussions, such as consumer product discussions, are rarely linear or involve a single user. Instead, they are characterized by interwoven and often conflicting perspectives, divergent concerns, goals, emotional tendencies, as well as implicit assumptions and background knowledge about usage scenarios. To accurately understand such explicit public intent, an LLM must go beyond parsing individual sentences; it must integrate multi-source signals, reason over inconsistencies, and adapt to evolving discourse, similar to how experts in fields like politics, economics, or finance approach complex, uncertain environments. Despite the importance of this capability, no large-scale benchmark currently exists for evaluating LLMs on real-world human intent understanding, primarily due to the challenges of collecting real-world public discussion data and constructing a robust evaluation pipeline. To bridge this gap, we introduce \bench, the first dynamic, live evaluation benchmark specifically designed for intent understanding, particularly in the consumer domain. \bench is the largest and most diverse benchmark of its kind, supporting real-time updates while preventing data contamination through an automated curation pipeline.

[61] MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts

Shujun Xia,Haokun Lin,Yichen Wu,Yinan Zhou,Zixuan Li,Zhongwei Wan,Xingrun Xing,Yefeng Zheng,Xiang Li,Caifeng Shan,Zhenan Sun,Quanzheng Li

Main category: cs.CL

TL;DR: 本文提出了一种基于检索的医学大语言模型编辑框架MedREK,解决了现有方法在医学知识表示重叠和批量编辑方面的局限性,并通过新构建的基准MedVersa验证了其在单次和批量编辑中的有效性。

Details Motivation: 由于医学知识更新迅速且训练数据中存在错误,大语言模型(LLMs)常生成过时或不准确的信息。参数化编辑方法在医学领域难以保持局部性,而现有的检索式编辑方法面临表示重叠和仅支持单样本编辑的问题,限制了其在临床实践中的应用。 Method: 首先构建了一个覆盖更广医学主题的增强型基准MedVersa,用于评估在严格局部性约束下的单次和批量编辑效果;然后提出了MedREK框架,结合共享查询-键模块实现精确匹配,并利用基于注意力机制的提示编码器提供信息引导。 Result: 在多个医学基准上的实验表明,MedREK在各项核心指标上均优于现有方法,并首次实现了对医学大模型的有效批量编辑。 Conclusion: MedREK为医学大语言模型提供了一种高效、准确且可扩展的编辑方案,尤其适用于需要频繁更新知识的高风险临床应用场景。 Abstract: LLMs hold great promise for healthcare applications, but the rapid evolution of medical knowledge and errors in training data often cause them to generate outdated or inaccurate information, limiting their applicability in high-stakes clinical practice. Model editing has emerged as a potential remedy without full retraining. While parameter-based editing often compromises locality and is thus ill-suited for the medical domain, retrieval-based editing offers a more viable alternative. However, it still faces two critical challenges: (1) representation overlap within the medical knowledge space often causes inaccurate retrieval and reduces editing accuracy; (2) existing methods are restricted to single-sample edits, while batch-editing remains largely unexplored despite its importance for real-world medical applications. To address these challenges, we first construct MedVersa, \hk{an enhanced benchmark with broader coverage of medical subjects, designed to evaluate both single and batch edits under strict locality constraints}. We then propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance. Experimental results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics and provides the first validated solution for batch-editing in medical LLMs. Our code and dataset are available at https://github.com/mylittleriver/MedREK.

[62] Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

Yang Li,Zhichen Dong,Yuhan Sun,Weixun Wang,Shaopan Xiong,Yijia Luo,Jiashun Liu,Han Lu,Jiamang Wang,Wenbo Su,Bo Zheng,Junchi Yan

Main category: cs.CL

TL;DR: 该论文提出通过分析大语言模型中的注意力机制来揭示其内部推理逻辑,定义了两种度量指标(窗口平均注意力距离和未来注意力影响),发现了“预规划-锚定”机制,并基于此设计了三种新的强化学习策略,实现了在多种推理任务上的性能提升。

Details Motivation: 大语言模型的推理过程不透明,传统强化学习对生成过程中的所有步骤进行统一信用分配,忽略了关键步骤与普通步骤的区别,导致优化效率低下。因此需要一种更精细、结构感知的信用分配方法。 Method: 通过区分局部和全局注意力头,分析注意力模式;提出两个量化指标:窗口平均注意力距离和未来注意力影响;识别出‘预规划-锚定’机制;基于该机制设计三种针对关键节点的动态信用分配RL策略。 Result: 成功揭示了LLM内部的‘预plan-and-anchor’推理机制;提出的三种新RL策略在多个推理任务上均带来一致的性能提升;注意力被证实可作为理解模型推理逻辑的机制性蓝图。 Conclusion: 利用注意力机制作为解码模型内部推理结构的工具,能够实现更透明、更有效的强化学习优化,为提升大语言模型的可解释性和推理能力提供了新路径。 Abstract: The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.

[63] Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models

Daniil Gurgurov,Josef van Genabith,Simon Ostermann

Main category: cs.CL

TL;DR: 提出一种通过识别和微调语言特定神经元来增强大模型在低资源语言上的性能的方法,仅更新最多1%的参数即显著优于全量微调等基线方法。

Details Motivation: 大语言模型在高低资源语言间表现不均,尤其对低资源语言支持较差,需高效适配方法以提升其多语言能力。 Method: 基于语言激活概率熵识别语言特定神经元,仅微调对应权重构成的语言专用子网络,使用目标语言数据进行针对性训练。 Result: 在Llama-3.1-8B和Mistral-Nemo-12B上对12种中低资源语言实验表明,该方法在仅更新最多1%参数的情况下,持续优于全微调、FFN微调、LoRA和随机子集微调等基线方法,并改善训练动态、跨语言表征对齐和权重更新模式。 Conclusion: 该方法为提升大模型在低资源语言上的性能提供了高效、低成本的解决方案,有助于推动真正多语言AI的发展。 Abstract: Large language models exhibit uneven performance across languages, with substantial gaps between high- and low-resource languages. We present a framework for enhancing monolingual capabilities of LLMs in underrepresented languages while preserving their general-purpose performance through targeted fine-tuning of language-specific subnetworks. Our approach identifies language-specific neurons using Language Activation Probability Entropy and fine-tunes only the weights associated with these neurons, a dedicated subnetwork, on target-language data. Experiments on Llama-3.1-8B and Mistral-Nemo-12B across 12 mid- and low-resource languages demonstrate that our method consistently outperforms full fine-tuning, FFN-only fine-tuning, LoRA adaptation, and random subset fine-tuning baselines while efficiently updating only up to 1% of model parameters. Beyond performance improvements, we observe enhanced favorable training dynamics, cross-lingual representational alignment, and systematic weight update changes. To facilitate future research, we release language-specific neuron identifications for over 100 languages as well as our adaptation pipeline, offering a cost-effective pathway for adapting state-of-the-art models to underrepresented languages.

[64] Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

Pasin Buakhaw,Kun Kerdthaisong,Phuree Phenhiran,Pitikorn Khlaisamniang,Supasate Vorathammathorn,Piyalitt Ittichaiwong,Nutchanon Yongsatianchot

Main category: cs.CL

TL;DR: 本文报告了Tu_Character_lab团队在CPDC 2025第二轮挑战赛中的参与情况,提出结合轻量级提示技术和微调大模型的两种策略,用于提升NPC在任务导向、上下文感知及二者融合对话中的表现,最终在多个赛道中取得优异排名。

Details Motivation: 随着大语言模型的发展,如何构建具备常识和个性化特征的动态非玩家角色(NPC)成为游戏AI的重要挑战。本研究旨在提升NPC在复杂对话任务中的表现,特别是在保持角色一致性的同时完成功能性任务。 Method: 采用两种互补策略:在API赛道使用轻量级提示技术,包括提出一种Deflanderization提示方法以抑制过度角色扮演并提高任务保真度;在GPU赛道则基于Qwen3-14B模型,结合监督微调(SFT)和低秩适应(LoRA)进行模型微调。 Result: 最佳提交方案在任务1和任务3(API赛道)均排名第二,在任务3(GPU赛道)排名第四,验证了所提方法的有效性。 Conclusion: 结合轻量提示工程与高效微调方法能有效提升NPC在多维度对话评估中的表现,为构建更智能、一致且高效的虚拟角色提供了可行路径。 Abstract: The emergence of large language models (LLMs) has opened new opportunities for cre- ating dynamic non-player characters (NPCs) in gaming environments, enabling both func- tional task execution and persona-consistent dialogue generation. In this paper, we (Tu_Character_lab) report our participation in the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2, which eval- uates agents across three tracks: task-oriented dialogue, context-aware dialogue, and their integration. Our approach combines two complementary strategies: (i) lightweight prompting techniques in the API track, including a Deflanderization prompting method to suppress excessive role-play and improve task fidelity, and (ii) fine-tuned large models in the GPU track, leveraging Qwen3-14B with supervisedfinetuning (SFT) and Low-Rank Adaptation(LoRA). Our best submissions ranked 2nd on Task 1, 2nd on Task 3 (API track), and 4th on Task 3 (GPU track).

[65] FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation

Kristýna Onderková,Ondřej Plátek,Zdeněk Kasner,Ondřej Dušek

Main category: cs.CL

TL;DR: 本文提出了FreshTab,一个从Wikipedia动态生成表格到文本的基准测试方法,以应对大语言模型训练数据污染和领域不平衡问题,并支持多语言场景下的评估。

Details Motivation: 现有表格到文本生成任务的基准存在大语言模型训练数据污染和领域分布不均的问题,且非英语数据集有限,因此需要一种新颖、动态、多语言且领域平衡的评估方式。 Method: 提出FreshTab方法,通过从Wikipedia实时抓取表格数据,动态生成多语言(包括英语、德语、俄语和法语)的表格到文本基准数据集,避免数据污染并实现领域敏感性评估。 Result: 实验发现,尽管自动指标显示LLM在新表格上的表现更差,但该差异在LLM和人工评估中并未体现;所有评估中均观察到领域效应,表明领域平衡的基准更具挑战性。 Conclusion: FreshTab有效缓解了数据污染问题,并揭示了领域因素对表格到文本生成模型评估的重要影响,强调构建领域平衡基准的必要性。 Abstract: Table-to-text generation (insight generation from tables) is a challenging task that requires precision in analyzing the data. In addition, the evaluation of existing benchmarks is affected by contamination of Large Language Model (LLM) training data as well as domain imbalance. We introduce FreshTab, an on-the-fly table-to-text benchmark generation from Wikipedia, to combat the LLM data contamination problem and enable domain-sensitive evaluation. While non-English table-to-text datasets are limited, FreshTab collects datasets in different languages on demand (we experiment with German, Russian and French in addition to English). We find that insights generated by LLMs from recent tables collected by our method appear clearly worse by automatic metrics, but this does not translate into LLM and human evaluations. Domain effects are visible in all evaluations, showing that a~domain-balanced benchmark is more challenging.

[66] NOSA: Native and Offloadable Sparse Attention

Yuxiang Huang,Chaojun Xiao,Xu Han,Zhiyuan Liu

Main category: cs.CL

TL;DR: 本文提出NOSA,一种支持KV缓存卸载的可训练稀疏注意力框架,通过引入显式局部性约束,在保持注意力计算不变的同时减少KV传输,显著提升解码吞吐量。

Details Motivation: 现有稀疏注意力方法未能减少KV缓存大小,限制了大规模批处理推理时的GPU批量大小和解码吞吐量。 Method: 将token选择分解为查询感知和查询无关两部分,引入显式局部性约束,实现高效的KV缓存卸载。 Result: 在1B参数模型上预训练并测试,相比InfLLM-V2基准,解码吞吐量最高提升2.3倍,同时保持接近无损的性能。 Conclusion: NOSA有效解决了稀疏注意力中KV缓存未压缩导致的解码效率瓶颈,原生支持KV缓存卸载,显著提升了长上下文场景下的解码效率。 Abstract: Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).

[67] MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning

Xingyu Tan,Xiaoyang Wang,Qing Liu,Xiwei Xu,Xin Yuan,Liming Zhu,Wenjie Zhang

Main category: cs.CL

TL;DR: 提出MemoTime,一种基于记忆增强的时序知识图谱框架,通过结构化 grounding、递归推理和持续经验学习,提升大模型在复杂时序问题上的推理能力,在多个基准上达到SOTA。

Details Motivation: 大语言模型在时序理解上存在困难,尤其是在涉及多实体、复合操作符和动态事件序列的问题中;现有基于时序知识图谱(TKG)的方法在多跳推理、多实体同步、操作符适配和经验复用方面面临挑战。 Method: 提出MemoTime框架:将复杂时序问题分解为‘时间树’进行层次化推理,确保时间单调性和多实体共约束;设计动态证据检索层以适配不同操作符;引入自进化经验记忆模块存储并复用推理轨迹、工具决策和子问题嵌入。 Result: 在多个时序问答基准上取得SOTA结果,最高超越强基线24.0%;使较小模型(如Qwen3-4B)达到与GPT-4-Turbo相当的推理性能。 Conclusion: MemoTime通过结构化时序知识利用和经验积累,显著提升了LLM在复杂时序推理任务中的准确性、稳定性和效率,推动了LLM与TKG的深度融合。 Abstract: Large Language Models (LLMs) have achieved impressive reasoning abilities, but struggle with temporal understanding, especially when questions involve multiple entities, compound operators, and evolving event sequences. Temporal Knowledge Graphs (TKGs), which capture vast amounts of temporal facts in a structured format, offer a reliable source for temporal reasoning. However, existing TKG-based LLM reasoning methods still struggle with four major challenges: maintaining temporal faithfulness in multi-hop reasoning, achieving multi-entity temporal synchronization, adapting retrieval to diverse temporal operators, and reusing prior reasoning experience for stability and efficiency. To address these issues, we propose MemoTime, a memory-augmented temporal knowledge graph framework that enhances LLM reasoning through structured grounding, recursive reasoning, and continual experience learning. MemoTime decomposes complex temporal questions into a hierarchical Tree of Time, enabling operator-aware reasoning that enforces monotonic timestamps and co-constrains multiple entities under unified temporal bounds. A dynamic evidence retrieval layer adaptively selects operator-specific retrieval strategies, while a self-evolving experience memory stores verified reasoning traces, toolkit decisions, and sub-question embeddings for cross-type reuse. Comprehensive experiments on multiple temporal QA benchmarks show that MemoTime achieves overall state-of-the-art results, outperforming the strong baseline by up to 24.0%. Furthermore, MemoTime enables smaller models (e.g., Qwen3-4B) to achieve reasoning performance comparable to that of GPT-4-Turbo.

[68] Unlocking Public Catalogues: Instruction-Tuning LLMs for ICD Coding of German Tumor Diagnoses

Stefan Lenz,Lakisha Ortiz Rosario,Georg Vollmar,Arsenij Ustjanzew,Fatma Alickovic,Thomas Kindler,Torsten Panholzer

Main category: cs.CL

TL;DR: 本研究探讨了基于公开目录构建指令数据集以提升开源大语言模型(LLM)在德语肿瘤诊断编码中的准确性,结果显示微调显著提高了ICD-10-GM和ICD-O-3编码性能,且小模型与大模型差距缩小。

Details Motivation: 由于小型开源LLM在德语医学文本编码中准确率较低,而隐私保护需求使其具有吸引力,因此需要探索通过指令微调提升其在德国癌症登记中的适用性。 Method: 使用超过50万条基于ICD-10-GM、ICD-O-3和OPS目录生成的问答对对来自Qwen、Llama和Mistral系列的8个开源模型(7-70B参数)进行指令微调,并在本地肿瘤登记数据上评估其编码准确率。 Result: ICD-10-GM完全准确率从1.4-24%提升至41-58%,部分准确率从31-74%提升至73-83%;ICD-O-3解剖部位编码准确率有所改善但仍较低(完全准确率22-40%,部分准确率56-67%);所有模型的错误编码输出降至0%,肿瘤诊断识别率达99%;模型性能随规模增加而提高,但微调后大小模型差距缩小;Qwen3的推理模式表现较差且速度慢100倍以上。 Conclusion: 利用公开医学目录构建指令数据集可有效提升开源大模型在德语肿瘤诊断编码中的准确性,尤其有助于缩小小模型与大模型之间的性能差距,具备在医疗文档自动化中应用的潜力。 Abstract: Accurate coding of tumor diagnoses with ICD-10-GM and ICD-O-3 is essential for structured cancer documentation in Germany. Smaller open-weight LLMs are appealing for privacy-preserving automation but often struggle with coding accuracy in German-language contexts. This study investigates whether instruction-based fine-tuning on public datasets improves the coding accuracy of open-weight LLMs for German tumor diagnosis texts. The evaluation uses coded diagnoses from the local tumor documentation system as test data. In a systematic data quality assessment, the upper limit for ICD-10 coding performance was estimated at 60-79% for exact and 81-94% for partial (three-character codes only) derivation. As training data, over 500,000 question-answer pairs were created based on the ICD-10-GM, ICD-O-3, and OPS catalogues. Eight open-weight models from the Qwen, Llama, and Mistral families (7-70 B parameters) were fine-tuned. ICD-10-GM accuracy rose from 1.4-24% to 41-58%, and partial accuracy from 31-74% to 73-83%. The accuracy of ICD-O-3 topography coding also improved but started and remained considerably lower with an exact accuracy of 22-40% and a partial accuracy of 56-67% after fine-tuning. Malformed code outputs dropped to 0% for all models. Tumor-diagnosis recognition reached 99%. Accuracy correlated positively with model size, but gaps between small and large models narrowed after fine-tuning. The reasoning mode in Qwen3 generally yielded a lower performance than fine-tuning and was over 100 times slower. Our findings highlight the potential of leveraging public catalogues to build instruction datasets that improve LLMs in medical documentation tasks. The complete training dataset and the best-performing checkpoints of the fine-tuned models are available from https://huggingface.co/datasets/stefan-m-lenz/ICDOPS-QA-2024.

[69] Closing the Gap Between Text and Speech Understanding in LLMs

Santiago Cuervo,Skyler Seto,Maureen de Seyssel,Richard He Bai,Zijin Gu,Tatiana Likhomanenko,Navdeep Jaitly,Zakaria Aldeneh

Main category: cs.CL

TL;DR: 本文提出了SALAD方法,通过跨模态蒸馏和主动选择合成数据,在大幅减少语音数据使用的情况下有效缩小了文本-语音理解差距。

Details Motivation: 现有语音适配大模型在语言理解任务上表现不佳,且依赖大量合成或专有语音数据,缺乏高效、可复现的数据利用方案。 Method: 分析文本-语音理解差距源于文本能力遗忘和跨模态不对齐,提出SALAD框架,结合跨模态蒸馏与主动选择的合成数据训练策略。 Result: 在3B和7B规模LLM上验证,SALAD在开放语料库上使用一个数量级更少的语音数据,达到了与强开源模型相当的性能。 Conclusion: SALAD是一种样本高效的语音-文本对齐方法,能有效缓解遗忘问题并提升跨模态理解性能,具有良好的数据效率和应用潜力。 Abstract: Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.

[70] How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study

Matthieu Dubois,François Yvon,Pablo Piantanida

Main category: cs.CL

TL;DR: 本文研究了不同解码策略对大语言模型生成文本检测的影响,发现即使微小的解码参数调整也会显著降低检测器性能,暴露出当前检测方法的盲点,并发布了一个包含37种解码配置的大规模数据集以支持未来研究。

Details Motivation: 随着大语言模型生成文本与人类书写内容越来越难以区分,自动文本检测研究受到关注,但现有检测器多在固定生成设置下评估,缺乏对不同解码策略鲁棒性的检验。 Method: 系统性地分析基于采样的解码方式(如温度、top-p、nucleus sampling)如何影响检测性能,通过改变LLM的词/子词级别分布来评估多种检测器的表现。 Result: 轻微调整解码参数可使检测器AUROC从接近100%下降至1%,表明当前检测方法对解码策略变化极为敏感。 Conclusion: 当前文本检测方法在面对不同解码策略时存在严重盲区,需建立更全面的评估协议。 Abstract: As texts generated by Large Language Models (LLMs) are ever more common and often indistinguishable from human-written content, research on automatic text detection has attracted growing attention. Many recent detectors report near-perfect accuracy, often boasting AUROC scores above 99\%. However, these claims typically assume fixed generation settings, leaving open the question of how robust such systems are to changes in decoding strategies. In this work, we systematically examine how sampling-based decoding impacts detectability, with a focus on how subtle variations in a model's (sub)word-level distribution affect detection performance. We find that even minor adjustments to decoding parameters - such as temperature, top-p, or nucleus sampling - can severely impair detector accuracy, with AUROC dropping from near-perfect levels to 1\% in some settings. Our findings expose critical blind spots in current detection methods and emphasize the need for more comprehensive evaluation protocols. To facilitate future research, we release a large-scale dataset encompassing 37 decoding configurations, along with our code and evaluation framework https://github.com/BaggerOfWords/Sampling-and-Detection

[71] NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

Run Luo,Xiaobo Xia,Lu Wang,Longze Chen,Renke Shan,Jing Luo,Min Yang,Tat-Seng Chua

Main category: cs.CL

TL;DR: 本文提出了NExT-OMNI,一种基于离散流范式的开源全模态基础模型,支持任意到任意的跨模态理解和生成,并在多轮交互与跨模态检索中表现出优越性能。

Details Motivation: 现有大多数多模态模型受限于自回归架构,难以平衡理解与生成能力;同时,混合或解耦策略设计冗余、缺乏集成,限制了其在更广泛场景(如跨模态检索)的应用。 Method: 采用离散流范式,结合度量诱导的概率路径和动力学最优速度,实现统一建模,摒弃任务解耦设计,通过大规模交错文本、图像、视频和音频数据进行训练。 Result: NExT-OMNI在多模态生成与理解基准上表现优异,在多轮多模态交互和跨模态检索任务上优于先前的统一模型,展现出架构优势。 Conclusion: NExT-OMNI实现了高效的统一全模态建模,支持多种模态间的灵活交互,具备广泛的应用潜力,且已开源代码、模型和训练细节以促进后续研究。 Abstract: Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval.In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.

[72] GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

Xiuyuan Chen,Tao Sun,Dexin Su,Ailing Yu,Junwei Liu,Zhe Chen,Gangzeng Jin,Xin Wang,Jingnan Liu,Hansong Xiao,Hualei Zhou,Dongjie Tao,Chunxiao Guo,Minghui Yang,Yuan Xia,Jing Zhao,Qianrui Fan,Yanyun Wang,Shuai Zhen,Kezhong Chen,Jun Wang,Zewen Sun,Heng Zhao,Tian Guan,Shaodong Wang,Geyun Chang,Jiaming Deng,Hongchengcheng Chen,Kexin Feng,Ruzhen Li,Jiayi Geng,Changtai Zhao,Jun Wang,Guihu Lin,Peihao Li,Liqi Liu,Peng Wei,Jian Wang,Jinjie Gu,Ping Wang,Fan Yang

Main category: cs.CL

TL;DR: 本文提出了GAPS框架,用于多维度评估AI临床医生系统在认知深度、回答完整性、鲁棒性和安全性方面的能力,并通过自动化流水线构建基准测试,揭示了现有模型的关键缺陷。

Details Motivation: 现有的AI临床医生评估方法(如选择题或人工评分)无法充分反映真实临床实践中所需的深度、鲁棒性和安全性,因此需要更全面、可扩展的评估方式。 Method: 提出GAPS框架(涵盖基础性、充分性、扰动性和安全性),开发基于指南的自动化流水线生成问题和评分标准,利用DeepResearch代理模拟GRADE和PICO证据审查,并使用大语言模型集成法官进行评分。 Result: 验证表明自动生成的问题质量高且与临床医生判断一致;对先进模型的评估显示其在推理深度、回答完整性和对抗扰动及安全方面存在显著缺陷。 Conclusion: 该自动化、以临床为基础的方法为评估AI临床系统提供了可重复、可扩展的严谨手段,有助于推动更安全、可靠的AI临床应用发展。 Abstract: Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating \textbf{G}rounding (cognitive depth), \textbf{A}dequacy (answer completeness), \textbf{P}erturbation (robustness), and \textbf{S}afety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.

[73] Assessing Web Search Credibility and Response Groundedness in Chat Assistants

Ivan Vykopal,Matúš Pikuliak,Simon Ostermann,Marián Šimko

Main category: cs.CL

TL;DR: 本研究提出了一种评估聊天助手在事实核查中网络搜索行为的新方法,重点关注来源可信度和响应的依据性。通过对五个易传播错误信息主题的100个声明进行测试,比较了GPT-4o、GPT-5、Perplexity和Qwen Chat的表现,发现Perplexity在来源可信度上表现最佳,而GPT-4o在敏感话题上引用低可信度来源较多。

Details Motivation: 随着聊天助手集成网络搜索功能,其可能放大低可信度来源的错误信息,因此需要系统评估其在高风险信息环境中的行为。 Method: 构建包含100个声明的数据集,覆盖五个易传播错误信息的主题,评估多个主流聊天助手在响应中引用来源的可信度及其回答与引用来源的一致性(groundedness)。 Result: Perplexity在引用高可信度来源方面表现最好;GPT-4o在敏感话题上更频繁引用低可信度来源;各助手在响应的依据性上存在差异。 Conclusion: 这是首次对常用聊天助手事实核查行为的系统性比较,为评估AI系统在高风险信息环境下的可靠性提供了基础框架。 Abstract: Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants' web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.

[74] Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation

Zhiqi Huang,Vivek Datla,Chenyang Zhu,Alfy Samuel,Daben Liu,Anoop Kumar,Ritesh Soni

Main category: cs.CL

TL;DR: 提出了一种基于前馈网络激活值的置信度估计方法,用于检索增强生成系统,提升了输出正确性的预测能力,尤其适用于金融等高风险领域。

Details Motivation: 在高风险领域(如金融、医疗)中,错误答案的代价很高,因此需要可靠的方法来估计大语言模型输出的置信度,以决定是否应答。 Method: 利用原始的前馈网络(FFN)激活值作为自回归信号,避免softmax导致的信息损失;将置信度预测建模为序列分类任务,并使用Huber损失进行训练正则化。 Result: 在真实金融客服场景中优于强基线方法,在Llama 3.1 8B上仅用第16层激活即可保持精度并降低延迟。 Conclusion: 基于激活的置信度建模为可扩展、感知架构的可信RAG部署提供了有效路径。 Abstract: We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs. Confidence estimation is especially critical in high-stakes domains such as finance and healthcare, where the cost of an incorrect answer outweighs that of not answering the question. Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals, avoiding the information loss inherent in token logits and probabilities after projection and softmax normalization. We model confidence prediction as a sequence classification task, and regularize training with a Huber loss term to improve robustness against noisy supervision. Applied in a real-world financial industry customer-support setting with complex knowledge bases, our method outperforms strong baselines and maintains high accuracy under strict latency constraints. Experiments on Llama 3.1 8B model show that using activations from only the 16th layer preserves accuracy while reducing response latency. Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.

[75] The Mechanistic Emergence of Symbol Grounding in Language Models

Shuyu Wu,Ziqiao Ma,Xiaoxi Luo,Yidong Huang,Josue Torres-Fonseca,Freda Shi,Joyce Chai

Main category: cs.CL

TL;DR: 本文提出了一种受控评估框架,系统追踪符号接地如何在大规模语言模型的内部计算中出现,并发现其集中在中间层,通过注意力头聚合环境基础来支持语言形式预测。

Details Motivation: 符号接地(symbol grounding)是理解语言模型如何从无明确目标训练中获得语义的关键问题,但其具体机制尚不清楚。因此,研究者希望揭示符号接地在模型内部的产生位置与驱动机制。 Method: 引入一个受控的评估框架,结合机械性与因果分析方法,系统追踪多模态对话和不同架构(如Transformer和状态空间模型)中符号接地在内部计算中的演化过程,并与单向LSTM进行对比。 Result: 发现符号接地主要集中在模型的中间层,由注意力头通过聚合环境信息实现;该现象在Transformer和状态空间模型中可复现,但在单向LSTM中不存在。 Conclusion: 研究表明,符号接地可在大规模语言模型中自发涌现,且具有特定的机制和位置,为提升生成可靠性提供了行为与机制层面的证据。 Abstract: Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.

[76] Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons

Giovanni Monea,Yair Feldman,Shankar Padmanabhan,Kianté Brantley,Yoav Artzi

Main category: cs.CL

TL;DR: 提出一种通过学习压缩生成过程中的KV缓存来提升大语言模型长上下文推理可扩展性的方法。

Details Motivation: Transformer的KV缓存随上下文增长线性增加,导致内存和计算开销大,限制了长上下文推理的扩展性。 Method: 周期性地使用一个专门学习的token对生成过程中的KV缓存进行压缩,并淘汰已压缩的条目,通过改进的联合蒸馏与强化学习框架训练模型执行压缩。 Result: 相比无缓存压缩和无需训练的压缩技术,该方法在内存占用与准确率之间实现了更优的权衡。 Conclusion: 所提方法有效缓解了长上下文生成中KV缓存的资源消耗问题,提升了大模型的推理效率与可扩展性。 Abstract: The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.

[77] BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning

Jia-Chen Gu,Junyi Zhang,Di Wu,Yuankai Li,Kai-Wei Chang,Nanyun Peng

Main category: cs.CL

TL;DR: BRIEF-Pro是一种轻量级、通用的上下文压缩方法,可将长检索文本压缩成简洁摘要,提升多跳问答中检索增强生成(RAG)的效率与性能。

Details Motivation: 随着RAG处理的任务越来越复杂,输入上下文不断扩展,导致模型延迟增加和认知负担加重,尤其在处理多跳问题时更为显著。因此需要一种高效压缩长上下文的方法以缓解这一瓶颈。 Method: 提出BRIEF-Pro,一种通用轻量级压缩器,利用较短上下文(少于1k词)作为种子数据,训练其对超过10k词的长上下文进行抽象式压缩,并支持用户指定摘要句子数以灵活控制长度。 Result: 在四个开放域多跳问答数据集上的实验表明,BRIEF-Pro生成的摘要更简洁且相关性更高;使用70B模型时,相比LongLLMLingua的9倍压缩,BRIEF-Pro实现32倍压缩,平均QA性能提升4.67%,计算开销仅为前者的23%。 Conclusion: BRIEF-Pro能有效压缩长上下文,在显著降低计算负担的同时提升RAG系统在复杂问答任务中的性能,具有良好的通用性和实用性。 Abstract: As retrieval-augmented generation (RAG) tackles complex tasks, increasingly expanded contexts offer richer information, but at the cost of higher latency and increased cognitive load on the model. To mitigate this bottleneck, especially for intricate multi-hop questions, we introduce BRIEF-Pro. It is a universal, lightweight compressor that distills relevant evidence for a given query from retrieved documents into a concise summary for seamless integration into in-context RAG. Using seed data consisting of relatively short contexts (fewer than 1k words), BRIEF-Pro is trained to perform abstractive compression of extended contexts exceeding 10k words across a wide range of scenarios. Furthermore, BRIEF-Pro offers flexible user control over summary length by allowing users to specify the desired number of sentences. Experiments on four open-domain multi-hop question-answering datasets show that BRIEF-Pro generates more concise and relevant summaries, enhancing performance across small, large, and proprietary language models. With the 70B reader model, 32x compression by BRIEF-Pro improves QA performance by 4.67% on average over LongLLMLingua's 9x, while requiring only 23% of its computational overhead.

cs.CV [Back]

[78] SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms

Haithem Turki,Qi Wu,Xin Kang,Janick Martinez Esturo,Shengyu Huang,Ruilong Li,Zan Gojcic,Riccardo de Lutio

Main category: cs.CV

TL;DR: 本文提出了SimULi,一种能够实时渲染任意相机模型和LiDAR数据的方法,解决了现有神经渲染方法在多传感器仿真中的速度慢、仅支持针孔相机模型以及跨传感器不一致的问题。

Details Motivation: 现有的基于NeRF和3DGS的神经渲染方法在渲染速度和相机模型支持方面存在局限性,且多传感器仿真中存在跨传感器不一致性问题,限制了其在自动驾驶等应用中的适用性。 Method: 扩展了支持复杂相机模型的3DGUT方法,增加了对LiDAR的支持,采用自动瓦片策略和基于射线剔除的技术,并设计了因子化的3D高斯表示和锚定策略以减少跨传感器不一致性。 Result: SimULi比光线追踪方法快10-20倍,比之前的光栅化方法快1.5-10倍,在两个广泛使用的自动驾驶数据集上,其在多个相机和LiDAR指标上的保真度达到或超过了现有最先进方法。 Conclusion: SimULi是首个能够实时渲染任意相机模型和LiDAR数据的方法,显著提升了多传感器仿真的效率和一致性,适用于自动驾驶等高要求应用场景。 Abstract: Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20x faster than ray tracing approaches and 1.5-10x faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.

[79] State-Change Learning for Prediction of Future Events in Endoscopic Videos

Saurav Sharma,Chinedu Innocent Nwoye,Didier Mutter,Nicolas Padoy

Main category: cs.CV

TL;DR: 本文提出了一种新的手术未来预测方法SurgFUTR,通过状态变化学习框架来预测手术中的短期和长期事件,提升了预测的泛化性和准确性。

Details Motivation: 现有手术AI研究多关注当前行为识别,缺乏统一且细粒度的未来事件预测方法,难以跨手术场景泛化。 Method: 将手术未来预测重构为状态转移学习问题,采用教师-学生架构,利用Sinkhorn-Knopp聚类压缩视频状态,并通过Action Dynamics(ActDyn)模块引导学生网络从当前视频预测未来状态。 Result: 在四个数据集和三种手术上实验表明,该方法在短时和长时预测任务中均优于现有方法,并验证了跨手术的可迁移性。 Conclusion: SurgFUTR通过状态变化建模实现了更通用、更精确的手术未来预测,为术中实时决策支持提供了有效工具。 Abstract: Surgical future prediction, driven by real-time AI analysis of surgical video, is critical for operating room safety and efficiency. It provides actionable insights into upcoming events, their timing, and risks-enabling better resource allocation, timely instrument readiness, and early warnings for complications (e.g., bleeding, bile duct injury). Despite this need, current surgical AI research focuses on understanding what is happening rather than predicting future events. Existing methods target specific tasks in isolation, lacking unified approaches that span both short-term (action triplets, events) and long-term horizons (remaining surgery duration, phase transitions). These methods rely on coarse-grained supervision while fine-grained surgical action triplets and steps remain underexplored. Furthermore, methods based only on future feature prediction struggle to generalize across different surgical contexts and procedures. We address these limits by reframing surgical future prediction as state-change learning. Rather than forecasting raw observations, our approach classifies state transitions between current and future timesteps. We introduce SurgFUTR, implementing this through a teacher-student architecture. Video clips are compressed into state representations via Sinkhorn-Knopp clustering; the teacher network learns from both current and future clips, while the student network predicts future states from current videos alone, guided by our Action Dynamics (ActDyn) module. We establish SFPBench with five prediction tasks spanning short-term (triplets, events) and long-term (remaining surgery duration, phase and step transitions) horizons. Experiments across four datasets and three procedures show consistent improvements. Cross-procedure transfer validates generalizability.

[80] Robust Plant Disease Diagnosis with Few Target-Domain Samples

Takafumi Nogami,Satoshi Kagiwada,Hitoshi Iyatomi

Main category: cs.CV

TL;DR: 提出一种基于度量学习的简单且高度可适应的框架TMPS,用于提升植物病害诊断模型在不同环境下的鲁棒性,在仅使用每个疾病10个目标域样本的情况下显著优于基线和其他方法。

Details Motivation: 现有深度学习模型在训练环境之外的图像上诊断准确率下降,主要由于训练数据多样性不足与域间差异导致泛化能力差。 Method: 提出Target-Aware Metric Learning with Prioritized Sampling (TMPS),基于度量学习框架,利用少量目标域标注样本进行有效训练,通过优先采样增强对目标域特征的学习。 Result: 在包含223,073张叶片图像的大规模数据集上验证,加入每类10个目标域样本后,TMPS相比联合训练和微调方法平均macro F1分数分别提高7.3和3.6点,相比基线和传统度量学习提升达18.7和17.1点。 Conclusion: TMPS能有效利用少量目标域样本提升模型跨域泛化能力,显著增强植物病害诊断系统的鲁棒性。 Abstract: Various deep learning-based systems have been proposed for accurate and convenient plant disease diagnosis, achieving impressive performance. However, recent studies show that these systems often fail to maintain diagnostic accuracy on images captured under different conditions from the training environment -- an essential criterion for model robustness. Many deep learning methods have shown high accuracy in plant disease diagnosis. However, they often struggle to generalize to images taken in conditions that differ from the training setting. This drop in performance stems from the subtle variability of disease symptoms and domain gaps -- differences in image context and environment. The root cause is the limited diversity of training data relative to task complexity, making even advanced models vulnerable in unseen domains. To tackle this challenge, we propose a simple yet highly adaptable learning framework called Target-Aware Metric Learning with Prioritized Sampling (TMPS), grounded in metric learning. TMPS operates under the assumption of access to a limited number of labeled samples from the target (deployment) domain and leverages these samples effectively to improve diagnostic robustness. We assess TMPS on a large-scale automated plant disease diagnostic task using a dataset comprising 223,073 leaf images sourced from 23 agricultural fields, spanning 21 diseases and healthy instances across three crop species. By incorporating just 10 target domain samples per disease into training, TMPS surpasses models trained using the same combined source and target samples, and those fine-tuned with these target samples after pre-training on source data. It achieves average macro F1 score improvements of 7.3 and 3.6 points, respectively, and a remarkable 18.7 and 17.1 point improvement over the baseline and conventional metric learning.

[81] Unifying Vision-Language Latents for Zero-label Image Caption Enhancement

Sanghyun Byun,Jung Ick Guack,Mohanad Odema,Baisub Lee,Jacob Song,Woo Seong Chung

Main category: cs.CV

TL;DR: 提出ViZer框架,实现无需文本标签的视觉-语言对齐,提升图像描述生成质量。

Details Motivation: 现有视觉-语言模型依赖标注数据,限制了可扩展性,大量无标签图像数据未被充分利用。 Method: 通过在训练过程中主动对齐视觉与语言表征特征,实现零标签学习,无需文本标签或完全重新训练。 Result: 在SmolVLM-Base和Qwen2-VL上应用ViZer后,生成的描述更准确、更具描述性,定性评估表现更好。 Conclusion: ViZer为视觉-语言任务中的零标签适应提供了可行起点,能有效利用无标签数据提升模型性能。 Abstract: Vision-language models (VLMs) achieve remarkable performance through large-scale image-text pretraining. However, their reliance on labeled image datasets limits scalability and leaves vast amounts of unlabeled image data underutilized. To address this, we propose Unified Vision-Language Alignment for Zero-Label Enhancement (ViZer), an enhancement training framework that enables zero-label learning in image captioning, providing a practical starting point for broader zero-label adaptation in vision-language tasks. Unlike prior approaches that rely on human or synthetically annotated datasets, ViZer actively aligns vision and language representation features during training, enabling existing VLMs to generate improved captions without requiring text labels or full retraining. We demonstrate ViZer's advantage in qualitative evaluation, as automated caption metrics such as CIDEr and BERTScore often penalize details that are absent in reference captions. Applying ViZer on SmolVLM-Base and Qwen2-VL, we observe consistent qualitative improvements, producing captions that are more grounded and descriptive than their baseline.

[82] Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation

Xiao He,Huangxuan Zhao,Guojia Wan,Wei Zhou,Yanxing Liu,Juhua Liu,Yongchao Xu,Yong Luo,Dacheng Tao,Bo Du

Main category: cs.CV

TL;DR: FetalMind是一个专为胎儿超声设计的医学AI系统,通过引入显著认知解耦(SED)方法和大规模FetalSigma-1M数据集,在报告生成和诊断任务上显著优于现有模型。

Details Motivation: 现有医学视觉语言模型多适用于结构化成人影像,在胎儿超声中因多视角推理、疾病种类多和图像多样性而表现不佳,亟需针对性解决方案。 Method: 提出Salient Epistemic Disentanglement(SED)方法,将专家构建的二分图注入模型,解耦视角与疾病关联,并通过强化学习引导临床可信的推理路径;同时构建首个大规模胎儿超声报告数据集FetalSigma-1M用于训练。 Result: FetalMind在所有孕周阶段均优于开源和闭源基线模型,平均提升14%,对关键病症的诊断准确率提高61.2%,且具备高效性、稳定性和可扩展性。 Conclusion: FetalMind通过临床引导的建模方法和高质量数据集,在胎儿超声的报告生成与诊断任务上实现了显著性能提升,推动了该领域AI应用的发展。 Abstract: Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: https://hexiao0275.github.io/FetalMind.

[83] CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models

Denis Rychkovskiy,GPT-5

Main category: cs.CV

TL;DR: 本文提出了CADE 2.5,一种用于SD/SDXL潜在扩散模型的采样级引导堆栈,核心模块ZeResFDG结合了频域解耦引导、能量重缩放和零投影,提升了图像清晰度、提示遵循性和伪影控制,同时引入无需训练的QSilk微纹理稳定器以增强高频细节和鲁棒性。

Details Motivation: 在保持生成质量的同时提升扩散模型在中等引导尺度下的细节表现和稳定性,避免重训练。 Method: 提出ZeResFDG模块,结合频域解耦引导、能量重缩放和零投影,并通过轻量谱指数移动平均(EMA)在采样过程中动态切换保守与细节追求模式;同时使用QSilk Micrograin Stabilizer进行推理时稳定化处理。 Result: 在多种SD/SDXL采样器上,CADE 2.5显著提升了图像锐度、提示匹配度和伪影控制能力,且在高分辨率下生成自然的高频微纹理,计算开销极低。 Conclusion: CADE 2.5是一种无需重训练、高效且通用的采样级引导方法,有效增强了扩散模型在细节保留和生成稳定性方面的表现。 Abstract: We introduce CADE 2.5 (Comfy Adaptive Detail Enhancer), a sampler-level guidance stack for SD/SDXL latent diffusion models. The central module, ZeResFDG, unifies (i) frequency-decoupled guidance that reweights low- and high-frequency components of the guidance signal, (ii) energy rescaling that matches the per-sample magnitude of the guided prediction to the positive branch, and (iii) zero-projection that removes the component parallel to the unconditional direction. A lightweight spectral EMA with hysteresis switches between a conservative and a detail-seeking mode as structure crystallizes during sampling. Across SD/SDXL samplers, ZeResFDG improves sharpness, prompt adherence, and artifact control at moderate guidance scales without any retraining. In addition, we employ a training-free inference-time stabilizer, QSilk Micrograin Stabilizer (quantile clamp + depth/edge-gated micro-detail injection), which improves robustness and yields natural high-frequency micro-texture at high resolutions with negligible overhead. For completeness we note that the same rule is compatible with alternative parameterizations (e.g., velocity), which we briefly discuss in the Appendix; however, this paper focuses on SD/SDXL latent diffusion models.

[84] Scope: Selective Cross-modal Orchestration of Visual Perception Experts

Tianyu Zhang,Suyuchen Wang,Chao Wang,Juan Rodriguez,Ahmed Masry,Xiangru Jian,Yoshua Bengio,Perouz Taslakian

Main category: cs.CV

TL;DR: SCOPE是一种基于实例级路由的多编码器混合框架,通过动态选择最优视觉编码器提升视觉-语言模型效率,在降低24-49%计算成本的同时性能优于使用多个编码器的模型。

Details Motivation: 现有视觉-语言模型使用多个视觉编码器时存在推理成本高和收益递减的问题,需要更高效的编码器利用方式。 Method: 提出SCOPE框架,包含一个共享编码器和多个路由编码器,通过轻量级路由器结合文本提示与共享视觉特征的交叉注意力机制,为每个图像-文本对动态选择最合适的编码器,并引入双重熵正则化辅助损失来训练路由器以平衡负载分布和路由置信度。 Result: SCOPE在一个共享编码器加一个路由编码器的配置下,性能超过同时使用四个额外编码器的模型,且计算成本减少24-49%。 Conclusion: 智能的编码器选择优于暴力聚合多个编码器,挑战了当前多编码器视觉-语言模型的主流范式。 Abstract: Vision-language models (VLMs) benefit from multiple vision encoders, but naively stacking them yields diminishing returns while multiplying inference costs. We propose SCOPE, a Mixture-of-Encoders (MoEnc) framework that dynamically selects one specialized encoder per image-text pair via instance-level routing, unlike token-level routing in traditional MoE. SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. To train this router, we introduce dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence. Remarkably, SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49\%. This demonstrates that intelligent encoder selection beats brute-force aggregation, challenging the prevailing paradigm in multi-encoder VLMs.

[85] SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Tanveer Hannan,Shuaicong Wu,Mark Weber,Suprosanna Shit,Jindong Gu,Rajat Koner,Aljoša Ošep,Laura Leal-Taixé,Thomas Seidl

Main category: cs.CV

TL;DR: 本文提出了时空视频动作定位(SVAG)这一新任务,旨在根据自然语言描述在视频中同时检测、跟踪并时序定位执行动作的对象,并构建了大规模数据集SVAG-Bench和基准模型SVAGFormer,推动细粒度视频理解的研究。

Details Motivation: 现有方法多关注粗粒度动作识别或通用目标跟踪,难以实现基于动作的语言描述对多个对象进行联合检测与时空定位,因此需要一个能同时处理细粒度动作和对象动态的新任务。 Method: 提出SVAG任务和SVAG-Bench数据集,包含688个视频、19,590条标注记录和903个不同动词;设计SVAGFormer作为基线框架,结合最先进的视觉语言模型实现空间与时间上的联合定位,并开发SVAGEval评估工具包。 Result: 实验表明现有模型在SVAG任务上表现不佳,尤其在密集或复杂场景中,突显出当前方法在长视频中对细粒度对象-动作交互推理能力的不足。 Conclusion: SVAG为细粒度视频理解提供了新的挑战和方向,强调模型需具备更强的时空推理能力以理解语言描述中的动作与对应对象的动态关系。 Abstract: Understanding fine-grained actions and accurately localizing their corresponding actors in space and time are fundamental capabilities for advancing next-generation AI systems, including embodied agents, autonomous platforms, and human-AI interaction frameworks. Despite recent progress in video understanding, existing methods predominantly address either coarse-grained action recognition or generic object tracking, thereby overlooking the challenge of jointly detecting and tracking multiple objects according to their actions while grounding them temporally. To address this gap, we introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos based on natural language descriptions of their actions. To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs, covering a diverse range of objects, actions, and real-world scenes. We further propose SVAGFormer, a baseline framework that adapts state of the art vision language models for joint spatial and temporal grounding, and introduce SVAGEval, a standardized evaluation toolkit for fair and reproducible benchmarking. Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes, underscoring the need for more advanced reasoning over fine-grained object-action interactions in long videos.

[86] SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models

Zhengxu Tang,Zizheng Wang,Luning Wang,Zitao Shuai,Chenhao Zhang,Siyu Qian,Yirui Wu,Bohao Wang,Haosong Rao,Zhenyu Yang,Chenwei Wu

Main category: cs.CV

TL;DR: 本文提出了SeqBench,一个用于评估文本到视频生成模型中顺序叙事连贯性的综合基准,并设计了基于动态时间图(DTG)的自动评估指标,揭示了当前模型在多动作序列、多对象场景和时间顺序上的关键缺陷。

Details Motivation: 现有文本到视频生成模型在视觉质量上表现良好,但在生成需要逻辑推进的连续叙事方面存在困难,且缺乏评估长序列叙事连贯性的基准。 Method: 构建包含320个提示和2,560个人工标注视频的SeqBench数据集,涵盖多种叙事复杂性;提出基于动态时间图(DTG)的自动评估方法,以高效捕捉长时间依赖和时序关系。 Result: DTG指标与人工标注具有强相关性;评估发现当前T2V模型在物体状态一致性、多对象物理合理性以及动作时序保持方面存在显著问题。 Conclusion: SeqBench为T2V生成中的叙事连贯性提供了首个系统评估框架,DTG指标具备高效且可靠的评估能力,研究结果为提升模型的序列推理能力指明了方向。 Abstract: Text-to-video (T2V) generation models have made significant progress in creating visually appealing videos. However, they struggle with generating coherent sequential narratives that require logical progression through multiple events. Existing T2V benchmarks primarily focus on visual quality metrics but fail to evaluate narrative coherence over extended sequences. To bridge this gap, we present SeqBench, a comprehensive benchmark for evaluating sequential narrative coherence in T2V generation. SeqBench includes a carefully designed dataset of 320 prompts spanning various narrative complexities, with 2,560 human-annotated videos generated from 8 state-of-the-art T2V models. Additionally, we design a Dynamic Temporal Graphs (DTG)-based automatic evaluation metric, which can efficiently capture long-range dependencies and temporal ordering while maintaining computational efficiency. Our DTG-based metric demonstrates a strong correlation with human annotations. Through systematic evaluation using SeqBench, we reveal critical limitations in current T2V models: failure to maintain consistent object states across multi-action sequences, physically implausible results in multi-object scenarios, and difficulties in preserving realistic timing and ordering relationships between sequential actions. SeqBench provides the first systematic framework for evaluating narrative coherence in T2V generation and offers concrete insights for improving sequential reasoning capabilities in future models. Please refer to https://videobench.github.io/SeqBench.github.io/ for more details.

[87] SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion

Jungbin Cho,Minsu Kim,Jisoo Kim,Ce Zheng,Laszlo A. Jeni,Ming-Hsuan Yang,Youngjae Yu,Seonjoo Kim

Main category: cs.CV

TL;DR: 本文提出了SceneAdapt框架,通过利用分离的场景-动作和文本-动作数据集,分两阶段(插帧与场景感知插帧)将场景感知注入到文本驱动的动作生成模型中,有效提升了生成动作的场景适应性。

Details Motivation: 现有动作生成方法通常孤立地处理动作语义或场景感知,难以同时兼顾丰富语义与精确场景交互,主要受限于缺乏兼具高质量文本标注和真实场景互动的大规模数据集。 Method: 提出SceneAdapt框架,包含两个自适应阶段:第一阶段引入可学习的关键帧层,在保持潜在流形的同时调节插帧过程;第二阶段加入基于交叉注意力的场景条件层,通过自适应查询局部上下文注入场景几何信息,从而实现对文本到动作模型的场景感知增强。 Result: 实验结果表明,SceneAdapt能有效将场景感知注入文本到动作模型中,生成的动作更符合场景约束,并通过消融研究验证了各模块的作用机制。 Conclusion: SceneAdapt提供了一种无需依赖联合标注数据即可融合多源数据集的方法,实现了文本驱动动作生成与场景感知的有效结合,为未来构建更真实、语义丰富的动作生成系统提供了新思路。 Abstract: Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches address either motion semantics or scene-awareness in isolation, since constructing large-scale datasets with both rich text--motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene--motion and text--motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.

[88] One Dimensional CNN ECG Mamba for Multilabel Abnormality Classification in 12 Lead ECG

Huawei Jiang,Husna Mutahira,Gan Huang,Mannan Saeed Muhammad

Main category: cs.CV

TL;DR: 提出了一种结合一维卷积神经网络和Mamba状态空间模型的混合框架(1D CNN-ECG Mamba),用于心电图异常检测,在PhysioNet 2020和2021挑战赛数据上表现出优于现有方法的性能。

Details Motivation: 传统深度学习模型在处理长序列心电信号时性能受限,需要更高效的序列建模方法来提升异常检测准确率。 Method: 构建基于Vision Mamba的双向选择性状态空间模型,结合一维卷积进行特征提取,形成1D CNN-ECG Mamba混合框架。 Result: 在十二导联心电图数据上,该模型的AUPRC和AUROC显著高于此前最优算法,展现出更强的时序依赖建模能力。 Conclusion: 基于Mamba的架构能有效提升心电分类的准确性,有助于实现早期诊断、个性化治疗,并推动远程医疗和资源受限环境下的应用。 Abstract: Accurate detection of cardiac abnormalities from electrocardiogram recordings is regarded as essential for clinical diagnostics and decision support. Traditional deep learning models such as residual networks and transformer architectures have been applied successfully to this task, but their performance has been limited when long sequential signals are processed. Recently, state space models have been introduced as an efficient alternative. In this study, a hybrid framework named One Dimensional Convolutional Neural Network Electrocardiogram Mamba is introduced, in which convolutional feature extraction is combined with Mamba, a selective state space model designed for effective sequence modeling. The model is built upon Vision Mamba, a bidirectional variant through which the representation of temporal dependencies in electrocardiogram data is enhanced. Comprehensive experiments on the PhysioNet Computing in Cardiology Challenges of 2020 and 2021 were conducted, and superior performance compared with existing methods was achieved. Specifically, the proposed model achieved substantially higher AUPRC and AUROC scores than those reported by the best previously published algorithms on twelve lead electrocardiograms. These results demonstrate the potential of Mamba-based architectures to advance reliable ECG classification. This capability supports early diagnosis and personalized treatment, while enhancing accessibility in telemedicine and resource-constrained healthcare systems.

[89] True Self-Supervised Novel View Synthesis is Transferable

Thomas W. Mitchel,Hyunwoo Ryu,Vincent Sitzmann

Main category: cs.CV

TL;DR: 本文提出了XFactor,首个无需几何先验的自监督新视角合成模型,通过姿态估计与输入输出增强实现姿态与场景内容的解耦,具备良好的姿态可迁移性。

Details Motivation: 现有自监督新视角合成方法中的预测姿态不具备跨场景的可迁移性,限制了模型对真实新视角合成能力的掌握。 Method: XFactor结合了成对姿态估计和简单的输入输出增强策略,在无3D归纳偏置的情况下实现相机姿态与场景内容的解耦,并支持几何推理。 Result: 实验表明,XFactor在大规模测试中显著优于先前的无姿态先验NVS Transformer模型,其隐式姿态变量与真实世界姿态高度相关,且提出的新指标验证了其姿态可迁移性。 Conclusion: XFactor首次实现了无需显式3D结构或SE(3)参数化的可迁移自监督新视角合成,证明了无几何约束模型也能学习到有意义的姿态表示。 Abstract: In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry -- such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.

[90] Direction-aware multi-scale gradient loss for infrared and visible image fusion

Kaixuan Yang,Wei Xiang,Zhenshuai Chen,Tong Jin,Yunpeng Liu

Main category: cs.CV

TL;DR: 提出一种方向感知的多尺度梯度损失,用于红外与可见光图像融合,通过分别监督水平和垂直梯度分量并保留符号信息,提升边缘清晰度和纹理保持效果。

Details Motivation: 现有方法在梯度处理中丢失方向信息,导致监督信号模糊和边缘质量下降。 Method: 设计了一种轴向分离且符号保留的多尺度梯度损失函数,在不改变网络结构和训练流程的前提下引入方向感知能力。 Result: 在开源模型和多个公开基准上的实验表明,该方法能有效提升融合图像的边缘对齐性和纹理细节。 Conclusion: 所提出的方向感知梯度损失为图像融合任务提供了更清晰的优化方向,显著改善结果质量。 Abstract: Infrared and visible image fusion aims to integrate complementary information from co-registered source images to produce a single, informative result. Most learning-based approaches train with a combination of structural similarity loss, intensity reconstruction loss, and a gradient-magnitude term. However, collapsing gradients to their magnitude removes directional information, yielding ambiguous supervision and suboptimal edge fidelity. We introduce a direction-aware, multi-scale gradient loss that supervises horizontal and vertical components separately and preserves their sign across scales. This axis-wise, sign-preserving objective provides clear directional guidance at both fine and coarse resolutions, promoting sharper, better-aligned edges and richer texture preservation without changing model architectures or training protocols. Experiments on open-source model and multiple public benchmarks demonstrate effectiveness of our approach.

[91] Unsupervised Domain Adaptation via Content Alignment for Hippocampus Segmentation

Hoda Kalabizadeh,Ludovica Griffanti,Pak-Hei Yeung,Ana I. L. Namburete,Nicola K. Dinsdale,Konstantinos Kamnitsas

Main category: cs.CV

TL;DR: 提出一种新的无监督域适应框架,通过结合高效的风格归一化和双向可变形图像配准策略,有效解决跨域MRI海马体分割中的域偏移问题,尤其关注内容变化,在不同数据集上显著优于现有方法。

Details Motivation: 深度学习模型在不同数据集间应用时因域偏移(包括图像外观和解剖特征差异)导致性能下降,尤其是在从健康年轻人群向痴呆患者转移时存在显著内容差异,需针对性解决内容变化带来的挑战。 Method: 结合z标准化进行风格协调,并采用与分割网络和判别器网络联合训练的双向可变形图像配准(DIR)策略,以感兴趣区域为导向生成解剖学上合理的变换,实现源域到目标域的对齐。 Result: 在Morpho-MNIST合成数据集及三个具有不同程度萎缩的MRI海马体数据集上验证,相较于标准增强方法,在从健康年轻人群向痴呆患者迁移时Dice分数最高提升15%,且在内容偏移较大的情况下提升最明显。 Conclusion: 所提方法能有效应对跨域海马体分割中的内容变化问题,显著提升分割精度,适用于不同人群间的医学图像分割任务。 Abstract: Deep learning models for medical image segmentation often struggle when deployed across different datasets due to domain shifts - variations in both image appearance, known as style, and population-dependent anatomical characteristics, referred to as content. This paper presents a novel unsupervised domain adaptation framework that directly addresses domain shifts encountered in cross-domain hippocampus segmentation from MRI, with specific emphasis on content variations. Our approach combines efficient style harmonisation through z-normalisation with a bidirectional deformable image registration (DIR) strategy. The DIR network is jointly trained with segmentation and discriminator networks to guide the registration with respect to a region of interest and generate anatomically plausible transformations that align source images to the target domain. We validate our approach through comprehensive evaluations on both a synthetic dataset using Morpho-MNIST (for controlled validation of core principles) and three MRI hippocampus datasets representing populations with varying degrees of atrophy. Across all experiments, our method outperforms existing baselines. For hippocampus segmentation, when transferring from young, healthy populations to clinical dementia patients, our framework achieves up to 15% relative improvement in Dice score compared to standard augmentation methods, with the largest gains observed in scenarios with substantial content shift. These results highlight the efficacy of our approach for accurate hippocampus segmentation across diverse populations.

[92] Counting Hallucinations in Diffusion Models

Shuai Fu,Jian Zhou,Qi Chen,Huang Jing,Huy Anh Nguyen,Xiaohan Liu,Zhixiong Zeng,Lin Ma,Quanshi Zhang,Qi Wu

Main category: cs.CV

TL;DR: 本研究针对扩散模型中的计数幻觉问题,构建了CountHalluSet数据集并提出标准化评估协议,系统分析了不同采样条件对幻觉的影响,并发现常用指标FID无法有效反映此类问题。

Details Motivation: 扩散模型在生成任务中表现优异,但常产生与现实知识冲突的幻觉样本(如生成多余物体),尤其是计数幻觉(如六指手)。目前缺乏量化此类幻觉的方法,阻碍了模型改进和事实约束下的生成模型发展。 Method: 提出计数幻觉的定义,构建包含ToyShape、SimObject和RealHand的CountHalluSet数据集,制定明确的计数标准,并设计标准化评估协议,系统考察不同DPM采样条件(求解器类型、ODE阶数、采样步数、初始噪声)对计数幻觉的影响。 Result: 成功构建了用于量化计数幻觉的数据集和评估方法,实验揭示了不同采样条件与幻觉程度的关系,并发现FID等常用图像质量指标无法一致地捕捉计数幻觉。 Conclusion: 该工作为系统量化扩散模型中的幻觉现象迈出了第一步,揭示了现有评估指标的局限性,为未来在事实约束下设计更可靠的生成模型提供了新方向。 Abstract: Diffusion probabilistic models (DPMs) have demonstrated remarkable progress in generative tasks, such as image and video synthesis. However, they still often produce hallucinated samples (hallucinations) that conflict with real-world knowledge, such as generating an implausible duplicate cup floating beside another cup. Despite their prevalence, the lack of feasible methodologies for systematically quantifying such hallucinations hinders progress in addressing this challenge and obscures potential pathways for designing next-generation generative models under factual constraints. In this work, we bridge this gap by focusing on a specific form of hallucination, which we term counting hallucination, referring to the generation of an incorrect number of instances or structured objects, such as a hand image with six fingers, despite such patterns being absent from the training data. To this end, we construct a dataset suite CountHalluSet, with well-defined counting criteria, comprising ToyShape, SimObject, and RealHand. Using these datasets, we develop a standardized evaluation protocol for quantifying counting hallucinations, and systematically examine how different sampling conditions in DPMs, including solver type, ODE solver order, sampling steps, and initial noise, affect counting hallucination levels. Furthermore, we analyze their correlation with common evaluation metrics such as FID, revealing that this widely used image quality metric fails to capture counting hallucinations consistently. This work aims to take the first step toward systematically quantifying hallucinations in diffusion models and offer new insights into the investigation of hallucination phenomena in image generation.

[93] Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation

Yi Zuo,Zitao Wang,Lingling Li,Xu Liu,Fang Liu,Licheng Jiao

Main category: cs.CV

TL;DR: 提出了一种轻量级、文本驱动、零样本视频编辑方法Edit-Your-Interest,通过时空特征记忆和特征传播机制显著提升效率和视觉保真度。

Details Motivation: 现有视频编辑方法计算开销大、内存消耗高,且常导致时序不一致和视觉伪影问题。 Method: 引入时空特征记忆库(SFM)、特征最相似传播(FMP)方法和SFM更新算法,并利用交叉注意力图自动提取感兴趣实例的掩码。 Result: 在多个实验中显著优于现有最先进方法,兼顾高效性和视觉保真度。 Conclusion: Edit-Your-Interest在保持背景完整性的同时实现精确编辑,具有优越的有效性和实用性。 Abstract: Text-to-image (T2I) diffusion models have recently demonstrated significant progress in video editing. However, existing video editing methods are severely limited by their high computational overhead and memory consumption. Furthermore, these approaches often sacrifice visual fidelity, leading to undesirable temporal inconsistencies and artifacts such as blurring and pronounced mosaic-like patterns. We propose Edit-Your-Interest, a lightweight, text-driven, zero-shot video editing method. Edit-Your-Interest introduces a spatio-temporal feature memory to cache features from previous frames, significantly reducing computational overhead compared to full-sequence spatio-temporal modeling approaches. Specifically, we first introduce a Spatio-Temporal Feature Memory bank (SFM), which is designed to efficiently cache and retain the crucial image tokens processed by spatial attention. Second, we propose the Feature Most-Similar Propagation (FMP) method. FMP propagates the most relevant tokens from previous frames to subsequent ones, preserving temporal consistency. Finally, we introduce an SFM update algorithm that continuously refreshes the cached features, ensuring their long-term relevance and effectiveness throughout the video sequence. Furthermore, we leverage cross-attention maps to automatically extract masks for the instances of interest. These masks are seamlessly integrated into the diffusion denoising process, enabling fine-grained control over target objects and allowing Edit-Your-Interest to perform highly accurate edits while robustly preserving the background integrity. Extensive experiments decisively demonstrate that the proposed Edit-Your-Interest outperforms state-of-the-art methods in both efficiency and visual fidelity, validating its superior effectiveness and practicality.

[94] EgoSocial: Benchmarking Proactive Intervention Ability of Omnimodal LLMs via Egocentric Social Interaction Perception

Xijun Wang,Tanay Sharma,Achin Kulshrestha,Abhimitra Meka,Aveek Purohit,Dinesh Manocha

Main category: cs.CV

TL;DR: 本文提出了EgoSocial数据集和EgoSoD方法,用于提升AI在第一人称社交场景中识别干预时机的能力。实验表明现有OLLM在该任务上表现较差,而EgoSoD显著提升了干预时机和社会互动识别的性能。

Details Motivation: 当前大语言模型缺乏对社交情境的理解,难以判断在社交互动中何时介入,导致频繁且不合时宜的响应,影响用户体验。 Method: 提出EgoSocial数据集(含13,500个社交视频-问题对)用于评估干预时机;设计EgoSoD方法,通过融合多模态线索构建社交思维图,动态建模参与者及其交互,实现对干预时机和社交互动的端到端检测。 Result: 现有OLLM在干预时机检测上表现差(Gemini 2.5 Pro仅14.4%);EgoSoD使Phi-4在干预时机上提升45.6%,Gemini提升9.9%;在整体社交互动任务上,Phi-4提升20.4%,Gemini提升6.9%。 Conclusion: EgoSoD通过整合多模态上下文信息,有效提升模型对社交动态的理解与干预时机判断能力,为AR/VR中的智能助手提供了更符合社交规范的解决方案。 Abstract: As AR/VR technologies become integral to daily life, there's a growing need for AI that understands human social dynamics from an egocentric perspective. However, current LLMs often lack the social awareness to discern when to intervene as AI assistant. This leads to constant, socially unaware responses that may disrupt natural conversation and negatively impact user focus. To address these limitations, we introduce EgoSocial, a large-scale egocentric dataset with 13,500 social video-question pairs, specifically designed to benchmark intervention in social interaction perception. We also present an in-depth analysis of current omnimodal LLMs (OLLMs) to assess their effectiveness in detecting diverse social contextual cues. Experiments show that OLLMs still struggle to detect the intervention timing (14.4% for Gemini 2.5 Pro). We also propose EgoSoD (EgoSocial Detection), an end-to-end method for robustly discerning social dynamics. Informed by our OLLM analysis, EgoSoD integrates multimodal contextual cues (e.g., audio and visual cues) into a social thinking graph, dynamically modeling participants and interactions. Our method proactively detects intervention timing and social interactions, precisely determining when to intervene. Our EgoSoD improves Phi-4 by 45.6% and Gemini 2.5 Pro by 9.9% on Intervention Timing performance, and improves Phi-4 by 20.4% and Gemini 2.5 Pro by 6.9% on overall Social Interaction performance. We will release the dataset and code soon.

[95] DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models

Jingyu Song,Zhenxin Li,Shiyi Lan,Xinglong Sun,Nadine Chang,Maying Shen,Joshua Chen,Katherine A. Skinner,Jose M. Alvarez

Main category: cs.CV

TL;DR: 本文提出了DriveCritic框架,包括一个包含挑战性场景和人类偏好的数据集以及基于视觉-语言模型的评估器,通过两阶段训练显著提升了自动驾驶规划器评估与人类判断的一致性。

Details Motivation: 现有自动驾驶评估指标(如EPDMS)在复杂情境下缺乏上下文感知能力,难以与人类判断对齐。 Method: 构建了DriveCritic数据集,并基于视觉-语言模型设计评估模型,采用两阶段监督学习与强化学习进行微调,融合视觉与符号化上下文信息进行轨迹对评判。 Result: 实验表明,DriveCritic在匹配人类偏好方面显著优于现有指标和基线方法,展现出更强的上下文感知能力。 Conclusion: DriveCritic为自动驾驶系统的评估提供了更可靠、更贴近人类判断的基础。 Abstract: Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems.

[96] VPREG: An Optimal Control Formulation for Diffeomorphic Image Registration Based on the Variational Principle Grid Generation Method

Zicong Zhou,Baihan Zhao,Andreas Mang,Guojun Liao

Main category: cs.CV

TL;DR: 本文提出了一种名为VPreg的新型微分同胚图像配准方法,通过变分原理生成非折叠网格,确保空间变换的雅可比行列式为正,并在微分同胚群内精确估计逆变换,实验表明其在Dice分数、变换正则性及逆映射精度方面优于现有主流方法。

Details Motivation: 为了提升图像配准的准确性与变换质量,特别是在神经影像工作流中对逆变换精度和微分同胚性质的高要求,改进已有配准方法在逆变换近似和变换正则性方面的不足。 Method: 提出VPreg方法,基于变分原理(VP)生成具有指定雅可比行列式和旋度的非折叠网格,构建保持微分同胚性质的空间变换,并在变换群内直接计算逆变换,从而保证拓扑保持性和逆映射精度。 Result: 在OASIS-1数据集150次脑扫描配准实验中,VPreg在35个感兴趣区域的Dice分数、变换的正则性以及逆映射的准确性和一致性方面均优于ANTs-SyN、Freesurfer-Easyreg和FSL-Fnirt等现有方法。 Conclusion: VPreg在保证微分同胚性质的同时,显著提升了图像配准的精度和变换质量,尤其在逆变换的准确性和稳定性方面表现突出,具有在计算解剖学和形态学研究中广泛应用的潜力。 Abstract: This paper introduces VPreg, a novel diffeomorphic image registration method. This work provides several improvements to our past work on mesh generation and diffeomorphic image registration. VPreg aims to achieve excellent registration accuracy while controlling the quality of the registration transformations. It ensures a positive Jacobian determinant of the spatial transformation and provides an accurate approximation of the inverse of the registration, a crucial property for many neuroimaging workflows. Unlike conventional methods, VPreg generates this inverse transformation within the group of diffeomorphisms rather than operating on the image space. The core of VPreg is a grid generation approach, referred to as \emph{Variational Principle} (VP), which constructs non-folding grids with prescribed Jacobian determinant and curl. These VP-generated grids guarantee diffeomorphic spatial transformations essential for computational anatomy and morphometry, and provide a more accurate inverse than existing methods. To assess the potential of the proposed approach, we conduct a performance analysis for 150 registrations of brain scans from the OASIS-1 dataset. Performance evaluation based on Dice scores for 35 regions of interest, along with an empirical analysis of the properties of the computed spatial transformations, demonstrates that VPreg outperforms state-of-the-art methods in terms of Dice scores, regularity properties of the computed transformation, and accuracy and consistency of the provided inverse map. We compare our results to ANTs-SyN, Freesurfer-Easyreg, and FSL-Fnirt.

[97] OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment

Rongjun Chen,Chengsi Yao,Jinchang Ren,Xianxian Zeng,Peixian Wang,Jun Yuan,Jiawen Li,Huimin Zhao,Xu Lu

Main category: cs.CV

TL;DR: 提出了一种基于大语言模型开放语义知识的超图适配器(OS-HGAdapter),通过提升文本模态的信息熵并构建图文多边关联,显著提升了跨模态检索性能,在Flickr30K和MS-COCO上取得新SOTA。

Details Motivation: 解决传统方法在图文互检中因模态间信息熵差异导致的检索不平衡问题。 Method: 1) 设计无需依赖任务领域显式知识的新提示模板,利用大语言模型增强文本多义性描述以提升其信息熵;2) 使用超图适配器构建图文多边连接,校正同义语义匹配误差,并通过维度映射抑制开放语义带来的噪声。 Result: 在Flickr30K和MS-COCO基准上,相较于现有方法,实现了16.8%(text-to-image)和40.1%(image-to-text)的跨模态检索增益。 Conclusion: OS-HGAdapter有效弥合了图文模态间的信息熵差距,增强了语义对齐能力,显著提升了跨模态检索性能,为多媒体内容理解提供了新思路。 Abstract: Text-image alignment constitutes a foundational challenge in multimedia content understanding, where effective modeling of cross-modal semantic correspondences critically enhances retrieval system performance through joint embedding space optimization. Given the inherent difference in information entropy between texts and images, conventional approaches often show an imbalance in the mutual retrieval of these two modalities. To address this particular challenge, we propose to use the open semantic knowledge of Large Language Model (LLM) to fill for the entropy gap and reproduce the alignment ability of humans in these tasks. Our entropy-enhancing alignment is achieved through a two-step process: 1) a new prompt template that does not rely on explicit knowledge in the task domain is designed to use LLM to enhance the polysemy description of the text modality. By analogy, the information entropy of the text modality relative to the visual modality is increased; 2) A hypergraph adapter is used to construct multilateral connections between the text and image modalities, which can correct the positive and negative matching errors for synonymous semantics in the same fixed embedding space, whilst reducing the noise caused by open semantic entropy by mapping the reduced dimensions back to the original dimensions. Comprehensive evaluations on the Flickr30K and MS-COCO benchmarks validate the superiority of our Open Semantic Hypergraph Adapter (OS-HGAdapter), showcasing 16.8\% (text-to-image) and 40.1\% (image-to-text) cross-modal retrieval gains over existing methods while establishing new state-of-the-art performance in semantic alignment tasks.

[98] Real-Time Sign Language to text Translation using Deep Learning: A Comparative study of LSTM and 3D CNN

Madhumati Pol,Anvay Anturkar,Anushka Khot,Ayush Andure,Aniruddha Ghosh,Anvit Magadum,Anvay Bahadur

Main category: cs.CV

TL;DR: 本研究比较了3D CNN和LSTM在实时美国手语识别中的性能,发现3D CNN精度更高(92.4%),但LSTM更高效,具有更低的计算延迟。

Details Motivation: 为了提升实时手语识别系统的性能,需权衡模型的准确性和计算效率,特别是在边缘计算环境下开发辅助技术时。 Method: 采用3D CNN和LSTM两种架构,在包含50类共1200个ASL手势的视频数据集上进行训练与评估,并比较其准确性、计算效率和帧处理延迟。 Result: 3D CNN达到92.4%的识别准确率,但每帧处理时间比LSTM多3.2%;LSTM准确率为86.7%,资源消耗显著更低;混合3D CNN-LSTM模型表现中等。 Conclusion: 3D CNN在精度上优于LSTM,但LSTM更适合资源受限的实时应用场景;应根据具体应用需求选择合适架构。 Abstract: This study investigates the performance of 3D Convolutional Neural Networks (3D CNNs) and Long Short-Term Memory (LSTM) networks for real-time American Sign Language (ASL) recognition. Though 3D CNNs are good at spatiotemporal feature extraction from video sequences, LSTMs are optimized for modeling temporal dependencies in sequential data. We evaluate both architectures on a dataset containing 1,200 ASL signs across 50 classes, comparing their accuracy, computational efficiency, and latency under similar training conditions. Experimental results demonstrate that 3D CNNs achieve 92.4% recognition accuracy but require 3.2% more processing time per frame compared to LSTMs, which maintain 86.7% accuracy with significantly lower resource consumption. The hybrid 3D CNNLSTM model shows decent performance, which suggests that context-dependent architecture selection is crucial for practical implementation.This project provides professional benchmarks for developing assistive technologies, highlighting trade-offs between recognition precision and real-time operational requirements in edge computing environments.

[99] Foveation Improves Payload Capacity in Steganography

Lifeng Qiu Lin,Henry Kam,Qi Sun,Kaan Akşit

Main category: cs.CV

TL;DR: 本文提出了一种基于高效潜在表示和中央凹渲染的隐写术模型,将信息容量从100位提升至500位,并在20万测试位中实现每2000位仅1位错误的高精度,同时保持了31.47 dB PSNR和0.13 LPIPS的视觉质量。

Details Motivation: 为了突破现有隐写术在视觉媒介中容量和鲁棒性的限制,提升数据嵌入能力同时保持高质量视觉效果。 Method: 利用高效的潜在表示和中央凹渲染技术训练隐写模型,优化信息嵌入过程中的感知设计,构建多模态潜在表征。 Result: 信息容量提升至500位,解码准确率达到每2000位仅1位错误,在20万测试位上验证了稳定性,视觉质量达到31.47 dB PSNR和0.13 LPIPS。 Conclusion: 新颖的感知设计能有效提升隐写术的容量与鲁棒性,同时保持良好的视觉质量,展示了多模态潜在表示在该领域的潜力。 Abstract: Steganography finds its use in visual medium such as providing metadata and watermarking. With support of efficient latent representations and foveated rendering, we trained models that improve existing capacity limits from 100 to 500 bits, while achieving better accuracy of up to 1 failure bit out of 2000, at 200K test bits. Finally, we achieve a comparable visual quality of 31.47 dB PSNR and 0.13 LPIPS, showing the effectiveness of novel perceptual design in creating multi-modal latent representations in steganography.

[100] DP-TTA: Test-time Adaptation for Transient Electromagnetic Signal Denoising via Dictionary-driven Prior Regularization

Meng Yang,Kecheng Chen,Wei Luo,Xianjie Chen,Yong Jia,Mingyue Wang,Fanqiang Lin

Main category: cs.CV

TL;DR: 提出了一种基于字典驱动先验正则化的测试时自适应方法(DP-TTA),用于提升瞬变电磁(TEM)信号在不同环境下的去噪性能。

Details Motivation: 现有深度学习去噪模型多在模拟或单一真实场景数据上训练,难以应对不同地理区域噪声特性的差异,导致跨环境泛化能力差。 Method: 利用TEM信号的内在物理特性(如指数衰减和光滑性)构建字典驱动的先验知识,嵌入到所提出的DTEMDNet网络中,并在测试阶段通过自监督损失实现模型参数的动态调整。 Result: 实验表明,该方法在多种实际场景下显著优于现有的TEM去噪方法和测试时自适应方法。 Conclusion: DP-TTA通过引入物理一致性的先验约束,有效提升了预训练模型在新环境中的去噪泛化能力,具有较强的实用性与推广价值。 Abstract: Transient Electromagnetic (TEM) method is widely used in various geophysical applications, providing valuable insights into subsurface properties. However, time-domain TEM signals are often submerged in various types of noise. While recent deep learning-based denoising models have shown strong performance, these models are mostly trained on simulated or single real-world scenario data, overlooking the significant differences in noise characteristics from different geographical regions. Intuitively, models trained in one environment often struggle to perform well in new settings due to differences in geological conditions, equipment, and external interference, leading to reduced denoising performance. To this end, we propose the Dictionary-driven Prior Regularization Test-time Adaptation (DP-TTA). Our key insight is that TEM signals possess intrinsic physical characteristics, such as exponential decay and smoothness, which remain consistent across different regions regardless of external conditions. These intrinsic characteristics serve as ideal prior knowledge for guiding the TTA strategy, which helps the pre-trained model dynamically adjust parameters by utilizing self-supervised losses, improving denoising performance in new scenarios. To implement this, we customized a network, named DTEMDNet. Specifically, we first use dictionary learning to encode these intrinsic characteristics as a dictionary-driven prior, which is integrated into the model during training. At the testing stage, this prior guides the model to adapt dynamically to new environments by minimizing self-supervised losses derived from the dictionary-driven consistency and the signal one-order variation. Extensive experimental results demonstrate that the proposed method achieves much better performance than existing TEM denoising methods and TTA methods.

[101] STT-GS: Sample-Then-Transmit Edge Gaussian Splatting with Joint Client Selection and Power Control

Zhen Li,Xibin Jin,Guoliang Li,Shuai Wang,Miaowen Wen,Huseyin Arslan,Derrick Wing Kwan Ng,Chengzhong Xu

Main category: cs.CV

TL;DR: 本文提出了一种面向边缘高斯溅射(EGS)的新型优化框架,通过引入GS导向的目标函数和“先采样后传输”(STT-GS)策略,解决客户端视图贡献异质性与通信资源受限下的性能优化问题。

Details Motivation: 传统边缘资源管理方法无法有效优化EGS中的高斯溅射质量,且客户端视图贡献差异大,直接传输所有数据开销大,需一种能预测贡献并高效分配资源的新机制。 Method: 提出GS导向目标函数,结合特征域聚类(FDC)进行代表性图像采样,采用Pilot Transmission Time Minimization(PTTM)降低开销,并设计联合客户端选择与功率控制(JCSPC)框架,基于PAMM算法求解非凸优化问题。 Result: 实验表明该方法在真实数据集上显著优于现有基准,仅需10%采样率即可准确预测GS目标,实现了视图贡献与通信成本间的优良权衡。 Conclusion: 所提STT-GS策略与JCSPC框架有效提升了边缘高斯溅射的重建质量与资源利用效率,为面向特定学习任务的边缘协同训练提供了新思路。 Abstract: Edge Gaussian splatting (EGS), which aggregates data from distributed clients and trains a global GS model at the edge server, is an emerging paradigm for scene reconstruction. Unlike traditional edge resource management methods that emphasize communication throughput or general-purpose learning performance, EGS explicitly aims to maximize the GS qualities, rendering existing approaches inapplicable. To address this problem, this paper formulates a novel GS-oriented objective function that distinguishes the heterogeneous view contributions of different clients. However, evaluating this function in turn requires clients' images, leading to a causality dilemma. To this end, this paper further proposes a sample-then-transmit EGS (or STT-GS for short) strategy, which first samples a subset of images as pilot data from each client for loss prediction. Based on the first-stage evaluation, communication resources are then prioritized towards more valuable clients. To achieve efficient sampling, a feature-domain clustering (FDC) scheme is proposed to select the most representative data and pilot transmission time minimization (PTTM) is adopted to reduce the pilot overhead.Subsequently, we develop a joint client selection and power control (JCSPC) framework to maximize the GS-oriented function under communication resource constraints. Despite the nonconvexity of the problem, we propose a low-complexity efficient solution based on the penalty alternating majorization minimization (PAMM) algorithm. Experiments unveil that the proposed scheme significantly outperforms existing benchmarks on real-world datasets. It is found that the GS-oriented objective can be accurately predicted with low sampling ratios (e.g.,10%), and our method achieves an excellent tradeoff between view contributions and communication costs.

[102] Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion

Rongtao Xu,Jinzhou Lin,Jialei Zhou,Jiahua Dong,Changwei Wang,Ruisheng Wang,Li Guo,Shibiao Xu,Xiaodan Liang

Main category: cs.CV

TL;DR: 提出了一种基于多级表示融合的两阶段占据预测框架CIGOcc,通过融合分割、图形和深度特征,并结合SAM知识蒸馏,在不增加训练成本的情况下在SemanticKITTI基准上达到SOTA性能。

Details Motivation: 现有方法主要通过结构改进提升性能,较少从表示融合角度探索,导致2D图像中丰富的特征多样性未被充分利用。 Method: CIGOcc从输入图像中提取分割、图形和深度特征,引入可变形的多级融合机制来融合这三种多级特征,并结合从SAM蒸馏的知识以提升预测精度。 Result: 在SemanticKITTI基准上实现了最先进的性能,且无需增加训练成本。 Conclusion: CIGOcc通过多级表示融合和知识蒸馏有效提升了相机-based占据预测的性能,展示了其在自动驾驶3D感知中的潜力。 Abstract: Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose \textbf{CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. \textbf{CIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released https://github.com/VitaLemonTea1/CIGOcc

[103] Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences

Jing Yang,Qiyao Wei,Jiaxin Pei

Main category: cs.CV

TL;DR: Paper Copilot 是一个构建计算机科学领域同行评审持久数字档案的系统,提供开放数据集并支持对 ICLR 多年评审数据的大规模实证分析,旨在促进可重复研究,提升同行评审系统的透明度与可靠性。

Details Motivation: 人工智能会议的快速增长给同行评审系统带来巨大压力,导致审稿负担重、专业不匹配、评审标准不一致、评审质量下降等问题,且当前会议政策的临时调整使评审过程更加不透明。 Method: 开发了名为 Paper Copilot 的系统,用于收集和归档多个计算机科学会议的同行评审数据,构建开放数据集,并对 ICLR 多年的评审数据进行大规模实证分析。 Result: 成功建立了可持续的评审数据基础设施和开放数据集,实现了对同行评审过程的大规模、可重复研究,并揭示了评审实践随时间演变的趋势与问题。 Conclusion: Paper Copilot 提供的工具和数据有助于社区追踪评审变化、诊断系统缺陷,并推动基于证据的同行评审改革,从而建设更稳健、透明和可靠的评审体系。 Abstract: The rapid growth of AI conferences is straining an already fragile peer-review system, leading to heavy reviewer workloads, expertise mismatches, inconsistent evaluation standards, superficial or templated reviews, and limited accountability under compressed timelines. In response, conference organizers have introduced new policies and interventions to preserve review standards. Yet these ad-hoc changes often create further concerns and confusion about the review process, leaving how papers are ultimately accepted - and how practices evolve across years - largely opaque. We present Paper Copilot, a system that creates durable digital archives of peer reviews across a wide range of computer-science venues, an open dataset that enables researchers to study peer review at scale, and a large-scale empirical analysis of ICLR reviews spanning multiple years. By releasing both the infrastructure and the dataset, Paper Copilot supports reproducible research on the evolution of peer review. We hope these resources help the community track changes, diagnose failure modes, and inform evidence-based improvements toward a more robust, transparent, and reliable peer-review system.

[104] MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation

Lianlian Liu,YongKang He,Zhaojie Chu,Xiaofen Xing,Xiangmin Xu

Main category: cs.CV

TL;DR: 提出MimicParts框架,通过分区域风格注入和去噪网络生成更自然、富有表现力的3D人物动作。

Details Motivation: 现有方法在语音驱动3D动作生成中对风格建模过于简化,忽略局部区域差异和语音节奏情感的动态匹配。 Method: 将人体划分为多个区域,采用部分感知的风格编码与注意力机制,使语音的节奏和情感特征指导各身体区域的动作生成。 Result: 实验表明该方法在动作自然性和表现力上优于现有方法。 Conclusion: MimicParts能有效捕捉语音与局部动作风格的细粒度关联,实现更逼真的风格化3D动作合成。 Abstract: Generating stylized 3D human motion from speech signals presents substantial challenges, primarily due to the intricate and fine-grained relationships among speech signals, individual styles, and the corresponding body movements. Current style encoding approaches either oversimplify stylistic diversity or ignore regional motion style differences (e.g., upper vs. lower body), limiting motion realism. Additionally, motion style should dynamically adapt to changes in speech rhythm and emotion, but existing methods often overlook this. To address these issues, we propose MimicParts, a novel framework designed to enhance stylized motion generation based on part-aware style injection and part-aware denoising network. It divides the body into different regions to encode localized motion styles, enabling the model to capture fine-grained regional differences. Furthermore, our part-aware attention block allows rhythm and emotion cues to guide each body region precisely, ensuring that the generated motion aligns with variations in speech rhythm and emotional state. Experimental results show that our method outperforming existing methods showcasing naturalness and expressive 3D human motion sequences.

[105] Prompt-based Adaptation in Large-scale Vision Models: A Survey

Xi Xiao,Yunbei Zhang,Lin Zhao,Yiyang Liu,Xiaoying Liao,Zheda Mai,Xingjian Li,Xiao Wang,Hao Xu,Jihun Hamm,Xue Lin,Min Xu,Qifan Wang,Tianyang Wang,Cheng Han

Main category: cs.CV

TL;DR: 本文提出了一种统一的视觉提示适应框架(PA),对现有的视觉提示方法进行了系统分类,并综述了其在多个领域的应用与挑战。

Details Motivation: 当前视觉提示(VP)和视觉提示调优(VPT)概念模糊,缺乏系统性区分,亟需一个清晰的框架来统一理解与推动研究发展。 Method: 从第一性原理出发,构建名为Prompt-based Adaptation(PA)的统一框架,提出按可学习性(learnable、generative、non-learnable)和注入粒度(像素级、token级)的分类体系,并全面回顾其在多领域中的应用。 Result: 建立了首个专注于PA方法论与应用的综合性分类体系,明确了VP与VPT之间的区别与联系,总结了现有基准、挑战及未来方向。 Conclusion: 该研究为视觉提示领域提供了清晰的概念划分与系统性视角,有助于研究人员更好地理解与推进PA相关技术的发展。 Abstract: In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the ``pretrain-then-finetune'' paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications. In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). We provide a taxonomy that categorizes existing methods into learnable, generative, and non-learnable prompts, and further organizes them by injection granularity -- pixel-level and token-level. Beyond the core methodologies, we examine PA's integrations across diverse domains, including medical imaging, 3D point clouds, and vision-language tasks, as well as its role in test-time adaptation and trustworthy AI. We also summarize current benchmarks and identify key challenges and future directions. To the best of our knowledge, we are the first comprehensive survey dedicated to PA's methodologies and applications in light of their distinct characteristics. Our survey aims to provide a clear roadmap for researchers and practitioners in all area to understand and explore the evolving landscape of PA-related research.

[106] Sample-Centric Multi-Task Learning for Detection and Segmentation of Industrial Surface Defects

Hang-Cheng Dong,Yibo Jiao,Fupeng Wei,Guodong Liu,Dong Ye,Bingguo Liu

Main category: cs.CV

TL;DR: 提出了一种样本中心的多任务学习框架和评估方法,用于工业表面缺陷检测,通过联合学习样本级分类和像素级分割,提升小缺陷和低对比度缺陷的检出率与定位完整性。

Details Motivation: 现有方法在像素级指标上表现良好,但在样本级决策上不稳定,尤其对稀疏或细长缺陷检测效果差,主要原因是优化目标与质量控制决策粒度不匹配。 Method: 采用共享编码器的多任务框架,联合训练样本级缺陷分类和像素级掩码定位;引入样本级监督来调节特征分布,并在梯度层面持续提升小缺陷和低对比度缺陷的召回率;设计了去除非缺陷样本偏差的评估指标Seg_mIoU和Seg_Recall。 Result: 在两个基准数据集上实验表明,该方法显著提升了样本级决策的可靠性以及缺陷定位的完整性,优于传统基于像素的评价指标表现。 Conclusion: 样本中心的多任务学习框架有效缓解了工业缺陷检测中因前景-背景不平衡、缺陷稀疏和低对比度带来的问题,通过优化目标与决策粒度对齐,实现了更稳定可靠的质检性能。 Abstract: Industrial surface defect inspection for sample-wise quality control (QC) must simultaneously decide whether a given sample contains defects and localize those defects spatially. In real production lines, extreme foreground-background imbalance, defect sparsity with a long-tailed scale distribution, and low contrast are common. As a result, pixel-centric training and evaluation are easily dominated by large homogeneous regions, making it difficult to drive models to attend to small or low-contrast defects-one of the main bottlenecks for deployment. Empirically, existing models achieve strong pixel-overlap metrics (e.g., mIoU) but exhibit insufficient stability at the sample level, especially for sparse or slender defects. The root cause is a mismatch between the optimization objective and the granularity of QC decisions. To address this, we propose a sample-centric multi-task learning framework and evaluation suite. Built on a shared-encoder architecture, the method jointly learns sample-level defect classification and pixel-level mask localization. Sample-level supervision modulates the feature distribution and, at the gradient level, continually boosts recall for small and low-contrast defects, while the segmentation branch preserves boundary and shape details to enhance per-sample decision stability and reduce misses. For evaluation, we propose decision-linked metrics, Seg_mIoU and Seg_Recall, which remove the bias of classical mIoU caused by empty or true-negative samples and tightly couple localization quality with sample-level decisions. Experiments on two benchmark datasets demonstrate that our approach substantially improves the reliability of sample-level decisions and the completeness of defect localization.

[107] What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

Inha Kang,Youngsun Lim,Seonho Lee,Jiho Choi,Junsuk Choe,Hyunjung Shim

Main category: cs.CV

TL;DR: 本文提出了一种针对视觉-语言模型中否定理解缺陷(即肯定性偏见)的解决方案,包括一个新的数据集CoVAND和一种轻量级的适配方法NegToMe。

Details Motivation: 现有视觉-语言模型在处理否定语句时存在严重缺陷,尤其在描述对象检测任务中表现不佳,难以正确理解如“不是女孩”之类的表达。 Method: 首先通过系统化的思维链和视觉问答(VQA)流程构建高质量的否定数据集CoVAND;其次提出NegToMe模块,通过将否定词与属性词合并为统一语义短语,在输入层面保持正确的极性,并结合LoRA进行高效微调。 Result: 该方法显著提升了模型在否定理解基准上的表现,OVDEval上的NMS-AP最高提升+10.8点,且假阳性率降低,同时可泛化到多种最先进视觉-语言模型。 Conclusion: NegToMe有效缓解了模型因分词导致的否定信号丢失问题,是实现真实场景下准确否定理解的重要进展。 Abstract: State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.

[108] UniVector: Unified Vector Extraction via Instance-Geometry Interaction

Yinglong Yan,Jun Yue,Shaobo Xia,Hanmeng Sun,Tianxu Ying,Chengcheng Wu,Sifan Lan,Min He,Pedram Ghamisi,Leyuan Fang

Main category: cs.CV

TL;DR: 本文提出了UniVector,一种统一的矢量提取框架,通过实例-几何交互机制在单一模型中实现多种矢量类型(如多边形、折线、线段)的同时提取,并引入了包含实例与几何信息的结构化查询和动态形状约束以提升复杂结构的重建精度。

Details Motivation: 现有方法通常针对单一矢量类型设计,难以处理多类型的向量结构,且将实例属性与几何属性分离建模限制了对复杂结构的表达能力。因此需要一种能够统一提取多种向量类型的框架。 Method: 提出UniVector,将矢量表示为包含实例和几何信息的结构化查询,通过交互模块迭代更新,实现跨层级上下文交换,并采用动态形状约束优化全局结构和关键点。 Result: 在单结构和多结构矢量提取任务上均达到当前最优性能,并发布了包含多样化多边形、折线和线段的Multi-Vector数据集用于评估。 Conclusion: UniVector通过统一建模范式和实例-几何交互机制,有效支持多种矢量类型的高保真提取,推动了矢量提取技术向更通用、灵活的方向发展。 Abstract: Vector extraction retrieves structured vector geometry from raster images, offering high-fidelity representation and broad applicability. Existing methods, however, are usually tailored to a single vector type (e.g., polygons, polylines, line segments), requiring separate models for different structures. This stems from treating instance attributes (category, structure) and geometric attributes (point coordinates, connections) independently, limiting the ability to capture complex structures. Inspired by the human brain's simultaneous use of semantic and spatial interactions in visual perception, we propose UniVector, a unified VE framework that leverages instance-geometry interaction to extract multiple vector types within a single model. UniVector encodes vectors as structured queries containing both instance- and geometry-level information, and iteratively updates them through an interaction module for cross-level context exchange. A dynamic shape constraint further refines global structures and key points. To benchmark multi-structure scenarios, we introduce the Multi-Vector dataset with diverse polygons, polylines, and line segments. Experiments show UniVector sets a new state of the art on both single- and multi-structure VE tasks. Code and dataset will be released at https://github.com/yyyyll0ss/UniVector.

[109] EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking

Yukuan Zhang,Jiarui Zhao,Shangqing Nie,Jin Kuang,Shengsheng Wang

Main category: cs.CV

TL;DR: 提出了一种统一的多模态视觉-语言跟踪框架EPIPTrack,利用显式和隐式提示进行动态目标建模和语义对齐,在多个数据集上表现出优越性能。

Details Motivation: 现有方法依赖静态文本描述,缺乏对目标状态实时变化的适应性且易产生幻觉。 Method: 引入显式提示将空间运动信息转化为自然语言提供时空引导,结合伪词与可学习描述符构建隐式提示以捕捉外观特征,并通过CLIP文本编码器实现动态调整;设计判别性特征增强模块提升视觉和跨模态表征。 Result: 在MOT17、MOT20和DanceTrack上实验表明,EPIPTrack优于现有跟踪器,具有强适应性和高性能。 Conclusion: EPIPTrack通过动态提示机制有效融合多模态语义线索,提升了复杂场景下的目标跟踪鲁棒性与准确性。 Abstract: Multimodal semantic cues, such as textual descriptions, have shown strong potential in enhancing target perception for tracking. However, existing methods rely on static textual descriptions from large language models, which lack adaptability to real-time target state changes and prone to hallucinations. To address these challenges, we propose a unified multimodal vision-language tracking framework, named EPIPTrack, which leverages explicit and implicit prompts for dynamic target modeling and semantic alignment. Specifically, explicit prompts transform spatial motion information into natural language descriptions to provide spatiotemporal guidance. Implicit prompts combine pseudo-words with learnable descriptors to construct individualized knowledge representations capturing appearance attributes. Both prompts undergo dynamic adjustment via the CLIP text encoder to respond to changes in target state. Furthermore, we design a Discriminative Feature Augmentor to enhance visual and cross-modal representations. Extensive experiments on MOT17, MOT20, and DanceTrack demonstrate that EPIPTrack outperforms existing trackers in diverse scenarios, exhibiting robust adaptability and superior performance.

[110] Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models

Haochuan Xu,Yun Sing Koh,Shuhuai Huang,Zirun Zhou,Di Wang,Jun Sakuma,Jingfeng Zhang

Main category: cs.CV

TL;DR: 本文提出了一种针对视觉-语言-动作(VLA)模型的对抗性补丁攻击EDPA及其防御方法,EDPA通过破坏视觉与文本表征的语义对齐来误导模型,而所提防御策略通过对抗微调提升鲁棒性,在LIBERO基准上验证了攻击的有效性和防御的性能。

Details Motivation: 尽管VLA模型在机器人学习中取得了显著进展,但其对抗鲁棒性尚未被充分研究,尤其是在自然语言指令驱动下的物理任务执行中,缺乏无需模型先验知识的通用攻击与有效防御手段。 Method: 提出了Embedding Disruption Patch Attack(EDPA),通过优化两个目标生成可直接置于摄像头视野内的对抗补丁:一是破坏视觉与文本潜在表征之间的语义对齐,二是最大化对抗样本与干净输入在潜在空间中的差异;同时提出一种针对视觉编码器的对抗微调防御方法,使其对扰动输入产生稳定的表征。 Result: 在LIBERO仿真基准上的实验表明,EDPA显著提高了先进VLA模型的任务失败率,而所提出的防御方法能有效缓解这种性能下降。 Conclusion: EDPA是一种无需模型信息的通用对抗攻击方法,能够有效破坏VLA模型的视觉理解能力,而对抗微调是一种有效的防御策略,有助于提升VLA模型在真实场景中的安全性与鲁棒性。 Abstract: Vision-Language-Action (VLA) models have achieved revolutionary progress in robot learning, enabling robots to execute complex physical robot tasks from natural language instructions. Despite this progress, their adversarial robustness remains underexplored. In this work, we propose both adversarial patch attack and corresponding defense strategies for VLA models. We first introduce the Embedding Disruption Patch Attack (EDPA), a model-agnostic adversarial attack that generates patches directly placeable within the camera's view. In comparison to prior methods, EDPA can be readily applied to different VLA models without requiring prior knowledge of the model architecture, or the controlled robotic manipulator. EDPA constructs these patches by (i) disrupting the semantic alignment between visual and textual latent representations, and (ii) maximizing the discrepancy of latent representations between adversarial and corresponding clean visual inputs. Through the optimization of these objectives, EDPA distorts the VLA's interpretation of visual information, causing the model to repeatedly generate incorrect actions and ultimately result in failure to complete the given robotic task. To counter this, we propose an adversarial fine-tuning scheme for the visual encoder, in which the encoder is optimized to produce similar latent representations for both clean and adversarially perturbed visual inputs. Extensive evaluations on the widely recognized LIBERO robotic simulation benchmark demonstrate that EDPA substantially increases the task failure rate of cutting-edge VLA models, while our proposed defense effectively mitigates this degradation. The codebase is accessible via the homepage at https://edpa-attack.github.io/.

[111] FlyAwareV2: A Multimodal Cross-Domain UAV Dataset for Urban Scene Understanding

Francesco Barbato,Matteo Caligiuri,Pietro Zanuttigh

Main category: cs.CV

TL;DR: FlyAwareV2是一个包含真实和合成无人机图像的多模态数据集,用于城市场景理解,提供RGB、深度和语义标签等多种数据及跨域适应研究。

Details Motivation: 由于收集和标注真实世界无人机数据困难且成本高昂,缺乏大规模高质量数据集限制了城市环境中无人机视觉算法的发展。 Method: 在SynDrone和FlyAware基础上构建FlyAwareV2,引入多模态数据(RGB、深度、语义标签),通过最先进的单目深度估计生成真实样本的深度图,并建立RGB与多模态语义分割基准,开展合成到真实域适应研究。 Result: FlyAwareV2提供了丰富的注释和环境多样性,支持多种城市场景理解任务,并建立了标准架构下的语义分割基准和域适应性能评估。 Conclusion: FlyAwareV2为基于无人机的城市3D场景理解研究提供了一个有价值的资源,有助于推动该领域算法的发展和跨域泛化能力的研究。 Abstract: The development of computer vision algorithms for Unmanned Aerial Vehicle (UAV) applications in urban environments heavily relies on the availability of large-scale datasets with accurate annotations. However, collecting and annotating real-world UAV data is extremely challenging and costly. To address this limitation, we present FlyAwareV2, a novel multimodal dataset encompassing both real and synthetic UAV imagery tailored for urban scene understanding tasks. Building upon the recently introduced SynDrone and FlyAware datasets, FlyAwareV2 introduces several new key contributions: 1) Multimodal data (RGB, depth, semantic labels) across diverse environmental conditions including varying weather and daytime; 2) Depth maps for real samples computed via state-of-the-art monocular depth estimation; 3) Benchmarks for RGB and multimodal semantic segmentation on standard architectures; 4) Studies on synthetic-to-real domain adaptation to assess the generalization capabilities of models trained on the synthetic data. With its rich set of annotations and environmental diversity, FlyAwareV2 provides a valuable resource for research on UAV-based 3D urban scene understanding.

[112] CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation

Li Liang,Bo Miao,Xinyu Wang,Naveed Akhtar,Jordan Vice,Ajmal Mian

Main category: cs.CV

TL;DR: 本文提出了SketchSem3D,首个用于从手绘草图和卫星图像伪标签生成户外3D语义场景的大规模基准,并提出了Cylinder Mamba Diffusion(CymbaDiff)模型以提升生成场景的空间连贯性与语义一致性。

Details Motivation: 由于缺乏公开的高质量标注数据集,户外3D语义场景生成的发展受到限制,因此需要一个标准化的大规模基准来推动该领域研究。 Method: 构建了包含草图和伪标记卫星图像的SketchSem3D数据集,并提出CymbaDiff模型,通过结构化空间排序、建模圆柱连续性和垂直层次结构来增强生成效果。 Result: 在SketchSem3D上的实验表明,CymbaDiff在语义一致性、空间真实感和跨数据集泛化能力方面均优于现有方法。 Conclusion: CymbaDiff结合SketchSem3D为户外3D语义场景生成提供了有效的新基准和先进模型,显著提升了生成质量与实用性。 Abstract: Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at https://github.com/Lillian-research-hub/CymbaDiff

[113] Real-Time Crowd Counting for Embedded Systems with Lightweight Architecture

Zhiyuan Zhao,Yubin Wen,Siyu Yang,Lichen Ning,Yuandong Liu,Junyu Gao

Main category: cs.CV

TL;DR: 提出一种具有茎-编码器-解码器结构的超实时人群计数模型,通过大卷积核、条件通道加权和多分支局部融合模块,在保证竞争性精度的同时实现最快的推理速度。

Details Motivation: 现有方法在嵌入式系统上存在模型参数过多、计算复杂等问题,难以满足实时性要求,因此需要设计轻量且快速的模型。 Method: 采用大卷积核扩展感受野以提取头部细节信息;在编码器中引入条件通道加权和多分支局部融合模块,低计算代价下融合多尺度特征;在编码器顶部加入特征金字塔网络缓解特征融合不充分问题。 Result: 在三个基准数据集上验证了模型的有效性,实现了381.7 FPS(GTX 1080Ti)和71.9 FPS(Jetson TX1)的推理速度,为当前最快水平,同时保持竞争力的精度。 Conclusion: 该模型适用于嵌入式系统的超实时人群计数任务,在速度与精度之间取得良好平衡,具备实际应用价值。 Abstract: Crowd counting is a task of estimating the number of the crowd through images, which is extremely valuable in the fields of intelligent security, urban planning, public safety management, and so on. However, the existing counting methods have some problems in practical application on embedded systems for these fields, such as excessive model parameters, abundant complex calculations, etc. The practical application of embedded systems requires the model to be real-time, which means that the model is fast enough. Considering the aforementioned problems, we design a super real-time model with a stem-encoder-decoder structure for crowd counting tasks, which achieves the fastest inference compared with state-of-the-arts. Firstly, large convolution kernels in the stem network are used to enlarge the receptive field, which effectively extracts detailed head information. Then, in the encoder part, we use conditional channel weighting and multi-branch local fusion block to merge multi-scale features with low computational consumption. This part is crucial to the super real-time performance of the model. Finally, the feature pyramid networks are added to the top of the encoder to alleviate its incomplete fusion problems. Experiments on three benchmarks show that our network is suitable for super real-time crowd counting on embedded systems, ensuring competitive accuracy. At the same time, the proposed network reasoning speed is the fastest. Specifically, the proposed network achieves 381.7 FPS on NVIDIA GTX 1080Ti and 71.9 FPS on NVIDIA Jetson TX1.

[114] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

Minji Kim,Taekyung Kim,Bohyung Han

Main category: cs.CV

TL;DR: 该研究通过机械可解释性技术分析了视频大语言模型(VideoLLMs)内部的信息流动机制,揭示了其在视频问答任务中进行时序推理的一致模式,并发现可通过保留关键信息路径、抑制大量注意力边来维持性能,为提升模型可解释性和下游泛化提供了实用见解。

Details Motivation: 尽管VideoLLMs在视频理解方面取得进展,但其内部如何提取和传播时空与文本信息仍不清楚,因此需要探究其内在信息流机制。 Method: 采用机械可解释性技术,分析多种VideoQA任务中VideoLLMs的跨层信息流动模式,识别关键的注意力路径和视频-语言对齐机制。 Result: 发现了VideoLLMs时序推理的四个阶段:早期到中期的跨帧交互、中期的视频-语言逐步融合、基于时间概念对齐的集成完成,以及中后期的答案生成;并证明可剪枝58%的注意力边(如LLaVA-NeXT-7B-Video-FT)而不影响性能。 Conclusion: VideoLLMs通过特定的分阶段信息流执行时序推理,保留关键路径即可维持性能,这为模型压缩、解释性和泛化能力提升提供了新思路。 Abstract: Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint on how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization. Our project page with the source code is available at https://map-the-flow.github.io

[115] End-to-End Multi-Modal Diffusion Mamba

Chunhao Lu,Qiang Lu,Meichen Dong,Jake Luo

Main category: cs.CV

TL;DR: 提出一种名为MDM的新型多模态扩散模型,通过统一的变分自编码器和Mamba架构实现多模态的端到端联合表示学习,在图像生成、图文理解等任务上优于现有模型。

Details Motivation: 现有端到端多模态模型因使用不同的编码器和解码器,限制了多模态的联合表示学习。 Method: 提出MDM(Multi-modal Diffusion Mamba)架构,采用基于Mamba的多步选择扩散模型,并使用统一的变分自编码器进行编码和解码,逐步生成和优化模态特定信息。 Result: 在图像生成、图像描述、视觉问答、文本理解和推理任务中,MDM显著优于MonoFormer、LlamaGen、Chameleon等模型,并与GPT-4V、Gemini Pro、Mistral等SOTA模型相当。 Conclusion: MDM有效实现了多模态处理的统一,在保持计算效率的同时为端到端多模态架构提供了新方向。 Abstract: Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM's effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.

[116] MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

Keyan Zhou,Zecheng Tang,Lingfeng Ming,Guanghao Zhou,Qiguang Chen,Dan Qiao,Zheming Yang,Libo Qin,Minghui Qiu,Juntao Li,Min Zhang

Main category: cs.CV

TL;DR: 本文提出了MMLongCite,一个用于评估大视觉语言模型在长上下文场景中保真度的综合基准,涵盖文本、图像和视频等多种模态,并揭示了现有模型在处理长多模态上下文时的局限性。

Details Motivation: 现有的长上下文评估主要集中于纯文本领域,缺乏对多模态尤其是长上下文场景的有效评估,因此需要构建一个更全面的基准来衡量大视觉语言模型的上下文保真度。 Method: 设计了包含8个不同任务、6个上下文长度区间的多模态基准MMLongCite,涵盖文本、图像和视频;对最先进的大视觉语言模型进行了系统评估,并分析上下文长度及关键内容位置对模型表现的影响。 Result: 评估结果显示当前最先进的大视觉语言模型在处理长多模态上下文时保真度有限,且性能受上下文长度和关键信息位置显著影响。 Conclusion: MMLongCite填补了长多模态上下文评估的空白,揭示了现有LVLMs在长上下文中的不足,为未来模型改进提供了重要方向。 Abstract: The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.

[117] Universal Image Restoration Pre-training via Masked Degradation Classification

JiaKui Hu,Zhengjian Yao,Lujia Jin,Yinghao Chen,Yanye Lu

Main category: cs.CV

TL;DR: 本文提出了一种名为Masked Degradation Classification Pre-Training(MaskDCPT)的方法,通过利用图像退化类型作为弱监督信号,并结合图像重建任务,实现通用图像修复的预训练。该方法包含一个编码器和两个解码器,分别用于退化分类和图像重建,融合了掩码图像建模与对比学习,提升了模型在多种退化场景下的性能与泛化能力。

Details Motivation: 现有的图像修复预训练方法通常依赖强监督或单一退化类型,难以泛化到复杂多样的真实退化场景。因此,需要一种能够利用弱监督信号并统一处理多种退化类型的预训练框架,以提升模型的通用性与鲁棒性。 Method: MaskDCPT采用一个共享编码器提取被掩码的低质量图像特征,配备两个解码器:一个用于退化类型分类,另一个用于高质量图像重建。通过退化类别预测提供弱监督,同时引入对比学习增强表征学习,使模型在无完整标注的情况下也能有效学习通用修复能力。 Result: 在5D全合一修复任务中,PSNR至少提升3.77 dB;在真实退化场景下,PIQE指标降低34.8%。模型展现出对未见退化类型和程度的强大泛化能力。此外,作者发布了包含2.5百万样本、19种退化类型和200多个退化级别的UIR-2.5M数据集。 Conclusion: MaskDCPT通过结合弱监督分类与重建任务,实现了高效且通用的图像修复预训练框架,显著提升了CNN和Transformer模型在多种退化任务中的表现,推动了通用图像修复的发展。 Abstract: This study introduces a Masked Degradation Classification Pre-Training method (MaskDCPT), designed to facilitate the classification of degradation types in input images, leading to comprehensive image restoration pre-training. Unlike conventional pre-training methods, MaskDCPT uses the degradation type of the image as an extremely weak supervision, while simultaneously leveraging the image reconstruction to enhance performance and robustness. MaskDCPT includes an encoder and two decoders: the encoder extracts features from the masked low-quality input image. The classification decoder uses these features to identify the degradation type, whereas the reconstruction decoder aims to reconstruct a corresponding high-quality image. This design allows the pre-training to benefit from both masked image modeling and contrastive learning, resulting in a generalized representation suited for restoration tasks. Benefit from the straightforward yet potent MaskDCPT, the pre-trained encoder can be used to address universal image restoration and achieve outstanding performance. Implementing MaskDCPT significantly improves performance for both convolution neural networks (CNNs) and Transformers, with a minimum increase in PSNR of 3.77 dB in the 5D all-in-one restoration task and a 34.8% reduction in PIQE compared to baseline in real-world degradation scenarios. It also emergences strong generalization to previously unseen degradation types and levels. In addition, we curate and release the UIR-2.5M dataset, which includes 2.5 million paired restoration samples across 19 degradation types and over 200 degradation levels, incorporating both synthetic and real-world data. The dataset, source code, and models are available at https://github.com/MILab-PKU/MaskDCPT.

[118] Automated document processing system for government agencies using DBNET++ and BART models

Aya Kaysan Bahjat

Main category: cs.CV

TL;DR: 提出了一种自动文档分类系统,能够从图像中检测文本内容,并将文档分为四类(发票、报告、信件和表格),适用于离线和实时图像输入。

Details Motivation: 解决实际场景中文本识别面临的光照变化、任意方向、弯曲或部分遮挡、低分辨率和远距离文本等问题。 Method: 采用四阶段流程:图像采集与预处理、基于DBNet++的文本检测、基于BART的文本分类,系统通过Python和PyQt5实现用户界面。 Result: 在Total-Text数据集上经过10小时训练,文本检测准确率达到92.88%,验证了系统在复杂成像条件下的有效性。 Conclusion: 该方法在非受限环境下对混合来源文档分类具有良好的实用性和鲁棒性。 Abstract: An automatic document classification system is presented that detects textual content in images and classifies documents into four predefined categories (Invoice, Report, Letter, and Form). The system supports both offline images (e.g., files on flash drives, HDDs, microSD) and real-time capture via connected cameras, and is designed to mitigate practical challenges such as variable illumination, arbitrary orientation, curved or partially occluded text, low resolution, and distant text. The pipeline comprises four stages: image capture and preprocessing, text detection [1] using a DBNet++ (Differentiable Binarization Network Plus) detector, and text classification [2] using a BART (Bidirectional and Auto-Regressive Transformers) classifier, all integrated within a user interface implemented in Python with PyQt5. The achieved results by the system for text detection in images were good at about 92.88% through 10 hours on Total-Text dataset that involve high resolution images simulate a various and very difficult challenges. The results indicate the proposed approach is effective for practical, mixed-source document categorization in unconstrained imaging scenarios.

[119] Novel Class Discovery for Point Cloud Segmentation via Joint Learning of Causal Representation and Reasoning

Yang Li,Aming Wu,Zihao Zhang,Yahong Han

Main category: cs.CV

TL;DR: 本文提出了一种基于结构因果模型(SCM)的新型方法——因果表示与推理联合学习,用于点云分割中的新类别发现(3D-NCD),通过消除混杂因素并建模基类与新类之间的因果关系,显著提升了无监督新类分割性能。

Details Motivation: 现有方法在3D新类别发现中依赖粗略或统计性的特征关联,容易导致新类推断混淆,缺乏对点云表征与类别间精确因果关系的建模。 Method: 引入结构因果模型(SCM)重新形式化3D-NCD问题;设计因果表征原型以消除基类表征中的隐性混杂因素;构建图结构建模基类与新类原型间的因果关系,实现从基类到新类的因果推理。 Result: 在3D和2D新类别发现语义分割任务上进行了大量实验和可视化分析,结果表明所提方法优于现有方法。 Conclusion: 通过因果建模明确表征与类别间的因果关系,能有效提升新类别发现的准确性和鲁棒性,为无监督语义分割提供了新的思路。 Abstract: In this paper, we focus on Novel Class Discovery for Point Cloud Segmentation (3D-NCD), aiming to learn a model that can segment unlabeled (novel) 3D classes using only the supervision from labeled (base) 3D classes. The key to this task is to setup the exact correlations between the point representations and their base class labels, as well as the representation correlations between the points from base and novel classes. A coarse or statistical correlation learning may lead to the confusion in novel class inference. lf we impose a causal relationship as a strong correlated constraint upon the learning process, the essential point cloud representations that accurately correspond to the classes should be uncovered. To this end, we introduce a structural causal model (SCM) to re-formalize the 3D-NCD problem and propose a new method, i.e., Joint Learning of Causal Representation and Reasoning. Specifically, we first analyze hidden confounders in the base class representations and the causal relationships between the base and novel classes through SCM. We devise a causal representation prototype that eliminates confounders to capture the causal representations of base classes. A graph structure is then used to model the causal relationships between the base classes' causal representation prototypes and the novel class prototypes, enabling causal reasoning from base to novel classes. Extensive experiments and visualization results on 3D and 2D NCD semantic segmentation demonstrate the superiorities of our method.

[120] InstantSfM: Fully Sparse and Parallel Structure-from-Motion

Jiankun Zhong,Zitong Zhan,Quankai Gao,Ziyu Chen,Haozhe Lou,Jiageng Mao,Ulrich Neumann,Yue Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于GPU并行计算的高效Structure-from-Motion(SfM)方法,通过扩展稀疏感知的束调整优化技术,统一加速束调整(BA)和全局定位(GP),在保持甚至提升重建精度的同时,相比COLMAP实现了最高约40倍的速度提升。

Details Motivation: 传统SfM方法(如COLMAP和GLOMAP)依赖CPU计算,存在计算开销大、灵活性差的问题;而基于深度学习的方法(如VGGSfM和VGGT)受限于GPU内存,难以扩展到数千张图像的大规模场景。因此,需要一种既能高效利用GPU资源又能支持大规模输入的SfM框架。 Method: 利用GPU并行计算能力,结合稀疏感知的束调整优化技术,构建一个统一的全局SfM框架,同时加速束调整(BA)和全局定位(GP)两个关键阶段,并支持多种外部优化选项以提高灵活性。 Result: 在多个不同规模的数据集上进行了实验验证,在5000张图像的场景中,现有深度学习方法因显存不足无法运行,而本文方法仍可成功处理;相比COLMAP最高速度提升达40倍,且重建精度相当或更优。 Conclusion: 该方法充分释放了GPU在SfM各关键阶段的并行计算潜力,实现了高效率与高精度的平衡,适用于大规模场景的三维重建,为SfM的实际应用提供了更具可扩展性和灵活性的解决方案。 Abstract: Structure-from-Motion (SfM), a method that recovers camera poses and scene geometry from uncalibrated images, is a central component in robotic reconstruction and simulation. Despite the state-of-the-art performance of traditional SfM methods such as COLMAP and its follow-up work, GLOMAP, naive CPU-specialized implementations of bundle adjustment (BA) or global positioning (GP) introduce significant computational overhead when handling large-scale scenarios, leading to a trade-off between accuracy and speed in SfM. Moreover, the blessing of efficient C++-based implementations in COLMAP and GLOMAP comes with the curse of limited flexibility, as they lack support for various external optimization options. On the other hand, while deep learning based SfM pipelines like VGGSfM and VGGT enable feed-forward 3D reconstruction, they are unable to scale to thousands of input views at once as GPU memory consumption increases sharply as the number of input views grows. In this paper, we unleash the full potential of GPU parallel computation to accelerate each critical stage of the standard SfM pipeline. Building upon recent advances in sparse-aware bundle adjustment optimization, our design extends these techniques to accelerate both BA and GP within a unified global SfM framework. Through extensive experiments on datasets of varying scales (e.g. 5000 images where VGGSfM and VGGT run out of memory), our method demonstrates up to about 40 times speedup over COLMAP while achieving consistently comparable or even improved reconstruction accuracy. Our project page can be found at https://cre185.github.io/InstantSfM/.

[121] Self-Augmented Visual Contrastive Decoding

Eun Woo Im,Muhammad Kashif Ali,Vivek Gupta

Main category: cs.CV

TL;DR: 提出一种无需训练的解码策略,通过自增强提示和自适应阈值算法提升大视觉语言模型的事实一致性。

Details Motivation: 现有的视觉对比解码方法使用通用视觉增强,忽略文本查询上下文,导致缓解幻觉得效果有限。 Method: 设计了两种关键技术:1)自增强提示策略,利用模型内在知识动态对齐查询与视觉增强的语义;2)自适应阈值算法,基于输出稀疏性调整候选词数量,充分利用logit分布信息。 Result: 在四个大视觉语言模型和七个基准上的实验表明,所提方法在事实一致性方面显著优于现有先进解码方法。 Conclusion: 结合查询相关的增强和熵感知解码有助于提升大视觉语言模型生成的准确性和可靠性。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastive decoding has been proposed to mitigate this issue, existing methods often apply generic visual augmentations that disregard the specific context provided by the text query, limiting their effectiveness. This study introduces a novel training-free decoding strategy that addresses these limitations, featuring two key contributions. First, a self-augmentation prompting strategy that leverages the intrinsic knowledge of the model to dynamically align semantics between the query and the visual augmentation. Second, an adaptive thresholding algorithm that adaptively adjusts next token candidate size based on the output sparsity, utilizing full information from the logit distribution. Extensive experiments across four LVLMs and seven benchmarks demonstrate that the proposed decoding significantly enhances factual consistency compared to state-of-the-art decoding methods. This work highlights the importance of integrating query-dependent augmentation and entropy-aware decoding for improving effective generation of LVLMs.

[122] Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests

Fitim Abdullahu,Helmut Grabner

Main category: cs.CV

TL;DR: 本研究探讨了大型多模态模型(如GPT-4o)在理解视觉吸引力方面的潜力,发现其预测与人类评估部分对齐,优于现有方法,并可用于构建学习排序模型以标注图像对的趣味性。

Details Motivation: 视觉吸引力深刻影响日常生活,如何有效捕捉和量化这一概念具有重要意义。随着大型多模态模型的发展,探索其是否能理解并预测人类对视觉内容的兴趣成为可能。 Method: 通过比较分析,评估GPT-4o在视觉吸引力判断上与人类评价的一致性,并利用其输出生成标注数据,进而蒸馏知识训练一个学习排序模型。 Result: 研究表明GPT-4o在捕捉视觉吸引力方面优于现有技术,与人类判断存在部分对齐,能够有效用于图像对趣味性的标注。 Conclusion: GPT-4o已具备一定程度理解视觉吸引力的能力,可作为工具辅助研究人类兴趣机制,并为相关应用提供高效的数据标注方案。 Abstract: Our daily life is highly influenced by what we consume and see. Attracting and holding one's attention -- the definition of (visual) interestingness -- is essential. The rise of Large Multimodal Models (LMMs) trained on large-scale visual and textual data has demonstrated impressive capabilities. We explore these models' potential to understand to what extent the concepts of visual interestingness are captured and examine the alignment between human assessments and GPT-4o's, a leading LMM, predictions through comparative analysis. Our studies reveal partial alignment between humans and GPT-4o. It already captures the concept as best compared to state-of-the-art methods. Hence, this allows for the effective labeling of image pairs according to their (commonly) interestingness, which are used as training data to distill the knowledge into a learning-to-rank model. The insights pave the way for a deeper understanding of human interest.

[123] Removing Cost Volumes from Optical Flow Estimators

Simon Kiefhaber,Stefan Roth,Simone Schaub-Meyer

Main category: cs.CV

TL;DR: 提出一种训练策略,可在训练过程中移除光流估计中的代价体积(cost volume),显著提升推理速度并降低内存占用。

Details Motivation: 由于代价体积在计算和存储上的高开销,限制了光流估计模型的速度和分辨率;同时观察到在网络其他部分充分训练后,代价体积的重要性下降。 Method: 引入一种训练策略,逐步去除光流估计器中的代价体积,并设计三种适用于不同计算预算的模型。 Result: 最精确的模型达到SOTA精度,速度快1.2倍,内存占用仅为同类模型的1/6;最快模型仅用500MB GPU内存即可以20FPS处理全高清帧。 Conclusion: 该训练策略有效平衡了精度、速度与内存消耗,为高效光流估计提供了新方向。 Abstract: Cost volumes are used in every modern optical flow estimator, but due to their computational and space complexity, they are often a limiting factor regarding both processing speed and the resolution of input frames. Motivated by our empirical observation that cost volumes lose their importance once all other network parts of, e.g., a RAFT-based pipeline have been sufficiently trained, we introduce a training strategy that allows removing the cost volume from optical flow estimators throughout training. This leads to significantly improved inference speed and reduced memory requirements. Using our training strategy, we create three different models covering different compute budgets. Our most accurate model reaches state-of-the-art accuracy while being $1.2\times$ faster and having a $6\times$ lower memory footprint than comparable models; our fastest model is capable of processing Full HD frames at $20\,\mathrm{FPS}$ using only $500\,\mathrm{MB}$ of GPU memory.

[124] DEF-YOLO: Leveraging YOLO for Concealed Weapon Detection in Thermal Imagin

Divya Bhardwaj,Arnav Ramamoorthy,Poonam Goyal

Main category: cs.CV

TL;DR: 提出了一种基于YOLOv8改进的DEF-YOLO模型和首个大规模热成像隐蔽武器检测数据集TICW,通过可变形卷积和焦点损失函数实现高效、隐私保护的实时检测。

Details Motivation: 为克服现有成像模态在分辨率、隐私和成本等方面的局限,寻求一种适用于全天候实时监控的低代价且保护隐私的隐蔽武器检测方案。 Method: 基于YOLOv8架构引入可变形卷积至SPPF、主干和颈部网络以提取多尺度特征,并采用焦点损失缓解类别不平衡问题,结合自建热成像数据集TICW进行训练与验证。 Result: 在自建的大规模TICW数据集上,DEF-YOLO显著提升了热成像中隐蔽武器的检测精度与鲁棒性,实验表明其性能优于基线模型,建立了该领域的新基准。 Conclusion: DEF-YOLO结合TICW数据集为热成像下的隐蔽武器检测提供了有效解决方案,推动了该领域的技术发展,具有实际部署潜力。 Abstract: Concealed weapon detection aims at detecting weapons hidden beneath a person's clothing or luggage. Various imaging modalities like Millimeter Wave, Microwave, Terahertz, Infrared, etc., are exploited for the concealed weapon detection task. These imaging modalities have their own limitations, such as poor resolution in microwave imaging, privacy concerns in millimeter wave imaging, etc. To provide a real-time, 24 x 7 surveillance, low-cost, and privacy-preserved solution, we opted for thermal imaging in spite of the lack of availability of a benchmark dataset. We propose a novel approach and a dataset for concealed weapon detection in thermal imagery. Our YOLO-based architecture, DEF-YOLO, is built with key enhancements in YOLOv8 tailored to the unique challenges of concealed weapon detection in thermal vision. We adopt deformable convolutions at the SPPF layer to exploit multi-scale features; backbone and neck layers to extract low, mid, and high-level features, enabling DEF-YOLO to adaptively focus on localization around the objects in thermal homogeneous regions, without sacrificing much of the speed and throughput. In addition to these simple yet effective key architectural changes, we introduce a new, large-scale Thermal Imaging Concealed Weapon dataset, TICW, featuring a diverse set of concealed weapons and capturing a wide range of scenarios. To the best of our knowledge, this is the first large-scale contributed dataset for this task. We also incorporate focal loss to address the significant class imbalance inherent in the concealed weapon detection task. The efficacy of the proposed work establishes a new benchmark through extensive experimentation for concealed weapon detection in thermal imagery.

[125] Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models

Hong-Kai Zheng,Piji Li

Main category: cs.CV

TL;DR: 本文提出了一种名为Group-VQ的新方法,通过分组优化码本并引入无需训练的码本重采样技术,改善了VQ-VAE在码本利用和重建性能之间的权衡,并实现了训练后灵活调整码本大小。

Details Motivation: 现有的VQ-VAE方法存在码本崩溃问题,且静态或整体联合优化码本限制了学习能力,导致重建质量下降。 Method: 提出Group-VQ,将码本进行分组,组间独立优化,组内联合优化;同时引入一种训练后无需再训练的码本重采样方法以动态调整码本大小。 Result: 在多种图像重建实验中,Group-VQ在重建指标上表现更优,且重采样方法实现了码本大小的灵活调整。 Conclusion: Group-VQ有效提升了码本利用率与重建性能的平衡,所提出的重采样方法增强了模型部署的灵活性。 Abstract: Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised learning through reconstruction tasks to represent continuous vectors using the closest vectors in a codebook. However, issues such as codebook collapse persist in the VQ model. To address these issues, existing approaches employ implicit static codebooks or jointly optimize the entire codebook, but these methods constrain the codebook's learning capability, leading to reduced reconstruction quality. In this paper, we propose Group-VQ, which performs group-wise optimization on the codebook. Each group is optimized independently, with joint optimization performed within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Additionally, we introduce a training-free codebook resampling method, allowing post-training adjustment of the codebook size. In image reconstruction experiments under various settings, Group-VQ demonstrates improved performance on reconstruction metrics. And the post-training codebook sampling method achieves the desired flexibility in adjusting the codebook size.

[126] No-Reference Rendered Video Quality Assessment: Dataset and Metrics

Sipeng Yang,Jiayu Ji,Qingchuan Zhu,Zhiyao Yang,Xiaogang Jin

Main category: cs.CV

TL;DR: 本文提出了一种面向渲染视频的无参考视频质量评估(NR-VQA)数据集和专用指标,兼顾图像质量和时间稳定性,在渲染视频上表现优于现有方法,并可用于实时渲染中的超采样和帧生成策略评估。

Details Motivation: 现有NR-VQA方法主要针对摄像视频,直接用于渲染视频会产生偏差,因后者更易出现时间伪影,缺乏针对性的数据集和评估指标。 Method: 构建了一个大规模、面向渲染的视频数据集,包含多种3D场景和渲染设置,并采集了主观质量评分;基于该数据集,设计了一个结合图像质量和时间稳定性分析的NR-VQA指标。 Result: 所提NR-VQA指标在渲染视频上的性能优于现有指标,并验证了其在超采样方法 benchmark 和实时渲染帧生成策略评估中的实用性。 Conclusion: 本文提出的渲染视频专用NR-VQA数据集和指标能更准确地评估渲染视频质量,为相关应用提供了有效的评估工具。 Abstract: Quality assessment of videos is crucial for many computer graphics applications, including video games, virtual reality, and augmented reality, where visual performance has a significant impact on user experience. When test videos cannot be perfectly aligned with references or when references are unavailable, the significance of no-reference video quality assessment (NR-VQA) methods is undeniable. However, existing NR-VQA datasets and metrics are primarily focused on camera-captured videos; applying them directly to rendered videos would result in biased predictions, as rendered videos are more prone to temporal artifacts. To address this, we present a large rendering-oriented video dataset with subjective quality annotations, as well as a designed NR-VQA metric specific to rendered videos. The proposed dataset includes a wide range of 3D scenes and rendering settings, with quality scores annotated for various display types to better reflect real-world application scenarios. Building on this dataset, we calibrate our NR-VQA metric to assess rendered video quality by looking at both image quality and temporal stability. We compare our metric to existing NR-VQA metrics, demonstrating its superior performance on rendered videos. Finally, we demonstrate that our metric can be used to benchmark supersampling methods and assess frame generation strategies in real-time rendering.

[127] Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity

MingZe Tang,Jubal Chandy Jacob

Main category: cs.CV

TL;DR: 研究探讨了提示设计对视觉语言模型在小样本下识别人类姿态(坐、站、走/跑)的影响,发现高性能模型使用简单提示效果更好,而复杂提示会导致性能下降,即“提示过拟合”现象,低性能模型则可通过详细提示改善表现。

Details Motivation: 理解提示设计对视觉相似类别(如人类姿态)零样本分类的影响,尤其是在数据稀缺条件下。 Method: 在285张图像的COCO衍生数据集上,系统评估OpenCLIP、MetaCLIP 2和SigLip等现代视觉语言模型,采用三级递增语言细节的提示设计进行零样本分类实验。 Result: 高性能模型(MetaCLIP 2和OpenCLIP)使用最简单提示时表现最佳,增加描述性细节显著降低性能(如MetaCLIP 2准确率从68.8%降至55.1%),出现“提示过拟合”;而低性能模型SigLip在使用基于身体线索的详细提示时,对模糊类别的分类有所提升。 Conclusion: 提示设计对不同性能水平的视觉语言模型影响不同:简单提示更利于高性能模型,而低性能模型可从更具描述性的提示中受益,提示复杂度需根据模型能力权衡。 Abstract: Recent Vision-Language Models (VLMs) enable zero-shot classification by aligning images and text in a shared space, a promising approach for data-scarce conditions. However, the influence of prompt design on recognizing visually similar categories, such as human postures, is not well understood. This study investigates how prompt specificity affects the zero-shot classification of sitting, standing, and walking/running on a small, 285-image COCO-derived dataset. A suite of modern VLMs, including OpenCLIP, MetaCLIP 2, and SigLip, were evaluated using a three-tiered prompt design that systematically increases linguistic detail. Our findings reveal a compelling, counter-intuitive trend: for the highest-performing models (MetaCLIP 2 and OpenCLIP), the simplest, most basic prompts consistently achieve the best results. Adding descriptive detail significantly degrades performance for instance, MetaCLIP 2's multi-class accuracy drops from 68.8\% to 55.1\% a phenomenon we term "prompt overfitting". Conversely, the lower-performing SigLip model shows improved classification on ambiguous classes when given more descriptive, body-cue-based prompts.

[128] DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

Tianyuan Yuan,Yicheng Liu,Chenhao Lu,Zhuoguang Chen,Tao Jiang,Hang Zhao

Main category: cs.CV

TL;DR: DepthVLA是一种新型视觉-语言-动作模型,通过引入预训练的深度预测模块和混合Transformer架构,显著提升了空间推理能力,在真实和模拟环境中均优于现有方法。

Details Motivation: 现有视觉-语言-动作模型在需要精确空间推理的任务上表现不佳,且依赖大量动作数据预训练,效率低且空间理解不足。 Method: 提出DepthVLA,采用混合Transformer设计,统一整合视觉-语言模型、深度Transformer和动作专家,通过共享注意力实现端到端的空间感知增强模型。 Result: 在真实环境和多个模拟器中,DepthVLA均超越最先进方法:真实任务达到78.5%(对比65.0%),LIBERO模拟器94.9%(对比93.6%),Simpler模拟器74.8%(对比58.8%)。 Conclusion: DepthVLA通过显式引入深度信息和共享注意力机制,有效提升空间推理能力,无需大量动作数据预训练,具有更高的训练效率和性能表现。 Abstract: Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.

[129] Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering

Siddharth Tourani,Jayaram Reddy,Akash Kumbar,Satyajit Tourani,Nishant Goyal,Madhava Krishna,N. Dinesh Reddy,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 提出一种结合Signed Distance Functions (SDF) 和 3D Gaussian Splatting (3DGS) 的新方法,用于动态城市场景的渲染与重建,无需依赖LiDAR或真实运动标注,实现了先进的渲染性能和灵活的场景编辑能力。

Details Motivation: 现有基于3DGS的方法在处理动态城市场景时依赖LiDAR数据、真实3D分割和运动轨迹等强监督信息,限制了其应用范围,本文旨在减少这些依赖。 Method: 将2D对象无关先验(如深度估计和点跟踪)与SDF表示相结合,并集成到3DGS框架中,构建统一优化框架,提升几何精度和动态形变建模能力。 Result: 在无LiDAR数据的情况下达到最先进的渲染指标;引入LiDAR后进一步提升重建和新视角生成效果,且无需真实3D运动标注,同时支持场景分解与合成等编辑任务。 Conclusion: 所提方法通过融合SDF与3DGS,在降低对多模态感知数据依赖的同时,提升了动态场景建模的准确性与灵活性,具有良好的实用性和扩展性。 Abstract: Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. When incorporating LiDAR, our approach improved further in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks, including scene decomposition, and scene composition.

[130] Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment

Feng-Qi Cui,Yu-Tong Guo,Tianyue Zheng,Jinyang Huang

Main category: cs.CV

TL;DR: 提出了一种名为GLSDA的新型框架,利用预训练大模型的语义先验来提升WiFi信号(CSI)在域内和跨域场景下的手势识别性能,结合双路径编码、多尺度语义编码、语义感知软监督和双蒸馏策略,在Widar3.0基准上实现了优于现有方法的准确率、模型大小和推理延迟的平衡。

Details Motivation: 现有基于WiFi的 gesture recognition 方法因信道状态信息(CSI)的领域敏感性和缺乏高层语义抽象而导致泛化能力差、语义表达不足。 Method: 设计了双路径CSI编码(CSI-Ratio相位序列和多普勒频谱图),通过多尺度语义编码器学习时序嵌入并与手势语义对齐,引入语义感知软监督以增强类别区分,并采用鲁棒双蒸馏策略将教师模型的知识(中间特征和软标签)压缩到轻量级学生网络中。 Result: 在Widar3.0数据集上,GLSDA在域内和跨域手势识别任务中均优于现有最先进方法,同时显著减小模型规模并降低推理延迟。 Conclusion: GLSDA通过融合大模型语义先验与RF信号特征,提供了一种可扩展且可部署的解决方案,推动了实际AIoT环境中通用RF手势交互界面的发展。 Abstract: WiFi-based gesture recognition has emerged as a promising RF sensing paradigm for enabling non-contact and privacy-preserving human-computer interaction in AIoT environments. However, existing methods often suffer from limited generalization and semantic expressiveness due to the domain-sensitive nature of Channel State Information and the lack of high-level gesture abstraction. To address these challenges, we propose a novel generalization framework, termed Large-Model-Aware Semantic Distillation and Alignment (GLSDA), which leverages the semantic prior of pre-trained large foundation models to enhance gesture representation learning in both in-domain and cross-domain scenarios. Specifically, we first design a dual-path CSI encoding pipeline that captures geometric and dynamic gesture patterns via CSI-Ratio phase sequences and Doppler spectrograms. These representations are then fed into a Multiscale Semantic Encoder, which learns robust temporal embeddings and aligns them with gesture semantics through cross-modal attention mechanisms. To further enhance category discrimination, we introduce a Semantic-Aware Soft Supervision scheme that encodes inter-class correlations and reduces label ambiguity, especially for semantically similar gestures. Finally, we develop a Robust Dual-Distillation strategy to compress the aligned model into a lightweight student network, jointly distilling intermediate features and semantic-informed soft labels from the teacher model. Extensive experiments on the Widar3.0 benchmark show that GLSDA consistently outperforms state-of-the-art methods in both in-domain and cross-domain gesture recognition tasks, while significantly reducing model size and inference latency. Our method offers a scalable and deployable solution for generalized RF-based gesture interfaces in real-world AIoT applications.

[131] Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

Xinmiao Huang,Qisong He,Zhenglin Huang,Boxuan Wang,Zhuoyun Li,Guangliang Cheng,Yi Dong,Xiaowei Huang

Main category: cs.CV

TL;DR: 本文提出了一个基于认知分类的统一基准Spatial-DISE,用于评估视觉语言模型的空间推理能力,特别是内在动态空间推理,并通过自动化生成管道构建了包含559个评测和12K+训练样本的数据集,实验表明现有模型与人类水平仍有显著差距。

Details Motivation: 现有基准在评估视觉语言模型的空间推理能力方面存在不足,尤其是难以衡量人类空间认知中的关键部分——内在动态空间推理。 Method: 基于认知科学提出四象限分类体系(内在-静态、内在-动态、外在-静态、外在-动态),并构建可扩展的自动化数据生成流程,创建Spatial-DISE Bench和Spatial-DISE-12K数据集。 Result: 在28个主流视觉语言模型上的实验显示,当前模型在多步、多视角空间推理任务上与人类表现存在显著且一致的差距。 Conclusion: Spatial-DISE为评估和提升视觉语言模型的空间智能提供了可靠框架、宝贵数据和明确研究方向,推动实现类人空间理解能力。 Abstract: Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, \textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: \textbf{I}ntrinsic-\textbf{S}tatic, Intrinsic-\textbf{D}ynamic, \textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new \textbf{Spatial-DISE} dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.

[132] Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation

Yifu Luo,Xinhao Hu,Keyu Fan,Haoyuan Sun,Zeyu Chen,Bo Xia,Tiantian Zhang,Yongzhe Chang,Xueqian Wang

Main category: cs.CV

TL;DR: 提出Mask-GRPO,首次将基于GRPO的强化学习引入掩码生成模型,通过重新定义转移概率并将解掩码过程建模为多步决策问题,在文本到图像生成任务中取得显著性能提升。

Details Motivation: 现有强化学习方法主要针对扩散模型或自回归模型,忽略了掩码生成模型这一重要范式,缺乏有效适配该框架的RL算法。 Method: 提出Mask-GRPO,重新定义转移概率,将解掩码过程视为多步决策问题,并结合去除KL约束、应用缩减策略和过滤低质量样本等优化手段。 Result: 在标准T2I基准测试和偏好对齐方面显著优于现有方法,成功提升了Show-o基础模型的性能。 Conclusion: Mask-GRPO是首个将GRPO强化学习应用于掩码生成模型的框架,验证了其在文本到图像生成中的有效性与优越性。 Abstract: Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the KL constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches. The code is available on https://github.com/xingzhejun/Mask-GRPO

[133] Ultra High-Resolution Image Inpainting with Patch-Based Content Consistency Adapter

Jianhui Zhang,Sheng Cheng,Qirui Sun,Jia Liu,Wang Luyang,Chaoyu Feng,Chen Fang,Lei Lei,Jue Wang,Shuaicheng Liu

Main category: cs.CV

TL;DR: 本文提出Patch-Adapter,一种用于高分辨率文本引导图像修复的高效框架,可在4K+分辨率下保持内容一致性和提示对齐。

Details Motivation: 现有方法受限于低分辨率,在高分辨率和复杂纹理下难以保持内容一致性和提示对齐,本文旨在解决这一可扩展性问题。 Method: 采用双阶段适配器架构:第一阶段通过Dual Context Adapter在低分辨率下学习遮罩与非遮罩区域的全局一致性;第二阶段通过Reference Patch Adapter在全分辨率下实现基于patch的注意力机制和自适应特征融合,保留局部细节。 Result: 实验表明,Patch-Adapter在OpenImages和Photo-Concept-Bucket数据集上优于现有方法,有效消除大尺度修复中的伪影,在感知质量和文本提示对齐方面达到SOTA性能。 Conclusion: Patch-Adapter通过解耦全局语义与局部精细化,成功填补了高分辨率图像修复的可扩展性空白,为高分辨率文本引导修复提供了有效解决方案。 Abstract: In this work, we present Patch-Adapter, an effective framework for high-resolution text-guided image inpainting. Unlike existing methods limited to lower resolutions, our approach achieves 4K+ resolution while maintaining precise content consistency and prompt alignment, two critical challenges in image inpainting that intensify with increasing resolution and texture complexity. Patch-Adapter leverages a two-stage adapter architecture to scale the diffusion model's resolution from 1K to 4K+ without requiring structural overhauls: (1) Dual Context Adapter learns coherence between masked and unmasked regions at reduced resolutions to establish global structural consistency; and (2) Reference Patch Adapter implements a patch-level attention mechanism for full-resolution inpainting, preserving local detail fidelity through adaptive feature fusion. This dual-stage architecture uniquely addresses the scalability gap in high-resolution inpainting by decoupling global semantics from localized refinement. Experiments demonstrate that Patch-Adapter not only resolves artifacts common in large-scale inpainting but also achieves state-of-the-art performance on the OpenImages and Photo-Concept-Bucket datasets, outperforming existing methods in both perceptual quality and text-prompt adherence.

[134] CoDS: Enhancing Collaborative Perception in Heterogeneous Scenarios via Domain Separation

Yushan Han,Hui Zhang,Honglei Zhang,Chuntao Ding,Yuanzhouhan Cao,Yidong Li

Main category: cs.CV

TL;DR: 本文提出了一种名为CoDS的协作感知方法,通过域分离机制解决异构场景下的特征差异问题,采用轻量级空间-通道调整模块和分布对齐模块,并结合互信息损失提升特征对齐效果,具有高效推理能力。

Details Motivation: 现有协作感知方法假设所有智能体使用相同编码器,且在异构场景下因域差距导致特征对齐效果差,同时基于Transformer的方法在移动端推理效率低。 Method: 提出CoDS方法,包含轻量级空间-通道调整模块(LSCR)和基于域分离的分布对齐模块(DADS),并设计域对齐互信息(DAMI)损失函数,采用全卷积架构以提高推理效率。 Result: 实验表明,CoDS能有效缓解异构场景中的特征差异,在检测精度和推理效率之间实现了良好平衡。 Conclusion: CoDS通过域分离和轻量级对齐模块,有效解决了异构协作感知中的特征不匹配问题,适用于实际部署场景。 Abstract: Collaborative perception has been proven to improve individual perception in autonomous driving through multi-agent interaction. Nevertheless, most methods often assume identical encoders for all agents, which does not hold true when these models are deployed in real-world applications. To realize collaborative perception in actual heterogeneous scenarios, existing methods usually align neighbor features to those of the ego vehicle, which is vulnerable to noise from domain gaps and thus fails to address feature discrepancies effectively. Moreover, they adopt transformer-based modules for domain adaptation, which causes the model inference inefficiency on mobile devices. To tackle these issues, we propose CoDS, a Collaborative perception method that leverages Domain Separation to address feature discrepancies in heterogeneous scenarios. The CoDS employs two feature alignment modules, i.e., Lightweight Spatial-Channel Resizer (LSCR) and Distribution Alignment via Domain Separation (DADS). Besides, it utilizes the Domain Alignment Mutual Information (DAMI) loss to ensure effective feature alignment. Specifically, the LSCR aligns the neighbor feature across spatial and channel dimensions using a lightweight convolutional layer. Subsequently, the DADS mitigates feature distribution discrepancy with encoder-specific and encoder-agnostic domain separation modules. The former removes domain-dependent information and the latter captures task-related information. During training, the DAMI loss maximizes the mutual information between aligned heterogeneous features to enhance the domain separation process. The CoDS employs a fully convolutional architecture, which ensures high inference efficiency. Extensive experiments demonstrate that the CoDS effectively mitigates feature discrepancies in heterogeneous scenarios and achieves a trade-off between detection accuracy and inference efficiency.

[135] Beyond Pixels: A Differentiable Pipeline for Probing Neuronal Selectivity in 3D

Pavithra Elumalai,Mohammad Bashiri,Goirik Chakrabarty,Suhas Shrinivasan,Fabian H. Sinz

Main category: cs.CV

TL;DR: 提出了一种基于可微渲染的3D刺激生成方法,用于探究灵长类V4区域神经元对姿态和光照等可解释3D因素的选择性。

Details Motivation: 现有方法主要基于2D像素,难以分离神经元对物理场景属性(如形状、姿态、光照)的选择性。 Method: 引入可微渲染流水线,通过优化可变形网格直接在3D空间生成最大激发刺激(MEI),使用径向基函数参数化网格变形,并学习最大化神经响应的偏移和尺度,同时施加几何正则化。 Result: 该方法成功应用于灵长类V4区模型,揭示了神经元对3D姿态和光照等物理可解释因素的选择性。 Conclusion: 该方法桥接了逆图形学与系统神经科学,提供了一种超越传统像素方法、基于物理真实的3D刺激来探究神经选择性的新途径。 Abstract: Visual perception relies on inference of 3D scene properties such as shape, pose, and lighting. To understand how visual sensory neurons enable robust perception, it is crucial to characterize their selectivity to such physically interpretable factors. However, current approaches mainly operate on 2D pixels, making it difficult to isolate selectivity for physical scene properties. To address this limitation, we introduce a differentiable rendering pipeline that optimizes deformable meshes to obtain MEIs directly in 3D. The method parameterizes mesh deformations with radial basis functions and learns offsets and scales that maximize neuronal responses while enforcing geometric regularity. Applied to models of monkey area V4, our approach enables probing neuronal selectivity to interpretable 3D factors such as pose and lighting. This approach bridges inverse graphics with systems neuroscience, offering a way to probe neural selectivity with physically grounded, 3D stimuli beyond conventional pixel-based methods.

[136] Near-Infrared Hyperspectral Imaging Applications in Food Analysis -- Improving Algorithms and Methodologies

Ole-Christian Galbo Engstrøm

Main category: cs.CV

TL;DR: 本论文研究了近红外高光谱成像(NIR-HSI)在食品质量分析中的应用,比较了卷积神经网络(CNN)与偏最小二乘法(PLS)的性能,发现结合空间-光谱信息的CNN在多种参数建模中表现更优,尤其在化学分布图生成中优于PLS;但PLS在均值化学含量分析中仍具优势。

Details Motivation: 探索NIR-HSI技术在食品质量检测中的潜力,提升化学和物理参数的建模精度,并解决现有方法在空间分布预测和模型效率上的局限性。 Method: 通过四项研究验证五个假设,比较CNN与PLS模型;使用2D CNN并引入光谱卷积层增强模型性能,同时开发了两个开源Python工具以加速PLS建模和交叉验证。 Result: CNN在联合时空分析中优于单独空间或光谱分析;带光谱卷积层的2D CNN提升了预测性能并解决了PLS在化学分布图中出现的非平滑和越界问题;PLS在均值化学含量预测上表现良好;由于数据集发芽率低,大麦发芽能力建模结果不明确。 Conclusion: NIR-HSI结合CNN适用于需同时利用空间与光谱信息的质量分析,尤其在化学分布建模中优势明显;但对于均值化学含量分析,PLS仍是推荐方法;获取高分辨率参考值仍是空间建模的主要挑战。 Abstract: This thesis investigates the application of near-infrared hyperspectral imaging (NIR-HSI) for food quality analysis. The investigation is conducted through four studies operating with five research hypotheses. For several analyses, the studies compare models based on convolutional neural networks (CNNs) and partial least squares (PLS). Generally, joint spatio-spectral analysis with CNNs outperforms spatial analysis with CNNs and spectral analysis with PLS when modeling parameters where chemical and physical visual information are relevant. When modeling chemical parameters with a 2-dimensional (2D) CNN, augmenting the CNN with an initial layer dedicated to performing spectral convolution enhances its predictive performance by learning a spectral preprocessing similar to that applied by domain experts. Still, PLS-based spectral modeling performs equally well for analysis of the mean content of chemical parameters in samples and is the recommended approach. Modeling the spatial distribution of chemical parameters with NIR-HSI is limited by the ability to obtain spatially resolved reference values. Therefore, a study used bulk mean references for chemical map generation of fat content in pork bellies. A PLS-based approach gave non-smooth chemical maps and pixel-wise predictions outside the range of 0-100\%. Conversely, a 2D CNN augmented with a spectral convolution layer mitigated all issues arising with PLS. The final study attempted to model barley's germinative capacity by analyzing NIR spectra, RGB images, and NIR-HSI images. However, the results were inconclusive due to the dataset's low degree of germination. Additionally, this thesis has led to the development of two open-sourced Python packages. The first facilitates fast PLS-based modeling, while the second facilitates very fast cross-validation of PLS and other classical machine learning models with a new algorithm.

[137] VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Hyojun Go,Dominik Narnhofer,Goutam Bhat,Prune Truong,Federico Tombari,Konrad Schindler

Main category: cs.CV

TL;DR: 提出VIST3A框架,结合文本到视频生成模型与3D重建系统,实现高质量文本到3D场景生成。

Details Motivation: 结合视觉生成模型和3D重建技术的优势,提升文本到3D生成的质量和一致性。 Method: 通过模型拼接技术连接文本到视频生成器与3D解码器,并使用直接奖励微调对齐生成器与解码器。 Result: 在多种生成器与重建模型组合下均显著优于先前方法,支持高质的文本到点云图生成。 Conclusion: VIST3A为文本到3D生成提供了一种通用且有效的框架,兼具几何准确性和视觉质量。 Abstract: The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce VIST3A, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit model stitching, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt direct reward finetuning, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.

[138] Through the Lens of Doubt: Robust and Efficient Uncertainty Estimation for Visual Place Recognition

Emily Miller,Michael Milford,Muhammad Burhan Hafez,SD Ramchurn,Shoaib Ehsan

Main category: cs.CV

TL;DR: 提出三种无需训练的不确定性度量方法(SD、RS、SU),用于评估视觉位置识别(VPR)中的匹配置信度,具有跨数据集和方法的泛化能力,且计算开销低,适用于实时机器人应用。

Details Motivation: VPR系统在不同环境条件下表现不稳定,关键应用需要可靠的匹配不确定性估计。现有方法依赖额外训练或高计算成本,缺乏通用性和实时性。 Method: 基于现有VPR方法的相似性得分分布,设计三种无需训练的不确定性指标:相似性分布(SD)、比率扩散(RS)及二者结合的统计不确定性(SU),通过分析得分模式评估匹配置信度。 Result: 在九种SOTA VPR方法和六个基准数据集上验证,所提指标显著优于现有方法,在正确与错误匹配区分上表现优异,且计算开销极低,适合实时部署。 Conclusion: SD、RS和SU是通用、高效、无需训练的不确定性度量,可提升VPR系统在多变环境下的鲁棒性和可靠性,尤其适用于SLAM等关键任务。 Abstract: Visual Place Recognition (VPR) enables robots and autonomous vehicles to identify previously visited locations by matching current observations against a database of known places. However, VPR systems face significant challenges when deployed across varying visual environments, lighting conditions, seasonal changes, and viewpoints changes. Failure-critical VPR applications, such as loop closure detection in simultaneous localization and mapping (SLAM) pipelines, require robust estimation of place matching uncertainty. We propose three training-free uncertainty metrics that estimate prediction confidence by analyzing inherent statistical patterns in similarity scores from any existing VPR method. Similarity Distribution (SD) quantifies match distinctiveness by measuring score separation between candidates; Ratio Spread (RS) evaluates competitive ambiguity among top-scoring locations; and Statistical Uncertainty (SU) is a combination of SD and RS that provides a unified metric that generalizes across datasets and VPR methods without requiring validation data to select the optimal metric. All three metrics operate without additional model training, architectural modifications, or computationally expensive geometric verification. Comprehensive evaluation across nine state-of-the-art VPR methods and six benchmark datasets confirms that our metrics excel at discriminating between correct and incorrect VPR matches, and consistently outperform existing approaches while maintaining negligible computational overhead, making it deployable for real-time robotic applications across varied environmental conditions with improved precision-recall performance.

[139] ExpressNet-MoE: A Hybrid Deep Neural Network for Emotion Recognition

Deeptimaan Banerjee,Prateek Gothwal,Ashis Kumer Biswas

Main category: cs.CV

TL;DR: 本文提出了一种名为ExpressNet-MoE的混合深度学习模型,结合CNN和Mixture of Experts框架,通过多尺度特征提取和自适应专家选择,提升了面部表情识别在复杂现实场景下的准确性和泛化能力。

Details Motivation: 由于头部姿态变化、遮挡、光照变化和人群多样性等因素,现实场景中的面部表情识别(FER)仍面临挑战,现有模型在情感识别和参与度检测方面表现不足。 Method: 提出ExpressNet-MoE模型,融合CNN进行多尺度特征提取,引入MoE模块实现专家网络的动态选择,并采用残差网络作为骨干网络进行深层特征学习。 Result: 在多个数据集上取得优异性能:AffectNet (v7) 74.77%,AffectNet (v8) 72.55%,RAF-DB 84.29%,FER-2013 64.66%,优于当前主流方法。 Conclusion: ExpressNet-MoE具有良好的适应性和泛化能力,适用于实际场景中的端到端情绪识别系统,且代码已开源以支持可复现性。 Abstract: In many domains, including online education, healthcare, security, and human-computer interaction, facial emotion recognition (FER) is essential. Real-world FER is still difficult despite its significance because of some factors such as variable head positions, occlusions, illumination shifts, and demographic diversity. Engagement detection, which is essential for applications like virtual learning and customer services, is frequently challenging due to FER limitations by many current models. In this article, we propose ExpressNet-MoE, a novel hybrid deep learning model that blends both Convolution Neural Networks (CNNs) and Mixture of Experts (MoE) framework, to overcome the difficulties. Our model dynamically chooses the most pertinent expert networks, thus it aids in the generalization and providing flexibility to model across a wide variety of datasets. Our model improves on the accuracy of emotion recognition by utilizing multi-scale feature extraction to collect both global and local facial features. ExpressNet-MoE includes numerous CNN-based feature extractors, a MoE module for adaptive feature selection, and finally a residual network backbone for deep feature learning. To demonstrate efficacy of our proposed model we evaluated on several datasets, and compared with current state-of-the-art methods. Our model achieves accuracies of 74.77% on AffectNet (v7), 72.55% on AffectNet (v8), 84.29% on RAF-DB, and 64.66% on FER-2013. The results show how adaptive our model is and how it may be used to develop end-to-end emotion recognition systems in practical settings. Reproducible codes and results are made publicly accessible at https://github.com/DeeptimaanB/ExpressNet-MoE.

[140] UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Tiancheng Gu,Kaicheng Yang,Kaichen Zhang,Xiang An,Ziyong Feng,Yueyi Zhang,Weidong Cai,Jiankang Deng,Lidong Bing

Main category: cs.CV

TL;DR: 本文提出了一种基于MLLM的通用多模态嵌入模型UniME-V2,通过引入全局检索构建潜在难负样本集,并利用MLLM-as-a-Judge机制生成软语义匹配分数,用于难负样本挖掘和软标签学习,从而提升模型的判别能力。此外,还提出了UniME-V2-Reranker重排序模型,采用联合pairwise和listwise优化策略,在多个任务上实现了最先进的性能。

Details Motivation: 现有通用多模态嵌入方法在负样本挖掘中难以捕捉候选样本间的细微语义差异,负样本多样性不足,且对假负例和难负例的区分能力有限,限制了模型的判别性能。 Method: 首先通过全局检索构建潜在难负样本集;然后利用MLLM-as-a-Judge机制评估查询-候选对的语义对齐程度,生成软语义匹配分数,用于难负样本挖掘和作为软标签训练信号;最后设计UniME-V2-Reranker模型,采用联合pairwise与listwise优化进行重排序训练。 Result: 在MMEB基准和多个检索任务上的实验表明,所提方法在平均性能上达到当前最先进水平,显著提升了模型对难负样本的识别能力和语义判别力。 Conclusion: 本文提出的UniME-V2通过结合MLLM的语义理解能力,有效改善了传统多模态嵌入模型在负样本选择和语义建模方面的局限性,增强了表示学习的质量和模型的整体检索性能。 Abstract: Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.

[141] High Semantic Features for the Continual Learning of Complex Emotions: a Lightweight Solution

Thibault Geoffroy,gauthier Gerspacher,Lionel Prevost

Main category: cs.CV

TL;DR: 本文提出一种基于动作单元(Action Units)的增量学习方法,用于复杂情感识别,有效缓解了灾难性遗忘问题,并在CFEE数据集上取得了0.75的准确率,同时模型轻量、内存占用小。

Details Motivation: 解决增量学习中因特征迁移性差导致的灾难性遗忘问题,特别是在复杂情感识别任务中逐步学习基本和复合情绪的需求。 Method: 利用描述面部肌肉运动的动作单元(Action Units)作为非瞬态、高语义特征,结合增量学习框架,逐步从基本情绪扩展到复杂复合情绪识别。 Result: 在CFEE数据集上实现了0.75的准确率,性能优于浅层和深层卷积神经网络提取的特征,且模型轻量化、内存占用小。 Conclusion: 动作单元是适合增量情感识别的鲁棒特征,所提方法在抑制遗忘和保持高性能的同时,具备实际部署潜力。 Abstract: Incremental learning is a complex process due to potential catastrophic forgetting of old tasks when learning new ones. This is mainly due to transient features that do not fit from task to task. In this paper, we focus on complex emotion recognition. First, we learn basic emotions and then, incrementally, like humans, complex emotions. We show that Action Units, describing facial muscle movements, are non-transient, highly semantical features that outperform those extracted by both shallow and deep convolutional neural networks. Thanks to this ability, our approach achieves interesting results when learning incrementally complex, compound emotions with an accuracy of 0.75 on the CFEE dataset and can be favorably compared to state-of-the-art results. Moreover, it results in a lightweight model with a small memory footprint.

[142] Learning Neural Parametric 3D Breast Shape Models for Metrical Surface Reconstruction From Monocular RGB Videos

Maximilian Weiherer,Antonia von Riedheim,Vanessa Brébant,Bernhard Egger,Christoph Palm

Main category: cs.CV

TL;DR: 提出了一种基于局部隐式神经表示的参数化3D乳房形状模型(liRBSM)和低成本、可访问的单目RGB视频3D表面重建流程,重建精度误差小于2毫米,速度快且开源。

Details Motivation: 现有3D乳房扫描方案成本高昂或依赖专用硬件,缺乏低成本、易获取且高精度的替代方案。 Method: 采用现成的Structure-from-motion技术结合新提出的局部隐式神经SDF模型(liRBSM),将乳房隐式域分解为多个由解剖标志点定位的局部区域,每个区域由一个局部神经符号距离函数(SDF)表示。 Result: 所提方法在无需专用硬件或软件的前提下,利用普通RGB视频实现乳房几何的高质量重建,误差小于2毫米,重建时间少于六分钟,且显著优于全局隐式模型iRBSM。 Conclusion: liRBSM结合开源重建流程为乳房形态建模提供了一种高精度、低成本、开放且实用的解决方案,具有广泛的临床与应用潜力。 Abstract: We present a neural parametric 3D breast shape model and, based on this model, introduce a low-cost and accessible 3D surface reconstruction pipeline capable of recovering accurate breast geometry from a monocular RGB video. In contrast to widely used, commercially available yet prohibitively expensive 3D breast scanning solutions and existing low-cost alternatives, our method requires neither specialized hardware nor proprietary software and can be used with any device that is able to record RGB videos. The key building blocks of our pipeline are a state-of-the-art, off-the-shelf Structure-from-motion pipeline, paired with a parametric breast model for robust and metrically correct surface reconstruction. Our model, similarly to the recently proposed implicit Regensburg Breast Shape Model (iRBSM), leverages implicit neural representations to model breast shapes. However, unlike the iRBSM, which employs a single global neural signed distance function (SDF), our approach -- inspired by recent state-of-the-art face models -- decomposes the implicit breast domain into multiple smaller regions, each represented by a local neural SDF anchored at anatomical landmark positions. When incorporated into our surface reconstruction pipeline, the proposed model, dubbed liRBSM (short for localized iRBSM), significantly outperforms the iRBSM in terms of reconstruction quality, yielding more detailed surface reconstruction than its global counterpart. Overall, we find that the introduced pipeline is able to recover high-quality 3D breast geometry within an error margin of less than 2 mm. Our method is fast (requires less than six minutes), fully transparent and open-source, and -- together with the model -- publicly available at https://rbsm.re-mic.de/local-implicit.

[143] Accelerated Feature Detectors for Visual SLAM: A Comparative Study of FPGA vs GPU

Ruiqi Ye,Mikel Luján

Main category: cs.CV

TL;DR: 本文研究了在视觉SLAM(V-SLAM)流程中硬件加速特征检测器的性能,比较了GPU与FPGA加速的FAST、Harris和SuperPoint特征检测器在现代SoC上的表现。结果显示,非学习型检测器(FAST、Harris)在GPU上性能和能效更优,而学习型检测器SuperPoint在FPGA上可实现最高3.1倍的性能和1.4倍的能效提升。FPGA加速的V-SLAM在部分数据集上帧率更高,但GPU方案整体精度更高。硬件加速有助于减少全局束调整频率,提升系统效率。

Details Motivation: 由于SLAM常用于无人机等功耗受限平台,亟需高效特征检测方法。尽管GPU广泛用于加速计算机视觉任务,集成FPGA的SoC也逐渐普及,但尚缺乏对GPU与FPGA在V-SLAM流程中特征检测器性能与能效的系统性对比研究。 Method: 本文在Nvidia Jetson Orin和AMD Versal等现代SoC平台上,实现了并对比了GPU与FPGA加速的FAST、Harris和SuperPoint特征检测器,并将其集成到V-SLAM流程中,评估其在运行性能、能效、精度及系统级影响方面的表现。 Result: 对于非学习型检测器(FAST、Harris),GPU实现优于FPGA,在性能和能效上均更佳;对于学习型检测器SuperPoint,FPGA实现性能提升达3.1倍,能效提升1.4倍。FPGA加速的V-SLAM在5个数据集中有2个帧率更高,但整体精度低于GPU方案。硬件加速可减少全局束调整调用频率,提升系统效率而不牺牲精度。 Conclusion: GPU更适合非学习型特征检测器,而FPGA在运行学习型模型如SuperPoint时更具性能与能效优势。FPGA在特定场景下可提供有竞争力的V-SLAM性能,但GPU方案整体精度更高。硬件加速特征检测有助于优化整个V-SLAM流程的效率。 Abstract: Feature detection is a common yet time-consuming module in Simultaneous Localization and Mapping (SLAM) implementations, which are increasingly deployed on power-constrained platforms, such as drones. Graphics Processing Units (GPUs) have been a popular accelerator for computer vision in general, and feature detection and SLAM in particular. On the other hand, System-on-Chips (SoCs) with integrated Field Programmable Gate Array (FPGA) are also widely available. This paper presents the first study of hardware-accelerated feature detectors considering a Visual SLAM (V-SLAM) pipeline. We offer new insights by comparing the best GPU-accelerated FAST, Harris, and SuperPoint implementations against the FPGA-accelerated counterparts on modern SoCs (Nvidia Jetson Orin and AMD Versal). The evaluation shows that when using a non-learning-based feature detector such as FAST and Harris, their GPU implementations, and the GPU-accelerated V-SLAM can achieve better run-time performance and energy efficiency than the FAST and Harris FPGA implementations as well as the FPGA-accelerated V-SLAM. However, when considering a learning-based detector such as SuperPoint, its FPGA implementation can achieve better run-time performance and energy efficiency (up to 3.1$\times$ and 1.4$\times$ improvements, respectively) than the GPU implementation. The FPGA-accelerated V-SLAM can also achieve comparable run-time performance compared to the GPU-accelerated V-SLAM, with better FPS in 2 out of 5 dataset sequences. When considering the accuracy, the results show that the GPU-accelerated V-SLAM is more accurate than the FPGA-accelerated V-SLAM in general. Last but not least, the use of hardware acceleration for feature detection could further improve the performance of the V-SLAM pipeline by having the global bundle adjustment module invoked less frequently without sacrificing accuracy.

[144] Modeling Cultural Bias in Facial Expression Recognition with Adaptive Agents

David Freire-Obregón,José Salas-Cáceres,Javier Lorenzo-Navarro,Oliverio J. Santana,Daniel Hernández-Sosa,Modesto Castrillón-Santana

Main category: cs.CV

TL;DR: 提出了一种基于代理的流式基准,用于评估跨文化和渐进模糊条件下面部表情识别(FER)的鲁棒性,发现不同文化群体在模糊条件下的性能退化存在不对称性,混合群体的表现受组成和交互结构影响。

Details Motivation: 现有FER评估多假设同质数据和高质量图像,忽视了文化差异和视觉退化的影响,因此需要更贴近真实场景的评估方法。 Method: 构建一个基于代理的流式系统,代理在冻结的CLIP特征空间中运行,使用在线训练的轻量残差适配器,并在带sigma调度的高斯模糊环境中进行测试;考察不同文化组成(单一、平衡混合、不平衡混合)和空间接触结构的影响。 Result: 亚洲群体(JAFFE)在低模糊下表现更好但中等模糊时下降更快,西方群体(KDEF)退化更均匀;混合群体中,平衡组合缓解早期退化,而不平衡组合在高模糊下放大主体群体的弱点。 Conclusion: 文化组成和交互结构显著影响FER系统在感知退化下的鲁棒性,需在多样化和动态环境中评估和优化模型。 Abstract: Facial expression recognition (FER) must remain robust under both cultural variation and perceptually degraded visual conditions, yet most existing evaluations assume homogeneous data and high-quality imagery. We introduce an agent-based, streaming benchmark that reveals how cross-cultural composition and progressive blurring interact to shape face recognition robustness. Each agent operates in a frozen CLIP feature space with a lightweight residual adapter trained online at sigma=0 and fixed during testing. Agents move and interact on a 5x5 lattice, while the environment provides inputs with sigma-scheduled Gaussian blur. We examine monocultural populations (Western-only, Asian-only) and mixed environments with balanced (5/5) and imbalanced (8/2, 2/8) compositions, as well as different spatial contact structures. Results show clear asymmetric degradation curves between cultural groups: JAFFE (Asian) populations maintain higher performance at low blur but exhibit sharper drops at intermediate stages, whereas KDEF (Western) populations degrade more uniformly. Mixed populations exhibit intermediate patterns, with balanced mixtures mitigating early degradation, but imbalanced settings amplify majority-group weaknesses under high blur. These findings quantify how cultural composition and interaction structure influence the robustness of FER as perceptual conditions deteriorate.

[145] XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation

Huawei Sun,Zixu Wang,Xiangyuan Peng,Julius Ott,Georg Stettinger,Lorenzo Servadei,Robert Wille

Main category: cs.CV

TL;DR: 本文提出了一种轻量级雷达-相机融合深度估计架构XD-RCDepth,通过知识蒸馏策略在减少29.7%参数的同时保持精度,显著提升了模型效率和可解释性。

Details Motivation: 为了在恶劣环境下提升自动驾驶中深度估计的鲁棒性,并解决轻量级模型在压缩后性能下降的问题。 Method: 提出了XD-RCDepth架构,采用两种知识蒸馏策略:可解释性对齐蒸馏和深度分布蒸馏,将深度回归转化为离散化区间的软分类。 Result: 相比直接训练,MAE降低了7.97%,在nuScenes和ZJU-4DRadarCam数据集上实现了具有竞争力的精度和实时性。 Conclusion: XD-RCDepth在显著减少模型参数的同时保持了高性能,为轻量级深度估计提供了高效且可解释的解决方案。 Abstract: Depth estimation remains central to autonomous driving, and radar-camera fusion offers robustness in adverse conditions by providing complementary geometric cues. In this paper, we present XD-RCDepth, a lightweight architecture that reduces the parameters by 29.7% relative to the state-of-the-art lightweight baseline while maintaining comparable accuracy. To preserve performance under compression and enhance interpretability, we introduce two knowledge-distillation strategies: an explainability-aligned distillation that transfers the teacher's saliency structure to the student, and a depth-distribution distillation that recasts depth regression as soft classification over discretized bins. Together, these components reduce the MAE compared with direct training with 7.97% and deliver competitive accuracy with real-time efficiency on nuScenes and ZJU-4DRadarCam datasets.

[146] Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues

Chen Chen,Kangcheng Bin,Ting Hu,Jiahao Qi,Xingyue Liu,Tianpeng Liu,Zhen Liu,Yongxiang Liu,Ping Zhong

Main category: cs.CV

TL;DR: 提出了一种基于提示的条件感知动态融合方法(PCDF),用于无人机在多模态(RGB-IR)图像中进行全天候目标检测,并发布了高多样性数据集ATR-UMOD。

Details Motivation: 现有数据集受限于成像条件,难以反映真实世界的复杂性,需构建更丰富多样的数据集并开发适应多变环境的融合方法。 Method: 提出PCDF方法,通过将成像条件编码为文本提示,利用条件线索自适应地重新分配多模态贡献,并设计条件解耦模块以支持无标注条件下的应用。 Result: 在新发布的ATR-UMOD数据集上实验验证了PCDF方法的有效性,显著提升了不同天气、光照和视角下的检测性能。 Conclusion: PCDF结合高多样性数据集可有效提升无人机在复杂真实环境中多模态目标检测的鲁棒性和实用性。 Abstract: Unmanned aerial vehicles (UAV)-based object detection with visible (RGB) and infrared (IR) images facilitates robust around-the-clock detection, driven by advancements in deep learning techniques and the availability of high-quality dataset. However, the existing dataset struggles to fully capture real-world complexity for limited imaging conditions. To this end, we introduce a high-diversity dataset ATR-UMOD covering varying scenarios, spanning altitudes from 80m to 300m, angles from 0{\deg} to 75{\deg}, and all-day, all-year time variations in rich weather and illumination conditions. Moreover, each RGB-IR image pair is annotated with 6 condition attributes, offering valuable high-level contextual information. To meet the challenge raised by such diverse conditions, we propose a novel prompt-guided condition-aware dynamic fusion (PCDF) to adaptively reassign multimodal contributions by leveraging annotated condition cues. By encoding imaging conditions as text prompts, PCDF effectively models the relationship between conditions and multimodal contributions through a task-specific soft-gating transformation. A prompt-guided condition-decoupling module further ensures the availability in practice without condition annotations. Experiments on ATR-UMOD dataset reveal the effectiveness of PCDF.

[147] AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset

Amjid Ali,Zulfiqar Ahmad Khan,Altaf Hussain,Muhammad Munsif,Adnan Hussain,Sung Wook Baik

Main category: cs.CV

TL;DR: 本文提出了一种轻量高效的音频-视觉异常识别框架AVAR-Net,并构建了一个中等规模的同步音视频数据集VAAR,通过多模态融合与时空建模显著提升了复杂环境下的异常识别性能。

Details Motivation: 现有异常识别方法主要依赖视觉信息,在遮挡、低光照和恶劣天气下可靠性差,且缺乏大规模同步音视频数据集,限制了多模态方法的发展。 Method: AVAR-Net包含音频特征提取(Wav2Vec2)、视频特征提取(MobileViT)、早期融合策略和多阶段时间卷积网络(MTCN)进行时序建模;同时构建了包含3000个真实视频的VAAR数据集用于评估。 Result: 在VAAR数据集上达到89.29%准确率,在XD-Violence数据集上平均精度达88.56%,较现有方法提升2.8%。 Conclusion: AVAR-Net有效利用音频-视觉多模态信息,增强了异常识别的鲁棒性和泛化能力,VAAR数据集为多模态异常识别提供了有价值的基准。 Abstract: Anomaly recognition plays a vital role in surveillance, transportation, healthcare, and public safety. However, most existing approaches rely solely on visual data, making them unreliable under challenging conditions such as occlusion, low illumination, and adverse weather. Moreover, the absence of large-scale synchronized audio-visual datasets has hindered progress in multimodal anomaly recognition. To address these limitations, this study presents AVAR-Net, a lightweight and efficient audio-visual anomaly recognition framework designed for real-world environments. AVAR-Net consists of four main modules: an audio feature extractor, a video feature extractor, fusion strategy, and a sequential pattern learning network that models cross-modal relationships for anomaly recognition. Specifically, the Wav2Vec2 model extracts robust temporal features from raw audio, while MobileViT captures both local and global visual representations from video frames. An early fusion mechanism combines these modalities, and a Multi-Stage Temporal Convolutional Network (MTCN) model that learns long-range temporal dependencies within the fused representation, enabling robust spatiotemporal reasoning. A novel Visual-Audio Anomaly Recognition (VAAR) dataset, is also introduced, serving as a medium-scale benchmark containing 3,000 real-world videos with synchronized audio across ten diverse anomaly classes. Experimental evaluations demonstrate that AVAR-Net achieves 89.29% accuracy on VAAR and 88.56% Average Precision on the XD-Violence dataset, improving Average Precision by 2.8% over existing state-of-the-art methods. These results highlight the effectiveness, efficiency, and generalization capability of the proposed framework, as well as the utility of VAAR as a benchmark for advancing multimodal anomaly recognition research.

[148] Challenges, Advances, and Evaluation Metrics in Medical Image Enhancement: A Systematic Literature Review

Chun Wai Chin,Haniza Yazid,Hoi Leong Lee

Main category: cs.CV

TL;DR: 本综述基于PRISMA方法,系统分析了39项关于医学图像增强的研究,总结了不同成像模态下噪声、低对比度等主要挑战及现有方法的应对效果,指出MRI和多模态成像研究较多,而组织病理学、内窥镜等模态仍待探索;深度学习应用逐渐增多,但传统方法仍占主导;图像质量评估指标多样,非参考型指标使用更广泛。

Details Motivation: 医学图像常受噪声、伪影和低对比度影响,限制诊断效果,需系统梳理当前增强方法与评估指标,明确研究现状、局限性与未来方向。 Method: 采用PRISMA指南进行系统性文献综述,分析39篇同行评审论文,归纳所用增强技术(传统方法、深度学习、混合方法)及其在不同成像模态中的应用,并统计图像质量评估(IQA)指标的使用情况。 Result: 发现低对比度和噪声是最常见问题,MRI和多模态成像最受关注;39项研究中29项使用传统数学方法,9项使用深度学习,1项为混合方法;65种IQA指标被使用,其中非参考型指标占主导;18项研究结合参考与非参考指标,9项仅用参考型,12项仅用非参考型。 Conclusion: 当前医学图像增强领域仍以传统方法为主,深度学习潜力有待进一步挖掘;评估指标缺乏统一标准,未来需加强标准化评估体系构建,并拓展至较少研究的成像模态。 Abstract: Medical image enhancement is crucial for improving the quality and interpretability of diagnostic images, ultimately supporting early detection, accurate diagnosis, and effective treatment planning. Despite advancements in imaging technologies such as X-ray, CT, MRI, and ultrasound, medical images often suffer from challenges like noise, artifacts, and low contrast, which limit their diagnostic potential. Addressing these challenges requires robust preprocessing, denoising algorithms, and advanced enhancement methods, with deep learning techniques playing an increasingly significant role. This systematic literature review, following the PRISMA approach, investigates the key challenges, recent advancements, and evaluation metrics in medical image enhancement. By analyzing findings from 39 peer-reviewed studies, this review provides insights into the effectiveness of various enhancement methods across different imaging modalities and the importance of evaluation metrics in assessing their impact. Key issues like low contrast and noise are identified as the most frequent, with MRI and multi-modal imaging receiving the most attention, while specialized modalities such as histopathology, endoscopy, and bone scintigraphy remain underexplored. Out of the 39 studies, 29 utilize conventional mathematical methods, 9 focus on deep learning techniques, and 1 explores a hybrid approach. In terms of image quality assessment, 18 studies employ both reference-based and non-reference-based metrics, 9 rely solely on reference-based metrics, and 12 use only non-reference-based metrics, with a total of 65 IQA metrics introduced, predominantly non-reference-based. This review highlights current limitations, research gaps, and potential future directions for advancing medical image enhancement.

[149] Towards Adversarial Robustness and Uncertainty Quantification in DINOv2-based Few-Shot Anomaly Detection

Akib Mohammed Khan,Bartosz Krawczyk

Main category: cs.CV

TL;DR: 本文研究了基于DINOv2的零样本异常检测器在对抗性扰动下的鲁棒性及异常评分的不确定性校准问题,发现其易受攻击且评分未良好校准;通过引入轻量线性头进行白盒攻击并采用Platt缩放校准不确定性,提出了提升可信度的基线方法。

Details Motivation: 探讨基于DINOv2的零样本异常检测器是否易受对抗扰动影响,以及其异常评分能否反映可靠的不确定性,以提升系统在安全关键场景中的可信度。 Method: 在冻结的DINOv2特征上附加轻量子网络用于生成对抗扰动(如FGSM),同时保持测试时行为不变;使用Platt缩放对异常评分进行后处理以实现不确定性校准。 Result: 实验显示对抗扰动显著降低检测性能(F1、AUROC等指标下降);原始异常评分校准差,经Platt缩放后校准误差(ECE)降低,且对抗样本的预测熵更高,可用于攻击检测。 Conclusion: 对抗鲁棒性和不确定性量化是异常检测系统可信部署的关键要素,需作为基本能力而非可选模块加以重视。 Abstract: Foundation models such as DINOv2 have shown strong performance in few-shot anomaly detection, yet two key questions remain unexamined: (i) how susceptible are these detectors to adversarial perturbations; and (ii) how well do their anomaly scores reflect calibrated uncertainty? Building on AnomalyDINO, a training-free deep nearest-neighbor detector over DINOv2 features, we present one of the first systematic studies of adversarial attacks and uncertainty estimation in this setting. To enable white-box gradient attacks while preserving test-time behavior, we attach a lightweight linear head to frozen DINOv2 features only for crafting perturbations. Using this heuristic, we evaluate the impact of FGSM across the MVTec-AD and VisA datasets and observe consistent drops in F1, AUROC, AP, and G-mean, indicating that imperceptible perturbations can flip nearest-neighbor relations in feature space to induce confident misclassification. Complementing robustness, we probe reliability and find that raw anomaly scores are poorly calibrated, revealing a gap between confidence and correctness that limits safety-critical use. As a simple, strong baseline toward trustworthiness, we apply post-hoc Platt scaling to the anomaly scores for uncertainty estimation. The resulting calibrated posteriors yield significantly higher predictive entropy on adversarially perturbed inputs than on clean ones, enabling a practical flagging mechanism for attack detection while reducing calibration error (ECE). Our findings surface concrete vulnerabilities in DINOv2-based few-shot anomaly detectors and establish an evaluation protocol and baseline for robust, uncertainty-aware anomaly detection. We argue that adversarial robustness and principled uncertainty quantification are not optional add-ons but essential capabilities if anomaly detection systems are to be trustworthy and ready for real-world deployment.

[150] Local-Global Context-Aware and Structure-Preserving Image Super-Resolution

Sanchar Palit,Subhasis Chaudhuri,Biplab Banerjee

Main category: cs.CV

TL;DR: 提出了一种基于局部-全局上下文感知注意力和像素空间条件机制的图像超分辨率框架,有效提升了在复杂退化图像上的重建质量与感知保真度。

Details Motivation: 现有基于预训练扩散模型的超分辨率方法在处理高度退化的图像时易产生噪声放大或内容错误,需提升对局部与全局结构的一致性保持能力。 Method: 引入局部-全局上下文感知注意力机制以维持像素间的局部与全局关系,并设计分布与感知对齐的像素级条件机制,在恢复过程中逐步保留和优化从局部细节到全局结构的信息。 Result: 在多个超分辨率基准上实现了高质量、高保真的图像重建,有效抑制伪影并提升感知质量。 Conclusion: 所提方法在复杂退化场景下优于现有方法,兼顾结构一致性与细节真实性,显著提升扩散模型在超分辨率任务中的性能。 Abstract: Diffusion models have recently achieved significant success in various image manipulation tasks, including image super-resolution and perceptual quality enhancement. Pretrained text-to-image models, such as Stable Diffusion, have exhibited strong capabilities in synthesizing realistic image content, which makes them particularly attractive for addressing super-resolution tasks. While some existing approaches leverage these models to achieve state-of-the-art results, they often struggle when applied to diverse and highly degraded images, leading to noise amplification or incorrect content generation. To address these limitations, we propose a contextually precise image super-resolution framework that effectively maintains both local and global pixel relationships through Local-Global Context-Aware Attention, enabling the generation of high-quality images. Furthermore, we propose a distribution- and perceptual-aligned conditioning mechanism in the pixel space to enhance perceptual fidelity. This mechanism captures fine-grained pixel-level representations while progressively preserving and refining structural information, transitioning from local content details to the global structural composition. During inference, our method generates high-quality images that are structurally consistent with the original content, mitigating artifacts and ensuring realistic detail restoration. Extensive experiments on multiple super-resolution benchmarks demonstrate the effectiveness of our approach in producing high-fidelity, perceptually accurate reconstructions.

[151] EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection

Huaizhi Qu,Ruichen Zhang,Shuqing Luo,Luchao Qi,Zhihao Zhang,Xiaoming Liu,Roni Sengupta,Tianlong Chen

Main category: cs.CV

TL;DR: 本文提出EditCast3D,一种利用视频生成基础模型将单帧编辑传播到整个数据集的3D编辑新范式,通过视图选择策略提升多视角一致性,实现高效、高质量的3D编辑。

Details Motivation: 现有图像编辑基础模型难以直接应用于3D编辑流程,因其计算开销大、闭源API限制多,导致在迭代编辑中不可行。因此需要一种更高效、可扩展的方法来整合基础模型到3D编辑中。 Method: 提出EditCast3D管道:使用视频生成基础模型从首帧传播编辑至整个数据集,并引入视图选择策略以识别一致且适合重建的视角,采用前馈重建避免昂贵的优化过程。 Result: 在常用3D编辑数据集上优于现有最先进方法,展现出更高的编辑质量和效率,显著减少对高成本图像编辑的依赖并缓解提示歧义问题。 Conclusion: EditCast3D为将基础模型集成到3D编辑流程提供了一种可扩展且通用的新范式,平衡了性能与计算成本。 Abstract: Recent advances in foundation models have driven remarkable progress in image editing, yet their extension to 3D editing remains underexplored. A natural approach is to replace the image editing modules in existing workflows with foundation models. However, their heavy computational demands and the restrictions and costs of closed-source APIs make plugging these models into existing iterative editing strategies impractical. To address this limitation, we propose EditCast3D, a pipeline that employs video generation foundation models to propagate edits from a single first frame across the entire dataset prior to reconstruction. While editing propagation enables dataset-level editing via video models, its consistency remains suboptimal for 3D reconstruction, where multi-view alignment is essential. To overcome this, EditCast3D introduces a view selection strategy that explicitly identifies consistent and reconstruction-friendly views and adopts feedforward reconstruction without requiring costly refinement. In combination, the pipeline both minimizes reliance on expensive image editing and mitigates prompt ambiguities that arise when applying foundation models independently across images. We evaluate EditCast3D on commonly used 3D editing datasets and compare it against state-of-the-art 3D editing baselines, demonstrating superior editing quality and high efficiency. These results establish EditCast3D as a scalable and general paradigm for integrating foundation models into 3D editing pipelines. The code is available at https://github.com/UNITES-Lab/EditCast3D

[152] OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild

Hongyu Qu,Jianan Wei,Xiangbo Shu,Yazhou Yao,Wenguan Wang,Jinhui Tang

Main category: cs.CV

TL;DR: 本文提出了一种名为OmniGaze的半监督3D视线估计框架,利用大规模未标注数据提升跨域泛化能力。

Details Motivation: 现有3D视线估计方法因标注数据稀缺且多样性不足,难以在不同数据域间良好泛化。 Method: 构建多样化的无标签人脸图像集,采用伪标签策略并设计奖励模型评估伪标签可靠性;结合3D方向向量、视觉编码器提取的嵌入和多模态大语言模型生成的语义线索计算置信度得分,用于筛选和加权高质量伪标签。 Result: 在五个数据集的同域和跨域设置下均达到最先进性能,并在四个未见数据集上表现出强零样本泛化能力。 Conclusion: OmniGaze通过有效利用大规模无标签数据显著提升了3D视线估计的泛化性和鲁棒性,可作为可扩展的视线估计数据引擎。 Abstract: Current 3D gaze estimation methods struggle to generalize across diverse data domains, primarily due to i) the scarcity of annotated datasets, and ii) the insufficient diversity of labeled data. In this work, we present OmniGaze, a semi-supervised framework for 3D gaze estimation, which utilizes large-scale unlabeled data collected from diverse and unconstrained real-world environments to mitigate domain bias and generalize gaze estimation in the wild. First, we build a diverse collection of unlabeled facial images, varying in facial appearances, background environments, illumination conditions, head poses, and eye occlusions. In order to leverage unlabeled data spanning a broader distribution, OmniGaze adopts a standard pseudo-labeling strategy and devises a reward model to assess the reliability of pseudo labels. Beyond pseudo labels as 3D direction vectors, the reward model also incorporates visual embeddings extracted by an off-the-shelf visual encoder and semantic cues from gaze perspective generated by prompting a Multimodal Large Language Model to compute confidence scores. Then, these scores are utilized to select high-quality pseudo labels and weight them for loss computation. Extensive experiments demonstrate that OmniGaze achieves state-of-the-art performance on five datasets under both in-domain and cross-domain settings. Furthermore, we also evaluate the efficacy of OmniGaze as a scalable data engine for gaze estimation, which exhibits robust zero-shot generalization on four unseen datasets.

[153] CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas

Zian Li,Muhan Zhang

Main category: cs.CV

TL;DR: 提出CanvasMAR,一种通过引入画布机制缓解视频生成中慢启动和误差累积问题的新型掩码自回归模型。

Details Motivation: 解决视频掩码自回归模型中的慢启动问题和跨时空维度的误差累积问题。 Method: 引入画布机制作为下一帧的模糊全局预测,用作掩码生成的起点;采用组合式无分类器引导和基于噪声的画布增强策略。 Result: 在BAIR和Kinetics-600数据集上实验表明,CanvasMAR能以更少的自回归步骤生成高质量视频,性能优于现有自回归模型,并可与扩散模型相媲美。 Conclusion: CanvasMAR有效改善了视频生成的效率和连贯性,为自回归视频生成提供了新思路。 Abstract: Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video MAR model that mitigates these issues by introducing a canvas mechanism--a blurred, global prediction of the next frame, used as the starting point for masked generation. The canvas provides global structure early in sampling, enabling faster and more coherent frame synthesis. Furthermore, we introduce compositional classifier-free guidance that jointly enlarges spatial (canvas) and temporal conditioning, and employ noise-based canvas augmentation to enhance robustness. Experiments on the BAIR and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality videos with fewer autoregressive steps. Our approach achieves remarkable performance among autoregressive models on Kinetics-600 dataset and rivals diffusion-based methods.

[154] NTIRE 2025 Challenge on Low Light Image Enhancement: Methods and Results

Xiaoning Liu,Zongwei Wu,Florin-Alexandru Vasluianu,Hailong Yan,Bin Ren,Yulun Zhang,Shuhang Gu,Le Zhang,Ce Zhu,Radu Timofte,Kangbiao Shi,Yixu Feng,Tao Hu,Yu Cao,Peng Wu,Yijin Liang,Yanning Zhang,Qingsen Yan,Han Zhou,Wei Dong,Yan Min,Mohab Kishawy,Jun Chen,Pengpeng Yu,Anjin Park,Seung-Soo Lee,Young-Joon Park,Zixiao Hu,Junyv Liu,Huilin Zhang,Jun Zhang,Fei Wan,Bingxin Xu,Hongzhe Liu,Cheng Xu,Weiguo Pan,Songyin Dai,Xunpeng Yi,Qinglong Yan,Yibing Zhang,Jiayi Ma,Changhui Hu,Kerui Hu,Donghang Jing,Tiesheng Chen,Zhi Jin,Hongjun Wu,Biao Huang,Haitao Ling,Jiahao Wu,Dandan Zhan,G Gyaneshwar Rao,Vijayalaxmi Ashok Aralikatti,Nikhil Akalwadi,Ramesh Ashok Tabib,Uma Mudenagudi,Ruirui Lin,Guoxi Huang,Nantheera Anantrasirichai,Qirui Yang,Alexandru Brateanu,Ciprian Orhei,Cosmin Ancuti,Daniel Feijoo,Juan C. Benito,Álvaro García,Marcos V. Conde,Yang Qin,Raul Balmez,Anas M. Ali,Bilel Benjdira,Wadii Boulila,Tianyi Mao,Huan Zheng,Yanyan Wei,Shengeng Tang,Dan Guo,Zhao Zhang,Sabari Nathan,K Uma,A Sasithradevi,B Sathya Bama,S. Mohamed Mansoor Roomi,Ao Li,Xiangtao Zhang,Zhe Liu,Yijie Tang,Jialong Tang,Zhicheng Fu,Gong Chen,Joe Nasti,John Nicholson,Zeyu Xiao,Zhuoyuan Li,Ashutosh Kulkarni,Prashant W. Patil,Santosh Kumar Vipparthi,Subrahmanyam Murala,Duan Liu,Weile Li,Hangyuan Lu,Rixian Liu,Tengfeng Wang,Jinxing Liang,Chenxin Yu

Main category: cs.CV

TL;DR: 本文综述了NTIRE 2025低光照图像增强挑战赛,评估了参赛方案与成果,展示了该领域最新的技术进展。

Details Motivation: 旨在推动低光照图像增强技术的发展,寻找在复杂条件下能生成更亮、更清晰、视觉效果更好的图像的有效网络方法。 Method: 通过组织国际竞赛,吸引全球研究者提交解决方案,并对各团队提出的方法进行系统评估和比较。 Result: 共有762名参与者注册,28支队伍提交有效方案,显著推动了低光照图像增强领域的技术进步。 Conclusion: 比赛展示了当前低光照图像增强的前沿水平,验证了多种先进网络结构的有效性,为未来研究提供了重要参考。 Abstract: This paper presents a comprehensive review of the NTIRE 2025 Low-Light Image Enhancement (LLIE) Challenge, highlighting the proposed solutions and final outcomes. The objective of the challenge is to identify effective networks capable of producing brighter, clearer, and visually compelling images under diverse and challenging conditions. A remarkable total of 762 participants registered for the competition, with 28 teams ultimately submitting valid entries. This paper thoroughly evaluates the state-of-the-art advancements in LLIE, showcasing the significant progress.

[155] Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning

Hongkuan Zhou,Lavdim Halilaj,Sebastian Monka,Stefan Schmid,Yuqicheng Zhu,Jingcheng Wu,Nadeem Nazer,Steffen Staab

Main category: cs.CV

TL;DR: 提出了一种知识引导的对比学习框架(KnowCoL),结合图像、文本和结构化知识(Wikidata)实现开放域视觉实体识别,在OVEN基准上显著提升了对罕见和未见实体的识别准确率。

Details Motivation: 开放域视觉实体识别面临训练时未见实体多、长尾分布、监督信号少和视觉歧义严重等挑战,传统方法难以有效处理。 Method: 提出KnowCoL框架,将图像和文本描述映射到由Wikidata结构化信息支撑的共享语义空间,通过实体描述、类型层次和关系上下文进行概念级对齐,支持零样本识别。 Result: 在OVEN数据集上实验表明,融合多种知识显著提升性能;最小模型相比最先进方法在未见实体上准确率提高10.5%,且模型规模小35倍。 Conclusion: 结合视觉、文本和结构化知识能有效提升开放域实体识别性能,尤其在处理未见和稀有实体方面具有显著优势,验证了知识引导的零样本学习潜力。 Abstract: Open-domain visual entity recognition aims to identify and link entities depicted in images to a vast and evolving set of real-world concepts, such as those found in Wikidata. Unlike conventional classification tasks with fixed label sets, it operates under open-set conditions, where most target entities are unseen during training and exhibit long-tail distributions. This makes the task inherently challenging due to limited supervision, high visual ambiguity, and the need for semantic disambiguation. In this work, we propose a Knowledge-guided Contrastive Learning (KnowCoL) framework that combines both images and text descriptions into a shared semantic space grounded by structured information from Wikidata. By abstracting visual and textual inputs to a conceptual level, the model leverages entity descriptions, type hierarchies, and relational context to support zero-shot entity recognition. We evaluate our approach on the OVEN benchmark, a large-scale open-domain visual recognition dataset with Wikidata IDs as the label space. Our experiments show that using visual, textual, and structured knowledge greatly improves accuracy, especially for rare and unseen entities. Our smallest model improves the accuracy on unseen entities by 10.5% compared to the state-of-the-art, despite being 35 times smaller.

[156] FlashWorld: High-quality 3D Scene Generation within Seconds

Xinyang Li,Tengfei Wang,Zixiao Gu,Shengchuan Zhang,Chunchao Guo,Liujuan Cao

Main category: cs.CV

TL;DR: FlashWorld是一种快速生成3D场景的生成模型,能在几秒内从单张图像或文本生成高质量3D高斯表示,比以往方法快10~100倍,并在保持3D一致性的同时提升视觉质量。

Details Motivation: 传统多视图生成方法速度慢且3D一致性与视觉质量难以兼顾,需更高效、高质量的3D生成方案。 Method: 提出3D导向的生成方式,结合双模式预训练(支持多视图和3D生成)和跨模式后训练蒸馏,利用视频扩散先验,并引入单视图图像和文本提示增强泛化能力。 Result: 在显著提升生成速度(快10~100倍)的同时,实现了更优的渲染质量和3D一致性,减少了去噪步数,增强了对分布外输入的泛化能力。 Conclusion: FlashWorld通过融合多视图与3D导向生成的优势,实现了快速、高质量的3D场景生成,为实际应用提供了高效解决方案。 Abstract: We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, 10~100$\times$ faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.

[157] Generating healthy counterfactuals with denoising diffusion bridge models

Ana Lawry Aguila,Peirong Liu,Marina Crespo Aguirre,Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: 提出了一种基于去噪扩散桥模型(DDBM)的新方法,用于从病理图像生成健康的反事实图像,在保留个体解剖特征的同时有效去除病理区域,在分割和异常检测任务中优于现有方法。

Details Motivation: 现有的去噪扩散模型在生成健康反事实图像时难以平衡去除异常与保留个体解剖特征,尤其是在仅使用健康数据训练的情况下;引入合成病理图像虽有改进,但仍缺乏有效引导生成过程的机制。 Method: 提出去噪扩散桥模型(DDBM),不同于传统DDPM只依赖初始健康图像,DDBM同时以初始健康图像和最终合成病理图像为条件,将病理图像作为结构先验来指导扩散过程,从而更精确地重建患者特异性解剖结构并选择性消除病理。 Result: 实验结果表明,DDBM在分割和异常检测任务上优于先前的扩散模型和全监督方法,生成的反事实图像更贴近真实解剖结构且更有效地去除病理。 Conclusion: DDBM通过引入双向条件扩散机制,显著提升了从病理图像生成健康反事实图像的质量和应用潜力,为医学图像分析中的异常检测和健康重建提供了新思路。 Abstract: Generating healthy counterfactuals from pathological images holds significant promise in medical imaging, e.g., in anomaly detection or for application of analysis tools that are designed for healthy scans. These counterfactuals should represent what a patient's scan would plausibly look like in the absence of pathology, preserving individual anatomical characteristics while modifying only the pathological regions. Denoising diffusion probabilistic models (DDPMs) have become popular methods for generating healthy counterfactuals of pathology data. Typically, this involves training on solely healthy data with the assumption that a partial denoising process will be unable to model disease regions and will instead reconstruct a closely matched healthy counterpart. More recent methods have incorporated synthetic pathological images to better guide the diffusion process. However, it remains challenging to guide the generative process in a way that effectively balances the removal of anomalies with the retention of subject-specific features. To solve this problem, we propose a novel application of denoising diffusion bridge models (DDBMs) - which, unlike DDPMs, condition the diffusion process not only on the initial point (i.e., the healthy image), but also on the final point (i.e., a corresponding synthetically generated pathological image). Treating the pathological image as a structurally informative prior enables us to generate counterfactuals that closely match the patient's anatomy while selectively removing pathology. The results show that our DDBM outperforms previously proposed diffusion models and fully supervised approaches at segmentation and anomaly detection tasks.

[158] Risk-adaptive Activation Steering for Safe Multimodal Large Language Models

Jonghyun Park,Minhyuk Seo,Jonghyun Choi

Main category: cs.CV

TL;DR: 提出了一种名为风险自适应激活引导(RAS)的方法,通过在查询阶段增强跨模态注意力来准确评估多模态输入的风险,并自适应地引导模型生成安全且有帮助的响应,避免了迭代调整带来的开销。

Details Motivation: 现代AI模型在处理嵌入图像中的有害意图的多模态查询时容易受到攻击,而现有的安全对齐方法存在数据集构建成本高或推理速度慢、误拒率高等问题。 Method: 通过重构查询以增强对图像中安全关键区域的跨模态注意力,在查询级别进行风险评估,并据此自适应地调整模型激活,实现推理时的安全对齐,无需迭代输出调整。 Result: 在多个多模态安全与效用基准上的实验表明,RAS显著降低了攻击成功率,保持了通用任务性能,并相比先前的推理时防御方法提升了推理速度。 Conclusion: RAS是一种高效、低开销的推理时对齐方法,能够在保证安全性的同时提升模型响应的准确性与效率。 Abstract: One of the key challenges of modern AI models is ensuring that they provide helpful responses to benign queries while refusing malicious ones. But often, the models are vulnerable to multimodal queries with harmful intent embedded in images. One approach for safety alignment is training with extensive safety datasets at the significant costs in both dataset curation and training. Inference-time alignment mitigates these costs, but introduces two drawbacks: excessive refusals from misclassified benign queries and slower inference speed due to iterative output adjustments. To overcome these limitations, we propose to reformulate queries to strengthen cross-modal attention to safety-critical image regions, enabling accurate risk assessment at the query level. Using the assessed risk, it adaptively steers activations to generate responses that are safe and helpful without overhead from iterative output adjustments. We call this Risk-adaptive Activation Steering (RAS). Extensive experiments across multiple benchmarks on multimodal safety and utility demonstrate that the RAS significantly reduces attack success rates, preserves general task performance, and improves inference speed over prior inference-time defenses.

[159] MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion

Minjung Shin,Hyunin Cho,Sooyeon Go,Jin-Hwa Kim,Youngjung Uh

Main category: cs.CV

TL;DR: 本文提出了MVCustom,一种新的扩散模型框架,用于实现多视角生成与定制化,解决了现有方法在几何一致性和视角控制上的不足。

Details Motivation: 现有的多视角生成模型缺乏定制能力,而定制化模型又缺少明确的视角控制,难以统一。因此需要一个能同时实现多视角控制和定制化的模型。 Method: 提出MVCustom框架,在训练阶段使用特征场表示结合文本到视频扩散模型和时空注意力机制学习主体身份与几何结构;在推理阶段引入深度感知特征渲染和一致性潜在补全技术以保证几何一致性和视角对齐。 Result: 实验证明MVCustom是目前唯一能同时实现高保真多视角生成与定制化的框架,在多视角一致性和定制准确性方面表现优异。 Conclusion: MVCustom成功实现了多视角生成与定制化的统一,兼顾几何一致性与生成质量,为可控生成模型提供了新思路。 Abstract: Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify. Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts. To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds. Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.

[160] Circle of Willis Centerline Graphs: A Dataset and Baseline Algorithm

Fabio Musio,Norman Juchler,Kaiyuan Yang,Suprosanna Shit,Chinmay Prabhakar,Bjoern Menze,Sven Hirsch

Main category: cs.CV

TL;DR: 本文提出了一种基于学习的骨架化与图连接方法,用于 Willis 环的中心线提取和形态特征分析,并发布了包含200例卒中患者数据的数据集和基线算法。

Details Motivation: Willis环(CoW)在脑血管疾病中至关重要,但现有骨架化方法难以可靠提取其复杂结构的中心线,且公开的中心线数据集稀缺。 Method: 采用基于细化的骨架化算法从TopCoW数据集中提取并整理中心线图和形态特征,结合U-Net骨架化与A*图连接开发基线算法,在独立测试集上评估解剖准确性和特征鲁棒性。 Result: 基线算法实现了高精度的图拓扑重建(F1=1),预测与参考图的平均欧氏节点距离低于一个体素;形态特征如段半径、长度和分叉比表现出强鲁棒性(中位相对误差<5%,Pearson相关性>0.95)。 Conclusion: 基于学习的骨架化结合图连接可实现解剖学上合理的中心线提取,强调应超越体素级指标,关注解剖准确性和特征鲁棒性,所发布数据集和算法有助于推动方法发展和临床研究。 Abstract: The Circle of Willis (CoW) is a critical network of arteries in the brain, often implicated in cerebrovascular pathologies. Voxel-level segmentation is an important first step toward an automated CoW assessment, but a full quantitative analysis requires centerline representations. However, conventional skeletonization techniques often struggle to extract reliable centerlines due to the CoW's complex geometry, and publicly available centerline datasets remain scarce. To address these challenges, we used a thinning-based skeletonization algorithm to extract and curate centerline graphs and morphometric features from the TopCoW dataset, which includes 200 stroke patients, each imaged with MRA and CTA. The curated graphs were used to develop a baseline algorithm for centerline and feature extraction, combining U-Net-based skeletonization with A* graph connection. Performance was evaluated on a held-out test set, focusing on anatomical accuracy and feature robustness. Further, we used the extracted features to predict the frequency of fetal PCA variants, confirm theoretical bifurcation optimality relations, and detect subtle modality differences. The baseline algorithm consistently reconstructed graph topology with high accuracy (F1 = 1), and the average Euclidean node distance between reference and predicted graphs was below one voxel. Features such as segment radius, length, and bifurcation ratios showed strong robustness, with median relative errors below 5% and Pearson correlations above 0.95. Our results demonstrate the utility of learning-based skeletonization combined with graph connection for anatomically plausible centerline extraction. We emphasize the importance of going beyond simple voxel-based measures by evaluating anatomical accuracy and feature robustness. The dataset and baseline algorithm have been released to support further method development and clinical research.

[161] Generative Universal Verifier as Multimodal Meta-Reasoner

Xinchen Zhang,Xiaoying Zhang,Youbin Wu,Yanbin Cao,Renrui Zhang,Ruihang Chu,Ling Yang,Yujiu Yang

Main category: cs.CV

TL;DR: 本文提出了Generative Universal Verifier,旨在提升视觉-语言模型在多模态推理中的视觉结果反思与优化能力。作者构建了ViVerBench基准,揭示现有模型在视觉验证上的不足;训练了OmniVerifier-7B,并提出OmniVerifier-TTS实现生成与编辑的迭代优化,在多个评测上超越现有方法。

Details Motivation: 现有视觉语言模型缺乏对生成结果的可靠视觉验证能力,难以实现人类水平的可信多模态推理,因此需要一种具备通用视觉验证能力的反射机制。 Method: 1) 构建涵盖16类任务的ViVerBench评估基准;2) 设计自动化流水线构建大规模视觉验证数据并训练OmniVerifier-7B;3) 提出OmniVerifier-TTS,采用序列化测试时扩展范式进行细粒度迭代优化。 Result: OmniVerifier-7B在ViVerBench上提升+8.3;OmniVerifier-TTS在T2I-ReasonBench提升+3.7,GenEval++提升+4.3,优于Best-of-N等并行方法。识别出视觉验证的三种原子能力,并验证其协同泛化性。 Conclusion: 引入通用视觉验证器显著增强了多模态模型在生成过程中的可靠反思与测试时可扩展优化能力,推动了更可信、可控的下一代推理系统发展。 Abstract: We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

[162] LiFMCR: Dataset and Benchmark for Light Field Multi-Camera Registration

Aymeric Fleith,Julian Zirbel,Daniel Cremers,Niclas Zeller

Main category: cs.CV

TL;DR: LiFMCR是一个用于多微透镜阵列光场相机配准的新型数据集,提供两个高分辨率Raytrix R32光场相机的同步图像序列和Vicon系统记录的高精度6自由度姿态作为外部真值。

Details Motivation: 现有光场数据集多局限于单相机设置且缺乏外部真值,难以有效评估多相机光场配准方法。LiFMCR旨在填补这一空白,支持对多相机光场系统进行严格评估。 Method: 提出了两种互补的配准方法:一种基于RANSAC的3D变换估计方法,利用跨视角点云进行配准;另一种是针对单个光场图像的plenoptic PnP算法,直接估计外参6自由度位姿。两种方法均显式结合了plenoptic相机模型。 Result: 实验表明所提出的配准方法与真实姿态高度一致,能够实现精确且可扩展的多相机光场配准,支持可靠的多视角光场处理。 Conclusion: LiFMCR为多光场相机系统的配准提供了高质量的数据和基准方法,推动了多视角光场成像技术的发展。 Abstract: We present LiFMCR, a novel dataset for the registration of multiple micro lens array (MLA)-based light field cameras. While existing light field datasets are limited to single-camera setups and typically lack external ground truth, LiFMCR provides synchronized image sequences from two high-resolution Raytrix R32 plenoptic cameras, together with high-precision 6-degrees of freedom (DoF) poses recorded by a Vicon motion capture system. This unique combination enables rigorous evaluation of multi-camera light field registration methods. As a baseline, we provide two complementary registration approaches: a robust 3D transformation estimation via a RANSAC-based method using cross-view point clouds, and a plenoptic PnP algorithm estimating extrinsic 6-DoF poses from single light field images. Both explicitly integrate the plenoptic camera model, enabling accurate and scalable multi-camera registration. Experiments show strong alignment with the ground truth, supporting reliable multi-view light field processing. Project page: https://lifmcr.github.io/

[163] Cyclic Self-Supervised Diffusion for Ultra Low-field to High-field MRI Synthesis

Zhenxuan Zhang,Peiyuan Jing,Zi Wang,Ula Briski,Coraline Beitone,Yue Yang,Yinzhe Wu,Fanwen Wang,Liutao Yang,Jiahao Huang,Zhifan Gao,Zhaolin Chen,Kh Tohidul Islam,Guang Yang,Peter J. Lally

Main category: cs.CV

TL;DR: 提出了一种基于循环自监督扩散模型(CSS-Diff)的低场到高场MRI图像合成方法,通过引入周期一致性约束、切片间对比学习和局部结构校正网络,在保持解剖结构一致性的同时显著提升图像质量。

Details Motivation: 低场MRI成本低、安全性高但分辨率差,现有方法在合成高场MRI时存在临床保真度不足、结构细节丢失和对比度域差距等问题,需提升合成图像的解剖一致性和细粒度质量。 Method: 提出循环自监督扩散(CSS-Diff)框架,结合周期一致性约束以保证解剖结构保留;设计切片级感知网络通过对比学习对齐层间不一致;引入局部结构校正网络,利用掩码与扰动块的自重建增强局部特征恢复。 Result: 在跨场合成任务中达到SOTA性能(PSNR: 31.80±2.70 dB,SSIM: 0.943±0.102,LPIPS: 0.0864±0.0689),且解剖结构误差显著降低(如左脑白质误差从12.1%降至2.1%,皮层从4.2%降至3.7%)。 Conclusion: CSS-Diff能够在无需配对数据的情况下生成兼具定量指标优越性和解剖结构一致性的高场MRI图像,有效缩小临床应用中的保真度差距。 Abstract: Synthesizing high-quality images from low-field MRI holds significant potential. Low-field MRI is cheaper, more accessible, and safer, but suffers from low resolution and poor signal-to-noise ratio. This synthesis process can reduce reliance on costly acquisitions and expand data availability. However, synthesizing high-field MRI still suffers from a clinical fidelity gap. There is a need to preserve anatomical fidelity, enhance fine-grained structural details, and bridge domain gaps in image contrast. To address these issues, we propose a \emph{cyclic self-supervised diffusion (CSS-Diff)} framework for high-field MRI synthesis from real low-field MRI data. Our core idea is to reformulate diffusion-based synthesis under a cycle-consistent constraint. It enforces anatomical preservation throughout the generative process rather than just relying on paired pixel-level supervision. The CSS-Diff framework further incorporates two novel processes. The slice-wise gap perception network aligns inter-slice inconsistencies via contrastive learning. The local structure correction network enhances local feature restoration through self-reconstruction of masked and perturbed patches. Extensive experiments on cross-field synthesis tasks demonstrate the effectiveness of our method, achieving state-of-the-art performance (e.g., 31.80 $\pm$ 2.70 dB in PSNR, 0.943 $\pm$ 0.102 in SSIM, and 0.0864 $\pm$ 0.0689 in LPIPS). Beyond pixel-wise fidelity, our method also preserves fine-grained anatomical structures compared with the original low-field MRI (e.g., left cerebral white matter error drops from 12.1$\%$ to 2.1$\%$, cortex from 4.2$\%$ to 3.7$\%$). To conclude, our CSS-Diff can synthesize images that are both quantitatively reliable and anatomically consistent.

[164] Multi-Scale High-Resolution Logarithmic Grapher Module for Efficient Vision GNNs

Mustafa Munir,Alex Zhang,Radu Marculescu

Main category: cs.CV

TL;DR: 提出了一种新的图构建方法LSGC和新型混合CNN-GNN模型LogViG,通过多尺度高分辨率架构在图像分类和语义分割任务上优于现有ViG、CNN和ViT模型。

Details Motivation: 现有图构建方法如KNN计算成本高,SVGA因固定步长导致信息丢失,需更高效的长距离连接机制。 Method: 提出对数可扩展图构建(LSGC)以限制长程链接数量,并设计LogViG模型,结合高分辨率分支与多尺度特征融合。 Result: LogViG在ImageNet-1K上Ti版本达到79.9% top-1准确率,比Vision GNN高1.7%,参数减少24.3%,GMACs减少35.3%。 Conclusion: LSGC有效提升ViG性能,LogViG在精度和效率上均优于当前主流模型。 Abstract: Vision graph neural networks (ViG) have demonstrated promise in vision tasks as a competitive alternative to conventional convolutional neural nets (CNN) and transformers (ViTs); however, common graph construction methods, such as k-nearest neighbor (KNN), can be expensive on larger images. While methods such as Sparse Vision Graph Attention (SVGA) have shown promise, SVGA's fixed step scale can lead to over-squashing and missing multiple connections to gain the same information that could be gained from a long-range link. Through this observation, we propose a new graph construction method, Logarithmic Scalable Graph Construction (LSGC) to enhance performance by limiting the number of long-range links. To this end, we propose LogViG, a novel hybrid CNN-GNN model that utilizes LSGC. Furthermore, inspired by the successes of multi-scale and high-resolution architectures, we introduce and apply a high-resolution branch and fuse features between our high-resolution and low-resolution branches for a multi-scale high-resolution Vision GNN network. Extensive experiments show that LogViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification and semantic segmentation tasks. Our smallest model, Ti-LogViG, achieves an average top-1 accuracy on ImageNet-1K of 79.9% with a standard deviation of 0.2%, 1.7% higher average accuracy than Vision GNN with a 24.3% reduction in parameters and 35.3% reduction in GMACs. Our work shows that leveraging long-range links in graph construction for ViGs through our proposed LSGC can exceed the performance of current state-of-the-art ViGs. Code is available at https://github.com/mmunir127/LogViG-Official.

[165] UniCalli: A Unified Diffusion Framework for Column-Level Generation and Recognition of Chinese Calligraphy

Tianshuo Xu,Kai Wang,Zhifei Chen,Leyi Wu,Tianshui Wen,Fei Chao,Ying-Cong Chen

Main category: cs.CV

TL;DR: 提出了一种统一的扩散框架UniCalli,用于列级中文书法识别与生成,通过联合训练实现高质量生成和准确识别,并在多种古代文字上表现优异。

Details Motivation: 现有方法在生成高质量单字时忽略页面级美学,或在整页合成时牺牲书法正确性,难以兼顾结构与布局。 Method: 提出UniCalli框架,采用联合训练识别与生成任务,引入非对称加噪策略和光栅化框图以提供空间先验,并在合成、标注和无标签数据上训练。 Result: 在超过8000件数字化作品(约4000件密集标注)的数据集上验证,生成质量达到SOTA,具有更好的连笔连续性和布局保真度,同时提升识别性能,并可扩展至甲骨文和埃及象形文字等古代文字。 Conclusion: UniCalli通过识别与生成的协同学习,实现了书法结构与风格布局的统一建模,有效解决了传统方法在数据稀缺下的局限,具备跨文字系统的泛化能力。 Abstract: Computational replication of Chinese calligraphy remains challenging. Existing methods falter, either creating high-quality isolated characters while ignoring page-level aesthetics like ligatures and spacing, or attempting page synthesis at the expense of calligraphic correctness. We introduce \textbf{UniCalli}, a unified diffusion framework for column-level recognition and generation. Training both tasks jointly is deliberate: recognition constrains the generator to preserve character structure, while generation provides style and layout priors. This synergy fosters concept-level abstractions that improve both tasks, especially in limited-data regimes. We curated a dataset of over 8,000 digitized pieces, with ~4,000 densely annotated. UniCalli employs asymmetric noising and a rasterized box map for spatial priors, trained on a mix of synthetic, labeled, and unlabeled data. The model achieves state-of-the-art generative quality with superior ligature continuity and layout fidelity, alongside stronger recognition. The framework successfully extends to other ancient scripts, including Oracle bone inscriptions and Egyptian hieroglyphs. Code and data can be viewed in \href{https://github.com/EnVision-Research/UniCalli}{this URL}.

[166] InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

Wenwen Tong,Hewei Guo,Dongchuan Ran,Jiangnan Chen,Jiefan Lu,Kaibin Wang,Keqiang Li,Xiaoxu Zhu,Jiakui Li,Kehan Li,Xueheng Li,Lumin Li,Chenxu Guo,Jiasheng Zhou,Jiandong Chen,Xianye Wu,Jiahao Wang,Silei Wu,Lei Chen,Hanming Deng,Yuxuan Song,Dinghao Zhou,Guiping Zhong,Ken Zheng,Shiyin Kang,Lewei Lu

Main category: cs.CV

TL;DR: InteractiveOmni是一个统一的、开源的多模态大语言模型,支持视听多轮交互,具备全面的多模态理解与语音生成能力,在轻量级模型中表现领先。

Details Motivation: 旨在推动轻量级多模态大模型的发展,实现高效、智能的多轮音频-视觉交互,并解决现有模型在长期记忆和多轮对话能力上的不足。 Method: 将视觉编码器、音频编码器、大语言模型和语音解码器集成到统一架构中,采用多阶段训练策略(包括多模态预训练和语音对话后训练),并构建高质量多轮对话数据集以增强长期交互能力。 Result: 在多模态多轮记忆和语音交互基准上表现优异,InteractiveOmni-4B性能接近Qwen2.5-Omni-7B,且仅用一半参数达到8B模型97%的性能,在图像、音频、视频理解和语音生成任务上达到同类模型SOTA水平。 Conclusion: InteractiveOmni为下一代智能交互系统提供了高效、开源的基础模型,在多模态理解与生成方面具有显著优势,尤其在多轮对话和长期记忆能力上表现突出。 Abstract: We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model's ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.

[167] RECODE: Reasoning Through Code Generation for Visual Question Answering

Junhong Shen,Mu Cai,Bo Hu,Ameet Talwalkar,David A Ross,Cordelia Schmid,Alireza Fathi

Main category: cs.CV

TL;DR: 提出RECODE框架,通过将图表等结构化视觉内容反向工程为可执行代码,实现可验证的视觉推理,显著提升多模态大模型在图表理解任务中的准确性和可验证性。

Details Motivation: 多模态大语言模型在处理图表等结构化视觉时难以进行精确推理,因像素感知缺乏验证机制。 Method: 提出RECODE框架,采用去渲染(derendering)方法,先生成多个候选程序重建输入图像,再通过批评模块选择最忠实的重构并迭代优化代码。 Result: 在CharXiv、ChartQA和Geometry3K等多个视觉推理基准上,RECODE显著优于不使用代码或仅用代码辅助绘图的方法。 Conclusion: 将视觉感知建立在可执行代码基础上,为更准确和可验证的多模态推理提供了新路径。 Abstract: Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.

[168] Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Kai Zou,Ziqi Huang,Yuhao Dong,Shulin Tian,Dian Zheng,Hongbo Liu,Jingwen He,Bin Liu,Yu Qiao,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了Uni-MMMU,一个全面且学科感知的基准,用于评估统一多模态模型在视觉理解与生成之间的双向协同能力,涵盖科学、编程、数学和谜题等八个推理密集型领域。

Details Motivation: 现有基准很少真正检验视觉理解与生成能力的融合,通常将二者孤立评估或忽略需要两者结合的任务,因此需要一个能系统评估统一多模态模型双向协同能力的新基准。 Method: 设计了Uni-MMMU基准,包含八个领域的双向耦合任务,要求模型既能基于理解指导视觉生成,也能利用生成支持推理分析,并提供可验证的中间步骤、唯一真值和可复现的评分协议。 Result: 通过对最先进的统一模型、仅生成模型和仅理解模型的广泛评估,揭示了显著的性能差异和跨模态依赖关系。 Conclusion: Uni-MMMU为统一多模态模型的发展提供了可靠基础,揭示了理解与生成能力在何时以及如何相互增强。 Abstract: Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration. Existing evaluations either treat the two abilities in isolation or overlook tasks that inherently couple them. To address this gap, we present Uni-MMMU, a comprehensive and discipline-aware benchmark that systematically unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains, including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, demanding models to (i) leverage conceptual understanding to guide precise visual synthesis, or (ii) utilize generation as a cognitive scaffold for analytical reasoning. Uni-MMMU incorporates verifiable intermediate reasoning steps, unique ground truths, and a reproducible scoring protocol for both textual and visual outputs. Through extensive evaluation of state-of-the-art unified, generation-only, and understanding-only models, we reveal substantial performance disparities and cross-modal dependencies, offering new insights into when and how these abilities reinforce one another, and establishing a reliable foundation for advancing unified models.

[169] Scaling Vision Transformers for Functional MRI with Flat Maps

Connor Lane,Daniel Z. Kaplan,Tanishq Mathew Abraham,Paul S. Scotti

Main category: cs.CV

TL;DR: 将4D fMRI数据转换为2D视频形式的flat maps,使用Vision Transformers和时空掩码自编码器(MAE)框架在大规模fMRI数据上进行训练,发现模型性能随数据量遵循幂律提升,并在下游分类任务中展现出强大的跨被试状态解码和个体特质解码能力。

Details Motivation: 如何有效表示fMRI数据以适配现代深度学习架构是一个关键问题,特别是弥合fMRI与自然图像之间的模态差距。 Method: 将4D fMRI数据转化为2D fMRI活动flat map视频,采用Vision Transformers结合时空掩码自编码器(MAE)框架,在Human Connectome Project的2.3千小时fMRI数据上进行自监督训练。 Result: 掩码fMRI建模性能随数据集规模严格遵循幂律增长;下游分类任务表明模型能有效支持细粒度的跨被试脑状态解码和个体特异性特征解码。 Conclusion: 该研究验证了将fMRI数据转化为视频形式并应用视觉Transformer进行大规模预训练的有效性,推动了面向fMRI的基础模型发展,并作为开放科学项目的一部分公开代码与数据。 Abstract: A key question for adapting modern deep learning architectures to functional MRI (fMRI) is how to represent the data for model input. To bridge the modality gap between fMRI and natural images, we transform the 4D volumetric fMRI data into videos of 2D fMRI activity flat maps. We train Vision Transformers on 2.3K hours of fMRI flat map videos from the Human Connectome Project using the spatiotemporal masked autoencoder (MAE) framework. We observe that masked fMRI modeling performance improves with dataset size according to a strict power scaling law. Downstream classification benchmarks show that our model learns rich representations supporting both fine-grained state decoding across subjects, as well as subject-specific trait decoding across changes in brain state. This work is part of an ongoing open science project to build foundation models for fMRI data. Our code and datasets are available at https://github.com/MedARC-AI/fmri-fm.

[170] Adaptive Visual Conditioning for Semantic Consistency in Diffusion-Based Story Continuation

Seyed Mohammad Mousavi,Morteza Analoui

Main category: cs.CV

TL;DR: 提出AVC框架,通过自适应视觉条件机制在扩散模型中实现故事图像续写,提升语义一致性和视觉保真度。

Details Motivation: 有效利用先前视觉上下文并保持与当前文本输入的语义一致性是故事图像续写的核心挑战。 Method: 使用CLIP模型检索前序帧中最语义对齐的图像,并在无足够相关图像时自适应限制视觉影响至扩散过程早期阶段;同时用大语言模型重生成高质量文本标注以增强监督。 Result: 实验表明AVC在定量指标和人工评估中均优于强基线方法,尤其在先前视觉与当前输入冲突的困难场景下表现更优。 Conclusion: AVC能有效平衡视觉上下文利用与语义对齐,在故事图像续写任务中实现了更好的连贯性、一致性和生成质量。 Abstract: Story continuation focuses on generating the next image in a narrative sequence so that it remains coherent with both the ongoing text description and the previously observed images. A central challenge in this setting lies in utilizing prior visual context effectively, while ensuring semantic alignment with the current textual input. In this work, we introduce AVC (Adaptive Visual Conditioning), a framework for diffusion-based story continuation. AVC employs the CLIP model to retrieve the most semantically aligned image from previous frames. Crucially, when no sufficiently relevant image is found, AVC adaptively restricts the influence of prior visuals to only the early stages of the diffusion process. This enables the model to exploit visual context when beneficial, while avoiding the injection of misleading or irrelevant information. Furthermore, we improve data quality by re-captioning a noisy dataset using large language models, thereby strengthening textual supervision and semantic alignment. Quantitative results and human evaluations demonstrate that AVC achieves superior coherence, semantic consistency, and visual fidelity compared to strong baselines, particularly in challenging cases where prior visuals conflict with the current input.

[171] NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models

Nir Goren,Oren Katzir,Abhinav Nakarmi,Eyal Ronen,Mahmood Sharif,Or Patashnik

Main category: cs.CV

TL;DR: 本文提出了一种名为NoisePrints的轻量级水印方案,利用扩散模型生成过程中初始噪声的随机种子作为版权证明,无需访问模型权重或修改生成过程,具有高效、安全和可扩展的特点。

Details Motivation: 随着扩散模型在视觉内容生成中的广泛应用,保护版权和证明作者身份变得至关重要,尤其是在模型所有者不公开模型的情况下,需要一种无需模型权重的第三方验证机制。 Method: 利用生成过程中初始噪声对应的随机种子作为水印,通过哈希函数嵌入信息,确保从内容反推种子不可行,并结合零知识证明技术在不泄露种子的前提下证明所有权。 Result: 实验表明,该方法在多种先进的图像和视频扩散模型上均能高效验证,且对各种内容篡改具有鲁棒性,验证仅需输出内容和种子,无需模型权重。 Conclusion: NoisePrints提供了一种实用、安全且可扩展的版权保护方案,适用于私有扩散模型的第三方水印验证。 Abstract: With the rapid adoption of diffusion models for visual content generation, proving authorship and protecting copyright have become critical. This challenge is particularly important when model owners keep their models private and may be unwilling or unable to handle authorship issues, making third-party verification essential. A natural solution is to embed watermarks for later verification. However, existing methods require access to model weights and rely on computationally heavy procedures, rendering them impractical and non-scalable. To address these challenges, we propose , a lightweight watermarking scheme that utilizes the random seed used to initialize the diffusion process as a proof of authorship without modifying the generation process. Our key observation is that the initial noise derived from a seed is highly correlated with the generated visual content. By incorporating a hash function into the noise sampling process, we further ensure that recovering a valid seed from the content is infeasible. We also show that sampling an alternative seed that passes verification is infeasible, and demonstrate the robustness of our method under various manipulations. Finally, we show how to use cryptographic zero-knowledge proofs to prove ownership without revealing the seed. By keeping the seed secret, we increase the difficulty of watermark removal. In our experiments, we validate NoisePrints on multiple state-of-the-art diffusion models for images and videos, demonstrating efficient verification using only the seed and output, without requiring access to model weights.

[172] Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Yi Zhang,Bolin Ni,Xin-Sheng Chen,Heng-Rui Zhang,Yongming Rao,Houwen Peng,Qinglin Lu,Han Hu,Meng-Hao Guo,Shi-Min Hu

Main category: cs.CV

TL;DR: 本文提出了Honey-Data-15M数据集和HoneyPipe数据处理流程,通过高质量的监督微调数据和双层思维链增强策略,显著提升了全开源多模态大语言模型的性能。基于该数据集训练的Bee-8B模型在多个指标上达到或超过了现有半开源模型的表现,验证了数据质量对全开源MLLM发展的关键作用。

Details Motivation: 当前全开源多模态大语言模型在监督微调数据的质量上远落后于闭源模型,尤其缺乏高质量的复杂推理数据(如思维链),限制了其能力发展。因此需要构建高质量、可扩展的开源SFT数据集与透明可复用的数据处理框架。 Method: 提出Honey-Data-15M,一个包含约1500万问答对的SFT数据集,采用多重清洗技术并引入短/长两级思维链(CoT)增强策略;设计HoneyPipe数据流水线及底层框架DataStudio,支持透明、可适应的数据构建流程。基于该数据集训练8B参数的Bee-8B模型进行验证。 Result: Bee-8B在多个基准测试中实现了全开源MLLM的新SOTA,性能可与近期半开源模型(如InternVL3.5-8B)相媲美甚至超越。消融实验验证了数据清洗与双层CoT策略的有效性。 Conclusion: 高质量的监督微调数据是推动全开源多模态大语言模型发展的关键因素。本工作通过提供数据集、处理流程、训练方法和模型权重等一整套开源资源,为社区建立了可复现、可持续改进的MLLM研发基础。 Abstract: Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

[173] Reasoning in Space via Grounding in the World

Yiming Chen,Zekun Qi,Wenyao Zhang,Xin Jin,Li Zhang,Peidong Liu

Main category: cs.CV

TL;DR: 本文提出了GS-Reasoner,首个无需外部模块即可实现自回归3D视觉定位的3D大模型,通过双路径池化机制构建统一的语义-几何联合表征,并引入GCoT数据集将定位融入空间推理过程,显著提升性能。

Details Motivation: 现有3D大模型缺乏统一的语义与几何联合表征,导致视觉定位能力弱或过度依赖外部模块,阻碍了定位与空间推理的融合。 Method: 提出双路径池化机制,对齐几何特征与语义、位置信息,构建基于图像块的统一3D表征;构建GS-Reasoner模型并发布GCoT数据集,支持端到端的自回归定位与推理。 Result: GS-Reasoner在3D视觉定位任务上表现优异,且无需外部模块即达到顶尖水平;在空间推理任务上实现SOTA性能,验证了定位与推理联合建模的有效性。 Conclusion: 统一的3D表征和将定位作为核心环节的训练数据设计是提升3D空间推理能力的关键,GS-Reasoner为该领域提供了自包含、一体化的解决方案。 Abstract: In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.

[174] Trace Anything: Representing Any Video in 4D via Trajectory Fields

Xinhang Liu,Yuxi Xiao,Donny Y. Chen,Jiashi Feng,Yu-Wing Tai,Chi-Keung Tang,Bingyi Kang

Main category: cs.CV

TL;DR: 本文提出了一种新的视频表示方法——轨迹场(Trajectory Field),将每个像素在时间上的3D运动建模为连续轨迹,并通过Trace Anything神经网络一次性预测整个轨迹场,实现了高效、精确的动态建模。

Details Motivation: 为了更有效地捕捉视频中像素级的连续动态信息,需要一种能够统一表达时空变化的密集表示方法。 Method: 提出轨迹场表示法,使用B样条控制点参数化每个像素的3D轨迹,并设计Trace Anything网络在单次前向传播中预测所有轨迹。 Result: 在新构建的轨迹场估计基准上达到SOTA性能,在点跟踪任务上表现优异,具备高效率和无需迭代优化的优势,并展现出目标条件操作、运动预测和时空融合等 emergent 能力。 Conclusion: Trace Anything 提供了一种统一、高效的视频动态建模框架,为未来视频理解与生成任务提供了强有力的表示基础。 Abstract: Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion. Project page: https://trace-anything.github.io/.

[175] VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models

Dominick Reilly,Manish Kumar Govind,Le Xue,Srijan Das

Main category: cs.CV

TL;DR: 提出了一种名为VisCoP的视觉上下文化探测方法,通过在视觉编码器中引入可学习的视觉探针,实现大视觉语言模型在跨域、跨模态和跨任务场景下的高效领域自适应,同时有效保留源域知识。

Details Motivation: 大视觉语言模型在面对与预训练数据分布差异较大的新领域时性能显著下降,现有领域自适应方法存在领域特定特征学习不足或灾难性遗忘的问题。 Method: 提出VisCoP方法,在VLM的视觉编码器中添加一组紧凑的可学习视觉探针,仅微调这些探针以实现高效的领域特定适应,最大限度减少对预训练参数的修改。 Result: 在三种挑战性领域适应场景(外视角到内视角、RGB到深度、人类理解到机器人控制)中,VisCoP consistently优于现有方法,在目标领域表现出更优性能,同时有效保持了源领域的能力。 Conclusion: VisCoP是一种高效且鲁棒的领域自适应框架,能够在最小化参数修改的前提下提升大视觉语言模型在新领域的泛化能力,并缓解灾难性遗忘问题。 Abstract: Large Vision-Language Models (VLMs) excel at general visual reasoning tasks but exhibit sharp performance degradation when applied to novel domains with substantial distribution shifts from pretraining data. Existing domain adaptation approaches finetune different VLM components, but this often results in limited domain-specific feature learning or catastrophic forgetting of prior capabilities. To address these issues, we introduce Vision Contextualized Probing (VisCoP), which augments the VLM's vision encoder with a compact set of learnable visual probes. These probes enable efficient domain-specific adaptation with minimal modification to pretrained parameters. We evaluate VisCoP across three challenging domain adaptation settings-cross-view (exocentric to egocentric), cross-modal (RGB to depth), and cross-task (human understanding to robot control). Experiments show that VisCoP consistently outperforms existing adaptation strategies, achieving superior performance on target domains while effectively retaining source-domain knowledge.

[176] PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning

Sihui Ji,Xi Chen,Xin Tao,Pengfei Wan,Hengshuang Zhao

Main category: cs.CV

TL;DR: 提出PhysMaster,通过物理表示学习和强化学习提升视频生成模型的物理感知能力。

Details Motivation: 现有视频生成模型难以遵循物理规律,导致生成结果缺乏物理合理性。 Method: 设计PhysEncoder从输入图像中提取物理信息,并结合人类反馈的强化学习(DPO)优化物理表示。 Result: 在代理任务和多种物理场景中验证了PhysMaster的有效性和泛化能力。 Conclusion: PhysMaster可作为通用插件提升视频生成模型的物理一致性,具有广泛的应用潜力。 Abstract: Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as ''world models''. To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness. Specifically, PhysMaster is based on the image-to-video task where the model is expected to predict physically plausible dynamics from the input image. Since the input image provides physical priors like relative positions and potential interactions of objects in the scenario, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the video generation process. The lack of proper supervision on the model's physical performance beyond mere appearance motivates PhysEncoder to apply reinforcement learning with human feedback to physical representation learning, which leverages feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end-to-end manner. PhysMaster provides a feasible solution for improving physics-awareness of PhysEncoder and thus of video generation, proving its ability on a simple proxy task and generalizability to wide-ranging physical scenarios. This implies that our PhysMaster, which unifies solutions for various physical processes via representation learning in the reinforcement learning paradigm, can act as a generic and plug-in solution for physics-aware video generation and broader applications.