Table of Contents
cs.CL [Back]
[1] MedPI: Evaluating AI Systems in Medical Patient-facing Interactions
Diego Fajardo V.,Oleksii Proniakin,Victoria-Elisabeth Gruber,Razvan Marinescu
Main category: cs.CL
TL;DR: MedPI是一个用于评估大语言模型在医患对话中表现的高维基准,涵盖105个维度,评估结果显示现有LLM在鉴别诊断等方面表现不佳。
Details
Motivation: 现有的医疗QA基准多为单轮问答形式,无法全面评估医患对话中的复杂交互过程,因此需要一个更细粒度、符合认证标准的多维评估框架。 Method: 提出MedPI五层框架:患者数据包、具身AI患者(含记忆与情感)、任务矩阵(就诊原因×目标)、基于ACGME能力映射的105维评估体系、由校准LLM组成的AI评审团进行打分与归因。 Result: 在366个AI患者和7,097次对话中测试了9个主流LLM,发现所有模型在多个维度上得分较低,尤其在鉴别诊断方面表现差。 Conclusion: MedPI可系统性揭示当前LLM在临床对话中的不足,有助于指导未来LLM在诊断与治疗建议中的应用与发展。 Abstract: We present MedPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations. Unlike single-turn question-answer (QA) benchmarks, MedPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. MedPI comprises five layers: (1) Patient Packets (synthetic EHR-like ground truth); (2) an AI Patient instantiated through an LLM with memory and affect; (3) a Task Matrix spanning encounter reasons (e.g. anxiety, pregnancy, wellness checkup) x encounter objectives (e.g. diagnosis, lifestyle advice, medication advice); (4) an Evaluation Framework with 105 dimensions on a 1-4 scale mapped to the Accreditation Council for Graduate Medical Education (ACGME) competencies; and (5) AI Judges that are calibrated, committee-based LLMs providing scores, flags, and evidence-linked rationales. We evaluate 9 flagship models -- Claude Opus 4.1, Claude Sonnet 4, MedGemma, Gemini 2.5 Pro, Llama 3.3 70b Instruct, GPT-5, GPT OSS 120b, o3, Grok-4 -- across 366 AI Patients and 7,097 conversations using a standardized "vanilla clinician" prompt. For all LLMs, we observe low performance across a variety of dimensions, in particular on differential diagnosis. Our work can help guide future use of LLMs for diagnosis and treatment recommendations.[2] RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
Keerthana Murugaraj,Salima Lamsiyah,Martin Theobald
Main category: cs.CL
TL;DR: 本文提出了RAGVUE,一种用于自动、无参考评估检索增强生成(RAG)系统的诊断性和可解释性框架。该框架将RAG行为分解为多个细粒度指标,并提供结构化解释,支持手动和自动化评估,具有良好的透明性和实用性。
Details
Motivation: 现有RAG评估指标常将复杂行为简化为单一分数,难以识别错误来源(如检索、推理或 grounding),缺乏透明性和细粒度分析能力。 Method: 提出RAGVUE框架,分解RAG系统表现为核心维度:检索质量、答案相关性与完整性、声明级忠实性以及判断校准;每个指标附带结构化解释,并支持代理式自动评估与交互式工具(API、CLI、Streamlit)。 Result: 实验表明,RAGVUE能发现现有工具(如RAGAS)忽略的细粒度失败模式,具备更强的诊断能力,并已通过开源代码支持研究与实际开发集成。 Conclusion: RAGVUE是一种透明、模块化且可扩展的RAG评估框架,能够提供细粒度、可解释的反馈,有助于改进RAG系统的设计与调试。 Abstract: Evaluating Retrieval-Augmented Generation (RAG) systems remains a challenging task: existing metrics often collapse heterogeneous behaviors into single scores and provide little insight into whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduce RAGVUE, a diagnostic and explainable framework for automated, reference-free evaluation of RAG pipelines. RAGVUE decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration. Each metric includes a structured explanation, making the evaluation process transparent. Our framework supports both manual metric selection and fully automated agentic evaluation. It also provides a Python API, CLI, and a local Streamlit interface for interactive usage. In comparative experiments, RAGVUE surfaces fine-grained failures that existing tools such as RAGAS often overlook. We showcase the full RAGVUE workflow and illustrate how it can be integrated into research pipelines and practical RAG development. The source code and detailed instructions on usage are publicly available on GitHub[3] Automatic Construction of Chinese Verb Collostruction Database
Xuri Tang,Daohuan Liu
Main category: cs.CL
TL;DR: 本文提出了一种完全无监督的中文动词构式库构建方法,通过聚类算法从大规模语料中生成具有功能独立性和等级典型性的动词构式,并在语法纠错任务中优于大语言模型。
Details
Motivation: 为了弥补大语言模型在需要可解释性和透明规则的应用场景中的不足,提供显式、可解释的语言规则支持。 Method: 将动词构式形式化定义为一种有向无环图,采用一系列聚类算法从大规模语料检索出的句子中为给定动词生成构式。 Result: 统计分析显示生成的构式具备功能独立性和等级典型性;在动词语法纠错任务中,基于构式最大匹配的纠错算法性能优于大语言模型。 Conclusion: 该方法能有效构建中文动词构式库,为需要可解释性的自然语言处理任务提供补充工具。 Abstract: This paper proposes a fully unsupervised approach to the construction of verb collostruction database for Chinese language, aimed at complementing LLMs by providing explicit and interpretable rules for application scenarios where explanation and interpretability are indispensable. The paper formally defines a verb collostruction as a projective, rooted, ordered, and directed acyclic graph and employs a series of clustering algorithms to generate collostructions for a given verb from a list of sentences retrieved from large-scale corpus. Statistical analysis demonstrates that the generated collostructions possess the design features of functional independence and graded typicality. Evaluation with verb grammatical error correction shows that the error correction algorithm based on maximum matching with collostructions achieves better performance than LLMs.[4] Attribute-Aware Controlled Product Generation with LLMs for E-commerce
Virginia Negri,Víctor Martínez Gómez,Sergio A. Balanya,Subburam Rajaram
Main category: cs.CL
TL;DR: 提出了一种基于大语言模型的合成电商产品数据生成方法,通过三种策略实现可控修改,在数据稀缺场景下有效提升产品信息抽取性能。
Details
Motivation: 电商领域缺乏高质量标注数据,限制了产品信息抽取模型的训练与性能。 Method: 利用大语言模型和属性感知提示,设计了属性保持修改、可控负例生成和系统属性删除三种合成数据生成策略,并结合商店约束保证数据一致性。 Result: 在MAVE数据集上,合成数据达到60.5%准确率,接近真实数据(60.8%),显著优于零样本基线(13.4%);人机评估显示合成数据自然性和属性有效性高;混合使用合成与真实数据可进一步提升至68.8%。 Conclusion: 该框架为电商信息抽取提供了高效、实用的数据增强方案,尤其适用于低资源场景。 Abstract: Product information extraction is crucial for e-commerce services, but obtaining high-quality labeled datasets remains challenging. We present a systematic approach for generating synthetic e-commerce product data using Large Language Models (LLMs), introducing a controlled modification framework with three strategies: attribute-preserving modification, controlled negative example generation, and systematic attribute removal. Using a state-of-the-art LLM with attribute-aware prompts, we enforce store constraints while maintaining product coherence. Human evaluation of 2000 synthetic products demonstrates high effectiveness, with 99.6% rated as natural, 96.5% containing valid attribute values, and over 90% showing consistent attribute usage. On the public MAVE dataset, our synthetic data achieves 60.5% accuracy, performing on par with real training data (60.8%) and significantly improving upon the 13.4% zero-shot baseline. Hybrid configurations combining synthetic and real data further improve performance, reaching 68.8% accuracy. Our framework provides a practical solution for augmenting e-commerce datasets, particularly valuable for low-resource scenarios.[5] Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems
Zihan Gao,Mohsin Y. K. Yousufi,Jacob Thebault-Spieker
Main category: cs.CL
TL;DR: 本文提出了一种名为“集体叙事 grounding”的参与式协议,通过将社区故事转化为结构化叙事单元并由社区治理整合到AI系统中,以解决大语言模型在本地问题回答中的知识盲区和认知不公问题。
Details
Motivation: 大语言模型在处理特定社区问题时常因缺乏本地知识而失败,导致边缘化和认知不公,因此需要一种能够纳入社区叙事并受其治理的知识整合机制。 Method: 通过三个与24名社区成员参与的制图工作坊,设计了保留叙事丰富性的叙述提取方法和模式,并开发了一个涵盖实体、时间、地点提取、验证和来源控制的框架;同时构建了一个包含14,782个本地问答对的县级基准数据集进行审计,并基于工作坊内容创建了参与式问答测试集。 Result: 基准数据显示76.7%的错误源于事实缺失、文化误解、地理混淆或时间错位;在参与式问答测试集中,最先进的大语言模型在无额外上下文情况下正确率低于21%;而所收集的叙事内容恰好能填补大部分缺失事实。 Conclusion: 研究提出了一套系统的解决方案,包括分类法、参与式协议和评估方法,为构建检索优先、来源可见、本地治理的问答系统提供了基础,推动实现更公平、扎根于社区的AI系统。 Abstract: Large language model (LLM) question-answering systems often fail on community-specific queries, creating "knowledge blind spots" that marginalize local voices and reinforce epistemic injustice. We present Collective Narrative Grounding, a participatory protocol that transforms community stories into structured narrative units and integrates them into AI systems under community governance. Learning from three participatory mapping workshops with N=24 community members, we designed elicitation methods and a schema that retain narrative richness while enabling entity, time, and place extraction, validation, and provenance control. To scope the problem, we audit a county-level benchmark of 14,782 local information QA pairs, where factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments account for 76.7% of errors. On a participatory QA set derived from our workshops, a state-of-the-art LLM answered fewer than 21% of questions correctly without added context, underscoring the need for local grounding. The missing facts often appear in the collected narratives, suggesting a direct path to closing the dominant error modes for narrative items. Beyond the protocol and pilot, we articulate key design tensions, such as representation and power, governance and control, and privacy and consent, providing concrete requirements for retrieval-first, provenance-visible, locally governed QA systems. Together, our taxonomy, protocol, and participatory evaluation offer a rigorous foundation for building community-grounded AI that better answers local questions.[6] TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation
Anas Ezzakri,Nicola Piovesan,Mohamed Sana,Antonio De Domenico,Fadhel Ayed,Haozhe Zhang
Main category: cs.CL
TL;DR: 本文提出了TeleTables,一个用于评估大语言模型在电信标准(如3GPP规范)中对表格的理解与推理能力的基准测试。
Details
Motivation: 现有研究表明,大语言模型在处理包含密集表格的电信技术文档时表现不佳,尤其是对表格内容的理解能力不足,因此需要专门评估和提升其在此类任务上的性能。 Method: 通过一个多阶段的数据生成流程构建TeleTables,从3GPP标准中提取表格,并利用多模态和面向推理的大模型生成并验证问题,最终形成包含500个人工验证问答对的数据集。 Result: 实验表明,参数量小于10B的小模型在回忆3GPP知识和解释表格方面表现差;大模型在表格推理任务上表现更好,但整体仍需领域特定微调以提升性能。 Conclusion: TeleTables揭示了当前大语言模型在理解和推理电信标准表格方面的局限性,强调了针对特定领域进行微调的重要性。 Abstract: Language Models (LLMs) are increasingly explored in the telecom industry to support engineering tasks, accelerate troubleshooting, and assist in interpreting complex technical documents. However, recent studies show that LLMs perform poorly on telecom standards, particularly 3GPP specifications. We argue that a key reason is that these standards densely include tables to present essential information, yet the LLM knowledge and interpretation ability of such tables remains largely unexamined. To address this gap, we introduce TeleTables, a benchmark designed to evaluate both the implicit knowledge LLMs have about tables in technical specifications and their explicit ability to interpret them. TeleTables is built through a novel multi-stage data generation pipeline that extracts tables from 3GPP standards and uses multimodal and reasoning-oriented LLMs to generate and validate questions. The resulting dataset, which is publicly available, comprises 500 human-verified question-answer pairs, each associated with the corresponding table in multiple formats. Our evaluation shows that, smaller models (under 10B parameters) struggle both to recall 3GPP knowledge and to interpret tables, indicating the limited exposure to telecom standards in their pretraining and the insufficient inductive biases for navigating complex technical material. Larger models, on the other hand, show stronger reasoning on table interpretation. Overall, TeleTables highlights the need for domain-specialized fine-tuning to reliably interpret and reason over telecom standards.[7] FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback
Xueqing Wu,Zihan Xue,Da Yin,Shuyan Zhou,Kai-Wei Chang,Nanyun Peng,Yeming Wen
Main category: cs.CL
TL;DR: FronTalk是一个用于前端代码生成的基准,探索了多模态反馈下的对话式代码生成,提出了基于代理的评估框架,并揭示了模型在多轮交互中易遗忘和难以理解视觉反馈的问题,提出AceCoder方法显著缓解遗忘问题。
Details
Motivation: 现有研究缺乏对前端开发中多轮对话式代码生成与多模态(文本+视觉)反馈结合的系统探索,尤其是视觉素材如草图、截图在交互中的作用未被充分挖掘。 Method: 构建包含100个多轮对话的数据集FronTalk,每轮提供文本和对应的视觉指令;设计基于网页代理的自动评估框架,模拟用户行为评估功能正确性和用户体验;提出AceCoder方法,利用自主代理逐轮检查并纠正历史指令实现,以缓解遗忘问题。 Result: 评估20个模型发现严重遗忘问题和视觉反馈理解困难;AceCoder将遗忘几乎降至零,在任务性能上最高提升9.3%(从56.0%到65.3%)。 Conclusion: FronTalk为多轮多模态前端代码生成提供了新的研究基础,揭示了关键挑战,并通过新型评估框架和AceCoder基线推动该领域发展。 Abstract: We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at https://github.com/shirley-wu/frontalk[8] STDD:Spatio-Temporal Dynamics-Driven Token Refinement in Diffusion Language Models
Xinhao Sun,Maoliang Li,Zihao Zheng,Jiayu Chen,Hezhao Xu,Yun Liang,Xiang Chen
Main category: cs.CL
TL;DR: 提出一种新的重掩码方法,通过动态检测每个token的时间方差和空间偏差来自适应调整置信度阈值,显著提升扩散语言模型的运行效率。
Details
Motivation: 主流的重掩码策略依赖单一全局置信度阈值,忽略了token在时间和空间上的动态特性,导致迭代冗余和平行性受限。 Method: 动态检测每个token的时间方差和空间偏差,利用这些信号反映其收敛状态和token间的相关性,并据此自适应调整每一步每个token的置信度阈值。 Result: 在主流数据集上实验表明,该方法显著提升了扩散语言模型的运行效率,最高可达8.9倍加速,同时保持生成质量。 Conclusion: 所提出的方法通过自适应的token级置信度控制,有效提高了扩散语言模型的生成效率与并行性能,为DLMs提供了更精细的重掩码机制。 Abstract: Unlike autoregressive language models, diffusion language models (DLMs) generate text by iteratively denoising all token positions in parallel. At each timestep, the remasking strategy of a DLM selects low-priority tokens to defer their decoding, thereby improving both efficiency and output quality. However, mainstream remasking strategies rely on a single global confidence threshold, overlooking the temporal and spatial dynamics of individual tokens. Motivated by the redundant iterations and constrained parallelism introduced by fixed-threshold remasking, we propose a novel remasking approach that dynamically detects Temporal Variance and Spatial Deviance of each token, which reflect its convergence status and inter-token correlations. Using these signals, our method adaptively adjusts the confidence threshold for every token at every step. Empirical results show that our approach significantly improves the operational efficiency of DLMs across mainstream datasets, achieving speedups of up to 8.9 times while faithfully preserving generation quality.[9] Enhancing Admission Inquiry Responses with Fine-Tuned Models and Retrieval-Augmented Generation
Aram Virabyan
Main category: cs.CL
TL;DR: 提出一种结合微调语言模型与检索增强生成(RAG)的AI系统,以提升大学招生咨询中的响应速度与信息准确性。
Details
Motivation: 大学招生办公室面临咨询量大、需保证回复质量的挑战,传统RAG在复杂细分领域中易产生上下文不充分的回复。 Method: 采用针对招生领域微调的语言模型与RAG结合的混合方法,并优化响应生成逻辑以平衡质量与速度。 Result: 该系统能更准确地理解RAG提供的信息,生成符合招生场景的高质量回复,并在响应时间和准确性上表现更优。 Conclusion: 融合微调与RAG的方法能有效提升AI在专业细分领域中的问答性能,适用于高要求的招生沟通场景。 Abstract: University admissions offices face the significant challenge of managing high volumes of inquiries efficiently while maintaining response quality, which critically impacts prospective students' perceptions. This paper addresses the issues of response time and information accuracy by proposing an AI system integrating a fine-tuned language model with Retrieval-Augmented Generation (RAG). While RAG effectively retrieves relevant information from large datasets, its performance in narrow, complex domains like university admissions can be limited without adaptation, potentially leading to contextually inadequate responses due to the intricate rules and specific details involved. To overcome this, we fine-tuned the model on a curated dataset specific to admissions processes, enhancing its ability to interpret RAG-provided data accurately and generate domain-relevant outputs. This hybrid approach leverages RAG's ability to access up-to-date information and fine-tuning's capacity to embed nuanced domain understanding. We further explored optimization strategies for the response generation logic, experimenting with settings to balance response quality and speed, aiming for consistently high-quality outputs that meet the specific requirements of admissions communications.[10] Ideology as a Problem: Lightweight Logit Steering for Annotator-Specific Alignment in Social Media Analysis
Wei Xia,Haowen Tang,Luozheng Li
Main category: cs.CL
TL;DR: 本文提出一种轻量级线性探针方法,通过内部特征计算偏见分数并直接调整LLM输出概率,以低成本对齐用户政治意识形态,且不损害模型原有推理能力。
Details
Motivation: 大语言模型(LLM)内部的政治意识形态结构与人类意识形态空间存在系统性、模型特定的不对齐问题,需在不重新训练的情况下进行有效校正。 Method: 引入一种轻量级线性探针,从模型内部特征中计算偏差分数,并在线性层直接调整最终输出概率,实现对模型输出的最小化修正。 Result: 该方法能有效量化并纠正LLM在政治意识形态上的输出偏差,具有模型特异性且可测量,同时保持模型原有的推理能力。 Conclusion: 无需重新训练,通过简单高效的线性调整即可实现LLM与用户观点的对齐,为个性化意识形态对齐提供了实用且低成本的解决方案。 Abstract: LLMs internally organize political ideology along low-dimensional structures that are partially, but not fully aligned with human ideological space. This misalignment is systematic, model specific, and measurable. We introduce a lightweight linear probe that both quantifies the misalignment and minimally corrects the output layer. This paper introduces a simple and efficient method for aligning models with specific user opinions. Instead of retraining the model, we calculated a bias score from its internal features and directly adjusted the final output probabilities. This solution is practical and low-cost and preserves the original reasoning power of the model.[11] LLMs for Explainable Business Decision-Making: A Reinforcement Learning Fine-Tuning Approach
Xiang Cheng,Wen Wang,Anindya Ghose
Main category: cs.CL
TL;DR: 本文提出LEXMA,一种基于强化学习的微调框架,利用大语言模型生成面向多受众、忠实于决策且风格适配的自然语言解释,应用于房贷审批场景并显示出优越性能。
Details
Motivation: 现有可解释AI方法多依赖事后数值特征归因,缺乏连贯叙事,难以满足不同受众需求,且依赖大量人工标注解释数据,成本高、扩展性差。 Method: 提出LEXMA框架,结合反思增强的监督微调与两阶段组相对策略优化(GRPO),分别优化决策正确性和面向不同受众的解释风格,使用无需人工标注解释的奖励信号进行强化学习微调。 Result: 在房贷审批任务中,LEXMA相比其他大语言模型基线显著提升预测性能;人类评估显示其生成的专家解释更关注风险,消费者解释更清晰、可操作且礼貌。 Conclusion: LEXMA提供了一种低成本、系统性的大语言模型微调方法,能生成高质量、多受众适配的自然语言解释,推动透明AI系统的大规模部署。 Abstract: Artificial Intelligence (AI) models increasingly drive high-stakes consumer interactions, yet their decision logic often remains opaque. Prevailing explainable AI techniques rely on post hoc numerical feature attributions, which fail to provide coherent narratives behind model decisions. Large language models (LLMs) present an opportunity to generate natural-language explanations, but three design challenges remain unresolved: explanations must be both decision-correct and faithful to the factors that drive the prediction; they should be able to serve multiple audiences without shifting the underlying decision rule; and they should be trained in a label-efficient way that does not depend on large corpora of human-scored explanations. To address these challenges, we introduce LEXMA (LLM-based EXplanations for Multi-Audience decisions), a reinforcement-learning-based fine-tuning framework that produces narrative-driven, audience-appropriate explanations. LEXMA combines reflection-augmented supervised fine-tuning with two stages of Group Relative Policy Optimization (GRPO). Specifically, it fine-tunes two separate parameter sets to improve decision correctness and satisfy stylistic requirements for different audiences, using reward signals that do not rely on human-annotated explanations. We instantiate LEXMA in the context of mortgage approval decisions. Results demonstrate that LEXMA yields significant improvements in predictive performance compared with other LLM baselines. Moreover, human evaluations show that expert-facing explanations generated by our approach are more risk-focused, and consumer-facing explanations are clearer, more actionable, and more polite. Our study contributes a cost-efficient, systematic LLM fine-tuning approach to enhance explanation quality for business decisions, offering strong potential for scalable deployment of transparent AI systems.[12] Leveraging Language Models and RAG for Efficient Knowledge Discovery in Clinical Environments
Seokhwan Ko,Donghyeon Lee,Jaewoo Chun,Hyungsoo Han,Junghwan Cho
Main category: cs.CL
TL;DR: 本研究开发并评估了一个基于PubMed出版物的检索增强生成(RAG)系统,用于在本地部署环境下推荐医学机构内的研究合作者。
Details
Motivation: 由于医院环境中对数据隐私和网络安全的严格要求,敏感信息必须在本地基础设施中处理,因此需要能够在本地运行的大语言模型应用。 Method: 系统采用PubMedBERT生成领域特定的嵌入表示,并结合本地部署的LLaMA3模型进行生成式综合,构建一个支持生物医学知识发现的RAG框架。 Result: 该系统能够有效利用机构内部的研究成果,实现研究合作者的精准推荐,验证了在本地部署条件下使用轻量级大模型与专业编码器结合的可行性。 Conclusion: 结合领域专用编码器与轻量级大语言模型可在保障数据安全的前提下,有效支持医疗环境中的知识发现与科研协作。 Abstract: Large language models (LLMs) are increasingly recognized as valuable tools across the medical environment, supporting clinical, research, and administrative workflows. However, strict privacy and network security regulations in hospital settings require that sensitive data be processed within fully local infrastructures. Within this context, we developed and evaluated a retrieval-augmented generation (RAG) system designed to recommend research collaborators based on PubMed publications authored by members of a medical institution. The system utilizes PubMedBERT for domain-specific embedding generation and a locally deployed LLaMA3 model for generative synthesis. This study demonstrates the feasibility and utility of integrating domain-specialized encoders with lightweight LLMs to support biomedical knowledge discovery under local deployment constraints.[13] Complexity Agnostic Recursive Decomposition of Thoughts
Kaleem Ullah Qasim,Jiashu Zhang,Hafiz Saif Ur Rehman
Main category: cs.CL
TL;DR: 本文提出了CARD框架,通过预先估计问题复杂度并动态调整分解策略,在提升多步推理准确率的同时显著降低计算开销。
Details
Motivation: 大型语言模型在多步推理任务中常因采用固定推理策略而表现不佳,忽略了问题本身的难度差异。 Method: 提出CARD框架,包含MRCE(多维推理复杂度估计器)和两阶段递归求解器:首先根据任务特征进行分层分解,然后通过递归MRCE分析为每一步分配思维预算。 Result: 在GSM8K上准确率达到81.4%至89.2%,token消耗减少1.88x到2.40x;在MATH-500上准确率为75.1%至86.8%,token使用减少1.71x到5.74x。 Conclusion: 预判问题复杂度可有效提升推理准确性和推理效率,验证了自适应分解策略的优越性。 Abstract: Large language models often fail on multi-step reasoning due to fixed reasoning strategies that ignore problem specific difficulty. We introduce CARD (Complexity Agnostic Recursive Decomposition), a framework that predicts problem complexity before generation and adapts decomposition accordingly. Our system comprises MRCE (Multi-dimensional Reasoning Complexity Estimator), a 0.6B Qwen model predicting 30 fine-grained features from question text and a two-stage recursive solver: (1) hierarchical decomposition into K steps based on task profile and (2) per-step thought budget allocation (1, 5-9, or 10 thoughts) via recursive MRCE profiling. Evaluated on three reasoning models (Qwen3-0.6B, DeepSeek-R1-Distill-Qwen-1.5B, Qwen3-1.7B), CARD achieves 81.4% to 89.2% accuracy on GSM8K while reducing token cost by 1.88x to 2.40x compared to fixed decomposition baselines. On MATH-500, CARD reaches 75.1 to 86.8% accuracy using 1.71x to 5.74x fewer tokens. Our results demonstrate that preemptive complexity estimation enables both higher accuracy and significant efficiency gains.[14] Qwerty AI: Explainable Automated Age Rating and Content Safety Assessment for Russian-Language Screenplays
Nikita Zmanovskii
Main category: cs.CL
TL;DR: Qwerty AI是一个端到端系统,用于根据俄罗斯联邦法律436-FZ对俄语剧本进行自动化年龄分级和内容安全评估,支持细粒度内容违规检测与可解释的年龄评级。
Details
Motivation: 解决俄罗斯媒体行业在剧本内容审查中的实际编辑挑战,提供符合联邦法律436-FZ要求的自动化、高效且可解释的年龄评级方案。 Method: 采用微调的Phi-3-mini模型(4位量化),在无外部API调用、80GB VRAM限制和5分钟内处理完脚本的约束下,实现剧本分段、五类违规内容识别(暴力、性内容、粗俗语言、毒品、恐怖元素)及年龄分级。 Result: 系统可在2分钟内处理长达700页的剧本,年龄评级准确率达80%,分段精度达80%-95%(依赖格式),并在Yandex Cloud上部署,支持CUDA加速。 Conclusion: Qwerty AI在严格资源限制下实现了高性能的内容安全评估,展示了其在实际生产流程中的可行性,尤其适用于俄语影视内容的自动化合规审查。 Abstract: We present Qwerty AI, an end-to-end system for automated age-rating and content-safety assessment of Russian-language screenplays according to Federal Law No. 436-FZ. The system processes full-length scripts (up to 700 pages in under 2 minutes), segments them into narrative units, detects content violations across five categories (violence, sexual content, profanity, substances, frightening elements), and assigns age ratings (0+, 6+, 12+, 16+, 18+) with explainable justifications. Our implementation leverages a fine-tuned Phi-3-mini model with 4-bit quantization, achieving 80% rating accuracy and 80-95% segmentation precision (format-dependent). The system was developed under strict constraints: no external API calls, 80GB VRAM limit, and <5 minute processing time for average scripts. Deployed on Yandex Cloud with CUDA acceleration, Qwerty AI demonstrates practical applicability for production workflows. We achieved these results during the Wink hackathon (November 2025), where our solution addressed real editorial challenges in the Russian media industry.[15] TrueBrief: Faithful Summarization through Small Language Models
Kumud Lakara,Ruibo Shi,Fran Silavong
Main category: cs.CL
TL;DR: 本文提出了TrueBrief框架,通过偏好优化范式提升小型语言模型在文本摘要任务中的真实性,核心是通过可控的幻觉注入生成合成偏好数据。
Details
Motivation: 大型语言模型存在产生幻觉的问题,限制了其在安全关键领域的应用,因此需要提高小型语言模型生成结果的忠实性。 Method: 设计了一个端到端的框架TrueBrief,包含一个可控制幻觉注入的数据生成模块,用于生成偏好数据,并采用偏好优化方法训练小型语言模型。 Result: 验证了数据质量和模型大小对偏好优化效果的影响,揭示了该方法在提升模型忠实性方面的有效性条件。 Conclusion: TrueBrief框架能有效增强小型语言模型在文本摘要中的真实性,为构建可靠的小型模型提供了可行路径。 Abstract: Large language models (LLMs) have exhibited remarkable proficiency in generating high-quality text; however, their propensity for producing hallucinations poses a significant challenge for their deployment in security-critical domains. In this work, we present TrueBrief, an end-to-end framework specifically designed to enhance the faithfulness of small LLMs (SLMs) primarily for the task of text summarization through a preference-optimization paradigm. Central to our framework is a data generation module that facilitates controlled hallucination injection to generate synthetic preference data. Our work provides insights into the impact of data quality and model size on preference-based optimization, highlighting the conditions under which these methods are most effective.[16] AnimatedLLM: Explaining LLMs with Interactive Visualizations
Zdeněk Kasner,Ondřej Dušek
Main category: cs.CL
TL;DR: AnimatedLLM 是一个在浏览器中运行的交互式网页应用,通过预计算的开源大语言模型轨迹和手动整理的输入,提供 Transformer 语言模型的逐步可视化,旨在辅助自然语言处理教学与自学。
Details
Motivation: 当前关于大语言模型(LLMs)教学材料稀缺,尽管其在自然语言处理教育中日益重要,因此需要一种直观工具来展示模型内部机制。 Method: 开发了一个名为 AnimatedLLM 的交互式网页应用,利用开源大语言模型对人工整理输入生成的预计算轨迹,在浏览器中实现 Transformer 模型的逐步可视化。 Result: AnimatedLLM 成功实现了无需本地计算即可在浏览器中运行的模型可视化,提供了清晰的教学视图,适用于课堂教学与自主学习。 Conclusion: AnimatedLLM 填补了大语言模型教学资源的空白,是一种有效的教育工具,有助于提升对 Transformer 架构的理解。 Abstract: Large language models (LLMs) are becoming central to natural language processing education, yet materials showing their mechanics are sparse. We present AnimatedLLM, an interactive web application that provides step-by-step visualizations of a Transformer language model. AnimatedLLM runs entirely in the browser, using pre-computed traces of open LLMs applied on manually curated inputs. The application is available at https://animatedllm.github.io, both as a teaching aid and for self-educational purposes.[17] From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning
Xiaoyu Xu,Minxin Du,Zitong Li,Zi Liang,Zhibiao Guo,Shiyu Zhang,Peizhao Hu,Qingqing Ye,Haibo Hu
Main category: cs.CL
TL;DR: 本文提出了BiForget框架,通过种子引导和对抗性提示从目标模型自身生成高质量的遗忘数据集,实现了领域级和实例级两种细粒度的机器遗忘,显著提升了遗忘效果与模型性能保持之间的平衡。
Details
Motivation: 现有的机器遗忘基准往往未能真实反映模型实际学习到的“遗忘范围”,且依赖外部生成器产生遗忘数据,导致数据与模型内部知识分布不匹配,影响遗忘效果评估的准确性。 Method: 本文形式化了领域级和实例级两种遗忘粒度,提出BiForget框架,利用目标模型自身通过种子引导和对抗性提示来激发与其内部知识分布一致的数据,从而自动合成高质量的遗忘集。 Result: 在多个基准上的实验表明,BiForget在相关性、多样性和效率方面表现优越;在哈利波特领域中,相比现有方法,相关性提高约20,多样性提升约0.05,同时数据总量减少一半。 Conclusion: BiForget能够实现更鲁棒的遗忘效果和更好的模型效用保持,为大语言模型的遗忘能力评估提供了更严谨的基础。 Abstract: Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true "forgetting scope" learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose BiForget, an automated framework for synthesizing high-quality forget sets. Unlike prior work relying on external generators, BiForget exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by ${\sim}20$ and diversity by ${\sim}$0.05 while halving the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.[18] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation
Joseph James,Chenghao Xiao,Yucheng Li,Nafise Sadat Moosavi,Chenghua Lin
Main category: cs.CL
TL;DR: 本文提出了RIGOURATE,一个两阶段多模态框架,用于评估科学论文中声明的夸大程度,通过检索正文中的证据并为每个声明分配夸大分数,从而促进更清晰、透明的科学交流。
Details
Motivation: 科学严谨性常因追求大胆声明而被忽视,导致作者的主张超出其结果支持的范围。 Method: 提出RIGOURATE框架,结合细调的重排序器进行证据检索,并使用细调模型预测声明的夸大分数;基于ICLR和NeurIPS论文构建了包含10K以上声明-证据对的数据集,并利用8个大语言模型标注,通过同行评审意见校准分数,并经人工评估验证。 Result: 相比强基线模型,RIGOURATE在证据检索和夸大声明检测方面表现更优。 Conclusion: 该工作实现了证据比例原则的操作化,有助于提升科学研究表述的准确性和透明度。 Abstract: Scientific rigour tends to be sidelined in favour of bold statements, leading authors to overstate claims beyond what their results support. We present RIGOURATE, a two-stage multimodal framework that retrieves supporting evidence from a paper's body and assigns each claim an overstatement score. The framework consists of a dataset of over 10K claim-evidence sets from ICLR and NeurIPS papers, annotated using eight LLMs, with overstatement scores calibrated using peer-review comments and validated through human evaluation. It employes a fine-tuned reranker for evidence retrieval and a fine-tuned model to predict overstatement scores with justification. Compared to strong baselines, RIGOURATE enables improved evidence retrieval and overstatement detection. Overall, our work operationalises evidential proportionality and supports clearer, more transparent scientific communication.[19] Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties
Akriti Dhasmana,Aarohi Srivastava,David Chiang
Main category: cs.CL
TL;DR: 本研究探讨了在多种印度方言和语言变体中使用自发、嘈杂且混杂语言的语音进行跨语言迁移的实证结果,发现尽管语言间的亲缘距离影响ASR性能,但并非唯一决定因素,少量方言数据微调的效果可媲美基于大量高资源标准语言的微调,并通过Garhwali语言案例分析揭示了ASR模型对预训练语言偏倚的问题。
Details
Motivation: 探索在低资源、方言化和非标准化语言环境中跨语言迁移的有效性,特别是面对代码混合和噪声语音时自动语音识别(ASR)系统的性能表现与挑战。 Method: 在广泛的印度语方言和语言变体上进行实证研究,评估不同语言间亲缘距离对ASR性能的影响,并比较使用小规模方言数据与大规模高资源标准语言数据进行微调的效果;以Garhwali语为案例,评估多个现代ASR模型,并分析转录错误中的语言偏倚现象。 Result: 语言亲缘距离虽影响ASR性能,但不足以完全解释方言环境下的表现;少量方言数据微调可达到与大量高资源语言数据相当的性能;在Garhwali语案例中发现现有ASR模型存在对预训练语言的偏差。 Conclusion: 跨语言迁移在处理低资源方言语音时具有潜力,但需考虑方言特异性数据的重要性以及模型对主流语言的偏倚问题,未来应加强对方言和非标准化语言的建模与训练策略。 Abstract: We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally improved with reduced phylogenetic distance between languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.[20] Disco-RAG: Discourse-Aware Retrieval-Augmented Generation
Dongqi Liu,Hang Ding,Qiming Feng,Jian Li,Xurong Xie,Zhucun Xue,Chengjie Wang,Jiangning Zhang,Yabiao Wang
Main category: cs.CL
TL;DR: Disco-RAG 是一种引入语篇感知的检索增强生成框架,通过构建段内话语树和段间修辞图来提升大语言模型在知识密集型任务中的表现。
Details
Motivation: 现有RAG方法通常以扁平化方式处理检索到的文本,忽略了文档内部和跨文档的结构信息,限制了模型对分散知识的整合能力。 Method: 提出Disco-RAG框架,构建段内话语树捕捉局部层次结构,并建立段间修辞图建模跨段落连贯性,将二者联合融入生成规划蓝图中以指导生成过程。 Result: 在问答和长文档摘要基准上实现了最先进的性能,且无需微调。 Conclusion: 语篇结构信息对提升RAG系统性能具有重要作用,显式注入语篇信号可有效增强模型的知识合成能力。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.[21] MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking
Iago Alves Brito,Walcy Santos Rezende Rios,Julia Soares Dollis,Diogo Fernandes Costa Silva,Arlindo Rodrigues Galvão Filho
Main category: cs.CL
TL;DR: MiJaBench揭示当前大语言模型的安全对齐存在基于少数群体身份的系统性偏差,安全防护效果因目标群体而异,且模型规模扩大加剧了这种不平等。
Details
Motivation: 现有LLM安全评估将不同群体的风险合并为单一指标,掩盖了对特定少数群体的系统性保护不足,需要更细粒度的评估基准来揭示这些差异。 Method: 构建双语对抗性基准MiJaBench,包含44,000个针对16个少数群体的提示,并通过12个主流LLM生成528,000个响应,形成MiJaBench-Align数据集以分析安全对齐在不同群体间的差异。 Result: 发现同一模型对不同群体的防御率差异高达33%,且模型规模越大,这种差距越明显,说明当前对齐技术并未实现真正的非歧视原则。 Conclusion: 安全对齐并非通用能力,而是形成了基于人口统计特征的防御等级制度,当前的扩展策略可能加剧不平等,需发展更具包容性的对齐方法。 Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating "Identity Hate" into scalar scores that mask systemic vulnerabilities against specific populations. To expose this selective safety, we introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate MiJaBench-Align, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33\% within the same model solely based on the target group. Crucially, we demonstrate that model scaling exacerbates these disparities, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.[22] ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models
Sharanya Dasgupta,Arkaprabha Basu,Sujoy Nath,Swagatam Das
Main category: cs.CL
TL;DR: 本文提出了ARREST框架,通过外部网络干预大语言模型的潜在激活空间,统一纠正事实性和安全性问题,而无需微调模型参数。
Details
Motivation: 大语言模型在事实性和安全性方面仍存在不足,这些问题源于其潜在激活空间中的表征错位,需要一种统一的对齐方法。 Method: 提出ARREST框架,利用外部网络检测并纠正漂移特征,结合对抗训练实现软拒绝、硬拒绝和事实性修正。 Result: 实验结果表明,ARREST能有效调节错位问题,并在生成软拒绝方面比RLHF对齐模型更具通用性。 Conclusion: ARREST提供了一种无需微调即可统一提升大语言模型真实性和安全性的新路径。 Abstract: Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF-aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya-dasgupta001/ARREST.[23] Interpreting Transformers Through Attention Head Intervention
Mason Kadem,Rong Zheng
Main category: cs.CL
TL;DR: 本文探讨了神经网络的机制可解释性,强调理解其决策过程在高风险领域中的问责与控制、数字大脑和认知涌现的研究以及AI超越人类时发现新知识的重要性。
Details
Motivation: 随着神经网络能力的增强,我们对其内部机制的理解滞后,缺乏对决策过程的透明度,这限制了其在关键领域的可靠应用并阻碍了对AI认知本质的理解。 Method: 文章通过概念性分析和案例论证,阐述机制可解释性的意义及其在多个前沿领域的潜在价值。 Result: 提出了机制可解释性的三大核心价值:实现高风险应用中的可控性、推动对人工认知系统的科学研究、以及从超人性能AI中提取新知识。 Conclusion: 机制可解释性不仅是提升AI安全与信任的关键,也是探索智能本质和推动科学发现的重要途径。 Abstract: Neural networks are growing more capable on their own, but we do not understand their neural mechanisms. Understanding these mechanisms' decision-making processes, or mechanistic interpretability, enables (1) accountability and control in high-stakes domains, (2) the study of digital brains and the emergence of cognition, and (3) discovery of new knowledge when AI systems outperform humans.[24] Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization
Yao Dou,Wei Xu
Main category: cs.CL
TL;DR: 本文提出了Gavel-Ref,一个基于多维度清单的评估框架,用于系统评估大语言模型在长文本法律案例摘要任务中的表现,并进一步提出Gavel-Agent以提升效率和自主性。
Details
Motivation: 现有大语言模型虽支持百万级上下文,但在复杂长文本任务(如多文档法律案例摘要)上的有效性尚不明确,且缺乏细粒度评估手段。 Method: 提出Gavel-Ref评估框架,包含26项多值清单、残余事实和写作风格评估;对12个前沿大模型在100个32K至512K token的法律案例上进行系统评测;并设计Gavel-Agent代理架构,集成六种工具实现高效自主信息提取。 Result: 最强模型Gemini 2.5 Pro在SGavel-Ref上仅得约50分;模型在简单条目上表现好,但在多值或罕见条目(如和解条款、监督报告)上表现差;Gavel-Agent使用Qwen3比GPT-4.1端到端提取节省36% token,仅导致7% Schecklist下降。 Conclusion: 当前大语言模型在超长上下文复杂任务中仍有显著局限,需更精细的评估与工具增强的代理架构来提升性能与效率。 Abstract: Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long-context tasks remains unclear. In this paper, we study multi-document legal case summarization, where a single case often spans many documents totaling 100K-500K tokens. We introduce Gavel-Ref, a reference-based evaluation framework with multi-value checklist evaluation over 26 items, as well as residual fact and writing-style evaluations. Using Gavel-Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of $S_{\text{Gavel-Ref}}$, highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi-value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human-written summaries -- making human references less reliable -- we develop Gavel-Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel-Agent reduces token usage by 36% while resulting in only a 7% drop in $S_{\text{checklist}}$ compared to end-to-end extraction with GPT-4.1.[25] Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs
Myra Cheng,Robert D. Hawkins,Dan Jurafsky
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLMs)在面对用户有害信念时缺乏质疑的问题,提出这是由于模型过度顺应用户假设且缺乏认知警觉性所致。研究发现,影响人类顺应行为的社会与语言因素同样影响LLM,并可通过简单的语用干预(如加入“等一下”)提升其挑战有害信念的能力,从而提高安全性。
Details
Motivation: 大语言模型常因顺应用户假设而未能挑战有害信念,导致在医疗建议、社会推理等领域产生风险,亟需从语用学角度理解并改善这一问题。 Method: 分析社会与语言因素(如命题显著性、语言编码和来源可靠性)对LLM顺应行为的影响,并在三个安全基准(Cancer-Myth、SAGE-Eval、ELEPHANT)上测试模型表现,同时引入简单语用干预(如添加‘wait a minute’)评估其效果。 Result: 发现影响人类顺应行为的因素同样影响LLM的表现;简单的语用干预显著提升了模型在挑战错误信念任务中的性能,同时保持较低的误报率。 Conclusion: 语用因素在评估和提升LLM安全性方面至关重要,增强模型的认知警觉性和调整顺应策略可有效改善其对有害信念的回应能力。 Abstract: Large language models (LLMs) frequently fail to challenge users' harmful beliefs in domains ranging from medical advice to social reasoning. We argue that these failures can be understood and addressed pragmatically as consequences of LLMs defaulting to accommodating users' assumptions and exhibiting insufficient epistemic vigilance. We show that social and linguistic factors known to influence accommodation in humans (at-issueness, linguistic encoding, and source reliability) similarly affect accommodation in LLMs, explaining performance differences across three safety benchmarks that test models' ability to challenge harmful beliefs, spanning misinformation (Cancer-Myth, SAGE-Eval) and sycophancy (ELEPHANT). We further show that simple pragmatic interventions, such as adding the phrase "wait a minute", significantly improve performance on these benchmarks while preserving low false-positive rates. Our results highlight the importance of considering pragmatics for evaluating LLM behavior and improving LLM safety.[26] Learning to Simulate Human Dialogue
Kanishk Gandhi,Agam Bhatia,Noah D. Goodman
Main category: cs.CL
TL;DR: 研究通过下一轮对话预测来建模人类思维,发现直接最大化真实人类对话的对数概率比基于LLM评判的奖励更有效,尤其在引入思考过程时表现更优。
Details
Motivation: 探索如何通过预测对话中的下一句来理解人类思维过程,并评估不同学习方法在生成类人回应上的有效性。 Method: 比较两种学习方式:是否允许模型在回应前进行思考,以及使用LLM作为评判标准衡量语义相似性和信息完整性,或直接最大化真实人类对话的对数概率。将思维链视为潜在变量,推导出对数概率的下界并优化该目标。 Result: 基于LLM评判的奖励虽然提升评判分数,但降低了对真实回应的概率估计和人类判断中的胜率;而直接最大化真实回应的对数概率在各项指标上表现更好,尤其是在引入思考过程时效果显著。 Conclusion: 当训练目标是匹配真实人类对话分布时,引入思考过程更有助于提升模型对人类行为的理解,表明扩展该方法至更广泛的对话数据可能带来更细腻的人类行为建模能力。 Abstract: To predict what someone will say is to model how they think. We study this through next-turn dialogue prediction: given a conversation, predict the next utterance produced by a person. We compare learning approaches along two dimensions: (1) whether the model is allowed to think before responding, and (2) how learning is rewarded either through an LLM-as-a-judge that scores semantic similarity and information completeness relative to the ground-truth response, or by directly maximizing the log-probability of the true human dialogue. We find that optimizing for judge-based rewards indeed increases judge scores throughout training, however it decreases the likelihood assigned to ground truth human responses and decreases the win rate when human judges choose the most human-like response among a real and synthetic option. This failure is amplified when the model is allowed to think before answering. In contrast, by directly maximizing the log-probability of observed human responses, the model learns to better predict what people actually say, improving on both log-probability and win rate evaluations. Treating chain-of-thought as a latent variable, we derive a lower bound on the log-probability. Optimizing this objective yields the best results on all our evaluations. These results suggest that thinking helps primarily when trained with a distribution-matching objective grounded in real human dialogue, and that scaling this approach to broader conversational data may produce models with a more nuanced understanding of human behavior.[27] Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
San Kim,Gary Geunbae Lee
Main category: cs.CL
TL;DR: 提出MB-Defense,一种针对指令调优大语言模型的新型防御框架,通过合并与破坏后门表征来抵御多样化的后门攻击。
Details
Motivation: 大语言模型在指令调优中依赖大规模数据,易受后门攻击,但现有防御方法研究不足。 Method: 采用两阶段训练策略:防御性投毒将攻击和防御触发器合并为统一的后门表征,权重恢复阶段通过额外训练打破该表征以恢复正常行为。 Result: 在多个大语言模型上的实验表明,MB-Defense显著降低了攻击成功率,同时保持了良好的指令遵循能力。 Conclusion: MB-Defense提供了一种可泛化且数据高效的防御策略,增强了指令调优模型对未见后门攻击的鲁棒性。 Abstract: Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging & Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) defensive poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) weight recovery, which breaks this representation through additional training to restore clean behavior. Extensive experiments across multiple LLMs show that MB-Defense substantially lowers attack success rates while preserving instruction-following ability. Our method offers a generalizable and data-efficient defense strategy, improving the robustness of instruction-tuned LLMs against unseen backdoor attacks.[28] Users Mispredict Their Own Preferences for AI Writing Assistance
Vivian Lai,Zana Buçinca,Nil-Jana Akpinar,Mo Houtti,Hyeonsu B. Kang,Kevin Chian,Namjoon Suh,Alex C. Williams
Main category: cs.CL
TL;DR: 用户在使用AI写作助手时,实际行为与自我报告的偏好存在显著差异, compositional effort 是主要驱动因素,而 urgency 虽被高估但实际影响最小,依赖自我报告设计系统会误导优化方向。
Details
Motivation: 了解用户在何时需要AI写作助手帮助的真实驱动因素,以改进主动式AI系统的预测能力。 Method: 通过包含50名参与者的因子 vignette 研究,进行750次成对比较,分析用户行为与自我报告偏好的一致性。 Result: 发现写作努力(compositional effort)是决策的主要因素(ρ=0.597),而紧急程度(urgency)无显著预测力(ρ≈0);用户自我报告中将紧急性排第一,但实际行为显示其最弱,导致系统设计准确率仅57.7%,而基于行为模式的设计达到61.3%(p<0.05)。 Conclusion: 依赖用户的自我报告来设计主动式AI写作助手会误导系统优化,应更多依据实际行为数据来构建NLG系统。 Abstract: Proactive AI writing assistants need to predict when users want drafting help, yet we lack empirical understanding of what drives preferences. Through a factorial vignette study with 50 participants making 750 pairwise comparisons, we find compositional effort dominates decisions ($ρ= 0.597$) while urgency shows no predictive power ($ρ\approx 0$). More critically, users exhibit a striking perception-behavior gap: they rank urgency first in self-reports despite it being the weakest behavioral driver, representing a complete preference inversion. This misalignment has measurable consequences. Systems designed from users' stated preferences achieve only 57.7\% accuracy, underperforming even naive baselines, while systems using behavioral patterns reach significantly higher 61.3\% ($p < 0.05$). These findings demonstrate that relying on user introspection for system design actively misleads optimization, with direct implications for proactive natural language generation (NLG) systems.[29] Beyond Static Summarization: Proactive Memory Extraction for LLM Agents
Chengyuan Yang,Zequn Sun,Wei Wei,Wei Hu
Main category: cs.CL
TL;DR: 本文提出了一种主动记忆提取方法(ProMem),通过迭代的自我提问反馈机制,改进了传统摘要式记忆管理在“前瞻性”和“一次性提取”上的局限,提升了信息完整性和问答准确性。
Details
Motivation: 现有基于摘要的记忆提取方法存在“前瞻性”和“一次性”问题,缺乏对后续任务需求的适应和错误修正能力。 Method: 引入一种循环反馈机制,代理通过自我提问主动探测对话历史,实现迭代式的记忆提取与修正。 Result: ProMem显著提高了提取记忆的完整性与问答准确率,并在提取质量与token成本之间实现了更优权衡。 Conclusion: 将记忆提取视为主动的认知过程而非静态摘要,能有效提升LLM代理的长期交互与个性化能力。 Abstract: Memory management is vital for LLM agents to handle long-term interaction and personalization. Most research focuses on how to organize and use memory summary, but often overlooks the initial memory extraction stage. In this paper, we argue that existing summary-based methods have two major limitations based on the recurrent processing theory. First, summarization is "ahead-of-time", acting as a blind "feed-forward" process that misses important details because it doesn't know future tasks. Second, extraction is usually "one-off", lacking a feedback loop to verify facts, which leads to the accumulation of information loss. To address these issues, we propose proactive memory extraction (namely ProMem). Unlike static summarization, ProMem treats extraction as an iterative cognitive process. We introduce a recurrent feedback loop where the agent uses self-questioning to actively probe the dialogue history. This mechanism allows the agent to recover missing information and correct errors. Our ProMem significantly improves the completeness of the extracted memory and QA accuracy. It also achieves a superior trade-off between extraction quality and token cost.[30] Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions
Ignacio Sastre,Aiala Rosá
Main category: cs.CL
TL;DR: Concept Tokens是一种轻量级方法,通过从多个自然语言定义中学习一个新特殊token的嵌入来向预训练大语言模型注入概念,同时保持模型冻结,能够有效引导模型行为。
Details
Motivation: 旨在探索如何仅通过定义注入新概念到冻结的大型语言模型中,并以紧凑的方式调控其行为,减少幻觉等不良输出。 Method: 引入一个新的特殊token并替换目标概念,仅优化该token的嵌入,使用标准语言建模目标进行训练,同时保持LLM其余部分冻结。 Result: 在HotpotQA上验证了对幻觉的定向影响:否定幻觉token可减少幻觉回答但增加拒绝回答;肯定则反之。在语言教学反馈策略和定性案例(如埃菲尔铁塔)中也表现出类似方向性效果,且优于上下文定义输入。 Conclusion: Concept Tokens提供了一种从定义中学习、用于调控冻结大语言模型行为的有效而紧凑的控制信号。 Abstract: We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional "Austral Tower" to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.[31] SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
Iaroslav Chelombitko,Ekaterina Chelombitko,Aleksey Komissarov
Main category: cs.CL
TL;DR: 本文提出了一种无需语料库的形态词典生成工具SampoNLP,并利用其生成的高质量词典对芬兰语、匈牙利语和爱沙尼亚语的BPE分词器进行了系统评估,提出了综合性能评分(IPS)指标,给出了这些乌拉尔语系语言中最优词汇量的实证建议。
Details
Motivation: 乌拉尔语系等形态复杂的低资源语言缺乏干净的词素词典,导致难以有效评估子词分词质量,现有BPE分词器在高度黏着语言中的表现受限且缺乏量化分析。 Method: 提出SampoNLP工具,基于MDL启发的自指原子性评分(Self-Referential Atomicity Scoring)从原始文本中自动构建高纯度形态词典;使用该词典系统评估不同词汇规模(8k-256k)下的BPE分词器,并提出综合性能得分(IPS)来平衡词素覆盖率与过度切分之间的权衡。 Result: 成功为芬兰语、匈牙利语和爱沙尼亚语生成了高纯度形态词典;通过IPS曲线分析发现了性能收益递减的'肘部点',提供了首个基于实证的最优词汇量建议;定量揭示了标准BPE在高度黏着语言中的局限性。 Conclusion: SampoNLP为低资源语言提供了无需标注语料的词典构建方案,所提出的IPS指标和最优k值建议对乌拉尔语系语言的分词器选择具有实践指导意义,同时揭示了BPE在复杂形态语言中的根本限制。 Abstract: The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP[32] WESR: Scaling and Evaluating Word-level Event-Speech Recognition
Chenchen Yang,Kexin Huang,Liwei Fan,Qian Tu,Botian Jiang,Dong Zhang,Linqi Yin,Shimin Li,Zhaoye Fei,Qinyuan Cheng,Xipeng Qiu
Main category: cs.CL
TL;DR: 本文提出了一种新的21类非语言声音事件分类体系,并构建了WESR-Bench评估集与大规模训练语料库,实现了对离散和连续发声事件的精确定位检测。
Details
Motivation: 现有非语言声音事件检测方法存在类别覆盖不足、时间粒度模糊及缺乏标准化评估的问题,限制了实际应用发展。 Method: 提出了细粒度的21类发声事件分类体系(离散型与连续型),构建了专家标注的WESR-Bench评估集(900+语句)和1700+小时训练语料,设计了位置感知评估协议以解耦ASR错误与事件检测性能。 Result: 所建模型在保持ASR质量的同时,优于开源音频-语言模型和商业API,在离散与连续事件定位上实现精确检测。 Conclusion: WESR及其评估框架为真实场景下的丰富听觉建模提供了基础资源,推动非语言语音事件检测的研究与应用。 Abstract: Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. We also build a strong baseline by constructing a 1,700+ hour corpus, and train specialized models, surpassing both open-source audio-language models and commercial APIs while preserving ASR quality. We anticipate that WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes.[33] LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation
Yuxiao Ye,Yiming Zhang,Yiran Ma,Huiyuan Xie,Huining Zhu,Zhiyuan Liu
Main category: cs.CL
TL;DR: 本文提出了LinguaGame,一种基于语言学和博弈论的多智能体对话生成框架,通过建模对话为意图与策略之间的信号博弈,提升大语言模型多智能体系统中的沟通效率。
Details
Motivation: 现有基于大语言模型的多智能体系统研究多集中于架构设计,而忽视了交互过程本身的沟通效率问题。本文旨在改进智能体在自然语言交流中更有效地传达意图的能力。 Method: 提出LinguaGame,将对话建模为关于沟通意图与策略的信号博弈,并采用无需训练的均衡近似算法在推理时调整决策。该方法基于语言学原理,强调智能体需推断他人的意图与策略,实现有意图且具策略性的沟通。 Result: 在模拟法庭审判和辩论场景中进行评估,结合人类专家评价,结果显示所提方法显著提升了多智能体间的沟通效率。 Conclusion: LinguaGame通过语言学驱动的博弈框架,在不依赖任务特定设计的前提下有效提升了多智能体系统的沟通效率,具有较强的通用性与实用性。 Abstract: Large Language Models (LLMs) have enabled Multi-Agent Systems (MASs) where agents interact through natural language to solve complex tasks or simulate multi-party dialogues. Recent work on LLM-based MASs has mainly focused on architecture design, such as role assignment and workflow orchestration. In contrast, this paper targets the interaction process itself, aiming to improve agents' communication efficiency by helping them convey their intended meaning more effectively through language. To this end, we propose LinguaGame, a linguistically-grounded game-theoretic paradigm for multi-agent dialogue generation. Our approach models dialogue as a signalling game over communicative intents and strategies, solved with a training-free equilibrium approximation algorithm for inference-time decision adjustment. Unlike prior game-theoretic MASs, whose game designs are often tightly coupled with task-specific objectives, our framework relies on linguistically informed reasoning with minimal task-specific coupling. Specifically, it treats dialogue as intentional and strategic communication, requiring agents to infer what others aim to achieve (intents) and how they pursue those goals (strategies). We evaluate our framework in simulated courtroom proceedings and debates, with human expert assessments showing significant gains in communication efficiency.[34] GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence
Yibo Zhao,Jiapeng Zhu,Zichen Ding,Xiang Li
Main category: cs.CL
TL;DR: 本文提出了GRACE,一种基于强化学习的检索增强生成框架,通过多阶段门控奖励机制统一解决答案缺乏证据支持和在上下文不足时产生幻觉的问题,在减少标注成本的同时实现了最先进的准确性和合理的拒绝行为。
Details
Motivation: 现有检索增强生成系统存在两个关键缺陷:答案缺乏明确证据支持,以及在检索上下文不足时生成虚构内容,但目前缺乏能同时解决这两个问题的统一框架。 Method: 提出GRACE框架,采用异构检索器生成多样化的训练样本,并设计多阶段门控奖励函数,通过强化学习训练模型判断证据充分性、提取关键证据,并在适当时回答或主动拒绝。 Result: 在两个基准上实验表明,GRACE达到最先进的整体准确率,在准确回答与合理拒绝之间取得良好平衡,且标注成本仅为先前方法的10%。 Conclusion: GRACE为检索增强生成提供了高效、低成本的解决方案,能够同时实现证据驱动的回答和可靠的内容拒答,提升了系统的可信度与实用性。 Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge to enhance Large Language Models (LLMs), yet systems remain susceptible to two critical flaws: providing correct answers without explicit grounded evidence and producing fabricated responses when the retrieved context is insufficient. While prior research has addressed these issues independently, a unified framework that integrates evidence-based grounding and reliable abstention is currently lacking. In this paper, we propose GRACE, a reinforcement-learning framework that simultaneously mitigates both types of flaws. GRACE employs a data construction method that utilizes heterogeneous retrievers to generate diverse training samples without manual annotation. A multi-stage gated reward function is then employed to train the model to assess evidence sufficiency, extract key supporting evidence, and provide answers or explicitly abstain. Experimental results on two benchmarks demonstrate that GRACE achieves state-of-the-art overall accuracy and strikes a favorable balance between accurate response and rejection, while requiring only 10% of the annotation costs of prior methods. Our code is available at https://github.com/YiboZhao624/Grace..[35] BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation
Amit Bin Tariqul,A N M Zahid Hossain Milkan,Sahab-Al-Chowdhury,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan
Main category: cs.CL
TL;DR: 本文首次系统评估了现有文本水印方法在低资源语言(孟加拉语)中的鲁棒性,发现跨语言回译攻击下传统标记级水印失效,进而提出一种无需训练的分层水印策略,显著提升检测准确率。
Details
Motivation: 现有文本水印方法在高资源语言中表现良好,但在低资源语言中面对跨语言攻击时的鲁棒性尚未充分研究,亟需评估并改进其抗攻击能力。 Method: 对KGW、EXP和Waterfall等先进水印方法在孟加拉语LLM生成文本中进行跨语言回译攻击下的系统评估,并提出结合嵌入时与生成后水印的分层策略以增强鲁棒性。 Result: 在良性条件下,KGW和EXP检测准确率超88%,但回译攻击下骤降至9-13%;分层水印将攻击后检测准确率提升至40-50%,实现3-4倍相对提升,但带来可控的语义退化。 Conclusion: 分层水印是一种无需训练、实用有效的解决方案,可在多语言场景下显著提升低资源语言文本水印的鲁棒性,明确了鲁棒性与生成质量间的权衡。 Abstract: As large language models (LLMs) are increasingly deployed for text generation, watermarking has become essential for authorship attribution, intellectual property protection, and misuse detection. While existing watermarking methods perform well in high-resource languages, their robustness in low-resource languages remains underexplored. This work presents the first systematic evaluation of state-of-the-art text watermarking methods: KGW, Exponential Sampling (EXP), and Waterfall, for Bangla LLM text generation under cross-lingual round-trip translation (RTT) attacks. Under benign conditions, KGW and EXP achieve high detection accuracy (>88%) with negligible perplexity and ROUGE degradation. However, RTT causes detection accuracy to collapse below RTT causes detection accuracy to collapse to 9-13%, indicating a fundamental failure of token-level watermarking. To address this, we propose a layered watermarking strategy that combines embedding-time and post-generation watermarks. Experimental results show that layered watermarking improves post-RTT detection accuracy by 25-35%, achieving 40-50% accuracy, representing a 3$\times$ to 4$\times$ relative improvement over single-layer methods, at the cost of controlled semantic degradation. Our findings quantify the robustness-quality trade-off in multilingual watermarking and establish layered watermarking as a practical, training-free solution for low-resource languages such as Bangla. Our code and data will be made public.[36] Identifying Good and Bad Neurons for Task-Level Controllable LLMs
Wenjie Li,Guansong Pang,Hezhe Qiao,Debin Gao,David Lo
Main category: cs.CL
TL;DR: 提出NeuronLLM框架,基于功能拮抗原理识别大模型中促进和抑制任务的神经元,通过对比学习与增强问题集实现更全面的神经元建模。
Details
Motivation: 现有方法仅关注支持性神经元且难以应对多能力协同的任务场景,同时忽略抑制性神经元及偶然正确行为带来的归因偏差。 Method: 采用功能拮抗机制,通过对比学习区分促进和抑制任务的神经元,并利用增强问题集减少偶然正确答案对归因的影响。 Result: 在不同规模和家族的LLM上验证了NeuronLLM在四个NLP任务中的优越性,优于现有方法。 Conclusion: NeuronLLM实现了对LLM神经元更全面的建模,揭示了其功能组织的新机制。 Abstract: Large Language Models have demonstrated remarkable capabilities on multiple-choice question answering benchmarks, but the complex mechanisms underlying their large-scale neurons remain opaque, posing significant challenges for understanding and steering LLMs. While recent studies made progress on identifying responsible neurons for certain abilities, these ability-specific methods are infeasible for task-focused scenarios requiring coordinated use of multiple abilities. Moreover, these approaches focus only on supportive neurons that correlate positively with task completion, while neglecting neurons with other roles-such as inhibitive roles-and misled neuron attribution due to fortuitous behaviors in LLMs (i.e., correctly answer the questions by chance rather than genuine understanding). To address these challenges, we propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification. The key insight is that task performance is jointly determined by neurons with two opposing roles: good neurons that facilitate task completion and bad neurons that inhibit it. NeuronLLM achieves a holistic modeling of neurons via contrastive learning of good and bad neurons, while leveraging augmented question sets to mitigate the fortuitous behaviors in LLMs. Comprehensive experiments on LLMs of different sizes and families show the superiority of NeuronLLM over existing methods in four NLP tasks, providing new insights into LLM functional organization.[37] FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback
Seongyeub Chu,Jongwoo Kim,Munyong Yi
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的框架FeedEval,用于评估LLM生成的作文反馈质量,聚焦于具体性、帮助性和有效性三个教学维度,并通过实验证明其在自动作文评分和修改指导中的优越性能。
Details
Motivation: 现有自动作文评分研究多依赖未经质量验证的LLM生成反馈进行模型训练,导致噪声传播问题,影响下游任务效果。 Method: 提出FeedEval框架,构建三个专门针对反馈质量维度(具体性、帮助性、有效性)的LLM评估器,使用本研究整理的数据集进行训练,对多个反馈候选进行评估并筛选高质量反馈用于下游任务。 Result: 在ASAP++基准上的实验表明,FeedEval与人类专家判断高度一致;使用其筛选的高质量反馈训练的评分模型表现更优;小规模LLM在修订实验中也表现出更有效的作文修改。 Conclusion: FeedEval能有效识别高质量的LLM生成反馈,提升自动作文评分和反馈利用的效果,具有良好的应用潜力。 Abstract: Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We will release our code and curated datasets upon accepted.[38] Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization
Mizanur Rahman,Mohammed Saidul Islam,Md Tahmid Rahman Laskar,Shafiq Joty,Enamul Hoque
Main category: cs.CL
TL;DR: 本文提出了RL-Text2Vis,首个用于文本到可视化生成的强化学习框架,基于GRPO并采用多目标奖励机制,结合执行后反馈优化文本准确性、代码有效性与可视化质量,在多个基准上显著优于现有方法。
Details
Motivation: 现有基于大模型的Text2Vis系统在生成可视化时存在语义对齐差、图表质量低的问题,且监督微调无法利用执行后反馈来提升整体质量。 Method: 提出RL-Text2Vis框架,基于Group Relative Policy Optimization(GRPO),设计包含文本准确率、代码有效性与可视化质量的多目标奖励函数,利用执行后反馈进行强化学习训练。 Result: 在Text2Vis基准上比GPT-4o相对提升22%的图表质量,代码执行成功率从78%提升至97%,并在VIS-Eval和NVBench等跨域数据集上表现出强泛化能力。 Conclusion: GRPO是一种有效的结构化多模态推理策略,强化学习结合多目标奖励可显著提升Text2Vis系统的生成质量与鲁棒性。 Abstract: Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at https://github.com/vis-nlp/RL-Text2Vis.[39] THaLLE-ThaiLLM: Domain-Specialized Small LLMs for Finance and Thai -- Technical Report
KBTG Labs,:,Anuruth Lertpiya,Danupat Khamnuansin,Kantapong Sucharitpongpan,Pornchanan Balee,Tawunrat Chalothorn,Thadpong Pongthawornkamol,Monchai Lertsutthiwong
Main category: cs.CL
TL;DR: 本论文探讨了通过模型融合技术构建多功能大语言模型的高效方法,特别针对泰语能力和金融领域应用进行了优化。
Details
Motivation: 由于隐私、安全和监管问题,金融机构倾向于本地部署大语言模型,但面临多专用模型部署与单一多功能模型训练成本之间的权衡。 Method: 采用模型融合策略,将Qwen-8B分别与ThaiLLM-8B及THaLLE-CFA-8B合并,以提升泰语理解和金融任务性能。 Result: 融合模型在M3、M6 O-NET、Flare-CFA和Thai-IC等多个基准测试中均表现出性能提升,验证了模型融合的有效性。 Conclusion: 模型融合是一种资源高效的方法,可用于构建兼具多领域能力的高性能大语言模型,尤其适用于对数据隐私要求高的行业场景。 Abstract: Large Language Models (LLMs) have demonstrated significant potential across various domains, particularly in banking and finance, where they can automate complex tasks and enhance decision-making at scale. Due to privacy, security, and regulatory concerns, organizations often prefer on-premise deployment of LLMs. The ThaiLLM initiative aims to enhance Thai language capabilities in open-LLMs, enabling Thai industry to leverage advanced language models. However, organizations often face a trade-off between deploying multiple specialized models versus the prohibitive expense of training a single multi-capability model. To address this, we explore model merging as a resource-efficient alternative for developing high-performance, multi-capability LLMs. We present results from two key experiments: first, merging Qwen-8B with ThaiLLM-8B demonstrates how ThaiLLM-8B enhances Thai general capabilities, showing an uplift of M3 and M6 O-NET exams over the general instruction-following Qwen-8B. Second, we merge Qwen-8B with both ThaiLLM-8B and THaLLE-CFA-8B. This combination results in further improvements in performance across both general and financial domains, by demonstrating an uplift in both M3 and M6 O-NET, Flare-CFA, and Thai-IC benchmarks. The report showcases the viability of model merging for efficiently creating multi-capability LLMs.[40] On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions
Zhiyuan He,Binghan Chen,Tianxiang Xiong,Ziyang Sun,Mozhao Zhu,Xi Chen
Main category: cs.CL
TL;DR: 本文研究了Rank-One Model Editing (ROME)在多跳推理任务中的局限性,并提出了冗余编辑策略以改善2跳问题的准确性。
Details
Motivation: 现有的知识编辑方法如ROME在单跳事实更新上表现良好,但在需要知识链的多跳推理任务中存在显著挑战。 Method: 分析不同层深度的知识编辑效果,识别出三个主要失败模式,并提出冗余编辑策略来缓解‘跳跃过晚’和泛化能力下降的问题。 Result: 实验表明,所提出的冗余编辑方法在2跳问题上的准确率至少提高了15.5个百分点,相比之前的单编辑策略提升了96%。 Conclusion: 冗余编辑是一种简单而有效的策略,能够显著提升多跳推理性能,但会牺牲一定的特异性和语言自然度。 Abstract: Recent advances in Knowledge Editing (KE), particularly Rank-One Model Editing (ROME), show superior efficiency over fine-tuning and in-context learning for updating single-hop facts in transformers. However, these methods face significant challenges when applied to multi-hop reasoning tasks requiring knowledge chaining. In this work, we study the effect of editing knowledge with ROME on different layer depths and identify three key failure modes. First, the "hopping-too-late" problem occurs as later layers lack access to necessary intermediate representations. Second, generalization ability deteriorates sharply when editing later layers. Third, the model overfits to edited knowledge, incorrectly prioritizing edited-hop answers regardless of context. To mitigate the issues of "hopping-too-late" and generalisation decay, we propose Redundant Editing, a simple yet effective strategy that enhances multi-hop reasoning. Our experiments demonstrate that this approach can improve accuracy on 2-hop questions by at least 15.5 percentage points, representing a 96% increase over the previous single-edit strategy, while trading off some specificity and language naturalness.[41] When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation
Rhea Kapur,Robert Hawkins,Elisa Kreiss
Main category: cs.CL
TL;DR: 本文提出应将视觉-语言模型中描述的“特异性”与“长度”解耦,定义了基于对比集的特异性概念,并通过控制长度、变化信息量的数据集验证人类更偏好特异性强的描述,强调应优先评估特异性而非冗长性。
Details
Motivation: 当前视觉语言模型生成的文本描述常将特异性与长度混淆,导致评价不准确,需明确区分二者以提升描述质量。 Method: 提出基于对比集的特异性定义,构建控制长度但信息量变化的数据集,并通过人类偏好实验验证特异性的影响。 Result: 实验表明,在控制长度的情况下,人类仍能可靠地区分并偏好信息更具体的描述,说明长度分配方式影响特异性。 Conclusion: 应采用直接衡量特异性的评估方法,而非依赖描述长度,以提升视觉内容描述的有效性和可访问性。 Abstract: Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.[42] Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR
Yihong Tang,Kehai Chen,Xuefeng Bai,Benyou Wang,Zeming Liu,Haifeng Wang,Min Zhang
Main category: cs.CL
TL;DR: 提出Character-R1框架,通过认知奖励机制提升角色扮演智能体的角色一致性与推理能力。
Details
Motivation: 现有角色扮演智能体缺乏内在认知一致性,易在复杂情境中出现离题行为。 Method: 设计包含认知聚焦奖励、参考引导奖励和角色条件化奖励归一化的三重奖励机制。 Result: 实验表明Character-R1在知识、记忆等方面显著优于现有方法。 Conclusion: Character-R1通过结构化认知和可验证奖励信号有效提升了角色扮演的稳定性和性能。 Abstract: Current role-playing agents (RPAs) are typically constructed by imitating surface-level behaviors, but this approach lacks internal cognitive consistency, often causing out-of-character errors in complex situations. To address this, we propose Character-R1, a framework designed to provide comprehensive verifiable reward signals for effective role-aware reasoning, which are missing in recent studies. Specifically, our framework comprises three core designs: (1) Cognitive Focus Reward, which enforces explicit label-based analysis of 10 character elements (e.g., worldview) to structure internal cognition; (2) Reference-Guided Reward, which utilizes overlap-based metrics with reference responses as optimization anchors to enhance exploration and performance; and (3) Character-Conditioned Reward Normalization, which adjusts reward distributions based on character categories to ensure robust optimization across heterogeneous roles. Extensive experiments demonstrate that Character-R1 significantly outperforms existing methods in knowledge, memory and others.[43] From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset
Haneul Yoo,Won Ik Cho,Geunhye Kim,Jiyoon Han
Main category: cs.CL
TL;DR: 提出了一种基于国家社会研究课程的自动化多智能体框架CuCu,用于生成文化特定的问答对,并构建了韩语文化相关的KCaQA数据集。
Details
Motivation: 解决大语言模型在跨文化和多语言场景下表现不均的问题,尤其是受英语中心主义训练数据影响导致的文化偏差。 Method: 利用国家社会研究课程作为文化感知监督的基础,通过多智能体LLM框架CuCu将教材内容转化为开放式、文化特定的问答对。 Result: 应用CuCu于韩国课程,构建包含34.1k个问答对的KCaQA数据集,分析表明其覆盖文化特有主题并能生成扎根于本地社会文化的回答。 Conclusion: 该方法为实现语言模型的文化对齐提供了一种可扩展且实用的路径。 Abstract: Large language models (LLMs) achieve strong performance on many tasks, but their progress remains uneven across languages and cultures, often reflecting values latent in English-centric training data. To enable practical cultural alignment, we propose a scalable approach that leverages national social studies curricula as a foundation for culture-aware supervision. We introduce CuCu, an automated multi-agent LLM framework that transforms national textbook curricula into open-ended, culture-specific question-answer pairs. Applying CuCu to the Korean national social studies curriculum, we construct KCaQA, comprising 34.1k open-ended QA pairs. Our quantitative and qualitative analyses suggest that KCaQA covers culture-specific topics and produces responses grounded in local sociocultural contexts.[44] MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark
Anyang Song,Ying Cheng,Yiqian Xu,Rui Feng
Main category: cs.CL
TL;DR: 本文提出了一种通过增强机器生成文本与人类写作对齐的新方法MAGA,以提升检测器的泛化能力并测试其鲁棒性。
Details
Motivation: 随着大语言模型的发展,机器生成文本越来越难以与人类写作区分,导致虚假新闻和网络欺诈等问题加剧。现有检测器泛化能力受限于数据集质量,需进一步改进生成过程以增强对齐。 Method: 提出MAGA框架,实现从提示构建到推理过程的全面对齐;其中关键组件是提出的基于检测器反馈的强化学习(RLDF)方法,通过迭代优化生成更对齐的文本。 Result: 在实验中,基于MAGA训练集微调的RoBERTa检测器在泛化检测AUC上平均提升了4.60%;而MAGA数据集使所选检测器的AUC平均下降8.13%,表明其对检测器构成挑战。 Conclusion: MAGA能有效提升检测器的泛化性能,同时也可作为评估检测器鲁棒性的工具,为未来检测技术研究提供指导意义。 Abstract: Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors' generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var's theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA). MAGA's pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbf{R}einforced \textbf{L}earning from \textbf{D}etectors \textbf{F}eedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60\% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13\% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.[45] SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation
Sirry Chen,Jieyi Wang,Wei Chen,Zhongyu Wei
Main category: cs.CL
TL;DR: 本文提出了一种名为SpeechMedAssist的语音语言模型,能够通过两阶段训练范式实现基于语音的多轮医疗咨询,显著降低对医疗语音数据的需求,并在新设计的基准上表现出优于现有方法的效果和鲁棒性。
Details
Motivation: 现有的医疗咨询系统多依赖长文本交互,不够自然且对患者不友好;尽管语音语言模型取得进展,但医疗语音数据稀缺和直接微调效率低阻碍了其在医疗场景中的应用。 Method: 提出SpeechMedAssist模型,利用语音语言模型的结构特性,将训练解耦为两个阶段:第一阶段通过文本注入知识与能力,第二阶段使用有限的语音数据进行模态再对齐,仅需1万条合成语音样本即可完成训练。 Result: 在包含单轮问答和多轮模拟交互的基准测试中,SpeechMedAssist在多数评估场景下均优于所有基线模型,表现出更高的有效性与鲁棒性。 Conclusion: 该两阶段训练范式有效缓解了医疗语音数据稀缺的问题,推动了语音语言模型在医疗咨询中的实际应用,展示了语音交互在医疗场景中的潜力。 Abstract: Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more natural speech-based interaction, yet the scarcity of medical speech data and the inefficiency of directly fine-tuning on speech data jointly hinder the adoption of SpeechLMs in medical consultation. In this paper, we propose SpeechMedAssist, a SpeechLM natively capable of conducting speech-based multi-turn interactions with patients. By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm consisting of (1) Knowledge & Capability Injection via Text and (2) Modality Re-alignment with Limited Speech Data, thereby reducing the requirement for medical speech data to only 10k synthesized samples. To evaluate SpeechLMs for medical consultation scenarios, we design a benchmark comprising both single-turn question answering and multi-turn simulated interactions. Experimental results show that our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.[46] CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models
Yifan Le,Yunliang Li
Main category: cs.CL
TL;DR: 本文提出了CRANE框架,通过基于相关性的神经元级干预方法,重新定义语言特异性,识别出对特定语言预测功能上必要的神经元,相比传统的激活启发方法更精确地揭示了多语言大模型中语言能力的组织结构。
Details
Motivation: 现有研究主要依赖激活强度识别语言相关神经元,但这种方法混淆了语言偏好与功能重要性,无法准确理解多语言大模型中语言能力在神经元层面的组织机制。 Method: 提出CRANE框架,基于神经元在语言条件预测中的功能必要性(而非激活幅度)来识别语言特异性神经元,通过针对性的神经元干预分析其贡献,并设计新的相关性度量指标和基础模型到对话模型的迁移分析进行验证。 Result: 实验发现屏蔽特定语言相关的神经元会显著降低该语言的表现而不严重影响其他语言,呈现出不对称的语言选择性退化模式;在英语、中文和越南语多个基准上的结果表明,CRANE比激活基线方法更精准地分离出语言特异成分。 Conclusion: 语言能力在神经元层面表现为功能上的必要性和选择性,而非简单的激活偏好,CRANE为理解多语言大模型内部的语言组织机制提供了更精细、可靠的分析工具。 Abstract: Multilingual large language models (LLMs) achieve strong performance across languages, yet how language capabilities are organized at the neuron level remains poorly understood. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. We propose CRANE, a relevance-based analysis framework that redefines language specificity in terms of functional necessity, identifying language-specific neurons through targeted neuron-level interventions. CRANE characterizes neuron specialization by their contribution to language-conditioned predictions rather than activation magnitude. Our implementation will be made publicly available. Neuron-level interventions reveal a consistent asymmetric pattern: masking neurons relevant to a target language selectively degrades performance on that language while preserving performance on other languages to a substantial extent, indicating language-selective but non-exclusive neuron specializations. Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than activation-based methods.[47] ToolGate: Contract-Grounded and Verified Tool Execution for LLMs
Yanming Liu,Xinyue Peng,Jiannan Cao,Xinyi Wang,Songhang Deng,Jintao Chen,Jianwei Yin,Xuhong Zhang
Main category: cs.CL
TL;DR: 本文提出了ToolGate,一个为大语言模型(LLM)工具调用提供逻辑安全保证和可验证状态演化的前向执行框架。
Details
Motivation: 现有基于自然语言推理的工具调用框架缺乏对逻辑安全性和可验证性的形式化保障,容易受到幻觉或无效结果的影响。 Method: ToolGate通过维护一个显式的符号状态空间(类型化键值映射)来表示可信的世界信息,并将每个工具形式化为Hoare风格的契约(前置条件和后置条件),以前置条件控制工具调用,以后置条件通过运行时验证决定是否提交结果。 Result: 实验表明,ToolGate显著提升了工具增强型LLM系统的可靠性和可验证性,同时在复杂的多步推理任务中保持了有竞争力的性能。 Conclusion: ToolGate确保符号状态仅通过经过验证的工具执行进行演化,防止错误或幻觉结果污染世界表示,为构建更可信、可调试的AI系统奠定了基础。 Abstract: Large Language Models (LLMs) augmented with external tools have demonstrated remarkable capabilities in complex reasoning tasks. However, existing frameworks rely heavily on natural language reasoning to determine when tools can be invoked and whether their results should be committed, lacking formal guarantees for logical safety and verifiability. We present \textbf{ToolGate}, a forward execution framework that provides logical safety guarantees and verifiable state evolution for LLM tool calling. ToolGate maintains an explicit symbolic state space as a typed key-value mapping representing trusted world information throughout the reasoning process. Each tool is formalized as a Hoare-style contract consisting of a precondition and a postcondition, where the precondition gates tool invocation by checking whether the current state satisfies the required conditions, and the postcondition determines whether the tool's result can be committed to update the state through runtime verification. Our approach guarantees that the symbolic state evolves only through verified tool executions, preventing invalid or hallucinated results from corrupting the world representation. Experimental validation demonstrates that ToolGate significantly improves the reliability and verifiability of tool-augmented LLM systems while maintaining competitive performance on complex multi-step reasoning tasks. This work establishes a foundation for building more trustworthy and debuggable AI systems that integrate language models with external tools.[48] See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation
Naquee Rizwan,Subhankar Swain,Paramananda Bhaskar,Gagan Aryan,Shehryaar Shah Khan,Animesh Mukherjee
Main category: cs.CL
TL;DR: 本文提出了一种基于生成式AI模型的多模态代理框架,用于在数据有限的情况下实现仇恨模因的检测、解释与干预,首次将三者统一研究,具有实际部署潜力。
Details
Motivation: 现有研究通常将仇恨模因的检测、解释和干预分开处理,且依赖大量标注数据,难以反映真实场景并面临高昂成本。 Method: 利用任务特定的生成式多模态代理和大型多模态模型的少样本适应能力,构建统一框架以应对不同类型的模因。 Result: 提出了一个新颖的、可在低数据条件下推广的仇恨模因检测、解释与干预框架,展现出在真实生产环境中的应用潜力。 Conclusion: 该方法首次将检测、解释与干预整合于统一框架,并在有限数据下实现良好泛化,为实际内容审核系统提供了可行方案。 Abstract: In this work, we examine hateful memes from three complementary angles - how to detect them, how to explain their content and how to intervene them prior to being posted - by applying a range of strategies built on top of generative AI models. To the best of our knowledge, explanation and intervention have typically been studied separately from detection, which does not reflect real-world conditions. Further, since curating large annotated datasets for meme moderation is prohibitively expensive, we propose a novel framework that leverages task-specific generative multimodal agents and the few-shot adaptability of large multimodal models to cater to different types of memes. We believe this is the first work focused on generalizable hateful meme moderation under limited data conditions, and has strong potential for deployment in real-world production scenarios. Warning: Contains potentially toxic contents.[49] Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding
Sungmok Jung,Yeonkyoung So,Joonhak Lee,Sangho Kim,Yelim Ahn,Jaejin Lee
Main category: cs.CL
TL;DR: 本文介绍了Thunder-KoNUBench,一个反映韩语否定现象实证分布的句子级基准,并通过评估47个大语言模型分析了模型规模和指令微调对否定理解的影响,结果表明在该基准上进行微调可提升韩语的否定理解和更广泛的上下文理解。
Details
Motivation: 由于否定结构已知会挑战大语言模型,且现有评估否定理解能力的基准(尤其是韩语)稀缺,因此需要构建针对韩语否定现象的系统性评测基准。 Method: 首先进行基于语料库的韩语否定现象分析,然后构建Thunder-KoNUBench句子级基准,最后通过评估47个大语言模型来分析模型规模、指令微调的影响,并验证在该基准上微调的效果。 Result: 实验显示大语言模型在否定条件下的表现下降;模型规模和指令微调对否定理解有影响;在Thunder-KoNUBench上微调能提升模型的否定理解及整体上下文理解能力。 Conclusion: Thunder-KoNUBench是一个有效的韩语否定理解评测基准,其不仅揭示了当前大语言模型在处理否定时的不足,还证明了针对性微调可显著提升相关能力。 Abstract: Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.[50] PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards
Mukesh Ghimire,Aosong Feng,Liwen You,Youzhi Luo,Fang Liu,Xuan Zhu
Main category: cs.CL
TL;DR: 本文提出了一种名为PRISM的统一训练框架,结合过程奖励模型(PRM)和模型内部置信度,在无真实标签的情况下实现稳定训练并提升测试性能。
Details
Motivation: 现有基于内部一致性的无监督学习信号(如熵或自置信度)在大规模长期训练中不可靠,且高质量问题的标注数据难以获取,因此需要更可靠的无标签学习方法。 Method: 提出PRISM框架,将过程奖励模型(PRM)与模型的自置信度相结合,利用PRM提供外部一致性信号,同时保留内部置信度信息,用于指导大语言模型的后训练。 Result: 实验表明,PRISM能够实现更稳定的训练过程,提升模型在数学推理和代码生成等任务上的测试性能,并有效校准模型的内部置信度。 Conclusion: 结合外部过程奖励与内部置信度是一种可靠且有效的无监督后训练方法,为未来无需人类标注的大规模语言模型优化提供了可行路径。 Abstract: Current techniques for post-training Large Language Models (LLMs) rely either on costly human supervision or on external verifiers to boost performance on tasks such as mathematical reasoning and code generation. However, as LLMs improve their problem-solving, any further improvement will potentially require high-quality solutions to difficult problems that are not available to humans. As a result, learning from unlabeled data is becoming increasingly attractive in the research community. Existing methods extract learning signal from a model's consistency, either by majority voting or by converting the model's internal confidence into reward. Although internal consistency metric such as entropy or self-certainty require no human intervention, as we show in this work, these are unreliable signals for large-scale and long-term training. To address the unreliability, we propose PRISM, a unified training framework that uses a Process Reward Model (PRM) to guide learning alongside model's internal confidence in the absence of ground-truth labels. We show that effectively combining PRM with self-certainty can lead to both stable training and better test-time performance, and also keep the model's internal confidence in check.[51] Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning
Feihu Jin,Shipeng Cen,Ying Tan
Main category: cs.CL
TL;DR: 提出一种基于先验信息的零阶梯度优化方法,通过引导扰动方向提升梯度估计质量,显著加速大模型微调的收敛速度,并在多个任务上超越传统零阶和梯度法。
Details
Motivation: 传统零阶优化因随机扰动导致梯度估计方差高、收敛慢,难以高效微调大模型。 Method: 动态计算由高斯采样得到的引导向量,采用先验信息指导扰动方向,并探索贪婪扰动策略以提升梯度估计效率。 Result: 理论证明所提梯度估计器与真实梯度方向对齐更强;实验显示在不同规模和架构的LLM上均实现更快收敛和更优性能,在OPT-13B上优于传统ZO方法并在11个任务中的9个超越梯度法基线。 Conclusion: 该方法通过引入先验引导扰动,有效降低零阶优化的方差,提升了大模型微调的效率与性能,实现了高效与准确的平衡。 Abstract: Fine-tuning large language models (LLMs) has achieved remarkable success across various NLP tasks, but the substantial memory overhead during backpropagation remains a critical bottleneck, especially as model scales grow. Zeroth-order (ZO) optimization alleviates this issue by estimating gradients through forward passes and Gaussian sampling, avoiding the need for backpropagation. However, conventional ZO methods suffer from high variance in gradient estimation due to their reliance on random perturbations, leading to slow convergence and suboptimal performance. We propose a simple plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation. Our method dynamically computes a guiding vector from Gaussian samples, which directs perturbations toward more informative directions, significantly accelerating convergence compared to standard ZO approaches. We further investigate a greedy perturbation strategy to explore the impact of prior knowledge on gradient estimation. Theoretically, we prove that our gradient estimator achieves stronger alignment with the true gradient direction, enhancing optimization efficiency. Extensive experiments across LLMs of varying scales and architectures demonstrate that our proposed method could seamlessly integrate into existing optimization methods, delivering faster convergence and superior performance. Notably, on the OPT-13B model, our method outperforms traditional ZO optimization across all 11 benchmark tasks and surpasses gradient-based baselines on 9 out of 11 tasks, establishing a robust balance between efficiency and accuracy.[52] DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs
Anh Thi-Hoang Nguyen,Khanh Quoc Tran,Tin Van Huynh,Phuoc Tan-Hoang Nguyen,Cam Tan Nguyen,Kiet Van Nguyen
Main category: cs.CL
TL;DR: 本文介绍了DSC2025 ViHallu挑战赛,首个针对越南语大语言模型幻觉检测的大规模共享任务,发布了包含10,000个标注样本的ViHallu数据集,并评估了多种检测方法,推动越南语AI系统的可靠性研究。
Details
Motivation: 越南语等低至中等资源语言缺乏标准化的幻觉检测评估框架,限制了大语言模型在实际应用中的可靠性,亟需建立专门的基准和数据集。 Method: 构建ViHallu数据集,包含10,000个(上下文、提示、响应)三元组,标注为无幻觉、内在幻觉和外在幻觉三类;设计事实性、噪声和对抗性三种提示类型以测试模型鲁棒性;组织111支队伍参与挑战赛,评估不同模型表现。 Result: 最佳系统达到84.80%的macro-F1分数,远超基线模型的32.83%;结果显示指令微调的大模型结合结构化提示和集成策略表现更优,但内在幻觉检测仍具挑战性。 Conclusion: ViHallu挑战赛为越南语大模型幻觉检测建立了首个权威基准,验证了先进方法的有效性,同时揭示了内在幻觉检测的难度,为未来提升越南语AI系统可信度提供了基础。 Abstract: The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations -- fluent, plausible-sounding outputs that contradict or fabricate information. While hallucination detection has recently emerged as a priority in English-centric benchmarks, low-to-medium resource languages such as Vietnamese remain inadequately covered by standardized evaluation frameworks. This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese LLMs. We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples systematically partitioned into three hallucination categories: no hallucination, intrinsic, and extrinsic hallucinations. The dataset incorporates three prompt types -- factual, noisy, and adversarial -- to stress-test model robustness. A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80\%, compared to a baseline encoder-only score of 32.83\%, demonstrating that instruction-tuned LLMs with structured prompting and ensemble strategies substantially outperform generic architectures. However, the gap to perfect performance indicates that hallucination detection remains a challenging problem, particularly for intrinsic (contradiction-based) hallucinations. This work establishes a rigorous benchmark and explores a diverse range of detection methodologies, providing a foundation for future research into the trustworthiness and reliability of Vietnamese language AI systems.[53] Fame Fades, Nature Remains: Disentangling the Character Identity of Role-Playing Agents
Yonghyun Jun,Junhyuk Choi,Jihyeong Park,Hwanhee Lee
Main category: cs.CL
TL;DR: 本文提出了“角色身份”的多维概念,将其分解为参数身份和属性身份,并通过构建统一的角色档案模式研究其在对话中的表现。研究发现,“名气会消退”:名人的初始优势随对话进行而消失;“本性保留”:个性特征稳定呈现,但道德和人际关系的负面特质显著影响角色扮演的保真度。
Details
Motivation: 现有基于大语言模型的角色扮演代理对角色身份的结构化定义不足,通常将角色视为任意文本输入,缺乏系统性的身份维度分析。 Method: 提出角色身份的双层结构(参数身份与属性身份),构建统一的角色档案模式,并生成知名人物与合成人物在相同结构约束下进行单轮与多轮对话实验。 Result: 发现‘名气会消退’:名人角色因预训练知识在初期有优势,但随着对话推进该优势迅速消失;‘本性保留’:模型能稳定表达个性特质,但对道德和人际态度的负面极性更敏感,导致角色表现失真。 Conclusion: 角色扮演代理的表现瓶颈主要在于负面社会属性的刻画,未来应加强对道德和人际关系正向维度的建模与评估。 Abstract: Despite the rapid proliferation of Role-Playing Agents (RPAs) based on Large Language Models (LLMs), the structural dimensions defining a character's identity remain weakly formalized, often treating characters as arbitrary text inputs. In this paper, we propose the concept of \textbf{Character Identity}, a multidimensional construct that disentangles a character into two distinct layers: \textbf{(1) Parametric Identity}, referring to character-specific knowledge encoded from the LLM's pre-training, and \textbf{(2) Attributive Identity}, capturing fine-grained behavioral properties such as personality traits and moral values. To systematically investigate these layers, we construct a unified character profile schema and generate both Famous and Synthetic characters under identical structural constraints. Our evaluation across single-turn and multi-turn interactions reveals two critical phenomena. First, we identify \textit{"Fame Fades"}: while famous characters hold a significant advantage in initial turns due to parametric knowledge, this edge rapidly vanishes as models prioritize accumulating conversational context over pre-trained priors. Second, we find that \textit{"Nature Remains"}: while models robustly portray general personality traits regardless of polarity, RPA performance is highly sensitive to the valence of morality and interpersonal relationships. Our findings pinpoint negative social natures as the primary bottleneck in RPA fidelity, guiding future character construction and evaluation.[54] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Mingxin Li,Yanzhao Zhang,Dingkun Long,Keqin Chen,Sibo Song,Shuai Bai,Zhibo Yang,Pengjun Xie,An Yang,Dayiheng Liu,Jingren Zhou,Junyang Lin
Main category: cs.CL
TL;DR: 本文介绍了Qwen3-VL-Embedding和Qwen3-VL-Reranker模型系列,基于Qwen3-VL基础模型构建,支持文本、图像、文档图像和视频等多模态统一表示,具备多语言能力,在多种多模态检索任务中达到SOTA性能。
Details
Motivation: 为了实现高精度的多模态搜索,需要将多种模态数据映射到统一的表示空间,并支持灵活部署与多语言应用。 Method: 采用多阶段训练范式,包括大规模对比预训练和重排序模型蒸馏;Qwen3-VL-Embedding支持Matryoshka表示学习和最长32k token输入,Qwen3-VL-Reranker使用交叉注意力机制的交叉编码器架构进行细粒度相关性评估。 Result: Qwen3-VL-Embedding-8B在MMEB-V2上取得77.8的总体得分,位居榜首(截至2025年1月8日),在图像-文本检索、视觉问答和视频-文本匹配等任务中表现优异。 Conclusion: 该模型系列构成了高效的端到端多模态搜索 pipeline,在多模态嵌入和重排序任务中均达到先进水平,适用于多样化的实际应用场景。 Abstract: In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.[55] Automatic Classifiers Underdetect Emotions Expressed by Men
Ivan Smirnov,Segun T. Aroyehun,Paul Plener,David Garcia
Main category: cs.CL
TL;DR: 本文研究了情感检测模型中的性别偏差问题,利用超过一百万个自我标注的帖子数据,发现男性文本的错误率普遍高于女性,提示现有工具在不同性别群体间的公平性仍存在问题。
Details
Motivation: 确保情感和情绪分类器在不同人群中的可靠性,避免因使用第三方标注数据而掩盖系统性偏差。 Method: 使用大规模自我标注数据集和预注册研究设计,评估414种模型与情绪类别组合下的性别偏差。 Result: 在不同类型的情感分类器和多种情绪中,男性作者文本的错误率始终高于女性。 Conclusion: 当前的情感分析工具在性别组成未知或变化的样本中应谨慎使用,情感分析在实现跨人口群体公平性方面尚未解决。 Abstract: The widespread adoption of automatic sentiment and emotion classifiers makes it important to ensure that these tools perform reliably across different populations. Yet their reliability is typically assessed using benchmarks that rely on third-party annotators rather than the individuals experiencing the emotions themselves, potentially concealing systematic biases. In this paper, we use a unique, large-scale dataset of more than one million self-annotated posts and a pre-registered research design to investigate gender biases in emotion detection across 414 combinations of models and emotion-related classes. We find that across different types of automatic classifiers and various underlying emotions, error rates are consistently higher for texts authored by men compared to those authored by women. We quantify how this bias could affect results in downstream applications and show that current machine learning tools, including large language models, should be applied with caution when the gender composition of a sample is not known or variable. Our findings demonstrate that sentiment analysis is not yet a solved problem, especially in ensuring equitable model behaviour across demographic groups.[56] AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs
Han Zhu,Jiale Chen,Chengkun Cai,Shengjie Sun,Haoran Li,Yujin Zhou,Chi-Min Chan,Pengcheng Wen,Lei Li,Sirui Han,Yike Guo
Main category: cs.CL
TL;DR: 本文提出了InterSafe-V数据集和AM³Safety框架,用于提升多模态大语言模型在多轮对话中的安全性,显著降低攻击成功率并增强无害性和有用性。
Details
Motivation: 现有的RLHF方法主要针对单轮视觉问答任务,难以有效应对多轮多模态对话中逐渐累积的安全威胁,且依赖昂贵的人工标注,限制了其在实际对话场景中的扩展性。 Method: 构建了一个包含11,270个对话和500个专门拒绝样本的开源多模态对话数据集InterSafe-V,并提出AM³Safety框架,结合冷启动拒绝阶段与基于组相对策略优化(GRPO)的微调,采用回合感知的双目标奖励机制在整个对话中进行优化。 Result: 在Qwen2.5-VL-7B-Instruct和LLaVA-NeXT-7B模型上的实验表明,攻击成功率(ASR)下降超过10%,无害性维度提升至少8%,有用性维度提升超过13%,同时保持模型通用能力。 Conclusion: AM³Safety框架结合InterSafe-V数据集能有效提升多模态大语言模型在多轮对话中的安全对齐性能,具备良好的实用性和可扩展性。 Abstract: Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10\% decrease in Attack Success Rate (ASR) together with an increment of at least 8\% in harmless dimension and over 13\% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.[57] RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation
Huawei Zheng,Xinqi Jiang,Sen Yang,Shouling Ji,Yingcai Wu,Dazhen Deng
Main category: cs.CL
TL;DR: 提出一种端到端框架,结合知识图谱引导和双路径混淆重写,生成具有领域相关性和隐含性的有害提示,以提升大语言模型的安全性研究。
Details
Motivation: 现有有害提示数据集多依赖人工构建,且集中于显式攻击,难以反映真实场景中通过间接领域知识表达的隐式威胁,亟需更贴近实际的隐式有害提示生成方法。 Method: 提出一个端到端框架:首先利用知识图谱引导生成领域相关的有害提示,然后通过双路径混淆重写(直接重写和上下文增强重写)将显式有害提示转化为更具隐含性的变体。 Result: 该框架能系统性地生成兼具高领域相关性和高隐含性的高质量有害提示数据集,有效提升对现代大语言模型防御机制的绕过能力。 Conclusion: 所提方法为大语言模型的红队测试提供了更真实、更具挑战性的评估数据,推动了特定领域下模型安全的研究进展。 Abstract: Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts-expressed through indirect domain knowledge-are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies dual-path obfuscation rewriting to convert explicit harmful prompts into implicit variants via direct and context-enhanced rewriting. This framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. We release our code and datasets at GitHub.[58] Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval
Seyeon Jeong,Yeonjun Choi,JongWook Kim,Beakcheol Jang
Main category: cs.CL
TL;DR: Tool-MAD 是一个通过为多个智能体分配不同外部工具(如搜索API、RAG模块)来增强事实验证的多智能体辩论框架,提出自适应查询机制和基于保真度与答案相关性的判决策略,在多个事实验证任务上优于现有方法。
Details
Motivation: 现有MAD系统依赖内部知识或静态文档,易产生幻觉;MADKE虽引入外部证据但检索机制固定,难以适应辩论中出现的新论点。因此需要更灵活、动态的外部知识整合方式以提升准确性与鲁棒性。 Method: 提出Tool-MAD框架:1)每个智能体配备异构外部工具以促进多样化推理;2)设计自适应查询生成机制,根据辩论流程迭代优化检索查询;3)引入保真度(Faithfulness)和答案相关性(Answer Relevance)评分,辅助裁判智能体进行最终决策。 Result: 在四个事实验证基准上,Tool-MAD比现有MAD方法最高提升5.5%准确率;在医学领域表现出强鲁棒性和适应性,适用于多种工具配置和领域条件。 Conclusion: Tool-MAD通过工具增强、动态检索与量化评估机制有效缓解了大模型幻觉问题,提升了多智能体辩论系统的准确性与实用性,具有广泛应用于现实世界事实核查的潜力。 Abstract: Large Language Models (LLMs) suffer from hallucinations and factual inaccuracies, especially in complex reasoning and fact verification tasks. Multi-Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue, promoting diverse reasoning and mutual verification. However, existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations. While MADKE introduces external evidence to mitigate this, its one-time retrieval mechanism limits adaptability to new arguments or emerging information during the debate. To address these limitations, We propose Tool-MAD, a multi-agent debate framework that enhances factual verification by assigning each agent a distinct external tool, such as a search API or RAG module. Tool-MAD introduces three key innovations: (1) a multi-agent debate framework where agents leverage heterogeneous external tools, encouraging diverse perspectives, (2) an adaptive query formulation mechanism that iteratively refines evidence retrieval based on the flow of the debate, and (3) the integration of Faithfulness and Answer Relevance scores into the final decision process, allowing the Judge agent to quantitatively assess the coherence and question alignment of each response and effectively detect hallucinations. Experimental results on four fact verification benchmarks demonstrate that Tool-MAD consistently outperforms state-of-the-art MAD frameworks, achieving up to 5.5% accuracy improvement. Furthermore, in medically specialized domains, Tool-MAD exhibits strong robustness and adaptability across various tool configurations and domain conditions, confirming its potential for broader real-world fact-checking applications.[59] PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks
Yehoon Jang,Chaewon Lee,Hyun-seok Min,Sungchul Choi
Main category: cs.CL
TL;DR: 本文提出了PILOT-Bench,首个以美国专利审判和上诉委员会(PTAB)为中心的基准,用于系统评估大语言模型(LLM)在专利领域中的法律推理能力。该基准将PTAB裁决与专利数据对齐,并定义了三个符合IRAC框架的分类任务:问题类型、委员会依据和子决策。实验结果显示,闭源模型在问题类型任务上表现优于开源模型,揭示了两者在推理能力上的显著差距。
Details
Motivation: 当前大语言模型在专利和法律领域的应用局限于轻量级任务,缺乏对其在专利法律推理方面系统性评估的方法。因此,需要一个专门的基准来衡量模型在复杂专利案件中结合技术理解与法律推理的能力。 Method: 构建PILOT-Bench基准,整合PTAB裁决与USPTO专利数据,设计三个IRAC对齐的分类任务:Issue Type、Board Authorities 和 Subdecision;评估多种闭源与开源大语言模型,分析不同模型家族、输入设置及错误模式下的表现。 Result: 在Issue Type任务中,闭源模型的Micro-F1分数超过0.75,而最强的开源模型Qwen-8B仅达到约0.56,显示出两者在推理能力上的明显差距。其他任务也揭示了模型在法律依据识别和子决策判断上的挑战。 Conclusion: PILOT-Bench为评估专利领域内的法律推理提供了首个系统化基准,揭示了当前大语言模型特别是开源模型在复杂法律推理任务上的局限性,并为未来通过数据集设计和模型对齐提升模型能力指明方向。 Abstract: The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot-bench.[60] Differential syntactic and semantic encoding in LLMs
Santiago Acevedo,Alessandro Laio,Marco Baroni
Main category: cs.CL
TL;DR: 研究发现,通过平均具有相同句法结构或语义的句子的隐藏表示向量,可以提取出显著捕捉句法和语义信息的“质心”,表明大型语言模型中句法和语义信息至少部分以线性方式编码,且在不同网络层中的编码模式不同,可部分解耦。
Details
Motivation: 探究大型语言模型(LLM)内部层如何编码句法和语义信息,尤其是它们是否被分离或混合表示。 Method: 通过对共享句法结构或语义的句子的隐藏表示向量取平均得到句法和语义质心,并分析减去这些质心后对句子相似性的影响,同时考察其跨层编码模式。 Result: 句法和语义质心能显著影响句子与句法/语义匹配句子之间的相似性,表明句法和语义信息至少部分线性编码;且两者在不同层的编码分布不同,可部分解耦。 Conclusion: 大型语言模型的内部表示中,句法和语义信息以部分线性、差异化的方式编码,支持了二者在表示空间中可分离的假设。 Abstract: We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids'' from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.[61] Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence
Shengyin Sun,Yiming Li,Renxi Liu,Weizhe Lin,Hui-Ling Zhen,Xianzhi Yu,Mingxuan Yuan,Chen Ma
Main category: cs.CL
TL;DR: 本文提出了一种基于KL散度的无需训练的验证机制,替代了依赖昂贵监督的Judge Decoding,在推理和代码生成任务中表现优越且更具鲁棒性。
Details
Motivation: 现有的Judge Decoding方法依赖昂贵且噪声大的监督信号来学习关键性评分,限制了其可扩展性和泛化能力。 Method: 从第一性原理出发,揭示了学习到的线性判别器与KL散度在结构上的对应关系,并据此提出一种无需训练、基于草案与目标分布之间KL散度的简单验证机制。 Result: 实验表明,该方法在多个推理和编程基准上达到或超过了复杂训练判别器(如AutoJudge)的性能,同时对领域迁移具有更强的鲁棒性。 Conclusion: 通过理论和实证验证,分布差异本身已包含足够信息用于有效解码判断,无需额外监督,从而消除了监督瓶颈。 Abstract: Judge Decoding accelerates LLM inference by relaxing the strict verification of Speculative Decoding, yet it typically relies on expensive and noisy supervision. In this work, we revisit this paradigm from first principles, revealing that the ``criticality'' scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence. We theoretically prove a structural correspondence between learned linear judges and Kullback-Leibler (KL) divergence, demonstrating they rely on the same underlying logit primitives. Guided by this, we propose a simple, training-free verification mechanism based on KL divergence. Extensive experiments across reasoning and coding benchmarks show that our method matches or outperforms complex trained judges (e.g., AutoJudge), offering superior robustness to domain shifts and eliminating the supervision bottleneck entirely.[62] LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal
Dongjun Kim,Jeongho Yoon,Chanjun Park,Heuiseok Lim
Main category: cs.CL
TL;DR: 提出LANGSAE EDITING方法,通过稀疏自编码器在向量空间中去除多语言嵌入中的语言身份信号,提升跨语言检索效果。
Details
Motivation: 多语言嵌入中包含语言身份信息,导致同语言文档对相似度被高估,影响跨语言检索性能。 Method: 训练一个后处理的稀疏自编码器(LANGSAE EDITING),基于跨语言激活统计识别并抑制与语言相关的隐层单元,并重建原始维度的嵌入向量。 Result: 在多种语言上实验表明,该方法提升了排序质量和跨语言覆盖率,尤其对文字系统不同的语言效果显著。 Conclusion: LANGSAE EDITING能有效去除多语言嵌入中的语言身份偏差,兼容现有系统,无需重新训练编码器或重新编码文本。 Abstract: Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages.[63] NC2C: Automated Convexification of Generic Non-Convex Optimization Problems
Xinyue Peng,Yanming Liu,Yihan Cang,Yuwei Zhang,Xinyi Wang,Songhang Deng,Jiannan Cao
Main category: cs.CL
TL;DR: 提出NC2C,一个基于大语言模型的端到端自动化框架,能将非凸优化问题转化为凸形式,显著提升求解效率与可行性。
Details
Motivation: 传统非凸优化问题求解依赖专家知识且手动凸化效率低,亟需自动化方法降低门槛并提升效果。 Method: 利用大语言模型的数学推理能力,结合符号推理、自适应变换和迭代验证,自动识别非凸成分、选择策略并生成有效的凸等价问题,同时通过错误纠正和可行域修正机制保障转换鲁棒性。 Result: 在100个非凸问题上实验显示,NC2C执行成功率达89.3%,转换有效率达76%,显著优于基线方法。 Conclusion: NC2C有效实现了非凸到凸的自动化转换,减少了对专家知识的依赖,推动了凸求解器在复杂优化问题中的应用。 Abstract: Non-convex optimization problems are pervasive across mathematical programming, engineering design, and scientific computing, often posing intractable challenges for traditional solvers due to their complex objective functions and constrained landscapes. To address the inefficiency of manual convexification and the over-reliance on expert knowledge, we propose NC2C, an LLM-based end-to-end automated framework designed to transform generic non-convex optimization problems into solvable convex forms using large language models. NC2C leverages LLMs' mathematical reasoning capabilities to autonomously detect non-convex components, select optimal convexification strategies, and generate rigorous convex equivalents. The framework integrates symbolic reasoning, adaptive transformation techniques, and iterative validation, equipped with error correction loops and feasibility domain correction mechanisms to ensure the robustness and validity of transformed problems. Experimental results on a diverse dataset of 100 generic non-convex problems demonstrate that NC2C achieves an 89.3\% execution rate and a 76\% success rate in producing feasible, high-quality convex transformations. This outperforms baseline methods by a significant margin, highlighting NC2C's ability to leverage LLMs for automated non-convex to convex transformation, reduce expert dependency, and enable efficient deployment of convex solvers for previously intractable optimization tasks.[64] Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework
Junhyuk Choi,Jeongyoun Kwon,Heeju Kim,Haeun Cho,Hayeong Jung,Sehee Min,Bugeun Kim
Main category: cs.CL
TL;DR: 本文首次系统分析了基于角色的权威偏见在自由形式多智能体评估中的影响,发现专家型和参照型权威角色比合法型更具影响力,且权威偏见源于权威智能体坚持立场而非普通智能体主动服从。
Details
Motivation: 探讨大语言模型多智能体系统中权威角色带来的权威偏见及其对交互的影响,填补该领域的研究空白。 Method: 基于French和Raven的权力理论,将权威角色分为合法型、参照型和专家型,在ChatEval框架下进行12轮对话实验,使用GPT-4o和DeepSeek R1模型进行评估。 Result: 专家型和参照型权威角色比合法型角色影响力更强;权威偏见的产生并非因为普通智能体主动 conformity,而是权威角色持续坚持立场,同时普通智能体表现出灵活性;明确的立场表达是产生偏见的前提,中立回应不会引发偏见。 Conclusion: 权威角色的设计显著影响多智能体系统的交互模式,清晰的立场表达和角色类型选择对构建非对称交互框架至关重要。 Abstract: Multi-agent systems utilizing large language models often assign authoritative roles to improve performance, yet the impact of authority bias on agent interactions remains underexplored. We present the first systematic analysis of role-based authority bias in free-form multi-agent evaluation using ChatEval. Applying French and Raven's power-based theory, we classify authoritative roles into legitimate, referent, and expert types and analyze their influence across 12-turn conversations. Experiments with GPT-4o and DeepSeek R1 reveal that Expert and Referent power roles exert stronger influence than Legitimate power roles. Crucially, authority bias emerges not through active conformity by general agents, but through authoritative roles consistently maintaining their positions while general agents demonstrate flexibility. Furthermore, authority influence requires clear position statements, as neutral responses fail to generate bias. These findings provide key insights for designing multi-agent frameworks with asymmetric interaction patterns.[65] When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection
Ke Sun,Guangsheng Bao,Han Cui,Yue Zhang
Main category: cs.CL
TL;DR: 本文提出了一种基于生成后期稳定性的新方法,用于检测AI生成文本,通过分析log概率波动的时序动态特性,在多个基准上实现了最先进的性能。
Details
Motivation: 现有的零样本检测方法通常忽略自回归生成过程中的时序动态特性,仅对整个序列进行token级统计聚合,导致检测能力受限。 Method: 提出了两种新特征——导数离散度(Derivative Dispersion)和局部波动性(Local Volatility),仅基于生成后期的统计信息进行计算,并利用Late-Stage Volatility Decay现象进行检测。 Result: 在EvoBench和MAGE基准上达到了最先进水平,且与现有全局方法具有强互补性;实验分析了超过12万个文本样本,发现AI生成文本在序列后半段波动性降低24%-32%。 Conclusion: 利用生成过程中后期波动衰减的特性可有效区分AI与人类文本,所提方法无需扰动采样或额外模型访问,简单高效且具备实际应用潜力。 Abstract: Zero-shot detection methods for AI-generated text typically aggregate token-level statistics across entire sequences, overlooking the temporal dynamics inherent to autoregressive generation. We analyze over 120k text samples and reveal Late-Stage Volatility Decay: AI-generated text exhibits rapidly stabilizing log probability fluctuations as generation progresses, while human writing maintains higher variability throughout. This divergence peaks in the second half of sequences, where AI-generated text shows 24--32\% lower volatility. Based on this finding, we propose two simple features: Derivative Dispersion and Local Volatility, which computed exclusively from late-stage statistics. Without perturbation sampling or additional model access, our method achieves state-of-the-art performance on EvoBench and MAGE benchmarks and demonstrates strong complementarity with existing global methods.[66] RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection
Zhiwei Liu,Runteng Guo,Baojie Qu,Yuechen Jiang,Min Peng,Qianqian Xie,Sophia Ananiadou
Main category: cs.CL
TL;DR: 本文提出了一种名为RAAR的检索增强型多智能体推理框架,用于跨领域虚假信息检测,通过多视角证据检索和多智能体协作推理,显著提升了在跨域任务上的性能。
Details
Motivation: 现有方法依赖单一视角线索,难以泛化到新或代表性不足的领域,而大模型虽擅长复杂任务但受限于同分布数据假设,因此需要一种能实现跨域迁移并具备系统性推理能力的新框架。 Method: RAAR框架通过检索与目标样本在语义、情感和写作风格上对齐的多视角源域证据,并利用专门设计的多智能体协作生成可验证的多步推理路径,结合监督微调和强化学习训练一个多功能验证器来提升推理与验证能力。 Result: 在三个跨领域虚假信息检测任务上的实验表明,基于RAAR训练的RAAR-8b和RAAR-14b模型显著优于基线模型、其他跨域方法、先进大模型及基于大模型的适配方法。 Conclusion: RAAR是首个用于跨域虚假信息检测的检索增强型多智能体推理框架,有效解决了跨域迁移和系统性推理的挑战,展现出强大的泛化能力和推理性能。 Abstract: Cross-domain misinformation detection is challenging, as misinformation arises across domains with substantial differences in knowledge and discourse. Existing methods often rely on single-perspective cues and struggle to generalize to challenging or underrepresented domains, while reasoning large language models (LLMs), though effective on complex tasks, are limited to same-distribution data. To address these gaps, we introduce RAAR, the first retrieval-augmented agentic reasoning framework for cross-domain misinformation detection. To enable cross-domain transfer beyond same-distribution assumptions, RAAR retrieves multi-perspective source-domain evidence aligned with each target sample's semantics, sentiment, and writing style. To overcome single-perspective modeling and missing systematic reasoning, RAAR constructs verifiable multi-step reasoning paths through specialized multi-agent collaboration, where perspective-specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. RAAR further applies supervised fine-tuning and reinforcement learning to train a single multi-task verifier to enhance verification and reasoning capabilities. Based on RAAR, we trained the RAAR-8b and RAAR-14b models. Evaluation on three cross-domain misinformation detection tasks shows that RAAR substantially enhances the capabilities of the base models and outperforms other cross-domain methods, advanced LLMs, and LLM-based adaptation approaches. The project will be released at https://github.com/lzw108/RAAR.[67] Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics
Oshri Naparstek
Main category: cs.CL
TL;DR: 提出一种连续自回归语言生成模型,通过在连续空间中演化词元表示并在收敛后离散化,实现无需词元级采样的稳定、多样文本生成。
Details
Motivation: 传统自回归语言模型在每一步生成时过早离散化,导致生成过程不稳定、重复性强且对解码策略敏感,因此需要一种能更好建模不确定性的生成机制。 Method: 引入连续自回归语言生成框架,将词元表示为在多步更新中逐渐‘成熟’的连续向量,通过确定性动态过程演化这些向量,仅在充分收敛后进行离散化解码。 Result: 该方法仅通过确定性解码(argmax)即可生成连贯且多样化的文本,无需依赖词元级采样、扩散式去噪或额外稳定机制,显著提升生成稳定性。 Conclusion: 这是首个通过连续词元表示演化至收敛再离散化的自回归语言模型,能够在不使用词元级采样的情况下实现稳定文本生成,为语言建模提供了新范式。 Abstract: Autoregressive language models are conventionally defined over discrete token sequences, committing to a specific token at every generation step. This early discretization forces uncertainty to be resolved through token-level sampling, often leading to instability, repetition, and sensitivity to decoding heuristics. In this work, we introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that \emph{mature} over multiple update steps before being discretized. Rather than sampling tokens, the model evolves continuous token representations through a deterministic dynamical process, committing to a discrete token only when the representation has sufficiently converged. Discrete text is recovered via hard decoding, while uncertainty is maintained and resolved in the continuous space. We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax), without reliance on token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations, such as stochastic dynamics or history smoothing, can be incorporated naturally but are not required for the model to function. To our knowledge, this is the first autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling.[68] MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News
Zhiwei Liu,Paul Thompson,Jiaqi Rong,Baojie Qu,Runteng Guo,Min Peng,Qianqian Xie,Sophia Ananiadou
Main category: cs.CL
TL;DR: 本文提出了MisSpans,首个用于句子级错误信息检测与分析的多领域人工标注基准,支持细粒度定位、类型分类与可解释性分析。
Details
Motivation: 现有错误信息检测方法多基于整体段落或声明进行二元判断,难以捕捉真假信息共存的细节,缺乏可解释性与精细化分析能力。 Method: 构建了包含真实与虚假新闻配对的MisSpans数据集,定义三个任务:MisSpansIdentity(定位错误片段)、MisSpansType(分类错误类型)和MisSpansExplanation(生成基于片段的解释),并采用标准化标注流程确保高质量;在零样本和一样本设置下评估15种主流大语言模型。 Result: 专家标注者间一致性高,实验表明细粒度错误信息识别极具挑战,模型表现受模型规模、推理能力及领域文本特征等多重因素影响。 Conclusion: MisSpans为细粒度错误信息分析提供了新基准,揭示了当前模型的局限性,并强调需深入理解多因素交互对性能的影响。 Abstract: Online misinformation is increasingly pervasive, yet most existing benchmarks and methods evaluate veracity at the level of whole claims or paragraphs using coarse binary labels, obscuring how true and false details often co-exist within single sentences. These simplifications also limit interpretability: global explanations cannot identify which specific segments are misleading or differentiate how a detail is false (e.g., distorted vs. fabricated). To address these gaps, we introduce MisSpans, the first multi-domain, human-annotated benchmark for span-level misinformation detection and analysis, consisting of paired real and fake news stories. MisSpans defines three complementary tasks: MisSpansIdentity for pinpointing false spans within sentences, MisSpansType for categorising false spans by misinformation type, and MisSpansExplanation for providing rationales grounded in identified spans. Together, these tasks enable fine-grained localisation, nuanced characterisation beyond true/false and actionable explanations. Expert annotators were guided by standardised guidelines and consistency checks, leading to high inter-annotator agreement. We evaluate 15 representative LLMs, including reasoning-enhanced and non-reasoning variants, under zero-shot and one-shot settings. Results reveal the challenging nature of fine-grained misinformation identification and analysis, and highlight the need for a deeper understanding of how performance may be influenced by multiple interacting factors, including model size and reasoning capabilities, along with domain-specific textual features. This project will be available at https://github.com/lzw108/MisSpans.[69] A Navigational Approach for Comprehensive RAG via Traversal over Proposition Graphs
Maxime Delmas,Lei Xu,André Freitas
Main category: cs.CL
TL;DR: ToPG(Traversal over Proposition Graphs)是一种新型RAG框架,通过构建命题、实体和段落的异构图,结合查询感知的图遍历与迭代式建议-选择机制,在简单和复杂问答任务中均表现出色。
Details
Motivation: 现有RAG方法在处理多跳复杂查询时缺乏结构连通性,而知识图谱方法在单跳事实查询上表现不佳,ToPG旨在统一两者的优点。 Method: 将知识库建模为包含命题、实体和段落的异构图,采用迭代的Suggestion-Selection循环:Suggestion阶段进行查询感知的图遍历,Selection阶段利用LLM反馈剪枝并引导下一步。 Result: 在三种不同的QA任务(简单、复杂、抽象问答)中,ToPG在准确性和质量指标上均表现优异。 Conclusion: ToPG表明,结合查询感知的图遍历与细粒度事实表示,是实现高效结构化RAG的关键。 Abstract: Standard RAG pipelines based on chunking excel at simple factual retrieval but fail on complex multi-hop queries due to a lack of structural connectivity. Conversely, initial strategies that interleave retrieval with reasoning often lack global corpus awareness, while Knowledge Graph (KG)-based RAG performs strongly on complex multi-hop tasks but suffers on fact-oriented single-hop queries. To bridge this gap, we propose a novel RAG framework: ToPG (Traversal over Proposition Graphs). ToPG models its knowledge base as a heterogeneous graph of propositions, entities, and passages, effectively combining the granular fact density of propositions with graph connectivity. We leverage this structure using iterative Suggestion-Selection cycles, where the Suggestion phase enables a query-aware traversal of the graph, and the Selection phase provides LLM feedback to prune irrelevant propositions and seed the next iteration. Evaluated on three distinct QA tasks (Simple, Complex, and Abstract QA), ToPG demonstrates strong performance across both accuracy- and quality-based metrics. Overall, ToPG shows that query-aware graph traversal combined with factual granularity is a critical component for efficient structured RAG systems. ToPG is available at https://github.com/idiap/ToPG.[70] EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis
Xuanguang Pan,Chongyang Tao,Jiayuan Bai,Jianling Gao,Zhengwei Tao,Xiansheng Zhou,Gavin Cheung,Shuai Ma
Main category: cs.CL
TL;DR: 提出EvolSQL,一种结构感知的数据合成框架,通过查询扩展和定向进化策略生成高质量、多样化的Text-to-SQL数据,在更少数据下显著提升模型性能。
Details
Motivation: 现有Text-to-SQL数据集受限于规模小或结构多样性不足,难以训练出泛化能力强的模型,尤其缺乏对复杂SQL结构的有效覆盖。 Method: 提出EvolSQL框架:1)Query-SQL扩展以增加问题多样性和模式覆盖;2)基于SQL抽象语法树的六种原子变换操作符,实现关系、谓词、聚合和嵌套维度上的自适应定向演化;3)执行驱动的SQL refinement和模式感知去重机制确保数据质量。 Result: 在仅使用SynSQL数据集1/18的数据量下,7B模型在实验中表现优于基于SynSQL训练的模型,验证了EvolSQL生成数据的高效性和优越性。 Conclusion: EvolSQL能有效生成高质、多样的Text-to-SQL训练数据,显著提升小模型在复杂SQL理解任务上的表现,为数据稀缺场景提供了新思路。 Abstract: Training effective Text-to-SQL models remains challenging due to the scarcity of high-quality, diverse, and structurally complex datasets. Existing methods either rely on limited human-annotated corpora, or synthesize datasets directly by simply prompting LLMs without explicit control over SQL structures, often resulting in limited structural diversity and complexity. To address this, we introduce EvolSQL, a structure-aware data synthesis framework that evolves SQL queries from seed data into richer and more semantically diverse forms. EvolSQL starts with an exploratory Query-SQL expansion to broaden question diversity and improve schema coverage, and then applies an adaptive directional evolution strategy using six atomic transformation operators derived from the SQL Abstract Syntax Tree to progressively increase query complexity across relational, predicate, aggregation, and nesting dimensions. An execution-grounded SQL refinement module and schema-aware deduplication further ensure the creation of high-quality, structurally diverse mapping pairs. Experimental results show that a 7B model fine-tuned on our data outperforms one trained on the much larger SynSQL dataset using only 1/18 of the data.[71] Mind2Report: A Cognitive Deep Research Agent for Expert-Level Commercial Report Synthesis
Mingyue Cheng,Daoyu Wang,Qi Liu,Shuo Yu,Xiaoyu Tao,Yuqian Wang,Chengzhong Chu,Yu Duan,Mingkang Long,Enhong Chen
Main category: cs.CL
TL;DR: 提出了一种名为Mind2Report的认知深度研究代理,通过模拟商业分析师的思维过程,从大量网络来源中生成高质量、可靠且覆盖全面的商业报告。
Details
Motivation: 现有深度研究代理在生成商业报告时存在质量、可靠性和覆盖范围方面的局限性,难以满足高风险商业决策的需求。 Method: 设计了一个无需训练的代理工作流Mind2Report,结合大语言模型与动态记忆机制,通过细粒度意图探测、实时信息提取和迭代式报告生成来模拟商业分析师的认知过程。 Result: 构建了包含200个真实商业任务的QRC-Eval评估集,并验证了Mind2Report在报告质量、可靠性和覆盖性方面优于OpenAI和Gemini等主流深度研究代理。 Conclusion: Mind2Report为未来商业深度研究代理的设计提供了可行框架,尽管是初步研究,但展示了认知代理在复杂信息合成任务中的潜力。 Abstract: Synthesizing informative commercial reports from massive and noisy web sources is critical for high-stakes business decisions. Although current deep research agents achieve notable progress, their reports still remain limited in terms of quality, reliability, and coverage. In this work, we propose Mind2Report, a cognitive deep research agent that emulates the commercial analyst to synthesize expert-level reports. Specifically, it first probes fine-grained intent, then searches web sources and records distilled information on the fly, and subsequently iteratively synthesizes the report. We design Mind2Report as a training-free agentic workflow that augments general large language models (LLMs) with dynamic memory to support these long-form cognitive processes. To rigorously evaluate Mind2Report, we further construct QRC-Eval comprising 200 real-world commercial tasks and establish a holistic evaluation strategy to assess report quality, reliability, and coverage. Experiments demonstrate that Mind2Report outperforms leading baselines, including OpenAI and Gemini deep research agents. Although this is a preliminary study, we expect it to serve as a foundation for advancing the future design of commercial deep research agents. Our code and data are available at https://github.com/Melmaphother/Mind2Report.[72] CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters
Ao Sun,Xiaoyu Wang,Zhe Tan,Yu Li,Jiachen Zhu,Shu Su,Yuheng Jia
Main category: cs.CL
TL;DR: 本文提出了一种名为CuMA的新框架,通过条件容量分离和人口感知路由来解决大语言模型在跨文化对齐中的“均值坍塌”问题,有效保留了文化多样性。
Details
Motivation: 由于大型语言模型需要服务于全球多元文化用户,传统的统一对齐方式无法满足不同文化价值观的需求,导致模型表现下降。因此,需要一种能够尊重文化多样性的对齐方法。 Method: 提出CuMA(Cultural Mixture of Adapters)框架,将对齐视为条件容量分离问题,利用人口感知路由机制构建潜在文化拓扑结构,将冲突的梯度分解到专门的适配器子空间中,从而缓解密集模型中的文化稀疏性和均值坍塌问题。 Result: 在WorldValuesBench、Community Alignment和PRISM等多个基准上,CuMA实现了最先进的性能,显著优于密集基线模型和仅语义的MoE方法,并验证了其能有效缓解均值坍塌,保持文化多样性。 Conclusion: CuMA通过引入文化感知的专家混合结构,成功解决了大语言模型在全球化应用中因文化冲突导致的表示退化问题,为多文化对齐提供了可扩展且有效的解决方案。 Abstract: As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.[73] Faithful Summarisation under Disagreement via Belief-Level Aggregation
Favour Yahdii Aghaebe,Tanefa Apekey,Elizabeth Williams,Nafise Sadat Moosavi
Main category: cs.CL
TL;DR: 提出一种分歧感知的摘要生成框架,通过将信念级聚合与语言生成分离,显式建模观点冲突,提升意见摘要的忠实性。
Details
Motivation: 现有摘要方法(尤其是基于大模型的方法)倾向于平滑分歧、偏向主流观点,导致在观点冲突明显的场景中摘要不够忠实。 Method: 将文档表示为结构化信念集,使用基于距离的信念融合算子进行信念级聚合,再利用大语言模型将聚合后的信念转化为自然语言摘要。 Result: 实验表明,尽管足够大的模型能在生成时聚合时达到相似效果,但该行为在不同架构和规模下不稳定;而所提方法在各类模型上均表现稳定且优异。 Conclusion: 信念级聚合结合简单提示能跨模型一致地提升分歧感知摘要的质量,同时保持摘要流畅性和事实性。 Abstract: Opinion and multi-document summarisation often involve genuinely conflicting viewpoints, yet many existing approaches, particularly LLM-based systems, implicitly smooth disagreement and over-represent majority opinions. This limits the faithfulness of generated summaries in opinion-heavy settings. We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation. Documents are first represented as structured belief sets and aggregated using distance-based belief merging operators that explicitly model conflict. Large language models are then used only to realise the aggregated beliefs as natural language summaries. We evaluate the approach across multiple model families and scales, comparing it to methods that perform explicit aggregation during generation. Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities. In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models, while maintaining fluent and grounded summaries.[74] V-FAT: Benchmarking Visual Fidelity Against Text-bias
Ziteng Wang,Yujie He,Guanliang Li,Siqi Yang,Jiaqi Xiong,Songxiang Liu
Main category: cs.CL
TL;DR: 本文提出了一个名为V-FAT的新基准,用于评估多模态大语言模型中的文本偏差问题,揭示了模型在视觉与语言信息冲突时的视觉退化现象,并提出视觉保真度评分(VRS)来衡量模型对视觉证据的真实依赖程度。
Details
Motivation: 担心当前多模态大语言模型过度依赖语言捷径而非真实视觉理解,存在文本偏差问题,需系统评估其视觉保真能力。 Method: 提出V-FAT基准和三层次评估框架(L1-L3),解耦内部语料偏差与外部指令偏差,并引入视觉保真度评分(VRS)指标。 Result: 在12个前沿MLLM上的实验表明,尽管模型在传统基准上表现良好,但在高语言主导性干扰下出现显著的视觉崩溃。 Conclusion: 现有MLLMs在面对视觉与语言冲突时仍严重依赖语言先验,缺乏真正的视觉 grounding,需更严格的评估标准推动鲁棒多模态理解的发展。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.[75] Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences
Arkadiusz Modzelewski,Paweł Golik,Anna Kołos,Giovanni Da San Martino
Main category: cs.CL
TL;DR: 本文提出了一个名为Persuaficial的多语言基准,用于评估大语言模型(LLM)生成的说服性文本与人类撰写的说服性文本在自动检测上的难度差异,发现隐性的LLM生成说服内容会降低检测性能,并提供了首次全面的语言学分析。
Details
Motivation: 由于大语言模型能生成极具说服力的文本,存在被滥用进行宣传或操纵的风险,因此需要研究LLM生成的说服性文本是否比人类写作的更难自动检测。 Method: 对可控生成方法进行分类,构建多语言高质量基准Persuaficial,涵盖六种语言,并通过大规模实证评估比较人类和LLM生成的说服性文本的检测难度及语言特征。 Result: 明显带有说服意图的LLM生成文本较易检测,但隐性说服的LLM生成文本会持续降低自动检测性能;同时提供了首项关于人类与LLM生成文本的全面语言学对比分析。 Conclusion: 隐性的LLM生成说服性内容对当前自动检测方法构成更大挑战,相关语言学差异可为未来更可解释、更鲁棒的检测工具提供指导。 Abstract: Large Language Models (LLMs) can generate highly persuasive text, raising concerns about their misuse for propaganda, manipulation, and other harmful purposes. This leads us to our central question: Is LLM-generated persuasion more difficult to automatically detect than human-written persuasion? To address this, we categorize controllable generation approaches for producing persuasive content with LLMs and introduce Persuaficial, a high-quality multilingual benchmark covering six languages: English, German, Polish, Italian, French and Russian. Using this benchmark, we conduct extensive empirical evaluations comparing human-authored and LLM-generated persuasive texts. We find that although overtly persuasive LLM-generated texts can be easier to detect than human-written ones, subtle LLM-generated persuasion consistently degrades automatic detection performance. Beyond detection performance, we provide the first comprehensive linguistic analysis contrasting human and LLM-generated persuasive texts, offering insights that may guide the development of more interpretable and robust detection tools.[76] GenProve: Learning to Generate Text with Fine-Grained Provenance
Jingxuan Wei,Xingyue Wang,Yanghaoyu Liao,Jie Dong,Yuchen Liu,Caijun Jia,Bihui Yu,Junnan Zhu
Main category: cs.CL
TL;DR: 本文提出了生成时细粒度溯源(Generation-time Fine-grained Provenance)任务和ReFInE数据集,以提升大语言模型的可解释性和可信度,并通过GenProve框架在联合评估中显著优于14种强基线模型,揭示了模型在推理型溯源上的挑战。
Details
Motivation: 大语言模型常产生幻觉,现有引用方法不足以确保问责性,且难以区分直接引用与复杂推理,缺乏细粒度溯源能力。 Method: 提出ReFInE数据集,包含专家标注的关系感知细粒度证据,区分引用、压缩和推断;并构建GenProve框架,结合监督微调与组相对策略优化,优化答案保真度和溯源正确性的复合奖励。 Result: GenProve在14个强LLM上显著优于现有方法,在联合评价指标中表现最佳;分析发现模型在表面引用上表现好,但在基于推理的溯源上仍有明显不足。 Conclusion: 细粒度溯源有助于提升模型可验证性,但可验证推理仍是独立于表面引用的前沿挑战,需进一步研究。 Abstract: Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability & Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.[77] A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction
Qing Wang,Zehan Li,Yaodong Song,Hongjie Chen,Jian Kang,Jie Lian,Jie Li,Yongxiang Li,Xuelong Li
Main category: cs.CL
TL;DR: 本文提出了一种基于注入式情感归因思维(IEAT)的统一口语语言模型,通过新型数据构建策略实现情感智能的内化推理,在情感轨迹建模、情感推理和共情回应生成任务中均取得领先性能。
Details
Motivation: 传统方法通常将情感识别作为显式监督任务处理,难以实现深层次的情感感知与推理;本文旨在通过内部化情感归因过程,提升模型在口语对话中的情感智能水平。 Method: 提出注入式情感归因思维(IEAT)策略,将用户情感状态及其成因融入模型推理过程;采用两阶段渐进训练:第一阶段通过自蒸馏实现语音-文本对齐与情感属性建模,第二阶段进行端到端跨模态联合优化以保持文本与语音情感表达的一致性。 Result: 在HumDial情感智能基准测试中,该方法在情感轨迹预测、情感推理和共情回应生成任务上均取得最优性能,无论是基于LLM评估还是人工评估均表现突出。 Conclusion: IEAT有效实现了情感感知与推理的内化,所提出的统一口语语言模型显著提升了多模态情感理解与共情交互能力。 Abstract: This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT). IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision. The model is trained with a two-stage progressive strategy. The first stage performs speech-text alignment and emotional attribute modeling via self-distillation, while the second stage conducts end-to-end cross-modal joint optimization to ensure consistency between textual and spoken emotional expressions. Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.[78] Text as a Universal Interface for Transferable Personalization
Yuting Liu,Jian Guan,Jia-Nan Li,Wei Wu,Jiang-Ming Yang,Jianzhe Zhao,Guibing Guo
Main category: cs.CL
TL;DR: 本文提出了一种基于自然语言的通用用户偏好表示方法,用于大语言模型的个性化,通过两阶段训练框架学习可解释、可迁移的偏好描述,并开发了具备强跨任务和跨模型迁移能力的8B规模模型AlignXplore+。
Details
Motivation: 现有个性化方法多使用隐式的向量或参数表示用户偏好,导致偏好难以解释且无法在不同模型和任务间迁移。因此需要一种更透明、通用的偏好表示方式。 Method: 提出以自然语言作为用户偏好的通用接口,设计两阶段训练框架:首先在高质量合成数据上进行监督微调,然后通过强化学习优化长期效用和跨任务可迁移性,从而训练出能生成文本化偏好摘要的通用偏好推理模型AlignXplore+。 Result: 在九个基准测试中,所提出的8B模型实现了最先进的性能,超越了更大规模的开源模型,并展现出强大的跨任务、跨模型家族和交互格式的迁移能力。 Conclusion: 自然语言是一种有效的通用偏好表示方式,结合两阶段训练可构建出高性能、可解释、可迁移的个性化大语言模型。 Abstract: We study the problem of personalization in large language models (LLMs). Prior work predominantly represents user preferences as implicit, model-specific vectors or parameters, yielding opaque ``black-box'' profiles that are difficult to interpret and transfer across models and tasks. In contrast, we advocate natural language as a universal, model- and task-agnostic interface for preference representation. The formulation leads to interpretable and reusable preference descriptions, while naturally supporting continual evolution as new interactions are observed. To learn such representations, we introduce a two-stage training framework that combines supervised fine-tuning on high-quality synthesized data with reinforcement learning to optimize long-term utility and cross-task transferability. Based on this framework, we develop AlignXplore+, a universal preference reasoning model that generates textual preference summaries. Experiments on nine benchmarks show that our 8B model achieves state-of-the-art performanc -- outperforming substantially larger open-source models -- while exhibiting strong transferability across tasks, model families, and interaction formats.[79] Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization
Xueyun Tian,Minghua Ma,Bingbing Xu,Nuoyan Lyu,Wei Li,Heng Dong,Zheng Chu,Yuanzhuo Wang,Huawei Shen
Main category: cs.CL
TL;DR: 本文提出在监督微调中利用包含错误最终答案的思维链轨迹(负样本)可显著提升大模型的域外泛化能力,并提出自适应损失加权方法GLOW,通过分析负样本中的有效推理过程缓解过拟合、增强推理探索性。
Details
Motivation: 传统SFT仅使用正确答案轨迹,丢弃了大量可能包含有效中间推理但最终错误的负样本,导致监督信号浪费和过拟合,限制了模型的域外泛化能力。 Method: 系统分析负样本轨迹中的22种常见模式,提出Gain-based LOss Weighting (GLOW) 方法,根据样本在训练轮次间的进展动态调整其损失权重,实现对正负样本的自适应学习。 Result: 在Qwen2.5-7B上,GLOW相比仅使用正样本的SFT实现了5.51%的域外性能提升;作为强化学习初始化时,MMLU得分从72.82%提升至76.47%,推理阶段策略熵提升35.67%。 Conclusion: 保留并合理利用负样本轨迹能有效增强模型泛化能力,GLOW提供了一种数据感知、自适应的训练范式,为更高效地利用非完美标注数据提供了新思路。 Abstract: Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.[80] Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei
Peng Wang,Xilin Tao,Siyi Yao,Jiageng Wu,Yuntao Zou,Zhuotao Tian,Libo Qin,Dagang Li
Main category: cs.CL
TL;DR: 本文提出了一种名为Subcultural Alignment Solver (SAS) 的多智能体框架,用于提升大语言模型在亚文化背景下检测自毁行为的能力,通过自动检索与亚文化对齐机制克服知识滞后和语义不匹配问题。
Details
Motivation: 由于亚文化中自毁行为的表达方式独特且隐晦,加之大语言模型存在知识更新滞后和语义理解偏差,现有方法难以有效识别这类行为。因此,需要一种专门针对亚文化的检测机制。 Method: 提出了SAS框架,采用多智能体架构,结合自动检索子系统和亚文化对齐机制,动态获取新兴亚文化术语并校准语义表征,从而增强大语言模型对自毁行为的理解与判断能力。 Result: 实验表明,SAS在检测性能上优于当前先进的多智能体框架OWL,并可与微调后的大语言模型相媲美,显著提升了在亚文化语境下的检测准确率。 Conclusion: SAS有效缓解了知识滞后和语义不匹配问题,为在亚文化背景下进行自毁行为检测提供了高效、可扩展的新方法,具有重要的研究与应用价值。 Abstract: Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) are applied across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs' training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we proposed Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly enhancing the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.[81] Hán Dān Xué Bù (Mimicry) or Qīng Chū Yú Lán (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models
Yueqing Hu,Xinyang Peng,Shuting Peng,Hanqi Wang,Tianhong Wang
Main category: cs.CL
TL;DR: 推理蒸馏通过监督微调模仿教师模型的推理轨迹,但导致学生模型无法继承其认知结构,出现“功能对齐崩溃”和负向迁移,表明类人认知是强化学习的涌现属性,而非被动模仿的结果。
Details
Motivation: 研究当前推理蒸馏方法是否能有效传递大型推理模型中与人类认知成本对齐的结构。 Method: 在14个模型上测试“邯郸学步”假设,比较教师模型与经监督微调(SFT)蒸馏后的学生模型在反映人类难度缩放方面的相关性表现。 Result: 教师模型较好地反映了人类认知难度(平均相关系数0.64),而蒸馏后学生模型显著下降(平均相关系数0.34),部分甚至不如蒸馏前基线,表现出“负迁移”;分析指出SFT导致“ Cargo Cult”效应,即学生仅复制语言形式而未掌握资源分配策略。 Conclusion: 人类类认知行为是主动强化学习的涌现特性,单纯的被动模仿(如SFT)无法传递这种动态认知结构,揭示了当前推理蒸馏方法的根本局限。 Abstract: Recent Large Reasoning Models trained via reinforcement learning exhibit a "natural" alignment with human cognitive costs. However, we show that the prevailing paradigm of reasoning distillation -- training student models to mimic these traces via Supervised Fine-Tuning (SFT) -- fails to transmit this cognitive structure. Testing the "Hán Dān Xué Bù" (Superficial Mimicry) hypothesis across 14 models, we find that distillation induces a "Functional Alignment Collapse": while teacher models mirror human difficulty scaling ($\bar{r}=0.64$), distilled students significantly degrade this alignment ($\bar{r}=0.34$), often underperforming their own pre-distillation baselines ("Negative Transfer"). Our analysis suggests that SFT induces a "Cargo Cult" effect, where students ritualistically replicate the linguistic form of reasoning (verbosity) without internalizing the teacher's dynamic resource allocation policy. Consequently, reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition is an emergent property of active reinforcement, not passive imitation.[82] ArcAligner: Adaptive Recursive Aligner for Compressed Context Embeddings in RAG
Jianbo Li,Yi Jiang,Sendong Zhao,Bairui Hu,Haochun Wang,Bing Qin
Main category: cs.CL
TL;DR: 提出ArcAligner,一种轻量级模块,通过自适应门控机制帮助语言模型更好地利用高度压缩的上下文进行生成,提升效率与性能。
Details
Motivation: 长文档输入导致模型变慢且昂贵,现有压缩方法过度压缩会损害LLM的理解能力。 Method: 设计ArcAligner模块,嵌入语言模型层中,采用自适应门控机制,在信息复杂时动态增加计算以更好利用压缩后的上下文。 Result: 在多个知识密集型问答基准上,ArcAligner在相当的压缩率下优于基线方法,尤其在多跳和长尾场景中表现更优。 Conclusion: ArcAligner有效平衡了上下文压缩带来的效率与信息损失问题,提升了LLM在压缩上下文下的生成能力。 Abstract: Retrieval-Augmented Generation (RAG) helps LLMs stay accurate, but feeding long documents into a prompt makes the model slow and expensive. This has motivated context compression, ranging from token pruning and summarization to embedding-based compression. While researchers have tried ''compressing'' these documents into smaller summaries or mathematical embeddings, there is a catch: the more you compress the data, the more the LLM struggles to understand it. To address this challenge, we propose ArcAligner (Adaptive recursive context *Aligner*), a lightweight module integrated into the language model layers to help the model better utilize highly compressed context representations for downstream generation. It uses an adaptive ''gating'' system that only adds extra processing power when the information is complex, keeping the system fast. Across knowledge-intensive QA benchmarks, ArcAligner consistently beats compression baselines at comparable compression rates, especially on multi-hop and long-tail settings. The source code is publicly available.[83] Compositional Steering of Large Language Models with Steering Tokens
Gorjan Radevski,Kiril Gashteovski,Giwon Hong,Carolin Lawrence,Goran Glavaš
Main category: cs.CL
TL;DR: 提出“组合引导令牌”(compositional steering tokens)用于大语言模型的多行为同时引导,通过将自然语言指令嵌入到输入令牌空间中的专用令牌,并引入“组合令牌”实现对未见行为和数量的泛化,实验表明其在多行为控制上优于现有方法。
Details
Motivation: 现有研究主要集中于单一行为的模型引导,而对多个行为同时引导(即组合引导)关注不足,难以满足现实应用中对输出多重可控性的需求。 Method: 通过自蒸馏将自然语言指令表达的单个行为嵌入为输入令牌空间中的专用引导令牌,并训练一个专门的“组合令牌”来学习成对行为的组合,使其能泛化到未见过的行为组合及不同数量的行为。 Result: 在多种大模型架构上的实验表明,引导令牌在多行为控制上优于自然语言指令、激活空间引导和LoRA合并等方法,且能有效实现零样本组合;组合令牌可良好泛化至未见的行为组合和行为数量。 Conclusion: 组合引导令牌是一种有效且可泛化的多行为引导方法,通过在输入令牌空间中操作并结合组合令牌,实现了更精确和灵活的多目标控制,同时与自然语言指令互补,进一步提升性能。 Abstract: Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textit{compositional steering} -- i.e., steering LLMs simultaneously towards multiple behaviors -- remains an underexplored problem. In this work, we propose \emph{compositional steering tokens} for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textit{composition token} on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textit{unseen} compositions, including those with unseen behaviors as well as those with an unseen \textit{number} of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior control compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.[84] SemPA: Improving Sentence Embeddings of Large Language Models through Semantic Preference Alignment
Ziyang Chen,Zhenxuan Huang,Yile Wang,Weiqin Wang,Lu Yin,Hui Huang
Main category: cs.CL
TL;DR: 提出SemPA方法,通过句子级直接偏好优化(DPO)在保持大语言模型生成能力的同时提升其句子表示性能,并理论连接DPO与对比学习。
Details
Motivation: 现有基于生成式大模型的句子嵌入方法依赖固定提示模板或修改模型结构,前者性能受限,后者损害生成能力,需一种既能优化表示又不破坏生成能力的方法。 Method: 提出SemPA,利用句子级Direct Preference Optimization(DPO)在同义句生成任务上优化LLMs,使其区分语义等价句子并保持生成能力;并在Plackett-Luce模型下建立DPO与对比学习的理论联系。 Result: 在语义文本相似性任务和多种LLM基准测试中,SemPA在不牺牲生成能力的前提下显著提升了语义表示性能。 Conclusion: SemPA能有效增强LLMs的句子表示能力,同时保留其生成特性,为构建多功能嵌入方法提供了新思路。 Abstract: Traditional sentence embedding methods employ token-level contrastive learning on non-generative pre-trained models. Recently, there have emerged embedding methods based on generative large language models (LLMs). These methods either rely on fixed prompt templates or involve modifications to the model architecture. The former lacks further optimization of the model and results in limited performance, while the latter alters the internal computational mechanisms of the model, thereby compromising its generative capabilities. We propose SemPA, a novel approach that boosts the sentence representations while preserving the generative ability of LLMs via semantic preference alignment. We leverage sentence-level Direct Preference Optimization (DPO) to efficiently optimize LLMs on a paraphrase generation task, where the model learns to discriminate semantically equivalent sentences while preserving inherent generative capacity. Theoretically, we establish a formal connection between DPO and contrastive learning under the Plackett-Luce model framework. Empirically, experimental results on both semantic textual similarity tasks and various benchmarks for LLMs show that SemPA achieves better semantic representations without sacrificing the inherent generation capability of LLMs.[85] Code-Mix Sentiment Analysis on Hinglish Tweets
Aashi Garg,Aneshya Das,Arshi Arya,Anushka Goyal,Aditi
Main category: cs.CL
TL;DR: 本文提出了一种针对Hinglish推文的高性能情感分类框架,利用微调mBERT和子词分词技术,有效应对印度社交媒体中英混合语言带来的NLP挑战。
Details
Motivation: 由于传统自然语言处理模型难以理解Hinglish(印地语和英语混合)的句法和语义复杂性,导致品牌舆情监控不准确,因此需要专门针对该语言变体的情感分析解决方案。 Method: 采用微调多语言BERT(mBERT)模型,并结合子词分词技术,以更好处理罗马化Hinglish中的拼写变异、俚语和未登录词。 Result: 所提框架在Hinglish情感分类任务中表现出色,具备部署为生产级AI系统的能力,并为低资源、混杂语言环境下的多语言NLP设立了有力基准。 Conclusion: 该研究成功解决了Hinglish文本情感分析的技术难题,提升了品牌在印度市场的舆情监测有效性,同时推动了多语言NLP在现实场景中的应用发展。 Abstract: The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.[86] How Human is AI? Examining the Impact of Emotional Prompts on Artificial and Human and Responsiveness
Florence Bernays,Marco Henriques Pereira,Jochen Menges
Main category: cs.CL
TL;DR: 本研究探讨了人类与AI互动中的情感语调如何影响ChatGPT的行为及人类后续交流。实验发现,表扬和愤怒情绪能提升ChatGPT的回答质量,而责备则改变其伦理决策倾向,并影响人类在后续人际交流中的表达方式。
Details
Motivation: 探讨人类在与AI互动时使用不同情感语调的影响力,特别是这些情绪如何反向塑造AI输出以及对人类自身行为的溢出效应。 Method: 通过被试间实验,要求参与者在与ChatGPT(GPT-4.0)合作完成公共回应撰写和伦理困境应对任务时表达特定情绪(如表扬、愤怒、责备或中性),分析其对AI输出及后续人际语言表达的影响。 Result: 表扬显著提升了ChatGPT的回答质量;愤怒带来较小但正向的改进;责备未提升回答质量,但在伦理困境中使ChatGPT更关注公众利益;此外,责备条件下的参与者在后续人际交流中使用更多负面和敌意表达。 Conclusion: 人类在与AI互动中的情感语调不仅影响AI的响应质量与价值取向,还会外溢到人与人之间的沟通行为,表明情感表达在人机交互中具有双重影响。 Abstract: This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape ChatGPT's outputs but also carry over into subsequent human-human communication.[87] Agent-as-a-Judge
Runyang You,Hongru Cai,Caiqi Zhang,Qiancheng Xu,Meng Liu,Tiezheng Yu,Yongqi Li,Wenjie Li
Main category: cs.CL
TL;DR: 本文综述了从LLM-as-a-Judge到Agent-as-a-Judge的演进,提出了一个统一的框架来理解这一范式转变,并建立了发展分类法,系统梳理了核心方法与应用领域,分析了前沿挑战并指明了未来研究方向。
Details
Motivation: 随着被评估对象变得越来越复杂、专业化和多步骤,基于大语言模型的评估(LLM-as-a-Judge)因存在固有偏见、浅层单次推理以及无法通过现实观察验证等问题而受限,因此需要更可靠、可验证和细致的评估方式。 Method: 提出一个全面的发展分类法,识别表征范式转变的关键维度,系统整理代理型评估(Agent-as-a-Judge)的核心方法,并在通用与专业领域中调研其应用,同时分析前沿挑战与未来研究方向。 Result: 建立了首个系统梳理从LLM-as-a-Judge向Agent-as-a-Judge演进的综合框架,明确了该领域的关键维度与发展路径,总结了当前方法与应用场景,并识别出未来研究的关键挑战与机遇。 Conclusion: Agent-as-a-Judge通过规划、工具增强验证、多智能体协作和持久记忆等机制显著提升了评估的可靠性与深度,本文提供的框架和路线图为下一代智能评估系统的发展奠定了基础。 Abstract: LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.[88] DocDancer: Towards Agentic Document-Grounded Information Seeking
Qintong Zhang,Xinjie Lv,Jialong Wu,Baixuan Li,Zhengwei Tao,Guochen Yan,Huanyao Zhang,Bin Wang,Jiahao Xu,Haitao Mi,Wentao Zhang
Main category: cs.CL
TL;DR: 本文提出了DocDancer,一个端到端训练的开源文档问答代理,通过工具驱动框架和探索-合成数据生成方法提升文档理解性能。
Details
Motivation: 现有文档问答(DocQA)代理在工具使用方面效率不足,且多依赖闭源模型,缺乏有效的开源解决方案。 Method: 将DocQA建模为信息检索问题,提出一种显式建模文档探索与理解的工具驱动代理框架,并设计探索-合成数据合成 pipeline 以支持端到端训练。 Result: 在MMLongBench-Doc和DocBench两个长文本基准上,基于合成数据训练的模型表现出色,验证了方法的有效性。 Conclusion: 该工作推动了开源DocQA代理的发展,为基于代理的工具设计和合成数据构建提供了有益见解。 Abstract: Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.[89] RelayLLM: Efficient Reasoning via Collaborative Decoding
Chengsong Huang,Tong Zheng,Langlin Huang,Jinyuan Li,Haolin Liu,Jiaxin Huang
Main category: cs.CL
TL;DR: RelayLLM是一种新型的高效推理框架,通过在词元级别上实现大小语言模型的协作解码,使小模型仅在关键 token 上调用大模型,显著降低计算成本。
Details
Motivation: 大语言模型推理成本高、延迟大,而小语言模型推理能力不足;现有协作方法粒度过粗,导致计算资源浪费。 Method: 提出RelayLLM框架,让小模型作为主动控制器,通过特殊指令动态调用大模型生成关键token,并采用两阶段训练(预热和GRPO)来平衡自主生成与求助策略。 Result: 在六个基准测试中,平均准确率达到49.52%,仅调用大模型生成1.07%的token,相比随机路由方案降低成本98.2%。 Conclusion: RelayLLM在保持较高推理性能的同时极大减少了对大模型的依赖,实现了高效且经济的协同推理。 Abstract: Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.[90] Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference
Rasmus Blanck,Bill Noble,Stergios Chatzikyriakidis
Main category: cs.CL
TL;DR: 本文探讨了自然语言推理(NLI)任务中逻辑属性的理解问题,提出了三种可能的NLI标签集解读方式,并利用SNLI数据集分析其元推理特性。
Details
Motivation: NLI任务在评估语言模型方面很重要,但其逻辑性质常被误解或误述,理解NLI所捕捉的推理概念对解释模型性能至关重要。 Method: 提出三种NLI标签集的可能解读,基于共享前提的NLI项目以及LLM生成的项目,分析SNLI数据集上训练模型的元推理一致性。 Result: 揭示了SNLI数据集隐含的元推理属性,并评估了模型在不同解读下的元推理一致性表现。 Conclusion: 明确了SNLI数据集所编码的逻辑关系的某种解读更符合实际训练结果,有助于更好地理解NLI任务的逻辑基础和模型行为。 Abstract: Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.[91] Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems
Jihao Zhao,Ding Chen,Zhaoxin Fan,Kerun Xu,Mengting Hu,Bo Tang,Feiyu Xiong,Zhiyu li
Main category: cs.CL
TL;DR: 本文提出Inside Out框架,利用全局维护的PersonaTree进行长期个性化对话建模,通过结构化记忆操作实现噪声抑制与一致性保持。
Details
Motivation: 现有长期个性化对话系统受限于上下文长度,易积累记忆噪声、导致推理退化和人设不一致。 Method: 设计PersonaTree结构化用户画像,结合强化学习训练轻量MemListener模型生成ADD/UPDATE/DELETE等操作,支持树的动态演化;在生成时直接利用树结构或按需触发agent模式补充细节。 Result: 实验表明,PersonaTree在抑制上下文噪声和维持人设一致性方面优于全文拼接及多种记忆系统;小型MemListener的记忆操作性能媲美甚至超过DeepSeek-R1-0528、Gemini-3-Pro等大模型。 Conclusion: Inside Out框架通过结构化、可解释的记忆管理机制,有效解决了长期对话中的记忆膨胀与一致性问题,兼具高效性与可扩展性。 Abstract: Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.[92] LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation
Samy Haffoudhi,Fabian M. Suchanek,Nils Holzenberger
Main category: cs.CL
TL;DR: LELA是一种无需微调的模块化粗到精方法,利用大语言模型进行实体链接,在多种设置下表现出与微调方法相媲美的性能,并显著优于其他非微调方法。
Details
Motivation: 实体链接在知识图谱构建、问答和信息抽取等任务中至关重要,但现有方法往往依赖于特定领域或需要微调,缺乏通用性和灵活性。 Method: 提出LELA,一种模块化的粗到精框架,结合大语言模型的能力,适用于不同目标领域、知识库和大语言模型,且无需任何微调过程。 Result: 在多种实体链接场景下的实验表明,LELA性能与微调方法相当,且显著优于其他非微调方法。 Conclusion: LELA是一种通用、灵活且高效的实体链接方法,能够在不进行微调的情况下适应多种设置,具有广泛的应用潜力。 Abstract: Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foundational step in tasks such as knowledge graph construction, question-answering, and information extraction. Our method, LELA, is a modular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowledge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly competitive with fine-tuned approaches, and substantially outperforms the non-fine-tuned ones.[93] Measuring and Fostering Peace through Machine Learning and Artificial Intelligence
P. Gilda,P. Dungarwal,A. Thongkham,E. T. Ajayi,S. Choudhary,T. M. Terol,C. Lam,J. P. Araujo,M. McFadyen-Mungalln,L. S. Liebovitch,P. T. Coleman,H. West,K. Sieck,S. Carter
Main category: cs.CL
TL;DR: 本研究利用机器学习和人工智能从新闻和社交媒体中测量和平水平,并开发了名为MirrorMirror的浏览器插件,为用户提供实时媒体和平性反馈,旨在促进更尊重、细致和信息丰富的传播。
Details
Motivation: 当前社交媒体内容倾向于激发情绪(如愤怒)以提高点击率,影响公众对信息的理解与社会和谐,因此需要工具来评估和改善媒体内容的和平性。 Method: 使用神经网络分析在线新闻文本嵌入以测量和平水平;在社交媒体方面,结合词级别(GoEmotions)和上下文级别(大语言模型)方法评估与和平相关的社会维度;并开发Chrome插件MirrorMirror提供YouTube视频的实时和平性反馈。 Result: 新闻数据集上训练的模型在不同数据集上均表现出高准确性;针对社交媒体构建的模型能有效识别和平相关社会维度;MirrorMirror插件已成功测试,具备向内容创作者和用户推广的潜力。 Conclusion: AI可用于量化媒体中的和平水平,MirrorMirror等工具可帮助用户理解自身媒体消费习惯,未来有望发展为开源平台,推动更有建设性的媒体传播。 Abstract: We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.[94] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Shih-Yang Liu,Xin Dong,Ximing Lu,Shizhe Diao,Peter Belcak,Mingjie Liu,Min-Hung Chen,Hongxu Yin,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Yejin Choi,Jan Kautz,Pavlo Molchanov
Main category: cs.CL
TL;DR: 本文提出了一种新的多奖励强化学习优化方法GDPO,解决了GRPO在处理多种奖励时优势值坍缩的问题,通过解耦各个奖励的归一化过程,提升了训练稳定性和性能表现。
Details
Motivation: 现有基于GRPO的多奖励强化学习方法在处理不同奖励组合时会导致优势值坍缩,降低训练信号分辨率,导致收敛不佳或训练失败,因此需要一种更合适的多奖励优化方法。 Method: 提出Group reward-Decoupled Normalization Policy Optimization (GDPO),在策略优化过程中对各个奖励分别进行归一化,避免不同奖励之间的相互干扰,从而更准确地保留各奖励的相对差异。 Result: 在工具调用、数学推理和代码推理三个任务上,GDPO在准确性、错误率以及格式、长度等约束遵循指标上均显著优于GRPO,表现出更强的训练稳定性与优化效果。 Conclusion: GDPO通过解耦多奖励归一化,有效解决了GRPO中的奖励坍缩问题,为多奖励语言模型对齐提供了一个更优、更具通用性的优化框架。 Abstract: As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.cs.CV [Back]
[95] Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes
Chenye Meng,Zejian Li,Zhongni Liu,Yize Li,Changle Xie,Kaixin Jia,Ling Yang,Huanghuang Deng,Shiying Ding,Shengyuan Zhang,Jiayi Li,Lingyun Sun
Main category: cs.CV
TL;DR: 提出了一种名为复杂偏好优化(CPO)的两阶段对齐框架,用于提升扩散模型在细粒度、层次化标准下的对齐能力,尤其适用于绘画生成等需要专业质量评估的任务。
Details
Motivation: 现有的扩散模型后训练对齐方法依赖于简化的信号(如标量奖励或二元偏好),难以捕捉人类专家复杂的、层次化的判断标准,限制了生成质量与专业期望的一致性。 Method: 首先与领域专家共同构建一个树状结构的层次化评估标准,将图像质量分解为多个正负细粒度属性;然后通过监督微调将领域知识注入辅助扩散模型;最后提出CPO算法,扩展DPO以同时最大化正向属性概率并最小化负向属性概率,实现对目标扩散模型的精细对齐。 Result: 在绘画生成任务中使用带有细粒度标注的数据集进行实验,结果显示CPO显著提升了生成质量和与专家标准的一致性,在多种自动指标和人工评估中均优于现有方法。 Conclusion: CPO为扩散模型在需要复杂专业知识的场景下的精细化对齐提供了有效解决方案,推动了基于层次化、非二元偏好信号的模型优化新方向。 Abstract: Post-training alignment of diffusion models relies on simplified signals, such as scalar rewards or binary preferences. This limits alignment with complex human expertise, which is hierarchical and fine-grained. To address this, we first construct a hierarchical, fine-grained evaluation criteria with domain experts, which decomposes image quality into multiple positive and negative attributes organized in a tree structure. Building on this, we propose a two-stage alignment framework. First, we inject domain knowledge to an auxiliary diffusion model via Supervised Fine-Tuning. Second, we introduce Complex Preference Optimization (CPO) that extends DPO to align the target diffusion to our non-binary, hierarchical criteria. Specifically, we reformulate the alignment problem to simultaneously maximize the probability of positive attributes while minimizing the probability of negative attributes with the auxiliary diffusion. We instantiate our approach in the domain of painting generation and conduct CPO training with an annotated dataset of painting with fine-grained attributes based on our criteria. Extensive experiments demonstrate that CPO significantly enhances generation quality and alignment with expertise, opening new avenues for fine-grained criteria alignment.[96] Embedding Textual Information in Images Using Quinary Pixel Combinations
A V Uday Kiran Kandala
Main category: cs.CV
TL;DR: 提出一种基于RGB空间五进制像素强度组合的新文本隐写方法,能在单个像素内编码完整字符,具有高嵌入效率、低计算开销和良好的图像保真度。
Details
Motivation: 现有文本隐写方法多依赖LSB/MSB、变换域或深度学习技术,存在噪声明显、计算复杂或需多像素编码等问题,难以兼顾隐蔽性与效率。因此需要一种更高效、低失真的新方法。 Method: 利用RGB通道中每个颜色分量的五种可控强度变化,形成最多125种像素强度组合(5×5×5),将这些组合映射到文本符号(包括字母、数字、空格和特殊字符),实现在单个RGB像素中嵌入一个完整字符。 Result: 在MSE、MAE、SNR、PSNR、SSIM、直方图对比和热力图分析等指标上,原始图像与编码图像之间无显著失真;相比LSB/MSB、变换域及深度学习方法,该方法在嵌入效率和计算开销方面表现更优。 Conclusion: 所提方法通过RGB空间的五进制强度编码,实现了高效、低失真的文本信息隐藏,能够在单个像素内完成字符嵌入,具有较高的实用性与优越的性能表现。 Abstract: This paper presents a novel technique for embedding textual data into images using quinary combinations of pixel intensities in RGB space. Existing methods predominantly rely on least and most significant bit (LSB & MSB) manipulation, Pixel Value Differencing (PVD), spatial perturbations in RGB channels, transform domain based methods, Quantization methods, Edge and Region based methods and more recently through deep learning methods and generative AI techniques for hiding textual information in spatial domain of images. Most of them are dependent on pixel intensity flipping over multiple pixels, such as LSB and combination of LSB based methodologies, and on transform coefficients, often resulting in the form of noise. Encoding and Decoding are deterministic in most of the existing approaches and are computationally heavy in case of higher models such as deep learning and gen AI approaches. The proposed method works on quinary pixel intensity combinations in RGB space, where five controlled different pixel intensity variations in each of the R, G, and B channels formulate up to one hundred and twenty five distinct pixel intensity combinations. These combinations are mapped to textual symbols, enabling the representation of uppercase and lowercase alphabetic characters, numeric digits, whitespace, and commonly used special characters. Different metrics such as MSE, MAE, SNR, PSNR, SSIM, Histogram Comparison and Heatmap analysis, were evaluated for both original and encoded images resulting in no significant distortion in the images. Furthermore, the method achieves improved embedding efficiency by encoding a complete textual symbol within a single RGB pixel, in contrast to LSB and MSB based approaches that typically require multiple pixels or multi-step processes, as well as transform and learning based methods that incur higher computational overhead.[97] Unified Text-Image Generation with Weakness-Targeted Post-Training
Jiahui Chen,Philippe Hansen-Estruch,Xiaochuang Han,Yushi Hu,Emily Dinan,Amita Kamath,Michal Drozdzal,Reyhane Askari-Hemmat,Luke Zettlemoyer,Marjan Ghazvininejad
Main category: cs.CV
TL;DR: 本文探索了通过后训练实现完全统一的文本-图像生成,使模型能够在单次推理过程中自主从文本推理过渡到视觉合成,并展示了奖励加权和策略设计的后训练数据在多模态图像生成中的有效性。
Details
Motivation: 现有的多模态生成系统依赖显式的模态切换,限制了跨模态耦合,无法实现自动化的多模态生成。 Method: 采用离线、奖励加权的后训练方法,使用完全自生成的合成数据,并探索不同的后训练数据策略。 Result: 该方法在四个不同的T2I基准测试中实现了多模态图像生成的改进,特别是在使用针对特定限制设计的目标数据集时表现优于广泛使用的图像-标题语料库或基准对齐数据。 Conclusion: 奖励加权双模态并结合策略设计的后训练数据能有效提升统一文本-图像生成模型的性能。 Abstract: Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.[98] ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
Mohsen Ghafoorian,Amirhossein Habibian
Main category: cs.CV
TL;DR: 本文提出了ReHyAt,一种结合softmax注意力和线性注意力的循环混合注意力机制,可在保持视频生成质量的同时将计算复杂度从二次降至线性,显著降低训练成本并提升长序列生成的可扩展性。
Details
Motivation: 现有的基于Transformer的视频扩散模型因二次注意力复杂度难以扩展到长序列,限制了实际应用。 Method: 提出ReHyAt,采用循环混合注意力机制,结合softmax注意力的高保真与线性注意力的高效,并通过分块循环重构实现恒定内存使用;设计轻量级蒸馏与微调流程,从现有softmax模型中高效迁移知识。 Result: 在VBench、VBench-2.0及人类偏好研究中,ReHyAt在视频质量上达到SOTA水平,训练成本降至约160 GPU小时(降低两个数量级),注意力复杂度降为线性。 Conclusion: ReHyAt在保持高质量视频生成的同时大幅提升效率和可扩展性,适用于长时长和设备端视频生成,且其蒸馏策略可推广至未来双向softmax模型。 Abstract: Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.[99] SCAR-GS: Spatial Context Attention for Residuals in Progressive Gaussian Splatting
Diego Revilla,Pooja Suresh,Anand Bhojan,Ooi Wei Tsang
Main category: cs.CV
TL;DR: 本文提出了一种用于3D高斯点阵的新型渐进式编解码器,采用残差向量量化和多分辨率哈希网格引导的自回归熵模型,以提高压缩效率和率失真性能。
Details
Motivation: 现有的3D高斯点阵压缩方法在处理中大型场景时存储需求高,且标量量化可能无法有效捕捉高维特征向量的相关性,限制了率失真性能。 Method: 引入残差向量量化(Residual Vector Quantization)替代传统标量量化,并设计基于多分辨率哈希网格的自回归熵模型,对逐个传输的索引进行条件概率预测,实现高效压缩。 Result: 所提方法能够更高效地压缩高斯图元特征,在渐进式传输中实现了更高的压缩效率和更好的率失真权衡。 Conclusion: 该方法显著提升了3D高斯点阵模型的压缩性能,有助于其在云端和流媒体服务中的部署与应用。 Abstract: Recent advances in 3D Gaussian Splatting have allowed for real-time, high-fidelity novel view synthesis. Nonetheless, these models have significant storage requirements for large and medium-sized scenes, hindering their deployment over cloud and streaming services. Some of the most recent progressive compression techniques for these models rely on progressive masking and scalar quantization techniques to reduce the bitrate of Gaussian attributes using spatial context models. While effective, scalar quantization may not optimally capture the correlations of high-dimensional feature vectors, which can potentially limit the rate-distortion performance. In this work, we introduce a novel progressive codec for 3D Gaussian Splatting that replaces traditional methods with a more powerful Residual Vector Quantization approach to compress the primitive features. Our key contribution is an auto-regressive entropy model, guided by a multi-resolution hash grid, that accurately predicts the conditional probability of each successive transmitted index, allowing for coarse and refinement layers to be compressed with high efficiency.[100] Comparative Analysis of Custom CNN Architectures versus Pre-trained Models and Transfer Learning: A Study on Five Bangladesh Datasets
Ibrahim Tanvir,Alif Ruslan,Sartaj Solaiman
Main category: cs.CV
TL;DR: 本研究比较了自定义CNN与预训练模型(ResNet-18、VGG-16)在五个孟加拉图像分类数据集上的表现,发现微调的迁移学习显著优于从头训练和特征提取方法,尤其在小样本复杂任务中表现突出。
Details
Motivation: 探索在资源受限环境下,针对特定区域图像分类任务应选择何种深度学习策略以平衡性能、模型大小与训练效率。 Method: 采用自定义CNN与预训练模型(ResNet-18、VGG-16),对比特征提取与迁移学习(含微调)在五种真实场景图像数据集上的分类准确率。 Result: 迁移学习结合微调效果最佳,准确率提升3%至76%;ResNet-18微调在Road Damage BD数据集上达到100%准确率;自定义CNN参数更少(3.4M)、训练更高效,但在复杂任务中性能较低。 Conclusion: 对于数据有限的复杂图像分类任务,应优先采用带微调的预训练模型;若注重模型轻量化和训练效率且任务较简单,可选用自定义CNN。 Abstract: This study presents a comprehensive comparative analysis of custom-built Convolutional Neural Networks (CNNs) against popular pre-trained architectures (ResNet-18 and VGG-16) using both feature extraction and transfer learning approaches. We evaluated these models across five diverse image classification datasets from Bangladesh: Footpath Vision, Auto Rickshaw Detection, Mango Image Classification, Paddy Variety Recognition, and Road Damage Detection. Our experimental results demonstrate that transfer learning with fine-tuning consistently outperforms both custom CNNs built from scratch and feature extraction methods, achieving accuracy improvements ranging from 3% to 76% across different datasets. Notably, ResNet-18 with fine-tuning achieved perfect 100% accuracy on the Road Damage BD dataset. While custom CNNs offer advantages in model size (3.4M parameters vs. 11-134M for pre-trained models) and training efficiency on simpler tasks, pre-trained models with transfer learning provide superior performance, particularly on complex classification tasks with limited training data. This research provides practical insights for practitioners in selecting appropriate deep learning approaches based on dataset characteristics, computational resources, and performance requirements.[101] PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache
Kunyang Li,Mubarak Shah,Yuzhang Shang
Main category: cs.CV
TL;DR: 本文提出PackCache,一种无需训练的KV-cache管理方法,用于提升统一自回归视频生成模型的推理效率,通过动态压缩KV缓存,在48帧长序列上实现1.7-2.2倍加速。
Details
Motivation: 统一自回归模型在处理多模态任务时面临KV-cache随生成长度线性增长的问题,成为推理效率和生成长度的主要瓶颈,尤其影响长视频生成。 Method: 基于对KV-cache中token时空特性的分析,提出PackCache,包含三个机制:保留语义锚点的条件锚定、依据时间距离分配缓存预算的跨帧衰减建模、保持3D结构一致性的空间位置嵌入保留。 Result: PackCache在48帧视频生成中实现端到端1.7-2.2倍加速;在最耗时的最后四帧,A40和H200上分别达到2.6倍和3.7倍加速。 Conclusion: PackCache有效缓解了KV-cache膨胀问题,显著提升了长序列视频生成的效率,为统一多模态模型的高效推理提供了实用解决方案。 Abstract: A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.[102] Combining facial videos and biosignals for stress estimation during driving
Paraskevi Valergaki,Vassilis C. Nicodemou,Iason Oikonomidis,Antonis Argyros,Anastasios Roussos
Main category: cs.CV
TL;DR: 本文提出了一种基于Transformer的时序建模框架,利用EMOCA提取的3D面部表情和姿态系数进行驾驶中的压力识别,结合跨模态注意力机制融合生理信号,实现了92%的AUROC性能。
Details
Motivation: 由于压力具有主观性且面部表情可被主动控制,从面部视频中可靠地识别压力具有挑战性;现有方法多依赖于面部动作单元,而对解耦的3D面部几何信息的利用尚不充分。 Method: 使用EMOCA模型提取3D面部表情和姿态系数,通过假设检验分析其在压力状态下的变化,并构建基于Transformer的时序模型,比较了单模态、早期融合和跨模态注意力三种策略。 Result: 56个系数中有41个表现出与压力显著相关的相位特异性响应;跨模态注意力融合EMOCA与生理信号达到最佳性能(AUROC 92%,准确率86.7%),EMOCA与注视融合也表现优异(AUROC 91.8%)。 Conclusion: 3D面部几何特征结合时序建模和跨模态注意力机制能有效提升压力识别性能,为基于面部视频的压力感知提供了新思路。 Abstract: Reliable stress recognition from facial videos is challenging due to stress's subjective nature and voluntary facial control. While most methods rely on Facial Action Units, the role of disentangled 3D facial geometry remains underexplored. We address this by analyzing stress during distracted driving using EMOCA-derived 3D expression and pose coefficients. Paired hypothesis tests between baseline and stressor phases reveal that 41 of 56 coefficients show consistent, phase-specific stress responses comparable to physiological markers. Building on this, we propose a Transformer-based temporal modeling framework and assess unimodal, early-fusion, and cross-modal attention strategies. Cross-Modal Attention fusion of EMOCA and physiological signals achieves best performance (AUROC 92\%, Accuracy 86.7\%), with EMOCA-gaze fusion also competitive (AUROC 91.8\%). This highlights the effectiveness of temporal modeling and cross-modal attention for stress recognition.[103] Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection
Maxim Clouser,Kia Khezeli,John Kalantari
Main category: cs.CV
TL;DR: 通过在FLUX.1 Kontext模型上使用低秩适应(LoRA)模块,仅用100对配对图像进行微调,成功将RGB图像转换为红外(IR)和合成孔径雷达(SAR)图像,生成的合成数据提升了目标检测性能,表明少量样本的LoRA适配是支持非可见光模态的一种有前景的方法。
Details
Motivation: 现有视觉基础模型主要基于RGB数据训练,但许多安全关键应用依赖于非可见光模态(如红外和SAR)。本文旨在探索是否可以通过少量配对样本,将主要在RGB上训练的基础模型重用于跨光谱翻译,并利用生成的合成数据提升下游检测任务的性能。 Method: 基于FLUX.1 Kontext模型,插入低秩适应(LoRA)模块,并在每个领域仅使用100张配对图像进行微调,研究两种设置:KAIST数据集上的RGB到IR转换和M4-SAR数据集上的RGB到SAR转换。利用LPIPS指标在50个保留对上评估翻译质量,并选择最佳LoRA适配器生成合成数据用于下游检测训练。 Result: 实验表明,LPIPS是下游性能的良好代理指标:较低的LPIPS始终预测更高的mAP。在KAIST IR和M4-SAR数据集上,使用最佳LoRA适配器生成的合成IR和SAR图像显著提升了YOLOv11n和DETR的目标检测性能。特别是在M4-SAR上,结合有限的真实SAR数据,合成SAR大幅提高了基础设施检测效果。 Conclusion: 少量样本的LoRA适配可用于将基于RGB的基础模型有效迁移到非可见光模态,生成高质量的跨光谱图像以增强下游检测任务,为非可见光模态提供基础模型支持开辟了新路径。 Abstract: Foundation models for vision are predominantly trained on RGB data, while many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR). We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator using only a few co-measured examples, and whether the resulting synthetic data can enhance downstream detection. Starting from FLUX.1 Kontext, we insert low-rank adaptation (LoRA) modules and fine-tune them on just 100 paired images per domain for two settings: RGB to IR on the KAIST dataset and RGB to SAR on the M4-SAR dataset. The adapted model translates RGB images into pixel-aligned IR/SAR, enabling us to reuse existing bounding boxes and train object detection models purely in the target modality. Across a grid of LoRA hyperparameters, we find that LPIPS computed on only 50 held-out pairs is a strong proxy for downstream performance: lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR and SAR, and for DETR on KAIST IR test data. Using the best LPIPS-selected LoRA adapter, synthetic IR from external RGB datasets (LLVIP, FLIR ADAS) improves KAIST IR pedestrian detection, and synthetic SAR significantly boosts infrastructure detection on M4-SAR when combined with limited real SAR. Our results suggest that few-shot LoRA adaptation of flow-matching foundation models is a promising path toward foundation-style support for non-visible modalities.[104] Performance Analysis of Image Classification on Bangladeshi Datasets
Mohammed Sami Khan,Fabiha Muniat,Rowzatul Zannat
Main category: cs.CV
TL;DR: 本文比较了自定义CNN与预训练模型(如VGG-16、ResNet-50、MobileNet)在图像分类任务中的性能,结果显示预训练模型在准确率和收敛速度上优于自定义CNN,尤其在数据有限时;而自定义CNN虽性能稍低,但参数更少、计算更高效。
Details
Motivation: 探讨在图像分类任务中设计自定义CNN与使用预训练模型之间的权衡,为实际应用中模型选择提供依据。 Method: 构建一个从零训练的自定义CNN,并与采用迁移学习的VGG-16、ResNet-50和MobileNet在相同实验条件下进行对比,使用准确率、精确率、召回率和F1分数等指标评估模型性能。 Result: 预训练模型在分类准确率和收敛速度上 consistently 优于自定义CNN,尤其是在训练数据有限的情况下;而自定义CNN虽然准确率较低,但参数数量更少,计算复杂度更低。 Conclusion: 在图像分类任务中,预训练模型整体性能更优,适合数据稀缺场景;自定义CNN则在计算资源受限时具有优势,体现了性能与效率之间的权衡。 Abstract: Convolutional Neural Networks (CNNs) have demonstrated remarkable success in image classification tasks; however, the choice between designing a custom CNN from scratch and employing established pre-trained architectures remains an important practical consideration. In this work, we present a comparative analysis of a custom-designed CNN and several widely used deep learning architectures, including VGG-16, ResNet-50, and MobileNet, for an image classification task. The custom CNN is developed and trained from scratch, while the popular architectures are employed using transfer learning under identical experimental settings. All models are evaluated using standard performance metrics such as accuracy, precision, recall, and F1-score. Experimental results show that pre-trained CNN architectures consistently outperform the custom CNN in terms of classification accuracy and convergence speed, particularly when training data is limited. However, the custom CNN demonstrates competitive performance with significantly fewer parameters and reduced computational complexity. This study highlights the trade-offs between model complexity, performance, and computational efficiency, and provides practical insights into selecting appropriate CNN architectures for image classification problems.[105] 3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation
Jusheng Zhang,Yijia Fan,Zimo Wen,Jian Wang,Keze Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Tri MARF的新型框架,通过整合2D多视角图像、文本描述和3D点云三种模态输入,在多智能体协作架构下提升大规模3D对象标注效果。
Details
Motivation: 现有的单模型方法在处理3D对象标注时难以有效应对空间复杂性、遮挡和视角不一致等问题,且在自动驾驶、机器人和增强现实等应用中对高质量3D标注的需求日益增长。 Method: Tri MARF采用多智能体协同架构,包含三个专用智能体:视觉语言模型智能体生成多视角描述,信息聚合智能体选择最优描述,门控智能体将文本语义与3D几何对齐以实现精细标注。该方法融合了2D多视图图像、文本描述和3D点云三模态输入。 Result: 在Objaverse LVIS、Objaverse XL和ABO数据集上的实验表明,Tri MARF显著优于现有方法,CLIPScore达到88.7,ViLT R@5检索准确率分别达到45.2和43.8,并在单块NVIDIA A100 GPU上实现每小时最多12000个对象的标注吞吐量。 Conclusion: Tri MARF通过多智能体协作和三模态融合有效提升了大规模3D对象标注的质量与效率,为复杂场景下的3D标注提供了新的解决方案。 Abstract: Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU[106] From Preoperative CT to Postmastoidectomy Mesh Construction:1Mastoidectomy Shape Prediction for Cochlear Implant Surgery
Yike Zhang,Eduardo Davalos,Dingjie Su,Ange Lou,Jack Noble
Main category: cs.CV
TL;DR: 本文提出了一种结合自监督和弱监督学习的混合框架,用于从术前CT扫描中预测耳蜗植入手术中的乳突切除区域,无需人工标注,取得了优于现有方法的Dice分数(0.72),并首次将3D T分布损失应用于弱监督医学图像分析。
Details
Motivation: 由于缺乏带标注的真实数据,现有的深度学习方法在乳突切除形状预测方面研究有限,本文旨在填补这一空白,提升术前规划的准确性与手术安全性。 Method: 提出一种混合自监督与弱监督学习框架,直接从完整的术前CT扫描中预测乳突切除区域,并引入3D T-分布损失来优化弱监督下的训练过程。 Result: 该方法在预测复杂且无明确边界的乳突切除形状时达到平均Dice分数0.72,优于现有最先进方法,并能生成3D术后乳突切除表面。 Conclusion: 本研究首次将自监督与弱监督学习结合用于乳突切除预测,为耳蜗植入手术规划提供了高效、鲁棒的解决方案,具有重要的临床应用前景。 Abstract: Cochlear Implant (CI) surgery treats severe hearing loss by inserting an electrode array into the cochlea to stimulate the auditory nerve. An important step in this procedure is mastoidectomy, which removes part of the mastoid region of the temporal bone to provide surgical access. Accurate mastoidectomy shape prediction from preoperative imaging improves pre-surgical planning, reduces risks, and enhances surgical outcomes. Despite its importance, there are limited deep-learning-based studies regarding this topic due to the challenges of acquiring ground-truth labels. We address this gap by investigating self-supervised and weakly-supervised learning models to predict the mastoidectomy region without human annotations. We propose a hybrid self-supervised and weakly-supervised learning framework to predict the mastoidectomy region directly from preoperative CT scans, where the mastoid remains intact. Our hybrid method achieves a mean Dice score of 0.72 when predicting the complex and boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance. The method provides groundwork for constructing 3D postmastoidectomy surfaces directly from the corresponding preoperative CT scans. To our knowledge, this is the first work that integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, offering a robust and efficient solution for CI surgical planning while leveraging 3D T-distribution loss in weakly-supervised medical imaging.[107] CRUNet-MR-Univ: A Foundation Model for Diverse Cardiac MRI Reconstruction
Donghang Lyu,Marius Staring,Hildo Lamb,Mariya Doneva
Main category: cs.CV
TL;DR: 提出CRUNet-MR-Univ,一种能够泛化于多种心脏MRI场景的基础模型,利用时空相关性和基于提示的先验信息,在多样化CMR扫描中表现出优越性能。
Details
Motivation: 现有深度学习方法在处理心脏MRI重建时泛化能力有限,难以应对图像对比度、采样模式、设备厂商等多种变化,限制了其在临床中的应用。 Method: 设计CRUNet-MR-Univ,结合时空相关性建模与基于提示的先验机制,构建一个可适应多种CMR扫描条件的统一基础模型。 Result: 该模型在多种设置下 consistently 优于基线方法,展现出更强的泛化能力和鲁棒性。 Conclusion: CRUNet-MR-Univ为实现通用型心脏MRI重建提供了有效解决方案,具有广泛的临床应用前景。 Abstract: In recent years, deep learning has attracted increasing attention in the field of Cardiac MRI (CMR) reconstruction due to its superior performance over traditional methods, particularly in handling higher acceleration factors, highlighting its potential for real-world clinical applications. However, current deep learning methods remain limited in generalizability. CMR scans exhibit wide variability in image contrast, sampling patterns, scanner vendors, anatomical structures, and disease types. Most existing models are designed to handle only a single or narrow subset of these variations, leading to performance degradation when faced with distribution shifts. Therefore, it is beneficial to develop a unified model capable of generalizing across diverse CMR scenarios. To this end, we propose CRUNet-MR-Univ, a foundation model that leverages spatio-temporal correlations and prompt-based priors to effectively handle the full diversity of CMR scans. Our approach consistently outperforms baseline methods across a wide range of settings, highlighting its effectiveness and promise.[108] Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
Xingjian Diao,Zheyuan Liu,Chunhui Zhang,Weiyi Wu,Keyi Kong,Lin Shi,Kaize Ding,Soroush Vosoughi,Jiang Gui
Main category: cs.CV
TL;DR: 提出GPRO方法,通过动态路由计算路径解决大视觉语言模型中的过推理和感知失败问题,提升准确性和效率。
Details
Motivation: 现有慢思考方法忽视了低层视觉感知失败这一根本瓶颈,导致推理不稳定和效率低下。 Method: 设计了一个元推理控制器GPRO,在每个生成步骤中动态选择快速路径、慢感知路径或慢推理路径,并利用约79万样本的失败归因监督数据,通过多目标强化学习训练控制器。 Result: 在五个基准上实验表明,GPRO显著提高了准确性和推理效率,响应更短且性能优于近期慢思考方法。 Conclusion: 稳定的推理依赖于可靠的低层视觉基础,GPRO能有效区分并处理感知与推理错误,实现高效自适应推理。 Abstract: Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.[109] UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving
Zhexiao Xiong,Xin Ye,Burhan Yaman,Sheng Cheng,Yiren Lu,Jingru Luo,Nathan Jacobs,Liu Ren
Main category: cs.CV
TL;DR: UniDrive-WM 是一个基于视觉语言模型(VLM)的统一世界模型,将驾驶场景理解、轨迹规划和未来图像生成整合到单一架构中,通过轨迹条件化生成未来帧并反馈优化,显著提升自动驾驶性能。
Details
Motivation: 现有自动驾驶系统通常将感知、预测和规划分离,缺乏紧密协同;利用 VLM 进行规划的工作尚未充分整合生成式建模与决策过程,限制了整体性能。 Method: 提出 UniDrive-WM,采用统一的 VLM 架构,先由轨迹规划器预测未来轨迹,再以此条件化图像生成器生成未来图像帧,生成结果用于反哺场景理解和轨迹优化;同时比较了离散与连续输出表示对未来预测的影响。 Result: 在 Bench2Drive 基准上,相比先前最优方法,L2 轨迹误差降低 5.9%,碰撞率降低 9.2%,并生成高保真未来图像。 Conclusion: 紧密集成 VLM 驱动的推理、规划与生成式世界建模可有效提升自动驾驶系统的感知与决策能力,验证了统一架构的优势。 Abstract: World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .[110] Vision-Language Agents for Interactive Forest Change Analysis
James Brock,Ce Zhang,Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: 提出一种基于大语言模型(LLM)驱动的智能体,用于集成森林变化分析,支持多任务遥感图像变化解释(RSICI),结合视觉-语言模型与自然语言查询,在新构建的Forest-Change数据集和LEVIR-MCI-Trees上取得良好表现。
Details
Motivation: 解决高分辨率遥感影像中森林动态的像素级变化检测与语义变化描述难题,填补LLM与视觉-语言模型在遥感变化解释中的整合空白。 Method: 构建一个多层级变化解释(MCI)视觉-语言骨干网络,并引入LLM进行任务编排与自然语言交互;同时发布Forest-Change数据集,包含双时相影像、像素级变化掩码和多粒度语义描述。 Result: 在Forest-Change数据集上达到67.10% mIoU和40.17 BLEU-4分数,在LEVIR-MCI-Trees子集上达到88.13% mIoU和34.41 BLEU-4分数。 Conclusion: LLM驱动的交互式遥感变化解释系统有望提升森林监测的可访问性、可解释性和分析效率。 Abstract: Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.[111] TokenSeg: Efficient 3D Medical Image Segmentation via Hierarchical Visual Token Compression
Sen Zeng,Hong Zhou,Zheng Zhu,Yang Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为TokenSeg的边界感知稀疏token表示框架,用于高效3D医学图像分割,通过多尺度编码器、边界感知分词器和稀疏到密集解码器,在保持高精度的同时显著降低计算资源消耗。
Details
Motivation: 3D医学图像分割计算成本高,体素处理呈立方增长,且在均匀区域存在冗余计算,因此需要更高效的分割方法。 Method: 设计了多尺度分层编码器提取候选token,引入结合VQ-VAE量化与重要性评分的边界感知分词器选择关键token,并开发稀疏到密集解码器重建完整分割掩码。 Result: 在960例3D乳腺DCE-MRI数据集上达到94.49% Dice和89.61% IoU,GPU内存和推理延迟分别降低64%和68%,并在MSD心脏和脑部MRI数据集上验证了良好泛化能力。 Conclusion: TokenSeg通过解剖结构引导的稀疏表示,实现了准确且高效的3D医学图像分割,为临床应用提供了高性能解决方案。 Abstract: Three-dimensional medical image segmentation is a fundamental yet computationally demanding task due to the cubic growth of voxel processing and the redundant computation on homogeneous regions. To address these limitations, we propose \textbf{TokenSeg}, a boundary-aware sparse token representation framework for efficient 3D medical volume segmentation. Specifically, (1) we design a \emph{multi-scale hierarchical encoder} that extracts 400 candidate tokens across four resolution levels to capture both global anatomical context and fine boundary details; (2) we introduce a \emph{boundary-aware tokenizer} that combines VQ-VAE quantization with importance scoring to select 100 salient tokens, over 60\% of which lie near tumor boundaries; and (3) we develop a \emph{sparse-to-dense decoder} that reconstructs full-resolution masks through token reprojection, progressive upsampling, and skip connections. Extensive experiments on a 3D breast DCE-MRI dataset comprising 960 cases demonstrate that TokenSeg achieves state-of-the-art performance with 94.49\% Dice and 89.61\% IoU, while reducing GPU memory and inference latency by 64\% and 68\%, respectively. To verify the generalization capability, our evaluations on MSD cardiac and brain MRI benchmark datasets demonstrate that TokenSeg consistently delivers optimal performance across heterogeneous anatomical structures. These results highlight the effectiveness of anatomically informed sparse representation for accurate and efficient 3D medical image segmentation.[112] FaceRefiner: High-Fidelity Facial Texture Refinement with Differentiable Rendering-based Style Transfer
Chengyang Li,Baoping Cheng,Yao Cheng,Haocheng Zhang,Renshuai Liu,Yinglin Zheng,Jing Liao,Xuan Cheng
Main category: cs.CV
TL;DR: 本文提出了一种基于风格迁移的面部纹理精细化方法FaceRefiner,通过将3D采样纹理作为风格、生成结果作为内容,并结合可微渲染实现多层级信息迁移,有效提升了野外图像下纹理细节、结构与身份的一致性。
Details
Motivation: 现有面部纹理生成方法受限于训练数据或2D生成器构建的空间,导致在真实场景图像中细节、结构和身份一致性差,泛化能力不足。 Method: 提出FaceRefiner,采用风格迁移框架,以3D采样纹理为风格、生成纹理为内容,结合可微渲染技术传递高低层特征,尤其保留可见区域的像素级信息。 Result: 在Multi-PIE、CelebA和FFHQ数据集上实验表明,该方法相比当前最优方法能显著提升纹理质量和身份保持能力。 Conclusion: FaceRefiner通过多级风格迁移与可微渲染,有效增强了面部纹理生成的真实感与输入一致性,提升了在野外图像下的泛化性能。 Abstract: Recent facial texture generation methods prefer to use deep networks to synthesize image content and then fill in the UV map, thus generating a compelling full texture from a single image. Nevertheless, the synthesized texture UV map usually comes from a space constructed by the training data or the 2D face generator, which limits the methods' generalization ability for in-the-wild input images. Consequently, their facial details, structures and identity may not be consistent with the input. In this paper, we address this issue by proposing a style transfer-based facial texture refinement method named FaceRefiner. FaceRefiner treats the 3D sampled texture as style and the output of a texture generation method as content. The photo-realistic style is then expected to be transferred from the style image to the content image. Different from current style transfer methods that only transfer high and middle level information to the result, our style transfer method integrates differentiable rendering to also transfer low level (or pixel level) information in the visible face regions. The main benefit of such multi-level information transfer is that, the details, structures and semantics in the input can thus be well preserved. The extensive experiments on Multi-PIE, CelebA and FFHQ datasets demonstrate that our refinement method can improve the texture quality and the face identity preserving ability, compared with state-of-the-arts.[113] All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction
Ziyou Jiang,Mingyang Li,Junjie Wang,Yuekai Huang,Jie Huang,Zhiyuan Chang,Zhaoyang Li,Qing Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于设计概念复现的不断变化的有害模因检测方法RepMD,通过构建设计概念图(DCG)并利用多模态大语言模型(MLLM)实现对类型和时间上不断演变的有害模因的有效识别。
Details
Motivation: 由于互联网社区中有害模因具有类型多变和随时间演化的特性,传统分析方法难以应对,因此需要一种能够捕捉其背后不变设计原则的方法来有效识别和理解这些有害内容。 Method: 引入攻击树定义设计概念图(DCG),通过历史模因进行设计步骤复现与图剪枝推导出DCG,并利用DCG指导多模态大语言模型(MLLM)进行有害模因检测。 Result: RepMD在检测准确率上达到81.1%,在面对类型变化和时间演化的新模因时仍保持较小的性能下降;人工评估显示其可将每条模因的识别效率提升15~30秒。 Conclusion: RepMD通过捕获有害模因背后的不变设计概念,实现了对不断变化模因的高效、可解释检测,具备良好的泛化能力和实际应用价值。 Abstract: Harmful memes are ever-shifting in the Internet communities, which are difficult to analyze due to their type-shifting and temporal-evolving nature. Although these memes are shifting, we find that different memes may share invariant principles, i.e., the underlying design concept of malicious users, which can help us analyze why these memes are harmful. In this paper, we propose RepMD, an ever-shifting harmful meme detection method based on the design concept reproduction. We first refer to the attack tree to define the Design Concept Graph (DCG), which describes steps that people may take to design a harmful meme. Then, we derive the DCG from historical memes with design step reproduction and graph pruning. Finally, we use DCG to guide the Multimodal Large Language Model (MLLM) to detect harmful memes. The evaluation results show that RepMD achieves the highest accuracy with 81.1% and has slight accuracy decreases when generalized to type-shifting and temporal-evolving memes. Human evaluation shows that RepMD can improve the efficiency of human discovery on harmful memes, with 15$\sim$30 seconds per meme.[114] 3D Conditional Image Synthesis of Left Atrial LGE MRI from Composite Semantic Masks
Yusri Al-Sanaani,Rebecca Thornhill,Sreeraman Rajan
Main category: cs.CV
TL;DR: 本研究提出使用3D条件生成模型(特别是SPADE-LDM)合成高保真LGE MRI图像,以增强稀缺训练数据并提升左心房分割性能,实验表明该方法显著提高分割精度。
Details
Motivation: 由于标注数据稀缺和解剖结构复杂,基于机器学习的左心房壁和心腔分割面临挑战,亟需有效数据增强方法。 Method: 开发了一种从语义标签图生成3D LGE MRI图像的流程,比较了Pix2Pix GAN、SPADE-GAN和SPADE-LDM三种生成模型,并将合成图像用于3D U-Net的训练以评估对分割性能的影响。 Result: SPADE-LDM生成图像最真实(FID=4.063),优于其他GAN模型;加入合成数据后,3D U-Net的左心房腔分割Dice分数从0.908提升至0.936,且差异显著(p < 0.05)。 Conclusion: 基于标签条件的3D图像合成可有效改善稀有心脏结构的分割效果,为医学图像分析中的数据稀缺问题提供了可行解决方案。 Abstract: Segmentation of the left atrial (LA) wall and endocardium from late gadolinium-enhanced (LGE) MRI is essential for quantifying atrial fibrosis in patients with atrial fibrillation. The development of accurate machine learning-based segmentation models remains challenging due to the limited availability of data and the complexity of anatomical structures. In this work, we investigate 3D conditional generative models as potential solution for augmenting scarce LGE training data and improving LA segmentation performance. We develop a pipeline to synthesize high-fidelity 3D LGE MRI volumes from composite semantic label maps combining anatomical expert annotations with unsupervised tissue clusters, using three 3D conditional generators (Pix2Pix GAN, SPADE-GAN, and SPADE-LDM). The synthetic images are evaluated for realism and their impact on downstream LA segmentation. SPADE-LDM generates the most realistic and structurally accurate images, achieving an FID of 4.063 and surpassing GAN models, which have FIDs of 40.821 and 7.652 for Pix2Pix and SPADE-GAN, respectively. When augmented with synthetic LGE images, the Dice score for LA cavity segmentation with a 3D U-Net model improved from 0.908 to 0.936, showing a statistically significant improvement (p < 0.05) over the baseline.These findings demonstrate the potential of label-conditioned 3D synthesis to enhance the segmentation of under-represented cardiac structures.[115] MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing
Zihao Lin,Wanrong Zhu,Jiuxiang Gu,Jihyung Kil,Christopher Tensmeyer,Lin Zhang,Shilong Liu,Ruiyi Zhang,Lifu Huang,Vlad I. Morariu,Tong Sun
Main category: cs.CV
TL;DR: 本文提出了MiLDEAgent,一种基于推理的多层设计文档编辑框架,结合强化学习训练的多模态推理器与图像编辑器,实现对海报等多层文档的自然语言指令驱动精细编辑,并构建了包含2万多个样本的MiLDEBench基准和MiLDEEval评估协议,实验表明该方法显著优于开源模型,性能媲美闭源模型。
Details
Motivation: 现有工作主要关注单层图像编辑或多层生成,缺乏对多层设计文档(如海报)中各图层进行细粒度、分层推理与编辑的能力,难以根据自然语言指令准确判断修改对象与位置,因此需要专门针对多层文档的编辑框架与基准。 Method: 提出MiLDEAgent框架,包含一个通过强化学习训练的多模态推理器,用于理解文档的多层结构并定位需修改的图层,以及一个图像编辑器执行具体修改;同时构建MiLDEBench数据集和MiLDEEval评估协议,从指令遵循、布局一致性、美学性和文本渲染四个维度评估性能。 Result: 在14个开源和2个闭源模型上的实验显示,现有方法普遍表现不佳:开源模型常无法完成任务,闭源模型易出现格式错误;而MiLDEAgent展现出强大的分层推理与精准编辑能力,在开源模型中显著领先,并达到与闭源模型相当的性能水平。 Conclusion: MiLDEAgent首次为多层设计文档编辑建立了强有力的基线,验证了分层推理在复杂文档编辑中的关键作用,推动了该领域的发展。 Abstract: Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.[116] Detection of Deployment Operational Deviations for Safety and Security of AI-Enabled Human-Centric Cyber Physical Systems
Bernard Ngabonziza,Ayan Banerjee,Sandeep K. S. Gupta
Main category: cs.CV
TL;DR: 本文讨论了人工智能驱动的人本网络物理系统在不确定操作条件下面临的安全与安全挑战,并提出一个评估框架来确保这些系统在部署中的安全性,同时以糖尿病患者的血糖闭环控制为例展示了基于个性化图像的餐食未申报检测新方法。
Details
Motivation: 由于人机交互导致的不确定性可能违反系统的安全和安全要求,因此需要研究AI赋能的人本网络物理系统在未知运行条件下的操作偏差问题。 Method: 提出一个评估框架用于分析不同策略以保障系统在部署过程中的安全性和安全性,并设计一种基于个性化图像的餐食未申报检测技术作为实例验证。 Result: 构建了一个可用于评估AI赋能的人本网络物理系统安全策略的框架,并通过糖尿病患者血糖控制案例验证了所提图像检测方法的有效性。 Conclusion: 该框架有助于提升AI赋能的人本网络物理系统在不确定环境下的安全性,所提出的个性化图像技术能够有效识别餐食未申报情况,从而增强闭环控制系统的可靠性。 Abstract: In recent years, Human-centric cyber-physical systems have increasingly involved artificial intelligence to enable knowledge extraction from sensor-collected data. Examples include medical monitoring and control systems, as well as autonomous cars. Such systems are intended to operate according to the protocols and guidelines for regular system operations. However, in many scenarios, such as closed-loop blood glucose control for Type 1 diabetics, self-driving cars, and monitoring systems for stroke diagnosis. The operations of such AI-enabled human-centric applications can expose them to cases for which their operational mode may be uncertain, for instance, resulting from the interactions with a human with the system. Such cases, in which the system is in uncertain conditions, can violate the system's safety and security requirements. This paper will discuss operational deviations that can lead these systems to operate in unknown conditions. We will then create a framework to evaluate different strategies for ensuring the safety and security of AI-enabled human-centric cyber-physical systems in operation deployment. Then, as an example, we show a personalized image-based novel technique for detecting the non-announcement of meals in closed-loop blood glucose control for Type 1 diabetics.[117] HUR-MACL: High-Uncertainty Region-Guided Multi-Architecture Collaborative Learning for Head and Neck Multi-Organ Segmentation
Xiaoyu Liu,Siwen Wei,Linhao Qu,Mingyuan Pan,Chengsheng Zhang,Yonghong Shi,Zhijian Song
Main category: cs.CV
TL;DR: 提出了一种高不确定性区域引导的多架构协同学习模型(HUR-MACL),用于头颈部多器官分割,结合CNN、Vision Mamba和可变形CNN,并引入异构特征蒸馏损失,在多个数据集上达到SOTA性能。
Details
Motivation: 现有深度学习模型在小且形状复杂的器官分割中表现不佳,混合架构通常仅简单拼接特征,导致功能重叠和精度受限。 Method: 提出HUR-MACL模型:使用CNN自适应识别高不确定性区域;在这些区域中结合Vision Mamba和可变形CNN进行联合优化;设计异构特征蒸馏损失促进两种架构在高不确定性区域的协同学习。 Result: 在两个公开数据集和一个私有数据集上均取得了当前最优的分割性能。 Conclusion: HUR-MACL通过聚焦高不确定性区域并实现不同架构的有效协作,显著提升了头颈部器官的分割精度,具有良好的临床应用潜力。 Abstract: Accurate segmentation of organs at risk in the head and neck is essential for radiation therapy, yet deep learning models often fail on small, complexly shaped organs. While hybrid architectures that combine different models show promise, they typically just concatenate features without exploiting the unique strengths of each component. This results in functional overlap and limited segmentation accuracy. To address these issues, we propose a high uncertainty region-guided multi-architecture collaborative learning (HUR-MACL) model for multi-organ segmentation in the head and neck. This model adaptively identifies high uncertainty regions using a convolutional neural network, and for these regions, Vision Mamba as well as Deformable CNN are utilized to jointly improve their segmentation accuracy. Additionally, a heterogeneous feature distillation loss was proposed to promote collaborative learning between the two architectures in high uncertainty regions to further enhance performance. Our method achieves SOTA results on two public datasets and one private dataset.[118] HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment
Wenzhi Chen,Bo Hu,Leida Li,Lihuo He,Wen Lu,Xinbo Gao
Main category: cs.CV
TL;DR: 提出HyperAlign,一种基于双曲蕴含几何的自适应文本到图像对齐评估框架,通过将CLIP特征映射到双曲空间,并设计动态监督蕴含建模和自适应调制回归器,实现更准确的图像-文本对齐评估。
Details
Motivation: 现有方法依赖欧氏空间度量,忽视语义对齐的结构特性,且缺乏对不同样本的自适应能力,难以准确评估生成图像与文本提示之间的对齐程度。 Method: 首先使用CLIP提取欧氏特征并映射到双曲空间;其次设计动态监督的蕴含建模机制,将离散的蕴含逻辑转化为连续的几何结构监督;最后提出自适应调制回归器,利用双曲几何特征生成样本级调制参数,动态校准余弦相似度以预测最终得分。 Result: HyperAlign在单数据库评估和跨数据库泛化任务上均取得极具竞争力的性能,验证了双曲几何建模在图像-文本对齐评估中的有效性。 Conclusion: 基于双曲蕴含几何的自适应建模能够更好地捕捉语义对齐的层次结构和逻辑关系,显著提升文本到图像生成中对齐评估的准确性与泛化能力。 Abstract: With the rapid development of text-to-image generation technology, accurately assessing the alignment between generated images and text prompts has become a critical challenge. Existing methods rely on Euclidean space metrics, neglecting the structured nature of semantic alignment, while lacking adaptive capabilities for different samples. To address these limitations, we propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry. First, we extract Euclidean features using CLIP and map them to hyperbolic space. Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision. Finally, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict the final score. HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, fully validating the effectiveness of hyperbolic geometric modeling for image-text alignment assessment.[119] Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning
Wentao Zhang,Lifei Wang,Lina Lu,MingKun Xu,Shangyang Li,Yanchao Yang,Tao Fang
Main category: cs.CV
TL;DR: 提出Agri-R1,一种基于推理增强的农业大模型,通过视觉-语言合成与LLM过滤自动生成高质量推理数据,并采用分组相对策略优化(GRPO)结合领域特定奖励函数进行训练,在少样本下显著提升病害识别、农业问答及跨域泛化能力。
Details
Motivation: 现有农业疾病诊断模型依赖大量标注数据、缺乏可解释性且泛化能力差;当前推理方法依赖昂贵专家标注,难以应对开放性和多样性农业问题。 Method: 提出Agri-R1框架:利用视觉-语言合成和大语言模型过滤来自动生成高质量推理数据;采用Group Relative Policy Optimization(GRPO)训练,并设计融合领域词典与模糊匹配的奖励函数,以评估开放回答的正确性与语言灵活性。 Result: 在CDDMBench上,3B参数模型性能媲美7B-13B基线:疾病识别准确率相对提升23.2%,农业知识问答提升33.3%,跨域泛化得分提高26.10分;消融实验验证推理数据与GRPO协同作用的有效性,且效果随问题复杂度增加而提升。 Conclusion: Agri-R1通过自动化推理数据生成与领域适配的强化学习训练,实现了高效、可解释且泛化能力强的农业智能诊断,为小样本、开放域农业问题提供了可行解决方案。 Abstract: Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbf{Agri-R1}, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19\% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2\% relative gain in disease recognition accuracy, +33.3\% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.[120] DB-MSMUNet:Dual Branch Multi-scale Mamba UNet for Pancreatic CT Scans Segmentation
Qiu Guan,Zhiqiang Yang,Dezhang Ye,Yang Chen,Xinli Xu,Ying Tang
Main category: cs.CV
TL;DR: 本文提出了一种用于胰腺及病变CT图像分割的新型双分支多尺度Mamba UNet(DB-MSMUNet),通过改进编码器和双解码器结构显著提升了分割精度与边界保持能力。
Details
Motivation: 由于胰腺组织对比度低、边界模糊、形状不规则以及病灶微小,现有方法在CT图像中准确分割胰腺及其病变仍面临巨大挑战。 Method: 提出DB-MSMUNet,编码器采用结合可变形卷积与多尺度状态空间建模的多尺度Mamba模块(MSMM);采用双解码器结构:边缘解码器引入边缘增强路径(EEP)以优化边界,区域解码器采用多层解码器(MLD)保留细节并重建小病灶;并在多尺度上引入辅助深度监督(ADS)以增强梯度反馈与特征判别力。 Result: 在NIH Pancreas、MSD和一个临床肿瘤数据集上,DB-MSMUNet分别取得了89.47%、87.59%和89.02%的Dice相似系数,优于大多数现有方法,表现出更强的边缘保持能力和跨数据集鲁棒性。 Conclusion: DB-MSMUNet在胰腺及病灶分割任务中表现出色,具有良好的准确性、边界细化能力和泛化性能,适用于实际临床CT图像分析。 Abstract: Accurate segmentation of the pancreas and its lesions in CT scans is crucial for the precise diagnosis and treatment of pancreatic cancer. However, it remains a highly challenging task due to several factors such as low tissue contrast with surrounding organs, blurry anatomical boundaries, irregular organ shapes, and the small size of lesions. To tackle these issues, we propose DB-MSMUNet (Dual-Branch Multi-scale Mamba UNet), a novel encoder-decoder architecture designed specifically for robust pancreatic segmentation. The encoder is constructed using a Multi-scale Mamba Module (MSMM), which combines deformable convolutions and multi-scale state space modeling to enhance both global context modeling and local deformation adaptation. The network employs a dual-decoder design: the edge decoder introduces an Edge Enhancement Path (EEP) to explicitly capture boundary cues and refine fuzzy contours, while the area decoder incorporates a Multi-layer Decoder (MLD) to preserve fine-grained details and accurately reconstruct small lesions by leveraging multi-scale deep semantic features. Furthermore, Auxiliary Deep Supervision (ADS) heads are added at multiple scales to both decoders, providing more accurate gradient feedback and further enhancing the discriminative capability of multi-scale features. We conduct extensive experiments on three datasets: the NIH Pancreas dataset, the MSD dataset, and a clinical pancreatic tumor dataset provided by collaborating hospitals. DB-MSMUNet achieves Dice Similarity Coefficients of 89.47%, 87.59%, and 89.02%, respectively, outperforming most existing state-of-the-art methods in terms of segmentation accuracy, edge preservation, and robustness across different datasets. These results demonstrate the effectiveness and generalizability of the proposed method for real-world pancreatic CT segmentation tasks.[121] HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution
Yang Zou,Xingyue Zhu,Kaiqi Han,Jun Ma,Xingyuan Li,Zhiying Jiang,Jinyuan Liu
Main category: cs.CV
TL;DR: 本文提出了HATIR,一种基于热感知扩散的红外视频超分辨率方法,用于联合建模大气湍流退化与结构细节丢失,并构建了首个湍流红外VSR数据集FLIR-IVSR。
Details
Motivation: 现有视频超分辨率方法忽视了红外与可见光图像之间的模态差异,或无法有效恢复湍流引起的失真;级联湍流抑制与超分辨率方法会导致误差传播。 Method: 提出HATIR,将热感知形变先验引入扩散采样路径;设计基于相量引导的光流估计器和湍流感知解码器,利用物理先验指导反向扩散过程并增强边缘感知特征聚合。 Result: 在自建的FLIR-IVSR数据集上验证了方法的有效性,实现了对湍流退化和结构细节丢失的联合建模,显著提升了红外视频超分辨率性能。 Conclusion: HATIR通过融合物理先验与扩散模型,有效解决了湍流环境下红外视频超分辨率中的误差累积与结构失真问题,推动了该领域的研究发展。 Abstract: Infrared video has been of great interest in visual tasks under challenging environments, but often suffers from severe atmospheric turbulence and compression degradation. Existing video super-resolution (VSR) methods either neglect the inherent modality gap between infrared and visible images or fail to restore turbulence-induced distortions. Directly cascading turbulence mitigation (TM) algorithms with VSR methods leads to error propagation and accumulation due to the decoupled modeling of degradation between turbulence and resolution. We introduce HATIR, a Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution, which injects heat-aware deformation priors into the diffusion sampling path to jointly model the inverse process of turbulent degradation and structural detail loss. Specifically, HATIR constructs a Phasor-Guided Flow Estimator, rooted in the physical principle that thermally active regions exhibit consistent phasor responses over time, enabling reliable turbulence-aware flow to guide the reverse diffusion process. To ensure the fidelity of structural recovery under nonuniform distortions, a Turbulence-Aware Decoder is proposed to selectively suppress unstable temporal cues and enhance edge-aware feature aggregation via turbulence gating and structure-aware attention. We built FLIR-IVSR, the first dataset for turbulent infrared VSR, comprising paired LR-HR sequences from a FLIR T1050sc camera (1024 X 768) spanning 640 diverse scenes with varying camera and object motion conditions. This encourages future research in infrared VSR. Project page: https://github.com/JZ0606/HATIR[122] WebCryptoAgent: Agentic Crypto Trading with Web Informatics
Ali Kurban,Wei Luo,Liangyu Zuo,Zeyu Zhang,Renda Han,Zhaolu Kang,Hao Tang
Main category: cs.CV
TL;DR: WebCryptoAgent 是一个用于加密货币交易的代理框架,能够整合多源网络信息与市场数据,通过模块化代理和解耦控制架构实现稳健的短时决策与实时风险控制。
Details
Motivation: 现有交易系统难以同时处理多源异构网络信息并应对亚秒级价格冲击,缺乏对非结构化内容与结构化信号的协同推理能力,且风险响应延迟。 Method: 提出 WebCryptoAgent 框架,将决策分解为模态特定的代理,生成统一证据文档进行置信度校准推理;采用解耦控制架构,分离小时级策略推理与秒级实时风险模型,实现快速风险响应。 Result: 在真实加密货币市场上的实验表明,该方法相比基线模型提升了交易稳定性,减少了虚假交易行为,并增强了尾部风险处理能力。 Conclusion: WebCryptoAgent 有效解决了多源信息融合与实时风险控制的矛盾,为高波动环境下的自动化交易提供了可解释、鲁棒的新范式。 Abstract: Cryptocurrency trading increasingly depends on timely integration of heterogeneous web information and market microstructure signals to support short-horizon decision making under extreme volatility. However, existing trading systems struggle to jointly reason over noisy multi-source web evidence while maintaining robustness to rapid price shocks at sub-second timescales. The first challenge lies in synthesizing unstructured web content, social sentiment, and structured OHLCV signals into coherent and interpretable trading decisions without amplifying spurious correlations, while the second challenge concerns risk control, as slow deliberative reasoning pipelines are ill-suited for handling abrupt market shocks that require immediate defensive responses. To address these challenges, we propose WebCryptoAgent, an agentic trading framework that decomposes web-informed decision making into modality-specific agents and consolidates their outputs into a unified evidence document for confidence-calibrated reasoning. We further introduce a decoupled control architecture that separates strategic hourly reasoning from a real-time second-level risk model, enabling fast shock detection and protective intervention independent of the trading loop. Extensive experiments on real-world cryptocurrency markets demonstrate that WebCryptoAgent improves trading stability, reduces spurious activity, and enhances tail-risk handling compared to existing baselines. Code will be available at https://github.com/AIGeeksGroup/WebCryptoAgent.[123] Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models
Yanbing Zeng,Jia Wang,Hanghang Ma,Junqiang Wu,Jie Zhu,Xiaoming Wei,Jie Hu
Main category: cs.CV
TL;DR: 本文提出了一种名为Forge-and-Quench的统一多模态框架,通过理解模型生成的“桥接特征”来提升图像生成的保真度和细节丰富性。
Details
Motivation: 现有工作主要利用理解模型的推理能力和世界知识,但如何有效利用理解能力来增强图像生成的质量尚未充分探索。本文旨在通过理解模型提升生成图像的细节和真实感。 Method: 提出Forge-and-Quench框架:首先使用MLLM基于对话上下文生成增强文本指令,再通过新设计的Bridge Adapter将其映射为虚拟视觉表示(即Bridge Feature),该特征作为视觉引导信号注入T2I模型,与增强文本共同指导图像生成。 Result: 实验表明,该方法显著提升了多个T2I模型在图像保真度、细节丰富性、指令遵循和世界知识应用方面的能力,同时具有良好的跨模型迁移性和低训练开销。 Conclusion: Forge-and-Quench成功实现了理解对生成的有效辅助,验证了通过桥接特征传递理解信息是一种高效、灵活且可扩展的多模态生成架构。 Abstract: Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM's inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at https://github.com/YanbingZeng/Forge-and-Quench.[124] On the Holistic Approach for Detecting Human Image Forgery
Xiao Guo,Jie Zhu,Anil Jain,Xiaoming Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为HuForDet的全人图像伪造检测框架,结合面部和全身上下文信息,实现对多种深度伪造内容的统一检测。
Details
Motivation: 现有深度伪造检测方法多局限于面部或全身单一区域,缺乏跨区域泛化能力,难以应对日益复杂的AI生成内容威胁。 Method: HuForDet采用双分支架构:一支在RGB和频域使用异构专家(如自适应LoG模块)检测面部伪造;另一支利用多模态大语言模型(MLLM)分析全身语义一致性,并通过置信度估计机制动态融合两支特征。同时构建了包含面部与全身伪造的新数据集HuFor。 Result: 实验表明,HuForDet在多种伪造类型上达到最先进的检测性能,具备更强的鲁棒性和泛化能力。 Conclusion: HuForDet通过融合局部细节与全局语义信息,实现了对全人图像伪造的有效检测,推动了深度伪造检测技术向更全面、统一的方向发展。 Abstract: The rapid advancement of AI-generated content (AIGC) has escalated the threat of deepfakes, from facial manipulations to the synthesis of entire photorealistic human bodies. However, existing detection methods remain fragmented, specializing either in facial-region forgeries or full-body synthetic images, and consequently fail to generalize across the full spectrum of human image manipulations. We introduce HuForDet, a holistic framework for human image forgery detection, which features a dual-branch architecture comprising: (1) a face forgery detection branch that employs heterogeneous experts operating in both RGB and frequency domains, including an adaptive Laplacian-of-Gaussian (LoG) module designed to capture artifacts ranging from fine-grained blending boundaries to coarse-scale texture irregularities; and (2) a contextualized forgery detection branch that leverages a Multi-Modal Large Language Model (MLLM) to analyze full-body semantic consistency, enhanced with a confidence estimation mechanism that dynamically weights its contribution during feature fusion. We curate a human image forgery (HuFor) dataset that unifies existing face forgery data with a new corpus of full-body synthetic humans. Extensive experiments show that our HuForDet achieves state-of-the-art forgery detection performance and superior robustness across diverse human image forgeries.[125] Training a Custom CNN on Five Heterogeneous Image Datasets
Anika Tabassum,Tasnuva Mahazabin Tuba,Nafisa Naznin
Main category: cs.CV
TL;DR: 本研究评估了自定义轻量级CNN与ResNet-18、VGG-16等深度网络在五个农业与城市视觉分类任务中的表现,探讨了模型复杂度、深度及迁移学习对性能的影响,提出了一种跨领域高效模型,并揭示了在数据受限场景下迁移学习的优势。
Details
Motivation: 旨在解决不同现实场景中图像数据的多样性挑战(如光照、分辨率、类别不平衡等),探索适用于资源受限环境的高效卷积神经网络架构。 Method: 采用自定义轻量级CNN与ResNet-18、VGG-16进行对比,分别从零训练和使用迁移学习,结合系统性预处理和数据增强,在五个异构数据集上进行实验分析。 Result: 自定义CNN在多个任务中达到竞争性性能;迁移学习显著提升数据稀缺场景下的模型表现;深层架构在复杂任务中更具优势,但需权衡计算成本。 Conclusion: 轻量级定制模型可在多领域实现高效分类,迁移学习在小数据集上具有关键作用,研究为资源受限的实际视觉任务提供了实用部署指导。 Abstract: Deep learning has transformed visual data analysis, with Convolutional Neural Networks (CNNs) becoming highly effective in learning meaningful feature representations directly from images. Unlike traditional manual feature engineering methods, CNNs automatically extract hierarchical visual patterns, enabling strong performance across diverse real-world contexts. This study investigates the effectiveness of CNN-based architectures across five heterogeneous datasets spanning agricultural and urban domains: mango variety classification, paddy variety identification, road surface condition assessment, auto-rickshaw detection, and footpath encroachment monitoring. These datasets introduce varying challenges, including differences in illumination, resolution, environmental complexity, and class imbalance, necessitating adaptable and robust learning models. We evaluate a lightweight, task-specific custom CNN alongside established deep architectures, including ResNet-18 and VGG-16, trained both from scratch and using transfer learning. Through systematic preprocessing, augmentation, and controlled experimentation, we analyze how architectural complexity, model depth, and pre-training influence convergence, generalization, and performance across datasets of differing scale and difficulty. The key contributions of this work are: (1) the development of an efficient custom CNN that achieves competitive performance across multiple application domains, and (2) a comprehensive comparative analysis highlighting when transfer learning and deep architectures provide substantial advantages, particularly in data-constrained environments. These findings offer practical insights for deploying deep learning models in resource-limited yet high-impact real-world visual classification tasks.[126] AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection
Yunqing Hu,Zheming Yang,Chang Zhao,Qi Guo,Meng Gao,Pengcheng Li,Wen Ji
Main category: cs.CV
TL;DR: 本文提出AIVD框架,通过轻量级边缘检测器与云端多模态大模型协作,实现精准定位与高质量语义生成,并设计了视觉-语义协同增强的微调策略及资源感知的动态调度算法,显著降低资源消耗,提升分类性能、语义质量和系统效率。
Details
Motivation: 多模态大语言模型在语义理解方面表现优异,但在精确对象定位和资源受限的边云部署方面仍存在挑战。 Method: 提出AIVD框架,结合边缘轻量检测器与云端MLLM;采用视觉-语义协同增强的微调策略提升鲁棒性;设计异构资源感知的动态调度算法以优化系统性能。 Result: 实验表明,AIVD显著降低了资源消耗,提升了MLLM的分类准确率和语义生成质量,同时调度策略实现了更高吞吐量和更低延迟。 Conclusion: AIVD框架有效实现了精准定位与高质量语义生成的统一,在多种场景下具备高效、鲁棒的边云协同潜力。 Abstract: Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM's robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.[127] Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition
Masatomo Yoshida,Haruto Namura,Nicola Adami,Masahiro Okuda
Main category: cs.CV
TL;DR: 提出一种基于骨架化的对抗攻击方法,有效缩小搜索空间,专门针对包含文本(尤其是数学公式图像)的图像,评估原始输出与对抗性扰动输出之间的字符和语义变化,并在ChatGPT上验证了该方法的有效性和实际应用价值。
Details
Motivation: 探索基础模型在视觉理解上的能力与局限,特别是在处理复杂结构文本(如数学公式)时的表现。 Method: 引入骨架化技术用于对抗攻击,通过减少搜索空间来生成对抗样本,并分析模型在字符级和语义级的变化响应。 Result: 该方法能有效攻击包含数学公式的图像输入,在ChatGPT上的实验表明其可导致显著的输出偏差,揭示模型对结构化视觉信息的脆弱性。 Conclusion: 基础模型在处理复杂视觉文本时存在明显弱点,骨架化对抗攻击为评估其视觉推理能力提供了新视角和有效工具。 Abstract: This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method utilizing skeletonization to reduce the search space effectively. Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their LaTeX conversion and intricate structure. We conduct a detailed evaluation of both character and semantic changes between original and adversarially perturbed outputs to provide insights into the models' visual interpretation and reasoning abilities. The effectiveness of our method is further demonstrated through its application to ChatGPT, which shows its practical implications in real-world scenarios.[128] ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting
Yen-Jen Chiou,Wei-Tse Cheng,Yuan-Fu Yang
Main category: cs.CV
TL;DR: ProFuse 是一个高效的、上下文感知的开放词汇3D场景理解框架,结合3D高斯点阵(3DGS),在无需渲染监督微调的情况下实现快速且一致的语义理解。
Details
Motivation: 现有的3D场景理解方法在开放词汇设置下存在跨视图不一致和掩码内聚性差的问题,且依赖预训练模型或耗时的优化过程,限制了效率与泛化能力。 Method: 提出密集对应引导的预注册阶段,初始化具有精确几何结构的高斯分布,并通过跨视图聚类构建3D上下文提议;利用加权聚合获得全局特征,并在直接配准过程中将其融合到高斯中以保持语言一致性。 Result: ProFuse 在保持几何精细度的同时显著提升了跨视图一致性和掩码内聚性,语义附着速度约为每场景5分钟,比现有最先进方法快两倍。 Conclusion: ProFuse 实现了高效、无需渲染监督微调的开放词汇3D场景理解,在速度和性能上均优于现有方法,适用于快速部署的实际应用。 Abstract: We present ProFuse, an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup, adding minimal overhead and requiring no render-supervised fine-tuning. Instead of relying on a pretrained 3DGS scene, we introduce a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, and this feature is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. With associations established in advance, semantic fusion requires no additional optimization beyond standard reconstruction, and the model retains geometric refinement without densification. ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is two times faster than SOTA.[129] Segmentation-Driven Monocular Shape from Polarization based on Physical Model
Jinyu Zhang,Xu Ma,Weili Chen,Gonzalo R. Arce
Main category: cs.CV
TL;DR: 提出一种新的分割驱动的单目形状从偏振(SMSfP)框架,通过自适应分割局部重建表面法线,有效解决传统方法中的方位角模糊问题,显著提升三维重建精度。
Details
Motivation: 现有单目形状从偏振方法存在方位角模糊问题,影响重建准确性和稳定性,需提出更鲁棒的方法以克服该局限。 Method: 提出偏振辅助的自适应区域增长(PARG)分割策略,将全局凸性假设分解为局部凸区域,并引入多尺度融合凸性先验(MFCP)约束以保持局部一致性并恢复细节。 Result: 在合成和真实数据集上实验表明,该方法在消解模糊性和几何保真度方面优于现有的基于物理的单目SfP技术。 Conclusion: SMSfP框架能有效抑制方位角模糊,提升表面连续性和细节恢复能力,为单目偏振三维重建提供了更精确稳定的解决方案。 Abstract: Monocular shape-from-polarization (SfP) leverages the intrinsic relationship between light polarization properties and surface geometry to recover surface normals from single-view polarized images, providing a compact and robust approach for three-dimensional (3D) reconstruction. Despite its potential, existing monocular SfP methods suffer from azimuth angle ambiguity, an inherent limitation of polarization analysis, that severely compromises reconstruction accuracy and stability. This paper introduces a novel segmentation-driven monocular SfP (SMSfP) framework that reformulates global shape recovery into a set of local reconstructions over adaptively segmented convex sub-regions. Specifically, a polarization-aided adaptive region growing (PARG) segmentation strategy is proposed to decompose the global convexity assumption into locally convex regions, effectively suppressing azimuth ambiguities and preserving surface continuity. Furthermore, a multi-scale fusion convexity prior (MFCP) constraint is developed to ensure local surface consistency and enhance the recovery of fine textural and structural details. Extensive experiments on both synthetic and real-world datasets validate the proposed approach, showing significant improvements in disambiguation accuracy and geometric fidelity compared with existing physics-based monocular SfP techniques.[130] GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
Shurong Zheng,Yousong Zhu,Hongyin Zhao,Fan Yang,Yufei Zhan,Ming Tang,Jinqiao Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的多图像视觉定位模型GeM-VG,能够实现广义的多图像视觉定位,并通过构建大规模数据集和混合强化微调策略显著提升在多图像和单图像定位任务上的性能。
Details
Motivation: 现有的多图像定位方法受限于单一目标定位和有限的任务类型,缺乏对广义定位任务的统一建模,且现有数据集在目标数量和图像关系方面存在局限。 Method: 提出了GeM-VG模型,系统分类多图像定位任务,构建了包含24万样本的MG-Data-240K数据集,并采用结合思维链(CoT)推理与直接回答的混合强化微调策略,利用基于规则的奖励机制优化训练。 Result: 在MIG-Bench和MC-Bench上分别超越先前最优MLLM 2.0%和9.7%,在ODINW单图像定位任务上比基础模型提升9.1%,同时保持强大的通用多图像理解能力。 Conclusion: GeM-VG通过统一建模和新型训练策略,在广义多图像视觉定位任务中展现出卓越的泛化能力和性能优势,推动了多模态大模型在复杂视觉定位场景中的应用。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model's overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.[131] CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models
Tobia Poppi,Burak Uzkent,Amanmeet Garg,Lucas Porto,Garin Kessler,Yezhou Yang,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara,Florian Schiffers
Main category: cs.CV
TL;DR: 提出了一种可扩展的反事实视频生成框架,用于缓解视频语言模型中的幻觉问题,特别是动作和时序推理方面的幻觉。构建了包含约2.6万对偏好的合成数据集CounterVid,并提出了联合利用文本和视觉偏好的MixDPO方法,在Qwen2.5-VL上微调后显著提升了时序理解能力。
Details
Motivation: 现有方法难以根治视频语言模型对语言先验的过度依赖,导致在动作和时序推理中产生幻觉,因此需要一种能精准干预视觉动态的反事实生成方法。 Method: 结合多模态大语言模型与扩散模型,设计了一个反事实视频生成流程,生成仅在动作或时序结构上不同的视频;基于此构建CounterVid数据集,并提出MixDPO方法进行联合偏好优化。 Result: 在CounterVid上训练后的模型在动作识别和时序推理任务上表现提升,且在标准幻觉基准测试中展现出良好的迁移性能。 Conclusion: 该框架有效减少了视频语言模型的幻觉,特别是在依赖细粒度视觉动态的任务中,通过反事实数据增强和混合偏好优化实现了更鲁棒的多模态理解。 Abstract: Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.[132] Defocus Aberration Theory Confirms Gaussian Model in Most Imaging Devices
Akbar Saadat
Main category: cs.CV
TL;DR: 本文提出了一种在传统成像设备中确保散焦算子符合高斯模型的方法,通过几何光学和衍射受限光学中的散焦像差理论分析,验证了高斯模型在大多数成像设备中的适用性,实验结果显示最大平均绝对误差小于1%。
Details
Motivation: 准确从2D图像估计深度仍是3D重建中的基本难题,尤其因固有模糊与深度引起的散焦模糊难以区分,导致问题不适定。 Method: 基于已知的散焦模型,在相同视点不同对焦设置的双图像基础上,利用高斯模型描述绝对和相对散焦模糊,并通过几何光学和衍射受限光学中的像差理论分析实际散焦算子与高斯近似的拟合精度。 Result: 在典型对焦深度范围为1至100米、最大深度变化为对焦深度10%的条件下,高斯模型适用于大多数成像设备,最大平均绝对误差(MAE)小于1%。 Conclusion: 高斯模型是实时应用的理想选择,且在常规成像设置下能精确逼近实际散焦行为,为基于散焦的深度估计提供了准确可靠的理论基础。 Abstract: Over the past three decades, defocus has consistently provided groundbreaking depth information in scene images. However, accurately estimating depth from 2D images continues to be a persistent and fundamental challenge in the field of 3D recovery. Heuristic approaches involve with the ill-posed problem for inferring the spatial variant defocusing blur, as the desired blur cannot be distinguished from the inherent blur. Given a prior knowledge of the defocus model, the problem become well-posed with an analytic solution for the relative blur between two images, taken at the same viewpoint with different camera settings for the focus. The Gaussian model stands out as an optimal choice for real-time applications, due to its mathematical simplicity and computational efficiency. And theoretically, it is the only model can be applied at the same time to both the absolute blur caused by depth in a single image and the relative blur resulting from depth differences between two images. This paper introduces the settings, for conventional imaging devices, to ensure that the defocusing operator adheres to the Gaussian model. Defocus analysis begins within the framework of geometric optics and is conducted by defocus aberration theory in diffraction-limited optics to obtain the accuracy of fitting the actual model to its Gaussian approximation. The results for a typical set of focused depths between $1$ and $100$ meters, with a maximum depth variation of $10\%$ at the focused depth, confirm the Gaussian model's applicability for defocus operators in most imaging devices. The findings demonstrate a maximum Mean Absolute Error $(\!M\!A\!E)$ of less than $1\%$, underscoring the model's accuracy and reliability.[133] SRU-Pix2Pix: A Fusion-Driven Generator Network for Medical Image Translation with Few-Shot Learning
Xihe Qiu,Yang Dai,Xiaoyu Tan,Sijia Li,Fenghao Sun,Lu Gan,Liang Liu
Main category: cs.CV
TL;DR: 本研究提出了一种增强型Pix2Pix框架,结合SEResNet和U-Net++,在少样本条件下实现了高质量的MRI图像翻译,显著提升了结构保真度和图像质量。
Details
Motivation: MRI临床应用受限于采集时间长、成本高和分辨率有限,图像翻译技术有望缓解这些问题,但现有方法如Pix2Pix在医学图像中的潜力尚未充分挖掘。 Method: 提出一种改进的Pix2Pix框架,引入Squeeze-and-Excitation Residual Networks(SEResNet)以增强通道注意力机制下的关键特征表示,并采用U-Net++提升多尺度特征融合能力;同时使用简化的PatchGAN判别器以稳定训练并提高局部解剖细节的真实性。 Result: 在少于500张图像的少样本条件下,该方法在多种MRI模态内图像翻译任务中均表现出优异的图像质量和一致的结构保真度,并展现出强泛化能力。 Conclusion: 所提出的增强型Pix2Pix框架有效拓展了Pix2Pix在医学图像翻译中的应用潜力,为低数据场景下的MRI图像生成提供了可行解决方案。 Abstract: Magnetic Resonance Imaging (MRI) provides detailed tissue information, but its clinical application is limited by long acquisition time, high cost, and restricted resolution. Image translation has recently gained attention as a strategy to address these limitations. Although Pix2Pix has been widely applied in medical image translation, its potential has not been fully explored. In this study, we propose an enhanced Pix2Pix framework that integrates Squeeze-and-Excitation Residual Networks (SEResNet) and U-Net++ to improve image generation quality and structural fidelity. SEResNet strengthens critical feature representation through channel attention, while U-Net++ enhances multi-scale feature fusion. A simplified PatchGAN discriminator further stabilizes training and refines local anatomical realism. Experimental results demonstrate that under few-shot conditions with fewer than 500 images, the proposed method achieves consistent structural fidelity and superior image quality across multiple intra-modality MRI translation tasks, showing strong generalization ability. These results suggest an effective extension of Pix2Pix for medical image translation.[134] Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers
Lee Hyoseok,Sohwi Lim,Eunju Cha,Tae-Hyun Oh
Main category: cs.CV
TL;DR: 本文提出了一种名为MCLC的即插即用校正模块,通过测量一致的Langevin更新来稳定基于潜在扩散模型的逆问题求解器,解决了现有方法中的不稳定性和伪影问题。
Details
Motivation: 现有的基于潜在扩散模型(LDM)的逆问题求解器存在不稳定性,表现为伪影和质量下降,其根本原因在于求解器与真实反向扩散动力学之间的差异。 Method: 作者分析了这种动力学差异,并提出了Measurement-Consistent Langevin Corrector(MCLC),该模块通过无需线性流形假设的Langevin更新来校正求解过程,从而提升稳定性与一致性。 Result: 实验表明MCLC能有效提升多种图像恢复任务中现有求解器的性能与稳定性,同时文章还分析了斑点伪影的成因。 Conclusion: MCLC是一种理论上有依据、通用且兼容性强的校正模块,是实现更鲁棒的零样本逆问题求解的重要一步。 Abstract: With recent advances in generative models, diffusion models have emerged as powerful priors for solving inverse problems in each domain. Since Latent Diffusion Models (LDMs) provide generic priors, several studies have explored their potential as domain-agnostic zero-shot inverse solvers. Despite these efforts, existing latent diffusion inverse solvers suffer from their instability, exhibiting undesirable artifacts and degraded quality. In this work, we first identify the instability as a discrepancy between the solver's and true reverse diffusion dynamics, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies the LDM-based inverse solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often do not hold in latent space, MCLC operates without this assumption, leading to more stable and reliable behavior. We experimentally demonstrate the effectiveness of MCLC and its compatibility with existing solvers across diverse image restoration tasks. Additionally, we analyze blob artifacts and offer insights into their underlying causes. We highlight that MCLC is a key step toward more robust zero-shot inverse problem solvers.[135] PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
Denis Korzhenkov,Adil Karjauv,Animesh Karnewar,Mohsen Ghafoorian,Amirhossein Habibian
Main category: cs.CV
TL;DR: 提出一种低成本微调方法,将预训练扩散模型转换为金字塔模型,保持视频生成质量的同时提升推理效率。
Details
Motivation: 现有开源金字塔视频模型从零训练,视觉真实感不如最先进模型,且多步去噪推理成本高。 Method: 通过低代价微调将预训练扩散模型转化为金字塔结构,并研究多种金字塔模型内的步蒸馏策略以提升推理效率。 Result: 成功实现预训练模型到金字塔模型的无损转换,生成视频质量不下降,同时显著降低推理计算成本。 Conclusion: 该方法为高效视频生成提供了一种高质量、低开销的解决方案,优于从零训练的金字塔模型。 Abstract: Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at https://qualcomm-ai-research.github.io/PyramidalWan.[136] Detector-Augmented SAMURAI for Long-Duration Drone Tracking
Tamara R. Lenhard,Andreas Weinmann,Hichem Snoussi,Tobias Koch
Main category: cs.CV
TL;DR: 本文首次系统评估了基础模型SAMURAI在城市监控场景中用于无人机跟踪的潜力,并提出一种检测器增强的扩展方法,提升了其在复杂环境和长序列中的鲁棒性。
Details
Motivation: 现有基于检测器的无人机跟踪方法存在时间不一致性,且对传统运动模型依赖较大;而虽有基础模型如SAMURAI在其他领域表现优异,但其在无人机特定场景下的应用尚未被研究。 Method: 提出一种检测器增强的SAMURAI扩展方法,以缓解其对边界框初始化和序列长度的敏感性,并在RGB视频上进行无人机跟踪实验。 Result: 所提方法在多个数据集和指标上均优于SAMURAI的零样本性能,显著提升了复杂城市环境下的跟踪鲁棒性,尤其在长时间序列和无人机进出场景中表现突出;成功率提升高达+0.393,误检率降低达-0.475。 Conclusion: 该工作验证了基础模型在无人机跟踪中的可行性,并通过引入检测器线索显著提升性能,为未来RGB-based无人机跟踪提供了新方向。 Abstract: Robust long-term tracking of drone is a critical requirement for modern surveillance systems, given their increasing threat potential. While detector-based approaches typically achieve strong frame-level accuracy, they often suffer from temporal inconsistencies caused by frequent detection dropouts. Despite its practical relevance, research on RGB-based drone tracking is still limited and largely reliant on conventional motion models. Meanwhile, foundation models like SAMURAI have established their effectiveness across other domains, exhibiting strong category-agnostic tracking performance. However, their applicability in drone-specific scenarios has not been investigated yet. Motivated by this gap, we present the first systematic evaluation of SAMURAI's potential for robust drone tracking in urban surveillance settings. Furthermore, we introduce a detector-augmented extension of SAMURAI to mitigate sensitivity to bounding-box initialization and sequence length. Our findings demonstrate that the proposed extension significantly improves robustness in complex urban environments, with pronounced benefits in long-duration sequences - especially under drone exit-re-entry events. The incorporation of detector cues yields consistent gains over SAMURAI's zero-shot performance across datasets and metrics, with success rate improvements of up to +0.393 and FNR reductions of up to -0.475.[137] Integrated Framework for Selecting and Enhancing Ancient Marathi Inscription Images from Stone, Metal Plate, and Paper Documents
Bapu D. Chendage,Rajivkumar S. Mente
Main category: cs.CV
TL;DR: 本文提出了一种基于二值化和互补预处理技术的古代文字图像增强方法,有效去除了污渍并增强了模糊文本,提高了古代马拉地文铭文图像的可读性。
Details
Motivation: 由于老化和环境影响,古代文字图像常存在严重背景噪声、低对比度和退化问题,且文字与背景视觉特征相似,难以辨认,因此需要有效的图像增强方法来提高可读性。 Method: 采用基于二值化的图像增强方法,并结合互补预处理技术去除污渍和增强模糊文字,对石刻、金属板和历史文献等多种古代文字类型进行实验验证。 Result: 使用K-NN分类器在石刻、金属板和文档上的分类准确率分别为55.7%、62%和65.6%;使用SVM分类器的准确率分别为53.2%、59.5%和67.8%,验证了该方法的有效性。 Conclusion: 所提出的图像增强方法能有效改善古代马拉地文铭文图像的可读性,适用于多种材质上的退化文本恢复。 Abstract: Ancient script images often suffer from severe background noise, low contrast, and degradation caused by aging and environmental effects. In many cases, the foreground text and background exhibit similar visual characteristics, making the inscriptions difficult to read. The primary objective of image enhancement is to improve the readability of such degraded ancient images. This paper presents an image enhancement approach based on binarization and complementary preprocessing techniques for removing stains and enhancing unclear ancient text. The proposed methods are evaluated on different types of ancient scripts, including inscriptions on stone, metal plates, and historical documents. Experimental results show that the proposed approach achieves classification accuracies of 55.7%, 62%, and 65.6% for stone, metal plate, and document scripts, respectively, using the K-Nearest Neighbor (K-NN) classifier. Using the Support Vector Machine (SVM) classifier, accuracies of 53.2%, 59.5%, and 67.8% are obtained. The results demonstrate the effectiveness of the proposed enhancement method in improving the readability of ancient Marathi inscription images.[138] SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models
Oriol Rabasseda,Zenjie Li,Kamal Nasrollahi,Sergio Escalera
Main category: cs.CV
TL;DR: 本文提出了一个面向监控视频中车辆行为识别的基准SOVABench,并提出一种无需训练的多模态大语言模型框架,通过生成可解释的描述来提升动作区分和推理能力。
Details
Motivation: 现有基于内容的视频检索基准主要关注场景级相似性,缺乏对监控场景中所需的动作判别能力的评估,因此需要构建专注于车辆行为的现实世界基准。 Method: 构建了SOVABench,包含真实监控视频,定义了两种评估协议(inter-pair和intra-pair)以评估跨动作区分和时间方向理解能力;利用多模态大语言模型(MLLM)生成图像和视频的描述,并从中提取可解释嵌入表示,实现无需训练的检索框架。 Result: 实验表明,尽管人类能直观区分这些动作,但当前最先进的视觉和多模态模型仍面临挑战;所提MLLM框架在SOVABench及多个空间和计数基准上表现优异,尤其在对比式视觉语言模型表现不佳的任务上。 Conclusion: SOVABench填补了监控场景下动作判别基准的空白,所提出的训练-free MLLM框架为可解释性和复杂推理提供了新思路,展示了其在视频理解中的潜力。 Abstract: Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.[139] Character Detection using YOLO for Writer Identification in multiple Medieval books
Alessandra Scotto di Freca,Tiziana D Alessandro,Francesco Fontanella,Filippo Sarria,Claudio De Stefano
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv5的古文字书写者识别方法,取代了先前依赖模板匹配和CNN的技术,提高了字符检测数量与分类准确性,并利用YOLO置信度实现拒识机制,增强了在未见手稿上的泛化能力。
Details
Motivation: 传统古文字研究中书写者识别依赖专家经验且效率较低,现有自动识别方法受限于模板匹配的阈值敏感性和特征提取能力,难以推广至未见手稿,因此需要更鲁棒、高效的自动化方法。 Method: 采用YOLOv5目标检测模型替代原有的模板匹配与CNN流程,直接在整页文本图像中定位并识别特定字母(如“a”),利用检测到的字符实例进行书写者归属分类,并引入基于YOLO置信度的拒绝阈值机制以提升系统可靠性。 Result: YOLOv5相比之前方法能检测出更多的目标字符,提升了第二阶段分类的准确性;同时其置信度得分可用于设置拒绝阈值,在训练未见的手稿上实现了更可靠的书写者识别。 Conclusion: YOLOv5在中世纪手稿书写者识别任务中优于传统模板匹配与CNN组合方法,不仅提高了检测与分类性能,还通过置信度机制增强了系统的可解释性与实用性,为数字古文字学提供了更强大的工具。 Abstract: Paleography is the study of ancient and historical handwriting, its key objectives include the dating of manuscripts and understanding the evolution of writing. Estimating when a document was written and tracing the development of scripts and writing styles can be aided by identifying the individual scribes who contributed to a medieval manuscript. Although digital technologies have made significant progress in this field, the general problem remains unsolved and continues to pose open challenges. ... We previously proposed an approach focused on identifying specific letters or abbreviations that characterize each writer. In that study, we considered the letter "a", as it was widely present on all pages of text and highly distinctive, according to the suggestions of expert paleographers. We used template matching techniques to detect the occurrences of the character "a" on each page and the convolutional neural network (CNN) to attribute each instance to the correct scribe. Moving from the interesting results achieved from this previous system and being aware of the limitations of the template matching technique, which requires an appropriate threshold to work, we decided to experiment in the same framework with the use of the YOLO object detection model to identify the scribe who contributed to the writing of different medieval books. We considered the fifth version of YOLO to implement the YOLO object detection model, which completely substituted the template matching and CNN used in the previous work. The experimental results demonstrate that YOLO effectively extracts a greater number of letters considered, leading to a more accurate second-stage classification. Furthermore, the YOLO confidence score provides a foundation for developing a system that applies a rejection threshold, enabling reliable writer identification even in unseen manuscripts.[140] DivAS: Interactive 3D Segmentation of NeRFs via Depth-Weighted Voxel Aggregation
Ayush Pande
Main category: cs.CV
TL;DR: DivAS是一种无需优化、完全交互式的NeRF分割框架,通过结合2D SAM掩码与NeRF深度先验,在几何精度和前景-背景分离上实现快速、高质量的3D分割。
Details
Motivation: 现有NeRF分割方法多依赖逐场景优化,速度慢且牺牲了2D基础模型的零样本能力,缺乏实时交互性。 Method: 提出DivAS框架:用户通过GUI输入点提示生成2D SAM掩码,利用NeRF的深度先验进行掩码优化,并通过自定义CUDA内核在200ms内将多视角掩码聚合成统一的3D体素网格。 Result: 在Mip-NeRF 360°和LLFF数据集上,DivAS达到与优化方法相当的分割质量,端到端速度快2-2.5倍,排除用户提示时间则快一个数量级。 Conclusion: DivAS实现了快速、无需训练的NeRF交互式分割,兼顾高质量与实时反馈,显著优于传统优化方法。 Abstract: Existing methods for segmenting Neural Radiance Fields (NeRFs) are often optimization-based, requiring slow per-scene training that sacrifices the zero-shot capabilities of 2D foundation models. We introduce DivAS (Depth-interactive Voxel Aggregation Segmentation), an optimization-free, fully interactive framework that addresses these limitations. Our method operates via a fast GUI-based workflow where 2D SAM masks, generated from user point prompts, are refined using NeRF-derived depth priors to improve geometric accuracy and foreground-background separation. The core of our contribution is a custom CUDA kernel that aggregates these refined multi-view masks into a unified 3D voxel grid in under 200ms, enabling real-time visual feedback. This optimization-free design eliminates the need for per-scene training. Experiments on Mip-NeRF 360° and LLFF show that DivAS achieves segmentation quality comparable to optimization-based methods, while being 2-2.5x faster end-to-end, and up to an order of magnitude faster when excluding user prompting time.[141] Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform
Suyash Mishra,Qiang Li,Srikanth Patil,Satyanarayan Pati,Baddu Narendra
Main category: cs.CV
TL;DR: 本文提出了一种面向工业级长视频理解的多模态推理框架,系统评估了40多个视觉语言模型在真实约束下的性能,揭示了当前VLM在时序对齐、关键帧检测和视频分割等方面的瓶颈,并提供了在GPU资源、延迟和成本受限情况下的实用优化建议。
Details
Motivation: 现有VLM评估多集中于短视频且忽略工业部署中的资源限制,而在制药等领域需处理长视频并受GPU、延迟和成本严格约束,缺乏可扩展的解决方案。 Method: 构建了一个工业级大规模多模态架构,实证分析了40多个VLM在Video-MME、MMBench及包含25,326个视频的专有数据集上的表现,重点研究多模态作用、注意力机制权衡、时序推理局限和视频分割挑战。 Result: 在商用GPU上使用SDPA注意力机制实现3-8倍效率提升;多模态在12个任务域中的8个表现更优,尤其利于依赖长度的任务;发现跨开源与闭源VLM在时序对齐和关键帧检测方面存在明显瓶颈。 Conclusion: 论文未提出新模型,而是刻画了当前VLM在现实部署条件下的实际限制、权衡关系与失效模式,为研究人员和从业者设计可扩展的工业级长视频理解系统提供了 actionable 指导。 Abstract: Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.[142] Rotation-Robust Regression with Convolutional Model Trees
Hongyi Li,William Ward Armstrong,Jun Xu
Main category: cs.CV
TL;DR: 本文研究了基于卷积模型树(CMT)的旋转鲁棒学习方法,提出三种几何感知的归纳偏置,并评估了部署时方向搜索对模型鲁棒性的影响。
Details
Motivation: 提升图像输入在旋转变化下的模型鲁棒性,特别是在部署时不更新参数的情况下实现旋转适应。 Method: 引入三种几何感知的归纳偏置:卷积平滑、倾斜主导约束和基于重要性的剪枝,并结合部署时的方向搜索策略选择最优旋转角度。 Result: 实验表明方向搜索在大角度旋转下提升了鲁棒性,但在标准方向附近可能因置信度与正确性不一致而产生负面影响;在MNIST手写数字识别任务中验证了一致趋势。 Conclusion: 基于置信度的方向选择在模型树集成中具有潜力但也存在局限,需谨慎设计以避免性能下降。 Abstract: We study rotation-robust learning for image inputs using Convolutional Model Trees (CMTs) [1], whose split and leaf coefficients can be structured on the image grid and transformed geometrically at deployment time. In a controlled MNIST setting with a rotation-invariant regression target, we introduce three geometry-aware inductive biases for split directions -- convolutional smoothing, a tilt dominance constraint, and importance-based pruning -- and quantify their impact on robustness under in-plane rotations. We further evaluate a deployment-time orientation search that selects a discrete rotation maximizing a forest-level confidence proxy without updating model parameters. Orientation search improves robustness under severe rotations but can be harmful near the canonical orientation when confidence is misaligned with correctness. Finally, we observe consistent trends on MNIST digit recognition implemented as one-vs-rest regression, highlighting both the promise and limitations of confidence-based orientation selection for model-tree ensembles.[143] Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics
Subhadeep Roy,Gagan Bhatia,Steffen Eger
Main category: cs.CV
TL;DR: 本文研究了文本到图像模型评估中的“原型偏差”问题,发现现有自动评估指标倾向于偏好数据分布中常见的原型图像,而非语义上正确但非典型的图像。为此,作者构建了ProtoBias基准,通过人类和自动评估对比,揭示了主流指标(如CLIPScore、PickScore等)在语义与原型冲突时的系统性误判。基于此,提出了一种新的高效评估指标ProtoScore,在准确性和鲁棒性上显著优于现有方法,且推理速度远超大型闭源模型。
Details
Motivation: 当前自动评估指标可能受训练数据偏见影响,优先选择视觉或社会意义上的‘典型’图像,而非真正符合文本语义的图像,导致评估结果偏离真实语义正确性。因此需要探究并缓解这种原型偏差。 Method: 构建了一个受控的对比基准ProtoBias,包含动物、物体和人口统计三类图像,每对样本由语义正确但非典型图像与语义错误但视觉更典型图像组成;通过该基准评估多种主流自动指标,并与人类判断对比;进而提出ProtoScore——一种基于7B参数的新型评估模型,旨在提升对语义正确性的敏感度并抑制原型偏差。 Result: 实验表明,CLIPScore、PickScore和VQA类指标常错误地偏爱原型图像;LLM-as-Judge系统在社会属性相关场景下表现不稳定;而人类评估始终更支持语义正确的图像,且判断差异更明显。ProtoScore显著降低了误排序率,表现出接近大型闭源裁判模型的鲁棒性,同时推理速度比GPT-5快多个数量级。 Conclusion: 现有自动评估指标存在显著的原型偏差,难以可靠衡量文本到图像生成的语义忠实度;ProtoBias为检测此类偏差提供了有效工具;提出的ProtoScore在准确性、公平性和效率之间实现了更好平衡,是更具鲁棒性的评估新范式。 Abstract: Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study \emph{prototypicality bias} as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark \textsc{\textbf{ProtoBias}} (\textit{\textbf{Proto}typical \textbf{Bias}}), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose \textbf{\textsc{ProtoScore}}, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.[144] TEA: Temporal Adaptive Satellite Image Semantic Segmentation
Juyuan Kang,Hao Zhu,Yan Zhu,Wei Zhang,Jianing Chen,Tianxiang Xiao,Yike Ma,Hao Jiang,Feng Dai
Main category: cs.CV
TL;DR: 提出了一种基于时间自适应的SITS语义分割方法TEA,通过教师-学生模型框架和全序列重建辅助任务,提升不同时间长度输入下的作物映射性能。
Details
Motivation: 现有方法在处理不同时间长度的卫星图像时间序列时泛化能力差,导致分割效果显著下降。 Method: 提出TEA方法,采用教师模型传递全局序列知识,指导具有自适应时间输入的学生模型;通过中间嵌入、原型和软标签进行知识迁移,并引入动态聚合与全序列重建辅助任务以增强表征质量。 Result: 实验表明,该方法在不同时间长度输入下,在常见基准上均显著优于现有方法。 Conclusion: TEA有效提升了模型在变化时间长度下的鲁棒性和分割性能,增强了实际农业应用中的适用性。 Abstract: Crop mapping based on satellite images time-series (SITS) holds substantial economic value in agricultural production settings, in which parcel segmentation is an essential step. Existing approaches have achieved notable advancements in SITS segmentation with predetermined sequence lengths. However, we found that these approaches overlooked the generalization capability of models across scenarios with varying temporal length, leading to markedly poor segmentation results in such cases. To address this issue, we propose TEA, a TEmporal Adaptive SITS semantic segmentation method to enhance the model's resilience under varying sequence lengths. We introduce a teacher model that encapsulates the global sequence knowledge to guide a student model with adaptive temporal input lengths. Specifically, teacher shapes the student's feature space via intermediate embedding, prototypes and soft label perspectives to realize knowledge transfer, while dynamically aggregating student model to mitigate knowledge forgetting. Finally, we introduce full-sequence reconstruction as an auxiliary task to further enhance the quality of representations across inputs of varying temporal lengths. Through extensive experiments, we demonstrate that our method brings remarkable improvements across inputs of different temporal lengths on common benchmarks. Our code will be publicly available.[145] SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection
Maximilian Pittner,Joel Janai,Mario Faigle,Alexandru Paul Condurache
Main category: cs.CV
TL;DR: 本文提出了一种新的3D车道检测方法SparseLaneSTP,结合车道结构的几何特性和时间信息,引入了车道特定的时空注意力机制、连续车道表示和时间正则化,在多个基准上实现了最先进的性能。
Details
Motivation: 现有的3D车道检测方法在从密集鸟瞰图特征中检测车道时存在错误转换导致特征表示不佳的问题,而稀疏检测器虽然表现更好,但忽略了有价值的车道先验信息,并且未利用历史观测数据来解决低可见性情况下的模糊问题。此外,现有数据集也存在不足。 Method: 提出了SparseLaneSTP,一种将车道结构几何属性和时间信息整合到稀疏车道Transformer中的新方法;引入了车道特定的时空注意力机制、适用于稀疏架构的连续车道表示以及时间正则化;并采用一种简单有效的自动标注策略构建了一个精确且一致的新3D车道数据集。 Result: 实验表明,所提方法在现有3D车道检测基准及新提出的数据集上,所有检测和误差指标均达到最先进水平。 Conclusion: 通过融合几何先验与时间信息,SparseLaneSTP有效提升了3D车道检测的准确性与鲁棒性,同时新数据集为未来研究提供了更可靠的评估基础。 Abstract: 3D lane detection has emerged as a critical challenge in autonomous driving, encompassing identification and localization of lane markings and the 3D road surface. Conventional 3D methods detect lanes from dense birds-eye-viewed (BEV) features, though erroneous transformations often result in a poor feature representation misaligned with the true 3D road surface. While recent sparse lane detectors have surpassed dense BEV approaches, they completely disregard valuable lane-specific priors. Furthermore, existing methods fail to utilize historic lane observations, which yield the potential to resolve ambiguities in situations of poor visibility. To address these challenges, we present SparseLaneSTP, a novel method that integrates both geometric properties of the lane structure and temporal information into a sparse lane transformer. It introduces a new lane-specific spatio-temporal attention mechanism, a continuous lane representation tailored for sparse architectures as well as temporal regularization. Identifying weaknesses of existing 3D lane datasets, we also introduce a precise and consistent 3D lane dataset using a simple yet effective auto-labeling strategy. Our experimental section proves the benefits of our contributions and demonstrates state-of-the-art performance across all detection and error metrics on existing 3D lane detection benchmarks as well as on our novel dataset.[146] OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction
Minseong Kweon,Jinsun Park
Main category: cs.CV
TL;DR: 提出OceanSplat,一种基于3D高斯点阵的水下场景几何表示方法,通过三目一致性、合成视差深度先验和深度感知透明度调整,有效抑制散射介质干扰,提升水下三维重建质量。
Details
Motivation: 水下光学退化导致多视角不一致和浮动物体伪影,现有方法难以准确恢复水下场景的三维几何结构。 Method: 引入三目视图一致性,通过平移相机视图并利用逆向扭曲对齐;通过三角化生成合成极线深度先验作为自监督正则项;提出深度感知的alpha调整策略,根据高斯分布的z分量和视线方向调节不透明度,抑制介质诱导的伪影。 Result: 在真实和模拟水下场景中实验表明,OceanSplat在场景重建和图像恢复方面显著优于现有方法,有效减少浮动物体伪影,提升几何精度。 Conclusion: OceanSplat通过引入几何约束和深度感知优化策略,实现了在散射介质中对3D高斯分布的有效解耦,显著提升了水下场景的三维重建鲁棒性和视觉质量。 Abstract: We introduce OceanSplat, a novel 3D Gaussian Splatting-based approach for accurately representing 3D geometry in underwater scenes. To overcome multi-view inconsistencies caused by underwater optical degradation, our method enforces trinocular view consistency by rendering horizontally and vertically translated camera views relative to each input view and aligning them via inverse warping. Furthermore, these translated camera views are used to derive a synthetic epipolar depth prior through triangulation, which serves as a self-supervised depth regularizer. These geometric constraints facilitate the spatial optimization of 3D Gaussians and preserve scene structure in underwater environments. We also propose a depth-aware alpha adjustment that modulates the opacity of 3D Gaussians during early training based on their $z$-component and viewing direction, deterring the formation of medium-induced primitives. With our contributions, 3D Gaussians are disentangled from the scattering medium, enabling robust representation of object geometry and significantly reducing floating artifacts in reconstructed underwater scenes. Experiments on real-world underwater and simulated scenes demonstrate that OceanSplat substantially outperforms existing methods for both scene reconstruction and restoration in scattering media.[147] Higher-Order Adversarial Patches for Real-Time Object Detectors
Jens Bayer,Stefan Becker,David Münch,Michael Arens,Jürgen Beyerer
Main category: cs.CV
TL;DR: 该研究探讨了高阶对抗性攻击对YOLOv10目标检测器的影响,发现高阶对抗性补丁比低阶具有更强的泛化能力,且仅靠对抗训练不足以有效防御此类攻击。
Details
Motivation: 揭示在对抗性攻击与防御的‘猫鼠游戏’中,高阶攻击对目标检测器的实际影响及现有防御方法的局限性。 Method: 通过迭代生成高阶对抗性补丁,并以逃避攻击方式应用于YOLOv10检测器,结合对抗训练评估其鲁棒性。 Result: 高阶对抗性补丁不仅对直接受训模型有效,还表现出更强的跨模型泛化能力;单纯对抗训练无法充分提升检测器对此类攻击的抵抗力。 Conclusion: 高阶对抗性攻击更具威胁,当前的对抗训练策略需进一步改进以应对复杂、高阶的攻击模式。 Abstract: Higher-order adversarial attacks can directly be considered the result of a cat-and-mouse game -- an elaborate action involving constant pursuit, near captures, and repeated escapes. This idiom describes the enduring circular training of adversarial attack patterns and adversarial training the best. The following work investigates the impact of higher-order adversarial attacks on object detectors by successively training attack patterns and hardening object detectors with adversarial training. The YOLOv10 object detector is chosen as a representative, and adversarial patches are used in an evasion attack manner. Our results indicate that higher-order adversarial patches are not only affecting the object detector directly trained on but rather provide a stronger generalization capacity compared to lower-order adversarial patches. Moreover, the results highlight that solely adversarial training is not sufficient to harden an object detector efficiently against this kind of adversarial attack. Code: https://github.com/JensBayer/HigherOrder[148] Patch-based Representation and Learning for Efficient Deformation Modeling
Ruochen Chen,Thuy Tran,Shaifali Parashar
Main category: cs.CV
TL;DR: 本文提出了一种基于局部曲面块拟合jet函数的补丁式表面表示方法PolyFit,可高效学习并泛化到多种曲面,适用于计算机视觉与图形学中的形变任务。
Details
Motivation: 传统方法在处理曲面形变时依赖逐顶点优化,计算成本高且难以泛化。需要一种更紧凑、高效的表示方式来支持多种下游任务。 Method: 通过在局部表面块上拟合jet函数得到PolyFit表示,以监督方式从解析函数和真实数据中学习,并通过更新jet系数实现高效形变。 Result: 在Shape-from-template和服装仿真两个任务中验证了方法的有效性:SfT任务中精度媲美物理求解器且速度显著更快;服装仿真任务中模型自监督、跨分辨率和服装类型通用,推理速度比基线快一个数量级。 Conclusion: PolyFit提供了一种紧凑、高效且可泛化的表面表示方法,能够在保持高精度的同时大幅提升形变任务的计算效率,具有广泛的应用潜力。 Abstract: In this paper, we present a patch-based representation of surfaces, PolyFit, which is obtained by fitting jet functions locally on surface patches. Such a representation can be learned efficiently in a supervised fashion from both analytic functions and real data. Once learned, it can be generalized to various types of surfaces. Using PolyFit, the surfaces can be efficiently deformed by updating a compact set of jet coefficients rather than optimizing per-vertex degrees of freedom for many downstream tasks in computer vision and graphics. We demonstrate the capabilities of our proposed methodologies with two applications: 1) Shape-from-template (SfT): where the goal is to deform the input 3D template of an object as seen in image/video. Using PolyFit, we adopt test-time optimization that delivers competitive accuracy while being markedly faster than offline physics-based solvers, and outperforms recent physics-guided neural simulators in accuracy at modest additional runtime. 2) Garment draping. We train a self-supervised, mesh- and garment-agnostic model that generalizes across resolutions and garment types, delivering up to an order-of-magnitude faster inference than strong baselines.[149] From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)
Suyash Mishra,Qiang Li,Srikanth Patil,Anubhav Girdhar
Main category: cs.CV
TL;DR: 提出了一种领域自适应的视频到视频片段生成框架,结合音频和视觉语言模型,实现高效、低成本的药学视频摘要生成,显著提升处理速度与内容质量。
Details
Motivation: 传统多模态数据(如文本、图像、视频、音频)的手动标注在制药行业中存在不一致、质量下降和效率低下的问题,尤其是面对大量长时视频和音频数据时挑战更为突出。 Method: 提出了一种融合音频语言模型(ALM)和视觉语言模型(VLM)的视频片段生成框架,包含可复现的Cut & Merge算法(带淡入淡出和时间戳归一化)、基于角色定义与提示注入的个性化机制,以及平衡ALM/VLM处理的成本高效端到端管道策略。 Result: 在Video MME基准和包含16,159个药学视频的私有数据集上验证,实现了3-4倍的速度提升、4倍的成本降低,并在片段连贯性(0.348)和信息量(0.721)上优于现有VLM基线(如Gemini 2.5 Pro)。 Conclusion: 该方法支持透明、定制化且符合合规要求的视频摘要,展现出在生命科学领域推动智能、可扩展和自动化内容处理的巨大潜力。 Abstract: Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut & Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.[150] Driving on Registers
Ellington Kirby,Alexandre Boulch,Yihong Xu,Yuan Yin,Gilles Puy,Éloi Zablocki,Andrei Bursuc,Spyros Gidaris,Renaud Marlet,Florent Bartoccioni,Anh-Quan Cao,Nermin Samet,Tuan-Hung VU,Matthieu Cord
Main category: cs.CV
TL;DR: DrivoR是一种基于Transformer的端到端自动驾驶架构,利用预训练Vision Transformer和相机感知的寄存器令牌压缩多相机特征,实现高效、准确的驾驶轨迹生成与评分。
Details
Motivation: 现有的端到端自动驾驶方法在计算效率和多传感器融合方面存在瓶颈,需要更高效的特征压缩与决策机制。 Method: 基于预训练Vision Transformer,引入相机感知的寄存器令牌来压缩多相机输入特征,并使用两个轻量级Transformer解码器生成并评分候选轨迹,其中评分解码器模仿专家策略并输出可解释的子评分(如安全性、舒适性、效率)。 Result: DrivoR在NAVSIM-v1、NAVSIM-v2和HUGSIM闭环基准上优于或媲美当前先进方法,同时显著降低计算开销。 Conclusion: 纯Transformer架构结合有针对性的令牌压缩,足以实现准确、高效且可适应的端到端自动驾驶。 Abstract: We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available via the project page.[151] UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition
Filippo Ghilotti,Samuel Brucker,Nahku Saidy,Matteo Matteucci,Mario Bijelic,Felix Heide
Main category: cs.CV
TL;DR: 提出一种无需人工标注的多模态伪标签方法,利用时间-几何一致性将文本和2D视觉基础模型的线索迁移到3D点云中,实现LiDAR数据的语义分割、目标检测与稠密化。
Details
Motivation: 解决自动驾驶中LiDAR数据依赖大量人工标注导致成本高昂的问题,充分利用无标签LiDAR数据中的3D几何信息。 Method: 利用时序累积的LiDAR地图学习强几何先验,通过时间-几何一致性融合文本和2D视觉基础模型的线索,提出一种迭代更新机制以保持几何与语义的一致性,并通过不一致性检测动态物体。 Result: 在三个数据集上实现了稳健泛化,同时生成3D语义标签、3D边界框和稠密LiDAR扫描;相比现有伪标签方法表现更优,且仅需少量几何一致的稠密化LiDAR即可使远距离深度预测误差(MAE)显著降低51.5%(80-150米)和22.0%(150-250米)。 Conclusion: 该方法有效降低了对人工标注的依赖,展示了无监督多模态伪标签在提升LiDAR感知任务性能上的巨大潜力,尤其在远距离深度估计方面效果显著。 Abstract: Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight - yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous-perception research. In this work we tackle this bottleneck by leveraging temporal-geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi-modal pseudo-labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric-semantic consistency, and vice-versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo-labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80-150 and 150-250 meters range, respectively.[152] From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
Zirui Wu,Zeren Jiang,Martin R. Oswald,Jie Song
Main category: cs.CV
TL;DR: 本文提出了一种名为“投影条件化”的新方法,用于改进前馈视图合成模型的鲁棒性和跨视图一致性,通过使用目标视图的2D投影线索替代传统的Plücker射线表示,并引入掩码自编码预训练策略,实现了在未校准数据上的有效预训练,在多个基准上达到了最先进的性能。
Details
Motivation: 现有前馈视图合成模型依赖Plücker射线映射,对相机坐标系选择敏感,且在微小相机变换下缺乏几何一致性,限制了其鲁棒性和一致性。 Method: 提出投影条件化方法,用目标视图的2D投影线索替代原始相机参数作为输入,将问题转化为稳定的图像到图像翻译任务;并设计掩码自编码预训练策略,利用大规模未校准数据进行预训练。 Result: 该方法在自建的视图一致性基准上表现出更高的保真度和更强的跨视图一致性,同时在标准新视图合成基准上达到最先进水平。 Conclusion: 投影条件化为前馈视图合成提供了一种更鲁棒、更一致的替代方案,降低了对精确相机参数的依赖,拓展了在未校准数据上的应用潜力。 Abstract: Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.[153] Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
Runze He,Yiji Cheng,Tiankai Hang,Zhimin Li,Yu Xu,Zijin Yin,Shiyi Zhang,Wenxun Dai,Penghui Du,Ao Ma,Chunyu Wang,Qinglin Lu,Jizhong Han,Jiao Dai
Main category: cs.CV
TL;DR: 提出Re-Align框架,通过结构化推理引导对齐,提升上下文图像生成与编辑中语义理解与生成的一致性。
Details
Motivation: 现有统一多模态模型在理解能力上的优势难以有效迁移到图像生成任务中,导致上下文图像生成与编辑(ICGE)中语义理解与生成不一致。 Method: 引入In-Context Chain-of-Thought(IC-CoT)结构化推理范式,解耦语义引导与参考关联,并结合基于代理奖励的强化学习训练策略,优化推理文本与生成图像之间的对齐。 Result: 实验表明,Re-Align在相同模型规模和资源下,优于现有的竞争方法,在上下文图像生成与编辑任务中表现出更优性能。 Conclusion: Re-Align通过结构化推理与强化学习对齐,有效弥合了理解与生成之间的鸿沟,显著提升了ICGE任务的性能。 Abstract: In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.[154] VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
Ignacio de Rodrigo,Alvaro J. Lopez-Lopez,Jaime Boal
Main category: cs.CV
TL;DR: VERSE是一种用于分析和改进视觉-语言模型在视觉丰富文档理解中应用的方法,通过探索其视觉嵌入空间来识别问题区域并生成合成数据以提升性能。
Details
Motivation: 现有视觉-语言模型在处理视觉丰富的文档时存在对错误易发区域缺乏可解释性和改进手段的问题,需要一种系统方法来诊断和增强模型表现。 Method: 提出VERSE方法,通过可视化模型的潜在表示来评估模型可行性,识别问题聚类,并指导生成针对这些区域的合成数据(如MERIT Dataset),进而重新训练模型。 Result: 在MERIT Secret上验证显示,使用VERSE可显著提升F1分数而不损害泛化能力;优化后的Donut和Idefics2模型性能达到甚至超过GPT-4和Pixtral等SaaS模型。 Conclusion: VERSE为视觉-语言模型提供了有效的分析与增强框架,不仅能揭示导致错误的视觉特征,还能通过定向数据生成显著提升模型性能,推动本地部署模型竞争力。 Abstract: This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.[155] VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
Sixiao Zheng,Minghao Yin,Wenbo Hu,Xiaoyu Li,Ying Shan,Yanwei Fu
Main category: cs.CV
TL;DR: VerseCrafter是一种4D感知的视频世界模型,通过新颖的4D几何控制表示法,实现对相机和多物体动态的统一、显式且连贯的控制。
Details
Motivation: 现有视频世界模型难以在2D图像平面中精确统一地控制相机和多物体运动,缺乏对3D时空动态的建模能力。 Method: 提出4D几何控制表示,用静态背景点云和每个对象的3D高斯轨迹编码世界状态,并将4D控制信号作为预训练视频扩散模型的条件输入;构建自动数据引擎从无约束视频中提取4D控制信息用于训练。 Result: 实现了高保真、视角一致的视频生成,能够精确遵循指定的相机与物体动态;并通过大规模自提取数据集成功训练模型。 Conclusion: VerseCrafter为视频世界模型提供了可解释、灵活且类别无关的4D控制框架,在缺乏真实4D标注数据的情况下仍能实现精准动态控制。 Abstract: Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.[156] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering
Md. Zahid Hossain,Most. Sharmin Sultana Samu,Md. Rakibul Islam,Md. Siam Ansary
Main category: cs.CV
TL;DR: 提出一种轻量级视觉-语言框架,用于从叶片图像中识别作物和病害,结合Swin Transformer视觉编码器与序列到序列语言解码器,通过两阶段训练提升视觉表征与跨模态对齐,在分类与自然语言生成任务上均表现优异且参数更少。
Details
Motivation: 需要在作物病害分析中实现准确的视觉理解与可靠的文本生成,现有大规模视觉-语言模型参数多、效率低,缺乏针对农业场景的优化。 Method: 采用Swin Transformer作为视觉编码器,结合序列到序列语言解码器,设计两阶段训练策略以增强视觉表示学习和跨模态对齐,并使用Grad-CAM和token级归因进行可解释性分析。 Result: 在大规模作物病害数据集上,模型在分类准确率及BLEU、ROUGE、BERTScore等生成指标上表现优异,优于大型视觉-语言基线模型,且参数量显著更少;定性结果展示其对多样化用户查询的鲁棒性。 Conclusion: 任务特定的视觉预训练能有效提升轻量级模型在作物病害视觉问答中的性能,兼顾准确性、效率与可解释性。 Abstract: Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.[157] Atlas 2 -- Foundation models for clinical deployment
Maximilian Alber,Timo Milbich,Alexandra Carpen-Amarie,Stephan Tietz,Jonas Dippel,Lukas Muttenthaler,Beatriz Perez Cancer,Alessandro Benetti,Panos Korfiatis,Elias Eulig,Jérôme Lüscher,Jiasen Wu,Sayed Abid Hashimi,Gabriel Dernbach,Simon Schallenberg,Neelay Shah,Moritz Krügener,Aniruddh Jammoria,Jake Matras,Patrick Duffy,Matt Redlon,Philipp Jurmeister,David Horst,Lukas Ruff,Klaus-Robert Müller,Frederick Klauschen,Andrew Norgan
Main category: cs.CV
TL;DR: 本文提出了Atlas 2系列病理学视觉基础模型,在性能、鲁棒性和资源效率方面实现了最先进水平,基于迄今为止最大的550万张全切片图像数据集进行训练。
Details
Motivation: 现有的病理学基础模型在性能、鲁棒性和计算需求之间存在权衡,限制了其在临床中的应用。 Method: 提出Atlas 2、Atlas 2-B和Atlas 2-S三个模型,使用来自Charité、LMU Munich和Mayo Clinic的550万张全切片图像进行训练,并在80个公开基准上进行全面评估。 Result: 模型在八十项基准测试中表现出色,显著提升了预测性能、鲁棒性及资源效率。 Conclusion: Atlas 2系列模型克服了现有病理学基础模型的局限,推动了其在临床环境中的实际部署。 Abstract: Pathology foundation models substantially advanced the possibilities in computational pathology -- yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.[158] Multi-Scale Local Speculative Decoding for Image Generation
Elia Peruzzo,Guillaume Sautière,Amirhossein Habibian
Main category: cs.CV
TL;DR: 本文提出了MuLo-SD,一种结合多分辨率起草和空间感知验证的新型局部推测解码框架,用于加速自回归图像生成,在保持高质量的同时实现了最高1.7倍的推理速度提升。
Details
Motivation: 现有的推测解码方法受限于token级模糊性和缺乏空间感知能力,导致在加速自回归图像生成时效率与质量难以兼顾。 Method: 提出Multi-Scale Local Speculative Decoding (MuLo-SD),采用低分辨率起草模型生成候选图像token,并通过学习的上采样器将其映射到高分辨率空间;然后由高分辨率目标模型并行验证这些候选token;引入局部拒绝与重采样机制,聚焦于空间邻域而非逐token扫描,实现高效纠错。 Result: 在MS-COCO 5k验证集上,MuLo-SD相比EAGLE-2和LANTERN等强基线实现了最高1.7倍的加速,在GenEval、DPG-Bench以及FID/HPSv2指标上保持了相当的语义对齐和感知质量。消融实验验证了上采样设计、概率池化及局部拒绝与邻域扩展机制的有效性。 Conclusion: MuLo-SD在推测解码图像生成中达到了新的SOTA,有效平衡了生成效率与保真度之间的权衡,推动了自回归模型在实际应用中的可行性。 Abstract: Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to $\mathbf{1.7\times}$ - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.[159] Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
Shuliang Liu,Songbo Yang,Dong Fang,Sihang Jia,Yuqi Tang,Lingfeng Su,Ruoshui Peng,Yibo Yan,Xin Zou,Xuming Hu
Main category: cs.CV
TL;DR: 提出了一种无需训练的推理框架VLI,通过元认知式的自我修正机制减少多模态大模型中的物体幻觉问题。
Details
Motivation: 现有方法无法有效解决多模态大语言模型中因依赖语言先验而忽视视觉证据导致的物体幻觉问题,缺乏对内部语义错位的根本修复。 Method: 提出Vision-Language Introspection (VLI),包括属性内省(检测幻觉风险并定位视觉锚点)和可解释双因果调控(动态调节推理过程,分离视觉证据与噪声,校准置信度)。 Result: 在MMHal-Bench上将物体幻觉率降低12.67%,在POPE数据集上准确率提升5.8%。 Conclusion: VLI通过模拟元认知自修正过程,显著提升了多模态模型在对象感知上的可靠性,为训练-free的幻觉抑制提供了新范式。 Abstract: Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.[160] CoV: Chain-of-View Prompting for Spatial Reasoning
Haoyu Zhao,Akide Liu,Zeyu Zhang,Weijie Wang,Feng Chen,Ruihan Zhu,Gholamreza Haffari,Bohan Zhuang
Main category: cs.CV
TL;DR: 提出了一种无需训练的测试时推理框架Chain-of-View (CoV),通过粗到细的视角探索过程提升3D环境中的具身问答性能。
Details
Motivation: 现有视觉-语言模型受限于固定数量的输入视角,难以在推理时获取问题相关的上下文并进行复杂的空间推理。 Method: CoV先使用View Selection代理筛选冗余帧并选择与问题对齐的关键视角,然后通过迭代推理与离散相机动作交替进行精细视角调整,从3D场景中获取新观测直至收集足够上下文或达到步数限制。 Result: 在OpenEQA上四个主流VLM中平均提升+11.56% LLM-Match,最高提升+13.62%(Qwen3-VL-Flash);增加动作预算可进一步提升性能,且在ScanQA和SQA3D上也表现优异。 Conclusion: 问题对齐的视角选择结合开放视角搜索是一种有效且模型无关的策略,可在无需训练的情况下显著提升3D具身问答中的空间推理能力。 Abstract: Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.[161] VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Shuming Liu,Mingchen Zhuge,Changsheng Zhao,Jun Chen,Lemeng Wu,Zechun Liu,Chenchen Zhu,Zhipeng Cai,Chong Zhou,Haozhe Liu,Ernie Chang,Saksham Suri,Hongyu Xu,Qi Qian,Wei Wen,Balakrishnan Varadarajan,Zhuang Liu,Hu Xu,Florian Bordes,Raghuraman Krishnamoorthi,Bernard Ghanem,Vikas Chandra,Yunyang Xiong
Main category: cs.CV
TL;DR: 本文提出了VideoAuto-R1,一种采用“必要时推理”策略的视频理解框架,在训练中通过“思考一次,回答两次”的范式结合可验证奖励监督,显著提升了效率与准确性,减少了响应长度,并表明显式语言推理在感知任务中非必需,而在复杂推理任务中更有价值。
Details
Motivation: 尽管链式思维(CoT)在视频理解中被广泛应用,但其相对于直接回答的必要性和优势尚不明确,尤其是在计算成本更高的情况下表现并未稳定超越,因此需要探索更高效的推理机制。 Method: 提出VideoAuto-R1框架,采用‘思考一次,回答两次’的训练范式:先生成初始答案,再进行推理并输出复核答案,两个答案均受可验证奖励监督;推理是否执行由初始答案的置信度决定。 Result: VideoAuto-R1在视频问答和定位基准上达到最先进的准确率,平均响应长度从149个token减少到44个,效率提升约3.3倍;在感知类任务中推理模式激活率低,而在需复杂推理的任务中激活率更高。 Conclusion: 显式的语言推理虽有益,但并非总是必要;通过置信度控制是否推理的‘必要时推理’策略可在保持高性能的同时大幅提升效率。 Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.[162] Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable
Zuhair Ahmed Khan Taha,Mohammed Mudassir Uddin,Shahnawaz Alam
Main category: cs.CV
TL;DR: AgentCompress根据任务难度动态选择模型精度,显著降低大模型科研应用的计算成本。
Details
Motivation: 大语言模型在科研自动化中计算成本高昂,许多学术实验室难以负担,需寻找高效降低成本的方法。 Method: 使用小型神经网络根据任务开头词语预测难度,并将任务路由到相应压缩程度的模型变体,决策耗时低于1毫秒。 Result: 在四个科学领域的500个研究工作流中测试,计算成本降低68.3%,保持了96.2%的原始成功率。 Conclusion: AgentCompress使资源受限的实验室也能负担基于大模型的自主研究任务,提升了科研工具的可及性。 Abstract: When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A single research session using a 70-billion parameter model can cost around $127 in cloud fees, putting these tools out of reach for many academic labs. We developed AgentCompress to tackle this problem head-on. The core idea came from a simple observation during our own work: writing a novel hypothesis clearly demands more from the model than reformatting a bibliography. Why should both tasks run at full precision? Our system uses a small neural network to gauge how hard each incoming task will be, based only on its opening words, then routes it to a suitably compressed model variant. The decision happens in under a millisecond. Testing across 500 research workflows in four scientific fields, we cut compute costs by 68.3% while keeping 96.2% of the original success rate. For labs watching their budgets, this could mean the difference between running experiments and sitting on the sidelines[163] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
William Rudman,Michal Golovanevsky,Dana Arad,Yonatan Belinkov,Ritambhara Singh,Carsten Eickhoff,Kyle Mahowald
Main category: cs.CV
TL;DR: 研究发现大型视觉-语言模型在物体计数任务中容易因文本提示而产生幻觉,随着物体数量增加,模型更倾向于遵从错误提示;通过机制分析识别出一组注意力头(PIH-heads),其消融可显著减少40%以上的提示诱导幻觉,且无需额外训练。
Details
Motivation: 解决大型视觉-语言模型在物体计数任务中因提示过度陈述而产生的幻觉问题,探究其忽视视觉证据、依赖文本提示的内在机制。 Method: 在控制条件下设计物体计数实验,使用机械分析方法研究三个VLM中的注意力头作用,通过消融实验评估对提示诱导幻觉(PIH)的影响,并分析不同模型间PIH-heads的行为差异。 Result: 发现随着物体数量增加,模型更易服从错误提示;识别出一小组关键注意力头(PIH-heads),其消融可将提示诱导幻觉减少至少40%,并增强模型对视觉证据的纠正倾向;不同模型中PIH-heads以特定方式介导提示复制行为。 Conclusion: 提示诱导幻觉源于特定注意力头的激活,这些头在不同模型中以独特方式实现对提示的复制;消融PIH-heads可有效提升模型对视觉证据的依赖,揭示了模型内部导致幻觉的机制差异。 Abstract: Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.[164] MoE3D: A Mixture-of-Experts Module for 3D Reconstruction
Zichen Wang,Ang Cao,Liam J. Wang,Jeong Joon Park
Main category: cs.CV
TL;DR: MoE3D是一种混合专家模块,用于提升3D重建模型的深度边界清晰度并减少漂浮点伪影,通过预测多个候选深度图并利用动态权重融合,在几乎不增加计算开销的情况下显著提高重建质量。
Details
Motivation: 现有前馈3D重建模型存在深度边界模糊和漂浮点伪影问题,影响重建质量,因此需要一种有效的方法来优化深度图生成过程。 Method: MoE3D通过预测多个候选深度图,并采用动态加权机制进行融合,结合预训练的3D重建骨干网络(如VGGT)实现精细化深度估计。 Result: 集成MoE3D后,3D重建模型在深度边界清晰度和伪影抑制方面表现显著提升,且仅引入极小额外计算成本。 Conclusion: MoE3D是一种高效、即插即用的模块,能有效增强现有3D重建模型的性能,具有良好的实用性和扩展性。 Abstract: MoE3D is a mixture-of-experts module designed to sharpen depth boundaries and mitigate flying-point artifacts (highlighted in red) of existing feed-forward 3D reconstruction models (left side). MoE3D predicts multiple candidate depth maps and fuses them via dynamic weighting (visualized by MoE weights on the right side). When integrated with a pre-trained 3D reconstruction backbone such as VGGT, it substantially enhances reconstruction quality with minimal additional computational overhead. Best viewed digitally.[165] FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching
Danilo Danese,Angela Lombardi,Matteo Attimonelli,Giuseppe Fasano,Tommaso Di Noia
Main category: cs.CV
TL;DR: 提出FlowLet,一种基于流匹配的条件生成框架,用于在可逆3D小波域中生成年龄条件下的3D MRI图像,提升脑龄预测模型对代表性不足年龄组的性能。
Details
Motivation: 现有3D MRI数据集人口统计偏差严重,获取新数据成本高且受伦理限制,现有生成方法存在推理慢、伪影多和缺乏年龄条件控制等问题。 Method: 提出FlowLet,利用流匹配在可逆3D小波域中合成年龄条件下的3D MRI图像,避免潜在空间压缩带来的伪影并降低计算开销。 Result: 实验表明FlowLet能在少量采样步骤下生成高保真3D MRI体积;用于训练脑龄预测模型时提升了对代表性不足年龄群体的预测性能,并在区域分析中保持了解剖结构。 Conclusion: FlowLet是一种高效、低伪影的年龄条件MRI生成方法,能有效增强数据集以改善脑龄预测模型的公平性和泛化能力。 Abstract: Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.[166] ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos
Rustin Soraki,Homanga Bharadhwaj,Ali Farhadi,Roozbeh Mottaghi
Main category: cs.CV
TL;DR: 本文提出了ObjectForesight,一种基于3D对象级表示的未来物体运动预测模型,能够从第一人称视频中预测刚性物体的6自由度姿态和轨迹。
Details
Motivation: 受人类能自然预判物体交互后运动方式的启发,希望赋予计算系统类似能力,实现从被动视觉观察中推断物体未来的合理运动。 Method: 提出一种以对象为中心的3D动力学模型,直接在3D对象空间进行建模,并利用超过200万段带有伪真实轨迹的短视频数据进行训练,这些数据通过分割、网格重建和3D姿态估计技术构建。 Result: 实验表明,该方法在预测准确性、几何一致性和对未见物体与场景的泛化能力上均有显著提升。 Conclusion: ObjectForesight为从视觉观察中学习物理合理的、以对象为中心的动力学模型提供了一个可扩展的框架。 Abstract: Humans can effortlessly anticipate how objects might move or change through interaction--imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. objectforesight.github.io[167] Plenoptic Video Generation
Xiao Fu,Shitao Tang,Min Shi,Xian Liu,Jinwei Gu,Ming-Yu Liu,Dahua Lin,Chen-Hsuan Lin
Main category: cs.CV
TL;DR: 本文提出了PlenopticDreamer,一种用于多视角视频重渲染的生成框架,通过自回归训练和相机引导的视频检索策略实现时空一致性。
Details
Motivation: 现有单视角生成重渲染方法在多视角场景下难以保持一致性,尤其在生成区域中缺乏时空连贯性。 Method: 采用多输入单输出的视频条件模型,结合自回归训练、相机引导的视频检索、渐进式上下文缩放、自条件机制和长视频条件机制。 Result: 在Basic和Agibot基准上实现了最先进的视频重渲染效果,表现出优异的视角同步性、高保真视觉效果、精确相机控制和多样化的视角变换能力。 Conclusion: PlenopticDreamer有效解决了生成模型在多视角重渲染中的时空不一致问题,显著提升了长序列生成的稳定性和视觉质量。 Abstract: Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: https://research.nvidia.com/labs/dir/plenopticdreamer/[168] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
Boyang Wang,Haoran Zhang,Shujie Zhang,Jinkun Hao,Mingda Jia,Qi Lv,Yucheng Mao,Zhaoyang Lyu,Jia Zeng,Xudong Xu,Jiangmiao Pang
Main category: cs.CV
TL;DR: 提出视觉身份提示方法,通过示例图像指导扩散模型生成多视角、时间连贯的增强操作数据,提升策略模型性能。
Details
Motivation: 现有基于文本提示的数据增强方法难以满足多视角和时间一致性需求,且无法精确指定场景配置。 Method: 引入视觉身份提示,利用示例图像作为条件输入,并构建可扩展的视觉身份池来指导扩散模型生成增强数据。 Result: 在仿真和真实机器人环境中,使用增强数据训练的策略模型均取得一致性能提升。 Conclusion: 视觉身份提示能有效生成高质量、多样化的操作数据,有助于提升机器人策略训练效果。 Abstract: The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.[169] GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation
Henghui Ding,Chang Liu,Shuting He,Xudong Jiang,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出了广义指代表达分割(GRES)、理解(GREC)和生成(GREG)三个新基准,统称为GREx,并构建了首个大规模gRefCOCO数据集,支持单目标、多目标和无目标表达,扩展了传统REx任务。同时提出了一种基于区域-语言关系建模的基线方法ReLA,在GRES和GREC任务上达到SOTA性能。
Details
Motivation: 现有指代表达任务仅支持单目标表达,限制了在真实场景中的应用,无法处理多目标或无目标情况,因此需要扩展为更通用的任务设置。 Method: 提出了GREx系列任务及gRefCOCO数据集,设计了与传统REx向后兼容的标注结构;针对GRES/GREC中的复杂关系建模问题,提出ReLA模型,通过自适应划分图像区域并显式建模区域间及区域-语言依赖关系。 Result: 构建了包含多目标、无目标和单目标表达的gRefCOCO数据集;ReLA模型在GRES和GREC任务上取得了当前最优性能;实验验证了现有REx方法在GREx任务上的性能差距。 Conclusion: GREx任务和gRefCOCO数据集推动了指代表达任务向更贴近现实应用的方向发展,ReLA为复杂关系建模提供了有效解决方案,促进了多目标指代理解与生成的研究。 Abstract: Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG). This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks. One of the challenges of GRES/GREC is complex relationship modeling, for which we propose a baseline ReLA that adaptively divides the image into regions with sub-instance clues and explicitly models the region-region and region-language dependencies. The proposed ReLA achieves the state-of-the-art results on the both GRES and GREC tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GREx.[170] Pixel-Perfect Visual Geometry Estimation
Gangwei Xu,Haotong Lin,Hongcheng Luo,Haiyang Sun,Bing Wang,Guang Chen,Sida Peng,Hangjun Ye,Xin Yang
Main category: cs.CV
TL;DR: 本文提出了一种像素级精确的视觉几何模型(PPD/PPVD),通过基于像素空间扩散变换器(DiT)的生成建模,实现无飞行像素、高细节的单目和视频深度估计。
Details
Motivation: 现有几何基础模型存在飞行像素严重和细节丢失的问题,难以满足机器人和增强现实对高质量几何恢复的需求。 Method: 提出Pixel-Perfect Depth (PPD),采用语义引导的DiT和级联DiT架构,在像素空间中进行扩散建模;进一步扩展为PPVD,引入语义一致性DiT和参考引导token传播以保持时序连贯性。 Result: 在单目和视频深度估计任务上性能优于所有生成式模型,生成的点云更干净、细节更丰富。 Conclusion: 所提方法显著提升了生成式深度估计的质量与效率,解决了飞行像素和细节丢失问题,为视觉几何重建提供了新的有效方案。 Abstract: Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.[171] RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
Yuan-Kang Lee,Kuan-Lin Chen,Chia-Che Chang,Yu-Lun Liu
Main category: cs.CV
TL;DR: 提出了一种结合统计方法与深度强化学习的夜间白平衡新框架RL-AWB,通过模拟专业人员调参实现动态优化,并发布了首个多传感器夜间数据集。
Details
Motivation: 夜间颜色恒定在低光噪声和复杂光照条件下极具挑战性,现有方法难以兼顾泛化性和精度。 Method: 首先设计针对夜间场景的统计算法,结合显著灰度像素检测与新的光照估计;在此基础上构建基于深度强化学习的框架,以统计算法为核心,动态优化每张图像的参数。 Result: 实验表明该方法在低光照和正常光照图像上均表现出优异的跨设备泛化能力。 Conclusion: RL-AWB通过融合统计先验与强化学习实现了更优的夜间白平衡性能,为自动白平衡提供了可学习、可调优的新范式。 Abstract: Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/[172] QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer
Daniele Lizzio Bosco,Shuteng Wang,Giuseppe Serra,Vladislav Golyanik
Main category: cs.CV
TL;DR: 本文提出了QNeRF,首个用于从2D图像进行新视角合成的混合量子-经典模型,利用量子电路实现更紧凑的表示,在参数少于一半的情况下达到或超越经典NeRF的性能。
Details
Motivation: 受量子视觉场(QVF)在信号学习中表现出的紧凑性和快速收敛启发,同时应对NeRF类模型参数量大、训练密集的问题,探索量子机器学习在3D场景表示中的潜力。 Method: 提出两种QNeRF架构:全QNeRF充分利用量子幅值增强表达能力;双分支QNeRF通过分离空间与视角相关的信息编码引入任务先验,降低复杂度并提升可扩展性。使用参数化量子电路进行量子叠加与纠缠建模。 Result: 在中等分辨率图像上训练时,QNeRF在性能上达到或优于经典NeRF基线模型,且参数数量不足其一半,展现出更高的模型效率。 Conclusion: 量子机器学习可作为计算机视觉中连续信号表示的有效替代方案,尤其适用于从2D观测学习3D表示等中等复杂度任务,具备良好的压缩性与可扩展潜力。 Abstract: Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen major advances with Neural Radiance Fields (NeRFs), where models learn a compact representation from 2D images to render 3D scenes, albeit at the cost of larger models and intensive training. In this work, we extend the approach of QVFs by introducing QNeRF, the first hybrid quantum-classical model designed for novel-view synthesis from 2D images. QNeRF leverages parameterised quantum circuits to encode spatial and view-dependent information via quantum superposition and entanglement, resulting in more compact models compared to the classical counterpart. We present two architectural variants. Full QNeRF maximally exploits all quantum amplitudes to enhance representational capabilities. In contrast, Dual-Branch QNeRF introduces a task-informed inductive bias by branching spatial and view-dependent quantum state preparations, drastically reducing the complexity of this operation and ensuring scalability and potential hardware compatibility. Our experiments demonstrate that -- when trained on images of moderate resolution -- QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters. These results suggest that quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level tasks in computer vision, such as 3D representation learning from 2D observations.[173] Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
Zeren Jiang,Chuanxia Zheng,Iro Laina,Diane Larlus,Andrea Vedaldi
Main category: cs.CV
TL;DR: 提出Mesh4D,一种用于单目4D网格重建的前馈模型,通过紧凑的潜在空间和时空注意力机制,实现对动态物体3D形状与运动的高质量重建。