Skip to content

Table of Contents

cs.CL [Back]

[1] MedPI: Evaluating AI Systems in Medical Patient-facing Interactions

Diego Fajardo V.,Oleksii Proniakin,Victoria-Elisabeth Gruber,Razvan Marinescu

Main category: cs.CL

TL;DR: MedPI是一个用于评估大语言模型在患者-临床医生对话中表现的高维基准,涵盖105个维度,评估了9个主流模型在诊断和治疗建议中的低表现,特别是在鉴别诊断方面。

Details Motivation: 现有的单轮问答基准无法全面评估医疗对话质量,缺乏对诊疗过程、安全性、结果及医患沟通的系统性衡量,因此需要一个更细粒度、符合认证标准的评估框架。 Method: 提出MedPI五层框架:合成EHR式患者数据包、具记忆与情感的AI患者、涵盖多种就诊原因与目标的任务矩阵、基于ACGME能力映射的105维评估体系,以及由校准后的委员会式LLM组成的AI裁判进行评分与归因。 Result: 在366个AI患者和7,097次对话中测试9个主流LLM,发现所有模型在多个维度上表现不佳,尤其在鉴别诊断方面得分最低。 Conclusion: MedPI提供了一种系统化、多维度的方法来评估LLM在临床对话中的能力,揭示当前模型在关键诊疗任务上的不足,有助于指导未来LLM在医疗诊断与治疗推荐中的发展与应用。 Abstract: We present MedPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations. Unlike single-turn question-answer (QA) benchmarks, MedPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. MedPI comprises five layers: (1) Patient Packets (synthetic EHR-like ground truth); (2) an AI Patient instantiated through an LLM with memory and affect; (3) a Task Matrix spanning encounter reasons (e.g. anxiety, pregnancy, wellness checkup) x encounter objectives (e.g. diagnosis, lifestyle advice, medication advice); (4) an Evaluation Framework with 105 dimensions on a 1-4 scale mapped to the Accreditation Council for Graduate Medical Education (ACGME) competencies; and (5) AI Judges that are calibrated, committee-based LLMs providing scores, flags, and evidence-linked rationales. We evaluate 9 flagship models -- Claude Opus 4.1, Claude Sonnet 4, MedGemma, Gemini 2.5 Pro, Llama 3.3 70b Instruct, GPT-5, GPT OSS 120b, o3, Grok-4 -- across 366 AI Patients and 7,097 conversations using a standardized "vanilla clinician" prompt. For all LLMs, we observe low performance across a variety of dimensions, in particular on differential diagnosis. Our work can help guide future use of LLMs for diagnosis and treatment recommendations.

[2] RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation

Keerthana Murugaraj,Salima Lamsiyah,Martin Theobald

Main category: cs.CL

TL;DR: 本文提出了RAGVUE,一个用于自动、无参考评估检索增强生成(RAG)系统的诊断性和可解释性框架。该框架将RAG行为分解为多个细粒度指标,并提供结构化解释,支持手动和自动化评估,具备良好的透明性和实用性。

Details Motivation: 现有RAG评估指标往往将复杂行为压缩为单一分数,难以识别错误来源(如检索、推理或事实一致性),缺乏透明度和细粒度诊断能力。 Method: 提出RAGVUE框架,分解评估为检索质量、答案相关性与完整性、声明级忠实度和裁判校准四个维度,每个指标附带结构化解释;支持Python API、CLI和Streamlit界面,实现手动选择或智能体驱动的自动化评估。 Result: 实验表明,RAGVUE能发现RAGAS等现有工具忽略的细粒度失败模式,具备更强的诊断能力,并已开源代码与使用指南。 Conclusion: RAGVUE提供了一种透明、模块化且可扩展的RAG系统评估方案,有助于推动RAG研究与实际应用中的调试与优化。 Abstract: Evaluating Retrieval-Augmented Generation (RAG) systems remains a challenging task: existing metrics often collapse heterogeneous behaviors into single scores and provide little insight into whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduce RAGVUE, a diagnostic and explainable framework for automated, reference-free evaluation of RAG pipelines. RAGVUE decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration. Each metric includes a structured explanation, making the evaluation process transparent. Our framework supports both manual metric selection and fully automated agentic evaluation. It also provides a Python API, CLI, and a local Streamlit interface for interactive usage. In comparative experiments, RAGVUE surfaces fine-grained failures that existing tools such as RAGAS often overlook. We showcase the full RAGVUE workflow and illustrate how it can be integrated into research pipelines and practical RAG development. The source code and detailed instructions on usage are publicly available on GitHub

[3] Automatic Construction of Chinese Verb Collostruction Database

Xuri Tang,Daohuan Liu

Main category: cs.CL

TL;DR: 本文提出了一种完全无监督的中文动词搭配构式库构建方法,通过聚类算法从大规模语料中生成具有可解释性的动词搭配构式,并在语法纠错任务中表现出优于大语言模型的性能。

Details Motivation: 为了弥补大语言模型在需要解释性和可解释性应用场景中的不足,提供显式的、可解释的语法规则支持。 Method: 将动词搭配构式形式化定义为有向无环图,采用一系列聚类算法从大规模语料中提取句子并生成特定动词的搭配构式。 Result: 统计分析表明生成的搭配构式具有功能独立性和等级典型性特征;在动词语法纠错任务中,基于最大匹配的纠错算法性能优于大语言模型。 Conclusion: 该方法能有效构建中文动词搭配构式库,为需要可解释性的自然语言处理任务提供补充工具。 Abstract: This paper proposes a fully unsupervised approach to the construction of verb collostruction database for Chinese language, aimed at complementing LLMs by providing explicit and interpretable rules for application scenarios where explanation and interpretability are indispensable. The paper formally defines a verb collostruction as a projective, rooted, ordered, and directed acyclic graph and employs a series of clustering algorithms to generate collostructions for a given verb from a list of sentences retrieved from large-scale corpus. Statistical analysis demonstrates that the generated collostructions possess the design features of functional independence and graded typicality. Evaluation with verb grammatical error correction shows that the error correction algorithm based on maximum matching with collostructions achieves better performance than LLMs.

[4] Attribute-Aware Controlled Product Generation with LLMs for E-commerce

Virginia Negri,Víctor Martínez Gómez,Sergio A. Balanya,Subburam Rajaram

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型的合成电商产品数据生成框架,通过三种策略生成高质量的标注数据,在MAVE数据集上性能与真实数据相当,并在低资源场景下具有实用价值。

Details Motivation: 电商领域缺乏高质量标注数据,现有方法难以满足信息抽取任务的需求。 Method: 设计了一个可控的合成数据生成框架,包括属性保持修改、受控负例生成和系统性属性删除三种策略,利用具备属性感知能力的大语言模型提示来生成符合商店约束且产品逻辑一致的数据。 Result: 人工评估显示99.6%的合成产品自然,96.5%具有有效属性值;在MAVE数据集上,纯合成数据达到60.5%准确率,接近真实数据的60.8%,远超零样本基线13.4%;合成与真实数据结合可提升至68.8%。 Conclusion: 该框架能有效生成高质量电商产品数据,可用于数据增强,尤其适用于标注数据稀缺的场景。 Abstract: Product information extraction is crucial for e-commerce services, but obtaining high-quality labeled datasets remains challenging. We present a systematic approach for generating synthetic e-commerce product data using Large Language Models (LLMs), introducing a controlled modification framework with three strategies: attribute-preserving modification, controlled negative example generation, and systematic attribute removal. Using a state-of-the-art LLM with attribute-aware prompts, we enforce store constraints while maintaining product coherence. Human evaluation of 2000 synthetic products demonstrates high effectiveness, with 99.6% rated as natural, 96.5% containing valid attribute values, and over 90% showing consistent attribute usage. On the public MAVE dataset, our synthetic data achieves 60.5% accuracy, performing on par with real training data (60.8%) and significantly improving upon the 13.4% zero-shot baseline. Hybrid configurations combining synthetic and real data further improve performance, reaching 68.8% accuracy. Our framework provides a practical solution for augmenting e-commerce datasets, particularly valuable for low-resource scenarios.

[5] Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems

Zihan Gao,Mohsin Y. K. Yousufi,Jacob Thebault-Spieker

Main category: cs.CL

TL;DR: 本文提出了一种名为“集体叙事 grounding”的参与式协议,通过将社区故事转化为结构化叙事单元并由社区治理整合到AI系统中,以解决大语言模型在回答社区特定问题时的知识盲区和认知不公问题。

Details Motivation: 大语言模型在处理社区特定查询时常出现知识盲点,导致边缘化群体的声音被忽视,加剧了认知不公正。因此需要一种能够纳入本地知识并由社区主导的AI系统设计方法。 Method: 基于与24名社区成员参与的三次参与式制图工作坊,设计了一套叙事采集方法和数据模式,保留叙事丰富性的同时支持实体、时间和地点的提取、验证及来源控制;并对一个包含14,782个本地信息问答对的县级基准进行审计,评估现有LLM的表现。 Result: 在一个基于工作坊生成的参与式问答集上,最先进的大语言模型在无额外上下文情况下正确率不足21%;基准审计显示76.7%的错误源于事实缺失、文化误解、地理混淆和时间错位;而这些缺失信息大多存在于收集的社区叙事中。 Conclusion: 研究提出了一套参与式协议、分类法和评估框架,为构建检索优先、来源可见、本地治理的问答系统提供了基础,推动实现更公平、扎根于社区的AI系统设计。 Abstract: Large language model (LLM) question-answering systems often fail on community-specific queries, creating "knowledge blind spots" that marginalize local voices and reinforce epistemic injustice. We present Collective Narrative Grounding, a participatory protocol that transforms community stories into structured narrative units and integrates them into AI systems under community governance. Learning from three participatory mapping workshops with N=24 community members, we designed elicitation methods and a schema that retain narrative richness while enabling entity, time, and place extraction, validation, and provenance control. To scope the problem, we audit a county-level benchmark of 14,782 local information QA pairs, where factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments account for 76.7% of errors. On a participatory QA set derived from our workshops, a state-of-the-art LLM answered fewer than 21% of questions correctly without added context, underscoring the need for local grounding. The missing facts often appear in the collected narratives, suggesting a direct path to closing the dominant error modes for narrative items. Beyond the protocol and pilot, we articulate key design tensions, such as representation and power, governance and control, and privacy and consent, providing concrete requirements for retrieval-first, provenance-visible, locally governed QA systems. Together, our taxonomy, protocol, and participatory evaluation offer a rigorous foundation for building community-grounded AI that better answers local questions.

[6] TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation

Anas Ezzakri,Nicola Piovesan,Mohamed Sana,Antonio De Domenico,Fadhel Ayed,Haozhe Zhang

Main category: cs.CL

TL;DR: 本文提出了TeleTables,一个用于评估大语言模型在电信标准(如3GPP规范)中对表格的理解与推理能力的基准测试。

Details Motivation: 现有研究表明,大语言模型在处理包含密集表格的电信技术标准时表现不佳,而这些表格承载了关键信息,因此需要专门评估模型对这类结构化内容的理解能力。 Method: 通过一个多阶段的数据生成流程从3GPP标准中提取表格,并利用多模态和面向推理的LLM生成并验证问题,构建了一个包含500个人工验证问答对的数据集。 Result: 实验表明,小于10B参数的模型在知识回忆和表格理解方面表现差,而较大模型在表格推理任务上表现更好,但整体仍需领域特定微调以提升性能。 Conclusion: TeleTables揭示了当前LLM在理解和推理电信标准中的局限性,强调了针对特定领域进行微调的重要性。 Abstract: Language Models (LLMs) are increasingly explored in the telecom industry to support engineering tasks, accelerate troubleshooting, and assist in interpreting complex technical documents. However, recent studies show that LLMs perform poorly on telecom standards, particularly 3GPP specifications. We argue that a key reason is that these standards densely include tables to present essential information, yet the LLM knowledge and interpretation ability of such tables remains largely unexamined. To address this gap, we introduce TeleTables, a benchmark designed to evaluate both the implicit knowledge LLMs have about tables in technical specifications and their explicit ability to interpret them. TeleTables is built through a novel multi-stage data generation pipeline that extracts tables from 3GPP standards and uses multimodal and reasoning-oriented LLMs to generate and validate questions. The resulting dataset, which is publicly available, comprises 500 human-verified question-answer pairs, each associated with the corresponding table in multiple formats. Our evaluation shows that, smaller models (under 10B parameters) struggle both to recall 3GPP knowledge and to interpret tables, indicating the limited exposure to telecom standards in their pretraining and the insufficient inductive biases for navigating complex technical material. Larger models, on the other hand, show stronger reasoning on table interpretation. Overall, TeleTables highlights the need for domain-specialized fine-tuning to reliably interpret and reason over telecom standards.

[7] FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

Xueqing Wu,Zihan Xue,Da Yin,Shuyan Zhou,Kai-Wei Chang,Nanyun Peng,Yeming Wen

Main category: cs.CL

TL;DR: FronTalk是一个用于前端代码生成的基准,探索了多模态反馈下的对话式代码生成,提出了新的评估框架和解决遗忘问题的AceCoder方法。

Details Motivation: 在前端开发中,视觉素材对于传达设计意图至关重要,但其在多轮代码生成中的作用尚未被充分研究。 Method: 构建了一个包含100个多轮对话的数据集FronTalk,并提出了一种基于网络代理的评估框架以及AceCoder方法来减少模型遗忘问题。 Result: 评估了20个模型,发现存在显著的遗忘问题和视觉反馈理解难题;AceCoder将遗忘问题几乎降至零,性能提升最高达9.3%。 Conclusion: FronTalk为前端开发及多轮多模态代码生成的研究提供了坚实基础。 Abstract: We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at https://github.com/shirley-wu/frontalk

[8] STDD:Spatio-Temporal Dynamics-Driven Token Refinement in Diffusion Language Models

Xinhao Sun,Maoliang Li,Zihao Zheng,Jiayu Chen,Hezhao Xu,Yun Liang,Xiang Chen

Main category: cs.CL

TL;DR: 提出一种新的重掩码方法,通过动态检测每个token的时间方差和空间偏差来自适应调整置信度阈值,显著提高了扩散语言模型的运行效率,同时保持生成质量。

Details Motivation: 主流的重掩码策略依赖单一全局置信度阈值,忽略了单个token的时间和空间动态特性,导致冗余迭代和并行性受限。 Method: 提出一种新的重掩码方法,动态检测每个token的时间方差(Temporal Variance)和空间偏差(Spatial Deviance),用以反映其收敛状态和token间的相关性,并据此自适应地调整每一步每个token的置信度阈值。 Result: 实验结果表明,该方法在主流数据集上显著提升了扩散语言模型的运行效率,最高可达8.9倍加速,同时保持生成质量。 Conclusion: 所提出的动态重掩码策略有效克服了固定阈值带来的局限,提升了DLMs的效率与并行性能,具有较强的实用价值。 Abstract: Unlike autoregressive language models, diffusion language models (DLMs) generate text by iteratively denoising all token positions in parallel. At each timestep, the remasking strategy of a DLM selects low-priority tokens to defer their decoding, thereby improving both efficiency and output quality. However, mainstream remasking strategies rely on a single global confidence threshold, overlooking the temporal and spatial dynamics of individual tokens. Motivated by the redundant iterations and constrained parallelism introduced by fixed-threshold remasking, we propose a novel remasking approach that dynamically detects Temporal Variance and Spatial Deviance of each token, which reflect its convergence status and inter-token correlations. Using these signals, our method adaptively adjusts the confidence threshold for every token at every step. Empirical results show that our approach significantly improves the operational efficiency of DLMs across mainstream datasets, achieving speedups of up to 8.9 times while faithfully preserving generation quality.

[9] Enhancing Admission Inquiry Responses with Fine-Tuned Models and Retrieval-Augmented Generation

Aram Virabyan

Main category: cs.CL

TL;DR: 提出一种结合微调语言模型与检索增强生成(RAG)的AI系统,以提升大学招生咨询中的响应速度与信息准确性。

Details Motivation: 大学招生咨询量大,需在保证回复质量的同时提高效率,传统RAG在复杂细分领域中存在上下文理解不足的问题。 Method: 构建一个融合微调语言模型与检索增强生成(RAG)的混合系统,并针对招生领域的特定数据集进行模型微调,同时优化响应生成逻辑以平衡质量与速度。 Result: 该系统能更准确地解析RAG提供的信息并生成符合招生场景的高质量回复,在响应时间、信息准确性和上下文相关性方面均有提升。 Conclusion: 结合微调与RAG的混合方法能有效应对大学招生等专业性强、规则复杂的领域需求,为高校招生沟通提供了高效且可靠的自动化解决方案。 Abstract: University admissions offices face the significant challenge of managing high volumes of inquiries efficiently while maintaining response quality, which critically impacts prospective students' perceptions. This paper addresses the issues of response time and information accuracy by proposing an AI system integrating a fine-tuned language model with Retrieval-Augmented Generation (RAG). While RAG effectively retrieves relevant information from large datasets, its performance in narrow, complex domains like university admissions can be limited without adaptation, potentially leading to contextually inadequate responses due to the intricate rules and specific details involved. To overcome this, we fine-tuned the model on a curated dataset specific to admissions processes, enhancing its ability to interpret RAG-provided data accurately and generate domain-relevant outputs. This hybrid approach leverages RAG's ability to access up-to-date information and fine-tuning's capacity to embed nuanced domain understanding. We further explored optimization strategies for the response generation logic, experimenting with settings to balance response quality and speed, aiming for consistently high-quality outputs that meet the specific requirements of admissions communications.

[10] Ideology as a Problem: Lightweight Logit Steering for Annotator-Specific Alignment in Social Media Analysis

Wei Xia,Haowen Tang,Luozheng Li

Main category: cs.CL

TL;DR: 提出一种轻量级线性探针方法,量化并校正LLM内部政治意识形态与人类的不对齐问题,通过调整输出层实现低成本、高效的模型对齐。

Details Motivation: 发现LLMs内部的政治意识形态结构与人类空间存在系统性、模型特定的不对齐,需量化并修正这种偏差。 Method: 引入轻量级线性探针,从模型内部特征计算偏见得分,并直接调整最终输出概率,无需重新训练模型。 Result: 该方法能有效量化模型内部的意识形态偏差,并在输出层进行最小化修正,保持模型原有推理能力。 Conclusion: 该方法为对齐模型与用户观点提供了实用、低成本的解决方案,适用于个性化模型调整。 Abstract: LLMs internally organize political ideology along low-dimensional structures that are partially, but not fully aligned with human ideological space. This misalignment is systematic, model specific, and measurable. We introduce a lightweight linear probe that both quantifies the misalignment and minimally corrects the output layer. This paper introduces a simple and efficient method for aligning models with specific user opinions. Instead of retraining the model, we calculated a bias score from its internal features and directly adjusted the final output probabilities. This solution is practical and low-cost and preserves the original reasoning power of the model.

[11] LLMs for Explainable Business Decision-Making: A Reinforcement Learning Fine-Tuning Approach

Xiang Cheng,Wen Wang,Anindya Ghose

Main category: cs.CL

TL;DR: 本文提出了LEXMA,一种基于强化学习的微调框架,利用大语言模型生成面向多受众、叙事性强且忠实于决策的自然语言解释,应用于房贷审批场景,显著提升了预测性能与解释质量。

Details Motivation: 现有可解释AI方法依赖事后数值特征归因,缺乏连贯叙述;同时难以兼顾不同受众需求,且依赖大量人工标注解释数据,成本高且不可扩展。 Method: 提出LEXMA框架,结合反思增强的监督微调与两阶段组相对策略优化(GRPO),通过无须人工标注解释的奖励信号,分别优化决策正确性与面向不同受众的风格化表达。 Result: 在房贷审批任务中,LEXMA优于其他大语言模型基线,显著提升预测性能;人工评估显示其为专家生成的风险聚焦型解释和为消费者生成的清晰、可操作、礼貌的解释均更优。 Conclusion: LEXMA提供了一种低成本、系统化的微调方法,能够生成高质量、多受众适配的自然语言解释,推动透明AI系统的大规模部署。 Abstract: Artificial Intelligence (AI) models increasingly drive high-stakes consumer interactions, yet their decision logic often remains opaque. Prevailing explainable AI techniques rely on post hoc numerical feature attributions, which fail to provide coherent narratives behind model decisions. Large language models (LLMs) present an opportunity to generate natural-language explanations, but three design challenges remain unresolved: explanations must be both decision-correct and faithful to the factors that drive the prediction; they should be able to serve multiple audiences without shifting the underlying decision rule; and they should be trained in a label-efficient way that does not depend on large corpora of human-scored explanations. To address these challenges, we introduce LEXMA (LLM-based EXplanations for Multi-Audience decisions), a reinforcement-learning-based fine-tuning framework that produces narrative-driven, audience-appropriate explanations. LEXMA combines reflection-augmented supervised fine-tuning with two stages of Group Relative Policy Optimization (GRPO). Specifically, it fine-tunes two separate parameter sets to improve decision correctness and satisfy stylistic requirements for different audiences, using reward signals that do not rely on human-annotated explanations. We instantiate LEXMA in the context of mortgage approval decisions. Results demonstrate that LEXMA yields significant improvements in predictive performance compared with other LLM baselines. Moreover, human evaluations show that expert-facing explanations generated by our approach are more risk-focused, and consumer-facing explanations are clearer, more actionable, and more polite. Our study contributes a cost-efficient, systematic LLM fine-tuning approach to enhance explanation quality for business decisions, offering strong potential for scalable deployment of transparent AI systems.

[12] Leveraging Language Models and RAG for Efficient Knowledge Discovery in Clinical Environments

Seokhwan Ko,Donghyeon Lee,Jaewoo Chun,Hyungsoo Han,Junghwan Cho

Main category: cs.CL

TL;DR: 本研究开发并评估了一种在本地部署的检索增强生成(RAG)系统,用于基于PubMed出版物推荐医学机构内的研究合作者。

Details Motivation: 由于医院环境中对数据隐私和网络安全的严格要求,敏感信息必须在本地基础设施中处理,因此需要能够在本地运行的大语言模型应用。 Method: 系统采用PubMedBERT生成领域特定的嵌入表示,并结合本地部署的LLaMA3模型进行生成式综合,构建RAG框架以推荐研究合作者。 Result: 该系统成功实现了在本地环境中基于机构成员发表记录的协作推荐,展示了领域专用编码器与轻量级大语言模型结合的有效性。 Conclusion: 在本地部署约束下,结合领域优化的嵌入模型与轻量级LLM可有效支持生物医学知识发现,适用于注重数据隐私的医疗环境。 Abstract: Large language models (LLMs) are increasingly recognized as valuable tools across the medical environment, supporting clinical, research, and administrative workflows. However, strict privacy and network security regulations in hospital settings require that sensitive data be processed within fully local infrastructures. Within this context, we developed and evaluated a retrieval-augmented generation (RAG) system designed to recommend research collaborators based on PubMed publications authored by members of a medical institution. The system utilizes PubMedBERT for domain-specific embedding generation and a locally deployed LLaMA3 model for generative synthesis. This study demonstrates the feasibility and utility of integrating domain-specialized encoders with lightweight LLMs to support biomedical knowledge discovery under local deployment constraints.

[13] Complexity Agnostic Recursive Decomposition of Thoughts

Kaleem Ullah Qasim,Jiashu Zhang,Hafiz Saif Ur Rehman

Main category: cs.CL

TL;DR: 本文提出了CARD框架,通过预先估计问题复杂度并动态调整推理分解策略,提升大语言模型在多步推理任务中的准确性和效率。

Details Motivation: 大语言模型在多步推理任务中常因采用固定推理策略而表现不佳,忽略了问题本身的复杂性差异,因此需要一种能根据问题难度自适应调整的推理方法。 Method: 提出CARD框架,包含MRCE(多维推理复杂度估计器)和两阶段递归求解器:首先根据任务特征进行层次化分解,然后通过递归复杂度分析为每一步分配适当的思维预算(1、5-9或10个思考步骤)。 Result: 在GSM8K上准确率达到81.4%至89.2%,token消耗减少1.88倍至2.40倍;在MATH-500上准确率为75.1%至86.8%,使用1.71倍至5.74倍更少的token。 Conclusion: 预先的复杂度估计能够有效提升多步推理的准确率,并显著降低计算资源消耗,证明了自适应推理策略的优势。 Abstract: Large language models often fail on multi-step reasoning due to fixed reasoning strategies that ignore problem specific difficulty. We introduce CARD (Complexity Agnostic Recursive Decomposition), a framework that predicts problem complexity before generation and adapts decomposition accordingly. Our system comprises MRCE (Multi-dimensional Reasoning Complexity Estimator), a 0.6B Qwen model predicting 30 fine-grained features from question text and a two-stage recursive solver: (1) hierarchical decomposition into K steps based on task profile and (2) per-step thought budget allocation (1, 5-9, or 10 thoughts) via recursive MRCE profiling. Evaluated on three reasoning models (Qwen3-0.6B, DeepSeek-R1-Distill-Qwen-1.5B, Qwen3-1.7B), CARD achieves 81.4% to 89.2% accuracy on GSM8K while reducing token cost by 1.88x to 2.40x compared to fixed decomposition baselines. On MATH-500, CARD reaches 75.1 to 86.8% accuracy using 1.71x to 5.74x fewer tokens. Our results demonstrate that preemptive complexity estimation enables both higher accuracy and significant efficiency gains.

[14] Qwerty AI: Explainable Automated Age Rating and Content Safety Assessment for Russian-Language Screenplays

Nikita Zmanovskii

Main category: cs.CL

TL;DR: Qwerty AI是一个端到端系统,用于根据俄罗斯联邦法律436-FZ对俄语剧本进行自动化年龄分级和内容安全评估,具有高精度和快速处理能力。

Details Motivation: 为解决俄罗斯媒体行业中俄语影视剧本内容审查和年龄评级效率低下的实际问题,特别是在编辑流程中缺乏自动化工具的挑战。 Method: 采用微调后的Phi-3-mini模型(4位量化),在无外部API调用、80GB VRAM限制和5分钟内完成处理的约束下,实现剧本分段、五类违规内容检测及可解释的年龄评级,并部署于Yandex Cloud并使用CUDA加速。 Result: 系统可在2分钟内处理长达700页的剧本,年龄评级准确率达80%,分段精度达80%-95%(依赖格式),满足生产级应用需求。 Conclusion: Qwerty AI在严格资源限制下实现了高效、可解释的俄语剧本自动内容审查,已在Wink黑客松中验证其在真实工业场景中的实用性与可行性。 Abstract: We present Qwerty AI, an end-to-end system for automated age-rating and content-safety assessment of Russian-language screenplays according to Federal Law No. 436-FZ. The system processes full-length scripts (up to 700 pages in under 2 minutes), segments them into narrative units, detects content violations across five categories (violence, sexual content, profanity, substances, frightening elements), and assigns age ratings (0+, 6+, 12+, 16+, 18+) with explainable justifications. Our implementation leverages a fine-tuned Phi-3-mini model with 4-bit quantization, achieving 80% rating accuracy and 80-95% segmentation precision (format-dependent). The system was developed under strict constraints: no external API calls, 80GB VRAM limit, and <5 minute processing time for average scripts. Deployed on Yandex Cloud with CUDA acceleration, Qwerty AI demonstrates practical applicability for production workflows. We achieved these results during the Wink hackathon (November 2025), where our solution addressed real editorial challenges in the Russian media industry.

[15] TrueBrief: Faithful Summarization through Small Language Models

Kumud Lakara,Ruibo Shi,Fran Silavong

Main category: cs.CL

TL;DR: 本文提出了TrueBrief框架,通过偏好优化范式提升小型语言模型在文本摘要任务中的真实性,核心是通过可控的幻觉注入生成合成偏好数据。

Details Motivation: 大语言模型虽然在生成高质量文本方面表现出色,但其产生幻觉的问题限制了其在安全关键领域的应用。 Method: 设计了一个端到端的框架TrueBrief,包含一个数据生成模块,用于注入可控幻觉以生成偏好数据,并采用偏好优化方法训练小型语言模型。 Result: 验证了数据质量和模型大小对偏好优化效果的影响,明确了该方法最有效的条件。 Conclusion: TrueBrief能够有效提升小型语言模型在文本摘要中的忠实性,为减少模型幻觉提供了可行方案。 Abstract: Large language models (LLMs) have exhibited remarkable proficiency in generating high-quality text; however, their propensity for producing hallucinations poses a significant challenge for their deployment in security-critical domains. In this work, we present TrueBrief, an end-to-end framework specifically designed to enhance the faithfulness of small LLMs (SLMs) primarily for the task of text summarization through a preference-optimization paradigm. Central to our framework is a data generation module that facilitates controlled hallucination injection to generate synthetic preference data. Our work provides insights into the impact of data quality and model size on preference-based optimization, highlighting the conditions under which these methods are most effective.

[16] AnimatedLLM: Explaining LLMs with Interactive Visualizations

Zdeněk Kasner,Ondřej Dušek

Main category: cs.CL

TL;DR: AnimatedLLM是一个在浏览器中运行的交互式Web应用,通过预计算的开源大语言模型轨迹和手动整理的输入,提供Transformer语言模型的逐步可视化,用于教学和自学。

Details Motivation: 由于展示大语言模型机制的教学材料较为稀缺,而大语言模型在自然语言处理教育中变得越来越重要,因此需要一个直观的教学工具来帮助理解其内部运作。 Method: 开发了一个名为AnimatedLLM的交互式Web应用程序,该程序利用开源大语言模型在精心设计输入上的预计算轨迹,在浏览器中实现Transformer模型的逐步可视化。 Result: 成功构建并发布了一个可在浏览器中运行的可视化工具AnimatedLLM,能够清晰展示Transformer语言模型的内部机制,并已上线供教学和自学使用。 Conclusion: AnimatedLLM填补了当前大语言模型教学资源的空白,是一种有效的教育工具,有助于提升对Transformer模型工作原理的理解。 Abstract: Large language models (LLMs) are becoming central to natural language processing education, yet materials showing their mechanics are sparse. We present AnimatedLLM, an interactive web application that provides step-by-step visualizations of a Transformer language model. AnimatedLLM runs entirely in the browser, using pre-computed traces of open LLMs applied on manually curated inputs. The application is available at https://animatedllm.github.io, both as a teaching aid and for self-educational purposes.

[17] From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning

Xiaoyu Xu,Minxin Du,Zitong Li,Zi Liang,Zhibiao Guo,Shiyu Zhang,Peizhao Hu,Qingqing Ye,Haibo Hu

Main category: cs.CL

TL;DR: 本文提出了BiForget框架,用于自动生成高质量的遗忘数据集,通过利用目标模型自身生成符合其知识分布的数据,在领域级和实例级两种遗忘粒度下实现更有效的机器遗忘,实验表明其在相关性、多样性和效率上均优于现有方法。

Details Motivation: 现有的机器遗忘基准未能准确反映模型实际学习到的“遗忘范围”,缺乏对不同遗忘粒度的区分,导致评估不准确。 Method: 提出BiForget框架,利用目标大语言模型自身通过种子引导和对抗性提示生成与其内部知识分布匹配的遗忘数据,区分领域级和实例级两种遗忘粒度。 Result: 在多个基准上的实验显示,相比现有最先进方法,BiForget在《哈利·波特》领域的相关性提升约20,多样性提高约0.05,同时数据量减少一半。 Conclusion: BiForget能更有效地实现模型遗忘,在遗忘效果和模型效用保持之间取得更好平衡,为评估大语言模型的机器遗忘提供了更严谨的基础。 Abstract: Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true "forgetting scope" learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose BiForget, an automated framework for synthesizing high-quality forget sets. Unlike prior work relying on external generators, BiForget exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by ${\sim}20$ and diversity by ${\sim}$0.05 while halving the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.

[18] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation

Joseph James,Chenghao Xiao,Yucheng Li,Nafise Sadat Moosavi,Chenghua Lin

Main category: cs.CL

TL;DR: 提出RIGOURATE,一个两阶段多模态框架,用于检索论文中的证据并评估科学声明的夸大程度,提升科研沟通的透明性与严谨性。

Details Motivation: 科学论文中常存在声明夸大、超出结果支持范围的问题,影响科学严谨性。 Method: 构建包含10K+声明-证据对的数据集,利用8个大语言模型标注,并结合同行评审意见校准夸大评分;采用微调重排序器进行证据检索,微调模型预测夸大分数并提供解释。 Result: 相比强基线方法,RIGOURATE在证据检索和夸大检测方面表现更优,评分经人类评估验证有效。 Conclusion: 该工作实现了证据比例原则的操作化,有助于促进更清晰、透明的科学交流。 Abstract: Scientific rigour tends to be sidelined in favour of bold statements, leading authors to overstate claims beyond what their results support. We present RIGOURATE, a two-stage multimodal framework that retrieves supporting evidence from a paper's body and assigns each claim an overstatement score. The framework consists of a dataset of over 10K claim-evidence sets from ICLR and NeurIPS papers, annotated using eight LLMs, with overstatement scores calibrated using peer-review comments and validated through human evaluation. It employes a fine-tuned reranker for evidence retrieval and a fine-tuned model to predict overstatement scores with justification. Compared to strong baselines, RIGOURATE enables improved evidence retrieval and overstatement detection. Overall, our work operationalises evidential proportionality and supports clearer, more transparent scientific communication.

[19] Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties

Akriti Dhasmana,Aarohi Srivastava,David Chiang

Main category: cs.CL

TL;DR: 本研究探讨了在多种印度方言和语言变体中使用自发、嘈杂且代码混合语音进行跨语言迁移的实证结果,发现语言间的亲缘距离并非唯一影响因素,少量方言数据微调可达到与大量高资源语言相当的效果,并以低资源的Garhwali语言为例分析了ASR模型的表现及预训练语言偏见问题。

Details Motivation: 探索跨语言迁移在低资源、方言化和非标准化语音中的有效性,特别是面对语言多样性时ASR系统的性能表现。 Method: 通过在多种印度方言上进行实验,比较不同语言间亲缘距离对ASR性能的影响,并利用少量方言数据进行微调;同时以Garhwali语言为案例,评估多个现代ASR模型并分析转录错误中的语言偏见。 Result: 发现语言亲缘距离虽有帮助,但不足以完全解释方言环境下的ASR表现;少量方言数据微调效果可媲美基于大规模高资源语言的微调;在Garhwali案例中揭示了ASR模型存在对预训练语言的偏差。 Conclusion: 方言和低资源语言的ASR系统应更注重本地化数据的使用,而非依赖语言亲缘关系;需进一步缓解预训练带来的语言偏见以提升跨语言迁移效果。 Abstract: We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally improved with reduced phylogenetic distance between languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.

[20] Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

Dongqi Liu,Hang Ding,Qiming Feng,Jian Li,Xurong Xie,Zhucun Xue,Chengjie Wang,Jiangning Zhang,Yabiao Wang

Main category: cs.CL

TL;DR: Disco-RAG是一种引入篇章结构感知的检索增强生成框架,通过构建块内话语树和块间修辞图来提升大模型在知识密集任务中的知识综合能力。

Details Motivation: 现有RAG方法通常以扁平方式处理检索到的段落,忽略了文档间的结构和层次信息,限制了模型对分散知识的整合能力。 Method: 提出Disco-RAG框架,构建段落内的 discourse tree 和段落间的 rhetorical graph,将这些结构信息融合为生成过程的规划蓝图,指导语言生成。 Result: 在问答和长文档摘要任务上实现了最先进的性能,且无需微调。 Conclusion: 引入话语结构能有效提升RAG系统的知识合成与推理能力,凸显了结构化信息在增强生成过程中的重要性。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.

[21] MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking

Iago Alves Brito,Walcy Santos Rezende Rios,Julia Soares Dollis,Diogo Fernandes Costa Silva,Arlindo Rodrigues Galvão Filho

Main category: cs.CL

TL;DR: MiJaBench揭示了当前大语言模型安全对齐中的系统性偏见,表明安全性并非普遍适用,而是基于群体的层级结构,且模型扩展会加剧这种不平等。

Details Motivation: 现有LLM安全评估将不同群体的风险合并为单一指标,掩盖了针对特定少数群体的系统性脆弱性,缺乏细粒度的公平性分析。 Method: 构建双语对抗性基准MiJaBench,包含44,000个针对16个少数群体的提示,并通过12个先进LLM生成528,000个响应,形成MiJaBench-Align数据集以分析防御率差异。 Result: 发现同一模型在不同群体上的防御率差异高达33%,且模型规模越大,这种差距越明显,说明当前对齐技术并未实现非歧视原则,而是固化了对特定群体的拒绝模式。 Conclusion: 安全对齐并非通用能力,而是一种人口统计层级现象;当前的对齐方法和安全扩展法则需重新审视,以实现真正公平的模型安全。 Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating "Identity Hate" into scalar scores that mask systemic vulnerabilities against specific populations. To expose this selective safety, we introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate MiJaBench-Align, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33\% within the same model solely based on the target group. Crucially, we demonstrate that model scaling exacerbates these disparities, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.

[22] ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

Sharanya Dasgupta,Arkaprabha Basu,Sujoy Nath,Swagatam Das

Main category: cs.CL

TL;DR: 本文提出了一种名为ARREST的统一框架,通过外部网络干预大语言模型的潜在激活空间,以同时提升事实性和安全性,而无需微调模型参数。

Details Motivation: 大语言模型在事实性和安全性方面存在缺陷,现有方法通常将二者分开处理,缺乏类似人类认知的自我纠正机制。 Method: 提出ARREST框架,利用外部网络监测并调节模型潜在空间中的表征偏移,通过对抗训练实现真实性与安全性的联合校正,支持软拒绝、硬拒绝和事实修正。 Result: 实验表明ARREST能有效纠正表征偏移,在生成软拒绝方面比RLHF对齐模型更具适应性,并提升了模型的事实性和安全性。 Conclusion: 事实性与安全性问题可归因于潜在空间中的表征不一致,ARREST提供了一种无需微调的统一干预机制,为模型对齐提供了新思路。 Abstract: Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF-aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya-dasgupta001/ARREST.

[23] Interpreting Transformers Through Attention Head Intervention

Mason Kadem,Rong Zheng

Main category: cs.CL

TL;DR: 本文探讨了神经网络的机制可解释性,强调理解其决策过程在高风险领域中的问责与控制、数字大脑及认知涌现的研究以及AI超越人类时新知识发现的重要性。

Details Motivation: 尽管神经网络能力不断增强,但我们对其内部神经机制缺乏理解,因此需要研究其决策的机制可解释性。 Method: 通过分析神经网络的内部工作机制,探索实现机械可解释性的路径及其科学与应用价值。 Result: 提出了机制可解释性的三大优势:在高风险场景中实现问责与控制、推动对数字大脑和认知涌现的研究、以及从超人类表现的AI中发现新知识。 Conclusion: 机制可解释性不仅是确保AI安全可控的关键,也是通向理解智能本质和获取新科学发现的重要途径。 Abstract: Neural networks are growing more capable on their own, but we do not understand their neural mechanisms. Understanding these mechanisms' decision-making processes, or mechanistic interpretability, enables (1) accountability and control in high-stakes domains, (2) the study of digital brains and the emergence of cognition, and (3) discovery of new knowledge when AI systems outperform humans.

Yao Dou,Wei Xu

Main category: cs.CL

TL;DR: 本文提出了Gavel-Ref,一个基于多值清单的评估框架,用于评估大语言模型在长文本法律案例摘要任务中的表现,并进一步提出Gavel-Agent以提升效率和自主性。

Details Motivation: 现有大语言模型虽支持超长上下文,但在复杂长文本任务(如多文档法律案例摘要)上的有效性尚不明确,且传统单一聚合评分不足以全面评估模型性能。 Method: 提出Gavel-Ref评估框架,包含26项多值清单、残余事实和写作风格评估;对12个前沿LLM在100个32K至512K token的法律案例上进行系统评估;并设计Gavel-Agent代理架构,集成六个工具实现高效自主的信息提取。 Result: 最强模型Gemini 2.5 Pro在SGavel-Ref上仅得约50分;模型在简单条目上表现好,但在多值或罕见条目(如和解、监察报告)上表现差;Gavel-Agent使用Qwen3比GPT-4.1端到端提取节省36% token,仅导致7% Schecklist下降。 Conclusion: 当前大语言模型在极长文本复杂任务中仍有显著局限,需更细粒度评估方法;Gavel-Agent为高效、可扩展的长文本处理提供了新方向。 Abstract: Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long-context tasks remains unclear. In this paper, we study multi-document legal case summarization, where a single case often spans many documents totaling 100K-500K tokens. We introduce Gavel-Ref, a reference-based evaluation framework with multi-value checklist evaluation over 26 items, as well as residual fact and writing-style evaluations. Using Gavel-Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of $S_{\text{Gavel-Ref}}$, highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi-value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human-written summaries -- making human references less reliable -- we develop Gavel-Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel-Agent reduces token usage by 36% while resulting in only a 7% drop in $S_{\text{checklist}}$ compared to end-to-end extraction with GPT-4.1.

[25] Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs

Myra Cheng,Robert D. Hawkins,Dan Jurafsky

Main category: cs.CL

TL;DR: 本文探讨了大语言模型(LLM)在面对用户有害信念时缺乏质疑的问题,提出这是由于模型过度迎合用户假设且缺乏认知警惕性。研究发现,影响人类对话中迎合行为的社会与语言因素同样影响LLM,并可通过简单的语用干预(如加入“等一下”)显著提升其挑战错误信念的能力,同时保持低误报率。

Details Motivation: LLM常因迎合用户假设而未能有效挑战有害信念(如医疗误导或社会偏见),导致安全风险。现有评估方法忽视语用因素,难以全面理解此类失败的根源。 Method: 分析三个安全基准(Cancer-Myth、SAGE-Eval、ELEPHANT)上的LLM表现,考察‘议题显著性’、‘语言编码’和‘信息源可靠性’等语用因素对模型回应的影响,并测试添加‘wait a minute’等简单语用提示是否能改善模型挑战有害信念的能力。 Result: 发现人类语用因素确实影响LLM的迎合行为;加入‘wait a minute’等干预可显著提高模型在多个基准上的表现,同时维持低误报率,表明语用策略有效。 Conclusion: 应将语用学视角纳入LLM行为评估与安全改进中,简单的语用干预即可增强模型的认知警惕性,减少对有害信念的盲目迎合。 Abstract: Large language models (LLMs) frequently fail to challenge users' harmful beliefs in domains ranging from medical advice to social reasoning. We argue that these failures can be understood and addressed pragmatically as consequences of LLMs defaulting to accommodating users' assumptions and exhibiting insufficient epistemic vigilance. We show that social and linguistic factors known to influence accommodation in humans (at-issueness, linguistic encoding, and source reliability) similarly affect accommodation in LLMs, explaining performance differences across three safety benchmarks that test models' ability to challenge harmful beliefs, spanning misinformation (Cancer-Myth, SAGE-Eval) and sycophancy (ELEPHANT). We further show that simple pragmatic interventions, such as adding the phrase "wait a minute", significantly improve performance on these benchmarks while preserving low false-positive rates. Our results highlight the importance of considering pragmatics for evaluating LLM behavior and improving LLM safety.

[26] Learning to Simulate Human Dialogue

Kanishk Gandhi,Agam Bhatia,Noah D. Goodman

Main category: cs.CL

TL;DR: 研究通过下一轮对话预测来建模人的思维方式,比较不同学习方法,发现直接最大化真实人类对话的对数概率比基于LLM评判的奖励更有效,尤其在引入思维链时表现最佳。

Details Motivation: 理解并预测人类对话背后的思维过程,提升模型对人类行为的模拟能力。 Method: 比较两种学习方式:是否允许模型在回应前‘思考’,以及使用LLM作为评判标准还是直接最大化真实回应的对数概率;将思维链视为隐变量,推导对数概率的下界并优化该目标。 Result: 基于评判奖励的方法虽提升评分但降低真实回应概率和人类偏好评分;直接最大化真实回应对数概率在各项评估中表现更好,结合思维链作为隐变量的方法效果最优。 Conclusion: 当训练目标是匹配真实人类回应分布时,‘思考’机制才真正有效;扩展此方法至更广的对话数据有望提升模型对人类行为的理解。 Abstract: To predict what someone will say is to model how they think. We study this through next-turn dialogue prediction: given a conversation, predict the next utterance produced by a person. We compare learning approaches along two dimensions: (1) whether the model is allowed to think before responding, and (2) how learning is rewarded either through an LLM-as-a-judge that scores semantic similarity and information completeness relative to the ground-truth response, or by directly maximizing the log-probability of the true human dialogue. We find that optimizing for judge-based rewards indeed increases judge scores throughout training, however it decreases the likelihood assigned to ground truth human responses and decreases the win rate when human judges choose the most human-like response among a real and synthetic option. This failure is amplified when the model is allowed to think before answering. In contrast, by directly maximizing the log-probability of observed human responses, the model learns to better predict what people actually say, improving on both log-probability and win rate evaluations. Treating chain-of-thought as a latent variable, we derive a lower bound on the log-probability. Optimizing this objective yields the best results on all our evaluations. These results suggest that thinking helps primarily when trained with a distribution-matching objective grounded in real human dialogue, and that scaling this approach to broader conversational data may produce models with a more nuanced understanding of human behavior.

[27] Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

San Kim,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 本文提出了一种名为MB-Defense的防御框架,通过“合并与破坏”策略在指令调优的大语言模型中有效抵御多种后门攻击,同时保持模型的指令遵循能力。

Details Motivation: 由于大语言模型依赖大规模人类或网络数据进行指令调优,这些数据易被恶意注入后门,而现有针对此类场景的防御方法研究不足,因此需要一种有效的防御机制来提升模型鲁棒性。 Method: MB-Defense包含两个阶段:(1)防御性投毒,将攻击触发器和防御触发器融合为统一的后门表示;(2)权重恢复,通过额外训练打破该表示以恢复模型的正常行为。 Result: 在多个大语言模型上的实验表明,MB-Defense显著降低了后门攻击的成功率,同时保留了模型的指令遵循性能,展现出对未见攻击类型的泛化防御能力。 Conclusion: MB-Defense是一种通用且数据高效的防御方法,能够有效增强指令调优大语言模型对抗多样化后门攻击的鲁棒性。 Abstract: Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging & Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) defensive poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) weight recovery, which breaks this representation through additional training to restore clean behavior. Extensive experiments across multiple LLMs show that MB-Defense substantially lowers attack success rates while preserving instruction-following ability. Our method offers a generalizable and data-efficient defense strategy, improving the robustness of instruction-tuned LLMs against unseen backdoor attacks.

[28] Users Mispredict Their Own Preferences for AI Writing Assistance

Vivian Lai,Zana Buçinca,Nil-Jana Akpinar,Mo Houtti,Hyeonsu B. Kang,Kevin Chian,Namjoon Suh,Alex C. Williams

Main category: cs.CL

TL;DR: 用户在使用AI写作助手时,实际行为与自我报告的偏好存在显著差异,研究发现写作努力程度是影响使用决策的主要因素,而紧急性虽被用户自认为最重要,但实际上影响最小,依赖自我报告设计系统会导致性能下降。

Details Motivation: 了解用户在使用主动式AI写作助手时的真实偏好驱动因素,解决自我报告偏好与实际行为之间的不一致问题。 Method: 通过一项包含50名参与者的因子情境研究,进行750次成对比较,分析用户在不同情境下的选择行为,并对比其自我报告的偏好。 Result: compositional effort(写作努力)是主要行为驱动因素(ρ=0.597),而urgency(紧急性)无显著预测力(ρ≈0);用户存在明显的感知-行为差距,自报中将紧急性排第一,但实际行为相反;基于自报偏好的系统准确率仅57.7%,低于基于行为模式的系统(61.3%,p<0.05)。 Conclusion: 依赖用户的自我 introspection 来设计主动式NLG系统会误导优化方向,应更多依据观察到的行为数据进行系统设计。 Abstract: Proactive AI writing assistants need to predict when users want drafting help, yet we lack empirical understanding of what drives preferences. Through a factorial vignette study with 50 participants making 750 pairwise comparisons, we find compositional effort dominates decisions ($ρ= 0.597$) while urgency shows no predictive power ($ρ\approx 0$). More critically, users exhibit a striking perception-behavior gap: they rank urgency first in self-reports despite it being the weakest behavioral driver, representing a complete preference inversion. This misalignment has measurable consequences. Systems designed from users' stated preferences achieve only 57.7\% accuracy, underperforming even naive baselines, while systems using behavioral patterns reach significantly higher 61.3\% ($p < 0.05$). These findings demonstrate that relying on user introspection for system design actively misleads optimization, with direct implications for proactive natural language generation (NLG) systems.

[29] Beyond Static Summarization: Proactive Memory Extraction for LLM Agents

Chengyuan Yang,Zequn Sun,Wei Wei,Wei Hu

Main category: cs.CL

TL;DR: 本文提出了主动记忆提取(ProMem)方法,通过迭代的自我提问反馈机制改进大模型代理的记忆管理,克服传统摘要方法“预先性”和“一次性”的局限,显著提升记忆完整性和问答准确率。

Details Motivation: 现有基于摘要的记忆提取方法存在“预先处理”和“一次性提取”的问题,无法根据未来任务调整提取内容,且缺乏反馈机制导致信息丢失累积。 Method: 提出ProMem,将记忆提取视为迭代的认知过程,引入自问自答的循环反馈机制,使代理能主动探测对话历史,恢复遗漏信息并纠正错误。 Result: ProMem显著提升了提取记忆的完整性与问答准确率,并在提取质量与token开销之间实现了更优权衡。 Conclusion: 通过引入具有反馈回路的主动提取机制,ProMem优于传统的前馈式摘要方法,为LLM代理的记忆管理提供了更高效、可靠的新范式。 Abstract: Memory management is vital for LLM agents to handle long-term interaction and personalization. Most research focuses on how to organize and use memory summary, but often overlooks the initial memory extraction stage. In this paper, we argue that existing summary-based methods have two major limitations based on the recurrent processing theory. First, summarization is "ahead-of-time", acting as a blind "feed-forward" process that misses important details because it doesn't know future tasks. Second, extraction is usually "one-off", lacking a feedback loop to verify facts, which leads to the accumulation of information loss. To address these issues, we propose proactive memory extraction (namely ProMem). Unlike static summarization, ProMem treats extraction as an iterative cognitive process. We introduce a recurrent feedback loop where the agent uses self-questioning to actively probe the dialogue history. This mechanism allows the agent to recover missing information and correct errors. Our ProMem significantly improves the completeness of the extracted memory and QA accuracy. It also achieves a superior trade-off between extraction quality and token cost.

[30] Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions

Ignacio Sastre,Aiala Rosá

Main category: cs.CL

TL;DR: 提出Concept Tokens方法,通过定义学习特殊token的嵌入来控制冻结大模型行为,减少幻觉并提升指令遵循能力。

Details Motivation: 希望在不微调整个大模型的情况下,利用自然语言定义来引导冻结的预训练语言模型的行为,减少闭书问答中的幻觉问题,并提升对教学策略等复杂指令的遵循能力。 Method: 引入一个新的特殊token,仅优化其嵌入向量,使用多个自然语言定义进行训练,保持LLM冻结,通过标准语言建模目标优化概念token的嵌入。 Result: 在HotpotQA上验证了对幻觉的定向影响:否定幻觉token可减少幻觉并增加拒绝回答;肯定则增加幻觉。在语言教学反馈策略recasting中也观察到类似效果,且相比上下文定义,concept tokens更能保持对其他指令的遵循。定性研究表明该方法能捕获概念信息但仍有局限。 Conclusion: Concept Tokens是一种轻量级、有效的控制信号,能够从定义中学习并引导冻结大模型的行为,在减少幻觉和提升指令遵从中具有潜力。 Abstract: We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional "Austral Tower" to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.

[31] SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers

Iaroslav Chelombitko,Ekaterina Chelombitko,Aleksey Komissarov

Main category: cs.CL

TL;DR: 提出了一种无需语料库的工具SampoNLP,用于构建形态词典,并基于此评估BPE分词器在乌拉尔语系中的表现,提出了综合性能评分(IPS)指标并给出了最优词汇量建议。

Details Motivation: 缺乏高质量的形态词素词典,导致难以评估形态丰富的乌拉尔语言的子词分词质量,尤其是在低资源环境下。 Method: 基于MDL启发的自指原子性评分(Self-Referential Atomicity Scoring),设计了无需语料库的SampoNLP工具来自动生成高纯度形态词典,并使用该词典系统评估不同词汇规模下的BPE分词器,提出集成性能评分(IPS)来权衡词素覆盖率与过度切分问题。 Result: 为芬兰语、匈牙利语和爱沙尼亚语生成了高质量形态词典,发现标准BPE在高度黏着语言中存在局限性,通过IPS曲线确定了收益递减的'肘点',首次提供了这些语言中最优词汇量的经验建议。 Conclusion: SampoNLP可有效支持低资源语言的形态分析,研究结果为乌拉尔语系语言的分词器选择提供了实证依据和实用指导。 Abstract: The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP

[32] WESR: Scaling and Evaluating Word-level Event-Speech Recognition

Chenchen Yang,Kexin Huang,Liwei Fan,Qian Tu,Botian Jiang,Dong Zhang,Linqi Yin,Shimin Li,Zhaoye Fei,Qinyuan Cheng,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出了一种新的21类非语言声音事件分类体系,并构建了WESR-Bench评估集和大规模训练语料库,以实现对离散和连续发声事件的精确定位检测。

Details Motivation: 现有方法在非语言声音事件的类别覆盖、时间粒度和评估标准方面存在不足,缺乏统一的评测框架,限制了下游应用的发展。 Method: 提出包含21类的精细化分类体系,区分离散与连续型发声事件;构建专家标注的WESR-Bench评估集(900+语句),采用位置感知协议解耦ASR错误与事件检测性能;建立1700+小时语料库并训练专用模型。 Result: 所提方法在非语言事件定位检测上优于开源音频-语言模型和商业API,同时保持良好的ASR性能。 Conclusion: WESR及其评估协议为真实场景下的丰富听觉场景建模提供了基础资源和标准化评测基准。 Abstract: Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. We also build a strong baseline by constructing a 1,700+ hour corpus, and train specialized models, surpassing both open-source audio-language models and commercial APIs while preserving ASR quality. We anticipate that WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes.

[33] LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation

Yuxiao Ye,Yiming Zhang,Yiran Ma,Huiyuan Xie,Huining Zhu,Zhiyuan Liu

Main category: cs.CL

TL;DR: 本文提出了LinguaGame,一种基于语言学和博弈论的多智能体对话生成框架,通过建模交际意图与策略来提升大模型智能体间的沟通效率。

Details Motivation: 现有基于大语言模型的多智能体系统多关注架构设计,而忽视了交互过程本身的优化,本文旨在提升智能体在自然语言交互中更有效地传达意图的能力。 Method: 提出LinguaGame,将对话建模为关于交际意图与策略的信号博弈,采用无需训练的均衡近似算法在推理时调整决策,依赖语言学启发的推理而非任务特定设计。 Result: 在模拟法庭和辩论场景中进行评估,人类专家评价显示该方法显著提升了沟通效率。 Conclusion: LinguaGame通过语言学指导的博弈框架有效提升了多智能体系统的沟通效率,且具有较低的任务耦合性,适用于复杂对话场景。 Abstract: Large Language Models (LLMs) have enabled Multi-Agent Systems (MASs) where agents interact through natural language to solve complex tasks or simulate multi-party dialogues. Recent work on LLM-based MASs has mainly focused on architecture design, such as role assignment and workflow orchestration. In contrast, this paper targets the interaction process itself, aiming to improve agents' communication efficiency by helping them convey their intended meaning more effectively through language. To this end, we propose LinguaGame, a linguistically-grounded game-theoretic paradigm for multi-agent dialogue generation. Our approach models dialogue as a signalling game over communicative intents and strategies, solved with a training-free equilibrium approximation algorithm for inference-time decision adjustment. Unlike prior game-theoretic MASs, whose game designs are often tightly coupled with task-specific objectives, our framework relies on linguistically informed reasoning with minimal task-specific coupling. Specifically, it treats dialogue as intentional and strategic communication, requiring agents to infer what others aim to achieve (intents) and how they pursue those goals (strategies). We evaluate our framework in simulated courtroom proceedings and debates, with human expert assessments showing significant gains in communication efficiency.

[34] GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence

Yibo Zhao,Jiapeng Zhu,Zichen Ding,Xiang Li

Main category: cs.CL

TL;DR: 本文提出了GRACE,一种基于强化学习的检索增强生成框架,通过多阶段门控奖励机制统一解决LLM在缺乏充分证据时的幻觉与非接地回答问题,显著提升回答准确性与拒绝能力,且仅需先前方法10%的标注成本。

Details Motivation: 现有RAG系统存在两个关键缺陷:一是在没有明确证据支持的情况下给出正确答案(非接地回答),二是在检索上下文不足时产生虚构内容(幻觉)。以往研究多单独处理这两个问题,缺乏统一框架。因此,需要一种能同时实现证据接地和可靠拒绝机制的方法。 Method: 提出GRACE框架,采用异构检索器生成多样化的训练样本以避免人工标注;设计一个多阶段门控奖励函数,结合强化学习训练模型判断证据充分性、提取关键支持证据,并据此选择回答或显式拒绝。 Result: 在两个基准数据集上的实验表明,GRACE实现了最先进的整体准确率,在准确回答与合理拒绝之间取得了良好平衡,且标注成本仅为先前方法的10%。 Conclusion: GRACE为RAG系统提供了一个统一、高效且低代价的解决方案,能够同时缓解非接地回答和幻觉问题,推动了语言模型在真实场景中更安全、可靠的应用。 Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge to enhance Large Language Models (LLMs), yet systems remain susceptible to two critical flaws: providing correct answers without explicit grounded evidence and producing fabricated responses when the retrieved context is insufficient. While prior research has addressed these issues independently, a unified framework that integrates evidence-based grounding and reliable abstention is currently lacking. In this paper, we propose GRACE, a reinforcement-learning framework that simultaneously mitigates both types of flaws. GRACE employs a data construction method that utilizes heterogeneous retrievers to generate diverse training samples without manual annotation. A multi-stage gated reward function is then employed to train the model to assess evidence sufficiency, extract key supporting evidence, and provide answers or explicitly abstain. Experimental results on two benchmarks demonstrate that GRACE achieves state-of-the-art overall accuracy and strikes a favorable balance between accurate response and rejection, while requiring only 10% of the annotation costs of prior methods. Our code is available at https://github.com/YiboZhao624/Grace..

[35] BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation

Amit Bin Tariqul,A N M Zahid Hossain Milkan,Sahab-Al-Chowdhury,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan

Main category: cs.CL

TL;DR: 本文首次系统评估了现有文本水印方法在低资源语言(孟加拉语)中的鲁棒性,发现跨语言往返翻译攻击下传统标记级水印失效,为此提出一种结合嵌入时与生成后水印的分层策略,显著提升检测准确率。

Details Motivation: 现有文本水印方法在高资源语言中表现良好,但在低资源语言中面对跨语言攻击时的鲁棒性尚未充分研究,亟需有效解决方案以保障多语言场景下的版权与安全需求。 Method: 对KGW、EXP和Waterfall等先进水印方法在孟加拉语LLM生成文本中进行系统评估,并提出一种无需训练的分层水印策略,结合嵌入时与生成后水印以增强鲁棒性。 Result: 在良性条件下,KGW和EXP水印检测准确率超过88%,但经往返翻译攻击后降至9-13%;所提分层水印策略将攻击后检测准确率提升至40-50%,实现25-35%的绝对增益,相对提升3-4倍。 Conclusion: 分层水印是一种有效的训练-free方案,可在可控语义损失下显著提升低资源语言中文本水印的抗攻击能力,为多语言水印部署提供了实用路径。 Abstract: As large language models (LLMs) are increasingly deployed for text generation, watermarking has become essential for authorship attribution, intellectual property protection, and misuse detection. While existing watermarking methods perform well in high-resource languages, their robustness in low-resource languages remains underexplored. This work presents the first systematic evaluation of state-of-the-art text watermarking methods: KGW, Exponential Sampling (EXP), and Waterfall, for Bangla LLM text generation under cross-lingual round-trip translation (RTT) attacks. Under benign conditions, KGW and EXP achieve high detection accuracy (>88%) with negligible perplexity and ROUGE degradation. However, RTT causes detection accuracy to collapse below RTT causes detection accuracy to collapse to 9-13%, indicating a fundamental failure of token-level watermarking. To address this, we propose a layered watermarking strategy that combines embedding-time and post-generation watermarks. Experimental results show that layered watermarking improves post-RTT detection accuracy by 25-35%, achieving 40-50% accuracy, representing a 3$\times$ to 4$\times$ relative improvement over single-layer methods, at the cost of controlled semantic degradation. Our findings quantify the robustness-quality trade-off in multilingual watermarking and establish layered watermarking as a practical, training-free solution for low-resource languages such as Bangla. Our code and data will be made public.

[36] Identifying Good and Bad Neurons for Task-Level Controllable LLMs

Wenjie Li,Guansong Pang,Hezhe Qiao,Debin Gao,David Lo

Main category: cs.CL

TL;DR: 提出NeuronLLM框架,基于功能拮抗原理识别大模型中促进和抑制任务的神经元,通过对比学习与增强问题集实现更全面的神经元建模。

Details Motivation: 现有方法仅关注支持性神经元且难以应对多能力协同的任务场景,同时忽略抑制性神经元和偶然正确行为带来的误导。 Method: 采用功能拮抗机制,通过对比学习区分促进和抑制任务的神经元,并利用增强问题集减少模型偶然正确答案的影响。 Result: 在不同规模和架构的LLM上,NeuronLLM在四个NLP任务中均优于现有方法。 Conclusion: NeuronLLM实现了对LLM神经元更全面的理解,揭示了任务表现由正负作用神经元共同决定的机制。 Abstract: Large Language Models have demonstrated remarkable capabilities on multiple-choice question answering benchmarks, but the complex mechanisms underlying their large-scale neurons remain opaque, posing significant challenges for understanding and steering LLMs. While recent studies made progress on identifying responsible neurons for certain abilities, these ability-specific methods are infeasible for task-focused scenarios requiring coordinated use of multiple abilities. Moreover, these approaches focus only on supportive neurons that correlate positively with task completion, while neglecting neurons with other roles-such as inhibitive roles-and misled neuron attribution due to fortuitous behaviors in LLMs (i.e., correctly answer the questions by chance rather than genuine understanding). To address these challenges, we propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification. The key insight is that task performance is jointly determined by neurons with two opposing roles: good neurons that facilitate task completion and bad neurons that inhibit it. NeuronLLM achieves a holistic modeling of neurons via contrastive learning of good and bad neurons, while leveraging augmented question sets to mitigate the fortuitous behaviors in LLMs. Comprehensive experiments on LLMs of different sizes and families show the superiority of NeuronLLM over existing methods in four NLP tasks, providing new insights into LLM functional organization.

[37] FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

Seongyeub Chu,Jongwoo Kim,Munyong Yi

Main category: cs.CL

TL;DR: 本文提出了一种名为FeedEval的基于大语言模型(LLM)的框架,用于评估LLM生成的作文反馈的质量,聚焦于具体性、有帮助性和有效性三个教学导向维度,并通过实验证明其在自动作文评分和指导修改方面的优越性能。

Details Motivation: 现有自动作文评分研究多依赖未经质量验证的LLM生成反馈进行模型训练,导致噪声传播,影响下游任务效果,因此需要一种可靠的方法来筛选高质量反馈。 Method: 提出FeedEval框架,构建针对具体性、有帮助性和有效性三个维度的专业化LLM评估器,使用本研究整理的数据集进行训练,对多个反馈候选进行评估并筛选高质量反馈。 Result: 在ASAP++基准上的实验表明,FeedEval与人类专家判断高度一致;使用其筛选的高质量反馈训练的评分模型表现更优;小规模LLM的修改实验证明该反馈能更有效地促进作文改进。 Conclusion: FeedEval能够有效识别高质量的LLM生成反馈,提升自动作文评分模型性能,并支持更有效的写作修改,具备良好的应用前景。 Abstract: Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We will release our code and curated datasets upon accepted.

[38] Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

Mizanur Rahman,Mohammed Saidul Islam,Md Tahmid Rahman Laskar,Shafiq Joty,Enamul Hoque

Main category: cs.CL

TL;DR: 本文提出了RL-Text2Vis,首个用于文本到可视化生成的强化学习框架,基于GRPO并结合多目标奖励机制,利用执行后反馈联合优化文本准确性、代码有效性与可视化质量,在多个基准上显著优于现有方法。

Details Motivation: 现有基于大模型的Text2Vis系统在生成可视化时存在语义对齐差、图表质量低的问题,尤其是开源模型表现更差;传统监督微调无法利用执行后反馈来提升整体可视化质量,因此需要一种能综合优化多维度指标的新方法。 Method: 提出RL-Text2Vis框架,基于Group Relative Policy Optimization (GRPO),设计包含文本准确率、代码有效性与可视化质量的多目标奖励函数,并在Qwen2.5-7B和14B模型上进行强化学习训练,利用执行后反馈持续优化生成结果。 Result: 在Text2Vis基准上相比GPT-4o相对提升22%的图表质量,代码执行成功率从78%提升至97%(相对于零样本基线),并在跨域数据集如VIS-Eval和NVBench上展现出强泛化能力。 Conclusion: GRPO是一种有效的结构化多模态推理策略,RL-Text2Vis通过引入执行后反馈的多目标强化学习,显著提升了文本到可视化的生成质量与鲁棒性,为未来可视化生成提供了新方向。 Abstract: Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at https://github.com/vis-nlp/RL-Text2Vis.

[39] THaLLE-ThaiLLM: Domain-Specialized Small LLMs for Finance and Thai -- Technical Report

KBTG Labs,:,Anuruth Lertpiya,Danupat Khamnuansin,Kantapong Sucharitpongpan,Pornchanan Balee,Tawunrat Chalothorn,Thadpong Pongthawornkamol,Monchai Lertsutthiwong

Main category: cs.CL

TL;DR: 本研究探讨了通过模型融合技术构建高效、多能力大型语言模型的方法,特别针对泰语场景和金融领域的需求。通过合并Qwen-8B与ThaiLLM-8B及THaLLE-CFA-8B,显著提升了在通用泰语任务和金融专业任务上的性能。

Details Motivation: 由于隐私、安全和监管问题,金融机构倾向于本地部署大模型;但训练多功能模型成本高昂。因此需要一种资源高效的替代方案来满足多任务需求。 Method: 采用模型融合方法,将Qwen-8B分别与ThaiLLM-8B以及两者(ThaiLLM-8B和THaLLE-CFA-8B)进行合并,并评估其在多个基准测试中的表现。 Result: 合并Qwen-8B与ThaiLLM-8B后,在M3和M6 O-NET考试中表现出优于原始Qwen-8B的泰语能力;进一步融合THaLLE-CFA-8B后,在Flare-CFA和Thai-IC等金融与综合基准上也取得提升。 Conclusion: 模型融合是一种可行且高效的手段,可用于构建兼具多种能力的大型语言模型,尤其适合资源受限或有特定语言与行业需求的组织。 Abstract: Large Language Models (LLMs) have demonstrated significant potential across various domains, particularly in banking and finance, where they can automate complex tasks and enhance decision-making at scale. Due to privacy, security, and regulatory concerns, organizations often prefer on-premise deployment of LLMs. The ThaiLLM initiative aims to enhance Thai language capabilities in open-LLMs, enabling Thai industry to leverage advanced language models. However, organizations often face a trade-off between deploying multiple specialized models versus the prohibitive expense of training a single multi-capability model. To address this, we explore model merging as a resource-efficient alternative for developing high-performance, multi-capability LLMs. We present results from two key experiments: first, merging Qwen-8B with ThaiLLM-8B demonstrates how ThaiLLM-8B enhances Thai general capabilities, showing an uplift of M3 and M6 O-NET exams over the general instruction-following Qwen-8B. Second, we merge Qwen-8B with both ThaiLLM-8B and THaLLE-CFA-8B. This combination results in further improvements in performance across both general and financial domains, by demonstrating an uplift in both M3 and M6 O-NET, Flare-CFA, and Thai-IC benchmarks. The report showcases the viability of model merging for efficiently creating multi-capability LLMs.

[40] On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions

Zhiyuan He,Binghan Chen,Tianxiang Xiong,Ziyang Sun,Mozhao Zhu,Xi Chen

Main category: cs.CL

TL;DR: 本文研究了Rank-One Model Editing (ROME)在多跳推理任务中的局限性,并提出了冗余编辑策略以改善2跳问题的准确性。

Details Motivation: 现有的知识编辑方法如ROME在单跳事实更新上表现良好,但在需要知识链的多跳推理任务中面临挑战。 Method: 分析不同层深度下编辑知识的影响,识别出三个主要失败模式,并提出冗余编辑策略来缓解‘跳跃过晚’和泛化能力下降的问题。 Result: 实验表明,所提出的冗余编辑方法在2跳问题上的准确率至少提高了15.5个百分点,相比之前的单编辑策略提升了96%。 Conclusion: 冗余编辑是一种简单而有效的策略,能够显著提升模型在多跳推理任务中的性能,尽管牺牲了一些特异性和语言自然度。 Abstract: Recent advances in Knowledge Editing (KE), particularly Rank-One Model Editing (ROME), show superior efficiency over fine-tuning and in-context learning for updating single-hop facts in transformers. However, these methods face significant challenges when applied to multi-hop reasoning tasks requiring knowledge chaining. In this work, we study the effect of editing knowledge with ROME on different layer depths and identify three key failure modes. First, the "hopping-too-late" problem occurs as later layers lack access to necessary intermediate representations. Second, generalization ability deteriorates sharply when editing later layers. Third, the model overfits to edited knowledge, incorrectly prioritizing edited-hop answers regardless of context. To mitigate the issues of "hopping-too-late" and generalisation decay, we propose Redundant Editing, a simple yet effective strategy that enhances multi-hop reasoning. Our experiments demonstrate that this approach can improve accuracy on 2-hop questions by at least 15.5 percentage points, representing a 96% increase over the previous single-edit strategy, while trading off some specificity and language naturalness.

[41] When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation

Rhea Kapur,Robert Hawkins,Elisa Kreiss

Main category: cs.CL

TL;DR: 本文提出视觉语言模型中的描述特异性应与长度解耦,特异性应基于对比集来定义,并通过控制长度、调节信息量的实验验证人类更偏好高特异性的描述,强调评估应优先考虑特异性而非冗长。

Details Motivation: 当前系统中视觉描述的特异性常与长度混淆,需明确区分两者以提升内容可访问性。 Method: 定义基于对比集的特异性概念,构建控制长度但变化信息量的数据集,并通过人类偏好实验验证特异性的影响。 Result: 实验表明,在控制长度的情况下,人类仍能可靠地区分并偏好信息更具体的描述,说明长度分配方式影响特异性判断。 Conclusion: 评估视觉语言模型的描述质量时,应直接优先考虑特异性,而非仅仅控制描述长度。 Abstract: Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.

[42] Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR

Yihong Tang,Kehai Chen,Xuefeng Bai,Benyou Wang,Zeming Liu,Haifeng Wang,Min Zhang

Main category: cs.CL

TL;DR: 提出Character-R1框架,通过认知聚焦奖励、参考引导奖励和角色条件奖励归一化,提升角色扮演智能体的内部认知一致性与性能。

Details Motivation: 现有角色扮演智能体多依赖表面行为模仿,缺乏内在认知一致性,导致在复杂情境下易出现偏离角色的行为。 Method: 设计包含认知聚焦奖励(分析10个角色元素)、参考引导奖励(基于参考响应的重叠度量)和角色条件奖励归一化(按角色类别调整奖励分布)的三重奖励机制。 Result: 实验表明,Character-R1在知识、记忆等多个方面显著优于现有方法。 Conclusion: Character-R1通过提供可验证的细粒度奖励信号,有效增强了角色扮演智能体的角色感知推理能力与跨角色优化稳定性。 Abstract: Current role-playing agents (RPAs) are typically constructed by imitating surface-level behaviors, but this approach lacks internal cognitive consistency, often causing out-of-character errors in complex situations. To address this, we propose Character-R1, a framework designed to provide comprehensive verifiable reward signals for effective role-aware reasoning, which are missing in recent studies. Specifically, our framework comprises three core designs: (1) Cognitive Focus Reward, which enforces explicit label-based analysis of 10 character elements (e.g., worldview) to structure internal cognition; (2) Reference-Guided Reward, which utilizes overlap-based metrics with reference responses as optimization anchors to enhance exploration and performance; and (3) Character-Conditioned Reward Normalization, which adjusts reward distributions based on character categories to ensure robust optimization across heterogeneous roles. Extensive experiments demonstrate that Character-R1 significantly outperforms existing methods in knowledge, memory and others.

[43] From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset

Haneul Yoo,Won Ik Cho,Geunhye Kim,Jiyoon Han

Main category: cs.CL

TL;DR: 提出了一种基于国家社会研究课程的自动化多智能体框架CuCu,用于生成文化特定的问答对,并构建了韩语文化相关的KCaQA数据集。

Details Motivation: 解决大语言模型在跨文化和多语言场景下因英语中心化训练数据而导致的文化价值偏差问题。 Method: 利用国家社会研究课程作为文化感知监督的基础,通过CuCu框架将教材内容转化为开放式、文化特定的问答对。 Result: 基于韩国课程构建了包含34.1k个问答对的KCaQA数据集,分析表明其覆盖文化特有主题且响应扎根于本地社会文化背景。 Conclusion: 该方法为实现语言模型的文化对齐提供了一种可扩展且实用的路径。 Abstract: Large language models (LLMs) achieve strong performance on many tasks, but their progress remains uneven across languages and cultures, often reflecting values latent in English-centric training data. To enable practical cultural alignment, we propose a scalable approach that leverages national social studies curricula as a foundation for culture-aware supervision. We introduce CuCu, an automated multi-agent LLM framework that transforms national textbook curricula into open-ended, culture-specific question-answer pairs. Applying CuCu to the Korean national social studies curriculum, we construct KCaQA, comprising 34.1k open-ended QA pairs. Our quantitative and qualitative analyses suggest that KCaQA covers culture-specific topics and produces responses grounded in local sociocultural contexts.

[44] MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

Anyang Song,Ying Cheng,Yiqian Xu,Rui Feng

Main category: cs.CL

TL;DR: 本文提出了一种通过增强机器生成文本与人类写作对齐的新方法MAGA,以提升检测器的泛化能力并测试其鲁棒性。

Details Motivation: 由于机器生成文本越来越难以与人类写作区分,导致虚假信息和网络欺诈等问题加剧,现有基于微调的检测器泛化能力受限于数据质量,需要更先进的生成对齐方法来提升检测器性能。 Method: 提出MAGA框架,实现从提示构建到推理过程的全面对齐,其中引入了基于检测器反馈的强化学习(RLDF)作为关键组件,系统性地优化生成文本的对齐程度。 Result: 在实验中,基于MAGA训练集微调的RoBERTa检测器平均AUC提升了4.60%;而MAGA数据集使所选检测器的AUC平均下降8.13%,表明其对现有检测器具有较强攻击性。 Conclusion: MAGA不仅能提升检测器的泛化能力,还可用于评估检测器的鲁棒性,为未来检测技术的研究提供了重要参考。 Abstract: Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors' generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var's theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA). MAGA's pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbf{R}einforced \textbf{L}earning from \textbf{D}etectors \textbf{F}eedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60\% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13\% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.

[45] SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation

Sirry Chen,Jieyi Wang,Wei Chen,Zhongyu Wei

Main category: cs.CL

TL;DR: 提出SpeechMedAssist,一种基于语音语言模型的医疗咨询系统,通过两阶段训练减少对大量医疗语音数据的依赖,并在新构建的基准上表现出优越性能。

Details Motivation: 现有医疗咨询系统多依赖文本交互,不够自然且对患者不友好;直接微调语音语言模型受限于医疗语音数据稀缺和效率问题。 Method: 提出两阶段训练范式:第一阶段通过文本注入知识与能力,第二阶段利用少量语音数据进行模态重对齐;仅需1万条合成语音样本。 Result: 在单轮问答和多轮模拟交互任务中,SpeechMedAssist在多数指标上优于基线模型,展现出更强的有效性和鲁棒性。 Conclusion: 该方法显著降低对真实医疗语音数据的依赖,推动了语音语言模型在医疗咨询中的实际应用。 Abstract: Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more natural speech-based interaction, yet the scarcity of medical speech data and the inefficiency of directly fine-tuning on speech data jointly hinder the adoption of SpeechLMs in medical consultation. In this paper, we propose SpeechMedAssist, a SpeechLM natively capable of conducting speech-based multi-turn interactions with patients. By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm consisting of (1) Knowledge & Capability Injection via Text and (2) Modality Re-alignment with Limited Speech Data, thereby reducing the requirement for medical speech data to only 10k synthesized samples. To evaluate SpeechLMs for medical consultation scenarios, we design a benchmark comprising both single-turn question answering and multi-turn simulated interactions. Experimental results show that our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.

[46] CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models

Yifan Le,Yunliang Li

Main category: cs.CL

TL;DR: 本文提出了CRANE,一种基于相关性的分析框架,通过神经元级干预识别多语言大模型中功能上对特定语言至关重要的神经元,揭示了神经元在语言处理中的选择性但非独占性专门化。

Details Motivation: 现有方法主要依赖激活强度识别语言相关神经元,混淆了语言偏好与功能重要性,无法准确理解多语言能力的神经基础。 Method: 提出CRANE框架,通过目标神经元干预衡量神经元在语言条件预测中的功能必要性,而非仅看激活幅度,从而更精确地识别语言特异性神经元。 Result: 实验发现屏蔽特定语言相关的神经元会显著降低该语言性能,而对其他语言影响较小,表现出不对称的语言选择性;在英语、中文和越南语多个基准上验证了CRANE优于激活基线方法。 Conclusion: CRANE能更精准地分离出功能上关键的语言特异性神经元,揭示了多语言模型中神经元具有语言选择性但非排他的专门化结构。 Abstract: Multilingual large language models (LLMs) achieve strong performance across languages, yet how language capabilities are organized at the neuron level remains poorly understood. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. We propose CRANE, a relevance-based analysis framework that redefines language specificity in terms of functional necessity, identifying language-specific neurons through targeted neuron-level interventions. CRANE characterizes neuron specialization by their contribution to language-conditioned predictions rather than activation magnitude. Our implementation will be made publicly available. Neuron-level interventions reveal a consistent asymmetric pattern: masking neurons relevant to a target language selectively degrades performance on that language while preserving performance on other languages to a substantial extent, indicating language-selective but non-exclusive neuron specializations. Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than activation-based methods.

[47] ToolGate: Contract-Grounded and Verified Tool Execution for LLMs

Yanming Liu,Xinyue Peng,Jiannan Cao,Xinyi Wang,Songhang Deng,Jintao Chen,Jianwei Yin,Xuhong Zhang

Main category: cs.CL

TL;DR: 本文提出了ToolGate,一个为大语言模型(LLM)调用外部工具提供逻辑安全保证和可验证状态演化的前向执行框架。

Details Motivation: 现有LLM调用工具的方法依赖自然语言推理,缺乏对逻辑安全性和结果可验证性的形式化保障。 Method: 引入显式的符号状态空间(类型化键值映射)表示可信世界信息,并将每个工具形式化为Hoare风格的契约(前置条件和后置条件),通过运行时验证控制工具调用与结果提交。 Result: 实验表明,ToolGate显著提升了工具增强型LLM系统的可靠性和可验证性,同时在复杂多步推理任务中保持竞争力。 Conclusion: ToolGate为构建更可信、可调试的LLM与外部工具集成系统奠定了基础。 Abstract: Large Language Models (LLMs) augmented with external tools have demonstrated remarkable capabilities in complex reasoning tasks. However, existing frameworks rely heavily on natural language reasoning to determine when tools can be invoked and whether their results should be committed, lacking formal guarantees for logical safety and verifiability. We present \textbf{ToolGate}, a forward execution framework that provides logical safety guarantees and verifiable state evolution for LLM tool calling. ToolGate maintains an explicit symbolic state space as a typed key-value mapping representing trusted world information throughout the reasoning process. Each tool is formalized as a Hoare-style contract consisting of a precondition and a postcondition, where the precondition gates tool invocation by checking whether the current state satisfies the required conditions, and the postcondition determines whether the tool's result can be committed to update the state through runtime verification. Our approach guarantees that the symbolic state evolves only through verified tool executions, preventing invalid or hallucinated results from corrupting the world representation. Experimental validation demonstrates that ToolGate significantly improves the reliability and verifiability of tool-augmented LLM systems while maintaining competitive performance on complex multi-step reasoning tasks. This work establishes a foundation for building more trustworthy and debuggable AI systems that integrate language models with external tools.

[48] See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation

Naquee Rizwan,Subhankar Swain,Paramananda Bhaskar,Gagan Aryan,Shehryaar Shah Khan,Animesh Mukherjee

Main category: cs.CL

TL;DR: 本文提出了一种基于生成式AI模型的多模态框架,用于在数据有限的情况下检测、解释和干预仇恨模因,是首个综合处理仇恨模因检测、解释与干预的研究。

Details Motivation: 由于标注数据集成本高昂,且现有研究通常将检测、解释与干预分开处理,无法反映真实场景需求,因此需要一个统一且可泛化的解决方案。 Method: 利用任务特定的生成式多模态代理和大型多模态模型的少样本适应能力,构建一个支持检测、内容解释和发布前干预的统一框架。 Result: 实现了在有限数据条件下对多种类型仇恨模因的有效检测、解释与干预,展现出良好的泛化能力和实际部署潜力。 Conclusion: 该框架首次将检测、解释与干预整合于同一系统中,为现实场景下的仇恨模因治理提供了高效、可扩展的解决方案。 Abstract: In this work, we examine hateful memes from three complementary angles - how to detect them, how to explain their content and how to intervene them prior to being posted - by applying a range of strategies built on top of generative AI models. To the best of our knowledge, explanation and intervention have typically been studied separately from detection, which does not reflect real-world conditions. Further, since curating large annotated datasets for meme moderation is prohibitively expensive, we propose a novel framework that leverages task-specific generative multimodal agents and the few-shot adaptability of large multimodal models to cater to different types of memes. We believe this is the first work focused on generalizable hateful meme moderation under limited data conditions, and has strong potential for deployment in real-world production scenarios. Warning: Contains potentially toxic contents.

[49] Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding

Sungmok Jung,Yeonkyoung So,Joonhak Lee,Sangho Kim,Yelim Ahn,Jaejin Lee

Main category: cs.CL

TL;DR: 本文介绍了Thunder-KoNUBench,一个反映韩语否定现象经验分布的句子级基准测试,并通过评估47个大语言模型分析了模型规模和指令微调对否定理解的影响,结果表明在该基准上进行微调可提升韩语的否定理解和上下文理解能力。

Details Motivation: 由于否定结构已知会挑战大语言模型,且现有评估否定理解能力的基准(尤其是韩语)十分稀缺,因此需要构建针对韩语否定现象的系统性评测基准。 Method: 首先进行基于语料库的韩语否定现象分析,然后构建Thunder-KoNUBench这一句子级基准测试,最后对47个大语言模型进行评估,并分析模型规模、指令微调以及在该基准上微调的效果。 Result: 研究表明大语言模型在处理否定时表现下降;Thunder-KoNUBench能有效反映韩语否定的真实分布;模型规模和指令微调对否定理解有影响;在Thunder-KoNUBench上微调可提升模型的否定理解和更广泛的上下文理解能力。 Conclusion: Thunder-KoNUBench是一个有效的韩语否定理解评测基准,其上的微调不仅能改善模型对否定的处理,还能增强整体的上下文理解能力。 Abstract: Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.

[50] PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards

Mukesh Ghimire,Aosong Feng,Liwen You,Youzhi Luo,Fang Liu,Xuan Zhu

Main category: cs.CL

TL;DR: 提出PRISM框架,结合过程奖励模型与自置信度,在无标签数据上实现稳定训练并提升大模型推理性能。

Details Motivation: 现有基于内部一致性的无监督训练信号(如熵、自置信度)在大规模长期训练中不可靠,需更稳定的替代方案。 Method: 提出PRISM框架,联合使用过程奖励模型(PRM)和模型自置信度作为学习信号,指导大语言模型在无真实标签情况下的训练。 Result: PRISM实现了更稳定的训练过程,提升了测试性能,并有效校准了模型的内部置信度。 Conclusion: 结合外部奖励模型与内部置信度可克服纯内部信号的不稳定性,为大模型的无监督后训练提供有效路径。 Abstract: Current techniques for post-training Large Language Models (LLMs) rely either on costly human supervision or on external verifiers to boost performance on tasks such as mathematical reasoning and code generation. However, as LLMs improve their problem-solving, any further improvement will potentially require high-quality solutions to difficult problems that are not available to humans. As a result, learning from unlabeled data is becoming increasingly attractive in the research community. Existing methods extract learning signal from a model's consistency, either by majority voting or by converting the model's internal confidence into reward. Although internal consistency metric such as entropy or self-certainty require no human intervention, as we show in this work, these are unreliable signals for large-scale and long-term training. To address the unreliability, we propose PRISM, a unified training framework that uses a Process Reward Model (PRM) to guide learning alongside model's internal confidence in the absence of ground-truth labels. We show that effectively combining PRM with self-certainty can lead to both stable training and better test-time performance, and also keep the model's internal confidence in check.

[51] Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning

Feihu Jin,Shipeng Cen,Ying Tan

Main category: cs.CL

TL;DR: 本文提出了一种基于先验信息的零阶梯度优化方法,通过引入导向性更强的扰动向量来降低梯度估计方差,显著提升了大模型微调中的收敛速度与性能表现。

Details Motivation: 传统零阶优化方法依赖随机扰动,导致梯度估计方差高、收敛慢,难以高效微调大规模语言模型。 Method: 提出一种即插即用的方法,利用高斯采样动态计算引导向量,生成更具信息性的扰动方向,并探索了贪婪扰动策略以增强先验知识的利用。 Result: 理论证明所提梯度估计器与真实梯度方向对齐更优;在多种规模和架构的LLM上实验表明,该方法收敛更快、性能更好,在OPT-13B模型上的11个任务中均优于传统ZO方法,9个任务超过基于梯度的方法。 Conclusion: 该方法有效平衡了优化效率与准确性,为无需反向传播的大模型微调提供了更高效的解决方案。 Abstract: Fine-tuning large language models (LLMs) has achieved remarkable success across various NLP tasks, but the substantial memory overhead during backpropagation remains a critical bottleneck, especially as model scales grow. Zeroth-order (ZO) optimization alleviates this issue by estimating gradients through forward passes and Gaussian sampling, avoiding the need for backpropagation. However, conventional ZO methods suffer from high variance in gradient estimation due to their reliance on random perturbations, leading to slow convergence and suboptimal performance. We propose a simple plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation. Our method dynamically computes a guiding vector from Gaussian samples, which directs perturbations toward more informative directions, significantly accelerating convergence compared to standard ZO approaches. We further investigate a greedy perturbation strategy to explore the impact of prior knowledge on gradient estimation. Theoretically, we prove that our gradient estimator achieves stronger alignment with the true gradient direction, enhancing optimization efficiency. Extensive experiments across LLMs of varying scales and architectures demonstrate that our proposed method could seamlessly integrate into existing optimization methods, delivering faster convergence and superior performance. Notably, on the OPT-13B model, our method outperforms traditional ZO optimization across all 11 benchmark tasks and surpasses gradient-based baselines on 9 out of 11 tasks, establishing a robust balance between efficiency and accuracy.

[52] DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs

Anh Thi-Hoang Nguyen,Khanh Quoc Tran,Tin Van Huynh,Phuoc Tan-Hoang Nguyen,Cam Tan Nguyen,Kiet Van Nguyen

Main category: cs.CL

TL;DR: 本文介绍了DSC2025 ViHallu挑战赛,这是首个针对越南语大语言模型幻觉检测的大规模共享任务,提出了包含1万条标注数据的ViHallu数据集,并评估了多种检测方法,推动了越南语AI系统可靠性研究。

Details Motivation: 低到中等资源语言(如越南语)缺乏标准化的幻觉检测评估框架,限制了其大语言模型在生产环境中的可靠性。 Method: 构建ViHallu数据集,包含10,000个(上下文、提示、响应)三元组,分为无幻觉、内在幻觉和外在幻觉三类;设计事实性、噪声和对抗性三种提示类型以测试模型鲁棒性;组织111支队伍参与挑战赛并评估不同检测方法。 Result: 最佳系统达到84.80%的macro-F1分数,远超32.83%的基线分数;基于指令调优的大模型结合结构化提示和集成策略表现最优,但内在幻觉检测仍具挑战性。 Conclusion: 该工作建立了越南语幻觉检测的严格基准,验证了先进方法的有效性,同时表明幻觉检测尤其是内在幻觉检测仍是开放难题,为未来越南语AI系统的可信性研究提供了基础。 Abstract: The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations -- fluent, plausible-sounding outputs that contradict or fabricate information. While hallucination detection has recently emerged as a priority in English-centric benchmarks, low-to-medium resource languages such as Vietnamese remain inadequately covered by standardized evaluation frameworks. This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese LLMs. We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples systematically partitioned into three hallucination categories: no hallucination, intrinsic, and extrinsic hallucinations. The dataset incorporates three prompt types -- factual, noisy, and adversarial -- to stress-test model robustness. A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80\%, compared to a baseline encoder-only score of 32.83\%, demonstrating that instruction-tuned LLMs with structured prompting and ensemble strategies substantially outperform generic architectures. However, the gap to perfect performance indicates that hallucination detection remains a challenging problem, particularly for intrinsic (contradiction-based) hallucinations. This work establishes a rigorous benchmark and explores a diverse range of detection methodologies, providing a foundation for future research into the trustworthiness and reliability of Vietnamese language AI systems.

[53] Fame Fades, Nature Remains: Disentangling the Character Identity of Role-Playing Agents

Yonghyun Jun,Junhyuk Choi,Jihyeong Park,Hwanhee Lee

Main category: cs.CL

TL;DR: 本文提出了“角色身份”的多维概念,将其分解为参数身份和属性身份两个层次,并通过构建统一的角色档案模式,系统研究了基于大语言模型的角色扮演代理中角色身份的结构化表示。研究发现,“名气会消退”,即名人的初始优势随着对话进行而消失;“本性保留”,即个性特征在不同极性下都能稳健呈现,但道德和人际关系的效价显著影响表现,负面社会特性是当前角色保真度的主要瓶颈。

Details Motivation: 现有基于大语言模型的角色扮演代理对角色身份的结构定义不清晰,通常将角色视为任意文本输入,缺乏系统性的身份分层建模,导致角色一致性与真实性受限。因此需要提出更精细的角色身份框架以提升角色扮演的质量与可评估性。 Method: 提出角色身份的双层解耦结构:参数身份(来自预训练知识)与属性身份(细粒度行为特征如性格、道德观)。设计统一的角色档案模板,生成具有相同结构约束的名人与合成角色,在单轮与多轮交互中评估其表现差异。 Result: 实验表明:1) 存在“名气消退”现象——名人角色因参数身份在初期有优势,但随对话推进该优势迅速减弱;2) 存在“本性保留”现象——模型能稳定表达性格特质,但对道德和人际态度的正负极性敏感,尤其负面社会属性显著降低角色保真度。 Conclusion: 角色的属性身份,特别是道德与人际维度的正向构建,是决定角色扮演代理真实性的关键因素。负面社会性质成为当前系统的主要瓶颈,未来应聚焦于细粒度属性控制与评价体系的建设。 Abstract: Despite the rapid proliferation of Role-Playing Agents (RPAs) based on Large Language Models (LLMs), the structural dimensions defining a character's identity remain weakly formalized, often treating characters as arbitrary text inputs. In this paper, we propose the concept of \textbf{Character Identity}, a multidimensional construct that disentangles a character into two distinct layers: \textbf{(1) Parametric Identity}, referring to character-specific knowledge encoded from the LLM's pre-training, and \textbf{(2) Attributive Identity}, capturing fine-grained behavioral properties such as personality traits and moral values. To systematically investigate these layers, we construct a unified character profile schema and generate both Famous and Synthetic characters under identical structural constraints. Our evaluation across single-turn and multi-turn interactions reveals two critical phenomena. First, we identify \textit{"Fame Fades"}: while famous characters hold a significant advantage in initial turns due to parametric knowledge, this edge rapidly vanishes as models prioritize accumulating conversational context over pre-trained priors. Second, we find that \textit{"Nature Remains"}: while models robustly portray general personality traits regardless of polarity, RPA performance is highly sensitive to the valence of morality and interpersonal relationships. Our findings pinpoint negative social natures as the primary bottleneck in RPA fidelity, guiding future character construction and evaluation.

[54] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li,Yanzhao Zhang,Dingkun Long,Keqin Chen,Sibo Song,Shuai Bai,Zhibo Yang,Pengjun Xie,An Yang,Dayiheng Liu,Jingren Zhou,Junyang Lin

Main category: cs.CL

TL;DR: 本文介绍了Qwen3-VL-Embedding和Qwen3-VL-Reranker模型系列,基于Qwen3-VL基础模型构建,支持文本、图像、文档和视频等多模态统一表示,具备多语言能力,在多模态检索任务中达到SOTA性能。

Details Motivation: 为了实现高精度的多模态搜索,需要将多种模态数据映射到统一的表示空间,并支持灵活部署与多语言应用。 Method: 采用多阶段训练范式,包括大规模对比预训练和重排序模型蒸馏,结合Matryoshka表示学习以支持可变维度嵌入;使用交叉编码器结构进行细粒度相关性估计。 Result: Qwen3-VL-Embedding-8B在MMEB-V2上取得77.8的总体得分,位居榜首;模型支持30多种语言,输入长度达32k tokens,并提供2B和8B两种参数规模。 Conclusion: 该模型系列构建了端到端的高精度多模态搜索 pipeline,在多种多模态检索任务中表现出色,具有良好的实用性和扩展性。 Abstract: In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.

[55] Automatic Classifiers Underdetect Emotions Expressed by Men

Ivan Smirnov,Segun T. Aroyehun,Paul Plener,David Garcia

Main category: cs.CL

TL;DR: 该研究利用超过一百万条自我标注的帖子,系统评估了414种模型与情绪类别组合在性别上的情感检测偏差,发现男性文本的错误率 consistently 高于女性,提示现有情感分析工具在性别群体间表现不均,需谨慎应用。

Details Motivation: 确保情感和情绪分类器在不同人群中的可靠性,避免因使用第三方标注数据而掩盖对特定群体(如不同性别)的系统性偏差。 Method: 采用大规模自我标注数据集(>100万条帖子)和预注册研究设计,评估414种模型-情绪类别组合在性别上的分类误差差异。 Result: 无论何种分类器或情绪类型,男性作者文本的错误率始终高于女性;这种偏差会影响下游应用结果,且大型语言模型也未能幸免。 Conclusion: 情感分析尚未解决公平性问题,特别是在不同人口统计群体间的均衡表现方面,当前工具在性别组成未知或变化的样本中应谨慎使用。 Abstract: The widespread adoption of automatic sentiment and emotion classifiers makes it important to ensure that these tools perform reliably across different populations. Yet their reliability is typically assessed using benchmarks that rely on third-party annotators rather than the individuals experiencing the emotions themselves, potentially concealing systematic biases. In this paper, we use a unique, large-scale dataset of more than one million self-annotated posts and a pre-registered research design to investigate gender biases in emotion detection across 414 combinations of models and emotion-related classes. We find that across different types of automatic classifiers and various underlying emotions, error rates are consistently higher for texts authored by men compared to those authored by women. We quantify how this bias could affect results in downstream applications and show that current machine learning tools, including large language models, should be applied with caution when the gender composition of a sample is not known or variable. Our findings demonstrate that sentiment analysis is not yet a solved problem, especially in ensuring equitable model behaviour across demographic groups.

[56] AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Han Zhu,Jiale Chen,Chengkun Cai,Shengjie Sun,Haoran Li,Yujin Zhou,Chi-Min Chan,Pengcheng Wen,Lei Li,Sirui Han,Yike Guo

Main category: cs.CL

TL;DR: 本文提出了InterSafe-V数据集和AM³Safety框架,用于提升多模态大语言模型在多轮对话中的安全性,通过构建交互式数据和优化策略,在降低攻击成功率的同时增强无害性和有用性。

Details Motivation: 现有的基于强化学习的人类反馈对齐方法主要针对单轮视觉问答任务,难以有效应对多轮多模态对话中逐渐累积的有害意图和安全机制遗忘问题。 Method: 构建了一个包含11,270个对话和500个专门拒绝样本的开源多模态对话数据集InterSafe-V,并提出AM³Safety框架,结合冷启动拒绝阶段与基于组相对策略优化(GRPO)的微调,采用回合感知的双目标奖励机制进行训练。 Result: 在Qwen2.5-VL-7B-Instruct和LLaVA-NeXT-7B模型上实验表明,攻击成功率(ASR)下降超过10%,无害性提升至少8%,有用性提升超过13%,同时保持模型通用能力。 Conclusion: AM³Safety框架结合InterSafe-V数据集能有效提升多模态大语言模型在多轮对话场景下的安全性与实用性,具备良好的可扩展性和应用前景。 Abstract: Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10\% decrease in Attack Success Rate (ASR) together with an increment of at least 8\% in harmless dimension and over 13\% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.

[57] RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation

Huawei Zheng,Xinqi Jiang,Sen Yang,Shouling Ji,Yingcai Wu,Dazhen Deng

Main category: cs.CL

TL;DR: 提出了一种端到端框架,结合知识图谱引导和双路径混淆重写,生成具有领域相关性和隐式特性的有害提示,以提升大语言模型的安全性研究。

Details Motivation: 现有有害提示数据集多为显式且依赖人工构建,难以反映真实场景中的隐式安全威胁,尤其在金融、医疗等专业领域缺乏高质量、隐含危害的提示数据。 Method: 首先利用知识图引导生成领域相关的有害提示,然后通过双路径混淆重写(直接重写和上下文增强重写)将显式有害提示转化为隐式形式,从而构建高隐含性与领域相关性的数据集。 Result: 该框架能系统化生成高质量、隐含性强且领域相关的新颖有害提示,在红队测试中表现出更高的绕过现代LLM防御机制的能力。 Conclusion: 所提方法有效解决了领域知识转化与提示隐式化两大挑战,推动了面向专业领域的大型语言模型安全评估与防御研究。 Abstract: Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts-expressed through indirect domain knowledge-are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies dual-path obfuscation rewriting to convert explicit harmful prompts into implicit variants via direct and context-enhanced rewriting. This framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. We release our code and datasets at GitHub.

[58] Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval

Seyeon Jeong,Yeonjun Choi,JongWook Kim,Beakcheol Jang

Main category: cs.CL

TL;DR: 提出Tool-MAD框架,通过为多智能体配备异构外部工具、自适应查询生成和保真性评分机制,提升事实验证的准确性和鲁棒性。

Details Motivation: 现有MAD系统依赖内部知识或静态文档,易产生幻觉;MADKE的一次性检索机制无法适应辩论中出现的新论点,需更灵活的事实验证方法。 Method: 构建多智能体辩论框架Tool-MAD,每个智能体拥有不同外部工具(如搜索API、RAG),引入自适应查询优化检索,并结合保真度与答案相关性评分辅助裁判智能体决策。 Result: 在四个事实验证基准上超越现有MAD方法,最高提升5.5%准确率;在医学领域表现出强鲁棒性和适应性。 Conclusion: Tool-MAD通过工具增强和动态证据更新有效缓解幻觉问题,提升了复杂推理和专业领域中的事实验证性能,具备实际应用潜力。 Abstract: Large Language Models (LLMs) suffer from hallucinations and factual inaccuracies, especially in complex reasoning and fact verification tasks. Multi-Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue, promoting diverse reasoning and mutual verification. However, existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations. While MADKE introduces external evidence to mitigate this, its one-time retrieval mechanism limits adaptability to new arguments or emerging information during the debate. To address these limitations, We propose Tool-MAD, a multi-agent debate framework that enhances factual verification by assigning each agent a distinct external tool, such as a search API or RAG module. Tool-MAD introduces three key innovations: (1) a multi-agent debate framework where agents leverage heterogeneous external tools, encouraging diverse perspectives, (2) an adaptive query formulation mechanism that iteratively refines evidence retrieval based on the flow of the debate, and (3) the integration of Faithfulness and Answer Relevance scores into the final decision process, allowing the Judge agent to quantitatively assess the coherence and question alignment of each response and effectively detect hallucinations. Experimental results on four fact verification benchmarks demonstrate that Tool-MAD consistently outperforms state-of-the-art MAD frameworks, achieving up to 5.5% accuracy improvement. Furthermore, in medically specialized domains, Tool-MAD exhibits strong robustness and adaptability across various tool configurations and domain conditions, confirming its potential for broader real-world fact-checking applications.

Yehoon Jang,Chaewon Lee,Hyun-seok Min,Sungchul Choi

Main category: cs.CL

TL;DR: 本文提出了PILOT-Bench,首个以美国专利审判和上诉委员会(PTAB)为中心的基准,用于系统评估大语言模型在专利领域中的法律推理能力,通过将PTAB裁决与专利数据对齐,并设计三项IRAC对齐的分类任务,评估了多种LLM的表现,揭示了闭源与开源模型在推理能力上的显著差距。

Details Motivation: 目前大语言模型在专利和法律领域的应用局限于轻量级任务,缺乏系统性评估其在专利法律推理方面能力的方法,因此需要构建专门的基准来衡量和提升模型在该领域的表现。 Method: 构建PILOT-Bench基准,将PTAB裁决与USPTO专利数据在案件级别对齐,设计三个符合IRAC结构的分类任务:问题类型(Issue Type)、委员会依据(Board Authorities)和子决策(Subdecision),并评估多种闭源和开源大语言模型在不同设置下的表现。 Result: 在问题类型任务中,闭源模型的Micro-F1得分 consistently 超过0.75,而最强的开源模型Qwen-8B得分约为0.56,显示出两者在推理能力上的显著差距;分析还揭示了不同模型家族、输入变体和错误倾向的差异。 Conclusion: PILOT-Bench为评估专利领域法律推理能力提供了首个系统化基准,揭示了当前LLM在该任务上的局限性,尤其是开源模型的不足,并为未来通过数据集设计和模型对齐改进LLM指明了方向。 Abstract: The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot-bench.

[60] Differential syntactic and semantic encoding in LLMs

Santiago Acevedo,Alessandro Laio,Marco Baroni

Main category: cs.CL

TL;DR: 研究发现,通过平均具有相同句法结构或语义的句子的隐藏表示向量,可以获得捕捉大量句法和语义信息的向量,表明大语言模型内部表示中句法和语义信息至少部分以线性方式编码,且在不同层中的编码模式不同,可部分解耦。

Details Motivation: 探究大语言模型(LLM)内部表示中如何编码句法和语义信息,尤其是深层表示中这两种语言信息的组织方式。 Method: 通过对共享句法结构或语义的句子的隐藏表示向量取平均得到句法和语义“质心”,然后分析减去这些质心后句子相似性的变化,并考察跨层的编码模式差异。 Result: 句法和语义信息在隐藏表示中可被显著捕捉,减去对应质心会显著影响与句法或语义匹配句子的相似性,表明其线性编码;同时句法和语义在不同层的编码模式不同,可部分解耦。 Conclusion: 大语言模型内部表示中,句法和语义信息至少部分以线性方式独立编码,且在模型深度方向上呈现差异化的编码路径。 Abstract: We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids'' from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.

[61] Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence

Shengyin Sun,Yiming Li,Renxi Liu,Weizhe Lin,Hui-Ling Zhen,Xianzhi Yu,Mingxuan Yuan,Chen Ma

Main category: cs.CL

TL;DR: 本文提出了一种基于KL散度的无需训练的验证机制,用于加速大语言模型推理,避免了传统Judge Decoding对昂贵监督信号的依赖,并在理论上证明了其与KL散度的结构对应关系。

Details Motivation: 现有的Judge Decoding方法依赖昂贵且噪声大的监督信号来学习关键性评分,限制了其可扩展性和鲁棒性,因此需要一种更高效、无需监督的替代方案。 Method: 通过理论分析揭示了学习到的线性判别器与KL散度之间的结构对应关系,并提出一种基于draft-target分布差异(KL散度)的简单、无需训练的验证机制。 Result: 在多种推理和代码生成基准上,该方法性能媲美或优于复杂的训练型判别器(如AutoJudge),且对领域迁移更具鲁棒性。 Conclusion: 基于KL散度的训练-free验证机制有效替代了需要监督的Judge Decoding,消除了监督瓶颈,同时保持甚至提升了性能。 Abstract: Judge Decoding accelerates LLM inference by relaxing the strict verification of Speculative Decoding, yet it typically relies on expensive and noisy supervision. In this work, we revisit this paradigm from first principles, revealing that the ``criticality'' scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence. We theoretically prove a structural correspondence between learned linear judges and Kullback-Leibler (KL) divergence, demonstrating they rely on the same underlying logit primitives. Guided by this, we propose a simple, training-free verification mechanism based on KL divergence. Extensive experiments across reasoning and coding benchmarks show that our method matches or outperforms complex trained judges (e.g., AutoJudge), offering superior robustness to domain shifts and eliminating the supervision bottleneck entirely.

[62] LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal

Dongjun Kim,Jeongho Yoon,Chanjun Park,Heuiseok Lim

Main category: cs.CL

TL;DR: 提出LANGSAE EDITING方法,通过稀疏自编码器在向量空间中可控去除多语言嵌入中的语言身份信号,提升跨语言检索效果。

Details Motivation: 多语言稠密检索中,语言身份信号会干扰语义相似性计算,导致同语言对相似度被高估,影响跨语言相关文档的召回。 Method: 训练一个后处理的稀疏自编码器(LANGSAE EDITING),基于跨语言激活统计识别与语言相关的隐层单元,在推理时抑制这些单元并重建原始维度的嵌入向量。 Result: 在多种语言上实验表明,该方法提升了排序质量和跨语言覆盖能力,尤其对文字系统差异大的语言效果显著。 Conclusion: LANGSAE EDITING能有效解耦语言身份与语义表示,无需重训练编码器或重新编码文本即可兼容现有向量数据库,具有实用价值。 Abstract: Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages.

[63] NC2C: Automated Convexification of Generic Non-Convex Optimization Problems

Xinyue Peng,Yanming Liu,Yihan Cang,Yuwei Zhang,Xinyi Wang,Songhang Deng,Jiannan Cao

Main category: cs.CL

TL;DR: 提出NC2C,一个基于大语言模型的端到端自动化框架,用于将非凸优化问题转化为可解的凸形式,显著提升转化成功率并减少对专家知识的依赖。

Details Motivation: 传统非凸优化问题求解依赖人工凸化和专家知识,效率低且难以扩展,亟需自动化方法以提升求解效率与通用性。 Method: 利用大语言模型的数学推理能力,通过符号推理、自适应变换、迭代验证及错误纠正机制,自动识别非凸成分、选择最优凸化策略并生成严格的凸等价形式。 Result: 在100个非凸问题上实验显示,NC2C执行成功率达89.3%,转化有效率达76%,显著优于基线方法。 Conclusion: NC2C有效实现了非凸到凸问题的自动化转换,降低了对专家知识的依赖,推动了凸求解器在复杂优化任务中的高效应用。 Abstract: Non-convex optimization problems are pervasive across mathematical programming, engineering design, and scientific computing, often posing intractable challenges for traditional solvers due to their complex objective functions and constrained landscapes. To address the inefficiency of manual convexification and the over-reliance on expert knowledge, we propose NC2C, an LLM-based end-to-end automated framework designed to transform generic non-convex optimization problems into solvable convex forms using large language models. NC2C leverages LLMs' mathematical reasoning capabilities to autonomously detect non-convex components, select optimal convexification strategies, and generate rigorous convex equivalents. The framework integrates symbolic reasoning, adaptive transformation techniques, and iterative validation, equipped with error correction loops and feasibility domain correction mechanisms to ensure the robustness and validity of transformed problems. Experimental results on a diverse dataset of 100 generic non-convex problems demonstrate that NC2C achieves an 89.3\% execution rate and a 76\% success rate in producing feasible, high-quality convex transformations. This outperforms baseline methods by a significant margin, highlighting NC2C's ability to leverage LLMs for automated non-convex to convex transformation, reduce expert dependency, and enable efficient deployment of convex solvers for previously intractable optimization tasks.

[64] Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework

Junhyuk Choi,Jeongyoun Kwon,Heeju Kim,Haeun Cho,Hayeong Jung,Sehee Min,Bugeun Kim

Main category: cs.CL

TL;DR: 本文首次系统分析了基于角色的权威偏见在自由形式多智能体评估中的影响,发现专家型和参照型权威角色比法定型更具影响力,且权威偏见源于权威智能体坚持立场而普通智能体灵活调整,而非主动顺从。

Details Motivation: 探讨大语言模型多智能体系统中权威角色带来的偏见影响,填补角色权威与智能体交互之间关系的研究空白。 Method: 基于French和Raven的权力理论,将权威角色分为法定、参照和专家三类,并在12轮对话中使用ChatEval对GPT-4o和DeepSeek R1进行实验分析。 Result: 专家和参照型权威角色比法定型更具影响力;权威偏见的产生不是因为普通智能体主动服从,而是权威智能体持续坚持立场,普通智能体表现出灵活性;明确的立场表达是产生偏见的前提,中立回应不会引发偏见。 Conclusion: 研究揭示了权威偏见的形成机制,强调清晰立场和角色类型的重要性,为设计具有非对称交互模式的多智能体系统提供了关键指导。 Abstract: Multi-agent systems utilizing large language models often assign authoritative roles to improve performance, yet the impact of authority bias on agent interactions remains underexplored. We present the first systematic analysis of role-based authority bias in free-form multi-agent evaluation using ChatEval. Applying French and Raven's power-based theory, we classify authoritative roles into legitimate, referent, and expert types and analyze their influence across 12-turn conversations. Experiments with GPT-4o and DeepSeek R1 reveal that Expert and Referent power roles exert stronger influence than Legitimate power roles. Crucially, authority bias emerges not through active conformity by general agents, but through authoritative roles consistently maintaining their positions while general agents demonstrate flexibility. Furthermore, authority influence requires clear position statements, as neutral responses fail to generate bias. These findings provide key insights for designing multi-agent frameworks with asymmetric interaction patterns.

[65] When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection

Ke Sun,Guangsheng Bao,Han Cui,Yue Zhang

Main category: cs.CL

TL;DR: 本文提出了一种基于生成后期log概率波动稳定性的新方法,用于检测AI生成文本,通过分析超过12万文本样本发现了“晚期波动衰减”现象,并提出了两种新特征,在无需额外模型访问的情况下实现了最先进的性能。

Details Motivation: 现有零样本检测方法忽略自回归生成的时序动态特性,导致对AI生成文本的识别能力不足,因此需要一种更细粒度的、考虑生成过程动态变化的方法。 Method: 分析大量文本样本中的token级log概率变化,发现AI与人类文本在生成后期的波动性差异;提出仅基于生成后期统计信息的两种新特征:导数离散度(Derivative Dispersion)和局部波动性(Local Volatility)。 Result: AI生成文本在序列后半段表现出比人类文本低24%-32%的波动性;所提方法在EvoBench和MAGE基准上达到最先进性能,并与现有全局方法具有良好互补性。 Conclusion: 利用生成过程中后期的时序动态特征可有效提升AI生成文本的检测能力,且无需扰动采样或额外模型访问,具备实用性和可扩展性。 Abstract: Zero-shot detection methods for AI-generated text typically aggregate token-level statistics across entire sequences, overlooking the temporal dynamics inherent to autoregressive generation. We analyze over 120k text samples and reveal Late-Stage Volatility Decay: AI-generated text exhibits rapidly stabilizing log probability fluctuations as generation progresses, while human writing maintains higher variability throughout. This divergence peaks in the second half of sequences, where AI-generated text shows 24--32\% lower volatility. Based on this finding, we propose two simple features: Derivative Dispersion and Local Volatility, which computed exclusively from late-stage statistics. Without perturbation sampling or additional model access, our method achieves state-of-the-art performance on EvoBench and MAGE benchmarks and demonstrates strong complementarity with existing global methods.

[66] RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection

Zhiwei Liu,Runteng Guo,Baojie Qu,Yuechen Jiang,Min Peng,Qianqian Xie,Sophia Ananiadou

Main category: cs.CL

TL;DR: 本文提出了RAAR,首个用于跨领域虚假信息检测的检索增强型智能体推理框架,通过多视角证据检索和多智能体协作推理提升跨域泛化能力。

Details Motivation: 现有方法依赖单一视角线索,难以泛化到新领域,且大模型受限于同分布假设,缺乏系统性推理机制。 Method: RAAR通过检索与目标样本在语义、情感和写作风格上对齐的多视角源域证据,并利用专业化多智能体协作构建可验证的多步推理路径,结合监督微调和强化学习训练多任务验证器。 Result: 在三个跨领域虚假信息检测任务中,RAAR显著优于基线模型、现有跨域方法及基于LLM的适配方法,提升了基础模型的性能。 Conclusion: RAAR有效实现了跨领域虚假信息检测中的知识迁移与系统性推理,为复杂、未见领域的虚假信息识别提供了新范式。 Abstract: Cross-domain misinformation detection is challenging, as misinformation arises across domains with substantial differences in knowledge and discourse. Existing methods often rely on single-perspective cues and struggle to generalize to challenging or underrepresented domains, while reasoning large language models (LLMs), though effective on complex tasks, are limited to same-distribution data. To address these gaps, we introduce RAAR, the first retrieval-augmented agentic reasoning framework for cross-domain misinformation detection. To enable cross-domain transfer beyond same-distribution assumptions, RAAR retrieves multi-perspective source-domain evidence aligned with each target sample's semantics, sentiment, and writing style. To overcome single-perspective modeling and missing systematic reasoning, RAAR constructs verifiable multi-step reasoning paths through specialized multi-agent collaboration, where perspective-specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. RAAR further applies supervised fine-tuning and reinforcement learning to train a single multi-task verifier to enhance verification and reasoning capabilities. Based on RAAR, we trained the RAAR-8b and RAAR-14b models. Evaluation on three cross-domain misinformation detection tasks shows that RAAR substantially enhances the capabilities of the base models and outperforms other cross-domain methods, advanced LLMs, and LLM-based adaptation approaches. The project will be released at https://github.com/lzw108/RAAR.

[67] Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics

Oshri Naparstek

Main category: cs.CL

TL;DR: 提出一种连续自回归语言生成模型,通过在连续空间中演化词元表示并在其充分收敛后离散化,实现无需词元级采样的稳定、多样文本生成。

Details Motivation: 传统自回归语言模型在每一步生成时过早离散化,导致生成过程不稳定、重复且对解码策略敏感,因此需要一种能更好建模不确定性的生成机制。 Method: 将词元表示为连续向量,通过确定性动态过程多次更新直至收敛,最后进行硬解码得到离散文本,从而在连续空间中维持和消解不确定性。 Result: 该方法仅通过表示的成熟过程即可在确定性解码下生成连贯且多样的文本,无需词元采样、扩散式去噪或额外稳定机制。 Conclusion: 这是首个在离散化前通过连续词元表示演化实现自回归文本生成的模型,能够在不依赖词元级采样的情况下实现稳定生成。 Abstract: Autoregressive language models are conventionally defined over discrete token sequences, committing to a specific token at every generation step. This early discretization forces uncertainty to be resolved through token-level sampling, often leading to instability, repetition, and sensitivity to decoding heuristics. In this work, we introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that \emph{mature} over multiple update steps before being discretized. Rather than sampling tokens, the model evolves continuous token representations through a deterministic dynamical process, committing to a discrete token only when the representation has sufficiently converged. Discrete text is recovered via hard decoding, while uncertainty is maintained and resolved in the continuous space. We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax), without reliance on token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations, such as stochastic dynamics or history smoothing, can be incorporated naturally but are not required for the model to function. To our knowledge, this is the first autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling.

[68] MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News

Zhiwei Liu,Paul Thompson,Jiaqi Rong,Baojie Qu,Runteng Guo,Min Peng,Qianqian Xie,Sophia Ananiadou

Main category: cs.CL

TL;DR: 本文提出了MisSpans,首个用于句子片段级错误信息检测与分析的多领域人工标注基准,支持细粒度定位、类型分类与可解释性分析。

Details Motivation: 现有错误信息检测方法多基于整体段落或声明进行二元判断,难以捕捉真假信息共存的细节,缺乏可解释性与精细分析能力。 Method: 构建了包含真实与虚假新闻配对的MisSpans数据集,定义三个任务:MisSpansIdentity(定位错误片段)、MisSpansType(分类错误类型)、MisSpansExplanation(生成基于片段的解释),并评估15种主流大模型在零样本和一样本设置下的表现。 Result: 专家标注达成高一致性,实验表明细粒度错误信息识别具有挑战性,模型性能受模型大小、推理能力及领域文本特征等多重因素影响。 Conclusion: MisSpans为错误信息检测提供了更细粒度、可解释和多维度的评估框架,揭示了当前大模型在片段级验证任务中的局限,推动未来对多因素交互影响的深入研究。 Abstract: Online misinformation is increasingly pervasive, yet most existing benchmarks and methods evaluate veracity at the level of whole claims or paragraphs using coarse binary labels, obscuring how true and false details often co-exist within single sentences. These simplifications also limit interpretability: global explanations cannot identify which specific segments are misleading or differentiate how a detail is false (e.g., distorted vs. fabricated). To address these gaps, we introduce MisSpans, the first multi-domain, human-annotated benchmark for span-level misinformation detection and analysis, consisting of paired real and fake news stories. MisSpans defines three complementary tasks: MisSpansIdentity for pinpointing false spans within sentences, MisSpansType for categorising false spans by misinformation type, and MisSpansExplanation for providing rationales grounded in identified spans. Together, these tasks enable fine-grained localisation, nuanced characterisation beyond true/false and actionable explanations. Expert annotators were guided by standardised guidelines and consistency checks, leading to high inter-annotator agreement. We evaluate 15 representative LLMs, including reasoning-enhanced and non-reasoning variants, under zero-shot and one-shot settings. Results reveal the challenging nature of fine-grained misinformation identification and analysis, and highlight the need for a deeper understanding of how performance may be influenced by multiple interacting factors, including model size and reasoning capabilities, along with domain-specific textual features. This project will be available at https://github.com/lzw108/MisSpans.

[69] A Navigational Approach for Comprehensive RAG via Traversal over Proposition Graphs

Maxime Delmas,Lei Xu,André Freitas

Main category: cs.CL

TL;DR: ToPG是一种新型RAG框架,通过命题图上的遍历(Traversal over Proposition Graphs)结合细粒度事实与图结构连接性,支持查询感知的迭代检索-推理过程,在简单、复杂和抽象问答任务中均表现出色。

Details Motivation: 现有RAG方法在处理复杂多跳查询时缺乏结构连通性,而知识图谱方法在单跳事实查询上表现不佳,ToPG旨在弥合这一差距。 Method: ToPG将知识库建模为包含命题、实体和段落的异构图,并采用迭代的Suggestion-Selection机制:Suggestion阶段进行查询感知的图遍历,Selection阶段利用大模型反馈修剪无关命题并引导下一轮迭代。 Result: 在三种不同类型的问答任务(简单、复杂、抽象)上评估显示,ToPG在准确性和质量指标方面均表现优异,优于传统RAG和KG-RAG方法。 Conclusion: ToPG证明了结合查询感知图遍历与细粒度事实表示是构建高效结构化RAG系统的关键路径。 Abstract: Standard RAG pipelines based on chunking excel at simple factual retrieval but fail on complex multi-hop queries due to a lack of structural connectivity. Conversely, initial strategies that interleave retrieval with reasoning often lack global corpus awareness, while Knowledge Graph (KG)-based RAG performs strongly on complex multi-hop tasks but suffers on fact-oriented single-hop queries. To bridge this gap, we propose a novel RAG framework: ToPG (Traversal over Proposition Graphs). ToPG models its knowledge base as a heterogeneous graph of propositions, entities, and passages, effectively combining the granular fact density of propositions with graph connectivity. We leverage this structure using iterative Suggestion-Selection cycles, where the Suggestion phase enables a query-aware traversal of the graph, and the Selection phase provides LLM feedback to prune irrelevant propositions and seed the next iteration. Evaluated on three distinct QA tasks (Simple, Complex, and Abstract QA), ToPG demonstrates strong performance across both accuracy- and quality-based metrics. Overall, ToPG shows that query-aware graph traversal combined with factual granularity is a critical component for efficient structured RAG systems. ToPG is available at https://github.com/idiap/ToPG.

[70] EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis

Xuanguang Pan,Chongyang Tao,Jiayuan Bai,Jianling Gao,Zhengwei Tao,Xiansheng Zhou,Gavin Cheung,Shuai Ma

Main category: cs.CL

TL;DR: 本文提出了一种结构感知的文本到SQL数据合成框架EvolSQL,通过基于抽象语法树的原子操作符逐步增加SQL查询的复杂性,从而生成高质量、多样化的训练数据。

Details Motivation: 现有的文本到SQL数据集要么依赖有限的人工标注语料,要么通过简单调用大模型生成,缺乏对SQL结构的控制,导致结构多样性和复杂性不足。 Method: EvolSQL从种子数据出发,首先进行探索性的问题-SQL扩展以提升问题多样性和模式覆盖率,然后利用从SQL抽象语法树中提取的六个原子变换算子,采用自适应定向进化策略,逐步在关联、谓词、聚合和嵌套维度上提升查询复杂度。同时引入执行验证的SQL精炼模块和模式感知去重机制,确保生成高质量且结构多样的映射对。 Result: 实验结果表明,在仅使用SynSQL数据集1/18数据量的情况下,基于EvolSQL数据微调的7B模型性能仍优于在SynSQL上训练的模型。 Conclusion: EvolSQL能够高效生成结构复杂且语义多样的文本到SQL训练数据,显著提升小规模模型在此任务上的表现,为数据稀缺场景下的模型训练提供了有效解决方案。 Abstract: Training effective Text-to-SQL models remains challenging due to the scarcity of high-quality, diverse, and structurally complex datasets. Existing methods either rely on limited human-annotated corpora, or synthesize datasets directly by simply prompting LLMs without explicit control over SQL structures, often resulting in limited structural diversity and complexity. To address this, we introduce EvolSQL, a structure-aware data synthesis framework that evolves SQL queries from seed data into richer and more semantically diverse forms. EvolSQL starts with an exploratory Query-SQL expansion to broaden question diversity and improve schema coverage, and then applies an adaptive directional evolution strategy using six atomic transformation operators derived from the SQL Abstract Syntax Tree to progressively increase query complexity across relational, predicate, aggregation, and nesting dimensions. An execution-grounded SQL refinement module and schema-aware deduplication further ensure the creation of high-quality, structurally diverse mapping pairs. Experimental results show that a 7B model fine-tuned on our data outperforms one trained on the much larger SynSQL dataset using only 1/18 of the data.

[71] Mind2Report: A Cognitive Deep Research Agent for Expert-Level Commercial Report Synthesis

Mingyue Cheng,Daoyu Wang,Qi Liu,Shuo Yu,Xiaoyu Tao,Yuqian Wang,Chengzhong Chu,Yu Duan,Mingkang Long,Enhong Chen

Main category: cs.CL

TL;DR: 提出Mind2Report,一种模拟商业分析师的认知型深度研究代理,通过训练-free的代理工作流和动态记忆增强大模型,生成高质量、可靠且覆盖全面的商业报告。

Details Motivation: 现有深度研究代理在生成商业报告时存在质量、可靠性和覆盖范围方面的局限,难以满足高风险商业决策的需求。 Method: 设计一个无需训练的代理工作流Mind2Report,具备细粒度意图理解、网络信息搜索与实时提炼、动态记忆存储及迭代式报告生成能力,增强大语言模型的长周期认知处理。 Result: 构建包含200个真实商业任务的QRC-Eval评测集,并通过综合评估策略验证Mind2Report在报告质量、可靠性和覆盖性上优于OpenAI和Gemini等领先基线系统。 Conclusion: Mind2Report为商业深度研究代理的设计提供了可行框架,有望推动未来高质量商业报告自动生成系统的发展。 Abstract: Synthesizing informative commercial reports from massive and noisy web sources is critical for high-stakes business decisions. Although current deep research agents achieve notable progress, their reports still remain limited in terms of quality, reliability, and coverage. In this work, we propose Mind2Report, a cognitive deep research agent that emulates the commercial analyst to synthesize expert-level reports. Specifically, it first probes fine-grained intent, then searches web sources and records distilled information on the fly, and subsequently iteratively synthesizes the report. We design Mind2Report as a training-free agentic workflow that augments general large language models (LLMs) with dynamic memory to support these long-form cognitive processes. To rigorously evaluate Mind2Report, we further construct QRC-Eval comprising 200 real-world commercial tasks and establish a holistic evaluation strategy to assess report quality, reliability, and coverage. Experiments demonstrate that Mind2Report outperforms leading baselines, including OpenAI and Gemini deep research agents. Although this is a preliminary study, we expect it to serve as a foundation for advancing the future design of commercial deep research agents. Our code and data are available at https://github.com/Melmaphother/Mind2Report.

[72] CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

Ao Sun,Xiaoyu Wang,Zhe Tan,Yu Li,Jiachen Zhu,Shu Su,Yuheng Jia

Main category: cs.CL

TL;DR: 本文提出了一种名为CuMA的新框架,通过条件容量分离和文化感知路由来解决大语言模型在跨文化对齐中的“均值坍塌”问题,有效保留了文化多样性。

Details Motivation: 现有的密集模型在处理不同文化价值观时会遭遇‘均值坍塌’,无法充分代表多元群体,因此需要一种能够尊重文化多样性的对齐方法。 Method: 提出CuMA框架,采用文化感知的路由机制,将对齐视为条件容量分离问题,并通过引入潜在文化拓扑来解耦冲突梯度到专门的专家子空间。 Result: 在WorldValuesBench、Community Alignment和PRISM等多个基准上,CuMA实现了最先进的性能,显著优于密集基线和仅语义的MoE模型。 Conclusion: CuMA能有效缓解均值坍塌问题,在保持文化多样性方面表现优越,为全球化场景下的语言模型对齐提供了可行方案。 Abstract: As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.

[73] Faithful Summarisation under Disagreement via Belief-Level Aggregation

Favour Yahdii Aghaebe,Tanefa Apekey,Elizabeth Williams,Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: 提出一种分歧感知的摘要生成 pipeline,先在信念层面聚合文档观点,再用大模型生成自然语言摘要,相比传统方法更忠实表达冲突观点。

Details Motivation: 现有摘要方法(尤其是基于大模型的)倾向于平滑意见分歧,过度代表主流观点,导致在观点冲突明显的场景中摘要不忠实。 Method: 将文档表示为结构化信念集,使用基于距离的信念融合算子进行显式建模冲突的聚合,之后仅用大语言模型将聚合后的信念转化为自然语言摘要。 Result: 实验表明,尽管足够大的模型在生成时聚合能接近信念级聚合效果,但该表现跨架构和规模不稳定;而所提方法在不同模型上均表现稳定且优异。 Conclusion: 信念层面的显式聚合结合简单提示即可在多文档意见摘要中实现更可靠、忠实且流畅的总结,优于依赖大模型隐式聚合的方法。 Abstract: Opinion and multi-document summarisation often involve genuinely conflicting viewpoints, yet many existing approaches, particularly LLM-based systems, implicitly smooth disagreement and over-represent majority opinions. This limits the faithfulness of generated summaries in opinion-heavy settings. We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation. Documents are first represented as structured belief sets and aggregated using distance-based belief merging operators that explicitly model conflict. Large language models are then used only to realise the aggregated beliefs as natural language summaries. We evaluate the approach across multiple model families and scales, comparing it to methods that perform explicit aggregation during generation. Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities. In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models, while maintaining fluent and grounded summaries.

[74] V-FAT: Benchmarking Visual Fidelity Against Text-bias

Ziteng Wang,Yujie He,Guanliang Li,Siqi Yang,Jiaqi Xiong,Songxiang Liu

Main category: cs.CL

TL;DR: 本文提出了一种新的诊断基准V-FAT,用于评估多模态大语言模型中的文本偏差问题,揭示了当前模型在视觉推理中过度依赖语言先验而非真实视觉理解的问题。

Details Motivation: 研究者担忧当前多模态大语言模型在视觉推理任务中过度依赖语言捷径而非真正的视觉理解,这种现象被称为文本偏差。本文旨在探究视觉感知与语言先验之间的根本矛盾。 Method: 将文本偏差来源解耦为内部语料库偏差和外部指令偏差,并构建包含4026个VQA实例的V-FAT基准,采用三级评估框架逐步增加视觉证据与文本信息之间的冲突程度,同时提出视觉鲁棒性评分(VRS)指标。 Result: 对12个前沿MLLM的评估表明,尽管这些模型在现有基准上表现优异,但在高语言主导情境下会出现显著的视觉崩溃现象,即忽视视觉内容而依赖语言模式进行预测。 Conclusion: 当前多模态大语言模型在面对强语言干扰时缺乏真正的视觉保真能力,需发展更具视觉鲁棒性的训练与评估方法以缓解文本偏差问题。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.

[75] Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences

Arkadiusz Modzelewski,Paweł Golik,Anna Kołos,Giovanni Da San Martino

Main category: cs.CL

TL;DR: 本文提出了一个名为Persuaficial的多语言基准,用于评估大语言模型(LLM)生成的说服性文本与人类撰写的说服性文本在自动检测上的差异,发现隐性的LLM生成说服内容更难检测,并提供了语言学分析以改进检测工具。

Details Motivation: 由于大语言模型能生成极具说服力的文本,存在被滥用进行宣传和操纵的风险,因此需要研究LLM生成的说服性文本是否比人类写作的更难自动检测。 Method: 对可控生成方法进行分类,构建多语言说服性文本基准Persuaficial,涵盖六种语言,并通过大规模实证评估比较人类和LLM生成文本的可检测性。 Result: 显性的LLM生成说服文本较易检测,但隐性的LLM说服内容会持续降低自动检测性能;同时提供了首项关于人类与LLM生成文本的全面语言学对比分析。 Conclusion: 隐性LLM生成的说服性文本对当前自动检测方法构成更大挑战,相关语言学发现有助于开发更可解释、更鲁棒的检测工具。 Abstract: Large Language Models (LLMs) can generate highly persuasive text, raising concerns about their misuse for propaganda, manipulation, and other harmful purposes. This leads us to our central question: Is LLM-generated persuasion more difficult to automatically detect than human-written persuasion? To address this, we categorize controllable generation approaches for producing persuasive content with LLMs and introduce Persuaficial, a high-quality multilingual benchmark covering six languages: English, German, Polish, Italian, French and Russian. Using this benchmark, we conduct extensive empirical evaluations comparing human-authored and LLM-generated persuasive texts. We find that although overtly persuasive LLM-generated texts can be easier to detect than human-written ones, subtle LLM-generated persuasion consistently degrades automatic detection performance. Beyond detection performance, we provide the first comprehensive linguistic analysis contrasting human and LLM-generated persuasive texts, offering insights that may guide the development of more interpretable and robust detection tools.

[76] GenProve: Learning to Generate Text with Fine-Grained Provenance

Jingxuan Wei,Xingyue Wang,Yanghaoyu Liao,Jie Dong,Yuchen Liu,Caijun Jia,Bihui Yu,Junnan Zhu

Main category: cs.CL

TL;DR: 本文提出了生成时细粒度溯源(Generation-time Fine-grained Provenance)任务和ReFInE数据集,以提升大语言模型在回答中提供可验证来源的能力,并通过GenProve框架显著优于现有模型,揭示了模型在推理型溯源上的挑战。

Details Motivation: 大语言模型常产生幻觉,现有引用方式不足以确保可验证性,且缺乏对引用内容与生成声明之间关系的精细区分。 Method: 提出ReFInE数据集,包含专家标注的引用、压缩和推断关系三元组;构建GenProve框架,结合监督微调与组相对策略优化,联合优化答案保真度与溯源正确性。 Result: GenProve在14个强基线模型上显著优于现有方法,在联合评估中表现最佳;发现模型在表面引用上表现好,但在基于推理的溯源上存在明显短板。 Conclusion: 细粒度溯源有助于提升生成内容的可解释性与可验证性,但模型的可验证推理能力仍是亟待解决的前沿问题。 Abstract: Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability & Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.

[77] A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction

Qing Wang,Zehan Li,Yaodong Song,Hongjie Chen,Jian Kang,Jie Lian,Jie Li,Yongxiang Li,Xuelong Li

Main category: cs.CL

TL;DR: 本文提出了一种基于注入式情感归因思维(IEAT)的统一口语语言模型,通过两阶段训练策略实现情感智能的内化推理,在情感轨迹建模、情感推理和共情回应生成任务中取得领先性能。

Details Motivation: 传统方法通常将情感识别作为显式监督任务,难以实现深层次的情感感知与推理;本文旨在通过隐式建模用户情感状态及其成因,提升模型在口语对话中的情感智能水平。 Method: 提出注入式情感归因思维(IEAT)数据构建策略,并采用两阶段渐进训练:第一阶段通过自蒸馏实现语音-文本对齐与情感属性建模,第二阶段进行端到端跨模态联合优化,确保文本与口语情感表达的一致性。 Result: 在HumDial情感智能基准上,该方法在基于LLM和人工评估下均在情感轨迹预测、情感推理和共情回应生成任务中达到最先进性能。 Conclusion: IEAT有效促进了情感意识推理的内化,所提模型能够更自然地理解并响应人类情感,为构建类人情感智能对话系统提供了新思路。 Abstract: This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT). IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision. The model is trained with a two-stage progressive strategy. The first stage performs speech-text alignment and emotional attribute modeling via self-distillation, while the second stage conducts end-to-end cross-modal joint optimization to ensure consistency between textual and spoken emotional expressions. Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.

[78] Text as a Universal Interface for Transferable Personalization

Yuting Liu,Jian Guan,Jia-Nan Li,Wei Wu,Jiang-Ming Yang,Jianzhe Zhao,Guibing Guo

Main category: cs.CL

TL;DR: 提出使用自然语言作为大语言模型中用户偏好表示的通用接口,并通过两阶段训练框架训练出具有强跨任务和跨模型迁移能力的8B规模模型AlignXplore+,在九个基准上达到先进性能。

Details Motivation: 现有个性化方法多依赖隐式的、模型特定的参数表示用户偏好,缺乏可解释性和跨模型、跨任务的可迁移性。 Method: 提出以自然语言作为通用的偏好表示方式,结合监督微调高质量合成数据与强化学习,构建两阶段训练框架,训练出能生成文本偏好摘要的通用偏好推理模型AlignXplore+。 Result: 在九个基准测试上验证了AlignXplore+(8B)的有效性,性能超越更大的开源模型,并展现出优异的跨任务、跨模型家族和交互格式的迁移能力。 Conclusion: 自然语言是一种高效、可解释且通用的用户偏好表示方式,所提框架有助于构建更具适应性和可解释性的个性化大模型。 Abstract: We study the problem of personalization in large language models (LLMs). Prior work predominantly represents user preferences as implicit, model-specific vectors or parameters, yielding opaque ``black-box'' profiles that are difficult to interpret and transfer across models and tasks. In contrast, we advocate natural language as a universal, model- and task-agnostic interface for preference representation. The formulation leads to interpretable and reusable preference descriptions, while naturally supporting continual evolution as new interactions are observed. To learn such representations, we introduce a two-stage training framework that combines supervised fine-tuning on high-quality synthesized data with reinforcement learning to optimize long-term utility and cross-task transferability. Based on this framework, we develop AlignXplore+, a universal preference reasoning model that generates textual preference summaries. Experiments on nine benchmarks show that our 8B model achieves state-of-the-art performanc -- outperforming substantially larger open-source models -- while exhibiting strong transferability across tasks, model families, and interaction formats.

[79] Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization

Xueyun Tian,Minghua Ma,Bingbing Xu,Nuoyan Lyu,Wei Li,Heng Dong,Zheng Chu,Yuanzhuo Wang,Huawei Shen

Main category: cs.CL

TL;DR: 本文提出在监督微调中利用包含错误最终答案的负向思维链轨迹,以提升大模型在域外任务上的泛化能力,并设计了自适应损失加权方法GLOW来有效利用这些轨迹。

Details Motivation: 传统方法仅使用正确答案的思维链轨迹进行训练,忽略了含有有效中间推理但最终错误的负样本,导致监督信号浪费和过拟合,限制了域外泛化能力。 Method: 系统分析负样本中的22种常见模式,提出Gain-based LOss Weighting (GLOW) 方法,根据样本在训练过程中的跨轮次进展动态调整损失权重,实现对正负轨迹的高效利用。 Result: 在Qwen2.5-7B上相比仅用正样本的SFT取得5.51%的域外性能提升;作为强化学习初始化时,MMLU得分从72.82%提升至76.47%;推理时策略熵提升35.67%,有助于探索。 Conclusion: 保留并合理利用负向思维链轨迹能有效缓解过拟合、增强模型泛化与探索能力,GLOW为更高效的数据利用提供了新范式。 Abstract: Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.

[80] Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei

Peng Wang,Xilin Tao,Siyi Yao,Jiageng Wu,Yuntao Zou,Zhuotao Tian,Libo Qin,Dagang Li

Main category: cs.CL

TL;DR: 本文提出了一种名为Subcultural Alignment Solver (SAS) 的多智能体框架,用于提升大语言模型在亚文化背景下检测自毁行为的能力,通过自动检索和亚文化对齐解决知识滞后和语义不匹配问题。

Details Motivation: 由于亚文化群体中自毁行为表达的特殊性和隐晦性,现有大语言模型难以准确识别,且面临知识更新滞后和语义理解偏差的问题。 Method: 提出SAS框架,采用多智能体架构,结合自动检索机制与亚文化语义对齐策略,增强模型对亚文化语境下自毁行为的理解与检测能力。 Result: 实验表明,SAS在检测性能上优于当前先进的多智能体框架OWL,并可与微调后的大语言模型相媲美。 Conclusion: SAS有效提升了大语言模型在亚文化情境中识别自毁行为的能力,为相关研究提供了可行的技术路径和工具支持。 Abstract: Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) are applied across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs' training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we proposed Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly enhancing the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.

[81] Hán Dān Xué Bù (Mimicry) or Qīng Chū Yú Lán (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models

Yueqing Hu,Xinyang Peng,Shuting Peng,Hanqi Wang,Tianhong Wang

Main category: cs.CL

TL;DR: 推理蒸馏通过监督微调模仿大型推理模型的思维链,但这种被动模仿无法传递教师模型中与人类认知成本对齐的动态资源分配策略,导致学生模型出现“功能性对齐崩溃”和负迁移现象。

Details Motivation: 研究当前推理蒸馏方法是否能有效传递教师模型中与人类认知成本对齐的推理结构。 Method: 在14个模型上测试“邯郸学步”假设,比较教师模型与经SFT蒸馏的学生模型在反映人类难度感知上的相关性表现,并分析其推理行为差异。 Result: 教师模型显著反映人类困难度变化(平均相关系数0.64),而蒸馏后学生模型的相关性明显下降(平均0.34),部分甚至不如蒸馏前基线;SFT仅复制推理形式(如冗长性),未掌握动态资源分配机制。 Conclusion: 人类类认知的推理模式是强化学习的涌现属性,而非可通过被动模仿获得,当前推理蒸馏方法因忽视主动决策过程而导致认知对齐失效。 Abstract: Recent Large Reasoning Models trained via reinforcement learning exhibit a "natural" alignment with human cognitive costs. However, we show that the prevailing paradigm of reasoning distillation -- training student models to mimic these traces via Supervised Fine-Tuning (SFT) -- fails to transmit this cognitive structure. Testing the "Hán Dān Xué Bù" (Superficial Mimicry) hypothesis across 14 models, we find that distillation induces a "Functional Alignment Collapse": while teacher models mirror human difficulty scaling ($\bar{r}=0.64$), distilled students significantly degrade this alignment ($\bar{r}=0.34$), often underperforming their own pre-distillation baselines ("Negative Transfer"). Our analysis suggests that SFT induces a "Cargo Cult" effect, where students ritualistically replicate the linguistic form of reasoning (verbosity) without internalizing the teacher's dynamic resource allocation policy. Consequently, reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition is an emergent property of active reinforcement, not passive imitation.

[82] ArcAligner: Adaptive Recursive Aligner for Compressed Context Embeddings in RAG

Jianbo Li,Yi Jiang,Sendong Zhao,Bairui Hu,Haochun Wang,Bing Qin

Main category: cs.CL

TL;DR: 提出ArcAligner,一种轻量级模块,通过自适应门控机制帮助语言模型更好地利用高度压缩的上下文进行生成,提升效率与性能。

Details Motivation: 现有上下文压缩方法在过度压缩时导致LLM理解困难,需平衡压缩率与信息可用性。 Method: 设计ArcAligner模块,集成至语言模型层,采用自适应门控机制动态分配计算资源以处理复杂信息。 Result: 在多个知识密集型问答基准上,ArcAligner在相似压缩率下优于基线方法,尤其在多跳推理和长尾场景中表现更优。 Conclusion: ArcAligner有效缓解了上下文压缩带来的信息损失问题,实现了高效且准确的检索增强生成。 Abstract: Retrieval-Augmented Generation (RAG) helps LLMs stay accurate, but feeding long documents into a prompt makes the model slow and expensive. This has motivated context compression, ranging from token pruning and summarization to embedding-based compression. While researchers have tried ''compressing'' these documents into smaller summaries or mathematical embeddings, there is a catch: the more you compress the data, the more the LLM struggles to understand it. To address this challenge, we propose ArcAligner (Adaptive recursive context *Aligner*), a lightweight module integrated into the language model layers to help the model better utilize highly compressed context representations for downstream generation. It uses an adaptive ''gating'' system that only adds extra processing power when the information is complex, keeping the system fast. Across knowledge-intensive QA benchmarks, ArcAligner consistently beats compression baselines at comparable compression rates, especially on multi-hop and long-tail settings. The source code is publicly available.

[83] Compositional Steering of Large Language Models with Steering Tokens

Gorjan Radevski,Kiril Gashteovski,Giwon Hong,Carolin Lawrence,Goran Glavaš

Main category: cs.CL

TL;DR: 本文提出了一种新的多行为控制方法——组合式引导标记(compositional steering tokens),通过将自然语言指令嵌入到专用标记中,并引入专门的组合标记,实现对大模型输出的高效、可泛化的多行为联合控制,优于现有方法。

Details Motivation: 现有研究主要关注单一行为的LLM引导,而多个行为的组合式引导仍缺乏探索。如何在真实应用中实现可控且满足多重需求的LLM输出是一个重要问题。 Method: 通过自蒸馏将自然语言指令表达的单个行为嵌入为专用输入标记;在输入标记空间而非激活空间中操作;训练一个专门的‘组合标记’用于学习多个行为的组合,并验证其对未见行为和未见数量行为组合的泛化能力。 Result: 实验表明,该方法在多种LLM架构上均优于指令、激活空间引导和LoRA合并等基线方法;组合标记能有效泛化到未见过的行为组合;且与自然语言指令结合使用时效果进一步提升。 Conclusion: 组合式引导标记是一种有效、灵活且可泛化的多行为LLM控制方法,推动了多目标可控生成的发展。 Abstract: Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textit{compositional steering} -- i.e., steering LLMs simultaneously towards multiple behaviors -- remains an underexplored problem. In this work, we propose \emph{compositional steering tokens} for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textit{composition token} on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textit{unseen} compositions, including those with unseen behaviors as well as those with an unseen \textit{number} of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior control compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.

[84] SemPA: Improving Sentence Embeddings of Large Language Models through Semantic Preference Alignment

Ziyang Chen,Zhenxuan Huang,Yile Wang,Weiqin Wang,Lu Yin,Hui Huang

Main category: cs.CL

TL;DR: 提出SemPA方法,通过句子级直接偏好优化(DPO)在保持大语言模型生成能力的同时提升语义表示性能。

Details Motivation: 现有基于生成式大模型的句子嵌入方法依赖固定提示模板或修改模型结构,前者性能受限,后者损害生成能力。 Method: 采用句子级Direct Preference Optimization(DPO),在同义句生成任务上优化LLM,使其区分语义等价句子并保持生成能力;并在Plackett-Luce模型下建立DPO与对比学习的理论联系。 Result: 在语义文本相似性任务和多个LLM基准测试上,SemPA在不牺牲生成能力的前提下实现了更优的语义表示性能。 Conclusion: SemPA能有效提升大语言模型的句子表示能力,同时保留其原有的生成特性,为多功能模型提供新思路。 Abstract: Traditional sentence embedding methods employ token-level contrastive learning on non-generative pre-trained models. Recently, there have emerged embedding methods based on generative large language models (LLMs). These methods either rely on fixed prompt templates or involve modifications to the model architecture. The former lacks further optimization of the model and results in limited performance, while the latter alters the internal computational mechanisms of the model, thereby compromising its generative capabilities. We propose SemPA, a novel approach that boosts the sentence representations while preserving the generative ability of LLMs via semantic preference alignment. We leverage sentence-level Direct Preference Optimization (DPO) to efficiently optimize LLMs on a paraphrase generation task, where the model learns to discriminate semantically equivalent sentences while preserving inherent generative capacity. Theoretically, we establish a formal connection between DPO and contrastive learning under the Plackett-Luce model framework. Empirically, experimental results on both semantic textual similarity tasks and various benchmarks for LLMs show that SemPA achieves better semantic representations without sacrificing the inherent generation capability of LLMs.

[85] Code-Mix Sentiment Analysis on Hinglish Tweets

Aashi Garg,Aneshya Das,Arshi Arya,Anushka Goyal,Aditi

Main category: cs.CL

TL;DR: 提出了一种基于mBERT的Hinglish推文情感分类框架,通过子词分词技术提升对印度社交媒体中混杂语言的理解能力。

Details Motivation: 传统NLP模型难以处理印度社交媒体中广泛使用的Hinglish(印地语和英语混合)语言的句法和语义复杂性,导致情感分析不准确。 Method: 利用mBERT的多语言能力,并结合子词分词技术,对Hinglish推文进行细粒度微调,以应对拼写变异、俚语和未登录词等问题。 Result: 构建了一个高性能、可投入生产的情感分类框架,在低资源、代码混合环境中为多语言NLP设立了强基准。 Conclusion: 该框架有效提升了Hinglish文本的情感分析准确性,为品牌监控提供了可靠的AI解决方案。 Abstract: The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.

[86] How Human is AI? Examining the Impact of Emotional Prompts on Artificial and Human and Responsiveness

Florence Bernays,Marco Henriques Pereira,Jochen Menges

Main category: cs.CL

TL;DR: 本研究通过实验发现,人类在与ChatGPT互动时表达的情绪(如赞扬、愤怒或责备)不仅影响其回答质量,还改变其价值倾向,并对后续人际交流中的语言风格产生溢出效应。

Details Motivation: 探讨人类-AI交互中情绪语调如何影响AI行为及人类后续交流,揭示情感在人机互动中的作用机制。 Method: 采用被试间实验设计,要求参与者在与ChatGPT(GPT-4.0)合作完成公共回应撰写和伦理困境应对两项任务时表达特定情绪,分析不同情绪条件下的AI响应变化及人类语言表达的后续影响。 Result: 赞扬显著提升ChatGPT的回答质量;愤怒带来较小改进;责备不改善回答但使其更关注公众利益;在伦理问题上,愤怒减少对企业利益的偏重,责备增强公共利益保护倾向;责备条件下参与者在后续人际交流中使用更多负面、敌意和失望性表达。 Conclusion: 人类在与AI互动中的情绪表达不仅塑造AI的输出内容和价值取向,还会溢出到后续的人际沟通中,表明情绪在人机协作中具有重要且深远的影响。 Abstract: This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape ChatGPT's outputs but also carry over into subsequent human-human communication.

[87] Agent-as-a-Judge

Runyang You,Hongru Cai,Caiqi Zhang,Qiancheng Xu,Meng Liu,Tiezheng Yu,Yongqi Li,Wenjie Li

Main category: cs.CL

TL;DR: 本文综述了从“大语言模型作为评判者”到“智能体作为评判者”的范式转变,提出了一个系统框架来理解这一演进,并总结了方法、应用、挑战与未来方向。

Details Motivation: 随着被评估对象变得越来越复杂、专业化和多步骤,传统LLM-as-a-Judge存在偏见、浅层推理和缺乏现实验证等问题,亟需更可靠的评估范式。 Method: 提出一个发展性分类体系,识别关键维度,梳理核心方法,调研在通用与专业领域的应用,并分析前沿挑战与研究方向。 Result: 建立了首个关于Agent-as-a-Judge的综合调查框架,系统整理了该领域的发展脉络、技术路径与应用场景。 Conclusion: Agent-as-a-Judge通过规划、工具增强、多智能体协作等机制提升了评估的鲁棒性与可验证性,本文为其未来发展提供了清晰路线图。 Abstract: LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.

[88] DocDancer: Towards Agentic Document-Grounded Information Seeking

Qintong Zhang,Xinjie Lv,Jialong Wu,Baixuan Li,Zhengwei Tao,Guochen Yan,Huanyao Zhang,Bin Wang,Jiahao Xu,Haitao Mi,Wentao Zhang

Main category: cs.CL

TL;DR: 本文提出了DocDancer,一个端到端训练的开源文档问答代理,通过工具驱动框架和数据合成方法提升文档理解和探索能力。

Details Motivation: 现有文档问答代理在工具使用方面效率不足,且多依赖闭源模型,缺乏高质量训练数据。 Method: 将文档问答视为信息检索问题,提出工具驱动的代理框架,并设计“先探索后合成”的数据合成 pipeline 以支持端到端训练。 Result: 在MMLongBench-Doc和DocBench两个长文本基准上验证了模型的有效性。 Conclusion: DocDancer展示了开源模型在文档问答中通过显式建模探索与理解过程的潜力,为代理式工具设计和合成数据提供了有益见解。 Abstract: Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.

[89] RelayLLM: Efficient Reasoning via Collaborative Decoding

Chengsong Huang,Tong Zheng,Langlin Huang,Jinyuan Li,Haolin Liu,Jiaxin Huang

Main category: cs.CL

TL;DR: RelayLLM是一种新型的高效推理框架,通过在token级别上实现大小语言模型(SLM和LLM)的协作解码,使SLM能主动在关键token处调用LLM,从而大幅降低计算成本。

Details Motivation: 大型语言模型(LLM)推理成本高、延迟大,而小型模型(SLM)推理能力不足;现有协作方法粒度过粗,导致计算资源浪费。 Method: 提出RelayLLM框架,让SLM作为主动控制器,在生成过程中通过特殊指令动态调用LLM处理关键token;采用两阶段训练(预热和GRPO)来平衡自主生成与求助策略。 Result: 在六个基准测试中,RelayLLM平均准确率达49.52%,仅调用LLM处理1.07%的token,相比随机路由方案节省98.2%的成本。 Conclusion: RelayLLM通过细粒度的协同解码机制,有效弥合了SLM与LLM之间的性能差距,在保持高性能的同时极大降低了推理成本。 Abstract: Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.

[90] Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference

Rasmus Blanck,Bill Noble,Stergios Chatzikyriakidis

Main category: cs.CL

TL;DR: 本文探讨了自然语言推理(NLI)任务中逻辑关系的解释问题,通过分析SNLI数据集中的共享前提项和大语言模型生成的项目,评估训练模型的元推理一致性,进而揭示数据集中编码的逻辑关系类型。

Details Motivation: NLI任务在自然语言理解中至关重要,但其逻辑属性常被误解或误表征,明确NLI标签集所捕捉的推理概念对正确解读模型性能至关重要。 Method: 提出了NLI标签集的三种可能解释,并利用SNLI数据集中具有共享前提的样本以及由大语言模型生成的样本来分析其元推理属性。 Result: 通过评估在SNLI上训练的模型的元推理一致性,发现了数据集隐含的某种逻辑关系解释更为突出。 Conclusion: 研究揭示了SNLI数据集实际编码的逻辑推理类型,有助于更准确地理解基于该数据集训练的语言模型的推理能力。 Abstract: Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.

[91] Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems

Jihao Zhao,Ding Chen,Zhaoxin Fan,Kerun Xu,Mengting Hu,Bo Tang,Feiyu Xiong,Zhiyu li

Main category: cs.CL

TL;DR: 本文提出Inside Out框架,利用全局维护的PersonaTree进行长期用户画像,通过强化学习训练轻量级MemListener实现动态记忆更新,在抑制上下文噪声和保持角色一致性方面优于现有方法。

Details Motivation: 现有长期个性化对话系统在处理无限交互流与有限上下文约束时存在记忆噪声累积、推理退化和人设不一致问题,亟需有效解决方案。 Method: 提出Inside Out框架,采用具有初始模式约束的PersonaTree进行可控增长的记忆压缩;通过基于过程奖励的强化学习训练轻量级MemListener,生成可执行的操作指令(ADD, UPDATE, DELETE, NO_OP)以实现动态演化;在响应生成中直接使用PersonaTree或按需触发代理模式补充细节。 Result: 实验表明,PersonaTree在抑制上下文噪声和维持人设一致性方面优于全文拼接及其他个性化记忆系统;小型MemListener模型的记忆操作决策性能媲美甚至超过DeepSeek-R1-0528和Gemini-3-Pro等强推理模型。 Conclusion: Inside Out框架通过结构化、可解释的PersonaTree与轻量级MemListener实现了高效、一致的长期个性化对话建模,为解决记忆噪声与一致性问题提供了新思路。 Abstract: Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.

[92] LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation

Samy Haffoudhi,Fabian M. Suchanek,Nils Holzenberger

Main category: cs.CL

TL;DR: LELA是一种无需微调的模块化粗到精方法,利用大语言模型进行实体链接,在多个领域和知识库上表现出色。

Details Motivation: 实体链接是知识图谱构建、问答和信息抽取等任务的基础步骤,但现有方法通常需要针对特定场景微调,缺乏通用性和灵活性。 Method: 提出LELA,一种模块化的粗到精框架,结合大语言模型的能力,支持不同目标领域、知识库和大语言模型,且无需任何微调。 Result: 在多种实体链接设置下的实验表明,LELA与需要微调的方法具有很强的竞争力,并显著优于其他无需微调的方法。 Conclusion: LELA提供了一种通用、灵活且高效的实体链接方案,能够在不微调的情况下适应多种应用场景。 Abstract: Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foundational step in tasks such as knowledge graph construction, question-answering, and information extraction. Our method, LELA, is a modular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowledge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly competitive with fine-tuned approaches, and substantially outperforms the non-fine-tuned ones.

[93] Measuring and Fostering Peace through Machine Learning and Artificial Intelligence

P. Gilda,P. Dungarwal,A. Thongkham,E. T. Ajayi,S. Choudhary,T. M. Terol,C. Lam,J. P. Araujo,M. McFadyen-Mungalln,L. S. Liebovitch,P. T. Coleman,H. West,K. Sieck,S. Carter

Main category: cs.CL

TL;DR: 本研究利用机器学习和人工智能从新闻与社交媒体中量化和平水平,并开发了名为MirrorMirror的Chrome插件,为用户提供实时媒体和平性反馈,旨在促进更尊重、细致和信息丰富的传播。

Details Motivation: 当前社交媒体内容倾向于激发情绪(如愤怒)以提高点击率,影响用户对信息的客观理解,因此需要工具来衡量和改善媒体内容的和平性与质量。 Method: 使用神经网络分析在线新闻文本嵌入以测量和平水平;在社交媒体上结合词级(GoEmotions)和上下文级(大语言模型)方法评估与和平相关的社会维度;开发并测试MirrorMirror浏览器插件,提供YouTube视频内容的实时和平性反馈。 Result: 新闻数据上的模型在跨数据集时仍保持高准确率;成功开发并测试了MirrorMirror插件,能有效向用户反馈所观看媒体的和平性特征。 Conclusion: 通过AI技术可有效量化媒体中的和平水平,MirrorMirror具备潜力发展为开源工具,帮助内容创作者、平台和用户理解媒体语调,推动更健康的信息生态。 Abstract: We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.

[94] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu,Xin Dong,Ximing Lu,Shizhe Diao,Peter Belcak,Mingjie Liu,Min-Hung Chen,Hongxu Yin,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Yejin Choi,Jan Kautz,Pavlo Molchanov

Main category: cs.CL

TL;DR: 本文提出了一种新的多奖励强化学习优化方法GDPO,解决了GRPO在处理多个不同奖励时优势值坍缩的问题,通过解耦各个奖励的归一化过程,提升了训练稳定性和模型性能。

Details Motivation: 现有的多奖励设置下直接应用GRPO会导致不同奖励组合的优势值坍缩,降低训练信号的分辨率,影响收敛性甚至导致训练失败,因此需要一种更合适的多奖励优化方法。 Method: 提出了Group reward-Decoupled Normalization Policy Optimization (GDPO),在策略优化过程中对每个独立奖励分别进行归一化,避免不同奖励之间的干扰,保留其相对差异,从而实现更精确的多奖励优化。 Result: 在工具调用、数学推理和代码推理三个任务上,GDPO在准确性、错误率以及格式、长度等约束遵循指标上均显著优于GRPO,表现出更强的训练稳定性与泛化能力。 Conclusion: GDPO通过解耦多奖励归一化有效解决了GRPO中的优势值坍缩问题,为多奖励语言模型对齐提供了一个更优、更稳定的训练框架。 Abstract: As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

cs.CV [Back]

[95] Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes

Chenye Meng,Zejian Li,Zhongni Liu,Yize Li,Changle Xie,Kaixin Jia,Ling Yang,Huanghuang Deng,Shiying Ding,Shengyuan Zhang,Jiayi Li,Lingyun Sun

Main category: cs.CV

TL;DR: 提出了一种基于层次化、细粒度评价标准的扩散模型对齐框架,通过两阶段方法提升生成图像与人类专业知识的对齐程度。

Details Motivation: 现有的扩散模型后训练对齐依赖于简化信号(如标量奖励或二元偏好),难以反映复杂且层次化的专家知识,限制了模型在精细任务上的对齐能力。 Method: 首先由领域专家构建一个树状结构的层次化评价体系,将图像质量分解为多个正负属性;然后采用两阶段对齐框架:第一阶段通过对辅助扩散模型进行监督微调注入领域知识;第二阶段提出复杂偏好优化(CPO),扩展DPO以支持非二元、层次化的偏好学习,同时最大化正属性概率并最小化负属性概率。 Result: 在绘画生成任务中,基于标注了细粒度属性的数据集进行CPO训练,实验表明该方法显著提升了生成质量和与专家知识的对齐程度。 Conclusion: CPO为扩散模型与复杂人类专业知识的细粒度对齐提供了有效途径,推动了生成模型在专业领域中的应用潜力。 Abstract: Post-training alignment of diffusion models relies on simplified signals, such as scalar rewards or binary preferences. This limits alignment with complex human expertise, which is hierarchical and fine-grained. To address this, we first construct a hierarchical, fine-grained evaluation criteria with domain experts, which decomposes image quality into multiple positive and negative attributes organized in a tree structure. Building on this, we propose a two-stage alignment framework. First, we inject domain knowledge to an auxiliary diffusion model via Supervised Fine-Tuning. Second, we introduce Complex Preference Optimization (CPO) that extends DPO to align the target diffusion to our non-binary, hierarchical criteria. Specifically, we reformulate the alignment problem to simultaneously maximize the probability of positive attributes while minimizing the probability of negative attributes with the auxiliary diffusion. We instantiate our approach in the domain of painting generation and conduct CPO training with an annotated dataset of painting with fine-grained attributes based on our criteria. Extensive experiments demonstrate that CPO significantly enhances generation quality and alignment with expertise, opening new avenues for fine-grained criteria alignment.

[96] Embedding Textual Information in Images Using Quinary Pixel Combinations

A V Uday Kiran Kandala

Main category: cs.CV

TL;DR: 提出一种基于RGB空间五进制像素强度组合的新文本隐写方法,能在单个像素内编码完整字符,具有高嵌入效率、低计算开销和良好图像质量。

Details Motivation: 现有隐写技术多依赖LSB/MSB、变换域或深度学习方法,存在噪声明显、计算复杂或需多像素编码等问题,难以兼顾隐蔽性与效率。 Method: 利用RGB通道中每个颜色分量的五种可控强度变化,形成最多125种组合,映射到文本符号,在单个像素内完成一个字符的编码与解码。 Result: 在MSE、MAE、SNR、PSNR、SSIM、直方图和热力图等指标上均显示原图与隐写图像无显著失真,且单像素即可完成字符嵌入,提升嵌入效率。 Conclusion: 该方法实现了高效、低失真的文本到图像隐写,优于传统需多像素或多步骤的LSB、变换域及深度学习方法,适用于对计算开销和图像质量要求较高的场景。 Abstract: This paper presents a novel technique for embedding textual data into images using quinary combinations of pixel intensities in RGB space. Existing methods predominantly rely on least and most significant bit (LSB & MSB) manipulation, Pixel Value Differencing (PVD), spatial perturbations in RGB channels, transform domain based methods, Quantization methods, Edge and Region based methods and more recently through deep learning methods and generative AI techniques for hiding textual information in spatial domain of images. Most of them are dependent on pixel intensity flipping over multiple pixels, such as LSB and combination of LSB based methodologies, and on transform coefficients, often resulting in the form of noise. Encoding and Decoding are deterministic in most of the existing approaches and are computationally heavy in case of higher models such as deep learning and gen AI approaches. The proposed method works on quinary pixel intensity combinations in RGB space, where five controlled different pixel intensity variations in each of the R, G, and B channels formulate up to one hundred and twenty five distinct pixel intensity combinations. These combinations are mapped to textual symbols, enabling the representation of uppercase and lowercase alphabetic characters, numeric digits, whitespace, and commonly used special characters. Different metrics such as MSE, MAE, SNR, PSNR, SSIM, Histogram Comparison and Heatmap analysis, were evaluated for both original and encoded images resulting in no significant distortion in the images. Furthermore, the method achieves improved embedding efficiency by encoding a complete textual symbol within a single RGB pixel, in contrast to LSB and MSB based approaches that typically require multiple pixels or multi-step processes, as well as transform and learning based methods that incur higher computational overhead.

[97] Unified Text-Image Generation with Weakness-Targeted Post-Training

Jiahui Chen,Philippe Hansen-Estruch,Xiaochuang Han,Yushi Hu,Emily Dinan,Amita Kamath,Michal Drozdzal,Reyhane Askari-Hemmat,Luke Zettlemoyer,Marjan Ghazvininejad

Main category: cs.CV

TL;DR: 本文提出了一种全统一的文本-图像生成方法,通过奖励加权的离线后训练和自生成合成数据,实现模型在单次推理中自主从文本推理过渡到图像生成,显著提升多模态生成性能。

Details Motivation: 现有统一多模态生成系统依赖显式模态切换,限制了跨模态耦合,无法实现自动化的多模态生成。 Method: 采用离线、奖励加权的后训练策略,使用完全自生成的合成数据,并探索不同后训练数据策略对文本-图像联合生成的影响。 Result: 在四个不同的文本到图像基准上均实现了性能提升,验证了奖励加权和针对性数据设计的有效性。 Conclusion: 全统一的生成架构结合奖励加权和精心设计的后训练数据,能够有效增强跨模态耦合,推动自动化多模态生成的发展。 Abstract: Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.

[98] ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Mohsen Ghafoorian,Amirhossein Habibian

Main category: cs.CV

TL;DR: ReHyAt是一种结合softmax注意力和线性注意力优势的循环混合注意力机制,实现了高质量、低训练成本和线性计算复杂度的视频生成扩散模型。

Details Motivation: 现有的基于Transformer的视频扩散模型因二次注意力复杂度难以扩展到长序列,且训练成本高昂,需要更高效可扩展的架构。 Method: 提出ReHyAt,一种循环混合注意力机制,通过分块递归方式结合softmax注意力的保真度和线性注意力的效率,并设计轻量级蒸馏与微调流程,从现有softmax模型中高效迁移知识。 Result: 在VBench、VBench-2.0和人类偏好研究中,ReHyAt达到最先进的视频质量,注意力计算复杂度由二次降至线性,训练成本降低两个数量级至约160 GPU小时。 Conclusion: ReHyAt为视频扩散模型提供了高效、可扩展的解决方案,支持长时间和设备端生成,且其蒸馏流程可推广至未来先进模型。 Abstract: Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.

[99] SCAR-GS: Spatial Context Attention for Residuals in Progressive Gaussian Splatting

Diego Revilla,Pooja Suresh,Anand Bhojan,Ooi Wei Tsang

Main category: cs.CV

TL;DR: 提出了一种用于3D高斯点阵的新型渐进式编解码器,采用残差向量量化和多分辨率哈希网格引导的自回归熵模型,以提高压缩效率和率失真性能。

Details Motivation: 现有的3D高斯点阵压缩方法依赖标量量化,难以有效捕捉高维特征向量间的相关性,限制了率失真性能,且存储需求大,不利于云端和流媒体部署。 Method: 引入残差向量量化(Residual Vector Quantization)替代传统标量量化,并设计基于多分辨率哈希网格的自回归熵模型,逐层高效压缩高斯图元特征。 Result: 实现了更高效的渐进式压缩,提升了率失真性能,在中大型场景下显著降低比特率,同时支持高质量的实时新视角合成。 Conclusion: 所提方法在保持高保真渲染的同时,显著提高了3D高斯点阵模型的压缩效率,更适合云传输与流媒体应用。 Abstract: Recent advances in 3D Gaussian Splatting have allowed for real-time, high-fidelity novel view synthesis. Nonetheless, these models have significant storage requirements for large and medium-sized scenes, hindering their deployment over cloud and streaming services. Some of the most recent progressive compression techniques for these models rely on progressive masking and scalar quantization techniques to reduce the bitrate of Gaussian attributes using spatial context models. While effective, scalar quantization may not optimally capture the correlations of high-dimensional feature vectors, which can potentially limit the rate-distortion performance. In this work, we introduce a novel progressive codec for 3D Gaussian Splatting that replaces traditional methods with a more powerful Residual Vector Quantization approach to compress the primitive features. Our key contribution is an auto-regressive entropy model, guided by a multi-resolution hash grid, that accurately predicts the conditional probability of each successive transmitted index, allowing for coarse and refinement layers to be compressed with high efficiency.

[100] Comparative Analysis of Custom CNN Architectures versus Pre-trained Models and Transfer Learning: A Study on Five Bangladesh Datasets

Ibrahim Tanvir,Alif Ruslan,Sartaj Solaiman

Main category: cs.CV

TL;DR: 本研究比较了自定义CNN与预训练模型(ResNet-18、VGG-16)在五个孟加拉图像分类数据集上的表现,发现微调的迁移学习显著优于从头训练和特征提取方法,尤其在小样本复杂任务中表现突出。

Details Motivation: 探索在资源有限和数据多样性背景下,哪种深度学习方法(自定义CNN vs. 预训练模型)在实际图像分类任务中更具优势。 Method: 采用自定义CNN、特征提取和带微调的迁移学习三种方法,在五个真实世界的孟加拉图像数据集上进行对比实验,评估其准确率、模型大小和训练效率。 Result: 迁移学习(尤其是ResNet-18微调)在所有数据集中表现最佳,准确率提升3%至76%,在Road Damage BD数据集上达到100%准确率;自定义CNN参数更少(3.4M)、训练更快,但在复杂任务上性能较差。 Conclusion: 对于图像分类任务,尤其是数据量有限的复杂场景,使用预训练模型结合微调的迁移学习是更优选择;而简单任务或资源受限时,自定义CNN更具效率优势。 Abstract: This study presents a comprehensive comparative analysis of custom-built Convolutional Neural Networks (CNNs) against popular pre-trained architectures (ResNet-18 and VGG-16) using both feature extraction and transfer learning approaches. We evaluated these models across five diverse image classification datasets from Bangladesh: Footpath Vision, Auto Rickshaw Detection, Mango Image Classification, Paddy Variety Recognition, and Road Damage Detection. Our experimental results demonstrate that transfer learning with fine-tuning consistently outperforms both custom CNNs built from scratch and feature extraction methods, achieving accuracy improvements ranging from 3% to 76% across different datasets. Notably, ResNet-18 with fine-tuning achieved perfect 100% accuracy on the Road Damage BD dataset. While custom CNNs offer advantages in model size (3.4M parameters vs. 11-134M for pre-trained models) and training efficiency on simpler tasks, pre-trained models with transfer learning provide superior performance, particularly on complex classification tasks with limited training data. This research provides practical insights for practitioners in selecting appropriate deep learning approaches based on dataset characteristics, computational resources, and performance requirements.

[101] PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache

Kunyang Li,Mubarak Shah,Yuzhang Shang

Main category: cs.CV

TL;DR: 本文提出PackCache,一种无需训练的KV-cache管理方法,通过动态压缩缓存来提升统一自回归视频生成模型的推理效率,显著加速长序列生成过程。

Details Motivation: 统一自回归模型在处理多模态任务时面临KV-cache随生成长度线性增长的问题,成为推理效率和生成长度的主要瓶颈,尤其影响长视频生成。 Method: 基于对KV-cache中token的时空特性分析,提出PackCache,包含三个机制:条件锚定保留语义参考、跨帧衰减建模按时间距离分配缓存资源、空间保持的位置嵌入以维持3D结构一致性。 Result: 在48帧长序列上,PackCache实现端到端生成速度提升1.7-2.2倍;在最耗时的最后四帧,A40和H200上分别达到2.6倍和3.7倍加速。 Conclusion: PackCache有效缓解了KV-cache膨胀问题,显著提升了统一自回归模型的推理效率,为长序列视频生成提供了可行解决方案。 Abstract: A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.

[102] Combining facial videos and biosignals for stress estimation during driving

Paraskevi Valergaki,Vassilis C. Nicodemou,Iason Oikonomidis,Antonis Argyros,Anastasios Roussos

Main category: cs.CV

TL;DR: 提出基于EMOCA的3D面部特征与跨模态注意力机制,用于驾驶压力识别,显著提升性能。

Details Motivation: 由于压力具有主观性和面部自主控制,从面部视频中可靠地识别压力具有挑战性;现有方法多依赖于面部动作单元,而对解耦的3D面部几何的作用研究不足。 Method: 利用EMOCA提取3D表情和姿态系数,结合假设检验分析其在压力阶段的变化,并构建基于Transformer的时间建模框架,比较单模态、早融合和跨模态注意力策略。 Result: 56个系数中有41个表现出与生理指标相当的一致性响应;跨模态注意力融合EMOCA与生理信号达到AUROC 92%、准确率86.7%,EMOCA-注视融合也达AUROC 91.8%。 Conclusion: 时序建模与跨模态注意力能有效提升基于3D面部几何的压力识别性能,验证了EMOCA系数作为压力标志物的潜力。 Abstract: Reliable stress recognition from facial videos is challenging due to stress's subjective nature and voluntary facial control. While most methods rely on Facial Action Units, the role of disentangled 3D facial geometry remains underexplored. We address this by analyzing stress during distracted driving using EMOCA-derived 3D expression and pose coefficients. Paired hypothesis tests between baseline and stressor phases reveal that 41 of 56 coefficients show consistent, phase-specific stress responses comparable to physiological markers. Building on this, we propose a Transformer-based temporal modeling framework and assess unimodal, early-fusion, and cross-modal attention strategies. Cross-Modal Attention fusion of EMOCA and physiological signals achieves best performance (AUROC 92\%, Accuracy 86.7\%), with EMOCA-gaze fusion also competitive (AUROC 91.8\%). This highlights the effectiveness of temporal modeling and cross-modal attention for stress recognition.

[103] Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection

Maxim Clouser,Kia Khezeli,John Kalantari

Main category: cs.CV

TL;DR: 通过在仅100对配对图像上微调基于LoRA的流匹配基础模型,可实现从RGB到红外和SAR的跨光谱翻译,并生成用于提升下游检测性能的合成数据。

Details Motivation: 可见光训练的基础模型难以直接应用于依赖非可见模态(如红外和SAR)的安全关键场景,缺乏大量标注的非可见光数据限制了模型性能,需探索如何利用少量配对样本实现跨光谱迁移。 Method: 基于FLUX.1 Kontext流匹配模型,插入低秩适应(LoRA)模块,在每种域(KAIST的RGB到IR,M4-SAR的RGB到SAR)仅使用100对图像进行微调,实现跨光谱图像转换;利用LPIPS指标预测下游性能,并将合成图像用于目标检测训练。 Result: LPIPS在50个保留对上可有效预测mAP表现:更低LPIPS对应更高检测精度;使用外部RGB数据生成的合成红外图像提升了KAIST数据集上的行人检测性能,合成SAR图像结合少量真实SAR显著增强了M4-SAR上的基础设施检测。 Conclusion: 少样本LoRA适配流匹配基础模型是支持非可见光模态的一种有前景的方法,能够在极少量配对数据下生成高质量跨光谱图像并提升下游任务性能。 Abstract: Foundation models for vision are predominantly trained on RGB data, while many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR). We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator using only a few co-measured examples, and whether the resulting synthetic data can enhance downstream detection. Starting from FLUX.1 Kontext, we insert low-rank adaptation (LoRA) modules and fine-tune them on just 100 paired images per domain for two settings: RGB to IR on the KAIST dataset and RGB to SAR on the M4-SAR dataset. The adapted model translates RGB images into pixel-aligned IR/SAR, enabling us to reuse existing bounding boxes and train object detection models purely in the target modality. Across a grid of LoRA hyperparameters, we find that LPIPS computed on only 50 held-out pairs is a strong proxy for downstream performance: lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR and SAR, and for DETR on KAIST IR test data. Using the best LPIPS-selected LoRA adapter, synthetic IR from external RGB datasets (LLVIP, FLIR ADAS) improves KAIST IR pedestrian detection, and synthetic SAR significantly boosts infrastructure detection on M4-SAR when combined with limited real SAR. Our results suggest that few-shot LoRA adaptation of flow-matching foundation models is a promising path toward foundation-style support for non-visible modalities.

[104] Performance Analysis of Image Classification on Bangladeshi Datasets

Mohammed Sami Khan,Fabiha Muniat,Rowzatul Zannat

Main category: cs.CV

TL;DR: 本文比较了从零训练的自定义CNN与使用迁移学习的预训练CNN(如VGG-16、ResNet-50、MobileNet)在图像分类任务中的性能。结果显示,预训练模型在准确率和收敛速度上优于自定义模型,尤其是在数据有限时;而自定义模型虽性能稍低,但参数更少、计算更高效。

Details Motivation: 探讨在图像分类任务中设计自定义CNN与采用预训练模型之间的权衡,为实际应用中模型选择提供依据。 Method: 构建一个自定义CNN并从零训练,同时使用迁移学习调用VGG-16、ResNet-50和MobileNet,在相同实验条件下比较各模型的准确率、精确率、召回率和F1分数等指标。 Result: 预训练模型在分类准确率和收敛速度上表现更优,尤其在数据量有限时优势明显;自定义CNN参数更少、计算复杂度更低,性能仍有竞争力。 Conclusion: 预训练架构在性能上占优,但自定义CNN在轻量化和效率方面具有潜力,模型选择需权衡性能需求与资源限制。 Abstract: Convolutional Neural Networks (CNNs) have demonstrated remarkable success in image classification tasks; however, the choice between designing a custom CNN from scratch and employing established pre-trained architectures remains an important practical consideration. In this work, we present a comparative analysis of a custom-designed CNN and several widely used deep learning architectures, including VGG-16, ResNet-50, and MobileNet, for an image classification task. The custom CNN is developed and trained from scratch, while the popular architectures are employed using transfer learning under identical experimental settings. All models are evaluated using standard performance metrics such as accuracy, precision, recall, and F1-score. Experimental results show that pre-trained CNN architectures consistently outperform the custom CNN in terms of classification accuracy and convergence speed, particularly when training data is limited. However, the custom CNN demonstrates competitive performance with significantly fewer parameters and reduced computational complexity. This study highlights the trade-offs between model complexity, performance, and computational efficiency, and provides practical insights into selecting appropriate CNN architectures for image classification problems.

[105] 3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation

Jusheng Zhang,Yijia Fan,Zimo Wen,Jian Wang,Keze Wang

Main category: cs.CV

TL;DR: Tri MARF是一种融合2D多视角图像、文本描述和3D点云的三模态输入框架,通过多智能体协作架构提升大规模3D物体标注效果,在多个基准上显著优于现有方法。

Details Motivation: 现有的单模型3D标注方法难以有效应对空间复杂性、遮挡和视角不一致等问题,且在多模态信息融合方面能力有限。 Method: 提出Tri MARF框架,包含三个专用智能体:视觉语言模型智能体生成多视角描述,信息聚合智能体选择最优描述,门控智能体对齐文本语义与3D几何以实现精细字幕生成。 Result: 在Objaverse LVIS、Objaverse XL和ABO数据集上实验表明,Tri MARF取得88.7的CLIPScore,ViLT R@5达到45.2和43.8的检索准确率,单A100 GPU吞吐量高达每小时12000个对象。 Conclusion: Tri MARF通过多智能体协同和三模态融合显著提升了3D对象标注的质量与效率,为自动驾驶、机器人和增强现实等应用提供了更可靠的标注解决方案。 Abstract: Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU

[106] From Preoperative CT to Postmastoidectomy Mesh Construction:1Mastoidectomy Shape Prediction for Cochlear Implant Surgery

Yike Zhang,Eduardo Davalos,Dingjie Su,Ange Lou,Jack Noble

Main category: cs.CV

TL;DR: 本文提出了一种结合自监督与弱监督学习的混合框架,用于从术前CT扫描中预测耳蜗植入手术中的乳突切除区域,首次在无需人工标注的情况下实现了对复杂无边界乳突切除形状的准确预测,Dice分数达到0.72,优于现有方法。

Details Motivation: 由于缺乏带有人工标注的真实数据,基于深度学习的乳突切除形状预测研究有限,本文旨在填补这一空白,提升术前规划的准确性与手术安全性。 Method: 提出一种融合自监督和弱监督学习的混合框架,并引入3D T分布损失函数,直接从完整的术前CT扫描中预测乳突切除区域。 Result: 该方法在预测复杂且无明确边界的乳突切除形状时取得了平均Dice分数0.72,优于现有最先进方法,并能生成3D术后表面模型。 Conclusion: 本研究首次将自监督与弱监督学习结合用于乳突切除预测,为耳蜗植入手术规划提供了高效、鲁棒的解决方案,具有重要的临床应用潜力。 Abstract: Cochlear Implant (CI) surgery treats severe hearing loss by inserting an electrode array into the cochlea to stimulate the auditory nerve. An important step in this procedure is mastoidectomy, which removes part of the mastoid region of the temporal bone to provide surgical access. Accurate mastoidectomy shape prediction from preoperative imaging improves pre-surgical planning, reduces risks, and enhances surgical outcomes. Despite its importance, there are limited deep-learning-based studies regarding this topic due to the challenges of acquiring ground-truth labels. We address this gap by investigating self-supervised and weakly-supervised learning models to predict the mastoidectomy region without human annotations. We propose a hybrid self-supervised and weakly-supervised learning framework to predict the mastoidectomy region directly from preoperative CT scans, where the mastoid remains intact. Our hybrid method achieves a mean Dice score of 0.72 when predicting the complex and boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance. The method provides groundwork for constructing 3D postmastoidectomy surfaces directly from the corresponding preoperative CT scans. To our knowledge, this is the first work that integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, offering a robust and efficient solution for CI surgical planning while leveraging 3D T-distribution loss in weakly-supervised medical imaging.

[107] CRUNet-MR-Univ: A Foundation Model for Diverse Cardiac MRI Reconstruction

Donghang Lyu,Marius Staring,Hildo Lamb,Mariya Doneva

Main category: cs.CV

TL;DR: 提出CRUNet-MR-Univ,一种能够泛化于多种心脏MRI场景的基础模型,利用时空相关性和基于提示的先验信息,在多样化CMR扫描中表现出优越性能。

Details Motivation: 现有深度学习方法在心脏MRI重建中泛化能力有限,难以应对图像对比度、采样模式、设备厂商等多种变化,限制了其临床应用。 Method: 设计CRUNet-MR-Univ,结合时空相关性建模与基于提示的先验机制,构建可适应多变CMR数据的统一重建模型。 Result: 在多种CMR设置下均优于基线方法,展现出更强的泛化能力和稳定性。 Conclusion: CRUNet-MR-Univ有望作为通用基础模型推动深度学习在临床CMR重建中的实际应用。 Abstract: In recent years, deep learning has attracted increasing attention in the field of Cardiac MRI (CMR) reconstruction due to its superior performance over traditional methods, particularly in handling higher acceleration factors, highlighting its potential for real-world clinical applications. However, current deep learning methods remain limited in generalizability. CMR scans exhibit wide variability in image contrast, sampling patterns, scanner vendors, anatomical structures, and disease types. Most existing models are designed to handle only a single or narrow subset of these variations, leading to performance degradation when faced with distribution shifts. Therefore, it is beneficial to develop a unified model capable of generalizing across diverse CMR scenarios. To this end, we propose CRUNet-MR-Univ, a foundation model that leverages spatio-temporal correlations and prompt-based priors to effectively handle the full diversity of CMR scans. Our approach consistently outperforms baseline methods across a wide range of settings, highlighting its effectiveness and promise.

[108] Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

Xingjian Diao,Zheyuan Liu,Chunhui Zhang,Weiyi Wu,Keyi Kong,Lin Shi,Kaize Ding,Soroush Vosoughi,Jiang Gui

Main category: cs.CV

TL;DR: 本文提出了Gated Perception-Reasoning Optimization (GPRO),一种动态调控视觉-推理计算路径的元控制器,以解决大视觉语言模型中因感知失败导致的过思考问题。

Details Motivation: 现有链式思维方法在简单问题上易产生过思考,且忽略底层视觉感知失败这一根本瓶颈,导致效率低和准确性下降。 Method: 提出GPRO框架,包含快速路径、慢速感知路径和慢速推理路径,并通过约79万样本的失败归因监督,使用多目标强化学习训练控制器以平衡准确性和计算成本。 Result: 在五个基准上实验表明,GPRO显著提升了准确性和推理效率,响应更短,优于现有的慢思考方法。 Conclusion: 稳定的推理依赖于可靠的视觉基础,GPRO通过动态路由有效区分感知与推理错误,实现了高效且鲁棒的视觉语言推理。 Abstract: Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.

[109] UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

Zhexiao Xiong,Xin Ye,Burhan Yaman,Sheng Cheng,Yiren Lu,Jingru Luo,Nathan Jacobs,Liu Ren

Main category: cs.CV

TL;DR: UniDrive-WM 是一种基于视觉语言模型(VLM)的统一世界模型,将驾驶场景理解、轨迹规划和未来图像生成整合到单一架构中,通过轨迹条件化生成未来图像并反馈优化,显著提升自动驾驶性能。

Details Motivation: 现有自动驾驶系统通常将感知、预测和规划分离,缺乏紧密协同;作者希望利用VLM的推理能力,构建一个统一的世界模型以实现更优的端到端驾驶性能。 Method: 提出 UniDrive-WM,结合轨迹规划器与VLM-based图像生成器:规划器预测未来轨迹,作为条件输入生成未来图像;生成的图像反过来提供监督信号,迭代优化场景理解和轨迹生成;同时比较了离散与连续表示对未来图像预测的影响。 Result: 在 Bench2Drive 基准上,相比先前最优方法,L2轨迹误差降低5.9%,碰撞率下降9.2%;生成的未来图像具有高保真度,验证了模型的有效性。 Conclusion: 紧密集成VLM驱动的推理、规划与生成式世界建模,能有效提升自动驾驶系统的感知与决策能力,证明了统一架构在复杂驾驶任务中的优势。 Abstract: World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .

[110] Vision-Language Agents for Interactive Forest Change Analysis

James Brock,Ce Zhang,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: 本文提出了一种基于大语言模型(LLM)驱动的智能体,用于集成森林变化分析,支持自然语言查询下的遥感图像变化解释(RSICI),并在新构建的Forest-Change数据集上验证了其在像素级变化检测与语义变化描述方面的有效性。

Details Motivation: 现有森林监测中缺乏将大语言模型与视觉-语言模型有效结合的方法来进行多任务遥感变化理解,尤其是在语义变化描述和交互式分析方面存在空白。 Method: 提出一种LLM驱动的智能体系统,结合多层级变化解释(MCI)视觉-语言骨干网络,并通过LLM进行任务编排;构建Forest-Change数据集,包含双时相卫星影像、像素级变化掩码及多粒度语义描述,用于训练与评估。 Result: 在Forest-Change数据集上达到67.10% mIoU和40.17 BLEU-4分数,在LEVIR-MCI-Trees子集上达到88.13% mIoU和34.41 BLEU-4分数,表现出优异的检测与描述能力。 Conclusion: LLM驱动的交互式遥感变化解释系统有望提升森林变化分析的可访问性、可解释性和效率,推动智能化生态监测的发展。 Abstract: Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.

[111] TokenSeg: Efficient 3D Medical Image Segmentation via Hierarchical Visual Token Compression

Sen Zeng,Hong Zhou,Zheng Zhu,Yang Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为TokenSeg的边界感知稀疏令牌表示框架,用于高效3D医学图像分割,通过多尺度编码器、边界感知分词器和稀疏到密集解码器,在保持高精度的同时显著降低计算资源消耗。

Details Motivation: 由于体素处理量呈立方增长且在均匀区域存在冗余计算,3D医学图像分割计算成本高,需要更高效的分割方法。 Method: 设计了多尺度分层编码器提取候选令牌,引入结合VQ-VAE量化与重要性评分的边界感知分词器选择关键令牌,并构建稀疏到密集解码器重建完整分割结果。 Result: 在960例乳腺DCE-MRI数据集上达到94.49% Dice和89.61% IoU,GPU内存和推理延迟分别降低64%和68%,并在心脏和脑部MRI数据集上验证了良好泛化能力。 Conclusion: TokenSeg通过解剖学指导的稀疏表示,实现了准确且高效的3D医学图像分割,为大规模临床应用提供了可行性。 Abstract: Three-dimensional medical image segmentation is a fundamental yet computationally demanding task due to the cubic growth of voxel processing and the redundant computation on homogeneous regions. To address these limitations, we propose \textbf{TokenSeg}, a boundary-aware sparse token representation framework for efficient 3D medical volume segmentation. Specifically, (1) we design a \emph{multi-scale hierarchical encoder} that extracts 400 candidate tokens across four resolution levels to capture both global anatomical context and fine boundary details; (2) we introduce a \emph{boundary-aware tokenizer} that combines VQ-VAE quantization with importance scoring to select 100 salient tokens, over 60\% of which lie near tumor boundaries; and (3) we develop a \emph{sparse-to-dense decoder} that reconstructs full-resolution masks through token reprojection, progressive upsampling, and skip connections. Extensive experiments on a 3D breast DCE-MRI dataset comprising 960 cases demonstrate that TokenSeg achieves state-of-the-art performance with 94.49\% Dice and 89.61\% IoU, while reducing GPU memory and inference latency by 64\% and 68\%, respectively. To verify the generalization capability, our evaluations on MSD cardiac and brain MRI benchmark datasets demonstrate that TokenSeg consistently delivers optimal performance across heterogeneous anatomical structures. These results highlight the effectiveness of anatomically informed sparse representation for accurate and efficient 3D medical image segmentation.

[112] FaceRefiner: High-Fidelity Facial Texture Refinement with Differentiable Rendering-based Style Transfer

Chengyang Li,Baoping Cheng,Yao Cheng,Haocheng Zhang,Renshuai Liu,Yinglin Zheng,Jing Liao,Xuan Cheng

Main category: cs.CV

TL;DR: 本文提出了一种基于风格迁移的面部纹理精细化方法FaceRefiner,通过可微渲染实现多层次信息迁移,提升了野外图像下3D人脸纹理生成的质量与身份一致性。

Details Motivation: 现有方法生成的UV纹理受限于训练数据分布或2D生成器空间,导致在真实场景图像中细节、结构和身份一致性不足。 Method: 将3D采样纹理作为风格,现有纹理生成结果作为内容,利用可微渲染进行多层级(尤其是像素级)风格迁移,在可见面部区域传递低层次信息。 Result: 在Multi-PIE、CelebA和FFHQ数据集上的实验表明,该方法在纹理质量和身份保持方面优于现有最先进方法。 Conclusion: FaceRefiner通过融合可微渲染的多级风格迁移,有效提升了单图3D人脸纹理生成的真实感与输入一致性。 Abstract: Recent facial texture generation methods prefer to use deep networks to synthesize image content and then fill in the UV map, thus generating a compelling full texture from a single image. Nevertheless, the synthesized texture UV map usually comes from a space constructed by the training data or the 2D face generator, which limits the methods' generalization ability for in-the-wild input images. Consequently, their facial details, structures and identity may not be consistent with the input. In this paper, we address this issue by proposing a style transfer-based facial texture refinement method named FaceRefiner. FaceRefiner treats the 3D sampled texture as style and the output of a texture generation method as content. The photo-realistic style is then expected to be transferred from the style image to the content image. Different from current style transfer methods that only transfer high and middle level information to the result, our style transfer method integrates differentiable rendering to also transfer low level (or pixel level) information in the visible face regions. The main benefit of such multi-level information transfer is that, the details, structures and semantics in the input can thus be well preserved. The extensive experiments on Multi-PIE, CelebA and FFHQ datasets demonstrate that our refinement method can improve the texture quality and the face identity preserving ability, compared with state-of-the-arts.

[113] All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction

Ziyou Jiang,Mingyang Li,Junjie Wang,Yuekai Huang,Jie Huang,Zhiyuan Chang,Zhaoyang Li,Qing Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于设计概念复现的不断变化的有害模因检测方法RepMD,通过攻击树定义设计概念图(DCG),并利用DCG指导多模态大语言模型检测有害模因,实验证明该方法在准确性和泛化能力上表现优异,并显著提升人工发现效率。

Details Motivation: 由于互联网社区中有害模因具有类型变化和时间演化的特性,难以分析,但其背后可能存在不变的设计原则,因此需要一种能够捕捉这些潜在设计概念的方法来有效识别不断变化的有害模因。 Method: 提出RepMD方法,首先基于攻击树构建设计概念图(DCG),然后从历史模因中通过设计步骤复现和图剪枝推导出DCG,最后利用DCG指导多模态大语言模型(MLLM)进行有害模因检测。 Result: RepMD在检测中达到81.1%的最高准确率,在面对类型变化和时间演化模因时准确率下降较小;人工评估显示可将每个模因的人工发现时间缩短15~30秒。 Conclusion: RepMD通过捕捉有害模因背后的不变设计概念,实现了对不断变化模因的有效检测,具备良好的准确性、泛化能力和实用性。 Abstract: Harmful memes are ever-shifting in the Internet communities, which are difficult to analyze due to their type-shifting and temporal-evolving nature. Although these memes are shifting, we find that different memes may share invariant principles, i.e., the underlying design concept of malicious users, which can help us analyze why these memes are harmful. In this paper, we propose RepMD, an ever-shifting harmful meme detection method based on the design concept reproduction. We first refer to the attack tree to define the Design Concept Graph (DCG), which describes steps that people may take to design a harmful meme. Then, we derive the DCG from historical memes with design step reproduction and graph pruning. Finally, we use DCG to guide the Multimodal Large Language Model (MLLM) to detect harmful memes. The evaluation results show that RepMD achieves the highest accuracy with 81.1% and has slight accuracy decreases when generalized to type-shifting and temporal-evolving memes. Human evaluation shows that RepMD can improve the efficiency of human discovery on harmful memes, with 15$\sim$30 seconds per meme.

[114] 3D Conditional Image Synthesis of Left Atrial LGE MRI from Composite Semantic Masks

Yusri Al-Sanaani,Rebecca Thornhill,Sreeraman Rajan

Main category: cs.CV

TL;DR: 本研究探索了3D条件生成模型在增强晚期钆增强MRI(LGE-MRI)数据以改善左心房(LA)分割性能中的应用,提出了一种从语义标签图合成高保真3D LGE图像的流程,并验证了其在提升分割精度方面的有效性。

Details Motivation: 由于标注数据稀缺和解剖结构复杂,基于机器学习的左心房壁和心内膜分割模型开发面临挑战,亟需有效数据增强方法来提升模型性能。 Method: 构建一个基于Pix2Pix GAN、SPADE-GAN和SPADE-LDM三种3D条件生成器的图像合成流程,利用结合专家标注与无监督组织聚类的语义标签图生成LGE-MRI图像,并评估其真实性及对下游分割任务的影响。 Result: SPADE-LDM生成的图像最真实且结构准确(FID为4.063),显著优于Pix2Pix(FID 40.821)和SPADE-GAN(FID 7.652);将合成数据用于3D U-Net训练后,LA腔分割Dice分数从0.908提升至0.936,差异具有统计学意义(p < 0.05)。 Conclusion: 基于标签条件的3D图像合成可有效增强稀疏医学影像数据,显著提升心脏结构分割性能,尤其适用于样本不足的临床场景。 Abstract: Segmentation of the left atrial (LA) wall and endocardium from late gadolinium-enhanced (LGE) MRI is essential for quantifying atrial fibrosis in patients with atrial fibrillation. The development of accurate machine learning-based segmentation models remains challenging due to the limited availability of data and the complexity of anatomical structures. In this work, we investigate 3D conditional generative models as potential solution for augmenting scarce LGE training data and improving LA segmentation performance. We develop a pipeline to synthesize high-fidelity 3D LGE MRI volumes from composite semantic label maps combining anatomical expert annotations with unsupervised tissue clusters, using three 3D conditional generators (Pix2Pix GAN, SPADE-GAN, and SPADE-LDM). The synthetic images are evaluated for realism and their impact on downstream LA segmentation. SPADE-LDM generates the most realistic and structurally accurate images, achieving an FID of 4.063 and surpassing GAN models, which have FIDs of 40.821 and 7.652 for Pix2Pix and SPADE-GAN, respectively. When augmented with synthetic LGE images, the Dice score for LA cavity segmentation with a 3D U-Net model improved from 0.908 to 0.936, showing a statistically significant improvement (p < 0.05) over the baseline.These findings demonstrate the potential of label-conditioned 3D synthesis to enhance the segmentation of under-represented cardiac structures.

[115] MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing

Zihao Lin,Wanrong Zhu,Jiuxiang Gu,Jihyung Kil,Christopher Tensmeyer,Lin Zhang,Shilong Liu,Ruiyi Zhang,Lifu Huang,Vlad I. Morariu,Tong Sun

Main category: cs.CV

TL;DR: 本文提出了MiLDEAgent,一个基于推理的多层设计文档编辑框架,结合强化学习训练的多模态推理器与图像编辑器,实现细粒度、分层感知的文档修改,并构建了包含2万多个样本的MiLDEBench基准及评估协议MiLDEEval,实验表明该方法显著优于开源模型,性能媲美闭源模型。

Details Motivation: 现有工作忽视了多层设计文档的编辑问题,主要集中于单层图像编辑或多层生成,缺乏对‘修改什么’和‘如何协调各层’的细粒度推理能力,因此需要一种具备分层理解与精准编辑能力的新方法。 Method: 提出MiLDEAgent框架,包含一个通过强化学习训练的多模态推理器用于分层理解,以及一个图像编辑器执行具体修改;同时构建MiLDEBench数据集和MiLDEEval评估协议,在四个维度上系统评测模型表现。 Result: 在14个开源和2个闭源模型上的实验显示,现有方法泛化能力差:开源模型常无法完成任务,闭源模型易出现格式错误;而MiLDEAgent展现出强大的分层推理与精确编辑能力,显著超越所有开源基线,性能接近闭源模型。 Conclusion: MiLDEAgent为多层设计文档编辑建立了首个强基线,验证了分层推理在复杂文档编辑中的关键作用,推动了自然语言驱动的设计自动化发展。 Abstract: Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.

[116] Detection of Deployment Operational Deviations for Safety and Security of AI-Enabled Human-Centric Cyber Physical Systems

Bernard Ngabonziza,Ayan Banerjee,Sandeep K. S. Gupta

Main category: cs.CV

TL;DR: 本文讨论了人工智能驱动的人本网络物理系统在不确定操作条件下面临的安全与安全挑战,并提出一个评估框架,以确保这些系统在实际部署中的安全性。此外,文章还展示了一种基于个性化图像的新技术,用于检测1型糖尿病患者闭环血糖控制中的未申报用餐情况。

Details Motivation: 由于人机交互带来的不确定性可能导致系统运行偏离预期,从而威胁到系统的安全性和安全性,因此需要研究如何保障AI赋能的人本网络物理系统在未知或异常操作条件下的可靠运行。 Method: 提出一个评估框架,用于分析和比较不同策略在确保系统安全与安全方面的有效性,并以一种基于个性化图像的 meal detection 技术作为实例验证方法。 Result: 设计出一个可用于评估AI驱动人本系统在运行中安全策略的框架,并成功开发并演示了一种能够识别未申报用餐行为的图像化个人化检测技术。 Conclusion: 通过构建评估框架和具体应用示例,证明了在不确定操作条件下增强人本AI系统安全性与安全性的可行性,为未来此类系统的可靠部署提供了理论与实践支持。 Abstract: In recent years, Human-centric cyber-physical systems have increasingly involved artificial intelligence to enable knowledge extraction from sensor-collected data. Examples include medical monitoring and control systems, as well as autonomous cars. Such systems are intended to operate according to the protocols and guidelines for regular system operations. However, in many scenarios, such as closed-loop blood glucose control for Type 1 diabetics, self-driving cars, and monitoring systems for stroke diagnosis. The operations of such AI-enabled human-centric applications can expose them to cases for which their operational mode may be uncertain, for instance, resulting from the interactions with a human with the system. Such cases, in which the system is in uncertain conditions, can violate the system's safety and security requirements. This paper will discuss operational deviations that can lead these systems to operate in unknown conditions. We will then create a framework to evaluate different strategies for ensuring the safety and security of AI-enabled human-centric cyber-physical systems in operation deployment. Then, as an example, we show a personalized image-based novel technique for detecting the non-announcement of meals in closed-loop blood glucose control for Type 1 diabetics.

[117] HUR-MACL: High-Uncertainty Region-Guided Multi-Architecture Collaborative Learning for Head and Neck Multi-Organ Segmentation

Xiaoyu Liu,Siwen Wei,Linhao Qu,Mingyuan Pan,Chengsheng Zhang,Yonghong Shi,Zhijian Song

Main category: cs.CV

TL;DR: 提出了一种高不确定性区域引导的多架构协同学习模型(HUR-MACL),用于头颈部多器官分割,结合CNN、Vision Mamba和可变形CNN,并引入异构特征蒸馏损失,在多个数据集上达到SOTA性能。

Details Motivation: 现有深度学习模型在小且形状复杂的器官分割中表现不佳,混合架构通常仅简单拼接特征,导致功能重叠和分割精度受限。 Method: 提出HUR-MACL模型:利用CNN自适应识别高不确定性区域,在这些区域使用Vision Mamba和可变形CNN联合优化分割;引入异构特征蒸馏损失促进两种架构在高不确定性区域的协同学习。 Result: 在两个公开数据集和一个私有数据集上验证了方法的有效性,分割精度优于现有方法,达到SOTA水平。 Conclusion: HUR-MACL通过聚焦高不确定性区域并促进不同架构的互补协作,显著提升了头颈部小器官的分割准确性,具有较强的临床应用潜力。 Abstract: Accurate segmentation of organs at risk in the head and neck is essential for radiation therapy, yet deep learning models often fail on small, complexly shaped organs. While hybrid architectures that combine different models show promise, they typically just concatenate features without exploiting the unique strengths of each component. This results in functional overlap and limited segmentation accuracy. To address these issues, we propose a high uncertainty region-guided multi-architecture collaborative learning (HUR-MACL) model for multi-organ segmentation in the head and neck. This model adaptively identifies high uncertainty regions using a convolutional neural network, and for these regions, Vision Mamba as well as Deformable CNN are utilized to jointly improve their segmentation accuracy. Additionally, a heterogeneous feature distillation loss was proposed to promote collaborative learning between the two architectures in high uncertainty regions to further enhance performance. Our method achieves SOTA results on two public datasets and one private dataset.

[118] HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment

Wenzhi Chen,Bo Hu,Leida Li,Lihuo He,Wen Lu,Xinbo Gao

Main category: cs.CV

TL;DR: 本文提出HyperAlign,一种基于双曲蕴含几何的自适应文本到图像对齐评估框架,通过将CLIP特征映射到双曲空间,并设计动态监督和自适应调节机制,显著提升了对齐评估性能。

Details Motivation: 现有文本到图像对齐评估方法依赖欧氏空间度量,忽视语义对齐的结构特性,且缺乏对不同样本的自适应能力。 Method: 首先使用CLIP提取欧氏特征并映射到双曲空间;其次设计动态监督的蕴含建模机制,将离散蕴含逻辑转化为连续几何结构监督;最后提出自适应调制回归器,利用双曲几何特征生成样本级调制参数,校准欧氏余弦相似度以预测最终得分。 Result: HyperAlign在单数据库评估和跨数据库泛化任务中均取得极具竞争力的性能表现。 Conclusion: 双曲几何建模能有效提升文本到图像对齐评估的准确性与适应性,验证了其在该任务中的有效性。 Abstract: With the rapid development of text-to-image generation technology, accurately assessing the alignment between generated images and text prompts has become a critical challenge. Existing methods rely on Euclidean space metrics, neglecting the structured nature of semantic alignment, while lacking adaptive capabilities for different samples. To address these limitations, we propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry. First, we extract Euclidean features using CLIP and map them to hyperbolic space. Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision. Finally, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict the final score. HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, fully validating the effectiveness of hyperbolic geometric modeling for image-text alignment assessment.

[119] Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning

Wentao Zhang,Lifei Wang,Lina Lu,MingKun Xu,Shangyang Li,Yanchao Yang,Tao Fang

Main category: cs.CV

TL;DR: 提出Agri-R1,一种基于推理增强的农业大模型,通过视觉-语言合成与LLM过滤自动生成高质量推理数据,并采用Group Relative Policy Optimization(GRPO)结合领域特定奖励函数进行训练,在少样本下实现了优于更大模型的性能表现。

Details Motivation: 现有农业疾病诊断模型依赖大量标注数据、缺乏可解释性且泛化能力差;现有推理方法依赖昂贵专家标注,难以应对开放、多样化的农业问题。 Method: 提出Agri-R1框架:利用视觉-语言合成和大语言模型过滤自动生成高质量推理数据(仅用19%样本);采用GRPO训练策略,设计融合领域词典与模糊匹配的奖励函数,评估回答的正确性与语言灵活性。 Result: 在CDDMBench上,3B参数模型性能媲美7B-13B基线模型:疾病识别准确率相对提升23.2%,农业知识问答提升33.3%,跨域泛化能力提高26.10分;消融实验验证推理数据与GRPO协同作用是性能增益关键,且随问题复杂度增加而提升。 Conclusion: Agri-R1通过自动化推理数据生成和定制化强化学习训练,在低资源条件下显著提升了农业领域VLM的准确性、鲁棒性和泛化能力,为开放、复杂的农业实际应用提供了高效解决方案。 Abstract: Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbf{Agri-R1}, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19\% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2\% relative gain in disease recognition accuracy, +33.3\% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.

[120] DB-MSMUNet:Dual Branch Multi-scale Mamba UNet for Pancreatic CT Scans Segmentation

Qiu Guan,Zhiqiang Yang,Dezhang Ye,Yang Chen,Xinli Xu,Ying Tang

Main category: cs.CV

TL;DR: 提出了一种名为DB-MSMUNet的新型编码器-解码器网络,用于CT图像中胰腺及其病变的精确分割,结合多尺度Mamba模块和双解码器设计,在多个数据集上取得了优于现有方法的性能。

Details Motivation: 胰腺在CT图像中的分割因组织对比度低、边界模糊、形状不规则和病灶小等问题而极具挑战性,影响胰腺癌的精准诊疗。 Method: 提出DB-MSMUNet,编码器采用多尺度Mamba模块(MSMM),结合可变形卷积与多尺度状态空间建模;采用双解码器结构,分别通过边缘增强路径(EEP)和多层解码器(MLD)强化边界细节和小病灶重建,并引入多尺度辅助深度监督(ADS)提升特征判别能力。 Result: 在NIH Pancreas、MSD和临床胰腺肿瘤三个数据集上,Dice系数分别达到89.47%、87.59%和89.02%,在分割精度、边缘保持和跨数据集鲁棒性方面均优于大多数现有方法。 Conclusion: DB-MSMUNet有效提升了胰腺及小病变的分割性能,具有良好的泛化能力,适用于实际临床CT图像分割任务。 Abstract: Accurate segmentation of the pancreas and its lesions in CT scans is crucial for the precise diagnosis and treatment of pancreatic cancer. However, it remains a highly challenging task due to several factors such as low tissue contrast with surrounding organs, blurry anatomical boundaries, irregular organ shapes, and the small size of lesions. To tackle these issues, we propose DB-MSMUNet (Dual-Branch Multi-scale Mamba UNet), a novel encoder-decoder architecture designed specifically for robust pancreatic segmentation. The encoder is constructed using a Multi-scale Mamba Module (MSMM), which combines deformable convolutions and multi-scale state space modeling to enhance both global context modeling and local deformation adaptation. The network employs a dual-decoder design: the edge decoder introduces an Edge Enhancement Path (EEP) to explicitly capture boundary cues and refine fuzzy contours, while the area decoder incorporates a Multi-layer Decoder (MLD) to preserve fine-grained details and accurately reconstruct small lesions by leveraging multi-scale deep semantic features. Furthermore, Auxiliary Deep Supervision (ADS) heads are added at multiple scales to both decoders, providing more accurate gradient feedback and further enhancing the discriminative capability of multi-scale features. We conduct extensive experiments on three datasets: the NIH Pancreas dataset, the MSD dataset, and a clinical pancreatic tumor dataset provided by collaborating hospitals. DB-MSMUNet achieves Dice Similarity Coefficients of 89.47%, 87.59%, and 89.02%, respectively, outperforming most existing state-of-the-art methods in terms of segmentation accuracy, edge preservation, and robustness across different datasets. These results demonstrate the effectiveness and generalizability of the proposed method for real-world pancreatic CT segmentation tasks.

[121] HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution

Yang Zou,Xingyue Zhu,Kaiqi Han,Jun Ma,Xingyuan Li,Zhiying Jiang,Jinyuan Liu

Main category: cs.CV

TL;DR: 提出HATIR,一种热感知扩散模型,用于解决湍流干扰下的红外视频超分辨率问题,通过引入热感知形变先验联合建模退化过程,并构建首个湍流红外VSR数据集FLIR-IVSR。

Details Motivation: 现有红外视频超分辨率方法忽视红外与可见光模态差异,且难以恢复大气湍流引起的畸变;级联湍流抑制与超分方法导致误差传播。 Method: 提出HATIR,将热感知形变先验注入扩散采样路径,联合建模湍流退化与细节丢失的逆过程;设计基于相量引导的光流估计器利用热区相量响应一致性提供可靠湍流感知光流,并采用湍流感知解码器通过湍流门控和结构感知注意力选择性抑制不稳定时序信息、增强边缘特征聚合。 Result: 在自建的FLIR-IVSR数据集(640个场景,1024×768分辨率)上验证了方法有效性,HATIR在湍流红外视频超分辨率任务中优于现有方法。 Conclusion: HATIR通过物理引导的热感知先验有效联合优化湍流抑制与超分辨率,提升了复杂环境下红外视频的质量恢复能力,推动了红外视频超分辨率的研究发展。 Abstract: Infrared video has been of great interest in visual tasks under challenging environments, but often suffers from severe atmospheric turbulence and compression degradation. Existing video super-resolution (VSR) methods either neglect the inherent modality gap between infrared and visible images or fail to restore turbulence-induced distortions. Directly cascading turbulence mitigation (TM) algorithms with VSR methods leads to error propagation and accumulation due to the decoupled modeling of degradation between turbulence and resolution. We introduce HATIR, a Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution, which injects heat-aware deformation priors into the diffusion sampling path to jointly model the inverse process of turbulent degradation and structural detail loss. Specifically, HATIR constructs a Phasor-Guided Flow Estimator, rooted in the physical principle that thermally active regions exhibit consistent phasor responses over time, enabling reliable turbulence-aware flow to guide the reverse diffusion process. To ensure the fidelity of structural recovery under nonuniform distortions, a Turbulence-Aware Decoder is proposed to selectively suppress unstable temporal cues and enhance edge-aware feature aggregation via turbulence gating and structure-aware attention. We built FLIR-IVSR, the first dataset for turbulent infrared VSR, comprising paired LR-HR sequences from a FLIR T1050sc camera (1024 X 768) spanning 640 diverse scenes with varying camera and object motion conditions. This encourages future research in infrared VSR. Project page: https://github.com/JZ0606/HATIR

[122] WebCryptoAgent: Agentic Crypto Trading with Web Informatics

Ali Kurban,Wei Luo,Liangyu Zuo,Zeyu Zhang,Renda Han,Zhaolu Kang,Hao Tang

Main category: cs.CV

TL;DR: WebCryptoAgent是一个面向加密货币交易的智能体框架,通过多模态代理和解耦控制架构,整合网络信息与市场数据,提升交易稳定性与极端风险应对能力。

Details Motivation: 现有交易系统难以同时处理多源异构的网络信息与高频市场微结构信号,尤其在面对快速价格波动时缺乏实时响应与稳健决策能力。 Method: 提出WebCryptoAgent框架,将不同模态信息交由专用代理处理,并生成统一证据文档用于可信推理;同时采用解耦架构,分离小时级策略推理与秒级实时风险控制,以实现快速冲击检测与防护干预。 Result: 在真实加密货币市场数据上的实验表明,该方法相比基线模型提升了交易稳定性,减少了误操作行为,并显著增强了对尾部风险的处理能力。 Conclusion: WebCryptoAgent通过模块化代理设计与分层控制机制,有效实现了多源信息融合与实时风险响应,为高波动环境下的自动化交易提供了可解释且稳健的解决方案。 Abstract: Cryptocurrency trading increasingly depends on timely integration of heterogeneous web information and market microstructure signals to support short-horizon decision making under extreme volatility. However, existing trading systems struggle to jointly reason over noisy multi-source web evidence while maintaining robustness to rapid price shocks at sub-second timescales. The first challenge lies in synthesizing unstructured web content, social sentiment, and structured OHLCV signals into coherent and interpretable trading decisions without amplifying spurious correlations, while the second challenge concerns risk control, as slow deliberative reasoning pipelines are ill-suited for handling abrupt market shocks that require immediate defensive responses. To address these challenges, we propose WebCryptoAgent, an agentic trading framework that decomposes web-informed decision making into modality-specific agents and consolidates their outputs into a unified evidence document for confidence-calibrated reasoning. We further introduce a decoupled control architecture that separates strategic hourly reasoning from a real-time second-level risk model, enabling fast shock detection and protective intervention independent of the trading loop. Extensive experiments on real-world cryptocurrency markets demonstrate that WebCryptoAgent improves trading stability, reduces spurious activity, and enhances tail-risk handling compared to existing baselines. Code will be available at https://github.com/AIGeeksGroup/WebCryptoAgent.

[123] Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models

Yanbing Zeng,Jia Wang,Hanghang Ma,Junqiang Wu,Jie Zhu,Xiaoming Wei,Jie Hu

Main category: cs.CV

TL;DR: 本文提出了一种名为Forge-and-Quench的新框架,通过将多模态大语言模型(MLLM)的理解能力转化为图像生成的视觉引导信号,提升文本到图像生成的保真度和细节丰富性。

Details Motivation: 现有工作较少探索如何利用理解模型有效提升生成图像的质量,本文旨在通过理解增强生成,弥补这一空白。 Method: 提出Forge-and-Quench框架:MLLM基于对话上下文生成增强文本指令,并通过Bridge Adapter将其映射为Bridge Feature;该特征作为视觉引导信号注入T2I模型,与增强文本共同指导图像生成。 Result: 在多个MLLM和T2I模型上验证了框架的通用性和有效性,显著提升了图像细节和保真度,同时保持指令遵循能力并增强世界知识应用,训练开销低且无需牺牲理解能力。 Conclusion: Forge-and-Quench成功实现了理解对生成的有效辅助,为统一多模态理解与生成提供了新范式。 Abstract: Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM's inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at https://github.com/YanbingZeng/Forge-and-Quench.

[124] On the Holistic Approach for Detecting Human Image Forgery

Xiao Guo,Jie Zhu,Anil Jain,Xiaoming Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为HuForDet的全人像伪造检测框架,结合面部与全身上下文信息,实现对多种人类图像伪造的统一检测。

Details Motivation: 现有检测方法局限于面部或全身伪造,缺乏跨区域泛化能力,难以应对日益复杂的AI生成人体图像伪造威胁。 Method: 采用双分支架构:一是基于RGB和频域异构专家及自适应LoG模块的面部伪造检测分支;二是利用多模态大语言模型分析全身语义一致性的上下文伪造检测分支,并通过置信度估计机制动态融合特征。 Result: 在自建的HuFor数据集上实验表明,HuForDet在多种伪造类型上达到最先进的检测性能和更强的鲁棒性。 Conclusion: HuForDet通过结合局部细节与全局语义信息,实现了对人脸及全身伪造图像的高效检测,推动了通用人类图像伪造检测的发展。 Abstract: The rapid advancement of AI-generated content (AIGC) has escalated the threat of deepfakes, from facial manipulations to the synthesis of entire photorealistic human bodies. However, existing detection methods remain fragmented, specializing either in facial-region forgeries or full-body synthetic images, and consequently fail to generalize across the full spectrum of human image manipulations. We introduce HuForDet, a holistic framework for human image forgery detection, which features a dual-branch architecture comprising: (1) a face forgery detection branch that employs heterogeneous experts operating in both RGB and frequency domains, including an adaptive Laplacian-of-Gaussian (LoG) module designed to capture artifacts ranging from fine-grained blending boundaries to coarse-scale texture irregularities; and (2) a contextualized forgery detection branch that leverages a Multi-Modal Large Language Model (MLLM) to analyze full-body semantic consistency, enhanced with a confidence estimation mechanism that dynamically weights its contribution during feature fusion. We curate a human image forgery (HuFor) dataset that unifies existing face forgery data with a new corpus of full-body synthetic humans. Extensive experiments show that our HuForDet achieves state-of-the-art forgery detection performance and superior robustness across diverse human image forgeries.

[125] Training a Custom CNN on Five Heterogeneous Image Datasets

Anika Tabassum,Tasnuva Mahazabin Tuba,Nafisa Naznin

Main category: cs.CV

TL;DR: 本研究评估了轻量级自定义CNN与ResNet-18、VGG-16等深度网络在五个农业与城市视觉分类任务中的表现,探讨了模型复杂度、深度和迁移学习对性能的影响。

Details Motivation: 旨在比较不同CNN架构在多样化真实场景下的有效性,特别是在数据受限和资源有限条件下,为实际视觉分类任务提供模型选择指导。 Method: 采用自定义轻量级CNN与ResNet-18、VGG-16进行对比实验,分别从头训练和使用迁移学习,并结合系统性预处理、数据增强和控制变量实验分析模型性能。 Result: 自定义轻量级CNN在多个任务中达到竞争性性能;迁移学习显著提升小数据集上的模型收敛与泛化能力;深层架构在复杂任务中优势明显,但在简单任务中未必优于轻量模型。 Conclusion: 迁移学习和深层网络在数据稀缺场景下具有显著优势,而轻量级定制模型在多领域任务中具备高效实用性,为资源受限环境下的部署提供了有效解决方案。 Abstract: Deep learning has transformed visual data analysis, with Convolutional Neural Networks (CNNs) becoming highly effective in learning meaningful feature representations directly from images. Unlike traditional manual feature engineering methods, CNNs automatically extract hierarchical visual patterns, enabling strong performance across diverse real-world contexts. This study investigates the effectiveness of CNN-based architectures across five heterogeneous datasets spanning agricultural and urban domains: mango variety classification, paddy variety identification, road surface condition assessment, auto-rickshaw detection, and footpath encroachment monitoring. These datasets introduce varying challenges, including differences in illumination, resolution, environmental complexity, and class imbalance, necessitating adaptable and robust learning models. We evaluate a lightweight, task-specific custom CNN alongside established deep architectures, including ResNet-18 and VGG-16, trained both from scratch and using transfer learning. Through systematic preprocessing, augmentation, and controlled experimentation, we analyze how architectural complexity, model depth, and pre-training influence convergence, generalization, and performance across datasets of differing scale and difficulty. The key contributions of this work are: (1) the development of an efficient custom CNN that achieves competitive performance across multiple application domains, and (2) a comprehensive comparative analysis highlighting when transfer learning and deep architectures provide substantial advantages, particularly in data-constrained environments. These findings offer practical insights for deploying deep learning models in resource-limited yet high-impact real-world visual classification tasks.

[126] AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection

Yunqing Hu,Zheming Yang,Chang Zhao,Qi Guo,Meng Gao,Pengcheng Li,Wen Ji

Main category: cs.CV

TL;DR: 本文提出AIVD框架,通过轻量级边缘检测器与云端多模态大模型协同,实现精确目标定位与高质量语义生成,并设计了视觉-语义联合增强的微调策略及资源感知的动态调度算法,显著降低资源消耗,提升性能与效率。

Details Motivation: 现有MLLM在精确目标定位和边缘-云协同部署方面存在不足,尤其在资源受限环境下难以兼顾高精度与低延迟。 Method: 提出AIVD框架,结合边缘轻量检测器与云端MLLM;设计视觉-语义协同增强的微调策略以提升鲁棒性;开发异构资源感知的动态调度算法以优化吞吐与延迟。 Result: 实验表明AIVD显著降低资源消耗,提升MLLM分类准确率与语义生成质量,同时调度策略实现了更高吞吐与更低延迟。 Conclusion: AIVD通过边缘-云协同架构与优化策略,有效平衡了精准定位、语义理解与资源效率,适用于复杂多变的边缘计算场景。 Abstract: Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM's robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.

[127] Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition

Masatomo Yoshida,Haruto Namura,Nicola Adami,Masahiro Okuda

Main category: cs.CV

TL;DR: 提出一种基于骨架化的对抗攻击方法,有效缩小搜索空间,特别针对包含文本(尤其是数学公式)的图像,评估原始与对抗性扰动输出之间的字符和语义变化,并在ChatGPT上验证了该方法的有效性和实际应用价值。

Details Motivation: 探索基础模型在视觉理解上的能力与局限,特别是在处理复杂结构图像(如数学公式)时的表现。 Method: 引入基于骨架化的对抗攻击方法,通过减少搜索空间来生成对抗样本,并分析模型在字符级和语义级的变化响应。 Result: 该方法能有效攻击包含数学公式的图像,在ChatGPT上成功实现对抗扰动,显示出模型对细微视觉变化的敏感性及其推理漏洞。 Conclusion: 基础模型在处理复杂视觉文本时存在弱点,骨架化对抗攻击为评估其视觉解释与推理能力提供了新视角。 Abstract: This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method utilizing skeletonization to reduce the search space effectively. Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their LaTeX conversion and intricate structure. We conduct a detailed evaluation of both character and semantic changes between original and adversarially perturbed outputs to provide insights into the models' visual interpretation and reasoning abilities. The effectiveness of our method is further demonstrated through its application to ChatGPT, which shows its practical implications in real-world scenarios.

[128] ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting

Yen-Jen Chiou,Wei-Tse Cheng,Yuan-Fu Yang

Main category: cs.CV

TL;DR: ProFuse是一个高效的上下文感知框架,用于基于3D高斯点阵的开放词汇3D场景理解,通过密集对应引导的预注册和跨视图聚类实现快速、准确的语义融合。

Details Motivation: 提升开放词汇3D场景理解中的跨视图一致性和掩码内聚性,避免依赖预训练3DGS模型和渲染监督微调。 Method: 提出密集对应引导的预注册阶段,初始化具有精确几何的高斯点,并通过跨视图聚类构建3D上下文提议;利用全局特征加权聚合并融合到高斯点中以保持语言一致性。 Result: 在标准重建之外无需额外优化即可实现语义融合,每场景语义附着约5分钟,速度比现有最先进方法快两倍。 Conclusion: ProFuse实现了高效、快速且准确的开放词汇3D场景理解,兼顾几何精细度与语义一致性。 Abstract: We present ProFuse, an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup, adding minimal overhead and requiring no render-supervised fine-tuning. Instead of relying on a pretrained 3DGS scene, we introduce a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, and this feature is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. With associations established in advance, semantic fusion requires no additional optimization beyond standard reconstruction, and the model retains geometric refinement without densification. ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is two times faster than SOTA.

[129] Segmentation-Driven Monocular Shape from Polarization based on Physical Model

Jinyu Zhang,Xu Ma,Weili Chen,Gonzalo R. Arce

Main category: cs.CV

TL;DR: 本文提出了一种新的分割驱动的单目形状-从-偏振(SMSfP)框架,通过将全局形状恢复转化为自适应分割的局部凸区域上的重建,有效解决了传统方法中的方位角模糊问题。

Details Motivation: 现有的单目SfP方法受限于偏振分析固有的方位角模糊问题,导致三维重建精度和稳定性下降,本文旨在解决这一关键挑战。 Method: 提出了一种偏振辅助的自适应区域增长(PARG)分割策略,将全局凸性假设分解为多个局部凸区域,并引入多尺度融合凸性先验(MFCP)约束以保持局部表面一致性和细节恢复。 Result: 在合成和真实数据集上的实验表明,该方法显著提升了消歧准确性和几何保真度,优于现有的基于物理的单目SfP技术。 Conclusion: SMSfP框架通过局部化重建和强加局部凸性先验,有效缓解了方位角模糊问题,为单目SfP提供了更鲁棒和精确的解决方案。 Abstract: Monocular shape-from-polarization (SfP) leverages the intrinsic relationship between light polarization properties and surface geometry to recover surface normals from single-view polarized images, providing a compact and robust approach for three-dimensional (3D) reconstruction. Despite its potential, existing monocular SfP methods suffer from azimuth angle ambiguity, an inherent limitation of polarization analysis, that severely compromises reconstruction accuracy and stability. This paper introduces a novel segmentation-driven monocular SfP (SMSfP) framework that reformulates global shape recovery into a set of local reconstructions over adaptively segmented convex sub-regions. Specifically, a polarization-aided adaptive region growing (PARG) segmentation strategy is proposed to decompose the global convexity assumption into locally convex regions, effectively suppressing azimuth ambiguities and preserving surface continuity. Furthermore, a multi-scale fusion convexity prior (MFCP) constraint is developed to ensure local surface consistency and enhance the recovery of fine textural and structural details. Extensive experiments on both synthetic and real-world datasets validate the proposed approach, showing significant improvements in disambiguation accuracy and geometric fidelity compared with existing physics-based monocular SfP techniques.

[130] GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

Shurong Zheng,Yousong Zhu,Hongyin Zhao,Fan Yang,Yufei Zhan,Ming Tang,Jinqiao Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的多图像视觉定位模型GeM-VG,通过构建大规模数据集MG-Data-240K和混合强化微调策略,实现了在单图和多图场景下优越的泛化定位能力。

Details Motivation: 现有的多图像定位方法受限于单一目标定位和任务类型的局限性,缺乏对通用定位任务的统一建模,且现有数据集在目标数量和图像关系方面存在不足。 Method: 提出GeM-VG模型,系统分类多图像定位任务,并构建MG-Data-240K数据集;采用结合思维链(CoT)推理与直接回答的混合强化微调策略,使用基于规则的奖励引导R1类算法优化模型。 Result: 在MIG-Bench和MC-Bench上分别超越先前最优MLLM 2.0%和9.7%,在ODINW单图定位任务上比基础模型提升9.1%,同时保持强大的通用多图像理解能力。 Conclusion: GeM-VG通过统一建模和新型训练策略,在多图像和单图像视觉定位任务中均展现出卓越的泛化能力和性能优势。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model's overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.

[131] CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models

Tobia Poppi,Burak Uzkent,Amanmeet Garg,Lucas Porto,Garin Kessler,Yezhou Yang,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara,Florian Schiffers

Main category: cs.CV

TL;DR: 提出了一种可扩展的反事实视频生成框架CounterVid,用于缓解视频语言模型中的幻觉问题,特别是动作和时序推理方面的幻觉。结合多模态大模型与扩散模型生成语义难例,并通过MixDPO方法联合利用文本与视觉偏好进行优化。

Details Motivation: 现有方法难以解决视频语言模型因依赖语言先验而导致的动作和时序推理幻觉问题,需构建更精细的视觉动态感知能力。 Method: 设计了一个结合多模态大语言模型(用于动作提议与编辑引导)和扩散模型(用于图像与视频生成)的反事实视频生成框架,构建包含约2.6万对偏好的合成数据集CounterVid;提出MixDPO方法,统一利用文本与视觉偏好进行直接偏好优化。 Result: 在Qwen2.5-VL上使用MixDPO微调后,模型在时序顺序判断等任务上表现显著提升,并在标准视频幻觉基准上展现出良好迁移性能。 Conclusion: 该框架有效减少了对语言先验的依赖,增强了模型对细粒度视觉动态的理解,提升了视频语言模型在动作识别与时序推理中的鲁棒性。 Abstract: Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.

[132] Defocus Aberration Theory Confirms Gaussian Model in Most Imaging Devices

Akbar Saadat

Main category: cs.CV

TL;DR: 本文提出了一种基于高斯模型的离焦分析方法,通过几何光学和衍射受限光学中的离焦像差理论,验证了在常规成像设备中离焦算子符合高斯模型的适用性,实验表明其拟合精度的平均绝对误差小于1%。

Details Motivation: 准确从2D图像估计深度是3D重建中的基本难题,传统方法难以区分由深度引起的离焦模糊与固有模糊,因此需要一个可靠且可解析的离焦模型来解决这一病态问题。 Method: 利用高斯模型作为离焦算子的近似,在几何光学框架下结合衍射受限光学中的离焦像差理论,分析并推导出在不同相机设置下两幅图像之间的相对模糊与深度的关系,并给出使实际离焦算子符合高斯模型的成像设备设置条件。 Result: 在典型聚焦深度范围为1至100米、最大深度变化为聚焦深度10%的条件下,验证了高斯模型对大多数成像设备的适用性,实测拟合的平均绝对误差(MAE)小于1%。 Conclusion: 高斯模型是适用于单图绝对模糊和双图相对模糊分析的唯一理论模型,本文为其在常规成像设备中的应用提供了理论依据和实用条件,具有高精度和实时应用潜力。 Abstract: Over the past three decades, defocus has consistently provided groundbreaking depth information in scene images. However, accurately estimating depth from 2D images continues to be a persistent and fundamental challenge in the field of 3D recovery. Heuristic approaches involve with the ill-posed problem for inferring the spatial variant defocusing blur, as the desired blur cannot be distinguished from the inherent blur. Given a prior knowledge of the defocus model, the problem become well-posed with an analytic solution for the relative blur between two images, taken at the same viewpoint with different camera settings for the focus. The Gaussian model stands out as an optimal choice for real-time applications, due to its mathematical simplicity and computational efficiency. And theoretically, it is the only model can be applied at the same time to both the absolute blur caused by depth in a single image and the relative blur resulting from depth differences between two images. This paper introduces the settings, for conventional imaging devices, to ensure that the defocusing operator adheres to the Gaussian model. Defocus analysis begins within the framework of geometric optics and is conducted by defocus aberration theory in diffraction-limited optics to obtain the accuracy of fitting the actual model to its Gaussian approximation. The results for a typical set of focused depths between $1$ and $100$ meters, with a maximum depth variation of $10\%$ at the focused depth, confirm the Gaussian model's applicability for defocus operators in most imaging devices. The findings demonstrate a maximum Mean Absolute Error $(\!M\!A\!E)$ of less than $1\%$, underscoring the model's accuracy and reliability.

[133] SRU-Pix2Pix: A Fusion-Driven Generator Network for Medical Image Translation with Few-Shot Learning

Xihe Qiu,Yang Dai,Xiaoyu Tan,Sijia Li,Fenghao Sun,Lu Gan,Liang Liu

Main category: cs.CV

TL;DR: 提出了一种增强的Pix2Pix框架,结合SEResNet和U-Net++,在少样本条件下实现了高质量、高结构保真度的MRI图像翻译。

Details Motivation: 为了解决MRI临床应用中成像时间长、成本高和分辨率低的问题,探索图像转换技术的潜力。 Method: 采用改进的Pix2Pix框架,引入Squeeze-and-Excitation Residual Networks(SEResNet)以增强通道注意力特征表示,并结合U-Net++提升多尺度特征融合能力;使用简化的PatchGAN判别器稳定训练并提高局部解剖真实性。 Result: 在少于500张图像的少样本条件下,该方法在多种MRI模态内图像翻译任务中表现出一致的结构保真度、更高的图像质量和良好的泛化能力。 Conclusion: 所提出的增强Pix2Pix框架有效提升了医学图像翻译的性能,是Pix2Pix在该领域的一种有力扩展。 Abstract: Magnetic Resonance Imaging (MRI) provides detailed tissue information, but its clinical application is limited by long acquisition time, high cost, and restricted resolution. Image translation has recently gained attention as a strategy to address these limitations. Although Pix2Pix has been widely applied in medical image translation, its potential has not been fully explored. In this study, we propose an enhanced Pix2Pix framework that integrates Squeeze-and-Excitation Residual Networks (SEResNet) and U-Net++ to improve image generation quality and structural fidelity. SEResNet strengthens critical feature representation through channel attention, while U-Net++ enhances multi-scale feature fusion. A simplified PatchGAN discriminator further stabilizes training and refines local anatomical realism. Experimental results demonstrate that under few-shot conditions with fewer than 500 images, the proposed method achieves consistent structural fidelity and superior image quality across multiple intra-modality MRI translation tasks, showing strong generalization ability. These results suggest an effective extension of Pix2Pix for medical image translation.

[134] Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers

Lee Hyoseok,Sohwi Lim,Eunju Cha,Tae-Hyun Oh

Main category: cs.CV

TL;DR: 本文提出了一种名为Measurement-Consistent Langevin Corrector (MCLC)的新方法,用于稳定基于潜在扩散模型(LDM)的逆问题求解器,通过消除反向扩散动态中的不一致性来提高图像恢复质量。

Details Motivation: 现有的基于LDM的逆问题求解器存在不稳定性和伪影问题,主要源于求解器与真实反向扩散动力学之间的差异。本文旨在识别并缩小这一差距以提升稳定性。 Method: 提出MCLC模块,采用无需线性流形假设的测量一致Langevin更新,作为即插即用的修正组件,修正LDM求解器在反向扩散过程中的偏差。 Result: 实验表明MCLC能有效减少伪影、提升图像恢复质量,并兼容多种现有求解器,在多个图像修复任务中表现出更稳定的性能。同时分析了斑点伪影的成因。 Conclusion: MCLC为构建更鲁棒的零样本逆问题求解器提供了关键一步,克服了传统方法对不成立假设的依赖,提升了实际应用可靠性。 Abstract: With recent advances in generative models, diffusion models have emerged as powerful priors for solving inverse problems in each domain. Since Latent Diffusion Models (LDMs) provide generic priors, several studies have explored their potential as domain-agnostic zero-shot inverse solvers. Despite these efforts, existing latent diffusion inverse solvers suffer from their instability, exhibiting undesirable artifacts and degraded quality. In this work, we first identify the instability as a discrepancy between the solver's and true reverse diffusion dynamics, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies the LDM-based inverse solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often do not hold in latent space, MCLC operates without this assumption, leading to more stable and reliable behavior. We experimentally demonstrate the effectiveness of MCLC and its compatibility with existing solvers across diverse image restoration tasks. Additionally, we analyze blob artifacts and offer insights into their underlying causes. We highlight that MCLC is a key step toward more robust zero-shot inverse problem solvers.

[135] PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference

Denis Korzhenkov,Adil Karjauv,Animesh Karnewar,Mohsen Ghafoorian,Amirhossein Habibian

Main category: cs.CV

TL;DR: 提出一种低成本微调方法,将预训练扩散模型转换为金字塔模型,保持视频生成质量的同时提升推理效率。

Details Motivation: 现有开源金字塔视频模型从零训练,生成质量低于最先进模型,且推理效率有待提升。 Method: 通过低代价微调将预训练扩散模型转化为金字塔结构,并比较多种步数蒸馏策略以优化推理效率。 Result: 成功实现高质量视频生成的金字塔模型,推理成本显著降低,且不损失视觉质量。 Conclusion: 该方法为高效视频生成提供了一种可行路径,兼顾性能与计算效率。 Abstract: Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at https://qualcomm-ai-research.github.io/PyramidalWan.

[136] Detector-Augmented SAMURAI for Long-Duration Drone Tracking

Tamara R. Lenhard,Andreas Weinmann,Hichem Snoussi,Tobias Koch

Main category: cs.CV

TL;DR: 本文首次系统评估了SAMURAI在无人机跟踪中的潜力,并提出一种检测器增强的扩展方法,提升了其在复杂城市环境下的鲁棒性,尤其在长时间序列和无人机进出场景中表现突出。

Details Motivation: 现有基于检测器的无人机跟踪方法存在时间不一致性问题,且对传统运动模型依赖较强,而基础模型如SAMURAI在其他领域表现出色但尚未在无人机场景中探索。 Method: 提出一种检测器增强的SAMURAI扩展方法,结合检测器线索以缓解对边界框初始化和序列长度的敏感性。 Result: 所提方法在多个数据集和指标上均优于SAMURAI的零样本性能,成功率提升高达+0.393,错误拒绝率降低达-0.475,尤其在长时序和再进入场景中表现更优。 Conclusion: 该研究验证了基础模型在无人机跟踪中的应用潜力,所提出的检测器增强策略显著提高了跟踪的鲁棒性和稳定性。 Abstract: Robust long-term tracking of drone is a critical requirement for modern surveillance systems, given their increasing threat potential. While detector-based approaches typically achieve strong frame-level accuracy, they often suffer from temporal inconsistencies caused by frequent detection dropouts. Despite its practical relevance, research on RGB-based drone tracking is still limited and largely reliant on conventional motion models. Meanwhile, foundation models like SAMURAI have established their effectiveness across other domains, exhibiting strong category-agnostic tracking performance. However, their applicability in drone-specific scenarios has not been investigated yet. Motivated by this gap, we present the first systematic evaluation of SAMURAI's potential for robust drone tracking in urban surveillance settings. Furthermore, we introduce a detector-augmented extension of SAMURAI to mitigate sensitivity to bounding-box initialization and sequence length. Our findings demonstrate that the proposed extension significantly improves robustness in complex urban environments, with pronounced benefits in long-duration sequences - especially under drone exit-re-entry events. The incorporation of detector cues yields consistent gains over SAMURAI's zero-shot performance across datasets and metrics, with success rate improvements of up to +0.393 and FNR reductions of up to -0.475.

[137] Integrated Framework for Selecting and Enhancing Ancient Marathi Inscription Images from Stone, Metal Plate, and Paper Documents

Bapu D. Chendage,Rajivkumar S. Mente

Main category: cs.CV

TL;DR: 本文提出了一种基于二值化和互补预处理技术的古代文字图像增强方法,有效提升了石刻、金属板和文献上模糊文字的可读性。

Details Motivation: 古代文字图像常因老化和环境影响而出现背景噪声严重、对比度低和退化等问题,导致文字难以辨认,亟需有效的图像增强方法来提高其可读性。 Method: 采用基于二值化的图像增强方法,并结合去除污渍和增强模糊文字的互补预处理技术,对不同类型古代文字图像进行处理,并使用K-NN和SVM分类器评估效果。 Result: 在K-NN分类器下,对石刻、金属板和文档文字的分类准确率分别为55.7%、62%和65.6%;在SVM分类器下,准确率分别为53.2%、59.5%和67.8%,验证了该方法的有效性。 Conclusion: 所提出的图像增强方法能有效改善古代马拉地文铭文图像的可读性,尤其在处理复杂背景和低对比度文本方面表现良好。 Abstract: Ancient script images often suffer from severe background noise, low contrast, and degradation caused by aging and environmental effects. In many cases, the foreground text and background exhibit similar visual characteristics, making the inscriptions difficult to read. The primary objective of image enhancement is to improve the readability of such degraded ancient images. This paper presents an image enhancement approach based on binarization and complementary preprocessing techniques for removing stains and enhancing unclear ancient text. The proposed methods are evaluated on different types of ancient scripts, including inscriptions on stone, metal plates, and historical documents. Experimental results show that the proposed approach achieves classification accuracies of 55.7%, 62%, and 65.6% for stone, metal plate, and document scripts, respectively, using the K-Nearest Neighbor (K-NN) classifier. Using the Support Vector Machine (SVM) classifier, accuracies of 53.2%, 59.5%, and 67.8% are obtained. The results demonstrate the effectiveness of the proposed enhancement method in improving the readability of ancient Marathi inscription images.

[138] SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models

Oriol Rabasseda,Zenjie Li,Kamal Nasrollahi,Sergio Escalera

Main category: cs.CV

TL;DR: 本文提出了SOVABench,一个面向监控视频中车辆行为识别的检索基准,并提出一种无需训练的多模态大模型框架,通过生成可解释描述来提升动作区分和时空推理能力。

Details Motivation: 现有基于内容的视频检索基准多关注场景级相似性,缺乏对监控场景中关键动作判别能力的评估,因此需要构建专注于车辆动作的现实世界基准。 Method: 构建了SOVABench数据集,包含两种评估协议(inter-pair和intra-pair),用于评估跨动作区分和时间方向理解;提出一种无需训练的框架,利用多模态大语言模型生成图像和视频的描述,并从中提取可解释的嵌入表示。 Result: 实验表明,尽管人类能直观区分这些动作,但现有视觉和多模态模型仍面临挑战;所提框架在SOVABench及多个空间与计数基准上表现优异,尤其在对比式视觉语言模型表现不佳的任务上。 Conclusion: SOVABench填补了监控场景下动作识别评估的空白,所提出的训练-free MLLM框架为可解释视频理解提供了新思路,展现出在复杂推理任务中的潜力。 Abstract: Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.

[139] Character Detection using YOLO for Writer Identification in multiple Medieval books

Alessandra Scotto di Freca,Tiziana D Alessandro,Francesco Fontanella,Filippo Sarria,Claudio De Stefano

Main category: cs.CV

TL;DR: 本文提出使用YOLOv5替代模板匹配和CNN,用于中世纪手稿中字母检测与书写者识别,提高了特征提取数量和分类准确性,并利用YOLO置信度实现拒绝机制,增强在未见手稿上的识别可靠性。

Details Motivation: 传统方法依赖模板匹配识别特定字母以判定书写者,但其性能受限于阈值选择且难以推广;需更鲁棒的方法自动检测更多字符并提升书写者归属分析的准确性和泛化能力。 Method: 采用YOLOv5目标检测模型替代原有的模板匹配与CNN流程,直接在页面图像上检测字母“a”的实例,并结合置信度分数进行书写者分类,在第二阶段利用高置信预测提升识别效果。 Result: YOLOv5相比之前方法能提取更多字母实例,显著提高第二阶段分类准确性;同时其置信度得分可用于设定拒绝阈值,使系统在未知手稿上也能实现可靠书写者识别。 Conclusion: YOLOv5有效改进了古文字学中的书写者识别流程,克服了模板匹配的局限性,为手稿分析提供了更具扩展性和鲁棒性的自动化解决方案。 Abstract: Paleography is the study of ancient and historical handwriting, its key objectives include the dating of manuscripts and understanding the evolution of writing. Estimating when a document was written and tracing the development of scripts and writing styles can be aided by identifying the individual scribes who contributed to a medieval manuscript. Although digital technologies have made significant progress in this field, the general problem remains unsolved and continues to pose open challenges. ... We previously proposed an approach focused on identifying specific letters or abbreviations that characterize each writer. In that study, we considered the letter "a", as it was widely present on all pages of text and highly distinctive, according to the suggestions of expert paleographers. We used template matching techniques to detect the occurrences of the character "a" on each page and the convolutional neural network (CNN) to attribute each instance to the correct scribe. Moving from the interesting results achieved from this previous system and being aware of the limitations of the template matching technique, which requires an appropriate threshold to work, we decided to experiment in the same framework with the use of the YOLO object detection model to identify the scribe who contributed to the writing of different medieval books. We considered the fifth version of YOLO to implement the YOLO object detection model, which completely substituted the template matching and CNN used in the previous work. The experimental results demonstrate that YOLO effectively extracts a greater number of letters considered, leading to a more accurate second-stage classification. Furthermore, the YOLO confidence score provides a foundation for developing a system that applies a rejection threshold, enabling reliable writer identification even in unseen manuscripts.

[140] DivAS: Interactive 3D Segmentation of NeRFs via Depth-Weighted Voxel Aggregation

Ayush Pande

Main category: cs.CV

TL;DR: 提出DivAS,一种无需优化、完全交互式的NeRF分割框架,通过结合2D SAM掩码与NeRF深度先验,在保持零样本能力的同时实现实时、高精度的3D分割。

Details Motivation: 现有NeRF分割方法多基于优化,需逐场景训练,牺牲了2D基础模型的零样本能力;且速度慢,难以支持实时交互。 Method: 提出DivAS框架:用户通过GUI输入点提示生成2D SAM掩码,利用NeRF提供的深度先验对多视角掩码进行几何优化,并通过自定义CUDA内核在200ms内将多视图掩码聚合成统一的3D体素网格。 Result: 在Mip-NeRF 360°和LLFF数据集上,DivAS分割质量媲美基于优化的方法,端到端速度快2-2.5倍,排除用户提示时间则快一个数量级。 Conclusion: DivAS通过去除逐场景优化,保留了2D基础模型的零样本能力,实现了快速、交互式、高精度的NeRF分割,适用于需要实时反馈的应用场景。 Abstract: Existing methods for segmenting Neural Radiance Fields (NeRFs) are often optimization-based, requiring slow per-scene training that sacrifices the zero-shot capabilities of 2D foundation models. We introduce DivAS (Depth-interactive Voxel Aggregation Segmentation), an optimization-free, fully interactive framework that addresses these limitations. Our method operates via a fast GUI-based workflow where 2D SAM masks, generated from user point prompts, are refined using NeRF-derived depth priors to improve geometric accuracy and foreground-background separation. The core of our contribution is a custom CUDA kernel that aggregates these refined multi-view masks into a unified 3D voxel grid in under 200ms, enabling real-time visual feedback. This optimization-free design eliminates the need for per-scene training. Experiments on Mip-NeRF 360° and LLFF show that DivAS achieves segmentation quality comparable to optimization-based methods, while being 2-2.5x faster end-to-end, and up to an order of magnitude faster when excluding user prompting time.

[141] Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform

Suyash Mishra,Qiang Li,Srikanth Patil,Satyanarayan Pati,Baddu Narendra

Main category: cs.CV

TL;DR: 本文提出了一种面向工业级长视频理解的多模态推理框架,评估了40多个视觉语言模型在真实约束下的性能,揭示了当前VLM在时序对齐、关键帧检测和视频分割方面的瓶颈,并提供了在GPU、延迟和成本限制下构建可扩展系统的实践指导。

Details Motivation: 现有VLM评估多集中于短视频且忽略资源限制,难以满足工业场景(如制药领域)中对长视频处理的高效性与可扩展性需求。 Method: 构建了一个工业级大规模多模态架构,实证分析了40多个VLM在Video-MME和MMBench两个基准及包含25,326个视频的专有数据集上的表现,重点研究多模态作用、注意力机制权衡、时序推理能力及GPU受限下的视频分割问题。 Result: 在商品化GPU上使用SDPA注意力机制实现3-8倍效率提升;多模态在8/12个任务域中表现更优,尤其利于依赖长度的任务;发现跨开源与闭源VLM在时序对齐与关键帧检测方面存在明显瓶颈。 Conclusion: 当前VLM在现实部署条件下存在显著的效率与能力局限,需综合考虑注意力机制设计、模态融合策略与时序建模方式,本文为工业级长视频理解系统的设计提供了实用指南。 Abstract: Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.

[142] Rotation-Robust Regression with Convolutional Model Trees

Hongyi Li,William Ward Armstrong,Jun Xu

Main category: cs.CV

TL;DR: 本文研究了基于卷积模型树(CMTs)的旋转鲁棒性学习方法,通过引入几何感知的归纳偏置和部署时方向搜索来提升图像输入在旋转下的鲁棒性,在MNIST旋转数据上验证了方法的有效性与局限性。

Details Motivation: 为了提高卷积模型树在图像旋转情况下的鲁棒性,探索几何结构和部署时变换对模型性能的影响。 Method: 提出三种几何感知的归纳偏置(卷积平滑、倾斜主导约束、基于重要性的剪枝),并结合部署时离散旋转搜索以最大化森林级置信度代理来进行方向选择。 Result: 方向搜索在大幅旋转下提升了鲁棒性,但在标准方向附近可能因置信度与正确性不一致而产生负面影响;在MNIST一对其余回归任务中观察到一致趋势。 Conclusion: 几何感知偏置和置信度驱动的方向搜索有助于旋转鲁棒性,但其效果受限于置信度与真实准确性的对齐程度,揭示了该方法在模型树集成中的潜力与局限。 Abstract: We study rotation-robust learning for image inputs using Convolutional Model Trees (CMTs) [1], whose split and leaf coefficients can be structured on the image grid and transformed geometrically at deployment time. In a controlled MNIST setting with a rotation-invariant regression target, we introduce three geometry-aware inductive biases for split directions -- convolutional smoothing, a tilt dominance constraint, and importance-based pruning -- and quantify their impact on robustness under in-plane rotations. We further evaluate a deployment-time orientation search that selects a discrete rotation maximizing a forest-level confidence proxy without updating model parameters. Orientation search improves robustness under severe rotations but can be harmful near the canonical orientation when confidence is misaligned with correctness. Finally, we observe consistent trends on MNIST digit recognition implemented as one-vs-rest regression, highlighting both the promise and limitations of confidence-based orientation selection for model-tree ensembles.

[143] Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

Subhadeep Roy,Gagan Bhatia,Steffen Eger

Main category: cs.CV

TL;DR: 本文研究了文本到图像模型评估中的“原型偏差”问题,提出了一种新的基准ProtoBias和一种更鲁棒的评估指标ProtoScore,能够有效减少因数据分布偏见导致的误判,同时保持高效计算性能。

Details Motivation: 现有的自动评估指标在评价文本到图像生成时可能存在偏向于“典型”图像而非语义正确图像的问题,这种偏见会影响模型的真实性能评估,因此需要系统性研究并解决这一问题。 Method: 提出了一个可控的对比基准ProtoBias,包含动物、物体和人口统计图像,通过将语义正确但非典型的图像与语义错误但视觉上更典型的对抗图像配对,来检测评估指标是否偏向原型。同时设计了新的评估指标ProtoScore,基于7B参数模型,提升对语义一致性的敏感度并降低原型偏差。 Result: 实验表明,包括CLIPScore、PickScore和基于VQA的评分等常用指标经常错误地优先选择原型图像;即使是LLM-as-Judge系统在社会属性相关情况下也表现不稳定。人类评估更倾向于语义正确的图像。ProtoScore显著降低了误判率,并且运行速度远快于GPT-5级别的模型。 Conclusion: 当前主流图像生成评估指标存在明显的原型偏差问题,不能可靠反映语义一致性;ProtoScore提供了一个高效且鲁棒的替代方案,推动更公平、准确的多模态评估发展。 Abstract: Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study \emph{prototypicality bias} as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark \textsc{\textbf{ProtoBias}} (\textit{\textbf{Proto}typical \textbf{Bias}}), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose \textbf{\textsc{ProtoScore}}, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.

[144] TEA: Temporal Adaptive Satellite Image Semantic Segmentation

Juyuan Kang,Hao Zhu,Yan Zhu,Wei Zhang,Jianing Chen,Tianxiang Xiao,Yike Ma,Hao Jiang,Feng Dai

Main category: cs.CV

TL;DR: 本文提出了一种名为TEA的时序自适应卫星影像时间序列(SITS)语义分割方法,通过教师-学生模型框架和全序列重建辅助任务,提升模型在不同时间长度输入下的泛化能力与分割性能。

Details Motivation: 现有SITS分割方法多基于固定长度的时间序列,忽视了模型在不同时间长度场景下的泛化能力,导致在变长序列下表现不佳。 Method: 提出TEA方法,采用教师-学生模型框架:教师模型捕获全局序列知识,指导具有可变时间长度输入的学生模型;通过中间嵌入、原型和软标签进行知识迁移,并动态聚合学生模型以缓解知识遗忘;引入全序列重建作为辅助任务以增强表征质量。 Result: 实验表明,TEA在不同时间长度输入下,在通用基准上均显著优于现有方法,展现出更强的鲁棒性和泛化能力。 Conclusion: TEA有效提升了SITS语义分割模型在变长时间序列下的适应性与性能,为实际农业应用中不完整时间序列问题提供了可行解决方案。 Abstract: Crop mapping based on satellite images time-series (SITS) holds substantial economic value in agricultural production settings, in which parcel segmentation is an essential step. Existing approaches have achieved notable advancements in SITS segmentation with predetermined sequence lengths. However, we found that these approaches overlooked the generalization capability of models across scenarios with varying temporal length, leading to markedly poor segmentation results in such cases. To address this issue, we propose TEA, a TEmporal Adaptive SITS semantic segmentation method to enhance the model's resilience under varying sequence lengths. We introduce a teacher model that encapsulates the global sequence knowledge to guide a student model with adaptive temporal input lengths. Specifically, teacher shapes the student's feature space via intermediate embedding, prototypes and soft label perspectives to realize knowledge transfer, while dynamically aggregating student model to mitigate knowledge forgetting. Finally, we introduce full-sequence reconstruction as an auxiliary task to further enhance the quality of representations across inputs of varying temporal lengths. Through extensive experiments, we demonstrate that our method brings remarkable improvements across inputs of different temporal lengths on common benchmarks. Our code will be publicly available.

[145] SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection

Maximilian Pittner,Joel Janai,Mario Faigle,Alexandru Paul Condurache

Main category: cs.CV

TL;DR: 本文提出了一种名为SparseLaneSTP的新型稀疏车道检测方法,结合车道结构的几何特性与时间信息,引入了车道特定的时空注意力机制、连续车道表示和时间正则化,并构建了一个高精度的3D车道数据集,实现了在现有和新数据集上的最先进性能。

Details Motivation: 现有的3D车道检测方法在BEV特征转换中存在错位问题,稀疏方法忽略车道先验,且未利用历史观测信息来应对低可见性情况,因此需要更鲁棒、准确的方法。 Method: 提出SparseLaneSTP,采用稀疏车道Transformer架构,引入车道特定的时空注意力机制、连续车道表示形式以及时间正则化策略,并结合自动生成标签的高精度3D车道数据集进行训练与评估。 Result: 在多个现有3D车道检测基准和新提出的高质量数据集上,SparseLaneSTP在检测率和误差指标方面均达到最先进的性能。 Conclusion: 通过融合几何先验、时间信息与稀疏检测架构,SparseLaneSTP有效提升了3D车道检测的准确性与鲁棒性,同时新数据集为未来研究提供了更可靠的评估基础。 Abstract: 3D lane detection has emerged as a critical challenge in autonomous driving, encompassing identification and localization of lane markings and the 3D road surface. Conventional 3D methods detect lanes from dense birds-eye-viewed (BEV) features, though erroneous transformations often result in a poor feature representation misaligned with the true 3D road surface. While recent sparse lane detectors have surpassed dense BEV approaches, they completely disregard valuable lane-specific priors. Furthermore, existing methods fail to utilize historic lane observations, which yield the potential to resolve ambiguities in situations of poor visibility. To address these challenges, we present SparseLaneSTP, a novel method that integrates both geometric properties of the lane structure and temporal information into a sparse lane transformer. It introduces a new lane-specific spatio-temporal attention mechanism, a continuous lane representation tailored for sparse architectures as well as temporal regularization. Identifying weaknesses of existing 3D lane datasets, we also introduce a precise and consistent 3D lane dataset using a simple yet effective auto-labeling strategy. Our experimental section proves the benefits of our contributions and demonstrates state-of-the-art performance across all detection and error metrics on existing 3D lane detection benchmarks as well as on our novel dataset.

[146] OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction

Minseong Kweon,Jinsun Park

Main category: cs.CV

TL;DR: 本文提出了OceanSplat,一种基于3D高斯点阵的水下场景几何表示方法,通过三目视图一致性、合成深度先验和深度感知透明度调整,有效克服水下光学退化问题,提升三维重建质量。

Details Motivation: 水下光学退化导致多视角不一致和浮动物体伪影,现有方法难以准确重建水下三维几何结构。 Method: 引入三目视图一致性,通过平移相机视图并逆向扭曲对齐;利用三角化生成合成极线深度先验作为自监督正则项;提出深度感知的alpha调整机制,根据z分量和视线方向调节3D高斯 opacity。 Result: 在真实和模拟水下场景中,OceanSplat在场景重建和图像恢复方面均显著优于现有方法,减少了浮动物体伪影,提升了几何精度。 Conclusion: OceanSplat通过引入几何约束和深度感知优化,实现了在散射介质中对3D高斯的有效解耦,实现了鲁棒的水下场景重建。 Abstract: We introduce OceanSplat, a novel 3D Gaussian Splatting-based approach for accurately representing 3D geometry in underwater scenes. To overcome multi-view inconsistencies caused by underwater optical degradation, our method enforces trinocular view consistency by rendering horizontally and vertically translated camera views relative to each input view and aligning them via inverse warping. Furthermore, these translated camera views are used to derive a synthetic epipolar depth prior through triangulation, which serves as a self-supervised depth regularizer. These geometric constraints facilitate the spatial optimization of 3D Gaussians and preserve scene structure in underwater environments. We also propose a depth-aware alpha adjustment that modulates the opacity of 3D Gaussians during early training based on their $z$-component and viewing direction, deterring the formation of medium-induced primitives. With our contributions, 3D Gaussians are disentangled from the scattering medium, enabling robust representation of object geometry and significantly reducing floating artifacts in reconstructed underwater scenes. Experiments on real-world underwater and simulated scenes demonstrate that OceanSplat substantially outperforms existing methods for both scene reconstruction and restoration in scattering media.

[147] Higher-Order Adversarial Patches for Real-Time Object Detectors

Jens Bayer,Stefan Becker,David Münch,Michael Arens,Jürgen Beyerer

Main category: cs.CV

TL;DR: 研究了高阶对抗性攻击对YOLOv10目标检测器的影响,发现高阶对抗性补丁具有更强的泛化能力,且仅靠对抗训练不足以有效防御此类攻击。

Details Motivation: 探索高阶对抗性攻击在目标检测中的影响,并评估对抗训练的防御效果。 Method: 使用对抗性补丁进行逃避攻击,通过迭代训练攻击模式并结合对抗训练来增强YOLOv10检测器。 Result: 高阶对抗性补丁不仅影响直接训练的目标检测器,还表现出比低阶补丁更强的泛化能力;单纯对抗训练无法有效抵御此类攻击。 Conclusion: 高阶对抗性攻击更具威胁,需开发更先进的防御机制以应对高阶攻击的泛化特性。 Abstract: Higher-order adversarial attacks can directly be considered the result of a cat-and-mouse game -- an elaborate action involving constant pursuit, near captures, and repeated escapes. This idiom describes the enduring circular training of adversarial attack patterns and adversarial training the best. The following work investigates the impact of higher-order adversarial attacks on object detectors by successively training attack patterns and hardening object detectors with adversarial training. The YOLOv10 object detector is chosen as a representative, and adversarial patches are used in an evasion attack manner. Our results indicate that higher-order adversarial patches are not only affecting the object detector directly trained on but rather provide a stronger generalization capacity compared to lower-order adversarial patches. Moreover, the results highlight that solely adversarial training is not sufficient to harden an object detector efficiently against this kind of adversarial attack. Code: https://github.com/JensBayer/HigherOrder

[148] Patch-based Representation and Learning for Efficient Deformation Modeling

Ruochen Chen,Thuy Tran,Shaifali Parashar

Main category: cs.CV

TL;DR: 提出了一种基于局部拟合jet函数的面片表示方法PolyFit,可高效学习并广泛应用于不同类型的表面变形任务。

Details Motivation: 为了实现更高效、紧凑且可泛化的表面表示与变形,避免传统方法中对顶点级自由度的优化。 Method: 通过在表面局部面片上拟合jet函数得到PolyFit表示,以监督方式从解析函数或真实数据中学习,并通过更新jet系数实现表面变形。 Result: 在Shape-from-template和服装仿真两个任务中验证了方法的有效性:SfT任务中实现了比物理求解器更快且精度更高的结果;服装仿真中模型自监督、跨分辨率和跨服装类型通用,推理速度比基线快一个数量级。 Conclusion: PolyFit提供了一种高效、紧凑且可推广的表面表示方法,显著提升了多种图形学与视觉任务中的变形效率与泛化能力。 Abstract: In this paper, we present a patch-based representation of surfaces, PolyFit, which is obtained by fitting jet functions locally on surface patches. Such a representation can be learned efficiently in a supervised fashion from both analytic functions and real data. Once learned, it can be generalized to various types of surfaces. Using PolyFit, the surfaces can be efficiently deformed by updating a compact set of jet coefficients rather than optimizing per-vertex degrees of freedom for many downstream tasks in computer vision and graphics. We demonstrate the capabilities of our proposed methodologies with two applications: 1) Shape-from-template (SfT): where the goal is to deform the input 3D template of an object as seen in image/video. Using PolyFit, we adopt test-time optimization that delivers competitive accuracy while being markedly faster than offline physics-based solvers, and outperforms recent physics-guided neural simulators in accuracy at modest additional runtime. 2) Garment draping. We train a self-supervised, mesh- and garment-agnostic model that generalizes across resolutions and garment types, delivering up to an order-of-magnitude faster inference than strong baselines.

[149] From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)

Suyash Mishra,Qiang Li,Srikanth Patil,Anubhav Girdhar

Main category: cs.CV

TL;DR: 提出了一种领域适配的视频到视频片段生成框架,结合音频与视觉语言模型,实现高效、低成本、个性化的药学视频摘要生成。

Details Motivation: 传统多模态数据标注在制药行业中存在不一致、低效和质量问题,尤其面对大量长时视频和音频数据(如临床试验访谈)时挑战更大。 Method: 提出Cut & Merge算法,结合ALM和VLM进行音视频对齐与片段生成;引入基于角色定义和提示注入的个性化机制;设计端到端成本优化 pipeline。 Result: 在Video MME和16,159个药学视频数据集上验证,实现3-4倍速度提升、4倍成本降低;片段连贯性得分0.348,信息性得分0.721,优于Gemini 2.5 Pro等基线模型。 Conclusion: 该框架支持透明、可定制且符合合规要求的视频摘要,在生命科学领域具有广泛应用潜力。 Abstract: Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut & Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.

[150] Driving on Registers

Ellington Kirby,Alexandre Boulch,Yihong Xu,Yuan Yin,Gilles Puy,Éloi Zablocki,Andrei Bursuc,Spyros Gidaris,Renaud Marlet,Florent Bartoccioni,Anh-Quan Cao,Nermin Samet,Tuan-Hung VU,Matthieu Cord

Main category: cs.CV

TL;DR: DrivoR是一种基于Transformer的端到端自动驾驶架构,利用预训练Vision Transformer和相机感知的寄存器令牌压缩多相机特征,实现高效、准确的轨迹预测。

Details Motivation: 现有端到端自动驾驶方法在多摄像头特征融合和计算效率之间难以平衡,且缺乏对驾驶行为的可解释性建模。 Method: 基于预训练Vision Transformer,引入相机感知的寄存器令牌来压缩多摄像头特征,并使用两个轻量级Transformer解码器生成并评分候选轨迹;评分解码器模仿专家策略并输出安全、舒适、效率等子分数,支持推理时的行为条件控制。 Result: DrivoR在NAVSIM-v1、NAVSIM-v2和HUGSIM闭环基准上优于或媲美当前先进方法,验证了纯Transformer架构结合令牌压缩的有效性。 Conclusion: 纯Transformer架构结合针对性的令牌压缩机制足以实现准确、高效且可适应的端到端自动驾驶。 Abstract: We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available via the project page.

[151] UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition

Filippo Ghilotti,Samuel Brucker,Nahku Saidy,Matteo Matteucci,Mario Bijelic,Felix Heide

Main category: cs.CV

TL;DR: 提出一种无需人工标注的多模态伪标签方法,利用时序几何一致性将文本和2D视觉基础模型的信息迁移到3D点云中,实现LiDAR数据的语义分割、目标检测与稠密化。

Details Motivation: 解决自动驾驶中LiDAR数据标注成本高的问题,充分利用无标签LiDAR数据中的3D几何信息。 Method: 利用时序累积的LiDAR地图学习强几何先验,通过多模态伪标签方法融合文本和2D视觉基础模型线索,并设计迭代更新规则以保持几何-语义一致性,同时通过不一致性检测运动物体。 Result: 在三个数据集上实现了鲁棒的3D语义标注、3D边界框生成和稠密LiDAR重建;相比现有方法无需额外人工监督,在语义分割和目标检测上表现更优;少量处理后的LiDAR使远距离深度预测误差分别降低51.5%和22.0%(MAE)。 Conclusion: 该方法有效解锁了无标签LiDAR数据的潜力,为自动驾驶感知研究提供了一种低成本、高扩展性的伪标签生成方案。 Abstract: Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight - yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous-perception research. In this work we tackle this bottleneck by leveraging temporal-geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi-modal pseudo-labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric-semantic consistency, and vice-versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo-labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80-150 and 150-250 meters range, respectively.

[152] From Rays to Projections: Better Inputs for Feed-Forward View Synthesis

Zirui Wu,Zeren Jiang,Martin R. Oswald,Jie Song

Main category: cs.CV

TL;DR: 本文提出了一种名为“投影条件化”的新方法,用于改进前馈视图合成模型的鲁棒性和跨视图一致性,通过将相机参数转换为稳定的目标视图投影线索,并引入掩码自编码预训练策略,实现了在未校准数据上的有效预训练,在多个基准上达到了最先进的性能。

Details Motivation: 现有基于Plücker射线表示的方法对相机坐标系选择敏感,且在微小相机变换下缺乏几何一致性,导致视图合成结果不稳定。因此需要一种更鲁棒、更具几何一致性的条件输入方式。 Method: 提出投影条件化方法,用目标视图的2D投影线索替代原始相机参数,将问题从不稳定的射线空间回归转化为良好的图像到图像翻译问题;并设计掩码自编码预训练策略,利用大规模未标定数据进行预训练。 Result: 在自建的视图一致性基准上表现出更高的保真度和更强的跨视图一致性,同时在标准新视图合成基准上达到最先进水平。 Conclusion: 投影条件化提供了一种更稳定、更实用的视图合成建模范式,降低了对精确相机参数的依赖,提升了模型的鲁棒性和泛化能力。 Abstract: Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.

[153] Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Runze He,Yiji Cheng,Tiankai Hang,Zhimin Li,Yu Xu,Zijin Yin,Shiyi Zhang,Wenxun Dai,Penghui Du,Ao Ma,Chunyu Wang,Qinglin Lu,Jizhong Han,Jiao Dai

Main category: cs.CV

TL;DR: 本文提出了一种名为Re-Align的统一框架,通过结构化推理引导对齐来解决上下文内图像生成与编辑(ICGE)中理解与生成之间的差距。

Details Motivation: 尽管现有的多模态模型在理解能力上表现出色,但这些优势往往无法有效迁移到图像生成任务中,导致对交错图文提示的用户意图执行不准确。 Method: 引入In-Context Chain-of-Thought (IC-CoT) 结构化推理范式,解耦语义引导与参考关联,并结合基于代理奖励的强化学习训练策略,以衡量生成图像与推理文本之间的对齐程度。 Result: 实验表明,Re-Align在相同比例和资源条件下,在上下文内图像生成与编辑任务上优于现有竞争方法。 Conclusion: Re-Align通过结构化推理与对齐机制,有效弥合了多模态理解与生成之间的鸿沟,提升了ICGE任务中的生成准确性与一致性。 Abstract: In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

[154] VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding

Ignacio de Rodrigo,Alvaro J. Lopez-Lopez,Jaime Boal

Main category: cs.CV

TL;DR: VERSE是一种用于分析和改进视觉-语言模型在富视觉文档理解中应用的方法,通过探索其视觉嵌入空间来识别问题区域并生成合成数据以提升性能。

Details Motivation: 现有视觉-语言模型在处理富视觉文档时存在对错误易发区域缺乏可解释性和改进手段的问题,需要一种系统方法来诊断和增强模型表现。 Method: 提出VERSE方法,通过可视化模型的潜在表示空间,识别导致错误的视觉特征聚类,并指导生成针对这些聚类的合成训练数据(如MERIT Dataset),从而优化模型性能。 Result: 在MERIT Secret上评估显示,使用VERSE生成的数据重训练显著提升了F1分数且未损害泛化能力;经VERSE优化的Donut和Idefics2等本地模型性能达到甚至超过GPT-4和Pixtral等SaaS方案。 Conclusion: VERSE为视觉-语言模型提供了可解释的分析框架,并通过针对性数据增强有效提升模型在富视觉文档理解任务中的性能,使本地模型能与大型商用API竞争。 Abstract: This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.

[155] VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

Sixiao Zheng,Minghao Yin,Wenbo Hu,Xiaoyu Li,Ying Shan,Yanwei Fu

Main category: cs.CV

TL;DR: VerseCrafter是一种4D感知的视频世界模型,通过新颖的4D几何控制表示,实现对相机和多物体动态的统一、精确控制,并利用自动数据引擎从无标注真实视频中提取4D信息进行训练。

Details Motivation: 现有视频世界模型难以在2D图像平面上统一且精确地控制相机与多物体运动,缺乏4D几何一致性。 Method: 提出4D几何控制表示,用静态背景点云和每个对象的3D高斯轨迹建模世界状态;将4D控制信号作为条件输入预训练视频扩散模型;构建自动数据引擎从野外视频中提取4D训练数据。 Result: 实现了对相机和物体动态的显式、连贯控制,生成高质量、视角一致的视频,并能在大规模无标注数据上训练。 Conclusion: VerseCrafter为视频世界模型提供了可扩展的4D可控生成框架,克服了数据稀缺与控制精度问题,推动了动态场景建模的发展。 Abstract: Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.

[156] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering

Md. Zahid Hossain,Most. Sharmin Sultana Samu,Md. Rakibul Islam,Md. Siam Ansary

Main category: cs.CV

TL;DR: 提出了一种轻量级视觉-语言框架,结合Swin Transformer和序列到序列语言解码器,用于从叶片图像中识别作物和病害,通过两阶段训练策略提升性能,在分类和自然语言生成任务上均表现优异且参数更少。

Details Motivation: 为了在作物病害分析中实现准确的视觉理解和可靠的文本生成,需要一种高效、轻量且具备良好跨模态对齐能力的视觉-语言模型。 Method: 采用Swin Transformer作为视觉编码器,结合序列到序列语言解码器,设计了一个轻量级的视觉-语言框架,并使用两阶段训练策略来增强视觉表征学习和跨模态对齐。 Result: 在大规模作物病害数据集上,该模型在作物与病害分类任务以及BLEU、ROUGE、BERTScore等文本生成指标上均取得优异表现,优于大型视觉-语言基线模型,且参数量更少;通过Grad-CAM和词元归因实现了可解释性分析。 Conclusion: 任务特定的视觉预训练能有效提升作物病害视觉问答的性能,所提轻量框架在精度和效率之间实现了良好平衡,适用于农业场景中的实际应用。 Abstract: Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.

[157] Atlas 2 -- Foundation models for clinical deployment

Maximilian Alber,Timo Milbich,Alexandra Carpen-Amarie,Stephan Tietz,Jonas Dippel,Lukas Muttenthaler,Beatriz Perez Cancer,Alessandro Benetti,Panos Korfiatis,Elias Eulig,Jérôme Lüscher,Jiasen Wu,Sayed Abid Hashimi,Gabriel Dernbach,Simon Schallenberg,Neelay Shah,Moritz Krügener,Aniruddh Jammoria,Jake Matras,Patrick Duffy,Matt Redlon,Philipp Jurmeister,David Horst,Lukas Ruff,Klaus-Robert Müller,Frederick Klauschen,Andrew Norgan

Main category: cs.CV

TL;DR: 本文提出了Atlas 2系列病理视觉基础模型,在性能、鲁棒性和资源效率方面达到先进水平,基于迄今为止最大的550万张全切片图像数据集进行训练,并在八十项公开基准测试中表现优异。

Details Motivation: 现有的病理基础模型在性能、鲁棒性和计算需求之间存在权衡,限制了其在临床中的应用。 Method: 提出Atlas 2、Atlas 2-B和Atlas 2-S三种病理视觉基础模型,使用来自Charité、LMU Munich和Mayo Clinic的550万张全切片图像进行训练。 Result: 在八十项公开基准测试中,模型在预测性能、鲁棒性和资源效率方面均达到最先进水平。 Conclusion: Atlas 2系列模型有效弥补了现有模型的不足,推动了病理基础模型在临床部署中的可行性。 Abstract: Pathology foundation models substantially advanced the possibilities in computational pathology -- yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.

[158] Multi-Scale Local Speculative Decoding for Image Generation

Elia Peruzzo,Guillaume Sautière,Amirhossein Habibian

Main category: cs.CV

TL;DR: 本文提出了一种名为MuLo-SD的多尺度局部推测解码框架,通过低分辨率草图生成与空间感知的并行验证机制,显著加速自回归图像生成,实现了最高1.7倍的速度提升,同时保持了高质量的图像生成效果。

Details Motivation: 自回归模型在图像生成中表现优异,但其逐token生成方式导致推理速度慢;现有推测解码方法受限于token级歧义和缺乏空间感知能力,难以高效加速。 Method: 提出MuLo-SD,采用低分辨率模型生成候选图像token,并通过学习的上采样模块将其映射到高分辨率空间,由目标模型并行验证;引入局部拒绝与重采样机制,仅在出错的空间邻域内进行修正,避免全局回溯。 Result: 在MS-COCO 5k上验证,MuLo-SD相比EAGLE-2和LANTERN等基线方法实现最高1.7倍加速,FID/HPSv2指标相当,语义一致性和感知质量保持良好;GenEval和DPG-Bench评估也显示优越性能。 Conclusion: MuLo-SD通过多尺度起草与局部纠错机制,在保证生成质量的同时显著提升自回归图像生成效率,成为当前图像合成任务中推测解码的新SOTA方法。 Abstract: Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to $\mathbf{1.7\times}$ - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.

[159] Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering

Shuliang Liu,Songbo Yang,Dong Fang,Sihang Jia,Yuqi Tang,Lingfeng Su,Ruoshui Peng,Yibo Yan,Xin Zou,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的推理框架——视觉-语言内省(VLI),通过模拟元认知自校正过程来减少多模态大模型中的物体幻觉问题。

Details Motivation: 现有方法在缓解物体幻觉方面存在不足,或仅表面操作未能纠正语义错位,或依赖静态向量缺乏实例特异性精度。 Method: VLI包含两个阶段:属性内省用于通过概率冲突检测诊断幻觉风险并定位因果视觉锚点;可解释双因果引导则动态调节推理过程,分离视觉证据与背景噪声,并通过自适应校准消除盲目置信。 Result: VLI在MMHal-Bench上将物体幻觉率降低12.67%,在POPE数据集上准确率提升5.8%,在先进模型上达到SOTA性能。 Conclusion: VLI有效提升了多模态大模型的认知可靠性,为解决物体幻觉提供了新的训练-free推理范式。 Abstract: Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.

[160] CoV: Chain-of-View Prompting for Spatial Reasoning

Haoyu Zhao,Akide Liu,Zeyu Zhang,Weijie Wang,Feng Chen,Ruihan Zhu,Gholamreza Haffari,Bohan Zhuang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的测试时推理框架Chain-of-View (CoV),通过粗到细的视角探索过程,提升3D环境中的具身问答(EQA)性能。

Details Motivation: 现有视觉-语言模型受限于固定数量的输入视图,难以在推理时获取充分的空间上下文,限制了复杂空间推理能力。 Method: CoV首先使用View Selection代理筛选冗余帧并选择与问题对齐的关键视角,然后通过迭代推理与离散相机动作交替进行精细视角调整,从3D场景中获取新观察直至收集足够上下文。 Result: 在OpenEQA上,四个主流VLM平均提升+11.56% LLM-Match,最高提升+13.62%;测试时扩展性良好,增加动作预算可进一步提升性能。在ScanQA和SQA3D上也表现出色。 Conclusion: 问题对齐的视角选择结合开放视角搜索是一种有效且模型无关的方法,可在无需训练的情况下显著提升3D EQA中的空间推理能力。 Abstract: Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.

[161] VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

Shuming Liu,Mingchen Zhuge,Changsheng Zhao,Jun Chen,Lemeng Wu,Zechun Liu,Chenchen Zhu,Zhipeng Cai,Chong Zhou,Haozhe Liu,Ernie Chang,Saksham Suri,Hongyu Xu,Qi Qian,Wei Wen,Balakrishnan Varadarajan,Zhuang Liu,Hu Xu,Florian Bordes,Raghuraman Krishnamoorthi,Bernard Ghanem,Vikas Chandra,Yunyang Xiong

Main category: cs.CV

TL;DR: 本文提出了VideoAuto-R1,一种在视频理解中采用“按需推理”策略的框架,通过“思考一次,回答两次”的训练范式,在保持高准确率的同时显著提升推理效率,减少了约3.3倍的响应长度。

Details Motivation: 尽管链式思维(CoT)推理在视频理解中被广泛应用,但其相对于直接回答的必要性和优势尚不明确,尤其是在计算成本更高的情况下表现并未稳定优于直接回答。 Method: 提出VideoAuto-R1框架,采用‘Thinking Once, Answering Twice’训练方式:先生成初始答案,再进行推理并输出复核答案,两个答案均受可验证奖励监督;推理阶段根据初始答案的置信度决定是否启动推理。 Result: VideoAuto-R1在多个视频问答和定位基准上达到SOTA性能,平均响应长度从149个token减少到44个,效率显著提升;在感知类任务中推理模式激活率低,而在需要复杂推理的任务中激活率更高。 Conclusion: 显式的语言推理虽有益,但并非总是必要;按需推理策略能有效平衡性能与效率,为多模态模型的推理机制提供了更高效的设计方向。 Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.

[162] Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable

Zuhair Ahmed Khan Taha,Mohammed Mudassir Uddin,Shahnawaz Alam

Main category: cs.CV

TL;DR: AgentCompress根据任务难度动态选择模型精度,在降低68.3%计算成本的同时保持96.2%的原始成功率,显著提升科研任务的经济可行性。

Details Motivation: 大型语言模型在自主科研任务中计算成本高昂,单次使用700亿参数模型可达127美元,限制了多数学术实验室的使用。 Method: 提出AgentCompress系统,利用小型神经网络根据任务开头词语预测难度,并将任务路由到相应压缩程度的模型变体,决策耗时低于1毫秒。 Result: 在四个科学领域的500个研究工作流中测试,计算成本降低68.3%,同时保持96.2%的原始成功率。 Conclusion: AgentCompress通过任务感知的模型压缩策略,有效平衡了性能与成本,使资源受限的实验室也能负担基于大模型的自主研究。 Abstract: When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A single research session using a 70-billion parameter model can cost around $127 in cloud fees, putting these tools out of reach for many academic labs. We developed AgentCompress to tackle this problem head-on. The core idea came from a simple observation during our own work: writing a novel hypothesis clearly demands more from the model than reformatting a bibliography. Why should both tasks run at full precision? Our system uses a small neural network to gauge how hard each incoming task will be, based only on its opening words, then routes it to a suitably compressed model variant. The decision happens in under a millisecond. Testing across 500 research workflows in four scientific fields, we cut compute costs by 68.3% while keeping 96.2% of the original success rate. For labs watching their budgets, this could mean the difference between running experiments and sitting on the sidelines

[163] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

William Rudman,Michal Golovanevsky,Dana Arad,Yonatan Belinkov,Ritambhara Singh,Carsten Eickhoff,Kyle Mahowald

Main category: cs.CV

TL;DR: 研究发现大型视觉-语言模型在物体计数任务中容易因文本提示而产生幻觉,随着物体数量增加,模型更倾向于遵从错误提示;通过机制分析识别出一组注意力头(PIH-heads),其消融可显著减少40%以上的提示诱导幻觉,且增强对视觉证据的纠正。

Details Motivation: 解决大型视觉-语言模型在物体计数任务中过度依赖文本提示而忽视视觉输入的问题,揭示提示诱导幻觉的内部机制。 Method: 在控制条件下设计物体计数任务,使用机械分析方法研究三个VLM中的注意力头行为,并通过消融实验评估对提示诱导幻觉的影响。 Result: 识别出一组与提示复制相关的小型注意力头(PIH-heads),其消融使提示诱导幻觉减少至少40%,并提升模型对视觉证据的响应;不同模型中PIH-heads作用方式具有特异性。 Conclusion: 提示诱导幻觉源于特定注意力头的活动,这些头在模型内部以特定方式促进提示复制;消融它们可有效抑制幻觉,为理解与改进VLM的可靠性提供了新路径。 Abstract: Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.

[164] MoE3D: A Mixture-of-Experts Module for 3D Reconstruction

Zichen Wang,Ang Cao,Liam J. Wang,Jeong Joon Park

Main category: cs.CV

TL;DR: MoE3D是一种混合专家模块,用于提升3D重建模型的深度边界清晰度并减少漂浮点伪影,通过预测多个候选深度图并动态加权融合,在几乎不增加计算开销的情况下显著提高重建质量。

Details Motivation: 现有前馈3D重建模型存在深度边界模糊和漂浮点伪影的问题,影响重建质量,因此需要一种有效的方法来优化深度图输出。 Method: MoE3D通过引入混合专家结构,预测多个候选深度图,并利用动态权重进行融合,该权重根据输入内容自适应调整,从而优化最终深度图。模块可集成到预训练主干网络(如VGGT)中。 Result: 在集成MoE3D后,3D重建模型在深度边界清晰度和减少漂浮点方面表现显著改善,且仅带来极小的额外计算开销。 Conclusion: MoE3D是一种高效且即插即用的模块,能有效提升现有3D重建模型的性能,具有较强的实用性和扩展性。 Abstract: MoE3D is a mixture-of-experts module designed to sharpen depth boundaries and mitigate flying-point artifacts (highlighted in red) of existing feed-forward 3D reconstruction models (left side). MoE3D predicts multiple candidate depth maps and fuses them via dynamic weighting (visualized by MoE weights on the right side). When integrated with a pre-trained 3D reconstruction backbone such as VGGT, it substantially enhances reconstruction quality with minimal additional computational overhead. Best viewed digitally.

[165] FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

Danilo Danese,Angela Lombardi,Matteo Attimonelli,Giuseppe Fasano,Tommaso Di Noia

Main category: cs.CV

TL;DR: 本文提出了一种名为FlowLet的条件生成框架,用于在可逆3D小波域中合成与年龄相关的3D脑部MRI图像,以解决现有生成模型在推理速度、伪影和计算开销方面的局限性。

Details Motivation: 现有的脑龄预测(BAP)模型受限于数据集的人口统计偏差,获取新数据成本高且存在伦理限制,因此需要有效的生成式数据增强方法。当前基于潜在扩散模型的方法存在推理慢、压缩伪影等问题,且很少以年龄为条件,影响BAP性能。 Method: 提出FlowLet,一种基于流匹配(flow matching)的条件生成框架,在可逆3D小波域中进行生成,避免了潜在空间压缩带来的伪影,并降低了计算需求。该方法直接在多尺度小波系数上操作,支持快速采样并实现年龄条件控制。 Result: 实验表明,FlowLet能够在较少采样步数下生成高保真度的3D MRI体积数据;使用其生成的数据训练BAP模型可提升对代表性不足年龄组的预测性能;基于区域的分析验证了关键解剖结构的保留。 Conclusion: FlowLet是一种高效、低伪影的年龄条件3D MRI生成方法,能够有效改善脑龄预测模型的公平性和泛化能力,具有应用于医学影像数据增强的潜力。 Abstract: Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.

[166] ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos

Rustin Soraki,Homanga Bharadhwaj,Ali Farhadi,Roozbeh Mottaghi

Main category: cs.CV

TL;DR: 本文提出了ObjectForesight,一种基于3D对象中心的动态模型,能够从第一人称视频中预测刚性物体的未来6自由度姿态和轨迹。该模型通过显式的3D对象级表示实现几何准确且时间连贯的预测,并利用大规模伪真实数据进行训练,在精度、一致性和泛化性方面表现优越。

Details Motivation: 赋予计算系统类似人类的能力,使其能从被动视觉观察中预测物体可能的运动和变化,从而实现对物理世界的理解与交互。 Method: 提出ObjectForesight模型,采用3D对象级表示而非像素或潜在空间建模;利用分割、网格重建和3D姿态估计技术构建包含200多万个短视频片段的大规模带伪标签数据集进行训练。 Result: 实验表明,ObjectForesight在预测准确性、几何一致性以及对未见物体和场景的泛化能力上显著优于现有方法。 Conclusion: ObjectForesight为从视觉观察中学习物理合理、以对象为中心的动力学模型提供了一个可扩展的框架,推动了对未来物体运动预测的研究。 Abstract: Humans can effortlessly anticipate how objects might move or change through interaction--imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. objectforesight.github.io

[167] Plenoptic Video Generation

Xiao Fu,Shitao Tang,Min Shi,Xian Liu,Jinwei Gu,Ming-Yu Liu,Dahua Lin,Chen-Hsuan Lin

Main category: cs.CV

TL;DR: 提出PlenopticDreamer,一种通过自回归训练和相机引导视频检索实现多视角一致的生成性视频重渲染框架,在多个基准上实现了最先进性能。

Details Motivation: 现有相机控制的生成视频重渲染方法在单视角表现良好,但在多视角场景下难以保持时空一致性,尤其是在生成区域中。 Method: 设计了一个多输入单输出的视频条件模型,采用自回归方式训练,并引入相机引导的视频检索策略以选择先前生成中的显著视频作为条件输入;结合渐进式上下文缩放、自条件机制和长视频条件机制提升训练收敛性和生成鲁棒性。 Result: 在Basic和Agibot基准上的实验表明,PlenopticDreamer在视图同步、视觉保真度、相机控制精度和多样化视角变换方面均优于现有方法。 Conclusion: PlenopticDreamer有效解决了生成视频在多视角下的时空一致性问题,推动了生成性视频重渲染在复杂场景中的应用。 Abstract: Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: https://research.nvidia.com/labs/dir/plenopticdreamer/

[168] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Boyang Wang,Haoran Zhang,Shujie Zhang,Jinkun Hao,Mingda Jia,Qi Lv,Yucheng Mao,Zhaoyang Lyu,Jia Zeng,Xudong Xu,Jiangmiao Pang

Main category: cs.CV

TL;DR: 提出视觉身份提示方法,利用示例图像指导扩散模型生成具有一致性和精确场景设置的多视角、时间连贯的操作数据,提升策略模型训练效果。

Details Motivation: 现有基于文本提示的扩散模型在生成机器人操作数据时难以满足多视角和时间连贯性的需求,且无法精确控制场景配置。 Method: 引入视觉身份提示,使用示例图像作为扩散模型的条件输入,并构建可扩展的视觉身份池 pipeline 来增强操作数据。 Result: 在仿真和真实机器人环境中,使用增强数据训练的视觉语言动作和视觉运动策略模型均表现出一致的性能提升。 Conclusion: 视觉身份提示能有效生成高质量、多样化的操作数据,显著提升下游策略模型的性能。 Abstract: The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.

[169] GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation

Henghui Ding,Chang Liu,Shuting He,Xudong Jiang,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出了广义指代表达分割、理解和生成(GRES/GREC/GREG)三个新基准,构建了首个大规模gRefCOCO数据集,支持单目标、多目标和无目标表达,并提出ReLA模型以建模复杂关系,在GRES和GREC任务上达到SOTA性能。

Details Motivation: 现有指代表达任务仅支持单目标表达,限制了实际应用,无法处理多目标或无目标情况,缺乏对广义场景的支持。 Method: 提出GRES/GREC/GREG三个新任务及gRefCOCO数据集,设计ReLA模型,通过子实例线索自适应划分图像区域,并显式建模区域间与区域-语言依赖关系。 Result: 构建了包含多目标、无目标和单目标表达的gRefCOCO数据集,实验表明现有方法在GREx上性能显著下降,所提ReLA模型在GRES和GREC任务上达到SOTA。 Conclusion: 广义指代表达(GREx)更贴近真实场景,gRefCOCO和ReLA为未来研究提供了基础,推动指代表达向更通用方向发展。 Abstract: Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG). This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks. One of the challenges of GRES/GREC is complex relationship modeling, for which we propose a baseline ReLA that adaptively divides the image into regions with sub-instance clues and explicitly models the region-region and region-language dependencies. The proposed ReLA achieves the state-of-the-art results on the both GRES and GREC tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GREx.

[170] Pixel-Perfect Visual Geometry Estimation

Gangwei Xu,Haotong Lin,Hongcheng Luo,Haiyang Sun,Bing Wang,Guang Chen,Sida Peng,Hangjun Ye,Xin Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于像素空间扩散变换器(DiT)的像素级精确视觉几何模型PPD和PPVD,用于生成无飞行点、高细节的深度估计与点云,通过引入语义引导和级联结构提升了效率、精度与时间一致性。

Details Motivation: 现有几何基础模型在处理图像时仍存在严重的飞行像素问题和细节丢失,难以满足机器人和增强现实对高质量几何重建的需求。 Method: 提出Pixel-Perfect Depth (PPD) 模型,基于像素空间扩散变换器(DiT),引入语义引导DiT以融合视觉语义信息,并设计级联DiT架构逐步增加图像token数量;进一步扩展为视频版本PPVD,采用语义一致DiT和参考引导token传播以保持时间连贯性。 Result: 模型在单目和视频深度估计任务中均达到最优性能,生成的点云显著更干净,且在计算和内存开销较低的情况下实现了更高的精度和细节保留。 Conclusion: 所提出的PPD和PPVD模型通过生成建模在像素空间中实现了高质量、无飞行像素的几何预测,有效解决了细节丢失和时间不一致问题,为视觉几何任务提供了新的基准。 Abstract: Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

[171] RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

Yuan-Kang Lee,Kuan-Lin Chen,Chia-Che Chang,Yu-Lun Liu

Main category: cs.CV

TL;DR: 提出了一种结合统计方法与深度强化学习的夜间白平衡框架RL-AWB,通过模拟专业自动白平衡调优专家动态优化参数,在多传感器夜间数据集上表现出优异的泛化能力。

Details Motivation: 夜间色恒常性因低光照噪声和复杂光照条件而具有挑战性,现有方法在跨场景和跨传感器情况下的泛化能力有限。 Method: 首先设计针对夜间场景的统计算法,结合显著灰度像素检测与新的光照估计方法;以此为基础构建首个用于色恒常性的深度强化学习框架,模拟专家行为动态调整图像参数。 Result: 实验表明该方法在低光和正常光照图像上均具有优越的跨传感器泛化性能,并发布了首个多传感器夜间数据集。 Conclusion: RL-AWB通过融合统计先验与强化学习实现了更鲁棒的夜间白平衡,为自动白平衡系统提供了可学习、可调优的新范式。 Abstract: Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/

[172] QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer

Daniele Lizzio Bosco,Shuteng Wang,Giuseppe Serra,Vladislav Golyanik

Main category: cs.CV

TL;DR: 本文提出了QNeRF,首个用于从2D图像合成新视角的混合量子-经典模型,利用参数化量子电路实现更紧凑的表示,在参数少于一半的情况下性能媲美或超越经典NeRF。

Details Motivation: 受量子视觉场(QVFs)在模型紧凑性和收敛速度上的优势启发,同时应对NeRF类模型参数量大、训练密集的问题,探索量子机器学习在3D场景表示中的潜力。 Method: 提出两种QNeRF架构:Full QNeRF充分利用量子幅值增强表达能力;Dual-Branch QNeRF通过分支处理空间与视角信息引入任务导向归纳偏置,降低复杂度并提升可扩展性与硬件兼容性。 Result: 实验表明,QNeRF在中等分辨率图像上训练时,参数量不足经典NeRF一半,性能相当甚至更优。 Conclusion: 量子机器学习可作为计算机视觉中连续信号表示的有效替代方案,尤其适用于从2D观测学习3D表示等中等复杂度任务。 Abstract: Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen major advances with Neural Radiance Fields (NeRFs), where models learn a compact representation from 2D images to render 3D scenes, albeit at the cost of larger models and intensive training. In this work, we extend the approach of QVFs by introducing QNeRF, the first hybrid quantum-classical model designed for novel-view synthesis from 2D images. QNeRF leverages parameterised quantum circuits to encode spatial and view-dependent information via quantum superposition and entanglement, resulting in more compact models compared to the classical counterpart. We present two architectural variants. Full QNeRF maximally exploits all quantum amplitudes to enhance representational capabilities. In contrast, Dual-Branch QNeRF introduces a task-informed inductive bias by branching spatial and view-dependent quantum state preparations, drastically reducing the complexity of this operation and ensuring scalability and potential hardware compatibility. Our experiments demonstrate that -- when trained on images of moderate resolution -- QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters. These results suggest that quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level tasks in computer vision, such as 3D representation learning from 2D observations.

[173] Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video

Zeren Jiang,Chuanxia Zheng,Iro Laina,Diane Larlus,Andrea Vedaldi

Main category: cs.CV

TL;DR: 提出Mesh4D,一种用于单目4D网格重建的前馈模型,通过紧凑的潜在空间和时空注意力机制,在无需推理时骨骼信息的情况下实现准确的3D形状与运动恢复。

Details Motivation: 现有的单目4D重建方法难以同时捕捉动态物体的完整3D形状和连续运动,且常依赖复杂的逐帧优化或需要额外的骨骼输入。希望设计一个高效、端到端的模型,能在单次前向传播中重建高质量的4D网格。 Method: 提出Mesh4D,采用基于自编码器的紧凑潜在空间来编码整个动画序列,并利用骨骼结构作为训练引导先验;引入时空注意力机制提升变形表示稳定性;结合潜在扩散模型,根据输入视频和首帧重建网格一步预测完整动画。 Result: 在重建和新视角合成任务上超越先前方法,展现出更精确的3D形状和变形恢复能力。 Conclusion: Mesh4D实现了高效的单目4D网格重建,其紧凑的潜在空间设计和潜在扩散策略有效提升了重建质量与稳定性,且无需推理时提供骨骼信息,具有良好的实用性与泛化性。 Abstract: We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object's complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object's overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.