Table of Contents
cs.CL [Back]
[1] MedPI: Evaluating AI Systems in Medical Patient-facing Interactions
Diego Fajardo V.,Oleksii Proniakin,Victoria-Elisabeth Gruber,Razvan Marinescu
Main category: cs.CL
TL;DR: MedPI是一个用于评估大语言模型在患者-临床医生对话中表现的高维基准,涵盖105个维度,评估9个主流模型在7,097次对话中的表现,发现其在鉴别诊断等方面表现较差。
Details
Motivation: 现有的LLM医学评估多基于单轮问答形式,无法全面反映真实临床对话的复杂性,缺乏对医疗过程、治疗安全性和医患沟通等多维度的系统性评估。 Method: 提出MedPI五层框架:合成EHR式患者数据包、具记忆与情感的AI患者、覆盖多种就诊原因与目标的任务矩阵、基于ACGME能力标准的105维评分体系,以及由校准后的委员会式LLM组成的AI评审团进行打分与反馈。 Result: 在366名AI患者和7,097次对话中评估了9个主流LLM,所有模型在多个维度上表现不佳,尤其在鉴别诊断方面得分最低。 Conclusion: MedPI提供了一种细粒度、认证对齐的对话式医学能力评估方法,揭示当前LLM在临床推理尤其是鉴别诊断上的不足,有助于指导未来LLM在诊断与治疗建议中的应用与发展。 Abstract: We present MedPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations. Unlike single-turn question-answer (QA) benchmarks, MedPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. MedPI comprises five layers: (1) Patient Packets (synthetic EHR-like ground truth); (2) an AI Patient instantiated through an LLM with memory and affect; (3) a Task Matrix spanning encounter reasons (e.g. anxiety, pregnancy, wellness checkup) x encounter objectives (e.g. diagnosis, lifestyle advice, medication advice); (4) an Evaluation Framework with 105 dimensions on a 1-4 scale mapped to the Accreditation Council for Graduate Medical Education (ACGME) competencies; and (5) AI Judges that are calibrated, committee-based LLMs providing scores, flags, and evidence-linked rationales. We evaluate 9 flagship models -- Claude Opus 4.1, Claude Sonnet 4, MedGemma, Gemini 2.5 Pro, Llama 3.3 70b Instruct, GPT-5, GPT OSS 120b, o3, Grok-4 -- across 366 AI Patients and 7,097 conversations using a standardized "vanilla clinician" prompt. For all LLMs, we observe low performance across a variety of dimensions, in particular on differential diagnosis. Our work can help guide future use of LLMs for diagnosis and treatment recommendations.[2] RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
Keerthana Murugaraj,Salima Lamsiyah,Martin Theobald
Main category: cs.CL
TL;DR: 本文提出了RAGVUE,一个用于自动、无参考评估检索增强生成(RAG)系统的诊断性和可解释性框架。该框架将RAG行为分解为多个维度,并提供细粒度的错误分析。
Details
Motivation: 现有RAG评估指标往往将复杂行为简化为单一分数,难以识别错误来源(如检索、推理或接地问题),缺乏透明性和诊断能力。 Method: 提出RAGVUE框架,分解评估为检索质量、答案相关性与完整性、声明级忠实度和裁判校准四个维度,每个指标均附带结构化解释;支持手动选择指标或全自动代理评估,并提供Python API、CLI和Streamlit界面。 Result: 实验表明,RAGVUE能发现现有工具(如RAGAS)常忽略的细粒度失败案例,且具备良好的可集成性,适用于研究与实际开发流程。 Conclusion: RAGVUE提供了一种透明、细粒度且可扩展的RAG系统评估方案,有助于提升RAG系统的调试与优化效率。 Abstract: Evaluating Retrieval-Augmented Generation (RAG) systems remains a challenging task: existing metrics often collapse heterogeneous behaviors into single scores and provide little insight into whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduce RAGVUE, a diagnostic and explainable framework for automated, reference-free evaluation of RAG pipelines. RAGVUE decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration. Each metric includes a structured explanation, making the evaluation process transparent. Our framework supports both manual metric selection and fully automated agentic evaluation. It also provides a Python API, CLI, and a local Streamlit interface for interactive usage. In comparative experiments, RAGVUE surfaces fine-grained failures that existing tools such as RAGAS often overlook. We showcase the full RAGVUE workflow and illustrate how it can be integrated into research pipelines and practical RAG development. The source code and detailed instructions on usage are publicly available on GitHub[3] Automatic Construction of Chinese Verb Collostruction Database
Xuri Tang,Daohuan Liu
Main category: cs.CL
TL;DR: 提出一种完全无监督的中文动词搭配结构库构建方法,通过聚类算法从大规模语料中生成具有功能独立性和等级典型性的搭配结构,并在语法纠错任务中优于大语言模型。
Details
Motivation: 为了弥补大语言模型在需要解释性和可解释性应用场景中的不足,提供明确且可解释的规则支持。 Method: 将动词搭配结构形式化为投影式、有根、有序、有向无环图,并采用一系列聚类算法从大规模语料中提取句子生成特定动词的搭配结构。 Result: 统计分析显示生成的搭配结构具备功能独立性和等级典型性,在动词语法纠错任务中,基于最大匹配搭配结构的纠错算法性能优于大语言模型。 Conclusion: 该方法能有效构建可解释的中文动词搭配知识库,增强语言应用系统在需透明规则场景下的表现。 Abstract: This paper proposes a fully unsupervised approach to the construction of verb collostruction database for Chinese language, aimed at complementing LLMs by providing explicit and interpretable rules for application scenarios where explanation and interpretability are indispensable. The paper formally defines a verb collostruction as a projective, rooted, ordered, and directed acyclic graph and employs a series of clustering algorithms to generate collostructions for a given verb from a list of sentences retrieved from large-scale corpus. Statistical analysis demonstrates that the generated collostructions possess the design features of functional independence and graded typicality. Evaluation with verb grammatical error correction shows that the error correction algorithm based on maximum matching with collostructions achieves better performance than LLMs.[4] Attribute-Aware Controlled Product Generation with LLMs for E-commerce
Virginia Negri,Víctor Martínez Gómez,Sergio A. Balanya,Subburam Rajaram
Main category: cs.CL
TL;DR: 提出了一种基于大语言模型的合成电子商务产品数据生成方法,通过三种策略实现可控修改,并在实际任务中表现接近真实数据。
Details
Motivation: 由于高质量标注数据集难以获取,电商领域的产品信息抽取面临挑战,因此需要一种有效的方法来生成可用于训练的合成数据。 Method: 采用大语言模型结合属性感知提示,设计了三种可控修改策略:属性保持修改、受控负例生成和系统性属性删除,以生成符合商店约束且产品逻辑一致的合成数据。 Result: 人工评估显示99.6%的合成产品自然,96.5%具有有效属性值;在MAVE数据集上,合成数据达到60.5%准确率,与真实数据(60.8%)相当,远超零样本基线13.4%,混合数据进一步提升至68.8%。 Conclusion: 该框架为增强电商数据集提供了一个实用解决方案,尤其适用于低资源场景。 Abstract: Product information extraction is crucial for e-commerce services, but obtaining high-quality labeled datasets remains challenging. We present a systematic approach for generating synthetic e-commerce product data using Large Language Models (LLMs), introducing a controlled modification framework with three strategies: attribute-preserving modification, controlled negative example generation, and systematic attribute removal. Using a state-of-the-art LLM with attribute-aware prompts, we enforce store constraints while maintaining product coherence. Human evaluation of 2000 synthetic products demonstrates high effectiveness, with 99.6% rated as natural, 96.5% containing valid attribute values, and over 90% showing consistent attribute usage. On the public MAVE dataset, our synthetic data achieves 60.5% accuracy, performing on par with real training data (60.8%) and significantly improving upon the 13.4% zero-shot baseline. Hybrid configurations combining synthetic and real data further improve performance, reaching 68.8% accuracy. Our framework provides a practical solution for augmenting e-commerce datasets, particularly valuable for low-resource scenarios.[5] Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems
Zihan Gao,Mohsin Y. K. Yousufi,Jacob Thebault-Spieker
Main category: cs.CL
TL;DR: 本文提出了一种名为“集体叙事基础”(Collective Narrative Grounding)的参与式协议,旨在通过社区治理将本地故事整合到AI系统中,以解决大语言模型在回答社区特定问题时的知识盲区和认知不公问题。
Details
Motivation: 大语言模型在处理社区特定查询时常失败,导致知识盲点,边缘化地方声音并加剧认知不公正。为此,研究旨在通过社区参与的方式弥补这些缺陷。 Method: 通过三个与24名社区成员参与的映射工作坊,设计了叙述性信息的采集方法和结构化模式,提取实体、时间与地点,并实现验证与来源控制;同时构建了一个县级本地信息问答数据集(14,782对)进行审计,并评估现有大模型在加入本地叙事前后的表现差异。 Result: 在县级基准测试中,76.7%的错误源于事实缺失、文化误解、地理混淆或时间错位;在参与式问答集中,最先进的大语言模型在无额外上下文情况下正确率低于21%;而所收集的叙事内容恰好能填补大部分缺失事实。 Conclusion: 提出了一套包含分类法、协议和参与式评估的框架,支持构建以检索优先、来源可见、本地治理的问答系统,为实现扎根于社区的AI提供了坚实基础。 Abstract: Large language model (LLM) question-answering systems often fail on community-specific queries, creating "knowledge blind spots" that marginalize local voices and reinforce epistemic injustice. We present Collective Narrative Grounding, a participatory protocol that transforms community stories into structured narrative units and integrates them into AI systems under community governance. Learning from three participatory mapping workshops with N=24 community members, we designed elicitation methods and a schema that retain narrative richness while enabling entity, time, and place extraction, validation, and provenance control. To scope the problem, we audit a county-level benchmark of 14,782 local information QA pairs, where factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments account for 76.7% of errors. On a participatory QA set derived from our workshops, a state-of-the-art LLM answered fewer than 21% of questions correctly without added context, underscoring the need for local grounding. The missing facts often appear in the collected narratives, suggesting a direct path to closing the dominant error modes for narrative items. Beyond the protocol and pilot, we articulate key design tensions, such as representation and power, governance and control, and privacy and consent, providing concrete requirements for retrieval-first, provenance-visible, locally governed QA systems. Together, our taxonomy, protocol, and participatory evaluation offer a rigorous foundation for building community-grounded AI that better answers local questions.[6] TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation
Anas Ezzakri,Nicola Piovesan,Mohamed Sana,Antonio De Domenico,Fadhel Ayed,Haozhe Zhang
Main category: cs.CL
TL;DR: 本文提出了TeleTables,一个用于评估大语言模型在电信标准(如3GPP规范)中对表格知识掌握和解释能力的基准测试。
Details
Motivation: 现有研究表明,大语言模型在处理密集使用表格的电信标准时表现不佳,尤其是对表格内容的理解能力缺乏系统评估。 Method: 通过一个多阶段的数据生成流程从3GPP标准中提取表格,并利用多模态和推理导向的大模型生成并验证问题,构建了包含500个人工验证问答对的数据集。 Result: 实验表明,小于10B参数的模型在回忆3GPP知识和解释表格方面表现差,而更大的模型在表格推理上表现更好。 Conclusion: TeleTables凸显了对大语言模型进行领域专业化微调以可靠解析和推理电信标准的必要性。 Abstract: Language Models (LLMs) are increasingly explored in the telecom industry to support engineering tasks, accelerate troubleshooting, and assist in interpreting complex technical documents. However, recent studies show that LLMs perform poorly on telecom standards, particularly 3GPP specifications. We argue that a key reason is that these standards densely include tables to present essential information, yet the LLM knowledge and interpretation ability of such tables remains largely unexamined. To address this gap, we introduce TeleTables, a benchmark designed to evaluate both the implicit knowledge LLMs have about tables in technical specifications and their explicit ability to interpret them. TeleTables is built through a novel multi-stage data generation pipeline that extracts tables from 3GPP standards and uses multimodal and reasoning-oriented LLMs to generate and validate questions. The resulting dataset, which is publicly available, comprises 500 human-verified question-answer pairs, each associated with the corresponding table in multiple formats. Our evaluation shows that, smaller models (under 10B parameters) struggle both to recall 3GPP knowledge and to interpret tables, indicating the limited exposure to telecom standards in their pretraining and the insufficient inductive biases for navigating complex technical material. Larger models, on the other hand, show stronger reasoning on table interpretation. Overall, TeleTables highlights the need for domain-specialized fine-tuning to reliably interpret and reason over telecom standards.[7] FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback
Xueqing Wu,Zihan Xue,Da Yin,Shuyan Zhou,Kai-Wei Chang,Nanyun Peng,Yeming Wen
Main category: cs.CL
TL;DR: FronTalk是一个用于前端代码生成的基准,探索了多模态反馈下的对话式代码生成,提出了一个新的基于代理的评估框架,并揭示了模型在记忆保持和视觉反馈理解上的挑战,提出AceCoder方法显著缓解了遗忘问题。
Details
Motivation: 现有研究缺乏对前端开发中多轮、多模态(文本+视觉)代码生成的系统探索,尤其是视觉反馈的作用未被充分研究。 Method: 构建了包含100个多轮对话的数据集FronTalk,每轮提供文本和视觉指令;提出基于网页代理的自动评估框架,衡量功能正确性和用户体验;设计AceCoder方法,利用自主代理审查历史实现以减少遗忘。 Result: 评估20个模型发现严重遗忘问题和视觉理解困难;AceCoder将遗忘几乎降至零,性能提升最高达9.3%(从56.0%到65.3%)。 Conclusion: FronTalk为多轮多模态代码生成提供了坚实基础,揭示了关键挑战并提出有效解决方案,推动前端开发与人机交互的研究。 Abstract: We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at https://github.com/shirley-wu/frontalk[8] STDD:Spatio-Temporal Dynamics-Driven Token Refinement in Diffusion Language Models
Xinhao Sun,Maoliang Li,Zihao Zheng,Jiayu Chen,Hezhao Xu,Yun Liang,Xiang Chen
Main category: cs.CL
TL;DR: 提出一种新的重掩码方法,通过动态检测每个token的时间方差和空间偏差来自适应调整置信度阈值,显著提升扩散语言模型的效率和生成质量。
Details
Motivation: 主流的重掩码策略依赖单一全局置信度阈值,忽略了单个token的时间和空间动态特性,导致冗余迭代和平行性受限。 Method: 提出一种新的重掩码方法,动态检测每个token的时间方差(Temporal Variance)和空间偏差(Spatial Deviance),用以反映其收敛状态和token间的相关性,并据此自适应地为每一步的每个token调整置信度阈值。 Result: 实验结果表明,该方法在主流数据集上显著提升了扩散语言模型的运行效率,最高可达8.9倍的加速,同时保持生成质量。 Conclusion: 所提出的动态重掩码策略有效克服了固定阈值带来的局限,提升了DLM的效率与并行解码能力,具有较强的实用价值。 Abstract: Unlike autoregressive language models, diffusion language models (DLMs) generate text by iteratively denoising all token positions in parallel. At each timestep, the remasking strategy of a DLM selects low-priority tokens to defer their decoding, thereby improving both efficiency and output quality. However, mainstream remasking strategies rely on a single global confidence threshold, overlooking the temporal and spatial dynamics of individual tokens. Motivated by the redundant iterations and constrained parallelism introduced by fixed-threshold remasking, we propose a novel remasking approach that dynamically detects Temporal Variance and Spatial Deviance of each token, which reflect its convergence status and inter-token correlations. Using these signals, our method adaptively adjusts the confidence threshold for every token at every step. Empirical results show that our approach significantly improves the operational efficiency of DLMs across mainstream datasets, achieving speedups of up to 8.9 times while faithfully preserving generation quality.[9] Enhancing Admission Inquiry Responses with Fine-Tuned Models and Retrieval-Augmented Generation
Aram Virabyan
Main category: cs.CL
TL;DR: 提出一种结合微调语言模型与检索增强生成(RAG)的AI系统,以提升大学招生咨询中的响应速度与信息准确性。
Details
Motivation: 大学招生办公室面临咨询量大、需保证回复质量的挑战,现有RAG在复杂细分领域中可能因上下文理解不足导致回复不准确。 Method: 构建一个融合微调语言模型与检索增强生成(RAG)的混合系统,并在特定招生流程数据集上进行模型微调,同时优化生成逻辑以平衡响应质量与速度。 Result: 该系统能更准确地解析RAG提供的信息并生成符合招生领域语境的回复,在保持信息更新能力的同时增强了对复杂规则的理解。 Conclusion: 结合微调与RAG的混合方法能有效提升大学招生场景下自动回复的质量与适用性,为高要求的专业领域提供可行的AI解决方案。 Abstract: University admissions offices face the significant challenge of managing high volumes of inquiries efficiently while maintaining response quality, which critically impacts prospective students' perceptions. This paper addresses the issues of response time and information accuracy by proposing an AI system integrating a fine-tuned language model with Retrieval-Augmented Generation (RAG). While RAG effectively retrieves relevant information from large datasets, its performance in narrow, complex domains like university admissions can be limited without adaptation, potentially leading to contextually inadequate responses due to the intricate rules and specific details involved. To overcome this, we fine-tuned the model on a curated dataset specific to admissions processes, enhancing its ability to interpret RAG-provided data accurately and generate domain-relevant outputs. This hybrid approach leverages RAG's ability to access up-to-date information and fine-tuning's capacity to embed nuanced domain understanding. We further explored optimization strategies for the response generation logic, experimenting with settings to balance response quality and speed, aiming for consistently high-quality outputs that meet the specific requirements of admissions communications.[10] Ideology as a Problem: Lightweight Logit Steering for Annotator-Specific Alignment in Social Media Analysis
Wei Xia,Haowen Tang,Luozheng Li
Main category: cs.CL
TL;DR: 提出一种轻量级线性探针方法,量化并修正LLM输出层中与人类意识形态空间的系统性错位,无需重新训练模型。
Details
Motivation: LLM内部的政治意识形态组织与人类意识形态空间存在系统性错位,影响模型对用户观点的对齐。 Method: 通过计算模型内部特征的偏见分数,使用线性探针量化错位,并直接调整最终输出概率以进行修正。 Result: 该方法能有效量化模型特定的意识形态错位,并在不重训练的情况下最小化修正输出,保持模型原有推理能力。 Conclusion: 提出的轻量级方法是实用且低成本的,能够更好地使模型对齐特定用户意见。 Abstract: LLMs internally organize political ideology along low-dimensional structures that are partially, but not fully aligned with human ideological space. This misalignment is systematic, model specific, and measurable. We introduce a lightweight linear probe that both quantifies the misalignment and minimally corrects the output layer. This paper introduces a simple and efficient method for aligning models with specific user opinions. Instead of retraining the model, we calculated a bias score from its internal features and directly adjusted the final output probabilities. This solution is practical and low-cost and preserves the original reasoning power of the model.[11] LLMs for Explainable Business Decision-Making: A Reinforcement Learning Fine-Tuning Approach
Xiang Cheng,Wen Wang,Anindya Ghose
Main category: cs.CL
TL;DR: 本文提出LEXMA,一种基于强化学习的微调框架,利用大语言模型生成面向多受众、叙事性强且忠实于决策的自然语言解释,应用于房贷审批场景,显著提升了解释质量和预测性能。
Details
Motivation: 现有可解释AI技术多依赖事后数值特征归因,缺乏连贯叙述,难以满足不同受众需求,且依赖大量人工标注数据,限制了其在高风险场景中的透明性和可扩展性。 Method: 提出LEXMA框架,结合反思增强的监督微调与两阶段组相对策略优化(GRPO),通过无须人工标注解释的奖励信号,分别优化决策正确性和面向不同受众的风格化表达。 Result: 在房贷审批任务中,LEXMA相比其他LLM基线显著提升预测性能;人工评估显示,面向专家的解释更具风险意识,面向消费者的解释更清晰、可操作且礼貌。 Conclusion: LEXMA提供了一种低成本、系统化的LLM微调方法,能够在不改变决策规则的前提下生成高质量、多受众适配的自然语言解释,推动透明AI系统的规模化部署。 Abstract: Artificial Intelligence (AI) models increasingly drive high-stakes consumer interactions, yet their decision logic often remains opaque. Prevailing explainable AI techniques rely on post hoc numerical feature attributions, which fail to provide coherent narratives behind model decisions. Large language models (LLMs) present an opportunity to generate natural-language explanations, but three design challenges remain unresolved: explanations must be both decision-correct and faithful to the factors that drive the prediction; they should be able to serve multiple audiences without shifting the underlying decision rule; and they should be trained in a label-efficient way that does not depend on large corpora of human-scored explanations. To address these challenges, we introduce LEXMA (LLM-based EXplanations for Multi-Audience decisions), a reinforcement-learning-based fine-tuning framework that produces narrative-driven, audience-appropriate explanations. LEXMA combines reflection-augmented supervised fine-tuning with two stages of Group Relative Policy Optimization (GRPO). Specifically, it fine-tunes two separate parameter sets to improve decision correctness and satisfy stylistic requirements for different audiences, using reward signals that do not rely on human-annotated explanations. We instantiate LEXMA in the context of mortgage approval decisions. Results demonstrate that LEXMA yields significant improvements in predictive performance compared with other LLM baselines. Moreover, human evaluations show that expert-facing explanations generated by our approach are more risk-focused, and consumer-facing explanations are clearer, more actionable, and more polite. Our study contributes a cost-efficient, systematic LLM fine-tuning approach to enhance explanation quality for business decisions, offering strong potential for scalable deployment of transparent AI systems.[12] Leveraging Language Models and RAG for Efficient Knowledge Discovery in Clinical Environments
Seokhwan Ko,Donghyeon Lee,Jaewoo Chun,Hyungsoo Han,Junghwan Cho
Main category: cs.CL
TL;DR: 本研究开发并评估了一种在本地部署的检索增强生成(RAG)系统,用于基于医学机构成员的PubMed出版物推荐研究合作者。
Details
Motivation: 由于医院环境中严格的隐私和网络安全规定,敏感数据必须在本地基础设施中处理,因此需要能够在本地运行的大语言模型应用。 Method: 系统采用PubMedBERT生成领域特定的嵌入表示,并结合本地部署的LLaMA3模型进行生成式合成,以实现研究合作者推荐功能。 Result: 验证了在本地环境中集成领域专用编码器与轻量级大语言模型的可行性,能够有效支持生物医学知识发现。 Conclusion: 在遵守医疗数据安全规范的前提下,结合领域优化的嵌入模型与本地大语言模型可有效支持医学研究中的协作推荐任务。 Abstract: Large language models (LLMs) are increasingly recognized as valuable tools across the medical environment, supporting clinical, research, and administrative workflows. However, strict privacy and network security regulations in hospital settings require that sensitive data be processed within fully local infrastructures. Within this context, we developed and evaluated a retrieval-augmented generation (RAG) system designed to recommend research collaborators based on PubMed publications authored by members of a medical institution. The system utilizes PubMedBERT for domain-specific embedding generation and a locally deployed LLaMA3 model for generative synthesis. This study demonstrates the feasibility and utility of integrating domain-specialized encoders with lightweight LLMs to support biomedical knowledge discovery under local deployment constraints.[13] Complexity Agnostic Recursive Decomposition of Thoughts
Kaleem Ullah Qasim,Jiashu Zhang,Hafiz Saif Ur Rehman
Main category: cs.CL
TL;DR: 本文提出了CARD框架,通过预先估计问题复杂度并动态调整推理分解策略,提升大语言模型在多步推理任务中的准确性和效率。
Details
Motivation: 大语言模型在多步推理任务中常因采用固定推理策略而表现不佳,忽略了问题本身的难度差异。因此需要一种能根据问题复杂度自适应调整的推理方法。 Method: 提出CARD框架,包含MRCE(多维推理复杂度估计器)和两阶段递归求解器:首先根据任务特征进行分层分解,然后通过递归复杂度分析为每步分配思维预算(1、5-9或10个思维步骤)。 Result: 在GSM8K上,CARD使三个不同规模的推理模型准确率达到81.4%至89.2%,同时减少1.88倍到2.40倍的token消耗;在MATH-500上准确率为75.1%至86.8%,使用仅1.71倍到5.74倍的token。 Conclusion: 预先的复杂度估计能够有效提升多步推理的准确率,并显著降低计算开销,证明了自适应推理策略的优越性。 Abstract: Large language models often fail on multi-step reasoning due to fixed reasoning strategies that ignore problem specific difficulty. We introduce CARD (Complexity Agnostic Recursive Decomposition), a framework that predicts problem complexity before generation and adapts decomposition accordingly. Our system comprises MRCE (Multi-dimensional Reasoning Complexity Estimator), a 0.6B Qwen model predicting 30 fine-grained features from question text and a two-stage recursive solver: (1) hierarchical decomposition into K steps based on task profile and (2) per-step thought budget allocation (1, 5-9, or 10 thoughts) via recursive MRCE profiling. Evaluated on three reasoning models (Qwen3-0.6B, DeepSeek-R1-Distill-Qwen-1.5B, Qwen3-1.7B), CARD achieves 81.4% to 89.2% accuracy on GSM8K while reducing token cost by 1.88x to 2.40x compared to fixed decomposition baselines. On MATH-500, CARD reaches 75.1 to 86.8% accuracy using 1.71x to 5.74x fewer tokens. Our results demonstrate that preemptive complexity estimation enables both higher accuracy and significant efficiency gains.[14] Qwerty AI: Explainable Automated Age Rating and Content Safety Assessment for Russian-Language Screenplays
Nikita Zmanovskii
Main category: cs.CL
TL;DR: Qwerty AI是一个端到端系统,用于自动化评估俄语剧本的内容安全性和年龄分级,基于联邦法律436-FZ,在严格资源限制下实现高效准确的处理。
Details
Motivation: 解决俄罗斯媒体行业中剧本内容审查的现实编辑挑战,提供符合联邦法律436-FZ的自动化年龄评级和内容安全评估方案。 Method: 采用微调的Phi-3-mini模型(4位量化),在无外部API调用、80GB VRAM限制和5分钟内处理时间的约束下,对剧本进行分段并检测五类违规内容,结合Yandex Cloud与CUDA加速部署实现。 Result: 系统可在2分钟内处理长达700页的剧本,年龄评级准确率达80%,分段精度达80%-95%(依赖格式),并提供可解释的评级依据。 Conclusion: Qwerty AI在资源受限条件下展示了实际生产工作流中的可行性,成功应用于Wink黑客松并解决了俄语内容审核的实际问题。 Abstract: We present Qwerty AI, an end-to-end system for automated age-rating and content-safety assessment of Russian-language screenplays according to Federal Law No. 436-FZ. The system processes full-length scripts (up to 700 pages in under 2 minutes), segments them into narrative units, detects content violations across five categories (violence, sexual content, profanity, substances, frightening elements), and assigns age ratings (0+, 6+, 12+, 16+, 18+) with explainable justifications. Our implementation leverages a fine-tuned Phi-3-mini model with 4-bit quantization, achieving 80% rating accuracy and 80-95% segmentation precision (format-dependent). The system was developed under strict constraints: no external API calls, 80GB VRAM limit, and <5 minute processing time for average scripts. Deployed on Yandex Cloud with CUDA acceleration, Qwerty AI demonstrates practical applicability for production workflows. We achieved these results during the Wink hackathon (November 2025), where our solution addressed real editorial challenges in the Russian media industry.[15] TrueBrief: Faithful Summarization through Small Language Models
Kumud Lakara,Ruibo Shi,Fran Silavong
Main category: cs.CL
TL;DR: 提出TrueBrief框架,通过偏好优化提升小语言模型在文本摘要中的真实性。
Details
Motivation: 大语言模型存在生成幻觉的问题,限制了其在安全关键领域的应用。 Method: 设计了一个端到端的框架,包含可控幻觉注入的数据生成模块,用于生成合成偏好数据,进而优化小语言模型。 Result: 验证了数据质量和模型大小对偏好优化效果的影响,明确了该方法最有效的条件。 Conclusion: TrueBrief能有效提升小语言模型在文本摘要任务中的忠实性,为缓解生成幻觉提供了可行路径。 Abstract: Large language models (LLMs) have exhibited remarkable proficiency in generating high-quality text; however, their propensity for producing hallucinations poses a significant challenge for their deployment in security-critical domains. In this work, we present TrueBrief, an end-to-end framework specifically designed to enhance the faithfulness of small LLMs (SLMs) primarily for the task of text summarization through a preference-optimization paradigm. Central to our framework is a data generation module that facilitates controlled hallucination injection to generate synthetic preference data. Our work provides insights into the impact of data quality and model size on preference-based optimization, highlighting the conditions under which these methods are most effective.[16] AnimatedLLM: Explaining LLMs with Interactive Visualizations
Zdeněk Kasner,Ondřej Dušek
Main category: cs.CL
TL;DR: AnimatedLLM是一个在浏览器中运行的交互式Web应用程序,提供Transformer语言模型的逐步可视化,旨在帮助自然语言处理教育。
Details
Motivation: 目前缺乏展示大语言模型机制的教学材料,因此需要一个直观、易用的工具来辅助教学和自学。 Method: 通过使用开源大语言模型在人工整理输入上的预计算轨迹,实现完全在浏览器中运行的交互式可视化应用。 Result: 开发出名为AnimatedLLM的Web应用,能够逐步展示Transformer语言模型的工作机制,并已公开发布供教学和自学使用。 Conclusion: AnimatedLLM填补了大语言模型教学资源的空白,是一种有效的教育工具。 Abstract: Large language models (LLMs) are becoming central to natural language processing education, yet materials showing their mechanics are sparse. We present AnimatedLLM, an interactive web application that provides step-by-step visualizations of a Transformer language model. AnimatedLLM runs entirely in the browser, using pre-computed traces of open LLMs applied on manually curated inputs. The application is available at https://animatedllm.github.io, both as a teaching aid and for self-educational purposes.[17] From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning
Xiaoyu Xu,Minxin Du,Zitong Li,Zi Liang,Zhibiao Guo,Shiyu Zhang,Peizhao Hu,Qingqing Ye,Haibo Hu
Main category: cs.CL
TL;DR: 提出BiForget框架,通过目标模型自身生成高质量遗忘数据集,形式化领域级和实例级两种遗忘粒度,实现更高效、相关性和多样性更高的机器遗忘评估。
Details
Motivation: 现有机器遗忘基准未能准确反映模型实际学习到的“遗忘范围”,缺乏对不同遗忘粒度的区分,且依赖外部生成器导致数据分布不匹配。 Method: 提出BiForget框架,利用目标模型本身通过种子引导和对抗性提示来激发符合其内部知识分布的数据,自动化合成高质量的遗忘集,并形式化定义域级和实例级两种遗忘粒度。 Result: 在多个基准上实验显示,BiForget在相关性上提升约20,在多样性上提升约0.05,同时将数据量减半;在哈利波特领域中表现出更优的遗忘效果与模型效用保持。 Conclusion: BiForget为大语言模型的机器遗忘提供了更可靠、高效的遗忘数据生成方法,增强了遗忘评估的严谨性,有助于实现更稳健的遗忘与更好的模型效用保留。 Abstract: Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true "forgetting scope" learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose BiForget, an automated framework for synthesizing high-quality forget sets. Unlike prior work relying on external generators, BiForget exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by ${\sim}20$ and diversity by ${\sim}$0.05 while halving the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.[18] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation
Joseph James,Chenghao Xiao,Yucheng Li,Nafise Sadat Moosavi,Chenghua Lin
Main category: cs.CL
TL;DR: 提出RIGOURATE,一个两阶段多模态框架,用于检索论文正文中的支持证据并为每项声明分配夸大评分,从而提升科学陈述的严谨性。
Details
Motivation: 当前科研论文中常出现声明夸大、超出结果支持范围的问题,影响科学交流的准确性和透明度。 Method: 构建包含10K+声明-证据对的数据集,利用8个大语言模型标注,并基于同行评审意见校准夸大评分;采用微调的重排序器进行证据检索,微调模型预测夸大评分并提供理由。 Result: 相比强基线模型,RIGOURATE在证据检索和夸大声明检测方面表现更优,且经人类评估验证有效。 Conclusion: 该工作实现了证据比例原则的操作化,有助于促进更清晰、透明的科学交流。 Abstract: Scientific rigour tends to be sidelined in favour of bold statements, leading authors to overstate claims beyond what their results support. We present RIGOURATE, a two-stage multimodal framework that retrieves supporting evidence from a paper's body and assigns each claim an overstatement score. The framework consists of a dataset of over 10K claim-evidence sets from ICLR and NeurIPS papers, annotated using eight LLMs, with overstatement scores calibrated using peer-review comments and validated through human evaluation. It employes a fine-tuned reranker for evidence retrieval and a fine-tuned model to predict overstatement scores with justification. Compared to strong baselines, RIGOURATE enables improved evidence retrieval and overstatement detection. Overall, our work operationalises evidential proportionality and supports clearer, more transparent scientific communication.[19] Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties
Akriti Dhasmana,Aarohi Srivastava,David Chiang
Main category: cs.CL
TL;DR: 研究了使用自发、嘈杂和混合代码的印度方言进行跨语言迁移的实证,发现尽管语言间的亲缘距离减少通常能提高ASR性能,但仅此因素无法完全解释方言环境下的表现;较小规模的方言数据微调常能达到与大规模高资源标准语言相当的效果,并以低资源的Garhwali语为例分析了多种现代ASR模型及预训练语言偏见问题。
Details
Motivation: 探索跨语言迁移在印度多样化的方言和语言变体中的有效性,特别是针对低资源、非标准化语言的自动语音识别(ASR)系统性能提升。 Method: 通过在多种印度方言上进行实验,比较不同语言间亲缘距离对ASR性能的影响,并评估使用小规模方言数据与大规模标准化语言数据进行微调的效果;同时以Garhwali语为案例,测试多个现代ASR模型并分析转录错误中的语言偏见。 Result: 发现语言亲缘距离虽影响ASR性能,但在方言场景中并非唯一决定因素;小规模方言数据微调可达到与大规模高资源语言相当的性能;在Garhwali语案例中揭示了模型存在对预训练语言的偏差。 Conclusion: 跨语言迁移在处理低资源和非标准化方言时具有潜力,但需考虑除语言亲缘关系外的其他因素;使用本地化方言数据微调是有效策略,且需进一步缓解ASR模型中的语言偏见问题。 Abstract: We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally improved with reduced phylogenetic distance between languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.[20] Disco-RAG: Discourse-Aware Retrieval-Augmented Generation
Dongqi Liu,Hang Ding,Qiming Feng,Jian Li,Xurong Xie,Zhucun Xue,Chengjie Wang,Jiangning Zhang,Yabiao Wang
Main category: cs.CL
TL;DR: Disco-RAG是一种引入篇章结构感知的检索增强生成框架,通过构建段内话语树和段间修辞图来提升大模型在知识密集任务中的表现。
Details
Motivation: 现有RAG方法通常以扁平化方式处理检索结果,忽略了文本间的结构关系,限制了模型对分散知识的整合能力。 Method: 提出Disco-RAG框架,构建段内话语树以捕捉局部层次结构,并建立段间修辞图以建模跨段落连贯性,将这些结构联合融入生成规划蓝图中。 Result: 在问答和长文档摘要基准测试上达到最先进的性能,且无需微调。 Conclusion: 篇章结构信息对提升RAG系统具有重要作用。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.[21] MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking
Iago Alves Brito,Walcy Santos Rezende Rios,Julia Soares Dollis,Diogo Fernandes Costa Silva,Arlindo Rodrigues Galvão Filho
Main category: cs.CL
TL;DR: MiJaBench揭示大型语言模型的安全对齐存在系统性偏见,安全防护效果因目标群体而异,且模型规模扩大加剧了这种不平等,挑战了当前的安全扩展规律。
Details
Motivation: 当前的LLM安全评估将不同群体的风险合并为单一指标,掩盖了对特定少数群体的系统性脆弱性,缺乏对安全对齐中歧视问题的细粒度分析。 Method: 构建双语对抗性基准MiJaBench,包含44,000个针对16个少数群体的提示,并通过12个最先进LLM生成528,000个响应对,形成MiJaBench-Align数据集以量化不同群体间的防御率差异。 Result: 发现同一模型在不同群体上的防御率差异高达33%,且模型规模越大,这种差距越明显,表明现有对齐方法并未实现普遍安全,而是形成了基于群体的防御层级。 Conclusion: 安全对齐并非通用语义能力,而是反映了训练数据中的记忆化拒绝边界,当前的安全扩展法则忽视了系统性歧视,需发展面向细粒度人口统计特征的对齐技术。 Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating "Identity Hate" into scalar scores that mask systemic vulnerabilities against specific populations. To expose this selective safety, we introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate MiJaBench-Align, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33\% within the same model solely based on the target group. Crucially, we demonstrate that model scaling exacerbates these disparities, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.[22] ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models
Sharanya Dasgupta,Arkaprabha Basu,Sujoy Nath,Swagatam Das
Main category: cs.CL
TL;DR: 本文提出了一种名为ARREST的统一框架,通过外部网络干预大语言模型的潜在激活空间,以同时解决事实性和安全性问题,而无需微调模型参数。
Details
Motivation: 大语言模型在事实性和安全性方面存在缺陷,这些问题源于其潜在激活空间中的表征错位,需要一种统一的方法来同时处理这两个问题。 Method: 提出ARREST框架,利用经过对抗训练的外部网络识别并纠正漂移特征,实现软拒绝、硬拒绝和事实性修正。 Result: 实验结果表明,ARREST不仅能有效调节错位问题,而且相比RLHF对齐模型,在生成软拒绝方面更具通用性。 Conclusion: ARREST提供了一种无需微调即可提升大语言模型事实性和安全性的新方法,具有较强的适应性和应用潜力。 Abstract: Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF-aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya-dasgupta001/ARREST.[23] Interpreting Transformers Through Attention Head Intervention
Mason Kadem,Rong Zheng
Main category: cs.CL
TL;DR: 本文探讨了神经网络的机制可解释性,强调理解其决策过程在高风险领域中的问责与控制、数字大脑研究以及AI超越人类时的新知识发现的重要性。
Details
Motivation: 尽管神经网络能力不断增强,但我们对其内部神经机制缺乏理解,这限制了其在关键领域的可信应用和进一步科学探索。 Method: 提出并阐述了机制可解释性的概念及其多方面意义,通过理论分析说明其在安全、认知科学和知识发现中的潜在价值。 Result: 明确了机制可解释性在确保AI系统透明性、推动认知科学发展和促进AI驱动的知识创新方面的三大作用。 Conclusion: 实现对神经网络决策机制的理解对于构建可信、可控且具有科学价值的AI系统至关重要。 Abstract: Neural networks are growing more capable on their own, but we do not understand their neural mechanisms. Understanding these mechanisms' decision-making processes, or mechanistic interpretability, enables (1) accountability and control in high-stakes domains, (2) the study of digital brains and the emergence of cognition, and (3) discovery of new knowledge when AI systems outperform humans.[24] Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization
Yao Dou,Wei Xu
Main category: cs.CL
TL;DR: 本文提出了Gavel-Ref,一个基于多值清单的评估框架,用于评估大语言模型在长上下文法律案件摘要任务中的表现,并引入Gavel-Agent以提高效率和自主性。
Details
Motivation: 现有的大语言模型虽支持长达1M token的上下文,但在复杂长文本任务上的有效性尚不明确,尤其是在多文档法律案例摘要这种高难度任务中缺乏系统评估方法。 Method: 提出Gavel-Ref评估框架,包含26项多值清单、残余事实和写作风格评估;对12个前沿大模型在100个32K到512K token的法律案例上进行系统评测;并设计Gavel-Agent代理架构,集成六个工具使模型能自主从文档中提取信息。 Result: 最强模型Gemini 2.5 Pro在SGavel-Ref上仅得约50分;模型在简单项目上表现好,但在多值或罕见项目(如和解条款、监察报告)上表现差;Gavel-Agent使用Qwen3比GPT-4.1端到端提取减少36% token消耗,Schecklist仅下降7%。 Conclusion: 当前大模型在处理超长上下文复杂任务时仍面临重大挑战,需更精细的评估体系与高效推理机制,Gavel-Agent为未来自主化信息提取提供了可行路径。 Abstract: Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long-context tasks remains unclear. In this paper, we study multi-document legal case summarization, where a single case often spans many documents totaling 100K-500K tokens. We introduce Gavel-Ref, a reference-based evaluation framework with multi-value checklist evaluation over 26 items, as well as residual fact and writing-style evaluations. Using Gavel-Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of $S_{\text{Gavel-Ref}}$, highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi-value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human-written summaries -- making human references less reliable -- we develop Gavel-Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel-Agent reduces token usage by 36% while resulting in only a 7% drop in $S_{\text{checklist}}$ compared to end-to-end extraction with GPT-4.1.[25] Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs
Myra Cheng,Robert D. Hawkins,Dan Jurafsky
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLMs)在面对用户有害信念时常常未能有效挑战的问题,认为这是由于LLMs默认迎合用户假设且缺乏足够的认知警觉性所致。研究发现,影响人类迎合行为的社会和语言因素同样会影响LLMs的表现,并通过三个安全基准测试验证了这一点。此外,简单的语用干预措施(如添加“等一下”短语)能显著提升模型表现,同时保持低误报率。
Details
Motivation: 大语言模型在医疗建议和社会推理等领域常未能挑战用户的有害信念,这可能导致 misinformation 的传播或对不合理请求的盲目迎合(sycophancy),因此需要从语用学角度理解并改进模型的安全性行为。 Method: 分析社会和语言因素(如命题显著性、语言编码和信息源可靠性)对LLMs迎合行为的影响,并在三个安全基准(Cancer-Myth, SAGE-Eval, ELEPHANT)上评估模型表现;引入简单的语用干预(如插入'wait a minute')来测试其对挑战有害信念能力的改善效果。 Result: 发现影响人类迎合行为的因素同样影响LLMs;语用干预显著提高了模型在挑战有害信念任务上的性能,同时维持了较低的误报率。 Conclusion: 语用因素在理解和改进大语言模型安全性方面具有重要作用,增强模型的认知警觉性和适当调整其语用策略可有效提升其对有害信念的挑战能力。 Abstract: Large language models (LLMs) frequently fail to challenge users' harmful beliefs in domains ranging from medical advice to social reasoning. We argue that these failures can be understood and addressed pragmatically as consequences of LLMs defaulting to accommodating users' assumptions and exhibiting insufficient epistemic vigilance. We show that social and linguistic factors known to influence accommodation in humans (at-issueness, linguistic encoding, and source reliability) similarly affect accommodation in LLMs, explaining performance differences across three safety benchmarks that test models' ability to challenge harmful beliefs, spanning misinformation (Cancer-Myth, SAGE-Eval) and sycophancy (ELEPHANT). We further show that simple pragmatic interventions, such as adding the phrase "wait a minute", significantly improve performance on these benchmarks while preserving low false-positive rates. Our results highlight the importance of considering pragmatics for evaluating LLM behavior and improving LLM safety.[26] Learning to Simulate Human Dialogue
Kanishk Gandhi,Agam Bhatia,Noah D. Goodman
Main category: cs.CL
TL;DR: 研究通过下一话语预测来建模人类思维,发现直接最大化真实人类对话的对数概率比基于评判奖励的优化更能准确预测人类实际表达,并在允许模型“思考”时表现更优。
Details
Motivation: 探索如何通过预测对话中的下一句话语来建模人类思维方式,并评估不同学习方法在模拟人类语言行为上的有效性。 Method: 比较了两种维度的学习方法:是否允许模型在回应前进行“思考”,以及使用LLM评判奖励(基于语义相似性和信息完整性)或直接最大化真实人类对话的对数概率来进行训练;推导出将思维链视为隐变量时的对数概率下界并优化该目标。 Result: 基于评判的奖励虽提升评判得分,但降低对真实人类回应的似然和人类判断中的胜率;允许‘思考’会加剧此问题;直接最大化真实回应的对数概率则在似然和胜率上均表现更好;优化推导出的下界目标取得最佳综合效果。 Conclusion: 当使用基于真实人类对话分布匹配的目标进行训练时,‘思考’机制才真正有助于提升模型对人类行为的理解;扩展该方法至更广泛的对话数据可能带来更细腻的人类行为建模能力。 Abstract: To predict what someone will say is to model how they think. We study this through next-turn dialogue prediction: given a conversation, predict the next utterance produced by a person. We compare learning approaches along two dimensions: (1) whether the model is allowed to think before responding, and (2) how learning is rewarded either through an LLM-as-a-judge that scores semantic similarity and information completeness relative to the ground-truth response, or by directly maximizing the log-probability of the true human dialogue. We find that optimizing for judge-based rewards indeed increases judge scores throughout training, however it decreases the likelihood assigned to ground truth human responses and decreases the win rate when human judges choose the most human-like response among a real and synthetic option. This failure is amplified when the model is allowed to think before answering. In contrast, by directly maximizing the log-probability of observed human responses, the model learns to better predict what people actually say, improving on both log-probability and win rate evaluations. Treating chain-of-thought as a latent variable, we derive a lower bound on the log-probability. Optimizing this objective yields the best results on all our evaluations. These results suggest that thinking helps primarily when trained with a distribution-matching objective grounded in real human dialogue, and that scaling this approach to broader conversational data may produce models with a more nuanced understanding of human behavior.[27] Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
San Kim,Gary Geunbae Lee
Main category: cs.CL
TL;DR: 提出MB-Defense防御框架,通过融合与破坏机制在训练过程中提升指令调优大语言模型对后门攻击的鲁棒性。
Details
Motivation: 指令调优的大语言模型依赖大规模数据,易受后门攻击,但现有防御方法研究不足。 Method: 设计两阶段训练流程:第一阶段为防御性投毒,将攻击与防御触发器合并为统一的后门表征;第二阶段为权重恢复,通过额外训练打破该表征以恢复干净行为。 Result: 在多个大语言模型上的实验表明,MB-Defense显著降低攻击成功率,同时保持模型的指令遵循能力。 Conclusion: MB-Defense提供了一种可泛化且数据高效的防御策略,增强了指令调优大语言模型对未见后门攻击的鲁棒性。 Abstract: Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging & Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) defensive poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) weight recovery, which breaks this representation through additional training to restore clean behavior. Extensive experiments across multiple LLMs show that MB-Defense substantially lowers attack success rates while preserving instruction-following ability. Our method offers a generalizable and data-efficient defense strategy, improving the robustness of instruction-tuned LLMs against unseen backdoor attacks.[28] Users Mispredict Their Own Preferences for AI Writing Assistance
Vivian Lai,Zana Buçinca,Nil-Jana Akpinar,Mo Houtti,Hyeonsu B. Kang,Kevin Chian,Namjoon Suh,Alex C. Williams
Main category: cs.CL
TL;DR: 用户在使用AI写作助手时,实际行为与自我报告的偏好存在显著差异,研究表明创作努力程度是影响决策的主要因素,而紧急程度虽被用户自认为最重要,但实际上对行为无显著影响,依赖用户自我报告设计系统会误导优化方向。
Details
Motivation: 了解用户在使用主动式AI写作助手时的真实需求驱动因素,解决用户自我报告偏好与实际行为之间的不一致问题。 Method: 通过一项包含50名参与者的因子情境研究,进行750次成对比较,分析用户在不同情境下的选择行为及其自我报告的偏好。 Result: 创作努力程度对用户决策具有显著预测力(ρ=0.597),而紧急程度几乎无预测能力(ρ≈0);用户存在明显的感知-行为差距,导致基于自我报告设计的系统准确率仅57.7%,低于基于行为模式设计的系统(61.3%,p<0.05)。 Conclusion: 依赖用户自我报告来设计主动式AI写作助手会导致系统性能下降,应更多依赖实际行为数据进行系统优化,这对主动式自然语言生成系统的设计具有重要启示。 Abstract: Proactive AI writing assistants need to predict when users want drafting help, yet we lack empirical understanding of what drives preferences. Through a factorial vignette study with 50 participants making 750 pairwise comparisons, we find compositional effort dominates decisions ($ρ= 0.597$) while urgency shows no predictive power ($ρ\approx 0$). More critically, users exhibit a striking perception-behavior gap: they rank urgency first in self-reports despite it being the weakest behavioral driver, representing a complete preference inversion. This misalignment has measurable consequences. Systems designed from users' stated preferences achieve only 57.7\% accuracy, underperforming even naive baselines, while systems using behavioral patterns reach significantly higher 61.3\% ($p < 0.05$). These findings demonstrate that relying on user introspection for system design actively misleads optimization, with direct implications for proactive natural language generation (NLG) systems.[29] Beyond Static Summarization: Proactive Memory Extraction for LLM Agents
Chengyuan Yang,Zequn Sun,Wei Wei,Wei Hu
Main category: cs.CL
TL;DR: 本文提出了一种主动记忆提取方法(ProMem),通过自问自答的循环反馈机制,改进了传统摘要方法在信息完整性和问答准确性上的不足。
Details
Motivation: 现有基于摘要的记忆提取方法存在“提前”和“一次性”的问题,导致重要信息丢失且缺乏纠错机制。 Method: 引入一种迭代的认知过程,利用自我提问主动探测对话历史,形成带有反馈循环的主动记忆提取机制。 Result: ProMem显著提高了提取记忆的完整性与问答准确率,并在提取质量与令牌成本之间实现了更优权衡。 Conclusion: 将记忆提取视为具有反馈的主动认知过程,比传统的前馈式摘要方法更有效,有助于长期交互和个性化。 Abstract: Memory management is vital for LLM agents to handle long-term interaction and personalization. Most research focuses on how to organize and use memory summary, but often overlooks the initial memory extraction stage. In this paper, we argue that existing summary-based methods have two major limitations based on the recurrent processing theory. First, summarization is "ahead-of-time", acting as a blind "feed-forward" process that misses important details because it doesn't know future tasks. Second, extraction is usually "one-off", lacking a feedback loop to verify facts, which leads to the accumulation of information loss. To address these issues, we propose proactive memory extraction (namely ProMem). Unlike static summarization, ProMem treats extraction as an iterative cognitive process. We introduce a recurrent feedback loop where the agent uses self-questioning to actively probe the dialogue history. This mechanism allows the agent to recover missing information and correct errors. Our ProMem significantly improves the completeness of the extracted memory and QA accuracy. It also achieves a superior trade-off between extraction quality and token cost.[30] Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions
Ignacio Sastre,Aiala Rosá
Main category: cs.CL
TL;DR: Concept Tokens是一种轻量级方法,通过引入新特殊令牌并仅学习其嵌入来控制冻结的大型语言模型的行为,基于自然语言定义优化嵌入,可在减少幻觉、促进教学反馈策略等方面提供方向性影响。
Details
Motivation: 旨在通过自然语言定义有效引导冻结的大型语言模型(LLM)行为,避免完全微调模型参数,提升对模型输出的可控性。 Method: 在预训练LLM中添加一个新的特殊令牌,替换目标概念,并仅优化该令牌的嵌入向量;模型其余部分保持冻结,使用标准语言建模目标进行训练。 Result: 在HotpotQA上验证了对幻觉回答的方向性调控作用;在第二语言教学反馈策略recasting中展现出类似效果;相比上下文学习,能更好保持对其他指令的遵循;定性研究表明其可捕捉特定概念信息但仍有局限。 Conclusion: Concept Tokens提供了一种紧凑且有效的控制信号,能够基于定义引导冻结LLM的行为,在多种任务中实现可控生成,同时保留模型原有能力。 Abstract: We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional "Austral Tower" to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.[31] SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
Iaroslav Chelombitko,Ekaterina Chelombitko,Aleksey Komissarov
Main category: cs.CL
TL;DR: 本文提出了一种无需语料库的工具SampoNLP,用于生成形态词典,并基于此评估BPE分词器在乌拉尔语系语言中的表现,提出了综合性能评分(IPS)指标,给出了最优词汇量建议。
Details
Motivation: 由于缺乏干净的词素词典,难以有效评估形态丰富的乌拉尔语言的子词分词质量。 Method: 提出SampoNLP工具,采用基于MDL的自指原子性评分构建高纯度形态词典,并使用该词典系统评估不同词汇规模的BPE分词器,引入集成性能得分(IPS)衡量分词效果。 Result: 为芬兰语、匈牙利语和爱沙尼亚语生成了高质量词素词典,发现BPE在高度黏着语言中存在局限性,并通过IPS曲线确定了词汇量的“肘点”,提供了实证支持的最优k值建议。 Conclusion: SampoNLP能有效支持低资源语言的词素分析,标准BPE在处理高度黏着语言时存在明显不足,推荐使用IPS指导分词器配置。 Abstract: The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP[32] WESR: Scaling and Evaluating Word-level Event-Speech Recognition
Chenchen Yang,Kexin Huang,Liwei Fan,Qian Tu,Botian Jiang,Dong Zhang,Linqi Yin,Shimin Li,Zhaoye Fei,Qinyuan Cheng,Xipeng Qiu
Main category: cs.CL
TL;DR: 本文提出了一种新的21类发声事件分类法,并构建了WESR-Bench评估集和大规模语料库,以实现对离散和连续非语言发声事件的精确定位检测。
Details
Motivation: 非语言发声事件(如笑、哭)的精确定位在当前研究中仍存在任务定义不充分、类别覆盖有限、时间粒度模糊以及缺乏标准化评估框架等问题。 Method: 提出包含21类发声事件的新分类体系,区分离散型与连续型事件;构建专家标注的WESR-Bench评估集(900+语句),引入位置感知协议以分离ASR错误与事件检测性能;构建1700+小时语料库并训练专用模型。 Result: 所提方法在非语言事件检测上优于开源音频-语言模型和商业API,同时保持良好的ASR性能,且实现了对两类事件的精确时空定位。 Conclusion: WESR及其评估协议为真实场景下的丰富听觉建模提供了基础资源,有望推动该领域的进一步发展。 Abstract: Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. We also build a strong baseline by constructing a 1,700+ hour corpus, and train specialized models, surpassing both open-source audio-language models and commercial APIs while preserving ASR quality. We anticipate that WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes.[33] LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation
Yuxiao Ye,Yiming Zhang,Yiran Ma,Huiyuan Xie,Huining Zhu,Zhiyuan Liu
Main category: cs.CL
TL;DR: 本文提出了LinguaGame,一种基于语言学和博弈论的多智能体对话生成框架,通过在推理时调整决策来提升智能体间的沟通效率。
Details
Motivation: 现有的大模型多智能体系统多关注架构设计,而忽略了交互过程本身。本文旨在改进智能体使用自然语言交流的效率与效果。 Method: 将对话建模为关于交际意图与策略的信号博弈,采用无需训练的均衡近似算法在推理时进行决策调整,并基于语言学原理进行意图与策略推断。 Result: 在模拟法庭和辩论场景中进行了评估,人类专家评估表明该方法显著提升了沟通效率。 Conclusion: LinguaGame通过语言学指导的博弈框架,实现了任务无关、高效且更具意图感知能力的多智能体对话生成。 Abstract: Large Language Models (LLMs) have enabled Multi-Agent Systems (MASs) where agents interact through natural language to solve complex tasks or simulate multi-party dialogues. Recent work on LLM-based MASs has mainly focused on architecture design, such as role assignment and workflow orchestration. In contrast, this paper targets the interaction process itself, aiming to improve agents' communication efficiency by helping them convey their intended meaning more effectively through language. To this end, we propose LinguaGame, a linguistically-grounded game-theoretic paradigm for multi-agent dialogue generation. Our approach models dialogue as a signalling game over communicative intents and strategies, solved with a training-free equilibrium approximation algorithm for inference-time decision adjustment. Unlike prior game-theoretic MASs, whose game designs are often tightly coupled with task-specific objectives, our framework relies on linguistically informed reasoning with minimal task-specific coupling. Specifically, it treats dialogue as intentional and strategic communication, requiring agents to infer what others aim to achieve (intents) and how they pursue those goals (strategies). We evaluate our framework in simulated courtroom proceedings and debates, with human expert assessments showing significant gains in communication efficiency.[34] GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence
Yibo Zhao,Jiapeng Zhu,Zichen Ding,Xiang Li
Main category: cs.CL
TL;DR: 本文提出了GRACE,一种基于强化学习的检索增强生成框架,通过多阶段门控奖励函数统一解决答案缺乏证据支持和在上下文不足时产生幻觉的问题,在减少90%标注成本的同时实现了最先进的准确性和合理的拒绝能力。
Details
Motivation: 现有RAG系统存在两个关键缺陷:回答正确但缺乏可验证证据,以及在检索内容不足时生成虚假信息。当前方法通常单独处理这两个问题,缺乏统一的解决方案。 Method: 提出GRACE框架,采用异构检索器生成多样化的训练样本以避免人工标注,并设计多阶段门控奖励函数,通过强化学习训练模型判断证据充分性、提取关键证据,并据此作答或主动拒绝回答。 Result: 在两个基准数据集上的实验表明,GRACE在整体准确率上达到最先进水平,且在准确回答与合理拒绝之间取得良好平衡,同时标注成本仅为先前方法的10%。 Conclusion: GRACE为RAG系统提供了一个统一、高效且低成本的解决方案,有效提升了模型的可信度与实用性,具备良好的应用前景。 Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge to enhance Large Language Models (LLMs), yet systems remain susceptible to two critical flaws: providing correct answers without explicit grounded evidence and producing fabricated responses when the retrieved context is insufficient. While prior research has addressed these issues independently, a unified framework that integrates evidence-based grounding and reliable abstention is currently lacking. In this paper, we propose GRACE, a reinforcement-learning framework that simultaneously mitigates both types of flaws. GRACE employs a data construction method that utilizes heterogeneous retrievers to generate diverse training samples without manual annotation. A multi-stage gated reward function is then employed to train the model to assess evidence sufficiency, extract key supporting evidence, and provide answers or explicitly abstain. Experimental results on two benchmarks demonstrate that GRACE achieves state-of-the-art overall accuracy and strikes a favorable balance between accurate response and rejection, while requiring only 10% of the annotation costs of prior methods. Our code is available at https://github.com/YiboZhao624/Grace..[35] BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation
Amit Bin Tariqul,A N M Zahid Hossain Milkan,Sahab-Al-Chowdhury,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan
Main category: cs.CL
TL;DR: 本文首次系统评估了在低资源语言(孟加拉语)中,现有文本水印方法在跨语言往返翻译攻击下的鲁棒性,并提出了一种分层水印策略以提升检测准确率。
Details
Motivation: 现有文本水印方法在高资源语言中表现良好,但在低资源语言中的鲁棒性尚不清楚,尤其是在面对跨语言攻击时存在明显缺陷。 Method: 对KGW、EXP和Waterfall等先进水印方法进行评估,并提出结合嵌入时与生成后水印的分层策略,以增强抗攻击能力。 Result: KGW和EXP在正常条件下检测准确率超过88%,但经RTT攻击后降至9-13%;分层水印将RTT后的检测准确率提升至40-50%,相对提高3-4倍。 Conclusion: 分层水印是一种无需训练、可行的解决方案,能有效改善低资源语言中文本水印的鲁棒性,尽管伴随可控的语义退化。 Abstract: As large language models (LLMs) are increasingly deployed for text generation, watermarking has become essential for authorship attribution, intellectual property protection, and misuse detection. While existing watermarking methods perform well in high-resource languages, their robustness in low-resource languages remains underexplored. This work presents the first systematic evaluation of state-of-the-art text watermarking methods: KGW, Exponential Sampling (EXP), and Waterfall, for Bangla LLM text generation under cross-lingual round-trip translation (RTT) attacks. Under benign conditions, KGW and EXP achieve high detection accuracy (>88%) with negligible perplexity and ROUGE degradation. However, RTT causes detection accuracy to collapse below RTT causes detection accuracy to collapse to 9-13%, indicating a fundamental failure of token-level watermarking. To address this, we propose a layered watermarking strategy that combines embedding-time and post-generation watermarks. Experimental results show that layered watermarking improves post-RTT detection accuracy by 25-35%, achieving 40-50% accuracy, representing a 3$\times$ to 4$\times$ relative improvement over single-layer methods, at the cost of controlled semantic degradation. Our findings quantify the robustness-quality trade-off in multilingual watermarking and establish layered watermarking as a practical, training-free solution for low-resource languages such as Bangla. Our code and data will be made public.[36] Identifying Good and Bad Neurons for Task-Level Controllable LLMs
Wenjie Li,Guansong Pang,Hezhe Qiao,Debin Gao,David Lo
Main category: cs.CL
TL;DR: 本文提出了NeuronLLM,一种基于功能拮抗原理的新型任务级大语言模型神经元理解框架,通过对比学习识别促进和抑制任务完成的神经元,并利用增强问题集减少模型偶然行为的影响,实现了对LLM神经元更全面的建模。
Details
Motivation: 现有方法仅关注与任务正相关的支持性神经元,难以应对需要多种能力协同的任务场景,且忽略了具有抑制作用的神经元以及模型因偶然行为导致的错误归因问题。 Method: 提出NeuronLLM框架,采用功能拮抗理念,通过对比学习同时建模‘好神经元’(促进任务)和‘坏神经元’(抑制任务),并构建增强问题集以减轻大模型偶然正确回答带来的干扰。 Result: 在不同规模和家族的LLM上进行的综合实验表明,NeuronLLM在四个NLP任务中均优于现有方法,能更准确地识别任务相关神经元。 Conclusion: NeuronLLM通过引入功能拮抗机制和对比学习,提供了对大语言模型神经元功能组织的新见解,提升了任务层级的可解释性与神经元识别的鲁棒性。 Abstract: Large Language Models have demonstrated remarkable capabilities on multiple-choice question answering benchmarks, but the complex mechanisms underlying their large-scale neurons remain opaque, posing significant challenges for understanding and steering LLMs. While recent studies made progress on identifying responsible neurons for certain abilities, these ability-specific methods are infeasible for task-focused scenarios requiring coordinated use of multiple abilities. Moreover, these approaches focus only on supportive neurons that correlate positively with task completion, while neglecting neurons with other roles-such as inhibitive roles-and misled neuron attribution due to fortuitous behaviors in LLMs (i.e., correctly answer the questions by chance rather than genuine understanding). To address these challenges, we propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification. The key insight is that task performance is jointly determined by neurons with two opposing roles: good neurons that facilitate task completion and bad neurons that inhibit it. NeuronLLM achieves a holistic modeling of neurons via contrastive learning of good and bad neurons, while leveraging augmented question sets to mitigate the fortuitous behaviors in LLMs. Comprehensive experiments on LLMs of different sizes and families show the superiority of NeuronLLM over existing methods in four NLP tasks, providing new insights into LLM functional organization.[37] FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback
Seongyeub Chu,Jongwoo Kim,Munyong Yi
Main category: cs.CL
TL;DR: 本文提出了一种名为FeedEval的基于大语言模型(LLM)的框架,用于评估LLM生成的作文反馈的质量,重点关注具体性、有帮助性和有效性三个教学相关维度,并通过实验验证其在提升自动评分和作文修改效果方面的优势。
Details
Motivation: 现有研究在利用LLM生成作文反馈时缺乏对反馈质量的显式验证,导致噪声传播,影响下游任务性能,因此需要一种可靠的方法来评估和筛选高质量的反馈。 Method: 提出了FeedEval框架,该框架采用针对特定维度训练的LLM评估器,在本研究构建的数据集上对多个反馈候选进行评估并选择高质量反馈;在ASAP++基准上进行了实验,并测试了筛选后的反馈对小规模LLM进行作文修改的效果。 Result: 实验表明,FeedEval与人类专家判断高度一致,使用其筛选的高质量反馈训练的评分模型表现更优,且能引导小规模LLM实现更有效的作文修改。 Conclusion: FeedEval能够有效识别高质量的LLM生成反馈,提升自动作文评分和修改指导的性能,为减少合成数据中的噪声提供了可行方案。 Abstract: Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We will release our code and curated datasets upon accepted.[38] Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization
Mizanur Rahman,Mohammed Saidul Islam,Md Tahmid Rahman Laskar,Shafiq Joty,Enamul Hoque
Main category: cs.CL
TL;DR: 本文提出了RL-Text2Vis,首个用于文本到可视化生成的强化学习框架,基于GRPO算法和多目标奖励机制,显著提升了代码可执行性和图表质量。
Details
Motivation: 现有文本到可视化系统生成的图表常缺乏语义对齐和清晰性,且传统监督微调无法利用执行后反馈来提升可视化质量。 Method: 提出RL-Text2Vis框架,采用Group Relative Policy Optimization(GRPO),设计包含文本准确性、代码有效性与可视化质量的多目标奖励函数,利用执行后反馈进行优化。 Result: 在Text2Vis基准上比GPT-4o相对提升22%图表质量,代码执行成功率从78%提升至97%,并在VIS-Eval和NVBench等跨域数据集上表现出强泛化能力。 Conclusion: GRPO是一种有效的多模态结构化推理策略,为文本到可视化生成提供了新的优化路径。 Abstract: Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at https://github.com/vis-nlp/RL-Text2Vis.[39] THaLLE-ThaiLLM: Domain-Specialized Small LLMs for Finance and Thai -- Technical Report
KBTG Labs,:,Anuruth Lertpiya,Danupat Khamnuansin,Kantapong Sucharitpongpan,Pornchanan Balee,Tawunrat Chalothorn,Thadpong Pongthawornkamol,Monchai Lertsutthiwong
Main category: cs.CL
TL;DR: 本文探讨了通过模型融合技术构建高性能、多能力大语言模型的可行性,以解决在本地部署时面临的专业化模型成本问题。研究通过合并Qwen-8B与ThaiLLM-8B及THaLLE-CFA-8B,在泰语能力和金融领域表现上均取得了显著提升。
Details
Motivation: 由于隐私、安全和监管问题,金融机构倾向于本地部署大语言模型;但训练多功能模型成本高昂,需寻找高效替代方案。 Method: 采用模型融合方法,将Qwen-8B分别与ThaiLLM-8B以及两者结合(ThaiLLM-8B和THaLLE-CFA-8B)进行融合,并评估其在通用、教育和金融等任务上的性能。 Result: Qwen-8B与ThaiLLM-8B融合后在M3和M6 O-NET考试中表现优于原始模型;进一步融合THaLLE-CFA-8B后,在Flare-CFA和Thai-IC金融基准上也取得提升,展现出更强的多领域能力。 Conclusion: 模型融合是一种资源高效的方法,可用于构建具备多领域能力的高质量大语言模型,尤其适用于有本地部署需求且受限于成本的组织。 Abstract: Large Language Models (LLMs) have demonstrated significant potential across various domains, particularly in banking and finance, where they can automate complex tasks and enhance decision-making at scale. Due to privacy, security, and regulatory concerns, organizations often prefer on-premise deployment of LLMs. The ThaiLLM initiative aims to enhance Thai language capabilities in open-LLMs, enabling Thai industry to leverage advanced language models. However, organizations often face a trade-off between deploying multiple specialized models versus the prohibitive expense of training a single multi-capability model. To address this, we explore model merging as a resource-efficient alternative for developing high-performance, multi-capability LLMs. We present results from two key experiments: first, merging Qwen-8B with ThaiLLM-8B demonstrates how ThaiLLM-8B enhances Thai general capabilities, showing an uplift of M3 and M6 O-NET exams over the general instruction-following Qwen-8B. Second, we merge Qwen-8B with both ThaiLLM-8B and THaLLE-CFA-8B. This combination results in further improvements in performance across both general and financial domains, by demonstrating an uplift in both M3 and M6 O-NET, Flare-CFA, and Thai-IC benchmarks. The report showcases the viability of model merging for efficiently creating multi-capability LLMs.[40] On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions
Zhiyuan He,Binghan Chen,Tianxiang Xiong,Ziyang Sun,Mozhao Zhu,Xi Chen
Main category: cs.CL
TL;DR: 本文研究了Rank-One Model Editing (ROME)在多跳推理任务中的局限性,并提出了Redundant Editing策略以提升多跳问答性能,显著提高了准确率。
Details
Motivation: 现有的知识编辑方法如ROME在单跳事实更新中表现良好,但在需要知识链式推理的多跳任务中效果不佳,本文旨在探究其在不同层深度下的编辑影响并解决相关问题。 Method: 分析ROME在不同网络层进行知识编辑时的表现,识别出三个主要失败模式,并提出Redundant Editing方法,在多个层次重复编辑以缓解‘跳跃过晚’和泛化能力下降的问题。 Result: 实验表明,Redundant Editing在2跳问题上的准确率至少提升了15.5个百分点,相比单一编辑策略提高了96%,但牺牲了一定的答案特异性和语言自然性。 Conclusion: 通过跨层冗余编辑可有效改善模型在多跳推理任务中的表现,为知识编辑应用于复杂推理提供了可行方向。 Abstract: Recent advances in Knowledge Editing (KE), particularly Rank-One Model Editing (ROME), show superior efficiency over fine-tuning and in-context learning for updating single-hop facts in transformers. However, these methods face significant challenges when applied to multi-hop reasoning tasks requiring knowledge chaining. In this work, we study the effect of editing knowledge with ROME on different layer depths and identify three key failure modes. First, the "hopping-too-late" problem occurs as later layers lack access to necessary intermediate representations. Second, generalization ability deteriorates sharply when editing later layers. Third, the model overfits to edited knowledge, incorrectly prioritizing edited-hop answers regardless of context. To mitigate the issues of "hopping-too-late" and generalisation decay, we propose Redundant Editing, a simple yet effective strategy that enhances multi-hop reasoning. Our experiments demonstrate that this approach can improve accuracy on 2-hop questions by at least 15.5 percentage points, representing a 96% increase over the previous single-edit strategy, while trading off some specificity and language naturalness.[41] When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation
Rhea Kapur,Robert Hawkins,Elisa Kreiss
Main category: cs.CL
TL;DR: 本文提出应将视觉-语言模型中描述的“特异性”与“长度”解耦,定义了基于对比集的特异性,并通过控制长度、变化信息量的数据集验证人类更偏好特异性强的描述,强调评估应优先考虑特异性而非冗长性。
Details
Motivation: 当前视觉语言模型的描述常将特异性与长度混淆,导致无法准确衡量描述质量,需明确区分二者以提升可访问性系统的有效性。 Method: 提出基于对比集的特异性定义,构建控制长度但信息内容不同的数据集,并通过人类偏好实验验证特异性的影响。 Result: 实验证明,即使长度相同,信息密度更高的描述仍被显著偏好;仅控制长度不足以解释特异性差异,内容分配方式至关重要。 Conclusion: 应采用直接衡量特异性的评估方法,而非依赖描述长度,以推动生成更精准、高效的文字描述。 Abstract: Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.[42] Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR
Yihong Tang,Kehai Chen,Xuefeng Bai,Benyou Wang,Zeming Liu,Haifeng Wang,Min Zhang
Main category: cs.CL
TL;DR: Character-R1是一个旨在提升角色扮演智能体角色感知推理能力的框架,通过提供可验证的奖励信号来增强内部认知一致性,在知识、记忆等方面显著优于现有方法。
Details
Motivation: 现有的角色扮演智能体通常通过模仿表面行为构建,缺乏内在认知一致性,导致在复杂情境下容易出现不符合角色的行为错误。 Method: 提出Character-R1框架,包含三个核心设计:认知聚焦奖励(基于标签分析10个角色元素)、参考引导奖励(利用与参考回答的重叠度量作为优化锚点)和角色条件化奖励归一化(根据角色类别调整奖励分布)。 Result: 大量实验表明,Character-R1在知识、记忆等多个方面显著优于现有方法。 Conclusion: Character-R1通过结构化的内部认知建模和精细化的奖励机制,有效提升了角色扮演智能体的认知一致性和表现稳定性。 Abstract: Current role-playing agents (RPAs) are typically constructed by imitating surface-level behaviors, but this approach lacks internal cognitive consistency, often causing out-of-character errors in complex situations. To address this, we propose Character-R1, a framework designed to provide comprehensive verifiable reward signals for effective role-aware reasoning, which are missing in recent studies. Specifically, our framework comprises three core designs: (1) Cognitive Focus Reward, which enforces explicit label-based analysis of 10 character elements (e.g., worldview) to structure internal cognition; (2) Reference-Guided Reward, which utilizes overlap-based metrics with reference responses as optimization anchors to enhance exploration and performance; and (3) Character-Conditioned Reward Normalization, which adjusts reward distributions based on character categories to ensure robust optimization across heterogeneous roles. Extensive experiments demonstrate that Character-R1 significantly outperforms existing methods in knowledge, memory and others.[43] From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset
Haneul Yoo,Won Ik Cho,Geunhye Kim,Jiyoon Han
Main category: cs.CL
TL;DR: 提出一种基于国家社会研究课程的自动化多智能体框架CuCu,用于生成文化特定的问答对,以实现大语言模型的文化对齐。
Details
Motivation: 大语言模型在多任务上表现优异,但其发展在不同语言和文化间不均衡,常反映英语中心训练数据中的价值观,缺乏对非英语文化的充分对齐。 Method: 利用国家社会研究课程作为文化感知监督的基础,通过CuCu框架将韩国国家课程转化为34.1k个开放式问答对(KCaQA),用于训练和评估模型的文化适应性。 Result: KCaQA覆盖了丰富的文化特异性主题,生成的回答扎根于本地社会文化背景,定量与定性分析均表明其有效性。 Conclusion: 该方法为跨文化对齐提供了可扩展、低成本的解决方案,有助于推动大语言模型在全球多样化文化中的公平与适用性。 Abstract: Large language models (LLMs) achieve strong performance on many tasks, but their progress remains uneven across languages and cultures, often reflecting values latent in English-centric training data. To enable practical cultural alignment, we propose a scalable approach that leverages national social studies curricula as a foundation for culture-aware supervision. We introduce CuCu, an automated multi-agent LLM framework that transforms national textbook curricula into open-ended, culture-specific question-answer pairs. Applying CuCu to the Korean national social studies curriculum, we construct KCaQA, comprising 34.1k open-ended QA pairs. Our quantitative and qualitative analyses suggest that KCaQA covers culture-specific topics and produces responses grounded in local sociocultural contexts.[44] MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark
Anyang Song,Ying Cheng,Yiqian Xu,Rui Feng
Main category: cs.CL
TL;DR: 本文提出了一种通过增强机器生成文本与人类书写文本对齐的新方法MAGA,以提升检测器的泛化能力并测试其鲁棒性。
Details
Motivation: 随着大语言模型的发展,机器生成文本越来越难以与人类书写文本区分,导致虚假新闻和网络诈骗等问题加剧。现有基于微调的检测器泛化能力受限于数据集质量,需进一步优化生成过程以提升对齐水平。 Method: 提出MAGA框架,实现从提示构建到推理过程的全面对齐;其中关键组件为基于检测器反馈的强化学习(RLDF),通过系统性增强生成文本的对齐性来训练和攻击检测器。 Result: 在实验中,基于MAGA训练集微调的RoBERTa检测器在泛化检测AUC上平均提升4.60%;而MAGA数据集使所选检测器的AUC平均下降8.13%。 Conclusion: MAGA能有效提升检测器的泛化能力,并揭示当前检测器的脆弱性,为未来检测技术的研究提供指导意义。 Abstract: Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors' generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var's theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA). MAGA's pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbf{R}einforced \textbf{L}earning from \textbf{D}etectors \textbf{F}eedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60\% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13\% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.[45] SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation
Sirry Chen,Jieyi Wang,Wei Chen,Zhongyu Wei
Main category: cs.CL
TL;DR: 本文提出了一种名为SpeechMedAssist的语音语言模型,能够通过两阶段训练范式实现基于语音的多轮医疗咨询,显著降低对医疗语音数据的需求,并在新构建的基准上表现出优于现有方法的效果和鲁棒性。
Details
Motivation: 现有的医疗咨询系统多依赖长文本交互,不够自然且对患者不友好;尽管语音语言模型取得进展,但医疗语音数据稀缺和直接微调效率低阻碍了其在医疗场景中的应用。 Method: 提出SpeechMedAssist模型,利用语音语言模型的结构特性,将传统单阶段训练解耦为两个阶段:第一阶段通过文本注入知识与能力,第二阶段使用有限的语音数据进行模态重对齐,仅需10k合成语音样本即可完成训练。 Result: 在包含单轮问答和多轮模拟交互的新基准上,SpeechMedAssist在多数评估设置中均优于所有基线模型,展现出更高的有效性与鲁棒性。 Conclusion: 该两阶段训练范式有效缓解了医疗语音数据稀缺的问题,使语音语言模型能够在资源受限的情况下成功应用于医疗咨询场景,推动了更自然、便捷的语音化医疗交互发展。 Abstract: Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more natural speech-based interaction, yet the scarcity of medical speech data and the inefficiency of directly fine-tuning on speech data jointly hinder the adoption of SpeechLMs in medical consultation. In this paper, we propose SpeechMedAssist, a SpeechLM natively capable of conducting speech-based multi-turn interactions with patients. By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm consisting of (1) Knowledge & Capability Injection via Text and (2) Modality Re-alignment with Limited Speech Data, thereby reducing the requirement for medical speech data to only 10k synthesized samples. To evaluate SpeechLMs for medical consultation scenarios, we design a benchmark comprising both single-turn question answering and multi-turn simulated interactions. Experimental results show that our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.[46] CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models
Yifan Le,Yunliang Li
Main category: cs.CL
TL;DR: 本文提出了CRANE,一种基于相关性的分析框架,用于识别多语言大模型中在功能上对特定语言至关重要的神经元,相比传统的激活启发方法能更精确地揭示语言特异性神经元的非独占性专业化模式。
Details
Motivation: 现有方法通过激活强度识别语言相关神经元,容易混淆语言偏好与功能重要性,无法准确理解多语言能力在神经元层面的组织方式。 Method: 提出CRANE框架,通过目标化的神经元干预,以神经元对语言条件预测的贡献(功能性必要性)来定义语言特异性,而非依赖激活幅度。 Result: 发现神经元具有不对称的语言选择性:屏蔽特定语言相关的神经元会显著降低该语言性能,但对其他语言影响较小;在英语、中文和越南语的多个基准上验证了方法的有效性,并设计了新的相关性指标和从基础模型到对话模型的迁移分析。 Conclusion: CRANE能够比基于激活的方法更精确地识别语言特异性的神经元组件,揭示了多语言模型中语言能力的非独占但选择性的神经表征结构。 Abstract: Multilingual large language models (LLMs) achieve strong performance across languages, yet how language capabilities are organized at the neuron level remains poorly understood. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. We propose CRANE, a relevance-based analysis framework that redefines language specificity in terms of functional necessity, identifying language-specific neurons through targeted neuron-level interventions. CRANE characterizes neuron specialization by their contribution to language-conditioned predictions rather than activation magnitude. Our implementation will be made publicly available. Neuron-level interventions reveal a consistent asymmetric pattern: masking neurons relevant to a target language selectively degrades performance on that language while preserving performance on other languages to a substantial extent, indicating language-selective but non-exclusive neuron specializations. Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than activation-based methods.[47] ToolGate: Contract-Grounded and Verified Tool Execution for LLMs
Yanming Liu,Xinyue Peng,Jiannan Cao,Xinyi Wang,Songhang Deng,Jintao Chen,Jianwei Yin,Xuhong Zhang
Main category: cs.CL
TL;DR: 本文提出了ToolGate,一个为大语言模型(LLM)调用外部工具提供逻辑安全保障和可验证状态演化的前向执行框架。
Details
Motivation: 现有基于自然语言推理的工具调用框架缺乏对逻辑安全性和可验证性的形式化保证,容易受到幻觉和无效结果的影响。 Method: ToolGate维护一个显式的符号状态空间,并将每个工具形式化为具有前置条件和后置条件的Hoare风格契约,通过运行时验证确保状态的可信演化。 Result: 实验表明,ToolGate显著提高了工具增强型LLM系统的可靠性和可验证性,同时在复杂的多步推理任务中保持了有竞争力的性能。 Conclusion: ToolGate为构建更可信、可调试的LLM与外部工具集成系统奠定了基础。 Abstract: Large Language Models (LLMs) augmented with external tools have demonstrated remarkable capabilities in complex reasoning tasks. However, existing frameworks rely heavily on natural language reasoning to determine when tools can be invoked and whether their results should be committed, lacking formal guarantees for logical safety and verifiability. We present \textbf{ToolGate}, a forward execution framework that provides logical safety guarantees and verifiable state evolution for LLM tool calling. ToolGate maintains an explicit symbolic state space as a typed key-value mapping representing trusted world information throughout the reasoning process. Each tool is formalized as a Hoare-style contract consisting of a precondition and a postcondition, where the precondition gates tool invocation by checking whether the current state satisfies the required conditions, and the postcondition determines whether the tool's result can be committed to update the state through runtime verification. Our approach guarantees that the symbolic state evolves only through verified tool executions, preventing invalid or hallucinated results from corrupting the world representation. Experimental validation demonstrates that ToolGate significantly improves the reliability and verifiability of tool-augmented LLM systems while maintaining competitive performance on complex multi-step reasoning tasks. This work establishes a foundation for building more trustworthy and debuggable AI systems that integrate language models with external tools.[48] See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation
Naquee Rizwan,Subhankar Swain,Paramananda Bhaskar,Gagan Aryan,Shehryaar Shah Khan,Animesh Mukherjee
Main category: cs.CL
TL;DR: 本文提出了一种基于生成式AI模型的多模态框架,用于在数据有限的情况下实现仇恨模因的检测、解释与干预,首次将三者结合以适应真实场景。
Details
Motivation: 现有的仇恨模因研究通常将检测、解释和干预分开处理,且依赖大量标注数据,难以反映真实应用场景并成本高昂。 Method: 利用任务特定的生成式多模态代理和大模型的小样本适应能力,构建一个统一的检测、解释与干预框架。 Result: 实现了在有限数据条件下的可泛化的仇恨模因识别与内容干预,具备实际部署潜力。 Conclusion: 该框架是首个同时涵盖检测、解释与干预的通用化仇恨模因治理方案,在低资源条件下表现良好,适合现实应用。 Abstract: In this work, we examine hateful memes from three complementary angles - how to detect them, how to explain their content and how to intervene them prior to being posted - by applying a range of strategies built on top of generative AI models. To the best of our knowledge, explanation and intervention have typically been studied separately from detection, which does not reflect real-world conditions. Further, since curating large annotated datasets for meme moderation is prohibitively expensive, we propose a novel framework that leverages task-specific generative multimodal agents and the few-shot adaptability of large multimodal models to cater to different types of memes. We believe this is the first work focused on generalizable hateful meme moderation under limited data conditions, and has strong potential for deployment in real-world production scenarios. Warning: Contains potentially toxic contents.[49] Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding
Sungmok Jung,Yeonkyoung So,Joonhak Lee,Sangho Kim,Yelim Ahn,Jaejin Lee
Main category: cs.CL
TL;DR: 本文介绍了Thunder-KoNUBench,一个反映韩语否定现象经验分布的句子级基准,并评估了47个大语言模型在该基准上的表现,表明微调可提升模型对否定及上下文的理解。
Details
Motivation: 由于否定结构对大语言模型构成挑战,且现有针对韩语否定理解的评测基准稀缺,因此需要构建更贴近实际语言使用的评测工具。 Method: 通过语料库分析韩语否定现象,构建名为Thunder-KoNUBench的句子级评测基准,并在47个大语言模型上评估模型规模和指令微调的影响。 Result: 实验显示大语言模型在否定条件下的表现下降;模型规模和指令微调对否定理解有积极影响;在Thunder-KoNUBench上进行微调可提升模型对否定及更广泛语境的理解能力。 Conclusion: Thunder-KoNUBench是一个有效的韩语否定理解评测基准,微调能显著改善模型在否定句上的表现,同时增强整体语境理解能力。 Abstract: Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.[50] PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards
Mukesh Ghimire,Aosong Feng,Liwen You,Youzhi Luo,Fang Liu,Xuan Zhu
Main category: cs.CL
TL;DR: 本文提出了一种名为PRISM的统一训练框架,通过结合过程奖励模型(PRM)和模型自身置信度,在无真实标签的情况下实现对大语言模型的有效后训练,提升了训练稳定性与推理性能。
Details
Motivation: 现有后训练方法依赖人工标注或外部验证器,难以扩展;而仅依赖模型内部一致性信号(如熵或自置信度)在长期大规模训练中不可靠,因此需要一种无需人工干预且稳定的自监督训练方法。 Method: 提出PRISM框架,结合过程奖励模型(PRM)与模型内部自置信度作为学习信号,利用PRM提供更可靠的阶段性反馈,指导模型在无标签数据上的训练过程。 Result: 实验表明,PRISM能够实现更稳定的训练过程,提升测试时的性能表现,并有效校准模型的内部置信度。 Conclusion: 结合外部过程奖励与内部置信度是一种可行且高效的无监督后训练范式,为大语言模型在复杂任务(如数学推理与代码生成)上的持续自我优化提供了新方向。 Abstract: Current techniques for post-training Large Language Models (LLMs) rely either on costly human supervision or on external verifiers to boost performance on tasks such as mathematical reasoning and code generation. However, as LLMs improve their problem-solving, any further improvement will potentially require high-quality solutions to difficult problems that are not available to humans. As a result, learning from unlabeled data is becoming increasingly attractive in the research community. Existing methods extract learning signal from a model's consistency, either by majority voting or by converting the model's internal confidence into reward. Although internal consistency metric such as entropy or self-certainty require no human intervention, as we show in this work, these are unreliable signals for large-scale and long-term training. To address the unreliability, we propose PRISM, a unified training framework that uses a Process Reward Model (PRM) to guide learning alongside model's internal confidence in the absence of ground-truth labels. We show that effectively combining PRM with self-certainty can lead to both stable training and better test-time performance, and also keep the model's internal confidence in check.[51] Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning
Feihu Jin,Shipeng Cen,Ying Tan
Main category: cs.CL
TL;DR: 提出一种基于先验信息的零阶梯度优化方法,通过引导扰动方向提升梯度估计质量,显著加速大模型微调过程。
Details
Motivation: 传统零阶优化因随机扰动导致梯度估计方差高、收敛慢,难以高效微调大模型。 Method: 利用高斯采样动态生成引导向量,采用先验感知的扰动策略改进梯度估计,并探索贪婪扰动策略以增强方向性。 Result: 理论证明所提梯度估计器与真实梯度方向对齐更好;实验显示在不同规模和架构的LLM上均实现更快收敛和更优性能,OPT-13B在11个任务中全面超越传统ZO方法,在9个任务上优于基于梯度的方法。 Conclusion: 该方法有效降低内存开销的同时提升优化效率,实现了高效准确的大模型微调,具有良好的通用性和应用潜力。 Abstract: Fine-tuning large language models (LLMs) has achieved remarkable success across various NLP tasks, but the substantial memory overhead during backpropagation remains a critical bottleneck, especially as model scales grow. Zeroth-order (ZO) optimization alleviates this issue by estimating gradients through forward passes and Gaussian sampling, avoiding the need for backpropagation. However, conventional ZO methods suffer from high variance in gradient estimation due to their reliance on random perturbations, leading to slow convergence and suboptimal performance. We propose a simple plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation. Our method dynamically computes a guiding vector from Gaussian samples, which directs perturbations toward more informative directions, significantly accelerating convergence compared to standard ZO approaches. We further investigate a greedy perturbation strategy to explore the impact of prior knowledge on gradient estimation. Theoretically, we prove that our gradient estimator achieves stronger alignment with the true gradient direction, enhancing optimization efficiency. Extensive experiments across LLMs of varying scales and architectures demonstrate that our proposed method could seamlessly integrate into existing optimization methods, delivering faster convergence and superior performance. Notably, on the OPT-13B model, our method outperforms traditional ZO optimization across all 11 benchmark tasks and surpasses gradient-based baselines on 9 out of 11 tasks, establishing a robust balance between efficiency and accuracy.[52] DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs
Anh Thi-Hoang Nguyen,Khanh Quoc Tran,Tin Van Huynh,Phuoc Tan-Hoang Nguyen,Cam Tan Nguyen,Kiet Van Nguyen
Main category: cs.CL
TL;DR: 本文介绍了DSC2025 ViHallu挑战赛,这是首个针对越南语大语言模型幻觉检测的大规模共享任务,提出了包含1万组标注数据的ViHallu数据集,并评估了多种检测方法,推动了越南语AI系统可靠性研究。
Details
Motivation: 低至中等资源语言(如越南语)缺乏标准化的幻觉检测评估框架,限制了其在实际应用中的可靠性。 Method: 构建ViHallu数据集,包含10,000个(上下文、提示、响应)三元组,分为无幻觉、内在幻觉和外在幻觉三类,并设计事实性、噪声和对抗性三种提示类型以测试模型鲁棒性;组织DSC2025 ViHallu挑战赛,吸引111支队伍参与,采用指令调优的大模型结合结构化提示与集成策略进行检测。 Result: 最佳系统达到84.80%的macro-F1分数,显著高于仅使用编码器基线的32.83%,但内在幻觉检测仍具挑战性。 Conclusion: 该工作建立了越南语幻觉检测的严格基准,验证了先进方法的有效性,同时表明幻觉检测尤其是内在幻觉仍为开放难题,为未来越南语AI可信研究提供了基础。 Abstract: The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations -- fluent, plausible-sounding outputs that contradict or fabricate information. While hallucination detection has recently emerged as a priority in English-centric benchmarks, low-to-medium resource languages such as Vietnamese remain inadequately covered by standardized evaluation frameworks. This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese LLMs. We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples systematically partitioned into three hallucination categories: no hallucination, intrinsic, and extrinsic hallucinations. The dataset incorporates three prompt types -- factual, noisy, and adversarial -- to stress-test model robustness. A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80\%, compared to a baseline encoder-only score of 32.83\%, demonstrating that instruction-tuned LLMs with structured prompting and ensemble strategies substantially outperform generic architectures. However, the gap to perfect performance indicates that hallucination detection remains a challenging problem, particularly for intrinsic (contradiction-based) hallucinations. This work establishes a rigorous benchmark and explores a diverse range of detection methodologies, providing a foundation for future research into the trustworthiness and reliability of Vietnamese language AI systems.[53] Fame Fades, Nature Remains: Disentangling the Character Identity of Role-Playing Agents
Yonghyun Jun,Junhyuk Choi,Jihyeong Park,Hwanhee Lee
Main category: cs.CL
TL;DR: 本文提出了“角色身份”的多维概念,将其分解为参数身份和属性身份,并通过构建统一的角色档案模式来研究这两个层面。研究发现,在对话中,名人的优势会迅速消失,而性格特质则保持稳定,但道德和人际关系的正面性对角色扮演代理的表现有显著影响。
Details
Motivation: 由于基于大语言模型的角色扮演代理快速发展,但角色身份的结构维度缺乏形式化定义,通常将角色视为任意文本输入。因此,需要一个更系统的框架来理解和评估角色身份的不同方面。 Method: 提出了一种名为“角色身份”的多维结构,包括参数身份(来自LLM预训练的角色特定知识)和属性身份(如人格特质和道德价值观等行为属性)。构建了一个统一的角色档案模式,并在相同的结构约束下生成了著名人物和合成人物。通过单轮和多轮交互进行评估。 Result: 发现了两个关键现象:一是“名声消退”,即名人角色在初始回合因参数知识具有明显优势,但这种优势随着模型优先考虑积累对话上下文而迅速减弱;二是“本性保留”,即模型能稳健地表现一般人格特质,但对道德和人际关系的情感极性非常敏感,负面社会特性成为角色扮演代理保真度的主要瓶颈。 Conclusion: 研究结果指出了负面社会特性是当前角色扮演代理性能的主要限制因素,为未来角色构建和评估提供了指导方向。 Abstract: Despite the rapid proliferation of Role-Playing Agents (RPAs) based on Large Language Models (LLMs), the structural dimensions defining a character's identity remain weakly formalized, often treating characters as arbitrary text inputs. In this paper, we propose the concept of \textbf{Character Identity}, a multidimensional construct that disentangles a character into two distinct layers: \textbf{(1) Parametric Identity}, referring to character-specific knowledge encoded from the LLM's pre-training, and \textbf{(2) Attributive Identity}, capturing fine-grained behavioral properties such as personality traits and moral values. To systematically investigate these layers, we construct a unified character profile schema and generate both Famous and Synthetic characters under identical structural constraints. Our evaluation across single-turn and multi-turn interactions reveals two critical phenomena. First, we identify \textit{"Fame Fades"}: while famous characters hold a significant advantage in initial turns due to parametric knowledge, this edge rapidly vanishes as models prioritize accumulating conversational context over pre-trained priors. Second, we find that \textit{"Nature Remains"}: while models robustly portray general personality traits regardless of polarity, RPA performance is highly sensitive to the valence of morality and interpersonal relationships. Our findings pinpoint negative social natures as the primary bottleneck in RPA fidelity, guiding future character construction and evaluation.[54] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Mingxin Li,Yanzhao Zhang,Dingkun Long,Keqin Chen,Sibo Song,Shuai Bai,Zhibo Yang,Pengjun Xie,An Yang,Dayiheng Liu,Jingren Zhou,Junyang Lin
Main category: cs.CL
TL;DR: 本文介绍了Qwen3-VL-Embedding和Qwen3-VL-Reranker模型系列,基于Qwen3-VL基础模型构建,支持文本、图像、文档图像和视频等多模态统一表示,具备多阶段训练、Matryoshka表示学习、长序列输入(32k tokens)和多语言能力,在MMEB-V2等基准上达到SOTA性能。
Details
Motivation: 为了实现高精度的多模态搜索,需要将多种模态数据映射到统一的表示空间中,并满足实际部署中的灵活性与效率需求。 Method: 采用多阶段训练范式,包括大规模对比预训练和重排序模型蒸馏;使用Matryoshka表示学习以支持可变维度嵌入;Qwen3-VL-Reranker采用交叉注意力机制的交叉编码器结构进行细粒度相关性评估。 Result: Qwen3-VL-Embedding-8B在MMEB-V2上取得77.8的总体得分,位居榜首(截至2025年1月8日);模型支持超过30种语言,提供2B和8B参数版本,适用于多样化部署场景。 Conclusion: 该模型系列在多种多模态检索任务中表现出色,如图文检索、视觉问答和视频-文本匹配,构成高效的端到端多模态搜索 pipeline。 Abstract: In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.[55] Automatic Classifiers Underdetect Emotions Expressed by Men
Ivan Smirnov,Segun T. Aroyehun,Paul Plener,David Garcia
Main category: cs.CL
TL;DR: 该研究利用超过一百万条自我标注的帖子,系统评估了414种模型与情绪类别组合在性别上的情感识别偏差,发现男性文本的错误率普遍高于女性,提示现有情感分析工具在性别群体间表现不均,需谨慎应用。
Details
Motivation: 确保情感和情绪分类器在不同人群中的可靠性,揭示基于第三方标注基准可能掩盖的系统性性别偏见。 Method: 使用大规模自我标注数据集和预注册研究设计,评估414种模型与情绪类别组合在性别上的错误率差异。 Result: 在各类自动分类器和情绪类型中,男性作者文本的错误率 consistently 高于女性;该偏差会影响下游应用结果。 Conclusion: 情感分析尚未解决公平性问题,尤其是在不同人口统计群体间的均衡表现方面,现有工具(包括大语言模型)应在样本性别构成未知或可变时谨慎使用。 Abstract: The widespread adoption of automatic sentiment and emotion classifiers makes it important to ensure that these tools perform reliably across different populations. Yet their reliability is typically assessed using benchmarks that rely on third-party annotators rather than the individuals experiencing the emotions themselves, potentially concealing systematic biases. In this paper, we use a unique, large-scale dataset of more than one million self-annotated posts and a pre-registered research design to investigate gender biases in emotion detection across 414 combinations of models and emotion-related classes. We find that across different types of automatic classifiers and various underlying emotions, error rates are consistently higher for texts authored by men compared to those authored by women. We quantify how this bias could affect results in downstream applications and show that current machine learning tools, including large language models, should be applied with caution when the gender composition of a sample is not known or variable. Our findings demonstrate that sentiment analysis is not yet a solved problem, especially in ensuring equitable model behaviour across demographic groups.[56] AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs
Han Zhu,Jiale Chen,Chengkun Cai,Shengjie Sun,Haoran Li,Yujin Zhou,Chi-Min Chan,Pengcheng Wen,Lei Li,Sirui Han,Yike Guo
Main category: cs.CL
TL;DR: 本文提出了InterSafe-V数据集和AM$^3$Safety框架,用于提升多模态大语言模型在多轮对话中的安全性,显著降低了攻击成功率并增强了无害性和有用性。
Details
Motivation: 现有的RLHF方法主要针对单轮视觉问答任务,难以有效应对多轮多模态对话中逐渐累积的安全风险,且依赖昂贵的人工标注,限制了其在实际对话场景中的扩展性。 Method: 构建了一个包含11,270个对话和500个特制拒绝样本的开源多模态对话数据集InterSafe-V,并提出AM$^3$Safety框架,结合冷启动拒绝阶段与基于回合感知双目标奖励的Group Relative Policy Optimization(GRPO)微调方法。 Result: 在Qwen2.5-VL-7B-Instruct和LLaVA-NeXT-7B上的实验显示,攻击成功率(ASR)降低超过10%,无害性维度提升至少8%,有用性维度提升超过13%,同时保持模型通用能力。 Conclusion: AM$^3$Safety框架结合InterSafe-V数据集能有效提升多模态大语言模型在多轮对话中的安全对齐能力,具有良好的实际应用前景。 Abstract: Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10\% decrease in Attack Success Rate (ASR) together with an increment of at least 8\% in harmless dimension and over 13\% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.[57] RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation
Huawei Zheng,Xinqi Jiang,Sen Yang,Shouling Ji,Yingcai Wu,Dazhen Deng
Main category: cs.CL
TL;DR: 提出一种端到端框架,结合知识图谱引导和双路径混淆重写,生成具有领域相关性和隐式特征的有害提示数据集,以提升大语言模型在专业领域的安全性研究。
Details
Motivation: 在金融、医疗等专业领域,大语言模型面临独特的安全风险,现有公开数据集主要关注显式有害提示,难以反映真实世界中更隐蔽的威胁;隐式有害提示利用间接领域知识,更难检测且更具现实威胁,但相关数据稀缺且依赖人工构建。 Method: 提出一个端到端框架:首先通过知识图谱引导生成领域相关的有害提示,系统化地将领域知识转化为可操作的约束;然后采用双路径混淆重写方法,包括直接重写和上下文增强重写,将显式有害提示转化为更隐晦的变体。 Result: 该框架能够生成高质量、兼具强领域相关性和高隐式性的有害提示数据集,显著提升对现代大语言模型防御机制的绕过能力,支持更贴近现实的红队测试。 Conclusion: 所提框架有效解决了领域知识转化与提示隐式性增强两大挑战,为大语言模型在专业领域的安全评估提供了更真实、可靠的数据支持,推动了LLM安全研究的发展。 Abstract: Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts-expressed through indirect domain knowledge-are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies dual-path obfuscation rewriting to convert explicit harmful prompts into implicit variants via direct and context-enhanced rewriting. This framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. We release our code and datasets at GitHub.[58] Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval
Seyeon Jeong,Yeonjun Choi,JongWook Kim,Beakcheol Jang
Main category: cs.CL
TL;DR: 提出Tool-MAD,一种通过为多智能体分配异构外部工具以增强事实验证的多智能体辩论框架,在四个基准上优于现有方法,准确率提升达5.5%。
Details
Motivation: 现有MAD系统依赖内部知识或静态文档,易产生幻觉;MADKE的一次性检索机制难以适应辩论中新出现的信息,需更灵活的事实验证机制。 Method: 构建Tool-MAD框架:每个LLM智能体配备不同外部工具(如搜索API、RAG),通过自适应查询生成机制迭代优化证据检索,并引入保真度和答案相关性评分辅助裁判智能体决策。 Result: 在四个事实验证基准上显著优于现有MAD方法,准确率最高提升5.5%;在医学领域表现出强鲁棒性和适应性。 Conclusion: Tool-MAD通过工具异构性、自适应检索和量化评估机制有效减少幻觉,提升了复杂推理和事实验证任务中的准确性和可靠性,具备广泛的实际应用潜力。 Abstract: Large Language Models (LLMs) suffer from hallucinations and factual inaccuracies, especially in complex reasoning and fact verification tasks. Multi-Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue, promoting diverse reasoning and mutual verification. However, existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations. While MADKE introduces external evidence to mitigate this, its one-time retrieval mechanism limits adaptability to new arguments or emerging information during the debate. To address these limitations, We propose Tool-MAD, a multi-agent debate framework that enhances factual verification by assigning each agent a distinct external tool, such as a search API or RAG module. Tool-MAD introduces three key innovations: (1) a multi-agent debate framework where agents leverage heterogeneous external tools, encouraging diverse perspectives, (2) an adaptive query formulation mechanism that iteratively refines evidence retrieval based on the flow of the debate, and (3) the integration of Faithfulness and Answer Relevance scores into the final decision process, allowing the Judge agent to quantitatively assess the coherence and question alignment of each response and effectively detect hallucinations. Experimental results on four fact verification benchmarks demonstrate that Tool-MAD consistently outperforms state-of-the-art MAD frameworks, achieving up to 5.5% accuracy improvement. Furthermore, in medically specialized domains, Tool-MAD exhibits strong robustness and adaptability across various tool configurations and domain conditions, confirming its potential for broader real-world fact-checking applications.[59] PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks
Yehoon Jang,Chaewon Lee,Hyun-seok Min,Sungchul Choi
Main category: cs.CL
TL;DR: 本文提出了PILOT-Bench,首个以美国专利审判和上诉委员会(PTAB)为中心的基准,用于系统评估大语言模型在专利领域中的法律推理能力,通过将PTAB决定与专利数据对齐,并设计三项IRAC对齐的分类任务,评估了多种LLM的表现,发现闭源模型显著优于开源模型,同时公开了所有资源。
Details
Motivation: 目前大语言模型在专利和法律领域的应用局限于轻量级任务,缺乏系统性评估其在专利法律推理方面能力的方法,因此需要构建专门的基准来衡量和提升模型的结构化法律推理能力。 Method: 提出PILOT-Bench基准,将PTAB裁决与USPTO专利数据在案件级别对齐,设计三个符合IRAC框架的分类任务:问题类型、委员会依据和子决策;评估多个闭源和开源大语言模型,从输入变化、模型家族和错误模式等多角度进行分析。 Result: 在问题类型任务上,闭源模型的Micro-F1得分普遍超过0.75,而最强的开源模型Qwen-8B仅达到约0.56,显示出两者在推理能力上的显著差距。 Conclusion: PILOT-Bench为评估专利领域的法律推理提供了基础,揭示了当前LLM在该任务上的表现差异,指出了通过数据集设计和模型对齐改进模型的未来方向。 Abstract: The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot-bench.[60] Differential syntactic and semantic encoding in LLMs
Santiago Acevedo,Alessandro Laio,Marco Baroni
Main category: cs.CL
TL;DR: 研究发现,通过平均具有相同句法结构或语义的句子的隐藏表示向量,可以获得显著捕捉句法和语义信息的“中心向量”,表明大语言模型内部表示中句法和语义信息至少部分以线性方式编码,且两者在不同层次上有不同的编码模式并可部分解耦。
Details
Motivation: 探究大语言模型(LLM)内部层如何分别编码句法和语义信息,尤其是超大规模模型DeepSeek-V3中的语言表示机制。 Method: 通过平均共享句法结构或语义的句子的隐藏表示向量,构建句法和语义“中心向量”,并分析减去这些中心向量后对句子相似性的影响,同时考察跨层的编码模式差异。 Result: 句法和语义信息在隐藏表示中可被显著捕捉;减去对应中心向量会显著影响句法或语义匹配句子间的相似性;句法与语义的跨层编码模式不同且可部分解耦。 Conclusion: 大语言模型内部表示中,句法和语义信息至少部分以线性方式独立编码,且在不同网络层次上表现出差异化的编码路径。 Abstract: We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids'' from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.[61] Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence
Shengyin Sun,Yiming Li,Renxi Liu,Weizhe Lin,Hui-Ling Zhen,Xianzhi Yu,Mingxuan Yuan,Chen Ma
Main category: cs.CL
TL;DR: 本文提出了一种基于KL散度的无需训练的验证机制,用于加速大语言模型推理,理论上证明了其与学习型线性判别器的内在联系,并在多个基准上表现出与复杂训练判别器相当甚至更优的性能。
Details
Motivation: 现有的Judge Decoding依赖昂贵且含噪声的监督信号来训练判别器,限制了其效率和泛化能力,因此需要一种无需训练且鲁棒的替代方案。 Method: 通过理论分析揭示学习到的判别分数本质上编码于草稿与目标分布之间的KL散度中,并提出直接使用KL散度作为训练自由的验证机制。 Result: 在推理和代码生成等多个基准上,该方法匹配或优于复杂的训练判别器(如AutoJudge),并对领域偏移具有更强的鲁棒性。 Conclusion: 基于KL散度的训练自由验证机制是有效的,可消除对监督信号的依赖,为Speculative Decoding提供了更简洁、高效的解决方案。 Abstract: Judge Decoding accelerates LLM inference by relaxing the strict verification of Speculative Decoding, yet it typically relies on expensive and noisy supervision. In this work, we revisit this paradigm from first principles, revealing that the ``criticality'' scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence. We theoretically prove a structural correspondence between learned linear judges and Kullback-Leibler (KL) divergence, demonstrating they rely on the same underlying logit primitives. Guided by this, we propose a simple, training-free verification mechanism based on KL divergence. Extensive experiments across reasoning and coding benchmarks show that our method matches or outperforms complex trained judges (e.g., AutoJudge), offering superior robustness to domain shifts and eliminating the supervision bottleneck entirely.[62] LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal
Dongjun Kim,Jeongho Yoon,Chanjun Park,Heuiseok Lim
Main category: cs.CL
TL;DR: 提出LANGSAE EDITING方法,通过在向量空间中可控地去除语言身份信号来提升多语言密集检索性能。
Details
Motivation: 多语言嵌入包含语言身份信息,导致同语言文档对相似度被高估,影响跨语言检索效果。 Method: 训练一个后处理的稀疏自编码器(LANGSAE EDITING),基于跨语言激活统计识别并抑制与语言相关的隐层单元,并在保持原维度下重建嵌入向量。 Result: 在多种语言上实验表明,该方法提升了排序质量和跨语言覆盖率,尤其对文字系统差异大的语言效果显著。 Conclusion: LANGSAE EDITING能有效解耦语言身份与语义信息,兼容现有向量数据库,无需重新训练编码器或重编码文本。 Abstract: Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages.[63] NC2C: Automated Convexification of Generic Non-Convex Optimization Problems
Xinyue Peng,Yanming Liu,Yihan Cang,Yuwei Zhang,Xinyi Wang,Songhang Deng,Jiannan Cao
Main category: cs.CL
TL;DR: 提出NC2C,一个基于大语言模型的端到端自动化框架,能将非凸优化问题转化为可解的凸形式,实验表明其在100个问题上达到89.3%执行率和76%成功率,显著优于基线方法。
Details
Motivation: 传统非凸优化问题求解依赖专家知识进行凸化,效率低且难以扩展,亟需自动化方法降低对人工干预的依赖。 Method: NC2C利用大语言模型的数学推理能力,自动识别非凸成分,选择最优凸化策略,并通过符号推理、自适应变换、迭代验证及错误修正机制生成有效的凸等价问题。 Result: 在100个通用非凸问题上的实验显示,NC2C实现了89.3%的执行成功率和76%的有效转化率,显著优于现有基线方法。 Conclusion: NC2C有效实现了非凸到凸优化问题的自动化转换,减少了对专家知识的依赖,推动了凸求解器在复杂优化任务中的高效应用。 Abstract: Non-convex optimization problems are pervasive across mathematical programming, engineering design, and scientific computing, often posing intractable challenges for traditional solvers due to their complex objective functions and constrained landscapes. To address the inefficiency of manual convexification and the over-reliance on expert knowledge, we propose NC2C, an LLM-based end-to-end automated framework designed to transform generic non-convex optimization problems into solvable convex forms using large language models. NC2C leverages LLMs' mathematical reasoning capabilities to autonomously detect non-convex components, select optimal convexification strategies, and generate rigorous convex equivalents. The framework integrates symbolic reasoning, adaptive transformation techniques, and iterative validation, equipped with error correction loops and feasibility domain correction mechanisms to ensure the robustness and validity of transformed problems. Experimental results on a diverse dataset of 100 generic non-convex problems demonstrate that NC2C achieves an 89.3\% execution rate and a 76\% success rate in producing feasible, high-quality convex transformations. This outperforms baseline methods by a significant margin, highlighting NC2C's ability to leverage LLMs for automated non-convex to convex transformation, reduce expert dependency, and enable efficient deployment of convex solvers for previously intractable optimization tasks.[64] Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework
Junhyuk Choi,Jeongyoun Kwon,Heeju Kim,Haeun Cho,Hayeong Jung,Sehee Min,Bugeun Kim
Main category: cs.CL
TL;DR: 本文首次系统分析了基于角色的权威偏见在自由形式多智能体评估中的影响,发现专家型和参照型权威角色比合法型更具影响力,且权威偏见源于权威智能体坚持立场而非普通智能体主动服从。
Details
Motivation: 探讨大语言模型多智能体系统中权威角色带来的权威偏见及其对交互的影响,填补该领域的研究空白。 Method: 基于French和Raven的权力理论,将权威角色分为合法型、参照型和专家型,在ChatEval框架下进行12轮对话实验,使用GPT-4o和DeepSeek R1模型分析不同类型权威的影响。 Result: 专家型和参照型权威角色影响力显著强于合法型;权威偏见的产生并非因为普通智能体主动 conformity,而是权威角色持续坚持自身立场,同时普通智能体表现出更高灵活性;明确的立场表达是产生影响的关键,中立回应无法引发偏见。 Conclusion: 权威角色的设计需注重其权力类型与立场表达的清晰性,以有效引导多智能体系统的非对称交互模式。 Abstract: Multi-agent systems utilizing large language models often assign authoritative roles to improve performance, yet the impact of authority bias on agent interactions remains underexplored. We present the first systematic analysis of role-based authority bias in free-form multi-agent evaluation using ChatEval. Applying French and Raven's power-based theory, we classify authoritative roles into legitimate, referent, and expert types and analyze their influence across 12-turn conversations. Experiments with GPT-4o and DeepSeek R1 reveal that Expert and Referent power roles exert stronger influence than Legitimate power roles. Crucially, authority bias emerges not through active conformity by general agents, but through authoritative roles consistently maintaining their positions while general agents demonstrate flexibility. Furthermore, authority influence requires clear position statements, as neutral responses fail to generate bias. These findings provide key insights for designing multi-agent frameworks with asymmetric interaction patterns.[65] When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection
Ke Sun,Guangsheng Bao,Han Cui,Yue Zhang
Main category: cs.CL
TL;DR: 提出基于生成后期稳定性的新特征,用于零样本检测AI生成文本。
Details
Motivation: 现有方法忽略自回归生成的时间动态特性,无法有效捕捉AI与人类写作在生成过程中的差异。 Method: 分析12万以上文本样本,发现AI生成文本在生成后期具有更低的对数概率波动性,提出仅基于生成后期统计量的两个简单特征:导数离散度和局部波动性。 Result: 该方法在EvoBench和MAGE基准上达到SOTA性能,并与现有全局方法具有良好互补性。 Conclusion: 利用生成过程中的时间动态特征可显著提升零样本检测AI生成文本的效果。 Abstract: Zero-shot detection methods for AI-generated text typically aggregate token-level statistics across entire sequences, overlooking the temporal dynamics inherent to autoregressive generation. We analyze over 120k text samples and reveal Late-Stage Volatility Decay: AI-generated text exhibits rapidly stabilizing log probability fluctuations as generation progresses, while human writing maintains higher variability throughout. This divergence peaks in the second half of sequences, where AI-generated text shows 24--32\% lower volatility. Based on this finding, we propose two simple features: Derivative Dispersion and Local Volatility, which computed exclusively from late-stage statistics. Without perturbation sampling or additional model access, our method achieves state-of-the-art performance on EvoBench and MAGE benchmarks and demonstrates strong complementarity with existing global methods.[66] RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection
Zhiwei Liu,Runteng Guo,Baojie Qu,Yuechen Jiang,Min Peng,Qianqian Xie,Sophia Ananiadou
Main category: cs.CL
TL;DR: 本文提出了RAAR,首个用于跨域虚假信息检测的检索增强型智能体推理框架,通过多视角证据检索和多智能体协作推理提升模型在不同领域下的泛化能力。
Details
Motivation: 现有方法依赖单一视角线索,难以泛化到挑战性或代表性不足的领域;大语言模型虽擅长复杂任务但受限于同分布数据假设。 Method: RAAR框架结合检索增强与多智能体协同推理:首先根据语义、情感和写作风格检索源域多视角证据;然后由多个专业化智能体生成互补分析,汇总智能体在验证器指导下整合形成可验证的多步推理路径;并通过监督微调和强化学习训练一个多任务验证器以提升推理与验证能力。 Result: 基于RAAR训练了RAAR-8b和RAAR-14b模型,在三个跨域虚假信息检测任务上显著优于基线模型、现有跨域方法、先进大模型及基于大模型的适配方法。 Conclusion: RAAR有效提升了跨域虚假信息检测的泛化性和系统性推理能力,为突破同分布假设限制提供了新思路。 Abstract: Cross-domain misinformation detection is challenging, as misinformation arises across domains with substantial differences in knowledge and discourse. Existing methods often rely on single-perspective cues and struggle to generalize to challenging or underrepresented domains, while reasoning large language models (LLMs), though effective on complex tasks, are limited to same-distribution data. To address these gaps, we introduce RAAR, the first retrieval-augmented agentic reasoning framework for cross-domain misinformation detection. To enable cross-domain transfer beyond same-distribution assumptions, RAAR retrieves multi-perspective source-domain evidence aligned with each target sample's semantics, sentiment, and writing style. To overcome single-perspective modeling and missing systematic reasoning, RAAR constructs verifiable multi-step reasoning paths through specialized multi-agent collaboration, where perspective-specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. RAAR further applies supervised fine-tuning and reinforcement learning to train a single multi-task verifier to enhance verification and reasoning capabilities. Based on RAAR, we trained the RAAR-8b and RAAR-14b models. Evaluation on three cross-domain misinformation detection tasks shows that RAAR substantially enhances the capabilities of the base models and outperforms other cross-domain methods, advanced LLMs, and LLM-based adaptation approaches. The project will be released at https://github.com/lzw108/RAAR.[67] Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics
Oshri Naparstek
Main category: cs.CL
TL;DR: 本文提出了一种连续自回归语言生成模型,通过在连续空间中演化词元表示并在其充分收敛后才进行离散化,实现了无需词元级采样的稳定文本生成。
Details
Motivation: 传统自回归语言模型在每一步生成时即刻离散化词元,导致生成过程对解码策略敏感,易出现重复和不稳定性,本文旨在解决这一问题。 Method: 引入连续自回归框架,将词元表示为连续向量,并通过确定性动态过程多次更新直至收敛,最后通过硬解码恢复离散文本。 Result: 该模型仅通过确定性解码(argmax)即可生成连贯且多样化的文本,无需依赖词元级采样、扩散式去噪或额外稳定机制。 Conclusion: 这是首个在离散化前通过连续词元表示演化实现稳定生成的自回归语言模型,为语言生成提供了新范式。 Abstract: Autoregressive language models are conventionally defined over discrete token sequences, committing to a specific token at every generation step. This early discretization forces uncertainty to be resolved through token-level sampling, often leading to instability, repetition, and sensitivity to decoding heuristics. In this work, we introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that \emph{mature} over multiple update steps before being discretized. Rather than sampling tokens, the model evolves continuous token representations through a deterministic dynamical process, committing to a discrete token only when the representation has sufficiently converged. Discrete text is recovered via hard decoding, while uncertainty is maintained and resolved in the continuous space. We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax), without reliance on token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations, such as stochastic dynamics or history smoothing, can be incorporated naturally but are not required for the model to function. To our knowledge, this is the first autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling.[68] MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News
Zhiwei Liu,Paul Thompson,Jiaqi Rong,Baojie Qu,Runteng Guo,Min Peng,Qianqian Xie,Sophia Ananiadou
Main category: cs.CL
TL;DR: 本文提出了MisSpans,首个用于句子片段级虚假信息检测与分析的多领域、人工标注基准数据集,支持细粒度识别、分类与可解释性分析。
Details
Motivation: 现有虚假信息检测方法多基于整体声明或段落进行二元判断,难以捕捉真假细节共存现象,且缺乏对误导性片段的精确定位与类型区分。 Method: 构建了配对的真实与虚假新闻语料库,定义三个任务:MisSpansIdentity(定位错误片段)、MisSpansType(分类虚假类型)、MisSpansExplanation(生成基于片段的解释),并采用专家标注与标准化流程确保质量。 Result: 15种主流大模型在零样本和一样本设置下的评测表明,细粒度虚假信息识别仍具挑战性,模型性能受模型规模、推理能力及领域特征等多重因素影响。 Conclusion: MisSpans为虚假信息研究提供了更精细、可解释的新范式,推动从粗粒度判断向细粒度分析转变,未来需深入理解多因素交互对检测效果的影响。 Abstract: Online misinformation is increasingly pervasive, yet most existing benchmarks and methods evaluate veracity at the level of whole claims or paragraphs using coarse binary labels, obscuring how true and false details often co-exist within single sentences. These simplifications also limit interpretability: global explanations cannot identify which specific segments are misleading or differentiate how a detail is false (e.g., distorted vs. fabricated). To address these gaps, we introduce MisSpans, the first multi-domain, human-annotated benchmark for span-level misinformation detection and analysis, consisting of paired real and fake news stories. MisSpans defines three complementary tasks: MisSpansIdentity for pinpointing false spans within sentences, MisSpansType for categorising false spans by misinformation type, and MisSpansExplanation for providing rationales grounded in identified spans. Together, these tasks enable fine-grained localisation, nuanced characterisation beyond true/false and actionable explanations. Expert annotators were guided by standardised guidelines and consistency checks, leading to high inter-annotator agreement. We evaluate 15 representative LLMs, including reasoning-enhanced and non-reasoning variants, under zero-shot and one-shot settings. Results reveal the challenging nature of fine-grained misinformation identification and analysis, and highlight the need for a deeper understanding of how performance may be influenced by multiple interacting factors, including model size and reasoning capabilities, along with domain-specific textual features. This project will be available at https://github.com/lzw108/MisSpans.[69] A Navigational Approach for Comprehensive RAG via Traversal over Proposition Graphs
Maxime Delmas,Lei Xu,André Freitas
Main category: cs.CL
TL;DR: ToPG(Traversal over Proposition Graphs)是一种新型RAG框架,通过构建命题、实体和段落的异构图,结合查询感知的图遍历与迭代式建议-选择机制,在简单和复杂问答任务中均表现出色。
Details
Motivation: 现有RAG方法在处理复杂多跳查询时缺乏结构连通性,而知识图谱方法在单跳事实查询上表现不佳,ToPG旨在弥合这一差距。 Method: 将知识库建模为包含命题、实体和段落的异构图,采用迭代的Suggestion-Selection循环:Suggestion阶段进行查询感知的图遍历,Selection阶段利用大模型反馈剪枝并引导下一轮迭代。 Result: 在三种不同的问答任务(简单、复杂和抽象问答)中,ToPG在准确性和质量指标上均表现优异,展现出对多种查询类型的适应能力。 Conclusion: ToPG表明,结合查询感知的图遍历与细粒度事实表示是实现高效结构化RAG系统的关键。 Abstract: Standard RAG pipelines based on chunking excel at simple factual retrieval but fail on complex multi-hop queries due to a lack of structural connectivity. Conversely, initial strategies that interleave retrieval with reasoning often lack global corpus awareness, while Knowledge Graph (KG)-based RAG performs strongly on complex multi-hop tasks but suffers on fact-oriented single-hop queries. To bridge this gap, we propose a novel RAG framework: ToPG (Traversal over Proposition Graphs). ToPG models its knowledge base as a heterogeneous graph of propositions, entities, and passages, effectively combining the granular fact density of propositions with graph connectivity. We leverage this structure using iterative Suggestion-Selection cycles, where the Suggestion phase enables a query-aware traversal of the graph, and the Selection phase provides LLM feedback to prune irrelevant propositions and seed the next iteration. Evaluated on three distinct QA tasks (Simple, Complex, and Abstract QA), ToPG demonstrates strong performance across both accuracy- and quality-based metrics. Overall, ToPG shows that query-aware graph traversal combined with factual granularity is a critical component for efficient structured RAG systems. ToPG is available at https://github.com/idiap/ToPG.[70] EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis
Xuanguang Pan,Chongyang Tao,Jiayuan Bai,Jianling Gao,Zhengwei Tao,Xiansheng Zhou,Gavin Cheung,Shuai Ma
Main category: cs.CL
TL;DR: EvolSQL是一个结构感知的数据合成框架,通过从种子数据演化SQL查询,提升文本到SQL模型的训练效果,具有更高的结构多样性和复杂性。
Details
Motivation: 现有方法依赖有限的人工标注语料库或直接使用大模型生成数据,缺乏对SQL结构的显式控制,导致结构多样性和复杂性不足。 Method: 提出EvolSQL框架,包含探索性Query-SQL扩展、基于六种原子变换操作符的自适应定向演化策略、执行驱动的SQL精炼模块和模式感知去重机制,逐步增加查询在关系、谓词、聚合和嵌套维度上的复杂性。 Result: 实验表明,在仅使用SynSQL数据集1/18数据量的情况下,微调出的7B模型性能优于使用更大SynSQL数据集训练的模型。 Conclusion: EvolSQL能有效生成高质量、结构多样的Text-to-SQL数据,显著提升小规模模型的性能,为数据稀缺场景下的模型训练提供了高效解决方案。 Abstract: Training effective Text-to-SQL models remains challenging due to the scarcity of high-quality, diverse, and structurally complex datasets. Existing methods either rely on limited human-annotated corpora, or synthesize datasets directly by simply prompting LLMs without explicit control over SQL structures, often resulting in limited structural diversity and complexity. To address this, we introduce EvolSQL, a structure-aware data synthesis framework that evolves SQL queries from seed data into richer and more semantically diverse forms. EvolSQL starts with an exploratory Query-SQL expansion to broaden question diversity and improve schema coverage, and then applies an adaptive directional evolution strategy using six atomic transformation operators derived from the SQL Abstract Syntax Tree to progressively increase query complexity across relational, predicate, aggregation, and nesting dimensions. An execution-grounded SQL refinement module and schema-aware deduplication further ensure the creation of high-quality, structurally diverse mapping pairs. Experimental results show that a 7B model fine-tuned on our data outperforms one trained on the much larger SynSQL dataset using only 1/18 of the data.[71] Mind2Report: A Cognitive Deep Research Agent for Expert-Level Commercial Report Synthesis
Mingyue Cheng,Daoyu Wang,Qi Liu,Shuo Yu,Xiaoyu Tao,Yuqian Wang,Chengzhong Chu,Yu Duan,Mingkang Long,Enhong Chen
Main category: cs.CL
TL;DR: 提出Mind2Report,一种模拟商业分析师的认知型深度研究代理,通过动态记忆增强大语言模型,实现高质量、可靠且覆盖全面的商业报告生成,并在新构建的QRC-Eval基准上验证其优越性。
Details
Motivation: 现有深度研究代理在生成商业报告时存在质量、可靠性和覆盖范围方面的局限,难以满足高风险商业决策的需求。 Method: 设计一个无需训练的智能体工作流Mind2Report,具备细粒度意图理解、网络信息搜索与实时蒸馏、动态记忆存储以及迭代式报告合成能力,结合大语言模型完成复杂认知任务。 Result: 在包含200个真实商业任务的QRC-Eval基准上,Mind2Report在报告的质量、可靠性和覆盖性方面均优于OpenAI和Gemini等主流深度研究代理。 Conclusion: Mind2Report为构建高水平商业分析代理提供了有效范式,有望推动未来商业化研究智能体的发展。 Abstract: Synthesizing informative commercial reports from massive and noisy web sources is critical for high-stakes business decisions. Although current deep research agents achieve notable progress, their reports still remain limited in terms of quality, reliability, and coverage. In this work, we propose Mind2Report, a cognitive deep research agent that emulates the commercial analyst to synthesize expert-level reports. Specifically, it first probes fine-grained intent, then searches web sources and records distilled information on the fly, and subsequently iteratively synthesizes the report. We design Mind2Report as a training-free agentic workflow that augments general large language models (LLMs) with dynamic memory to support these long-form cognitive processes. To rigorously evaluate Mind2Report, we further construct QRC-Eval comprising 200 real-world commercial tasks and establish a holistic evaluation strategy to assess report quality, reliability, and coverage. Experiments demonstrate that Mind2Report outperforms leading baselines, including OpenAI and Gemini deep research agents. Although this is a preliminary study, we expect it to serve as a foundation for advancing the future design of commercial deep research agents. Our code and data are available at https://github.com/Melmaphother/Mind2Report.[72] CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters
Ao Sun,Xiaoyu Wang,Zhe Tan,Yu Li,Jiachen Zhu,Shu Su,Yuheng Jia
Main category: cs.CL
TL;DR: 本文提出了一种名为CuMA的新框架,通过条件容量分离和人口感知路由来解决大语言模型在跨文化对齐中的“均值坍塌”问题,有效保留了文化多样性。
Details
Motivation: 由于大语言模型服务于全球用户,需要从追求普遍共识转向尊重文化多元性,而传统密集模型在处理冲突价值观时会出现均值坍塌现象,无法充分代表不同文化群体。 Method: 提出CuMA(Cultural Mixture of Adapters)框架,利用适配器的混合结构和基于人口统计的路由机制,构建潜在的文化拓扑结构,将不同文化的梯度分离到专门的子空间中,从而缓解文化稀疏性带来的梯度干扰。 Result: 在WorldValuesBench、Community Alignment和PRISM等多个基准上,CuMA实现了最先进的性能,显著优于密集模型和其他仅基于语义的MoE方法,并验证了其能有效缓解均值坍塌问题。 Conclusion: CuMA通过条件化参数分离成功解决了密集模型在多文化对齐中的局限性,为支持文化多样性的语言模型对齐提供了可行方案。 Abstract: As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.[73] Faithful Summarisation under Disagreement via Belief-Level Aggregation
Favour Yahdii Aghaebe,Tanefa Apekey,Elizabeth Williams,Nafise Sadat Moosavi
Main category: cs.CL
TL;DR: 提出一种分歧感知的摘要生成框架,通过将信念级聚合与语言生成分离,显式建模观点冲突,提升意见摘要的保真度。
Details
Motivation: 现有摘要方法(尤其是基于大模型的方法)倾向于平滑分歧、偏向主流观点,导致在观点冲突明显的场景中摘要不够忠实。 Method: 将文档表示为结构化信念集,使用基于距离的信念融合算子进行显式冲突建模的信念级聚合,再利用大语言模型将聚合后的信念转化为自然语言摘要。 Result: 实验表明,所提方法在不同模型家族和规模下均能稳定地实现优于或媲美在生成过程中进行聚合的方法,且生成的摘要更忠实于原始观点分布。 Conclusion: 信念级聚合与语言生成分离的架构能更可靠地处理多文档中的观点冲突,是一种更鲁棒、可推广的分歧感知摘要策略。 Abstract: Opinion and multi-document summarisation often involve genuinely conflicting viewpoints, yet many existing approaches, particularly LLM-based systems, implicitly smooth disagreement and over-represent majority opinions. This limits the faithfulness of generated summaries in opinion-heavy settings. We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation. Documents are first represented as structured belief sets and aggregated using distance-based belief merging operators that explicitly model conflict. Large language models are then used only to realise the aggregated beliefs as natural language summaries. We evaluate the approach across multiple model families and scales, comparing it to methods that perform explicit aggregation during generation. Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities. In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models, while maintaining fluent and grounded summaries.[74] V-FAT: Benchmarking Visual Fidelity Against Text-bias
Ziteng Wang,Yujie He,Guanliang Li,Siqi Yang,Jiaqi Xiong,Songxiang Liu
Main category: cs.CL
TL;DR: 本文提出了一种新的基准V-FAT,用于评估多模态大语言模型中的文本偏差问题,揭示了模型在视觉推理中过度依赖语言先验而非真实视觉理解的现象。
Details
Motivation: 研究者担忧当前多模态大语言模型在视觉推理任务中过度依赖语言捷径,缺乏真正的视觉 grounding,因此需要系统性地识别和量化这种文本偏差。 Method: 将文本偏差解耦为内部语料库偏差和外部指令偏差,并构建V-FAT基准与三级评估框架(L1-L3),引入视觉鲁棒性评分(VRS)来衡量模型对视觉信息的真实依赖程度。 Result: 在12个前沿MLLM上的实验表明,尽管这些模型在传统基准上表现良好,但在高语言主导的冲突情境下出现显著的视觉崩溃现象。 Conclusion: 当前多模态模型存在严重的文本偏差问题,未来的研究需更关注提升模型的视觉保真度与鲁棒性,而不仅仅依赖语言先验进行推理。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.[75] Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences
Arkadiusz Modzelewski,Paweł Golik,Anna Kołos,Giovanni Da San Martino
Main category: cs.CL
TL;DR: 本文提出了一个名为Persuaficial的多语言基准,用于评估大语言模型(LLM)生成的说服性文本与人类撰写的说服性文本在自动检测上的难度差异,发现隐性的LLM生成说服内容更难被检测。
Details
Motivation: 由于大语言模型能生成极具说服力的文本,存在被滥用于宣传和操纵的风险,因此需要研究LLM生成的说服性内容是否比人类写作的更难自动检测。 Method: 对可控生成方法进行分类,并构建涵盖六种语言的高质量多语言基准Persuaficial,通过大规模实证评估比较人类和LLM生成的说服性文本在检测性能上的表现。 Result: 显性的LLM生成说服文本较易检测,但隐性的LLM生成说服内容会持续降低自动检测性能;同时提供了首项全面的语言学分析,揭示人类与LLM生成文本的差异。 Conclusion: 隐性说服的LLM生成文本对当前自动检测方法构成更大挑战,相关语言学发现可指导未来更可解释、更鲁棒检测工具的开发。 Abstract: Large Language Models (LLMs) can generate highly persuasive text, raising concerns about their misuse for propaganda, manipulation, and other harmful purposes. This leads us to our central question: Is LLM-generated persuasion more difficult to automatically detect than human-written persuasion? To address this, we categorize controllable generation approaches for producing persuasive content with LLMs and introduce Persuaficial, a high-quality multilingual benchmark covering six languages: English, German, Polish, Italian, French and Russian. Using this benchmark, we conduct extensive empirical evaluations comparing human-authored and LLM-generated persuasive texts. We find that although overtly persuasive LLM-generated texts can be easier to detect than human-written ones, subtle LLM-generated persuasion consistently degrades automatic detection performance. Beyond detection performance, we provide the first comprehensive linguistic analysis contrasting human and LLM-generated persuasive texts, offering insights that may guide the development of more interpretable and robust detection tools.[76] GenProve: Learning to Generate Text with Fine-Grained Provenance
Jingxuan Wei,Xingyue Wang,Yanghaoyu Liao,Jie Dong,Yuchen Liu,Caijun Jia,Bihui Yu,Junnan Zhu
Main category: cs.CL
TL;DR: 本文提出了生成时细粒度溯源(Generation-time Fine-grained Provenance)任务,通过构建ReFInE数据集和GenProve框架,提升大语言模型在回答中生成准确细粒度溯源的能力,揭示了模型在基于推理的溯源上的挑战。
Details
Motivation: 大语言模型常出现幻觉问题,现有引用方法多为粗粒度,难以验证引用与生成内容之间的支持关系,缺乏对直接引用、压缩和推理等不同关系的区分。 Method: 构建了包含专家标注的ReFInE数据集,区分引用、压缩和推断三类关系;提出GenProve框架,结合监督微调(SFT)与分组相对策略优化(GRPO),优化答案保真度和溯源正确性的复合奖励。 Result: GenProve在联合评估中显著优于14个强基线大模型;发现模型在表面级引用上表现好,但在基于推理的溯源上存在明显短板。 Conclusion: 细粒度溯源有助于提升模型可解释性和可验证性,但实现可靠的推理溯源仍是未来关键挑战。 Abstract: Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability & Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.[77] A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction
Qing Wang,Zehan Li,Yaodong Song,Hongjie Chen,Jian Kang,Jie Lian,Jie Li,Yongxiang Li,Xuelong Li
Main category: cs.CL
TL;DR: 本文提出了一种基于注入式情感归因思维(IEAT)的统一口语语言模型,通过新型数据构建策略实现情感智能的内化推理,在情感轨迹建模、情感推理和共情回应生成任务中取得领先性能。
Details
Motivation: 传统方法通常将情感识别作为显式监督任务处理,难以实现深层次的情感理解与推理;本文旨在通过隐式建模用户情感状态及其成因,提升模型在口语对话中的情感智能水平。 Method: 提出注入式情感归因思维(IEAT)策略,在模型内部引入情感状态及原因的推理过程;采用两阶段渐进训练:第一阶段通过自蒸馏实现语音-文本对齐与情感属性建模,第二阶段进行端到端跨模态联合优化,确保文本与口语情感表达的一致性。 Result: 在HumDial情感智能基准上,本方法在情感轨迹建模、情感推理和共情回应生成任务中均取得最优性能,LLM评估与人工评估结果均优于现有方法。 Conclusion: IEAT能有效将情感意识推理内化于模型之中,所提出的两阶段训练策略有助于提升跨模态情感一致性,显著增强口语对话系统的情感能力。 Abstract: This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT). IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision. The model is trained with a two-stage progressive strategy. The first stage performs speech-text alignment and emotional attribute modeling via self-distillation, while the second stage conducts end-to-end cross-modal joint optimization to ensure consistency between textual and spoken emotional expressions. Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.[78] Text as a Universal Interface for Transferable Personalization
Yuting Liu,Jian Guan,Jia-Nan Li,Wei Wu,Jiang-Ming Yang,Jianzhe Zhao,Guibing Guo
Main category: cs.CL
TL;DR: 提出一种基于自然语言的用户偏好表示方法,通过两阶段训练框架学习可解释、可迁移的偏好描述,并开发了具备强跨任务和跨模型迁移能力的8B规模模型AlignXplore+。
Details
Motivation: 现有LLM个性化方法多使用隐式向量表示用户偏好,缺乏可解释性和跨模型、跨任务的可转移性。 Method: 提出以自然语言作为通用的偏好表示接口,采用两阶段训练框架:先在高质量合成数据上进行监督微调,再通过强化学习优化长期效用和跨任务可迁移性。 Result: 在九个基准测试中,所提出的8B模型AlignXplore+达到SOTA性能,超越更大规模的开源模型,并展现出强大的跨任务、跨模型族和交互格式的迁移能力。 Conclusion: 自然语言是一种有效的通用偏好表示方式,结合两阶段训练可生成可解释、可复用且持续演化的用户偏好描述,推动个性化LLM的发展。 Abstract: We study the problem of personalization in large language models (LLMs). Prior work predominantly represents user preferences as implicit, model-specific vectors or parameters, yielding opaque ``black-box'' profiles that are difficult to interpret and transfer across models and tasks. In contrast, we advocate natural language as a universal, model- and task-agnostic interface for preference representation. The formulation leads to interpretable and reusable preference descriptions, while naturally supporting continual evolution as new interactions are observed. To learn such representations, we introduce a two-stage training framework that combines supervised fine-tuning on high-quality synthesized data with reinforcement learning to optimize long-term utility and cross-task transferability. Based on this framework, we develop AlignXplore+, a universal preference reasoning model that generates textual preference summaries. Experiments on nine benchmarks show that our 8B model achieves state-of-the-art performanc -- outperforming substantially larger open-source models -- while exhibiting strong transferability across tasks, model families, and interaction formats.[79] Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization
Xueyun Tian,Minghua Ma,Bingbing Xu,Nuoyan Lyu,Wei Li,Heng Dong,Zheng Chu,Yuanzhuo Wang,Huawei Shen
Main category: cs.CL
TL;DR: 本文提出在监督微调中利用包含错误最终答案的负向思维链轨迹,以提升大语言模型在域外任务上的泛化能力,并设计了基于增益的损失加权方法GLOW,显著改善了推理性能。
Details
Motivation: 传统SFT仅使用正确答案的思维链轨迹,忽略错误轨迹导致监督信号浪费和过拟合,限制了模型的域外泛化能力。 Method: 系统分析负向轨迹中的22种模式,提出GLOW方法,根据样本在训练过程中的跨轮次进展自适应调整损失权重。 Result: 在Qwen2.5-7B上实现5.51%的域外性能提升,MMLU从72.82%提升至76.47%,推理熵增加35.67%,促进探索能力。 Conclusion: 利用负向思维链轨迹并结合自适应损失加权可有效缓解过拟合,增强模型泛化与推理能力。 Abstract: Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.[80] Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei
Peng Wang,Xilin Tao,Siyi Yao,Jiageng Wu,Yuntao Zou,Zhuotao Tian,Libo Qin,Dagang Li
Main category: cs.CL
TL;DR: 本文提出了一种名为Subcultural Alignment Solver (SAS) 的多智能体框架,用于提升大语言模型在亚文化背景下检测自毁行为的能力,通过自动检索和亚文化对齐机制有效应对知识滞后和语义不匹配问题。
Details
Motivation: 由于亚文化群体中自毁行为的表达方式独特且隐晦,加之大语言模型存在知识更新滞后和语义理解偏差,现有方法难以准确识别这些行为,因此需要更有效的检测框架。 Method: 提出了SAS框架,结合多智能体系统实现自动检索与亚文化语义对齐,动态适应快速演变的亚文化术语,并提高对特定语境下自毁行为的理解能力。 Result: 实验表明,SAS在检测性能上优于当前先进的多智能体框架OWL,并可与微调后的大语言模型相媲美。 Conclusion: SAS能有效提升大语言模型在亚文化情境下识别自毁行为的能力,为心理健康领域的研究提供了新的工具和方向。 Abstract: Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) are applied across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs' training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we proposed Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly enhancing the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.[81] Hán Dān Xué Bù (Mimicry) or Qīng Chū Yú Lán (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models
Yueqing Hu,Xinyang Peng,Shuting Peng,Hanqi Wang,Tianhong Wang
Main category: cs.CL
TL;DR: 推理蒸馏通过监督微调模仿教师模型的推理轨迹,但导致学生模型无法继承其认知结构,出现“功能对齐崩溃”和负向迁移,表明类人认知源于主动强化而非被动模仿。
Details
Motivation: 研究当前推理蒸馏范式是否能有效传递大型推理模型中与人类认知成本对齐的结构。 Method: 在14个模型上测试‘邯郸学步’假设,比较教师模型与经SFT蒸馏的学生模型在人类难度匹配上的相关性表现。 Result: 教师模型显著对齐人类难度($\bar{r}=0.64$),而蒸馏后学生模型对齐程度大幅下降($\bar{r}=0.34$),甚至不如蒸馏前基线,表现出‘ Cargo Cult’效应。 Conclusion: 监督微调仅复制推理的语言形式,未能内化动态资源分配策略,说明类人认知是强化学习的涌现属性,不能通过被动模仿传递。 Abstract: Recent Large Reasoning Models trained via reinforcement learning exhibit a "natural" alignment with human cognitive costs. However, we show that the prevailing paradigm of reasoning distillation -- training student models to mimic these traces via Supervised Fine-Tuning (SFT) -- fails to transmit this cognitive structure. Testing the "Hán Dān Xué Bù" (Superficial Mimicry) hypothesis across 14 models, we find that distillation induces a "Functional Alignment Collapse": while teacher models mirror human difficulty scaling ($\bar{r}=0.64$), distilled students significantly degrade this alignment ($\bar{r}=0.34$), often underperforming their own pre-distillation baselines ("Negative Transfer"). Our analysis suggests that SFT induces a "Cargo Cult" effect, where students ritualistically replicate the linguistic form of reasoning (verbosity) without internalizing the teacher's dynamic resource allocation policy. Consequently, reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition is an emergent property of active reinforcement, not passive imitation.[82] ArcAligner: Adaptive Recursive Aligner for Compressed Context Embeddings in RAG
Jianbo Li,Yi Jiang,Sendong Zhao,Bairui Hu,Haochun Wang,Bing Qin
Main category: cs.CL
TL;DR: 提出ArcAligner,一种轻量级模块,通过自适应门控机制帮助语言模型更好地利用高度压缩的上下文进行生成,提升效率与性能。
Details
Motivation: 现有上下文压缩方法在过度压缩时导致LLM理解困难,需在压缩率和模型可读性之间取得平衡。 Method: 设计ArcAligner模块,嵌入语言模型层中,采用自适应门控机制,仅在信息复杂时增加计算,以高效利用高度压缩的上下文。 Result: 在多个知识密集型问答基准上,ArcAligner在相当的压缩率下优于基线方法,尤其在多跳推理和长尾场景中表现更优。 Conclusion: ArcAligner有效缓解了上下文压缩带来的信息损失问题,实现了高效且准确的检索增强生成。 Abstract: Retrieval-Augmented Generation (RAG) helps LLMs stay accurate, but feeding long documents into a prompt makes the model slow and expensive. This has motivated context compression, ranging from token pruning and summarization to embedding-based compression. While researchers have tried ''compressing'' these documents into smaller summaries or mathematical embeddings, there is a catch: the more you compress the data, the more the LLM struggles to understand it. To address this challenge, we propose ArcAligner (Adaptive recursive context *Aligner*), a lightweight module integrated into the language model layers to help the model better utilize highly compressed context representations for downstream generation. It uses an adaptive ''gating'' system that only adds extra processing power when the information is complex, keeping the system fast. Across knowledge-intensive QA benchmarks, ArcAligner consistently beats compression baselines at comparable compression rates, especially on multi-hop and long-tail settings. The source code is publicly available.[83] Compositional Steering of Large Language Models with Steering Tokens
Gorjan Radevski,Kiril Gashteovski,Giwon Hong,Carolin Lawrence,Goran Glavaš
Main category: cs.CL
TL;DR: 提出了一种称为“组合引导令牌”(compositional steering tokens)的新方法,用于在输入标记空间中实现大语言模型对多种行为的联合控制,具有良好的零样本泛化能力。
Details
Motivation: 现有研究主要关注单一行为的模型引导,而对多行为组合引导(compositional steering)研究不足,难以满足现实应用中对可控输出的复合需求。 Method: 通过自蒸馏将自然语言指令表达的单个行为嵌入到专用令牌中,并在输入标记空间中构建行为引导;进一步训练一个专门的“组合令牌”来学习多个行为的组合,使其能泛化到未见过的行为组合及数量。 Result: 实验表明,该方法在不同大模型架构上均优于现有方法(如指令、激活空间引导和LoRA合并),并在未见组合上表现出强泛化能力;且与自然语言指令结合可进一步提升性能。 Conclusion: 组合引导令牌为多行为联合控制提供了有效且灵活的解决方案,推动了大模型在复杂实际场景中的可控生成能力。 Abstract: Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textit{compositional steering} -- i.e., steering LLMs simultaneously towards multiple behaviors -- remains an underexplored problem. In this work, we propose \emph{compositional steering tokens} for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textit{composition token} on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textit{unseen} compositions, including those with unseen behaviors as well as those with an unseen \textit{number} of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior control compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.[84] SemPA: Improving Sentence Embeddings of Large Language Models through Semantic Preference Alignment
Ziyang Chen,Zhenxuan Huang,Yile Wang,Weiqin Wang,Lu Yin,Hui Huang
Main category: cs.CL
TL;DR: 提出SemPA方法,通过句子级直接偏好优化(DPO)在保持大语言模型生成能力的同时提升语义表示性能。
Details
Motivation: 现有基于生成式大模型的句子嵌入方法依赖固定提示模板或修改模型结构,前者性能受限,后者损害生成能力,需一种兼顾表示能力和生成能力的方法。 Method: 采用句子级直接偏好优化(DPO)进行语义偏好对齐,在重述生成任务上训练模型以区分语义等价句子,理论层面建立DPO与对比学习在Plackett-Luce模型下的联系。 Result: 在语义文本相似性任务和多种LLM基准测试中,SemPA在不牺牲生成能力的前提下显著提升了语义表示性能。 Conclusion: SemPA能有效增强大语言模型的句子表示能力,同时保留其原有生成能力,为通用化句子嵌入提供了新思路。 Abstract: Traditional sentence embedding methods employ token-level contrastive learning on non-generative pre-trained models. Recently, there have emerged embedding methods based on generative large language models (LLMs). These methods either rely on fixed prompt templates or involve modifications to the model architecture. The former lacks further optimization of the model and results in limited performance, while the latter alters the internal computational mechanisms of the model, thereby compromising its generative capabilities. We propose SemPA, a novel approach that boosts the sentence representations while preserving the generative ability of LLMs via semantic preference alignment. We leverage sentence-level Direct Preference Optimization (DPO) to efficiently optimize LLMs on a paraphrase generation task, where the model learns to discriminate semantically equivalent sentences while preserving inherent generative capacity. Theoretically, we establish a formal connection between DPO and contrastive learning under the Plackett-Luce model framework. Empirically, experimental results on both semantic textual similarity tasks and various benchmarks for LLMs show that SemPA achieves better semantic representations without sacrificing the inherent generation capability of LLMs.[85] Code-Mix Sentiment Analysis on Hinglish Tweets
Aashi Garg,Aneshya Das,Arshi Arya,Anushka Goyal,Aditi
Main category: cs.CL
TL;DR: 提出了一种基于mBERT的Hinglish推文情感分类框架,利用子词分词技术提升对罗马化混合语言中拼写变体和俚语的处理能力,为印度品牌舆情监控提供了高效的多语言NLP解决方案。
Details
Motivation: 传统单语NLP模型难以准确解析印度社交媒体中广泛使用的Hinglish(印地语与英语混合)的句法和语义复杂性,导致情感分析不准确,影响品牌监测效果。 Method: 采用mBERT(多语言BERT)并进行微调,结合子词分词技术,以更好处理Hinglish中的拼写变异、俚语和未登录词,从而提升情感分类性能。 Result: 构建了一个高性能、可投入生产的情感分类框架,在Hinglish推文数据上实现了优越的情感识别效果,为低资源、代码混合环境下的多语言NLP设立了强基准。 Conclusion: 该框架有效解决了Hinglish语言复杂性带来的挑战,显著提升了品牌在印度市场社交媒体舆情监控的准确性与实用性。 Abstract: The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.[86] How Human is AI? Examining the Impact of Emotional Prompts on Artificial and Human and Responsiveness
Florence Bernays,Marco Henriques Pereira,Jochen Menges
Main category: cs.CL
TL;DR: 该研究通过实验发现,人类在与ChatGPT互动时表达的情绪语调会影响其回答质量及后续人际交流:表扬提升回答质量,愤怒带来较小提升,指责虽不改善AI表现但增强其对公共利益的关注,并导致人类在后续交流中使用更多负面表达。
Details
Motivation: 探讨人类对AI表达不同情绪如何影响AI行为及人类自身后续社交表达。 Method: 采用被试间实验设计,要求参与者在与ChatGPT(GPT-4.0)合作完成公开回应和伦理困境任务时表达特定情绪。 Result: 表扬显著提升ChatGPT的回答质量;愤怒带来较小提升;指责不影响AI表现但使其更关注公共利益;在伦理问题上,愤怒减少AI对企业利益的偏向;指责后人类在人际交流中使用更多负面语言。 Conclusion: 人类在人机交互中的情绪表达不仅塑造AI输出,还会影响其自身在后续人际沟通中的语言风格。 Abstract: This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape ChatGPT's outputs but also carry over into subsequent human-human communication.[87] Agent-as-a-Judge
Runyang You,Hongru Cai,Caiqi Zhang,Qiancheng Xu,Meng Liu,Tiezheng Yu,Yongqi Li,Wenjie Li
Main category: cs.CL
TL;DR: 本文综述了从LLM-as-a-Judge到Agent-as-a-Judge的演进,提出了一个统一的框架来刻画这一范式转变,并建立了发展分类体系,总结了核心方法与应用领域,分析了前沿挑战并指出了未来研究方向。
Details
Motivation: 随着被评估对象变得越来越复杂、专业化和多步骤,传统的LLM-as-a-Judge方法因存在偏见、浅层推理和缺乏现实验证而受限,亟需更强大的评估范式。 Method: 提出一个全面的分类框架,识别表征范式转变的关键维度,组织核心方法论,并在通用与专业领域中调研应用案例,同时分析挑战与未来方向。 Result: 建立了Agent-as-a-Judge的发展 taxonomy,系统梳理了规划、工具增强验证、多智能体协作和持久记忆等核心技术,总结了现有系统在各领域的应用,并识别出关键挑战与研究趋势。 Conclusion: Agent-as-a-Judge代表了AI评估的下一代方向,本文提供的框架和路线图为该领域的进一步发展奠定了基础。 Abstract: LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.[88] DocDancer: Towards Agentic Document-Grounded Information Seeking
Qintong Zhang,Xinjie Lv,Jialong Wu,Baixuan Li,Zhengwei Tao,Guochen Yan,Huanyao Zhang,Bin Wang,Jiahao Xu,Haitao Mi,Wentao Zhang
Main category: cs.CL
TL;DR: 本文提出了DocDancer,一个端到端训练的开源文档问答代理,通过工具驱动框架和数据合成方法提升文档理解和探索能力。
Details
Motivation: 现有文档问答系统在工具使用上效率低下,且多依赖闭源模型,缺乏有效的开源解决方案。 Method: 将文档问答视为信息检索问题,提出一种显式建模文档探索与理解的工具驱动代理框架,并设计了‘先探索后合成’的数据合成流程以支持端到端训练。 Result: 在MMLongBench-Doc和DocBench两个长文本基准上验证了模型的有效性,训练后的模型表现出色。 Conclusion: 该工作为文档问答中的代理工具设计和合成数据构建提供了有价值的见解。 Abstract: Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.[89] RelayLLM: Efficient Reasoning via Collaborative Decoding
Chengsong Huang,Tong Zheng,Langlin Huang,Jinyuan Li,Haolin Liu,Jiaxin Huang
Main category: cs.CL
TL;DR: 提出RelayLLM框架,通过细粒度的token级协作解码,使小模型在需要时动态调用大模型生成关键token,显著降低计算成本并提升推理效率。
Details
Motivation: 大语言模型推理成本高,小模型推理能力弱,现有粗粒度协同方法导致计算资源浪费。 Method: 设计RelayLLM框架,让小模型作为控制器,通过特殊指令动态调用大模型生成关键token;采用两阶段训练(预热和GRPO)平衡自主生成与求助策略。 Result: 在六个基准上平均准确率达49.52%,仅调用大模型生成1.07%的token,相比随机路由节省98.2%成本。 Conclusion: RelayLLM实现了高效推理,在保持高性能的同时大幅减少对大模型的依赖,为低成本复杂推理提供了有效解决方案。 Abstract: Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.[90] Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference
Rasmus Blanck,Bill Noble,Stergios Chatzikyriakidis
Main category: cs.CL
TL;DR: 本文探讨了自然语言推理(NLI)任务中逻辑属性的理解问题,提出了三种可能的标签解释,并通过分析SNLI数据集中的共享前提和大模型生成的样例,评估模型在元推理一致性上的表现。
Details
Motivation: NLI任务常被用于评估语言模型的自然语言理解能力,但其逻辑性质常被误解或描述不清。理解NLI所捕捉的推理概念对正确解释模型性能至关重要。 Method: 提出三种NLI标签集的可能解读,基于SNLI数据集,利用具有共享前提的NLI样本以及大语言模型生成的样本,分析模型在元推理一致性上的表现。 Result: 揭示了不同标签解读所蕴含的元推理属性,并发现当前SNLI训练的模型在元推理一致性方面存在局限,暗示数据集中编码的逻辑关系具有特定倾向。 Conclusion: SNLI数据集隐含地支持某种特定的逻辑关系解读,而现有模型未能完全捕捉其元推理结构,需重新审视NLI任务的设计与评估方式。 Abstract: Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.[91] Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems
Jihao Zhao,Ding Chen,Zhaoxin Fan,Kerun Xu,Mengting Hu,Bo Tang,Feiyu Xiong,Zhiyu li
Main category: cs.CL
TL;DR: 本文提出Inside Out框架,利用全局维护的PersonaTree进行长期个性化对话建模,通过结构化记忆操作实现噪声抑制与一致性保持。
Details
Motivation: 现有长期个性化对话系统在处理无限交互流时受限于上下文长度,易导致记忆噪声累积、推理退化和人设不一致。 Method: 设计PersonaTree结构化用户画像框架,结合初始模式约束主干,动态更新分支;训练轻量MemListener模型,通过强化学习生成ADD/UPDATE/DELETE等操作实现树的演化。 Result: 实验表明,PersonaTree在抑制上下文噪声和维持人设一致性方面优于全文拼接及其他个性化记忆系统;小型MemListener的记忆操作性能媲美甚至超过DeepSeek-R1-0528和Gemini-3-Pro等大模型。 Conclusion: 该方法通过结构化记忆管理有效解决了长期对话中的记忆膨胀与一致性问题,为低延迟场景下的个性化对话提供了高效可解释的解决方案。 Abstract: Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.[92] LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation
Samy Haffoudhi,Fabian M. Suchanek,Nils Holzenberger
Main category: cs.CL
TL;DR: LELA是一种无需微调的模块化粗到精方法,利用大语言模型进行实体链接,在多个领域和知识库中表现优异。
Details
Motivation: 实体链接是知识图构建、问答和信息抽取等任务的基础步骤,但现有方法通常需要针对特定场景微调,缺乏通用性。 Method: 提出LELA方法,采用模块化的粗到精策略,结合大语言模型的能力,适用于不同目标领域、知识库和语言模型,且无需微调。 Result: 在多种实体链接设置下的实验表明,LELA与微调方法具有很强的竞争力,并显著优于其他非微调方法。 Conclusion: LELA在不进行微调的情况下实现了高性能的实体链接,具备良好的通用性和实用性。 Abstract: Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foundational step in tasks such as knowledge graph construction, question-answering, and information extraction. Our method, LELA, is a modular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowledge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly competitive with fine-tuned approaches, and substantially outperforms the non-fine-tuned ones.[93] Measuring and Fostering Peace through Machine Learning and Artificial Intelligence
P. Gilda,P. Dungarwal,A. Thongkham,E. T. Ajayi,S. Choudhary,T. M. Terol,C. Lam,J. P. Araujo,M. McFadyen-Mungalln,L. S. Liebovitch,P. T. Coleman,H. West,K. Sieck,S. Carter
Main category: cs.CL
TL;DR: 本文利用机器学习和人工智能从新闻和社交媒体中测量和平水平,并开发了促进和平的在线工具,帮助用户理解自己的媒体消费习惯。
Details
Motivation: 由于社交媒体内容创作者倾向于制造激发情绪(如愤怒)的内容以增加点击量,影响公众舆论和心理健康,因此需要一种工具来评估和改善媒体内容的和平性。 Method: 对新闻媒体,使用神经网络和文本嵌入技术训练模型以衡量和平水平;对社交媒体(如YouTube),结合词级别(GoEmotions)和上下文级别(大语言模型)的方法分析社会维度;并开发了一款名为MirrorMirror的Chrome扩展程序,为用户提供实时媒体和平性反馈。 Result: 新闻模型在不同数据集上均表现出高准确性;针对社交媒体的模型能有效识别与和平相关的社会情感维度;MirrorMirror插件成功实现了对用户观看内容的实时和平性评估。 Conclusion: 该研究展示了AI在衡量和促进社会和平方面的潜力,未来希望将MirrorMirror发展为开源工具,供内容创作者、记者、研究人员等使用,推动更尊重、细致和有信息量的传播方式。 Abstract: We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.[94] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Shih-Yang Liu,Xin Dong,Ximing Lu,Shizhe Diao,Peter Belcak,Mingjie Liu,Min-Hung Chen,Hongxu Yin,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Yejin Choi,Jan Kautz,Pavlo Molchanov
Main category: cs.CL
TL;DR: 本文提出了一种新的多奖励强化学习优化方法GDPO,解决了GRPO在处理多奖励时因归一化导致的优势值坍缩问题,提升了训练稳定性和性能。
Details
Motivation: 现有的多奖励强化学习方法(如GRPO)在处理多个不同奖励时会因联合归一化导致优势值退化,影响训练效果,因此需要一种更合适的优化方法。 Method: 提出Group reward-Decoupled Normalization Policy Optimization (GDPO),通过解耦各个奖励的归一化过程,保留各奖励之间的相对差异,从而实现更精确的多奖励优化。 Result: 在工具调用、数学推理和代码推理三个任务上,GDPO在准确性、错误率以及格式、长度等约束遵循指标上均显著优于GRPO,且训练更稳定。 Conclusion: GDPO是一种更优的多奖励RLHF框架下的策略优化方法,能有效避免奖励信号分辨率下降问题,具有良好的通用性和实用性。 Abstract: As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.cs.CV [Back]
[95] Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes
Chenye Meng,Zejian Li,Zhongni Liu,Yize Li,Changle Xie,Kaixin Jia,Ling Yang,Huanghuang Deng,Shiying Ding,Shengyuan Zhang,Jiayi Li,Lingyun Sun
Main category: cs.CV
TL;DR: 提出了一种名为复杂偏好优化(CPO)的两阶段对齐框架,用于提升扩散模型在细粒度、层次化标准下的对齐能力,尤其适用于绘画生成任务。
Details
Motivation: 现有的扩散模型后训练对齐方法依赖于简化信号(如标量奖励或二元偏好),难以捕捉人类专家复杂的、层次化的判断标准,限制了生成质量的精细控制。 Method: 首先与领域专家共同构建一个树状结构的层次化评估标准,将图像质量分解为多个正负属性;然后通过监督微调将领域知识注入辅助扩散模型;最后提出CPO算法,扩展DPO以同时最大化正向属性概率并最小化负向属性概率。 Result: 在绘画生成任务中,基于标注了细粒度属性的数据集进行CPO训练,实验表明该方法显著提升了生成质量和与专家标准的一致性。 Conclusion: CPO为扩散模型在复杂、非二元、层次化标准下的对齐提供了有效解决方案,推动了生成模型在专业领域中的精细化控制与应用。 Abstract: Post-training alignment of diffusion models relies on simplified signals, such as scalar rewards or binary preferences. This limits alignment with complex human expertise, which is hierarchical and fine-grained. To address this, we first construct a hierarchical, fine-grained evaluation criteria with domain experts, which decomposes image quality into multiple positive and negative attributes organized in a tree structure. Building on this, we propose a two-stage alignment framework. First, we inject domain knowledge to an auxiliary diffusion model via Supervised Fine-Tuning. Second, we introduce Complex Preference Optimization (CPO) that extends DPO to align the target diffusion to our non-binary, hierarchical criteria. Specifically, we reformulate the alignment problem to simultaneously maximize the probability of positive attributes while minimizing the probability of negative attributes with the auxiliary diffusion. We instantiate our approach in the domain of painting generation and conduct CPO training with an annotated dataset of painting with fine-grained attributes based on our criteria. Extensive experiments demonstrate that CPO significantly enhances generation quality and alignment with expertise, opening new avenues for fine-grained criteria alignment.[96] Embedding Textual Information in Images Using Quinary Pixel Combinations
A V Uday Kiran Kandala
Main category: cs.CV
TL;DR: 提出了一种基于RGB空间五进制像素强度组合的新隐写技术,可在单个像素内高效嵌入文本符号,具有高保真性和低计算开销。
Details
Motivation: 现有隐写方法如LSB、变换域或深度学习方法常导致图像失真、计算复杂或需多像素编码,缺乏高效且低损的嵌入方案。 Method: 利用RGB通道中每个颜色分量的五种可控强度变化,形成125种组合,映射到文本字符,在单个像素内完成一个符号的编码与解码。 Result: 在MSE、MAE、SNR、PSNR、SSIM、直方图和热力图等指标上显示无显著图像失真,且嵌入效率高于传统方法。 Conclusion: 该方法实现了高效、低损、轻量化的文本到图像隐写,优于依赖多像素或高计算成本的现有技术。 Abstract: This paper presents a novel technique for embedding textual data into images using quinary combinations of pixel intensities in RGB space. Existing methods predominantly rely on least and most significant bit (LSB & MSB) manipulation, Pixel Value Differencing (PVD), spatial perturbations in RGB channels, transform domain based methods, Quantization methods, Edge and Region based methods and more recently through deep learning methods and generative AI techniques for hiding textual information in spatial domain of images. Most of them are dependent on pixel intensity flipping over multiple pixels, such as LSB and combination of LSB based methodologies, and on transform coefficients, often resulting in the form of noise. Encoding and Decoding are deterministic in most of the existing approaches and are computationally heavy in case of higher models such as deep learning and gen AI approaches. The proposed method works on quinary pixel intensity combinations in RGB space, where five controlled different pixel intensity variations in each of the R, G, and B channels formulate up to one hundred and twenty five distinct pixel intensity combinations. These combinations are mapped to textual symbols, enabling the representation of uppercase and lowercase alphabetic characters, numeric digits, whitespace, and commonly used special characters. Different metrics such as MSE, MAE, SNR, PSNR, SSIM, Histogram Comparison and Heatmap analysis, were evaluated for both original and encoded images resulting in no significant distortion in the images. Furthermore, the method achieves improved embedding efficiency by encoding a complete textual symbol within a single RGB pixel, in contrast to LSB and MSB based approaches that typically require multiple pixels or multi-step processes, as well as transform and learning based methods that incur higher computational overhead.[97] Unified Text-Image Generation with Weakness-Targeted Post-Training
Jiahui Chen,Philippe Hansen-Estruch,Xiaochuang Han,Yushi Hu,Emily Dinan,Amita Kamath,Michal Drozdzal,Reyhane Askari-Hemmat,Luke Zettlemoyer,Marjan Ghazvininejad
Main category: cs.CV
TL;DR: 本文探索了通过后训练实现完全统一的文本-图像生成,使模型能够在单次推理过程中自主地从文本推理过渡到视觉合成,从而增强跨模态耦合。研究比较了不同后训练数据策略的影响,发现针对特定限制设计的定向数据集比广泛的图像-字幕语料库或基准对齐数据效果更好。通过使用离线、奖励加权的后训练方法和完全自生成的合成数据,该方法在四个不同的文本到图像生成基准测试中均实现了性能提升,证明了奖励加权和策略性设计的后训练数据的有效性。
Details
Motivation: 现有的多模态生成系统通常依赖显式的模态切换,先生成推理文本再手动切换到图像生成,这种分离且顺序的推理过程限制了跨模态的耦合,无法实现自动化的多模态生成。因此,需要一种能够自主在文本和图像之间无缝转换的统一生成架构。 Method: 采用离线、奖励加权的后训练方法,利用完全自生成的合成数据进行联合文本-图像生成训练。探索不同的后训练数据策略,包括广泛使用的图像-字幕语料库、基准对齐数据以及针对特定局限性设计的定向数据集,并评估其在文本到图像生成任务中的表现。 Result: 在四个多样化的T2I基准测试中,所提出的方法均实现了性能提升;结果显示,同时对两种模态进行奖励加权并采用战略性设计的后训练数据优于其他策略,特别是在使用定向数据集时效果最佳。 Conclusion: 通过奖励加权的后训练和精心设计的合成数据,可以有效实现统一的文本-图像生成,增强跨模态耦合能力,推动自动化多模态生成的发展。 Abstract: Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.[98] ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
Mohsen Ghafoorian,Amirhossein Habibian
Main category: cs.CV
TL;DR: 本文提出了一种名为ReHyAt的循环混合注意力机制,结合了softmax注意力的高保真度和线性注意力的高效性,实现了视频生成中注意力复杂度从二次到线性的降低,并支持高效的模型蒸馏与微调,显著提升了长序列视频生成的可扩展性。
Details
Motivation: 现有的基于Transformer的视频扩散模型因使用二次复杂度的注意力机制,在处理长序列时面临计算和内存瓶颈,限制了其可扩展性。因此需要一种既能保持生成质量又能提升效率的注意力机制。 Method: 提出ReHyAt,一种循环混合注意力机制,将softmax注意力与线性注意力结合,采用分块循环重构方式实现恒定内存占用,并设计轻量级蒸馏与微调流程,从已有的softmax模型中高效迁移知识。 Result: 在VBench、VBench-2.0基准测试及人类偏好研究中,ReHyAt在视频生成质量上达到先进水平,同时将注意力计算成本由二次降至线性,仅需约160 GPU小时即可完成训练,相比此前方法降低两个数量级。 Conclusion: ReHyAt通过混合注意力设计和高效蒸馏策略,在保持高质量视频生成的同时显著提升效率和可扩展性,为长时长和设备端视频生成提供了可行方案。 Abstract: Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.[99] SCAR-GS: Spatial Context Attention for Residuals in Progressive Gaussian Splatting
Diego Revilla,Pooja Suresh,Anand Bhojan,Ooi Wei Tsang
Main category: cs.CV
TL;DR: 提出了一种用于3D高斯点阵的新型渐进式编解码器,采用残差向量量化和多分辨率哈希网格引导的自回归熵模型,以提高压缩效率和率失真性能。
Details
Motivation: 现有的3D高斯点阵压缩方法在处理中大型场景时存储开销大,传统标量量化难以有效捕捉高维特征向量的相关性,限制了率失真性能。 Method: 引入残差向量量化(Residual Vector Quantization)替代传统标量量化,并设计基于多分辨率哈希网格的自回归熵模型,对逐个传输的索引进行条件概率预测,实现高效压缩。 Result: 所提方法在渐进式压缩中实现了更高的压缩效率,能够在粗略层和细化层之间有效编码,提升率失真性能。 Conclusion: 该方法显著优于传统的基于标量量化的压缩技术,适用于云端和流媒体服务中的实时高保真视图合成应用。 Abstract: Recent advances in 3D Gaussian Splatting have allowed for real-time, high-fidelity novel view synthesis. Nonetheless, these models have significant storage requirements for large and medium-sized scenes, hindering their deployment over cloud and streaming services. Some of the most recent progressive compression techniques for these models rely on progressive masking and scalar quantization techniques to reduce the bitrate of Gaussian attributes using spatial context models. While effective, scalar quantization may not optimally capture the correlations of high-dimensional feature vectors, which can potentially limit the rate-distortion performance. In this work, we introduce a novel progressive codec for 3D Gaussian Splatting that replaces traditional methods with a more powerful Residual Vector Quantization approach to compress the primitive features. Our key contribution is an auto-regressive entropy model, guided by a multi-resolution hash grid, that accurately predicts the conditional probability of each successive transmitted index, allowing for coarse and refinement layers to be compressed with high efficiency.[100] Comparative Analysis of Custom CNN Architectures versus Pre-trained Models and Transfer Learning: A Study on Five Bangladesh Datasets
Ibrahim Tanvir,Alif Ruslan,Sartaj Solaiman
Main category: cs.CV
TL;DR: 本研究比较了自定义CNN与预训练模型(ResNet-18、VGG-16)在五个孟加拉图像分类数据集上的表现,发现微调的迁移学习显著优于从头训练和特征提取方法,尤其在小样本复杂任务中表现突出。
Details
Motivation: 探索在资源受限环境下,针对特定区域图像分类任务应选择何种深度学习策略以平衡性能、模型大小与训练效率。 Method: 采用自定义CNN与预训练模型(ResNet-18、VGG-16)进行对比,分别应用从头训练、特征提取和带微调的迁移学习三种方式,在五个本地数据集上评估其性能。 Result: 迁移学习结合微调效果最佳,准确率提升3%至76%;ResNet-18在Road Damage BD数据集上达到100%准确率;自定义CNN参数更少(3.4M),训练更高效,但在复杂任务上性能较低。 Conclusion: 对于复杂且标注数据有限的任务,使用预训练模型并进行微调是更优选择;而简单任务或资源受限时,自定义小型CNN更具优势。 Abstract: This study presents a comprehensive comparative analysis of custom-built Convolutional Neural Networks (CNNs) against popular pre-trained architectures (ResNet-18 and VGG-16) using both feature extraction and transfer learning approaches. We evaluated these models across five diverse image classification datasets from Bangladesh: Footpath Vision, Auto Rickshaw Detection, Mango Image Classification, Paddy Variety Recognition, and Road Damage Detection. Our experimental results demonstrate that transfer learning with fine-tuning consistently outperforms both custom CNNs built from scratch and feature extraction methods, achieving accuracy improvements ranging from 3% to 76% across different datasets. Notably, ResNet-18 with fine-tuning achieved perfect 100% accuracy on the Road Damage BD dataset. While custom CNNs offer advantages in model size (3.4M parameters vs. 11-134M for pre-trained models) and training efficiency on simpler tasks, pre-trained models with transfer learning provide superior performance, particularly on complex classification tasks with limited training data. This research provides practical insights for practitioners in selecting appropriate deep learning approaches based on dataset characteristics, computational resources, and performance requirements.[101] PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache
Kunyang Li,Mubarak Shah,Yuzhang Shang
Main category: cs.CV
TL;DR: 本文提出PackCache,一种无需训练的KV-cache管理方法,用于提升统一自回归视频生成模型的推理效率。通过分析KV-cache在多模态任务中的时空特性,采用条件锚定、跨帧衰减建模和位置嵌入保持策略,显著加速长序列生成过程。
Details
Motivation: 统一自回归模型在处理长序列视频生成时,KV-cache随生成长度线性增长,成为推理效率的主要瓶颈。需要一种高效缓存管理机制来缓解这一问题。 Method: 提出PackCache,包含三个核心机制:1)条件锚定保留关键语义信息;2)基于时间距离的跨帧注意力衰减模型动态分配缓存资源;3)空间保持的位置嵌入确保结构一致性。该方法无需训练,动态压缩KV-cache。 Result: 在48帧长序列上实现端到端生成速度提升1.7-2.2倍;在最耗时的最后四帧生成中,在A40和H200上分别达到2.6倍和3.7倍加速。 Conclusion: PackCache有效解决了统一自回归模型中KV-cache膨胀导致的效率瓶颈,显著提升了长视频生成的推理速度,为多模态长序列建模提供了高效的缓存管理方案。 Abstract: A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.[102] Combining facial videos and biosignals for stress estimation during driving
Paraskevi Valergaki,Vassilis C. Nicodemou,Iason Oikonomidis,Antonis Argyros,Anastasios Roussos
Main category: cs.CV
TL;DR: 该论文提出了一种基于Transformer的时序建模框架,利用EMOCA提取的3D面部表情和姿态系数进行驾驶场景下的压力识别,结合跨模态注意力机制融合生理信号或注视信息,实现了高性能的压力识别(AUROC 92%)。
Details
Motivation: 由于压力具有主观性且面部表情可被主动控制,从面部视频中可靠地识别压力具有挑战性。现有方法多依赖面部动作单元(FAUs),而对解耦的3D面部几何特征的作用研究不足。因此,本文旨在探索3D面部动态在压力识别中的有效性。 Method: 使用EMOCA模型提取面部视频中的3D表情和姿态系数,并通过假设检验分析其在基线与压力刺激阶段的变化;进而构建基于Transformer的时间序列建模框架,比较单模态、早期融合和跨模态注意力三种策略,用于压力识别。 Result: 41个面部3D系数在压力阶段表现出显著变化,与生理指标相当;跨模态注意力融合EMOCA与生理信号达到最佳性能(AUROC 92%,准确率86.7%),EMOCA与注视信息融合也表现优异(AUROC 91.8%)。 Conclusion: 3D面部几何动态包含可靠的应激反应信息,结合时间建模和跨模态注意力机制可有效提升压力识别性能,验证了视觉模态在非侵入式压力监测中的潜力。 Abstract: Reliable stress recognition from facial videos is challenging due to stress's subjective nature and voluntary facial control. While most methods rely on Facial Action Units, the role of disentangled 3D facial geometry remains underexplored. We address this by analyzing stress during distracted driving using EMOCA-derived 3D expression and pose coefficients. Paired hypothesis tests between baseline and stressor phases reveal that 41 of 56 coefficients show consistent, phase-specific stress responses comparable to physiological markers. Building on this, we propose a Transformer-based temporal modeling framework and assess unimodal, early-fusion, and cross-modal attention strategies. Cross-Modal Attention fusion of EMOCA and physiological signals achieves best performance (AUROC 92\%, Accuracy 86.7\%), with EMOCA-gaze fusion also competitive (AUROC 91.8\%). This highlights the effectiveness of temporal modeling and cross-modal attention for stress recognition.[103] Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection
Maxim Clouser,Kia Khezeli,John Kalantari
Main category: cs.CV
TL;DR: 通过在FLUX.1模型中引入低秩适应(LoRA)模块,并仅使用100对配对图像进行微调,实现了从RGB到红外(IR)和合成孔径雷达(SAR)的跨光谱图像翻译,生成的合成数据显著提升了目标检测性能。
Details
Motivation: 现有视觉基础模型主要基于RGB图像训练,但许多安全关键应用依赖于非可见光模态(如红外和SAR)。如何利用少量配对样本将RGB预训练模型迁移到非可见光模态是一个重要问题。 Method: 基于FLUX.1 Kontext这一流匹配基础模型,插入低秩适应(LoRA)模块,在KAIST(RGB→IR)和M4-SAR(RGB→SAR)数据集上使用仅100对图像进行微调,实现跨光谱图像翻译;利用LPIPS指标预测下游检测性能,并将合成图像用于目标检测训练。 Result: 在仅使用100对训练样本的情况下,模型能生成像素对齐的IR/SAR图像;LPIPS与下游mAP高度相关,可作为有效代理指标;使用外部RGB数据生成的合成IR图像提升了KAIST上的行人检测性能,合成SAR图像结合少量真实SAR显著增强了M4-SAR上的基础设施检测。 Conclusion: 少样本LoRA适配流匹配基础模型是为非可见光模态提供基础支持的一条可行路径,验证了单一RGB预训练模型向多模态扩展的潜力。 Abstract: Foundation models for vision are predominantly trained on RGB data, while many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR). We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator using only a few co-measured examples, and whether the resulting synthetic data can enhance downstream detection. Starting from FLUX.1 Kontext, we insert low-rank adaptation (LoRA) modules and fine-tune them on just 100 paired images per domain for two settings: RGB to IR on the KAIST dataset and RGB to SAR on the M4-SAR dataset. The adapted model translates RGB images into pixel-aligned IR/SAR, enabling us to reuse existing bounding boxes and train object detection models purely in the target modality. Across a grid of LoRA hyperparameters, we find that LPIPS computed on only 50 held-out pairs is a strong proxy for downstream performance: lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR and SAR, and for DETR on KAIST IR test data. Using the best LPIPS-selected LoRA adapter, synthetic IR from external RGB datasets (LLVIP, FLIR ADAS) improves KAIST IR pedestrian detection, and synthetic SAR significantly boosts infrastructure detection on M4-SAR when combined with limited real SAR. Our results suggest that few-shot LoRA adaptation of flow-matching foundation models is a promising path toward foundation-style support for non-visible modalities.[104] Performance Analysis of Image Classification on Bangladeshi Datasets
Mohammed Sami Khan,Fabiha Muniat,Rowzatul Zannat
Main category: cs.CV
TL;DR: 本文比较了从零训练的自定义CNN与使用迁移学习的预训练CNN(如VGG-16、ResNet-50、MobileNet)在图像分类任务中的性能。结果表明,预训练模型在准确率和收敛速度上优于自定义模型,尤其是在数据有限的情况下;而自定义CNN虽性能稍低,但参数更少、计算更高效。
Details
Motivation: 探讨在图像分类任务中设计自定义CNN与采用预训练模型之间的权衡,为实际应用中模型选择提供指导。 Method: 构建一个从零开始训练的自定义CNN,并与基于迁移学习的VGG-16、ResNet-50和MobileNet在相同实验条件下进行对比,使用准确率、精确率、召回率和F1分数等指标评估性能。 Result: 预训练模型在分类准确率和收敛速度上表现更优,特别是在训练数据较少时;而自定义CNN参数量少、计算复杂度低,仍能实现有竞争力的性能。 Conclusion: 预训练架构在性能上占优,但自定义CNN在模型轻量化方面具有优势,研究揭示了模型复杂度、性能与计算效率之间的权衡,为实际场景下的架构选择提供了实用见解。 Abstract: Convolutional Neural Networks (CNNs) have demonstrated remarkable success in image classification tasks; however, the choice between designing a custom CNN from scratch and employing established pre-trained architectures remains an important practical consideration. In this work, we present a comparative analysis of a custom-designed CNN and several widely used deep learning architectures, including VGG-16, ResNet-50, and MobileNet, for an image classification task. The custom CNN is developed and trained from scratch, while the popular architectures are employed using transfer learning under identical experimental settings. All models are evaluated using standard performance metrics such as accuracy, precision, recall, and F1-score. Experimental results show that pre-trained CNN architectures consistently outperform the custom CNN in terms of classification accuracy and convergence speed, particularly when training data is limited. However, the custom CNN demonstrates competitive performance with significantly fewer parameters and reduced computational complexity. This study highlights the trade-offs between model complexity, performance, and computational efficiency, and provides practical insights into selecting appropriate CNN architectures for image classification problems.[105] 3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation
Jusheng Zhang,Yijia Fan,Zimo Wen,Jian Wang,Keze Wang
Main category: cs.CV
TL;DR: 提出Tri-MARF框架,通过多模态输入和多智能体协作提升大规模3D物体标注效果,在多个数据集上显著优于现有方法。
Details
Motivation: 现有的单模型方法难以有效应对3D物体标注中的空间复杂性、遮挡和视角不一致等问题。 Method: 结合2D多视角图像、文本描述和3D点云三种模态输入,采用多智能体协作架构,包括视觉语言模型代理、信息聚合代理和门控代理。 Result: 在Objaverse LVIS、Objaverse XL和ABO数据集上实现了88.7的CLIPScore,ViLT R@5达到45.2和43.8,单张NVIDIA A100 GPU吞吐量高达每小时12000个物体。 Conclusion: Tri-MARF显著提升了3D对象标注的质量与效率,为大规模3D标注提供了有效解决方案。 Abstract: Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU[106] From Preoperative CT to Postmastoidectomy Mesh Construction:1Mastoidectomy Shape Prediction for Cochlear Implant Surgery
Yike Zhang,Eduardo Davalos,Dingjie Su,Ange Lou,Jack Noble
Main category: cs.CV
TL;DR: 本文提出了一种结合自监督和弱监督学习的混合框架,用于从术前CT扫描中预测耳蜗植入手术中的乳突切除区域,无需人工标注,显著提升了预测性能。
Details
Motivation: 由于缺乏带有人工标注的真实数据,现有的深度学习方法在乳突切除形状预测方面研究有限,本文旨在填补这一空白。 Method: 提出一种混合自监督与弱监督学习框架,并引入3D T分布损失,直接从完整的术前CT扫描中预测乳突切除区域。 Result: 该方法在复杂且无明确边界的乳突切除形状预测任务中达到了0.72的平均Dice分数,优于现有最先进方法。 Conclusion: 这是首个将自监督与弱监督学习结合用于乳突切除预测的研究,为耳蜗植入手术规划提供了高效、鲁棒的解决方案。 Abstract: Cochlear Implant (CI) surgery treats severe hearing loss by inserting an electrode array into the cochlea to stimulate the auditory nerve. An important step in this procedure is mastoidectomy, which removes part of the mastoid region of the temporal bone to provide surgical access. Accurate mastoidectomy shape prediction from preoperative imaging improves pre-surgical planning, reduces risks, and enhances surgical outcomes. Despite its importance, there are limited deep-learning-based studies regarding this topic due to the challenges of acquiring ground-truth labels. We address this gap by investigating self-supervised and weakly-supervised learning models to predict the mastoidectomy region without human annotations. We propose a hybrid self-supervised and weakly-supervised learning framework to predict the mastoidectomy region directly from preoperative CT scans, where the mastoid remains intact. Our hybrid method achieves a mean Dice score of 0.72 when predicting the complex and boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance. The method provides groundwork for constructing 3D postmastoidectomy surfaces directly from the corresponding preoperative CT scans. To our knowledge, this is the first work that integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, offering a robust and efficient solution for CI surgical planning while leveraging 3D T-distribution loss in weakly-supervised medical imaging.[107] CRUNet-MR-Univ: A Foundation Model for Diverse Cardiac MRI Reconstruction
Donghang Lyu,Marius Staring,Hildo Lamb,Mariya Doneva
Main category: cs.CV
TL;DR: 提出了一种名为CRUNet-MR-Univ的通用基础模型,用于心脏MRI重建,能够有效应对多种成像条件变化。
Details
Motivation: 现有深度学习方法在心脏MRI重建中泛化能力有限,难以应对图像对比度、采样模式、设备厂商等多种变化。 Method: 利用时空相关性和基于提示的先验信息构建统一的基础模型CRUNet-MR-Univ,以处理多样化的心脏MRI扫描数据。 Result: 该方法在多种设置下均优于基线方法,表现出更强的鲁棒性和泛化能力。 Conclusion: CRUNet-MR-Univ具有成为通用心脏MRI重建模型的潜力,适用于真实临床环境中的广泛场景。 Abstract: In recent years, deep learning has attracted increasing attention in the field of Cardiac MRI (CMR) reconstruction due to its superior performance over traditional methods, particularly in handling higher acceleration factors, highlighting its potential for real-world clinical applications. However, current deep learning methods remain limited in generalizability. CMR scans exhibit wide variability in image contrast, sampling patterns, scanner vendors, anatomical structures, and disease types. Most existing models are designed to handle only a single or narrow subset of these variations, leading to performance degradation when faced with distribution shifts. Therefore, it is beneficial to develop a unified model capable of generalizing across diverse CMR scenarios. To this end, we propose CRUNet-MR-Univ, a foundation model that leverages spatio-temporal correlations and prompt-based priors to effectively handle the full diversity of CMR scans. Our approach consistently outperforms baseline methods across a wide range of settings, highlighting its effectiveness and promise.[108] Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
Xingjian Diao,Zheyuan Liu,Chunhui Zhang,Weiyi Wu,Keyi Kong,Lin Shi,Kaize Ding,Soroush Vosoughi,Jiang Gui
Main category: cs.CV
TL;DR: 本文提出了一种名为GPRO的门控感知-推理优化框架,通过动态选择快速路径、感知路径或推理路径来解决大型视觉语言模型中的过思考和感知失败问题,显著提升了模型在多个基准上的准确性和效率。
Details
Motivation: 现有的链式思维机制虽然增强了模型的推理能力,但容易导致对简单问题过度推理,且忽视了底层视觉感知失败这一根本瓶颈。作者认为推理错误更多源于感知缺陷而非推理不足。 Method: 提出GPRO,一种包含快速路径、慢速感知路径(重新审视视觉输入)和慢速推理路径(自我反思)的三路动态计算路由机制;利用约79万样本的教师模型生成失败归因监督信号,并采用多目标强化学习训练控制器以平衡准确性与计算成本。 Result: 在五个基准上实验表明,GPRO相比现有慢思考方法显著提高了准确性和推理效率,同时生成更简短的回答。 Conclusion: 稳定的推理依赖于可靠的低层视觉 grounding,GPRO通过区分感知与推理错误实现了高效且准确的动态推理,为LVLMs提供了有效的元推理控制机制。 Abstract: Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.[109] UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving
Zhexiao Xiong,Xin Ye,Burhan Yaman,Sheng Cheng,Yiren Lu,Jingru Luo,Nathan Jacobs,Liu Ren
Main category: cs.CV
TL;DR: UniDrive-WM 是一个基于视觉-语言模型(VLM)的统一世界模型,联合实现驾驶场景理解、轨迹规划和未来图像生成,在Bench2Drive基准上显著提升了自动驾驶的轨迹精度和安全性。
Details
Motivation: 现有自动驾驶系统通常将感知、预测和规划分离,导致信息割裂;本文旨在通过统一框架增强三者之间的协同。 Method: 提出UniDrive-WM,利用VLM同时进行场景理解与轨迹规划,并通过轨迹条件生成未来图像,形成闭环反馈以迭代优化性能;比较了离散与连续表示对未来预测的影响。 Result: 在Bench2Drive上,相比先前最佳方法,L2轨迹误差降低5.9%,碰撞率下降9.2%,并生成高保真未来帧。 Conclusion: 紧密集成VLM驱动的推理、规划与生成式世界建模可有效提升自动驾驶系统的性能与安全性。 Abstract: World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .[110] Vision-Language Agents for Interactive Forest Change Analysis
James Brock,Ce Zhang,Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: 提出了一种LLM驱动的代理系统,结合视觉-语言模型用于遥感图像中的森林变化分析,支持自然语言查询,并发布了包含双时相影像和多粒度语义描述的Forest-Change数据集。
Details
Motivation: 现有遥感图像变化解释方法在像素级变化检测与语义化描述方面存在不足,且缺乏与大语言模型结合的交互式分析手段,尤其在复杂森林动态监测中表现有限。 Method: 构建了一个多层级变化解释(MCI)视觉-语言骨干网络,并引入大语言模型(LLM)进行任务编排,实现对遥感图像变化的联合检测与语义描述;同时发布Forest-Change数据集,包含双时相影像、像素级变化掩码及多粒度语义标注。 Result: 在Forest-Change数据集上达到67.10% mIoU和40.17 BLEU-4分数,在LEVIR-MCI-Trees子集上达到88.13% mIoU和34.41 BLEU-4分数,验证了系统在变化检测与语义描述上的有效性。 Conclusion: LLM驱动的交互式遥感变化解释系统能够提升森林变化分析的可访问性、可解释性和效率,为未来智能遥感解译提供了新范式。 Abstract: Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.[111] TokenSeg: Efficient 3D Medical Image Segmentation via Hierarchical Visual Token Compression
Sen Zeng,Hong Zhou,Zheng Zhu,Yang Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为TokenSeg的边界感知稀疏token表示框架,用于高效3D医学图像分割,通过多尺度编码器、边界感知分词器和稀疏到稠密解码器,在保持高精度的同时显著降低计算资源消耗。
Details
Motivation: 3D医学图像分割计算开销大,体素处理呈立方增长,且在均匀区域存在冗余计算,因此需要更高效的分割方法。 Method: 设计多尺度分层编码器提取400个候选token;引入结合VQ-VAE量化与重要性评分的边界感知分词器,选择100个关键token(超60%位于肿瘤边界附近);开发稀疏到稠密解码器,通过token重投影、渐进上采样和跳跃连接恢复完整分割掩码。 Result: 在960例3D乳腺DCE-MRI数据集上达到94.49% Dice和89.61% IoU,GPU内存和推理延迟分别降低64%和68%;在MSD心脏和脑部MRI数据集上也表现出最优性能,验证了泛化能力。 Conclusion: TokenSeg通过解剖结构感知的稀疏表示,在保证3D医学图像分割精度的同时大幅提升效率,具有良好的通用性和应用潜力。 Abstract: Three-dimensional medical image segmentation is a fundamental yet computationally demanding task due to the cubic growth of voxel processing and the redundant computation on homogeneous regions. To address these limitations, we propose \textbf{TokenSeg}, a boundary-aware sparse token representation framework for efficient 3D medical volume segmentation. Specifically, (1) we design a \emph{multi-scale hierarchical encoder} that extracts 400 candidate tokens across four resolution levels to capture both global anatomical context and fine boundary details; (2) we introduce a \emph{boundary-aware tokenizer} that combines VQ-VAE quantization with importance scoring to select 100 salient tokens, over 60\% of which lie near tumor boundaries; and (3) we develop a \emph{sparse-to-dense decoder} that reconstructs full-resolution masks through token reprojection, progressive upsampling, and skip connections. Extensive experiments on a 3D breast DCE-MRI dataset comprising 960 cases demonstrate that TokenSeg achieves state-of-the-art performance with 94.49\% Dice and 89.61\% IoU, while reducing GPU memory and inference latency by 64\% and 68\%, respectively. To verify the generalization capability, our evaluations on MSD cardiac and brain MRI benchmark datasets demonstrate that TokenSeg consistently delivers optimal performance across heterogeneous anatomical structures. These results highlight the effectiveness of anatomically informed sparse representation for accurate and efficient 3D medical image segmentation.[112] FaceRefiner: High-Fidelity Facial Texture Refinement with Differentiable Rendering-based Style Transfer
Chengyang Li,Baoping Cheng,Yao Cheng,Haocheng Zhang,Renshuai Liu,Yinglin Zheng,Jing Liao,Xuan Cheng
Main category: cs.CV
TL;DR: 本文提出了一种基于风格迁移的面部纹理精细化方法FaceRefiner,通过将3D采样纹理作为风格、现有生成结果作为内容,结合可微渲染实现多层次信息迁移,有效提升了野外图像下生成纹理的细节、结构与身份一致性。
Details
Motivation: 现有面部纹理生成方法受限于训练数据构建的空间,难以在真实场景图像中保持细节、结构和身份的一致性,因此需要一种更具泛化能力的精细化方法。 Method: 提出FaceRefiner,采用风格迁移框架,以3D采样纹理为风格源,生成结果为内容源,并引入可微渲染技术,实现从高、中到低层次(像素级)信息的完整迁移,尤其在可见面部区域保留更多原始细节。 Result: 在Multi-PIE、CelebA和FFHQ数据集上的实验表明,该方法相比现有最先进方法能显著提升纹理质量和身份保持能力。 Conclusion: FaceRefiner通过多层级风格迁移与可微渲染相结合,有效解决了现有方法在真实场景下的泛化问题,显著增强了面部纹理生成的保真度与身份一致性。 Abstract: Recent facial texture generation methods prefer to use deep networks to synthesize image content and then fill in the UV map, thus generating a compelling full texture from a single image. Nevertheless, the synthesized texture UV map usually comes from a space constructed by the training data or the 2D face generator, which limits the methods' generalization ability for in-the-wild input images. Consequently, their facial details, structures and identity may not be consistent with the input. In this paper, we address this issue by proposing a style transfer-based facial texture refinement method named FaceRefiner. FaceRefiner treats the 3D sampled texture as style and the output of a texture generation method as content. The photo-realistic style is then expected to be transferred from the style image to the content image. Different from current style transfer methods that only transfer high and middle level information to the result, our style transfer method integrates differentiable rendering to also transfer low level (or pixel level) information in the visible face regions. The main benefit of such multi-level information transfer is that, the details, structures and semantics in the input can thus be well preserved. The extensive experiments on Multi-PIE, CelebA and FFHQ datasets demonstrate that our refinement method can improve the texture quality and the face identity preserving ability, compared with state-of-the-arts.[113] All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction
Ziyou Jiang,Mingyang Li,Junjie Wang,Yuekai Huang,Jie Huang,Zhiyuan Chang,Zhaoyang Li,Qing Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于设计概念复现的动态有害梗检测方法RepMD,通过构建设计概念图(DCG)并结合多模态大语言模型实现对类型和时间演化梗的有效识别。
Details
Motivation: 有害梗在互联网社区中不断演变,类型多变且随时间演化,传统方法难以应对。作者发现不同有害梗背后存在共通的设计原则,希望通过挖掘这些不变的设计概念来提升检测效果。 Method: 引入攻击树定义设计概念图(DCG),从历史梗中通过设计步骤复现和图剪枝推导出DCG,并利用DCG指导多模态大语言模型(MLLM)进行有害梗检测。 Result: RepMD在检测准确率上达到81.1%,在面对类型变化和时间演化的梗时仍保持较小的性能下降;人工评估显示其可将人工发现有害梗的效率提高15~30秒/梗。 Conclusion: 基于设计概念复现的方法能有效捕捉有害梗的内在规律,提升对不断演变梗的检测能力和人类审查效率。 Abstract: Harmful memes are ever-shifting in the Internet communities, which are difficult to analyze due to their type-shifting and temporal-evolving nature. Although these memes are shifting, we find that different memes may share invariant principles, i.e., the underlying design concept of malicious users, which can help us analyze why these memes are harmful. In this paper, we propose RepMD, an ever-shifting harmful meme detection method based on the design concept reproduction. We first refer to the attack tree to define the Design Concept Graph (DCG), which describes steps that people may take to design a harmful meme. Then, we derive the DCG from historical memes with design step reproduction and graph pruning. Finally, we use DCG to guide the Multimodal Large Language Model (MLLM) to detect harmful memes. The evaluation results show that RepMD achieves the highest accuracy with 81.1% and has slight accuracy decreases when generalized to type-shifting and temporal-evolving memes. Human evaluation shows that RepMD can improve the efficiency of human discovery on harmful memes, with 15$\sim$30 seconds per meme.[114] 3D Conditional Image Synthesis of Left Atrial LGE MRI from Composite Semantic Masks
Yusri Al-Sanaani,Rebecca Thornhill,Sreeraman Rajan
Main category: cs.CV
TL;DR: 本研究利用3D条件生成模型(特别是SPADE-LDM)合成高保真LGE MRI图像,以增强稀缺的训练数据,并显著提升左心房分割性能。
Details
Motivation: 由于数据稀缺和解剖结构复杂,基于机器学习的左心房壁和心内膜分割仍具挑战性,亟需有效数据增强方法。 Method: 提出一种合成3D LGE MRI的管道,使用三种3D条件生成器(Pix2Pix GAN、SPADE-GAN、SPADE-LDM),基于结合专家标注与无监督组织聚类的语义标签图生成图像,并评估其对下游分割任务的影响。 Result: SPADE-LDM生成图像最真实(FID=4.063),优于其他GAN模型;加入合成数据后,3D U-Net对左心房腔的Dice分数从0.908提升至0.936(p<0.05)。 Conclusion: 基于标签的3D图像合成可有效增强稀缺医学影像数据,提升心脏结构分割精度,尤其适用于样本不足场景。 Abstract: Segmentation of the left atrial (LA) wall and endocardium from late gadolinium-enhanced (LGE) MRI is essential for quantifying atrial fibrosis in patients with atrial fibrillation. The development of accurate machine learning-based segmentation models remains challenging due to the limited availability of data and the complexity of anatomical structures. In this work, we investigate 3D conditional generative models as potential solution for augmenting scarce LGE training data and improving LA segmentation performance. We develop a pipeline to synthesize high-fidelity 3D LGE MRI volumes from composite semantic label maps combining anatomical expert annotations with unsupervised tissue clusters, using three 3D conditional generators (Pix2Pix GAN, SPADE-GAN, and SPADE-LDM). The synthetic images are evaluated for realism and their impact on downstream LA segmentation. SPADE-LDM generates the most realistic and structurally accurate images, achieving an FID of 4.063 and surpassing GAN models, which have FIDs of 40.821 and 7.652 for Pix2Pix and SPADE-GAN, respectively. When augmented with synthetic LGE images, the Dice score for LA cavity segmentation with a 3D U-Net model improved from 0.908 to 0.936, showing a statistically significant improvement (p < 0.05) over the baseline.These findings demonstrate the potential of label-conditioned 3D synthesis to enhance the segmentation of under-represented cardiac structures.[115] MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing
Zihao Lin,Wanrong Zhu,Jiuxiang Gu,Jihyung Kil,Christopher Tensmeyer,Lin Zhang,Shilong Liu,Ruiyi Zhang,Lifu Huang,Vlad I. Morariu,Tong Sun
Main category: cs.CV
TL;DR: 本文提出了MiLDEAgent,一种基于推理的多层设计文档编辑框架,结合强化学习训练的多模态推理器与图像编辑器,实现对海报等多层文档的精细化、层次化编辑,并构建了包含2万多个样本的基准MiLDEBench和评估协议MiLDEEval,实验表明该方法显著优于开源模型,性能接近闭源模型。
Details
Motivation: 现有工作主要关注单层图像编辑或多层生成,缺乏对多层设计文档(如海报)中各图层进行细粒度、有意识编辑的能力,无法准确判断修改对象及位置,因此需要一种具备层次感知推理能力的编辑方法。 Method: 提出MiLDEAgent框架,包含一个通过强化学习训练的多模态推理器,用于理解并定位需修改的图层,以及一个图像编辑器执行具体修改;同时构建MiLDEBench数据集和MiLDEEval评估协议,在指令遵循、布局一致性、美学和文本渲染四个维度进行评估。 Result: 在14个开源和2个闭源模型上的实验显示,现有方法普遍表现不佳:开源模型常无法完成任务,闭源模型易出现格式错误;而MiLDEAgent展现出强大的层次化推理与精确编辑能力,显著优于所有开源基线,性能与闭源模型相当。 Conclusion: MiLDEAgent首次实现了对多层设计文档的有效编辑,建立了该领域的首个强基线,验证了引入显式层间推理在复杂文档编辑中的关键作用,为未来研究提供了数据与方法基础。 Abstract: Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.[116] Detection of Deployment Operational Deviations for Safety and Security of AI-Enabled Human-Centric Cyber Physical Systems
Bernard Ngabonziza,Ayan Banerjee,Sandeep K. S. Gupta
Main category: cs.CV
TL;DR: 本文讨论了人工智能驱动的人本体物理系统在不确定操作条件下面临的安全与安全挑战,并提出一个评估框架以确保其运行安全,同时以糖尿病患者的血糖闭环控制为例展示了基于图像的个性化检测技术。
Details
Motivation: 由于人机交互带来的不确定性可能导致AI增强的人本体物理系统违反安全和安全要求,因此需要研究如何应对运行中的未知状态。 Method: 提出一个评估不同安全与安全保障策略的框架,并设计一种基于个性化图像的新型技术用于检测1型糖尿病患者未申报的进餐情况。 Result: 开发出一个可用于评估AI增强人本体物理系统在部署期间安全策略的框架,并通过实际案例验证了所提图像技术在检测未申报饮食方面的有效性。 Conclusion: 该框架和个性化图像技术有助于提升AI增强人本体物理系统在不确定环境下的安全性和可靠性,特别是在医疗等关键应用场景中。 Abstract: In recent years, Human-centric cyber-physical systems have increasingly involved artificial intelligence to enable knowledge extraction from sensor-collected data. Examples include medical monitoring and control systems, as well as autonomous cars. Such systems are intended to operate according to the protocols and guidelines for regular system operations. However, in many scenarios, such as closed-loop blood glucose control for Type 1 diabetics, self-driving cars, and monitoring systems for stroke diagnosis. The operations of such AI-enabled human-centric applications can expose them to cases for which their operational mode may be uncertain, for instance, resulting from the interactions with a human with the system. Such cases, in which the system is in uncertain conditions, can violate the system's safety and security requirements. This paper will discuss operational deviations that can lead these systems to operate in unknown conditions. We will then create a framework to evaluate different strategies for ensuring the safety and security of AI-enabled human-centric cyber-physical systems in operation deployment. Then, as an example, we show a personalized image-based novel technique for detecting the non-announcement of meals in closed-loop blood glucose control for Type 1 diabetics.[117] HUR-MACL: High-Uncertainty Region-Guided Multi-Architecture Collaborative Learning for Head and Neck Multi-Organ Segmentation
Xiaoyu Liu,Siwen Wei,Linhao Qu,Mingyuan Pan,Chengsheng Zhang,Yonghong Shi,Zhijian Song
Main category: cs.CV
TL;DR: 提出了一种高不确定性区域引导的多架构协同学习模型(HUR-MACL),用于头颈部多器官分割,结合CNN、Vision Mamba和可变形CNN,并通过异构特征蒸馏损失提升协同学习效果,在多个数据集上达到SOTA性能。
Details
Motivation: 现有深度学习模型在小且形状复杂的头颈部器官分割中表现不佳,混合架构通常仅简单拼接特征,导致功能重叠和精度受限。 Method: 提出HUR-MACL模型:1)使用CNN自适应识别高不确定性区域;2)在这些区域中联合使用Vision Mamba和可变形CNN进行精细化分割;3)引入异构特征蒸馏损失促进两种架构在高不确定性区域的协同学习。 Result: 在两个公开数据集和一个私有数据集上均取得了当前最优的分割性能。 Conclusion: HUR-MACL通过针对高不确定性区域的多架构协作与特征蒸馏,有效提升了头颈部小器官的分割精度,具有较强的临床应用潜力。 Abstract: Accurate segmentation of organs at risk in the head and neck is essential for radiation therapy, yet deep learning models often fail on small, complexly shaped organs. While hybrid architectures that combine different models show promise, they typically just concatenate features without exploiting the unique strengths of each component. This results in functional overlap and limited segmentation accuracy. To address these issues, we propose a high uncertainty region-guided multi-architecture collaborative learning (HUR-MACL) model for multi-organ segmentation in the head and neck. This model adaptively identifies high uncertainty regions using a convolutional neural network, and for these regions, Vision Mamba as well as Deformable CNN are utilized to jointly improve their segmentation accuracy. Additionally, a heterogeneous feature distillation loss was proposed to promote collaborative learning between the two architectures in high uncertainty regions to further enhance performance. Our method achieves SOTA results on two public datasets and one private dataset.[118] HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment
Wenzhi Chen,Bo Hu,Leida Li,Lihuo He,Wen Lu,Xinbo Gao
Main category: cs.CV
TL;DR: 本文提出了一种基于双曲蕴含几何的自适应文本到图像对齐评估框架HyperAlign,通过将CLIP提取的特征映射到双曲空间,并设计动态监督蕴含建模和自适应调制回归器,实现了对图像-文本对齐的更准确评估。
Details
Motivation: 现有方法依赖欧氏空间度量,忽视语义对齐的结构特性,且缺乏对不同样本的自适应能力,难以准确评估生成图像与文本提示之间的对齐关系。 Method: 首先使用CLIP提取欧氏特征并映射到双曲空间;其次设计动态监督的蕴含建模机制,将离散的蕴含逻辑转化为连续的几何结构监督;最后提出一个自适应调制回归器,利用双曲几何特征生成样本级调制参数,动态校准欧氏余弦相似度以预测最终得分。 Result: HyperAlign在单数据库评估和跨数据库泛化任务中均取得了极具竞争力的性能表现。 Conclusion: 实验充分验证了双曲几何建模在图像-文本对齐评估中的有效性,表明其优于传统欧氏空间方法。 Abstract: With the rapid development of text-to-image generation technology, accurately assessing the alignment between generated images and text prompts has become a critical challenge. Existing methods rely on Euclidean space metrics, neglecting the structured nature of semantic alignment, while lacking adaptive capabilities for different samples. To address these limitations, we propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry. First, we extract Euclidean features using CLIP and map them to hyperbolic space. Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision. Finally, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict the final score. HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, fully validating the effectiveness of hyperbolic geometric modeling for image-text alignment assessment.[119] Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning
Wentao Zhang,Lifei Wang,Lina Lu,MingKun Xu,Shangyang Li,Yanchao Yang,Tao Fang
Main category: cs.CV
TL;DR: 提出Agri-R1,一种基于推理增强的农业大模型,通过视觉-语言合成与LLM过滤自动生成高质量推理数据,并采用Group Relative Policy Optimization(GRPO)结合领域特定奖励函数进行训练,在少样本下显著提升病害识别、农业问答及跨域泛化能力。
Details
Motivation: 现有农业疾病诊断模型依赖大量标注数据,缺乏可解释性且泛化能力差;现有推理方法依赖昂贵专家标注,难以应对开放、多样的农业问题。因此需要一种低标注成本、高可解释性、强泛化的推理增强模型。 Method: 提出Agri-R1框架:1)利用视觉-语言合成与大语言模型过滤,仅用19%样本自动构建高质量推理数据;2)采用Group Relative Policy Optimization(GRPO)训练,设计融合领域词典与模糊匹配的奖励函数,评估开放回答的正确性与语言灵活性。 Result: 在CDDMBench上,3B参数的Agri-R1达到与7B-13B模型相当的性能:疾病识别准确率相对提升23.2%,农业知识问答提升33.3%,跨域泛化得分提高26.10分;消融实验表明推理数据与GRPO协同作用是性能提升关键,且复杂问题增益更显著。 Conclusion: Agri-R1通过自动化推理数据生成与领域适配的强化学习训练,实现了小模型在农业开放问答中的高效、可解释与强泛化表现,为农业AI提供了低成本、高性能的新范式。 Abstract: Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbf{Agri-R1}, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19\% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2\% relative gain in disease recognition accuracy, +33.3\% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.[120] DB-MSMUNet:Dual Branch Multi-scale Mamba UNet for Pancreatic CT Scans Segmentation
Qiu Guan,Zhiqiang Yang,Dezhang Ye,Yang Chen,Xinli Xu,Ying Tang
Main category: cs.CV
TL;DR: 本文提出了一种用于胰腺及病灶CT图像分割的新型双分支多尺度Mamba UNet(DB-MSMUNet),通过结合可变形卷积与多尺度状态空间建模,增强全局上下文和局部形变建模能力,并采用双解码器结构提升边界清晰度和小病灶重建效果,在多个公开和临床数据集上取得了优于现有方法的分割性能。
Details
Motivation: 胰腺及其病变在CT图像中的准确分割对胰腺癌的诊断和治疗至关重要,但由于组织对比度低、边界模糊、形态不规则及病灶微小等因素,分割仍具挑战性。 Method: 提出DB-MSMUNet,编码器采用多尺度Mamba模块(MSMM)融合可变形卷积与多尺度状态空间建模;双解码器结构中,边缘解码器引入边缘增强路径(EEP)优化边界,区域解码器采用多层解码器(MLD)恢复细节;并引入多层次辅助监督(ADS)提升多尺度特征学习。 Result: 在NIH Pancreas、MSD和临床胰腺肿瘤三个数据集上,DB-MSMUNet分别取得了89.47%、87.59%和89.02%的Dice系数,显著优于多数现有方法,表现出更优的分割精度、边缘保持能力和跨数据集鲁棒性。 Conclusion: DB-MSMUNet在胰腺及病灶分割任务中表现出色,具备良好的通用性和临床应用潜力,有效应对了复杂解剖结构下的医学图像分割挑战。 Abstract: Accurate segmentation of the pancreas and its lesions in CT scans is crucial for the precise diagnosis and treatment of pancreatic cancer. However, it remains a highly challenging task due to several factors such as low tissue contrast with surrounding organs, blurry anatomical boundaries, irregular organ shapes, and the small size of lesions. To tackle these issues, we propose DB-MSMUNet (Dual-Branch Multi-scale Mamba UNet), a novel encoder-decoder architecture designed specifically for robust pancreatic segmentation. The encoder is constructed using a Multi-scale Mamba Module (MSMM), which combines deformable convolutions and multi-scale state space modeling to enhance both global context modeling and local deformation adaptation. The network employs a dual-decoder design: the edge decoder introduces an Edge Enhancement Path (EEP) to explicitly capture boundary cues and refine fuzzy contours, while the area decoder incorporates a Multi-layer Decoder (MLD) to preserve fine-grained details and accurately reconstruct small lesions by leveraging multi-scale deep semantic features. Furthermore, Auxiliary Deep Supervision (ADS) heads are added at multiple scales to both decoders, providing more accurate gradient feedback and further enhancing the discriminative capability of multi-scale features. We conduct extensive experiments on three datasets: the NIH Pancreas dataset, the MSD dataset, and a clinical pancreatic tumor dataset provided by collaborating hospitals. DB-MSMUNet achieves Dice Similarity Coefficients of 89.47%, 87.59%, and 89.02%, respectively, outperforming most existing state-of-the-art methods in terms of segmentation accuracy, edge preservation, and robustness across different datasets. These results demonstrate the effectiveness and generalizability of the proposed method for real-world pancreatic CT segmentation tasks.[121] HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution
Yang Zou,Xingyue Zhu,Kaiqi Han,Jun Ma,Xingyuan Li,Zhiying Jiang,Jinyuan Liu
Main category: cs.CV
TL;DR: 本文提出HATIR,一种基于热感知扩散模型的红外视频超分辨率方法,用于联合建模大气湍流退化与细节丢失问题,并构建了首个湍流红外VSR数据集FLIR-IVSR。
Details
Motivation: 现有红外视频超分辨率方法忽视红外与可见光模态差异,且难以恢复湍流引起的畸变;级联湍流抑制与超分方法导致误差传播。 Method: 提出HATIR,将热感知形变先验引入扩散采样路径;设计Phasor引导的光流估计器利用热区域相位一致性提供湍流感知光流,并采用湍流感知解码器通过门控机制抑制不稳定时序信息,增强边缘特征聚合。 Result: 在自建FLIR-IVSR数据集(640个场景)上验证了方法有效性,HATIR在结构恢复和细节重建方面优于现有方法,尤其在非均匀畸变下表现突出。 Conclusion: HATIR首次实现对湍流退化与分辨率损失的联合逆向建模,推动了复杂环境下红外视频复原的研究进展。 Abstract: Infrared video has been of great interest in visual tasks under challenging environments, but often suffers from severe atmospheric turbulence and compression degradation. Existing video super-resolution (VSR) methods either neglect the inherent modality gap between infrared and visible images or fail to restore turbulence-induced distortions. Directly cascading turbulence mitigation (TM) algorithms with VSR methods leads to error propagation and accumulation due to the decoupled modeling of degradation between turbulence and resolution. We introduce HATIR, a Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution, which injects heat-aware deformation priors into the diffusion sampling path to jointly model the inverse process of turbulent degradation and structural detail loss. Specifically, HATIR constructs a Phasor-Guided Flow Estimator, rooted in the physical principle that thermally active regions exhibit consistent phasor responses over time, enabling reliable turbulence-aware flow to guide the reverse diffusion process. To ensure the fidelity of structural recovery under nonuniform distortions, a Turbulence-Aware Decoder is proposed to selectively suppress unstable temporal cues and enhance edge-aware feature aggregation via turbulence gating and structure-aware attention. We built FLIR-IVSR, the first dataset for turbulent infrared VSR, comprising paired LR-HR sequences from a FLIR T1050sc camera (1024 X 768) spanning 640 diverse scenes with varying camera and object motion conditions. This encourages future research in infrared VSR. Project page: https://github.com/JZ0606/HATIR[122] WebCryptoAgent: Agentic Crypto Trading with Web Informatics
Ali Kurban,Wei Luo,Liangyu Zuo,Zeyu Zhang,Renda Han,Zhaolu Kang,Hao Tang
Main category: cs.CV
TL;DR: 本文提出了WebCryptoAgent,一种用于加密货币交易的智能代理框架,能够整合网络信息与市场数据,在高频波动中实现稳健决策。
Details
Motivation: 现有交易系统难以同时处理多源异构网络信息并应对亚秒级价格冲击,缺乏对噪声信息的有效过滤和实时风险响应机制。 Method: 提出WebCryptoAgent框架,将不同模态的信息交由专用代理处理,并生成统一证据文档以支持置信度校准的推理;采用解耦控制架构,分离小时级策略推理与秒级实时风险模型,实现快速风险检测与干预。 Result: 在真实加密货币市场上的实验表明,该方法相比基线模型提升了交易稳定性,减少了误操作行为,并显著改善了尾部风险应对能力。 Conclusion: WebCryptoAgent通过模块化代理与解耦控制架构,有效实现了多源信息融合与实时风险管控,为高波动环境下的自动化交易提供了可行解决方案。 Abstract: Cryptocurrency trading increasingly depends on timely integration of heterogeneous web information and market microstructure signals to support short-horizon decision making under extreme volatility. However, existing trading systems struggle to jointly reason over noisy multi-source web evidence while maintaining robustness to rapid price shocks at sub-second timescales. The first challenge lies in synthesizing unstructured web content, social sentiment, and structured OHLCV signals into coherent and interpretable trading decisions without amplifying spurious correlations, while the second challenge concerns risk control, as slow deliberative reasoning pipelines are ill-suited for handling abrupt market shocks that require immediate defensive responses. To address these challenges, we propose WebCryptoAgent, an agentic trading framework that decomposes web-informed decision making into modality-specific agents and consolidates their outputs into a unified evidence document for confidence-calibrated reasoning. We further introduce a decoupled control architecture that separates strategic hourly reasoning from a real-time second-level risk model, enabling fast shock detection and protective intervention independent of the trading loop. Extensive experiments on real-world cryptocurrency markets demonstrate that WebCryptoAgent improves trading stability, reduces spurious activity, and enhances tail-risk handling compared to existing baselines. Code will be available at https://github.com/AIGeeksGroup/WebCryptoAgent.[123] Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models
Yanbing Zeng,Jia Wang,Hanghang Ma,Junqiang Wu,Jie Zhu,Xiaoming Wei,Jie Hu
Main category: cs.CV
TL;DR: 本文提出了一种名为Forge-and-Quench的新框架,通过将多模态大语言模型(MLLM)的理解能力转化为图像生成的视觉引导信号,提升文本到图像生成的保真度和细节丰富性。
Details
Motivation: 现有工作主要利用理解模型的推理能力和世界知识来改进生成,但如何有效利用理解能力提升生成图像的质量尚未充分探索。本文旨在通过理解模型增强生成图像的保真度和细节。 Method: 提出Forge-and-Quench框架:首先由MLLM基于对话上下文生成增强文本指令,再通过新设计的Bridge Adapter将其映射为“桥接特征”(Bridge Feature),作为视觉表示注入T2I模型中,与增强文本共同指导图像生成。 Result: 该方法在多个MLLM和T2I模型上验证了良好的扩展性和灵活性,显著降低训练开销,同时提升了图像细节和保真度,保持指令遵循能力,并增强了世界知识的应用。 Conclusion: Forge-and-Quench成功实现了理解对生成的有效助力,通过桥接特征将理解模型的洞察转化为生成过程中的视觉引导,为统一多模态生成与理解提供了新范式。 Abstract: Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM's inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at https://github.com/YanbingZeng/Forge-and-Quench.[124] On the Holistic Approach for Detecting Human Image Forgery
Xiao Guo,Jie Zhu,Anil Jain,Xiaoming Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为HuForDet的全人图像伪造检测框架,结合面部和全身伪造检测双分支结构,并构建统一数据集HuFor,实现了在多种伪造场景下的先进检测性能。
Details
Motivation: 现有的深度伪造检测方法多集中于面部或全身单一区域,缺乏对整体人类图像伪造的统一检测能力,难以应对日益复杂的AI生成内容威胁。 Method: 提出HuForDet框架:1)面部伪造检测分支采用RGB和频域异构专家,结合自适应LoG模块捕捉多尺度伪造痕迹;2)上下文伪造检测分支利用多模态大语言模型分析全身语义一致性,并通过置信度估计机制动态融合双分支特征。同时构建包含面部与全身伪造的HuFor数据集用于训练与评估。 Result: 实验表明,HuForDet在多个基准上达到最先进的检测性能,具备更强的泛化能力和鲁棒性,尤其在跨区域、跨类型伪造图像上表现优异。 Conclusion: HuForDet通过融合局部细节分析与全局语义理解,实现了对全人图像伪造的有效检测,为应对复杂AIGC威胁提供了可扩展的统一解决方案。 Abstract: The rapid advancement of AI-generated content (AIGC) has escalated the threat of deepfakes, from facial manipulations to the synthesis of entire photorealistic human bodies. However, existing detection methods remain fragmented, specializing either in facial-region forgeries or full-body synthetic images, and consequently fail to generalize across the full spectrum of human image manipulations. We introduce HuForDet, a holistic framework for human image forgery detection, which features a dual-branch architecture comprising: (1) a face forgery detection branch that employs heterogeneous experts operating in both RGB and frequency domains, including an adaptive Laplacian-of-Gaussian (LoG) module designed to capture artifacts ranging from fine-grained blending boundaries to coarse-scale texture irregularities; and (2) a contextualized forgery detection branch that leverages a Multi-Modal Large Language Model (MLLM) to analyze full-body semantic consistency, enhanced with a confidence estimation mechanism that dynamically weights its contribution during feature fusion. We curate a human image forgery (HuFor) dataset that unifies existing face forgery data with a new corpus of full-body synthetic humans. Extensive experiments show that our HuForDet achieves state-of-the-art forgery detection performance and superior robustness across diverse human image forgeries.[125] Training a Custom CNN on Five Heterogeneous Image Datasets
Anika Tabassum,Tasnuva Mahazabin Tuba,Nafisa Naznin
Main category: cs.CV
TL;DR: 本研究评估了自定义轻量级CNN与ResNet-18、VGG-16等主流深度网络在五个农业与城市领域异构数据集上的表现,探讨了模型复杂度、深度及预训练对性能的影响。
Details
Motivation: 旨在比较不同CNN架构在多样化真实场景中的有效性,特别是在资源受限环境下如何实现高效视觉分类。 Method: 采用从头训练和迁移学习两种方式训练自定义轻量CNN、ResNet-18和VGG-16,结合系统性预处理、数据增强和控制实验进行评估。 Result: 自定义轻量CNN在多个任务中表现出竞争力;迁移学习和深层架构在数据稀缺场景下显著提升收敛速度与泛化能力。 Conclusion: 针对特定任务设计的轻量模型可有效应对多领域分类问题,迁移学习在小样本条件下具有明显优势,为资源受限的实际应用提供了可行方案。 Abstract: Deep learning has transformed visual data analysis, with Convolutional Neural Networks (CNNs) becoming highly effective in learning meaningful feature representations directly from images. Unlike traditional manual feature engineering methods, CNNs automatically extract hierarchical visual patterns, enabling strong performance across diverse real-world contexts. This study investigates the effectiveness of CNN-based architectures across five heterogeneous datasets spanning agricultural and urban domains: mango variety classification, paddy variety identification, road surface condition assessment, auto-rickshaw detection, and footpath encroachment monitoring. These datasets introduce varying challenges, including differences in illumination, resolution, environmental complexity, and class imbalance, necessitating adaptable and robust learning models. We evaluate a lightweight, task-specific custom CNN alongside established deep architectures, including ResNet-18 and VGG-16, trained both from scratch and using transfer learning. Through systematic preprocessing, augmentation, and controlled experimentation, we analyze how architectural complexity, model depth, and pre-training influence convergence, generalization, and performance across datasets of differing scale and difficulty. The key contributions of this work are: (1) the development of an efficient custom CNN that achieves competitive performance across multiple application domains, and (2) a comprehensive comparative analysis highlighting when transfer learning and deep architectures provide substantial advantages, particularly in data-constrained environments. These findings offer practical insights for deploying deep learning models in resource-limited yet high-impact real-world visual classification tasks.[126] AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection
Yunqing Hu,Zheming Yang,Chang Zhao,Qi Guo,Meng Gao,Pengcheng Li,Wen Ji
Main category: cs.CV
TL;DR: 本文提出AIVD框架,通过轻量级边缘检测器与云端多模态大模型协作,实现精确目标定位与高质量语义生成,并设计了视觉-语义协同增强的微调策略及资源感知的动态调度算法,显著降低资源消耗的同时提升性能。
Details
Motivation: 多模态大语言模型在语义理解方面表现优异,但在精确目标定位和资源受限的边云部署方面仍存在挑战。 Method: 提出AIVD框架,结合边缘轻量检测器与云端MLLM;采用视觉-语义协同增强的微调策略提升鲁棒性;设计异构资源感知的动态调度算法以优化效率。 Result: 实验表明,AIVD显著降低了资源消耗,提升了分类准确率和语义生成质量,同时调度策略实现了更高吞吐量和更低延迟。 Conclusion: AIVD框架有效实现了精准定位与高质量语义生成的统一,在多种场景下具备高效、鲁棒的边云协同潜力。 Abstract: Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM's robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.[127] Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition
Masatomo Yoshida,Haruto Namura,Nicola Adami,Masahiro Okuda
Main category: cs.CV
TL;DR: 提出一种基于骨架化的对抗攻击方法,有效缩小搜索空间,针对含文本(尤其是数学公式)的图像,评估了基础模型在字符和语义层面的变化,揭示其视觉理解和推理能力,并在ChatGPT上验证了实用性。
Details
Motivation: 探索基础模型在处理复杂结构图像(如数学公式)时的视觉能力与局限性,特别是在对抗攻击下的鲁棒性。 Method: 引入基于骨架化的对抗攻击方法,通过减少搜索空间来生成对抗样本,并分析原始输出与对抗输出之间的字符级和语义级变化。 Result: 该方法能有效攻击包含数学公式的图像,在ChatGPT上成功应用,显示出对模型输出的显著影响,并揭示了模型在视觉解释和推理上的缺陷。 Conclusion: 骨架化对抗攻击是一种有效的手段,可用于探测基础模型在复杂图文输入下的脆弱性,为提升其鲁棒性提供了方向。 Abstract: This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method utilizing skeletonization to reduce the search space effectively. Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their LaTeX conversion and intricate structure. We conduct a detailed evaluation of both character and semantic changes between original and adversarially perturbed outputs to provide insights into the models' visual interpretation and reasoning abilities. The effectiveness of our method is further demonstrated through its application to ChatGPT, which shows its practical implications in real-world scenarios.[128] ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting
Yen-Jen Chiou,Wei-Tse Cheng,Yuan-Fu Yang
Main category: cs.CV
TL;DR: ProFuse是一个高效的上下文感知框架,用于基于3D高斯点阵的开放词汇3D场景理解,通过密集对应引导的预注册和跨视图聚类实现快速、一致的语义分割。
Details
Motivation: 提升开放词汇3D场景理解中的跨视图一致性和掩码内聚性,避免依赖渲染监督微调和复杂的优化过程。 Method: 提出密集对应引导的预注册阶段,初始化具有精确几何的高斯分布,并通过跨视图聚类构建3D上下文提议;利用全局特征加权聚合并融合到高斯分布中,实现直接注册过程中的语义一致性。 Result: 在无需额外优化的情况下实现了强健的开放词汇3D理解,每场景语义附着仅需约五分钟,速度是当前最先进方法的两倍。 Conclusion: ProFuse在保持几何精度的同时显著提升了语义一致性和效率,为实时开放词汇3D场景理解提供了高效解决方案。 Abstract: We present ProFuse, an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup, adding minimal overhead and requiring no render-supervised fine-tuning. Instead of relying on a pretrained 3DGS scene, we introduce a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, and this feature is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. With associations established in advance, semantic fusion requires no additional optimization beyond standard reconstruction, and the model retains geometric refinement without densification. ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is two times faster than SOTA.[129] Segmentation-Driven Monocular Shape from Polarization based on Physical Model
Jinyu Zhang,Xu Ma,Weili Chen,Gonzalo R. Arce
Main category: cs.CV
TL;DR: 提出一种分割驱动的单目形状-从偏振(SMSfP)框架,通过自适应分割和多尺度凸性先验显著提升表面法线恢复的精度与稳定性。
Details
Motivation: 现有单目形状-从偏振方法存在方位角模糊性问题,影响三维重建的准确性和稳定性。 Method: 提出偏振辅助的自适应区域增长(PARG)分割策略,将全局凸性假设分解为局部凸区域,并引入多尺度融合凸性先验(MFCP)约束以增强局部一致性和细节恢复。 Result: 在合成和真实数据集上实验表明,该方法在消解方位模糊和几何保真度方面优于现有物理驱动方法。 Conclusion: SMSfP框架有效解决了方位角模糊问题,提升了单目SfP的重建精度和鲁棒性。 Abstract: Monocular shape-from-polarization (SfP) leverages the intrinsic relationship between light polarization properties and surface geometry to recover surface normals from single-view polarized images, providing a compact and robust approach for three-dimensional (3D) reconstruction. Despite its potential, existing monocular SfP methods suffer from azimuth angle ambiguity, an inherent limitation of polarization analysis, that severely compromises reconstruction accuracy and stability. This paper introduces a novel segmentation-driven monocular SfP (SMSfP) framework that reformulates global shape recovery into a set of local reconstructions over adaptively segmented convex sub-regions. Specifically, a polarization-aided adaptive region growing (PARG) segmentation strategy is proposed to decompose the global convexity assumption into locally convex regions, effectively suppressing azimuth ambiguities and preserving surface continuity. Furthermore, a multi-scale fusion convexity prior (MFCP) constraint is developed to ensure local surface consistency and enhance the recovery of fine textural and structural details. Extensive experiments on both synthetic and real-world datasets validate the proposed approach, showing significant improvements in disambiguation accuracy and geometric fidelity compared with existing physics-based monocular SfP techniques.[130] GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
Shurong Zheng,Yousong Zhu,Hongyin Zhao,Fan Yang,Yufei Zhan,Ming Tang,Jinqiao Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的多图像视觉定位模型GeM-VG,通过构建大规模数据集MG-Data-240K和混合强化微调策略,实现了在多图像和单图像定位任务上的显著性能提升。
Details
Motivation: 现有的多图像定位方法受限于单一目标定位和任务类型有限,缺乏对广义定位任务的统一建模。 Method: 提出GeM-VG模型,系统分类多图像定位任务,构建MG-Data-240K数据集,并采用结合思维链(CoT)与直接回答的混合强化微调策略。 Result: 在MIG-Bench和MC-Bench上分别超越先前最优MLLM 2.0%和9.7%,在ODINW上比基础模型提升9.1%,且保持良好的多图像理解能力。 Conclusion: GeM-VG通过统一建模和新型训练策略,显著提升了多图像和单图像视觉定位的泛化能力。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model's overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.[131] CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models
Tobia Poppi,Burak Uzkent,Amanmeet Garg,Lucas Porto,Garin Kessler,Yezhou Yang,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara,Florian Schiffers
Main category: cs.CV
TL;DR: 提出了一种可扩展的反事实视频生成框架,用于缓解视频语言模型在动作和时序推理中的幻觉问题,并构建了包含约2.6万对偏好的合成数据集CounterVid,结合MixDPO方法有效提升了模型性能。
Details
Motivation: 现有视频语言模型容易产生幻觉,尤其在动作和时间顺序推理上,主要原因是过度依赖语言先验而非细粒度视觉动态,需从根本上解决该问题。 Method: 结合多模态大语言模型与扩散模型,生成仅在动作或时序结构上不同的反事实视频,构建语义难例;提出MixDPO方法,统一利用文本和视觉偏好进行优化。 Result: 构建了包含约26,000个偏好对的CounterVid数据集;使用MixDPO微调Qwen2.5-VL在时序推理任务上表现显著提升,并在标准幻觉基准上具有良好迁移效果。 Conclusion: 通过反事实视频生成和联合偏好优化,能有效减少视频语言模型的幻觉,提升其对动作和时间结构的理解能力。 Abstract: Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.[132] Defocus Aberration Theory Confirms Gaussian Model in Most Imaging Devices
Akbar Saadat
Main category: cs.CV
TL;DR: 本文研究了在三维恢复中从二维图像估计深度的挑战,提出高斯模型是实时应用中的最优选择,并验证了其在大多数成像设备中对散焦算子的适用性。
Details
Motivation: 准确从2D图像估计深度一直是3D恢复中的基本难题,尤其是在区分由深度引起的模糊与固有模糊方面存在挑战。 Method: 基于几何光学框架进行散焦分析,并利用衍射受限光学中的散焦像差理论,评估实际散焦模型与高斯近似的拟合精度。同时提出了使常规成像设备符合高斯模型的设置。 Result: 在典型聚焦深度(1到100米)和最大10%深度变化条件下,实验结果显示高斯模型的最大平均绝对误差(MAE)小于1%,证明其高度准确和可靠。 Conclusion: 高斯模型能有效应用于单幅图像的绝对模糊和双图像相对模糊的建模,在大多数成像设备中具有良好的适用性和精确性,适合实时应用。 Abstract: Over the past three decades, defocus has consistently provided groundbreaking depth information in scene images. However, accurately estimating depth from 2D images continues to be a persistent and fundamental challenge in the field of 3D recovery. Heuristic approaches involve with the ill-posed problem for inferring the spatial variant defocusing blur, as the desired blur cannot be distinguished from the inherent blur. Given a prior knowledge of the defocus model, the problem become well-posed with an analytic solution for the relative blur between two images, taken at the same viewpoint with different camera settings for the focus. The Gaussian model stands out as an optimal choice for real-time applications, due to its mathematical simplicity and computational efficiency. And theoretically, it is the only model can be applied at the same time to both the absolute blur caused by depth in a single image and the relative blur resulting from depth differences between two images. This paper introduces the settings, for conventional imaging devices, to ensure that the defocusing operator adheres to the Gaussian model. Defocus analysis begins within the framework of geometric optics and is conducted by defocus aberration theory in diffraction-limited optics to obtain the accuracy of fitting the actual model to its Gaussian approximation. The results for a typical set of focused depths between $1$ and $100$ meters, with a maximum depth variation of $10\%$ at the focused depth, confirm the Gaussian model's applicability for defocus operators in most imaging devices. The findings demonstrate a maximum Mean Absolute Error $(\!M\!A\!E)$ of less than $1\%$, underscoring the model's accuracy and reliability.[133] SRU-Pix2Pix: A Fusion-Driven Generator Network for Medical Image Translation with Few-Shot Learning
Xihe Qiu,Yang Dai,Xiaoyu Tan,Sijia Li,Fenghao Sun,Lu Gan,Liang Liu
Main category: cs.CV
TL;DR: 提出了一种增强的Pix2Pix框架,结合SEResNet和U-Net++,用于提升医学图像翻译中的生成质量和结构保真度,在少样本条件下表现出优异性能。
Details
Motivation: MRI临床应用受限于采集时间长、成本高和分辨率低,图像翻译有望缓解这些问题,但现有方法如Pix2Pix的潜力尚未充分挖掘。 Method: 提出一种改进的Pix2Pix框架,引入Squeeze-and-Excitation Residual Networks(SEResNet)以增强通道注意力下的关键特征表示,并采用U-Net++提升多尺度特征融合能力;使用简化的PatchGAN判别器稳定训练并提高局部解剖真实性。 Result: 在少于500张图像的少样本条件下,该方法在多种MRI模态内图像翻译任务中均实现了更高的图像质量和一致的结构保真度,并展现出强泛化能力。 Conclusion: 所提出的增强Pix2Pix框架有效提升了医学图像翻译的性能,为低数据场景下的MRI图像生成提供了可行解决方案。 Abstract: Magnetic Resonance Imaging (MRI) provides detailed tissue information, but its clinical application is limited by long acquisition time, high cost, and restricted resolution. Image translation has recently gained attention as a strategy to address these limitations. Although Pix2Pix has been widely applied in medical image translation, its potential has not been fully explored. In this study, we propose an enhanced Pix2Pix framework that integrates Squeeze-and-Excitation Residual Networks (SEResNet) and U-Net++ to improve image generation quality and structural fidelity. SEResNet strengthens critical feature representation through channel attention, while U-Net++ enhances multi-scale feature fusion. A simplified PatchGAN discriminator further stabilizes training and refines local anatomical realism. Experimental results demonstrate that under few-shot conditions with fewer than 500 images, the proposed method achieves consistent structural fidelity and superior image quality across multiple intra-modality MRI translation tasks, showing strong generalization ability. These results suggest an effective extension of Pix2Pix for medical image translation.[134] Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers
Lee Hyoseok,Sohwi Lim,Eunju Cha,Tae-Hyun Oh
Main category: cs.CV
TL;DR: 本文提出了一种名为MCLC的即插即用校正模块,通过测量一致的朗之万更新来稳定基于潜在扩散模型的逆问题求解器,并在多种图像恢复任务中验证了其有效性。
Details
Motivation: 现有的基于潜在扩散模型(LDM)的逆问题求解器存在不稳定性,导致伪影和质量下降,本文旨在解决这一问题。 Method: 作者识别出求解器与真实反向扩散动力学之间的差异是不稳定性的根源,并提出了Measurement-Consistent Langevin Corrector (MCLC),该方法无需依赖潜在空间中的线性流形假设,通过测量一致的朗之万更新来修正LDM求解器。 Result: 实验表明MCLC能够有效提升现有求解器的稳定性与性能,在多个图像恢复任务中表现出色,并揭示了斑点伪影的成因。 Conclusion: MCLC是一种理论严谨、即插即用的改进方案,推动了更鲁棒的零样本逆问题求解器的发展。 Abstract: With recent advances in generative models, diffusion models have emerged as powerful priors for solving inverse problems in each domain. Since Latent Diffusion Models (LDMs) provide generic priors, several studies have explored their potential as domain-agnostic zero-shot inverse solvers. Despite these efforts, existing latent diffusion inverse solvers suffer from their instability, exhibiting undesirable artifacts and degraded quality. In this work, we first identify the instability as a discrepancy between the solver's and true reverse diffusion dynamics, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies the LDM-based inverse solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often do not hold in latent space, MCLC operates without this assumption, leading to more stable and reliable behavior. We experimentally demonstrate the effectiveness of MCLC and its compatibility with existing solvers across diverse image restoration tasks. Additionally, we analyze blob artifacts and offer insights into their underlying causes. We highlight that MCLC is a key step toward more robust zero-shot inverse problem solvers.[135] PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
Denis Korzhenkov,Adil Karjauv,Animesh Karnewar,Mohsen Ghafoorian,Amirhossein Habibian
Main category: cs.CV
TL;DR: 提出了一种将预训练扩散模型通过低成本微调转换为金字塔模型的管道,在不降低视频质量的前提下提升推理效率,并比较了多种步骤蒸馏策略。
Details
Motivation: 现有的开源金字塔视频模型从零开始训练,视觉逼真度不如最先进的系统,且推理成本较高。 Method: 通过低代价微调将预训练扩散模型转化为金字塔模型,并研究和比较了多种金字塔模型内的步数蒸馏策略。 Result: 成功实现了预训练模型到金字塔结构的转换,保持了输出视频的质量,同时显著提高了推理效率。 Conclusion: 该方法为高效视频生成提供了一种实用方案,能够在不牺牲质量的情况下利用金字塔结构降低计算成本。 Abstract: Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at https://qualcomm-ai-research.github.io/PyramidalWan.[136] Detector-Augmented SAMURAI for Long-Duration Drone Tracking
Tamara R. Lenhard,Andreas Weinmann,Hichem Snoussi,Tobias Koch
Main category: cs.CV
TL;DR: 本文首次系统评估了基础模型SAMURAI在城市监控场景下无人机跟踪中的潜力,并提出一种检测器增强的扩展方法,显著提升了复杂环境下的跟踪鲁棒性,尤其在长时间序列和无人机进出场景中表现突出。
Details
Motivation: 现有基于检测器的无人机跟踪方法存在时间不一致性,且对传统运动模型依赖较强;而虽有基础模型(如SAMURAI)在其他领域表现优异,但其在无人机特定场景中的应用尚未被探索,存在研究空白。 Method: 提出一种检测器增强的SAMURAI扩展方法,通过融合检测器线索来缓解SAMURAI对边界框初始化和序列长度的敏感性,提升跟踪稳定性。 Result: 所提方法在多个数据集和指标上均优于SAMURAI的零样本性能,显著提高了成功率(+0.393)并降低了误检率(-0.475),尤其在长时序和无人机反复进出场景中表现出更强的鲁棒性。 Conclusion: 该研究表明,结合检测器信息的基础模型可有效提升无人机跟踪的长期鲁棒性,为城市监控中的实际应用提供了新方向。 Abstract: Robust long-term tracking of drone is a critical requirement for modern surveillance systems, given their increasing threat potential. While detector-based approaches typically achieve strong frame-level accuracy, they often suffer from temporal inconsistencies caused by frequent detection dropouts. Despite its practical relevance, research on RGB-based drone tracking is still limited and largely reliant on conventional motion models. Meanwhile, foundation models like SAMURAI have established their effectiveness across other domains, exhibiting strong category-agnostic tracking performance. However, their applicability in drone-specific scenarios has not been investigated yet. Motivated by this gap, we present the first systematic evaluation of SAMURAI's potential for robust drone tracking in urban surveillance settings. Furthermore, we introduce a detector-augmented extension of SAMURAI to mitigate sensitivity to bounding-box initialization and sequence length. Our findings demonstrate that the proposed extension significantly improves robustness in complex urban environments, with pronounced benefits in long-duration sequences - especially under drone exit-re-entry events. The incorporation of detector cues yields consistent gains over SAMURAI's zero-shot performance across datasets and metrics, with success rate improvements of up to +0.393 and FNR reductions of up to -0.475.[137] Integrated Framework for Selecting and Enhancing Ancient Marathi Inscription Images from Stone, Metal Plate, and Paper Documents
Bapu D. Chendage,Rajivkumar S. Mente
Main category: cs.CV
TL;DR: 本文提出了一种基于二值化和互补预处理技术的古代文字图像增强方法,有效去除了污渍并增强了模糊文字,提升了石刻、金属板和历史文献上古代铭文的可读性。
Details
Motivation: 古代文字图像常因老化和环境影响而出现严重背景噪声、低对比度和退化问题,文字与背景视觉特征相似,难以辨认,亟需有效的图像增强方法提升其可读性。 Method: 采用基于二值化的图像增强方法,并结合互补预处理技术去除污渍、增强模糊文字,应用于不同类型古代文字图像,并使用K-NN和SVM分类器评估效果。 Result: 在K-NN分类器下,对石刻、金属板和文档文字的分类准确率分别为55.7%、62%和65.6%;在SVM分类器下分别为53.2%、59.5%和67.8%。 Conclusion: 所提增强方法能有效改善古代马拉地铭文图像的可读性,尤其在文档类文本上表现更优,验证了方法的有效性和适用性。 Abstract: Ancient script images often suffer from severe background noise, low contrast, and degradation caused by aging and environmental effects. In many cases, the foreground text and background exhibit similar visual characteristics, making the inscriptions difficult to read. The primary objective of image enhancement is to improve the readability of such degraded ancient images. This paper presents an image enhancement approach based on binarization and complementary preprocessing techniques for removing stains and enhancing unclear ancient text. The proposed methods are evaluated on different types of ancient scripts, including inscriptions on stone, metal plates, and historical documents. Experimental results show that the proposed approach achieves classification accuracies of 55.7%, 62%, and 65.6% for stone, metal plate, and document scripts, respectively, using the K-Nearest Neighbor (K-NN) classifier. Using the Support Vector Machine (SVM) classifier, accuracies of 53.2%, 59.5%, and 67.8% are obtained. The results demonstrate the effectiveness of the proposed enhancement method in improving the readability of ancient Marathi inscription images.[138] SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models
Oriol Rabasseda,Zenjie Li,Kamal Nasrollahi,Sergio Escalera
Main category: cs.CV
TL;DR: 本文提出了一个面向监控视频中车辆行为识别的基准SOVABench,并设计了一种无需训练的多模态大语言模型框架,通过生成可解释的描述来提升动作区分和时空推理能力。
Details
Motivation: 现有基于内容的视频检索基准多关注场景级相似性,缺乏对监控场景中所需的动作判别能力的评估,因此需要构建专门针对车辆行为的现实世界基准。 Method: 提出SOVABench,包含两种评估协议(inter-pair和intra-pair),用于评估跨动作区分和时间方向理解能力;同时利用多模态大语言模型(MLLM)的视觉推理与指令跟随能力,构建无需训练的框架,从MLLM生成的描述中提取可解释嵌入用于图像和视频。 Result: 实验表明,尽管人类能直观区分这些动作,但当前最先进的视觉与多模态模型仍面临挑战;所提框架在SOVABench及多个空间与计数基准上表现优异,尤其在对比式视觉语言模型表现不佳的任务上。 Conclusion: SOVABench填补了监控视频动作检索基准的空白,所提出的无需训练的MLLM框架在动作识别与复杂推理任务中展现出更强的性能与可解释性,展示了MLLM在视频理解中的潜力。 Abstract: Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.[139] Character Detection using YOLO for Writer Identification in multiple Medieval books
Alessandra Scotto di Freca,Tiziana D Alessandro,Francesco Fontanella,Filippo Sarria,Claudio De Stefano
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv5的古文字书写者识别方法,通过目标检测提取字母“a”并进行书写风格分类,相较于传统模板匹配与CNN方法,实现了更高精度的识别效果,并引入置信度拒绝机制以提升在未见手稿上的泛化能力。
Details
Motivation: 旨在解决中世纪手稿中书写者识别的难题,克服传统模板匹配方法对阈值敏感、泛化能力差的问题,推动数字古文字学的发展。 Method: 采用YOLOv5模型进行字母“a”的目标检测,取代原有的模板匹配与CNN流程;利用检测结果训练分类器实现书写者归属判断,并结合YOLO的置信度分数设定拒绝阈值以提高系统可靠性。 Result: YOLOv5相比先前方法能提取更多字母实例,显著提升第二阶段分类准确率;置信度得分可用于构建拒绝机制,在未见过的手稿上实现更可靠的书写者识别。 Conclusion: YOLOv5在古文字书写者识别任务中优于传统方法,不仅提高了检测与分类性能,还为实际应用中的不确定性处理提供了可行策略,具有较强的推广潜力。 Abstract: Paleography is the study of ancient and historical handwriting, its key objectives include the dating of manuscripts and understanding the evolution of writing. Estimating when a document was written and tracing the development of scripts and writing styles can be aided by identifying the individual scribes who contributed to a medieval manuscript. Although digital technologies have made significant progress in this field, the general problem remains unsolved and continues to pose open challenges. ... We previously proposed an approach focused on identifying specific letters or abbreviations that characterize each writer. In that study, we considered the letter "a", as it was widely present on all pages of text and highly distinctive, according to the suggestions of expert paleographers. We used template matching techniques to detect the occurrences of the character "a" on each page and the convolutional neural network (CNN) to attribute each instance to the correct scribe. Moving from the interesting results achieved from this previous system and being aware of the limitations of the template matching technique, which requires an appropriate threshold to work, we decided to experiment in the same framework with the use of the YOLO object detection model to identify the scribe who contributed to the writing of different medieval books. We considered the fifth version of YOLO to implement the YOLO object detection model, which completely substituted the template matching and CNN used in the previous work. The experimental results demonstrate that YOLO effectively extracts a greater number of letters considered, leading to a more accurate second-stage classification. Furthermore, the YOLO confidence score provides a foundation for developing a system that applies a rejection threshold, enabling reliable writer identification even in unseen manuscripts.[140] DivAS: Interactive 3D Segmentation of NeRFs via Depth-Weighted Voxel Aggregation
Ayush Pande
Main category: cs.CV
TL;DR: 提出DivAS,一种无需优化、完全交互式的NeRF分割框架,通过结合2D SAM掩码与NeRF深度先验,在保持零样本能力的同时实现实时、高精度的3D分割。
Details
Motivation: 现有NeRF分割方法多依赖于每场景优化训练,速度慢且牺牲了2D基础模型的零样本能力,难以满足实时交互需求。 Method: 提出DivAS框架:用户通过GUI输入点提示生成2D SAM掩码,利用NeRF提供的深度先验对掩码进行几何校正;设计定制CUDA核函数,将多视角修正后的掩码快速聚合成统一的3D体素网格(<200ms)。 Result: 在Mip-NeRF 360°和LLFF数据集上,DivAS分割质量与优化方法相当,端到端速度快2-2.5倍,排除用户提示时间则快一个数量级。 Conclusion: DivAS通过消除每场景优化实现了快速、交互式NeRF分割,兼顾准确性与效率,推动了零样本3D场景理解的发展。 Abstract: Existing methods for segmenting Neural Radiance Fields (NeRFs) are often optimization-based, requiring slow per-scene training that sacrifices the zero-shot capabilities of 2D foundation models. We introduce DivAS (Depth-interactive Voxel Aggregation Segmentation), an optimization-free, fully interactive framework that addresses these limitations. Our method operates via a fast GUI-based workflow where 2D SAM masks, generated from user point prompts, are refined using NeRF-derived depth priors to improve geometric accuracy and foreground-background separation. The core of our contribution is a custom CUDA kernel that aggregates these refined multi-view masks into a unified 3D voxel grid in under 200ms, enabling real-time visual feedback. This optimization-free design eliminates the need for per-scene training. Experiments on Mip-NeRF 360° and LLFF show that DivAS achieves segmentation quality comparable to optimization-based methods, while being 2-2.5x faster end-to-end, and up to an order of magnitude faster when excluding user prompting time.[141] Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform
Suyash Mishra,Qiang Li,Srikanth Patil,Satyanarayan Pati,Baddu Narendra
Main category: cs.CV
TL;DR: 本文提出了一种面向工业级长视频理解的多模态推理框架,评估了40多个视觉语言模型在真实制药场景中的性能,揭示了当前VLM在计算资源受限下的实际瓶颈与权衡,并提供了可操作的优化建议。
Details
Motivation: 现有VLM评估多集中于短视频且忽略工业部署中的GPU、延迟和成本限制,难以适用于长视频处理。工业场景如制药领域需高效处理大量多格式、多语言的长视频与文档,亟需系统性研究VLM在现实约束下的表现与挑战。 Method: 构建了一个工业级大规模多模态架构,整合20万+ PDF、2.5万+视频(8种格式)和888个音频文件(20+语言);在Video-MME、MMBench及专有数据集上对40多个VLM进行实证分析,重点研究注意力机制、多模态融合、时序推理和视频分段策略。 Result: 采用SDPA注意力机制在普通GPU上实现3-8倍效率提升;多模态在8/12任务中表现更优,尤其利于依赖长度的任务;发现跨开源与闭源VLM在时序对齐与关键帧检测方面存在明显瓶颈;不同注意力机制在内存与精度间存在权衡。 Conclusion: 当前VLM在长视频工业应用中仍面临显著的效率与能力瓶颈,单纯模型叠加(A+B)不可持续;应关注系统级优化如注意力机制选择、视频分割策略与多模态协同设计;研究为工业多模态系统提供了实用指导,强调在真实约束下进行评估与设计的重要性。 Abstract: Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.[142] Rotation-Robust Regression with Convolutional Model Trees
Hongyi Li,William Ward Armstrong,Jun Xu
Main category: cs.CV
TL;DR: 本文研究了使用卷积模型树(CMTs)进行旋转鲁棒性学习的方法,提出了三种几何感知的归纳偏置,并评估了部署时方向搜索对模型鲁棒性的影响。
Details
Motivation: 为了提升图像输入在旋转情况下的学习鲁棒性,尤其是在旋转不变性回归任务中保持性能。 Method: 引入三种几何感知的归纳偏置:卷积平滑、倾斜主导约束和基于重要性的剪枝;并在部署时采用离散旋转搜索以最大化森林级置信度代理。 Result: 方向搜索在严重旋转下提升了鲁棒性,但在标准方向附近可能因置信度与正确性不一致而产生负面影响;在MNIST手写数字识别任务中验证了一致趋势。 Conclusion: 基于置信度的方向选择在模型树集成中具有潜力,但也存在局限性,需进一步优化置信度与准确性的对齐。 Abstract: We study rotation-robust learning for image inputs using Convolutional Model Trees (CMTs) [1], whose split and leaf coefficients can be structured on the image grid and transformed geometrically at deployment time. In a controlled MNIST setting with a rotation-invariant regression target, we introduce three geometry-aware inductive biases for split directions -- convolutional smoothing, a tilt dominance constraint, and importance-based pruning -- and quantify their impact on robustness under in-plane rotations. We further evaluate a deployment-time orientation search that selects a discrete rotation maximizing a forest-level confidence proxy without updating model parameters. Orientation search improves robustness under severe rotations but can be harmful near the canonical orientation when confidence is misaligned with correctness. Finally, we observe consistent trends on MNIST digit recognition implemented as one-vs-rest regression, highlighting both the promise and limitations of confidence-based orientation selection for model-tree ensembles.[143] Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics
Subhadeep Roy,Gagan Bhatia,Steffen Eger
Main category: cs.CV
TL;DR: 本文研究了文本到图像模型评估中的“原型偏差”问题,提出一个名为ProtoBias的对比基准,并发现现有自动评估指标倾向于偏好视觉或社会上的典型图像而非语义正确的图像。为此,作者提出了新的评估指标ProtoScore,具有更强的鲁棒性且运行高效。
Details
Motivation: 现有的自动评估指标在衡量文本到图像生成质量时可能存在对数据分布中视觉或社会原型的偏好,而忽视语义正确性,这导致评估结果偏离人类判断,因此需要系统研究并解决这种原型偏差问题。 Method: 构建了一个受控的对比基准ProtoBias,包含动物、物体和人口统计三类图像,每组配对中包含语义正确但非典型的图像与语义轻微错误但更典型的对抗图像;通过比较多种主流指标(如CLIPScore、PickScore、VQA评分及LLM-as-Judge)在该基准上的表现,分析其是否偏向原型;并基于此设计了新指标ProtoScore。 Result: 实验表明,包括CLIPScore、PickScore和VQA等在内的常用指标经常错误地给典型但语义不符的图像更高分,LLM-as-Judge在社会属性相关案例中也表现不稳定;而人类评估者明显更偏好语义正确的图像;新提出的ProtoScore显著降低了误排序率,具备接近大型闭源裁判模型的鲁棒性,同时速度远超GPT-5推理耗时。 Conclusion: 当前主流自动评估指标存在明显的原型偏差,不能可靠反映语义一致性;ProtoScore提供了一种高效且鲁棒的替代方案,为未来更公平、准确的多模态评估提供了方向。 Abstract: Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study \emph{prototypicality bias} as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark \textsc{\textbf{ProtoBias}} (\textit{\textbf{Proto}typical \textbf{Bias}}), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose \textbf{\textsc{ProtoScore}}, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.[144] TEA: Temporal Adaptive Satellite Image Semantic Segmentation
Juyuan Kang,Hao Zhu,Yan Zhu,Wei Zhang,Jianing Chen,Tianxiang Xiao,Yike Ma,Hao Jiang,Feng Dai
Main category: cs.CV
TL;DR: 提出了一种时间自适应的卫星图像时间序列语义分割方法TEA,通过教师-学生模型框架和全序列重建辅助任务,提升了不同时间长度输入下的模型泛化能力。
Details
Motivation: 现有SITS分割方法在固定时间长度下表现良好,但在不同时间长度场景下泛化能力差,导致分割效果显著下降。 Method: 提出TEA方法,采用教师模型(全局序列知识)指导具有自适应时间输入的学生模型,通过中间嵌入、原型和软标签进行知识迁移,并引入动态聚合机制减少知识遗忘;同时使用全序列重建作为辅助任务提升表征质量。 Result: 实验表明,该方法在不同时间长度输入下,在常用基准上均取得显著性能提升。 Conclusion: TEA有效增强了模型在变化时间长度下的鲁棒性和泛化能力,为实际农业应用中不规则时间序列的作物制图提供了可靠解决方案。 Abstract: Crop mapping based on satellite images time-series (SITS) holds substantial economic value in agricultural production settings, in which parcel segmentation is an essential step. Existing approaches have achieved notable advancements in SITS segmentation with predetermined sequence lengths. However, we found that these approaches overlooked the generalization capability of models across scenarios with varying temporal length, leading to markedly poor segmentation results in such cases. To address this issue, we propose TEA, a TEmporal Adaptive SITS semantic segmentation method to enhance the model's resilience under varying sequence lengths. We introduce a teacher model that encapsulates the global sequence knowledge to guide a student model with adaptive temporal input lengths. Specifically, teacher shapes the student's feature space via intermediate embedding, prototypes and soft label perspectives to realize knowledge transfer, while dynamically aggregating student model to mitigate knowledge forgetting. Finally, we introduce full-sequence reconstruction as an auxiliary task to further enhance the quality of representations across inputs of varying temporal lengths. Through extensive experiments, we demonstrate that our method brings remarkable improvements across inputs of different temporal lengths on common benchmarks. Our code will be publicly available.[145] SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection
Maximilian Pittner,Joel Janai,Mario Faigle,Alexandru Paul Condurache
Main category: cs.CV
TL;DR: 本文提出了一种名为SparseLaneSTP的新型3D车道检测方法,通过引入车道特定的时空注意力机制、连续车道表示和时间正则化,结合几何先验与历史时序信息,在现有和新构建的数据集上实现了最先进的性能。
Details
Motivation: 现有的3D车道检测方法在BEV特征转换中存在错位问题,且忽略车道特定先验和历史时序信息,难以应对低可见性场景。 Method: 提出SparseLaneSTP,采用稀疏车道Transformer架构,引入车道特定的时空注意力机制、连续车道表示形式以及时间正则化,并构建了一个高精度、一致性好的新3D车道数据集用于训练与评估。 Result: 实验表明,该方法在多个现有3D车道检测基准及自建数据集上均取得最优性能,显著优于密集BEV方法和现有稀疏方法。 Conclusion: 通过融合结构先验与时序信息,SparseLaneSTP有效提升了3D车道检测的准确性与鲁棒性,为未来利用时空上下文信息提供了新思路。 Abstract: 3D lane detection has emerged as a critical challenge in autonomous driving, encompassing identification and localization of lane markings and the 3D road surface. Conventional 3D methods detect lanes from dense birds-eye-viewed (BEV) features, though erroneous transformations often result in a poor feature representation misaligned with the true 3D road surface. While recent sparse lane detectors have surpassed dense BEV approaches, they completely disregard valuable lane-specific priors. Furthermore, existing methods fail to utilize historic lane observations, which yield the potential to resolve ambiguities in situations of poor visibility. To address these challenges, we present SparseLaneSTP, a novel method that integrates both geometric properties of the lane structure and temporal information into a sparse lane transformer. It introduces a new lane-specific spatio-temporal attention mechanism, a continuous lane representation tailored for sparse architectures as well as temporal regularization. Identifying weaknesses of existing 3D lane datasets, we also introduce a precise and consistent 3D lane dataset using a simple yet effective auto-labeling strategy. Our experimental section proves the benefits of our contributions and demonstrates state-of-the-art performance across all detection and error metrics on existing 3D lane detection benchmarks as well as on our novel dataset.[146] OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction
Minseong Kweon,Jinsun Park
Main category: cs.CV
TL;DR: 提出OceanSplat,一种基于3D高斯点阵的水下场景几何表示方法,通过三目一致性、合成视差深度先验和深度感知透明度调整,有效抑制散射介质中的浮动物体伪影,提升水下三维重建质量。
Details
Motivation: 水下光学退化导致多视角不一致和浮动物体伪影,现有方法难以准确重建水下三维几何结构。 Method: 引入三目视图一致性,通过水平和垂直平移相机视图并进行逆向扭曲对齐;利用平移视图通过三角化生成合成极线深度先验作为自监督正则项;提出深度感知的alpha调整策略,根据z分量和视线方向调节3D高斯的不透明度。 Result: 在真实和模拟水下场景中,OceanSplat在场景重建和图像恢复方面均显著优于现有方法,有效减少浮动物体伪影,提升几何精度。 Conclusion: 所提方法通过多种几何约束和优化策略,使3D高斯点阵脱离散射介质影响,实现了鲁棒的水下几何表示与高质量重建。 Abstract: We introduce OceanSplat, a novel 3D Gaussian Splatting-based approach for accurately representing 3D geometry in underwater scenes. To overcome multi-view inconsistencies caused by underwater optical degradation, our method enforces trinocular view consistency by rendering horizontally and vertically translated camera views relative to each input view and aligning them via inverse warping. Furthermore, these translated camera views are used to derive a synthetic epipolar depth prior through triangulation, which serves as a self-supervised depth regularizer. These geometric constraints facilitate the spatial optimization of 3D Gaussians and preserve scene structure in underwater environments. We also propose a depth-aware alpha adjustment that modulates the opacity of 3D Gaussians during early training based on their $z$-component and viewing direction, deterring the formation of medium-induced primitives. With our contributions, 3D Gaussians are disentangled from the scattering medium, enabling robust representation of object geometry and significantly reducing floating artifacts in reconstructed underwater scenes. Experiments on real-world underwater and simulated scenes demonstrate that OceanSplat substantially outperforms existing methods for both scene reconstruction and restoration in scattering media.[147] Higher-Order Adversarial Patches for Real-Time Object Detectors
Jens Bayer,Stefan Becker,David Münch,Michael Arens,Jürgen Beyerer
Main category: cs.CV
TL;DR: 该论文研究了高阶对抗性攻击对YOLOv10目标检测器的影响,发现高阶对抗性补丁比低阶具有更强的泛化能力,且仅靠对抗训练不足以有效防御此类攻击。
Details
Motivation: 探讨高阶对抗性攻击在目标检测中的影响及其防御机制的有效性。 Method: 通过迭代生成对抗性补丁并使用对抗训练来增强YOLOv10检测器,评估其在逃避攻击下的表现。 Result: 高阶对抗性补丁不仅影响直接训练的目标检测器,还表现出更强的泛化能力;单纯对抗训练无法有效抵御此类攻击。 Conclusion: 需要更先进的防御策略来应对高阶对抗性攻击,仅依赖对抗训练是不够的。 Abstract: Higher-order adversarial attacks can directly be considered the result of a cat-and-mouse game -- an elaborate action involving constant pursuit, near captures, and repeated escapes. This idiom describes the enduring circular training of adversarial attack patterns and adversarial training the best. The following work investigates the impact of higher-order adversarial attacks on object detectors by successively training attack patterns and hardening object detectors with adversarial training. The YOLOv10 object detector is chosen as a representative, and adversarial patches are used in an evasion attack manner. Our results indicate that higher-order adversarial patches are not only affecting the object detector directly trained on but rather provide a stronger generalization capacity compared to lower-order adversarial patches. Moreover, the results highlight that solely adversarial training is not sufficient to harden an object detector efficiently against this kind of adversarial attack. Code: https://github.com/JensBayer/HigherOrder[148] Patch-based Representation and Learning for Efficient Deformation Modeling
Ruochen Chen,Thuy Tran,Shaifali Parashar
Main category: cs.CV
TL;DR: 本文提出了一种基于局部曲面块拟合jet函数的表面表示方法PolyFit,可高效学习并泛化到多种曲面,适用于形状恢复和服装仿真等任务。
Details
Motivation: 传统方法在处理曲面变形时依赖于逐顶点优化,计算成本高且难以泛化。需要一种更紧凑、高效的表示方式来支持多种下游任务。 Method: 通过在局部表面块上拟合jet函数得到Patch-based的PolyFit表示,以监督方式从解析函数和真实数据中学习,并通过更新jet系数实现高效形变。 Result: 在Shape-from-template和服装仿真任务中,PolyFit实现了比现有方法更快的推理速度,同时在准确性上优于近期的物理引导神经模拟器。 Conclusion: PolyFit提供了一种紧凑、高效且可泛化的曲面表示方法,显著提升了多个图形学与视觉任务中的性能与效率。 Abstract: In this paper, we present a patch-based representation of surfaces, PolyFit, which is obtained by fitting jet functions locally on surface patches. Such a representation can be learned efficiently in a supervised fashion from both analytic functions and real data. Once learned, it can be generalized to various types of surfaces. Using PolyFit, the surfaces can be efficiently deformed by updating a compact set of jet coefficients rather than optimizing per-vertex degrees of freedom for many downstream tasks in computer vision and graphics. We demonstrate the capabilities of our proposed methodologies with two applications: 1) Shape-from-template (SfT): where the goal is to deform the input 3D template of an object as seen in image/video. Using PolyFit, we adopt test-time optimization that delivers competitive accuracy while being markedly faster than offline physics-based solvers, and outperforms recent physics-guided neural simulators in accuracy at modest additional runtime. 2) Garment draping. We train a self-supervised, mesh- and garment-agnostic model that generalizes across resolutions and garment types, delivering up to an order-of-magnitude faster inference than strong baselines.[149] From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)
Suyash Mishra,Qiang Li,Srikanth Patil,Anubhav Girdhar
Main category: cs.CV
TL;DR: 提出一种领域自适应的视频到视频片段生成框架,结合音频和视觉语言模型,实现高效、低成本的药学视频摘要生成,显著提升处理速度与内容质量。
Details
Motivation: 传统多模态数据标注在制药行业中存在不一致、低效和质量问题,尤其是长视频和音频数据(如临床试验访谈)难以有效利用。 Method: 设计了一种可复现的Cut & Merge算法,结合淡入淡出与时间戳归一化,集成ALM和VLM,并引入基于角色定义与提示注入的个性化机制,构建端到端成本优化 pipeline。 Result: 在Video MME和16,159个药学视频数据集上验证,处理速度提升3-4倍,成本降低4倍,片段连贯性(0.348)和信息量(0.721)优于现有VLM基线(如Gemini 2.5 Pro)。 Conclusion: 该方法支持透明、可定制、合规的视频摘要,为生命科学领域的智能化内容处理提供了高效可行的解决方案。 Abstract: Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut & Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.[150] Driving on Registers
Ellington Kirby,Alexandre Boulch,Yihong Xu,Yuan Yin,Gilles Puy,Éloi Zablocki,Andrei Bursuc,Spyros Gidaris,Renaud Marlet,Florent Bartoccioni,Anh-Quan Cao,Nermin Samet,Tuan-Hung VU,Matthieu Cord
Main category: cs.CV
TL;DR: DrivoR是一种基于Transformer的端到端自动驾驶架构,利用预训练Vision Transformer和相机感知的寄存器令牌压缩多相机特征,实现高效、准确的轨迹生成与评分。
Details
Motivation: 现有方法在处理多相机输入时计算开销大,且缺乏对驾驶行为的可解释性建模,DrivoR旨在通过轻量级设计提升效率与可解释性。 Method: 基于预训练ViT,引入相机感知的寄存器令牌压缩多视角特征,并使用两个轻量级Transformer解码器生成并评分候选轨迹,其中评分解码器学习模仿专家并输出安全、舒适、效率等子分数。 Result: DrivoR在NAVSIM-v1、NAVSIM-v2和HUGSIM闭环基准上优于或媲美当前先进方法,验证了纯Transformer架构在端到端驾驶中的有效性。 Conclusion: 纯Transformer架构结合针对性的令牌压缩策略足以实现精确、高效且适应性强的端到端自动驾驶。 Abstract: We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available via the project page.[151] UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition
Filippo Ghilotti,Samuel Brucker,Nahku Saidy,Matteo Matteucci,Mario Bijelic,Felix Heide
Main category: cs.CV
TL;DR: 提出一种无监督多模态伪标签方法,利用时间-几何一致性将文本和2D视觉基础模型的线索提升并融合到3D中,无需人工标注,同时生成3D语义标签、3D边界框和密集LiDAR扫描。
Details
Motivation: 解决自动驾驶中LiDAR数据缺乏标注导致的高成本问题,突破感知研究中的标注瓶颈。 Method: 利用时间累积的LiDAR地图中的强几何先验,结合迭代更新规则,强制几何与语义一致性,并通过不一致性检测运动物体。 Result: 在三个数据集上实现了良好的泛化能力,在语义分割和目标检测伪标签任务中优于现有方法;少量几何一致的密集LiDAR使远距离深度预测误差分别降低51.5%和22.0% MAE。 Conclusion: 该方法无需人工标注即可有效生成高质量3D伪标签,显著提升深度预测性能,具有广泛的应用潜力。 Abstract: Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight - yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous-perception research. In this work we tackle this bottleneck by leveraging temporal-geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi-modal pseudo-labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric-semantic consistency, and vice-versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo-labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80-150 and 150-250 meters range, respectively.[152] From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
Zirui Wu,Zeren Jiang,Martin R. Oswald,Jie Song
Main category: cs.CV
TL;DR: 本文提出了一种名为“投影条件化”的新方法,用于改进前馈视图合成模型的鲁棒性和跨视图一致性,通过将相机参数转换为稳定的目标视图投影线索,并引入掩码自编码预训练策略,提升了合成图像的质量和几何一致性。
Details
Motivation: 现有基于Plücker射线表示的方法对相机坐标系敏感,导致在微小相机变换下几何不一致,缺乏鲁棒性。 Method: 提出投影条件化,用目标视图的2D投影线索替代原始相机参数,并采用掩码自编码进行预训练,以利用大规模未标定数据。 Result: 在视图一致性基准上优于基于射线的方法,在标准新视图合成基准上达到最先进水平。 Conclusion: 投影条件化使视图合成从脆弱的几何回归转变为稳定的图像到图像翻译问题,显著提升一致性和图像质量。 Abstract: Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.[153] Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
Runze He,Yiji Cheng,Tiankai Hang,Zhimin Li,Yu Xu,Zijin Yin,Shiyi Zhang,Wenxun Dai,Penghui Du,Ao Ma,Chunyu Wang,Qinglin Lu,Jizhong Han,Jiao Dai
Main category: cs.CV
TL;DR: 本文提出了Re-Align框架,通过结构化推理引导对齐来解决上下文图像生成与编辑中理解与生成之间的鸿沟。
Details
Motivation: 现有的多模态模型在理解能力上表现良好,但这些优势未能有效迁移到图像生成任务中,导致无法准确执行用户意图。 Method: 引入In-Context Chain-of-Thought (IC-CoT) 结构化推理范式,解耦语义引导和参考关联,并结合基于代理奖励的强化学习训练策略,提升生成图像与推理文本间的对齐性。 Result: 实验表明,Re-Align在同类规模和资源条件下,在上下文图像生成和编辑任务上优于现有竞争方法。 Conclusion: Re-Align通过结构化推理与强化学习实现了理解与生成的有效对齐,显著提升了上下文图像生成与编辑的性能。 Abstract: In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.[154] VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
Ignacio de Rodrigo,Alvaro J. Lopez-Lopez,Jaime Boal
Main category: cs.CV
TL;DR: VERSE是一种用于分析和改进视觉语言模型在视觉丰富文档理解中应用的方法,通过探索其视觉嵌入空间来可视化潜在表示、识别问题区域并生成合成数据以提升性能。
Details
Motivation: 旨在解决视觉语言模型在处理视觉丰富文档时的局限性,提高对错误易发区域的理解与优化能力。 Method: 提出VERSE方法,通过分析模型的视觉嵌入空间实现潜在表示的可视化,定位问题区域,并指导生成针对这些区域的合成数据以增强模型性能。 Result: 在MERIT数据集上验证表明,VERSE能有效识别与错误相关的视觉特征,加入这些特征进行重训练显著提升了F1分数且未损害泛化能力;经VERSE优化的本地模型(如Donut和Idefics2)性能达到甚至超过GPT-4和Pixtral等SaaS方案。 Conclusion: VERSE为改进视觉语言模型提供了有效途径,不仅增强了对模型内部机制的理解,还通过定向数据生成实现了性能突破,在实际应用中展现出媲美或超越大型服务模型的能力。 Abstract: This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.[155] VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
Sixiao Zheng,Minghao Yin,Wenbo Hu,Xiaoyu Li,Ying Shan,Yanwei Fu
Main category: cs.CV
TL;DR: VerseCrafter是一种4D感知的视频世界模型,通过新颖的4D几何控制表示,实现对相机和多物体动态的统一、显式且一致的控制。
Details
Motivation: 现有视频世界模型难以在2D图像平面中统一且精确地控制相机和多物体运动,缺乏对真实动态场景的精细建模能力。 Method: 提出4D几何控制表示,使用静态背景点云和每个对象的3D高斯轨迹来编码世界状态,并将这些4D控制作为预训练视频扩散模型的条件输入;同时构建自动数据引擎,从野外视频中提取4D控制以生成大规模训练数据。 Result: 实现了高保真、视角一致的视频生成,能够精确遵循指定的相机与物体动态;并通过自动生成的大量4D标注数据成功训练模型。 Conclusion: VerseCrafter为视频世界模型提供了可扩展、类别无关且几何精确的控制框架,解决了4D控制信号稀缺和动态建模不准确的问题。 Abstract: Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.[156] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering
Md. Zahid Hossain,Most. Sharmin Sultana Samu,Md. Rakibul Islam,Md. Siam Ansary
Main category: cs.CV
TL;DR: 提出了一种轻量级视觉-语言框架,用于从叶片图像中进行作物和病害识别,结合Swin Transformer视觉编码器与序列到序列语言解码器,并采用两阶段训练策略提升性能。
Details
Motivation: 需要在作物病害分析中实现准确的视觉理解与可靠的文本生成,现有大规模视觉-语言模型参数多、效率低。 Method: 采用Swin Transformer作为视觉编码器,结合序列到序列语言解码器,通过两阶段训练策略优化视觉表征学习与跨模态对齐。使用Grad-CAM和token-level attribution进行可解释性分析。 Result: 在大规模作物病害数据集上表现出高精度的作物与病害分类能力,在BLEU、ROUGE和BERTScore等自然语言生成指标上表现优异,且参数量显著少于大型基线模型。 Conclusion: 任务特定的视觉预训练能有效提升作物病害视觉问答系统的性能,所提轻量框架在准确性与效率之间实现了良好平衡。 Abstract: Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.[157] Atlas 2 -- Foundation models for clinical deployment
Maximilian Alber,Timo Milbich,Alexandra Carpen-Amarie,Stephan Tietz,Jonas Dippel,Lukas Muttenthaler,Beatriz Perez Cancer,Alessandro Benetti,Panos Korfiatis,Elias Eulig,Jérôme Lüscher,Jiasen Wu,Sayed Abid Hashimi,Gabriel Dernbach,Simon Schallenberg,Neelay Shah,Moritz Krügener,Aniruddh Jammoria,Jake Matras,Patrick Duffy,Matt Redlon,Philipp Jurmeister,David Horst,Lukas Ruff,Klaus-Robert Müller,Frederick Klauschen,Andrew Norgan
Main category: cs.CV
TL;DR: 本文介绍了Atlas 2系列病理学视觉基础模型,在80个公开基准上表现出卓越的预测性能、鲁棒性和资源效率,解决了此前模型在临床部署中的局限。模型基于来自三家医疗机构的550万张全切片图像训练,是目前最大的病理学基础模型数据集。
Details
Motivation: 解决现有病理学基础模型在性能、鲁棒性和计算资源需求之间的权衡问题,推动其在临床中的实际应用。 Method: 提出Atlas 2、Atlas 2-B和Atlas 2-S三种模型,使用包含550万张组织病理学全切片图像的大规模数据集进行训练,并在80个公共基准上进行全面评估。 Result: 在80个公共基准测试中,Atlas 2系列模型在预测性能、鲁棒性和资源效率方面均达到最先进的水平。 Conclusion: Atlas 2系列模型通过大规模数据训练和优化设计,显著提升了病理学基础模型的实用性,有望促进其在临床环境中的广泛部署。 Abstract: Pathology foundation models substantially advanced the possibilities in computational pathology -- yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.[158] Multi-Scale Local Speculative Decoding for Image Generation
Elia Peruzzo,Guillaume Sautière,Amirhossein Habibian
Main category: cs.CV
TL;DR: 本文提出了Multi-Scale Local Speculative Decoding (MuLo-SD),一种结合多分辨率草图生成与空间感知验证的新型框架,用于加速自回归图像生成,实现了最高1.7倍的推理速度提升,同时保持良好的图像质量。
Details
Motivation: 现有的推测解码方法在加速自回归图像生成时受限于令牌级歧义和缺乏空间感知能力,难以兼顾生成效率与图像保真度。 Method: 提出MuLo-SD框架:使用低分辨率草稿模型生成候选图像块,通过学习的上采样模块进行放大,并由高分辨率目标模型并行验证;引入局部拒绝与重采样机制,聚焦于空间邻域而非逐行扫描修正错误。 Result: 在MS-COCO 5k验证集上,MuLo-SD相比EAGLE-2、LANTERN等基线方法实现最高1.7倍的加速,在GenEval、DPG-Bench、FID/HPSv2等多个指标上保持相当的语义对齐与感知质量。 Conclusion: MuLo-SD通过多尺度结构设计与局部纠错机制,在推测解码中实现了效率与保真度之间的更好平衡,成为图像合成领域新的最先进方法。 Abstract: Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to $\mathbf{1.7\times}$ - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.[159] Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
Shuliang Liu,Songbo Yang,Dong Fang,Sihang Jia,Yuqi Tang,Lingfeng Su,Ruoshui Peng,Yibo Yan,Xin Zou,Xuming Hu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的推理框架VLI,通过模拟元认知自我修正过程来减少多模态大语言模型中的物体幻觉问题。
Details
Motivation: 现有方法在缓解物体幻觉方面存在局限,无法有效纠正内部语义错位或缺乏实例特异性精度。 Method: VLI框架包括属性内省(诊断幻觉风险并定位视觉锚点)和可解释双因果引导(动态调节推理过程,分离视觉证据与噪声)。 Result: 在MMHal-Bench上将物体幻觉率降低12.67%,在POPE数据集上准确率提升5.8%。 Conclusion: VLI通过动态、实例特定的干预机制,在不需训练的情况下显著提升了多模态模型的可靠性。 Abstract: Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.[160] CoV: Chain-of-View Prompting for Spatial Reasoning
Haoyu Zhao,Akide Liu,Zeyu Zhang,Weijie Wang,Feng Chen,Ruihan Zhu,Gholamreza Haffari,Bohan Zhuang
Main category: cs.CV
TL;DR: 本文提出了Chain-of-View (CoV) prompting,一种无需训练的测试时推理框架,通过粗到细的探索过程将视觉-语言模型(VLM)转化为主动视角推理器,以提升3D环境中的具身问答(EQA)性能。
Details
Motivation: 现有VLM在EQA中受限于固定数量的输入视图,难以获取分布于多视角且部分遮挡的上下文信息,限制了复杂空间推理能力。 Method: CoV首先使用View Selection代理筛选冗余帧并选择与问题对齐的关键视图;然后通过迭代推理与离散相机动作交错执行,从3D场景中获取新观测,逐步积累足够上下文。 Result: 在OpenEQA上,四个主流VLM平均提升+11.56% LLM-Match,最高达+13.62%(Qwen3-VL-Flash);增加动作预算可进一步提升性能,如Gemini-2.5-Flash提升+3.73%;在ScanQA和SQA3D上也表现出色。 Conclusion: 问题对齐的视图选择结合开放视图搜索是一种有效且模型无关的方法,可在无需训练的情况下显著提升3D EQA中的空间推理能力。 Abstract: Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.[161] VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Shuming Liu,Mingchen Zhuge,Changsheng Zhao,Jun Chen,Lemeng Wu,Zechun Liu,Chenchen Zhu,Zhipeng Cai,Chong Zhou,Haozhe Liu,Ernie Chang,Saksham Suri,Hongyu Xu,Qi Qian,Wei Wen,Balakrishnan Varadarajan,Zhuang Liu,Hu Xu,Florian Bordes,Raghuraman Krishnamoorthi,Bernard Ghanem,Vikas Chandra,Yunyang Xiong
Main category: cs.CV
TL;DR: 本文提出了VideoAuto-R1,一种在视频理解中采用“按需推理”策略的框架,通过“思考一次,回答两次”的训练范式,在保持高性能的同时显著提升推理效率。
Details
Motivation: 尽管链式思维(CoT)推理被广泛用于视频理解,但其必要性和优势尚不明确;观察到直接回答在某些情况下优于或等同于CoT,促使作者探索更高效的推理机制。 Method: 提出VideoAuto-R1框架,采用Thinking Once, Answering Twice训练范式:先生成初始答案,再进行推理并输出复核答案,两个答案均受可验证奖励监督;推理阶段根据初始答案的置信度决定是否启动。 Result: 在视频问答和定位基准上达到SOTA性能,平均响应长度减少约3.3倍(如从149降至44个token),且在感知类任务中推理模式激活率低,在需要复杂推理的任务中激活率更高。 Conclusion: 显式的语言推理虽有益,但并非总是必要;按需推理可在不影响性能的前提下大幅提升效率。 Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.[162] Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable
Zuhair Ahmed Khan Taha,Mohammed Mudassir Uddin,Shahnawaz Alam
Main category: cs.CV
TL;DR: AgentCompress根据任务难度动态分配压缩模型,显著降低大模型科研应用的计算成本。
Details
Motivation: 大型语言模型在科研自动化中计算成本高昂,许多学术实验室难以负担,需寻找高效节能的方法。 Method: 使用小型神经网络根据任务开头词语预测难度,并将任务路由到相应压缩程度的模型变体。 Result: 在500个跨学科研究工作流中测试,计算成本降低68.3%,保持96.2%的原始成功率。 Conclusion: AgentCompress使资源受限的实验室也能高效使用大模型进行科研任务。 Abstract: When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A single research session using a 70-billion parameter model can cost around $127 in cloud fees, putting these tools out of reach for many academic labs. We developed AgentCompress to tackle this problem head-on. The core idea came from a simple observation during our own work: writing a novel hypothesis clearly demands more from the model than reformatting a bibliography. Why should both tasks run at full precision? Our system uses a small neural network to gauge how hard each incoming task will be, based only on its opening words, then routes it to a suitably compressed model variant. The decision happens in under a millisecond. Testing across 500 research workflows in four scientific fields, we cut compute costs by 68.3% while keeping 96.2% of the original success rate. For labs watching their budgets, this could mean the difference between running experiments and sitting on the sidelines[163] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
William Rudman,Michal Golovanevsky,Dana Arad,Yonatan Belinkov,Ritambhara Singh,Carsten Eickhoff,Kyle Mahowald
Main category: cs.CV
TL;DR: 研究了大型视觉-语言模型在物体计数任务中因文本提示而产生幻觉的问题,发现随着物体数量增加,模型更倾向于遵循错误提示。通过机制分析识别出少量注意力头(PIH-heads),其移除可显著减少40%以上的提示诱导幻觉,且增强对视觉证据的纠正。
Details
Motivation: 解决大型视觉-语言模型在推理时过度依赖文本提示而非视觉输入,导致幻觉的问题,特别是在对象计数等需要精确感知的任务中。 Method: 在控制条件下设计物体计数实验,使用过估计的提示诱导模型幻觉;对三个VLM进行机械分析,识别与提示复制相关的关键注意力头,并通过消融实验评估其影响。 Result: 发现当物体数量较少时模型能纠正提示错误,但数量增多后趋向于顺从提示;识别出一组关键的PIH注意力头,其消融可将幻觉减少至少40%,并提升对视觉证据的依赖。 Conclusion: 提示诱导幻觉源于特定注意力头的活动,这些头在不同模型中以特定方式促进提示复制;移除它们可在无需再训练的情况下显著抑制幻觉,揭示了模型内部实现此类行为的机制差异。 Abstract: Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.[164] MoE3D: A Mixture-of-Experts Module for 3D Reconstruction
Zichen Wang,Ang Cao,Liam J. Wang,Jeong Joon Park
Main category: cs.CV
TL;DR: MoE3D是一种混合专家模块,用于提升3D重建模型的深度边界清晰度并减少漂浮点伪影,通过预测多个候选深度图并利用动态加权融合,在几乎不增加计算开销的情况下显著提高重建质量。
Details
Motivation: 现有前馈3D重建模型存在深度边界模糊和漂浮点伪影问题,影响重建质量,因此需要一种有效方法来优化深度图输出。 Method: MoE3D通过预测多个候选深度图,并利用动态权重进行融合,结合预训练的3D重建骨干网络(如VGGT)实现优化。 Result: 在集成MoE3D后,3D重建模型的深度边界更清晰,漂浮点伪影明显减少,重建质量显著提升,且仅带来极小的额外计算开销。 Conclusion: MoE3D是一种高效、即插即用的模块,能有效增强现有3D重建模型的性能,具有较强的实用性和扩展性。 Abstract: MoE3D is a mixture-of-experts module designed to sharpen depth boundaries and mitigate flying-point artifacts (highlighted in red) of existing feed-forward 3D reconstruction models (left side). MoE3D predicts multiple candidate depth maps and fuses them via dynamic weighting (visualized by MoE weights on the right side). When integrated with a pre-trained 3D reconstruction backbone such as VGGT, it substantially enhances reconstruction quality with minimal additional computational overhead. Best viewed digitally.[165] FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching
Danilo Danese,Angela Lombardi,Matteo Attimonelli,Giuseppe Fasano,Tommaso Di Noia
Main category: cs.CV
TL;DR: 提出FlowLet,一种基于流匹配的条件生成框架,用于在可逆3D小波域中生成年龄条件下的3D MRI图像,提升脑龄预测模型对代表性不足年龄段的性能。
Details
Motivation: 现有3D MRI数据集存在人口统计学偏差,且生成方法常受限于推理速度、伪影问题和缺乏年龄条件控制,影响脑龄预测的公平性和泛化能力。 Method: 提出FlowLet,利用流匹配在可逆3D小波域中进行年龄条件下的MRI生成,避免潜在空间压缩带来的伪影并降低计算开销。 Result: 实验表明FlowLet能以较少采样步数生成高保真3D MRI体积,增强BAP模型在代表性不足年龄组的表现,并在区域分析中保持解剖结构。 Conclusion: FlowLet是一种高效、低伪影的条件生成方法,有助于缓解MRI数据不平衡问题,提升脑龄预测模型的公平性与实用性。 Abstract: Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.[166] ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos
Rustin Soraki,Homanga Bharadhwaj,Ali Farhadi,Roozbeh Mottaghi
Main category: cs.CV
TL;DR: 本文提出了ObjectForesight,一种基于被动视觉观察预测刚性物体未来6自由度姿态和轨迹的3D对象中心动力学模型。
Details
Motivation: 赋予计算系统类似人类的预测物体运动的能力,从被动视觉观察中推断物体未来的可能运动。 Method: 提出ObjectForesight模型,采用3D对象级表示,结合分割、网格重建和3D姿态估计技术,利用超过200万段带有伪真实轨迹的短视频进行训练。 Result: 实验表明,该模型在准确性、几何一致性和对未见物体与场景的泛化能力方面显著优于现有方法。 Conclusion: ObjectForesight为从视觉观察中学习物理 grounded 的对象中心动力学模型提供了一个可扩展的框架。 Abstract: Humans can effortlessly anticipate how objects might move or change through interaction--imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. objectforesight.github.io[167] Plenoptic Video Generation
Xiao Fu,Shitao Tang,Min Shi,Xian Liu,Jinwei Gu,Ming-Yu Liu,Dahua Lin,Chen-Hsuan Lin
Main category: cs.CV
TL;DR: 提出PlenopticDreamer框架,通过自回归训练和相机引导的视频检索策略实现多视角下时空一致的生成性视频重渲染。
Details
Motivation: 现有相机控制的生成视频重渲染方法在多视角场景中难以保持一致性,尤其在幻觉区域的时空连贯性方面存在挑战。 Method: 设计一个多输入单输出的视频条件模型,采用自回归方式训练,并结合相机引导的视频检索策略选择先前生成中的显著视频作为条件输入;引入渐进式上下文缩放、自条件机制和长视频条件机制以提升训练收敛性与鲁棒性。 Result: 在Basic和Agibot基准上实验表明,PlenopticDreamer在视图同步、视觉保真度、相机控制精度和视图变换多样性方面均达到SOTA性能。 Conclusion: PlenopticDreamer有效解决了生成模型在多视角视频重渲染中的时空不一致问题,支持高质量、长序列的多样化视角生成。 Abstract: Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: https://research.nvidia.com/labs/dir/plenopticdreamer/[168] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
Boyang Wang,Haoran Zhang,Shujie Zhang,Jinkun Hao,Mingda Jia,Qi Lv,Yucheng Mao,Zhaoyang Lyu,Jia Zeng,Xudong Xu,Jiangmiao Pang
Main category: cs.CV
TL;DR: 提出视觉身份提示方法,利用示例图像作为扩散模型的条件输入,生成具有多视角和时间一致性的操作数据,提升策略模型训练效果。
Details
Motivation: 现有基于文本提示的扩散模型在生成机器人操作数据时难以满足多视角、时间连贯性以及精确场景控制的需求。 Method: 引入视觉身份提示(visual identity prompting),通过提供示例图像作为扩散模型的条件输入,并构建可扩展的视觉身份池来指导场景生成。 Result: 使用增强后的操作数据训练的视觉语言动作和视觉运动策略模型在仿真和真实机器人环境中均表现出性能提升。 Conclusion: 视觉身份提示能有效生成高质量、多样化的操作数据,克服了文本提示在复杂场景控制中的局限性。 Abstract: The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.[169] GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation
Henghui Ding,Chang Liu,Shuting He,Xudong Jiang,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出了广义指代表达分割(GRES)、理解(GREC)和生成(GREG)三个新基准,统称为GREx,并构建了首个大规模数据集gRefCOCO,支持单目标、多目标和无目标表达,增强了指代表达任务在实际场景中的应用能力。同时提出一种基于区域-语言注意力的基线方法ReLA,在复杂关系建模上表现优异,取得了当前最优性能。
Details
Motivation: 现有指代表达任务仅支持单目标表达,无法应对真实场景中多目标或无目标的情况,限制了其应用。因此需要扩展传统REx任务以支持更通用的表达形式。 Method: 提出了GREx系列任务及gRefCOCO数据集,设计了向后兼容的评估框架;针对GRES/GREC中的复杂关系建模问题,提出ReLA模型,通过自适应划分图像区域并显式建模区域间及区域与语言间的依赖关系。 Result: 构建了包含多目标、无目标和单目标表达的大规模gRefCOCO数据集;ReLA模型在GRES和GREC任务上达到当前最优性能;实验验证了现有REx方法在GREx任务上的局限性。 Conclusion: GREx推动了指代表达任务向更通用、现实的方向发展,gRefCOCO为后续研究提供了重要资源,ReLA展示了显式关系建模在复杂场景下的有效性。 Abstract: Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG). This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks. One of the challenges of GRES/GREC is complex relationship modeling, for which we propose a baseline ReLA that adaptively divides the image into regions with sub-instance clues and explicitly models the region-region and region-language dependencies. The proposed ReLA achieves the state-of-the-art results on the both GRES and GREC tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GREx.[170] Pixel-Perfect Visual Geometry Estimation
Gangwei Xu,Haotong Lin,Hongcheng Luo,Haiyang Sun,Bing Wang,Guang Chen,Sida Peng,Hangjun Ye,Xin Yang
Main category: cs.CV
TL;DR: 本文提出了一种基于像素空间扩散变换器(DiT)的像素级精确视觉几何模型PPD和PPVD,通过引入语义引导的DiT和级联结构,有效消除飞点并保留细节,在单目和视频深度估计中实现了最先进的性能。
Details
Motivation: 现有几何基础模型存在飞点严重和细节丢失的问题,难以满足机器人和增强现实对高质量几何重建的需求。 Method: 提出Pixel-Perfect Depth (PPD) 模型,基于像素空间扩散变换器(DiT),设计语义引导DiT和级联DiT架构以提升效率与精度;进一步扩展为视频模型PPVD,引入语义一致性DiT和参考引导token传播以保证时序连贯性。 Result: 在单目和视频深度估计任务上达到生成式模型中的最优性能,生成的点云显著更干净,优于所有其他模型。 Conclusion: 所提出的PPD和PPVD模型通过结合生成建模与视觉语义先验,实现了高质量、无飞点的几何预测,为图像中精确几何恢复提供了新的解决方案。 Abstract: Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.[171] RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
Yuan-Kang Lee,Kuan-Lin Chen,Chia-Che Chang,Yu-Lun Liu
Main category: cs.CV
TL;DR: 提出了一种结合统计方法和深度强化学习的夜间白平衡框架RL-AWB,实现了在低光和正常光照图像上的优异泛化能力。
Details
Motivation: 夜间色恒常性因低光噪声和复杂光照条件而具有挑战性,现有方法难以适应多样化的传感器和照明环境。 Method: 首先设计了一个针对夜间场景的统计算法,结合显著灰度像素检测与新的光照估计方法;在此基础上,构建了首个用于色恒常性的深度强化学习框架,通过模仿专业AWB调优专家动态优化每张图像的参数。 Result: 实验结果表明该方法在多个传感器和不同光照条件下均表现出优越的泛化能力,优于现有方法。 Conclusion: RL-AWB为夜间白平衡提供了一个有效且可扩展的解决方案,推动了跨传感器颜色恒常性的研究进展。 Abstract: Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/[172] QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer
Daniele Lizzio Bosco,Shuteng Wang,Giuseppe Serra,Vladislav Golyanik
Main category: cs.CV
TL;DR: 本文提出了QNeRF,首个用于从2D图像进行新视角合成的混合量子-经典模型,利用参数化量子电路实现更紧凑的表示,在性能相当或更优的同时参数量不到经典NeRF的一半。
Details
Motivation: 尽管NeRF在新视角合成上取得进展,但其模型大、训练开销高;而量子视觉场(QVF)在模型紧凑性和收敛速度上有优势,因此探索将量子方法引入NeRF以提升效率。 Method: 提出两种QNeRF架构:Full QNeRF充分利用量子幅值增强表达能力;Dual-Branch QNeRF通过分支处理空间和视角信息,引入任务相关的归纳偏置,降低复杂度并提升可扩展性与硬件兼容性。 Result: 在中等分辨率图像上训练时,QNeRF性能达到或超过经典NeRF基线,且参数量少于一半。 Conclusion: 量子机器学习可作为计算机视觉中连续信号表示的有效替代方案,尤其适用于从2D观测学习3D表示等中等复杂度任务。 Abstract: Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen major advances with Neural Radiance Fields (NeRFs), where models learn a compact representation from 2D images to render 3D scenes, albeit at the cost of larger models and intensive training. In this work, we extend the approach of QVFs by introducing QNeRF, the first hybrid quantum-classical model designed for novel-view synthesis from 2D images. QNeRF leverages parameterised quantum circuits to encode spatial and view-dependent information via quantum superposition and entanglement, resulting in more compact models compared to the classical counterpart. We present two architectural variants. Full QNeRF maximally exploits all quantum amplitudes to enhance representational capabilities. In contrast, Dual-Branch QNeRF introduces a task-informed inductive bias by branching spatial and view-dependent quantum state preparations, drastically reducing the complexity of this operation and ensuring scalability and potential hardware compatibility. Our experiments demonstrate that -- when trained on images of moderate resolution -- QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters. These results suggest that quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level tasks in computer vision, such as 3D representation learning from 2D observations.[173] Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
Zeren Jiang,Chuanxia Zheng,Iro Laina,Diane Larlus,Andrea Vedaldi
Main category: cs.CV
TL;DR: 提出Mesh4D,一种用于单目4D网格重建的前馈模型,通过紧凑的潜在空间和时空注意力机制,在无需推理时骨骼信息的情况下,实现对动态物体3D形状和运动的精确重建。