cs.CL [Back]

[1] WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

Zhaojiang Lin,Yong Xu,Kai Sun,Jing Zheng,Yin Huang,Surya Teja Appini,Krish Narang,Renjie Tao,Ishan Kapil Jain,Siddhant Arora,Ruizhi Li,Yiteng Huang,Kaushik Patnaik,Wenfang Xu,Suwon Shon,Yue Liu,Ahmed A Aly,Anuj Kumar,Florian Metze,Xin Luna Dong

Main category: cs.CL

TL;DR: WearVox是首个针对可穿戴设备（如AI眼镜）场景下语音助手性能评估的基准，包含3842段多通道、以自我为中心的音频记录，涵盖多种任务和真实环境，揭示了现有语音大模型在噪声环境下的局限性，并证明多通道音频能显著提升模型鲁棒性和语音区分能力。

Details

Motivation: 现有语音助手基准未能充分考虑可穿戴设备带来的独特挑战，如运动引起的音频干扰、背景噪音以及设备指向性语音的识别问题，因此需要一个更贴近现实的评估基准。 Method: 构建WearVox基准，收集3842段来自AI眼镜的真实多通道egocentric音频，覆盖五类任务（搜索型问答、闭卷问答、侧谈拒绝、工具调用、语音翻译）及多样室内外环境，并对主流语音大模型进行评测，同时开展单/多通道输入的对比案例研究。 Result: 现有实时语音大模型在WearVox上的准确率仅为29%-59%，在户外噪声环境下性能显著下降；案例研究表明多通道音频输入能有效提升模型对环境噪声的鲁棒性及设备定向语音的识别能力。 Conclusion: 空间音频线索对实现上下文感知的语音助手至关重要，WearVox为可穿戴语音AI提供了更具挑战性和现实性的评估平台，推动该领域的发展。 Abstract: Wearable devices such as AI glasses are transforming voice assistants into always-available, hands-free collaborators that integrate seamlessly with daily life, but they also introduce challenges like egocentric audio affected by motion and noise, rapid micro-interactions, and the need to distinguish device-directed speech from background conversations. Existing benchmarks largely overlook these complexities, focusing instead on clean or generic conversational audio. To bridge this gap, we present WearVox, the first benchmark designed to rigorously evaluate voice assistants in realistic wearable scenarios. WearVox comprises 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks including Search-Grounded QA, Closed-Book QA, Side-Talk Rejection, Tool Calling, and Speech Translation, spanning a wide range of indoor and outdoor environments and acoustic conditions. Each recording is accompanied by rich metadata, enabling nuanced analysis of model performance under real-world constraints. We benchmark leading proprietary and open-source speech Large Language Models (SLLMs) and find that most real-time SLLMs achieve accuracies on WearVox ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, underscoring the difficulty and realism of the benchmark. Additionally, we conduct a case study with two new SLLMs that perform inference with single-channel and multi-channel audio, demonstrating that multi-channel audio inputs significantly enhance model robustness to environmental noise and improve discrimination between device-directed and background speech. Our results highlight the critical importance of spatial audio cues for context-aware voice assistants and establish WearVox as a comprehensive testbed for advancing wearable voice AI research.

[2] PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models

Inpyo Song,Eunji Jeon,Jangwon Lee

Main category: cs.CL

TL;DR: 本文提出了PCEval，首个用于物理计算的自动化评估基准，用以评估大语言模型在软硬件协同项目中的逻辑与物理设计能力。实验表明，尽管大语言模型在代码和逻辑电路生成上表现良好，但在物理面包板布线和避免电路错误方面仍存在显著困难。

Details

Motivation: 现有研究多关注大语言模型在纯软件开发中的应用，但在涉及硬件约束的物理计算环境中，其实际能力尚未系统评估。因此，需要一个自动化的评估框架来衡量模型在真实硬件环境下的表现。 Method: 提出PCEval评估框架，通过模拟环境对13个主流大语言模型进行测试，评估其在不同复杂度项目中生成电路和配套代码的能力，尤其关注逻辑设计与物理布局（如引脚连接）的正确性，并实现全自动验证而无需人工干预。 Result: 实验结果显示，大语言模型在代码生成和逻辑电路设计任务中表现较好，但在物理面包板布局创建中表现不佳，尤其在管理引脚连接和避免短路等电路错误方面存在严重问题。 Conclusion: PCEval为评估大语言模型在物理计算中的能力提供了首个可复现、自动验证的基准，揭示了当前模型在处理硬件实现约束方面的局限性，推动了面向物理计算教育的AI辅助工具的发展。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including software development, education, and technical assistance. Among these, software development is one of the key areas where LLMs are increasingly adopted. However, when hardware constraints are considered-for instance, in physical computing, where software must interact with and control physical hardware -their effectiveness has not been fully explored. To address this gap, we introduce \textsc{PCEval} (Physical Computing Evaluation), the first benchmark in physical computing that enables a fully automatic evaluation of the capabilities of LLM in both the logical and physical aspects of the projects, without requiring human assessment. Our evaluation framework assesses LLMs in generating circuits and producing compatible code across varying levels of project complexity. Through comprehensive testing of 13 leading models, \textsc{PCEval} provides the first reproducible and automatically validated empirical assessment of LLMs' ability to reason about fundamental hardware implementation constraints within a simulation environment. Our findings reveal that while LLMs perform well in code generation and logical circuit design, they struggle significantly with physical breadboard layout creation, particularly in managing proper pin connections and avoiding circuit errors. \textsc{PCEval} advances our understanding of AI assistance in hardware-dependent computing environments and establishes a foundation for developing more effective tools to support physical computing education.

[3] Losses that Cook: Topological Optimal Transport for Structured Recipe Generation

Mattia Ottoborgo,Daniele Rege Cambrin,Paolo Garza

Main category: cs.CL

TL;DR: 提出一种新的拓扑损失函数，结合点云嵌入表示食材列表，提升生成菜谱的成分和步骤准确性，实验表明在自动和人工评估中均优于传统方法。

Details

Motivation: 标准训练方法仅关注文本流畅性，难以保证菜谱中的食材组成、时间温度控制和操作流程的准确性，因此需要更有效的复合目标来提升生成质量。 Method: 基于RECIPE-NLG数据集，引入一种将食材列表表示为嵌入空间点云的拓扑损失，并结合Dice损失优化时间/温度预测，采用混合损失函数进行联合训练。 Result: 所提拓扑损失显著改善了成分和动作层面的指标；Dice损失提高了时间/温度精度；混合损失在数量和时间上取得协同增益；人类偏好测试显示模型在62%的情况下更受青睐。 Conclusion: 通过引入基于点云的拓扑损失和复合目标函数，能有效提升菜谱生成的质量，特别是在关键烹饪参数上的准确性和整体可读性方面优于传统交叉熵方法。 Abstract: Cooking recipes are complex procedures that require not only a fluent and factual text, but also accurate timing, temperature, and procedural coherence, as well as the correct composition of ingredients. Standard training procedures are primarily based on cross-entropy and focus solely on fluency. Building on RECIPE-NLG, we investigate the use of several composite objectives and present a new topological loss that represents ingredient lists as point clouds in embedding space, minimizing the divergence between predicted and gold ingredients. Using both standard NLG metrics and recipe-specific metrics, we find that our loss significantly improves ingredient- and action-level metrics. Meanwhile, the Dice loss excels in time/temperature precision, and the mixed loss yields competitive trade-offs with synergistic gains in quantity and time. A human preference analysis supports our finding, showing our model is preferred in 62% of the cases.

[4] ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation

Hyeong Kyu Choi,Sharon Li

Main category: cs.CL

TL;DR: 提出了一种无需评估器的Best-of-N选择框架ModeX，通过谱聚类识别生成文本中的模态输出，以实现在开放性任务中高效、鲁棒的文本生成。

Details

Motivation: 在没有标准答案的开放性任务中，如何从多个随机生成结果中选出高质量输出是大语言模型面临的一个挑战。现有方法依赖外部评估器或精确字符串匹配，限制了其适用性和效率。 Method: 提出Mode Extraction (ModeX)，构建生成文本之间的相似性图，并递归应用谱聚类来选择代表语义共识的中心节点作为输出；进一步提出ModeX-Lite，在早期引入剪枝以提升效率。 Result: 在文本摘要、代码生成和数学推理等开放性任务上，ModeX及其轻量版ModeX-Lite consistently 优于单路径和多路径基线方法。 Conclusion: ModeX提供了一种无需额外推理或辅助模型的高效、通用选择机制，显著提升了开放性文本生成的性能与实用性。 Abstract: Selecting a single high-quality output from multiple stochastic generations remains a fundamental challenge for large language models (LLMs), particularly in open-ended tasks where no canonical answer exists. While Best-of-N and self-consistency methods show that aggregating multiple generations can improve performance, existing approaches typically rely on external evaluators, reward models, or exact string-match voting, limiting their applicability and efficiency. We propose Mode Extraction (ModeX), an evaluator-free Best-of-N selection framework that generalizes majority voting to open-ended text generation by identifying the modal output representing the dominant semantic consensus among generated texts. ModeX constructs a similarity graph over candidate generations and recursively applies spectral clustering to select a representative centroid, without requiring additional inference or auxiliary models. We further instantiate this selection principle as ModeX-Lite, an improved version of ModeX with early pruning for efficiency. Across open-ended tasks -- including text summarization, code generation, and mathematical reasoning -- our approaches consistently outperform standard single- and multi-path baselines, providing a computationally efficient solution for robust open-ended text generation. Code is released in https://github.com/deeplearning-wisc/ModeX.

[5] LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference

Hossein Rajabzadeh,Maryam Dialameh,Chul B. Park,Il-Min Kim,Hyock Ju Kwon

Main category: cs.CL

TL;DR: LoRA-Drop是一种无需路由机制的即插即用推理框架，通过在大部分解码步骤中复用前一token的隐藏状态并应用低秩LoRA修正，周期性地执行完整模型刷新，从而加速自回归大语言模型的解码过程，显著减少计算和KV缓存开销，同时保持与基线模型相近的精度。

Details

Motivation: 自回归大语言模型的推理速度受限于逐token的顺序解码，现有方法在减少计算成本时往往依赖辅助路由机制或导致精度下降。需要一种简单、高效且不牺牲性能的加速方法。 Method: 提出LoRA-Drop框架，采用时间计算调度策略：在多数解码步中，选定中间层复用前一token的隐藏状态，并通过低秩LoRA进行修正；周期性插入完整前向传播以防止误差累积。该方法兼容标准KV缓存，并可在LoRA步骤中跳过部分层的KV更新以减少缓存占用。 Result: 在LLaMA2-7B、LLaMA3-8B、Qwen2.5-7B和Qwen2.5-14B等模型上，LoRA-Drop实现了最高2.6倍的解码加速和45%-55%的KV缓存缩减，精度损失控制在0.5个百分点以内。在多种任务（如数学推理、代码生成、长上下文和多语言理解）中验证了其有效性与稳定性。 Conclusion: LoRA-Drop提供了一种简单有效的路径，实现大语言模型的自适应计算推理，在保持模型性能的同时大幅提升推理效率，具有良好的实用性和扩展性。 Abstract: Autoregressive large language models (LLMs) are bottlenecked by sequential decoding, where each new token typically requires executing all transformer layers. Existing dynamic-depth and layer-skipping methods reduce this cost, but often rely on auxiliary routing mechanisms or incur accuracy degradation when bypassed layers are left uncompensated. We present \textbf{LoRA-Drop}, a plug-and-play inference framework that accelerates decoding by applying a \emph{temporal compute schedule} to a fixed subset of intermediate layers: on most decoding steps, selected layers reuse the previous-token hidden state and apply a low-rank LoRA correction, while periodic \emph{refresh} steps execute the full model to prevent drift. LoRA-Drop requires no routing network, is compatible with standard KV caching, and can reduce KV-cache footprint by skipping KV updates in droppable layers during LoRA steps and refreshing periodically. Across \textbf{LLaMA2-7B}, \textbf{LLaMA3-8B}, \textbf{Qwen2.5-7B}, and \textbf{Qwen2.5-14B}, LoRA-Drop achieves up to \textbf{2.6$\times$ faster decoding} and \textbf{45--55\% KV-cache reduction} while staying within \textbf{0.5 percentage points (pp)} of baseline accuracy. Evaluations on reasoning (GSM8K, MATH, BBH), code generation (HumanEval, MBPP), and long-context/multilingual benchmarks (LongBench, XNLI, XCOPA) identify a consistent \emph{safe zone} of scheduling configurations that preserves quality while delivering substantial efficiency gains, providing a simple path toward adaptive-capacity inference in LLMs. Codes are available at https://github.com/hosseinbv/LoRA-Drop.git.

[6] Fact-Checking with Large Language Models via Probabilistic Certainty and Consistency

Haoran Wang,Maryam Khalid,Qiong Wu,Jian Gao,Cheng Cao

Main category: cs.CL

TL;DR: 本文提出了一种名为Probabilistic Certainty and Consistency (PCC)的框架，通过联合建模大语言模型的概率确定性和推理一致性来估计其事实性置信度，并实现自适应的事实验证策略，从而有效减少幻觉并提高准确性和效率。

Details

Motivation: 大语言模型常产生幻觉问题，现有事实核查方法通常无差别地检索外部证据，忽视模型内部知识，且缺乏针对不确定性进行精准处理的机制。受人类事实核查方式启发，需要一种能根据模型自身置信度自适应选择是否检索外部证据的方法。 Method: 提出PCC框架，结合LLM的概率确定性（probabilistic certainty）和推理一致性（reasoning consistency）来量化事实置信度；基于该置信度实现自适应验证策略：高置信时直接回答，低置信或不一致时触发目标检索，高度模糊时升级为深度搜索。 Result: 在三个具有挑战性的基准上实验表明，PCC在不确定性量化方面优于基于口头化信心的方法，并持续超越强基线模型；同时验证了其在多种大语言模型上的良好泛化能力。 Conclusion: PCC通过置信度引导的路由机制实现了更高效、可靠的事实核查，在减少幻觉的同时优化了外部检索的使用，为提升大语言模型的事实准确性提供了有效方案。 Abstract: Large language models (LLMs) are increasingly used in applications requiring factual accuracy, yet their outputs often contain hallucinated responses. While fact-checking can mitigate these errors, existing methods typically retrieve external evidence indiscriminately, overlooking the model's internal knowledge and potentially introducing irrelevant noise. Moreover, current systems lack targeted mechanisms to resolve specific uncertainties in the model's reasoning. Inspired by how humans fact-check, we argue that LLMs should adaptively decide whether to rely on internal knowledge or initiate retrieval based on their confidence in a given claim. We introduce Probabilistic Certainty and Consistency (PCC), a framework that estimates factual confidence by jointly modeling an LLM's probabilistic certainty and reasoning consistency. These confidence signals enable an adaptive verification strategy: the model answers directly when confident, triggers targeted retrieval when uncertain or inconsistent, and escalates to deep search when ambiguity is high. Our confidence-guided routing mechanism ensures that retrieval is invoked only when necessary, improving both efficiency and reliability. Extensive experiments across three challenging benchmarks show that PCC achieves better uncertainty quantification than verbalized confidence and consistently outperforms strong LLM-based fact-checking baselines. Furthermore, we demonstrate that PCC generalizes well across various LLMs.

[7] DataParasite Enables Scalable and Repurposable Online Data Curation

Mengyi Sun

Main category: cs.CL

TL;DR: DataParasite是一个开源、模块化的在线数据收集管道，利用大语言模型实现高效、透明且可复用的社会科学研究数据整理，显著降低人工成本并提升准确性。

Details

Motivation: 现有网络数据采集系统往往不透明、不灵活，难以满足计算社会科学对可复现和可扩展数据整理的需求。 Method: 将表格化数据整理任务分解为基于实体的独立搜索，通过轻量级配置文件和通用Python脚本实现模块化处理，并支持仅用自然语言指令适配新任务。 Result: 在多个经典计算社会科学任务中（如教师聘用历史、精英死亡事件、政治生涯轨迹），DataParasite实现了高准确率，数据采集成本比人工降低一个数量级。 Conclusion: DataParasite通过降低技术和人力门槛，为可扩展、透明和可重复的数据整理提供了实用基础，适用于计算社会科学及其他领域。 Abstract: Many questions in computational social science rely on datasets assembled from heterogeneous online sources, a process that is often labor-intensive, costly, and difficult to reproduce. Recent advances in large language models enable agentic search and structured extraction from the web, but existing systems are frequently opaque, inflexible, or poorly suited to scientific data curation. Here we introduce DataParasite, an open-source, modular pipeline for scalable online data collection. DataParasite decomposes tabular curation tasks into independent, entity-level searches defined through lightweight configuration files and executed through a shared, task-agnostic python script. Crucially, the same pipeline can be repurposed to new tasks, including those without predefined entity lists, using only natural-language instructions. We evaluate the pipeline on multiple canonical tasks in computational social science, including faculty hiring histories, elite death events, and political career trajectories. Across tasks, DataParasite achieves high accuracy while reducing data-collection costs by an order of magnitude relative to manual curation. By lowering the technical and labor barriers to online data assembly, DataParasite provides a practical foundation for scalable, transparent, and reusable data curation in computational social science and beyond.

[8] Reconstructing Item Characteristic Curves using Fine-Tuned Large Language Models

Christopher Ormerod

Main category: cs.CL

TL;DR: 本研究提出一种新方法，通过微调大型语言模型（如Qwen-3）并使用LoRA技术，模拟不同能力水平学生的答题反应，从而无需实际测试即可估计题目参数（如难度和区分度）。

Details

Motivation: 传统项目参数估计依赖昂贵的实地测试来收集学生作答数据，成本高且耗时，因此需要一种更高效的替代方法。 Method: 使用Qwen-3系列密集模型和低秩适应（LoRA）对大语言模型进行微调，使其根据离散的能力描述生成多项选择题的回答，并基于模拟反应构建项目特征曲线（ICC），进而估计IRT参数。 Result: 在六年级英语语言艺术题目和BEA 2024共享任务数据集上的评估表明，该方法在估计项目参数方面优于或媲美基线方法，尤其在建模项目区分度方面表现突出。 Conclusion: 基于大语言模型的学生反应模拟是一种可行且高效的方法，可用于隐式建模教育评估中的项目参数，减少对实际测试数据的依赖。 Abstract: Traditional methods for determining assessment item parameters, such as difficulty and discrimination, rely heavily on expensive field testing to collect student performance data for Item Response Theory (IRT) calibration. This study introduces a novel approach that implicitly models these psychometric properties by fine-tuning Large Language Models (LLMs) to simulate student responses across a spectrum of latent abilities. Leveraging the Qwen-3 dense model series and Low-Rank Adaptation (LoRA), we train models to generate responses to multiple choice questions conditioned on discrete ability descriptors. We reconstruct the probability of a correct response as a function of student ability, effectively generating synthetic Item Characteristic Curves (ICCs) to estimate IRT parameters. Evaluation on a dataset of Grade 6 English Language Arts (ELA) items and the BEA 2024 Shared Task dataset demonstrates that this method competes with or outperforms baseline approaches. This simulation-based technique seems particularly effective at modeling item discrimination.

[9] FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions

Kris W Pan,Yongmin Yoo

Main category: cs.CL

TL;DR: 提出FlowPlan-G2P框架，将科学论文转化为专利描述，通过概念图构建、段落规划和图条件生成三阶段模拟专家思维流程，显著提升生成结果的逻辑连贯性与法律合规性。

Details

Motivation: 科学论文与专利在修辞风格和法律要求上差异大，传统端到端模型难以满足结构化推理和法律约束，需更符合专家撰写逻辑的方法。 Method: 将论文到专利的转换任务分解为三个阶段：1）概念图归纳，提取技术实体与关系构建有向图；2）段落与章节规划，将图重组为符合专利结构的簇；3）图条件生成，利用子图和定制提示生成合规段落。 Result: 实验表明，FlowPlan-G2P在逻辑连贯性和法律合规性方面显著优于端到端大模型基线方法。 Conclusion: FlowPlan-G2P通过结构化生成流程，为论文到专利的转化提供了新范式，推动了专业领域结构化文本生成的发展。 Abstract: Over 3.5 million patents are filed annually, with drafting patent descriptions requiring deep technical and legal expertise. Transforming scientific papers into patent descriptions is particularly challenging due to their differing rhetorical styles and stringent legal requirements. Unlike black-box text-to-text approaches that struggle to model structural reasoning and legal constraints, we propose FlowPlan-G2P, a novel framework that mirrors the cognitive workflow of expert drafters by reformulating this task into three stages: (1) Concept Graph Induction, extracting technical entities and relationships into a directed graph via expert-like reasoning; (2) Paragraph and Section Planning, reorganizing the graph into coherent clusters aligned with canonical patent sections; and (3) Graph-Conditioned Generation, producing legally compliant paragraphs using section-specific subgraphs and tailored prompts. Experiments demonstrate that FlowPlan-G2P significantly improves logical coherence and legal compliance over end-to-end LLM baselines. Our framework establishes a new paradigm for paper-to-patent generation and advances structured text generation for specialized domains.

[10] Scalable Construction of a Lung Cancer Knowledge Base: Profiling Semantic Reasoning in LLMs

Cesar Felipe Martínez Cisneros,Jesús Ulises Quiroz Bautista,Claudia Anahí Guzmán Solano,Bogdan Kaleb García Rivera,Iván García Pacheco,Yalbi Itzel Balderas Martínez,Kolawole John Adebayoc,Ignacio Arroyo Fernández

Main category: cs.CL

TL;DR: 本研究提出了一种基于开放信息抽取（OpenIE）的肺癌知识库构建流程，用于大规模、低成本地生成领域特定的知识三元组，并成功应用于大语言模型的语义微调，显著提升了生物医学自然语言处理的性能与语义一致性。

Details

Motivation: 在肿瘤学等对精确性和可解释性要求高的领域，大语言模型的表现依赖于高质量的语义数据；然而现有方法难以高效构建结构化医学知识库，因此需要可扩展且低成本的方法来支持领域内推理与知识表示。 Method: 采用OpenIE方法构建肺癌知识库，包括：使用MeSH词表识别医学概念，筛选具有宽松许可（CC0）的开放获取PubMed文献，抽取（主语，关系，宾语）三元组，并结合命名实体识别（NER）增强三元组的生物医学相关性，最终用于T5模型的监督式语义微调。 Result: 所构建的三元组数据集显著提升了T5模型在ROUGE和BERTScore指标上的表现，显示出更好的语义连贯性和生成质量，验证了该知识库在生物医学NLP任务中的有效性。 Conclusion: 基于OpenIE的知识抽取流程能够高效、低成本地生成高质量的领域特定知识库，为大语言模型在精准医疗和生物医学研究中的应用提供了可扩展的数据基础和实用路径。 Abstract: The integration of Large Language Models (LLMs) into biomedical research offers new opportunities for domainspecific reasoning and knowledge representation. However, their performance depends heavily on the semantic quality of training data. In oncology, where precision and interpretability are vital, scalable methods for constructing structured knowledge bases are essential for effective fine-tuning. This study presents a pipeline for developing a lung cancer knowledge base using Open Information Extraction (OpenIE). The process includes: (1) identifying medical concepts with the MeSH thesaurus; (2) filtering open-access PubMed literature with permissive licenses (CC0); (3) extracting (subject, relation, object) triplets using OpenIE method; and (4) enriching triplet sets with Named Entity Recognition (NER) to ensure biomedical relevance. The resulting triplet sets provide a domain-specific, large-scale, and noise-aware resource for fine-tuning LLMs. We evaluated T5 models finetuned on this dataset through Supervised Semantic Fine-Tuning. Comparative assessments with ROUGE and BERTScore show significantly improved performance and semantic coherence, demonstrating the potential of OpenIE-derived resources as scalable, low-cost solutions for enhancing biomedical NLP.

[11] Improved Evidence Extraction for Document Inconsistency Detection with LLMs

Nelvin Tan,Yaowen Zhang,James Asikin Cheung,Fusheng Liu,Yu-Ching Shih,Dong Yang

Main category: cs.CL

TL;DR: 本文提出了一种新的框架和度量方法，用于改进基于大语言模型的文档不一致检测中的证据提取任务。

Details

Motivation: 尽管大语言模型在多个领域表现出色，但在文档不一致检测方面的研究仍有限，尤其是在提供不一致证据方面缺乏有效方法。 Method: 提出了红action-and-retry框架与约束过滤机制，并引入了新的证据提取评估指标。 Result: 实验结果表明，所提方法在证据提取任务上显著优于直接提示方法。 Conclusion: 该框架有效提升了大语言模型在文档不一致检测中发现和提取不一致证据的能力。 Abstract: Large language models (LLMs) are becoming useful in many domains due to their impressive abilities that arise from large training datasets and large model sizes. However, research on LLM-based approaches to document inconsistency detection is relatively limited. There are two key aspects of document inconsistency detection: (i) classification of whether there exists any inconsistency, and (ii) providing evidence of the inconsistent sentences. We focus on the latter, and introduce new comprehensive evidence-extraction metrics and a redact-and-retry framework with constrained filtering that substantially improves LLM-based document inconsistency detection over direct prompting. We back our claims with promising experimental results.

[12] Empirical Comparison of Encoder-Based Language Models and Feature-Based Supervised Machine Learning Approaches to Automated Scoring of Long Essays

Kuo Wang,Haowei Hua,Pengfei Yan,Hong Jiao,Dan Song

Main category: cs.CL

TL;DR: 该研究探讨了在长文本作文自动评分中，基于编码器的预训练语言模型的表现，并提出一种结合多种语言模型嵌入的集成方法，显著提升了长文本评分性能。

Details

Motivation: 长文本对仅编码器的语言模型在文本处理（尤其是作文自动评分）中带来挑战，现有模型受限于上下文长度（如512 token），难以有效处理长作文，因此需要探索更优的建模策略。 Method: 研究训练了多种主流编码器语言模型（BERT、RoBERTa、DistilBERT、DeBERTa）进行长作文自动评分，并与基于512 token限制的集成模型对比；集成模型包括多语言模型嵌入融合模型，以及基于特征的传统机器学习集成模型（如GBDT、XGBoost、LightGBM），使用80%/10%/10%的数据划分进行训练、验证和测试，采用二次加权Kappa系数评估性能。 Result: 实验结果表明，将多个预训练语言模型的嵌入表示结合并使用梯度提升分类器作为融合器的嵌入集成模型，在长作文评分任务上显著优于单个语言模型。 Conclusion: 针对长文本作文自动评分，基于多语言模型嵌入融合的集成方法能有效克服单模型上下文限制，显著提升评分性能，为长文本评估提供了更优解决方案。 Abstract: Long context may impose challenges for encoder-only language models in text processing, specifically for automated scoring of essays. This study trained several commonly used encoder-based language models for automated scoring of long essays. The performance of these trained models was evaluated and compared with the ensemble models built upon the base language models with a token limit of 512?. The experimented models include BERT-based models (BERT, RoBERTa, DistilBERT, and DeBERTa), ensemble models integrating embeddings from multiple encoder models, and ensemble models of feature-based supervised machine learning models, including Gradient-Boosted Decision Trees, eXtreme Gradient Boosting, and Light Gradient Boosting Machine. We trained, validated, and tested each model on a dataset of 17,307 essays, with an 80%/10%/10% split, and evaluated model performance using Quadratic Weighted Kappa. This study revealed that an ensemble-of-embeddings model that combines multiple pre-trained language model representations with gradient-boosting classifier as the ensemble model significantly outperforms individual language models at scoring long essays.

[13] When Do Tools and Planning Help LLMs Think? A Cost- and Latency-Aware Benchmark

Subha Ghoshal,Ali Al-Bustami

Main category: cs.CL

TL;DR: 该研究评估了推理时规划和外部工具在事件问答（Event-QA）和Reddit说服性回复生成（CMV）两个任务中的效果，发现工具增强可提升GPT-4o在Event-QA上的准确率但显著增加延迟，在CMV中则无明显收益且加剧小模型的退化。

Details

Motivation: 探究现代大语言模型在真实场景中使用推理时规划与外部工具的有效性、代价及适用条件。 Method: 基于LangChain和LangGraph构建对比实验：一对一基线 vs 具备重规划能力的智能体，配备针对任务的工具（如DBpedia查询、维基检索、网络搜索），在Event-QA和CMV数据集上评估GPT-4o和GPT-4o-mini的准确性、延迟和token成本。 Result: 在Event-QA上，工具增强使GPT-4o准确率从47.5%提升至67.5%，但平均延迟从8秒增至317秒；在CMV上，单次提示表现最佳（GPT-4o-mini达75%准确率，约6秒），规划与搜索未带来一致增益且显著增加延迟，并导致小模型在复杂多工具协作中性能下降。 Conclusion: 工具增强策略的效果高度依赖任务类型和模型规模，需根据任务需求进行成本感知的模型与工具组合设计。 Abstract: Modern large language models (LLMs) increasingly rely on inference-time planning and external tools to improve reasoning. We benchmark this behavior on two real-world settings: event-centric question answering over graph-structured knowledge (Event-QA) and persuasive response generation in Reddit ChangeMyView (CMV). Using LangChain and LangGraph, we compare a one-shot baseline against a plan-execute-replan agent equipped with task-specific tools (DBpedia SPARQL/lookup/schema exploration, Wikipedia-focused retrieval, and topical web search). We evaluate on 60 examples each from Event-QA and CMV (3 splits of 20), and report both mean end-to-end latency and per-example token cost estimates. We evaluate GPT-4o and GPT-4o-mini under identical workflows and report accuracy and end-to-end latency. On Event-QA, the best tool-augmented configuration improves accuracy (e.g., 47.5\% $\rightarrow$ 67.5\% for GPT-4o) while increasing latency by orders of magnitude ($\sim$8s $\rightarrow$ $\sim$317s per example). On CMV, one-shot prompting is strongest (e.g., GPT-4o-mini achieves 75\% at $\sim$6s), and planning+search increases latency substantially without consistent gains. However, complex multi-tool orchestration exposes failure modes where the smaller model degrades. Overall, the findings highlight the need for task-specific, cost-aware choices of both model size and agent/tooling complexity.

[14] Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking

Hongzhan Lin,Zixin Chen,Zhiqi Shen,Ziyang Luo,Zhen Ye,Jing Ma,Tat-Seng Chua,Guandong Xu

Main category: cs.CL

TL;DR: 本文提出了FactArena，一个全自动的竞技场式评估框架，用于对大语言模型（LLMs）在完整事实核查流程中的表现进行分阶段综合评测，涵盖主张提取、证据检索和结论推理，并揭示现有模型在整体事实核查能力上的不足。

Details

Motivation: 现有的LLM评估主要集中于主张验证，忽略了事实核查流程中的其他关键环节，导致无法全面发现模型的推理缺陷和鲁棒性问题。因此需要一个更全面的评估框架。 Method: 提出FactArena框架，包含三个部分：1）由LLM驱动的标准化事实核查流程（主张分解、工具增强的证据检索、基于推理的判决）；2）基于统一准则的竞技场式判断机制，实现跨不同裁判代理的无偏比较；3）竞技场驱动的主张演化模块，自动生成更具挑战性的主张以测试模型的事实鲁棒性。 Result: 在16个最先进的LLM上进行测试，FactArena生成了稳定且可解释的排名，并揭示了静态主张验证准确率与端到端事实核查能力之间的显著差异。 Conclusion: FactArena提供了一种可扩展且可信的范式，可用于诊断LLM的事实推理能力，指导未来模型开发，并推动LLM在安全关键型事实核查应用中的可靠部署。 Abstract: Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems, yet existing evaluations focus predominantly on claim verification and overlook the broader fact-checking workflow, including claim extraction and evidence retrieval. This narrow focus prevents current benchmarks from revealing systematic reasoning failures, factual blind spots, and robustness limitations of modern LLMs. To bridge this gap, we present FactArena, a fully automated arena-style evaluation framework that conducts comprehensive, stage-wise benchmarking of LLMs across the complete fact-checking pipeline. FactArena integrates three key components: (i) an LLM-driven fact-checking process that standardizes claim decomposition, evidence retrieval via tool-augmented interactions, and justification-based verdict prediction; (ii) an arena-styled judgment mechanism guided by consolidated reference guidelines to ensure unbiased and consistent pairwise comparisons across heterogeneous judge agents; and (iii) an arena-driven claim-evolution module that adaptively generates more challenging and semantically controlled claims to probe LLMs' factual robustness beyond fixed seed data. Across 16 state-of-the-art LLMs spanning seven model families, FactArena produces stable and interpretable rankings. Our analyses further reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence, highlighting the necessity of holistic evaluation. The proposed framework offers a scalable and trustworthy paradigm for diagnosing LLMs' factual reasoning, guiding future model development, and advancing the reliable deployment of LLMs in safety-critical fact-checking applications.

[15] Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Devang Kulshreshtha,Hang Su,Chinmay Hegde,Haohan Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为Lexical Anchor Tree Search (LATS)的新型无攻击者大模型依赖的越狱方法，通过词法锚点注入和广度优先树搜索，在仅需约6.4次查询的情况下实现了97-100%的攻击成功率，显著优于现有方法。

Details

Motivation: 现有的越狱方法通常依赖攻击者的大语言模型生成对抗性查询，且需要较高的查询预算，导致成本高且生成的前缀难以解释。因此，需要一种更高效、低成本且不依赖攻击者LLM的方法。 Method: LATS将越狱问题重新定义为在多轮对话上的广度优先树搜索，每个节点逐步向良性提示中注入攻击目标缺失的内容词，通过纯词法锚点注入实现攻击，无需使用攻击者LLM。 Result: 在AdvBench和HarmBench数据集上的实验表明，LATS在最新的GPT、Claude和Llama模型上实现了97-100%的攻击成功率，平均仅需约6.4次查询，远少于其他方法所需的20次以上。 Conclusion: LATS展示了对话结构作为一种强大但防护不足的攻击面的潜力，同时在高攻击成功率已可轻易实现的时代，提供了卓越的查询效率，推动了越狱技术向更高效、低成本方向发展。 Abstract: Most jailbreak methods achieve high attack success rates (ASR) but require attacker LLMs to craft adversarial queries and/or demand high query budgets. These resource limitations make jailbreaking expensive, and the queries generated by attacker LLMs often consist of non-interpretable random prefixes. This paper introduces Lexical Anchor Tree Search (), addressing these limitations through an attacker-LLM-free method that operates purely via lexical anchor injection. LATS reformulates jailbreaking as a breadth-first tree search over multi-turn dialogues, where each node incrementally injects missing content words from the attack goal into benign prompts. Evaluations on AdvBench and HarmBench demonstrate that LATS achieves 97-100% ASR on latest GPT, Claude, and Llama models with an average of only ~6.4 queries, compared to 20+ queries required by other methods. These results highlight conversational structure as a potent and under-protected attack surface, while demonstrating superior query efficiency in an era where high ASR is readily achievable. Our code will be released to support reproducibility.

[16] Extracting books from production language models

Ahmed Ahmed,A. Feder Cooper,Sanmi Koyejo,Percy Liang

Main category: cs.CL

TL;DR: 该研究探讨了在具备安全防护措施的生产级大语言模型（LLM）中，是否仍可提取受版权保护的训练数据。通过两阶段方法，研究发现包括Claude、GPT-4、Gemini和Grok在内的多个主流模型存在不同程度的数据提取风险，其中部分模型在 jailbreak 后可近乎完整地复现整本受版权保护的书籍。

Details

Motivation: 围绕LLM与版权的法律争议核心是“记忆化”问题：即训练数据是否被编码于模型权重中，以及这些数据能否从输出中提取。尽管普遍认为LLM不会大量记忆训练数据，但已有研究表明开源模型中可提取大量受版权保护的内容。本文旨在验证此类提取是否在具备安全措施的生产级模型中依然可行。 Method: 采用两阶段方法：第一阶段使用Best-of-N（BoN）越狱技术探测数据提取可行性；第二阶段通过迭代续写提示尝试持续提取文本内容。在四个主流生产模型（Claude 3.7 Sonnet、GPT-4.1、Gemini 2.5 Pro、Grok 3）上进行实验，并使用基于最长公共子串的块近似评分（nv-recall）衡量提取效果。 Result: Gemini 2.5 Pro 和 Grok 3 无需越狱即可提取大量文本（如《哈利·波特与魔法石》的 nv-recall 分别达76.8%和70.3%）；Claude 3.7 Sonnet 在越狱后可近乎完整复现全书（nv-recall=95.8%）；GPT-4.1 需要更多尝试且最终拒绝继续生成（nv-recall=4.0%）。不同模型表现出显著差异。 Conclusion: 即使具备模型和系统层面的安全机制，生产级大语言模型仍存在训练数据（包括受版权保护内容）被提取的风险，这对版权法律问题提出了严峻挑战。 Abstract: Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.

[17] Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration

Guangxin Wu,Hao Zhang,Zhang Zhibin,Jiafeng Guo,Xueqi Cheng

Main category: cs.CL

TL;DR: 提出了一种基于混合多域校准集和迭代校准策略的新型结构化剪枝框架，有效压缩大语言模型并保持性能。

Details

Motivation: 大语言模型规模不断增长，导致计算开销、内存占用和推理延迟高，难以部署；现有非结构化剪枝方法常导致稀疏模式不规则，需专用硬件支持。 Method: 采用结构化剪枝，通过移除整个架构组件来兼容标准硬件加速器，并引入混合多域校准集与迭代校准策略以识别冗余通道。 Result: 在多种模型和下游任务上的实验表明，该方法能显著压缩模型，同时性能下降极小。 Conclusion: 所提结构化剪枝框架在保持硬件兼容性的同时，实现了高效模型压缩，适用于实际部署。 Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide spectrum of natural language processing tasks. However, their ever-growing scale introduces significant barriers to real-world deployment, including substantial computational overhead, memory footprint, and inference latency. While model pruning presents a viable solution to these challenges, existing unstructured pruning techniques often yield irregular sparsity patterns that necessitate specialized hardware or software support. In this work, we explore structured pruning, which eliminates entire architectural components and maintains compatibility with standard hardware accelerators. We introduce a novel structured pruning framework that leverages a hybrid multi-domain calibration set and an iterative calibration strategy to effectively identify and remove redundant channels. Extensive experiments on various models across diverse downstream tasks show that our approach achieves significant compression with minimal performance degradation.

[18] EvoRoute: Experience-Driven Self-Routing LLM Agent Systems

Guibin Zhang,Haiyang Yu,Kaiming Yang,Bingli Wu,Fei Huang,Yongbin Li,Shuicheng Yan

Main category: cs.CL

TL;DR: 本文提出了EvoRoute，一种自进化的模型路由范式，用于解决复杂AI代理系统在性能、成本和延迟之间的三难权衡（Agent System Trilemma）。通过动态选择每一步的最优大语言模型并基于经验持续优化策略，EvoRoute显著降低了80%的成本和70%的延迟，同时保持或提升了性能。

Details

Motivation: 复杂的AI代理系统虽然能力强，但面临高昂的成本和严重延迟问题，现有方法难以兼顾性能、成本与响应速度，因此需要一种更智能的动态模型调度机制来破解这一三难困境。 Method: 提出EvoRoute，利用不断扩展的经验知识库，在每个步骤中动态选择帕累托最优的大语言模型，并通过环境反馈持续优化自身的路由策略，实现对准确性、效率和资源消耗的平衡。 Result: 在GAIA和BrowseComp+等具有挑战性的代理基准测试中，EvoRoute在保持甚至提升系统性能的同时，最多减少了80%的执行成本和超过70%的延迟。 Conclusion: EvoRoute有效打破了代理系统三难困境，为构建高效、经济且高性能的复杂AI代理系统提供了可行路径。 Abstract: Complex agentic AI systems, powered by a coordinated ensemble of Large Language Models (LLMs), tool and memory modules, have demonstrated remarkable capabilities on intricate, multi-turn tasks. However, this success is shadowed by prohibitive economic costs and severe latency, exposing a critical, yet underexplored, trade-off. We formalize this challenge as the \textbf{Agent System Trilemma}: the inherent tension among achieving state-of-the-art performance, minimizing monetary cost, and ensuring rapid task completion. To dismantle this trilemma, we introduce EvoRoute, a self-evolving model routing paradigm that transcends static, pre-defined model assignments. Leveraging an ever-expanding knowledge base of prior experience, EvoRoute dynamically selects Pareto-optimal LLM backbones at each step, balancing accuracy, efficiency, and resource use, while continually refining its own selection policy through environment feedback. Experiments on challenging agentic benchmarks such as GAIA and BrowseComp+ demonstrate that EvoRoute, when integrated into off-the-shelf agentic systems, not only sustains or enhances system performance but also reduces execution cost by up to $80\%$ and latency by over $70\%$.

[19] Boosting Accuracy and Interpretability in Multilingual Hate Speech Detection Through Layer Freezing and Explainable AI

Meysam Shirdel Bilehsavar,Negin Mahmoudi,Mohammad Jalili Torkamani,Kiana Kiashemshaki

Main category: cs.CL

TL;DR: 本研究评估了三种基于Transformer的模型（BERT、RoBERTa、XLM-RoBERTa）在五种语言上的多语言情感分析与仇恨言论检测性能，并结合LIME框架提升模型可解释性。

Details

Motivation: 情感分析与仇恨言论检测对在线内容审核至关重要，但在多语言环境下模型性能和决策透明度仍需提升。 Method: 采用BERT-base-multilingual-cased、RoBERTa-base和XLM-RoBERTa-base（前八层冻结）三种模型，在英语、韩语、日语、中文和法语数据上进行评估，并使用准确率、精确率、召回率和F1分数进行比较；同时集成LIME框架以解释模型预测。 Result: 模型在不同语言和任务中表现各异，XLM-RoBERTa在多数情况下表现最佳，且LIME有效揭示了关键词对预测结果的影响。 Conclusion: 结合先进的Transformer架构与可解释性方法能有效提升多语言情感分析与仇恨言论检测系统的性能与透明度，有助于构建更安全的数字环境。 Abstract: Sentiment analysis focuses on identifying the emotional polarity expressed in textual data, typically categorized as positive, negative, or neutral. Hate speech detection, on the other hand, aims to recognize content that incites violence, discrimination, or hostility toward individuals or groups based on attributes such as race, gender, sexual orientation, or religion. Both tasks play a critical role in online content moderation by enabling the detection and mitigation of harmful or offensive material, thereby contributing to safer digital environments. In this study, we examine the performance of three transformer-based models: BERT-base-multilingual-cased, RoBERTa-base, and XLM-RoBERTa-base with the first eight layers frozen, for multilingual sentiment analysis and hate speech detection. The evaluation is conducted across five languages: English, Korean, Japanese, Chinese, and French. The models are compared using standard performance metrics, including accuracy, precision, recall, and F1-score. To enhance model interpretability and provide deeper insight into prediction behavior, we integrate the Local Interpretable Model-agnostic Explanations (LIME) framework, which highlights the contribution of individual words to the models decisions. By combining state-of-the-art transformer architectures with explainability techniques, this work aims to improve both the effectiveness and transparency of multilingual sentiment analysis and hate speech detection systems.

[20] Adversarial Question Answering Robustness: A Multi-Level Error Analysis and Mitigation Study

Agniv Roy Choudhury,Vignesh Ponselvan Rajasingh

Main category: cs.CL

TL;DR: 本研究通过在AddSent对抗数据集上对Transformer模型进行系统性实验，结合多层级错误分析与针对性缓解策略，探究了问答系统的对抗鲁棒性。研究发现缩放模型规模和采用实体感知对比学习可显著缩小对抗性能差距。

Details

Motivation: 尽管问答系统在标准基准上表现优异，但其在对抗样本面前仍脆弱。因此需要深入研究模型的对抗鲁棒性，并识别主要失败模式以提升实际可靠性。 Method: 使用五种互补的分类方案进行多级错误分析，评估不同比例的对抗微调数据，并尝试三种针对性缓解策略（如基于命名实体识别的对比学习），同时在ELECTRA-small到ELECTRA-base之间进行模型尺度扩展实验。 Result: 识别出否定混淆和实体替换为主要失败模式；80%干净+20%对抗数据为最优训练比例；小模型存在容量瓶颈；模型扩展消除了鲁棒性-准确性权衡；实体感知对比学习达到89.89% AddSent EM和90.73% SQuAD EM，填补了94.9%的对抗差距。 Conclusion: 综合语言学错误分析与NER引导的对比学习能有效提升问答模型的对抗鲁棒性，实现干净与对抗数据性能的近似平衡，是首个将两者结合的工作。 Abstract: Question answering (QA) systems achieve impressive performance on standard benchmarks like SQuAD, but remain vulnerable to adversarial examples. This project investigates the adversarial robustness of transformer models on the AddSent adversarial dataset through systematic experimentation across model scales and targeted mitigation strategies. We perform comprehensive multi-level error analysis using five complementary categorization schemes, identifying negation confusion and entity substitution as the primary failure modes. Through systematic evaluation of adversarial fine-tuning ratios, we identify 80% clean + 20% adversarial data as optimal. Data augmentation experiments reveal a capacity bottleneck in small models. Scaling from ELECTRA-small (14M parameters) to ELECTRA-base (110M parameters) eliminates the robustness-accuracy trade-off, achieving substantial improvements on both clean and adversarial data. We implement three targeted mitigation strategies, with Entity-Aware contrastive learning achieving best performance: 89.89% AddSent Exact Match (EM) and 90.73% SQuAD EM, representing 94.9% closure of the adversarial gap. To our knowledge, this is the first work integrating comprehensive linguistic error analysis with Named Entity Recognition (NER)-guided contrastive learning for adversarial QA, demonstrating that targeted mitigation can achieve near-parity between clean and adversarial performance.

[21] Mitigating Prompt-Induced Hallucinations in Large Language Models via Structured Reasoning

Jinbo Hao,Kai Yang,Qingzhen Su,Yang Chen,Yifan Li,Chao Jiang

Main category: cs.CL

TL;DR: 本文提出了一种通过引入代码模块引导知识图谱探索并将其融入链式思维提示中的方法，以缓解大语言模型中的提示诱导幻觉问题。

Details

Motivation: 大语言模型在推理过程中容易产生幻觉，尤其是由提示引发的错误信息生成，影响了其准确性和可信度。 Method: 基于知识蒸馏链式模型，引入代码模块作为外部知识输入，指导知识图谱的探索，并将代码整合到链式思维提示中，从而约束模型的推理过程。 Result: 在GPT-4和LLaMA-3.3上多个公开数据集的实验表明，该方法显著提升了上下文信息捕捉能力，HIT@1、HIT@3和HIT@5分别提升15.64%、13.38%和13.28%，多个设置下均超过95%。 Conclusion: 所提方法能有效减少大语言模型的幻觉行为，显著提高推理的准确性与可验证性。 Abstract: To address hallucination issues in large language models (LLMs), this paper proposes a method for mitigating prompt-induced hallucinations. Building on a knowledge distillation chain-style model, we introduce a code module to guide knowledge-graph exploration and incorporate code as part of the chain-of-thought prompt, forming an external knowledge input that provides more accurate and structured information to the model. Based on this design, we develop an improved knowledge distillation chain-style model and leverage it to analyze and constrain the reasoning process of LLMs, thereby improving inference accuracy. We empirically evaluate the proposed approach using GPT-4 and LLaMA-3.3 on multiple public datasets. Experimental results demonstrate that incorporating code modules significantly enhances the model's ability to capture contextual information and effectively mitigates prompt-induced hallucinations. Specifically, HIT@1, HIT@3, and HIT@5 improve by 15.64%, 13.38%, and 13.28%, respectively. Moreover, the proposed method achieves HIT@1, HIT@3, and HIT@5 scores exceeding 95% across several evaluation settings. These results indicate that the proposed approach substantially reduces hallucination behavior while improving the accuracy and verifiability of large language models.

[22] Language Hierarchization Provides the Optimal Solution to Human Working Memory Limits

Luyao Chen,Weibo Gao,Junjie Wu,Jinshan Wu,Angela D. Friederici

Main category: cs.CL

TL;DR: 该研究发现语言的层次化结构能够优化人类有限工作记忆容量下的语言处理效率，通过计算模拟和自然语言验证表明，层次化处理比线性处理更有效地将信息单元控制在记忆限制范围内，从而解释了人类语言普遍具有层次结构的原因。

Details

Motivation: 探讨人类语言为何具有层次结构，特别是这种结构是否与人类有限的工作记忆容量有关。 Method: 构建了一个似然函数来量化语言处理机制中单位数量与人类工作记忆容量（WMC）的匹配程度，并通过符号序列的计算模拟以及自然语言句子的验证分析进行研究。 Result: 最大似然估计（tehta_MLE）等于单位数量的均值；层次化处理相比线性处理能更有效地将tehta_MLE控制在人类WMC限制内，且随着序列/句子长度增加表现更优，同时呈现与儿童WMC发展相关的收敛模式。 Conclusion: 语言的层次化结构是一种适应性优化机制，能够在有限工作记忆条件下高效处理序列信息，这从根本上解释了人类语言普遍具有层次性的原因。 Abstract: Language is a uniquely human trait, conveying information efficiently by organizing word sequences in sentences into hierarchical structures. A central question persists: Why is human language hierarchical? In this study, we show that hierarchization optimally solves the challenge of our limited working memory capacity. We established a likelihood function that quantifies how well the average number of units according to the language processing mechanisms aligns with human working memory capacity (WMC) in a direct fashion. The maximum likelihood estimate (MLE) of this function, tehta_MLE, turns out to be the mean of units. Through computational simulations of symbol sequences and validation analyses of natural language sentences, we uncover that compared to linear processing, hierarchical processing far surpasses it in constraining the tehta_MLE values under the human WMC limit, along with the increase of sequence/sentence length successfully. It also shows a converging pattern related to children's WMC development. These results suggest that constructing hierarchical structures optimizes the processing efficiency of sequential language input while staying within memory constraints, genuinely explaining the universal hierarchical nature of human language.

[23] SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation

Hanqi Jiang,Junhao Chen,Yi Pan,Ling Chen,Weihang You,Yifan Zhou,Ruidong Zhang,Yohannes Abate,Tianming Liu

Main category: cs.CL

TL;DR: 本文提出了Synapse，一种受认知科学启发的统一记忆架构，通过动态图模型和扩散激活机制改进了大语言模型在长期代理记忆中的检索能力，显著提升了复杂时序和多跳推理任务的表现。

Details

Motivation: 现有的基于检索增强的方法无法有效处理长期代理记忆中信息断连的问题，且静态向量相似性难以捕捉动态关联，因此需要一种更接近人类记忆机制的动态记忆架构。 Method: Synapse将记忆建模为动态图，利用扩散激活机制决定相关性，并引入侧向抑制和时间衰减来动态突出相关子图并抑制干扰；同时提出三重混合检索策略，融合几何嵌入与基于激活的图遍历。 Result: 在LoCoMo基准上的实验表明，Synapse在复杂时序和多跳推理任务上显著优于现有最先进方法，有效缓解了“上下文隧道”问题。 Conclusion: Synapse通过模拟人类记忆的动态特性，提供了一种更高效、更鲁棒的长期记忆管理方案，为大语言模型的持续学习与推理提供了新方向。 Abstract: While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory. To bridge this gap, we introduce Synapse (Synergistic Associative Processing Semantic Encoding), a unified memory architecture that transcends static vector similarity. Drawing from cognitive science, Synapse models memory as a dynamic graph where relevance emerges from spreading activation rather than pre-computed links. By integrating lateral inhibition and temporal decay, the system dynamically highlights relevant sub-graphs while filtering interference. We implement a Triple Hybrid Retrieval strategy that fuses geometric embeddings with activation-based graph traversal. Comprehensive evaluations on the LoCoMo benchmark show that Synapse significantly outperforms state-of-the-art methods in complex temporal and multi-hop reasoning tasks, offering a robust solution to the "Contextual Tunneling" problem. Our code and data will be made publicly available upon acceptance.

[24] Window-based Membership Inference Attacks Against Fine-tuned Large Language Models

Yuetian Chen,Yuntao Du,Kaiyuan Zhang,Ashish Kundu,Charles Fleming,Bruno Ribeiro,Ninghui Li

Main category: cs.CL

TL;DR: 本文提出了一种基于滑动窗口的局部比较方法WBC，用于提升大语言模型中的成员推断攻击效果，相较于传统的全局平均方法，在多个数据集上表现出更优的性能。

Details

Motivation: 现有的大多数成员推断攻击依赖于全局信号（如平均损失），这会削弱局部的记忆化信号，从而降低攻击效果。本文旨在通过关注更明显的局部上下文信号来改进这一问题。 Method: 提出WBC（Window-Based Comparison）方法，采用滑动窗口在文本序列上进行局部损失比较，并通过基于符号的聚合机制对不同大小的窗口投票结果进行集成，捕捉从词元到短语级别的记忆模式。 Result: 在十一个数据集上的实验表明，WBC显著优于现有基线方法，AUC得分更高，在低误报率阈值下检测率提高2-3倍。 Conclusion: 聚合局部证据比全局平均更有效，揭示了微调大语言模型中存在的关键隐私漏洞。 Abstract: Most membership inference attacks (MIAs) against Large Language Models (LLMs) rely on global signals, like average loss, to identify training data. This approach, however, dilutes the subtle, localized signals of memorization, reducing attack effectiveness. We challenge this global-averaging paradigm, positing that membership signals are more pronounced within localized contexts. We introduce WBC (Window-Based Comparison), which exploits this insight through a sliding window approach with sign-based aggregation. Our method slides windows of varying sizes across text sequences, with each window casting a binary vote on membership based on loss comparisons between target and reference models. By ensembling votes across geometrically spaced window sizes, we capture memorization patterns from token-level artifacts to phrase-level structures. Extensive experiments across eleven datasets demonstrate that WBC substantially outperforms established baselines, achieving higher AUC scores and 2-3 times improvements in detection rates at low false positive thresholds. Our findings reveal that aggregating localized evidence is fundamentally more effective than global averaging, exposing critical privacy vulnerabilities in fine-tuned LLMs.

[25] EComStage: Stage-wise and Orientation-specific Benchmarking for Large Language Models in E-commerce

Kaiyan Zhao,Zijie Meng,Zheyong Xie,Jin Duan,Yao Hu,Zuozhu Liu,Shaosheng Cao

Main category: cs.CL

TL;DR: EComStage 是一个面向电商场景的统一基准，用于评估大语言模型代理在感知、规划和行动三个阶段的逐步推理能力，涵盖客户和商家双重视角，提供细粒度的性能洞察。

Details

Motivation: 现有基准主要关注任务最终完成情况，忽视了中间推理阶段，且多局限于客户交互，缺乏对实际电商应用中商家场景的评估。 Method: 提出 EComStage 基准，包含七个代表性的电商任务，覆盖感知、规划与行动三阶段，所有样本经人工标注与质量检查，并同时评估客户与商家导向场景。 Result: 评估了30多个不同规模的主流大模型，揭示了各模型在不同推理阶段和应用场景下的优劣势，提供了细粒度的分析结果。 Conclusion: EComStage 能有效评估 LLM 代理在电商中的阶段性推理能力，为实际应用中的模型设计与优化提供了可操作的洞察。 Abstract: Large Language Model (LLM)-based agents are increasingly deployed in e-commerce applications to assist customer services in tasks such as product inquiries, recommendations, and order management. Existing benchmarks primarily evaluate whether these agents successfully complete the final task, overlooking the intermediate reasoning stages that are crucial for effective decision-making. To address this gap, we propose EComStage, a unified benchmark for evaluating agent-capable LLMs across the comprehensive stage-wise reasoning process: Perception (understanding user intent), Planning (formulating an action plan), and Action (executing the decision). EComStage evaluates LLMs through seven separate representative tasks spanning diverse e-commerce scenarios, with all samples human-annotated and quality-checked. Unlike prior benchmarks that focus only on customer-oriented interactions, EComStage also evaluates merchant-oriented scenarios, including promotion management, content review, and operational support relevant to real-world applications. We evaluate a wide range of over 30 LLMs, spanning from 1B to over 200B parameters, including open-source models and closed-source APIs, revealing stage/orientation-specific strengths and weaknesses. Our results provide fine-grained, actionable insights for designing and optimizing LLM-based agents in real-world e-commerce settings.

[26] MiMo-V2-Flash Technical Report

Bangjun Xiao,Bingquan Xia,Bo Yang,Bofei Gao,Bowen Shen,Chen Zhang,Chenhong He,Chiheng Lou,Fuli Luo,Gang Wang,Gang Xie,Hailin Zhang,Hanglong Lv,Hanyu Li,Heyu Chen,Hongshen Xu,Houbin Zhang,Huaqiu Liu,Jiangshan Duo,Jianyu Wei,Jiebao Xiao,Jinhao Dong,Jun Shi,Junhao Hu,Kainan Bao,Kang Zhou,Lei Li,Liang Zhao,Linghao Zhang,Peidian Li,Qianli Chen,Shaohui Liu,Shihua Yu,Shijie Cao,Shimao Chen,Shouqiu Yu,Shuo Liu,Tianling Zhou,Weijiang Su,Weikun Wang,Wenhan Ma,Xiangwei Deng,Bohan Mao,Bowen Ye,Can Cai,Chenghua Wang,Chengxuan Zhu,Chong Ma,Chun Chen,Chunan Li,Dawei Zhu,Deshan Xiao,Dong Zhang,Duo Zhang,Fangyue Liu,Feiyu Yang,Fengyuan Shi,Guoan Wang,Hao Tian,Hao Wu,Heng Qu,Hongfei Yi,Hongxu An,Hongyi Guan,Xing Zhang,Yifan Song,Yihan Yan,Yihao Zhao,Yingchun Lai,Yizhao Gao,Yu Cheng,Yuanyuan Tian,Yudong Wang,Zhen Tang,Zhengju Tang,Zhengtao Wen,Zhichao Song,Zhixian Zheng,Zihan Jiang,Jian Wen,Jiarui Sun,Jiawei Li,Jinlong Xue,Jun Xia,Kai Fang,Menghang Zhu,Nuo Chen,Qian Tu,Qihao Zhang,Qiying Wang,Rang Li,Rui Ma,Shaolei Zhang,Shengfan Wang,Shicheng Li,Shuhao Gu,Shuhuai Ren,Sirui Deng,Tao Guo,Tianyang Lu,Weiji Zhuang,Weikang Zhang,Weimin Xiong,Wenshan Huang,Wenyu Yang,Xin Zhang,Xing Yong,Xu Wang,Xueyang Xie,Yilin Jiang,Yixin Yang,Yongzhe He,Yu Tu,Yuanliang Dong,Yuchen Liu,Yue Ma,Yue Yu,Yuxing Xiang,Zhaojun Huang,Zhenru Lin,Zhipeng Xu,Zhiyang Chen,Zhonghua Deng,Zihan Zhang,Zihao Yue

Main category: cs.CL

TL;DR: MiMo-V2-Flash 是一个具有309B总参数和15B激活参数的混合专家模型，采用滑动窗口与全局注意力结合的架构，支持长达256k的上下文，并通过多教师在线策略蒸馏实现高效后训练扩展，在推理中利用多令牌预测实现高达2.6倍的解码加速。

Details

Motivation: 旨在构建一个兼具强大推理能力、智能体特性和高效推理速度的大规模语言模型，同时降低对计算资源的需求。 Method: 采用混合注意力机制（滑动窗口+全局注意力），使用多令牌预测（MTP）进行预训练并扩展上下文长度至256k，提出多教师在线策略蒸馏（MOPD）框架以提升后训练效率，并在推理阶段将MTP用作推测解码的草稿模型。 Result: 模型在仅使用顶级开源模型一半或三分之一参数的情况下，性能相当甚至更优；实现了最高3.6的接受长度和2.6倍的解码速度提升。 Conclusion: MiMo-V2-Flash 在保持高性能的同时显著提升了推理效率，且通过开源促进社区研究与协作。 Abstract: We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.

[27] Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models

Junxiang Qiu,Shuo Wang,Zhengsu Chen,Hengheng Zhang,Jinda Lu,Changcheng Li,Qi Tian

Main category: cs.CL

TL;DR: 本文提出了一种基于标点符号感知的混合稀疏注意力机制（PHSA），通过利用标点作为语义边界锚点，提升长序列建模中稀疏注意力的信息保留能力。

Details

Motivation: 现有稀疏注意力方法在块选择时依赖粗粒度语义表示，模糊了块内语义边界，导致关键信息丢失。为解决这一问题，需更精细的语义边界建模机制。 Method: 设计双分支聚合机制，融合全局语义表示与标点增强的边界特征；引入极稀疏自适应训练与推理策略，以在低激活比下保持模型稳定性。 Result: 在通用基准和长上下文任务上，PHSA持续优于密集注意力及最先进的稀疏注意力方法（如InfLLM v2）；在32k长度输入、97.3%稀疏度下，0.6B模型的信息损失减少10.8%。 Conclusion: PHSA通过标点引导的语义边界建模，实现了高效且低信息损失的稀疏注意力，具备原生可训练性和高扩展性，适用于长文本建模。 Abstract: Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention has received increasing attention as a scalable alternative. However, existing sparse attention methods rely on coarse-grained semantic representations during block selection, which blur intra-block semantic boundaries and lead to the loss of critical information. To address this issue, we propose \textbf{P}unctuation-aware \textbf{H}ybrid \textbf{S}parse \textbf{A}ttention \textbf{(PHSA)}, a natively trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors. Specifically, (1) we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead; (2) we introduce an extreme-sparsity-adaptive training and inference strategy that stabilizes model behavior under very low token activation ratios; Extensive experiments on general benchmarks and long-context evaluations demonstrate that PHSA consistently outperforms dense attention and state-of-the-art sparse attention baselines, including InfLLM v2. Specifically, for the 0.6B-parameter model with 32k-token input sequences, PHSA can reduce the information loss by 10.8\% at a sparsity ratio of 97.3\%.

[28] The performances of the Chinese and U.S. Large Language Models on the Topic of Chinese Culture

Feiyan Liu,Chenxun Zhuo,Siyan Zhao,Bao Ge,Tianming Liu

Main category: cs.CL

TL;DR: 本研究比较了中美开发的大型语言模型在中文文化理解任务上的表现，发现中国开发的模型整体优于美国开发的模型，性能差异可能源于训练数据分布、本地化策略及对中文文化内容的重视程度不同。

Details

Motivation: 探究中美开发的大型语言模型在中文语境下是否表现出文化差异，特别是在中国历史文化等领域的理解能力。 Method: 采用直接提问范式，评估GPT-5.1、DeepSeek-V3.2、Qwen3-Max和Gemini2.5Pro等模型对中国传统文化（如历史、文学、诗歌）相关问题的回答准确性，并进行对比分析。 Result: 中国开发的模型在中文文化理解任务上普遍优于美国开发的模型；在美国模型中，Gemini 2.5Pro和GPT-5.1表现相对较好。 Conclusion: 模型性能差异可能与训练数据的地域分布、本地化优化策略以及开发者对中文文化内容的重视程度有关，反映出文化背景对LLM表现的影响。 Abstract: Cultural backgrounds shape individuals' perspectives and approaches to problem-solving. Since the emergence of GPT-1 in 2018, large language models (LLMs) have undergone rapid development. To date, the world's ten leading LLM developers are primarily based in China and the United States. To examine whether LLMs released by Chinese and U.S. developers exhibit cultural differences in Chinese-language settings, we evaluate their performance on questions about Chinese culture. This study adopts a direct-questioning paradigm to evaluate models such as GPT-5.1, DeepSeek-V3.2, Qwen3-Max, and Gemini2.5Pro. We assess their understanding of traditional Chinese culture, including history, literature, poetry, and related domains. Comparative analyses between LLMs developed in China and the U.S. indicate that Chinese models generally outperform their U.S. counterparts on these tasks. Among U.S.-developed models, Gemini 2.5Pro and GPT-5.1 achieve relatively higher accuracy. The observed performance differences may potentially arise from variations in training data distribution, localization strategies, and the degree of emphasis on Chinese cultural content during model development.

[29] TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents

Kai Li,Xuanqing Yu,Ziyi Ni,Yi Zeng,Yao Xu,Zheqing Zhang,Xin Li,Jitao Sang,Xiaogang Duan,Xuelei Wang,Chengbao Liu,Jie Tan

Main category: cs.CL

TL;DR: TiMem是一种面向长时程对话代理的时序-分层记忆框架，通过时序记忆树（TMT）组织对话，实现从原始对话到抽象人格表征的系统性记忆整合，在多个基准上达到最优性能，同时显著减少回忆记忆长度。

Details

Motivation: 现有记忆框架对跨层次的时序结构化信息支持有限，导致记忆碎片化和长期个性化不稳定，难以应对LLM上下文窗口限制下的长时程对话需求。 Method: 提出TiMem框架，构建时序记忆树（TMT），采用语义引导的记忆整合机制进行无需微调的跨层级记忆压缩，并设计复杂度感知的记忆召回策略以平衡不同查询的精度与效率。 Result: 在LoCoMo和LongMemEval-S两个基准上分别达到75.30%和76.88%的准确率，优于所有基线模型，且在LoCoMo上将回忆记忆长度减少52.20%；流形分析显示人格表征分离更清晰、分布更集中。 Conclusion: TiMem将时间连续性作为对话代理长时记忆的核心组织原则，有效提升长时程个性化对话的稳定性与效率。 Abstract: Long-horizon conversational agents have to manage ever-growing interaction histories that quickly exceed the finite context windows of large language models (LLMs). Existing memory frameworks provide limited support for temporally structured information across hierarchical levels, often leading to fragmented memories and unstable long-horizon personalization. We present TiMem, a temporal--hierarchical memory framework that organizes conversations through a Temporal Memory Tree (TMT), enabling systematic memory consolidation from raw conversational observations to progressively abstracted persona representations. TiMem is characterized by three core properties: (1) temporal--hierarchical organization through TMT; (2) semantic-guided consolidation that enables memory integration across hierarchical levels without fine-tuning; and (3) complexity-aware memory recall that balances precision and efficiency across queries of varying complexity. Under a consistent evaluation setup, TiMem achieves state-of-the-art accuracy on both benchmarks, reaching 75.30% on LoCoMo and 76.88% on LongMemEval-S. It outperforms all evaluated baselines while reducing the recalled memory length by 52.20% on LoCoMo. Manifold analysis indicates clear persona separation on LoCoMo and reduced dispersion on LongMemEval-S. Overall, TiMem treats temporal continuity as a first-class organizing principle for long-horizon memory in conversational agents.

[30] To Generate or Discriminate? Methodological Considerations for Measuring Cultural Alignment in LLMs

Saurabh Kumar Pandey,Sougata Saha,Monojit Choudhury

Main category: cs.CL

TL;DR: 本文提出了逆社会人口统计提示（ISDP），通过让大语言模型从实际和模拟用户行为中推断人口统计特征，以评估其文化理解能力。研究使用Goodreads-CSI数据集测试了四种模型，发现模型对真实行为的表现优于模拟行为，但在个体层面两者表现趋同，表明个性化存在局限。

Details

Motivation: 现有社会人口统计提示（SDP）方法易受提示敏感性、解码参数等因素影响，难以判断模型表现差是因偏见还是任务设计问题。因此需要更可靠的方法来评估LLMs的文化适应性。 Method: 提出逆社会人口统计提示（ISDP），即让LLM根据用户行为（如书评）预测其人口统计特征（如国籍）。在Goodreads-CSI数据集上对Aya-23、Gemma-2、GPT-4o和LLaMA-3.1进行测试，比较模型对真实与模拟行为的判别能力。 Result: 模型在真实用户行为上的表现优于模拟行为，这与SDP结果相反；但在个体层面，两种行为的预测性能均下降且趋于一致，显示个性化预测的局限性。 Conclusion: ISDP为评估LLM文化理解提供了新视角，揭示了当前生成式提示（如SDP）可能误导结论，且强调了在个体层面实现文化个性化适配的挑战。 Abstract: Socio-demographic prompting (SDP) - prompting Large Language Models (LLMs) using demographic proxies to generate culturally aligned outputs - often shows LLM responses as stereotypical and biased. While effective in assessing LLMs' cultural competency, SDP is prone to confounding factors such as prompt sensitivity, decoding parameters, and the inherent difficulty of generation over discrimination tasks due to larger output spaces. These factors complicate interpretation, making it difficult to determine if the poor performance is due to bias or the task design. To address this, we use inverse socio-demographic prompting (ISDP), where we prompt LLMs to discriminate and predict the demographic proxy from actual and simulated user behavior from different users. We use the Goodreads-CSI dataset (Saha et al., 2025), which captures difficulty in understanding English book reviews for users from India, Mexico, and the USA, and test four LLMs: Aya-23, Gemma-2, GPT-4o, and LLaMA-3.1 with ISDP. Results show that models perform better with actual behaviors than simulated ones, contrary to what SDP suggests. However, performance with both behavior types diminishes and becomes nearly equal at the individual level, indicating limits to personalization.

[31] Training Language Models with homotokens Leads to Delayed Overfitting

Adrian Cosma,Stefan Ruseti,Emilian Radoi,Mihai Dascalu

Main category: cs.CL

TL;DR: 本文提出了homotoken（同形异构子词）作为数据增强方法，通过在训练中引入不同的有效子词分割方式来提升语言模型的泛化能力，同时保持语义不变。

Details

Motivation: 由于子词分词的非唯一性，不同token序列可能对应相同语义，但现有模型通常只使用单一最长前缀分词方式，忽略了其他合法且语义一致的分词变体，导致内部计算不一致和过拟合风险。 Method: 提出一种轻量级训练架构，利用辅助因果编码器和块因果交叉注意力机制，在不改变训练目标或token接口的前提下，基于采样的homotoken变体进行条件化的下一个token预测。 Result: 在数据受限的预训练中，homotoken能持续延缓过拟合，并在多种评测集上提升泛化性能；在多语言微调中，其效果依赖于分词器质量：当原始分词高度压缩时增益明显，而当分词已过度碎片化时效果减弱。 Conclusion: homotoken提供了一种简单且模块化的方法，用于实现语言模型中的分词不变性，增强模型对不同合法分词方式的鲁棒性。 Abstract: Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness, language models are typically trained using a single canonical longest-prefix tokenization. We formalize homotokens-alternative valid subword segmentations of the same lexical item-as a strictly meaning-preserving form of data augmentation. We introduce a lightweight training architecture that conditions canonical next-token prediction on sampled homotoken variants via an auxiliary causal encoder and block-causal cross-attention, without modifying the training objective or token interface. In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure and improves generalization across diverse evaluation datasets. In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality: gains are strongest when canonical tokens are highly compressed and diminish when the tokenizer already over-fragments the input. Overall, homotokens provide a simple and modular mechanism for inducing tokenization invariance in language models.

[32] LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark

Ziyang Chen,Xing Wu,Junlong Jia,Chaochen Gao,Qi Fu,Debing Zhang,Songlin Hu

Main category: cs.CL

TL;DR: LongBench Pro是一个新的、更真实且全面的双语长上下文基准测试，包含1500个自然发生的长文本样本，覆盖11个主要任务和25个次要任务，输入长度从8k到256k词元，支持细粒度分析。

Details

Motivation: 现有长上下文评估基准在可扩展性和现实性之间存在权衡，合成任务无法充分代表现实世界的复杂性，而完全人工标注难以扩展到极端长度和多样化场景，因此需要一个更现实、可扩展的评估基准。 Method: 提出Human-Model Collaborative Construction构建流程：前沿大模型生成问题、参考答案及设计 rationale，专家进行验证与修正；构建包含多维度分类（依赖类型、长度、难度）和任务特定指标的双语基准LongBench Pro，并对46个主流长上下文模型进行评估。 Result: 评估46个长上下文模型发现：(1) 长上下文优化比参数规模对理解能力提升更显著；(2) 有效上下文长度通常短于宣称长度，且存在显著跨语言不对齐；(3) '思考'范式主要帮助原生推理训练的模型，混合思考设计提供更好的帕累托权衡。 Conclusion: LongBench Pro为长上下文理解提供了高质量、可扩展的评估平台，揭示了当前模型在实际长文本处理中的关键局限，推动未来更有效的长上下文建模。 Abstract: The rapid expansion of context length in large language models (LLMs) has outpaced existing evaluation benchmarks. Current long-context benchmarks often trade off scalability and realism: synthetic tasks underrepresent real-world complexity, while fully manual annotation is costly to scale to extreme lengths and diverse scenarios. We present LongBench Pro, a more realistic and comprehensive bilingual benchmark of 1,500 naturally occurring long-context samples in English and Chinese spanning 11 primary tasks and 25 secondary tasks, with input lengths from 8k to 256k tokens. LongBench Pro supports fine-grained analysis with task-specific metrics and a multi-dimensional taxonomy of context requirement (full vs. partial dependency), length (six levels), and difficulty (four levels calibrated by model performance). To balance quality with scalability, we propose a Human-Model Collaborative Construction pipeline: frontier LLMs draft challenging questions and reference answers, along with design rationales and solution processes, to reduce the cost of expert verification. Experts then rigorously validate correctness and refine problematic cases. Evaluating 46 widely used long-context LLMs on LongBench Pro yields three findings: (1) long-context optimization contributes more to long-context comprehension than parameter scaling; (2) effective context length is typically shorter than the claimed context length, with pronounced cross-lingual misalignment; and (3) the "thinking" paradigm helps primarily models trained with native reasoning, while mixed-thinking designs offer a promising Pareto trade-off. In summary, LongBench Pro provides a robust testbed for advancing long-context understanding.

[33] Revisiting Data Compression with Language Modeling

Chen-Han Tsai

Main category: cs.CL

TL;DR: 本报告探讨了大语言模型（LLM）在数据压缩中的应用，特别是在enwik9数据集上实现了约18%的最先进调整压缩率，且无需额外训练。研究还评估了LLM在非英文文本、代码和字节流等非自然语言数据上的压缩能力。

Details

Motivation: 尽管已有研究显示LLM在多模态数据压缩中表现良好，但其实际应用仍面临挑战，本文旨在探索如何更有效地利用LLM进行数据压缩，以期替代传统压缩算法。 Method: 通过多种方法优化LLM作为数据压缩器的性能，未进行额外模型训练，并在enwik9数据集及其他类型数据（如非英文文本、代码、字节流）上测试压缩效果。 Result: 在enwik9数据集上达到约18%的SOTA调整压缩率；LLM在文本主导的数据上表现优异，在适当配置下对非自然语言序列也有较强压缩能力。 Conclusion: LLM在数据压缩方面具有巨大潜力，尤其在文本数据上表现突出，合理配置下也可有效处理非自然语言数据，具备成为通用压缩工具的前景。 Abstract: In this report, we investigate the potential use of large language models (LLM's) in the task of data compression. Previous works have demonstrated promising results in applying LLM's towards compressing not only text, but also a wide range of multi-modal data. Despite the favorable performance achieved, there still remains several practical questions that pose a challenge towards replacing existing data compression algorithms with LLM's. In this work, we explore different methods to achieve a lower adjusted compression rate using LLM's as data compressors. In comparison to previous works, we were able to achieve a new state-of-the-art (SOTA) adjusted compression rate of around $18\%$ on the enwik9 dataset without additional model training. Furthermore, we explore the use of LLM's in compressing non-English data, code data, byte stream sequences. We show that while LLM's excel in compressing data in text-dominant domains, their ability in compressing non-natural text sequences still remain competitive if configured in the right way.

[34] Transparent Semantic Change Detection with Dependency-Based Profiles

Bach Phan-Tat,Kris Heylen,Dirk Geeraerts,Stefano De Pascale,Dirk Speelman

Main category: cs.CL

TL;DR: 提出一种基于依赖共现模式的词汇语义变化检测方法，具有高可解释性并优于部分分布式语义模型。

Details

Motivation: 现有基于神经网络的词嵌入方法在语义变化检测中表现良好但缺乏可解释性。 Method: 利用词语的依存共现模式进行语义变化检测，不依赖复杂的分布表示。 Result: 该方法在定量和定性分析中均表现出有效性，且结果更具可解释性和合理性。 Conclusion: 基于依存共现的方法在语义变化检测中是可行且优越的，尤其在可解释性方面具有优势。 Abstract: Most modern computational approaches to lexical semantic change detection (LSC) rely on embedding-based distributional word representations with neural networks. Despite the strong performance on LSC benchmarks, they are often opaque. We investigate an alternative method which relies purely on dependency co-occurrence patterns of words. We demonstrate that it is effective for semantic change detection and even outperforms a number of distributional semantic models. We provide an in-depth quantitative and qualitative analysis of the predictions, showing that they are plausible and interpretable.

[35] Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration

Ryan Soh-Eun Shim,Kwanghee Choi,Kalvin Chang,Ming-Hao Hsu,Florian Eichin,Zhizheng Wu,Alane Suhr,Michael A. Hedderich,David Harwath,David R. Mortensen,Barbara Plank

Main category: cs.CL

TL;DR: 本文提出通过在激活空间中线性操控脚本向量，实现对多语言语音模型输出脚本的直接控制，即使在非传统语言-脚本配对下也有效。

Details

Motivation: 由于同一语言的不同地区变体可能使用不同书写系统，导致多语言语音模型输出存在脚本不确定性，本文旨在解决这一问题。 Method: 发现脚本在线性激活空间中被编码，并通过在推理时修改激活中的脚本向量来控制输出脚本。 Result: 该方法能在Whisper各尺寸模型上有效改变输出脚本，包括非常规语言-脚本组合（如用西里尔字母写意大利语、拉丁字母写日语），并实现具有竞争力的性能。 Conclusion: 通过激活空间中的脚本向量调控，可实现对多语言语音识别输出脚本的灵活、后验控制。 Abstract: Multilingual speech foundation models such as Whisper are trained on web-scale data, where data for each language consists of a myriad of regional varieties. However, different regional varieties often employ different scripts to write the same language, rendering speech recognition output also subject to non-determinism in the output script. To mitigate this problem, we show that script is linearly encoded in the activation space of multilingual speech models, and that modifying activations at inference time enables direct control over output script. We find the addition of such script vectors to activations at test time can induce script change even in unconventional language-script pairings (e.g. Italian in Cyrillic and Japanese in Latin script). We apply this approach to inducing post-hoc control over the script of speech recognition output, where we observe competitive performance across all model sizes of Whisper.

[36] Beyond the Black Box: Theory and Mechanism of Large Language Models

Zeyu Gan,Ruifeng Ren,Wei Yao,Xiaolin Hu,Gengze Xu,Chen Qian,Huayi Tang,Zixuan Gong,Xinhao Yao,Pengwei Tang,Zhenxing Dou,Yong Liu

Main category: cs.CL

TL;DR: 本文提出了一种基于生命周期的分类法，系统梳理了大语言模型（LLM）各阶段的理论基础与机制，旨在推动LLM研究从工程实践向科学原理转变。

Details

Motivation: 尽管大语言模型在实践中取得成功，但其理论理解仍滞后，被视为“黑箱”，缺乏统一的理论框架来解释其内部机制和性能表现。 Method: 构建了一个包含数据准备、模型准备、训练、对齐、推理和评估六个阶段的生命周期分类体系，并在此框架下系统回顾和分析LLMs的核心理论问题。 Result: 总结了各阶段的基础理论，如数据混合的数学依据、架构的表示能力极限、对齐算法的优化动态，并指出了合成数据自提升的理论极限、安全性保证的数学边界以及涌现智能的机制起源等前沿挑战。 Conclusion: 该工作为连接经验观察与严谨理论提供了结构化路径，推动大语言模型的发展从经验驱动转向基于科学原理的范式。 Abstract: The rapid emergence of Large Language Models (LLMs) has precipitated a profound paradigm shift in Artificial Intelligence, delivering monumental engineering successes that increasingly impact modern society. However, a critical paradox persists within the current field: despite the empirical efficacy, our theoretical understanding of LLMs remains disproportionately nascent, forcing these systems to be treated largely as ``black boxes''. To address this theoretical fragmentation, this survey proposes a unified lifecycle-based taxonomy that organizes the research landscape into six distinct stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and Evaluation. Within this framework, we provide a systematic review of the foundational theories and internal mechanisms driving LLM performance. Specifically, we analyze core theoretical issues such as the mathematical justification for data mixtures, the representational limits of various architectures, and the optimization dynamics of alignment algorithms. Moving beyond current best practices, we identify critical frontier challenges, including the theoretical limits of synthetic data self-improvement, the mathematical bounds of safety guarantees, and the mechanistic origins of emergent intelligence. By connecting empirical observations with rigorous scientific inquiry, this work provides a structured roadmap for transitioning LLM development from engineering heuristics toward a principled scientific discipline.

[37] Image, Word and Thought: A More Challenging Language Task for the Iterated Learning Model

Hyoyeon Lee,Seth Bullock,Conor Houghton

Main category: cs.CL

TL;DR: 本文研究了半监督迭代学习模型在七段显示器图像复杂语义空间中的语言传递，发现该模型能生成具有表达性、组合性和稳定性的语言。

Details

Motivation: 探索语言传递约束如何促发语言结构的产生，特别是在更大且更复杂的语义-信号空间中。 Method: 采用结合监督与无监督学习的自动编码器架构的半监督迭代学习模型，模拟多代语言传递过程。 Result: 代理能够学习并传递一种对全部128个字符使用不同编码（表达性）、信号成分与意义成分一致映射（组合性），且代际间保持不变（稳定性）的语言。 Conclusion: 半监督迭代学习模型适用于复杂语义空间的语言演化研究，能有效生成具备表达性、组合性和稳定性的语言系统。 Abstract: The iterated learning model simulates the transmission of language from generation to generation in order to explore how the constraints imposed by language transmission facilitate the emergence of language structure. Despite each modelled language learner starting from a blank slate, the presence of a bottleneck limiting the number of utterances to which the learner is exposed can lead to the emergence of language that lacks ambiguity, is governed by grammatical rules, and is consistent over successive generations, that is, one that is expressive, compositional and stable. The recent introduction of a more computationally tractable and ecologically valid semi supervised iterated learning model, combining supervised and unsupervised learning within an autoencoder architecture, has enabled exploration of language transmission dynamics for much larger meaning-signal spaces. Here, for the first time, the model has been successfully applied to a language learning task involving the communication of much more complex meanings: seven-segment display images. Agents in this model are able to learn and transmit a language that is expressive: distinct codes are employed for all 128 glyphs; compositional: signal components consistently map to meaning components, and stable: the language does not change from generation to generation.

[38] RAL2M: Retrieval Augmented Learning-To-Match Against Hallucination in Compliance-Guaranteed Service Systems

Mengze Hong,Di Jiang,Jiangtao Wen,Zhiyang Su,Yawen Li,Yanjie Sun,Guan Wang,Chen Jason Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为RAL2M的检索增强学习匹配框架，通过将大语言模型（LLM）重新定位为查询-响应匹配判别器，并引入查询自适应的潜在集成策略，有效抑制生成和判断中的幻觉问题，在大规模基准上显著优于基线方法。

Details

Motivation: 大语言模型驱动的服务系统中存在严重的幻觉问题，影响响应的合规性和可靠性，亟需通过显式知识 grounding 来解决。 Method: 提出检索增强学习匹配（RAL2M）框架，将LLM用作检索系统中的匹配判别器而非生成器；并设计查询自适应的潜在集成策略，建模不同LLM的能力差异与依赖关系，生成校准后的共识判断。 Result: 在大规模基准测试中，RAL2M显著优于强基线方法，有效利用了LLM群体的‘集体智慧’，显著降低了生成与判断中的幻觉现象。 Conclusion: RAL2M为构建可靠、合规的LLM服务系统提供了新范式，展示了检索与判别式应用相比纯生成方法在控制幻觉方面的显著优势。 Abstract: Hallucination is a major concern in LLM-driven service systems, necessitating explicit knowledge grounding for compliance-guaranteed responses. In this paper, we introduce Retrieval-Augmented Learning-to-Match (RAL2M), a novel framework that eliminates generation hallucination by repositioning LLMs as query-response matching judges within a retrieval-based system, providing a robust alternative to purely generative approaches. To further mitigate judgment hallucination, we propose a query-adaptive latent ensemble strategy that explicitly models heterogeneous model competence and interdependencies among LLMs, deriving a calibrated consensus decision. Extensive experiments on large-scale benchmarks demonstrate that the proposed method effectively leverages the "wisdom of the crowd" and significantly outperforms strong baselines. Finally, we discuss best practices and promising directions for further exploiting latent representations in future work.

[39] Memorization, Emergence, and Explaining Reversal Failures: A Controlled Study of Relational Semantics in LLMs

Yihua Zhu,Qianying Liu,Jiaxin Wang,Fei Cheng,Chaoran Liu,Akiko Aizawa,Sadao Kurohashi,Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: 该研究通过构建基于知识图谱的合成框架，探究自回归语言模型在处理对称/逆向关系时的逻辑语义学习能力，发现关系语义在足够监督下会涌现，且反向错误主要源于顺序偏差而非语义缺陷。

Details

Motivation: 不清楚自回归语言模型是否真正学习了关系的逻辑语义（如对称性和可逆性），以及反转失败是由于缺乏语义理解还是由从左到右的生成顺序偏见导致。 Method: 提出一个基于知识图谱的可控合成框架，生成包含对称/逆向三元组的文本，从零开始训练GPT式自回归模型，并评估其在记忆、逻辑推理和上下文泛化方面的能力。 Result: 发现当有足够的逻辑相关监督时，即使浅层模型（2-3层）也会出现关系语义的急剧相变；成功泛化与中间层稳定信号相关；顺序匹配的正向/反向测试和扩散基线表明，反转失败主要由自回归顺序偏置引起。 Conclusion: 自回归模型能够学习关系的逻辑语义，反转任务中的失败更多是由于生成顺序的固有偏差，而非缺乏对逆向关系的理解。 Abstract: Autoregressive LLMs perform well on relational tasks that require linking entities via relational words (e.g., father/son, friend), but it is unclear whether they learn the logical semantics of such relations (e.g., symmetry and inversion logic) and, if so, whether reversal-type failures arise from missing relational semantics or left-to-right order bias. We propose a controlled Knowledge Graph-based synthetic framework that generates text from symmetric/inverse triples, train GPT-style autoregressive models from scratch, and evaluate memorization, logical inference, and in-context generalization to unseen entities to address these questions. We find a sharp phase transition in which relational semantics emerge with sufficient logic-bearing supervision, even in shallow (2-3 layer) models, and that successful generalization aligns with stable intermediate-layer signals. Finally, order-matched forward/reverse tests and a diffusion baseline indicate that reversal failures are primarily driven by autoregressive order bias rather than deficient inversion semantics.

[40] Pearmut: Human Evaluation of Translation Made Trivial

Vilém Zouhar,Tom Kocmi

Main category: cs.CL

TL;DR: Pearmut是一个轻量级但功能丰富的平台，旨在简化多语言NLP中的人类评估流程，特别是机器翻译任务，使人类评估像自动评估一样易于实施。

Details

Motivation: 由于现有工具设置人类评估复杂且耗时，导致实践中常以自动指标替代，影响评估可靠性。 Method: 开发Pearmut平台，集成标准评估协议（如DA、ESA、MQM），支持文档级上下文、绝对与对比评估、注意力检查、预标注及静态与主动学习分配策略。 Result: Pearmut降低了人类评估的门槛，支持多种评估方式和协议扩展，提升了评估效率与可靠性。 Conclusion: Pearmut使得可靠的人类评估可成为模型开发和诊断中的常规环节，而非偶尔进行的任务。 Abstract: Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics, because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, or MQM, but is also extensible to allow prototyping new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and active learning-based assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.

[41] Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion

Jeonghyun Park,Byeongjeong Kim,Seojin Hwang,Hwanhee Lee

Main category: cs.CL

TL;DR: 本文提出了一种去偏的语言偏好评估指标DeLP，并基于此开发了多语言检索增强生成框架DELTA，发现模型对英语的偏好主要源于数据分布偏差而非模型本身，而检索器更倾向于查询与文档语言的单语对齐。

Details

Motivation: 现有研究认为大模型在多语言检索增强生成中存在对英语的偏好，但这种结论可能受到评测基准中结构先验（如资源分布不均、文化局部性等）的影响，导致评估失真。因此需要一种更公平的评估方式来揭示真实的语言偏好。 Method: 提出了DeLP（Debiased Language Preference）指标，通过校准暴露偏差、黄金答案可用性先验和文化主题先验等结构性混淆因素，以准确衡量模型的语言偏好；并基于单语对齐优势设计了轻量级mRAG框架DELTA。 Result: 使用DeLP分析发现，先前报告的英语偏好主要是证据分布的结果，而非模型内在偏差；检索器实际上更偏好查询与文档语言一致的单语对齐；DELTA在多种语言上优于英语枢纽法和其他mRAG基线方法。 Conclusion: 多语言RAG中的语言偏好被现有基准高估，尤其是英语优势很大程度上是数据分布带来的假象；通过去偏评估和利用单语对齐可构建更高效公平的跨语言RAG系统。 Abstract: Multilingual Retrieval-Augmented Generation (mRAG) systems often exhibit a perceived preference for high-resource languages, particularly English, resulting in the widespread adoption of English pivoting. While prior studies attribute this advantage to the superior English-centric capabilities of Large Language Models (LLMs), we find that such measurements are significantly distorted by structural priors inherent in evaluation benchmarks. Specifically, we identify exposure bias and a gold availability prior-both driven by the disproportionate concentration of resources in English-as well as cultural priors rooted in topic locality, as factors that hinder accurate assessment of genuine language preference. To address these biases, we propose DeLP (Debiased Language Preference), a calibrated metric designed to explicitly factor out these structural confounds. Our analysis using DeLP reveals that the previously reported English preference is largely a byproduct of evidence distribution rather than an inherent model bias. Instead, we find that retrievers fundamentally favor monolingual alignment between the query and the document language. Building on this insight, we introduce DELTA (DEbiased Language preference-guided Text Augmentation), a lightweight and efficient mRAG framework that strategically leverages monolingual alignment to optimize cross-lingual retrieval and generation. Experimental results demonstrate that DELTA consistently outperforms English pivoting and mRAG baselines across diverse languages.

[42] LLM-Augmented Changepoint Detection: A Framework for Ensemble Detection and Automated Explanation

Fabian Lukassen,Christoph Weisser,Michael Schlee,Manish Kumar,Anton Thielmann,Benjamin Saefken,Thomas Kneib

Main category: cs.CL

TL;DR: 提出了一种结合集成统计方法与大语言模型（LLM）的新型变点检测框架，提升了时间序列中 regime 变化的检测准确性与可解释性。

Details

Motivation: 解决现有变点检测方法在算法选择上的非最优性以及缺乏对检测结果的自动化上下文解释问题。 Method: 集成10种不同的变点检测算法，并利用LLM生成检测变化的上下文叙述；对于私有数据，采用基于检索增强生成（RAG）的方法进行文档驱动的解释。 Result: 该框架在多个领域（如金融、政治学和环境科学）表现出优于单一方法的性能和鲁棒性，并能自动生成与现实事件关联的解释。 Conclusion: 所提框架通过融合多种检测算法与LLM解释能力，显著提升了变点检测的实用性，为分析师和决策者提供更具洞察力的结果。 Abstract: This paper introduces a novel changepoint detection framework that combines ensemble statistical methods with Large Language Models (LLMs) to enhance both detection accuracy and the interpretability of regime changes in time series data. Two critical limitations in the field are addressed. First, individual detection methods exhibit complementary strengths and weaknesses depending on data characteristics, making method selection non-trivial and prone to suboptimal results. Second, automated, contextual explanations for detected changes are largely absent. The proposed ensemble method aggregates results from ten distinct changepoint detection algorithms, achieving superior performance and robustness compared to individual methods. Additionally, an LLM-powered explanation pipeline automatically generates contextual narratives, linking detected changepoints to potential real-world historical events. For private or domain-specific data, a Retrieval-Augmented Generation (RAG) solution enables explanations grounded in user-provided documents. The open source Python framework demonstrates practical utility in diverse domains, including finance, political science, and environmental science, transforming raw statistical output into actionable insights for analysts and decision-makers.

[43] Low-Resource Heuristics for Bahnaric Optical Character Recognition Improvement

Phat Tran,Phuoc Pham,Hung Trinh,Tho Quan

Main category: cs.CL

TL;DR: 本研究提出了一种结合表格与非表格检测及基于概率的后处理方法，以提升少数民族语言Bahnar文档的OCR识别准确率，实验结果显示准确率从72.86%提升至79.26%。

Details

Motivation: Bahnar语因研究和数据稀缺面临保存危机，纸质文献数字化过程中因图像质量差导致OCR错误频发，影响信息检索。 Method: 采用先进的表格与非表格区域检测技术优化输入数据质量，并结合基于概率的启发式后处理方法对OCR输出进行纠错。 Result: OCR识别准确率从72.86%提高到79.26%，显著改善了Bahnar语文档的数字化效果。 Conclusion: 该方法不仅为Bahnar语的保护提供了有效工具，也为其他少数民族语言的数字化提供了可借鉴的框架。 Abstract: Bahnar, a minority language spoken across Vietnam, Cambodia, and Laos, faces significant preservation challenges due to limited research and data availability. This study addresses the critical need for accurate digitization of Bahnar language documents through optical character recognition (OCR) technology. Digitizing scanned paper documents poses significant challenges, as degraded image quality from broken or blurred areas introduces considerable OCR errors that compromise information retrieval systems. We propose a comprehensive approach combining advanced table and non-table detection techniques with probability-based post-processing heuristics to enhance recognition accuracy. Our method first applies detection algorithms to improve input data quality, then employs probabilistic error correction on OCR output. Experimental results indicate a substantial improvement, with recognition accuracy increasing from 72.86% to 79.26%. This work contributes valuable resources for Bahnar language preservation and provides a framework applicable to other minority language digitization efforts.

[44] Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning

Junseok Kim,Nakyeong Yang,Kyungmin Min,Kyomin Jung

Main category: cs.CL

TL;DR: 提出ReASC方法，通过置信度感知的自适应采样，在保证准确率的同时显著降低推理成本。

Details

Motivation: 自洽性提升推理可靠性但带来高推理成本，现有自适应方法因等同对待所有响应而造成冗余采样。 Method: 将自适应采样从计数机制转为证据充分性判断，引入响应级置信度；分两阶段：单样本决策阶段处理高置信问题，可靠性感知累积阶段联合频率与置信度聚合结果。 Result: 在五个模型和四个数据集上，ReASC在准确率-成本权衡上优于现有方法，对3B到27B参数模型均提升推理效率；在GSM8K上使用Gemma-3-4B-it时相对自洽性降低70%推理成本并保持准确率。 Conclusion: ReASC通过引入响应置信度实现更高效的自适应采样，有效平衡了推理成本与性能，适用于不同规模的模型。 Abstract: Self-Consistency improves reasoning reliability through multi-sample aggregation, but incurs substantial inference cost. Adaptive self-consistency methods mitigate this issue by adjusting the sampling budget; however, they rely on count-based stopping rules that treat all responses equally, often leading to unnecessary sampling. We propose Reliability-Aware Adaptive Self-Consistency (ReASC), which addresses this limitation by reframing adaptive sampling from response counting to evidence sufficiency, leveraging response-level confidence for principled information aggregation. ReASC operates in two stages: a single-sample decision stage that resolves instances confidently answerable from a single response, and a reliability-aware accumulation stage that aggregates responses by jointly leveraging their frequency and confidence. Across five models and four datasets, ReASC consistently achieves the best accuracy-cost trade-off compared to existing baselines, yielding improved inference efficiency across model scales from 3B to 27B parameters. As a concrete example, ReASC reduces inference cost by up to 70\% relative to self-consistency while preserving accuracy on GSM8K using Gemma-3-4B-it.

[45] Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning

Nathanaël Carraz Rakotonirina,Ren Pang,Neha Anna John,Michael Bohlke-Schneider,Momchil Hardalov

Main category: cs.CL

TL;DR: 提出了一种多阶段高效推理方法，通过监督微调和强化学习结合自适应长度惩罚来减少大语言模型中的“过度思考”问题，在降低28%-40%响应长度的同时仅造成轻微性能下降，并在过思考调整准确率曲线下面积（AUC_OAA）上优于现有方法。

Details

Motivation: 解决大语言模型在推理过程中因链式思维（CoT）过长而导致的“过度思考”问题，避免计算资源浪费和性能下降。 Method: 结合监督微调（通过拒绝采样或推理路径重格式化）与带有自适应长度惩罚的强化学习，设计轻量级奖励函数，对首次正确答案后的多余标记进行惩罚，并仅在有益时鼓励自我验证。 Result: 在七项推理任务中评估，8B和32B模型的响应长度分别平均减少28%和40%，准确率仅下降1.6和2.5点；AUC_OAA得分为76.6，比基线高5点，比次优方法高2.5点。 Conclusion: 该方法以概念简洁的方式实现了更优的准确率与推理长度权衡，显著缓解了大模型推理中的过度思考问题。 Abstract: The reasoning capabilities of large language models (LLMs) have improved substantially through increased test-time computation, typically in the form of intermediate tokens known as chain-of-thought (CoT). However, CoT often becomes unnecessarily long, increasing computation cost without actual accuracy gains or sometimes even degrading performance, a phenomenon known as ``overthinking''. We propose a multi-stage efficient reasoning method that combines supervised fine-tuning -- via rejection sampling or reasoning trace reformatting -- with reinforcement learning using an adaptive length penalty. We introduce a lightweight reward function that penalizes tokens generated after the first correct answer but encouraging self-verification only when beneficial. We conduct a holistic evaluation across seven diverse reasoning tasks, analyzing the accuracy-response length trade-off. Our approach reduces response length by an average of 28\% for 8B models and 40\% for 32B models, while incurring only minor performance drops of 1.6 and 2.5 points, respectively. Despite its conceptual simplicity, it achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods, scoring 76.6, in terms of the area under the Overthinking-Adjusted Accuracy curve ($\text{AUC}_{\text{OAA}}$) -- 5 points above the base model and 2.5 points above the second-best approach.

[46] Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders

Ruikang Zhang,Shuo Wang,Qi Su

Main category: cs.CL

TL;DR: 本文提出了一种基于稀疏自编码器的框架，用于检索和操控大语言模型中与高层语言行为相关的语义可解释内部特征，并以“大五人格特质”为例验证了该方法在实现精准、双向行为引导上的有效性与稳定性。

Details

Motivation: 现有方法难以将大语言模型的内部特征可靠地关联到复杂语义属性的行为控制，本文旨在建立一种可解释且稳定的机制来实现对高层语义行为的精确调控。 Method: 采用基于对比语义对立的对比特征检索流程，结合统计激活分析与生成式验证，从稀疏激活空间中提取单义性功能特征，构建基于稀疏自编码器的特征操控框架。 Result: 在大五人格特质任务中，该方法实现了比CAA等现有方法更稳定、更精确的双向行为引导，并发现‘功能保真性’现象：干预特定内部特征会引发多个语言维度上一致且可预测的变化。 Conclusion: 大语言模型内部存在高度整合的高阶概念表征，所提方法为复杂AI行为的机制性调控提供了新路径。 Abstract: Recent work in Mechanistic Interpretability (MI) has enabled the identification and intervention of internal features in Large Language Models (LLMs). However, a persistent challenge lies in linking such internal features to the reliable control of complex, behavior-level semantic attributes in language generation. In this paper, we propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features associated with high-level linguistic behaviors. Our method employs a contrastive feature retrieval pipeline based on controlled semantic oppositions, combing statistical activation analysis and generation-based validation to distill monosemantic functional features from sparse activation spaces. Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior while maintaining superior stability and performance compared to existing activation steering methods like Contrastive Activation Addition (CAA). We further identify an empirical effect, which we term Functional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute. Our findings suggest that LLMs internalize deeply integrated representations of high-order concepts, and provide a novel, robust mechanistic path for the regulation of complex AI behaviors.

[47] P-Check: Advancing Personalized Reward Model via Learning to Generate Dynamic Checklist

Kwangwook Seo,Dongha Lee

Main category: cs.CL

TL;DR: 本文提出了P-Check，一种新的个性化奖励建模框架，通过生成动态评估清单来改进奖励预测，并引入基于偏好对比的准则加权方法以增强个性化对齐。

Details

Motivation: 现有方法将用户上下文视为静态或隐式条件信号，未能捕捉人类判断的动态性和多面性。 Method: 提出P-Check框架，训练一个即插即用的清单生成器，并采用偏好对比准则加权策略，为具有判别力的标准分配显著性得分。 Result: 实验表明，P-Check在奖励准确性、下游个性化生成任务以及OOD场景中均表现更优。 Conclusion: P-Check能有效建模动态个性化偏好，提升奖励模型性能与泛化能力。 Abstract: Recent approaches in personalized reward modeling have primarily focused on leveraging user interaction history to align model judgments with individual preferences. However, existing approaches largely treat user context as a static or implicit conditioning signal, failing to capture the dynamic and multi-faceted nature of human judgment. In this paper, we propose P-Check, a novel personalized reward modeling framework, designed to train a plug-and-play checklist generator that synthesizes dynamic evaluation criteria for guiding the reward prediction. To better align these checklists with personalized nuances, we introduce Preference-Contrastive Criterion Weighting, a training strategy that assigns saliency scores to criteria based on their discriminative power for personalized judgment. We conduct extensive experiments and demonstrate that P-Check not only improves reward accuracy but also enhances downstream personalized generation, and remains robust in OOD scenarios.

[48] Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy

Hosein Hasani,Mohammadali Banayeeanzade,Ali Nafisi,Sadegh Mohammadian,Fatemeh Askari,Mobin Bagherian,Amirmohammad Izadi,Mahdieh Soleymani Baghshah

Main category: cs.CL

TL;DR: 提出一种受System-2认知启发的测试时策略，通过将大规模计数任务分解为子问题，使大语言模型克服Transformer架构在计数上的局限，实现高精度并揭示其推理机制。

Details

Motivation: 大语言模型在复杂数学任务上表现良好，但在计数任务上存在系统性缺陷，源于Transformer架构的层数限制导致大数计数精度下降。 Method: 采用System-2启发的分而治之策略，将大计数任务拆分为多个小规模独立子任务；通过观察和因果中介分析探究模型内部机制，识别关键组件如注意力头在中间步骤传递隐式计数信息的作用。 Result: 实验证明该策略显著提升LLM在大规模计数任务上的准确性；机制分析发现模型能在各部分的最后一项表示中计算并存储潜在计数值，并通过特定注意力头传递并在最终阶段聚合。 Conclusion: 该方法使LLM能够突破原有架构限制，有效完成大数计数任务，同时提供了对模型推理过程的可解释性，为改进和理解LLM的系统性推理能力提供了通用框架。 Abstract: Large language models (LLMs), despite strong performance on complex mathematical problems, exhibit systematic limitations in counting tasks. This issue arises from architectural limits of transformers, where counting is performed across layers, leading to degraded precision for larger counting problems due to depth constraints. To address this limitation, we propose a simple test-time strategy inspired by System-2 cognitive processes that decomposes large counting tasks into smaller, independent sub-problems that the model can reliably solve. We evaluate this approach using observational and causal mediation analyses to understand the underlying mechanism of this System-2-like strategy. Our mechanistic analysis identifies key components: latent counts are computed and stored in the final item representations of each part, transferred to intermediate steps via dedicated attention heads, and aggregated in the final stage to produce the total count. Experimental results demonstrate that this strategy enables LLMs to surpass architectural limitations and achieve high accuracy on large-scale counting tasks. This work provides mechanistic insight into System-2 counting in LLMs and presents a generalizable approach for improving and understanding their reasoning behavior.

[49] Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation

Qianchi Zhang,Hainan Zhang,Liang Pang,Hongwei Zheng,Zhiming Zheng

Main category: cs.CL

TL;DR: 本文提出Stable-RAG，通过估计检索顺序敏感性来减少大语言模型在检索增强生成中的幻觉，提升答案准确性与推理一致性。

Details

Motivation: 现有RAG方法未充分关注检索文档顺序对模型输出的影响，导致即使包含正确文档，不同排列仍引发不一致的预测。 Method: Stable-RAG通过多次不同检索顺序运行生成器，聚类隐藏状态，并基于聚类中心表示解码，利用主导推理模式校正幻觉输出。 Result: 在三个QA数据集上实验表明，Stable-RAG显著提高了答案准确率、推理一致性和跨数据集、检索器及输入长度的鲁棒泛化能力。 Conclusion: Stable-RAG有效缓解了RAG中的排列敏感性问题，增强了模型输出的稳定性与正确性。 Abstract: Retrieval-Augmented Generation (RAG) has become a key paradigm for reducing factual hallucinations in large language models (LLMs), yet little is known about how the order of retrieved documents affects model behavior. We empirically show that under Top-5 retrieval with the gold document included, LLM answers vary substantially across permutations of the retrieved set, even when the gold document is fixed in the first position. This reveals a previously underexplored sensitivity to retrieval permutations. Although robust RAG methods primarily focus on enhancing LLM robustness to low-quality retrieval and mitigating positional bias to distribute attention fairly over long contexts, neither approach directly addresses permutation sensitivity. In this paper, we propose Stable-RAG, which exploits permutation sensitivity estimation to mitigate permutation-induced hallucinations. Stable-RAG runs the generator under multiple retrieval orders, clusters hidden states, and decodes from a cluster-center representation that captures the dominant reasoning pattern. It then uses these reasoning results to align hallucinated outputs toward the correct answer, encouraging the model to produce consistent and accurate predictions across document permutations. Experiments on three QA datasets show that Stable-RAG significantly improves answer accuracy, reasoning consistency and robust generalization across datasets, retrievers, and input lengths compared with baselines.

[50] Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

Yihong Liu,Raoyuan Zhao,Hinrich Schütze,Michael A. Hedderich

Main category: cs.CL

TL;DR: 本文研究了大型推理模型在多语言环境下的潜在推理能力，发现资源丰富的语言中潜在推理较强，而在低资源语言中较弱，且整体上与英语的推理路径高度一致。

Details

Motivation: 探索大型推理模型在多语言场景下是否存在潜在推理，并理解其跨语言的内部机制差异。 Method: 采用基于截断的策略，通过提供部分推理链来观察正确答案的逐步形成过程，并进行表征分析以比较不同语言的内部预测演化。 Result: 发现了明显的多语言潜在推理现象：在资源丰富的语言中较强，低资源语言中较弱，且在更难的基准上表现更差；但内部预测演化在不同语言间高度一致，且与英语对齐。 Conclusion: 大型推理模型存在一种以英语为中心的多语言潜在推理路径，表明其内部推理机制在跨语言环境下具有高度一致性。 Abstract: Large reasoning models (LRMs) achieve strong performance on mathematical reasoning tasks, often attributed to their capability to generate explicit chain-of-thought (CoT) explanations. However, recent work shows that LRMs often arrive at the correct answer before completing these textual reasoning steps, indicating the presence of latent reasoning -- internal, non-verbal computation encoded in hidden states. While this phenomenon has been explored in English, its multilingual behavior remains largely unknown. In this paper, we conduct a systematic investigation of multilingual latent reasoning in LRMs across 11 languages. Using a truncation-based strategy, we examine how the correct answer emerges as the model is given only partial reasoning traces, allowing us to measure stepwise latent prediction formation. Our results reveal clear evidence of multilingual latent reasoning, though unevenly: strong in resource-rich languages, weaker in low-resource ones, and broadly less observable on harder benchmarks. To understand whether these differences reflect distinct internal mechanisms, we further perform representational analyses. Despite surface-level disparities, we find that the internal evolution of predictions is highly consistent across languages and broadly aligns with English -- a pattern suggesting an English-centered latent reasoning pathway.

[51] SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering

Junli Liang,Pengfei Zhou,Wangqiu Zhou,Wenjie Qing,Qi Zhao,Ziwen Wang,Qi Song,Xiangyang Li

Main category: cs.CL

TL;DR: 提出SentGraph，一种基于句子级图结构的检索增强生成框架，用于改进多跳问答中的推理能力。

Details

Motivation: 传统基于块的检索在多跳问答中常导致证据链不完整和推理错误，因无法捕捉句子间的细粒度逻辑关系。 Method: 基于修辞结构理论构建分层句子图，通过核-卫星句识别和跨文档实体桥连接组织主题子图，并在检索时进行图引导的证据选择与路径扩展。 Result: 在四个多跳问答基准上的实验表明，SentGraph显著优于现有方法，验证了建模句子级逻辑依赖的有效性。 Conclusion: 显式建模句子级逻辑关系有助于提升多跳问答中的证据整合与推理准确性。 Abstract: Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with large language models but faces significant limitations in multi-hop question answering tasks, which require combining evidence from multiple documents. Existing chunk-based retrieval often provides irrelevant and logically incoherent context, leading to incomplete evidence chains and incorrect reasoning during answer generation. To address these challenges, we propose SentGraph, a sentence-level graph-based RAG framework that explicitly models fine-grained logical relationships between sentences for multi-hop question answering. Specifically, we construct a hierarchical sentence graph offline by first adapting Rhetorical Structure Theory to distinguish nucleus and satellite sentences, and then organizing them into topic-level subgraphs with cross-document entity bridges. During online retrieval, SentGraph performs graph-guided evidence selection and path expansion to retrieve fine-grained sentence-level evidence. Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.

[52] MMFormalizer: Multimodal Autoformalization in the Wild

Jing Xiong,Qi Han,Yunta Hsieh,Hui Shen,Huajian Xin,Chaofan Tao,Chenyang Zhao,Hengyuan Zhang,Taiqiang Wu,Zhen Zhang,Haochen Wang,Zhongwei Wan,Lingpeng Kong,Ngai Wong

Main category: cs.CL

TL;DR: 本文提出了MMFormalizer，首个能够处理经典力学、相对论、量子力学和热力学等多模态自动形式化方法，通过结合视觉证据与公理化建模，实现从自然语言到形式化数学的跨模态推理。

Details

Motivation: 由于现实世界中物理问题常包含需从图像中推断的隐含约束（如质量、能量），传统纯文本自动形式化方法难以应对多模态挑战，因此需要一种能融合视觉信息与形式化推理的新方法。 Method: MMFormalizer通过自适应接地机制，将来自数学与物理领域的实体与视觉元素关联，利用递归接地与公理组合，从感知基础的原始要素中递归构建形式命题，并通过自适应递归终止确保每一层抽象均有视觉证据支持且符合量纲或公理约束。 Result: 在新构建的PhyX-AF基准（包含115个样本）上测试显示，GPT-5和Gemini-3-Pro等前沿模型表现最佳，GPT-5在物理推理上尤为突出，而几何仍是最大挑战；MMFormalizer实现了较高的编译与语义准确率。 Conclusion: MMFormalizer为统一的多模态自动形式化提供了可扩展框架，成功连接感知与形式推理，是首个能处理多种物理理论的形式化系统，推动了机器在复杂科学场景下的自动推理能力。 Abstract: Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: MMFormalizer.github.io

[53] Dementia-R1: Reinforced Pretraining and Reasoning from Unstructured Clinical Notes for Real-World Dementia Prognosis

Choonghan Kim,Hyunmin Hwang,Hangeol Chang,Jaemin Kim,Jinse Park,Jae-Sung Lim,Jong Chul Ye

Main category: cs.CL

TL;DR: 提出了一种基于强化学习的框架Dementia-R1，用于从非结构化临床文本中进行纵向痴呆症预后预测，通过冷启动RL策略提升了对复杂症状轨迹的推理能力。

Details

Motivation: 现有大语言模型在纵向预测任务（如痴呆症预后）上表现不佳，因缺乏对症状演化的显式标注且传统强化学习奖励稀疏。 Method: 采用冷启动强化学习框架，先预训练模型预测可验证的临床指标，再用于最终临床状态判断，以增强对疾病进展的推理。 Result: 在真实世界临床数据集上F1达到77.03%；在ADNI基准上，7B模型性能媲美GPT-4o，能有效捕捉波动的认知轨迹。 Conclusion: Dementia-R1通过结合可验证临床指标的预训练与强化学习，显著提升模型在复杂、非单调症状轨迹下的长期预测能力。 Abstract: While Large Language Models (LLMs) have shown strong performance on clinical text understanding, they struggle with longitudinal prediction tasks such as dementia prognosis, which require reasoning over complex, non-monotonic symptom trajectories across multiple visits. Standard supervised training lacks explicit annotations for symptom evolution, while direct Reinforcement Learning (RL) is hindered by sparse binary rewards. To address this challenge, we introduce Dementia-R1, an RL-based framework for longitudinal dementia prognosis from unstructured clinical notes. Our approach adopts a Cold-Start RL strategy that pre-trains the model to predict verifiable clinical indices extracted from patient histories, enhancing the capability to reason about disease progression before determining the final clinical status. Extensive experiments demonstrate that Dementia-R1 achieves an F1 score of 77.03% on real-world unstructured clinical datasets. Notably, on the ADNI benchmark, our 7B model rivals GPT-4o, effectively capturing fluctuating cognitive trajectories. Code is available at https://anonymous.4open.science/r/dementiar1-CDB5

[54] MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models

Lecheng Gong,Weimin Fang,Ting Yang,Dongjie Tao,Chunxiao Guo,Peng Wei,Bo Xie,Jinqun Guan,Zixiao Chen,Fang Shi,Jinjie Gu,Junwei Liu

Main category: cs.CL

TL;DR: 本文提出了MedDialogRubrics，一个用于评估医学大语言模型多轮诊断能力的新基准，包含5200个合成病例和6万余条细粒度评估标准，通过多智能体系统生成并经临床专家优化，揭示现有模型在对话管理架构上的不足。

Details

Motivation: 现有医学对话AI的评估框架缺乏对信息收集与诊断推理能力的严格评估，且面临隐私和数据治理问题，亟需一种安全、系统、细粒度的评估方法。 Method: 提出MedDialogRubrics：利用多智能体系统生成基于疾病知识的合成患者病例，设计受限于原子医学事实并具备动态纠错机制的患者代理以减少幻觉；通过基于循证医学指南的LLM与专家标注流程，结合拒绝采样生成优先级评估标准（“必问项”）。 Result: 构建了5,200个合成病例和超过60,000条细粒度评估标准；在多个维度上对主流大模型进行评测，发现其在信息采集完整性、诊断逻辑和对话管理方面表现不佳。 Conclusion: 当前医学大语言模型在诊断性对话中仍存在显著缺陷，仅靠基础模型微调无法解决；需在对话管理架构层面进行创新，才能提升医疗对话系统的安全性和有效性。 Abstract: Medical conversational AI (AI) plays a pivotal role in the development of safer and more effective medical dialogue systems. However, existing benchmarks and evaluation frameworks for assessing the information-gathering and diagnostic reasoning abilities of medical large language models (LLMs) have not been rigorously evaluated. To address these gaps, we present MedDialogRubrics, a novel benchmark comprising 5,200 synthetically constructed patient cases and over 60,000 fine-grained evaluation rubrics generated by LLMs and subsequently refined by clinical experts, specifically designed to assess the multi-turn diagnostic capabilities of LLM. Our framework employs a multi-agent system to synthesize realistic patient records and chief complaints from underlying disease knowledge without accessing real-world electronic health records, thereby mitigating privacy and data-governance concerns. We design a robust Patient Agent that is limited to a set of atomic medical facts and augmented with a dynamic guidance mechanism that continuously detects and corrects hallucinations throughout the dialogue, ensuring internal coherence and clinical plausibility of the simulated cases. Furthermore, we propose a structured LLM-based and expert-annotated rubric-generation pipeline that retrieves Evidence-Based Medicine (EBM) guidelines and utilizes the reject sampling to derive a prioritized set of rubric items ("must-ask" items) for each case. We perform a comprehensive evaluation of state-of-the-art models and demonstrate that, across multiple assessment dimensions, current models face substantial challenges. Our results indicate that improving medical dialogue will require advances in dialogue management architectures, not just incremental tuning of the base-model.

[55] LittiChoQA: Literary Texts in Indic Languages Chosen for Question Answering

Aarya Khandelwal,Ritwik Mishra,Rajiv Ratn Shah

Main category: cs.CL

TL;DR: 本文介绍了LittiChoQA，这是目前规模最大的涵盖印度恒河平原多种语言的文学问答数据集，用于解决低资源语言中长文本问答资源稀缺的问题。

Details

Motivation: 现代大语言模型在处理低资源语言的文学长文本问答时面临挑战，缺乏相应的高质量数据集。 Method: 构建了一个包含27万多个自动生动生成的问答对的数据集，涵盖事实型与非事实型问题，并在全上下文与截断上下文设置下评估多个多语言大模型的表现。 Result: Krutrim-2模型在全上下文设置下语义得分为76.1，截断上下文下分别为74.9和71.4；结果显示性能与效率之间存在权衡。 Conclusion: LittiChoQA为印度语言的长文本问答提供了重要资源，实验表明全上下文微调效果更优，但上下文截断显著提升推理效率。 Abstract: Long-context question answering (QA) over literary texts poses significant challenges for modern large language models, particularly in low-resource languages. We address the scarcity of long-context QA resources for Indic languages by introducing LittiChoQA, the largest literary QA dataset to date covering many languages spoken in the Gangetic plains of India. The dataset comprises over 270K automatically generated question-answer pairs with a balanced distribution of factoid and non-factoid questions, generated from naturally authored literary texts collected from the open web. We evaluate multiple multilingual LLMs on non-factoid, abstractive QA, under both full-context and context-shortened settings. Results demonstrate a clear trade-off between performance and efficiency: full-context fine-tuning yields the highest token-level and semantic-level scores, while context shortening substantially improves throughput. Among the evaluated models, Krutrim-2 achieves the strongest performance, obtaining a semantic score of 76.1 with full context. While, in shortened context settings it scores 74.9 with answer paragraph selection and 71.4 with vector-based retrieval. Qualitative evaluations further corroborate these findings.

[56] Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

Sindhuja Chaduvula,Ahmed Y. Radwan,Azib Farooq,Yani Ioannou,Shaina Raza

Main category: cs.CL

TL;DR: F-DPO是一种改进的偏好优化方法，通过引入二元事实性标签，在不增加复杂训练流程的情况下显著提升大语言模型的事实准确性并减少幻觉。

Details

Motivation: 现有的偏好对齐方法（如RLHF和DPO）可能因偏好判断更重视流畅性和自信程度而强化模型的幻觉问题，因此需要一种能明确促进事实准确性的对齐方法。 Method: 提出F-DPO，包含两个关键机制：一是标签翻转变换，确保被选中的回答至少与被拒绝对应一样真实；二是引入事实感知的边距机制，强调在事实正确性上有明显差异的回答对，并在两者事实性相同时退化为标准DPO。使用带有二元事实标签的增强数据进行训练。 Result: 在七个开源大模型（1B-14B）上验证，F-DPO显著降低幻觉率并提升事实性得分。例如Qwen3-8B将幻觉率从0.424降至0.084，事实性得分提高50%；在Out-of-distribution基准TruthfulQA上也表现出显著提升。 Conclusion: F-DPO能在不依赖辅助奖励模型、token级标注或多阶段训练的情况下有效提升模型事实性，是一种简单且可扩展的对齐方法。 Abstract: Preference alignment methods such as RLHF and Direct Preference Optimization (DPO) improve instruction following, but they can also reinforce hallucinations when preference judgments reward fluency and confidence over factual correctness. We introduce F-DPO (Factuality-aware Direct Preference Optimization), a simple extension of DPO that uses only binary factuality labels. F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin that emphasizes pairs with clear correctness differences, while reducing to standard DPO when both responses share the same factuality. We construct factuality-aware preference data by augmenting DPO pairs with binary factuality indicators and synthetic hallucinated variants. Across seven open-weight LLMs (1B-14B), F-DPO consistently improves factuality and reduces hallucination rates relative to both base models and standard DPO. On Qwen3-8B, F-DPO reduces hallucination rates by five times (from 0.424 to 0.084) while improving factuality scores by 50 percent (from 5.26 to 7.90). F-DPO also generalizes to out-of-distribution benchmarks: on TruthfulQA, Qwen2.5-14B achieves plus 17 percent MC1 accuracy (0.500 to 0.585) and plus 49 percent MC2 accuracy (0.357 to 0.531). F-DPO requires no auxiliary reward model, token-level annotations, or multi-stage training.

[57] NorwAI's Large Language Models: Technical Report

Jon Atle Gulla,Peng Liu,Lemei Zhang

Main category: cs.CL

TL;DR: NorLLM团队开发了一系列针对挪威语及其他斯堪的纳维亚语言的大型语言模型，基于多种Transformer架构进行预训练和持续训练，并采用挪威语扩展的分词器和先进后训练策略，提升模型在实际任务中的性能与适应性。

Details

Motivation: 挪威语在自然语言处理的重要进展中代表性不足，缺乏专门针对该语言的高效语言模型，限制了其在实际应用中的发展。 Method: 基于GPT、Mistral、Llama2、Mixtral和Magistral等Transformer架构，从头预训练或持续预训练25B至88.45B标记，使用挪威语扩展的分词器，并结合指令微调和高级后训练策略。 Result: 开发出多个高性能的挪威语适配模型，其中指令调优版本（如Mistral-7B-Instruct）展现出强大的助手式交互能力，在多种现实任务中表现良好。 Conclusion: 该系列模型有效填补了挪威语在NLP领域的空白，已向北欧组织、企业和学生开放，支持研究与实验用途，具备广泛的应用潜力。 Abstract: Norwegian, spoken by approximately five million people, remains underrepresented in many of the most significant breakthroughs in Natural Language Processing (NLP). To address this gap, the NorLLM team at NorwAI has developed a family of models specifically tailored to Norwegian and other Scandinavian languages, building on diverse Transformer-based architectures such as GPT, Mistral, Llama2, Mixtral and Magistral. These models are either pretrained from scratch or continually pretrained on 25B - 88.45B tokens, using a Norwegian-extended tokenizer and advanced post-training strategies to optimize performance, enhance robustness, and improve adaptability across various real-world tasks. Notably, instruction-tuned variants (e.g., Mistral-7B-Instruct and Mixtral-8x7B-Instruct) showcase strong assistant-style capabilities, underscoring their potential for practical deployment in interactive and domain-specific applications. The NorwAI large language models are openly available to Nordic organizations, companies and students for both research and experimental use. This report provides detailed documentation of the model architectures, training data, tokenizer design, fine-tuning strategies, deployment, and evaluations.

[58] BaseCal: Unsupervised Confidence Calibration via Base Model Signals

Hexiang Tan,Wanli Yang,Junwei Zhang,Xin Chen,Rui Tang,Du Su,Jingang Wang,Yuanzhuo Wang,Fei Sun,Xueqi Cheng

Main category: cs.CL

TL;DR: 提出BaseCal方法，利用基础LLM校准后训练LLM的置信度，包含ReEval和Proj两种方式，有效降低校准误差。

Details

Motivation: 后训练的大语言模型（PoLLMs）通常存在严重过度自信的问题，而其对应的基础LLM往往保持良好校准，因此利用基础LLM作为参考来校准PoLLM的置信度。 Method: 提出两种方法：BaseCal-ReEval通过将PoLLM的响应输入基础LLM获取平均概率作为置信度；BaseCal-Proj训练一个轻量级投影将PoLLM的最后一层隐藏状态映射回基础LLM的状态，并通过基础LLM的输出层计算校准后的置信度。 Result: 在五个数据集和三个LLM家族上的实验表明，与最佳无监督基线相比，BaseCal平均降低了42.90%的预期校准误差（ECE）。 Conclusion: BaseCal是一种无需人工标签或修改LLM的无监督即插即用解决方案，能有效校准PoLLM的置信度。 Abstract: Reliable confidence is essential for trusting the outputs of LLMs, yet widely deployed post-trained LLMs (PoLLMs) typically compromise this trust with severe overconfidence. In contrast, we observe that their corresponding base LLMs often remain well-calibrated. This naturally motivates us to calibrate PoLLM confidence using the base LLM as a reference. This work proposes two ways to achieve this. A straightforward solution, BaseCal-ReEval, evaluates PoLLM's responses by feeding them into the base LLM to get average probabilities as confidence. While effective, this approach introduces additional inference overhead. To address this, we propose BaseCal-Proj, which trains a lightweight projection to map the final-layer hidden states of PoLLMs back to those of their base LLMs. These projected states are then processed by the base LLM's output layer to derive base-calibrated confidence for PoLLM's responses. Notably, BaseCal is an unsupervised, plug-and-play solution that operates without human labels or LLM modifications. Experiments across five datasets and three LLM families demonstrate the effectiveness of BaseCal, reducing Expected Calibration Error (ECE) by an average of 42.90\% compared to the best unsupervised baselines.

[59] Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage

Junhao Hu,Fangze Li,Mingtao Xu,Feifan Meng,Shiju Zhao,Tiancheng Hu,Ting Peng,Anmin Liu,Wenrui Huang,Chenxu Liu,Ziyue Hua,Tao Xie

Main category: cs.CL

TL;DR: 本文研究了稀疏注意力机制在大语言模型推理解码阶段的效率问题，发现其可能导致信息丢失从而增加序列长度（即“Less is Less”现象），并提出一种早期停止算法以减少token消耗，显著提升效率。

Details

Motivation: 稀疏注意力虽旨在降低解码阶段的时间和内存复杂度，但可能因信息丢失导致生成更长序列，反而增加整体开销，需深入理解其影响并加以优化。 Method: 通过理论分析与实验验证提出“Less is Less”现象，并设计一种早期停止算法，在稀疏解码过程中动态检测信息增益与损失的平衡点，及时终止冗余生成。 Result: 所提早期停止算法在推理密集型基准上最多减少90%的token使用量，且准确率下降不到2%。 Conclusion: 稀疏注意力可能因信息丢失而适得其反，引入早期停止机制可有效缓解该问题，在保持性能的同时大幅提升推理效率。 Abstract: Large language models (LLMs) demonstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work introduces sparse-attention algorithms. In this paper, we show, both empirically and theoretically, that sparse attention can paradoxically increase end-to-end complexity: information loss often induces significantly longer sequences, a phenomenon we term ``Less is Less'' (Lil). To mitigate the Lil problem, we propose an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding. Our early-stopping algorithm reduces token consumption by up to 90% with a marginal accuracy degradation of less than 2% across reasoning-intensive benchmarks.

[60] Temporal Graph Network: Hallucination Detection in Multi-Turn Conversation

Vidhi Rathore,Sambu Aneesh,Himanshu Singh

Main category: cs.CL

TL;DR: 提出了一种基于图的方法，通过构建对话的时序图来检测多轮对话中的幻觉，利用消息传递和注意力机制提升检测性能并增强可解释性。

Details

Motivation: 多轮对话中上下文变化和矛盾可能导致AI系统产生幻觉，需要有效检测方法。 Method: 将每轮对话作为节点，使用句子变换器编码，并通过共享实体或时间顺序建立边；利用消息传递更新节点嵌入，结合注意力池化和分类器检测幻觉类型。 Result: 该方法在检测对话级幻觉上略优于现有方法，且注意力机制有助于解释决策过程。 Conclusion: 基于图的建模方式能有效捕捉对话结构信息，提升幻觉检测效果，并具备一定可解释性。 Abstract: Hallucinations can be produced by conversational AI systems, particularly in multi-turn conversations where context changes and contradictions may eventually surface. By representing the entire conversation as a temporal graph, we present a novel graph-based method for detecting dialogue-level hallucinations. Our framework models each dialogue as a node, encoding it using a sentence transformer. We explore two different ways of connectivity: i) shared-entity edges, which connect turns that refer to the same entities; ii) temporal edges, which connect contiguous turns in the conversation. Message-passing is used to update the node embeddings, allowing flow of information between related nodes. The context-aware node embeddings are then combined using attention pooling into a single vector, which is then passed on to a classifier to determine the presence and type of hallucinations. We demonstrate that our method offers slightly improved performance over existing methods. Further, we show the attention mechanism can be used to justify the decision making process. The code and model weights are made available at: https://github.com/sambuaneesh/anlp-project.

[61] Detecting Hallucinations in Retrieval-Augmented Generation via Semantic-level Internal Reasoning Graph

Jianpeng Hu,Yanzeng Li,Jialun Zhong,Wenfa Qi,Lei Zou

Main category: cs.CL

TL;DR: 本文提出了一种基于语义级内部推理图的方法，用于检测大语言模型中检索增强生成系统的忠实性幻觉，通过扩展层相关传播算法并构建语义级归因向量图，提升了检测性能。

Details

Motivation: 现有方法未能充分捕捉大模型内部推理过程或对特征处理粗糙，导致难以有效检测忠实性幻觉。 Method: 将层相关传播算法从词元级扩展到语义级，构建基于归因向量的内部推理图，并设计基于小型预训练语言模型的通用框架进行幻觉检测，通过阈值动态调整正确样本的通过率。 Result: 在RAGTruth和Dolly-15k数据集上，该方法相比当前最先进基线取得了更优的整体性能。 Conclusion: 所提方法能更真实地表征大模型推理依赖关系，有效提升忠实性幻觉检测效果。 Abstract: The Retrieval-augmented generation (RAG) system based on Large language model (LLM) has made significant progress. It can effectively reduce factuality hallucinations, but faithfulness hallucinations still exist. Previous methods for detecting faithfulness hallucinations either neglect to capture the models' internal reasoning processes or handle those features coarsely, making it difficult for discriminators to learn. This paper proposes a semantic-level internal reasoning graph-based method for detecting faithfulness hallucination. Specifically, we first extend the layer-wise relevance propagation algorithm from the token level to the semantic level, constructing an internal reasoning graph based on attribution vectors. This provides a more faithful semantic-level representation of dependency. Furthermore, we design a general framework based on a small pre-trained language model to utilize the dependencies in LLM's reasoning for training and hallucination detection, which can dynamically adjust the pass rate of correct samples through a threshold. Experimental results demonstrate that our method achieves better overall performance compared to state-of-the-art baselines on RAGTruth and Dolly-15k.

[62] Do LLMs Encode Functional Importance of Reasoning Tokens?

Janvijay Singh,Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: 本文提出了一种名为贪婪剪枝的方法，通过保留模型似然性的同时迭代删除推理标记来缩短大型语言模型的推理链，从而在保持准确性的同时减少计算成本，并揭示了模型内部存在对推理标记的功能重要性结构。

Details

Motivation: 现有的紧凑推理方法虽然能够缩短推理链，但未能深入探讨模型是否在内部编码了用于生成答案的标记级功能重要性。为此，本文旨在填补这一研究空白。 Method: 提出了贪婪剪枝方法，该方法是一种似然性保持的删除过程，通过迭代移除对模型似然性影响最小的推理标记，生成长度可控的推理链，并结合注意力分数分析其可预测性。 Result: 实验表明，在相同推理长度下，使用剪枝后推理链训练的学生模型优于基于前沿模型监督压缩的基线方法；同时发现注意力分数可以预测贪婪剪枝的排序，显示出模型内部存在系统性的剪枝模式。 Conclusion: 研究表明，大型语言模型在推理过程中确实编码了标记级别的功能重要性结构，而贪婪剪枝不仅能有效压缩推理链，还能保留关键推理信息，为理解模型内部机制提供了新视角。 Abstract: Large language models solve complex tasks by generating long reasoning chains, achieving higher accuracy at the cost of increased computational cost and reduced ability to isolate functionally relevant reasoning. Prior work on compact reasoning shortens such chains through probabilistic sampling, heuristics, or supervision from frontier models, but offers limited insight into whether models internally encode token-level functional importance for answer generation. We address this gap diagnostically and propose greedy pruning, a likelihood-preserving deletion procedure that iteratively removes reasoning tokens whose removal minimally degrades model likelihood under a specified objective, yielding length-controlled reasoning chains. We evaluate pruned reasoning in a distillation framework and show that students trained on pruned chains outperform a frontier-model-supervised compression baseline at matched reasoning lengths. Finally, our analysis reveals systematic pruning patterns and shows that attention scores can predict greedy pruning ranks, further suggesting that models encode a nontrivial functional importance structure over reasoning tokens.

[63] Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models

Bocheng Chen,Han Zi,Xi Chen,Xitong Zhang,Kristen Johnson,Guangliang Liu

Main category: cs.CL

TL;DR: 本文提出了一种增强大语言模型（LLM）道德敏感性的方法，通过两种基于推理负荷的实用推理方法，使LLM能够识别道德上良性与危险的输入并纠正道德错误。

Details

Motivation: 尽管许多方法试图将大语言模型与人类道德价值观对齐，但如何使其具备道德敏感性仍极具挑战。 Method: 提出了两种实用推理方法，从统一的推理负荷角度出发，使LLM能诊断道德相关输入并修正道德错误，而不依赖复杂的语义表面形式。 Result: 实验结果表明，所提方法在多个代表性道德相关基准上显著提升了LLM的道德敏感性。 Conclusion: 该研究为提升大语言模型的道德敏感性提供了有效且原则性强的新路径。 Abstract: Moral sensitivity is fundamental to human moral competence, as it guides individuals in regulating everyday behavior. Although many approaches seek to align large language models (LLMs) with human moral values, how to enable them morally sensitive has been extremely challenging. In this paper, we take a step toward answering the question: how can we enhance moral sensitivity in LLMs? Specifically, we propose two pragmatic inference methods that faciliate LLMs to diagnose morally benign and hazardous input and correct moral errors, whereby enhancing LLMs' moral sensitivity. A central strength of our pragmatic inference methods is their unified perspective: instead of modeling moral discourses across semantically diverse and complex surface forms, they offer a principled perspective for designing pragmatic inference procedures grounded in their inferential loads. Empirical evidence demonstrates that our pragmatic methods can enhance moral sensitivity in LLMs and achieves strong performance on representative morality-relevant benchmarks.

[64] Grad-ELLM: Gradient-based Explanations for Decoder-only LLMs

Xin Huang,Antoni B. Chan

Main category: cs.CL

TL;DR: 提出了一种名为Grad-ELLM的基于梯度的输入归因方法，用于解码器-only Transformer架构的大语言模型，通过结合注意力层梯度和注意力图生成更可信的归因热图，并提出了改进的评估指标π-Soft-NC/NS，在多种任务上验证了其优越性。

Details

Motivation: 现有输入归因方法多为模型无关，未充分利用Transformer架构特性，导致对大语言模型的归因不够透明和可信，因此需要一种针对decoder-only LLMs的专用归因方法。 Method: 提出Grad-ELLM，通过聚合注意力层输出logit相对于梯度的通道重要性和注意力图的空间重要性，在不修改模型结构的情况下生成每一步生成的归因热图；同时提出π-Soft-NC和π-Soft-NS作为改进的归因可信度评估指标。 Result: 在情感分类、问答和开放生成任务上，使用多个模型进行实验，结果表明Grad-ELLM在归因可信度方面 consistently 优于其他归因方法。 Conclusion: Grad-ELLM是一种高效且无需架构修改的归因方法，能够为decoder-only LLMs提供更准确和可信的输入贡献解释，结合新提出的评估指标进一步提升了归因结果的可比性和可靠性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their black-box nature raises concerns about transparency and faithfulness. Input attribution methods aim to highlight each input token's contributions to the model's output, but existing approaches are typically model-agnostic, and do not focus on transformer-specific architectures, leading to limited faithfulness. To address this, we propose Grad-ELLM, a gradient-based attribution method for decoder-only transformer-based LLMs. By aggregating channel importance from gradients of the output logit with respect to attention layers and spatial importance from attention maps, Grad-ELLM generates heatmaps at each generation step without requiring architectural modifications. Additionally, we introduce two faithfulneses metrics $π$-Soft-NC and $π$-Soft-NS, which are modifications of Soft-NC/NS that provide fairer comparisons by controlling the amount of information kept when perturbing the text. We evaluate Grad-ELLM on sentiment classification, question answering, and open-generation tasks using different models. Experiment results show that Grad-ELLM consistently achieves superior faithfulness than other attribution methods.

[65] Who Laughs with Whom? Disentangling Influential Factors in Humor Preferences across User Clusters and LLMs

Soichiro Murakami,Hidetaka Kamigaito,Hiroya Takamura,Manabu Okumura

Main category: cs.CL

TL;DR: 该研究通过聚类用户投票日志并使用Bradley-Terry-Luce模型估计不同群体对日本创意游戏Oogiri的幽默偏好，发现LLM可通过提示模仿特定用户群体的偏好。

Details

Motivation: 由于个体和文化间幽默偏好的巨大差异，使用大语言模型评估幽默效果具有挑战性，因此需要建模用户偏好的异质性。 Method: 利用用户投票日志进行聚类，构建基于Bradley-Terry-Luce模型的集群特异性偏好因子权重，并通过提示LLM选择更有趣回应来获取其偏好判断，再使用persona提示引导LLM偏好。 Result: 识别出不同的用户偏好集群，发现LLM的偏好判断与某些真实用户集群相似，且可通过persona提示定向调整LLM的偏好倾向。 Conclusion: LLM可以通过适当的提示方法模拟特定人群的幽默偏好，表明在考虑用户多样性的情况下，LLM有望更准确地评估跨文化和个体的幽默内容。 Abstract: Humor preferences vary widely across individuals and cultures, complicating the evaluation of humor using large language models (LLMs). In this study, we model heterogeneity in humor preferences in Oogiri, a Japanese creative response game, by clustering users with voting logs and estimating cluster-specific weights over interpretable preference factors using Bradley-Terry-Luce models. We elicit preference judgments from LLMs by prompting them to select the funnier response and found that user clusters exhibit distinct preference patterns and that the LLM results can resemble those of particular clusters. Finally, we demonstrate that, by persona prompting, LLM preferences can be directed toward a specific cluster. The scripts for data collection and analysis will be released to support reproducibility.

[66] Discovering and Causally Validating Emotion-Sensitive Neurons in Large Audio-Language Models

Xiutian Zhao,Björn Schuller,Berrak Sisman

Main category: cs.CL

TL;DR: 本研究首次在大音频语言模型中揭示了情感敏感神经元（ESNs）的存在，并通过干预实验证明其对特定情感识别的因果作用。

Details

Motivation: 缺乏对大音频语言模型如何内部编码情感的机制性理解。 Method: 在Qwen2.5-Omni、Kimi-Audio和Audio Flamingo 3上，比较多种基于频率、熵、幅度和对比的神经元选择方法，并使用推理时干预（如消融和增益放大）进行因果分析。 Result: 发现情感特异性神经元在消融时会显著损害对应情感识别，而增益放大可引导预测趋向目标情感；该效应在不同数据集和模型中具有一致性和可扩展性，且ESNs在层间呈非均匀聚类分布并部分跨数据集迁移。 Conclusion: 提供了大音频语言模型中情感决策的神经元级因果解释，并表明靶向神经元干预可用于调控情感行为。 Abstract: Emotion is a central dimension of spoken communication, yet, we still lack a mechanistic account of how modern large audio-language models (LALMs) encode it internally. We present the first neuron-level interpretability study of emotion-sensitive neurons (ESNs) in LALMs and provide causal evidence that such units exist in Qwen2.5-Omni, Kimi-Audio, and Audio Flamingo 3. Across these three widely used open-source models, we compare frequency-, entropy-, magnitude-, and contrast-based neuron selectors on multiple emotion recognition benchmarks. Using inference-time interventions, we reveal a consistent emotion-specific signature: ablating neurons selected for a given emotion disproportionately degrades recognition of that emotion while largely preserving other classes, whereas gain-based amplification steers predictions toward the target emotion. These effects arise with modest identification data and scale systematically with intervention strength. We further observe that ESNs exhibit non-uniform layer-wise clustering with partial cross-dataset transfer. Taken together, our results offer a causal, neuron-level account of emotion decisions in LALMs and highlight targeted neuron interventions as an actionable handle for controllable affective behaviors.

[67] ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation

Peiran Li,Jan Fillies,Adrian Paschke

Main category: cs.CL

TL;DR: ToxiGAN是一种结合对抗生成和大语言模型语义引导的类别感知文本增强框架，用于提升毒性内容分类的鲁棒性。

Details

Motivation: 现有毒性语言数据增强方法受限于监督不足和分布偏移，难以有效生成类别特定的有毒文本。 Method: 提出ToxiGAN，采用两步定向训练策略，并利用大语言模型动态选择中性文本作为语义锚点，通过对抗生成使有毒文本远离这些中性样例，增强类别对比信号。 Result: 在四个仇恨言论基准上的实验表明，ToxiGAN在macro-F1和hate-F1上均优于传统和基于LLM的数据增强方法。 Conclusion: ToxiGAN通过引入语义锚点和定向训练有效缓解了模式崩溃和语义漂移问题，显著提升了分类器的鲁棒性。 Abstract: Augmenting toxic language data in a controllable and class-specific manner is crucial for improving robustness in toxicity classification, yet remains challenging due to limited supervision and distributional skew. We propose ToxiGAN, a class-aware text augmentation framework that combines adversarial generation with semantic guidance from large language models (LLMs). To address common issues in GAN-based augmentation such as mode collapse and semantic drift, ToxiGAN introduces a two-step directional training strategy and leverages LLM-generated neutral texts as semantic ballast. Unlike prior work that treats LLMs as static generators, our approach dynamically selects neutral exemplars to provide balanced guidance. Toxic samples are explicitly optimized to diverge from these exemplars, reinforcing class-specific contrastive signals. Experiments on four hate speech benchmarks show that ToxiGAN achieves the strongest average performance in both macro-F1 and hate-F1, consistently outperforming traditional and LLM-based augmentation methods. Ablation and sensitivity analyses further confirm the benefits of semantic ballast and directional training in enhancing classifier robustness.

[68] The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs

Xiangzhe Yuan,Zhenhao Zhang,Haoming Tang,Siying Hu

Main category: cs.CL

TL;DR: 该研究通过LLM-to-LLM的多轮对话模拟框架，系统分析了大语言模型在中英文场景下的多轮诈骗风险，发现安全机制激活和角色不稳定是主要失败原因，强调多轮交互安全是LLM行为的重要新维度。

Details

Motivation: 现有单轮安全评估无法捕捉大语言模型在多轮对话中可能引发的新型诈骗风险，亟需系统性研究多轮交互中的安全隐患。 Method: 采用可控的LLM-to-LLM模拟框架，在多种多轮诈骗场景下对八种最先进的中英文模型进行评估，并对攻击策略、防御反应和失败模式进行定性标注。 Result: 发现诈骗对话呈现反复升级模式，防御手段主要包括验证和延迟机制；多数交互失败源于安全护栏触发和角色不稳定性。 Conclusion: 多轮交互安全是大语言模型安全性的一个关键且独立的维度，需在评估和部署中予以特别关注。 Abstract: As LLMs gain persuasive agentic capabilities through extended dialogues, they introduce novel risks in multi-turn conversational scams that single-turn safety evaluations fail to capture. We systematically study these risks using a controlled LLM-to-LLM simulation framework across multi-turn scam scenarios. Evaluating eight state-of-the-art models in English and Chinese, we analyze dialogue outcomes and qualitatively annotate attacker strategies, defensive responses, and failure modes. Results reveal that scam interactions follow recurrent escalation patterns, while defenses employ verification and delay mechanisms. Furthermore, interactional failures frequently stem from safety guardrail activation and role instability. Our findings highlight multi-turn interactional safety as a critical, distinct dimension of LLM behavior.

[69] Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Aashish Dhawan,Christopher Driggers-Ellis,Christan Grant,Daisy Zhe Wang

Main category: cs.CL

TL;DR: 本文研究了通过合成数据增强来改善美洲土著语言（如瓜拉尼语、克丘亚语）神经机器翻译性能的方法，使用多语言mBART模型生成合成句对，并结合特定语言的预处理，在低资源条件下提升了翻译质量。

Details

Motivation: 由于许多土著语言缺乏足够的平行语料库，难以训练有效的神经机器翻译系统，因此需要寻找缓解数据稀缺问题的方法。 Method: 利用大规模多语言翻译模型生成合成句对，增强现有平行数据；采用mBART模型进行微调，并引入正字法归一化和噪声感知过滤等语言特定预处理方法。 Result: 在瓜拉尼-西班牙语和克丘亚-西班牙语翻译任务中，合成数据增强显著提升了chrF++得分；但在艾马拉语上的实验表明通用预处理对高度黏着语言存在局限。 Conclusion: 合成数据增强能有效提升低资源土著语言的翻译性能，但需结合语言特性设计预处理策略以避免引入偏差或噪声。 Abstract: Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani--Spanish and Quechua--Spanish translation show consistent chrF++ improvements from synthetic data augmentation, while diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.

[70] Limited Linguistic Diversity in Embodied AI Datasets

Selma Wanna,Agnes Luhtaru,Jonathan Salfity,Ryan Barron,Juston Moore,Cynthia Matuszek,Mitch Pryor

Main category: cs.CL

TL;DR: 本文对当前视觉-语言-动作（VLA）模型所使用的数据集中的指令语言进行了系统性审计，揭示了这些数据集中语言多样性有限、重复性高、结构变化少的问题。

Details

Motivation: 尽管语言在VLA模型中起关键作用，但现有训练和评估数据集的语言特征缺乏系统记录，导致模型可能受限于狭窄的语言分布。 Method: 对多个广泛使用的VLA数据集进行审计，从词汇多样性、重复与重叠、语义相似性和句法复杂性等多个维度量化指令语言特征。 Result: 分析显示大多数数据集依赖高度重复、模板化的指令，句法结构变化有限，语言形式分布狭窄。 Conclusion: 研究呼吁更详细的數據集报告、更合理的數據选择，以及有针对性的数据扩充策略，以提升VLA模型的语言覆盖能力。 Abstract: Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions-including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support more detailed dataset reporting, more principled dataset selection, and targeted curation or augmentation strategies that broaden language coverage.

[71] Self-Verification is All You Need To Pass The Japanese Bar Examination

Andrew Shin

Main category: cs.CL

TL;DR: 本文提出了一种自验证模型，在真实格式和评分标准下首次实现了大型语言模型通过日本司法考试，强调了格式忠实监督的重要性。

Details

Motivation: 尽管大语言模型进展迅速，但在专业且结构化的考试中仍难以可靠表现，尤其是需要复杂推理和严格格式的日本司法考试。现有方法未在原始考试格式下系统评估，其有效性存疑。 Method: 构建了一个忠实还原日本司法考试格式和评分标准的新数据集，并在此基础上训练了一个自验证模型，同时与多智能体推断和分解式监督等方法进行了广泛比较。 Result: 该模型在实际考试评分标准下超过了官方及格线，是首个无需修改题目结构或评分规则即通过日本司法考试的LLM；相比之下，其他方法未能达到类似性能。 Conclusion: 格式忠实的监督和一致性验证对高风险专业推理任务至关重要，精心设计的单模型方法可优于更复杂的系统。 Abstract: Despite rapid advances in large language models (LLMs), achieving reliable performance on highly professional and structured examinations remains a significant challenge. The Japanese bar examination is a particularly demanding benchmark, requiring not only advanced legal reasoning but also strict adherence to complex answer formats that involve joint evaluation of multiple propositions. While recent studies have reported improvements by decomposing such questions into simpler true--false judgments, these approaches have not been systematically evaluated under the original exam format and scoring scheme, leaving open the question of whether they truly capture exam-level competence. In this paper, we present a self-verification model trained on a newly constructed dataset that faithfully replicates the authentic format and evaluation scale of the exam. Our model is able to exceed the official passing score when evaluated on the actual exam scale, marking the first demonstration, to our knowledge, of an LLM passing the Japanese bar examination without altering its original question structure or scoring rules. We further conduct extensive comparisons with alternative strategies, including multi-agent inference and decomposition-based supervision, and find that these methods fail to achieve comparable performance. Our results highlight the importance of format-faithful supervision and consistency verification, and suggest that carefully designed single-model approaches can outperform more complex systems in high-stakes professional reasoning tasks. Our dataset and codes are publicly available.

[72] Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective

Beiduo Chen,Tiancheng Hu,Caiqi Zhang,Robert Litschko,Anna Korhonen,Barbara Plank

Main category: cs.CL

TL;DR: 长链式思维推理（CoT）能提升大模型在分布对齐上的表现，但最终准确性由CoT内容决定，而分布结构排序主要依赖模型先验，表明CoT更像决策者而非分布校准器。

Details

Motivation: 探究推理调优的大模型在处理需捕捉概率模糊性的人类标签变异任务中，长链式思维（CoT）与模型先验之间的作用机制是否解耦。 Method: 通过基于分布的任务进行系统性解耦实验，并采用Cross-CoT方法分离推理文本与模型内在先验的影响。 Result: 发现“解耦机制”：CoT内容主导最终准确性（99%方差贡献），而分布排序由模型先验控制（超80%）；逐步分析显示CoT对准确性的影响力随推理过程单调增长，但分布结构主要由模型先验决定。 Conclusion: 长CoT在单答案任务中作为有效决策机制，但在需建模概率歧义的任务中无法精细校准分布，其作用受限于模型内在先验。 Abstract: Reasoning-tuned LLMs utilizing long Chain-of-Thought (CoT) excel at single-answer tasks, yet their ability to model Human Label Variation--which requires capturing probabilistic ambiguity rather than resolving it--remains underexplored. We investigate this through systematic disentanglement experiments on distribution-based tasks, employing Cross-CoT experiments to isolate the effect of reasoning text from intrinsic model priors. We observe a distinct "decoupled mechanism": while CoT improves distributional alignment, final accuracy is dictated by CoT content (99% variance contribution), whereas distributional ranking is governed by model priors (over 80%). Step-wise analysis further shows that while CoT's influence on accuracy grows monotonically during the reasoning process, distributional structure is largely determined by LLM's intrinsic priors. These findings suggest that long CoT serves as a decisive LLM decision-maker for the top option but fails to function as a granular distribution calibrator for ambiguous tasks.

[73] WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning

Yu Xinmiao,Zhang Liwen,Feng Xiaocheng,Jiang Yong,Qin Bing,Xie Pengjun,Zhou Jingren

Main category: cs.CL

TL;DR: 本文提出了一种名为Anchor-GRPO的两阶段强化学习框架，用于提升基于大语言模型的代理在长视野网页推理任务中的规划能力，通过解耦规划与执行并引入首步锚定机制，在多个基准上显著优于基线方法。

Details

Motivation: 现有强化学习方法在处理长视野网页推理任务时，因忽略“计划锚”现象（即第一步对后续行为影响巨大）而导致规划效率低下。 Method: 提出Anchor-GRPO框架：第一阶段利用自对弈和人类校准生成细粒度评分标准优化首步规划；第二阶段通过稀疏奖励使执行过程与初始计划保持一致。 Result: 在BrowseComp、GAIA等四个基准上验证，模型从3B到30B均超越GRPO和First-step GRPO基线；WebAnchor-30B在BrowseComp上达到46.0% pass@1，在GAIA上达到76.4%。 Conclusion: Anchor-GRPO有效解决了长视野任务中规划瓶颈问题，提升了任务成功率与工具使用效率，并展现出良好的模型与上下文长度可扩展性。 Abstract: Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking, with reinforcement learning (RL) becoming a key optimization paradigm. However, planning remains a bottleneck, as existing methods struggle with long-horizon strategies. Our analysis reveals a critical phenomenon, plan anchor, where the first reasoning step disproportionately impacts downstream behavior in long-horizon web reasoning tasks. Current RL algorithms, fail to account for this by uniformly distributing rewards across the trajectory. To address this, we propose Anchor-GRPO, a two-stage RL framework that decouples planning and execution. In Stage 1, the agent optimizes its first-step planning using fine-grained rubrics derived from self-play experiences and human calibration. In Stage 2, execution is aligned with the initial plan through sparse rewards, ensuring stable and efficient tool usage. We evaluate Anchor-GRPO on four benchmarks: BrowseComp, BrowseComp-Zh, GAIA, and XBench-DeepSearch. Across models from 3B to 30B, Anchor-GRPO outperforms baseline GRPO and First-step GRPO, improving task success and tool efficiency. Notably, WebAnchor-30B achieves 46.0% pass@1 on BrowseComp and 76.4% on GAIA. Anchor-GRPO also demonstrates strong scalability, getting higher accuracy as model size and context length increase.

[74] Can Embedding Similarity Predict Cross-Lingual Transfer? A Systematic Study on African Languages

Tewodros Kederalah Idris,Prasenjit Mitra,Roald Eiselen

Main category: cs.CL

TL;DR: 该研究系统评估了五种嵌入相似性度量在跨语言迁移中的预测能力，发现余弦间隙和基于检索的指标（如P@1、CSLS）能可靠预测迁移效果，而CKA效果甚微。同时指出在不同模型间聚合时会出现辛普森悖论，强调需进行模型特定分析。

Details

Motivation: 缺乏可靠的源语言选择方法，尤其是在低资源非洲语言的跨语言迁移中，亟需有效的指导策略。 Method: 在涵盖三种NLP任务、三种多语言模型和12种非洲语言的816次迁移实验中，系统评估了五种嵌入相似性度量与迁移性能之间的相关性。 Result: 余弦间隙和P@1、CSLS等检索类指标能显著预测迁移成功（ρ=0.4-0.6），CKA预测能力极弱（ρ≈0.1）；跨模型聚合时相关性出现反转（辛普森悖论）；嵌入指标预测力可媲美语言学数据库URIEL。 Conclusion: 嵌入相似性指标可用于指导源语言选择，但必须针对具体模型进行验证，不能跨模型一概而论，提供了面向实践者的具体建议。 Abstract: Cross-lingual transfer is essential for building NLP systems for low-resource African languages, but practitioners lack reliable methods for selecting source languages. We systematically evaluate five embedding similarity metrics across 816 transfer experiments spanning three NLP tasks, three African-centric multilingual models, and 12 languages from four language families. We find that cosine gap and retrieval-based metrics (P@1, CSLS) reliably predict transfer success ($ρ= 0.4-0.6$), while CKA shows negligible predictive power ($ρ\approx 0.1$). Critically, correlation signs reverse when pooling across models (Simpson's Paradox), so practitioners must validate per-model. Embedding metrics achieve comparable predictive power to URIEL linguistic typology. Our results provide concrete guidance for source language selection and highlight the importance of model-specific analysis.

[75] Maximizing Local Entropy Where It Matters: Prefix-Aware Localized LLM Unlearning

Naixin Zhai,Pengyang Shao,Binbin Zheng,Fei Shen,Long Bai,Xun Yang

Main category: cs.CL

TL;DR: PALU是一种面向大语言模型的机器遗忘框架，通过前缀感知的局部化遗忘，在关键子空间内最大化熵，实现高效的知识遗忘与模型效用保持。

Details

Motivation: 现有机器遗忘方法对所有token进行全局处理，导致不必要的性能下降和优化冗余，缺乏对敏感内容生成链路的精准干预。 Method: 提出PALU框架，采用局部熵最大化目标，仅抑制敏感前缀并展平top-k logits，在时间和词汇维度上实现精细化遗忘。 Result: 实验表明，PALU在遗忘效果和模型效用保持方面优于现有最先进方法。 Conclusion: PALU通过局部化优化策略，有效切断敏感知识的生成路径，同时最小化对模型整体性能的影响，为高效机器遗忘提供了新思路。 Abstract: Machine unlearning aims to forget sensitive knowledge from Large Language Models (LLMs) while maintaining general utility. However, existing approaches typically treat all tokens in a response indiscriminately and enforce uncertainty over the entire vocabulary. This global treatment results in unnecessary utility degradation and extends optimization to content-agnostic regions. To address these limitations, we propose PALU (Prefix-Aware Localized Unlearning), a framework driven by a local entropy maximization objective across both temporal and vocabulary dimensions. PALU reveals that (i) suppressing the sensitive prefix alone is sufficient to sever the causal generation link, and (ii) flattening only the top-$k$ logits is adequate to maximize uncertainty in the critical subspace. These findings allow PALU to avoid redundant optimization across the full vocabulary and parameter space while minimizing collateral damage to general model performance. Extensive experiments validate that PALU achieves superior forgetting efficacy and utility preservation compared to state-of-the-art baselines.

[76] MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

Shengtao Zhang,Jiaqian Wang,Ruiwen Zhou,Junwei Liao,Yuchen Feng,Weinan Zhang,Ying Wen,Zhiyu Li,Feiyu Xiong,Yutao Qi,Bo Tang,Muning Wen

Main category: cs.CL

TL;DR: 提出MemRL框架，通过非参数强化学习在情景记忆上实现智能体的自我进化，分离了大语言模型的稳定推理与动态记忆演化，利用两阶段检索机制结合语义相关性和基于Q值的效用选择，持续通过环境反馈优化策略，在多个任务上显著优于现有方法。

Details

Motivation: 大语言模型虽具备强推理能力，但难以像人类一样通过过往经验合成新技能；现有微调方法计算成本高且易灾难性遗忘，基于记忆的方法多依赖被动语义匹配，常检索到噪声。 Method: 提出MemRL框架，将大语言模型的稳定推理与可塑的记忆系统分离；采用两阶段检索机制：先按语义相关性过滤候选，再根据学习得到的Q值（效用）进行选择；通过环境反馈以试错方式持续更新记忆效用。 Result: 在HLE、BigCodeBench、ALFWorld和Lifelong Agent Bench等多个基准上，MemRL显著优于当前最先进的基线方法；分析实验表明其能有效平衡稳定性与可塑性，在不更新模型权重的情况下实现持续运行时改进。 Conclusion: MemRL通过非参数强化学习实现了智能体在 episodic 记忆上的自我进化，解决了传统方法中稳定性与可塑性的矛盾，为大模型的持续学习提供了高效、可扩展的新路径。 Abstract: The hallmark of human intelligence is the ability to master new skills through Constructive Episodic Simulation-retrieving past experiences to synthesize solutions for novel tasks. While Large Language Models possess strong reasoning capabilities, they struggle to emulate this self-evolution: fine-tuning is computationally expensive and prone to catastrophic forgetting, while existing memory-based methods rely on passive semantic matching that often retrieves noise. To address these challenges, we propose MemRL, a framework that enables agents to self-evolve via non-parametric reinforcement learning on episodic memory. MemRL explicitly separates the stable reasoning of a frozen LLM from the plastic, evolving memory. Unlike traditional methods, MemRL employs a Two-Phase Retrieval mechanism that filters candidates by semantic relevance and then selects them based on learned Q-values (utility). These utilities are continuously refined via environmental feedback in an trial-and-error manner, allowing the agent to distinguish high-value strategies from similar noise. Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MemRL significantly outperforms state-of-the-art baselines. Our analysis experiments confirm that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates.

[77] X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework

Mohammad Zia Ur Rehman,Sai Kartheek Reddy Kasu,Shashivardhan Reddy Koppula,Sai Rithwik Reddy Chirra,Shwetank Shekhar Singh,Nagendra Kumar

Main category: cs.CL

TL;DR: 提出了一种可解释的多语言仇恨言论检测框架X-MuTeST，结合大语言模型与人工标注理由，提升印度语种（如印地语、泰卢固语）及英语的检测性能与可解释性。

Details

Motivation: 现有仇恨言论检测在准确性与可解释性方面存在不足，尤其针对资源匮乏的印度语言缺乏研究。 Method: 提出X-MuTeST框架，结合LLM的高层语义推理与注意力增强技术；利用人工标注的词级理由进行训练，并通过预测概率差异计算n-gram重要性，融合LLM与X-MuTeST生成最终解释。 Result: 在Hindi、Telugu和English数据上实现了更优的分类性能与可解释性；引入Plausibility（Token-F1, IOU-F1）和Faithfulness（Comprehensiveness, Sufficiency）指标验证有效性；发布了包含上万样本的标注数据集。 Conclusion: 结合人类理由与可解释方法能有效提升多语言仇恨言论检测的效果与透明度，尤其推动了低资源语言下的相关研究进展。 Abstract: Hate speech detection on social media faces challenges in both accuracy and explainability, especially for underexplored Indic languages. We propose a novel explainability-guided training framework, X-MuTeST (eXplainable Multilingual haTe Speech deTection), for hate speech detection that combines high-level semantic reasoning from large language models (LLMs) with traditional attention-enhancing techniques. We extend this research to Hindi and Telugu alongside English by providing benchmark human-annotated rationales for each word to justify the assigned class label. The X-MuTeST explainability method computes the difference between the prediction probabilities of the original text and those of unigrams, bigrams, and trigrams. Final explanations are computed as the union between LLM explanations and X-MuTeST explanations. We show that leveraging human rationales during training enhances both classification performance and explainability. Moreover, combining human rationales with our explainability method to refine the model attention yields further improvements. We evaluate explainability using Plausibility metrics such as Token-F1 and IOU-F1 and Faithfulness metrics such as Comprehensiveness and Sufficiency. By focusing on under-resourced languages, our work advances hate speech detection across diverse linguistic contexts. Our dataset includes token-level rationale annotations for 6,004 Hindi, 4,492 Telugu, and 6,334 English samples. Data and code are available on https://github.com/ziarehman30/X-MuTeST

[78] DIP: Dynamic In-Context Planner For Diffusion Language Models

Yang Li,Han Meng,Chenan Wang,Haipeng Chen

Main category: cs.CL

TL;DR: 提出动态上下文优化方法DIP，可在保持生成质量的同时显著提升扩散语言模型的推理速度。

Details

Motivation: 由于双向注意力机制，扩散语言模型在长上下文时计算成本高，亟需高效的上下文管理方法。 Method: 利用扩散生成范式支持动态调整上下文的特点，设计Dynamic In-Context Planner (DIP)，在生成过程中动态选择并插入上下文示例。 Result: DIP相比标准推理最高实现12.9倍的推理加速，相比KV缓存增强推理也有1.17倍的提升，且保持生成质量。 Conclusion: DIP通过动态上下文规划有效降低了DLMs的推理成本，为高效自然语言处理提供了新思路。 Abstract: Diffusion language models (DLMs) have shown strong potential for general natural language tasks with in-context examples. However, due to the bidirectional attention mechanism, DLMs incur substantial computational cost as context length increases. This work addresses this issue with a key discovery: unlike the sequential generation in autoregressive language models (ARLMs), the diffusion generation paradigm in DLMs allows \textit{efficient dynamic adjustment of the context} during generation. Building on this insight, we propose \textbf{D}ynamic \textbf{I}n-Context \textbf{P}lanner (DIP), a context-optimization method that dynamically selects and inserts in-context examples during generation, rather than providing all examples in the prompt upfront. Results show DIP maintains generation quality while achieving up to 12.9$\times$ inference speedup over standard inference and 1.17$\times$ over KV cache-enhanced inference.

[79] UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward

Yile Liu,Yixian Liu,Zongwei Li,Yufei Huang,Xinhua Feng,Zhichao Hu,Jinglu Hu,Jianfeng Yan,Fengzong Lian,Yuhong Liu

Main category: cs.CL

TL;DR: UltraLogic是一个通过代码化求解方法解耦问题逻辑核心与自然语言表达的框架，用于自动化生成高质量、多类型、难度分级的推理数据，并提出双极浮点奖励机制（BFR）以提升大模型在复杂推理任务中的训练效率与逻辑准确性。

Details

Motivation: 大语言模型在多步逻辑、规划和验证等复杂推理任务上仍存在瓶颈，现有强化学习方法缺乏大规模、高质量且难度可调的通用推理数据。 Method: 提出UltraLogic框架，采用基于代码的求解方法将逻辑核心与自然语言分离，支持数百种任务类型和十级自动难度校准；引入双极浮点奖励（BFR）机制，通过分级惩罚区分完美回答与有逻辑缺陷的回答。 Result: 实验证明任务多样性是提升推理能力的主要驱动力，BFR结合难度匹配策略显著提高了训练效率，并引导模型趋向全局逻辑最优解。 Conclusion: UltraLogic框架通过高多样性任务和精细化奖励机制有效推动了大语言模型在复杂、通用推理任务上的发展，为构建高逻辑一致性模型提供了可扩展的数据与训练解决方案。 Abstract: While Large Language Models (LLMs) have demonstrated significant potential in natural language processing , complex general-purpose reasoning requiring multi-step logic, planning, and verification remains a critical bottleneck. Although Reinforcement Learning with Verifiable Rewards (RLVR) has succeeded in specific domains , the field lacks large-scale, high-quality, and difficulty-calibrated data for general reasoning. To address this, we propose UltraLogic, a framework that decouples the logical core of a problem from its natural language expression through a Code-based Solving methodology to automate high-quality data production. The framework comprises hundreds of unique task types and an automated calibration pipeline across ten difficulty levels. Furthermore, to mitigate binary reward sparsity and the Non-negative Reward Trap, we introduce the Bipolar Float Reward (BFR) mechanism, utilizing graded penalties to effectively distinguish perfect responses from those with logical flaws. Our experiments demonstrate that task diversity is the primary driver for reasoning enhancement , and that BFR, combined with a difficulty matching strategy, significantly improves training efficiency, guiding models toward global logical optima.

[80] MalruleLib: Large-Scale Executable Misconception Reasoning with Step Traces for Modeling Student Thinking in Mathematics

Xinghe Chen,Naiming Liu,Shashank Sonkar

Main category: cs.CL

TL;DR: 本文提出了MalruleLib，一个基于学习科学的框架，用于将学生在数学中的系统性错误（误解）转化为可执行的过程，并生成符合这些错误规则的学生解题轨迹。

Details

Motivation: 学生在数学中常犯系统性的错误，即他们持续应用一种连贯但错误的方法。为了更好地诊断和纠正这些误解，需要一个能够建模并预测学生错误行为的工具。 Method: MalruleLib基于67个学习科学和数学教育来源，形式化了101种错误规则（malrules）和498个参数化的题目模板，生成正确解法与错误一致的双路径解题步骤。研究还将误解建模为核心问题——Malrule Reasoning Accuracy (MRA)，即从一个错误示例推断误解并预测学生在不同题目表述下的反应。 Result: 在九种语言模型（4B-120B参数）上测试发现，直接解题准确率为66%，而在跨模板误解预测任务中下降至40%；提供学生解题步骤可使预测提升3-15%，但跨模板性能仍下降10-21%。MalruleLib能生成超过一百万个实例，支持大规模监督与可控评估。 Conclusion: MalruleLib为教育AI提供了基础设施，能够跨情境建模学生的错误推理过程，从而实现针对根本误解的精准诊断与反馈。 Abstract: Student mistakes in mathematics are often systematic: a learner applies a coherent but wrong procedure and repeats it across contexts. We introduce MalruleLib, a learning-science-grounded framework that translates documented misconceptions into executable procedures, drawing on 67 learning-science and mathematics education sources, and generates step-by-step traces of malrule-consistent student work. We formalize a core student-modeling problem as Malrule Reasoning Accuracy (MRA): infer a misconception from one worked mistake and predict the student's next answer under cross-template rephrasing. Across nine language models (4B-120B), accuracy drops from 66% on direct problem solving to 40% on cross-template misconception prediction. MalruleLib encodes 101 malrules over 498 parameterized problem templates and produces paired dual-path traces for both correct reasoning and malrule-consistent student reasoning. Because malrules are executable and templates are parameterizable, MalruleLib can generate over one million instances, enabling scalable supervision and controlled evaluation. Using MalruleLib, we observe cross-template degradations of 10-21%, while providing student step traces improves prediction by 3-15%. We release MalruleLib as infrastructure for educational AI that models student procedures across contexts, enabling diagnosis and feedback that targets the underlying misconception.

[81] Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models

Kartik Bose,Abhinandan Kumar,Raghuraman Soundararajan,Priya Mudgil,Samonee Ralmilay,Niharika Dutta,Manphool Singhal,Arun Kumar,Saugata Sen,Anurima Patra,Priya Ghosh,Abanti Das,Amit Gupta,Ashish Verma,Dipin Sudhakaran,Ekta Dhamija,Himangi Unde,Ishan Kumar,Krithika Rangarajan,Prerna Garg,Rachel Sequeira,Sudhin Shylendran,Taruna Yadav,Tej Pal,Pankaj Gupta

Main category: cs.CL

TL;DR: 本研究构建了经放射科医生验证的多RADS合成数据集RXL-RADSet，并评估了多种小语言模型（SLMs）与GPT-5.2在RADS风险分级任务中的表现，发现大型SLMs在引导提示下可接近专有模型性能，但在复杂RADS系统中仍存在差距。

Details

Motivation: 现有的RADS系统在放射学风险沟通中标准化程度高，但从叙述性报告中自动分配RADS面临指南复杂、输出格式限制和跨框架基准测试不足等挑战，亟需一个可靠的多RADS基准数据集来评估不同规模语言模型的表现。 Method: 研究团队构建了包含1600份合成放射学报告的RXL-RADSet数据集，涵盖10种RADS系统和多种影像模态；报告由大语言模型基于情景规划和模拟放射科医生风格生成，并经过两阶段放射科医生验证；随后在固定引导提示下评估了41个量化的小语言模型（参数范围0.135-32B）及GPT-5.2的表现，主要终点为输出有效性和准确性，次要分析比较引导提示与零样本提示的效果。 Result: 在引导提示下，GPT-5.2达到99.8%的有效性和81.1%的准确性；汇总的SLMs总体有效性为96.8%，准确性为61.1%；顶级20-32B参数SLMs可达约99%有效性和70%+准确性；模型性能随参数规模提升而上升（<1B到≥10B间有明显拐点），且RADS复杂度越高性能下降越显著，主要源于分类难度而非无效输出；引导提示相比零样本提示提升了有效性和准确性（99.2% vs 96.7%；78.5% vs 69.6%）。 Conclusion: RXL-RADSet是一个可靠的放射科医生验证的多RADS基准数据集；大型小语言模型在引导提示下可接近专有模型的RADS分配性能，尤其在较大模型（20-32B参数）中表现更佳，但对高复杂度RADS系统仍有改进空间。 Abstract: Background: Reporting and Data Systems (RADS) standardize radiology risk communication but automated RADS assignment from narrative reports is challenging because of guideline complexity, output-format constraints, and limited benchmarking across RADS frameworks and model sizes. Purpose: To create RXL-RADSet, a radiologist-verified synthetic multi-RADS benchmark, and compare validity and accuracy of open-weight small language models (SLMs) with a proprietary model for RADS assignment. Materials and Methods: RXL-RADSet contains 1,600 synthetic radiology reports across 10 RADS (BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS) and multiple modalities. Reports were generated by LLMs using scenario plans and simulated radiologist styles and underwent two-stage radiologist verification. We evaluated 41 quantized SLMs (12 families, 0.135-32B parameters) and GPT-5.2 under a fixed guided prompt. Primary endpoints were validity and accuracy; a secondary analysis compared guided versus zero-shot prompting. Results: Under guided prompting GPT-5.2 achieved 99.8% validity and 81.1% accuracy (1,600 predictions). Pooled SLMs (65,600 predictions) achieved 96.8% validity and 61.1% accuracy; top SLMs in the 20-32B range reached ~99% validity and mid-to-high 70% accuracy. Performance scaled with model size (inflection between <1B and >=10B) and declined with RADS complexity primarily due to classification difficulty rather than invalid outputs. Guided prompting improved validity (99.2% vs 96.7%) and accuracy (78.5% vs 69.6%) compared with zero-shot. Conclusion: RXL-RADSet provides a radiologist-verified multi-RADS benchmark; large SLMs (20-32B) can approach proprietary-model performance under guided prompting, but gaps remain for higher-complexity schemes.

[82] STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

Juntong Ni,Shiyu Wang,Ming Jin,Qi He,Wei Jin

Main category: cs.CL

TL;DR: 本文提出了ST-Bench基准和STReasoner模型，以增强时间序列中的时空推理能力，通过引入S-GRPO算法强化空间逻辑，在多个任务中显著提升准确率且成本极低。

Details

Motivation: 现有研究多关注预测准确性而忽视推理能力，导致时空推理在高风险决策系统中发展不足。 Method: 构建基于网络SDE的多智能体数据合成流程生成ST-Bench基准；设计STReasoner模型融合时间序列、图结构与文本进行显式推理；提出S-GRPO强化学习算法，专门奖励由空间信息带来的性能提升。 Result: STReasoner在ST-Bench上平均准确率提升17%至135%，成本仅为商用模型的0.004倍，并在真实世界数据中表现出强泛化能力。 Conclusion: 通过新基准和模型结合S-GRPO算法，有效推动了时间序列中时空推理的发展，兼顾高性能、低成本与可解释性。 Abstract: Spatio-temporal reasoning in time series involves the explicit synthesis of temporal dynamics, spatial dependencies, and textual context. This capability is vital for high-stakes decision-making in systems such as traffic networks, power grids, and disease propagation. However, the field remains underdeveloped because most existing works prioritize predictive accuracy over reasoning. To address the gap, we introduce ST-Bench, a benchmark consisting of four core tasks, including etiological reasoning, entity identification, correlation reasoning, and in-context forecasting, developed via a network SDE-based multi-agent data synthesis pipeline. We then propose STReasoner, which empowers LLM to integrate time series, graph structure, and text for explicit reasoning. To promote spatially grounded logic, we introduce S-GRPO, a reinforcement learning algorithm that rewards performance gains specifically attributable to spatial information. Experiments show that STReasoner achieves average accuracy gains between 17% and 135% at only 0.004X the cost of proprietary models and generalizes robustly to real-world data.

[83] Automated Semantic Rules Detection (ASRD) for Emergent Communication Interpretation

Bastien Vanderplaetse,Xavier Siebert,Stéphane Dupont

Main category: cs.CL

TL;DR: 本文提出了一种自动语义规则检测（ASRD）算法，用于解析多智能体系统在Lewis游戏中 emergent communication 中产生的消息，提升对新兴语言的可解释性。

Details

Motivation: 现有研究缺乏对新兴语言可解释性的关注，难以理解智能体间通信的语义内容。 Method: 设计并实现ASRD算法，从两个不同数据集训练的智能体在Lewis游戏中的消息中提取模式，并将这些模式与输入数据的特定属性相关联。 Result: ASRD能够有效提取与输入属性相关的通信模式，显著简化了对新兴语言的分析过程。 Conclusion: ASRD为理解和解释多智能体系统中涌现的通信协议提供了有效的工具，增强了新兴语言的可解释性。 Abstract: The field of emergent communication within multi-agent systems examines how autonomous agents can independently develop communication strategies, without explicit programming, and adapt them to varied environments. However, few studies have focused on the interpretability of emergent languages. The research exposed in this paper proposes an Automated Semantic Rules Detection (ASRD) algorithm, which extracts relevant patterns in messages exchanged by agents trained with two different datasets on the Lewis Game, which is often studied in the context of emergent communication. ASRD helps at the interpretation of the emergent communication by relating the extracted patterns to specific attributes of the input data, thereby considerably simplifying subsequent analysis.

cs.CV [Back]

[84] Self-Supervised Masked Autoencoders with Dense-Unet for Coronary Calcium Removal in limited CT Data

Mo Chen

Main category: cs.CV

TL;DR: 提出了一种名为Dense-MAE的自监督学习框架，通过3D块掩码预训练Dense-Unet，以提升冠状动脉钙化伪影去除和狭窄评估的准确性。

Details

Motivation: 冠状动脉钙化在CTA中产生 blooming artifacts，影响管腔狭窄诊断，而深度卷积网络通常需要大量标注数据，这在医学领域稀缺。 Method: 受3D点云Masked Autoencoders启发，提出Dense-MAE框架，通过随机掩码血管管腔的3D图像块，预训练Dense-Unet重建缺失几何结构，实现无监督特征学习。 Result: 在临床CTA数据集上实验表明，使用Dense-MAE预训练权重初始化钙化去除网络，显著提升了图像修复精度和狭窄程度估计性能，尤其在少样本场景下优于从零训练。 Conclusion: Dense-MAE通过自监督预训练有效挖掘了血管拓扑的高维特征，缓解了医学图像中标注数据不足的问题，为低剂量或伪影严重的血管成像提供了新思路。 Abstract: Coronary calcification creates blooming artifacts in Computed Tomography Angiography (CTA), severely hampering the diagnosis of lumen stenosis. While Deep Convolutional Neural Networks (DCNNs) like Dense-Unet have shown promise in removing these artifacts via inpainting, they often require large labeled datasets which are scarce in the medical domain. Inspired by recent advancements in Masked Autoencoders (MAE) for 3D point clouds, we propose \textbf{Dense-MAE}, a novel self-supervised learning framework for volumetric medical data. We introduce a pre-training strategy that randomly masks 3D patches of the vessel lumen and trains the Dense-Unet to reconstruct the missing geometry. This forces the encoder to learn high-level latent features of arterial topology without human annotation. Experimental results on clinical CTA datasets demonstrate that initializing the Calcium Removal network with our MAE-based weights significantly improves inpainting accuracy and stenosis estimation compared to training from scratch, specifically in few-shot scenarios.

[85] MIAR: Modality Interaction and Alignment Representation Fuison for Multimodal Emotion

Jichao Zhu,Jun Yu

Main category: cs.CV

TL;DR: 提出一种新的多模态情感识别方法MIAR，通过特征交互和对比学习实现模态间对齐与信息提取，在CMU-MOSI和CMU-MOSEI数据集上优于现有方法。

Details

Motivation: 现有方法未能充分解决模态间分布差异和贡献度不均问题，且泛化能力弱，限制了多模态情感识别性能。 Method: 设计MIAR网络，利用特征交互生成能从其他模态提取信息的全局表征特征token，并通过对比学习和归一化策略对齐不同模态。 Result: 在CMU-MOSI和CMU-MOSEI两个基准上的实验表明，MIAR优于当前最先进的多模态情感识别方法。 Conclusion: MIAR有效提升了多模态情感识别的性能，通过模态交互与对齐增强了模型的表征能力和泛化性。 Abstract: Multimodal Emotion Recognition (MER) aims to perceive human emotions through three modes: language, vision, and audio. Previous methods primarily focused on modal fusion without adequately addressing significant distributional differences among modalities or considering their varying contributions to the task. They also lacked robust generalization capabilities across diverse textual model features, thus limiting performance in multimodal scenarios. Therefore, we propose a novel approach called Modality Interaction and Alignment Representation (MIAR). This network integrates contextual features across different modalities using a feature interaction to generate feature tokens to represent global representations of this modality extracting information from other modalities. These four tokens represent global representations of how each modality extracts information from others. MIAR aligns different modalities using contrastive learning and normalization strategies. We conduct experiments on two benchmarks: CMU-MOSI and CMU-MOSEI datasets, experimental results demonstrate the MIAR outperforms state-of-the-art MER methods.

[86] Multimodal Sentiment Analysis based on Multi-channel and Symmetric Mutual Promotion Feature Fusion

Wangyuan Zhu,Jun Yu

Main category: cs.CV

TL;DR: 本文提出了一种用于多模态情感分析的对称互促（SMP）特征融合方法，通过双通道特征提取和注意力机制增强模态内与模态间的信息交互，提升了情感识别性能。

Details

Motivation: 现有方法在单模态特征提取上不够丰富，且大多只关注模态间特征的一致性，忽略了特征差异，导致特征融合不充分。 Method: 采用视觉和听觉双通道特征以增强模态内表征；提出SMP融合方法，结合对称交叉注意力和自注意力机制，促进模态间信息交换，并融合模态内与模态间特征。 Result: 在两个基准数据集上的实验表明，所提方法在多模态情感分析任务中具有更优性能。 Conclusion: 该方法有效提升了多模态情感分析中的特征表示与融合效果，兼顾了模态间互补性与特征差异性。 Abstract: Multimodal sentiment analysis is a key technology in the fields of human-computer interaction and affective computing. Accurately recognizing human emotional states is crucial for facilitating smooth communication between humans and machines. Despite some progress in multimodal sentiment analysis research, numerous challenges remain. The first challenge is the limited and insufficiently rich features extracted from single modality data. Secondly, most studies focus only on the consistency of inter-modal feature information, neglecting the differences between features, resulting in inadequate feature information fusion. In this paper, we first extract multi-channel features to obtain more comprehensive feature information. We employ dual-channel features in both the visual and auditory modalities to enhance intra-modal feature representation. Secondly, we propose a symmetric mutual promotion (SMP) inter-modal feature fusion method. This method combines symmetric cross-modal attention mechanisms and self-attention mechanisms, where the cross-modal attention mechanism captures useful information from other modalities, and the self-attention mechanism models contextual information. This approach promotes the exchange of useful information between modalities, thereby strengthening inter-modal interactions. Furthermore, we integrate intra-modal features and inter-modal fused features, fully leveraging the complementarity of inter-modal feature information while considering feature information differences. Experiments conducted on two benchmark datasets demonstrate the effectiveness and superiority of our proposed method.

Wenting Lu,Didi Zhu,Tao Shen,Donglin Zhu,Ayong Ye,Chao Wu

Main category: cs.CV

TL;DR: 提出CoCoT框架，通过动态多区域定位和关系感知推理提升多模态推理能力，构建了包含7.4万样本的数据集，在多个基准上显著提升性能。

Details

Motivation: 现有思维链方法在跨模态场景中存在对粗粒度图像区域依赖和推理步骤间语义断裂的问题。 Method: 提出CoCoT框架，包含动态多区域定位和关系感知推理两个核心机制，并构建CoCoT-70K数据集。 Result: 在LLaVA-1.5上平均准确率提升15.4%，在Qwen2-VL上提升4.0%，在六个挑战性基准上表现优异。 Conclusion: CoCoT有效解决了多模态推理中的关键问题，显著提升了复杂视觉推理性能。 Abstract: Multi-modal reasoning requires the seamless integration of visual and linguistic cues, yet existing Chain-of-Thought methods suffer from two critical limitations in cross-modal scenarios: (1) over-reliance on single coarse-grained image regions, and (2) semantic fragmentation between successive reasoning steps. To address these issues, we propose the CoCoT (Collaborative Coross-modal Thought) framework, built upon two key innovations: a) Dynamic Multi-Region Grounding to adaptively detect the most relevant image regions based on the question, and b) Relation-Aware Reasoning to enable multi-region collaboration by iteratively aligning visual cues to form a coherent and logical chain of thought. Through this approach, we construct the CoCoT-70K dataset, comprising 74,691 high-quality samples with multi-region annotations and structured reasoning chains. Extensive experiments demonstrate that CoCoT significantly enhances complex visual reasoning, achieving an average accuracy improvement of 15.4% on LLaVA-1.5 and 4.0% on Qwen2-VL across six challenging benchmarks. The data and code are available at: https://github.com/deer-echo/CoCoT.

[88] NitroGen: An Open Foundation Model for Generalist Gaming Agents

Loïc Magne,Anas Awadalla,Guanzhi Wang,Yinzhen Xu,Joshua Belofsky,Fengyuan Hu,Joohwan Kim,Ludwig Schmidt,Georgia Gkioxari,Jan Kautz,Yisong Yue,Yejin Choi,Yuke Zhu,Linxi "Jim" Fan

Main category: cs.CV

TL;DR: NitroGen 是一个基于 40,000 小时跨 1,000 多款游戏玩法视频训练的视觉-动作基础模型，通过大规模行为克隆实现跨游戏泛化，展现出在多种游戏任务中的强大能力，并公开数据集、评估套件和模型权重。

Details

Motivation: 为了构建能够在多种游戏中泛化并执行复杂任务的通用游戏智能体，需要克服现有方法在数据规模、评估环境和模型统一性方面的限制。 Method: 1) 构建互联网规模的视频-动作数据集，自动从公开玩法视频中提取玩家操作；2) 设计支持跨游戏泛化的多游戏基准环境；3) 训练统一的视觉-动作基础模型，采用大规模行为克隆方法。 Result: NitroGen 在 3D 动作游戏战斗、2D 平台精确控制和程序生成世界探索等多样化任务中表现优异，在未见过的游戏中相比从零训练的模型任务成功率最高提升 52%。 Conclusion: NitroGen 展示了大规模视觉-动作联合建模在构建通用游戏智能体上的潜力，其发布的资源有助于推动具身智能代理的研究发展。 Abstract: We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.

[89] TAP-ViTs: Task-Adaptive Pruning for On-Device Deployment of Vision Transformers

Zhibo Wang,Zuoyuan Zhang,Xiaoyi Pang,Qile Zhang,Xuanyi Hao,Shuguo Zhuo,Peng Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为TAP-ViTs的新型任务自适应剪枝框架，能够在不访问本地原始数据的情况下为资源受限的移动设备生成设备特定的剪枝视觉Transformer（ViT）模型，通过使用高斯混合模型（GMM）参数构建代表性的度量数据集，并结合双粒度重要性评估策略实现细粒度、任务感知的剪枝。

Details

Motivation: 现有的ViT剪枝方法要么忽略设备异构性，要么依赖于需要本地数据的微调，在隐私保护和资源受限的移动计算环境中难以适用，因此缺乏在保护隐私的前提下实现任务定制化剪枝的有效方案。 Method: 提出TAP-ViTs框架：1）各设备使用轻量级GMM拟合其私有数据分布并上传GMM参数；2）云端利用这些参数从公共数据中选择分布一致的样本来构建每个设备的任务代表性度量数据集；3）基于该代理数据集，采用双粒度重要性评估剪枝策略，联合衡量复合神经元重要性和自适应层重要性，实现针对设备计算预算的细粒度剪枝。 Result: 在多个ViT骨干网络和数据集上的实验表明，TAP-ViTs在相同压缩比下始终优于当前最先进的剪枝方法，能够有效提升剪枝后模型在特定任务上的性能表现。 Conclusion: TAP-ViTs实现了无需访问本地数据的隐私保护型任务自适应ViT剪枝，兼顾设备异构性与计算效率，为边缘设备上的高效视觉模型部署提供了可行解决方案。 Abstract: Vision Transformers (ViTs) have demonstrated strong performance across a wide range of vision tasks, yet their substantial computational and memory demands hinder efficient deployment on resource-constrained mobile and edge devices. Pruning has emerged as a promising direction for reducing ViT complexity. However, existing approaches either (i) produce a single pruned model shared across all devices, ignoring device heterogeneity, or (ii) rely on fine-tuning with device-local data, which is often infeasible due to limited on-device resources and strict privacy constraints. As a result, current methods fall short of enabling task-customized ViT pruning in privacy-preserving mobile computing settings. This paper introduces TAP-ViTs, a novel task-adaptive pruning framework that generates device-specific pruned ViT models without requiring access to any raw local data. Specifically, to infer device-level task characteristics under privacy constraints, we propose a Gaussian Mixture Model (GMM)-based metric dataset construction mechanism. Each device fits a lightweight GMM to approximate its private data distribution and uploads only the GMM parameters. Using these parameters, the cloud selects distribution-consistent samples from public data to construct a task-representative metric dataset for each device. Based on this proxy dataset, we further develop a dual-granularity importance evaluation-based pruning strategy that jointly measures composite neuron importance and adaptive layer importance, enabling fine-grained, task-aware pruning tailored to each device's computational budget. Extensive experiments across multiple ViT backbones and datasets demonstrate that TAP-ViTs consistently outperforms state-of-the-art pruning methods under comparable compression ratios.

[90] Understanding Pure Textual Reasoning for Blind Image Quality Assessment

Yuan Li,Shin'ya Nishida

Main category: cs.CV

TL;DR: 本文从信息流的角度研究了文本信息在盲图像质量评估（BIQA）中的作用，比较了三种学习图像-文本-评分关系的范式：Chain-of-Thought、Self-Consistency 和 Autoencoder。实验表明，仅使用文本信息时现有模型性能显著下降；Self-Consistency 显著缩小了基于图像和文本预测之间的差距，而 Chain-of-Thought 改进有限，Autoencoder 虽效果较弱但指出了优化方向。

Details

Motivation: 尚不清楚文本信息如何影响BIQA中的质量预测，以及文本能在多大程度上表征与评分相关的图像内容。 Method: 从信息流角度出发，设计并比较了三种学习图像-文本-评分关系的范式：Chain-of-Thought、Self-Consistency 和 Autoencoder。 Result: 当仅使用文本信息进行预测时，现有BIQA模型性能显著下降；Chain-of-Thought 提升有限；Self-Consistency 将图像与文本条件下的PLCC/SRCC差异缩小至0.02/0.03；Autoencoder 范式效果较弱但仍具启发性。 Conclusion: 文本信息当前尚不能充分替代图像进行质量预测；Self-Consistency 是提升文本推理能力的有效方法，为改进BIQA及高层视觉任务中的文本推理提供了方向。 Abstract: Textual reasoning has recently been widely adopted in Blind Image Quality Assessment (BIQA). However, it remains unclear how textual information contributes to quality prediction and to what extent text can represent the score-related image contents. This work addresses these questions from an information-flow perspective by comparing existing BIQA models with three paradigms designed to learn the image-text-score relationship: Chain-of-Thought, Self-Consistency, and Autoencoder. Our experiments show that the score prediction performance of the existing model significantly drops when only textual information is used for prediction. Whereas the Chain-of-Thought paradigm introduces little improvement in BIQA performance, the Self-Consistency paradigm significantly reduces the gap between image- and text-conditioned predictions, narrowing the PLCC/SRCC difference to 0.02/0.03. The Autoencoder-like paradigm is less effective in closing the image-text gap, yet it reveals a direction for further optimization. These findings provide insights into how to improve the textual reasoning for BIQA and high-level vision tasks.

[91] Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative

Li Wang,Xi Chen,XiangWen Deng,HuaHui Yi,ZeKun Jiang,Kang Li,Jian Li

Main category: cs.CV

TL;DR: 该研究评估了多模态大语言模型（MLLM）在膝关节骨关节炎（OA）X光片分类中的应用，发现视觉编码器本身优于完整MLLM，且数据质量和平衡性比数据规模更重要。

Details

Motivation: 尽管MLLM在医学视觉问答和报告生成中表现良好，但其在疾病特异性分类任务中的性能尚不明确，尤其是膝关节OA这一高发疾病缺乏相关基准测试。 Method: 通过系统性消融实验，评估了不同视觉编码器、连接器和大语言模型（LLM）组件在多种训练策略下对诊断准确率的影响，并比较了不同数据集规模与平衡性的效果。 Result: 训练后的视觉编码器单独使用时在分类准确率上优于完整的MLLM流水线；微调LLM未带来显著提升；在小而类别平衡的数据集（500张图像）上使用LoRA微调的效果优于在更大但类别不平衡的数据集（5,778张图像）上的训练结果。 Conclusion: 对于特定领域的医学图像分类任务，LLM更适合作为解释器和报告生成工具，而非主要分类器；MLLM架构在需要高确定性的医学诊断分类中适用性有限，建议优先优化视觉编码器并精心策划数据集。 Abstract: Multimodal large language models (MLLMs) show promising performance on medical visual question answering (VQA) and report generation, but these generation and explanation abilities do not reliably transfer to disease-specific classification. We evaluated MLLM architectures on knee osteoarthritis (OA) radiograph classification, which remains underrepresented in existing medical MLLM benchmarks, even though knee OA affects an estimated 300 to 400 million people worldwide. Through systematic ablation studies manipulating the vision encoder, the connector, and the large language model (LLM) across diverse training strategies, we measured each component's contribution to diagnostic accuracy. In our classification task, a trained vision encoder alone could outperform full MLLM pipelines in classification accuracy and fine-tuning the LLM provided no meaningful improvement over prompt-based guidance. And LoRA fine-tuning on a small, class-balanced dataset (500 images) gave better results than training on a much larger but class-imbalanced set (5,778 images), indicating that data balance and quality can matter more than raw scale for this task. These findings suggest that for domain-specific medical classification, LLMs are more effective as interpreters and report generators rather than as primary classifiers. Therefore, the MLLM architecture appears less suitable for medical image diagnostic classification tasks that demand high certainty. We recommend prioritizing vision encoder optimization and careful dataset curation when developing clinically applicable systems.

[92] A Spatio-Temporal Deep Learning Approach For High-Resolution Gridded Monsoon Prediction

Parashjyoti Borah,Sanghamitra Sarkar,Ranjan Phukan

Main category: cs.CV

TL;DR: 提出一种基于深度学习的框架，将季风预测重构为时空计算机视觉任务，利用CNN从多年再分析数据中学习前季风期大气海洋场与高分辨率降雨模式之间的复杂映射关系。

Details

Motivation: 传统季风预测方法缺乏空间细节，难以满足区域资源管理需求，因此需要一种能提供高分辨率、分月和季节性降雨预测的新方法。 Method: 将多变量前季风大气和海洋场视为多通道图像序列，构建视频状输入张量，使用基于CNN的架构，基于85年ERA5再分析数据和IMD降水数据，学习从前季风期（1-5月）到后续季风期高分辨率降水格局的映射。 Result: 该框架能够成功生成每月（6-9月）及整个季风季节的总平均降雨量的独立预测，具有较高的空间分辨率和预测技能。 Conclusion: 该深度学习方法在提供精细化的季风降水预测方面具有显著优势，可同时支持季节内和季节尺度的预报，有助于提升农业、水资源和应急管理的决策能力。 Abstract: The Indian Summer Monsoon (ISM) is a critical climate phenomenon, fundamentally impacting the agriculture, economy, and water security of over a billion people. Traditional long-range forecasting, whether statistical or dynamical, has predominantly focused on predicting a single, spatially-averaged seasonal value, lacking the spatial detail essential for regional-level resource management. To address this gap, we introduce a novel deep learning framework that reframes gridded monsoon prediction as a spatio-temporal computer vision task. We treat multi-variable, pre-monsoon atmospheric and oceanic fields as a sequence of multi-channel images, effectively creating a video-like input tensor. Using 85 years of ERA5 reanalysis data for predictors and IMD rainfall data for targets, we employ a Convolutional Neural Network (CNN)-based architecture to learn the complex mapping from the five-month pre-monsoon period (January-May) to a high-resolution gridded rainfall pattern for the subsequent monsoon season. Our framework successfully produces distinct forecasts for each of the four monsoon months (June-September) as well as the total seasonal average, demonstrating its utility for both intra-seasonal and seasonal outlooks.

[93] Don't Mind the Gaps: Implicit Neural Representations for Resolution-Agnostic Retinal OCT Analysis

Bennet Kahrs,Julia Andresen,Fenja Falta,Monty Santarossa,Heinz Handels,Timo Kepp

Main category: cs.CV

TL;DR: 本文提出基于隐式神经表示（INR）的两种框架，用于实现光学相干断层扫描（OCT）图像的密集三维视网膜分析，解决传统2D方法在各向异性数据中产生的不一致问题，并实现跨分辨率的通用化建模。

Details

Motivation: 常规OCT成像切片间距大，导致图像高度各向异性且扫描稀疏，现有基于2D的方法易造成相邻B-scan结果不一致，且传统CNN受限于训练数据分辨率，难以泛化到不同成像协议的数据。 Method: 利用隐式神经表示（INR）的坐标输入特性，构建分辨率无关的连续表达：1）结合en-face模态信息进行B-scan间插值；2）构建基于人群训练的通用3D视网膜图谱。 Result: 所提方法能有效提升视网膜结构的3D形态表征，支持对大间距B-scan的OCT数据进行密集重建与分析，并在未见数据上实现良好泛化。 Conclusion: 基于INR的框架为稀疏扫描的视网膜OCT提供了分辨率无关、可泛化的3D分析方案，推动了其在临床病理评估中的体积化应用。 Abstract: Routine clinical imaging of the retina using optical coherence tomography (OCT) is performed with large slice spacing, resulting in highly anisotropic images and a sparsely scanned retina. Most learning-based methods circumvent the problems arising from the anisotropy by using 2D approaches rather than performing volumetric analyses. These approaches inherently bear the risk of generating inconsistent results for neighboring B-scans. For example, 2D retinal layer segmentations can have irregular surfaces in 3D. Furthermore, the typically used convolutional neural networks are bound to the resolution of the training data, which prevents their usage for images acquired with a different imaging protocol. Implicit neural representations (INRs) have recently emerged as a tool to store voxelized data as a continuous representation. Using coordinates as input, INRs are resolution-agnostic, which allows them to be applied to anisotropic data. In this paper, we propose two frameworks that make use of this characteristic of INRs for dense 3D analyses of retinal OCT volumes. 1) We perform inter-B-scan interpolation by incorporating additional information from en-face modalities, that help retain relevant structures between B-scans. 2) We create a resolution-agnostic retinal atlas that enables general analysis without strict requirements for the data. Both methods leverage generalizable INRs, improving retinal shape representation through population-based training and allowing predictions for unseen cases. Our resolution-independent frameworks facilitate the analysis of OCT images with large B-scan distances, opening up possibilities for the volumetric evaluation of retinal structures and pathologies.

[94] PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding

Souhail Hadgi,Bingchen Gong,Ramana Sundararaman,Emery Pierson,Lei Li,Peter Wonka,Maks Ovsjanikov

Main category: cs.CV

TL;DR: 提出一种基于点云的编码器-only 3D模型，通过两阶段预训练实现语言对齐的局部特征学习，在零样本3D部件分割任务中优于现有渲染-based方法。

Details

Motivation: 现有3D基础模型在局部部件级推理上表现不佳，依赖多视角渲染和大语言模型提示工程，且未充分利用3D几何结构。 Method: 使用点云Transformer编码器，先从DINOv2等视觉编码器蒸馏密集2D特征到3D patch，再通过多正例对比学习将patch嵌入与文本描述对齐。 Result: 在多个3D部件分割基准上显著优于之前的渲染-based和前馈方法，支持快速单次推理，无需测试时多视角渲染。 Conclusion: 该方法有效实现了语言对齐的3D局部特征学习，提升了零样本3D部件分割性能，同时降低了推理成本。 Abstract: Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language-model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes. We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D features from visual encoders such as DINOv2 into 3D patches, and (2) alignment of these patch embeddings with part-level text embeddings through a multi-positive contrastive objective. Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering, while significantly outperforming previous rendering-based and feed-forward approaches across several 3D part segmentation benchmarks. Project website: https://souhail-hadgi.github.io/patchalign3dsite/

[95] CT Scans As Video: Efficient Intracranial Hemorrhage Detection Using Multi-Object Tracking

Amirreza Parvahan,Mohammad Hoseyni,Javad Khoramdel,Amirhossein Nikoofard

Main category: cs.CV

TL;DR: 本文提出一种轻量级框架，将CT体积数据转化为视频流形式，结合2D检测效率与3D上下文需求，用于颅内出血检测，在边缘设备上实现高效、高精度的实时分析。

Details

Motivation: 3D CNN在边缘设备上进行医学影像分析时面临高内存和计算开销的挑战，难以满足实时性要求。 Method: 将CT体积数据重新表述为视频流，采用Nano版本的YOLO系列模型作为切片级检测器，并引入ByteTrack算法和时空一致性滤波器来维持z轴解剖连续性，解决跟踪初始化延迟问题。 Result: 在独立测试数据上，该框架将检测精度从0.703提升至0.779，同时保持高灵敏度，显著优于基线2D检测器。 Conclusion: 所提方法以极低计算成本逼近3D上下文推理，为资源受限环境下的实时患者优先级划分提供了可扩展解决方案。 Abstract: Automated analysis of volumetric medical imaging on edge devices is severely constrained by the high memory and computational demands of 3D Convolutional Neural Networks (CNNs). This paper develops a lightweight computer vision framework that reconciles the efficiency of 2D detection with the necessity of 3D context by reformulating volumetric Computer Tomography (CT) data as sequential video streams. This video-viewpoint paradigm is applied to the time-sensitive task of Intracranial Hemorrhage (ICH) detection using the Hemorica dataset. To ensure operational efficiency, we benchmarked multiple generations of the YOLO architecture (v8, v10, v11 and v12) in their Nano configurations, selecting the version with the highest mAP@50 to serve as the slice-level backbone. A ByteTrack algorithm is then introduced to enforce anatomical consistency across the $z$-axis. To address the initialization lag inherent in video trackers, a hybrid inference strategy and a spatiotemporal consistency filter are proposed to distinguish true pathology from transient prediction noise. Experimental results on independent test data demonstrate that the proposed framework serves as a rigorous temporal validator, increasing detection Precision from 0.703 to 0.779 compared to the baseline 2D detector, while maintaining high sensitivity. By approximating 3D contextual reasoning at a fraction of the computational cost, this method provides a scalable solution for real-time patient prioritization in resource-constrained environments, such as mobile stroke units and IoT-enabled remote clinics.

[96] MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark

Shaden Shaar,Bradon Thymes,Sirawut Chaixanien,Claire Cardie,Bharath Hariharan

Main category: cs.CV

TL;DR: 本文提出了一个名为MovieRecapsQA的新型开放性多模态视频问答基准，利用电影回顾视频和文字摘要生成约8.2K个问题-答案对，并提供无需参考即可验证答案的事实依据。这是首个提供明确文本上下文的开放性VideoQA基准。

Details

Motivation: 现有VideoQA基准难以捕捉多模态推理，且大多不是开放式的，因为自由形式答案的评估较为困难。为了克服这一限制，需要一个新的基准来更好地理解真实世界视频中的视觉和对话线索。 Method: 利用电影回顾视频（YouTube上的内容）及其对应的文本摘要，构建MovieRecapsQA基准；基于 recap summary 生成约8.2K个与字幕对齐的QA对，并提供用于无参考验证的答案事实；包含多种视频长度和问题类型以支持细粒度分析。 Result: 在七个最先进的MLLM上进行评估发现：1）仅视觉问题仍最具挑战性；2）模型在有文本输入时倾向于依赖文本；3）从视频中提取事实准确信息对所有模型仍然困难；4）专有和开源模型在依赖视频的问题上表现相当。 Conclusion: MovieRecapsQA是首个提供显式文本上下文的开放性多模态VideoQA基准，能够支持更有效的多模态推理评估，揭示当前模型在视觉理解和跨模态整合方面的局限性。 Abstract: Understanding real-world videos such as movies requires integrating visual and dialogue cues to answer complex questions. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and are largely not open-ended, given the difficulty of evaluating free-form answers. In this paper, we introduce a novel open-ended multi-modal VideoQA benchmark, MovieRecapsQA created using movie recap videos--a distinctive type of YouTube content that summarizes a film by presenting its key events through synchronized visual (recap video) and textual (recap summary) modalities. Using the recap summary, we generate $\approx 8.2$ K question-answer (QA) pairs (aligned with movie-subtitles) and provide the necessary "facts" needed to verify an answer in a reference-free manner. To our knowledge, this is the first open-ended VideoQA benchmark that supplies explicit textual context of the input (video and/or text); which we use for evaluation. Our benchmark provides videos of multiple lengths (i.e., recap-segments, movie-segments) and categorizations of questions (by modality and type) to enable fine-grained analysis. We evaluate the performance of seven state-of-the-art MLLMs using our benchmark and observe that: 1) visual-only questions remain the most challenging; 2) models default to textual inputs whenever available; 3) extracting factually accurate information from video content is still difficult for all models; and 4) proprietary and open-source models perform comparably on video-dependent questions.

[97] Shallow- and Deep-fake Image Manipulation Localization Using Vision Mamba and Guided Graph Neural Network

Junbin Zhang,Hamid Reza Tohidypour,Yixiao Wang,Panos Nasiopoulos

Main category: cs.CV

TL;DR: 本文提出了一种基于Vision Mamba和新型引导图神经网络（G-GNN）的方法，用于统一定位浅层伪造和深度伪造图像中的篡改区域，实验表明其在检测精度上优于现有最先进方法。

Details

Motivation: 由于伪造图像可能对社会产生重大影响，需要一种能够同时检测传统图像编辑工具（浅层伪造）和人工智能生成技术（深度伪造）的图像篡改定位方法。 Method: 利用Vision Mamba网络提取能清晰描述篡改与未篡改区域边界的特征图，并设计了一种新的引导图神经网络（G-GNN）模块，以增强真实与篡改像素之间的区分度。 Result: 所提方法在图像篡改定位任务中实现了高于其他先进方法的推理准确率，验证了其在浅层伪造和深层伪造图像上的有效性。 Conclusion: 该研究证明了使用深度学习模型统一处理浅层和深层伪造图像篡改定位的可行性，提出的G-GNN模块有效增强了特征区分能力，提升了整体检测性能。 Abstract: Image manipulation localization is a critical research task, given that forged images may have a significant societal impact of various aspects. Such image manipulations can be produced using traditional image editing tools (known as "shallowfakes") or advanced artificial intelligence techniques ("deepfakes"). While numerous studies have focused on image manipulation localization on either shallowfake images or deepfake videos, few approaches address both cases. In this paper, we explore the feasibility of using a deep learning network to localize manipulations in both shallow- and deep-fake images, and proposed a solution for such purpose. To precisely differentiate between authentic and manipulated pixels, we leverage the Vision Mamba network to extract feature maps that clearly describe the boundaries between tampered and untouched regions. To further enhance this separation, we propose a novel Guided Graph Neural Network (G-GNN) module that amplifies the distinction between manipulated and authentic pixels. Our evaluation results show that our proposed method achieved higher inference accuracy compared to other state-of-the-art methods.

[98] DreamLoop: Controllable Cinemagraph Generation from a Single Photograph

Aniruddha Mahapatra,Long Mai,Cusuh Ham,Feng Liu

Main category: cs.CV

TL;DR: DreamLoop是一种无需专门训练数据即可从单张照片生成可控电影静帧（cinemagraph）的视频合成框架，通过适配大规模视频扩散模型并引入时间桥接和运动条件控制，实现高质量、可循环的复杂动态效果。

Details

Motivation: 现有图像动画技术局限于简单、低频运动，且无法在非重复纹理场景中生成符合电影静帧要求的无缝循环动画；而现有的视频扩散模型未针对cinemagraph进行优化，缺乏专用训练数据。因此需要一种能在通用场景中实现可控、高质量cinemagraph生成的方法。 Method: 提出DreamLoop框架，基于通用视频扩散模型，设计两个训练目标：时间桥接（temporal bridging）和运动条件控制（motion conditioning）。推理时，将输入图像同时作为首帧和末帧条件以强制形成闭环；通过静态轨迹保持背景静止；利用用户指定的运动路径控制目标物体的动画轨迹与节奏。 Result: DreamLoop能够生成高质量、复杂且符合用户意图的cinemagraph，支持多种场景下的灵活控制，在视觉质量和可控性方面优于现有方法。 Conclusion: DreamLoop是首个无需cinemagraph训练数据即可实现通用场景下可控电影静帧生成的方法，通过适配视频扩散模型与巧妙的条件设计，实现了无缝循环、背景静态与前景动态可控的高质量结果。 Abstract: Cinemagraphs, which combine static photographs with selective, looping motion, offer unique artistic appeal. Generating them from a single photograph in a controllable manner is particularly challenging. Existing image-animation techniques are restricted to simple, low-frequency motions and operate only in narrow domains with repetitive textures like water and smoke. In contrast, large-scale video diffusion models are not tailored for cinemagraph constraints and lack the specialized data required to generate seamless, controlled loops. We present DreamLoop, a controllable video synthesis framework dedicated to generating cinemagraphs from a single photo without requiring any cinemagraph training data. Our key idea is to adapt a general video diffusion model by training it on two objectives: temporal bridging and motion conditioning. This strategy enables flexible cinemagraph generation. During inference, by using the input image as both the first- and last- frame condition, we enforce a seamless loop. By conditioning on static tracks, we maintain a static background. Finally, by providing a user-specified motion path for a target object, our method provides intuitive control over the animation's trajectory and timing. To our knowledge, DreamLoop is the first method to enable cinemagraph generation for general scenes with flexible and intuitive controls. We demonstrate that our method produces high-quality, complex cinemagraphs that align with user intent, outperforming existing approaches.

[99] GRRE: Leveraging G-Channel Removed Reconstruction Error for Robust Detection of AI-Generated Images

Shuman He,Xiehua Li,Xioaju Yang,Yang Xiong,Keqin Li

Main category: cs.CV

TL;DR: 提出一种基于G通道移除重建误差（GRRE）的AI生成图像检测新方法，具有强鲁棒性和跨模型泛化能力。

Details

Motivation: 现有生成图像检测方法在面对新型或未见过的生成模型时泛化性能差，难以有效区分真实与合成图像。 Method: 通过移除图像的绿色（G）通道并进行重建，利用真实图像与生成图像在重建误差上的显著差异，提出G通道移除重建误差（GRRE）检测方法。 Result: GRRE在多个生成模型上均表现出高检测精度，对未见模型、扰动和后处理操作具有强鲁棒性和优越的跨模型泛化性能。 Conclusion: 基于通道移除重建的策略是一种有效的图像取证手段，在应对生成式AI带来的真实性挑战方面具有巨大潜力。 Abstract: The rapid progress of generative models, particularly diffusion models and GANs, has greatly increased the difficulty of distinguishing synthetic images from real ones. Although numerous detection methods have been proposed, their accuracy often degrades when applied to images generated by novel or unseen generative models, highlighting the challenge of achieving strong generalization. To address this challenge, we introduce a novel detection paradigm based on channel removal reconstruction. Specifically, we observe that when the green (G) channel is removed from real images and reconstructed, the resulting reconstruction errors differ significantly from those of AI-generated images. Building upon this insight, we propose G-channel Removed Reconstruction Error (GRRE), a simple yet effective method that exploits this discrepancy for robust AI-generated image detection. Extensive experiments demonstrate that GRRE consistently achieves high detection accuracy across multiple generative models, including those unseen during training. Compared with existing approaches, GRRE not only maintains strong robustness against various perturbations and post-processing operations but also exhibits superior cross-model generalization. These results highlight the potential of channel-removal-based reconstruction as a powerful forensic tool for safeguarding image authenticity in the era of generative AI.

[100] CAMO: Category-Agnostic 3D Motion Transfer from Monocular 2D Videos

Taeyeon Kim,Youngju Na,Jumin Lee,Minhyuk Sung,Sung-Eui Yoon

Main category: cs.CV

TL;DR: 提出CAMO框架，一种无需类别特定模板或显式3D监督的类别无关3D运动迁移方法，通过形态参数化的可变形3D高斯点阵与密集语义对应联合优化形状与姿态。

Details

Motivation: 现有方法依赖类别特定模板或显式3D监督，难以泛化到多样物体形状和真实2D视频场景，存在姿态与形状歧义问题。 Method: 提出CAMO，结合形态参数化的可变形3D高斯点阵模型与密集语义对应，从单目2D视频中直接优化目标网格的形状与姿态，实现无需模板的运动迁移。 Result: 在多种物体类别和非正式视频场景下实现了更准确、高效且视觉连贯的运动迁移效果，优于现有方法。 Conclusion: CAMO能有效缓解形状-姿态歧义，支持多样化类别的高质量运动迁移，推动了从单目视频进行3D运动迁移的发展。 Abstract: Motion transfer from 2D videos to 3D assets is a challenging problem, due to inherent pose ambiguities and diverse object shapes, often requiring category-specific parametric templates. We propose CAMO, a category-agnostic framework that transfers motion to diverse target meshes directly from monocular 2D videos without relying on predefined templates or explicit 3D supervision. The core of CAMO is a morphology-parameterized articulated 3D Gaussian splatting model combined with dense semantic correspondences to jointly adapt shape and pose through optimization. This approach effectively alleviates shape-pose ambiguities, enabling visually faithful motion transfer for diverse categories. Experimental results demonstrate superior motion accuracy, efficiency, and visual coherence compared to existing methods, significantly advancing motion transfer in varied object categories and casual video scenarios.

[101] Robust Mesh Saliency GT Acquisition in VR via View Cone Sampling and Geometric Smoothing

Guoquan Zheng,Jie Hao,Huiyu Duan,Yongming Han,Liang Yuan,Dong Zhang,Guangtao Zhai

Main category: cs.CV

TL;DR: 提出了一种新的3D网格显著性真值获取框架，通过视锥采样和混合流形-欧几里得扩散算法，提升复杂拓扑下的采样鲁棒性和显著性传播的拓扑一致性。

Details

Motivation: 现有3D网格显著性获取方法沿用2D图像方法，忽略3D几何拓扑与2D图像阵列的差异，导致信号泄漏和纹理注意力偏差。 Method: 提出视锥采样（VCS）策略模拟人眼中央凹感受野，并设计混合流形-欧几里得约束扩散（HCD）算法，结合测地线约束与欧氏尺度进行显著性传播。 Result: 框架有效缓解了拓扑短路和混叠问题，提升了3D显著性检测的保真度和鲁棒性。 Conclusion: 该方法提供了一个更符合人类感知的高保真3D注意力获取范式，为3D网格显著性研究提供了更准确的基准。 Abstract: Reliable 3D mesh saliency ground truth (GT) is essential for human-centric visual modeling in virtual reality (VR). However, current 3D mesh saliency GT acquisition methods are generally consistent with 2D image methods, ignoring the differences between 3D geometry topology and 2D image array. Current VR eye-tracking pipelines rely on single ray sampling and Euclidean smoothing, triggering texture attention and signal leakage across gaps. This paper proposes a robust framework to address these limitations. We first introduce a view cone sampling (VCS) strategy, which simulates the human foveal receptive field via Gaussian-distributed ray bundles to improve sampling robustness for complex topologies. Furthermore, a hybrid Manifold-Euclidean constrained diffusion (HCD) algorithm is developed, fusing manifold geodesic constraints with Euclidean scales to ensure topologically-consistent saliency propagation. By mitigating "topological short-circuits" and aliasing, our framework provides a high-fidelity 3D attention acquisition paradigm that aligns with natural human perception, offering a more accurate and robust baseline for 3D mesh saliency research.

[102] Foreground-Aware Dataset Distillation via Dynamic Patch Selection

Longzhen Li,Guang Li,Ren Togo,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama

Main category: cs.CV

TL;DR: 提出一种前景感知的数据集蒸馏方法，通过内容自适应的动态补丁选择策略，利用Grounded SAM2识别前景对象并优化补丁选取，提升蒸馏数据集的质量和模型泛化能力。

Details

Motivation: 传统数据集蒸馏方法存在计算开销高、内存受限以及生成图像不真实的问题，现有非优化方法因固定补丁选择策略易丢失关键前景信息，需更智能的内容感知机制来保留主要对象。 Method: 使用Grounded SAM2检测图像前景对象，计算每类别的前景覆盖率以设定自适应阈值，设计双路径动态补丁选择策略：当前景占优时直接缩放整图，否则从多个候选中选最信息丰富的补丁。 Result: 在多个基准上实验表明，该方法显著优于现有蒸馏方法，生成更具代表性、信息更丰富的蒸馏数据集，提升跨架构和图像组成的模型鲁棒性与性能。 Conclusion: 所提前景感知蒸馏方法通过内容自适应的动态补丁选择，有效保留关键对象信息并减少背景冗余，为高效数据集蒸馏提供了新思路。 Abstract: In this paper, we propose a foreground-aware dataset distillation method that enhances patch selection in a content-adaptive manner. With the rising computational cost of training large-scale deep models, dataset distillation has emerged as a promising approach for constructing compact synthetic datasets that retain the knowledge of their large original counterparts. However, traditional optimization-based methods often suffer from high computational overhead, memory constraints, and the generation of unrealistic, noise-like images with limited architectural generalization. Recent non-optimization methods alleviate some of these issues by constructing distilled data from real image patches, but the used rigid patch selection strategies can still discard critical information about the main objects. To solve this problem, we first leverage Grounded SAM2 to identify foreground objects and compute per-image foreground occupancy, from which we derive a category-wise patch decision threshold. Guided by these thresholds, we design a dynamic patch selection strategy that, for each image, either selects the most informative patch from multiple candidates or directly resizes the full image when the foreground dominates. This dual-path mechanism preserves more key information about the main objects while reducing redundant background content. Extensive experiments on multiple benchmarks show that the proposed method consistently improves distillation performance over existing approaches, producing more informative and representative distilled datasets and enhancing robustness across different architectures and image compositions.

[103] HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps

Xuchang Zhong,Xu Cao,Jinke Feng,Hao Fang

Main category: cs.CV

TL;DR: 本文提出了一种新的基于单应性引导的位姿估计网络，用于多视角图像与标准定义（SD）地图之间的细粒度视觉定位，通过引入几何先验显著提升了训练效率和定位精度。

Details

Motivation: 现有基于回归的视觉定位方法忽略了固有的几何先验，导致训练效率低和定位精度受限。 Method: 构建满足单应性约束的输入对，将地面视角特征投影到BEV域，并与地图特征进行语义对齐；利用单应性关系指导特征融合，并限制位姿输出在有效范围内。 Result: 在nuScenes数据集上显著优于现有的最先进方法，且支持跨分辨率输入，提高了模型灵活性。 Conclusion: 这是首个将BEV语义推理与单应性学习统一用于图像到地图定位的工作，有效提升了视觉定位的性能和通用性。 Abstract: Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusion and direct 3-DoF pose regression. To the best of our knowledge, this is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. Furthermore, by explicitly modeling homography transformations, the proposed framework naturally supports cross-resolution inputs, enhancing model flexibility. Extensive experiments on the nuScenes dataset demonstrate that our approach significantly outperforms existing state-of-the-art visual localization methods. Code and pretrained models will be publicly released to foster future research.

[104] Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench

Zanting Ye,Xiaolong Niu,Xuanbin Wu,Xu Han,Shengyuan Liu,Jing Hao,Zhihao Peng,Hao Sun,Jieqin Lv,Fanghu Wang,Yanchao Huang,Hubing Wu,Yixuan Yuan,Habib Zaidi,Arman Rahmim,Yefeng Zheng,Lijun Lu

Main category: cs.CV

TL;DR: 本文提出了PET-Bench，首个大规模功能影像基准，揭示了多模态大模型在PET图像理解中存在的“功能感知鸿沟”和“思维链幻觉陷阱”，并提出Atomic Visual Alignment（AVA）方法，通过强化低级功能感知显著提升诊断准确性。

Details

Motivation: 当前多模态大模型在解剖影像中表现良好，但在功能影像（如PET）中的能力尚不明确，存在视觉编码器无法独立解析功能分布的问题，亟需专门基准与方法来识别和解决该缺陷。 Method: 构建包含52,308个问答对的PET-Bench基准，评估19种主流MLLM，并提出AVA微调策略，强制模型先掌握低层次功能感知再进行高级诊断推理。 Result: 发现标准思维链提示会引发临床流畅但事实错误的诊断（CoT幻觉陷阱），而AVA可有效缓解该问题，将诊断准确率提升高达14.83%。 Conclusion: 功能影像理解需独立于形态先验的感知能力，AVA通过分阶段训练策略成功弥合感知鸿沟，使思维链成为可靠推理工具，为安全临床AI提供了新路径。 Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities, their capability in functional imaging remains largely unexplored. In this work, we identify and quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors. Identifying Positron Emission Tomography (PET) as the quintessential modality to investigate this disconnect, we introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies. Extensive evaluation of 19 state-of-the-art MLLMs reveals a critical safety hazard termed the Chain-of-Thought (CoT) hallucination trap. We observe that standard CoT prompting, widely considered to enhance reasoning, paradoxically decouples linguistic generation from visual evidence in PET, producing clinically fluent but factually ungrounded diagnoses. To resolve this, we propose Atomic Visual Alignment (AVA), a simple fine-tuning strategy that enforces the mastery of low-level functional perception prior to high-level diagnostic reasoning. Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic accuracy by up to 14.83%. Code and data are available at https://github.com/yezanting/PET-Bench.

[105] D$^3$R-DETR: DETR with Dual-Domain Density Refinement for Tiny Object Detection in Aerial Images

Zixiao Wen,Zhen Yang,Xianjie Bao,Lei Zhang,Xiantai Xiang,Wenshuai Li,Yuhan Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于DETR的新型检测器D³R-DETR，通过双域密度细化（融合空间和频域信息）来提升遥感图像中微小目标的检测性能。

Details

Motivation: 由于微小目标像素信息极少且目标密度变化大，现有的Transformer检测器存在收敛慢和查询-目标匹配不准确的问题。 Method: 提出D³R-DETR，融合空间域和频率域信息，优化低层特征图，并利用细节信息预测更精确的目标密度图，以指导微小目标的精确定位。 Result: 在AI-TOD-v2数据集上的实验表明，D³R-DETR在微小目标检测上优于现有的最先进检测器。 Conclusion: D³R-DETR通过双域密度细化机制有效提升了遥感图像中微小目标的检测精度和模型收敛速度，具有良好的应用前景。 Abstract: Detecting tiny objects plays a vital role in remote sensing intelligent interpretation, as these objects often carry critical information for downstream applications. However, due to the extremely limited pixel information and significant variations in object density, mainstream Transformer-based detectors often suffer from slow convergence and inaccurate query-object matching. To address these challenges, we propose D$^3$R-DETR, a novel DETR-based detector with Dual-Domain Density Refinement. By fusing spatial and frequency domain information, our method refines low-level feature maps and utilizes their rich details to predict more accurate object density map, thereby guiding the model to precisely localize tiny objects. Extensive experiments on the AI-TOD-v2 dataset demonstrate that D$^3$R-DETR outperforms existing state-of-the-art detectors for tiny object detection.

[106] Towards Zero-Shot Point Cloud Registration Across Diverse Scales, Scenes, and Sensor Setups

Hyungtae Lim,Minkyun Seo,Luca Carlone,Jaesik Park

Main category: cs.CV

TL;DR: 本文提出了BUFFER-X，一种无需训练的点云配准框架，通过几何自举、分布感知采样和局部坐标归一化实现零样本泛化，解决了现有深度学习方法在跨域场景中泛化能力差的问题。

Details

Motivation: 现有的基于深度学习的点云配准方法在零样本泛化方面存在困难，通常依赖固定参数或需重新调参/训练，难以适应不同尺度和域间的差异。 Method: 提出BUFFER-X框架：1）几何自举自动估计超参数；2）使用分布感知的最远点采样替代学习型关键点检测器；3）采用patch级坐标归一化缓解尺度不一致；并结合多层次匹配策略。同时提出轻量版BUFFER-X-Lite以提升效率。 Result: 在包含12个数据集（涵盖物体级、室内外场景及异构LiDAR）的综合基准上验证，无需任何调参即可实现良好泛化性能，BUFFER-X-Lite相对减少43%计算时间且保持精度。 Conclusion: BUFFER-X实现了真正意义上的零样本泛化，无需训练或先验知识即可适应多样环境，在跨传感器和跨域配准任务中表现出强鲁棒性和实用性。 Abstract: Some deep learning-based point cloud registration methods struggle with zero-shot generalization, often requiring dataset-specific hyperparameter tuning or retraining for new environments. We identify three critical limitations: (a) fixed user-defined parameters (e.g., voxel size, search radius) that fail to generalize across varying scales, (b) learned keypoint detectors exhibit poor cross-domain transferability, and (c) absolute coordinates amplify scale mismatches between datasets. To address these three issues, we present BUFFER-X, a training-free registration framework that achieves zero-shot generalization through: (a) geometric bootstrapping for automatic hyperparameter estimation, (b) distribution-aware farthest point sampling to replace learned detectors, and (c) patch-level coordinate normalization to ensure scale consistency. Our approach employs hierarchical multi-scale matching to extract correspondences across local, middle, and global receptive fields, enabling robust registration in diverse environments. For efficiency-critical applications, we introduce BUFFER-X-Lite, which reduces total computation time by 43% (relative to BUFFER-X) through early exit strategies and fast pose solvers while preserving accuracy. We evaluate on a comprehensive benchmark comprising 12 datasets spanning object-scale, indoor, and outdoor scenes, including cross-sensor registration between heterogeneous LiDAR configurations. Results demonstrate that our approach generalizes effectively without manual tuning or prior knowledge of test domains. Code: https://github.com/MIT-SPARK/BUFFER-X.

[107] AnyDepth: Depth Estimation Made Easy

Zeyu Ren,Zeyu Zhang,Wukai Li,Qingxiang Liu,Hao Tang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级且以数据为中心的零样本单目深度估计框架，采用DINOv3作为视觉编码器，并设计了结构更简洁的Simple Depth Transformer（SDT）解码器，显著减少参数量并提升精度，同时通过基于质量的过滤策略提高训练数据质量，在五个基准上均超越DPT。

Details

Motivation: 现有方法依赖大规模数据集和复杂解码器，导致效率低、泛化能力差，限制了实际应用。 Method: 采用DINOv3作为编码器获取高质量密集特征；设计单路径特征融合与上采样的Simple Depth Transformer（SDT）解码器；提出基于质量的过滤策略筛选高质量训练样本。 Result: SDT相比DPT减少了85%-89%的参数量，在五个基准测试上实现了更高的估计精度，同时提升了训练效率和模型泛化能力。 Conclusion: 平衡模型设计与数据质量对实现高效且泛化的零样本深度估计至关重要，所提框架在精度和效率之间取得了更好权衡。 Abstract: Monocular depth estimation aims to recover the depth information of 3D scenes from 2D images. Recent work has made significant progress, but its reliance on large-scale datasets and complex decoders has limited its efficiency and generalization ability. In this paper, we propose a lightweight and data-centric framework for zero-shot monocular depth estimation. We first adopt DINOv3 as the visual encoder to obtain high-quality dense features. Secondly, to address the inherent drawbacks of the complex structure of the DPT, we design the Simple Depth Transformer (SDT), a compact transformer-based decoder. Compared to the DPT, it uses a single-path feature fusion and upsampling process to reduce the computational overhead of cross-scale feature fusion, achieving higher accuracy while reducing the number of parameters by approximately 85%-89%. Furthermore, we propose a quality-based filtering strategy to filter out harmful samples, thereby reducing dataset size while improving overall training quality. Extensive experiments on five benchmarks demonstrate that our framework surpasses the DPT in accuracy. This work highlights the importance of balancing model design and data quality for achieving efficient and generalizable zero-shot depth estimation. Code: https://github.com/AIGeeksGroup/AnyDepth. Website: https://aigeeksgroup.github.io/AnyDepth.

[108] ClearAIR: A Human-Visual-Perception-Inspired All-in-One Image Restoration

Xu Zhang,Huan Zhang,Guoli Wang,Qian Zhang,Lefei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种受人类视觉感知启发的全合一图像恢复框架ClearAIR，采用分层、由粗到细的策略，在复杂真实退化场景下实现了优越性能。

Details

Motivation: 现有全合一图像恢复方法依赖于特定退化表示，易导致过平滑和伪影问题，且难以准确处理复杂的复合退化。 Method: 1）利用基于多模态大语言模型的图像质量评估模块进行全局评价；2）通过语义交叉注意力与退化感知模块实现区域感知与任务识别；3）引入自监督的内部线索重用机制以恢复细节。 Result: ClearAIR在多个合成与真实世界数据集上均取得了优异的图像恢复效果，尤其在细节保留和复合退化处理方面优于现有方法。 Conclusion: 所提出的ClearAIR框架通过模仿人类视觉感知过程，结合跨模态理解与内部信息挖掘，有效提升了全合一图像恢复的质量与鲁棒性。 Abstract: All-in-One Image Restoration (AiOIR) has advanced significantly, offering promising solutions for complex real-world degradations. However, most existing approaches rely heavily on degradation-specific representations, often resulting in oversmoothing and artifacts. To address this, we propose ClearAIR, a novel AiOIR framework inspired by Human Visual Perception (HVP) and designed with a hierarchical, coarse-to-fine restoration strategy. First, leveraging the global priority of early HVP, we employ a Multimodal Large Language Model (MLLM)-based Image Quality Assessment (IQA) model for overall evaluation. Unlike conventional IQA, our method integrates cross-modal understanding to more accurately characterize complex, composite degradations. Building upon this overall assessment, we then introduce a region awareness and task recognition pipeline. A semantic cross-attention, leveraging semantic guidance unit, first produces coarse semantic prompts. Guided by this regional context, a degradation-aware module implicitly captures region-specific degradation characteristics, enabling more precise local restoration. Finally, to recover fine details, we propose an internal clue reuse mechanism. It operates in a self-supervised manner to mine and leverage the intrinsic information of the image itself, substantially enhancing detail restoration. Experimental results show that ClearAIR achieves superior performance across diverse synthetic and real-world datasets.

[109] AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs

Boyu Chang,Qi Wang,Xi Guo,Zhixiong Nan,Yazhou Yao,Tianfei Zhou

Main category: cs.CV

TL;DR: 本文提出AbductiveMLLM，通过结合语言和图像双模态推理提升多模态大模型在视觉溯因推理（VAR）中的表现，包含REASONER和IMAGINER两个协同组件，在标准基准上达到SOTA。

Details

Motivation: 现有MLLMs在视觉溯因推理方面仍不及人类，缺乏有效的跨模态因果推理与具象化想象能力，因此需要模仿人类的语言与图像双重推理机制来增强其溯因能力。 Method: 提出AbductiveMLLM，包括：1）REASONER在语言空间中生成并筛选可能解释，利用盲LLM生成假设并通过跨模态因果对齐过滤；2）IMAGINER基于输入视频和REASONER输出，使用文本到图像扩散模型生成对应视觉场景以增强上下文理解；两组件端到端联合训练。 Result: 在标准VAR基准上的实验表明，AbductiveMLLM显著优于传统方法和先进MLLM，取得当前最优性能。 Conclusion: 通过模拟人类语言与视觉双通道溯因机制，AbductiveMLLM有效提升了多模态大模型在视觉溯因推理任务中的表现，验证了双模态协同推理对增强AI因果理解的重要性。 Abstract: Visual abductive reasoning (VAR) is a challenging task that requires AI systems to infer the most likely explanation for incomplete visual observations. While recent MLLMs develop strong general-purpose multimodal reasoning capabilities, they fall short in abductive inference, as compared to human beings. To bridge this gap, we draw inspiration from the interplay between verbal and pictorial abduction in human cognition, and propose to strengthen abduction of MLLMs by mimicking such dual-mode behavior. Concretely, we introduce AbductiveMLLM comprising of two synergistic components: REASONER and IMAGINER. The REASONER operates in the verbal domain. It first explores a broad space of possible explanations using a blind LLM and then prunes visually incongruent hypotheses based on cross-modal causal alignment. The remaining hypotheses are introduced into the MLLM as targeted priors, steering its reasoning toward causally coherent explanations. The IMAGINER, on the other hand, further guides MLLMs by emulating human-like pictorial thinking. It conditions a text-to-image diffusion model on both the input video and the REASONER's output embeddings to "imagine" plausible visual scenes that correspond to verbal explanation, thereby enriching MLLMs' contextual grounding. The two components are trained jointly in an end-to-end manner. Experiments on standard VAR benchmarks show that AbductiveMLLM achieves state-of-the-art performance, consistently outperforming traditional solutions and advanced MLLMs.

[110] EarthVL: A Progressive Earth Vision-Language Understanding and Generation Framework

Junjue Wang,Yanfei Zhong,Zihang Chen,Zhuo Zheng,Ailong Ma,Liangpei Zhang

Main category: cs.CV

TL;DR: 提出了一种渐进式地球视觉-语言理解与生成框架，包括多任务数据集EarthVLSet和语义引导网络EarthVLNet，用于城市规划中的遥感图像理解。

Details

Motivation: 现有地球视觉在地理空间对象识别上取得进展，但缺乏对象间关系推理能力，限制了对场景的全面理解，尤其在城市规划等应用中需求迫切。 Method: 构建包含10.9k高分辨率影像和761.5k文本对的EarthVLSet数据集，涵盖多种VQA任务；提出EarthVLNet，分阶段实现语义分割、关系推理与综合理解，结合像素级语义引导LLM进行问答生成，并引入数值差异损失优化模型。 Result: 在三个基准任务（语义分割、选择题与开放性VQA）上表现优越，发现分割特征可提升跨数据集VQA性能，选择题更依赖视觉编码器，而开放性任务需更强的视觉与语言模型协同。 Conclusion: 该数据集与方法建立了‘图像-掩码-文本’联结，为地球视觉在地理应用中的发展提供了有效基准与新方向。 Abstract: Earth vision has achieved milestones in geospatial object recognition but lacks exploration in object-relational reasoning, limiting comprehensive scene understanding. To address this, a progressive Earth vision-language understanding and generation framework is proposed, including a multi-task dataset (EarthVLSet) and a semantic-guided network (EarthVLNet). Focusing on city planning applications, EarthVLSet includes 10.9k sub-meter resolution remote sensing images, land-cover masks, and 761.5k textual pairs involving both multiple-choice and open-ended visual question answering (VQA) tasks. In an object-centric way, EarthVLNet is proposed to progressively achieve semantic segmentation, relational reasoning, and comprehensive understanding. The first stage involves land-cover segmentation to generate object semantics for VQA guidance. Guided by pixel-wise semantics, the object awareness based large language model (LLM) performs relational reasoning and knowledge summarization to generate the required answers. As for optimization, the numerical difference loss is proposed to dynamically add difference penalties, addressing the various objects' statistics. Three benchmarks, including semantic segmentation, multiple-choice, and open-ended VQA demonstrated the superiorities of EarthVLNet, yielding three future directions: 1) segmentation features consistently enhance VQA performance even in cross-dataset scenarios; 2) multiple-choice tasks show greater sensitivity to the vision encoder than to the language decoder; and 3) open-ended tasks necessitate advanced vision encoders and language decoders for an optimal performance. We believe this dataset and method will provide a beneficial benchmark that connects ''image-mask-text'', advancing geographical applications for Earth vision.

[111] DreamStyle: A Unified Framework for Video Stylization

Mengtian Li,Jinshu Chen,Songtao Zhao,Wanquan Feng,Pengqi Tu,Qian He

Main category: cs.CV

TL;DR: DreamStyle是一个统一的视频风格化框架，支持文本、风格图像和首帧引导三种条件，并通过高质量数据集提升风格一致性和视频质量。

Details

Motivation: 现有视频风格化方法受限于单一风格条件输入，且缺乏高质量数据集，导致风格不一致和时序闪烁问题。 Method: 基于基础Image-to-Video模型，采用LoRA微调并设计了特定token的上矩阵，以减少不同条件token间的混淆，同时构建了高质量配对视频数据的采集流程。 Result: 在三种视频风格化任务中均表现出色，定性和定量评估显示其在风格一致性和视频质量上优于现有方法。 Conclusion: DreamStyle实现了多条件兼容的统一视频风格化框架，有效提升了生成质量和稳定性。 Abstract: Video stylization, an important downstream task of video generation models, has not yet been thoroughly explored. Its input style conditions typically include text, style image, and stylized first frame. Each condition has a characteristic advantage: text is more flexible, style image provides a more accurate visual anchor, and stylized first frame makes long-video stylization feasible. However, existing methods are largely confined to a single type of style condition, which limits their scope of application. Additionally, their lack of high-quality datasets leads to style inconsistency and temporal flicker. To address these limitations, we introduce DreamStyle, a unified framework for video stylization, supporting (1) text-guided, (2) style-image-guided, and (3) first-frame-guided video stylization, accompanied by a well-designed data curation pipeline to acquire high-quality paired video data. DreamStyle is built on a vanilla Image-to-Video (I2V) model and trained using a Low-Rank Adaptation (LoRA) with token-specific up matrices that reduces the confusion among different condition tokens. Both qualitative and quantitative evaluations demonstrate that DreamStyle is competent in all three video stylization tasks, and outperforms the competitors in style consistency and video quality.

[112] Textile IR: A Bidirectional Intermediate Representation for Physics-Aware Fashion CAD

Petteri Teikari,Neliana Fuenmayor

Main category: cs.CV

TL;DR: 本文提出了Textile IR，一种连接制造有效CAD、物理仿真和生命周期评估的双向中间表示，通过七层验证阶梯实现从语法检查到物理验证的集成，支持设计中的可持续性、可制造性和美学权衡。

Details

Motivation: 现有工具孤立，无法在图案设计、物理模拟和可持续性评估之间提供有效的反馈与整合，导致设计冲突在物理原型制作后才被发现。 Method: 提出Textile IR，采用七层验证阶梯架构和场景图表示，实现双向反馈，支持模式修改建议、材料替换的实时影响更新以及不确定性传播。 Result: 实现了跨图案设计、物理仿真和生命周期评估的集成工作流，能够实时跟踪不确定性并支持AI系统将服装作为结构化程序进行操作。 Conclusion: Textile IR提供了一种形式化表示，使工程约束可见、可操作且具有即时后果，帮助设计师同时应对可持续性、可制造性和美学挑战，减少对高成本物理原型的依赖。 Abstract: We introduce Textile IR, a bidirectional intermediate representation that connects manufacturing-valid CAD, physics-based simulation, and lifecycle assessment for fashion design. Unlike existing siloed tools where pattern software guarantees sewable outputs but understands nothing about drape, and physics simulation predicts behaviour but cannot automatically fix patterns, Textile IR provides the semantic glue for integration through a seven-layer Verification Ladder -- from cheap syntactic checks (pattern closure, seam compatibility) to expensive physics validation (drape simulation, stress analysis). The architecture enables bidirectional feedback: simulation failures suggest pattern modifications; material substitutions update sustainability estimates in real time; uncertainty propagates across the pipeline with explicit confidence bounds. We formalise fashion engineering as constraint satisfaction over three domains and demonstrate how Textile IR's scene-graph representation enables AI systems to manipulate garments as structured programs rather than pixel arrays. The framework addresses the compound uncertainty problem: when measurement errors in material testing, simulation approximations, and LCA database gaps combine, sustainability claims become unreliable without explicit uncertainty tracking. We propose six research priorities and discuss deployment considerations for fashion SMEs where integrated workflows reduce specialised engineering requirements. Key contribution: a formal representation that makes engineering constraints perceptible, manipulable, and immediately consequential -- enabling designers to navigate sustainability, manufacturability, and aesthetic tradeoffs simultaneously rather than discovering conflicts after costly physical prototyping.

[113] StableDPT: Temporal Stable Monocular Video Depth Estimation

Ivan Sobko,Hayko Riemenschneider,Markus Gross,Christopher Schroers

Main category: cs.CV

TL;DR: 提出一种名为StableDPT的新方法，通过在DPT头部引入基于交叉注意力的时序模块，提升单目深度估计在视频中的时间稳定性与精度，并支持任意长度视频的高效推理。

Details

Motivation: 现有基于图像的单目深度估计模型直接用于视频时存在时间不稳定和闪烁伪影问题，且难以处理长视频序列。 Method: 在预训练ViT编码器基础上改进DPT结构，在其预测头中引入可学习的时序交叉注意力模块，利用关键帧间的全局上下文进行深度估计；同时设计新的非重叠窗口推理策略以避免尺度错位和冗余计算。 Result: 在多个基准数据集上实现了优于现有方法的时间一致性，达到领先性能，且实际运行速度提升超过2倍。 Conclusion: StableDPT能有效提升图像级深度估计模型在视频应用中的时间稳定性与效率，具备良好的通用性和实用性。 Abstract: Applying single image Monocular Depth Estimation (MDE) models to video sequences introduces significant temporal instability and flickering artifacts. We propose a novel approach that adapts any state-of-the-art image-based (depth) estimation model for video processing by integrating a new temporal module - trainable on a single GPU in a few days. Our architecture StableDPT builds upon an off-the-shelf Vision Transformer (ViT) encoder and enhances the Dense Prediction Transformer (DPT) head. The core of our contribution lies in the temporal layers within the head, which use an efficient cross-attention mechanism to integrate information from keyframes sampled across the entire video sequence. This allows the model to capture global context and inter-frame relationships leading to more accurate and temporally stable depth predictions. Furthermore, we propose a novel inference strategy for processing videos of arbitrary length avoiding the scale misalignment and redundant computations associated with overlapping windows used in other methods. Evaluations on multiple benchmark datasets demonstrate improved temporal consistency, competitive state-of-the-art performance and on top 2x faster processing in real-world scenarios.

[114] Topology-aware Pathological Consistency Matching for Weakly-Paired IHC Virtual Staining

Mingzhou Jiang,Jiaying Zhou,Nan Zeng,Mickael Li,Qijie Tang,Chao He,Huazhu Fu,Honghui He

Main category: cs.CV

TL;DR: 提出了一种拓扑感知的框架用于H&E到IHC的虚拟染色，通过图对比学习和拓扑扰动来克服空间错位问题，提升了病理一致性与生成质量。

Details

Motivation: 由于免疫组化（IHC）染色过程复杂、耗时且昂贵，限制了其临床广泛应用，而现有的虚拟染色方法因使用配对但存在空间错位的切片导致监督学习效果受限，因此需要一种更鲁棒的方法来实现高质量的虚拟染色。 Method: 提出了Topology-aware Consistency Matching (TACM) 和 Topology-constrained Pathological Matching (TCPM) 两种机制：TACM利用图对比学习和拓扑扰动学习在空间错位下仍保持稳定的匹配模式；TCPM根据节点重要性对齐病理阳性区域以增强病理一致性。 Result: 在两个基准数据集上的四个染色任务中，该方法在生成质量和临床相关性方面均优于现有最先进方法。 Conclusion: 所提出的拓扑感知框架能有效应对弱配对数据中的空间错位和局部形变问题，显著提升虚拟染色的性能与临床适用性。 Abstract: Immunohistochemical (IHC) staining provides crucial molecular characterization of tissue samples and plays an indispensable role in the clinical examination and diagnosis of cancers. However, compared with the commonly used Hematoxylin and Eosin (H&E) staining, IHC staining involves complex procedures and is both time-consuming and expensive, which limits its widespread clinical use. Virtual staining converts H&E images to IHC images, offering a cost-effective alternative to clinical IHC staining. Nevertheless, using adjacent slides as ground truth often results in weakly-paired data with spatial misalignment and local deformations, hindering effective supervised learning. To address these challenges, we propose a novel topology-aware framework for H&E-to-IHC virtual staining. Specifically, we introduce a Topology-aware Consistency Matching (TACM) mechanism that employs graph contrastive learning and topological perturbations to learn robust matching patterns despite spatial misalignments, ensuring structural consistency. Furthermore, we propose a Topology-constrained Pathological Matching (TCPM) mechanism that aligns pathological positive regions based on node importance to enhance pathological consistency. Extensive experiments on two benchmarks across four staining tasks demonstrate that our method outperforms state-of-the-art approaches, achieving superior generation quality with higher clinical relevance.

[115] SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models

Ruiyang Zhang,Dongzhan Zhou,Zhedong Zheng

Main category: cs.CV

TL;DR: 提出SketchThinker-R1，通过模仿人类的草图式推理，减少大模型的推理token消耗，提升效率。

Details

Motivation: 长链推理计算开销大，影响推理效率，而人类常使用简洁、目标导向的草图式推理。 Method: 包括三个阶段：草图模式冷启动（将长推理转为草图式并微调模型）、训练评估推理过程的奖励模型SketchJudge、在奖励模型指导下进行强化学习以泛化草图式推理能力。 Result: 在四个基准上实现超过64%的推理token成本降低，且不损失答案准确率；定性分析显示更关注关键线索。 Conclusion: SketchThinker-R1有效实现了高效、紧凑的推理，兼顾性能与成本。 Abstract: Despite the empirical success of extensive, step-by-step reasoning in large multimodal models, long reasoning processes inevitably incur substantial computational overhead, i.e., in terms of higher token costs and increased response time, which undermines inference efficiency. In contrast, humans often employ sketch-style reasoning: a concise, goal-directed cognitive process that prioritizes salient information and enables efficient problem-solving. Inspired by this cognitive efficiency, we propose SketchThinker-R1, which incentivizes sketch-style reasoning ability in large multimodal models. Our method consists of three primary stages. In the Sketch-Mode Cold Start stage, we convert standard long reasoning process into sketch-style reasoning and finetune base multimodal model, instilling initial sketch-style reasoning capability. Next, we train SketchJudge Reward Model, which explicitly evaluates thinking process of model and assigns higher scores to sketch-style reasoning. Finally, we conduct Sketch-Thinking Reinforcement Learning under supervision of SketchJudge to further generalize sketch-style reasoning ability. Experimental evaluation on four benchmarks reveals that our SketchThinker-R1 achieves over 64% reduction in reasoning token cost without compromising final answer accuracy. Qualitative analysis further shows that sketch-style reasoning focuses more on key cues during problem solving.

[116] DGA-Net: Enhancing SAM with Depth Prompting and Graph-Anchor Guidance for Camouflaged Object Detection

Yuetong Li,Qing Zhang,Yilin Zhao,Gongyang Li,Zeming Liu

Main category: cs.CV

TL;DR: 本文提出了一种用于伪装物体检测（COD）的新型框架DGA-Net，通过引入“深度提示”范式改进了现有的Segment Anything Model（SAM）。与依赖稀疏提示的方法不同，该方法利用密集深度提示进行更全面的引导。

Details

Motivation: 为了充分挖掘深度线索在伪装物体检测中的潜力，克服现有方法仅使用点或框等稀疏提示的局限性。 Method: 提出了跨模态图增强（CGE）模块，将RGB语义信息和深度几何信息融合于异构图中，生成统一的指导信号；设计了锚点引导细化（AGR）模块，构建全局锚点并建立深层到浅层的非局部通路以传播指导信号，防止特征层级中的信息衰减。 Result: 实验结果表明，DGA-Net在定量和定性评估上均优于当前最先进的伪装物体检测方法。 Conclusion: DGA-Net通过深度提示机制有效整合了深度信息，在伪装物体检测任务中实现了更精确且一致的分割性能。 Abstract: To fully exploit depth cues in Camouflaged Object Detection (COD), we present DGA-Net, a specialized framework that adapts the Segment Anything Model (SAM) via a novel ``depth prompting" paradigm. Distinguished from existing approaches that primarily rely on sparse prompts (e.g., points or boxes), our method introduces a holistic mechanism for constructing and propagating dense depth prompts. Specifically, we propose a Cross-modal Graph Enhancement (CGE) module that synthesizes RGB semantics and depth geometric within a heterogeneous graph to form a unified guidance signal. Furthermore, we design an Anchor-Guided Refinement (AGR) module. To counteract the inherent information decay in feature hierarchies, AGR forges a global anchor and establishes direct non-local pathways to broadcast this guidance from deep to shallow layers, ensuring precise and consistent segmentation. Quantitative and qualitative experimental results demonstrate that our proposed DGA-Net outperforms the state-of-the-art COD methods.

[117] Breaking Self-Attention Failure: Rethinking Query Initialization for Infrared Small Target Detection

Yuteng Liu,Duanni Meng,Maoxun Yuan,Xingxing Wei

Main category: cs.CV

TL;DR: 本文提出了一种用于红外小目标检测（IRSTD）的新框架SEF-DETR，通过改进查询初始化来应对低信噪比和复杂背景带来的挑战。

Details

Motivation: 由于自注意力机制中背景特征占主导地位，现有基于DETR的方法在IRSTD任务上表现不佳，因此需要更可靠的查询初始化方法。 Method: 提出SEF-DETR框架，包含三个模块：频率引导的块筛选（FPS）、动态嵌入增强（DEE）和可靠性一致性融合（RCF），以抑制背景干扰并增强目标相关特征。 Result: 在三个公开的IRSTD数据集上实验表明，SEF-DETR在检测性能上优于当前最先进的方法。 Conclusion: SEF-DETR通过引入频域信息和增强目标感知的多尺度表示，有效提升了红外小目标检测的准确性和鲁棒性。 Abstract: Infrared small target detection (IRSTD) faces significant challenges due to the low signal-to-noise ratio (SNR), small target size, and complex cluttered backgrounds. Although recent DETR-based detectors benefit from global context modeling, they exhibit notable performance degradation on IRSTD. We revisit this phenomenon and reveal that the target-relevant embeddings of IRST are inevitably overwhelmed by dominant background features due to the self-attention mechanism, leading to unreliable query initialization and inaccurate target localization. To address this issue, we propose SEF-DETR, a novel framework that refines query initialization for IRSTD. Specifically, SEF-DETR consists of three components: Frequency-guided Patch Screening (FPS), Dynamic Embedding Enhancement (DEE), and Reliability-Consistency-aware Fusion (RCF). The FPS module leverages the Fourier spectrum of local patches to construct a target-relevant density map, suppressing background-dominated features. DEE strengthens multi-scale representations in a target-aware manner, while RCF further refines object queries by enforcing spatial-frequency consistency and reliability. Extensive experiments on three public IRSTD datasets demonstrate that SEF-DETR achieves superior detection performance compared to state-of-the-art methods, delivering a robust and efficient solution for infrared small target detection task.

[118] Towards Agnostic and Holistic Universal Image Segmentation with Bit Diffusion

Jakob Lønborg Christensen,Morten Rieger Hannemose,Anders Bjorholm Dahl,Vedrana Andersen Dahl

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的通用图像分割框架，通过关键改进实现无需依赖掩码的端到端分割，并展现出对离散数据的有效处理和不确定性建模能力。

Details

Motivation: 为了摆脱传统掩码生成框架的限制，实现更通用、灵活且能建模不确定性的图像分割方法。 Method: 引入扩散模型用于图像分割，提出了位置感知调色板、2D格雷码排序、添加tanh激活函数以及使用sigmoid损失加权优化x预测等关键技术改进。 Result: 尽管当前模型尚未超越领先的掩码基架构，但显著缩小了性能差距，并具备如原则性模糊建模等独特优势。所有模型均为从零训练。 Conclusion: 所提出的扩散框架为通用图像分割提供了新方向，结合大规模预训练或可提示条件化有望实现更具竞争力的性能。 Abstract: This paper introduces a diffusion-based framework for universal image segmentation, making agnostic segmentation possible without depending on mask-based frameworks and instead predicting the full segmentation in a holistic manner. We present several key adaptations to diffusion models, which are important in this discrete setting. Notably, we show that a location-aware palette with our 2D gray code ordering improves performance. Adding a final tanh activation function is crucial for discrete data. On optimizing diffusion parameters, the sigmoid loss weighting consistently outperforms alternatives, regardless of the prediction type used, and we settle on x-prediction. While our current model does not yet surpass leading mask-based architectures, it narrows the performance gap and introduces unique capabilities, such as principled ambiguity modeling, that these models lack. All models were trained from scratch, and we believe that combining our proposed improvements with large-scale pretraining or promptable conditioning could lead to competitive models.

[119] TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

Wei-Yuan Cheng,Kai-Po Chang,Chi-Pin Huang,Fu-En Yang,Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为TA-Prompting的方法，通过引入时间锚点（Temporal Anchors）来增强视频大语言模型（VideoLLMs），以更精确地定位视频中的事件边界并生成连贯的密集视频描述，在多个基准任务上优于现有方法。

Details

Motivation: 现有的VideoLLMs在未剪辑视频中难以准确识别事件的时间边界，导致生成的字幕缺乏良好的时序对齐和语义连贯性。 Method: 提出TA-Prompting方法，利用时间锚点学习事件的精确定位，并通过时间感知提示机制提升VideoLLMs的时序理解能力；同时设计事件一致性采样策略，在推理阶段选择跨事件连贯且与视频模态匹配的字幕序列。 Result: 在多个基准数据集上的实验表明，该方法在密集视频字幕生成、时刻检索和时序问答等任务上均优于当前最先进的VideoLLMs。 Conclusion: TA-Prompting通过引入时间锚点和事件一致性采样，有效提升了VideoLLMs在密集视频描述与时序理解方面的性能，增强了字幕生成的时序准确性和跨模态一致性。 Abstract: Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity with the given video. Through extensive experiments on benchmark datasets, we show that our TA-Prompting is favorable against state-of-the-art VideoLLMs, yielding superior performance on dense video captioning and temporal understanding tasks including moment retrieval and temporalQA.

[120] Zoom-IQA: Image Quality Assessment with Reliable Region-Aware Reasoning

Guoqiang Liang,Jianyi Wang,Zhonghua Wu,Shangchen Zhou

Main category: cs.CV

TL;DR: 本文提出了Zoom-IQA，一种基于视觉语言模型的图像质量评估方法，通过显式模拟不确定性感知、区域推理和迭代优化等认知行为，提升评估的可解释性与鲁棒性。

Details

Motivation: 现有基于VLM的图像质量评估方法在视觉与文本信息融合方面能力有限，导致推理不可靠，缺乏准确的区域关联与稳定推理过程。 Method: 提出两阶段训练 pipeline：第一阶段在构建的GR-IQA数据集上进行监督微调，使模型能基于关键区域生成理由；第二阶段采用强化学习，结合KL-Coverage正则化和渐进重采样策略，以稳定探索并减少标注偏差。 Result: 实验表明，Zoom-IQA在鲁棒性、可解释性和泛化能力方面表现优越，并能有效应用于图像恢复等下游任务。 Conclusion: Zoom-IQA通过模拟人类认知行为和优化训练策略，在推理可靠性与质量评估一致性方面取得了显著进展，推动了可解释性IQA的发展。 Abstract: Image Quality Assessment (IQA) is a long-standing problem in computer vision. Previous methods typically focus on predicting numerical scores without explanation or provide low-level descriptions lacking precise scores. Recent reasoning-based vision language models (VLMs) have shown strong potential for IQA, enabling joint generation of quality descriptions and scores. However, we notice that existing VLM-based IQA methods tend to exhibit unreliable reasoning due to their limited capability of integrating visual and textual cues. In this work, we introduce Zoom-IQA, a VLM-based IQA model to explicitly emulate key cognitive behaviors: uncertainty awareness, region reasoning, and iterative refinement. Specifically, we present a two-stage training pipeline: 1) supervised fine-tuning (SFT) on our Grounded-Rationale-IQA (GR-IQA) dataset to teach the model to ground its assessments in key regions; and 2) reinforcement learning (RL) for dynamic policy exploration, primarily stabilized by our KL-Coverage regularizer to prevent reasoning and scoring diversity collapse, and supported by a Progressive Re-sampling Strategy to mitigate annotation bias. Extensive experiments show that Zoom-IQA achieves improved robustness, explainability, and generalization. The application to downstream tasks, such as image restoration, further demonstrates the effectiveness of Zoom-IQA.

Aihua Zheng,Ya Gao,Shihao Li,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种用于多模态车辆重识别（ReID）的解耦协作与引导融合表示方法（DCG-ReID），通过动态置信度加权机制分离不同质量分布的模态数据，并设计了针对平衡与非平衡质量分布的两种融合策略，显著提升了跨模态一致性与类别内一致性。

Details

Motivation: 现有方法在处理多模态车辆ReID时未考虑模态质量分布不平衡带来的冲突融合需求，难以兼顾类内一致性和模态间异质性，导致性能受限。 Method: 提出DCG-ReID框架，包含动态置信度解耦加权机制（DCDW）以及针对平衡分布的协作融合模块（CFM）和针对非平衡分布的引导融合模块（GFM），实现自适应特征融合。 Result: 在三个多模态ReID基准（WMVeID863、MSVR310、RGBNT100）上进行了广泛实验，验证了所提方法的有效性，优于现有融合策略。 Conclusion: DCG-ReID通过解耦不同质量分布下的融合需求，有效缓解了多模态车辆ReID中的一致性与异质性冲突，提升了整体检索性能。 Abstract: Multi-modal vehicle Re-Identification (ReID) aims to leverage complementary information from RGB, Near Infrared (NIR), and Thermal Infrared (TIR) modalities to retrieve the same vehicle. The challenges of multi-modal vehicle ReID arise from the uncertainty of modality quality distribution induced by inherent discrepancies across modalities, resulting in distinct conflicting fusion requirements for data with balanced and unbalanced quality distributions. Existing methods handle all multi-modal data within a single fusion model, overlooking the different needs of the two data types and making it difficult to decouple the conflict between intra-class consistency and inter-modal heterogeneity. To this end, we propose Disentangle Collaboration and Guidance Fusion Representations for Multi-modal Vehicle ReID (DCG-ReID). Specifically, to disentangle heterogeneous quality-distributed modal data without mutual interference, we first design the Dynamic Confidence-based Disentangling Weighting (DCDW) mechanism: dynamically reweighting three-modal contributions via interaction-derived modal confidence to build a disentangled fusion framework. Building on DCDW, we develop two scenario-specific fusion strategies: (1) for balanced quality distributions, Collaboration Fusion Module (CFM) mines pairwise consensus features to capture shared discriminative information and boost intra-class consistency; (2) for unbalanced distributions, Guidance Fusion Module (GFM) implements differential amplification of modal discriminative disparities to reinforce dominant modality advantages, guide auxiliary modalities to mine complementary discriminative info, and mitigate inter-modal divergence to boost multi-modal joint decision performance. Extensive experiments on three multi-modal ReID benchmarks (WMVeID863, MSVR310, RGBNT100) validate the effectiveness of our method. Code will be released upon acceptance.

[122] PrismVAU: Prompt-Refined Inference System for Multimodal Video Anomaly Understanding

Iñaki Erregue,Kamal Nasrollahi,Sergio Escalera

Main category: cs.CV

TL;DR: PrismVAU是一种轻量级的实时视频异常理解（VAU）系统，利用现成的多模态大语言模型进行异常评分、解释和提示优化，无需微调或外部模块。

Details

Motivation: 现有VAU方法依赖精细调整的多模态大模型或外部组件，导致标注成本高、训练复杂、推理开销大，限制了实际应用。 Method: 提出PrismVAU，包含两个阶段：1）基于文本锚点相似性的粗粒度帧级异常评分；2）基于MLLM的细化模块，通过系统和用户提示实现上下文异常解释；并采用弱监督自动提示工程（APE）框架优化文本锚点和提示。 Result: 在标准VAD基准上的实验表明，PrismVAU在检测性能上具有竞争力，并能生成可解释的异常描述，且无需指令微调、帧级标注、外部模块或密集计算。 Conclusion: PrismVAU是一种高效、实用的VAU解决方案，适用于真实场景，兼顾性能与效率。 Abstract: Video Anomaly Understanding (VAU) extends traditional Video Anomaly Detection (VAD) by not only localizing anomalies but also describing and reasoning about their context. Existing VAU approaches often rely on fine-tuned multimodal large language models (MLLMs) or external modules such as video captioners, which introduce costly annotations, complex training pipelines, and high inference overhead. In this work, we introduce PrismVAU, a lightweight yet effective system for real-time VAU that leverages a single off-the-shelf MLLM for anomaly scoring, explanation, and prompt optimization. PrismVAU operates in two complementary stages: (1) a coarse anomaly scoring module that computes frame-level anomaly scores via similarity to textual anchors, and (2) an MLLM-based refinement module that contextualizes anomalies through system and user prompts. Both textual anchors and prompts are optimized with a weakly supervised Automatic Prompt Engineering (APE) framework. Extensive experiments on standard VAD benchmarks demonstrate that PrismVAU delivers competitive detection performance and interpretable anomaly explanations -- without relying on instruction tuning, frame-level annotations, and external modules or dense processing -- making it an efficient and practical solution for real-world applications.

[123] HybridSolarNet: A Lightweight and Explainable EfficientNet-CBAM Architecture for Real-Time Solar Panel Fault Detection

Md. Asif Hossain,G M Mota-Tahrin Tayef,Nabil Subhan

Main category: cs.CV

TL;DR: 提出一种轻量级太阳能电池板故障检测模型HybridSolarNet，结合EfficientNet-B0与CBAM模块，采用焦点损失和余弦退火策略，在避免数据泄露的前提下实现高精度、高效率的无人机实时检测。

Details

Motivation: 现有深度学习模型在边缘设备上运行时存在模型过大或准确率评估偏差的问题，难以适用于无人机（UAV）实时监测太阳能电池板故障的需求。 Method: 提出HybridSolarNet模型，融合EfficientNet-B0与卷积块注意力模块（CBAM），使用焦点损失（focal loss）处理类别不平衡问题，并采用余弦退火学习率策略；在Kaggle数据集上采用先划分后增强的严格协议防止数据泄露。 Result: 五折分层交叉验证平均准确率达92.37%±0.41%，F1分数为0.9226±0.39，模型仅占16.3MB，推理速度达54.9 FPS（GPU），显著优于VGG19等基线模型；消融实验表明CBAM提升准确率1.53%，焦点损失有助于识别少数类；Grad-CAM可视化显示模型关注真实故障区域。 Conclusion: HybridSolarNet是一种高效、轻量且准确的太阳能电池板故障检测模型，适合部署于边缘计算设备和无人机平台，具备良好的实时性与实用性。 Abstract: Manual inspections for solar panel systems are a tedious, costly, and error-prone task, making it desirable for Unmanned Aerial Vehicle (UAV) based monitoring. Though deep learning models have excellent fault detection capabilities, almost all methods either are too large and heavy for edge computing devices or involve biased estimation of accuracy due to ineffective learning techniques. We propose a new solar panel fault detection model called HybridSolarNet. It integrates EfficientNet-B0 with Convolutional Block Attention Module (CBAM). We implemented it on the Kaggle Solar Panel Images competition dataset with a tight split-before-augmentation protocol. It avoids leakage in accuracy estimation. We introduced focal loss and cosine annealing. Ablation analysis validates that accuracy boosts due to added benefits from CBAM (+1.53%) and that there are benefits from recognition of classes with imbalanced samples via focal loss. Overall average accuracy on 5-fold stratified cross-validation experiments on the given competition dataset topped 92.37% +/- 0.41 and an F1-score of 0.9226 +/- 0.39 compared to baselines like VGG19, requiring merely 16.3 MB storage, i.e., 32 times less. Its inference speed measured at 54.9 FPS with GPU support makes it a successful candidate for real-time UAV implementation. Moreover, visualization obtained from Grad-CAM illustrates that HybridSolarNet focuses on actual locations instead of irrelevant ones.

[124] VTONQA: A Multi-Dimensional Quality Assessment Dataset for Virtual Try-on

Xinyi Wei,Sijing Wu,Zitong Xu,Yunhao Li,Huiyu Duan,Xiongkuo Min,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了VTONQA，首个专用于图像虚拟试穿（VTON）的多维质量评估数据集，包含8,132张图像和24,396个主观评分，涵盖服装贴合度、身体兼容性和整体质量三个维度。基于该数据集，作者对现有VTON模型和图像质量评估指标进行了基准测试，揭示了当前方法的局限性，并为未来研究提供了重要基础。

Details

Motivation: 现有的VTON模型常出现服装变形和身体不一致等问题，但缺乏可靠的质量评估手段，因此需要一个专门针对VTON的多维质量评估数据集来推动模型和评估方法的发展。 Method: 构建了一个名为VTONQA的数据集，包含由11种代表性VTON模型生成的8,132张图像，并收集了24,396个在三个维度上的平均意见分数（MOS）。在此基础上，对现有VTON模型和多种图像质量评估指标进行了系统性基准测试。 Result: VTONQA成为首个面向VTON的多维质量评估数据集；基准测试结果表明现有VTON模型和IQA指标在感知对齐方面存在明显不足。 Conclusion: VTONQA数据集及其基准测试为VTON图像的质量评估提供了可靠基础，有助于推动更符合人类感知的评估方法和更高质量的VTON模型发展。 Abstract: With the rapid development of e-commerce and digital fashion, image-based virtual try-on (VTON) has attracted increasing attention. However, existing VTON models often suffer from artifacts such as garment distortion and body inconsistency, highlighting the need for reliable quality evaluation of VTON-generated images. To this end, we construct VTONQA, the first multi-dimensional quality assessment dataset specifically designed for VTON, which contains 8,132 images generated by 11 representative VTON models, along with 24,396 mean opinion scores (MOSs) across three evaluation dimensions (i.e., clothing fit, body compatibility, and overall quality). Based on VTONQA, we benchmark both VTON models and a diverse set of image quality assessment (IQA) metrics, revealing the limitations of existing methods and highlighting the value of the proposed dataset. We believe that the VTONQA dataset and corresponding benchmarks will provide a solid foundation for perceptually aligned evaluation, benefiting both the development of quality assessment methods and the advancement of VTON models.

[125] LAMS-Edit: Latent and Attention Mixing with Schedulers for Improved Content Preservation in Diffusion-Based Image and Style Editing

Wingwa Fu,Takayuki Okatani

Main category: cs.CV

TL;DR: 提出LAMS-Edit框架，利用扩散模型中反转过程的中间状态，结合潜在表示和注意力图，实现文本到图像编辑中内容保持与编辑应用的平衡。

Details

Motivation: 解决文本到图像编辑中内容保留与编辑效果之间的权衡问题，并提升真实图像编辑的能力。 Method: 提出Latent and Attention Mixing with Schedulers (LAMS)，在每一步通过调度器控制的加权插值融合反转和生成过程中的潜在表示与注意力图，结合Prompt-to-Prompt和LoRA实现精确编辑与风格迁移。 Result: 实验证明LAMS-Edit在内容保持和编辑应用方面表现优异，支持区域掩码编辑和基于LoRA的风格迁移。 Conclusion: LAMS-Edit有效平衡了编辑精度与原始内容保留，是一个可扩展且灵活的文本到图像编辑框架。 Abstract: Text-to-Image editing using diffusion models faces challenges in balancing content preservation with edit application and handling real-image editing. To address these, we propose LAMS-Edit, leveraging intermediate states from the inversion process--an essential step in real-image editing--during edited image generation. Specifically, latent representations and attention maps from both processes are combined at each step using weighted interpolation, controlled by a scheduler. This technique, Latent and Attention Mixing with Schedulers (LAMS), integrates with Prompt-to-Prompt (P2P) to form LAMS-Edit--an extensible framework that supports precise editing with region masks and enables style transfer via LoRA. Extensive experiments demonstrate that LAMS-Edit effectively balances content preservation and edit application.

[126] ULS+: Data-driven Model Adaptation Enhances Lesion Segmentation

Rianne Weber,Niels Rocholl,Max de Grauw,Mathias Prokop,Ewoud Smit,Alessa Hering

Main category: cs.CV

TL;DR: ULS+ 是 ULS 模型的改进版本，通过引入新数据集和更小输入尺寸，在病变分割精度和推理速度上均显著优于原模型，并在 ULS23 挑战赛中排名第一。

Details

Motivation: 利用新发布的公共数据集和优化模型设置，提升通用病变分割模型的性能和临床适用性。 Method: 在原始 ULS 模型基础上，引入多个新公开数据集进行训练，并采用更小的输入图像尺寸以提高效率和准确性。 Result: 在 ULS23 挑战赛测试数据和 Longitudinal-CT 数据子集上，ULS+ 在 Dice 评分和对点击点位置的鲁棒性方面均显著优于 ULS。 Conclusion: ULS+ 通过数据驱动的持续更新和临床验证循环，为可靠且具临床价值的全身病变分割模型奠定了基础。 Abstract: In this study, we present ULS+, an enhanced version of the Universal Lesion Segmentation (ULS) model. The original ULS model segments lesions across the whole body in CT scans given volumes of interest (VOIs) centered around a click-point. Since its release, several new public datasets have become available that can further improve model performance. ULS+ incorporates these additional datasets and uses smaller input image sizes, resulting in higher accuracy and faster inference. We compared ULS and ULS+ using the Dice score and robustness to click-point location on the ULS23 Challenge test data and a subset of the Longitudinal-CT dataset. In all comparisons, ULS+ significantly outperformed ULS. Additionally, ULS+ ranks first on the ULS23 Challenge test-phase leaderboard. By maintaining a cycle of data-driven updates and clinical validation, ULS+ establishes a foundation for robust and clinically relevant lesion segmentation models.

[127] Towards Faithful Reasoning in Comics for Small MLLMs

Chengcheng Feng,Haojie Yin,Yucheng Jin,Kaizhu Huang

Main category: cs.CV

TL;DR: 提出一种新的漫画推理框架，结合模块化思维链生成、基于GRPO的强化微调和结构化奖励，显著提升小规模多模态大模型在漫画视觉问答及其他幽默相关任务中的表现。

Details

Motivation: 标准的思维链（CoT）提示在漫画视觉问答（CVQA）中表现不佳，尤其对小模型存在状态纠缠、伪转移和探索效率低等问题，需设计更有效的推理框架。 Method: 提出模块化CoT生成、基于GRPO的强化微调策略，并引入新的结构化奖励机制，以提高推理链的准确性和可迁移性。 Result: 在五个具有挑战性的基准上，3B参数模型超越现有最先进方法，插件实验平均提升12.1%。 Conclusion: 该框架有效解决了小模型在CVQA中的推理缺陷，在漫画、梗图和社论漫画等抽象幽默视觉任务中表现出强泛化能力。 Abstract: Comic-based visual question answering (CVQA) poses distinct challenges to multimodal large language models (MLLMs) due to its reliance on symbolic abstraction, narrative logic, and humor, which differ from conventional VQA tasks. Although Chain-of-Thought (CoT) prompting is widely used to enhance MLLM reasoning, surprisingly, its direct application to CVQA often degrades performance, especially in small-scale models. Our theoretical and empirical analyses reveal that standard CoT in CVQA suffers from state entanglement, spurious transitions, and exploration inefficiency, with small models particularly vulnerable in resource-constrained settings. To address these issues, we propose a novel comic reasoning framework, designed to produce more faithful and transferable reasoning chains in small MLLMs. Specifically, our framework combines modular CoT generation with GRPO-based reinforcement fine-tuning and a novel structured reward. Beyond comic VQA, we further evaluate our approach on a broader class of humor-centric and abstract visual reasoning tasks, including meme understanding and editorial cartoon interpretation. Across five challenging benchmarks, our 3B model outperforms state-of-the-art methods, and plug-in experiments yield an additional average improvement of $\mathbf{12.1\%}$ across different MLLMs.

[128] Towards Efficient 3D Object Detection for Vehicle-Infrastructure Collaboration via Risk-Intent Selection

Li Wang,Boqi Li,Hang Chen,Xingjian Wu,Yichen Wang,Jiewen Tan,Xinyu Zhang,Huaping Liu

Main category: cs.CV

TL;DR: 提出RiSe框架，通过风险-意图协同感知实现车辆-基础设施协作感知中的语义选择性特征融合，显著降低通信开销同时保持高检测精度。

Details

Motivation: 现有协作感知方法在处理遮挡问题时存在通信带宽与特征冗余的权衡问题，尤其是静态压缩方法无法有效剔除非关键区域的冗余特征。 Method: 提出RiSe框架，包含基于势场理论的潜在场-轨迹相关模型（PTCM）量化运动风险，并设计意图驱动的区域预测模块（IDAPM）结合自车运动先验预测关键BEV区域，实现仅传输高交互区域的高保真特征。 Result: 在DeepAccident数据集上实验表明，相比全特征共享，通信量降至0.71%的同时保持最先进的检测精度，显著优于现有方法。 Conclusion: RiSe通过交互感知的选择性融合机制，在极低带宽下实现了高效准确的协作感知，为车路协同系统提供了新的优化方向。 Abstract: Vehicle-Infrastructure Collaborative Perception (VICP) is pivotal for resolving occlusion in autonomous driving, yet the trade-off between communication bandwidth and feature redundancy remains a critical bottleneck. While intermediate fusion mitigates data volume compared to raw sharing, existing frameworks typically rely on spatial compression or static confidence maps, which inefficiently transmit spatially redundant features from non-critical background regions. To address this, we propose Risk-intent Selective detection (RiSe), an interaction-aware framework that shifts the paradigm from identifying visible regions to prioritizing risk-critical ones. Specifically, we introduce a Potential Field-Trajectory Correlation Model (PTCM) grounded in potential field theory to quantitatively assess kinematic risks. Complementing this, an Intention-Driven Area Prediction Module (IDAPM) leverages ego-motion priors to proactively predict and filter key Bird's-Eye-View (BEV) areas essential for decision-making. By integrating these components, RiSe implements a semantic-selective fusion scheme that transmits high-fidelity features only from high-interaction regions, effectively acting as a feature denoiser. Extensive experiments on the DeepAccident dataset demonstrate that our method reduces communication volume to 0.71\% of full feature sharing while maintaining state-of-the-art detection accuracy, establishing a competitive Pareto frontier between bandwidth efficiency and perception performance.

[129] ReCCur: A Recursive Corner-Case Curation Framework for Robust Vision-Language Understanding in Open and Edge Scenarios

Yihan Wei,Shenghai Yuan,Tianchen Deng,Boyang Lou,Enwen Hu

Main category: cs.CV

TL;DR: ReCCur是一种低计算量的递归框架，通过多智能体管道将噪声网络图像转化为可审计的细粒度标签，用于高效挖掘现实世界中的罕见或极端场景（corner cases）.

Details

Motivation: Corner cases（罕见或极端场景）虽然少见，但常导致真实世界系统失效。传统方法难以大规模获取高质量corner case数据，因网络数据噪声大、标签脆弱且边缘部署限制了频繁重训练。因此需要一种高效、低依赖人工、适应资源受限环境的数据挖掘方法。 Method: 提出ReCCur框架，包含三个阶段：1）大规模数据获取与过滤：利用视觉-语言模型扩展领域词汇，爬取网络数据，并通过图像、描述、关键词三模态一致性过滤加轻量人工抽检获得候选集；2）混合专家知识蒸馏：结合多种编码器（如CLIP、DINOv2、BEiT）进行kNN投票，采用双置信激活与不确定性采样提升精度；3）区域证据VLM对抗标注：由提议者生成多粒度区域与语义线索，验证者检查全局与局部链式一致性，实现可解释标签并闭环优化。 Result: 在真实corner case场景（如淹水车辆检测）中验证，ReCCur可在消费级GPU上运行，持续提升数据纯度与类间可分性，显著减少人工监督需求，适用于资源受限下的下游训练与评估。 Conclusion: ReCCur为在低计算和少人工条件下高效构建高质量corner case数据集提供了可行方案，推动了面向实际应用的鲁棒视觉系统发展。 Abstract: Corner cases are rare or extreme scenarios that drive real-world failures, but they are difficult to curate at scale: web data are noisy, labels are brittle, and edge deployments preclude large retraining. We present ReCCur (Recursive Corner-Case Curation), a low-compute framework that converts noisy web imagery into auditable fine-grained labels via a multi-agent recursive pipeline. First, large-scale data acquisition and filtering expands a domain vocabulary with a vision-language model (VLM), crawls the web, and enforces tri-modal (image, description, keyword) consistency with light human spot checks to yield refined candidates. Next, mixture-of-experts knowledge distillation uses complementary encoders (e.g., CLIP, DINOv2, BEiT) for kNN voting with dual-confidence activation and uncertainty sampling, converging to a high-precision set. Finally, region-evidence VLM adversarial labeling pairs a proposer (multi-granularity regions and semantic cues) with a validator (global and local chained consistency) to produce explainable labels and close the loop. On realistic corner-case scenarios (e.g., flooded-car inspection), ReCCur runs on consumer-grade GPUs, steadily improves purity and separability, and requires minimal human supervision, providing a practical substrate for downstream training and evaluation under resource constraints. Code and dataset will be released.

[130] SA-ResGS: Self-Augmented Residual 3D Gaussian Splatting for Next Best View Selection

Kim Jun-Seong,Tae-Hyun Oh,Eduardo Pérez-Pellitero,Youngkyoon Jang

Main category: cs.CV

TL;DR: 提出SA-ResGS框架，通过自增强点云和残差学习提升3D高斯点阵在主动场景重建中的不确定性估计与监督效果。

Details

Motivation: 现有方法在稀疏、大基线视角下存在高斯点监督不足、不确定性估计不可靠的问题，影响主动重建中视图选择的稳定性与覆盖效率。 Method: 1) 通过训练视图与外推渲染视图三角化生成自增强点云（SA-Points），实现物理引导的视图选择；2) 引入面向3D高斯点阵的残差学习策略，结合不确定性过滤与受dropout和难例挖掘启发的采样机制，增强高不确定性高斯点的梯度传播。 Result: 在主动视图选择任务中，SA-ResGS在重建质量与视图选择鲁棒性上优于现有最先进方法，实现了更均匀的场景覆盖与更稳定的不确定性估计。 Conclusion: SA-ResGS通过自增强与残差监督机制，有效缓解了稀疏视角下的监督偏差与不确定性失真问题，为基于3DGS的主动重建提供了稳定可靠的框架。 Abstract: We propose Self-Augmented Residual 3D Gaussian Splatting (SA-ResGS), a novel framework to stabilize uncertainty quantification and enhancing uncertainty-aware supervision in next-best-view (NBV) selection for active scene reconstruction. SA-ResGS improves both the reliability of uncertainty estimates and their effectiveness for supervision by generating Self-Augmented point clouds (SA-Points) via triangulation between a training view and a rasterized extrapolated view, enabling efficient scene coverage estimation. While improving scene coverage through physically guided view selection, SA-ResGS also addresses the challenge of under-supervised Gaussians, exacerbated by sparse and wide-baseline views, by introducing the first residual learning strategy tailored for 3D Gaussian Splatting. This targeted supervision enhances gradient flow in high-uncertainty Gaussians by combining uncertainty-driven filtering with dropout- and hard-negative-mining-inspired sampling. Our contributions are threefold: (1) a physically grounded view selection strategy that promotes efficient and uniform scene coverage; (2) an uncertainty-aware residual supervision scheme that amplifies learning signals for weakly contributing Gaussians, improving training stability and uncertainty estimation across scenes with diverse camera distributions; (3) an implicit unbiasing of uncertainty quantification as a consequence of constrained view selection and residual supervision, which together mitigate conflicting effects of wide-baseline exploration and sparse-view ambiguity in NBV planning. Experiments on active view selection demonstrate that SA-ResGS outperforms state-of-the-art baselines in both reconstruction quality and view selection robustness.

[131] Flow Matching and Diffusion Models via PointNet for Generating Fluid Fields on Irregular Geometries

Ali Kashefi

Main category: cs.CV

TL;DR: 提出两种新的生成式几何深度学习框架（Flow Matching PointNet和Diffusion PointNet），基于PointNet结合流匹配与扩散模型，直接在点云表示的不规则几何上预测流体流动变量，避免了传统方法中的像素化限制和高频噪声问题，且架构更简洁统一。

Details

Motivation: 现有基于图神经网络的扩散模型在预测流场时存在高频噪声，且依赖辅助网络进行几何条件建模；同时将几何投影到规则网格会引入离散化误差，因此需要一种能直接处理不规则几何、结构简单且鲁棒的生成模型。 Method: 将PointNet嵌入流匹配和扩散模型中，构建Flow Matching PointNet和Diffusion PointNet，通过反向生成过程从高斯噪声中重建物理场，直接以点云形式处理计算域（如有限体积网格顶点），无需像素化或额外的中间网络。 Result: 在圆柱绕流数据集上验证，所提方法在速度和压力场以及升阻力预测上比相同参数量的普通PointNet更准确，并对不完整几何表现出更强的鲁棒性，且无高频噪声。 Conclusion: Flow Matching PointNet和Diffusion PointNet为在复杂不规则几何上进行流场预测提供了高效、准确且鲁棒的新范式，展示了生成模型与几何深度学习结合在科学计算中的潜力。 Abstract: We present two novel generative geometric deep learning frameworks, termed Flow Matching PointNet and Diffusion PointNet, for predicting fluid flow variables on irregular geometries by incorporating PointNet into flow matching and diffusion models, respectively. In these frameworks, a reverse generative process reconstructs physical fields from standard Gaussian noise conditioned on unseen geometries. The proposed approaches operate directly on point-cloud representations of computational domains (e.g., grid vertices of finite-volume meshes) and therefore avoid the limitations of pixelation used to project geometries onto uniform lattices. In contrast to graph neural network-based diffusion models, Flow Matching PointNet and Diffusion PointNet do not exhibit high-frequency noise artifacts in the predicted fields. Moreover, unlike such approaches, which require auxiliary intermediate networks to condition geometry, the proposed frameworks rely solely on PointNet, resulting in a simple and unified architecture. The performance of the proposed frameworks is evaluated on steady incompressible flow past a cylinder, using a geometric dataset constructed by varying the cylinder's cross-sectional shape and orientation across samples. The results demonstrate that Flow Matching PointNet and Diffusion PointNet achieve more accurate predictions of velocity and pressure fields, as well as lift and drag forces, and exhibit greater robustness to incomplete geometries compared to a vanilla PointNet with the same number of trainable parameters.

[132] Motion Blur Robust Wheat Pest Damage Detection with Dynamic Fuzzy Feature Fusion

Han Zhang,Yanwei Wang,Fang Li,Hongjun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为DFRCP的动态模糊鲁棒卷积金字塔，作为YOLOv11的即插即用升级模块，用于在运动模糊场景下提升目标检测性能。该方法通过融合多尺度特征并引入自适应模糊特征注入机制，在保持原始表征的同时增强全局感知能力，并设计了高效的CUDA并行核函数以支持边缘设备部署。

Details

Motivation: 运动模糊会导致目标检测中出现伪影，现有方法要么将模糊视为噪声而丢失结构信息，要么进行完整图像恢复导致延迟过高，难以在资源受限设备上应用。因此需要一种既能保留关键结构又能高效运行的模糊鲁棒检测方案。 Method: 提出DFRCP模块，增强YOLOv11的特征金字塔：1）融合大尺度和中等尺度特征并保留原生表示；2）引入动态鲁棒开关单元，通过旋转和非线性插值生成模糊特征并自适应注入；3）采用透明卷积学习原始与模糊线索之间的内容自适应权衡；4）设计CUDA并行旋转与插值核函数以避免边界溢出并大幅提升计算效率。 Result: 在约3500张图像的私有小麦病虫害数据集上训练，通过两种模糊模式三倍增强数据。在模糊测试集上，YOLOv11+DFRCP比基线模型准确率提高约10.4%，且仅带来轻微训练开销。 Conclusion: DFRCP能有效提升YOLOv11在运动模糊场景下的检测精度，同时具备高计算效率和边缘部署可行性，减少了数据采集后手动筛选的需求。 Abstract: Motion blur caused by camera shake produces ghosting artifacts that substantially degrade edge side object detection. Existing approaches either suppress blur as noise and lose discriminative structure, or apply full image restoration that increases latency and limits deployment on resource constrained devices. We propose DFRCP, a Dynamic Fuzzy Robust Convolutional Pyramid, as a plug in upgrade to YOLOv11 for blur robust detection. DFRCP enhances the YOLOv11 feature pyramid by combining large scale and medium scale features while preserving native representations, and by introducing Dynamic Robust Switch units that adaptively inject fuzzy features to strengthen global perception under jitter. Fuzzy features are synthesized by rotating and nonlinearly interpolating multiscale features, then merged through a transparency convolution that learns a content adaptive trade off between original and fuzzy cues. We further develop a CUDA parallel rotation and interpolation kernel that avoids boundary overflow and delivers more than 400 times speedup, making the design practical for edge deployment. We train with paired supervision on a private wheat pest damage dataset of about 3,500 images, augmented threefold using two blur regimes, uniform image wide motion blur and bounding box confined rotational blur. On blurred test sets, YOLOv11 with DFRCP achieves about 10.4 percent higher accuracy than the YOLOv11 baseline with only a modest training time overhead, reducing the need for manual filtering after data collection.

[133] On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning

Siyi Lyu,Quan Liu,Feng Yan

Main category: cs.CV

TL;DR: 本文提出视觉Transformer（ViT）在空间推理任务中表现不佳的根本原因在于其架构的电路复杂性限制，而非数据规模；通过将空间理解形式化为群同态映射，作者证明了对于非可解群（如3D旋转群SO(3)），保持结构保持嵌入的计算复杂性下界为NC¹-完全问题，而常数深度的ViT受限于TC⁰复杂度类，因此在TC⁰ ⊊ NC¹的假设下无法高效捕捉非可解空间结构，并通过潜在空间探测验证了这一复杂性差距。

Details

Motivation: 视觉Transformer在语义识别上表现出色，但在空间推理任务（如心理旋转）中系统性失败。通常归因于数据规模，但作者认为根本原因在于架构本身的电路复杂性不足，因此试图从计算复杂性理论角度解释这一局限。 Method: 将空间理解形式化为学习一个群同态：将图像序列映射到保持底层变换群代数结构的潜在空间；利用计算复杂性理论分析该任务的下界，并与ViT的计算能力（TC⁰）进行对比；通过潜在空间探针实验验证ViT在非可解群任务上的表示是否发生结构性崩溃。 Result: 证明了对于非可解群（如SO(3)），保持结构保持映射的计算复杂性属于NC¹-完全（与‘字问题’等价）；而常数深度ViT在多项式精度下的计算能力被严格限制在TC⁰；在TC⁰ ⊊ NC¹的假设下，ViT存在根本性的逻辑深度不足；实验显示随着组合深度增加，ViT的表示在非可解任务上出现结构性崩溃。 Conclusion: 视觉Transformer由于其架构的电路复杂性限制（TC⁰级别），无法有效处理需要NC¹级别计算能力的非可解空间结构推理任务，这揭示了其在空间推理失败的理论根源，并提出了对模型归纳偏置重新设计的需求。 Abstract: Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, we propose that this limitation arises from the intrinsic circuit complexity of the architecture. We formalize spatial understanding as learning a Group Homomorphism: mapping image sequences to a latent space that preserves the algebraic structure of the underlying transformation group. We demonstrate that for non-solvable groups (e.g., the 3D rotation group $\mathrm{SO}(3)$), maintaining such a structure-preserving embedding is computationally lower-bounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, we prove that constant-depth ViTs with polynomial precision are strictly bounded by $\mathsf{TC^0}$. Under the conjecture $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$, we establish a complexity boundary: constant-depth ViTs fundamentally lack the logical depth to efficiently capture non-solvable spatial structures. We validate this complexity gap via latent-space probing, demonstrating that ViT representations suffer a structural collapse on non-solvable tasks as compositional depth increases.

[134] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

Yankai Jiang,Qiaoru Li,Binlu Xu,Haoran Sun,Chao Ding,Junting Dong,Yuxiang Cai,Xuhong Zhang,Jianwei Yin

Main category: cs.CV

TL;DR: 本文提出了一种新型的医疗多模态大语言模型IBISAgent，将分割任务重构为以视觉为中心的多步决策过程，通过迭代的视觉推理和基于文本的点击操作实现高质量的像素级分割，并支持掩码优化，无需修改模型架构。

Details

Motivation: 现有医学MLLM在像素级理解上存在隐式分割标记引入、需同时微调模型与外部解码器导致灾难性遗忘及泛化能力差，且缺乏迭代优化机制的问题。 Method: 提出IBISAgent，将分割视为视觉中心的多步决策过程，利用交错式推理与文本点击动作调用分割工具生成高质量掩码；设计两阶段训练框架：冷启动监督微调和带有细粒度奖励的代理强化学习。 Result: 实验表明IBISAgent在多种医学图像分割任务中均优于现有的闭源和开源SOTA方法，具备更强的复杂场景下的鲁棒性和推理能力。 Conclusion: IBISAgent有效提升了医学图像中像素级理解的性能与泛化能力，推动了多模态大模型在精细视觉推理任务中的发展，具备良好的开放性和应用前景。 Abstract: Recent research on medical MLLMs has gradually shifted its focus from image-level understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require simultaneous fine-tuning of both the MLLM and external pixel decoders, which increases the risk of catastrophic forgetting and limits generalization to out-of-domain scenarios. Second, most methods rely on single-pass reasoning and lack the capability to iteratively refine segmentation results, leading to suboptimal performance. To overcome these limitations, we propose a novel agentic MLLM, named IBISAgent, that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. By iteratively performing multi-step visual reasoning on masked image features, IBISAgent naturally supports mask refinement and promotes the development of pixel-level visual reasoning capabilities. We further design a two-stage training framework consisting of cold-start supervised fine-tuning and agentic reinforcement learning with tailored, fine-grained rewards, enhancing the model's robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods. All datasets, code, and trained models will be released publicly.

[135] Fine-Grained Generalization via Structuralizing Concept and Feature Space into Commonality, Specificity and Confounding

Zhen Wang,Jiaojiao Zhao,Qilong Wang,Yongfeng Dong,Wenlong Yu

Main category: cs.CV

TL;DR: 本文提出了一种用于细粒度域泛化的概念-特征结构化泛化方法（CFSG），通过解耦共性、特异性和混淆成分，并引入自适应机制动态调整各成分比例，显著提升了模型性能。

Details

Motivation: 细粒度域泛化任务中因类间差异小、类内变化大导致模型对细微特征过于敏感，现有模型缺乏像人类一样利用共性和特异性属性进行分类的机制。 Method: 提出CFSG模型，将概念和特征空间解耦为共性、特异性和混淆三部分，引入自适应机制动态调节各成分比例，并在预测时为各成分对分配显式权重。 Result: 在三个单源基准数据集上平均比基线模型提升9.87%，优于现有最先进方法平均3.08%，可解释性分析验证了特征结构化促进了概念结构化的形成。 Conclusion: CFSG通过结构化建模有效整合多粒度知识，缓解了域偏移下的性能下降问题，为细粒度域泛化提供了新思路。 Abstract: Fine-Grained Domain Generalization (FGDG) presents greater challenges than conventional domain generalization due to the subtle inter-class differences and relatively pronounced intra-class variations inherent in fine-grained recognition tasks. Under domain shifts, the model becomes overly sensitive to fine-grained cues, leading to the suppression of critical features and a significant drop in performance. Cognitive studies suggest that humans classify objects by leveraging both common and specific attributes, enabling accurate differentiation between fine-grained categories. However, current deep learning models have yet to incorporate this mechanism effectively. Inspired by this mechanism, we propose Concept-Feature Structuralized Generalization (CFSG). This model explicitly disentangles both the concept and feature spaces into three structured components: common, specific, and confounding segments. To mitigate the adverse effects of varying degrees of distribution shift, we introduce an adaptive mechanism that dynamically adjusts the proportions of common, specific, and confounding components. In the final prediction, explicit weights are assigned to each pair of components. Extensive experiments on three single-source benchmark datasets demonstrate that CFSG achieves an average performance improvement of 9.87% over baseline models and outperforms existing state-of-the-art methods by an average of 3.08%. Additionally, explainability analysis validates that CFSG effectively integrates multi-granularity structured knowledge and confirms that feature structuralization facilitates the emergence of concept structuralization.

[136] Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA

Tong Wu,Thanet Markchom

Main category: cs.CV

TL;DR: 提出了一种多智能体大语言模型框架，用于解决风格化卡通图像中的视觉问答（VQA）任务，通过三个专门代理协同工作，结合视觉线索与叙事上下文，在Pororo和Simpsons数据集上进行了系统评估。

Details

Motivation: 标准的大语言模型在处理风格化卡通图像的视觉问答时表现不佳，因其难以理解夸张的视觉抽象和叙事驱动的上下文，因此需要专门设计的框架来应对这一挑战。 Method: 设计了一个包含视觉代理、语言代理和批评代理的多智能体LLM框架，各代理协同进行结构化推理，整合视觉信息与叙事背景以完成卡通图像的VQA任务。 Result: 在Pororo和Simpsons两个基于卡通的VQA数据集上对框架进行了系统评估，结果详细分析了每个代理对最终预测的贡献。 Conclusion: 该多智能体框架有效提升了对卡通图像的理解能力，增强了多模态推理效果，为基于LLM的多智能体系统在卡通VQA中的应用提供了深入见解。 Abstract: Visual Question Answering (VQA) for stylised cartoon imagery presents challenges, such as interpreting exaggerated visual abstraction and narrative-driven context, which are not adequately addressed by standard large language models (LLMs) trained on natural images. To investigate this issue, a multi-agent LLM framework is introduced, specifically designed for VQA tasks in cartoon imagery. The proposed architecture consists of three specialised agents: visual agent, language agent and critic agent, which work collaboratively to support structured reasoning by integrating visual cues and narrative context. The framework was systematically evaluated on two cartoon-based VQA datasets: Pororo and Simpsons. Experimental results provide a detailed analysis of how each agent contributes to the final prediction, offering a deeper understanding of LLM-based multi-agent behaviour in cartoon VQA and multimodal inference.

[137] LesionTABE: Equitable AI for Skin Lesion Detection

Rocio Mexia Diaz,Yasmin Greenway,Petru Manescu

Main category: cs.CV

TL;DR: LesionTABE是一个以公平性为中心的框架，结合对抗去偏和皮肤病专用基础模型嵌入，显著提升了对深色皮肤类型的诊断公平性和整体准确性。

Details

Motivation: AI在皮肤病学中的临床应用面临偏见问题，尤其是在深色皮肤类型上诊断模型表现不佳。 Method: 提出LesionTABE框架，结合对抗去偏技术和基于皮肤病基础模型的嵌入方法，在多个数据集上进行评估。 Result: 与ResNet-152基线相比，LesionTABE在公平性指标上提高了超过25%，优于现有的去偏方法，并同时提升了整体诊断准确性。 Conclusion: 基础模型去偏技术有助于推动公平的临床AI应用，是实现医疗AI公平性的关键一步。 Abstract: Bias remains a major barrier to the clinical adoption of AI in dermatology, as diagnostic models underperform on darker skin tones. We present LesionTABE, a fairness-centric framework that couples adversarial debiasing with dermatology-specific foundation model embeddings. Evaluated across multiple datasets covering both malignant and inflammatory conditions, LesionTABE achieves over a 25\% improvement in fairness metrics compared to a ResNet-152 baseline, outperforming existing debiasing methods while simultaneously enhancing overall diagnostic accuracy. These results highlight the potential of foundation model debiasing as a step towards equitable clinical AI adoption.

[138] Text-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMs

Chenchen Lin,Sanbao Su,Rachel Luo,Yuxiao Chen,Yan Wang,Marco Pavone,Fei Miao

Main category: cs.CV

TL;DR: 本文提出了TGIF（文本引导的跨层融合）模块，通过查询条件下的层次感知融合来增强多模态大语言模型中的视觉基础，减少幻觉现象。

Details

Motivation: 现有的多模态大语言模型通常依赖冻结视觉编码器的单一层特征，未能充分利用其丰富的层次化视觉信息，导致视觉未接地的幻觉问题。现有方法多从文本侧缓解该问题，而忽视了视觉表示的优化。 Method: 提出TGIF模块，将视觉编码器各层视为深度上的“专家”，根据输入提示动态预测视觉特征的融合方式。采用直接外部融合策略，无需更新视觉编码器，仅引入极小计算开销。 Result: 在LLaVA-1.5-7B上集成TGIF后，在幻觉、OCR和VQA基准测试中表现持续提升，并在ScienceQA、GQA和MMBench上保持或改善性能。 Conclusion: 基于查询条件的层次化特征融合是加强现代多模态大语言模型视觉基础、减少幻觉的有效途径。 Abstract: Multimodal large language models (MLLMs) typically rely on a single late-layer feature from a frozen vision encoder, leaving the encoder's rich hierarchy of visual cues under-utilized. MLLMs still suffer from visually ungrounded hallucinations, often relying on language priors rather than image evidence. While many prior mitigation strategies operate on the text side, they leave the visual representation unchanged and do not exploit the rich hierarchy of features encoded across vision layers. Existing multi-layer fusion methods partially address this limitation but remain static, applying the same layer mixture regardless of the query. In this work, we introduce TGIF (Text-Guided Inter-layer Fusion), a lightweight module that treats encoder layers as depth-wise "experts" and predicts a prompt-dependent fusion of visual features. TGIF follows the principle of direct external fusion, requires no vision-encoder updates, and adds minimal overhead. Integrated into LLaVA-1.5-7B, TGIF provides consistent improvements across hallucination, OCR, and VQA benchmarks, while preserving or improving performance on ScienceQA, GQA, and MMBench. These results suggest that query-conditioned, hierarchy-aware fusion is an effective way to strengthen visual grounding and reduce hallucination in modern MLLMs.

[139] LeafLife: An Explainable Deep Learning Framework with Robustness for Grape Leaf Disease Recognition

B. M. Shahria Alam,Md. Nasim Ahmed

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的葡萄叶病害分类方法，使用Xception模型实现了96.23%的准确率，并通过对抗训练和Grad-CAM提升模型鲁棒性与可解释性，最终部署为Streamlit网页应用。

Details

Motivation: 植物病害会降低作物产量和品质，准确且可解释的葡萄叶病害检测对农业管理至关重要。 Method: 采用InceptionV3和Xception两个预训练模型，在包含9,032张图像的数据集上进行训练（70%训练、20%验证、10%测试），引入对抗训练增强鲁棒性，并结合Grad-CAM可视化技术提高模型透明度。 Result: Xception模型达到96.23%的准确率，优于InceptionV3；通过Grad-CAM验证了病害关注区域的合理性，并成功部署为带有热力图和置信度输出的Web应用。 Conclusion: Xception结合可解释性技术在葡萄叶病害检测中表现优异，所部署的Web应用有助于农民和农业从业者进行实时、可靠的病害诊断。 Abstract: Plant disease diagnosis is essential to farmers' management choices because plant diseases frequently lower crop yield and product quality. For harvests to flourish and agricultural productivity to boost, grape leaf disease detection is important. The plant disease dataset contains grape leaf diseases total of 9,032 images of four classes, among them three classes are leaf diseases, and the other one is healthy leaves. After rigorous pre-processing dataset was split (70% training, 20% validation, 10% testing), and two pre-trained models were deployed: InceptionV3 and Xception. Xception shows a promising result of 96.23% accuracy, which is remarkable than InceptionV3. Adversarial Training is used for robustness, along with more transparency. Grad-CAM is integrated to confirm the leaf disease. Finally deployed a web application using Streamlit with a heatmap visualization and prediction with confidence level for robust grape leaf disease classification.

[140] Unified Thinker: A General Reasoning Modular Core for Image Generation

Sashuai Zhou,Qiang Zhou,Jijin Hu,Hanqing Yang,Yue Cao,Junpeng Ma,Yinchao Ma,Jun Song,Tiezheng Ge,Cheng Yu,Bo Zheng,Zhou Zhao

Main category: cs.CV

TL;DR: 提出Unified Thinker，一种任务无关的推理架构，通过将推理与生成解耦并引入两阶段训练范式，提升图像生成中的逻辑推理能力。

Details

Motivation: 现有生成模型在逻辑密集型指令遵循上表现不佳，存在推理与执行之间的鸿沟，而闭源系统展现出更强的推理生成能力，凸显开源模型的不足。 Method: 设计Unified Thinker架构，将推理（Thinker）与生成（Generator）分离；构建结构化规划接口，并采用强化学习结合像素级反馈来优化推理策略。 Result: 在文本到图像生成和图像编辑任务中，Unified Thinker显著提升了图像推理能力和生成质量。 Conclusion: 通过可执行推理和模块化设计，Unified Thinker有效缩小了开源生成模型在逻辑推理上的差距，为通用图像生成提供了新方向。 Abstract: Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.

[141] LSP-DETR: Efficient and Scalable Nuclei Segmentation in Whole Slide Images

Matěj Pekár,Vít Musil,Rudolf Nenutil,Petr Holub,Tomáš Brázdil

Main category: cs.CV

TL;DR: LSP-DETR是一种端到端的细胞核实例分割框架，利用轻量级线性复杂度Transformer处理大图像，通过星凸多边形表示和新型径向距离损失函数，自然实现重叠细胞核分割，无需额外标注或后处理，显著提升效率和泛化能力。

Details

Motivation: 现有细胞核实例分割方法依赖于基于patch的处理和昂贵的后处理步骤，牺牲了上下文信息和效率，难以应对千兆像素级全切片图像的计算挑战。 Method: 提出LSP-DETR框架，采用具有线性复杂度的轻量级Transformer进行端到端处理；将细胞核表示为星凸多边形，并设计一种新的径向距离损失函数，使重叠细胞核的分割能够自然产生，无需显式重叠标注或手工后处理。 Result: 在PanNuke和MoNuSeg数据集上验证了方法的有效性，显示出跨组织类型的强泛化能力和最先进的效率，速度比次优的领先方法快五倍以上。 Conclusion: LSP-DETR通过轻量级架构和新颖的几何表示与损失函数，实现了高效、可扩展且精确的细胞核实例分割，适用于大规模病理图像分析。 Abstract: Precise and scalable instance segmentation of cell nuclei is essential for computational pathology, yet gigapixel Whole-Slide Images pose major computational challenges. Existing approaches rely on patch-based processing and costly post-processing for instance separation, sacrificing context and efficiency. We introduce LSP-DETR (Local Star Polygon DEtection TRansformer), a fully end-to-end framework that uses a lightweight transformer with linear complexity to process substantially larger images without additional computational cost. Nuclei are represented as star-convex polygons, and a novel radial distance loss function allows the segmentation of overlapping nuclei to emerge naturally, without requiring explicit overlap annotations or handcrafted post-processing. Evaluations on PanNuke and MoNuSeg show strong generalization across tissues and state-of-the-art efficiency, with LSP-DETR being over five times faster than the next-fastest leading method. Code and models are available at https://github.com/RationAI/lsp-detr.

[142] DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

Jiajun jiao,Haowei Zhu,Puyuan Yang,Jianghui Wang,Ji Liu,Ziqiong Liu,Dong Li,Yuejian Fang,Junhai Yong,Bin Wang,Emad Barsoum

Main category: cs.CV

TL;DR: 本文提出了一种基于大语言模型的自动化框架DiffAgent，用于加速扩散模型的推理过程，并通过构建全面的基准测试DiffBench进行评估，实验表明该方法在生成高效加速策略方面显著优于现有方法。

Details

Motivation: 扩散模型在图像和视频生成中表现出色，但其多步推理过程导致计算开销大，限制了实际部署，因此需要有效的方法来加速模型并解决多种加速技术组合的难题。 Method: 提出了DiffBench基准测试和DiffAgent代理，其中DiffAgent采用闭环工作流，结合规划、代码生成和调试组件，并利用遗传算法根据执行环境的性能反馈指导代码优化。 Result: 实验结果显示DiffBench能够全面评估生成的加速代码，而DiffAgent在生成有效的扩散模型加速策略方面显著优于现有的大语言模型方法。 Conclusion: 该研究展示了大语言模型在自动化模型加速中的潜力，所提出的框架为扩散模型的高效部署提供了可行解决方案。 Abstract: Diffusion models have achieved remarkable success in image and video generation. However, their inherently multiple step inference process imposes substantial computational overhead, hindering real-world deployment. Accelerating diffusion models is therefore essential, yet determining how to combine multiple model acceleration techniques remains a significant challenge. To address this issue, we introduce a framework driven by large language models (LLMs) for automated acceleration code generation and evaluation. First, we present DiffBench, a comprehensive benchmark that implements a three stage automated evaluation pipeline across diverse diffusion architectures, optimization combinations and deployment scenarios. Second, we propose DiffAgent, an agent that generates optimal acceleration strategies and codes for arbitrary diffusion models. DiffAgent employs a closed-loop workflow in which a planning component and a debugging component iteratively refine the output of a code generation component, while a genetic algorithm extracts performance feedback from the execution environment to guide subsequent code refinements. We provide a detailed explanation of the DiffBench construction and the design principles underlying DiffAgent. Extensive experiments show that DiffBench offers a thorough evaluation of generated codes and that DiffAgent significantly outperforms existing LLMs in producing effective diffusion acceleration strategies.

[143] AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

Anees Ur Rehman Hashmi,Numan Saeed,Christoph Lippert

Main category: cs.CV

TL;DR: AnatomiX是一种专为解剖学基础胸部X光解读设计的多任务多模态大语言模型，采用两阶段方法提升空间推理和解剖理解能力。

Details

Motivation: 现有模型在胸部X光的解剖对应关系上存在不足，导致解剖理解错误，缺乏真正的解剖一致性。 Method: 受放射科工作流程启发，AnatomiX首先识别解剖结构并提取特征，再利用大语言模型完成短语定位、报告生成、视觉问答等下游任务。 Result: 在多个基准测试中，AnatomiX在解剖定位、短语定位、基于解剖的诊断和描述任务上性能提升超过25%。 Conclusion: AnatomiX显著提升了胸部X光图像的解剖推理能力，实现了更准确的医学图像理解。 Abstract: Multimodal medical large language models have shown impressive progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model explicitly designed for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://github.com/aneesurhashmi/anatomix

[144] UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Ruiyan Han,Zhen Fang,XinYu Sun,Yuchen Ma,Ziheng Wang,Yu Zeng,Zehui Chen,Lin Chen,Wenxuan Huang,Wei-Jie Xu,Yi Cao,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出了UniCorn框架，通过自提升机制解决统一多模态模型在跨模态理解与生成之间的“传导性失语”问题，实现了无需外部数据或监督的高质量文本到图像生成。

Details

Motivation: 统一多模态模型（UMM）在跨模态理解上表现优异，但在将理解转化为高质量、可控生成方面存在明显不足，即‘传导性失语’问题。 Method: 提出UniCorn框架，将单一UMM划分为Proposer、Solver和Judge三个协作角色，通过自我博弈生成高质量交互，并利用认知模式重构将隐性理解转化为显性生成信号；同时设计UniCycle基准测试验证多模态一致性。 Result: 在六个通用图像生成基准上均实现显著提升，在TIIF、DPG、CompBench和UniCycle上达到SOTA，在WISE和OneIG上分别提升+5.0和+6.5。 Conclusion: UniCorn有效弥合了理解与生成之间的鸿沟，展示了完全自监督优化在统一多模态智能中的可扩展性。 Abstract: While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.

[145] LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen,Benny Brazowski,Nisan Chiprut,Yaki Bitterman,Andrew Kvochko,Avishai Berkowitz,Daniel Shalem,Daphna Lifschitz,Dudu Moshe,Eitan Porat,Eitan Richardson,Guy Shiran,Itay Chachy,Jonathan Chetboun,Michael Finkelson,Michael Kupchick,Nir Zabari,Nitzan Guetta,Noa Kotler,Ofir Bibi,Ori Gordon,Poriya Panet,Roi Benita,Shahar Armon,Victor Kulikov,Yaron Inger,Yonatan Shiftan,Zeev Melumian,Zeev Farbman

Main category: cs.CV

TL;DR: LTX-2是一个开源的基础模型，能够统一生成高质量、时间同步的音视频内容，采用非对称双流Transformer架构，在音频和视频生成之间实现高效平衡，并通过多语言文本编码器和模态感知分类-free引导机制提升音视频对齐与可控性。

Details

Motivation: 现有的文本到视频扩散模型缺乏音频支持，无法传达语义、情感和氛围信息，因此需要一个能同步生成高质量音频和视频内容的统一模型。 Method: 提出LTX-2模型，采用非对称双流Transformer（14B参数视频流 + 5B参数音频流），引入双向音视频交叉注意力、时间位置编码和跨模态AdaLN进行共享时间步控制，并结合多语言文本编码器与模态感知CFG机制以增强对齐与控制。 Result: LTX-2在开源系统中实现了最先进的音视频质量和提示遵循能力，效果可媲美专有模型，且计算成本和推理时间显著更低；能生成包含背景音、拟音等自然元素的丰富连贯音频。 Conclusion: LTX-2实现了高效、可控的统一音视频生成，在质量、同步性和实用性方面均表现优异，推动了开放、多模态内容生成的发展。 Abstract: Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.

[146] A Versatile Multimodal Agent for Multimedia Content Generation

Daoan Zhang,Wenlin Yao,Xiaoyang Wang,Yebowen Hu,Jiebo Luo,Dong Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为MultiMedia-Agent的多模态代理系统，用于自动化复杂的多媒体内容生成任务，通过技能获取理论和两阶段关联策略优化规划，并采用三阶段训练方法提升生成效果。

Details

Motivation: 当前AIGC模型在真实应用场景中难以实现端到端的多模态内容生成，且缺乏对复杂任务的整合能力，因此需要一种能够协调多种工具和输入输出模态的智能代理系统。 Method: 提出了MultiMedia-Agent系统，包含数据生成管道、内容创作工具库和偏好对齐评估指标；引入技能获取理论指导训练数据构建与代理训练；设计了自相关与模型偏好相关的两阶段关联策略进行规划优化；并通过基础/成功计划微调和偏好优化的三阶段方法训练代理。 Result: 实验结果表明，所提方法有效提升了多媒体内容生成质量，MultiMedia-Agent在生成效果上优于现有新型模型。 Conclusion: MultiMedia-Agent通过系统化的方法实现了更高效的复杂多媒体内容自动化生成，展示了基于代理的框架在AIGC领域中的潜力。 Abstract: With the advancement of AIGC (AI-generated content) technologies, an increasing number of generative models are revolutionizing fields such as video editing, music generation, and even film production. However, due to the limitations of current AIGC models, most models can only serve as individual components within specific application scenarios and are not capable of completing tasks end-to-end in real-world applications. In real-world applications, editing experts often work with a wide variety of images and video inputs, producing multimodal outputs -- a video typically includes audio, text, and other elements. This level of integration across multiple modalities is something current models are unable to achieve effectively. However, the rise of agent-based systems has made it possible to use AI tools to tackle complex content generation tasks. To deal with the complex scenarios, in this paper, we propose a MultiMedia-Agent designed to automate complex content creation. Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment. Notably, we introduce the skill acquisition theory to model the training data curation and agent training. We designed a two-stage correlation strategy for plan optimization, including self-correlation and model preference correlation. Additionally, we utilized the generated plans to train the MultiMedia-Agent via a three stage approach including base/success plan finetune and preference optimization. The comparison results demonstrate that the our approaches are effective and the MultiMedia-Agent can generate better multimedia content compared to novel models.

[147] InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

Hao Yu,Haotong Lin,Jiawei Wang,Jiaxin Li,Yida Wang,Xueyang Zhang,Yue Wang,Xiaowei Zhou,Ruizhen Hu,Sida Peng

Main category: cs.CV

TL;DR: 本文提出了InfiniDepth，一种基于神经隐式场的连续深度估计方法，能够实现任意分辨率和细粒度的深度预测，并在合成与真实场景中均达到领先性能。

Details

Motivation: 现有深度估计方法受限于离散图像网格表示，难以扩展到任意分辨率并恢复精细几何细节。 Method: 提出InfiniDepth，采用局部隐式解码器将深度表示为神经隐式场，可在连续2D坐标上查询深度值。 Result: 在自建4K合成基准和真实世界数据集上，InfiniDepth在相对和度量深度估计任务中均取得最先进性能，尤其在细节区域表现突出，并改善了大视角变换下的新视图合成质量。 Conclusion: InfiniDepth通过连续隐式表示突破了传统离散网格的限制，实现了高分辨率、细粒度的深度估计，具有良好的应用潜力。 Abstract: Existing depth estimation methods are fundamentally limited to predicting depth on discrete image grids. Such representations restrict their scalability to arbitrary output resolutions and hinder the geometric detail recovery. This paper introduces InfiniDepth, which represents depth as neural implicit fields. Through a simple yet effective local implicit decoder, we can query depth at continuous 2D coordinates, enabling arbitrary-resolution and fine-grained depth estimation. To better assess our method's capabilities, we curate a high-quality 4K synthetic benchmark from five different games, spanning diverse scenes with rich geometric and appearance details. Extensive experiments demonstrate that InfiniDepth achieves state-of-the-art performance on both synthetic and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions. It also benefits the task of novel view synthesis under large viewpoint shifts, producing high-quality results with fewer holes and artifacts.

[148] Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training

Hexiao Lu,Xiaokun Sun,Zeyu Cai,Hao Guo,Ying Tai,Jian Yang,Zhenyu Zhang

Main category: cs.CV

TL;DR: Muses是一种无需训练的3D生物生成方法，利用3D骨架作为基础，通过结构感知的设计、组合与生成流程，实现高质量、文本对齐的奇幻3D生物建模。

Details

Motivation: 现有方法依赖部件优化、手动组装或2D图像生成，难以处理复杂部件操作且生成结果常不真实或不连贯，缺乏灵活可控的3D内容创作方式。 Method: Muses首先通过图约束推理构建具有合理布局和比例的创意3D骨架；然后在结构化隐空间中引导基于体素的跨对象区域组装；最后在骨架条件下进行图像引导的外观建模，生成风格一致的纹理。 Result: 实验表明Muses在视觉保真度和文本对齐方面达到SOTA水平，并支持灵活的3D对象编辑。 Conclusion: Muses通过以3D骨架为核心的结构化生成 pipeline，实现了高质量、可控且无需训练的奇幻3D生物生成，推动了创意内容创作的发展。 Abstract: We present Muses, the first training-free method for fantastic 3D creature generation in a feed-forward paradigm. Previous methods, which rely on part-aware optimization, manual assembly, or 2D image generation, often produce unrealistic or incoherent 3D assets due to the challenges of intricate part-level manipulation and limited out-of-domain generation. In contrast, Muses leverages the 3D skeleton, a fundamental representation of biological forms, to explicitly and rationally compose diverse elements. This skeletal foundation formalizes 3D content creation as a structure-aware pipeline of design, composition, and generation. Muses begins by constructing a creatively composed 3D skeleton with coherent layout and scale through graph-constrained reasoning. This skeleton then guides a voxel-based assembly process within a structured latent space, integrating regions from different objects. Finally, image-guided appearance modeling under skeletal conditions is applied to generate a style-consistent and harmonious texture for the assembled shape. Extensive experiments establish Muses' state-of-the-art performance in terms of visual fidelity and alignment with textual descriptions, and potential on flexible 3D object editing. Project page: https://luhexiao.github.io/Muses.github.io/.

eess.IV [Back]

[149] Expert-Guided Explainable Few-Shot Learning with Active Sample Selection for Medical Image Analysis

Longwei Wang,Ifrat Ikhtear Uddin,KC Santosh

Main category: eess.IV

TL;DR: 本文提出了一种结合专家指导的可解释少样本学习（EGxFSL）和基于可解释性的主动学习（xGAL）的双框架方法，以解决医学图像分析中标注数据稀缺和模型不可解释的问题。

Details

Motivation: 医学图像分析面临标注数据稀缺和模型缺乏可解释性两大挑战，限制了AI在临床中的应用。现有的少样本学习和主动学习方法未能充分解决可解释性问题。 Method: EGxFSL通过将放射科医生定义的兴趣区域作为空间监督信号，结合Grad-CAM的Dice损失与原型分类进行联合优化；xGAL则在主动学习中同时考虑预测不确定性与注意力不一致性来选择样本，形成可解释性驱动的闭环学习框架。 Result: 在BraTS、VinDr-CXR和SIIM-COVID-19数据集上，EGxFSL分别达到92%、76%和62%的准确率，显著优于基线模型；xGAL在仅使用680个样本时达到76%准确率，优于随机采样的57%。Grad-CAM可视化显示模型关注诊断相关区域，并在乳腺超声数据上验证了跨模态泛化能力。 Conclusion: 所提出的EGxFSL和xGAL框架有效提升了少样本条件下的模型性能与可解释性，实现了专家知识与可解释性驱动的协同学习，具有广泛的临床应用潜力。 Abstract: Medical image analysis faces two critical challenges: scarcity of labeled data and lack of model interpretability, both hindering clinical AI deployment. Few-shot learning (FSL) addresses data limitations but lacks transparency in predictions. Active learning (AL) methods optimize data acquisition but overlook interpretability of acquired samples. We propose a dual-framework solution: Expert-Guided Explainable Few-Shot Learning (EGxFSL) and Explainability-Guided AL (xGAL). EGxFSL integrates radiologist-defined regions-of-interest as spatial supervision via Grad-CAM-based Dice loss, jointly optimized with prototypical classification for interpretable few-shot learning. xGAL introduces iterative sample acquisition prioritizing both predictive uncertainty and attention misalignment, creating a closed-loop framework where explainability guides training and sample selection synergistically. On the BraTS (MRI), VinDr-CXR (chest X-ray), and SIIM-COVID-19 (chest X-ray) datasets, we achieve accuracies of 92\%, 76\%, and 62\%, respectively, consistently outperforming non-guided baselines across all datasets. Under severe data constraints, xGAL achieves 76\% accuracy with only 680 samples versus 57\% for random sampling. Grad-CAM visualizations demonstrate guided models focus on diagnostically relevant regions, with generalization validated on breast ultrasound confirming cross-modality applicability.

Table of Contents

cs.CL [Back]

[1] WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

[2] PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models

[3] Losses that Cook: Topological Optimal Transport for Structured Recipe Generation

[4] ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation

[5] LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference

[6] Fact-Checking with Large Language Models via Probabilistic Certainty and Consistency

[7] DataParasite Enables Scalable and Repurposable Online Data Curation

[8] Reconstructing Item Characteristic Curves using Fine-Tuned Large Language Models

[9] FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions

[10] Scalable Construction of a Lung Cancer Knowledge Base: Profiling Semantic Reasoning in LLMs

[11] Improved Evidence Extraction for Document Inconsistency Detection with LLMs

[12] Empirical Comparison of Encoder-Based Language Models and Feature-Based Supervised Machine Learning Approaches to Automated Scoring of Long Essays

[13] When Do Tools and Planning Help LLMs Think? A Cost- and Latency-Aware Benchmark

[14] Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking

[15] Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

[16] Extracting books from production language models

[17] Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration

[18] EvoRoute: Experience-Driven Self-Routing LLM Agent Systems

[19] Boosting Accuracy and Interpretability in Multilingual Hate Speech Detection Through Layer Freezing and Explainable AI

[20] Adversarial Question Answering Robustness: A Multi-Level Error Analysis and Mitigation Study

[21] Mitigating Prompt-Induced Hallucinations in Large Language Models via Structured Reasoning

[22] Language Hierarchization Provides the Optimal Solution to Human Working Memory Limits

[23] SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation

[24] Window-based Membership Inference Attacks Against Fine-tuned Large Language Models

[25] EComStage: Stage-wise and Orientation-specific Benchmarking for Large Language Models in E-commerce

[26] MiMo-V2-Flash Technical Report

[27] Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models

[28] The performances of the Chinese and U.S. Large Language Models on the Topic of Chinese Culture

[29] TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents

[30] To Generate or Discriminate? Methodological Considerations for Measuring Cultural Alignment in LLMs

[31] Training Language Models with homotokens Leads to Delayed Overfitting

[32] LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark

[33] Revisiting Data Compression with Language Modeling

[34] Transparent Semantic Change Detection with Dependency-Based Profiles

[35] Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration

[36] Beyond the Black Box: Theory and Mechanism of Large Language Models

[37] Image, Word and Thought: A More Challenging Language Task for the Iterated Learning Model

[38] RAL2M: Retrieval Augmented Learning-To-Match Against Hallucination in Compliance-Guaranteed Service Systems

[39] Memorization, Emergence, and Explaining Reversal Failures: A Controlled Study of Relational Semantics in LLMs

[40] Pearmut: Human Evaluation of Translation Made Trivial

[41] Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion

[42] LLM-Augmented Changepoint Detection: A Framework for Ensemble Detection and Automated Explanation

[43] Low-Resource Heuristics for Bahnaric Optical Character Recognition Improvement

[44] Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning

[45] Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning

[46] Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders

[47] P-Check: Advancing Personalized Reward Model via Learning to Generate Dynamic Checklist

[48] Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy

[49] Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation

[50] Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

[51] SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering

[52] MMFormalizer: Multimodal Autoformalization in the Wild

[53] Dementia-R1: Reinforced Pretraining and Reasoning from Unstructured Clinical Notes for Real-World Dementia Prognosis

[54] MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models

[55] LittiChoQA: Literary Texts in Indic Languages Chosen for Question Answering

[56] Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

[57] NorwAI's Large Language Models: Technical Report

[58] BaseCal: Unsupervised Confidence Calibration via Base Model Signals

[59] Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage

[60] Temporal Graph Network: Hallucination Detection in Multi-Turn Conversation

[61] Detecting Hallucinations in Retrieval-Augmented Generation via Semantic-level Internal Reasoning Graph

[62] Do LLMs Encode Functional Importance of Reasoning Tokens?

[63] Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models

[64] Grad-ELLM: Gradient-based Explanations for Decoder-only LLMs

[65] Who Laughs with Whom? Disentangling Influential Factors in Humor Preferences across User Clusters and LLMs

[66] Discovering and Causally Validating Emotion-Sensitive Neurons in Large Audio-Language Models

[67] ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation

[68] The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs

[69] Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

[70] Limited Linguistic Diversity in Embodied AI Datasets

[71] Self-Verification is All You Need To Pass The Japanese Bar Examination

[72] Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective

[73] WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning

[74] Can Embedding Similarity Predict Cross-Lingual Transfer? A Systematic Study on African Languages

[75] Maximizing Local Entropy Where It Matters: Prefix-Aware Localized LLM Unlearning

[76] MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

[77] X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework

[78] DIP: Dynamic In-Context Planner For Diffusion Language Models