cs.CL [Back]

[1] From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text

Shinwoo Park,Yo-Sub Han

Main category: cs.CL

TL;DR: 本研究探讨了通过结构化校准提升专家对韩语人类写作与大语言模型生成文本的辨别能力，提出基于韩国国家写作标准的LREAD评分细则，并在三阶段纵向实验中验证其有效性。

Details

Motivation: 区分人类撰写的韩语文本与大语言模型生成的流利文本对语言学训练者仍具挑战性，表面流畅性易导致过度信任；需探索专家检测是否为可习得技能，并通过结构化校准加以提升。 Method: 构建LREAD评分细则（源自韩国国家写作标准，聚焦标点选择性、空格使用、语域转换等微观语言特征）；开展三阶段纵向双盲实验（直觉判断→标准评分+理由说明→领域专精评估），对象为韩语语言学专业本科生。 Result: 多数投票准确率从60%提升至100%，标注者间一致性（Fleiss' kappa）从-0.09升至0.82；校准后的人类判别更依赖语言特异性微观诊断特征，而非现有LLM检测器所依赖的粗粒度话语先验。 Conclusion: 基于评分细则的专家判断可作为非英语场景下自动化检测器的可解释补充；作者公开了完整评分细则及校准后的检测特征分类体系。 Abstract: Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for linguistically trained readers, who can over-trust surface well-formedness. We study whether expert detection can be treated as a learnable skill and improved through structured calibration. We introduce LREAD, a rubric derived from national Korean writing standards and adapted to target micro-level artifacts (e.g., punctuation optionality, spacing behavior, and register shifts). In a three-phase longitudinal blind protocol with Korean linguistics majors, Phase 1 measures intuition-only detection, Phase 2 enforces criterion-level scoring with explicit justifications, and Phase 3 evaluates domain-focused mastery on held-out elementary essays. Across phases, majority-vote accuracy increases from 60% to 100%, accompanied by stronger inter-annotator agreement (Fleiss' kappa: -0.09 --> 0.82). Compared to state-of-the-art LLM detectors, calibrated humans rely more on language-specific micro-diagnostics that are not well captured by coarse discourse priors. Our findings suggest that rubric-scaffolded expert judgment can serve as an interpretable complement to automated detectors for non-English settings, and we release the full rubric and a taxonomy of calibrated detection signatures.

[2] Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments

Maxwell Crouse,Ibrahim Abdelaziz,Kshitij Fadnis,Siva Sankalp Patel,Kinjal Basu,Chulaka Gunasekara,Sadhana Kumaravel,Asim Munawar,Pavan Kapanipathi

Main category: cs.CL

TL;DR: 本文提出DiGiT-TC方法，用于生成适用于无状态工具调用环境的高质量多轮合成数据，在保持与有状态环境相似对话特性的同时，提升模型在标准工具调用基准上的性能。

Details

Motivation: 现有合成多轮工具调用数据的方法大多假设存在可维护状态的执行环境，但现实中许多场景（如企业安全敏感环境或跨源工具规格）无法提供该条件，导致数据生成与实际应用脱节。 Method: 提出DiGiT-TC数据生成方法，核心是新颖的生成模式，能隐式地在用户请求中表征某些工具调用，从而模拟有状态环境下的对话特性，而无需真实状态维持。 Result: 在标准工具调用基准上验证了DiGiT-TC的有效性，结果表明即使在有状态问题设置下，该方法仍带来显著性能提升。 Conclusion: DiGiT-TC弥合了合成数据生成与现实无状态工具使用场景之间的鸿沟，为更鲁棒、实用的工具调用语言模型训练提供了新路径。 Abstract: Synthetic data has proven itself to be a valuable resource for tuning smaller, cost-effective language models to handle the complexities of multi-turn tool calling conversations. While many frameworks and systems for producing synthetic multi-turn tool calling data have been proposed, prior works have frequently assumed that any tool calling interactions will take place in an execution environment that maintains state. When such an environment is available, this is advantageous as it allows for the validity of an interaction to be determined by whether or not the state of the execution environment matches to some prespecified objective. Unfortunately, this does not hold in many real-world tool use settings, e.g., in enterprise settings where data security is of the utmost importance or in cases where tool specifications are synthesized from multiple sources. In this work, we address this gap by introducing a data generation method, DiGiT-TC, that is designed to produce tool calling conversations that have the characteristics of conversations generated through search in a stateful environment. The key to our technique lies in a novel generation pattern that allows our approach to implicitly represent certain tool calls in the user request. We validate our approach on standard tool calling benchmarks and demonstrate that, even in stateful problem settings, our approach results in strong performance gains.

[3] Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

Paul Tarau

Main category: cs.CL

TL;DR: 本文提出了基于直觉主义逻辑的Arrow语言模型，将token预测解释为构造性证明扩展，利用左嵌套蕴含链编码前缀，并通过modus ponens实现下一token预测；模型从逻辑推导自然导出类似乘法RNN的结构，并给出低秩实现，与Transformer和状态空间模型进行对比。

Details

Motivation: 探索神经架构的逻辑基础，弥补当前主流模型（如Transformer）缺乏形式化语义解释的缺陷，通过直觉主义逻辑和Curry-Howard对应为语言建模提供可验证的理论基础。 Method: 将token序列建模为左嵌套蕴含链，用非交换组合保持顺序；将next-token预测解释为modus ponens推理；借助Prolog定理证明器验证模型性质；推导等价于乘法RNN的神经结构并设计低秩实现。 Result: 证明了非交换序贯建模与多token预测选择之间的关键关系；导出了与乘法RNN等价的神经架构；实现了可行的低秩Arrow模型；明确了其在状态空间模型与Transformer之间的理论定位。 Conclusion: Arrow语言模型展示了如何从直觉主义蕴涵逻辑出发，形式化地导出具有可解释性和数学保证的语言模型架构，为构建逻辑驱动、可验证的下一代基础模型提供了新路径。 Abstract: We introduce the \emph{Arrow Language Model}, a neural architecture derived from an intuitionistic-logic interpretation of next-token prediction. Instead of representing tokens as additive embeddings mixed by attention, we encode a prefix as a \emph{left-nested implication chain} whose structure preserves order through non-commutative composition. Next-token prediction corresponds to \emph{modus ponens}, and sequence processing becomes constructive proof extension under the Curry--Howard correspondence. Our Prolog-based specialized theorem provers validate fundamental properties of the neural models, among which relations between commutative vs. non-commutative sequencing and single-token vs. multi-token prediction choices. We show that a neural architecture equivalent to multiplicative RNNs arises naturally from a proof-theoretic interpretation of next-token prediction as nested intuitionistic implication, we present a practical low-rank neural realization and position the model relative to Transformers and state-space models. Keywords: logic-based derivation of neural architectures, intuitionistic implicational logic, token-as-operator neural models, state-space models, alternatives to transformer-based foundational models.

[4] PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review

Songjun Tu,Yiwen Ma,Jiahao Lin,Qichao Zhang,Xiangyuan Lan,Junfeng. Li,Nan Xu,Linjing Li,Dongbin Zhao

Main category: cs.CL

TL;DR: 本文提出PaperAudit-Bench，包含一个涵盖单节与跨节错误的长上下文评测数据集（PaperAudit-Dataset）和一个融合结构化错误检测与证据感知审稿生成的自动化审稿框架（PaperAudit-Review），实验证明其能提升审稿的严格性与判别力，并支持低成本轻量模型训练。

Details

Motivation: 现有大语言模型生成的同行评审虽流畅，但对分散、细微的论文问题缺乏足够的批判性严谨性，尤其在长上下文场景下。 Method: 构建PaperAudit-Dataset（含单节与跨节错误标注）用于可控评测；设计PaperAudit-Review框架，将结构化错误检测与证据感知的审稿生成相结合；采用监督微调（SFT）和强化学习（RL）训练轻量LLM检测器。 Result: 实验表明不同模型在长上下文下的错误可检测性差异大；显式引入错误检测使评估更严格、更具判别力；轻量模型经SFT/RL训练后可实现高效、低成本错误检测。 Conclusion: PaperAudit-Bench为提升自动化同行评审的批判性与可靠性提供了新基准与方法，验证了结构化错误检测对高质量审稿的关键作用。 Abstract: Large language models can generate fluent peer reviews, yet their assessments often lack sufficient critical rigor when substantive issues are subtle and distributed across a paper. In this paper, we introduce PaperAudit-Bench, which consists of two components: (1) PaperAudit-Dataset, an error dataset covering both errors identifiable within individual sections and those requiring cross-section reasoning, designed for controlled evaluation under long-context settings; and (2) PaperAudit-Review, an automated review framework that integrates structured error detection with evidence-aware review generation to support critical assessment. Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths, highlighting the difficulty of identifying such errors under long-context settings. Relative to representative automated reviewing baselines, incorporating explicit error detection into the review workflow produces systematically stricter and more discriminative evaluations, demonstrating its suitability for peer review. Finally, we show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.

[5] PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

Haoyu Zheng,Yun Zhu,Yuqian Yuan,Bo Yuan,Wenqiao Zhang,Siliang Tang,Jun Xiao

Main category: cs.CL

TL;DR: 本文提出PILOT框架，通过轻量级超网络生成查询条件的潜在引导向量，将大模型的战略规划能力内化到小型语言模型中，从而提升多步推理任务的稳定性和性能，且几乎不增加推理延迟。

Details

Motivation: 紧凑型大语言模型（LLMs）缺乏全局策略规划能力，导致长程任务中错误传播；虽有潜在推理能力，但依赖外部教师模型指导在实际运行中受限于延迟和可用性。 Method: 提出PILOT（Planning via Internalized Latent Optimization Trajectories）框架：不修改主干权重，而是用轻量级Hyper-Network生成查询条件的Latent Guidance向量，作为内部引导机制调控模型表征路径。 Result: 在数学与编程基准上显著提升性能（如MATH500提升+8.9%），有效稳定推理轨迹，且推理延迟可忽略。 Conclusion: PILOT是一种非侵入式、高效可行的方法，能将大型模型的战略规划能力内化至紧凑模型，增强其多步推理鲁棒性与实用性。 Abstract: Strategic planning is critical for multi-step reasoning, yet compact Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks. Our analysis reveals that LLMs possess latent reasoning capabilities that can be unlocked when conditioned on explicit plans from a teacher model; however, runtime reliance on external guidance is often impractical due to latency and availability constraints. To bridge this gap, we propose PILOT (Planning via Internalized Latent Optimization Trajectories), a non-invasive framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance. Instead of altering backbone weights, PILOT employs a lightweight Hyper-Network to synthesize a query-conditioned Latent Guidance vector. This vector acts as an internal steering mechanism, guiding the model's representations toward optimal reasoning paths. Extensive experiments on mathematical and coding benchmarks demonstrate that PILOT effectively stabilizes reasoning trajectories, consistently outperforming strong baselines (e.g., +8.9% on MATH500) with negligible inference latency.

[6] Lowest Span Confidence: A Zero-Shot Metric for Efficient and Black-Box Hallucination Detection in LLMs

Yitong Qiao,Licheng Pan,Yu Mi,Lei Liu,Yue Shen,Fei Sun,Zhixuan Chu

Main category: cs.CL

TL;DR: 本文提出了一种名为最低跨度置信度（LSC）的零样本、高效幻觉检测指标，仅需单次前向传播和输出概率，通过滑动窗口评估语义连贯片段的联合似然，以捕捉与事实不一致强相关的局部不确定性模式。

Details

Motivation: 现有大语言模型（LLM）幻觉检测方法依赖昂贵采样或白盒模型状态，在API调用等实际场景中不适用，亟需轻量、零样本、黑盒可用的新方法。 Method: 提出最低跨度置信度（LSC）指标：基于单次前向输出的概率，采用滑动窗口机制计算不同长度n-gram语义连贯片段的联合似然，并定位边际置信度最低的区域作为幻觉指示；该设计缓解了困惑度稀释效应和最小词元概率的噪声敏感性。 Result: 在多个SOTA LLM和多样化基准上实验表明，LSC在零样本设定下持续优于现有基线，即使在资源受限条件下也展现出强检测性能。 Conclusion: LSC是一种实用、鲁棒、低开销的幻觉检测新范式，显著提升了黑盒、API驱动场景下LLM输出可信度评估的可行性与有效性。 Abstract: Hallucinations in Large Language Models (LLMs), i.e., the tendency to generate plausible but non-factual content, pose a significant challenge for their reliable deployment in high-stakes environments. However, existing hallucination detection methods generally operate under unrealistic assumptions, i.e., either requiring expensive intensive sampling strategies for consistency checks or white-box LLM states, which are unavailable or inefficient in common API-based scenarios. To this end, we propose a novel efficient zero-shot metric called Lowest Span Confidence (LSC) for hallucination detection under minimal resource assumptions, only requiring a single forward with output probabilities. Concretely, LSC evaluates the joint likelihood of semantically coherent spans via a sliding window mechanism. By identifying regions of lowest marginal confidence across variable-length n-grams, LSC could well capture local uncertainty patterns strongly correlated with factual inconsistency. Importantly, LSC can mitigate the dilution effect of perplexity and the noise sensitivity of minimum token probability, offering a more robust estimate of factual uncertainty. Extensive experiments across multiple state-of-the-art (SOTA) LLMs and diverse benchmarks show that LSC consistently outperforms existing zero-shot baselines, delivering strong detection performance even under resource-constrained conditions.

[7] FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition

Junseok Lee,Nahoon Kim,Sangyong Lee,Chang-Jae Chun

Main category: cs.CL

TL;DR: 本文提出自适应自知识蒸馏（ASKD）方法，通过动态降低学生模型对教师模型的依赖，提升其自训练和泛化能力，并将Whisper模型蒸馏为更小更快的FastWhisper，在后训练中实现了更低词错率和5倍推理加速。

Details

Motivation: 现有知识蒸馏方法可能导致学生模型继承教师模型的缺陷，从而降低泛化能力。 Method: 提出自适应自知识蒸馏（ASKD），动态减少学生对教师的依赖，并结合自知识蒸馏；将Whisper蒸馏为FastWhisper，在后训练框架下进行优化。 Result: FastWhisper在词错率上比Whisper低1.07%，推理速度提升5倍。 Conclusion: ASKD能有效提升学生模型的泛化能力和自训练能力，FastWhisper验证了该方法在语音识别任务中的高效性与实用性。 Abstract: Knowledge distillation is one of the most effective methods for model compression. Previous studies have focused on the student model effectively training the predictive distribution of the teacher model. However, during training, the student model may inherit the shortcomings of the teacher model, which can lead to a decline in generalization capacity. To mitigate this issue, we propose adaptive self-knowledge distillation (ASKD), which dynamically reduces the dependence of the teacher model to improve the self-training capacity, and performs the self-knowledge distillation method to improve the generalization capacity of the student model. We further distill the Whisper model into a smaller variant, called FastWhisper. In our post-training setting, FastWhisper achieved a word error rate of 1.07% lower than the teacher model Whisper, and its relative inference time was 5 times faster.

[8] Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Xiaochen Zhu,Caiqi Zhang,Yizhou Chi,Tom Stafford,Nigel Collier,Andreas Vlachos

Main category: cs.CL

TL;DR: 本文提出两种轻量级干预方法（多样性感知初始化和置信度调制辩论协议）来改进多智能体辩论（MAD），显著提升其在推理型问答任务上的性能，超越基线方法。

Details

Motivation: vanilla MAD在同质代理和均匀信念更新下无法可靠提升结果，缺乏人类协商中关键的初始观点多样性和显式校准的置信度沟通机制。 Method: 提出多样性感知初始化（选择更初始答案池）和置信度调制辩论协议（代理表达校准置信度并据此更新信念）。 Result: 理论上证明两种方法分别提升成功先验概率与引导辩论系统性收敛至正确假设；实验上在六个推理型QA基准上一致优于vanilla MAD和多数投票。 Conclusion: 借鉴人类协商机制、引入简单而有原则的修改可显著增强LLM多智能体辩论的有效性。 Abstract: Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

[9] HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Laya Iyer,Kriti Aggarwal,Sanmi Koyejo,Gail Heyman,Desmond C. Ong,Subhabrata Mukherjee

Main category: cs.CL

TL;DR: 本文提出了HEART框架，首次直接比较人类与大语言模型（LLM）在多轮情感支持对话中的表现，通过双盲人类评分与LLM-as-judge评估，在五个基于人际沟通科学的维度上进行量化分析；结果发现部分前沿模型在共情感知和一致性上接近或超越普通人，但在适应性重构、张力命名和细微语调调整等方面仍逊于人类，尤其在对抗性对话中；人类与LLM评判者偏好高度一致（约80%），表明评估标准正趋于收敛；HEART将支持性对话确立为独立于语言能力与通用推理的情感能力轴。

Details

Motivation: 现有语言模型虽在语言能力上快速进步，但缺乏对其在情感支持等关键人际技能上的系统性、可比性评估；亟需一个能公平、科学地同步衡量人类与模型在支持性对话中表现的框架。 Method: 提出HEART评估框架：构建多轮情感支持对话数据集，对每段对话历史同步采集人类与LLM响应，采用双盲人类评分员与LLM-as-judge集成评估，依据五大维度（Human Alignment, Empathic Responsiveness, Attunement, Resonance, Task-Following）打分并分析偏好与归因。 Result: 前沿模型在Empathic Responsiveness和Consistency上接近或超过平均人类水平；人类在Adaptive Reframing、Tension-Naming和Nuanced Tone Shifts（尤其对抗性回合）上显著占优；人类与LLM-as-judge偏好一致率达80%，归因聚焦相同HEART维度；模型表现随规模增大呈一定可预测提升趋势。 Conclusion: HEART证实支持性对话是一种可分离、可量化的独特能力轴；其评估范式推动了对模型情感能力的实证理解，为未来人机协同支持系统的设计与伦理评估提供了统一基准。 Abstract: Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress. Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations. For each dialogue history, we pair human and model responses and evaluate them through blinded human raters and an ensemble of LLM-as-judge evaluators. All assessments follow a rubric grounded in interpersonal communication science across five dimensions: Human Alignment, Empathic Responsiveness, Attunement, Resonance, and Task-Following. HEART uncovers striking behavioral patterns. Several frontier models approach or surpass the average human responses in perceived empathy and consistency. At the same time, humans maintain advantages in adaptive reframing, tension-naming, and nuanced tone shifts, particularly in adversarial turns. Human and LLM-as-judge preferences align on about 80 percent of pairwise comparisons, matching inter-human agreement, and their written rationales emphasize similar HEART dimensions. This pattern suggests an emerging convergence in the criteria used to assess supportive quality. By placing humans and models on equal footing, HEART reframes supportive dialogue as a distinct capability axis, separable from general reasoning or linguistic fluency. It provides a unified empirical foundation for understanding where model-generated support aligns with human social judgment, where it diverges, and how affective conversational competence scales with model size.

[10] Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

Boxiang Zhao,Qince Li,Zhonghao Wang,Zelin Cao,Yi Wang,Peng Cheng,Bo Lin

Main category: cs.CL

TL;DR: 本文提出Table-BiEval，一种无需人工干预的自监督评估框架，用于量化评估大语言模型在将自然语言转换为结构化格式及解析表格信息时的结构保真度。该方法通过中间表示解耦内容与结构，并在层级结构和扁平表格两个维度上评测15个先进模型，发现中等规模模型在结构效率上可能优于更大模型，且深层递归嵌套仍是普遍瓶颈。

Details

Motivation: 现有评估方法缺乏有效、低成本手段来衡量大语言模型在自然语言到结构化格式（如工具调用）及表格信息到机器可读规范转换中的结构保真度；传统文本指标无法检测类代码输出中的语义漂移。 Method: 提出Table-BiEval框架，基于自监督、无需人工标注，利用确定性中间表示（Intermediate Representations），分别计算内容语义准确率（Content Semantic Accuracy）和归一化树编辑距离（Normalized Tree Edit Distance），以解耦结构与内容；并在层级结构与扁平表格两大拓扑维度上对15个SOTA LLM进行实证评估。 Result: 实验揭示不同模型在结构保真度上存在显著差异；中等规模模型在结构效率上可意外超越更大模型；深层递归嵌套被证实为当前各类架构的普遍瓶颈。 Conclusion: Table-BiEval提供了一种高效、可扩展、无需人工的结构保真度评估范式；结果表明模型规模并非结构能力的唯一决定因素，架构设计与递归建模能力亟待优化。 Abstract: As Large Language Models (LLMs) evolve into autonomous agents, the capability to faithfully translate natural language into rigorous structured formats-essential for tool invocation-and to convert complex tabular information into machine-readable specifications has become paramount. However, current evaluations lack effective methodologies to measure this structural fidelity without costly human intervention, as traditional text metrics fail to detect semantic drift in code-like outputs. This paper proposes Table-BiEval, a novel approach based on a human-free, self-supervised evaluation framework, to assess LLMs performance quantitatively. By leveraging deterministic Intermediate Representations, our framework calculates Content Semantic Accuracy and Normalized Tree Edit Distance to decouple structure from content. Also, it empirically evaluates 15 state-of-the-art LLMs across dual topological dimensions-hierarchical structures and flat tables. The results reveal substantial variability, highlighting that mid-sized models can surprisingly outperform larger counterparts in structural efficiency and confirming that deep recursive nesting remains a universal bottleneck for current architectures.

[11] OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Yitian Chen,Cheng Cheng,Yinan Sun,Zi Ling,Dongdong Ge

Main category: cs.CL

TL;DR: 本文提出了OPT-ENGINE，一个用于评估大语言模型（LLMs）在优化建模任务中能力的可扩展基准框架，涵盖10个运筹学经典任务（5个线性规划、5个混合整数规划），并发现工具增强推理比纯文本推理更具鲁棒性，且约束自动建模是主要瓶颈。

Details

Motivation: 当前LLMs在优化建模中的能力边界尚不清晰，尤其在复杂真实任务上；缺乏可控难度、可扩展的评估基准。 Method: 构建OPT-ENGINE基准框架，覆盖10个典型运筹学优化任务（LP与MIP各5个），通过系统实验分析LLMs在分布外泛化与推理阶段瓶颈。 Result: 1）集成外部求解器的工具增强推理随任务复杂度提升仍保持高鲁棒性，纯文本推理性能迅速饱和；2）约束的自动建模是性能最主要瓶颈。 Conclusion: OPT-ENGINE为评估和推动面向优化建模的LLMs提供了新范式；未来工作应聚焦于提升约束理解与生成能力，并强化工具协同推理机制。 Abstract: Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real-world tasks. To bridge this gap, we propose OPT-ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT-ENGINE spans 10 canonical tasks across operations research, with five Linear Programming and five Mixed-Integer Programming. Utilizing OPT-ENGINE, we conduct an extensive study of LLMs' reasoning capabilities, addressing two critical questions: 1.) Do LLMs' performance remain robust when generalizing to out-of-distribution optimization tasks that scale in complexity beyond current benchmark levels? and 2.) At what stage, from problem interpretation to solution generation, do current LLMs encounter the most significant bottlenecks? Our empirical results yield two key insights: first, tool-integrated reasoning with external solvers exhibits significantly higher robustness as task complexity escalates, while pure-text reasoning reaches a ceiling; second, the automated formulation of constraints constitutes the primary performance bottleneck. These findings provide actionable guidance for developing next-generation LLMs for advanced optimization. Our code is publicly available at \textcolor{blue}{https://github.com/Cardinal-Operations/OPTEngine}.

[12] Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

Yinuo Liu,Emre Sezgin,Eric A. Youngstrom

Main category: cs.CL

TL;DR: 本研究评估了ChatGPT-5、Gemini-3-Pro和Claude-Sonnet-4.5在学术摘要评审中与人类评审者的一致性与可靠性，发现其在客观维度上与人类有中等一致性，但在主观维度上表现较弱，适合作为人类评审的补充工具。

Details

Motivation: 探索大语言模型（LLMs）在辅助科学评审中的潜力，特别是其评估复杂学术内容（如会议摘要）的一致性与可靠性是否可媲美人类评审者。 Method: 使用统一评分量表，对160篇本地会议摘要分别由14名人类评审者和3个LLM（ChatGPT-5、Gemini-3-Pro、Claude-Sonnet-4.5）进行评分；通过组内相关系数（ICC）评估AI间及AI-人类间的信度，并用Bland-Altman图分析系统性偏差。 Result: LLM之间具良好至极佳一致性（ICC: 0.59–0.87）；ChatGPT与Claude在整体质量与客观指标上与人类达中等一致（ICC ≈ 0.45–0.60），在主观指标（影响、吸引力、适用性）上仅达一般一致（ICC: 0.23–0.38）；Gemini在部分指标上仅达一般一致，对影响与适用性无信度；三者与人类平均分差异较小（ChatGPT +0.24，Gemini +0.42，Claude −0.02）。 Conclusion: LLM可在客观、结构化评审任务中批量高效处理摘要，表现接近人类，但主观判断仍需人类主导；AI应作为评审流程的互补工具，而非替代。 Abstract: Introduction: Large language models (LLMs) can process requests and generate texts, but their feasibility for assessing complex academic content needs further investigation. To explore LLM's potential in assisting scientific review, this study examined ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5's consistency and reliability in evaluating abstracts compared to one another and to human reviewers. Methods: 160 abstracts from a local conference were graded by human reviewers and three LLMs using one rubric. Composite score distributions across three LLMs and fourteen reviewers were examined. Inter-rater reliability was calculated using intraclass correlation coefficients (ICCs) for within-AI reliability and AI-human concordance. Bland-Altman plots were examined for visual agreement patterns and systematic bias. Results: LLMs achieved good-to-excellent agreement with each other (ICCs: 0.59-0.87). ChatGPT and Claude reached moderate agreement with human reviewers on overall quality and content-specific criteria, with ICCs ~.45-.60 for composite, impression, clarity, objective, and results. They exhibited fair agreement on subjective dimensions, with ICC ranging from 0.23-0.38 for impact, engagement, and applicability. Gemini showed fair agreement on half criteria and no reliability on impact and applicability. Three LLMs showed acceptable or negligible mean difference (ChatGPT=0.24, Gemini=0.42, Claude=-0.02) from the human mean composite scores. Discussion: LLMs could process abstracts in batches with moderate agreement with human experts on overall quality and objective criteria. With appropriate process architecture, they can apply a rubric consistently across volumes of abstracts exceeding feasibility for a human rater. The weaker performance on subjective dimensions indicates that AI should serve a complementary role in evaluation, while human expertise remains essential.

[13] The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Nora Graichen,Iria de-Dios-Flores,Gemma Boleda

Main category: cs.CL

TL;DR: This paper reviews 337 studies on Transformer-based language models' syntactic abilities, highlighting strengths in form-oriented syntax (e.g., POS, agreement) but weaker, more variable performance on syntax-semantics interface phenomena (e.g., binding, filler-gap dependencies); it calls for broader linguistic coverage, mechanistic interpretability, and methodological standardization.

Details

Motivation: To systematically assess how well Transformer-based language models capture diverse syntactic phenomena, identify methodological biases and gaps (e.g., English-centricity, overreliance on BERT, focus on easy-to-test phenomena), and guide future research toward more rigorous, inclusive, and theoretically grounded evaluation. Method: Systematic review and meta-analysis of 337 articles reporting 1,015 model results across syntactic phenomena and interpretability methods; analysis focuses on language coverage, model diversity, phenomenon difficulty, and methodological alignment. Result: TLMs perform well on form-oriented syntactic tasks (e.g., part-of-speech tagging, subject-verb agreement) but show inconsistent and weaker performance on syntax-semantics interface phenomena (e.g., binding, filler-gap dependencies); dominant focus remains on English, BERT, and shallow phenomena. Conclusion: Current evaluation practices suffer from narrow empirical and methodological scope; future work should prioritize multilingual data, mechanistic interpretability, theoretical-methodological alignment, and comprehensive reporting to advance understanding of TLMs’ syntactic competence. Abstract: We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models, reporting on 1,015 model results from a range of syntactic phenomena and interpretability methods. Our analysis shows that the state of the art presents a healthy variety of methods and data, but an over-focus on a single language (English), a single model (BERT), and phenomena that are easy to get at (like part of speech and agreement). Results also suggest that TLMs capture these form-oriented phenomena well, but show more variable and weaker performance on phenomena at the syntax-semantics interface, like binding or filler-gap dependencies. We provide recommendations for future work, in particular reporting complete data, better aligning theoretical constructs and methods across studies, increasing the use of mechanistic methods, and broadening the empirical scope regarding languages and linguistic phenomena.

[14] Attribution Techniques for Mitigating Hallucinated Information in RAG Systems: A Survey

Yuqing Zhao,Ziyao Liu,Yongsen Zheng,Kwok-Yan Lam

Main category: cs.CL

TL;DR: 本文综述了基于归因（attribution-based）技术在检索增强生成（RAG）系统中缓解大语言模型幻觉问题的研究进展，提出了RAG中幻觉类型的分类体系、统一的归因技术流程，并按目标幻觉类型回顾方法，分析优劣并给出实践建议。

Details

Motivation: 现有LLM问答系统存在幻觉问题，RAG虽能提升响应质量但也引入新幻觉；归因技术虽有进展，但缺乏统一框架、清晰分类与系统性对比，制约故障诊断与方案选型。 Method: 开展系统性文献综述，构建RAG幻觉分类法、提出归因技术统一处理流程，并按所针对幻觉类型对技术进行归类评述，辅以优缺点分析与实践指南。 Result: 建立了RAG系统中幻觉的结构化分类体系；提出了覆盖归因全过程的统一技术管道；系统梳理并比较了面向不同幻觉类型的归因方法；提供了面向实际部署的选型与应用指导。 Conclusion: 归因技术是提升RAG可信性的关键路径；本综述填补了该领域缺乏统一框架与系统评估的空白，为后续研究与工业落地提供理论基础与实践参考。 Abstract: Large Language Models (LLMs)-based question answering (QA) systems play a critical role in modern AI, demonstrating strong performance across various tasks. However, LLM-generated responses often suffer from hallucinations, unfaithful statements lacking reliable references. Retrieval-Augmented Generation (RAG) frameworks enhance LLM responses by incorporating external references but also introduce new forms of hallucination due to complex interactions between the retriever and generator. To address these challenges, researchers have explored attribution-based techniques that ensure responses are verifiably supported by retrieved content. Despite progress, a unified pipeline for these techniques, along with a clear taxonomy and systematic comparison of their strengths and weaknesses, remains lacking. A well-defined taxonomy is essential for identifying specific failure modes within RAG systems, while comparative analysis helps practitioners choose appropriate solutions based on hallucination types and application context. This survey investigates how attribution-based techniques are used within RAG systems to mitigate hallucinations and addresses the gap by: (i) outlining a taxonomy of hallucination types in RAG systems, (ii) presenting a unified pipeline for attribution techniques, (iii) reviewing techniques based on the hallucinations they target, and (iv) discussing strengths and weaknesses with practical guidelines. This work offers insights for future research and practical use of attribution techniques in RAG systems.

[15] Towards a Mechanistic Understanding of Large Reasoning Models: A Survey of Training, Inference, and Failures

Yi Hu,Jiaqi Gu,Ruxin Wang,Zijun Yao,Hao Peng,Xiaobao Wu,Jianhui Chen,Muhan Zhang,Liangming Pan

Main category: cs.CL

TL;DR: 本文综述了大型推理模型（LRMs）的机制理解，从训练动态、推理机制和意外行为三方面系统梳理现有研究，并提出未来在应用可解释性、方法改进和统一理论框架等方面的挑战与方向。

Details

Motivation: 尽管大型推理模型（LRMs）在性能上取得显著进展，但其内部工作机制仍不透明，亟需深入的机制性理解以弥合‘黑箱性能’与‘机制透明’之间的鸿沟。 Method: 采用系统性文献综述方法，将现有研究按训练动态、推理机制和 unintended behaviors 三个核心维度进行归类与整合分析。 Result: 构建了一个关于LRMs机制理解的三维分析框架，明确了当前研究进展与关键洞见，并识别出若干未被充分探索的挑战。 Conclusion: 机制性理解是推动LRMs稳健、可信和可控发展的关键路径；未来需加强可解释性技术的实际应用、发展更严谨的分析方法，并建立统一的理论基础。 Abstract: Reinforcement learning (RL) has catalyzed the emergence of Large Reasoning Models (LRMs) that have pushed reasoning capabilities to new heights. While their performance has garnered significant excitement, exploring the internal mechanisms driving these behaviors has become an equally critical research frontier. This paper provides a comprehensive survey of the mechanistic understanding of LRMs, organizing recent findings into three core dimensions: 1) training dynamics, 2) reasoning mechanisms, and 3) unintended behaviors. By synthesizing these insights, we aim to bridge the gap between black-box performance and mechanistic transparency. Finally, we discuss under-explored challenges to outline a roadmap for future mechanistic studies, including the need for applied interpretability, improved methodologies, and a unified theoretical framework.

[16] Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding

David Linus Ostby

Main category: cs.CL

TL;DR: 本文提出了一种名为Stingy Context的分层树状压缩方案，用于自动编码任务，可将大语言模型（LLM）上下文压缩至原来的1/18，并在保持任务准确性的前提下显著提升效率。

Details

Motivation: 解决大语言模型在自动编码任务中上下文过长导致的计算开销大、'中间丢失'（lost-in-the-middle）等问题。 Method: 提出基于树结构的上下文压缩方案Stingy Context，并采用TREEFRAG分解技术对源代码进行层次化压缩。 Result: 在239k token的真实代码库上实现压缩至11k token（约18:1），在12个前沿模型上对40个真实问题的成功率达94%–97%，优于扁平化压缩方法。 Conclusion: Stingy Context通过层次化压缩有效平衡了上下文长度与任务保真度，在自动编码任务中展现出高性价比和强鲁棒性。 Abstract: We introduce Stingy Context, a hierarchical tree-based compression scheme achieving 18:1 reduction in LLM context for auto-coding tasks. Using our TREEFRAG exploit decomposition, we reduce a real source code base of 239k tokens to 11k tokens while preserving task fidelity. Empirical results across 12 Frontier models show 94 to 97% success on 40 real-world issues at low cost, outperforming flat methods and mitigating lost-in-the-middle effects.

[17] SDUs DAISY: A Benchmark for Danish Culture

Jacob Nielsen,Stine L. Beltoft,Peter Schneider-Kamp,Lukas Galke Poech

Main category: cs.CL

TL;DR: 本文介绍了一个名为Daisy的新基准，用于评估丹麦文化知识，基于2006年丹麦文化经典（Danish Culture Canon）构建，包含741个经过人工校验的问答对，覆盖从公元前1300年考古发现至当代丹麦流行音乐、设计与建筑等广泛主题。

Details

Motivation: 现有文化理解基准缺乏对特定国家文化（如丹麦）系统性、深度覆盖的评测资源，尤其缺少兼顾主流知识与文化深层基石的高质量问答数据集。 Method: 基于丹麦文化经典选取文物主题，爬取对应维基百科页面，利用语言模型生成随机问题，并按中心性与边缘性进行采样；所有问答对经人工审核或修正，最终构建闭合式问答数据集。 Result: 构建了名为Daisy的丹麦文化基准数据集，含741个高质量、人工校验的闭合式问答对，时间跨度大、主题多样，涵盖考古、文学、音乐、设计与建筑等。 Conclusion: Daisy为评估和提升语言模型对特定民族文化（尤其是丹麦文化）的理解能力提供了首个系统化、权威且细粒度的基准，支持更深入的文化遗产建模研究。 Abstract: We introduce a new benchmark for Danish culture via cultural heritage, Daisy, based on the curated topics from the Danish Culture Canon 2006. For each artifact in the culture canon, we query the corresponding Wikipedia page and have a language model generate random questions. This yields a sampling strategy within each work, with a mix of central of peripheral questions for each work, not only knowledge of mainstream information, but also in-depth cornerstones defining the heritage of Danish Culture, defined by the Canon committee. Each question-answer pair is humanly approved or corrected in the final dataset consisting of 741 close-ended question answer pairs covering topics, from 1300 BC. archaeological findings, 1700 century poems and musicals pieces to contemporary pop music and Danish design and architecture.

[18] CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity

Sebastien Kawada,Dylan Holyoak

Main category: cs.CL

TL;DR: 本文提出一种混合神经符号系统，结合神经自一致性投票与多尺度叙事分析符号集成，在叙事故事相似性任务中实现81%准确率。

Details

Motivation: 解决叙事故事相似性判断中神经模型在模糊案例上决策不可靠的问题，通过引入符号方法作为神经模型的补充和校正机制。 Method: 构建级联架构：神经组件采用大语言模型进行多轮平行投票，设定超级多数阈值；对完全平票案例，启用由五种叙事相似性信号（词汇重叠、语义嵌入、故事语法结构、事件链对齐、叙事张力曲线）组成的符号集成进行最终判定。 Result: 在开发集上达到81%准确率，验证了在真正模糊的叙事比较中，有选择地将不确定案例交由符号方法处理可提升整体性能。 Conclusion: 神经与符号方法的协同并非简单融合，而应基于置信度进行动态分工；符号组件作为‘可信的备选裁判’，在神经模型失效的关键节点提供可解释、可调试的决策支持。 Abstract: We present a hybrid neuro-symbolic system for the SemEval-2026 Task 4 on Narrative Story Similarity. Our approach combines neural self-consistency voting with a novel Multi-Scale Narrative Analysis Ensemble that operates as a symbolic tiebreaker. The neural network component uses a large language model with multiple parallel votes, applying a supermajority threshold for confident decisions and escalating uncertain cases to additional voting rounds. When votes result in a perfect tie, a symbolic ensemble combining five narrative similarity signals (lexical overlap, semantic embeddings, story grammar structure, event chain alignment, and narrative tension curves) provides the final decision. Our cascade architecture achieves 81% accuracy on the development set, demonstrating that selective deferral to symbolic methods can enhance neural predictions on genuinely ambiguous narrative comparisons.

[19] "Newspaper Eat" Means "Not Tasty": A Taxonomy and Benchmark for Coded Languages in Real-World Chinese Online Reviews

Ruyuan Wan,Changye Li,Ting-Hao 'Kenneth' Huang

Main category: cs.CL

TL;DR: 本文介绍了CodedLang数据集，包含7744条中文谷歌地图评论及900条带细粒度标注的编码语言样本，并构建了涵盖七类编码策略的分类体系；实验表明当前语言模型在识别与理解编码语言方面表现不佳，尤其在依赖发音的编码策略上存在明显缺陷。

Details

Motivation: 现有语言模型对编码语言处理效果差，且受限于真实世界数据集缺乏和分类体系不清晰。 Method: 构建CodedLang数据集（含7744条中文谷歌地图评论及900条span-level编码语言标注），提出七类编码策略（如语音、字形、跨语言替换等）的分类法，并在编码语言检测、分类及评分预测任务上评测语言模型；进一步开展语音层面的编码/解码形式分析。 Result: 即使强语言模型在编码语言识别与理解任务中仍表现不佳；语音分析揭示了编码表达多依赖发音特征。 Conclusion: 编码语言是现实NLP系统中一个关键但被忽视的挑战，需更深入研究与建模。 Abstract: Coded language is an important part of human communication. It refers to cases where users intentionally encode meaning so that the surface text differs from the intended meaning and must be decoded to be understood. Current language models handle coded language poorly. Progress has been limited by the lack of real-world datasets and clear taxonomies. This paper introduces CodedLang, a dataset of 7,744 Chinese Google Maps reviews, including 900 reviews with span-level annotations of coded language. We developed a seven-class taxonomy that captures common encoding strategies, including phonetic, orthographic, and cross-lingual substitutions. We benchmarked language models on coded language detection, classification, and review rating prediction. Results show that even strong models can fail to identify or understand coded language. Because many coded expressions rely on pronunciation-based strategies, we further conducted a phonetic analysis of coded and decoded forms. Together, our results highlight coded language as an important and underexplored challenge for real-world NLP systems.

[20] Text-to-State Mapping for Non-Resolution Reasoning: The Contradiction-Preservation Principle

Kei Saito

Main category: cs.CL

TL;DR: 本文提出了一种文本到状态映射函数φ，将自然语言输入转化为非分辨率推理（NRR）框架中的叠加态，并引入矛盾保持原理以维持语义模糊性；实验表明该方法在68个歧义句上显著提升了状态熵，实现了对语言模型推理中过早解释坍缩的延迟。

Details

Motivation: 解决自然语言如何映射到非分辨率推理（NRR）数学结构这一开放问题，弥补NRR框架中从文本到形式化状态空间的算法鸿沟。 Method: 提出文本到状态映射函数φ，形式化矛盾保持原理，利用现有大语言模型作为解释生成器设计提取协议。 Result: 在68个涵盖词汇、结构和语用歧义的测试句上，所提映射使平均香农熵达H(S) = 1.087比特，而基线单解释方法为0.000比特。 Conclusion: 该框架提供了从原始文本到NRR形式状态空间的关键算法桥梁，支持在语言模型推理中延迟架构性坍缩，从而维持语义模糊性。 Abstract: Non-Resolution Reasoning (NRR) provides a formal framework for maintaining semantic ambiguity rather than forcing premature interpretation collapse. While the foundational architecture establishes state spaces and operators for ambiguity-preserving computation, the critical question of how natural language maps to these mathematical structures remains open. This paper introduces the text-to-state mapping function φ that transforms linguistic input into superposition states within the NRR framework. We formalize the Contradiction-Preservation Principle, which requires that genuinely ambiguous expressions maintain non-zero entropy in their state representations, and develop extraction protocols using existing Large Language Models as interpretation generators. Empirical validation across 68 test sentences spanning lexical, structural, and pragmatic ambiguity demonstrates that our mapping achieves mean Shannon entropy H(S) = 1.087 bits for ambiguous inputs while baseline single-interpretation approaches yield H(S) = 0.000. The framework provides the missing algorithmic bridge between raw text and the formal state spaces on which NRR operators act, enabling architectural collapse deferment in language model inference.

[21] Quantifying non deterministic drift in large language models

Claire Nicholson

Main category: cs.CL

TL;DR: 本文通过重复实验量化了大语言模型（LLMs）在相同提示下的输出不一致性（即‘行为漂移’），发现即使在温度为0时仍存在显著的非确定性，并揭示其与模型规模、部署方式和提示类型相关；研究提出多维度评估指标，为后续漂移控制方法提供基准。

Details

Motivation: 实践中LLM对同一提示可能产生不同输出，这种非确定性影响可靠性，但缺乏系统性的基线量化；作者旨在建立无干预条件下的行为漂移经验基准。 Method: 对gpt-4o-mini和llama3.1-8b两个公开模型，在五类提示下开展重复运行实验，设置温度0.0和0.7，考察精确重复、扰动输入和复用模式；采用唯一输出比例、词汇相似度和词数统计进行多维漂移度量。 Result: 证实温度为0时仍存在行为漂移；漂移程度因模型大小、部署类型和提示类别而异；词汇类指标有局限，语义指标更具潜力。 Conclusion: 行为漂移是LLM固有现象，需在无稳定化技术前提下建立系统基准；本研究为未来漂移缓解与控制方法提供了可比参考框架。 Abstract: Large language models (LLMs) are widely used for tasks ranging from summarisation to decision support. In practice, identical prompts do not always produce identical outputs, even when temperature and other decoding parameters are fixed. In this work, we conduct repeated-run experiments to empirically quantify baseline behavioural drift, defined as output variability observed when the same prompt is issued multiple times under operator-free conditions. We evaluate two publicly accessible models, gpt-4o-mini and llama3.1-8b, across five prompt categories using exact repeats, perturbed inputs, and reuse modes at temperatures of 0.0 and 0.7. Drift is measured using unique output fractions, lexical similarity, and word count statistics, enabling direct comparison across models, prompting modes, and deployment types. The results show that nondeterminism persists even at temperature 0.0, with distinct variability patterns by model size, deployment, and prompt type. We situate these findings within existing work on concept drift, behavioural drift, and infrastructure-induced nondeterminism, discuss the limitations of lexical metrics, and highlight emerging semantic approaches. By establishing a systematic empirical baseline in the absence of stabilisation techniques, this study provides a reference point for evaluating future drift mitigation and control methods.

[22] Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

Yiting Shen,Kun Li,Wei Zhou,Songlin Hu

Main category: cs.CL

TL;DR: 本文提出Mem2ActBench基准，用于评估大语言模型代理如何主动利用长期记忆驱动工具调用与参数接地，弥补现有基准仅测试被动事实检索的不足。

Details

Motivation: 现有基准仅测试代理被动检索孤立事实的能力，无法评估其主动应用长期记忆执行工具任务（如工具选择与参数接地）的关键能力。 Method: 构建Mem2ActBench：通过自动化流水线融合多源数据（ToolACE、BFCL、Oasst1），用一致性建模解决冲突，生成2029个含平均12轮交互的会话；再通过逆向生成法构造400个强记忆依赖的工具使用任务，并经人工验证。 Result: 在7种记忆框架上的实验表明，当前系统在利用记忆进行参数接地方面仍严重不足。 Conclusion: Mem2ActBench揭示了现有记忆系统在主动任务执行中的短板，亟需更有效的方法来评估和提升记忆在工具操作中的实际应用能力。 Abstract: Large Language Model (LLM)-based agents are increasingly deployed for complex, tool-based tasks where long-term memory is critical to driving actions. Existing benchmarks, however, primarily test a angent's ability to passively retrieve isolated facts in response to explicit questions. They fail to evaluate the more crucial capability of actively applying memory to execute tasks. To address this gap, we introduce \textsc{Mem2ActBench}, a benchmark for evaluating whether agents can proactively leverage long-term memory to execute tool-based actions by selecting appropriate tools and grounding their parameters. The benchmark simulates persistent assistant usage, where users mention the same topic across long, interrupted interactions and expect previously established preferences and task states to be implicitly applied. We build the dataset with an automated pipeline that merges heterogeneous sources (ToolACE, BFCL, Oasst1), resolves conflicts via consistency modeling, and synthesizes 2,029 sessions with 12 user--assistant--tool turns on average. From these memory chains, a reverse-generation method produces 400 tool-use tasks, with human evaluation confirming 91.3\% are strongly memory-dependent. Experiments on seven memory frameworks show that current systems remain inadequate at actively utilizing memory for parameter grounding, highlighting the need for more effective approaches to evaluate and improve memory application in task execution.

[23] Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen

Thomas Schuster,Julius Trögele,Nico Döring,Robin Krüger,Matthieu Hoffmann,Holger Friedrich

Main category: cs.CL

TL;DR: 本文提出了一种面向德语医疗场景（含方言）的ASR评测基准，评估了29个开源与商用ASR模型，发现性能差异显著，最优模型WER可低于3%，但多数模型在医学术语和方言上仍表现较差。

Details

Motivation: 现有ASR基准多针对英语，缺乏适用于德语医疗场景（尤其包含方言）的专门评测数据与系统性评估。 Method: 构建模拟医患对话的德语语音-文本配对数据集，对比评测29种ASL模型（包括Whisper、Voxtral、Wav2Vec2等开源模型及AssemblyAI、Deepgram等商业API），采用WER、CER、BLEU三类指标，并初步开展语义质量分析。 Result: 不同模型间性能差异显著：最优模型WER低至3%以下；而多数模型在医学术语识别和方言适应方面错误率明显偏高。 Conclusion: 当前主流ASR模型在真实复杂德语医疗场景（尤其涉方言与专业术语）中泛化能力有限，亟需领域适配与数据增强；本工作为后续研究提供了首个公开、细粒度的评测基准。 Abstract: Automatic Speech Recognition (ASR) offers significant potential to reduce the workload of medical personnel, for example, through the automation of documentation tasks. While numerous benchmarks exist for the English language, specific evaluations for the German-speaking medical context are still lacking, particularly regarding the inclusion of dialects. In this article, we present a curated dataset of simulated doctor-patient conversations and evaluate a total of 29 different ASR models. The test field encompasses both open-weights models from the Whisper, Voxtral, and Wav2Vec2 families as well as commercial state-of-the-art APIs (AssemblyAI, Deepgram). For evaluation, we utilize three different metrics (WER, CER, BLEU) and provide an outlook on qualitative semantic analysis. The results demonstrate significant performance differences between the models: while the best systems already achieve very good Word Error Rates (WER) of partly below 3%, the error rates of other models, especially concerning medical terminology or dialect-influenced variations, are considerably higher.

[24] On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

Michał Gromadzki,Anna Wróblewska,Agnieszka Kaliska

Main category: cs.CL

TL;DR: 本文提出了一种基于大规模语料和新型训练策略的AI生成文本检测方法，构建了超大规模人类与AI文本语料库，并设计了'按LLM'和'按LLM家族'微调范式，在包含21个大模型的基准测试中达到99.6%的词元级准确率。

Details

Motivation: 随着大语言模型快速发展，其生成文本高度拟人化，给教育、出版和数字安全等领域的真实性验证带来严峻挑战，亟需高效可靠的AI文本检测技术。 Method: 构建了10亿词元的人类撰写文本语料库和19亿词元的多模型、多领域AI生成文本语料库；提出'Per LLM'和'Per LLM family'两种细粒度微调范式；在涵盖21个大语言模型的1亿词元基准上系统评估多种检测模型。 Result: 最优微调检测器在100百万词元基准测试中实现最高99.6%的词元级准确率，显著优于现有开源基线方法。 Conclusion: 大规模高质量语料与细粒度模型适配的训练策略对提升AI文本检测性能至关重要，所提方法为该领域提供了新的技术路径与基准资源。 Abstract: The rapid progress of large language models has enabled the generation of text that closely resembles human writing, creating challenges for authenticity verification in education, publishing, and digital security. Detecting AI-generated text has therefore become a crucial technical and ethical issue. This paper presents a comprehensive study of AI-generated text detection based on large-scale corpora and novel training strategies. We introduce a 1-billion-token corpus of human-authored texts spanning multiple genres and a 1.9-billion-token corpus of AI-generated texts produced by prompting a variety of LLMs across diverse domains. Using these resources, we develop and evaluate numerous detection models and propose two novel training paradigms: Per LLM and Per LLM family fine-tuning. Across a 100-million-token benchmark covering 21 large language models, our best fine-tuned detector achieves up to $99.6\%$ token-level accuracy, substantially outperforming existing open-source baselines.

[25] LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?

J. Ben Tamo,Daniel Carlander-Reuterfelt,Jonathan Rubin,Dezhi Hong,Mingxian Wang,Oleg Poliannikov

Main category: cs.CL

TL;DR: 本文发现多语言大模型在语言控制（即按指定语言作答）方面存在两大瓶颈，并通过可解释性分析揭示了模型内部三层结构，进而提出仅微调最后几层以高效提升语言一致性，效果媲美全参数微调。

Details

Motivation: 尽管经过多语言预训练，大语言模型在非英语任务中仍常出现语言控制失败，即无法稳定按提示要求的语言作答，亟需系统诊断与高效修复。 Method: 设计四场景评估协议；扩展logit lens分析追踪各层语言概率；计算隐状态跨语言语义相似性；据此定位语言控制功能集中于模型末层，并开展选择性微调（仅微调最后3–5%参数）。 Result: 在Qwen-3-32B和Bloom-7.1B上实现六种语言超98%语言一致性，任务准确率无损，且效果与全参数微调几乎一致，但计算开销大幅降低。 Conclusion: 语言控制功能具有显著的层定位性（集中在输出层），利用该特性进行选择性微调是一种高效、轻量、通用的多语言适配新范式。 Abstract: Despite multilingual pretraining, large language models often struggle with non-English tasks, particularly in language control, the ability to respond in the intended language. We identify and characterize two key failure modes: the multilingual transfer bottleneck (correct language, incorrect task response) and the language consistency bottleneck (correct task response, wrong language). To systematically surface these issues, we design a four-scenario evaluation protocol spanning MMLU, MGSM, and XQuAD benchmarks. To probe these issues with interpretability, we extend logit lens analysis to track language probabilities layer by layer and compute cross-lingual semantic similarity of hidden states. The results reveal a three-phase internal structure: early layers align inputs into a shared semantic space, middle layers perform task reasoning, and late layers drive language-specific generation. Guided by these insights, we introduce selective fine-tuning of only the final layers responsible for language control. On Qwen-3-32B and Bloom-7.1B, this method achieves over 98 percent language consistency across six languages while fine-tuning only 3-5 percent of parameters, without sacrificing task accuracy. Importantly, this result is nearly identical to that of full-scope fine-tuning (for example, above 98 percent language consistency for both methods across all prompt scenarios) but uses a fraction of the computational resources. To the best of our knowledge, this is the first approach to leverage layer-localization of language control for efficient multilingual adaptation.

[26] Semantic Uncertainty Quantification of Hallucinations in LLMs: A Quantum Tensor Network Based Method

Pragatheeswaran Vipulanandan,Kamal Premaratne,Dilip Sarkar

Main category: cs.CL

TL;DR: 本文提出一种基于量子张量网络的不确定性量化框架，用于检测大语言模型（LLM）生成中的幻觉（confabulation），通过语义等价聚类和熵最大化策略提升可靠性与可解释性，并在多数据集、多模型上验证其鲁棒性与优越性。

Details

Motivation: 大型语言模型虽具强生成能力，但易产生流利却不可靠的幻觉输出，现有方法缺乏对生成过程中内在不确定性（尤其是偶然不确定性）的可解释量化。 Method: 提出基于量子张量网络的不确定性量化框架，结合语义等价下的token序列概率建模与聚类，并引入熵最大化策略识别高不确定性区域以指导人工干预。 Result: 在TriviaQA、NQ、SVAMP、SQuAD共116组实验中，该方法在AUROC和AURAC指标上持续超越现有最优基线，且在不同生成长度与量化级别下保持鲁棒性。 Conclusion: 该量子物理启发的不确定性量化方法为LLM幻觉检测提供了原理清晰、可解释、资源友好且部署稳健的新范式。 Abstract: Large language models (LLMs) exhibit strong generative capabilities but remain vulnerable to confabulations, fluent yet unreliable outputs that vary arbitrarily even under identical prompts. Leveraging a quantum tensor network based pipeline, we propose a quantum physics inspired uncertainty quantification framework that accounts for aleatoric uncertainty in token sequence probability for semantic equivalence based clustering of LLM generations. This offers a principled and interpretable scheme for hallucination detection. We further introduce an entropy maximization strategy that prioritizes high certainty, semantically coherent outputs and highlights entropy regions where LLM decisions are likely to be unreliable, offering practical guidelines for when human oversight is warranted. We evaluate the robustness of our scheme under different generation lengths and quantization levels, dimensions overlooked in prior studies, demonstrating that our approach remains reliable even in resource constrained deployments. A total of 116 experiments on TriviaQA, NQ, SVAMP, and SQuAD across multiple architectures including Mistral-7B, Mistral-7B-instruct, Falcon-rw-1b, LLaMA-3.2-1b, LLaMA-2-13b-chat, LLaMA-2-7b-chat, LLaMA-2-13b, and LLaMA-2-7b show consistent improvements in AUROC and AURAC over state of the art baselines.

Nishanth Sridhar Nakshatri,Eylon Caplan,Rajkumar Pujari,Dan Goldwasser

Main category: cs.CL

TL;DR: 本文提出TAIGR框架，通过识别核心建议（takeaway）、构建论证图及因子图概率推理，对健康领域网红话语进行结构化分析与验证，强调需建模其语用与论证结构而非简单事实核查。

Details

Motivation: 健康网红内容多以叙事和修辞方式传达，缺乏显式事实主张，传统基于声明的验证方法难以捕捉其语用含义。 Method: 提出TAIGR（Takeaway Argumentation Inference with Grounded References）三阶段框架：(1) 提取核心推荐（takeaway）；(2) 构建论证图以表征支持理由；(3) 基于因子图的概率推理验证takeaway。 Result: 在健康类网红视频转录文本的内容验证任务中，TAIGR证明：准确验证依赖对话语语用与论证结构的建模，而非将文本视为扁平化的声明集合。 Conclusion: 网红健康话语的可信度评估需转向结构化论证与语用理解，TAIGR为该方向提供了可扩展的计算框架。 Abstract: Health influencers play a growing role in shaping public beliefs, yet their content is often conveyed through conversational narratives and rhetorical strategies rather than explicit factual claims. As a result, claim-centric verification methods struggle to capture the pragmatic meaning of influencer discourse. In this paper, we propose TAIGR (Takeaway Argumentation Inference with Grounded References), a structured framework designed to analyze influencer discourse, which operates in three stages: (1) identifying the core influencer recommendation--takeaway; (2) constructing an argumentation graph that captures influencer justification for the takeaway; (3) performing factor graph-based probabilistic inference to validate the takeaway. We evaluate TAIGR on a content validation task over influencer video transcripts on health, showing that accurate validation requires modeling the discourse's pragmatic and argumentative structure rather than treating transcripts as flat collections of claims.

Vikash Singh,Darion Cassel,Nathaniel Weir,Nick Feng,Sam Bayless

Main category: cs.CL

TL;DR: 本文提出了一种结合大语言模型（LLM）与SMT求解器的神经符号框架VERGE，通过将LLM输出分解为原子命题、自动形式化为一阶逻辑并进行逻辑一致性验证，实现验证引导的答案迭代优化。

Details

Motivation: 尽管大语言模型在语法上表现流畅，但在高风险领域中确保其逻辑正确性仍是根本挑战。 Method: 提出神经符号框架VERGE：1）基于形式语义等价性检查的多模型共识；2）按命题类型进行语义路由（符号求解器处理逻辑命题，LLM集成处理常识推理）；3）利用最小修正子集（MCS）精确定位逻辑错误。系统对命题分类、聚合多源验证信号并引入方差惩罚，迭代反馈直至满足接受标准。 Result: 在GPT-OSS-120B模型上，VERGE在多个推理基准测试中相较单次生成方法平均提升18.7%性能（收敛时）。 Conclusion: 该混合方法在可形式化场景提供严格保证，在其余场景依赖共识验证，显著提升AI系统的可信性。 Abstract: Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers to produce verification-guided answers through iterative refinement. Our approach decomposes LLM outputs into atomic claims, autoformalizes them into first-order logic, and verifies their logical consistency using automated theorem proving. We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking to ensure logic-level alignment between candidates, eliminating the syntactic bias of surface-form metrics, (2) semantic routing that directs different claim types to appropriate verification strategies: symbolic solvers for logical claims and LLM ensembles for commonsense reasoning, and (3) precise logical error localization via Minimal Correction Subsets (MCS), which pinpoint the exact subset of claims to revise, transforming binary failure signals into actionable feedback. Our framework classifies claims by their logical status and aggregates multiple verification signals into a unified score with variance-based penalty. The system iteratively refines answers using structured feedback until acceptance criteria are met or convergence is achieved. This hybrid approach delivers formal guarantees where possible and consensus verification elsewhere, advancing trustworthy AI. With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.

[29] Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects

Amirhossein Haji Mohammad Rezaei,Zahra Shakeri

Main category: cs.CL

TL;DR: 本文提出了一种反事实基准，用于评估医疗大模型在引入文化相关信息时是否仍能保持临床诊断的准确性；实验发现多种主流模型在文化线索干扰下诊断准确率显著下降，尤其当身份标识与情境线索共现时；研究还揭示文化相关推理常导致错误诊断，并开源了相关提示与数据增强方法。

Details

Motivation: 医疗语言模型需在面对非决定性文化信息时仍能维持临床诊断的正确性，以实现可持续、公平的医疗服务。现有模型可能因文化偏见导致诊断偏差，亟需可量化的评估基准。 Method: 构建包含1650个变体的反事实基准，基于150个MedQA题目，插入三类文化相关信息（身份标识、情境线索、二者组合）及中性对照；在GPT-5.2、Llama-3.1-8B、DeepSeek-R1、MedGemma等模型上开展选项式与简要解释式提示评测；采用临床医生验证答案不变性，并用高一致性（κ=0.76）的人工标注规则结合LLM-as-judge分析错误归因。 Result: 所有模型在文化线索下准确率显著下降（Cochran's Q, p<10⁻¹⁴），联合文化线索导致最大降幅（达3–7个百分点）；中性编辑影响小且无系统性；超半数文化相关解释最终导向错误诊断。 Conclusion: 当前医疗大模型存在显著的文化诱发诊断偏差，尤其在融合身份与情境线索时；文化参照性推理易引发诊断失败；需针对性评估与缓解策略，本文开源资源以支持后续研究。 Abstract: Engineering sustainable and equitable healthcare requires medical language models that do not change clinically correct diagnoses when presented with non-decisive cultural information. We introduce a counterfactual benchmark that expands 150 MedQA test items into 1650 variants by inserting culture-related (i) identifier tokens, (ii) contextual cues, or (iii) their combination for three groups (Indigenous Canadian, Middle-Eastern Muslim, Southeast Asian), plus a length-matched neutral control, where a clinician verified that the gold answer remains invariant in all variants. We evaluate GPT-5.2, Llama-3.1-8B, DeepSeek-R1, and MedGemma (4B/27B) under option-only and brief-explanation prompting. Across models, cultural cues significantly affect accuracy (Cochran's Q, $p<10^-14$), with the largest degradation when identifier and context co-occur (up to 3-7 percentage points under option-only prompting), while neutral edits produce smaller, non-systematic changes. A human-validated rubric ($κ=0.76$) applied via an LLM-as-judge shows that more than half of culturally grounded explanations end in an incorrect answer, linking culture-referential reasoning to diagnostic failure. We release prompts and augmentations to support evaluation and mitigation of culturally induced diagnostic errors.

[30] FFE-Hallu:Hallucinations in Fixed Figurative Expressions:Benchmark of Idioms and Proverbs in the Persian Language

Faezeh Hosseini,Mohammadali Yousefzadeh,Yadollah Yaghoobzadeh

Main category: cs.CL

TL;DR: 本文提出FFEHallu基准，首次系统评估大语言模型（LLM）在波斯语固定比喻表达（如习语、谚语）上的幻觉问题，发现现有模型普遍存在比喻能力与文化根基不足的问题。

Details

Motivation: 固定比喻表达（FFEs）具有文化依赖性、非组合性和约定性，导致大语言模型易产生‘比喻幻觉’（即生成或认可看似地道但实际不存在的表达），而现有评估缺乏针对性，尤其对波斯语等资源匮乏语言关注不足。 Method: 构建首个面向比喻幻觉评估的综合基准FFEHallu（含600个样本），涵盖三项任务：基于语义生成FFEs、检测四类人工构造的虚假FFEs、英-波FFEs翻译；并在六个主流多语种LLM上进行系统评测。 Result: 实验表明，尽管GPT-4.1等模型在拒绝虚假FFEs和检索真实FFEs方面表现较好，但多数模型难以可靠区分真实与高质量伪造FFEs，且在跨语言翻译中频繁发生比喻幻觉。 Conclusion: 当前LLM在处理固定比喻表达时存在显著缺陷，亟需专门设计的基准来评估并缓解比喻幻觉问题，尤其应加强文化适配与语言多样性支持。 Abstract: Figurative language, particularly fixed figurative expressions (FFEs) such as idioms and proverbs, poses persistent challenges for large language models (LLMs). Unlike literal phrases, FFEs are culturally grounded, largely non-compositional, and conventionally fixed, making them especially vulnerable to figurative hallucination. We define figurative hallucination as the generation or endorsement of expressions that sound idiomatic and plausible but do not exist as authentic figurative expressions in the target language. We introduce FFEHallu, the first comprehensive benchmark for evaluating figurative hallucination in LLMs, with a focus on Persian, a linguistically rich yet underrepresented language. FFEHallu consists of 600 carefully curated instances spanning three complementary tasks: (i) FFE generation from meaning, (ii) detection of fabricated FFEs across four controlled construction categories, and (iii) FFE to FFE translation from English to Persian. Evaluating six state of the art multilingual LLMs, we find systematic weaknesses in figurative competence and cultural grounding. While models such as GPT4.1 demonstrate relatively strong performance in rejecting fabricated FFEs and retrieving authentic ones, most models struggle to reliably distinguish real expressions from high quality fabrications and frequently hallucinate during cross lingual translation. These findings reveal substantial gaps in current LLMs handling of figurative language and underscore the need for targeted benchmarks to assess and mitigate figurative hallucination.

[31] Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

Abha Jha,Akanksha Mahajan,Ashwath Vaithinathan Aravindan,Praveen Saravanan,Sai Sailaja Policharla,Sonal Chaturbhuj Gehlot

Main category: cs.CL

TL;DR: 本文提出了一种基于可验证奖励的强化学习（RLVR）方法，通过在训练中显式奖励模型的‘不知道’行为（abstention），来缓解大语言模型的幻觉问题。实验表明，适度的 abstention 奖励能有效减少错误回答，且与监督微调结合可进一步提升效果。

Details

Motivation: 大型语言模型常产生幻觉或不可验证的内容，影响其在事实性任务中的可靠性，亟需一种能鼓励模型诚实承认知识边界的训练机制。 Method: 采用三元奖励结构（-1, r_abs, 1）对 Granite-3.3-2B-Instruct 和 Qwen-3-4B-Instruct 进行 RLVR 微调，并结合监督式 abstention 预训练；在 MedMCQA 和 Hendrycks Math 上评估；分析不同 r_abs 对多选与开放问答任务的影响。 Result: 中等 abstention 奖励（r_abs ≈ -0.25 至 0.3）显著降低错误率，准确率损失可控；大模型更鲁棒；开放问答中探索不足的问题可通过监督 abstention 训练部分缓解。 Conclusion: RLVR 是一种可行、灵活且实用的幻觉缓解范式，强调可验证奖励设计与模型认知谦逊的协同优化。 Abstract: Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention ("I don't know") alongside correctness to promote intellectual humility. We fine-tune and evaluate Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure ($-1$, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine-tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs $\approx -0.25$ to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open-ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here https://github.com/Mystic-Slice/rl-abstention.

[32] BengaliSent140: A Large-Scale Bengali Binary Sentiment Dataset for Hate and Non-Hate Speech Classification

Akif Islam,Sujan Kumar Roy,Md. Ekramul Hamid

Main category: cs.CL

TL;DR: 本文提出了BengaliSent140，一个大规模、多源整合的孟加拉语二分类情感（仇恨/非仇恨）数据集，含近14万样本，旨在缓解现有数据集规模小、领域单一的问题，支持深度学习模型训练与评测。

Details

Motivation: 现有孟加拉语情感与仇恨言论数据集普遍规模小、领域单一（如仅限社交媒体），难以满足现代深度学习模型对大规模、多样化数据的需求。 Method: 整合七个现有孟加拉语文本数据集，统一标准化标注体系为二元类别（Not Hate / Hate），构建统一、去重、平衡的大规模语料库 BengaliSent140，并报告基线实验结果。 Result: 构建出含139,792条样本（68,548条Hate，71,244条Not Hate）的高质量、跨域、类平衡数据集，显著提升语言覆盖与上下文多样性，并验证其在基准模型上的可用性。 Conclusion: BengaliSent140填补了孟加拉语大规模情感分析基准数据集的空白，为后续研究提供了坚实的数据基础和可复现的评估平台。 Abstract: Sentiment analysis for the Bengali language has attracted increasing research interest in recent years. However, progress remains constrained by the scarcity of large-scale and diverse annotated datasets. Although several Bengali sentiment and hate speech datasets are publicly available, most are limited in size or confined to a single domain, such as social media comments. Consequently, these resources are often insufficient for training modern deep learning based models, which require large volumes of heterogeneous data to learn robust and generalizable representations. In this work, we introduce BengaliSent140, a large-scale Bengali binary sentiment dataset constructed by consolidating seven existing Bengali text datasets into a unified corpus. To ensure consistency across sources, heterogeneous annotation schemes are systematically harmonized into a binary sentiment formulation with two classes: Not Hate (0) and Hate (1). The resulting dataset comprises 139,792 unique text samples, including 68,548 hate and 71,244 not-hate instances, yielding a relatively balanced class distribution. By integrating data from multiple sources and domains, BengaliSent140 offers broader linguistic and contextual coverage than existing Bengali sentiment datasets and provides a strong foundation for training and benchmarking deep learning models. Baseline experimental results are also reported to demonstrate the practical usability of the dataset. The dataset is publicly available at https://www.kaggle.com/datasets/akifislam/bengalisent140/

[33] Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR

Zilai Wang,Natarajan Balaji Shankar,Kaiyuan Zhang,Zihan Wang,Abeer Alwan

Main category: cs.CL

TL;DR: 本文提出了一种利用delta SSL嵌入（即微调模型与预训练模型嵌入之差）进行特征融合的新方法，显著提升了儿童语音识别（ASR）性能，在MyST语料库上达到当前SSL模型最优的WER（9.64）。

Details

Motivation: 儿童ASR面临数据稀缺和预训练领域不匹配的挑战，微调SSL模型会导致表征空间偏移，需挖掘更有效的任务相关特征。 Method: 定义并提取delta SSL嵌入（微调模型嵌入减去对应预训练模型嵌入），将其与另一SSL模型的微调嵌入进行多种策略融合，在MyST儿童语料上评估。 Result: delta嵌入融合使HuBERT相对WER降低10%，W2V2降低4.4%；WavLM与delta W2V2融合达到WER 9.64，为MyST上SSL模型新SOTA。 Conclusion: delta嵌入能有效编码任务特异性信息，与跨模型嵌入融合是提升儿童ASR性能的有效且有前景的方向。 Abstract: Self-supervised learning (SSL) models have achieved impressive results across many speech tasks, yet child automatic speech recognition (ASR) remains challenging due to limited data and pretraining domain mismatch. Fine-tuning SSL models on child speech induces shifts in the representation space. We hypothesize that delta SSL embeddings, defined as the differences between embeddings from a finetuned model and those from its pretrained counterpart, encode task-specific information that complements finetuned features from another SSL model. We evaluate multiple fusion strategies on the MyST childrens corpus using different models. Results show that delta embedding fusion with WavLM yields up to a 10 percent relative WER reduction for HuBERT and a 4.4 percent reduction for W2V2, compared to finetuned embedding fusion. Notably, fusing WavLM with delta W2V2 embeddings achieves a WER of 9.64, setting a new state of the art among SSL models on the MyST corpus. These findings demonstrate the effectiveness of delta embeddings and highlight feature fusion as a promising direction for advancing child ASR.

[34] Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents

Ziyi Wang,Yuxuan Lu,Yimeng Zhang,Jing Huang,Jiri Gesi,Xianfeng Tang,Chen Luo,Yisi Sang,Hanqing Lu,Manling Li,Dakuo Wang

Main category: cs.CL

TL;DR: 本文提出Trajectory2Task数据生成流程，用于在三种真实用户场景（意图模糊、意图变化、意图不可行）下大规模研究工具调用，通过多轮探索生成可验证任务，并基于成功轨迹微调轻量级LLM，显著提升其在复杂场景下的工具调用能力与跨领域泛化性。

Details

Motivation: 现有工具调用智能体研究多集中于理想化、固定且明确的任务设定，难以应对现实客户交互中常见的意图模糊、动态变化或政策限制导致的不可行等问题，且缺乏覆盖此类复杂交互模式的训练与评估数据。 Method: 提出Trajectory2Task流水线：首先进行多轮探索生成合法工具调用轨迹；再将轨迹转化为带可控意图调整（模糊/变化/不可行）的用户任务；最终生成可验证、支持闭环评估与训练的任务集，并在七种SOTA LLM上评测，随后利用成功轨迹对轻量级LLM进行微调。 Result: 基准测试显示主流LLM在三类复杂场景中频繁失败；经微调的轻量级LLM在所有三类条件上均取得一致性能提升，并展现出对未见工具领域的更好泛化能力。 Conclusion: Trajectory2Task有效弥补了真实世界工具调用研究的数据缺口，验证了基于成功轨迹的定向微调是提升LLM鲁棒工具调用能力的有效路径，为构建面向实际应用的可靠工具调用智能体提供了新范式。 Abstract: Tool-calling agents are increasingly deployed in real-world customer-facing workflows. Yet most studies on tool-calling agents focus on idealized settings with general, fixed, and well-specified tasks. In real-world applications, user requests are often (1) ambiguous, (2) changing over time, or (3) infeasible due to policy constraints, and training and evaluation data that cover these diverse, complex interaction patterns remain under-represented. To bridge the gap, we present Trajectory2Task, a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios: ambiguous intent, changing intent, and infeasible intents. The pipeline first conducts multi-turn exploration to produce valid tool-call trajectories. It then converts these trajectories into user-facing tasks with controlled intent adaptations. This process yields verifiable task that support closed-loop evaluation and training. We benchmark seven state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures. Finally, using successful trajectories obtained from task rollouts, we fine-tune lightweight LLMs and find consistent improvements across all three conditions, along with better generalization to unseen tool-use domains, indicating stronger general tool-calling ability.

[35] Me-Agent: A Personalized Mobile Agent with Two-Level User Habit Learning for Enhanced Interaction

Shuoxin Wang,Chang Liu,Gowen Loo,Lifan Zheng,Kaiwen Wei,Xinyi Zeng,Jingyuan Zhang,Yu Tian

Main category: cs.CL

TL;DR: 本文提出Me-Agent，一种可学习、可记忆的个性化移动代理，通过两级用户习惯学习方法（提示级用户偏好学习和分层偏好记忆）解决现有LLM移动代理在个性化方面的不足，并在新基准User FingerTip上验证其优越性能。

Details

Motivation: 现有基于大语言模型的移动代理虽性能先进，但忽视用户个性化需求，导致无法理解模糊指令、无法从交互历史中学习、难以处理个性化指令。 Method: 提出Me-Agent，包含提示级的用户偏好学习策略（结合个人奖励模型）和记忆级的分层偏好记忆（区分长期记忆与应用特定记忆）。 Result: 在新构建的User FingerTip基准及通用基准上的实验表明，Me-Agent在个性化能力上达到SOTA，同时保持较强的指令执行性能。 Conclusion: Me-Agent有效提升了移动代理的个性化能力，为面向真实用户的智能代理提供了可行技术路径。 Abstract: Large Language Model (LLM)-based mobile agents have made significant performance advancements. However, these agents often follow explicit user instructions while overlooking personalized needs, leading to significant limitations for real users, particularly without personalized context: (1) inability to interpret ambiguous instructions, (2) lack of learning from user interaction history, and (3) failure to handle personalized instructions. To alleviate the above challenges, we propose Me-Agent, a learnable and memorable personalized mobile agent. Specifically, Me-Agent incorporates a two-level user habit learning approach. At the prompt level, we design a user preference learning strategy enhanced with a Personal Reward Model to improve personalization performance. At the memory level, we design a Hierarchical Preference Memory, which stores users' long-term memory and app-specific memory in different level memory. To validate the personalization capabilities of mobile agents, we introduce User FingerTip, a new benchmark featuring numerous ambiguous instructions for daily life. Extensive experiments on User FingerTip and general benchmarks demonstrate that Me-Agent achieves state-of-the-art performance in personalization while maintaining competitive instruction execution performance.

[36] Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling

Husein Zolkepli

Main category: cs.CL

TL;DR: 本文提出X-Codec-2.0的轻量改进版本，通过增加池化和增大解码器步长，将潜在表示率从50Hz降至25Hz、输出采样率从16kHz升至24kHz，在不改变核心架构前提下提升效率与音质。

Details

Motivation: 原始X-Codec-2.0在50Hz隐变量率和16kHz采样率下存在时间效率低和音频保真度受限的问题。 Method: 在保持冻结HuBERT特征和核心架构不变的前提下，引入额外池化操作并增大解码器跳帧尺寸（hop size），从而降低隐变量率并提高输出采样率。 Result: 在多语言Common Voice 17测试集上，基于UTMOSv2评估，MOS得分比原X-Codec-2.0提升0.29，且为当前所有25Hz编解码器中性能最优。 Conclusion: 该简单修改显著提升了神经音频压缩模型的效率与感知质量，验证了调整隐变量率与采样率协同优化的有效性。 Abstract: X-Codec-2.0 has shown strong performance in neural audio compression and multilingual speech modeling, operating at a 50 Hz latent rate and a 16 kHz sampling rate using frozen HuBERT features. While effective, this configuration limits temporal efficiency and audio fidelity. In this work, we explore a simple and effective modification by introducing additional pooling and increasing the decoder hop size. This reduces the latent rate from 50 Hz to 25 Hz and simultaneously raises the output sampling rate from 16 kHz to 24 kHz, improving efficiency and perceptual quality without altering the core architecture. Evaluated on the multilingual Common Voice 17 test set, the proposed configuration achieves a 0.29 MOS improvement over the original X-Codec-2.0 baseline based on UTMOSv2, and attains the best reported performance among all codecs operating at 25 Hz. The source code, checkpoints, and generation comparisons are released at \href{https://huggingface.co/Scicom-intl/xcodec2-25TPS-24k}{https://huggingface.co/Scicom-intl/xcodec2-25TPS-24k}.

[37] Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems

Haoyuan Yu,Yuxuan Chen,Minjie Cai

Main category: cs.CL

TL;DR: 本文提出了一种面向全双工语音交互的对话框架，将复杂对话分解为最小对话单元，并基于多模态大语言模型构建了无需训练、即插即用的半级联式全双工系统，在HumDial数据集和Human-like Spoken Dialogue Systems Challenge中表现优异。

Details

Motivation: 实现自然的人机语音交互需要支持全双工语音交互能力，而现有系统在实时响应、无缝过渡和低延迟处理方面仍存在挑战。 Method: 提出一种将复杂对话分解为最小对话单元的框架，构建以多模态大语言模型为核心的半级联式全双工对话系统，整合语音活动检测（VAD）和文本转语音（TTS）等辅助模块，整个系统无需训练、即插即用。 Result: 在HumDial数据集上验证了框架有效性，在Human-like Spoken Dialogue Systems Challenge（Track 2: Full-Duplex Interaction）测试集上排名第二。 Conclusion: 该框架显著提升了全双工语音交互的自然性与实时性，且具备良好的可部署性和泛化能力，为实际落地提供了新思路。 Abstract: Full-duplex voice interaction is crucial for natural human computer interaction. We present a framework that decomposes complex dialogue into minimal conversational units, enabling the system to process each unit independently and predict when to transit to the next. This framework is instantiated as a semi-cascaded full-duplex dialogue system built around a multimodal large language model, supported by auxiliary modules such as voice activity detection (VAD) and text-to-speech (TTS) synthesis. The resulting system operates in a train-free, plug-and-play manner. Experiments on the HumDial dataset demonstrate the effectiveness of our framework, which ranks second among all teams on the test set of the Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction). Code is available at the GitHub repository https://github.com/yu-haoyuan/fd-badcat.

[38] Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy

Si Chen,Le Huy Khiem,Annalisa Szymanski,Ronald Metoyer,Ting Hua,Nitesh V. Chawla

Main category: cs.CL

TL;DR: 本文提出了一种基于专家指南和布鲁姆分类法的自动化基准生成框架，用于开放性问答评估，特别适用于实践性领域；该框架生成隐式违规场景，并扩展为自动评分的多选题和多轮对话，覆盖四个认知层次；在教学、营养学和护理三个领域应用后发现，大语言模型在高阶分析能力上表现较好，但在基础记忆任务上反而更易出错。

Details

Motivation: 现有大语言模型评测基准多依赖于已有考试数据集，在实践性领域（如教学、护理等）往往缺乏此类数据；而这些领域的知识具有程序性和专业判断性，需评估模型的情境化推理能力，而非简单事实回忆。 Method: 基于专家撰写的指南，结合布鲁姆分类法，构建自动化基准生成框架：将专家实践转化为隐式违规场景，并自动生成覆盖Remember、Understand、Apply、Analyze四个认知层次的自动评分多选题与多轮对话。 Result: 在教学、饮食营养、护理三个应用领域生成了大规模、心理测量学支持的评测基准；实验发现LLMs在Analyze层级表现相对优于人类，但在Remember层级错误率更高，揭示了非直观的模型行为差异。 Conclusion: 该框架实现了可复现、可扩展、确定性的实践领域QA评测；所揭示的模型认知层级表现倒置现象，为深入理解LLM推理机制及改进其在真实场景中的适用性提供了新视角。 Abstract: Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.

[39] SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility

Xuanyu Su,Diana Inkpen,Nathalie Japkowicz

Main category: cs.CL

TL;DR: 本文提出SoftHateBench基准，用于评估现有内容审核系统对软性仇恨言论（表面中立但隐含敌意的推理型话语）的识别能力，发现当前模型在软仇恨场景下性能显著下降。

Details

Motivation: 现有仇恨言论检测系统主要针对显性毒性语言（硬仇恨），难以识别以合理表象、价值导向和论证框架呈现的软性仇恨言论；而缺乏系统性评测该缺陷的基准。 Method: 构建生成式基准SoftHateBench，融合Argumentum Model of Topics（AMT）与Relevance Theory（RT）：AMT提供保持立场的中性化论证结构，RT保障逻辑连贯性；覆盖7个社会文化领域、28个目标群体，共4745条软仇恨样本。 Result: 在编码器类检测器、通用大语言模型及安全模型上的实验表明，所有系统在软仇恨样本上的检测性能均较硬仇恨显著下降，证实其对推理驱动型敌意鲁棒性不足。 Conclusion: 当前内容审核系统存在关键盲区——无法有效识别基于论证和语用隐含的软性仇恨言论；SoftHateBench为推动更鲁棒、更深层的仇恨言论检测研究提供了必要评测工具。 Abstract: Online hate on social media ranges from overt slurs and threats (\emph{hard hate speech}) to \emph{soft hate speech}: discourse that appears reasonable on the surface but uses framing and value-based arguments to steer audiences toward blaming or excluding a target group. We hypothesize that current moderation systems, largely optimized for surface toxicity cues, are not robust to this reasoning-driven hostility, yet existing benchmarks do not measure this gap systematically. We introduce \textbf{\textsc{SoftHateBench}}, a generative benchmark that produces soft-hate variants while preserving the underlying hostile standpoint. To generate soft hate, we integrate the \emph{Argumentum Model of Topics} (AMT) and \emph{Relevance Theory} (RT) in a unified framework: AMT provides the backbone argument structure for rewriting an explicit hateful standpoint into a seemingly neutral discussion while preserving the stance, and RT guides generation to keep the AMT chain logically coherent. The benchmark spans \textbf{7} sociocultural domains and \textbf{28} target groups, comprising \textbf{4,745} soft-hate instances. Evaluations across encoder-based detectors, general-purpose LLMs, and safety models show a consistent drop from hard to soft tiers: systems that detect explicit hostility often fail when the same stance is conveyed through subtle, reasoning-based language. \textcolor{red}{\textbf{Disclaimer.} Contains offensive examples used solely for research.}

[40] RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis

Elina Sigdel,Anastasia Panfilova

Main category: cs.CL

TL;DR: 本文提出了一种针对俄语的LIWC方法适配方案，构建了包含96个类别的俄语心理语言学词典，并开发了RusLICA网络服务。

Details

Motivation: 现有LIWC工具主要面向英语，直接翻译难以适应俄语的语法和文化特性，需为俄语专门构建心理语言学分析工具。 Method: 基于多种词典、语义词典和语料库，为俄语专门构建包含96个类别的词典（涵盖句法、形态、词汇、统计特征及预训练语言模型预测结果），并完成42个心理语言学类别的词元映射；实现为RusLICA网络服务。 Result: 成功构建了适用于俄语的LIWC风格心理语言学分析词典与工具，支持多维度文本特征提取。 Conclusion: 该方法优于简单翻译，能更准确地捕捉俄语文本的心理语言学特征，为俄语文本分析提供了有效工具。 Abstract: Defining psycholinguistic characteristics in written texts is a task gaining increasing attention from researchers. One of the most widely used tools in the current field is Linguistic Inquiry and Word Count (LIWC) that originally was developed to analyze English texts and translated into multiple languages. Our approach offers the adaptation of LIWC methodology for the Russian language, considering its grammatical and cultural specificities. The suggested approach comprises 96 categories, integrating syntactic, morphological, lexical, general statistical features, and results of predictions obtained using pre-trained language models (LMs) for text analysis. Rather than applying direct translation to existing thesauri, we built the dictionary specifically for the Russian language based on the content from several lexicographic resources, semantic dictionaries and corpora. The paper describes the process of mapping lemmas to 42 psycholinguistic categories and the implementation of the analyzer as part of RusLICA web service.

[41] Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

Tianwei Lin,Zuyi Zhou,Xinda Zhao,Chenke Wang,Xiaohong Li,Yu Chen,Chuanrui Hu,Jian Pei,Yafeng Deng

Main category: cs.CL

TL;DR: 本文提出了EverMemBench-S（EMB-S），一个对抗性长上下文检索基准，用于评估大语言模型在大规模语义干扰环境下的证据定位能力，发现语义判别能力而非上下文长度本身才是长上下文记忆的关键瓶颈。

Details

Motivation: 现有主流的Needle-in-a-Haystack（NIAH）评测过于理想化——‘针’高度唯一、‘草堆’基本无关，无法反映真实复杂语义环境中模型对近似干扰项的判别能力，因而难以揭示长上下文LLM代理的实际瓶颈。 Method: 构建了基于3.26亿token MemoryBank的 adversarial NIAH 风格基准 EMB-S，引入碰撞测试的近似负样本和跨文档黄金证据集，并设计解耦诊断协议，分别评估证据访问（document-ID定位）与端到端问答质量；在从64K到326M token的参考语料规模梯度上进行系统评测。 Result: 在语义干扰增强的EMB-S上，原本在良性NIAH中表现饱和的系统其证据访问能力显著下降；native long-context模型与RAG系统均表现出相似的性能退化趋势，证实语义判别是核心瓶颈。 Conclusion: 长上下文模型的记忆能力瓶颈主要在于语义层面的精确区分能力，而非单纯上下文长度扩展；未来研究应聚焦提升模型在高密度、高相似度信息环境中的证据甄别与定位能力。 Abstract: Long-context LLM agents must access the right evidence from large environments and use it faithfully. However, the popular Needle-in-a-Haystack (NIAH) evaluation mostly measures benign span localization. The needle is near-unique, and the haystack is largely irrelevant. We introduce EverMemBench-S (EMB-S), an adversarial NIAH-style benchmark built on a 326M-token MemoryBank. While the full MemoryBank spans 326M tokens for retrieval-based (RAG) evaluation, we evaluate native long-context models only at scales that fit within each model's context window (up to 1M tokens in this work) to ensure a fair comparison. EMB-S pairs queries with collision-tested near-miss hard negatives and gold evidence sets spanning one or more documents, validated via human screening and LLM verification. We also propose a decoupled diagnostic protocol that reports evidence access (document-ID localization) separately from end-to-end QA quality under full-context prompting. This enables consistent diagnosis for both native long-context prompting and retrieval pipelines. Across a reference-corpus ladder from domain-isolated 64K contexts to a globally shared 326M-token environment, we observe a clear reality gap. Systems that saturate benign NIAH degrade sharply in evidence access under semantic interference. These results indicate that semantic discrimination, not context length alone, is the dominant bottleneck for long-context memory at scale.

[42] MiLorE-SSL: Scaling Multilingual Capabilities in Self-Supervised Models without Forgetting

Jing Xu,Minglin Wu,Xueyuan Chen,Xixin Wu,Helen Meng

Main category: cs.CL

TL;DR: 本文提出MiLorE-SSL框架，结合LoRA与软MoE机制，实现高效持续多语言自监督语音表征学习，仅用2.14%可训练参数即缓解灾难性遗忘并提升新旧语言性能。

Details

Motivation: 现有多语言自监督语音模型难以扩展至预训练未见语言；从头训练计算昂贵，顺序训练易导致灾难性遗忘。 Method: 提出MiLorE-SSL：融合低秩适应（LoRA）与软混合专家（soft MoE），并引入有限历史语言重放数据以缓解遗忘。 Result: 在ML-SUPERB上验证，MiLorE-SSL在新增语言上表现优异，同时提升已有语言能力，仅需2.14%可训练参数。 Conclusion: MiLorE-SSL是一种轻量、高效、可扩展的持续多语言SSL训练框架，兼顾新语言适配与旧语言保留。 Abstract: Self-supervised learning (SSL) has greatly advanced speech representation learning, but multilingual SSL models remain constrained to languages encountered during pretraining. Retraining from scratch to incorporate new languages is computationally expensive, while sequential training without migitation strategies often leads to catastrophic forgetting. To address this, we propose MiLorE-SSL, a lightweight framework that combines LoRA modules with a soft mixture-of-experts (MoE) mechanism for efficient continual multilingual training. LoRA provides efficient low-rank adaptation, while soft MoE promotes flexible expert sharing across languages, reducing cross-lingual interference. To further mitigate forgetting, we introduce limited replay data from existing languages, avoiding reliance on large historical corpora. Experiments on ML-SUPERB demonstrate that MiLorE-SSL achieves strong performance in new languages and improves the ability in existing ones with only 2.14% trainable parameters.

[43] SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger

Kaiyuan Chen,Guangmin Zheng,Jin Wang,Xiaobing Zhou,Xuejie Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为Self-Adaptive Process Optimization (SAPO)的方法，用于提升小型语言模型（SLMs）的自我进化能力，通过借鉴神经科学中的Error-Related Negativity（ERN）机制，自适应地缩小推理器与验证器之间的差距，避免低效的蒙特卡洛过程监督，在数学和代码任务上显著优于现有方法，并构建了两个新的过程奖励模型评测基准。

Details

Motivation: 现有自进化方法忽略了细粒度推理步骤的影响，导致推理器与验证器之间存在差距；同时蒙特卡洛过程监督计算效率低，加剧了该问题。受神经科学中错误相关负波（ERN）启发——即推理器可在错误决策后快速定位并修正错误——作者提出更高效、自适应的过程监督机制。 Method: 提出Self-Adaptive Process Optimization (SAPO)方法：在小型语言模型中，不依赖低效的蒙特卡洛采样，而是主动最小化推理器与验证器之间的过程差距，引入自适应、高效的过程监督信号。 Result: 在数学与代码两类高难度任务上，SAPO显著优于大多数现有自进化方法；并构建了两个面向过程奖励建模的新基准（数学与编程任务各一），用于评估验证器性能。 Conclusion: SAPO是一种高效、神经机制启发的自进化方法，能有效弥合理解与验证之间的过程鸿沟，尤其适用于资源受限的小型语言模型，并为过程监督建模提供了新基准与范式。 Abstract: Existing self-evolution methods overlook the influence of fine-grained reasoning steps, which leads to the reasoner-verifier gap. The computational inefficiency of Monte Carlo (MC) process supervision further exacerbates the difficulty in mitigating the gap. Motivated by the Error-Related Negativity (ERN), which the reasoner can localize error following incorrect decisions, guiding rapid adjustments, we propose a Self-Adaptive Process Optimization (SAPO) method for self-improvement in Small Language Models (SLMs). SAPO adaptively and efficiently introduces process supervision signals by actively minimizing the reasoner-verifier gap rather than relying on inefficient MC estimations. Extensive experiments demonstrate that the proposed method outperforms most existing self-evolution methods on two challenging task types: mathematics and code. Additionally, to further investigate SAPO's impact on verifier performance, this work introduces two new benchmarks for process reward models in both mathematical and coding tasks.

[44] Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning

Zeyu Xing,Xing Li,Hui-Ling Zhen,Mingxuan Yuan,Sinno Jialin Pan

Main category: cs.CL

TL;DR: 本文提出将KV缓存作为轻量级上下文表示，无需额外计算或存储隐藏状态，即可用于下游任务，在链式嵌入和快慢思维切换中展现出高效性与实用性。

Details

Motivation: KV缓存通常仅用于加速自回归解码，但其中蕴含的上下文信息未被充分利用；本文旨在挖掘其作为免费、可复用表征的潜力。 Method: 将KV缓存直接用作轻量级表示，应用于链式嵌入（Chain-of-Embedding）和快慢思维切换（Fast/Slow Thinking Switching）两类任务，验证其有效性。 Result: 在Llama-3.1-8B-Instruct和Qwen2-7B-Instruct上链式嵌入性能达到或超越基线；在Qwen3-8B和DeepSeek-R1-Distil-Qwen-14B上实现最高5.7倍的token生成减少，且精度损失极小。 Conclusion: KV缓存是一种免费、有效、可即插即用的推理期表征资源，为大模型推理中的表示复用提供了新范式。 Abstract: KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: \textbf{(i) Chain-of-Embedding}, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf{(ii) Fast/Slow Thinking Switching}, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to $5.7\times$ with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: https://github.com/cmd2001/ICLR2026_KV-Embedding.

[45] CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

Xinyu Hu,Yancheng He,Weixun Wang,Tao Feng,Li Lin,Jiashun Liu,Wenbo Su,Bo Zheng,Xiaojun Wan

Main category: cs.CL

TL;DR: 本文提出CE-RM-4B，一种基于点式评估、两阶段训练和统一查询标准的轻量级生成式奖励模型，在奖励建模基准和下游强化学习实践中均表现优异。

Details

Motivation: 现有LLM-as-a-Judge方法在基准测试中表现好，但在实际强化学习中效果不佳，主因是依赖成对比较且评价标准优化不足。 Method: 提出CE-RM-4B模型，采用点式（pointwise）评估范式、专用两阶段rollout训练策略，并使用统一的基于查询的评价标准；仅用约5.7K高质量开源偏好数据进行训练。 Result: 在多种奖励模型基准（尤其Best-of-N场景）上性能领先，并在下游RL实践中带来更有效的性能提升。 Conclusion: 点式建模、精细化标准设计与高效数据利用可显著提升生成式奖励模型在真实RL场景中的有效性。 Abstract: Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.

[46] PsychePass: Calibrating LLM Therapeutic Competence via Trajectory-Anchored Tournaments

Zhuang Chen,Dazhen Wan,Zhangkai Zheng,Guanqun Bi,Xiyao Xiao,Binghang Li,Minlie Huang

Main category: cs.CL

TL;DR: 本文提出PsychePass框架，通过轨迹锚定的锦标赛方法评估大语言模型在心理治疗中的能力，解决了现有评估方法中过程漂移和标准漂移的问题，并支持基于奖励信号的强化学习优化。

Details

Motivation: 当前大语言模型在心理健康领域的治疗能力评估面临挑战，主要由于咨询过程非结构化、长周期，且现有评估范式存在‘未锚定’缺陷，导致过程漂移与标准漂移。 Method: 提出PsychePass统一框架：1）在模拟中锚定交互轨迹，由客户端精确控制咨询流程以多维度探测模型能力；2）在评估中锚定对抗轨迹，采用瑞士制锦标赛进行动态两两比对，生成鲁棒Elo评分；3）将锦标赛轨迹转化为可信奖励信号，支持在线策略强化学习。 Result: 实验表明PsychePass有效提升了LLM的治疗能力评估可靠性，并与人类专家判断高度一致。 Conclusion: PsychePass为LLM在心理咨询等复杂交互场景中的能力评估提供了稳定、可扩展、可学习的新范式。 Abstract: While large language models show promise in mental healthcare, evaluating their therapeutic competence remains challenging due to the unstructured and longitudinal nature of counseling. We argue that current evaluation paradigms suffer from an unanchored defect, leading to two forms of instability: process drift, where unsteered client simulation wanders away from specific counseling goals, and standard drift, where static pointwise scoring lacks the stability for reliable judgment. To address this, we introduce Ps, a unified framework that calibrates the therapeutic competence of LLMs via trajectory-anchored tournaments. We first anchor the interaction trajectory in simulation, where clients precisely control the fluid consultation process to probe multifaceted capabilities. We then anchor the battle trajectory in judgments through an efficient Swiss-system tournament, utilizing dynamic pairwise battles to yield robust Elo ratings. Beyond ranking, we demonstrate that tournament trajectories can be transformed into credible reward signals, enabling on-policy reinforcement learning to enhance LLMs' performance. Extensive experiments validate the effectiveness of PsychePass and its strong consistency with human expert judgments.

[47] MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

Qinzhuo Wu,Zhizhuo Yang,Hanhao Li,Pengzhi Gao,Wei Liu,Jian Luan

Main category: cs.CL

TL;DR: 本文提出了MobileBench-OL，一个面向中文移动应用的在线GUI代理评测基准，包含1080个任务、5个子集，评估任务执行、复杂推理与噪声鲁棒性，并配套自动评测框架；实验表明现有12个主流GUI代理仍有较大提升空间。

Details

Motivation: 现有在线GUI代理评测基准过于侧重指令遵循能力，忽视推理与探索能力，且未考虑真实移动环境中的随机噪声，导致评测与实际场景存在差距。 Method: 构建MobileBench-OL在线基准，涵盖80款中文App的1080个任务，设计5个子集以多维度评估任务执行、复杂推理和噪声鲁棒性；开发具备重置机制的自动评测框架，支持稳定可复现的真实环境评测。 Result: 在MobileBench-OL上评测12个主流GUI代理，发现其整体表现距真实需求仍有显著差距；人工评估验证该基准能可靠反映代理在真实环境中的性能。 Conclusion: MobileBench-OL有效弥补了现有基准在推理能力、探索能力和环境噪声建模上的不足，为移动GUI代理提供了更贴近真实场景的综合评测标准。 Abstract: Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents' task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 12 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.

[48] Improving Diffusion Language Model Decoding through Joint Search in Generation Order and Token Space

Yangyi Shen,Tianjian Feng,Jiaqi Han,Wen Wang,Tianlang Chen,Chunhua Shen,Jure Leskovec,Stefano Ermon

Main category: cs.CL

TL;DR: 本文提出了Order-Token Search方法，通过联合搜索生成顺序和token值来增强扩散语言模型（DLMs）的解码能力，在多个数学推理与编程基准上显著优于基线模型。

Details

Motivation: 现有DLMs解码方法仅沿单一轨迹进行，限制了在轨迹空间中的探索能力。 Method: 提出Order-Token Search，核心是一个能评估去噪动作似然度的估计器，支持稳定剪枝和高效多样化轨迹探索。 Result: 在GSM8K、MATH500、Countdown和HumanEval上分别取得3.1%、3.8%、7.9%和6.8%的绝对性能提升，并匹敌或超越后训练的diffu-GRPO模型d1-LLaDA。 Conclusion: 联合搜索生成顺序与token值是提升DLMs解码性能的关键方向。 Abstract: Diffusion Language Models (DLMs) offer order-agnostic generation that can explore many possible decoding trajectories. However, current decoding methods commit to a single trajectory, limiting exploration in trajectory space. We introduce Order-Token Search to explore this space through jointly searching over generation order and token values. Its core is a likelihood estimator that scores denoising actions, enabling stable pruning and efficient exploration of diverse trajectories. Across mathematical reasoning and coding benchmarks, Order-Token Search consistently outperforms baselines on GSM8K, MATH500, Countdown, and HumanEval (3.1%, 3.8%, 7.9%, and 6.8% absolute over backbone), matching or surpassing diffu-GRPO post-trained d1-LLaDA. Our work establishes joint search as a key component for advancing decoding in DLMs.

[49] Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

Qihao Wang,Yue Hu,Mingzhe Lu,Jiayue Wu,Yanbing Liu,Yuanmin Tang

Main category: cs.CL

TL;DR: 本文提出基于认知负荷理论的框架，用于诊断大语言模型（LLM）在工具使用中的能力瓶颈，并构建首个可调节认知负荷的基准ToolLoad-Bench，揭示模型性能随负荷增加而出现的明显下降边界。

Details

Motivation: 现有基准仅报告最终准确率，无法揭示LLM在工具使用中真正的认知瓶颈；需从单纯性能评估转向可诊断能力边界的分析框架。 Method: 基于认知负荷理论，将任务复杂度分解为内在负荷（用新提出的工具交互图形式化）和外在负荷（源于任务描述模糊性），并构建可参数化调节认知负荷的基准ToolLoad-Bench。 Result: 实验发现模型性能随认知负荷增加呈现显著‘性能悬崖’；框架预测与实证结果高度校准。 Conclusion: 该框架为理解智能体能力边界提供了原理性方法，也为构建更高效系统奠定了实用基础。 Abstract: The ability of Large Language Models (LLMs) to use external tools unlocks powerful real-world interactions, making rigorous evaluation essential. However, current benchmarks primarily report final accuracy, revealing what models can do but obscuring the cognitive bottlenecks that define their true capability boundaries. To move from simple performance scoring to a diagnostic tool, we introduce a framework grounded in Cognitive Load Theory. Our framework deconstructs task complexity into two quantifiable components: Intrinsic Load, the inherent structural complexity of the solution path, formalized with a novel Tool Interaction Graph; and Extraneous Load, the difficulty arising from ambiguous task presentation. To enable controlled experiments, we construct ToolLoad-Bench, the first benchmark with parametrically adjustable cognitive load. Our evaluation reveals distinct performance cliffs as cognitive load increases, allowing us to precisely map each model's capability boundary. We validate that our framework's predictions are highly calibrated with empirical results, establishing a principled methodology for understanding an agent's limits and a practical foundation for building more efficient systems.

[50] SpeechMapper: Speech-to-text Embedding Projector for LLMs

Biswesh Mohapatra,Marcely Zanon Boito,Ioan Calapodescu

Main category: cs.CL

TL;DR: SpeechMapper提出了一种高效、低成本的语音到大语言模型（LLM）嵌入训练方法，通过预训练+轻量指令微调（1K步）实现强泛化能力，在任务无关和任务相关设置下均表现优异。

Details

Motivation: 现有语音大模型依赖端到端联合训练，计算开销大且易过拟合任务与提示；需更鲁棒、通用、低成本的语音-LLM对齐方案。 Method: SpeechMapper分两阶段：先在廉价硬件上独立预训练语音映射模块（不耦合LLM），再通过极短（1K步）的指令微调将其对接目标LLM；支持任务无关（ASR引导适配）与任务相关两种微调范式。 Result: 在语音翻译与口语问答任务中，任务无关设置下性能媲美IWSLT25最优语音LLM（未在该任务训练）；任务相关设置下以更少数据与算力超越该基线。 Conclusion: SpeechMapper提供了一种实用、可扩展的语音-LLM集成范式，显著降低训练成本，提升模型泛化性与鲁棒性，无需大规模指令微调。 Abstract: Current speech LLMs bridge speech foundation models to LLMs using projection layers, training all of these components on speech instruction data. This strategy is computationally intensive and susceptible to task and prompt overfitting. We present SpeechMapper, a cost-efficient speech-to-LLM-embedding training approach that mitigates overfitting, enabling more robust and generalizable models. Our model is first pretrained without the LLM on inexpensive hardware, and then efficiently attached to the target LLM via a brief 1K-step instruction tuning (IT) stage. Through experiments on speech translation and spoken question answering, we demonstrate the versatility of SpeechMapper's pretrained block, presenting results for both task-agnostic IT, an ASR-based adaptation strategy that does not train in the target task, and task-specific IT. In task-agnostic settings, Speechmapper rivals the best instruction-following speech LLM from IWSLT25, despite never being trained on these tasks, while in task-specific settings, it outperforms this model across many datasets, despite requiring less data and compute. Overall, SpeechMapper offers a practical and scalable approach for efficient, generalizable speech-LLM integration without large-scale IT.

[51] Hopes and Fears -- Emotion Distribution in the Topic Landscape of Finnish Parliamentary Speech 2000-2020

Anna Ristilä,Otto Tarkka,Veronika Laippala,Kimmo Elo

Main category: cs.CL

TL;DR: This paper analyzes emotion expression across different topics in Finnish parliamentary speeches (2000–2020), revealing topic-specific emotional patterns and confirming a trend of increasing positivity over time.

Details

Motivation: Existing research treats parliamentary discourse as homogeneous and lacks empirical investigation into how emotions vary across topics; this study addresses that gap for Finnish Parliament (Eduskunta). Method: Applies an emotion analysis model to parliamentary speeches from Eduskunta (2000–2020), examining emotion expression both synchronically (across topics at a given time) and diachronically (over time). Result: Finds topic-specific emotion patterns and corroborates a rising trend of positivity in parliamentary speech over two decades. Conclusion: Parliamentary emotion expression is not uniform but strongly topic-dependent; the study provides novel empirical evidence on emotional dynamics in legislative discourse, with implications for political communication and computational social science. Abstract: Existing research often treats parliamentary discourse as a homogeneous whole, overlooking topic-specific patterns. Parliamentary speeches address a wide range of topics, some of which evoke stronger emotions than others. While everyone has intuitive assumptions about what the most emotive topics in a parliament may be, there has been little research into the emotions typically linked to different topics. This paper strives to fill this gap by examining emotion expression among the topics of parliamentary speeches delivered in Eduskunta, the Finnish Parliament, between 2000 and 2020. An emotion analysis model is used to investigate emotion expression in topics, from both synchronic and diachronic perspectives. The results strengthen evidence of increasing positivity in parliamentary speech and provide further insights into topic-specific emotion expression within parliamentary debate.

[52] PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use

Qihao Wang,Mingzhe Lu,Jiayue Wu,Yue Hu,Yanbing Liu

Main category: cs.CL

TL;DR: 本文提出PEARL框架，通过离线探索与在线强化学习（GRPO）提升大语言模型在复杂多步工具调用中的规划与执行能力，在ToolHop和T-Eval基准上达到新SOTA性能。

Details

Motivation: 大语言模型在复杂、多轮工具调用中存在规划弱、工具幻觉、参数错误及交互鲁棒性差等问题。 Method: PEARL采用两阶段方法：离线阶段探索工具以学习有效使用模式与失败条件；在线阶段利用分组相对策略优化（GRPO）训练专用Planner，并设计细粒度奖励函数以区分规划质量。 Result: 在ToolHop和T-Eval基准上显著超越现有方法，ToolHop成功率提升至56.5%，且调用错误率低。 Conclusion: PEARL有效缓解了LLM在复杂工具使用中的规划挑战，推动了更鲁棒、可靠LLM智能体的发展。 Abstract: Large Language Models show great potential with external tools, but face significant challenges in complex, multi-turn tool invocation. They often exhibit weak planning, tool hallucination, erroneous parameter generation, and struggle with robust interaction. To tackle these issues, we present PEARL, a novel framework to enhance LLM planning and execution for sophisticated tool use. PEARL adopts a two-stage approach: an offline phase where the agent explores tools to learn valid usage patterns and failure conditions, and an online reinforcement learning phase. In the online phase, a dedicated Planner is trained via group Relative Policy Optimization (GRPO) with a carefully designed reward function that provides distinct signals for planning quality. Experiments on the ToolHop and T-Eval benchmarks show PEARL significantly outperforms existing methods, achieving a new state-of-the-art success rate of \textbf{56.5\%} on ToolHop while maintaining a low invocation error rate. Our work marks a key advance in addressing the complex planning challenges of tool use, contributing to the development of more robust and reliable LLM-based agents.

[53] MuVaC: AVariational Causal Framework for Multimodal Sarcasm Understanding in Dialogues

Diandian Guo,Fangfang Yuan,Cong Cao,Xixun Lin,Chuan Zhou,Hao Peng,Yanan Cao,Yanbing Liu

Main category: cs.CL

TL;DR: 本文提出MuVaC框架，通过变分因果推理联合优化多模态讽刺检测（MSD）与解释（MuSE），建模二者因果依赖关系，并采用对齐-融合策略提升特征鲁棒性与结果一致性。

Details

Motivation: 现有研究多将多模态讽刺检测（MSD）和解释（MuSE）视为独立任务，忽视其内在因果依赖关系，难以实现真正可解释、可信的讽刺理解。 Method: 提出MuVaC变分因果推理框架：1）基于结构因果模型建模MSD与MuSE的因果路径并定义联合优化目标；2）设计‘对齐-融合’多模态特征整合策略；3）通过检测结果与解释的一致性约束增强推理可信度。 Result: 在公开数据集上实验表明，MuVaC在MSD与MuSE两项任务上均优于现有方法，且生成的解释更可信、一致。 Conclusion: MuVaC成功建模了讽刺检测与解释之间的因果关系，为多模态讽刺理解提供了可解释、鲁棒且可信的新范式。 Abstract: The prevalence of sarcasm in multimodal dialogues on the social platforms presents a crucial yet challenging task for understanding the true intent behind online content. Comprehensive sarcasm analysis requires two key aspects: Multimodal Sarcasm Detection (MSD) and Multimodal Sarcasm Explanation (MuSE). Intuitively, the act of detection is the result of the reasoning process that explains the sarcasm. Current research predominantly focuses on addressing either MSD or MuSE as a single task. Even though some recent work has attempted to integrate these tasks, their inherent causal dependency is often overlooked. To bridge this gap, we propose MuVaC, a variational causal inference framework that mimics human cognitive mechanisms for understanding sarcasm, enabling robust multimodal feature learning to jointly optimize MSD and MuSE. Specifically, we first model MSD and MuSE from the perspective of structural causal models, establishing variational causal pathways to define the objectives for joint optimization. Next, we design an alignment-then-fusion approach to integrate multimodal features, providing robust fusion representations for sarcasm detection and explanation generation. Finally, we enhance the reasoning trustworthiness by ensuring consistency between detection results and explanations. Experimental results demonstrate the superiority of MuVaC in public datasets, offering a new perspective for understanding multimodal sarcasm.

[54] BMAM: Brain-inspired Multi-Agent Memory Framework

Yang Li,Jiaxiang Liu,Yusong Wang,Yujie Wu,Mingkun Xu

Main category: cs.CL

TL;DR: 本文提出BMAM（类脑多智能体记忆）架构，通过模拟人类认知记忆系统，将智能体记忆分解为情景、语义、显著性和控制导向等多子系统，以解决长时程交互中信息丢失与行为不一致问题（即‘灵魂侵蚀’），在LoCoMo基准测试中达到78.45%准确率。

Details

Motivation: 语言模型智能体在长时程交互中易出现‘灵魂侵蚀’——即时间定位信息丢失和跨会话行为不一致，亟需更符合认知原理的记忆建模。 Method: 提出BMAM架构，受脑科学启发，将记忆划分为功能特化的子系统（情景、语义、显著性、控制导向），其中情景记忆按显式时间线组织，并融合多信号进行检索。 Result: 在LoCoMo基准上标准长时程评估准确率达78.45%；消融实验证实海马体启发的情景记忆子系统对时序推理起关键作用。 Conclusion: BMAM通过类脑多子系统记忆设计，有效缓解灵魂侵蚀问题，验证了功能分化与时间结构化对长时程智能体记忆的关键价值。 Abstract: Language-model-based agents operating over extended interaction horizons face persistent challenges in preserving temporally grounded information and maintaining behavioral consistency across sessions, a failure mode we term soul erosion. We present BMAM (Brain-inspired Multi-Agent Memory), a general-purpose memory architecture that models agent memory as a set of functionally specialized subsystems rather than a single unstructured store. Inspired by cognitive memory systems, BMAM decomposes memory into episodic, semantic, salience-aware, and control-oriented components that operate at complementary time scales. To support long-horizon reasoning, BMAM organizes episodic memories along explicit timelines and retrieves evidence by fusing multiple complementary signals. Experiments on the LoCoMo benchmark show that BMAM achieves 78.45 percent accuracy under the standard long-horizon evaluation setting, and ablation analyses confirm that the hippocampus-inspired episodic memory subsystem plays a critical role in temporal reasoning.

[55] Can We Improve Educational Diagram Generation with In-Context Examples? Not if a Hallucination Spoils the Bunch

Evanfiya Logacheva,Arto Hellas,Tsvetomila Mihaylova,Juha Sorva,Ava Heinonen,Juho Leinonen

Main category: cs.CL

TL;DR: This paper introduces a novel RST-based method for generating diagrams with in-context examples to reduce AI hallucination and improve faithfulness in LLM-generated educational diagrams, validated by educator assessments and analysis of context complexity effects.

Details

Motivation: Concerns about the quality and factual reliability of generative AI-produced educational materials, especially diagrams, motivate this work. Method: A novel diagram code generation method using in-context examples grounded in Rhetorical Structure Theory (RST) to align LLM outputs with user expectations. Result: The method reduces factual hallucination and improves diagram faithfulness; however, output quality varies due to LLM stochasticity. Higher text complexity correlates with increased hallucination, and LLMs often fail to self-correct. Conclusion: RST-guided in-context generation enhances diagram reliability in computing education, but inherent LLM limitations—especially regarding self-detection of errors and sensitivity to input complexity—remain key challenges. Abstract: Generative artificial intelligence (AI) has found a widespread use in computing education; at the same time, quality of generated materials raises concerns among educators and students. This study addresses this issue by introducing a novel method for diagram code generation with in-context examples based on the Rhetorical Structure Theory (RST), which aims to improve diagram generation by aligning models' output with user expectations. Our approach is evaluated by computer science educators, who assessed 150 diagrams generated with large language models (LLMs) for logical organization, connectivity, layout aesthetic, and AI hallucination. The assessment dataset is additionally investigated for its utility in automated diagram evaluation. The preliminary results suggest that our method decreases the rate of factual hallucination and improves diagram faithfulness to provided context; however, due to LLMs' stochasticity, the quality of the generated diagrams varies. Additionally, we present an in-depth analysis and discussion on the connection between AI hallucination and the quality of generated diagrams, which reveals that text contexts of higher complexity lead to higher rates of hallucination and LLMs often fail to detect mistakes in their output.

[56] Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models

Kumiko Nakajima,Jan Zuiderveld,Sandro Pezzelle

Main category: cs.CL

TL;DR: 本文指出当前用于评估大语言模型（LLM）创造力的Divergent Association Task（DAT）存在理论缺陷——只测新颖性、忽略适切性；为此提出基于人类创造力理论（新颖性+适切性）的新评估任务CDAT，并发现小模型在CDAT下反而更富创造力，而大模型更倾向适切性，揭示训练与对齐可能使模型沿‘新颖-适切’权衡前沿移动。

Details

Motivation: 现有LLM创造力评估（如DAT）缺乏人类创造力理论基础，仅关注 novelty而忽略appropriateness，导致评估结果难以解释且有效性存疑。 Method: 基于创造力‘新颖性+适切性’双重定义，提出Conditional Divergent Association Task（CDAT），在给定语境约束下评估新颖性，以更好区分噪声与真正创造力；并在多个SOTA LLM上对比DAT与CDAT表现。 Result: DAT评估显示LLM得分反低于无创意能力的基线，证明其无效；CDAT则揭示小模型常具更高创造力（高新颖+可接受适切），而大模型偏向高适切、低新颖；验证了训练/对齐会推动模型在新颖–适切权衡前沿上移动。 Conclusion: DAT不适合作为LLM创造力评估标准；CDAT是更理论一致、简单客观的新基准；创造力评估需兼顾novelty与appropriateness，且模型规模与训练目标显著影响其创造力表现。 Abstract: Large language models (LLMs) are increasingly used in verbal creative tasks. However, previous assessments of the creative capabilities of LLMs remain weakly grounded in human creativity theory and are thus hard to interpret. The widely used Divergent Association Task (DAT) focuses on novelty, ignoring appropriateness, a core component of creativity. We evaluate a range of state-of-the-art LLMs on DAT and show that their scores on the task are lower than those of two baselines that do not possess any creative abilities, undermining its validity for model evaluation. Grounded in human creativity theory, which defines creativity as the combination of novelty and appropriateness, we introduce Conditional Divergent Association Task (CDAT). CDAT evaluates novelty conditional on contextual appropriateness, separating noise from creativity better than DAT, while remaining simple and objective. Under CDAT, smaller model families often show the most creativity, whereas advanced families favor appropriateness at lower novelty. We hypothesize that training and alignment likely shift models along this frontier, making outputs more appropriate but less creative. We release the dataset and code.

[57] Single-Nodal Spontaneous Symmetry Breaking in NLP Models

Shalom Rosner,Ronit D. Gross,Ella Koresh,Ido Kanter

Main category: cs.CL

TL;DR: 本文揭示了自然语言处理（NLP）模型中存在类相变的自发对称性破缺现象，发生在单个注意力头乃至单个神经元节点层面，并通过节点协作与能力权衡解释其机制，区别于自旋玻璃系统，且在BERT-6模型上得到实证。

Details

Motivation: 受统计力学中相变与自发对称性破缺启发，探索该现象是否可迁移到深度学习模型（尤其是NLP）中，以理解模型内部表征形成与涌现机制。 Method: 基于BERT-6架构，在Wikipedia数据集上预训练并在FewRel任务上微调；分析单个注意力头及单个节点的token/label偏好行为；引入凸包分析对单节点功能进行上界估计；建模节点数量增加时的学习能力交叉现象（随机猜测下降 vs 协作增益）。 Result: 观察到预训练和微调过程中均出现节点级自发对称性破缺：节点趋向专精于有限token或标签；随节点数增加出现学习能力 crossover；节点功能可显式贡献于全局任务并被凸包上界界定；该现象不依赖随机性或热力学极限。 Conclusion: 自发对称性破缺是NLP模型中一种普适的、可量化、任务相关的结构涌现机制，为理解大模型内部表征组织与分工提供了新的统计物理视角。 Abstract: Spontaneous symmetry breaking in statistical mechanics primarily occurs during phase transitions at the thermodynamic limit where the Hamiltonian preserves inversion symmetry, yet the low-temperature free energy exhibits reduced symmetry. Herein, we demonstrate the emergence of spontaneous symmetry breaking in natural language processing (NLP) models during both pre-training and fine-tuning, even under deterministic dynamics and within a finite training architecture. This phenomenon occurs at the level of individual attention heads and is scaled-down to its small subset of nodes and also valid at a single-nodal level, where nodes acquire the capacity to learn a limited set of tokens after pre-training or labels after fine-tuning for a specific classification task. As the number of nodes increases, a crossover in learning ability occurs, governed by the tradeoff between a decrease following random-guess among increased possible outputs, and enhancement following nodal cooperation, which exceeds the sum of individual nodal capabilities. In contrast to spin-glass systems, where a microscopic state of frozen spins cannot be directly linked to the free-energy minimization goal, each nodal function in this framework contributes explicitly to the global network task and can be upper-bounded using convex hull analysis. Results are demonstrated using BERT-6 architecture pre-trained on Wikipedia dataset and fine-tuned on the FewRel classification task.

[58] A Computational Approach to Language Contact -- A Case Study of Persian

Ali Basirat,Danial Namazifard,Navid Baradaran Hemmati

Main category: cs.CL

TL;DR: 本文研究了单语语言模型中间表示中语言接触的结构痕迹，发现历史接触对形态特征（如格和性）影响显著，而对普遍句法信息影响较小。

Details

Motivation: 探究单语语言模型中间表示中语言接触的结构痕迹，特别是波斯语这种历史上接触丰富的语言。 Method: 通过探测波斯语训练模型在接触不同语言时的中间表示，量化其中编码的语言信息量，并评估不同形态句法特征信息在模型组件中的分布。 Result: 普遍句法信息对历史接触不敏感，而形态特征（如格和性）受语言特有结构强烈影响。 Conclusion: 单语语言模型中的接触效应是选择性的且受结构约束。 Abstract: We investigate structural traces of language contact in the intermediate representations of a monolingual language model. Focusing on Persian (Farsi) as a historically contact-rich language, we probe the representations of a Persian-trained model when exposed to languages with varying degrees and types of contact with Persian. Our methodology quantifies the amount of linguistic information encoded in intermediate representations and assesses how this information is distributed across model components for different morphosyntactic features. The results show that universal syntactic information is largely insensitive to historical contact, whereas morphological features such as Case and Gender are strongly shaped by language-specific structure, suggesting that contact effects in monolingual language models are selective and structurally constrained.

[59] AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Kaiyuan Chen,Qimin Wu,Taiyu Hou,Tianhao Tang,Xueyu Hu,Yuchen Hou,Bikun Li,Chengming Qian,Guoyin Wang,Haolin Chen,Haotong Tian,Haoye Zhang,Haoyu Bian,Hongbing Pan,Hongkang Zhang,Hongyi Zhou,Jiaqi Cai,Jiewu Rao,Jiyuan Ren,Keduan Huang,Lucia Zhu Huang,Mingyu Yuan,Naixu Guo,Qicheng Tang,Qinyan Zhang,Shuai Chen,Siheng Chen,Ting Ting Li,Xiaoxing Guo,Yaocheng Zuo,Yaoqi Guo,Yinan Wang,Yinzhou Yu,Yize Wang,Yuan Jiang,Yuan Tian,Yuanshuo Zhang,Yuxuan Liu,Yvette Yan Zeng,Zenyu Shan,Zihan Yin,Xiaobo Hu,Yang Liu,Yixin Ren,Yuan Gong

Main category: cs.CL

TL;DR: 本文提出了AgentIF-OneDay基准，旨在评估AI代理在日常任务中的实际应用能力，涵盖开放工作流执行、隐式指令理解和迭代优化三类用户中心任务，并通过实例级评分和LLM验证与人工判断对齐的方式进行评估。

Details

Motivation: 当前AI代理评测过于关注任务难度提升，而忽视了覆盖普通用户日常生活、工作和学习所需多样化任务的需求，导致用户对AI能力感知有限。 Method: 提出AgentIF-OneDay基准，包含104个任务、767个评分点，分为Open Workflow Execution、Latent Instruction和Iterative Refinement三类；采用实例级评分标准和结合LLM（Gemini-3-Pro）验证与人工判断的评估流程。 Result: 四个主流通用AI代理在该基准上被评测，结果显示基于API构建的代理产品与基于RL的ChatGPT代理同属第一梯队；主流LLM API和开源模型已内化代理能力，支持AI应用团队开发前沿代理产品。 Conclusion: AgentIF-OneDay填补了面向真实用户日常场景的AI代理评测空白，强调任务多样性、文件结果交付与自然语言指令理解，推动AI代理从‘能做难事’走向‘会做常事’。 Abstract: The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results. The benchmark is structured around three user-centric categories: Open Workflow Execution, which assesses adherence to explicit and complex workflows; Latent Instruction, which requires agents to infer implicit instructions from attachments; and Iterative Refinement, which involves modifying or expanding upon ongoing work. We employ instance-level rubrics and a refined evaluation pipeline that aligns LLM-based verification with human judgment, achieving an 80.1% agreement rate using Gemini-3-Pro. AgentIF-OneDay comprises 104 tasks covering 767 scoring points. We benchmarked four leading general AI agents and found that agent products built based on APIs and ChatGPT agents based on agent RL remain in the first tier simultaneously. Leading LLM APIs and open-source models have internalized agentic capabilities, enabling AI application teams to develop cutting-edge Agent products.

[60] P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering

Wenlin Zhong,Chengyuan Liu,Yiquan Wu,Bovin Tan,Changlong Sun,Yi Wang,Xiaozhong Liu,Kun Kuang

Main category: cs.CL

TL;DR: 本文提出了一种名为概率过程监督（P2S）的新框架，通过为推理链每一步计算路径忠实度奖励（PFR），实现无需人工标注或额外奖励模型的细粒度过程监督，显著提升大语言模型在通用领域推理任务中的表现。

Details

Motivation: 现有强化学习方法（如RLPR）仅关注最终答案的概率作为奖励信号，忽略了对推理过程本身的逐步监督，而通用领域缺乏可验证的奖励信号，导致过程监督困难。 Method: 提出P2S框架：在RL训练中自动生成并筛选高质量参考推理链（gold-CoT），并为每一步计算路径忠实度奖励（PFR），即基于当前推理前缀生成gold-CoT后缀的条件概率；PFR可与任意结果奖励结合，提供密集过程指导。 Result: 在阅读理解与医学问答基准上，P2S显著优于强基线方法。 Conclusion: P2S通过无需额外模型或人工标注的细粒度过程奖励机制，有效缓解奖励稀疏问题，提升了LLM在通用领域推理任务中的性能。 Abstract: While reinforcement learning with verifiable rewards (RLVR) has advanced LLM reasoning in structured domains like mathematics and programming, its application to general-domain reasoning tasks remains challenging due to the absence of verifiable reward signals. To this end, methods like Reinforcement Learning with Reference Probability Reward (RLPR) have emerged, leveraging the probability of generating the final answer as a reward signal. However, these outcome-focused approaches neglect crucial step-by-step supervision of the reasoning process itself. To address this gap, we introduce Probabilistic Process Supervision (P2S), a novel self-supervision framework that provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps. During reinforcement learning, P2S synthesizes and filters a high-quality reference reasoning chain (gold-CoT). The core of our method is to calculate a Path Faithfulness Reward (PFR) for each reasoning step, which is derived from the conditional probability of generating the gold-CoT's suffix, given the model's current reasoning prefix. Crucially, this PFR can be flexibly integrated with any outcome-based reward, directly tackling the reward sparsity problem by providing dense guidance. Extensive experiments on reading comprehension and medical Question Answering benchmarks show that P2S significantly outperforms strong baselines.

[61] A Dialectic Pipeline for Improving LLM Robustness

Sara Candussio

Main category: cs.CL

TL;DR: 本文提出了一种无需额外训练或领域限制的辩证式自对话流程，通过上下文增强的自我反思与修正，显著提升大语言模型回答质量并减少幻觉。

Details

Motivation: 现有减少大模型幻觉的方法（如领域微调或专用验证器）计算开销大、泛化性差，亟需轻量、通用的改进方案。 Method: 设计一种基于自我对话的辩证式推理流程，在oracle-RAG设置下为各阶段注入相关上下文，并研究摘要与过滤策略的影响；在多个数据集和模型族上进行实验验证。 Result: 该辩证流程显著优于基线模型输出，并持续超越仅使用思维链（Chain-of-Thought）提示的方法。 Conclusion: 无需额外训练、保持通用性的自对话辩证流程，是提升大语言模型输出质量与可靠性的一种高效可行路径。 Abstract: Assessing ways in which Language Models can reduce their hallucinations and improve the outputs' quality is crucial to ensure their large-scale use. However, methods such as fine-tuning on domain-specific data or the training of a separate \textit{ad hoc} verifier require demanding computational resources (not feasible for many user applications) and constrain the models to specific fields of knowledge. In this thesis, we propose a dialectic pipeline that preserves LLMs' generalization abilities while improving the quality of its answer via self-dialogue, enabling it to reflect upon and correct tentative wrong answers. We experimented with different pipeline settings, testing our proposed method on different datasets and on different families of models. All the pipeline stages are enriched with the relevant context (in an oracle-RAG setting) and a study on the impact of its summarization or its filtering is conducted. We find that our proposed dialectic pipeline is able to outperform by significative margins the standard model answers and that it consistently achieves higher performances than Chain-of-Thought only prompting.

[62] Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science

Juan Jose Rubio Jan,Jack Wu,Julia Ive

Main category: cs.CL

TL;DR: 本研究探索了大语言模型（LLMs）在电子健康档案（EHR）数据科学中的两个基础任务中的应用：结构化数据查询（使用Python/Pandas）和非结构化临床文本的信息提取（通过RAG流程），并在MIMIC-III子集上进行了实验验证。

Details

Motivation: 提升LLMs在真实临床数据场景中对结构化查询与非结构化文本信息提取的准确性与可靠性，以支持临床工作流。 Method: 构建灵活的自动合成问答评估框架，结合精确匹配、语义相似度与人工评估，在MIMIC-III子集（4张结构化表+1类临床笔记）上测试本地与API型LLMs在结构化查询与RAG增强信息提取任务中的表现。 Result: LLMs在结构化数据查询和RAG辅助下的临床文本信息提取中展现出较高准确性与语义正确性，评估结果支持其在临床数据分析中的实用潜力。 Conclusion: LLMs具备支持临床工作流中精准查询与准确信息提取的潜力，但需结合高质量评估框架与领域适配方法（如RAG）以保障可靠性。 Abstract: This study applies Large Language Models (LLMs) to two foundational Electronic Health Record (EHR) data science tasks: structured data querying (using programmatic languages, Python/Pandas) and information extraction from unstructured clinical text via a Retrieval Augmented Generation (RAG) pipeline. We test the ability of LLMs to interact accurately with large structured datasets for analytics and the reliability of LLMs in extracting semantically correct information from free text health records when supported by RAG. To this end, we presented a flexible evaluation framework that automatically generates synthetic question and answer pairs tailored to the characteristics of each dataset or task. Experiments were conducted on a curated subset of MIMIC III, (four structured tables and one clinical note type), using a mix of locally hosted and API-based LLMs. Evaluation combined exact-match metrics, semantic similarity, and human judgment. Our findings demonstrate the potential of LLMs to support precise querying and accurate information extraction in clinical workflows.

[63] Efficient Multimodal Planning Agent for Visual Question-Answering

Zhuo Chen,Xinyu Geng,Xinyu Wang,Yong Jiang,Zhen Zhang,Pengjun Xie,Kewei Tu

Main category: cs.CL

TL;DR: 本文提出了一种多模态规划代理方法，通过动态分解mRAG流程来提升视觉问答（VQA）任务的效率与效果，在减少60%以上搜索时间的同时，性能优于多个基线方法。

Details

Motivation: 现有知识密集型VQA任务依赖多阶段、强依赖的多模态检索增强生成（mRAG）流程，导致效率低下；需在保持性能前提下缓解该瓶颈。 Method: 训练一个可动态决策是否执行各mRAG步骤的多模态规划代理，实现对mRAG流程的智能精简与自适应调度。 Result: 相比现有方法，搜索时间减少超60%，昂贵工具调用显著下降；在六个数据集上平均性能超越包括深度研究代理和精心设计的提示方法在内的所有基线。 Conclusion: 多模态规划代理能有效平衡VQA中效率与效果，为知识密集型多模态推理提供更轻量、更鲁棒的新范式。 Abstract: Visual Question-Answering (VQA) is a challenging multimodal task that requires integrating visual and textual information to generate accurate responses. While multimodal Retrieval-Augmented Generation (mRAG) has shown promise in enhancing VQA systems by providing more evidence on both image and text sides, the default procedure that addresses VQA queries, especially the knowledge-intensive ones, often relies on multi-stage pipelines of mRAG with inherent dependencies. To mitigate the inefficiency limitations while maintaining VQA task performance, this paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task. Our method optimizes the trade-off between efficiency and effectiveness by training the agent to intelligently determine the necessity of each mRAG step. In our experiments, the agent can help reduce redundant computations, cutting search time by over 60\% compared to existing methods and decreasing costly tool calls. Meanwhile, experiments demonstrate that our method outperforms all baselines, including a Deep Research agent and a carefully designed prompt-based method, on average over six various datasets. Code will be released.

[64] ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code

Mingqiao Mo,Yunlong Tan,Hao Zhang,Heng Zhang,Yangfan He

Main category: cs.CL

TL;DR: 本文提出了ShieldedCode，首个面向保护的框架，用于学习虚拟机保护（VMP）代码的鲁棒表征，结合多级依赖建模与功能/保护感知对比学习，在代码生成与二进制相似性检测任务上显著优于现有方法。

Details

Motivation: 大型语言模型在代码生成上表现优异，但在软件保护（尤其是对抗逆向工程）方面尚未被有效利用；传统VMP方法依赖人工规则、成本高且易被自动化分析攻破。 Method: 构建源码与规范化VM实现的大规模配对数据集；提出跨指令层级（指令内、前序、指令间）的分层依赖建模；联合优化语言建模与功能/保护感知的对比学习目标；设计保护有效性优化任务以量化和排序不同VM变体；采用两阶段持续预训练与微调流程。 Result: 在L0 VM代码生成任务上Pass@1达26.95%（GPT-4o为22.58%）；二进制相似性检测Recall@1较jTrans等SOTA方法提升10%；显著提升模型在多种保护强度下的鲁棒性。 Conclusion: ShieldedCode开辟了基于学习的软件防御新方向，证明了LLM可被有效引导以理解、生成并评估受保护代码，为VMP自动化与智能化提供了可行路径。 Abstract: Large language models (LLMs) have achieved remarkable progress in code generation, yet their potential for software protection remains largely untapped. Reverse engineering continues to threaten software security, while traditional virtual machine protection (VMP) relies on rigid, rule-based transformations that are costly to design and vulnerable to automated analysis. In this work, we present the first protection-aware framework that learns robust representations of VMP-protected code. Our approach builds large-scale paired datasets of source code and normalized VM implementations, and introduces hierarchical dependency modeling at intra-, preceding-, and inter-instruction levels. We jointly optimize language modeling with functionality-aware and protection-aware contrastive objectives to capture both semantic equivalence and protection strength. To further assess resilience, we propose a protection effectiveness optimization task that quantifies and ranks different VM variants derived from the same source. Coupled with a two-stage continual pre-training and fine-tuning pipeline, our method enables models to generate, compare, and reason over protected code. Extensive experiments show that our framework significantly improves robustness across diverse protection levels, opening a new research direction for learning-based software defense. In this work, we present ShieldedCode, the first protection-aware framework that learns robust representations of VMP-protected code. Our method achieves 26.95% Pass@1 on L0 VM code generation compared to 22.58% for GPT-4o., and improves binary similarity detection Recall@1 by 10% over state of art methods like jTrans.

[65] Online Density-Based Clustering for Real-Time Narrative Evolution Monitorin

Ostap Vykhopen,Viktoria Skorik,Maxim Tereschenko,Veronika Solopova

Main category: cs.CL

TL;DR: 本文探讨了在社交媒体监控中，用在线（流式/增量）聚类算法替代传统离线HDBSCAN聚类方法，以提升叙事智能系统的实时性、可扩展性与内存效率，并提出融合传统聚类指标与叙事特有指标的新型评估框架。

Details

Motivation: 传统HDBSCAN等批处理聚类算法难以适应社交媒体数据流的实时性、动态演化和内存约束需求，导致无法支持实时叙事监测。 Method: 设计三阶段流水线架构（数据采集、建模、仪表盘生成），在乌克兰信息空间历史数据上开展滑动窗口仿真，系统评估多种在线聚类算法，并引入兼顾Silhouette系数、Davies-Bouldin指数与叙事区分度、一致性及方差的混合评估标准。 Result: 验证了若干在线聚类算法在集群质量、计算效率、内存占用和工作流兼容性等方面的权衡表现，明确了其在真实叙事监控场景中的适用边界。 Conclusion: 在线聚类可有效弥补批处理主题建模与社交媒体流式特性之间的鸿沟，为计算社会科学、危机信息学与叙事监控系统提供更实用、可部署的技术路径。 Abstract: Automated narrative intelligence systems for social media monitoring face significant scalability challenges when processing continuous data streams using traditional batch clustering algorithms. We investigate the replacement of HDBSCAN (offline clustering) with online (streaming/incremental) clustering methods in a production narrative report generation pipeline. The proposed system employs a three-stage architecture (data collection, modeling, dashboard generation) that processes thousands of multilingual social media documents daily. While HDBSCAN excels at discovering hierarchical density-based clusters and handling noise, its batch-only nature necessitates complete retraining for each time window, resulting in memory constraints, computational inefficiency, and inability to adapt to evolving narratives in real-time. This work evaluates a bunch of online clustering algorithms across dimensions of cluster quality preservation, computational efficiency, memory footprint, and integration compatibility with existing workflows. We propose evaluation criteria that balance traditional clustering metrics (Silhouette Coefficient, Davies-Bouldin Index) with narrative metrics (narrative distinctness, contingency and variance). Our methodology includes sliding-window simulations on historical datasets from Ukraine information space, enabling comparative analysis of algorithmic trade-offs in realistic operational contexts. This research addresses a critical gap between batch-oriented topic modeling frameworks and the streaming nature of social media monitoring, with implications for computational social science, crisis informatics, and narrative surveillance systems.

[66] AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

Shicheng Fang,Yuxin Wang,XiaoRan Liu,Jiahao Lu,Chuanyuan Tan,Xinchi Chen,Yining Zheng. Xuanjing Huang,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出了AgentLongBench，一个基于横向思维谜题的动态环境模拟基准，用于评估大语言模型作为自主智能体在长上下文、动态交互场景下的表现；实验发现当前模型虽擅长静态检索，但在动态信息合成任务中性能显著下降，主因是解析查询所需的最小token数，尤其受高密度工具响应影响更大。

Details

Motivation: 现有基准过于静态，无法模拟智能体与环境交互中的非线性推理和迭代反馈等复杂场景，亟需更贴近真实代理行为的动态评估框架。 Method: 提出AgentLongBench基准，基于横向思维谜题构建知识密集型与知识无关型两类模拟环境，并通过多轮交互轨迹评估不同规模（32K–4M tokens）模型及记忆系统的表现。 Result: 实验证明，当前先进模型和记忆系统在动态信息合成任务中性能明显劣于静态检索任务；性能退化主要由解析查询所需的最小token数驱动，高信息密度的工具响应比长对话中的记忆碎片化更具挑战性。 Conclusion: AgentLongBench揭示了当前LLM智能体在动态上下文管理上的根本瓶颈，强调需超越单纯扩大上下文窗口，转而提升对高密度、多步依赖信息的实时合成能力。 Abstract: The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce \textbf{AgentLongBench}, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.

[67] QueerGen: How LLMs Reflect Societal Norms on Gender and Sexuality in Sentence Completion Tasks

Mae Sosto,Delfina Sol Martinez Pandiani,Laura Hollink

Main category: cs.CL

TL;DR: 本文研究大型语言模型（LLMs）如何再现社会规范（特别是异性顺性别规范），并量化其在文本生成中对不同性别/性取向标记主体（酷儿标记、非酷儿标记、“无标记”主流群体）的偏见表现，发现掩码语言模型（MLMs）对酷儿标记主体表现出最负面的情感、更高毒性与更消极评价，自回归模型（ARLMs）部分缓解但未消除偏见，且闭源ARLMs对“无标记”主体反而产生更多有害输出。

Details

Motivation: 探究大型语言模型（LLMs）是否及如何再现社会中的异性顺性别规范（heterocisnormativity），并将其转化为可测量的生成偏见，尤其关注显式性别/性取向信息如何影响模型对不同主体类别的响应差异。 Method: 通过对比三类主体（酷儿标记、非酷儿标记、“无标记”主流）在四维指标（情感倾向、评价态度、毒性程度、预测多样性）上的英文句子补全结果，分析掩码语言模型（MLMs）和自回归语言模型（ARLMs）的偏见模式。 Result: MLMs对酷儿标记主体生成最负面情感、更高毒性和更消极评价；ARLMs部分缓解该偏见；而闭源ARLMs反而对‘无标记’主体产生更多有害输出；整体表明LLMs再现社会规范，但偏见形式与程度高度依赖具体模型架构与访问权限。 Conclusion: LLMs确实系统性地再生产社会规范性假设（如异性顺性别规范），导致可量化的表征不平等；但偏见并非均质存在，而是随模型类型（MLM vs. ARLM）、开源/闭源状态等特征而重新分布，无法被单一技术路径彻底消除。 Abstract: This paper examines how Large Language Models (LLMs) reproduce societal norms, particularly heterocisnormativity, and how these norms translate into measurable biases in their text generations. We investigate whether explicit information about a subject's gender or sexuality influences LLM responses across three subject categories: queer-marked, non-queer-marked, and the normalized "unmarked" category. Representational imbalances are operationalized as measurable differences in English sentence completions across four dimensions: sentiment, regard, toxicity, and prediction diversity. Our findings show that Masked Language Models (MLMs) produce the least favorable sentiment, higher toxicity, and more negative regard for queer-marked subjects. Autoregressive Language Models (ARLMs) partially mitigate these patterns, while closed-access ARLMs tend to produce more harmful outputs for unmarked subjects. Results suggest that LLMs reproduce normative social assumptions, though the form and degree of bias depend strongly on specific model characteristics, which may redistribute, but not eliminate, representational harms.

[68] Like a Therapist, But Not: Reddit Narratives of AI in Mental Health Contexts

Elham Aghakhani,Rezvaneh Rezapour

Main category: cs.CL

TL;DR: 本研究通过分析5126篇Reddit帖子，探讨了用户在日常生活中对大型语言模型（LLM）用于情感支持和心理健康互动的接受度与关系建立机制，发现用户评价主要基于叙述性结果、信任感和响应质量，而非单纯的情感联结；任务与目标一致带来积极情绪，而陪伴导向使用则易引发联盟错位及依赖、症状恶化等风险。

Details

Motivation: 尽管大语言模型越来越多地被用于非临床场景下的情感支持和心理健康互动，但人们对这些系统在日常使用中的评估方式和关系建立机制仍知之甚少。 Method: 基于技术接受模型与治疗联盟理论构建理论驱动的标注框架，采用LLM-人工混合流水线分析来自47个心理健康社区的5126篇Reddit帖子，聚焦评价性语言、采纳态度与关系对齐。 Result: 用户参与度主要由叙述性结果、信任感和响应质量驱动；正向情绪最强关联于任务与目标对齐；陪伴导向使用更常出现治疗联盟错位，并报告依赖、症状加剧等风险。 Conclusion: 理论驱动的构念可在大规模话语分析中有效操作化；需重视用户在敏感现实语境中如何理解与诠释语言技术。 Abstract: Large language models (LLMs) are increasingly used for emotional support and mental health-related interactions outside clinical settings, yet little is known about how people evaluate and relate to these systems in everyday use. We analyze 5,126 Reddit posts from 47 mental health communities describing experiential or exploratory use of AI for emotional support or therapy. Grounded in the Technology Acceptance Model and therapeutic alliance theory, we develop a theory-informed annotation framework and apply a hybrid LLM-human pipeline to analyze evaluative language, adoption-related attitudes, and relational alignment at scale. Our results show that engagement is shaped primarily by narrated outcomes, trust, and response quality, rather than emotional bond alone. Positive sentiment is most strongly associated with task and goal alignment, while companionship-oriented use more often involves misaligned alliances and reported risks such as dependence and symptom escalation. Overall, this work demonstrates how theory-grounded constructs can be operationalized in large-scale discourse analysis and highlights the importance of studying how users interpret language technologies in sensitive, real-world contexts.

Jing Yang,Moritz Hechtbauer,Elisabeth Khalilov,Evelyn Luise Brinkmann,Vera Schmitt,Nils Feldhus

Main category: cs.CL

TL;DR: 本文研究了Persona prompting（PP）对大型语言模型（LLM）在仇恨言论检测等社会敏感任务中生成解释（rationales）的影响，发现PP虽可提升分类准确率，却会降低解释质量，且无法有效对齐真实人群的解释偏好或缓解模型固有偏见。

Details

Motivation: 在仇恨言论检测等社会敏感任务中，LLM解释的质量对用户信任和模型对齐至关重要；而PP虽被用于引导生成，但其对解释质量的影响尚不明确。 Method: 基于带词级解释标注的数据集，通过模拟不同人口统计学特征的persona，评估LLM生成解释与各群体人工标注的一致性，并分析PP对模型偏差和人类对齐的影响。 Result: （1）PP提升仇恨言论分类性能但损害解释质量；（2）模拟persona未能匹配真实群体解释偏好，且模型对persona变化不敏感；（3）模型普遍存在人口统计学偏差及过度标记有害内容倾向，PP无法缓解。 Conclusion: PP在敏感任务中存在关键权衡：提升分类性能以牺牲解释质量和偏见缓解为代价，需谨慎应用。 Abstract: For socially sensitive tasks like hate speech detection, the quality of explanations from Large Language Models (LLMs) is crucial for factors like user trust and model alignment. While Persona prompting (PP) is increasingly used as a way to steer model towards user-specific generation, its effect on model rationales remains underexplored. We investigate how LLM-generated rationales vary when conditioned on different simulated demographic personas. Using datasets annotated with word-level rationales, we measure agreement with human annotations from different demographic groups, and assess the impact of PP on model bias and human alignment. Our evaluation across three LLMs results reveals three key findings: (1) PP improving classification on the most subjective task (hate speech) but degrading rationale quality. (2) Simulated personas fail to align with their real-world demographic counterparts, and high inter-persona agreement shows models are resistant to significant steering. (3) Models exhibit consistent demographic biases and a strong tendency to over-flag content as harmful, regardless of PP. Our findings reveal a critical trade-off: while PP can improve classification in socially-sensitive tasks, it often comes at the cost of rationale quality and fails to mitigate underlying biases, urging caution in its application.

[70] SERA: Soft-Verified Efficient Repository Agents

Ethan Shen,Danny Tormoen,Saurabh Shah,Ali Farhadi,Tim Dettmers

Main category: cs.CL

TL;DR: 本文提出SERA方法，通过软验证生成（SVG）技术，实现了对私有代码库的高效、低成本微调，显著降低了训练开销，并在开源模型中达到前沿性能。

Details

Motivation: 开放权重编码代理本应具备针对私有代码库定制化的优势，但训练成本和复杂性使其长期停留在理论层面，本文旨在将该优势变为现实。 Method: 提出Soft-Verified Efficient Repository Agents（SERA），采用纯监督微调（SFT）与Soft Verified Generation（SVG）技术，从单个代码仓库生成数千条训练轨迹，并扩展至超20万合成轨迹用于分析与消融实验。 Result: SERA在完全开源模型中达到最优性能，媲美Devstral-Small-2等前沿开放权重模型；训练成本仅为强化学习的1/26、先前合成数据方法的1/57。 Conclusion: SERA使私有代码库定制化编码代理真正实用化，推动开放编码代理研究，并释放开放模型在专有场景下的核心优势；作者开源了全部模型、代码、数据及Claude Code集成。 Abstract: Open-weight coding agents should hold a fundamental advantage over closed-source systems: they can be specialized to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical. We show it is now practical. We present Soft-Verified Efficient Repository Agents (SERA), an efficient method for training coding agents that enables the rapid and cheap creation of agents specialized to private codebases. Using only supervised finetuning (SFT), SERA achieves state-of-the-art results among fully open-source (open data, method, code) models while matching the performance of frontier open-weight models like Devstral-Small-2. Creating SERA models is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. Our method, Soft Verified Generation (SVG), generates thousands of trajectories from a single code repository. Combined with cost-efficiency, this enables specialization to private codebases. Beyond repository specialization, we apply SVG to a larger corpus of codebases, generating over 200,000 synthetic trajectories. We use this dataset to provide detailed analysis of scaling laws, ablations, and confounding factors for training coding agents. Overall, we believe our work will greatly accelerate research on open coding agents and showcase the advantage of open-source models that can specialize to private codebases. We release SERA as the first model in Ai2's Open Coding Agents series, along with all our code, data, and Claude Code integration to support the research community.

[71] Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

Yiran Huang,Karsten Roth,Quentin Bouniot,Wenjia Xu,Zeynep Akata

Main category: cs.CL

TL;DR: 本文通过可控实验研究了Transformer模型在多模态上下文学习（ICL）中如何跨模态关联信息，发现存在学习不对称性：主模态高多样性预训练下，辅模态仅需较低数据复杂度即可实现多模态ICL；机制分析表明其依赖于跨模态扩展的归纳式标签复制机制。

Details

Motivation: 探究Transformer-based多模态大语言模型为何具备上下文学习能力，特别是如何从上下文示例中学习跨模态信息关联。 Method: 在合成分类任务上训练小型Transformer模型，精确控制数据统计与模型结构；复现并拓展单模态ICL研究，引入RoPE等位置编码；设计多模态设置以分析学习不对称性，并进行机制分析。 Result: 发现RoPE提高了单模态ICL所需的数据复杂度阈值；多模态ICL中存在显著学习不对称性——主模态高多样性预训练可使辅模态以极低数据复杂度触发ICL；两种设定均依赖归纳式标签复制机制，而多模态训练扩展了该机制至跨模态电路。 Conclusion: 多模态ICL的出现有其内在机制基础，即跨模态扩展的归纳电路；该工作为理解现代Transformer中的多模态ICL提供了机制性解释，并构建了可控实验平台供后续研究。 Abstract: Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation.

[72] Structured Semantic Information Helps Retrieve Better Examples for In-Context Learning in Few-Shot Relation Extraction

Aunabil Chakma,Mihai Surdeanu,Eduardo Blanco

Main category: cs.CL

TL;DR: 本文提出了一种用于少样本关系抽取的上下文学习中自动获取额外示例的新策略，通过基于句法-语义结构相似性选择示例，结合LLM生成方法形成混合系统，在多个数据集和大模型上均取得SOTA性能。

Details

Motivation: 提升单样本关系抽取在上下文学习中的泛化能力与效果，缓解仅依赖单一示例导致的信息不足问题。 Method: 提出基于句法-语义结构相似性的新示例选择策略，并与大语言模型生成示例相结合，构建混合示例增强框架。 Result: 混合系统在FS-TACRED上达到SOTA，在定制FewRel子集上显著提升；且跨数据集（FS-TACRED、FS-FewRel）和跨模型（Qwen、Gemma）均表现稳健。 Conclusion: 基于结构相似性的示例选择能提供互补信息，与LLM生成方法结合可更全面刻画目标关系，优于各类基线方法。 Abstract: This paper presents several strategies to automatically obtain additional examples for in-context learning of one-shot relation extraction. Specifically, we introduce a novel strategy for example selection, in which new examples are selected based on the similarity of their underlying syntactic-semantic structure to the provided one-shot example. We show that this method results in complementary word choices and sentence structures when compared to LLM-generated examples. When these strategies are combined, the resulting hybrid system achieves a more holistic picture of the relations of interest than either method alone. Our framework transfers well across datasets (FS-TACRED and FS-FewRel) and LLM families (Qwen and Gemma). Overall, our hybrid selection method consistently outperforms alternative strategies and achieves state-of-the-art performance on FS-TACRED and strong gains on a customized FewRel subset.

[73] Linear representations in language models can change dramatically over a conversation

Andrew Kyle Lampinen,Yuxuan Li,Eghbal Hosseini,Sangnie Bhardwaj,Murray Shanahan

Main category: cs.CL

TL;DR: 本文研究了语言模型表示中线性方向随对话上下文动态演化的过程，发现表示会随对话内容显著变化，且这种变化具有内容依赖性、鲁棒性和跨模型一致性，对可解释性与可控性提出了挑战，同时也揭示了模型上下文适应的新机制。

Details

Motivation: 探究语言模型中与高阶概念对应的线性表示方向在对话过程中的动态演化规律，以理解模型如何随上下文调整内部表征。 Method: 通过分析模拟对话中特定线性方向（如事实性）的表示变化，对比不同模型家族、不同层、不同上下文类型（如重放对话脚本 vs. 科幻故事）下的表征动态，并检验方向引导在对话不同阶段的效果差异。 Result: 发现线性表示方向在对话中可发生剧烈、内容依赖的变化；该现象鲁棒存在于多种模型和层级；不依赖在线对话，仅重放他人生成的对话脚本即可复现；而纯虚构文本引发的适应较弱；同一方向在对话不同阶段的干预效果差异显著。 Conclusion: 语言模型的表示具有强上下文依赖的动态性，其可能源于模型在对话中扮演特定角色；这挑战了静态特征解释与固定探针的可靠性，但也为理解模型上下文适应机制提供了新路径。 Abstract: Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.

[74] When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation

David Tan,Pinzhen Chen,Josef van Genabith,Koel Dutta Chowdhury

Main category: cs.CL

TL;DR: 本文研究了大型语言模型（LLMs）在多语言设置下的基准污染问题，特别是FLORES-200翻译基准对Bloomz等模型的影响，并揭示了污染导致的跨方向性能虚高及记忆效应的鲁棒性。

Details

Motivation: 大型语言模型（LLMs）可能因训练数据中包含基准测试内容而被污染，导致评估分数虚高，掩盖了模型真实泛化能力；在多语言场景下，这种污染甚至可能跨语言迁移，影响对‘未污染’语言的评估可靠性。 Method: 以FLORES-200翻译基准为诊断工具，对比分析两个7–8B参数的指令微调多语言LLM：已知使用FLORES训练的Bloomz（污染模型）与未使用FLORES训练的Llama（未污染对照）；通过源端扰动（如改写、命名实体替换）测试模型对污染参考的回忆鲁棒性，并观察BLEU分数变化。 Result: 确认Bloomz存在FLORES污染；发现机器翻译污染具有跨方向性（target-side memorization可提升未见翻译方向性能）；源端扰动（尤其命名实体替换）虽难消除回忆，但能稳定降低BLEU，可作为探测污染的有效手段。 Conclusion: 基准污染会严重扭曲多语言LLM评估结果，其记忆效应具有跨方向性和一定鲁棒性；命名实体替换是一种实用且敏感的污染探测方法，提示未来需更谨慎构建无污染评测基准与训练数据。 Abstract: Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to "uncontaminated" languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz's FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.

cs.CV [Back]

[75] Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation

Gautham Vinod,Bruce Coburn,Siddeshwar Raghavan,Jiangpeng He,Fengqing Zhu

Main category: cs.CV

TL;DR: 本文提出了一种从单目图像中恢复真实尺度3D食物重建的新方法，通过学习视觉特征估计尺度，显著提升体积估计精度，助力精准营养。

Details

Motivation: 饮食相关慢性病（如肥胖、糖尿病）增多，亟需准确监测食物摄入量；现有AI膳食评估方法难以从单目图像中准确恢复食物体积（即‘吃了多少’），尤其3D重建缺乏真实世界尺度，限制其在精准营养中的应用。 Method: 利用大规模预训练模型提取丰富视觉特征，学习估计单目3D重建物体的真实尺度，从而将单视角重建结果转化为具有物理意义的真实尺度3D模型。 Result: 在两个公开数据集上的实验表明，该方法相较现有技术平均绝对体积估计误差降低近30%，性能稳定且优越。 Conclusion: 本工作成功弥合了3D计算机视觉与数字健康之间的鸿沟，为基于图像的精准膳食评估提供了可扩展、高精度的尺度感知3D重建方案。 Abstract: The rise of chronic diseases related to diet, such as obesity and diabetes, emphasizes the need for accurate monitoring of food intake. While AI-driven dietary assessment has made strides in recent years, the ill-posed nature of recovering size (portion) information from monocular images for accurate estimation of ``how much did you eat?'' is a pressing challenge. Some 3D reconstruction methods have achieved impressive geometric reconstruction but fail to recover the crucial real-world scale of the reconstructed object, limiting its usage in precision nutrition. In this paper, we bridge the gap between 3D computer vision and digital health by proposing a method that recovers a true-to-scale 3D reconstructed object from a monocular image. Our approach leverages rich visual features extracted from models trained on large-scale datasets to estimate the scale of the reconstructed object. This learned scale enables us to convert single-view 3D reconstructions into true-to-life, physically meaningful models. Extensive experiments and ablation studies on two publicly available datasets show that our method consistently outperforms existing techniques, achieving nearly a 30% reduction in mean absolute volume-estimation error, showcasing its potential to enhance the domain of precision nutrition. Code: https://gitlab.com/viper-purdue/size-matters

[76] DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation

Zhen Yao,Xin Li,Taotao Jing,Shuai Zhang,Mooi Choo Chuah

Main category: cs.CV

TL;DR: 本文提出DiSa框架，通过引入显著性感知的前景-背景解耦模块（SDM）和分层细化模块（HRM），解决开放词汇语义分割中视觉语言模型存在的前景偏差和空间定位能力弱的问题。

Details

Motivation: 现有基于视觉语言模型（如CLIP）的方法在开放词汇语义分割中存在前景偏差（忽略背景）和空间定位能力差（边界模糊）两大问题。 Method: 提出DiSa框架，包含显著性感知解耦模块（SDM）用于分别建模前景与背景特征，以及分层细化模块（HRM）利用像素级空间上下文和通道级多级更新进行特征优化。 Result: 在六个基准数据集上实验表明，DiSa持续优于当前最先进方法。 Conclusion: DiSa通过显式引入显著性线索与分层特征细化机制，有效缓解了VLM在密集预测任务中的固有局限，提升了开放词汇语义分割的整体性能与边界精度。 Abstract: Open-vocabulary semantic segmentation aims to assign labels to every pixel in an image based on text labels. Existing approaches typically utilize vision-language models (VLMs), such as CLIP, for dense prediction. However, VLMs, pre-trained on image-text pairs, are biased toward salient, object-centric regions and exhibit two critical limitations when adapted to segmentation: (i) Foreground Bias, which tends to ignore background regions, and (ii) Limited Spatial Localization, resulting in blurred object boundaries. To address these limitations, we introduce DiSa, a novel saliency-aware foreground-background disentangled framework. By explicitly incorporating saliency cues in our designed Saliency-aware Disentanglement Module (SDM), DiSa separately models foreground and background ensemble features in a divide-and-conquer manner. Additionally, we propose a Hierarchical Refinement Module (HRM) that leverages pixel-wise spatial contexts and enables channel-wise feature refinement through multi-level updates. Extensive experiments on six benchmarks demonstrate that DiSa consistently outperforms state-of-the-art methods.

[77] Semi-Supervised Masked Autoencoders: Unlocking Vision Transformer Potential with Limited Data

Atik Faysal,Mohammad Rostami,Reihaneh Gh. Roshan,Nikhil Muralidhar,Huaxia Wang

Main category: cs.CV

TL;DR: 本文提出了一种半监督掩码自编码器（SSMAE）框架，用于在标签数据稀缺但无标签数据丰富的情况下训练视觉Transformer（ViT），通过结合掩码图像重建与分类任务，并引入验证驱动的门控机制动态选择高置信、跨增强视图一致的伪标签，显著提升了低标签比例下的性能。

Details

Motivation: 解决标签数据稀缺但无标签数据丰富时Vision Transformer（ViT）训练困难的问题。 Method: 提出Semi-Supervised Masked Autoencoder（SSMAE），联合优化掩码图像重建和分类任务；利用验证驱动的门控机制，仅在模型对弱/强增强视图均给出高置信且一致预测时才启用伪标签，以降低确认偏差。 Result: 在CIFAR-10和CIFAR-100上，SSMAE持续优于监督式ViT和微调后的MAE；在10%标签的CIFAR-10上相对ViT提升9.24%。 Conclusion: 伪标签引入的时机与生成方式同等重要，验证驱动的动态伪标签策略可显著提升ViT在低数据场景下的训练效率与性能。 Abstract: We address the challenge of training Vision Transformers (ViTs) when labeled data is scarce but unlabeled data is abundant. We propose Semi-Supervised Masked Autoencoder (SSMAE), a framework that jointly optimizes masked image reconstruction and classification using both unlabeled and labeled samples with dynamically selected pseudo-labels. SSMAE introduces a validation-driven gating mechanism that activates pseudo-labeling only after the model achieves reliable, high-confidence predictions that are consistent across both weakly and strongly augmented views of the same image, reducing confirmation bias. On CIFAR-10 and CIFAR-100, SSMAE consistently outperforms supervised ViT and fine-tuned MAE, with the largest gains in low-label regimes (+9.24% over ViT on CIFAR-10 with 10% labels). Our results demonstrate that when pseudo-labels are introduced is as important as how they are generated for data-efficient transformer training. Codes are available at https://github.com/atik666/ssmae.

[78] Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

Chuan Qin,Constantin Venhoff,Sonia Joseph,Fanyi Xiao,Stefan Scherer

Main category: cs.CV

TL;DR: 本文提出了一种在CLIP训练中直接引入稀疏性的方法（Sparse CLIP），在不牺牲性能的前提下显著提升表示的可解释性，并保留其多模态能力，挑战了‘可解释性与性能不可兼得’的传统认知。

Details

Motivation: CLIP虽成功但其稠密、不透明的隐表示导致可解释性差；现有后处理稀疏方法（如SAE）会损害下游性能和多模态能力。 Method: 将稀疏性约束直接嵌入CLIP联合训练过程，而非后处理；通过结构化稀疏正则化学习语义清晰、跨模态对齐的稀疏特征。 Result: Sparse CLIP在下游任务上性能媲美原始CLIP，同时实现更高可解释性、更强多模态特征对齐，并支持可解释的视觉引导式模型控制。 Conclusion: 可解释性与高性能并非互斥；通过端到端稀疏训练可协同优化二者，为未来多模态模型设计提供新范式。 Abstract: Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs). Despite its success, CLIP's dense and opaque latent representations pose significant interpretability challenges. A common assumption is that interpretability and performance are in tension: enforcing sparsity during training degrades accuracy, motivating recent post-hoc approaches such as Sparse Autoencoders (SAEs). However, these post-hoc approaches often suffer from degraded downstream performance and loss of CLIP's inherent multimodal capabilities, with most learned features remaining unimodal. We propose a simple yet effective approach that integrates sparsity directly into CLIP training, yielding representations that are both interpretable and performant. Compared to SAEs, our Sparse CLIP representations preserve strong downstream task performance, achieve superior interpretability, and retain multimodal capabilities. We show that multimodal sparse features enable straightforward semantic concept alignment and reveal training dynamics of how cross-modal knowledge emerges. Finally, as a proof of concept, we train a vision-language model on sparse CLIP representations that enables interpretable, vision-based steering capabilities. Our findings challenge conventional wisdom that interpretability requires sacrificing accuracy and demonstrate that interpretability and performance can be co-optimized, offering a promising design principle for future models.

[79] NucFuseRank: Dataset Fusion and Performance Ranking for Nuclei Instance Segmentation

Nima Torbati,Anastasia Meshcheryakova,Ramona Woitek,Sepideh Hatamikia,Diana Mechtcheriakova,Amirreza Mahbod

Main category: cs.CV

TL;DR: 本文聚焦于H&E染色图像中细胞核实例分割的数据集评估与标准化，而非模型开发；通过统一格式整理多个公开数据集，使用两种SOTA模型进行系统评估与排序，并构建了统一测试集（NucFuse-test）和训练集（NucFuse-train），最终建立了一个新的基准用于该任务。

Details

Motivation: 现有研究多集中于新分割算法开发，而忽视了数据集的差异性、可比性和标准化问题，导致模型性能评估缺乏公平性和可复现性。 Method: 对文献中手动标注的公开H&E图像数据集进行收集与标准化（统一输入与标注格式）；采用CNN和CNN+ViT两种SOTA模型进行跨数据集性能评估与排序；构建统一测试集NucFuse-test用于公平比较，融合多源数据构建NucFuse-train以提升泛化能力；并开展外部验证与开源实现。 Result: 完成了多个主流数据集的性能排序；发布了NucFuse-test和NucFuse-train两个标准化数据集；验证了融合训练能显著提升模型在未见数据上的分割性能；提供了可复现的评估框架与开源代码。 Conclusion: 数据集的质量与标准化对 nuclei instance segmentation 的模型评估与实际部署至关重要；本工作提出的NucFuse基准为该领域提供了更公平、鲁棒和实用的评测标准。 Abstract: Nuclei instance segmentation in hematoxylin and eosin (H&E)-stained images plays an important role in automated histological image analysis, with various applications in downstream tasks. While several machine learning and deep learning approaches have been proposed for nuclei instance segmentation, most research in this field focuses on developing new segmentation algorithms and benchmarking them on a limited number of arbitrarily selected public datasets. In this work, rather than focusing on model development, we focused on the datasets used for this task. Based on an extensive literature review, we identified manually annotated, publicly available datasets of H&E-stained images for nuclei instance segmentation and standardized them into a unified input and annotation format. Using two state-of-the-art segmentation models, one based on convolutional neural networks (CNNs) and one based on a hybrid CNN and vision transformer architecture, we systematically evaluated and ranked these datasets based on their nuclei instance segmentation performance. Furthermore, we proposed a unified test set (NucFuse-test) for fair cross-dataset evaluation and a unified training set (NucFuse-train) for improved segmentation performance by merging images from multiple datasets. By evaluating and ranking the datasets, performing comprehensive analyses, generating fused datasets, conducting external validation, and making our implementation publicly available, we provided a new benchmark for training, testing, and evaluating nuclei instance segmentation models on H&E-stained histological images.

[80] Look in the Middle: Structural Anchor Pruning for Scalable Visual RAG Indexing

Zhuchenyang Liu,Ziyu Hu,Yao Zhang,Yu Xiao

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视觉文档检索向量剪枝方法Structural Anchor Pruning (SAP)，通过识别中间层的关键视觉块实现超90%索引向量压缩，同时保持高检索精度；并引入Oracle Score Retention (OSR)评估协议，揭示中层比末层更保留语义结构锚点。

Details

Motivation: 现有训练-free剪枝方法在高压缩率（>80%）下性能显著下降，被归因于视觉token重要性依赖查询，因而被认为难以实现高效无训练剪枝。 Method: 提出Structural Anchor Pruning (SAP)，在中间层识别语义结构锚点视觉块进行剪枝；引入Oracle Score Retention (OSR)协议量化各层信息对压缩效率的影响。 Result: 在ViDoRe基准上实现超90%索引向量压缩，同时保持稳健的检索保真度；OSR分析证实中层存在持久的语义结构锚点，而末层结构信号已消散。 Conclusion: 视觉token的重要性并非完全查询依赖，中层蕴含稳定、可挖掘的结构化语义锚点，支持高效且无需训练的高压缩率剪枝。 Abstract: Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive index vector size overheads. Training-free pruning solutions (e.g., EOS-attention based methods) can reduce index vector size by approximately 60% without model adaptation, but often underperform random selection in high-compression scenarios (> 80%). Prior research (e.g., Light-ColPali) attributes this to the conclusion that visual token importance is inherently query-dependent, thereby questioning the feasibility of training-free pruning. In this work, we propose Structural Anchor Pruning (SAP), a training-free pruning method that identifies key visual patches from middle layers to achieve high performance compression. We also introduce Oracle Score Retention (OSR) protocol to evaluate how layer-wise information affects compression efficiency. Evaluations on the ViDoRe benchmark demonstrate that SAP reduces index vectors by over 90% while maintaining robust retrieval fidelity, providing a highly scalable solution for Visual RAG. Furthermore, our OSR-based analysis reveals that semantic structural anchor patches persist in the middle layers, unlike traditional pruning solutions that focus on the final layer where structural signals dissipate.

[81] Efficient Token Pruning for LLaDA-V

Zhewen Wan,Tianchen Song,Chen Lin,Zhiyong Zhao,Xianpeng Lang

Main category: cs.CV

TL;DR: 本文提出了一种针对扩散式多模态大模型（如LLaDA-V）的结构化视觉token剪枝策略，聚焦于中后期层和首次去噪步骤，以降低计算开销并保持语义对齐与生成质量。

Details

Motivation: LLaDA-V等扩散式多模态模型因双向注意力与迭代去噪机制导致高计算开销；注意力分析发现其跨模态信息聚合主要发生在中后期层，存在延迟语义对齐问题。 Method: 基于注意力分析结果，提出面向中后期层、仅在首次去噪步骤执行的结构化视觉token剪枝策略，受FastV启发但关键区别在于剪枝时机与位置，以兼顾效率与质量。 Result: 在多个基准上，最优配置实现最高65%的计算量下降，同时保持平均95%的任务性能。 Conclusion: 这是首个探索扩散式多模态大模型结构化token剪枝的工作；证明了视觉感知剪枝在该类模型中的有效性，并为高效LLaDA-V推理提供了实证基础。 Abstract: Diffusion-based large multimodal models, such as LLaDA-V, have demonstrated impressive capabilities in vision-language understanding and generation. However, their bidirectional attention mechanism and diffusion-style iterative denoising paradigm introduce significant computational overhead, as visual tokens are repeatedly processed across all layers and denoising steps. In this work, we conduct an in-depth attention analysis and reveal that, unlike autoregressive decoders, LLaDA-V aggregates cross-modal information predominantly in middle-to-late layers, leading to delayed semantic alignment. Motivated by this observation, we propose a structured token pruning strategy inspired by FastV, selectively removing a proportion of visual tokens at designated layers to reduce FLOPs while preserving critical semantic information. To the best of our knowledge, this is the first work to investigate structured token pruning in diffusion-based large multimodal models. Unlike FastV, which focuses on shallow-layer pruning, our method targets the middle-to-late layers of the first denoising step to align with LLaDA-V's delayed attention aggregation to maintain output quality, and the first-step pruning strategy reduces the computation across all subsequent steps. Our framework provides an empirical basis for efficient LLaDA-V inference and highlights the potential of vision-aware pruning in diffusion-based multimodal models. Across multiple benchmarks, our best configuration reduces computational cost by up to 65% while preserving an average of 95% task performance.

[82] TeleStyle: Content-Preserving Style Transfer in Images and Videos

Shiwen Zhang,Xiaoyan Yang,Bojia Zi,Haibin Huang,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了TeleStyle，一种基于Qwen-Image-Edit的轻量高效内容保持型风格迁移模型，支持图像与视频 stylization；通过构建高质量风格数据集、提出课程持续学习框架，并引入视频时序一致性模块，在风格相似性、内容一致性和美学质量三方面达到SOTA。

Details

Motivation: Diffusion Transformers（DiTs）中内容与风格特征高度耦合，导致内容保持的风格迁移效果受限，亟需解耦并提升泛化能力与保真度。 Method: 基于Qwen-Image-Edit构建TeleStyle；构建含真实风格样本与合成三元组（内容/风格/目标）的混合数据集；设计课程持续学习框架分阶段训练；新增视频到视频风格迁移模块以增强时序一致性。 Result: 在风格相似性、内容一致性、美学质量三大指标上均达到当前最优（SOTA）；支持图像与视频 stylization；具备对未见风格的泛化能力且不牺牲内容保真度。 Conclusion: TeleStyle验证了轻量架构结合课程持续学习与高质量数据构造的有效性，为内容保持型多模态风格迁移提供了可扩展、高保真、强泛化的实用解决方案。 Abstract: Content-preserving style transfer, generating stylized outputs based on content and style references, remains a significant challenge for Diffusion Transformers (DiTs) due to the inherent entanglement of content and style features in their internal representations. In this technical report, we present TeleStyle, a lightweight yet effective model for both image and video stylization. Built upon Qwen-Image-Edit, TeleStyle leverages the base model's robust capabilities in content preservation and style customization. To facilitate effective training, we curated a high-quality dataset of distinct specific styles and further synthesized triplets using thousands of diverse, in-the-wild style categories. We introduce a Curriculum Continual Learning framework to train TeleStyle on this hybrid dataset of clean (curated) and noisy (synthetic) triplets. This approach enables the model to generalize to unseen styles without compromising precise content fidelity. Additionally, we introduce a video-to-video stylization module to enhance temporal consistency and visual quality. TeleStyle achieves state-of-the-art performance across three core evaluation metrics: style similarity, content consistency, and aesthetic quality. Code and pre-trained models are available at https://github.com/Tele-AI/TeleStyle

[83] Automated Marine Biofouling Assessment: Benchmarking Computer Vision and Multimodal LLMs on the Level of Fouling Scale

Brayden Hamilton,Tim Cashmore,Peter Driscoll,Trevor Gee,Henry Williams

Main category: cs.CV

TL;DR: This paper explores automated classification of marine biofouling severity using computer vision models and large multimodal language models (LLMs), finding that while CV models excel at extreme fouling levels, LLMs offer competitive zero-shot performance with interpretability; hybrid approaches combining segmentation and LLM reasoning show promise for scalable, interpretable assessment.

Details

Motivation: Marine biofouling on vessel hulls poses ecological, economic, and biosecurity risks, and traditional diver-based surveys are hazardous and not scalable. Method: Evaluated convolutional neural networks, transformer-based segmentation, and zero-shot large multimodal language models (LLMs) on an expert-labelled dataset from New Zealand's Ministry for Primary Industries, using structured prompts and retrieval for LLMs. Result: Computer vision models achieved high accuracy for extreme LoF categories but underperformed on intermediate levels due to dataset imbalance and image framing; LLMs achieved competitive zero-shot performance with interpretable outputs. Conclusion: CV and LLM approaches have complementary strengths; hybrid methods integrating segmentation coverage with LLM reasoning are promising for scalable and interpretable biofouling assessment. Abstract: Marine biofouling on vessel hulls poses major ecological, economic, and biosecurity risks. Traditional survey methods rely on diver inspections, which are hazardous and limited in scalability. This work investigates automated classification of biofouling severity on the Level of Fouling (LoF) scale using both custom computer vision models and large multimodal language models (LLMs). Convolutional neural networks, transformer-based segmentation, and zero-shot LLMs were evaluated on an expert-labelled dataset from the New Zealand Ministry for Primary Industries. Computer vision models showed high accuracy at extreme LoF categories but struggled with intermediate levels due to dataset imbalance and image framing. LLMs, guided by structured prompts and retrieval, achieved competitive performance without training and provided interpretable outputs. The results demonstrate complementary strengths across approaches and suggest that hybrid methods integrating segmentation coverage with LLM reasoning offer a promising pathway toward scalable and interpretable biofouling assessment.

[84] DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Haoyou Deng,Keyu Yan,Chaojie Mao,Xiang Wang,Yu Liu,Changxin Gao,Nong Sang

Main category: cs.CV

TL;DR: 本文提出DenseGRPO框架，通过引入细粒度的密集奖励（dense reward）解决现有GRPO方法中终端奖励稀疏导致的反馈信号与中间去噪步骤贡献不匹配问题；该框架包含两部分：一是基于ODE方法对每步中间清晰图像预测逐步奖励增益，二是设计奖励感知的探索空间校准机制，自适应调整SDE采样器中各时间步的随机性注入。实验表明DenseGRPO在多个基准上显著提升人类偏好对齐效果。

Details

Motivation: 现有基于GRPO和流匹配模型的方法存在稀疏奖励问题，即仅用最终图像的终端奖励指导全部中间去噪步骤，导致全局反馈与各步实际贡献不匹配，影响训练效果。 Method: 提出DenseGRPO框架：（1）利用ODE方法在中间清晰图像上应用奖励模型，预测每步的奖励增益作为密集奖励；（2）发现均匀探索策略与时变噪声强度不匹配，进而设计奖励感知的探索校准机制，自适应调整SDE采样器中各时间步的随机性注入。 Result: 在多个标准基准上，DenseGRPO显著优于现有GRPO方法，验证了密集奖励的有效性及对流匹配模型人类偏好对齐的关键作用。 Conclusion: 密集奖励能更精准地反映各去噪步骤的实际贡献，结合奖励感知的探索空间校准，可有效提升文本到图像生成中的人类偏好对齐性能；DenseGRPO为流匹配模型的强化学习对齐提供了新范式。 Abstract: Recent GRPO-based approaches built on flow matching models have shown remarkable improvements in human preference alignment for text-to-image generation. Nevertheless, they still suffer from the sparse reward problem: the terminal reward of the entire denoising trajectory is applied to all intermediate steps, resulting in a mismatch between the global feedback signals and the exact fine-grained contributions at intermediate denoising steps. To address this issue, we introduce \textbf{DenseGRPO}, a novel framework that aligns human preference with dense rewards, which evaluates the fine-grained contribution of each denoising step. Specifically, our approach includes two key components: (1) we propose to predict the step-wise reward gain as dense reward of each denoising step, which applies a reward model on the intermediate clean images via an ODE-based approach. This manner ensures an alignment between feedback signals and the contributions of individual steps, facilitating effective training; and (2) based on the estimated dense rewards, a mismatch drawback between the uniform exploration setting and the time-varying noise intensity in existing GRPO-based methods is revealed, leading to an inappropriate exploration space. Thus, we propose a reward-aware scheme to calibrate the exploration space by adaptively adjusting a timestep-specific stochasticity injection in the SDE sampler, ensuring a suitable exploration space at all timesteps. Extensive experiments on multiple standard benchmarks demonstrate the effectiveness of the proposed DenseGRPO and highlight the critical role of the valid dense rewards in flow matching model alignment.

[85] Feature Projection Learning for Better Vision-Language Reasoning

Yi Zhang,Weicheng Lin,Liang-Jie Zhang

Main category: cs.CV

TL;DR: 本文提出了一种简单高效的方法FPL（Feature Projection Learning），通过将类别原型特征投影到查询图像特征空间并重构特征图，将分类问题转化为特征投影问题，结合CLIP原始预测，在有限监督下显著提升下游任务性能。

Details

Motivation: 现有CLIP适配方法存在性能有限、可学习参数过多或训练时间过长等问题，难以高效利用预训练视觉语言知识。 Method: 提出FPL方法：构建一个投影模型，将类别原型特征映射到查询图像特征空间，并重构图像特征图；以负平均平方重构误差作为类别得分；最终融合投影模型预测与原始CLIP输出。 Result: FPL在多个下游任务上显著超越当前最优方法，展现出更高准确率、更少参数和更快训练速度。 Conclusion: FPL是一种简单、高效且有效的CLIP适配方法，成功将分类任务转化为特征投影任务，在有限监督下实现优异性能。 Abstract: Vision-Language Pre-Trained models, notably CLIP, that utilize contrastive learning have proven highly adept at extracting generalizable visual features. To inherit the well-learned knowledge of VLP models for downstream tasks, several approaches aim to adapt them efficiently with limited supervision. However, these methods either suffer from limited performance, excessive learnable parameters, or extended training times, all of which hinder their effectiveness in adapting the CLIP model to downstream tasks. In this work, we propose a simple yet efficient and effective method called \textit{\textbf{F}eature \textbf{P}rojection \textbf{L}earning(FPL)} to address these problems. Specifically, we develop a projection model that projects class prototype features into the query image feature space and reconstructs the query image feature map. The negative average squared reconstruction error is used as the class score. In this way, we transform the classification problem into a feature projection problem. The final output of this method is a combination of the prediction from the projection model and the original pre-trained CLIP. Comprehensive empirical evaluations confirm that FPL delivers superior accuracy, surpassing the current state-of-the-art methods by a substantial margin.

[86] Visual Prompt-Agnostic Evolution

Junze Wang,Lei Fan,Dezheng Zhang,Weipeng Jing,Donglin Di,Yang Song,Sidong Liu,Cong Cong

Main category: cs.CV

TL;DR: 本文提出Prompt-Agnostic Evolution (PAE)，通过频域初始化、共享Koopman算子建模跨层提示动态演化，并引入Lyapunov稳定性正则化，显著提升视觉提示调优（VPT）的训练稳定性、收敛速度与精度。

Details

Motivation: 现有视觉提示调优（VPT）方法存在训练不稳定、浅层提示早停滞、深层提示高方差振荡、跨层不匹配等问题，导致收敛慢、性能差。 Method: 从频域视角初始化任务感知提示；使用共享Koopman算子统一建模各层提示演化；引入基于Lyapunov稳定性理论的正则项抑制误差放大。 Result: 在25个数据集上平均加速收敛1.41倍，精度提升1–3%；具备提示无关性、轻量性，兼容各类VPT变体，无需修改骨干网络或推理流程。 Conclusion: PAE为VPT提供了更稳定、高效、通用的动态建模范式，推动提示调优向可解释、可控的方向发展。 Abstract: Visual Prompt Tuning (VPT) adapts a frozen Vision Transformer (ViT) to downstream tasks by inserting a small number of learnable prompt tokens into the token sequence at each layer. However, we observe that existing VPT variants often suffer from unstable training dynamics, characterized by gradient oscillations. A layer-wise analysis reveals that shallow-layer prompts tend to stagnate early, while deeper-layer prompts exhibit high-variance oscillations, leading to cross-layer mismatch. These issues slow convergence and degrade final performance. To address these challenges, we propose Prompt-Agnostic Evolution ($\mathtt{PAE}$), which strengthens vision prompt tuning by explicitly modeling prompt dynamics. From a frequency-domain perspective, we initialize prompts in a task-aware direction by uncovering and propagating frequency shortcut patterns that the backbone inherently exploits for recognition. To ensure coherent evolution across layers, we employ a shared Koopman operator that imposes a global linear transformation instead of uncoordinated, layer-specific updates. Finally, inspired by Lyapunov stability theory, we introduce a regularizer that constrains error amplification during evolution. Extensive experiments show that $\mathtt{PAE}$ accelerates convergence with an average $1.41\times$ speedup and improves accuracy by 1--3% on 25 datasets across multiple downstream tasks. Beyond performance, $\mathtt{PAE}$ is prompt-agnostic and lightweight, and it integrates seamlessly with diverse VPT variants without backbone modification or inference-time changes.

[87] BLENDER: Blended Text Embeddings and Diffusion Residuals for Intra-Class Image Synthesis in Deep Metric Learning

Jan Niklas Kolf,Ozan Tezcan,Justin Theiss,Hyung Jun Kim,Wentao Bao,Bhargav Bhushanam,Khushi Gupta,Arun Kejariwal,Naser Damer,Fadi Boutros

Main category: cs.CV

TL;DR: 本文提出BLenDeR，一种基于扩散模型的可控合成方法，通过集合论启发的残差操作（并集与交集）增强深度度量学习中的类内多样性，显著提升Recall@1等指标。

Details

Motivation: 现有生成方法难以可控地提升深度度量学习（DML）中的类内多样性，限制了合成数据在DML中的增益效果。 Method: 提出BLenDeR，利用扩散模型去噪残差，设计集合论启发的并集（鼓励多提示共有的属性）和交集（通过主成分近似提取共同方向）操作，实现类内多样属性组合的可控合成。 Result: 在CUB-200和Cars-196等标准DML基准上，BLenDeR相较SOTA基线分别提升Recall@1达3.7%和1.8%。 Conclusion: BLenDeR通过可控的残差操作有效增强类内多样性，为DML中合成数据的应用提供了新范式，并在多个数据集与骨干网络上验证了其有效性与泛化性。 Abstract: The rise of Deep Generative Models (DGM) has enabled the generation of high-quality synthetic data. When used to augment authentic data in Deep Metric Learning (DML), these synthetic samples enhance intra-class diversity and improve the performance of downstream DML tasks. We introduce BLenDeR, a diffusion sampling method designed to increase intra-class diversity for DML in a controllable way by leveraging set-theory inspired union and intersection operations on denoising residuals. The union operation encourages any attribute present across multiple prompts, while the intersection extracts the common direction through a principal component surrogate. These operations enable controlled synthesis of diverse attribute combinations within each class, addressing key limitations of existing generative approaches. Experiments on standard DML benchmarks demonstrate that BLenDeR consistently outperforms state-of-the-art baselines across multiple datasets and backbones. Specifically, BLenDeR achieves 3.7% increase in Recall@1 on CUB-200 and a 1.8% increase on Cars-196, compared to state-of-the-art baselines under standard experimental settings.

[88] Reversible Efficient Diffusion for Image Fusion

Xingxin Xu,Bing Cao,DongDong Li,Qinghua Hu,Pengfei Zhu

Main category: cs.CV

TL;DR: 本文提出了一种名为可逆高效扩散（RED）模型的新框架，用于多模态图像融合，旨在克服传统扩散模型在融合任务中细节丢失和计算效率低的问题。

Details

Motivation: 传统扩散模型在多模态图像融合中存在细节丢失和噪声误差累积问题，且端到端显式监督训练计算开销大。 Method: 提出可逆高效扩散（RED）模型，采用显式监督训练框架，在保留扩散模型强生成能力的同时避免分布估计。 Result: RED模型在保持高视觉保真度和细节保留方面优于现有扩散模型方法，并提升了训练与推理效率。 Conclusion: RED模型为多模态图像融合提供了一种高效、高质量的新范式，兼顾生成能力与监督精度。 Abstract: Multi-modal image fusion aims to consolidate complementary information from diverse source images into a unified representation. The fused image is expected to preserve fine details and maintain high visual fidelity. While diffusion models have demonstrated impressive generative capabilities in image generation, they often suffer from detail loss when applied to image fusion tasks. This issue arises from the accumulation of noise errors inherent in the Markov process, leading to inconsistency and degradation in the fused results. However, incorporating explicit supervision into end-to-end training of diffusion-based image fusion introduces challenges related to computational efficiency. To address these limitations, we propose the Reversible Efficient Diffusion (RED) model - an explicitly supervised training framework that inherits the powerful generative capability of diffusion models while avoiding the distribution estimation.

[89] Hallucination Begins Where Saliency Drops

Xiaofeng Zhang,Yuanchao Zhu,Chaochen Gu,Xiaosong Yuan,Qiyan Zhao,Jiawei Cao,Feilong Tang,Sinan Fan,Yaomin Shen,Chen Shen,Hao Tang

Main category: cs.CV

TL;DR: 本文提出LVLMs-Saliency框架，结合注意力权重与输入梯度来量化输出词元的视觉接地强度，发现低显著性前序词元易导致幻觉；进而设计SGRS拒绝采样与LocoRE局部一致性增强两种推理时机制，有效抑制大视觉语言模型中的幻觉。

Details

Motivation: 现有基于前向注意力的方法难以可靠区分幻觉与事实性输出，忽视了能反映词元影响传播的梯度信号。 Method: 提出梯度感知的LVLMs-Saliency诊断框架，融合注意力权重与输入梯度以量化视觉接地强度；基于发现的低前序词元显著性导致幻觉的规律，设计Saliency-Guided Rejection Sampling（SGRS）和Local Coherence Reinforcement（LocoRE）两种推理时干预机制。 Result: 在多个LVLM上实验表明，所提方法显著降低幻觉率，同时保持生成流畅性和任务性能。 Conclusion: LVLMs-Saliency提供了可解释、鲁棒的幻觉检测与缓解方案，揭示了上下文记忆保持对抑制幻觉的关键作用。 Abstract: Recent studies have examined attention dynamics in large vision-language models (LVLMs) to detect hallucinations. However, existing approaches remain limited in reliably distinguishing hallucinated from factually grounded outputs, as they rely solely on forward-pass attention patterns and neglect gradient-based signals that reveal how token influence propagates through the network. To bridge this gap, we introduce LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token by fusing attention weights with their input gradients. Our analysis uncovers a decisive pattern: hallucinations frequently arise when preceding output tokens exhibit low saliency toward the prediction of the next token, signaling a breakdown in contextual memory retention. Leveraging this insight, we propose a dual-mechanism inference-time framework to mitigate hallucinations: (1) Saliency-Guided Rejection Sampling (SGRS), which dynamically filters candidate tokens during autoregressive decoding by rejecting those whose saliency falls below a context-adaptive threshold, thereby preventing coherence-breaking tokens from entering the output sequence; and (2) Local Coherence Reinforcement (LocoRE), a lightweight, plug-and-play module that strengthens attention from the current token to its most recent predecessors, actively counteracting the contextual forgetting behavior identified by LVLMs-Saliency. Extensive experiments across multiple LVLMs demonstrate that our method significantly reduces hallucination rates while preserving fluency and task performance, offering a robust and interpretable solution for enhancing model reliability. Code is available at: https://github.com/zhangbaijin/LVLMs-Saliency

[90] A Source-Free Approach for Domain Adaptation via Multiview Image Transformation and Latent Space Consistency

Debopom Sutradhar,Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Reem E. Mohamed,Sami Azam

Main category: cs.CV

TL;DR: 本文提出了一种无需源域数据的无源域自适应（Source-Free DA）方法，通过多视角增强与潜在空间一致性机制，在仅使用目标域数据的情况下学习域不变特征，避免了对抗训练和伪标签等复杂步骤，并在多个标准数据集上取得了SOTA性能。

Details

Motivation: 现有域自适应方法通常依赖源域数据、对抗训练或复杂的伪标签技术，计算开销大且部署受限；亟需一种高效、轻量、仅依赖目标域数据的自适应方法。 Method: 提出基于多视角增强和潜在空间一致性的无源域自适应框架；采用ConvNeXt编码器；设计融合分类损失与一致性损失的联合目标函数，强制同一目标样本不同增强视图在潜在空间中特征表示一致。 Result: 在Office-31、Office-Home和Office-Caltech数据集上分别达到90.72%、84%和97.12%的平均分类精度，较现有方法分别提升+1.23%、+7.26%和+1.77%。 Conclusion: 该方法首次实现仅用目标域数据、无需源域访问或伪标签优化即可高效学习可迁移特征，验证了潜在空间一致性在源自由域自适应中的有效性与实用性。 Abstract: Domain adaptation (DA) addresses the challenge of transferring knowledge from a source domain to a target domain where image data distributions may differ. Existing DA methods often require access to source domain data, adversarial training, or complex pseudo-labeling techniques, which are computationally expensive. To address these challenges, this paper introduces a novel source-free domain adaptation method. It is the first approach to use multiview augmentation and latent space consistency techniques to learn domain-invariant features directly from the target domain. Our method eliminates the need for source-target alignment or pseudo-label refinement by learning transferable representations solely from the target domain by enforcing consistency between multiple augmented views in the latent space. Additionally, the method ensures consistency in the learned features by generating multiple augmented views of target domain data and minimizing the distance between their feature representations in the latent space. We also introduce a ConvNeXt-based encoder and design a loss function that combines classification and consistency objectives to drive effective adaptation directly from the target domain. The proposed model achieves an average classification accuracy of 90. 72\%, 84\%, and 97. 12\% in Office-31, Office-Home and Office-Caltech datasets, respectively. Further evaluations confirm that our study improves existing methods by an average classification accuracy increment of +1.23\%, +7.26\%, and +1.77\% on the respective datasets.

[91] Artifact-Aware Evaluation for High-Quality Video Generation

Chen Zhu,Jiashu Zhu,Yanxun Li,Meiqi Wu,Bingze Song,Chubin Chen,Jiahong Wu,Xiangxiang Chu,Yangang Wang

Main category: cs.CV

TL;DR: 本文提出了一种针对生成视频的细粒度评估协议，定义了外观、运动和相机三大感知维度及10类常见生成缺陷，并构建了包含8万视频的大规模标注数据集GenVID，进而开发了密集视频缺陷识别框架DVAR，显著提升了缺陷检测精度与低质内容过滤能力。

Details

Motivation: 现有视频生成评估方法仅提供粗略质量评分，缺乏对具体缺陷的定位与分类，难以满足精细化审计需求。 Method: 构建基于Appearance、Motion、Camera三轴的10类生成缺陷分类体系；创建大规模视频缺陷标注数据集GenVID（80k视频）；设计Dense Video Artifact Recognition（DVAR）框架实现细粒度缺陷识别与分类。 Result: 在GenVID上验证表明，DVAR显著提升生成视频缺陷检测准确率，并能有效过滤低质量内容。 Conclusion: 该工作为视频生成模型的评估与审计提供了可解释、可定位、可量化的细粒度新范式。 Abstract: With the rapid advancement of video generation techniques, evaluating and auditing generated videos has become increasingly crucial. Existing approaches typically offer coarse video quality scores, lacking detailed localization and categorization of specific artifacts. In this work, we introduce a comprehensive evaluation protocol focusing on three key aspects affecting human perception: Appearance, Motion, and Camera. We define these axes through a taxonomy of 10 prevalent artifact categories reflecting common generative failures observed in video generation. To enable robust artifact detection and categorization, we introduce GenVID, a large-scale dataset of 80k videos generated by various state-of-the-art video generation models, each carefully annotated for the defined artifact categories. Leveraging GenVID, we develop DVAR, a Dense Video Artifact Recognition framework for fine-grained identification and classification of generative artifacts. Extensive experiments show that our approach significantly improves artifact detection accuracy and enables effective filtering of low-quality content.

[92] Towards Compact and Robust DNNs via Compression-aware Sharpness Minimization

Jialuo He,Huangxun Chen

Main category: cs.CV

TL;DR: 本文提出C-SAM框架，通过在训练中扰动剪枝掩码而非模型参数，实现对结构平坦性的优化，从而同时提升压缩后模型的鲁棒性和精度。

Details

Motivation: 现有SAM方法在模型剪枝后难以保持鲁棒性，而先剪枝后应用SAM又受限于早期非鲁棒性感知的结构选择，因此需一种兼顾压缩与鲁棒性的联合优化方法。 Method: 提出Compression-aware ShArpness Minimization（C-SAM），将SAM从参数空间扰动迁移至剪枝掩码空间扰动，在训练中显式优化对结构变化鲁棒的平坦损失景观。 Result: 在CelebA-HQ、Flowers-102和CIFAR-10-C等多个数据集及ResNet-18、GoogLeNet、MobileNet-V2等模型上，C-SAM相较强基线提升最高达42%的认证鲁棒性，同时任务精度与未剪枝模型相当。 Conclusion: C-SAM有效弥合了模型压缩与鲁棒性优化之间的鸿沟，证明结构层面的平坦性优化是实现高鲁棒、高紧凑DNN的关键路径。 Abstract: Sharpness-Aware Minimization (SAM) has recently emerged as an effective technique for improving DNN robustness to input variations. However, its interplay with the compactness requirements of on-device DNN deployments remains less explored. Simply pruning a SAM-trained model can undermine robustness, since flatness in the continuous parameter space does not necessarily translate to robustness under the discrete structural changes induced by pruning. Conversely, applying SAM after pruning may be fundamentally constrained by architectural limitations imposed by an early, robustness-agnostic pruning pattern. To address this gap, we propose Compression-aware ShArpness Minimization (C-SAM), a framework that shifts sharpness-aware learning from parameter perturbations to mask perturbations. By explicitly perturbing pruning masks during training, C-SAM promotes a flatter loss landscape with respect to model structure, enabling the discovery of pruning patterns that simultaneously optimize model compactness and robustness to input variations. Extensive experiments on CelebA-HQ, Flowers-102, and CIFAR-10-C across ResNet-18, GoogLeNet, and MobileNet-V2 show that C-SAM consistently achieves higher certified robustness than strong baselines, with improvements of up to 42%, while maintaining task accuracy comparable to the corresponding unpruned models.

[93] Bridging the Applicator Gap with Data-Doping:Dual-Domain Learning for Precise Bladder Segmentation in CT-Guided Brachytherapy

Suresh Das,Siladittya Manna,Sayantari Ghosh

Main category: cs.CV

TL;DR: 本文提出了一种双域学习策略，利用大量无施源器（NA）和少量有施源器（WA）的CT数据联合训练模型，显著提升了膀胱分割在协变量偏移下的鲁棒性和泛化能力，仅需10%-30% WA数据即可达到纯WA数据训练的性能水平。

Details

Motivation: 在妇科后装放疗中，带施源器（WA）的CT图像稀缺且存在严重解剖形变与伪影，而无施源器（NA）图像丰富但分布不同；如何有效利用NA数据辅助有限WA数据的学习仍是一个开放问题。 Method: 提出双域学习策略，系统性地将NA与WA CT数据按不同比例混合训练，并在轴向、冠状、矢状面及多种深度网络架构下验证；使用经过筛选的混合数据集进行实验。 Result: 掺入10%-30% WA数据即可使分割性能（Dice达0.94，IoU达0.92）媲美纯WA数据训练模型，显著优于仅用NA数据的模型。 Conclusion: 解剖相似但分布偏移的数据集可有效协同提升医学图像分割性能，为缓解标注数据稀缺、增强临床可靠性提供了可行路径。 Abstract: Performance degradation due to covariate shift remains a major challenge for deep learning models in medical image segmentation. An open question is whether samples from a shifted distribution can effectively support learning when combined with limited target domain data. We investigate this problem in the context of bladder segmentation in CT guided gynecological brachytherapy, a critical task for accurate dose optimization and organ at risk sparing. While CT scans without brachytherapy applicators (no applicator: NA) are widely available, scans with applicators inserted (with applicator: WA) are scarce and exhibit substantial anatomical deformation and imaging artifacts, making automated segmentation particularly difficult. We propose a dual domain learning strategy that integrates NA and WA CT data to improve robustness and generalizability under covariate shift. Using a curated assorted dataset, we show that NA data alone fail to capture the anatomical and artifact related characteristics of WA images. However, introducing a modest proportion of WA data into a predominantly NA training set leads to significant performance improvements. Through systematic experiments across axial, coronal, and sagittal planes using multiple deep learning architectures, we demonstrate that doping only 10 to 30 percent WA data achieves segmentation performance comparable to models trained exclusively on WA data. The proposed approach attains Dice similarity coefficients of up to 0.94 and Intersection over Union scores of up to 0.92, indicating effective domain adaptation and improved clinical reliability. This study highlights the value of integrating anatomically similar but distribution shifted datasets to overcome data scarcity and enhance deep learning based segmentation for brachytherapy treatment planning.

[94] Physically Guided Visual Mass Estimation from a Single RGB Image

Sungjae Lee,Junhan Jeong,Yeonjoo Hong,Kwang In Kim

Main category: cs.CV

TL;DR: 本文提出了一种基于物理结构的单图像物体质量估计框架，通过单张RGB图像联合估计三维几何（体积）和材料语义（密度），并利用物理引导的潜变量回归，在无显式体积/密度标注下实现高质量质量预测。

Details

Motivation: 从单张RGB图像估计物体质量是病态问题，因为质量由不可直接观测的几何体积和材料密度共同决定，需引入物理意义明确的表征来约束解空间。 Method: 1）通过单目深度估计获取物体中心的3D几何以推断体积；2）用视觉-语言模型提取粗粒度材料语义以指导密度推理；3）通过实例自适应门控机制融合几何、语义与外观特征；4）在仅质量监督下，用两个独立回归头分别预测体积相关和密度相关的物理潜因子。 Result: 在image2mass和ABO-500数据集上，该方法持续优于现有最先进方法。 Conclusion: 将物理先验（体积与密度解耦）嵌入视觉表征学习过程，能有效缓解单图像质量估计的病态性，提升泛化性与可解释性。 Abstract: Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two physically guided latent factors (volume- and density-related) are predicted through separate regression heads under mass-only supervision. Experiments on image2mass and ABO-500 show that the proposed method consistently outperforms state-of-the-art methods.

[95] Structure-constrained Language-informed Diffusion Model for Unpaired Low-dose Computed Tomography Angiography Reconstruction

Genyuan Zhang,Zihao Wang,Zhifan Gao,Lei Xu,Zhen Zhou,Haijun Yu,Jianjia Zhang,Xiujian Liu,Weiwei Zhang,Shaoyu Wang,Huazhu Fu,Fenglin Liu,Weiwen Wu

Main category: cs.CV

TL;DR: 本文提出了一种结构约束的语言引导扩散模型（SLDM），用于从低剂量碘对比剂CT图像生成正常剂量图像，在减少对比剂用量的同时保持诊断效果。

Details

Motivation: 现有深度学习方法在不完全配对图像下难以实现准确增强，主要受限于模型对特定解剖结构的识别能力不足。 Method: 提出结构约束的语言引导扩散模型（SLDM）：1）提取图像结构先验信息以约束推理过程；2）引入具有空间智能的语义监督策略，融合视觉感知与空间推理；3）加入减影血管造影增强模块以提升对比剂区域对比度。 Result: 在血管造影重建任务中，定性视觉比较和多个定量指标均验证了该方法在低剂量对比剂CT血管造影中的有效性。 Conclusion: SLDM通过结构约束、语言引导与空间智能协同，显著提升了低剂量碘对比剂CT图像的增强质量与诊断可用性。 Abstract: The application of iodinated contrast media (ICM) improves the sensitivity and specificity of computed tomography (CT) for a wide range of clinical indications. However, overdose of ICM can cause problems such as kidney damage and life-threatening allergic reactions. Deep learning methods can generate CT images of normal-dose ICM from low-dose ICM, reducing the required dose while maintaining diagnostic power. However, existing methods are difficult to realize accurate enhancement with incompletely paired images, mainly because of the limited ability of the model to recognize specific structures. To overcome this limitation, we propose a Structure-constrained Language-informed Diffusion Model (SLDM), a unified medical generation model that integrates structural synergy and spatial intelligence. First, the structural prior information of the image is effectively extracted to constrain the model inference process, thus ensuring structural consistency in the enhancement process. Subsequently, semantic supervision strategy with spatial intelligence is introduced, which integrates the functions of visual perception and spatial reasoning, thus prompting the model to achieve accurate enhancement. Finally, the subtraction angiography enhancement module is applied, which serves to improve the contrast of the ICM agent region to suitable interval for observation. Qualitative analysis of visual comparison and quantitative results of several metrics demonstrate the effectiveness of our method in angiographic reconstruction for low-dose contrast medium CT angiography.

[96] TPGDiff: Hierarchical Triple-Prior Guided Diffusion for Image Restoration

Yanjie Tu,Qingsen Yan,Axi Niu,Jiacong Tang

Main category: cs.CV

TL;DR: 本文提出了一种三先验引导的扩散模型TPGDiff，用于统一图像恢复，通过在扩散过程中分层引入退化先验、结构先验和语义先验，提升严重退化区域的内容重建能力与空间结构保持。

Details

Motivation: 现有统一图像恢复方法依赖退化先验，但在严重退化区域内容重建效果差；引入语义信息到扩散模型浅层易导致结构模糊等伪影。 Method: 提出TPGDiff网络：1）全程建模退化先验；2）在浅层引入多源结构先验以保持细节；3）在深层引入蒸馏驱动的语义先验以提供鲁棒高层指导；4）使用退化提取器实现时序自适应控制。 Result: 在单/多退化基准上显著优于现有方法，展现出更强的泛化性与重建质量。 Conclusion: 分层互补的三先验协同引导机制有效提升了统一图像恢复的性能与鲁棒性，尤其在严重退化场景下优势明显。 Abstract: All-in-one image restoration aims to address diverse degradation types using a single unified model. Existing methods typically rely on degradation priors to guide restoration, yet often struggle to reconstruct content in severely degraded regions. Although recent works leverage semantic information to facilitate content generation, integrating it into the shallow layers of diffusion models often disrupts spatial structures (\emph{e.g.}, blurring artifacts). To address this issue, we propose a Triple-Prior Guided Diffusion (TPGDiff) network for unified image restoration. TPGDiff incorporates degradation priors throughout the diffusion trajectory, while introducing structural priors into shallow layers and semantic priors into deep layers, enabling hierarchical and complementary prior guidance for image reconstruction. Specifically, we leverage multi-source structural cues as structural priors to capture fine-grained details and guide shallow layers representations. To complement this design, we further develop a distillation-driven semantic extractor that yields robust semantic priors, ensuring reliable high-level guidance at deep layers even under severe degradations. Furthermore, a degradation extractor is employed to learn degradation-aware priors, enabling stage-adaptive control of the diffusion process across all timesteps. Extensive experiments on both single- and multi-degradation benchmarks demonstrate that TPGDiff achieves superior performance and generalization across diverse restoration scenarios. Our project page is: https://leoyjtu.github.io/tpgdiff-project.

[97] OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion

Shuoyan Wei,Feng Li,Chen Zhou,Runmin Cong,Yao Zhao,Huihui Bai

Main category: cs.CV

TL;DR: 本文提出了OSDEnhancer，首个通过高效单步扩散过程实现真实世界时空视频超分辨率（STVSR）的框架，结合线性预插值、时序细化与空间增强混合专家（TR-SE MoE）以及双向可变形变分自编码器（VAE）解码器，显著提升重建保真度与时序一致性。

Details

Motivation: 现有STVSR方法多基于简化的退化假设，在真实复杂未知退化场景下性能受限；同时，扩散模型在STVSR中的潜力尚未被充分探索。 Method: 提出OSDEnhancer框架：1）线性预插值初始化时空结构；2）训练时序细化与空间增强混合专家（TR-SE MoE）以协同学习时序连贯性与空间细节；3）引入双向可变形VAE解码器实现循环时空聚合与传播。 Result: 在真实场景下达到SOTA性能，具备优异泛化能力。 Conclusion: OSDEnhancer是首个实现真实世界STVSR的单步扩散方法，通过结构化模块设计有效兼顾高保真重建与时序一致性，为STVSR提供了新范式。 Abstract: Diffusion models (DMs) have demonstrated exceptional success in video super-resolution (VSR), showcasing a powerful capacity for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic visual content from low-resolution to high-resolution but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simplified degradation assumptions, which often struggle in real-world scenarios with complex unknown degradations. Such a high demand for reconstruction fidelity and temporal consistency makes the development of a robust STVSR framework particularly non-trivial. To address these challenges, we propose OSDEnhancer, a novel framework that, to the best of our knowledge, represents the first method to achieve real-world STVSR through an efficient one-step diffusion process. OSDEnhancer initializes essential spatiotemporal structures through a linear pre-interpolation strategy and pivots on training temporal refinement and spatial enhancement mixture of experts (TR-SE MoE), which allows distinct expert pathways to progressively learn robust, specialized representations for temporal coherence and spatial detail, further collaboratively reinforcing each other during inference. A bidirectional deformable variational autoencoder (VAE) decoder is further introduced to perform recurrent spatiotemporal aggregation and propagation, enhancing cross-frame reconstruction fidelity. Experiments demonstrate that the proposed method achieves state-of-the-art performance while maintaining superior generalization capability in real-world scenarios.

[98] CPiRi: Channel Permutation-Invariant Relational Interaction for Multivariate Time Series Forecasting

Jiyuan Xu,Wenyu Zhang,Xin Jing,Shuai Chen,Shuai Zhang,Jiahao Nie

Main category: cs.CV

TL;DR: 本文提出CPiRi框架，通过通道排列不变性设计解决多变量时间序列预测中通道依赖与独立模型的局限，实现结构与分布协同漂移下的免重训练部署，并在多个基准上达到SOTA性能。

Details

Motivation: 现有通道依赖模型易过拟合通道顺序，通道独立模型忽略通道间依赖，二者均难以适应通道增减或重排场景。 Method: 提出CPiRi框架：采用时空解耦架构（冻结时序编码器+轻量空间模块）与排列不变正则化训练策略（通道随机打乱），并从理论上分析排列等变性。 Result: 在多个基准测试中达到SOTA；对通道顺序扰动鲁棒；仅用一半通道训练即可泛化至未见通道；在大规模数据上保持高效。 Conclusion: CPiRi实现了通道排列不变性与跨通道建模能力的统一，在灵活性、性能和效率之间取得良好平衡，适用于动态变化的实际时序预测场景。 Abstract: Current methods for multivariate time series forecasting can be classified into channel-dependent and channel-independent models. Channel-dependent models learn cross-channel features but often overfit the channel ordering, which hampers adaptation when channels are added or reordered. Channel-independent models treat each channel in isolation to increase flexibility, yet this neglects inter-channel dependencies and limits performance. To address these limitations, we propose \textbf{CPiRi}, a \textbf{channel permutation invariant (CPI)} framework that infers cross-channel structure from data rather than memorizing a fixed ordering, enabling deployment in settings with structural and distributional co-drift without retraining. CPiRi couples \textbf{spatio-temporal decoupling architecture} with \textbf{permutation-invariant regularization training strategy}: a frozen pretrained temporal encoder extracts high-quality temporal features, a lightweight spatial module learns content-driven inter-channel relations, while a channel shuffling strategy enforces CPI during training. We further \textbf{ground CPiRi in theory} by analyzing permutation equivariance in multivariate time series forecasting. Experiments on multiple benchmarks show state-of-the-art results. CPiRi remains stable when channel orders are shuffled and exhibits strong \textbf{inductive generalization} to unseen channels even when trained on \textbf{only half} of the channels, while maintaining \textbf{practical efficiency} on large-scale datasets. The source code is released at https://github.com/JasonStraka/CPiRi.

[99] GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

Mai Su,Qihan Yu,Zhongtao Wang,Yilong Li,Chengwei Pan,Yisong Chen,Guoping Wang

Main category: cs.CV

TL;DR: 本文提出了一种结合可见性感知的多视角几何一致性约束与渐进式四叉树校准单目深度约束的方法，以提升3D高斯点阵（Gaussian Splatting）中的表面重建精度与稳定性。

Details

Motivation: 现有基于3D高斯点阵的表面重建方法在多视角几何一致性或单目深度先验上存在局限：前者在大几何差异下不可靠，后者存在尺度模糊和局部不一致问题，导致深度监督不准。 Method: 1）引入高斯可见性感知的多视角几何一致性约束，聚合跨视角共享高斯原语的可见性；2）提出渐进式四叉树校准的单目深度约束，从粗到细进行分块仿射校准，缓解尺度模糊并保留细节。 Result: 在DTU和TNT数据集上实验表明，该方法在几何精度上持续优于现有基于高斯和隐式表示的表面重建方法。 Conclusion: 所提双约束机制有效提升了3D高斯点阵中表面重建的准确性与鲁棒性，为高质量神经渲染与几何建模提供了新思路。 Abstract: 3D Gaussian Splatting enables efficient optimization and high-quality rendering, yet accurate surface reconstruction remains challenging. Prior methods improve surface reconstruction by refining Gaussian depth estimates, either via multi-view geometric consistency or through monocular depth priors. However, multi-view constraints become unreliable under large geometric discrepancies, while monocular priors suffer from scale ambiguity and local inconsistency, ultimately leading to inaccurate Gaussian depth supervision. To address these limitations, we introduce a Gaussian visibility-aware multi-view geometric consistency constraint that aggregates the visibility of shared Gaussian primitives across views, enabling more accurate and stable geometric supervision. In addition, we propose a progressive quadtree-calibrated Monocular depth constraint that performs block-wise affine calibration from coarse to fine spatial scales, mitigating the scale ambiguity of depth priors while preserving fine-grained surface details. Extensive experiments on DTU and TNT datasets demonstrate consistent improvements in geometric accuracy over prior Gaussian-based and implicit surface reconstruction methods. Codes are available at an anonymous repository: https://github.com/GVGScode/GVGS.

[100] Test-Time Adaptation for Anomaly Segmentation via Topology-Aware Optimal Transport Chaining

Ali Zia,Usman Ali,Umer Ramzan,Abdul Rehman,Abdelwahed Khamis,Wei Xiang

Main category: cs.CV

TL;DR: 本文提出了TopoOT，一种结合拓扑数据分析与最优传输的测试时自适应框架，用于异常分割，通过多滤波持久性图与最优传输链式对齐生成稳定性分数，并指导轻量级头部在线训练，在2D和3D异常检测基准上达到SOTA性能。

Details

Motivation: 传统基于阈值的二值化方法在分布偏移下鲁棒性差；而拓扑数据分析能从全局结构角度刻画异常，更具原理性和稳定性。 Method: 提出TopoOT框架，核心为最优传输链（Optimal Transport Chaining），用于跨阈值和滤波对齐多滤波持久性图（PDs），生成测地线稳定性分数以构建稳定性感知的伪标签；再通过OT一致性与对比学习目标在线训练轻量分割头。 Result: 在标准2D和3D异常分割基准上达到SOTA：2D数据集平均F1提升+24.1%，3D提升+10.2%。 Conclusion: 将拓扑结构稳定性建模与最优传输驱动的测试时自适应相结合，显著提升了异常分割在域偏移下的鲁棒性与精度。 Abstract: Deep topological data analysis (TDA) offers a principled framework for capturing structural invariants such as connectivity and cycles that persist across scales, making it a natural fit for anomaly segmentation (AS). Unlike thresholdbased binarisation, which produces brittle masks under distribution shift, TDA allows anomalies to be characterised as disruptions to global structure rather than local fluctuations. We introduce TopoOT, a topology-aware optimal transport (OT) framework that integrates multi-filtration persistence diagrams (PDs) with test-time adaptation (TTA). Our key innovation is Optimal Transport Chaining, which sequentially aligns PDs across thresholds and filtrations, yielding geodesic stability scores that identify features consistently preserved across scales. These stabilityaware pseudo-labels supervise a lightweight head trained online with OT-consistency and contrastive objectives, ensuring robust adaptation under domain shift. Across standard 2D and 3D anomaly detection benchmarks, TopoOT achieves state-of-the-art performance, outperforming the most competitive methods by up to +24.1% mean F1 on 2D datasets and +10.2% on 3D AS benchmarks.

[101] MMSF: Multitask and Multimodal Supervised Framework for WSI Classification and Survival Analysis

Chengying She,Chengwei Chen,Xinran Zhang,Ben Wang,Lizhuang Liu,Chengwei Shao,Yun Bian

Main category: cs.CV

TL;DR: 本文提出MMSF框架，通过多任务、多模态监督学习整合组织病理图像与临床数据，在多个癌症数据集上显著提升预测准确率、AUC和C-index。

Details

Motivation: 多模态证据（如全切片图像和临床描述）对计算病理学至关重要，但因特征空间统计特性和尺度差异大，异构信号融合仍具挑战性。 Method: 提出MMSF框架：基于线性复杂度的MIL主干网络，包含图特征提取模块（建模组织拓扑）、临床数据嵌入模块（标准化患者属性）、特征融合模块（对齐模态共享与特异性表征）以及基于Mamba的MIL编码器与多任务预测头。 Result: 在CAMELYON16和TCGA-NSCLC上，准确率提升2.1–6.6%，AUC提升2.2–6.9%；在五个TCGA生存队列中，C-index相比单模态方法提升7.1–9.8%，相比其他多模态方法提升5.6–7.1%。 Conclusion: MMSF有效实现了病理图像与临床数据的深度融合，显著提升了癌症预后预测性能，验证了显式分解与融合跨模态信息的有效性。 Abstract: Multimodal evidence is critical in computational pathology: gigapixel whole slide images capture tumor morphology, while patient-level clinical descriptors preserve complementary context for prognosis. Integrating such heterogeneous signals remains challenging because feature spaces exhibit distinct statistics and scales. We introduce MMSF, a multitask and multimodal supervised framework built on a linear-complexity MIL backbone that explicitly decomposes and fuses cross-modal information. MMSF comprises a graph feature extraction module embedding tissue topology at the patch level, a clinical data embedding module standardizing patient attributes, a feature fusion module aligning modality-shared and modality-specific representations, and a Mamba-based MIL encoder with multitask prediction heads. Experiments on CAMELYON16 and TCGA-NSCLC demonstrate 2.1--6.6\% accuracy and 2.2--6.9\% AUC improvements over competitive baselines, while evaluations on five TCGA survival cohorts yield 7.1--9.8\% C-index improvements compared with unimodal methods and 5.6--7.1\% over multimodal alternatives.

[102] PalmBridge: A Plug-and-Play Feature Alignment Framework for Open-Set Palmprint Verification

Chenke Zhang,Ziyuan Yang,Licheng Yan,Shuyi Li,Andrew Beng Jin Teoh,Bob Zhang,Yi Zhang

Main category: cs.CV

TL;DR: 本文提出PalmBridge，一种基于向量量化的即插即用特征空间对齐框架，用于开放集掌纹验证，通过学习代表性向量并融合原始特征，在抑制域偏移干扰的同时保留身份判别性。

Details

Motivation: 现有深度掌纹模型假设封闭且静态的数据分布，易过拟合数据集特有纹理，难以应对真实场景中因部署环境异构导致的特征分布偏移；传统数据增强在显著域差异下难以逼近目标分布。 Method: 提出PalmBridge框架：在特征空间中学习一组紧凑的代表性向量；在注册与验证阶段，将每个特征向量映射至最近代表向量，并按最小距离准则加权融合；联合优化代表性向量与骨干网络，引入任务监督、特征一致性损失和正交正则化；通过分配一致性与碰撞率分析映射稳定性。 Result: 在多个掌纹数据集和骨干网络上实验表明，PalmBridge一致降低了同数据集开放集评估的等错误率（EER），并提升了跨数据集泛化能力，仅带来可忽略至适中的运行开销。 Conclusion: PalmBridge通过特征空间而非数据空间的对齐机制，有效缓解域偏移问题，为开放集掌纹识别提供了一种鲁棒、轻量且即插即用的解决方案。 Abstract: Palmprint recognition is widely used in biometric systems, yet real-world performance often degrades due to feature distribution shifts caused by heterogeneous deployment conditions. Most deep palmprint models assume a closed and stationary distribution, leading to overfitting to dataset-specific textures rather than learning domain-invariant representations. Although data augmentation is commonly used to mitigate this issue, it assumes augmented samples can approximate the target deployment distribution, an assumption that often fails under significant domain mismatch. To address this limitation, we propose PalmBridge, a plug-and-play feature-space alignment framework for open-set palmprint verification based on vector quantization. Rather than relying solely on data-level augmentation, PalmBridge learns a compact set of representative vectors directly from training features. During enrollment and verification, each feature vector is mapped to its nearest representative vector under a minimum-distance criterion, and the mapped vector is then blended with the original vector. This design suppresses nuisance variation induced by domain shifts while retaining discriminative identity cues. The representative vectors are jointly optimized with the backbone network using task supervision, a feature-consistency objective, and an orthogonality regularization term to form a stable and well-structured shared embedding space. Furthermore, we analyze feature-to-representative mappings via assignment consistency and collision rate to assess model's sensitivity to blending weights. Experiments on multiple palmprint datasets and backbone architectures show that PalmBridge consistently reduces EER in intra-dataset open-set evaluation and improves cross-dataset generalization with negligible to modest runtime overhead.

[103] Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Zengbin Wang,Xuecai Hu,Yong Wang,Feng Xiong,Man Zhang,Xiangxiang Chu

Main category: cs.CV

TL;DR: 本文提出了SpatialGenEval基准，用于系统评估文本到图像（T2I）模型的空间智能，涵盖1230条信息密集型长提示及多选问答；同时构建SpatialT2I数据集并验证其在提升空间关系建模上的有效性。

Details

Motivation: 现有T2I模型在复杂空间关系（如位置、遮挡、因果等）建模上表现不足，而主流基准因提示简短、信息稀疏难以有效评测该能力。 Method: 构建包含1230条长提示、覆盖25种现实场景和10个空间子领域的SpatialGenEval基准，并配套多选问答；进一步构建15400对信息密集且图像一致的SpatialT2I数据集，用于微调主流T2I模型。 Result: 对21个SOTA模型的评测表明高阶空间推理仍是瓶颈；在Stable Diffusion-XL等模型上微调SpatialT2I带来4.2%–5.7%性能提升，并显著改善空间关系真实性。 Conclusion: 信息密集型提示设计不仅可用于更严格的评估，更能通过数据驱动方式有效提升T2I模型的空间智能，确立了以数据为中心提升空间推理能力的新范式。 Abstract: Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while preserving information density. Fine-tuned results on current foundation models (i.e., Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.

[104] CURVE: Learning Causality-Inspired Invariant Representations for Robust Scene Understanding via Uncertainty-Guided Regularization

Yue Liang,Jiatong Du,Ziyi Yang,Yanjun Huang,Hong Chen

Main category: cs.CV

TL;DR: 本文提出CURVE框架，通过变分不确定性建模与不确定性引导的结构正则化，抑制环境特异性关系，提升场景图在分布外场景下的泛化能力。

Details

Motivation: 现有场景图方法易过拟合于虚假相关性，导致分布外泛化性能差。 Method: 提出CURVE框架，结合原型条件去偏、变分不确定性建模与不确定性引导的结构正则化，以解耦不变交互动力学与环境依赖变化，并促进稀疏、域稳定的拓扑结构学习。 Result: 在零样本迁移和低数据sim-to-real适配任务中验证了CURVE能学习域稳定的稀疏拓扑，并提供可靠的不确定性估计以支持分布偏移下的风险预测。 Conclusion: CURVE有效缓解了场景图对虚假相关的过拟合问题，提升了其在分布外场景中的鲁棒性与可解释性。 Abstract: Scene graphs provide structured abstractions for scene understanding, yet they often overfit to spurious correlations, severely hindering out-of-distribution generalization. To address this limitation, we propose CURVE, a causality-inspired framework that integrates variational uncertainty modeling with uncertainty-guided structural regularization to suppress high-variance, environment-specific relations. Specifically, we apply prototype-conditioned debiasing to disentangle invariant interaction dynamics from environment-dependent variations, promoting a sparse and domain-stable topology. Empirically, we evaluate CURVE in zero-shot transfer and low-data sim-to-real adaptation, verifying its ability to learn domain-stable sparse topologies and provide reliable uncertainty estimates to support risk prediction under distribution shifts.

[105] RAW-Flow: Advancing RGB-to-RAW Image Reconstruction with Deterministic Latent Flow Matching

Zhen Liu,Diedong Feng,Hai Jiang,Liaoyuan Zeng,Hao Wang,Chaoyu Feng,Lei Lei,Bing Zeng,Shuaicheng Liu

Main category: cs.CV

TL;DR: 本文提出RAW-Flow框架，将RGB-to-RAW重建建模为潜在空间中的确定性流匹配问题，结合跨尺度上下文引导与双域潜在自编码器，显著提升细节和色彩重建质量。

Details

Motivation: 现有基于学习的RGB-to-RAW方法多采用直接回归，受限于逆ISP问题的病态性及RGB量化导致的信息损失，易出现细节不一致和颜色偏差。 Method: 提出RAW-Flow：1）将任务建模为潜在空间的确定性流匹配；2）引入跨尺度上下文引导模块融合多级RGB特征；3）设计带特征对齐约束的双域潜在自编码器。 Result: 在多个指标和视觉效果上均超越当前最优方法。 Conclusion: 生成式视角下的确定性潜在传输框架（RAW-Flow）能更准确地恢复RAW图像的结构细节与色彩信息，为逆ISP任务提供了新范式。 Abstract: RGB-to-RAW reconstruction, or the reverse modeling of a camera Image Signal Processing (ISP) pipeline, aims to recover high-fidelity RAW data from RGB images. Despite notable progress, existing learning-based methods typically treat this task as a direct regression objective and struggle with detail inconsistency and color deviation, due to the ill-posed nature of inverse ISP and the inherent information loss in quantized RGB images. To address these limitations, we pioneer a generative perspective by reformulating RGB-to-RAW reconstruction as a deterministic latent transport problem and introduce a novel framework named RAW-Flow, which leverages flow matching to learn a deterministic vector field in latent space, to effectively bridge the gap between RGB and RAW representations and enable accurate reconstruction of structural details and color information. To further enhance latent transport, we introduce a cross-scale context guidance module that injects hierarchical RGB features into the flow estimation process. Moreover, we design a dual-domain latent autoencoder with a feature alignment constraint to support the proposed latent transport framework, which jointly encodes RGB and RAW inputs while promoting stable training and high-fidelity reconstruction. Extensive experiments demonstrate that RAW-Flow outperforms state-of-the-art approaches both quantitatively and visually.

[106] Dual-Modality IoT Framework for Integrated Access Control and Environmental Safety Monitoring with Real-Time Cloud Analytics

Abdul Hasib,A. S. M. Ahsanul Sarkar Akib,Nihal Das Ankur,Anish Giri

Main category: cs.CV

TL;DR: 本文提出了一种基于双模态IoT的物理安全与环境安全协同监控框架，通过ESP32和云架构集成RFID门禁与多传感器环境监测，实现高精度、低延迟、高鲁棒性及低成本（仅48美元）的智能基础设施管理方案。

Details

Motivation: 传统物理安全系统与环境安全监测系统相互独立，导致运维效率低、应急响应延迟、管理复杂度高，亟需一体化协同解决方案。 Method: 构建双子系统IoT框架：子系统1采用RFID认证+舵机控门+Google Sheets实时日志；子系统2集成火焰检测、水流测量、LCD状态显示与人员识别；均基于ESP32边缘处理与无线通信，并通过统一云架构融合；引入本地智能缓存提升断网鲁棒性。 Result: 实验45天表明：RFID认证准确率99.2%（平均响应0.82秒）、火焰检测可靠率98.5%（5米内）、云端日志成功率99.8%；断网下仍保持运行；总成本5400 BDT（约48美元），较商用集成方案降低82%。 Conclusion: 该框架验证了通过精巧架构设计与组件优化，可在保障专业级性能的同时，显著提升成本效益与部署可及性，为智慧基建中安全与安全的深度融合提供了实用可行的技术路径。 Abstract: The integration of physical security systems with environmental safety monitoring represents a critical advancement in smart infrastructure management. Traditional approaches maintain these systems as independent silos, creating operational inefficiencies, delayed emergency responses, and increased management complexity. This paper presents a comprehensive dual-modality Internet of Things framework that seamlessly integrates RFID-based access control with multi-sensor environmental safety monitoring through a unified cloud architecture. The system comprises two coordinated subsystems: Subsystem 1 implements RFID authentication with servo-actuated gate control and real-time Google Sheets logging, while Subsystem 2 provides comprehensive safety monitoring incorporating flame detection, water flow measurement, LCD status display, and personnel identification. Both subsystems utilize ESP32 microcontrollers for edge processing and wireless connectivity. Experimental evaluation over 45 days demonstrates exceptional performance metrics: 99.2\% RFID authentication accuracy with 0.82-second average response time, 98.5\% flame detection reliability within 5-meter range, and 99.8\% cloud data logging success rate. The system maintains operational integrity during network disruptions through intelligent local caching mechanisms and achieves total implementation cost of 5,400 BDT (approximately \$48), representing an 82\% reduction compared to commercial integrated solutions. This research establishes a practical framework for synergistic security-safety integration, demonstrating that professional-grade performance can be achieved through careful architectural design and component optimization while maintaining exceptional cost-effectiveness and accessibility for diverse application scenarios.

[107] RepSFNet : A Single Fusion Network with Structural Reparameterization for Crowd Counting

Mas Nurul Achmadiah,Chi-Chia Sun,Wen-Kai Kuo,Jun-Wei Hsieh

Main category: cs.CV

TL;DR: RepSFNet是一种轻量级单融合网络，通过重参数化大核与多尺度特征融合，在保证精度的同时显著降低计算开销，适用于实时边缘场景的 crowd counting。

Details

Motivation: 解决可变密度场景下因尺度变化、遮挡及现有模型计算成本高导致的 crowd counting 难题。 Method: 提出 RepSFNet：采用 RepLK-ViT 主干网络提取多尺度特征；融合 ASPP 与 CAN 构建密度自适应上下文建模模块；引入 Concatenate Fusion 保留空间分辨率；摒弃注意力机制与多分支结构以减参降耗；联合 MSE 与最优传输损失进行训练。 Result: 在 ShanghaiTech、NWPU 和 UCF-QNRF 数据集上达到有竞争力的计数精度，推理延迟较最新方法最高降低 34%，适合实时低功耗边缘部署。 Conclusion: RepSFNet 在精度与效率间取得良好平衡，验证了轻量化单路径设计在 crowd counting 中的有效性与实用性。 Abstract: Crowd counting remains challenging in variable-density scenes due to scale variations, occlusions, and the high computational cost of existing models. To address these issues, we propose RepSFNet (Reparameterized Single Fusion Network), a lightweight architecture designed for accurate and real-time crowd estimation. RepSFNet leverages a RepLK-ViT backbone with large reparameterized kernels for efficient multi-scale feature extraction. It further integrates a Feature Fusion module combining Atrous Spatial Pyramid Pooling (ASPP) and Context-Aware Network (CAN) to achieve robust, density-adaptive context modeling. A Concatenate Fusion module is employed to preserve spatial resolution and generate high-quality density maps. By avoiding attention mechanisms and multi-branch designs, RepSFNet significantly reduces parameters and computational complexity. The training objective combines Mean Squared Error and Optimal Transport loss to improve both count accuracy and spatial distribution alignment. Experiments conducted on ShanghaiTech, NWPU, and UCF-QNRF datasets demonstrate that RepSFNet achieves competitive accuracy while reducing inference latency by up to 34 percent compared to recent state-of-the-art methods, making it suitable for real-time and low-power edge computing applications.

[108] HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion Generation

Mengge Liu,Yan Di,Gu Wang,Yun Qu,Dekai Zhu,Yanyan Li,Xiangyang Ji

Main category: cs.CV

TL;DR: 本文提出HINT，首个用于多人体运动生成的自回归框架，通过分层交互建模与扩散模型结合，支持可变人数和长文本输入，并在InterHuman数据集上FID达3.100，显著优于先前SOTA。

Details

Motivation: 现有离线方法难以处理长/变长文本输入及可变人数的多人体运动生成，需转向自回归建模以实现灵活、连贯的时序生成。 Method: HINT采用解耦的规范隐空间运动表征，分离个体运动语义与人际交互；结合滑动窗口策略，融合窗内局部条件与跨窗全局条件，实现高效在线生成与长程一致性建模。 Result: 在公开基准（如InterHuman）上，HINT性能媲美强离线模型，且超越自回归基线；InterHuman上FID为3.100，较此前SOTA（5.154）大幅提升。 Conclusion: HINT验证了自回归+分层交互建模在多人体运动生成中的有效性，解决了人数可变、文本可变与长序列连贯性三大挑战，为该任务提供了新范式。 Abstract: Text-driven multi-human motion generation with complex interactions remains a challenging problem. Despite progress in performance, existing offline methods that generate fixed-length motions with a fixed number of agents, are inherently limited in handling long or variable text, and varying agent counts. These limitations naturally encourage autoregressive formulations, which predict future motions step by step conditioned on all past trajectories and current text guidance. In this work, we introduce HINT, the first autoregressive framework for multi-human motion generation with Hierarchical INTeraction modeling in diffusion. First, HINT leverages a disentangled motion representation within a canonicalized latent space, decoupling local motion semantics from inter-person interactions. This design facilitates direct adaptation to varying numbers of human participants without requiring additional refinement. Second, HINT adopts a sliding-window strategy for efficient online generation, and aggregates local within-window and global cross-window conditions to capture past human history, inter-person dependencies, and align with text guidance. This strategy not only enables fine-grained interaction modeling within each window but also preserves long-horizon coherence across all the long sequence. Extensive experiments on public benchmarks demonstrate that HINT matches the performance of strong offline models and surpasses autoregressive baselines. Notably, on InterHuman, HINT achieves an FID of 3.100, significantly improving over the previous state-of-the-art score of 5.154.

[109] Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

Yuhao Sun,Chengyi Cai,Jiacheng Zhang,Zesheng Ye,Xingliang Yuan,Feng Liu

Main category: cs.CV

TL;DR: 本文提出BiFTA方法，通过视图精炼（去除高IoU的冗余图像块）和描述精炼（去除高余弦相似度的冗余文本描述），提升细粒度图文对齐效果，从而增强CLIP等模型的零样本性能。

Details

Motivation: 现有细粒度图文对齐方法中，图像局部块和文本描述常含冗余信息，削弱对齐效果。 Method: 提出双精炼（BiFTA）框架：1）视图精炼——基于IoU剔除冗余图像块；2）描述精炼——基于余弦相似度剔除冗余文本描述。 Result: 在6个基准数据集上，BiFTA显著提升ViT与ResNet两种结构CLIP的零样本分类性能。 Conclusion: 去除图文模态中的冗余信息对细粒度对齐至关重要，BiFTA验证了该策略的有效性与通用性。 Abstract: Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{View Refinement} and \emph{Description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.

[110] Quartet of Diffusions: Structure-Aware Point Cloud Generation through Part and Symmetry Guidance

Chenliang Zhou,Fangcheng Zhong,Weihao Xia,Albert Miao,Canberk Baykal,Cengiz Oztireli

Main category: cs.CV

TL;DR: 本文提出Quartet of Diffusions，一种结构感知的点云生成框架，通过四个协同扩散模型分别建模全局形状潜在表示、对称性、语义部件及其空间组装，从而显式支持部件组合与对称性约束，实现高质量、多样且结构一致的3D点云生成。

Details

Motivation: 现有方法要么将形状生成视为整体过程，要么仅支持部件组合，缺乏对对称性和部件结构先验的联合建模与显式约束。 Method: 设计四个协调的扩散模型，分别学习全局形状潜在表示、对称性、语义部件及部件空间组装的分布；引入中心全局潜在变量以增强部件间结构一致性；整个流程解耦且可解释，支持细粒度属性控制。 Result: 在多个指标上达到SOTA性能；首次在3D点云生成中全程集成并强制执行对称性与部件先验。 Conclusion: Quartet of Diffusions通过结构化解耦的扩散建模范式，有效提升了生成点云的结构性、可控性与质量，为几何生成提供了新范式。 Abstract: We introduce the Quartet of Diffusions, a structure-aware point cloud generation framework that explicitly models part composition and symmetry. Unlike prior methods that treat shape generation as a holistic process or only support part composition, our approach leverages four coordinated diffusion models to learn distributions of global shape latents, symmetries, semantic parts, and their spatial assembly. This structured pipeline ensures guaranteed symmetry, coherent part placement, and diverse, high-quality outputs. By disentangling the generative process into interpretable components, our method supports fine-grained control over shape attributes, enabling targeted manipulation of individual parts while preserving global consistency. A central global latent further reinforces structural coherence across assembled parts. Our experiments show that the Quartet achieves state-of-the-art performance. To our best knowledge, this is the first 3D point cloud generation framework that fully integrates and enforces both symmetry and part priors throughout the generative process.

[111] Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding

Kun Yin,Yunfei Wu,Bing Liu,Zhongpeng Cai,Xiaotian Li,Huang Chen,Xin Li,Haoyu Cao,Yinsong Liu,Deqiang Jiang,Xing Sun,Yunsheng Wu,Qianyu Li,Antai Guo,Yanzhen Liao,Yanqiu Qu,Haodong Lin,Chengxu He,Shuangyin Liu

Main category: cs.CV

TL;DR: Youtu-Parsing 是一种高效、通用的文档解析模型，结合动态分辨率 ViT 与提示引导的轻量语言模型，通过 token 并行与 query 并行解码策略大幅提升速度（最高达 11×），同时在 OmniDocBench 和 olmOCR-bench 上达到 SOTA 性能。

Details

Motivation: 提升文档内容提取的效率与泛化能力，尤其应对高结构化文档（如表格）、多语言、手写及罕见字符等复杂场景。 Method: 采用原生 ViT 动态分辨率视觉编码器提取共享文档特征，结合 prompt-guided Youtu-LLM-2B 进行布局分析与区域提示解码；提出 token 并行（单步生成最多 64 候选 token）和 query 并行（并发预测最多 5 个区域内容）双并行解码策略。 Result: 在 OmniDocBench 和 olmOCR-bench 上达到 SOTA；相比传统自回归解码提速 5–11×（token 并行）+ 2×（query 并行）；支持文本、公式、表格、图表、印章、层级结构等多元元素；对罕见字符、多语言、手写内容鲁棒性强。 Conclusion: Youtu-Parsing 兼具高性能、高鲁棒性与高实用性，为大规模文档智能应用提供了可扩展、工业级的解析解决方案。 Abstract: This paper presents Youtu-Parsing, an efficient and versatile document parsing model designed for high-performance content extraction. The architecture employs a native Vision Transformer (ViT) featuring a dynamic-resolution visual encoder to extract shared document features, coupled with a prompt-guided Youtu-LLM-2B language model for layout analysis and region-prompted decoding. Leveraging this decoupled and feature-reusable framework, we introduce a high-parallelism decoding strategy comprising two core components: token parallelism and query parallelism. The token parallelism strategy concurrently generates up to 64 candidate tokens per inference step, which are subsequently validated through a verification mechanism. This approach yields a 5--11x speedup over traditional autoregressive decoding and is particularly well-suited for highly structured scenarios, such as table recognition. To further exploit the advantages of region-prompted decoding, the query parallelism strategy enables simultaneous content prediction for multiple bounding boxes (up to five), providing an additional 2x acceleration while maintaining output quality equivalent to standard decoding. Youtu-Parsing encompasses a diverse range of document elements, including text, formulas, tables, charts, seals, and hierarchical structures. Furthermore, the model exhibits strong robustness when handling rare characters, multilingual text, and handwritten content. Extensive evaluations demonstrate that Youtu-Parsing achieves state-of-the-art (SOTA) performance on both the OmniDocBench and olmOCR-bench benchmarks. Overall, Youtu-Parsing demonstrates significant experimental value and practical utility for large-scale document intelligence applications.

[112] MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models

Wenbo Xu,Wei Lu,Xiangyang Luo,Jiantao Zhou

Main category: cs.CV

TL;DR: 本文提出MARE方法，利用多模态对齐与强化学习提升视觉-语言模型在深度伪造检测中的准确性与可解释性。

Details

Motivation: 生成模型的快速发展对深度伪造检测提出了新挑战，现有方法主要聚焦于分类或空间定位，缺乏可解释性与可靠性。 Method: 提出MARE框架，结合多模态对齐、基于人类反馈的强化学习（RLHF）设计综合奖励函数，并引入伪造解耦模块以提取面部高层语义中的伪造痕迹。 Result: 在推理内容生成的定量与定性评估中，MARE在准确率与可靠性方面均达到当前最优性能。 Conclusion: MARE有效提升了视觉-语言模型在深度伪造检测任务中的准确性、可解释性与可信度，为应对新型生成式伪造提供了新思路。 Abstract: Deepfake detection is a widely researched topic that is crucial for combating the spread of malicious content, with existing methods mainly modeling the problem as classification or spatial localization. The rapid advancements in generative models impose new demands on Deepfake detection. In this paper, we propose multimodal alignment and reinforcement for explainable Deepfake detection via vision-language models, termed MARE, which aims to enhance the accuracy and reliability of Vision-Language Models (VLMs) in Deepfake detection and reasoning. Specifically, MARE designs comprehensive reward functions, incorporating reinforcement learning from human feedback (RLHF), to incentivize the generation of text-spatially aligned reasoning content that adheres to human preferences. Besides, MARE introduces a forgery disentanglement module to capture intrinsic forgery traces from high-level facial semantics, thereby improving its authenticity detection capability. We conduct thorough evaluations on the reasoning content generated by MARE. Both quantitative and qualitative experimental results demonstrate that MARE achieves state-of-the-art performance in terms of accuracy and reliability.

[113] Exploiting the Final Component of Generator Architectures for AI-Generated Image Detection

Yanzhu Liu,Xiao Liu,Yuexuan Wang,Mondal Soumik

Main category: cs.CV

TL;DR: 本文提出了一种通过用生成器的最终组件“污染”真实图像来训练检测器的新方法，显著提升了对未见过生成器所产图像的泛化检测能力。

Details

Motivation: 现有深度伪造检测器在面对未知图像生成器时泛化能力差；而许多不同范式的生成器共享相似的最终架构组件，这为设计通用检测方法提供了契机。 Method: 利用生成器的最终组件对真实图像进行‘污染’，构建正负样本对，并基于DINOv3骨干网络微调检测器；同时构建基于最终组件的生成器分类体系，涵盖21种主流生成器。 Result: 仅用每类100张样本（共三类），该方法在22个来自未知生成器的测试集上平均准确率达98.83%。 Conclusion: 基于生成器最终组件的污染策略可有效提升检测器对未知生成器的泛化能力，为AI生成图像检测提供了新思路和实用框架。 Abstract: With the rapid proliferation of powerful image generators, accurate detection of AI-generated images has become essential for maintaining a trustworthy online environment. However, existing deepfake detectors often generalize poorly to images produced by unseen generators. Notably, despite being trained under vastly different paradigms, such as diffusion or autoregressive modeling, many modern image generators share common final architectural components that serve as the last stage for converting intermediate representations into images. Motivated by this insight, we propose to "contaminate" real images using the generator's final component and train a detector to distinguish them from the original real images. We further introduce a taxonomy based on generators' final components and categorize 21 widely used generators accordingly, enabling a comprehensive investigation of our method's generalization capability. Using only 100 samples from each of three representative categories, our detector-fine-tuned on the DINOv3 backbone-achieves an average accuracy of 98.83% across 22 testing sets from unseen generators.

[114] Efficient Autoregressive Video Diffusion with Dummy Head

Hang Guo,Zhaoyang Jia,Jiahao Li,Bin Li,Yuanhao Cai,Jiangshan Wang,Yawei Li,Yan Lu

Main category: cs.CV

TL;DR: 本文提出Dummy Forcing方法，通过控制不同注意力头对历史帧的访问权限，减少冗余计算，在不额外训练的前提下实现最高2.0倍加速，视频生成达24.3 FPS且质量损失小于0.5%。

Details

Motivation: 发现自回归视频扩散模型中多头自注意力机制对历史帧利用不足，约25%的注意力头几乎只关注当前帧，存在计算冗余。 Method: 提出Dummy Forcing方法，包括异构内存分配（减少头间上下文冗余）、动态头编程（自适应分类注意力头类型）和上下文打包技术（更激进的KV缓存压缩）。 Result: 无需额外训练，相比基线最高提速2.0倍，支持24.3 FPS视频生成，质量下降小于0.5%。 Conclusion: Dummy Forcing是一种简单而有效的方法，显著提升自回归视频扩散模型推理效率，同时保持生成质量。 Abstract: The autoregressive video diffusion model has recently gained considerable research interest due to its causal modeling and iterative denoising. In this work, we identify that the multi-head self-attention in these models under-utilizes historical frames: approximately 25% heads attend almost exclusively to the current frame, and discarding their KV caches incurs only minor performance degradation. Building upon this, we propose Dummy Forcing, a simple yet effective method to control context accessibility across different heads. Specifically, the proposed heterogeneous memory allocation reduces head-wise context redundancy, accompanied by dynamic head programming to adaptively classify head types. Moreover, we develop a context packing technique to achieve more aggressive cache compression. Without additional training, our Dummy Forcing delivers up to 2.0x speedup over the baseline, supporting video generation at 24.3 FPS with less than 0.5% quality drop. Project page is available at https://csguoh.github.io/project/DummyForcing/.

[115] Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI

Jesse Phitidis,Alison Q. Smithard,William N. Whiteley,Joanna M. Wardlaw,Miguel O. Bernabeu,Maria Valdés Hernández

Main category: cs.CV

TL;DR: 本文研究了利用部分标注数据训练深度学习模型以同时分割白质高信号（WMH）和缺血性卒中病变（ISL）的方法，发现伪标签策略效果最佳。

Details

Motivation: WMH和ISL在FLAIR序列MRI上视觉上难以区分且常共存，导致全监督分割模型开发与验证困难，亟需有效利用部分标注数据的策略。 Method: 探索六种基于部分标注数据训练联合WMH与ISL分割模型的策略，整合私有全标注/部分标注数据与公开部分标注数据（共2052例MRI），其中1341例含WMH真值、1152例含ISL真值。 Result: 多种方法能有效利用部分标注数据提升性能，其中伪标签法效果最优。 Conclusion: 伪标签等半监督策略可显著提升WMH与ISL联合分割模型性能，为小血管病影像分析提供了实用可行的建模路径。 Abstract: White matter hyperintensities (WMH) and ischaemic stroke lesions (ISL) are imaging features associated with cerebral small vessel disease (SVD) that are visible on brain magnetic resonance imaging (MRI) scans. The development and validation of deep learning models to segment and differentiate these features is difficult because they visually confound each other in the fluid-attenuated inversion recovery (FLAIR) sequence and often appear in the same subject. We investigated six strategies for training a combined WMH and ISL segmentation model using partially labelled data. We combined privately held fully and partially labelled datasets with publicly available partially labelled datasets to yield a total of 2052 MRI volumes, with 1341 and 1152 containing ground truth annotations for WMH and ISL respectively. We found that several methods were able to effectively leverage the partially labelled data to improve model performance, with the use of pseudolabels yielding the best result.

[116] Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V

Meiqi Wu,Bingze Song,Ruimin Lin,Chen Zhu,Xiaokun Feng,Jiahong Wu,Xiangxiang Chu,Kaiqi Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于潜在时空差异（LTD）的运动先验方法，用于改进视频生成模型在动态场景下的表现，通过在扩散模型中引入运动感知的损失加权机制，提升了运动区域的重建质量。

Details

Motivation: 现有视频生成模型在剧烈动态变化下性能下降，主要因为噪声破坏时间一致性且静态损失函数难以捕捉复杂动态。 Method: 引入潜在时空差异（LTD）作为运动先验，衡量潜在空间中帧间变化，并据此对损失进行加权：对高差异区域施加更大惩罚，对稳定区域保持常规优化。 Result: 在VBench和VMBench基准上分别提升3.31%和3.58%，显著改善运动质量。 Conclusion: LTD是一种有效的运动感知训练策略，能稳定训练过程并增强高频动态重建能力。 Abstract: Video generation models have achieved notable progress in static scenarios, yet their performance in motion video generation remains limited, with quality degrading under drastic dynamic changes. This is due to noise disrupting temporal coherence and increasing the difficulty of learning dynamic regions. {Unfortunately, existing diffusion models rely on static loss for all scenarios, constraining their ability to capture complex dynamics.} To address this issue, we introduce Latent Temporal Discrepancy (LTD) as a motion prior to guide loss weighting. LTD measures frame-to-frame variation in the latent space, assigning larger penalties to regions with higher discrepancy while maintaining regular optimization for stable regions. This motion-aware strategy stabilizes training and enables the model to better reconstruct high-frequency dynamics. Extensive experiments on the general benchmark VBench and the motion-focused VMBench show consistent gains, with our method outperforming strong baselines by 3.31% on VBench and 3.58% on VMBench, achieving significant improvements in motion quality.

[117] Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits

Zelong Sun,Jiahui Wu,Ying Ba,Dong Jing,Zhiwu Lu

Main category: cs.CV

TL;DR: 本文提出了Portrait Collection Generation (PCG)这一新任务，旨在通过自然语言指令编辑参考人像生成连贯的人像集合，并构建了首个大规模PCG数据集CHEESE及对应生成框架SCheese。

Details

Motivation: 社交平台兴起使用户对高质量、多样化人像集合的生成需求增长，但现有方法难以同时处理多属性复杂编辑与高保真细节保持。 Method: 构建了基于大视觉语言模型与反转验证的数据集CHEESE；提出SCheese框架，融合文本引导生成、自适应特征融合机制和ConsistencyNet以兼顾身份一致性与细节保真度。 Result: CHEESE是首个大规模PCG数据集（24K人像集合，573K样本）；SCheese在PCG任务上达到SOTA性能。 Conclusion: PCG是一项具有实际意义的新任务，CHEESE数据集与SCheese框架为该方向提供了坚实基础与有效解决方案。 Abstract: As social media platforms proliferate, users increasingly demand intuitive ways to create diverse, high-quality portrait collections. In this work, we introduce Portrait Collection Generation (PCG), a novel task that generates coherent portrait collections by editing a reference portrait image through natural language instructions. This task poses two unique challenges to existing methods: (1) complex multi-attribute modifications such as pose, spatial layout, and camera viewpoint; and (2) high-fidelity detail preservation including identity, clothing, and accessories. To address these challenges, we propose CHEESE, the first large-scale PCG dataset containing 24K portrait collections and 573K samples with high-quality modification text annotations, constructed through an Large Vison-Language Model-based pipeline with inversion-based verification. We further propose SCheese, a framework that combines text-guided generation with hierarchical identity and detail preservation. SCheese employs adaptive feature fusion mechanism to maintain identity consistency, and ConsistencyNet to inject fine-grained features for detail consistency. Comprehensive experiments validate the effectiveness of CHEESE in advancing PCG, with SCheese achieving state-of-the-art performance.

[118] Context Tokens are Anchors: Understanding the Repetition Curse in dMLLMs from an Information Flow Perspective

Qiyan Zhao,Xiaofeng Zhang,Shuochen Chang,Qianyu Chen,Xiaosong Yuan,Xuhang Chen,Luoqi Liu,Jiajun Zhang,Xu-Yao Zhang,Da-Han Wang

Main category: cs.CV

TL;DR: 本文提出CoTA方法，通过增强上下文标记的注意力并引入解码置信度惩罚项，缓解扩散型多模态大语言模型（dMLLMs）中因缓存机制导致的重复生成问题（Repeat Curse）。

Details

Motivation: 现有扩散型多模态大语言模型（dMLLMs）依赖缓存加速推理，但易引发重复文本生成（Repeat Curse），需深入理解其信息流机制以解决。 Method: 基于信息流视角分析重复生成机制，发现上下文token语义锚定、熵收敛性与重复现象密切相关；据此提出CoTA：增强上下文token注意力 + 解码时对低熵（不确定）上下文施加置信度惩罚。 Result: CoTA在多个任务上显著缓解重复生成，提升通用性能，且为即插即用方案。 Conclusion: 重复生成源于上下文token信息流中断及深层熵不收敛；CoTA通过保持信息流与调控解码信心有效解决该问题。 Abstract: Recent diffusion-based Multimodal Large Language Models (dMLLMs) suffer from high inference latency and therefore rely on caching techniques to accelerate decoding. However, the application of cache mechanisms often introduces undesirable repetitive text generation, a phenomenon we term the \textbf{Repeat Curse}. To better investigate underlying mechanism behind this issue, we analyze repetition generation through the lens of information flow. Our work reveals three key findings: (1) context tokens aggregate semantic information as anchors and guide the final predictions; (2) as information propagates across layers, the entropy of context tokens converges in deeper layers, reflecting the model's growing prediction certainty; (3) Repetition is typically linked to disruptions in the information flow of context tokens and to the inability of their entropy to converge in deeper layers. Based on these insights, we present \textbf{CoTA}, a plug-and-play method for mitigating repetition. CoTA enhances the attention of context tokens to preserve intrinsic information flow patterns, while introducing a penalty term to the confidence score during decoding to avoid outputs driven by uncertain context tokens. With extensive experiments, CoTA demonstrates significant effectiveness in alleviating repetition and achieves consistent performance improvements on general tasks. Code is available at https://github.com/ErikZ719/CoTA

[119] AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

Matic Fučka,Vitjan Zavrtanik,Danijel Skočaj

Main category: cs.CV

TL;DR: 本文提出AnomalyVFM框架，通过三阶段合成数据生成与参数高效适配机制（低秩特征适配器+置信加权像素损失），显著提升纯视觉基础模型（如DINOv2、RADIO）在零样本异常检测任务上的性能，图像级AUROC达94.1%，超越SOTA 3.3个百分点。

Details

Motivation: 现有基于视觉基础模型（VFMs）的零样本异常检测方法性能落后于VLMs（如CLIP），主因是辅助异常检测数据集多样性不足及VFM适配策略过于简单。 Method: 提出AnomalyVFM框架：1）鲁棒的三阶段合成异常数据生成方案；2）参数高效适配机制，包括低秩特征适配器和置信加权像素损失。 Result: 以RADIO为骨干，AnomalyVFM在9个数据集上平均图像级AUROC达94.1%，较此前最优方法提升3.3个百分点。 Conclusion: AnomalyVFM有效弥合了VFM与VLM在零样本异常检测中的性能差距，证明了纯视觉基础模型经合理适配后具备强大零样本异常检测能力。 Abstract: Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision-language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page: https://maticfuc.github.io/anomaly_vfm/

[120] IOTA: Corrective Knowledge-Guided Prompt Learning via Black-White Box Framework

Shaokun Wang,Yifan Yu,Yuhang He,Weili Guan,Yihong Gong

Main category: cs.CV

TL;DR: 本文提出了一种名为IOTA的黑-白盒提示学习框架，结合数据驱动的黑盒模块与知识驱动的白盒模块，通过对比错误预测与正确认知生成可解释提示，指导模型在下游任务中更准确地适应。

Details

Motivation: 现有参数高效调优（PET）方法将预训练模型视为黑箱，忽视其内在先验知识，限制了下游任务适配效果。 Method: 提出IOTA框架，包含黑盒（数据驱动）与白盒（知识驱动）两个模块；白盒模块通过对比错误预测与正确认知生成可解释提示，并采用矫正知识引导的提示选择策略指导黑盒模块。 Result: 在12个图像分类基准上，IOTA在少样本和易到难适应设置下均优于当前最优方法。 Conclusion: 融合知识驱动与数据驱动信号能显著提升预训练模型在下游任务中的适应能力，IOTA验证了矫正知识的有效性与实用性。 Abstract: Recently, adapting pre-trained models to downstream tasks has attracted increasing interest. Previous Parameter-Efficient-Tuning (PET) methods regard the pre-trained model as an opaque Black Box model, relying purely on data-driven optimization and underutilizing their inherent prior knowledge. This oversight limits the models' potential for effective downstream task adaptation. To address these issues, we propose a novel black-whIte bOx prompT leArning framework (IOTA), which integrates a data-driven Black Box module with a knowledge-driven White Box module for downstream task adaptation. Specifically, the White Box module derives corrective knowledge by contrasting the wrong predictions with the right cognition. This knowledge is verbalized into interpretable human prompts and leveraged through a corrective knowledge-guided prompt selection strategy to guide the Black Box module toward more accurate predictions. By jointly leveraging knowledge- and data-driven learning signals, IOTA achieves effective downstream task adaptation. Experimental results on 12 image classification benchmarks under few-shot and easy-to-hard adaptation settings demonstrate the effectiveness of corrective knowledge and the superiority of our method over state-of-the-art methods.

[121] Advancing Open-source World Models

Robbyant Team,Zelin Gao,Qiuyu Wang,Yanhong Zeng,Jiapeng Zhu,Ka Leong Cheng,Yixuan Li,Hanlin Wang,Yinghao Xu,Shuailei Ma,Yihang Chen,Jie Liu,Yansong Cheng,Yao Yao,Jiayi Zhu,Yihao Meng,Kecheng Zheng,Qingyan Bai,Jingye Chen,Zehong Shen,Yue Yu,Xing Zhu,Yujun Shen,Hao Ouyang

Main category: cs.CV

TL;DR: LingBot-World is an open-source, high-fidelity world simulator derived from video generation, supporting long-horizon consistency, real-time interactivity, and diverse environments.

Details

Motivation: To bridge the gap between open-source and closed-source world models and enable practical applications in content creation, gaming, and robot learning. Method: Building a world simulator based on video generation technology, with optimizations for fidelity, dynamic robustness, long-term contextual consistency, and low-latency real-time interaction. Result: A top-tier open-sourced world model supporting minute-level horizons, sub-1-second latency at 16fps, and broad environment coverage (realism, scientific, cartoon, etc.). Conclusion: LingBot-World advances open-world modeling by delivering high performance, versatility, and accessibility, fostering community-driven innovation. Abstract: We present LingBot-World, an open-sourced world simulator stemming from video generation. Positioned as a top-tier world model, LingBot-World offers the following features. (1) It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond. (2) It enables a minute-level horizon while preserving contextual consistency over time, which is also known as "long-term memory". (3) It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.

[122] DeepSeek-OCR 2: Visual Causal Flow

Haoran Wei,Yaofeng Sun,Yukun Li

Main category: cs.CV

TL;DR: 本文提出了DeepSeek-OCR 2，其核心是新型编码器DeepEncoder V2，能根据图像语义动态重排视觉token，以模拟人类灵活、语义连贯的视觉扫描机制，探索用两级一维因果推理结构实现二维图像理解的新范式。

Details

Motivation: 传统VLM按固定光栅扫描顺序处理视觉token，与人类基于语义和逻辑结构的灵活视觉感知不符，尤其在复杂版式图像中，人类视觉具有因果驱动的序列处理特性。 Method: 设计具备因果推理能力的DeepEncoder V2编码器，在输入大语言模型前智能重排视觉token；采用两级一维因果推理结构建模二维图像理解过程。 Result: 验证了动态视觉token重排的可行性，为实现真正二维推理提供了新架构思路；代码与模型权重已开源。 Conclusion: 通过模仿人类视觉认知机制，DeepEncoder V2展示了利用因果推理驱动的token重排提升图像理解潜力，开辟了VLM架构设计的新方向。 Abstract: We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder-DeepEncoder V2-capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, human vision exhibits causally-informed sequential processing. Inspired by this cognitive mechanism, DeepEncoder V2 is designed to endow the encoder with causal reasoning capabilities, enabling it to intelligently reorder visual tokens prior to LLM-based content interpretation. This work explores a novel paradigm: whether 2D image understanding can be effectively achieved through two-cascaded 1D causal reasoning structures, thereby offering a new architectural approach with the potential to achieve genuine 2D reasoning. Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR-2.

[123] DiffVC-RT: Towards Practical Real-Time Diffusion-based Perceptual Neural Video Compression

Wenzhuo Ma,Zhenzhong Chen

Main category: cs.CV

TL;DR: DiffVC-RT 是首个实现实时推理的基于扩散模型的神经视频压缩框架，通过高效架构、显隐式时序一致性建模和异步并行解码流水线，在保持高感知质量的同时大幅降低码率与延迟。

Details

Motivation: 解决现有基于扩散模型的神经视频压缩在实际部署中面临的信息严重丢失、推理延迟过高和时序不一致等关键问题。 Method: 提出三方面创新：1）高效且信息丰富的模型架构（模块替换与剪枝）；2）显式（在线时序偏移模块）与隐式（混合一致性约束）相结合的时序一致性建模；3）支持混合半精度的异步并行解码流水线（含Batch维时序偏移设计）。 Result: 在HEVC数据集上相比VTM-17.0实现80.1% LPIPS码率节省；720p视频在NVIDIA H800 GPU上达到206/30 fps（编码/解码）实时速度。 Conclusion: DiffVC-RT成功推动扩散模型在视频压缩中的实用化，是该方向的重要里程碑。 Abstract: The practical deployment of diffusion-based Neural Video Compression (NVC) faces critical challenges, including severe information loss, prohibitive inference latency, and poor temporal consistency. To bridge this gap, we propose DiffVC-RT, the first framework designed to achieve real-time diffusion-based perceptual NVC. First, we introduce an Efficient and Informative Model Architecture. Through strategic module replacements and pruning, this architecture significantly reduces computational complexity while mitigating structural information loss. Second, to address generative flickering artifacts, we propose Explicit and Implicit Consistency Modeling. We enhance temporal consistency by explicitly incorporating a zero-cost Online Temporal Shift Module within the U-Net, complemented by hybrid implicit consistency constraints. Finally, we present an Asynchronous and Parallel Decoding Pipeline incorporating Mixed Half Precision, which enables asynchronous latent decoding and parallel frame reconstruction via a Batch-dimension Temporal Shift design. Experiments show that DiffVC-RT achieves 80.1% bitrate savings in terms of LPIPS over VTM-17.0 on HEVC dataset with real-time encoding and decoding speeds of 206 / 30 fps for 720p videos on an NVIDIA H800 GPU, marking a significant milestone in diffusion-based video compression.

Shaokun Wang,Weili Guan,Jizhou Han,Jianlong Wu,Yupeng Hu,Liqiang Nie

Main category: cs.CV

TL;DR: 本文提出StructAlign方法，通过引入单纯形等角紧框架（ETF）几何先验和跨模态ETF对齐损失来缓解文本-视频检索中的模态错位问题，并设计跨模态关系保持损失以抑制单模态特征漂移，从而有效缓解持续学习中的灾难性遗忘。

Details

Motivation: 持续文本到视频检索（CTVR）面临灾难性遗忘问题，核心挑战是特征漂移，包括单模态内漂移和跨模态非协同漂移导致的模态错位。 Method: 提出StructAlign：1）采用单纯形等角紧框架（ETF）作为统一几何先验；2）设计跨模态ETF对齐损失，使文本与视频特征对齐到类别级ETF原型；3）设计跨模态关系保持损失，利用互补模态维持跨模态相似性关系。 Result: 在多个基准数据集上，StructAlign持续优于当前最优的持续检索方法。 Conclusion: StructAlign通过联合解决跨模态非协同漂移与单模态内漂移，有效缓解了CTVR中的灾难性遗忘问题，提升了持续学习下的文本-视频对齐性能。 Abstract: Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method consistently outperforms state-of-the-art continual retrieval approaches.

[125] Person Re-ID in 2025: Supervised, Self-Supervised, and Language-Aligned. What Works?

Lakshman Balasubramanian

Main category: cs.CV

TL;DR: 本文综述了行人重识别（ReID）中的不同训练范式，重点评估了监督、自监督和语言对齐模型在跨域场景下的鲁棒性，并分析了基础模型（如SigLIP2）对ReID泛化能力的提升作用。

Details

Motivation: 探究现有ReID模型（尤其是监督与基础模型）在跨域应用中的泛化能力不足问题，明确其优势与缺陷，以推动更鲁棒、可迁移的ReID方法发展。 Method: 系统比较三种训练范式（监督、自监督、语言对齐），在11个模型和9个数据集上进行跨域性能评估，并分析各模型的表现差异与原因。 Result: 监督模型在同域表现优异但跨域性能急剧下降；语言对齐模型虽未针对ReID专门训练，却展现出显著更强的跨域鲁棒性；SigLIP2等基础模型有助于提升表征迁移性。 Conclusion: 语言对齐等基础模型为ReID提供了更通用、可迁移的视觉表征，是提升跨域泛化能力的有效路径；当前监督模型泛化性差、基础模型缺乏任务适配是主要弱点。 Abstract: Person Re-Identification (ReID) remains a challenging problem in computer vision. This work reviews various training paradigm and evaluates the robustness of state-of-the-art ReID models in cross-domain applications and examines the role of foundation models in improving generalization through richer, more transferable visual representations. We compare three training paradigms, supervised, self-supervised, and language-aligned models. Through the study the aim is to answer the following questions: Can supervised models generalize in cross-domain scenarios? How does foundation models like SigLIP2 perform for the ReID tasks? What are the weaknesses of current supervised and foundational models for ReID? We have conducted the analysis across 11 models and 9 datasets. Our results show a clear split: supervised models dominate their training domain but crumble on cross-domain data. Language-aligned models, however, show surprising robustness cross-domain for ReID tasks, even though they are not explicitly trained to do so. Code and data available at: https://github.com/moiiai-tech/object-reid-benchmark.

[126] CLEAR-Mamba:Towards Accurate, Adaptive and Trustworthy Multi-Sequence Ophthalmic Angiography Classification

Zhuonan Wang,Wenjie Yan,Wenqiao Zhang,Xiaohui Song,Jian Ma,Ke Yao,Yibo Yu,Beng Chin Ooi

Main category: cs.CV

TL;DR: 本文提出CLEAR-Mamba框架，通过HaC自适应条件层和RaP可靠性感知预测方法，提升多模态眼底血管造影图像（FFA/ICGA）的跨域泛化性与预测可信度，并构建大规模眼底血管造影数据集验证其有效性。

Details

Motivation: 现有单模态眼底血管造影（FFA/ICGA）分类方法受限于模态单一、病灶细微及设备间差异大，导致泛化能力与高置信预测不足。 Method: 基于MedMamba改进：1）引入HaC超网络自适应条件层，依据输入特征分布动态生成参数以增强跨域适应性；2）设计RaP可靠性感知预测机制，基于证据不确定性学习，聚焦低置信样本以提升模型稳定性；3）构建覆盖FFA和ICGA的大规模多疾病眼底血管造影数据集。 Result: CLEAR-Mamba在多疾病分类与可靠性感知预测任务上持续优于包括原始MedMamba在内的多个基线模型，展现出更强的泛化性与预测可靠性。 Conclusion: CLEAR-Mamba为模态特异性的医学图像分类任务提供了一种兼顾泛化能力与预测可靠性的有效解决方案。 Abstract: Medical image classification is a core task in computer-aided diagnosis (CAD), playing a pivotal role in early disease detection, treatment planning, and patient prognosis assessment. In ophthalmic practice, fluorescein fundus angiography (FFA) and indocyanine green angiography (ICGA) provide hemodynamic and lesion-structural information that conventional fundus photography cannot capture. However, due to the single-modality nature, subtle lesion patterns, and significant inter-device variability, existing methods still face limitations in generalization and high-confidence prediction. To address these challenges, we propose CLEAR-Mamba, an enhanced framework built upon MedMamba with optimizations in both architecture and training strategy. Architecturally, we introduce HaC, a hypernetwork-based adaptive conditioning layer that dynamically generates parameters according to input feature distributions, thereby improving cross-domain adaptability. From a training perspective, we develop RaP, a reliability-aware prediction scheme built upon evidential uncertainty learning, which encourages the model to emphasize low-confidence samples and improves overall stability and reliability. We further construct a large-scale ophthalmic angiography dataset covering both FFA and ICGA modalities, comprising multiple retinal disease categories for model training and evaluation. Experimental results demonstrate that CLEAR-Mamba consistently outperforms multiple baseline models, including the original MedMamba, across various metrics-showing particular advantages in multi-disease classification and reliability-aware prediction. This study provides an effective solution that balances generalizability and reliability for modality-specific medical image classification tasks.

[127] GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection

Shuguang Zhang,Junhong Lian,Guoxin Yu,Baoxun Xu,Xiang Ao

Main category: cs.CV

TL;DR: 本文提出GDCNet，利用多模态大语言模型生成的客观图像描述作为语义锚点，计算其与原始文本在语义和情感上的差异，并结合视觉-文本保真度，以更鲁棒地检测图文对中的反讽。

Details

Motivation: 现有方法难以处理图文松散关联或语义间接的情况，且基于LLM生成讽刺线索的方法易受主观性和多样性噪声干扰。 Method: 提出生成差异比较网络（GDCNet），使用多模态大语言模型（MLLM）生成事实性图像描述作为稳定语义锚点，计算该描述与原文本的语义与情感差异，并衡量视觉-文本保真度；通过门控模块融合差异特征与原始模态表征。 Result: 在多个MSD基准上取得更高准确率和鲁棒性，在MMSD2.0上达到新SOTA。 Conclusion: 以客观、事实性多模态生成内容为锚点建模跨模态差异，比直接建模图文不一致或依赖主观生成线索更有效。 Abstract: Multimodal sarcasm detection (MSD) aims to identify sarcasm within image-text pairs by modeling semantic incongruities across modalities. Existing methods often exploit cross-modal embedding misalignment to detect inconsistency but struggle when visual and textual content are loosely related or semantically indirect. While recent approaches leverage large language models (LLMs) to generate sarcastic cues, the inherent diversity and subjectivity of these generations often introduce noise. To address these limitations, we propose the Generative Discrepancy Comparison Network (GDCNet). This framework captures cross-modal conflicts by utilizing descriptive, factually grounded image captions generated by Multimodal LLMs (MLLMs) as stable semantic anchors. Specifically, GDCNet computes semantic and sentiment discrepancies between the generated objective description and the original text, alongside measuring visual-textual fidelity. These discrepancy features are then fused with visual and textual representations via a gated module to adaptively balance modality contributions. Extensive experiments on MSD benchmarks demonstrate GDCNet's superior accuracy and robustness, establishing a new state-of-the-art on the MMSD2.0 benchmark.

[128] OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks

Jing Wu,Daphne Barretto,Yiye Chen,Nicholas Gydé,Yanan Jian,Yuhang He,Vibhav Vineet

Main category: cs.CV

TL;DR: 本文提出了OS-Marathon基准，用于评估计算机使用代理（CUAs）在长周期、重复性任务上的性能，并提出了一种基于少量示例构建精简演示以学习工作流逻辑的低成本方法。

Details

Motivation: 缺乏针对长周期、重复性专业任务的评估基准，限制了计算机使用代理（CUAs）的发展与评估。 Method: 构建了包含242个长周期、重复性任务的OS-Marathon基准；提出一种仅需少量示例即可提炼出工作流逻辑并生成精简演示的方法，使代理能泛化到更大规模未见数据上。 Result: 实验证明了长周期重复任务对现有SOTA代理具有挑战性，同时验证了所提方法在提升代理泛化能力与执行效率方面的有效性。 Conclusion: OS-Marathon填补了长周期重复任务评估的空白，所提出的轻量级工作流学习方法为CUAs在现实专业场景中的落地提供了可行路径。 Abstract: Long-horizon, repetitive workflows are common in professional settings, such as processing expense reports from receipts and entering student grades from exam papers. These tasks are often tedious for humans since they can extend to extreme lengths proportional to the size of the data to process. However, they are ideal for Computer-Use Agents (CUAs) due to their structured, recurring sub-workflows with logic that can be systematically learned. Identifying the absence of an evaluation benchmark as a primary bottleneck, we establish OS-Marathon, comprising 242 long-horizon, repetitive tasks across 2 domains to evaluate state-of-the-art (SOTA) agents. We then introduce a cost-effective method to construct a condensed demonstration using only few-shot examples to teach agents the underlying workflow logic, enabling them to execute similar workflows effectively on larger, unseen data collections. Extensive experiments demonstrate both the inherent challenges of these tasks and the effectiveness of our proposed method. Project website: https://os-marathon.github.io/.

[129] FD-MAD: Frequency-Domain Residual Analysis for Face Morphing Attack Detection

Diogo J. Paulo,Hugo Proença,João C. Neves

Main category: cs.CV

TL;DR: 本文提出了一种基于区域感知频域分析的轻量级人脸合成攻击（morphing）检测方法，通过构建残差频域特征并结合马尔可夫随机场对多面部区域进行全局-局部融合，在跨数据集和跨攻击类型场景下显著提升检测性能。

Details

Motivation: 现有单图像人脸合成攻击检测（S-MAD）方法在跨数据集场景下泛化能力差，亟需不依赖可信参考、鲁棒性强且轻量的检测策略。 Method: 1）提出残差频域概念，解耦信号频谱与自然衰减特性以增强morph与真实人脸的可分性；2）构建马尔可夫随机场模型，融合全局与多个局部面部区域的频域证据，实现一致决策。 Result: 在FRLL-Morph上平均EER为1.85%，在MAD22上排名第二（平均EER为6.12%），且在低APCER下保持良好BPCER，仅使用频谱特征即达先进性能。 Conclusion: 傅里叶域残差建模结合结构化区域融合是一种有竞争力的轻量级S-MAD替代方案，优于多数深度学习方法。 Abstract: Face morphing attacks present a significant threat to face recognition systems used in electronic identity enrolment and border control, particularly in single-image morphing attack detection (S-MAD) scenarios where no trusted reference is available. In spite of the vast amount of research on this problem, morph detection systems struggle in cross-dataset scenarios. To address this problem, we introduce a region-aware frequency-based morph detection strategy that drastically improves over strong baseline methods in challenging cross-dataset and cross-morph settings using a lightweight approach. Having observed the separability of bona fide and morph samples in the frequency domain of different facial parts, our approach 1) introduces the concept of residual frequency domain, where the frequency of the signal is decoupled from the natural spectral decay to easily discriminate between morph and bona fide data; 2) additionally, we reason in a global and local manner by combining the evidence from different facial regions in a Markov Random Field, which infers a globally consistent decision. The proposed method, trained exclusively on the synthetic morphing attack detection development dataset (SMDD), is evaluated in challenging cross-dataset and cross-morph settings on FRLL-Morph and MAD22 sets. Our approach achieves an average equal error rate (EER) of 1.85\% on FRLL-Morph and ranks second on MAD22 with an average EER of 6.12\%, while also obtaining a good bona fide presentation classification error rate (BPCER) at a low attack presentation classification error rate (APCER) using only spectral features. These findings indicate that Fourier-domain residual modeling with structured regional fusion offers a competitive alternative to deep S-MAD architectures.

[130] ProSkill: Segment-Level Skill Assessment in Procedural Videos

Michele Mazzamuto,Daniele Di Mauro,Gianpiero Francesca,Giovanni Maria Farinella,Antonino Furnari

Main category: cs.CV

TL;DR: 本文提出了ProSkill，首个用于程序性视频中动作级技能评估的基准数据集，包含绝对和成对技能评估标注，并设计了基于瑞士锦标赛和ELO评分系统的可扩展标注协议。

Details

Motivation: 现有技能评估研究主要集中在体育领域，缺乏面向制造和日常程序性任务的大规模复杂数据集；且标注方式单一（仅二元或成对），难以支持绝对技能水平量化。 Method: 提出ProSkill数据集及配套的新型可扩展标注协议：采用瑞士锦标赛机制进行高效成对比较，再通过ELO系统聚合为连续、一致的全局绝对技能评分。 Result: 在ProSkill上评测了当前主流技能评估算法（排序式与成对式），结果均不理想，验证了该数据集的挑战性与研究价值。 Conclusion: ProSkill填补了程序性视频技能评估领域缺乏高质量、大规模、多粒度标注基准的空白，为后续研究提供了重要基础和新挑战。 Abstract: Skill assessment in procedural videos is crucial for the objective evaluation of human performance in settings such as manufacturing and procedural daily tasks. Current research on skill assessment has predominantly focused on sports and lacks large-scale datasets for complex procedural activities. Existing studies typically involve only a limited number of actions, focus on either pairwise assessments (e.g., A is better than B) or on binary labels (e.g., good execution vs needs improvement). In response to these shortcomings, we introduce ProSkill, the first benchmark dataset for action-level skill assessment in procedural tasks. ProSkill provides absolute skill assessment annotations, along with pairwise ones. This is enabled by a novel and scalable annotation protocol that allows for the creation of an absolute skill assessment ranking starting from pairwise assessments. This protocol leverages a Swiss Tournament scheme for efficient pairwise comparisons, which are then aggregated into consistent, continuous global scores using an ELO-based rating system. We use our dataset to benchmark the main state-of-the-art skill assessment algorithms, including both ranking-based and pairwise paradigms. The suboptimal results achieved by the current state-of-the-art highlight the challenges and thus the value of ProSkill in the context of skill assessment for procedural videos. All data and code are available at https://fpv-iplab.github.io/ProSkill/

Pankhi Kashyap,Mainak Singha,Biplab Banerjee

Main category: cs.CV

TL;DR: 本文提出BiMoRS，一种面向遥感图像的轻量级双模态提示学习框架，通过融合图像字幕文本语义与CLIP视觉特征，提升少样本和域泛化性能。

Details

Motivation: 现有提示学习方法在遥感图像上迁移能力差，因其多标签、类内差异大、分辨率多样等特性导致难以识别主导语义并泛化到新类别。 Method: 利用冻结的BLIP-2生成遥感图像文本描述，经BERT分词后与CLIP视觉特征融合，再通过轻量级跨注意力模块生成上下文化提示，不修改CLIP主干。 Result: 在四个遥感数据集、三项域泛化任务中持续优于强基线，平均提升达2%。 Conclusion: BiMoRS验证了引入文本语义引导可有效提升提示学习在遥感任务中的泛化能力，为VLM适配遥感领域提供了新思路。 Abstract: Prompt learning (PL) has emerged as an effective strategy to adapt vision-language models (VLMs), such as CLIP, for downstream tasks under limited supervision. While PL has demonstrated strong generalization on natural image datasets, its transferability to remote sensing (RS) imagery remains underexplored. RS data present unique challenges, including multi-label scenes, high intra-class variability, and diverse spatial resolutions, that hinder the direct applicability of existing PL methods. In particular, current prompt-based approaches often struggle to identify dominant semantic cues and fail to generalize to novel classes in RS scenarios. To address these challenges, we propose BiMoRS, a lightweight bi-modal prompt learning framework tailored for RS tasks. BiMoRS employs a frozen image captioning model (e.g., BLIP-2) to extract textual semantic summaries from RS images. These captions are tokenized using a BERT tokenizer and fused with high-level visual features from the CLIP encoder. A lightweight cross-attention module then conditions a learnable query prompt on the fused textual-visual representation, yielding contextualized prompts without altering the CLIP backbone. We evaluate BiMoRS on four RS datasets across three domain generalization (DG) tasks and observe consistent performance gains, outperforming strong baselines by up to 2% on average. Codes are available at https://github.com/ipankhi/BiMoRS.

[132] Decoupling Perception and Calibration: Label-Efficient Image Quality Assessment Framework

Xinyue Li,Zhichao Zhang,Zhiming Xu,Shubo Xu,Xiongkuo Min,Yitong Chen,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出LEAF框架，通过知识蒸馏将多模态大语言模型（MLLM）的感知先验迁移到轻量学生回归器，仅需少量MOS标注即可实现高质量图像质量评估。

Details

Motivation: 现有基于MLLM的图像质量评估方法计算开销大且依赖大量MOS标注；作者认为瓶颈在于MOS尺度校准而非质量感知能力本身。 Method: 提出LEAF框架：利用MLLM教师模型提供点级判断和成对偏好作为密集监督信号，并估计决策可靠性；学生模型通过联合蒸馏学习教师的质量感知模式，并在小规模MOS子集上进行校准。 Result: 在用户生成和AI生成的IQA基准上，LEAF显著减少人类标注需求，同时保持与MOS高度一致的相关性。 Conclusion: LEAF实现了标签高效、轻量化的图像质量评估，在标注预算受限场景下具有实用价值。 Abstract: Recent multimodal large language models (MLLMs) have demonstrated strong capabilities in image quality assessment (IQA) tasks. However, adapting such large-scale models is computationally expensive and still relies on substantial Mean Opinion Score (MOS) annotations. We argue that for MLLM-based IQA, the core bottleneck lies not in the quality perception capacity of MLLMs, but in MOS scale calibration. Therefore, we propose LEAF, a Label-Efficient Image Quality Assessment Framework that distills perceptual quality priors from an MLLM teacher into a lightweight student regressor, enabling MOS calibration with minimal human supervision. Specifically, the teacher conducts dense supervision through point-wise judgments and pair-wise preferences, with an estimate of decision reliability. Guided by these signals, the student learns the teacher's quality perception patterns through joint distillation and is calibrated on a small MOS subset to align with human annotations. Experiments on both user-generated and AI-generated IQA benchmarks demonstrate that our method significantly reduces the need for human annotations while maintaining strong MOS-aligned correlations, making lightweight IQA practical under limited annotation budgets.

[133] LEMON: How Well Do MLLMs Perform Temporal Multimodal Understanding on Instructional Videos?

Zhuang Yu,Lei Shen,Jing Zhao,Shiliang Sun

Main category: cs.CV

TL;DR: 本文提出了LEMON，一个面向STEM讲座视频的多模态理解评测基准，强调长时程推理与跨模态整合，包含2277个视频片段、4181个高质量问答对，并揭示当前MLLMs在时间推理和教学预测等任务上仍存在显著短板。

Details

Motivation: 现有MLLMs在长时程、知识密集型、时序结构化教育内容上的性能尚未被充分探索，亟需专门的评测基准来推动其在教育场景下的发展。 Method: 构建LEMON基准：涵盖5个学科、29门课程、2277个平均时长196.1秒的视频片段，标注4181个QA对（含多选与开放题），设计六大任务十二子任务，强调语义丰富性、模态强耦合、显式时序与教学结构、多轮上下文关联提问。 Result: 实验表明当前SOTA模型（如GPT-4o）在时间推理和教学预测等任务上表现不佳，各任务间存在显著性能差距。 Conclusion: LEMON是一个可扩展、具挑战性的新基准，有望推动MLLMs在长形式教学内容中的多模态感知、推理与生成能力发展。 Abstract: Recent multimodal large language models (MLLMs) have shown remarkable progress across vision, audio, and language tasks, yet their performance on long-form, knowledge-intensive, and temporally structured educational content remains largely unexplored. To bridge this gap, we introduce LEMON, a Lecture-based Evaluation benchmark for MultimOdal uNderstanding, focusing on STEM lecture videos that require long-horizon reasoning and cross-modal integration. LEMON comprises 2,277 video segments spanning 5 disciplines and 29 courses, with an average duration of 196.1 seconds, yielding 4,181 high-quality QA pairs, including 3,413 multiple-choice and 768 open-ended questions. Distinct from existing video benchmarks, LEMON features: (1) semantic richness and disciplinary density, (2) tightly coupled video-audio-text modalities, (3) explicit temporal and pedagogical structure, and (4) contextually linked multi-turn questioning. It further encompasses six major tasks and twelve subtasks, covering the full cognitive spectrum from perception to reasoning and then to generation. Comprehensive experiments reveal substantial performance gaps across tasks, highlighting that even state-of-the-art MLLMs like GPT-4o struggle with temporal reasoning and instructional prediction. We expect LEMON to serve as an extensible and challenging benchmark for advancing multimodal perception, reasoning, and generation in long-form instructional contents.

[134] Li-ViP3D++: Query-Gated Deformable Camera-LiDAR Fusion for End-to-End Perception and Trajectory Prediction

Matej Halinkovic,Nina Masarykova,Alexey Vinel,Marek Galinski

Main category: cs.CV

TL;DR: 本文提出Li-ViP3D++，一种基于查询的多模态感知与轨迹预测（PnP）框架，通过Query-Gated Deformable Fusion（QGDF）在查询空间中实现RGB图像与LiDAR数据的全可微融合，显著提升nuScenes上的检测与行为预测性能，并保持实时性。

Details

Motivation: 现有端到端感知与预测模型在相机与LiDAR融合时依赖启发式对齐和离散选择，限制信息利用并引入偏差；模块化流程存在误差累积问题。 Method: 提出Query-Gated Deformable Fusion（QGDF），包含三部分：(i) 跨相机与特征层级的掩码注意力聚合图像证据；(ii) 基于学习的每查询偏移量进行全可微BEV LiDAR采样；(iii) 查询条件门控自适应加权视觉与几何线索；整体为联合优化检测、跟踪与多假设轨迹预测的单阶段端到端模型。 Result: 在nuScenes上，EPA达0.335，mAP达0.502，误报率降至0.147，推理速度139.82ms，优于前代Li-ViP3D（145.91ms）。 Conclusion: 在查询空间中实现全可微相机-LiDAR融合可提升端到端PnP模型的鲁棒性与实用性，无需牺牲部署效率。 Abstract: End-to-end perception and trajectory prediction from raw sensor data is one of the key capabilities for autonomous driving. Modular pipelines restrict information flow and can amplify upstream errors. Recent query-based, fully differentiable perception-and-prediction (PnP) models mitigate these issues, yet the complementarity of cameras and LiDAR in the query-space has not been sufficiently explored. Models often rely on fusion schemes that introduce heuristic alignment and discrete selection steps which prevent full utilization of available information and can introduce unwanted bias. We propose Li-ViP3D++, a query-based multimodal PnP framework that introduces Query-Gated Deformable Fusion (QGDF) to integrate multi-view RGB and LiDAR in query space. QGDF (i) aggregates image evidence via masked attention across cameras and feature levels, (ii) extracts LiDAR context through fully differentiable BEV sampling with learned per-query offsets, and (iii) applies query-conditioned gating to adaptively weight visual and geometric cues per agent. The resulting architecture jointly optimizes detection, tracking, and multi-hypothesis trajectory forecasting in a single end-to-end model. On nuScenes, Li-ViP3D++ improves end-to-end behavior and detection quality, achieving higher EPA (0.335) and mAP (0.502) while substantially reducing false positives (FP ratio 0.147), and it is faster than the prior Li-ViP3D variant (139.82 ms vs. 145.91 ms). These results indicate that query-space, fully differentiable camera-LiDAR fusion can increase robustness of end-to-end PnP without sacrificing deployability.

[135] Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification

Xin Jin,Jinming Liu,Yuntao Wei,Junyan Lin,Zhicheng Wang,Jianguo Huang,Xudong Yang,Yanxiao Liu,Wenjun Zeng

Main category: cs.CV

TL;DR: 本文综述并统一了传统视觉编码与新兴的视觉token技术，从优化角度分析压缩效率与模型性能的权衡，并展望下一代视觉编解码器与token技术在多模态大模型、AIGC和具身AI等任务中的应用前景及标准化可能。

Details

Motivation: 视觉编码与视觉token技术虽领域不同，但目标一致——在表征学习中最大化语义保真度并最小化计算开销；亟需从统一视角理解其内在联系与协同潜力。 Method: 对视觉编码与视觉token技术进行系统性综述，提出基于优化的统一建模框架，进而开展双向技术洞见分析与未来趋势预测，并通过实验验证任务导向token在多模态大模型等实际场景中的潜力。 Result: 建立了视觉编码与视觉token技术的统一优化视角；揭示了压缩效率与智能模型性能间的本质关联；实验证明任务导向token在MLLMs、AIGC和具身AI中具有显著应用价值；提出通用token技术标准化的可行路径。 Conclusion: 压缩不仅是数据效率手段，更是衡量和提升人工智能能力的重要指标；视觉编码与token技术正走向深度融合，有望催生面向智能任务的下一代高效、通用、标准化表征范式。 Abstract: "Compression Tells Intelligence", is supported by research in artificial intelligence, particularly concerning (multimodal) large language models (LLMs/MLLMs), where compression efficiency often correlates with improved model performance and capabilities. For compression, classical visual coding based on traditional information theory has developed over decades, achieving great success with numerous international industrial standards widely applied in multimedia (e.g., image/video) systems. Except that, the recent emergingvisual token technology of generative multi-modal large models also shares a similar fundamental objective like visual coding: maximizing semantic information fidelity during the representation learning while minimizing computational cost. Therefore, this paper provides a comprehensive overview of two dominant technique families first -- Visual Coding and Vision Token Technology -- then we further unify them from the aspect of optimization, discussing the essence of compression efficiency and model performance trade-off behind. Next, based on the proposed unified formulation bridging visual coding andvisual token technology, we synthesize bidirectional insights of themselves and forecast the next-gen visual codec and token techniques. Last but not least, we experimentally show a large potential of the task-oriented token developments in the more practical tasks like multimodal LLMs (MLLMs), AI-generated content (AIGC), and embodied AI, as well as shedding light on the future possibility of standardizing a general token technology like the traditional codecs (e.g., H.264/265) with high efficiency for a wide range of intelligent tasks in a unified and effective manner.

[136] FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models

Haonan Zhong,Wei Song,Tingxu Han,Maurice Pagnucco,Jingling Xue,Yang Song

Main category: cs.CV

TL;DR: 本文提出FairT2V，一种无需训练的文本到视频生成去偏框架，通过锚点引导的球面测地线变换中和提示嵌入中的性别偏差，并在早期去噪步骤中动态应用以保持时序一致性，显著降低职业等场景中的性别偏见，且对视频质量影响极小。

Details

Motivation: 现有文本到视频（T2V）扩散模型虽进展迅速，但其人口统计学偏差（尤其是性别偏差）尚未被充分研究；作者发现偏差主要源于预训练文本编码器对中性提示隐含的性别关联。 Method: 提出FairT2V框架：1）定义性别偏向得分量化偏差；2）采用基于锚点的球面测地线变换中和提示嵌入；3）通过动态去噪调度仅在早期身份形成步骤中应用去偏以保持时序连贯性；4）构建结合VideoLLM推理与人工验证的视频级公平性评估协议。 Result: 在Open-Sora模型上实验表明，FairT2V显著降低了多种职业场景下的性别偏差，同时对视频质量影响极小。 Conclusion: FairT2V是一种高效、免训练、语义保持的T2V去偏方法，为生成式视频模型的公平性研究提供了新范式和实用工具。 Abstract: Text-to-video (T2V) diffusion models have achieved rapid progress, yet their demographic biases, particularly gender bias, remain largely unexplored. We present FairT2V, a training-free debiasing framework for text-to-video generation that mitigates encoder-induced bias without finetuning. We first analyze demographic bias in T2V models and show that it primarily originates from pretrained text encoders, which encode implicit gender associations even for neutral prompts. We quantify this effect with a gender-leaning score that correlates with bias in generated videos. Based on this insight, FairT2V mitigates demographic bias by neutralizing prompt embeddings via anchor-based spherical geodesic transformations while preserving semantics. To maintain temporal coherence, we apply debiasing only during early identity-forming steps through a dynamic denoising schedule. We further propose a video-level fairness evaluation protocol combining VideoLLM-based reasoning with human verification. Experiments on the modern T2V model Open-Sora show that FairT2V substantially reduces demographic bias across occupations with minimal impact on video quality.

[137] Open-Vocabulary Functional 3D Human-Scene Interaction Generation

Jie Liu,Yu Sun,Alpar Cseke,Yao Feng,Nicolas Heron,Michael J. Black,Yan Zhang

Main category: cs.CV

TL;DR: 本文提出FunHSI框架，无需训练即可根据开放词汇任务提示生成功能正确、物理合理的3D人-场景交互，通过功能感知接触推理、视觉语言模型合成与分阶段优化实现。

Details

Motivation: 现有方法缺乏对物体功能及人-场景接触的显式建模，导致生成的交互不真实或功能错误。 Method: FunHSI采用功能驱动、无训练范式：首先进行功能感知接触推理以识别功能元素并构建接触图；其次利用视觉语言模型合成图像并估计3D人体与手部姿态；最后通过分阶段优化提升物理合理性和功能正确性。 Result: FunHSI在多种室内外场景中均能稳定生成功能正确且物理合理的3D人-场景交互，支持从‘坐在沙发上’到‘提高室温’等粗粒度与细粒度任务。 Conclusion: FunHSI为功能性3D人-场景交互生成提供了有效、通用且无需训练的新范式，显著提升了交互的真实性与功能性。 Abstract: Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. In this work, we propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. In contrast to existing methods, FunHSI not only synthesizes more plausible general 3D interactions, such as "sitting on a sofa'', while supporting fine-grained functional human-scene interactions, e.g., "increasing the room temperature''. Extensive experiments demonstrate that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.

[138] A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion

Willams de Lima Costa,Thifany Ketuli Silva de Souza,Jonas Ferreira Silva,Carlos Gabriel Bezerra Pereira,Bruno Reis Vila Nova,Leonardo Silvino Brito,Rafael Raider Leoni,Juliano Silva,Valter Ferreira,Sibele Miguel Soares Neto,Samantha Uehara,Daniel Giacomo,João Marcelo Teixeira,Veronica Teichrieb,Cristiano Coelho de Araújo

Main category: cs.CV

TL;DR: 本文提出了一种融合图像与惯性测量数据的轻量级多模态道路表面分类框架，引入双向交叉注意力与自适应门控机制，并构建了具有环境多样性的新基准ROAD数据集，显著提升了在复杂环境下的分类性能与泛化能力。

Details

Motivation: 现有道路表面分类方法因传感模态单一、数据集环境多样性不足，导致在实际多变工况下泛化能力差。 Method: 提出轻量级双向交叉注意力模块融合RGB图像与IMU数据，并加入自适应门控层动态调整模态贡献；构建包含真实多模态、纯视觉和合成三类子集的ROAD数据集。 Result: 在PVS基准上提升1.4个百分点，在自建ROAD多模态子集上提升11.6个百分点，少数类F1分数更优，且在夜间、大雨、混合路面等挑战场景下表现稳定。 Conclusion: 融合低成本相机与IMU并采用多模态注意力机制，可为资源受限地区提供可扩展、鲁棒的道路表面理解方案。 Abstract: Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets that lack environmental diversity. This work addresses these limitations by introducing a multimodal framework that fuses images and inertial measurements using a lightweight bidirectional cross-attention module followed by an adaptive gating layer that adjusts modality contributions under domain shifts. Given the limitations of current benchmarks, especially regarding lack of variability, we introduce ROAD, a new dataset composed of three complementary subsets: (i) real-world multimodal recordings with RGB-IMU streams synchronized using a gold-standard industry datalogger, captured across diverse lighting, weather, and surface conditions; (ii) a large vision-only subset designed to assess robustness under adverse illumination and heterogeneous capture setups; and (iii) a synthetic subset generated to study out-of-distribution generalization in scenarios difficult to obtain in practice. Experiments show that our method achieves a +1.4 pp improvement over the previous state-of-the-art on the PVS benchmark and an +11.6 pp improvement on our multimodal ROAD subset, with consistently higher F1-scores on minority classes. The framework also demonstrates stable performance across challenging visual conditions, including nighttime, heavy rain, and mixed-surface transitions. These findings indicate that combining affordable camera and IMU sensors with multimodal attention mechanisms provides a scalable, robust foundation for road surface understanding, particularly relevant for regions where environmental variability and cost constraints limit the adoption of high-end sensing suites.

[139] FreeFix: Boosting 3D Gaussian Splatting via Fine-Tuning-Free Diffusion Models

Hongyu Zhou,Zisen Shao,Sheng Miao,Pan Wang,Dongfeng Bai,Bingbing Liu,Yiyi Liao

Main category: cs.CV

TL;DR: 本文提出FreeFix，一种无需微调的框架，利用预训练图像扩散模型提升外推视角渲染质量，在保持泛化能力的同时提升保真度。

Details

Motivation: 现有NeRF和3D高斯溅射方法在密集输入下表现好，但在外推视角下性能下降；基于生成模型（如扩散模型）的方法面临泛化性与保真度之间的权衡困境。 Method: 提出FreeFix，采用2D-3D交替细化策略，利用预训练图像扩散模型进行一致细化，并设计逐像素置信度掩码以定位并增强不确定区域。 Result: 在多个数据集上实验表明，FreeFix提升了多帧一致性，性能媲美甚至超越微调方法，同时保持强泛化能力。 Conclusion: FreeFix成功缓解了生成式视图合成中泛化性与保真度的权衡问题，为无需微调地利用扩散模型提供了新范式。 Abstract: Neural Radiance Fields and 3D Gaussian Splatting have advanced novel view synthesis, yet still rely on dense inputs and often degrade at extrapolated views. Recent approaches leverage generative models, such as diffusion models, to provide additional supervision, but face a trade-off between generalization and fidelity: fine-tuning diffusion models for artifact removal improves fidelity but risks overfitting, while fine-tuning-free methods preserve generalization but often yield lower fidelity. We introduce FreeFix, a fine-tuning-free approach that pushes the boundary of this trade-off by enhancing extrapolated rendering with pretrained image diffusion models. We present an interleaved 2D-3D refinement strategy, showing that image diffusion models can be leveraged for consistent refinement without relying on costly video diffusion models. Furthermore, we take a closer look at the guidance signal for 2D refinement and propose a per-pixel confidence mask to identify uncertain regions for targeted improvement. Experiments across multiple datasets show that FreeFix improves multi-frame consistency and achieves performance comparable to or surpassing fine-tuning-based methods, while retaining strong generalization ability.

Table of Contents

cs.CL [Back]

[1] From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text

[2] Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments

[3] Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

[4] PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review

[5] PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

[6] Lowest Span Confidence: A Zero-Shot Metric for Efficient and Black-Box Hallucination Detection in LLMs

[7] FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition

[8] Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

[9] HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

[10] Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

[11] OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

[12] Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

[13] The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

[14] Attribution Techniques for Mitigating Hallucinated Information in RAG Systems: A Survey

[15] Towards a Mechanistic Understanding of Large Reasoning Models: A Survey of Training, Inference, and Failures

[16] Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding

[17] SDUs DAISY: A Benchmark for Danish Culture

[18] CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity

[19] "Newspaper Eat" Means "Not Tasty": A Taxonomy and Benchmark for Coded Languages in Real-World Chinese Online Reviews

[20] Text-to-State Mapping for Non-Resolution Reasoning: The Contradiction-Preservation Principle

[21] Quantifying non deterministic drift in large language models

[22] Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

[23] Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen

[24] On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

[25] LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?

[26] Semantic Uncertainty Quantification of Hallucinations in LLMs: A Quantum Tensor Network Based Method

[27] TAIGR: Towards Modeling Influencer Content on Social Media via Structured, Pragmatic Inference

[28] VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

[29] Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects

[30] FFE-Hallu:Hallucinations in Fixed Figurative Expressions:Benchmark of Idioms and Proverbs in the Persian Language

[31] Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

[32] BengaliSent140: A Large-Scale Bengali Binary Sentiment Dataset for Hate and Non-Hate Speech Classification

[33] Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR

[34] Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents

[35] Me-Agent: A Personalized Mobile Agent with Two-Level User Habit Learning for Enhanced Interaction

[36] Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling

[37] Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems

[38] Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy

[39] SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility

[40] RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis

[41] Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

[42] MiLorE-SSL: Scaling Multilingual Capabilities in Self-Supervised Models without Forgetting

[43] SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger

[44] Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning

[45] CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

[46] PsychePass: Calibrating LLM Therapeutic Competence via Trajectory-Anchored Tournaments

[47] MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

[48] Improving Diffusion Language Model Decoding through Joint Search in Generation Order and Token Space

[49] Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

[50] SpeechMapper: Speech-to-text Embedding Projector for LLMs

[51] Hopes and Fears -- Emotion Distribution in the Topic Landscape of Finnish Parliamentary Speech 2000-2020

[52] PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use

[53] MuVaC: AVariational Causal Framework for Multimodal Sarcasm Understanding in Dialogues

[54] BMAM: Brain-inspired Multi-Agent Memory Framework

[55] Can We Improve Educational Diagram Generation with In-Context Examples? Not if a Hallucination Spoils the Bunch

[56] Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models

[57] Single-Nodal Spontaneous Symmetry Breaking in NLP Models

[58] A Computational Approach to Language Contact -- A Case Study of Persian

[59] AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

[60] P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering

[61] A Dialectic Pipeline for Improving LLM Robustness

[62] Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science

[63] Efficient Multimodal Planning Agent for Visual Question-Answering

[64] ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code

[65] Online Density-Based Clustering for Real-Time Narrative Evolution Monitorin

[66] AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

[67] QueerGen: How LLMs Reflect Societal Norms on Gender and Sexuality in Sentence Completion Tasks

[68] Like a Therapist, But Not: Reddit Narratives of AI in Mental Health Contexts

[69] Persona Prompting as a Lens on LLM Social Reasoning

[70] SERA: Soft-Verified Efficient Repository Agents

[71] Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

[72] Structured Semantic Information Helps Retrieve Better Examples for In-Context Learning in Few-Shot Relation Extraction

[73] Linear representations in language models can change dramatically over a conversation

[74] When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation

cs.CV [Back]

[75] Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation

[76] DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation

[77] Semi-Supervised Masked Autoencoders: Unlocking Vision Transformer Potential with Limited Data