cs.CL [Back]

[1] Enhancing Biomedical Named Entity Recognition using GLiNER-BioMed with Targeted Dictionary-Based Post-processing for BioASQ 2025 task 6

Ritesh Mehta

Main category: cs.CL

TL;DR: 本研究评估了GLiNER-BioMed模型在BioASQ数据集上的表现，并引入基于词典的后处理策略以改善生物医学命名实体识别中的误分类问题，尽管在开发集上性能提升明显（micro F1从0.79升至0.83），但在盲测集上未见改进（F1从0.79降至0.77），揭示了过拟合风险及泛化能力的重要性。

Details

Motivation: 生物医学命名实体识别（BioNER）面临诸如基因与化学物质等相似实体类型难以区分的问题，影响信息抽取效果，因此需要提升模型的分类准确性。 Method: 采用GLiNER-BioMed模型进行基线实验，并设计一种针对常见误分类的字典匹配后处理方法；同时探索了CRF等替代建模方法。 Result: 在开发集上，后处理使micro F1-score从0.79提升至0.83，但在盲测集上，后处理模型得分为0.77，低于基线的0.79，表明存在过拟合现象。 Conclusion: 基于字典的后处理虽有潜力提升BioNER模型性能，但需警惕对开发集的过拟合，强调模型应具备良好泛化能力以确保实际应用效果。 Abstract: Biomedical Named Entity Recognition (BioNER), task6 in BioASQ (A challenge in large-scale biomedical semantic indexing and question answering), is crucial for extracting information from scientific literature but faces hurdles such as distinguishing between similar entity types like genes and chemicals. This study evaluates the GLiNER-BioMed model on a BioASQ dataset and introduces a targeted dictionary-based post-processing strategy to address common misclassifications. While this post-processing approach demonstrated notable improvement on our development set, increasing the micro F1-score from a baseline of 0.79 to 0.83, this enhancement did not generalize to the blind test set, where the post-processed model achieved a micro F1-score of 0.77 compared to the baselines 0.79. We also discuss insights gained from exploring alternative methodologies, including Conditional Random Fields. This work highlights the potential of dictionary-based refinement for pre-trained BioNER models but underscores the critical challenge of overfitting to development data and the necessity of ensuring robust generalization for real-world applicability.

[2] Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

Shahriar Kabir Nahin,Hadi Askari,Muhao Chen,Anshuman Chhabra

Main category: cs.CL

TL;DR: 本文揭示了测试时扩展（TTS）中候选多样性降低会导致不安全输出增加的现象，提出了一种名为RefDiv的诊断性攻击方法来检测TTS系统的脆弱性，并发现现有安全分类器难以识别此类威胁，呼吁设计更鲁棒、安全的TTS策略。

Details

Motivation: TTS依赖于高多样性的候选响应以提高推理可靠性，但其在多样性受限时的安全风险尚未被充分认识，本文旨在揭示这一潜在失效模式。 Method: 提出参考引导的多样性减少协议（RefDiv），在多个开源和闭源模型及TTS策略上进行实验，评估多样性约束对生成不安全内容的影响，并测试主流安全守门员分类器的防御能力。 Result: 实验证明，即使轻微限制候选多样性，TTS生成不安全结果的概率显著上升；该效应跨模型和TTS策略具有一致性，且现有安全检测工具难以识别RefDiv生成的有害提示。 Conclusion: 多样性是TTS安全性的重要保障，当前TTS方法在面对多样性缩减时存在普遍且严重的安全漏洞，需开发更具鲁棒性的TTS机制以应对此类威胁。 Abstract: Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across four open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3 and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard and OpenAI Moderation API), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode. Through this work, we hope to motivate future research on designing robust TTS strategies that are both effective and secure against diversity-targeted stress tests as illustrated by RefDiv.

[3] Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech

Yuxin Li,Eng Siong Chng,Cuntai Guan

Main category: cs.CL

TL;DR: 提出HAREN-CTC模型，通过多层SSL特征的交叉注意力与CTC损失实现语音抑郁检测的新方法。

Details

Motivation: 现有语音抑郁检测方法多依赖单一SSL层特征，难以捕捉稀疏且异质的时间抑郁线索，限制了模型泛化能力。 Method: 提出HAREN-CTC，结合多任务学习框架，利用跨注意力融合多层SSL特征，并引入CTC损失处理稀疏时间监督；包含分层自适应聚类和跨模态融合模块。 Result: 在DAIC-WOZ和MODMA数据集上分别达到0.81和0.82的macro F1-score，优于现有方法，且在标准与五折交叉验证设置下均表现优异。 Conclusion: HAREN-CTC能有效利用SSL模型的层次化特征，提升语音抑郁检测的性能与鲁棒性，具备良好泛化能力。 Abstract: Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments. However, it remains limited by the difficulty of extracting meaningful features and capturing sparse, heterogeneous depressive cues over time. Pretrained self-supervised learning (SSL) models such as WavLM provide rich, multi-layer speech representations, yet most existing SDD methods rely only on the final layer or search for a single best-performing one. These approaches often overfit to specific datasets and fail to leverage the full hierarchical structure needed to detect subtle and persistent depression signals. To address this challenge, we propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework, combined with Connectionist Temporal Classification loss to handle sparse temporal supervision. HAREN-CTC comprises two key modules: a Hierarchical Adaptive Clustering module that reorganizes SSL features into complementary embeddings, and a Cross-Modal Fusion module that models inter-layer dependencies through cross-attention. The CTC objective enables alignment-aware training, allowing the model to track irregular temporal patterns of depressive speech cues. We evaluate HAREN-CTC under both an upper-bound setting with standard data splits and a generalization setting using five-fold cross-validation. The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.

[4] Systematic Diagnosis of Brittle Reasoning in Large Language Models

V. S. Raghu Parupudi

Main category: cs.CL

TL;DR: 提出一种新框架来评估机器学习模型的数学推理能力，通过GPT-3.5生成推理步骤，GPT-4o-mini进行错误分类和无监督聚类，发现模型在组合推理等复杂模式上表现差，揭示其非人类般的脆弱性。

Details

Motivation: 探究机器学习模型是否真正理解数学，而不仅仅是依赖表面模式完成任务，需超越传统基准测试以识别具体失败原因。 Method: 在GSM8K数据集上使用gpt-3.5-turbo生成逐步推理；利用更强大的gpt-4o-mini对错误进行分类，并对推理语句进行无监督聚类以识别‘推理模式’。 Result: 发现模型在程序性推理（如顺序计算）上准确率接近完美，但在需要带限制条件的组合推理模式下性能急剧下降，表现出非人类的认知脆弱性。 Conclusion: 该方法提供了更细粒度的数学理解评估手段，明确了模型推理能力的强弱边界，为提升未来模型的可靠性和能力建设提供了精确路径。 Abstract: A central question in artificial intelligence is the extent to which machine learning models comprehend mathematics. To address this, we propose a novel framework for measuring mathematical reasoning that moves beyond standard benchmarks to diagnose specific failure points. Our method first generates structured, step-by-step reasoning from gpt-3.5-turbo on the GSM8K dataset. We then use a more capable analyst model, gpt-4o-mini, to categorize errors and, crucially, perform an unsupervised clustering of every reasoning sentence to identify emergent "reasoning modes." This analysis reveals a cognitive profile with a stark, nonhuman-like brittleness: while the model achieves near-perfect accuracy on procedural modes like sequential calculation, its performance on modes requiring combinatorial reasoning with restrictions plummets. By identifying and quantifying the reliability of these distinct reasoning skills, our work provides a more granular method to evaluate mathematical comprehension and offers a precise roadmap for developing new capabilities and more reliable future applications.

[5] Confidence, Not Perplexity: A Better Metric for the Creative Era of LLMs

V. S. Raghu Parupudi

Main category: cs.CL

TL;DR: 提出了一种新的评估指标——置信度分数（CS），相比传统的自困惑度等指标，能更少地对创造性文本生成产生偏见，并在实验中显示出明显更倾向于认可新颖回答的能力。

Details

Motivation: 传统无参考指标如自困惑度对创造性文本存在明显偏见，难以公正评估现代大模型的生成能力。 Method: 基于模型输出的概率分布推导出置信度分数（CS），并通过gpt-4o-mini在99个创意提示上的表现进行实验验证，比较其与基于流畅性的指标在选择新颖回答上的差异。 Result: 实验显示，流畅性指标在0%的情况下偏好新颖回答，而CS在19%的情况下偏好新颖回答，差异具有统计显著性（95% CI: [11.1%, 27.3%]）；同时CS能有效区分任务难度。 Conclusion: 置信度分数缓解了传统指标对创造性的偏见，同时保留了其评估优势，为现代大语言模型提供了更平衡的评估方式。 Abstract: Reference-free metrics like self-perplexity are strongly biased against creative text generation. We propose the Confidence Score (CS), derived from a model's output probability distribution, as a less biased alternative. Experiments on gpt-4o-mini show that while fluency-based metrics prefer novel responses in 0\% of cases on 99 creative prompts, our CS does so 19% of the time, a statistically significant difference (95% CI for difference: [11.1%, 27.3%]). We also show that CS effectively distinguishes between easy, medium, and hard tasks, confirmed by non-overlapping confidence intervals. The Confidence Score thus mitigates the creativity bias of traditional metrics while retaining their core evaluative strengths, offering a more balanced assessment for modern LLMs.

[6] Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation

Devleena Das,Rajeev Patwari,Ashish Sirasao

Main category: cs.CL

TL;DR: 提出了一种轻量级、数据集无关的方法Recover-LoRA，利用合成数据和logit蒸馏来恢复因量化、剪枝、格式转换等导致性能下降的语言模型的准确性。

Details

Motivation: 推理优化（如量化、剪枝等）可能导致语言模型性能下降，现有工作主要关注鲁棒量化技术，而本文旨在从任何导致模型权重退化的来源中恢复模型准确性，例如不恰当的模型序列化。 Method: 提出Recover-LoRA方法，使用合成数据和logit蒸馏，在选择性层上学习LoRA适配器，使退化模型对齐其全精度模型。该方法不依赖真实数据，适用于多种小型语言模型结构。 Result: 在多种小型语言模型（包括多头注意力MHA和组查询注意力GQA架构）和多个评估数据集上验证了Recover-LoRA的有效性，模型准确性恢复达5%-17%。 Conclusion: Recover-LoRA是一种有效的轻量级方法，能够显著恢复因各种推理优化导致退化的语言模型的性能，且具有良好的通用性和实用性。 Abstract: Inference optimizations such as quantization, pruning, format and datatype conversion, model export, and serialization can lead to functional degradations in language model task performance. While most efforts on performance recovery for deployment focus on robust quantization techniques, we focus on recovering model accuracies from any sources that degrade model weights, such as improper model serialization. In this work, we propose Recover-LoRA, a lightweight and dataset agnostic method to recover accuracy in degraded models. Recover-LoRA uses synthetic data and logit distillation to learn LoRA adapters on selective layers that facilitate aligning the degraded model to its full precision model. We investigate the utility of Recover-LoRA across a diverse set of small language models (SLMs), including models with varying attention architectures, multi-head attention (MHA) and group-query attention (GQA), as well as several evaluation datasets. Our results show that Recover-LoRA recovers model accuracies by 5-17% on MHA and GQA SLMs.

[7] Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs

Aneesh Jonelagadda,Christina Hahn,Haoze Zheng,Salvatore Penachio

Main category: cs.CL

TL;DR: Mnemosyne是一种面向边缘设备的无监督、类人长期记忆架构，用于提升大语言模型在长期对话中的记忆能力，尤其适用于医疗等纵向交互场景。

Details

Motivation: 现有LLM记忆系统依赖上下文扩展或静态检索，在资源受限的边缘设备上表现不佳，且难以处理时间上分离但语义相似的重复对话，尤其是在医疗等需要长期用户理解的场景中。 Method: 提出Mnemosyne，采用图结构存储、模块化内容与冗余过滤、记忆写入与剪枝机制，并结合具有时间衰减和刷新机制的概率化回忆过程；引入从固定长度记忆子集中提取的‘核心摘要’以捕捉用户个性与领域关键信息。 Result: 在纵向医疗对话实验中，Mnemosyne在盲评中现实感和长期记忆能力胜率达65.8%（基线RAG为31.1%），LoCoMo基准上的时序推理与单跳检索得分最高，整体平均得分54.6%位列第二，优于Mem0和OpenAI基线。 Conclusion: Mnemosyne证明了在边缘设备上实现高效、自然、具备强事实回忆与时序推理能力的长期记忆系统是可行的，且具有良好的可迁移性。 Abstract: Long-term memory is essential for natural, realistic dialogue. However, current large language model (LLM) memory systems rely on either brute-force context expansion or static retrieval pipelines that fail on edge-constrained devices. We introduce Mnemosyne, an unsupervised, human-inspired long-term memory architecture designed for edge-based LLMs. Our approach uses graph-structured storage, modular substance and redundancy filters, memory committing and pruning mechanisms, and probabilistic recall with temporal decay and refresh processes modeled after human memory. Mnemosyne also introduces a concentrated "core summary" efficiently derived from a fixed-length subset of the memory graph to capture the user's personality and other domain-specific long-term details such as, using healthcare application as an example, post-recovery ambitions and attitude towards care. Unlike existing retrieval-augmented methods, Mnemosyne is designed for use in longitudinal healthcare assistants, where repetitive and semantically similar but temporally distinct conversations are limited by naive retrieval. In experiments with longitudinal healthcare dialogues, Mnemosyne demonstrates the highest win rate of 65.8% in blind human evaluations of realism and long-term memory capability compared to a baseline RAG win rate of 31.1%. Mnemosyne also achieves current highest LoCoMo benchmark scores in temporal reasoning and single-hop retrieval compared to other same-backboned techniques. Further, the average overall score of 54.6% was second highest across all methods, beating commonly used Mem0 and OpenAI baselines among others. This demonstrates that improved factual recall, enhanced temporal reasoning, and much more natural user-facing responses can be feasible with an edge-compatible and easily transferable unsupervised memory architecture.

[8] Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Cong Zeng,Shengkun Tang,Yuanzhou Chen,Zhiqiang Shen,Wenchao Yu,Xujiang Zhao,Haifeng Chen,Wei Cheng,Zhiqiang Xu

Main category: cs.CL

TL;DR: 本文提出将AI生成文本检测任务重新定义为分布外（OOD）检测问题，使用单类学习和基于能量的方法实现对人类撰写文本的鲁棒检测，在多个数据集上表现出优异的泛化性和鲁棒性。

Details

Motivation: 现有二分类方法假设人类文本具有统一分布，但实际上人类文本多样性高、分布不一，导致模型难以泛化；因此需要更本质的建模方式来区分人类与机器生成文本。 Method: 将人类文本视为分布外（OOD）样本，机器生成文本视为分布内（ID）样本，采用单类学习（如DeepSVDD、HRN）和基于能量的分数学习方法构建检测框架。 Result: 在DeepFake数据集上达到98.3% AUROC和AUPR，FPR95仅为8.9%，并在多语言、对抗攻击、未见模型和领域场景下验证了框架的鲁棒性与泛化能力。 Conclusion: 将AI生成文本检测视为OOD检测任务比传统二分类更具优势，能更好应对人类文本的多样性，提升模型在真实场景中的适用性。 Abstract: The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID' behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.

[9] YpathRAG:A Retrieval-Augmented Generation Framework and Benchmark for Pathology

Deshui Yu,Yizhi Wang,Saihui Jin,Taojie Zhu,Fanyi Zeng,Wen Qian,Zirui Huang,Jingli Ouyang,Jiameng Li,Zhen Song,Tian Guan,Yonghong He

Main category: cs.CL

TL;DR: 本文提出了一种面向病理学的RAG框架YpathRAG，结合双通道混合检索和证据判断模块，显著提升了检索准确率和生成结果的事实可靠性。

Details

Motivation: 大型语言模型在专业领域如病理学中容易产生幻觉，现有方法依赖领域微调，无法扩展知识边界或强制基于证据的约束。 Method: 构建覆盖28个子领域、153万段落的病理学向量数据库，提出YpathRAG框架，采用BGE-M3密集检索与词汇引导的稀疏检索相结合的双通道混合检索，并引入基于LLM的支持性证据判断模块。 Result: 在YpathR基准上，Recall@5达到98.64%，比基线提高23个百分点；在YpathQA-M最具挑战性的300题上，平均提升通用和医学LLM准确率9.0%，最高达15.6%。 Conclusion: YpathRAG显著提升了病理学领域的检索质量和事实准确性，提供了一种可扩展的RAG构建范式和可解释的评估体系。 Abstract: Large language models (LLMs) excel on general tasks yet still hallucinate in high-barrier domains such as pathology. Prior work often relies on domain fine-tuning, which neither expands the knowledge boundary nor enforces evidence-grounded constraints. We therefore build a pathology vector database covering 28 subfields and 1.53 million paragraphs, and present YpathRAG, a pathology-oriented RAG framework with dual-channel hybrid retrieval (BGE-M3 dense retrieval coupled with vocabulary-guided sparse retrieval) and an LLM-based supportive-evidence judgment module that closes the retrieval-judgment-generation loop. We also release two evaluation benchmarks, YpathR and YpathQA-M. On YpathR, YpathRAG attains Recall@5 of 98.64%, a gain of 23 percentage points over the baseline; on YpathQA-M, a set of the 300 most challenging questions, it increases the accuracies of both general and medical LLMs by 9.0% on average and up to 15.6%. These results demonstrate improved retrieval quality and factual reliability, providing a scalable construction paradigm and interpretable evaluation for pathology-oriented RAG.

[10] LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

Raffaele Mura,Giorgio Piras,Kamilė Lukošiūtė,Maura Pintor,Amin Karbasi,Battista Biggio

Main category: cs.CL

TL;DR: 提出LatentBreak，一种通过替换语义等价词生成低困惑度对抗性提示的白盒越狱攻击方法，能够有效绕过基于困惑度的防御机制。

Details

Motivation: 现有的越狱攻击容易被基于困惑度的过滤器检测到，因此需要一种能生成更自然、低困惑度提示的攻击方法来规避此类防御。 Method: LatentBreak在潜在空间中最小化对抗性提示与无害请求表示之间的距离，通过替换语义等价词而非添加高困惑度后缀或长模板来构造攻击提示。 Result: 实验表明，LatentBreak生成的提示更短且困惑度更低，在多个安全对齐模型上优于现有越狱算法，尤其在对抗基于困惑度的过滤器时表现更优。 Conclusion: LatentBreak是一种有效的白盒越狱攻击方法，能够在保持提示自然性的同时绕过基于困惑度的检测机制，揭示了当前安全过滤器的局限性。 Abstract: Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.

[11] Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks

Nouar Aldahoul,Yasir Zaki

Main category: cs.CL

TL;DR: 本文提出了一种多语言、多代理的大语言模型框架，结合检索增强生成技术，用于检测数字平台上的虚假信息，并支持插件化部署。

Details

Motivation: 虚假信息在数字平台上的快速传播威胁公共话语、情绪稳定和决策能力，现有研究对特定变换（如跨语言转换、查询长度膨胀和结构重写）的对抗攻击缺乏系统性研究。 Method: 研究了英语、法语、西班牙语、阿拉伯语、印地语和中文之间的语言切换后翻译、查询长度膨胀后摘要以及结构重构为选择题等变换方式，提出了一个多语言、多代理的大语言模型框架，并结合检索增强生成技术，可作为网页插件部署。 Result: 该框架能够有效应对多种对抗性攻击形式下的虚假信息传播，验证了基于插件的部署在真实网络应用中的可行性。 Conclusion: AI驱动的虚假信息检测对于维护在线事实完整性至关重要，插件化的多语言检测框架具有实际应用前景。 Abstract: The rapid spread of misinformation on digital platforms threatens public discourse, emotional stability, and decision-making. While prior work has explored various adversarial attacks in misinformation detection, the specific transformations examined in this paper have not been systematically studied. In particular, we investigate language-switching across English, French, Spanish, Arabic, Hindi, and Chinese, followed by translation. We also study query length inflation preceding summarization and structural reformatting into multiple-choice questions. In this paper, we present a multilingual, multi-agent large language model framework with retrieval-augmented generation that can be deployed as a web plugin into online platforms. Our work underscores the importance of AI-driven misinformation detection in safeguarding online factual integrity against diverse attacks, while showcasing the feasibility of plugin-based deployment for real-world web applications.

Yu Liu,Hanlei Shi,Haoxun Li,Yuqing Sun,Yuxuan Ding,Linlin Gong,Leyuan Qu,Taihao Li

Main category: cs.CL

TL;DR: 本文提出了一种以情感热点为中心的多模态对话情绪识别模型，通过热点门控融合和路由对齐机制提升性能。

Details

Motivation: 由于情感线索稀疏、局部化且跨模态异步，对话中的情感识别具有挑战性。 Method: 检测文本、音频和视频中的每句话情感热点，采用热点门控融合（HGF）与全局特征融合，并使用路由的混合对齐器（MoA）进行模态对齐，结合跨模态图编码对话结构。 Result: 在标准ERC基准上实验显示，该方法显著优于强基线，消融实验证实了HGF和MoA的有效性。 Conclusion: 以情感热点为中心的设计有助于聚焦关键信息、缓解模态错位并保持上下文，为多模态学习和ERC中的模态融合提供了新视角。 Abstract: Emotion Recognition in Conversations (ERC) is hard because discriminative evidence is sparse, localized, and often asynchronous across modalities. We center ERC on emotion hotspots and present a unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, and aligns modalities using a routed Mixture-of-Aligners; a cross-modal graph encodes conversational structure. This design focuses modeling on salient spans, mitigates misalignment, and preserves context. Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of HGF and MoA. Our results point to a hotspot-centric view that can inform future multimodal learning, offering a new perspective on modality fusion in ERC.

[13] MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

Weihua Zheng,Zhengyuan Liu,Tanmoy Chakraborty,Weiwen Xu,Xiaoxue Gao,Bryan Chen Zhengyu Tan,Bowei Zou,Chang Liu,Yujia Hu,Xing Xie,Xiaoyuan Yi,Jing Yao,Chaojun Wang,Long Li,Rui Liu,Huiyao Liu,Koji Inoue,Ryuichi Sumida,Tatsuya Kawahara,Fan Xu,Lingyu Ye,Wei Tian,Dongjun Kim,Jimin Jung,Jaehyung Seo,Nadya Yuki Wangsajaya,Pham Minh Duc,Ojasva Saxena,Palash Nandi,Xiyan Tao,Wiwik Karlina,Tuan Luong,Keertana Arun Vasan,Roy Ka-Wei Lee,Nancy F. Chen

Main category: cs.CL

TL;DR: 本文提出了MMA-ASIA，一个专注于亚洲语境的多语言、多模态大语言模型文化意识评估框架，包含27,000个问题，涵盖8个亚洲国家和10种语言，强调基于文化背景的多步推理，并提出五维评估协议与验证模块，以提升跨语言和跨模态的文化理解能力。

Details

Motivation: 现有大语言模型在非西方、低资源环境下的多模态理解和推理能力显著下降，缺乏对亚洲文化背景的有效认知，因此需要一个专门评估模型文化意识的基准。 Method: 构建了一个人工整理、多语言、多模态对齐的多项选择题基准（MMA-ASIA），覆盖8个亚洲国家和10种语言，共27,000个问题；设计五维评估协议（文化意识差异、跨语言一致性、跨模态一致性、文化知识泛化、 grounding有效性），并引入文化意识grounding验证模块检测“捷径学习”；采用比较模型分析、注意力追踪和视觉消融前缀回放（VPR）方法探究模型表现差异的原因。 Result: MMA-ASIA是首个在文本、图像和语音三种模态输入层面对齐的数据集，超过79%的问题需基于文化背景的多步推理；实验证明该框架能有效评估模型在跨语言和跨模态场景下的文化意识表现，并揭示模型依赖表面特征或语言先验进行预测的问题。 Conclusion: MMA-ASIA为评估和改进大语言模型在亚洲多元文化背景下的多模态理解提供了可靠基准，所提出的评估协议和分析方法有助于推动更具文化敏感性和鲁棒性的多模态AI系统发展。 Abstract: Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high-resource settings. We propose MMA-ASIA, a comprehensive framework to evaluate LLMs' cultural awareness with a focus on Asian contexts. MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned multiple-choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79 percent require multi-step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross-modal transfer. Building on this benchmark, we propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects "shortcut learning" by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision-ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.

[14] GraphGhost: Tracing Structures Behind Large Language Models

Xinnan Dai,Kai Guo,Chung-Hsiang Lo,Shenglai Zeng,Jiayuan Ding,Dongsheng Luo,Subhabrata Mukherjee,Jiliang Tang

Main category: cs.CL

TL;DR: 本文提出了GraphGhost框架，通过图结构表示大语言模型中的神经元激活和信号传播，揭示了LLM推理能力背后的结构性机制，并利用图算法分析和干预模型推理过程。

Details

Motivation: 探索大语言模型推理能力背后的结构机制，理解其如何从序列输入中捕捉结构语义并以结构一致的方式生成输出。 Method: 提出GraphGhost框架，将神经元激活和信号传播建模为图结构，应用PageRank等图算法分析模型行为，并通过结构干预识别关键神经元节点。 Result: 发现LLM中存在结构化的推理机制，识别出对推理至关重要的神经元，修改这些关键节点会导致推理崩溃，影响逻辑流和语义理解。 Conclusion: GraphGhost为分析和干预大语言模型的推理结构提供了有力工具，有助于深入理解LLM的推理基础。 Abstract: Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, yet the structural mechanisms underlying these abilities remain under explored. In this work, we introduce GraphGhost, a unified framework that represents neuron activations and their signal propagation as graphs, explaining how LLMs capture structural semantics from sequential inputs and generate outputs through structurally consistent mechanisms. This graph-based perspective enables us to employ graph algorithms such as PageRank to characterize the properties of LLMs, revealing both shared and model-specific reasoning behaviors across diverse datasets. We further identify the activated neurons within GraphGhost and evaluate them through structural interventions, showing that edits to key neuron nodes can trigger reasoning collapse, altering both logical flow and semantic understanding. Together, these contributions position GraphGhost as a powerful tool for analyzing, intervening in, and ultimately understanding the structural foundations of reasoning in LLMs.

[15] Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications

Mingxuan Liu,Yuhe Ke,Wentao Zhu,Mayli Mertens,Yilin Ning,Jingchi Liao,Chuan Hong,Daniel Shu Wei Ting,Yifan Peng,Danielle S. Bitterman,Marcus Eng Hock Ong,Nan Liu

Main category: cs.CL

TL;DR: 研究发现，尽管大型语言模型（LLM）在不同性别设定下的诊断结果相对一致，但在判断患者性别相关性时表现出显著不一致性，部分模型甚至存在系统性的性别差异，提示需对LLM的性别指派一致性进行常规检查以确保临床AI的公平性和可靠性。

Details

Motivation: 担忧LLM在承担类似医生或医学教育者角色时可能复制或加剧与性别相关的偏见，影响临床决策的公平性。 Method: 通过《新英格兰医学杂志》挑战赛的案例，为多个开源和专有LLM分配性别（女性、男性或未指定），评估其在诊断及判断患者性别相关性方面的响应一致性。 Result: 大多数模型在诊断上具有一致性，但在判断患者性别相关性和必要性方面表现出显著不一致，尤其在相关性判断上；部分模型显示出系统性的女性-男性差异。 Conclusion: LLM在性别指派上的不一致可能构成一种被忽视的偏见，威胁其在临床实践中的可靠性，因此需要常规检查身份指派的一致性以确保AI辅助医疗的公平与可信。 Abstract: The integration of large language models (LLMs) into healthcare holds promise to enhance clinical decision-making, yet their susceptibility to biases remains a critical concern. Gender has long influenced physician behaviors and patient outcomes, raising concerns that LLMs assuming human-like roles, such as clinicians or medical educators, may replicate or amplify gender-related biases. Using case studies from the New England Journal of Medicine Challenge (NEJM), we assigned genders (female, male, or unspecified) to multiple open-source and proprietary LLMs. We evaluated their response consistency across LLM-gender assignments regarding both LLM-based diagnosis and models' judgments on the clinical relevance or necessity of patient gender. In our findings, diagnoses were relatively consistent across LLM genders for most models. However, for patient gender's relevance and necessity in LLM-based diagnosis, all models demonstrated substantial inconsistency across LLM genders, particularly for relevance judgements. Some models even displayed a systematic female-male disparity in their interpretation of patient gender. These findings present an underexplored bias that could undermine the reliability of LLMs in clinical practice, underscoring the need for routine checks of identity-assignment consistency when interacting with LLMs to ensure reliable and equitable AI-supported clinical care.

Kaiqi Yang,Hang Li,Yucheng Chu,Zitao Liu,Mi Tian,Hui Liu

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的迭代框架，自动生成带有干扰条件的数学应用题，同时保持原题解法不变，减少人工标注成本并提升数据集质量。

Details

Motivation: 现有包含干扰条件的数学应用题数据集存在难度低、上下文不连贯等问题，导致干扰条件容易被识别和忽略，影响评测可靠性；同时添加干扰条件后需大量人工修改答案和推理过程。 Method: 设计一个利用大语言模型生成干扰条件的迭代框架，通过多角度、多层次的提示词引导模型生成有意义的干扰信息，并确保生成的干扰不改变原题的解法。 Result: 该框架能高效生成高质量、上下文合理且解法一致的带干扰条件的数学应用题，显著降低人工干预需求。 Conclusion: 所提方法有效解决了当前数学应用题数据集中干扰条件易检测、质量低的问题，为大语言模型的数学推理能力评估提供了更可靠的评测基准。 Abstract: Mathematical reasoning serves as a crucial testbed for evaluating the intelligence of large language models (LLMs), and math word problems (MWPs) represent one of the most widely used formats. Most existing MWP datasets contain only the necessary information, while problems with distracting or excessive conditions are often overlooked. Prior studies have shown that popular LLMs experience a dramatic performance drop when such distracting conditions are introduced. However, available datasets of MWPs with distracting conditions remain limited, and most exhibit low difficulty and out-of-context expressions. These shortcomings make the distracting conditions easy to detect and disregard, thereby reducing the credibility of benchmarking on these datasets. Moreover, when distracting conditions are added, the reasoning process and answers may change, requiring intensive manual effort to check and rewrite solutions. To address these issues, we design an iterative framework that leverages LLMs to generate distracting conditions automatically. We develop a set of prompts to revise MWPs from multiple perspectives and cognitive levels, encouraging the creation of meaningful distracting conditions as well as suggestions for further refinement. A key advantage of our framework is the preservation of shared solutions between the original and revised problems: the LLMs are explicitly guided to generate distractions that do not alter the original solution, thus eliminating the need to produce new answers. This framework is efficient and easy to deploy, substantially reducing the effort required to generate MWPs with distracting conditions while maintaining high data quality.

[17] LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests

Juan Miguel Navarro Carranza

Main category: cs.CL

TL;DR: 通过对比原始和改写后问题的模型表现，研究发现大语言模型在基准测试中存在因记忆或表面形式捷径导致的性能下降。

Details

Motivation: 解决大语言模型在基准测试中因测试项记忆或近似重复导致分数虚高的问题。 Method: 使用Mistral-7B-Instruct和Qwen2.5-7B-Instruct模型，在ARC-Easy和ARC-Challenge上对原始和改写后的多选题进行评测，并控制解码过程和输出格式，采用鲁棒的改写清洗步骤以保持语义一致性。 Result: 改写问题导致模型准确率显著下降，表明模型可能依赖表面形式捷径而非真正理解。 Conclusion: 当前大语言模型在基准测试中的表现可能被高估，需更注重评估其泛化能力和抗干扰性。 Abstract: Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.

[18] JAI-1: A Thai-Centric Large Language Model

Attapol T. Rutherford,Jullajak Karnjanaekarin,Narongkorn Panitsrisit,Pontakorn Trakuekul,Sumana Sumanakul,Natchanon Pollertlam

Main category: cs.CL

TL;DR: JAI-1是一个具有750亿参数的泰语中心语言模型，采用扩展策略，在保留原始英文大模型通用能力的同时，通过扩大参数空间系统性地注入泰语知识，避免了传统微调方法可能带来的知识覆盖问题。

Details

Motivation: 现有泰语模型多基于开源模型进行额外训练，可能导致原有知识被覆盖；因此需要一种既能增强泰语能力又不损害通用智能的新架构。 Method: 从一个高性能的小型英文开源大模型出发，通过扩大参数空间（上采样策略），利用新增容量系统集成泰语知识，并经过1.5万亿token预训练（含3000亿以上泰语token）及60万条指令微调与对齐训练。 Result: 在泰语基准测试（IFEval-TH、MT-Bench-TH、JAI-Hall-Bench）中表现优于Typhoon2-70B，验证了其扩展与知识集成框架的有效性。 Conclusion: JAI-1通过参数扩展而非直接微调的方式成功构建了一个兼具强大通用能力和卓越泰语性能的语言模型，具备独特架构和可扩展性，为未来多语言模型发展提供了新路径。 Abstract: This technical report introduces JAI-1, a Thai-centric language model with 75B parameters. Recent Thai models have primarily relied on existing open-source models, applying additional training without structural modifications to specialize in Thai. However, this approach risks eroding pre-existing knowledge in the model's parameter space during the injection of Thai-specific information, as optimized parameters for general tasks may conflict with new linguistic requirements. In contrast, JAI-1 adopts an upscaling strategy: starting from a smaller, high-performing English open-source LLM, we expanded its parameter space and utilized the newly allocated capacity to systematically integrate Thai-language knowledge. This methodology not only preserves the original model's general intelligence but also establishes a unique architecture distinct from other open-source models, enabling scalable future enhancements. During pre-training, JAI-1 was exposed to 1.5T tokens, including over 300B Thai language tokens. This was followed by post-training stages -- supervised fine-tuning and alignment tuning -- using more than 600K instruction-based examples. The final model demonstrated superior performance compared to Typhoon2-70B on Thai-centric benchmarks (IFEval-TH, MT-Bench-TH, and JAI-Hall-Bench), validating the efficacy of its upscaling and knowledge-integration framework.

[19] From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents

Wen-Yu Chang,Tzu-Hung Huang,Chih-Ho Chen,Yun-Nung Chen

Main category: cs.CL

TL;DR: 提出一种基于职业信息的轻量级对话策略，通过用户模拟器优化销售导向对话系统的性能，发现职业对对话意图影响最显著，从而实现更短且更成功的对话。

Details

Motivation: 为了提升销售导向对话系统的效果，需要在训练中使用更真实的用户模拟器，并探索如何根据用户特征调整对话策略。 Method: 构建包含年龄、性别和职业的用户画像模拟器，分析不同属性对对话效果的影响，并设计一种轻量化的、基于职业条件的对话策略以引导代理调整对话意图。 Result: 实验表明，虽然年龄和性别会影响整体性能，但职业对对话意图的影响最为显著；采用职业条件策略后，对话更短且成功率更高。 Conclusion: 丰富的用户画像有助于优化对话策略，简单的基于人物特征的策略可显著提升销售导向对话系统的有效性。 Abstract: Amid the rapid rise of agentic dialogue models, realistic user-simulator studies are essential for tuning effective conversation strategies. This work investigates a sales-oriented agent that adapts its dialogue based on user profiles spanning age, gender, and occupation. While age and gender influence overall performance, occupation produces the most pronounced differences in conversational intent. Leveraging this insight, we introduce a lightweight, occupation-conditioned strategy that guides the agent to prioritize intents aligned with user preferences, resulting in shorter and more successful dialogues. Our findings highlight the importance of rich simulator profiles and demonstrate how simple persona-informed strategies can enhance the effectiveness of sales-oriented dialogue systems.

[20] Text2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories

Francesco Dente,Fabiano Dalpiaz,Paolo Papotti

Main category: cs.CL

TL;DR: 提出Text2Stories任务和度量方法，用于量化评估从访谈记录生成的用户故事与实际需求的一致性。

Details

Motivation: 评估LLM生成的软件需求是否忠实反映利益相关者需求目前主要依赖人工，缺乏自动化、可量化的手段。 Method: 将访谈记录分段，并将其与用户故事之间的对齐建模为匹配问题，提出正确性（故事被记录支持的比例）和完整性（记录被故事覆盖的比例）两个指标，使用LLM和嵌入模型进行匹配。 Result: 在四个数据集上的实验表明，基于LLM的匹配器在保留注释上达到0.86的宏F1值，嵌入模型虽表现稍差但能有效支持阻塞策略。 Conclusion: Text2Stories提供了一种可扩展、源忠实的用户故事质量评估方法，可用于比较不同来源的故事集（如人工vs自动生成），是对现有标准的有效补充。 Abstract: Large language models (LLMs) can be employed for automating the generation of software requirements from natural language inputs such as the transcripts of elicitation interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a largely manual task. We introduce Text2Stories, a task and metrics for text-to-story alignment that allow quantifying the extent to which requirements (in the form of user stories) match the actual needs expressed by the elicitation session participants. Given an interview transcript and a set of user stories, our metric quantifies (i) correctness: the proportion of stories supported by the transcript, and (ii) completeness: the proportion of transcript supported by at least one story. We segment the transcript into text chunks and instantiate the alignment as a matching problem between chunks and stories. Experiments over four datasets show that an LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, while embedding models alone remain behind but enable effective blocking. Finally, we show how our metrics enable the comparison across sets of stories (e.g., human vs. generated), positioning Text2Stories as a scalable, source-faithful complement to existing user-story quality criteria.

[21] PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction

Anubhav Shrimal,Aryan Jain,Soumyajit Chowdhury,Promod Yenigalla

Main category: cs.CL

TL;DR: 提出PARSE系统，通过优化JSON schema和反射式提取方法，显著提升LLM在结构化信息抽取中的准确性和可靠性。

Details

Motivation: 现有JSON schema作为静态契约设计用于人类开发者，导致LLM在信息抽取时出现性能不佳、幻觉频繁和行为不可靠的问题。 Method: 开发PARSE系统，包含ARCHITECT（自动优化JSON schema并保持向后兼容）和SCOPE（结合静态与LLM-based防护的反射式提取）。 Result: 在SGD、SWDE和内部零售对话数据集上评估，SWDE上最高提升64.7%抽取准确率，综合框架改进使各模型平均提升10%，首次重试即减少92%错误，且保持实用延迟。 Conclusion: PARSE能有效提升LLM对JSON schema的理解与利用，改善结构化信息抽取的性能与稳定性，适用于自主代理场景。 Abstract: Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection-based extraction with combined static and LLM-based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema-Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.

[22] Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B

Nisar Ahmed,Muhammad Imran Zaman,Gulshan Saleem,Ali Hassan

Main category: cs.CL

TL;DR: 本文研究了大语言模型评测中“评估气味”（evaluation scent）对性能测量的干扰，发现评估导向的提示词会显著增加推理长度并降低简洁回答的合规性，但并未带来相应的准确率提升。实验通过固定任务内容与解码方式，在不同提示框架和推理深度下进行A/B测试，并提出可复现的评估框架与实践建议，以确保评测增益能反映真实部署能力。

Details

Motivation: 现有大语言模型基准测试常依赖要求显式推理和严格格式的提示词，而实际应用则需要简洁、符合合同约束的回答。这种差异可能导致评测结果虚高，无法真实反映模型在现实场景中的表现。因此，亟需探究‘评估气味’是否夸大了模型性能，以及如何使评测更贴近真实部署需求。 Method: 使用单一开源权重模型GPT-OSS-20B，设计六组配对A/B实验，控制任务内容和解码参数不变，仅改变提示框架（评估导向 vs. 现实导向）和推理深度（中/高）。采用确定性验证器评估准确性、仅答案合规性、回避/拒绝行为、思维链长度及结构合规性，并预注册差异与综合指标。涵盖数学推理、代码修复、引用生成、激励反转、CoT可见性及乌尔都语头部提示等场景。 Result: 评估导向提示显著增加了思维链长度（数百至千字符以上），降低了仅答案合规性，但准确率提升有限或不一致。在结构化输出中，虽改善包装格式（如代码块、列表），但未提升经正则验证的内容正确性。激励措辞影响错误类型：强调谨慎可小幅提升高推理下的准确率并减少‘错误却自信’输出，强调能力则导致更简洁但更具风险的回答。乌尔都语提示重现上述现象，且在高推理下可能降低准确率，显示多语言场景下的公平性风险。 Conclusion: ‘评估气味’会导致大语言模型评测结果虚高，不能有效反映真实部署能力。为缓解此问题，作者提供了一个可复现的A/B测试框架（含提示库、验证器、评分脚本，带版本DOI），并提出实践建议：使用中性措辞或双框架校验、合同感知评分、报告风格差异、置信度治理及多语言监控仪表板，以确保基准测试的增益真正转化为可部署的能力。 Abstract: Benchmarks for large language models (LLMs) often rely on rubric-scented prompts that request visible reasoning and strict formatting, whereas real deployments demand terse, contract-bound answers. We investigate whether such "evaluation scent" inflates measured performance without commensurate capability gains. Using a single open-weights model (GPT-OSS-20B), we run six paired A/B scenarios that hold task content and decoding fixed while varying framing (evaluation-oriented vs. real-world) and reasoning depth (Medium/High): deterministic math, strict code-fix, citation generation, incentive flips (caution vs. competence), CoT visibility, and multilingual (Urdu) headers. Deterministic validators compute accuracy, answer-only compliance, hedging/refusals, chain-of-thought (CoT) length, and schema compliance, with pre-registered deltas and composite indices. Across scenarios, evaluation framing reliably inflates CoT (hundreds to >1000 characters) and reduces answer-only compliance, with limited or inconsistent accuracy gains. In structured outputs, it improves wrappers (e.g., fenced blocks, enumerated lists) but not regex-validated substance. Incentive wording reweights error composition: praising caution modestly improves accuracy at high reasoning and reduces wrong-but-confident errors, whereas praising competence yields terser but riskier outputs. Urdu rubric headers reproduce these signatures and can decrease accuracy at higher reasoning depth, indicating multilingual parity risks. We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts; versioned DOI) and practical guidance: neutral phrasing or dual-framing checks, contract-aware grading, style-delta reporting, confidence governance, and multilingual dashboards to ensure that benchmark gains reflect deployable capability.

[23] From What to Why: Thought-Space Recommendation with Small Language Models

Prosenjit Biswas,Pervez Shaik,Abhinav Thorat,Ravi Kolla,Niranjan Pedanekar

Main category: cs.CL

TL;DR: 本文提出PULSE框架，利用小型语言模型（SLM）生成的推理文本来构建跨领域的“思维空间”（Thought Space），将用户行为与其语义驱动因素联合建模，显著提升推荐系统的性能、可迁移性与下游任务表现。

Details

Motivation: 大型语言模型（LLM）虽能增强推荐系统的推理能力，但推理成本高，难以部署；小型语言模型（SLM）效率高但推理能力未被充分挖掘。现有方法多将自然语言理由作为无监督描述文本，未能充分利用其作为学习信号的潜力。 Method: 提出PULSE框架，通过SLM生成用户偏好理由，并将其作为显式监督信号，结合用户交互历史，联合建模用户行为（what）和行为动因（why），在多个领域构建统一的‘思维空间’（Thought Space），实现更鲁棒和可泛化的语义嵌入。 Result: 实验表明，PULSE在多个基准数据集上优于主流的ID、协同过滤（CF）和基于LLM的序列推荐模型，在跨域推荐中表现出更强的迁移能力，并在面向推理的问答等下游任务中表现优异。 Conclusion: PULSE通过将SLM生成的理由作为直接学习信号，有效提升了推荐系统的语义理解能力与泛化性能，为高效、可解释的推荐系统提供了新思路。 Abstract: Large Language Models (LLMs) have advanced recommendation capabilities through enhanced reasoning, but pose significant challenges for real-world deployment due to high inference costs. Conversely, while Small Language Models (SLMs) offer an efficient alternative, their reasoning capabilities for recommendation remain underexplored. Existing systems often use natural language rationales merely as unsupervised descriptive text, failing to harness their full potential as learning signals. In this work our main idea is to create a common understanding of user and items across multiple domains called Thought Space with SLMs instead of using LLMs' distilled knowledge. To that end we propose PULSE (Preference Understanding by Latent Semantic Embeddings), a framework that treats SLM-generated rationales as director learning signals, supervising them with interaction histories to jointly model user actions (what) and their semantic drivers (why). Existing methods consider only interactions such as sequences and embeddings, whereas PULSE treats rationales as first-class signals, this novel design yields embeddings that are more robust and generalizable. Extensive experiments demonstrate that PULSE outperforms leading ID, Collaborative Filtering (CF), and LLM-based sequential recommendation models across multiple benchmark datasets. Furthermore, PULSE exhibits superior transferability in cross-domain recommendation and demonstrates strong performance on downstream tasks such as reasoning-oriented question answering. Our code is available \href{https://anonymous.4open.science/r/Thinking_PULSE-0FC5/README.md}{here}.

[24] ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

Jingbiao Mei,Mingsheng Sun,Jinghong Chen,Pengda Qin,Yuhong Li,Da Chen,Bill Byrne

Main category: cs.CL

TL;DR: 本文提出ExPO-HM，一种用于仇恨模因检测的解释先行策略，通过结合SFT预热、带课程学习的GRPO和条件决策熵（CDE），在二分类、细粒度分类和推理质量上实现最优性能。

Details

Motivation: 现有仇恨模因检测模型多为直接二元判断，缺乏可解释性；解释先行方法表现不佳，且二元奖励信号不足以引导有效推理。 Method: 提出ExPO-HM框架，结合SFT预热、GRPO与课程学习，以及CDE作为推理质量的度量与奖励机制，模拟人工标注员的训练与评估过程。 Result: 在三个仇恨模因基准上，ExPO-HM在二元检测、细粒度分类和推理质量均达到SOTA，F1分数较GRPO和DPO基线分别提升最多15%和17%。 Conclusion: ExPO-HM将仇恨模因检测从简单的二元报警转变为基于解释的检测，提供更准确、可解释且可操作的审核支持。 Abstract: Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15\% and 17\% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.

[25] Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

Cai Zhou,Chenyu Wang,Dinghuai Zhang,Shangyuan Tong,Yifei Wang,Stephen Bates,Tommi Jaakkola

Main category: cs.CL

TL;DR: 本文提出了层次化扩散语言模型（HDLM），一种用于语言建模的新型离散扩散模型，通过层次词汇结构实现语义粒度的逐步生成，在多项实验中表现出优于基线模型的性能。

Details

Motivation: 现有的语言模型在处理语义层次和生成效率方面存在局限，需要一种能够灵活建模不同语义粒度的新型框架。 Method: 提出HDLM，基于层次化词汇表，在前向过程中将细粒度词元扰动为粗粒度祖先，在反向过程中逐步预测更详细的语义，并推导出扩散ELBO的闭式表达，结合实际训练技巧进行优化。 Result: 在文本生成实验中，HDLM在验证和生成困惑度上均显著低于基线模型，验证了其有效性。 Conclusion: HDLM提供了一种通用且灵活的语言建模框架，统一了语义尺度的动态预测过程，并可涵盖现有模型作为特例。 Abstract: In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.

[26] Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression

Chengzhengxu Li,Xiaoming Liu,Zhaohan Zhang,Shaochu Zhang,Shengchao Liu,Guoxin Ma,Yu Lan,Chao Shen

Main category: cs.CL

TL;DR: 本文提出了Upfront CoT (UCoT)，一种通过前置思维嵌入实现自动思维链压缩的高效推理框架，显著减少推理长度的同时保持甚至提升模型性能。

Details

Motivation: 长思维链（CoT）虽然提升了大模型的推理能力，但因其自回归特性导致计算成本高、延迟严重；现有压缩方法依赖人工提示设计或牺牲关键推理细节的外部数据集，效率与效果难以兼顾。 Method: UCoT采用小型模型（压缩器）和大型模型（执行器）协同工作：压缩器生成富含推理信息的前置思维嵌入，执行器利用该嵌入并通过奖励机制生成简短推理得出答案，实现自动化压缩。 Result: 在GSM8K数据集上，将Qwen2.5-7B-Instruct模型的token使用量减少50%，同时性能超过当前最优方法3.08%。 Conclusion: UCoT有效平衡了推理效率与性能，避免了人工提示设计和推理细节丢失问题，为高效推理提供了新思路。 Abstract: Recent developments have enabled advanced reasoning in Large Language Models (LLMs) via long Chain-of-Thought (CoT), while long CoT suffers from high computational costs and significant latency losses owing to the autoregressive nature of generative LLMs. CoT compression aims to improve efficiency in the reasoning process by reducing output length. Previous works trade reasoning efficiency by either laborious discrete prompt designing or the construction of external compressed CoT datasets that sacrifice key reasoning details. In this work, we propose Upfront CoT (UCoT): an efficient reasoning framework with upfront thought embedding to automate CoT compression. UCoT is a cooperative workflow involving a small model (compressor) and a large model (executor). The first stage of UCoT trains compressor to generate upfront thought embeddings rich in reasoning information for the executor, avoiding the drawbacks of manually designed prompts. The second stage optimizes executor to utilize upfront thought embeddings to derive the correct answer with short reasoning, using a reward mechanism. Extensive experiments show that UCoT maintains the powerful reasoning ability of executor while significantly reducing the length of CoT. It is worth mentioning that when applying UCoT to the Qwen2.5-7B-Instruct model, the usage of tokens on GSM8K dataset is reduced by 50\%, while the performance is 3.08\% higher than that of the state-of-the-art (SOTA) method. The code and dataset are in supplementary material.

[27] Formalizing Style in Personal Narratives

Gustave Cortal,Alain Finkel

Main category: cs.CL

TL;DR: 提出一种新方法，将个人叙述中的语言风格形式化为作者表达主观经历时的语言选择模式，结合功能语言学、计算机科学和心理学，利用语言模型自动提取语言特征，并应用于梦境叙述分析，揭示语言选择与心理状态之间的关系。

Details

Motivation: 缺乏系统分析个人叙述中风格选择的正式框架，需要整合多学科方法来理解语言使用如何反映主观体验和心理状态。 Method: 结合功能语言学理论与计算机科学技术，利用语言模型自动提取过程、参与者和环境等语言特征，识别并分析个人叙述中的序列模式，并将其与心理观察相关联。 Result: 在数百个梦境叙述上应用该框架，发现显著的语言模式；对一名患有创伤后应激障碍的退伍军人的案例研究表明，其叙述中言语过程显著多于心理过程，反映出特定的心理状态。 Conclusion: 该框架能够有效形式化个人叙述中的风格，揭示语言选择与心理状态之间的深层联系，为心理语言学和叙事分析提供了可扩展的自动化分析工具。 Abstract: Personal narratives are stories authors construct to make meaning of their experiences. Style, the distinctive way authors use language to express themselves, is fundamental to how these narratives convey subjective experiences. Yet there is a lack of a formal framework for systematically analyzing these stylistic choices. We present a novel approach that formalizes style in personal narratives as patterns in the linguistic choices authors make when communicating subjective experiences. Our framework integrates three domains: functional linguistics establishes language as a system of meaningful choices, computer science provides methods for automatically extracting and analyzing sequential patterns, and these patterns are linked to psychological observations. Using language models, we automatically extract linguistic features such as processes, participants, and circumstances. We apply our framework to hundreds of dream narratives, including a case study on a war veteran with post-traumatic stress disorder. Analysis of his narratives uncovers distinctive patterns, particularly how verbal processes dominate over mental ones, illustrating the relationship between linguistic choices and psychological states.

[28] A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data

Joe Watson,Ivan O'Conner,Chia-Wen Chen,Luning Sun,Fang Luo,David Stillwell

Main category: cs.CL

TL;DR: 本研究提出一种结合LLM评分文本与传统量表项目的增强型心理测评框架，并以抑郁评估为例验证其有效性，显著提升了测量精度和准确性。

Details

Motivation: 传统心理评估依赖结构化量表，难以捕捉被试自然语言中的丰富细节，限制了测量的深度与灵活性。 Method: 利用大语言模型（LLM）对开放式文本进行评分，结合传统量表项目构建增强型测验；通过实证数据（n=693）和合成数据（n=3,000）评估框架性能，并基于项目信息量选择最优LLM评分指令。 Result: 在保留测试集上，增强型测验显著提高了测量精度和准确性；LLM提供的信息增益相当于在原有19项量表中增加6.3（真实数据）至16.0（合成数据）个条目。 Conclusion: 该框架实现了自动化评分的概念性突破，无需依赖预标注数据或专家制定的复杂评分标准，可扩展地利用文本数据提升传统心理测量工具的效能。 Abstract: Psychological assessments typically rely on structured rating scales, which cannot incorporate the rich nuance of a respondent's natural language. This study leverages recent LLM advances to harness qualitative data within a novel conceptual framework, combining LLM-scored text and traditional rating-scale items to create an augmented test. We demonstrate this approach using depression as a case study, developing and assessing the framework on a real-world sample of upper secondary students (n=693) and corresponding synthetic dataset (n=3,000). On held-out test sets, augmented tests achieved statistically significant improvements in measurement precision and accuracy. The information gain from the LLM items was equivalent to adding between 6.3 (real data) and 16.0 (synthetic data) items to the original 19-item test. Our approach marks a conceptual shift in automated scoring that bypasses its typical bottlenecks: instead of relying on pre-labelled data or complex expert-created rubrics, we empirically select the most informative LLM scoring instructions based on calculations of item information. This framework provides a scalable approach for leveraging the growing stream of transcribed text to enhance traditional psychometric measures, and we discuss its potential utility in clinical health and beyond.

[29] dInfer: An Efficient Inference Framework for Diffusion Language Models

Yuxin Ma,Lun Du,Lanning Wei,Kun Chen,Qian Xu,Kangyu Wang,Guofeng Feng,Guoshan Lu,Lin Liu,Xiaojing Qi,Xinyuan Zhang,Zhen Tao,Haibo Feng,Ziyun Jiang,Ying Xu,Zenan Huang,Yihong Zhuang,Haokai Xu,Jiaqi Hu,Zhenzhong Lan,Junbo Zhao,Jianguo Li,Da Zheng

Main category: cs.CL

TL;DR: 本文提出了dInfer，一个高效且可扩展的扩散型大语言模型（dLLM）推理框架，通过模块化设计和系统级优化，在保持生成质量的同时显著提升了推理速度，相比现有方法实现最高10倍加速，并已开源。

Details

Motivation: 尽管扩散型大语言模型（dLLMs）因具备内在并行性而展现出潜力，但缺乏标准化且高效的推理框架限制了其广泛应用。 Method: dInfer将推理流程分解为四个模块化组件：模型、扩散迭代管理器、解码策略和KV缓存管理器，并针对每个组件集成新算法，结合系统级优化以提升整体效率。 Result: 在LLaDA-MoE模型上，dInfer在8×H800 GPU上单批次达到每秒超过1,100个token（HumanEval），在六个基准测试中平均超过800 token/s；相比Fast-dLLM提速10倍，相比高度优化的AR模型QWen2.5-3B仍快2-3倍。 Conclusion: dInfer通过算法与系统协同优化，显著提升了dLLM的推理效率，推动了dLLM的实际应用，并为未来研究提供了开放、可扩展的框架基础。 Abstract: Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components-model, diffusion iteration manager, decoding strategy, and KV-cache manager-and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared with AR models (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with latest vLLM inference engine, dInfer still deliverers $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.

[30] Scaling Laws for Code: A More Data-Hungry Regime

Xianzhen Luo,Wenzhen Zheng,Qingfu Zhu,Rongyi Zhang,Houyi Li,Siming Huang,YuanTao Fan,Wanxiang Che

Main category: cs.CL

TL;DR: 本研究首次对代码大语言模型的扩展规律进行了大规模实证分析，发现代码模型更依赖数据量，且相比自然语言需要更高的数据-参数比。

Details

Motivation: 由于代码与自然语言在语法结构上有本质差异，现有基于自然语言的扩展规律是否适用于代码尚不明确，因此需要专门针对代码进行研究。 Method: 通过117次实验，涵盖0.2B到3.8B参数模型和2B到128B训练token，拟合Chinchilla和Farseer两种扩展规律，并对比代码与自然语言的混合训练效果。 Result: Farseer规律在代码上表现更准确；代码模型随规模增长表现良好，但处于更耗数据的 regime，需更高数据-参数比；在资源受限时混合自然语言有益，但在高计算预算下反而不利。 Conclusion: 代码大模型的扩展规律不同于自然语言，需更高数据量支持，且在高算力条件下应优先专注代码数据而非混合自然语言。 Abstract: Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.

[31] Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning

Li Zhang,Matthias Grabmair,Morgan Gray,Kevin Ashley

Main category: cs.CL

TL;DR: 提出一个三阶段推理框架来评估大语言模型在案例推理中的表现，发现模型在表面推理上表现良好，但在层次和综合分析上显著下降，并且错误回答消耗更多计算资源。

Details

Motivation: 研究大语言模型在法律领域复杂的案例类比与区分推理中的实际能力，揭示其在细微推理任务中的不足。 Method: 构建基于事实谓词（因素）的案例表示，建立法律知识层次结构，并定义可验证的规则，将识别案例差异分解为三个阶段的推理任务进行评估。 Result: 模型在第一阶段（表层推理）准确率高，第二阶段（层次推理）性能下降（64.82%-92.09%），第三阶段（综合分析）几乎崩溃（11.46%-33.99%），且错误答案比正确答案消耗更多计算资源。 Conclusion: 当前大语言模型在复杂法律推理任务中存在根本性局限，单纯增加推理资源无法提升准确性，需更精细的方法来改进可信法律AI系统。 Abstract: Case-based reasoning is a cornerstone of U.S. legal practice, requiring professionals to argue about a current case by drawing analogies to and distinguishing from past precedents. While Large Language Models (LLMs) have shown remarkable capabilities, their proficiency in this complex, nuanced form of reasoning needs further investigation. We propose a formal framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks. Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions, analyzing their argumentative support, and evaluating their significance. Through comprehensive evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve high accuracy on surface-level reasoning (Task 1), performance degrades on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that "thinking longer" does not always mean "thinking smarter." Our work provides a methodology for fine-grained analysis of LLM reasoning capabilities in complex domains and reveals fundamental limitations that must be addressed for robust and trustworthy legal AI.

[32] How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

Xianzhen Luo,Jinyang Huang,Wenzhen Zheng,Qingfu Zhu,Mingzheng Xu,Yiheng Xu,Yuantao Fan,Libo Qin,Wanxiang Che

Main category: cs.CL

TL;DR: 提出一种基于二元代码-测试矩阵的最优诊断基框架，用于构建紧凑、多样且抗分数膨胀的测试用例基准TC-Bench，揭示现有大模型生成测试用例方法的诊断能力不足。

Details

Motivation: 现有测试用例评估基准存在计算成本高、评分膨胀和偏向简单错误的问题，难以有效衡量LLM生成测试用例的真实诊断能力。 Method: 将基准构建形式化为在二元代码-测试矩阵中寻找最优诊断基的问题，提出WrongSelect算法以选择最大多样化的错误代码，并基于竞争性编程提交数据构建TC-Bench。 Result: TC-Bench实现了最小化错误模式覆盖所需的测试用例数量，实验显示当前最先进的测试生成方法在其上的排除率仅约60%，表明其具有更高诊断挑战性。 Conclusion: 所提框架能有效提升测试基准的质量与效率，TC-Bench为评估测试生成系统提供了更严格、可靠的标准。 Abstract: Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults. In this work, we ask two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power. Our dataset is available at: https://huggingface.co/datasets/Luoberta/TC-Bench and our code is at: https://github.com/Luowaterbi/TC-Bench.

[33] How Reliable is Language Model Micro-Benchmarking?

Gregory Yauney,Shahzaib Saqib Warraich,Swabha Swayamdipta

Main category: cs.CL

TL;DR: 微基准测试在评估语言模型时面临可靠性问题，研究表明小样本（如10个示例）难以一致地对模型进行排序，需多达250个样例才能提高可靠性，此时随机抽样效果与现有方法相当。

Details

Motivation: 由于语言模型开发的时间和成本高昂，研究者采用微基准测试来降低评估开销，但其是否能像完整基准一样可靠地对模型排序尚不明确。 Method: 提出一种元评估指标，衡量微基准根据模型在完整基准上的性能差异对其进行排序的能力，并分析不同规模微基准的可靠性。 Result: 发现当前微基准方法无法稳定区分在MMLU-Pro上准确率相差3.5点或在BIG-bench Hard上相差4点的模型；要可靠排序性能相近的模型，通常需要高达250个样本；使用25个样本的微基准时，超过一半的8B参数模型两两比较结果不可靠。 Conclusion: 微基准测试需谨慎使用，极小样本下的排序不可靠；为保证可靠性，可能需要较大样本量，此时随机抽样已具竞争力，研究为平衡评估效率与可靠性提供了实用指导。 Abstract: Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.

[34] Coordinates from Context: Using LLMs to Ground Complex Location References

Tessa Masis,Brendan O'Connor

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型（LLM）的组合式位置引用地理编码方法，发现经过微调的小型LLM在性能上可与大型现成模型相媲美。

Details

Motivation: 组合式位置引用的地理编码具有挑战性，现有方法难以准确解析复杂的自然语言位置描述，需要更好地利用LLM在地理空间知识和推理方面的能力。 Method: 通过评估LLM在地理空间知识与推理能力方面的表现，设计并提出一种基于LLM的地理编码策略，并对小型LLM进行微调以提升其在该任务上的性能。 Result: 所提出的方法在地理编码任务上表现更优，且经过微调的小型LLM能够达到与大型预训练模型相当的性能水平。 Conclusion: LLM在组合式位置引用地理编码中具有潜力，合理利用其推理能力并通过微调优化，可在降低成本的同时实现高效准确的地理编码。 Abstract: Geocoding is the task of linking a location reference to an actual geographic location and is essential for many downstream analyses of unstructured text. In this paper, we explore the challenging setting of geocoding compositional location references. Building on recent work demonstrating LLMs' abilities to reason over geospatial data, we evaluate LLMs' geospatial knowledge versus reasoning skills relevant to our task. Based on these insights, we propose an LLM-based strategy for geocoding compositional location references. We show that our approach improves performance for the task and that a relatively small fine-tuned LLM can achieve comparable performance with much larger off-the-shelf models.

[35] Measuring Moral LLM Responses in Multilingual Capacities

Kimaya Basu,Savi Kolari,Allison Yu

Main category: cs.CL

TL;DR: 本研究评估了前沿和开源大语言模型在多种语言环境下的响应准确性与一致性，发现GPT-5在各维度表现最佳，而Gemini 2.5 Pro在某些关键安全类别中表现较差，凸显了跨语言测试与改进的必要性。

Details

Motivation: 随着大语言模型在全球范围内广泛使用，亟需理解并规范其多语言响应行为，特别是在不同资源水平语言中的表现差异。 Method: 研究采用五点评分标准和裁判型大语言模型，评估模型在五个维度上对高低资源语言的响应表现。 Result: GPT-5在各类别中平均表现最好；Gemini 2.5 Pro在‘同意与自主’和‘伤害预防与安全’类别得分最低，分别为1.39和1.98；其他模型在不同语言和类别间表现出较大不一致。 Conclusion: 当前大语言模型在多语言场景下的表现存在显著差异，尤其在伦理与安全相关类别中，需进一步研究语言变化对模型输出的影响，并加强低资源语言下的测试与优化。 Abstract: With LLM usage becoming widespread across countries, languages, and humanity more broadly, the need to understand and guardrail their multilingual responses increases. Large-scale datasets for testing and benchmarking have been created to evaluate and facilitate LLM responses across multiple dimensions. In this study, we evaluate the responses of frontier and leading open-source models in five dimensions across low and high-resource languages to measure LLM accuracy and consistency across multilingual contexts. We evaluate the responses using a five-point grading rubric and a judge LLM. Our study shows that GPT-5 performed the best on average in each category, while other models displayed more inconsistency across language and category. Most notably, in the Consent & Autonomy and Harm Prevention & Safety categories, GPT scored the highest with averages of 3.56 and 4.73, while Gemini 2.5 Pro scored the lowest with averages of 1.39 and 1.98, respectively. These findings emphasize the need for further testing on how linguistic shifts impact LLM responses across various categories and improvement in these areas.

[36] Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models

S M Rafiuddin,Muntaha Nujat Khan

Main category: cs.CL

TL;DR: 提出了一种名为Adaptive Retention的层间令牌选择机制，通过学习在全局预算下保留哪些表示，在保持模型性能的同时显著降低内存消耗并提升吞吐量。

Details

Motivation: Transformer注意力机制的计算开销随序列长度呈平方增长，限制了其在长上下文场景中的应用。 Method: 采用基于Bernoulli门的概率性、逐层令牌选择机制，通过Hard-Concrete/变分松弛进行训练，并在推理时使用top-M规则强制执行全局预算M，使方法可微且可直接集成到标准编码器中。 Result: 在仅保留30-50%令牌的情况下，仍能保持全模型95%以上的性能，峰值内存减少约35-45%，吞吐量最高提升1.8倍。 Conclusion: 该方法无需修改基础注意力结构或任务头，具有架构无关性，能有效实现长上下文处理的实用化效率提升。 Abstract: Transformer attention scales quadratically with sequence length O(n^2), limiting long-context use. We propose Adaptive Retention, a probabilistic, layer-wise token selection mechanism that learns which representations to keep under a strict global budget M. Retention is modeled with Bernoulli gates trained via a Hard-Concrete/variational relaxation and enforced with a simple top-M rule at inference, making the method differentiable and drop-in for standard encoders. Across classification, extractive QA, and long-document summarization, keeping only 30-50% of tokens preserves >= 95% of full-model performance while cutting peak memory by ~35-45% and improving throughput by up to ~1.8x. This architecture-agnostic approach delivers practical long-context efficiency without modifying base attention or task heads.

[37] Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

Wangjie You,Xusheng Wang,Xing Wang,Wenxiang Jiao,Chao Feng,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: 提出中文常识多跳推理基准CCMOR，用于评估大模型在中文语境下的多步推理能力，结合人类专家验证，发现现有模型在长尾知识和知识密集推理上存在局限，而检索增强生成可显著提升性能。

Details

Motivation: 现有大语言模型在中文语境下的综合评估不足，尤其缺乏对中文特有知识与多步逻辑推理结合能力的评测基准。 Method: 基于现有问答数据集构建领域平衡的种子集，利用大模型生成基于事实单元链的多跳问题，并通过人工参与的验证机制由领域专家审核和优化问题质量。 Result: 在CCMOR基准上评测了多个先进大模型，发现其在处理长尾知识和知识密集型推理任务时表现有限，而引入检索增强生成技术后性能显著提升。 Conclusion: CCMOR为评估中文多跳推理提供了有效基准，揭示了当前大模型的知识瓶颈，并验证了检索增强方法在弥补知识缺口方面的有效性。 Abstract: While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs' ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.

[38] MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

Siddeshwar Raghavan,Tanwi Mallick

Main category: cs.CL

TL;DR: MOSAIC是一个无需训练的多智能体大语言模型框架，专为解决复杂的科学编码任务设计，通过学生-教师范式实现自我反思、推理生成、编码与调试，显著提升准确性、鲁棒性和可解释性。

Details

Motivation: 科学计算任务需要严谨的算法、深厚的领域知识和特定的推理能力，传统通用代码生成方法难以满足这些需求，尤其在缺乏I/O测试用例和需解决链式子问题的场景下表现不足。 Method: 提出MOSAIC框架，采用多智能体架构，在学生-教师范式下进行自我反思与协作；结合分步问题分解、针对性错误修正机制和整合上下文窗口（CCW），支持科学编码中的迭代与推理。 Result: 在科学编码基准测试中，MOSAIC在准确性、鲁棒性和可解释性方面优于现有方法，有效减少大模型在复杂链式任务中的幻觉问题。 Conclusion: MOSAIC为科学编程提供了一种高效、可靠的多智能体LLM解决方案，无需训练即可通过结构化协作与上下文管理提升复杂任务的求解能力。 Abstract: We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks. Unlike general-purpose coding, scientific workflows require algorithms that are rigorous, interconnected with deep domain knowledge, and incorporate domain-specific reasoning, as well as algorithm iteration without requiring I/O test cases. Many scientific problems also require a sequence of subproblems to be solved, leading to the final desired result. MOSAIC is designed as a training-free framework with specially designed agents to self-reflect, create the rationale, code, and debug within a student-teacher paradigm to address the challenges of scientific code generation. This design facilitates stepwise problem decomposition, targeted error correction, and, when combined with our Consolidated Context Window (CCW), mitigates LLM hallucinations when solving complex scientific tasks involving chained subproblems. We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.

[39] The Model's Language Matters: A Comparative Privacy Analysis of LLMs

Abhishek K. Mishra,Antoine Boutet,Lucas Magnana

Main category: cs.CL

TL;DR: 该论文研究了语言结构对大语言模型（LLM）隐私泄露的影响，发现语言的冗余性和分词粒度与隐私风险相关，其中意大利语泄露最严重，英语次之，而法语和西班牙语因形态复杂性更强而更具抗性。

Details

Motivation: 大语言模型在多语言敏感数据应用中广泛部署，但其隐私风险在非英语语境下缺乏研究，因此需要探究不同语言结构如何影响隐私泄露。 Method: 作者量化了六种语言学指标，并在英语、西班牙语、法语和意大利语的医学语料训练的LLM上评估了三种攻击方式：提取攻击、反事实记忆化攻击和成员推断攻击。 Result: 隐私漏洞与语言冗余性和分词粒度正相关；意大利语泄露最严重，英语成员可分离性高，而法语和西班牙语因形态复杂性表现出更强的抗攻击能力。 Conclusion: 语言结构显著影响LLM的隐私泄露，需在部署中引入语言感知的隐私保护机制。 Abstract: Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic variability introduce major privacy risks. Mostly evaluated for English, this paper investigates how language structure affects privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six linguistic indicators and evaluate three attack vectors: extraction, counterfactual memorization, and membership inference. Results show that privacy vulnerability scales with linguistic redundancy and tokenization granularity: Italian exhibits the strongest leakage, while English shows higher membership separability. In contrast, French and Spanish display greater resilience due to higher morphological complexity. Overall, our findings provide the first quantitative evidence that language matters in privacy leakage, underscoring the need for language-aware privacy-preserving mechanisms in LLM deployments.

Jia Ao Sun,Hao Yu,Fabrizio Gotti,Fengran Mo,Yihong Wu,Yuchen Hui,Jian-Yun Nie

Main category: cs.CL

TL;DR: 提出Search-on-Graph（SoG）框架，通过迭代式、基于观察的图导航提升大语言模型在知识密集型多跳问答中的表现，无需微调即在多个基准上实现SOTA。

Details

Motivation: 大语言模型在多跳知识问答中因缺乏长尾事实、易产生幻觉且知识更新滞后而表现不可靠；现有知识图谱问答方法在查询构建、噪声引入和搜索空间方面存在权衡问题。 Method: 设计一种简单的Search函数，采用“先观察后导航”策略，让大语言模型在每一步基于当前节点的实际可用关系决定下一步路径，支持自适应过滤以应对高阶节点，并适应不同知识图谱结构。 Result: 在六个基于Freebase和Wikidata的KGQA基准上均达到最先进的性能，无需微调，在Wikidata上比此前最佳方法提升16%，在Freebase上也有稳定增益。 Conclusion: SoG通过轻量、迭代式的图导航机制有效结合大语言模型与知识图谱，克服了传统KGQA方法的局限性，展现出强大多跳推理能力与广泛适用性。 Abstract: Large language models (LLMs) have demonstrated impressive reasoning abilities yet remain unreliable on knowledge-intensive, multi-hop questions -- they miss long-tail facts, hallucinate when uncertain, and their internal knowledge lags behind real-world change. Knowledge graphs (KGs) offer a structured source of relational evidence, but existing KGQA methods face fundamental trade-offs: compiling complete SPARQL queries without knowing available relations proves brittle, retrieving large subgraphs introduces noise, and complex agent frameworks with parallel exploration exponentially expand search spaces. To address these limitations, we propose Search-on-Graph (SoG), a simple yet effective framework that enables LLMs to perform iterative informed graph navigation using a single, carefully designed \textsc{Search} function. Rather than pre-planning paths or retrieving large subgraphs, SoG follows an ``observe-then-navigate'' principle: at each step, the LLM examines actual available relations from the current entity before deciding on the next hop. This approach further adapts seamlessly to different KG schemas and handles high-degree nodes through adaptive filtering. Across six KGQA benchmarks spanning Freebase and Wikidata, SoG achieves state-of-the-art performance without fine-tuning. We demonstrate particularly strong gains on Wikidata benchmarks (+16\% improvement over previous best methods) alongside consistent improvements on Freebase benchmarks.

[41] Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

Ragib Amin Nihal,Rui Wen,Kazuhiro Nakadai,Jun Sakuma

Main category: cs.CL

TL;DR: 提出Pattern Enhanced Chain of Attack (PE-CoA) 框架，通过五种对话模式构建有效的多轮越狱攻击，在十二个大语言模型和十个危害类别上实现最先进的性能，揭示了模型在不同对话模式下的特定弱点和行为特征。

Details

Motivation: 现有方法多依赖启发式或临时探索策略，对模型漏洞的底层机制理解不足，缺乏对对话模式与模型漏洞之间关系的系统性研究。 Method: 提出PE-CoA框架，包含五种系统化的对话模式，用于构建自然的多轮越狱攻击，并在12个LLM和10个危害类别上进行评估。 Result: 实现了当前最优的攻击效果，发现不同模型对不同对话模式存在特异性脆弱性，且同一模型家族具有相似的失败模式。 Conclusion: 安全训练存在局限性，需针对不同对话模式设计更细粒度、模式感知的防御机制。 Abstract: Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories (like malware generation, harassment, or fraud) through distinct conversational approaches (educational discussions, personal experiences, hypothetical scenarios). Existing multi-turn jailbreaking methods often rely on heuristic or ad hoc exploration strategies, providing limited insight into underlying model weaknesses. The relationship between conversation patterns and model vulnerabilities across harm categories remains poorly understood. We propose Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct effective multi-turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles where robustness to one conversational pattern does not generalize to others, and model families share similar failure modes. These findings highlight limitations of safety training and indicate the need for pattern-aware defenses. Code available on: https://github.com/Ragib-Amin-Nihal/PE-CoA

[42] Quality Estimation Reranking for Document-Level Translation

Krzysztof Mrozinski,Minji Kang,Ahmed Khota,Vincent Michael Sutanto,Giovanni Gatti De Giacomo

Main category: cs.CL

TL;DR: 本文研究了在文档级机器翻译中使用质量估计（QE）重排序的效果，发现基于学习的指标SLIDE和基于大语言模型的指标GEMBA-DA均能显著提升翻译质量，尤其是在候选较多时效果更明显，证明了文档级QE的实际价值。

Details

Motivation: 尽管QE重排序在句子级别已被证实有效，但在日益重要的文档级翻译中的应用尚不充分，本文旨在填补这一空白。 Method: 在文档级翻译任务上评估多种学习型和基于大语言模型（LLM）的QE指标的重排序性能，包括SLIDE和GEMBA-DA，并与句子级方法对比。 Result: 使用SLIDE指标，在仅两个候选时BLEURT-20提升+2.00，32个候选时提升+5.09；GEMBA-DA分别带来+1.63和+4.30的提升；即使在较长输入下（512-1024词元），32候选仍可获得+2.34（SLIDE）和+1.40（GEMBA-DA）的增益。 Conclusion: 文档级QE重排序能有效提升机器翻译质量，尤其在多候选情况下表现突出，且具有较低的运行开销，具备实际应用价值。 Abstract: Quality estimation (QE) reranking is a form of quality-aware decoding which aims to improve machine translation (MT) by scoring and selecting the best candidate from a pool of generated translations. While known to be effective at the sentence level, its application to the increasingly prominent domain of document-level translation remains underexplored. In this work, we evaluate QE reranking performance on document-level (rather than the typical sentence-level) translation, using various learned and large language model (LLM)-based QE metrics. We find that with our best learned metric, SLIDE, BLEURT-20 scores improve by +2.00 with only two candidates, and by +5.09 with 32, across both decoder-only LLM models and encoder-decoder neural machine translation (NMT) models. Using the best LLM-based metric, GEMBA-DA, gains of +1.63 and +4.30 are achieved under the same conditions. Although gains shrink with longer inputs, reranking with 32 candidates yields improvements of +2.34 (SLIDE) and +1.40 (GEMBA-DA) on our longest documents (512-1024 source tokens). These findings demonstrate the practical value of document-level QE, with minimal runtime overhead given suitable translation models and hardware.

[43] FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Yan Wang,Keyi Wang,Shanshan Yang,Jaisal Patel,Jeff Zhao,Fengran Mo,Xueqing Peng,Lingfei Qian,Jimin Huang,Guojun Xiong,Xiao-Yang Liu,Jian-Yun Nie

Main category: cs.CL

TL;DR: 本文提出了FinAuditing，首个面向财务审计任务的、符合会计准则分类的、多文档结构感知基准，包含语义、关系和数值一致性三个子任务，用于评估大语言模型在结构化财务文件中的推理能力。实验表明现有模型在层次化多文档结构上表现显著下降，揭示了当前LLM在税务制导财务推理中的系统性缺陷。

Details

Motivation: 由于GAAP复杂性和XBRL文件的层级结构，财务审计自动化面临挑战；而现有大语言模型在处理结构化、依赖性强、基于分类体系的财务文档方面的推理能力尚不明确，亟需专门的评估基准。 Method: 基于真实符合US-GAAP的XBRL文件构建FinAuditing基准，设计三个子任务：FinSM（语义一致性）、FinRE（关系一致性）、FinMR（数值一致性），并提出统一的评估框架，结合检索、分类与推理指标，在13个主流大模型上进行零样本测试。 Result: 在13个最先进大语言模型上的零样本实验显示，模型在语义、关系和数学推理方面表现不一致，面对层级化多文档结构时准确率下降达60-90%。 Conclusion: 现代大语言模型在基于分类体系的财务推理方面存在系统性局限，FinAuditing为开发可信、结构感知且合规对齐的金融智能系统提供了基础。 Abstract: The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.

[44] Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

Haomin Zhuang,Yujun Zhou,Taicheng Guo,Yue Huang,Fangxu Liu,Kai Song,Xiangliang Zhang

Main category: cs.CL

TL;DR: 本文提出了一种在强化学习中通过为不同类型token设置不同温度来显式促进大语言模型推理过程中探索行为的方法，提升了模型的推理性能。

Details

Motivation: 现有方法未能在token生成阶段显式促进探索行为，且未充分利用推理token和知识token的不同特性。 Method: 对高熵的推理token采用较高温度以增强探索，对低熵的知识token采用较低温度以保持事实准确性，并研究了多种多温度调度策略。 Result: 在多个推理基准上的实验表明，该方法显著提升了大语言模型的推理性能。 Conclusion: 通过区分token类型并应用多温度采样策略，能有效平衡探索与利用，提升LLM在强化学习中的推理能力。 Abstract: Reinforcement Learning has demonstrated substantial improvements in the reasoning abilities of Large Language Models (LLMs), exhibiting significant applicability across various domains. Recent research has identified that tokens within LLMs play distinct roles during reasoning tasks, categorizing them into high-entropy reasoning tokens and low-entropy knowledge tokens. Prior approaches have typically focused on restricting updates to indirectly encourage exploration, yet they do not explicitly facilitate exploratory behavior during the token generation stage itself. In this work, we introduce a complementary approach that explicitly promotes exploration during sampling by applying distinct temperature settings for different token types. Specifically, our method employs higher temperatures for reasoning tokens to actively encourage exploration, while retaining lower temperatures for knowledge tokens to maintain factual correctness. Furthermore, we systematically investigate various multi-temperature scheduling strategies and their impacts within reinforcement learning contexts. Empirical evaluations on several reasoning benchmarks demonstrate that our approach significantly enhances the reasoning performance of LLMs. The code is available at https://github.com/zhmzm/Multi_Temperature_Verl.git.

[45] A Unified Biomedical Named Entity Recognition Framework with Large Language Models

Tengxiao Lv,Ling Luo,Juntao Li,Yanhua Wang,Yuchen Pan,Chao Liu,Yanan Wang,Yan Jiang,Huiyi Lv,Yuanyuan Sun,Jian Wang,Hongfei Lin

Main category: cs.CL

TL;DR: 提出基于大语言模型的统一生物医学命名实体识别框架，通过符号化标注策略和对比学习实体选择器，在多语言和嵌套实体场景下实现最先进的性能和强零样本泛化能力。

Details

Motivation: 现有方法在处理嵌套实体、边界模糊和跨语言泛化方面存在困难，难以准确识别生物医学命名实体。 Method: 将BioNER重构为文本生成任务，设计符号化标注策略以同时处理平面和嵌套实体，并采用双语联合微调和基于对比学习的实体选择器提升多语言与多任务泛化能力。 Result: 在四个基准数据集和两个未见语料库上取得最先进的性能，表现出强大的跨语言零样本泛化能力。 Conclusion: 所提出的基于LLM的统一BioNER框架有效解决了嵌套实体识别、边界歧义和跨语言泛化问题，显著提升了生物医学信息抽取的准确性。 Abstract: Accurate recognition of biomedical named entities is critical for medical information extraction and knowledge discovery. However, existing methods often struggle with nested entities, entity boundary ambiguity, and cross-lingual generalization. In this paper, we propose a unified Biomedical Named Entity Recognition (BioNER) framework based on Large Language Models (LLMs). We first reformulate BioNER as a text generation task and design a symbolic tagging strategy to jointly handle both flat and nested entities with explicit boundary annotation. To enhance multilingual and multi-task generalization, we perform bilingual joint fine-tuning across multiple Chinese and English datasets. Additionally, we introduce a contrastive learning-based entity selector that filters incorrect or spurious predictions by leveraging boundary-sensitive positive and negative samples. Experimental results on four benchmark datasets and two unseen corpora show that our method achieves state-of-the-art performance and robust zero-shot generalization across languages. The source codes are freely available at https://github.com/dreamer-tx/LLMNER.

[46] Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors

Xin Liu,RunSong Zhao,PengCheng Huang,XinYu Liu,JunYi Xiao,ChunYang Xiao,Tong Xiao,Shengxiang Gao,Zhengtao Yu,JingBo Zhu

Main category: cs.CL

TL;DR: 提出了一种新的上下文压缩方法Semantic-Anchor Compression (SAC)，通过直接选择原始上下文中的锚点token并聚合信息到其键值表示中，避免了传统基于自编码任务的压缩方法带来的重建优化与下游任务不匹配的问题。

Details

Motivation: 现有基于自编码任务的上下文压缩方法在优化目标上偏向于重构而非实际下游任务，导致压缩特征在真实应用中表现不佳。 Method: SAC不依赖自编码训练，而是从原始上下文中选取锚点token，并通过锚点嵌入和双向注意力机制修改，使锚点token能够捕获整个上下文的信息，直接构建KV表示。 Result: 实验表明，SAC在多种压缩比下均优于现有的上下文压缩方法；在MRQA跨分布评估中，5倍压缩下EM指标提升1，在更高压缩比下优势更明显。 Conclusion: SAC通过消除自编码预训练、直接利用上下文token进行压缩，实现了更高效且适用于下游任务的上下文压缩，具备较强的实用性和扩展潜力。 Abstract: Context compression presents a promising approach for accelerating large language model (LLM) inference by compressing long contexts into compact representations. Current context compression methods predominantly rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics. While autoencoding tasks enable compression tokens to acquire compression capabilities, compression via autoencoding tasks creates a fundamental mismatch: the models are optimized for reconstruction that diverge from actual downstream tasks, thereby weakening the features more beneficial for real-world usage. We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability \textit{a priori}. Instead of training models to compress contexts through autoencoding tasks, SAC directly selects so-called anchor tokens from the original context and aggregates contextual information into their key-value (KV) representations. By deriving representations directly from the contextual tokens, SAC eliminates the need for autoencoding training. To ensure compression performance while directly leveraging anchor tokens, SAC incorporates two key designs: (1) anchor embeddings that enable the compressor to identify critical tokens, and (2) bidirectional attention modification that allows anchor tokens to capture information from the entire context. Experimental results demonstrate that SAC consistently outperforms existing context compression methods across various compression ratios. On out-of-distribution evaluation using MRQA, SAC achieves 1 EM improvement at 5x compression over strong baselines, with increasing advantages at higher compression ratios.

[47] Artificial Impressions: Evaluating Large Language Model Behavior Through the Lens of Trait Impressions

Nicholas Deas,Kathleen McKeown

Main category: cs.CL

TL;DR: 本文提出了“人工印象”概念，即大语言模型（LLM）在内部表征中对提示词形成的类似人类刻板印象的模式，并利用线性探测方法基于刻板印象内容模型（SCM）预测这些印象，发现尽管模型在直接提示下不一致地报告印象，但其隐藏层中可线性解码的印象具有一致性，并能预测模型输出的质量和犹豫程度，同时研究了提示词的内容、风格和方言特征如何影响这些人工印象。

Details

Motivation: 探索大语言模型是否在内部形成类似人类的社会性印象和刻板印象，以及这些印象如何影响模型行为，旨在揭示模型决策过程中的潜在偏见机制。 Method: 使用线性探针对生成的提示词进行训练以预测基于二维刻板印象内容模型（SCM）的人工印象，并分析模型隐藏层表示中印象的可解码性及其与下游行为（如回应质量与犹豫使用）的关系，同时考察提示词的语言特征（内容、风格、方言）对人工印象的影响。 Result: 发现大语言模型虽在显式提示下不一致地表达印象，但其隐藏层中人工印象具有一致的线性可解码性；人工印象可预测回应质量和 hedging 使用；提示词的特定语言特征显著影响所形成的印象。 Conclusion: 大语言模型在内部表征中形成了可识别的人工印象，这些印象虽不总在输出中显现，但系统性地影响模型行为，提示需关注模型中隐含的社会认知偏差。 Abstract: We introduce and study artificial impressions--patterns in LLMs' internal representations of prompts that resemble human impressions and stereotypes based on language. We fit linear probes on generated prompts to predict impressions according to the two-dimensional Stereotype Content Model (SCM). Using these probes, we study the relationship between impressions and downstream model behavior as well as prompt features that may inform such impressions. We find that LLMs inconsistently report impressions when prompted, but also that impressions are more consistently linearly decodable from their hidden representations. Additionally, we show that artificial impressions of prompts are predictive of the quality and use of hedging in model responses. We also investigate how particular content, stylistic, and dialectal features in prompts impact LLM impressions.

[48] SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

Jiaming Wang,Zhe Tang,Yilin Jin,Peng Ding,Xiaoyu Li,Xuezhi Cao

Main category: cs.CL

TL;DR: 提出SOP-Maze，一个基于真实商业数据的基准，用于评估大语言模型在复杂标准操作程序（SOP）场景中的表现，发现现有模型在路径遵循、对话处理和计算推理方面存在显著缺陷。

Details

Motivation: 现有基准未能充分评估大语言模型在复杂、真实商业SOP场景中的指令遵循与决策能力，需构建更贴近实际的评测任务。 Method: 基于真实商业数据构建包含397个任务、23种复杂SOP场景的基准SOP-Maze，并提出LRS（横向根系统）和HRS（纵向根系统）两类任务分类，通过实验分析模型表现。 Result: 几乎所有最先进的模型在SOP-Maze上表现不佳，识别出三类主要错误：路径盲区、对话脆弱性和计算错误。 Conclusion: SOP-Maze揭示了当前大语言模型在处理复杂业务流程时的关键缺陷，为提升模型在现实世界SOP环境中的可靠性提供了新方向和评测工具。 Abstract: As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 tasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on https://github.com/ADoublLEN/SOP-Maze.

[49] A Human Behavioral Baseline for Collective Governance in Software Projects

Mobina Noori,Mahasweta Chakraborti,Amy X Zhang,Seth Frey

Main category: cs.CL

TL;DR: 该研究通过分析710个开源项目的版本化治理文档，发现项目随时间推移定义了更多角色和行动，并且分布更均匀，而规则构成保持稳定。

Details

Motivation: 探究开源社区如何通过版本控制的治理文档描述参与和控制机制。 Method: 构建包含710个配对快照的项目语料库，将文本解析为参与者、规则、行为和对象，并使用熵、丰富度和Jensen-Shannon散度衡量变化。 Result: 项目随时间定义了更多角色和行为，分布更均衡，规则组成保持稳定，表明治理通过扩展和平衡参与类别发展，而非改变规范力度。 Conclusion: 治理演进表现为参与类别的扩展与均衡化，未发生根本性规范转变，研究为评估未来AI驱动工作流是否集中或分散权力提供了可复现的基准。 Abstract: We study how open source communities describe participation and control through version controlled governance documents. Using a corpus of 710 projects with paired snapshots, we parse text into actors, rules, actions, and objects, then group them and measure change with entropy for evenness, richness for diversity, and Jensen Shannon divergence for drift. Projects define more roles and more actions over time, and these are distributed more evenly, while the composition of rules remains stable. These findings indicate that governance grows by expanding and balancing categories of participation without major shifts in prescriptive force. The analysis provides a reproducible baseline for evaluating whether future AI mediated workflows concentrate or redistribute authority.

[50] Creation of the Chinese Adaptive Policy Communication Corpus

Bolun Sun,Charles Chang,Yuen Yuen Ang,Pingxu Hao,Ruotong Mu,Yuchen Xu,Zhengxin Zhang

Main category: cs.CL

TL;DR: 本文介绍了CAPC-CG，即中文适应性政策传播（中央政府）语料库，这是首个标注了五色分类法的中文政策指令公开数据集，涵盖1949年至2023年中国最高机构发布的国家法律、行政法规和部门规章。

Details

Motivation: 为了支持政策传播中的下游任务和多语言自然语言处理研究，构建一个高质量、标注清晰的中文政策指令语料库。 Method: 基于Ang的适应性政策传播理论，采用五色分类法对政策文本中的明确与模糊语言进行标注，将文档分段为段落单位，并通过专家和训练编码员两轮标注建立黄金标准注释集。 Result: 语料库包含330万个段落单位，标注者间一致性达到Fleiss's kappa K=0.86，显示出高可靠性；同时发布了元数据、标注框架、代码本及大语言模型的基线分类结果。 Conclusion: CAPC-CG为中文政策通信研究提供了可靠的数据基础，有助于推动多语言NLP在政策分析领域的应用。 Abstract: We introduce CAPC-CG, the Chinese Adaptive Policy Communication (Central Government) Corpus, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang's theory of adaptive policy communication. Spanning 1949-2023, this corpus includes national laws, administrative regulations, and ministerial rules issued by China's top authorities. Each document is segmented into paragraphs, producing a total of 3.3 million units. Alongside the corpus, we release comprehensive metadata, a two-round labeling framework, and a gold-standard annotation set developed by expert and trained coders. Inter-annotator agreement achieves a Fleiss's kappa of K = 0.86 on directive labels, indicating high reliability for supervised modeling. We provide baseline classification results with several large language models (LLMs), together with our annotation codebook, and describe patterns from the dataset. This release aims to support downstream tasks and multilingual NLP research in policy communication.

[51] MASA: LLM-Driven Multi-Agent Systems for Autoformalization

Lan Zhang,Marco Valentino,André Freitas

Main category: cs.CL

TL;DR: 本文提出了MASA，一个基于大语言模型的多智能体系统框架，用于自然语言到形式化表示的自动转换，强调模块化、灵活性和可扩展性，并通过数学定义和形式化数据集实验验证了其有效性。

Details

Motivation: 为了提升自然语言到形式化逻辑转换的效率与可靠性，解决现有方法在复杂性和适应性上的不足。 Method: 设计了一个模块化多智能体系统MASA，利用多个协同工作的智能体，结合大语言模型与定理证明器，实现自然语言陈述的形式化。 Result: 在真实数学定义和形式化数学数据集上的实验表明，MASA能够有效支持自动形式化任务，具备良好的可扩展性和实际应用潜力。 Conclusion: MASA为自动形式化提供了一个高效、灵活的多智能体框架，展示了LLM与定理证明器协同在形式化推理中的广阔前景。 Abstract: Autoformalization serves a crucial role in connecting natural language and formal reasoning. This paper presents MASA, a novel framework for building multi-agent systems for autoformalization driven by Large Language Models (LLMs). MASA leverages collaborative agents to convert natural language statements into their formal representations. The architecture of MASA is designed with a strong emphasis on modularity, flexibility, and extensibility, allowing seamless integration of new agents and tools to adapt to a fast-evolving field. We showcase the effectiveness of MASA through use cases on real-world mathematical definitions and experiments on formal mathematics datasets. This work highlights the potential of multi-agent systems powered by the interaction of LLMs and theorem provers in enhancing the efficiency and reliability of autoformalization, providing valuable insights and support for researchers and practitioners in the field.

[52] DARO: Difficulty-Aware Reweighting Policy Optimization

Jingyu Zhou,Lu Ma,Hao Liang,Chengyu Shen,Bin Cui,Wentao Zhang

Main category: cs.CL

TL;DR: 本文提出了一种新的强化学习方法DARO，用于解决现有基于可验证奖励的推理模型训练中因静态加权导致的学习效率问题，通过动态调整不同难度样本的损失权重，显著提升了数学推理任务的收敛速度和性能。

Details

Motivation: 现有的GRPO及其变体方法在处理不同难度样本时采用静态或过于简化的加权策略，无法适应模型能力的变化，导致训练过程中某些难度级别的样本被过度关注，影响整体性能。 Method: 提出了Difficulty-Aware Reweighting Policy Optimization (DARO)，该方法根据模型的学习状态动态调整每个难度组的损失贡献，从而实现更均衡和高效的训练。 Result: 在Qwen2.5-Math-1.5B、Qwen2.5-Math-7B和Llama3.1-8B模型上进行了广泛实验，DARO在六个数学基准测试中均优于四种主流基线方法，表现出更快的收敛速度和更高的最终性能。 Conclusion: DARO通过动态难度感知重加权机制有效解决了现有RLVR方法中的损失尺度失衡问题，为大型语言模型的推理能力提升提供了更优的训练框架。 Abstract: Recent advances in large language models (LLMs) have shown that reasoning ability can be significantly enhanced through Reinforcement Learning with Verifiable Rewards (RLVR). Group Relative Policy Optimization (GRPO) has emerged as the de facto approach for RLVR, inspiring numerous variants. However, our mathematical analysis reveals that these methods are fundamentally weighted variations of GRPO. We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model's evolving capabilities. This creates a significant loss scale issue, where training disproportionately focuses on certain difficulty levels at the expense of others, hindering overall performance. To address these limitations, we introduce \textbf{Difficulty-Aware Reweighting Policy Optimization (DARO)}, a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state. Extensive experiments on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama3.1-8B show that DARO outperforms four leading baselines across six math benchmarks, achieving significantly faster convergence and superior final performance.

[53] Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models

Yutao Mou,Xiaoling Zhou,Yuxiao Luo,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 本文提出基于LoRA的拒绝训练方法，仅使用安全数据即可实现不损害通用性能的安全对齐，通过理论与实验证明LoRA能将安全特性解耦到低秩子空间，有效避免对模型原有能力的干扰。

Details

Motivation: 在提升模型安全性的同时往往会导致其通用性能下降，现有方法需要昂贵的数据比例搜索来平衡二者，成本高且收益有限。 Method: 采用基于LoRA的拒绝训练方法，仅使用安全关键数据进行微调，并从理论和实验角度分析LoRA如何将安全功能限制在低秩正交子空间中。 Result: LoRA-based Refusal-training 在仅使用安全数据训练时仍能保持模型的通用性能；LoRA被证实可作为高效、即插即用的安全补丁，将安全机制解耦至与模型主变换空间近似正交的低秩子空间。 Conclusion: LoRA不仅是一种成本低廉且即插即用的安全对齐方案，还能通过低秩结构实现安全与性能的解耦，为构建可信AI提供了一种高效可行的路径。 Abstract: Safety alignment is essential for building trustworthy artificial intelligence, yet it remains challenging to enhance model safety without degrading general performance. Current approaches require computationally expensive searches for the optimal proportion of safety-critical and general-purpose data to balance safety and general performance, incurring high costs with limited gains. In this work, we show that LoRA-based Refusal-training enables performance-preserving safety alignment even when trained solely on safety data, demonstrating that LoRA serves as cost-efficient, performance-preserving, and plug-and-play safety patches. Beyond empirical findings, we provide both theoretical and experimental evidence that LoRA effectively decouples safety into a low-rank subspace largely orthogonal to the model's intrinsic transformation space, ensuring that safety enhancements do not interfere with inherent capabilities.

[54] LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction

Shengmin Piao,Jieun Lee,Sanghyun Park

Main category: cs.CL

TL;DR: 本文提出了LitE-SQL，一个轻量高效的文本到SQL生成框架，包含模式检索器和SQL生成器两部分，在BIRD和Spider数据集上表现出色，且参数量远少于大模型，适合资源受限和隐私敏感场景。

Details

Motivation: 现有基于大语言模型的Text-to-SQL方法依赖闭源模型，存在部署困难和数据隐私问题，因此需要一种轻量、高效且可部署的替代方案。 Method: 提出LitE-SQL框架：1）使用预计算的模式嵌入向量数据库进行高效模式链接的Schema Retriever；2）采用两阶段微调（监督微调+执行引导的强化学习）的SQL Generator，实现无需多候选生成的自我修正。 Result: 在BIRD上达到72.10%的执行准确率，在Spider 1.0上达到88.45%，性能媲美或优于基于大模型的方法，但参数量仅为其1/2到1/30。 Conclusion: 轻量级模型也能实现高质量的Text-to-SQL生成，LitE-SQL为隐私敏感和资源受限场景提供了实用解决方案。 Abstract: The Text-to-SQL task translates natural language questions into SQL queries, enabling intuitive database interaction for non-experts. While recent methods leveraging Large Language Models (LLMs) achieve strong performance, their reliance on proprietary models raise concerns about deployment feasibility and data privacy. In this work, we introduce LitE-SQL, a Lightweight and Efficient framework with two components: (i) a Schema Retriever that performs efficient schema linking using a vector database of pre-computed schema embeddings, and (ii) a SQL Generator fine-tuned in two stages-supervised fine-tuning followed by execution-guided reinforcement-enabling self-correction without costly multi-candidate generation. On BIRD, LitE-SQL achieves 72.10% execution accuracy, and on Spider 1.0 it reaches 88.45%, demonstrating comparable or superior performance to LLM-based methods despite using 2x to 30x fewer parameters. Our findings demonstrate that high-quality Text-to-SQL generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.

Keno Harada,Lui Yoshida,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo

Main category: cs.CL

TL;DR: 本研究通过迭代优化评分标准，提升大语言模型在自动作文评分中的表现，实验显示其显著提高了与人工评分的一致性。

Details

Motivation: 大语言模型对提示词敏感，而现有自动作文评分系统与人工评分存在偏差，因此需要优化评分标准以提高一致性。 Method: 通过让模型基于自身评分理由及与人工评分的差异，在样本文章上进行反思，迭代优化评分标准。 Result: 在TOEFL11和ASAP数据集上使用GPT-4.1、Gemini-2.5-Pro和Qwen-3-Next-80B-A3B-Instruct模型，二次加权Kappa值分别提升了0.19和0.47；即使初始评分标准简单，也能达到甚至超过人工设计详细标准的效果。 Conclusion: 迭代式评分标准优化能有效提升大语言模型自动作文评分与人工评分的一致性，具有重要应用价值。 Abstract: The performance of Large Language Models (LLMs) is highly sensitive to the prompts they are given. Drawing inspiration from the field of prompt optimization, this study investigates the potential for enhancing Automated Essay Scoring (AES) by refining the scoring rubrics used by LLMs. Specifically, our approach prompts models to iteratively refine rubrics by reflecting on models' own scoring rationales and observed discrepancies with human scores on sample essays. Experiments on the TOEFL11 and ASAP datasets using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct show Quadratic Weighted Kappa (QWK) improvements of up to 0.19 and 0.47, respectively. Notably, even with a simple initial rubric, our approach achieves comparable or better QWK than using detailed human-authored rubrics. Our findings highlight the importance of iterative rubric refinement in LLM-based AES to enhance alignment with human evaluations.

[56] Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

Adity Khisa,Nusrat Jahan Lia,Tasnim Mahfuz Nafis,Zarif Masud,Tanzir Pial,Shebuti Rayana,Ahmedul Kabir

Main category: cs.CL

TL;DR: 本文提出了一种基于孟加拉文转写的Chakma语料库，并通过微调多种多语言和区域Transformer模型，验证了其在掩码语言建模任务中的有效性，显著提升了低资源Chakma语言的建模性能。

Details

Motivation: Chakma作为一种数据稀缺的印度-雅利安语系语言，在现有语言模型中代表性不足，亟需有效的建模方法。 Method: 构建了一个由母语者验证的上下文连贯的孟加拉文转写Chakma语料库，并对六种基于编码器的多语言和地区性Transformer模型（如mBERT、XLM-RoBERTa等）进行掩码语言建模任务的微调。 Result: 微调后的多语言模型在Chakma语言上表现优于预训练模型，最高达到73.54%的token准确率和低至2.90的困惑度，同时揭示了数据质量和OCR处理对形态丰富文字的局限性。 Conclusion: 孟加拉文转写的Chakma语在迁移学习中非常有效，研究发布的手动验证单语数据集有助于推动低资源语言的多语言语言建模研究。 Abstract: As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based multilingual and regional transformer models (mBERT, XLM-RoBERTa, DistilBERT, DeBERTaV3, BanglaBERT, and IndicBERT) on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our manually validated monolingual dataset to encourage further research on multilingual language modeling for low-resource languages.

[57] Large Language Models Do NOT Really Know What They Don't Know

Chi Seng Cheang,Hou Pong Chan,Wenxuan Zhang,Yang Deng

Main category: cs.CL

TL;DR: 该研究通过机制分析发现，大语言模型（LLM）在处理事实性查询时，仅编码知识回忆模式而非真实性信号，因此无法可靠区分真实与幻觉输出，表明“LLM并不真正知道自己不知道什么”。

Details

Motivation: 探究大语言模型内部计算是否能可靠区分事实性输出与幻觉，尤其是在其隐藏状态中是否存在事实性信号。 Method: 通过比较两类基于主题信息依赖程度不同的幻觉，进行机制性分析，研究LLM在处理事实查询时的内部表征差异。 Result: 当幻觉与主题知识相关时，LLM使用与正确回答相同的内部回忆机制，导致隐藏状态几何结构重叠；而脱离主题知识的幻觉则产生可检测的聚类表征。 Conclusion: LLM并未在其内部状态中编码真实性，而只是编码了知识回忆的模式，因此不能真正识别自身产生的幻觉。 Abstract: Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations, such as hidden states, attention weights, or token probabilities, implying that LLMs may "know what they don't know". However, LLMs can also produce factual errors by relying on shortcuts or spurious associations. These error are driven by the same training objective that encourage correct predictions, raising the question of whether internal computations can reliably distinguish between factual and hallucinated outputs. In this work, we conduct a mechanistic analysis of how LLMs internally process factual queries by comparing two types of hallucinations based on their reliance on subject information. We find that when hallucinations are associated with subject knowledge, LLMs employ the same internal recall process as for correct responses, leading to overlapping and indistinguishable hidden-state geometries. In contrast, hallucinations detached from subject knowledge produce distinct, clustered representations that make them detectable. These findings reveal a fundamental limitation: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that "LLMs don't really know what they don't know".

[58] Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation

Muhammad Ali Shafique,Kanwal Mehreen,Muhammad Arham,Maaz Amjad,Sabur Butt,Hamza Farooq

Main category: cs.CL

TL;DR: 提出了一种基于改进自指令技术的高质量多语言合成数据集训练方法，用于构建高效、文化对齐的乌尔都语-英语大语言模型Alif-1.0-8B-Instruct，在低资源语言建模中实现了卓越性能且训练成本低于100美元。

Details

Motivation: 解决乌尔都语等低资源语言在大语言模型发展中面临的数据稀缺、多语言不一致和安全性问题，避免依赖低质量翻译数据带来的文化缺失和高成本问题。 Method: 采用改进的self-instruct技术生成高质量乌尔都-英双语合成数据集（Urd-Instruct），通过任务特定提示词、种子值和全局任务池，融入乌尔都语本土的思维链推理、双语翻译、文化相关性和伦理安全对齐机制，并在此基础上微调Llama-3.1-8B模型。 Result: Alif-1.0-8B-Instruct在乌尔都语特定任务上优于Llama-3.1-8B-Instruct及多个主流多语言大模型（如Mistral-7B、Qwen-2.5-7B、Cohere-Aya-Expanse-8B），且训练成本低于100美元。 Conclusion: 通过改进的self-instruct方法可高效构建高性能、文化对齐的低资源语言大模型，为类似语言提供可行的技术路径。 Abstract: Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: https://github.com/traversaal-ai/alif-urdu-llm.

[59] ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

Chung-En Sun,Ge Yan,Akshay Kulkarni,Tsui-Wei Weng

Main category: cs.CL

TL;DR: 本文提出了ReFIne训练框架，旨在提升大模型在长链推理中的可信度，涵盖可解释性、保真性和可靠性三个方面，并在多个Qwen3模型上验证了其有效性。

Details

Motivation: 现有长链推理研究过于关注准确率和效率，忽视了系统可信度的关键方面，如可解释性、保真性和可靠性。因此需要构建更可信的推理系统。 Method: 提出ReFIne框架，结合监督微调与GRPO，引导模型生成结构化推理轨迹、显式披露关键决策信息，并提供推导正确性和答案置信度的自我评估。 Result: 在多个数学基准上实验表明，ReFIne提升了可解释性（+44.0%）、保真性（+18.8%）和可靠性（+42.4%），生成更清晰的推理过程和更有信息量的置信估计。 Conclusion: 推理模型不应仅优化准确性，还应兼顾可信度的多维度，ReFIne为构建可信长链推理模型提供了有效路径。 Abstract: Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation's soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Our code is available at: https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine

[60] FrameEOL: Semantic Frame Induction using Causal Language Models

Chihiro Yano,Kosuke Yamada,Hayato Tsukagoshi,Ryohei Sasano,Koichi Takeda

Main category: cs.CL

TL;DR: 提出了一种基于因果语言模型（CLM）的语义框架归纳新方法FrameEOL，通过提示学习生成框架嵌入，并结合上下文学习和深度度量学习，在英语和日语FrameNet数据集上优于现有方法。

Details

Motivation: 尽管CLM在多种语言理解任务中表现出色，但尚未应用于语义框架归纳；本文旨在探索CLM在此任务中的潜力，特别是在资源稀缺语言（如日语）中的应用。 Method: 提出FrameEOL，一种基于提示的方法，利用因果语言模型生成仅输出一个框架名称标签的框架嵌入，并结合上下文学习（ICL）和深度度量学习（DML）优化嵌入以进行聚类。 Result: 在英语和日语FrameNet数据集上的实验表明，该方法优于现有的框架归纳方法；对于日语，仅用5个ICL示例的CLM方法即可达到与经过DML微调的MLM方法相当的性能。 Conclusion: 基于CLM的提示方法在语义框架归纳中具有优越性能，尤其适用于标注资源稀缺的语言，展示了无需微调即可有效迁移CLM知识的潜力。 Abstract: Semantic frame induction is the task of clustering frame-evoking words according to the semantic frames they evoke. In recent years, leveraging embeddings of frame-evoking words that are obtained using masked language models (MLMs) such as BERT has led to high-performance semantic frame induction. Although causal language models (CLMs) such as the GPT and Llama series succeed in a wide range of language comprehension tasks and can engage in dialogue as if they understood frames, they have not yet been applied to semantic frame induction. We propose a new method for semantic frame induction based on CLMs. Specifically, we introduce FrameEOL, a prompt-based method for obtaining Frame Embeddings that outputs One frame-name as a Label representing the given situation. To obtain embeddings more suitable for frame induction, we leverage in-context learning (ICL) and deep metric learning (DML). Frame induction is then performed by clustering the resulting embeddings. Experimental results on the English and Japanese FrameNet datasets demonstrate that the proposed methods outperform existing frame induction methods. In particular, for Japanese, which lacks extensive frame resources, the CLM-based method using only 5 ICL examples achieved comparable performance to the MLM-based method fine-tuned with DML.

[61] When Retrieval Succeeds and Fails: Rethinking Retrieval-Augmented Generation for LLMs

Yongjie Wang,Yue Yu,Kaisong Song,Jun Lin,Zhiqi Shen

Main category: cs.CL

TL;DR: 本文综述了检索增强生成（RAG）技术，分析了其在应对大语言模型（LLM）局限性方面的优势与挑战，并探讨了RAG与LLM结合的应用场景，旨在推动下一代RAG系统的发展。

Details

Motivation: 由于大语言模型基于静态语料训练，在处理动态更新或领域特定信息时存在困难，因此需要引入外部知识检索机制来提升其性能。 Method: 本文通过回顾RAG的总体目标和核心组件，分析其关键技术挑战和弱点，并展示RAG在实际应用中的有效性。 Result: 揭示了传统RAG框架的优势正在减弱，指出了其关键弱点，并明确了RAG能显著提升LLM性能的应用场景。 Conclusion: 研究者应重新思考RAG的角色，并致力于开发更先进的下一代RAG系统。 Abstract: Large Language Models (LLMs) have enabled a wide range of applications through their powerful capabilities in language understanding and generation. However, as LLMs are trained on static corpora, they face difficulties in addressing rapidly evolving information or domain-specific queries. Retrieval-Augmented Generation (RAG) was developed to overcome this limitation by integrating LLMs with external retrieval mechanisms, allowing them to access up-to-date and contextually relevant knowledge. However, as LLMs themselves continue to advance in scale and capability, the relative advantages of traditional RAG frameworks have become less pronounced and necessary. Here, we present a comprehensive review of RAG, beginning with its overarching objectives and core components. We then analyze the key challenges within RAG, highlighting critical weakness that may limit its effectiveness. Finally, we showcase applications where LLMs alone perform inadequately, but where RAG, when combined with LLMs, can substantially enhance their effectiveness. We hope this work will encourage researchers to reconsider the role of RAG and inspire the development of next-generation RAG systems.

[62] DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

Enze Zhang,Jiaying Wang,Mengxi Xiao,Jifei Liu,Ziyan Kuang,Rui Dong,Youzhong Dong,Sophia Ananiadou,Min Peng,Qianqian Xie

Main category: cs.CL

TL;DR: 本文提出了DITING和AgentEval两个框架，用于评估网络小说翻译的质量，特别是在叙事和文化保真度方面。通过18000多个专家标注的中英文句子对，以及一个包含300个句子对的元评估数据集MetricAlign，研究发现中文训练的大型语言模型（LLM）优于更大的外国模型，其中DeepSeek-V3提供了最忠实且风格连贯的翻译。

Details

Motivation: 现有的机器翻译评估基准主要依赖于表面层次的指标，无法捕捉网络小说这一特定体裁的独特特征，因此需要一个新的评估框架来更全面地衡量翻译质量。 Method: 提出DITING作为首个针对网络小说翻译的综合评估框架，涵盖六个维度：习语翻译、词汇歧义、术语本地化、时态一致性、零代词解析和文化安全性；同时提出AgentEval，一种基于推理的多智能体评估框架，模拟专家讨论以超越词汇重叠的翻译质量评估；并开发了MetricAlign元评估数据集，包含300个带有错误标签和标量质量评分的句子对。 Result: 在14个开源、闭源和商业模型的综合评估中，发现中文训练的LLM表现优于更大的外国模型，DeepSeek-V3在忠实度和风格连贯性上表现最佳；AgentEval在七种自动评估指标中与人类判断的相关性最高。 Conclusion: 本研究为基于LLM的网络小说翻译探索建立了新范式，并提供了公开资源以推动未来的研究。 Abstract: Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.

[63] Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM

Seiya Ishikura,Hiroaki Yamada,Tatsuya Hiraoka,Hiroaki Yamada,Takenobu Tokunaga

Main category: cs.CL

TL;DR: 该研究通过在对话数据中加入“出声思考”语句（TAU）来增强大语言模型对个体性格的建模能力，实验表明TAU增强的数据能更好地捕捉说话者在大五人格中的宜人性和神经质特质，且TAU质量影响模型表现。

Details

Motivation: 为了提升大语言模型在文本对话中对个体性格的建模能力，研究者希望利用出声思考语句（TAU）来更真实地反映说话者的内在思维过程，从而更好地捕捉其人格特征。 Method: 在原始对话数据基础上增加TAU（即说话者在表达前的内心想法），用此增强数据训练“个性LLM”，并评估其在大五人格框架下对人类性格的拟合程度。 Result: 使用TAU增强数据训练的LLM在宜人性和神经质两个维度上更贴近真实说话者的人格特征，且TAU的质量显著影响模型性能。 Conclusion: TAU数据增强有助于提升LLM对个体人格特质的建模精度，尤其是在大五人格中的宜人性和神经质维度，同时强调了TAU生成质量的重要性。 Abstract: This study proposes augmenting dialog data with think-aloud utterances (TAUs) for modeling individual personalities in text chat by LLM. TAU is a verbalization of a speaker's thought before articulating the utterance. We expect "persona LLMs" trained with TAU-augmented data can mimic the speaker's personality trait better. We tested whether the trained persona LLMs obtain the human personality with respect to Big Five, a framework characterizing human personality traits from five aspects. The results showed that LLMs trained with TAU-augmented data more closely align to the speakers' Agreeableness and Neuroticism of Big Five than those trained with original dialog data. We also found that the quality of TAU-augmentation impacts persona LLM's performance.

[64] Stronger Re-identification Attacks through Reasoning and Aggregation

Lucas Georges Gabriel Charpentier,Pierre Lison

Main category: cs.CL

TL;DR: 本文提出了两种增强文本去标识化逆向攻击（再识别）的方法，通过考虑PII片段的识别顺序和利用推理模型来提升再识别效果。

Details

Motivation: 现有的去标识化方法难以评估其安全性，需要更强大的再识别攻击手段来衡量其鲁棒性。 Method: 提出两种策略：一是通过聚合多种PII识别顺序的结果提升性能；二是引入推理模型，利用丰富的背景知识增强再识别能力。 Result: 实验表明，考虑识别顺序和使用推理模型均能显著提高再识别的准确率，尤其是在拥有大量背景知识的情况下。 Conclusion: 更强的再识别攻击有助于更准确地评估去标识化技术的安全性，为改进隐私保护方法提供依据。 Abstract: Text de-identification techniques are often used to mask personally identifiable information (PII) from documents. Their ability to conceal the identity of the individuals mentioned in a text is, however, hard to measure. Recent work has shown how the robustness of de-identification methods could be assessed by attempting the reverse process of _re-identification_, based on an automated adversary using its background knowledge to uncover the PIIs that have been masked. This paper presents two complementary strategies to build stronger re-identification attacks. We first show that (1) the _order_ in which the PII spans are re-identified matters, and that aggregating predictions across multiple orderings leads to improved results. We also find that (2) reasoning models can boost the re-identification performance, especially when the adversary is assumed to have access to extensive background knowledge.

[65] LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning

Changjiang Gao,Zixian Huang,Jingyang Gong,Shujian Huang,Lei Li,Fei Yuan

Main category: cs.CL

TL;DR: 提出了一种新的翻译增强方法，通过指令模型和在并行数据上的层选择性调优，显著提升高低资源语言的翻译性能，同时保持强大的推理能力。

Details

Motivation: 通用大语言模型在推理任务上表现出色，但在增强翻译能力后往往牺牲了推理能力，因此需要一种既能提升翻译性能又不损害推理能力的方法。 Method: 基于指令模型，仅在并行数据上进行层选择性调优，构建Qwen3-XPlus模型。 Result: 在低资源语言（如斯瓦希里语）上实现15+ spBLEU和40+ xComet的提升，在7个多语言任务上平均提升1+点，并在15个常用推理数据集上保持与Qwen3指令模型相当的性能。 Conclusion: 该方法有效平衡了翻译能力和推理能力，为多语言增强提供了更简洁、可及性更高的解决方案。 Abstract: General Large Language Models (LLMs) excel in reasoning, but those enhanced for translation struggle with reasoning tasks. To address this, we propose a novel translationenhanced recipe that begins with instruct models and applies layer-selective tuning only on parallel data. Following this pipeline, we introduce the Qwen3-XPlus models, which demonstrate significant improvements in translation performance across both high- and lowresource languages, achieving 15+ spBLEU and 40+ xComet in low-resource languages, like Swahili. Interestingly, training only with small parallel datasets, Qwen3-XPlus achieves an average improvement of 1+ points on 7 multilingual tasks while maintaining proficiency comparable to the Qwen3 instruct model in 15 popular reasoning datasets. This work offers a promising approach to multilingual enhancement, significantly reducing complexity and enhancing accessibility for a wider range of languages. The code and model are publicly available.

[66] DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

Yiqi Li,Yusheng Liao,Zhe Chen,Yanfeng Wang,Yu Wang

Main category: cs.CL

TL;DR: 提出DICE框架，利用小语言模型通过思维链校正来优化大语言模型的输出，以满足结构化格式要求，显著提升格式准确性和内容正确性。

Details

Motivation: 大语言模型在推理任务中常忽视用户对输出格式的严格要求，直接微调成本高且不现实。 Method: 采用两阶段方法构建结构化思维链适配数据集，并通过双重微调策略训练小语言模型，实现分析后回答的修正模式。 Result: 实验表明，DICE框架使大语言模型输出的格式准确率和内容正确率分别平均提升了35.4%和29.4%，优于现有基线方法。 Conclusion: DICE有效平衡了大语言模型的推理能力与用户特定输出需求，实现了高效、轻量化的输出控制。 Abstract: When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs' outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs' broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4\% and 29.4\%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.

[67] IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data

Tao Feng,Lizhen Qu,Niket Tandon,Gholamreza Haffari

Main category: cs.CL

TL;DR: 提出了一种名为IRIS的实时因果发现框架，结合统计算法和大语言模型，能从初始变量出发自动发现已知和新颖的因果关系，并补充缺失变量。

Details

Motivation: 传统因果发现方法面临数据收集成本高、计算冗余和假设不现实等问题，而基于大语言模型的方法难以发现新颖因果关系。 Method: 采用混合方法，结合统计算法与大语言模型，通过迭代检索文档、提取变量、发现因果关系，并提出缺失变量以扩展因果图。 Result: 实现了仅从初始变量出发的实时因果发现，无需预先存在的数据集，能够发现已知和新颖的因果关系，并有效补充缺失变量。 Conclusion: IRIS框架克服了传统方法和纯LLM方法的局限，为科学探索提供了更高效、灵活的因果发现工具。 Abstract: Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.

[68] CrisiText: A dataset of warning messages for LLM training in emergency communication

Giacomo Gonella,Gian Maria Campedelli,Stefano Menini,Marco Guerini

Main category: cs.CL

TL;DR: 本文提出了CrisiText，首个大规模用于生成危机警告消息的数据集，涵盖13种不同类型的危机场景，包含超过40万条警告消息，并基于专家指南确保内容准确性和术语正确性，支持多种自然语言生成方法的研究与评估。

Details

Motivation: 在自然灾害或暴力袭击等危机情况下，及时生成有效的警告信息对于保护民众至关重要，但目前NLP技术在此领域的应用仍局限于分类任务，缺乏对自然语言生成（NLG）潜力的充分挖掘。 Method: 基于现有危机描述构建事件链，为每个事件配对一条符合专家指南的警告消息，并引入三种次优警告类型以支持NLG方法比较；通过监督微调、偏好对齐、零样本和少样本设置进行实验，并评估模型在分布外场景中的表现及自动后编辑器的有效性。 Result: 构建了包含40多万条警告消息的大规模数据集CrisiText，覆盖近18,000个危机情境，支持多种NLG模型训练与评估，并验证了不同生成方法在真实与未知危机场景下的性能差异。 Conclusion: CrisiText为危机管理中的自动警告生成提供了重要资源，展示了NLG在应急响应中的潜力，推动了该领域多模式生成与优化技术的发展。 Abstract: Effectively identifying threats and mitigating their potential damage during crisis situations, such as natural disasters or violent attacks, is paramount for safeguarding endangered individuals. To tackle these challenges, AI has been used in assisting humans in emergency situations. Still, the use of NLP techniques remains limited and mostly focuses on classification tasks. The significant potential of timely warning message generation using NLG architectures, however, has been largely overlooked. In this paper we present CrisiText, the first large-scale dataset for the generation of warning messages across 13 different types of crisis scenarios. The dataset contains more than 400,000 warning messages (spanning almost 18,000 crisis situations) aimed at assisting civilians during and after such events. To generate the dataset, we started from existing crisis descriptions and created chains of events related to the scenarios. Each event was then paired with a warning message. The generations follow experts' written guidelines to ensure correct terminology and factuality of their suggestions. Additionally, each message is accompanied by three suboptimal warning types to allow for the study of different NLG approaches. To this end, we conducted a series of experiments comparing supervised fine-tuning setups with preference alignment, zero-shot, and few-shot approaches. We further assessed model performance in out-of-distribution scenarios and evaluated the effectiveness of an automatic post-editor.

[69] DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

Chenyang Gu,Yewen Pu,Bruce Yang,Xiaofan Li,Huan Gao

Main category: cs.CL

TL;DR: 提出了一种新的强化学习算法DSPO，用于训练小型语言模型在复杂问答任务中通过多轮搜索和推理主动获取外部知识，无需监督数据，在多个基准上显著优于先前工作。

Details

Motivation: 现有的方法依赖提示或在复杂任务上应用强化学习时性能受限，无法充分发挥语言模型的代理潜力。 Method: 提出动态过滤序列级策略优化（DSPO），通过序列级优化和动态样本过滤进行强化学习，使模型能自主交替执行多轮搜索与推理。 Result: 在多个QA基准上，7B模型相比之前工作提升34.1%，在HotpotQA等多跳问答任务中甚至超过14B模型近9%（相对），且训练稳定性高。 Conclusion: DSPO有效释放了小型语言模型作为主动代理的潜力，实现了高效、稳定的端到端训练，无需监督数据即可完成复杂任务。 Abstract: Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our DSPO-trained 7B model improves over a comparable previous work by \textbf{34.1\%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9\% relative}, maintaining exceptional training stability.

[70] Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

Yongding Tao,Tian Wang,Yihong Dong,Huanyu Liu,Kechi Zhang,Xiaolong Hu,Ge Li

Main category: cs.CL

TL;DR: 本文提出了Self-Critique方法，用于检测大语言模型在强化学习后训练阶段的数据污染问题，通过探测策略崩溃现象显著提升了检测性能。

Details

Motivation: 由于强化学习后训练阶段的数据污染检测缺乏有效方法，现有检测手段在此场景下接近随机猜测，因此需要专门针对该阶段的检测技术。 Method: 提出Self-Critique方法，利用强化学习后模型输出熵分布坍缩的现象，通过探测策略崩溃（即模型收敛到狭窄推理路径）来识别数据污染，并构建RL-MIA基准模拟污染场景。 Result: 实验表明，Self-Critique在多个模型和污染任务上显著优于基线方法，AUC提升高达30%，使RL阶段的污染检测成为可能。 Conclusion: Self-Critique为强化学习后训练阶段的数据污染检测提供了首个系统性解决方案，有效解决了该关键阶段的模型评估可靠性问题。 Abstract: Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.

[71] CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation

Kaiwen Wei,Xiao Liu,Jie Zhang,Zijian Wang,Ruida Liu,Yuming Yang,Xin Xiao,Xiao Sun,Haoyang Zeng,Changzai Pan,Yidan Zhang,Jiang Zhong,Peijin Wang,Yingchao Feng

Main category: cs.CL

TL;DR: 本文提出了CFVBench，一个大规模、手动验证的多模态检索增强生成（MRAG）基准，用于评估模型在视频等多模态证据下的问答能力。同时提出自适应视觉细化（AVR）框架以提升模型对细粒度多模态信息的理解。

Details

Motivation: 现有MRAG基准在模态覆盖和格式多样性上存在局限，多集中于单一或粗粒度场景理解，缺乏对复杂、高密度多模态内容的支持。 Method: 构建包含599个公开视频、5360个开放性问答对的CFVBench基准，并系统评估7种检索方法和14种主流MLLM；提出AVR框架，通过自适应增加帧采样密度和调用外部工具来增强细粒度理解。 Result: 实验表明当前模型（包括GPT-5、Gemini）难以捕捉关键的瞬时细粒度信息；AVR框架显著提升了所有被测MLLM的细粒度多模态理解性能。 Conclusion: CFVBench填补了多模态RAG在高密度、多样化格式上的评测空白，AVR为提升模型细粒度感知提供了有效解决方案。 Abstract: Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs

[72] Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation

Xiangxu Zhang,Lei Li,Yanyun Zhou,Xiao Zhou,Yingying Zhang,Xian Wu

Main category: cs.CL

TL;DR: 提出DyReMe，一种动态医学诊断评估基准，通过生成新颖、类似咨询的病例并评估真实性、帮助性和一致性，更真实地反映临床实践。

Details

Motivation: 现有大模型在医学诊断评估中依赖静态考试题，高估性能且忽视真实临床中的复杂性与模糊性。 Method: 构建DyReMe动态基准，生成含干扰因素（如鉴别诊断）和多样化表达风格的病例，从准确性、真实性、帮助性和一致性四个维度评估大模型。 Result: 实验表明DyReMe比传统静态评估更具挑战性和现实性，揭示了当前大模型与真实临床需求之间的显著错位。 Conclusion: 需要更贴近真实临床需求的评估框架，以确保医学诊断中大模型的可靠性。 Abstract: Medical diagnostics is a high-stakes and complex domain that is critical to patient care. However, current evaluations of large language models (LLMs) are fundamentally misaligned with real-world clinical practice. Most of them rely on static benchmarks derived from public medical exam items, which tend to overestimate model performance and ignore the difference between textbook cases and the ambiguous, varying conditions in the real world. Recent efforts toward dynamic evaluation offer a promising alternative, but their improvements are limited to superficial perturbations and a narrow focus on accuracy. To address these gaps, we propose DyReMe, a dynamic benchmark for medical diagnostics that better reflects real clinical practice. Unlike static exam-style questions, DyReMe generates fresh, consultation-like cases that introduce distractors such as differential diagnoses and common misdiagnosis factors. It also varies expression styles to mimic diverse real-world query habits. Beyond accuracy, DyReMe evaluates LLMs on three additional clinically relevant dimensions: veracity, helpfulness, and consistency. Our experiments demonstrate that this dynamic approach yields more challenging and realistic assessments, revealing significant misalignments between the performance of state-of-the-art LLMs and real clinical practice. These findings highlight the urgent need for evaluation frameworks that better reflect the demands of trustworthy medical diagnostics.

[73] CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

Jiuheng Lin,Cong Jiang,Zirui Wu,Jiarui Sun,Yansong Feng

Main category: cs.CL

TL;DR: CLARity是一种低成本强化学习框架，利用小型通用大模型提升专家模型的推理一致性，在数据稀缺领域显著提高响应一致性和准确性。

Details

Motivation: 在数据稀缺领域训练专家大模型通常依赖多项选择题，但标准基于结果的强化学习可能损害推理质量，现有监督推理方法成本过高。 Method: 提出CLARity框架，结合一致性感知奖励机制、两阶段‘先优化后监控’训练流程，以及动态数据重构策略，仅使用小型通用大模型来提升推理质量。 Result: 实验显示，相比基线方法，CLARity将响应一致性提升16.5%，准确率提升7.5%；人工评估也验证了其在连贯性和专业性上的整体改进。 Conclusion: CLARity提供了一种可泛化的解决方案，使小型模型能有效引导专家模型，通过关注推理一致性克服数据稀缺挑战。 Abstract: Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While it may improve accuracy, we observe it often degrades reasoning quality such as logical consistency. Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to to better exploit limited data. Experiments demonstrate that CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines. Human evaluations further confirm holistic improvements in coherence and professionalism. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert models by reasoning consistency.Our code is open sourced at: https://github.com/Infinite-set/CLARity

[74] One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations

Kohei Oda,Po-Min Chuang,Kiyoaki Shirai,Natthawut Kertkeidkachorn

Main category: cs.CL

TL;DR: 提出DualCSE，一种为句子分配显式和隐式语义双嵌入的句子嵌入方法，有效提升下游任务性能。

Details

Motivation: 传统句子嵌入方法因每句仅生成单一向量而难以捕捉句子中的隐式语义。 Method: DualCSE为每个句子生成两个嵌入向量，分别表示显式和隐式语义，并共享同一语义空间。 Result: 实验表明DualCSE能有效编码显式与隐式语义，并在信息检索和文本分类等下游任务中提升性能。 Conclusion: DualCSE通过双嵌入机制克服了传统方法的局限，增强了句子表示能力。 Abstract: Sentence embedding methods have made remarkable progress, yet they still struggle to capture the implicit semantics within sentences. This can be attributed to the inherent limitations of conventional sentence embedding methods that assign only a single vector per sentence. To overcome this limitation, we propose DualCSE, a sentence embedding method that assigns two embeddings to each sentence: one representing the explicit semantics and the other representing the implicit semantics. These embeddings coexist in the shared space, enabling the selection of the desired semantics for specific purposes such as information retrieval and text classification. Experimental results demonstrate that DualCSE can effectively encode both explicit and implicit meanings and improve the performance of the downstream task.

[75] MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

Jiapeng Wang,Changxin Tian,Kunlong Chen,Ziqi Liu,Jiaxin Mao,Wayne Xin Zhao,Zhiqiang Zhang,Jun Zhou

Main category: cs.CL

TL;DR: 本文提出了MaP框架，通过模型检查点合并和Pass@k指标来减少大语言模型预训练过程中的评估不稳定性，从而更可靠地观察模型训练动态。

Details

Motivation: 大语言模型的评估在预训练阶段存在显著的不稳定性，掩盖了真实的学习动态，影响了模型发展的可靠性。 Method: 提出MaP框架，结合检查点合并（Merging）以平滑参数空间，以及Pass@k指标以降低评估方差，从参数和评估两个层面抑制噪声。 Result: 实验表明，MaP显著平滑了性能曲线，降低了不同训练运行间的方差，并提高了模型排名的一致性。 Conclusion: MaP为观察大语言模型的训练动态提供了更可靠、更真实的视角，为LLM研究奠定了重要的实证基础。 Abstract: Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: \textit{Parameter Instability} from training stochasticity and \textit{Evaluation Instability} from noisy measurement protocols. To counteract both sources of noise, we introduce \textbf{MaP}, a dual-pronged framework that synergistically integrates checkpoint \underline{M}erging \underline{a}nd the \underline{P}ass@k metric. Checkpoint merging smooths the parameter space by averaging recent model weights, while Pass@k provides a robust, low-variance statistical estimate of model capability. Extensive experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings. Ultimately, MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research.

[76] ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation

Zhitian Hou,Kun Zeng

Main category: cs.CL

TL;DR: 本文提出了ShiZhi，首个专用于法庭观点生成的中文大语言模型，并构建了包含11万多个案例的CCVG数据集，实验表明基于高质量领域数据训练的小型大模型即可生成合理且法律上连贯的法庭观点。

Details

Motivation: 由于案件事实的多样性和复杂性，直接从原始事实生成法庭观点具有挑战性，现有方法性能受限，因此需要专门针对该任务设计高质量的领域模型和数据集。 Method: 构建了一个包含110K以上中文案例的法院观点生成数据集CCVG，并在此基础上训练了专用的大语言模型ShiZhi，用于生成‘法院观点’部分。 Result: ShiZhi在法庭观点生成任务上达到58.5的BLEU-1分数，在指控预测任务上达到86.1%的准确率和92.5%的宏F1分数，表现出优异性能。 Conclusion: 实验证明，基于高质量、领域特定的数据训练，即使是小型大语言模型也能生成合理且法律上一致的法院观点，ShiZhi及其数据集已公开发布。 Abstract: Criminal Court View Generation (CVG) is a fundamental task in legal artificial intelligence, aiming to automatically generate the "Court View" section of a legal case document. Generating court views is challenging due to the diversity and complexity of case facts, and directly generating from raw facts may limit performance. In this paper, we present ShiZhi, the first large language model (LLM) specifically designed for court view generation. We construct a Chinese Court View Generation dataset, CCVG, of more than 110K cases, each containing fact descriptions paired with corresponding court views. Based on this dataset, ShiZhi achieving 58.5 BLEU-1 on court view generation and 86.1\% accuracy with 92.5\% macro F1 on charge prediction. Experimental results demonstrate that even a small LLM can generate reasonable and legally coherent court views when trained on high-quality domain-specific data. Our model and dataset are available at \href{https://github.com/ZhitianHou/ShiZhi}{https://github.com/ZhitianHou/ShiZhi}.

[77] Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference

Jianuo Huang,Yaojie Zhang,Yicun Yang,Benhao Huang,Biqing Qi,Dongrui Liu,Linfeng Zhang

Main category: cs.CL

TL;DR: 提出了一种名为MaskKV的训练-free缓存淘汰框架，专为扩散式大语言模型（dLLMs）设计，通过掩码查询引导的评分机制和自适应缓存预算策略，在极小缓存下保持高性能并显著加速长上下文处理。

Details

Motivation: 现有的缓存淘汰策略针对自回归模型设计，忽视了dLLMs中掩码机制的独特特性，导致在资源受限下处理长上下文时性能不佳。因此需要专门面向dLLMs的高效缓存管理方案。 Method: MaskKV采用两个核心技术：一是基于注意力权重的掩码查询引导评分机制，按注意力头识别并淘汰不重要的提示词；二是自适应缓存分配策略，在中间层减少缓存分配，将资源集中在偏好提示词的注意力头上。 Result: 在LLaDA模型上，将KV缓存压缩至仅256对（不到5%的token），在LongBench上仍保持94%的全缓存性能，并在32k提示长度下实现最高31倍的加速。 Conclusion: MaskKV是一种高效、无需训练的缓存淘汰方法，显著提升了dLLMs在长上下文场景下的推理效率与可行性，为dLLMs的实际部署提供了重要支持。 Abstract: Diffusion large language models (dLLMs) present a promising alternative to dominant autoregressive models (ARMs) by the ability of parallel decoding at the expense of substantial computation and memory costs. Specifically, the cache mechanism for bidirectional attention in dLLMs demands large memory footprint, restricting their ability to handle long contexts under resource-limited settings. Existing cache eviction strategies are designed for ARMs and ignore the unique characteristics of dLLMs, thus leading to unsatisfactory performance. To address these challenges, we introduce MaskKV, a training-free cache eviction framework tailored to dLLMs, focusing on the effect of mask tokens in dLLMs. MaskKV is built on two key innovations: (1) a mask-query guided scoring mechanism that leverages attention weights to identify and evict less critical prompt tokens for each head; (2) an adaptive cache budgeting strategy that improves efficiency by reducing allocation in intermediate layers and concentrating resources on prompt-preferring heads. On LLaDA with MaskKV, compressing the KV cache to only 256 pairs (less than 5% of tokens) retains 94% of the full-cache performance on LongBench and achieves up to 31x acceleration at 32k prompt length. The code is publicly available at: https://github.com/jianuo-huang/MaskKV

[78] Verifying Chain-of-Thought Reasoning via Its Computational Graph

Zheng Zhao,Yeskendir Koishekenov,Xianjun Yang,Naila Murray,Nicola Cancedda

Main category: cs.CL

TL;DR: 提出了一种基于电路的推理验证方法（CRV），通过分析归因图的结构特征来识别和纠正大模型推理过程中的错误，实现了对LLM推理的白盒验证与因果理解。

Details

Motivation: 现有CoT验证方法多为黑盒或灰盒，难以揭示推理失败的根本原因，缺乏对错误机制的深入洞察。 Method: 将正确与错误的CoT步骤归因图视为模型潜在推理电路的执行轨迹，提取其结构特征并训练分类器识别错误模式，进一步通过干预特定解码器特征进行因果验证。 Result: 1) 结构特征能高度预测推理错误；2) 错误特征具有任务领域特异性；3) 通过定向干预可纠正模型推理错误。 Conclusion: CRV实现了对LLM推理过程的白盒验证，不仅能检测错误，还能提供因果层面的理解，推动从结果验证向过程理解的转变。 Abstract: Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model's faulty reasoning. Our work shows that, by scrutinizing a model's computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.

[79] FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

Yu-Chen Lu,Chong-Yan Chen,Chi-Chih Chang,Yu-Fang Hu,Kai-Chiang Wu

Main category: cs.CL

TL;DR: 提出了一种细粒度低秩压缩方法FLRC，通过为每层分配最优秩并结合渐进式低秩解码，显著提升大语言模型在资源受限设备上的推理效率与生成质量。

Details

Motivation: 大语言模型参数量巨大，难以部署在资源受限的硬件上；现有低秩压缩方法采用统一压缩比，易导致性能下降，且解码效果不佳。 Method: 提出Fine-grained Low-Rank Compressor (FLRC)，实现各层的最优秩分配，并引入渐进式低秩解码机制以保持生成质量。 Result: 在多个基准测试中验证了FLRC的有效性，在摘要任务中ROUGE-L指标最高提升17%，优于当前最先进的低秩压缩方法。 Conclusion: FLRC提供了一个更高效、更鲁棒的框架，显著改善了大语言模型的推理效率和生成性能。 Abstract: Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.

[80] LLP: LLM-based Product Pricing in E-commerce

Hairu Wang,Sheng You,Qiheng Zhang,Xike Xie,Shuguang Han,Yuchen Wu,Fei Huang,Jufeng Chen

Main category: cs.CL

TL;DR: 本文提出了LLP，首个基于大语言模型（LLM）的生成式二手商品定价框架，通过检索相似商品、利用LLM理解文本信息，并结合两阶段优化和置信度过滤机制，在动态市场中实现更准确的定价。

Details

Motivation: 个人卖家在C2C平台上难以高效定价二手商品，现有静态回归模型泛化能力差且无法捕捉市场动态变化。 Method: LLP框架首先检索相似产品以对齐市场变化，利用LLM理解自由文本中的关键定价信息，采用监督微调和组相对策略优化进行两阶段训练，并引入置信度过滤机制排除不可靠建议。 Result: 实验表明LLP显著优于现有方法，在未见类别上具有良好泛化性；在闲鱼平台部署后，相同30%覆盖率下静态采纳率从40%提升至72%，90%召回率时仍保持47%的采纳率。 Conclusion: LLP是首个基于LLM的二手商品定价框架，能有效应对市场动态变化，在真实场景中显著提升定价采纳率，具有良好的实际应用价值。 Abstract: Unlike Business-to-Consumer e-commerce platforms (e.g., Amazon), inexperienced individual sellers on Consumer-to-Consumer platforms (e.g., eBay) often face significant challenges in setting prices for their second-hand products efficiently. Therefore, numerous studies have been proposed for automating price prediction. However, most of them are based on static regression models, which suffer from poor generalization performance and fail to capture market dynamics (e.g., the price of a used iPhone decreases over time). Inspired by recent breakthroughs in Large Language Models (LLMs), we introduce LLP, the first LLM-based generative framework for second-hand product pricing. LLP first retrieves similar products to better align with the dynamic market change. Afterwards, it leverages the LLMs' nuanced understanding of key pricing information in free-form text to generate accurate price suggestions. To strengthen the LLMs' domain reasoning over retrieved products, we apply a two-stage optimization, supervised fine-tuning (SFT) followed by group relative policy optimization (GRPO), on a dataset built via bidirectional reasoning. Moreover, LLP employs a confidence-based filtering mechanism to reject unreliable price suggestions. Extensive experiments demonstrate that LLP substantially surpasses existing methods while generalizing well to unseen categories. We have successfully deployed LLP on Xianyu\footnote\{Xianyu is China's largest second-hand e-commerce platform.\}, significantly outperforming the previous pricing method. Under the same 30\% product coverage, it raises the static adoption rate (SAR) from 40\% to 72\%, and maintains a strong SAR of 47\% even at 90\% recall.

[81] ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering

Francesco Maria Molfese,Luca Moroni,Ciro Porcaro,Simone Conia,Roberto Navigli

Main category: cs.CL

TL;DR: 本文提出了ReTraceQA基准，用于在常识推理任务中评估小型语言模型（SLM）的推理过程，而不仅仅是最终答案的准确性。研究发现，14-24%的情况下，SLM虽然答案正确，但推理过程存在缺陷，表明仅依赖答案准确性的评估方法可能高估了模型能力。当使用强大的大语言模型作为自动评判器进行推理感知评估时，SLM的性能显著下降，降幅高达25%。

Details

Motivation: 当前对小型语言模型的评估主要依赖最终答案的准确性，忽略了推理过程的有效性，可能导致模型能力被高估。因此需要一种能够评估推理过程的新方法。 Method: 构建了一个由专家标注的ReTraceQA基准数据集，并利用强大的大语言模型作为自动化裁判，对SLM的推理过程进行评估，对比传统仅看答案的评估方式。 Result: 发现在14-24%的情况下，SLM给出正确答案但推理过程错误；使用推理感知评估后，所有SLM在各数据集上的表现均显著下降，最高降幅达25%。 Conclusion: 仅依赖最终答案准确性的评估方式会高估小型语言模型的真实推理能力，引入过程级评估（如ReTraceQA）能更真实地反映模型的推理质量。 Abstract: While Small Language Models (SLMs) have demonstrated promising performance on an increasingly wide array of commonsense reasoning benchmarks, current evaluation practices rely almost exclusively on the accuracy of their final answers, neglecting the validity of the reasoning processes that lead to those answers. To address this issue, we introduce ReTraceQA, a novel benchmark that introduces process-level evaluation for commonsense reasoning tasks. Our expert-annotated dataset reveals that in a substantial portion of instances (14-24%), SLMs provide correct final answers despite flawed reasoning processes, suggesting that the capabilities of SLMs are often overestimated by evaluation metrics that focus only on comparing the final answer with the ground truth. Indeed, we show that when employing strong Large Language Models (LLMs) as automated judges for reasoning-aware evaluation rather than answer-only metrics, SLM performance drops significantly across all models and datasets, with scores decreasing by up to 25%.

[82] Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Yunxiang Zhang,Muhammad Khalifa,Lechen Zhang,Xin Liu,Ayoung Lee,Xinliang Frederick Zhang,Farima Fatahi Bayat,Lu Wang

Main category: cs.CL

TL;DR: 提出一种解码时方法ThinkLogit，利用小推理模型作为引导，通过logit算术提升大模型的长链推理能力，无需额外训练；进一步结合偏好优化（ThinkLogit-DPO）可显著提高性能，在五个推理基准上分别实现24.5%和29.1%的准确率提升。

Details

Motivation: 探索在不进行额外训练的情况下，能否激发大模型的长链推理能力，如回溯和自我纠正，降低对昂贵后训练的依赖。 Method: 提出ThinkLogit方法，使用小型推理模型作为引导，在解码时通过logit算术调整目标大模型的输出；进一步采用偏好优化（DPO）训练引导模型，构建ThinkLogit-DPO框架。 Result: 在Qwen2.5-32B上以R1-Distill-Qwen-1.5B为引导模型（小21倍），ThinkLogit和ThinkLogit-DPO在五个推理基准上的平均准确率分别相对提升24.5%和29.1%；方法跨模型家族有效，并可与小模型的后训练技术正交结合。 Conclusion: ThinkLogit提供了一条无需 costly 后训练即可解锁大规模模型长链推理能力的实用路径，具有良好的扩展性和兼容性。 Abstract: Large reasoning models exhibit long chain-of-thought reasoning with strategies such as backtracking and self-correction, though recent studies suggest that these abilities typically require additional training. We first investigate whether such behaviors can be elicited without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to tune a target large non-reasoning model for long reasoning using a substantially smaller reasoning model as the guider. We then show that we can further boost its performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model, a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in average accuracy by 24.5% and 29.1%, respectively, over five reasoning benchmarks using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Moreover, we find that ThinkLogit remains effective when the guider and target come from different model families. It is also orthogonal to post-training methods for small models, as guiders improved through supervised distillation or reinforcement learning can be directly plugged in to yield stronger large models, offering a practical path to unlock long reasoning in large-scale models without costly post-training.

[83] NL2GenSym: Natural Language to Generative Symbolic Rules for SOAR Cognitive Architecture via Large Language Models

Fang Yuan,Junjie Zeng,Yue Hu,Zhengqiu Zhu,Quanjun Yin,Yuxiang Xie

Main category: cs.CL

TL;DR: 提出NL2GenSym框架，利用大语言模型从自然语言自动生成符号规则，并通过执行反馈机制优化规则，显著提升SOAR架构的自动化与性能。

Details

Motivation: 现有研究多集中于概念框架，缺乏对大语言模型生成SOAR规则的有效性进行充分实验验证，限制了实际应用。 Method: 提出NL2GenSym框架，结合检索增强生成的知识库，通过LLM生成规则，在SOAR中执行验证，并由基于LLM的批评模块进行迭代优化。 Result: 在Water Jug Problem数据集上，规则生成成功率超过86%；生成的新启发式规则使决策周期降至最优解的1.98倍，仅为基线方法的1/1000；小参数模型表现优于大模型。 Conclusion: NL2GenSym有效实现了自然语言到符号规则的自动转换，提升了SOAR系统的自动化水平和求解效率，为符号AI与大模型融合提供了可验证的实践路径。 Abstract: SOAR, a classic symbol-based cognitive architecture, has been fostering the development of general, human-like intelligent agents. Nevertheless, its practical adoption is hindered by the laborious manual rule coding. Emerging Large Language Models (LLMs) present the immense potential for efficient rules generation. However, there is a critical gap that current research predominantly focuses on conceptual frameworks and lacks robust experimental validation. To bridge this gap, we propose \textit{N}atural \textit{L}anguage to \textit{Gen}erative \textit{Sym}bolic Rules (NL2GenSym), a novel framework that integrates LLMs with SOAR to autonomously produce generative symbolic rules from natural language. Specifically, our framework introduces a novel Execution-Grounded Generator-Critic mechanism. The LLM-based Generator, guided by a Retrieval-Augmented Generation-accessed self-evolving domain knowledge base, proposes rules from natural language. Subsequently, these rules are immediately executed within the SOAR environment to rigorously validate their correctness. Based on this execution-grounded feedback, a reflective LLM-based Critic drives the iterative refinement of these rules. Experiments on our specialized Water Jug Problem (WJP) dataset, utilizing both Gemini and Qwen series models, validate the efficacy of our framework. It achieves a success rate over 86\% in generating rules from natural language. Crucially, the framework also generates novel heuristic rules, reducing average decision cycles for solving the WJP to 1.98 times the optimal solution and 1/1000 of baseline methods. Additionally, our initial experiments show that NL2GenSym enables smaller-parameter models to achieve better performance than larger counterparts.

[84] Understanding the Effects of Domain Finetuning on LLMs

Eshaan Tanwar,Deepak Nathani,William Yang Wang,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文首次系统研究了大语言模型在医学领域的领域特定微调机制，提出了一种名为“调优向量”的新框架，用于解释微调过程中参数空间的变化，发现微调仅改变表示子空间的一小部分，并揭示了MLP层和注意力头在方向对齐中的不同作用。

Details

Motivation: 尽管针对特定领域微调的大语言模型表现出色，但其如何重塑参数空间的机制尚不清楚，现有研究主要集中在自回归或通用指令模型上，缺乏对领域专用模型的深入探索。 Method: 通过分析领域特定微调下的大医学语言模型，引入“调优向量”框架，该框架受任务向量启发，显式捕捉微调引起的参数方向变化，并结合方向对齐分析探究其在MLP层和注意力头中的作用。 Result: 发现微调仅修改了少量表示子空间，基本保留了预训练模型的表示；调优向量能显著提升指令遵循和生成质量；跨领域组合调优向量可改善泛化能力；且这些向量主要在MLP层写入新方向信息，在注意力头中增强已有方向。 Conclusion: 本研究为理解大语言模型的适应机制提供了新视角，提出的调优向量框架具有通用性和可解释性，有助于分析大型语言模型的专门化过程。 Abstract: Large Language Models (LLMs) fine-tuned for specific domains exhibit strong performance; however, the underlying mechanisms by which this fine-tuning reshapes their parametric space are not well understood. Prior works primarily focus on auto-regressive or general-purpose instruct models, leaving domain-specialised LLMs under-explored. We present the first systematic study of domain-specific fine-tuning in large medical language models. Our analysis reveals that fine-tuning modifies only a small subset of the representational subspace, essentially preserving the pre-trained model's representation. To interpret these changes in subspaces, we propose tuning vectors, a novel framework inspired by task vectors, which explicitly capture the directional parameter shifts induced by fine-tuning. We demonstrate that these vectors are critical for enhancing both instruction-following and generation quality. Furthermore, combining tuning vectors across different domains yields improved generalisation. Upon closer inspection of directional alignment, we find these vectors primarily write new directional information into the MLP layers of the model, while amplifying existing directions in attention heads. Our findings offer new insights into LLM adaptation and provide a general, interpretable framework for analysing specialisation in large language models.

[85] Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

Xingyu Lin,Yilin Wen,En Wang,Du Su,Wenbin Liu,Chenfu Bao,Zhonghou Lv

Main category: cs.CL

TL;DR: 本文提出了一种新的token级别框架TEPO，通过马尔可夫似然将组级奖励与token关联，解决了现有方法在链式思维推理中因稀疏token奖励导致的熵崩溃问题，在数学推理任务上实现了更优的性能和训练稳定性。

Details

Motivation: 现有的GRPO等熵正则化方法在链式思维推理中面临稀疏token奖励的问题，常采用无差异的token级熵调整，易导致熵崩溃或模型崩溃。 Method: 提出TEPO框架，利用马尔可夫似然（序列似然）在token级别聚合，将组级奖励与各个token关联，实现更精细的策略优化。 Result: 实验表明，TEPO在@k和准确率等关键指标上持续优于现有基线方法，提升了训练稳定性，并在数学推理任务上达到新的SOTA水平。 Conclusion: TEPO有效缓解了由稀疏奖励引发的熵崩溃问题，为大语言模型的推理能力训练提供了一种更稳定、高效的token级优化方案。 Abstract: Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.

[86] Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation

Mert İnan,Anthony Sicilia,Alex Xie,Saujas Vaduguru,Daniel Fried,Malihe Alikhani

Main category: cs.CL

TL;DR: 本文研究了在数据可视化领域中自然语言生成代码时的歧义问题，提出了歧义类型的分类法和量化指标，并展示了多轮对话如何通过语用模型减少歧义、提高代码准确性。

Details

Motivation: 由于自然语言中的歧义可能导致生成看似正确但不符合用户意图的代码，因此需要系统性地识别和解决人机沟通中的目标不一致问题。 Method: 提出歧义类型的分类法和量化指标，利用DS-1000数据集中的Matplotlib问题进行实验，并结合Gricean合作原则、话语表征理论和讨论中的问题三种语用模型设计多轮对话策略。 Result: 所提出的歧义度量指标比不确定性基线更符合人工标注结果；模拟用户研究表明，基于语用学的多轮对话能有效降低歧义并提升代码生成的准确性。 Conclusion: 多轮对话结合语用模型有助于澄清用户意图，显著减少自然语言到代码生成过程中的歧义，从而更好地实现人与AI之间的目标对齐。 Abstract: Establishing shared goals is a fundamental step in human-AI communication. However, ambiguities can lead to outputs that seem correct but fail to reflect the speaker's intent. In this paper, we explore this issue with a focus on the data visualization domain, where ambiguities in natural language impact the generation of code that visualizes data. The availability of multiple views on the contextual (e.g., the intended plot and the code rendering the plot) allows for a unique and comprehensive analysis of diverse ambiguity types. We develop a taxonomy of types of ambiguity that arise in this task and propose metrics to quantify them. Using Matplotlib problems from the DS-1000 dataset, we demonstrate that our ambiguity metrics better correlate with human annotations than uncertainty baselines. Our work also explores how multi-turn dialogue can reduce ambiguity, therefore, improve code accuracy by better matching user goals. We evaluate three pragmatic models to inform our dialogue strategies: Gricean Cooperativity, Discourse Representation Theory, and Questions under Discussion. A simulated user study reveals how pragmatic dialogues reduce ambiguity and enhance code accuracy, highlighting the value of multi-turn exchanges in code generation.

[87] Beyond Single-Granularity Prompts: A Multi-Scale Chain-of-Thought Prompt Learning for Graph

Ziyu Zheng,Yaming Yang,Ziyu Guan,Wei Zhao,Xinyan Huang,Weigang Lu

Main category: cs.CL

TL;DR: 提出了一种多尺度图链式思维（MSGCOT）提示框架，通过整合多尺度信息提升图提示学习的性能，尤其在少样本场景下表现优异。

Details

Motivation: 现有图提示方法局限于单一粒度，忽略了图数据中固有的多尺度结构信息，限制了提示语义的多样性。 Method: 设计了一个轻量级低秩粗化网络来捕捉多尺度结构特征，并模仿人类从粗到细的认知过程，动态整合多尺度信息生成逐步细化的提示链。 Result: 在八个基准数据集上的实验表明，MSGCOT优于最先进的单粒度图提示方法，尤其在少样本场景下性能更优。 Conclusion: MSGCOT有效利用多尺度结构信息，提升了图提示学习的表达能力和泛化性能。 Abstract: The "pre-train, prompt'' paradigm, designed to bridge the gap between pre-training tasks and downstream objectives, has been extended from the NLP domain to the graph domain and has achieved remarkable progress. Current mainstream graph prompt-tuning methods modify input or output features using learnable prompt vectors. However, existing approaches are confined to single-granularity (e.g., node-level or subgraph-level) during prompt generation, overlooking the inherently multi-scale structural information in graph data, which limits the diversity of prompt semantics. To address this issue, we pioneer the integration of multi-scale information into graph prompt and propose a Multi-Scale Graph Chain-of-Thought (MSGCOT) prompting framework. Specifically, we design a lightweight, low-rank coarsening network to efficiently capture multi-scale structural features as hierarchical basis vectors for prompt generation. Subsequently, mimicking human cognition from coarse-to-fine granularity, we dynamically integrate multi-scale information at each reasoning step, forming a progressive coarse-to-fine prompt chain. Extensive experiments on eight benchmark datasets demonstrate that MSGCOT outperforms the state-of-the-art single-granularity graph prompt-tuning method, particularly in few-shot scenarios, showcasing superior performance.

[88] Active Model Selection for Large Language Models

Yavuz Durmazkeser,Patrik Okanovic,Andreas Kirsch,Torsten Hoefler,Nezihe Merve Gürel

Main category: cs.CL

TL;DR: LLM SELECTOR是首个用于大语言模型主动选择的框架，通过自适应选择最具信息量的小规模查询集并利用基于裁判的标注模型，显著降低标注成本。

Details

Motivation: 现有大语言模型评估方法依赖完全标注数据集，标注成本高且效率低，难以在有限标注下高效选择最优模型。 Method: 提出LLM SELECTOR框架，采用主动学习策略自适应选择对模型选择最有价值的查询进行标注，并引入基于裁判的oracle标注机制以进一步减少人工标注开销。 Result: 在6个基准数据集和151个大语言模型上的实验表明，LLM SELECTOR在选择最佳或接近最佳模型时，最多可将标注成本降低59.62%。 Conclusion: LLM SELECTOR能有效减少大语言模型选择过程中的标注成本，为实际应用中高效模型选型提供了可行方案。 Abstract: We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.

[89] On the Representations of Entities in Auto-regressive Large Language Models

Victor Morand,Josiane Mothe,Benjamin Piwowarski

Main category: cs.CL

TL;DR: 提出了一种名为“实体提及重建”的新框架，用于研究大语言模型如何编码和操作实体，并引入了“实体透镜”来预测多词实体提及。

Details

Motivation: 目前尚不清楚大语言模型（LLM）如何在内部表示实体，尤其是多词实体及其关系知识。 Method: 利用任务向量从LLM的隐藏状态中提取的各种实体表示来生成多词提及，并扩展logit-lens为Entity Lens以预测实体提及。 Result: 能够一致地从内部表示生成多词实体提及，发现LLM发展出特定于实体的机制来表示和操作任何多词实体，包括训练期间未见过的实体。 Conclusion: 大语言模型具备专门的机制来编码和操作多词实体，且这些机制可被任务向量和Entity Lens有效揭示。 Abstract: Named entities are fundamental building blocks of knowledge in text, grounding factual information and structuring relationships within language. Despite their importance, it remains unclear how Large Language Models (LLMs) internally represent entities. Prior research has primarily examined explicit relationships, but little is known about entity representations themselves. We introduce entity mention reconstruction as a novel framework for studying how LLMs encode and manipulate entities. We investigate whether entity mentions can be generated from internal representations, how multi-token entities are encoded beyond last-token embeddings, and whether these representations capture relational knowledge. Our proposed method, leveraging _task vectors_, allows to consistently generate multi-token mentions from various entity representations derived from the LLMs hidden states. We thus introduce the _Entity Lens_, extending the _logit-lens_ to predict multi-token mentions. Our results bring new evidence that LLMs develop entity-specific mechanisms to represent and manipulate any multi-token entities, including those unseen during training. Our code is avalable at https://github.com/VictorMorand/EntityRepresentations .

[90] Dyna-Mind: Learning to Simulate from Experience for Better AI Agents

Xiao Yu,Baolin Peng,Michel Galley,Hao Cheng,Qianhui Wu,Janardhan Kulkarni,Suman Nath,Zhou Yu,Jianfeng Gao

Main category: cs.CL

TL;DR: 本文提出了Dyna-Mind框架，通过两阶段训练方法提升AI代理在复杂交互环境中的推理与规划能力，结合模拟推理（ReSim）和强化学习（Dyna-GRPO），在多个基准任务上验证了其有效性。

Details

Motivation: 当前AI代理在长周期、交互式任务中表现不佳，受人类认知研究启发，作者认为需要引入‘替代性试错’机制，即在行动前模拟多种未来状态，以提升智能体的环境理解与决策能力。 Method: 提出Dyna-Mind框架：第一阶段使用ReSim方法，基于真实交互数据构建搜索树并生成结构化推理轨迹；第二阶段采用Dyna-GRPO在线强化学习方法，利用结果奖励和中间状态反馈优化策略。 Result: 在Sokoban、ALFWorld和AndroidWorld三个基准测试中，ReSim有效赋予代理模拟能力，Dyna-GRPO显著提升了长周期、规划密集型任务的性能。 Conclusion: 模拟在AI代理的推理、规划与行动中起核心作用，Dyna-Mind为构建能在复杂环境中高效决策的智能体提供了有效路径。 Abstract: Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ''vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.

[91] The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Nizar El Ghazal,Antoine Caubrière,Valentin Vielzeuf

Main category: cs.CL

TL;DR: 本文研究了基于语音大语言模型的端到端口语对话状态跟踪中的上下文管理策略，比较了传统多模态上下文、完整语音历史和压缩语音历史方法。实验表明，使用完整语音对话输入性能最佳，而基于注意力池化的压缩方法在减小上下文规模的同时保持了较高准确率。

Details

Motivation: 为了提升端到端口语对话状态跟踪的性能，探索更有效的上下文管理策略，特别是避免文本转录过程中的信息损失。 Method: 在SpokenWOZ语料库上系统评估三种上下文策略：传统多模态上下文（文本历史+当前语音轮次）、完整语音历史、以及基于注意力池化的压缩语音历史。 Result: 完整语音历史输入取得了最高性能，显著优于先前方法；压缩语音历史在减少上下文长度的同时保持了竞争力的准确性；分析表明性能提升源于更有效的上下文利用。 Conclusion: 直接使用完整语音历史作为输入是提升语音LLM在对话状态跟踪任务中表现的有效策略，而注意力池化提供了良好的效率与性能平衡。 Abstract: This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.

[92] KORMo: Korean Open Reasoning Model for Everyone

Minjun Kim,Hyeonseok Lim,Hangyeol Yoo,Inho Won,Seungwoo Song,Minkyung Cho,Junhun Yuk,Changsu Choi,Dongjae Shin,Huige Lee,Hoyun Song,Alice Oh,Kyungtae Lim

Main category: cs.CL

TL;DR: 本文介绍了KORMo-10B，一个主要基于合成数据从零训练的108亿参数韩英双语大语言模型，证明了在精心策划下合成数据可有效支持大规模预训练且不导致性能退化，并通过完全开源推动低资源语言下全开放模型的发展。

Details

Motivation: 构建针对非英语语言（特别是韩语）的全开放双语大语言模型，探索在低资源条件下利用合成数据进行大规模训练的可行性与有效性。 Method: 从头训练一个10.8B参数的韩英双语模型，其中68.74%的韩语数据为合成数据，采用平衡语言覆盖和多样化指令风格的合成数据构建策略，并进行双语指令微调。 Result: 模型在推理、知识和指令遵循等基准上表现媲美现有的开源多语言基线模型；实验证明合成数据不会导致训练不稳定或模型崩溃，且双语指令微调可实现接近母语水平的韩语推理与话语连贯性。 Conclusion: 合成数据能可靠地支撑大规模长周期预训练，双语指令微调有助于提升低资源语言的表现，本工作为低资源语言下的全开放模型发展提供了可复现的开源范例。 Abstract: This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.

[93] Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives

Xixi Wang,Jordanka Kovaceva,Miguel Costa,Shuai Wang,Francisco Camara Pereira,Robert Thomson

Main category: cs.CL

TL;DR: 本研究探讨了如何利用紧凑型开源预训练语言模型（如BERT和LLM）从真实世界的碰撞叙述中提取推理密集型信息，通过LoRA微调技术提升模型在碰撞类型和碰撞方式识别上的性能，结果表明其优于GPT-4o等闭源大模型。

Details

Motivation: 现有工具难以批量处理非结构化、非标准化的碰撞叙述文本，且闭源大模型存在隐私问题和领域知识不足的问题，导致在复杂分类任务上表现不佳。 Method: 采用低秩适配（LoRA）对紧凑型开源预训练语言模型（如BERT）进行微调，以注入特定任务知识，从而提升其在碰撞方式和碰撞类型识别中的推理能力。 Result: 在CISS真实数据集上的实验表明，微调后的紧凑模型在两项任务上均优于GPT-4o等强闭源模型，并能捕捉更丰富的叙述细节，甚至纠正部分数据标注错误。 Conclusion: 紧凑型开源预训练语言模型经适当微调后，可在保护隐私和节省资源的前提下，有效支持交通碰撞叙述中的复杂信息提取任务。 Abstract: Free-text crash narratives recorded in real-world crash databases have been shown to play a significant role in improving traffic safety. However, large-scale analyses remain difficult to implement as there are no documented tools that can batch process the unstructured, non standardized text content written by various authors with diverse experience and attention to detail. In recent years, Transformer-based pre-trained language models (PLMs), such as Bidirectional Encoder Representations from Transformers (BERT) and large language models (LLMs), have demonstrated strong capabilities across various natural language processing tasks. These models can extract explicit facts from crash narratives, but their performance declines on inference-heavy tasks in, for example, Crash Type identification, which can involve nearly 100 categories. Moreover, relying on closed LLMs through external APIs raises privacy concerns for sensitive crash data. Additionally, these black-box tools often underperform due to limited domain knowledge. Motivated by these challenges, we study whether compact open-source PLMs can support reasoning-intensive extraction from crash narratives. We target two challenging objectives: 1) identifying the Manner of Collision for a crash, and 2) Crash Type for each vehicle involved in the crash event from real-world crash narratives. To bridge domain gaps, we apply fine-tuning techniques to inject task-specific knowledge to LLMs with Low-Rank Adaption (LoRA) and BERT. Experiments on the authoritative real-world dataset Crash Investigation Sampling System (CISS) demonstrate that our fine-tuned compact models outperform strong closed LLMs, such as GPT-4o, while requiring only minimal training resources. Further analysis reveals that the fine-tuned PLMs can capture richer narrative details and even correct some mislabeled annotations in the dataset.

[94] Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Ines Altemir Marinas,Anastasiia Kucherenko,Alexander Sternfeld,Andrei Kucharavy

Main category: cs.CL

TL;DR: 本文介绍了Apertus大语言模型训练数据的全文索引管道，利用Elasticsearch和Alps arm64超算集群，成功索引了8.6万亿token的数据，为LLM安全性和开放网络搜索提供了可行方案。

Details

Motivation: 尽管开源大模型日益增多，但其训练数据仍难以获取且规模庞大，导致科学界难以研究和确保模型安全性，因此需要一种可扩展、高效的索引方法来提升透明度和安全性。 Method: 采用Elasticsearch的并行索引技术，并部署在基于arm64架构的高效能Alps超算基础设施上，对Apertus LLM训练数据进行全文索引。 Result: 成功索引了15.2万亿token中的8.6万亿token，验证了在新型arm64架构上运行Elasticsearch的可行性，证明了大规模LLM训练数据全文索引的可实现性，并展示了该索引在无依赖越狱检测的LLM安全中的应用价值。 Conclusion: 该工作不仅为LLM训练数据提供了可访问的索引工具，推动了绿色计算的发展，还为未来大规模数据索引和模型安全性研究提供了重要基础设施和参考范例。 Abstract: The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.

[95] Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic

Manuel Vargas Guzmán,Jakub Szymanik,Maciej Malicki

Main category: cs.CL

TL;DR: 本文研究了大语言模型在自然语言推理中的逻辑泛化能力，区分了组合性与递归性两个关键方面，发现当前模型在组合性上表现较差，为此提出一种结合符号推理与神经计算的混合架构以提升推理的鲁棒性与效率。

Details

Motivation: 尽管神经模型取得了显著进展，其在逻辑推理等任务中的泛化能力仍受限，尤其是对组合性和递归性的区分不清，亟需明确评估并改进模型的逻辑抽象与推理能力。 Method: 使用三段论片段作为基准任务，评估预训练大语言模型在组合性与递归性两方面的表现，并提出一种神经-符号混合架构，通过符号系统保证推理完备性，神经组件加速计算。 Result: 实验发现大语言模型在递归性方面表现尚可，但在组合性方面存在明显不足；所提出的混合架构在保持高效率的同时显著提升了逻辑推理的可靠性，且小规模神经组件即可实现良好性能。 Conclusion: 组合性是当前神经模型逻辑泛化的主要瓶颈，融合符号推理的混合架构是突破该瓶颈的有效路径，为构建可靠的神经推理系统提供了可行方案。 Abstract: Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications like logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: compositionality, the capacity to abstract atomic logical rules underlying complex inferences, and recursiveness, the aptitude to build intricate representations through iterative application of inference rules. In the literature, these two aspects are often confounded together under the umbrella term of generalization. To sharpen this distinction, we investigated the logical generalization capabilities of pre-trained large language models (LLMs) using the syllogistic fragment as a benchmark for natural language reasoning. Though simple, this fragment provides a foundational yet expressive subset of formal logic that supports controlled evaluation of essential reasoning abilities. Our findings reveal a significant disparity: while LLMs demonstrate reasonable proficiency in recursiveness, they struggle with compositionality. To overcome these limitations and establish a reliable logical prover, we propose a hybrid architecture integrating symbolic reasoning with neural computation. This synergistic interaction enables robust and efficient inference, neural components accelerate processing, while symbolic reasoning ensures completeness. Our experiments show that high efficiency is preserved even with relatively small neural components. As part of our proposed methodology, this analysis gives a rationale and highlights the potential of hybrid models to effectively address key generalization barriers in neural reasoning systems.

[96] Multimodal Policy Internalization for Conversational Agents

Zhenhailong Wang,Jiateng Liu,Amin Fazel,Ritesh Sarkhel,Xing Fan,Xiang Li,Chenlei Guo,Heng Ji,Ruhi Sarikaya

Main category: cs.CL

TL;DR: 本文提出了多模态策略内化（MPI）任务，旨在将复杂的多模态策略内化到模型参数中，以实现无需在推理时包含策略即可更好遵循策略的目标。为此，作者构建了两个数据集并提出三阶段训练框架TriMPI，结合持续预训练、监督微调和基于策略感知响应的强化学习方法PolicyRollout，在准确性、泛化性和抗遗忘性方面均取得显著提升。

Details

Motivation: 随着LLM-based对话系统支持更多样化的业务和用户查询，传统的基于上下文提示的策略实现方式变得复杂且计算成本高，尤其在多模态场景下缺乏有效策略建模方法。因此需要一种能将复杂多模态策略内化到模型中的新方法。 Method: 提出TriMPI三阶段训练框架：第一阶段通过持续预训练注入策略知识；第二阶段进行监督微调；第三阶段采用PolicyRollout，一种扩展的GRPO风格强化学习方法，通过在rollout中引入策略感知响应实现更 grounded 的探索。同时构建了涵盖合成与真实世界任务的两个MPI数据集。 Result: TriMPI在端到端准确率、泛化能力和抗遗忘鲁棒性方面均表现出显著性能提升。实验验证了其在多模态决策和工具使用任务中的有效性，成为首个针对多模态策略内化的研究工作。 Conclusion: TriMPI成功实现了多模态策略的参数内化，消除了推理时对长提示的依赖，降低了计算开销，并为未来多模态智能体的策略学习提供了数据、训练方法和评估基准。 Abstract: Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.

[97] StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

Yuchen Lu,Run Yang,Yichen Zhang,Shuguang Yu,Runpeng Dai,Ziwei Wang,Jiayi Xiang,Wenxin E,Siran Gao,Xinyao Ruan,Yirui Huang,Chenjing Xi,Haibo Hu,Yueming Fu,Qinglan Yu,Xiaobing Wei,Jiani Gu,Rui Sun,Jiaxuan Jia,Fan Zhou

Main category: cs.CL

TL;DR: 本文提出了首个专注于统计学的综合基准StatEval，包含13817个基础问题和2374个研究级证明任务，并设计了多智能体流水线进行自动化构建与质量控制，实验表明现有大模型在统计推理上仍有显著局限性。

Details

Motivation: 统计学作为一门独立且综合的学科，在当前的大模型评测中尚未得到充分探索，缺乏系统性、多层次的评估基准。 Method: 提出StatEval基准，涵盖本科到研究生课程的基础题目及顶级期刊中的研究级证明任务；设计基于多智能体、人工参与验证的可扩展流水线用于问题提取与质量控制；构建针对计算型与证明型任务的鲁棒评估框架。 Result: 实验显示，闭源模型如GPT5-mini在研究级问题上准确率低于57%，开源模型表现更差，反映出当前大语言模型在统计推理方面存在明显不足。 Conclusion: StatEval为评估大模型的统计智能提供了严格、全面的基准，揭示了现有模型在统计推理能力上的局限，有望推动该领域的进一步发展。 Abstract: Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \textbf{StatEval}, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57\% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.

[98] Can We Reliably Rank Model Performance across Domains without Labeled Data?

Veronica Rammouz,Aaron Gonzalez,Carlos Cruzportillo,Adrian Tan,Nicole Beebe,Anthony Rios

Main category: cs.CL

TL;DR: 本文研究了在无标签情况下估计NLP模型性能的方法，发现基于大语言模型的错误预测器在跨域性能排序中表现更优。

Details

Motivation: 理解NLP模型在无标签情况下的泛化能力，并评估现有性能估计方法在跨域场景中的可靠性。 Method: 采用两步评估框架，结合四种基础分类器和多个大语言模型作为错误预测器，在GeoOLID和Amazon Reviews数据集上进行实验。 Result: 基于大语言模型的错误预测器相较于基于漂移或零样本的方法，与真实准确率具有更强且更一致的排序相关性；性能差异越大、错误模型预测与基础模型实际失败模式越一致，排序越可靠。 Conclusion: 明确了性能估计方法在何种条件下可被信任，并为跨域模型评估提供了实践指导。 Abstract: Estimating model performance without labels is an important goal for understanding how NLP models generalize. While prior work has proposed measures based on dataset similarity or predicted correctness, it remains unclear when these estimates produce reliable performance rankings across domains. In this paper, we analyze the factors that affect ranking reliability using a two-step evaluation setup with four base classifiers and several large language models as error predictors. Experiments on the GeoOLID and Amazon Reviews datasets, spanning 15 domains, show that large language model-based error predictors produce stronger and more consistent rank correlations with true accuracy than drift-based or zero-shot baselines. Our analysis reveals two key findings: ranking is more reliable when performance differences across domains are larger, and when the error model's predictions align with the base model's true failure patterns. These results clarify when performance estimation methods can be trusted and provide guidance for their use in cross-domain model evaluation.

[99] Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking

Mohammad Hossein Sameti,Sepehr Harfi Moridani,Ali Zarean,Hossein Sameti

Main category: cs.CL

TL;DR: 提出了一种口音不变的ASR框架，通过口音分类和频谱掩码增强数据，提升多语言ASR系统对口音和方言变化的鲁棒性。

Details

Motivation: 现有基于Transformer的预训练模型在处理口音和方言变异时表现不佳，导致英语和波斯语等语言的词错误率较高，亟需提高ASR系统在语言多样性下的鲁棒性。 Method: 训练基于频谱图的口音分类器，识别并对影响分类的关键区域进行掩码，利用掩码后的频谱图进行数据增强，以提升ASR模型的抗口音干扰能力。 Result: 在英语和波斯语语音上验证了方法的有效性，显著降低了Whisper模型的词错误率；并为波斯语建立了首个涵盖多种地域口音的公开数据集。 Conclusion: 所提出的口音不变ASR框架有效提升了多语言ASR系统对口音和方言差异的适应能力，推动了低资源、语言多样化场景下的语音识别研究。 Abstract: Pre-trained transformer-based models have significantly advanced automatic speech recognition (ASR), yet they remain sensitive to accent and dialectal variations, resulting in elevated word error rates (WER) in linguistically diverse languages such as English and Persian. To address this challenge, we propose an accent-invariant ASR framework that integrates accent and dialect classification into the recognition pipeline. Our approach involves training a spectrogram-based classifier to capture accent-specific cues, masking the regions most influential to its predictions, and using the masked spectrograms for data augmentation. This enhances the robustness of ASR models against accent variability. We evaluate the method using both English and Persian speech. For Persian, we introduce a newly collected dataset spanning multiple regional accents, establishing the first systematic benchmark for accent variation in Persian ASR that fills a critical gap in multilingual speech research and provides a foundation for future studies on low-resource, linguistically diverse languages. Experimental results with the Whisper model demonstrate that our masking and augmentation strategy yields substantial WER reductions in both English and Persian settings, confirming the effectiveness of the approach. This research advances the development of multilingual ASR systems that are resilient to accent and dialect diversity. Code and dataset are publicly available at: https://github.com/MH-Sameti/Accent_invariant_ASR

[100] Mitigating Overthinking through Reasoning Shaping

Feifan Song,Shaohang Wei,Bofei Gao,Yejie Wang,Wen Luo,Wei Li,Linli Yao,Weimin Xiong,Liang Chen,Tianyu Liu,Houfeng Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为Group Relative Segment Penalization (GRSP)的步级正则化方法，旨在解决大推理模型在强化学习验证奖励（RLVR）中因过细粒度监督导致的“过度思考”问题，在保证准确率的同时显著提升推理效率。

Details

Motivation: 现有的RLVR方法在降低token消耗时往往损害模型性能，原因在于其基于token级别的简单惩罚机制。本文认为监督粒度对平衡效率与准确性至关重要，因此提出更精细的段级监督方法。 Method: 提出Group Relative Segment Penalization (GRSP)，通过将推理过程划分为段，并引入长度感知的加权机制，在段级别进行奖励分配与惩罚，从而实现对冗余推理的有效控制。 Result: 实验表明，GRSP在减少token消耗方面表现优异，同时保持了较高的准确率，尤其在处理复杂问题时优势明显；此外，该方法还提升了RL训练的稳定性，并能有效扩展到不同规模的模型。 Conclusion: GRSP通过精细化的段级监督机制，成功缓解了大推理模型中的过思考问题，在效率与性能之间实现了更好的权衡，具有良好的可扩展性和应用潜力。 Abstract: Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.

[101] Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

Yihong Liu,Raoyuan Zhao,Lena Altinger,Hinrich Schütze,Michael A. Hedderich

Main category: cs.CL

TL;DR: 本文提出了MulTypo，一种多语言拼写错误生成算法，用于评估大语言模型在含拼写错误输入下的鲁棒性。研究发现拼写错误普遍降低模型性能，且指令微调可能增加对噪声的敏感性，高资源语言表现更稳健。

Details

Motivation: 大多数基准测试假设输入是干净的，而现实中的用户输入常包含拼写错误，尤其在多语言场景下。因此，亟需研究大语言模型在多语言拼写错误下的鲁棒性。 Method: 提出MulTypo算法，基于语言特定键盘布局和打字行为模拟人类拼写错误，并在五个下游任务上评估18个开源大语言模型的性能。 Result: 拼写错误显著降低模型性能，尤其是生成和推理任务；自然语言推断任务相对稳健；指令微调提升干净输入性能但可能削弱抗噪能力；高资源语言比低资源语言更鲁棒，英译出比英译入更稳健。 Conclusion: 大语言模型在含拼写错误的多语言输入下表现下降，需加强噪声感知训练和多语言鲁棒性评估。 Abstract: Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs -- naturally introducing typographical errors (typos). Yet most benchmarks assume clean input, leaving the robustness of LLMs to typos across languages largely underexplored. To address this gap, we introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior. We evaluate 18 open-source LLMs across three model families and five downstream tasks spanning language inference, multi-choice question answering, mathematical reasoning, and machine translation tasks. Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning -- while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. We also observe language-dependent robustness: high-resource languages are generally more robust than low-resource ones, and translation from English is more robust than translation into English. Our findings underscore the need for noise-aware training and multilingual robustness evaluation. We make our code and data publicly available.

[102] SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Chengyu Wang,Paria Rashidinejad,DiJia Su,Song Jiang,Sid Wang,Siyan Zhao,Cai Zhou,Shannon Zejiang Shen,Feiyu Chen,Tommi Jaakkola,Yuandong Tian,Bo Liu

Main category: cs.CL

TL;DR: 提出了一种 Sandwiched Policy Gradient (SPG) 方法，通过上下界联合估计来解决扩散大语言模型（dLLMs）在强化学习中因对数似然不可计算而导致的策略梯度偏差问题，显著提升了在多个任务上的性能。

Details

Motivation: 由于扩散大语言模型（dLLMs）的对数似然不可计算，难以直接应用标准策略梯度方法进行偏好对齐或任务奖励优化，现有基于ELBO等单边近似的方法存在较大梯度偏差。 Method: 提出 Sandwiched Policy Gradient (SPG)，利用真实对数似然的上界和下界共同构建更准确的策略梯度估计，减少偏差。 Result: 在GSM8K、MATH500、Countdown和Sudoku等多个任务上，SPG相比基于ELBO或一步估计的基线方法显著提升准确率，分别提高3.6%、2.6%、18.4%和27.0%。 Conclusion: SPG通过双向边界估计有效缓解了dLLMs在强化学习中的梯度偏差问题，是更优的策略梯度方法，显著提升了模型性能。 Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

[103] Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models

Qiguang Chen,Hanjing Li,Libo Qin,Dengyun Peng,Jinhao Liu,Jiangyi Wang,Chengyue Wu,Xie Chen,Yantao Du,Wanxiang Che

Main category: cs.CL

TL;DR: 本文提出了扩散大语言模型（DLLM）中存在的并行-序列矛盾（PSC），指出其在复杂推理任务中退化为类自回归行为，并提出三种扩展维度及缓解策略。

Details

Motivation: DLLM虽具高吞吐与有效序贯推理优势，但并行解码与因果顺序要求存在冲突，影响其推理深度与效率，需系统分析该问题根源。 Method: 通过行为分析识别PSC现象，引入并行、扩散和序列三个扩展维度进行实证研究，并提出面向并行的提示、扩散早停和并行扩展等缓解策略。 Result: 发现DLLM仅在可直接判定输出上表现真正并行性；任务难度增加时退化为类自回归模式；自回归提示加剧解码步数且不提升质量；PSC限制了自我反思、推理深度与探索广度；并行扩展有效，而扩散与序列扩展受PSC制约。 Conclusion: PSC是限制DLLM性能的核心瓶颈，需重新设计训练与推理机制以实现真正的并行高效推理。 Abstract: Recently, Diffusion Large Language Models (DLLMs) have offered high throughput and effective sequential reasoning, making them a competitive alternative to autoregressive LLMs (ALLMs). However, parallel decoding, which enables simultaneous token updates, conflicts with the causal order often required for rigorous reasoning. We first identify this conflict as the core Parallel-Sequential Contradiction (PSC). Behavioral analyses in both simple and complex reasoning tasks show that DLLMs exhibit genuine parallelism only for directly decidable outputs. As task difficulty increases, they revert to autoregressive-like behavior, a limitation exacerbated by autoregressive prompting, which nearly doubles the number of decoding steps with remasking without improving quality. Moreover, PSC restricts DLLMs' self-reflection, reasoning depth, and exploratory breadth. To further characterize PSC, we introduce three scaling dimensions for DLLMs: parallel, diffusion, and sequential. Empirically, while parallel scaling yields consistent improvements, diffusion and sequential scaling are constrained by PSC. Based on these findings, we propose several practical mitigations, parallel-oriented prompting, diffusion early stopping, and parallel scaling, to reduce PSC-induced ineffectiveness and inefficiencies.

[104] Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval

Yu Wang,Tianhao Tan,Yifei Wang

Main category: cs.CL

TL;DR: 提出一种多阶段框架，结合多语言语义、领域术语和高效长视频处理，用于从多语言医学视频档案中检索相关教学视频，在mVCR任务上实现最先进的性能。

Details

Motivation: 现有系统在处理长视频时要么压缩成粗糙的嵌入，导致信息丢失，要么进行细粒度匹配成本过高，难以跨语言边界有效回答复杂的多跳问题。 Method: 将视频字幕划分为语义连贯的片段，利用简洁的知识图谱事实增强，并构建成层次化树结构；使用语言无关的多语言编码器生成节点嵌入；查询时通过粗到精的树搜索剪枝，仅对高排名片段用轻量级大模型重排序。 Result: 在mVCR测试集上达到最先进性能，消融实验验证了知识图谱增强、层次索引和目标式LLM重排序的互补作用。 Conclusion: 该方法为专业医学视频集合中的多语言检索提供了准确且可扩展的解决方案。 Abstract: Retrieving relevant instructional videos from multilingual medical archives is crucial for answering complex, multi-hop questions across language boundaries. However, existing systems either compress hour-long videos into coarse embeddings or incur prohibitive costs for fine-grained matching. We tackle the Multilingual Video Corpus Retrieval (mVCR) task in the NLPCC-2025 M4IVQA challenge with a multi-stage framework that integrates multilingual semantics, domain terminology, and efficient long-form processing. Video subtitles are divided into semantically coherent chunks, enriched with concise knowledge-graph (KG) facts, and organized into a hierarchical tree whose node embeddings are generated by a language-agnostic multilingual encoder. At query time, the same encoder embeds the input question; a coarse-to-fine tree search prunes irrelevant branches, and only the top-ranked chunks are re-scored by a lightweight large language model (LLM). This design avoids exhaustive cross-encoder scoring while preserving chunk-level precision. Experiments on the mVCR test set demonstrate state-of-the-art performance, and ablation studies confirm the complementary contributions of KG enrichment, hierarchical indexing, and targeted LLM re-ranking. The proposed method offers an accurate and scalable solution for multilingual retrieval in specialized medical video collections.

[105] A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages

Raoyuan Zhao,Yihong Liu,Hinrich Schütze,Michael A. Hedderich

Main category: cs.CL

TL;DR: 本文首次全面研究了多语言思维链（CoT）推理，评估了性能、一致性和忠实性三个维度，发现大型推理模型在不同语言中的思维轨迹存在显著差异。

Details

Motivation: 尽管大型推理模型在英语等高资源语言中广泛使用思维链推理，但其在多语言场景下的中间推理过程（思维轨迹）仍缺乏深入研究。 Method: 通过测量语言合规性、答案准确性和一致性，评估模型在目标语言中进行推理的表现；通过跨语言交换思维轨迹测试一致性；采用截断和错误注入等扰动技术检验思维轨迹的忠实性。 Result: 发现模型对不同语言有明显偏好，推理效果因提示语言而异，且思维轨迹的质量和有效性在不同语言间差异显著；模型对思维轨迹的依赖程度也随语言变化。 Conclusion: 多语言CoT推理存在显著的语言偏差和不一致性，未来的研究需关注如何提升低资源语言下的推理质量和模型忠实性。 Abstract: Large reasoning models (LRMs) increasingly rely on step-by-step Chain-of-Thought (CoT) reasoning to improve task performance, particularly in high-resource languages such as English. While recent work has examined final-answer accuracy in multilingual settings, the thinking traces themselves, i.e., the intermediate steps that lead to the final answer, remain underexplored. In this paper, we present the first comprehensive study of multilingual CoT reasoning, evaluating three key dimensions: performance, consistency, and faithfulness. We begin by measuring language compliance, answer accuracy, and answer consistency when LRMs are explicitly instructed or prompt-hacked to think in a target language, revealing strong language preferences and divergent performance across languages. Next, we assess crosslingual consistency of thinking traces by interchanging them between languages. We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language. Finally, we adapt perturbation-based techniques -- i.e., truncation and error injection -- to probe the faithfulness of thinking traces across languages, showing that models rely on traces to varying degrees. We release our code and data to support future research.

[106] WUGNECTIVES: Novel Entity Inferences of Language Models from Discourse Connectives

Daniel Brubaker,William Sheffield,Junyi Jessy Li,Kanishka Misra

Main category: cs.CL

TL;DR: 本文提出了WUGNECTIVES数据集，用于研究话语连接词是否能帮助语言模型推断新实体的属性，并发现连接词对模型推理有影响，尤其在让步类连接词上表现较差。

Details

Motivation: 探究话语连接词能否为语言模型提供关于世界知识的信息，反转了以往利用世界知识预测连接词的研究思路。 Method: 构建包含8,880个样本的WUGNECTIVES数据集，评估17种不同规模和训练方式的语言模型在连接词引导下的实体属性推断能力。 Result: 微调以增强推理能力的模型在多数连接词上有显著提升；但所有模型在表达让步意义的连接词上均表现不佳，且不同连接词类型间模型性能差异较大。 Conclusion: 话语连接词可作为语言模型获取世界知识的线索，但当前模型对某些语义类型（如让步）的理解仍存在挑战，未来需更细致地研究语言线索在模型中的功能作用。 Abstract: The role of world knowledge has been particularly crucial to predict the discourse connective that marks the discourse relation between two arguments, with language models (LMs) being generally successful at this task. We flip this premise in our work, and instead study the inverse problem of understanding whether discourse connectives can inform LMs about the world. To this end, we present WUGNECTIVES, a dataset of 8,880 stimuli that evaluates LMs' inferences about novel entities in contexts where connectives link the entities to particular attributes. On investigating 17 different LMs at various scales, and training regimens, we found that tuning an LM to show reasoning behavior yields noteworthy improvements on most connectives. At the same time, there was a large variation in LMs' overall performance across connective type, with all models systematically struggling on connectives that express a concessive meaning. Our findings pave the way for more nuanced investigations into the functional role of language cues as captured by LMs. We release WUGNECTIVES at https://github.com/sheffwb/wugnectives.

[107] AutoPR: Let's Automate Your Academic Promotion!

Qiguang Chen,Zheng Yan,Mingda Yang,Libo Qin,Yixin Yuan,Hanjing Li,Jinhao Liu,Yiyan Ji,Dengyun Peng,Jiannan Guan,Mengkang Hu,Yantao Du,Wanxiang Che

Main category: cs.CL

TL;DR: 本文提出了自动推广（AutoPR）任务，旨在将研究论文转化为准确、吸引人且及时的公开内容，并发布了PRBench多模态基准用于评估系统在保真度、参与度和对齐性三个维度的表现，同时提出PRAgent多智能体框架，在实际测试中显著提升了观看时间、点赞数和整体参与度。

Details

Motivation: 随着科研文献数量激增，学者依赖社交平台进行成果发现，作者也需大量投入推广工作以提升可见性和引用率，亟需自动化手段减少人力负担并提高传播效率。 Method: 提出AutoPR任务和PRBench基准，构建包含多模态准备、协同生成和平台适配三阶段的PRAgent多智能体框架，实现研究内容的自动化推广。 Result: PRAgent相比直接LLM流水线在PRBench上实现604%的总观看时长增长、438%的点赞数提升及至少2.9倍的整体参与度增长，消融实验表明平台建模与定向推广贡献最大。 Conclusion: AutoPR被确立为一个可衡量的研究问题，PRAgent为实现高效、可扩展的学术传播自动化提供了可行路径。 Abstract: As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.

[108] Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

Donghang Wu,Haoyang Zhang,Jun Chen,Xiangyu,Zhang,Hexin Liu,Eng Siong Chng,Fei Tian,Xuerui Yang,Xiangyu Zhang,Daxin Jiang,Gang Yu

Main category: cs.CL

TL;DR: 本文提出了Mind-Paced Speaking (MPS)框架，通过模拟人类大脑中思考与表达分离的机制，实现低延迟、高质量的实时口语推理。

Details

Motivation: 现有口语语言模型在实时推理时面临链式思维（CoT）生成延迟过高的问题，难以兼顾推理质量与响应速度。 Method: 提出一种双脑架构：'构思脑'负责高层推理，'表达脑'负责流畅语音生成，两者并行工作，避免模式切换，实现边说边想。 Result: MPS在数学推理任务Spoken-MQA上零延迟配置下达到92.8%准确率，在对话任务URO-Bench上得分为82.5，性能接近预先生成完整思维链的模型，且显著降低延迟。 Conclusion: MPS有效弥合了高质量推理与实时交互之间的鸿沟，为实时口语语言模型提供了新的脑启发解决方案。 Abstract: Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a "Formulation Brain" for high-level reasoning to pace and guide a separate "Articulation Brain" for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. Our work effectively bridges the gap between high-quality reasoning and real-time interaction.

[109] Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation

Sondos Mahmoud Bsharat,Zhiqiang Shen

Main category: cs.CL

TL;DR: 本文提出了一种名为Prompting Test-Time Scaling (P-TTS)的推理增强方法，通过在测试时利用少量手动选择的推理实例并结合系统化的提示策略进行数据扩增，显著提升大语言模型在数学和跨领域推理任务上的表现，且无需大量标注数据。

Details

Motivation: 现有的大语言模型虽然具备一定推理能力，但依赖大规模标注的推理数据集，而这类数据的构建成本高昂。因此，亟需一种低成本、高效的推理增强方法，以减少对大规模标注数据的依赖。 Method: 提出P-TTS方法，使用仅90个精选的推理实例，在测试时通过不同强度的指令提示生成多样化的推理路径，并将这些合成数据用于微调Qwen-2.5系列模型（7B和32B）。该方法在推理过程中实现测试时数据扩增，无需额外收集大量样本。 Result: 在AIME2024&25、MATH500、GPQA-Diamond等数学推理任务上，P-TTS-7B和P-TTS-32B均显著优于S1/S1.1等强基线，例如在AIME'24上分别取得+26.66%和+30.00%的绝对准确率提升；同时在Gaokao、AMC23等跨域零样本任务上也表现出良好的泛化能力。 Conclusion: P-TTS通过测试时扩展有效挖掘了大模型潜在的推理模式空间，在极低标注开销下显著提升了推理性能，为资源受限或快速变化的场景提供了一种实用且低成本的推理增强方案。 Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME'24 (7B), and +13.34% and +6.67% on AIME'25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME'24, and +26.63% and +3.33% on AIME'25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.

cs.CV [Back]

Nirmal Elamon,Rouzbeh Davoudi

Main category: cs.CV

TL;DR: 本文比较了微调的传统CNN、零样本多模态大语言模型（LLM）和微调的多模态LLM在图像中人工文本叠加检测任务上的性能，发现仅用不到1000张图像微调的LLM可提升最多36%的准确率，达到或超过需大量数据的CNN基线方法。

Details

Motivation: 探索多模态大语言模型在视觉任务中，尤其是在数据稀缺场景下的潜力，并弥补其与传统CNN在专用视觉任务上性能的差距。 Method: 对传统CNN和多模态LLM（包括零样本和微调版本）在人工文本叠加检测任务上进行系统性对比实验，重点评估在极少量训练数据下微调LLM的效果。 Result: 微调后的多模态LLM在少于1000张图像的训练数据下，准确率最高提升36%，性能达到甚至超越需要更多数据的传统CNN模型。 Conclusion: 多模态LLM通过小规模数据微调即可实现高效精确的视觉理解，展现出卓越的数据效率和适应性，为低资源环境下的跨模态学习提供了可行方案。 Abstract: The field of object detection and understanding is rapidly evolving, driven by advances in both traditional CNN-based models and emerging multi-modal large language models (LLMs). While CNNs like ResNet and YOLO remain highly effective for image-based tasks, recent transformer-based LLMs introduce new capabilities such as dynamic context reasoning, language-guided prompts, and holistic scene understanding. However, when used out-of-the-box, the full potential of LLMs remains underexploited, often resulting in suboptimal performance on specialized visual tasks. In this work, we conduct a comprehensive comparison of fine-tuned traditional CNNs, zero-shot pre-trained multi-modal LLMs, and fine-tuned multi-modal LLMs on the challenging task of artificial text overlay detection in images. A key contribution of our study is demonstrating that LLMs can be effectively fine-tuned on very limited data (fewer than 1,000 images) to achieve up to 36% accuracy improvement, matching or surpassing CNN-based baselines that typically require orders of magnitude more data. By exploring how language-guided models can be adapted for precise visual understanding with minimal supervision, our work contributes to the broader effort of bridging vision and language, offering novel insights into efficient cross-modal learning strategies. These findings highlight the adaptability and data efficiency of LLM-based approaches for real-world object detection tasks and provide actionable guidance for applying multi-modal transformers in low-resource visual environments. To support continued progress in this area, we have made the code used to fine-tune the models available in our GitHub, enabling future improvements and reuse in related applications.

[111] Reproducible Evaluation of Data Augmentation and Loss Functions for Brain Tumor Segmentation

Saumya B

Main category: cs.CV

TL;DR: 本研究对使用焦点损失和基本数据增强策略的U-Net在脑肿瘤MRI分割中的性能进行了可重复评估，建立了透明、可复现的基线。

Details

Motivation: 解决脑肿瘤分割中存在的类别不平衡和模型泛化能力不足的问题。 Method: 在公开MRI数据集上实验，调整焦点损失参数，并评估水平翻转、旋转和缩放三种数据增强技术的影响。 Result: 使用焦点损失的U-Net达到了90%的精确率，与当前最先进的结果相当。 Conclusion: 该研究通过公开所有代码和结果，为未来脑肿瘤分割中的增强策略和损失函数设计研究提供了可复现的基准。 Abstract: Brain tumor segmentation is crucial for diagnosis and treatment planning, yet challenges such as class imbalance and limited model generalization continue to hinder progress. This work presents a reproducible evaluation of U-Net segmentation performance on brain tumor MRI using focal loss and basic data augmentation strategies. Experiments were conducted on a publicly available MRI dataset, focusing on focal loss parameter tuning and assessing the impact of three data augmentation techniques: horizontal flip, rotation, and scaling. The U-Net with focal loss achieved a precision of 90%, comparable to state-of-the-art results. By making all code and results publicly available, this study establishes a transparent, reproducible baseline to guide future research on augmentation strategies and loss function design in brain tumor segmentation.

[112] Adjusting Initial Noise to Mitigate Memorization in Text-to-Image Diffusion Models

Hyeonggeun Han,Sehwan Kim,Hyungjun Joo,Sangwoo Hong,Jungwoo Lee

Main category: cs.CV

TL;DR: 提出通过调整初始噪声样本以促进扩散模型在去噪过程中更早逃离记忆吸引盆地，从而减少训练数据的记忆化问题，同时保持图像与文本的对齐。

Details

Motivation: 文本到图像扩散模型存在记忆和复制训练数据的问题，引发隐私和版权担忧，现有方法延迟分类器自由引导（CFG）应用时机但导致生成图像与提示对齐差，需改进。 Method: 观察不同初始噪声样本对逃离吸引盆地时间的影响，提出两种调整初始噪声的方法：集体和个体调整，以促进更早逃离记忆盆地。 Result: 所提方法显著减少了模型对训练数据的记忆化现象，同时维持了生成图像与输入提示的良好对齐。 Conclusion: 初始噪声的选择在控制记忆化行为中起关键作用，通过优化初始噪声可有效平衡隐私保护与生成质量。 Abstract: Despite their impressive generative capabilities, text-to-image diffusion models often memorize and replicate training data, prompting serious concerns over privacy and copyright. Recent work has attributed this memorization to an attraction basin-a region where applying classifier-free guidance (CFG) steers the denoising trajectory toward memorized outputs-and has proposed deferring CFG application until the denoising trajectory escapes this basin. However, such delays often result in non-memorized images that are poorly aligned with the input prompts, highlighting the need to promote earlier escape so that CFG can be applied sooner in the denoising process. In this work, we show that the initial noise sample plays a crucial role in determining when this escape occurs. We empirically observe that different initial samples lead to varying escape times. Building on this insight, we propose two mitigation strategies that adjust the initial noise-either collectively or individually-to find and utilize initial samples that encourage earlier basin escape. These approaches significantly reduce memorization while preserving image-text alignment.

[113] The Digital Mirror: Gender Bias and Occupational Stereotypes in AI-Generated Images

Siiri Leppälampi,Sonja M. Hyrynsalmi,Erno Vanhala

Main category: cs.CV

TL;DR: 该研究探讨了DALL-E 3和Ideogram在生成职业图像时的性别表征偏见，发现两者均在不同程度上强化了传统性别刻板印象，强调需采取措施提升AI生成图像中的多样性。

Details

Motivation: 现有研究多关注AI生成图像的质量和创作过程，忽视其表征偏见问题，本研究旨在填补这一空白，特别是在职业场景中的性别偏见。 Method: 通过提示生成750多张AI职业图像，使用DALL-E 3和Ideogram两种工具，并进行主题分析以评估性别表征偏差及情绪、年龄等表现。 Result: DALL-E 3和Ideogram均在生成的职业图像中强化了传统性别刻板印象，尽管程度不同，表明AI可视化工具存在延续社会偏见的风险。 Conclusion: AI图像生成工具可能加剧性别和年龄等社会偏见，建议从业者、个人和研究人员采取策略以提高生成内容的多样性和包容性。 Abstract: Generative AI offers vast opportunities for creating visualisations, such as graphics, videos, and images. However, recent studies around AI-generated visualisations have primarily focused on the creation process and image quality, overlooking representational biases. This study addresses this gap by testing representation biases in AI-generated pictures in an occupational setting and evaluating how two AI image generator tools, DALL-E 3 and Ideogram, compare. Additionally, the study discusses topics such as ageing and emotions in AI-generated images. As AI image tools are becoming more widely used, addressing and mitigating harmful gender biases becomes essential to ensure diverse representation in media and professional settings. In this study, over 750 AI-generated images of occupations were prompted. The thematic analysis results revealed that both DALL-E 3 and Ideogram reinforce traditional gender stereotypes in AI-generated images, although to varying degrees. These findings emphasise that AI visualisation tools risk reinforcing narrow representations. In our discussion section, we propose suggestions for practitioners, individuals and researchers to increase representation when generating images with visible genders.

[114] Dynamic Mixture-of-Experts for Visual Autoregressive Model

Jort Vincenti,Metod Jazbec,Guoxuan Xia

Main category: cs.CV

TL;DR: 提出了一种集成动态Mixture-of-Experts路由器的VAR模型，通过尺度感知阈值策略在不牺牲图像质量的前提下减少计算量。

Details

Motivation: 解决VAR模型因重复调用Transformer导致的计算冗余问题。 Method: 在VAR中引入动态Mixture-of-Experts路由器，并采用无需额外训练的尺度感知阈值策略进行专家选择。 Result: 实现了减少20%的FLOPs和11%的推理速度提升，同时保持与密集基线相当的图像生成质量。 Conclusion: 该方法有效平衡了计算开销与生成质量，提升了VAR模型的效率。 Abstract: Visual Autoregressive Models (VAR) offer efficient and high-quality image generation but suffer from computational redundancy due to repeated Transformer calls at increasing resolutions. We introduce a dynamic Mixture-of-Experts router integrated into VAR. The new architecture allows to trade compute for quality through scale-aware thresholding. This thresholding strategy balances expert selection based on token complexity and resolution, without requiring additional training. As a result, we achieve 20% fewer FLOPs, 11% faster inference and match the image quality achieved by the dense baseline.

[115] Out-of-Distribution Detection in LiDAR Semantic Segmentation Using Epistemic Uncertainty from Hierarchical GMMs

Hanieh Shojaei Miandashti,Claus Brenner

Main category: cs.CV

TL;DR: 提出一种基于分层贝叶斯建模的无监督OOD检测方法，利用特征空间中GMM参数的认知不确定性，在不依赖辅助数据的情况下显著提升LiDAR点云语义分割中的OOD检测性能。

Details

Motivation: 现有无监督OOD检测方法常混淆认知不确定性和偶然不确定性，导致将分布内模糊区域误判为OOD；同时避免依赖辅助OOD数据集。 Method: 在深度神经网络的特征空间中，采用分层贝叶斯模型对高斯混合模型（GMM）参数进行建模，提取认知不确定性作为OOD评分依据，无需集成或后验采样。 Result: 在SemanticKITTI数据集上，相比基于预测熵的方法，AUROC提升18%，AUPRC提高22%，FPR95从76%降至40%（减少36%）。 Conclusion: 该方法有效分离认知不确定性，提升了无监督OOD检测的准确性，且无需额外训练或辅助数据，具有较强实用性。 Abstract: In addition to accurate scene understanding through precise semantic segmentation of LiDAR point clouds, detecting out-of-distribution (OOD) objects, instances not encountered during training, is essential to prevent the incorrect assignment of unknown objects to known classes. While supervised OOD detection methods depend on auxiliary OOD datasets, unsupervised methods avoid this requirement but typically rely on predictive entropy, the entropy of the predictive distribution obtained by averaging over an ensemble or multiple posterior weight samples. However, these methods often conflate epistemic (model) and aleatoric (data) uncertainties, misclassifying ambiguous in distribution regions as OOD. To address this issue, we present an unsupervised OOD detection approach that employs epistemic uncertainty derived from hierarchical Bayesian modeling of Gaussian Mixture Model (GMM) parameters in the feature space of a deep neural network. Without requiring auxiliary data or additional training stages, our approach outperforms existing uncertainty-based methods on the SemanticKITTI dataset, achieving an 18\% improvement in AUROC, 22\% increase in AUPRC, and 36\% reduction in FPR95 (from 76\% to 40\%), compared to the predictive entropy approach used in prior works.

[116] Hi-OSCAR: Hierarchical Open-set Classifier for Human Activity Recognition

Conor McCarthy,Loes Quirijnen,Jan Peter van Zandwijk,Zeno Geradts,Marcel Worring

Main category: cs.CV

TL;DR: 提出了一种用于人类活动识别的层次化开集分类器Hi-OSCAR，能够准确识别已知活动并拒绝未知活动，同时将未知类定位到最近的内部节点。

Details

Motivation: 解决人类活动识别中训练数据无法覆盖所有实际活动的问题，以及传统分类器难以处理未见活动和类别间相似性的问题。 Method: 构建活动类别的层次结构，提出Hi-OSCAR模型，结合层次化分类与开集识别，实现对已知活动的高精度识别和对未知活动的有效拒绝与定位。 Result: 在公开数据集NFI_FARED上验证了方法的有效性，实现了最先进的已知活动识别精度，并能有效检测和定位未知活动。 Conclusion: Hi-OSCAR通过层次化建模提升了HAR系统的鲁棒性和实用性，为开放环境下的活动识别提供了新思路。 Abstract: Within Human Activity Recognition (HAR), there is an insurmountable gap between the range of activities performed in life and those that can be captured in an annotated sensor dataset used in training. Failure to properly handle unseen activities seriously undermines any HAR classifier's reliability. Additionally within HAR, not all classes are equally dissimilar, some significantly overlap or encompass other sub-activities. Based on these observations, we arrange activity classes into a structured hierarchy. From there, we propose Hi-OSCAR: a Hierarchical Open-set Classifier for Activity Recognition, that can identify known activities at state-of-the-art accuracy while simultaneously rejecting unknown activities. This not only enables open-set classification, but also allows for unknown classes to be localized to the nearest internal node, providing insight beyond a binary "known/unknown" classification. To facilitate this and future open-set HAR research, we collected a new dataset: NFI_FARED. NFI_FARED contains data from multiple subjects performing nineteen activities from a range of contexts, including daily living, commuting, and rapid movements, which is fully public and available for download.

[117] Detection of high-frequency oscillations using time-frequency analysis

Mostafa Mohammadpour,Mehdi Zekriyapanah Gashti,Yusif S. Gasimov

Main category: cs.CV

TL;DR: 提出一种基于S变换和无监督聚类的自动检测高频振荡（HFOs）的新方法，能够在80-500 Hz频段有效区分HFOs与尖峰、背景活动和伪迹，在控制数据集上达到97.67%的敏感性和98.57%的精确度，并在患者数据中显示与手术结果更强的相关性。

Details

Motivation: HFOs是定位致痫区的重要生物标志物，但其人工识别耗时、费力且主观，现有自动检测方法仍不完善，因此亟需开发更准确、可靠的自动化HFO检测方法以支持临床应用和研究。 Method: 采用S变换提取时频域特征，结合无监督聚类技术对事件进行分类，实现对ripple和fast ripple频段（80-500 Hz）HFOs的自动检测，并区分HFOs与癫痫尖波、背景活动及伪迹。 Result: 在控制数据集上，检测方法的敏感性为97.67%，精确度为98.57%，F分数为97.78%；在癫痫患者数据中，切除区与未切除区电极的HFO发生率比值为0.73，与手术预后相关性更强，且残留HFO（尤其是快速涟波）与术后癫痫复发相关。 Conclusion: HFOs，特别是快速涟波，是致痫性的可靠生物标志物，完全切除HFO区域有助于实现术后 seizure 自由，该自动检测方法具有高准确性与临床应用潜力。 Abstract: High-frequency oscillations (HFOs) are a new biomarker for identifying the epileptogenic zone. Mapping HFO-generating regions can improve the precision of resection sites in patients with refractory epilepsy. However, detecting HFOs remains challenging, and their clinical features are not yet fully defined. Visual identification of HFOs is time-consuming, labor-intensive, and subjective. As a result, developing automated methods to detect HFOs is critical for research and clinical use. In this study, we developed a novel method for detecting HFOs in the ripple and fast ripple frequency bands (80-500 Hz). We validated it using both controlled datasets and data from epilepsy patients. Our method employs an unsupervised clustering technique to categorize events extracted from the time-frequency domain using the S-transform. The proposed detector differentiates HFOs events from spikes, background activity, and artifacts. Compared to existing detectors, our method achieved a sensitivity of 97.67%, a precision of 98.57%, and an F-score of 97.78% on the controlled dataset. In epilepsy patients, our results showed a stronger correlation with surgical outcomes, with a ratio of 0.73 between HFOs rates in resected versus non-resected contacts. The study confirmed previous findings that HFOs are promising biomarkers of epileptogenicity in epileptic patients. Removing HFOs, especially fast ripple, leads to seizure freedom, while remaining HFOs lead to seizure recurrence.

[118] Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Thomas Fel,Binxu Wang,Michael A. Lepori,Matthew Kowal,Andrew Lee,Randall Balestriero,Sonia Joseph,Ekdeep S. Lubana,Talia Konkle,Demba Ba,Martin Wattenberg

Main category: cs.CV

TL;DR: DINOv2的内部表征机制通过稀疏自编码器（SAE）构建了32,000个单元的字典进行解析，研究揭示其在分类、分割和深度估计任务中分别利用“非目标”概念、边界检测和单目深度线索；进一步分析显示表征并非完全稀疏，而是趋向于低维连通结构，并提出基于凸组合原型的Minkowski表征假设（MRH），将视觉Transformer的表示与Gardenfors概念空间理论联系起来。

Details

Motivation: 尽管DINOv2被广泛用于视觉识别任务，但其内部感知机制尚不清楚。本文旨在揭示其深层表征的本质，检验线性表征假设（LRH），并探索更合理的解释框架。 Method: 采用稀疏自编码器（SAE）对DINOv2的特征进行分解，构建包含32,000个单元的可解释字典；在此基础上分析不同下游任务中概念的使用模式、表征的几何结构与统计特性，并结合注意力机制提出Minkowski表征假设（MRH）。 Result: 发现DINOv2在不同任务中表现出功能特化：分类利用‘Elsewhere’概念实现类内抑制，分割依赖边界检测子空间，深度估计使用符合神经科学原理的三种单目线索；表征具有部分稠密、低维局部连接特性，偏离理想正交结构；token由原型的凸组合构成，支持概念空间理论。 Conclusion: 视觉Transformer中的表征不仅超越了线性稀疏性，还呈现出基于凸组合原型的几何结构，提出的Minkowski表征假设（MRH）为理解其内部机制提供了新的理论框架，连接了深度学习模型与认知科学中的概念空间理论。 Abstract: DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.

[119] PhyDAE: Physics-Guided Degradation-Adaptive Experts for All-in-One Remote Sensing Image Restoration

Zhe Dong,Yuzhe Sun,Haochen Jiang,Tianzhu Liu,Yanfeng Gu

Main category: cs.CV

TL;DR: 本文提出了一种名为PhyDAE的遥感图像去模糊方法，通过物理引导和自适应专家机制，显式建模复杂退化过程，在多个基准数据集上实现了性能与效率的最优平衡。

Details

Motivation: 现有的一体化遥感图像恢复方法过度依赖隐式特征表示，缺乏对退化物理过程的显式建模，难以有效处理多种异质退化并保证物理一致性。 Method: 提出Physics-Guided Degradation-Adaptive Experts (PhyDAE)，采用两阶段级联架构，结合残差流形投影器（RMP）和频域感知退化分解器（FADD）从流形几何和频率角度分析退化特性，并引入物理感知专家模块与温度控制的稀疏激活策略。 Result: 在MD-RSID、MD-RRSHID和MDRS-Landsat三个基准数据集上实验表明，PhyDAE在去雾、去噪、去模糊和低照度增强四项任务中均优于现有最先进方法，显著提升恢复质量的同时减少了参数量和计算复杂度。 Conclusion: PhyDAE通过显式建模退化物理过程和差异化处理机制，实现了高性能与高效率的遥感图像恢复，为多退化场景提供了新的解决方案。 Abstract: Remote sensing images inevitably suffer from various degradation factors during acquisition, including atmospheric interference, sensor limitations, and imaging conditions. These complex and heterogeneous degradations pose severe challenges to image quality and downstream interpretation tasks. Addressing limitations of existing all-in-one restoration methods that overly rely on implicit feature representations and lack explicit modeling of degradation physics, this paper proposes Physics-Guided Degradation-Adaptive Experts (PhyDAE). The method employs a two-stage cascaded architecture transforming degradation information from implicit features into explicit decision signals, enabling precise identification and differentiated processing of multiple heterogeneous degradations including haze, noise, blur, and low-light conditions. The model incorporates progressive degradation mining and exploitation mechanisms, where the Residual Manifold Projector (RMP) and Frequency-Aware Degradation Decomposer (FADD) comprehensively analyze degradation characteristics from manifold geometry and frequency perspectives. Physics-aware expert modules and temperature-controlled sparse activation strategies are introduced to enhance computational efficiency while ensuring imaging physics consistency. Extensive experiments on three benchmark datasets (MD-RSID, MD-RRSHID, and MDRS-Landsat) demonstrate that PhyDAE achieves superior performance across all four restoration tasks, comprehensively outperforming state-of-the-art methods. Notably, PhyDAE substantially improves restoration quality while achieving significant reductions in parameter count and computational complexity, resulting in remarkable efficiency gains compared to mainstream approaches and achieving optimal balance between performance and efficiency. Code is available at https://github.com/HIT-SIRS/PhyDAE.

[120] Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

Songtao Jiang,Yuan Wang,Sibo Song,Tianxiang Hu,Chenyi Zhou,Bin Pu,Yan Zhang,Zhibo Yang,Yang Feng,Joey Tianyi Zhou,Jin Hao,Zijian Chen,Ruijia Wu,Tao Tang,Junhui Lv,Hongxia Xu,Hongwei Wang,Jun Xiao,Bin Feng,Fudong Zhu,Kenli Li,Weidi Xie,Jimeng Sun,Jian Wu,Zuozhu Liu

Main category: cs.CV

TL;DR: Hulu-Med是一个透明的医疗视觉-语言模型，统一处理文本、2D/3D图像和视频，通过大规模训练在多模态临床任务中实现先进性能。

Details

Motivation: 临床决策需整合多种模态数据，但现有通用视觉-语言模型存在流程不透明、数据稀缺和架构僵化问题。 Method: 采用基于patch的统一视觉编码器和大语言模型解码器，结合医学感知的token压缩技术，逐步在1670万样本上训练支持2D、3D和视频理解。 Result: 在30个基准测试中表现SOTA，优于主流开源模型，接近商用系统，在视觉问答、报告生成、多语言及罕见病推理任务中表现出色。 Conclusion: 通过开源完整流程，证明高性能医疗VLM可透明构建，为临床AI提供可访问且具影响力的基础工具。 Abstract: Real-world clinical decision-making grapples with integrating information from diverse data modalities, including medical text, 2D/3D images, and video, leading to inefficiencies and potential diagnostic oversights. While generalist vision-language models (VLMs) offer promise, their medical development faces challenges of opaque pipelines, data scarcity, and architectural inflexibility. Here we present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities. Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension. The medical-aware token reduction enables efficient training, requiring only 4,000 to 40,000 GPU hours for 7B to 32B parameter variants. Extensive evaluation across 30 benchmarks exhibits state-of-the-art performance, surpassing leading open-source models and competing with proprietary systems in tasks spanning visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios. By open-sourcing our complete pipeline, we establish that high-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI. Code is released on \href{https://github.com/ZJUI-AI4H/Hulu-Med}{https://github.com/ZJUI-AI4H/Hulu-Med}.

[121] Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Kang Liao,Size Wu,Zhonghua Wu,Linyi Jin,Chao Wang,Yikai Wang,Fei Wang,Wei Li,Chen Change Loy

Main category: cs.CV

TL;DR: Puffin是一个统一的以相机为中心的多模态模型，通过将相机参数视为语言来实现跨视角的理解与生成，具备强大的空间智能能力。

Details

Motivation: 现有的研究通常孤立地处理相机中心的理解与生成任务，缺乏统一框架来整合空间感知与多模态交互。 Method: 提出Puffin模型，结合语言回归和扩散生成，引入将相机作为语言的新范式，并在包含400万视觉-语言-相机三元组的数据集上进行训练。 Result: 实验表明Puffin在相机中心的生成与理解任务上优于专用模型，并能通过指令微调泛化到空间想象、世界探索和摄影指导等跨视角任务。 Conclusion: Puffin实现了以相机为中心的统一多模态建模，推动了空间智能的研究，相关代码、模型和数据集将公开发布。 Abstract: Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.

[122] Structured Output Regularization: a framework for few-shot transfer learning

Nicolas Ewen,Jairo Diaz-Rodriguez,Kelly Ramsay

Main category: cs.CV

TL;DR: 提出了一种名为结构化输出正则化（SOR）的简单而有效的迁移学习框架，通过冻结网络内部结构并结合组Lasso和L1惩罚，在少量数据下实现模型对特定数据的适应，减少了过拟合风险，并在多个医学图像分类任务中取得了与现有基准相当的结果。

Details

Motivation: 传统迁移学习方法由于固定部分权重并添加任务特定层，限制了模型对领域特征的适应能力，且在数据极少时容易过拟合，因此需要一种更灵活且参数效率更高的迁移学习方法。 Method: 提出结构化输出正则化（SOR），冻结网络内部结构（如卷积核），同时对网络组件（如卷积滤波器或网络块）施加组Lasso和L1正则化，以减少冗余参数并提升模型适应性。该方法可广泛应用于不同网络结构。 Result: 在三个小样本医学图像分类任务上，基于DenseNet121和EfficientNetB4的SOR方法取得了具有竞争力的结果，表现优于或媲美现有基准方法。 Conclusion: SOR是一种高效、通用的迁移学习框架，能够在保持计算效率的同时提升模型在有限数据下的适应能力，特别适用于医学图像等数据稀缺场景。 Abstract: Traditional transfer learning typically reuses large pre-trained networks by freezing some of their weights and adding task-specific layers. While this approach is computationally efficient, it limits the model's ability to adapt to domain-specific features and can still lead to overfitting with very limited data. To address these limitations, we propose Structured Output Regularization (SOR), a simple yet effective framework that freezes the internal network structures (e.g., convolutional filters) while using a combination of group lasso and $L_1$ penalties. This framework tailors the model to specific data with minimal additional parameters and is easily applicable to various network components, such as convolutional filters or various blocks in neural networks enabling broad applicability for transfer learning tasks. We evaluate SOR on three few shot medical imaging classification tasks and we achieve competitive results using DenseNet121, and EfficientNetB4 bases compared to established benchmarks.

[123] BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

Yu Qi,Haibo Zhao,Ziyu Guo,Siyuan Ma,Ziyan Chen,Yaokun Han,Renrui Zhang,Zitiantao Lin,Shiji Xin,Yijian Huang,Kai Cheng,Peiheng Wang,Jiazheng Liu,Jiayi Zhang,Yizhe Zhu,Wenqing Wang,Yiran Qin,Xupeng Zhu,Haojie Huang,Lawson L. S. Wong

Main category: cs.CV

TL;DR: 本文提出了BEAR，一个全面且细粒度的基准，用于评估多模态大语言模型（MLLMs）在原子化具身能力上的表现，并提出了BEAR-Agent来增强MLLM的感知、3D理解和规划能力。

Details

Motivation: 现有的具身智能体评估基准主要集中在特定领域，缺乏对MLLMs具身能力的系统性评估，因此需要一个更全面的基准来衡量其在多种具身任务中的表现。 Method: 构建了一个包含4,469个图文视频交错样本的BEAR基准，覆盖6个类别共14个领域；并提出BEAR-Agent，结合预训练视觉模型以提升MLLM的感知与规划能力。 Result: 对20个代表性MLLM的实验表明其在各类具身能力上均存在局限性；BEAR-Agent在GPT-5上实现了9.12%的绝对增益和17.5%的相对性能提升，并在模拟环境中验证了其有效性。 Conclusion: BEAR为评估MLLM的具身能力提供了有效工具，而BEAR-Agent通过融合视觉模型显著提升了模型的具身表现，展示了改进方向。 Abstract: Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/

[124] SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense

Jiayang Liu,Daniel Tso,Yiming Bu,Qinru Qiu

Main category: cs.CV

TL;DR: 提出一种受生物视觉系统启发的防御框架，通过强化学习引导的眼跳和中央-外周处理机制有效抑制对抗噪声，无需重新训练分类器即可提升模型鲁棒性。

Details

Motivation: 传统对抗防御方法计算成本高，而人类视觉系统具有天然鲁棒性，作者希望借鉴生物机制设计更高效、无需重训练的防御方法。 Method: 结合中央-外周处理、眼跳运动和皮层填充三种生物机制，使用强化学习引导眼跳选择性捕捉多个视野片段，并将其整合为重建图像用于分类。 Result: 在ImageNet数据集上验证了该方法在多种分类器和攻击类型下的有效性，显著提升了系统鲁棒性，同时大幅降低了训练开销。 Conclusion: 该生物启发式预处理方法能有效抵御对抗攻击，保持语义完整性，且无需微调下游分类器，便于集成到现有系统中。 Abstract: Adversarial attacks significantly challenge the safe deployment of deep learning models, particularly in real-world applications. Traditional defenses often rely on computationally intensive optimization (e.g., adversarial training or data augmentation) to improve robustness, whereas the human visual system achieves inherent robustness to adversarial perturbations through evolved biological mechanisms. We hypothesize that attention guided non-homogeneous sparse sampling and predictive coding plays a key role in this robustness. To test this hypothesis, we propose a novel defense framework incorporating three key biological mechanisms: foveal-peripheral processing, saccadic eye movements, and cortical filling-in. Our approach employs reinforcement learning-guided saccades to selectively capture multiple foveal-peripheral glimpses, which are integrated into a reconstructed image before classification. This biologically inspired preprocessing effectively mitigates adversarial noise, preserves semantic integrity, and notably requires no retraining or fine-tuning of downstream classifiers, enabling seamless integration with existing systems. Experiments on the ImageNet dataset demonstrate that our method improves system robustness across diverse classifiers and attack types, while significantly reducing training overhead compared to both biologically and non-biologically inspired defense techniques.

[125] Detecting spills using thermal imaging, pretrained deep learning models, and a robotic platform

Gregory Yeghiyan,Jurius Azar,Devson Butani,Chan-Jin Chung

Main category: cs.CV

TL;DR: 提出了一种基于预训练深度学习模型的实时泄漏检测系统，利用RGB和热成像进行溢出分类，在轻量级模型上实现高达100%的准确率，并在消费级硬件上实现实时推理。

Details

Motivation: 需要一种能够在不同环境条件下快速、准确检测液体泄漏的实时系统，以提升安全关键场景下的响应能力。 Method: 采用预训练的深度学习模型（如VGG19和NasNetMobile），结合RGB和热成像数据，在平衡的二分类数据集（4000张图像）上进行训练与评估。 Result: 热成像模型在准确性（最高达100%）、推理速度（低至44ms）和模型大小（小于350MB）方面表现更优，且在不同光照条件下更具鲁棒性；在真实机器人测试中，基于热成像的VGG19模型性能最佳。 Conclusion: 热成像结合轻量级深度学习模型可高效实现实时泄漏检测，具备在消费级硬件上部署的可行性，适用于安全关键应用。 Abstract: This paper presents a real-time spill detection system that utilizes pretrained deep learning models with RGB and thermal imaging to classify spill vs. no-spill scenarios across varied environments. Using a balanced binary dataset (4,000 images), our experiments demonstrate the advantages of thermal imaging in inference speed, accuracy, and model size. We achieve up to 100% accuracy using lightweight models like VGG19 and NasNetMobile, with thermal models performing faster and more robustly across different lighting conditions. Our system runs on consumer-grade hardware (RTX 4080) and achieves inference times as low as 44 ms with model sizes under 350 MB, highlighting its deployability in safety-critical contexts. Results from experiments with a real robot and test datasets indicate that a VGG19 model trained on thermal imaging performs best.

[126] LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

Xiaohui Li,Shaobin Zhuang,Shuo Cao,Yang Yang,Yuandong Pu,Qi Qin,Siqi Luo,Bin Fu,Yihao Liu

Main category: cs.CV

TL;DR: 本文提出了LinearSR，一个首次系统性解决线性注意力在图像超分辨率中应用难题的框架，实现了高效且高质量的生成式超分。

Details

Motivation: 生成式图像超分辨率模型因自注意力的二次复杂度导致计算瓶颈，线性注意力虽为线性复杂度，但在真实感超分中的潜力尚未充分挖掘，存在训练不稳定、感知-失真权衡等问题。 Method: 提出Early-Stopping Guided Fine-tuning (ESGF) 策略解决训练不稳定性；设计基于信噪比的Mixture of Experts (MoE) 架构缓解感知-失真权衡；提出轻量化的TAG引导机制，基于“精度优先于数量”原则。 Result: LinearSR在保持高感知质量的同时显著提升效率，单步扩散前向速度达到SOTA水平，多步推理时间也极具竞争力。 Conclusion: LinearSR为线性注意力在真实感图像超分辨率中的应用提供了首个稳健方案，奠定了高效生成式超分的研究基础。 Abstract: Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention's quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel "knee point"-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our "precision-over-volume" principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.

[127] Re-Identifying Kākā with AI-Automated Video Key Frame Extraction

Paula Maddigan,Andrew Lensen,Rachael C. Shaw

Main category: cs.CV

TL;DR: 提出一种基于AI的非侵入式野生动物个体识别方法，通过视频关键帧提取和计算机视觉技术实现对新西兰濒危鹦鹉kākā的高效重识别。

Details

Motivation: 传统动物个体识别方法（如腿环）耗时且具有侵入性，亟需非侵入、高效的自动化替代方案。 Method: 结合YOLO和Grounding DINO进行目标检测，利用光流模糊检测、DINOv2图像编码和聚类方法，构建无监督的关键帧提取流程。 Result: 在定制喂食器拍摄的视频中成功提取高质量关键帧，显著提升kākā个体重识别准确率。 Conclusion: 该方法为野生动物监测提供了可扩展、非侵入的技术框架，有望推广至复杂自然环境下的生态与保护研究。 Abstract: Accurate recognition and re-identification of individual animals is essential for successful wildlife population monitoring. Traditional methods, such as leg banding of birds, are time consuming and invasive. Recent progress in artificial intelligence, particularly computer vision, offers encouraging solutions for smart conservation and efficient automation. This study presents a unique pipeline for extracting high-quality key frames from videos of k\={a}k\={a} (Nestor meridionalis), a threatened forest-dwelling parrot in New Zealand. Key frame extraction is well-studied in person re-identification, however, its application to wildlife is limited. Using video recordings at a custom-built feeder, we extract key frames and evaluate the re-identification performance of our pipeline. Our unsupervised methodology combines object detection using YOLO and Grounding DINO, optical flow blur detection, image encoding with DINOv2, and clustering methods to identify representative key frames. The results indicate that our proposed key frame selection methods yield image collections which achieve high accuracy in k\={a}k\={a} re-identification, providing a foundation for future research using media collected in more diverse and challenging environments. Through the use of artificial intelligence and computer vision, our non-invasive and efficient approach provides a valuable alternative to traditional physical tagging methods for recognising k\={a}k\={a} individuals and therefore improving the monitoring of populations. This research contributes to developing fresh approaches in wildlife monitoring, with applications in ecology and conservation biology.

[128] Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization

Shuo Xing,Soumik Dey,Mingyang Wu,Ashirbad Mishra,Hansi Wu,Binbin Li,Zhengzhong Tu

Main category: cs.CV

TL;DR: 提出Q-Router，一种基于多级模型路由的通用视频质量评估框架，利用视觉-语言模型动态选择专家模型，提升跨内容泛化性、可解释性和扩展性。

Details

Motivation: 现有VQA模型在跨内容泛化、可解释性和扩展性方面存在不足，难以适应UGC、短视频和AIGC等多样化场景。 Method: 构建一个多层次的路由系统，结合多个专家模型，并使用视觉-语言模型作为实时路由器，根据视频语义动态推理并组合最合适的专家；最复杂层级包含时空伪影定位以增强可解释性。 Result: 在多个VQA基准上达到或超越当前最优性能，显著提升跨数据集泛化能力和可解释性，并在Q-Bench-Video问答任务中表现优异，同时能有效定位时空伪影。 Conclusion: Q-Router通过代理式多专家路由机制，实现了灵活、鲁棒且可解释的通用视频质量评估，有望成为下一代VQA系统的基石。 Abstract: Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision--language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.

[129] Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering

Yuanhao Zou,Zhaozheng Yin

Main category: cs.CV

TL;DR: 提出了一种用于医学视觉问答（Med-VQA）的新框架，通过统一的多模态对齐方法、硬负样本挖掘和门控交叉注意力模块，在多个基准数据集上超越了现有最优方法。

Details

Motivation: 现有医学视觉语言预训练模型在模态对齐方面缺乏统一解决方案，且对硬负样本问题关注不足，常用的知识融合技术可能引入无关信息。 Method: 1) 基于对比学习和最优传输理论，实现跨多层级、多模态、多视角和多阶段的统一异构模态对齐；2) 采用软标签进行多模态对齐并增强硬负样本判别能力的硬负样本挖掘方法；3) 引入融合答案词汇先验知识并选择相关信息的门控交叉注意力模块。 Result: 在RAD-VQA、SLAKE、PathVQA和VQA-2019等多个广泛使用的Med-VQA数据集上取得了优于先前最先进方法的表现。 Conclusion: 所提出的框架有效解决了Med-VQA中的模态对齐、硬负样本和噪声知识融合问题，显著提升了模型性能。 Abstract: Medical Visual Question Answering (Med-VQA) is a challenging task that requires a deep understanding of both medical images and textual questions. Although recent works leveraging Medical Vision-Language Pre-training (Med-VLP) have shown strong performance on the Med-VQA task, there is still no unified solution for modality alignment, and the issue of hard negatives remains under-explored. Additionally, commonly used knowledge fusion techniques for Med-VQA may introduce irrelevant information. In this work, we propose a framework to address these challenges through three key contributions: (1) a unified solution for heterogeneous modality alignments across multiple levels, modalities, views, and stages, leveraging methods like contrastive learning and optimal transport theory; (2) a hard negative mining method that employs soft labels for multi-modality alignments and enforces the hard negative pair discrimination; and (3) a Gated Cross-Attention Module for Med-VQA that integrates the answer vocabulary as prior knowledge and selects relevant information from it. Our framework outperforms the previous state-of-the-art on widely used Med-VQA datasets like RAD-VQA, SLAKE, PathVQA and VQA-2019.

[130] SkipSR: Faster Super Resolution with Token Skipping

Rohan Choudhury,Shanchuan Lin,Jianyi Wang,Hao Chen,Qi Zhao,Feng Cheng,Lu Jiang,Kris Kitani,Laszlo A. Jeni

Main category: cs.CV

TL;DR: 提出SkipSR框架，通过识别低细节区域并跳过其超分辨率计算，显著加速视频超分过程，同时保持感知质量。

Details

Motivation: 现有的扩散超分辨率方法对所有像素进行均匀处理，导致计算开销大、速度慢，难以扩展到高分辨率和长视频。 Method: 基于低分辨率输入直接识别低细节区域，并完全跳过这些区域的计算，仅对需要细化的区域进行超分辨率处理。 Result: 在720p视频上比先前模型最高快60%，端到端延迟显著降低，且无明显质量损失。 Conclusion: SkipSR是一种简单而有效的方法，能够在不牺牲感知质量的前提下大幅提升视频超分辨率的效率。 Abstract: Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/

[131] D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Yiyang Huang,Yizhou Wang,Yun Fu

Main category: cs.CV

TL;DR: 本文提出了D-CoDe，一种无需训练的视频大语言模型适配框架，通过动态压缩和问题分解解决图像预训练模型在视频理解中的感知瓶颈与令牌过载问题，在多个视频理解基准上表现出色，尤其适用于长视频任务。

Details

Motivation: 将图像预训练的视觉语言模型应用于视频理解面临处理密集且时序延长的视觉输入的挑战，存在感知瓶颈和令牌过载问题。 Method: 提出D-CoDe框架，结合动态压缩（自适应选择关键帧和内容感知的空间令牌聚合）和问题分解（将原问题拆解为子问题以引导模型关注视频不同方面）来实现无需训练的模型适配。 Result: 实验表明D-CoDe在多个视频理解基准上显著提升性能，尤其在长视频理解任务中表现突出。 Conclusion: D-CoDe有效解决了图像基VLM向视频领域扩展的关键挑战，为构建高效的视频大语言模型提供了可行的训练-free方案。 Abstract: Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.

[132] FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation

Hongrui Wu,Zhicheng Gao,Jin Cao,Kelu Yao,Wen Shen,Zhihua Wei

Main category: cs.CV

TL;DR: 本文提出了一种名为FOLK的快速开放词汇3D实例分割方法，通过标签引导的知识蒸馏，将2D教师模型中的高质量实例嵌入知识迁移到3D学生模型，从而在推理阶段直接基于点云进行分类，避免了2D遮挡噪声并显著提升速度。

Details

Motivation: 现有方法通过将3D实例映射到2D图像并利用视觉语言模型分类，存在由遮挡引入的噪声以及高计算和内存开销，导致推理速度慢。 Method: 设计一个教师模型生成融合可见性和视角多样性的2D CLIP嵌入作为蒸馏目标，构建一个直接输出3D嵌入的学生模型，并采用标签引导的知识蒸馏算法将开放词汇知识从教师模型迁移至学生模型。 Result: 在ScanNet200和Replica数据集上进行了实验，在ScanNet200上达到35.7的AP50分数，性能领先，同时推理速度比先前方法快6.0倍至152.2倍。 Conclusion: FOLK有效解决了开放词汇3D实例分割中因2D映射带来的噪声与效率问题，实现了高效准确的3D实例分类，具备实际应用潜力。 Abstract: Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.

[133] Modeling Time-Lapse Trajectories to Characterize Cranberry Growth

Ronan John,Anis Chihoub,Ryan Meegan,Gina Sidelli,Jeffery Neyhart,Peter Oudemans,Kristin Dana

Main category: cs.CV

TL;DR: 提出一种基于自监督微调视觉Transformer（ViTs）的蔓越莓作物生长建模方法，无需繁琐图像标注，通过时间回归和类别预测的双重预训练任务，实现可解释的作物生长时序追踪与品种差异分析。

Details

Motivation: 传统蔓越莓种植中的变化监测依赖人工，耗时费力，且基于深度学习的方法存在特征难以解释和需要大量手工标注的问题。 Method: 采用自监督学习方法微调视觉Transformer（ViTs），设计时间回归和类别预测的双重预训练任务，以学习植物和果实外观随时间演变的潜在空间，生成可解释的二维时间轨迹。 Result: 实现了对蔓越莓生长过程的可解释时序建模，能够预测生长趋势并区分不同品种的时间差异；同时发布了包含8个品种、52次观测的新型时间序列数据集，标注了杀菌剂使用、产量和腐烂信息。 Conclusion: 该方法无需手工标注，具有良好的可解释性，且具备通用性，可推广至其他作物和应用场景。 Abstract: Change monitoring is an essential task for cranberry farming as it provides both breeders and growers with the ability to analyze growth, predict yield, and make treatment decisions. However, this task is often done manually, requiring significant time on the part of a cranberry grower or breeder. Deep learning based change monitoring holds promise, despite the caveat of hard-to-interpret high dimensional features and hand-annotations for fine-tuning. To address this gap, we introduce a method for modeling crop growth based on fine-tuning vision transformers (ViTs) using a self-supervised approach that avoids tedious image annotations. We use a two-fold pretext task (time regression and class prediction) to learn a latent space for the time-lapse evolution of plant and fruit appearance. The resulting 2D temporal tracks provide an interpretable time-series model of crop growth that can be used to: 1) predict growth over time and 2) distinguish temporal differences of cranberry varieties. We also provide a novel time-lapse dataset of cranberry fruit featuring eight distinct varieties, observed 52 times over the growing season (span of around four months), annotated with information about fungicide application, yield, and rot. Our approach is general and can be applied to other crops and applications (code and dataset can be found at https://github. com/ronan-39/tlt/).

[134] PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

Daiki Yoshikawa,Takashi Matsubara

Main category: cs.CV

TL;DR: 提出PHyCLIP模型，利用ℓ1-乘积度量的双曲空间乘积结构，统一建模概念内的层次性和跨概念的组合性，在多项任务中优于单一度量空间方法。

Details

Motivation: 现有视觉-语言模型难以同时捕捉概念族内的层次结构（如狗⊆哺乳动物⊆动物）和跨概念族的组合结构（如“车中的狗”），而当前基于双曲空间的方法虽擅长层次建模，但对组合性的表达能力有限。 Method: 提出PHyCLIP，采用多个双曲空间的笛卡尔积，并通过ℓ1-乘积度量来联合建模层次性和组合性：每个双曲因子内部捕捉一个概念族的层次结构，而ℓ1-乘积度量则模拟布尔代数，实现跨族概念的组合。 Result: 在零样本分类、检索、层次分类和组合理解任务上，PHyCLIP优于现有的单一度量空间方法，并在嵌入空间中展现出更强的可解释性。 Conclusion: PHyCLIP通过混合几何结构有效统一了语义的层次性与组合性，为多模态表示学习提供了更具表达力和解释性的框架。 Abstract: Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.

[135] SegTrans: Transferable Adversarial Examples for Segmentation Models

Yufei Song,Ziqi Zhou,Qi Lu,Hangtao Zhang,Yifan Hu,Lulu Xue,Shengshan Hu,Minghui Li,Leo Yu Zhang

Main category: cs.CV

TL;DR: 提出了一种名为SegTrans的新型迁移攻击框架，通过将输入样本划分为多个局部区域并重映射其语义信息来生成多样化的增强样本，从而提升分割模型间对抗样本的可迁移性。

Details

Motivation: 现有对抗攻击方法在不同分割模型间的迁移性较差，主要由于模型内部复杂的上下文依赖和代理模型与目标模型之间的特征分布差异。 Method: SegTrans将输入样本分块处理，仅保留局部语义信息进行扰动优化，生成更具迁移性的对抗样本，避免使用全局语义信息。 Result: 在PASCAL VOC和Cityscapes数据集上，针对四种分割模型和三种骨干网络的实验表明，SegTrans平均提升8.55%的迁移攻击成功率，计算效率提高超过100%，且无额外计算开销。 Conclusion: SegTrans有效提升了对抗样本在语义分割模型间的迁移性能，是一种高效、通用的转移攻击方法。 Abstract: Segmentation models exhibit significant vulnerability to adversarial examples in white-box settings, but existing adversarial attack methods often show poor transferability across different segmentation models. While some researchers have explored transfer-based adversarial attack (i.e., transfer attack) methods for segmentation models, the complex contextual dependencies within these models and the feature distribution gaps between surrogate and target models result in unsatisfactory transfer success rates. To address these issues, we propose SegTrans, a novel transfer attack framework that divides the input sample into multiple local regions and remaps their semantic information to generate diverse enhanced samples. These enhanced samples replace the original ones for perturbation optimization, thereby improving the transferability of adversarial examples across different segmentation models. Unlike existing methods, SegTrans only retains local semantic information from the original input, rather than using global semantic information to optimize perturbations. Extensive experiments on two benchmark datasets, PASCAL VOC and Cityscapes, four different segmentation models, and three backbone networks show that SegTrans significantly improves adversarial transfer success rates without introducing additional computational overhead. Compared to the current state-of-the-art methods, SegTrans achieves an average increase of 8.55% in transfer attack success rate and improves computational efficiency by more than 100%.

[136] Defense against Unauthorized Distillation in Image Restoration via Feature Space Perturbation

Han Hu,Zhuoran Zheng,Chen Lyu

Main category: cs.CV

TL;DR: 提出了一种名为自适应奇异值扰动（ASVP）的运行时防御方法，用于保护图像恢复模型免受知识蒸馏攻击。

Details

Motivation: 现有的知识蒸馏防御方法在图像分类中有效，但难以直接应用于图像恢复任务，因为后者是依赖空间一致性和细节的高维连续输出生成任务，常规微小扰动不足以阻止学生网络学习。 Method: 基于教师模型内部特征图，利用奇异值分解（SVD），放大前k个奇异值以注入结构化的高频扰动，破坏知识蒸馏所需的信息对齐，从而干扰学生网络学习，同时保持教师模型输出质量。 Result: 在超分辨率、低光增强、水下增强、去雾和去雨五个图像恢复任务中验证，ASVP可使学生模型的PSNR降低高达4 dB，SSIM下降60-75%，且对教师模型性能影响极小。相比现有方法，防御效果更强且更稳定。 Conclusion: ASVP为开源图像恢复模型提供了一种实用且有效的防御知识蒸馏攻击的解决方案。 Abstract: Knowledge distillation (KD) attacks pose a significant threat to deep model intellectual property by enabling adversaries to train student networks using a teacher model's outputs. While recent defenses in image classification have successfully disrupted KD by perturbing output probabilities, extending these methods to image restoration is difficult. Unlike classification, restoration is a generative task with continuous, high-dimensional outputs that depend on spatial coherence and fine details. Minor perturbations are often insufficient, as students can still learn the underlying mapping.To address this, we propose Adaptive Singular Value Perturbation (ASVP), a runtime defense tailored for image restoration models. ASVP operates on internal feature maps of the teacher using singular value decomposition (SVD). It amplifies the topk singular values to inject structured, high-frequency perturbations, disrupting the alignment needed for distillation. This hinders student learning while preserving the teacher's output quality.We evaluate ASVP across five image restoration tasks: super-resolution, low-light enhancement, underwater enhancement, dehazing, and deraining. Experiments show ASVP reduces student PSNR by up to 4 dB and SSIM by 60-75%, with negligible impact on the teacher's performance. Compared to prior methods, ASVP offers a stronger and more consistent defense.Our approach provides a practical solution to protect open-source restoration models from unauthorized knowledge distillation.

[137] RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos

Zixi Yang,Jiapeng Li,Muxi Diao,Yinuo Jing,Kongming Liang

Main category: cs.CV

TL;DR: 本文提出了Ro-Bench，首个用于评估多模态大语言模型在动态OOD反事实视频数据上鲁棒性的基准。实验表明现有模型在该基准上性能显著下降，而使用反事实数据微调可大幅提升模型鲁棒性和视频理解能力。

Details

Motivation: 现有的多模态大语言模型在视频理解任务中表现良好，但其面对操纵性或反事实视频内容时的鲁棒性尚未被充分探索。因此，需要一个专门的基准来评估模型在动态分布外反事实场景下的表现。 Method: 构建了一个高质量、多样化且时间相关的反事实视频测试集Ro-Bench，通过编辑视频的风格、对象、背景及其组合生成OOD样本；在此基础上评估了八种最新的视频MLLM，并研究了使用反事实数据微调对模型鲁棒性的影响。 Result: 当前MLLM在Ro-Bench上表现出显著性能下降；引入反事实数据微调后，模型在Ro-Bench上性能提升21.73%，在MVBench的20个任务上平均提升12.78%。 Conclusion: 反事实数据能有效增强MLLM的鲁棒性和视频理解能力，Ro-Bench为未来视频模型鲁棒性研究提供了重要工具。 Abstract: Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.

[138] Denoised Diffusion for Object-Focused Image Augmentation

Nisha Pillai,Aditi Virupakshaiah,Harrison W. Smith,Amanda J. Ashworth,Prasanna Gowda,Phillip R. Owens,Adam R. Rivers,Bindu Nanduri,Mahalingam Ramkumar

Main category: cs.CV

TL;DR: 提出一种针对动物健康监测的物体聚焦数据增强框架，通过分割动物并利用变换和扩散合成生成逼真的多样化场景，提升在数据受限情况下的检测性能。

Details

Motivation: 由于农场环境中动物种类、环境和行为的多样性，现有迁移学习方法因缺乏反映特定条件的大规模数据集而难以应对小样本、遮挡等问题，导致无人机监测中数据不足。 Method: 将动物从背景中分割出来，通过几何变换和基于扩散模型的合成进行数据增强，构建适用于特定场景的多样化训练数据。 Result: 实验表明，使用增强数据集后，在动物检测任务上优于基线模型，显著提升了检测效果。 Conclusion: 该方法通过生成领域特定数据，有效解决了农业场景中数据稀缺的问题，增强了实时动物健康监测系统的实用性。 Abstract: Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.

[139] Unleashing Perception-Time Scaling to Multimodal Reasoning Models

Yifan Li,Zhenghao Chen,Ziheng Wu,Kun Zhou,Ruipu Luo,Can Zhang,Zhentao He,Yufei Zhan,Wayne Xin Zhao,Minghui Qiu

Main category: cs.CV

TL;DR: 本文提出了感知时间扩展（Perception-Time Scaling, PTS）新范式，以提升大视觉语言模型在视觉估计任务中的表现。通过引入以感知为中心的基准DisTANCE，发现现有模型在推理时扩展下的估计精度有限，归因于其快速感知范式。PTS通过分解复杂感知问题并鼓励生成更丰富的感知标记，显著提升了精度，并展现出良好的跨领域泛化能力。

Details

Motivation: 当前大视觉语言模型（LVLMs）在推理时扩展技术下虽提升了推理能力，但在视觉感知任务上的效果不明显。作者旨在探究该技术对视觉感知的影响，并解决现有模型在视觉估计中精度不足的问题。 Method: 提出DisTANCE这一以感知为中心的视觉估计基准，并设计Perception-Time Scaling（PTS）新范式。PTS将复杂感知问题分解为可处理的子问题，鼓励模型生成更丰富的中间感知标记，并结合强化学习进行训练。 Result: 在DisTANCE基准上，PTS将高精度性能从8.0%大幅提升至64.7%，并在域外任务中表现出良好泛化能力。分析显示，PTS增加了感知相关标记数量和模型对图像标记的关注度。此外，即使使用合成数据，PTS与数学推理数据结合也能在真实感知和推理任务中带来持续增益。 Conclusion: Perception-Time Scaling（PTS）为提升大视觉语言模型的视觉感知能力提供了有效的新路径，弥合了推理时扩展在感知任务中的应用差距，推动模型从“快速感知”向“深度感知”转变。 Abstract: Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model's attention to image tokens. Our code and data will be publicly released.

[140] mmJoints: Expanding Joint Representations Beyond (x,y,z) in mmWave-Based 3D Pose Estimation

Zhenyu Wang,Mahathir Monjur,Shahriar Nirjon

Main category: cs.CV

TL;DR: 本文提出了mmJoints框架，通过为毫米波3D姿态估计器的输出增加关节感知可能性和位置可靠性描述符，显式地揭示模型对先验知识的依赖，从而提升下游任务（如活动识别）的准确性和结果可解释性。

Details

Motivation: 毫米波信号稀疏且反射弱，导致姿态估计模型过度依赖统计先验而非实际传感器数据，影响下游任务性能。 Method: 在预训练的黑箱毫米波3D姿态估计器基础上，引入额外的关节描述符，估计每个关节能否被感知及其位置预测的可靠性，并将这些描述符用于增强输出。 Result: 在超过11.5万帧信号数据和13种姿态估计设置下验证，描述符估计误差低于4.2%，关节位置精度最高提升12.5%，活动识别准确率最高提升16%。 Conclusion: mmJoints通过显式建模模型对先验的依赖，有效提升了毫米波姿态估计在下游任务中的性能与可解释性，优于当前最先进方法。 Abstract: In mmWave-based pose estimation, sparse signals and weak reflections often cause models to infer body joints from statistical priors rather than sensor data. While prior knowledge helps in learning meaningful representations, over-reliance on it degrades performance in downstream tasks like gesture and activity recognition. In this paper, we introduce mmJoints, a framework that augments a pre-trained, black-box mmWave-based 3D pose estimator's output with additional joint descriptors. Rather than mitigating bias, mmJoints makes it explicit by estimating the likelihood of a joint being sensed and the reliability of its predicted location. These descriptors enhance interpretability and improve downstream task accuracy. Through extensive evaluations using over 115,000 signal frames across 13 pose estimation settings, we show that mmJoints estimates descriptors with an error rate below 4.2%. mmJoints also improves joint position accuracy by up to 12.5% and boosts activity recognition by up to 16% over state-of-the-art methods.

[141] Hierarchical Scheduling for Multi-Vector Image Retrieval

Maoliang Li,Ke Li,Yaoyang Liu,Jiayu Chen,Zihao Zheng,Yinjun Wu,Xiang Chen

Main category: cs.CV

TL;DR: 本文提出了一种高效的图像检索调度框架HiMIR，通过多层次粒度对齐和减少冗余计算，在多模态大模型中显著提升检索精度并降低3.5倍计算开销。

Details

Motivation: 传统检索方法在准确性上有限，现有多向量检索虽改进但仍未充分考虑查询与图像对象的对齐及细粒度片段冗余问题。 Method: 提出HiMIR框架：采用多层次中间粒度增强对齐，利用跨层次相似性一致性和层次稀疏性减少冗余匹配计算，并自动配置各数据集参数。 Result: 实验表明，HiMIR相比现有MVR系统在提升准确率的同时，计算量最多减少3.5倍。 Conclusion: HiMIR通过分层对齐与冗余优化，实现了高效准确的图像检索，适用于多种实际场景。 Abstract: To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - HiMIR. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, HiMIR not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.

[142] HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images

Zichuan Wang,Bo Peng,Songlin Yang,Zhenchen Tang,Jing Dong

Main category: cs.CV

TL;DR: 本文提出了首个针对生成图像中手部区域的质量评估任务，引入了HandPair数据集和HandEval模型，利用多模态大语言模型和手部关键点先验知识，显著提升了手部质量评估的准确性，并在图像生成优化和AIGC检测等下游任务中展现出广泛应用价值。

Details

Motivation: 现有文本到图像模型在生成复杂局部区域（尤其是人手）时存在结构扭曲和纹理失真问题，但缺乏专门的手部质量评估方法，限制了相关下游任务的性能提升。 Method: 构建包含48k高低质量手部配对图像的HandPair数据集，基于多模态大语言模型设计手部专用质量评估模型HandEval，并融合手部关键点先验知识以增强对手部质量的感知能力。 Result: HandEval在人类标注测试集上的表现优于现有SOTA方法，与人类判断具有更高一致性，并成功应用于图像生成优化和AIGC检测，显著提升手部真实感和检测准确率。 Conclusion: HandEval为生成图像中的手部质量评估提供了有效解决方案，在多种下游应用中验证了其通用性和有效性，推动了以人为中心的生成质量优化研究。 Abstract: Although recent text-to-image (T2I) models have significantly improved the overall visual quality of generated images, they still struggle in the generation of accurate details in complex local regions, especially human hands. Generated hands often exhibit structural distortions and unrealistic textures, which can be very noticeable even when the rest of the body is well-generated. However, the quality assessment of hand regions remains largely neglected, limiting downstream task performance like human-centric generation quality optimization and AIGC detection. To address this, we propose the first quality assessment task targeting generated hand regions and showcase its abundant downstream applications. We first introduce the HandPair dataset for training hand quality assessment models. It consists of 48k images formed by high- and low-quality hand pairs, enabling low-cost, efficient supervision without manual annotation. Based on it, we develop HandEval, a carefully designed hand-specific quality assessment model. It leverages the powerful visual understanding capability of Multimodal Large Language Model (MLLM) and incorporates prior knowledge of hand keypoints, gaining strong perception of hand quality. We further construct a human-annotated test set with hand images from various state-of-the-art (SOTA) T2I models to validate its quality evaluation capability. Results show that HandEval aligns better with human judgments than existing SOTA methods. Furthermore, we integrate HandEval into image generation and AIGC detection pipelines, prominently enhancing generated hand realism and detection accuracy, respectively, confirming its universal effectiveness in downstream applications. Code and dataset will be available.

[143] Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation

Yuki Nii,Futa Waseda,Ching-Chun Chang,Isao Echizen

Main category: cs.CV

TL;DR: 本文提出了首个防御AI非法上色的范式“Uncolorable Examples”，通过在灰度图像中嵌入不可感知的扰动来阻止未经授权的上色，保护视觉内容版权。

Details

Motivation: AI上色技术存在侵犯版权的风险，如未经授权对黑白漫画或电影进行上色并转售，目前缺乏有效的防范方法。 Method: 提出PAChroma方法，利用Laplacian滤波器优化不可感知扰动以保持感知质量，并在优化过程中应用多样化的输入变换以增强跨模型的可迁移性和对后处理的鲁棒性。 Result: 在ImageNet和Danbooru数据集上的实验表明，PAChroma能有效降低上色质量，同时保持原始图像的视觉效果，满足有效性、不可感知性、可迁移性和鲁棒性四项标准。 Conclusion: 该工作首次实现了对生成式媒体中AI非法上色的有效防御，为版权保护提供了新方向。 Abstract: AI-based colorization has shown remarkable capability in generating realistic color images from grayscale inputs. However, it poses risks of copyright infringement -- for example, the unauthorized colorization and resale of monochrome manga and films. Despite these concerns, no effective method currently exists to prevent such misuse. To address this, we introduce the first defensive paradigm, Uncolorable Examples, which embed imperceptible perturbations into grayscale images to invalidate unauthorized colorization. To ensure real-world applicability, we establish four criteria: effectiveness, imperceptibility, transferability, and robustness. Our method, Perception-Aware Chroma-Restrictive Perturbation (PAChroma), generates Uncolorable Examples that meet these four criteria by optimizing imperceptible perturbations with a Laplacian filter to preserve perceptual quality, and applying diverse input transformations during optimization to enhance transferability across models and robustness against common post-processing (e.g., compression). Experiments on ImageNet and Danbooru datasets demonstrate that PAChroma effectively degrades colorization quality while maintaining the visual appearance. This work marks the first step toward protecting visual content from illegitimate AI colorization, paving the way for copyright-aware defenses in generative media.

[144] Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

Yao Teng,Fuyun Wang,Xian Liu,Zhekai Chen,Han Shi,Yu Wang,Zhenguo Li,Weiyang Liu,Difan Zou,Xihui Liu

Main category: cs.CV

TL;DR: 提出了一种名为Speculative Jacobi-Denoising Decoding (SJD2)的框架，通过将去噪过程引入Jacobi迭代，实现自回归模型中的并行令牌生成，显著减少前向传递次数，加快图像生成速度，同时保持图像质量。

Details

Motivation: 自回归文本到图像模型由于逐令牌串行解码，推理速度慢，需数千次前向传递生成单张图像，效率低下。 Method: 引入去噪机制到Jacobi迭代中，提出‘下一干净令牌预测’范式，允许预训练模型接受带噪声的令牌嵌入，并通过低成本微调预测干净令牌；在推理时用高斯噪声初始化令牌序列，在嵌入空间中迭代预测，并采用概率准则并行验证和接受多个令牌，未通过的令牌则继续优化。 Result: 实验表明，该方法能显著减少模型前向传递次数，加速生成过程，同时保持生成图像的视觉质量。 Conclusion: SJD2为自回归文本到图像模型提供了一种高效的并行解码框架，有效解决了推理缓慢的问题，具有良好的应用潜力。 Abstract: As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.

[145] On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

Hoigi Seo,Dong Un Kang,Hyunjin Cho,Joohoon Lee,Se Young Chun

Main category: cs.CV

TL;DR: 本文提出通过识别和屏蔽视觉编码器中具有高认知不确定性的视觉令牌来缓解大视觉语言模型中的对象幻觉问题。

Details

Motivation: 大视觉语言模型存在对象幻觉问题，即生成图像中不存在的对象描述，影响了其可靠性。作者认为视觉编码器中不确定的视觉令牌是导致该问题的关键因素。 Method: 通过统计分析和理论推导，识别出在小对抗扰动下表征偏差较大的早期视觉编码器层中的视觉令牌作为高不确定性令牌，并在中间层自注意力过程中屏蔽这些令牌，从而抑制其对视觉编码的影响。 Result: 实验表明，该方法显著减少了大视觉语言模型中的对象幻觉现象，并能与其他现有技术协同工作。 Conclusion: 通过仅修改视觉编码器，有效缓解了对象幻觉问题，为提升LVLM的可靠性提供了简单而有效的解决方案。 Abstract: Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

[146] Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy

Xiaoxiao Ma,Feng Zhao,Pengyang Ling,Haibo Qiu,Zhixiang Wei,Hu Yu,Jie Huang,Zhixiong Zeng,Lin Ma

Main category: cs.CV

TL;DR: 提出了一种基于熵的解码策略，用于提升自回归图像生成模型的生成质量和速度。

Details

Motivation: 当前自回归图像生成模型存在图像token信息密度低、空间分布不均的问题，导致采样效率和生成质量受限。 Method: 1) 基于token分布的空间熵进行动态温度控制；2) 在推测解码中引入熵感知的接受规则。 Result: 在多个基准和不同自回归图像生成模型上验证了该方法的有效性和通用性，实现了接近无损生成，推理成本仅为传统加速方法的85%。 Conclusion: 所提出的熵引导解码策略在不增加计算开销的前提下，显著提升了生成质量和采样速度，适用于多种AR图像生成模型。 Abstract: In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85\% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.

[147] Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels

Weitong Kong,Zichao Zeng,Di Wen,Jiale Wei,Kunyu Peng,June Moh Goo,Jan Boehm,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: 本文提出了在噪声标签下的LiDAR语义分割领域泛化新任务DGLSS-NL，并构建了首个基准。针对现有方法在点云数据上表现不佳的问题，提出DuNe双视图框架，通过特征一致性约束和置信度过滤提升性能，在多个数据集上实现了最先进的结果。

Details

Motivation: LiDAR标注常因传感器缺陷、遮挡和人为错误而含有噪声，且在域迁移下会进一步降低3D语义分割性能。现有的图像噪声学习方法难以直接应用于稀疏不规则的点云数据，因此需要专门针对LiDAR设计鲁棒的领域泛化方法。 Method: 提出DuNe，一种双分支（强弱增强）框架，通过在特征层面强制一致性，并结合基于置信度感知预测过滤的交叉熵损失来应对噪声标签问题，适用于域泛化下的LiDAR语义分割。 Result: 在10%对称噪声下，SemanticKITTI上达到56.86% mIoU，nuScenes为42.28%，SemanticPOSS为52.58%，算术平均49.57%，调和平均48.50%，显著优于适配的现有噪声学习方法。 Conclusion: DuNe有效解决了噪声标签与域偏移双重挑战下的LiDAR 3D语义分割问题，推动了领域泛化在真实自动驾驶场景中的应用可靠性。 Abstract: Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available on our project page.

[148] Lesion-Aware Post-Training of Latent Diffusion Models for Synthesizing Diffusion MRI from CT Perfusion

Junhyeok Lee,Hyunwoong Kim,Hyungjin Chung,Heeseong Eom,Joon Jang,Chul-Ho Sohn,Kyu Sung Choi

Main category: cs.CV

TL;DR: 提出一种基于病变感知的医学像素空间目标的潜在扩散模型后训练框架，用于提升医学图像到图像翻译中的病变区域重建精度和整体图像质量。

Details

Motivation: 潜在扩散模型在压缩的潜在空间中进行高效学习，但可能牺牲医学图像中关键的像素级细节，尤其是对诊断至关重要的小病灶区域，影响临床可靠性。 Method: 在预训练的潜在扩散模型基础上，引入病变感知的像素空间优化目标，在后训练阶段增强对病灶区域的重建精度，同时提升整体图像质量。 Result: 在817名急性缺血性卒中患者的脑部CT-to-MRI翻译任务中，该方法在合成DWI和ADC图像时优于现有模型，显著改善图像质量和病灶勾画精度。 Conclusion: 所提出的后训练框架能有效提升医学图像翻译中病灶的重建能力，具有良好的通用性和临床应用潜力，尤其适用于资源受限但需高诊断精度的场景。 Abstract: Image-to-Image translation models can help mitigate various challenges inherent to medical image acquisition. Latent diffusion models (LDMs) leverage efficient learning in compressed latent space and constitute the core of state-of-the-art generative image models. However, this efficiency comes with a trade-off, potentially compromising crucial pixel-level detail essential for high-fidelity medical images. This limitation becomes particularly critical when generating clinically significant structures, such as lesions, which often occupy only a small portion of the image. Failure to accurately reconstruct these regions can severely impact diagnostic reliability and clinical decision-making. To overcome this limitation, we propose a novel post-training framework for LDMs in medical image-to-image translation by incorporating lesion-aware medical pixel space objectives. This approach is essential, as it not only enhances overall image quality but also improves the precision of lesion delineation. We evaluate our framework on brain CT-to-MRI translation in acute ischemic stroke patients, where early and accurate diagnosis is critical for optimal treatment selection and improved patient outcomes. While diffusion MRI is the gold standard for stroke diagnosis, its clinical utility is often constrained by high costs and low accessibility. Using a dataset of 817 patients, we demonstrate that our framework improves overall image quality and enhances lesion delineation when synthesizing DWI and ADC images from CT perfusion scans, outperforming existing image-to-image translation models. Furthermore, our post-training strategy is easily adaptable to pre-trained LDMs and exhibits substantial potential for broader applications across diverse medical image translation tasks.

[149] Visual Anomaly Detection for Reliable Robotic Implantation of Flexible Microelectrode Array

Yitong Chen,Xinyao Xu,Ping Zhu,Xinyong Han,Fangbo Qin,Shan Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于显微相机图像的异常检测框架，用于机器人柔性微电极（FME）植入过程中的实时监控，通过改进的视觉Transformer方法在多个检查点实现高精度检测。

Details

Motivation: 由于FME探针的柔性和与生物组织的相互作用，其脑皮层植入过程具有挑战性，需高精度监测以确保安全性和可靠性。 Method: 利用机器人系统上的显微相机，在四个关键检查点提取感兴趣区域（ROI），结合预训练的视觉Transformer（ViT），提出渐进式粒度补丁采样方法，并筛选信噪比较高的特征通道以优化场景描述。 Result: 在真实植入系统采集的图像数据集上验证了所提方法的有效性，显著提升了不同位置的敏感性与容错性平衡，实现了准确的异常检测。 Conclusion: 该统一框架能够有效支持FME植入过程的自动化质量控制，为神经植入手术的安全性提供了可靠的技术方案。 Abstract: Flexible microelectrode (FME) implantation into brain cortex is challenging due to the deformable fiber-like structure of FME probe and the interaction with critical bio-tissue. To ensure reliability and safety, the implantation process should be monitored carefully. This paper develops an image-based anomaly detection framework based on the microscopic cameras of the robotic FME implantation system. The unified framework is utilized at four checkpoints to check the micro-needle, FME probe, hooking result, and implantation point, respectively. Exploiting the existing object localization results, the aligned regions of interest (ROIs) are extracted from raw image and input to a pretrained vision transformer (ViT). Considering the task specifications, we propose a progressive granularity patch feature sampling method to address the sensitivity-tolerance trade-off issue at different locations. Moreover, we select a part of feature channels with higher signal-to-noise ratios from the raw general ViT features, to provide better descriptors for each specific scene. The effectiveness of the proposed methods is validated with the image datasets collected from our implantation system.

[150] MambaH-Fit: Rethinking Hyper-surface Fitting-based Point Cloud Normal Estimation via State Space Modelling

Weijia Wang,Yuanzhi Su,Pei-Gen Ye,Yuan-Gen Wang,Xuequan Lu

Main category: cs.CV

TL;DR: 提出MambaH-Fit框架，结合注意力驱动的分层特征融合和基于状态空间模型的分块建模，用于点云法线估计，提升局部几何细节的建模能力。

Details

Motivation: 现有法线估计方法难以精确建模细粒度几何结构，且当前基于Mamba的方法主要关注全局形状结构，缺乏对局部细节的有效建模。 Method: 提出注意力驱动的分层特征融合（AHFF）以融合多尺度点云块特征，并设计基于状态空间模型的分块建模（PSSM），将点云块建模为隐式超曲面，通过状态动态捕捉局部几何细节。 Result: 在多个基准数据集上实验表明，该方法在准确性、鲁棒性和灵活性方面优于现有方法，消融实验验证了各组件的有效性。 Conclusion: MambaH-Fit有效提升了点云法线估计中对局部细粒度几何结构的建模能力，为状态空间模型在点云局部特征学习中的应用提供了新思路。 Abstract: We present MambaH-Fit, a state space modelling framework tailored for hyper-surface fitting-based point cloud normal estimation. Existing normal estimation methods often fall short in modelling fine-grained geometric structures, thereby limiting the accuracy of the predicted normals. Recently, state space models (SSMs), particularly Mamba, have demonstrated strong modelling capability by capturing long-range dependencies with linear complexity and inspired adaptations to point cloud processing. However, existing Mamba-based approaches primarily focus on understanding global shape structures, leaving the modelling of local, fine-grained geometric details largely under-explored. To address the issues above, we first introduce an Attention-driven Hierarchical Feature Fusion (AHFF) scheme to adaptively fuse multi-scale point cloud patch features, significantly enhancing geometric context learning in local point cloud neighbourhoods. Building upon this, we further propose Patch-wise State Space Model (PSSM) that models point cloud patches as implicit hyper-surfaces via state dynamics, enabling effective fine-grained geometric understanding for normal prediction. Extensive experiments on benchmark datasets show that our method outperforms existing ones in terms of accuracy, robustness, and flexibility. Ablation studies further validate the contribution of the proposed components.

[151] GL-DT: Multi-UAV Detection and Tracking with Global-Local Integration

Juanqin Liu,Leonardo Plotegher,Eloy Roura,Shaoming He

Main category: cs.CV

TL;DR: 本文提出了一种用于无人机多目标跟踪的全局-局部检测与跟踪（GL-DT）框架，结合时空特征融合模块和JPTrack算法，有效提升了小目标检测精度和轨迹连续性。

Details

Motivation: 无人机在军事侦察和环境监测中的广泛应用对多目标跟踪技术提出了更高要求，但复杂背景、小目标和频繁遮挡等问题限制了现有方法的性能。 Method: 提出GL-DT框架，采用时空特征融合（STFF）模块联合建模运动与外观特征，并结合全局-局部协同检测策略；设计JPTrack算法以减少ID切换和轨迹碎片化。 Result: 实验结果表明，该方法显著提高了多目标跟踪的连续性和稳定性，同时保持实时性。 Conclusion: GL-DT框架有效解决了无人机场景下的多目标跟踪难题，为相关技术的发展提供了有力支持。 Abstract: The extensive application of unmanned aerial vehicles (UAVs) in military reconnaissance, environmental monitoring, and related domains has created an urgent need for accurate and efficient multi-object tracking (MOT) technologies, which are also essential for UAV situational awareness. However, complex backgrounds, small-scale targets, and frequent occlusions and interactions continue to challenge existing methods in terms of detection accuracy and trajectory continuity. To address these issues, this paper proposes the Global-Local Detection and Tracking (GL-DT) framework. It employs a Spatio-Temporal Feature Fusion (STFF) module to jointly model motion and appearance features, combined with a global-local collaborative detection strategy, effectively enhancing small-target detection. Building upon this, the JPTrack tracking algorithm is introduced to mitigate common issues such as ID switches and trajectory fragmentation. Experimental results demonstrate that the proposed approach significantly improves the continuity and stability of MOT while maintaining real-time performance, providing strong support for the advancement of UAV detection and tracking technologies.

[152] Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

Youwei Zheng,Yuxi Ren,Xin Xia,Xuefeng Xiao,Xiaohua Xie

Main category: cs.CV

TL;DR: 本文提出Dense2MoE，将密集的扩散Transformer（DiT）转化为Mixture of Experts（MoE）结构，通过结构化稀疏化减少激活参数量，同时保持模型性能，显著提升文本到图像生成的效率。

Details

Motivation: 现有的参数压缩方法（如剪枝）在大幅压缩时会导致模型性能严重下降，因模型容量降低。为解决这一问题，需在减少激活参数的同时保留模型整体容量。 Method: 将DiT中的前馈网络（FFN）替换为MoE层，并提出Mixture of Blocks（MoB）机制选择性激活DiT块；设计多步蒸馏流程，包括基于Taylor metric的专家初始化、带负载均衡的知识蒸馏和MoB的分组特征损失优化。 Result: 在FLUX.1[dev]等大型扩散Transformer上实现60%的激活参数减少，同时保持原始生成性能，且优于剪枝类方法。 Conclusion: Dense2MoE为高效文本到图像生成建立了一个新范式，实现了模型稀疏性与性能的良好平衡。 Abstract: Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5\%. Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity. To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60\% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.

[153] A Novel Multi-branch ConvNeXt Architecture for Identifying Subtle Pathological Features in CT Scans

Irash Perera,Uthayasanker Thayasivam

Main category: cs.CV

TL;DR: 提出一种多分支ConvNeXt架构，用于医学图像分析，尤其在CT图像中实现高精度的COVID-19诊断。

Details

Motivation: 医学图像中细微病理特征难以识别，需提升自动诊断的准确性和鲁棒性。 Method: 设计包含全局平均池化、全局最大池化和注意力加权池化的三分支ConvNeXt模型，采用两阶段训练策略并结合迁移学习。 Result: 在2609张CT切片上验证，ROC-AUC达0.9937，准确率0.9757，F1分数0.9825，优于已有模型。 Conclusion: 多分支架构与精细数据处理相结合可显著提升医学图像分类性能，证明深度学习在临床诊断中的有效性。 Abstract: Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis, especially for identifying subtle pathological features. This paper introduces a novel multi-branch ConvNeXt architecture designed specifically for the nuanced challenges of medical image analysis. While applied here to the specific problem of COVID-19 diagnosis, the methodology offers a generalizable framework for classifying a wide range of pathologies from CT scans. The proposed model incorporates a rigorous end-to-end pipeline, from meticulous data preprocessing and augmentation to a disciplined two-phase training strategy that leverages transfer learning effectively. The architecture uniquely integrates features extracted from three parallel branches: Global Average Pooling, Global Max Pooling, and a new Attention-weighted Pooling mechanism. The model was trained and validated on a combined dataset of 2,609 CT slices derived from two distinct datasets. Experimental results demonstrate a superior performance on the validation set, achieving a final ROC-AUC of 0.9937, a validation accuracy of 0.9757, and an F1-score of 0.9825 for COVID-19 cases, outperforming all previously reported models on this dataset. These findings indicate that a modern, multi-branch architecture, coupled with careful data handling, can achieve performance comparable to or exceeding contemporary state-of-the-art models, thereby proving the efficacy of advanced deep learning techniques for robust medical diagnostics.

[154] SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding

Weikai Huang,Jieyu Zhang,Taoyang Jia,Chenhao Zheng,Ziqi Gao,Jae Sung Park,Ranjay Krishna

Main category: cs.CV

TL;DR: 提出SOS，一种简单且可扩展的基于对象中心组合策略的数据合成管道，通过结构化布局先验和生成式重光照技术将高质量合成对象片段插入新图像中，在检测和视觉定位任务上优于使用更大真实数据集训练的模型。

Details

Motivation: 现有大规模标注数据集成本高、覆盖偏差大且难以扩展；合成数据缺乏灵活性、准确性和组合多样性。因此需要一种更高效、可控且多样化的数据合成方法来提升视觉分组任务性能。 Method: 提出SOS数据合成管道，采用以对象为中心的组合策略，利用结构化布局先验和生成式重光照技术，将高质量合成对象片段粘贴到新图像中，生成准确且多样化的掩码、边界框和指代表达。 Result: 在10万张SOS合成图像上训练的模型超越了在更大真实数据集（如GRIT 2000万、V3Det 20万）上训练的模型，LVIS检测提升+10.9 AP，gRefCOCO定位提升+8.4 N_Acc；增强LVIS和COCO后在极低真实数据下仍有显著增益（如LVIS稀有类AP提升+3.83，1% COCO设置下AP提升+6.59）。 Conclusion: SOS实现了可控的数据集构建，提升了低数据和闭合词汇设置下的泛化能力，并支持针对复杂场景（如类内指代）的定向数据生成，为视觉分组任务提供了高效且灵活的合成数据解决方案。 Abstract: Visual grouping -- operationalized via instance segmentation, visual grounding, and object detection -- underpins applications from robotic perception to photo editing. Large annotated datasets are costly, biased in coverage, and hard to scale. Synthetic data are promising but often lack flexibility, accuracy, and compositional diversity. We present SOS, a simple and scalable data synthesis pipeline based on an object-centric composition strategy. It pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting, producing accurate and diverse masks, boxes, and referring expressions. Models trained on 100000 synthetic images from SOS outperform those trained on larger real-image datasets such as GRIT (20M) and V3Det (200K) on detection and grounding tasks, achieving +10.9 AP on LVIS detection and +8.4 $N_{\text{Acc}}$ on gRefCOCO grounding. SOS enables controllable dataset construction and improves generalization in both low-data and closed-vocabulary settings. Augmenting LVIS and COCO with synthetic object segments yields strong performance across real-data scales and even larger gains under extremely limited real data (for example, +3.83 $AP_{\text{rare}}$ on LVIS instance segmentation and +6.59 AP with a 1 percent COCO setup). This controllability also supports targeted data generation for challenging intra-class referring in visual grounding.

[155] MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation

Dominik Winter,Mai Bui,Monica Azqueta Gavaldon,Nicolas Triltsch,Marco Rosati,Nicolas Brieu

Main category: cs.CV

TL;DR: 提出了一种多模态语义扩散模型（MSDM），用于生成细胞和核分割的逼真图像-掩码对，通过融合形态学、颜色和元数据信息，显著提升分割模型在稀缺形态上的性能。

Details

Motivation: 由于标注数据稀缺，尤其是罕见或非典型形态的细胞，给计算病理学中的细胞和核分割带来挑战，而手动标注成本高，因此需要一种高效的合成数据生成方法。 Method: 提出多模态语义扩散模型（MSDM），结合细胞形态（水平与垂直图）、RGB颜色特征及BERT编码的实验/适应症元数据，通过多头交叉注意力融合多模态信息，生成具有特定生物学特性的图像-掩码对。 Result: 生成的图像在嵌入空间中与真实图像接近（ Wasserstein距离低），并在柱状细胞等稀有类型上显著提升了分割模型的准确性。 Conclusion: MSDM能有效增强数据集，针对性弥补模型缺陷，验证了基于多模态扩散模型的数据增强在提升细胞分割模型鲁棒性和泛化能力方面的潜力。 Abstract: Scarcity of annotated data, particularly for rare or atypical morphologies, present significant challenges for cell and nuclei segmentation in computational pathology. While manual annotation is labor-intensive and costly, synthetic data offers a cost-effective alternative. We introduce a Multimodal Semantic Diffusion Model (MSDM) for generating realistic pixel-precise image-mask pairs for cell and nuclei segmentation. By conditioning the generative process with cellular/nuclear morphologies (using horizontal and vertical maps), RGB color characteristics, and BERT-encoded assay/indication metadata, MSDM generates datasests with desired morphological properties. These heterogeneous modalities are integrated via multi-head cross-attention, enabling fine-grained control over the generated images. Quantitative analysis demonstrates that synthetic images closely match real data, with low Wasserstein distances between embeddings of generated and real images under matching biological conditions. The incorporation of these synthetic samples, exemplified by columnar cells, significantly improves segmentation model accuracy on columnar cells. This strategy systematically enriches data sets, directly targeting model deficiencies. We highlight the effectiveness of multimodal diffusion-based augmentation for advancing the robustness and generalizability of cell and nuclei segmentation models. Thereby, we pave the way for broader application of generative models in computational pathology.

[156] Polar Separable Transform for Efficient Orthogonal Rotation-Invariant Image Representation

Satya P. Singh,Rashmi Chaudhry,Anand Srivastava,Jagath C. Rajapakse

Main category: cs.CV

TL;DR: 提出了一种名为PSepT（极坐标可分离变换）的新型正交可分离变换，通过DCT径向基和傅里叶角基的张量积实现核函数完全分解，显著降低了计算复杂度、内存需求和条件数，实现了高效稳定的图像表示。

Details

Motivation: 经典正交矩方法（如Zernike矩）在高阶情况下存在计算复杂度高和数值不稳定的问题，且由于极坐标下核函数不可分离，难以高效因式分解。 Method: 设计了一种可分离的正交变换PSepT，采用离散余弦变换（DCT）作为径向基，傅里叶谐波作为角基，通过张量积构造实现径向与角向处理的完全解耦。 Result: 将计算复杂度降至O(N²logN)，内存需求为O(N²)，条件数缩放至O(√N)，并保持正交性、完备性、能量守恒和旋转协变性；实验显示其具有更好的数值稳定性、计算效率和分类性能，支持精确重构。 Conclusion: PSepT克服了传统极坐标正交矩不可分离的局限，实现了指数级性能提升，使以往难以实现的高阶矩分析成为可能，为鲁棒图像分析提供了新途径。 Abstract: Orthogonal moment-based image representations are fundamental in computer vision, but classical methods suffer from high computational complexity and numerical instability at large orders. Zernike and pseudo-Zernike moments, for instance, require coupled radial-angular processing that precludes efficient factorization, resulting in $\mathcal{O}(n^3N^2)$ to $\mathcal{O}(n^6N^2)$ complexity and $\mathcal{O}(N^4)$ condition number scaling for the $n$th-order moments on an $N\times N$ image. We introduce \textbf{PSepT} (Polar Separable Transform), a separable orthogonal transform that overcomes the non-separability barrier in polar coordinates. PSepT achieves complete kernel factorization via tensor-product construction of Discrete Cosine Transform (DCT) radial bases and Fourier harmonic angular bases, enabling independent radial and angular processing. This separable design reduces computational complexity to $\mathcal{O}(N^2 \log N)$, memory requirements to $\mathcal{O}(N^2)$, and condition number scaling to $\mathcal{O}(\sqrt{N})$, representing exponential improvements over polynomial approaches. PSepT exhibits orthogonality, completeness, energy conservation, and rotation-covariance properties. Experimental results demonstrate better numerical stability, computational efficiency, and competitive classification performance on structured datasets, while preserving exact reconstruction. The separable framework enables high-order moment analysis previously infeasible with classical methods, opening new possibilities for robust image analysis applications.

[157] Training Feature Attribution for Vision Models

Aziz Bacha,Thomas George

Main category: cs.CV

TL;DR: 本文提出了一种新的训练特征归因方法，将测试预测与特定训练图像的特定区域关联起来，以提供深度模型内部运作的细粒度、特定于测试的解释。

Details

Motivation: 现有的可解释性方法通常仅将测试时的预测归因于输入特征或有影响力的训练样本，缺乏对两者联合视角的研究。 Method: 提出并探索了训练特征归因方法，通过链接测试预测到特定训练图像的具体区域来分析模型行为。 Result: 在视觉数据集上的实验表明，该方法能够识别导致错误分类的有害样本，并揭示传统归因方法无法发现的虚假相关性（如基于图像块的捷径）。 Conclusion: 训练特征归因提供了更精细和具体的解释能力，有助于深入理解深度神经网络的决策机制。 Abstract: Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.

Clara Tomasini,Luis Riazuelo,Ana C. Murillo

Main category: cs.CV

TL;DR: 提出一种基于图像的支气管镜拓扑定位方法，无需患者CT扫描，仅在模拟数据上训练即可实现良好的导航辅助效果。

Details

Motivation: 支气管镜导航中定位困难，现有方法依赖CT扫描和额外传感器，成本高且需复杂设置，因此需要一种无需CT扫描、低成本且易于部署的定位方法。 Method: 提出一种基于图像的拓扑定位流程，仅使用模拟数据进行训练，通过图像特征实现支气管镜在通用气道模型中的位置估计，避免了真实数据标注和患者CT扫描。 Result: 该方法在真实数据测试序列上表现优于现有方法，具有良好的泛化能力。 Conclusion: 该方法无需患者CT扫描且训练成本低，在真实场景中表现出优越性能，为支气管镜导航提供了实用的解决方案。 Abstract: Video bronchoscopy is a fundamental procedure in respiratory medicine, where medical experts navigate through the bronchial tree of a patient to diagnose or operate the patient. Surgeons need to determine the position of the scope as they go through the airway until they reach the area of interest. This task is very challenging for practitioners due to the complex bronchial tree structure and varying doctor experience and training. Navigation assistance to locate the bronchoscope during the procedure can improve its outcome. Currently used techniques for navigational guidance commonly rely on previous CT scans of the patient to obtain a 3D model of the airway, followed by tracking of the scope with additional sensors or image registration. These methods obtain accurate locations but imply additional setup, scans and training. Accurate metric localization is not always required, and a topological localization with regard to a generic airway model can often suffice to assist the surgeon with navigation. We present an image-based bronchoscopy topological localization pipeline to provide navigation assistance during the procedure, with no need of patient CT scan. Our approach is trained only on phantom data, eliminating the high cost of real data labeling, and presents good generalization capabilities. The results obtained surpass existing methods, particularly on real data test sequences.

[159] Instance-Level Generation for Representation Learning

Yankun Wu,Zakaria Laskar,Giorgos Kordopatis-Zilos,Noa Garcia,Giorgos Tolias

Main category: cs.CV

TL;DR: 提出一种无需真实图像的合成数据方法，用于实例级识别（ILR），仅需目标领域名称即可生成大规模训练集，显著提升跨领域检索性能。

Details

Motivation: 实例级识别因细粒度特性导致标注数据集构建困难，限制了其在现实世界中的广泛应用。 Method: 通过合成生成来自多个领域、不同条件和背景下的多样化对象实例，构建大规模训练集，并用于微调基础视觉模型。 Result: 在七个跨领域的ILR基准上显著提升了检索性能，且不依赖任何真实图像。 Conclusion: 该方法提供了一种高效、有效的ILR新范式，摆脱了对大量数据收集和标注的依赖。 Abstract: Instance-level recognition (ILR) focuses on identifying individual objects rather than broad categories, offering the highest granularity in image classification. However, this fine-grained nature makes creating large-scale annotated datasets challenging, limiting ILR's real-world applicability across domains. To overcome this, we introduce a novel approach that synthetically generates diverse object instances from multiple domains under varied conditions and backgrounds, forming a large-scale training set. Unlike prior work on automatic data synthesis, our method is the first to address ILR-specific challenges without relying on any real images. Fine-tuning foundation vision models on the generated data significantly improves retrieval performance across seven ILR benchmarks spanning multiple domains. Our approach offers a new, efficient, and effective alternative to extensive data collection and curation, introducing a new ILR paradigm where the only input is the names of the target domains, unlocking a wide range of real-world applications.

[160] TARO: Toward Semantically Rich Open-World Object Detection

Yuchen Zhang,Yao Lu,Johannes Betz

Main category: cs.CV

TL;DR: 本文提出了TARO，一种新的检测框架，能够在识别未知物体的同时将其分类到语义层次结构中的粗粒度父类别中，从而提升开放世界场景下的决策能力。

Details

Motivation: 现有目标检测器受限于“封闭世界”假设，难以应对现实场景中的新颖物体；将所有未知物体简单归为一类不足以支持安全关键应用中的有效决策。 Method: TARO采用基于sparsemax的头部结构建模物体性，结合层次引导的重标注组件提供辅助监督，并设计分类模块学习类别间的层次关系，实现对未知物体的细粒度粗分类。 Result: 实验表明，TARO能将最多29.9%的未知物体正确归入有意义的粗类别，显著减少已知与未知类之间的混淆，在未知召回率和已知类mAP上均表现优异。 Conclusion: TARO通过引入语义层次结构有效提升了开放集检测中对未知物体的描述能力，为安全关键应用提供了更精细的未知物体分类方案。 Abstract: Modern object detectors are largely confined to a "closed-world" assumption, limiting them to a predefined set of classes and posing risks when encountering novel objects in real-world scenarios. While open-set detection methods aim to address this by identifying such instances as 'Unknown', this is often insufficient. Rather than treating all unknowns as a single class, assigning them more descriptive subcategories can enhance decision-making in safety-critical contexts. For example, identifying an object as an 'Unknown Animal' (requiring an urgent stop) versus 'Unknown Debris' (requiring a safe lane change) is far more useful than just 'Unknown' in autonomous driving. To bridge this gap, we introduce TARO, a novel detection framework that not only identifies unknown objects but also classifies them into coarse parent categories within a semantic hierarchy. TARO employs a unique architecture with a sparsemax-based head for modeling objectness, a hierarchy-guided relabeling component that provides auxiliary supervision, and a classification module that learns hierarchical relationships. Experiments show TARO can categorize up to 29.9% of unknowns into meaningful coarse classes, significantly reduce confusion between unknown and known classes, and achieve competitive performance in both unknown recall and known mAP. Code will be made available.

[161] Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption

Johann-Friedrich Feiden,Tim Küchler,Denis Zavadski,Bogdan Savchynskyy,Carsten Rother

Main category: cs.CV

TL;DR: 本文提出了oVDA，一种用于在线单目视频深度估计的高效方法，通过引入LLM中的缓存和训练时帧掩码技术，实现了在准确性和显存使用上的最优表现，并可在边缘设备上高效部署。

Details

Motivation: Video Depth Anything (VDA) 虽然在长视频序列上表现良好，但依赖批处理，无法用于在线场景，限制了其在实时系统中的应用。 Method: 提出online VDA (oVDA)，借鉴大语言模型的技术，在推理过程中缓存潜在特征，并在训练时对帧进行掩码，以支持在线、流式处理。 Result: oVDA在准确性与VRAM使用方面优于所有现有在线视频深度估计方法；在NVIDIA A100上达到42 FPS，在Jetson边缘设备上达到20 FPS，且显存占用低。 Conclusion: oVDA克服了VDA无法在线运行的局限，具备高效率和低资源消耗的优势，适合部署于边缘设备，推动了实时视频深度估计的实用化。 Abstract: Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.

[162] Modern Deep Learning Approaches for Cricket Shot Classification: A Comprehensive Baseline Study

Sungwoo Kang

Main category: cs.CV

TL;DR: 本文首次系统比较了七种深度学习方法在板球击球分类中的表现，揭示了学术文献中报告的高准确率与实际复现结果之间的显著差距，并提出基于EfficientNet-B0和GRU的现代架构，在统一基准上实现了92.25%的准确率，强调标准化评估协议的重要性。

Details

Motivation: 现有研究在板球击球分类任务中报告了极高的准确率，但缺乏统一的评估标准，导致结果难以复现和比较，因此需要一个系统的基准研究来评估不同方法的真实性能。 Method: 本文实现了七种代表性的深度学习模型，涵盖CNN-LSTM、注意力机制、视觉Transformer、迁移学习和EfficientNet-GRU等方法，在统一数据集和实验设置下进行系统评估，所有模型均采用PyTorch Lightning实现以确保可复现性。 Result: 先前研究声称的96%、99.2%和93%准确率在本研究中仅分别达到46.0%、55.6%和57.7%，而本文提出的EfficientNet-B0与GRU结合的方法达到了92.25%的准确率，成为新的SOTA。 Conclusion: 当前体育视频分析领域存在严重的可复现性问题，必须建立标准化评估协议；现代架构结合系统优化可在该任务上取得显著性能提升。 Abstract: Cricket shot classification from video sequences remains a challenging problem in sports video analysis, requiring effective modeling of both spatial and temporal features. This paper presents the first comprehensive baseline study comparing seven different deep learning approaches across four distinct research paradigms for cricket shot classification. We implement and systematically evaluate traditional CNN-LSTM architectures, attention-based models, vision transformers, transfer learning approaches, and modern EfficientNet-GRU combinations on a unified benchmark. A critical finding of our study is the significant performance gap between claims in academic literature and practical implementation results. While previous papers reported accuracies of 96\% (Balaji LRCN), 99.2\% (IJERCSE), and 93\% (Sensors), our standardized re-implementations achieve 46.0\%, 55.6\%, and 57.7\% respectively. Our modern SOTA approach, combining EfficientNet-B0 with a GRU-based temporal model, achieves 92.25\% accuracy, demonstrating that substantial improvements are possible with modern architectures and systematic optimization. All implementations follow modern MLOps practices with PyTorch Lightning, providing a reproducible research platform that exposes the critical importance of standardized evaluation protocols in sports video analysis research.

[163] Towards Safer and Understandable Driver Intention Prediction

Mukilan Karuppasamy,Shankar Gangisetty,Shyam Nandan Rai,Carlo Masone,C V Jawahar

Main category: cs.CV

TL;DR: 本文提出了一个可解释的驾驶行为预测任务（DIP），并构建了DAAD-X多模态数据集以支持该研究，同时提出VCBM模型来自动生成时空连贯的解释，实验表明基于Transformer的模型比CNN更具可解释性。

Details

Motivation: 随着自动驾驶系统与人类交互增多，决策过程的可解释性对安全至关重要；现有深度学习系统难以提供清晰的因果推理，因此需要提升驾驶意图预测的可解释性。 Method: 构建了包含高阶文本解释的DAAD-X数据集，这些解释来源于驾驶员视线和自车视角；提出视频概念瓶颈模型（VCBM），能内在生成时空一致的解释，并引入多标签t-SNE可视化方法分析多个解释间的解耦与因果关系。 Result: 在DAAD-X上的实验显示，Transformer模型比CNN模型更具可解释性，VCBM能有效生成合理的因果解释，且多标签t-SNE揭示了解释之间的结构化关联。 Conclusion: 通过构建可解释的数据集和模型框架，显著提升了驾驶意图预测的透明度和可信度，为可解释自动驾驶决策提供了新方向。 Abstract: Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver's decisions. These explanations are derived from both the driver's eye-gaze and the ego-vehicle's perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: https://mukil07.github.io/VCBM.github.io/

[164] Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition

Huimin Liu,Jing Gao,Daria Baran,AxelX Montout,Neill W Campbell,Andrew W Dowsey

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP的多模态深度学习框架Cattle-CLIP，用于牛行为识别，结合语义线索提升视频特征识别性能，并发布了包含1905个视频片段的CattleBehaviours6数据集。

Details

Motivation: 牛的行为是其健康、生产性能和福利的重要指标，现有基于视频的行为识别方法在数据稀缺场景下表现有限，需提升模型在真实牧场环境中的泛化能力。 Method: 基于大规模图文模型CLIP，引入时间整合模块、定制化的数据增强策略和特定文本提示，以缩小预训练模型与实际牛只监控数据之间的领域差距。 Result: 在全监督设置下，Cattle-CLIP在六类行为识别上达到96.1%的整体准确率，对进食、饮水和站立反刍行为召回率达近100%，并在少样本场景中表现出强泛化能力。 Conclusion: Cattle-CLIP有效提升了牛行为识别的准确性与数据效率，验证了多模态学习在农业与动物行为分析中的潜力。 Abstract: Cattle behaviour is a crucial indicator of an individual animal health, productivity and overall well-being. Video-based monitoring, combined with deep learning techniques, has become a mainstream approach in animal biometrics, and it can offer high accuracy in some behaviour recognition tasks. We present Cattle-CLIP, a multimodal deep learning framework for cattle behaviour recognition, using semantic cues to improve the performance of video-based visual feature recognition. It is adapted from the large-scale image-language model CLIP by adding a temporal integration module. To address the domain gap between web data used for the pre-trained model and real-world cattle surveillance footage, we introduce tailored data augmentation strategies and specialised text prompts. Cattle-CLIP is evaluated under both fully-supervised and few-shot learning scenarios, with a particular focus on data-scarce behaviour recognition - an important yet under-explored goal in livestock monitoring. To evaluate the proposed method, we release the CattleBehaviours6 dataset, which comprises six types of indoor behaviours: feeding, drinking, standing-self-grooming, standing-ruminating, lying-self-grooming and lying-ruminating. The dataset consists of 1905 clips collected from our John Oldacre Centre dairy farm research platform housing 200 Holstein-Friesian cows. Experiments show that Cattle-CLIP achieves 96.1% overall accuracy across six behaviours in a supervised setting, with nearly 100% recall for feeding, drinking and standing-ruminating behaviours, and demonstrates robust generalisation with limited data in few-shot scenarios, highlighting the potential of multimodal learning in agricultural and animal behaviour analysis.

[165] 3D Reconstruction from Transient Measurements with Time-Resolved Transformer

Yue Li,Shida Sun,Yu Hong,Feihu Xu,Zhiwei Xiong

Main category: cs.CV

TL;DR: 提出了一种通用的Time-Resolved Transformer（TRT）架构，用于提升光子高效成像中的3D重建性能，其在视线内（LOS）和非视线（NLOS）成像任务中均显著优于现有方法。

Details

Motivation: 由于传感器量子效率低和噪声水平高，特别是在远距离或复杂场景下，现有的瞬态测量3D重建仍面临挑战。因此需要提升光子高效成像中的重建质量。 Method: 设计了针对时空瞬态测量的TRT架构，包含时空自注意力编码器和时空交叉注意力解码器，分别用于提取多尺度局部与全局相关性，并在token空间融合特征以增强表示能力；基于此架构开发了TRT-LOS和TRT-NLOS两个任务特定模型。 Result: 在合成数据和真实世界数据上，TRT-LOS和TRT-NLOS均显著优于现有方法；同时贡献了一个大规模、高分辨率、含多种噪声水平的合成LOS数据集以及一组使用自建系统采集的真实NLOS测量数据。 Conclusion: TRT架构有效提升了光子高效成像下的3D重建性能，具备良好的泛化能力和应用前景，为LOS和NLOS成像提供了新的解决方案。 Abstract: Transient measurements, captured by the timeresolved systems, are widely employed in photon-efficient reconstruction tasks, including line-of-sight (LOS) and non-line-of-sight (NLOS) imaging. However, challenges persist in their 3D reconstruction due to the low quantum efficiency of sensors and the high noise levels, particularly for long-range or complex scenes. To boost the 3D reconstruction performance in photon-efficient imaging, we propose a generic Time-Resolved Transformer (TRT) architecture. Different from existing transformers designed for high-dimensional data, TRT has two elaborate attention designs tailored for the spatio-temporal transient measurements. Specifically, the spatio-temporal self-attention encoders explore both local and global correlations within transient data by splitting or downsampling input features into different scales. Then, the spatio-temporal cross attention decoders integrate the local and global features in the token space, resulting in deep features with high representation capabilities. Building on TRT, we develop two task-specific embodiments: TRT-LOS for LOS imaging and TRT-NLOS for NLOS imaging. Extensive experiments demonstrate that both embodiments significantly outperform existing methods on synthetic data and real-world data captured by different imaging systems. In addition, we contribute a large-scale, high-resolution synthetic LOS dataset with various noise levels and capture a set of real-world NLOS measurements using a custom-built imaging system, enhancing the data diversity in this field. Code and datasets are available at https://github.com/Depth2World/TRT.

[166] Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

Wuyang Li,Wentao Pan,Po-Chien Luan,Yang Gao,Alexandre Alahi

Main category: cs.CV

TL;DR: 本文提出了Stable Video Infinity (SVI)，一种能够生成无限长度视频的方法，具有高时间一致性、合理的场景过渡和可控的故事情节。其核心是通过“错误回收微调”机制，在训练中利用模型自身生成的错误来提升长期自回归生成的稳定性。

Details

Motivation: 现有长视频生成方法受限于累积误差和单一提示外推，导致场景单调、动作重复；且训练（清洁数据）与测试（自回归依赖生成结果）条件不一致，导致性能下降。 Method: 提出Error-Recycling Fine-Tuning：在Diffusion Transformer (DiT)中注入历史生成错误以模拟误差累积；通过单步双向积分近似预测并计算残差；将错误动态存入回放缓冲区并在后续输入中重采样，实现闭环错误学习。 Result: SVI在三个基准（一致性、创造性、条件生成）上验证了其优越性，支持无限时长视频生成且无额外推理开销，兼容文本、音频、骨架等多种条件输入。 Conclusion: SVI通过闭循环错误回收机制有效弥合了训练与自回归推理之间的鸿沟，实现了高质量、可控制的无限长度视频生成，具备良好的扩展性和通用性。 Abstract: We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)'s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.

[167] Tag-Enriched Multi-Attention with Large Language Models for Cross-Domain Sequential Recommendation

Wangyu Wu,Xuhang Chen,Zhenhong Chen,Jing-En Jiang,Kim-Fung Tsang,Xiaowei Huang,Fei Ma,Jimin Xiao

Main category: cs.CV

TL;DR: 本文提出了TEMA-LLM，一种结合大语言模型（LLMs）的跨域序列推荐框架，通过语义标签生成和多注意力机制提升推荐性能。

Details

Motivation: 跨域推荐系统需要准确捕捉领域内和跨领域的用户行为模式，现有方法在语义理解和特征融合方面存在不足。 Method: 利用大语言模型生成基于项目标题和描述的语义标签，并引入领域感知提示；将标签嵌入与项目ID、文本和视觉特征融合以构建增强的项目表示；设计了标签增强的多注意力机制来联合建模用户在域内和跨域的偏好。 Result: 在四个大规模电商数据集上实验表明，TEMA-LLM持续优于最先进的基线方法，验证了LLM生成语义标签和多注意力机制的有效性。 Conclusion: TEMA-LLM展示了大语言模型在提升消费者电子领域智能、以用户为中心的服务中的潜力，为跨域推荐提供了新的有效解决方案。 Abstract: Cross-Domain Sequential Recommendation (CDSR) plays a crucial role in modern consumer electronics and e-commerce platforms, where users interact with diverse services such as books, movies, and online retail products. These systems must accurately capture both domain-specific and cross-domain behavioral patterns to provide personalized and seamless consumer experiences. To address this challenge, we propose \textbf{TEMA-LLM} (\textit{Tag-Enriched Multi-Attention with Large Language Models}), a practical and effective framework that integrates \textit{Large Language Models (LLMs)} for semantic tag generation and enrichment. Specifically, TEMA-LLM employs LLMs to assign domain-aware prompts and generate descriptive tags from item titles and descriptions. The resulting tag embeddings are fused with item identifiers as well as textual and visual features to construct enhanced item representations. A \textit{Tag-Enriched Multi-Attention} mechanism is then introduced to jointly model user preferences within and across domains, enabling the system to capture complex and evolving consumer interests. Extensive experiments on four large-scale e-commerce datasets demonstrate that TEMA-LLM consistently outperforms state-of-the-art baselines, underscoring the benefits of LLM-based semantic tagging and multi-attention integration for consumer-facing recommendation systems. The proposed approach highlights the potential of LLMs to advance intelligent, user-centric services in the field of consumer electronics.

[168] Clear Roads, Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation

Vijay M. Galshetwar,Praful Hambarde,Prashant W. Patil,Akshay Dudhane,Sachin Chaudhary,Santosh Kumar Vipparathi,Subrahmanyam Murala

Main category: cs.CV

TL;DR: 本文综述了针对雾霾、雨、雪等恶劣天气导致的图像和视频质量下降问题的恢复技术，涵盖了传统基于先验的方法和现代数据驱动模型（如CNN、Transformer、扩散模型和视觉-语言模型），并按单任务、多任务和一体化框架进行分类。同时讨论了昼夜恢复挑战、数据集与评估方法，并指出现有研究的局限性及未来方向，如复合退化恢复、实时部署和智能体AI框架。

Details

Motivation: 恶劣天气严重影响依赖视觉输入的智能交通系统（如自动驾驶、交通监控），因此需要有效的图像与视频恢复技术以提升系统鲁棒性。 Method: 对现有图像和视频恢复技术进行系统性综述，分为传统先验方法和现代数据驱动模型，并根据处理范围分为单任务、多任务和一体化框架；同时分析了数据集、评估指标及实际应用挑战。 Result: 梳理了当前主流的天气恢复方法分类体系，总结了各方法在不同天气条件下的性能表现，提供了常用数据集和评估协议的全面对比，并指出了现有研究在真实场景应用中的局限性。 Conclusion: 该综述为智能交通环境下的全天候视觉系统提供了技术参考，并提出未来应关注复合退化建模、实时性优化以及结合智能体AI的自适应恢复框架，推动天气鲁棒型视觉系统的发展。 Abstract: Adverse weather conditions such as haze, rain, and snow significantly degrade the quality of images and videos, posing serious challenges to intelligent transportation systems (ITS) that rely on visual input. These degradations affect critical applications including autonomous driving, traffic monitoring, and surveillance. This survey presents a comprehensive review of image and video restoration techniques developed to mitigate weather-induced visual impairments. We categorize existing approaches into traditional prior-based methods and modern data-driven models, including CNNs, transformers, diffusion models, and emerging vision-language models (VLMs). Restoration strategies are further classified based on their scope: single-task models, multi-task/multi-weather systems, and all-in-one frameworks capable of handling diverse degradations. In addition, we discuss day and night time restoration challenges, benchmark datasets, and evaluation protocols. The survey concludes with an in-depth discussion on limitations in current research and outlines future directions such as mixed/compound-degradation restoration, real-time deployment, and agentic AI frameworks. This work aims to serve as a valuable reference for advancing weather-resilient vision systems in smart transportation environments. Lastly, to stay current with rapid advancements in this field, we will maintain regular updates of the latest relevant papers and their open-source implementations at https://github.com/ChaudharyUPES/A-comprehensive-review-on-Multi-weather-restoration

[169] Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras

Jindong Hong,Wencheng Zhang,Shiqin Qiao,Jianhai Chen,Jianing Qiu,Chuanyang Zheng,Qian Xu,Yun Ji,Qianyue Wen,Weiwei Sun,Hao Li,Huizhen Li,Huichao Wang,Kai Wu,Meng Li,Yijun He,Lingjie Luo,Jiankai Sun

Main category: cs.CV

TL;DR: 本研究提出了一种基于消费级设备视频的肩关节疾病初步诊断框架HMVDx，利用多模态大语言模型（MLLMs）实现低成本、可扩展的辅助诊断，并通过新提出的可用性指数评估模型在医疗决策路径中的有效性，实验显示诊断准确率相比直接视频诊断提升了79.6%。

Details

Motivation: 肩关节疾病在医疗资源匮乏地区难以早期准确诊断，亟需低成本、易推广的辅助诊断方案。 Method: 提出Hybrid Motion Video Diagnosis（HMVDx）框架，将动作理解与疾病诊断任务分离，由两个多模态大语言模型分别完成；引入基于医疗决策逻辑流程的新型评估指标——可用性指数。 Result: HMVDx在肩关节损伤诊断中的准确率相较直接视频诊断提升了79.6%，验证了该框架在医学视频理解任务中的有效性。 Conclusion: 该研究表明，基于消费级设备视频和多模态大语言模型的低门槛诊断框架在医疗应用中具有巨大潜力，为资源有限地区的肩关节疾病筛查提供了可行解决方案。 Abstract: Shoulder disorders, such as frozen shoulder (a.k.a., adhesive capsulitis), are common conditions affecting the health of people worldwide, and have a high incidence rate among the elderly and workers engaged in repetitive shoulder tasks. In regions with scarce medical resources, achieving early and accurate diagnosis poses significant challenges, and there is an urgent need for low-cost and easily scalable auxiliary diagnostic solutions. This research introduces videos captured by consumer-grade devices as the basis for diagnosis, reducing the cost for users. We focus on the innovative application of Multimodal Large Language Models (MLLMs) in the preliminary diagnosis of shoulder disorders and propose a Hybrid Motion Video Diagnosis framework (HMVDx). This framework divides the two tasks of action understanding and disease diagnosis, which are respectively completed by two MLLMs. In addition to traditional evaluation indicators, this work proposes a novel metric called Usability Index by the logical process of medical decision-making (action recognition, movement diagnosis, and final diagnosis). This index evaluates the effectiveness of MLLMs in the medical field from the perspective of the entire medical diagnostic pathway, revealing the potential value of low-cost MLLMs in medical applications for medical practitioners. In experimental comparisons, the accuracy of HMVDx in diagnosing shoulder joint injuries has increased by 79.6\% compared with direct video diagnosis, a significant technical contribution to future research on the application of MLLMs for video understanding in the medical field.

[170] Zero-shot image privacy classification with Vision-Language Models

Alina Elena Baia,Alessio Xompero,Andrea Cavallaro

Main category: cs.CV

TL;DR: 本文评估了大型视觉-语言模型（VLMs）在图像隐私分类中的零样本表现，发现尽管VLMs对图像扰动更具鲁棒性，但在准确性和效率上仍落后于专用的小型模型。

Details

Motivation: 现有研究倾向于使用通用的大型视觉-语言模型进行图像隐私预测，但缺乏与专用模型的系统性比较，本文旨在填补这一空白。 Method: 构建了一个零样本图像隐私分类基准，评估了三个顶尖开源VLM，并使用任务对齐提示将其与传统的视觉单模态和多模态方法在性能、效率和鲁棒性方面进行对比。 Result: 实验结果显示，尽管VLM参数量大、推理慢，其隐私预测准确率仍低于专用的小模型；但VLM对图像扰动表现出更高的鲁棒性。 Conclusion: 当前VLM在图像隐私预测任务中尚未超越专用模型，未来需在保持鲁棒性优势的同时提升效率与准确性。 Abstract: While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.

[171] Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

Patrick Wienholt,Sophie Caselitz,Robert Siepmann,Philipp Bruners,Keno Bressem,Christiane Kuhl,Jakob Nikolas Kather,Sven Nebelung,Daniel Truhn

Main category: cs.CV

TL;DR: 该研究提出使用离散语义熵（DSE）来检测并过滤可能导致幻觉的高熵问题，从而提升黑箱视觉语言模型（VLMs）在放射学图像问答中的准确性。

Details

Motivation: 由于VLMs在医疗问答中易产生幻觉（即生成不准确或不一致的答案），需要一种无需访问模型内部机制即可检测并减少幻觉的方法。 Method: 采用DSE量化答案的语义不一致性：通过温度为1.0时的多次采样生成多个回答，利用双向蕴涵检查将语义等价的回答聚类，并基于聚类频率计算DSE。然后排除DSE高于设定阈值（>0.3或>0.6）的问题，评估剩余问题的准确性。使用GPT-4o和GPT-4.1在两个公开医学VQA数据集上进行实验。 Result: 基线准确率为GPT-4o 51.7%，GPT-4.1 54.8%。当排除DSE > 0.3的问题后，GPT-4o准确率提升至76.3%（保留334/706问题），GPT-4.1提升至63.8%（保留499/706问题），且结果在统计上显著（p < .001）。效果在两个数据集中均成立，并在Bonferroni校正后仍显著。 Conclusion: DSE是一种有效的无监督、黑箱方法，可用于识别易引发幻觉的问题，显著提高临床VLM系统的答案准确性，具有实际应用价值。 Abstract: To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.

[172] MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

Ming Dai,Sen Yang,Boqiang Duan,Wankou Yang,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出了一种统一框架，联合优化时序句子定位（TSG）和基于语言的视频对象分割（RefVOS），通过引入[FIND]令牌进行关键帧识别，并设计了时刻中心采样策略（MCS）和双向锚点更新传播（BAP）以提升分割性能与稳定性。

Details

Motivation: 现有基于大模型的RefVOS方法依赖手工设计或外部关键帧模型进行采样，前者忽略时序线索，后者增加系统复杂性，因此需要一种更高效、内生的关键时刻定位机制。 Method: 提出一种新的TSG范式，使用专用[FIND]令牌通过时间令牌相似性匹配来识别关键时刻；设计时刻中心采样（MCS）策略，密集采样重要时刻，稀疏处理非关键帧；引入双向锚点更新传播（BAP），利用最相关时刻初始化掩码并动态更新以减少误差累积。 Result: 该方法在RefVOS任务中实现了更好的性能，有效保留运动细节和全局上下文，提升了跟踪稳定性，且无需外部时间标注或关键帧检测模型。 Conclusion: 所提出的联合优化框架通过内生的关键时刻定位和高效的采样与传播策略，在不增加系统复杂性的前提下显著提升了RefVOS的精度与鲁棒性。 Abstract: Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg

[173] Spotlight on Token Perception for Multimodal Reinforcement Learning

Siyuan Huang,Xiaoye Qu,Yafu Li,Yun Luo,Zefeng He,Daizong Liu,Yu Cheng

Main category: cs.CV

TL;DR: 本文提出了一种基于token感知的多模态强化学习新视角，并设计了VPPO算法，通过利用token级别的视觉依赖性来增强大视觉语言模型的推理能力。

Details

Motivation: 现有基于强化学习与可验证奖励（RLVR）的多模态推理方法忽视了视觉感知在优化过程中的关键作用，本文旨在探索并量化生成token对视觉输入的依赖，从而提升模型的视觉-语言推理能力。 Method: 提出Visually-Perceptive Policy Optimization (VPPO)算法，通过分析链式思维（CoT）过程中每个token的视觉依赖性（即token感知），采用双机制：根据轨迹的整体视觉依赖性重加权优势函数，并仅对感知关键token进行策略更新。 Result: 在八个感知与推理基准上，VPPO在7B和32B两种规模的开源RL微调模型上均取得了显著性能提升，验证了其有效性。 Conclusion: VPPO建立了分析多模态RLVR的token级感知新视角，提供了一种有效的优化策略，显著增强了大视觉语言模型的多模态推理能力。 Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

[174] Foraging with the Eyes: Dynamics in Human Visual Gaze and Deep Predictive Modeling

Tejaswi V. Panchagnula

Main category: cs.CV

TL;DR: 人类视觉扫描图像时的眼动轨迹遵循类似动物觅食的Lévy行走模式，表明视觉信息获取具有最优效率，并可通过CNN从图像结构预测注视热点。

Details

Motivation: 探索人类眼动在视觉搜索中的时空统计特性，及其与动物觅食行为的相似性。 Method: 通过大规模实验（40名参与者观察50张图像），使用高速眼动仪记录超过400万 gaze points，分析眼动轨迹统计规律；训练CNN模型仅从图像输入预测注视热图。 Result: 发现人类眼动轨迹符合Lévy行走特征，且CNN能准确预测新图像的显著注视区域。 Conclusion: 人类视觉探索遵循类似于自然觅食的统计规律，具有优化的信息获取效率，且部分眼动行为可从图像结构中学习和预测。 Abstract: Animals often forage via Levy walks stochastic trajectories with heavy tailed step lengths optimized for sparse resource environments. We show that human visual gaze follows similar dynamics when scanning images. While traditional models emphasize image based saliency, the underlying spatiotemporal statistics of eye movements remain underexplored. Understanding these dynamics has broad applications in attention modeling and vision-based interfaces. In this study, we conducted a large scale human subject experiment involving 40 participants viewing 50 diverse images under unconstrained conditions, recording over 4 million gaze points using a high speed eye tracker. Analysis of these data shows that the gaze trajectory of the human eye also follows a Levy walk akin to animal foraging. This suggests that the human eye forages for visual information in an optimally efficient manner. Further, we trained a convolutional neural network (CNN) to predict fixation heatmaps from image input alone. The model accurately reproduced salient fixation regions across novel images, demonstrating that key components of gaze behavior are learnable from visual structure alone. Our findings present new evidence that human visual exploration obeys statistical laws analogous to natural foraging and open avenues for modeling gaze through generative and predictive frameworks.

[175] CapGeo: A Caption-Assisted Approach to Geometric Reasoning

Yuying Li,Siyi Qian,Hao Liang,Leqi Zheng,Ruichuan An,Yongzhen Guo,Wentao Zhang

Main category: cs.CV

TL;DR: 本文提出了CapGeo，一种通过图像字幕辅助推理的框架，以提升多模态大语言模型在几何问题上的表现，并构建了包含4641个图文对的CapGeo-Bench数据集及关键点评分指标，系统评估几何字幕生成质量。

Details

Motivation: 尽管最先进的多模态大模型在文本推理任务上表现出色，但在几何推理上仍存在困难，表明瓶颈在于对几何图形的理解而非推理能力本身；因此，通过将视觉内容转化为高质量的文字描述来弥补这一缺陷。 Method: 提出CapGeo框架，利用几何图生文技术生成字幕辅助模型进行推理，并构建CapGeo-Bench数据集和基于关键点的评估指标来系统评测几何字幕质量。 Result: 实验显示结合字幕后模型性能显著提升：Qwen2.5-VL-72B从8.6%提高到59.0%，Claude-Opus-4从44.8%提升至73.0%；提出的评估指标与下游任务性能高度相关。 Conclusion: 通过高质量的几何图像字幕可有效提升多模态大模型的几何推理能力，CapGeo框架和CapGeo-Bench为该方向提供了有效的工具和评估标准。 Abstract: Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.

[176] RadioFlow: Efficient Radio Map Construction Framework with Flow Matching

Haozhe Jia,Wenshuo Chen,Xiucheng Wang,Nan Cheng,Hongbo Zhang,Kuimou Yu,Songning Lai,Nanjian Jia,Bowen Tian,Hongru Xiao,Yutao Yue

Main category: cs.CV

TL;DR: 提出了一种基于流匹配的新型生成框架RadioFlow，用于高效、高保真地生成无线电图，相比扩散模型具有更少参数和更快推理速度。

Details

Motivation: 现有的基于扩散模型的无线电图生成方法存在模型大、去噪迭代慢、推理延迟高等问题，限制了实际部署。 Method: 采用流匹配（flow-matching）方法，学习从噪声到数据的连续传输轨迹，实现单步高效采样，从而加速训练和推理过程。 Result: 实验表明，RadioFlow在性能上达到最先进水平，参数量最多减少8倍，推理速度比现有扩散模型（RadioDiff）快4倍以上。 Conclusion: RadioFlow为未来6G网络中可扩展、节能且实时的电磁数字孪生提供了可行的技术路径。 Abstract: Accurate and real-time radio map (RM) generation is crucial for next-generation wireless systems, yet diffusion-based approaches often suffer from large model sizes, slow iterative denoising, and high inference latency, which hinder practical deployment. To overcome these limitations, we propose \textbf{RadioFlow}, a novel flow-matching-based generative framework that achieves high-fidelity RM generation through single-step efficient sampling. Unlike conventional diffusion models, RadioFlow learns continuous transport trajectories between noise and data, enabling both training and inference to be significantly accelerated while preserving reconstruction accuracy. Comprehensive experiments demonstrate that RadioFlow achieves state-of-the-art performance with \textbf{up to 8$\times$ fewer parameters} and \textbf{over 4$\times$ faster inference} compared to the leading diffusion-based baseline (RadioDiff). This advancement provides a promising pathway toward scalable, energy-efficient, and real-time electromagnetic digital twins for future 6G networks. We release the code at \href{https://github.com/Hxxxz0/RadioFlow}{GitHub}.

[177] Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation

Wenyao Zhang,Hongsi Liu,Bohan Li,Jiawei He,Zekun Qi,Yunnan Wang,Shengyang Zhao,Xinqiang Yu,Wenjun Zeng,Xin Jin

Main category: cs.CV

TL;DR: 提出Hybrid-depth框架，结合CLIP和DINO等基础模型，通过语言引导的多粒度特征融合与粗到精学习，在自监督单目深度估计中显著超越现有方法。

Details

Motivation: 现有自监督单目深度估计方法因语义-空间知识提取不足而性能受限，需更有效的上下文信息建模。 Method: 1) 利用CLIP和DINO提取全局语义与局部空间特征，通过对比语言引导的近远图像块代理任务实现深度感知特征对齐；2) 结合相机姿态和像素级语言对齐，逐步细化深度预测，并可作为即插即用模块集成到现有MDE流程中。 Result: 在KITTI基准上全面超越SOTA方法，且有助于BEV感知等下游任务。 Conclusion: Hybrid-depth通过语言引导的多模态特征融合有效缓解特征粒度不匹配问题，显著提升自监督单目深度估计性能。 Abstract: Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depth-aware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP's semantic context and DINO's spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception. Code is available at https://github.com/Zhangwenyao1/Hybrid-depth.

[178] Instance-Aware Robust Consistency Regularization for Semi-Supervised Nuclei Instance Segmentation

Zenan Lin,Wei Li,Jintao Chen,Zihao Wu,Wenxiong Kang,Changxin Gao,Liansheng Wang,Jin-Gang Yu

Main category: cs.CV

TL;DR: 提出了一种用于病理图像中细胞核实例分割的半监督方法IRCR-Net，通过引入实例感知的一致性正则化机制和形态学先验知识，有效提升了密集和重叠细胞核的分割精度，并减少了伪标签噪声。

Details

Motivation: 全监督方法依赖大量标注数据，成本高且数据稀缺；现有半监督方法在实例级别上一致性正则化不足，缺乏对病理结构先验知识的利用，且易引入噪声伪标签。 Method: 提出了IRCR-Net，包含匹配驱动的实例感知一致性（MIAC）和先验驱动的实例感知一致性（PIAC）机制，利用形态学先验评估无标签数据生成的伪标签质量，筛选高质量伪标签用于训练。 Result: 在多个公开数据集上显著优于现有半监督方法，部分场景下甚至超过全监督方法。 Conclusion: IRCR-Net通过引入实例级一致性和先验知识，有效提升半监督细胞核实例分割性能，具有较强的鲁棒性和应用潜力。 Abstract: Nuclei instance segmentation in pathological images is crucial for downstream tasks such as tumor microenvironment analysis. However, the high cost and scarcity of annotated data limit the applicability of fully supervised methods, while existing semi-supervised methods fail to adequately regularize consistency at the instance level, lack leverage of the inherent prior knowledge of pathological structures, and are prone to introducing noisy pseudo-labels during training. In this paper, we propose an Instance-Aware Robust Consistency Regularization Network (IRCR-Net) for accurate instance-level nuclei segmentation. Specifically, we introduce the Matching-Driven Instance-Aware Consistency (MIAC) and Prior-Driven Instance-Aware Consistency (PIAC) mechanisms to refine the nuclei instance segmentation result of the teacher and student subnetwork, particularly for densely distributed and overlapping nuclei. We incorporate morphological prior knowledge of nuclei in pathological images and utilize these priors to assess the quality of pseudo-labels generated from unlabeled data. Low-quality pseudo-labels are discarded, while high-quality predictions are enhanced to reduce pseudo-label noise and benefit the network's robust training. Experimental results demonstrate that the proposed method significantly enhances semi-supervised nuclei instance segmentation performance across multiple public datasets compared to existing approaches, even surpassing fully supervised methods in some scenarios.

[179] Enhancing Infrared Vision: Progressive Prompt Fusion Network and Benchmark

Jinyuan Liu,Zihang Chen,Zhu Liu,Zhiying Jiang,Long Ma,Xin Fan,Risheng Liu

Main category: cs.CV

TL;DR: 本文提出了一种用于热红外图像增强的渐进提示融合网络（PPFN），通过结合成像机制设计提示对，并引入选择性渐进训练策略，有效应对单个和复合退化问题，在多种场景下显著提升了增强效果。

Details

Motivation: 现有红外图像增强方法多针对单一退化因素，难以处理复杂耦合退化；而通用的全功能增强方法在红外图像上效果有限，因此需要专门针对热红外成像特性设计高效的一体化增强方法。 Method: 提出渐进提示融合网络（PPFN），基于热成像过程构建提示对，并融合对应提示以调制模型特征；引入选择性渐进训练（SPT）机制，逐步提升模型对复合退化情况的处理能力。 Result: 在自制的高质量多场景红外数据集上实验表明，该方法在特定和复杂退化场景下均取得优异视觉效果和性能，相比现有方法在复杂场景下性能提升8.76%。 Conclusion: PPFN结合物理启发的提示融合与渐进训练策略，有效解决了热红外图像中多种退化耦合的问题，推动了红外图像增强技术的发展。 Abstract: We engage in the relatively underexplored task named thermal infrared image enhancement. Existing infrared image enhancement methods primarily focus on tackling individual degradations, such as noise, contrast, and blurring, making it difficult to handle coupled degradations. Meanwhile, all-in-one enhancement methods, commonly applied to RGB sensors, often demonstrate limited effectiveness due to the significant differences in imaging models. In sight of this, we first revisit the imaging mechanism and introduce a Progressive Prompt Fusion Network (PPFN). Specifically, the PPFN initially establishes prompt pairs based on the thermal imaging process. For each type of degradation, we fuse the corresponding prompt pairs to modulate the model's features, providing adaptive guidance that enables the model to better address specific degradations under single or multiple conditions. In addition, a Selective Progressive Training (SPT) mechanism is introduced to gradually refine the model's handling of composite cases to align the enhancement process, which not only allows the model to remove camera noise and retain key structural details, but also enhancing the overall contrast of the thermal image. Furthermore, we introduce the most high-quality, multi-scenarios infrared benchmark covering a wide range of scenarios. Extensive experiments substantiate that our approach not only delivers promising visual results under specific degradation but also significantly improves performance on complex degradation scenes, achieving a notable 8.76\% improvement. Code is available at https://github.com/Zihang-Chen/HM-TIR.

Qihang Ma,Shengyu Li,Jie Tang,Dingkang Yang,Shaodong Chen,Yingyi Zhang,Chao Feng,Jiao Ran

Main category: cs.CV

TL;DR: 本文提出利用视觉语言模型（VLMs）进行多模态关键词预测（MMKP），并通过零样本、监督微调及动态CoT策略提升模型推理能力，在多个数据集上验证了方法的有效性。

Details

Motivation: 传统多模态方法在处理缺失和未见场景时存在局限，且现有基准因训练与测试集重叠而高估模型性能。 Method: 采用零-shot和监督微调评估VLMs下限性能，引入Fine-tune-CoT利用教师模型生成的高质量思维链数据增强小模型，并提出动态CoT策略以自适应注入思维链数据，缓解“过度思考”问题。 Result: 在多个数据集上的实验表明，所提方法显著提升了多模态关键短语预测的性能，尤其在复杂推理和泛化能力方面表现优异。 Conclusion: 利用VLMs结合动态CoT策略可有效提升MMKP任务的表现，为多模态推理提供了更可靠且可扩展的解决方案。 Abstract: Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.

[181] BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

Junyan Ye,Dongzhi Jiang,Jun He,Baichuan Zhou,Zilong Huang,Zhiyuan Yan,Hongsheng Li,Conghui He,Weijia Li

Main category: cs.CV

TL;DR: 本文提出了BLINK-Twice，一个以视觉为中心的推理基准，专注于基于图像内容的细粒度观察与分析推理，而非依赖语言或外部知识。该基准包含七类视觉挑战、自然对抗图像对和标注的推理链，评估了20个主流多模态大模型，揭示了当前模型在视觉推理上的不足，并指出重复观察和主动视觉交互是提升性能的关键。

Details

Motivation: 现有MLLM的推理评测主要侧重语言层面，忽视视觉输入的核心作用。作者旨在构建一个真正以视觉为中心的推理基准，推动模型从浅层感知（see）向深层观察与分析（observe）转变。 Method: 设计了BLINK-Twice基准，包含三部分：七类视觉推理挑战、自然对抗图像对（迫使模型依赖视觉内容）、标注的推理链（用于细粒度评估）。在20个主流MLLM上进行测试，分析不同推理策略（如思维链、自我批评）的效果，并研究重复观察和主动交互对性能的影响。 Result: 现有MLLM在BLINK-Twice上表现不佳，语言推理策略虽有提升但导致不稳定和冗余；重复图像观察能普遍提升性能；具备主动视觉交互能力的模型（如o3）表现更优，表明新推理范式的需求。 Conclusion: BLINK-Twice有效暴露了当前多模态大模型在视觉 grounded 推理上的局限，强调应从语言中心转向视觉中心的推理范式，未来工作需重视细粒度观察与主动视觉交互机制的设计。 Abstract: Recently, Multimodal Large Language Models (MLLMs) have made rapid progress, particularly in enhancing their reasoning capabilities. However, existing reasoning benchmarks still primarily assess language-based reasoning, often treating visual input as replaceable context. To address this gap, we introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks. Instead of relying on external knowledge, our tasks require models to reason from visual content alone, shifting the focus from language-based to image-grounded reasoning. Compared to prior perception benchmarks, it moves beyond shallow perception ("see") and requires fine-grained observation and analytical reasoning ("observe"). BLINK-Twice integrates three core components: seven types of visual challenges for testing visual reasoning, natural adversarial image pairs that enforce reliance on visual content, and annotated reasoning chains for fine-grained evaluation of the reasoning process rather than final answers alone. We evaluate 20 leading MLLMs, including 12 foundation models and 8 reasoning-enhanced models. BLINK-Twice poses a significant challenge to current models. While existing reasoning strategies in the language space-such as chain-of-thought or self-criticism can improve performance, they often result in unstable and redundant reasoning. We observe that repeated image observation improves performance across models, and active visual interaction, as demonstrated by models like o3, highlights the need for a new paradigm for vision reasoning. The dataset is publicly available at https://github.com/PicoTrex/BLINK-Twice

[182] Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes

Yikang Zhang,Rui Fan

Main category: cs.CV

TL;DR: 提出VAD-GS，一种针对复杂城市场景的3D高斯点阵化框架，通过体素可见性推理、多样性感知视图选择和基于patch匹配的多视图立体重建，有效恢复缺失几何结构。

Details

Motivation: 3D高斯点阵化（3DGS）依赖高质量初始点云，在无界动态城市环境中常因观测视野不重叠导致点覆盖不全，进而引发渲染失真和伪影；现有稠密化策略无法重建缺失结构。 Method: 提出VAD-GS框架：1）基于体素的可见性推理识别不可靠几何；2）多样性感知视图选择确定关键补充视图；3）基于patch匹配的多视图立体重建恢复缺失结构，并生成新的高斯原语。 Result: 在Waymo和nuScenes数据集上实验表明，VAD-GS优于现有3DGS方法，显著提升静态与动态物体的几何重建质量。 Conclusion: VAD-GS通过引入可靠的几何先验和主动补全机制，有效解决了部分初始化点云下的几何缺失问题，提升了3DGS在复杂城市场景中的鲁棒性与重建精度。 Abstract: 3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via patch matching-based multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects. Source code will be released upon publication.

[183] Minkowski-MambaNet: A Point Cloud Framework with Selective State Space Models for Forest Biomass Quantification

Jinxiang Tu,Dayong Ren,Fei Shi,Zhenhong Jia,Yahong Ren,Jiwei Qin,Fang He

Main category: cs.CV

TL;DR: 提出了一种名为Minkowski-MambaNet的新型深度学习框架，用于直接从原始LiDAR点云估计森林生物量，显著优于现有方法。

Details

Motivation: 准确量化森林生物量对碳循环监测至关重要，但直接从LiDAR点云估计木质体积和地上生物量（AGB）因难以建模长距离依赖关系而具有挑战性。 Method: 将Mamba模型的选择性状态空间模型（SSM）集成到Minkowski网络中，并引入跳跃连接以增强特征并加速收敛，从而有效编码全局上下文和长距离依赖关系。 Result: 在丹麦国家森林清查LiDAR数据上的评估表明，该方法显著优于当前最先进的方法，且无需数字地形模型（DTM），对边界伪影具有鲁棒性。 Conclusion: Minkowski-MambaNet为大规模森林生物量分析提供了强大工具，推动了基于LiDAR的森林清查技术的发展。 Abstract: Accurate forest biomass quantification is vital for carbon cycle monitoring. While airborne LiDAR excels at capturing 3D forest structure, directly estimating woody volume and Aboveground Biomass (AGB) from point clouds is challenging due to difficulties in modeling long-range dependencies needed to distinguish trees.We propose Minkowski-MambaNet, a novel deep learning framework that directly estimates volume and AGB from raw LiDAR. Its key innovation is integrating the Mamba model's Selective State Space Model (SSM) into a Minkowski network, enabling effective encoding of global context and long-range dependencies for improved tree differentiation. Skip connections are incorporated to enhance features and accelerate convergence.Evaluated on Danish National Forest Inventory LiDAR data, Minkowski-MambaNet significantly outperforms state-of-the-art methods, providing more accurate and robust estimates. Crucially, it requires no Digital Terrain Model (DTM) and is robust to boundary artifacts. This work offers a powerful tool for large-scale forest biomass analysis, advancing LiDAR-based forest inventories.

[184] Utilizing dynamic sparsity on pretrained DETR

Reza Sedghi,Anand Subramoney,David Kappel

Main category: cs.CV

TL;DR: 提出了一种名为Micro-Gated Sparsification (MGS)的轻量级门控机制，用于在不重新训练的前提下提升DETR模型推理效率，实现了85%到95%的激活稀疏性，同时保持甚至提升了性能。

Details

Motivation: Transformer模型在视觉任务中的高效推理仍具挑战性，尤其是在目标检测中。由于MLP层存在固有的稀疏性，探索如何在不重新训练的情况下利用这种稀疏性以提高推理效率成为关键问题。 Method: 分析了DETR模型MLP层的固有稀疏性，提出了两种无需重新训练的方法：静态基于指示器的稀疏化（SIBS）和微门控稀疏化（MGS）。SIBS基于固定激活模式预测神经元不活跃性；MGS则通过在预训练DETR上附加一个小的线性层实现动态稀疏性的预测。 Result: 在COCO数据集上的实验表明，MGS能够实现85%到95%的激活稀疏性，在显著降低计算量的同时保持甚至略微提升模型性能。 Conclusion: MGS提供了一种实用且输入自适应的稀疏化方法，可在无需完整重新训练的情况下高效部署预训练视觉Transformer模型。 Abstract: Efficient inference with transformer-based models remains a challenge, especially in vision tasks like object detection. We analyze the inherent sparsity in the MLP layers of DETR and introduce two methods to exploit it without retraining. First, we propose Static Indicator-Based Sparsification (SIBS), a heuristic method that predicts neuron inactivity based on fixed activation patterns. While simple, SIBS offers limited gains due to the input-dependent nature of sparsity. To address this, we introduce Micro-Gated Sparsification (MGS), a lightweight gating mechanism trained on top of a pretrained DETR. MGS predicts dynamic sparsity using a small linear layer and achieves up to 85 to 95% activation sparsity. Experiments on the COCO dataset show that MGS maintains or even improves performance while significantly reducing computation. Our method offers a practical, input-adaptive approach to sparsification, enabling efficient deployment of pretrained vision transformers without full model retraining.

[185] Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians

Jin-Chuan Shi,Chengye Su,Jiajun Wang,Ariel Shamir,Miao Wang

Main category: cs.CV

TL;DR: 提出Mono4DEditor，一种基于文本驱动的单目视频4D场景编辑新框架，结合3D高斯与CLIP特征实现语义精确编辑。

Details

Motivation: 在复杂动态场景中实现局部区域的语义精确编辑，同时保持未编辑内容的完整性是一个挑战。 Method: 引入量化CLIP特征增强3D高斯表示，提出两阶段点级定位策略，并利用扩散视频编辑模型进行编辑，结合光流和涂鸦引导确保时空一致性。 Result: 实验表明，该方法在多种场景和对象上实现了高质量、灵活且保真度高的文本驱动编辑，优于先前方法。 Conclusion: Mono4DEditor能有效实现单目重建4D场景的精确文本驱动编辑，在视觉质量和编辑灵活性方面均表现优越。 Abstract: Editing 4D scenes reconstructed from monocular videos based on text prompts is a valuable yet challenging task with broad applications in content creation and virtual environments. The key difficulty lies in achieving semantically precise edits in localized regions of complex, dynamic scenes, while preserving the integrity of unedited content. To address this, we introduce Mono4DEditor, a novel framework for flexible and accurate text-driven 4D scene editing. Our method augments 3D Gaussians with quantized CLIP features to form a language-embedded dynamic representation, enabling efficient semantic querying of arbitrary spatial regions. We further propose a two-stage point-level localization strategy that first selects candidate Gaussians via CLIP similarity and then refines their spatial extent to improve accuracy. Finally, targeted edits are performed on localized regions using a diffusion-based video editing model, with flow and scribble guidance ensuring spatial fidelity and temporal coherence. Extensive experiments demonstrate that Mono4DEditor enables high-quality, text-driven edits across diverse scenes and object types, while preserving the appearance and geometry of unedited areas and surpassing prior approaches in both flexibility and visual fidelity.

[186] Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement

Ruirui Lin,Guoxi Huang,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: 本文提出了一种名为DWTA-Net的两阶段框架，用于真实低光视频增强，通过联合利用短期和长期时序信息有效抑制噪声并提升视觉质量。

Details

Motivation: 现有基于学习的方法在真实低光场景中难以有效利用时序信息，导致对严重噪声的处理效果不佳。 Method: DWTA-Net包含两个阶段：第一阶段使用视觉状态空间模块进行多帧对齐，恢复亮度、颜色和结构；第二阶段引入基于光流引导的动态权重时序聚合的循环细化模块，并采用纹理自适应损失来保留细节并平滑平坦区域。 Result: 在真实低光视频上的实验表明，该方法在抑制噪声和伪影方面表现优异，视觉质量优于现有最先进方法。 Conclusion: DWTA-Net通过有效融合短程和长程时序信息，在真实低光视频增强任务中实现了卓越性能。 Abstract: Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradations. Learning-based approaches offer fast inference but still struggle with heavy noise in real low-light scenes, primarily due to limitations in effectively leveraging temporal information. In this paper, we address this issue with DWTA-Net, a novel two-stage framework that jointly exploits short- and long-term temporal cues. Stage I employs Visual State-Space blocks for multi-frame alignment, recovering brightness, color, and structure with local consistency. Stage II introduces a recurrent refinement module with dynamic weight-based temporal aggregation guided by optical flow, adaptively balancing static and dynamic regions. A texture-adaptive loss further preserves fine details while promoting smoothness in flat areas. Experiments on real-world low-light videos show that DWTA-Net effectively suppresses noise and artifacts, delivering superior visual quality compared with state-of-the-art methods.

[187] SilvaScenes: Tree Segmentation and Species Classification from Under-Canopy Images in Natural Forests

David-Alexandre Duclos,William Guimont-Martin,Gabriel Jeanson,Arthur Larochelle-Tremblay,Théo Defosse,Frédéric Moore,Philippe Nolet,François Pomerleau,Philippe Giguère

Main category: cs.CV

TL;DR: 本文提出了SilvaScenes，一个用于林下图像中树种实例分割的新数据集，涵盖加拿大魁北克五个生物气候区域的1476棵树、24个树种，并由林业专家标注。通过基准测试表明，尽管树木实例分割性能较好（mAP达67.65%），但树种细粒度分类仍具挑战性（mAP仅为35.69%）。

Details

Motivation: 现有数据集多集中于城市环境或少数树种，难以支持复杂自然森林环境中精细感知系统（如单木检测与树种分类）的研发，制约了精准林业、生物多样性监测及林业装备自动化的发展。 Method: 采集覆盖五个生物气候区的林下图像，构建包含24种共1476棵树的SilvaScenes数据集，采用专家标注进行实例分割；使用现代深度学习模型对实例分割与树种分类任务进行基准测试。 Result: 实例分割表现良好，最高mAP为67.65%；但细粒度树种分类仍困难，mAP仅为35.69%，验证了数据集的挑战性。 Conclusion: SilvaScenes为森林环境下树种感知研究提供了高质量、具挑战性的基准数据集，推动了面向复杂自然场景的林业机器人感知技术发展。 Abstract: Interest in robotics for forest management is growing, but perception in complex, natural environments remains a significant hurdle. Conditions such as heavy occlusion, variable lighting, and dense vegetation pose challenges to automated systems, which are essential for precision forestry, biodiversity monitoring, and the automation of forestry equipment. These tasks rely on advanced perceptual capabilities, such as detection and fine-grained species classification of individual trees. Yet, existing datasets are inadequate to develop such perception systems, as they often focus on urban settings or a limited number of species. To address this, we present SilvaScenes, a new dataset for instance segmentation of tree species from under-canopy images. Collected across five bioclimatic domains in Quebec, Canada, SilvaScenes features 1476 trees from 24 species with annotations from forestry experts. We demonstrate the relevance and challenging nature of our dataset by benchmarking modern deep learning approaches for instance segmentation. Our results show that, while tree segmentation is easy, with a top mean average precision (mAP) of 67.65%, species classification remains a significant challenge with an mAP of only 35.69%. Our dataset and source code will be available at https://github.com/norlab-ulaval/SilvaScenes.

[188] D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models

Jisu Han,Wonjun Hwang

Main category: cs.CV

TL;DR: 提出了一种基于维度熵最大化的测试时提示调优方法，以缓解视觉-语言模型中因单一主导特征维度引起的模态差距和校准性能下降问题。

Details

Motivation: 在测试时适应中，视觉-语言模型（VLMs）的模态间存在由单一主导特征维度引起的模态差距，导致校准性能下降，影响模型可靠性。 Method: 通过分析对比型VLMs中的模态差距，提出维度熵最大化方法，正则化文本特征分布趋向均匀，减少对主导维度的依赖。 Result: 所提方法有效缓解了测试时提示调优中的校准性能退化问题，在多种部署场景下提升了VLMs的可靠性。 Conclusion: 维度熵最大化是一种简单而有效的策略，能够增强VLMs在真实场景中的适应性和预测校准性。 Abstract: Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.

[189] Few-shot multi-token DreamBooth with LoRa for style-consistent character generation

Ruben Pascual,Mikel Sesma-Sara,Aranzazu Jurio,Daniel Paternain,Mikel Galar

Main category: cs.CV

TL;DR: 本文提出一种基于DreamBooth和LoRA的多令牌策略，用于在少量参考角色的基础上生成无限多样且保持艺术风格一致的新角色，适用于动画与游戏等领域。

Details

Motivation: 在音频视觉产业中，AI正被用于拓展创意边界。本文旨在解决如何在仅提供少量人工设计角色的情况下，生成大量保持原有艺术风格和共同视觉特征的新角色的问题。 Method: 基于DreamBooth方法，采用聚类分配独立令牌表示个体角色及其整体风格，结合LoRA进行参数高效微调；去除类别特定正则化集，并在生成时引入随机令牌和嵌入，以实现多样化生成。 Result: 在五个小型专用数据集上进行了评估，定量指标和人类评估均表明该方法能生成高质量、多样化的新角色，并有效保留参考角色的独特美学特征。 Conclusion: 所提方法能够在极少样本条件下实现风格一致的无限角色生成，显著提升了内容创作的效率与可能性，展现出在动画、游戏等领域的广泛应用潜力。 Abstract: The audiovisual industry is undergoing a profound transformation as it is integrating AI developments not only to automate routine tasks but also to inspire new forms of art. This paper addresses the problem of producing a virtually unlimited number of novel characters that preserve the artistic style and shared visual traits of a small set of human-designed reference characters, thus broadening creative possibilities in animation, gaming, and related domains. Our solution builds upon DreamBooth, a well-established fine-tuning technique for text-to-image diffusion models, and adapts it to tackle two core challenges: capturing intricate character details beyond textual prompts and the few-shot nature of the training data. To achieve this, we propose a multi-token strategy, using clustering to assign separate tokens to individual characters and their collective style, combined with LoRA-based parameter-efficient fine-tuning. By removing the class-specific regularization set and introducing random tokens and embeddings during generation, our approach allows for unlimited character creation while preserving the learned style. We evaluate our method on five small specialized datasets, comparing it to relevant baselines using both quantitative metrics and a human evaluation study. Our results demonstrate that our approach produces high-quality, diverse characters while preserving the distinctive aesthetic features of the reference characters, with human evaluation further reinforcing its effectiveness and highlighting the potential of our method.

[190] A methodology for clinically driven interactive segmentation evaluation

Parhom Esmaeili,Virginia Fernandez,Pedro Borges,Eli Gibson,Sebastien Ourselin,M. Jorge Cardoso

Main category: cs.CV

TL;DR: 本文提出了一种临床导向的评估方法和软件框架，用于标准化交互式医学图像分割算法的评估流程，并通过实验揭示了影响模型性能的关键因素。

Details

Motivation: 现有的交互式分割算法在体数据医学图像分割中缺乏一致且符合临床实际的评估方法，导致难以公平比较和准确反映真实世界表现。 Method: 提出一种基于临床需求的评估任务和指标定义方法，构建标准化评估软件框架，并在异构且复杂的任务上评测当前最先进的算法。 Result: 实验发现：最小化用户交互中的信息损失对模型鲁棒性至关重要；自适应缩放机制提升鲁棒性和收敛速度；训练与验证时提示行为或预算不一致会导致性能下降；2D方法在块状图像和粗略目标上表现良好，但3D上下文更利于大或不规则形状目标；非医学领域模型（如SAM2）在低对比度和复杂形状下性能下降明显。 Conclusion: 建立临床相关的标准化评估体系对于推动交互式医学图像分割的发展至关重要，现有模型在真实临床场景下面临挑战，需针对性优化设计。 Abstract: Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.

[191] PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs

Zixin Zhang,Kanghao Chen,Xingwang Lin,Lutao Jiang,Xu Zheng,Yuanhuiyi Lyu,Litao Guo,Yinchuan Li,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出了PhysToolBench，首个用于评估多模态大语言模型对物理工具理解能力的基准，包含1000多个图像-文本对，涵盖工具识别、理解和创造三个层次。对32个MLLM的评估表明其在工具理解方面存在显著不足，并提出了初步解决方案。

Details

Motivation: 当前多模态大语言模型（MLLMs）虽在具身AI和视觉-语言-动作模型中表现出色，但其对物理工具的真实理解能力尚未被系统评估。为填补这一空白，需构建专门的评测基准以量化模型的工具认知能力。 Method: 构建了一个名为PhysToolBench的视觉问答（VQA）数据集，包含超过1000个图像-文本对，从工具识别、工具理解到工具创造三个难度层级进行评估，并在32个主流MLLM上进行大规模实验。 Result: 实验结果显示现有MLLM在工具理解任务上表现不佳，尤其在需要深层原理理解和创造性工具制作的任务中性能显著下降，暴露出模型在物理常识推理方面的缺陷。 Conclusion: PhysToolBench揭示了当前MLLM在物理工具理解上的局限性，强调了提升模型对物理世界操作认知的重要性，为未来具身智能和VLA模型的发展提供了评估标准与改进方向。 Abstract: The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general-purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high-level planning in embodied AI and in downstream Vision-Language-Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool's primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool's operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source, specialized embodied, and backbones in VLAs-reveals a significant deficiency in tool understanding. Furthermore, we provide an in-depth analysis and propose preliminary solutions. Code and dataset are publicly available.

[192] Diagonal Artifacts in Samsung Images: PRNU Challenges and Solutions

David Vázquez-Padín,Fernando Pérez-González,Alejandro Martín-Del-Río

Main category: cs.CV

TL;DR: 本文研究了三星智能手机拍摄图像中的对角线伪影及其对基于PRNU的相机源验证的影响，发现部分Galaxy S和A系列设备存在共享的伪影模式导致指纹冲突，并指出支持PRO模式的设备可通过RAW图像规避该问题，同时探讨了这些伪影在取证中的潜在应用。

Details

Motivation: 由于三星部分手机图像中存在对角线伪影，可能导致基于PRNU的相机识别出现误匹配，影响取证准确性，因此需要系统分析其成因与影响。 Method: 通过分析多个三星Galaxy S和A系列手机的图像，识别其中的对角线伪影模式；比较处理图像与RAW图像的PRNU特征差异；评估伪影在HDR合成与人像模式中的可检测性及取证价值。 Result: 发现某些Galaxy S系列机型共享相同的伪影模式，导致PRNU指纹碰撞；部分A系列机型也存在类似问题；支持PRO模式的设备使用RAW图像可避免伪影干扰；伪影可用于辅助检测HDR图像中的误检并定位人像模式中的合成虚化区域。 Conclusion: 尽管对角线伪影可能影响传统PRNU相机源验证的可靠性，但在支持RAW拍摄的设备上仍可实现准确验证；对于无法获取RAW图像的情况，这些伪影本身也可作为新的取证线索用于图像分析。 Abstract: We investigate diagonal artifacts present in images captured by several Samsung smartphones and their impact on PRNU-based camera source verification. We first show that certain Galaxy S series models share a common pattern causing fingerprint collisions, with a similar issue also found in some Galaxy A models. Next, we demonstrate that reliable PRNU verification remains feasible for devices supporting PRO mode with raw capture, since raw images bypass the processing pipeline that introduces artifacts. This option, however, is not available for the mid-range A series models or in forensic cases without access to raw images. Finally, we outline potential forensic applications of the diagonal artifacts, such as reducing misdetections in HDR images and localizing regions affected by synthetic bokeh in portrait-mode images.

[193] PRNet: Original Information Is All You Have

PeiHuang Zheng,Yunlong Zhao,Zheng Cui,Yang Li

Main category: cs.CV

TL;DR: 本文提出了一种用于航拍图像中小目标检测的实时框架PRNet，通过保留和有效利用浅层空间特征来提升检测性能。

Details

Motivation: 由于像素表示有限，航拍图像中的小目标在特征提取过程中易发生信息退化，导致浅层空间细节与语义信息难以对齐，造成漏检和误检。现有方法在后处理阶段修复信息，但重建细节常偏离原始信息，影响融合效果。 Method: 提出PRNet，包含两个模块：Progressive Refinement Neck（PRN）通过主干网络重用和迭代优化实现空间-语义对齐；Enhanced SliceSamp（ESSamp）通过优化重排和卷积在下采样过程中保留浅层信息。 Result: 在VisDrone、AI-TOD和UAVDT数据集上的实验表明，PRNet在相似计算开销下优于现有方法，实现了更优的精度与效率平衡。 Conclusion: PRNet通过有效保留和利用原始浅层特征，显著提升了航拍图像中小目标的检测性能，具备良好的实时性和应用潜力。 Abstract: Small object detection in aerial images suffers from severe information degradation during feature extraction due to limited pixel representations, where shallow spatial details fail to align effectively with semantic information, leading to frequent misses and false positives. Existing FPN-based methods attempt to mitigate these losses through post-processing enhancements, but the reconstructed details often deviate from the original image information, impeding their fusion with semantic content. To address this limitation, we propose PRNet, a real-time detection framework that prioritizes the preservation and efficient utilization of primitive shallow spatial features to enhance small object representations. PRNet achieves this via two modules:the Progressive Refinement Neck (PRN) for spatial-semantic alignment through backbone reuse and iterative refinement, and the Enhanced SliceSamp (ESSamp) for preserving shallow information during downsampling via optimized rearrangement and convolution. Extensive experiments on the VisDrone, AI-TOD, and UAVDT datasets demonstrate that PRNet outperforms state-of-the-art methods under comparable computational constraints, achieving superior accuracy-efficiency trade-offs.

[194] FLOWING: Implicit Neural Flows for Structure-Preserving Morphing

Arthur Bizzi,Matias Grynberg,Vitor Matias,Daniel Perazzo,João Paulo Lima,Luiz Velho,Nuno Gonçalves,João Pereira,Guilherme Schardong,Tiago Novello

Main category: cs.CV

TL;DR: 本文提出了一种名为FLOWING的新型形变框架，通过将形变重新定义为微分向量流的构建，解决了传统MLP在图像和三维形状形变中训练不稳定和特征对齐困难的问题。

Details

Motivation: 传统MLP作为隐式神经表示用于形变建模时，依赖昂贵的正则化来保证形变的连续性和特征对齐，常导致训练不稳定。因此需要一种更稳定、结构保持的形变方法。 Method: 提出FLOWING框架，将形变视为微分向量流的构造过程，并将流的结构性质（如连续性、可逆性和时间一致性）直接编码到网络架构中，从而实现原理性更强且稳定的变换。 Result: 在人脸、图像以及高斯点阵形变等多种任务上实现了最先进的形变质量，并具有更快的收敛速度。 Conclusion: FLOWING通过流中心化的网络设计，有效解决了传统MLP在形变任务中的局限性，实现了准确、结构保持且训练稳定的高质量形变。 Abstract: Morphing is a long-standing problem in vision and computer graphics, requiring a time-dependent warping for feature alignment and a blending for smooth interpolation. Recently, multilayer perceptrons (MLPs) have been explored as implicit neural representations (INRs) for modeling such deformations, due to their meshlessness and differentiability; however, extracting coherent and accurate morphings from standard MLPs typically relies on costly regularizations, which often lead to unstable training and prevent effective feature alignment. To overcome these limitations, we propose FLOWING (FLOW morphING), a framework that recasts warping as the construction of a differential vector flow, naturally ensuring continuity, invertibility, and temporal coherence by encoding structural flow properties directly into the network architectures. This flow-centric approach yields principled and stable transformations, enabling accurate and structure-preserving morphing of both 2D images and 3D shapes. Extensive experiments across a range of applications - including face and image morphing, as well as Gaussian Splatting morphing - show that FLOWING achieves state-of-the-art morphing quality with faster convergence. Code and pretrained models are available at http://schardong.github.io/flowing.

[195] TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control

Minkyoung Cho,Ruben Ohana,Christian Jacobsen,Adityan Jothi,Min-Hung Chen,Z. Morley Mao,Ethem Can

Main category: cs.CV

TL;DR: 提出TC-LoRA，一种通过超网络动态生成LoRA适配器来实现时间调制条件控制的新型可控扩散模型方法。

Details

Motivation: 现有可控扩散模型采用静态条件策略，难以适应从粗略结构到精细细节的多阶段去噪过程，限制了生成质量与条件对齐能力。 Method: 引入TC-LoRA，利用超网络在每一步扩散过程中根据时间和用户条件实时生成LoRA适配器，直接调节模型权重，实现动态、上下文感知的控制。 Result: 在多个数据域上的实验表明，TC-LoRA相比静态的激活层调控方法，在生成保真度和空间条件遵循性方面均有显著提升。 Conclusion: TC-LoRA通过动态调整模型权重实现了更精细的条件控制，为可控生成提供了一种基于深度函数适应的新范式。 Abstract: Current controllable diffusion models typically rely on fixed architectures that modify intermediate activations to inject guidance conditioned on a new modality. This approach uses a static conditioning strategy for a dynamic, multi-stage denoising process, limiting the model's ability to adapt its response as the generation evolves from coarse structure to fine detail. We introduce TC-LoRA (Temporally Modulated Conditional LoRA), a new paradigm that enables dynamic, context-aware control by conditioning the model's weights directly. Our framework uses a hypernetwork to generate LoRA adapters on-the-fly, tailoring weight modifications for the frozen backbone at each diffusion step based on time and the user's condition. This mechanism enables the model to learn and execute an explicit, adaptive strategy for applying conditional guidance throughout the entire generation process. Through experiments on various data domains, we demonstrate that this dynamic, parametric control significantly enhances generative fidelity and adherence to spatial conditions compared to static, activation-based methods. TC-LoRA establishes an alternative approach in which the model's conditioning strategy is modified through a deeper functional adaptation of its weights, allowing control to align with the dynamic demands of the task and generative stage.

[196] FSP-DETR: Few-Shot Prototypical Parasitic Ova Detection

Shubham Trehan,Udhav Ramachandran,Akash Rao,Ruth Scimeca,Sathyanarayanan N. Aakur

Main category: cs.CV

TL;DR: FSP-DETR是一种统一的检测框架，能够在少样本、开放集和跨任务场景下实现鲁棒的生物医学目标检测，无需重新训练即可支持新类识别和背景拒绝。

Details

Motivation: 生物医学目标检测受限于标注数据稀缺和新/罕见类别频繁出现的问题，现有方法通常孤立处理少样本检测、开放集识别等任务，缺乏统一且灵活的解决方案。 Method: 基于类别无关的DETR骨干网络，利用支持图像构建类别原型，通过增强视图和轻量级Transformer解码器学习嵌入空间，并联合优化原型匹配损失、基于对齐的分离损失和KL散度正则化。 Result: 在卵子、血细胞和疟疾检测任务上实验表明，FSP-DETR在低样本和开放集场景下显著优于以往的少样本和基于原型的检测器，并提出了一个新的卵子物种检测基准。 Conclusion: FSP-DETR实现了Few-shot、Open-set和Cross-task的统一检测框架，具备推理时灵活性，无需微调即可适应新任务和新类别，提升了生物医学图像检测的实用性和泛化能力。 Abstract: Object detection in biomedical settings is fundamentally constrained by the scarcity of labeled data and the frequent emergence of novel or rare categories. We present FSP-DETR, a unified detection framework that enables robust few-shot detection, open-set recognition, and generalization to unseen biomedical tasks within a single model. Built upon a class-agnostic DETR backbone, our approach constructs class prototypes from original support images and learns an embedding space using augmented views and a lightweight transformer decoder. Training jointly optimizes a prototype matching loss, an alignment-based separation loss, and a KL divergence regularization to improve discriminative feature learning and calibration under scarce supervision. Unlike prior work that tackles these tasks in isolation, FSP-DETR enables inference-time flexibility to support unseen class recognition, background rejection, and cross-task adaptation without retraining. We also introduce a new ova species detection benchmark with 20 parasite classes and establish standardized evaluation protocols. Extensive experiments across ova, blood cell, and malaria detection tasks demonstrate that FSP-DETR significantly outperforms prior few-shot and prototype-based detectors, especially in low-shot and open-set scenarios.

[197] Vision Language Models: A Survey of 26K Papers

Fengming Lin

Main category: cs.CV

TL;DR: 本文对2023-2025年CVPR、ICLR和NeurIPS的26,104篇论文进行了系统性分析，识别出计算机视觉与机器学习领域的三大趋势：多模态视觉-语言-大模型兴起、生成模型持续发展、3D与视频理解保持活跃。研究采用手工构建的词典对标题和摘要进行标注，揭示了技术方向、训练范式和跨领域主题的演变，并公开了方法与词典以支持后续研究。

Details

Motivation: 为了透明、可复现地理解当前人工智能领域（尤其是计算机视觉与机器学习）的研究趋势，需要系统化分析顶会论文的主题演化，而非依赖主观判断或碎片化观察。 Method: 收集2023-2025年CVPR、ICLR和NeurIPS共26,104篇接收论文的标题和摘要，通过归一化、短语保护处理后，使用手动构建的词典匹配最多35个主题标签，挖掘任务、架构、训练方式、目标、数据集和模态等细粒度信息。 Result: 发现三大宏观趋势：(1) 多模态视觉-语言-大模型工作激增，将传统感知任务重构为指令跟随与多步推理；(2) 生成模型稳步扩展，扩散模型聚焦可控性、蒸馏与加速；(3) 3D与视频活动持续活跃，从NeRF转向高斯泼溅，强调人与智能体为中心的理解。此外，VLM中轻量级适配（如提示、适配器、LoRA）占主导，训练范式转向基于强主干的指令微调，对比学习目标被交叉熵/排序/蒸馏取代。跨会议比较显示CVPR在3D方面更强，ICLR的VLM占比最高，效率与鲁棒性等可靠性主题广泛渗透。 Conclusion: 当前AI研究正经历从独立模块设计向多模态大模型集成、从从头训练向高效微调、从单一任务向复杂推理转变的趋势，该分析提供了可审计的量化视角，并通过公开工具促进未来趋势追踪。 Abstract: We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.

[198] SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

Peiwen Sun,Shiqiang Lang,Dongming Wu,Yi Ding,Kaituo Feng,Huadai Liu,Zhen Ye,Rui Liu,Yun-Hui Liu,Jianan Wang,Xiangyu Yue

Main category: cs.CV

TL;DR: 本文提出了一个面向全尺度空间推理的综合解决方案，包括构建大规模数据集SpaceVista-1M、设计全尺度基准以及开发基于尺度感知专家和渐进奖励的模型SpaceVista-7B，显著提升了MLLM在多样化场景中的空间智能能力。

Details

Motivation: 现有研究过度依赖室内3D扫描和人工标注，且缺乏有效的全尺度场景建模，导致在机器人和自动驾驶等多样化应用中表现受限。 Method: 提出了一种结合结构化空间推理知识系统、尺度感知建模和渐进训练范式的整体方案；通过专用自动化流水线构建包含38K视频场景和约100万空间问答对的SpaceVista-1M数据集，并建立手工标注的全尺度基准；设计了支持密集输入和尺度锚定专家机制的SpaceVista-7B模型。 Result: 在5个基准测试（包括自建SpaceVista-Bench）上验证了模型的优越性能，表现出跨尺度和跨场景的良好泛化能力。 Conclusion: 该工作首次尝试拓展MLLM的全尺度空间智能，所提出的框架、数据集和模型为未来空间推理研究提供了重要资源和方向。 Abstract: With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .

[199] VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

Shaoqi Dong,Chaoyou Fu,Haihan Gao,Yi-Fan Zhang,Chi Yan,Chu Wu,Xiaoyu Liu,Yunhang Shen,Jing Huo,Deqiang Jiang,Haoyu Cao,Yang Gao,Xing Sun,Ran He,Caifeng Shan

Main category: cs.CV

TL;DR: 提出一种基于知识蒸馏的轻量级框架，将预训练的小型动作模型的知识迁移到视觉语言模型（VLM）中，赋予其动作执行能力，在保持VLM结构的同时仅添加动作标记和状态编码器，显著降低训练成本并在仿真和真实环境中均取得优于现有方法的性能。

Details

Motivation: 训练大型视觉语言动作（VLA）模型成本高昂，且需从头训练；希望在不破坏VLM强大感知能力的前提下，高效赋予其精确的动作生成能力。 Method: 采用两阶段蒸馏训练策略：第一阶段通过轻量级对齐将VLM隐状态映射到小型动作模型的动作空间，复用其预训练动作解码器；第二阶段选择性微调语言模型、状态编码器和动作模块。架构上保留原始VLM结构，仅引入一个动作标记和一个状态编码器以融合物理状态输入。 Result: 在LIBERO数据集上达到97.3%平均成功率（提升11.8%），在LIBERO-LONG上达93.5%（提升24.5%）；在五个真实世界任务中成功率达82.0%（比教师模型提升17%），且训练效率显著优于从头训练的VLA模型。 Conclusion: 该蒸馏框架能有效将动作知识注入VLM，在大幅降低训练成本的同时提升动作精度和泛化能力，验证了知识蒸馏在构建高效VLA系统中的可行性与优势。 Abstract: Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.

[200] StreamingVLM: Real-Time Understanding for Infinite Video Streams

Ruyi Xu,Guangxuan Xiao,Yukang Chen,Liuning He,Kelly Peng,Yao Lu,Song Han

Main category: cs.CV

TL;DR: 本文提出了StreamingVLM，一种用于实时、稳定理解无限视觉输入的模型，通过维护紧凑的KV缓存和简单的监督微调策略，在长视频理解和实时性能上显著优于现有方法。

Details

Motivation: 现有的视觉语言模型在处理无限视频流时面临计算成本高、内存占用大、延迟高等问题，难以实现高效稳定的实时理解。 Method: 提出StreamingVLM，采用统一框架对齐训练与流式推理过程；通过重用注意力锚点状态、短窗口视觉token和长窗口文本token来维持紧凑的KV缓存，并使用简单监督微调（SFT）策略模拟推理时的注意力模式。 Result: 在新构建的两小时平均长度视频评测集Inf-Streams-Eval上，StreamingVLM以66.18%的胜率优于GPT-4O mini，单卡H100上实现最高8FPS的稳定实时性能；同时在LongVideoBench和OVOBench Realtime上分别提升+4.30和+5.96分。 Conclusion: StreamingVLM实现了高效、低延迟、可扩展的视频流理解，其训练与推理一致的设计为未来真实场景中的视觉语言系统提供了可行方案。 Abstract: Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

Table of Contents

cs.CL [Back]

[1] Enhancing Biomedical Named Entity Recognition using GLiNER-BioMed with Targeted Dictionary-Based Post-processing for BioASQ 2025 task 6

[2] Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

[3] Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech

[4] Systematic Diagnosis of Brittle Reasoning in Large Language Models

[5] Confidence, Not Perplexity: A Better Metric for the Creative Era of LLMs

[6] Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation

[7] Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs

[8] Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

[9] YpathRAG:A Retrieval-Augmented Generation Framework and Benchmark for Pathology

[10] LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

[11] Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks

[12] Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations

[13] MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

[14] GraphGhost: Tracing Structures Behind Large Language Models

[15] Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications

[16] Iterative LLM-Based Generation and Refinement of Distracting Conditions in Math Word Problems

[17] LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests

[18] JAI-1: A Thai-Centric Large Language Model

[19] From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents

[20] Text2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories

[21] PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction

[22] Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B

[23] From What to Why: Thought-Space Recommendation with Small Language Models

[24] ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

[25] Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

[26] Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression

[27] Formalizing Style in Personal Narratives

[28] A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data

[29] dInfer: An Efficient Inference Framework for Diffusion Language Models

[30] Scaling Laws for Code: A More Data-Hungry Regime

[31] Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning

[32] How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

[33] How Reliable is Language Model Micro-Benchmarking?

[34] Coordinates from Context: Using LLMs to Ground Complex Location References

[35] Measuring Moral LLM Responses in Multilingual Capacities

[36] Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models

[37] Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

[38] MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

[39] The Model's Language Matters: A Comparative Privacy Analysis of LLMs

[40] Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs

[41] Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

[42] Quality Estimation Reranking for Document-Level Translation

[43] FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

[44] Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

[45] A Unified Biomedical Named Entity Recognition Framework with Large Language Models

[46] Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors

[47] Artificial Impressions: Evaluating Large Language Model Behavior Through the Lens of Trait Impressions

[48] SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

[49] A Human Behavioral Baseline for Collective Governance in Software Projects

[50] Creation of the Chinese Adaptive Policy Communication Corpus

[51] MASA: LLM-Driven Multi-Agent Systems for Autoformalization

[52] DARO: Difficulty-Aware Reweighting Policy Optimization

[53] Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models

[54] LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction

[55] Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise

[56] Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

[57] Large Language Models Do NOT Really Know What They Don't Know

[58] Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation

[59] ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

[60] FrameEOL: Semantic Frame Induction using Causal Language Models

[61] When Retrieval Succeeds and Fails: Rethinking Retrieval-Augmented Generation for LLMs

[62] DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

[63] Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM

[64] Stronger Re-identification Attacks through Reasoning and Aggregation

[65] LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning

[66] DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

[67] IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data

[68] CrisiText: A dataset of warning messages for LLM training in emergency communication

[69] DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

[70] Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

[71] CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation

[72] Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation

[73] CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

[74] One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations

[75] MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

[76] ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation

[77] Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference

[78] Verifying Chain-of-Thought Reasoning via Its Computational Graph