Skip to content

Table of Contents

cs.CL [Back]

[1] HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated Q&A over Raw Unstructured Documents

Sungmoon Kim,Hyuna Jeon,Dahye Kim,Mingyu Kim,Dong-Kyu Chae,Jiwoong Kim

Main category: cs.CL

TL;DR: 本文提出HybridRAG框架,通过预处理PDF等非结构化文档构建QA知识库,并在查询时优先匹配已有答案,仅在无匹配时触发生成,从而提升准确率与响应速度。

Details Motivation: 现有RAG方法依赖结构化文本且需实时检索-生成,难以应对真实聊天场景中大量非结构化PDF文档和高并发低资源限制的需求。 Method: HybridRAG分两阶段:1)用OCR与布局分析解析PDF,组织为层次化文本块;2)用LLM预生成QA知识库;查询时先检索QA库,匹配失败再启用标准RAG生成。 Result: 在OHRBench上,HybridRAG相比标准RAG显著提升答案质量并降低延迟。 Conclusion: HybridRAG是一种面向实际聊天机器人应用的高效、实用的RAG改进方案,尤其适用于处理海量非结构化文档与资源受限环境。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for grounding Large Language Model (LLM)-based chatbot responses on external knowledge. However, existing RAG studies typically assume well-structured textual sources (e.g. Wikipedia or curated datasets) and perform retrieval and generation at query time, which can limit their applicability in real-world chatbot scenarios. In this paper, we present HybridRAG, a novel and practical RAG framework towards more accurate and faster chatbot responses. First, HybridRAG ingests raw, unstructured PDF documents containing complex layouts (text, tables, figures) via Optical Character Recognition (OCR) and layout analysis, and convert them into hierarchical text chunks. Then, it pre-generates a plausible question-answer (QA) knowledge base from the organized chunks using an LLM. At query time, user questions are matched against this QA bank to retrieve immediate answers when possible, and only if no suitable QA match is found does our framework fall back to an on-the-fly response generation. Experiments on OHRBench demonstrate that our HybridRAG provides higher answer quality and lower latency compared to a standard RAG baseline. We believe that HybridRAG could be a practical solution for real-world chatbot applications that must handle large volumes of unstructured documents and lots of users under limited computational resources.

[2] Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Max Zhang,Derek Liu,Kai Zhang,Joshua Franco,Haihao Liu

Main category: cs.CL

TL;DR: 本文探讨了知识蒸馏(KD)在多语言越狱防御中的应用,发现标准微调反而会增加越狱成功率(JSR),而移除模糊边界拒绝可缓解安全退化,但推理能力仍下降。

Details Motivation: 大型语言模型(LLMs)的安全对齐目前以英语为中心,导致低资源语言场景存在安全隐患,亟需多语言安全对齐方法。 Method: 采用基于黑盒响应的知识蒸馏与LoRA参数高效微调,将OpenAI o1-mini教师模型的拒绝行为蒸馏至三个开源学生模型(Llama-3-8B、Gemma-2-2B、Qwen3-8B),使用约2.8万条多语言越狱提示(XSafety数据集)。 Result: 在MultiJail基准上发现:标准‘安全’拒绝数据微调反而使所有学生模型越狱成功率(JSR)上升最多达16.6个百分点;移除‘边界性’拒绝可缓解或逆转安全退化,但GSM8K推理性能仍下降。 Conclusion: 知识蒸馏在多语言安全对齐中具有潜力但也面临挑战,需谨慎设计拒绝行为建模,为后续研究提供基础。 Abstract: Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.

[3] Retrieval Heads are Dynamic

Yuping Lin,Zitao Li,Yue Xing,Pengfei He,Yingqian Cui,Yaliang Li,Bolin Ding,Jingren Zhou,Jiliang Tang

Main category: cs.CL

TL;DR: 本文从动态视角研究大语言模型(LLM)中的检索头,发现其在自回归生成过程中随时间步动态变化、不可被静态头替代,且隐状态蕴含对未来检索头模式的预测信号,揭示了模型内部的规划机制。

Details Motivation: 现有工作多依赖跨数据集的静态统计识别平均意义上的检索头,忽视了自回归生成中细粒度的时间动态性。 Method: 通过大量实证分析,在 Needle-in-a-Haystack 和多跳问答任务上验证三个核心主张,并在动态检索增强生成框架中量化动态与静态检索头的效用差异。 Result: 证实了检索头具有时间步动态性、不可替代性,且隐状态可预测未来检索头模式;动态检索头在任务中显著优于静态检索头。 Conclusion: LLMs 的检索行为是高度动态和时序敏感的,其隐状态具备内部规划能力,这对理解 LLM 内部机制和设计更优检索增强方法具有重要意义。 Abstract: Recent studies have identified "retrieval heads" in Large Language Models (LLMs) responsible for extracting information from input contexts. However, prior works largely rely on static statistics aggregated across datasets, identifying heads that perform retrieval on average. This perspective overlooks the fine-grained temporal dynamics of autoregressive generation. In this paper, we investigate retrieval heads from a dynamic perspective. Through extensive analysis, we establish three core claims: (1) Dynamism: Retrieval heads vary dynamically across timesteps; (2) Irreplaceability: Dynamic retrieval heads are specific at each timestep and cannot be effectively replaced by static retrieval heads; and (3) Correlation: The model's hidden state encodes a predictive signal for future retrieval head patterns, indicating an internal planning mechanism. We validate these findings on the Needle-in-a-Haystack task and a multi-hop QA task, and quantify the differences on the utility of dynamic and static retrieval heads in a Dynamic Retrieval-Augmented Generation framework. Our study provides new insights into the internal mechanisms of LLMs.

[4] Nested Named Entity Recognition in Plasma Physics Research Articles

Muhammad Haris,Hans Höft,Markus M. Becker,Markus Stocker

Main category: cs.CL

TL;DR: 本文提出了一种基于BERT-CRF和超参数优化的轻量级嵌套命名实体识别方法,专用于等离子体物理研究论文,构建了含16类实体的标注语料,并通过实体特化建模提升识别效果。

Details Motivation: 等离子体物理研究论文内容高度复杂、上下文丰富,需提取专业实体以支持高级检索等应用,但现有NER方法在该领域面临挑战。 Method: 构建包含16类嵌套实体的等离子体物理语料;采用独立BERT-CRF模型进行实体类型特化训练;引入系统化超参数优化流程提升模型性能。 Result: 实现了针对等离子体物理文本的高效嵌套命名实体识别,在该领域取得了良好性能,并为科研文献导航与分析提供了基础支撑。 Conclusion: 该工作推动了专业领域NER的发展,验证了轻量级、特化建模与优化策略在科学文本实体识别中的有效性。 Abstract: Named Entity Recognition (NER) is an important task in natural language processing that aims to identify and extract key entities from unstructured text. We present a novel application of NER in plasma physics research articles and address the challenges of extracting specialized entities from scientific text in this domain. Research articles in plasma physics often contain highly complex and context-rich content that must be extracted to enable, e.g., advanced search. We propose a lightweight approach based on encoder-transformers and conditional random fields to extract (nested) named entities from plasma physics research articles. First, we annotate a plasma physics corpus with 16 classes specifically designed for the nested NER task. Second, we evaluate an entity-specific model specialization approach, where independent BERT-CRF models are trained to recognize individual entity types in plasma physics text. Third, we integrate an optimization process to systematically fine-tune hyperparameters and enhance model performance. Our work contributes to the advancement of entity recognition in plasma physics and also provides a foundation to support researchers in navigating and analyzing scientific literature.

[5] Assessing LLM Reliability on Temporally Recent Open-Domain Questions

Pushwitha Krishnappa,Amit Das,Vinija Jain,Tathagata Mukherjee,Aman Chadha

Main category: cs.CL

TL;DR: 本文提出RECOM基准数据集,评估大语言模型对近期Reddit问题的回答能力,发现语义相似度高但词汇重叠率低的悖论,表明模型通过大量改写而非直接复现来保持语义一致性;同时指出模型参数量不决定性能,且需多维评估框架替代单一词汇指标。

Details Motivation: 探究大语言模型在开放域问答中对近期时间信息与人类观点的一致性,现有研究对此关注不足。 Method: 构建包含15,000条2025年9月Reddit问题及社区参考答案的RECOM数据集;使用BLEU、ROUGE、BERTScore、MoverScore、余弦相似度和NLI等多维指标,评估四个开源LLM(Llama3.1-8B、Mistral-7B、Gemma-2-9B、GPT-OSS-20B)的回答对齐程度。 Result: 所有模型余弦相似度>99%,但BLEU-1<8%,呈现显著语义-词汇悖论;MoverScore居中(51–53%),反映语义对齐的最优传输代价;参数量更大的GPT-OSS-20B表现不如Mistral-7B;NLI显示矛盾率<7%。 Conclusion: 词汇匹配类指标(如BLEU)不可靠,应采用融合语义、逻辑与结构的多维评估框架;RECOM数据集公开以推动时效性与人类对齐研究。 Abstract: Large Language Models (LLMs) are increasingly deployed for open-domain question answering, yet their alignment with human perspectives on temporally recent information remains underexplored. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025 paired with community-derived reference answers. We investigate how four open-source LLMs (Llama3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B) respond to these questions, evaluating alignment using lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). Our central finding is a striking semantic-lexical paradox: all models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, a 90+ percentage point gap indicating that models preserve meaning through extensive paraphrasing rather than lexical reproduction. MoverScore (51-53%) confirms this pattern, occupying an intermediate position that reflects the optimal transport cost of semantic alignment. Furthermore, model scale does not predict performance: Mistral-7B (7B parameters) outperforms GPT-OSS-20B (20B parameters) across all metrics. NLI analysis reveals that contradiction rates remain below 7%, suggesting models rarely generate content that directly conflicts with human consensus. These findings challenge the reliability of lexical metrics for evaluating abstractive generation and argue for multi-dimensional evaluation frameworks that capture semantic fidelity beyond surface-level text matching. The RECOM dataset is publicly available at https://anonymous.4open.science/r/recom-D4B0

[6] Small Updates, Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ?

Xu Hu,Yifan Zhang,Songtao Wei,Chen Zhao,Qiannan Li,Bingzhe Li,Feng Chen

Main category: cs.CL

TL;DR: 本文系统研究了参数高效微调(PEFT)方法对大语言模型幻觉检测能力的影响,发现PEFT能显著提升多种无监督幻觉检测器的AUROC性能,其作用机制主要是重构模型不确定性表征,而非注入新事实知识。

Details Motivation: 尽管PEFT被广泛用于适配大语言模型并常被认为可提升事实正确性,但其对幻觉行为(尤其在问答任务中)的影响尚不明确。 Method: 在三个开源大语言模型和三个事实导向的问答数据集上,采用七种覆盖语义一致性、置信度和熵三类范式的无监督幻觉检测方法进行综合实证评估;并辅以线性探针与表征诊断分析机制。 Result: PEFT一致增强了幻觉检测能力,在多种检测器上显著提升AUROC;进一步分析表明PEFT主要重塑模型对不确定性的编码与呈现方式,而非注入新事实知识。 Conclusion: PEFT通过重构不确定性表征来提升幻觉检测性能,而非依赖新增事实知识,这对理解PEFT的作用机制和优化幻觉缓解策略具有重要意义。 Abstract: Parameter-efficient fine-tuning (PEFT) methods are widely used to adapt large language models (LLMs) to downstream tasks and are often assumed to improve factual correctness. However, how the parameter-efficient fine-tuning methods affect hallucination behavior remains insufficiently understood, especially on QA datasets. In this work, we systematically investigate the impact of PEFT on hallucination detection through a comprehensive empirical study across three open-weight LLM backbones and three fact-seeking QA benchmarks. For each model, we evaluate performance using seven unsupervised hallucination detection methods spanning three complementary approaches: semantic consistency based detectors, confidence based detectors, and entropy based detectors. This multifaceted evaluation enables us to characterize how PEFT reshapes uncertainty across different detection paradigms. In conclusion, our experimental results show that PEFT consistently strengthens hallucination detection ability, substantially improving AUROC across a wide range of hallucination detectors. Besides, further analyses using linear probes and representation diagnostics indicate that PEFT methods primarily reshapes how uncertainty is encoded and surfaced, comparing with injecting new factual knowledge into the models.

[7] Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering

Nathan Mao,Varun Kaushik,Shreya Shivkumar,Parham Sharafoleslami,Kevin Zhu,Sunishchal Dev

Main category: cs.CL

TL;DR: 本文提出FalseCite数据集,用于系统评估大语言模型在误导性引文下的幻觉现象,并通过分析隐藏状态揭示其内在模式。

Details Motivation: 大型语言模型(LLMs)常产生幻觉(即错误或无意义信息),尤其在医学、法律等敏感领域危害严重;现有研究缺乏系统性基准来评估由虚假/误导性引文诱发的幻觉。 Method: 构建FalseCite——一个专门捕捉由伪造引文诱导幻觉的标注数据集;在GPT-4o-mini、Falcon-7B和Mistral 7-B上进行测试;对模型隐藏状态向量进行可视化与聚类分析。 Result: 发现虚假引文显著提升幻觉率(尤以GPT-4o-mini为甚);幻觉与非幻觉样本的隐藏状态均呈现一致的‘角状’(horn-like)几何结构。 Conclusion: FalseCite为评估与缓解LLM幻觉提供了新基准;隐藏状态的共性结构暗示幻觉可能源于模型固有推理机制而非单纯错误输出。 Abstract: Large Language Models (LLMs) often hallucinate, generating nonsensical or false information that can be especially harmful in sensitive fields such as medicine or law. To study this phenomenon systematically, we introduce FalseCite, a curated dataset designed to capture and benchmark hallucinated responses induced by misleading or fabricated citations. Running GPT-4o-mini, Falcon-7B, and Mistral 7-B through FalseCite, we observed a noticeable increase in hallucination activity for false claims with deceptive citations, especially in GPT-4o-mini. Using the responses from FalseCite, we can also analyze the internal states of hallucinating models, visualizing and clustering the hidden state vectors. From this analysis, we noticed that the hidden state vectors, regardless of hallucination or non-hallucination, tend to trace out a distinct horn-like shape. Our work underscores FalseCite's potential as a foundation for evaluating and mitigating hallucinations in future LLM research.

[8] Enhancing SDG-Text Classification with Combinatorial Fusion Analysis and Generative AI

Jingyan Xu,Marcelo L. LaFleur,Christina Schweikert,D. Frank Hsu

Main category: cs.CL

TL;DR: 本文提出了一种结合多模型与人类专家知识的文本分类方法,用于联合国可持续发展目标(SDGs)的自动归类,利用生成式AI扩充数据,并通过组合融合分析(CFA)提升性能,达到96.73%准确率,超越单模型及部分人工标注结果。

Details Motivation: SDG相关文本分类面临类别模糊、相互关联、标注困难等问题,亟需更鲁棒、可解释且融合人机智能的方法。 Method: 采用生成式AI构建合成训练数据,训练多个差异化基础分类器,再运用组合融合分析(CFA),基于秩-得分特征(RSC)函数和认知多样性(CD)融合模型输出;同时引入人类领域专家标注结果进行对比与协同分析。 Result: CFA融合方法在SDG文本分类任务中达到96.73%准确率,显著优于最佳单模型;人机融合结果表明二者具有互补性与协同增强效应。 Conclusion: 多模型融合(CFA)与人类专家知识协同可有效提升复杂语义文本分类的准确性与可信度,为社会语境下的NLP应用提供了可行范式。 Abstract: (Natural Language Processing) NLP techniques such as text classification and topic discovery are very useful in many application areas including information retrieval, knowledge discovery, policy formulation, and decision-making. However, it remains a challenging problem in cases where the categories are unavailable, difficult to differentiate, or are interrelated. Social analysis with human context is an area that can benefit from text classification, as it relies substantially on text data. The focus of this paper is to enhance the classification of text according to the UN's Sustainable Development Goals (SDGs) by collecting and combining intelligence from multiple models. Combinatorial Fusion Analysis (CFA), a system fusion paradigm using a rank-score characteristic (RSC) function and cognitive diversity (CD), has been used to enhance classifier methods by combining a set of relatively good and mutually diverse classification models. We use a generative AI model to generate synthetic data for model training and then apply CFA to this classification task. The CFA technique achieves 96.73% performance, outperforming the best individual model. We compare the outcomes with those obtained from human domain experts. It is demonstrated that combining intelligence from multiple ML/AI models using CFA and getting input from human experts can, not only complement, but also enhance each other.

[9] Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis

Mangadoddi Srikar Vardhan,Lekkala Sai Teja

Main category: cs.CL

TL;DR: 本文发现Transformer隐藏状态的方向(角度)和大小(模长)在语言建模和句法处理中承担不同功能角色:方向扰动更损害语言建模损失,大小扰动更损害主谓一致等句法任务;该分离现象在LayerNorm架构中稳健存在,但在RMSNorm中不显著。

Details Motivation: 探究Transformer隐藏状态中方向与大小是否具有不同的功能角色,当前尚不清楚。 Method: 在Pythia系列模型上采用L2匹配扰动分析法(保证角度与大小扰动具有相同欧氏位移),结合因果干预(如修复注意力或LayerNorm路径)来定位功能影响路径。 Result: 角度扰动对语言建模损失损害高达42.9倍,大小扰动对主谓一致准确率损害达20.4%(远高于角度扰动的1.6%);角度损伤主要经注意力路径传播(修复后损失恢复28.4%),大小损伤部分经LayerNorm路径传播(修复后恢复29.9%);该分离在Pythia各尺度模型中复现,但在RMSNorm架构中不显著。 Conclusion: 方向与大小在LayerNorm类架构中承担部分独立的计算角色:方向主导注意力路由,大小调节细粒度句法判断的处理强度;该功能分离依赖于归一化方式,挑战并细化了线性表征假说,对模型编辑与可解释性研究具启示意义。 Abstract: Transformer hidden states encode information as high-dimensional vectors, yet whether direction (orientation in representational space) and magnitude (vector norm) serve distinct functional roles remains unclear. Studying Pythia-family models, we discover a striking cross-over dissociation: angular perturbations cause up to 42.9 more damage to language modeling loss, while magnitude perturbations cause disproportionately more damage to syntactic processing (20.4% vs.1.6% accuracy drop on subject-verb agreement).This finding is enabled by L2-matched perturbation analysis, a methodology ensuring that an gular and magnitude perturbations achieve identical Euclidean displacements. Causal intervention reveals that angular damage flows substantially through the attention pathways (28.4% loss recovery via attention repair), while magnitude damage flows partly through the LayerNorm pathways(29.9% recovery via LayerNorm repair). These patterns replicate across scales within the Pythia architecture family. These findings provide evidence that direction and magnitude support partially distinct computational roles in LayerNorm based architectures. The direction preferentially affects attentional routing, while magnitude modulates processing intensity for fine-grained syntactic judgments. We find different patterns in RMSNorm-based architectures, suggesting that the dissociation depends on architectural choices. Our results refine the linear representation hypothesis and have implications for model editing and interpretability research

[10] PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models

Jiawei Xu,Zhenyu Yu,Ziqian Bi,Minh Duc Pham,Xiaoyi Qu,Danyang Zhang

Main category: cs.CL

TL;DR: 本文提出PRIME框架,通过三个专门代理(执行器、验证器、协调器)协同完成算法推理任务,并采用组相对策略优化进行训练;在新构建的大规模基准PRIME-Bench上显著提升准确率,尤其对需持续状态跟踪的任务效果突出,小模型受益更明显。

Details Motivation: 大型语言模型在算法推理任务上表现仍有限,亟需一种能有效处理复杂、多步、带约束的算法推理的新方法。 Method: 提出PRIME(Policy-Reinforced Iterative Multi-agent Execution)框架,包含执行器(step-by-step推理)、验证器(约束检查)和协调器(回溯控制)三类代理,并采用组相对策略优化(group relative policy optimization)进行联合训练;同时构建PRIME-Bench基准(86任务、12类别、51600实例)用于评估。 Result: PRIME将平均准确率从26.8%提升至93.8%(相对提升250%);图灵机模拟从9%→92%,长除法从16%→94%;消融实验表明迭代验证是关键;小模型(8B)性能接近大模型(120B)的8倍参数量水平。 Conclusion: PRIME通过多智能体协同与迭代验证机制,显著提升LLM在算法推理上的鲁棒性与准确性,尤其缓解错误传播问题,且对小模型更具可扩展性和实用性。 Abstract: Large language models have demonstrated remarkable capabilities across diverse reasoning tasks, yet their performance on algorithmic reasoning remains limited. To handle this limitation, we propose PRIME (Policy-Reinforced Iterative Multi-agent Execution), a framework comprising three specialized agents, an executor for step-by-step reasoning, a verifier for constraint checking, and a coordinator for backtracking control, optimized through group relative policy optimization. For comprehensive evaluation, we introduce PRIME-Bench, the largest algorithmic reasoning benchmark to date, comprising 86 tasks across 12 categories with 51,600 instances. Tasks span sorting algorithms, graph and tree structures, automata and state machines, symbolic reasoning, and constraint-based puzzles, with execution traces reaching over one million steps. Compared to baseline approach, PRIME improves average accuracy from 26.8% to 93.8%, a 250% relative gain. The largest improvements occur on tasks requiring sustained state tracking, with Turing machine simulation improving from 9% to 92% and long division from 16% to 94%. Ablation studies identify iterative verification as the primary contributor, preventing the error propagation that causes baseline approaches to fail catastrophically. Analysis across model scales (8B-120B parameters) reveals that smaller models benefit disproportionately, achieving accuracy comparable to models 8x larger.

[11] Efficient Hyper-Parameter Search for LoRA via Language-aided Bayesian Optimization

Baek Seong-Eun,Lee Jung-Mok,Kim Sung-Bin,Tae-Hyun Oh

Main category: cs.CL

TL;DR: 本文提出了一种将大语言模型(LLM)的领域知识融入贝叶斯优化(BO)的新框架,用于高效搜索LoRA微调的超参数,结合自然语言提示、可学习token和子集代理评估,显著提升搜索效率与性能。

Details Motivation: LoRA微调虽高效但对超参数敏感,传统穷搜计算开销大,亟需更智能、高效的超参数搜索方法。 Method: 1)利用LLM作为离散超参数到连续向量空间的映射,通过领域感知文本提示注入LoRA知识;2)引入可学习token建模难以语言描述的残差信息;3)采用子集数据进行代理训练与评估以加速迭代。 Result: 仅用约30次迭代找到的超参数,性能比从约45,000种组合中选出的标准超参数提升超20%。 Conclusion: 将LLM的语义理解能力与BO结合,并辅以提示工程和代理评估,是高效、知识驱动的LoRA超参数优化的有效范式。 Abstract: Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enables resource-efficient personalization or specialization, but it comes at the expense of additional hyperparameter tuning. Although LoRA makes fine-tuning efficient, it is highly sensitive to the choice of hyperparameters, and exhaustive hyperparameter search is still computationally very demanding. To address these challenges, we propose a framework that integrates the domain knowledge of pre-trained LLMs into Bayesian Optimization (BO) to efficiently search for LoRA hyperparameters. To leverage the informed knowledge of LLMs, we repurpose LLMs as a discrete-to-continuous mapping to link the hyperparameters and their domain knowledge with a continuous vector space, where BO is conducted. We design and control the mapping by language prompting, where we provide a domain-aware textual prompt describing the relationships among hyperparameters and their respective roles; thereby, we explicitly inject domain knowledge about LoRA into the LLM in natural language. Also, we model the residual information that is hard to linguistically describe in the prompt with an additional learnable token. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the observation of the strong correlation between the respective performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation with a data subset. This further increases the efficiency of our method. We demonstrate that our hyperparameter found with only about 30 iterations achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations.

[12] Synthesizing the Virtual Advocate: A Multi-Persona Speech Generation Framework for Diverse Linguistic Jurisdictions in Indic Languages

Aniket Deroy

Main category: cs.CL

TL;DR: 本研究评估了 Gemini 2.5 Flash 和 Pro TTS 模型在五种印度语言中生成法庭演说的表现,提出了一种利用其多语言与上下文感知能力的提示框架;结果表明模型能胜任程序性任务,但在表达权威感、情感张力与语调变化等说服性法律话语关键要素上仍显不足,尤其在孟加拉语和古吉拉特语中表现更弱。

Details Motivation: 法律辩护需要权威语调、节奏性停顿与情感智能的结合,而现有TTS技术在印度多语言背景下尚难复现人类律师的说服性语音艺术。 Method: 采用Gemini 2.5 Flash/Pro TTS模型,设计支持五种印度语言(泰米尔语、泰卢固语、孟加拉语、印地语、古吉拉特语)的提示框架,利用其原生多语言能力与上下文感知节奏控制,生成不同律师人设的合成语音。 Result: 模型展现出'单调的权威感':擅长传递程序性信息,但在动态语调变化、情感厚重感与说服力方面明显不足;孟加拉语和古吉拉特语表现最差,揭示了音系建模的前沿挑战。 Conclusion: 多语言TTS已初步具备支撑程序性法律任务的能力,但要真正复现人类律师富有表现力与说服力的口头辩护艺术,仍需在情感建模、韵律控制与低资源语言适配等方面取得突破。 Abstract: Legal advocacy requires a unique combination of authoritative tone, rhythmic pausing for emphasis, and emotional intelligence. This study investigates the performance of the Gemini 2.5 Flash TTS and Gemini 2.5 Pro TTS models in generating synthetic courtroom speeches across five Indic languages: Tamil, Telugu, Bengali, Hindi, and Gujarati. We propose a prompting framework that utilizes Gemini 2.5s native support for 5 languages and its context-aware pacing to produce distinct advocate personas. The evolution of Large Language Models (LLMs) has shifted the focus of TexttoSpeech (TTS) technology from basic intelligibility to context-aware, expressive synthesis. In the legal domain, synthetic speech must convey authority and a specific professional persona a task that becomes significantly more complex in the linguistically diverse landscape of India. The models exhibit a "monotone authority," excelling at procedural information delivery but struggling with the dynamic vocal modulation and emotive gravitas required for persuasive advocacy. Performance dips in Bengali and Gujarati further highlight phonological frontiers for future refinement. This research underscores the readiness of multilingual TTS for procedural legal tasks while identifying the remaining challenges in replicating the persuasive artistry of human legal discourse. The code is available at-https://github.com/naturenurtureelite/Synthesizing-the-Virtual-Advocate/tree/main

[13] Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

Qian Ruan,Iryna Gurevych

Main category: cs.CL

TL;DR: 本文提出REspGen框架和REspEval评估套件,将作者回应生成任务重构为作者参与的闭环任务,并构建首个大规模评审-回应-修订三元组数据集Re^3Align,以提升同行评审中作者回应的质量与可控性。

Details Motivation: 现有自动文本生成方法忽视了作者的专业知识、独有信息及修订策略等关键信号,无法有效支持作者在同行评审中撰写高质量回应。 Method: 提出作者参与的回应生成框架REspGen,整合显式作者输入、多属性控制和评估引导的优化;构建首个评审-回应-修订对齐数据集Re^3Align;设计涵盖输入利用、可控性、回应质量与话语结构的20+指标评估套件REspEval。 Result: 实验证明作者输入和评估引导优化显著提升回应质量;输入设计影响回应效果;可控性与质量之间存在权衡。 Conclusion: 将作者专业知识与意图显式建模并融入生成与评估流程,可更有效地支持科学同行评审中的回应写作,推动NLP在科研协作中的实际应用。 Abstract: Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. Recent work frames this task as automatic text generation, underusing author expertise and intent. In practice, authors possess domain expertise, author-only information, revision and response strategies--concrete forms of author expertise and intent--to address reviewer concerns, and seek NLP assistance that integrates these signals to support effective response writing in peer review. We reformulate author response generation as an author-in-the-loop task and introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement, together with REspEval, a comprehensive evaluation suite with 20+ metrics covering input utilization, controllability, response quality, and discourse. To support this formulation, we construct Re$^3$Align, the first large-scale dataset of aligned review--response--revision triplets, where revisions provide signals of author expertise and intent. Experiments with state-of-the-art LLMs show the benefits of author input and evaluation-guided refinement, the impact of input design on response quality, and trade-offs between controllability and quality. We make our dataset, generation and evaluation tools publicly available.

[14] The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models

Aradhya Dixit,Shreem Dixit

Main category: cs.CL

TL;DR: 本文揭示了多语言预训练模型中分词器对不同书写系统施加的系统性成本(即'脚本税'),通过对比相同语言内容但不同正字法变体,发现高碎片化正字法显著增加token/word比率、推理延迟及信息成本(BPC),表明分词是多语言NLP中不平等的重要来源,并呼吁脚本感知的分词与预训练。

Details Motivation: 预训练多语言语言模型常被假定为脚本无关,但其分词器可能对特定书写系统产生系统性负面影响,需量化这种‘脚本税’以揭示潜在不平等问题。 Method: 通过比较两种正字法变体(语言内容相同但书写形式不同),在mBERT和XLM-R上测量分词率(fertility)、推理速度及每字符比特数(BPC);引入往返转换错误率(CER_rt)验证差异源于正字法条件处理而非映射噪声。 Result: 高碎片化正字法导致token/word比率增加约3.4倍、推理速度下降16.5倍、BPC分别上升19.7%(mBERT)和47.1%(XLM-R);CER_rt=0.31支持正字法条件处理假说。 Conclusion: 分词器是多语言NLP中造成系统性不平等的关键因素,亟需发展脚本感知的分词策略与预训练方法。 Abstract: Pretrained multilingual language models are often assumed to be script-agnostic, yet their tokenizers can impose systematic costs on certain writing systems. We quantify this script tax by comparing two orthographic variants with identical linguistic content. Across mBERT and XLM-R, the higher-fragmentation orthography shows a ~3.4x increase in fertility (6.73-6.85 vs. 2.10-2.35 tokens/word), leading to a 16.5x inference slowdown (0.23 vs. 3.8 sentences/second) on identical hardware. Using bits per character (BPC) to avoid the "NLL paradox" from subword fragmentation, we find a substantial increase in information cost: +19.7% for mBERT (8.06->9.65) and +47.1% for XLM-R (12.19->17.94). A round-trip conversion check (CER_rt=0.31) suggests these gaps reflect orthography-conditioned processing rather than mapping noise. Our results highlight tokenization as a key source of inequity in multilingual NLP and motivate script-aware tokenization and pretraining.

[15] Barriers to Discrete Reasoning with Transformers: A Survey Across Depth, Exactness, and Bandwidth

Michelle Yuan,Weiyi Sun,Amir H. Rezaeian,Jyotika Singh,Sandip Ghoshal,Yao-Ting Wang,Miguel Ballesteros,Yassine Benajiba

Main category: cs.CL

TL;DR: 本文综述了Transformer在离散推理任务(如算术、逻辑推理和算法合成)中的理论局限性,从电路复杂度、逼近论和通信复杂度三个角度系统分析其结构性与计算性障碍。

Details Motivation: Transformer虽在模式匹配和插值任务中表现出色,但在需要精确符号计算的离散推理任务上存在根本性理论缺陷,亟需从理论层面厘清其局限根源。 Method: 综合分析电路复杂度、逼近论和通信复杂度三大理论框架,梳理关键定义、经典结论与典型示例,统一阐释Transformer难以实现精确离散算法的内在原因。 Result: 揭示了Transformer受限于深度约束、难以逼近不连续函数、以及token间通信瓶颈等核心挑战。 Conclusion: 当前Transformer架构的本质限制源于其结构与计算范式,未来模型设计需突破现有框架,探索融合符号推理能力的新路径。 Abstract: Transformers have become the foundational architecture for a broad spectrum of sequence modeling applications, underpinning state-of-the-art systems in natural language processing, vision, and beyond. However, their theoretical limitations in discrete reasoning tasks, such as arithmetic, logical inference, and algorithmic composition, remain a critical open problem. In this survey, we synthesize recent studies from three theoretical perspectives: circuit complexity, approximation theory, and communication complexity, to clarify the structural and computational barriers that transformers face when performing symbolic computations. By connecting these established theoretical frameworks, we provide an accessible and unified account of why current transformer architectures struggle to implement exact discrete algorithms, even as they excel at pattern matching and interpolation. We review key definitions, seminal results, and illustrative examples, highlighting challenges such as depth constraints, difficulty approximating discontinuities, and bottlenecks in inter-token communication. Finally, we discuss implications for model design and suggest promising directions for overcoming these foundational limitations.

[16] Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments

Maral Doctorarastoo,Katherine A. Flanigan,Mario Bergés,Christopher McComb

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型(LLM)在低数据环境下,利用紧凑的上下文线索(时间、空间、行为历史、人物特征)进行人类活动预测与持续时间估计的能力;通过检索增强提示策略,在CASAS Aruba数据集上验证了LLM具备强固有时间理解能力,零样本即可生成合理日常活动序列,少量示例即可显著提升精度,性能在 few-shot 后趋于饱和。

Details Motivation: 现有数据驱动的基于智能体的模型(从规则到深度学习)在低数据场景下表现不佳,限制了其实际应用;而人类活动预测对智能家居、城市设计、人机协作等至关重要,亟需一种数据高效、泛化能力强的建模方法。 Method: 提出一种检索增强的提示策略,融合时间、空间、行为历史和人物特征四类上下文信息;在CASAS Aruba智能家居数据集上开展两项任务:带持续时间估计的下一活动预测,以及多步日常序列生成;系统评估不同few-shot数量(0/1/2/...)下的性能变化。 Result: LLM在零样本下即能生成语义连贯、时序合理的日常活动预测;加入1–2个示例即可显著提升活动类别准确率与持续时间校准效果;更多示例带来边际收益递减;序列级评估显示其时间对齐性在few-shot条件下保持稳定。 Conclusion: 预训练大语言模型可作为高效的时间推理器,无需大量标注数据即可捕捉人类行为的规律性与情境依赖性,有望增强基于智能体模型中的行为模块。 Abstract: Anticipating human activities and their durations is essential in applications such as smart-home automation, simulation-based architectural and urban design, activity-based transportation system simulation, and human-robot collaboration, where adaptive systems must respond to human activities. Existing data-driven agent-based models--from rule-based to deep learning--struggle in low-data environments, limiting their practicality. This paper investigates whether large language models, pre-trained on broad human knowledge, can fill this gap by reasoning about everyday activities from compact contextual cues. We adopt a retrieval-augmented prompting strategy that integrates four sources of context--temporal, spatial, behavioral history, and persona--and evaluate it on the CASAS Aruba smart-home dataset. The evaluation spans two complementary tasks: next-activity prediction with duration estimation, and multi-step daily sequence generation, each tested with various numbers of few-shot examples provided in the prompt. Analyzing few-shot effects reveals how much contextual supervision is sufficient to balance data efficiency and predictive accuracy, particularly in low-data environments. Results show that large language models exhibit strong inherent temporal understanding of human behavior: even in zero-shot settings, they produce coherent daily activity predictions, while adding one or two demonstrations further refines duration calibration and categorical accuracy. Beyond a few examples, performance saturates, indicating diminishing returns. Sequence-level evaluation confirms consistent temporal alignment across few-shot conditions. These findings suggest that pre-trained language models can serve as promising temporal reasoners, capturing both recurring routines and context-dependent behavioral variations, thereby strengthening the behavioral modules of agent-based models.

[17] What Do LLMs Know About Alzheimer's Disease? Fine-Tuning, Probing, and Data Synthesis for AD Detection

Lei Jiang,Yue Zhou,Natalie Parde

Main category: cs.CL

TL;DR: 本文探索了如何利用大语言模型(LLM)进行阿尔茨海默病(AD)的早期检测,通过微调LLM并分析其内部表征,发现特定词和特殊标记在细调后起关键作用;据此设计任务感知的特殊标记,并构建序列到序列的数据合成模型生成高质量合成数据,用于提升下游AD检测性能。

Details Motivation: 阿尔茨海默病早期检测面临标注数据稀缺的挑战,而大语言模型在跨领域迁移中表现出色,但其在AD领域的监督微调尚未被充分探索。 Method: 对LLM进行AD检测任务的监督微调,采用探针方法分析各Transformer层的中间激活;基于发现的关键标记,设计任务感知的特殊标记,并训练序列到序列模型用于合成结构一致、诊断信息丰富的数据。 Result: 微调后模型内部特定词汇与特殊标记的探针值显著变化;所提出的合成数据在内在评估及下游训练中均有效提升AD检测性能。 Conclusion: LLM可通过任务驱动的微调与表征分析适配至AD检测任务,且结合任务感知标记的数据合成方法可缓解标注数据不足问题,为医疗AI提供新思路。 Abstract: Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across domains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we fine-tune an LLM for AD detection and investigate how task-relevant information is encoded within its internal representations. We employ probing techniques to analyze intermediate activations across transformer layers, and we observe that, after fine-tuning, the probing values of specific words and special markers change substantially, indicating that these elements assume a crucial role in the model's improved detection performance. Guided by this insight, we design a curated set of task-aware special markers and train a sequence-to-sequence model as a data-synthesis tool that leverages these markers to generate structurally consistent and diagnostically informative synthetic samples. We evaluate the synthesized data both intrinsically and by incorporating it into downstream training pipelines.

[18] From Instruction to Output: The Role of Prompting in Modern NLG

Munazza Zaib,Elaf Alhazmi

Main category: cs.CL

TL;DR: This survey provides a comprehensive overview of prompt engineering techniques for Natural Language Generation (NLG), introducing a taxonomy, decision framework, and integrated design-optimization-evaluation framework to enhance controllability and generalizability.

Details Motivation: Lack of a structured framework or coherent understanding of diverse prompt engineering methods, especially in NLG. Method: Survey and synthesis of recent prompting methods; proposes a taxonomy of prompting paradigms, a decision framework for practitioners, and an integrated framework linking prompt design, optimization, and evaluation. Result: A unified conceptual and practical framework for prompt engineering in NLG, including classification, selection guidance, and identification of trends and challenges. Conclusion: Prompt engineering is a crucial input-level control mechanism for NLG; systematic design, optimization, and evaluation are essential for controllable and generalizable NLG systems. Abstract: Prompt engineering has emerged as an integral technique for extending the strengths and abilities of Large Language Models (LLMs) to gain significant performance gains in various Natural Language Processing (NLP) tasks. This approach, which requires instructions to be composed in natural language to bring out the knowledge from LLMs in a structured way, has driven breakthroughs in various NLP tasks. Yet there is still no structured framework or coherent understanding of the varied prompt engineering methods and techniques, particularly in the field of Natural Language Generation (NLG). This survey aims to help fill that gap by outlining recent developments in prompt engineering, and their effect on different NLG tasks. It reviews recent advances in prompting methods and their impact on NLG tasks, presenting prompt design as an input-level control mechanism that complements fine-tuning and decoding approaches. The paper introduces a taxonomy of prompting paradigms, a decision framework for prompt selection based on varying factors for the practitioners, outlines emerging trends and challenges, and proposes a framework that links design, optimization, and evaluation to support more controllable and generalizable NLG.

[19] Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions

Usman Naseem

Main category: cs.CL

TL;DR: 本文综述了面向大语言模型(LLM)对齐的机制可解释性研究进展,涵盖电路发现、特征可视化、激活引导与因果干预等方法,并分析其在RLHF、宪法AI和可扩展监督等对齐策略中的应用;指出超叠加、神经元多义性及涌现行为解释难等挑战,提出自动化可解释性、跨模型电路泛化及可扩展的可解释性驱动对齐等未来方向。

Details Motivation: 大语言模型虽能力强,但决策过程不透明,亟需机制可解释性研究以提升理解与对齐能力。 Method: 系统综述近期机制可解释性技术(如电路发现、特征可视化、激活 steering、因果干预)及其在LLM对齐中的应用。 Result: 梳理了可解释性技术如何支撑RLHF、宪法AI和可扩展监督等对齐策略,并识别出超叠加、多义性和涌现行为解释等关键挑战。 Conclusion: 机制可解释性是提升LLM可信度与可控性的关键路径;未来需发展自动化、跨模型泛化和可扩展的可解释性驱动对齐方法。 Abstract: Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Mechanistic interpretability (i.e., the systematic study of how neural networks implement algorithms through their learned representations and computational structures) has emerged as a critical research direction for understanding and aligning these models. This paper surveys recent progress in mechanistic interpretability techniques applied to LLM alignment, examining methods ranging from circuit discovery to feature visualization, activation steering, and causal intervention. We analyze how interpretability insights have informed alignment strategies including reinforcement learning from human feedback (RLHF), constitutional AI, and scalable oversight. Key challenges are identified, including the superposition hypothesis, polysemanticity of neurons, and the difficulty of interpreting emergent behaviors in large-scale models. We propose future research directions focusing on automated interpretability, cross-model generalization of circuits, and the development of interpretability-driven alignment techniques that can scale to frontier models.

[20] Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

Himanshu Gupta,Pratik Jayarao,Chaitanya Dwivedi,Neeraj Varshney

Main category: cs.CL

TL;DR: 本文综述了大语言模型在语码转换(CSW)场景下的研究现状,提出统一分类法,并提供构建、适配与评估CSW能力LLM的实用指南。

Details Motivation: 现有大语言模型在语码混用和语码转换场景下表现不佳,存在语法性、事实性和安全性系统性下降问题,亟需系统性梳理与改进。 Method: 构建涵盖数据、建模与评估三个维度的统一分类体系;综述CSW定制预训练、任务后训练、提示策略及上下文学习等建模方法;分析当前评测方法的不稳定性与偏差;批判性审视现有基准的覆盖范围与英语中心偏见。 Result: 提出了面向CSW能力LLM的实用操作手册;识别出CSW作为规避安全机制手段的新风险;明确了多个开放研究挑战。 Conclusion: CSW是检验和提升LLM真正多语言能力的关键前沿;需从数据构建、建模范式到评估标准进行系统性革新,并重视其带来的新型安全风险。 Abstract: Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.

[21] MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization

Haidong Xin,Xinze Li,Zhenghao Liu,Yukun Yan,Shuo Wang,Cheng Yang,Yu Gu,Ge Yu,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出MetaMem框架,通过自演化的元记忆增强LLM的记忆系统,提升其对分散记忆片段中关键证据的识别与整合能力,显著优于基线方法。

Details Motivation: 现有记忆系统虽能保存长周期交互历史,但常破坏会话内的逻辑与时间关系,导致记忆碎片化、推理性能下降。 Method: MetaMem引入自演化的元记忆,在优化过程中通过自我反思推理过程并更新元记忆状态,提炼跨任务可迁移的知识利用经验。 Result: 实验表明MetaMem显著优于强基线模型,提升超3.6%。代码与数据集已开源。 Conclusion: MetaMem通过显式建模知识利用经验,有效提升了LLM在长程交互中对碎片化记忆的系统性利用能力。 Abstract: Existing memory systems enable Large Language Models (LLMs) to support long-horizon human-LLM interactions by persisting historical interactions beyond limited context windows. However, while recent approaches have succeeded in constructing effective memories, they often disrupt the inherent logical and temporal relationships within interaction sessions, resulting in fragmented memory units and degraded reasoning performance. In this paper, we propose MetaMem, a novel framework that augments memory systems with a self-evolving meta-memory, aiming to teach LLMs how to effectively utilize memorized knowledge. During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks by self-reflecting on reasoning processes and performing actions to update the current meta-memory state. The accumulated meta-memory units serve as explicit knowledge utilization experiences, guiding the LLM to systematically identify and integrate critical evidence from scattered memory fragments. Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%. All codes and datasets are available at https://github.com/OpenBMB/MetaMem.

[22] DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

Shafiuddin Rehan Ahmed,Wei Wei

Main category: cs.CL

TL;DR: 本文提出DDL2PropBank基准任务,用于评估多智能体框架在LLM驱动软件开发中的开发者体验,通过统一Agent-as-a-Tool模式在10个框架上对比代码复杂度与AI可辅助性,发现Agno综合性能最优。

Details Motivation: 现有缺乏在受控环境下系统评估多智能体框架开发者体验的原理性方法。 Method: 构建DDL2PropBank新基准任务(将数据库schema映射到PropBank rolesets),采用Agent-as-a-Tool模式在10个框架中部署相同智能体逻辑,并从代码复杂度(静态分析)和AI-assistability(LLM自动生成正确框架代码的能力)两方面进行评估。 Result: 发现三档复杂度谱系,Pydantic AI和Agno实现开销最小;结构对齐分数可有效代理单范式框架的运行成功率,但高估多范式框架;Agno在复杂度最低的同时具备最高结构对齐度和83% pass@1。 Conclusion: Agno是当前综合表现最强的多智能体框架,该工作为框架选型提供了可量化的评估基准与实证依据。 Abstract: Multi-agent frameworks promise to simplify LLM-driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine-grained linguistic reasoning over table names, columns, and relations. Using the Agent-as-a-Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability -- the extent to which LLMs can autonomously generate correct, framework-specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI-assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi-pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.

[23] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification

Jiale Zhao,Ke Fang,Lu Cheng

Main category: cs.CL

TL;DR: 本文提出AskBench交互式基准和RLVR强化学习方法,提升大语言模型在信息缺失或错误前提下主动澄清的能力,兼顾准确性与交互效率。

Details Motivation: 大语言模型常在提示信息不全或存在误导时仍强行作答,导致幻觉或强化错误认知,亟需提升其主动澄清能力。 Method: 构建AskBench交互式基准(含AskMind和AskOverconfidence两个子任务),并提出基于结构化评分标准和验证器奖励的强化学习方法(RLVR)。 Result: 实验表明该方法在准确性、评分标准遵循度和交互效率上均显著提升,并在未见领域上具有良好泛化性。 Conclusion: 通过交互式评估与rubric引导的强化学习,可有效增强LLM的澄清意识与能力,实现性能与鲁棒性的协同提升。 Abstract: Large language models (LLMs) often respond even when prompts omit critical details or include misleading information, leading to hallucinations or reinforced misconceptions. We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance. We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints. A unified judge loop evaluates final answers and simulates user responses as needed. AskBench covers two settings: AskMind, with intent-deficient queries requiring clarification, and AskOverconfidence, with queries containing false premises that must be identified and corrected. We further propose rubric-guided reinforcement learning with verifier-based rewards (RLVR), which uses structured rubrics to encourage targeted clarification. Experiments show consistent improvements in accuracy, rubric adherence, and interaction efficiency, with strong generalization to unseen domains.

[24] Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Donald Ye,Max Loffgren,Om Kotadia,Linus Wong

Main category: cs.CL

TL;DR: 本文提出NLDD指标来评估Chain-of-Thought(CoT)解释的忠实性,发现模型在推理链中存在‘推理视界’(k*),且准确率不能反映模型是否真正进行推理。

Details Motivation: Chain-of-Thought解释是否真实反映模型内部推理过程尚不明确,亟需可量化的忠实性评估方法。 Method: 提出标准化对数概率差衰减(NLDD)指标:通过逐个破坏CoT中的推理步骤,测量模型答案置信度下降程度,并标准化以支持跨模型比较。 Result: 在三类任务和三种模型家族上验证发现:存在一致的‘推理视界’(70%–85%链长),超出后步骤几乎无影响;模型可拥有正确内部表征却仍答错题。 Conclusion: 仅靠任务准确率无法判断模型是否真正进行链式推理;NLDD为衡量CoT何时真正起作用提供了新工具。 Abstract: Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.

[25] The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task

Rui Cao,Zhenyun Deng,Yulong Chen,Michael Schlichtkrull,Andreas Vlachos

Main category: cs.CL

TL;DR: 本文介绍了AVerImaTeC共享任务,旨在推动图像-文本声明的自动验证系统发展,评估采用条件判决准确率(AVerImaTeC分数),共有6支队伍参与测试阶段,全部超越基线,冠军团队HUMANE得分为0.5455。

Details Motivation: 推进图像-文本声明自动验证系统的发展,解决真实世界中图文一致性验证的挑战。 Method: 组织AVerImaTeC共享任务,允许参赛者使用外部知识源(如网络搜索引擎)或主办方提供的结构化知识库,并采用AVerImaTeC分数(基于证据得分阈值的条件判决准确率)进行评估。 Result: 测试阶段6支队伍全部优于基线;冠军团队HUMANE取得0.5455的AVerImaTeC分数。 Conclusion: 该共享任务成功促进了图文验证技术发展,提供了可复现的基准与评估框架,并总结了关键经验与改进方向。 Abstract: The Automatic Verification of Image-Text Claims (AVerImaTeC) shared task aims to advance system development for retrieving evidence and verifying real-world image-text claims. Participants were allowed to either employ external knowledge sources, such as web search engines, or leverage the curated knowledge store provided by the organizers. System performance was evaluated using the AVerImaTeC score, defined as a conditional verdict accuracy in which a verdict is considered correct only when the associated evidence score exceeds a predefined threshold. The shared task attracted 14 submissions during the development phase and 6 submissions during the testing phase. All participating systems in the testing phase outperformed the baseline provided. The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455. This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.

[26] SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation

Beichen Guo,Zhiyuan Wen,Jia Gu,Senzhang Wang,Haochen Shi,Ruosong Yang,Shuaiqi Liu

Main category: cs.CL

TL;DR: 本文提出SurveyLens,首个面向多学科的自动综述生成(ASG)评估基准,包含10个学科的1000篇高质量人工综述,并设计双重视角评估框架,揭示不同ASG方法在各学科中的表现差异。

Details Motivation: 现有ASG评估方法依赖通用指标、严重偏向计算机科学,无法衡量ASG是否符合各学科特有的写作与内容标准,导致非CS领域研究者缺乏可靠工具选择依据。 Method: 构建跨学科数据集SurveyLens-1k;提出双重视角评估框架:(1)学科感知评分量表评估(基于对齐人类偏好的LLM打分),(2)经典对齐评估(对比人工综述的内容覆盖与综合质量);在11种SOTA ASG方法上开展实验。 Result: 首次系统揭示了不同ASG范式(基础LLM、专用ASG系统、Deep Research代理)在10个学科中的差异化性能表现,识别出各方法的优势领域与短板。 Conclusion: SurveyLens为ASG评估提供了学科适配的新范式,推动ASG技术向跨学科实用化发展,并为研究者按学科需求选择合适工具提供实证依据。 Abstract: The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline-aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines. Subsequently, we propose a dual-lens evaluation framework: (1) Discipline-Aware Rubric Evaluation, which utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human-written survey papers. We conduct extensive experiments by evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.

[27] Are Aligned Large Language Models Still Misaligned?

Usman Naseem,Gautam Siddharth Kashyap,Rafiq Ali,Ebad Shabbir,Sushant Kumar Ray,Abdullah Mohammad,Agrima Seth

Main category: cs.CL

TL;DR: 本文提出Mis-Align Bench,首个支持安全、价值与文化三维度联合评估大模型错位(misalignment)的统一基准;构建了涵盖112个领域的38万样本数据集SAVACU,并通过两阶段拒绝采样配对错位/对齐响应;实验发现单维度优化模型在联合评估下对齐得分显著下降(63%-66%),假失败率超50%。

Details Motivation: 现有错位评测基准(如INSECURE CODE、VALUEACTIONLENS、CULTURALHERITAGE)仅聚焦单一维度,无法反映真实场景中安全、价值、文化需同时满足的复杂要求,导致评估结果失真。 Method: 1)构建SAVACU数据集:基于LLM-PROMPT-DATASET,用Mistral-7B-Instruct-v0.3按14安全/56价值/42文化域分类,再用Llama-3.1-8B-Instruct结合SimHash扩增低资源域;2)采用两阶段拒绝采样为每个提示生成高质量错位与对齐响应;3)在通用、微调及开源权重LLM上开展三维度联合评测。 Result: 单维度优化模型在各自维度覆盖率高达97.6%,但在三维度联合评估下假失败率>50%,对齐得分仅63%-66%;验证了单一维度优化不足以保障真实场景下的综合对齐。 Conclusion: 必须建立支持多维协同评估的基准以准确衡量LLM对齐能力;Mis-Align Bench填补了该空白,揭示了当前模型在跨维度一致性上的严重不足,为后续对齐研究提供了新范式与数据基础。 Abstract: Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS (value-centric), and CULTURALHERITAGE (culture centric)-rely on evaluating misalignment along individual dimensions, preventing simultaneous evaluation. To address this gap, we introduce Mis-Align Bench, a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions. First we constructs SAVACU, an English misaligned-aligned dataset of 382,424 samples spanning 112 domains (or labels), by reclassifying prompts from the LLM-PROMPT-DATASET via taxonomy into 14 safety domains, 56 value domains, and 42 cultural domains using Mistral-7B-Instruct-v0.3, and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based fingerprint to avoid deduplication. Furthermore, we pairs prompts with misaligned and aligned responses via two-stage rejection sampling to enforce quality. Second we benchmarks general-purpose, fine-tuned, and open-weight LLMs, enabling systematic evaluation of misalignment under three dimensions. Empirically, single-dimension models achieve high Coverage (upto 97.6%) but incur False Failure Rate >50% and lower Alignment Score (63%-66%) under joint conditions.

[28] Evaluating Alignment of Behavioral Dispositions in LLMs

Amir Taubenfeld,Zorik Gekhman,Lior Nezry,Omri Feldman,Natalie Harris,Shashir Reddy,Romina Stella,Ariel Goldstein,Marian Croak,Yossi Matias,Amir Feder

Main category: cs.CL

TL;DR: 本文提出一种新框架,通过情境判断测试(SJTs)评估大语言模型(LLMs)在社交情境中的行为倾向是否与人类一致,发现LLMs普遍存在过度自信、偏离人类共识及言行不一等问题。

Details Motivation: 随着大语言模型(LLMs)日益融入日常生活,理解其在社会语境中的行为倾向(behavioral dispositions)变得至关重要;现有研究多依赖自我报告式心理量表,难以直接反映真实行为,亟需适配LLMs的行为评估方法。 Method: 将经典心理问卷中的自我报告题项转化为面向LLMs的情境判断测试(SJTs),构建含2500个经人工标注验证的SJTs数据集;每题收集10名被试(共550人)的偏好行动选择,作为人类基准;在25个LLM上测试其行为响应,并与人类偏好分布进行系统比较。 Result: (1)低共识情境下,LLMs普遍过度自信,集中于单一响应;(2)高共识情境下,小模型显著偏离人类偏好,前沿模型仍有15–20%未反映共识;(3)跨模型存在稳定倾向模式(如鼓励情绪表达而非克制);(4)LLMs的自我陈述倾向与其实际行为响应间存在显著预测效度差距。 Conclusion: 当前LLMs的行为倾向与人类存在系统性偏差,单纯依赖心理量表转化不足以准确刻画其社会行为;SJTs为评估和校准LLMs的人类对齐提供了更可靠的行为层面工具。 Abstract: As LLMs integrate into our daily lives, understanding their behavior becomes essential. In this work, we focus on behavioral dispositions$-$the underlying tendencies that shape responses in social contexts$-$and introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans. Our approach is grounded in established psychological questionnaires but adapts them for LLMs by transforming human self-report statements into Situational Judgment Tests (SJTs). These SJTs assess behavior by eliciting natural recommendations in realistic user-assistant scenarios. We generate 2,500 SJTs, each validated by three human annotators, and collect preferred actions from 10 annotators per SJT, from a large pool of 550 participants. In a comprehensive study involving 25 LLMs, we find that models often do not reflect the distribution of human preferences: (1) in scenarios with low human consensus, LLMs consistently exhibit overconfidence in a single response; (2) when human consensus is high, smaller models deviate significantly, and even some frontier models do not reflect the consensus in 15-20% of cases; (3) traits can exhibit cross-LLM patterns, e.g., LLMs may encourage emotion expression in contexts where human consensus favors composure. Lastly, mapping psychometric statements directly to behavioral scenarios presents a unique opportunity to evaluate the predictive validity of self-reports, revealing considerable gaps between LLMs' stated values and their revealed behavior.

[29] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Zachary Pedram Dadfar

Main category: cs.CL

TL;DR: 本文提出Pull Methodology,通过格式工程引导大语言模型进行自我检查,发现自我指涉词汇与模型内部激活动态存在特异性关联,表明在适当条件下,模型的自我报告可可靠反映其内部计算状态。

Details Motivation: 探究大语言模型在自我检查时产生的内省语言是否真实反映内部计算,还是仅是复杂编造。 Method: 引入Pull Methodology协议,利用格式工程激发模型长程自我检查;在Llama 3.1中识别区分自我指涉与描述性处理的激活方向,并验证其正交性、定位及因果影响;分析Qwen 2.5-32B作为独立验证。 Result: 发现自我指涉词汇(如'loop'、'shimmer')与特定激活指标(如自相关性、变异性)显著相关(p=0.002),且该对应关系在非自我指涉语境中消失;该方向位于模型深度6.25%处、正交于拒绝方向;Qwen模型独立演化出不同但一致的对应模式。 Conclusion: 在适当提示与结构下,大语言模型的自我报告能可靠追踪其内部计算状态,支持其内省语言具有计算基础而非纯幻觉。 Abstract: Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.

[30] Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification

Weili Shi,Dongliang Guo,Lehan Yang,Tianlong Wang,Hanzhang Yuan,Sheng Li

Main category: cs.CL

TL;DR: 本文提出PPCV框架,通过识别和替换推理过程中的关键token,并结合一致性验证来提升大语言模型在复杂推理任务上的性能。

Details Motivation: 大语言模型在复杂推理任务中常因幻觉和中间步骤错误累积导致性能下降,而现有方法难以可靠地识别和利用关键token。 Method: PPCV框架分为两阶段:第一阶段通过初始推理路径与问题的改写版本比对,识别关键token;第二阶段用候选替代token替换关键token并生成多条并行推理路径,通过输出一致性确定最终答案。 Result: 在多个主流大语言模型和基准测试上,PPCV显著优于基线方法,提升了推理性能。 Conclusion: PPCV是一种有效提升大语言模型复杂推理能力的新方法,通过关键token探测与一致性验证机制缓解了错误传播问题。 Abstract: Large language models have demonstrated impressive performance across a variety of reasoning tasks. However, their problem-solving ability often declines on more complex tasks due to hallucinations and the accumulation of errors within these intermediate steps. Recent work has introduced the notion of critical tokens--tokens in the reasoning process that exert significant influence on subsequent steps. Prior studies suggest that replacing critical tokens can refine reasoning trajectories. Nonetheless, reliably identifying and exploiting critical tokens remains challenging. To address this, we propose the Paraphrastic Probing and Consistency Verification~(PPCV) framework. PPCV operates in two stages. In the first stage, we roll out an initial reasoning path from the original question and then concatenate paraphrased versions of the question with this reasoning path. And we identify critical tokens based on mismatches between the predicted top-1 token and the expected token in the reasoning path. A criterion is employed to confirm the final critical token. In the second stage, we substitute critical tokens with candidate alternatives and roll out new reasoning paths for both the original and paraphrased questions. The final answer is determined by checking the consistency of outputs across these parallel reasoning processes. We evaluate PPCV on mainstream LLMs across multiple benchmarks. Extensive experiments demonstrate PPCV substantially enhances the reasoning performance of LLMs compared to baselines.

[31] The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods

Arpit Singh Gautam,Kailash Talreja,Saurabh Jha

Main category: cs.CL

TL;DR: 本文提出DiffuTruth框架,利用非平衡热力学思想将事实视为生成流形上的稳定吸引子,通过生成压力测试和语义能量度量来检测大语言模型的幻觉,并结合混合校准提升不确定性估计性能。

Details Motivation: 大语言模型常产生看似合理但错误的断言(即幻觉),而现有不确定性指标难以识别模型在高置信度下的错误预测。 Method: 提出DiffuTruth:1)将事实建模为生成流形上的稳定吸引子;2)设计生成压力测试,对声明添加噪声并用离散文本扩散模型重建;3)定义语义能量(基于NLI评判器衡量原始声明与重建之间的语义差异);4)融合语义稳定性信号与判别式置信度形成混合校准。 Result: 在FEVER数据集上实现0.725的无监督AUROC,超越基线1.5%;在多跳HOVER数据集上零样本泛化性能优于基线超4%。 Conclusion: 基于热力学原理的事实建模能有效区分真实与幻觉,语义能量可捕捉深层事实矛盾,混合校准提升了不确定性估计的鲁棒性与泛化能力。 Abstract: Large Language Models (LLMs) frequently hallucinate plausible but incorrect assertions, a vulnerability often missed by uncertainty metrics when models are confidently wrong. We propose DiffuTruth, an unsupervised framework that reconceptualizes fact verification via non equilibrium thermodynamics, positing that factual truths act as stable attractors on a generative manifold while hallucinations are unstable. We introduce the Generative Stress Test, claims are corrupted with noise and reconstructed using a discrete text diffusion model. We define Semantic Energy, a metric measuring the semantic divergence between the original claim and its reconstruction using an NLI critic. Unlike vector space errors, Semantic Energy isolates deep factual contradictions. We further propose a Hybrid Calibration fusing this stability signal with discriminative confidence. Extensive experiments on FEVER demonstrate DiffuTruth achieves a state of the art unsupervised AUROC of 0.725, outperforming baselines by 1.5 percent through the correction of overconfident predictions. Furthermore, we show superior zero shot generalization on the multi hop HOVER dataset, outperforming baselines by over 4 percent, confirming the robustness of thermodynamic truth properties to distribution shifts.

[32] Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

Md Tanvir Rouf Shawon,Mohammad Sabik Irbaz,Hadeel R. A. Elyazori,Keerti Reddy Resapu,Yili Lin,Vladimir Franzuela Cardenas,Farrokh Alemi,Kevin Lybarger

Main category: cs.CL

TL;DR: 本文提出了一种基于NIST AI风险管理框架的患者模拟器,用于自动化、可扩展地评估医疗对话AI系统,通过医学、语言和行为三个维度生成可控且真实的患者交互,并在抗抑郁药选择决策辅助系统上验证了其有效性。

Details Motivation: 现有医疗对话AI评估方法缺乏可扩展性、可控性和对多样化患者群体的风险刻画能力,亟需一种能系统化模拟患者多样性并自动识别AI错误(如幻觉、不准确)的工具。 Method: 构建三类患者档案:(1) 基于All of Us电子健康记录的医学档案;(2) 建模健康素养与疾病特异性语言模式的语言档案;(3) 捕捉合作、分心、对抗等真实交互行为的行为档案;结合人工标注与大模型裁判(LLM judge)进行多维评估。 Result: 在500次对话中,人工标注者与LLM裁判均表现出高一致性(F1=0.94,κ≈0.75);发现AI决策辅助性能随健康素养提升而单调改善(概念检索准确率从47.9%升至81.6%)。 Conclusion: 该患者模拟器是一种有效、可靠且可推广的评估工具,能系统揭示AI医疗代理在不同患者亚群中的风险差异,支持符合NIST框架的AI风险管理实践。 Abstract: Objective: This paper introduces a patient simulator designed to enable scalable, automated evaluation of healthcare conversational agents. The simulator generates realistic, controllable patient interactions that systematically vary across medical, linguistic, and behavioral dimensions, allowing annotators and an independent AI judge to assess agent performance, identify hallucinations and inaccuracies, and characterize risk patterns across diverse patient populations. Methods: The simulator is grounded in the NIST AI Risk Management Framework and integrates three profile components reflecting different dimensions of patient variation: (1) medical profiles constructed from electronic health records in the All of Us Research Program; (2) linguistic profiles modeling variation in health literacy and condition-specific communication patterns; and (3) behavioral profiles representing empirically observed interaction patterns, including cooperation, distraction, and adversarial engagement. We evaluated the simulator's effectiveness in identifying errors in an AI decision aid for antidepressant selection. Results: We generated 500 conversations between the patient simulator and the AI decision aid across systematic combinations of five linguistic and three behavioral profiles. Human annotators assessed 1,787 medical concepts across 100 conversations, achieving high agreement (F1=0.94, \k{appa}=0.73), and the LLM judge achieved comparable agreement with human annotators (F1=0.94, \k{appa}=0.78; paired bootstrap p=0.21). The simulator revealed a monotonic degradation in AI decision aid performance across the health literacy spectrum: rank-one concept retrieval accuracy increased from 47.9% for limited health literacy to 69.1% for functional and 81.6% for proficient.

[33] Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

Zecheng Wang,Deyuan Liu,Chunshan Li,Yupeng Zhang,Zhengyun Zhao,Dianhui Chu,Bingning Wang,Dianbo Sui

Main category: cs.CL

TL;DR: 本文提出Dynamic Entropy Fine-Tuning (DEFT),一种无需额外参数的监督微调目标,通过Rényi-2熵动态调节模型对自身预测的信任度,以解决标准NLL在token级加权上的僵化问题,提升模型在探索与利用间的平衡及整体性能。

Details Motivation: 标准负对数似然(NLL)在监督微调中采用统一的token级权重,导致两个问题:过度强调低概率目标会放大噪声监督的梯度、削弱鲁棒先验;而当模型已高度自信时,均匀加权又缺乏有效 sharpening。现有方法难以兼顾学习稳定性与可塑性。 Method: 将token级SFT目标统一到广义deformed-log族,揭示其共有的‘门控×误差梯度’结构;引入Cayley变换,将模型不确定性映射为连续聚焦轨迹;进而提出基于Rényi-2熵(表征分布集中度)动态调节信任门的无参目标DEFT。 Result: 大量实验与分析表明,DEFT在多个任务上实现了探索与利用的更好平衡,显著提升整体性能。 Conclusion: DEFT通过建模并利用模型自身的预测不确定性(以熵为代理),提供了一种灵活、自适应且无需超参的SFT优化范式,有效缓解了传统NLL的稳定性-可塑性困境。 Abstract: Standard negative log-likelihood (NLL) for Supervised Fine-Tuning (SFT) applies uniform token-level weighting. This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident. Existing methods fail to resolve the resulting plasticity--stability dilemma, often suppressing necessary learning signals alongside harmful ones. To address this issue, we unify token-level SFT objectives within a generalized deformed-log family and expose a universal gate $\times$ error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model's continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless interpolation between scenarios involving uncertain novel concepts and those involving well-established knowledge. We then introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the trust gate using distribution concentration (Rényi-2 entropy) as a practical proxy for the model's predictive state. Extensive experiments and analyses demonstrate that DEFT achieves a better balance between exploration and exploitation, leading to improved overall performance.

[34] Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety

Muskaan Chopra,Lorenz Sparrenberg,Rafet Sifa

Main category: cs.CL

TL;DR: 本文探讨了指令调优的大语言模型(LLMs)在机器翻译(MT)中检测关键语义错误(如事实扭曲、意图反转、偏见)的能力,发现模型规模扩大和不同适配策略(零样本、少样本、微调)均能持续提升检测性能,超越XLM-R等编码器基线模型,强调该任务对构建安全、可信、社会负责的多语言AI系统的重要性。

Details Motivation: 机器翻译中的关键意义错误(如事实扭曲、意图反转、偏见)会损害多语言系统的可靠性、公平性与安全性,尤其在高风险或资源匮乏语境下危害更大,亟需有效检测机制。 Method: 在公开数据集上评估多种参数规模的指令调优大语言模型,对比零样本、少样本和微调等适应策略,并与XLM-R、ModernBERT等encoder-only基线模型进行性能比较。 Result: 模型规模扩大与各类适配策略均带来一致性能提升,显著优于XLM-R和ModernBERT等基线模型。 Conclusion: 提升机器翻译关键错误检测能力是构建更安全、可信、社会可追责的信息系统的关键路径,该任务应被视作实现公正、负责任多语言AI的必要保障,而非单纯技术问题。 Abstract: Machine Translation (MT) plays a pivotal role in cross-lingual information access, public policy communication, and equitable knowledge dissemination. However, critical meaning errors, such as factual distortions, intent reversals, or biased translations, can undermine the reliability, fairness, and safety of multilingual systems. In this work, we explore the capacity of instruction-tuned Large Language Models (LLMs) to detect such critical errors, evaluating models across a range of parameters using the publicly accessible data sets. Our findings show that model scaling and adaptation strategies (zero-shot, few-shot, fine-tuning) yield consistent improvements, outperforming encoder-only baselines like XLM-R and ModernBERT. We argue that improving critical error detection in MT contributes to safer, more trustworthy, and socially accountable information systems by reducing the risk of disinformation, miscommunication, and linguistic harm, especially in high-stakes or underrepresented contexts. This work positions error detection not merely as a technical challenge, but as a necessary safeguard in the pursuit of just and responsible multilingual AI. The code will be made available at GitHub.

[35] LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation

Ahmadreza Jeddi,Marco Ciccone,Babak Taati

Main category: cs.CL

TL;DR: 本文提出LoopFormer,一种支持可变计算预算的循环Transformer模型,通过shortcut-consistency训练策略实现不同迭代长度下的表示一致性,提升语言建模与推理任务在受限计算资源下的鲁棒性与可扩展性。

Details Motivation: 现有循环Transformer固定迭代次数,无法根据计算预算动态调整推理深度,缺乏灵活性和适应性。 Method: 提出LoopFormer模型,采用shortcut-consistency训练方案对齐不同长度的循环轨迹,并在每轮循环中引入当前时间和步长作为条件,使表征在不同长度轨迹下一致演化。 Result: LoopFormer在语言建模与推理基准上表现出强鲁棒性,尤其在严苛计算约束下仍保持高性能,并能随预算增加平滑扩展性能。 Conclusion: 循环Transformer天然适配自适应语言建模,为构建可控、预算感知的大语言模型提供了新路径。 Abstract: Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models.

Guangxin Zhao,Jiahao Zheng,Malaz Boustani,Jarek Nabrzyski,Meng Jiang,Yiyu Shi,Zhi Zheng

Main category: cs.CL

TL;DR: 本文提出了首个专门针对阿尔茨海默病及相关痴呆症(ADRD)的评测基准ADRD-Bench,包含临床知识问答与照护实践问答两部分,并在33个主流大语言模型上进行了评测,发现现有模型虽在整体准确率上表现尚可,但在照护场景下的推理一致性与稳定性仍不足,亟需领域特化改进。

Details Motivation: 现有医疗大模型评测基准对阿尔茨海默病及相关痴呆症(ADRD)覆盖严重不足,尤其缺乏真实照护场景下的评估能力。 Method: 构建ADRD-Bench基准,含两部分:1)ADRD Unified QA(整合7个医学基准共1352题),评估临床知识;2)ADRD Caregiving QA(基于ABC项目新建149题),聚焦日常照护实践。并在33个SOTA大语言模型上开展系统评测与案例分析。 Result: 开源通用模型平均准确率0.78,开源医学模型0.82,闭源通用模型0.89;顶尖模型虽达>0.9准确率,但案例分析揭示其推理质量与稳定性不一致,尤其在照护语境下表现不稳定。 Conclusion: 当前大语言模型在ADRD领域,尤其是照护实践层面的知识与推理能力仍显薄弱,亟需结合真实照护数据进行领域特化优化,以提升临床可用性与可靠性。 Abstract: Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.

[37] When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

Jayadev Billa

Main category: cs.CL

TL;DR: 本文发现语音-文本大模型在音频和文本信息冲突时严重偏向文本(text dominance),即使音频质量更高且明确指示信任音频;通过ALME基准测试,揭示这种偏向源于模型推理层对多模态信息仲裁的困难,而非音频质量或编码器问题,并提出可通过微调语言模型而非音频编码器来缓解。

Details Motivation: 探究语音-文本多模态大模型在音频与文本信息冲突时为何强烈偏向文本,以及这种‘文本主导’现象的根本原因。 Method: 构建跨语言、大规模音频-文本冲突基准ALME(57,620条样本,8种语言);系统评估Gemini 2.0 Flash等4个SOTA音频-LLM;开展控制实验(如强制转录、提示工程、投影层微调、LoRA微调)以定位文本主导的机制来源。 Result: 发现文本主导在音文冲突下达16.6%,远高于文文冲突的1.6%;音频单独准确率(97.2%)高于级联准确率(93.9%),说明音频信息未丢失;强制转录加剧文本主导(19%→33%);‘文本被故意污染’提示降低文本主导80%;仅微调音频投影层使文本主导上升26.5%,而LoRA微调LLM使其下降23.9%。 Conclusion: 文本主导并非源于音频质量差或编码器缺陷,而是语言模型推理层难以有效仲裁不同模态表征所致;因此,模态仲裁能力是独立于传统语音识别/理解指标的新可靠性维度,需专门建模与评估。 Abstract: When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6\% text dominance under audio-text conflict versus 1.6\% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2\%) exceeds cascade accuracy (93.9\%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19\% to 33\%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80\%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5\%), while LoRA on the language model halves it ($-$23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.

[38] Multimodal Fact-Level Attribution for Verifiable Reasoning

David Wan,Han Wang,Ziyang Wang,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal

Main category: cs.CL

TL;DR: 本文提出了MuRGAt基准,用于评估多模态大语言模型在复杂多步推理中对事实的精确归因能力,涵盖视频、音频等多模态输入,并引入自动评估框架,发现当前模型常出现引用幻觉,且推理深度与归因准确性存在权衡。

Details Motivation: 现有多模态归因评测基准局限于简单观察场景或单一模态,无法评估复杂多步推理中的事实级归因能力。 Method: 提出MuRGAt基准,要求模型对跨模态(如视频、音频)输入进行多步推理并生成带精确引用(含模态与时间片段)的答案;设计与人工判断高度一致的自动评估框架。 Result: 实验表明,即使强大多模态大语言模型也频繁产生引用幻觉;增加推理深度或强制结构化归因会降低答案准确性。 Conclusion: 当前多模态大语言模型在内部推理与可验证归因之间存在显著鸿沟,需新方法弥合这一差距。 Abstract: Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.

[39] Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Jinrui Zhang,Chaodong Xiao,Aoqi Wu,Xindong Zhang,Lei Zhang

Main category: cs.CL

TL;DR: 本文提出SPES框架,通过稀疏专家同步和专家合并预热策略,在去中心化环境下高效预训练MoE大语言模型,显著降低GPU内存需求并保持竞争力性能。

Details Motivation: 现有去中心化训练方法仍需在每个节点上训练完整模型,受限于GPU内存;而集中式预训练依赖大量高显存GPU,成本高昂且难以扩展。 Method: 提出SParse Expert Synchronization (SPES) 框架:各节点仅训练部分专家,定期同步专家参数而非全模型;引入专家合并预热策略以加速收敛。 Result: 在16块48GB GPU(互联网连接)上成功预训练2B MoE模型,性能媲美同等计算预算下的中心化训练;进一步扩展至7B从头训练和9B由稠密检查点升级的模型,均达到先前中心化基线水平。 Conclusion: SPES实现了内存高效、可扩展且性能有保障的去中心化MoE大模型预训练,为资源受限环境下的大模型训练提供了新范式。 Abstract: Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.

[40] SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

Wenlin Zhong,Jinluan Yang,Yiquan Wu,Yi Liu,Jianhang Yao,Kun Kuang

Main category: cs.CL

TL;DR: SIGHT是一种增强搜索式推理的框架,通过自证支持(SES)和信息增益驱动的多样化分支策略,解决多轮搜索中冗余高、信噪比低导致的‘隧道视野’问题,显著提升复杂问答任务性能。

Details Motivation: 多轮搜索场景下,检索结果冗余高、信噪比低,导致智能体陷入‘隧道视野’并累积不可逆错误。 Method: 提出SIGHT框架,包含自证支持(SES)蒸馏高保真证据、信息增益评分识别关键状态、动态提示干预(去重/反思/自适应分支),以及结合SES与正确性奖励的组相对策略优化(GRPO)。 Result: 在单跳和多跳问答基准上显著优于现有方法,尤其在复杂推理任务中,且使用更少搜索步数。 Conclusion: SIGHT通过内部化鲁棒探索策略,无需外部验证器即可提升搜索式推理的准确性和效率。 Abstract: Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.

[41] PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

Xiangfeng Wang,Hangyu Guo,Yanlin Lai,Mitt Huang,Liang Zhao,Chengyuan Yao,Yinmin Zhang,Qi Han,Xiaoxiao Ren,Chun Yuan,Tong Xu,Zheng Ge,Xiangyu Zhang,Daxin Jiang

Main category: cs.CL

TL;DR: 本文提出PRIME基准,用于评估验证器在数学与工程领域中对推导过程与结果一致性的验证能力,并基于该基准设计了过程感知的RLVR训练范式,显著提升了模型性能。

Details Motivation: 现有基于结果的验证范式忽视推导过程中的错误,导致对错误推导得出的正确答案给予正向奖励,亟需能同时验证过程与结果一致性的新方法。 Method: 构建PRIME基准(含2530道高难度STEM题目),提出过程感知的RLVR训练范式,并通过多组实验评估验证器在PRIME上的表现及其对RLVR效果的预测能力。 Result: 当前验证器常无法检测推导缺陷;所提方法在AIME24、AIME25和Beyond-AIME上分别带来8.29%、9.12%、7.31%的绝对性能提升;验证器在PRIME上的准确率与其在RLVR中的有效性呈强线性相关(R² > 0.92)。 Conclusion: PRIME是首个聚焦过程-结果一致性的验证基准,可有效识别验证器缺陷,并作为选择高效验证器的可靠指标,推动更鲁棒的RLVR发展。 Abstract: While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ($R^2 > 0.92$) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.

[42] Scene-Aware Memory Discrimination: Deciding Which Personal Knowledge Stays

Yijie Zhong,Mengying Guo,Zewei Wang,Zhongyang Li,Dandan Tu,Haofen Wang

Main category: cs.CL

TL;DR: 本文提出了一种场景感知的记忆判别方法(SAMD),通过门控单元模块(GUM)和聚类提示模块(CPM)提升大语言模型在用户交互数据中高效、精准地筛选和组织个人知识的能力。

Details Motivation: 现有基于大语言模型的记忆写入、管理和读取研究面临信息过滤不精准和计算成本高的问题,需借鉴人类选择性注意机制,实现更智能的记忆判别。 Method: 提出记忆判别任务,并设计场景感知记忆判别方法(SAMD),包含门控单元模块(GUM)用于过滤非记忆性交互、聚焦关键内容,以及聚类提示模块(CPM)用于建立自适应记忆标准、分析用户意图与记忆上下文关系以生成有效聚类提示。 Result: 实验表明SAMD在直接评估中能成功召回大部分可记忆数据,在动态场景下保持鲁棒性;在个性化应用中显著提升记忆构建的效率与质量。 Conclusion: SAMD为LLM驱动的个人知识组织提供了高效、可扩展且场景自适应的新范式,解决了记忆判别中的相关性过滤与计算开销难题。 Abstract: Intelligent devices have become deeply integrated into everyday life, generating vast amounts of user interactions that form valuable personal knowledge. Efficient organization of this knowledge in user memory is essential for enabling personalized applications. However, current research on memory writing, management, and reading using large language models (LLMs) faces challenges in filtering irrelevant information and in dealing with rising computational costs. Inspired by the concept of selective attention in the human brain, we introduce a memory discrimination task. To address large-scale interactions and diverse memory standards in this task, we propose a Scene-Aware Memory Discrimination method (SAMD), which comprises two key components: the Gating Unit Module (GUM) and the Cluster Prompting Module (CPM). GUM enhances processing efficiency by filtering out non-memorable interactions and focusing on the salient content most relevant to application demands. CPM establishes adaptive memory standards, guiding LLMs to discern what information should be remembered or discarded. It also analyzes the relationship between user intents and memory contexts to build effective clustering prompts. Comprehensive direct and indirect evaluations demonstrate the effectiveness and generalization of our approach. We independently assess the performance of memory discrimination, showing that SAMD successfully recalls the majority of memorable data and remains robust in dynamic scenarios. Furthermore, when integrated into personalized applications, SAMD significantly enhances both the efficiency and quality of memory construction, leading to better organization of personal knowledge.

[43] PACE: Prefix-Protected and Difficulty-Aware Compression for Efficient Reasoning

Ruixiang Feng,Yuntao Wen,Silin Zhou,Ke Shi,Yifan Wang,Ran Le,Zhenwei An,Zongchao Chen,Chen Yang,Guangyue Peng,Yiming Jia,Dongsheng Wang,Tao Zhang,Lisi Chen,Yang Song,Shen Gao,Shuo Shang

Main category: cs.CL

TL;DR: 本文提出了一种名为\model的双层级压缩框架,通过前缀保护和难度感知机制,在保持推理正确性的同时显著减少语言推理模型(LRMs)的推理长度和计算开销。

Details Motivation: 现有语言推理模型(LRMs)存在“过度思考”问题,即生成过长推理链,导致延迟和内存开销增加;而传统统一长度惩罚策略在序列级会压缩关键早期推理步骤,在群体级则对所有查询一视同仁,缺乏灵活性。 Method: 提出\model框架:序列级采用前缀保护优化(结合衰减混合rollout)保留有效推理路径;群体级引入难度感知惩罚,根据查询复杂度动态调整长度约束。 Result: 在DeepSeek-R1-Distill-Qwen(1.5B/7B)上实验表明,\model最多降低55.7% token使用量,同时数学基准准确率最高提升4.1%,并泛化至代码、科学与通用领域。 Conclusion: 双层级压缩策略能兼顾推理质量与效率,为LRMs的测试时计算优化提供了新范式。 Abstract: Language Reasoning Models (LRMs) achieve strong performance by scaling test-time computation but often suffer from ``overthinking'', producing excessively long reasoning traces that increase latency and memory usage. Existing LRMs typically enforce conciseness with uniform length penalties, which over-compress crucial early deduction steps at the sequence level and indiscriminately penalize all queries at the group level. To solve these limitations, we propose \textbf{\model}, a dual-level framework for prefix-protected and difficulty-aware compression under hierarchical supervision. At the sequence level, prefix-protected optimization employs decaying mixed rollouts to maintain valid reasoning paths while promoting conciseness. At the group level, difficulty-aware penalty dynamically scales length constraints based on query complexity, maintaining exploration for harder questions while curbing redundancy on easier ones. Extensive experiments on DeepSeek-R1-Distill-Qwen (1.5B/7B) demonstrate that \model achieves a substantial reduction in token usage (up to \textbf{55.7\%}) while simultaneously improving accuracy (up to \textbf{4.1\%}) on math benchmarks, with generalization ability to code, science, and general domains.

[44] Which Feedback Works for Whom? Differential Effects of LLM-Generated Feedback Elements Across Learner Profiles

Momoka Furuhashi,Kouta Nakayama,Noboru Kawai,Takashi Kodama,Saku Sugawara,Kyosuke Takami

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型(LLM)生成的教育反馈中不同元素(如语气、信息覆盖度)对学习效果和学生接受度的影响,并分析了学生五大性格特质如何调节其对反馈的偏好与效果。

Details Motivation: 目前尚不清楚LLM生成反馈中的具体元素(如语气、信息覆盖)如何影响学习成效与学生接受度,尤其在不同性格特质的学生中存在何种差异。 Method: 定义六类反馈元素,利用GPT-5为多项选择生物学题目生成反馈;开展含321名高一学生的实验,采用两类学习成效指标与六项主观评价标准评估反馈有效性;进一步按Big Five人格特质聚类分析反馈接受度差异。 Result: 有效反馈元素在提升学习成效上呈现共性模式,但学生主观偏好显著因人格特质聚类而异。 Conclusion: 设计LLM生成反馈时需依据学习者人格特质选择并适配反馈元素,该发现为教育中个性化反馈设计提供了实践启示。 Abstract: Large language models (LLMs) show promise for automatically generating feedback in education settings. However, it remains unclear how specific feedback elements, such as tone and information coverage, contribute to learning outcomes and learner acceptance, particularly across learners with different personality traits. In this study, we define six feedback elements and generate feedback for multiple-choice biology questions using GPT-5. We conduct a learning experiment with 321 first-year high school students and evaluate feedback effectiveness using two learning outcomes measures and subjective evaluations across six criteria. We further analyze differences in how feedback acceptance varies across learners based on Big Five personality traits. Our results show that effective feedback elements share common patterns supporting learning outcomes, while learners' subjective preferences differ across personality-based clusters. These findings highlight the importance of selecting and adapting feedback elements according to learners' personality traits when we design LLM-generated feedback, and provide practical implications for personalized feedback design in education.

[45] PatientHub: A Unified Framework for Patient Simulation

Sahand Sabour,TszYam NG,Minlie Huang

Main category: cs.CL

TL;DR: 本文提出了PatientHub,一个统一且模块化的框架,用于标准化模拟患者的定义、组合和部署,旨在解决现有患者模拟方法在数据格式、提示和评估指标上的不兼容问题,提升可复现性和公平比较能力。

Details Motivation: 现有患者模拟方法缺乏标准化,导致数据格式、提示和评估指标不统一,阻碍了可复现性和公平比较。 Method: 提出PatientHub框架,实现模拟患者的标准化定义、组合与部署,并通过多个案例研究和两个新变体原型验证其模块性、可扩展性和易用性。 Result: PatientHub支持跨方法标准化评估、自定义评估指标集成,并加速新模拟方法开发;代码已开源。 Conclusion: PatientHub为患者中心对话领域的数据集、方法和基准测试提供了实用基础,降低了新方法开发门槛,促进了跨方法与跨模型的基准评测。 Abstract: As Large Language Models increasingly power role-playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non-standardized data formats, prompts, and evaluation metrics, hindering reproducibility and fair comparison. In this paper, we introduce PatientHub, a unified and modular framework that standardizes the definition, composition, and deployment of simulated patients. To demonstrate PatientHub's utility, we implement several representative patient simulation methods as case studies, showcasing how our framework supports standardized cross-method evaluation and the seamless integration of custom evaluation metrics. We further demonstrate PatientHub's extensibility by prototyping two new simulator variants, highlighting how PatientHub accelerates method development by eliminating infrastructure overhead. By consolidating existing work into a single reproducible pipeline, PatientHub lowers the barrier to developing new simulation methods and facilitates cross-method and cross-model benchmarking. Our framework provides a practical foundation for future datasets, methods, and benchmarks in patient-centered dialogue, and the code is publicly available via https://github.com/Sahandfer/PatientHub.

[46] Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models

Katrin Olsen,Sebastian Padó

Main category: cs.CL

TL;DR: 本文通过人类评分员和大语言模型(LLM)对五个语义异常数据集中的句子进行可理解性判断,发现多数句子仅属异常而非真正无意义,并且LLMs在为异常句子生成合理上下文方面表现出较强能力。

Details Motivation: 现有语义异常数据集中句子的‘无意义性’程度不明确,且尚不清楚大语言模型能否准确区分‘异常’与‘真正无意义’的句子。 Method: 收集人类评分员和大语言模型对五个语义偏离数据集(含无上下文及提供上下文两种条件)中句子的可理解性判断,并分析其分布与模型表现。 Result: 人类评分显示大多数句子最多仅为异常,极少被判定为真正无意义;LLMs能为异常句子生成大量合理上下文,展现出较强语义修复能力。 Conclusion: 当前语义异常数据集普遍缺乏真正无意义的句子,而LLMs在处理语义异常时更倾向于赋予其可解释性,提示需重新审视‘无意义’的定义及其在语义建模中的作用。 Abstract: Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.

[47] Thinking with Drafting: Optical Decompression via Logical Reconstruction

Jingxuan Wei,Honghao He,Caijun Jia,Siyuan Li,Zheng Sun,Yuhang Xu,Yuanyuan Lin,Linzhuang Sun,Yuchen Wu,Bihui Yu,Xiangxiang Zhang,Cheng Tan

Main category: cs.CL

TL;DR: 本文提出Thinking with Drafting (TwD)方法,将视觉推理重构为光学解压缩过程,利用领域特定语言(DSL)作为中间表示,通过生成可执行代码实现自验证的确定性视觉证明,并在VisAlg视觉代数基准上验证其有效性。

Details Motivation: 现有多模态大模型在高保真视觉感知和探索式视觉生成方面表现优异,但在复杂推理任务中存在精度悖论:光学感知系统仅转录符号而无法捕捉逻辑拓扑,像素级生成模型则产生缺乏数学精确性的视觉伪影。 Method: 提出‘Thinking with Drafting (TwD)’框架,以‘Parsing is Reasoning’为公理,采用极简领域特定语言(DSL)作为接地中间表示,强制模型将思维模型草拟为可执行代码,生成确定性视觉证明以实现自我验证。 Result: 在新提出的VisAlg视觉代数基准上实验表明,TwD显著优于现有方法,成为更优的认知支架;构建了以视觉生成作为逻辑验证器而非创意输出的闭环系统。 Conclusion: TwD为视觉推理提供了一条可泛化的路径,将视觉生成角色从创造性输出转变为逻辑验证工具,弥合了感知与推理之间的语义鸿沟。 Abstract: Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.

[48] Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Futing Wang,Jianhao Yan,Yun Luo,Ganqu Cui,Zhi Wang,Xiaoye Qu,Yue Zhang,Yu Cheng,Tao Lin

Main category: cs.CL

TL;DR: 本文提出Length-Incentivized Exploration(LIE)方法,通过长度奖励与冗余惩罚机制,提升大模型在上下文中的多假设生成、验证与优化能力(即In-Context Exploration),缓解‘浅层探索陷阱’,显著提升域内与域外任务性能。

Details Motivation: 现有模型在测试时难以有效扩展,因其缺乏在单一上下文中生成、验证和优化多个推理假设的内在能力(即In-Context Exploration);而基于状态覆盖理论,长推理轨迹采样概率呈指数衰减,形成‘浅层探索陷阱’。 Method: 提出Length-Incentivized Exploration(LIE)方法:在推理阶段引入基于生成长度的奖励函数,并叠加冗余惩罚项,以两步方式最大化状态覆盖。 Result: 在Qwen3、Llama等不同模型上实验表明,LIE显著增强上下文内探索能力,在域内任务平均提升4.4%,域外基准提升2.7%。 Conclusion: LIE是一种简单有效的测试时扩展机制,能系统性缓解浅层探索陷阱,提升模型泛化与鲁棒性。 Abstract: Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.

[49] MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM Team,Wenhao An,Yingfa Chen,Yewei Fang,Jiayi Li,Xin Li,Yaohui Li,Yishan Li,Yuxuan Li,Biyuan Lin,Chuan Liu,Hezi Liu,Siyuan Liu,Hongya Lyu,Yinxu Pan,Shixin Ren,Xingyu Shen,Zhou Su,Haojun Sun,Yangang Sun,Zhen Leng Thai,Xin Tian,Rui Wang,Xiaorong Wang,Yudong Wang,Bo Wu,Xiaoyue Xu,Dong Xu,Shuaikang Xue,Jiawei Yang,Bowen Zhang,Jinqian Zhang,Letian Zhang,Shengnan Zhang,Xinyu Zhang,Xinyuan Zhang,Zhu Zhang,Hengyu Zhao,Jiacheng Zhao,Jie Zhou,Zihan Zhou,Shuo Wang,Chaojun Xiao,Xu Han,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出MiniCPM-SALA,一种9B参数的混合注意力架构,结合稀疏注意力(InfLLM-V2)与线性注意力(Lightning Attention),通过层选择算法和混合位置编码(HyPE)实现长上下文建模的高效与高性能平衡,并设计低成本持续训练框架,显著降低训练开销,在单卡A6000D上支持百万级token上下文并提速3.5倍。

Details Motivation: 大型语言模型向超长上下文应用演进时,Transformer架构带来高昂计算与内存开销;现有稀疏/线性注意力方法常在内存效率与性能间妥协,亟需兼顾二者的新架构。 Method: 提出MiniCPM-SALA混合架构:融合InfLLM-V2(稀疏注意力)与Lightning Attention(线性注意力),采用1:3层选择策略及混合位置编码(HyPE);并设计低成本持续训练框架,将预训练Transformer模型转化为混合模型。 Result: 在单块NVIDIA A6000D GPU上,MiniCPM-SALA在256K序列长度下推理速度达全注意力模型的3.5倍,支持最长1M token上下文(传统8B全注意力模型在此尺度下因显存不足失效),且通用能力与全注意力模型相当。 Conclusion: MiniCPM-SALA成功在保持长上下文建模高保真度的同时大幅提升推理效率与可扩展性,其混合架构设计与低成本训练范式为超长上下文LLM提供了实用可行的技术路径。 Abstract: The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.

[50] A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

Anne-Marie Lutgen,Alistair Plum,Christoph Purschke

Main category: cs.CL

TL;DR: 本文提出了一种基于子词嵌入的变体检测方法,无需预归一化或预定义变体列表,通过结合余弦相似度与n-gram相似度对相关形式聚类,在卢森堡语用户评论语料中成功揭示了符合方言学与社会语言学规律的系统性拼写与形态变异。

Details Motivation: 传统方法依赖预定义变体列表或归一化,难以在低资源、多语或小语种场景下灵活捕捉真实语言变异;需一种无需强监督、可解释且适用于‘噪声’文本的变异发现框架。 Method: 在原始文本上训练子词嵌入,再结合余弦相似度与n-gram相似度对词汇形式进行聚类,从而自动归纳变异家族(variant families)。 Result: 在大规模卢森堡语用户评论语料上,该方法成功识别出大量符合方言与社会语言学描述的系统性拼写和形态变异,生成透明、可解释的聚类结果,支持定量与定性分析。 Conclusion: 分布语义建模可在低资源、高噪声环境下有效揭示有意义的语言变异模式,为多语种及小语种的语言变体研究提供了可复现的方法论框架。 Abstract: This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ''noisy'' or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.

[51] DMAP: A Distribution Map for Text

Tom Kempton,Julia Rozanova,Parameswaran Kamalaruban,Maeve Madigan,Karolina Wresilo,Yoann L. Launay,David Sutton,Stuart Burrell

Main category: cs.CL

TL;DR: 本文提出DMAP方法,通过将文本映射为单位区间内的样本集,联合编码词序与概率信息,实现对大语言模型输出的上下文感知统计分析,并在生成参数验证、机器生成文本检测和合成数据下游模型指纹分析中验证其有效性。

Details Motivation: 现有基于LLM的文本分析方法(如困惑度)未充分考虑上下文,即同一token概率的解释依赖于条件分布的形状(合理选择的数量),缺乏对概率与排序信息的联合建模。 Method: 提出DMAP(Distribution Mapping for Analysis of Probabilities)方法:利用LLM对文本逐token生成的条件概率分布,将每个位置的token按概率降序排列后,将其累积概率映射到单位区间[0,1]内形成样本点,从而构建一个能同时反映rank与probability的紧凑表示。 Result: 在三个案例中验证DMAP有效性:(i) 生成参数校准以保障数据完整性;(ii) 揭示概率曲率在区分人工与机器生成文本中的关键作用;(iii) 发现经合成数据微调的下游模型会留下可检测的统计指纹。DMAP计算轻量,可在消费级硬件运行。 Conclusion: DMAP提供了一种数学严谨、模型无关、高效可扩展的文本统计表征方式,为LLM驱动的文本分析建立了统一基础,并支持多种实际应用与后续研究。 Abstract: Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.

[52] Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

Wanxing Wu,He Zhu,Yixia Li,Lei Yang,Jiehui Zhao,Hongru Wang,Jian Yang,Benyou Wang,Bingyi Jing,Guanhua Chen

Main category: cs.CL

TL;DR: 本文提出RouterXBench评估框架和ProbeDirichlet轻量级路由方法,利用模型内部隐藏状态建模不确定性,提升本地-云端LLM协同中的路由准确性与跨域鲁棒性。

Details Motivation: 现有LLM路由机制评估不系统,忽视场景适配性与分布外鲁棒性,且依赖输出概率或外部嵌入,无法有效捕捉模型内在不确定性。 Method: 提出三维度评估框架RouterXBench(路由能力、场景对齐、跨域鲁棒性);设计ProbeDirichlet路由器,通过可学习的狄利克雷分布聚合多层隐藏状态,并采用概率化训练。 Result: ProbeDirichlet在路由能力和高精度场景下分别比最优基线提升16.68%和18.86%,且在不同模型族、规模、异构任务及智能体工作流中表现稳定。 Conclusion: 基于内部隐藏状态建模不确定性的轻量级路由方法更有效,RouterXBench为LLM路由提供了系统化、可复现的评估标准。 Abstract: Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.

[53] LLM-based Triplet Extraction from Financial Reports

Dante Wesslund,Ville Stenström,Pontus Linde,Alexander Holmberg

Main category: cs.CL

TL;DR: 本文提出了一种面向企业财务报告的半自动化三元组抽取流水线,采用本体驱动的代理评估指标(本体一致性与忠实性)替代依赖标注真值的传统评估方法;实验表明自动构建的文档特定本体在所有配置下均实现100%模式一致性,并通过混合验证策略将主语幻觉率从65.2%显著降至1.6%。

Details Motivation: 企业财务报告虽富含结构化知识,但该领域缺乏标注真值,导致三元组抽取效果难以评估。 Method: 提出基于本体驱动的代理评估指标(Ontology Conformance 和 Faithfulness)的半自动化三元组抽取流水线;对比静态手工本体与全自动文档特定本体诱导方法;设计结合正则匹配与LLM-as-a-judge的混合验证策略;分析主语与宾语幻觉的系统性不对称现象。 Result: 自动诱导本体在所有配置下实现100%本体一致性,消除了手工本体的本体漂移;混合验证策略将主语幻觉率从65.2%降至1.6%;发现主语与宾语幻觉存在系统性不对称,归因于财务文本中的被动语态和施事省略。 Conclusion: 本体驱动的代理评估与文档特定本体诱导可有效缓解金融领域标注缺失带来的评估难题;混合验证策略显著提升抽取可靠性;主谓宾幻觉的不对称性揭示了财务语言特性对LLM抽取行为的重要影响。 Abstract: Corporate financial reports are a valuable source of structured knowledge for Knowledge Graph construction, but the lack of annotated ground truth in this domain makes evaluation difficult. We present a semi-automated pipeline for Subject-Predicate-Object triplet extraction that uses ontology-driven proxy metrics, specifically Ontology Conformance and Faithfulness, instead of ground-truth-based evaluation. We compare a static, manually engineered ontology against a fully automated, document-specific ontology induction approach across different LLMs and two corporate annual reports. The automatically induced ontology achieves 100% schema conformance in all configurations, eliminating the ontology drift observed with the manual approach. We also propose a hybrid verification strategy that combines regex matching with an LLM-as-a-judge check, reducing apparent subject hallucination rates from 65.2% to 1.6% by filtering false positives caused by coreference resolution. Finally, we identify a systematic asymmetry between subject and object hallucinations, which we attribute to passive constructions and omitted agents in financial prose.

[54] Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences

Eddie Yang,Dashun Wang

Main category: cs.CL

TL;DR: 本文揭示了大语言模型(LLMs)在基准测试中准确率趋同的表象下,存在显著的认知分歧(即模型间答案高度不一致),这种‘基准幻觉’会严重影响科学数据标注与推断的可靠性,甚至导致研究结论反转。

Details Motivation: 现有基准测试(如MMLU-Pro、GPQA)仅关注整体准确率,忽视模型间具体预测的一致性,可能导致对模型能力的误判及科学研究中结果不可复现。 Method: 通过分析多个前沿LLM在MMLU-Pro和GPQA两个主流推理基准上的逐项预测结果,量化模型间分歧比例;并实证检验这些分歧在教育学与政治学真实研究再分析中的影响(如处理效应估计变化)。 Result: 即使准确率相近,LLM间在16–66%的题目上存在分歧,前沿模型间分歧达16–38%;更换标注模型可使处理效应估计变动超80%,甚至符号反转。 Conclusion: 基准准确率不能代表模型认知一致性,'基准幻觉'威胁科学可复现性,模型选择本身已成为一个需显式控制的关键变量。 Abstract: Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.

[55] AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

Pretam Ray,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: 本文提出AdaptEvolve方法,通过利用生成置信度动态选择适合当前推理步骤的LLM,在多LLM进化式精炼框架中实现计算效率与推理能力的更好平衡。

Details Motivation: 进化型智能体系统在推理过程中反复调用大语言模型(LLMs),加剧了计算效率与推理能力之间的权衡;现有模型级联路由策略多依赖静态启发式或外部控制器,未显式建模模型不确定性。 Method: 提出AdaptEvolve:一种基于内在生成置信度实时估计问题可解性、自适应选择LLM的方法,嵌入于进化式顺序精炼框架中。 Result: 实验表明,置信度驱动的选择策略在多个基准上平均降低37.9%总推理成本,同时保持静态大模型基线97.5%的上限准确率,形成更优的Pareto前沿。 Conclusion: 动态利用模型内在置信度进行LLM选择,是提升进化型智能体系统效率与性能平衡的有效途径。 Abstract: Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade-off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi-LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real-time solvability. Empirical results show that confidence-driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model baselines. Our code is available at https://github.com/raypretam/adaptive_llm_selection.

[56] Cross-Modal Robustness Transfer (CMRT): Training Robust Speech Translation Models Using Adversarial Text

Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis

Main category: cs.CL

TL;DR: 本文提出了一种跨模态鲁棒性迁移(CMRT)框架,将文本模态的对抗鲁棒性迁移到语音模态,无需生成对抗语音数据,显著提升了端到端语音翻译模型对形态变化攻击的鲁棒性。

Details Motivation: 现有端到端语音翻译模型主要在干净数据集上评测,忽视了真实场景中非母语或方言语音的屈折形态变化等鲁棒性挑战。 Method: 将面向文本的屈折形态对抗攻击适配到语音领域,并提出CMRT框架,通过跨模态知识迁移,将文本模态中训练得到的对抗鲁棒性迁移到语音模态,避免直接生成对抗语音数据。 Result: 在四个语言对上的实验表明,CMRT平均提升对抗鲁棒性超3 BLEU点,且无需生成对抗语音数据。 Conclusion: CMRT为构建鲁棒的端到端语音翻译系统提供了高效新范式,确立了无需对抗语音数据的鲁棒性新基线。 Abstract: End-to-End Speech Translation (E2E-ST) has seen significant advancements, yet current models are primarily benchmarked on curated, "clean" datasets. This overlooks critical real-world challenges, such as morphological robustness to inflectional variations common in non-native or dialectal speech. In this work, we adapt a text-based adversarial attack targeting inflectional morphology to the speech domain and demonstrate that state-of-the-art E2E-ST models are highly vulnerable it. While adversarial training effectively mitigates such risks in text-based tasks, generating high-quality adversarial speech data remains computationally expensive and technically challenging. To address this, we propose Cross-Modal Robustness Transfer (CMRT), a framework that transfers adversarial robustness from the text modality to the speech modality. Our method eliminates the requirement for adversarial speech data during training. Extensive experiments across four language pairs demonstrate that CMRT improves adversarial robustness by an average of more than 3 BLEU points, establishing a new baseline for robust E2E-ST without the overhead of generating adversarial speech.

[57] Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

Yunchong Huang,Gianni Barlacchi,Sandro Pezzelle

Main category: cs.CL

TL;DR: 本文发现大量问答基准测试中的问题存在表述不明确(underspecified)的问题,导致大语言模型表现不佳;通过构建分类器识别这类问题,并进行重写实验,证明提升问题明确性可显著提高模型性能,从而揭示了问题表述质量对评估结果的重要影响。

Details Motivation: 标准问答基准测试仍未被大语言模型完全解决,作者认为其中部分原因在于问题表述不明确(underspecified),即缺乏足够上下文导致语义歧义。 Method: 提出一个基于大语言模型的分类器来自动识别问答数据集中不明确的问题,并在多个主流QA数据集上进行检测;进一步设计控制变量的重写实验,将不明确问题改写为明确版本,同时保持标准答案不变,以隔离并验证问题明确性的影响。 Result: 发现16%至50%以上的基准问题属于不明确问题,且大语言模型在这些问题上的表现显著更差;重写实验显示,问题明确化后QA性能稳定提升。 Conclusion: 问题不明确是当前问答评估中的重要混杂因素,应重视问题表述的清晰性,改进基准设计。 Abstract: Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.

[58] Do Large Language Models Adapt to Language Variation across Socioeconomic Status?

Elisa Bassignana,Mike Zhang,Dirk Hovy,Amanda Cercas Curry

Main category: cs.CL

TL;DR: 本研究探讨了大型语言模型(LLMs)在不同社会经济地位(SES)社区中的语言风格适应能力,发现其仅微弱调整风格,且偏向模仿高SES语言,可能加剧语言不平等并影响社会科学研究的有效性。

Details Motivation: LLMs日益介入人际交流,但其对不同社会语境(尤其是SES差异)的语言适应能力尚不明确;若无法适配多元语言规范,可能强化刻板印象与社会分层。 Method: 构建基于Reddit和YouTube的SES分层新数据集,用4个LLM对不完整文本进行补全,并在94个社会语言学指标(句法、修辞、词汇等)上对比生成文本与原始文本。 Result: LLMs对SES的语言风格调节程度很低,常表现为近似或夸张化,且更擅长模仿高SES语言风格。 Conclusion: LLMs存在放大语言等级的风险,其在基于代理的社会模拟、调查实验及依赖语言风格作为社会信号的研究中可靠性存疑。 Abstract: Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.

[59] Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

Yuzhe Shang,Pengzhi Gao,Wei Liu,Jian Luan,Jinsong Su

Main category: cs.CL

TL;DR: 本文研究了开源大语言模型(LLMs)在多语言机器翻译(MT)中的应用,基于Gemma3模型家族开发了支持46种语言的MiLMMT-46模型,在多项评测中超越多个SOTA开源模型,并媲美Google Translate和Gemini 3 Pro等闭源系统。

Details Motivation: 开源LLM多语言能力不断提升,但其在多语言机器翻译任务上的系统性研究与高效适配方法仍不足,需探究模型规模与数据规模对适配效果的影响。 Method: 通过持续预训练(continual pretraining)和指令微调(instruction finetuning)两种方式,基于Gemma3模型族构建多语言机器翻译模型MiLMMT-46,并在46种语言上进行评估。 Result: MiLMMT-46在46种语言的多语言MT任务上达到顶尖性能,持续优于Seed-X、HY-MT-1.5和TranslateGemma等最新开源SOTA模型,并与Google Translate和Gemini 3 Pro等强闭源系统性能相当。 Conclusion: 模型规模与数据规模协同扩展可显著提升开源LLM在多语言MT上的表现;经针对性持续预训练与指令微调,开源模型可逼近甚至媲美顶级商业翻译系统。 Abstract: Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.

[60] DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling

Mariia Fedorova,Andrey Kutuzov,Khonzoda Umarova

Main category: cs.CL

TL;DR: 本文介绍了DHPLT,一个包含41种语言的历时语料库开放集合,旨在填补多语言历时语料在语义变化建模领域的空白。

Details Motivation: 当前缺乏除高资源语言外的多语言历时语料库,限制了语义变化建模研究的发展。 Method: 基于网络爬取的HPLT数据集,利用网页爬取时间戳作为文档创建时间的近似信号,构建覆盖三个时间段(2011–2015、2020–2021、2024至今)的历时语料库,并为选定目标词提供预计算的词类型/词例嵌入和词汇替换。 Result: 发布了涵盖41种语言、每种语言每时段100万文档的DHPLT历时语料库,以及配套的嵌入与词汇替换资源,全部公开可获取。 Conclusion: DHPLT有效弥补了多语言历时语料资源的缺口,为语义变化建模提供了新实验基础,并支持研究者自定义目标词开展进一步研究。 Abstract: In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.

[61] Automatic Simplification of Common Vulnerabilities and Exposures Descriptions

Varpu Vehomäki,Kimmo K. Kaski

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型(LLMs)在自动简化通用漏洞披露(CVE)描述中的应用,构建了网络安全领域首个ATS基线与含40条CVE的测试集,并通过两轮专家评估发现:现成LLM虽能提升文本表观简洁性,但难以保持语义准确性。

Details Motivation: 网络安全信息对非专业人士而言理解困难,而自动文本简化(ATS)在医疗、科学等领域已有研究,但在快速演变且复杂的网络安全领域尚属空白。 Method: 构建网络安全ATS基线和包含40条CVE描述的测试集,采用两轮由网络安全专家参与的调查评估现成大语言模型在简化CVE文本时的表现。 Result: 现成大语言模型能提升文本表观简洁性,但在语义保真方面表现不佳。 Conclusion: 需专门针对网络安全领域优化ATS方法,尤其加强语义一致性保障;开源了代码与数据以支持后续研究。 Abstract: Understanding cyber security is increasingly important for individuals and organizations. However, a lot of information related to cyber security can be difficult to understand to those not familiar with the topic. In this study, we focus on investigating how large language models (LLMs) could be utilized in automatic text simplification (ATS) of Common Vulnerability and Exposure (CVE) descriptions. Automatic text simplification has been studied in several contexts, such as medical, scientific, and news texts, but it has not yet been studied to simplify texts in the rapidly changing and complex domain of cyber security. We created a baseline for cyber security ATS and a test dataset of 40 CVE descriptions, evaluated by two groups of cyber security experts in two survey rounds. We have found that while out-of-the box LLMs can make the text appear simpler, they struggle with meaning preservation. Code and data are available at https://version.aalto.fi/gitlab/vehomav1/simplification\_nmi.

[62] LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

Szilvia Ujváry,Louis Béthune,Pierre Ablin,João Monteiro,Marco Cuturi,Michael Kirchhof

Main category: cs.CL

TL;DR: 本文提出LaCy预训练方法,通过结合损失值与语法解析器(spaCy)判断哪些token应由小语言模型(SLM)直接预测、哪些应通过调用外部资源,从而提升事实准确性。

Details Motivation: 小语言模型(SLMs)参数量有限,预训练难以覆盖全部世界知识,易产生事实性错误;虽可通过外部调用缓解,但需明确预训练阶段应学哪些token、应委托哪些token。 Method: 提出LaCy方法:在预训练中,不仅依据预测损失,还引入spaCy语法解析器增强信号,区分‘可接受的高损token’(语义合理替代续写)与‘应委托的token’(易致事实错误),并用显式建模委托行为。 Result: LaCy训练的SLM在与大模型级联生成时FactScore更高,优于Rho和LLM-judge微调方法,且更简单、成本更低。 Conclusion: token级学习/委托决策不应仅依赖损失,而需结合语言结构信息;LaCy验证了该思想的有效性,为SLM事实性提升提供了新范式。 Abstract: Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emph{acceptable} in that they are truthful alternative continuations of a pretraining document, and should not trigger a \texttt{} even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.

[63] Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

Angelo Ziletti,Leonardo D'Ambrosi

Main category: cs.CL

TL;DR: 本文提出CLUES框架,用于临床Text-to-SQL任务中区分输入歧义与模型不稳定性两类不确定性,并分别量化为歧义分和不稳定性分,以支持精准干预与错误溯源。

Details Motivation: 在临床Text-to-SQL部署中,需区分由输入歧义(应触发用户澄清)和模型不稳定性(应触发人工审核)导致的输出多样性,现有单一不确定性度量无法支持差异化干预。 Method: 将Text-to-SQL建模为两阶段过程(解释→答案),构建双部语义图,利用其矩阵的Schur补计算不稳定性得分;同时结合解释空间结构估计歧义得分,实现语义不确定性的可分解建模。 Result: 在AmbigQA/SituatedQA及临床Text-to-SQL基准上,CLUES在失败预测上优于当前最优的Kernel Language Entropy;在部署中保持竞争力,并提供唯一可诊断的不确定性分解;高歧义-高不稳定性区域覆盖25%查询却包含51%错误,显著提升错误排查效率。 Conclusion: CLUES通过解耦歧义与不稳定性,实现了面向临床部署的、可解释且可干预的不确定性建模,为LLM在高风险场景中的可靠应用提供了新范式。 Abstract: Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.

[64] Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Xin Xu,Clive Bai,Kai Yang,Tianhao Chen,Yangkun Chen,Weijie Liu,Hao Chen,Yang Wang,Saiyong Yang,Can Yang

Main category: cs.CL

TL;DR: 本文提出Composition-RL方法,通过自动组合多个问题生成新的可验证提示,以更有效地利用高通过率(pass-rate=1)的提示数据,提升大模型在强化学习中的推理能力。

Details Motivation: 大规模可验证提示虽支撑了RLVR的成功,但存在大量无信息量样本、扩展成本高;训练中易提示(pass rate=1)增多导致有效数据减少,需新策略加以利用。 Method: 提出Composition-RL:自动将多个问题组合成新的可验证问题,并用于RL训练;进一步设计课程学习变体,逐步增加组合深度;支持跨领域组合提示。 Result: 在4B至30B模型规模上实验表明,Composition-RL持续优于基线RL;课程变体进一步提升性能;跨领域组合也显著增强泛化能力。 Conclusion: Composition-RL是一种简单而有效的方法,能显著提升有限可验证提示下的模型推理能力,尤其适用于pass-rate-1提示占主导的训练后期阶段,并拓展至跨域RL场景。 Abstract: Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at https://github.com/XinXU-USTC/Composition-RL.

[65] DeepSight: An All-in-One LM Safety Toolkit

Bo Zhang,Jiaxuan Guo,Lijun Li,Dongrui Liu,Sujin Chen,Guanxu Chen,Zhijie Zheng,Qihao Lin,Lewen Yan,Chen Qian,Yijin Zhou,Yuyao Wu,Shaoxiong Guo,Tianyi Du,Jingyi Yang,Xuhao Hu,Ziqi Miao,Xiaoya Lu,Jing Shao,Xia Hu

Main category: cs.CL

TL;DR: 本文提出DeepSight开源项目,整合安全评估与诊断,实现从黑盒到白盒的模型安全分析。

Details Motivation: 当前大模型安全工作流中评估、诊断与对齐由不同工具处理,导致评估无法定位内部根源、诊断脱离实际风险场景、对齐缺乏机制解释,可能损害模型通用能力。 Method: 提出DeepSight开源项目,包含评估工具DeepSafe和诊断工具DeepScan,通过统一任务与数据协议打通评估与诊断阶段,实现白盒化安全分析。 Result: DeepSight是首个支持前沿AI风险评估及联合安全评估与诊断的开源工具包,具备低成本、可复现、高效和高可扩展性。 Conclusion: DeepSight推动了大模型安全从孤立工具向一体化、可解释、机制驱动范式的转变。 Abstract: As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.

[66] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Pinyi Zhang,Ting-En Lin,Yuchuan Wu,Jingyang Chen,Zongqi Wang,Hua Yang,Ze Xu,Fei Huang,Kai Zhang,Yongbin Li

Main category: cs.CL

TL;DR: 本文提出P-GenRM,首个支持测试时用户自适应缩放的个性化生成式奖励模型,通过构建评估链、用户原型聚类与双粒度缩放机制,显著提升个性化对齐效果与新用户泛化能力。

Details Motivation: 现有个性化奖励模型难以准确建模用户多样且场景依赖的偏好,且在新用户(反馈稀疏)上泛化能力差。 Method: 提出P-GenRM:将偏好信号转化为结构化评估链,生成自适应角色与评分标准;聚类用户形成User Prototypes;设计个体级与原型级双粒度缩放机制,实现偏好噪声抑制与跨用户迁移。 Result: 在主流个性化奖励模型基准上达到SOTA,平均提升2.31%;在OOD数据集上泛化性强;测试时用户缩放带来额外3%性能增益。 Conclusion: P-GenRM通过生成式建模与测试时自适应机制,有效解决了个性化奖励建模中的偏好多样性与冷启动问题,为LLM个性化对齐提供了新范式。 Abstract: Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.

[67] A Rule-based Computational Model for Gaidhlig Morphology

Peter J Barclay

Main category: cs.CL

TL;DR: 本文提出了一种基于规则的苏格兰盖尔语(Gaidhlig)形态学建模方法,利用Wiktionary数据构建可解释、低数据依赖的系统,支持教学工具与高阶NLP工具开发。

Details Motivation: 主流神经语言模型需要大量训练数据,难以适用于低资源语言;而规则系统能有效利用有限数据、增强可解释性,并辅助语言教学材料设计。 Method: 从Wiktionary提取数据,使用SQL查询词汇模式,构建声明式规则库,并通过Python工具实现盖尔语词形变化推导。 Result: 实现了可推导盖尔语屈折形式的规则系统,可用于教育工具(如语言模式讲解)及高阶工具(如基于规则的依存句法分析器)。 Conclusion: 该规则方法能有效挖掘Wiktionary现有数据价值,为低资源语言提供可解释、实用且可扩展的语言技术基础。 Abstract: Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.

[68] WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

Yangzhuo Li,Shengpeng Ji,Yifu Chen,Tianle Liang,Haorong Ying,Yule Wang,Junbo Li,Jun Fang,Zhou Zhao

Main category: cs.CL

TL;DR: 本文提出了WavBench,一个面向真实语音对话能力评估的新型基准,涵盖推理能力、口语化表达与副语言特征三大维度。

Details Motivation: 现有语音对话模型评估多沿用文本生成标准,忽视语音特有的副语言特征、口语表达及现代智能体所需的认知深度,亟需更贴近现实复杂性的评测基准。 Method: 提出WavBench三部分框架:Pro子集(高难度推理挑战)、Basic子集(以可听性为核心的口语化评估标准)、Acoustic子集(覆盖显式理解/生成与隐式对话的副语言能力评估)。 Result: 在五个SOTA模型上验证了WavBench的有效性,揭示了复杂推理、口语表达与副语言保真度之间的关键关系与当前短板。 Conclusion: WavBench为构建更鲁棒、自然、智能的语音对话系统提供了系统性评估工具与发展方向指引。 Abstract: With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto-2024.github.io/wavbench.github.io/.

Ricardo Campos,Ana Filipa Pacheco,Ana Luísa Fernandes,Inês Cantante,Rute Rebouças,Luís Filipe Cunha,José Miguel Isidro,José Pedro Evans,Miguel Marques,Rodrigo Batista,Evelin Amorim,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,António Leal,Purificação Silvano

Main category: cs.CL

TL;DR: 本文介绍了CitiLink-Minutes数据集,一个包含120份欧洲葡萄牙语市政会议纪要的多层标注数据集,旨在填补市政记录在信息检索和自然语言处理领域研究的空白。

Details Motivation: 市政会议纪要是地方治理的重要官方记录,但在信息检索(IR)和自然语言处理(NLP)领域缺乏标注数据集,限制了相关计算模型的发展。 Method: 构建了CitiLink-Minutes数据集,涵盖六个葡萄牙市镇的120份会议纪要,进行三方面人工标注(元数据、讨论主题、投票结果),并由语言学家审核;所有个人标识已脱敏,数据按FAIR原则发布,并提供基线实验结果。 Result: 数据集包含超百万词符、38,000+标注项;提供了元数据抽取、主题分类与投票标注三项任务的基线结果;支持下游NLP/IR任务并促进市政决策透明化。 Conclusion: CitiLink-Minutes是首个面向市政会议纪要的多层结构化标注数据集,有望推动本地治理文本的计算分析研究,并提升政府决策可及性与透明度。 Abstract: City councils play a crucial role in local governance, directly influencing citizens' daily lives through decisions made during municipal meetings. These deliberations are formally documented in meeting minutes, which serve as official records of discussions, decisions, and voting outcomes. Despite their importance, municipal meeting records have received little attention in Information Retrieval (IR) and Natural Language Processing (NLP), largely due to the lack of annotated datasets, which ultimately limit the development of computational models. To address this gap, we introduce CitiLink-Minutes, a multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities. Unlike prior annotated datasets of parliamentary or video records, CitiLink-Minutes provides multilayer annotations and structured linkage of official written minutes. The dataset contains over one million tokens, with all personal identifiers de-identified. Each minute was manually annotated by two trained annotators and curated by an experienced linguist across three complementary dimensions: (1) metadata, (2) subjects of discussion, and (3) voting outcomes, totaling over 38,000 individual annotations. Released under FAIR principles and accompanied by baseline results on metadata extraction, topic classification, and vote labeling, CitiLink-Minutes demonstrates its potential for downstream NLP and IR tasks, while promoting transparent access to municipal decisions.

[70] dVoting: Fast Voting for dLLMs

Sicheng Feng,Zigeng Chen,Xinyin Ma,Gongfan Fang,Xinchao Wang

Main category: cs.CL

TL;DR: 本文提出dVoting方法,利用扩散大语言模型(dLLMs)的任意位置并行生成能力,通过一致性分析识别不确定token并进行投票式迭代重生成,在无需训练的前提下显著提升推理性能。

Details Motivation: 观察到dLLMs在多次采样中多数token预测一致,而性能瓶颈常由少量跨样本不一致的token决定,因此需一种轻量、无需训练的机制来聚焦修正这些关键token。 Method: dVoting基于dLLMs的任意位置生成能力,对同一提示多次采样,通过一致性分析识别低置信度token,再对其集中投票重生成,并迭代该过程直至收敛。 Result: 在GSM8K、MATH500、ARC-C和MMLU等基准上分别取得6.22%–7.66%、4.40%–7.20%、3.16%–14.84%和4.83%–5.74%的性能提升。 Conclusion: dVoting是一种高效、免训练的推理增强技术,充分释放dLLMs的并行解码潜力,为测试时扩展提供新思路。 Abstract: Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test-time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross-sample variability. Leveraging the arbitrary-position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%-7.66% on GSM8K, 4.40%-7.20% on MATH500, 3.16%-14.84% on ARC-C, and 4.83%-5.74% on MMLU. Our code is available at https://github.com/fscdc/dVoting

[71] Query-focused and Memory-aware Reranker for Long Context Processing

Yuqing Li,Jiangnan Li,Mo Yu,Guoxuan Ding,Zheng Lin,Weiping Wang,Jie Zhou

Main category: cs.CL

TL;DR: 本文提出了一种基于大模型检索头注意力分数的轻量级列表式重排序框架,利用注意力分数估计查询-段落相关性,无需Likert标注,小模型(如4B)即可达到SOTA性能,并在多领域及LoCoMo对话记忆基准上取得新SOTA。

Details Motivation: 现有重排序方法多依赖点式或需人工标注的列表式监督,且计算开销大;而大语言模型中检索头的注意力机制天然蕴含相关性信号,值得进一步挖掘。 Method: 设计一个轻量级框架,直接利用大模型中选定注意力头的注意力分数作为查询-段落相关性估计依据,进行端到端列表式训练;支持上下文增强与中层注意力头微调等扩展。 Result: 在Wikipedia、长叙事数据集及LoCoMo对话理解与记忆基准上均超越现有SOTA点式和列表式重排序器;仅用4B参数小模型即实现高性能;支持灵活扩展并提升准确率与效率。 Conclusion: 基于注意力分数的列表式重排序是一种高效、通用且可扩展的新范式,降低了对大规模模型和人工标注的依赖,为检索增强生成提供了更实用的轻量解决方案。 Abstract: Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.

[72] Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

Mohamed Huti,Alasdair Mackintosh,Amy Waldock,Dominic Andrews,Maxime Lelièvre,Moritz Boos,Tobias Murray,Paul Atherton,Robin A. A. Ince,Oliver G. B. Garrod

Main category: cs.CL

TL;DR: 本文提出视觉推理基准(VRB),用于评估多模态大语言模型(MLLMs)在真实小学课堂视觉问题上的推理能力,发现模型在静态任务(如计数)上表现较好,但在动态空间操作(如折叠、反射、旋转)上存在明显瓶颈,揭示其在教育应用中的风险与局限。

Details Motivation: 现有AI模型在文本推理上表现优异,但在空间与关系结构推理方面仍存在瓶颈,尤其在依赖视觉的小学数学教育中亟需可靠评估工具。 Method: 构建了包含701道来自赞比亚和印度小学考试题的视觉推理基准(VRB),使用未经编辑、文字极少的真实图像,涵盖类比推理、模式补全和空间匹配等任务,以评估MLLMs在真实教育场景下的视觉推理能力。 Result: 实验发现模型能力呈“锯齿状前沿”:在计数、缩放等静态技能上表现较好,但在折叠、反射、旋转等动态空间操作上存在显著“空间天花板”,易导致错误批改、误导性引导及强化学生错误概念。 Conclusion: 教育导向的基准(如VRB)对界定多模态教育工具的实际可用边界至关重要,当前MLLMs尚不具备可靠支持小学视觉推理教学的能力。 Abstract: AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.

[73] ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Mathieu Sibue,Andres Muñoz Garza,Samuel Mensah,Pranav Shetty,Zhiqiang Ma,Xiaomo Liu,Manuela Veloso

Main category: cs.CL

TL;DR: 本文提出ExStrucTiny新基准数据集,用于评估通用视觉语言模型在多样化文档图像上进行灵活、结构化信息抽取的能力,并分析现有模型在模式适配、查询不明确和答案定位等方面的挑战。

Details Motivation: 现有Key Entity Extraction (KEE)、Relation Extraction (RE)和Visual Question Answering (VQA)数据集存在实体本体狭窄、查询简单、文档类型单一等问题,难以支撑面向企业文档的灵活、细粒度、结构化信息抽取研究。 Method: 构建了名为ExStrucTiny的新基准数据集,融合KEE、RE与VQA任务,采用人工标注与合成样本相结合并经人工验证的流程,覆盖更丰富的文档类型与抽取场景;并在该基准上系统评测开源与闭源视觉语言模型的表现。 Result: 揭示了当前通用VLM在schema adaptation(模式适配)、query under-specification(查询描述不充分)和answer localization(答案定位)等关键挑战上的不足。 Conclusion: ExStrucTiny为推动通用视觉语言模型在企业文档结构化信息抽取任务上的能力提升提供了坚实基础和统一评测平台。 Abstract: Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.

[74] Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

Julia Belikova,Danila Rozhevskii,Dennis Svirin,Konstantin Polev,Alexander Panchenko

Main category: cs.CL

TL;DR: 本文提出了一种检测软压缩架构中'token overflow'(令牌溢出)现象的方法,即当压缩表示不足以回答查询时的状态;通过引入轻量级探针分类器,结合查询与上下文信息,在多个问答数据集上实现了0.72平均AUC-ROC的溢出检测性能,支持低成本、前置式错误缓解。

Details Motivation: 现有软压缩架构虽能扩展LLM上下文长度,但其可压缩极限及何时开始丢失任务相关关键信息尚不明确,亟需一种有效识别压缩失效(即token overflow)的机制。 Method: 定义token overflow概念,并在xRAG软压缩框架下,对比分析查询无关的饱和统计量与轻量级探针分类器(输入为查询+上下文xRAG表示)在overflow检测中的表现。 Result: 查询无关的饱和统计量能有效区分压缩/未压缩token,但溢出检测能力有限;而轻量级探针分类器在HotpotQA、SQuADv2和TriviaQA上平均达到0.72 AUC-ROC,显著提升检测效果。 Conclusion: 引入查询感知的轻量级探测器是迈向实用化软压缩监控的关键一步,可支撑低开销、前置于LLM推理的溢出拦截与错误缓解。 Abstract: Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.

[75] Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications

Manjunath Kudlur,Evan King,James Wang,Pete Warden

Main category: cs.CL

TL;DR: Moonshine v2 提出一种基于滑动窗口自注意力的流式编码器ASR模型,兼顾低首字延迟(TTFT)与高识别精度,在边缘设备上实现高效实时语音识别。

Details Motivation: 现有全注意力Transformer编码器虽精度高,但因全局依赖导致二次计算复杂度和线性增长的首字延迟,难以满足资源受限边缘设备上的流式语音识别低延迟需求。 Method: 提出Moonshine v2模型,采用滑动窗口自注意力机制替代全注意力,在保持强局部上下文建模能力的同时实现有界、低延迟推理。 Result: 在标准基准上达到SOTA词错误率(WER),精度媲美大6倍的模型,且推理速度显著更快。 Conclusion: 精心设计的局部注意力机制可在大幅降低模型尺寸与延迟成本的同时,达到与全注意力相当的识别精度,为边缘设备上的交互式语音接口开辟新路径。 Abstract: Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent "encode-the-whole-utterance" latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.

[76] A technical curriculum on language-oriented artificial intelligence in translation and specialised communication

Ralph Krüger

Main category: cs.CL

TL;DR: 本文提出了一门面向语言与翻译(L&T)行业的语言导向人工智能(AI)技术课程,旨在提升相关从业者的技术AI素养,涵盖向量嵌入、神经网络基础、分词及Transformer模型四大核心内容,并通过教学实践验证其有效性。

Details Motivation: 提升语言与翻译领域从业者在AI驱动工作环境中的技术AI素养、计算思维、算法意识与算法能动性,增强其数字韧性。 Method: 设计并实施一门包含向量嵌入、神经网络基础、分词和Transformer神经网络四大模块的技术课程,并在科隆应用技术大学翻译与多语传播研究所的AI主题硕士课程中开展教学实证。 Result: 教学实践表明该课程具有良好的教学有效性,但参与者反馈指出需辅以更高层次的教学支架(如教师指导)以实现最优学习效果。 Conclusion: 该技术课程是培养L&T领域AI素养的有效路径,但需结合适当的教学支持机制以充分发挥其潜力。 Abstract: This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (L&T) industry. The curriculum aims to foster domain-specific technical AI literacy among stakeholders in the fields of translation and specialised communication by exposing them to the conceptual and technical/algorithmic foundations of modern language-oriented AI in an accessible way. The core curriculum focuses on 1) vector embeddings, 2) the technical foundations of neural networks, 3) tokenization and 4) transformer neural networks. It is intended to help users develop computational thinking as well as algorithmic awareness and algorithmic agency, ultimately contributing to their digital resilience in AI-driven work environments. The didactic suitability of the curriculum was tested in an AI-focused MA course at the Institute of Translation and Multilingual Communication at TH Koeln. Results suggest the didactic effectiveness of the curriculum, but participant feedback indicates that it should be embedded into higher-level didactic scaffolding - e.g., in the form of lecturer support - in order to enable optimal learning conditions.

[77] T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization

Tunyu Zhang,Xinxi Zhang,Ligong Han,Haizhou Shi,Xiaoxiao He,Zhuowei Li,Hao Wang,Kai Xu,Akash Srivastava,Hao Wang,Vladimir Pavlovic,Dimitris N. Metaxas

Main category: cs.CL

TL;DR: 本文提出了一种轨迹自蒸馏框架(T3D),结合直接判别优化(DDO)目标,提升扩散大语言模型(DLLMs)在少量采样步数下的生成质量与效率。

Details Motivation: 扩散大语言模型(DLLMs)虽支持并行多token解码,但实际推理受限于需大量细化步数;减少步数又显著损害生成质量,亟需兼顾效率与质量的解码策略。 Method: 提出轨迹自蒸馏(Trajectory Self-Distillation)框架,利用模型自身生成轨迹作为教师信号,并引入基于逆KL的直接判别优化(DDO)目标,实现模式聚焦式蒸馏,使学生模型专注高概率生成路径。 Result: 在多个基准上,该方法在严格步数限制下持续优于强few-step基线和标准训练;虽仍略逊于全步长解码,但显著缩小了性能差距。 Conclusion: T3D为构建实用化的少步DLLMs提供了坚实基础,验证了自蒸馏+DDO在提升few-step生成质量上的有效性。 Abstract: Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model's own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student to concentrate on high-probability teacher modes. Across benchmarks, our approach consistently outperforms strong few-step baselines and standard training under tight step budgets. Although full-step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few-step DLLMs. The source code is available at https://github.com/Tyrion58/T3D.

[78] On-Policy Context Distillation for Language Models

Tianzhu Ye,Li Dong,Xun Wu,Shaohan Huang,Furu Wei

Main category: cs.CL

TL;DR: 本文提出了一种名为On-Policy Context Distillation(OPCD)的新框架,通过在学生模型自身生成的轨迹上进行策略内蒸馏,并最小化其与上下文条件教师模型之间的反向KL散度,实现上下文知识向参数的内化。该方法在经验知识蒸馏和系统提示蒸馏两个任务中均展现出优越性能,并支持跨尺寸模型的知识迁移。

Details Motivation: 现有上下文蒸馏方法难以有效将模型在推理过程中积累的经验或优化提示中的行为模式内化为参数知识;需一种能结合策略内训练与上下文建模优势的新范式。 Method: 提出On-Policy Context Distillation(OPCD)框架:学生模型基于自身生成的轨迹进行训练,目标是最小化其输出分布与上下文条件教师模型输出分布之间的反向KL散度。 Result: OPCD在数学推理、文本游戏和领域特定任务中均超越基线方法,提升任务准确率并更好保持分布外泛化能力;同时支持大模型向小模型的有效跨尺寸经验知识蒸馏。 Conclusion: OPCD成功融合了策略内学习与上下文蒸馏,为语言模型实现经验内化和提示行为固化提供了统一、高效且可扩展的解决方案。 Abstract: Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.

cs.CV [Back]

[79] DD-MDN: Human Trajectory Forecasting with Diffusion-Based Dual Mixture Density Networks and Uncertainty Self-Calibration

Manuel Hetzel,Kerim Turacan,Hannes Reichert,Konrad Doll,Bernhard Sick

Main category: cs.CV

TL;DR: 本文提出DD-MDN模型,一种端到端的概率性人类轨迹预测方法,结合去噪扩散与双混合密度网络,实现高精度、校准良好的不确定性建模及对短观测期的鲁棒性。

Details Motivation: 现有工作多关注预测精度、社交交互建模与多样性,却忽视了不确定性建模、校准性以及短观测期下的预测能力,而这三者对路径规划和碰撞规避等下游任务至关重要。 Method: 提出DD-MDN:基于少样本去噪扩散主干网络与双混合密度网络(dual MDN),自动学习自校准的驻留区域和概率排序的锚点路径,无需预定义锚点或终点,生成多样化轨迹假设。 Result: 在ETH/UCY、SDD、inD和IMPTC数据集上达到SOTA精度,尤其在短观测区间下表现鲁棒,并展现出可靠的不确定性建模能力。 Conclusion: DD-MDN统一解决了HTF中的精度、不确定性校准与短时观测鲁棒性三大挑战,为安全关键型应用提供了更可信的预测基础。 Abstract: Human Trajectory Forecasting (HTF) predicts future human movements from past trajectories and environmental context, with applications in Autonomous Driving, Smart Surveillance, and Human-Robot Interaction. While prior work has focused on accuracy, social interaction modeling, and diversity, little attention has been paid to uncertainty modeling, calibration, and forecasts from short observation periods, which are crucial for downstream tasks such as path planning and collision avoidance. We propose DD-MDN, an end-to-end probabilistic HTF model that combines high positional accuracy, calibrated uncertainty, and robustness to short observations. Using a few-shot denoising diffusion backbone and a dual mixture density network, our method learns self-calibrated residence areas and probability-ranked anchor paths, from which diverse trajectory hypotheses are derived, without predefined anchors or endpoints. Experiments on the ETH/UCY, SDD, inD, and IMPTC datasets demonstrate state-of-the-art accuracy, robustness at short observation intervals, and reliable uncertainty modeling. The code is available at: https://github.com/kav-institute/ddmdn.

[80] ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Yandan Yang,Shuang Zeng,Tong Lin,Xinyuan Chang,Dekang Qi,Junjin Xiao,Haoyun Liu,Ronghan Chen,Yuzhi Chen,Dongjie Huo,Feng Xiong,Xing Wei,Zhiheng Ma,Mu Xu

Main category: cs.CV

TL;DR: 本文提出ABot-M0框架,通过系统性数据清洗与统一预训练(UniACT-dataset),结合动作流形假设与Action Manifold Learning(AML)方法,提升跨平台具身智能体的动作预测效率与稳定性,并采用双流感知模块融合VLM语义与几何先验,实现模块化、可扩展的通用具身智能建模。

Details Motivation: 解决机器人领域‘一脑多形’范式下数据碎片化、表征不一致、训练目标不统一等阻碍通用具身智能发展的核心问题。 Method: 构建ABot-M0框架:1)建立系统数据清洗与标准化流程,构建大规模统一动作数据集UniACT(6个公开数据集,600万轨迹,9500小时);2)提出动作流形假设并设计Action Manifold Learning(AML),基于DiT主干网络直接预测连续动作序列,将学习目标从去噪转为流形投影;3)设计双流感知模块,融合VLM语义与几何先验及多视角3D模块(如VGGT、Qwen-Image-Edit),保持主干不变。 Result: ABot-M0在跨形态、跨任务场景中展现出更强的知识迁移与泛化能力;AML显著提升动作解码速度与策略稳定性;双流感知模块增强空间理解能力且各组件具备独立性与可加性增益。 Conclusion: ABot-M0为构建真正通用、高效、可扩展的具身智能体提供了端到端框架,验证了统一数据、流形动作建模与模块化感知协同设计的有效性;代码与流水线将全部开源。 Abstract: Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.

[81] Toward Reliable Tea Leaf Disease Diagnosis Using Deep Learning Model: Enhancing Robustness With Explainable AI and Adversarial Training

Samanta Ghosh,Jannatul Adan Mahi,Shayan Abrar,Md Parvez Mia,Asaduzzaman Rayhan,Abdul Awal Yasir,Asaduzzaman Hridoy

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的自动茶树叶病分类方法,使用teaLeafBD数据集(含5278张高分辨率图像,分为7类),结合DenseNet201和EfficientNetB3模型、对抗训练与Grad-CAM可解释性分析,实现了最高93%的分类准确率。

Details Motivation: 茶树易受多种叶部病害影响,导致减产和品质下降;人工检测耗时且易出错,亟需自动化、高效、准确的病害识别方案。 Method: 采用teaLeafBD数据集,构建包含数据预处理、划分、对抗训练、增强、模型训练、评估及Grad-CAM可解释性分析的完整流程;对比DenseNet201与EfficientNetB3,并引入对抗训练提升鲁棒性。 Result: EfficientNetB3达到93%分类准确率,DenseNet201为91%;Grad-CAM可视化验证了模型关注区域的合理性;模型对噪声和扰动输入具有较强鲁棒性。 Conclusion: 所提方法能准确、高效识别茶树叶病,具备实际农业应用价值,为智慧农业管理提供了可行的技术支持。 Abstract: Tea is a valuable asset for the economy of Bangladesh. So, tea cultivation plays an important role to boost the economy. These valuable plants are vulnerable to various kinds of leaf infections which may cause less production and low quality. It is not so easy to detect these diseases manually. It may take time and there could be some errors in the detection.Therefore, the purpose of the study is to develop an automated deep learning model for tea leaf disease classification based on the teaLeafBD dataset so that anyone can detect the diseases more easily and efficiently. There are 5,278 high-resolution images in this dataset. The images are classified into seven categories. Six of them represents various diseases and the rest one represents healthy leaves. The proposed pipeline contains data preprocessing, data splitting, adversarial training, augmentation, model training, evaluation, and comprehension made possible with Explainable AI strategies. DenseNet201 and EfficientNetB3 were employed to perform the classification task. To prepare the model more robustly, we applied adversarial training so it can operate effectively even with noisy or disturbed inputs. In addition, Grad-CAM visualization was executed to analyze the model's predictions by identifying the most influential regions of each image. Our experimental outcomes revealed that EfficientNetB3 achieved the highest classification accuracy of 93%, while DenseNet201 reached 91%. The outcomes prove that the effectiveness of the proposed approach can accurately detect tea leaf diseases and provide a practical solution for advanced agricultural management.

[82] Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration

Jinghan He,Junfeng Fang,Feng Xiong,Zijun Yao,Fei Shen,Haiyun Guo,Jinqiao Wang,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出Active-Zero框架,通过三个协同进化的智能体(Searcher、Questioner、Solver)实现视觉语言模型在开放世界环境中的主动探索与自适应学习,显著提升推理与理解能力。

Details Motivation: 现有视觉语言模型的自博弈方法依赖静态图像集,缺乏主动获取适配其当前能力的视觉数据的能力,导致学习低效且对初始数据强依赖。 Method: 提出Active-Zero框架,包含三个协同演化的智能体:Searcher从开放世界库中按能力边界检索图像,Questioner生成校准的推理任务,Solver通过准确率奖励进行优化,形成闭环自搭建课程。 Result: 在Qwen2.5-VL-7B-Instruct上,12个基准测试中推理任务平均准确率达53.97(+5.7%),通用理解达59.77(+3.9%),持续优于现有自博弈基线。 Conclusion: 主动探索是构建可扩展、自适应的视觉语言模型自演化系统的关键要素。 Abstract: Self-play has enabled large language models to autonomously improve through self-generated challenges. However, existing self-play methods for vision-language models rely on passive interaction with static image collections, resulting in strong dependence on initial datasets and inefficient learning. Without the ability to actively seek visual data tailored to their evolving capabilities, agents waste computational effort on samples that are either trivial or beyond their current skill level. To address these limitations, we propose Active-Zero, a framework that shifts from passive interaction to active exploration of visual environments. Active-Zero employs three co-evolving agents: a Searcher that retrieves images from open-world repositories based on the model's capability frontier, a Questioner that synthesizes calibrated reasoning tasks, and a Solver refined through accuracy rewards. This closed loop enables self-scaffolding auto-curricula where the model autonomously constructs its learning trajectory. On Qwen2.5-VL-7B-Instruct across 12 benchmarks, Active-Zero achieves 53.97 average accuracy on reasoning tasks (5.7% improvement) and 59.77 on general understanding (3.9% improvement), consistently outperforming existing self-play baselines. These results highlight active exploration as a key ingredient for scalable and adaptive self-evolving vision-language systems.

[83] ReTracing: An Archaeological Approach Through Body, Machine, and Generative Systems

Yitong Wang,Yue Yao

Main category: cs.CV

TL;DR: ReTracing 是一个结合人类舞者与四足机器人的多智能体具身行为艺术项目,通过考古式方法探究人工智能如何塑造、约束和生成身体运动;利用大语言模型生成动作指令,扩散模型生成视频指导,最终在镜面地板上同步表演并构建三维运动轨迹数字档案,揭示生成式AI中隐含的社会文化偏见。

Details Motivation: 探究人工智能如何塑造、约束和生成身体运动,并揭示生成式系统中编码的社会文化偏见。 Method: 从科幻小说中提取人机交互描述,用大语言模型生成'该做什么'和'不该做什么'的配对提示,再用扩散文本到视频模型生成人类舞者的编舞指南和机器人的电机指令,在镜面地板上同步执行,通过多相机动捕重建为3D点云和运动轨迹。 Result: 构建了一个包含人类与机器人协同运动痕迹的数字档案,可视化呈现AI生成动作中的社会文化偏见,并引发对‘在能运动、思考与留痕的AI之间,何以为人’的哲学反思。 Conclusion: ReTracing 提出了一种新颖的具身化方法,将AI、人类与机器人置于平等的表演主体位置,以运动轨迹为媒介,批判性地揭示生成式AI的隐性价值负载与人文意涵。 Abstract: We present ReTracing, a multi-agent embodied performance art that adopts an archaeological approach to examine how artificial intelligence shapes, constrains, and produces bodily movement. Drawing from science-fiction novels, the project extracts sentences that describe human-machine interaction. We use large language models (LLMs) to generate paired prompts "what to do" and "what not to do" for each excerpt. A diffusion-based text-to-video model transforms these prompts into choreographic guides for a human performer and motor commands for a quadruped robot. Both agents enact the actions on a mirrored floor, captured by multi-camera motion tracking and reconstructed into 3D point clouds and motion trails, forming a digital archive of motion traces. Through this process, ReTracing serves as a novel approach to reveal how generative systems encode socio-cultural biases through choreographed movements. Through an immersive interplay of AI, human, and robot, ReTracing confronts a critical question of our time: What does it mean to be human among AIs that also move, think, and leave traces behind?

[84] Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

Sethuraman T,Savya Khosla,Aditi Tiwari,Vidya Ganesh,Rakshana Jayaprakash,Aditya Jain,Vignesh Srinivasakumar,Onkar Kishor Susladkar,Srinidhi Sunkara,Aditya Shanmugham,Rakesh Vaideeswaran,Abbaas Alif Mohamed Nishar,Simon Jenni,Derek Hoiem

Main category: cs.CV

TL;DR: 本文提出REVEAL诊断基准,通过五种受控压力测试揭示当前视频-语言模型(VidLMs)在时间序列、运动理解和视频内容依赖等方面的严重缺陷,发现其常忽略视频信息、误判时序、易受语言捷径干扰等,而人类则轻松完成这些任务。

Details Motivation: 探究视频-语言模型是否稳健地理解视频内容、时间顺序和运动,发现其存在根本性弱点。 Method: 构建REVEAL诊断基准,包含五个压力测试:时间预期偏差、语言捷径依赖、视频盲从性、摄像机运动敏感性、时空遮挡鲁棒性;并提供自动生成诊断样本的数据流水线。 Result: 主流开源与闭源VidLMs在各项测试中表现糟糕:将倒放视频误判为正向、忽略视频内容作答、附和错误陈述、难以处理基础摄像机运动、无法在简单时空掩蔽下聚合时序信息;而人类表现优异。 Conclusion: 当前VidLMs对视频内容的理解远未稳健,亟需更严格的评估基准与建模改进,REVEAL为此提供了可扩展的诊断工具与开源资源。 Abstract: This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.

[85] Advancing Digital Twin Generation Through a Novel Simulation Framework and Quantitative Benchmarking

Jacob Rubinstein,Avi Donaty,Don Engel

Main category: cs.CV

TL;DR: 本文提出了一种基于高质量3D模型和程序化生成相机位姿来合成图像的新流程,用于定量评估数字孪生重建质量。

Details Motivation: 现有基于摄影测量的3D建模方法在数字孪生构建中存在大量设计选择,且评估多为定性,缺乏可重复、可量化的基准测试手段。 Method: 构建一个合成图像生成流程:从高保真3D模型出发,结合程序化生成的虚拟相机位姿,渲染合成图像,并与重建算法输出进行对比分析。 Result: 实现了可控、可重复的实验环境,支持对相机参数估计和物体重建精度进行定量评估。 Conclusion: 该合成数据驱动的评估框架为数字孪生和三维重建方法提供了新的量化评测范式。 Abstract: The generation of 3D models from real-world objects has often been accomplished through photogrammetry, i.e., by taking 2D photos from a variety of perspectives and then triangulating matched point-based features to create a textured mesh. Many design choices exist within this framework for the generation of digital twins, and differences between such approaches are largely judged qualitatively. Here, we present and test a novel pipeline for generating synthetic images from high-quality 3D models and programmatically generated camera poses. This enables a wide variety of repeatable, quantifiable experiments which can compare ground-truth knowledge of virtual camera parameters and of virtual objects against the reconstructed estimations of those perspectives and subjects.

[86] Selective Prior Synchronization via SYNC Loss

Ishan Mishra,Jiajie Li,Deepak Mishra,Jinjun Xiong

Main category: cs.CV

TL;DR: 本文提出SYNC损失函数,将后验方法(如softmax响应)引入SelectiveNet的训练过程,通过利用选择先验(selective prior)提升深度神经网络的选择性预测能力,在多个数据集上实现了SOTA性能。

Details Motivation: 现有选择性预测方法分为即插即用型(ad-hoc)和后验型(post-hoc),但后者仅在推理阶段使用其隐含的不确定性信息(即选择先验),作者认为该先验在训练阶段同样重要。 Method: 提出SYNC损失函数,将softmax响应作为选择先验显式融入SelectiveNet的训练目标中,实现ad-hoc与post-hoc方法的协同优化。 Result: 在CIFAR-100、ImageNet-100和Stanford Cars等多个基准数据集上,模型泛化能力增强,选择性预测性能超越先前方法,达到新SOTA。 Conclusion: 选择先验不仅可用于推理,更应在训练中被有效利用;SYNC损失通过联合建模选择机制与分类目标,为不确定性驱动的选择性预测提供了更统一、有效的框架。 Abstract: Prediction under uncertainty is a critical requirement for the deep neural network to succeed responsibly. This paper focuses on selective prediction, which allows DNNs to make informed decisions about when to predict or abstain based on the uncertainty level of their predictions. Current methods are either ad-hoc such as SelectiveNet, focusing on how to modify the network architecture or objective function, or post-hoc such as softmax response, achieving selective prediction through analyzing the model's probabilistic outputs. We observe that post-hoc methods implicitly generate uncertainty information, termed the selective prior, which has traditionally been used only during inference. We argue that the selective prior provided by the selection mechanism is equally vital during the training stage. Therefore, we propose the SYNC loss which introduces a novel integration of ad-hoc and post-hoc method. Specifically, our approach incorporates the softmax response into the training process of SelectiveNet, enhancing its selective prediction capabilities by examining the selective prior. Evaluated across various datasets, including CIFAR-100, ImageNet-100, and Stanford Cars, our method not only enhances the model's generalization capabilities but also surpasses previous works in selective prediction performance, and sets new benchmarks for state-of-the-art performance.

[87] MDE-VIO: Enhancing Visual-Inertial Odometry Using Learned Depth Priors

Arda Alniak,Sinan Kalkan,Mustafa Mert Ankarali,Afsar Saranli,Abdullah Aydin Alatan

Main category: cs.CV

TL;DR: 本文提出了一种将学习到的深度先验直接集成到VINS-Mono优化后端的新框架,通过引入仿射不变深度一致性与序数约束,并结合方差门控机制抑制不稳定伪影,在边缘设备计算限制下实现鲁棒的度量尺度恢复,显著提升低纹理环境下的单目VIO精度。

Details Motivation: 传统单目视觉惯性里程计(VIO)在低纹理环境中因稀疏特征不足而性能下降,需引入稠密深度估计作为补充;但现有基于ViT的高精度深度模型难以满足实时边缘部署的计算需求。 Method: 将学习到的深度先验嵌入VINS-Mono优化后端,设计仿射不变深度一致性约束和成对序数约束,并采用方差驱动的门控机制过滤不可靠深度区域。 Result: 在TartanGround和M3ED数据集上验证,该方法可防止系统发散,在挑战性场景下将绝对轨迹误差(ATE)降低最多28.3%。 Conclusion: 所提方法在严格受限的边缘计算资源下,实现了高鲁棒性与高精度的单目VIO,有效解决了低纹理环境下的尺度漂移与跟踪失败问题。 Abstract: Traditional monocular Visual-Inertial Odometry (VIO) systems struggle in low-texture environments where sparse visual features are insufficient for accurate pose estimation. To address this, dense Monocular Depth Estimation (MDE) has been widely explored as a complementary information source. While recent Vision Transformer (ViT) based complex foundational models offer dense, geometrically consistent depth, their computational demands typically preclude them from real-time edge deployment. Our work bridges this gap by integrating learned depth priors directly into the VINS-Mono optimization backend. We propose a novel framework that enforces affine-invariant depth consistency and pairwise ordinal constraints, explicitly filtering unstable artifacts via variance-based gating. This approach strictly adheres to the computational limits of edge devices while robustly recovering metric scale. Extensive experiments on the TartanGround and M3ED datasets demonstrate that our method prevents divergence in challenging scenarios and delivers significant accuracy gains, reducing Absolute Trajectory Error (ATE) by up to 28.3%. Code will be made available.

[88] Exploring Real-Time Super-Resolution: Benchmarking and Fine-Tuning for Streaming Content

Evgeney Bogatyrev,Khaled Abud,Ivan Molodetskikh,Nikita Alutis,Dmitry Vatolin

Main category: cs.CV

TL;DR: 本文提出StreamSR数据集和EfRLFN模型,旨在提升压缩视频流的实时超分辨率性能。

Details Motivation: 现有实时超分辨率方法在处理压缩视频内容时表现不佳,且常用数据集无法准确反映流媒体特性,导致基准测试缺乏现实相关性。 Method: 构建了源自YouTube的StreamSR数据集,涵盖多种视频类型和分辨率;提出EfRLFN模型,融合高效通道注意力机制与双曲正切激活函数,并设计复合损失函数优化训练;对11种SOTA实时超分辨率模型进行基准测试,并验证微调效果。 Result: EfRLFN在视觉质量和运行效率上均优于现有方法;在StreamSR上微调其他模型可显著提升其在多个标准基准上的泛化性能。 Conclusion: StreamSR数据集填补了真实流媒体场景下超分辨率研究的数据空白,EfRLFN为实时超分辨率提供了高效且高质量的新方案,推动该技术向实际应用迈进。 Abstract: Recent advancements in real-time super-resolution have enabled higher-quality video streaming, yet existing methods struggle with the unique challenges of compressed video content. Commonly used datasets do not accurately reflect the characteristics of streaming media, limiting the relevance of current benchmarks. To address this gap, we introduce a comprehensive dataset - StreamSR - sourced from YouTube, covering a wide range of video genres and resolutions representative of real-world streaming scenarios. We benchmark 11 state-of-the-art real-time super-resolution models to evaluate their performance for the streaming use-case. Furthermore, we propose EfRLFN, an efficient real-time model that integrates Efficient Channel Attention and a hyperbolic tangent activation function - a novel design choice in the context of real-time super-resolution. We extensively optimized the architecture to maximize efficiency and designed a composite loss function that improves training convergence. EfRLFN combines the strengths of existing architectures while improving both visual quality and runtime performance. Finally, we show that fine-tuning other models on our dataset results in significant performance gains that generalize well across various standard benchmarks. We made the dataset, the code, and the benchmark available at https://github.com/EvgeneyBogatyrev/EfRLFN.

[89] ArtContext: Contextualizing Artworks with Open-Access Art History Articles and Wikidata Knowledge through a LoRA-Tuned CLIP Model

Samuel Waugh,Stuart James

Main category: cs.CV

TL;DR: 本文提出ArtContext管道,利用开放获取的艺术史文章和Wikidata知识,通过定制化CLIP模型(PaintingCLIP)为艺术品添加上下文注释,提升艺术史文本与图像的关联分析能力。

Details Motivation: 艺术史文献常讨论艺术品整体及局部特征,但人工匹配文献内容与具体艺术品困难,缺乏自动化工具支持跨文本-图像的语义关联。 Method: 构建新型语料收集流程,基于开放艺术史文章与Wikidata知识构建训练语料;采用LoRA技术微调CLIP模型,得到领域专用的PaintingCLIP模型;实现艺术品的多源上下文标注。 Result: PaintingCLIP在弱监督下优于原始CLIP模型,能有效为给定艺术品提供可解释的上下文信息;该管道具备跨人文学科泛化能力。 Conclusion: ArtContext为艺术史研究提供了可扩展、可复用的图文联合分析框架,推动数字人文中知识图谱与视觉模型的深度融合。 Abstract: Many Art History articles discuss artworks in general as well as specific parts of works, such as layout, iconography, or material culture. However, when viewing an artwork, it is not trivial to identify what different articles have said about the piece. Therefore, we propose ArtContext, a pipeline for taking a corpus of Open-Access Art History articles and Wikidata Knowledge and annotating Artworks with this information. We do this using a novel corpus collection pipeline, then learn a bespoke CLIP model adapted using Low-Rank Adaptation (LoRA) to make it domain-specific. We show that the new model, PaintingCLIP, which is weakly supervised by the collected corpus, outperforms CLIP and provides context for a given artwork. The proposed pipeline is generalisable and can be readily applied to numerous humanities areas.

[90] Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation

Alan Baade,Eric Ryan Chan,Kyle Sargent,Changan Chen,Justin Johnson,Ehsan Adeli,Li Fei-Fei

Main category: cs.CV

TL;DR: 本文提出Latent Forcing方法,在保持潜在扩散模型高效性的同时直接在原始图像上操作,通过联合处理潜在表示和像素并采用独立调优的噪声调度,提升像素级生成质量。

Details Motivation: 现有潜在扩散模型虽高效,但因图像编码信息丢失、需单独训练解码器、建模辅助分布等问题,牺牲了端到端建模优势。 Method: 提出Latent Forcing:对现有架构做简单修改,联合处理潜在表示与像素,使用分别调优的噪声调度来排序去噪轨迹,使潜在表示作为中间计算的暂存区。 Result: 在ImageNet上,Latent Forcing在相同算力下达到基于扩散Transformer的像素生成新SOTA,并揭示了条件信号顺序、REPA蒸馏、tokenizer重建质量与可扩散性之间的关系。 Conclusion: Latent Forcing成功弥合了潜在空间效率与像素空间端到端建模之间的鸿沟,为扩散模型设计提供了新思路。 Abstract: Latent diffusion models excel at generating high-quality images but lose the benefits of end-to-end modeling. They discard information during image encoding, require a separately trained decoder, and model an auxiliary distribution to the raw data. In this paper, we propose Latent Forcing, a simple modification to existing architectures that achieves the efficiency of latent diffusion while operating on raw natural images. Our approach orders the denoising trajectory by jointly processing latents and pixels with separately tuned noise schedules. This allows the latents to act as a scratchpad for intermediate computation before high-frequency pixel features are generated. We find that the order of conditioning signals is critical, and we analyze this to explain differences between REPA distillation in the tokenizer and the diffusion model, conditional versus unconditional generation, and how tokenizer reconstruction quality relates to diffusability. Applied to ImageNet, Latent Forcing achieves a new state-of-the-art for diffusion transformer-based pixel generation at our compute scale.

[91] Fighting MRI Anisotropy: Learning Multiple Cardiac Shapes From a Single Implicit Neural Representation

Carolina Brás,Soufiane Ben Haddou,Thijs P. Kuipers,Laura Alvarez-Florez,R. Nils Planken,Fleur V. Y. Tjong,Connie Bezzina,Ivana Išgum

Main category: cs.CV

TL;DR: 本文提出了一种利用高分辨率、近各向同性的CTA数据训练单个神经隐式函数的方法,以联合表示任意分辨率的CMRI心脏形状,从而克服SAX CMRI各向异性带来的形状分析限制。该方法在右心室(RV)和心肌(MYO)重建任务中表现出色,尤其在4CH切片上的Dice系数和Hausdorff距离指标验证了其准确性与解剖合理性。

Details Motivation: 短轴心血管磁共振成像(SAX CMRI)具有各向异性,限制了心脏形状分析的精度,而高分辨率、近各向同性的CTA数据可提供更优的几何先验。 Method: 利用高分辨率近各向同性CTA数据训练一个神经隐式函数,联合建模CMRI中任意分辨率的心脏结构(RV和MYO),其中MYO同时表征左心室的内、外表面;通过从重建形状中提取4CH切片并与CMRI参考分割对比进行评估。 Result: RV和MYO在4CH切片上的Dice相似系数分别为0.91±0.07和0.75±0.13,Hausdorff距离分别为6.21±3.97 mm和7.53±5.13 mm;定性和定量结果均表明重建形状准确、光滑且解剖合理。 Conclusion: 该神经隐式建模方法有效缓解了CMRI各向异性问题,提升了心脏形状重建质量,为后续心脏形态学分析提供了可靠基础。 Abstract: The anisotropic nature of short-axis (SAX) cardiovascular magnetic resonance imaging (CMRI) limits cardiac shape analysis. To address this, we propose to leverage near-isotropic, higher resolution computed tomography angiography (CTA) data of the heart. We use this data to train a single neural implicit function to jointly represent cardiac shapes from CMRI at any resolution. We evaluate the method for the reconstruction of right ventricle (RV) and myocardium (MYO), where MYO simultaneously models endocardial and epicardial left-ventricle surfaces. Since high-resolution SAX reference segmentations are unavailable, we evaluate performance by extracting a 4-chamber (4CH) slice of RV and MYO from their reconstructed shapes. When compared with the reference 4CH segmentation masks from CMRI, our method achieved a Dice similarity coefficient of 0.91 $\pm$ 0.07 and 0.75 $\pm$ 0.13, and a Hausdorff distance of 6.21 $\pm$ 3.97 mm and 7.53 $\pm$ 5.13 mm for RV and MYO, respectively. Quantitative and qualitative assessment demonstrate the model's ability to reconstruct accurate, smooth and anatomically plausible shapes, supporting improvements in cardiac shape analysis.

[92] Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

Penghui Ruan,Bojia Zi,Xianbiao Qi,Youze Huang,Rong Xiao,Pichao Wang,Jiannong Cao,Yuhui Shi

Main category: cs.CV

TL;DR: Ctrl&Shift 是一种端到端扩散框架,无需显式3D建模即可实现几何一致、用户可控的图像/视频对象级操作,通过两阶段分解(对象移除+相机姿态引导的参考填充)与多任务训练策略,在保真度、视角一致性和可控性上达到SOTA。

Details Motivation: 现有方法难以同时满足背景保留、视角变化下的几何一致性及用户可控变换三大目标:几何法需显式3D重建且泛化差,扩散法泛化好但缺乏细粒度几何控制。 Method: 提出Ctrl&Shift框架:将操作解耦为对象移除和参考引导的相机姿态控制填充两个阶段,并统一于扩散过程;设计多任务多阶段训练策略,分离背景、身份和姿态信号;构建含估计相对相机姿态的配对图像/视频真实世界数据集。 Result: 在保真度、视角一致性和可控性方面达到当前最优(SOTA);首个在不依赖任何显式3D建模前提下,统一细粒度几何控制与真实世界泛化的对象操作框架。 Conclusion: Ctrl&Shift成功弥合了几何控制精度与扩散模型泛化能力之间的鸿沟,为影视后期、AR和创意编辑等应用提供了高效、鲁棒且用户友好的对象级编辑新范式。 Abstract: Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that Ctrl&Shift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.

[93] Enhanced Portable Ultra Low-Field Diffusion Tensor Imaging with Bayesian Artifact Correction and Deep Learning-Based Super-Resolution

Mark D. Olchanyi,Annabel Sorby-Adams,John Kirsch,Brian L. Edlow,Ava Farnan,Renfei Liu,Matthew S. Rosen,Emery N. Brown,W. Taylor Kimberly,Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: 本文提出了一种适用于超低场(ULF)扩散张量成像(DTI)的九方向单壳采集序列,以及配套的具有角度依赖性的贝叶斯偏置场校正算法和无需重训练、可泛化的卷积神经网络超分辨率算法DiffSR,显著提升了ULF DTI的空间/角度分辨率、信噪比及白质微结构信息恢复能力,并在阿尔茨海默病分类任务中验证了其有效性。

Details Motivation: 超低场(ULF)MRI具备便携性和普及潜力,但其DTI序列受限于空间/角度分辨率低、信噪比差、扫描时间长及多域伪影等问题,亟需针对性建模与重建方法。 Method: 提出九方向单壳ULF DTI采集序列;设计角度依赖的贝叶斯偏置场校正算法;开发基于CNN、无需重训练、跨数据集通用的超分辨率算法DiffSR。 Result: 在合成下采样实验和真实匹配的ULF/高场DTI数据中,所提方法有效恢复白质微结构与体积信息;DiffSR在合成退化数据上实现阿尔茨海默病分类,DTI指标与原始高场数据一致性显著提升。 Conclusion: 所提出的序列与算法(特别是DiffSR)为ULF DTI提供了实用、可推广的重建与标准化方案,代码开源以推动ULF重建与DTI序列协同优化研究。 Abstract: Portable, ultra-low-field (ULF) magnetic resonance imaging has the potential to expand access to neuroimaging but currently suffers from coarse spatial and angular resolutions and low signal-to-noise ratios. Diffusion tensor imaging (DTI), a sequence tailored to detect and reconstruct white matter tracts within the brain, is particularly prone to such imaging degradation due to inherent sequence design coupled with prolonged scan times. In addition, ULF DTI scans exhibit artifacting that spans both the space and angular domains, requiring a custom modelling algorithm for subsequent correction. We introduce a nine-direction, single-shell ULF DTI sequence, as well as a companion Bayesian bias field correction algorithm that possesses angular dependence and convolutional neural network-based superresolution algorithm that is generalizable across DTI datasets and does not require re-training (''DiffSR''). We show through a synthetic downsampling experiment and white matter assessment in real, matched ULF and high-field DTI scans that these algorithms can recover microstructural and volumetric white matter information at ULF. We also show that DiffSR can be directly applied to white matter-based Alzheimers disease classification in synthetically degraded scans, with notable improvements in agreement between DTI metrics, as compared to un-degraded scans. We freely disseminate the Bayesian bias correction algorithm and DiffSR with the goal of furthering progress on both ULF reconstruction methods and general DTI sequence harmonization. We release all code related to DiffSR for $\href{https://github.com/markolchanyi/DiffSR}{public \space use}$.

[94] A Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness

Yun-Cheng Li,Sen Lei,Heng-Chao Li,Ke Li

Main category: cs.CV

TL;DR: 本文提出DBTANet,一种双分支语义变化检测框架,结合冻结SAM分支(全局语义与边界先验)和ResNet34分支(局部细节),并引入双向时序感知模块(BTAM)与高斯平滑投影模块(GSPM),显著提升变化检测精度与边界清晰度。

Details Motivation: 现有方法存在边界模糊和时序建模不足的问题,限制了语义变化检测的分割精度。 Method: 提出双分支Siamese编码器(冻结SAM分支 + ResNet34分支)、双向时序感知模块(BTAM)和高斯平滑投影模块(GSPM),实现全局语义、局部细节、时序依赖与边界感知的协同建模。 Result: 在两个公开基准上达到SOTA性能,有效提升了变化区域分割精度与边界清晰度。 Conclusion: DBTANet通过多维度特征互补与时序-边界联合建模,为遥感图像语义变化检测提供了高效、鲁棒的解决方案。 Abstract: Semantic Change Detection (SCD) aims to detect and categorize land-cover changes from bi-temporal remote sensing images. Existing methods often suffer from blurred boundaries and inadequate temporal modeling, limiting segmentation accuracy. To address these issues, we propose a Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness, termed DBTANet. Specifically, we utilize a dual-branch Siamese encoder where a frozen SAM branch captures global semantic context and boundary priors, while a ResNet34 branch provides local spatial details, ensuring complementary feature representations. On this basis, we design a Bidirectional Temporal Awareness Module (BTAM) to aggregate multi-scale features and capture temporal dependencies in a symmetric manner. Furthermore, a Gaussian-smoothed Projection Module (GSPM) refines shallow SAM features, suppressing noise while enhancing edge information for boundary-aware constraints. Extensive experiments on two public benchmarks demonstrate that DBTANet effectively integrates global semantics, local details, temporal reasoning, and boundary awareness, achieving state-of-the-art performance.

[95] Arbitrary Ratio Feature Compression via Next Token Prediction

Yufan Liu,Daoyuan Ren,Zhipeng Zhang,Wenyang Luo,Bing Li,Weiming Hu,Stephen Maybank

Main category: cs.CV

TL;DR: 本文提出了一种任意压缩比特征压缩框架ARFC,通过一个统一模型支持任意压缩比,无需重新训练;其核心ARC模块采用自回归方式生成压缩token,配合MoS模块提升鲁棒性、ERGC模块保持语义结构,在多个任务上超越现有方法,甚至优于原始未压缩特征。

Details Motivation: 现有特征压缩方法通常需为不同压缩比训练专用模型,缺乏灵活性和泛化能力,适应新压缩比时需重新训练。 Method: 提出ARFC框架,包含自回归式任意比率压缩器(ARC)、多解混合(MoS)模块和实体关系图约束(ERGC)模块;ARC通过控制生成token数量调节压缩比,MoS融合多个压缩结果提升稳定性,ERGC在训练中建模语义与结构关系。 Result: 在跨模态检索、图像分类和图像检索等多个任务和数据集上,ARFC在各种压缩比下均显著优于现有方法;部分场景下性能甚至超过原始未压缩特征。 Conclusion: ARFC是一种灵活、高效且通用的特征压缩框架,适用于资源受限的实际应用场景,解决了传统方法对特定压缩比依赖性强的问题。 Abstract: Feature compression is increasingly important for improving the efficiency of downstream tasks, especially in applications involving large-scale or multi-modal data. While existing methods typically rely on dedicated models for achieving specific compression ratios, they are often limited in flexibility and generalization. In particular, retraining is necessary when adapting to a new compression ratio. To address this limitation, we propose a novel and flexible Arbitrary Ratio Feature Compression (ARFC) framework, which supports any compression ratio with a single model, eliminating the need for multiple specialized models. At its core, the Arbitrary Ratio Compressor (ARC) is an auto-regressive model that performs compression via next-token prediction. This allows the compression ratio to be controlled at inference simply by adjusting the number of generated tokens. To enhance the quality of the compressed features, two key modules are introduced. The Mixture of Solutions (MoS) module refines the compressed tokens by utilizing multiple compression results (solutions), reducing uncertainty and improving robustness. The Entity Relation Graph Constraint (ERGC) is integrated into the training process to preserve semantic and structural relationships during compression. Extensive experiments on cross-modal retrieval, image classification, and image retrieval tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches at various compression ratios. Notably, in some cases, it even surpasses the performance of the original, uncompressed features. These results validate the effectiveness and versatility of ARFC for practical, resource-constrained scenarios.

[96] What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

Zhenlong Yuan,Xiangyan Qu,Jing Tang,Rui Chen,Lei Sun,Ruidong Chen,Hongwei Yu,Chengxuan Qian,Xiangxiang Chu,Shuo Li,Yuyin Zhou

Main category: cs.CV

TL;DR: 本文提出ImagineAgent框架,通过认知推理与生成式想象结合,解决开放词汇人-物交互(OV-HOI)中的跨模态幻觉与遮挡模糊问题,在SWIG-HOI和HICO-DET上达到SOTA,仅需20%训练数据。

Details Motivation: 现有多模态大语言模型在开放词汇人-物交互(OV-HOI)任务中受限于跨模态幻觉和遮挡导致的语义模糊,难以实现鲁棒视觉理解。 Method: 提出ImagineAgent智能体框架:构建显式建模实体与动作关系的认知图;动态调用检索增强、图像裁剪和扩散模型等工具获取领域知识与视觉证据;设计兼顾预测准确率与工具效率的复合奖励函数。 Result: 在SWIG-HOI和HICO-DET数据集上取得SOTA性能,且仅需约20%的训练数据,验证了方法的鲁棒性与高效性。 Conclusion: ImagineAgent通过认知建模与生成式工具调用的协同,有效缓解OV-HOI中的跨模态幻觉与遮挡歧义,为多模态推理提供了可解释、高效的新范式。 Abstract: Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.

[97] Vascular anatomy-aware self-supervised pre-training for X-ray angiogram analysis

De-Xing Huang,Chaohui Yu,Xiao-Hu Zhou,Tian-Yu Xiang,Qin-Yi Zhang,Mei-Jiang Gui,Rui-Ze Ma,Chen-Yu Wang,Nu-Fang Xiao,Fan Wang,Zeng-Guang Hou

Main category: cs.CV

TL;DR: 本文提出了一种血管解剖结构感知的掩码图像建模框架VasoMIM,并构建了目前最大的X射线血管造影预训练数据集XA-170K,以解决该领域标注数据稀缺问题;通过解剖引导的掩码策略与解剖一致性损失,显著提升了下游任务性能,达到SOTA。

Details Motivation: X射线血管造影分析面临标注数据严重稀缺的问题,而现有自监督学习(SSL)方法在该领域缺乏有效的框架和大规模数据集支撑。 Method: 提出血管解剖感知的掩码图像建模框架VasoMIM,包含解剖引导的掩码策略(优先掩蔽含血管图像块)和解剖一致性损失(保证重建图像中血管结构的一致性);同时构建大规模无标注X射线血管造影数据集XA-170K用于预训练。 Result: 在六个数据集、四个下游任务上验证,VasoMIM展现出优异的迁移能力,性能超越现有方法,达到当前最优(SOTA)。 Conclusion: VasoMIM是一种有潜力成为X射线血管造影分析基础模型的新范式,其代码与数据集将开源。 Abstract: X-ray angiography is the gold standard imaging modality for cardiovascular diseases. However, current deep learning approaches for X-ray angiogram analysis are severely constrained by the scarcity of annotated data. While large-scale self-supervised learning (SSL) has emerged as a promising solution, its potential in this domain remains largely unexplored, primarily due to the lack of effective SSL frameworks and large-scale datasets. To bridge this gap, we introduce a vascular anatomy-aware masked image modeling (VasoMIM) framework that explicitly integrates domain-specific anatomical knowledge. Specifically, VasoMIM comprises two key designs: an anatomy-guided masking strategy and an anatomical consistency loss. The former strategically masks vessel-containing patches to compel the model to learn robust vascular semantics, while the latter preserves structural consistency of vessels between original and reconstructed images, enhancing the discriminability of the learned representations. In conjunction with VasoMIM, we curate XA-170K, the largest X-ray angiogram pre-training dataset to date. We validate VasoMIM on four downstream tasks across six datasets, where it demonstrates superior transferability and achieves state-of-the-art performance compared to existing methods. These findings highlight the significant potential of VasoMIM as a foundation model for advancing a wide range of X-ray angiogram analysis tasks. VasoMIM and XA-170K will be available at https://github.com/Dxhuang-CASIA/XA-SSL.

[98] Supervise-assisted Multi-modality Fusion Diffusion Model for PET Restoration

Yingkai Zhang,Shuang Chen,Ye Tian,Yunyi Gao,Jianyong Jiang,Ying Fu

Main category: cs.CV

TL;DR: 本文提出了一种监督辅助的多模态融合扩散模型(MFdiff),利用MR图像辅助低剂量PET图像重建,通过多模态特征融合模块和两阶段监督学习策略,有效提升重建PET图像质量,尤其在处理分布外(OOD)数据时表现优异。

Details Motivation: 降低PET检查辐射剂量会导致图像质量下降;利用MR图像辅助重建标准剂量PET图像存在多模态结构/纹理不一致及分布外(OOD)数据不匹配问题。 Method: 提出监督辅助的多模态融合扩散模型(MFdiff):1)设计多模态特征融合模块以优化MR与PET特征融合;2)以融合特征为条件,基于扩散模型迭代生成高质量标准剂量PET图像;3)引入两阶段监督学习策略,结合仿真数据的通用先验与真实OOD数据的特异性先验。 Result: MFdiff在多模态输入下能有效重建高质量标准剂量PET图像,在定性与定量指标上均优于当前最先进方法。 Conclusion: MFdiff通过协同建模多模态信息与分阶段监督学习,显著提升了低剂量PET图像重建性能,尤其增强了对临床真实OOD数据的泛化能力。 Abstract: Positron emission tomography (PET) offers powerful functional imaging but involves radiation exposure. Efforts to reduce this exposure by lowering the radiotracer dose or scan time can degrade image quality. While using magnetic resonance (MR) images with clearer anatomical information to restore standard-dose PET (SPET) from low-dose PET (LPET) is a promising approach, it faces challenges with the inconsistencies in the structure and texture of multi-modality fusion, as well as the mismatch in out-of-distribution (OOD) data. In this paper, we propose a supervise-assisted multi-modality fusion diffusion model (MFdiff) for addressing these challenges for high-quality PET restoration. Firstly, to fully utilize auxiliary MR images without introducing extraneous details in the restored image, a multi-modality feature fusion module is designed to learn an optimized fusion feature. Secondly, using the fusion feature as an additional condition, high-quality SPET images are iteratively generated based on the diffusion model. Furthermore, we introduce a two-stage supervise-assisted learning strategy that harnesses both generalized priors from simulated in-distribution datasets and specific priors tailored to in-vivo OOD data. Experiments demonstrate that the proposed MFdiff effectively restores high-quality SPET images from multi-modality inputs and outperforms state-of-the-art methods both qualitatively and quantitatively.

[99] Perception-based Image Denoising via Generative Compression

Nam Nguyen,Thinh Nguyen,Bella Bose

Main category: cs.CV

TL;DR: 本文提出了一种基于生成式压缩的感知图像去噪框架,通过熵编码潜在表示和生成解码器(如WGAN或扩散模型)实现结构保持与纹理真实性的平衡,并提供了理论误差界和实验验证。

Details Motivation: 传统失真驱动的去噪方法在强噪声和分布偏移下易产生过度平滑结果,难以兼顾结构保真与感知真实性。 Method: 构建生成式压缩去噪框架:利用熵编码约束低复杂度潜在表示,结合基于LPIPS损失和Wasserstein距离的生成解码器;提出两种实现——条件WGAN压缩去噪器(显式控制率-失真-感知权衡)和条件扩散重建策略(由压缩潜变量引导迭代去噪);并为加性高斯噪声下的压缩最大似然去噪器提供非渐近理论保证。 Result: 在合成与真实噪声数据集上均展现出一致的感知质量提升(如LPIPS降低),同时保持有竞争力的失真指标(如PSNR、SSIM)。 Conclusion: 生成式压缩范式可有效协调去噪任务中的率、失真与感知三重目标,理论分析与实验均验证了其有效性与鲁棒性。 Abstract: Image denoising aims to remove noise while preserving structural details and perceptual realism, yet distortion-driven methods often produce over-smoothed reconstructions, especially under strong noise and distribution shift. This paper proposes a generative compression framework for perception-based denoising, where restoration is achieved by reconstructing from entropy-coded latent representations that enforce low-complexity structure, while generative decoders recover realistic textures via perceptual measures such as learned perceptual image patch similarity (LPIPS) loss and Wasserstein distance. Two complementary instantiations are introduced: (i) a conditional Wasserstein GAN (WGAN)-based compression denoiser that explicitly controls the rate-distortion-perception (RDP) trade-off, and (ii) a conditional diffusion-based reconstruction strategy that performs iterative denoising guided by compressed latents. We further establish non-asymptotic guarantees for the compression-based maximum-likelihood denoiser under additive Gaussian noise, including bounds on reconstruction error and decoding error probability. Experiments on synthetic and real-noise benchmarks demonstrate consistent perceptual improvements while maintaining competitive distortion performance.

[100] LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

Chen Zhao,Jiawei Chen,Hongyu Li,Zhuoliang Kang,Shilin Lu,Xiaoming Wei,Kai Zhang,Jian Yang,Ying Tai

Main category: cs.CV

TL;DR: 本文提出LUVE框架,通过三阶段级联架构解决超高清视频生成中的运动建模、语义规划与细节合成难题,显著提升UHR视频的视觉质量与内容保真度。

Details Motivation: 现有视频扩散模型在超高清(UHR)视频生成中仍面临运动建模、语义规划和细节合成等多重挑战,难以兼顾高分辨率与高质量。 Method: 提出基于双频专家的潜在级联UHR视频生成框架LUVE,包含三个阶段:低分辨率运动生成、潜在空间视频上采样、高频与低频专家协同的高分辨率内容细化。 Result: LUVE在UHR视频生成中实现了更优的逼真度与内容保真度;消融实验验证了各模块的有效性。 Conclusion: LUVE为超高清视频生成提供了一种高效、高质量的解决方案,尤其在潜在空间处理与双频专家协同方面具有创新性与实用性。 Abstract: Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.

[101] Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception

Zesheng Jia,Jin Wang,Siao Liu,Lingzhi Li,Ziyao Huang,Yunjiang Xu,Jianping Wang

Main category: cs.CV

TL;DR: 本文提出FlowAdapt,一种基于最优传输理论的参数高效多智能体协同感知域自适应框架,通过Wasserstein贪心采样和渐进知识迁移模块解决跨帧冗余与深层语义退化问题,在仅训练1%参数下实现SOTA性能。

Details Motivation: 快速域适应是V2X多智能体协同感知部署的关键挑战;现有PEFT方法在多智能体场景中存在性能下降和训练不稳定问题。 Method: 提出基于最优传输理论的FlowAdapt框架:1)Wasserstein贪心采样策略,以有界覆盖半径筛选冗余样本;2)渐进知识迁移模块,通过可学习路径将压缩的早期表征逐步注入后期层,缓解语义退化。 Result: 在三个基准上验证,FlowAdapt仅需1%可训练参数即达SOTA性能,具备优异的样本效率与泛化能力。 Conclusion: FlowAdapt有效解决了多智能体PEFT中的冗余与语义退化问题,为V2X协同感知提供了高效、稳定、轻量的域自适应新范式。 Abstract: Fast domain adaptation remains a fundamental challenge for deploying multi-agent systems across diverse environments in Vehicle-to-Everything (V2X) collaborative perception. Despite the success of Parameter-Efficient Fine-Tuning (PEFT) in natural language processing and conventional vision tasks, directly applying PEFT to multi-agent settings leads to significant performance degradation and training instability. In this work, we conduct a detailed analysis and identify two key factors: (i) inter-frame redundancy in heterogeneous sensory streams, and (ii) erosion of fine-grained semantics in deep-layer representations under PEFT adaptation. To address these issues, we propose FlowAdapt, a parameter-efficient framework grounded in optimal transport theory, which minimizes information transport costs across both data distributions and network hierarchies. Specifically, we introduce a Wasserstein Greedy Sampling strategy to selectively filter redundant samples via a bounded covering radius. Furthermore, Progressive Knowledge Transfer module is designed to progressively inject compressed early-stage representations into later stages through learnable pathways, alleviating semantic degradation in late-stage adaptation. Extensive experiments on three benchmarks demonstrate that FlowAdapt achieves state-of-the-art performance with only 1% of trainable parameters, effectively bridging domain gaps with superior sample efficiency and generalization.

[102] A Large Language Model for Disaster Structural Reconnaissance Summarization

Yuqing Gao,Guanren Zhou,Khalid M. Mosalam

Main category: cs.CV

TL;DR: 本文提出了一种基于大语言模型(LLM)的灾害侦察摘要框架(LLM-DRS),将视觉数据与文本元数据融合,利用深度卷积神经网络提取结构损伤属性,并通过定制化提示词驱动LLM生成结构化灾后评估报告,提升了视觉SHM在快速灾后侦察中的实用性与决策支持能力。

Details Motivation: 现有视觉结构健康监测(SHM)方法仅输出离散结果(如损伤类别、坐标),需工程师人工整合分析;而大语言模型(LLMs)兴起为自动化生成可读、可决策的灾后评估报告提供了新路径。 Method: 构建LLM-DRS框架:1)制定标准化现场侦察流程以采集图像与文本元数据;2)用深度卷积神经网络联合提取损伤状态、材料类型、损伤等级等关键属性;3)将结构化属性与元数据统一输入经提示工程优化的LLM,生成自然语言摘要报告。 Result: 实验表明,LLM-DRS能自动生成面向单体结构或区域的灾后侦察摘要报告,在快速灾后评估中展现出提升建成环境韧性的潜力。 Conclusion: 将LLM深度融入视觉SHM系统,特别是用于灾后快速侦察,可显著增强结果的可解释性、可用性与工程决策支持能力,是AI赋能基础设施韧性管理的重要进展。 Abstract: Artificial Intelligence (AI)-aided vision-based Structural Health Monitoring (SHM) has emerged as an effective approach for monitoring and assessing structural condition by analyzing image and video data. By integrating Computer Vision (CV) and Deep Learning (DL), vision-based SHM can automatically identify and localize visual patterns associated with structural damage. However, previous works typically generate only discrete outputs, such as damage class labels and damage region coordinates, requiring engineers to further reorganize and analyze these results for evaluation and decision-making. In late 2022, Large Language Models (LLMs) became popular across multiple fields, providing new insights into AI-aided vision-based SHM. In this study, a novel LLM-based Disaster Reconnaissance Summarization (LLM-DRS) framework is proposed. It introduces a standard reconnaissance plan in which the collection of vision data and corresponding metadata follows a well-designed on-site investigation process. Text-based metadata and image-based vision data are then processed and integrated into a unified format, where well-trained Deep Convolutional Neural Networks extract key attributes, including damage state, material type, and damage level. Finally, all data are fed into an LLM with carefully designed prompts, enabling the LLM-DRS to generate summary reports for individual structures or affected regions based on aggregated attributes and metadata. Results show that integrating LLMs into vision-based SHM, particularly for rapid post-disaster reconnaissance, demonstrates promising potential for improving resilience of the built environment through effective reconnaissance.

[103] PLOT-CT: Pre-log Voronoi Decomposition Assisted Generation for Low-dose CT Reconstruction

Bin Huang,Xun Yu,Yikun Zhang,Yi Zhang,Yang Chen,Qiegen Liu

Main category: cs.CV

TL;DR: 本文提出PLOT-CT框架,通过在预对数域对sinogram进行Voronoi分解,分离噪声与结构信息,从而提升低剂量CT重建精度。

Details Motivation: 现有LDCT重建方法多在图像域或对数投影域操作,无法充分利用预对数测量中的结构信息,且对数变换会放大噪声,导致重建精度受限。 Method: 提出PLOT-CT框架:在预对数sinogram上应用Voronoi分解,将数据解耦为不同潜在空间中的成分,以增强特征判别性并抑制噪声。 Result: 在1e4入射光子水平下,PSNR较传统方法提升2.36dB,达到预对数域SOTA性能。 Conclusion: PLOT-CT通过预对数域的显式结构分解,有效缓解噪声影响、保留原始信息,显著提升低剂量CT重建质量。 Abstract: Low-dose computed tomography (LDCT) reconstruction is fundamentally challenged by severe noise and compromised data fidelity under reduced radiation exposure. Most existing methods operate either in the image or post-log projection domain, which fails to fully exploit the rich structural information in pre-log measurements while being highly susceptible to noise. The requisite logarithmic transformation critically amplifies noise within these data, imposing exceptional demands on reconstruction precision. To overcome these challenges, we propose PLOT-CT, a novel framework for Pre-Log vOronoi decomposiTion-assisted CT generation. Our method begins by applying Voronoi decomposition to pre-log sinograms, disentangling the data into distinct underlying components, which are embedded in separate latent spaces. This explicit decomposition significantly enhances the model's capacity to learn discriminative features, directly improving reconstruction accuracy by mitigating noise and preserving information inherent in the pre-log domain. Extensive experiments demonstrate that PLOT-CT achieves state-of-the-art performance, attaining a 2.36dB PSNR improvement over traditional methods at the 1e4 incident photon level in the pre-log domain.

[104] PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation

Yeva Gabrielyan,Varduhi Yeghiazaryan,Irina Voiculescu

Main category: cs.CV

TL;DR: 本文提出PLESS,一种通用的伪标签增强策略,通过图像分层划分和语义一致区域内传播涂鸦信息,提升弱监督医学图像分割中伪标签的可靠性与空间一致性。

Details Motivation: 涂鸦标注虽降低标注成本,但存在噪声和不完整性;现有基于伪标签的方法受限于伪标签质量。 Method: PLESS构建图像的空间一致区域层次结构,并在语义连贯区域内传播涂鸦信息以优化伪标签;该方法模型无关,可无缝集成至现有伪标签框架。 Result: 在ACDC和MSCMRseg两个心脏MRI数据集上,结合四种涂鸦监督算法,PLESS均显著提升分割精度。 Conclusion: PLESS是一种有效、通用且易集成的伪标签增强策略,能稳健提升弱监督医学图像分割性能。 Abstract: Weakly supervised learning with scribble annotations uses sparse user-drawn strokes to indicate segmentation labels on a small subset of pixels. This annotation reduces the cost of dense pixel-wise labeling, but suffers inherently from noisy and incomplete supervision. Recent scribble-based approaches in medical image segmentation address this limitation using pseudo-label-based training; however, the quality of the pseudo-labels remains a key performance limit. We propose PLESS, a generic pseudo-label enhancement strategy which improves reliability and spatial consistency. It builds on a hierarchical partitioning of the image into a hierarchy of spatially coherent regions. PLESS propagates scribble information to refine pseudo-labels within semantically coherent regions. The framework is model-agnostic and easily integrates into existing pseudo-label methods. Experiments on two public cardiac MRI datasets (ACDC and MSCMRseg) across four scribble-supervised algorithms show consistent improvements in segmentation accuracy. Code will be made available on GitHub upon acceptance.

[105] ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

Changti Wu,Jiahuai Mao,Yuzhuo Miao,Shijie Lian,Bin Yu,Xiaopeng Lin,Cong Huang,Lei Zhang,Kai Chen

Main category: cs.CV

TL;DR: 本文提出ScalSelect,一种无需训练、线性时间复杂度的多模态数据选择方法,用于大规模视觉指令调优(VIT),显著提升训练效率且不依赖外部模型或数据集。

Details Motivation: 大规模视觉指令调优(VIT)因数据冗余导致计算昂贵低效,亟需高效、可扩展、无需训练的多模态数据选择方法。 Method: ScalSelect首先提取目标VLM中指令词元最关注的视觉特征构建样本表征;再通过子空间逼近方式,线性地评估各样本对全量数据表征主子空间的贡献,实现无成对比较的重要性打分。 Result: 在多个VLM、数据集和选择预算下实验表明,仅用16%的数据即可达到全量训练97.5%以上的性能,部分设置下甚至超越全量训练。 Conclusion: ScalSelect是一种高效、可扩展、训练免费的多模态数据选择方法,为VIT提供了实用且鲁棒的轻量化训练新范式。 Abstract: Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.

[106] Electrostatics-Inspired Surface Reconstruction (EISR): Recovering 3D Shapes as a Superposition of Poisson's PDE Solutions

Diego Patiño,Knut Peterson,Kostas Daniilidis,David K. Han

Main category: cs.CV

TL;DR: 本文提出了一种基于泊松方程(而非传统Eikonal方程)的隐式形状表示新方法,利用格林函数和线性叠加原理构建SDF近似,显著提升了高频细节重建能力。

Details Motivation: 传统基于Eikonal方程的SDF学习方法在重建高频率几何细节时存在局限;本文旨在探索更具物理意义且数学性质更优(如线性、可解析求解)的替代PDE以提升重建质量。 Method: 将表面重建建模为泊松方程的求解问题;借助静电势等物理类比建立直观理解;使用格林函数导出解的闭式参数表达;利用泊松方程的线性特性,将目标隐式场表示为多个先验形状对应解的线性叠加。 Result: 在少量形状先验条件下,该方法在高频率几何细节(如尖锐边缘、细小结构)的SDF逼近上优于现有基于Eikonal方程的方法。 Conclusion: 采用泊松方程作为代理PDE并结合格林函数与线性叠加,是一种有效提升隐式形状表示精度与细节表现力的新范式。 Abstract: Implicit shape representation, such as SDFs, is a popular approach to recover the surface of a 3D shape as the level sets of a scalar field. Several methods approximate SDFs using machine learning strategies that exploit the knowledge that SDFs are solutions of the Eikonal partial differential equation (PDEs). In this work, we present a novel approach to surface reconstruction by encoding it as a solution to a proxy PDE, namely Poisson's equation. Then, we explore the connection between Poisson's equation and physics, e.g., the electrostatic potential due to a positive charge density. We employ Green's functions to obtain a closed-form parametric expression for the PDE's solution, and leverage the linearity of our proxy PDE to find the target shape's implicit field as a superposition of solutions. Our method shows improved results in approximating high-frequency details, even with a small number of shape priors.

[107] Brain Tumor Classifiers Under Attack: Robustness of ResNet Variants Against Transferable FGSM and PGD Attacks

Ryan Deem,Garrett Goodman,Waqas Majeed,Md Abdullah Al Hafiz Khan,Michail S. Alexiou

Main category: cs.CV

TL;DR: 本文研究了基于ResNet的脑肿瘤分类模型(BrainNet、BrainNeXt、DilationNet)在MRI数据上的对抗鲁棒性,发现BrainNeXt对黑盒攻击最鲁棒但迁移性差,而输入分辨率降低和去增强会显著削弱鲁棒性,揭示了精度与鲁棒性的权衡。

Details Motivation: 深度学习模型在脑肿瘤分类中的对抗鲁棒性尚不明确,而这对临床部署至关重要。 Method: 评估三种ResNet变体(BrainNet、BrainNeXt、DilationNet)在三种MRI预处理配置下对FGSM和PGD攻击的鲁棒性。 Result: BrainNeXt对黑盒攻击最鲁棒但生成的对抗样本迁移性弱;BrainNet和DilationNet易相互攻击;缩小尺寸且无增强的数据显著降低鲁棒性,即使干净测试准确率仍高。 Conclusion: 脑肿瘤分类模型需同时评估分类性能与对抗鲁棒性,以保障临床可靠部署。 Abstract: Adversarial robustness in deep learning models for brain tumor classification remains an underexplored yet critical challenge, particularly for clinical deployment scenarios involving MRI data. In this work, we investigate the susceptibility and resilience of several ResNet-based architectures, referred to as BrainNet, BrainNeXt and DilationNet, against gradient-based adversarial attacks, namely FGSM and PGD. These models, based on ResNet, ResNeXt, and dilated ResNet variants respectively, are evaluated across three preprocessing configurations (i) full-sized augmented, (ii) shrunk augmented and (iii) shrunk non-augmented MRI datasets. Our experiments reveal that BrainNeXt models exhibit the highest robustness to black-box attacks, likely due to their increased cardinality, though they produce weaker transferable adversarial samples. In contrast, BrainNet and Dilation models are more vulnerable to attacks from each other, especially under PGD with higher iteration steps and $α$ values. Notably, shrunk and non-augmented data significantly reduce model resilience, even when the untampered test accuracy remains high, highlighting a key trade-off between input resolution and adversarial vulnerability. These results underscore the importance of jointly evaluating classification performance and adversarial robustness for reliable real-world deployment in brain MRI analysis.

[108] GR-Diffusion: 3D Gaussian Representation Meets Diffusion in Whole-Body PET Reconstruction

Mengxiao Geng,Zijie Chen,Ran Hong,Bingxuan Li,Qiegen Liu

Main category: cs.CV

TL;DR: 本文提出GR-Diffusion框架,将三维离散高斯表示(GR)的几何先验与扩散模型的生成能力结合,用于低剂量全身PET图像重建,显著提升图像质量与细节保留。

Details Motivation: PET重建面临噪声放大、结构模糊和细节丢失等挑战,传统方法受限于低通滤波特性,难以兼顾全局一致性与局部精度。 Method: 提出GR-Diffusion框架:利用GR从投影数据生成物理合理、结构明确的参考图像;设计基于GR参考的分层引导机制(细粒度差异精修+粗粒度多尺度差异校正),指导扩散模型逐步融合几何先验并恢复亚体素信息。 Result: 在UDPET和临床数据集上,GR-Diffusion在不同剂量水平下均优于现有最先进方法,显著提升3D全身PET图像质量及生理细节保真度。 Conclusion: GR与扩散模型的协同集成可有效克服PET重建中的病态性与稀疏采样限制,为低剂量分子影像提供新范式。 Abstract: Positron emission tomography (PET) reconstruction is a critical challenge in molecular imaging, often hampered by noise amplification, structural blurring, and detail loss due to sparse sampling and the ill-posed nature of inverse problems. The three-dimensional discrete Gaussian representation (GR), which efficiently encodes 3D scenes using parameterized discrete Gaussian distributions, has shown promise in computer vision. In this work, we pro-pose a novel GR-Diffusion framework that synergistically integrates the geometric priors of GR with the generative power of diffusion models for 3D low-dose whole-body PET reconstruction. GR-Diffusion employs GR to generate a reference 3D PET image from projection data, establishing a physically grounded and structurally explicit benchmark that overcomes the low-pass limitations of conventional point-based or voxel-based methods. This reference image serves as a dual guide during the diffusion process, ensuring both global consistency and local accuracy. Specifically, we employ a hierarchical guidance mechanism based on the GR reference. Fine-grained guidance leverages differences to refine local details, while coarse-grained guidance uses multi-scale difference maps to correct deviations. This strategy allows the diffusion model to sequentially integrate the strong geometric prior from GR and recover sub-voxel information. Experimental results on the UDPET and Clinical datasets with varying dose levels show that GR-Diffusion outperforms state-of-the-art methods in enhancing 3D whole-body PET image quality and preserving physiological details.

[109] SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

Seo Hyun Kim,Jin Bok Park,Do Yeon Koo,Ho Gun Park,Il Yong Chun

Main category: cs.CV

TL;DR: 本文提出了一种名为SToRM的监督式视觉令牌缩减框架,用于多模态大语言模型驱动的端到端自动驾驶系统,在大幅降低计算开销(最高达30倍)的同时,保持与全令牌输入相当的性能。

Details Motivation: 端到端自动驾驶系统需兼顾安全性与实时性;引入多模态大语言模型(MLLM)可支持自然语言人车交互,但其高计算开销(尤其大量视觉token)难以满足车载部署限制;现有token缩减方法常导致下游任务性能下降。 Method: 提出监督式Token缩减框架SToRM:1)轻量级重要性预测器(基于短时滑动窗口)估计token重要性;2)通过辅助路径从全token LLM前向传播中生成伪监督信号,实现监督训练;3)设计anchor-context融合模块,将token划分为anchor与context两类,并将context token有选择地融合进相关anchor以减少冗余、保留关键信息。 Result: 在LangAuto基准上,SToRM在相同缩减token预算下优于当前最优E2E驾驶MLLM方法,性能媲美全token输入,同时计算成本最多降低30倍。 Conclusion: SToRM首次实现了面向端到端自动驾驶的高效、高性能监督式视觉token缩减,为车载部署多模态大模型提供了可行方案。 Abstract: In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.

[110] EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation

Bingyuan Wang,Xingbei Chen,Zongyang Qiu,Linping Yuan,Zeyu Wang

Main category: cs.CV

TL;DR: 本文提出EmoSpace框架,通过视觉-语言对齐学习动态、可解释的情感原型,实现无需显式情感标签的细粒度情感控制,支持VR环境中的情感图像外绘、风格化生成和全景生成等应用。

Details Motivation: 现有生成方法难以捕捉细腻的情感语义和沉浸式体验所需的精细情感控制。 Method: 引入EmoSpace框架,采用分层情感表示与可学习动态情感原型,结合多原型引导、时间融合和注意力重加权的可控生成流程。 Result: 在定性和定量评估中均优于现有方法,并通过用户研究验证了VR环境对情感感知的影响。 Conclusion: EmoSpace实现了沉浸式视觉内容的细粒度情感控制,支持治疗、教育、叙事、艺术创作和文化保护等多领域应用。 Abstract: Emotion is important for creating compelling virtual reality (VR) content. Although some generative methods have been applied to lower the barrier to creating emotionally rich content, they fail to capture the nuanced emotional semantics and the fine-grained control essential for immersive experiences. To address these limitations, we introduce EmoSpace, a novel framework for emotion-aware content generation that learns dynamic, interpretable emotion prototypes through vision-language alignment. We employ a hierarchical emotion representation with rich learnable prototypes that evolve during training, enabling fine-grained emotional control without requiring explicit emotion labels. We develop a controllable generation pipeline featuring multi-prototype guidance, temporal blending, and attention reweighting that supports diverse applications, including emotional image outpainting, stylized generation, and emotional panorama generation for VR environments. Our experiments demonstrate the superior performance of EmoSpace over existing methods in both qualitative and quantitative evaluations. Additionally, we present a comprehensive user study investigating how VR environments affect emotional perception compared to desktop settings. Our work facilitates immersive visual content generation with fine-grained emotion control and supports applications like therapy, education, storytelling, artistic creation, and cultural preservation. Code and models will be made publicly available.

[111] Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes

Jeongho Noh,Tai Hyoung Rhee,Eunho Lee,Jeongyun Kim,Sunwoo Lee,Ayoung Kim

Main category: cs.CV

TL;DR: Clutt3R-Seg是一种面向语言引导机器人抓取的零样本3D实例分割方法,通过构建语义线索的层级实例树,在杂乱场景中实现鲁棒分割与目标选择,并支持单图更新以适应多阶段任务。

Details Motivation: 在杂乱环境中,遮挡、有限视角和噪声掩码严重降低3D实例分割可靠性,影响语言引导的机器人操作。 Method: 提出层级实例树结构,利用噪声掩码作为信息线索,结合跨视角分组与条件替换抑制过/欠分割;引入开放词汇语义嵌入支持自然语言目标选择;设计一致性感知更新机制,仅凭单张交互后图像维持实例对应关系。 Result: 在合成与真实数据集及真实机器人上验证,重杂乱序列中AP@25达61.66,超基线2.2倍;仅用4个视角即超越MaskClustering(8视角)2倍以上。 Conclusion: Clutt3R-Seg显著提升了杂乱与稀疏视角下的3D实例分割鲁棒性与语言对齐能力,具备实际部署潜力。 Abstract: Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.

[112] Egocentric Gaze Estimation via Neck-Mounted Camera

Haoyu Huang,Yoichi Sato

Main category: cs.CV

TL;DR: 本文提出了颈戴式视角凝视估计这一新任务,并构建了首个相关数据集,提出了基于Transformer的GLC模型及两种改进方法:辅助凝视越界分类任务和多视角协同学习方法,实验表明前者有效提升性能,后者未见增益。

Details Motivation: 现有以第一人称视角进行凝视估计的研究主要集中于头戴式摄像头,而颈戴式等其他视角尚未被充分探索,本文旨在填补这一空白。 Method: 构建首个颈戴式凝视估计数据集(含8名参与者、约4小时日常活动视频),提出基于Transformer的GLC模型,并引入辅助凝视越界分类任务与几何感知的多视角协同学习方法。 Result: 辅助凝视越界分类任务显著提升模型性能,而多视角协同学习方法未带来性能增益;实验还对结果进行了深入分析。 Conclusion: 颈戴式凝视估计具有可行性与研究价值,辅助分类任务是有效的改进方向,但协同学习需进一步优化设计;该工作为非头戴式可穿戴凝视估计提供了新思路与基准。 Abstract: This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective. Prior work on egocentric gaze estimation, which predicts device wearer's gaze location within the camera's field of view, mainly focuses on head-mounted cameras while alternative viewpoints remain underexplored. To bridge this gap, we collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities. We evaluate a transformer-based gaze estimation model, GLC, on the new dataset and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models using a geometry-aware auxiliary loss. Experimental results show that incorporating gaze out-of-bound classification improves performance over standard fine-tuning, while the co-learning approach does not yield gains. We further analyze these results and discuss implications for neck-mounted gaze estimation.

[113] U-Net with Hadamard Transform and DCT Latent Spaces for Next-day Wildfire Spread Prediction

Yingyi Luo,Shuaiang Rong,Adam Watts,Ahmet Enis Cetin

Main category: cs.CV

TL;DR: 本文提出了一种轻量级深度学习模型TD-FusionUNet,结合可训练的哈达玛与离散余弦变换层及定制预处理技术,利用多模态卫星数据实现次日野火蔓延预测,在保持高效率(370k参数)的同时达到F1=0.591,优于基线模型。

Details Motivation: 提升野火蔓延预测的实时性与资源受限环境下的适用性,解决现有模型计算开销大、泛化能力弱的问题。 Method: 提出TD-FusionUNet模型,引入可训练的2D Hadamard与DCT变换层以捕获正交隐空间中的频率特征;设计随机边缘裁剪与高斯混合模型预处理,增强稀疏火前掩膜表征与泛化能力。 Result: 在Next-Day Wildfire Spread和WildfireSpreadTS两个数据集上验证,F1达0.591,参数量仅370k,优于ResNet18编码器UNet基线。 Conclusion: TD-FusionUNet在精度与效率间取得良好平衡,适用于边缘设备上的实时野火预测任务。 Abstract: We developed a lightweight and computationally efficient tool for next-day wildfire spread prediction using multimodal satellite data as input. The deep learning model, which we call Transform Domain Fusion UNet (TD-FusionUNet), incorporates trainable Hadamard Transform and Discrete Cosine Transform layers that apply two-dimensional transforms, enabling the network to capture essential "frequency" components in orthogonalized latent spaces. Additionally, we introduce custom preprocessing techniques, including random margin cropping and a Gaussian mixture model, to enrich the representation of the sparse pre-fire masks and enhance the model's generalization capability. The TD-FusionUNet is evaluated on two datasets which are the Next-Day Wildfire Spread dataset released by Google Research in 2023, and WildfireSpreadTS dataset. Our proposed TD-FusionUNet achieves an F1 score of 0.591 with 370k parameters, outperforming the UNet baseline using ResNet18 as the encoder reported in the WildfireSpreadTS dataset while using substantially fewer parameters. These results show that the proposed latent space fusion model balances accuracy and efficiency under a lightweight setting, making it suitable for real time wildfire prediction applications in resource limited environments.

[114] RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval

Khanh Nguyen,Dasith de Silva Edirimuni,Ghulam Mubashar Hassan,Ajmal Mian

Main category: cs.CV

TL;DR: 本文提出RI-Mamba,首个面向点云的旋转不变状态空间模型,通过定义参考系、Hilbert排序与方向嵌入调制,在任意朝向下实现跨200+类别的高效文本到3D形状检索。

Details Motivation: 现有文本到形状检索方法依赖规范姿态且类别支持有限,难以应对真实场景中物体类别多样、朝向任意的挑战。 Method: 提出RI-Mamba:1)构建全局/局部参考系解耦位姿与几何;2)采用Hilbert排序生成保持旋转不变性的几何感知token序列;3)设计方向嵌入与特征线性调制(FiLM)恢复空间上下文;4)结合自动三元组生成的跨模态对比学习进行可扩展训练。 Result: 在OmniObject3D基准上,对200多个物体类别、任意朝向条件下,达到文本到形状检索的SOTA性能。 Conclusion: RI-Mamba首次将旋转不变性与状态空间建模有效结合,兼具线性复杂度、强泛化性与实际部署潜力,为大规模3D检索提供了新范式。 Abstract: 3D assets have rapidly expanded in quantity and diversity due to the growing popularity of virtual reality and gaming. As a result, text-to-shape retrieval has become essential in facilitating intuitive search within large repositories. However, existing methods require canonical poses and support few object categories, limiting their real-world applicability where objects can belong to diverse classes and appear in random orientations. To address this challenge, we propose RI-Mamba, the first rotation-invariant state-space model for point clouds. RI-Mamba defines global and local reference frames to disentangle pose from geometry and uses Hilbert sorting to construct token sequences with meaningful geometric structure while maintaining rotation invariance. We further introduce a novel strategy to compute orientational embeddings and reintegrate them via feature-wise linear modulation, effectively recovering spatial context and enhancing model expressiveness. Our strategy is inherently compatible with state-space models and operates in linear time. To scale up retrieval, we adopt cross-modal contrastive learning with automated triplet generation, allowing training on diverse datasets without manual annotation. Extensive experiments demonstrate RI-Mamba's superior representational capacity and robustness, achieving state-of-the-art performance on the OmniObject3D benchmark across more than 200 object categories under arbitrary orientations. Our code will be made available at https://github.com/ndkhanh360/RI-Mamba.git.

[115] Semantically Conditioned Diffusion Models for Cerebral DSA Synthesis

Qiwen Xu,David Rügamer,Holger Wenz,Johann Fontana,Nora Meggyeshazi,Andreas Bender,Máté E. Maros

Main category: cs.CV

TL;DR: 本文提出了一种语义条件潜在扩散模型(LDM),用于生成具有解剖循环(前/后循环)和C型臂位置控制的动脉期脑部数字减影血管造影(DSA)图像,经专家评估和FID指标验证,生成图像具备临床真实感,可用于算法开发、研究与培训。

Details Motivation: DSA虽在脑血管病诊疗中至关重要,但其侵入性和高采集成本限制了大规模数据收集与共享,亟需高质量合成数据替代方案。 Method: 构建含99,349帧的单中心DSA数据集,训练基于文本嵌入(编码解剖与几何信息)的语义条件潜在扩散模型(LDM),实现动脉期DSA图像的可控合成。 Result: 四名医学专家对400张合成图像进行5级Likert量表评估,图像级总体评分为3.1–3.3分,组内相关系数ICC(2,k)=0.80–0.87;Fréchet Inception Distance(FID)中位数为15.27,表明分布相似性高。 Conclusion: 语义可控的潜在扩散模型可生成临床真实的合成DSA图像,适用于下游算法开发、科研及医学培训。 Abstract: Digital subtraction angiography (DSA) plays a central role in the diagnosis and treatment of cerebrovascular disease, yet its invasive nature and high acquisition cost severely limit large-scale data collection and public data sharing. Therefore, we developed a semantically conditioned latent diffusion model (LDM) that synthesizes arterial-phase cerebral DSA frames under explicit control of anatomical circulation (anterior vs.\ posterior) and canonical C-arm positions. We curated a large single-centre DSA dataset of 99,349 frames and trained a conditional LDM using text embeddings that encoded anatomy and acquisition geometry. To assess clinical realism, four medical experts, including two neuroradiologists, one neurosurgeon, and one internal medicine expert, systematically rated 400 synthetic DSA images using a 5-grade Likert scale for evaluating proximal large, medium, and small peripheral vessels. The generated images achieved image-wise overall Likert scores ranging from 3.1 to 3.3, with high inter-rater reliability (ICC(2,k) = 0.80--0.87). Distributional similarity to real DSA frames was supported by a low median Fréchet inception distance (FID) of 15.27. Our results indicate that semantically controlled LDMs can produce realistic synthetic DSAs suitable for downstream algorithm development, research, and training.

[116] TG-Field: Geometry-Aware Radiative Gaussian Fields for Tomographic Reconstruction

Yuxiang Zhong,Jun Wei,Chaoqi Chen,Senyou An,Hui Huang

Main category: cs.CV

TL;DR: 本文提出TG-Field,一种面向CT重建(含静态与动态)的几何感知高斯形变框架,通过多分辨率哈希编码、时序条件表示、时空注意力机制及运动流网络,显著提升超稀疏视角下的重建质量与动态一致性。

Details Motivation: 现有3D高斯点绘(3DGS)在CT重建中面临超稀疏视角投影和动态运动下的严重伪影问题。 Method: 提出Tomographic Geometry Field(TG-Field),包含:1)多分辨率哈希编码以建模局部空间先验;2)时间条件化表示与时空注意力块实现动态特征自适应聚合;3)运动流网络建模呼吸运动引起的精细解剖形变。 Result: 在合成与真实数据集上,TG-Field在高度稀疏视角条件下持续超越现有方法,达到SOTA重建精度,并有效缓解动态伪影与时空模糊。 Conclusion: TG-Field为稀疏视角CT重建(尤其含呼吸运动的动态场景)提供了高效、鲁棒且几何感知的新范式。 Abstract: 3D Gaussian Splatting (3DGS) has revolutionized 3D scene representation with superior efficiency and quality. While recent adaptations for computed tomography (CT) show promise, they struggle with severe artifacts under highly sparse-view projections and dynamic motions. To address these challenges, we propose Tomographic Geometry Field (TG-Field), a geometry-aware Gaussian deformation framework tailored for both static and dynamic CT reconstruction. A multi-resolution hash encoder is employed to capture local spatial priors, regularizing primitive parameters under ultra-sparse settings. We further extend the framework to dynamic reconstruction by introducing time-conditioned representations and a spatiotemporal attention block to adaptively aggregate features, thereby resolving spatiotemporal ambiguities and enforcing temporal coherence. In addition, a motion-flow network models fine-grained respiratory motion to track local anatomical deformations. Extensive experiments on synthetic and real-world datasets demonstrate that TG-Field consistently outperforms existing methods, achieving state-of-the-art reconstruction accuracy under highly sparse-view conditions.

[117] LLM-Driven 3D Scene Generation of Agricultural Simulation Environments

Arafa Yoncalik,Wouter Jansen,Nico Huebel,Mohammad Hasan Rahmani,Jan Steckel

Main category: cs.CV

TL;DR: 本文提出了一种面向农业仿真的模块化多LLM流水线,通过结合领域知识注入、RAG、微调与验证机制,在Unreal引擎中实现从自然语言生成高保真、可验证的3D农田环境,显著提升可控性、准确性和生成效率。

Details Motivation: 现有基于LLM的3D场景生成方法缺乏农业等垂直领域的专用推理能力、验证机制和模块化设计,导致控制力弱、可扩展性差。 Method: 构建模块化多LLM流水线,整合3D资产检索、农业领域知识注入(如作物生长规律、田间布局规范)及Unreal API代码生成,并融合少样本提示、RAG、微调和多级验证策略。 Result: 系统在结构化提示下生成语义准确的农业3D环境;用户研究显示其现实感与真实图像相当;专家评估表明相比手动建模节省大量时间。 Conclusion: 模块化多LLM架构能有效支撑领域定制化3D仿真生成,在可靠性、精度与可维护性上优于单体LLM方案,为农业数字孪生等应用提供新范式。 Abstract: Procedural generation techniques in 3D rendering engines have revolutionized the creation of complex environments, reducing reliance on manual design. Recent approaches using Large Language Models (LLMs) for 3D scene generation show promise but often lack domain-specific reasoning, verification mechanisms, and modular design. These limitations lead to reduced control and poor scalability. This paper investigates the use of LLMs to generate agricultural synthetic simulation environments from natural language prompts, specifically to address the limitations of lacking domain-specific reasoning, verification mechanisms, and modular design. A modular multi-LLM pipeline was developed, integrating 3D asset retrieval, domain knowledge injection, and code generation for the Unreal rendering engine using its API. This results in a 3D environment with realistic planting layouts and environmental context, all based on the input prompt and the domain knowledge. To enhance accuracy and scalability, the system employs a hybrid strategy combining LLM optimization techniques such as few-shot prompting, Retrieval-Augmented Generation (RAG), finetuning, and validation. Unlike monolithic models, the modular architecture enables structured data handling, intermediate verification, and flexible expansion. The system was evaluated using structured prompts and semantic accuracy metrics. A user study assessed realism and familiarity against real-world images, while an expert comparison demonstrated significant time savings over manual scene design. The results confirm the effectiveness of multi-LLM pipelines in automating domain-specific 3D scene generation with improved reliability and precision. Future work will explore expanding the asset hierarchy, incorporating real-time generation, and adapting the pipeline to other simulation domains beyond agriculture.

[118] GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry

Jiung Yeon,Seongbo Ha,Hyeonwoo Yu

Main category: cs.CV

TL;DR: GSO-SLAM是一种实时单目稠密SLAM系统,通过双向耦合视觉里程计(VO)与高斯泼溅(GS),在EM框架下联合优化深度估计与场景表示,并提出高斯泼溅初始化方法,实现高精度、实时重建。

Details Motivation: 现有SLAM方法在跟踪与建图耦合方式上存在计算开销大或冗余问题,需更高效、精确的单目稠密重建方案。 Method: 提出GSO-SLAM:1)在EM框架中双向耦合VO与GS,联合优化半稠密深度与高斯场景;2)设计高斯泼溅初始化,利用VO的关键帧位姿、像素关联和图像信息生成高质量初始高斯表示。 Result: 实验证明该方法可实时运行,在几何/光度重建保真度和跟踪精度上达到SOTA水平。 Conclusion: GSO-SLAM通过紧耦合VO与GS并引入数据驱动的初始化策略,显著提升了单目稠密SLAM的效率与精度,为实时高保真场景重建提供了新范式。 Abstract: We propose GSO-SLAM, a real-time monocular dense SLAM system that leverages Gaussian scene representation. Unlike existing methods that couple tracking and mapping with a unified scene, incurring computational costs, or loosely integrate them with well-structured tracking frameworks, introducing redundancies, our method bidirectionally couples Visual Odometry (VO) and Gaussian Splatting (GS). Specifically, our approach formulates joint optimization within an Expectation-Maximization (EM) framework, enabling the simultaneous refinement of VO-derived semi-dense depth estimates and the GS representation without additional computational overhead. Moreover, we present Gaussian Splat Initialization, which utilizes image information, keyframe poses, and pixel associations from VO to produce close approximations to the final Gaussian scene, thereby eliminating the need for heuristic methods. Through extensive experiments, we validate the effectiveness of our method, showing that it not only operates in real time but also achieves state-of-the-art geometric/photometric fidelity of the reconstructed scene and tracking accuracy.

[119] STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

Xiaowen Zhang,Zhi Gao,Licheng Jiao,Lingling Li,Qing Li

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉提示范式和首个面向空间-时间视频定位(STVG)的强化学习框架STVG-R1,通过实例ID编码避免跨模态坐标对齐难题,并在多个基准上达到SOTA性能。

Details Motivation: 现有VLM在空间-时间视频定位(STVG)中因文本与视觉坐标错位易产生幻觉,且现有方法依赖额外可训练模块,带来高标注成本和计算开销。 Method: 提出基于唯一、时序一致实例ID的视觉提示范式,将逐帧坐标预测转化为实例级识别问题;并设计STVG-R1强化学习框架,采用任务驱动奖励联合优化时间精度、空间一致性与格式结构。 Result: 在HCSTVG-v2上m_IoU超越Qwen2.5-VL-7B达20.9%;零样本迁移至MeViS,在J&F指标上达47.3%,创SOTA。 Conclusion: 所提视觉提示与强化学习框架有效缓解VLM跨模态错位问题,显著提升STVG性能,并具备强泛化能力。 Abstract: In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.

[120] Adapting Vision-Language Models for E-commerce Understanding at Scale

Matteo Nulli,Vladimir Orshulevich,Tala Bazazo,Christian Herold,Michael Kozielski,Marcin Mazur,Szymon Tuzel,Cees G. M. Snoek,Seyyed Hadi Hashemi,Omar Javed,Yannick Versley,Shahram Khadivi

Main category: cs.CV

TL;DR: 本文提出了一种针对电商数据特点(属性中心、多图、噪声高)对通用视觉语言模型(VLMs)进行定向适配的方法,并构建了一个涵盖深度商品理解、严格指令遵循与动态属性抽取的综合评测套件,显著提升了电商场景性能且不损害通用多模态能力。

Details Motivation: 通用视觉语言模型(VLMs)虽具备通用多模态建模能力,但缺乏针对电商数据属性密集、多图像、高噪声等特性的有效适配策略,难以兼顾电商专用性能与通用能力。 Method: 通过大规模实验研究,设计并实施面向电商数据特性的VLM定向适配方法,并构建覆盖深度商品理解、严格指令遵循和动态属性抽取的新型综合评估套件。 Result: 所提适配方法显著提升电商任务性能,同时保持VLM原有的广泛多模态能力;新评测套件为电商多模态理解提供了更全面、更具挑战性的评估基准。 Conclusion: 通用VLM可通过有针对性的适配策略高效适配至电商场景,在不牺牲通用能力的前提下实现专用性能跃升;系统化、多维度的评测体系对推动电商多模态理解发展至关重要。 Abstract: E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

[121] Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Boqi Chen,Xudong Liu,Jianing Qiu

Main category: cs.CV

TL;DR: 本文提出了一种改进视觉对比解码(VCD)的方法,通过构建对象对齐的辅助视图来减少多模态大语言模型(MLLMs)中的物体幻觉现象。

Details Motivation: 解决多模态大语言模型(MLLMs)中常见的物体幻觉问题。 Method: 利用自监督视觉Transformer中的以对象为中心的注意力机制,移除最显著的视觉证据以构建辅助视图,从而破坏不支持的token并增强对比信号。该方法是提示无关、模型无关的,并可无缝集成到现有VCD流程中,仅需一次可缓存的前向传播。 Result: 在两个主流物体幻觉基准测试和两种MLLM上均取得一致性能提升。 Conclusion: 所提方法有效缓解了MLLMs中的物体幻觉问题,具有通用性、低开销和易集成性。 Abstract: We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.

[122] Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation

Xiangyu Wu,Dongming Jiang,Feng Yu,Yueying Tian,Jiaqi Tang,Qing-Guo Chen,Yang Yang,Jianfeng Lu

Main category: cs.CV

TL;DR: 本文提出了一种基于Tsallis熵的自适应去偏方法ADTE,用于视觉-语言模型(如CLIP)的测试时自适应(TTA),以克服Shannon熵在不平衡数据下带来的不确定性估计偏差。

Details Motivation: 主流TTA方法依赖Shannon熵估计不确定性,但CLIP预训练数据高度不平衡,导致Shannon熵产生有偏的不确定性估计。 Method: 提出Tsallis熵(TE)作为Shannon熵的广义形式,能更好刻画偏态分布;进一步设计自适应去偏Tsallis熵(ADTE),为每个类别动态学习参数q^l,基于持续到来的测试样本估计标签偏差并归一化;结合高置信度视图选择与标签调整策略。 Result: ADTE在ImageNet及其5个变体上超越SOTA,在10个跨域基准上平均性能最高,且不依赖特定模型架构或文本提示。 Conclusion: ADTE是一种无需额外超参调优、即插即用的先进TTA替代方案,TE和ADTE均可直接替代Shannon熵提升TTA鲁棒性与泛化性。 Abstract: Mainstream Test-Time Adaptation (TTA) methods for adapting vision-language models, e.g., CLIP, typically rely on Shannon Entropy (SE) at test time to measure prediction uncertainty and inconsistency. However, since CLIP has a built-in bias from pretraining on highly imbalanced web-crawled data, SE inevitably results in producing biased estimates of uncertainty entropy. To address this issue, we notably find and demonstrate that Tsallis Entropy (TE), a generalized form of SE, is naturally suited for characterizing biased distributions by introducing a non-extensive parameter q, with the performance of SE serving as a lower bound for TE. Building upon this, we generalize TE into Adaptive Debiasing Tsallis Entropy (ADTE) for TTA, customizing a class-specific parameter q^l derived by normalizing the estimated label bias from continuously incoming test instances, for each category. This adaptive approach allows ADTE to accurately select high-confidence views and seamlessly integrate with a label adjustment strategy to enhance adaptation, without introducing distribution-specific hyperparameter tuning. Besides, our investigation reveals that both TE and ADTE can serve as direct, advanced alternatives to SE in TTA, without any other modifications. Experimental results show that ADTE outperforms state-of-the-art methods on ImageNet and its five variants, and achieves the highest average performance on 10 cross-domain benchmarks, regardless of the model architecture or text prompts used. Our code is available at https://github.com/Jinx630/ADTE.

[123] Code2Worlds: Empowering Coding LLMs for 4D World Generation

Yi Zhang,Yunshuang Wang,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: 本文提出Code2Worlds框架,将4D(时空)世界生成建模为语言到物理仿真代码的生成任务,通过双流架构解耦物体与环境生成,并引入物理感知的闭环机制(含PostProcess Agent和VLM-Motion Critic)提升动态保真度,在Code4D基准上显著优于基线。

Details Motivation: 现有基于编码大模型的3D生成方法难以扩展至4D动态模拟;存在多尺度上下文纠缠和语义-物理执行鸿沟两大挑战,导致物理幻觉与动态失真。 Method: 提出Code2Worlds:1)双流架构——检索增强的物体生成流 + 层次化环境编排流;2)物理感知闭环机制——PostProcess Agent生成动力学脚本,VLM-Motion Critic进行自反思迭代优化仿真代码。 Result: 在Code4D基准上,SGS指标提升41%,Richness提升49%,首次实现具备物理一致性的4D动态生成,超越所有静态生成基线。 Conclusion: 将4D生成转化为可执行、可验证的仿真代码生成任务,并通过解耦与闭环设计弥合语义与物理之间的鸿沟,是构建空间智能与世界模拟器的关键路径。 Abstract: Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a critical frontier. This task presents two fundamental challenges: multi-scale context entanglement, where monolithic generation fails to balance local object structures with global environmental layouts; and a semantic-physical execution gap, where open-loop code generation leads to physical hallucinations lacking dynamic fidelity. We introduce Code2Worlds, a framework that formulates 4D generation as language-to-simulation code generation. First, we propose a dual-stream architecture that disentangles retrieval-augmented object generation from hierarchical environmental orchestration. Second, to ensure dynamic fidelity, we establish a physics-aware closed-loop mechanism in which a PostProcess Agent scripts dynamics, coupled with a VLM-Motion Critic that performs self-reflection to iteratively refine simulation code. Evaluations on the Code4D benchmark show Code2Worlds outperforms baselines with a 41% SGS gain and 49% higher Richness, while uniquely generating physics-aware dynamics absent in prior static methods. Code: https://github.com/AIGeeksGroup/Code2Worlds. Website: https://aigeeksgroup.github.io/Code2Worlds.

[124] Light4D: Training-Free Extreme Viewpoint 4D Video Relighting

Zhenghuang Wu,Kang Chen,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: 本文提出Light4D,一种无需训练的4D relighting框架,通过解耦光流引导和时序一致注意力机制,在极端视角变化下生成光照一致的4D视频。

Details Motivation: 现有扩散模型在图像和视频relighting上取得进展,但扩展到4D relighting面临配对训练数据稀缺和极端视角下时序一致性难维持两大挑战。 Method: 提出Light4D框架:1)Disentangled Flow Guidance——时序感知的潜在空间光照控制策略,保持几何完整性;2)Temporal Consistent Attention(集成于IC-Light架构)与确定性正则化,抑制外观闪烁、增强时序一致性。 Result: 实验表明该方法在时序一致性与光照保真度上达到领先水平,稳健支持-90°至90°相机旋转。 Conclusion: Light4D是一种高效、无需训练的4D relighting新范式,有效解决了数据稀缺与时序不一致问题,为高维可控生成提供新思路。 Abstract: Recent advances in diffusion-based generative models have established a new paradigm for image and video relighting. However, extending these capabilities to 4D relighting remains challenging, due primarily to the scarcity of paired 4D relighting training data and the difficulty of maintaining temporal consistency across extreme viewpoints. In this work, we propose Light4D, a novel training-free framework designed to synthesize consistent 4D videos under target illumination, even under extreme viewpoint changes. First, we introduce Disentangled Flow Guidance, a time-aware strategy that effectively injects lighting control into the latent space while preserving geometric integrity. Second, to reinforce temporal consistency, we develop Temporal Consistent Attention within the IC-Light architecture and further incorporate deterministic regularization to eliminate appearance flickering. Extensive experiments demonstrate that our method achieves competitive performance in temporal consistency and lighting fidelity, robustly handling camera rotations from -90 to 90. Code: https://github.com/AIGeeksGroup/Light4D. Website: https://aigeeksgroup.github.io/Light4D.

[125] Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

Yiming Zhou,Xuenjie Xie,Panfeng Li,Albrecht Kunz,Ahmad Osman,Xavier Maldague

Main category: cs.CV

TL;DR: 本文提出了一种轻量级RGB-D融合框架,通过引入单目深度先验增强EfficientViT-SAM,在仅用11.2k样本(不足SA-1B的0.1%)训练下,分割精度超越原EfficientViT-SAM。

Details Motivation: 现有Segment Anything Models(SAM)虽性能优异,但依赖海量数据(如1100万图像)和纯RGB输入;高效变体仍需大规模训练,限制了其在资源受限场景的应用。 Method: 将预训练单目深度估计器生成的深度图,经专用深度编码器与RGB中层特征进行融合,构建轻量级RGB-D融合框架,以增强EfficientViT-SAM。 Result: 在仅使用11.2k样本训练的情况下,该方法在分割精度上超越了EfficientViT-SAM,验证了深度线索作为强几何先验的有效性。 Conclusion: 引入单目深度先验可显著提升轻量级SAM模型的分割性能,大幅降低对大规模标注数据的依赖,为高效、低资源通用分割提供了新思路。 Abstract: Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.

[126] How to Sample High Quality 3D Fractals for Action Recognition Pre-Training?

Marko Putak,Thomas B. Moeslund,Joakim Bruslund Haurum

Main category: cs.CV

TL;DR: 本文提出了一种基于3D迭代函数系统(IFS)生成3D分形视频用于动作识别模型预训练的新方法,解决了传统分形生成速度慢、质量差的问题,并引入‘目标智能过滤’(Targeted Smart Filtering)提升采样效率与下游性能。

Details Motivation: 现有合成数据方法如FDSL虽能规避人工标注、隐私等问题,但标准3D分形生成方法速度慢且易产生退化结构,难以支撑有效预训练。 Method: 采用3D IFS生成分形,通过时间变换构造成视频;系统探索多种分形生成策略,并提出‘目标智能过滤’方法,在保证多样性的同时大幅提升采样速度。 Result: 所提方法采样速度提升约100倍,且在动作识别下游任务中性能优于其他3D分形过滤方法。 Conclusion: 高质量、高效率的合成3D分形视频可作为有效的自监督预训练信号,Targeted Smart Filtering是兼顾生成质量与效率的关键创新。 Abstract: Synthetic datasets are being recognized in the deep learning realm as a valuable alternative to exhaustively labeled real data. One such synthetic data generation method is Formula Driven Supervised Learning (FDSL), which can provide an infinite number of perfectly labeled data through a formula driven approach, such as fractals or contours. FDSL does not have common drawbacks like manual labor, privacy and other ethical concerns. In this work we generate 3D fractals using 3D Iterated Function Systems (IFS) for pre-training an action recognition model. The fractals are temporally transformed to form a video that is used as a pre-training dataset for downstream task of action recognition. We find that standard methods of generating fractals are slow and produce degenerate 3D fractals. Therefore, we systematically explore alternative ways of generating fractals and finds that overly-restrictive approaches, while generating aesthetically pleasing fractals, are detrimental for downstream task performance. We propose a novel method, Targeted Smart Filtering, to address both the generation speed and fractal diversity issue. The method reports roughly 100 times faster sampling speed and achieves superior downstream performance against other 3D fractal filtering methods.

[127] JEPA-VLA: Video Predictive Embedding is Needed for VLA Models

Shangchen Miao,Ningya Feng,Jialong Wu,Ye Lin,Xu He,Dong Li,Mingsheng Long

Main category: cs.CV

TL;DR: 本文提出JEPA-VLA方法,通过引入视频预训练的预测性视觉表征(如V-JEPA 2)来弥补现有视觉语言动作模型(VLA)在环境理解与策略先验上的不足,显著提升样本效率与泛化能力。

Details Motivation: 现有VLA模型受限于预训练视觉表征,难以有效捕捉任务相关环境信息和策略先验(即对成功执行任务时环境演化的预见性知识),导致样本效率低、泛化能力差。 Method: 分析多种视觉表征的局限性,发现视频预测式表征(特别是V-JEPA 2)能更好建模任务相关时序动态并过滤不可预测因素;据此提出JEPA-VLA,将预测性嵌入自适应融合进现有VLA架构中。 Result: JEPA-VLA在LIBERO、LIBERO-plus、RoboTwin2.0及真实机器人任务等多个基准上均取得显著性能提升。 Conclusion: 视频预训练的预测性视觉表征可有效弥补当前VLA中视觉编码器的知识缺口,JEPA-VLA以简洁设计实现了通用且鲁棒的性能增益。 Abstract: Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.

[128] WorldTree: Towards 4D Dynamic Worlds from Monocular Video using Tree-Chains

Qisen Wang,Yifan Zhao,Jia Li

Main category: cs.CV

TL;DR: 本文提出WorldTree框架,通过Temporal Partition Tree(TPT)实现分层时间分解的粗到细优化,并利用Spatial Ancestral Chains(SAC)递归建模空间动态,提升单目动态场景重建性能。

Details Motivation: 现有单目动态重建方法缺乏统一的时空分解框架,或依赖整体时间优化,或存在耦合的层次化空间建模,限制了实用性与精度。 Method: 提出WorldTree:包含基于继承式划分树结构的Temporal Partition Tree(TPT),支持分层时间分解的粗到细优化;以及Spatial Ancestral Chains(SAC),通过递归查询祖先层级结构建模互补空间动态并特化各节点运动表征。 Result: 在NVIDIA-LS上LPIPS提升8.26%,在DyCheck上mLPIPS提升9.09%,显著优于次优方法。 Conclusion: WorldTree提供了一种统一、解耦且可扩展的时空建模范式,有效提升了单目动态重建的质量与泛化能力。 Abstract: Dynamic reconstruction has achieved remarkable progress, but there remain challenges in monocular input for more practical applications. The prevailing works attempt to construct efficient motion representations, but lack a unified spatiotemporal decomposition framework, suffering from either holistic temporal optimization or coupled hierarchical spatial composition. To this end, we propose WorldTree, a unified framework comprising Temporal Partition Tree (TPT) that enables coarse-to-fine optimization based on the inheritance-based partition tree structure for hierarchical temporal decomposition, and Spatial Ancestral Chains (SAC) that recursively query ancestral hierarchical structure to provide complementary spatial dynamics while specializing motion representations across ancestral nodes. Experimental results on different datasets indicate that our proposed method achieves 8.26% improvement of LPIPS on NVIDIA-LS and 9.09% improvement of mLPIPS on DyCheck compared to the second-best method. Code: https://github.com/iCVTEAM/WorldTree.

[129] Free Lunch for Stabilizing Rectified Flow Inversion

Chenru Wang,Beier Zhu,Chi Zhang

Main category: cs.CV

TL;DR: 本文提出Proximal-Mean Inversion (PMI) 和 mimic-CFG 两种训练免费的梯度校正方法,用于提升Rectified-Flow(RF)模型在图像重建与编辑任务中的反演稳定性与保真度,同时降低计算开销。

Details Motivation: 现有RF模型的反演方法存在跨时间步的近似误差累积,导致速度场不稳定、重建与编辑质量下降。 Method: 提出Proxial-Mean Inversion(PMI),通过将当前速度引导至历史速度均值并约束在理论推导的球形高斯内,实现速度场稳定;并设计轻量级mimic-CFG,对编辑任务中速度进行插值校正,兼顾编辑效果与结构一致性。 Result: 在PIE-Bench上实验表明,所提方法显著提升反演稳定性、重建质量和编辑保真度,并减少神经函数评估次数,达到SOTA性能。 Conclusion: PMI与mimic-CFG为RF模型提供了高效、稳定、理论可解释的训练免费反演与编辑方案,推动了流式生成模型的实际应用。 Abstract: Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.

[130] Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei,Liangbo He,Jun Lan,Lingzhong Dong,Yutong Cai,Siyuan Li,Huijia Zhu,Weiqiang Wang,Linghe Kong,Yue Wang,Zhuosheng Zhang,Weiran Huang

Main category: cs.CV

TL;DR: 本文提出Region-to-Image Distillation方法,将推理时的图像缩放(zooming)转化为训练时的蒸馏过程,在单次前向传播中提升MLLM对细粒度视觉信息的理解能力,并构建了ZoomBench基准用于评估。

Details Motivation: 现有MLLM在细粒度感知任务上表现不佳,因关键线索微小且易被全局上下文掩盖;而现有‘Thinking-with-Images’方法虽有效但推理延迟高。 Method: 提出Region-to-Image Distillation:利用强教师模型在微裁剪区域上生成高质量VQA数据,再将区域级监督蒸馏回完整图像;同时构建ZoomBench基准与双视角评估协议。 Result: 所提方法显著提升MLLM在多个细粒度感知基准上的性能,并在视觉推理、GUI代理等通用多模态认知任务上也取得提升。 Conclusion: 细粒度感知能力可通过训练时蒸馏替代推理时反复缩放,从而在不牺牲效率的前提下实现高性能;并明确了‘Thinking-with-Images’的适用边界。 Abstract: Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.

[131] DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition

Ji Li,Zhiwei Li,Shihao Li,Zhenjiang Yu,Boyang Wang,Haiou Liu

Main category: cs.CV

TL;DR: 本文提出DiffPlace框架,通过引入place-ID控制器实现对多视角图像生成的地点可控性,提升城市街景生成的地点感知能力和背景一致性,从而增强视觉地点识别任务的性能。

Details Motivation: 现有多视角扩散模型在文本、鸟瞰图(BEV)和物体边界框驱动下难以生成地点感知强且背景一致的城市街景,限制了其在地点识别任务中的应用效果。 Method: 提出DiffPlace框架,包含place-ID控制器,利用线性投影、Perceiver Transformer和对比学习将place-ID嵌入映射到固定CLIP空间,以实现背景建筑一致性与前景物体及天气条件的灵活控制。 Result: 在定量比较和增强训练评估中,DiffPlace在生成质量及对视觉地点识别任务的训练支持方面均优于现有方法。 Conclusion: DiffPlace有效提升了生成图像的场景级与地点感知能力,为自动驾驶中的地点识别提供了新思路与实用工具。 Abstract: Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition in autonomous driving

[132] SynthRAR: Ring Artifacts Reduction in CT with Unrolled Network and Synthetic Data Training

Hongxu Yang,Levente Lippenszky,Edina Timko,Gopal Avinash

Main category: cs.CV

TL;DR: 本文提出了一种基于非理想CT探测器响应理论分析的无监督深度学习方法,通过展开网络建模前向投影与探测器缺陷,并利用合成数据挖掘图像域与正弦图域间环状伪影的内在关联,无需真实临床数据即可有效去除环状和条纹伪影。

Details Motivation: 现有环状伪影去除方法依赖大量标注的真实临床数据,数据获取成本高;且多局限于单一域(图像域或sinogram域)校正,忽略了CT几何前向过程中的内在关联。 Method: 将环状伪影去除(RAR)问题建模为结合非理想探测器响应与CT几何线性前向投影的逆问题,采用展开网络架构;利用自然图像生成合成数据,显式建模sinogram域与图像域中环状伪影的内在相关性,实现无真实临床数据训练。 Result: 在多种扫描几何结构和解剖区域上广泛评估表明,仅用合成数据训练的模型持续优于现有最先进方法。 Conclusion: 该方法突破了对真实临床标注数据的依赖,通过物理引导的展开网络与跨域相关性建模,实现了更鲁棒、泛化性更强的环状伪影校正。 Abstract: Defective and inconsistent responses in CT detectors can cause ring and streak artifacts in the reconstructed images, making them unusable for clinical purposes. In recent years, several ring artifact reduction solutions have been proposed in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, leading to a high data collection cost. Furthermore, existing approaches focus exclusively on either image-space or sinogram-space correction, neglecting the intrinsic correlations from the forward operation of the CT geometry. Based on the theoretical analysis of non-ideal CT detector responses, the RAR problem is reformulated as an inverse problem by using an unrolled network, which considers non-ideal response together with linear forward-projection with CT geometry. Additionally, the intrinsic correlations of ring artifacts between the sinogram and image domains are leveraged through synthetic data derived from natural images, enabling the trained model to correct artifacts without requiring real-world clinical data. Extensive evaluations on diverse scanning geometries and anatomical regions demonstrate that the model trained on synthetic data consistently outperforms existing state-of-the-art methods.

[133] DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target

BoCheng Hu,Zhonghan Zhao,Kaiyue Zhou,Hongwei Wang,Gaoang Wang

Main category: cs.CV

TL;DR: 本文提出了DynaHOI-Gym平台和DynaHOI-10M基准数据集,用于评估动态手-物交互场景下的手部运动生成,同时提出ObAct基线模型提升定位成功率。

Details Motivation: 现有手-物交互(HOI)基准主要关注静态物体,缺乏对动态目标和时间敏感协调能力的评估。 Method: 构建了统一的在线闭环平台DynaHOI-Gym,包含参数化运动生成器与基于rollout的评估指标;发布大规模动态HOI基准DynaHOI-10M(10M帧、180K轨迹),并设计基于时空注意力的observe-before-act基线模型(ObAct)。 Result: ObAct模型在位置成功率上相较基线提升8.1%;DynaHOI-10M涵盖8大类、22子类动态目标运动。 Conclusion: DynaHOI-Gym与DynaHOI-10M填补了动态手-物交互评估的空白,为时序协调建模提供了新基准与方法启示。 Abstract: Most existing hand motion generation benchmarks for hand-object interaction (HOI) focus on static objects, leaving dynamic scenarios with moving targets and time-critical coordination largely untested. To address this gap, we introduce the DynaHOI-Gym, a unified online closed-loop platform with parameterized motion generators and rollout-based metrics for dynamic capture evaluation. Built on DynaHOI-Gym, we release DynaHOI-10M, a large-scale benchmark with 10M frames and 180K hand capture trajectories, whose target motions are organized into 8 major categories and 22 fine-grained subcategories. We also provide a simple observe-before-act baseline (ObAct) that integrates short-term observations with the current frame via spatiotemporal attention to predict actions, achieving an 8.1% improvement in location success rate.

[134] Synthesis of Late Gadolinium Enhancement Images via Implicit Neural Representations for Cardiac Scar Segmentation

Soufiane Ben Haddou,Laura Alvarez-Florez,Erik J. Bekkers,Fleur V. Y. Tjong,Ahmad S. Amin,Connie R. Bezzina,Ivana Išgum

Main category: cs.CV

TL;DR: 本文提出了一种结合隐式神经表示(INRs)与去噪扩散模型的框架,用于无标注合成晚期钆增强(LGE)心脏MRI图像及其对应的心肌和纤维化分割掩码,缓解数据稀缺问题,并在真实数据上验证了其对纤维化分割性能的提升。

Details Motivation: 晚期钆增强(LGE)成像是心肌瘢痕评估的临床金标准,但缺乏大量带标注的数据集严重制约了自动分割方法的发展。 Method: 首先用隐式神经表示(INRs)建模LGE图像及对应心肌/纤维化掩码的连续空间表征;再将INRs压缩为保留解剖信息的紧凑潜在嵌入;最后在该潜在空间上训练扩散模型生成新表征,并解码为解剖一致的合成LGE图像与分割掩码。 Result: 在133例心脏MRI数据上实验表明,加入200例合成数据后,纤维化分割Dice分数从0.509提升至0.524。 Conclusion: 该方法提供了一种无需人工标注的数据增强方案,可有效缓解LGE图像分割任务中的数据稀缺问题。 Abstract: Late gadolinium enhancement (LGE) imaging is the clinical standard for myocardial scar assessment, but limited annotated datasets hinder the development of automated segmentation methods. We propose a novel framework that synthesises both LGE images and their corresponding segmentation masks using implicit neural representations (INRs) combined with denoising diffusion models. Our approach first trains INRs to capture continuous spatial representations of LGE data and associated myocardium and fibrosis masks. These INRs are then compressed into compact latent embeddings, preserving essential anatomical information. A diffusion model operates on this latent space to generate new representations, which are decoded into synthetic LGE images with anatomically consistent segmentation masks. Experiments on 133 cardiac MRI scans suggest that augmenting training data with 200 synthetic volumes contributes to improved fibrosis segmentation performance, with the Dice score showing an increase from 0.509 to 0.524. Our approach provides an annotation-free method to help mitigate data scarcity.The code for this research is publicly available.

[135] Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion

Bruno Rigal,Victor Dupriez,Alexis Mignon,Ronan Le Hy,Nicolas Mery

Main category: cs.CV

TL;DR: 本文提出了一种面向法语复杂文档的PDF-to-Markdown转换评估新基准,采用模型分歧采样构建难例数据集,并设计了兼顾语义正确性与呈现容错性的细粒度评估方法;实验表明闭源最强VLM在手写体和表单上鲁棒性更优,而部分开源模型在标准印刷体上仍具竞争力。

Details Motivation: 现有PDF解析基准多聚焦英文或中文,且易因无关排版差异(如换行、列表分割、表格渲染方式)过度惩罚模型,而文档解析错误会严重影响RAG等下游任务效果,亟需更合理、语言适配、任务导向的评估方法。 Method: 构建法语专属基准:从6万份文档中通过模型分歧采样挑选难例(手写表单、复杂版式、密集表格、图文混排页);提出单元测试式评估:针对文本存在性、阅读顺序、局部表格约束三类具体失效模式,并结合类别特异性归一化以忽略纯表现差异。 Result: 在15个模型上的评测显示:最强闭源VLM在手写体和表单解析上鲁棒性显著更高;多个开源权重模型在标准印刷体文档上仍保持较强竞争力。 Conclusion: 评估方法需解耦语义正确性与呈现形式差异;法语复杂文档解析仍是挑战,闭源模型当前领先但开源方案潜力可观;该基准与评估范式可推广至其他小语种与高难度文档场景。 Abstract: This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.

[136] Calibrated Bayesian Deep Learning for Explainable Decision Support Systems Based on Medical Imaging

Hua Xu,Julián D. Arias-Londoño,Juan I. Godino-Llorente

Main category: cs.CV

TL;DR: 本文提出了一种基于贝叶斯深度学习的概率优化框架,包含新颖的CUB-Loss和双温度缩放(DTS)策略,以提升医学影像AI模型的不确定性校准能力,从而增强临床可信度与可解释性。

Details Motivation: 医学影像AI辅助决策需兼顾预测准确性与不确定性校准;现有模型常过自信于错误预测,阻碍临床采纳。 Method: 提出 Confidence-Uncertainty Boundary Loss(CUB-Loss)在训练中对高置信错误和低置信正确预测施加惩罚,并结合后处理的 Dual Temperature Scaling(DTS)进行概率校准。 Result: 在肺炎筛查、糖尿病视网膜病变检测和皮肤病变识别三个任务上验证,显著提升校准性能,且在小样本和严重类别不平衡数据下保持鲁棒性。 Conclusion: 该框架具有通用性与临床实用性,为可信AI在医疗影像中的落地提供了有效解决方案。 Abstract: In critical decision support systems based on medical imaging, the reliability of AI-assisted decision-making is as relevant as predictive accuracy. Although deep learning models have demonstrated significant accuracy, they frequently suffer from miscalibration, manifested as overconfidence in erroneous predictions. To facilitate clinical acceptance, it is imperative that models quantify uncertainty in a manner that correlates with prediction correctness, allowing clinicians to identify unreliable outputs for further review. In order to address this necessity, the present paper proposes a generalizable probabilistic optimization framework grounded in Bayesian deep learning. Specifically, a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) is introduced that imposes penalties on high-certainty errors and low-certainty correct predictions, explicitly enforcing alignment between prediction correctness and uncertainty estimates. Complementing this training-time optimization, a Dual Temperature Scaling (DTS) strategy is devised for post-hoc calibration, further refining the posterior distribution to improve intuitive explainability. The proposed framework is validated on three distinct medical imaging tasks: automatic screening of pneumonia, diabetic retinopathy detection, and identification of skin lesions. Empirical results demonstrate that the proposed approach achieves consistent calibration improvements across diverse modalities, maintains robust performance in data-scarce scenarios, and remains effective on severely imbalanced datasets, underscoring its potential for real clinical deployment.

[137] Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation

Wei Chen,Yancheng Long,Mingqiao Liu,Haojie Ding,Yankai Yang,Hongyang Wei,Yi-Fan Zhang,Bin Wen,Fan Yang,Tingting Gao,Han Li,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为Spatial Chain-of-Thought(SCoT)的插件式框架,通过将多模态大语言模型(MLLMs)的空间推理能力与扩散模型的生成能力结合,提升其空间理解与布局规划能力,无需联合训练且避免文本提示导致的空间信息丢失。

Details Motivation: 扩散模型在图像生成上表现优异,但在复杂空间理解与推理方面存在不足;现有方法或需高成本联合训练,或依赖纯文本提示导致空间信息丢失。 Method: 提出SCoT框架:1)用交错的文本-坐标指令格式训练扩散模型以增强其布局感知;2)利用先进MLLM作为‘规划器’生成详细布局方案,并将其空间规划能力直接融入生成过程。 Result: 在图像生成基准测试中达到SOTA性能,在复杂空间推理任务上显著优于基线方法,并在图像编辑场景中展现出强有效性。 Conclusion: SCoT是一种高效、即插即用的框架,成功弥合了MLLM的空间推理能力与扩散模型生成能力之间的鸿沟,兼顾性能与实用性。 Abstract: While diffusion models have shown exceptional capabilities in aesthetic image synthesis, they often struggle with complex spatial understanding and reasoning. Existing approaches resort to Multimodal Large Language Models (MLLMs) to enhance this capability. However, they either incur high computational costs through joint training or suffer from spatial information loss when relying solely on textual prompts. To alleviate these limitations, we propose a Spatial Chain-of-Thought (SCoT) framework, a plug-and-play approach that effectively bridges the reasoning capabilities of MLLMs with the generative power of diffusion models. Specifically, we first enhance the diffusion model's layout awareness by training it on an interleaved text-coordinate instruction format. We then leverage state-of-the-art MLLMs as planners to generate comprehensive layout plans, transferring their spatial planning capabilities directly to the generation process. Extensive experiments demonstrate that our method achieves state-of-the-art performance on image generation benchmarks and significantly outperforms baselines on complex reasoning tasks, while also showing strong efficacy in image editing scenarios.

[138] Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

Enrico Guerriero,Kjersti Engan,Øyvind Meinich-Bache

Main category: cs.CV

TL;DR: 本文探讨了生成式AI(GenAI)方法在新生儿复苏视频中活动识别的应用,通过结合本地视觉语言模型(VLMs)与大语言模型(LLMs),并与监督式TimeSformer基线进行比较。实验表明,经LoRA微调的小型本地VLM在F1分数上达到0.91,优于TimeSformer的0.70。

Details Motivation: 新生儿复苏过程的准确记录对质量改进和临床指南依从性至关重要,但实践中仍被低估;现有基于3D-CNN和ViT的方法虽有潜力,但在细粒度活动识别上存在挑战。 Method: 采用模拟的13.26小时新生儿复苏视频数据集,评估多种零样本VLM策略及带分类头的微调VLM(含LoRA适配),并与监督式TimeSformer基线对比。 Result: 经LoRA微调的小型本地VLM达到F1分数0.91,显著优于TimeSformer的0.70;而零样本VLM易出现幻觉问题。 Conclusion: 微调的本地VLM(尤其是结合LoRA)在新生儿复苏视频活动识别任务中展现出优越性能,为临床视频分析提供了新路径。 Abstract: Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.

[139] Projected Representation Conditioning for High-fidelity Novel View Synthesis

Min-Seop Kwak,Minkyung Kwon,Jinhyeok Choi,Jiho Park,Seungryong Kim

Main category: cs.CV

TL;DR: 本文提出ReNoV框架,利用外部视觉表征作为扩散模型的条件输入,通过投影模块增强新视角生成的几何一致性,在重建保真度和修复质量上均优于现有扩散方法。

Details Motivation: 解决扩散模型在新视角合成中几何一致性不足的问题,利用外部表征的几何与语义对应性提升生成质量。 Method: 分析外部视觉表征空间注意力中的对应能力,设计专用表征投影模块将外部表征注入扩散过程,提出ReNoV(Representation-guided Novel View synthesis)方法。 Result: 在标准基准上显著提升重建保真度与inpainting质量,优于现有扩散类新视角合成方法,并支持稀疏、无位姿图像集合的鲁棒合成。 Conclusion: 外部表征可有效引导扩散模型提升新视角合成的几何一致性与生成质量,ReNoV为基于扩散的新视角合成提供了新范式。 Abstract: We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.

[140] A DMD-Based Adaptive Modulation Method for High Dynamic Range Imaging in High-Glare Environments

Banglei Guan,Jing Tao,Liang Xu,Dongcai Tan,Pengju Sun,Jianbing Liu,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于数字微镜器件(DMD)的高动态范围(HDR)成像系统,用于提升强眩光环境下光力学测量(如焊接弧监测和金属表面分析)的图像质量与数字图像相关(DIC)精度。该系统通过DMD光学调制与自适应计算成像协同实现区域自适应曝光,实测动态范围达127 dB,显著抑制饱和伪影,使应变误差降低78%,DIC定位精度提升。

Details Motivation: 传统CCD/CMOS传感器动态范围低(<70 dB),在强眩光(如焊接弧、镜面反射)下易饱和,导致DIC测量严重失真,亟需更高动态范围(>120 dB)的成像方案。 Method: 构建基于DMD的空间光调制HDR成像系统,包含DMD光学调制单元与自适应计算成像流水线,支持自主区域分割与动态曝光控制。 Result: 系统实测动态范围达127 dB,消除高眩光下的饱和伪影;实验表明DIC应变误差降低78%,定位精度提高。 Conclusion: 该DMD-HDR系统可提供高保真自适应成像能力,突破传统传感器限制,适用于强眩光环境下的光学计量与应力分析。 Abstract: Background The accuracy of photomechanics measurements critically relies on image quality,particularly under extreme illumination conditions such as welding arc monitoring and polished metallic surface analysis. High dynamic range (HDR) imaging above 120 dB is essential in these contexts. Conventional CCD/CMOS sensors, with dynamic ranges typically below 70 dB, are highly susceptible to saturation under glare, resulting in irreversible loss of detail and significant errors in digital image correlation (DIC). Methods This paper presents an HDR imaging system that leverages the spatial modulation capability of a digital micromirror device (DMD). The system architecture enables autonomous regional segmentation and adaptive exposure control for high-dynamic-range scenes through an integrated framework comprising two synergistic subsystems: a DMD-based optical modulation unit and an adaptive computational imaging pipeline. Results The system achieves a measurable dynamic range of 127 dB, effectively eliminating satu ration artifacts under high glare. Experimental results demonstrate a 78% reduction in strain error and improved DIC positioning accuracy, confirming reliable performance across extreme intensity variations. Conclusion The DMD-based system provides high fidelity adaptive HDR imaging, overcoming key limitations of conventional sensors. It exhibits strong potential for optical metrology and stress analysis in high-glare environments where traditional methods are inadequate.

[141] GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GigaBrain Team,Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Hao Li,Jie Li,Jindi Lv,Jingyu Liu,Lv Feng,Mingming Yu,Peng Li,Qiuping Deng,Tianze Liu,Xinyu Zhou,Xinze Chen,Xiaofeng Wang,Yang Wang,Yifan Li,Yifei Nie,Yilong Li,Yukun Zhou,Yun Ye,Zhichao Liu,Zheng Zhu

Main category: cs.CV

TL;DR: 本文提出GigaBrain-0.5M*,一种基于视频世界模型强化学习的视觉-语言-动作(VLA)模型,通过RAMP方法提升跨任务泛化与长程操作鲁棒性,在多项复杂机器人任务中性能提升约30%,并经真实部署验证有效。

Details Motivation: 现有VLA模型受限于场景理解能力弱和未来预测能力差;而预训练于大规模视频数据的世界模型具备强时空推理与未来预测能力,可作为增强VLA学习的理想基础。 Method: 在已有的机器人操作数据预训练模型GigaBrain-0.5基础上,引入基于世界模型的强化学习框架RAMP(Reinforcement leArning via world Model-conditioned Policy),构建GigaBrain-0.5M*模型,实现跨任务自适应优化。 Result: 在Laundry Folding、Box Packing、Espresso Preparation等挑战性任务上,相比RECAP基线提升约30%;真实部署中展现出可靠的长时程执行能力,无失败完成复杂操作任务。 Conclusion: 将世界模型引入VLA训练范式(特别是通过RAMP机制)显著提升了模型的泛化性、预测性与实际部署鲁棒性,为下一代具身智能系统提供了新路径。 Abstract: Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.

[142] AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer

Lingting Zhu,Shengju Qian,Haidi Fan,Jiayu Dong,Zhenchao Jin,Siwei Zhou,Gen Dong,Xin Wang,Lequan Yu

Main category: cs.CV

TL;DR: 本文提出了AssetFormer,一种基于Transformer的自回归模型,用于根据文本描述生成符合设计约束的模块化3D资产,适用于专业开发和用户生成内容(UGC)场景。

Details Motivation: 数字行业对高质量、多样化的模块化3D资产需求迫切,尤其在用户生成内容(UGC)中;现有方法难以兼顾模块化结构与设计约束。 Method: 提出基于Transformer的自回归模型AssetFormer,借鉴语言模型的模块序列建模与解码技术,利用真实世界采集的模块化资产数据进行训练。 Result: 初步实验表明AssetFormer能有效提升模块化3D资产生成质量,支持多种应用中的约束设计,并具备良好扩展性。 Conclusion: AssetFormer为模块化3D内容生成提供了一个灵活、可扩展的新框架,推动了3D生成在专业与UGC场景中的实用化发展。 Abstract: The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content~(UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets, contributing to the broader field of 3D content generation. The code is available at https://github.com/Advocate99/AssetFormer.

[143] PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback

Sixiang Chen,Jianyu Lai,Jialin Gao,Hengyu Shi,Zhongying Liu,Tian Ye,Junfeng Luo,Xiaoming Wei,Lei Zhu

Main category: cs.CV

TL;DR: 本文提出PosterOmni框架,统一处理图像到海报生成中的局部编辑与全局创作任务,通过数据构建、知识蒸馏与统一奖励反馈机制,在多任务上显著提升实体保真度与美学协调性。

Details Motivation: 图像到海报生成需同时满足局部视觉实体保持与全局设计概念理解,现有方法难以兼顾二者,缺乏统一框架与评估基准。 Method: 提出PosterOmni框架,包含三部分:(i) 构建覆盖六类任务的多场景图像-海报数据集;(ii) 在局部编辑专家与全局创作专家间进行知识蒸馏并监督微调;(iii) 引入统一PosterOmni Reward Feedback对齐实体保真与美学偏好;并建立PosterOmni-Bench统一基准。 Result: PosterOmni在参考遵循性、全局构图质量与美学协调性上显著优于所有开源基线,并超越多个闭源系统。 Conclusion: PosterOmni成功耦合实体保持型编辑与概念驱动型创作,验证了统一框架在多维图像到海报生成任务中的有效性与泛化能力。 Abstract: Image-to-poster generation is a high-demand task requiring not only local adjustments but also high-level design understanding. Models must generate text, layout, style, and visual elements while preserving semantic fidelity and aesthetic coherence. The process spans two regimes: local editing, where ID-driven generation, rescaling, filling, and extending must preserve concrete visual entities; and global creation, where layout- and style-driven tasks rely on understanding abstract design concepts. These intertwined demands make image-to-poster a multi-dimensional process coupling entity-preserving editing with concept-driven creation under image-prompt control. To address these challenges, we propose PosterOmni, a generalized artistic poster creation framework that unlocks the potential of a base edit model for multi-task image-to-poster generation. PosterOmni integrates the two regimes, namely local editing and global creation, within a single system through an efficient data-distillation-reward pipeline: (i) constructing multi-scenario image-to-poster datasets covering six task types across entity-based and concept-based creation; (ii) distilling knowledge between local and global experts for supervised fine-tuning; and (iii) applying unified PosterOmni Reward Feedback to jointly align visual entity-preserving and aesthetic preference across all tasks. Additionally, we establish PosterOmni-Bench, a unified benchmark for evaluating both local editing and global creation. Extensive experiments show that PosterOmni significantly enhances reference adherence, global composition quality, and aesthetic harmony, outperforming all open-source baselines and even surpassing several proprietary systems.

[144] FAIL: Flow Matching Adversarial Imitation Learning for Image Generation

Yeyao Ma,Chen Li,Xiaosong Zhang,Han Hu,Weidi Xie

Main category: cs.CV

TL;DR: 本文提出FAIL方法,通过对抗训练最小化策略与专家分布的差异,无需显式奖励或成对比较,适用于流匹配模型的后训练优化。

Details Motivation: 现有监督微调无法纠正未见状态下的策略漂移,而偏好优化方法依赖高成本的偏好对或奖励建模,亟需更高效、通用的模仿学习方法。 Method: 提出Flow Matching Adversarial Imitation Learning(FAIL),包含两个变体:FAIL-PD利用可微ODE求解器获得低方差路径梯度;FAIL-PG为离散或计算受限场景提供黑盒替代方案。 Result: 在仅使用13,000条Nano Banana pro演示数据微调FLUX模型时,FAIL在提示遵循和美学评估基准上达到有竞争力性能,并成功泛化至离散图像/视频生成,还能作为鲁棒正则器缓解基于奖励优化中的奖励黑客问题。 Conclusion: FAIL为流匹配模型后训练提供了无需奖励信号、无需成对比较的高效对抗模仿学习框架,兼具理论简洁性与实际泛化能力。 Abstract: Post-training of flow matching models-aligning the output distribution with a high-quality target-is mathematically equivalent to imitation learning. While Supervised Fine-Tuning mimics expert demonstrations effectively, it cannot correct policy drift in unseen states. Preference optimization methods address this but require costly preference pairs or reward modeling. We propose Flow Matching Adversarial Imitation Learning (FAIL), which minimizes policy-expert divergence through adversarial training without explicit rewards or pairwise comparisons. We derive two algorithms: FAIL-PD exploits differentiable ODE solvers for low-variance pathwise gradients, while FAIL-PG provides a black-box alternative for discrete or computationally constrained settings. Fine-tuning FLUX with only 13,000 demonstrations from Nano Banana pro, FAIL achieves competitive performance on prompt following and aesthetic benchmarks. Furthermore, the framework generalizes effectively to discrete image and video generation, and functions as a robust regularizer to mitigate reward hacking in reward-based optimization. Code and data are available at https://github.com/HansPolo113/FAIL.

[145] TexSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation

Ziteng Lu,Yushuang Wu,Chongjie Ye,Yuda Qiu,Jing Shao,Xiaoyang Guo,Jiaqing Zhou,Tianlei Hu,Kun Zhou,Xiaoguang Han

Main category: cs.CV

TL;DR: 本文提出TexSpot,一种基于扩散模型的纹理增强框架,通过新提出的Texlet表示法解决3D纹理生成中的视角不一致与失真问题,显著提升纹理质量、几何一致性与鲁棒性。

Details Motivation: 现有3D纹理生成方法存在UV映射失真或点云方法依赖几何密度导致高分辨率纹理生成受限的问题,且主流多视角扩散流程存在视角不一致缺陷。 Method: 提出Texlet——一种融合点基几何表达力与UV映射紧凑性的新型3D纹理表示;每个Texlet由2D编码器编码局部纹理块,并经3D编码器聚合全局形状上下文;采用级联3D-to-2D解码器重建纹理块,并训练以Texlet为条件的扩散Transformer进行纹理优化。 Result: TexSpot在视觉保真度、几何一致性与鲁棒性方面显著优于现有SOTA 3D纹理生成与增强方法。 Conclusion: Texlet表示法有效解耦纹理质量与几何密度,TexSpot框架为高质量、视角一致的3D纹理生成提供了新范式。 Abstract: High-quality 3D texture generation remains a fundamental challenge due to the view-inconsistency inherent in current mainstream multi-view diffusion pipelines. Existing representations either rely on UV maps, which suffer from distortion during unwrapping, or point-based methods, which tightly couple texture fidelity to geometric density that limits high-resolution texture generation. To address these limitations, we introduce TexSpot, a diffusion-based texture enhancement framework. At its core is Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representation. Each Texlet latent vector encodes a local texture patch via a 2D encoder and is further aggregated using a 3D encoder to incorporate global shape context. A cascaded 3D-to-2D decoder reconstructs high-quality texture patches, enabling the Texlet space learning. Leveraging this representation, we train a diffusion transformer conditioned on Texlets to refine and enhance textures produced by multi-view diffusion methods. Extensive experiments demonstrate that TexSpot significantly improves visual fidelity, geometric consistency, and robustness over existing state-of-the-art 3D texture generation and enhancement approaches. Project page: https://anonymous.4open.science/w/TexSpot-page-2D91.

[146] DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Xu Guo,Fulong Ye,Qichao Sun,Liyang Chen,Bingchuan Li,Pengze Zhang,Jiawei Liu,Songtao Zhao,Qian He,Xiangwang Hou

Main category: cs.CV

TL;DR: 本文提出DreamID-Omni,一个统一的可控人像音视频生成框架,通过新型对称条件扩散Transformer、双层级解耦策略与多任务渐进训练,解决多角色身份与音色精细分离控制难题,在多项指标上达到SOTA。

Details Motivation: 现有音视频生成方法将人像相关任务(如参考式音视频生成、视频编辑、语音驱动动画)孤立处理,且难以在单框架内实现多角色身份与语音音色的精确、解耦控制。 Method: 提出DreamID-Omni框架:1)对称条件扩散Transformer,采用对称条件注入整合异构控制信号;2)双层级解耦策略——信号层用同步RoPE确保注意力空间绑定,语义层用结构化字幕建立属性-主体显式映射;3)多任务渐进训练,利用弱约束生成先验正则化强约束任务。 Result: 在视频质量、音频质量及音视频一致性方面全面达到SOTA,性能超越主流商用闭源模型。 Conclusion: DreamID-Omni实现了统一、可控、高保真的人像音视频联合生成,为学术研究与工业级应用提供了可开源的新范式。 Abstract: Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.

[147] EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data

Nils Lehmann,Yi Wang,Zhitong Xiong,Xiaoxiang Zhu

Main category: cs.CV

TL;DR: EO-VAE is a multi-sensor variational autoencoder that serves as a unified tokenizer for Earth observation data, using dynamic hypernetworks to handle diverse spectral channels and outperforming prior modality-specific tokenizers in reconstruction fidelity.

Details Motivation: Earth observation data poses unique challenges for generative modeling due to heterogeneous sensor specifications and variable spectral channels, making standard RGB-focused tokenizers inadequate. Method: EO-VAE employs a single multi-sensor variational autoencoder with dynamic hypernetworks to flexibly encode and reconstruct varying channel combinations, avoiding the need for separate tokenizers per modality. Result: On the TerraMesh dataset, EO-VAE achieves superior reconstruction fidelity compared to TerraMind tokenizers. Conclusion: EO-VAE establishes a robust, unified baseline tokenizer for latent generative modeling in remote sensing. Abstract: State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.

[148] DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Dianyi Wang,Ruihang Li,Feng Han,Chaofan Ma,Wei Song,Siyuan Wang,Yibin Wang,Yi Xin,Hongjian Liu,Zhixiong Zhang,Shengyuan Ding,Tianhang Wang,Zhenglin Cheng,Tao Lin,Cheng Jin,Kaicheng Yu,Jingjing Chen,Wenjie Wang,Zhongyu Wei,Jiaqi Wang

Main category: cs.CV

TL;DR: DeepGen 1.0 是一个仅5B参数的轻量级统一多模态图像生成与编辑模型,通过Stacked Channel Bridging(SCB)框架和三阶段数据驱动训练策略,在多项基准上超越更大规模模型(如80B HunyuanImage、27B Qwen-Image-Edit),并开源代码、权重与数据集。

Details Motivation: 现有统一多模态图像生成与编辑模型参数量巨大(>10B),训练与部署成本高昂;亟需轻量高效且性能不妥协的替代方案。 Method: 提出Stacked Channel Bridging(SCB)深度对齐框架,融合多层视觉语言模型特征与可学习‘think tokens’;设计三阶段训练策略:(1)对齐预训练,(2)联合监督微调,(3)基于MR-GRPO的强化学习。 Result: 在仅约50M样本上训练,DeepGen 1.0在WISE上比80B HunyuanImage提升28%,在UniREditBench上比27B Qwen-Image-Edit提升37%;显著提升生成质量、人类偏好对齐性,并避免视觉伪影。 Conclusion: 证明轻量级统一模型可通过结构创新与数据策略实现甚至超越大模型性能,推动多模态生成技术民主化。 Abstract: Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.

[149] Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

Onkar Susladkar,Tushar Prakash,Gayatri Deshmukh,Kiet A. Nguyen,Jiaxun Zhang,Adheesh Juvekar,Tianshu Bao,Lin Chai,Sparsh Mittal,Inderjit S Dhillon,Ismini Lourentzou

Main category: cs.CV

TL;DR: UniDFlow是一个统一的离散流匹配框架,用于多模态理解、生成和编辑,通过任务特定的低秩适配器解耦理解和生成,并引入基于参考的多模态偏好对齐方法,显著提升保真度与可控性,在多个基准测试中达到SOTA并具备强零样本泛化能力。

Details Motivation: 解决多模态任务中目标干扰和表征纠缠问题,提升模型在理解、生成与编辑任务中的保真度、可控性及泛化能力。 Method: 提出UniDFlow框架:1)采用任务特定低秩适配器解耦理解与生成;2)引入参考驱动的多模态偏好对齐机制,优化相同条件下的相对输出。 Result: 在八个基准上达到SOTA性能,并在无显式任务训练前提下实现零样本泛化至图像修复、上下文图像生成、参考驱动编辑和组合生成等任务。 Conclusion: UniDFlow通过解耦建模与偏好对齐,实现了统一、高效且泛化能力强的多模态处理框架,无需大规模重训练即可适应多样化任务。 Abstract: We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.

[150] MonarchRT: Efficient Attention for Real-Time Video Generation

Krish Agarwal,Zhuoming Chen,Cheng Luo,Yongqi Chen,Haizhong Zheng,Xun Huang,Atri Rudra,Beidi Chen

Main category: cs.CV

TL;DR: 本文提出Monarch-RT,一种基于Monarch矩阵分解的结构化注意力参数化方法,用于解决实时视频扩散模型中3D自注意力计算开销大的问题,在保持高质量的同时实现高达95%的注意力稀疏性,并首次在单张RTX 5090上以16 FPS实现真实时视频生成。

Details Motivation: 实时视频生成中,Diffusion Transformer受限于3D自注意力的二次计算复杂度,尤其在少步长、自回归设定下误差累积严重,而现有稀疏注意力方法在该场景下失效。 Method: 提出Monarch-RT:利用Monarch矩阵对注意力进行结构化分解,结合对齐的分块结构与扩展的平铺参数化;通过微调与定制Triton内核优化实现效率。 Result: 在Self-Forcing模型上实现95%注意力稀疏且无质量损失;在RTX 5090/H100/B200上相比FlashAttention系列内核提速1.4–11.8倍;首次在单卡RTX 5090上达成16 FPS真实时视频生成。 Conclusion: Monarch-RT是首个适用于实时、自回归视频扩散的高能力稀疏注意力参数化方案,兼顾表达力与效率,显著推动了实时视频生成的实用化。 Abstract: Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.

[151] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Leon Liangyu Chen,Haoyu Ma,Zhipeng Fan,Ziqi Huang,Animesh Sinha,Xiaoliang Dai,Jialiang Wang,Zecheng He,Jianwei Yang,Chunyuan Li,Junzhe Sun,Chu Wang,Serena Yeung-Levy,Felix Juefei-Xu

Main category: cs.CV

TL;DR: 本文提出UniT框架,通过多轮推理、验证与修正,实现统一多模态模型的测试时缩放(TTS),提升复杂多模态任务的性能。

Details Motivation: 现有统一多模态模型通常单次前向推理,难以应对需分解指令、验证中间结果和迭代修正的复杂任务;而语言模型中已验证测试时缩放(TTS)有效,但其在统一多模态模型中的扩展仍是开放问题。 Method: 提出UniT框架,融合智能体式数据合成、统一模型训练与灵活测试时推理,支持多轮链式思维推理(chain-of-thought)、验证、子目标分解与内容记忆。 Result: 实验表明:(1) 在短推理轨迹上训练的统一模型可泛化至更长推理链;(2) 序列式链式推理比并行采样更可扩展且计算高效;(3) 结合生成与编辑轨迹训练可提升分布外视觉推理能力。 Conclusion: 多模态测试时缩放是一种有效范式,能同步推动统一模型在生成与理解两方面的能力进步。 Abstract: Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

[152] Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

Huai-Hsun Cheng,Siang-Ling Zhang,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为'Progressive Semantic Illusions'的新型矢量草图任务,通过逐步添加笔画使单个草图发生语义突变,并设计了'Stroke of Surprise'生成框架,利用双分支Score Distillation Sampling和Overlay Loss实现多阶段语义一致性与结构互补性。

Details Motivation: 传统视觉错觉依赖空间操作(如多视角一致性),而本文旨在探索时间维度上的语义错觉,即在单一草图的绘制过程中实现动态、渐进的语义转变,拓展视觉回文(visual anagrams)从空间到时间的表达能力。 Method: 提出序列感知的联合优化框架,核心包括:1)双分支Score Distillation Sampling(SDS)机制,动态优化前缀笔画以同时满足两个语义目标;2)Overlay Loss,强制新增笔画与原有结构空间互补而非遮挡;3)不冻结初始笔画,而是寻找二者共享的‘结构子空间’。 Result: 实验表明该方法在可识别性和错觉强度上显著优于现有最先进基线,成功实现了从空间到时间维度的视觉回文扩展。 Conclusion: Progressive Semantic Illusions为生成式建模与人类视觉认知交叉提供了新范式;Stroke of Surprise框架证明了通过可控笔画序列引导多阶段语义解释的可行性,推动了可解释、时序化生成内容的发展。 Abstract: Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel vector sketching task where a single sketch undergoes a dramatic semantic transformation through the sequential addition of strokes. We present Stroke of Surprise, a generative framework that optimizes vector strokes to satisfy distinct semantic interpretations at different drawing stages. The core challenge lies in the "dual-constraint": initial prefix strokes must form a coherent object (e.g., a duck) while simultaneously serving as the structural foundation for a second concept (e.g., a sheep) upon adding delta strokes. To address this, we propose a sequence-aware joint optimization framework driven by a dual-branch Score Distillation Sampling (SDS) mechanism. Unlike sequential approaches that freeze the initial state, our method dynamically adjusts prefix strokes to discover a "common structural subspace" valid for both targets. Furthermore, we introduce a novel Overlay Loss that enforces spatial complementarity, ensuring structural integration rather than occlusion. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines in recognizability and illusion strength, successfully expanding visual anagrams from the spatial to the temporal dimension. Project page: https://stroke-of-surprise.github.io/