Skip to content

Table of Contents

cs.CL [Back]

[1] HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated Q&A over Raw Unstructured Documents

Sungmoon Kim,Hyuna Jeon,Dahye Kim,Mingyu Kim,Dong-Kyu Chae,Jiwoong Kim

Main category: cs.CL

TL;DR: 本文提出HybridRAG框架,通过预处理PDF等非结构化文档构建QA知识库,并在查询时优先匹配已有答案,仅在无匹配时触发生成,从而提升准确率与响应速度。

Details Motivation: 现有RAG方法依赖结构化文本且需实时检索-生成,难以应对真实聊天场景中大量非结构化PDF文档和高并发低资源限制的需求。 Method: HybridRAG首先利用OCR与布局分析解析PDF,生成分层文本块;再用LLM预生成QA知识库;查询时先检索QA库返回答案,未命中时才启用实时RAG生成。 Result: 在OHRBench上实验表明,HybridRAG相比标准RAG基线具有更高回答质量与更低延迟。 Conclusion: HybridRAG是一种面向实际聊天机器人应用的高效、实用的RAG新范式,尤其适用于处理海量非结构化文档与资源受限场景。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for grounding Large Language Model (LLM)-based chatbot responses on external knowledge. However, existing RAG studies typically assume well-structured textual sources (e.g. Wikipedia or curated datasets) and perform retrieval and generation at query time, which can limit their applicability in real-world chatbot scenarios. In this paper, we present HybridRAG, a novel and practical RAG framework towards more accurate and faster chatbot responses. First, HybridRAG ingests raw, unstructured PDF documents containing complex layouts (text, tables, figures) via Optical Character Recognition (OCR) and layout analysis, and convert them into hierarchical text chunks. Then, it pre-generates a plausible question-answer (QA) knowledge base from the organized chunks using an LLM. At query time, user questions are matched against this QA bank to retrieve immediate answers when possible, and only if no suitable QA match is found does our framework fall back to an on-the-fly response generation. Experiments on OHRBench demonstrate that our HybridRAG provides higher answer quality and lower latency compared to a standard RAG baseline. We believe that HybridRAG could be a practical solution for real-world chatbot applications that must handle large volumes of unstructured documents and lots of users under limited computational resources.

[2] Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Max Zhang,Derek Liu,Kai Zhang,Joshua Franco,Haihao Liu

Main category: cs.CL

TL;DR: 本文探讨了知识蒸馏(KD)在多语言越狱防御中的应用,发现标准微调反而会增加越狱成功率(JSR),而移除模糊边界拒绝可缓解安全退化,但推理能力仍下降。

Details Motivation: 大型语言模型(LLMs)的安全对齐目前以英语为中心,导致低资源语言场景下存在安全隐患,亟需多语言安全对齐方法。 Method: 采用基于黑盒响应的知识蒸馏与LoRA参数高效微调(PEFT),将OpenAI o1-mini教师模型的拒绝行为蒸馏至三个开源学生模型(Llama-3、Gemma-2、Qwen3),使用约2.8万个XSafety多语言越狱提示。 Result: 在MultiJail基准上发现标准微调反而使所有学生模型越狱成功率(JSR)最高上升16.6个百分点;移除‘边界性’拒绝可缓解或逆转安全退化,但GSM8K推理性能仍下降。 Conclusion: 知识蒸馏在多语言安全对齐中具有潜力但也面临挑战,需进一步研究如何平衡安全性和推理能力。 Abstract: Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.

[3] Retrieval Heads are Dynamic

Yuping Lin,Zitao Li,Yue Xing,Pengfei He,Yingqian Cui,Yaliang Li,Bolin Ding,Jingren Zhou,Jiliang Tang

Main category: cs.CL

TL;DR: 本文从动态视角研究大语言模型(LLM)中的检索头,发现其在自回归生成过程中随时间步动态变化、不可被静态头替代,并且隐藏状态可预测未来检索模式,揭示了模型内部的规划机制。

Details Motivation: 现有工作多基于静态统计识别平均表现检索功能的头,忽略了自回归生成中检索行为的细粒度时间动态性。 Method: 通过在Needle-in-a-Haystack和多跳问答任务上的广泛分析,提出并验证关于检索头动态性、不可替代性与预测相关性的三个核心主张,并在动态检索增强生成框架中量化对比动态与静态检索头效用。 Result: 证实检索头随时间步动态变化;动态检索头在各时刻具有特异性且无法被静态头有效替代;隐藏状态蕴含对未来检索头模式的预测信号。 Conclusion: LLMs内部存在时序敏感的检索机制与潜在规划能力,挑战了将检索功能归因于固定头的静态视角,为理解其工作机制提供了新洞见。 Abstract: Recent studies have identified "retrieval heads" in Large Language Models (LLMs) responsible for extracting information from input contexts. However, prior works largely rely on static statistics aggregated across datasets, identifying heads that perform retrieval on average. This perspective overlooks the fine-grained temporal dynamics of autoregressive generation. In this paper, we investigate retrieval heads from a dynamic perspective. Through extensive analysis, we establish three core claims: (1) Dynamism: Retrieval heads vary dynamically across timesteps; (2) Irreplaceability: Dynamic retrieval heads are specific at each timestep and cannot be effectively replaced by static retrieval heads; and (3) Correlation: The model's hidden state encodes a predictive signal for future retrieval head patterns, indicating an internal planning mechanism. We validate these findings on the Needle-in-a-Haystack task and a multi-hop QA task, and quantify the differences on the utility of dynamic and static retrieval heads in a Dynamic Retrieval-Augmented Generation framework. Our study provides new insights into the internal mechanisms of LLMs.

[4] Nested Named Entity Recognition in Plasma Physics Research Articles

Muhammad Haris,Hans Höft,Markus M. Becker,Markus Stocker

Main category: cs.CL

TL;DR: 本文提出了一种基于编码器-Transformer与条件随机场(CRF)的轻量级嵌套命名实体识别(NER)方法,专用于等离子体物理研究论文,通过构建16类标注语料、实体特定模型专业化及超参数优化,提升了领域内专业实体的识别效果。

Details Motivation: 等离子体物理研究论文内容高度复杂且上下文丰富,现有通用NER方法难以有效提取其中的专业实体,亟需针对该领域的定制化解决方案以支持高级检索与文献分析。 Method: 构建包含16个嵌套实体类别的等离子体物理语料库;采用多个独立的BERT-CRF模型分别识别各类实体(实体特定模型专业化);结合系统性超参数优化提升模型性能。 Result: 所提方法在等离子体物理文本的嵌套NER任务中展现出良好性能,验证了轻量级、专业化建模与优化策略的有效性。 Conclusion: 该工作推动了科学领域(尤其是等离子体物理)命名实体识别的发展,为科研人员高效导航与分析专业文献提供了可扩展的技术基础。 Abstract: Named Entity Recognition (NER) is an important task in natural language processing that aims to identify and extract key entities from unstructured text. We present a novel application of NER in plasma physics research articles and address the challenges of extracting specialized entities from scientific text in this domain. Research articles in plasma physics often contain highly complex and context-rich content that must be extracted to enable, e.g., advanced search. We propose a lightweight approach based on encoder-transformers and conditional random fields to extract (nested) named entities from plasma physics research articles. First, we annotate a plasma physics corpus with 16 classes specifically designed for the nested NER task. Second, we evaluate an entity-specific model specialization approach, where independent BERT-CRF models are trained to recognize individual entity types in plasma physics text. Third, we integrate an optimization process to systematically fine-tune hyperparameters and enhance model performance. Our work contributes to the advancement of entity recognition in plasma physics and also provides a foundation to support researchers in navigating and analyzing scientific literature.

[5] Assessing LLM Reliability on Temporally Recent Open-Domain Questions

Pushwitha Krishnappa,Amit Das,Vinija Jain,Tathagata Mukherjee,Aman Chadha

Main category: cs.CL

TL;DR: 本文提出RECOM基准数据集,评估开源大语言模型在近期Reddit问题上的回答与人类共识的对齐程度,发现语义相似性高但词汇重叠率极低的‘语义-词汇悖论’,指出单纯依赖词汇匹配指标不可靠,需多维评估框架。

Details Motivation: 大型语言模型在开放域问答中广泛应用,但其对时效性信息(如近期Reddit问题)与人类观点的一致性尚未被充分研究。 Method: 构建包含15,000条2025年9月Reddit问题及社区参考答案的RECOM数据集;使用BLEU、ROUGE、BERTScore、MoverScore、余弦相似度和NLI等多类指标,评测Llama3.1-8B、Mistral-7B、Gemma-2-9B和GPT-OSS-20B四个开源LLM的回答对齐能力。 Result: 所有模型余弦相似度>99%,但BLEU-1<8%,呈现显著语义-词汇悖论;MoverScore居中(51–53%),反映语义对齐的最优传输代价;参数量更大的GPT-OSS-20B表现不如Mistral-7B;NLI显示矛盾率<7%。 Conclusion: 词汇匹配指标(如BLEU)无法可靠评估抽象生成质量,应采用融合语义、逻辑与结构的多维评估框架;RECOM数据集已开源。 Abstract: Large Language Models (LLMs) are increasingly deployed for open-domain question answering, yet their alignment with human perspectives on temporally recent information remains underexplored. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025 paired with community-derived reference answers. We investigate how four open-source LLMs (Llama3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B) respond to these questions, evaluating alignment using lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). Our central finding is a striking semantic-lexical paradox: all models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, a 90+ percentage point gap indicating that models preserve meaning through extensive paraphrasing rather than lexical reproduction. MoverScore (51-53%) confirms this pattern, occupying an intermediate position that reflects the optimal transport cost of semantic alignment. Furthermore, model scale does not predict performance: Mistral-7B (7B parameters) outperforms GPT-OSS-20B (20B parameters) across all metrics. NLI analysis reveals that contradiction rates remain below 7%, suggesting models rarely generate content that directly conflicts with human consensus. These findings challenge the reliability of lexical metrics for evaluating abstractive generation and argue for multi-dimensional evaluation frameworks that capture semantic fidelity beyond surface-level text matching. The RECOM dataset is publicly available at https://anonymous.4open.science/r/recom-D4B0

[6] Small Updates, Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ?

Xu Hu,Yifan Zhang,Songtao Wei,Chen Zhao,Qiannan Li,Bingzhe Li,Feng Chen

Main category: cs.CL

TL;DR: 本文系统研究了参数高效微调(PEFT)对大语言模型幻觉检测能力的影响,发现PEFT能显著提升多种无监督幻觉检测器的AUROC性能,其作用机制主要是重构模型的不确定性表征,而非注入新事实知识。

Details Motivation: 尽管PEFT被广泛用于适配大语言模型并常被认为可提升事实正确性,但其对幻觉行为(尤其在问答任务中)的影响尚不明确。 Method: 在三个开源大语言模型和三个事实导向的问答基准上,对七种覆盖语义一致性、置信度和熵三类范式的无监督幻觉检测方法进行系统评估;并结合线性探针与表征诊断分析PEFT的作用机制。 Result: PEFT一致增强了幻觉检测能力,在多种检测器上显著提升AUROC;进一步分析表明PEFT主要改变模型中不确定性的编码与呈现方式,而非注入新事实知识。 Conclusion: PEFT通过重塑不确定性表征来提升幻觉检测性能,这一发现为理解PEFT的作用机制及优化幻觉缓解策略提供了新视角。 Abstract: Parameter-efficient fine-tuning (PEFT) methods are widely used to adapt large language models (LLMs) to downstream tasks and are often assumed to improve factual correctness. However, how the parameter-efficient fine-tuning methods affect hallucination behavior remains insufficiently understood, especially on QA datasets. In this work, we systematically investigate the impact of PEFT on hallucination detection through a comprehensive empirical study across three open-weight LLM backbones and three fact-seeking QA benchmarks. For each model, we evaluate performance using seven unsupervised hallucination detection methods spanning three complementary approaches: semantic consistency based detectors, confidence based detectors, and entropy based detectors. This multifaceted evaluation enables us to characterize how PEFT reshapes uncertainty across different detection paradigms. In conclusion, our experimental results show that PEFT consistently strengthens hallucination detection ability, substantially improving AUROC across a wide range of hallucination detectors. Besides, further analyses using linear probes and representation diagnostics indicate that PEFT methods primarily reshapes how uncertainty is encoded and surfaced, comparing with injecting new factual knowledge into the models.

[7] Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering

Nathan Mao,Varun Kaushik,Shreya Shivkumar,Parham Sharafoleslami,Kevin Zhu,Sunishchal Dev

Main category: cs.CL

TL;DR: 本文提出FalseCite数据集,用于系统评估大语言模型在误导性引文下的幻觉现象,并通过分析隐藏状态揭示其内在模式。

Details Motivation: 大型语言模型(LLMs)常产生幻觉,尤其在医学、法律等敏感领域危害严重;需系统性研究幻觉机制并构建针对性评测基准。 Method: 构建FalseCite数据集,包含诱导幻觉的虚假/误导性引文;在GPT-4o-mini、Falcon-7B和Mistral 7-B上测试;分析并可视化模型隐藏状态向量。 Result: 虚假引文显著提升幻觉率(尤以GPT-4o-mini为甚);无论是否幻觉,隐藏状态向量均呈现类角状(horn-like)分布模式。 Conclusion: FalseCite可作为评估与缓解LLM幻觉的重要基准工具,隐藏状态的几何结构为理解幻觉机制提供新视角。 Abstract: Large Language Models (LLMs) often hallucinate, generating nonsensical or false information that can be especially harmful in sensitive fields such as medicine or law. To study this phenomenon systematically, we introduce FalseCite, a curated dataset designed to capture and benchmark hallucinated responses induced by misleading or fabricated citations. Running GPT-4o-mini, Falcon-7B, and Mistral 7-B through FalseCite, we observed a noticeable increase in hallucination activity for false claims with deceptive citations, especially in GPT-4o-mini. Using the responses from FalseCite, we can also analyze the internal states of hallucinating models, visualizing and clustering the hidden state vectors. From this analysis, we noticed that the hidden state vectors, regardless of hallucination or non-hallucination, tend to trace out a distinct horn-like shape. Our work underscores FalseCite's potential as a foundation for evaluating and mitigating hallucinations in future LLM research.

[8] Enhancing SDG-Text Classification with Combinatorial Fusion Analysis and Generative AI

Jingyan Xu,Marcelo L. LaFleur,Christina Schweikert,D. Frank Hsu

Main category: cs.CL

TL;DR: 本文提出一种基于组合融合分析(CFA)的方法,结合多个AI模型与人类专家知识,提升联合国可持续发展目标(SDGs)文本分类性能,达到96.73%准确率,优于单模型及纯人工标注。

Details Motivation: SDG相关文本分类面临类别模糊、互相关联、标注稀缺等挑战,亟需更鲁棒、可解释的多源智能融合方法。 Method: 采用生成式AI构建合成训练数据,集成多个文本分类模型,并利用组合融合分析(CFA)框架,通过秩-得分特征(RSC)函数和认知多样性(CD)进行模型融合;同时引入人类领域专家结果进行对比与协同。 Result: CFA融合方法在SDG文本分类任务中达到96.73%准确率,显著优于最佳单模型;且与人类专家结果呈现互补与增强效应。 Conclusion: 多模型融合(CFA)与人类专家协同可有效提升复杂语义文本分类的准确性与可信度,为社会分析等高不确定性NLP任务提供新范式。 Abstract: (Natural Language Processing) NLP techniques such as text classification and topic discovery are very useful in many application areas including information retrieval, knowledge discovery, policy formulation, and decision-making. However, it remains a challenging problem in cases where the categories are unavailable, difficult to differentiate, or are interrelated. Social analysis with human context is an area that can benefit from text classification, as it relies substantially on text data. The focus of this paper is to enhance the classification of text according to the UN's Sustainable Development Goals (SDGs) by collecting and combining intelligence from multiple models. Combinatorial Fusion Analysis (CFA), a system fusion paradigm using a rank-score characteristic (RSC) function and cognitive diversity (CD), has been used to enhance classifier methods by combining a set of relatively good and mutually diverse classification models. We use a generative AI model to generate synthetic data for model training and then apply CFA to this classification task. The CFA technique achieves 96.73% performance, outperforming the best individual model. We compare the outcomes with those obtained from human domain experts. It is demonstrated that combining intelligence from multiple ML/AI models using CFA and getting input from human experts can, not only complement, but also enhance each other.

[9] Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis

Mangadoddi Srikar Vardhan,Lekkala Sai Teja

Main category: cs.CL

TL;DR: 本文发现Transformer隐藏状态的方向(角度)和模长(大小)在语言建模和句法处理中承担不同功能角色:方向扰动更损害语言建模损失,模长扰动更损害主谓一致等句法任务;该分离现象依赖LayerNorm结构,在RMSNorm中不显著。

Details Motivation: 探究Transformer隐藏状态中方向(向量朝向)与模长(向量长度)是否具有不同的计算功能,突破传统将隐藏状态视为统一高维表征的假设。 Method: 在Pythia系列模型上采用L2匹配扰动分析(保证角度扰动与模长扰动具有相同欧氏位移),结合因果干预(如修复attention或LayerNorm路径)量化两类扰动的影响路径,并跨模型尺度及架构(LayerNorm vs. RMSNorm)验证泛化性。 Result: 角度扰动使语言建模损失增加最多42.9倍,模长扰动导致主谓一致准确率下降20.4%(远高于角度扰动的1.6%);角度损伤主要经attention路径传导(修复attention恢复28.4%损失),模长损伤部分经LayerNorm路径传导(修复LayerNorm恢复29.9%损失);该分离现象在LayerNorm模型中稳健存在,但在RMSNorm模型中消失。 Conclusion: Transformer中隐藏状态的方向与模长承载部分分离的计算功能:方向主导attention路由,模长调节细粒度句法判断的处理强度;该功能分工依赖LayerNorm等特定归一化设计,挑战并细化了线性表征假说,对模型编辑与可解释性研究具指导意义。 Abstract: Transformer hidden states encode information as high-dimensional vectors, yet whether direction (orientation in representational space) and magnitude (vector norm) serve distinct functional roles remains unclear. Studying Pythia-family models, we discover a striking cross-over dissociation: angular perturbations cause up to 42.9 more damage to language modeling loss, while magnitude perturbations cause disproportionately more damage to syntactic processing (20.4% vs.1.6% accuracy drop on subject-verb agreement).This finding is enabled by L2-matched perturbation analysis, a methodology ensuring that an gular and magnitude perturbations achieve identical Euclidean displacements. Causal intervention reveals that angular damage flows substantially through the attention pathways (28.4% loss recovery via attention repair), while magnitude damage flows partly through the LayerNorm pathways(29.9% recovery via LayerNorm repair). These patterns replicate across scales within the Pythia architecture family. These findings provide evidence that direction and magnitude support partially distinct computational roles in LayerNorm based architectures. The direction preferentially affects attentional routing, while magnitude modulates processing intensity for fine-grained syntactic judgments. We find different patterns in RMSNorm-based architectures, suggesting that the dissociation depends on architectural choices. Our results refine the linear representation hypothesis and have implications for model editing and interpretability research

[10] PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models

Jiawei Xu,Zhenyu Yu,Ziqian Bi,Minh Duc Pham,Xiaoyi Qu,Danyang Zhang

Main category: cs.CL

TL;DR: 本文提出PRIME框架,通过三个专门代理(执行器、验证器、协调器)协同工作,并采用群体相对策略优化方法,显著提升了大语言模型在算法推理任务上的性能。在新构建的PRIME-Bench基准上,平均准确率从26.8%提升至93.8%,尤其在需持续状态跟踪的任务中效果突出;消融实验表明迭代验证是关键机制,且小模型从中获益更大。

Details Motivation: 大型语言模型在算法推理任务上表现有限,亟需一种能有效支持复杂、多步、带约束的算法推理的新方法。 Method: 提出PRIME框架,包含执行器(step-by-step推理)、验证器(约束检查)和协调器(回溯控制)三类智能体,并采用群体相对策略优化(Group Relative Policy Optimization)进行联合训练;同时构建大规模算法推理基准PRIME-Bench用于评估。 Result: 在PRIME-Bench上平均准确率从26.8%提升至93.8%(相对提升250%);图灵机模拟从9%→92%,长除法从16%→94%;迭代验证被证实为核心增益来源;小模型(8B)性能接近大模型(64B–120B)。 Conclusion: PRIME通过多智能体协同与迭代验证机制,显著突破了大语言模型在算法推理中的瓶颈,尤其缓解了错误传播问题,并展现出对不同规模模型的良好可扩展性与泛化能力。 Abstract: Large language models have demonstrated remarkable capabilities across diverse reasoning tasks, yet their performance on algorithmic reasoning remains limited. To handle this limitation, we propose PRIME (Policy-Reinforced Iterative Multi-agent Execution), a framework comprising three specialized agents, an executor for step-by-step reasoning, a verifier for constraint checking, and a coordinator for backtracking control, optimized through group relative policy optimization. For comprehensive evaluation, we introduce PRIME-Bench, the largest algorithmic reasoning benchmark to date, comprising 86 tasks across 12 categories with 51,600 instances. Tasks span sorting algorithms, graph and tree structures, automata and state machines, symbolic reasoning, and constraint-based puzzles, with execution traces reaching over one million steps. Compared to baseline approach, PRIME improves average accuracy from 26.8% to 93.8%, a 250% relative gain. The largest improvements occur on tasks requiring sustained state tracking, with Turing machine simulation improving from 9% to 92% and long division from 16% to 94%. Ablation studies identify iterative verification as the primary contributor, preventing the error propagation that causes baseline approaches to fail catastrophically. Analysis across model scales (8B-120B parameters) reveals that smaller models benefit disproportionately, achieving accuracy comparable to models 8x larger.

[11] Efficient Hyper-Parameter Search for LoRA via Language-aided Bayesian Optimization

Baek Seong-Eun,Lee Jung-Mok,Kim Sung-Bin,Tae-Hyun Oh

Main category: cs.CL

TL;DR: 本文提出了一种将大语言模型(LLM)的领域知识融入贝叶斯优化(BO)的新框架,用于高效搜索LoRA微调的超参数,结合自然语言提示、可学习token和子集代理训练,显著提升搜索效率与性能。

Details Motivation: LoRA微调虽高效但对超参数敏感,传统穷搜计算开销大,亟需一种融合领域知识、更智能高效的超参搜索方法。 Method: 将预训练LLM用作离散超参数到连续向量空间的映射器,通过领域感知文本提示注入LoRA超参数知识,并引入可学习token捕获难以语言描述的残差信息;同时利用全量与子集数据性能强相关性,采用子集代理训练/评估加速BO过程。 Result: 仅约30次迭代即可找到优于传统45,000种组合搜索所得标准超参数的配置,在任务性能上提升超20%。 Conclusion: 将LLM作为结构化知识编码器融入BO,是提升LoRA超参优化效率与效果的有效范式,为资源受限下的LLM个性化提供了新思路。 Abstract: Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enables resource-efficient personalization or specialization, but it comes at the expense of additional hyperparameter tuning. Although LoRA makes fine-tuning efficient, it is highly sensitive to the choice of hyperparameters, and exhaustive hyperparameter search is still computationally very demanding. To address these challenges, we propose a framework that integrates the domain knowledge of pre-trained LLMs into Bayesian Optimization (BO) to efficiently search for LoRA hyperparameters. To leverage the informed knowledge of LLMs, we repurpose LLMs as a discrete-to-continuous mapping to link the hyperparameters and their domain knowledge with a continuous vector space, where BO is conducted. We design and control the mapping by language prompting, where we provide a domain-aware textual prompt describing the relationships among hyperparameters and their respective roles; thereby, we explicitly inject domain knowledge about LoRA into the LLM in natural language. Also, we model the residual information that is hard to linguistically describe in the prompt with an additional learnable token. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the observation of the strong correlation between the respective performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation with a data subset. This further increases the efficiency of our method. We demonstrate that our hyperparameter found with only about 30 iterations achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations.

[12] Synthesizing the Virtual Advocate: A Multi-Persona Speech Generation Framework for Diverse Linguistic Jurisdictions in Indic Languages

Aniket Deroy

Main category: cs.CL

TL;DR: 本研究评估了 Gemini 2.5 Flash 和 Pro TTS 模型在五种印度语言中生成法庭演说的表现,发现其虽能准确传递程序性信息,但在表达权威感、情感张力和动态语调方面仍不足,尤其在孟加拉语和古吉拉特语中表现更弱。

Details Motivation: 法律辩护需兼具权威语气、节奏性停顿与情感智能,而当前多语言TTS在印度多元语言背景下尚难复现人类律师的说服性语音艺术。 Method: 提出一种提示框架,利用 Gemini 2.5 原生支持的五种语言能力及上下文感知语速控制,生成差异化律师人设,并对合成语音在权威性、节奏与情感表达方面进行评估。 Result: 模型表现出“单调权威”特征:擅长程序性内容播报,但缺乏动态语调与情感厚重感;在孟加拉语和古吉拉特语中性能明显下降。 Conclusion: 多语言TTS已初步适用于程序性法律任务,但在模拟人类律师的高阶说服性语音表达上仍有显著差距,需进一步突破语音韵律与情感建模的技术瓶颈。 Abstract: Legal advocacy requires a unique combination of authoritative tone, rhythmic pausing for emphasis, and emotional intelligence. This study investigates the performance of the Gemini 2.5 Flash TTS and Gemini 2.5 Pro TTS models in generating synthetic courtroom speeches across five Indic languages: Tamil, Telugu, Bengali, Hindi, and Gujarati. We propose a prompting framework that utilizes Gemini 2.5s native support for 5 languages and its context-aware pacing to produce distinct advocate personas. The evolution of Large Language Models (LLMs) has shifted the focus of TexttoSpeech (TTS) technology from basic intelligibility to context-aware, expressive synthesis. In the legal domain, synthetic speech must convey authority and a specific professional persona a task that becomes significantly more complex in the linguistically diverse landscape of India. The models exhibit a "monotone authority," excelling at procedural information delivery but struggling with the dynamic vocal modulation and emotive gravitas required for persuasive advocacy. Performance dips in Bengali and Gujarati further highlight phonological frontiers for future refinement. This research underscores the readiness of multilingual TTS for procedural legal tasks while identifying the remaining challenges in replicating the persuasive artistry of human legal discourse. The code is available at-https://github.com/naturenurtureelite/Synthesizing-the-Virtual-Advocate/tree/main

[13] Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

Qian Ruan,Iryna Gurevych

Main category: cs.CL

TL;DR: 本文提出REspGen框架,将作者回应生成任务重新定义为作者参与的闭环任务,结合作者输入、多属性控制和评估引导优化,并构建首个大规模评审-回应-修订三元组数据集Re^3Align,配套评估工具REspEval。

Details Motivation: 现有自动回应生成方法忽视作者的专业知识、独有信息及修订策略等关键意图信号,难以有效支持真实评审场景中的回应写作。 Method: 提出作者在环(author-in-the-loop)的回应生成范式;设计REspGen框架(含显式作者输入接口、多属性可控生成、评估引导精炼);构建Re^3Align三元组数据集;开发REspEval多维评估套件(20+指标)。 Result: 实验验证了作者输入与评估引导精炼的有效性,揭示了输入设计对回应质量的影响,以及可控性与质量间的权衡;所有资源开源。 Conclusion: 将作者专业知识与意图显式建模为生成过程的核心组件,显著提升回应生成的真实性、可控性与实用性,推动NLP在科学评审支持中的落地。 Abstract: Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. Recent work frames this task as automatic text generation, underusing author expertise and intent. In practice, authors possess domain expertise, author-only information, revision and response strategies--concrete forms of author expertise and intent--to address reviewer concerns, and seek NLP assistance that integrates these signals to support effective response writing in peer review. We reformulate author response generation as an author-in-the-loop task and introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement, together with REspEval, a comprehensive evaluation suite with 20+ metrics covering input utilization, controllability, response quality, and discourse. To support this formulation, we construct Re$^3$Align, the first large-scale dataset of aligned review--response--revision triplets, where revisions provide signals of author expertise and intent. Experiments with state-of-the-art LLMs show the benefits of author input and evaluation-guided refinement, the impact of input design on response quality, and trade-offs between controllability and quality. We make our dataset, generation and evaluation tools publicly available.

[14] The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models

Aradhya Dixit,Shreem Dixit

Main category: cs.CL

TL;DR: 本文揭示了多语言预训练模型中分词器对不同书写系统施加的系统性成本,即“脚本税”,通过比较相同语言内容但不同正字法变体,发现高碎片化正字法导致显著更高的分词数量、推理速度下降及信息成本上升,强调了分词在多语言NLP公平性中的关键作用,并呼吁脚本感知的分词与预训练。

Details Motivation: 预训练多语言语言模型常被假定为脚本无关,但其分词器可能对特定书写系统带来系统性负担,本文旨在量化这种‘脚本税’并揭示其影响。 Method: 通过对比两种具有相同语言内容但不同正字法的文本变体,在mBERT和XLM-R上测量分词数量(fertility)、推理速度及以每字符比特数(BPC)衡量的信息成本,并辅以往返转换错误率(CER_rt)验证差异来源。 Result: 高碎片化正字法导致分词数量增加约3.4倍、推理速度下降16.5倍、BPC分别上升19.7%(mBERT)和47.1%(XLM-R);往返转换错误率CER_rt=0.31表明差异源于正字法相关的处理偏差而非映射噪声。 Conclusion: 分词是多语言NLP中不平等的重要来源,需发展脚本感知的分词策略和预训练方法以提升公平性与效率。 Abstract: Pretrained multilingual language models are often assumed to be script-agnostic, yet their tokenizers can impose systematic costs on certain writing systems. We quantify this script tax by comparing two orthographic variants with identical linguistic content. Across mBERT and XLM-R, the higher-fragmentation orthography shows a ~3.4x increase in fertility (6.73-6.85 vs. 2.10-2.35 tokens/word), leading to a 16.5x inference slowdown (0.23 vs. 3.8 sentences/second) on identical hardware. Using bits per character (BPC) to avoid the "NLL paradox" from subword fragmentation, we find a substantial increase in information cost: +19.7% for mBERT (8.06->9.65) and +47.1% for XLM-R (12.19->17.94). A round-trip conversion check (CER_rt=0.31) suggests these gaps reflect orthography-conditioned processing rather than mapping noise. Our results highlight tokenization as a key source of inequity in multilingual NLP and motivate script-aware tokenization and pretraining.

[15] Barriers to Discrete Reasoning with Transformers: A Survey Across Depth, Exactness, and Bandwidth

Michelle Yuan,Weiyi Sun,Amir H. Rezaeian,Jyotika Singh,Sandip Ghoshal,Yao-Ting Wang,Miguel Ballesteros,Yassine Benajiba

Main category: cs.CL

TL;DR: 本文综述了Transformer在离散推理任务(如算术、逻辑推理和算法合成)中的理论局限性,从电路复杂度、逼近论和通信复杂度三个角度系统分析其结构性与计算性障碍,并探讨改进方向。

Details Motivation: Transformer虽在序列建模中表现优异,但在需要精确符号计算的离散推理任务上存在根本性理论瓶颈,亟需从理论层面厘清原因。 Method: 综合梳理并统一阐释电路复杂度、逼近理论和通信复杂度三个理论视角下的相关研究,辅以关键定义、经典结论与典型示例。 Result: 明确了Transformer难以实现精确离散算法的三大根源:深度限制、难以逼近不连续函数、以及token间通信瓶颈。 Conclusion: 当前Transformer擅长模式匹配与插值,但其架构本质限制了对符号化、确定性计算的支持;未来模型设计需突破现有结构范式以克服这些基础性限制。 Abstract: Transformers have become the foundational architecture for a broad spectrum of sequence modeling applications, underpinning state-of-the-art systems in natural language processing, vision, and beyond. However, their theoretical limitations in discrete reasoning tasks, such as arithmetic, logical inference, and algorithmic composition, remain a critical open problem. In this survey, we synthesize recent studies from three theoretical perspectives: circuit complexity, approximation theory, and communication complexity, to clarify the structural and computational barriers that transformers face when performing symbolic computations. By connecting these established theoretical frameworks, we provide an accessible and unified account of why current transformer architectures struggle to implement exact discrete algorithms, even as they excel at pattern matching and interpolation. We review key definitions, seminal results, and illustrative examples, highlighting challenges such as depth constraints, difficulty approximating discontinuities, and bottlenecks in inter-token communication. Finally, we discuss implications for model design and suggest promising directions for overcoming these foundational limitations.

[16] Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments

Maral Doctorarastoo,Katherine A. Flanigan,Mario Bergés,Christopher McComb

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型(LLM)在低数据环境下对人类活动及其持续时间进行预测的能力,提出一种结合时间、空间、行为历史与人物特征的检索增强提示策略,并在CASAS Aruba数据集上验证其在活动预测与日序列生成任务中的有效性。

Details Motivation: 现有数据驱动的基于智能体的模型(如规则法和深度学习)在低数据场景下表现不佳,限制了其实际应用;而预训练大语言模型具备广泛的人类知识,有望仅凭少量上下文线索实现对日常活动的推理。 Method: 采用检索增强型提示(RAG)策略,融合时间、空间、行为历史和人物特征四类上下文信息,在CASAS Aruba智能家居数据集上开展两项任务:带持续时间估计的下一活动预测、多步日常序列生成;并系统评估不同few-shot样本数对性能的影响。 Result: LLM即使在零样本下也能生成连贯的日常活动预测;加入1–2个示例即可显著提升持续时间校准与类别准确率;更多示例带来边际收益递减;序列级评估显示其时间一致性良好。 Conclusion: 预训练语言模型具备强固有的时间推理能力,可作为行为建模的有效补充,尤其适用于低数据环境下的智能体建模与人机协同系统。 Abstract: Anticipating human activities and their durations is essential in applications such as smart-home automation, simulation-based architectural and urban design, activity-based transportation system simulation, and human-robot collaboration, where adaptive systems must respond to human activities. Existing data-driven agent-based models--from rule-based to deep learning--struggle in low-data environments, limiting their practicality. This paper investigates whether large language models, pre-trained on broad human knowledge, can fill this gap by reasoning about everyday activities from compact contextual cues. We adopt a retrieval-augmented prompting strategy that integrates four sources of context--temporal, spatial, behavioral history, and persona--and evaluate it on the CASAS Aruba smart-home dataset. The evaluation spans two complementary tasks: next-activity prediction with duration estimation, and multi-step daily sequence generation, each tested with various numbers of few-shot examples provided in the prompt. Analyzing few-shot effects reveals how much contextual supervision is sufficient to balance data efficiency and predictive accuracy, particularly in low-data environments. Results show that large language models exhibit strong inherent temporal understanding of human behavior: even in zero-shot settings, they produce coherent daily activity predictions, while adding one or two demonstrations further refines duration calibration and categorical accuracy. Beyond a few examples, performance saturates, indicating diminishing returns. Sequence-level evaluation confirms consistent temporal alignment across few-shot conditions. These findings suggest that pre-trained language models can serve as promising temporal reasoners, capturing both recurring routines and context-dependent behavioral variations, thereby strengthening the behavioral modules of agent-based models.

[17] What Do LLMs Know About Alzheimer's Disease? Fine-Tuning, Probing, and Data Synthesis for AD Detection

Lei Jiang,Yue Zhou,Natalie Parde

Main category: cs.CL

TL;DR: 本文探索了如何利用大语言模型(LLM)进行阿尔茨海默病(AD)的早期检测,通过监督微调和内部表征分析,发现特定词与特殊标记在任务中起关键作用;进而设计任务感知的特殊标记,并构建序列到序列的数据合成模型生成高质量合成数据,用于提升下游AD检测性能。

Details Motivation: 阿尔茨海默病早期检测面临标注数据稀缺的挑战,而大语言模型在跨领域迁移能力上表现突出,但其在AD检测领域的监督微调及内在机制尚缺乏系统研究。 Method: 对LLM进行AD检测任务的监督微调,结合探针技术分析各层中间激活;基于发现的关键词与特殊标记,设计任务感知标记集,并训练序列到序列模型用于合成结构一致、诊断信息丰富的数据。 Result: 微调后模型内部表示中特定词汇与特殊标记的探针值显著变化;所提数据合成方法生成的样本在内在质量评估与下游任务训练中均表现出有效性。 Conclusion: LLM可通过任务驱动的微调与表征分析有效适配AD检测任务,且基于关键表征引导的数据合成策略能缓解标注数据不足问题,提升模型性能。 Abstract: Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across domains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we fine-tune an LLM for AD detection and investigate how task-relevant information is encoded within its internal representations. We employ probing techniques to analyze intermediate activations across transformer layers, and we observe that, after fine-tuning, the probing values of specific words and special markers change substantially, indicating that these elements assume a crucial role in the model's improved detection performance. Guided by this insight, we design a curated set of task-aware special markers and train a sequence-to-sequence model as a data-synthesis tool that leverages these markers to generate structurally consistent and diagnostically informative synthetic samples. We evaluate the synthesized data both intrinsically and by incorporating it into downstream training pipelines.

[18] From Instruction to Output: The Role of Prompting in Modern NLG

Munazza Zaib,Elaf Alhazmi

Main category: cs.CL

TL;DR: This survey provides a comprehensive overview of prompt engineering techniques for Natural Language Generation (NLG), introducing a taxonomy, decision framework, and integrated design-optimization-evaluation framework to enhance controllability and generalizability.

Details Motivation: The lack of a structured framework or coherent understanding of diverse prompt engineering methods—especially in NLG—motivates this survey. Method: The paper reviews recent prompting methods, proposes a taxonomy of prompting paradigms, develops a decision framework for prompt selection, and introduces an integrated framework linking prompt design, optimization, and evaluation. Result: A unified taxonomy, practitioner-oriented decision framework, and a holistic framework connecting design, optimization, and evaluation for controllable and generalizable NLG are presented. Conclusion: Prompt engineering serves as a crucial input-level control mechanism for NLG, complementary to fine-tuning and decoding; establishing structured, evaluative, and adaptable frameworks is essential for its future advancement. Abstract: Prompt engineering has emerged as an integral technique for extending the strengths and abilities of Large Language Models (LLMs) to gain significant performance gains in various Natural Language Processing (NLP) tasks. This approach, which requires instructions to be composed in natural language to bring out the knowledge from LLMs in a structured way, has driven breakthroughs in various NLP tasks. Yet there is still no structured framework or coherent understanding of the varied prompt engineering methods and techniques, particularly in the field of Natural Language Generation (NLG). This survey aims to help fill that gap by outlining recent developments in prompt engineering, and their effect on different NLG tasks. It reviews recent advances in prompting methods and their impact on NLG tasks, presenting prompt design as an input-level control mechanism that complements fine-tuning and decoding approaches. The paper introduces a taxonomy of prompting paradigms, a decision framework for prompt selection based on varying factors for the practitioners, outlines emerging trends and challenges, and proposes a framework that links design, optimization, and evaluation to support more controllable and generalizable NLG.

[19] Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions

Usman Naseem

Main category: cs.CL

TL;DR: 本文综述了大语言模型(LLM)机制可解释性在对齐中的最新进展,涵盖电路发现、特征可视化、激活引导与因果干预等方法,并分析其对RLHF、宪法AI和可扩展监督等对齐策略的启示;指出超叠加、神经元多义性及涌现行为解释难等挑战,提出自动化可解释性、跨模型电路泛化及可扩展的可解释性驱动对齐等未来方向。

Details Motivation: 大语言模型虽能力强大,但其内部决策过程不透明,亟需机制可解释性来提升理解与对齐能力。 Method: 系统性综述近期机制可解释性技术(如电路发现、特征可视化、激活 steering、因果干预)及其在LLM对齐中的应用。 Result: 梳理了可解释性如何支撑RLHF、宪法AI和可扩展监督等对齐策略,并识别出超叠加、多义性、涌现行为解释难等关键挑战。 Conclusion: 机制可解释性是实现可信、可控LLM对齐的关键路径,未来需发展自动化、可泛化、可扩展的可解释性驱动对齐方法。 Abstract: Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Mechanistic interpretability (i.e., the systematic study of how neural networks implement algorithms through their learned representations and computational structures) has emerged as a critical research direction for understanding and aligning these models. This paper surveys recent progress in mechanistic interpretability techniques applied to LLM alignment, examining methods ranging from circuit discovery to feature visualization, activation steering, and causal intervention. We analyze how interpretability insights have informed alignment strategies including reinforcement learning from human feedback (RLHF), constitutional AI, and scalable oversight. Key challenges are identified, including the superposition hypothesis, polysemanticity of neurons, and the difficulty of interpreting emergent behaviors in large-scale models. We propose future research directions focusing on automated interpretability, cross-model generalization of circuits, and the development of interpretability-driven alignment techniques that can scale to frontier models.

[20] Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

Himanshu Gupta,Pratik Jayarao,Chaitanya Dwivedi,Neeraj Varshney

Main category: cs.CL

TL;DR: This paper provides a comprehensive overview of code-mixing and code-switching (CSW) research in large language models, proposing a taxonomy and practical recommendations for building, adapting, and evaluating CSW-capable LLMs, while addressing evaluation limitations and safety concerns.

Details Motivation: LLMs often struggle with code-mixing and code-switching, showing degradation in grammaticality, factuality, and safety—highlighting the need for systematic understanding and improvement. Method: The authors introduce a unifying taxonomy across data, modeling, and evaluation dimensions; review modeling approaches (e.g., CSW-tailored pre-training, prompting, in-context learning); analyze evaluation practices and benchmarks; and discuss safety implications. Result: A structured taxonomy, a practical playbook for CSW-LLM development, critical assessment of benchmarks (exposing English-centric biases), identification of evaluation instability, and recognition of CSW as a potential safeguard bypass mechanism. Conclusion: Building robust CSW-capable LLMs requires coordinated advances in data curation, model adaptation, rigorous multilingual evaluation, and safety-aware design—open challenges remain in linguistic coverage, reproducibility, and ethical deployment. Abstract: Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.

[21] MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization

Haidong Xin,Xinze Li,Zhenghao Liu,Yukun Yan,Shuo Wang,Cheng Yang,Yu Gu,Ge Yu,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出MetaMem框架,通过自演化的元记忆增强LLM的记忆系统,提升其对分散记忆片段中关键证据的识别与整合能力,显著优于基线方法。

Details Motivation: 现有记忆系统虽能存储长周期交互历史,但破坏了会话内的逻辑与时间关系,导致记忆碎片化、推理性能下降。 Method: MetaMem引入自演化的元记忆,在优化过程中通过自我反思推理过程并更新元记忆状态,迭代提炼跨任务可迁移的知识利用经验。 Result: 实验表明MetaMem显著优于强基线模型,性能提升超3.6%;代码与数据集已开源。 Conclusion: MetaMem通过显式建模知识利用经验,有效提升了LLM在长程交互中对碎片化记忆的系统性利用能力。 Abstract: Existing memory systems enable Large Language Models (LLMs) to support long-horizon human-LLM interactions by persisting historical interactions beyond limited context windows. However, while recent approaches have succeeded in constructing effective memories, they often disrupt the inherent logical and temporal relationships within interaction sessions, resulting in fragmented memory units and degraded reasoning performance. In this paper, we propose MetaMem, a novel framework that augments memory systems with a self-evolving meta-memory, aiming to teach LLMs how to effectively utilize memorized knowledge. During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks by self-reflecting on reasoning processes and performing actions to update the current meta-memory state. The accumulated meta-memory units serve as explicit knowledge utilization experiences, guiding the LLM to systematically identify and integrate critical evidence from scattered memory fragments. Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%. All codes and datasets are available at https://github.com/OpenBMB/MetaMem.

[22] DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

Shafiuddin Rehan Ahmed,Wei Wei

Main category: cs.CL

TL;DR: 本文提出DDL2PropBank基准任务,用于评估多智能体框架在LLM驱动软件开发中的开发者体验,通过统一Agent-as-a-Tool实现方式,在10个框架上对比代码复杂度与AI可辅助性,发现Agno综合性能最优。

Details Motivation: 缺乏在受控环境下系统评估多智能体框架开发者体验的原理性方法。 Method: 构建DDL2PropBank新基准任务(将数据库schema映射到PropBank rolesets),采用Agent-as-a-Tool模式在10个框架中复现相同代理逻辑,并从代码复杂度(静态分析)和AI-assistability(LLM自动生成正确框架代码能力)两方面评估。 Result: 发现三档复杂度谱系,Pydantic AI和Agno实现开销最小;结构对齐分数能可靠预测单范式框架的运行成功率,但高估多范式框架;Agno以最低复杂度、最高结构对齐度和83% pass@1成为整体最优。 Conclusion: 框架设计应兼顾低实现复杂度与高结构可预测性,Agno展示了当前最优平衡;结构对齐可作为AI-assistability的有效代理指标,但需区分框架范式类型。 Abstract: Multi-agent frameworks promise to simplify LLM-driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine-grained linguistic reasoning over table names, columns, and relations. Using the Agent-as-a-Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability -- the extent to which LLMs can autonomously generate correct, framework-specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI-assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi-pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.

[23] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification

Jiale Zhao,Ke Fang,Lu Cheng

Main category: cs.CL

TL;DR: 本文提出AskBench交互式基准和RLVR强化学习方法,提升大语言模型在信息缺失或错误前提下主动澄清的能力,兼顾准确性与交互效率。

Details Motivation: 大语言模型常在提示信息不全或含误导内容时仍强行作答,导致幻觉或错误强化,亟需提升其判断何时及如何澄清的能力。 Method: 构建AskBench交互式基准(含AskMind和AskOverconfidence两类场景),设计统一judge loop评估;提出基于结构化评分标准与验证器奖励的润色引导强化学习(RLVR)。 Result: 实验表明该方法在准确性、评分标准遵循度和交互效率上均有持续提升,并在未见领域展现强泛化能力。 Conclusion: 显式建模澄清决策过程可有效缓解LLM幻觉,AskBench与RLVR为可控、可信交互式推理提供了可扩展框架。 Abstract: Large language models (LLMs) often respond even when prompts omit critical details or include misleading information, leading to hallucinations or reinforced misconceptions. We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance. We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints. A unified judge loop evaluates final answers and simulates user responses as needed. AskBench covers two settings: AskMind, with intent-deficient queries requiring clarification, and AskOverconfidence, with queries containing false premises that must be identified and corrected. We further propose rubric-guided reinforcement learning with verifier-based rewards (RLVR), which uses structured rubrics to encourage targeted clarification. Experiments show consistent improvements in accuracy, rubric adherence, and interaction efficiency, with strong generalization to unseen domains.

[24] Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Donald Ye,Max Loffgren,Om Kotadia,Linus Wong

Main category: cs.CL

TL;DR: 本文提出NLDD指标来评估Chain-of-Thought(CoT)解释的忠实性,发现模型在推理链中存在‘推理视界’(k*),且准确率不能反映真实推理过程。

Details Motivation: Chain-of-Thought解释是否真实反映模型决策过程尚不明确,需量化其忠实性。 Method: 提出Normalized Logit Difference Decay(NLDD)指标:通过扰动CoT中各步并测量答案置信度下降程度,标准化后支持跨模型比较。 Result: 在三类任务和三种模型上发现一致的‘推理视界’(70–85%链长),超出该点的步骤对答案影响微弱或负面;模型可具备正确内部表征却仍答错题。 Conclusion: 准确性无法揭示模型是否真正进行链式推理;NLDD为判断CoT何时真正起作用提供了可量化的评估工具。 Abstract: Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.

[25] The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task

Rui Cao,Zhenyun Deng,Yulong Chen,Michael Schlichtkrull,Andreas Vlachos

Main category: cs.CL

TL;DR: 本文介绍了AVerImaTeC共享任务,旨在推动图像-文本声明的自动验证系统发展,评估采用条件判决准确率(AVerImaTeC分数),14支队伍参与开发阶段、6支进入测试阶段,全部超越基线,优胜队伍HUMANE得分为0.5455。

Details Motivation: 推动图像-文本声明自动验证系统的发展,解决真实世界中图文一致性验证的挑战。 Method: 组织AVerImaTeC共享任务,允许参赛者使用外部知识源(如网络搜索引擎)或主办方提供的结构化知识库;采用AVerImaTeC分数(基于证据得分阈值的条件判决准确率)进行评估。 Result: 共14支队伍参与开发阶段,6支进入测试阶段;所有测试队伍均优于基线;优胜队伍HUMANE获得0.5455的AVerImaTeC分数。 Conclusion: 该共享任务有效促进了图文验证技术发展,提供了可复现的基准与评估框架,并总结了关键经验与改进方向。 Abstract: The Automatic Verification of Image-Text Claims (AVerImaTeC) shared task aims to advance system development for retrieving evidence and verifying real-world image-text claims. Participants were allowed to either employ external knowledge sources, such as web search engines, or leverage the curated knowledge store provided by the organizers. System performance was evaluated using the AVerImaTeC score, defined as a conditional verdict accuracy in which a verdict is considered correct only when the associated evidence score exceeds a predefined threshold. The shared task attracted 14 submissions during the development phase and 6 submissions during the testing phase. All participating systems in the testing phase outperformed the baseline provided. The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455. This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.

[26] SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation

Beichen Guo,Zhiyuan Wen,Jia Gu,Senzhang Wang,Haochen Shi,Ruosong Yang,Shuaiqi Liu

Main category: cs.CL

TL;DR: 本文提出SurveyLens,首个面向多学科的自动综述生成(ASG)评估基准,包含10个学科的1000篇高质量人工综述,并设计双重视角评估框架(学科感知评分+经典对齐评估),系统评测11种ASG方法在各学科的表现差异。

Details Motivation: 现有ASG评估方法依赖通用指标、严重偏向计算机科学,无法反映不同学科特有的写作规范与标准,导致非CS领域研究者缺乏选用合适ASG工具的指导。 Method: 构建跨学科高质量综述数据集SurveyLens-1k(10学科×100篇),提出双镜头评估框架:(1)学科感知评分——利用对齐人类偏好的LLM按学科权重打分;(2)经典对齐评估——比对内容覆盖度与综合质量。对11种ASG方法进行实证评测。 Result: 首次揭示不同ASG范式(基础LLM、专用ASG系统、Deep Research代理)在各学科中的差异化表现,识别出各方法在特定领域的优势与短板。 Conclusion: SurveyLens为ASG研究提供了首个真正学科适配的评估标准,推动ASG向跨学科实用化发展,并为研究者按学科需求选择ASG工具提供实证依据。 Abstract: The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline-aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines. Subsequently, we propose a dual-lens evaluation framework: (1) Discipline-Aware Rubric Evaluation, which utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human-written survey papers. We conduct extensive experiments by evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.

[27] Are Aligned Large Language Models Still Misaligned?

Usman Naseem,Gautam Siddharth Kashyap,Rafiq Ali,Ebad Shabbir,Sushant Kumar Ray,Abdullah Mohammad,Agrima Seth

Main category: cs.CL

TL;DR: 本文提出Mis-Align Bench,首个支持安全、价值与文化三维度联合评估大语言模型错对齐问题的统一基准;构建了涵盖112个领域的高质量数据集SAVACU(38万样本),并揭示单维度对齐模型在联合评估下性能显著下降。

Details Motivation: 现有错对齐评测基准(如INSECURE CODE、VALUEACTIONLENS、CULTURALHERITAGE)仅关注单一维度,无法反映真实场景中安全、价值、文化三者必须共存且协同满足的需求,导致评估不全面。 Method: 1)构建SAVACU数据集:基于LLM-PROMPT-DATASET,用Mistral-7B-Instruct-v0.3按三级分类法(14安全+56价值+42文化域)重标注,并用Llama-3.1-8B-Instruct结合SimHash扩增低资源域;2)通过两阶段拒绝采样配对错对齐/对齐响应;3)在多类LLM上开展三维度联合评测。 Result: 单维度对齐模型在各自维度覆盖率高达97.6%,但在三维度联合评估下假失败率超50%,对齐得分骤降至63%–66%。 Conclusion: 单一维度优化不足以保障真实场景中的综合对齐;Mis-Align Bench为多维协同对齐研究提供了可复现、细粒度的评测基础设施。 Abstract: Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS (value-centric), and CULTURALHERITAGE (culture centric)-rely on evaluating misalignment along individual dimensions, preventing simultaneous evaluation. To address this gap, we introduce Mis-Align Bench, a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions. First we constructs SAVACU, an English misaligned-aligned dataset of 382,424 samples spanning 112 domains (or labels), by reclassifying prompts from the LLM-PROMPT-DATASET via taxonomy into 14 safety domains, 56 value domains, and 42 cultural domains using Mistral-7B-Instruct-v0.3, and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based fingerprint to avoid deduplication. Furthermore, we pairs prompts with misaligned and aligned responses via two-stage rejection sampling to enforce quality. Second we benchmarks general-purpose, fine-tuned, and open-weight LLMs, enabling systematic evaluation of misalignment under three dimensions. Empirically, single-dimension models achieve high Coverage (upto 97.6%) but incur False Failure Rate >50% and lower Alignment Score (63%-66%) under joint conditions.

[28] Evaluating Alignment of Behavioral Dispositions in LLMs

Amir Taubenfeld,Zorik Gekhman,Lior Nezry,Omri Feldman,Natalie Harris,Shashir Reddy,Romina Stella,Ariel Goldstein,Marian Croak,Yossi Matias,Amir Feder

Main category: cs.CL

TL;DR: 本文提出一种基于情境判断测试(SJT)的框架,评估大语言模型(LLM)在社会情境中表现出的行为倾向与人类真实偏好的一致性,发现LLMs普遍存在过度自信、偏离人类共识及言行不一等问题。

Details Motivation: 随着大语言模型(LLM)日益融入日常生活,理解其行为倾向(尤其是社会语境下的行为 disposition)变得至关重要;现有研究多依赖自我报告式心理量表,难以直接反映真实行为,亟需适配LLM的行为评估方法。 Method: 将经典心理学问卷中的自我报告题项转化为情境判断测试(SJT),即在真实用户-助手场景中要求LLM推荐自然行为;构建含2500个SJTs的数据集,每题经3名标注员验证,并由550名参与者中每人标注10题以获取人类偏好分布;在25个LLM上进行系统性行为对比分析。 Result: (1)低人类共识情境下,LLM普遍过度自信于单一答案;(2)高共识情境下,小模型显著偏离,部分前沿模型仍有15–20%未反映共识;(3)跨模型存在稳定倾向模式(如鼓励情绪表达而非冷静);(4)LLM的自我陈述价值观与其实际行为间存在明显预测效度差距。 Conclusion: 当前LLM的行为倾向与人类偏好存在系统性偏差,单纯依赖自我报告式心理测量不足以刻画其真实社会行为;SJT框架为评估和校准LLM的社会行为提供了更可靠、具生态效度的方法。 Abstract: As LLMs integrate into our daily lives, understanding their behavior becomes essential. In this work, we focus on behavioral dispositions$-$the underlying tendencies that shape responses in social contexts$-$and introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans. Our approach is grounded in established psychological questionnaires but adapts them for LLMs by transforming human self-report statements into Situational Judgment Tests (SJTs). These SJTs assess behavior by eliciting natural recommendations in realistic user-assistant scenarios. We generate 2,500 SJTs, each validated by three human annotators, and collect preferred actions from 10 annotators per SJT, from a large pool of 550 participants. In a comprehensive study involving 25 LLMs, we find that models often do not reflect the distribution of human preferences: (1) in scenarios with low human consensus, LLMs consistently exhibit overconfidence in a single response; (2) when human consensus is high, smaller models deviate significantly, and even some frontier models do not reflect the consensus in 15-20% of cases; (3) traits can exhibit cross-LLM patterns, e.g., LLMs may encourage emotion expression in contexts where human consensus favors composure. Lastly, mapping psychometric statements directly to behavioral scenarios presents a unique opportunity to evaluate the predictive validity of self-reports, revealing considerable gaps between LLMs' stated values and their revealed behavior.

[29] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Zachary Pedram Dadfar

Main category: cs.CL

TL;DR: 本文提出Pull Methodology,通过格式工程引导大语言模型进行自我检查,发现其自指性词汇与内部激活动态存在特异性对应关系,表明在适当条件下,模型的自我报告可可靠反映其内部计算状态。

Details Motivation: 探究大语言模型在自我检查时产生的内省语言是否真实反映内部计算过程,而非仅是复杂编造。 Method: 引入Pull Methodology(一种通过格式工程激发长程自我检查的协议),在Llama 3.1中识别区分自指性与描述性处理的激活空间方向,并结合激活分析、因果干预(steering)及跨模型验证(Qwen 2.5-32B)。 Result: 发现自指性词汇(如'loop'、'shimmer')与特定激活特征(如高自相关、高变异性)显著相关;该对应关系具有特异性(不出现于非自指语境)、定位性(深度6.25%)、因果性(steering可调控输出)和跨模型独立复现性。 Conclusion: 在适当提示与结构下,大语言模型的自我报告并非纯粹幻觉,而是能可靠追踪其内部计算状态,为理解模型内省机制提供了实证基础。 Abstract: Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.

[30] Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification

Weili Shi,Dongliang Guo,Lehan Yang,Tianlong Wang,Hanzhang Yuan,Sheng Li

Main category: cs.CL

TL;DR: 本文提出PPCV框架,通过识别和替换推理路径中的关键token,并结合一致性验证来提升大语言模型在复杂推理任务上的性能。

Details Motivation: 大语言模型在复杂推理任务中常因幻觉和中间步骤错误累积导致性能下降,而关键token的识别与利用仍具挑战性。 Method: PPCV框架分为两阶段:第一阶段通过原始问题生成初始推理路径,并利用问题重述与预测token-实际token的不匹配识别关键token;第二阶段用候选替代token替换关键token,对原始及重述问题并行生成新推理路径,并通过输出一致性确定最终答案。 Result: 在多个主流大语言模型和基准测试上,PPCV显著优于基线方法,提升了推理性能。 Conclusion: PPCV是一种有效提升大语言模型复杂推理鲁棒性的新范式,其关键token探测与一致性验证机制具有实用价值和推广潜力。 Abstract: Large language models have demonstrated impressive performance across a variety of reasoning tasks. However, their problem-solving ability often declines on more complex tasks due to hallucinations and the accumulation of errors within these intermediate steps. Recent work has introduced the notion of critical tokens--tokens in the reasoning process that exert significant influence on subsequent steps. Prior studies suggest that replacing critical tokens can refine reasoning trajectories. Nonetheless, reliably identifying and exploiting critical tokens remains challenging. To address this, we propose the Paraphrastic Probing and Consistency Verification~(PPCV) framework. PPCV operates in two stages. In the first stage, we roll out an initial reasoning path from the original question and then concatenate paraphrased versions of the question with this reasoning path. And we identify critical tokens based on mismatches between the predicted top-1 token and the expected token in the reasoning path. A criterion is employed to confirm the final critical token. In the second stage, we substitute critical tokens with candidate alternatives and roll out new reasoning paths for both the original and paraphrased questions. The final answer is determined by checking the consistency of outputs across these parallel reasoning processes. We evaluate PPCV on mainstream LLMs across multiple benchmarks. Extensive experiments demonstrate PPCV substantially enhances the reasoning performance of LLMs compared to baselines.

[31] The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods

Arpit Singh Gautam,Kailash Talreja,Saurabh Jha

Main category: cs.CL

TL;DR: 本文提出DiffuTruth框架,利用非平衡热力学思想,将事实视为生成流形上的稳定吸引子,而幻觉则是不稳定的。通过生成压力测试和语义能量度量来检测事实错误,并结合混合校准提升置信度估计,实现了无监督事实验证的SOTA性能。

Details Motivation: 大语言模型常产生看似合理但错误的断言(幻觉),而现有不确定性度量难以识别模型在错误时的高置信度问题。 Method: 提出DiffuTruth:基于非平衡热力学建模事实为生成流形上的稳定吸引子;设计生成压力测试(加噪-重建);定义语义能量(用NLI评判器衡量原始声明与重建间的语义偏差);融合语义稳定性信号与判别式置信度形成混合校准。 Result: 在FEVER数据集上达到0.725的无监督AUROC,超越基线1.5%;在多跳HOVER数据集上零样本泛化性能优于基线超4%。 Conclusion: 语义能量与热力学真实性建模能有效区分事实与幻觉,且具备跨分布鲁棒性,为无监督事实验证提供了新范式。 Abstract: Large Language Models (LLMs) frequently hallucinate plausible but incorrect assertions, a vulnerability often missed by uncertainty metrics when models are confidently wrong. We propose DiffuTruth, an unsupervised framework that reconceptualizes fact verification via non equilibrium thermodynamics, positing that factual truths act as stable attractors on a generative manifold while hallucinations are unstable. We introduce the Generative Stress Test, claims are corrupted with noise and reconstructed using a discrete text diffusion model. We define Semantic Energy, a metric measuring the semantic divergence between the original claim and its reconstruction using an NLI critic. Unlike vector space errors, Semantic Energy isolates deep factual contradictions. We further propose a Hybrid Calibration fusing this stability signal with discriminative confidence. Extensive experiments on FEVER demonstrate DiffuTruth achieves a state of the art unsupervised AUROC of 0.725, outperforming baselines by 1.5 percent through the correction of overconfident predictions. Furthermore, we show superior zero shot generalization on the multi hop HOVER dataset, outperforming baselines by over 4 percent, confirming the robustness of thermodynamic truth properties to distribution shifts.

[32] Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

Md Tanvir Rouf Shawon,Mohammad Sabik Irbaz,Hadeel R. A. Elyazori,Keerti Reddy Resapu,Yili Lin,Vladimir Franzuela Cardenas,Farrokh Alemi,Kevin Lybarger

Main category: cs.CL

TL;DR: 本文提出了一种基于NIST框架的患者模拟器,用于自动化、可扩展地评估医疗对话AI系统,通过医学、语言和行为三类可控患者画像生成真实交互,并在抗抑郁药推荐决策辅助系统中验证了其识别错误与风险模式的能力。

Details Motivation: 现有医疗对话AI系统缺乏可扩展、可控且贴近真实临床场景的自动化评估方法,难以系统识别幻觉、错误及跨患者群体的风险差异。 Method: 构建基于NIST AI风险管理框架的患者模拟器,整合三类画像:(1)源自All of Us项目的电子健康记录医学画像;(2)建模健康素养与疾病特异性语言模式的语言画像;(3)刻画合作、分心、对抗等实证行为模式的行为画像;并结合人工标注与LLM法官评估AI决策辅助系统的错误表现。 Result: 在500次对话中,人工标注者与LLM法官对1787个医学概念的评估一致性高(F1=0.94,κ≈0.75);发现AI决策辅助性能随健康素养提升而单调改善(概念检索准确率从47.9%升至81.6%)。 Conclusion: 该患者模拟器是一种有效、可靠且可解释的评估工具,能系统揭示医疗AI在不同患者亚群中的性能差异与潜在风险,支持更安全、公平的临床AI部署。 Abstract: Objective: This paper introduces a patient simulator designed to enable scalable, automated evaluation of healthcare conversational agents. The simulator generates realistic, controllable patient interactions that systematically vary across medical, linguistic, and behavioral dimensions, allowing annotators and an independent AI judge to assess agent performance, identify hallucinations and inaccuracies, and characterize risk patterns across diverse patient populations. Methods: The simulator is grounded in the NIST AI Risk Management Framework and integrates three profile components reflecting different dimensions of patient variation: (1) medical profiles constructed from electronic health records in the All of Us Research Program; (2) linguistic profiles modeling variation in health literacy and condition-specific communication patterns; and (3) behavioral profiles representing empirically observed interaction patterns, including cooperation, distraction, and adversarial engagement. We evaluated the simulator's effectiveness in identifying errors in an AI decision aid for antidepressant selection. Results: We generated 500 conversations between the patient simulator and the AI decision aid across systematic combinations of five linguistic and three behavioral profiles. Human annotators assessed 1,787 medical concepts across 100 conversations, achieving high agreement (F1=0.94, \k{appa}=0.73), and the LLM judge achieved comparable agreement with human annotators (F1=0.94, \k{appa}=0.78; paired bootstrap p=0.21). The simulator revealed a monotonic degradation in AI decision aid performance across the health literacy spectrum: rank-one concept retrieval accuracy increased from 47.9% for limited health literacy to 69.1% for functional and 81.6% for proficient.

[33] Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

Zecheng Wang,Deyuan Liu,Chunshan Li,Yupeng Zhang,Zhengyun Zhao,Dianhui Chu,Bingning Wang,Dianbo Sui

Main category: cs.CL

TL;DR: 本文提出Dynamic Entropy Fine-Tuning(DEFT),一种无需额外参数的监督微调目标,通过Rényi-2熵动态调节模型对自身预测的信任度,缓解传统NLL中因均匀词元加权导致的‘可塑性–稳定性’困境,提升模型在探索与利用间的平衡及整体性能。

Details Motivation: 标准负对数似然(NLL)在监督微调中采用均匀词元加权,导致两个问题:(i) 过度强调低概率目标会放大噪声监督下的梯度、破坏鲁棒先验;(ii) 当模型已高度自信时,均匀加权缺乏有效锐化能力;现有方法无法兼顾学习信号的保留与有害噪声的抑制。 Method: 将词元级SFT目标统一到广义变形对数(deformed-log)族,揭示其共有的‘门控×误差梯度’结构;利用Cayley变换将模型持续演化的不确定性映射为连续聚焦轨迹;提出DEFT目标,以Rényi-2熵作为模型预测状态的实用代理,动态调节信任门控。 Result: 大量实验与分析表明,DEFT在探索与利用之间取得更好平衡,显著提升模型整体性能。 Conclusion: DEFT是一种参数免费、原理清晰且实践有效的SFT优化方法,通过熵驱动的动态门控机制,统一解决了监督微调中的稳定性与可塑性矛盾。 Abstract: Standard negative log-likelihood (NLL) for Supervised Fine-Tuning (SFT) applies uniform token-level weighting. This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident. Existing methods fail to resolve the resulting plasticity--stability dilemma, often suppressing necessary learning signals alongside harmful ones. To address this issue, we unify token-level SFT objectives within a generalized deformed-log family and expose a universal gate $\times$ error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model's continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless interpolation between scenarios involving uncertain novel concepts and those involving well-established knowledge. We then introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the trust gate using distribution concentration (Rényi-2 entropy) as a practical proxy for the model's predictive state. Extensive experiments and analyses demonstrate that DEFT achieves a better balance between exploration and exploitation, leading to improved overall performance.

[34] Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety

Muskaan Chopra,Lorenz Sparrenberg,Rafet Sifa

Main category: cs.CL

TL;DR: 本文探讨了指令调优的大语言模型(LLMs)在机器翻译中检测关键语义错误(如事实扭曲、意图反转、偏见翻译)的能力,发现模型规模扩大和适配策略(零样本、少样本、微调)能持续提升检测性能,优于XLM-R等编码器模型;研究强调该任务对构建安全、可信、社会负责的多语言AI系统的重要性。

Details Motivation: 机器翻译中的关键意义错误(如事实扭曲、意图反转、偏见)会损害多语言系统的可靠性、公平性与安全性,尤其在高风险或资源匮乏语境下亟需有效检测机制。 Method: 在公开数据集上评估不同参数规模的指令调优大语言模型,对比零样本、少样本和微调等适应策略,并与XLM-R、ModernBERT等encoder-only基线模型进行性能比较。 Result: 模型规模扩大与指令调优策略(尤其是微调)显著提升关键错误检测性能,稳定优于XLM-R和ModernBERT等基线模型。 Conclusion: 提升机器翻译关键错误检测能力是构建安全、可信、社会可追责的多语言信息系统的必要保障,应被视为实现公正、负责任多语言AI的关键防护手段,而不仅是一个技术问题。 Abstract: Machine Translation (MT) plays a pivotal role in cross-lingual information access, public policy communication, and equitable knowledge dissemination. However, critical meaning errors, such as factual distortions, intent reversals, or biased translations, can undermine the reliability, fairness, and safety of multilingual systems. In this work, we explore the capacity of instruction-tuned Large Language Models (LLMs) to detect such critical errors, evaluating models across a range of parameters using the publicly accessible data sets. Our findings show that model scaling and adaptation strategies (zero-shot, few-shot, fine-tuning) yield consistent improvements, outperforming encoder-only baselines like XLM-R and ModernBERT. We argue that improving critical error detection in MT contributes to safer, more trustworthy, and socially accountable information systems by reducing the risk of disinformation, miscommunication, and linguistic harm, especially in high-stakes or underrepresented contexts. This work positions error detection not merely as a technical challenge, but as a necessary safeguard in the pursuit of just and responsible multilingual AI. The code will be made available at GitHub.

[35] LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation

Ahmadreza Jeddi,Marco Ciccone,Babak Taati

Main category: cs.CL

TL;DR: 本文提出LoopFormer,一种支持可变计算预算的循环Transformer模型,通过shortcut-consistency训练策略实现不同迭代长度下的表示一致性,提升语言建模与推理任务在受限计算资源下的鲁棒性与可扩展性。

Details Motivation: 现有循环Transformer固定迭代次数,无法根据计算预算动态调整推理深度,限制了其在实际资源受限场景中的应用灵活性。 Method: 提出LoopFormer模型,引入基于时间与步长条件的循环机制,并设计shortcut-consistency训练方案,对齐不同长度轨迹的中间表示,确保短循环有信息量、长循环能持续优化。 Result: LoopFormer在语言建模和推理基准上展现出强鲁棒性,尤其在严苛计算约束下仍保持高性能,并随预算增加平滑提升性能。 Conclusion: 循环Transformer天然适合自适应语言建模,LoopFormer为构建可控、预算感知的大语言模型提供了可行路径。 Abstract: Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models.

Guangxin Zhao,Jiahao Zheng,Malaz Boustani,Jarek Nabrzyski,Meng Jiang,Yiyu Shi,Zhi Zheng

Main category: cs.CL

TL;DR: 本文提出了首个专门针对阿尔茨海默病及相关痴呆症(ADRD)的大型语言模型评估基准ADRD-Bench,包含临床知识问答与照护实践问答两部分,并在33个主流LLM上进行了评测,发现尽管部分模型准确率高,但推理一致性与稳定性不足,亟需结合日常照护数据进行领域特化改进。

Details Motivation: 现有医疗大模型评估基准对阿尔茨海默病及相关痴呆症(ADRD)覆盖极少,缺乏临床知识与真实照护场景的双重评估,难以支撑LLM在该关键健康领域的可靠落地。 Method: 构建ADRD-Bench基准,含两部分:1)ADRD Unified QA——整合7个既有医学基准共1352道题,统一评估临床知识;2)ADRD Caregiving QA——基于权威脑健康项目ABC新构建149道照护实践题。在33个SOTA LLM(开源通用、开源医学、闭源通用)上开展系统评测,并辅以案例分析。 Result: 开源通用模型准确率0.63–0.93(均值0.78),开源医学模型0.48–0.93(均值0.82),闭源通用模型0.83–0.91(均值0.89);顶尖模型虽达>0.9准确率,但案例显示其推理质量与稳定性不一致,可靠性受限。 Conclusion: ADRD-Bench填补了ADRD领域专用评估空白;结果表明当前LLM在该领域仍存在推理不稳、照护语境缺失等关键短板,亟需融合真实照护数据开展领域适配与增强。 Abstract: Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.

[37] When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

Jayadev Billa

Main category: cs.CL

TL;DR: 本文发现语音-文本大模型在音频与文本冲突时显著偏向文本(文本主导效应),并通过新基准ALME揭示该现象源于模态仲裁可及性而非信息质量差异;提出通过提示工程、微调策略等方法可有效调控文本主导程度。

Details Motivation: 探究多模态大模型(尤其是语音-文本模型)在模态冲突(如音频与文本不一致)下的决策机制,解释为何模型普遍偏向文本而非更准确的音频信号。 Method: 构建跨语言音频-文本冲突基准ALME(57,602个可控样本,8种语言);设计多种干预实验(提示修改、强制转录、腐败标注提示、投影层/LoRA微调);在Gemini 2.0 Flash等4个SOTA音频-LLM上系统评估模态仲裁行为。 Result: 发现文本主导效应(16.6% vs 1.6%);证实音频嵌入信息量高于文本转录(音频准确率97.2% > 级联准确率93.9%);强制转录加剧文本主导(19%→33%),'文本被故意污染'提示降低80%;仅微调音频投影层使文本主导+26.5%,而LLM端LoRA使其-23.9%。 Conclusion: 文本主导源于语言模型对文本表征的推理可及性优势,而非音频信息缺失;模态仲裁能力是独立于传统语音识别指标的新可靠性维度;可通过架构与提示设计进行定向调控。 Abstract: When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6\% text dominance under audio-text conflict versus 1.6\% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2\%) exceeds cascade accuracy (93.9\%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19\% to 33\%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80\%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5\%), while LoRA on the language model halves it ($-$23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.

[38] Multimodal Fact-Level Attribution for Verifiable Reasoning

David Wan,Han Wang,Ziyang Wang,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal

Main category: cs.CL

TL;DR: 本文提出了MuRGAt基准,用于评估多模态大语言模型在复杂多步推理中对事实的精确归因能力,特别强调跨视频、音频等多模态输入的时序与模态级引用,并设计了与人工评分高度一致的自动评估框架,发现当前MLLMs常出现引用幻觉,且推理深度与归因准确性存在权衡。

Details Motivation: 现有多模态归因评测基准局限于简单观察场景或单一模态,无法评估复杂多步推理中的事实级归因可靠性。 Method: 提出MuRGAt基准,要求模型对多模态(视频、音频等)输入生成带显式推理链和精确引用(含模态类型与时序段)的答案;并构建与人工判断强相关的自动化评估框架。 Result: 实验表明,即使推理正确,主流MLLMs仍频繁产生错误引用(引用幻觉);且推理深度增加或强制结构化归因会降低答案准确率。 Conclusion: 当前MLLMs在内部推理能力与可验证归因能力之间存在显著鸿沟,MuRGAt为推动可信多模态推理提供了新评测标准与分析视角。 Abstract: Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.

[39] Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Jinrui Zhang,Chaodong Xiao,Aoqi Wu,Xindong Zhang,Lei Zhang

Main category: cs.CL

TL;DR: 本文提出SPES框架,通过稀疏专家同步和专家合并预热策略,在去中心化环境下高效预训练MoE大语言模型,显著降低GPU内存需求并保持性能。

Details Motivation: 现有去中心化训练方法仍需在每个节点上训练完整模型,受限于GPU内存;亟需一种内存高效、适合分布式环境的预训练方案。 Method: 提出SParse Expert Synchronization(SPES)框架:各节点仅训练子集专家,定期同步专家参数而非全模型;引入专家合并预热策略以加速收敛。 Result: 在16块48GB GPU上成功训练2B参数MoE模型,性能媲美同等算力下的中心化训练;进一步扩展至7B从头训练和9B由稠密检查点升级的模型,均达到先前中心化基线水平。 Conclusion: SPES实现了内存高效、可扩展且高性能的去中心化MoE大模型预训练,为资源受限场景下的大模型训练提供了新范式。 Abstract: Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.

[40] SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

Wenlin Zhong,Jinluan Yang,Yiquan Wu,Yi Liu,Jianhang Yao,Kun Kuang

Main category: cs.CL

TL;DR: SIGHT是一种增强搜索式推理的框架,通过自证支持(SES)和信息增益驱动的多样化分支机制,解决多轮搜索中结果冗余、信噪比低导致的‘隧道视野’问题,显著提升复杂问答任务性能。

Details Motivation: 多轮搜索中搜索结果冗余高、信噪比低,导致智能体陷入‘隧道视野’并积累不可逆错误。 Method: 提出SIGHT框架:1)利用自证支持(SES)提炼高保真证据;2)计算信息增益分数识别关键状态;3)基于该分数实施动态提示干预(去重、反思或自适应分支);4)结合SES与正确性奖励,采用组相对策略优化(GRPO)内化鲁棒探索策略。 Result: 在单跳与多跳问答基准上显著优于现有方法,尤其在复杂推理场景下,且使用更少搜索步数。 Conclusion: SIGHT有效缓解了RL驱动LLM搜索中的冗余与误差累积问题,提升了推理鲁棒性与效率,无需外部验证器。 Abstract: Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.

[41] PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

Xiangfeng Wang,Hangyu Guo,Yanlin Lai,Mitt Huang,Liang Zhao,Chengyuan Yao,Yinmin Zhang,Qi Han,Xiaoxiao Ren,Chun Yuan,Tong Xu,Zheng Ge,Xiangyu Zhang,Daxin Jiang

Main category: cs.CL

TL;DR: 本文提出PRIME基准,用于评估数学与工程领域中验证器在过程-结果一致性上的能力,并基于该基准改进RLVR训练范式,显著提升模型性能。

Details Motivation: 现有基于结果的验证范式忽视推导过程中的错误,导致对错误推导得出的正确答案给予正向奖励,亟需关注过程与结果的一致性验证。 Method: 构建PRIME基准(2530个高难度STEM问题),设计过程-结果对齐验证任务;提出过程感知的RLVR训练范式,并利用PRIME筛选验证器。 Result: 当前验证器普遍无法检测推导缺陷;新范式在AIME24、AIME25和Beyond-AIME上为Qwen3-14B-Base带来8.29%、9.12%、7.31%的绝对性能提升;PRIME准确率与RLVR效果呈强线性相关(R² > 0.92)。 Conclusion: PRIME有效揭示了验证器在过程对齐上的不足,其作为验证器选型指标具有高度预测性,推动更可靠、可解释的RLVR发展。 Abstract: While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ($R^2 > 0.92$) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.

[42] Scene-Aware Memory Discrimination: Deciding Which Personal Knowledge Stays

Yijie Zhong,Mengying Guo,Zewei Wang,Zhongyang Li,Dandan Tu,Haofen Wang

Main category: cs.CL

TL;DR: 本文提出了一种场景感知的记忆判别方法(SAMD),通过门控单元模块(GUM)和聚类提示模块(CPM)提升大语言模型在用户交互数据中高效、精准地筛选与组织个人知识的能力。

Details Motivation: 现有基于大语言模型的记忆写入、管理和读取研究面临信息过滤不精准和计算开销上升的问题,而人类选择性注意机制为解决该问题提供了启发。 Method: 提出记忆判别任务,并设计场景感知记忆判别方法(SAMD),包含门控单元模块(GUM)用于过滤非记忆性交互、聚焦关键内容,以及聚类提示模块(CPM)用于建立自适应记忆标准、分析用户意图与记忆上下文关系以生成有效聚类提示。 Result: 实验表明SAMD在记忆判别任务中能成功召回大部分可记忆数据,在动态场景下保持鲁棒性;集成到个性化应用中显著提升了记忆构建的效率与质量。 Conclusion: SAMD通过借鉴人脑选择性注意机制,有效解决了大规模用户交互数据中记忆筛选与组织的关键挑战,为个性化智能应用中的个人知识管理提供了新范式。 Abstract: Intelligent devices have become deeply integrated into everyday life, generating vast amounts of user interactions that form valuable personal knowledge. Efficient organization of this knowledge in user memory is essential for enabling personalized applications. However, current research on memory writing, management, and reading using large language models (LLMs) faces challenges in filtering irrelevant information and in dealing with rising computational costs. Inspired by the concept of selective attention in the human brain, we introduce a memory discrimination task. To address large-scale interactions and diverse memory standards in this task, we propose a Scene-Aware Memory Discrimination method (SAMD), which comprises two key components: the Gating Unit Module (GUM) and the Cluster Prompting Module (CPM). GUM enhances processing efficiency by filtering out non-memorable interactions and focusing on the salient content most relevant to application demands. CPM establishes adaptive memory standards, guiding LLMs to discern what information should be remembered or discarded. It also analyzes the relationship between user intents and memory contexts to build effective clustering prompts. Comprehensive direct and indirect evaluations demonstrate the effectiveness and generalization of our approach. We independently assess the performance of memory discrimination, showing that SAMD successfully recalls the majority of memorable data and remains robust in dynamic scenarios. Furthermore, when integrated into personalized applications, SAMD significantly enhances both the efficiency and quality of memory construction, leading to better organization of personal knowledge.

[43] PACE: Prefix-Protected and Difficulty-Aware Compression for Efficient Reasoning

Ruixiang Feng,Yuntao Wen,Silin Zhou,Ke Shi,Yifan Wang,Ran Le,Zhenwei An,Zongchao Chen,Chen Yang,Guangyue Peng,Yiming Jia,Dongsheng Wang,Tao Zhang,Lisi Chen,Yang Song,Shen Gao,Shuo Shang

Main category: cs.CL

TL;DR: 本文提出了一种名为\model的双层级压缩框架,通过前缀保护和难度感知机制,在保持推理有效性的同时显著减少语言推理模型(LRMs)的冗余计算和token消耗,并提升准确率。

Details Motivation: 现有语言推理模型(LRMs)在扩大测试时计算量时易出现“过度思考”问题,导致推理链过长、延迟高、内存占用大;而统一长度惩罚策略在序列级会压缩关键早期推理步骤,在群体级则对所有查询一视同仁,缺乏灵活性。 Method: 提出\model框架:序列级采用前缀保护优化(使用衰减混合rollout保证有效推理路径并促进简洁性);群体级采用难度感知惩罚(根据查询复杂度动态调整长度约束)。 Result: 在DeepSeek-R1-Distill-Qwen(1.5B/7B)上实验表明,\model最多减少55.7% token使用量,同时数学基准准确率最高提升4.1%,且可泛化至代码、科学和通用领域。 Conclusion: 双层级压缩策略能兼顾推理质量与效率,为LRMs的高效部署提供新范式。 Abstract: Language Reasoning Models (LRMs) achieve strong performance by scaling test-time computation but often suffer from ``overthinking'', producing excessively long reasoning traces that increase latency and memory usage. Existing LRMs typically enforce conciseness with uniform length penalties, which over-compress crucial early deduction steps at the sequence level and indiscriminately penalize all queries at the group level. To solve these limitations, we propose \textbf{\model}, a dual-level framework for prefix-protected and difficulty-aware compression under hierarchical supervision. At the sequence level, prefix-protected optimization employs decaying mixed rollouts to maintain valid reasoning paths while promoting conciseness. At the group level, difficulty-aware penalty dynamically scales length constraints based on query complexity, maintaining exploration for harder questions while curbing redundancy on easier ones. Extensive experiments on DeepSeek-R1-Distill-Qwen (1.5B/7B) demonstrate that \model achieves a substantial reduction in token usage (up to \textbf{55.7\%}) while simultaneously improving accuracy (up to \textbf{4.1\%}) on math benchmarks, with generalization ability to code, science, and general domains.

[44] Which Feedback Works for Whom? Differential Effects of LLM-Generated Feedback Elements Across Learner Profiles

Momoka Furuhashi,Kouta Nakayama,Noboru Kawai,Takashi Kodama,Saku Sugawara,Kyosuke Takami

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型(LLM)生成的教育反馈中不同元素(如语气、信息覆盖度)对学习效果和学生接受度的影响,并分析其与大五人格特质的关系。

Details Motivation: 尚不清楚LLM生成反馈中的具体元素(如语气、信息覆盖)如何影响不同人格特质学习者的学习效果与接受度。 Method: 定义六类反馈元素,利用GPT-5为多项选择题生成反馈;开展含321名高一学生的实验,结合两类学习成效指标与六项主观评价标准评估反馈有效性;进一步按大五人格聚类分析反馈接受度差异。 Result: 有效反馈元素在促进学习成效方面呈现共性模式,但学生主观偏好因人格聚类而异。 Conclusion: 设计LLM生成反馈时需依据学习者人格特质选择和适配反馈元素,为教育中个性化反馈设计提供实践启示。 Abstract: Large language models (LLMs) show promise for automatically generating feedback in education settings. However, it remains unclear how specific feedback elements, such as tone and information coverage, contribute to learning outcomes and learner acceptance, particularly across learners with different personality traits. In this study, we define six feedback elements and generate feedback for multiple-choice biology questions using GPT-5. We conduct a learning experiment with 321 first-year high school students and evaluate feedback effectiveness using two learning outcomes measures and subjective evaluations across six criteria. We further analyze differences in how feedback acceptance varies across learners based on Big Five personality traits. Our results show that effective feedback elements share common patterns supporting learning outcomes, while learners' subjective preferences differ across personality-based clusters. These findings highlight the importance of selecting and adapting feedback elements according to learners' personality traits when we design LLM-generated feedback, and provide practical implications for personalized feedback design in education.

[45] PatientHub: A Unified Framework for Patient Simulation

Sahand Sabour,TszYam NG,Minlie Huang

Main category: cs.CL

TL;DR: 本文提出了PatientHub,一个用于标准化模拟患者对话的统一、模块化框架,旨在解决现有研究中数据格式、提示词和评估指标不统一的问题。

Details Motivation: 现有模拟患者的研究工作分散,缺乏统一的数据格式、提示词和评估指标,导致可复现性和公平比较困难。 Method: 提出PatientHub框架,提供标准化的患者定义、组合与部署方式,并通过多个案例研究和两种新模拟器变体验证其模块性、可扩展性和易用性。 Result: 实现了支持跨方法评估、自定义评估指标集成及快速原型开发的统一框架,显著降低新方法开发门槛并促进跨模型基准测试。 Conclusion: PatientHub为以患者为中心的对话系统提供了实用基础,推动未来数据集、方法和基准的发展,并已开源代码。 Abstract: As Large Language Models increasingly power role-playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non-standardized data formats, prompts, and evaluation metrics, hindering reproducibility and fair comparison. In this paper, we introduce PatientHub, a unified and modular framework that standardizes the definition, composition, and deployment of simulated patients. To demonstrate PatientHub's utility, we implement several representative patient simulation methods as case studies, showcasing how our framework supports standardized cross-method evaluation and the seamless integration of custom evaluation metrics. We further demonstrate PatientHub's extensibility by prototyping two new simulator variants, highlighting how PatientHub accelerates method development by eliminating infrastructure overhead. By consolidating existing work into a single reproducible pipeline, PatientHub lowers the barrier to developing new simulation methods and facilitates cross-method and cross-model benchmarking. Our framework provides a practical foundation for future datasets, methods, and benchmarks in patient-centered dialogue, and the code is publicly available via https://github.com/Sahandfer/PatientHub.

[46] Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models

Katrin Olsen,Sebastian Padó

Main category: cs.CL

TL;DR: 本文通过人类评分员和大语言模型(LLM)对五个语义异常数据集中的句子进行可理解性判断,发现多数句子仅属异常而非真正无意义,并且LLMs擅长为异常句子生成合理上下文。

Details Motivation: 现有语义异常数据集的‘无意义性’程度不明确,且尚不清楚大语言模型能否准确区分‘异常’与‘真正无意义’的句子。 Method: 收集人类评分员和大语言模型在五种语义异常数据集上的可理解性判断(包括无上下文和提供上下文两种条件)。 Result: 人类评分员认为大多数句子最多只是异常,仅有少数被判定为真正无意义;LLMs在为异常句子生成合理上下文方面表现出较强能力。 Conclusion: 当前语义异常数据集大多并不真正‘无意义’,而LLMs在理解与补全异常语义方面已具备相当能力,提示需重新审视‘无意义’的定义与评估方式。 Abstract: Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.

[47] Thinking with Drafting: Optical Decompression via Logical Reconstruction

Jingxuan Wei,Honghao He,Caijun Jia,Siyuan Li,Zheng Sun,Yuhang Xu,Yuanyuan Lin,Linzhuang Sun,Yuchen Wu,Bihui Yu,Xiangxiang Zhang,Cheng Tan

Main category: cs.CL

TL;DR: 本文提出Thinking with Drafting (TwD)方法,通过将视觉推理重构为光学解压缩过程,并利用领域特定语言(DSL)作为中间表示,实现可验证的视觉逻辑推理。

Details Motivation: 现有多模态大模型在复杂推理任务中存在精度悖论:光学感知系统仅转录符号而未捕捉逻辑拓扑,像素级生成模型则产生缺乏数学精确性的视觉伪影。 Method: 提出光学解压缩视角下的视觉推理框架;引入Thinking with Drafting(TwD),使用极简领域特定语言(DSL)作为接地中间表示,强制模型将思维模型草拟为可执行代码以生成确定性视觉证明。 Result: 在自建视觉代数基准VisAlg上验证,TwD显著提升视觉推理性能,形成以视觉生成为逻辑验证工具的闭环系统。 Conclusion: TwD提供了一种通用、可验证的视觉推理新范式,将视觉生成从创造性输出转变为逻辑验证手段。 Abstract: Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.

[48] Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Futing Wang,Jianhao Yan,Yun Luo,Ganqu Cui,Zhi Wang,Xiaoye Qu,Yue Zhang,Yu Cheng,Tao Lin

Main category: cs.CL

TL;DR: 本文提出Length-Incentivized Exploration(LIE)方法,通过长度奖励与冗余惩罚协同提升模型在上下文中的多假设生成、验证与优化能力,缓解‘浅层探索陷阱’,显著提升域内和域外任务性能。

Details Motivation: 现有模型在测试时难以进行有效的上下文内探索(In-Context Exploration),因自回归生成中长推理轨迹采样概率呈指数衰减,形成‘浅层探索陷阱’,限制状态覆盖广度。 Method: 提出Length-Incentivized Exploration(LIE):在推理阶段引入基于生成长度的奖励函数,并叠加冗余惩罚项,以两阶段方式最大化状态覆盖。 Result: 在Qwen3、Llama等模型上验证,LIE平均提升域内任务4.4%,域外任务2.7%。 Conclusion: 显式长度激励可有效增强大模型的上下文内探索能力,是突破测试时缩放瓶颈的关键路径。 Abstract: Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.

[49] MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM Team,Wenhao An,Yingfa Chen,Yewei Fang,Jiayi Li,Xin Li,Yaohui Li,Yishan Li,Yuxuan Li,Biyuan Lin,Chuan Liu,Hezi Liu,Siyuan Liu,Hongya Lyu,Yinxu Pan,Shixin Ren,Xingyu Shen,Zhou Su,Haojun Sun,Yangang Sun,Zhen Leng Thai,Xin Tian,Rui Wang,Xiaorong Wang,Yudong Wang,Bo Wu,Xiaoyue Xu,Dong Xu,Shuaikang Xue,Jiawei Yang,Bowen Zhang,Jinqian Zhang,Letian Zhang,Shengnan Zhang,Xinyu Zhang,Xinyuan Zhang,Zhu Zhang,Hengyu Zhao,Jiacheng Zhao,Jie Zhou,Zihan Zhou,Shuo Wang,Chaojun Xiao,Xu Han,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出MiniCPM-SALA,一种9B参数的混合注意力架构,结合稀疏注意力与线性注意力优势,在保持性能的同时显著提升长上下文处理效率和推理速度,并通过低成本持续训练框架大幅降低训练开销。

Details Motivation: 大型语言模型向超长上下文应用演进时,Transformer架构面临高计算与内存开销问题;现有稀疏/线性注意力方法常在内存效率与性能间权衡,亟需兼顾二者的新方案。 Method: 提出MiniCPM-SALA混合架构:融合InfLLM-V2(稀疏注意力)与Lightning Attention(线性注意力),采用1:3层选择算法分配二者,并引入混合位置编码(HyPE);设计低成本持续训练框架,将预训练Transformer模型高效转为混合模型。 Result: 在单张NVIDIA A6000D GPU上,256K序列长度下推理速度达全注意力模型的3.5倍,支持最长1M token上下文(传统8B全注意力模型在此尺度下因内存不足失效);通用能力与全注意力模型相当,训练成本降低约75%。 Conclusion: MiniCPM-SALA成功在不牺牲模型能力的前提下,显著提升长上下文场景下的效率与可扩展性,为超长上下文LLM部署提供了实用、经济的混合注意力新范式。 Abstract: The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.

[50] A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

Anne-Marie Lutgen,Alistair Plum,Christoph Purschke

Main category: cs.CL

TL;DR: 本文提出了一种基于嵌入的变体检测方法,无需预归一化或预定义变体列表,通过子词嵌入和相似度聚类揭示卢森堡语中的词汇与正字法变异,并支持定量与定性分析。

Details Motivation: 解决在低资源或‘噪声’文本中难以系统识别语言变体的问题,尤其针对小语种或多语环境下的方言与社会语言学研究需求。 Method: 在原始文本上训练子词嵌入,结合余弦相似度与n-gram相似度对相关形式进行聚类,从而发现拼写与形态变异模式。 Result: 在卢森堡语用户评论大数据集上成功识别出大量符合方言学和社会语言学规律的词汇与正字法变异,生成可解释的变体族,揭示区域与风格分化规律。 Conclusion: 分布建模可在低资源条件下有效揭示有意义的语言变异模式,为多语种及小语种的语言变体研究提供可复现的方法论框架。 Abstract: This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ''noisy'' or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.

[51] DMAP: A Distribution Map for Text

Tom Kempton,Julia Rozanova,Parameswaran Kamalaruban,Maeve Madigan,Karolina Wresilo,Yoann L. Launay,David Sutton,Stuart Burrell

Main category: cs.CL

TL;DR: 本文提出DMAP方法,通过将文本映射为单位区间内的样本集,联合编码词序与概率信息,实现对大语言模型输出的上下文感知统计分析。

Details Motivation: 现有基于困惑度等指标的文本分析方法未能充分考虑条件概率分布的形状(即合理候选词数量)对概率解释的影响,缺乏上下文敏感性。 Method: 提出DMAP(Distribution-Mapped Analysis of Probabilities)方法:利用LLM对文本逐token生成的条件概率分布,依据其秩和概率值构造单位区间内的样本表示,该表示不依赖具体模型架构。 Result: 在三个案例中验证了DMAP有效性:(i) 生成参数校验以保障数据完整性;(ii) 揭示概率曲率在检测机器生成文本中的关键作用;(iii) 发现下游模型经合成数据微调后留下的可识别统计指纹。 Conclusion: DMAP提供了一种统一、轻量、模型无关的文本统计表征,可在消费级硬件上高效计算,适用于多种文本分析任务,并为LLM驱动的文本研究奠定新基础。 Abstract: Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.

[52] Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

Wanxing Wu,He Zhu,Yixia Li,Lei Yang,Jiehui Zhao,Hongru Wang,Jian Yang,Benyou Wang,Bingyi Jing,Guanhua Chen

Main category: cs.CL

TL;DR: 本文提出RouterXBench评估框架和ProbeDirichlet轻量级路由方法,利用模型内部隐藏状态建模不确定性,提升本地-云端LLM协同中的路由准确性与跨域鲁棒性。

Details Motivation: 现有LLM路由机制评估缺乏系统性,忽视场景适配性与分布外鲁棒性,且依赖输出概率或外部嵌入,无法有效刻画模型内在不确定性。 Method: 构建三维度评估框架RouterXBench(路由能力、场景对齐、跨域鲁棒性);提出ProbeDirichlet路由方法,通过可学习的狄利克雷分布聚合多层隐藏状态,并采用概率化训练。 Result: ProbeDirichlet在路由能力和高精度场景下分别比最优基线提升16.68%和18.86%,且在不同模型族、规模、异构任务及智能体工作流中表现稳定。 Conclusion: 基于内部隐藏状态的不确定性感知路由更可靠;RouterXBench为路由器设计与评估提供了系统化、可扩展的新范式。 Abstract: Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.

[53] LLM-based Triplet Extraction from Financial Reports

Dante Wesslund,Ville Stenström,Pontus Linde,Alexander Holmberg

Main category: cs.CL

TL;DR: 本文提出了一种面向企业财务报告的半自动化三元组抽取流水线,采用本体驱动的代理评估指标(本体一致性与忠实性)替代依赖标注真值的传统评估方法,并通过自动本体归纳和混合验证策略显著提升抽取质量。

Details Motivation: 企业财务报告是构建知识图谱的重要结构化知识源,但该领域缺乏标注真值,导致评估困难。 Method: 设计半自动化三元组抽取流水线,引入本体驱动的代理指标(Ontology Conformance 和 Faithfulness);对比静态手工本体与全自动文档特定本体归纳方法;采用正则匹配与LLM-as-a-judge相结合的混合验证策略;分析主宾语幻觉的系统性不对称现象。 Result: 自动归纳本体在所有配置下实现100%模式一致性,消除了手工本体的本体漂移;混合验证将主语幻觉率从65.2%降至1.6%;发现主语与宾语幻觉存在系统性不对称,归因于财务文本中的被动语态与施事省略。 Conclusion: 本体驱动的代理评估与自动本体归纳可有效缓解财务领域标注缺失下的评估难题,混合验证策略显著抑制幻觉,为专业领域知识抽取提供了可复现、可解释的新范式。 Abstract: Corporate financial reports are a valuable source of structured knowledge for Knowledge Graph construction, but the lack of annotated ground truth in this domain makes evaluation difficult. We present a semi-automated pipeline for Subject-Predicate-Object triplet extraction that uses ontology-driven proxy metrics, specifically Ontology Conformance and Faithfulness, instead of ground-truth-based evaluation. We compare a static, manually engineered ontology against a fully automated, document-specific ontology induction approach across different LLMs and two corporate annual reports. The automatically induced ontology achieves 100% schema conformance in all configurations, eliminating the ontology drift observed with the manual approach. We also propose a hybrid verification strategy that combines regex matching with an LLM-as-a-judge check, reducing apparent subject hallucination rates from 65.2% to 1.6% by filtering false positives caused by coreference resolution. Finally, we identify a systematic asymmetry between subject and object hallucinations, which we attribute to passive constructions and omitted agents in financial prose.

[54] Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences

Eddie Yang,Dashun Wang

Main category: cs.CL

TL;DR: 本文揭示了大型语言模型(LLMs)在基准测试中准确率趋同的表象下,存在显著的认知分歧(epistemic divergence),即不同模型对同一问题常给出不同答案;这种隐藏分歧在科学数据标注与推断中会严重扭曲研究结果,威胁科研可重复性。

Details Motivation: 现有LLM评估过度依赖整体准确率,忽视模型间具体预测的一致性,可能导致对模型能力与可靠性的误判,尤其在科学应用中带来严重后果。 Method: 基于MMLU-Pro和GPQA两大推理基准,量化分析多个LLM(尤其是前沿模型)在题目层面的预测一致性;进一步在教育学与政治学已发表研究的再分析中,替换标注所用LLM,评估其对因果效应估计的影响。 Result: 即使准确率相近,LLM间题目级不一致率达16–66%,前沿模型间达16–38%;在实证再分析中,更换标注模型可使处理效应估计变化超80%,甚至符号反转。 Conclusion: 基准准确率趋同可能构成‘基准幻觉’,掩盖模型本质差异;模型选择应被视为影响科学结论可靠性的隐性关键变量,需在评估与应用中显式考量一致性与分歧。 Abstract: Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.

[55] AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

Pretam Ray,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: 本文提出AdaptEvolve方法,通过利用生成置信度动态选择适合当前推理步骤的LLM,在多LLM进化式精炼框架中实现计算效率与推理能力的更好平衡。

Details Motivation: 进化式智能体系统在推理中反复调用大语言模型(LLMs),加剧了计算效率与推理能力之间的权衡;现有模型级联路由策略依赖静态启发式或外部控制器,未显式建模模型不确定性。 Method: 提出AdaptEvolve:一种基于内在生成置信度实时估计问题可解性、自适应选择LLM的方法,嵌入于进化式顺序精炼框架中。 Result: 在多个基准上平均降低37.9%总推理成本,同时保持静态大模型基线97.5%的上限准确率,形成更优的Pareto前沿。 Conclusion: 基于置信度的动态LLM选择能有效缓解效率-能力权衡,为进化式智能体提供轻量、鲁棒且高性价比的推理策略。 Abstract: Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade-off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi-LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real-time solvability. Empirical results show that confidence-driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model baselines. Our code is available at https://github.com/raypretam/adaptive_llm_selection.

[56] Cross-Modal Robustness Transfer (CMRT): Training Robust Speech Translation Models Using Adversarial Text

Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis

Main category: cs.CL

TL;DR: 本文提出跨模态鲁棒性迁移(CMRT)框架,将文本模态的对抗鲁棒性迁移到语音模态,无需生成对抗语音数据,显著提升端到端语音翻译模型对形态变化攻击的鲁棒性。

Details Motivation: 现有端到端语音翻译模型在干净数据上表现良好,但在真实场景中面对非母语或方言语音的词形变化时鲁棒性不足,且生成高质量对抗语音数据成本高、难度大。 Method: 将面向文本的词形对抗攻击适配到语音领域,并提出CMRT框架,通过跨模态知识迁移,将文本模态中训练得到的对抗鲁棒性迁移到语音模态,避免使用对抗语音数据进行训练。 Result: 在四个语言对上的实验表明,CMRT平均提升对抗鲁棒性超过3 BLEU分,显著优于基线,且无需生成对抗语音数据。 Conclusion: CMRT为构建鲁棒的端到端语音翻译系统提供了高效可行的新范式,确立了无需对抗语音数据的鲁棒性新基准。 Abstract: End-to-End Speech Translation (E2E-ST) has seen significant advancements, yet current models are primarily benchmarked on curated, "clean" datasets. This overlooks critical real-world challenges, such as morphological robustness to inflectional variations common in non-native or dialectal speech. In this work, we adapt a text-based adversarial attack targeting inflectional morphology to the speech domain and demonstrate that state-of-the-art E2E-ST models are highly vulnerable it. While adversarial training effectively mitigates such risks in text-based tasks, generating high-quality adversarial speech data remains computationally expensive and technically challenging. To address this, we propose Cross-Modal Robustness Transfer (CMRT), a framework that transfers adversarial robustness from the text modality to the speech modality. Our method eliminates the requirement for adversarial speech data during training. Extensive experiments across four language pairs demonstrate that CMRT improves adversarial robustness by an average of more than 3 BLEU points, establishing a new baseline for robust E2E-ST without the overhead of generating adversarial speech.

[57] Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

Yunchong Huang,Gianni Barlacchi,Sandro Pezzelle

Main category: cs.CL

TL;DR: 本文探讨了大语言模型在问答任务中表现不佳的原因,指出问题表述不明确(underspecified questions)是重要原因之一,并通过构建分类器识别不明确问题、重写实验验证其影响。

Details Motivation: 大型语言模型在标准问答基准测试中表现不佳,作者认为这可能部分源于问题表述不明确,即缺乏足够上下文导致无法唯一确定问题含义。 Method: 提出一种基于大语言模型的分类器来识别不明确问题,并在多个主流问答数据集上进行检测;进一步开展受控重写实验,将不明确问题改写为明确版本,同时保持标准答案不变。 Result: 发现16%至50%以上的基准问题属于不明确问题,且大语言模型在这些问题上的表现显著更差;重写后问答性能持续提升,表明许多失败源于问题本身而非模型能力。 Conclusion: 问题不明确是问答评估中的重要混杂因素,应重视基准设计中问题表述的清晰性。 Abstract: Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.

[58] Do Large Language Models Adapt to Language Variation across Socioeconomic Status?

Elisa Bassignana,Mike Zhang,Dirk Hovy,Amanda Cercas Curry

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLM)在不同社会经济地位(SES)社区语境下的语言风格适应能力,发现其仅能微弱调整风格,且偏向模仿高SES语言,可能加剧语言不平等并影响社会科学研究的有效性。

Details Motivation: 随着LLM越来越多地介入人际交流,若其无法适配不同社会群体的语言风格,可能强化刻板印象、边缘化语言规范与模型偏差不一致的社群,并加剧社会分层。 Method: 构建基于Reddit和YouTube的按SES分层的新数据集,用其中不完整文本提示4个LLM,将生成补全与原始文本在94个社会语言学指标(句法、修辞、词汇等)上进行对比分析。 Result: LLM对SES相关语言风格的调节程度极小,常表现为粗略近似或刻板化模仿,且更擅长模仿高SES语言风格。 Conclusion: LLM存在放大语言等级结构的风险,其语言风格生成能力不足以支撑以语言风格为社会信号的代理型社会模拟、调查实验等社会科学应用。 Abstract: Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.

[59] Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

Yuzhe Shang,Pengzhi Gao,Wei Liu,Jian Luan,Jinsong Su

Main category: cs.CL

TL;DR: 本文研究了开源大语言模型(LLMs)在多语言机器翻译(MT)中的应用,基于Gemma3模型家族开发了支持46种语言的MiLMMT-46模型,在多项评测中超越多个SOTA开源模型,并媲美Google Translate和Gemini 3 Pro等商用系统。

Details Motivation: 开源LLM多语言能力持续提升,但其在多语言机器翻译任务上的系统性评估与适配方法(如模型与数据扩展策略)仍缺乏深入研究。 Method: 通过持续预训练(continual pretraining)和指令微调(instruction finetuning),在Gemma3模型家族基础上构建多语言机器翻译模型MiLMMT-46,并系统分析模型规模与数据规模对性能的影响。 Result: MiLMMT-46在46种语言的多语言MT任务上达到领先水平,持续优于Seed-X、HY-MT-1.5和TranslateGemma等最新开源SOTA模型,并与Google Translate、Gemini 3 Pro等强闭源系统性能相当。 Conclusion: 模型与数据协同缩放可显著提升开源LLM在多语言MT中的表现;MiLMMT-46验证了开源模型在高质量多语言翻译中具备替代闭源系统的潜力。 Abstract: Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.

[60] DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling

Mariia Fedorova,Andrey Kutuzov,Khonzoda Umarova

Main category: cs.CL

TL;DR: 本文介绍了DHPLT,一个包含41种语言的历时语料库开放集合,基于HPLT网络爬取数据,按三个时间段(2011–2015、2020–2021、2024至今)组织,每语言每时段100万文档,并提供预计算的词向量与词汇替换信息,旨在填补多语言历时语义变化建模中低资源语言语料的空白。

Details Motivation: 当前缺乏面向语义变化建模的多语言历时语料库(尤其低资源语言),限制了该领域的实验广度与语言覆盖度。 Method: 基于HPLT网络爬取语料,利用网页爬取时间戳近似文档生成时间,划分三个时间段;为每种语言每时段采集100万文档;预计算词型/词例嵌入及目标词的词汇替换;开放目标词选择接口供后续研究者自定义。 Result: 构建了覆盖41种语言、三个时间段的DHPLT历时语料集合,提供标准化格式的语料、嵌入与替换数据,并全部开源发布。 Conclusion: DHPLT有效弥补了多语言历时语义变化研究中语料资源的缺口,支持更广泛的语言和新实验设计,推动该领域向低资源语言拓展。 Abstract: In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.

[61] Automatic Simplification of Common Vulnerabilities and Exposures Descriptions

Varpu Vehomäki,Kimmo K. Kaski

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型(LLMs)在简化通用漏洞与暴露(CVE)描述中的应用,构建了网络安全领域自动文本简化(ATS)的基线和含40个CVE描述的测试数据集,并通过两轮专家调查评估发现:现成LLM虽能提升表层可读性,但难以保持语义准确性。

Details Motivation: 网络安全信息对非专业人士而言理解困难,而现有自动文本简化研究尚未覆盖快速演变且高度复杂的网络安全领域,亟需针对性方法。 Method: 构建网络安全ATS基线与含40个CVE描述的测试数据集,采用两轮由网络安全专家参与的调查进行人工评估,并测试现成大语言模型在简化任务中的表现。 Result: 现成大语言模型能提升文本表层简洁性,但在关键语义保留方面表现不佳;研究提供了首个面向CVE描述的ATS基准与公开数据集。 Conclusion: 直接应用现有LLM进行CVE文本简化存在语义失真风险,未来工作需结合领域知识增强模型的准确性与可靠性。 Abstract: Understanding cyber security is increasingly important for individuals and organizations. However, a lot of information related to cyber security can be difficult to understand to those not familiar with the topic. In this study, we focus on investigating how large language models (LLMs) could be utilized in automatic text simplification (ATS) of Common Vulnerability and Exposure (CVE) descriptions. Automatic text simplification has been studied in several contexts, such as medical, scientific, and news texts, but it has not yet been studied to simplify texts in the rapidly changing and complex domain of cyber security. We created a baseline for cyber security ATS and a test dataset of 40 CVE descriptions, evaluated by two groups of cyber security experts in two survey rounds. We have found that while out-of-the box LLMs can make the text appear simpler, they struggle with meaning preservation. Code and data are available at https://version.aalto.fi/gitlab/vehomav1/simplification\_nmi.

[62] LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

Szilvia Ujváry,Louis Béthune,Pierre Ablin,João Monteiro,Marco Cuturi,Michael Kirchhof

Main category: cs.CL

TL;DR: 本文提出LaCy方法,通过结合语法解析器增强损失信号,指导小型语言模型(SLMs)在预训练阶段自主决定哪些词元应自行预测、哪些应通过标记调用外部资源,从而在保持低成本的同时提升事实准确性。

Details Motivation: 小型语言模型(SLMs)参数量有限,难以充分压缩世界知识,易产生事实性错误;虽可通过外部查询缓解,但需解决预训练中‘哪些token该学、哪些该委托’的根本问题。 Method: 提出LaCy预训练方法:利用spaCy语法解析器增强传统损失信号,区分‘可接受的高损替代续写’与‘应触发的真实错误’,据此动态决定token的学习或委托策略。 Result: LaCy模型能有效学习预测与委托的边界,在与大模型级联生成时显著提升FactScore,优于Rho和LLM-judge训练的SLMs,且更简单、成本更低。 Conclusion: token级的语义-语法联合判据比单纯损失更适合作为SLMs预训练中的学习/委托决策依据;LaCy验证了该思路的有效性与实用性。 Abstract: Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emph{acceptable} in that they are truthful alternative continuations of a pretraining document, and should not trigger a \texttt{} even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.

[63] Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

Angelo Ziletti,Leonardo D'Ambrosi

Main category: cs.CL

TL;DR: 本文提出CLUES框架,用于临床Text-to-SQL任务中区分输入歧义与模型不稳定性两类不确定性,并分别量化为歧义分和不稳定性分,以支持精准干预与错误溯源。

Details Motivation: 在临床Text-to-SQL部署中,需区分由输入歧义(应触发用户澄清)和模型不稳定性(应触发人工审核)导致的输出多样性,现有单一不确定性度量无法支持差异化干预。 Method: 将Text-to-SQL建模为两阶段过程(解释→答案),构建双部语义图,利用其矩阵的Schur补计算不稳定性得分;同时分解语义不确定性为歧义得分与不稳定性得分。 Result: 在AmbigQA/SituatedQA和临床Text-to-SQL基准上,CLUES在故障预测上优于当前最优的Kernel Language Entropy;在部署场景中保持竞争力,并提供可解释的不确定性分解;高歧义+高不稳定性子集覆盖25%查询但包含51%错误,显著提升错误排查效率。 Conclusion: CLUES实现了不确定性来源的可解释解耦,支撑面向歧义的查询优化与面向不稳定的模型改进,为临床LLM安全部署提供了实用诊断框架。 Abstract: Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.

[64] Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Xin Xu,Clive Bai,Kai Yang,Tianhao Chen,Yangkun Chen,Weijie Liu,Hao Chen,Yang Wang,Saiyong Yang,Can Yang

Main category: cs.CL

TL;DR: 本文提出Composition-RL方法,通过自动组合多个问题生成新的可验证提示,以更有效地利用高通过率(pass-rate=1)的提示数据,提升大模型在强化学习中的推理能力。

Details Motivation: 大规模可验证提示虽支撑了RLVR的成功,但存在大量无信息量样本、扩展成本高;训练中易提示(pass rate=1)增多导致有效数据减少,需更好利用这类数据。 Method: 提出Composition-RL:自动将多个问题组合成新可验证问题,并用于RL训练;进一步设计课程学习变体,逐步增加组合深度;支持跨领域组合提示。 Result: 在4B至30B模型规模上实验表明,Composition-RL持续优于基线RL;课程变体进一步提升性能;跨领域组合亦提升泛化能力。 Conclusion: Composition-RL是一种简单而有效的方法,显著提升了对高通过率可验证提示的利用效率,增强了模型推理能力与跨域适应性。 Abstract: Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at https://github.com/XinXU-USTC/Composition-RL.

[65] DeepSight: An All-in-One LM Safety Toolkit

Bo Zhang,Jiaxuan Guo,Lijun Li,Dongrui Liu,Sujin Chen,Guanxu Chen,Zhijie Zheng,Qihao Lin,Lewen Yan,Chen Qian,Yijin Zhou,Yuyao Wu,Shaoxiong Guo,Tianyi Du,Jingyi Yang,Xuhao Hu,Ziqi Miao,Xiaoya Lu,Jing Shao,Xia Hu

Main category: cs.CL

TL;DR: 本文提出DeepSight开源项目,整合安全评估与诊断,实现从黑盒到白盒的大模型安全分析。

Details Motivation: 当前大模型安全工作流中,评估、诊断与对齐由不同工具分别处理,导致评估无法定位内部根源、诊断脱离具体风险场景、对齐缺乏机制解释,可能损害通用能力。 Method: 提出DeepSight开源项目,包含评估工具DeepSafe和诊断工具DeepScan,通过统一任务与数据协议打通评估与诊断阶段,实现白盒化安全分析。 Result: DeepSight是首个支持前沿AI风险评估及联合安全评估与诊断的开源工具包,具备低成本、可复现、高效和高可扩展性。 Conclusion: DeepSight推动了大模型安全从孤立黑盒评估向系统化、可解释、机制级白盒分析的范式转变。 Abstract: As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.

[66] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Pinyi Zhang,Ting-En Lin,Yuchuan Wu,Jingyang Chen,Zongqi Wang,Hua Yang,Ze Xu,Fei Huang,Kai Zhang,Yongbin Li

Main category: cs.CL

TL;DR: 本文提出P-GenRM,首个支持测试时用户自适应缩放的个性化生成式奖励模型,通过构建评估链、用户原型聚类与双粒度缩放机制,显著提升个性化对齐效果与跨用户泛化能力。

Details Motivation: 现有个性化奖励模型难以准确建模多样化的场景特异性偏好,且在新用户(反馈稀疏)上泛化能力差。 Method: 提出P-GenRM:将偏好信号转化为结构化评估链,生成自适应角色与评分标准;聚类用户形成User Prototypes;设计个体级与原型级双粒度缩放机制,实现测试时用户自适应调整。 Result: 在主流个性化奖励模型基准上达到SOTA,平均提升2.31%;在OOD数据集上泛化性强;测试时用户缩放额外带来3%性能提升。 Conclusion: P-GenRM通过生成式建模与双粒度缩放,有效缓解偏好噪声并增强新用户泛化,为个性化对齐提供了可扩展、高精度的新范式。 Abstract: Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.

[67] A Rule-based Computational Model for Gaidhlig Morphology

Peter J Barclay

Main category: cs.CL

TL;DR: 本文提出了一种基于规则的苏格兰盖尔语(Gaidhlig)形态学建模方法,利用Wiktionary数据和SQL查询构建可解释、低资源友好的系统,支持语言教学与高阶NLP工具开发。

Details Motivation: 主流神经语言模型依赖大量训练数据,难以适用于低资源语言;而规则系统能有效利用有限样本、增强可解释性,并辅助教学材料设计。 Method: 从Wiktionary提取盖尔语形态数据,使用SQL查询词汇模式,构建声明式规则库,并通过Python工具实现词形变化推导。 Result: 实现了可推导盖尔语屈折形式的规则系统,可用于教育工具(如语言模式讲解)及高阶工具(如基于规则的依存句法分析器)。 Conclusion: 该规则方法有效挖掘并拓展了Wiktionary现有数据的价值,为低资源语言处理提供了可解释、实用且可复用的技术路径。 Abstract: Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.

[68] WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

Yangzhuo Li,Shengpeng Ji,Yifu Chen,Tianle Liang,Haorong Ying,Yule Wang,Junbo Li,Jun Fang,Zhou Zhao

Main category: cs.CL

TL;DR: 本文提出WavBench,一个面向真实语音对话能力评估的新型基准,涵盖推理能力(Pro子集)、口语化表达(Basic子集)和副语言理解与生成(Acoustic子集)三方面,弥补现有文本导向评测对语音特性的忽视。

Details Motivation: 当前语音对话模型评估多沿用文本生成标准,忽视语音特有的副语言特征(如语调、停顿)和口语化表达,且难以衡量现代智能体所需的认知深度与真实交互能力。 Method: 构建WavBench三元评估框架:Pro子集(高难度推理任务)、Basic子集(以‘可听性’为核心的口语化标准)、Acoustic子集(显式/隐式副语言理解与生成),并在5个SOTA模型上开展系统评测。 Result: WavBench揭示了当前模型在复杂推理、自然口语表达及副语言建模三方面的显著短板,提供了可复现的量化评估结果与开源工具链。 Conclusion: WavBench为语音对话系统建立了更贴近真实场景的综合评估范式,推动模型从‘能说’向‘会听、懂话、善交流’演进。 Abstract: With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto-2024.github.io/wavbench.github.io/.

Ricardo Campos,Ana Filipa Pacheco,Ana Luísa Fernandes,Inês Cantante,Rute Rebouças,Luís Filipe Cunha,José Miguel Isidro,José Pedro Evans,Miguel Marques,Rodrigo Batista,Evelin Amorim,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,António Leal,Purificação Silvano

Main category: cs.CL

TL;DR: 本文介绍了CitiLink-Minutes数据集,一个包含120份欧洲葡萄牙语市政会议纪要的多层标注数据集,旨在填补市政记录在信息检索与自然语言处理领域研究不足的空白。

Details Motivation: 市政会议纪要是地方治理的重要官方记录,但在IR和NLP领域因缺乏标注数据集而长期被忽视,限制了相关计算模型的发展。 Method: 构建了CitiLink-Minutes数据集,包含来自六个葡萄牙市镇的120份会议纪要,涵盖超百万词符,经两名训练有素的标注员和一名语言学家在元数据、讨论主题和投票结果三个维度进行人工多层标注(共38,000+标注),所有个人标识均已脱敏,并依FAIR原则发布。 Result: 发布了首个面向市政会议纪要的多层结构化标注数据集,配套基线实验结果(元数据抽取、主题分类、投票标签识别),验证其在NLP与IR下游任务中的可用性。 Conclusion: CitiLink-Minutes填补了市政文本NLP研究的数据空白,支持更透明的地方决策分析,并为类似低资源政务文本研究提供了可复用范式。 Abstract: City councils play a crucial role in local governance, directly influencing citizens' daily lives through decisions made during municipal meetings. These deliberations are formally documented in meeting minutes, which serve as official records of discussions, decisions, and voting outcomes. Despite their importance, municipal meeting records have received little attention in Information Retrieval (IR) and Natural Language Processing (NLP), largely due to the lack of annotated datasets, which ultimately limit the development of computational models. To address this gap, we introduce CitiLink-Minutes, a multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities. Unlike prior annotated datasets of parliamentary or video records, CitiLink-Minutes provides multilayer annotations and structured linkage of official written minutes. The dataset contains over one million tokens, with all personal identifiers de-identified. Each minute was manually annotated by two trained annotators and curated by an experienced linguist across three complementary dimensions: (1) metadata, (2) subjects of discussion, and (3) voting outcomes, totaling over 38,000 individual annotations. Released under FAIR principles and accompanied by baseline results on metadata extraction, topic classification, and vote labeling, CitiLink-Minutes demonstrates its potential for downstream NLP and IR tasks, while promoting transparent access to municipal decisions.

[70] dVoting: Fast Voting for dLLMs

Sicheng Feng,Zigeng Chen,Xinyin Ma,Gongfan Fang,Xinchao Wang

Main category: cs.CL

TL;DR: 本文提出dVoting,一种无需训练、基于扩散大语言模型(dLLMs)的快速投票技术,通过迭代采样、一致性分析与不确定性token重生成,显著提升推理能力。

Details Motivation: 观察到dLLMs在多样本生成中多数token预测一致,而性能瓶颈在于少数跨样本不一致的token;同时利用dLLMs可任意位置并行生成的特性。 Method: dVoting通过多次采样同一提示,分析token跨样本一致性以识别不确定性token,再对其投票重生成,并迭代直至收敛。 Result: 在GSM8K、MATH500、ARC-C和MMLU等基准上分别提升6.22%–7.66%、4.40%–7.20%、3.16%–14.84%和4.83%–5.74%。 Conclusion: dVoting是一种高效、免训练的推理增强方法,充分释放dLLMs在并行解码与测试时缩放方面的潜力。 Abstract: Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test-time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross-sample variability. Leveraging the arbitrary-position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%-7.66% on GSM8K, 4.40%-7.20% on MATH500, 3.16%-14.84% on ARC-C, and 4.83%-5.74% on MMLU. Our code is available at https://github.com/fscdc/dVoting

[71] Query-focused and Memory-aware Reranker for Long Context Processing

Yuqing Li,Jiangnan Li,Mo Yu,Guoxuan Ding,Zheng Lin,Weiping Wang,Jie Zhou

Main category: cs.CL

TL;DR: 本文提出了一种基于大模型检索头注意力分数的新重排序框架,利用选定注意力头的分数估计段落-查询相关性,实现轻量、高效且无需显式标注的列表级重排序。

Details Motivation: 现有重排序方法多依赖点式或需人工标注的列表式训练,且难以充分利用候选列表的整体信息;同时,对对话理解与记忆能力等复杂检索任务支持不足。 Method: 提出一种基于大语言模型中特定检索头注意力分数的重排序框架,将注意力分数直接建模为连续相关性得分,实现列表级排序;使用小规模模型(如4B参数)进行训练,并支持上下文增强和中间层注意力头训练等扩展。 Result: 在Wikipedia、长叙事数据集及LoCoMo对话理解基准上均超越现有SOTA点式和列表式重排序器,尤其在LoCoMo上达到新SOTA;验证了上下文增强和中间层头训练可进一步提升准确率或效率。 Conclusion: 该框架以轻量、无监督、列表感知的方式实现了高性能重排序,为检索增强生成和复杂对话检索提供了新范式。 Abstract: Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.

[72] Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

Mohamed Huti,Alasdair Mackintosh,Amy Waldock,Dominic Andrews,Maxime Lelièvre,Moritz Boos,Tobias Murray,Paul Atherton,Robin A. A. Ince,Oliver G. B. Garrod

Main category: cs.CL

TL;DR: 本文提出了视觉推理基准(VRB),用于评估多模态大语言模型(MLLMs)在真实课堂视觉问题上的求解能力,发现模型在静态任务(如计数)上表现较好,但在动态空间操作(如折叠、反射、旋转)上存在明显瓶颈,凸显教育场景中专用评测基准的必要性。

Details Motivation: AI模型在文本推理上表现出色,但在空间与关系结构推理方面仍存在瓶颈,尤其在依赖视觉的初等数学教育中;亟需面向真实课堂需求的评测基准来检验模型的实际教育适用性。 Method: 构建了包含701道来自赞比亚和印度小学考试题的视觉推理基准(VRB),题目涵盖类比推理、模式补全、空间匹配等任务;采用未经编辑、极简文字的真实图像,以贴近实际教学场景并评估模型的空间理解能力。 Result: 实验揭示了模型能力的“锯齿前沿”:在计数、缩放等静态技能上表现较好,但在折叠、反射、旋转等动态空间操作上存在显著性能瓶颈(即“空间天花板”),易导致错误批改、误导性引导及强化学生错误概念。 Conclusion: VRB等面向教育的专用基准对界定多模态教育工具的功能边界至关重要,当前MLLMs尚不满足真实课堂视觉推理需求,需针对性提升空间与动态关系建模能力。 Abstract: AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.

[73] ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Mathieu Sibue,Andres Muñoz Garza,Samuel Mensah,Pranav Shetty,Zhiqiang Ma,Xiaomo Liu,Manuela Veloso

Main category: cs.CL

TL;DR: 本文提出ExStrucTiny新基准数据集,用于评估通用视觉语言模型在多样化文档图像上进行灵活、结构化信息抽取的能力,并分析现有模型在模式适配、查询不明确和答案定位等方面的挑战。

Details Motivation: 现有文档理解基准(如KEE、RE、VQA)受限于狭窄的实体本体、简单查询或单一文档类型,难以评估模型在多样化文档和灵活schema下的整体细粒度结构化抽取能力。 Method: 构建了名为ExStrucTiny的新基准数据集,融合KEE、RE与VQA任务,采用人工标注与合成样本结合并经人工验证的流水线;并在该基准上系统评测开源与闭源VLMs。 Result: 揭示了当前VLMs在schema适应性、查询欠定义(query under-specification)及答案空间定位等关键挑战上的不足。 Conclusion: ExStrucTiny为推动通用视觉语言模型在企业文档结构化信息抽取任务中的泛化性与实用性提供了坚实基础和评估标准。 Abstract: Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.

[74] Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

Julia Belikova,Danila Rozhevskii,Dennis Svirin,Konstantin Polev,Alexander Panchenko

Main category: cs.CL

TL;DR: 本文提出了一种检测软压缩架构中'token overflow'(令牌溢出)现象的方法,即当压缩表示不足以回答查询时的状态;通过引入轻量级探针分类器,在多个问答数据集上实现了0.72的平均AUC-ROC,提升了对压缩导致错误的早期识别能力。

Details Motivation: 软压缩架构虽能扩展LLM有效上下文长度,但其可压缩性极限及何时开始擦除任务相关信息尚不明确,亟需界定并检测'令牌溢出'现象。 Method: 定义了'token overflow'概念,并在xRAG软压缩框架下,分别采用查询无关的饱和统计量和结合查询与上下文表示的轻量级探针分类器进行检测分析。 Result: 查询无关的饱和统计量可有效区分压缩/未压缩token,但溢出检测能力有限;而查询感知的探针分类器在HotpotQA、SQuADv2和TriviaQA上平均达到0.72 AUC-ROC。 Conclusion: 从查询无关诊断迈向查询感知检测,为LLM前的低成本门控机制提供了可行路径,有助于缓解压缩引发的错误。 Abstract: Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.

[75] Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications

Manjunath Kudlur,Evan King,James Wang,Pete Warden

Main category: cs.CL

TL;DR: 本文提出Moonshine v2,一种采用滑动窗口自注意力机制的流式ASR编码器模型,在保持高精度的同时显著降低端侧实时语音识别的首字延迟(TTFT)和计算开销。

Details Motivation: 现有基于全注意力Transformer的ASR模型虽精度高,但因全局依赖导致二次计算复杂度和线性增长的首字延迟,难以满足边缘设备上低延迟、高精度的流式语音识别需求。 Method: 提出Moonshine v2模型,采用滑动窗口自注意力替代全注意力,实现有界低延迟推理,同时保留强局部上下文建模能力。 Result: 在标准基准上达到SOTA词错误率(WER),精度媲美大6倍的模型,且推理速度显著更快。 Conclusion: 精心设计的局部注意力机制可在大幅减小模型尺寸与延迟代价的前提下,达到与全注意力相当的识别精度,为边缘设备上的交互式语音接口开辟新路径。 Abstract: Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent "encode-the-whole-utterance" latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.

[76] A technical curriculum on language-oriented artificial intelligence in translation and specialised communication

Ralph Krüger

Main category: cs.CL

TL;DR: 本文提出了一项面向语言与翻译(L&T)行业的语言导向人工智能(AI)技术课程,旨在提升该领域从业者的领域专属AI素养,涵盖向量嵌入、神经网络基础、分词和Transformer模型四大核心内容,并通过实践验证其教学有效性,同时指出需辅以教师支持等高阶教学支架以优化学习效果。

Details Motivation: 提升语言与翻译行业从业者在AI时代所需的领域专属技术AI素养,培养其计算思维、算法意识与算法能动性,增强其在AI驱动工作环境中的数字韧性。 Method: 设计并实施一门包含向量嵌入、神经网络基础、分词和Transformer神经网络四大模块的技术课程,并在科隆应用技术大学翻译与多语传播研究所的AI方向硕士课程中开展教学实践与效果评估。 Result: 课程具有良好的教学有效性,但参与者反馈表明,需嵌入更高层次的教学支架(如教师指导)才能实现最优学习效果。 Conclusion: 该技术课程为L&T领域提供了可行的AI素养培养路径,但成功实施依赖于适当的教学支持机制。 Abstract: This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (L&T) industry. The curriculum aims to foster domain-specific technical AI literacy among stakeholders in the fields of translation and specialised communication by exposing them to the conceptual and technical/algorithmic foundations of modern language-oriented AI in an accessible way. The core curriculum focuses on 1) vector embeddings, 2) the technical foundations of neural networks, 3) tokenization and 4) transformer neural networks. It is intended to help users develop computational thinking as well as algorithmic awareness and algorithmic agency, ultimately contributing to their digital resilience in AI-driven work environments. The didactic suitability of the curriculum was tested in an AI-focused MA course at the Institute of Translation and Multilingual Communication at TH Koeln. Results suggest the didactic effectiveness of the curriculum, but participant feedback indicates that it should be embedded into higher-level didactic scaffolding - e.g., in the form of lecturer support - in order to enable optimal learning conditions.

[77] T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization

Tunyu Zhang,Xinxi Zhang,Ligong Han,Haizhou Shi,Xiaoxiao He,Zhuowei Li,Hao Wang,Kai Xu,Akash Srivastava,Hao Wang,Vladimir Pavlovic,Dimitris N. Metaxas

Main category: cs.CL

TL;DR: 本文提出了一种轨迹自蒸馏框架(T3D),结合直接判别优化(DDO)目标,提升扩散大语言模型(DLLMs)在少量采样步数下的生成质量,显著缩小了与全步长解码的性能差距。

Details Motivation: 扩散大语言模型(DLLMs)虽能并行解码多词,但实际推理效率受限于大量精炼步数;减少步数又严重损害生成质量。 Method: 提出轨迹自蒸馏框架(T3D),利用模型自身生成轨迹进行蒸馏,并引入基于逆KL的直接判别优化(DDO)目标,实现模式聚焦式蒸馏。 Result: 在多个基准上持续优于强few-step基线和标准训练方法,在严格步数限制下显著提升生成质量,大幅缩小与全步长解码的性能差距。 Conclusion: T3D为构建实用化的少步长DLLMs提供了坚实基础。 Abstract: Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model's own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student to concentrate on high-probability teacher modes. Across benchmarks, our approach consistently outperforms strong few-step baselines and standard training under tight step budgets. Although full-step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few-step DLLMs. The source code is available at https://github.com/Tyrion58/T3D.

[78] On-Policy Context Distillation for Language Models

Tianzhu Ye,Li Dong,Xun Wu,Shaohan Huang,Furu Wei

Main category: cs.CL

TL;DR: 本文提出了一种名为On-Policy Context Distillation(OPCD)的新框架,通过在学生模型自身生成的轨迹上进行策略内蒸馏,并最小化其与上下文条件教师模型之间的反向KL散度,实现上下文知识向参数的内化。该方法在经验知识蒸馏和系统提示蒸馏两个任务中均展现出优越性能,并支持跨尺寸模型的知识迁移。

Details Motivation: 现有上下文蒸馏方法难以有效将模型在推理过程中积累的经验或优化提示中的行为内化为参数知识;需一种能结合策略内训练与上下文建模优势的新范式。 Method: 提出On-Policy Context Distillation(OPCD):学生模型基于自身生成的轨迹进行训练,以反向KL散度对齐上下文条件下的教师输出分布;应用于经验知识蒸馏(从历史解题轨迹中提取可迁移知识)和系统提示蒸馏(内化优化提示所编码的行为)。 Result: OPCD在数学推理、文本游戏和领域特定任务中均超越基线方法,提升任务准确率并更好保持分布外泛化能力;同时支持跨尺寸蒸馏,小模型可有效吸收大模型的经验知识。 Conclusion: OPCD成功融合了策略内学习与上下文蒸馏,为语言模型将运行时经验与提示工程成果持久化到参数中提供了通用、高效且可扩展的路径。 Abstract: Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.

cs.CV [Back]

[79] DD-MDN: Human Trajectory Forecasting with Diffusion-Based Dual Mixture Density Networks and Uncertainty Self-Calibration

Manuel Hetzel,Kerim Turacan,Hannes Reichert,Konrad Doll,Bernhard Sick

Main category: cs.CV

TL;DR: 本文提出DD-MDN模型,一种端到端的、基于去噪扩散与双混合密度网络的概率化人类轨迹预测方法,兼顾高定位精度、校准良好的不确定性建模及对短观测时长的鲁棒性。

Details Motivation: 现有工作多关注预测精度、社交交互建模和多样性,而忽视了不确定性建模、校准性以及短观测下的预测能力,而这三者对路径规划和避障等下游任务至关重要。 Method: 提出DD-MDN:采用少样本去噪扩散主干网络与双混合密度网络(dual MDN),自动学习自校准的驻留区域和概率排序的锚点路径,无需预设锚点或终点,从而生成多样化的轨迹假设。 Result: 在ETH/UCY、SDD、inD和IMPTC数据集上达到SOTA精度;在短观测区间下表现鲁棒;不确定性建模更可靠、可校准。 Conclusion: DD-MDN为人类轨迹预测提供了兼具准确性、不确定性校准性与短时鲁棒性的统一概率框架,提升了实际部署中的安全性与可信度。 Abstract: Human Trajectory Forecasting (HTF) predicts future human movements from past trajectories and environmental context, with applications in Autonomous Driving, Smart Surveillance, and Human-Robot Interaction. While prior work has focused on accuracy, social interaction modeling, and diversity, little attention has been paid to uncertainty modeling, calibration, and forecasts from short observation periods, which are crucial for downstream tasks such as path planning and collision avoidance. We propose DD-MDN, an end-to-end probabilistic HTF model that combines high positional accuracy, calibrated uncertainty, and robustness to short observations. Using a few-shot denoising diffusion backbone and a dual mixture density network, our method learns self-calibrated residence areas and probability-ranked anchor paths, from which diverse trajectory hypotheses are derived, without predefined anchors or endpoints. Experiments on the ETH/UCY, SDD, inD, and IMPTC datasets demonstrate state-of-the-art accuracy, robustness at short observation intervals, and reliable uncertainty modeling. The code is available at: https://github.com/kav-institute/ddmdn.

[80] ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Yandan Yang,Shuang Zeng,Tong Lin,Xinyuan Chang,Dekang Qi,Junjin Xiao,Haoyun Liu,Ronghan Chen,Yuzhi Chen,Dongjie Huo,Feng Xiong,Xing Wei,Zhiheng Ma,Mu Xu

Main category: cs.CV

TL;DR: 本文提出ABot-M0框架,通过系统性数据清洗与统一预训练(构建UniACT数据集),结合动作流形假设与Action Manifold Learning(AML)方法,提升跨平台、跨任务的具身智能泛化能力与动作预测效率,并支持模块化多模态感知。

Details Motivation: 解决机器人领域‘一脑多形’范式下数据碎片化、表征不一致、训练目标不统一等阻碍通用具身智能发展的核心问题。 Method: 构建UniACT统一数据集(6个公开数据集,600万轨迹,9500小时);提出动作流形假设并设计AML方法(基于DiT预测连续动作序列);采用双流模块化感知架构(融合VLM语义与几何先验及多视角3D模块)。 Result: 显著提升跨形态机器人与任务的知识迁移与泛化能力;动作预测更高效稳定;各组件可独立运行且增益可叠加;代码与流程将全部开源。 Conclusion: ABot-M0为构建通用具身智能提供了可扩展、可复现、模块化的端到端框架,推动‘一脑多形’从理念走向实践。 Abstract: Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.

[81] Toward Reliable Tea Leaf Disease Diagnosis Using Deep Learning Model: Enhancing Robustness With Explainable AI and Adversarial Training

Samanta Ghosh,Jannatul Adan Mahi,Shayan Abrar,Md Parvez Mia,Asaduzzaman Rayhan,Abdul Awal Yasir,Asaduzzaman Hridoy

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的自动茶树叶病分类方法,使用teaLeafBD数据集(含5278张高分辨率图像,分为7类),结合DenseNet201和EfficientNetB3模型、对抗训练与Grad-CAM可解释性分析,实现了最高93%的分类准确率。

Details Motivation: 茶树易受多种叶部病害影响,导致减产和品质下降;人工检测耗时且易出错,亟需高效、自动化的病害识别方案。 Method: 采用teaLeafBD数据集,构建包含数据预处理、划分、对抗训练、增强、模型训练、评估及Grad-CAM可解释性分析的完整流程;使用DenseNet201和EfficientNetB3进行分类,并引入对抗训练提升鲁棒性。 Result: EfficientNetB3达到93%分类准确率,DenseNet201为91%;Grad-CAM可视化验证了模型关注区域的合理性;对抗训练增强了模型对噪声输入的鲁棒性。 Conclusion: 所提方法能准确、高效识别茶树叶病,具备农业实际应用价值,为智慧农业管理提供了可行的技术支持。 Abstract: Tea is a valuable asset for the economy of Bangladesh. So, tea cultivation plays an important role to boost the economy. These valuable plants are vulnerable to various kinds of leaf infections which may cause less production and low quality. It is not so easy to detect these diseases manually. It may take time and there could be some errors in the detection.Therefore, the purpose of the study is to develop an automated deep learning model for tea leaf disease classification based on the teaLeafBD dataset so that anyone can detect the diseases more easily and efficiently. There are 5,278 high-resolution images in this dataset. The images are classified into seven categories. Six of them represents various diseases and the rest one represents healthy leaves. The proposed pipeline contains data preprocessing, data splitting, adversarial training, augmentation, model training, evaluation, and comprehension made possible with Explainable AI strategies. DenseNet201 and EfficientNetB3 were employed to perform the classification task. To prepare the model more robustly, we applied adversarial training so it can operate effectively even with noisy or disturbed inputs. In addition, Grad-CAM visualization was executed to analyze the model's predictions by identifying the most influential regions of each image. Our experimental outcomes revealed that EfficientNetB3 achieved the highest classification accuracy of 93%, while DenseNet201 reached 91%. The outcomes prove that the effectiveness of the proposed approach can accurately detect tea leaf diseases and provide a practical solution for advanced agricultural management.

[82] Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration

Jinghan He,Junfeng Fang,Feng Xiong,Zijun Yao,Fei Shen,Haiyun Guo,Jinqiao Wang,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出Active-Zero框架,通过三个协同进化的智能体(Searcher、Questioner、Solver)实现视觉-语言模型在开放世界中的主动探索与自适应学习,显著提升推理与理解能力。

Details Motivation: 现有视觉-语言模型的自博弈方法依赖静态图像集,缺乏主动获取适配其当前能力水平的视觉数据的能力,导致学习低效且对初始数据强依赖。 Method: 提出Active-Zero框架,包含三个协同进化的智能体:Searcher从开放世界库中按能力边界检索图像,Questioner生成校准的推理任务,Solver通过准确率奖励进行优化,形成闭环自搭建课程学习。 Result: 在Qwen2.5-VL-7B-Instruct上,12个基准测试中推理任务平均准确率达53.97(+5.7%),通用理解达59.77(+3.9%),持续优于现有自博弈基线。 Conclusion: 主动探索是构建可扩展、自适应的视觉-语言自演化系统的关键要素。 Abstract: Self-play has enabled large language models to autonomously improve through self-generated challenges. However, existing self-play methods for vision-language models rely on passive interaction with static image collections, resulting in strong dependence on initial datasets and inefficient learning. Without the ability to actively seek visual data tailored to their evolving capabilities, agents waste computational effort on samples that are either trivial or beyond their current skill level. To address these limitations, we propose Active-Zero, a framework that shifts from passive interaction to active exploration of visual environments. Active-Zero employs three co-evolving agents: a Searcher that retrieves images from open-world repositories based on the model's capability frontier, a Questioner that synthesizes calibrated reasoning tasks, and a Solver refined through accuracy rewards. This closed loop enables self-scaffolding auto-curricula where the model autonomously constructs its learning trajectory. On Qwen2.5-VL-7B-Instruct across 12 benchmarks, Active-Zero achieves 53.97 average accuracy on reasoning tasks (5.7% improvement) and 59.77 on general understanding (3.9% improvement), consistently outperforming existing self-play baselines. These results highlight active exploration as a key ingredient for scalable and adaptive self-evolving vision-language systems.

[83] ReTracing: An Archaeological Approach Through Body, Machine, and Generative Systems

Yitong Wang,Yue Yao

Main category: cs.CV

TL;DR: ReTracing is a multi-agent embodied performance art project that uses LLMs and text-to-video diffusion models to generate choreographic instructions for a human performer and a quadruped robot, examining how AI encodes socio-cultural biases in movement through an archaeological lens.

Details Motivation: To examine how artificial intelligence shapes, constrains, and produces bodily movement, and to reveal how generative AI systems encode socio-cultural biases through choreographed human-robot interactions. Method: Extracting human-machine interaction sentences from science-fiction novels; using LLMs to generate 'what to do' and 'what not to do' prompts; employing a diffusion-based text-to-video model to produce choreographic guides and motor commands; enacting movements on a mirrored floor with multi-camera motion tracking and 3D reconstruction. Result: A digital archive of motion traces (3D point clouds and motion trails) generated by synchronized human and robot performances, illustrating AI-mediated embodiment and bias in movement. Conclusion: ReTracing demonstrates that generative AI does not merely reflect but actively shapes embodied behavior and cultural norms, prompting reflection on humanity’s evolving relationship with moving, thinking, and trace-leaving AIs. Abstract: We present ReTracing, a multi-agent embodied performance art that adopts an archaeological approach to examine how artificial intelligence shapes, constrains, and produces bodily movement. Drawing from science-fiction novels, the project extracts sentences that describe human-machine interaction. We use large language models (LLMs) to generate paired prompts "what to do" and "what not to do" for each excerpt. A diffusion-based text-to-video model transforms these prompts into choreographic guides for a human performer and motor commands for a quadruped robot. Both agents enact the actions on a mirrored floor, captured by multi-camera motion tracking and reconstructed into 3D point clouds and motion trails, forming a digital archive of motion traces. Through this process, ReTracing serves as a novel approach to reveal how generative systems encode socio-cultural biases through choreographed movements. Through an immersive interplay of AI, human, and robot, ReTracing confronts a critical question of our time: What does it mean to be human among AIs that also move, think, and leave traces behind?

[84] Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

Sethuraman T,Savya Khosla,Aditi Tiwari,Vidya Ganesh,Rakshana Jayaprakash,Aditya Jain,Vignesh Srinivasakumar,Onkar Kishor Susladkar,Srinidhi Sunkara,Aditya Shanmugham,Rakesh Vaideeswaran,Abbaas Alif Mohamed Nishar,Simon Jenni,Derek Hoiem

Main category: cs.CV

TL;DR: 本文提出REVEAL诊断基准,通过五种压力测试揭示当前视频-语言模型(VidLMs)在时间序列、运动理解及视频内容依赖等方面的严重缺陷,并提供自动生成诊断数据的流程。

Details Motivation: 探究视频-语言模型是否稳健地理解视频内容、时间顺序和运动,发现现有模型存在诸多未被充分认识的弱点。 Method: 构建REVEAL诊断基准,包含五个受控压力测试:时间预期偏差、语言捷径依赖、视频盲从性、摄像机运动敏感性、时空遮挡鲁棒性;并设计自动化数据生成流程。 Result: 主流开源与闭源VidLMs在各项测试中表现糟糕:误判倒放视频为正向、忽略视频内容作答、轻信错误陈述、难以处理基本摄像机运动、无法在简单时空掩码下聚合时序信息;而人类轻松完成这些任务。 Conclusion: 当前VidLMs对视频内容的理解远不如表面所示稳健,亟需更严格的评估基准和建模改进。 Abstract: This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.

[85] Advancing Digital Twin Generation Through a Novel Simulation Framework and Quantitative Benchmarking

Jacob Rubinstein,Avi Donaty,Don Engel

Main category: cs.CV

TL;DR: 本文提出了一种基于高质量3D模型和程序化相机位姿生成合成图像的新流程,用于定量评估数字孪生重建质量。

Details Motivation: 现有基于摄影测量的3D建模方法在数字孪生构建中存在大量设计选择,但其差异多依赖主观定性评估,缺乏可重复、可量化的实验基准。 Method: 提出并实现一个新流程:从高保真3D模型出发,结合程序化生成的相机位姿,渲染合成图像;利用已知的虚拟相机参数与物体真值,与重建结果进行定量对比分析。 Result: 该流程支持大量可重复、可量化的实验,能精确评估相机位姿估计与三维重建的精度。 Conclusion: 所提合成图像生成流程为数字孪生和摄影测量方法的定量评估提供了可控、可靠的基准框架。 Abstract: The generation of 3D models from real-world objects has often been accomplished through photogrammetry, i.e., by taking 2D photos from a variety of perspectives and then triangulating matched point-based features to create a textured mesh. Many design choices exist within this framework for the generation of digital twins, and differences between such approaches are largely judged qualitatively. Here, we present and test a novel pipeline for generating synthetic images from high-quality 3D models and programmatically generated camera poses. This enables a wide variety of repeatable, quantifiable experiments which can compare ground-truth knowledge of virtual camera parameters and of virtual objects against the reconstructed estimations of those perspectives and subjects.

[86] Selective Prior Synchronization via SYNC Loss

Ishan Mishra,Jiajie Li,Deepak Mishra,Jinjun Xiong

Main category: cs.CV

TL;DR: 本文提出SYNC损失函数,将后验方法(如softmax响应)引入SelectiveNet的训练过程,通过利用选择先验(selective prior)提升深度神经网络的选择性预测能力,在多个数据集上实现了SOTA性能。

Details Motivation: 现有选择性预测方法分为ad-hoc(如SelectiveNet)和post-hoc(如softmax响应)两类,但post-hoc方法中隐含的不确定性信息(即选择先验)仅用于推理阶段,作者认为其在训练阶段同样重要。 Method: 提出SYNC损失函数,将softmax响应(代表选择先验)显式融入SelectiveNet的训练目标中,实现ad-hoc与post-hoc方法的协同优化。 Result: 在CIFAR-100、ImageNet-100和Stanford Cars等多个基准数据集上,该方法显著提升了模型泛化能力和选择性预测性能,达到当前最优水平。 Conclusion: 选择先验不仅可用于推理,更应在训练中加以利用;SYNC损失有效融合两类方法优势,为可信AI中的不确定性建模提供了新思路。 Abstract: Prediction under uncertainty is a critical requirement for the deep neural network to succeed responsibly. This paper focuses on selective prediction, which allows DNNs to make informed decisions about when to predict or abstain based on the uncertainty level of their predictions. Current methods are either ad-hoc such as SelectiveNet, focusing on how to modify the network architecture or objective function, or post-hoc such as softmax response, achieving selective prediction through analyzing the model's probabilistic outputs. We observe that post-hoc methods implicitly generate uncertainty information, termed the selective prior, which has traditionally been used only during inference. We argue that the selective prior provided by the selection mechanism is equally vital during the training stage. Therefore, we propose the SYNC loss which introduces a novel integration of ad-hoc and post-hoc method. Specifically, our approach incorporates the softmax response into the training process of SelectiveNet, enhancing its selective prediction capabilities by examining the selective prior. Evaluated across various datasets, including CIFAR-100, ImageNet-100, and Stanford Cars, our method not only enhances the model's generalization capabilities but also surpasses previous works in selective prediction performance, and sets new benchmarks for state-of-the-art performance.

[87] MDE-VIO: Enhancing Visual-Inertial Odometry Using Learned Depth Priors

Arda Alniak,Sinan Kalkan,Mustafa Mert Ankarali,Afsar Saranli,Abdullah Aydin Alatan

Main category: cs.CV

TL;DR: 本文提出了一种将学习到的深度先验直接集成到VINS-Mono优化后端的新框架,通过引入仿射不变深度一致性与序数约束,并结合方差门控机制抑制不稳定伪影,在边缘设备计算限制下实现鲁棒的度量尺度恢复,显著提升低纹理环境下的VIO精度。

Details Motivation: 传统单目视觉惯性里程计(VIO)在低纹理环境中因稀疏特征不足而性能下降,需借助稠密单目深度估计(MDE)作为补充;但现有基于ViT的高精度深度模型计算开销大,难以实时部署于边缘设备。 Method: 将学习到的深度先验嵌入VINS-Mono优化后端,引入仿射不变深度一致性约束、成对序数约束,并采用基于深度预测方差的门控机制过滤不稳定深度伪影。 Result: 在TartanGround和M3ED数据集上实验表明,该方法有效防止系统发散,在挑战性场景下将绝对轨迹误差(ATE)降低最多28.3%。 Conclusion: 所提方法在严格满足边缘设备计算约束的前提下,实现了稠密深度信息与VIO的高效融合,显著提升了低纹理环境下的定位鲁棒性与精度。 Abstract: Traditional monocular Visual-Inertial Odometry (VIO) systems struggle in low-texture environments where sparse visual features are insufficient for accurate pose estimation. To address this, dense Monocular Depth Estimation (MDE) has been widely explored as a complementary information source. While recent Vision Transformer (ViT) based complex foundational models offer dense, geometrically consistent depth, their computational demands typically preclude them from real-time edge deployment. Our work bridges this gap by integrating learned depth priors directly into the VINS-Mono optimization backend. We propose a novel framework that enforces affine-invariant depth consistency and pairwise ordinal constraints, explicitly filtering unstable artifacts via variance-based gating. This approach strictly adheres to the computational limits of edge devices while robustly recovering metric scale. Extensive experiments on the TartanGround and M3ED datasets demonstrate that our method prevents divergence in challenging scenarios and delivers significant accuracy gains, reducing Absolute Trajectory Error (ATE) by up to 28.3%. Code will be made available.

[88] Exploring Real-Time Super-Resolution: Benchmarking and Fine-Tuning for Streaming Content

Evgeney Bogatyrev,Khaled Abud,Ivan Molodetskikh,Nikita Alutis,Dmitry Vatolin

Main category: cs.CV

TL;DR: 本文提出StreamSR数据集和EfRLFN模型,旨在提升压缩视频流的实时超分辨率性能。

Details Motivation: 现有实时超分辨率方法在处理压缩视频内容时表现不佳,且常用数据集无法准确反映流媒体特性,导致基准测试缺乏现实相关性。 Method: 构建了来自YouTube的StreamSR数据集,并对11种最先进实时超分辨率模型进行基准测试;提出EfRLFN模型,融合高效通道注意力机制与双曲正切激活函数,并设计复合损失函数优化训练。 Result: EfRLFN在视觉质量和运行效率上均优于现有方法;在StreamSR上微调其他模型也能显著提升其在多个标准基准上的泛化性能。 Conclusion: StreamSR数据集和EfRLFN模型为流媒体场景下的实时超分辨率研究提供了更贴近实际的新基准和高效解决方案。 Abstract: Recent advancements in real-time super-resolution have enabled higher-quality video streaming, yet existing methods struggle with the unique challenges of compressed video content. Commonly used datasets do not accurately reflect the characteristics of streaming media, limiting the relevance of current benchmarks. To address this gap, we introduce a comprehensive dataset - StreamSR - sourced from YouTube, covering a wide range of video genres and resolutions representative of real-world streaming scenarios. We benchmark 11 state-of-the-art real-time super-resolution models to evaluate their performance for the streaming use-case. Furthermore, we propose EfRLFN, an efficient real-time model that integrates Efficient Channel Attention and a hyperbolic tangent activation function - a novel design choice in the context of real-time super-resolution. We extensively optimized the architecture to maximize efficiency and designed a composite loss function that improves training convergence. EfRLFN combines the strengths of existing architectures while improving both visual quality and runtime performance. Finally, we show that fine-tuning other models on our dataset results in significant performance gains that generalize well across various standard benchmarks. We made the dataset, the code, and the benchmark available at https://github.com/EvgeneyBogatyrev/EfRLFN.

[89] ArtContext: Contextualizing Artworks with Open-Access Art History Articles and Wikidata Knowledge through a LoRA-Tuned CLIP Model

Samuel Waugh,Stuart James

Main category: cs.CV

TL;DR: 本文提出ArtContext管道,利用开放获取艺术史文章和Wikidata知识,通过改进的CLIP模型(PaintingCLIP)为艺术品添加上下文注释,提升艺术史信息检索与理解能力。

Details Motivation: 艺术史文章常讨论艺术品整体及局部特征,但读者难以快速定位不同文章对同一作品的具体论述,亟需一种能自动关联文本与图像内容的工具。 Method: 构建新型语料收集流程,基于开放艺术史文献与Wikidata构建训练语料;采用LoRA技术对CLIP模型进行领域适配,训练出弱监督的领域专用模型PaintingCLIP。 Result: PaintingCLIP在艺术上下文理解任务中优于原始CLIP模型,能有效为给定艺术品提供相关文本背景;该管道具备跨人文学科泛化能力。 Conclusion: ArtContext为艺术史研究提供了可扩展、可复用的图文联合分析框架,推动数字人文中视觉与文本知识的深度融合。 Abstract: Many Art History articles discuss artworks in general as well as specific parts of works, such as layout, iconography, or material culture. However, when viewing an artwork, it is not trivial to identify what different articles have said about the piece. Therefore, we propose ArtContext, a pipeline for taking a corpus of Open-Access Art History articles and Wikidata Knowledge and annotating Artworks with this information. We do this using a novel corpus collection pipeline, then learn a bespoke CLIP model adapted using Low-Rank Adaptation (LoRA) to make it domain-specific. We show that the new model, PaintingCLIP, which is weakly supervised by the collected corpus, outperforms CLIP and provides context for a given artwork. The proposed pipeline is generalisable and can be readily applied to numerous humanities areas.

[90] Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation

Alan Baade,Eric Ryan Chan,Kyle Sargent,Changan Chen,Justin Johnson,Ehsan Adeli,Li Fei-Fei

Main category: cs.CV

TL;DR: 本文提出Latent Forcing方法,在保持潜在扩散模型高效性的同时直接在原始图像上操作,通过联合处理潜在表示与像素并采用独立调优的噪声调度,提升像素级生成质量。

Details Motivation: 现有潜在扩散模型虽生成质量高,但因图像编码阶段丢弃信息、需单独训练解码器、建模辅助分布而损失端到端建模优势。 Method: 提出Latent Forcing:对现有架构做简单修改,联合处理潜在表示和像素,采用分别调优的噪声调度来排序去噪轨迹,使潜在表示作为中间计算的暂存区。 Result: 在ImageNet上,Latent Forcing在同等计算开销下,实现了基于扩散Transformer的像素级生成新SOTA。 Conclusion: Latent Forcing成功弥合了潜在扩散效率与原始图像端到端建模之间的鸿沟,并揭示了条件信号顺序对生成性能的关键影响。 Abstract: Latent diffusion models excel at generating high-quality images but lose the benefits of end-to-end modeling. They discard information during image encoding, require a separately trained decoder, and model an auxiliary distribution to the raw data. In this paper, we propose Latent Forcing, a simple modification to existing architectures that achieves the efficiency of latent diffusion while operating on raw natural images. Our approach orders the denoising trajectory by jointly processing latents and pixels with separately tuned noise schedules. This allows the latents to act as a scratchpad for intermediate computation before high-frequency pixel features are generated. We find that the order of conditioning signals is critical, and we analyze this to explain differences between REPA distillation in the tokenizer and the diffusion model, conditional versus unconditional generation, and how tokenizer reconstruction quality relates to diffusability. Applied to ImageNet, Latent Forcing achieves a new state-of-the-art for diffusion transformer-based pixel generation at our compute scale.

[91] Fighting MRI Anisotropy: Learning Multiple Cardiac Shapes From a Single Implicit Neural Representation

Carolina Brás,Soufiane Ben Haddou,Thijs P. Kuipers,Laura Alvarez-Florez,R. Nils Planken,Fleur V. Y. Tjong,Connie Bezzina,Ivana Išgum

Main category: cs.CV

TL;DR: 本文提出了一种利用高分辨率、近各向同性的CTA数据训练单个神经隐式函数,以重建低分辨率、各向异性的短轴CMRI心脏结构(RV和MYO)的方法,并在4CH切面上验证了其重建精度。

Details Motivation: 短轴CMRI图像具有各向异性,限制了心脏形状分析的精度,而高分辨率、近各向同性的CTA数据可提供更优的几何先验。 Method: 使用CTA数据训练一个神经隐式函数,联合表征任意分辨率下的CMRI心脏形状;重点重建右心室(RV)和心肌(MYO),其中MYO同时建模左心室的内、外表面;通过从重建形状中提取4CH切面并与CMRI参考分割对比进行评估。 Result: 在RV和MYO的4CH切面重建中,Dice相似系数分别为0.91±0.07和0.75±0.13,Hausdorff距离分别为6.21±3.97 mm和7.53±5.13 mm;定性和定量结果均表明重建形状准确、光滑且解剖合理。 Conclusion: 该方法能有效提升各向异性CMRI下心脏形状分析的精度与鲁棒性,为后续临床应用提供了新思路。 Abstract: The anisotropic nature of short-axis (SAX) cardiovascular magnetic resonance imaging (CMRI) limits cardiac shape analysis. To address this, we propose to leverage near-isotropic, higher resolution computed tomography angiography (CTA) data of the heart. We use this data to train a single neural implicit function to jointly represent cardiac shapes from CMRI at any resolution. We evaluate the method for the reconstruction of right ventricle (RV) and myocardium (MYO), where MYO simultaneously models endocardial and epicardial left-ventricle surfaces. Since high-resolution SAX reference segmentations are unavailable, we evaluate performance by extracting a 4-chamber (4CH) slice of RV and MYO from their reconstructed shapes. When compared with the reference 4CH segmentation masks from CMRI, our method achieved a Dice similarity coefficient of 0.91 $\pm$ 0.07 and 0.75 $\pm$ 0.13, and a Hausdorff distance of 6.21 $\pm$ 3.97 mm and 7.53 $\pm$ 5.13 mm for RV and MYO, respectively. Quantitative and qualitative assessment demonstrate the model's ability to reconstruct accurate, smooth and anatomically plausible shapes, supporting improvements in cardiac shape analysis.

[92] Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

Penghui Ruan,Bojia Zi,Xianbiao Qi,Youze Huang,Rong Xiao,Pichao Wang,Jiannong Cao,Yuhui Shi

Main category: cs.CV

TL;DR: 本文提出Ctrl&Shift,一种无需显式3D建模的端到端扩散框架,通过两阶段分解(对象移除+相机姿态控制的参考引导修复)和多任务多阶段训练,实现几何一致、背景保持、用户可控的对象级图像/视频编辑。

Details Motivation: 现有方法难以同时满足背景保留、视角变化下的几何一致性及用户可控变换三大目标:几何法依赖显式3D重建且泛化差,扩散法泛化好但缺乏细粒度几何控制。 Method: 提出Ctrl&Shift框架:1)将操作解耦为对象移除与相机姿态控制下的参考引导修复;2)设计多任务多阶段训练策略,分离背景、身份与姿态信号;3)构建含估计相对相机姿态的配对图像/视频真实世界数据集。 Result: 在保真度、视角一致性与可控性上达到SOTA;首个在不依赖显式3D建模前提下统一细粒度几何控制与真实世界泛化的框架。 Conclusion: Ctrl&Shift成功弥合了几何精确性与扩散模型泛化能力之间的鸿沟,为无需3D先验的对象级编辑提供了新范式。 Abstract: Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that Ctrl&Shift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.

[93] Enhanced Portable Ultra Low-Field Diffusion Tensor Imaging with Bayesian Artifact Correction and Deep Learning-Based Super-Resolution

Mark D. Olchanyi,Annabel Sorby-Adams,John Kirsch,Brian L. Edlow,Ava Farnan,Renfei Liu,Matthew S. Rosen,Emery N. Brown,W. Taylor Kimberly,Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: 本文提出了一种适用于超低场(ULF)扩散张量成像(DTI)的九方向单壳采集序列,以及配套的具有角度依赖性的贝叶斯偏置场校正算法和无需重训练即可泛化的卷积神经网络超分辨率算法DiffSR,显著提升了ULF DTI的空间/角度分辨率、信噪比及白质微结构信息恢复能力,并支持阿尔茨海默病分类任务。

Details Motivation: 超低场MRI受限于空间与角度分辨率低、信噪比差,尤其DTI序列因设计固有缺陷和扫描时间长而更易受退化影响;ULF DTI还存在跨空间与角度域的特异性伪影,需定制化建模校正。 Method: 提出九方向单壳ULF DTI采集序列;设计角度依赖的贝叶斯偏置场校正算法;开发基于CNN的通用超分辨率算法DiffSR,无需针对新数据集重训练。 Result: 在合成下采样实验和真实匹配的ULF/高场DTI数据中,算法成功恢复白质微结构与体积信息;DiffSR直接应用于合成退化扫描的阿尔茨海默病分类,DTI指标与原始扫描一致性显著提升。 Conclusion: 所提方法有效克服ULF DTI的关键成像瓶颈,DiffSR具备跨数据集泛化能力,代码开源以推动ULF重建与DTI序列标准化发展。 Abstract: Portable, ultra-low-field (ULF) magnetic resonance imaging has the potential to expand access to neuroimaging but currently suffers from coarse spatial and angular resolutions and low signal-to-noise ratios. Diffusion tensor imaging (DTI), a sequence tailored to detect and reconstruct white matter tracts within the brain, is particularly prone to such imaging degradation due to inherent sequence design coupled with prolonged scan times. In addition, ULF DTI scans exhibit artifacting that spans both the space and angular domains, requiring a custom modelling algorithm for subsequent correction. We introduce a nine-direction, single-shell ULF DTI sequence, as well as a companion Bayesian bias field correction algorithm that possesses angular dependence and convolutional neural network-based superresolution algorithm that is generalizable across DTI datasets and does not require re-training (''DiffSR''). We show through a synthetic downsampling experiment and white matter assessment in real, matched ULF and high-field DTI scans that these algorithms can recover microstructural and volumetric white matter information at ULF. We also show that DiffSR can be directly applied to white matter-based Alzheimers disease classification in synthetically degraded scans, with notable improvements in agreement between DTI metrics, as compared to un-degraded scans. We freely disseminate the Bayesian bias correction algorithm and DiffSR with the goal of furthering progress on both ULF reconstruction methods and general DTI sequence harmonization. We release all code related to DiffSR for $\href{https://github.com/markolchanyi/DiffSR}{public \space use}$.

[94] A Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness

Yun-Cheng Li,Sen Lei,Heng-Chao Li,Ke Li

Main category: cs.CV

TL;DR: 本文提出DBTANet,一种双分支语义变化检测框架,结合SAM与ResNet34编码器、双向时序感知模块(BTAM)和高斯平滑投影模块(GSPM),显著提升边界清晰度与时序建模能力,在两个公开数据集上达到SOTA性能。

Details Motivation: 现有语义变化检测方法存在边界模糊和时序建模不足的问题,限制了分割精度。 Method: 提出双分支Siamese编码器(冻结SAM分支捕获全局语义与边界先验,ResNet34分支提取局部细节);设计双向时序感知模块(BTAM)对称聚合多尺度特征并建模时序依赖;引入高斯平滑投影模块(GSPM)优化浅层SAM特征以增强边界约束。 Result: 在两个公开基准上实验表明,DBTANet在融合全局语义、局部细节、时序推理与边界感知方面效果显著,达到当前最优性能。 Conclusion: DBTANet通过协同建模边界与时间维度,有效克服了传统SCD方法的关键缺陷,为遥感图像变化检测提供了新范式。 Abstract: Semantic Change Detection (SCD) aims to detect and categorize land-cover changes from bi-temporal remote sensing images. Existing methods often suffer from blurred boundaries and inadequate temporal modeling, limiting segmentation accuracy. To address these issues, we propose a Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness, termed DBTANet. Specifically, we utilize a dual-branch Siamese encoder where a frozen SAM branch captures global semantic context and boundary priors, while a ResNet34 branch provides local spatial details, ensuring complementary feature representations. On this basis, we design a Bidirectional Temporal Awareness Module (BTAM) to aggregate multi-scale features and capture temporal dependencies in a symmetric manner. Furthermore, a Gaussian-smoothed Projection Module (GSPM) refines shallow SAM features, suppressing noise while enhancing edge information for boundary-aware constraints. Extensive experiments on two public benchmarks demonstrate that DBTANet effectively integrates global semantics, local details, temporal reasoning, and boundary awareness, achieving state-of-the-art performance.

[95] Arbitrary Ratio Feature Compression via Next Token Prediction

Yufan Liu,Daoyuan Ren,Zhipeng Zhang,Wenyang Luo,Bing Li,Weiming Hu,Stephen Maybank

Main category: cs.CV

TL;DR: 本文提出了一种任意比率特征压缩(ARFC)框架,通过一个自回归的任意比率压缩器(ARC)实现单模型支持任意压缩比,无需重新训练;引入混合解(MoS)模块提升鲁棒性,实体关系图约束(ERGC)模块保持语义结构;在多个跨模态与图像任务上显著优于现有方法,甚至在某些情况下超越原始未压缩特征性能。

Details Motivation: 现有特征压缩方法通常需为特定压缩比训练专用模型,缺乏灵活性和泛化能力,适应新压缩比时需重新训练,限制了实际应用。 Method: 提出ARFC框架,核心是自回归的ARC模型,通过控制生成token数量实现任意压缩比;引入MoS模块融合多解以降低不确定性;引入ERGC模块在训练中约束实体关系以保留语义与结构信息。 Result: 在跨模态检索、图像分类与图像检索等多个任务和数据集上,ARFC在各种压缩比下均显著优于现有方法;部分场景下性能甚至超过原始未压缩特征。 Conclusion: ARFC是一种灵活、高效且通用的特征压缩方法,适用于资源受限的实际场景,解决了传统方法需多模型与重训练的问题。 Abstract: Feature compression is increasingly important for improving the efficiency of downstream tasks, especially in applications involving large-scale or multi-modal data. While existing methods typically rely on dedicated models for achieving specific compression ratios, they are often limited in flexibility and generalization. In particular, retraining is necessary when adapting to a new compression ratio. To address this limitation, we propose a novel and flexible Arbitrary Ratio Feature Compression (ARFC) framework, which supports any compression ratio with a single model, eliminating the need for multiple specialized models. At its core, the Arbitrary Ratio Compressor (ARC) is an auto-regressive model that performs compression via next-token prediction. This allows the compression ratio to be controlled at inference simply by adjusting the number of generated tokens. To enhance the quality of the compressed features, two key modules are introduced. The Mixture of Solutions (MoS) module refines the compressed tokens by utilizing multiple compression results (solutions), reducing uncertainty and improving robustness. The Entity Relation Graph Constraint (ERGC) is integrated into the training process to preserve semantic and structural relationships during compression. Extensive experiments on cross-modal retrieval, image classification, and image retrieval tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches at various compression ratios. Notably, in some cases, it even surpasses the performance of the original, uncompressed features. These results validate the effectiveness and versatility of ARFC for practical, resource-constrained scenarios.

[96] What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

Zhenlong Yuan,Xiangyan Qu,Jing Tang,Rui Chen,Lei Sun,Ruidong Chen,Hongwei Yu,Chengxuan Qian,Xiangxiang Chu,Shuo Li,Yuyin Zhou

Main category: cs.CV

TL;DR: 本文提出ImagineAgent框架,通过认知推理与生成式想象结合,解决开放词汇人-物交互(OV-HOI)中的跨模态幻觉与遮挡模糊问题,在SWIG-HOI和HICO-DET上达到SOTA,仅需20%训练数据。

Details Motivation: 现有多模态大语言模型在开放词汇人-物交互(OV-HOI)任务中受限于跨模态幻觉和遮挡导致的语义模糊,缺乏鲁棒的视觉理解能力。 Method: 提出ImagineAgent智能体框架:构建显式建模实体与动作关系的认知图;动态调用检索增强、图像裁剪和扩散模型等工具获取领域知识与视觉证据;设计兼顾预测准确率与工具效率的复合奖励函数。 Result: 在SWIG-HOI和HICO-DET数据集上取得SOTA性能,仅需约20%的训练数据,验证了方法的鲁棒性与高效性。 Conclusion: 认知推理与生成式想象协同的智能体范式能有效缓解OV-HOI中的跨模态不一致与遮挡歧义,为多模态视觉理解提供新思路。 Abstract: Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.

[97] Vascular anatomy-aware self-supervised pre-training for X-ray angiogram analysis

De-Xing Huang,Chaohui Yu,Xiao-Hu Zhou,Tian-Yu Xiang,Qin-Yi Zhang,Mei-Jiang Gui,Rui-Ze Ma,Chen-Yu Wang,Nu-Fang Xiao,Fan Wang,Zeng-Guang Hou

Main category: cs.CV

TL;DR: 本文提出了一种血管解剖感知的掩码图像建模框架VasoMIM,结合自建的大规模X射线血管造影预训练数据集XA-170K,显著提升了下游任务性能,推动X射线血管造影分析的发展。

Details Motivation: 当前X射线血管造影分析的深度学习方法受限于标注数据稀缺,而大规模自监督学习在该领域尚未充分探索,主要由于缺乏有效的SSL框架和大规模数据集。 Method: 提出VasoMIM框架,包含解剖引导的掩码策略(重点掩码含血管区域)和解剖一致性损失(保持原始与重建图像间血管结构一致性),并构建了迄今最大的X射线血管造影预训练数据集XA-170K。 Result: 在四个下游任务、六个数据集上验证,VasoMIM展现出卓越的迁移能力和当前最优性能。 Conclusion: VasoMIM作为基础模型,具有推动多种X射线血管造影分析任务发展的重大潜力;代码与数据集将开源。 Abstract: X-ray angiography is the gold standard imaging modality for cardiovascular diseases. However, current deep learning approaches for X-ray angiogram analysis are severely constrained by the scarcity of annotated data. While large-scale self-supervised learning (SSL) has emerged as a promising solution, its potential in this domain remains largely unexplored, primarily due to the lack of effective SSL frameworks and large-scale datasets. To bridge this gap, we introduce a vascular anatomy-aware masked image modeling (VasoMIM) framework that explicitly integrates domain-specific anatomical knowledge. Specifically, VasoMIM comprises two key designs: an anatomy-guided masking strategy and an anatomical consistency loss. The former strategically masks vessel-containing patches to compel the model to learn robust vascular semantics, while the latter preserves structural consistency of vessels between original and reconstructed images, enhancing the discriminability of the learned representations. In conjunction with VasoMIM, we curate XA-170K, the largest X-ray angiogram pre-training dataset to date. We validate VasoMIM on four downstream tasks across six datasets, where it demonstrates superior transferability and achieves state-of-the-art performance compared to existing methods. These findings highlight the significant potential of VasoMIM as a foundation model for advancing a wide range of X-ray angiogram analysis tasks. VasoMIM and XA-170K will be available at https://github.com/Dxhuang-CASIA/XA-SSL.

[98] Supervise-assisted Multi-modality Fusion Diffusion Model for PET Restoration

Yingkai Zhang,Shuang Chen,Ye Tian,Yunyi Gao,Jianyong Jiang,Ying Fu

Main category: cs.CV

TL;DR: 本文提出了一种监督辅助的多模态融合扩散模型(MFdiff),利用MR图像辅助恢复低剂量PET图像,通过多模态特征融合模块和两阶段监督学习策略,有效缓解了跨模态结构/纹理不一致及分布外(OOD)数据失配问题,显著提升了PET图像重建质量。

Details Motivation: 降低PET成像辐射剂量会导致图像质量下降;利用高分辨率MR图像辅助低剂量PET(LPET)恢复标准剂量PET(SPET)虽有前景,但面临多模态融合中结构/纹理不一致及分布外(OOD)数据失配的挑战。 Method: 提出监督辅助的多模态融合扩散模型(MFdiff):1)设计多模态特征融合模块,优化MR与LPET特征融合,避免引入冗余细节;2)以融合特征为条件,驱动扩散模型迭代生成高质量SPET;3)采用两阶段监督学习策略,联合利用仿真分布内数据的通用先验和真实体内OOD数据的特异性先验。 Result: MFdiff在多模态输入下能有效恢复高质量SPET图像,在定性和定量评估上均优于当前最先进方法。 Conclusion: MFdiff通过协同建模多模态互补信息与分阶段监督学习,为低剂量PET图像重建提供了鲁棒、高效的解决方案,尤其适用于存在分布偏移的真实临床场景。 Abstract: Positron emission tomography (PET) offers powerful functional imaging but involves radiation exposure. Efforts to reduce this exposure by lowering the radiotracer dose or scan time can degrade image quality. While using magnetic resonance (MR) images with clearer anatomical information to restore standard-dose PET (SPET) from low-dose PET (LPET) is a promising approach, it faces challenges with the inconsistencies in the structure and texture of multi-modality fusion, as well as the mismatch in out-of-distribution (OOD) data. In this paper, we propose a supervise-assisted multi-modality fusion diffusion model (MFdiff) for addressing these challenges for high-quality PET restoration. Firstly, to fully utilize auxiliary MR images without introducing extraneous details in the restored image, a multi-modality feature fusion module is designed to learn an optimized fusion feature. Secondly, using the fusion feature as an additional condition, high-quality SPET images are iteratively generated based on the diffusion model. Furthermore, we introduce a two-stage supervise-assisted learning strategy that harnesses both generalized priors from simulated in-distribution datasets and specific priors tailored to in-vivo OOD data. Experiments demonstrate that the proposed MFdiff effectively restores high-quality SPET images from multi-modality inputs and outperforms state-of-the-art methods both qualitatively and quantitatively.

[99] Perception-based Image Denoising via Generative Compression

Nam Nguyen,Thinh Nguyen,Bella Bose

Main category: cs.CV

TL;DR: 本文提出了一种基于生成式压缩的感知图像去噪框架,通过熵编码潜在表示与生成解码器(如WGAN或扩散模型)结合,在保持结构细节的同时提升视觉真实感,并提供了理论误差界和实验验证。

Details Motivation: 传统失真驱动的去噪方法在强噪声和分布偏移下易产生过度平滑结果,难以兼顾保真度与感知质量。 Method: 构建生成式压缩去噪框架:利用熵编码的低复杂度潜在表示进行重建,并采用基于LPIPS损失和Wasserstein距离的生成解码器恢复纹理;具体实现包括条件WGAN压缩去噪器和条件扩散重建策略;并为加性高斯噪声下的压缩最大似然去噪器提供非渐近理论保证。 Result: 在合成噪声与真实噪声数据集上均取得一致的感知质量提升(如更低LPIPS),同时保持有竞争力的失真指标(如PSNR、SSIM)。 Conclusion: 生成式压缩范式可有效平衡率-失真-感知三重权衡,为感知驱动的图像去噪提供了新思路及理论支撑。 Abstract: Image denoising aims to remove noise while preserving structural details and perceptual realism, yet distortion-driven methods often produce over-smoothed reconstructions, especially under strong noise and distribution shift. This paper proposes a generative compression framework for perception-based denoising, where restoration is achieved by reconstructing from entropy-coded latent representations that enforce low-complexity structure, while generative decoders recover realistic textures via perceptual measures such as learned perceptual image patch similarity (LPIPS) loss and Wasserstein distance. Two complementary instantiations are introduced: (i) a conditional Wasserstein GAN (WGAN)-based compression denoiser that explicitly controls the rate-distortion-perception (RDP) trade-off, and (ii) a conditional diffusion-based reconstruction strategy that performs iterative denoising guided by compressed latents. We further establish non-asymptotic guarantees for the compression-based maximum-likelihood denoiser under additive Gaussian noise, including bounds on reconstruction error and decoding error probability. Experiments on synthetic and real-noise benchmarks demonstrate consistent perceptual improvements while maintaining competitive distortion performance.

[100] LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

Chen Zhao,Jiawei Chen,Hongyu Li,Zhuoliang Kang,Shilin Lu,Xiaoming Wei,Kai Zhang,Jian Yang,Ying Tai

Main category: cs.CV

TL;DR: 本文提出LUVE框架,通过三阶段级联架构解决超高清视频生成中的运动建模、语义规划与细节合成难题,显著提升UHR视频的视觉质量与内容保真度。

Details Motivation: 现有视频扩散模型在超高清(UHR)视频生成中仍面临运动建模、语义规划和细节合成的复合挑战。 Method: 提出基于双频专家的潜在级联UHR视频生成框架LUVE,包含三个阶段:低分辨率运动生成、潜在空间视频上采样、高频与低频专家协同的高分辨率内容精细化。 Result: LUVE在UHR视频生成中实现了更优的逼真度与内容保真度;消融实验验证了各模块有效性。 Conclusion: LUVE为UHR视频生成提供了高效、高质量的解决方案,兼顾运动一致性、语义连贯性与细节丰富性。 Abstract: Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.

[101] Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception

Zesheng Jia,Jin Wang,Siao Liu,Lingzhi Li,Ziyao Huang,Yunjiang Xu,Jianping Wang

Main category: cs.CV

TL;DR: 本文提出FlowAdapt,一种基于最优传输理论的参数高效多智能体域自适应框架,通过Wasserstein贪心采样和渐进知识迁移模块解决V2X协同感知中PEFT应用导致的冗余与语义退化问题,在仅训练1%参数下实现SOTA性能。

Details Motivation: 快速域适应是V2X多智能体协同感知跨环境部署的关键挑战;现有PEFT方法直接应用于多智能体场景时存在性能下降和训练不稳定问题。 Method: 提出FlowAdapt框架:1)基于最优传输理论,最小化数据分布与网络层级间的信息传输代价;2)Wasserstein贪心采样策略,以有界覆盖半径筛选冗余样本;3)渐进知识迁移模块,通过可学习路径将压缩的早期表征注入后期层,缓解深层语义退化。 Result: 在三个基准上验证,FlowAdapt仅需1%可训练参数即达SOTA性能,显著提升样本效率与泛化能力。 Conclusion: FlowAdapt有效解决了多智能体PEFT中的帧间冗余与深层语义退化问题,为V2X协同感知提供了高效、稳定、轻量的域自适应新范式。 Abstract: Fast domain adaptation remains a fundamental challenge for deploying multi-agent systems across diverse environments in Vehicle-to-Everything (V2X) collaborative perception. Despite the success of Parameter-Efficient Fine-Tuning (PEFT) in natural language processing and conventional vision tasks, directly applying PEFT to multi-agent settings leads to significant performance degradation and training instability. In this work, we conduct a detailed analysis and identify two key factors: (i) inter-frame redundancy in heterogeneous sensory streams, and (ii) erosion of fine-grained semantics in deep-layer representations under PEFT adaptation. To address these issues, we propose FlowAdapt, a parameter-efficient framework grounded in optimal transport theory, which minimizes information transport costs across both data distributions and network hierarchies. Specifically, we introduce a Wasserstein Greedy Sampling strategy to selectively filter redundant samples via a bounded covering radius. Furthermore, Progressive Knowledge Transfer module is designed to progressively inject compressed early-stage representations into later stages through learnable pathways, alleviating semantic degradation in late-stage adaptation. Extensive experiments on three benchmarks demonstrate that FlowAdapt achieves state-of-the-art performance with only 1% of trainable parameters, effectively bridging domain gaps with superior sample efficiency and generalization.

[102] A Large Language Model for Disaster Structural Reconnaissance Summarization

Yuqing Gao,Guanren Zhou,Khalid M. Mosalam

Main category: cs.CV

TL;DR: 本文提出了一种基于大语言模型(LLM)的灾害侦察摘要框架(LLM-DRS),将视觉数据与文本元数据融合,利用深度卷积神经网络提取结构损伤属性,并通过LLM生成结构级或区域级摘要报告,提升灾后快速评估能力。

Details Motivation: 现有视觉结构健康监测方法仅输出离散结果(如损伤类别、坐标),需人工进一步分析;而大语言模型的兴起为自动化生成可读、可决策的评估报告提供了新思路。 Method: 构建标准化侦察计划,统一采集图像与文本元数据;用深度卷积神经网络提取损伤状态、材料类型、损伤等级等关键属性;将结构化属性与元数据输入经提示工程优化的LLM,生成自然语言摘要报告。 Result: LLM-DRS能自动生成面向单体结构或受灾区域的灾后侦察摘要报告,验证了LLM在提升视觉SHM可解释性与实用性方面的有效性。 Conclusion: 将LLM融入视觉结构健康监测,尤其在快速灾后侦察中展现出显著潜力,有助于增强建成环境的韧性。 Abstract: Artificial Intelligence (AI)-aided vision-based Structural Health Monitoring (SHM) has emerged as an effective approach for monitoring and assessing structural condition by analyzing image and video data. By integrating Computer Vision (CV) and Deep Learning (DL), vision-based SHM can automatically identify and localize visual patterns associated with structural damage. However, previous works typically generate only discrete outputs, such as damage class labels and damage region coordinates, requiring engineers to further reorganize and analyze these results for evaluation and decision-making. In late 2022, Large Language Models (LLMs) became popular across multiple fields, providing new insights into AI-aided vision-based SHM. In this study, a novel LLM-based Disaster Reconnaissance Summarization (LLM-DRS) framework is proposed. It introduces a standard reconnaissance plan in which the collection of vision data and corresponding metadata follows a well-designed on-site investigation process. Text-based metadata and image-based vision data are then processed and integrated into a unified format, where well-trained Deep Convolutional Neural Networks extract key attributes, including damage state, material type, and damage level. Finally, all data are fed into an LLM with carefully designed prompts, enabling the LLM-DRS to generate summary reports for individual structures or affected regions based on aggregated attributes and metadata. Results show that integrating LLMs into vision-based SHM, particularly for rapid post-disaster reconnaissance, demonstrates promising potential for improving resilience of the built environment through effective reconnaissance.

[103] PLOT-CT: Pre-log Voronoi Decomposition Assisted Generation for Low-dose CT Reconstruction

Bin Huang,Xun Yu,Yikun Zhang,Yi Zhang,Yang Chen,Qiegen Liu

Main category: cs.CV

TL;DR: 本文提出PLOT-CT框架,通过在预对数域对LDCT数据进行Voronoi分解,提升低剂量CT重建的精度与鲁棒性。

Details Motivation: 现有LDCT重建方法多在图像域或对数后投影域操作,无法充分利用预对数测量中的结构信息,且对数变换会显著放大噪声,导致重建精度受限。 Method: 提出PLOT-CT框架:在预对数正弦图上应用Voronoi分解,将数据解耦为多个潜在子空间中的独立成分,从而显式建模并抑制噪声、保留原始信息。 Result: 在1e4入射光子水平下,PSNR较传统方法提升2.36dB,达到预对数域LDCT重建的SOTA性能。 Conclusion: PLOT-CT通过预对数域的Voronoi分解有效缓解噪声放大问题,验证了在原始测量域建模对低剂量CT重建的重要性与优越性。 Abstract: Low-dose computed tomography (LDCT) reconstruction is fundamentally challenged by severe noise and compromised data fidelity under reduced radiation exposure. Most existing methods operate either in the image or post-log projection domain, which fails to fully exploit the rich structural information in pre-log measurements while being highly susceptible to noise. The requisite logarithmic transformation critically amplifies noise within these data, imposing exceptional demands on reconstruction precision. To overcome these challenges, we propose PLOT-CT, a novel framework for Pre-Log vOronoi decomposiTion-assisted CT generation. Our method begins by applying Voronoi decomposition to pre-log sinograms, disentangling the data into distinct underlying components, which are embedded in separate latent spaces. This explicit decomposition significantly enhances the model's capacity to learn discriminative features, directly improving reconstruction accuracy by mitigating noise and preserving information inherent in the pre-log domain. Extensive experiments demonstrate that PLOT-CT achieves state-of-the-art performance, attaining a 2.36dB PSNR improvement over traditional methods at the 1e4 incident photon level in the pre-log domain.

[104] PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation

Yeva Gabrielyan,Varduhi Yeghiazaryan,Irina Voiculescu

Main category: cs.CV

TL;DR: 本文提出PLESS,一种通用的伪标签增强策略,通过图像的分层空间区域划分来提升伪标签的可靠性和空间一致性,从而改善基于涂鸦标注的医学图像分割性能。

Details Motivation: 弱监督学习中使用涂鸦标注虽降低了标注成本,但存在监督信号稀疏、噪声大和不完整的问题;现有基于伪标签的方法受限于伪标签质量。 Method: PLESS基于图像的分层空间一致区域划分,将涂鸦信息在语义一致区域内传播以优化伪标签;该方法模型无关,易于集成到现有伪标签框架中。 Result: 在ACDC和MSCMRseg两个心脏MRI数据集上,结合四种涂鸦监督算法均取得分割精度的一致提升。 Conclusion: PLESS是一种有效、通用且即插即用的伪标签增强策略,显著提升了涂鸦监督下医学图像分割的性能。 Abstract: Weakly supervised learning with scribble annotations uses sparse user-drawn strokes to indicate segmentation labels on a small subset of pixels. This annotation reduces the cost of dense pixel-wise labeling, but suffers inherently from noisy and incomplete supervision. Recent scribble-based approaches in medical image segmentation address this limitation using pseudo-label-based training; however, the quality of the pseudo-labels remains a key performance limit. We propose PLESS, a generic pseudo-label enhancement strategy which improves reliability and spatial consistency. It builds on a hierarchical partitioning of the image into a hierarchy of spatially coherent regions. PLESS propagates scribble information to refine pseudo-labels within semantically coherent regions. The framework is model-agnostic and easily integrates into existing pseudo-label methods. Experiments on two public cardiac MRI datasets (ACDC and MSCMRseg) across four scribble-supervised algorithms show consistent improvements in segmentation accuracy. Code will be made available on GitHub upon acceptance.

[105] ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

Changti Wu,Jiahuai Mao,Yuzhuo Miao,Shijie Lian,Bin Yu,Xiaopeng Lin,Cong Huang,Lei Zhang,Kai Chen

Main category: cs.CV

TL;DR: 本文提出ScalSelect,一种无需训练、线性时间复杂度的多模态数据选择方法,用于大规模视觉指令调优(VIT),显著提升训练效率且不依赖外部模型或数据集。

Details Motivation: 大规模视觉指令调优(VIT)因数据冗余导致计算昂贵低效,亟需高效、可扩展、无需训练的多模态数据选择方法。 Method: ScalSelect首先提取目标VLM中指令token最关注的视觉特征构建样本表征,再通过子空间近似识别最具代表性的样本,实现线性时间复杂度的重要性评分。 Result: 在多个VLM、数据集和选择预算下,仅用16%的数据即可达到全量训练97.5%以上的性能,并在部分设置下超越全量训练。 Conclusion: ScalSelect是一种高效、可扩展、无需训练且不依赖代理模型或辅助数据的多模态数据选择新范式,为VIT提供了实用且高性能的解决方案。 Abstract: Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.

[106] Electrostatics-Inspired Surface Reconstruction (EISR): Recovering 3D Shapes as a Superposition of Poisson's PDE Solutions

Diego Patiño,Knut Peterson,Kostas Daniilidis,David K. Han

Main category: cs.CV

TL;DR: 本文提出了一种基于泊松方程(而非传统Eikonal方程)的隐式形状表示新方法,利用格林函数和线性叠加原理构建SDF近似,显著提升了高频几何细节的重建能力。

Details Motivation: 现有基于SDF的隐式表面重建方法多依赖Eikonal PDE约束,但其在高频细节建模上存在局限;本文旨在探索更具表达力且数学上更易处理的替代PDE框架。 Method: 将表面重建建模为泊松方程的解;借助静电势等物理类比理解该PDE;利用格林函数获得解析参数化解;通过线性叠加多个先验形状对应的解来构造目标隐式场。 Result: 在少量形状先验下,该方法在高频细节重建方面优于现有基于Eikonal方程的方法。 Conclusion: 泊松方程作为代理PDE可更有效地支持高质量隐式表面重建,其线性与解析可解性为隐式表示提供了新范式。 Abstract: Implicit shape representation, such as SDFs, is a popular approach to recover the surface of a 3D shape as the level sets of a scalar field. Several methods approximate SDFs using machine learning strategies that exploit the knowledge that SDFs are solutions of the Eikonal partial differential equation (PDEs). In this work, we present a novel approach to surface reconstruction by encoding it as a solution to a proxy PDE, namely Poisson's equation. Then, we explore the connection between Poisson's equation and physics, e.g., the electrostatic potential due to a positive charge density. We employ Green's functions to obtain a closed-form parametric expression for the PDE's solution, and leverage the linearity of our proxy PDE to find the target shape's implicit field as a superposition of solutions. Our method shows improved results in approximating high-frequency details, even with a small number of shape priors.

[107] Brain Tumor Classifiers Under Attack: Robustness of ResNet Variants Against Transferable FGSM and PGD Attacks

Ryan Deem,Garrett Goodman,Waqas Majeed,Md Abdullah Al Hafiz Khan,Michail S. Alexiou

Main category: cs.CV

TL;DR: 本文研究了基于ResNet的脑肿瘤分类模型(BrainNet、BrainNeXt、DilationNet)在MRI数据上的对抗鲁棒性,发现BrainNeXt对黑盒攻击最鲁棒但迁移性差,而输入分辨率降低和去增强会显著削弱鲁棒性,揭示了精度与鲁棒性的权衡。

Details Motivation: 深度学习模型在脑肿瘤分类中的对抗鲁棒性尚不明确,尤其在临床MRI部署中至关重要。 Method: 评估三种ResNet变体(BrainNet、BrainNeXt、DilationNet)在FGSM和PGD攻击下的鲁棒性,覆盖三种MRI预处理配置(全尺寸增强、缩小增强、缩小非增强)。 Result: BrainNeXt对黑盒攻击最鲁棒但生成的对抗样本迁移性弱;BrainNet和DilationNet互易受攻,尤其在高步数/步长PGD下;缩小且非增强数据大幅降低鲁棒性,即使测试准确率仍高。 Conclusion: 脑肿瘤分类模型需联合评估分类性能与对抗鲁棒性,预处理方式(如分辨率与数据增强)显著影响鲁棒性,对临床安全部署具有关键意义。 Abstract: Adversarial robustness in deep learning models for brain tumor classification remains an underexplored yet critical challenge, particularly for clinical deployment scenarios involving MRI data. In this work, we investigate the susceptibility and resilience of several ResNet-based architectures, referred to as BrainNet, BrainNeXt and DilationNet, against gradient-based adversarial attacks, namely FGSM and PGD. These models, based on ResNet, ResNeXt, and dilated ResNet variants respectively, are evaluated across three preprocessing configurations (i) full-sized augmented, (ii) shrunk augmented and (iii) shrunk non-augmented MRI datasets. Our experiments reveal that BrainNeXt models exhibit the highest robustness to black-box attacks, likely due to their increased cardinality, though they produce weaker transferable adversarial samples. In contrast, BrainNet and Dilation models are more vulnerable to attacks from each other, especially under PGD with higher iteration steps and $α$ values. Notably, shrunk and non-augmented data significantly reduce model resilience, even when the untampered test accuracy remains high, highlighting a key trade-off between input resolution and adversarial vulnerability. These results underscore the importance of jointly evaluating classification performance and adversarial robustness for reliable real-world deployment in brain MRI analysis.

[108] GR-Diffusion: 3D Gaussian Representation Meets Diffusion in Whole-Body PET Reconstruction

Mengxiao Geng,Zijie Chen,Ran Hong,Bingxuan Li,Qiegen Liu

Main category: cs.CV

TL;DR: 本文提出GR-Diffusion框架,将三维离散高斯表示(GR)的几何先验与扩散模型的生成能力结合,用于低剂量全身PET图像重建,显著提升图像质量与细节保留。

Details Motivation: PET重建面临噪声放大、结构模糊和细节丢失等挑战,传统方法受限于低通滤波特性,难以兼顾全局一致性与局部精度。 Method: 提出GR-Diffusion框架:首先用GR从投影数据生成物理合理、结构明确的参考图像;再设计基于该参考图像的分层引导机制(细粒度差异精修+粗粒度多尺度差异校正),指导扩散模型逐步融合几何先验并恢复亚体素信息。 Result: 在UDPET和临床数据集上,GR-Diffusion在不同剂量水平下均优于现有最先进方法,显著提升3D全身PET图像质量及生理细节保真度。 Conclusion: GR与扩散模型的协同建模可有效克服PET重建中的病态性与稀疏采样限制,为低剂量分子影像提供兼具物理可解释性与强生成能力的新范式。 Abstract: Positron emission tomography (PET) reconstruction is a critical challenge in molecular imaging, often hampered by noise amplification, structural blurring, and detail loss due to sparse sampling and the ill-posed nature of inverse problems. The three-dimensional discrete Gaussian representation (GR), which efficiently encodes 3D scenes using parameterized discrete Gaussian distributions, has shown promise in computer vision. In this work, we pro-pose a novel GR-Diffusion framework that synergistically integrates the geometric priors of GR with the generative power of diffusion models for 3D low-dose whole-body PET reconstruction. GR-Diffusion employs GR to generate a reference 3D PET image from projection data, establishing a physically grounded and structurally explicit benchmark that overcomes the low-pass limitations of conventional point-based or voxel-based methods. This reference image serves as a dual guide during the diffusion process, ensuring both global consistency and local accuracy. Specifically, we employ a hierarchical guidance mechanism based on the GR reference. Fine-grained guidance leverages differences to refine local details, while coarse-grained guidance uses multi-scale difference maps to correct deviations. This strategy allows the diffusion model to sequentially integrate the strong geometric prior from GR and recover sub-voxel information. Experimental results on the UDPET and Clinical datasets with varying dose levels show that GR-Diffusion outperforms state-of-the-art methods in enhancing 3D whole-body PET image quality and preserving physiological details.

[109] SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

Seo Hyun Kim,Jin Bok Park,Do Yeon Koo,Ho Gun Park,Il Yong Chun

Main category: cs.CV

TL;DR: 本文提出了一种名为SToRM的监督式视觉令牌缩减框架,用于多模态大语言模型(MLLM)驱动的端到端自动驾驶系统,在大幅降低计算开销(最高30倍)的同时,保持与全令牌输入相当的性能。

Details Motivation: 端到端自动驾驶系统需兼顾安全性和实时性;引入自然语言指令可提升异常场景应对能力,但多模态大语言模型因依赖大量视觉令牌和LLM而计算开销过大,现有令牌压缩方法常以性能下降为代价。 Method: 提出监督式令牌缩减框架SToRM,包含三部分:1)基于短时滑动窗口的轻量级令牌重要性预测器;2)通过全令牌LLM前向传播生成伪标签的监督训练机制;3)锚点-上下文合并模块,将冗余上下文令牌合并至关键锚点以保留信息。 Result: 在LangAuto基准上,SToRM在相同令牌缩减预算下显著优于现有SOTA方法,维持全令牌性能,计算成本最高降低30倍。 Conclusion: SToRM首次实现了对多模态大语言模型的高效、有监督的视觉令牌缩减,在保证端到端驾驶性能的同时显著提升部署可行性,为车载MLLM应用提供了新范式。 Abstract: In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.

[110] EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation

Bingyuan Wang,Xingbei Chen,Zongyang Qiu,Linping Yuan,Zeyu Wang

Main category: cs.CV

TL;DR: 本文提出EmoSpace框架,通过视觉-语言对齐学习动态、可解释的情绪原型,实现无需显式情绪标签的细粒度情绪控制生成,支持VR环境中的情绪图像外绘、风格化生成和全景生成等应用。

Details Motivation: 现有生成方法难以捕捉细腻的情绪语义和沉浸体验所需的精细情绪控制。 Method: 引入EmoSpace框架,采用分层情绪表征与可学习动态情绪原型,结合多原型引导、时间融合和注意力重加权的可控生成流程。 Result: 在定性和定量评估中均优于现有方法,并通过用户研究验证了VR环境对情绪感知的影响。 Conclusion: EmoSpace实现了沉浸式视觉内容的细粒度情绪控制生成,支持治疗、教育、叙事、艺术创作和文化保护等应用。 Abstract: Emotion is important for creating compelling virtual reality (VR) content. Although some generative methods have been applied to lower the barrier to creating emotionally rich content, they fail to capture the nuanced emotional semantics and the fine-grained control essential for immersive experiences. To address these limitations, we introduce EmoSpace, a novel framework for emotion-aware content generation that learns dynamic, interpretable emotion prototypes through vision-language alignment. We employ a hierarchical emotion representation with rich learnable prototypes that evolve during training, enabling fine-grained emotional control without requiring explicit emotion labels. We develop a controllable generation pipeline featuring multi-prototype guidance, temporal blending, and attention reweighting that supports diverse applications, including emotional image outpainting, stylized generation, and emotional panorama generation for VR environments. Our experiments demonstrate the superior performance of EmoSpace over existing methods in both qualitative and quantitative evaluations. Additionally, we present a comprehensive user study investigating how VR environments affect emotional perception compared to desktop settings. Our work facilitates immersive visual content generation with fine-grained emotion control and supports applications like therapy, education, storytelling, artistic creation, and cultural preservation. Code and models will be made publicly available.

[111] Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes

Jeongho Noh,Tai Hyoung Rhee,Eunho Lee,Jeongyun Kim,Sunwoo Lee,Ayoung Kim

Main category: cs.CV

TL;DR: Clutt3R-Seg 是一种零样本、分层语义树驱动的3D实例分割方法,专为语言引导的机器人抓取在杂乱场景中设计,通过跨视角分组与条件替换提升鲁棒性,并支持单图更新以适应场景变化。

Details Motivation: 解决杂乱环境中因遮挡、有限视角和噪声掩码导致的3D实例分割不可靠问题,支撑语言引导的机器人操作。 Method: 提出分层实例语义树,将噪声掩码作为信息线索;采用跨视角分组与条件替换抑制过/欠分割;引入开放词汇语义嵌入支持自然语言目标选择;设计一致性感知更新机制,仅用单张交互后图像维持实例对应关系。 Result: 在合成与真实数据集及真实机器人上均超越SOTA;重杂乱序列下AP@25达61.66(超基线2.2倍);仅用4个视角即超MaskClustering(8视角)2倍以上。 Conclusion: Clutt3R-Seg显著提升了杂乱、稀疏视角场景下的3D实例分割鲁棒性与语言对齐能力,具备实际机器人部署潜力。 Abstract: Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.

[112] Egocentric Gaze Estimation via Neck-Mounted Camera

Haoyu Huang,Yoichi Sato

Main category: cs.CV

TL;DR: 本文提出了颈戴式视角凝视估计这一新任务,并构建了首个相关数据集,通过改进的Transformer模型GLC进行实验验证,引入了视线越界分类辅助任务以提升性能。

Details Motivation: 现有以自我为中心的凝视估计研究主要集中于头戴式摄像头,而颈戴式等其他视角尚未被充分探索,本文旨在填补这一空白。 Method: 构建首个颈戴式凝视估计数据集(含8名参与者、约4小时日常活动视频),提出基于Transformer的GLC模型,并引入视线越界分类辅助任务及多视角协同学习方法(结合头戴与颈戴视角,使用几何感知辅助损失)。 Result: 视线越界分类辅助任务显著提升性能,而多视角协同学习未带来增益;实验结果揭示了颈戴式凝视估计的独特挑战与潜力。 Conclusion: 颈戴式凝视估计是一项有前景的新方向,视线越界建模是关键,但多视角协同学习需进一步优化设计。 Abstract: This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective. Prior work on egocentric gaze estimation, which predicts device wearer's gaze location within the camera's field of view, mainly focuses on head-mounted cameras while alternative viewpoints remain underexplored. To bridge this gap, we collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities. We evaluate a transformer-based gaze estimation model, GLC, on the new dataset and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models using a geometry-aware auxiliary loss. Experimental results show that incorporating gaze out-of-bound classification improves performance over standard fine-tuning, while the co-learning approach does not yield gains. We further analyze these results and discuss implications for neck-mounted gaze estimation.

[113] U-Net with Hadamard Transform and DCT Latent Spaces for Next-day Wildfire Spread Prediction

Yingyi Luo,Shuaiang Rong,Adam Watts,Ahmet Enis Cetin

Main category: cs.CV

TL;DR: 本文提出了一种轻量级深度学习模型TD-FusionUNet,结合可训练的哈达玛与余弦变换层及定制预处理技术,利用多模态卫星数据实现次日野火蔓延预测,在保持高效率(370k参数)的同时达到F1=0.591,优于基线模型。

Details Motivation: 提升野火蔓延预测的实时性与资源受限环境下的适用性,解决现有模型参数量大、计算开销高、对稀疏火情掩膜表征能力弱等问题。 Method: 提出TD-FusionUNet模型,引入可训练的二维Hadamard与DCT变换层以建模正交隐空间中的频率成分;设计随机边缘裁剪和高斯混合模型预处理,增强稀疏前火掩膜的表征与泛化能力。 Result: 在Next-Day Wildfire Spread和WildfireSpreadTS两个数据集上验证,F1得分为0.591,参数仅370k,优于WildfireSpreadTS中基于ResNet18编码器的UNet基线模型。 Conclusion: TD-FusionUNet在精度与效率间取得良好平衡,适用于轻量化、实时野火预测任务。 Abstract: We developed a lightweight and computationally efficient tool for next-day wildfire spread prediction using multimodal satellite data as input. The deep learning model, which we call Transform Domain Fusion UNet (TD-FusionUNet), incorporates trainable Hadamard Transform and Discrete Cosine Transform layers that apply two-dimensional transforms, enabling the network to capture essential "frequency" components in orthogonalized latent spaces. Additionally, we introduce custom preprocessing techniques, including random margin cropping and a Gaussian mixture model, to enrich the representation of the sparse pre-fire masks and enhance the model's generalization capability. The TD-FusionUNet is evaluated on two datasets which are the Next-Day Wildfire Spread dataset released by Google Research in 2023, and WildfireSpreadTS dataset. Our proposed TD-FusionUNet achieves an F1 score of 0.591 with 370k parameters, outperforming the UNet baseline using ResNet18 as the encoder reported in the WildfireSpreadTS dataset while using substantially fewer parameters. These results show that the proposed latent space fusion model balances accuracy and efficiency under a lightweight setting, making it suitable for real time wildfire prediction applications in resource limited environments.

[114] RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval

Khanh Nguyen,Dasith de Silva Edirimuni,Ghulam Mubashar Hassan,Ajmal Mian

Main category: cs.CV

TL;DR: 本文提出RI-Mamba,首个面向点云的旋转不变状态空间模型,通过参考系解耦姿态与几何、Hilbert排序构建几何感知序列,并引入方向嵌入与特征线性调制恢复空间上下文,在OmniObject3D上实现任意朝向下跨200+类的SOTA文本到形状检索性能。

Details Motivation: 现有文本到形状检索方法依赖规范姿态且类别支持有限,难以应对真实场景中物体类别多样、朝向任意的挑战。 Method: 提出RI-Mamba:1)构建全局与局部参考系以解耦姿态与几何;2)采用Hilbert排序生成旋转不变且具几何结构的token序列;3)设计方向嵌入与特征线性调制(FiLM)机制恢复空间上下文;4)结合跨模态对比学习与自动三元组生成进行可扩展训练。 Result: 在OmniObject3D基准上,对超过200个物体类别、任意朝向条件下,取得文本到形状检索的最先进性能;模型具备线性时间复杂度和强泛化鲁棒性。 Conclusion: RI-Mamba有效解决了点云检索中姿态敏感与类别受限两大瓶颈,为大规模、多类别、任意朝向的3D资产检索提供了高效、可扩展的新范式。 Abstract: 3D assets have rapidly expanded in quantity and diversity due to the growing popularity of virtual reality and gaming. As a result, text-to-shape retrieval has become essential in facilitating intuitive search within large repositories. However, existing methods require canonical poses and support few object categories, limiting their real-world applicability where objects can belong to diverse classes and appear in random orientations. To address this challenge, we propose RI-Mamba, the first rotation-invariant state-space model for point clouds. RI-Mamba defines global and local reference frames to disentangle pose from geometry and uses Hilbert sorting to construct token sequences with meaningful geometric structure while maintaining rotation invariance. We further introduce a novel strategy to compute orientational embeddings and reintegrate them via feature-wise linear modulation, effectively recovering spatial context and enhancing model expressiveness. Our strategy is inherently compatible with state-space models and operates in linear time. To scale up retrieval, we adopt cross-modal contrastive learning with automated triplet generation, allowing training on diverse datasets without manual annotation. Extensive experiments demonstrate RI-Mamba's superior representational capacity and robustness, achieving state-of-the-art performance on the OmniObject3D benchmark across more than 200 object categories under arbitrary orientations. Our code will be made available at https://github.com/ndkhanh360/RI-Mamba.git.

[115] Semantically Conditioned Diffusion Models for Cerebral DSA Synthesis

Qiwen Xu,David Rügamer,Holger Wenz,Johann Fontana,Nora Meggyeshazi,Andreas Bender,Máté E. Maros

Main category: cs.CV

TL;DR: 本文提出了一种语义条件潜在扩散模型(LDM),用于生成具有解剖循环(前/后循环)和C型臂位置控制的动脉期脑部数字减影血管造影(DSA)图像,经专家评估和FID指标验证,生成图像具备临床级真实感,可用于算法开发、研究与培训。

Details Motivation: DSA虽在脑血管病诊疗中至关重要,但因其侵入性和高采集成本,导致大规模数据收集和共享受限,亟需高质量合成数据替代方案。 Method: 构建了含99,349帧的单中心DSA数据集,训练基于文本嵌入(编码解剖结构与采集几何)的条件潜在扩散模型(LDM),实现动脉期DSA图像的语义可控合成。 Result: 四名医学专家对400张合成图像进行5级Likert量表评估,图像级总体评分为3.1–3.3分,组内相关系数ICC(2,k)达0.80–0.87;Fréchet Inception Distance(FID)中位数为15.27,表明分布相似性高。 Conclusion: 语义控制的潜在扩散模型可生成具备临床真实感的合成DSA图像,适用于下游算法开发、医学研究及人员培训。 Abstract: Digital subtraction angiography (DSA) plays a central role in the diagnosis and treatment of cerebrovascular disease, yet its invasive nature and high acquisition cost severely limit large-scale data collection and public data sharing. Therefore, we developed a semantically conditioned latent diffusion model (LDM) that synthesizes arterial-phase cerebral DSA frames under explicit control of anatomical circulation (anterior vs.\ posterior) and canonical C-arm positions. We curated a large single-centre DSA dataset of 99,349 frames and trained a conditional LDM using text embeddings that encoded anatomy and acquisition geometry. To assess clinical realism, four medical experts, including two neuroradiologists, one neurosurgeon, and one internal medicine expert, systematically rated 400 synthetic DSA images using a 5-grade Likert scale for evaluating proximal large, medium, and small peripheral vessels. The generated images achieved image-wise overall Likert scores ranging from 3.1 to 3.3, with high inter-rater reliability (ICC(2,k) = 0.80--0.87). Distributional similarity to real DSA frames was supported by a low median Fréchet inception distance (FID) of 15.27. Our results indicate that semantically controlled LDMs can produce realistic synthetic DSAs suitable for downstream algorithm development, research, and training.

[116] TG-Field: Geometry-Aware Radiative Gaussian Fields for Tomographic Reconstruction

Yuxiang Zhong,Jun Wei,Chaoqi Chen,Senyou An,Hui Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为Tomographic Geometry Field(TG-Field)的几何感知高斯形变框架,用于静态与动态CT重建,通过多分辨率哈希编码、时序条件表示、时空注意力机制及运动流网络,显著提升超稀疏视角下的重建精度与鲁棒性。

Details Motivation: 现有3D高斯泼溅(3DGS)方法在CT重建中面临超稀疏视角投影和动态运动下严重伪影的问题,亟需更鲁棒的几何建模与时空一致性建模方法。 Method: 提出TG-Field框架:1)多分辨率哈希编码捕获局部空间先验以正则化高斯参数;2)引入时间条件表示与时空注意力模块实现动态重建;3)设计运动流网络建模呼吸运动引起的精细解剖形变。 Result: 在合成与真实CT数据集上实验表明,TG-Field在高度稀疏视角条件下持续优于现有方法,达到当前最优重建精度。 Conclusion: TG-Field是一种兼顾几何感知与时空建模能力的新型CT重建框架,有效解决了稀疏视角与动态场景下的重建挑战,为医学影像三维重建提供了新范式。 Abstract: 3D Gaussian Splatting (3DGS) has revolutionized 3D scene representation with superior efficiency and quality. While recent adaptations for computed tomography (CT) show promise, they struggle with severe artifacts under highly sparse-view projections and dynamic motions. To address these challenges, we propose Tomographic Geometry Field (TG-Field), a geometry-aware Gaussian deformation framework tailored for both static and dynamic CT reconstruction. A multi-resolution hash encoder is employed to capture local spatial priors, regularizing primitive parameters under ultra-sparse settings. We further extend the framework to dynamic reconstruction by introducing time-conditioned representations and a spatiotemporal attention block to adaptively aggregate features, thereby resolving spatiotemporal ambiguities and enforcing temporal coherence. In addition, a motion-flow network models fine-grained respiratory motion to track local anatomical deformations. Extensive experiments on synthetic and real-world datasets demonstrate that TG-Field consistently outperforms existing methods, achieving state-of-the-art reconstruction accuracy under highly sparse-view conditions.

[117] LLM-Driven 3D Scene Generation of Agricultural Simulation Environments

Arafa Yoncalik,Wouter Jansen,Nico Huebel,Mohammad Hasan Rahmani,Jan Steckel

Main category: cs.CV

TL;DR: 本文提出了一种面向农业仿真的多LLM模块化管道,利用自然语言生成符合领域知识的3D场景,结合RAG、微调与验证等技术提升准确性与可扩展性,并在Unreal引擎中实现。

Details Motivation: 现有基于LLM的3D场景生成方法缺乏农业等特定领域的推理能力、验证机制和模块化设计,导致可控性差、扩展性弱。 Method: 构建模块化多LLM流水线,集成3D资产检索、农业领域知识注入及Unreal引擎API代码生成;采用少样本提示、RAG、微调与多级验证的混合策略。 Result: 系统能根据自然语言提示生成具真实种植布局与环境上下文的3D农业仿真场景;用户研究和专家评估表明其 realism 高、熟悉度好,且显著节省人工建模时间。 Conclusion: 多LLM模块化架构可有效提升领域特定3D生成的可靠性、精度与可扩展性,为农业及其他仿真领域提供新范式。 Abstract: Procedural generation techniques in 3D rendering engines have revolutionized the creation of complex environments, reducing reliance on manual design. Recent approaches using Large Language Models (LLMs) for 3D scene generation show promise but often lack domain-specific reasoning, verification mechanisms, and modular design. These limitations lead to reduced control and poor scalability. This paper investigates the use of LLMs to generate agricultural synthetic simulation environments from natural language prompts, specifically to address the limitations of lacking domain-specific reasoning, verification mechanisms, and modular design. A modular multi-LLM pipeline was developed, integrating 3D asset retrieval, domain knowledge injection, and code generation for the Unreal rendering engine using its API. This results in a 3D environment with realistic planting layouts and environmental context, all based on the input prompt and the domain knowledge. To enhance accuracy and scalability, the system employs a hybrid strategy combining LLM optimization techniques such as few-shot prompting, Retrieval-Augmented Generation (RAG), finetuning, and validation. Unlike monolithic models, the modular architecture enables structured data handling, intermediate verification, and flexible expansion. The system was evaluated using structured prompts and semantic accuracy metrics. A user study assessed realism and familiarity against real-world images, while an expert comparison demonstrated significant time savings over manual scene design. The results confirm the effectiveness of multi-LLM pipelines in automating domain-specific 3D scene generation with improved reliability and precision. Future work will explore expanding the asset hierarchy, incorporating real-time generation, and adapting the pipeline to other simulation domains beyond agriculture.

[118] GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry

Jiung Yeon,Seongbo Ha,Hyeonwoo Yu

Main category: cs.CV

TL;DR: GSO-SLAM是一种实时单目稠密SLAM系统,通过双向耦合视觉里程计(VO)与高斯泼溅(GS),在EM框架下联合优化深度估计与场景表示,并提出高斯泼溅初始化方法,实现高精度、高保真、实时重建。

Details Motivation: 现有SLAM方法在跟踪与建图耦合方式上存在计算开销大或冗余严重的问题,亟需一种高效且高保真的实时单目稠密SLAM方案。 Method: 提出GSO-SLAM,采用EM框架双向耦合VO与高斯泼溅进行联合优化;设计高斯泼溅初始化,利用VO输出的关键帧位姿、图像信息和像素关联生成高质量初始高斯场景。 Result: 在多项实验中验证了该方法的实时性,并在几何/光度重建保真度及跟踪精度上达到当前最优水平。 Conclusion: GSO-SLAM通过紧耦合VO与高斯泼溅并消除启发式初始化,实现了高效、高精度、实时的单目稠密SLAM,为基于高斯表示的SLAM提供了新范式。 Abstract: We propose GSO-SLAM, a real-time monocular dense SLAM system that leverages Gaussian scene representation. Unlike existing methods that couple tracking and mapping with a unified scene, incurring computational costs, or loosely integrate them with well-structured tracking frameworks, introducing redundancies, our method bidirectionally couples Visual Odometry (VO) and Gaussian Splatting (GS). Specifically, our approach formulates joint optimization within an Expectation-Maximization (EM) framework, enabling the simultaneous refinement of VO-derived semi-dense depth estimates and the GS representation without additional computational overhead. Moreover, we present Gaussian Splat Initialization, which utilizes image information, keyframe poses, and pixel associations from VO to produce close approximations to the final Gaussian scene, thereby eliminating the need for heuristic methods. Through extensive experiments, we validate the effectiveness of our method, showing that it not only operates in real time but also achieves state-of-the-art geometric/photometric fidelity of the reconstructed scene and tracking accuracy.

[119] STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

Xiaowen Zhang,Zhi Gao,Licheng Jiao,Lingling Li,Qing Li

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉提示范式和首个面向空间-时间视频定位(STVG)的强化学习框架STVG-R1,通过实例ID编码和任务驱动奖励,显著提升定位精度与泛化能力。

Details Motivation: 解决视觉-语言模型在密集预测任务(如STVG)中因图文坐标不一致导致的幻觉问题,避免传统方法引入额外可训练模块带来的标注成本和计算开销。 Method: 1)将逐帧坐标预测重构为紧凑的实例级ID识别问题,用时序一致的唯一ID作为嵌入视频的视觉提示;2)提出STVG-R1强化学习框架,联合优化时间准确性、空间一致性与结构格式正则化。 Result: 在HCSTVG-v2上m_IoU超越Qwen2.5-VL-7B达20.9%,创SOTA;零样本迁移到MeViS多目标指代视频分割任务,J&F达47.3%,亦为SOTA。 Conclusion: 视觉提示+强化学习的范式有效规避跨模态坐标对齐难题,在STVG任务中实现高性能与强泛化性,为VLM密集预测提供新思路。 Abstract: In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.

[120] Adapting Vision-Language Models for E-commerce Understanding at Scale

Matteo Nulli,Vladimir Orshulevich,Tala Bazazo,Christian Herold,Michael Kozielski,Marcin Mazur,Szymon Tuzel,Cees G. M. Snoek,Seyyed Hadi Hashemi,Omar Javed,Yannick Versley,Shahram Khadivi

Main category: cs.CV

TL;DR: 本文探讨了如何针对电商数据的特性(属性中心、多图像、噪声)对通用视觉语言模型(VLMs)进行有针对性的适配,以提升其在电商产品理解任务上的性能,同时保持其广泛的多模态能力,并提出了一个涵盖深层产品理解、严格指令遵循和动态属性抽取的新型评估套件。

Details Motivation: 通用视觉语言模型(VLMs)虽具备通用多模态建模能力,但缺乏针对电商数据(属性中心、多图像、噪声)的有效适配策略,在不牺牲通用性能的前提下提升电商理解能力存在挑战。 Method: 通过大规模实验研究,对通用VLMs进行有针对性的适配,并设计了一个覆盖深层产品理解、严格指令遵循和动态属性抽取的新型综合评估套件。 Result: 验证了针对性适配能显著提升VLMs在电商场景下的性能,同时保持其通用多模态能力;提出的评估套件为电商多模态理解提供了更全面的评测基准。 Conclusion: 通用VLMs可通过目标明确的适配策略有效服务于电商产品理解任务,且无需以牺牲其通用能力为代价;新评估套件有助于推动该领域更严谨、深入的研究。 Abstract: E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

[121] Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Boqi Chen,Xudong Liu,Jianing Qiu

Main category: cs.CV

TL;DR: 本文提出了一种改进视觉对比解码(VCD)的方法,通过构建对象对齐的辅助视图来缓解多模态大语言模型(MLLMs)中的物体幻觉问题,该方法基于自监督ViT的对象中心注意力机制,移除最显著视觉证据以增强对比信号,具有提示无关、模型无关、低开销等优势,并在多个基准上验证了有效性。

Details Motivation: 解决多模态大语言模型(MLLMs)中普遍存在的物体幻觉问题,提升其生成内容的视觉忠实度。 Method: 利用自监督Vision Transformer中的对象中心注意力机制,构造一个移除最显著视觉证据的辅助视图,以增强视觉对比解码(VCD)中的对比信号;该方法无需修改提示或模型结构,仅需一次可缓存的前向传播。 Result: 在两个主流物体幻觉评测基准和两种MLLM上均取得一致性能提升,且计算开销极小。 Conclusion: 对象对齐的辅助视图能有效增强VCD的对比能力,是一种通用、轻量且即插即用的物体幻觉缓解方案。 Abstract: We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.

[122] Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation

Xiangyu Wu,Dongming Jiang,Feng Yu,Yueying Tian,Jiaqi Tang,Qing-Guo Chen,Yang Yang,Jianfeng Lu

Main category: cs.CV

TL;DR: 本文提出自适应去偏Tsallis熵(ADTE)用于视觉-语言模型的测试时适应(TTA),通过引入类别特定的非广延参数q^l来缓解CLIP预训练数据不平衡导致的不确定性估计偏差,无需额外超参调优,在多个基准上显著优于现有方法。

Details Motivation: 主流TTA方法(如基于CLIP)使用香农熵(SE)评估预测不确定性,但CLIP在高度不平衡的网络数据上预训练,导致SE产生有偏的不确定性估计。 Method: 发现Tsallis熵(TE)因其非广延参数q天然适合刻画有偏分布,并进一步提出自适应去偏Tsallis熵(ADTE):为每个类别动态计算归一化后的标签偏置作为q^l,实现高置信视图选择与标签调整集成。 Result: ADTE在ImageNet及其5个变体上超越SOTA,在10个跨域基准上取得最高平均性能,且对模型架构和文本提示鲁棒;TE和ADTE可直接替代SE,无需其他修改。 Conclusion: ADTE是一种无需分布特异性超参、通用性强、效果显著的TTA不确定性度量新范式,有效缓解了预训练偏差带来的不确定性估计失真问题。 Abstract: Mainstream Test-Time Adaptation (TTA) methods for adapting vision-language models, e.g., CLIP, typically rely on Shannon Entropy (SE) at test time to measure prediction uncertainty and inconsistency. However, since CLIP has a built-in bias from pretraining on highly imbalanced web-crawled data, SE inevitably results in producing biased estimates of uncertainty entropy. To address this issue, we notably find and demonstrate that Tsallis Entropy (TE), a generalized form of SE, is naturally suited for characterizing biased distributions by introducing a non-extensive parameter q, with the performance of SE serving as a lower bound for TE. Building upon this, we generalize TE into Adaptive Debiasing Tsallis Entropy (ADTE) for TTA, customizing a class-specific parameter q^l derived by normalizing the estimated label bias from continuously incoming test instances, for each category. This adaptive approach allows ADTE to accurately select high-confidence views and seamlessly integrate with a label adjustment strategy to enhance adaptation, without introducing distribution-specific hyperparameter tuning. Besides, our investigation reveals that both TE and ADTE can serve as direct, advanced alternatives to SE in TTA, without any other modifications. Experimental results show that ADTE outperforms state-of-the-art methods on ImageNet and its five variants, and achieves the highest average performance on 10 cross-domain benchmarks, regardless of the model architecture or text prompts used. Our code is available at https://github.com/Jinx630/ADTE.

[123] Code2Worlds: Empowering Coding LLMs for 4D World Generation

Yi Zhang,Yunshuang Wang,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: 本文提出Code2Worlds框架,将4D动态世界生成建模为语言到物理仿真代码的生成任务,通过双流架构解耦物体与环境生成,并引入物理感知的闭环机制(含PostProcess Agent和VLM-Motion Critic)提升动态保真度,在Code4D基准上显著优于基线。

Details Motivation: 现有基于大模型的3D生成方法难以扩展至4D动态场景,存在多尺度上下文纠缠和语义-物理执行鸿沟两大挑战,亟需构建物理 grounded 的世界模拟器。 Method: 提出Code2Worlds:1)双流架构——检索增强的物体生成流 + 层次化环境编排流;2)物理感知闭环机制——PostProcess Agent生成动力学脚本 + VLM-Motion Critic进行自反思迭代优化仿真代码。 Result: 在Code4D基准上,SGS指标提升41%,Richness提升49%,首次实现无需人工干预的、具备物理合理性的4D动态生成。 Conclusion: 语言到仿真代码的范式可有效弥合语义生成与物理真实之间的鸿沟,为构建空间智能与具身AI提供可扩展、可验证的基础框架。 Abstract: Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a critical frontier. This task presents two fundamental challenges: multi-scale context entanglement, where monolithic generation fails to balance local object structures with global environmental layouts; and a semantic-physical execution gap, where open-loop code generation leads to physical hallucinations lacking dynamic fidelity. We introduce Code2Worlds, a framework that formulates 4D generation as language-to-simulation code generation. First, we propose a dual-stream architecture that disentangles retrieval-augmented object generation from hierarchical environmental orchestration. Second, to ensure dynamic fidelity, we establish a physics-aware closed-loop mechanism in which a PostProcess Agent scripts dynamics, coupled with a VLM-Motion Critic that performs self-reflection to iteratively refine simulation code. Evaluations on the Code4D benchmark show Code2Worlds outperforms baselines with a 41% SGS gain and 49% higher Richness, while uniquely generating physics-aware dynamics absent in prior static methods. Code: https://github.com/AIGeeksGroup/Code2Worlds. Website: https://aigeeksgroup.github.io/Code2Worlds.

[124] Light4D: Training-Free Extreme Viewpoint 4D Video Relighting

Zhenghuang Wu,Kang Chen,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: 本文提出Light4D,一种无需训练的4D视频重光照框架,通过解耦光流引导和时序一致注意力机制,在极端视角变化下实现高保真、时序一致的4D重光照合成。

Details Motivation: 现有基于扩散模型的重光照方法难以扩展到4D(时空)场景,主要受限于配对4D训练数据稀缺及极端视角下时序一致性难保持。 Method: 提出Light4D框架:1)解耦光流引导(Disentangled Flow Guidance),在潜在空间中注入光照控制并保持几何完整性;2)在IC-Light架构中引入时序一致注意力(Temporal Consistent Attention)并加入确定性正则化以消除闪烁。 Result: 实验表明该方法在时序一致性和光照保真度上达到领先水平,可稳健处理-90°至90°的相机旋转。 Conclusion: Light4D是一种训练自由、高效鲁棒的4D视频重光照新范式,有效解决了极端视角下的时空一致性难题。 Abstract: Recent advances in diffusion-based generative models have established a new paradigm for image and video relighting. However, extending these capabilities to 4D relighting remains challenging, due primarily to the scarcity of paired 4D relighting training data and the difficulty of maintaining temporal consistency across extreme viewpoints. In this work, we propose Light4D, a novel training-free framework designed to synthesize consistent 4D videos under target illumination, even under extreme viewpoint changes. First, we introduce Disentangled Flow Guidance, a time-aware strategy that effectively injects lighting control into the latent space while preserving geometric integrity. Second, to reinforce temporal consistency, we develop Temporal Consistent Attention within the IC-Light architecture and further incorporate deterministic regularization to eliminate appearance flickering. Extensive experiments demonstrate that our method achieves competitive performance in temporal consistency and lighting fidelity, robustly handling camera rotations from -90 to 90. Code: https://github.com/AIGeeksGroup/Light4D. Website: https://aigeeksgroup.github.io/Light4D.

[125] Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

Yiming Zhou,Xuenjie Xie,Panfeng Li,Albrecht Kunz,Ahmad Osman,Xavier Maldague

Main category: cs.CV

TL;DR: 本文提出了一种轻量级RGB-D融合框架,通过引入单目深度先验增强EfficientViT-SAM,在仅用11.2k样本(不足SA-1B的0.1%)训练下,分割精度超越原EfficientViT-SAM。

Details Motivation: SAM虽性能优异但依赖海量数据和纯RGB输入;现有高效变体仍需大规模训练,亟需更轻量、数据高效的方案。 Method: 在EfficientViT-SAM基础上,引入预训练单目深度估计器生成深度图,并设计专用深度编码器,将深度图与RGB特征在中层进行融合。 Result: 在仅11.2k样本上训练,分割精度高于EfficientViT-SAM,验证了深度线索作为几何先验的有效性。 Conclusion: 轻量级RGB-D融合可显著提升小样本分割性能,深度信息是比单纯扩大数据规模更有效的几何归纳偏置。 Abstract: Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.

[126] How to Sample High Quality 3D Fractals for Action Recognition Pre-Training?

Marko Putak,Thomas B. Moeslund,Joakim Bruslund Haurum

Main category: cs.CV

TL;DR: 本文提出了一种基于3D迭代函数系统(IFS)生成3D分形视频用于动作识别模型预训练的新方法,针对传统分形生成慢且质量差的问题,设计了‘目标智能过滤’(Targeted Smart Filtering)策略,显著提升采样速度(约100倍)和下游任务性能。

Details Motivation: 传统真实数据标注成本高、存在隐私与伦理问题;现有3D分形生成方法速度慢、易产生退化结构,影响下游动作识别性能。 Method: 使用3D IFS生成分形,通过时间变换构造成视频作为预训练数据;提出‘目标智能过滤’(Targeted Smart Filtering)方法,在保证多样性的同时加速采样并避免过度限制导致的性能下降。 Result: 所提方法实现约100倍的采样速度提升,并在动作识别下游任务上优于其他3D分形过滤方法。 Conclusion: 公式驱动的合成数据(如3D分形视频)是有效的预训练替代方案;合理控制分形生成的约束程度对下游性能至关重要,‘目标智能过滤’在效率与有效性间取得更好平衡。 Abstract: Synthetic datasets are being recognized in the deep learning realm as a valuable alternative to exhaustively labeled real data. One such synthetic data generation method is Formula Driven Supervised Learning (FDSL), which can provide an infinite number of perfectly labeled data through a formula driven approach, such as fractals or contours. FDSL does not have common drawbacks like manual labor, privacy and other ethical concerns. In this work we generate 3D fractals using 3D Iterated Function Systems (IFS) for pre-training an action recognition model. The fractals are temporally transformed to form a video that is used as a pre-training dataset for downstream task of action recognition. We find that standard methods of generating fractals are slow and produce degenerate 3D fractals. Therefore, we systematically explore alternative ways of generating fractals and finds that overly-restrictive approaches, while generating aesthetically pleasing fractals, are detrimental for downstream task performance. We propose a novel method, Targeted Smart Filtering, to address both the generation speed and fractal diversity issue. The method reports roughly 100 times faster sampling speed and achieves superior downstream performance against other 3D fractal filtering methods.

[127] JEPA-VLA: Video Predictive Embedding is Needed for VLA Models

Shangchen Miao,Ningya Feng,Jialong Wu,Ye Lin,Xu He,Dong Li,Mingsheng Long

Main category: cs.CV

TL;DR: 本文提出JEPA-VLA方法,通过引入视频预训练的预测性视觉表征(如V-JEPA 2)来弥补现有视觉语言动作模型在环境理解与策略先验上的不足,显著提升样本效率与泛化能力。

Details Motivation: 现有VLA模型受限于预训练视觉表征,难以有效捕捉任务相关环境信息和策略先验(即对成功执行任务时环境动态演化的预判能力)。 Method: 分析多种视觉表征的局限性,发现视频预训练的预测性嵌入(特别是V-JEPA 2)能更好建模任务相关时序动态并过滤无关干扰;据此提出JEPA-VLA,将预测性嵌入自适应融入现有VLA架构。 Result: JEPA-VLA在LIBERO、LIBERO-plus、RoboTwin2.0及真实机器人任务等多个基准上均取得显著性能提升。 Conclusion: 视频驱动的预测性视觉表征是提升VLA模型样本效率与泛化能力的关键,JEPA-VLA为VLA架构设计提供了新范式。 Abstract: Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.

[128] WorldTree: Towards 4D Dynamic Worlds from Monocular Video using Tree-Chains

Qisen Wang,Yifan Zhao,Jia Li

Main category: cs.CV

TL;DR: 本文提出WorldTree框架,通过Temporal Partition Tree(TPT)和Spatial Ancestral Chains(SAC)实现统一的时空分解,提升单目动态场景重建性能。

Details Motivation: 现有单目动态重建方法缺乏统一的时空分解框架,导致时间优化过于整体化或空间层次耦合过强。 Method: 提出WorldTree框架,包含基于继承式划分树结构的时序粗到精优化Temporal Partition Tree(TPT),以及递归查询祖先层级结构以建模空间动态的Spatial Ancestral Chains(SAC)。 Result: 在NVIDIA-LS上LPIPS提升8.26%,在DyCheck上mLPIPS提升9.09%,优于次优方法。 Conclusion: WorldTree实现了更高效、解耦的时空建模,显著提升了单目动态重建质量。 Abstract: Dynamic reconstruction has achieved remarkable progress, but there remain challenges in monocular input for more practical applications. The prevailing works attempt to construct efficient motion representations, but lack a unified spatiotemporal decomposition framework, suffering from either holistic temporal optimization or coupled hierarchical spatial composition. To this end, we propose WorldTree, a unified framework comprising Temporal Partition Tree (TPT) that enables coarse-to-fine optimization based on the inheritance-based partition tree structure for hierarchical temporal decomposition, and Spatial Ancestral Chains (SAC) that recursively query ancestral hierarchical structure to provide complementary spatial dynamics while specializing motion representations across ancestral nodes. Experimental results on different datasets indicate that our proposed method achieves 8.26% improvement of LPIPS on NVIDIA-LS and 9.09% improvement of mLPIPS on DyCheck compared to the second-best method. Code: https://github.com/iCVTEAM/WorldTree.

[129] Free Lunch for Stabilizing Rectified Flow Inversion

Chenru Wang,Beier Zhu,Chi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Proximal-Mean Inversion(PMI)的训练-free梯度校正方法,用于稳定Rectified-Flow(RF)模型中的速度场,提升重建与编辑质量;同时引入轻量级mimic-CFG方案优化编辑任务,实验表明该方法在PIE-Bench上实现了SOTA性能与更高效率。

Details Motivation: 现有RF模型的训练-free反演方法(如vanilla RF inversion)存在跨时间步的近似误差累积问题,导致速度场不稳定、重建和编辑质量下降。 Method: 提出Proximal-Mean Inversion(PMI),通过将当前速度引导至历史速度的运行平均值,并约束在理论推导的球形高斯范围内以稳定速度场;并设计mimic-CFG,对当前速度与其在历史平均上的投影进行插值,兼顾编辑效果与结构一致性。 Result: 在PIE-Bench上显著提升了反演稳定性、图像重建质量和编辑保真度,同时减少了神经函数评估次数。 Conclusion: PMI与mimic-CFG为RF模型提供了理论严谨、高效实用的训练-free反演与编辑框架,在保持高性能的同时增强了鲁棒性与可解释性。 Abstract: Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.

[130] Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei,Liangbo He,Jun Lan,Lingzhong Dong,Yutong Cai,Siyuan Li,Huijia Zhu,Weiqiang Wang,Linghe Kong,Yue Wang,Zhuosheng Zhang,Weiran Huang

Main category: cs.CV

TL;DR: 本文提出Region-to-Image Distillation方法,将推理时多次缩放(zooming)操作转化为训练时的蒸馏过程,使小型MLLM在单次前向传播中即可实现高精度细粒度视觉理解,无需运行时工具调用;同时构建了细粒度感知评测基准ZoomBench。

Details Motivation: 现有MLLM擅长全局视觉理解,但在细粒度感知(如小尺度关键证据识别)上表现不足;而当前'Thinking-with-Images'类方法虽有效,但因反复调用视觉编码器导致高延迟。 Method: 提出Region-to-Image Distillation:先用强教师模型在微裁剪区域上生成高质量VQA数据,再将区域级监督蒸馏回完整图像;训练后学生模型可在单次前向中完成细粒度理解。同时构建ZoomBench基准与双视角评估协议。 Result: 所提方法在多个细粒度感知基准上达到领先性能,并提升视觉推理、GUI智能体等通用多模态认知任务表现;ZoomBench可量化全局—局部‘缩放差距’。 Conclusion: 细粒度感知能力可通过训练时蒸馏内化到MLLM中,无需推理时动态缩放;该工作厘清了何时必须依赖‘Thinking-with-Images’,何时可被单次前向替代。 Abstract: Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.

[131] DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition

Ji Li,Zhiwei Li,Shihao Li,Zhenjiang Yu,Boyang Wang,Haiou Liu

Main category: cs.CV

TL;DR: 本文提出DiffPlace框架,通过引入place-ID控制器实现可控的多视角图像生成,提升城市街景生成的地点感知能力和背景一致性,从而增强视觉地点识别任务的效果。

Details Motivation: 现有生成模型在生成地点感知和背景一致的城市街景方面存在不足,限制了其在地点识别任务中的应用效果。 Method: 提出DiffPlace框架,引入place-ID控制器,结合线性投影、Perceiver Transformer和对比学习,将place-ID嵌入映射到固定CLIP空间,以实现背景建筑一致性与前景对象及天气条件的灵活调整。 Result: 实验表明DiffPlace在生成质量和对视觉地点识别任务的训练支持上均优于现有方法。 Conclusion: DiffPlace展示了生成模型在场景级和地点感知合成方面的潜力,为提升自动驾驶中的地点识别提供了有效途径。 Abstract: Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition in autonomous driving

[132] SynthRAR: Ring Artifacts Reduction in CT with Unrolled Network and Synthetic Data Training

Hongxu Yang,Levente Lippenszky,Edina Timko,Gopal Avinash

Main category: cs.CV

TL;DR: 本文提出了一种基于非理想CT探测器响应理论分析的无监督深度学习方法,通过展开网络建模逆问题,并利用合成数据挖掘图像域与正弦图域间的内在关联,实现无需真实临床数据的环状伪影校正。

Details Motivation: 现有环状伪影校正方法依赖大量标注的真实临床数据,采集成本高;且多局限于单一图像域或正弦图域,忽略CT前向几何投影中的内在相关性。 Method: 将环状伪影校正问题重构为结合非理想探测器响应与CT几何线性前向投影的逆问题,采用展开网络建模;利用自然图像生成合成数据,挖掘正弦图域与图像域间环状伪影的内在关联,实现无真实临床数据训练。 Result: 在多种扫描几何和解剖区域上评估表明,仅用合成数据训练的模型持续优于现有最先进方法。 Conclusion: 该方法突破了对真实临床标注数据的依赖,通过联合建模前向物理过程与跨域相关性,实现了更鲁棒、泛化性更强的环状伪影校正。 Abstract: Defective and inconsistent responses in CT detectors can cause ring and streak artifacts in the reconstructed images, making them unusable for clinical purposes. In recent years, several ring artifact reduction solutions have been proposed in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, leading to a high data collection cost. Furthermore, existing approaches focus exclusively on either image-space or sinogram-space correction, neglecting the intrinsic correlations from the forward operation of the CT geometry. Based on the theoretical analysis of non-ideal CT detector responses, the RAR problem is reformulated as an inverse problem by using an unrolled network, which considers non-ideal response together with linear forward-projection with CT geometry. Additionally, the intrinsic correlations of ring artifacts between the sinogram and image domains are leveraged through synthetic data derived from natural images, enabling the trained model to correct artifacts without requiring real-world clinical data. Extensive evaluations on diverse scanning geometries and anatomical regions demonstrate that the model trained on synthetic data consistently outperforms existing state-of-the-art methods.

[133] DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target

BoCheng Hu,Zhonghan Zhao,Kaiyue Zhou,Hongwei Wang,Gaoang Wang

Main category: cs.CV

TL;DR: 本文提出了DynaHOI-Gym平台和DynaHOI-10M基准数据集,用于评估动态手-物交互(HOI)中的手部运动生成,填补了现有静态基准的空白,并提出ObAct基线方法提升定位成功率。

Details Motivation: 现有手-物交互(HOI)基准多关注静态物体,缺乏对动态目标和时间敏感协调场景的评估能力,存在明显研究空白。 Method: 构建了统一的在线闭环平台DynaHOI-Gym,包含参数化运动生成器和基于rollout的评估指标;发布大规模动态HOI基准DynaHOI-10M(10M帧、180K轨迹),并设计了观察-行动基线ObAct,融合短时观测与当前帧,采用时空注意力机制预测动作。 Result: DynaHOI-10M涵盖8大类、22细分类目标运动;ObAct方法在位置成功率上较基线提升8.1%。 Conclusion: DynaHOI-Gym与DynaHOI-10M为动态HOI研究提供了新标准和资源,ObAct验证了简单时序建模的有效性,推动面向真实交互场景的手势生成研究。 Abstract: Most existing hand motion generation benchmarks for hand-object interaction (HOI) focus on static objects, leaving dynamic scenarios with moving targets and time-critical coordination largely untested. To address this gap, we introduce the DynaHOI-Gym, a unified online closed-loop platform with parameterized motion generators and rollout-based metrics for dynamic capture evaluation. Built on DynaHOI-Gym, we release DynaHOI-10M, a large-scale benchmark with 10M frames and 180K hand capture trajectories, whose target motions are organized into 8 major categories and 22 fine-grained subcategories. We also provide a simple observe-before-act baseline (ObAct) that integrates short-term observations with the current frame via spatiotemporal attention to predict actions, achieving an 8.1% improvement in location success rate.

[134] Synthesis of Late Gadolinium Enhancement Images via Implicit Neural Representations for Cardiac Scar Segmentation

Soufiane Ben Haddou,Laura Alvarez-Florez,Erik J. Bekkers,Fleur V. Y. Tjong,Ahmad S. Amin,Connie R. Bezzina,Ivana Išgum

Main category: cs.CV

TL;DR: 本文提出了一种结合隐式神经表示(INRs)与去噪扩散模型的框架,用于无标注合成晚期钆增强(LGE)心脏MRI图像及其对应的心肌和纤维化分割掩码,以缓解标注数据稀缺问题;在133例扫描数据上验证表明,加入200例合成数据可将纤维化分割Dice分数从0.509提升至0.524。

Details Motivation: 晚期钆增强(LGE)成像是心肌瘢痕评估的临床标准,但标注数据稀缺严重制约自动化分割方法的发展。 Method: 利用隐式神经表示(INRs)建模LGE图像及对应心肌/纤维化掩码的连续空间表征,将其压缩为紧凑潜在嵌入;再在该潜在空间上训练扩散模型生成新表征,并解码为解剖结构一致的合成LGE图像与分割掩码。 Result: 在133例心脏MRI数据上实验表明,用200例合成数据扩充训练集后,纤维化分割Dice分数由0.509提升至0.524。 Conclusion: 该方法提供了一种无需真实标注即可生成高质量、解剖一致的LGE图像与分割掩码的合成策略,有效缓解医学图像分割中的数据稀缺问题。 Abstract: Late gadolinium enhancement (LGE) imaging is the clinical standard for myocardial scar assessment, but limited annotated datasets hinder the development of automated segmentation methods. We propose a novel framework that synthesises both LGE images and their corresponding segmentation masks using implicit neural representations (INRs) combined with denoising diffusion models. Our approach first trains INRs to capture continuous spatial representations of LGE data and associated myocardium and fibrosis masks. These INRs are then compressed into compact latent embeddings, preserving essential anatomical information. A diffusion model operates on this latent space to generate new representations, which are decoded into synthetic LGE images with anatomically consistent segmentation masks. Experiments on 133 cardiac MRI scans suggest that augmenting training data with 200 synthetic volumes contributes to improved fibrosis segmentation performance, with the Dice score showing an increase from 0.509 to 0.524. Our approach provides an annotation-free method to help mitigate data scarcity.The code for this research is publicly available.

[135] Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion

Bruno Rigal,Victor Dupriez,Alexis Mignon,Ronan Le Hy,Nicolas Mery

Main category: cs.CV

TL;DR: 本文提出了一种面向法语复杂文档的PDF转Markdown基准测试方法,采用模型分歧采样构建挑战性样本,并设计了针对具体错误模式的单元测试式评估方式,以更准确衡量视觉语言模型在文档解析任务中的实际鲁棒性。

Details Motivation: 现有文档解析基准多聚焦英语或中文,且易因无关排版差异(如换行、列表分割、表格渲染)过度惩罚模型,无法真实反映其在RAG等下游任务中的实用性。 Method: 构建法语专用基准:从6万份文档中通过模型分歧采样选取难例(手写表单、复杂版式、密集表格、图文混排页);采用单元测试式评估,检查文本存在性、阅读顺序及局部表格约束,并辅以类别特异性归一化以忽略纯表现差异。 Result: 在15个模型上测试发现,最强闭源模型在手写与表单解析上鲁棒性显著更高,而多个开源权重模型在标准印刷体版式上仍具竞争力。 Conclusion: 评估方式应聚焦下游任务相关错误而非格式细节;法语复杂文档解析仍具挑战,闭源与开源模型各具优势,需针对性选择与优化。 Abstract: This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.

[136] Calibrated Bayesian Deep Learning for Explainable Decision Support Systems Based on Medical Imaging

Hua Xu,Julián D. Arias-Londoño,Juan I. Godino-Llorente

Main category: cs.CV

TL;DR: 本文提出了一种基于贝叶斯深度学习的概率优化框架,包含新颖的置信-不确定性边界损失(CUB-Loss)和双温度缩放(DTS)校准策略,显著提升了医学影像AI模型的预测校准性与不确定性估计可靠性。

Details Motivation: 医学影像AI决策支持系统需兼顾预测准确性与不确定性估计的可靠性(即校准性),以增强临床信任;现有深度学习模型常过度自信于错误预测,阻碍临床采纳。 Method: 提出一种通用概率优化框架:1)训练阶段引入置信-不确定性边界损失(CUB-Loss),惩罚高置信错误和低置信正确预测;2)推理阶段采用双温度缩放(DTS)进行后处理校准,优化后验分布并提升可解释性。 Result: 在肺炎筛查、糖尿病视网膜病变检测和皮肤病变识别三个医学影像任务上验证,该方法在多模态数据、小样本及严重类别不平衡场景下均实现一致且鲁棒的校准性能提升。 Conclusion: 所提框架显著增强了AI模型不确定性估计与预测正确性的对齐能力,具备强泛化性与临床部署潜力,为可信医学AI提供了实用化校准解决方案。 Abstract: In critical decision support systems based on medical imaging, the reliability of AI-assisted decision-making is as relevant as predictive accuracy. Although deep learning models have demonstrated significant accuracy, they frequently suffer from miscalibration, manifested as overconfidence in erroneous predictions. To facilitate clinical acceptance, it is imperative that models quantify uncertainty in a manner that correlates with prediction correctness, allowing clinicians to identify unreliable outputs for further review. In order to address this necessity, the present paper proposes a generalizable probabilistic optimization framework grounded in Bayesian deep learning. Specifically, a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) is introduced that imposes penalties on high-certainty errors and low-certainty correct predictions, explicitly enforcing alignment between prediction correctness and uncertainty estimates. Complementing this training-time optimization, a Dual Temperature Scaling (DTS) strategy is devised for post-hoc calibration, further refining the posterior distribution to improve intuitive explainability. The proposed framework is validated on three distinct medical imaging tasks: automatic screening of pneumonia, diabetic retinopathy detection, and identification of skin lesions. Empirical results demonstrate that the proposed approach achieves consistent calibration improvements across diverse modalities, maintains robust performance in data-scarce scenarios, and remains effective on severely imbalanced datasets, underscoring its potential for real clinical deployment.

[137] Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation

Wei Chen,Yancheng Long,Mingqiao Liu,Haojie Ding,Yankai Yang,Hongyang Wei,Yi-Fan Zhang,Bin Wen,Fan Yang,Tingting Gao,Han Li,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为Spatial Chain-of-Thought (SCoT)的插件式框架,通过将多模态大语言模型(MLLMs)的空间规划能力与扩散模型的生成能力结合,提升其空间理解与推理能力,无需联合训练且避免空间信息丢失。

Details Motivation: 扩散模型在图像生成上表现优异,但在复杂空间理解与推理方面存在不足;现有方法依赖MLLMs增强能力,但存在高计算成本或空间信息丢失问题。 Method: 提出SCoT框架:1)用交错文本-坐标指令格式训练扩散模型以增强布局感知;2)利用先进MLLMs作为规划器生成详细布局方案,并将其空间规划能力直接融入生成过程。 Result: 在图像生成基准测试中达到SOTA性能,在复杂推理任务上显著优于基线,并在图像编辑场景中表现出强有效性。 Conclusion: SCoT是一种高效、即插即用的框架,成功弥合了MLLMs的空间推理能力与扩散模型生成能力之间的鸿沟,兼顾性能与实用性。 Abstract: While diffusion models have shown exceptional capabilities in aesthetic image synthesis, they often struggle with complex spatial understanding and reasoning. Existing approaches resort to Multimodal Large Language Models (MLLMs) to enhance this capability. However, they either incur high computational costs through joint training or suffer from spatial information loss when relying solely on textual prompts. To alleviate these limitations, we propose a Spatial Chain-of-Thought (SCoT) framework, a plug-and-play approach that effectively bridges the reasoning capabilities of MLLMs with the generative power of diffusion models. Specifically, we first enhance the diffusion model's layout awareness by training it on an interleaved text-coordinate instruction format. We then leverage state-of-the-art MLLMs as planners to generate comprehensive layout plans, transferring their spatial planning capabilities directly to the generation process. Extensive experiments demonstrate that our method achieves state-of-the-art performance on image generation benchmarks and significantly outperforms baselines on complex reasoning tasks, while also showing strong efficacy in image editing scenarios.

[138] Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

Enrico Guerriero,Kjersti Engan,Øyvind Meinich-Bache

Main category: cs.CV

TL;DR: 本文探讨了生成式AI(GenAI)方法在新生儿复苏视频活动识别中的应用,通过结合本地视觉语言模型(VLMs)与大语言模型(LLMs),并在模拟数据集上验证其性能,结果表明经LoRA微调的小型VLM在F1分数上显著优于TimeSformer基线模型。

Details Motivation: 新生儿复苏过程的准确记录对质量改进和临床指南依从性至关重要,但实践中仍普遍不足;已有基于3D-CNN和ViT的方法虽具潜力,但在细粒度活动识别上存在挑战,亟需新方法提升性能。 Method: 采用本地视觉语言模型(VLM)结合大语言模型(LLM),对比监督式TimeSformer基线;在含13.26小时新生儿复苏视频的模拟数据集上,评估多种零样本VLM策略及带分类头的微调VLM(含LoRA适配)。 Result: 小型本地VLM在零样本下易产生幻觉,但经LoRA微调后F1达0.91,明显优于TimeSformer的0.70。 Conclusion: 微调的本地VLM(尤其使用LoRA)是提升新生儿复苏视频细粒度活动识别性能的有效途径,为临床文档自动化提供了新思路。 Abstract: Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.

[139] Projected Representation Conditioning for High-fidelity Novel View Synthesis

Min-Seop Kwak,Minkyung Kwon,Jinhyeok Choi,Jiho Park,Seungryong Kim

Main category: cs.CV

TL;DR: 本文提出ReNoV框架,利用外部视觉表征(如深度图、语义分割图)作为扩散模型的条件输入,通过专门设计的表征投影模块将几何与语义信息注入扩散过程,显著提升新视角合成的几何一致性、重建保真度和缺失区域填充质量。

Details Motivation: 现有扩散模型在新视角合成中存在几何不一致问题,缺乏对场景结构的显式建模;而外部表征(如深度、语义)天然具备几何与语义对应性,可作为强几何先验加以利用。 Method: 1)分析扩散模型空间注意力机制中隐含的跨视角对应能力;2)设计Representation Projection Modules,将外部视觉表征(如深度、语义图)以条件形式注入U-Net的交叉注意力层;3)构建端到端的representation-guided diffusion pipeline(ReNoV)。 Result: 在多个标准基准(如DTU、BlendedMVS)上超越现有扩散类新视角合成方法,在重建PSNR/SSIM、inpainting指标及稀疏无位姿图像集上的合成鲁棒性方面均有明显提升。 Conclusion: 外部表征作为几何与语义先验能有效增强扩散模型的新视角生成一致性;ReNoV验证了显式引入结构化条件是提升生成质量的关键路径,为扩散模型的可控三维生成提供了新范式。 Abstract: We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.

[140] A DMD-Based Adaptive Modulation Method for High Dynamic Range Imaging in High-Glare Environments

Banglei Guan,Jing Tao,Liang Xu,Dongcai Tan,Pengju Sun,Jianbing Liu,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于数字微镜器件(DMD)的高动态范围(HDR)成像系统,用于提升强眩光环境下光力学测量(如焊接弧监测和金属表面分析)的图像质量与数字图像相关(DIC)精度。该系统通过DMD光学调制与自适应计算成像协同实现区域自适应曝光,实测动态范围达127 dB,显著抑制饱和伪影,使应变误差降低78%,DIC定位精度提升。

Details Motivation: 传统CCD/CMOS传感器动态范围低(<70 dB),在强眩光下易饱和,导致DIC测量严重失真,亟需更高动态范围(>120 dB)的成像方案以保障光力学测量精度。 Method: 构建基于DMD的空间光调制HDR成像系统,包含DMD光学调制单元与自适应计算成像流水线两个协同子系统,支持自主区域分割与动态曝光控制。 Result: 系统实测动态范围达127 dB,完全消除高眩光下的饱和伪影;实验表明DIC应变误差降低78%,定位精度提高,验证其在极端亮度变化下的鲁棒性。 Conclusion: 该DMD-based HDR系统克服了传统传感器在高眩光场景下的关键局限,为光学计量与应力分析等应用提供了高保真、自适应的成像新范式。 Abstract: Background The accuracy of photomechanics measurements critically relies on image quality,particularly under extreme illumination conditions such as welding arc monitoring and polished metallic surface analysis. High dynamic range (HDR) imaging above 120 dB is essential in these contexts. Conventional CCD/CMOS sensors, with dynamic ranges typically below 70 dB, are highly susceptible to saturation under glare, resulting in irreversible loss of detail and significant errors in digital image correlation (DIC). Methods This paper presents an HDR imaging system that leverages the spatial modulation capability of a digital micromirror device (DMD). The system architecture enables autonomous regional segmentation and adaptive exposure control for high-dynamic-range scenes through an integrated framework comprising two synergistic subsystems: a DMD-based optical modulation unit and an adaptive computational imaging pipeline. Results The system achieves a measurable dynamic range of 127 dB, effectively eliminating satu ration artifacts under high glare. Experimental results demonstrate a 78% reduction in strain error and improved DIC positioning accuracy, confirming reliable performance across extreme intensity variations. Conclusion The DMD-based system provides high fidelity adaptive HDR imaging, overcoming key limitations of conventional sensors. It exhibits strong potential for optical metrology and stress analysis in high-glare environments where traditional methods are inadequate.

[141] GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GigaBrain Team,Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Hao Li,Jie Li,Jindi Lv,Jingyu Liu,Lv Feng,Mingming Yu,Peng Li,Qiuping Deng,Tianze Liu,Xinyu Zhou,Xinze Chen,Xiaofeng Wang,Yang Wang,Yifan Li,Yifei Nie,Yilong Li,Yukun Zhou,Yun Ye,Zhichao Liu,Zheng Zhu

Main category: cs.CV

TL;DR: 本文提出GigaBrain-0.5M*,一种基于视频世界模型增强的视觉-语言-动作(VLA)模型,通过RAMP强化学习框架提升跨任务泛化与长时程操作鲁棒性,在多项复杂机器人任务中性能显著优于基线。

Details Motivation: 现有VLA模型受限于场景理解能力弱和未来预测能力差;而视频世界模型具备强时空推理与预测能力,可作为VLA学习的理想基础。 Method: 在已预训练的GigaBrain-0.5(基于超10,000小时机器人操作数据)基础上,引入RAMP(基于世界模型的策略强化学习)进行微调,实现世界模型引导的动作策略优化。 Result: 在Laundry Folding、Box Packing、Espresso Preparation等挑战性任务上较RECAP基线提升约30%;实测验证其能可靠完成长时程复杂操作,无失败。 Conclusion: 将视频世界模型融入VLA训练流程,特别是通过RAMP框架,可显著提升模型的跨任务适应性、未来预见性与真实环境执行鲁棒性。 Abstract: Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.

[142] AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer

Lingting Zhu,Shengju Qian,Haidi Fan,Jiayu Dong,Zhenchao Jin,Siwei Zhou,Gen Dong,Xin Wang,Lequan Yu

Main category: cs.CV

TL;DR: AssetFormer是一个基于Transformer的自回归模型,用于根据文本描述生成符合设计约束的模块化3D资产,提升专业开发与用户生成内容(UGC)中的3D内容创作效率。

Details Motivation: 数字行业,尤其是用户生成内容(UGC)领域,亟需高质量、多样化的模块化3D资产;现有方法难以兼顾设计约束与生成多样性。 Method: 提出AssetFormer,一种自回归Transformer模型,借鉴语言模型的模块序列建模与解码技术,基于真实世界采集的模块化资产数据进行训练。 Result: 初步实验表明AssetFormer能有效提升模块化3D资产生成质量,适用于专业开发与UGC场景。 Conclusion: AssetFormer提供了一个灵活可扩展的框架,推动模块化3D内容生成的发展,并已开源代码。 Abstract: The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content~(UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets, contributing to the broader field of 3D content generation. The code is available at https://github.com/Advocate99/AssetFormer.

[143] PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback

Sixiang Chen,Jianyu Lai,Jialin Gao,Hengyu Shi,Zhongying Liu,Tian Ye,Junfeng Luo,Xiaoming Wei,Lei Zhu

Main category: cs.CV

TL;DR: 本文提出PosterOmni框架,统一处理图像到海报生成中的局部编辑与全局创作任务,通过数据蒸馏与奖励反馈机制提升语义保真度与美学一致性,并构建了首个综合评测基准PosterOmni-Bench。

Details Motivation: 图像到海报生成需同时满足局部实体保持(如ID驱动编辑)与全局设计理解(如布局、风格),现有方法难以兼顾二者,缺乏统一建模与评测体系。 Method: 提出PosterOmni框架,包含三部分:(i) 构建覆盖六类任务的多场景图像-海报数据集;(ii) 在局部与全局专家模型间进行知识蒸馏并监督微调;(iii) 设计统一的PosterOmni Reward Feedback机制联合优化实体保真与美学偏好。同时建立PosterOmni-Bench评测基准。 Result: PosterOmni在参考遵循性、全局构图质量与美学协调性上显著优于所有开源基线,并超越多个商用系统。 Conclusion: PosterOmni实现了局部编辑与全球创作的有机统一,验证了数据蒸馏与奖励对齐策略在多任务海报生成中的有效性,为图像驱动的艺术设计生成提供了新范式。 Abstract: Image-to-poster generation is a high-demand task requiring not only local adjustments but also high-level design understanding. Models must generate text, layout, style, and visual elements while preserving semantic fidelity and aesthetic coherence. The process spans two regimes: local editing, where ID-driven generation, rescaling, filling, and extending must preserve concrete visual entities; and global creation, where layout- and style-driven tasks rely on understanding abstract design concepts. These intertwined demands make image-to-poster a multi-dimensional process coupling entity-preserving editing with concept-driven creation under image-prompt control. To address these challenges, we propose PosterOmni, a generalized artistic poster creation framework that unlocks the potential of a base edit model for multi-task image-to-poster generation. PosterOmni integrates the two regimes, namely local editing and global creation, within a single system through an efficient data-distillation-reward pipeline: (i) constructing multi-scenario image-to-poster datasets covering six task types across entity-based and concept-based creation; (ii) distilling knowledge between local and global experts for supervised fine-tuning; and (iii) applying unified PosterOmni Reward Feedback to jointly align visual entity-preserving and aesthetic preference across all tasks. Additionally, we establish PosterOmni-Bench, a unified benchmark for evaluating both local editing and global creation. Extensive experiments show that PosterOmni significantly enhances reference adherence, global composition quality, and aesthetic harmony, outperforming all open-source baselines and even surpassing several proprietary systems.

[144] FAIL: Flow Matching Adversarial Imitation Learning for Image Generation

Yeyao Ma,Chen Li,Xiaosong Zhang,Han Hu,Weidi Xie

Main category: cs.CV

TL;DR: 本文提出FAIL方法,通过对抗训练最小化策略与专家分布的差异,无需显式奖励或成对比较,适用于流匹配模型的后训练。

Details Motivation: 现有监督微调无法纠正未见状态下的策略漂移,偏好优化方法则需要昂贵的偏好对或奖励建模。 Method: 提出Flow Matching Adversarial Imitation Learning(FAIL),包含两种算法:FAIL-PD利用可微ODE求解器获得低方差路径梯度;FAIL-PG提供黑箱替代方案,适用于离散或计算受限场景。 Result: 在仅使用13,000条Nano Banana pro演示数据微调FLUX模型时,FAIL在提示遵循和美学评估基准上达到竞争性性能;框架还可推广至离散图像与视频生成,并作为鲁棒正则器缓解基于奖励优化中的奖励作弊问题。 Conclusion: FAIL为流匹配模型的后训练提供了一种无需显式奖励信号、高效且通用的模仿学习新范式。 Abstract: Post-training of flow matching models-aligning the output distribution with a high-quality target-is mathematically equivalent to imitation learning. While Supervised Fine-Tuning mimics expert demonstrations effectively, it cannot correct policy drift in unseen states. Preference optimization methods address this but require costly preference pairs or reward modeling. We propose Flow Matching Adversarial Imitation Learning (FAIL), which minimizes policy-expert divergence through adversarial training without explicit rewards or pairwise comparisons. We derive two algorithms: FAIL-PD exploits differentiable ODE solvers for low-variance pathwise gradients, while FAIL-PG provides a black-box alternative for discrete or computationally constrained settings. Fine-tuning FLUX with only 13,000 demonstrations from Nano Banana pro, FAIL achieves competitive performance on prompt following and aesthetic benchmarks. Furthermore, the framework generalizes effectively to discrete image and video generation, and functions as a robust regularizer to mitigate reward hacking in reward-based optimization. Code and data are available at https://github.com/HansPolo113/FAIL.

[145] TexSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation

Ziteng Lu,Yushuang Wu,Chongjie Ye,Yuda Qiu,Jing Shao,Xiaoyang Guo,Jiaqing Zhou,Tianlei Hu,Kun Zhou,Xiaoguang Han

Main category: cs.CV

TL;DR: 本文提出TexSpot,一种基于扩散模型的纹理增强框架,通过新提出的Texlet表示法解决3D纹理生成中的视图不一致、UV失真和点云密度依赖等问题,显著提升纹理质量与几何一致性。

Details Motivation: 现有3D纹理生成方法存在视图不一致、UV映射失真或过度依赖几何密度等问题,难以生成高质量、高分辨率且几何一致的纹理。 Method: 提出Texlet——一种融合点基表达几何表现力与UV映射紧凑性的新型3D纹理表示;每个Texlet由2D编码器编码局部纹理块,并经3D编码器融入全局形状上下文;采用级联3D-to-2D解码器重建纹理块;在此基础上训练条件扩散Transformer以增强多视角扩散生成的纹理。 Result: TexSpot在视觉保真度、几何一致性和鲁棒性方面显著优于现有SOTA方法。 Conclusion: Texlet表示与扩散Transformer结合为高质量、高一致性3D纹理生成提供了有效新范式。 Abstract: High-quality 3D texture generation remains a fundamental challenge due to the view-inconsistency inherent in current mainstream multi-view diffusion pipelines. Existing representations either rely on UV maps, which suffer from distortion during unwrapping, or point-based methods, which tightly couple texture fidelity to geometric density that limits high-resolution texture generation. To address these limitations, we introduce TexSpot, a diffusion-based texture enhancement framework. At its core is Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representation. Each Texlet latent vector encodes a local texture patch via a 2D encoder and is further aggregated using a 3D encoder to incorporate global shape context. A cascaded 3D-to-2D decoder reconstructs high-quality texture patches, enabling the Texlet space learning. Leveraging this representation, we train a diffusion transformer conditioned on Texlets to refine and enhance textures produced by multi-view diffusion methods. Extensive experiments demonstrate that TexSpot significantly improves visual fidelity, geometric consistency, and robustness over existing state-of-the-art 3D texture generation and enhancement approaches. Project page: https://anonymous.4open.science/w/TexSpot-page-2D91.

[146] DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Xu Guo,Fulong Ye,Qichao Sun,Liyang Chen,Bingchuan Li,Pengze Zhang,Jiawei Liu,Songtao Zhao,Qian He,Xiangwang Hou

Main category: cs.CV

TL;DR: 本文提出DreamID-Omni,一个统一的可控人像音视频生成框架,通过新型对称条件扩散Transformer、双层级解耦策略(信号级同步RoPE + 语义级结构化描述)以及多任务渐进训练方案,解决多角色身份与音色解耦难题,在多项指标上达到SOTA。

Details Motivation: 现有方法将人像相关的音视频生成任务(如参考式生成、视频编辑、语音驱动动画)视为孤立目标,且难以在单一框架中实现多角色身份与语音音色的精确、解耦控制。 Method: 提出DreamID-Omni框架:1)对称条件扩散Transformer,采用对称条件注入机制融合异构控制信号;2)双层级解耦策略——信号层用同步RoPE保障注意力空间绑定,语义层用结构化描述建立属性-主体显式映射;3)多任务渐进训练,利用弱约束生成先验正则化强约束任务。 Result: 在视频质量、音频质量及音视频一致性等多维度全面超越现有方法,甚至优于主流商业闭源模型。 Conclusion: DreamID-Omni实现了人像音视频生成任务的统一建模与精细可控,有效解决了多说话人场景下的身份-音色混淆问题,推动学术研究向商用落地迈进。 Abstract: Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.

[147] EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data

Nils Lehmann,Yi Wang,Zhitong Xiong,Xiaoxiang Zhu

Main category: cs.CV

TL;DR: EO-VAE是一种面向地球观测(EO)数据的多传感器变分自编码器 tokenizer,通过动态超网络统一处理多样化的光谱通道,显著提升重建保真度。

Details Motivation: 现有生成模型依赖tokenizer压缩高维输入,但地球观测(EO)数据因传感器多样、光谱通道可变,难以用单一或固定结构tokenizer有效建模。 Method: 提出EO-VAE:一种基于动态超网络的多传感器变分自编码器,支持灵活通道组合的统一编码与重建,不需为每种模态单独训练tokenizer。 Result: 在TerraMesh数据集上,EO-VAE重建保真度优于TerraMind tokenizer,验证了其作为EO领域生成建模基础tokenizer的有效性。 Conclusion: EO-VAE为异构地球观测数据提供了通用、可扩展的潜在表示方案,推动遥感图像/视频生成模型的发展。 Abstract: State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.

[148] DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Dianyi Wang,Ruihang Li,Feng Han,Chaofan Ma,Wei Song,Siyuan Wang,Yibin Wang,Yi Xin,Hongjian Liu,Zhixiong Zhang,Shengyuan Ding,Tianhang Wang,Zhenglin Cheng,Tao Lin,Cheng Jin,Kaicheng Yu,Jingjing Chen,Wenjie Wang,Zhongyu Wei,Jiaqi Wang

Main category: cs.CV

TL;DR: DeepGen 1.0 是一个仅5B参数的轻量级统一多模态图像生成与编辑模型,通过Stacked Channel Bridging(SCB)框架和三阶段数据驱动训练策略,在多项基准上超越更大规模模型。

Details Motivation: 解决现有统一多模态图像生成与编辑模型参数量大、训练与部署成本高的问题,同时提升轻量模型在语义理解与细粒度控制上的能力。 Method: 提出Stacked Channel Bridging(SCB)深度对齐框架,融合多层视觉语言模型特征与可学习'think tokens';设计三阶段训练策略:对齐预训练、联合监督微调、基于MR-GRPO的强化学习。 Result: 在仅约50M样本上训练,DeepGen 1.0在WISE上比80B HunyuanImage高28%,在UniREditBench上比27B Qwen-Image-Edit高37%;开源代码、权重与数据集。 Conclusion: 证明轻量级统一多模态模型可通过结构创新与高效训练策略实现甚至超越大模型性能,推动多模态研究民主化。 Abstract: Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.

[149] Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

Onkar Susladkar,Tushar Prakash,Gayatri Deshmukh,Kiet A. Nguyen,Jiaxun Zhang,Adheesh Juvekar,Tianshu Bao,Lin Chai,Sparsh Mittal,Inderjit S Dhillon,Ismini Lourentzou

Main category: cs.CV

TL;DR: UniDFlow是一种统一的离散流匹配框架,用于多模态理解、生成和编辑,通过任务特定的低秩适配器解耦理解与生成,并采用基于参考的多模态偏好对齐方法提升保真度与可控性,在多个基准测试中达到SOTA性能并具备强零样本泛化能力。

Details Motivation: 解决多模态任务中理解与生成目标相互干扰、表征纠缠的问题,同时提升模型在不同任务下的保真度与可控性,避免大规模重训练。 Method: 提出UniDFlow框架,采用任务特定低秩适配器解耦理解与生成模块,并引入参考驱动的多模态偏好对齐机制优化相同条件下的相对输出。 Result: 在八个基准上达到SOTA性能,并在图像修复、上下文图像生成、参考驱动编辑及组合生成等零样本任务中展现出强泛化能力。 Conclusion: UniDFlow通过解耦设计与偏好对齐策略,实现了高效、灵活且通用的多模态建模,无需任务专属训练即可广泛适应多种下游任务。 Abstract: We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.

[150] MonarchRT: Efficient Attention for Real-Time Video Generation

Krish Agarwal,Zhuoming Chen,Cheng Luo,Yongqi Chen,Haizhong Zheng,Xun Huang,Atri Rudra,Beidi Chen

Main category: cs.CV

TL;DR: 本文提出Monarch-RT,一种基于Monarch矩阵分解的结构化注意力参数化方法,用于实时视频扩散模型,显著提升计算效率并保持生成质量,在单卡RTX 5090上首次实现16 FPS的真实时视频生成。

Details Motivation: 3D自注意力在实时视频扩散中因二次计算复杂度成为瓶颈,尤其在少步、自回归设定下,现有稀疏注意力方法失效,因其无法建模视频注意力中周期性位置结构、动态语义对应与密集混合共存的复杂模式。 Method: 提出Monarch-RT:利用Monarch矩阵对注意力进行结构化因子分解,结合对齐的分块结构与扩展的平铺参数化;通过微调与定制Triton内核优化实现高效推理。 Result: 在Self-Forcing模型上达到95%注意力稀疏度且无质量损失;在RTX 5090/H100/B200上分别比FlashAttention-2/3/4快1.4–11.8倍;首次在单RTX 5090上实现16 FPS真实时视频生成。 Conclusion: Monarch-RT是首个面向实时视频生成的高能力稀疏注意力参数化方案,兼顾表达力与效率,为扩散Transformer在资源受限场景下的部署提供了新范式。 Abstract: Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.

[151] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Leon Liangyu Chen,Haoyu Ma,Zhipeng Fan,Ziqi Huang,Animesh Sinha,Xiaoliang Dai,Jialiang Wang,Zecheng He,Jianwei Yang,Chunyuan Li,Junzhe Sun,Chu Wang,Serena Yeung-Levy,Felix Juefei-Xu

Main category: cs.CV

TL;DR: 本文提出UniT框架,通过多轮推理、验证与修正,实现统一多模态模型的测试时缩放(TTS),提升复杂多模态任务的性能。

Details Motivation: 现有统一多模态模型通常单次前向推理,难以应对需分解指令、验证中间结果和迭代修正的复杂任务;而语言模型中已验证的测试时缩放(TTS)尚未有效扩展至统一多模态模型。 Method: 提出UniT框架,融合智能体式数据合成、统一模型训练与灵活测试时推理,支持多轮链式推理(chain-of-thought)、验证、子目标分解与内容记忆。 Result: 实验表明:(1) 在短推理轨迹上训练的统一模型可泛化至更长推理链;(2) 序列式链式推理比并行采样更可扩展且计算高效;(3) 结合生成与编辑轨迹训练可提升分布外视觉推理能力。 Conclusion: 多模态测试时缩放是一种有效范式,可同步提升统一模型的生成与理解能力。 Abstract: Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

[152] Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

Huai-Hsun Cheng,Siang-Ling Zhang,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为'Progressive Semantic Illusions'的新型矢量草图任务,通过逐步添加笔画使单个草图发生显著语义变化,并设计了'Stroke of Surprise'生成框架,利用双分支Score Distillation Sampling机制和Overlay Loss实现跨阶段语义一致性与结构互补性。

Details Motivation: 传统视觉错觉依赖空间操作(如多视角一致性),而本文旨在探索时间维度上的语义错觉,即单个草图在绘制过程中随笔画增加而动态改变语义解释,拓展视觉回文(visual anagrams)至时序域。 Method: 提出序列感知的联合优化框架,核心是双分支Score Distillation Sampling(SDS)机制,动态优化前缀笔画以同时满足两个语义目标;引入Overlay Loss强制新增笔画与原有笔画在空间上互补而非遮挡,从而发现二者共享的结构子空间。 Result: 实验表明该方法在可识别性和错觉强度上显著优于现有最先进基线,成功将视觉回文从空间域扩展到时间域。 Conclusion: Progressive Semantic Illusions是一种新颖的时序视觉错觉范式,'Stroke of Surprise'框架通过联合优化与结构约束,实现了单草图多语义的渐进式生成,为生成式建模与人机交互提供了新思路。 Abstract: Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel vector sketching task where a single sketch undergoes a dramatic semantic transformation through the sequential addition of strokes. We present Stroke of Surprise, a generative framework that optimizes vector strokes to satisfy distinct semantic interpretations at different drawing stages. The core challenge lies in the "dual-constraint": initial prefix strokes must form a coherent object (e.g., a duck) while simultaneously serving as the structural foundation for a second concept (e.g., a sheep) upon adding delta strokes. To address this, we propose a sequence-aware joint optimization framework driven by a dual-branch Score Distillation Sampling (SDS) mechanism. Unlike sequential approaches that freeze the initial state, our method dynamically adjusts prefix strokes to discover a "common structural subspace" valid for both targets. Furthermore, we introduce a novel Overlay Loss that enforces spatial complementarity, ensuring structural integration rather than occlusion. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines in recognizability and illusion strength, successfully expanding visual anagrams from the spatial to the temporal dimension. Project page: https://stroke-of-surprise.github.io/