Table of Contents
cs.CL [Back]
[1] HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated Q&A over Raw Unstructured Documents
Sungmoon Kim,Hyuna Jeon,Dahye Kim,Mingyu Kim,Dong-Kyu Chae,Jiwoong Kim
Main category: cs.CL
TL;DR: 本文提出HybridRAG框架,通过预处理PDF等非结构化文档构建QA知识库,并在查询时优先匹配已有答案,仅在无匹配时触发生成,从而提升响应质量与速度。
Details
Motivation: 现有RAG方法依赖结构化文本且需实时检索-生成,难以应对真实聊天场景中大量非结构化PDF文档和高并发低资源限制的需求。 Method: HybridRAG分两阶段:1)利用OCR与布局分析解析PDF为层次化文本块;2)用LLM预生成QA知识库;查询时先检索QA库,无匹配再启用标准RAG生成。 Result: 在OHRBench上实验表明,HybridRAG相比标准RAG基线,答案质量更高、响应延迟更低。 Conclusion: HybridRAG是一种面向实际聊天机器人应用的高效、实用RAG框架,适用于处理海量非结构化文档与资源受限环境。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for grounding Large Language Model (LLM)-based chatbot responses on external knowledge. However, existing RAG studies typically assume well-structured textual sources (e.g. Wikipedia or curated datasets) and perform retrieval and generation at query time, which can limit their applicability in real-world chatbot scenarios. In this paper, we present HybridRAG, a novel and practical RAG framework towards more accurate and faster chatbot responses. First, HybridRAG ingests raw, unstructured PDF documents containing complex layouts (text, tables, figures) via Optical Character Recognition (OCR) and layout analysis, and convert them into hierarchical text chunks. Then, it pre-generates a plausible question-answer (QA) knowledge base from the organized chunks using an LLM. At query time, user questions are matched against this QA bank to retrieve immediate answers when possible, and only if no suitable QA match is found does our framework fall back to an on-the-fly response generation. Experiments on OHRBench demonstrate that our HybridRAG provides higher answer quality and lower latency compared to a standard RAG baseline. We believe that HybridRAG could be a practical solution for real-world chatbot applications that must handle large volumes of unstructured documents and lots of users under limited computational resources.[2] Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
Max Zhang,Derek Liu,Kai Zhang,Joshua Franco,Haihao Liu
Main category: cs.CL
TL;DR: 本文探讨了知识蒸馏(KD)在多语言越狱防御中的应用,发现标准微调反而会增加越狱成功率(JSR),而移除细粒度‘边界’拒绝可缓解安全退化,但推理能力仍下降。
Details
Motivation: 大型语言模型(LLMs)的安全对齐目前主要面向英语,导致低资源语言场景存在安全隐患,亟需多语言安全对齐方法。 Method: 采用基于黑盒响应的知识蒸馏,利用LoRA对三个开源模型(Llama-3、Gemma-2、Qwen3)进行参数高效微调,教师模型为OpenAI o1-mini,训练数据为XSafety提供的约2.8万条多语言越狱提示。 Result: 在MultiJail基准上发现标准微调反而使所有学生模型的越狱成功率(JSR)上升最多达16.6个百分点;移除‘边界’拒绝后可缓解或逆转安全退化,但GSM8K推理性能仍下降。 Conclusion: 知识蒸馏在多语言安全对齐中具有潜力但也面临挑战,需进一步研究如何平衡安全性和推理能力。 Abstract: Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.[3] Retrieval Heads are Dynamic
Yuping Lin,Zitao Li,Yue Xing,Pengfei He,Yingqian Cui,Yaliang Li,Bolin Ding,Jingren Zhou,Jiliang Tang
Main category: cs.CL
TL;DR: 本文从动态视角研究大语言模型(LLM)中的检索头,发现其在自回归生成过程中随时间步动态变化、不可被静态头替代,且隐状态可预测未来检索模式,揭示了模型内部的规划机制。
Details
Motivation: 现有工作多基于静态统计识别平均表现检索功能的头,忽略了自回归生成中细粒度的时间动态性。 Method: 通过在Needle-in-a-Haystack和多跳问答任务上的系统性实证分析,提出并验证三个核心主张:检索头的时序动态性、不可替代性及隐状态对检索模式的可预测性,并构建动态检索增强生成框架量化对比效果。 Result: 证实检索头在不同时间步动态切换;静态头无法有效替代动态头;模型隐状态蕴含对未来检索头模式的预测信号;动态检索头在RAG中显著优于静态头。 Conclusion: LLM中的检索行为具有强时序动态性和内在规划能力,应摒弃静态视角,转向动态建模以深入理解其工作机制。 Abstract: Recent studies have identified "retrieval heads" in Large Language Models (LLMs) responsible for extracting information from input contexts. However, prior works largely rely on static statistics aggregated across datasets, identifying heads that perform retrieval on average. This perspective overlooks the fine-grained temporal dynamics of autoregressive generation. In this paper, we investigate retrieval heads from a dynamic perspective. Through extensive analysis, we establish three core claims: (1) Dynamism: Retrieval heads vary dynamically across timesteps; (2) Irreplaceability: Dynamic retrieval heads are specific at each timestep and cannot be effectively replaced by static retrieval heads; and (3) Correlation: The model's hidden state encodes a predictive signal for future retrieval head patterns, indicating an internal planning mechanism. We validate these findings on the Needle-in-a-Haystack task and a multi-hop QA task, and quantify the differences on the utility of dynamic and static retrieval heads in a Dynamic Retrieval-Augmented Generation framework. Our study provides new insights into the internal mechanisms of LLMs.[4] Nested Named Entity Recognition in Plasma Physics Research Articles
Muhammad Haris,Hans Höft,Markus M. Becker,Markus Stocker
Main category: cs.CL
TL;DR: 本文提出了一种基于编码器-Transformer与条件随机场(CRF)的轻量级嵌套命名实体识别(NER)方法,专用于等离子体物理研究论文,通过构建16类标注语料、实体特化建模及超参数优化,提升了领域内专业实体抽取效果。
Details
Motivation: 等离子体物理研究论文内容高度复杂且上下文丰富,需有效抽取专业实体以支持高级检索等应用,但现有NER方法在该领域面临挑战。 Method: 构建包含16个嵌套实体类别的等离子体物理语料库;采用独立BERT-CRF模型进行实体类型特化训练;引入系统化超参数优化流程提升性能。 Result: 实现了针对等离子体物理文本的高效嵌套命名实体识别,在专业实体抽取任务上取得性能提升。 Conclusion: 该工作推动了等离子体物理领域命名实体识别的发展,并为科研人员分析和导航科学文献提供了基础支撑。 Abstract: Named Entity Recognition (NER) is an important task in natural language processing that aims to identify and extract key entities from unstructured text. We present a novel application of NER in plasma physics research articles and address the challenges of extracting specialized entities from scientific text in this domain. Research articles in plasma physics often contain highly complex and context-rich content that must be extracted to enable, e.g., advanced search. We propose a lightweight approach based on encoder-transformers and conditional random fields to extract (nested) named entities from plasma physics research articles. First, we annotate a plasma physics corpus with 16 classes specifically designed for the nested NER task. Second, we evaluate an entity-specific model specialization approach, where independent BERT-CRF models are trained to recognize individual entity types in plasma physics text. Third, we integrate an optimization process to systematically fine-tune hyperparameters and enhance model performance. Our work contributes to the advancement of entity recognition in plasma physics and also provides a foundation to support researchers in navigating and analyzing scientific literature.[5] Assessing LLM Reliability on Temporally Recent Open-Domain Questions
Pushwitha Krishnappa,Amit Das,Vinija Jain,Tathagata Mukherjee,Aman Chadha
Main category: cs.CL
TL;DR: 本文提出RECOM基准数据集,评估大语言模型对近期Reddit问题的回答能力,发现语义相似性高但词汇重叠率低的悖论,表明模型通过大量改写而非直接复现来保持语义一致性;同时指出模型规模不决定性能,且逻辑矛盾率低,呼吁采用多维评估框架替代单一词汇指标。
Details
Motivation: 探究大语言模型在开放域问答中对近期时间信息与人类观点的一致性,现有研究对此关注不足。 Method: 构建包含15000条2025年9月Reddit问题及社区参考答案的RECOM数据集,使用BLEU、ROUGE、BERTScore、MoverScore、余弦相似度和NLI等多维度指标评估四个开源LLM(Llama3.1-8B、Mistral-7B、Gemma-2-9B、GPT-OSS-20B)的回答质量。 Result: 所有模型余弦相似度超99%,但BLEU-1仅低于8%,呈现显著语义-词汇悖论;MoverScore居中(51–53%);模型参数量不决定性能(Mistral-7B优于GPT-OSS-20B);NLI显示矛盾率低于7%。 Conclusion: 词汇匹配指标(如BLEU)不可靠地反映抽象生成的质量,应采用融合语义、逻辑与表征的多维评估框架;RECOM数据集已开源。 Abstract: Large Language Models (LLMs) are increasingly deployed for open-domain question answering, yet their alignment with human perspectives on temporally recent information remains underexplored. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025 paired with community-derived reference answers. We investigate how four open-source LLMs (Llama3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B) respond to these questions, evaluating alignment using lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). Our central finding is a striking semantic-lexical paradox: all models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, a 90+ percentage point gap indicating that models preserve meaning through extensive paraphrasing rather than lexical reproduction. MoverScore (51-53%) confirms this pattern, occupying an intermediate position that reflects the optimal transport cost of semantic alignment. Furthermore, model scale does not predict performance: Mistral-7B (7B parameters) outperforms GPT-OSS-20B (20B parameters) across all metrics. NLI analysis reveals that contradiction rates remain below 7%, suggesting models rarely generate content that directly conflicts with human consensus. These findings challenge the reliability of lexical metrics for evaluating abstractive generation and argue for multi-dimensional evaluation frameworks that capture semantic fidelity beyond surface-level text matching. The RECOM dataset is publicly available at https://anonymous.4open.science/r/recom-D4B0[6] Small Updates, Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ?
Xu Hu,Yifan Zhang,Songtao Wei,Chen Zhao,Qiannan Li,Bingzhe Li,Feng Chen
Main category: cs.CL
TL;DR: 本文系统研究了参数高效微调(PEFT)对大语言模型幻觉检测能力的影响,发现PEFT能显著提升多种无监督幻觉检测器的AUROC性能,其作用机制主要是重塑模型中不确定性的表征方式,而非注入新事实知识。
Details
Motivation: 尽管PEFT被广泛用于适配大语言模型并常被认为可提升事实准确性,但其对幻觉行为(尤其在问答任务中)的影响尚不明确。 Method: 在三个开源LLM主干模型和三个事实导向的QA数据集上,对七种覆盖语义一致性、置信度和熵三类范式的无监督幻觉检测方法进行综合实证评估;并结合线性探针与表征诊断分析PEFT的作用机制。 Result: PEFT一致增强了幻觉检测能力,在多种检测器上显著提升AUROC;进一步分析表明PEFT主要改变不确定性编码与呈现方式,而非注入新事实知识。 Conclusion: PEFT通过重塑模型内部不确定性表征来提升幻觉检测性能,而非依赖新增事实性知识,这对理解PEFT的作用机制及优化幻觉缓解策略具有重要意义。 Abstract: Parameter-efficient fine-tuning (PEFT) methods are widely used to adapt large language models (LLMs) to downstream tasks and are often assumed to improve factual correctness. However, how the parameter-efficient fine-tuning methods affect hallucination behavior remains insufficiently understood, especially on QA datasets. In this work, we systematically investigate the impact of PEFT on hallucination detection through a comprehensive empirical study across three open-weight LLM backbones and three fact-seeking QA benchmarks. For each model, we evaluate performance using seven unsupervised hallucination detection methods spanning three complementary approaches: semantic consistency based detectors, confidence based detectors, and entropy based detectors. This multifaceted evaluation enables us to characterize how PEFT reshapes uncertainty across different detection paradigms. In conclusion, our experimental results show that PEFT consistently strengthens hallucination detection ability, substantially improving AUROC across a wide range of hallucination detectors. Besides, further analyses using linear probes and representation diagnostics indicate that PEFT methods primarily reshapes how uncertainty is encoded and surfaced, comparing with injecting new factual knowledge into the models.[7] Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering
Nathan Mao,Varun Kaushik,Shreya Shivkumar,Parham Sharafoleslami,Kevin Zhu,Sunishchal Dev
Main category: cs.CL
TL;DR: 本文提出FalseCite数据集,用于系统评估大语言模型在误导性引用下的幻觉现象,并通过分析隐藏状态揭示其内在模式。
Details
Motivation: 大型语言模型(LLMs)常产生幻觉,尤其在医学、法律等敏感领域危害严重;需系统性研究幻觉机制并构建针对性评测基准。 Method: 构建FalseCite数据集,包含由误导或伪造引文诱发的虚假主张;在GPT-4o-mini、Falcon-7B和Mistral-7B上测试;分析模型隐藏状态向量的分布与聚类特性。 Result: 发现误导性引文显著提升幻觉率(尤其在GPT-4o-mini中);隐藏状态向量无论是否幻觉均呈现明显的‘角状’几何结构。 Conclusion: FalseCite为LLM幻觉研究提供了可复现、可解释的评测基础,其揭示的隐藏状态规律有助于未来开发更鲁棒的幻觉缓解方法。 Abstract: Large Language Models (LLMs) often hallucinate, generating nonsensical or false information that can be especially harmful in sensitive fields such as medicine or law. To study this phenomenon systematically, we introduce FalseCite, a curated dataset designed to capture and benchmark hallucinated responses induced by misleading or fabricated citations. Running GPT-4o-mini, Falcon-7B, and Mistral 7-B through FalseCite, we observed a noticeable increase in hallucination activity for false claims with deceptive citations, especially in GPT-4o-mini. Using the responses from FalseCite, we can also analyze the internal states of hallucinating models, visualizing and clustering the hidden state vectors. From this analysis, we noticed that the hidden state vectors, regardless of hallucination or non-hallucination, tend to trace out a distinct horn-like shape. Our work underscores FalseCite's potential as a foundation for evaluating and mitigating hallucinations in future LLM research.[8] Enhancing SDG-Text Classification with Combinatorial Fusion Analysis and Generative AI
Jingyan Xu,Marcelo L. LaFleur,Christina Schweikert,D. Frank Hsu
Main category: cs.CL
TL;DR: 本文提出一种基于组合融合分析(CFA)与生成式AI合成数据的方法,用于提升联合国可持续发展目标(SDGs)文本分类性能,并验证了模型融合与人类专家协同的有效性。
Details
Motivation: SDG文本分类面临类别不可用、难区分或相互关联等挑战,且社会分析高度依赖文本数据,亟需更鲁棒的分类方法。 Method: 利用生成式AI构建合成训练数据,结合多个具备认知多样性的分类模型,采用组合融合分析(CFA)框架,通过秩-得分特征(RSC)函数进行融合。 Result: CFA融合方法达到96.73%准确率,优于最佳单模型;并与人类领域专家结果对比,证实模型融合与专家判断可互补增强。 Conclusion: 多模型智能融合(CFA)与人类专家协同能显著提升复杂语义文本分类效果,为SDG等高阶政策文本分析提供了新范式。 Abstract: (Natural Language Processing) NLP techniques such as text classification and topic discovery are very useful in many application areas including information retrieval, knowledge discovery, policy formulation, and decision-making. However, it remains a challenging problem in cases where the categories are unavailable, difficult to differentiate, or are interrelated. Social analysis with human context is an area that can benefit from text classification, as it relies substantially on text data. The focus of this paper is to enhance the classification of text according to the UN's Sustainable Development Goals (SDGs) by collecting and combining intelligence from multiple models. Combinatorial Fusion Analysis (CFA), a system fusion paradigm using a rank-score characteristic (RSC) function and cognitive diversity (CD), has been used to enhance classifier methods by combining a set of relatively good and mutually diverse classification models. We use a generative AI model to generate synthetic data for model training and then apply CFA to this classification task. The CFA technique achieves 96.73% performance, outperforming the best individual model. We compare the outcomes with those obtained from human domain experts. It is demonstrated that combining intelligence from multiple ML/AI models using CFA and getting input from human experts can, not only complement, but also enhance each other.[9] Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis
Mangadoddi Srikar Vardhan,Lekkala Sai Teja
Main category: cs.CL
TL;DR: 本文发现Transformer隐藏状态的方向(角度)和大小(模长)在语言建模和句法处理中承担不同功能角色:方向扰动更损害语言建模损失,大小扰动更损害主谓一致等句法任务;该分离现象依赖LayerNorm结构,在RMSNorm模型中不显著。
Details
Motivation: 探究Transformer隐藏状态中方向(角度)与大小(模长)是否具有不同的功能角色,现有研究尚未明确区分二者作用。 Method: 在Pythia系列模型上采用L2匹配扰动分析法,分别施加角度与模长扰动并控制欧氏位移相同;结合因果干预(如修复attention或LayerNorm路径)分析损伤传播路径;跨模型尺度验证,并对比RMSNorm架构。 Result: 角度扰动使语言建模损失增加最多42.9倍,模长扰动导致主谓一致准确率下降20.4%(远高于角度扰动的1.6%);角度损伤主要经attention路径传导(修复后损失恢复28.4%),模长损伤部分经LayerNorm路径传导(修复后恢复29.9%);该分离现象在Pythia各尺寸模型中复现,但在RMSNorm模型中消失。 Conclusion: 方向与模长在LayerNorm架构中承担部分独立的计算功能:方向主导注意力路由,模长调节细粒度句法判断的处理强度;该功能分离依赖于归一化方式,挑战并细化了线性表征假说,对模型编辑与可解释性研究具启示意义。 Abstract: Transformer hidden states encode information as high-dimensional vectors, yet whether direction (orientation in representational space) and magnitude (vector norm) serve distinct functional roles remains unclear. Studying Pythia-family models, we discover a striking cross-over dissociation: angular perturbations cause up to 42.9 more damage to language modeling loss, while magnitude perturbations cause disproportionately more damage to syntactic processing (20.4% vs.1.6% accuracy drop on subject-verb agreement).This finding is enabled by L2-matched perturbation analysis, a methodology ensuring that an gular and magnitude perturbations achieve identical Euclidean displacements. Causal intervention reveals that angular damage flows substantially through the attention pathways (28.4% loss recovery via attention repair), while magnitude damage flows partly through the LayerNorm pathways(29.9% recovery via LayerNorm repair). These patterns replicate across scales within the Pythia architecture family. These findings provide evidence that direction and magnitude support partially distinct computational roles in LayerNorm based architectures. The direction preferentially affects attentional routing, while magnitude modulates processing intensity for fine-grained syntactic judgments. We find different patterns in RMSNorm-based architectures, suggesting that the dissociation depends on architectural choices. Our results refine the linear representation hypothesis and have implications for model editing and interpretability research[10] PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models
Jiawei Xu,Zhenyu Yu,Ziqian Bi,Minh Duc Pham,Xiaoyi Qu,Danyang Zhang
Main category: cs.CL
TL;DR: 本文提出PRIME框架,通过三个专门代理(执行器、验证器、协调器)协同工作,并采用组相对策略优化方法,显著提升大语言模型在算法推理任务上的性能;同时构建了目前最大的算法推理基准PRIME-Bench,实验表明PRIME将平均准确率从26.8%提升至93.8%,尤其在需持续状态跟踪的任务中效果突出。
Details
Motivation: 大型语言模型在算法推理任务上表现仍有限,亟需一种能有效支持复杂、多步、带约束的算法推理的新框架。 Method: 提出PRIME框架,包含执行器(step-by-step推理)、验证器(约束检查)和协调器(回溯控制)三类智能体,并采用组相对策略优化(Group Relative Policy Optimization)进行联合训练;同时构建大规模、多类别、含执行轨迹的算法推理基准PRIME-Bench。 Result: PRIME在PRIME-Bench上将平均准确率从26.8%提升至93.8%(相对提升250%);图灵机模拟从9%→92%,长除法从16%→94%;消融实验证实迭代验证是关键机制;小模型(8B)经PRIME可达到与大模型(64B–120B)相当的性能。 Conclusion: PRIME验证了多智能体协同与迭代验证机制对提升算法推理能力的有效性,尤其缓解了错误传播问题,且具有良好的模型规模可扩展性,为LLM在严谨计算任务中的应用提供了新范式。 Abstract: Large language models have demonstrated remarkable capabilities across diverse reasoning tasks, yet their performance on algorithmic reasoning remains limited. To handle this limitation, we propose PRIME (Policy-Reinforced Iterative Multi-agent Execution), a framework comprising three specialized agents, an executor for step-by-step reasoning, a verifier for constraint checking, and a coordinator for backtracking control, optimized through group relative policy optimization. For comprehensive evaluation, we introduce PRIME-Bench, the largest algorithmic reasoning benchmark to date, comprising 86 tasks across 12 categories with 51,600 instances. Tasks span sorting algorithms, graph and tree structures, automata and state machines, symbolic reasoning, and constraint-based puzzles, with execution traces reaching over one million steps. Compared to baseline approach, PRIME improves average accuracy from 26.8% to 93.8%, a 250% relative gain. The largest improvements occur on tasks requiring sustained state tracking, with Turing machine simulation improving from 9% to 92% and long division from 16% to 94%. Ablation studies identify iterative verification as the primary contributor, preventing the error propagation that causes baseline approaches to fail catastrophically. Analysis across model scales (8B-120B parameters) reveals that smaller models benefit disproportionately, achieving accuracy comparable to models 8x larger.[11] Efficient Hyper-Parameter Search for LoRA via Language-aided Bayesian Optimization
Baek Seong-Eun,Lee Jung-Mok,Kim Sung-Bin,Tae-Hyun Oh
Main category: cs.CL
TL;DR: 本文提出了一种将大语言模型(LLM)的领域知识融入贝叶斯优化(BO)的新框架,用于高效搜索LoRA微调的超参数。通过语言提示将超参数及其语义映射到连续向量空间,并引入可学习token补充难以描述的残差信息,结合子集代理训练,显著提升超参搜索效率与性能。
Details
Motivation: LoRA虽提升了LLM微调效率,但其超参数高度敏感,传统穷搜计算开销大,亟需更智能、高效的超参优化方法。 Method: 1)利用LLM作为离散超参数到连续向量空间的映射器,通过领域感知文本提示注入LoRA超参数先验知识;2)引入可学习token建模提示难以覆盖的残差信息;3)采用子集代理训练/评估加速BO迭代。 Result: 仅用约30次BO迭代找到的超参数,在多项任务上相较从约45,000种组合中选出的标准超参数,性能提升超20%。 Conclusion: 将LLM的语义理解能力与贝叶斯优化结合,能显著提升LoRA超参数搜索的效率与效果,为资源受限下的LLM个性化提供新范式。 Abstract: Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enables resource-efficient personalization or specialization, but it comes at the expense of additional hyperparameter tuning. Although LoRA makes fine-tuning efficient, it is highly sensitive to the choice of hyperparameters, and exhaustive hyperparameter search is still computationally very demanding. To address these challenges, we propose a framework that integrates the domain knowledge of pre-trained LLMs into Bayesian Optimization (BO) to efficiently search for LoRA hyperparameters. To leverage the informed knowledge of LLMs, we repurpose LLMs as a discrete-to-continuous mapping to link the hyperparameters and their domain knowledge with a continuous vector space, where BO is conducted. We design and control the mapping by language prompting, where we provide a domain-aware textual prompt describing the relationships among hyperparameters and their respective roles; thereby, we explicitly inject domain knowledge about LoRA into the LLM in natural language. Also, we model the residual information that is hard to linguistically describe in the prompt with an additional learnable token. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the observation of the strong correlation between the respective performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation with a data subset. This further increases the efficiency of our method. We demonstrate that our hyperparameter found with only about 30 iterations achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations.[12] Synthesizing the Virtual Advocate: A Multi-Persona Speech Generation Framework for Diverse Linguistic Jurisdictions in Indic Languages
Aniket Deroy
Main category: cs.CL
TL;DR: 本研究评估了 Gemini 2.5 Flash 和 Pro TTS 模型在五种印度语言中生成法庭演说的表现,提出了一种利用其多语言与上下文感知节奏能力的提示框架;结果表明模型虽能胜任程序性法律语音任务,但在表现说服力、情感张力与动态语调方面仍显不足,尤其在孟加拉语和古吉拉特语中性能下降明显。
Details
Motivation: 法律辩护需兼具权威语气、节奏性停顿与情感智能,而当前多语言TTS在印度多元语言背景下尚难复现人类律师富有说服力的语音艺术。 Method: 构建面向五种印度语言(泰米尔语、泰卢固语、孟加拉语、印地语、古吉拉特语)的提示框架,利用Gemini 2.5系列模型的原生多语言支持与上下文感知节奏控制,生成差异化律师人设的合成语音,并进行性能评估。 Result: 模型表现出‘单调的权威感’,在程序性信息传达上表现良好,但在动态语调、情感强度和说服性表达上存在明显短板;孟加拉语和古吉拉特语性能最弱,揭示了音系建模的前沿挑战。 Conclusion: 多语言TTS已初步具备支持程序性法律语音任务的能力,但要真正复现人类法律倡导者的情感丰富性与修辞性表达,仍需在语音表现力与语言特异性建模方面取得突破。 Abstract: Legal advocacy requires a unique combination of authoritative tone, rhythmic pausing for emphasis, and emotional intelligence. This study investigates the performance of the Gemini 2.5 Flash TTS and Gemini 2.5 Pro TTS models in generating synthetic courtroom speeches across five Indic languages: Tamil, Telugu, Bengali, Hindi, and Gujarati. We propose a prompting framework that utilizes Gemini 2.5s native support for 5 languages and its context-aware pacing to produce distinct advocate personas. The evolution of Large Language Models (LLMs) has shifted the focus of TexttoSpeech (TTS) technology from basic intelligibility to context-aware, expressive synthesis. In the legal domain, synthetic speech must convey authority and a specific professional persona a task that becomes significantly more complex in the linguistically diverse landscape of India. The models exhibit a "monotone authority," excelling at procedural information delivery but struggling with the dynamic vocal modulation and emotive gravitas required for persuasive advocacy. Performance dips in Bengali and Gujarati further highlight phonological frontiers for future refinement. This research underscores the readiness of multilingual TTS for procedural legal tasks while identifying the remaining challenges in replicating the persuasive artistry of human legal discourse. The code is available at-https://github.com/naturenurtureelite/Synthesizing-the-Virtual-Advocate/tree/main[13] Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review
Qian Ruan,Iryna Gurevych
Main category: cs.CL
TL;DR: 本文提出REspGen框架和REspEval评估套件,将作者回应生成重新定义为作者参与的闭环任务,并构建了首个大规模评审-回应-修订三元组数据集Re³Align。
Details
Motivation: 现有自动文本生成方法忽略了作者的专业知识、独家信息及修订策略等关键信号,无法有效支持同行评审中的回应写作。 Method: 提出作者参与的回应生成框架REspGen,整合显式作者输入、多属性控制与评估引导的细化;开发包含20+指标的综合评估套件REspEval;构建首个对齐的评审-回应-修订三元组数据集Re³Align。 Result: 实验表明作者输入与评估引导细化显著提升回应质量;输入设计影响回应效果;可控性与质量存在权衡。 Conclusion: 作者专业知识与意图应被显式建模并融入生成流程,所提框架、数据集与评估工具为该方向提供了系统性支撑与开源资源。 Abstract: Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. Recent work frames this task as automatic text generation, underusing author expertise and intent. In practice, authors possess domain expertise, author-only information, revision and response strategies--concrete forms of author expertise and intent--to address reviewer concerns, and seek NLP assistance that integrates these signals to support effective response writing in peer review. We reformulate author response generation as an author-in-the-loop task and introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement, together with REspEval, a comprehensive evaluation suite with 20+ metrics covering input utilization, controllability, response quality, and discourse. To support this formulation, we construct Re$^3$Align, the first large-scale dataset of aligned review--response--revision triplets, where revisions provide signals of author expertise and intent. Experiments with state-of-the-art LLMs show the benefits of author input and evaluation-guided refinement, the impact of input design on response quality, and trade-offs between controllability and quality. We make our dataset, generation and evaluation tools publicly available.[14] The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models
Aradhya Dixit,Shreem Dixit
Main category: cs.CL
TL;DR: 本文揭示了多语言预训练模型中分词器对不同书写系统施加的系统性成本,即“脚本税”,通过比较相同语言内容但不同正字法变体,发现高碎片化正字法导致显著更高的分词率、推理延迟和信息成本,表明分词是多语言NLP中不平等的重要来源。
Details
Motivation: 预训练多语言语言模型常被假设为脚本无关,但其分词器可能对某些书写系统造成系统性负担,本文旨在量化这种‘脚本税’并揭示其影响。 Method: 通过比较两种正字法变体(相同语言内容但不同书写形式),在mBERT和XLM-R上测量分词率(fertility)、推理速度及每字符比特数(BPC);引入往返转换错误率(CER_rt)验证差异源于正字法相关处理而非映射噪声。 Result: 高碎片化正字法导致分词率增加约3.4倍、推理速度下降16.5倍、BPC分别上升19.7%(mBERT)和47.1%(XLM-R);往返转换错误率CER_rt=0.31支持正字法条件处理假说。 Conclusion: 分词器设计是多语言NLP中不平等的关键来源,亟需发展面向脚本的分词与预训练方法。 Abstract: Pretrained multilingual language models are often assumed to be script-agnostic, yet their tokenizers can impose systematic costs on certain writing systems. We quantify this script tax by comparing two orthographic variants with identical linguistic content. Across mBERT and XLM-R, the higher-fragmentation orthography shows a ~3.4x increase in fertility (6.73-6.85 vs. 2.10-2.35 tokens/word), leading to a 16.5x inference slowdown (0.23 vs. 3.8 sentences/second) on identical hardware. Using bits per character (BPC) to avoid the "NLL paradox" from subword fragmentation, we find a substantial increase in information cost: +19.7% for mBERT (8.06->9.65) and +47.1% for XLM-R (12.19->17.94). A round-trip conversion check (CER_rt=0.31) suggests these gaps reflect orthography-conditioned processing rather than mapping noise. Our results highlight tokenization as a key source of inequity in multilingual NLP and motivate script-aware tokenization and pretraining.[15] Barriers to Discrete Reasoning with Transformers: A Survey Across Depth, Exactness, and Bandwidth
Michelle Yuan,Weiyi Sun,Amir H. Rezaeian,Jyotika Singh,Sandip Ghoshal,Yao-Ting Wang,Miguel Ballesteros,Yassine Benajiba
Main category: cs.CL
TL;DR: 本文综述了Transformer在离散推理任务(如算术、逻辑推理和算法合成)中的理论局限性,从电路复杂度、逼近论和通信复杂度三个角度系统分析其结构性与计算性障碍,并探讨改进模型设计的方向。
Details
Motivation: Transformer虽在序列建模中表现卓越,但在需要精确符号计算的离散推理任务上存在根本性理论限制,亟需从理论层面厘清其瓶颈。 Method: 综合分析电路复杂度、逼近论和通信复杂度三大理论框架,梳理关键定义、经典结论与典型示例,统一阐释Transformer在符号计算上的局限成因。 Result: 明确了Transformer受限于深度约束、难以逼近不连续函数、以及token间通信瓶颈等核心问题,导致其无法可靠执行精确离散算法。 Conclusion: 当前Transformer擅长模式匹配与插值,但其架构本质难以支持严格的符号推理;未来需在模型结构设计上突破现有理论限制。 Abstract: Transformers have become the foundational architecture for a broad spectrum of sequence modeling applications, underpinning state-of-the-art systems in natural language processing, vision, and beyond. However, their theoretical limitations in discrete reasoning tasks, such as arithmetic, logical inference, and algorithmic composition, remain a critical open problem. In this survey, we synthesize recent studies from three theoretical perspectives: circuit complexity, approximation theory, and communication complexity, to clarify the structural and computational barriers that transformers face when performing symbolic computations. By connecting these established theoretical frameworks, we provide an accessible and unified account of why current transformer architectures struggle to implement exact discrete algorithms, even as they excel at pattern matching and interpolation. We review key definitions, seminal results, and illustrative examples, highlighting challenges such as depth constraints, difficulty approximating discontinuities, and bottlenecks in inter-token communication. Finally, we discuss implications for model design and suggest promising directions for overcoming these foundational limitations.[16] Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments
Maral Doctorarastoo,Katherine A. Flanigan,Mario Bergés,Christopher McComb
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLM)在低数据环境下,利用紧凑的上下文线索(时间、空间、行为历史、人物特征)进行人类活动预测与持续时间估计的能力;实验表明LLM具备较强的固有时间推理能力,零样本下即可生成合理预测,少量示例即可显著提升精度,性能在 few-shot 后趋于饱和。
Details
Motivation: 现有数据驱动的智能体模型在低数据场景下表现不佳,而人类活动预测对智能家居、人机协作等应用至关重要,亟需一种不依赖大量标注数据的泛化方法。 Method: 提出一种检索增强的提示策略,融合时间、空间、行为历史和人物特征四类上下文,并在CASAS Aruba数据集上评估两类任务:带持续时间的下一活动预测、多步日常序列生成,系统分析不同few-shot数量下的性能变化。 Result: LLM在零样本下即能生成语义连贯的日常活动序列;加入1–2个示例即可显著提升持续时间校准与类别准确率;更多示例带来收益递减;序列级评估显示其时间一致性稳定。 Conclusion: 预训练语言模型可作为有效的时序推理器,兼顾常规习惯与上下文敏感的行为变化,有望增强基于智能体模型的行为模块。 Abstract: Anticipating human activities and their durations is essential in applications such as smart-home automation, simulation-based architectural and urban design, activity-based transportation system simulation, and human-robot collaboration, where adaptive systems must respond to human activities. Existing data-driven agent-based models--from rule-based to deep learning--struggle in low-data environments, limiting their practicality. This paper investigates whether large language models, pre-trained on broad human knowledge, can fill this gap by reasoning about everyday activities from compact contextual cues. We adopt a retrieval-augmented prompting strategy that integrates four sources of context--temporal, spatial, behavioral history, and persona--and evaluate it on the CASAS Aruba smart-home dataset. The evaluation spans two complementary tasks: next-activity prediction with duration estimation, and multi-step daily sequence generation, each tested with various numbers of few-shot examples provided in the prompt. Analyzing few-shot effects reveals how much contextual supervision is sufficient to balance data efficiency and predictive accuracy, particularly in low-data environments. Results show that large language models exhibit strong inherent temporal understanding of human behavior: even in zero-shot settings, they produce coherent daily activity predictions, while adding one or two demonstrations further refines duration calibration and categorical accuracy. Beyond a few examples, performance saturates, indicating diminishing returns. Sequence-level evaluation confirms consistent temporal alignment across few-shot conditions. These findings suggest that pre-trained language models can serve as promising temporal reasoners, capturing both recurring routines and context-dependent behavioral variations, thereby strengthening the behavioral modules of agent-based models.[17] What Do LLMs Know About Alzheimer's Disease? Fine-Tuning, Probing, and Data Synthesis for AD Detection
Lei Jiang,Yue Zhou,Natalie Parde
Main category: cs.CL
TL;DR: 本文探索了如何利用大语言模型(LLM)进行阿尔茨海默病(AD)的早期检测,通过监督微调与表示探针分析,发现特定词和特殊标记在模型内部表征中起关键作用;据此设计任务感知的特殊标记,并构建序列到序列的数据合成模型生成高质量合成数据,提升下游AD检测性能。
Details
Motivation: 阿尔茨海默病早期检测面临标注数据稀缺的挑战,而大语言模型在跨领域迁移方面表现出色,但其在AD检测任务中的监督微调及内部机制尚缺乏探索。 Method: 对LLM进行AD检测任务的监督微调;采用探针方法分析各Transformer层中间激活,识别关键语义单元(如特定词和特殊标记);基于发现设计任务感知特殊标记,并训练序列到序列模型用于合成结构一致、诊断信息丰富的数据。 Result: 探针分析揭示微调后特定词汇和特殊标记的表征显著变化,验证其对检测性能的关键作用;所提数据合成工具生成的样本在内在评估和下游训练中均有效提升AD检测性能。 Conclusion: LLM经任务导向微调可有效编码AD相关诊断知识,其内部表征具有可解释性;任务感知的特殊标记与合成策略为小样本医学文本分析提供了新范式。 Abstract: Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across domains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we fine-tune an LLM for AD detection and investigate how task-relevant information is encoded within its internal representations. We employ probing techniques to analyze intermediate activations across transformer layers, and we observe that, after fine-tuning, the probing values of specific words and special markers change substantially, indicating that these elements assume a crucial role in the model's improved detection performance. Guided by this insight, we design a curated set of task-aware special markers and train a sequence-to-sequence model as a data-synthesis tool that leverages these markers to generate structurally consistent and diagnostically informative synthetic samples. We evaluate the synthesized data both intrinsically and by incorporating it into downstream training pipelines.[18] From Instruction to Output: The Role of Prompting in Modern NLG
Munazza Zaib,Elaf Alhazmi
Main category: cs.CL
TL;DR: This survey provides a comprehensive overview of prompt engineering techniques for Natural Language Generation (NLG), introducing a taxonomy, decision framework, and integrated design-optimization-evaluation framework to enhance controllability and generalizability.
Details
Motivation: Lack of a structured framework or coherent understanding of diverse prompt engineering methods, especially in NLG. Method: Survey and systematic analysis of recent prompting methods; introduction of a taxonomy, decision framework, and an integrated framework linking prompt design, optimization, and evaluation. Result: A structured taxonomy of prompting paradigms, a practitioner-oriented decision framework for prompt selection, identification of emerging trends and challenges, and a unified framework supporting more controllable and generalizable NLG. Conclusion: Prompt engineering is a vital input-level control mechanism for NLG, complementary to fine-tuning and decoding; a systematic, integrated approach enhances its effectiveness and applicability. Abstract: Prompt engineering has emerged as an integral technique for extending the strengths and abilities of Large Language Models (LLMs) to gain significant performance gains in various Natural Language Processing (NLP) tasks. This approach, which requires instructions to be composed in natural language to bring out the knowledge from LLMs in a structured way, has driven breakthroughs in various NLP tasks. Yet there is still no structured framework or coherent understanding of the varied prompt engineering methods and techniques, particularly in the field of Natural Language Generation (NLG). This survey aims to help fill that gap by outlining recent developments in prompt engineering, and their effect on different NLG tasks. It reviews recent advances in prompting methods and their impact on NLG tasks, presenting prompt design as an input-level control mechanism that complements fine-tuning and decoding approaches. The paper introduces a taxonomy of prompting paradigms, a decision framework for prompt selection based on varying factors for the practitioners, outlines emerging trends and challenges, and proposes a framework that links design, optimization, and evaluation to support more controllable and generalizable NLG.[19] Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
Usman Naseem
Main category: cs.CL
TL;DR: 本文综述了面向大语言模型(LLM)对齐的机制可解释性研究进展,涵盖电路发现、特征可视化、激活引导与因果干预等方法,并探讨其在RLHF、宪法AI和可扩展监督等对齐策略中的应用;指出超叠加、神经元多义性及涌现行为解释难等挑战,提出自动化可解释性、跨模型电路泛化及可扩展的可解释性驱动对齐等未来方向。
Details
Motivation: 大语言模型虽能力强大,但其内部决策过程不透明,亟需通过机制可解释性研究提升理解与对齐能力。 Method: 系统综述近年来应用于LLM对齐的机制可解释性技术,包括电路发现、特征可视化、激活 steering 和因果干预,并分析其在RLHF、宪法AI和可扩展监督等对齐策略中的实际影响。 Result: 梳理了可解释性技术如何支撑多种主流对齐方法,识别出超叠加、多义性和涌现行为解释等关键挑战。 Conclusion: 机制可解释性是实现可靠LLM对齐的关键路径;未来需发展自动化、跨模型泛化及可扩展的可解释性驱动对齐技术。 Abstract: Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Mechanistic interpretability (i.e., the systematic study of how neural networks implement algorithms through their learned representations and computational structures) has emerged as a critical research direction for understanding and aligning these models. This paper surveys recent progress in mechanistic interpretability techniques applied to LLM alignment, examining methods ranging from circuit discovery to feature visualization, activation steering, and causal intervention. We analyze how interpretability insights have informed alignment strategies including reinforcement learning from human feedback (RLHF), constitutional AI, and scalable oversight. Key challenges are identified, including the superposition hypothesis, polysemanticity of neurons, and the difficulty of interpreting emergent behaviors in large-scale models. We propose future research directions focusing on automated interpretability, cross-model generalization of circuits, and the development of interpretability-driven alignment techniques that can scale to frontier models.[20] Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs
Himanshu Gupta,Pratik Jayarao,Chaitanya Dwivedi,Neeraj Varshney
Main category: cs.CL
TL;DR: 本文综述了大语言模型在语码混合(CSW)场景下的研究现状,提出统一分类框架,涵盖数据、建模与评估三方面,并提供实用建议;分析了各类建模方法、评估问题及基准缺陷,指出安全风险与开放挑战。
Details
Motivation: 尽管多语言建模取得进展,大语言模型在语码混合/切换场景下仍存在语法性、事实性和安全性系统性下降问题,亟需系统性梳理与指导。 Method: 构建覆盖数据、建模与评估的统一分类法;综述CSW定制预训练、任务后训练、提示策略与上下文学习等建模方法;批判性分析现有评估实践与基准的局限性与英语中心偏见;探讨CSW引发的安全绕过问题。 Result: 提出了首个面向CSW的LLM研究统一分类体系与实用建设指南;识别出当前评估不稳定、基准覆盖不全、英语中心性强等关键问题;揭示CSW作为安全规避手段的新风险;明确了多个开放研究挑战。 Conclusion: 语码混合能力是衡量LLM真实多语言能力的关键维度,需从数据构建、模型设计、评估范式与安全治理四方面协同推进,未来工作应超越英语中心主义,加强低资源语言覆盖与可复现评估体系建设。 Abstract: Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.[21] MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization
Haidong Xin,Xinze Li,Zhenghao Liu,Yukun Yan,Shuo Wang,Cheng Yang,Yu Gu,Ge Yu,Maosong Sun
Main category: cs.CL
TL;DR: 本文提出MetaMem框架,通过自演化的元记忆增强LLM的记忆系统,提升其对分散记忆片段中关键证据的识别与整合能力,显著优于基线方法。
Details
Motivation: 现有记忆系统虽能存储长程交互历史,但常破坏会话内的逻辑与时间关系,导致记忆碎片化、推理性能下降。 Method: MetaMem引入自演化的元记忆,在优化过程中通过自我反思推理过程并更新元记忆状态,提炼跨任务可迁移的知识利用经验。 Result: 实验表明MetaMem显著优于强基线模型,性能提升超3.6%;代码与数据集已开源。 Conclusion: MetaMem通过显式建模知识利用经验,有效提升LLM在长程交互中对碎片化记忆的系统性利用能力,为记忆增强型LLM提供了新范式。 Abstract: Existing memory systems enable Large Language Models (LLMs) to support long-horizon human-LLM interactions by persisting historical interactions beyond limited context windows. However, while recent approaches have succeeded in constructing effective memories, they often disrupt the inherent logical and temporal relationships within interaction sessions, resulting in fragmented memory units and degraded reasoning performance. In this paper, we propose MetaMem, a novel framework that augments memory systems with a self-evolving meta-memory, aiming to teach LLMs how to effectively utilize memorized knowledge. During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks by self-reflecting on reasoning processes and performing actions to update the current meta-memory state. The accumulated meta-memory units serve as explicit knowledge utilization experiences, guiding the LLM to systematically identify and integrate critical evidence from scattered memory fragments. Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%. All codes and datasets are available at https://github.com/OpenBMB/MetaMem.[22] DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task
Shafiuddin Rehan Ahmed,Wei Wei
Main category: cs.CL
TL;DR: 本文提出DDL2PropBank基准任务,用于评估多智能体框架在LLM驱动软件开发中的开发者体验,通过统一Agent-as-a-Tool模式在10个框架上测试代码复杂度与AI可辅助性,发现Agno综合性能最优。
Details
Motivation: 缺乏在可控环境下系统评估多智能体框架开发者体验的原理性方法。 Method: 构建DDL2PropBank新基准任务(将数据库schema映射到PropBank rolesets),采用Agent-as-a-Tool模式在10个框架中部署相同agent逻辑,从代码复杂度(静态分析)和AI-assistability(LLM自动生成正确框架代码的能力)两方面评估。 Result: 发现三档复杂度谱系(Pydantic AI和Agno最低);结构对齐分数能可靠预测单范式框架的成功率,但高估多范式框架;Agno综合最优(复杂度最低、结构对齐最高、pass@1达83%)。 Conclusion: 框架设计应兼顾低实现开销与结构可预测性以提升AI-assistability;Agno为当前最优实践选择,DDL2PropBank可作为标准化评估工具。 Abstract: Multi-agent frameworks promise to simplify LLM-driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine-grained linguistic reasoning over table names, columns, and relations. Using the Agent-as-a-Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability -- the extent to which LLMs can autonomously generate correct, framework-specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI-assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi-pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.[23] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification
Jiale Zhao,Ke Fang,Lu Cheng
Main category: cs.CL
TL;DR: 本文提出AskBench交互式基准和RLVR强化学习方法,提升大语言模型在信息缺失或错误前提下主动澄清的能力,兼顾准确性与交互效率。
Details
Motivation: 大语言模型常在提示信息不全或含误导性内容时仍强行作答,导致幻觉或强化错误认知,亟需提升其判断何时及如何澄清的能力。 Method: 构建AskBench交互式基准(含AskMind和AskOverconfidence两类场景),并提出基于结构化评分标准与验证器奖励的强化学习方法(RLVR)。 Result: 实验表明该方法在准确率、评分标准遵循度和交互效率上均显著提升,并具备跨领域泛化能力。 Conclusion: 通过引入交互式评估基准与目标导向的强化学习策略,可有效增强LLM的澄清意识与能力,而不损害任务性能。 Abstract: Large language models (LLMs) often respond even when prompts omit critical details or include misleading information, leading to hallucinations or reinforced misconceptions. We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance. We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints. A unified judge loop evaluates final answers and simulates user responses as needed. AskBench covers two settings: AskMind, with intent-deficient queries requiring clarification, and AskOverconfidence, with queries containing false premises that must be identified and corrected. We further propose rubric-guided reinforcement learning with verifier-based rewards (RLVR), which uses structured rubrics to encourage targeted clarification. Experiments show consistent improvements in accuracy, rubric adherence, and interaction efficiency, with strong generalization to unseen domains.[24] Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning
Donald Ye,Max Loffgren,Om Kotadia,Linus Wong
Main category: cs.CL
TL;DR: 本文提出NLDD指标来评估Chain-of-Thought(CoT)解释的忠实性,发现模型在推理链中存在‘推理视界’(k*),且准确率不能反映真实推理过程。
Details
Motivation: Chain-of-Thought解释是否真实反映模型决策过程尚不明确,需量化其忠实性。 Method: 提出Normalized Logit Difference Decay(NLDD)指标:通过逐个破坏CoT中的推理步骤,测量模型答案置信度下降程度,并标准化以支持跨模型比较。 Result: 在语法、逻辑和算术任务上验证三类模型,发现一致的‘推理视界’k*(链长的70–85%);模型可拥有正确内部表征却仍答错题;准确率无法揭示实际推理行为。 Conclusion: NLDD为评估CoT何时真正起作用提供了可量化的工具,表明仅靠任务准确率不足以判断模型是否真实进行链式推理。 Abstract: Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.[25] The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
Rui Cao,Zhenyun Deng,Yulong Chen,Michael Schlichtkrull,Andreas Vlachos
Main category: cs.CL
TL;DR: 本文介绍了AVerImaTeC共享任务,旨在推动图像-文本声明的自动验证系统发展,评估采用条件判决准确率(AVerImaTeC分数),14支队伍参与开发阶段、6支进入测试阶段,全部超越基线,优胜队HUMANE得分为0.5455。
Details
Motivation: 推进图像-文本声明自动验证系统的发展,解决真实世界中图文一致性验证问题。 Method: 组织AVerImaTeC共享任务,允许参赛者使用外部知识源(如网络搜索引擎)或主办方提供的结构化知识库;采用AVerImaTeC分数(基于证据得分阈值的条件判决准确率)进行评估。 Result: 共14支队伍参与开发阶段、6支进入测试阶段;所有测试队伍均优于基线;优胜队HUMANE获得0.5455的AVerImaTeC分数。 Conclusion: 该共享任务成功促进了图文验证技术发展,提供了可复现的基准与评估框架,并总结了关键经验与挑战。 Abstract: The Automatic Verification of Image-Text Claims (AVerImaTeC) shared task aims to advance system development for retrieving evidence and verifying real-world image-text claims. Participants were allowed to either employ external knowledge sources, such as web search engines, or leverage the curated knowledge store provided by the organizers. System performance was evaluated using the AVerImaTeC score, defined as a conditional verdict accuracy in which a verdict is considered correct only when the associated evidence score exceeds a predefined threshold. The shared task attracted 14 submissions during the development phase and 6 submissions during the testing phase. All participating systems in the testing phase outperformed the baseline provided. The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455. This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.[26] SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation
Beichen Guo,Zhiyuan Wen,Jia Gu,Senzhang Wang,Haochen Shi,Ruosong Yang,Shuaiqi Liu
Main category: cs.CL
TL;DR: 本文提出SurveyLens,首个面向多学科的自动综述生成(ASG)评估基准,包含10个学科的1000篇高质量人工综述,并设计双重视角评估框架(学科感知评分+经典对齐评估),系统评测11种前沿ASG方法在各学科的表现差异。
Details
Motivation: 现有ASG评估方法依赖通用指标、严重偏向计算机科学,无法反映不同学科特有的写作规范与质量标准,导致非CS领域研究者缺乏可靠工具选择依据。 Method: 构建跨学科综述数据集SurveyLens-1k(10学科×100篇),提出双镜头评估框架:(1)学科感知评分——基于人类偏好校准的LLM打分;(2)经典对齐评估——对比人工综述衡量内容覆盖与综合质量。 Result: 在11种ASG方法(含基础LLM、专用ASG系统与Deep Research代理)上的实验揭示了各方法在不同学科中的显著性能差异,例如某些方法在医学综述中表现优异但在人文社科中明显不足。 Conclusion: SurveyLens为ASG评估提供了首个可推广的学科适配范式,推动ASG从‘通用可用’走向‘学科可信’,并为跨学科研究者提供方法选型指南。 Abstract: The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline-aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines. Subsequently, we propose a dual-lens evaluation framework: (1) Discipline-Aware Rubric Evaluation, which utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human-written survey papers. We conduct extensive experiments by evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.[27] Are Aligned Large Language Models Still Misaligned?
Usman Naseem,Gautam Siddharth Kashyap,Rafiq Ali,Ebad Shabbir,Sushant Kumar Ray,Abdullah Mohammad,Agrima Seth
Main category: cs.CL
TL;DR: 本文提出Mis-Align Bench,首个支持安全、价值与文化三维度联合评估大模型错位(misalignment)的统一基准;构建了涵盖112个领域的38万样本数据集SAVACU,并通过两阶段拒绝采样配对错位/对齐响应;实验表明单维度优化模型在联合评估下表现显著下降。
Details
Motivation: 现有错位评估基准(如INSECURE CODE、VALUEACTIONLENS、CULTURALHERITAGE)仅关注单一维度,无法反映真实场景中安全、价值与文化需同时满足的复杂要求,导致评估不充分。 Method: 构建统一基准Mis-Align Bench:1)基于LLM-PROMPT-DATASET,利用Mistral-7B-Instruct-v0.3按新分类法划分出14类安全、56类价值、42类文化领域,形成SAVACU数据集(382,424样本);2)用Llama-3.1-8B-Instruct结合SimHash扩增低资源领域;3)两阶段拒绝采样生成高质量错位/对齐响应对;4)在通用、微调及开源权重LLM上开展三维度联合评测。 Result: 单维度优化模型在各自维度Coverage高达97.6%,但在三维度联合评估下False Failure Rate超50%,Alignment Score降至63%–66%,凸显单维评估的局限性。 Conclusion: 真实世界中的LLM错位需多维协同治理,Mis-Align Bench为全面、系统评估和推动多维对齐提供了关键基础设施与实证依据。 Abstract: Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS (value-centric), and CULTURALHERITAGE (culture centric)-rely on evaluating misalignment along individual dimensions, preventing simultaneous evaluation. To address this gap, we introduce Mis-Align Bench, a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions. First we constructs SAVACU, an English misaligned-aligned dataset of 382,424 samples spanning 112 domains (or labels), by reclassifying prompts from the LLM-PROMPT-DATASET via taxonomy into 14 safety domains, 56 value domains, and 42 cultural domains using Mistral-7B-Instruct-v0.3, and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based fingerprint to avoid deduplication. Furthermore, we pairs prompts with misaligned and aligned responses via two-stage rejection sampling to enforce quality. Second we benchmarks general-purpose, fine-tuned, and open-weight LLMs, enabling systematic evaluation of misalignment under three dimensions. Empirically, single-dimension models achieve high Coverage (upto 97.6%) but incur False Failure Rate >50% and lower Alignment Score (63%-66%) under joint conditions.[28] Evaluating Alignment of Behavioral Dispositions in LLMs
Amir Taubenfeld,Zorik Gekhman,Lior Nezry,Omri Feldman,Natalie Harris,Shashir Reddy,Romina Stella,Ariel Goldstein,Marian Croak,Yossi Matias,Amir Feder
Main category: cs.CL
TL;DR: 本文提出了一种基于情境判断测试(SJTs)的新框架,用于评估大语言模型(LLMs)在社交情境中表现出的行为倾向是否与人类一致;研究发现LLMs普遍存在过度自信、偏离人类共识及言行不一等问题。
Details
Motivation: 随着大语言模型(LLMs)日益融入日常生活,理解其行为倾向(尤其是社会语境下的行为倾向)变得至关重要;现有基于自我报告的心理量表难以直接适用于LLMs,亟需适配的评估方法。 Method: 将经典心理学问卷转化为面向LLMs的情境判断测试(SJTs),共构建2500个经三人标注验证的SJTs;每项SJTs由550名参与者中的10人提供偏好行为选择;在25个LLM上进行大规模行为一致性评估,并分析其与人类偏好分布的偏差模式。 Result: (1)在人类共识低的情境中,LLMs表现出过度自信,倾向于单一回答;(2)在人类共识高时,小模型显著偏离,前沿模型仍有15–20%未反映共识;(3)跨模型存在共性偏差,如鼓励情绪表达而违背人类偏好的冷静倾向;(4)揭示了LLMs‘声称价值观’与‘实际行为’之间存在显著差距。 Conclusion: 当前LLMs的行为倾向与人类存在系统性偏差,SJTs是一种更有效、更具生态效度的评估范式;该框架不仅可用于模型对齐评估,也为心理学中自我报告的预测效度检验提供了新路径。 Abstract: As LLMs integrate into our daily lives, understanding their behavior becomes essential. In this work, we focus on behavioral dispositions$-$the underlying tendencies that shape responses in social contexts$-$and introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans. Our approach is grounded in established psychological questionnaires but adapts them for LLMs by transforming human self-report statements into Situational Judgment Tests (SJTs). These SJTs assess behavior by eliciting natural recommendations in realistic user-assistant scenarios. We generate 2,500 SJTs, each validated by three human annotators, and collect preferred actions from 10 annotators per SJT, from a large pool of 550 participants. In a comprehensive study involving 25 LLMs, we find that models often do not reflect the distribution of human preferences: (1) in scenarios with low human consensus, LLMs consistently exhibit overconfidence in a single response; (2) when human consensus is high, smaller models deviate significantly, and even some frontier models do not reflect the consensus in 15-20% of cases; (3) traits can exhibit cross-LLM patterns, e.g., LLMs may encourage emotion expression in contexts where human consensus favors composure. Lastly, mapping psychometric statements directly to behavioral scenarios presents a unique opportunity to evaluate the predictive validity of self-reports, revealing considerable gaps between LLMs' stated values and their revealed behavior.[29] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
Zachary Pedram Dadfar
Main category: cs.CL
TL;DR: 本文提出Pull Methodology,通过格式工程引导大语言模型进行自我检查,发现其自指性词汇与内部激活动态存在特异性对应关系,表明在适当条件下,模型的自我报告可可靠反映其内部计算状态。
Details
Motivation: 探究大语言模型在自我检查时产生的内省语言是反映真实内部计算还是复杂虚构。 Method: 提出Pull Methodology,利用格式工程引导模型进行长程自我检查,并在Llama 3.1中识别出区分自指性与描述性处理的激活空间方向;结合激活分析、词汇-激活相关性检验及跨模型(Qwen 2.5-32B)验证。 Result: 发现自指性词汇(如'loop'、'shimmer')与特定激活动态(如自相关性、变异性)显著相关,且该对应关系具有特异性(非自指语境下不成立);该激活方向局部化于6.25%模型深度,正交于拒绝方向,并可因果影响输出;Qwen 2.5-32B独立演化出不同但同样有效的自指性词汇-激活映射。 Conclusion: 在适当提示与结构下,大语言模型的自我报告可作为其内部计算状态的可靠指标,挑战了‘纯confabulation’观点,为理解模型内在机制提供新路径。 Abstract: Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.[30] Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification
Weili Shi,Dongliang Guo,Lehan Yang,Tianlong Wang,Hanzhang Yuan,Sheng Li
Main category: cs.CL
TL;DR: 本文提出PPCV框架,通过识别和替换推理过程中的关键token,并结合一致性验证来提升大语言模型在复杂推理任务上的性能。
Details
Motivation: 大语言模型在复杂推理任务中易因幻觉和中间步骤错误累积而导致性能下降,而关键token的识别与利用仍具挑战性。 Method: PPCV框架分为两阶段:第一阶段通过原始问题推理路径与问题重述的对比,基于预测token与期望token的不匹配识别关键token;第二阶段用候选替代token替换关键token并生成多条平行推理路径,通过输出一致性确定最终答案。 Result: 在多个主流大语言模型和基准测试上,PPCV显著优于基线方法,有效提升了模型推理性能。 Conclusion: PPCV是一种有效提升LLM复杂推理能力的新方法,其关键token探测与一致性验证机制具有实用性和可扩展性。 Abstract: Large language models have demonstrated impressive performance across a variety of reasoning tasks. However, their problem-solving ability often declines on more complex tasks due to hallucinations and the accumulation of errors within these intermediate steps. Recent work has introduced the notion of critical tokens--tokens in the reasoning process that exert significant influence on subsequent steps. Prior studies suggest that replacing critical tokens can refine reasoning trajectories. Nonetheless, reliably identifying and exploiting critical tokens remains challenging. To address this, we propose the Paraphrastic Probing and Consistency Verification~(PPCV) framework. PPCV operates in two stages. In the first stage, we roll out an initial reasoning path from the original question and then concatenate paraphrased versions of the question with this reasoning path. And we identify critical tokens based on mismatches between the predicted top-1 token and the expected token in the reasoning path. A criterion is employed to confirm the final critical token. In the second stage, we substitute critical tokens with candidate alternatives and roll out new reasoning paths for both the original and paraphrased questions. The final answer is determined by checking the consistency of outputs across these parallel reasoning processes. We evaluate PPCV on mainstream LLMs across multiple benchmarks. Extensive experiments demonstrate PPCV substantially enhances the reasoning performance of LLMs compared to baselines.[31] The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods
Arpit Singh Gautam,Kailash Talreja,Saurabh Jha
Main category: cs.CL
TL;DR: 本文提出DiffuTruth框架,利用非平衡热力学思想将事实视为生成流形上的稳定吸引子,通过生成压力测试和语义能量度量来检测大语言模型的幻觉,并结合混合校准提升不确定性估计性能。
Details
Motivation: 大语言模型常产生看似合理但错误的断言(即幻觉),而现有不确定性指标难以识别模型在高置信度下的错误预测。 Method: 提出DiffuTruth:基于非平衡热力学建模事实为生成流形上的稳定吸引子;设计生成压力测试(对声明加噪并用离散文本扩散模型重建);定义语义能量(利用NLI评判器衡量原始声明与重建间的语义差异);融合语义稳定性信号与判别式置信度形成混合校准。 Result: 在FEVER数据集上实现0.725的无监督AUROC,超越基线1.5%;在多跳推理HOVER数据集上零样本泛化性能优于基线超4%。 Conclusion: 语义能量能有效捕捉深层事实矛盾,热力学视角下的真实性具有跨分布鲁棒性,DiffuTruth为无监督事实验证提供了新范式。 Abstract: Large Language Models (LLMs) frequently hallucinate plausible but incorrect assertions, a vulnerability often missed by uncertainty metrics when models are confidently wrong. We propose DiffuTruth, an unsupervised framework that reconceptualizes fact verification via non equilibrium thermodynamics, positing that factual truths act as stable attractors on a generative manifold while hallucinations are unstable. We introduce the Generative Stress Test, claims are corrupted with noise and reconstructed using a discrete text diffusion model. We define Semantic Energy, a metric measuring the semantic divergence between the original claim and its reconstruction using an NLI critic. Unlike vector space errors, Semantic Energy isolates deep factual contradictions. We further propose a Hybrid Calibration fusing this stability signal with discriminative confidence. Extensive experiments on FEVER demonstrate DiffuTruth achieves a state of the art unsupervised AUROC of 0.725, outperforming baselines by 1.5 percent through the correction of overconfident predictions. Furthermore, we show superior zero shot generalization on the multi hop HOVER dataset, outperforming baselines by over 4 percent, confirming the robustness of thermodynamic truth properties to distribution shifts.[32] Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection
Md Tanvir Rouf Shawon,Mohammad Sabik Irbaz,Hadeel R. A. Elyazori,Keerti Reddy Resapu,Yili Lin,Vladimir Franzuela Cardenas,Farrokh Alemi,Kevin Lybarger
Main category: cs.CL
TL;DR: 本文提出了一种基于NIST AI风险管理框架的患者模拟器,用于自动化、可扩展地评估医疗对话代理,通过医学、语言和行为三类可控患者画像生成真实交互,并成功揭示了AI决策辅助工具在不同健康素养水平下的性能退化规律。
Details
Motivation: 现有医疗对话代理评估方法缺乏可扩展性、可控性和系统性,难以全面识别幻觉、错误和跨人群的风险模式,亟需一种能模拟多样化真实患者交互的自动化评估工具。 Method: 构建一个三层患者模拟器:(1) 基于All of Us电子健康记录的医学画像;(2) 建模健康素养与疾病特异性语言特征的语言画像;(3) 捕捉合作、分心、对抗等实证交互模式的行为画像;并结合人工标注与大模型裁判进行双轨评估。 Result: 在500次模拟对话中,人工标注者与LLM裁判均表现出高一致性(F1≈0.94,κ≈0.75);发现AI抗抑郁药推荐决策辅助工具的准确率随健康素养提升而单调上升(47.9%→69.1%→81.6%)。 Conclusion: 该患者模拟器是一种有效、可靠且可解释的评估基础设施,支持对医疗AI系统进行细粒度、多维度、人群敏感的风险分析,为临床部署前的质量保障提供了新范式。 Abstract: Objective: This paper introduces a patient simulator designed to enable scalable, automated evaluation of healthcare conversational agents. The simulator generates realistic, controllable patient interactions that systematically vary across medical, linguistic, and behavioral dimensions, allowing annotators and an independent AI judge to assess agent performance, identify hallucinations and inaccuracies, and characterize risk patterns across diverse patient populations. Methods: The simulator is grounded in the NIST AI Risk Management Framework and integrates three profile components reflecting different dimensions of patient variation: (1) medical profiles constructed from electronic health records in the All of Us Research Program; (2) linguistic profiles modeling variation in health literacy and condition-specific communication patterns; and (3) behavioral profiles representing empirically observed interaction patterns, including cooperation, distraction, and adversarial engagement. We evaluated the simulator's effectiveness in identifying errors in an AI decision aid for antidepressant selection. Results: We generated 500 conversations between the patient simulator and the AI decision aid across systematic combinations of five linguistic and three behavioral profiles. Human annotators assessed 1,787 medical concepts across 100 conversations, achieving high agreement (F1=0.94, \k{appa}=0.73), and the LLM judge achieved comparable agreement with human annotators (F1=0.94, \k{appa}=0.78; paired bootstrap p=0.21). The simulator revealed a monotonic degradation in AI decision aid performance across the health literacy spectrum: rank-one concept retrieval accuracy increased from 47.9% for limited health literacy to 69.1% for functional and 81.6% for proficient.[33] Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives
Zecheng Wang,Deyuan Liu,Chunshan Li,Yupeng Zhang,Zhengyun Zhao,Dianhui Chu,Bingning Wang,Dianbo Sui
Main category: cs.CL
TL;DR: 本文提出Dynamic Entropy Fine-Tuning(DEFT),一种无需额外参数的监督微调目标,通过Rényi-2熵动态调节模型对自身预测的信任度,以解决标准负对数似然中均匀token加权导致的稳定性与可塑性矛盾。
Details
Motivation: 标准SFT中均匀token级权重导致两方面问题:过度强调低概率目标会放大噪声监督的梯度、削弱鲁棒先验;而对高置信预测缺乏足够 sharpening 效果;现有方法难以兼顾稳定性与可塑性。 Method: 将SFT目标统一到广义deformed-log族,揭示其'门控×误差梯度'通用结构;利用Cayley变换将模型不确定性映射为连续聚焦轨迹;提出基于Rényi-2熵(表征分布集中度)自适应调控信任门的DEFT目标。 Result: DEFT在多个任务上实现了探索与利用的更好平衡,显著提升整体性能,并经大量实验与分析验证其有效性。 Conclusion: DEFT提供了一种原理清晰、无需调参的SFT优化范式,通过建模和利用模型自身的预测不确定性,有效缓解了监督微调中的塑料性—稳定性困境。 Abstract: Standard negative log-likelihood (NLL) for Supervised Fine-Tuning (SFT) applies uniform token-level weighting. This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident. Existing methods fail to resolve the resulting plasticity--stability dilemma, often suppressing necessary learning signals alongside harmful ones. To address this issue, we unify token-level SFT objectives within a generalized deformed-log family and expose a universal gate $\times$ error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model's continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless interpolation between scenarios involving uncertain novel concepts and those involving well-established knowledge. We then introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the trust gate using distribution concentration (Rényi-2 entropy) as a practical proxy for the model's predictive state. Extensive experiments and analyses demonstrate that DEFT achieves a better balance between exploration and exploitation, leading to improved overall performance.[34] Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety
Muskaan Chopra,Lorenz Sparrenberg,Rafet Sifa
Main category: cs.CL
TL;DR: 本文探讨了指令调优的大语言模型(LLMs)在机器翻译(MT)中检测关键语义错误(如事实扭曲、意图反转、偏见翻译)的能力,发现模型规模扩大和适配策略(零样本、少样本、微调)能持续提升检测性能,并超越XLM-R等编码器基线模型。
Details
Motivation: 机器翻译中的关键语义错误会损害多语言系统的可靠性、公平性与安全性,尤其在高风险或资源匮乏语境下亟需可靠检测机制。 Method: 基于公开数据集,系统评估不同参数规模的指令调优大语言模型在关键错误检测任务上的表现,对比零样本、少样本和微调等适应策略,并与XLM-R、ModernBERT等encoder-only基线模型进行比较。 Result: 模型规模扩大与指令调优策略显著提升关键错误检测性能,全面优于现有encoder-only基线模型。 Conclusion: 提升机器翻译关键错误检测能力是构建安全、可信、社会可问责多语言信息系统的必要保障,应被视为实现公正、负责任多语言AI的关键防护手段。 Abstract: Machine Translation (MT) plays a pivotal role in cross-lingual information access, public policy communication, and equitable knowledge dissemination. However, critical meaning errors, such as factual distortions, intent reversals, or biased translations, can undermine the reliability, fairness, and safety of multilingual systems. In this work, we explore the capacity of instruction-tuned Large Language Models (LLMs) to detect such critical errors, evaluating models across a range of parameters using the publicly accessible data sets. Our findings show that model scaling and adaptation strategies (zero-shot, few-shot, fine-tuning) yield consistent improvements, outperforming encoder-only baselines like XLM-R and ModernBERT. We argue that improving critical error detection in MT contributes to safer, more trustworthy, and socially accountable information systems by reducing the risk of disinformation, miscommunication, and linguistic harm, especially in high-stakes or underrepresented contexts. This work positions error detection not merely as a technical challenge, but as a necessary safeguard in the pursuit of just and responsible multilingual AI. The code will be made available at GitHub.[35] LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation
Ahmadreza Jeddi,Marco Ciccone,Babak Taati
Main category: cs.CL
TL;DR: 本文提出LoopFormer,一种能够根据计算预算动态调整循环次数的循环Transformer模型,通过shortcut-consistency训练策略实现不同长度轨迹间的一致性,从而在语言建模与推理任务中实现预算可控的自适应推理。
Details
Motivation: 现有循环Transformer固定循环次数,无法灵活适应不同计算预算,限制了其在实际部署中的灵活性和效率。 Method: 提出LoopFormer模型,引入基于时间和步长的条件机制,并设计shortcut-consistency训练方案,使不同长度的循环轨迹保持表征一致性。 Result: LoopFormer在语言建模与推理基准上表现出对计算约束的强鲁棒性,并能随预算增加平滑提升性能。 Conclusion: 循环Transformer天然适合自适应语言建模,LoopFormer为构建可控、预算感知的大语言模型提供了新路径。 Abstract: Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models.[36] ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias
Guangxin Zhao,Jiahao Zheng,Malaz Boustani,Jarek Nabrzyski,Meng Jiang,Yiyu Shi,Zhi Zheng
Main category: cs.CL
TL;DR: 本文提出了首个专门针对阿尔茨海默病及相关痴呆症(ADRD)的评测基准ADRD-Bench,包含临床知识问答(ADRD Unified QA)和照护实践问答(ADRD Caregiving QA)两部分,并在33个主流大语言模型上进行了评测,发现尽管部分模型准确率高,但推理一致性与稳定性不足,亟需领域特异性优化。
Details
Motivation: 现有医疗大模型评测基准对阿尔茨海默病及相关痴呆症(ADRD)覆盖极少,缺乏临床知识与实际照护场景的结合,难以支撑该关键领域的可信AI应用。 Method: 构建ADRD-Bench基准,包括:1)整合7个既有医学基准的1352题ADRD Unified QA;2)基于权威脑健康项目ABC开发的149题ADRD Caregiving QA;并在33个SOTA大语言模型(开源通用、开源医学、闭源通用)上进行系统评测与案例分析。 Result: 开源通用模型准确率0.63–0.93(均值0.78),开源医学模型0.48–0.93(均值0.82),闭源通用模型0.83–0.91(均值0.89);顶尖模型虽达>0.9准确率,但案例显示其推理不一致、不稳定。 Conclusion: ADRD-Bench填补了ADRD领域专用评测空白;当前LLMs在该领域仍存在推理可靠性不足问题,需结合日常照护数据开展领域定制化改进。 Abstract: Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.[37] When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration
Jayadev Billa
Main category: cs.CL
TL;DR: 本文发现语音-文本大模型在音频和文本信息冲突时严重偏向文本(文本主导),即使被明确指示信任音频;这种偏差源于模型在推理层难以协调多模态信息,而非音频质量差,并通过ALME基准和多种干预实验验证了这一结论。
Details
Motivation: 探究语音-文本多模态大模型在音频与文本信息冲突时的决策偏好及其成因,尤其是为何模型显著偏向文本而非更准确的音频信号。 Method: 构建多语言音频-文本冲突基准ALME(57,620个样本,8种语言),系统评估Gemini 2.0 Flash等4个SOTA音频-LLM;设计控制实验:改变预处理(强制转录)、提示工程(标注文本为‘故意损坏’)、参数高效微调(LoRA)及模块化消融(仅微调音频投影层)。 Result: 发现文本主导现象普遍存在(如Gemini中音频-文本冲突下文本主导率达16.6%,远高于文本-文本冲突的1.6%);音频质量并非主因(音频单独准确率97.2% > 级联准确率93.9%);强制转录加剧文本主导,而‘文本损坏’提示降低80%;LoRA微调语言模型可减半文本主导,仅微调音频投影层反而加剧26.5%。 Conclusion: 文本主导本质是语言模型推理层对多模态表征仲裁能力不足所致,是一种独立于传统语音识别性能的新型可靠性维度,需在模型设计中专门优化模态仲裁机制。 Abstract: When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6\% text dominance under audio-text conflict versus 1.6\% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2\%) exceeds cascade accuracy (93.9\%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19\% to 33\%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80\%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5\%), while LoRA on the language model halves it ($-$23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.[38] Multimodal Fact-Level Attribution for Verifiable Reasoning
David Wan,Han Wang,Ziyang Wang,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal
Main category: cs.CL
TL;DR: 本文提出了MuRGAt基准,用于评估多模态大语言模型在复杂多步推理中对事实的精确归因能力,涵盖视频、音频等多模态输入,并引入自动评估框架,发现当前模型常出现引用幻觉,且推理深度与可验证归因之间存在权衡。
Details
Motivation: 现有多模态归因评测基准局限于简单观测场景或单一模态,无法评估复杂多步推理中的事实级归因可靠性。 Method: 构建MuRGAt基准,要求模型对跨模态输入(如视频、音频)生成带显式推理链和精确引用(含模态与时间片段)的答案;设计与人工评分强相关的自动化评估框架。 Result: 实验表明,即使强大MLLM也常产生错误引用(citation hallucination),且推理越深或强制结构化归因,答案准确率反而下降。 Conclusion: 当前MLLM在内部推理与可验证归因之间存在显著鸿沟,需新方法协同提升二者一致性。 Abstract: Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.[39] Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm
Jinrui Zhang,Chaodong Xiao,Aoqi Wu,Xindong Zhang,Lei Zhang
Main category: cs.CL
TL;DR: 本文提出SPES框架,通过稀疏专家同步和专家合并预热策略,在去中心化环境下高效预训练MoE大语言模型,显著降低GPU内存需求并保持性能。
Details
Motivation: 现有去中心化训练方法仍需在每个节点上训练完整模型,受限于GPU内存;而集中式预训练又依赖大规模高内存GPU集群,成本高昂且不灵活。 Method: 提出SParse Expert Synchronization(SPES)框架:各节点仅训练部分专家,定期同步专家参数而非全模型;引入专家合并预热策略以加速收敛。 Result: 在16块48GB GPU(互联网连接)上成功预训练2B参数MoE模型,性能媲美同等计算预算下的中心化训练模型;进一步扩展至7B从头训练和9B由稠密检查点升级的模型,均达到先前中心化基线水平。 Conclusion: SPES是一种内存高效、可扩展的去中心化MoE预训练框架,为资源受限环境下的大模型训练提供了可行新路径。 Abstract: Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.[40] SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent
Wenlin Zhong,Jinluan Yang,Yiquan Wu,Yi Liu,Jianhang Yao,Kun Kuang
Main category: cs.CL
TL;DR: 本文提出SIGHT框架,通过自证支持(SES)和信息增益驱动的多样性分支策略,提升大语言模型在多轮搜索问答中的推理能力,有效缓解冗余和噪声问题,减少错误累积,并在多个QA基准上显著优于现有方法。
Details
Motivation: 在多轮搜索场景中,强化学习驱动的大语言模型常因检索结果冗余、信噪比低而陷入“隧道视野”,导致早期噪声引发不可逆的错误累积。 Method: 提出SIGHT框架,包含:1)自证支持(SES)用于提炼高保真证据;2)基于信息增益评分识别关键状态并触发动态提示干预(如去重、反思或自适应分支);3)结合SES与正确性奖励,采用组相对策略优化(GRPO)内化鲁棒探索策略。 Result: 在单跳与多跳问答基准上,SIGHT显著优于现有方法,尤其在复杂推理任务中,且使用更少的搜索步数。 Conclusion: SIGHT通过融合证据质量评估与信息驱动的探索控制,有效提升了搜索式推理的准确性与效率,为自主搜索代理提供了可扩展、无需外部验证的优化范式。 Abstract: Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.[41] PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering
Xiangfeng Wang,Hangyu Guo,Yanlin Lai,Mitt Huang,Liang Zhao,Chengyuan Yao,Yinmin Zhang,Qi Han,Xiaoxiao Ren,Chun Yuan,Tong Xu,Zheng Ge,Xiangyu Zhang,Daxin Jiang
Main category: cs.CL
TL;DR: 本文提出PRIME基准,用于评估数学与工程领域中验证器对推导过程与结果一致性的检验能力,并基于该基准设计了过程感知的RLVR训练范式,显著提升了模型性能。
Details
Motivation: 现有基于结果的验证范式忽视推导过程中的错误,导致对错误推导得出的正确答案给予正向奖励,亟需能同时检验过程与结果一致性的验证方法。 Method: 构建PRIME基准(含2530道高难度STEM题目),提出过程感知的RLVR训练范式,并通过在多个测试集上评估验证器性能及与RLVR效果的相关性进行验证。 Result: 当前验证器常无法检测推导错误;新范式在AIME24、AIME25和Beyond-AIME上为Qwen3-14B-Base模型带来8.29%、9.12%、7.31%的绝对性能提升;PRIME上的验证器准确率与RLVR训练效果呈强线性相关(R² > 0.92)。 Conclusion: PRIME有效弥补了过程验证的缺失,可作为验证器选型的可靠指标,并推动更鲁棒的RLVR训练。 Abstract: While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ($R^2 > 0.92$) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.[42] Scene-Aware Memory Discrimination: Deciding Which Personal Knowledge Stays
Yijie Zhong,Mengying Guo,Zewei Wang,Zhongyang Li,Dandan Tu,Haofen Wang
Main category: cs.CL
TL;DR: 本文提出了一种受人类选择性注意机制启发的场景感知记忆判别方法(SAMD),用于高效筛选和组织用户交互数据中的可记忆信息,包含门控单元模块(GUM)和聚类提示模块(CPM),显著提升了个性化应用中个人知识的记忆构建效率与质量。
Details
Motivation: 现有基于大语言模型(LLM)的记忆写入、管理和读取研究面临难以过滤无关信息及计算成本上升的问题,而用户交互产生的海量数据蕴含宝贵个人知识,亟需高效组织以支撑个性化应用。 Method: 提出记忆判别任务,并设计场景感知记忆判别方法(SAMD),包含两个核心模块:门控单元模块(GUM)用于过滤非记忆性交互、聚焦关键内容;聚类提示模块(CPM)建立自适应记忆标准,结合用户意图与记忆上下文生成有效聚类提示。 Result: 实验表明SAMD在直接和间接评估中均有效且泛化性强:能成功召回大多数可记忆数据,在动态场景下保持鲁棒性;集成到个性化应用后,显著提升记忆构建的效率与质量。 Conclusion: SAMD通过模拟人类选择性注意机制,为大规模、多标准的用户交互记忆判别提供了高效、自适应的解决方案,有效促进了个人知识的结构化组织与应用。 Abstract: Intelligent devices have become deeply integrated into everyday life, generating vast amounts of user interactions that form valuable personal knowledge. Efficient organization of this knowledge in user memory is essential for enabling personalized applications. However, current research on memory writing, management, and reading using large language models (LLMs) faces challenges in filtering irrelevant information and in dealing with rising computational costs. Inspired by the concept of selective attention in the human brain, we introduce a memory discrimination task. To address large-scale interactions and diverse memory standards in this task, we propose a Scene-Aware Memory Discrimination method (SAMD), which comprises two key components: the Gating Unit Module (GUM) and the Cluster Prompting Module (CPM). GUM enhances processing efficiency by filtering out non-memorable interactions and focusing on the salient content most relevant to application demands. CPM establishes adaptive memory standards, guiding LLMs to discern what information should be remembered or discarded. It also analyzes the relationship between user intents and memory contexts to build effective clustering prompts. Comprehensive direct and indirect evaluations demonstrate the effectiveness and generalization of our approach. We independently assess the performance of memory discrimination, showing that SAMD successfully recalls the majority of memorable data and remains robust in dynamic scenarios. Furthermore, when integrated into personalized applications, SAMD significantly enhances both the efficiency and quality of memory construction, leading to better organization of personal knowledge.[43] PACE: Prefix-Protected and Difficulty-Aware Compression for Efficient Reasoning
Ruixiang Feng,Yuntao Wen,Silin Zhou,Ke Shi,Yifan Wang,Ran Le,Zhenwei An,Zongchao Chen,Chen Yang,Guangyue Peng,Yiming Jia,Dongsheng Wang,Tao Zhang,Lisi Chen,Yang Song,Shen Gao,Shuo Shang
Main category: cs.CL
TL;DR: 本文提出了一种名为\model的双层框架,用于在保持推理正确性的同时压缩语言推理模型(LRMs)的推理路径长度,通过序列级前缀保护优化和组级难度感知惩罚,显著减少token使用并提升准确率。
Details
Motivation: 现有语言推理模型(LRMs)存在“过度思考”问题,即生成过长推理链,导致延迟和内存开销增加;而统一长度惩罚策略在序列级会压缩关键早期推理步骤,在组级则对所有查询一视同仁,缺乏适应性。 Method: 提出\model框架:序列级采用前缀保护优化(结合衰减混合rollout),确保关键初始推理不被破坏;组级引入难度感知长度惩罚,根据查询复杂度动态调整约束强度。 Result: 在DeepSeek-R1-Distill-Qwen(1.5B/7B)上实验表明,\model最多减少55.7% token用量,同时数学基准准确率最高提升4.1%,且泛化至代码、科学和通用领域。 Conclusion: 双层级(序列+组)自适应压缩机制能有效缓解LRMs的过长推理问题,在降低计算开销的同时反而提升性能,为高效推理提供了新范式。 Abstract: Language Reasoning Models (LRMs) achieve strong performance by scaling test-time computation but often suffer from ``overthinking'', producing excessively long reasoning traces that increase latency and memory usage. Existing LRMs typically enforce conciseness with uniform length penalties, which over-compress crucial early deduction steps at the sequence level and indiscriminately penalize all queries at the group level. To solve these limitations, we propose \textbf{\model}, a dual-level framework for prefix-protected and difficulty-aware compression under hierarchical supervision. At the sequence level, prefix-protected optimization employs decaying mixed rollouts to maintain valid reasoning paths while promoting conciseness. At the group level, difficulty-aware penalty dynamically scales length constraints based on query complexity, maintaining exploration for harder questions while curbing redundancy on easier ones. Extensive experiments on DeepSeek-R1-Distill-Qwen (1.5B/7B) demonstrate that \model achieves a substantial reduction in token usage (up to \textbf{55.7\%}) while simultaneously improving accuracy (up to \textbf{4.1\%}) on math benchmarks, with generalization ability to code, science, and general domains.[44] Which Feedback Works for Whom? Differential Effects of LLM-Generated Feedback Elements Across Learner Profiles
Momoka Furuhashi,Kouta Nakayama,Noboru Kawai,Takashi Kodama,Saku Sugawara,Kyosuke Takami
Main category: cs.CL
TL;DR: 本研究探讨了大语言模型(LLM)生成的教育反馈中不同元素(如语气、信息覆盖度)对学习效果和学生接受度的影响,并分析其与大五人格特质的关系。
Details
Motivation: 尚不清楚LLM生成反馈中具体元素(如语气、信息覆盖)如何影响不同人格特质学习者的学习成效与接受度。 Method: 定义六类反馈元素,使用GPT-5为高中生物选择题生成反馈;开展含321名高一学生的实验,从两类学习成效指标和六项主观评价维度评估反馈效果;结合大五人格特质进行聚类分析。 Result: 有效反馈元素存在共性模式以支持学习成效,但学生主观偏好因人格聚类而异。 Conclusion: 设计LLM生成反馈时需依据学习者人格特质选择与适配反馈元素,为教育中个性化反馈提供实践启示。 Abstract: Large language models (LLMs) show promise for automatically generating feedback in education settings. However, it remains unclear how specific feedback elements, such as tone and information coverage, contribute to learning outcomes and learner acceptance, particularly across learners with different personality traits. In this study, we define six feedback elements and generate feedback for multiple-choice biology questions using GPT-5. We conduct a learning experiment with 321 first-year high school students and evaluate feedback effectiveness using two learning outcomes measures and subjective evaluations across six criteria. We further analyze differences in how feedback acceptance varies across learners based on Big Five personality traits. Our results show that effective feedback elements share common patterns supporting learning outcomes, while learners' subjective preferences differ across personality-based clusters. These findings highlight the importance of selecting and adapting feedback elements according to learners' personality traits when we design LLM-generated feedback, and provide practical implications for personalized feedback design in education.[45] PatientHub: A Unified Framework for Patient Simulation
Sahand Sabour,TszYam NG,Minlie Huang
Main category: cs.CL
TL;DR: 本文提出了PatientHub,一个用于标准化模拟患者对话的统一框架,旨在解决现有方法在数据格式、提示词和评估指标上的碎片化问题,提升可复现性和跨方法比较能力。
Details
Motivation: 现有模拟患者的方法缺乏统一标准,导致数据格式、提示词和评估指标不兼容,阻碍了可复现性和公平比较。 Method: 提出PatientHub框架,提供模块化、标准化的模拟患者定义、构建与部署流程,并集成多种典型模拟方法作为案例,支持自定义评估指标与新变体快速原型开发。 Result: 实现了多个代表性患者模拟方法的标准化实现,验证了跨方法评估与指标扩展能力,并成功原型化两个新模拟器变体,显著降低方法开发基础设施开销。 Conclusion: PatientHub为患者中心对话研究提供了可复现、可扩展的基础框架,降低了新方法开发门槛,促进了跨模型与跨方法基准测试,并已开源。 Abstract: As Large Language Models increasingly power role-playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non-standardized data formats, prompts, and evaluation metrics, hindering reproducibility and fair comparison. In this paper, we introduce PatientHub, a unified and modular framework that standardizes the definition, composition, and deployment of simulated patients. To demonstrate PatientHub's utility, we implement several representative patient simulation methods as case studies, showcasing how our framework supports standardized cross-method evaluation and the seamless integration of custom evaluation metrics. We further demonstrate PatientHub's extensibility by prototyping two new simulator variants, highlighting how PatientHub accelerates method development by eliminating infrastructure overhead. By consolidating existing work into a single reproducible pipeline, PatientHub lowers the barrier to developing new simulation methods and facilitates cross-method and cross-model benchmarking. Our framework provides a practical foundation for future datasets, methods, and benchmarks in patient-centered dialogue, and the code is publicly available via https://github.com/Sahandfer/PatientHub.[46] Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models
Katrin Olsen,Sebastian Padó
Main category: cs.CL
TL;DR: 本文探讨了如何区分语义异常句(anomalous)和真正无意义句(nonsensical),通过人类与大语言模型(LLM)对五个语义偏差数据集的判断实验,发现多数句子仅属异常而非无意义,且LLMs擅长为异常句生成合理上下文。
Details
Motivation: 现有语义偏差数据集中哪些句子真正无意义尚不明确,且LLMs区分异常与无意义的能力未知。 Method: 收集人类评分者和LLMs对五个语义偏差数据集(含无上下文与提供上下文两种条件)中句子的‘可理解性’(sensicality)判断。 Result: 人类认为大多数句子最多只是异常,极少真正无意义;LLMs在为异常句子生成合理上下文方面表现出较强能力。 Conclusion: 当前语义偏差数据集主要包含异常句而非真正无意义句;LLMs在语义修复任务上具备一定鲁棒性与生成能力。 Abstract: Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.[47] Thinking with Drafting: Optical Decompression via Logical Reconstruction
Jingxuan Wei,Honghao He,Caijun Jia,Siyuan Li,Zheng Sun,Yuhang Xu,Yuanyuan Lin,Linzhuang Sun,Yuchen Wu,Bihui Yu,Xiangxiang Zhang,Cheng Tan
Main category: cs.CL
TL;DR: 本文提出Thinking with Drafting (TwD)方法,通过将视觉推理重构为光学解压缩过程,并利用领域特定语言(DSL)作为中间表示,实现视觉输入的逻辑结构重建与自验证,从而解决多模态大模型在复杂推理任务中的精度悖论。
Details
Motivation: 现有多模态大语言模型虽在视觉感知与生成方面表现优异,但在复杂推理任务中存在精度悖论:光学感知系统无法捕获逻辑拓扑,像素级生成模型缺乏数学精确性。 Method: 提出Thinking with Drafting(TwD)框架,以‘Parsing is Reasoning’为公理,采用极简领域特定语言(DSL)作为接地中间表示,强制模型将推理过程草拟为可执行代码,生成确定性视觉证明以实现自我验证。 Result: 在新构建的视觉代数基准VisAlg上实验表明,TwD显著提升视觉推理性能,形成以视觉生成为逻辑验证手段的闭环系统。 Conclusion: TwD为视觉推理提供了一种通用、可扩展的认知架构,将视觉生成从创造性输出转变为逻辑验证工具,有效弥合感知与推理之间的语义鸿沟。 Abstract: Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.[48] Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning
Futing Wang,Jianhao Yan,Yun Luo,Ganqu Cui,Zhi Wang,Xiaoye Qu,Yue Zhang,Yu Cheng,Tao Lin
Main category: cs.CL
TL;DR: 本文提出Length-Incentivized Exploration(LIE)方法,通过长度奖励与冗余惩罚协同提升模型在上下文中的多假设生成、验证与优化能力(即In-Context Exploration),缓解‘浅层探索陷阱’,显著提升域内和域外任务性能。
Details
Motivation: 现有大模型在测试时难以有效扩展,因其缺乏在单次上下文中生成、验证并优化多个推理假设的能力(即In-Context Exploration);而State Coverage理论揭示:更长的推理路径虽能提升状态覆盖率,却因自回归采样概率指数衰减而难以实现,形成‘浅层探索陷阱’。 Method: 提出Length-Incentivized Exploration(LIE)方法,采用基于生成长度的显式奖励函数,并结合冗余惩罚项,在两步过程中最大化状态覆盖率,从而鼓励模型进行更深入的上下文内探索。 Result: 在Qwen3、Llama等多个模型上实验表明,LIE显著提升了In-Context Exploration能力,在域内任务平均提升4.4%,域外基准提升2.7%。 Conclusion: LIE是一种简单而有效的机制,可突破自回归生成中的探索瓶颈,为测试时缩放提供新思路,强化模型在无微调条件下的泛化与推理能力。 Abstract: Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.[49] MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling
MiniCPM Team,Wenhao An,Yingfa Chen,Yewei Fang,Jiayi Li,Xin Li,Yaohui Li,Yishan Li,Yuxuan Li,Biyuan Lin,Chuan Liu,Hezi Liu,Siyuan Liu,Hongya Lyu,Yinxu Pan,Shixin Ren,Xingyu Shen,Zhou Su,Haojun Sun,Yangang Sun,Zhen Leng Thai,Xin Tian,Rui Wang,Xiaorong Wang,Yudong Wang,Bo Wu,Xiaoyue Xu,Dong Xu,Shuaikang Xue,Jiawei Yang,Bowen Zhang,Jinqian Zhang,Letian Zhang,Shengnan Zhang,Xinyu Zhang,Xinyuan Zhang,Zhu Zhang,Hengyu Zhao,Jiacheng Zhao,Jie Zhou,Zihan Zhou,Shuo Wang,Chaojun Xiao,Xu Han,Zhiyuan Liu,Maosong Sun
Main category: cs.CL
TL;DR: 本文提出MiniCPM-SALA,一种9B参数混合架构模型,结合稀疏注意力(InfLLM-V2)与线性注意力(Lightning Attention),通过层选择算法和混合位置编码(HyPE)实现长上下文高效建模,并设计低成本持续训练框架,使训练成本降低约75%,在单卡A6000D上支持1M上下文并达3.5倍推理加速。
Details
Motivation: Transformer架构在超长上下文应用中面临高计算与内存开销挑战,现有稀疏/线性注意力方法常以性能换效率,需兼顾二者优势。 Method: 提出MiniCPM-SALA混合架构:采用1:3比例的层选择算法融合InfLLM-V2(稀疏)与Lightning Attention(线性),引入HyPE位置编码,并设计基于预训练模型的低成本持续训练框架。 Result: 在单A6000D GPU上,256K序列长度下推理速度达全注意力模型的3.5倍,支持最长1M上下文;传统8B全注意力模型在此尺度下因显存不足失效;通用能力与全注意力模型相当。 Conclusion: MiniCPM-SALA成功平衡长上下文建模的效率与性能,为大模型超长上下文部署提供实用、低成本的混合注意力新范式。 Abstract: The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.[50] A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments
Anne-Marie Lutgen,Alistair Plum,Christoph Purschke
Main category: cs.CL
TL;DR: 本文提出了一种基于嵌入的变体检测方法,无需预归一化或预定义变体列表,通过在原始文本上训练子词嵌入并结合余弦相似度与n-gram相似度聚类相关形式,有效将拼写与形态变异视为语言结构而非噪声,在卢森堡语用户评论语料上验证了其有效性。
Details
Motivation: 解决低资源、多语种背景下语言变体研究缺乏可复现、少依赖人工标注方法的问题,将语言变异从‘噪声’重新定位为可建模的语言结构。 Method: 在原始文本上训练子词嵌入,结合余弦相似度和n-gram相似度对词形进行聚类,从而发现系统性对应关系与区域/风格分化模式。 Result: 在卢森堡语大规模用户评论语料中成功挖掘出大量符合方言学与社会语言学描述的词汇与正字法变异,生成透明、可解释的变体族,支持定量与定性分析。 Conclusion: 分布式建模可在‘噪声大’或低资源条件下有效揭示有意义的语言变异模式,为小语种及多语环境中的语言多样性研究提供了可复现的方法论框架。 Abstract: This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ''noisy'' or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.[51] DMAP: A Distribution Map for Text
Tom Kempton,Julia Rozanova,Parameswaran Kamalaruban,Maeve Madigan,Karolina Wresilo,Yoann L. Launay,David Sutton,Stuart Burrell
Main category: cs.CL
TL;DR: 本文提出DMAP方法,通过将文本映射为单位区间内的样本集,联合编码词序与概率信息,实现对大语言模型输出的上下文感知统计分析,并在生成参数验证、机器生成文本检测和合成数据下游模型指纹分析三个案例中验证其有效性。
Details
Motivation: 现有基于LLM的文本分析方法(如困惑度)未充分考虑上下文,即对给定下一个词的概率解释依赖于条件分布的形状(合理候选词数量),缺乏对概率与词序联合建模的数学严谨方法。 Method: 提出DMAP(Distribution Mapping for Analysis of Probabilities)方法:利用LLM对输入文本逐token生成条件概率分布,依据其累积分布函数(CDF)采样,将每个token位置映射为单位区间[0,1]中的一个或多个样本,从而联合表征该token的排名(rank)与概率(probability)信息。 Result: DMAP在三方面展现出实用性:(i) 有效验证生成参数(如temperature、top-k)以保障数据完整性;(ii) 揭示概率曲率(probability curvature)是区分人类与机器生成文本的关键统计特征;(iii) 发现经合成数据后训练的下游模型会残留可识别的统计指纹。所有实验可在消费级硬件上高效运行。 Conclusion: DMAP提供了一种数学严谨、模型无关、计算轻量且高度通用的文本统计表征方式,为基于LLM的文本分析建立统一基础,并推动相关方向的研究。 Abstract: Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.[52] Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems
Wanxing Wu,He Zhu,Yixia Li,Lei Yang,Jiehui Zhao,Hongru Wang,Jian Yang,Benyou Wang,Bingyi Jing,Guanhua Chen
Main category: cs.CL
TL;DR: 本文提出RouterXBench评估框架和ProbeDirichlet轻量级路由方法,利用模型内部隐藏状态建模不确定性,提升本地-云端LLM协同中的路由准确性与跨域鲁棒性。
Details
Motivation: 现有LLM路由机制评估缺乏系统性,忽视场景适配性与分布外鲁棒性;同时,依赖输出概率或外部嵌入的方法未能充分利用模型内部不确定性信息。 Method: 构建三维度评估框架RouterXBench(路由能力、场景对齐、跨域鲁棒性);提出ProbeDirichlet路由器,基于可学习的狄利克雷分布聚合多层隐藏状态,并采用概率化训练策略。 Result: ProbeDirichlet在路由能力和高精度场景下相较最优基线分别提升16.68%和18.86%,且在不同模型族、规模、异构任务及智能体工作流中表现稳定。 Conclusion: 利用内部隐藏状态建模不确定性是提升LLM路由性能与泛化性的有效途径,RouterXBench为未来路由研究提供了系统化评估基准。 Abstract: Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.[53] LLM-based Triplet Extraction from Financial Reports
Dante Wesslund,Ville Stenström,Pontus Linde,Alexander Holmberg
Main category: cs.CL
TL;DR: 本文提出了一种面向企业财务报告的半自动三元组抽取流水线,采用本体一致性(Ontology Conformance)和忠实性(Faithfulness)等代理指标替代传统基于标注真值的评估方法,并通过自动本体归纳与混合验证策略显著提升抽取质量。
Details
Motivation: 企业财务报告虽富含结构化知识,但缺乏领域标注真值,导致知识图谱构建中的评估困难。 Method: 设计半自动三元组抽取流水线,引入本体驱动的代理评估指标(Ontology Conformance 和 Faithfulness);对比静态手工本体与全自动文档特定本体归纳方法;提出结合正则匹配与LLM-as-a-judge的混合验证策略;分析主语/宾语幻觉的不对称性成因。 Result: 自动归纳本体在所有配置下实现100%本体一致性,消除了手工本体的本体漂移;混合验证将主语幻觉率从65.2%降至1.6%;发现并解释了主语与宾语幻觉的系统性不对称现象。 Conclusion: 本体驱动的代理评估与自动本体归纳可有效缓解财务文本三元组抽取中缺乏标注真值的问题,混合验证策略显著抑制幻觉,为专业领域知识图谱构建提供了可行路径。 Abstract: Corporate financial reports are a valuable source of structured knowledge for Knowledge Graph construction, but the lack of annotated ground truth in this domain makes evaluation difficult. We present a semi-automated pipeline for Subject-Predicate-Object triplet extraction that uses ontology-driven proxy metrics, specifically Ontology Conformance and Faithfulness, instead of ground-truth-based evaluation. We compare a static, manually engineered ontology against a fully automated, document-specific ontology induction approach across different LLMs and two corporate annual reports. The automatically induced ontology achieves 100% schema conformance in all configurations, eliminating the ontology drift observed with the manual approach. We also propose a hybrid verification strategy that combines regex matching with an LLM-as-a-judge check, reducing apparent subject hallucination rates from 65.2% to 1.6% by filtering false positives caused by coreference resolution. Finally, we identify a systematic asymmetry between subject and object hallucinations, which we attribute to passive constructions and omitted agents in financial prose.[54] Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences
Eddie Yang,Dashun Wang
Main category: cs.CL
TL;DR: 本文揭示了大型语言模型(LLMs)在基准测试中准确率趋同的表象下,存在显著的认知分歧(epistemic divergence),即不同模型对同一问题常给出不同答案;这种隐藏分歧在科学数据标注与推断中会严重扭曲研究结果,威胁科研可重复性。
Details
Motivation: 现有基准测试(如MMLU-Pro、GPQA)仅关注整体准确率,忽视模型间具体预测的一致性,可能导致对模型能力的误判及科学应用中的系统性偏差。 Method: 通过分析多个前沿LLM在MMLU-Pro和GPQA上的逐题预测结果,量化模型间分歧比例;进一步将不同LLM用于真实教育学与政治学论文的数据标注任务,评估其对因果效应估计的影响。 Result: 即使准确率相近,LLM间在16–66%的题目上存在分歧,前沿模型间达16–38%;更换标注模型可使处理效应估计变化超80%,甚至反转符号。 Conclusion: 基准准确率相等不等于认知一致,‘基准幻觉’掩盖了模型选择对科学研究结果的实质性影响,亟需发展能衡量模型一致性与可靠性的新评估范式。 Abstract: Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.[55] AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection
Pretam Ray,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum
Main category: cs.CL
TL;DR: 本文提出AdaptEvolve方法,利用生成置信度动态选择适合当前推理步骤的LLM,在多LLM进化精炼框架中实现计算效率与推理能力的更好权衡。
Details
Motivation: 进化型智能体系统在推理过程中反复调用大语言模型(LLMs),加剧了计算效率与推理能力之间的权衡;现有模型级联路由策略依赖静态启发式或外部控制器,未显式建模模型不确定性。 Method: 提出AdaptEvolve:一种基于内在生成置信度实时估计可解性的自适应LLM选择方法,嵌入于进化式顺序精炼框架中。 Result: 在多个基准上平均降低37.9%总推理成本,同时保持静态大模型基线97.5%的上限准确率,形成更优的Pareto前沿。 Conclusion: 基于置信度的动态LLM选择能有效提升多LLM进化系统的效率-性能平衡,为高效推理提供新范式。 Abstract: Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade-off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi-LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real-time solvability. Empirical results show that confidence-driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model baselines. Our code is available at https://github.com/raypretam/adaptive_llm_selection.[56] Cross-Modal Robustness Transfer (CMRT): Training Robust Speech Translation Models Using Adversarial Text
Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis
Main category: cs.CL
TL;DR: 本文提出了一种跨模态鲁棒性迁移(CMRT)框架,将文本模态的对抗鲁棒性迁移到语音模态,无需生成对抗语音数据,显著提升端到端语音翻译模型对形态变化攻击的鲁棒性。
Details
Motivation: 现有端到端语音翻译模型主要在干净数据集上评测,忽视了真实场景中非母语或方言语音的屈折形态变化带来的鲁棒性挑战。 Method: 将面向文本的屈折形态对抗攻击适配到语音领域,并提出CMRT框架,通过知识迁移将文本模态的对抗鲁棒性转移到语音模态,避免生成昂贵且困难的对抗语音数据。 Result: 在四个语言对上的实验表明,CMRT平均提升对抗鲁棒性超过3 BLEU分,建立了无需对抗语音数据的鲁棒E2E-ST新基线。 Conclusion: CMRT是一种高效、实用的跨模态鲁棒性增强方法,为提升语音翻译模型在真实复杂语音环境下的可靠性提供了新思路。 Abstract: End-to-End Speech Translation (E2E-ST) has seen significant advancements, yet current models are primarily benchmarked on curated, "clean" datasets. This overlooks critical real-world challenges, such as morphological robustness to inflectional variations common in non-native or dialectal speech. In this work, we adapt a text-based adversarial attack targeting inflectional morphology to the speech domain and demonstrate that state-of-the-art E2E-ST models are highly vulnerable it. While adversarial training effectively mitigates such risks in text-based tasks, generating high-quality adversarial speech data remains computationally expensive and technically challenging. To address this, we propose Cross-Modal Robustness Transfer (CMRT), a framework that transfers adversarial robustness from the text modality to the speech modality. Our method eliminates the requirement for adversarial speech data during training. Extensive experiments across four language pairs demonstrate that CMRT improves adversarial robustness by an average of more than 3 BLEU points, establishing a new baseline for robust E2E-ST without the overhead of generating adversarial speech.[57] Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance
Yunchong Huang,Gianni Barlacchi,Sandro Pezzelle
Main category: cs.CL
TL;DR: 本文指出大型语言模型(LLM)在标准问答(QA)基准上表现不佳的部分原因在于问题表述不明确(underspecified questions),即缺乏足够上下文导致歧义;作者构建了一个LLM分类器识别此类问题,在多个数据集上发现16%–50%的问题存在该问题,且LLM在这些问题上性能显著下降;通过人工重写使其明确化(保持答案不变)的控制实验表明,性能提升明显,说明许多失败源于问题本身而非模型能力不足;研究呼吁重视QA基准中问题的清晰性设计。
Details
Motivation: 标准问答基准尚未被LLM充分解决,但其问题可能存在表述不明确(underspecification)这一被忽视的系统性缺陷,从而混淆对模型真实能力的评估。 Method: 提出并训练一个基于LLM的分类器来自动识别问答数据集中是否存在 underspecified 问题;在多个主流QA数据集(如SQuAD、Natural Questions等)上进行检测统计;进一步开展受控重写实验——将 underspecified 问题人工改写为语义明确、答案不变的版本,并重新评估LLM性能。 Result: 发现16%至超50%的基准问题属于 underspecified;LLM在这些题目上的准确率显著低于明确问题;重写后所有测试模型的QA性能均稳定提升,验证了 underspecification 是关键性能干扰因素。 Conclusion: 问答任务中的问题 underspecification 是当前基准评估的重要混杂变量,不应归因为模型缺陷;未来QA基准设计需更强调问题表述的清晰性与无歧义性。 Abstract: Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.[58] Do Large Language Models Adapt to Language Variation across Socioeconomic Status?
Elisa Bassignana,Mike Zhang,Dirk Hovy,Amanda Cercas Curry
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(LLM)在不同社会经济地位(SES)社区中的语言风格适应能力,发现其仅能微弱调整风格,且偏向模仿高SES语言,可能加剧语言不平等并影响社会科学研究的有效性。
Details
Motivation: 随着LLM越来越多地介入人际交流,其对不同社会语境(尤其是不同SES群体)的语言风格适应能力尚不明确,而这种缺失可能强化刻板印象、边缘化语言规范与模型偏差不一致的群体,并加剧社会分层。 Method: 构建了一个基于Reddit和YouTube、按SES分层的新数据集;用四个LLM对语料中不完整文本进行补全;将生成结果与原始文本在94个社会语言学指标(句法、修辞、词汇等)上进行对比分析。 Result: LLM仅轻微调节语言风格以适配SES差异,常表现为近似或夸张式模仿,且对高SES风格的模拟明显优于低SES;未展现出真实、细致的社会语境适应能力。 Conclusion: LLM当前存在系统性语言风格偏见,可能放大语言等级秩序;其生成语言不宜直接用于基于代理的社会模拟、调查实验等依赖语言作为社会信号的研究。 Abstract: Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.[59] Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models
Yuzhe Shang,Pengzhi Gao,Wei Liu,Jian Luan,Jinsong Su
Main category: cs.CL
TL;DR: 本文研究了开源大语言模型(LLMs)在多语言机器翻译(MT)中的应用,基于Gemma3模型家族开发了支持46种语言的MiLMMT-46模型,在多语言翻译任务上达到顶尖水平,并超越多个SOTA开源模型,媲美Google Translate和Gemini 3 Pro等闭源系统。
Details
Motivation: 开源大语言模型多语言能力不断提升,但其在多语言机器翻译任务上的系统性研究与高效适配方法仍不足,需探究模型规模与数据规模对适配效果的影响。 Method: 通过持续预训练(continual pretraining)和指令微调(instruction finetuning)两种方式,基于Gemma3模型家族构建多语言机器翻译模型MiLMMT-46,并在46种语言上进行评估。 Result: MiLMMT-46在46种语言的多语言MT任务上达到SOTA性能,显著优于Seed-X、HY-MT-1.5和TranslateGemma等最新开源模型,并与Google Translate和Gemini 3 Pro等强闭源系统性能相当。 Conclusion: 模型规模与数据规模协同扩展可有效提升开源LLM在多语言MT任务上的表现;指令微调与持续预训练是适配LLM至高质量多语言MT的有效路径;MiLMMT-46验证了开源模型在该任务上具备与顶级闭源系统竞争的能力。 Abstract: Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.[60] DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling
Mariia Fedorova,Andrey Kutuzov,Khonzoda Umarova
Main category: cs.CL
TL;DR: 本文介绍了DHPLT,一个包含41种语言的历时语料库开放集合,旨在填补多语言历时语料库在语义变化建模方面的空白。
Details
Motivation: 当前缺乏除高资源语言外的多语言历时语料库,限制了语义变化建模的研究。 Method: 基于网络爬取的HPLT数据集,利用网页爬取时间戳作为文档创建时间的近似信号,构建覆盖三个时间段(2011–2015、2020–2021、2024至今)的历时语料库,并提供预计算的词类型/词符嵌入和目标词的词汇替换。 Result: 发布了涵盖41种语言、每种语言每时段100万文档的DHPLT历时语料库,附带多种预处理资源,并全部开源。 Conclusion: DHPLT为多语言语义变化研究提供了重要基础设施,支持更广泛的语言和新颖实验设计。 Abstract: In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.[61] Automatic Simplification of Common Vulnerabilities and Exposures Descriptions
Varpu Vehomäki,Kimmo K. Kaski
Main category: cs.CL
TL;DR: 本研究探索了大型语言模型(LLMs)在简化通用漏洞与暴露(CVE)描述中的应用,构建了网络安全领域自动文本简化(ATS)的基线和含40个CVE描述的测试数据集,并通过两轮专家调查评估发现:现成LLM虽能提升表层可读性,但常损害语义准确性。
Details
Motivation: 网络安全信息对非专业人士而言难以理解,而现有自动文本简化研究尚未覆盖快速演变且高度复杂的网络安全领域,尤其是CVE描述的简化需求未被满足。 Method: 构建面向网络安全领域的ATS基线系统与包含40条CVE描述的测试数据集,并组织两组网络安全专家开展两轮调查评估;对比分析现成大型语言模型在简化CVE文本时的可读性提升与语义保真度表现。 Result: 实验表明,现成LLMs能提升文本表层简洁性与可读性,但在关键安全语义(如漏洞类型、影响范围、严重程度)上存在显著失真,语义保真度不足。 Conclusion: 直接应用通用LLMs进行CVE文本简化不可靠,亟需结合网络安全领域知识的定制化方法以兼顾可读性与语义准确性。 Abstract: Understanding cyber security is increasingly important for individuals and organizations. However, a lot of information related to cyber security can be difficult to understand to those not familiar with the topic. In this study, we focus on investigating how large language models (LLMs) could be utilized in automatic text simplification (ATS) of Common Vulnerability and Exposure (CVE) descriptions. Automatic text simplification has been studied in several contexts, such as medical, scientific, and news texts, but it has not yet been studied to simplify texts in the rapidly changing and complex domain of cyber security. We created a baseline for cyber security ATS and a test dataset of 40 CVE descriptions, evaluated by two groups of cyber security experts in two survey rounds. We have found that while out-of-the box LLMs can make the text appear simpler, they struggle with meaning preservation. Code and data are available at https://version.aalto.fi/gitlab/vehomav1/simplification\_nmi.[62] LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss
Szilvia Ujváry,Louis Béthune,Pierre Ablin,João Monteiro,Marco Cuturi,Michael Kirchhof
Main category: cs.CL
TL;DR: 本文提出LaCy预训练方法,通过结合损失值与语法解析器(spaCy)判断哪些token应由小语言模型(SLM)直接预测、哪些应通过
Details
Motivation: 小语言模型(SLMs)参数有限,预训练难以覆盖全部世界知识,易产生事实错误;现有方案依赖外部调用(如大模型或数据库),但缺乏对‘何时该自己预测、何时该调用’的细粒度控制机制。 Method: 提出基于token级决策的预训练策略LaCy:利用损失值初步筛选,并引入spaCy语法解析器增强判断,区分高损失但语义可接受的token(应保留预测)与高损失且易致事实错误的token(应触发[63] Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study
Angelo Ziletti,Leonardo D'Ambrosi
Main category: cs.CL
TL;DR: 本文提出CLUES框架,用于临床Text-to-SQL任务中区分输入歧义与模型不稳定性两类不确定性,并分别量化为歧义分和不稳定性分,以支持精准干预与错误溯源。
Details
Motivation: 在临床Text-to-SQL部署中,需区分由输入歧义(应触发用户澄清)和模型不稳定性(应触发人工审核)导致的输出多样性,现有单一不确定性度量无法支持差异化干预。 Method: 将Text-to-SQL建模为两阶段过程(解释→答案),构建双部语义图,利用其矩阵的Schur补计算不稳定性得分;同时分解语义不确定性为歧义得分与不稳定性得分。 Result: 在AmbigQA/SituatedQA和临床Text-to-SQL基准上,CLUES在故障预测上优于当前最优的Kernel Language Entropy;高歧义-高不稳定性区域覆盖25%查询但包含51%错误,显著提升错误筛查效率。 Conclusion: CLUES提供可解释、可干预的不确定性分解,支持面向歧义的查询优化与面向不稳定的模型改进,适用于临床AI系统的安全部署。 Abstract: Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.[64] Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models
Xin Xu,Clive Bai,Kai Yang,Tianhao Chen,Yangkun Chen,Weijie Liu,Hao Chen,Yang Wang,Saiyong Yang,Can Yang
Main category: cs.CL
TL;DR: 本文提出Composition-RL方法,通过自动组合多个问题生成新可验证提示,以更高效利用高通过率(pass-rate=1)的提示数据,提升大模型推理能力,并支持跨领域强化学习。
Details
Motivation: 现有RLVR方法依赖大规模可验证提示,但其中存在大量无信息量样本且扩展成本高;随着训练进行,易题(pass-rate=1)比例上升,导致有效数据减少。 Method: 提出Composition-RL:自动将多个问题组合成新的可验证问题用于RL训练;并设计课程学习变体,逐步增加组合深度;还可跨领域组合提示以支持跨域RL。 Result: 在4B至30B模型规模上实验表明,Composition-RL持续提升推理能力;课程变体进一步增强性能;跨领域组合也提升了跨域RL效果。 Conclusion: Composition-RL是一种简单而有效的方法,能显著提升有限可验证提示下的RL训练效率与泛化能力,尤其适用于高通过率提示场景及跨领域设置。 Abstract: Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at https://github.com/XinXU-USTC/Composition-RL.[65] DeepSight: An All-in-One LM Safety Toolkit
Bo Zhang,Jiaxuan Guo,Lijun Li,Dongrui Liu,Sujin Chen,Guanxu Chen,Zhijie Zheng,Qihao Lin,Lewen Yan,Chen Qian,Yijin Zhou,Yuyao Wu,Shaoxiong Guo,Tianyi Du,Jingyi Yang,Xuhao Hu,Ziqi Miao,Xiaoya Lu,Jing Shao,Xia Hu
Main category: cs.CL
TL;DR: 本文提出DeepSight开源项目,整合安全评估与诊断,实现从黑盒到白盒的模型安全分析。
Details
Motivation: 当前大模型安全工作流中,评估、诊断与对齐由不同工具分别处理,导致评估无法定位内部根源、诊断脱离实际风险场景、对齐缺乏机制解释,可能损害模型通用能力。 Method: 提出DeepSight开源项目,包含评估工具DeepSafe和诊断工具DeepScan,通过统一任务与数据协议,连接评估与诊断阶段,实现白盒化安全分析。 Result: DeepSight是首个支持前沿AI风险评估及联合安全评估与诊断的开源工具包,具备低成本、可复现、高效、高可扩展性特点。 Conclusion: DeepSight推动了大模型安全从孤立黑盒评估向系统化白盒洞察的范式转变,为安全对齐提供机制层面支撑。 Abstract: As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.[66] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling
Pinyi Zhang,Ting-En Lin,Yuchuan Wu,Jingyang Chen,Zongqi Wang,Hua Yang,Ze Xu,Fei Huang,Kai Zhang,Yongbin Li
Main category: cs.CL
TL;DR: 本文提出P-GenRM,首个支持测试时用户自适应缩放的个性化生成式奖励模型,通过构建评估链、用户原型聚类与双粒度缩放机制,显著提升个性化对齐效果与新用户泛化能力。
Details
Motivation: 现有个性化奖励模型难以准确捕捉开放场景下多样化的用户偏好,且在新用户(反馈少)上泛化能力差:一是将复杂偏好简化为固定少量评估原则;二是缺乏对新用户的迁移能力。 Method: 提出P-GenRM:1)将偏好信号转化为结构化评估链,动态生成适配场景的 persona 和评分量规;2)聚类用户形成 User Prototypes;3)引入双粒度缩放机制——个体级自适应聚合用户评分方案 + 原型级融合相似用户偏好。 Result: 在主流个性化奖励模型基准上达到SOTA,平均提升2.31%;在OOD数据集上泛化性强;测试时用户缩放带来额外3%性能增益。 Conclusion: P-GenRM通过生成式建模与测试时自适应缩放,有效缓解偏好噪声、增强跨用户迁移能力,为个性化对齐提供了更鲁棒、可扩展的新范式。 Abstract: Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.[67] A Rule-based Computational Model for Gaidhlig Morphology
Peter J Barclay
Main category: cs.CL
TL;DR: 本文提出了一种基于规则的苏格兰盖尔语(Gaidhlig)形态学建模方法,利用Wiktionary数据构建可解释、低数据依赖的系统,支持教学工具与高阶NLP工具开发。
Details
Motivation: 主流神经语言模型需要大量训练数据,而低资源语言(如苏格兰盖尔语)缺乏此类数据;亟需一种能高效利用有限数据、具备可解释性并服务于语言教学与NLP工具开发的替代方案。 Method: 从Wiktionary提取盖尔语词汇形态数据,使用SQL查询词形模式,构建声明式规则库,并通过Python工具实现词形屈折生成。 Result: 实现了可运行的规则基系统,能生成盖尔语词形变化,支持教育类工具(如语法讲解)及上层工具(如基于规则的依存句法分析器)。 Conclusion: 规则方法在低资源语言中具有显著优势:有效利用稀缺数据、增强模型可解释性,并能将Wiktionary现有数据转化为新用途,为语言保护与技术赋能提供可行路径。 Abstract: Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.[68] WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models
Yangzhuo Li,Shengpeng Ji,Yifu Chen,Tianle Liang,Haorong Ying,Yule Wang,Junbo Li,Jun Fang,Zhou Zhao
Main category: cs.CL
TL;DR: 本文提出WavBench,一个面向真实语音对话能力评估的新型基准,涵盖推理能力(Pro子集)、口语化表达(Basic子集)和副语言理解与生成(Acoustic子集)三方面,弥补现有文本导向评测对语音特性的忽视。
Details
Motivation: 当前语音对话模型评估多沿用文本生成标准,忽略了语音特有的副语言特征(如语调、停顿)和口语化表达,也未充分考察高级推理能力在真实对话中的表现,亟需更贴近实际场景的综合评测基准。 Method: 构建WavBench三元评测框架:1)Pro子集聚焦高难度推理任务;2)Basic子集以‘可听性’(listenability)为核心定义口语化标准,强调自然词汇、语言流利度与交互亲和力;3)Acoustic子集系统评估显式理解/生成与隐式对话中的副语言能力。在5个SOTA模型上开展实证评估。 Result: WavBench揭示了当前模型在复杂推理、自然口语表达与副语言建模三方面的性能差距与协同关系,为语音对话系统发展提供了可复现、多维度的评估依据。 Conclusion: WavBench是首个兼顾认知深度、语言自然性与声学真实性的语音对话综合基准,推动评测范式从文本中心转向语音本位,对构建鲁棒、拟人化 spoken dialogue model 具有指导意义。 Abstract: With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto-2024.github.io/wavbench.github.io/.[69] CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes
Ricardo Campos,Ana Filipa Pacheco,Ana Luísa Fernandes,Inês Cantante,Rute Rebouças,Luís Filipe Cunha,José Miguel Isidro,José Pedro Evans,Miguel Marques,Rodrigo Batista,Evelin Amorim,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,António Leal,Purificação Silvano
Main category: cs.CL
TL;DR: 本文介绍了CitiLink-Minutes数据集,一个包含120份欧洲葡萄牙语市政会议纪要的多层标注数据集,旨在填补市政记录在信息检索和自然语言处理领域研究的空白。
Details
Motivation: 市政会议纪要是地方治理的重要记录,但因缺乏标注数据集,其在信息检索和自然语言处理领域的研究严重受限。 Method: 构建了CitiLink-Minutes数据集,涵盖6个市镇的120份会议纪要,进行三方面人工标注(元数据、讨论主题、投票结果),共38,000余条标注,并遵循FAIR原则发布。 Result: 该数据集包含超百万词符,所有个人信息已脱敏;提供了元数据抽取、主题分类和投票标注等基线实验结果。 Conclusion: CitiLink-Minutes为市政文本分析提供了高质量资源,推动NLP/IR在地方治理透明化中的应用。 Abstract: City councils play a crucial role in local governance, directly influencing citizens' daily lives through decisions made during municipal meetings. These deliberations are formally documented in meeting minutes, which serve as official records of discussions, decisions, and voting outcomes. Despite their importance, municipal meeting records have received little attention in Information Retrieval (IR) and Natural Language Processing (NLP), largely due to the lack of annotated datasets, which ultimately limit the development of computational models. To address this gap, we introduce CitiLink-Minutes, a multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities. Unlike prior annotated datasets of parliamentary or video records, CitiLink-Minutes provides multilayer annotations and structured linkage of official written minutes. The dataset contains over one million tokens, with all personal identifiers de-identified. Each minute was manually annotated by two trained annotators and curated by an experienced linguist across three complementary dimensions: (1) metadata, (2) subjects of discussion, and (3) voting outcomes, totaling over 38,000 individual annotations. Released under FAIR principles and accompanied by baseline results on metadata extraction, topic classification, and vote labeling, CitiLink-Minutes demonstrates its potential for downstream NLP and IR tasks, while promoting transparent access to municipal decisions.[70] dVoting: Fast Voting for dLLMs
Sicheng Feng,Zigeng Chen,Xinyin Ma,Gongfan Fang,Xinchao Wang
Main category: cs.CL
TL;DR: 本文提出dVoting方法,利用扩散大语言模型(dLLMs)的任意位置并行生成能力,通过多采样一致性分析识别不确定token并迭代投票重生成,无需训练即可显著提升推理性能。
Details
Motivation: 观察到dLLMs对同一提示的多次采样中多数token预测一致,而性能瓶颈在于少数跨样本不一致的token;同时,传统自回归模型受限于串行解码效率低,而dLLMs天然支持并行解码,为高效投票机制提供基础。 Method: dVoting是一种无需训练的快速投票技术:对同一prompt进行多次采样,通过一致性分析识别跨样本预测不稳定的token,利用dLLMs的任意位置生成能力仅重生成这些不确定token,并迭代执行采样-分析-投票-重生成直至收敛。 Result: 在多个基准上显著提升性能:GSM8K提升6.22%-7.66%,MATH500提升4.40%-7.20%,ARC-C提升3.16%-14.84%,MMLU提升4.83%-5.74%。 Conclusion: dVoting有效挖掘dLLMs的并行生成优势,以低计算开销实现推理能力提升,为无需训练的解码时增强提供了新范式。 Abstract: Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test-time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross-sample variability. Leveraging the arbitrary-position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%-7.66% on GSM8K, 4.40%-7.20% on MATH500, 3.16%-14.84% on ARC-C, and 4.83%-5.74% on MMLU. Our code is available at https://github.com/fscdc/dVoting[71] Query-focused and Memory-aware Reranker for Long Context Processing
Yuqing Li,Jiangnan Li,Mo Yu,Guoxuan Ding,Zheng Lin,Weiping Wang,Jie Zhou
Main category: cs.CL
TL;DR: 本文提出了一种基于大模型中检索头注意力分数的轻量级列表式重排序框架,用于估计段落-查询相关性,无需Likert标注,仅需小规模模型(如4B)即可在多个数据集和LoCoMo基准上达到SOTA性能,并支持上下文增强与中间层注意力训练等灵活扩展。
Details
Motivation: 现有重排序方法多为逐点(pointwise)或需人工标注(如Likert量表),缺乏对候选列表整体信息的建模能力,且依赖大规模模型;而检索头分析已表明其注意力分数蕴含相关性信号,值得进一步利用。 Method: 利用预训练语言模型中特定检索头的注意力分数直接建模段落-查询相关性,构建端到端可微的列表式(listwise)重排序器;采用小规模模型(如4B参数)进行训练,支持无监督/弱监督学习;并探索上下文增强与中间层注意力头训练等变体。 Result: 在Wikipedia、长叙事数据集及LoCoMo对话记忆理解基准上均超越现有SOTA点式与列表式重排序器;尤其在LoCoMo上建立新SOTA;验证了上下文增强提升准确率、中间层注意力训练提升效率的有效性。 Conclusion: 基于注意力分数的轻量列表式重排序框架是一种高效、通用且可扩展的新范式,降低了对大模型和强监督信号的依赖,为检索增强生成与对话系统提供了实用新工具。 Abstract: Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.[72] Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education
Mohamed Huti,Alasdair Mackintosh,Amy Waldock,Dominic Andrews,Maxime Lelièvre,Moritz Boos,Tobias Murray,Paul Atherton,Robin A. A. Ince,Oliver G. B. Garrod
Main category: cs.CL
TL;DR: 本文提出视觉推理基准(VRB),用于评估多模态大语言模型(MLLMs)在真实小学课堂视觉问题上的推理能力,发现模型在静态任务(如计数)上表现较好,但在动态空间操作(如折叠、反射、旋转)上存在明显瓶颈。
Details
Motivation: 现有AI模型在文本推理上表现优异,但在空间与关系结构推理方面仍存在瓶颈,尤其在依赖视觉的小学数学教育中亟需可靠评估工具。 Method: 构建包含701道来自赞比亚和印度小学考试题的VRB数据集,使用未经编辑、文字极少的真实图像,涵盖类比推理、模式补全和空间匹配等任务,以评估MLLMs在真实教育场景中的视觉推理能力。 Result: 实验揭示模型能力呈“锯齿状前沿”:在计数、缩放等静态技能上较强,但在折叠、反射、旋转等动态空间操作上存在显著“空间天花板”。 Conclusion: 教育导向的基准(如VRB)对界定课堂中多模态工具的实际应用边界至关重要,当前模型的空间推理缺陷可能带来错误批改、无效引导及强化学生误解等教学风险。 Abstract: AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.[73] ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images
Mathieu Sibue,Andres Muñoz Garza,Samuel Mensah,Pranav Shetty,Zhiqiang Ma,Xiaomo Liu,Manuela Veloso
Main category: cs.CL
TL;DR: 本文提出ExStrucTiny数据集,用于评估和提升通用视觉语言模型在多样化企业文档图像上的细粒度、可适配的结构化信息抽取能力,涵盖关键实体抽取、关系抽取与视觉问答任务。
Details
Motivation: 现有通用视觉语言模型在标准文档理解基准上表现良好,但其在多样化文档类型和灵活模式下的整体性、细粒度结构化抽取能力尚未被充分研究;现有数据集受限于狭窄的实体本体、简单查询或同质化文档类型,难以支持可适应、结构化的抽取需求。 Method: 构建了一个名为ExStrucTiny的新基准数据集,融合关键实体抽取(KEE)、关系抽取(RE)和视觉问答(VQA)任务,采用人工标注与合成验证相结合的新型流水线生成,覆盖更丰富的文档类型与抽取场景;并在该基准上对开源与闭源VLM进行系统评测。 Result: 揭示了当前VLM在结构化信息抽取中面临的关键挑战,包括模式适配(schema adaptation)、查询描述不明确(query under-specification)及答案定位(answer localization)。 Conclusion: ExStrucTiny为推动通用模型在企业文档结构化信息抽取任务中的能力提升提供了坚实基准,有望促进更鲁棒、可泛化、可适配的文档理解模型发展。 Abstract: Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.[74] Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
Julia Belikova,Danila Rozhevskii,Dennis Svirin,Konstantin Polev,Alexander Panchenko
Main category: cs.CL
TL;DR: 本文提出了一种检测软压缩架构中'token overflow'(令牌溢出)现象的方法,即当压缩表示不足以支持查询回答时的状态;通过在xRAG设置下结合查询与上下文的轻量探针分类器,实现了平均0.72 AUC-ROC的溢出检测性能,推动了低成本、查询感知的压缩错误前置拦截机制。
Details
Motivation: 软压缩架构虽可扩展LLM有效上下文长度,但其可压缩极限及何时开始擦除任务相关关键信息尚不明确,亟需量化并检测'token overflow'这一信息不足状态。 Method: 定义'token overflow'概念,并在xRAG软压缩框架下:1)利用查询无关的饱和统计量区分压缩/未压缩token;2)构建轻量级探针分类器,联合建模查询与上下文的xRAG表示以检测overflow。 Result: 查询无关饱和统计可有效识别压缩token但溢出检测能力有限;引入查询信息的轻量探针分类器在HotpotQA、SQuADv2和TriviaQA上平均达到0.72 AUC-ROC的overflow检测性能。 Conclusion: 从查询无关诊断迈向查询感知检测是提升软压缩鲁棒性的关键一步,所提方法支持低开销的预LLM门控机制,可有效缓解压缩导致的推理错误。 Abstract: Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.[75] Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
Manjunath Kudlur,Evan King,James Wang,Pete Warden
Main category: cs.CL
TL;DR: Moonshine v2 提出一种基于滑动窗口自注意力的流式编码器ASR模型,兼顾低首字延迟(TTFT)与高识别精度,在边缘设备上实现高效实时语音识别。
Details
Motivation: 解决全注意力Transformer编码器在流式语音识别中因全局依赖导致的高首字延迟问题,满足边缘设备上低延迟、高精度的实时语音应用需求。 Method: 提出Moonshine v2模型,采用遍历式(ergodic)流式编码器架构和滑动窗口自注意力机制,在保持强局部上下文建模能力的同时实现有界、低延迟推理。 Result: 在标准基准上达到SOTA词错误率(WER),精度媲美大6倍的模型,同时显著提升推理速度。 Conclusion: 精心设计的局部注意力机制可在大幅降低模型尺寸和延迟代价的前提下,达到与全注意力相当的识别精度,为边缘端交互式语音接口提供了新可能。 Abstract: Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent "encode-the-whole-utterance" latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.[76] A technical curriculum on language-oriented artificial intelligence in translation and specialised communication
Ralph Krüger
Main category: cs.CL
TL;DR: 本文提出了一项面向语言与翻译(L&T)行业的语言导向人工智能(AI)技术课程,旨在提升该领域从业者的领域特定AI素养,涵盖向量嵌入、神经网络基础、分词及Transformer模型等核心内容,并通过教学实践验证其有效性。
Details
Motivation: 提升语言与翻译行业从业者在AI时代所需的领域特定技术AI素养、计算思维、算法意识与算法能动性,增强其数字韧性。 Method: 设计并实施一门聚焦语言导向AI的技术课程,涵盖向量嵌入、神经网络基础、分词和Transformer模型四大模块,并在科隆应用技术大学翻译与多语交流研究所的AI方向硕士课程中开展教学实证。 Result: 课程具备良好的教学有效性,但参与者反馈指出需辅以更高层次的教学支架(如教师支持)以实现最优学习效果。 Conclusion: 该课程为L&T领域提供了可行的AI素养培养路径,但成功实施依赖于配套的教学支持体系。 Abstract: This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (L&T) industry. The curriculum aims to foster domain-specific technical AI literacy among stakeholders in the fields of translation and specialised communication by exposing them to the conceptual and technical/algorithmic foundations of modern language-oriented AI in an accessible way. The core curriculum focuses on 1) vector embeddings, 2) the technical foundations of neural networks, 3) tokenization and 4) transformer neural networks. It is intended to help users develop computational thinking as well as algorithmic awareness and algorithmic agency, ultimately contributing to their digital resilience in AI-driven work environments. The didactic suitability of the curriculum was tested in an AI-focused MA course at the Institute of Translation and Multilingual Communication at TH Koeln. Results suggest the didactic effectiveness of the curriculum, but participant feedback indicates that it should be embedded into higher-level didactic scaffolding - e.g., in the form of lecturer support - in order to enable optimal learning conditions.[77] T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization
Tunyu Zhang,Xinxi Zhang,Ligong Han,Haizhou Shi,Xiaoxiao He,Zhuowei Li,Hao Wang,Kai Xu,Akash Srivastava,Hao Wang,Vladimir Pavlovic,Dimitris N. Metaxas
Main category: cs.CL
TL;DR: 本文提出了一种轨迹自蒸馏框架,结合Direct Discriminative Optimization(DDO)目标,提升扩散大语言模型(DLLMs)在少量步数下的生成质量,显著缩小与全步长解码的性能差距。
Details
Motivation: 扩散大语言模型(DLLMs)虽支持并行多token生成,但实际推理效率受限于需大量精炼步数;减少步数又导致生成质量严重下降。 Method: 提出轨迹自蒸馏框架,利用模型自身生成轨迹进行蒸馏,并引入基于逆KL的Direct Discriminative Optimization(DDO)目标,实现模式聚焦式蒸馏。 Result: 在多个基准上,该方法在严格步数限制下持续优于强few-step基线和标准训练;虽仍略逊于全步长解码,但显著缩小性能差距。 Conclusion: 该工作为实用化少步长DLLMs奠定了坚实基础。 Abstract: Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model's own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student to concentrate on high-probability teacher modes. Across benchmarks, our approach consistently outperforms strong few-step baselines and standard training under tight step budgets. Although full-step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few-step DLLMs. The source code is available at https://github.com/Tyrion58/T3D.[78] On-Policy Context Distillation for Language Models
Tianzhu Ye,Li Dong,Xun Wu,Shaohan Huang,Furu Wei
Main category: cs.CL
TL;DR: 本文提出了一种名为On-Policy Context Distillation(OPCD)的新框架,通过在学生模型自身生成的轨迹上进行训练,并最小化其与上下文条件教师模型之间的反向KL散度,实现上下文知识向参数的内化。该方法在经验知识蒸馏和系统提示蒸馏两个任务中均展现出优越性能,并支持跨模型尺寸的知识迁移。
Details
Motivation: 现有上下文蒸馏方法难以有效将模型在推理过程中积累的经验或优化提示中的行为内化为参数知识;需一种能结合策略生成与上下文建模的蒸馏机制。 Method: 提出OPCD框架:学生模型基于自身采样轨迹进行on-policy训练,目标是最小化其输出分布与上下文条件教师模型输出分布之间的反向KL散度。应用于经验知识蒸馏(从历史解题轨迹中提取可迁移知识)和系统提示蒸馏(将优化提示编码的行为内化)。 Result: 在数学推理、文本游戏和领域特定任务中,OPCD显著优于基线方法,提升任务准确率并更好保持OOD泛化能力;同时支持大模型向小模型的有效跨尺寸经验知识蒸馏。 Conclusion: OPCD成功统一了on-policy学习与上下文蒸馏,为语言模型自主积累和固化经验知识提供了可扩展、实用的新范式。 Abstract: Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.cs.CV [Back]
[79] DD-MDN: Human Trajectory Forecasting with Diffusion-Based Dual Mixture Density Networks and Uncertainty Self-Calibration
Manuel Hetzel,Kerim Turacan,Hannes Reichert,Konrad Doll,Bernhard Sick
Main category: cs.CV
TL;DR: 本文提出DD-MDN模型,一种端到端的、基于去噪扩散与双混合密度网络的概率化人类轨迹预测方法,兼顾高定位精度、校准良好的不确定性建模及对短观测时长的鲁棒性。
Details
Motivation: 现有工作多关注预测精度、社交交互建模和多样性,而忽视了不确定性建模、预测校准以及短观测期下的鲁棒性,而这三者对路径规划和避障等下游任务至关重要。 Method: 提出DD-MDN:结合少样本去噪扩散主干网络与双混合密度网络(dual MDN),自动学习自校准的驻留区域和概率排序的锚点路径,无需预定义锚点或终点,从而生成多样化的轨迹假设。 Result: 在ETH/UCY、SDD、inD和IMPTC多个数据集上达到SOTA精度,尤其在短观测区间下表现鲁棒,并展现出可靠的不确定性建模能力。 Conclusion: DD-MDN为人类轨迹预测提供了更实用、可信且鲁棒的概率化框架,推动HTF在真实安全关键场景中的落地应用。 Abstract: Human Trajectory Forecasting (HTF) predicts future human movements from past trajectories and environmental context, with applications in Autonomous Driving, Smart Surveillance, and Human-Robot Interaction. While prior work has focused on accuracy, social interaction modeling, and diversity, little attention has been paid to uncertainty modeling, calibration, and forecasts from short observation periods, which are crucial for downstream tasks such as path planning and collision avoidance. We propose DD-MDN, an end-to-end probabilistic HTF model that combines high positional accuracy, calibrated uncertainty, and robustness to short observations. Using a few-shot denoising diffusion backbone and a dual mixture density network, our method learns self-calibrated residence areas and probability-ranked anchor paths, from which diverse trajectory hypotheses are derived, without predefined anchors or endpoints. Experiments on the ETH/UCY, SDD, inD, and IMPTC datasets demonstrate state-of-the-art accuracy, robustness at short observation intervals, and reliable uncertainty modeling. The code is available at: https://github.com/kav-institute/ddmdn.[80] ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
Yandan Yang,Shuang Zeng,Tong Lin,Xinyuan Chang,Dekang Qi,Junjin Xiao,Haoyun Liu,Ronghan Chen,Yuzhi Chen,Dongjie Huo,Feng Xiong,Xing Wei,Zhiheng Ma,Mu Xu
Main category: cs.CV
TL;DR: ABot-M0 提出统一数据集(UniACT)与动作流形学习(AML)框架,通过系统化数据清洗、统一预训练和基于DiT的动作流形投影,提升跨形态机器人泛化能力与动作预测效率与稳定性。
Details
Motivation: 解决机器人领域中因数据碎片化、表征不一致和训练目标不统一导致的‘一脑多形’通用具身智能发展瓶颈。 Method: 构建UniACT大规模统一数据集;提出动作流形假设并设计Action Manifold Learning(AML),采用DiT主干网络直接预测连续动作序列;引入双流模块化感知机制,融合VLM语义与几何先验及多视角3D模块输入。 Result: 在多个机器人形态与任务场景上验证了各组件独立有效且增益可叠加;显著提升动作预测速度与策略稳定性;增强跨平台知识迁移与泛化能力。 Conclusion: ABot-M0为通用具身智能提供了可扩展、可复现的系统性框架,推动‘一脑多形’范式落地。 Abstract: Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.[81] Toward Reliable Tea Leaf Disease Diagnosis Using Deep Learning Model: Enhancing Robustness With Explainable AI and Adversarial Training
Samanta Ghosh,Jannatul Adan Mahi,Shayan Abrar,Md Parvez Mia,Asaduzzaman Rayhan,Abdul Awal Yasir,Asaduzzaman Hridoy
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的自动茶树叶片病害分类方法,使用teaLeafBD数据集(含5278张高分辨率图像,7类),结合DenseNet201与EfficientNetB3模型、对抗训练与Grad-CAM可解释性分析,最终EfficientNetB3达到93%准确率。
Details
Motivation: 茶树是孟加拉国重要经济作物,但易受多种叶部病害影响,人工检测耗时且易错,亟需高效、自动化的病害识别方案。 Method: 采用teaLeafBD数据集,构建包含数据预处理、划分、对抗训练、数据增强、模型训练、评估及Grad-CAM可解释性分析的完整流程;对比DenseNet201和EfficientNetB3,并引入对抗训练提升鲁棒性。 Result: EfficientNetB3取得93%最高分类准确率,DenseNet201为91%;Grad-CAM可视化验证了模型关注区域的合理性;模型对噪声/扰动输入具备较强鲁棒性。 Conclusion: 所提方法能准确、高效识别茶树叶片病害,具备农业实际应用价值,为智慧农业管理提供了可行技术支撑。 Abstract: Tea is a valuable asset for the economy of Bangladesh. So, tea cultivation plays an important role to boost the economy. These valuable plants are vulnerable to various kinds of leaf infections which may cause less production and low quality. It is not so easy to detect these diseases manually. It may take time and there could be some errors in the detection.Therefore, the purpose of the study is to develop an automated deep learning model for tea leaf disease classification based on the teaLeafBD dataset so that anyone can detect the diseases more easily and efficiently. There are 5,278 high-resolution images in this dataset. The images are classified into seven categories. Six of them represents various diseases and the rest one represents healthy leaves. The proposed pipeline contains data preprocessing, data splitting, adversarial training, augmentation, model training, evaluation, and comprehension made possible with Explainable AI strategies. DenseNet201 and EfficientNetB3 were employed to perform the classification task. To prepare the model more robustly, we applied adversarial training so it can operate effectively even with noisy or disturbed inputs. In addition, Grad-CAM visualization was executed to analyze the model's predictions by identifying the most influential regions of each image. Our experimental outcomes revealed that EfficientNetB3 achieved the highest classification accuracy of 93%, while DenseNet201 reached 91%. The outcomes prove that the effectiveness of the proposed approach can accurately detect tea leaf diseases and provide a practical solution for advanced agricultural management.[82] Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration
Jinghan He,Junfeng Fang,Feng Xiong,Zijun Yao,Fei Shen,Haiyun Guo,Jinqiao Wang,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出Active-Zero框架,通过三个协同进化的智能体(Searcher、Questioner、Solver)实现视觉语言模型在开放世界环境中的主动探索与自适应学习,显著提升推理与理解能力。
Details
Motivation: 现有视觉语言模型的自博弈方法依赖静态图像集,缺乏主动获取适配其当前能力水平的视觉数据的能力,导致学习效率低、对初始数据依赖强。 Method: 提出Active-Zero框架,包含三个协同演化的智能体:Searcher从开放世界库中按模型能力边界检索图像;Questioner生成校准的推理任务;Solver通过准确率奖励进行优化;三者构成闭环自搭建课程学习机制。 Result: 在Qwen2.5-VL-7B-Instruct上,12个基准测试中推理任务平均准确率达53.97(+5.7%),通用理解达59.77(+3.9%),持续优于现有自博弈基线。 Conclusion: 主动探索是构建可扩展、自适应视觉语言系统的关键要素,Active-Zero为模型自主演化提供了新范式。 Abstract: Self-play has enabled large language models to autonomously improve through self-generated challenges. However, existing self-play methods for vision-language models rely on passive interaction with static image collections, resulting in strong dependence on initial datasets and inefficient learning. Without the ability to actively seek visual data tailored to their evolving capabilities, agents waste computational effort on samples that are either trivial or beyond their current skill level. To address these limitations, we propose Active-Zero, a framework that shifts from passive interaction to active exploration of visual environments. Active-Zero employs three co-evolving agents: a Searcher that retrieves images from open-world repositories based on the model's capability frontier, a Questioner that synthesizes calibrated reasoning tasks, and a Solver refined through accuracy rewards. This closed loop enables self-scaffolding auto-curricula where the model autonomously constructs its learning trajectory. On Qwen2.5-VL-7B-Instruct across 12 benchmarks, Active-Zero achieves 53.97 average accuracy on reasoning tasks (5.7% improvement) and 59.77 on general understanding (3.9% improvement), consistently outperforming existing self-play baselines. These results highlight active exploration as a key ingredient for scalable and adaptive self-evolving vision-language systems.[83] ReTracing: An Archaeological Approach Through Body, Machine, and Generative Systems
Yitong Wang,Yue Yao
Main category: cs.CV
TL;DR: ReTracing 是一个结合人类舞者与四足机器人的多智能体具身行为艺术项目,通过考古学方法探究人工智能如何塑造、约束和生成身体运动,利用大语言模型与文生视频扩散模型生成动作指令,并在镜面地板上同步执行,最终形成运动轨迹的数字档案,揭示生成式AI中隐含的社会文化偏见。
Details
Motivation: 探究人工智能如何塑造、约束和生成身体运动,并揭示生成式系统中隐含的社会文化偏见。 Method: 从科幻小说中提取人机交互描述,用大语言模型生成‘该做什么’与‘不该做什么’配对提示,再通过扩散型文生视频模型转化为人类舞者动作指南与机器人电机指令,在镜面地板上同步执行并用多相机运动捕捉重建为3D点云与运动轨迹。 Result: 构建了一个具身化、可追溯的运动数字档案,实现了人类与机器人在AI引导下的协同表演,并可视化呈现了AI生成动作中的规范性与禁忌性逻辑。 Conclusion: ReTracing 作为一种新型研究方法,揭示了生成式AI不仅输出文本或图像,更在具身层面编码社会规范与偏见;其核心追问是:当AI也能移动、思考并留下痕迹时,‘人之为人’意味着什么? Abstract: We present ReTracing, a multi-agent embodied performance art that adopts an archaeological approach to examine how artificial intelligence shapes, constrains, and produces bodily movement. Drawing from science-fiction novels, the project extracts sentences that describe human-machine interaction. We use large language models (LLMs) to generate paired prompts "what to do" and "what not to do" for each excerpt. A diffusion-based text-to-video model transforms these prompts into choreographic guides for a human performer and motor commands for a quadruped robot. Both agents enact the actions on a mirrored floor, captured by multi-camera motion tracking and reconstructed into 3D point clouds and motion trails, forming a digital archive of motion traces. Through this process, ReTracing serves as a novel approach to reveal how generative systems encode socio-cultural biases through choreographed movements. Through an immersive interplay of AI, human, and robot, ReTracing confronts a critical question of our time: What does it mean to be human among AIs that also move, think, and leave traces behind?[84] Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models
Sethuraman T,Savya Khosla,Aditi Tiwari,Vidya Ganesh,Rakshana Jayaprakash,Aditya Jain,Vignesh Srinivasakumar,Onkar Kishor Susladkar,Srinidhi Sunkara,Aditya Shanmugham,Rakesh Vaideeswaran,Abbaas Alif Mohamed Nishar,Simon Jenni,Derek Hoiem
Main category: cs.CV
TL;DR: 本文提出REVEAL诊断基准,通过五种受控压力测试揭示当前视频-语言模型(VidLMs)在时间序列、运动感知和视频内容利用方面的根本性缺陷,并提供自动生成诊断样本的数据管道。
Details
Motivation: 探究视频-语言模型是否能稳健地理解视频内容、时间顺序和运动信息,发现现有模型存在严重偏差与短板。 Method: 构建REVEAL诊断基准,包含五个压力测试:时间预期偏差、语言捷径依赖、视频盲从性、相机运动敏感性、时空遮挡鲁棒性;并设计自动数据生成管道。 Result: 主流开源与闭源VidLMs在各项测试中表现糟糕:将倒放视频误判为正向、忽略视频内容作答、轻信错误陈述、难以处理基本相机运动、无法在简单时空遮挡下聚合时序信息;而人类轻松完成这些任务。 Conclusion: 当前VidLMs在核心视频理解能力上存在严重不足,亟需更严格的评估基准和更具鲁棒性的建模方法。 Abstract: This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.[85] Advancing Digital Twin Generation Through a Novel Simulation Framework and Quantitative Benchmarking
Jacob Rubinstein,Avi Donaty,Don Engel
Main category: cs.CV
TL;DR: 本文提出了一种基于高质量3D模型和程序生成相机位姿来合成图像的新流程,以支持可重复、可量化的实验,用于评估数字孪生建模中相机参数与物体重建的精度。
Details
Motivation: 现有基于摄影测量法的3D建模方法在数字孪生构建中存在大量设计选择,但其差异多依赖主观定性评价,缺乏可重复、可量化的评估手段。 Method: 提出并实现一种新流程:从高保真3D模型出发,结合程序化生成的虚拟相机位姿,合成大量带精确已知参数的图像数据。 Result: 该流程支持开展大量可控实验,能将虚拟场景中真实的相机参数和物体几何作为‘真值’,定量评估重建算法对视角与物体的估计精度。 Conclusion: 该合成图像生成管道为摄影测量及数字孪生建模提供了标准化、可复现、可定量的基准测试框架,弥补了传统方法评估不足的问题。 Abstract: The generation of 3D models from real-world objects has often been accomplished through photogrammetry, i.e., by taking 2D photos from a variety of perspectives and then triangulating matched point-based features to create a textured mesh. Many design choices exist within this framework for the generation of digital twins, and differences between such approaches are largely judged qualitatively. Here, we present and test a novel pipeline for generating synthetic images from high-quality 3D models and programmatically generated camera poses. This enables a wide variety of repeatable, quantifiable experiments which can compare ground-truth knowledge of virtual camera parameters and of virtual objects against the reconstructed estimations of those perspectives and subjects.[86] Selective Prior Synchronization via SYNC Loss
Ishan Mishra,Jiajie Li,Deepak Mishra,Jinjun Xiong
Main category: cs.CV
TL;DR: 本文提出SYNC损失函数,将后验方法(如softmax响应)引入SelectiveNet的训练过程,通过利用选择先验(selective prior)提升深度神经网络的选择性预测能力,在多个数据集上实现了SOTA性能。
Details
Motivation: 现有选择性预测方法分为ad-hoc(如SelectiveNet)和post-hoc(如softmax响应)两类,但post-hoc方法隐含的不确定性信息(即选择先验)仅用于推理阶段,作者认为其在训练阶段同样重要。 Method: 提出SYNC损失函数,将softmax响应(代表选择先验)显式融入SelectiveNet的训练目标中,实现ad-hoc与post-hoc方法的协同优化。 Result: 在CIFAR-100、ImageNet-100和Stanford Cars等多个基准数据集上,该方法显著提升了模型泛化能力和选择性预测性能,达到当前最优水平。 Conclusion: 选择先验不仅可用于推理,更应参与训练;SYNC损失有效融合两类方法,为可靠AI中的不确定性建模提供了新范式。 Abstract: Prediction under uncertainty is a critical requirement for the deep neural network to succeed responsibly. This paper focuses on selective prediction, which allows DNNs to make informed decisions about when to predict or abstain based on the uncertainty level of their predictions. Current methods are either ad-hoc such as SelectiveNet, focusing on how to modify the network architecture or objective function, or post-hoc such as softmax response, achieving selective prediction through analyzing the model's probabilistic outputs. We observe that post-hoc methods implicitly generate uncertainty information, termed the selective prior, which has traditionally been used only during inference. We argue that the selective prior provided by the selection mechanism is equally vital during the training stage. Therefore, we propose the SYNC loss which introduces a novel integration of ad-hoc and post-hoc method. Specifically, our approach incorporates the softmax response into the training process of SelectiveNet, enhancing its selective prediction capabilities by examining the selective prior. Evaluated across various datasets, including CIFAR-100, ImageNet-100, and Stanford Cars, our method not only enhances the model's generalization capabilities but also surpasses previous works in selective prediction performance, and sets new benchmarks for state-of-the-art performance.[87] MDE-VIO: Enhancing Visual-Inertial Odometry Using Learned Depth Priors
Arda Alniak,Sinan Kalkan,Mustafa Mert Ankarali,Afsar Saranli,Abdullah Aydin Alatan
Main category: cs.CV
TL;DR: 本文提出了一种将学习到的深度先验直接集成到VINS-Mono优化后端的新框架,通过引入仿射不变深度一致性与序数约束,并结合方差门控机制抑制不稳定伪影,在边缘设备算力限制下实现鲁棒的度量尺度恢复。
Details
Motivation: 传统单目视觉惯性里程计(VIO)在低纹理环境中因稀疏特征不足而性能下降,而基于ViT的密集深度估计模型虽精度高但计算开销大,难以实时部署于边缘设备。 Method: 将学习到的深度先验嵌入VINS-Mono优化框架;引入仿射不变深度一致性约束和成对序数约束;采用方差驱动的门控机制过滤不可靠深度区域。 Result: 在TartanGround和M3ED数据集上验证,显著提升鲁棒性与精度,绝对轨迹误差(ATE)最高降低28.3%,且满足边缘设备实时性要求。 Conclusion: 该方法在不牺牲实时性的前提下,有效提升了单目VIO在低纹理场景下的定位精度与稳定性,为轻量化深度增强VIO提供了可行路径。 Abstract: Traditional monocular Visual-Inertial Odometry (VIO) systems struggle in low-texture environments where sparse visual features are insufficient for accurate pose estimation. To address this, dense Monocular Depth Estimation (MDE) has been widely explored as a complementary information source. While recent Vision Transformer (ViT) based complex foundational models offer dense, geometrically consistent depth, their computational demands typically preclude them from real-time edge deployment. Our work bridges this gap by integrating learned depth priors directly into the VINS-Mono optimization backend. We propose a novel framework that enforces affine-invariant depth consistency and pairwise ordinal constraints, explicitly filtering unstable artifacts via variance-based gating. This approach strictly adheres to the computational limits of edge devices while robustly recovering metric scale. Extensive experiments on the TartanGround and M3ED datasets demonstrate that our method prevents divergence in challenging scenarios and delivers significant accuracy gains, reducing Absolute Trajectory Error (ATE) by up to 28.3%. Code will be made available.[88] Exploring Real-Time Super-Resolution: Benchmarking and Fine-Tuning for Streaming Content
Evgeney Bogatyrev,Khaled Abud,Ivan Molodetskikh,Nikita Alutis,Dmitry Vatolin
Main category: cs.CV
TL;DR: 本文提出StreamSR数据集和EfRLFN模型,旨在提升实时超分辨率在压缩视频流媒体场景下的性能。
Details
Motivation: 现有实时超分辨率方法在处理压缩视频内容时表现不佳,且常用数据集无法准确反映真实流媒体特性,导致基准评估缺乏实际相关性。 Method: 构建了来自YouTube的多样化真实流媒体视频数据集StreamSR;提出轻量高效实时模型EfRLFN,融合通道注意力机制与tanh激活函数,并设计复合损失函数以加速训练收敛;对11种SOTA模型进行基准测试,并验证在StreamSR上微调的效果泛化性。 Result: EfRLFN在视觉质量与运行效率上均优于现有实时模型;在StreamSR上微调其他模型可显著提升其在多个标准基准上的性能;所有资源(数据集、代码、基准)均已开源。 Conclusion: StreamSR填补了真实流媒体超分辨率评估的数据空白,EfRLFN为实时超分辨率提供了更优的架构设计范式,推动该技术向实际流媒体应用落地。 Abstract: Recent advancements in real-time super-resolution have enabled higher-quality video streaming, yet existing methods struggle with the unique challenges of compressed video content. Commonly used datasets do not accurately reflect the characteristics of streaming media, limiting the relevance of current benchmarks. To address this gap, we introduce a comprehensive dataset - StreamSR - sourced from YouTube, covering a wide range of video genres and resolutions representative of real-world streaming scenarios. We benchmark 11 state-of-the-art real-time super-resolution models to evaluate their performance for the streaming use-case. Furthermore, we propose EfRLFN, an efficient real-time model that integrates Efficient Channel Attention and a hyperbolic tangent activation function - a novel design choice in the context of real-time super-resolution. We extensively optimized the architecture to maximize efficiency and designed a composite loss function that improves training convergence. EfRLFN combines the strengths of existing architectures while improving both visual quality and runtime performance. Finally, we show that fine-tuning other models on our dataset results in significant performance gains that generalize well across various standard benchmarks. We made the dataset, the code, and the benchmark available at https://github.com/EvgeneyBogatyrev/EfRLFN.[89] ArtContext: Contextualizing Artworks with Open-Access Art History Articles and Wikidata Knowledge through a LoRA-Tuned CLIP Model
Samuel Waugh,Stuart James
Main category: cs.CV
TL;DR: 本文提出ArtContext管道,利用弱监督学习和LoRA技术改进CLIP模型(PaintingCLIP),以自动为艺术作品关联艺术史文献与Wikidata知识,提升艺术品上下文理解能力。
Details
Motivation: 艺术史文章常讨论艺术品整体及局部特征(如构图、图像学、物质文化),但人工难以快速定位不同文章对同一作品的论述;需自动化方法建立艺术品与文献知识的关联。 Method: 构建开放获取艺术史文献与Wikidata知识融合的新型语料库;基于该语料,采用低秩适应(LoRA)微调CLIP模型,训练出领域专用的PaintingCLIP模型。 Result: PaintingCLIP在弱监督下优于原始CLIP模型,能有效为给定艺术品提供文献与知识上下文;ArtContext管道具有跨人文学科的通用性。 Conclusion: ArtContext为艺术史研究提供了可扩展、可复用的知识标注框架,推动数字人文中多模态语义理解的发展。 Abstract: Many Art History articles discuss artworks in general as well as specific parts of works, such as layout, iconography, or material culture. However, when viewing an artwork, it is not trivial to identify what different articles have said about the piece. Therefore, we propose ArtContext, a pipeline for taking a corpus of Open-Access Art History articles and Wikidata Knowledge and annotating Artworks with this information. We do this using a novel corpus collection pipeline, then learn a bespoke CLIP model adapted using Low-Rank Adaptation (LoRA) to make it domain-specific. We show that the new model, PaintingCLIP, which is weakly supervised by the collected corpus, outperforms CLIP and provides context for a given artwork. The proposed pipeline is generalisable and can be readily applied to numerous humanities areas.[90] Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation
Alan Baade,Eric Ryan Chan,Kyle Sargent,Changan Chen,Justin Johnson,Ehsan Adeli,Li Fei-Fei
Main category: cs.CV
TL;DR: 本文提出Latent Forcing方法,在保持潜在扩散模型高效性的同时,直接在原始图像上操作,通过联合处理潜在表示和像素并采用独立调优的噪声调度,提升像素级生成质量。
Details
Motivation: 现有潜在扩散模型虽能生成高质量图像,但存在信息丢失、需单独训练解码器、建模辅助分布等问题,无法实现端到端建模。 Method: 提出Latent Forcing:对现有架构进行简单修改,联合处理潜在表示与像素,使用分别调优的噪声调度来排序去噪轨迹,使潜在表示作为中间计算的暂存区。 Result: 在ImageNet上,Latent Forcing在相同算力规模下,实现了基于扩散Transformer的像素级生成新SOTA。 Conclusion: Latent Forcing成功融合了潜在空间效率与像素空间建模优势,揭示了条件信号顺序对生成性能的关键影响,并为理解tokenizer重建质量与可扩散性关系提供了新视角。 Abstract: Latent diffusion models excel at generating high-quality images but lose the benefits of end-to-end modeling. They discard information during image encoding, require a separately trained decoder, and model an auxiliary distribution to the raw data. In this paper, we propose Latent Forcing, a simple modification to existing architectures that achieves the efficiency of latent diffusion while operating on raw natural images. Our approach orders the denoising trajectory by jointly processing latents and pixels with separately tuned noise schedules. This allows the latents to act as a scratchpad for intermediate computation before high-frequency pixel features are generated. We find that the order of conditioning signals is critical, and we analyze this to explain differences between REPA distillation in the tokenizer and the diffusion model, conditional versus unconditional generation, and how tokenizer reconstruction quality relates to diffusability. Applied to ImageNet, Latent Forcing achieves a new state-of-the-art for diffusion transformer-based pixel generation at our compute scale.[91] Fighting MRI Anisotropy: Learning Multiple Cardiac Shapes From a Single Implicit Neural Representation
Carolina Brás,Soufiane Ben Haddou,Thijs P. Kuipers,Laura Alvarez-Florez,R. Nils Planken,Fleur V. Y. Tjong,Connie Bezzina,Ivana Išgum
Main category: cs.CV
TL;DR: 本文提出了一种利用高分辨率、近各向同性的CTA数据训练单个神经隐式函数,以重建低分辨率、各向异性的短轴CMRI心脏结构(RV和MYO),并在4CH切面上验证其准确性。
Details
Motivation: 短轴CMRI具有各向异性,限制了心脏形状分析的精度;而高分辨率、近各向同性的CTA数据可提供更准确的解剖先验,但缺乏对应高分辨率SAX标注。 Method: 使用CTA数据训练一个神经隐式函数,联合表征任意分辨率下的CMRI心脏形状;重建右心室(RV)和心肌(MYO,含左心室内外膜);通过提取重建形状的4CH切面与CMRI参考分割对比评估性能。 Result: 在RV和MYO的4CH切面上,Dice系数分别为0.91±0.07和0.75±0.13,Hausdorff距离分别为6.21±3.97 mm和7.53±5.13 mm;定性和定量结果均表明重建形状准确、光滑且解剖合理。 Conclusion: 该方法有效利用CTA先验提升CMRI心脏形状重建质量,为各向异性影像下的心脏形态分析提供了新思路。 Abstract: The anisotropic nature of short-axis (SAX) cardiovascular magnetic resonance imaging (CMRI) limits cardiac shape analysis. To address this, we propose to leverage near-isotropic, higher resolution computed tomography angiography (CTA) data of the heart. We use this data to train a single neural implicit function to jointly represent cardiac shapes from CMRI at any resolution. We evaluate the method for the reconstruction of right ventricle (RV) and myocardium (MYO), where MYO simultaneously models endocardial and epicardial left-ventricle surfaces. Since high-resolution SAX reference segmentations are unavailable, we evaluate performance by extracting a 4-chamber (4CH) slice of RV and MYO from their reconstructed shapes. When compared with the reference 4CH segmentation masks from CMRI, our method achieved a Dice similarity coefficient of 0.91 $\pm$ 0.07 and 0.75 $\pm$ 0.13, and a Hausdorff distance of 6.21 $\pm$ 3.97 mm and 7.53 $\pm$ 5.13 mm for RV and MYO, respectively. Quantitative and qualitative assessment demonstrate the model's ability to reconstruct accurate, smooth and anatomically plausible shapes, supporting improvements in cardiac shape analysis.[92] Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation
Penghui Ruan,Bojia Zi,Xianbiao Qi,Youze Huang,Rong Xiao,Pichao Wang,Jiannong Cao,Yuhui Shi
Main category: cs.CV
TL;DR: 本文提出Ctrl&Shift,一种无需显式3D建模的端到端扩散框架,通过两阶段分解(对象移除+相机姿态控制的参考引导修复)和多任务多阶段训练,实现几何一致、背景保持、用户可控的对象级图像/视频编辑。
Details
Motivation: 现有方法难以同时满足背景保持、视角变化下的几何一致性以及用户可控变换这三大目标;几何法控制精确但依赖显式3D重建且泛化差,扩散法泛化好但缺乏细粒度几何控制。 Method: 提出Ctrl&Shift框架:1)将操作解耦为对象移除与相机姿态控制下的参考引导修复两阶段;2)在统一扩散过程中联合建模;3)设计多任务多阶段训练策略以解耦背景、身份与姿态信号;4)构建含估计相对相机姿态的大规模真实世界配对图像/视频数据集。 Result: 在保真度、视角一致性与可控性方面达到SOTA;首次在不依赖任何显式3D建模前提下,统一实现细粒度几何控制与真实世界泛化能力。 Conclusion: Ctrl&Shift为对象级图像/视频编辑提供了无需3D先验、兼具高可控性与强泛化性的新范式,适用于电影后期、AR与创意编辑等实际场景。 Abstract: Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that Ctrl&Shift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.[93] Enhanced Portable Ultra Low-Field Diffusion Tensor Imaging with Bayesian Artifact Correction and Deep Learning-Based Super-Resolution
Mark D. Olchanyi,Annabel Sorby-Adams,John Kirsch,Brian L. Edlow,Ava Farnan,Renfei Liu,Matthew S. Rosen,Emery N. Brown,W. Taylor Kimberly,Juan Eugenio Iglesias
Main category: cs.CV
TL;DR: 本文提出了一种适用于超低场(ULF)扩散张量成像(DTI)的九方向单壳采集序列,以及配套的具有角度依赖性的贝叶斯偏置场校正算法和无需重训练、可泛化的卷积神经网络超分辨率算法DiffSR,显著提升了ULF DTI的空间/角度分辨率、信噪比及白质微结构信息恢复能力,并在阿尔茨海默病分类任务中验证了其有效性。
Details
Motivation: 超低场(ULF)MRI具备便携性和普及潜力,但其DTI序列受限于空间/角度分辨率低、信噪比差、扫描时间长及跨域伪影严重等问题,亟需针对性建模与重建方法。 Method: 提出九方向单壳ULF DTI采集序列;设计角度依赖的贝叶斯偏置场校正算法;开发基于CNN的通用超分辨率算法DiffSR,无需针对新数据集重训练。 Result: 在合成下采样实验和真实匹配的ULF/高场DTI数据上,算法成功恢复白质微结构与体积信息;在合成退化数据上直接应用DiffSR进行AD分类,DTI指标与原始高场数据一致性显著提升。 Conclusion: 所提序列与算法(尤其是DiffSR)有效克服ULF DTI关键瓶颈,具备跨数据集泛化能力,代码开源以推动ULF重建与DTI标准化发展。 Abstract: Portable, ultra-low-field (ULF) magnetic resonance imaging has the potential to expand access to neuroimaging but currently suffers from coarse spatial and angular resolutions and low signal-to-noise ratios. Diffusion tensor imaging (DTI), a sequence tailored to detect and reconstruct white matter tracts within the brain, is particularly prone to such imaging degradation due to inherent sequence design coupled with prolonged scan times. In addition, ULF DTI scans exhibit artifacting that spans both the space and angular domains, requiring a custom modelling algorithm for subsequent correction. We introduce a nine-direction, single-shell ULF DTI sequence, as well as a companion Bayesian bias field correction algorithm that possesses angular dependence and convolutional neural network-based superresolution algorithm that is generalizable across DTI datasets and does not require re-training (''DiffSR''). We show through a synthetic downsampling experiment and white matter assessment in real, matched ULF and high-field DTI scans that these algorithms can recover microstructural and volumetric white matter information at ULF. We also show that DiffSR can be directly applied to white matter-based Alzheimers disease classification in synthetically degraded scans, with notable improvements in agreement between DTI metrics, as compared to un-degraded scans. We freely disseminate the Bayesian bias correction algorithm and DiffSR with the goal of furthering progress on both ULF reconstruction methods and general DTI sequence harmonization. We release all code related to DiffSR for $\href{https://github.com/markolchanyi/DiffSR}{public \space use}$.[94] A Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness
Yun-Cheng Li,Sen Lei,Heng-Chao Li,Ke Li
Main category: cs.CV
TL;DR: 本文提出DBTANet,一种双分支语义变化检测框架,结合冻结SAM分支(提供全局语义与边界先验)和ResNet34分支(提供局部细节),并引入双向时序感知模块(BTAM)与高斯平滑投影模块(GSPM),以提升边界清晰度与时序建模能力,在两个公开数据集上达到SOTA性能。
Details
Motivation: 现有语义变化检测方法存在边界模糊和时序建模不足的问题,限制了分割精度。 Method: 提出双分支Siamese编码器(冻结SAM分支 + ResNet34分支)、双向时序感知模块(BTAM)和高斯平滑投影模块(GSPM),协同实现全局语义、局部细节、时序依赖与边界感知的融合。 Result: 在两个公开基准上取得当前最优性能,显著提升变化区域分割精度与边界清晰度。 Conclusion: DBTANet通过结构化地融合多源信息,在语义变化检测任务中实现了更鲁棒、更精确的变化定位与分类。 Abstract: Semantic Change Detection (SCD) aims to detect and categorize land-cover changes from bi-temporal remote sensing images. Existing methods often suffer from blurred boundaries and inadequate temporal modeling, limiting segmentation accuracy. To address these issues, we propose a Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness, termed DBTANet. Specifically, we utilize a dual-branch Siamese encoder where a frozen SAM branch captures global semantic context and boundary priors, while a ResNet34 branch provides local spatial details, ensuring complementary feature representations. On this basis, we design a Bidirectional Temporal Awareness Module (BTAM) to aggregate multi-scale features and capture temporal dependencies in a symmetric manner. Furthermore, a Gaussian-smoothed Projection Module (GSPM) refines shallow SAM features, suppressing noise while enhancing edge information for boundary-aware constraints. Extensive experiments on two public benchmarks demonstrate that DBTANet effectively integrates global semantics, local details, temporal reasoning, and boundary awareness, achieving state-of-the-art performance.[95] Arbitrary Ratio Feature Compression via Next Token Prediction
Yufan Liu,Daoyuan Ren,Zhipeng Zhang,Wenyang Luo,Bing Li,Weiming Hu,Stephen Maybank
Main category: cs.CV
TL;DR: 本文提出了一种任意压缩比特征压缩框架ARFC,通过一个统一模型支持任意压缩比,无需重训练;其核心ARC模块采用自回归方式生成压缩token,配合MoS模块提升鲁棒性、ERGC模块保持语义结构,实验表明其在多个任务上优于现有方法,甚至超越原始未压缩特征。
Details
Motivation: 现有特征压缩方法通常需为不同压缩比训练专用模型,缺乏灵活性和泛化能力,适应新压缩比时需重新训练。 Method: 提出ARFC框架,包含自回归的任意比率压缩器(ARC),通过控制生成token数量调节压缩比;引入混合解(MoS)模块提升压缩质量与鲁棒性,以及实体关系图约束(ERGC)模块在训练中保持语义与结构关系。 Result: 在跨模态检索、图像分类与图像检索等多个任务和数据集上,ARFC在各种压缩比下均显著优于现有方法,部分场景下性能甚至超过原始未压缩特征。 Conclusion: ARFC是一种高效、灵活且通用的特征压缩方法,适用于资源受限的实际应用场景,解决了传统方法压缩比不灵活、需重复训练的问题。 Abstract: Feature compression is increasingly important for improving the efficiency of downstream tasks, especially in applications involving large-scale or multi-modal data. While existing methods typically rely on dedicated models for achieving specific compression ratios, they are often limited in flexibility and generalization. In particular, retraining is necessary when adapting to a new compression ratio. To address this limitation, we propose a novel and flexible Arbitrary Ratio Feature Compression (ARFC) framework, which supports any compression ratio with a single model, eliminating the need for multiple specialized models. At its core, the Arbitrary Ratio Compressor (ARC) is an auto-regressive model that performs compression via next-token prediction. This allows the compression ratio to be controlled at inference simply by adjusting the number of generated tokens. To enhance the quality of the compressed features, two key modules are introduced. The Mixture of Solutions (MoS) module refines the compressed tokens by utilizing multiple compression results (solutions), reducing uncertainty and improving robustness. The Entity Relation Graph Constraint (ERGC) is integrated into the training process to preserve semantic and structural relationships during compression. Extensive experiments on cross-modal retrieval, image classification, and image retrieval tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches at various compression ratios. Notably, in some cases, it even surpasses the performance of the original, uncompressed features. These results validate the effectiveness and versatility of ARFC for practical, resource-constrained scenarios.[96] What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation
Zhenlong Yuan,Xiangyan Qu,Jing Tang,Rui Chen,Lei Sun,Ruidong Chen,Hongwei Yu,Chengxuan Qian,Xiangxiang Chu,Shuo Li,Yuyin Zhou
Main category: cs.CV
TL;DR: 本文提出ImagineAgent框架,通过认知推理与生成式想象结合,解决开放词汇人-物交互(OV-HOI)中的跨模态幻觉与遮挡模糊问题,在SWIG-HOI和HICO-DET上达到SOTA,仅需20%训练数据。
Details
Motivation: 现有多模态大语言模型在开放词汇人-物交互(OV-HOI)任务中受限于跨模态幻觉和遮挡导致的语义模糊,难以实现鲁棒视觉理解。 Method: 提出ImagineAgent智能体框架:1)构建显式建模实体与候选动作关系的认知图;2)动态调用检索增强、图像裁剪和扩散模型等工具获取领域知识与视觉证据;3)设计兼顾预测准确率与工具效率的复合奖励函数。 Result: 在SWIG-HOI和HICO-DET数据集上取得SOTA性能,且仅需约20%的训练数据,验证了方法的鲁棒性与高效性。 Conclusion: 认知推理与生成式想象协同的智能体范式,可有效缓解OV-HOI中的跨模态不一致与局部信息缺失问题,为多模态具身推理提供新思路。 Abstract: Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.[97] Vascular anatomy-aware self-supervised pre-training for X-ray angiogram analysis
De-Xing Huang,Chaohui Yu,Xiao-Hu Zhou,Tian-Yu Xiang,Qin-Yi Zhang,Mei-Jiang Gui,Rui-Ze Ma,Chen-Yu Wang,Nu-Fang Xiao,Fan Wang,Zeng-Guang Hou
Main category: cs.CV
TL;DR: 本文提出了一种血管解剖感知的掩码图像建模框架VasoMIM,并构建了目前最大的X射线血管造影预训练数据集XA-170K,显著提升了下游任务性能。
Details
Motivation: X射线血管造影分析面临标注数据稀缺问题,而现有自监督学习方法在该领域缺乏有效框架和大规模数据集。 Method: 提出VasoMIM框架,包含解剖引导的掩码策略(重点掩蔽含血管区域)和解剖一致性损失(保证重建图像中血管结构一致性);同时构建大规模数据集XA-170K。 Result: 在四个下游任务、六个数据集上验证,VasoMIM展现出优异的迁移能力和当前最优性能。 Conclusion: VasoMIM有望成为X射线血管造影分析任务的基础模型,推动该领域发展。 Abstract: X-ray angiography is the gold standard imaging modality for cardiovascular diseases. However, current deep learning approaches for X-ray angiogram analysis are severely constrained by the scarcity of annotated data. While large-scale self-supervised learning (SSL) has emerged as a promising solution, its potential in this domain remains largely unexplored, primarily due to the lack of effective SSL frameworks and large-scale datasets. To bridge this gap, we introduce a vascular anatomy-aware masked image modeling (VasoMIM) framework that explicitly integrates domain-specific anatomical knowledge. Specifically, VasoMIM comprises two key designs: an anatomy-guided masking strategy and an anatomical consistency loss. The former strategically masks vessel-containing patches to compel the model to learn robust vascular semantics, while the latter preserves structural consistency of vessels between original and reconstructed images, enhancing the discriminability of the learned representations. In conjunction with VasoMIM, we curate XA-170K, the largest X-ray angiogram pre-training dataset to date. We validate VasoMIM on four downstream tasks across six datasets, where it demonstrates superior transferability and achieves state-of-the-art performance compared to existing methods. These findings highlight the significant potential of VasoMIM as a foundation model for advancing a wide range of X-ray angiogram analysis tasks. VasoMIM and XA-170K will be available at https://github.com/Dxhuang-CASIA/XA-SSL.[98] Supervise-assisted Multi-modality Fusion Diffusion Model for PET Restoration
Yingkai Zhang,Shuang Chen,Ye Tian,Yunyi Gao,Jianyong Jiang,Ying Fu
Main category: cs.CV
TL;DR: 本文提出了一种监督辅助的多模态融合扩散模型(MFdiff),利用MR图像辅助恢复低剂量PET图像,通过多模态特征融合模块和两阶段监督学习策略,有效缓解了跨模态结构/纹理不一致及分布外(OOD)数据失配问题,显著提升了PET图像重建质量。
Details
Motivation: 降低PET扫描辐射剂量会导致图像质量下降;利用MR图像辅助恢复标准剂量PET图像面临多模态结构/纹理不一致及分布外(OOD)数据失配的挑战。 Method: 提出监督辅助的多模态融合扩散模型(MFdiff):1)设计多模态特征融合模块以优化MR与PET特征融合;2)将融合特征作为扩散模型的条件进行迭代生成;3)采用两阶段监督学习策略,结合仿真数据的通用先验与真实OOD数据的特异性先验。 Result: MFdiff在定性与定量指标上均优于现有最先进方法,能有效从多模态输入中恢复高质量标准剂量PET图像。 Conclusion: MFdiff通过创新的多模态融合机制与分阶段监督学习策略,成功解决了低剂量PET图像重建中跨模态不一致与OOD泛化难题,为临床低剂量PET成像提供了可靠新方案。 Abstract: Positron emission tomography (PET) offers powerful functional imaging but involves radiation exposure. Efforts to reduce this exposure by lowering the radiotracer dose or scan time can degrade image quality. While using magnetic resonance (MR) images with clearer anatomical information to restore standard-dose PET (SPET) from low-dose PET (LPET) is a promising approach, it faces challenges with the inconsistencies in the structure and texture of multi-modality fusion, as well as the mismatch in out-of-distribution (OOD) data. In this paper, we propose a supervise-assisted multi-modality fusion diffusion model (MFdiff) for addressing these challenges for high-quality PET restoration. Firstly, to fully utilize auxiliary MR images without introducing extraneous details in the restored image, a multi-modality feature fusion module is designed to learn an optimized fusion feature. Secondly, using the fusion feature as an additional condition, high-quality SPET images are iteratively generated based on the diffusion model. Furthermore, we introduce a two-stage supervise-assisted learning strategy that harnesses both generalized priors from simulated in-distribution datasets and specific priors tailored to in-vivo OOD data. Experiments demonstrate that the proposed MFdiff effectively restores high-quality SPET images from multi-modality inputs and outperforms state-of-the-art methods both qualitatively and quantitatively.[99] Perception-based Image Denoising via Generative Compression
Nam Nguyen,Thinh Nguyen,Bella Bose
Main category: cs.CV
TL;DR: 本文提出了一种基于生成式压缩的感知图像去噪框架,通过熵编码潜在表示和生成解码器(如WGAN或扩散模型)联合优化感知质量与失真性能,并提供了理论保证。
Details
Motivation: 传统失真驱动的去噪方法在强噪声和分布偏移下易产生过度平滑结果,难以兼顾结构保真与感知真实感。 Method: 构建生成式压缩框架:利用熵编码的低复杂度潜在表示进行重建,并采用基于LPIPS损失和Wasserstein距离的生成解码器恢复纹理;具体实现包括条件WGAN压缩去噪器和条件扩散重建策略;并为加性高斯噪声下的压缩最大似然去噪器提供非渐近理论保证。 Result: 在合成与真实噪声数据集上实验表明,该方法在感知质量上持续提升,同时保持有竞争力的失真性能(如PSNR、SSIM)。 Conclusion: 生成式压缩范式能有效平衡率-失真-感知三重目标,为感知驱动的图像恢复提供了新思路与理论支撑。 Abstract: Image denoising aims to remove noise while preserving structural details and perceptual realism, yet distortion-driven methods often produce over-smoothed reconstructions, especially under strong noise and distribution shift. This paper proposes a generative compression framework for perception-based denoising, where restoration is achieved by reconstructing from entropy-coded latent representations that enforce low-complexity structure, while generative decoders recover realistic textures via perceptual measures such as learned perceptual image patch similarity (LPIPS) loss and Wasserstein distance. Two complementary instantiations are introduced: (i) a conditional Wasserstein GAN (WGAN)-based compression denoiser that explicitly controls the rate-distortion-perception (RDP) trade-off, and (ii) a conditional diffusion-based reconstruction strategy that performs iterative denoising guided by compressed latents. We further establish non-asymptotic guarantees for the compression-based maximum-likelihood denoiser under additive Gaussian noise, including bounds on reconstruction error and decoding error probability. Experiments on synthetic and real-noise benchmarks demonstrate consistent perceptual improvements while maintaining competitive distortion performance.[100] LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts
Chen Zhao,Jiawei Chen,Hongyu Li,Zhuoliang Kang,Shilin Lu,Xiaoming Wei,Kai Zhang,Jian Yang,Ying Tai
Main category: cs.CV
TL;DR: LUVE是一种基于双频专家的潜在级联超高清视频生成框架,通过三阶段架构解决运动建模、语义规划与细节合成难题。
Details
Motivation: 超高清(UHR)视频生成面临运动建模、语义规划和细节合成等多重挑战,现有视频扩散模型难以兼顾质量与效率。 Method: 提出LUVE框架:第一阶段生成低分辨率运动一致的潜在表示;第二阶段在潜在空间中进行视频潜变量上采样以降低计算与内存开销;第三阶段融合低频(语义)与高频(细节)专家协同优化内容保真度与细节真实性。 Result: LUVE在UHR视频生成中实现了更优的逼真度与内容保真度,消融实验验证了各模块有效性。 Conclusion: LUVE通过分阶段、频率解耦的潜在空间设计,有效平衡了UHR视频生成中的运动一致性、语义连贯性与细节丰富性,为高质量长时序视频生成提供了新范式。 Abstract: Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.[101] Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception
Zesheng Jia,Jin Wang,Siao Liu,Lingzhi Li,Ziyao Huang,Yunjiang Xu,Jianping Wang
Main category: cs.CV
TL;DR: 本文提出FlowAdapt,一种基于最优传输理论的参数高效多智能体协同感知域自适应框架,通过Wasserstein贪心采样和渐进知识迁移模块解决感官流冗余与语义退化问题,在仅训练1%参数下实现SOTA性能。
Details
Motivation: 快速域自适应是V2X协同感知中部署多智能体系统的核心挑战;现有PEFT方法在多智能体场景下存在性能下降与训练不稳定问题。 Method: 提出基于最优传输理论的FlowAdapt框架:1)Wasserstein贪心采样策略,以有界覆盖半径筛选冗余样本;2)渐进知识迁移模块,通过可学习路径将压缩的早期表征逐步注入后期层,缓解深层语义退化。 Result: 在三个基准上验证,FlowAdapt仅需1%可训练参数即达SOTA性能,显著提升样本效率与跨域泛化能力。 Conclusion: FlowAdapt有效解决了多智能体PEFT中的冗余与语义退化问题,为V2X协同感知提供了高效、稳定、轻量的域自适应新范式。 Abstract: Fast domain adaptation remains a fundamental challenge for deploying multi-agent systems across diverse environments in Vehicle-to-Everything (V2X) collaborative perception. Despite the success of Parameter-Efficient Fine-Tuning (PEFT) in natural language processing and conventional vision tasks, directly applying PEFT to multi-agent settings leads to significant performance degradation and training instability. In this work, we conduct a detailed analysis and identify two key factors: (i) inter-frame redundancy in heterogeneous sensory streams, and (ii) erosion of fine-grained semantics in deep-layer representations under PEFT adaptation. To address these issues, we propose FlowAdapt, a parameter-efficient framework grounded in optimal transport theory, which minimizes information transport costs across both data distributions and network hierarchies. Specifically, we introduce a Wasserstein Greedy Sampling strategy to selectively filter redundant samples via a bounded covering radius. Furthermore, Progressive Knowledge Transfer module is designed to progressively inject compressed early-stage representations into later stages through learnable pathways, alleviating semantic degradation in late-stage adaptation. Extensive experiments on three benchmarks demonstrate that FlowAdapt achieves state-of-the-art performance with only 1% of trainable parameters, effectively bridging domain gaps with superior sample efficiency and generalization.[102] A Large Language Model for Disaster Structural Reconnaissance Summarization
Yuqing Gao,Guanren Zhou,Khalid M. Mosalam
Main category: cs.CV
TL;DR: 本文提出了一种基于大语言模型(LLM)的灾害侦察摘要框架(LLM-DRS),将视觉数据与文本元数据融合,利用深度卷积神经网络提取结构损伤属性,并通过LLM生成结构化灾后评估报告,提升震后快速侦察与韧性建设能力。
Details
Motivation: 现有视觉驱动的结构健康监测方法仅输出离散结果(如损伤类别、坐标),需人工二次分析;而大语言模型的兴起为自动生成可读、可决策的灾后评估报告提供了新路径。 Method: 构建标准化侦察流程,采集图像与文本元数据;用深度卷积神经网络提取损伤状态、材料类型、损伤等级等关键属性;将结构化属性与元数据输入经提示工程优化的LLM,生成面向结构或区域的摘要报告。 Result: LLM-DRS能自动生成高质量灾后侦察摘要报告,实证表明其在快速灾后评估中具备可行性与有效性,显著提升建筑环境韧性。 Conclusion: 将LLM融入视觉驱动的SHM系统,特别是用于灾后快速侦察,是提升基础设施智能化评估与决策支持能力的重要方向。 Abstract: Artificial Intelligence (AI)-aided vision-based Structural Health Monitoring (SHM) has emerged as an effective approach for monitoring and assessing structural condition by analyzing image and video data. By integrating Computer Vision (CV) and Deep Learning (DL), vision-based SHM can automatically identify and localize visual patterns associated with structural damage. However, previous works typically generate only discrete outputs, such as damage class labels and damage region coordinates, requiring engineers to further reorganize and analyze these results for evaluation and decision-making. In late 2022, Large Language Models (LLMs) became popular across multiple fields, providing new insights into AI-aided vision-based SHM. In this study, a novel LLM-based Disaster Reconnaissance Summarization (LLM-DRS) framework is proposed. It introduces a standard reconnaissance plan in which the collection of vision data and corresponding metadata follows a well-designed on-site investigation process. Text-based metadata and image-based vision data are then processed and integrated into a unified format, where well-trained Deep Convolutional Neural Networks extract key attributes, including damage state, material type, and damage level. Finally, all data are fed into an LLM with carefully designed prompts, enabling the LLM-DRS to generate summary reports for individual structures or affected regions based on aggregated attributes and metadata. Results show that integrating LLMs into vision-based SHM, particularly for rapid post-disaster reconnaissance, demonstrates promising potential for improving resilience of the built environment through effective reconnaissance.[103] PLOT-CT: Pre-log Voronoi Decomposition Assisted Generation for Low-dose CT Reconstruction
Bin Huang,Xun Yu,Yikun Zhang,Yi Zhang,Yang Chen,Qiegen Liu
Main category: cs.CV
TL;DR: 本文提出PLOT-CT框架,通过在对数变换前对sinogram进行Voronoi分解,显式分离预对数域数据成分,从而提升低剂量CT重建精度。
Details
Motivation: 现有LDCT重建方法多在图像域或对数后投影域操作,无法充分利用预对数测量中的结构信息,且对数变换会显著放大噪声,导致重建精度受限。 Method: 提出PLOT-CT框架:在预对数sinogram上应用Voronoi分解,将数据解耦为多个潜在子空间中的独立成分,以增强特征判别力并抑制噪声。 Result: 在1e4入射光子水平下,PLOT-CT在预对数域相比传统方法PSNR提升2.36dB,达到SOTA性能。 Conclusion: 预对数域的显式结构分解可有效缓解噪声放大问题,提升低剂量CT重建质量,验证了预对数建模的潜力。 Abstract: Low-dose computed tomography (LDCT) reconstruction is fundamentally challenged by severe noise and compromised data fidelity under reduced radiation exposure. Most existing methods operate either in the image or post-log projection domain, which fails to fully exploit the rich structural information in pre-log measurements while being highly susceptible to noise. The requisite logarithmic transformation critically amplifies noise within these data, imposing exceptional demands on reconstruction precision. To overcome these challenges, we propose PLOT-CT, a novel framework for Pre-Log vOronoi decomposiTion-assisted CT generation. Our method begins by applying Voronoi decomposition to pre-log sinograms, disentangling the data into distinct underlying components, which are embedded in separate latent spaces. This explicit decomposition significantly enhances the model's capacity to learn discriminative features, directly improving reconstruction accuracy by mitigating noise and preserving information inherent in the pre-log domain. Extensive experiments demonstrate that PLOT-CT achieves state-of-the-art performance, attaining a 2.36dB PSNR improvement over traditional methods at the 1e4 incident photon level in the pre-log domain.[104] PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation
Yeva Gabrielyan,Varduhi Yeghiazaryan,Irina Voiculescu
Main category: cs.CV
TL;DR: 本文提出PLESS,一种通用的伪标签增强策略,通过分层空间区域划分和语义一致区域内传播涂鸦信息来提升伪标签的可靠性与空间一致性,从而改善弱监督医学图像分割性能。
Details
Motivation: 涂鸦标注虽降低标注成本,但其固有的噪声和不完整性导致伪标签质量受限,进而影响分割性能。 Method: PLESS基于图像的分层空间一致性区域划分,将涂鸦信息在语义连贯区域内传播以优化伪标签;该方法模型无关,可无缝集成至现有伪标签框架中。 Result: 在ACDC和MSCMRseg两个心脏MRI数据集上,PLESS在四种涂鸦监督算法中均一致提升了分割精度。 Conclusion: PLESS是一种有效、通用且易集成的伪标签增强策略,显著缓解了涂鸦监督下伪标签质量低的问题,提升了弱监督医学图像分割的鲁棒性与准确性。 Abstract: Weakly supervised learning with scribble annotations uses sparse user-drawn strokes to indicate segmentation labels on a small subset of pixels. This annotation reduces the cost of dense pixel-wise labeling, but suffers inherently from noisy and incomplete supervision. Recent scribble-based approaches in medical image segmentation address this limitation using pseudo-label-based training; however, the quality of the pseudo-labels remains a key performance limit. We propose PLESS, a generic pseudo-label enhancement strategy which improves reliability and spatial consistency. It builds on a hierarchical partitioning of the image into a hierarchy of spatially coherent regions. PLESS propagates scribble information to refine pseudo-labels within semantically coherent regions. The framework is model-agnostic and easily integrates into existing pseudo-label methods. Experiments on two public cardiac MRI datasets (ACDC and MSCMRseg) across four scribble-supervised algorithms show consistent improvements in segmentation accuracy. Code will be made available on GitHub upon acceptance.[105] ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning
Changti Wu,Jiahuai Mao,Yuzhuo Miao,Shijie Lian,Bin Yu,Xiaopeng Lin,Cong Huang,Lei Zhang,Kai Chen
Main category: cs.CV
TL;DR: 本文提出ScalSelect,一种无需训练、线性时间复杂度的多模态数据选择方法,用于大规模视觉指令调优(VIT),显著提升训练效率且不依赖外部模型或数据集。
Details
Motivation: 大规模视觉指令调优(VIT)因数据冗余导致计算昂贵低效,亟需高效、可扩展、无需训练的多模态数据选择方法。 Method: ScalSelect首先提取目标VLM中指令token最关注的视觉特征构建样本表征,再通过主子空间近似实现线性时间重要性评分,无需成对相似度计算、外部模型或辅助数据集。 Result: 在多个VLM、数据集和预算下实验表明,仅用16%数据即可达到全量训练97.5%以上的性能,部分设置下甚至超越全量训练。 Conclusion: ScalSelect是一种高效、可扩展、训练-free的多模态数据选择方法,为VIT提供了实用且高性能的数据精简方案。 Abstract: Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.[106] Electrostatics-Inspired Surface Reconstruction (EISR): Recovering 3D Shapes as a Superposition of Poisson's PDE Solutions
Diego Patiño,Knut Peterson,Kostas Daniilidis,David K. Han
Main category: cs.CV
TL;DR: 本文提出了一种基于泊松方程(而非传统Eikonal方程)的新隐式形状表示方法,利用格林函数和线性叠加原理构建SDF近似,显著提升了高频几何细节的重建质量,尤其在少量先验下仍表现优异。
Details
Motivation: 传统基于Eikonal方程的SDF学习方法在重建高频率几何细节时存在局限;本文旨在探索更具物理意义且数学性质更优(如线性、可解析求解)的代理PDE,以提升重建精度与泛化能力。 Method: 将表面重建建模为泊松方程的求解问题;借助静电势等物理类比建立直观理解;采用格林函数获得闭式参数化解;利用泊松方程的线性性质,将目标隐式场表示为多个基解的叠加。 Result: 在保持少量形状先验的前提下,显著提升了对高频率几何细节(如尖锐边缘、精细纹理)的SDF逼近精度,重建质量优于基于Eikonal方程的主流方法。 Conclusion: 泊松方程作为Eikonal方程的有效代理PDE,结合格林函数与线性叠加策略,为隐式形状重建提供了更鲁棒、高效且物理可解释的新范式。 Abstract: Implicit shape representation, such as SDFs, is a popular approach to recover the surface of a 3D shape as the level sets of a scalar field. Several methods approximate SDFs using machine learning strategies that exploit the knowledge that SDFs are solutions of the Eikonal partial differential equation (PDEs). In this work, we present a novel approach to surface reconstruction by encoding it as a solution to a proxy PDE, namely Poisson's equation. Then, we explore the connection between Poisson's equation and physics, e.g., the electrostatic potential due to a positive charge density. We employ Green's functions to obtain a closed-form parametric expression for the PDE's solution, and leverage the linearity of our proxy PDE to find the target shape's implicit field as a superposition of solutions. Our method shows improved results in approximating high-frequency details, even with a small number of shape priors.[107] Brain Tumor Classifiers Under Attack: Robustness of ResNet Variants Against Transferable FGSM and PGD Attacks
Ryan Deem,Garrett Goodman,Waqas Majeed,Md Abdullah Al Hafiz Khan,Michail S. Alexiou
Main category: cs.CV
TL;DR: 本文研究了基于ResNet的脑肿瘤分类模型(BrainNet、BrainNeXt、DilationNet)在MRI数据上的对抗鲁棒性,发现BrainNeXt对黑盒攻击最鲁棒但迁移性差,而输入分辨率降低和取消数据增强会显著削弱鲁棒性,即使准确率未明显下降。
Details
Motivation: 深度学习模型在脑肿瘤分类中的对抗鲁棒性尚不充分,尤其在临床部署中涉及MRI数据时至关重要。 Method: 评估三种ResNet变体(BrainNet、BrainNeXt、DilationNet)在FGSM和PGD等梯度类对抗攻击下的鲁棒性,对比三种MRI预处理配置(全尺寸增强、缩小增强、缩小非增强)。 Result: BrainNeXt在黑盒攻击下鲁棒性最强但生成的对抗样本迁移性弱;BrainNet和DilationNet相互更易受攻击,尤其在高步数/步长PGD下;缩小且非增强数据显著降低鲁棒性,即使测试准确率仍高。 Conclusion: 脑MRI分析的实际部署需同步评估分类性能与对抗鲁棒性,输入分辨率与数据增强策略对鲁棒性影响显著。 Abstract: Adversarial robustness in deep learning models for brain tumor classification remains an underexplored yet critical challenge, particularly for clinical deployment scenarios involving MRI data. In this work, we investigate the susceptibility and resilience of several ResNet-based architectures, referred to as BrainNet, BrainNeXt and DilationNet, against gradient-based adversarial attacks, namely FGSM and PGD. These models, based on ResNet, ResNeXt, and dilated ResNet variants respectively, are evaluated across three preprocessing configurations (i) full-sized augmented, (ii) shrunk augmented and (iii) shrunk non-augmented MRI datasets. Our experiments reveal that BrainNeXt models exhibit the highest robustness to black-box attacks, likely due to their increased cardinality, though they produce weaker transferable adversarial samples. In contrast, BrainNet and Dilation models are more vulnerable to attacks from each other, especially under PGD with higher iteration steps and $α$ values. Notably, shrunk and non-augmented data significantly reduce model resilience, even when the untampered test accuracy remains high, highlighting a key trade-off between input resolution and adversarial vulnerability. These results underscore the importance of jointly evaluating classification performance and adversarial robustness for reliable real-world deployment in brain MRI analysis.[108] GR-Diffusion: 3D Gaussian Representation Meets Diffusion in Whole-Body PET Reconstruction
Mengxiao Geng,Zijie Chen,Ran Hong,Bingxuan Li,Qiegen Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为GR-Diffusion的新框架,将三维离散高斯表示(GR)的几何先验与扩散模型的生成能力相结合,用于低剂量全身PET图像重建,显著提升了图像质量与细节保留能力。
Details
Motivation: PET重建面临噪声放大、结构模糊和细节丢失等挑战,传统方法受限于低通滤波特性,难以兼顾全局一致性和局部精度。 Method: 提出GR-Diffusion框架:利用GR从投影数据生成物理合理、结构明确的参考图像;设计基于该参考图像的层次化引导机制(细粒度差异精修 + 粗粒度多尺度差异校正),在扩散过程中融合几何先验并恢复亚体素信息。 Result: 在UDPET和临床数据集上,GR-Diffusion在不同剂量水平下均优于现有最先进方法,显著提升3D全身PET图像质量及生理细节保真度。 Conclusion: GR-Diffusion成功将几何建模与生成式建模协同,为低剂量PET重建提供了新范式,兼具物理可解释性与强重建性能。 Abstract: Positron emission tomography (PET) reconstruction is a critical challenge in molecular imaging, often hampered by noise amplification, structural blurring, and detail loss due to sparse sampling and the ill-posed nature of inverse problems. The three-dimensional discrete Gaussian representation (GR), which efficiently encodes 3D scenes using parameterized discrete Gaussian distributions, has shown promise in computer vision. In this work, we pro-pose a novel GR-Diffusion framework that synergistically integrates the geometric priors of GR with the generative power of diffusion models for 3D low-dose whole-body PET reconstruction. GR-Diffusion employs GR to generate a reference 3D PET image from projection data, establishing a physically grounded and structurally explicit benchmark that overcomes the low-pass limitations of conventional point-based or voxel-based methods. This reference image serves as a dual guide during the diffusion process, ensuring both global consistency and local accuracy. Specifically, we employ a hierarchical guidance mechanism based on the GR reference. Fine-grained guidance leverages differences to refine local details, while coarse-grained guidance uses multi-scale difference maps to correct deviations. This strategy allows the diffusion model to sequentially integrate the strong geometric prior from GR and recover sub-voxel information. Experimental results on the UDPET and Clinical datasets with varying dose levels show that GR-Diffusion outperforms state-of-the-art methods in enhancing 3D whole-body PET image quality and preserving physiological details.[109] SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving
Seo Hyun Kim,Jin Bok Park,Do Yeon Koo,Ho Gun Park,Il Yong Chun
Main category: cs.CV
TL;DR: 本文提出了一种名为SToRM的监督式视觉令牌缩减框架,用于多模态大语言模型驱动的端到端自动驾驶系统,在大幅降低计算开销(最高达30倍)的同时,保持与全令牌输入相当的性能。
Details
Motivation: 端到端自动驾驶系统需兼顾安全性与实时性;引入多模态大语言模型(MLLM)可支持自然语言人车交互,但其高视觉令牌消耗与车载算力受限之间存在矛盾;现有令牌压缩方法常以性能下降为代价。 Method: 提出监督式令牌缩减框架SToRM,包含三部分:1)基于滑动窗口的轻量级令牌重要性预测器;2)通过全令牌LLM前向传播生成伪标签的监督训练机制;3)锚点-上下文合并模块,将冗余上下文令牌融合至关键锚点以保留信息。 Result: 在LangAuto基准上,SToRM在相同缩减令牌预算下优于当前最优方法,性能媲美全令牌输入,计算成本最高降低30倍。 Conclusion: SToRM首次实现了对多模态大语言模型的高效、有监督的视觉令牌缩减,为资源受限场景下的安全、实时端到端自动驾驶提供了可行方案。 Abstract: In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.[110] EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation
Bingyuan Wang,Xingbei Chen,Zongyang Qiu,Linping Yuan,Zeyu Wang
Main category: cs.CV
TL;DR: 本文提出EmoSpace框架,通过视觉-语言对齐学习动态、可解释的情绪原型,实现无需显式情绪标签的细粒度情绪控制生成,支持VR环境中的情感图像外绘、风格化生成和全景生成等应用。
Details
Motivation: 现有生成方法难以捕捉细腻的情绪语义和沉浸体验所需的精细情绪控制。 Method: 提出EmoSpace框架,采用分层情绪表示与可学习动态原型,结合多原型引导、时间融合与注意力重加权的可控生成流程。 Result: 在定性与定量评估中均优于现有方法,并通过用户研究验证VR环境相比桌面环境对情绪感知的影响。 Conclusion: EmoSpace实现了细粒度情绪控制的沉浸式视觉内容生成,支持治疗、教育、叙事、艺术创作和文化保护等多领域应用。 Abstract: Emotion is important for creating compelling virtual reality (VR) content. Although some generative methods have been applied to lower the barrier to creating emotionally rich content, they fail to capture the nuanced emotional semantics and the fine-grained control essential for immersive experiences. To address these limitations, we introduce EmoSpace, a novel framework for emotion-aware content generation that learns dynamic, interpretable emotion prototypes through vision-language alignment. We employ a hierarchical emotion representation with rich learnable prototypes that evolve during training, enabling fine-grained emotional control without requiring explicit emotion labels. We develop a controllable generation pipeline featuring multi-prototype guidance, temporal blending, and attention reweighting that supports diverse applications, including emotional image outpainting, stylized generation, and emotional panorama generation for VR environments. Our experiments demonstrate the superior performance of EmoSpace over existing methods in both qualitative and quantitative evaluations. Additionally, we present a comprehensive user study investigating how VR environments affect emotional perception compared to desktop settings. Our work facilitates immersive visual content generation with fine-grained emotion control and supports applications like therapy, education, storytelling, artistic creation, and cultural preservation. Code and models will be made publicly available.[111] Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes
Jeongho Noh,Tai Hyoung Rhee,Eunho Lee,Jeongyun Kim,Sunwoo Lee,Ayoung Kim
Main category: cs.CV
TL;DR: Clutt3R-Seg是一种面向语言引导机器人抓取的零样本3D实例分割方法,通过构建语义线索的层次化实例树,利用跨视角分组与条件替换抑制过/欠分割,结合开放词汇语义嵌入和一致性感知更新机制,在杂乱、稀疏视角场景下显著提升鲁棒性与准确性。
Details
Motivation: 现有3D实例分割方法在杂乱环境中受限于遮挡、有限视角和噪声掩码,难以支撑语言引导的机器人操作需求。 Method: 提出Clutt3R-Seg零样本流水线:构建层次化实例树,将噪声掩码作为有用线索;采用跨视角分组与条件替换实现视图一致分割;引入开放词汇语义嵌入支持自然语言目标选择;设计一致性感知更新机制,仅凭单张交互后图像维持实例对应关系。 Result: 在合成与真实数据集及真实机器人上验证,Clutt3R-Seg在杂乱和稀疏视角场景下持续超越SOTA;在重度杂乱序列中AP@25达61.66,是基线的2.2倍;仅用4个视角即超MaskClustering(8个视角)2倍以上。 Conclusion: Clutt3R-Seg有效提升了杂乱环境下语言引导抓取所需的3D实例分割鲁棒性与泛化能力,具备实际部署潜力。 Abstract: Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.[112] Egocentric Gaze Estimation via Neck-Mounted Camera
Haoyu Huang,Yoichi Sato
Main category: cs.CV
TL;DR: 本文提出了颈戴式视角凝视估计这一新任务,并构建了首个相关数据集,通过改进的Transformer模型GLC及两种扩展方法(凝视越界分类和多视角协同学习)进行实验验证。
Details
Motivation: 现有以自我为中心的凝视估计研究主要集中在头戴式相机视角,而颈戴式等替代视角尚未被充分探索,本文旨在填补这一空白。 Method: 构建首个颈戴式凝视估计数据集(约4小时、8名参与者);提出基于Transformer的GLC模型;引入辅助的凝视越界分类任务和几何感知的多视角协同学习方法。 Result: 凝视越界分类任务提升了性能,但多视角协同学习未带来增益;实验结果揭示了颈戴式凝视估计的独特挑战与潜力。 Conclusion: 颈戴式凝视估计是一个有前景的新方向,需针对其视角特性设计专用建模策略,而非简单迁移头戴式方法。 Abstract: This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective. Prior work on egocentric gaze estimation, which predicts device wearer's gaze location within the camera's field of view, mainly focuses on head-mounted cameras while alternative viewpoints remain underexplored. To bridge this gap, we collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities. We evaluate a transformer-based gaze estimation model, GLC, on the new dataset and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models using a geometry-aware auxiliary loss. Experimental results show that incorporating gaze out-of-bound classification improves performance over standard fine-tuning, while the co-learning approach does not yield gains. We further analyze these results and discuss implications for neck-mounted gaze estimation.[113] U-Net with Hadamard Transform and DCT Latent Spaces for Next-day Wildfire Spread Prediction
Yingyi Luo,Shuaiang Rong,Adam Watts,Ahmet Enis Cetin
Main category: cs.CV
TL;DR: 本文提出了一种轻量级深度学习模型TD-FusionUNet,结合可训练的哈达玛与余弦变换层及定制预处理技术,利用多模态卫星数据实现次日野火蔓延预测,在保持高效率的同时达到优于基线模型的F1分数。
Details
Motivation: 为应对资源受限环境下实时野火预测的需求,需开发计算高效且精度可靠的轻量级预测工具。 Method: 提出TD-FusionUNet模型,引入可训练的二维Hadamard与DCT变换层以捕获正交隐空间中的频率成分,并采用随机边缘裁剪和高斯混合模型等定制预处理方法增强稀疏火前掩膜表征与泛化能力。 Result: 在Next-Day Wildfire Spread和WildfireSpreadTS两个数据集上验证,TD-FusionUNet以37万参数取得0.591的F1分数,优于WildfireSpreadTS中基于ResNet18编码器的UNet基线模型。 Conclusion: TD-FusionUNet在精度与效率间取得良好平衡,适用于资源受限环境下的实时野火蔓延预测。 Abstract: We developed a lightweight and computationally efficient tool for next-day wildfire spread prediction using multimodal satellite data as input. The deep learning model, which we call Transform Domain Fusion UNet (TD-FusionUNet), incorporates trainable Hadamard Transform and Discrete Cosine Transform layers that apply two-dimensional transforms, enabling the network to capture essential "frequency" components in orthogonalized latent spaces. Additionally, we introduce custom preprocessing techniques, including random margin cropping and a Gaussian mixture model, to enrich the representation of the sparse pre-fire masks and enhance the model's generalization capability. The TD-FusionUNet is evaluated on two datasets which are the Next-Day Wildfire Spread dataset released by Google Research in 2023, and WildfireSpreadTS dataset. Our proposed TD-FusionUNet achieves an F1 score of 0.591 with 370k parameters, outperforming the UNet baseline using ResNet18 as the encoder reported in the WildfireSpreadTS dataset while using substantially fewer parameters. These results show that the proposed latent space fusion model balances accuracy and efficiency under a lightweight setting, making it suitable for real time wildfire prediction applications in resource limited environments.[114] RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval
Khanh Nguyen,Dasith de Silva Edirimuni,Ghulam Mubashar Hassan,Ajmal Mian
Main category: cs.CV
TL;DR: 本文提出RI-Mamba,首个面向点云的旋转不变状态空间模型,通过参考系解耦姿态与几何、Hilbert排序建序、定向嵌入调制等技术,实现任意朝向下的跨类别3D文本-形状检索,在OmniObject3D上达到SOTA。
Details
Motivation: 现有文本到形状检索方法依赖规范位姿且类别覆盖少,难以应对真实场景中物体类别多样、朝向任意的挑战。 Method: 提出RI-Mamba:构建全局/局部参考系解耦位姿与几何;采用Hilbert排序生成旋转不变且具几何意义的token序列;设计定向嵌入并通过特征线性调制(FiLM)恢复空间上下文;结合自动三元组生成的跨模态对比学习进行大规模训练。 Result: 在OmniObject3D基准上,对200多个类别、任意朝向的3D物体实现SOTA文本-形状检索性能;模型具备线性时间复杂度和强泛化鲁棒性。 Conclusion: RI-Mamba有效解决了3D检索中姿态敏感与类别受限两大瓶颈,为开放世界、多朝向、多类别的3D资产检索提供了可扩展、高表达力的新范式。 Abstract: 3D assets have rapidly expanded in quantity and diversity due to the growing popularity of virtual reality and gaming. As a result, text-to-shape retrieval has become essential in facilitating intuitive search within large repositories. However, existing methods require canonical poses and support few object categories, limiting their real-world applicability where objects can belong to diverse classes and appear in random orientations. To address this challenge, we propose RI-Mamba, the first rotation-invariant state-space model for point clouds. RI-Mamba defines global and local reference frames to disentangle pose from geometry and uses Hilbert sorting to construct token sequences with meaningful geometric structure while maintaining rotation invariance. We further introduce a novel strategy to compute orientational embeddings and reintegrate them via feature-wise linear modulation, effectively recovering spatial context and enhancing model expressiveness. Our strategy is inherently compatible with state-space models and operates in linear time. To scale up retrieval, we adopt cross-modal contrastive learning with automated triplet generation, allowing training on diverse datasets without manual annotation. Extensive experiments demonstrate RI-Mamba's superior representational capacity and robustness, achieving state-of-the-art performance on the OmniObject3D benchmark across more than 200 object categories under arbitrary orientations. Our code will be made available at https://github.com/ndkhanh360/RI-Mamba.git.[115] Semantically Conditioned Diffusion Models for Cerebral DSA Synthesis
Qiwen Xu,David Rügamer,Holger Wenz,Johann Fontana,Nora Meggyeshazi,Andreas Bender,Máté E. Maros
Main category: cs.CV
TL;DR: 本文提出了一种语义条件潜在扩散模型(LDM),用于生成具有解剖循环(前/后循环)和C臂位置控制的动脉期脑部数字减影血管造影(DSA)图像,经专家评估和FID指标验证,生成图像具备临床真实感,可用于算法开发、研究与培训。
Details
Motivation: DSA虽在脑血管病诊疗中至关重要,但其侵入性和高采集成本严重限制了大规模数据收集与共享,亟需高质量合成数据替代方案。 Method: 构建含99,349帧的单中心DSA数据集,训练基于文本嵌入(编码解剖与几何信息)的语义条件潜在扩散模型(LDM),实现对动脉期DSA图像的可控合成。 Result: 四名医学专家对400张合成DSA图像进行5级Likert量表评估,图像级总体评分为3.1–3.3分(ICC=0.80–0.87);Fréchet Inception Distance(FID)中位数为15.27,表明分布相似性高。 Conclusion: 语义可控的潜在扩散模型可生成具备临床真实感的合成DSA图像,适用于下游算法开发、科研及医师培训。 Abstract: Digital subtraction angiography (DSA) plays a central role in the diagnosis and treatment of cerebrovascular disease, yet its invasive nature and high acquisition cost severely limit large-scale data collection and public data sharing. Therefore, we developed a semantically conditioned latent diffusion model (LDM) that synthesizes arterial-phase cerebral DSA frames under explicit control of anatomical circulation (anterior vs.\ posterior) and canonical C-arm positions. We curated a large single-centre DSA dataset of 99,349 frames and trained a conditional LDM using text embeddings that encoded anatomy and acquisition geometry. To assess clinical realism, four medical experts, including two neuroradiologists, one neurosurgeon, and one internal medicine expert, systematically rated 400 synthetic DSA images using a 5-grade Likert scale for evaluating proximal large, medium, and small peripheral vessels. The generated images achieved image-wise overall Likert scores ranging from 3.1 to 3.3, with high inter-rater reliability (ICC(2,k) = 0.80--0.87). Distributional similarity to real DSA frames was supported by a low median Fréchet inception distance (FID) of 15.27. Our results indicate that semantically controlled LDMs can produce realistic synthetic DSAs suitable for downstream algorithm development, research, and training.[116] TG-Field: Geometry-Aware Radiative Gaussian Fields for Tomographic Reconstruction
Yuxiang Zhong,Jun Wei,Chaoqi Chen,Senyou An,Hui Huang
Main category: cs.CV
TL;DR: 本文提出TG-Field,一种面向CT重建(含静态与动态)的几何感知高斯形变框架,通过多分辨率哈希编码、时序条件表示、时空注意力机制及运动流网络,显著提升超稀疏视角下的重建质量与动态一致性。
Details
Motivation: 现有3D高斯点绘方法在CT重建中面临超稀疏视角投影和动态运动下的严重伪影问题。 Method: 提出Tomographic Geometry Field(TG-Field),包含:1)多分辨率哈希编码以建模局部空间先验;2)时间条件化表示与时空注意力模块实现动态特征自适应聚合;3)运动流网络建模呼吸运动引起的精细解剖形变。 Result: 在合成与真实CT数据集上,TG-Field在高度稀疏视角条件下持续超越现有方法,达到最先进重建精度。 Conclusion: TG-Field有效缓解了稀疏视角与动态运动带来的几何不确定性与时空模糊性,为医学CT重建提供了高效、鲁棒的新范式。 Abstract: 3D Gaussian Splatting (3DGS) has revolutionized 3D scene representation with superior efficiency and quality. While recent adaptations for computed tomography (CT) show promise, they struggle with severe artifacts under highly sparse-view projections and dynamic motions. To address these challenges, we propose Tomographic Geometry Field (TG-Field), a geometry-aware Gaussian deformation framework tailored for both static and dynamic CT reconstruction. A multi-resolution hash encoder is employed to capture local spatial priors, regularizing primitive parameters under ultra-sparse settings. We further extend the framework to dynamic reconstruction by introducing time-conditioned representations and a spatiotemporal attention block to adaptively aggregate features, thereby resolving spatiotemporal ambiguities and enforcing temporal coherence. In addition, a motion-flow network models fine-grained respiratory motion to track local anatomical deformations. Extensive experiments on synthetic and real-world datasets demonstrate that TG-Field consistently outperforms existing methods, achieving state-of-the-art reconstruction accuracy under highly sparse-view conditions.[117] LLM-Driven 3D Scene Generation of Agricultural Simulation Environments
Arafa Yoncalik,Wouter Jansen,Nico Huebel,Mohammad Hasan Rahmani,Jan Steckel
Main category: cs.CV
TL;DR: 本文提出了一种基于多LLM模块化流水线的农业合成仿真环境生成方法,通过结合领域知识注入、3D资产检索与Unreal引擎代码生成,并融合few-shot prompting、RAG、微调与验证等技术,提升了生成结果的准确性、可扩展性与可控性。
Details
Motivation: 现有基于LLM的3D场景生成方法缺乏农业等特定领域的推理能力、验证机制和模块化设计,导致控制力弱、可扩展性差。 Method: 构建模块化多LLM流水线,集成3D资产检索、农业领域知识注入、Unreal引擎API代码生成;采用few-shot prompting、RAG、微调与多级验证的混合策略。 Result: 系统在结构化提示和语义准确率指标上表现良好;用户研究显示生成环境具有高 realism 和 familiarity;专家对比表明显著节省人工建模时间。 Conclusion: 多LLM模块化架构能有效提升领域专用3D场景生成的可靠性、精度与自动化水平,为农业及其他仿真领域提供可扩展的技术路径。 Abstract: Procedural generation techniques in 3D rendering engines have revolutionized the creation of complex environments, reducing reliance on manual design. Recent approaches using Large Language Models (LLMs) for 3D scene generation show promise but often lack domain-specific reasoning, verification mechanisms, and modular design. These limitations lead to reduced control and poor scalability. This paper investigates the use of LLMs to generate agricultural synthetic simulation environments from natural language prompts, specifically to address the limitations of lacking domain-specific reasoning, verification mechanisms, and modular design. A modular multi-LLM pipeline was developed, integrating 3D asset retrieval, domain knowledge injection, and code generation for the Unreal rendering engine using its API. This results in a 3D environment with realistic planting layouts and environmental context, all based on the input prompt and the domain knowledge. To enhance accuracy and scalability, the system employs a hybrid strategy combining LLM optimization techniques such as few-shot prompting, Retrieval-Augmented Generation (RAG), finetuning, and validation. Unlike monolithic models, the modular architecture enables structured data handling, intermediate verification, and flexible expansion. The system was evaluated using structured prompts and semantic accuracy metrics. A user study assessed realism and familiarity against real-world images, while an expert comparison demonstrated significant time savings over manual scene design. The results confirm the effectiveness of multi-LLM pipelines in automating domain-specific 3D scene generation with improved reliability and precision. Future work will explore expanding the asset hierarchy, incorporating real-time generation, and adapting the pipeline to other simulation domains beyond agriculture.[118] GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry
Jiung Yeon,Seongbo Ha,Hyeonwoo Yu
Main category: cs.CV
TL;DR: 本文提出GSO-SLAM,一种基于高斯场景表示的实时单目稠密SLAM系统,通过EM框架双向耦合视觉里程计(VO)与高斯点绘(GS),联合优化深度估计与场景表示,并设计高斯点初始化方法提升重建精度与效率。
Details
Motivation: 现有SLAM方法在跟踪与建图耦合方式上存在计算开销大或结构冗余的问题,亟需一种高效、高保真且实时的单目稠密SLAM方案。 Method: 提出GSO-SLAM:1)在EM框架下双向耦合VO与GS,联合优化半稠密深度与高斯场景;2)设计高斯点初始化(Gaussian Splat Initialization),利用VO提供的图像信息、关键帧位姿和像素关联生成高质量初始高斯场景。 Result: 实验表明该方法可实现实时运行,并在场景几何/光度保真度及跟踪精度方面达到当前最优水平。 Conclusion: GSO-SLAM通过紧耦合VO与GS并引入数据驱动的初始化策略,在保持实时性的同时显著提升了单目稠密SLAM的重建质量与鲁棒性。 Abstract: We propose GSO-SLAM, a real-time monocular dense SLAM system that leverages Gaussian scene representation. Unlike existing methods that couple tracking and mapping with a unified scene, incurring computational costs, or loosely integrate them with well-structured tracking frameworks, introducing redundancies, our method bidirectionally couples Visual Odometry (VO) and Gaussian Splatting (GS). Specifically, our approach formulates joint optimization within an Expectation-Maximization (EM) framework, enabling the simultaneous refinement of VO-derived semi-dense depth estimates and the GS representation without additional computational overhead. Moreover, we present Gaussian Splat Initialization, which utilizes image information, keyframe poses, and pixel associations from VO to produce close approximations to the final Gaussian scene, thereby eliminating the need for heuristic methods. Through extensive experiments, we validate the effectiveness of our method, showing that it not only operates in real time but also achieves state-of-the-art geometric/photometric fidelity of the reconstructed scene and tracking accuracy.[119] STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
Xiaowen Zhang,Zhi Gao,Licheng Jiao,Lingling Li,Qing Li
Main category: cs.CV
TL;DR: 本文提出了一种新的视觉提示范式和首个面向空间-时间视频定位(STVG)的强化学习框架STVG-R1,通过实例ID编码视觉提示并设计多目标奖励函数,在多个基准上显著提升性能,并展现出优异的零样本迁移能力。
Details
Motivation: 视觉语言模型(VLMs)在空间-时间视频定位(STVG)等稠密预测任务中,因文本描述与视觉坐标错位而易产生幻觉;现有方法依赖跨模态对齐或添加辅助解码器,带来高标注成本与计算开销。 Method: 提出基于唯一且时序一致实例ID的视觉提示范式,将逐帧坐标预测转化为实例级识别问题;设计STVG-R1强化学习框架,采用兼顾时间精度、空间一致性与结构格式的多目标任务驱动奖励进行联合优化。 Result: 在HCSTVG-v2上m_IoU超越Qwen2.5-VL-7B达20.9%,创SOTA;零样本迁移到MeViS数据集,J&F达47.3%,亦为SOTA。 Conclusion: 所提视觉提示范式规避了困难的跨模态坐标对齐,STVG-R1框架实现了高效、可解释且泛化性强的STVG建模,为VLM在稠密视觉理解任务中的应用提供了新范式。 Abstract: In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.[120] Adapting Vision-Language Models for E-commerce Understanding at Scale
Matteo Nulli,Vladimir Orshulevich,Tala Bazazo,Christian Herold,Michael Kozielski,Marcin Mazur,Szymon Tuzel,Cees G. M. Snoek,Seyyed Hadi Hashemi,Omar Javed,Yannick Versley,Shahram Khadivi
Main category: cs.CV
TL;DR: 本文提出了一种针对电商场景定制化适配通用视觉语言模型(VLM)的方法,在保持其广泛多模态能力的同时,显著提升其在属性密集、多图及噪声数据下的产品理解性能,并构建了覆盖深度产品理解、严格指令遵循与动态属性抽取的新型评估套件。
Details
Motivation: 通用视觉语言模型(VLM)虽具备泛化的多模态建模能力,但缺乏针对电商数据特有的属性中心性、多图像输入和高噪声等特性的有效适配策略,难以兼顾电商专用性能与通用多模态能力。 Method: 通过大规模实验研究,探索并实施面向电商数据特点(如属性密集、多图、噪声)的VLM定向适配方法;同时设计了一个涵盖深度产品理解、严格指令遵循和动态属性提取的综合性评估套件。 Result: 所提适配方法在多个电商理解任务上显著优于基线模型,且未损害VLM在通用视觉语言任务上的性能;新评估套件验证了方法的有效性与鲁棒性。 Conclusion: 针对特定领域(如电商)对通用VLM进行目标导向的轻量适配是可行且高效的,可在不牺牲通用能力的前提下大幅提升领域性能;构建领域专属评估体系对推动模型实用化至关重要。 Abstract: E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.[121] Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding
Boqi Chen,Xudong Liu,Jianing Qiu
Main category: cs.CV
TL;DR: 本文提出了一种改进视觉对比解码(VCD)的方法,通过构建对象对齐的辅助视图来减少多模态大语言模型(MLLMs)中的物体幻觉问题。该方法利用自监督ViT中的以对象为中心的注意力机制,移除最显著的视觉证据以生成干扰不支持token的辅助视图,从而增强对比信号。方法具有提示无关性、模型无关性,且计算开销极小,实验证明其在两个主流物体幻觉基准上对两种MLLM均带来一致提升。
Details
Motivation: 解决多模态大语言模型(MLLMs)中普遍存在的物体幻觉问题。 Method: 基于视觉对比解码(VCD),利用自监督Vision Transformer中的对象中心注意力机制,移除图像中最显著的视觉证据,构造对象对齐的辅助视图,以增强对比学习信号。 Result: 在两个主流物体幻觉基准(如POPE、MME等)上,对两种MLLM(如LLaVA、Qwen-VL)均取得一致性能提升,且方法轻量、即插即用。 Conclusion: 构造对象对齐的辅助视图是一种有效、通用且高效的缓解MLLM物体幻觉的策略,显著提升了VCD的鲁棒性与泛化性。 Abstract: We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.[122] Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation
Xiangyu Wu,Dongming Jiang,Feng Yu,Yueying Tian,Jiaqi Tang,Qing-Guo Chen,Yang Yang,Jianfeng Lu
Main category: cs.CV
TL;DR: 本文提出了一种基于Tsallis熵的自适应去偏方法ADTE,用于视觉语言模型(如CLIP)的测试时适应(TTA),有效缓解预训练数据偏差导致的不确定性估计偏差,无需额外超参调优,在多个基准上达到SOTA性能。
Details
Motivation: 主流TTA方法依赖香农熵(SE)评估预测不确定性,但CLIP等模型在高度不平衡的网络数据上预训练,导致SE产生有偏的不确定性估计。 Method: 提出Tsallis熵(TE)作为SE的广义形式,能通过非广延参数q刻画偏态分布;进一步设计自适应去偏Tsallis熵(ADTE),为每个类别动态学习类别特定参数q^l,该参数由持续到来的测试样本估计的标签偏差归一化得到,并结合标签调整策略提升适应效果。 Result: ADTE在ImageNet及其5个变体上超越现有SOTA方法,在10个跨域基准上取得最高平均性能,且不依赖模型架构或文本提示;TE和ADTE均可直接替代SE,无需其他修改。 Conclusion: ADTE提供了一种无需分布特异性调参、可即插即用的去偏不确定性度量方案,显著提升了视觉语言模型在测试时适应中的鲁棒性与泛化能力。 Abstract: Mainstream Test-Time Adaptation (TTA) methods for adapting vision-language models, e.g., CLIP, typically rely on Shannon Entropy (SE) at test time to measure prediction uncertainty and inconsistency. However, since CLIP has a built-in bias from pretraining on highly imbalanced web-crawled data, SE inevitably results in producing biased estimates of uncertainty entropy. To address this issue, we notably find and demonstrate that Tsallis Entropy (TE), a generalized form of SE, is naturally suited for characterizing biased distributions by introducing a non-extensive parameter q, with the performance of SE serving as a lower bound for TE. Building upon this, we generalize TE into Adaptive Debiasing Tsallis Entropy (ADTE) for TTA, customizing a class-specific parameter q^l derived by normalizing the estimated label bias from continuously incoming test instances, for each category. This adaptive approach allows ADTE to accurately select high-confidence views and seamlessly integrate with a label adjustment strategy to enhance adaptation, without introducing distribution-specific hyperparameter tuning. Besides, our investigation reveals that both TE and ADTE can serve as direct, advanced alternatives to SE in TTA, without any other modifications. Experimental results show that ADTE outperforms state-of-the-art methods on ImageNet and its five variants, and achieves the highest average performance on 10 cross-domain benchmarks, regardless of the model architecture or text prompts used. Our code is available at https://github.com/Jinx630/ADTE.[123] Code2Worlds: Empowering Coding LLMs for 4D World Generation
Yi Zhang,Yunshuang Wang,Zeyu Zhang,Hao Tang
Main category: cs.CV
TL;DR: 本文提出Code2Worlds框架,将4D动态世界生成建模为语言到物理仿真代码的生成任务,通过双流架构解耦物体与环境生成,并引入物理感知的闭环机制(含PostProcess Agent和VLM-Motion Critic)提升动态保真度,在Code4D基准上显著优于基线。
Details
Motivation: 现有基于编码大模型的3D生成方法难以扩展至4D动态世界,面临多尺度上下文纠缠和语义-物理执行鸿沟两大挑战,亟需构建符合物理规律的世界模拟器。 Method: 提出Code2Worlds框架:1)双流架构,分别处理检索增强的物体生成与分层环境编排;2)物理感知闭环机制,包含脚本化动力学的PostProcess Agent和基于视觉语言模型的运动批评器(VLM-Motion Critic)进行迭代自反思优化。 Result: 在Code4D基准上,相比基线方法,SGS指标提升41%,Richness提升49%,并首次实现具备物理一致性的动态生成,克服了静态方法的物理幻觉问题。 Conclusion: Code2Worlds验证了将4D生成转化为可执行、可验证的仿真代码生成范式的有效性,为构建具身智能所需的物理可信世界模拟器提供了新路径。 Abstract: Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a critical frontier. This task presents two fundamental challenges: multi-scale context entanglement, where monolithic generation fails to balance local object structures with global environmental layouts; and a semantic-physical execution gap, where open-loop code generation leads to physical hallucinations lacking dynamic fidelity. We introduce Code2Worlds, a framework that formulates 4D generation as language-to-simulation code generation. First, we propose a dual-stream architecture that disentangles retrieval-augmented object generation from hierarchical environmental orchestration. Second, to ensure dynamic fidelity, we establish a physics-aware closed-loop mechanism in which a PostProcess Agent scripts dynamics, coupled with a VLM-Motion Critic that performs self-reflection to iteratively refine simulation code. Evaluations on the Code4D benchmark show Code2Worlds outperforms baselines with a 41% SGS gain and 49% higher Richness, while uniquely generating physics-aware dynamics absent in prior static methods. Code: https://github.com/AIGeeksGroup/Code2Worlds. Website: https://aigeeksgroup.github.io/Code2Worlds.[124] Light4D: Training-Free Extreme Viewpoint 4D Video Relighting
Zhenghuang Wu,Kang Chen,Zeyu Zhang,Hao Tang
Main category: cs.CV
TL;DR: 本文提出Light4D,一种无需训练的4D视频重光照框架,通过解耦光流引导和时序一致注意力机制,在极端视角变化下实现高保真、时序一致的4D重光照合成。
Details
Motivation: 现有基于扩散模型的重光照方法难以扩展到4D(时空)场景,主要受限于配对4D训练数据稀缺及极端视角下时序一致性难保持。 Method: 提出Light4D框架:1)解耦光流引导(Disentangled Flow Guidance),在潜在空间中注入光照控制并保持几何完整性;2)在IC-Light架构中引入时序一致注意力(Temporal Consistent Attention)并加入确定性正则化以消除闪烁。 Result: 实验表明该方法在时序一致性和光照保真度上达到领先水平,可稳健处理-90°至90°的相机旋转。 Conclusion: Light4D是一种训练自由、高效可靠的4D视频重光照新范式,显著缓解了数据稀缺与时序不稳两大瓶颈。 Abstract: Recent advances in diffusion-based generative models have established a new paradigm for image and video relighting. However, extending these capabilities to 4D relighting remains challenging, due primarily to the scarcity of paired 4D relighting training data and the difficulty of maintaining temporal consistency across extreme viewpoints. In this work, we propose Light4D, a novel training-free framework designed to synthesize consistent 4D videos under target illumination, even under extreme viewpoint changes. First, we introduce Disentangled Flow Guidance, a time-aware strategy that effectively injects lighting control into the latent space while preserving geometric integrity. Second, to reinforce temporal consistency, we develop Temporal Consistent Attention within the IC-Light architecture and further incorporate deterministic regularization to eliminate appearance flickering. Extensive experiments demonstrate that our method achieves competitive performance in temporal consistency and lighting fidelity, robustly handling camera rotations from -90 to 90. Code: https://github.com/AIGeeksGroup/Light4D. Website: https://aigeeksgroup.github.io/Light4D.[125] Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data
Yiming Zhou,Xuenjie Xie,Panfeng Li,Albrecht Kunz,Ahmad Osman,Xavier Maldague
Main category: cs.CV
TL;DR: 本文提出了一种轻量级RGB-D融合框架,通过引入单目深度先验增强EfficientViT-SAM,在仅用11.2k样本(不足SA-1B的0.1%)训练下,分割精度超越原EfficientViT-SAM。
Details
Motivation: 现有SAM模型依赖大规模RGB数据集且计算开销大;亟需在小数据、低计算成本下提升分割性能。 Method: 将预训练单目深度估计器生成的深度图,经专用深度编码器与RGB中层特征进行融合,构建轻量RGB-D融合框架,并基于EfficientViT-SAM改进。 Result: 在仅11.2k样本上训练,分割精度高于EfficientViT-SAM,验证了深度线索对分割任务具有强几何先验作用。 Conclusion: 引入轻量深度信息可显著降低SAM类模型对大数据和高算力的依赖,为高效通用分割提供了新思路。 Abstract: Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.[126] How to Sample High Quality 3D Fractals for Action Recognition Pre-Training?
Marko Putak,Thomas B. Moeslund,Joakim Bruslund Haurum
Main category: cs.CV
TL;DR: 本文提出了一种名为Targeted Smart Filtering的新方法,用于加速3D分形生成并提升其多样性,以支持动作识别模型的预训练。
Details
Motivation: 传统合成数据生成方法存在速度慢、生成的3D分形退化等问题,且需兼顾下游任务性能与生成效率。 Method: 基于3D迭代函数系统(IFS)生成3D分形,并通过时序变换构造视频;提出Targeted Smart Filtering方法优化采样速度与分形多样性。 Result: 所提方法采样速度提升约100倍,在动作识别下游任务中性能优于其他3D分形过滤方法。 Conclusion: Targeted Smart Filtering有效解决了3D分形生成的速度与多样性矛盾,为无标注预训练提供了高效可行的合成数据方案。 Abstract: Synthetic datasets are being recognized in the deep learning realm as a valuable alternative to exhaustively labeled real data. One such synthetic data generation method is Formula Driven Supervised Learning (FDSL), which can provide an infinite number of perfectly labeled data through a formula driven approach, such as fractals or contours. FDSL does not have common drawbacks like manual labor, privacy and other ethical concerns. In this work we generate 3D fractals using 3D Iterated Function Systems (IFS) for pre-training an action recognition model. The fractals are temporally transformed to form a video that is used as a pre-training dataset for downstream task of action recognition. We find that standard methods of generating fractals are slow and produce degenerate 3D fractals. Therefore, we systematically explore alternative ways of generating fractals and finds that overly-restrictive approaches, while generating aesthetically pleasing fractals, are detrimental for downstream task performance. We propose a novel method, Targeted Smart Filtering, to address both the generation speed and fractal diversity issue. The method reports roughly 100 times faster sampling speed and achieves superior downstream performance against other 3D fractal filtering methods.[127] JEPA-VLA: Video Predictive Embedding is Needed for VLA Models
Shangchen Miao,Ningya Feng,Jialong Wu,Ye Lin,Xu He,Dong Li,Mingsheng Long
Main category: cs.CV
TL;DR: 本文提出JEPA-VLA方法,通过引入基于视频预训练的预测性视觉表征(如V-JEPA 2),弥补现有视觉-语言-动作模型在环境理解与策略先验上的不足,显著提升样本效率与泛化能力。
Details
Motivation: 现有VLA模型受限于预训练视觉表征,无法充分捕捉任务相关环境信息和策略先验(即对成功执行任务时环境动态演化的预判能力)。 Method: 分析不同视觉表征的局限性,发现视频预训练的预测性嵌入(特别是V-JEPA 2)能更好建模任务相关时序动态并过滤不可预测因素;据此提出JEPA-VLA框架,自适应融合该类预测嵌入到现有VLA模型中。 Result: JEPA-VLA在LIBERO、LIBERO-plus、RoboTwin2.0及真实机器人任务等多个基准上取得显著性能提升。 Conclusion: 预测性视频预训练表征可有效增强VLA模型的环境理解与策略先验能力,JEPA-VLA是一种简单而高效的改进范式。 Abstract: Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.[128] WorldTree: Towards 4D Dynamic Worlds from Monocular Video using Tree-Chains
Qisen Wang,Yifan Zhao,Jia Li
Main category: cs.CV
TL;DR: 本文提出WorldTree框架,通过时间划分树(TPT)和空间祖先链(SAC)实现统一的时空分解,提升单目动态重建性能。
Details
Motivation: 现有单目动态重建方法缺乏统一的时空分解框架,导致时间优化过于整体化或空间组成层级耦合。 Method: 提出WorldTree框架,包含基于继承式划分树结构的时序粗到细优化的时间划分树(TPT),以及通过递归查询祖先层级结构提供互补空间动态并特化运动表征的空间祖先链(SAC)。 Result: 在NVIDIA-LS数据集上LPIPS提升8.26%,在DyCheck数据集上mLPIPS提升9.09%,优于次优方法。 Conclusion: WorldTree提供了有效的统一时空分解机制,在单目动态重建任务中显著提升了重建质量。 Abstract: Dynamic reconstruction has achieved remarkable progress, but there remain challenges in monocular input for more practical applications. The prevailing works attempt to construct efficient motion representations, but lack a unified spatiotemporal decomposition framework, suffering from either holistic temporal optimization or coupled hierarchical spatial composition. To this end, we propose WorldTree, a unified framework comprising Temporal Partition Tree (TPT) that enables coarse-to-fine optimization based on the inheritance-based partition tree structure for hierarchical temporal decomposition, and Spatial Ancestral Chains (SAC) that recursively query ancestral hierarchical structure to provide complementary spatial dynamics while specializing motion representations across ancestral nodes. Experimental results on different datasets indicate that our proposed method achieves 8.26% improvement of LPIPS on NVIDIA-LS and 9.09% improvement of mLPIPS on DyCheck compared to the second-best method. Code: https://github.com/iCVTEAM/WorldTree.[129] Free Lunch for Stabilizing Rectified Flow Inversion
Chenru Wang,Beier Zhu,Chi Zhang
Main category: cs.CV
TL;DR: 本文提出Proximal-Mean Inversion(PMI)和mimic-CFG两种训练免费的梯度校正方法,用于提升Rectified-Flow生成模型的逆向稳定性、重建质量与编辑保真度。
Details
Motivation: 现有Rectified-Flow模型的逆向方法存在跨时间步误差累积问题,导致速度场不稳定、重建与编辑质量下降。 Method: 提出PMI:通过将当前速度引导至历史平均速度的球形高斯约束内来稳定速度场;提出mimic-CFG:在当前速度与其投影到历史平均上的结果间插值,兼顾编辑效果与结构一致性。 Result: 在PIE-Bench上显著提升逆向稳定性、图像重建质量与编辑保真度,减少神经函数评估次数,达到SOTA性能。 Conclusion: 所提方法在不增加训练开销的前提下,提升了RF模型逆向过程的理论严谨性、稳定性与实用性。 Abstract: Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.[130] Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei,Liangbo He,Jun Lan,Lingzhong Dong,Yutong Cai,Siyuan Li,Huijia Zhu,Weiqiang Wang,Linghe Kong,Yue Wang,Zhuosheng Zhang,Weiran Huang
Main category: cs.CV
TL;DR: 本文提出Region-to-Image Distillation方法,将‘图像思考’中的迭代缩放能力从推理时的工具调用转化为训练时的内化能力,使模型在单次前向传播中即可实现细粒度视觉理解,无需运行时缩放,并构建了ZoomBench基准用于评估。
Details
Motivation: 现有MLLMs在细粒度感知(如小目标识别)上表现不佳,因关键证据易被全局上下文淹没;而现有‘Thinking-with-Images’方法虽有效但推理延迟高。 Method: 提出Region-to-Image Distillation:先用强教师模型在微裁剪区域生成高质量VQA监督信号,再将该区域级监督蒸馏回完整图像;同时构建ZoomBench基准与双视角评估协议。 Result: 蒸馏后的学生模型在多个细粒度感知基准上达到SOTA,在视觉推理和GUI代理等通用多模态任务上也取得提升;ZoomBench可量化全局—局部‘缩放差距’。 Conclusion: 细粒度感知能力可通过训练时蒸馏内化,无需推理时调用缩放工具;研究界定了‘Thinking-with-Images’的适用边界——部分增益可被单次前向传播吸收。 Abstract: Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.[131] DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition
Ji Li,Zhiwei Li,Shihao Li,Zhenjiang Yu,Boyang Wang,Haiou Liu
Main category: cs.CV
TL;DR: 本文提出DiffPlace框架,通过引入place-ID控制器实现可控的多视角图像生成,提升城市街景生成的地点感知能力和背景一致性,从而增强视觉地点识别任务的效果。
Details
Motivation: 现有多视角扩散模型在文本、鸟瞰图(BEV)和物体边界框条件下难以生成地点感知强且背景一致的城市街景,限制了其在地点识别任务中的应用效果。 Method: 提出DiffPlace框架,包含place-ID控制器,采用线性投影、Perceiver Transformer和对比学习,将place-ID嵌入映射到固定CLIP空间,以实现背景建筑一致性与前景对象及天气条件的灵活控制。 Result: 大量实验(包括定量比较和增强训练评估)表明,DiffPlace在生成质量和对视觉地点识别任务的训练支持方面均优于现有方法。 Conclusion: DiffPlace展示了生成模型在场景级和地点感知合成方面的潜力,为提升自动驾驶中的地点识别能力提供了有效新方法。 Abstract: Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition in autonomous driving[132] SynthRAR: Ring Artifacts Reduction in CT with Unrolled Network and Synthetic Data Training
Hongxu Yang,Levente Lippenszky,Edina Timko,Gopal Avinash
Main category: cs.CV
TL;DR: 本文提出了一种基于非理想CT探测器响应理论分析的无监督深度学习方法,通过展开网络建模逆问题,并利用合成数据挖掘图像域与正弦图域间的内在关联,实现无需真实临床数据的环形伪影校正。
Details
Motivation: 现有环形伪影校正方法依赖大量标注的真实临床数据,数据采集成本高,且多局限于单一域(图像域或正弦图域)校正,忽视CT前向几何建模中的内在相关性。 Method: 将环形伪影校正问题建模为结合非理想探测器响应与CT线性前向投影的逆问题,采用展开网络架构;利用自然图像生成合成数据,挖掘正弦图域与图像域间环形伪影的内在关联,实现无真实临床数据训练。 Result: 在多种扫描几何与解剖区域上的实验表明,仅用合成数据训练的模型持续优于现有最先进方法。 Conclusion: 该方法有效克服了对真实临床数据的依赖,同时融合图像域与正弦图域先验,在保持物理可解释性的同时提升了环形伪影校正性能。 Abstract: Defective and inconsistent responses in CT detectors can cause ring and streak artifacts in the reconstructed images, making them unusable for clinical purposes. In recent years, several ring artifact reduction solutions have been proposed in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, leading to a high data collection cost. Furthermore, existing approaches focus exclusively on either image-space or sinogram-space correction, neglecting the intrinsic correlations from the forward operation of the CT geometry. Based on the theoretical analysis of non-ideal CT detector responses, the RAR problem is reformulated as an inverse problem by using an unrolled network, which considers non-ideal response together with linear forward-projection with CT geometry. Additionally, the intrinsic correlations of ring artifacts between the sinogram and image domains are leveraged through synthetic data derived from natural images, enabling the trained model to correct artifacts without requiring real-world clinical data. Extensive evaluations on diverse scanning geometries and anatomical regions demonstrate that the model trained on synthetic data consistently outperforms existing state-of-the-art methods.[133] DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target
BoCheng Hu,Zhonghan Zhao,Kaiyue Zhou,Hongwei Wang,Gaoang Wang
Main category: cs.CV
TL;DR: 本文提出了DynaHOI-Gym平台和DynaHOI-10M基准数据集,用于评估动态手-物交互场景下的手部动作生成,并设计了一个基于短时观测与时空注意力的ObAct基线方法,提升了定位成功率。
Details
Motivation: 现有手-物交互(HOI)动作生成基准主要关注静态物体,缺乏对动态目标和时间敏感协调任务的评估能力,存在明显研究空白。 Method: 构建了统一的在线闭环评估平台DynaHOI-Gym,包含参数化运动生成器和基于rollout的指标;发布大规模动态HOI基准DynaHOI-10M(10M帧、180K轨迹);提出ObAct基线模型,融合短时观测与当前帧,通过时空注意力预测动作。 Result: ObAct基线在位置成功率达8.1%的提升;DynaHOI-10M涵盖8大类、22细分类别的目标运动模式。 Conclusion: DynaHOI-Gym与DynaHOI-10M填补了动态HOI评估的空白,ObAct验证了观测-动作范式在该任务中的有效性,为未来研究提供了新平台与基准。 Abstract: Most existing hand motion generation benchmarks for hand-object interaction (HOI) focus on static objects, leaving dynamic scenarios with moving targets and time-critical coordination largely untested. To address this gap, we introduce the DynaHOI-Gym, a unified online closed-loop platform with parameterized motion generators and rollout-based metrics for dynamic capture evaluation. Built on DynaHOI-Gym, we release DynaHOI-10M, a large-scale benchmark with 10M frames and 180K hand capture trajectories, whose target motions are organized into 8 major categories and 22 fine-grained subcategories. We also provide a simple observe-before-act baseline (ObAct) that integrates short-term observations with the current frame via spatiotemporal attention to predict actions, achieving an 8.1% improvement in location success rate.[134] Synthesis of Late Gadolinium Enhancement Images via Implicit Neural Representations for Cardiac Scar Segmentation
Soufiane Ben Haddou,Laura Alvarez-Florez,Erik J. Bekkers,Fleur V. Y. Tjong,Ahmad S. Amin,Connie R. Bezzina,Ivana Išgum
Main category: cs.CV
TL;DR: 本文提出了一种结合隐式神经表示(INRs)与去噪扩散模型的框架,用于合成带配准分割掩码的晚期钆增强(LGE)心脏MRI图像,以缓解标注数据稀缺问题;在133例真实扫描数据上验证表明,加入200例合成数据可将纤维化分割Dice分数从0.509提升至0.524。
Details
Motivation: 晚期钆增强(LGE)成像是心肌瘢痕评估的临床金标准,但高质量标注数据稀缺,严重制约自动化分割方法的发展。 Method: 首先用隐式神经表示(INRs)分别建模LGE图像及对应的心肌和纤维化分割掩码的连续空间表征;然后将INRs压缩为紧凑的潜在嵌入以保留关键解剖信息;最后在该潜在空间上训练扩散模型生成新表征,并解码为解剖一致的合成LGE图像及配准掩码。 Result: 在133例真实心脏MRI数据上实验表明,用200例合成数据扩充训练集后,纤维化分割Dice分数由0.509提升至0.524;所提方法无需人工标注即可生成带精确分割掩码的合成数据。 Conclusion: 该框架为标注匮乏的医学图像分割任务提供了一种有效的、无标注的数据增强新范式,兼具解剖合理性和生成质量。 Abstract: Late gadolinium enhancement (LGE) imaging is the clinical standard for myocardial scar assessment, but limited annotated datasets hinder the development of automated segmentation methods. We propose a novel framework that synthesises both LGE images and their corresponding segmentation masks using implicit neural representations (INRs) combined with denoising diffusion models. Our approach first trains INRs to capture continuous spatial representations of LGE data and associated myocardium and fibrosis masks. These INRs are then compressed into compact latent embeddings, preserving essential anatomical information. A diffusion model operates on this latent space to generate new representations, which are decoded into synthetic LGE images with anatomically consistent segmentation masks. Experiments on 133 cardiac MRI scans suggest that augmenting training data with 200 synthetic volumes contributes to improved fibrosis segmentation performance, with the Dice score showing an increase from 0.509 to 0.524. Our approach provides an annotation-free method to help mitigate data scarcity.The code for this research is publicly available.[135] Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion
Bruno Rigal,Victor Dupriez,Alexis Mignon,Ronan Le Hy,Nicolas Mery
Main category: cs.CV
TL;DR: 本文提出了一种面向法语复杂文档的PDF-to-Markdown转换评估新基准,采用模型分歧采样构建难例数据集,并设计了兼顾语义正确性与呈现无关性的细粒度评估方法。
Details
Motivation: 现有PDF解析基准多聚焦英文或中文,且过度惩罚对下游RAG任务无影响的格式差异(如换行、列表分割、表格渲染方式),缺乏针对法语复杂文档(手写、复杂版式、密集表格、图文混排)的鲁棒性评估。 Method: 构建法语专属基准:从6万份文档中通过模型分歧采样挑选难例;设计单元测试式评估:检查文本存在性、阅读顺序和局部表格约束,并引入类别特异性归一化以忽略纯呈现差异;在15个视觉语言模型上进行系统评测。 Result: 最强闭源模型在手写体和表单类文档上展现出显著更高鲁棒性;多个开源权重模型在标准印刷体文档上仍具竞争力。 Conclusion: 评估方法需解耦语义正确性与格式呈现,法语复杂文档解析需专用基准;模型性能高度依赖文档类型,闭源与开源模型各有优势场景。 Abstract: This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.[136] Calibrated Bayesian Deep Learning for Explainable Decision Support Systems Based on Medical Imaging
Hua Xu,Julián D. Arias-Londoño,Juan I. Godino-Llorente
Main category: cs.CV
TL;DR: 本文提出了一种基于贝叶斯深度学习的概率优化框架,包含CUB-Loss和Dual Temperature Scaling,以提升医学影像AI模型的不确定性校准能力,增强临床可信度。
Details
Motivation: 医学影像AI辅助决策需兼顾预测准确性和不确定性校准,现有模型常过度自信于错误预测,阻碍临床采纳。 Method: 提出Confidence-Uncertainty Boundary Loss(CUB-Loss)在训练中对高置信错误和低置信正确预测施加惩罚,并设计Dual Temperature Scaling(DTS)进行后处理校准。 Result: 在肺炎筛查、糖尿病视网膜病变检测和皮肤病变识别三个任务上验证了该方法,显著提升校准性能,且在小样本和严重类别不平衡数据下保持鲁棒性。 Conclusion: 所提框架具有通用性与临床实用性,可有效提升AI决策的可靠性与可解释性,助力真实临床部署。 Abstract: In critical decision support systems based on medical imaging, the reliability of AI-assisted decision-making is as relevant as predictive accuracy. Although deep learning models have demonstrated significant accuracy, they frequently suffer from miscalibration, manifested as overconfidence in erroneous predictions. To facilitate clinical acceptance, it is imperative that models quantify uncertainty in a manner that correlates with prediction correctness, allowing clinicians to identify unreliable outputs for further review. In order to address this necessity, the present paper proposes a generalizable probabilistic optimization framework grounded in Bayesian deep learning. Specifically, a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) is introduced that imposes penalties on high-certainty errors and low-certainty correct predictions, explicitly enforcing alignment between prediction correctness and uncertainty estimates. Complementing this training-time optimization, a Dual Temperature Scaling (DTS) strategy is devised for post-hoc calibration, further refining the posterior distribution to improve intuitive explainability. The proposed framework is validated on three distinct medical imaging tasks: automatic screening of pneumonia, diabetic retinopathy detection, and identification of skin lesions. Empirical results demonstrate that the proposed approach achieves consistent calibration improvements across diverse modalities, maintains robust performance in data-scarce scenarios, and remains effective on severely imbalanced datasets, underscoring its potential for real clinical deployment.[137] Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation
Wei Chen,Yancheng Long,Mingqiao Liu,Haojie Ding,Yankai Yang,Hongyang Wei,Yi-Fan Zhang,Bin Wen,Fan Yang,Tingting Gao,Han Li,Long Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为Spatial Chain-of-Thought (SCoT)的框架,通过将多模态大语言模型(MLLMs)的空间推理能力与扩散模型的生成能力结合,提升其在复杂空间理解与推理任务中的表现。
Details
Motivation: 扩散模型在图像生成方面表现出色,但在复杂空间理解与推理上存在不足;现有方法要么计算成本高,要么因仅依赖文本提示而丢失空间信息。 Method: 提出SCoT框架:1)用交错文本-坐标指令格式训练扩散模型以增强其布局感知;2)利用先进MLLM作为规划器生成详细布局方案,并将其空间规划能力迁移至生成过程。 Result: 在图像生成基准测试中达到SOTA性能,在复杂推理任务上显著优于基线方法,并在图像编辑场景中也展现出强有效性。 Conclusion: SCoT是一种即插即用、高效且有效的框架,成功弥合了MLLM的空间推理能力与扩散模型生成能力之间的鸿沟。 Abstract: While diffusion models have shown exceptional capabilities in aesthetic image synthesis, they often struggle with complex spatial understanding and reasoning. Existing approaches resort to Multimodal Large Language Models (MLLMs) to enhance this capability. However, they either incur high computational costs through joint training or suffer from spatial information loss when relying solely on textual prompts. To alleviate these limitations, we propose a Spatial Chain-of-Thought (SCoT) framework, a plug-and-play approach that effectively bridges the reasoning capabilities of MLLMs with the generative power of diffusion models. Specifically, we first enhance the diffusion model's layout awareness by training it on an interleaved text-coordinate instruction format. We then leverage state-of-the-art MLLMs as planners to generate comprehensive layout plans, transferring their spatial planning capabilities directly to the generation process. Extensive experiments demonstrate that our method achieves state-of-the-art performance on image generation benchmarks and significantly outperforms baselines on complex reasoning tasks, while also showing strong efficacy in image editing scenarios.[138] Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation
Enrico Guerriero,Kjersti Engan,Øyvind Meinich-Bache
Main category: cs.CV
TL;DR: 本文探讨了生成式AI(GenAI)方法在新生儿复苏视频中活动识别的应用,通过结合本地视觉语言模型(VLMs)与大语言模型(LLMs),并与监督式TimeSformer基线进行比较。实验表明,经LoRA微调的小型本地VLM在F1分数上达到0.91,显著优于TimeSformer的0.70。
Details
Motivation: 新生儿复苏过程的准确记录对质量改进和临床指南依从性至关重要,但实践中仍被低估;此前基于3D-CNN和ViT的方法虽有进展,但在细粒度活动识别上面临挑战。 Method: 采用模拟的13.26小时新生儿复苏视频数据集,评估多种零样本VLM策略及带分类头的微调VLM(含LoRA适配),并与TimeSformer基线对比。 Result: 经LoRA微调的本地VLM取得F1分数0.91,显著高于TimeSformer的0.70;而零样本VLM存在明显幻觉问题。 Conclusion: 本地VLM经轻量微调(如LoRA)可显著提升新生儿复苏视频中的细粒度活动识别性能,为临床文档自动化提供新路径。 Abstract: Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.[139] Projected Representation Conditioning for High-fidelity Novel View Synthesis
Min-Seop Kwak,Minkyung Kwon,Jinhyeok Choi,Jiho Park,Seungryong Kim
Main category: cs.CV
TL;DR: 本文提出ReNoV框架,利用外部视觉表征作为条件引导扩散模型进行新视角合成,通过分析空间注意力的对应能力并设计表征投影模块,显著提升几何一致性、重建保真度和修复质量。
Details
Motivation: 解决现有扩散模型在新视角合成中几何一致性不足的问题,利用外部表征的几何与语义对应特性增强生成视角的结构准确性。 Method: 分析外部视觉表征空间注意力的对应能力;设计专用表征投影模块,将外部表征注入扩散过程,实现表征引导的新视角合成(ReNoV)。 Result: 在标准基准上超越先前基于扩散的新视角方法,提升重建保真度与图像修复质量,并支持稀疏、无位姿图像集合的鲁棒合成。 Conclusion: 外部表征可有效提升扩散模型在新视角合成中的几何一致性与生成质量,ReNoV为该任务提供了更可靠、泛化性更强的框架。 Abstract: We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.[140] A DMD-Based Adaptive Modulation Method for High Dynamic Range Imaging in High-Glare Environments
Banglei Guan,Jing Tao,Liang Xu,Dongcai Tan,Pengju Sun,Jianbing Liu,Yang Shang,Qifeng Yu
Main category: cs.CV
TL;DR: 本文提出了一种基于数字微镜器件(DMD)的高动态范围(HDR)成像系统,用于提升强眩光环境下光力学测量(如焊接弧监测和金属表面分析)的图像质量与数字图像相关(DIC)精度。该系统通过DMD光学调制与自适应计算成像协同实现区域自适应曝光,实测动态范围达127 dB,显著抑制饱和伪影,使应变误差降低78%,DIC定位精度提升。
Details
Motivation: 传统CCD/CMOS传感器动态范围低(<70 dB),在强眩光下易饱和,导致DIC测量严重失真,亟需适用于极端照明条件的高保真HDR成像方案。 Method: 构建基于DMD的空间调制HDR成像系统,包含DMD光学调制单元和自适应计算成像流水线,实现场景的自主区域分割与动态曝光控制。 Result: 系统实测动态范围达127 dB,消除高眩光下的饱和伪影;实验表明DIC应变误差降低78%,定位精度提高。 Conclusion: 该DMD HDR系统突破了传统传感器限制,为高眩光环境下的光学计量与应力分析提供了可靠、高保真的新方法。 Abstract: Background The accuracy of photomechanics measurements critically relies on image quality,particularly under extreme illumination conditions such as welding arc monitoring and polished metallic surface analysis. High dynamic range (HDR) imaging above 120 dB is essential in these contexts. Conventional CCD/CMOS sensors, with dynamic ranges typically below 70 dB, are highly susceptible to saturation under glare, resulting in irreversible loss of detail and significant errors in digital image correlation (DIC). Methods This paper presents an HDR imaging system that leverages the spatial modulation capability of a digital micromirror device (DMD). The system architecture enables autonomous regional segmentation and adaptive exposure control for high-dynamic-range scenes through an integrated framework comprising two synergistic subsystems: a DMD-based optical modulation unit and an adaptive computational imaging pipeline. Results The system achieves a measurable dynamic range of 127 dB, effectively eliminating satu ration artifacts under high glare. Experimental results demonstrate a 78% reduction in strain error and improved DIC positioning accuracy, confirming reliable performance across extreme intensity variations. Conclusion The DMD-based system provides high fidelity adaptive HDR imaging, overcoming key limitations of conventional sensors. It exhibits strong potential for optical metrology and stress analysis in high-glare environments where traditional methods are inadequate.[141] GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
GigaBrain Team,Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Hao Li,Jie Li,Jindi Lv,Jingyu Liu,Lv Feng,Mingming Yu,Peng Li,Qiuping Deng,Tianze Liu,Xinyu Zhou,Xinze Chen,Xiaofeng Wang,Yang Wang,Yifan Li,Yifei Nie,Yilong Li,Yukun Zhou,Yun Ye,Zhichao Liu,Zheng Zhu
Main category: cs.CV
TL;DR: 本文提出GigaBrain-0.5M*,一种基于视频世界模型增强的视觉-语言-动作(VLA)模型,通过RAMP强化学习框架提升跨任务泛化与长时程操作鲁棒性,在多项复杂机器人任务中性能提升约30%,并在真实场景中成功部署验证。
Details
Motivation: 现有VLA模型受限于场景理解能力弱和未来预测能力差;而预训练于大规模视频数据的世界模型具备强时空推理与未来预测能力,可自然用于增强VLA学习。 Method: 在已有的GigaBrain-0.5(基于超10000小时机器人操作数据预训练)基础上,引入基于世界模型的强化学习框架RAMP(Reinforcement leArning via world Model-conditioned Policy),实现跨任务自适应优化。 Result: 在Laundry Folding、Box Packing、Espresso Preparation等挑战性任务上相较RECAP基线提升约30%;实测显示其具备可靠的长时程执行能力,真实部署中无失败案例。 Conclusion: 将世界模型融入VLA训练范式(如RAMP)能显著提升模型的泛化性、预测性与实际部署鲁棒性,为具身智能提供新路径。 Abstract: Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.[142] AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer
Lingting Zhu,Shengju Qian,Haidi Fan,Jiayu Dong,Zhenchao Jin,Siwei Zhou,Gen Dong,Xin Wang,Lequan Yu
Main category: cs.CV
TL;DR: 本文提出了AssetFormer,一种基于Transformer的自回归模型,用于根据文本描述生成符合设计约束的模块化3D资产,提升了专业开发与用户生成内容(UGC)中的3D资产创建效率。
Details
Motivation: 数字产业对高质量、多样化的模块化3D资产需求迫切,尤其在用户生成内容(UGC)场景中,亟需能依据文本自动构建符合设计约束的模块化3D资产的方法。 Method: 提出基于Transformer的自回归模型AssetFormer,借鉴语言模型的模块序列建模与解码技术,利用真实世界采集的模块化资产数据进行训练,支持从文本生成满足特定参数约束的模块化3D资产。 Result: 初步实验结果表明AssetFormer能有效提升模块化3D资产生成质量,适用于专业开发和UGC场景,并具备向多种模块化3D资产类型扩展的灵活性。 Conclusion: AssetFormer为模块化3D内容生成提供了新范式,是一个可扩展、实用性强的框架,推动了3D内容自动生成领域的发展。 Abstract: The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content~(UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets, contributing to the broader field of 3D content generation. The code is available at https://github.com/Advocate99/AssetFormer.[143] PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback
Sixiang Chen,Jianyu Lai,Jialin Gao,Hengyu Shi,Zhongying Liu,Tian Ye,Junfeng Luo,Xiaoming Wei,Lei Zhu
Main category: cs.CV
TL;DR: 本文提出PosterOmni框架,统一处理图像到海报生成中的局部编辑与全局创作任务,通过数据构建、知识蒸馏与统一奖励反馈机制,在多任务上显著提升语义保真度与美学一致性。
Details
Motivation: 图像到海报生成需同时满足局部视觉实体保持与全局设计概念理解,现有方法难以兼顾二者,缺乏统一框架与评估基准。 Method: 提出PosterOmni框架,包含三部分:(i) 构建覆盖六类任务的多场景图像-海报数据集;(ii) 在局部与全局专家模型间进行知识蒸馏以监督微调;(iii) 设计统一PosterOmni Reward Feedback联合优化实体保真与美学偏好;并建立PosterOmni-Bench统一评测基准。 Result: 在多个指标上显著优于所有开源基线,甚至超越部分商用系统,尤其在参考遵循性、全局构图质量与美学协调性方面表现突出。 Conclusion: PosterOmni成功耦合实体保持型编辑与概念驱动型创作,验证了统一多任务框架在艺术化图像生成中的有效性与泛化能力。 Abstract: Image-to-poster generation is a high-demand task requiring not only local adjustments but also high-level design understanding. Models must generate text, layout, style, and visual elements while preserving semantic fidelity and aesthetic coherence. The process spans two regimes: local editing, where ID-driven generation, rescaling, filling, and extending must preserve concrete visual entities; and global creation, where layout- and style-driven tasks rely on understanding abstract design concepts. These intertwined demands make image-to-poster a multi-dimensional process coupling entity-preserving editing with concept-driven creation under image-prompt control. To address these challenges, we propose PosterOmni, a generalized artistic poster creation framework that unlocks the potential of a base edit model for multi-task image-to-poster generation. PosterOmni integrates the two regimes, namely local editing and global creation, within a single system through an efficient data-distillation-reward pipeline: (i) constructing multi-scenario image-to-poster datasets covering six task types across entity-based and concept-based creation; (ii) distilling knowledge between local and global experts for supervised fine-tuning; and (iii) applying unified PosterOmni Reward Feedback to jointly align visual entity-preserving and aesthetic preference across all tasks. Additionally, we establish PosterOmni-Bench, a unified benchmark for evaluating both local editing and global creation. Extensive experiments show that PosterOmni significantly enhances reference adherence, global composition quality, and aesthetic harmony, outperforming all open-source baselines and even surpassing several proprietary systems.[144] FAIL: Flow Matching Adversarial Imitation Learning for Image Generation
Yeyao Ma,Chen Li,Xiaosong Zhang,Han Hu,Weidi Xie
Main category: cs.CV
TL;DR: 本文提出FAIL方法,通过对抗训练最小化策略与专家之间的差异,无需显式奖励或成对比较,实现了流匹配模型的后训练优化。
Details
Motivation: 现有监督微调无法纠正未见状态下的策略漂移,而偏好优化方法需要昂贵的偏好对或奖励建模。 Method: 提出Flow Matching Adversarial Imitation Learning(FAIL),包含FAIL-PD(利用可微ODE求解器获得低方差路径梯度)和FAIL-PG(适用于离散或计算受限场景的黑盒替代方案)。 Result: 在仅使用13,000条Nano Banana pro演示数据微调FLUX模型时,FAIL在提示遵循和美学基准上达到有竞争力的性能;框架还可推广至离散图像与视频生成,并作为鲁棒正则器缓解基于奖励优化中的奖励黑客问题。 Conclusion: FAIL为流匹配模型提供了一种无需显式奖励或偏好对的高效后训练范式,在多个生成任务中展现出泛化性与鲁棒性。 Abstract: Post-training of flow matching models-aligning the output distribution with a high-quality target-is mathematically equivalent to imitation learning. While Supervised Fine-Tuning mimics expert demonstrations effectively, it cannot correct policy drift in unseen states. Preference optimization methods address this but require costly preference pairs or reward modeling. We propose Flow Matching Adversarial Imitation Learning (FAIL), which minimizes policy-expert divergence through adversarial training without explicit rewards or pairwise comparisons. We derive two algorithms: FAIL-PD exploits differentiable ODE solvers for low-variance pathwise gradients, while FAIL-PG provides a black-box alternative for discrete or computationally constrained settings. Fine-tuning FLUX with only 13,000 demonstrations from Nano Banana pro, FAIL achieves competitive performance on prompt following and aesthetic benchmarks. Furthermore, the framework generalizes effectively to discrete image and video generation, and functions as a robust regularizer to mitigate reward hacking in reward-based optimization. Code and data are available at https://github.com/HansPolo113/FAIL.[145] TexSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation
Ziteng Lu,Yushuang Wu,Chongjie Ye,Yuda Qiu,Jing Shao,Xiaoyang Guo,Jiaqing Zhou,Tianlei Hu,Kun Zhou,Xiaoguang Han
Main category: cs.CV
TL;DR: 本文提出TexSpot,一种基于扩散模型的纹理增强框架,通过新提出的Texlet表示法解决3D纹理生成中的视角不一致与失真问题,显著提升纹理质量、几何一致性与鲁棒性。
Details
Motivation: 现有3D纹理生成方法存在UV映射失真或点基表示受限于几何密度的问题,难以兼顾高保真与高分辨率纹理生成。 Method: 提出Texlet——一种融合点基几何表达力与UV映射紧凑性的新3D纹理表示;每个Texlet由2D编码器编码局部纹理块,并经3D编码器融入全局形状上下文;采用级联3D-to-2D解码器重建纹理块;在此基础上训练条件扩散Transformer以增强多视角扩散生成的纹理。 Result: 在多项实验中,TexSpot在视觉保真度、几何一致性与鲁棒性方面显著优于现有SOTA方法。 Conclusion: TexSpot通过创新的Texlet表示与扩散增强机制,有效克服了主流多视角扩散流程中的视角不一致问题,为高质量3D纹理生成提供了新范式。 Abstract: High-quality 3D texture generation remains a fundamental challenge due to the view-inconsistency inherent in current mainstream multi-view diffusion pipelines. Existing representations either rely on UV maps, which suffer from distortion during unwrapping, or point-based methods, which tightly couple texture fidelity to geometric density that limits high-resolution texture generation. To address these limitations, we introduce TexSpot, a diffusion-based texture enhancement framework. At its core is Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representation. Each Texlet latent vector encodes a local texture patch via a 2D encoder and is further aggregated using a 3D encoder to incorporate global shape context. A cascaded 3D-to-2D decoder reconstructs high-quality texture patches, enabling the Texlet space learning. Leveraging this representation, we train a diffusion transformer conditioned on Texlets to refine and enhance textures produced by multi-view diffusion methods. Extensive experiments demonstrate that TexSpot significantly improves visual fidelity, geometric consistency, and robustness over existing state-of-the-art 3D texture generation and enhancement approaches. Project page: https://anonymous.4open.science/w/TexSpot-page-2D91.[146] DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
Xu Guo,Fulong Ye,Qichao Sun,Liyang Chen,Bingchuan Li,Pengze Zhang,Jiawei Liu,Songtao Zhao,Qian He,Xiangwang Hou
Main category: cs.CV
TL;DR: 本文提出DreamID-Omni,一个统一的可控人像音视频生成框架,通过新型对称条件扩散Transformer、双层级解耦策略和多任务渐进训练,解决多身份/音色混淆与控制难题,在多项指标上达到SOTA。
Details
Motivation: 现有音视频生成方法将人像相关任务(如参考式生成、编辑、语音驱动动画)割裂处理,且难以在单框架中实现多角色身份与语音音色的精确、解耦控制。 Method: 提出DreamID-Omni框架:1)对称条件扩散Transformer,采用对称条件注入整合异构控制信号;2)双层级解耦策略——信号层用同步RoPE确保注意力空间绑定,语义层用结构化字幕建立属性-主体显式映射;3)多任务渐进训练,利用弱约束生成先验正则化强约束任务。 Result: 在视频质量、音频质量及音视频一致性等全面指标上达到SOTA,甚至超越主流商用闭源模型。 Conclusion: DreamID-Omni实现了统一、可控、高保真的人像音视频生成,显著缓解身份-音色绑定失败与说话人混淆问题,并推动学术研究向商用级应用落地。 Abstract: Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.[147] EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data
Nils Lehmann,Yi Wang,Zhitong Xiong,Xiaoxiang Zhu
Main category: cs.CV
TL;DR: 本文提出EO-VAE,一种面向地球观测(EO)多传感器数据的统一变分自编码器 tokenizer,利用动态超网络支持灵活光谱通道组合的编码与重建,在TerraMesh数据集上优于现有方法。
Details
Motivation: 现有生成模型依赖tokenizer压缩输入,但地球观测数据因传感器多样、光谱通道不一,难以用单一或固定tokenizer有效处理。 Method: 提出EO-VAE——基于变分自编码器的多传感器tokenizer,引入动态超网络以适配不同光谱通道组合,实现单模型统一编码与重建。 Result: 在TerraMesh数据集上,EO-VAE的重建保真度显著优于TerraMind tokenizer,验证了其作为遥感生成建模基础tokenizer的有效性。 Conclusion: EO-VAE为地球观测领域提供了首个支持多传感器、多通道的统一latent tokenizer框架,推动遥感图像/视频生成模型的发展。 Abstract: State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.[148] DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
Dianyi Wang,Ruihang Li,Feng Han,Chaofan Ma,Wei Song,Siyuan Wang,Yibin Wang,Yi Xin,Hongjian Liu,Zhixiong Zhang,Shengyuan Ding,Tianhang Wang,Zhenglin Cheng,Tao Lin,Cheng Jin,Kaicheng Yu,Jingjing Chen,Wenjie Wang,Zhongyu Wei,Jiaqi Wang
Main category: cs.CV
TL;DR: DeepGen 1.0 是一个仅5B参数的轻量级统一多模态图像生成与编辑模型,通过Stacked Channel Bridging(SCB)框架和三阶段数据驱动训练策略,在多项基准上超越更大规模模型(如80B HunyuanImage、27B Qwen-Image-Edit),并开源代码、权重与数据集。
Details
Motivation: 现有统一多模态图像生成与编辑模型参数量巨大(>10B),训练与部署成本高昂;亟需轻量高效且性能不妥协的替代方案。 Method: 提出Stacked Channel Bridging(SCB)深度对齐框架,融合多层视觉语言模型特征与可学习‘think tokens’;采用三阶段训练策略:(1)对齐预训练、(2)联合监督微调、(3)基于MR-GRPO的强化学习。 Result: 在仅约50M样本上训练,DeepGen 1.0在WISE上比80B HunyuanImage高28%,在UniREditBench上比27B Qwen-Image-Edit高37%;显著提升生成质量、人类偏好对齐性,并避免视觉伪影。 Conclusion: 证明轻量级统一多模态模型可通过结构创新与精细化训练策略实现甚至超越大模型性能,为多模态研究提供高效、可复现、可民主化的开源基线。 Abstract: Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.[149] Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
Onkar Susladkar,Tushar Prakash,Gayatri Deshmukh,Kiet A. Nguyen,Jiaxun Zhang,Adheesh Juvekar,Tianshu Bao,Lin Chai,Sparsh Mittal,Inderjit S Dhillon,Ismini Lourentzou
Main category: cs.CV
TL;DR: UniDFlow is a unified discrete flow-matching framework for multimodal tasks, decoupling understanding and generation via adapters and using reference-based preference alignment to improve faithfulness and controllability without retraining.
Details
Motivation: To address objective interference and representation entanglement in multimodal understanding and generation, and to improve faithfulness and controllability without large-scale retraining. Method: UniDFlow uses task-specific low-rank adapters to decouple understanding and generation, and introduces reference-based multimodal preference alignment to optimize relative outcomes under identical conditioning. Result: UniDFlow achieves state-of-the-art performance across eight benchmarks and shows strong zero-shot generalization to tasks like inpainting, in-context image generation, reference-based editing, and compositional generation. Conclusion: UniDFlow provides a flexible, unified framework for multimodal tasks with improved performance, generalization, and controllability without explicit task-specific training. Abstract: We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.[150] MonarchRT: Efficient Attention for Real-Time Video Generation
Krish Agarwal,Zhuoming Chen,Cheng Luo,Yongqi Chen,Haizhong Zheng,Xun Huang,Atri Rudra,Beidi Chen
Main category: cs.CV
TL;DR: 本文提出Monarch-RT,一种基于Monarch矩阵分解的结构化注意力参数化方法,用于解决实时视频扩散模型中3D自注意力计算开销大的问题;它在保持高质量生成的同时实现高达95%的注意力稀疏性,并在多种GPU上显著超越FlashAttention系列内核,首次实现在单张RTX 5090上以16 FPS进行真实时视频生成。
Details
Motivation: 实时视频生成中,Diffusion Transformer受限于3D自注意力的二次计算复杂度,尤其在少步长、自回归设置下误差累积严重,且现有稀疏注意力方法(如top-k)在该场景下失效。 Method: 提出Monarch-RT:利用Monarch矩阵对注意力进行结构化分解,结合块对齐与扩展的分块Monarch参数化,并通过定制Triton内核与微调优化效率。 Result: 在Self-Forcing模型上实现95%注意力稀疏性且无质量损失;在RTX 5090/H100/B200上分别比FlashAttention-2/3/4快1.4–11.8倍;首次在单张RTX 5090上达成16 FPS实时视频生成。 Conclusion: Monarch-RT是首个面向实时视频生成的高能力稀疏注意力参数化方案,兼顾表达力与效率,为高效视频扩散模型提供了新范式。 Abstract: Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.[151] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
Leon Liangyu Chen,Haoyu Ma,Zhipeng Fan,Ziqi Huang,Animesh Sinha,Xiaoliang Dai,Jialiang Wang,Zecheng He,Jianwei Yang,Chunyuan Li,Junzhe Sun,Chu Wang,Serena Yeung-Levy,Felix Juefei-Xu
Main category: cs.CV
TL;DR: 本文提出UniT框架,通过多轮推理、验证与修正,实现统一多模态模型的测试时缩放(TTS),显著提升复杂多模态任务的性能。
Details
Motivation: 现有统一多模态模型通常单次前向推理,难以应对需分解指令、验证中间结果和迭代修正的复杂任务;而语言模型中已验证有效的测试时缩放(TTS)尚未成功扩展至统一多模态模型。 Method: 提出UniT框架,融合智能体式数据合成、统一模型训练与灵活测试时推理,支持多轮链式思维(chain-of-thought)推理,涵盖验证、子目标分解与内容记忆等认知行为。 Result: 实验发现:(1) 在短推理轨迹上训练的统一模型可泛化至更长测试时推理链;(2) 序列式链式推理比并行采样更可扩展且计算高效;(3) 结合生成与编辑轨迹训练可提升分布外视觉推理能力。 Conclusion: 多模态测试时缩放是一种有效范式,能同步推动统一模型在生成与理解两方面的能力进步。 Abstract: Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.[152] Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching
Huai-Hsun Cheng,Siang-Ling Zhang,Yu-Lun Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为'Progressive Semantic Illusions'的新型矢量草图任务,通过逐步添加笔画使单个草图发生显著语义变化,并设计了'Stroke of Surprise'生成框架来优化笔画序列以满足不同绘制阶段的语义解释。