Table of Contents
cs.CL [Back]
[1] Contextual Augmentation for Entity Linking using Large Language Models
Daniel Vollmers,Hamada M. Zahera,Diego Moussallem,Axel-Cyrille Ngonga Ngomo
Main category: cs.CL
TL;DR: 提出一种联合实体识别与消歧的统一框架,并利用大语言模型增强上下文,实现了在跨领域数据集上的最先进性能。
Details
Motivation: 传统实体链接方法采用两步分离模型,计算开销大且效果有限,难以有效处理跨领域数据。 Method: 通过微调大语言模型,在统一框架中联合进行实体识别与消歧,并利用其上下文理解能力提升消歧性能。 Result: 在基准数据集上评估显示,该方法在跨领域场景下优于多种基线模型,达到最先进水平。 Conclusion: 所提出的联合方法结合大语言模型能有效提升实体链接性能,尤其在跨领域应用中表现突出。 Abstract: Entity Linking involves detecting and linking entity mentions in natural language texts to a knowledge graph. Traditional methods use a two-step process with separate models for entity recognition and disambiguation, which can be computationally intensive and less effective. We propose a fine-tuned model that jointly integrates entity recognition and disambiguation in a unified framework. Furthermore, our approach leverages large language models to enrich the context of entity mentions, yielding better performance in entity disambiguation. We evaluated our approach on benchmark datasets and compared with several baselines. The evaluation results show that our approach achieves state-of-the-art performance on out-of-domain datasets.[2] Small Language Models Offer Significant Potential for Science Community
Jian Zhang
Main category: cs.CL
TL;DR: 提出了一种基于小型语言模型(MiniLM)的高效、低成本方法,用于从大量地球科学文献中精确检索信息,并支持语义搜索、情感分析和主题聚类,具有在事实检索、趋势分析和教育等场景的应用潜力。
Details
Motivation: 尽管大语言模型(LLMs)在科研文献处理中应用增多,但仍存在信息偏差和计算成本高的问题,因此需要一种更精确、快速且经济的信息检索方案。 Method: 构建了一个包含约7700万条高质量句子的地球科学文献语料库,使用免费的小型语言模型(MiniLM)结合语义搜索和句子级索引进行信息检索,并通过情感分析和无监督聚类分析研究主题演化和情感趋势。 Result: MiniLM能够在领域内高效提取专家验证的、具多学科来源的信息,尤其擅长处理定量研究结果,并能有效追踪地球科学研究结论、重点、进展和新兴问题的演变。 Conclusion: MiniLM在地球科学领域具有显著应用潜力,可用于事实与图像检索、趋势分析、矛盾识别及教育等任务,提供比大型语言模型更精确、可解释且低成本的解决方案。 Abstract: Recent advancements in natural language processing, particularly with large language models (LLMs), are transforming how scientists engage with the literature. While the adoption of LLMs is increasing, concerns remain regarding potential information biases and computational costs. Rather than LLMs, I developed a framework to evaluate the feasibility of precise, rapid, and cost-effective information retrieval from extensive geoscience literature using freely available small language models (MiniLMs). A curated corpus of approximately 77 million high-quality sentences, extracted from 95 leading peer-reviewed geoscience journals such as Geophysical Research Letters and Earth and Planetary Science Letters published during years 2000 to 2024, was constructed. MiniLMs enable a computationally efficient approach for extracting relevant domain-specific information from these corpora through semantic search techniques and sentence-level indexing. This approach, unlike LLMs such as ChatGPT-4 that often produces generalized responses, excels at identifying substantial amounts of expert-verified information with established, multi-disciplinary sources, especially for information with quantitative findings. Furthermore, by analyzing emotional tone via sentiment analysis and topical clusters through unsupervised clustering within sentences, MiniLM provides a powerful tool for tracking the evolution of conclusions, research priorities, advancements, and emerging questions within geoscience communities. Overall, MiniLM holds significant potential within the geoscience community for applications such as fact and image retrievals, trend analyses, contradiction analyses, and educational purposes.[3] When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs
Richard J. Young,Brandon Gillins,Alice M. Matthews
Main category: cs.CL
TL;DR: 本文提出了一种简化的评估框架,使用20个精心设计的提示来评估大语言模型在多样化任务类别中的指令遵循能力,并通过大规模实证研究测试了256个模型,揭示了常见失败模式和特定挑战,为研究人员和实践者提供了实用的诊断工具。
Details
Motivation: 尽管大语言模型已广泛部署,但系统评估其指令遵循能力仍具挑战性;现有基准可能被模型记忆,导致评估偏差,因此需要新颖、聚焦且高效的评估方法。 Method: 构建包含20个针对性提示的紧凑测试套件,覆盖格式合规、内容约束、逻辑顺序和多步骤任务执行等维度,在OpenRouter平台上对256个经功能验证的模型进行实证测试,确保方法严谨并避免选择偏倚。 Result: 发现不同模型在特定指令类型上存在一致的失败模式,识别出更具挑战性的指令类别,同时验证了该框架作为高效诊断工具的有效性,能够在资源有限的情况下实现全面评估。 Conclusion: 该研究提供了一个兼顾全面性与效率的实用评估工具,并通过对当代大语言模型景观的大规模分析,揭示了指令遵循能力的关键问题,有助于推动更可靠和可控的模型发展。 Abstract: Despite widespread deployment of Large Language Models, systematic evaluation of instruction-following capabilities remains challenging. While comprehensive benchmarks exist, focused assessments that quickly diagnose specific instruction adherence patterns are valuable. As newer models may be trained on existing benchmarks, novel evaluation approaches are needed to assess genuine capabilities rather than memorized performance. This paper presents a streamlined evaluation framework using twenty carefully designed prompts to assess LLM instruction-following across diverse task categories. We demonstrate this framework through a large-scale empirical study conducted on October 14, 2025, testing 256 verified working models from 331 available via OpenRouter. To ensure methodological rigor and prevent selection bias, we first verified each model's basic functionality before inclusion. Unlike large-scale benchmarks requiring extensive computational resources, our approach offers a practical diagnostic tool researchers and practitioners can readily apply. Our methodology builds upon verifiable instructions while introducing a compact test suite balancing comprehensiveness with efficiency. Each prompt targets distinct aspects of instruction following, including format compliance, content constraints, logical sequencing, and multi-step task execution. We evaluate models from major providers (OpenAI, Anthropic, Google, Meta, Mistral) and emerging implementations (Qwen, DeepSeek, community models), providing comparative performance analysis. Our findings reveal consistent failure modes and identify specific instruction types posing particular challenges. This work contributes both a practical evaluation tool and one of the most comprehensive empirical analyses of instruction-following capabilities across the contemporary LLM landscape.[4] Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti
Mangsura Kabir Oni,Tabia Tanzin Prama
Main category: cs.CL
TL;DR: 本文研究了从孟加拉语到锡尔赫特语的翻译,比较了微调多语言Transformer模型与零样本大语言模型的性能,结果表明微调模型显著优于大模型。
Details
Motivation: 由于低资源语言如锡尔赫特语在机器翻译中研究不足,本文旨在探索有效的翻译方法以促进语言技术的包容性。 Method: 通过微调多语言Transformer模型(如mBART-50和MarianMT)并与零样本大语言模型进行比较,评估其在孟加拉语到锡尔赫特语翻译任务中的表现。 Result: 微调后的模型显著优于零样本大语言模型,其中mBART-50在翻译充分性上表现最佳,MarianMT在字符级保真度上最优。 Conclusion: 针对特定任务的模型微调对低资源、代表性不足的语言翻译至关重要,有助于推动包容性语言技术的发展。 Abstract: Machine Translation (MT) has advanced from rule-based and statistical methods to neural approaches based on the Transformer architecture. While these methods have achieved impressive results for high-resource languages, low-resource varieties such as Sylheti remain underexplored. In this work, we investigate Bengali-to-Sylheti translation by fine-tuning multilingual Transformer models and comparing them with zero-shot large language models (LLMs). Experimental results demonstrate that fine-tuned models significantly outperform LLMs, with mBART-50 achieving the highest translation adequacy and MarianMT showing the strongest character-level fidelity. These findings highlight the importance of task-specific adaptation for underrepresented languages and contribute to ongoing efforts toward inclusive language technologies.[5] DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code
Shriyansh Agrawal,Aidan Lau,Sanyam Shah,Ahan M R,Kevin Zhu,Sunishchal Dev,Vasu Sharma
Main category: cs.CL
TL;DR: 提出并验证了使用微调的编码器-only小型语言模型(如RoBERTa和CodeBERTa)在多语言文本和源代码生成内容检测任务中显著优于大型语言模型,具有更高的准确性和效率。
Details
Motivation: 现有的零样本方法在计算成本和准确性之间存在权衡,无法满足跨领域机器生成内容检测的需求。 Method: 通过对RoBERTa和CodeBERTa等预训练的小型语言模型进行微调,并使用专门的数据集进行二分类任务训练。 Result: 模型在AUROC指标上达到0.97到0.99,macro-F1达到0.89到0.94,延迟降低8-12倍,峰值VRAM减少3-5倍,并在跨生成器和对抗性变换下保持至少92%的干净AUROC性能。 Conclusion: 微调的小型语言模型在生成内容检测任务中优于大型语言模型,同时显著降低资源消耗,具备高效且鲁棒的检测能力。 Abstract: The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.[6] Improving Topic Modeling of Social Media Short Texts with Rephrasing: A Case Study of COVID-19 Related Tweets
Wangjiaxuan Xin,Shuhua Yin,Shi Chen,Yaorong Ge
Main category: cs.CL
TL;DR: 提出了一种名为TM-Rephrase的模型无关框架,利用大语言模型将推文重写为更正式的语言,以提升社交媒体短文本在危机期间的主题建模效果。
Details
Motivation: 传统主题模型在处理社交媒体短文本时因语言简短、非正式和噪声多而表现不佳,导致主题不连贯或冗余,难以解释。 Method: 使用大语言模型对推文进行两种重写策略(通用重写和口语到正式语体重写),然后应用于多种主题模型算法,并在25,027条与COVID-19相关的推文上评估其效果。 Result: TM-Rephrase显著提升了主题一致性、唯一性和多样性,减少了主题冗余,其中口语到正式语体的重写策略效果最佳,尤其改善了LDA算法的表现。 Conclusion: 该研究提供了一种增强公共卫生领域社交媒体主题建模的通用方法,有助于更好地理解健康危机中的公众讨论及其他重要领域的应用。 Abstract: Social media platforms such as Twitter (now X) provide rich data for analyzing public discourse, especially during crises such as the COVID-19 pandemic. However, the brevity, informality, and noise of social media short texts often hinder the effectiveness of traditional topic modeling, producing incoherent or redundant topics that are often difficult to interpret. To address these challenges, we have developed \emph{TM-Rephrase}, a model-agnostic framework that leverages large language models (LLMs) to rephrase raw tweets into more standardized and formal language prior to topic modeling. Using a dataset of 25,027 COVID-19-related Twitter posts, we investigate the effects of two rephrasing strategies, general- and colloquial-to-formal-rephrasing, on multiple topic modeling methods. Results demonstrate that \emph{TM-Rephrase} improves three metrics measuring topic modeling performance (i.e., topic coherence, topic uniqueness, and topic diversity) while reducing topic redundancy of most topic modeling algorithms, with the colloquial-to-formal strategy yielding the greatest performance gains and especially for the Latent Dirichlet Allocation (LDA) algorithm. This study contributes to a model-agnostic approach to enhancing topic modeling in public health related social media analysis, with broad implications for improved understanding of public discourse in health crisis as well as other important domains.[7] Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection
Hongyi He,Xiao Liu,Zhenghao Lin,Mingni Tang,Yi Cheng,Jintao Wang,Wenjie Li,Peng Cheng,Yeyun Gong
Main category: cs.CL
TL;DR: 本文提出了正交多样性感知选择(ODiS)算法,通过主成分分析将多维评分解相关,实现高质量且多样化的语言模型预训练数据选择,显著提升下游任务性能。
Details
Motivation: 现有基于分数的数据选择方法存在质量与多样性之间的偏差,导致高分数据缺乏多样性,影响模型性能。 Method: 提出ODiS算法:1)从语言质量、知识质量和理解难度等多个维度评估数据;2)使用PCA将多维评分转化为正交的评价维度;3)训练RoBERTa-based评分器回归PCA投影后的分数;4)在每个正交维度内选择高分数据构建训练集。 Result: ODiS选出的数据维度间重叠小于2%,验证了正交性;使用ODiS数据训练的模型在下游基准上显著优于其他基线方法。 Conclusion: 确保预训练数据的质量和多样性需要将相关指标分解为正交特征维度,ODiS为大语言模型的数据选择提供了有效解决方案。 Abstract: High-quality pre-training data is crutial for large language models, where quality captures factual reliability and semantic value, and diversity ensures broad coverage and distributional heterogeneity. Existing approaches typically rely on single or multiple-dimensional score-based selection. However, directly selecting top-scored data often degrades performance, and sampling from a broader range is required to recover results. The above non-monotonicity between dataset scores and downstream benchmark results reveals a fundamental bias: score-based methods collapse correlated dimensions, causing top-scored data to appear high-quality while systematically overlooking diversity. We argue that ensuring diversity requires decomposing correlated metrics into orthogonal feature dimensions, from which the top-scored data can be directly selected. Therefore, we proposed the Orthogonal Diversity-Aware Selection (ODiS) algorithm, which preserves both quality and diversity during data selection. First, ODiS evaluates data from multiple dimensions, covering language quality, knowledge quality, and comprehension difficulty. The multi-dimensional scores are then decorrelated via Principal Component Analysis (PCA), yielding orthogonal evaluation dimensions. For each dimension, a Roberta-based scorer is trained to regress the data onto PCA-projected scores, enabling scalable inference on large corpora. Finally, ODiS constructs the training dataset by selecting top-scored data within each orthogonal dimension, thereby ensuring both quality and diversity. Empirical results show that ODiS-selected data exhibit less than 2\% inter-dimension overlap, confirming orthogonality between dimensions. More importantly, models trained with ODiS-selected data significantly outperform other baselines on downstream benchmarks, highlighting the necessity of orthogonal, diversity-aware data selection for LLMs.[8] Context-aware Fairness Evaluation and Mitigation in LLMs
Afrozah Nadeem,Mark Dras,Usman Naseem
Main category: cs.CL
TL;DR: 提出一种动态、可逆的基于剪枝的框架,通过上下文感知的神经元激活检测和自适应掩码,在生成过程中调节神经元影响,从而在推理时实现细粒度、内存感知且保持知识一致性的公平性控制。
Details
Motivation: 大语言模型内部表示中存在不良行为,影响公平性并传播有害内容,现有训练时或数据为中心的方法成本高、不可逆且难以适应新对话场景。 Method: 设计一种动态、可逆的剪枝框架,能够在推理时根据上下文感知神经元激活,并应用自适应掩码来调节其影响。 Result: 该方法在多语言单轮和多轮对话中实现了更连贯的行为,支持知识保留和细粒度的偏差缓解,具备内存效率和动态公平性控制能力。 Conclusion: 所提出的框架提供了一种灵活、透明且可逆的方式来减轻大模型中的有害行为,适用于真实场景中的对话AI系统。 Abstract: Large language models often display undesirable behaviors embedded in their internal representations, undermining fairness, inconsistency drift, amplification of harmful content, and the propagation of unwanted patterns during extended dialogue and conversations. Although training-time or data-centric methods attempt to reduce these effects, they are computationally expensive, irreversible once deployed, and slow to adapt to new conversational contexts. Pruning-based methods provide a flexible and transparent way to reduce bias by adjusting the neurons responsible for certain behaviors. However, most existing approaches are static; once a neuron is removed, the model loses the ability to adapt when the conversation or context changes. To address this, we propose a dynamic, reversible, pruning-based framework that detects context-aware neuron activations and applies adaptive masking to modulate their influence during generation. Our inference-time solution provides fine-grained, memory-aware mitigation with knowledge-preserved, more coherent behavior across multilingual single- and multi-turn dialogues, enabling dynamic fairness control in real-world conversational AI.[9] MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Uni-modal and Omni-modal in OmniModels
Chen Chen,ZeYang Hu,Fengjiao Chen,Liya Ma,Jiaxing Liu,Xiaoyu Li,Xuezhi Cao
Main category: cs.CL
TL;DR: 提出了一种新的高质量、多样化的多模态统一模型基准MMAO-Bench,用于评估单模态和多模态理解能力,实验揭示了跨模态与单模态性能之间的组合规律及强弱模型在多模态下的瓶颈与协同效应。
Details
Motivation: 当前多模态大模型从单模态理解向统一视觉、音频和语言模态发展,但单模态与全模态(omni-modal)之间的关联尚不明确,缺乏综合性评测来推动其智能演进。 Method: 构建了一个名为MultiModal All in One Benchmark(MMAO-Bench)的新型基准,包含1880个人工标注样本、44种任务类型,并引入创新的多步开放式问题类型以更好评估复杂推理能力,全面测试单模态与多模态理解性能。 Result: 实验结果揭示了跨模态与单模态性能之间存在组合规律:对于弱模型,多模态能力表现出瓶颈效应;而对于强模型,则展现出协同促进效应。 Conclusion: MMAO-Bench能够有效评估多模态模型的综合理解能力,揭示了不同强度模型在多模态融合中的不同行为模式,为未来全模态模型的发展提供了重要的评估工具和洞察。 Abstract: Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we propose a novel, high quality and diversity omni model benchmark, MultiModal All in One Benchmark (MMAO-Bench), which effectively assesses both uni-modal and omni-modal understanding capabilities. The benchmark consists of 1880 human curated samples, across 44 task types, and a innovative multi-step open-ended question type that better assess complex reasoning tasks. Experimental result shows the compositional law between cross-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.[10] Misinformation Detection using Large Language Models with Explainability
Jainee Patel,Chintan Bhatt,Himani Trivedi,Thanh Thi Nguyen
Main category: cs.CL
TL;DR: 本文提出了一种可解释且计算高效的基于预训练语言模型的虚假信息检测方法,通过分步微调和层间学习率衰减策略优化RoBERTa和DistilBERT,并结合LIME与SHAP实现局部与全局解释性。实验表明DistilBERT在保持性能的同时显著降低计算开销。
Details
Motivation: 在线平台上虚假信息的快速传播损害了个体间的信任并阻碍了理性决策,因此需要高效且可信的检测方法。 Method: 采用两步微调策略:首先冻结主干网络仅训练分类头,然后逐步解冻主干层并应用逐层学习率衰减;使用RoBERTa和DistilBERT模型,在统一的预处理和分层划分协议下进行实验,并集成LIME(词元级)和SHAP(全局特征归因)提供解释性。 Result: 在COVID Fake News和FakeNewsNet GossipCop两个真实数据集上,DistilBERT取得了与RoBERTa相当的准确率,但计算资源需求显著更低;所提方法在不牺牲性能的前提下实现了良好的局部和全局可解释性。 Conclusion: 轻量级预训练语言模型结合合理的微调策略和可解释技术,能够在保证性能的同时大幅降低计算成本,为可扩展、可信赖的虚假信息检测提供了有效框架。 Abstract: The rapid spread of misinformation on online platforms undermines trust among individuals and hinders informed decision making. This paper shows an explainable and computationally efficient pipeline to detect misinformation using transformer-based pretrained language models (PLMs). We optimize both RoBERTa and DistilBERT using a two-step strategy: first, we freeze the backbone and train only the classification head; then, we progressively unfreeze the backbone layers while applying layer-wise learning rate decay. On two real-world benchmark datasets, COVID Fake News and FakeNewsNet GossipCop, we test the proposed approach with a unified protocol of preprocessing and stratified splits. To ensure transparency, we integrate the Local Interpretable Model-Agnostic Explanations (LIME) at the token level to present token-level rationales and SHapley Additive exPlanations (SHAP) at the global feature attribution level. It demonstrates that DistilBERT achieves accuracy comparable to RoBERTa while requiring significantly less computational resources. This work makes two key contributions: (1) it quantitatively shows that a lightweight PLM can maintain task performance while substantially reducing computational cost, and (2) it presents an explainable pipeline that retrieves faithful local and global justifications without compromising performance. The results suggest that PLMs combined with principled fine-tuning and interpretability can be an effective framework for scalable, trustworthy misinformation detection.[11] Evaluating LLM Story Generation through Large-scale Network Analysis of Social Structures
Hiroshi Nonaka,K. E. Perry
Main category: cs.CL
TL;DR: 提出一种基于角色网络的可扩展方法来评估大语言模型在故事生成中的创造性能力。
Details
Motivation: 现有评估大语言模型创造性的方法依赖人工评价,难以扩展。 Method: 将故事中的社会关系建模为带符号的角色网络,并分析网络密度、聚类和边权重等属性。 Result: 在1200多个故事中发现,LLM生成的故事普遍偏向紧密且积极的人际关系,与人类写作存在差异。 Conclusion: 该方法能有效揭示当前大语言模型在叙事结构上的倾向和局限,为自动化评估创造性提供了可行路径。 Abstract: Evaluating the creative capabilities of large language models (LLMs) in complex tasks often requires human assessments that are difficult to scale. We introduce a novel, scalable methodology for evaluating LLM story generation by analyzing underlying social structures in narratives as signed character networks. To demonstrate its effectiveness, we conduct a large-scale comparative analysis using networks from over 1,200 stories, generated by four leading LLMs (GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash) and a human-written corpus. Our findings, based on network properties like density, clustering, and signed edge weights, show that LLM-generated stories consistently exhibit a strong bias toward tightly-knit, positive relationships, which aligns with findings from prior research using human assessment. Our proposed approach provides a valuable tool for evaluating limitations and tendencies in the creative storytelling of current and future LLMs.[12] Lost in the Maze: Overcoming Context Limitations in Long-Horizon Agentic Search
Howard Yen,Ashwin Paranjape,Mengzhou Xia,Thejas Venkatesh,Jack Hessel,Danqi Chen,Yuhao Zhang
Main category: cs.CL
TL;DR: 本文提出了SLIM,一个用于长视距代理搜索的简单轻量级信息管理框架,通过分离检索工具并定期总结搜索轨迹,在降低成本和工具调用的同时提升了性能。
Details
Motivation: 现有的代理搜索框架在处理长轨迹时受限于上下文长度、噪声积累和工具预算,难以扩展,因此需要一种更高效的信息管理方法。 Method: SLIM将检索过程分为独立的搜索和浏览工具,并周期性地对搜索轨迹进行摘要,以保持上下文简洁,支持更长且更聚焦的搜索。 Result: 在多个基础模型上,SLIM在长视距任务中实现了与强开源基线相当的性能,但成本显著更低,工具调用减少4-6倍;使用o3模型时,在BrowseComp和HLE任务上分别达到56%和31%的性能,领先现有开源框架8和4个百分点。 Conclusion: SLIM通过简洁的设计有效解决了长视距代理搜索中的可扩展性问题,同时减少了幻觉现象,其分析框架和工具设计为未来研究提供了参考。 Abstract: Long-horizon agentic search requires iteratively exploring the web over long trajectories and synthesizing information across many sources, and is the foundation for enabling powerful applications like deep research systems. In this work, we show that popular agentic search frameworks struggle to scale to long trajectories primarily due to context limitations-they accumulate long, noisy content, hit context window and tool budgets, or stop early. Then, we introduce SLIM (Simple Lightweight Information Management), a simple framework that separates retrieval into distinct search and browse tools, and periodically summarizes the trajectory, keeping context concise while enabling longer, more focused searches. On long-horizon tasks, SLIM achieves comparable performance at substantially lower cost and with far fewer tool calls than strong open-source baselines across multiple base models. Specifically, with o3 as the base model, SLIM achieves 56% on BrowseComp and 31% on HLE, outperforming all open-source frameworks by 8 and 4 absolute points, respectively, while incurring 4-6x fewer tool calls. Finally, we release an automated fine-grained trajectory analysis pipeline and error taxonomy for characterizing long-horizon agentic search frameworks; SLIM exhibits fewer hallucinations than prior systems. We hope our analysis framework and simple tool design inform future long-horizon agents.[13] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
Zhilin Wang,Jaehun Jung,Ximing Lu,Shizhe Diao,Ellie Evans,Jiaqi Zeng,Pavlo Molchanov,Yejin Choi,Jan Kautz,Yi Dong
Main category: cs.CL
TL;DR: 本文提出了ProfBench,一个包含7000多个由专业领域专家评估的响应-标准对的数据集,用于评估大语言模型在物理、化学、金融和咨询等专业领域的表现。通过构建低成本、去偏的LLM-Judge进行自动评分,研究发现即使是GPT-5-high等顶级模型在该基准上也仅达到65.9%的整体性能,揭示了当前模型在处理复杂专业任务上的局限性。
Details
Motivation: 现有LLM评估多集中于数学、编程等易验证任务,缺乏对专业文档处理、信息综合与报告生成等现实应用场景的有效评测,因此需要构建更贴近真实专业需求的评估基准。 Method: 构建ProfBench数据集,涵盖四个专业领域(物理PhD、化学PhD、金融MBA、咨询MBA),每个样本包含人工标注的响应和评分标准;设计并优化LLM-Judge,通过提示工程和去偏策略减少自我增强偏差,并显著降低评估成本。 Result: ProfBench对当前最先进的LLM构成重大挑战,GPT-5-high最高仅得65.9%;发现闭源模型整体优于开源模型,且扩展思维(如推理长度)对提升专业任务表现具有积极作用。 Conclusion: ProfBench为评估LLM在专业领域的综合能力提供了可靠且可扩展的基准,推动更贴近实际应用的模型发展,并通过高效LLM-Judge使广泛社区能负担得起高质量评估。 Abstract: Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9\% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench[14] Dynamic Evaluation for Oversensitivity in LLMs
Sophia Xiao Pu,Sitao Cheng,Xin Eric Wang,William Yang Wang
Main category: cs.CL
TL;DR: 本文提出了一种动态生成针对特定模型的挑战性数据集的框架,并构建了包含25个大语言模型、45万样本的OVERBENCH基准,以持续监测模型在发展过程中的过度敏感问题。
Details
Motivation: 现有评估基准依赖静态数据集,随着模型演化会出现数据污染和评估能力下降的问题,难以有效捕捉模型的过度敏感行为。 Method: 开发了一个动态生成模型特异性挑战数据集的框架,基于该框架构建了覆盖多种大语言模型家族的OVERBENCH基准。 Result: OVERBENCH包含45万个样本,能够揭示静态数据集忽略的防御性触发模式和模型漏洞,提供对过度敏感问题的动态、持续评估能力。 Conclusion: 动态生成模型特定数据集的方法能更有效地评估和监控大语言模型的过度敏感问题,为未来模型安全性和鲁棒性研究提供了有力工具。 Abstract: Oversensitivity occurs when language models defensively reject prompts that are actually benign. This behavior not only disrupts user interactions but also obscures the boundary between harmful and harmless content. Existing benchmarks rely on static datasets that degrade overtime as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model's unique behavior. Building on this approach, we construct OVERBENCH, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 25 models. OVERBENCH provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.[15] Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues
Eunsu Kim,Junyeong Park,Juhyun Oh,Kiwoong Park,Seyoung Song,A. Seza Dogruoz,Najoung Kim,Alice Oh
Main category: cs.CL
TL;DR: 本文介绍了SCRIPTS数据集,用于评估大语言模型在英语和韩语对话中推断人际关系的社会推理能力,发现现有模型在韩语上的表现显著下降,并且容易选择不可能的关系,思维链提示等方法对此帮助有限。
Details
Motivation: 随着大语言模型在人机交互中的广泛应用,其在人际情境下的社会推理能力变得至关重要,但目前缺乏评估该能力的多语言数据集。 Method: 构建包含1000个来自电影剧本的英韩双语对话数据集SCRIPTS,每个对话由母语者标注人际关系概率标签(极可能、较不可能、不可能),并评估九个主流模型的表现。 Result: 当前闭源大模型在英文数据上准确率为75-80%,但在韩语上降至58-69%;模型在10-25%的回应中错误选择“不可能”关系;思维链提示对社会推理提升效果甚微,甚至可能加剧社会偏见。 Conclusion: 现有大语言模型在跨语言社会推理方面存在明显不足,亟需开发更具社会意识的语言模型。 Abstract: As large language models (LLMs) are increasingly used in human-AI interactions, their social reasoning capabilities in interpersonal contexts are critical. We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean, sourced from movie scripts. The task involves evaluating models' social reasoning capability to infer the interpersonal relationships (e.g., friends, sisters, lovers) between speakers in each dialogue. Each dialogue is annotated with probabilistic relational labels (Highly Likely, Less Likely, Unlikely) by native (or equivalent) Korean and English speakers from Korea and the U.S. Evaluating nine models on our task, current proprietary LLMs achieve around 75-80% on the English dataset, whereas their performance on Korean drops to 58-69%. More strikingly, models select Unlikely relationships in 10-25% of their responses. Furthermore, we find that thinking models and chain-of-thought prompting, effective for general reasoning, provide minimal benefits for social reasoning and occasionally amplify social biases. Our findings reveal significant limitations in current LLMs' social reasoning capabilities, highlighting the need for efforts to develop socially-aware language models.[16] Re:Member: Emotional Question Generation from Personal Memories
Zackary Rackauckas,Nobuaki Minematsu,Julia Hirschberg
Main category: cs.CL
TL;DR: Re:Member 是一个通过情感表达和记忆关联来增强二语学习体验的系统,利用个人视频和情感语音生成促进学习者的情感回忆与对话参与。
Details
Motivation: 为了提升二语学习的情感参与度和记忆效果,探索情感化、基于记忆的交互在语言学习中的作用。 Method: 系统结合WhisperX进行转录对齐,采用3帧视觉采样,并利用Style-BERT-VITS2生成具有情感风格的语音(如耳语或深夜语气),通过模块化生成流程将情感语调与视觉情境相匹配。 Result: Re:Member 成功实现了情感语调与个人视觉内容的融合,能够生成风格化的口语问题,有效激发学习者的情感回忆和对话兴趣。 Conclusion: 该研究展示了情感化设计和个人媒体在以学习者为中心的教育技术中的潜力,为语言学习系统提供了新的交互范式。 Abstract: We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users' personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.[17] When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
Abeer Badawi,Elahe Rahimi,Md Tahmid Rahman Laskar,Sheri Grach,Lindsay Bertrand,Lames Danok,Jimmy Huang,Frank Rudzicz,Elham Dolatabadi
Main category: cs.CL
TL;DR: 本文提出了两个用于评估大语言模型在心理健康支持中表现的基准测试MentalBench-100k和MentalAlign-70k,并引入了Affective Cognitive Agreement Framework来量化LLM评判者与人类专家之间的一致性、可靠性和偏差。
Details
Motivation: 由于治疗性对话的情感和认知复杂性,现有基准在规模、可靠性和数据来源方面存在局限,难以有效评估大语言模型在心理健康支持中的表现。 Method: 构建包含10万条响应对的MentalBench-100k数据集;通过比较四种高性能LLM评判者与人类专家在7万个评分上的七个属性,提出基于组内相关系数(ICC)的Affective Cognitive Agreement Framework进行可靠性分析。 Result: 发现LLM评判者存在系统性评分膨胀,在认知类属性(如指导性和信息量)上表现出高可靠性,但在共情、安全性和相关性方面精度较低或不可靠。 Conclusion: 所提出的基准和框架为大规模、可靠地评估心理健康领域的LLM奠定了方法论和实证基础。 Abstract: Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: https://github.com/abeerbadawi/MentalBench/[18] From Memorization to Generalization: Fine-Tuning Large Language Models for Biomedical Term-to-Identifier Normalization
Suswitha Pericharla,Daniel B. Hier,Tayo Obafemi-Ajayi
Main category: cs.CL
TL;DR: 本研究评估了大型语言模型(LLM)在不同生物医学术语系统中进行术语标准化的表现,发现微调效果因术语系统的标识符流行度和词法化程度而异,基因-蛋白术语因高流行度和良好词法化表现出显著记忆和泛化能力,而GO和HPO术语因标识符任意性强、词法化弱仅支持机械记忆。
Details
Motivation: 生物医学数据整合依赖于术语标准化,即将自然语言术语映射到标准标识符。尽管大型语言模型在此任务上潜力巨大,但其在不同术语系统中的表现不一,亟需理解其记忆与泛化能力的边界及影响因素。 Method: 研究对Llama 3.1 8B等模型在多个生物医学本体(如GO、HPO、GENE)上进行微调,评估其在训练术语(记忆)和验证术语(泛化)上的准确率,并通过嵌入分析探究语义对齐情况,同时考察标识符流行度与词法化对性能的影响。 Result: GO术语微调后记忆性能提升达77%,但泛化能力差;HPO术语几乎无改善;GENE术语在记忆和泛化上均有显著增益(泛化提升13.9%)。GPT-4o在所有术语上均优于Llama系列。嵌入分析显示基因符号与蛋白名称语义对齐紧密,而GO和HPO术语与标识符对齐较弱。 Conclusion: 微调能否提升术语标准化性能取决于标识符的流行度和词法化程度:高流行度促进记忆,强词法化支持泛化;对于缺乏这两者的术语系统(如GO、HPO),微调效果有限,模型只能进行死记硬背。该发现为预测微调有效性提供了框架。 Abstract: Effective biomedical data integration depends on automated term normalization, the mapping of natural language biomedical terms to standardized identifiers. This linking of terms to identifiers is essential for semantic interoperability. Large language models (LLMs) show promise for this task but perform unevenly across terminologies. We evaluated both memorization (training-term performance) and generalization (validation-term performance) across multiple biomedical ontologies. Fine-tuning Llama 3.1 8B revealed marked differences by terminology. GO mappings showed strong memorization gains (up to 77% improvement in term-to-identifier accuracy), whereas HPO showed minimal improvement. Generalization occurred only for protein-gene (GENE) mappings (13.9% gain), while fine-tuning for HPO and GO yielded negligible transfer. Baseline accuracy varied by model scale, with GPT-4o outperforming both Llama variants for all terminologies. Embedding analyses showed tight semantic alignment between gene symbols and protein names but weak alignment between terms and identifiers for GO or HPO, consistent with limited lexicalization. Fine-tuning success depended on two interacting factors: identifier popularity and lexicalization. Popular identifiers were more likely encountered during pretraining, enhancing memorization. Lexicalized identifiers, such as gene symbols, enabled semantic generalization. By contrast, arbitrary identifiers in GO and HPO constrained models to rote learning. These findings provide a predictive framework for when fine-tuning enhances factual recall versus when it fails due to sparse or non-lexicalized identifiers.[19] That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation
Jaesung Bae,Cameron Churchwell,Mitchell Hermon,Tsun-An Hsieh,Jocelyn Xu,Yekaterina Yegorova,Mark Hasegawa-Johnson,Heng Ji
Main category: cs.CL
TL;DR: 本文研究了大语言模型在面对参数化知识与提示中冲突信息之间的不一致时的行为,提出了一种领域无关的框架来构建和解释代码生成中的知识冲突,并设计了新的评估方法和数据集。实验表明,足够大的大语言模型能够在参数中编码知识冲突的概念,实现高达80.65%的检测准确率,并通过激活层引导使引导成功率比随机基线提高12.6%。
Details
Motivation: 探索大语言模型在知识冲突情况下的行为机制,尤其是在代码生成任务中如何处理其内部知识与外部提示之间的矛盾。 Method: 提出一个领域无关的知识冲突构建与解释框架,设计专门针对代码冲突场景的评估方法和数据集,利用模型激活层面的分析进行冲突检测和引导实验。 Result: 发现足够大的大语言模型能以最高80.65%的准确率检测知识冲突;激活层面的引导相比随机基线最多可提升12.6%的成功率,但效果受模型大小、任务领域和引导方向的共同影响。 Conclusion: 大语言模型具备识别知识冲突的内在能力,且可通过激活 steering 有效干预输出,但其有效性依赖于模型规模、任务类型与引导策略之间的平衡。 Abstract: This paper investigates how large language models (LLMs) behave when faced with discrepancies between their parametric knowledge and conflicting information contained in a prompt. Building on prior question-answering (QA) research, we extend the investigation of knowledge conflicts to the realm of code generation. We propose a domain-agnostic framework for constructing and interpreting such conflicts, along with a novel evaluation method and dataset tailored to code conflict scenarios. Our experiments indicate that sufficiently large LLMs encode the notion of a knowledge conflict in their parameters, enabling us to detect knowledge conflicts with up to \textbf{80.65\%} accuracy. Building on these insights, we show that activation-level steering can achieve up to a \textbf{12.6\%} improvement in steering success over a random baseline. However, effectiveness depends critically on balancing model size, task domain, and steering direction. The experiment code and data will be made publicly available after acceptance.[20] A Graph Signal Processing Framework for Hallucination Detection in Large Language Models
Valentin Noël
Main category: cs.CL
TL;DR: 提出一种基于谱分析的框架,通过将Transformer层建模为动态图来检测大语言模型中的幻觉现象,实验表明该方法准确率达88.75%,优于传统基线。
Details
Motivation: 大语言模型在生成结果时难以区分事实推理与幻觉,亟需有效的方法来识别和分析不同类型的幻觉。 Method: 将Transformer层建模为由注意力机制诱导的动态图,使用图信号处理技术定义谱特征(如Dirichlet能量、谱熵和高频能量比),并分析其在不同类型输出中的表现。 Result: 发现事实陈述具有一致的低频收敛‘能量山’模式,而逻辑矛盾、语义错误和替换型幻觉分别表现出显著不同的谱特征;基于这些特征的简单检测器准确率达到88.75%,显著优于基于困惑度的基线(75%)。 Conclusion: 谱几何能够捕捉大语言模型的推理模式与错误行为,为幻觉检测提供了一个有效的分析框架。 Abstract: Large language models achieve impressive results but distinguishing factual reasoning from hallucinations remains challenging. We propose a spectral analysis framework that models transformer layers as dynamic graphs induced by attention, with token embeddings as signals on these graphs. Through graph signal processing, we define diagnostics including Dirichlet energy, spectral entropy, and high-frequency energy ratios, with theoretical connections to computational stability. Experiments across GPT architectures suggest universal spectral patterns: factual statements exhibit consistent "energy mountain" behavior with low-frequency convergence, while different hallucination types show distinct signatures. Logical contradictions destabilize spectra with large effect sizes ($g>1.0$), semantic errors remain stable but show connectivity drift, and substitution hallucinations display intermediate perturbations. A simple detector using spectral signatures achieves 88.75% accuracy versus 75% for perplexity-based baselines, demonstrating practical utility. These findings indicate that spectral geometry may capture reasoning patterns and error behaviors, potentially offering a framework for hallucination detection in large language models.[21] Training-Free Spectral Fingerprints of Voice Processing in Transformers
Valentin Noël
Main category: cs.CL
TL;DR: 该研究通过图信号处理分析注意力机制中的代词图谱,揭示了不同Transformer架构在语言计算中的独特连接模式,即“计算指纹”。利用代数连通性(Fiedler值)的变化,研究发现Phi-3-Mini在英语中表现出显著的早期层扰动,而其他语言则不明显,反映出其以英语为主的训练偏向;Qwen2.5-7B和LLaMA-3.2-1B也展现出不同的语言响应模式。这些频谱特征与行为差异高度相关,并可通过注意力头消融实验调控,验证了其功能意义。结果表明,训练重点会在模型中留下可检测的计算印记,该框架还可用于区分推理模式,是一种无需训练即可诊断架构偏好的有效工具。
Details
Motivation: 探索不同Transformer模型在执行相同语言计算时是否因架构差异而产生独特的连接模式,并揭示这些‘计算指纹’是否反映其训练偏好和语言特异性。 Method: 采用图信号处理方法,分析注意力诱导的代词图谱在语态转换下的代数连通性变化(Δλ₂),在20种语言和三种模型家族中进行跨语言比较,重点关注早期层(2-5层),并通过注意力头消融实验验证结果的功能相关性。 Result: 发现Phi-3-Mini在英语中出现显著的早期层连通性下降(Δλ₂ ≈ -0.446),其他语言无明显变化;Qwen2.5-7B在形态丰富的语言中显示最明显的分布式变化;LLaMA-3.2-1B反应系统但较弱。这些频谱特征与行为表现高度相关(Phi-3: r = -0.976),且可通过注意力头移除改变,证实其功能意义。 Conclusion: 不同Transformer模型因训练重点和架构差异形成可检测的‘计算指纹’,表现为特定的语言处理连接模式。这些模式不仅反映语言偏向,还与行为表现相关,且具有功能重要性。所提出的频谱分析框架可作为无需训练的诊断工具,用于揭示模型架构偏差并支持可靠性分析。 Abstract: Different transformer architectures implement identical linguistic computations via distinct connectivity patterns, yielding model imprinted ``computational fingerprints'' detectable through spectral analysis. Using graph signal processing on attention induced token graphs, we track changes in algebraic connectivity (Fiedler value, $\Delta\lambda_2$) under voice alternation across 20 languages and three model families, with a prespecified early window (layers 2--5). Our analysis uncovers clear architectural signatures: Phi-3-Mini shows a dramatic English specific early layer disruption ($\overline{\Delta\lambda_2}_{[2,5]}\!\approx\!-0.446$) while effects in 19 other languages are minimal, consistent with public documentation that positions the model primarily for English use. Qwen2.5-7B displays small, distributed shifts that are largest for morphologically rich languages, and LLaMA-3.2-1B exhibits systematic but muted responses. These spectral signatures correlate strongly with behavioral differences (Phi-3: $r=-0.976$) and are modulated by targeted attention head ablations, linking the effect to early attention structure and confirming functional relevance. Taken together, the findings are consistent with the view that training emphasis can leave detectable computational imprints: specialized processing strategies that manifest as measurable connectivity patterns during syntactic transformations. Beyond voice alternation, the framework differentiates reasoning modes, indicating utility as a simple, training free diagnostic for revealing architectural biases and supporting model reliability analysis.[22] Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges
Cheng Huang,Nyima Tashi,Fan Gao,Yutong Liu,Jiahao Li,Hao Tian,Siyang Jiang,Thupten Tsering,Ban Ma-bao,Renzeg Duojie,Gadeng Luosang,Rinchen Dongrub,Dorje Tashi,Jin Zhang,Xiao Feng,Hao Wang,Jie Tang,Guojie Tang,Xiangxiang Wang,Jia Zhang,Tsengdar Lee,Yongbin Yu
Main category: cs.CL
TL;DR: 本文综述了藏语人工智能研究的现状,涵盖文本与语音资源、自然语言处理任务、机器翻译、语音识别及大模型进展,系统梳理了现有数据集与工具,指出了数据稀疏、拼写变异和缺乏统一评估标准等瓶颈,并探讨了跨语言迁移、多模态学习和社区驱动资源建设的潜力。
Details
Motivation: 藏语作为亚洲主要低资源语言之一,具有独特的语言与社会文化特征,但在AI研究中因缺乏可用数据、标准基准和专用工具而受到关注较少,亟需系统性综述以推动研究发展。 Method: 本文对现有的藏语AI研究进行了全面调研,系统分类了数据集与工具,评估了不同任务中的方法,并在可能的情况下比较了性能表现。 Result: 梳理了藏语文本和语音数据资源、NLP任务、机器翻译、语音识别及大模型的发展现状,明确了数据稀疏、正字法变异和评估标准不统一等关键瓶颈。 Conclusion: 该综述为未来藏语AI研究提供了基础参考,呼吁通过跨语言迁移、多模态学习和社区协作等方式,构建包容且可持续的低资源语言AI生态系统。 Abstract: Tibetan, one of the major low-resource languages in Asia, presents unique linguistic and sociocultural characteristics that pose both challenges and opportunities for AI research. Despite increasing interest in developing AI systems for underrepresented languages, Tibetan has received limited attention due to a lack of accessible data resources, standardized benchmarks, and dedicated tools. This paper provides a comprehensive survey of the current state of Tibetan AI in the AI domain, covering textual and speech data resources, NLP tasks, machine translation, speech recognition, and recent developments in LLMs. We systematically categorize existing datasets and tools, evaluate methods used across different tasks, and compare performance where possible. We also identify persistent bottlenecks such as data sparsity, orthographic variation, and the lack of unified evaluation metrics. Additionally, we discuss the potential of cross-lingual transfer, multi-modal learning, and community-driven resource creation. This survey aims to serve as a foundational reference for future work on Tibetan AI research and encourages collaborative efforts to build an inclusive and sustainable AI ecosystem for low-resource languages.[23] "You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations
Dingjie Fu,Dianxing Shi
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLM)是否能通过科技公司用于招聘的标准化评估。研究使用最先进的LLM生成答案,并与公司参考答案进行比较,结果发现所有测试的LLM均未能通过评估,表明当前LLM在实际工程岗位评估中仍存在显著不足。
Details
Motivation: 由于大型科技公司每年需要大量软件和算法工程师,招聘评估至关重要。鉴于LLM在编程和推理任务中的出色表现,作者想探究这些模型是否能够通过真实的招聘考核。 Method: 研究选取了一套广泛使用的专业评估问卷,利用最先进的大语言模型生成回答,并将其与公司提供的标准答案进行对比分析,评估其准确性和一致性。 Result: 实验结果显示,所有被测试的大语言模型生成的答案与公司参考答案存在显著不一致,未能达到通过标准。 Conclusion: 尽管LLM在编码和推理方面表现出色,但当前技术尚不足以使其通过企业级的招聘评估,说明其在实际应用中仍有局限性。 Abstract: With the proliferation of the internet and the rapid advancement of Artificial Intelligence, leading technology companies face an urgent annual demand for a considerable number of software and algorithm engineers. To efficiently and effectively identify high-potential candidates from thousands of applicants, these firms have established a multi-stage selection process, which crucially includes a standardized hiring evaluation designed to assess job-specific competencies. Motivated by the demonstrated prowess of Large Language Models (LLMs) in coding and reasoning tasks, this paper investigates a critical question: Can LLMs successfully pass these hiring evaluations? To this end, we conduct a comprehensive examination of a widely used professional assessment questionnaire. We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance. Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions. Our empirical findings lead to a striking conclusion: All evaluated LLMs fails to pass the hiring evaluation.[24] Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG
Jihwan Bang,Juntae Lee,Seunghan Yang,Sungha Choi
Main category: cs.CL
TL;DR: TSSS是一种高效的多跳检索增强生成框架,通过模板化推理和基于检索器的终止机制,在保持高准确率的同时显著提升推理效率。
Details
Motivation: 现有的多跳RAG方法在复杂推理中效率低下,存在重复生成和不稳定停止的问题。 Method: 提出TSSS框架,采用模板化推理缓存前缀并锚定子查询,结合基于检索器的确定性终止机制。 Result: 在HotpotQA、2WikiMultiHop和MuSiQue上达到SOTA准确率,并具有良好的推理效率。 Conclusion: TSSS通过结构化推理与终止控制分离,实现了高效且稳定的多跳问答,适用于资源受限场景。 Abstract: Multi-hop retrieval-augmented generation (RAG) is a promising strategy for complex reasoning, yet existing iterative prompting approaches remain inefficient. They often regenerate predictable token sequences at every step and rely on stochastic stopping, leading to excessive token usage and unstable termination. We propose TSSS (Think Straight, Stop Smart), a structured multi-hop RAG framework designed for efficiency. TSSS introduces (i) a template-based reasoning that caches recurring prefixes and anchors sub-queries to the main question, reducing token generation cost while promoting stable reasoning, and (ii) a retriever-based terminator, which deterministically halts reasoning once additional sub-queries collapse into repetition. This separation of structured reasoning and termination control enables both faster inference and more reliable answers. On HotpotQA, 2WikiMultiHop, and MuSiQue, TSSS achieves state-of-the-art accuracy and competitive efficiency among RAG-CoT approaches, highlighting its effectiveness in efficiency-constrained scenarios such as on-device inference.[25] When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA
Nishanth Sridhar Nakshatri,Shamik Roy,Manoj Ghuhan Arivazhagan,Hanhan Zhou,Vinayshekhar Bannihatti Kumar,Rashmi Gangadharaiah
Main category: cs.CL
TL;DR: evolveQA是一个专为评估大语言模型在时间演化知识上表现而设计的基准,基于真实世界的时间戳数据构建,揭示了模型在处理随时间变化的知识时性能显著下降的问题。
Details
Motivation: 现有研究在评估大语言模型处理随时间演变的知识冲突能力方面存在局限,主要集中在流行实体上,缺乏对不同知识截止日期模型的公平评估手段。 Method: 提出evolveQA基准,利用AWS更新、Azure变更和WHO疾病暴发报告三个真实世界的时间戳语料库,识别自然发生的知识演变,并生成针对不同模型知识截止日期定制的问题与标准答案。 Result: 通过对12个开源和闭源大语言模型在三种知识探测格式下的广泛评估,发现模型在evolveQA上的性能比静态知识问题下降高达31%。 Conclusion: evolveQA能有效揭示大语言模型在处理时间演化知识时的不足,为未来模型改进提供了重要方向。 Abstract: LLMs often fail to handle temporal knowledge conflicts--contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.[26] Interpretable Question Answering with Knowledge Graphs
Kartikeya Aneja,Manasvi Srivastava,Subhayan Das,Nagender Aneja
Main category: cs.CL
TL;DR: 本文提出了一种不依赖大语言模型增强生成(RAG)的问答系统,仅通过知识图谱检索和小规模改写模型生成答案。
Details
Motivation: 避免依赖大型语言模型进行检索增强生成,提升系统效率与可解释性。 Method: 首先从文档生成问答对,构建知识图谱;然后通过嵌入和模糊技术进行图检索、重排序,并使用小型改写模型对关系边进行改写以生成最终答案。 Result: 在CRAG基准上使用LLAMA-3.2和GPT-3.5-Turbo作为评判模型,准确率分别为71.9%和54.4%。 Conclusion: 该方法证明了无需大型语言模型参与生成过程,也能在知识图谱基础上实现有效的问答。 Abstract: This paper presents a question answering system that operates exclusively on a knowledge graph retrieval without relying on retrieval augmented generation (RAG) with large language models (LLMs). Instead, a small paraphraser model is used to paraphrase the entity relationship edges retrieved from querying the knowledge graph. The proposed pipeline is divided into two main stages. The first stage involves pre-processing a document to generate sets of question-answer (QA) pairs. The second stage converts these QAs into a knowledge graph from which graph-based retrieval is performed using embeddings and fuzzy techniques. The graph is queried, re-ranked, and paraphrased to generate a final answer. This work includes an evaluation using LLM-as-a-judge on the CRAG benchmark, which resulted in accuracies of 71.9% and 54.4% using LLAMA-3.2 and GPT-3.5-Turbo, respectively.[27] Multi-Faceted Evaluation of Tool-Augmented Dialogue Systems
Zhaoyi Joey Hou,Tanya Shourya,Yingfan Wang,Shamik Roy,Vinayshekhar Bannihatti Kumar,Rashmi Gangadharaiah
Main category: cs.CL
TL;DR: 提出TRACE基准和SCOPE评估框架,用于发现工具增强对话中复杂的错误模式,特别是在用户满意度信号具有误导性的情况下。
Details
Motivation: 现有评估方法无法捕捉多轮工具增强对话中的关键错误,例如代理误解工具结果但仍让用户满意的情况。 Method: 构建系统合成的对话基准TRACE,并提出自动发现错误模式和评估准则的框架SCOPE。 Result: 实验表明SCOPE在具有挑战性的案例上显著优于基线方法,尤其是在用户满意度信号不可靠时。 Conclusion: SCOPE能够有效识别传统评估方法难以发现的错误,为评估工具增强型对话系统提供了更可靠的解决方案。 Abstract: Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents' tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues-such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases, and SCOPE, an evaluation framework that automatically discovers diverse error patterns and evaluation rubrics in tool-augmented dialogues. Experiments show SCOPE significantly outperforms the baseline, particularly on challenging cases where user satisfaction signals are misleading.[28] DiSRouter: Distributed Self-Routing for LLM Selections
Hang Zheng,Hongshen Xu,Yongkai Lin,Shuai Fan,Lu Chen,Kai Yu
Main category: cs.CL
TL;DR: 提出DiSRouter,一种基于LLM自知能力的分布式自路由机制,通过去中心化方式提升多模型系统在性能与成本间的平衡能力。
Details
Motivation: 现有集中式路由系统依赖固定模型集合且灵活性差,难以准确判断不同大模型的知识边界,导致性能不佳。 Method: 设计DiSRouter框架,让查询在多个LLM代理间分布式流转,每个代理基于自身自知能力决定回答或转发;并提出两阶段自知训练方法提升LLM对自身能力的判断力。 Result: 实验表明DiSRouter在多种场景下显著优于现有路由方法,能有效区分难易查询,并具备良好的领域外泛化能力。 Conclusion: 利用LLM内在的自知能力进行路由决策比外部评估更有效,为构建模块化、高效的多代理系统提供了新路径。 Abstract: The proliferation of Large Language Models (LLMs) has created a diverse ecosystem of models with highly varying performance and costs, necessitating effective query routing to balance performance and expense. Current routing systems often rely on a centralized external router trained on a fixed set of LLMs, making them inflexible and prone to poor performance since the small router can not fully understand the knowledge boundaries of different LLMs. We introduce DiSRouter (Distributed Self-Router), a novel paradigm that shifts from centralized control to distributed routing. In DiSRouter, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awareness, its ability to judge its competence. This distributed design offers superior flexibility, scalability, and generalizability. To enable this, we propose a two-stage Self-Awareness Training pipeline that enhances each LLM's self-awareness. Extensive experiments demonstrate that DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out-of-domain tasks. Our work validates that leveraging an LLM's intrinsic self-awareness is more effective than external assessment, paving the way for more modular and efficient multi-agent systems.[29] Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+
York Hay Ng,Aditya Khan,Xiang Lu,Matteo Salloum,Michael Zhou,Phuong H. Hoang,A. Seza Doğruöz,En-Shiun Annie Lee
Main category: cs.CL
TL;DR: 本文提出了一种类型匹配的语言距离框架,通过结构感知的表示方法和统一的复合距离度量,克服了现有语言知识库在跨语言迁移中的局限性。
Details
Motivation: 现有的语言知识库(如URIEL+)在跨语言迁移中存在向量表示不适应语言数据多样性和缺乏有效信号聚合方法的问题。 Method: 提出了针对地理、谱系和类型学距离的新型结构感知表示方法:说话人加权分布、双曲嵌入和潜在变量模型,并将这些信号整合为一个鲁棒且与任务无关的复合距离。 Result: 在多种NLP任务中,所提出的表示方法和复合距离显著提升了迁移语言选择的性能。 Conclusion: 该框架为多语言研究提供了一个更系统且高效的语言距离工具包。 Abstract: Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. One, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data, and two, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. In selecting transfer languages, our representations and composite distances consistently improve performance across a wide range of NLP tasks, providing a more principled and effective toolkit for multilingual research.[30] SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets
Ziwei Wang,Jiayuan Su,Mengyu Zhou,Huaxing Zeng,Mengni Jia,Xiao Lv,Haoyu Dong,Xiaojun Ma,Shi Han,Dongmei Zhang
Main category: cs.CL
TL;DR: 本文提出了SheetBrain,一种神经符号双工作流代理框架,用于在复杂电子表格上进行准确的推理,支持问答和操作任务,并在多个基准测试中显著提升了准确性。
Details
Motivation: 大型语言模型在理解和推理复杂电子表格方面存在困难,难以准确捕捉表格结构并确保推理正确性。 Method: 提出SheetBrain框架,包含理解模块(生成表格概览和问题洞察)、执行模块(集成Python沙箱与表格处理库)和验证模块(验证推理正确性并触发重执行)。 Result: 在多个公开的表格问答与操作基准测试中,SheetBrain显著提高了准确性,尤其在新提出的复杂多表基准SheetBench上表现突出。 Conclusion: SheetBrain通过神经符号结合的双工作流机制,有效提升了LLM在复杂表格数据上的推理准确性和鲁棒性。 Abstract: Understanding and reasoning over complex spreadsheets remain fundamental challenges for large language models (LLMs), which often struggle with accurately capturing the complex structure of tables and ensuring reasoning correctness. In this work, we propose SheetBrain, a neuro-symbolic dual workflow agent framework designed for accurate reasoning over tabular data, supporting both spreadsheet question answering and manipulation tasks. SheetBrain comprises three core modules: an understanding module, which produces a comprehensive overview of the spreadsheet - including sheet summary and query-based problem insight to guide reasoning; an execution module, which integrates a Python sandbox with preloaded table-processing libraries and an Excel helper toolkit for effective multi-turn reasoning; and a validation module, which verifies the correctness of reasoning and answers, triggering re-execution when necessary. We evaluate SheetBrain on multiple public tabular QA and manipulation benchmarks, and introduce SheetBench, a new benchmark targeting large, multi-table, and structurally complex spreadsheets. Experimental results show that SheetBrain significantly improves accuracy on both existing benchmarks and the more challenging scenarios presented in SheetBench. Our code is publicly available at https://github.com/microsoft/SheetBrain.[31] Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization
Yuto Tomikawa,Masaki Uto
Main category: cs.CL
TL;DR: 提出一种基于大语言模型和直接偏好优化的难度可控的多项选择题生成方法,以提升阅读理解中问题生成的难度控制精度。
Details
Motivation: 现有神经问题生成方法无法直接生成多项选择题,且未明确优化难度控制的准确性。 Method: 利用大语言模型并采用直接偏好优化技术训练模型,以生成难度可控的多项选择题。 Result: 所提方法能够有效生成多项选择题,并在难度控制准确性方面优于传统方法。 Conclusion: 该方法显著提升了阅读理解中难度可控问题生成的效果,尤其在教育场景中具有广泛应用潜力。 Abstract: Difficulty-controllable question generation for reading comprehension has gained significant attention in the field of education as a fundamental tool for adaptive learning support. Although several neural question generation methods have recently succeeded in controlling difficulty, conventional approaches still face two major limitations. First, they cannot directly generate multiple-choice questions, which are the most widely used question type in educational contexts. Second, they are not explicitly trained to optimize the accuracy of difficulty control, leaving room for further improvement in difficulty controllability. To address these limitations, this study proposes a novel difficulty-controllable multiple-choice question generation method for reading comprehension which leverages a large language model trained using a direct preference optimization technique to improve the accuracy of difficulty control.[32] TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
Reza Esfandiarpoor,Vishwas Suryanarayanan,Stephen H. Bach,Vishal Chowdhary,Anthony Aue
Main category: cs.CL
TL;DR: TheMCPCompany是一个用于评估调用工具的智能体在真实服务交互任务中表现的基准,包含超过18,000个基于REST API的工具,并揭示当前模型在复杂企业环境中仍面临推理和检索的挑战。
Details
Motivation: 现有的通用智能体主要依赖浏览器与环境交互,而随着Model Context Protocol(MCP)的发展,任务专用工具日益增多。研究旨在探索工具调用智能体在真实服务中的潜力与局限。 Method: 构建TheMCPCompany基准,利用真实服务的REST API创建MCP服务器,提供超过18,000个工具,并为每个任务提供人工标注的真值工具。通过对比使用真值工具和工具检索的智能体表现,评估其性能与实用性。 Result: 使用真值工具时,工具调用智能体性能更优且成本更低;在工具检索场景下,GPT-5表现接近真值水平,而小模型难以有效利用工具。所有带工具检索的模型均优于或相当于浏览器基线。 Conclusion: 最先进的推理模型在简单环境中能有效发现工具,但在复杂企业级环境中仍难以应对大规模工具导航与组合,需改进推理与检索能力。 Abstract: Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5's performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.[33] JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation
Fan Xu,Huixuan Zhang,Zhenliang Zhang,Jiahao Wang,Xiaojun Wan
Main category: cs.CL
TL;DR: 提出JointCQ框架,通过联合生成声明和查询来提升大语言模型幻觉检测的效果。
Details
Motivation: 现有幻觉检测方法在声明提取和查询生成阶段存在上下文丢失和查询 specificity 不足的问题,影响整体性能。 Method: 设计评估标准筛选合成训练数据,并微调语言模型实现声明提取与查询生成的联合优化。 Result: 在多个开放域问答幻觉检测基准上优于先前方法,提升了下游检索与验证效果。 Conclusion: JointCQ框架能有效提高幻觉检测的准确性和效率,推动更可信、透明的语言模型系统发展。 Abstract: Current large language models (LLMs) often suffer from hallucination issues, i,e, generating content that appears factual but is actually unreliable. A typical hallucination detection pipeline involves response decomposition (i.e., claim extraction), query generation, evidence collection (i.e., search or retrieval), and claim verification. However, existing methods exhibit limitations in the first two stages, such as context loss during claim extraction and low specificity in query generation, resulting in degraded performance across the hallucination detection pipeline. In this work, we introduce JointCQ https://github.com/pku0xff/JointCQ, a joint claim-and-query generation framework designed to construct an effective and efficient claim-query generator. Our framework leverages elaborately designed evaluation criteria to filter synthesized training data, and finetunes a language model for joint claim extraction and query generation, providing reliable and informative inputs for downstream search and verification. Experimental results demonstrate that our method outperforms previous methods on multiple open-domain QA hallucination detection benchmarks, advancing the goal of more trustworthy and transparent language model systems.[34] KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints
Kailin Jiang,Hongbo Jiang,Ning Jiang,Zhi Gao,Jinhe Bi,Yuchen Ren,Bin Li,Yuntao Du,Lei Liu,Qing Li
Main category: cs.CL
TL;DR: 提出KORE方法,通过面向知识的增强和约束,实现大视觉语言模型中新知识的有效注入与旧知识的保留。
Details
Motivation: 大视觉语言模型中的知识是静态且有限的,难以跟上现实世界的发展,现有方法在学习新知识时容易发生灾难性遗忘。 Method: KORE方法将知识项自动转化为结构化知识以促进新知识学习,并利用线性层激活的协方差矩阵存储旧知识,通过将其权重投影到零空间来初始化适配器,减少对旧知识的干扰。 Result: 在LLaVA-v1.5-7B、LLaVA-v1.5-13B和Qwen2.5-VL-7B等多个大视觉语言模型上的实验表明,KORE在新知识注入性能上表现优越,并有效缓解了灾难性遗忘问题。 Conclusion: KORE能够协同实现知识适应与保留,为大视觉语言模型的持续知识更新提供了高效可行的解决方案。 Abstract: Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM's linear layer activations and initializes the adapter by projecting the original weights into the matrix's null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.[35] HAD: HAllucination Detection Language Models Based on a Comprehensive Hallucination Taxonomy
Fan Xu,Xinyu Hu,Zhenghan Yu,Li Lin,Xu Zhang,Yang Zhang,Wei Zhou,Jinjie Gu,Xiaojun Wan
Main category: cs.CL
TL;DR: 本文提出了一种包含11个类别的幻觉分类体系,并开发了HAD模型,能够在一个推理过程中集成幻觉检测、跨度级识别与修正。模型在约9万样本的合成数据上训练,并在多个基准上实现了最先进的性能。
Details
Motivation: 随着对自然语言生成模型(尤其是大模型)依赖的增加,其输出的可靠性和准确性引发关注,其中幻觉问题(生成看似合理但错误的信息)尤为突出,亟需有效的检测方法。 Method: 提出了一个涵盖11类的幻觉分类体系,构建了HAD模型,集成检测、定位与纠正功能;使用约9万条合成数据进行训练,并构建了包含2,248个样本的人工标注测试集HADTest用于评估。 Result: HAD模型在领域内和领域外测试集上均优于现有基线,在HaluEval、FactCHD和FaithBench等基准上达到最先进水平,表现出良好的鲁棒性与通用性。 Conclusion: HAD模型通过统一框架有效解决了NLG中的幻觉问题,具备跨任务的适用能力,为提升生成内容的可靠性提供了实用工具。 Abstract: The increasing reliance on natural language generation (NLG) models, particularly large language models, has raised concerns about the reliability and accuracy of their outputs. A key challenge is hallucination, where models produce plausible but incorrect information. As a result, hallucination detection has become a critical task. In this work, we introduce a comprehensive hallucination taxonomy with 11 categories across various NLG tasks and propose the HAllucination Detection (HAD) models https://github.com/pku0xff/HAD, which integrate hallucination detection, span-level identification, and correction into a single inference process. Trained on an elaborate synthetic dataset of about 90K samples, our HAD models are versatile and can be applied to various NLG tasks. We also carefully annotate a test set for hallucination detection, called HADTest, which contains 2,248 samples. Evaluations on in-domain and out-of-domain test sets show that our HAD models generally outperform the existing baselines, achieving state-of-the-art results on HaluEval, FactCHD, and FaithBench, confirming their robustness and versatility.[36] Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization
Junjie Song,Yiwen Liu,Dapeng Li,Yin Sun,Shukun Fu,Siqi Chen,Yuji Cao
Main category: cs.CL
TL;DR: 本文提出了一种基于超体积优化(HVO)的多目标强化学习方法,用于文本摘要生成,能够在多个评估维度上实现更均衡且优于现有方法的性能。
Details
Motivation: 现有的大语言模型在文本摘要中多采用单目标优化,缺乏对一致性、连贯性、相关性和流畅性等多目标协同优化的有效机制,因此需要一种能够动态平衡多个奖励目标的优化策略。 Method: 提出超体积优化(HVO)方法,在强化学习过程中利用超体积指标动态调整组间奖励分数,引导模型逼近帕累托前沿,从而实现多目标平衡优化。 Result: 在多个代表性摘要数据集上,HVO优于GRPO方法,整体得分更高且各维度表现更均衡;使用HVO增强的7B模型在摘要任务上性能接近GPT-4,同时生成长度更短。 Conclusion: HVO是一种有效的多目标优化策略,能够提升大语言模型在文本摘要中的综合性能与平衡性,具有实际应用潜力。 Abstract: Text summarization is a crucial task that requires the simultaneous optimization of multiple objectives, including consistency, coherence, relevance, and fluency, which presents considerable challenges. Although large language models (LLMs) have demonstrated remarkable performance, enhanced by reinforcement learning (RL), few studies have focused on optimizing the multi-objective problem of summarization through RL based on LLMs. In this paper, we introduce hypervolume optimization (HVO), a novel optimization strategy that dynamically adjusts the scores between groups during the reward process in RL by using the hypervolume method. This method guides the model's optimization to progressively approximate the pareto front, thereby generating balanced summaries across multiple objectives. Experimental results on several representative summarization datasets demonstrate that our method outperforms group relative policy optimization (GRPO) in overall scores and shows more balanced performance across different dimensions. Moreover, a 7B foundation model enhanced by HVO performs comparably to GPT-4 in the summarization task, while maintaining a shorter generation length. Our code is publicly available at https://github.com/ai4business-LiAuto/HVO.git[37] Slot Filling as a Reasoning Task for SpeechLLMs
Kadri Hacioglu,Manjunath K E,Andreas Stolcke
Main category: cs.CL
TL;DR: 本文提出在语音大语言模型(speechLLMs)中引入推理机制,用于端到端的槽位填充任务。通过思维链框架将任务分解为多个推理步骤,并构建推理数据集进行监督微调。实验表明,引入中间推理步骤可提升性能,但专用于数学、逻辑和编程领域的文本推理大模型作为基础模型时表现较差。而基于混合文本基础模型、支持直接与推理双模式的混合speechLLM表现更优。
Details
Motivation: 为了提升端到端槽位填充任务的性能,探索在语音大语言模型中引入推理能力的可行性与有效性,克服传统模型缺乏显式推理过程的问题。 Method: 采用思维链(chain-of-thought)框架将槽位填充任务分解为多个推理步骤,构建专门的推理数据集,并对speechLLM进行监督微调;比较不同类型和规模的文本大模型作为基础模型的效果,区分常规与推理型speechLLM,并探索单模式与双模式微调策略。 Result: 引入推理步骤显著提升了speechLLM在槽位填充任务上的表现;然而,专为数学、逻辑和代码设计的推理文本大模型作为基础模型时效果不佳;基于混合文本基础模型并保留直接与推理双操作模式的混合speechLLM表现最优。 Conclusion: 在speechLLM中集成推理能力有助于提升槽位填充性能,但需选择合适的基础模型;支持多模式操作的混合微调策略优于单一模式,是构建高效推理speechLLM的有效路径。 Abstract: We propose integration of reasoning into speech large language models (speechLLMs) for the end-to-end slot-filling task. Inspired by the recent development of reasoning LLMs, we use a chain-of-thought framework to decompose the slot-filling task into multiple reasoning steps, create a reasoning dataset and apply the supervised fine-tuning strategy to a speechLLM. We distinguish between regular and reasoning speechLLMs and experiment with different types and sizes of LLMs as their text foundation models. We demonstrate performance improvements by introducing reasoning (intermediate) steps. However, we show that a reasoning textual LLM developed mainly for math, logic and coding domains might be inferior as a foundation model for a reasoning speechLLM. We further show that hybrid speechLLMs, built on a hybrid text foundation LLM and fine-tuned to preserve both direct and reasoning modes of operation, have better performance than those fine-tuned employing only one mode of operation.[38] Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection
Ewelina Gajewska,Arda Derbent,Jaroslaw A Chudziak,Katarzyna Budzynska
Main category: cs.CL
TL;DR: 本研究探讨了通过注释者人设个性化大语言模型(Persona-LLMs)对其仇恨言论敏感性的影响,特别是与注释者和目标群体身份一致或差异相关的偏见问题。研究使用Gemini和GPT-4.1-mini模型,结合浅层人设提示和基于检索增强生成(RAG)的深层上下文人设方法,分析同群与异群人设对模型检测性能和公平性的影响。结果表明,融入社会人口属性有助于缓解自动仇恨言论检测中的偏见,但也揭示了人设方法的局限性。
Details
Motivation: 现有仇恨言论检测系统常因身份相关偏见导致不公平判断,本研究旨在探索通过引入注释者人设来减轻大语言模型在此类任务中的偏见,提升检测的公平性和准确性。 Method: 采用Google的Gemini和OpenAI的GPT-4.1-mini模型,应用两种人设提示方法:浅层人设提示和基于检索增强生成(RAG)的深层上下文人设构建,比较在不同社会群体中使用同群(in-group)与异群(out-group)注释者人设对模型检测性能和公平性的影响。 Result: 实验结果显示,引入社会人口属性的人设能有效改善模型在仇恨言论检测中的公平性,尤其在减少身份相关偏见方面表现显著;但同时也暴露出人设方法在某些群体中效果有限或可能引入新偏差的问题。 Conclusion: 将心理层面的群体认同机制融入NLP技术,为人设化大语言模型减少偏见提供了可行路径,表明个性化人设有助于构建更公正的仇恨言论检测系统,但需谨慎设计以避免新形式的偏见。 Abstract: In this paper, we investigate how personalising Large Language Models (Persona-LLMs) with annotator personas affects their sensitivity to hate speech, particularly regarding biases linked to shared or differing identities between annotators and targets. To this end, we employ Google's Gemini and OpenAI's GPT-4.1-mini models and two persona-prompting methods: shallow persona prompting and a deeply contextualised persona development based on Retrieval-Augmented Generation (RAG) to incorporate richer persona profiles. We analyse the impact of using in-group and out-group annotator personas on the models' detection performance and fairness across diverse social groups. This work bridges psychological insights on group identity with advanced NLP techniques, demonstrating that incorporating socio-demographic attributes into LLMs can address bias in automated hate speech detection. Our results highlight both the potential and limitations of persona-based approaches in reducing bias, offering valuable insights for developing more equitable hate speech detection systems.[39] Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system
Prakrithi Shivaprakash,Lekhansh Shukla,Animesh Mukherjee,Prabhat Chand,Pratima Murthy
Main category: cs.CL
TL;DR: 提出了一种基于微调GLiNER模型的本地化PII去除系统LOGICAL,用于高效、安全地去识别临床文本,在资源受限环境中表现优异。
Details
Motivation: 在电子健康记录中去除个人身份信息(PII)对研究和AI开发至关重要,但现有大语言模型因计算成本高和API隐私风险限制了其应用,尤其在资源有限的场景下。 Method: 基于微调的轻量级通用命名实体识别模型GLiNER构建本地部署的PII去除系统LOGICAL,并在精神病医院的1515份临床文档上验证,定义九类PII,使用2849个实例训练,376个测试,评估指标为字符级精确率、召回率和F1分数,并与Azure NER、Presidio及Gemini-Pro-2.5、Llama-3.3-70B-Instruct零样本提示对比。 Result: 微调后的GLiNER模型达到0.980的micro-F1分数,显著优于Gemini-Pro-2.5(0.845),LOGICAL完全正确去识别95%的文档,次优方案为64%,且可在无独立GPU的标准笔记本上高效运行,但存在2%的实体级漏报率。 Conclusion: 微调的专用Transformer模型如GLiNER提供了一种准确、高效且安全的临床文本去标识方案,LOGICAL实现“源头去标识”,是资源密集型大模型的实用替代方案,有助于在资源受限环境下推动隐私保护的数据研究。 Abstract: Removing Personally Identifiable Information (PII) from clinical notes in Electronic Health Records (EHRs) is essential for research and AI development. While Large Language Models (LLMs) are powerful, their high computational costs and the data privacy risks of API-based services limit their use, especially in low-resource settings. To address this, we developed LOGICAL (Local Obfuscation by GLINER for Impartial Context-Aware Lineage), an efficient, locally deployable PII removal system built on a fine-tuned Generalist and Lightweight Named Entity Recognition (GLiNER) model. We used 1515 clinical documents from a psychiatric hospital's EHR system. We defined nine PII categories for removal. A modern-gliner-bi-large-v1.0 model was fine-tuned on 2849 text instances and evaluated on a test set of 376 instances using character-level precision, recall, and F1-score. We compared its performance against Microsoft Azure NER, Microsoft Presidio, and zero-shot prompting with Gemini-Pro-2.5 and Llama-3.3-70B-Instruct. The fine-tuned GLiNER model achieved superior performance, with an overall micro-average F1-score of 0.980, significantly outperforming Gemini-Pro-2.5 (F1-score: 0.845). LOGICAL correctly sanitised 95% of documents completely, compared to 64% for the next-best solution. The model operated efficiently on a standard laptop without a dedicated GPU. However, a 2% entity-level false negative rate underscores the need for human-in-the-loop validation across all tested systems. Fine-tuned, specialised transformer models like GLiNER offer an accurate, computationally efficient, and secure solution for PII removal from clinical notes. This "sanitisation at the source" approach is a practical alternative to resource-intensive LLMs, enabling the creation of de-identified datasets for research and AI development while preserving data privacy, particularly in resource-constrained environments.[40] Modeling Turn-Taking with Semantically Informed Gestures
Varsha Suresh,M. Hamza Mughal,Christian Theobalt,Vera Demberg
Main category: cs.CL
TL;DR: 本文提出了DnD Gesture++数据集,通过语义引导的手势信息提升多模态对话中的说话权转换预测性能。
Details
Motivation: 在对话中,人类使用多种模态线索(如语音、手势和注视)来管理说话权的转换,而现有研究对手势的语义信息利用不足。 Method: 构建了包含2663个语义手势标注的DnD Gesture++数据集,并采用Mixture-of-Experts框架融合文本、音频和手势进行说话权转换预测。 Result: 实验表明,引入语义引导的手势能持续提升模型性能,优于基线方法。 Conclusion: 语义手势为多模态说话权转换提供了有价值的补充信息,有助于更准确地建模对话动态。 Abstract: In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.[41] M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models
Yejin Kwon,Taewoo Kang,Hyunsoo Yoon,Changouk Kim
Main category: cs.CL
TL;DR: M3-SLU是一个用于评估多说话人、多轮对话语言理解的多模态大语言模型新基准,包含超过12,000个带音频、转录和元数据的实例,揭示现有模型在说话人归属推理上的不足。
Details
Motivation: 现有模型在语音和文本理解上表现良好,但在自然对话中识别谁在何时说了什么是具有挑战性的,缺乏对说话人归属推理能力的有效评估。 Method: 基于四个公开语料库(CHiME-6, MELD, MultiDialog, AMI)构建M3-SLU基准,包含两个任务:说话人归属问答和通过语句匹配进行说话人识别,并采用级联流程和端到端MLLM进行实验,使用LLM-as-Judge和准确率指标评估。 Result: 实验结果表明,当前模型能较好理解内容,但在识别说话人方面表现不佳,暴露出说话人感知对话理解的关键缺陷。 Conclusion: M3-SLU提供了一个具有挑战性的基准,有助于推动多模态下说话人感知语言理解的研究发展。 Abstract: We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding. M3-SLU offers as a challenging benchmark to advance research in speaker-aware multimodal understanding.[42] AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation
Xianyang Liu,Yilin Liu,Shuai Wang,Hao Cheng,Andrew Estornell,Yuzhi Zhao,Jiaheng Wei
Main category: cs.CL
TL;DR: 提出AgenticMath,一种用于生成高质量数学问答对的智能体流水线,通过四阶段方法提升大语言模型的推理能力。
Details
Motivation: 现有数据集生成方法常产生低质量或错误答案,且信息丰富度有限,难以有效提升大语言模型的推理能力。 Method: 采用四阶段流程:种子问题筛选、智能体问题重述、答案增强(使用思维链推理)和问答对评估,利用多智能体系统生成多样且逻辑一致的高质量数学问答对。 Result: 在3B-8B参数规模的LLM上进行微调,仅用30-60K样本就在多个数学推理基准上达到或超过使用400K甚至230万样本的基线方法。 Conclusion: 高质量、有针对性的数据生成比大规模低质量数据更高效,是提升大语言模型数学推理能力的有效途径。 Abstract: The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic pipeline for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.[43] LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts
Siyuan Wang,Gaokai Zhang,Li Lyna Zhang,Ning Shang,Fan Yang,Dongyao Chen,Mao Yang
Main category: cs.CL
TL;DR: 本文提出了LoongRL,一种基于数据驱动的强化学习方法,用于提升大语言模型在长上下文推理中的表现。核心是KeyChain方法,通过插入UUID链将短多跳问答转换为高难度长上下文任务,诱导模型形成“计划-检索-推理-重查”的思维模式,在不增加训练成本的情况下显著提升长上下文多跳问答准确率。
Details
Motivation: 长上下文推理对大语言模型至关重要,但现有强化学习主要集中在短上下文,缺乏高难度的长上下文训练数据,且高级推理模式尚未充分探索。 Method: 提出KeyChain合成方法,将短多跳问答转化为含大量干扰文档的长上下文任务,通过插入UUID链隐藏真实问题,要求模型逐步追踪、识别问题、检索事实并推理回答;在此数据上进行强化学习训练。 Result: 在Qwen2.5-7B和14B上分别实现+23.5%和+21.1%的绝对准确率提升,LoongRL-14B得分达74.2,接近o3-mini(74.5)和DeepSeek-R1(74.9);支持128K上下文,通过所有长上下文检索压力测试,且保持短上下文能力。 Conclusion: LoongRL通过数据构造和强化学习成功诱导出可泛化的高级长上下文推理模式,在低计算成本下实现对超长上下文任务的有效求解,并在性能上媲美更大规模的前沿模型。 Abstract: Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.[44] The Massive Legal Embedding Benchmark (MLEB)
Umar Butler,Abdur-Rahman Butler,Adrian Lucas Malec
Main category: cs.CL
TL;DR: 提出了大规模法律嵌入基准MLEB,包含十个专家标注的数据集,覆盖多个司法管辖区、文档类型和任务类型,其中七个为新构建以填补领域和地域空白,并公开代码、结果和数据。
Details
Motivation: 为了填补开源法律信息检索领域在地域和领域上的空白,提供一个更大、更多样化和更全面的基准测试工具。 Method: 构建了十个专家标注的数据集,涵盖多个司法管辖区和文档类型,包括搜索、零样本分类和问答任务;其中七个为新创建,并公开所有代码、数据和结果。 Result: MLEB成为目前最大、最多样且最全面的开源法律信息检索基准,支持跨领域和跨地区的模型评估。 Conclusion: MLEB为法律信息检索研究提供了重要资源,推动了该领域的可重复性与标准化评估。 Abstract: We present the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive open-source benchmark for legal information retrieval to date. MLEB consists of ten expert-annotated datasets spanning multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore), document types (cases, legislation, regulatory guidance, contracts, and literature), and task types (search, zero-shot classification, and question answering). Seven of the datasets in MLEB were newly constructed in order to fill domain and jurisdictional gaps in the open-source legal information retrieval landscape. We document our methodology in building MLEB and creating the new constituent datasets, and release our code, results, and data openly to assist with reproducible evaluations.[45] MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs
Xinfeng Xia,Jiacheng Liu,Xiaofeng Hou,Peng Tang,Mingxuan Zhang,Wenfeng Wang,Chao Li
Main category: cs.CL
TL;DR: 本文提出了MoE-Prism,一种模型与系统协同设计方法,通过将单体专家分解为细粒度子专家并实现QoS感知调度,使MoE模型具备弹性,显著提升吞吐量并降低延迟。
Details
Motivation: 传统的MoE模型因依赖固定的top-k路由机制导致质量悬崖和资源过度配置,难以适应多样化的服务级别目标(SLO),缺乏灵活性。 Method: 分为两个阶段:离线重构引擎使用元启发式优化算法将单体专家分解为保持功能局部性的子专家;在线调度引擎则根据服务质量需求进行弹性调度,解决云部署中的吞吐量最大化和内存受限设备的延迟优化问题。 Result: 在三种不同MoE模型上的实验表明,MoE-Prism相比基线提供了4倍以上的稳定运行点,在严格延迟约束下吞吐量最高提升19.9%,在资源受限时延迟最多降低10.36%。 Conclusion: MoE-Prism填补了模型与系统之间的鸿沟,为下一代自适应、高效且QoS感知的AI服务提供了关键的控制机制。 Abstract: Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by sparsely activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a "quality cliff", offering only a few coarse-grained operating points. This inflexibility forces a difficult trade-off between cost and quality, preventing adaptation to diverse Service Level Objectives (SLOs) and leading to significant resource over-provisioning. This paper introduces MoE-Prism, a model-system co-design that transforms rigid MoE models into elastic services. Our methodology is divided into two phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs monolithic experts into fine-grained "sub-experts." This engine employs a partitioning optimization solver that uses a metaheuristic-based approach to group neurons, preserving functional locality without requiring retraining. Second, an \emph{Online Scheduling Engine} leverages this new elasticity through QoS-aware scheduling. It implements specialized policies to solve complex system problems, including maximizing throughput in cloud deployments and managing latency-optimized offloading for memory-constrained devices. Our evaluation across three different MoE models shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline. This allows an AI service to dynamically improve throughput by up to 19.9\% under a strict latency budget or reduce latency by up to 10.36\% under limited resources. MoE-Prism provides the critical "control knob" to bridge the model-system gap, enabling the next generation of adaptive, efficient, and QoS-aware AI services.[46] Sign Language Translation with Sentence Embedding Supervision
Yasser Hamidullah,Josef van Genabith,Cristina España-Bonet
Main category: cs.CL
TL;DR: 提出了一种基于目标句子的句子嵌入来替代手语标注的新方法,无需人工标注且支持多语言,在无标注数据集上达到新的最先进水平。
Details
Motivation: 现有的手语翻译系统依赖于规模有限且不一致的手语标注数据,限制了其广泛应用。 Method: 利用目标句子的句子嵌入作为训练时的监督信号,取代传统使用的手语标注,实现端到端的多语言手语翻译。 Result: 在PHOENIX-2014T和How2Sign数据集上验证了方法的有效性,显著优于其他无标注方法,并缩小了无标注与有标注系统之间的性能差距。 Conclusion: 该方法在无需额外预训练和手动标注的情况下,实现了新的最先进性能,具有良好的多语言扩展能力。 Abstract: State-of-the-art sign language translation (SLT) systems facilitate the learning process through gloss annotations, either in an end2end manner or by involving an intermediate step. Unfortunately, gloss labelled sign language data is usually not available at scale and, when available, gloss annotations widely differ from dataset to dataset. We present a novel approach using sentence embeddings of the target sentences at training time that take the role of glosses. The new kind of supervision does not need any manual annotation but it is learned on raw textual data. As our approach easily facilitates multilinguality, we evaluate it on datasets covering German (PHOENIX-2014T) and American (How2Sign) sign languages and experiment with mono- and multilingual sentence embeddings and translation systems. Our approach significantly outperforms other gloss-free approaches, setting the new state-of-the-art for data sets where glosses are not available and when no additional SLT datasets are used for pretraining, diminishing the gap between gloss-free and gloss-dependent systems.[47] SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision
Yasser Hamidullah,Shakib Yazdani,Cennet Oguz,Josef van Genabith,Cristina España-Bonet
Main category: cs.CL
TL;DR: 提出使用语言无关的多模态嵌入和耦合增强方法,实现可扩展且语义鲁棒的多语言手语翻译。
Details
Motivation: 传统手语翻译通常依赖单一口语文本训练,限制了模型的可扩展性和跨语言泛化能力。 Method: 采用多语言文本和语音训练得到的语言无关、多模态嵌入作为监督信号,并结合多语言目标增强与视频级扰动的耦合增强方法。 Result: 在BLEURT指标上优于仅使用文本句子嵌入监督的方法,尤其在低资源场景下提升更显著。 Conclusion: 语言无关的嵌入监督结合耦合增强,为传统手语翻译训练提供了一种可扩展且语义鲁棒的替代方案。 Abstract: Sign language translation (SLT) is typically trained with text in a single spoken language, which limits scalability and cross-language generalization. Earlier approaches have replaced gloss supervision with text-based sentence embeddings, but up to now, these remain tied to a specific language and modality. In contrast, here we employ language-agnostic, multimodal embeddings trained on text and speech from multiple languages to supervise SLT, enabling direct multilingual translation. To address data scarcity, we propose a coupled augmentation method that combines multilingual target augmentations (i.e. translations into many languages) with video-level perturbations, improving model robustness. Experiments show consistent BLEURT gains over text-only sentence embedding supervision, with larger improvements in low-resource settings. Our results demonstrate that language-agnostic embedding supervision, combined with coupled augmentation, provides a scalable and semantically robust alternative to traditional SLT training.[48] ToMMeR -- Efficient Entity Mention Detection from Large Language Models
Victor Morand,Nadi Tomeh,Josiane Mothe,Benjamin Piwowarski
Main category: cs.CL
TL;DR: ToMMeR是一个轻量级模型,能在早期LLM层中高效探测提及检测能力,实现高召回率和精度,并揭示提及检测在语言建模中自然涌现。
Details
Motivation: 提及检测是信息抽取的基础任务,但性能瓶颈明显,需要更高效的模型来提升效果。 Method: 提出ToMMeR模型,通过探针早期LLM层进行提及检测,并结合LLM作为判别器评估结果;扩展span分类头以实现命名实体识别。 Result: 在13个NER基准上实现了93%的召回率(零样本),超过90%的精确率;跨模型分析显示不同架构在提及边界上高度一致(DICE>75%);扩展后达到接近SOTA的NER性能(F1为80-87%)。 Conclusion: 结构化实体表示存在于早期Transformer层中,且可通过极小参数量高效恢复,表明提及检测是语言模型中的自然涌现能力。 Abstract: Identifying which text spans refer to entities -- mention detection -- is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93\% recall zero-shot, with over 90\% precision using an LLM as a judge showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE >75\%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves near SOTA NER performance (80-87\% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.[49] Spatio-temporal Sign Language Representation and Translation
Yasser Hamidullah,Josef van Genabith,Cristina España-Bonet
Main category: cs.CL
TL;DR: 本文提出了一种用于瑞士德语手语到德语文本翻译的端到端模型,该模型在单一框架中学习时空特征表示和翻译,但在测试集上性能显著下降。
Details
Motivation: 现有的手语翻译系统通常无法有效利用时间特征,且多为非完全端到端架构,限制了其泛化能力。 Method: 采用基于序列到序列的架构,直接从视频帧中提取并学习时空特征表示,实现从视频输入到文本输出的端到端训练。 Result: 在开发集上最佳系统达到 $5\pm1$ BLEU 分数,但在测试集上性能大幅下降至 $0.11\pm0.06$ BLEU。 Conclusion: 尽管模型设计为端到端并能学习时空特征,但在实际测试中泛化能力较差,表明仍需改进鲁棒性和过拟合问题。 Abstract: This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved $5\pm1$ BLEU points on the development set, but the performance on the test dropped to $0.11\pm0.06$ BLEU points.[50] BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models
Yuan Gao,Suchir Salhan,Andrew Caines,Paula Buttery,Weiwei Sun
Main category: cs.CL
TL;DR: BLiSS 1.0 是一个面向评估语言习得模型的新基准,通过选择性容忍范式衡量模型对自然学习者语法错误的识别能力。
Details
Motivation: 现有性能导向的基准无法有效评估认知启发的语言模型,缺乏对人类语言习得模式的对齐评估。 Method: 构建包含280万条自然学习者句子的数据集,生成136,867个(正确、学习者错误、人工错误)三元组,提出选择性容忍测试范式。 Result: 实验表明选择性容忍是一项独立于传统语法正确性判断的能力,且模型表现按训练范式显著聚类。 Conclusion: BLiSS 1.0 能有效衡量不同训练目标下模型与人类语言习得规律的对齐程度,是评估语言模型认知合理性的有力工具。 Abstract: To bridge the gap between performance-oriented benchmarks and the evaluation of cognitively inspired models, we introduce BLiSS 1.0, a Benchmark of Learner Interlingual Syntactic Structure. Our benchmark operationalizes a new paradigm of selective tolerance, testing whether a model finds a naturalistic learner error more plausible than a matched, artificial error within the same sentence. Constructed from over 2.8 million naturalistic learner sentences, BLiSS provides 136,867 controlled triplets (corrected, learner, artificial) for this purpose. Experiments on a diverse suite of models demonstrate that selective tolerance is a distinct capability from standard grammaticality, with performance clustering strongly by training paradigm. This validates BLiSS as a robust tool for measuring how different training objectives impact a model's alignment with the systematic patterns of human language acquisition.[51] MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models
Kailin Jiang,Ning Jiang,Yuchen Ren,Yuchen Li,Yifan Gao,Jinhe Bi,Yunpu Ma,Qingqing Liu,Xianhao Wang,Yifan Jia,Hongbo Jiang,Yaocong Hu,Bin Li,Lei Liu,Yuntao Du
Main category: cs.CL
TL;DR: 提出MINED基准,评估多模态大模型对时间敏感知识的理解能力,涵盖6个维度和11项任务,发现现有模型在体育类知识上表现最弱,且知识编辑方法可在单次编辑场景下有效更新模型知识。
Details
Motivation: 现有基准无法充分评估多模态大模型(LMMs)对时间敏感知识的理解能力,且LMMs的静态表征难以保持对动态事实知识的准确理解。 Method: 基于Wikipedia构建MINED基准,由两名专业标注员完成,包含2,104个时间敏感知识样本,覆盖六类知识,在6个维度和11项任务上评估15种主流LMMs,并探索知识编辑方法在更新时间敏感知识中的可行性。 Result: Gemini-2.5-Pro在MINED上取得最高平均CEM分数63.07;开源LMMs普遍时间理解能力较弱;模型在组织类知识上表现最好,体育类最差;知识编辑方法可有效更新LMMs中的时间敏感知识。 Conclusion: MINED为评估LMMs的时间感知能力提供了全面基准,揭示了当前模型在时间敏感知识理解上的局限,并验证了知识编辑作为知识更新手段的潜力。 Abstract: Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs' ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.[52] Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition
Yuu Jinnai
Main category: cs.CL
TL;DR: 本文研究了基于采样的最小贝叶斯风险(MBR)解码在语音到文本任务(如自动语音识别和语音翻译)中的应用,发现其在多数实验设置下优于传统的束搜索方法,表明MBR解码在高精度离线任务中具有潜力。
Details
Motivation: 由于MBR解码在文本生成任务中表现优异,作者希望验证其在语音到文本任务(如ASR和ST)中的有效性。 Method: 在英语和日语的ASR和ST任务上,使用Whisper及其衍生模型评估MBR解码的性能,并与束搜索进行比较。 Result: 在大多数实验设置中,MBR解码的准确率优于束搜索。 Conclusion: MBR解码是高精度离线语音识别和语音翻译任务中一种有前景的方法。 Abstract: Recent work has shown that sample-based Minimum Bayes Risk (MBR) decoding outperforms beam search in text-to-text generation tasks, such as machine translation, text summarization, and image captioning. On the other hand, beam search is the current practice for speech-to-text tasks such as automatic speech recognition (ASR) and Speech Translation (ST). Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks. In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models. We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated. The results show that MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy. The code is available at https://github.com/CyberAgentAILab/mbr-for-asr[53] VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
Dunjie Lu,Yiheng Xu,Junli Wang,Haoyuan Wu,Xinyuan Wang,Zekun Wang,Junlin Yang,Hongjin Su,Jixuan Chen,Junda Chen,Yuchen Mao,Jingren Zhou,Junyang Lin,Binyuan Hui,Tao Yu
Main category: cs.CL
TL;DR: 本文提出了一种名为VideoAgentTrek的可扩展管道,利用公开的屏幕录制视频自动生成GUI交互训练数据,无需人工标注。通过Video2Action模块将视频转化为带结构化动作标签的数据,应用于3.9万YouTube教程视频,生成152万交互步骤。在OSWorld和AgentNetBench上显著提升任务成功率与步准确率,验证了互联网视频作为高质量监督信号的潜力。
Details
Motivation: 训练计算机使用代理需要大量GUI交互数据,但手动标注成本过高,难以规模化。因此,亟需一种自动、低成本的方法从现有资源中挖掘有效训练数据。 Method: 提出VideoAgentTrek管道,包含两个核心组件的Video2Action模块:(1) 视频定位模型,用于检测并精确定位GUI动作的时间边界与上下文;(2) 动作内容识别器,提取点击坐标、输入文本等结构化参数。利用该方法从YouTube教程视频中自动构建大规模动作轨迹数据集,并用于继续预训练与监督微调。 Result: 在39,000个YouTube视频上生成152万条交互步骤。在OSWorld-Verified上任务成功率从9.3%提升至15.8%(相对提升70%);在AgentNetBench上步准确率从64.1%提升至69.3%。 Conclusion: 被动获取的互联网视频可被转化为高质量监督信号,用于训练计算机使用代理,提供了一种可扩展且无需人工标注的替代方案。 Abstract: Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.[54] Machine Text Detectors are Membership Inference Attacks
Ryuto Koike,Liam Dugan,Masahiro Kaneko,Chris Callison-Burch,Naoaki Okazaki
Main category: cs.CL
TL;DR: 本文研究了成员推断攻击(MIA)与机器生成文本检测之间的可迁移性,发现两者在方法论上共享相似信号,并证明在理论上存在一个在两个任务上均能达到最优性能的统一指标。实验表明现有方法在这两个任务间的性能具有强相关性,例如Binoculars在MIA基准上也表现优异。作者提出MINT评估套件以促进跨任务研究。
Details
Motivation: 尽管成员推断攻击和机器生成文本检测目标不同,但它们依赖相似的语言模型概率信号。由于两任务被独立研究,可能忽略了彼此间的方法借鉴。因此,亟需探究二者之间的方法可迁移性以提升整体性能。 Method: 通过理论分析证明两个任务存在相同的最优评估指标,并在13个领域、10种生成模型下对7种先进MIA方法和5种机器文本检测器进行大规模实验,验证方法间的跨任务性能相关性。 Result: 理论表明两任务的最优性能由同一指标决定;实验显示跨任务性能存在强等级相关性(rho > 0.6);Binoculars在MIA任务上达到SOTA性能;推出统一评估工具包MINT。 Conclusion: 成员推断攻击与机器生成文本检测之间存在高度方法可迁移性,未来应加强两个研究领域的交叉合作,采用统一框架进行方法评估与开发。 Abstract: Although membership inference attacks (MIAs) and machine-generated text detection target different goals, identifying training samples and synthetic texts, their methods often exploit similar signals based on a language model's probability distribution. Despite this shared methodological foundation, the two tasks have been independently studied, which may lead to conclusions that overlook stronger methods and valuable insights developed in the other task. In this work, we theoretically and empirically investigate the transferability, i.e., how well a method originally developed for one task performs on the other, between MIAs and machine text detection. For our theoretical contribution, we prove that the metric that achieves the asymptotically highest performance on both tasks is the same. We unify a large proportion of the existing literature in the context of this optimal metric and hypothesize that the accuracy with which a given method approximates this metric is directly correlated with its transferability. Our large-scale empirical experiments, including 7 state-of-the-art MIA methods and 5 state-of-the-art machine text detectors across 13 domains and 10 generators, demonstrate very strong rank correlation (rho > 0.6) in cross-task performance. We notably find that Binoculars, originally designed for machine text detection, achieves state-of-the-art performance on MIA benchmarks as well, demonstrating the practical impact of the transferability. Our findings highlight the need for greater cross-task awareness and collaboration between the two research communities. To facilitate cross-task developments and fair evaluations, we introduce MINT, a unified evaluation suite for MIAs and machine-generated text detection, with implementation of 15 recent methods from both tasks.[55] What is the Best Sequence Length for BABYLM?
Suchir Salhan,Richard Diehl Martinez,Zébulon Goriely,Paula Buttery
Main category: cs.CL
TL;DR: 本文研究了在BabyLM预训练中序列长度对模型性能的影响,发现最佳序列长度取决于任务类型和模型架构:较短序列适用于语法泛化任务,较长序列有利于形态类比推理任务。
Details
Motivation: 在BabyLM挑战赛中,许多提交使用了远短于标准的序列长度,因此需要探究在固定计算预算下,不同序列长度对训练效果的影响,以确定最佳实践。 Method: 在100M词的训练数据和固定计算预算下,比较了125M参数的Mamba和OPT模型在不同序列长度下的表现,分析其在语法和形态任务上的性能差异。 Result: 较长序列通常表现更好,但最优长度因任务和架构而异:语法泛化任务在较短序列下已足够,而形态类比推理任务受益于更长上下文。 Conclusion: 选择序列长度应根据具体任务和模型架构权衡,不能一概而论;在BabyLM训练中需针对任务特性优化上下文长度。 Abstract: Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.[56] Lookahead Routing for Large Language Models
Canbin Huang,Tianyuan Shi,Yuhua Zhu,Ruijun Chen,Xiaojun Quan
Main category: cs.CL
TL;DR: 提出Lookahead路由框架,通过预测潜在模型输出的隐表示来指导模型选择,在不进行完整推理的情况下实现更智能的路由决策,在多个任务上平均比现有最优方法提升7.7%性能。
Details
Motivation: 现有LLM路由方法仅基于输入查询分类,忽略了响应生成过程中才能体现的隐含意图和上下文细节,导致对复杂或模糊查询的路由效果不佳。 Method: 提出Lookahead框架,通过预测各模型输出的隐层表示来预判其响应内容,并据此进行模型选择;实现了基于因果和掩码语言模型的两种具体方法。 Result: 在七个公开基准(涵盖指令遵循、数学推理和代码生成)上实验表明,Lookahead持续优于现有路由基线,平均性能提升7.7%。 Conclusion: Lookahead通过前瞻性预测模型输出的隐表示,显著提升了多模型系统中路由决策的质量,为高效利用异构大模型提供了新思路。 Abstract: Large language model (LLM) routers improve the efficiency of multi-model systems by directing each query to the most appropriate model while leveraging the diverse strengths of heterogeneous LLMs. Most existing approaches frame routing as a classification problem based solely on the input query. While this reduces overhead by avoiding inference across all models, it overlooks valuable information that could be gleaned from potential outputs and fails to capture implicit intent or contextual nuances that often emerge only during response generation. These limitations can result in suboptimal routing decisions, particularly for complex or ambiguous queries that require deeper semantic understanding. To address this challenge, we propose Lookahead, a routing framework that "foresees" potential model outputs by predicting their latent representations and uses these predictions to guide model selection, thus enabling more informed routing without full inference. Within this framework, we implement two approaches based on causal and masked language models. Empirical evaluations across seven public benchmarks - spanning instruction following, mathematical reasoning, and code generation - show that Lookahead consistently outperforms existing routing baselines, achieving an average performance gain of 7.7% over the state-of-the-art. Our code is available at https://github.com/huangcb01/lookahead-routing.[57] Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment
Maureen de Seyssel,Eeshan Gunesh Dhekane
Main category: cs.CL
TL;DR: 本文提出了一种统一的分类法,用于指导语音基础模型的评估,通过三个正交轴——评估方面、模型能力需求和任务要求——系统地对现有评估方法进行分类,并揭示了当前基准在韵律、交互和推理等方面的不足。
Details
Motivation: 现有的语音基础模型评估方法在不同任务和模型类型之间缺乏一致性,导致难以比较和选择合适的评估方式。因此,需要一个统一的框架来系统化评估过程。 Method: 提出了一个包含三个正交轴的分类体系:评估方面(evaluation aspect)、所需模型能力(model capabilities)和任务/协议要求(task or protocol requirements),并据此对现有的语音模型评估与基准进行了系统分类。 Result: 成功对涵盖表示学习、语音生成和对话交互等多个领域的现有评估方法进行了分类,明确了不同模型的能力暴露情况及其方法学需求,揭示了当前评估体系在 prosody、interaction 和 reasoning 方面的系统性缺失。 Conclusion: 该分类法为语音模型的评估提供了原则性框架,有助于合理选择和解释评估方法,同时也为未来基准设计指明了方向。 Abstract: Speech foundation models have recently achieved remarkable capabilities across a wide range of tasks. However, their evaluation remains disjointed across tasks and model types. Different models excel at distinct aspects of speech processing and thus require different evaluation protocols. This paper proposes a unified taxonomy that addresses the question: Which evaluation is appropriate for which model? The taxonomy defines three orthogonal axes: the \textbf{evaluation aspect} being measured, the model capabilities required to attempt the task, and the task or protocol requirements needed to perform it. We classify a broad set of existing evaluations and benchmarks along these axes, spanning areas such as representation learning, speech generation, and interactive dialogue. By mapping each evaluation to the capabilities a model exposes (e.g., speech generation, real-time processing) and to its methodological demands (e.g., fine-tuning data, human judgment), the taxonomy provides a principled framework for aligning models with suitable evaluation methods. It also reveals systematic gaps, such as limited coverage of prosody, interaction, or reasoning, that highlight priorities for future benchmark design. Overall, this work offers a conceptual foundation and practical guide for selecting, interpreting, and extending evaluations of speech models.[58] Conditions for Catastrophic Forgetting in Multilingual Translation
Danni Liu,Jan Niehues
Main category: cs.CL
TL;DR: 本研究通过机器翻译测试床系统性地探讨了多语言微调中的灾难性遗忘问题,发现模型与数据的相对规模是遗忘的主要决定因素,指令跟随能力比架构更重要,且参数高效微调并未优于全量微调,跨语言对齐可缓解遗忘并促进正向迁移。
Details
Motivation: 解决现有文献中关于多语言微调何时发生灾难性遗忘的不一致和模糊结论,系统性识别引发遗忘的关键条件。 Method: 在不同模型架构、数据规模和微调方法下进行受控实验,使用机器翻译作为测试平台分析遗忘现象。 Result: 发现模型与数据的相对规模是遗忘的主要决定因素;指令跟随能力对保持多语言知识至关重要;参数高效微调未显著优于全量微调;跨语言对齐可缓解遗忘并实现对未见语言的正向迁移。 Conclusion: 多语言模型微调中的灾难性遗忘主要受模型与数据规模关系及指令能力影响,跨语言对齐是有效缓解策略,而参数高效方法并无明显优势。 Abstract: Fine-tuning multilingual foundation models on specific languages often induces catastrophic forgetting, degrading performance on languages unseen in fine-tuning. While this phenomenon is widely-documented, the literature presents fragmented results about when forgetting occurs. To address this ambiguity, we conduct a systematic empirical study using machine translation as a testbed to identify the conditions that trigger catastrophic forgetting in multilingual fine-tuning. Through controlled experiments across different model architectures, data scales, and fine-tuning approaches, we reveal that the relative scale between model and data size is a primary determinant of forgetting. Moreover, we demonstrate that a model's instruction-following ability is more critical for retaining multilingual knowledge than its architecture. Contrary to assumptions, parameter-efficient fine-tuning offers no clear advantage over full fine-tuning in mitigating forgetting. Lastly, we show that cross-lingual alignment can mitigate forgetting while also facilitating positive transfer to unseen target languages.[59] Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
Yu Wu,Ke Shu,Jonas Fischer,Lidia Pivovarova,David Rosson,Eetu Mäkelä,Mikko Tolonen
Main category: cs.CL
TL;DR: 本文提出了一种从多语言历史文档中提取拉丁文片段的新任务,并使用724页标注数据评估了大模型的性能,结果表明现有模型可有效检测拉丁文。
Details
Motivation: 历史文档常包含多种语言和复杂布局,自动提取特定语言(如拉丁文)片段具有挑战性且研究不足。 Method: 构建了一个多模态标注数据集(724页),并基于该数据集对大型基础模型在拉丁文片段提取任务上进行基准测试和性能评估。 Result: 实验结果表明,当前的大型模型能够可靠地检测混合文档中的拉丁文内容,展现出较好的性能。 Conclusion: 这是首次全面分析大型基础模型在该任务上的能力与局限,为后续研究提供了基准和方向。 Abstract: This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limits for this task.[60] PBBQ: A Persian Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models
Farhan Farsi,Shayan Bali,Fatemeh Valeh,Parsa Ghofrani,Alireza Pakniat,Kian Kashfipour,Amir H. Payberah
Main category: cs.CL
TL;DR: 本文提出了PBBQ,一个用于评估波斯语大语言模型中社会偏见的综合基准数据集,涵盖16个文化类别,基于250名不同人群的问卷调查,并与社会科学专家合作构建,包含超过37,000个问题。实验表明现有模型在波斯文化背景下存在显著社会偏见,且其输出常复制人类的偏见模式。
Details
Motivation: 随着大语言模型的广泛应用,确保其符合社会规范变得至关重要。然而,针对波斯语文化背景下的社会偏见研究资源仍严重缺乏,因此需要构建专门的基准数据集来评估和缓解波斯语模型中的偏见问题。 Method: 通过与社会科学专家合作,设计涵盖16个文化类别的问卷,收集250名来自不同人口统计学背景的参与者的数据,构建包含超过37,000个问题的PBBQ数据集,并用于评估多个开源、闭源及针对波斯语微调的语言模型。 Result: 实验结果显示当前的大语言模型在波斯文化背景下表现出显著的社会偏见,且模型输出与人类回答的偏见模式高度相似,表明模型可能学习并再现了文化刻板印象。 Conclusion: PBBQ是一个有效的工具,可用于评估和缓解波斯语大语言模型中的社会偏见。研究强调了在不同文化背景下开发本地化偏见评估资源的重要性,并呼吁未来研究关注文化敏感性与模型对齐问题。 Abstract: With the increasing adoption of large language models (LLMs), ensuring their alignment with social norms has become a critical concern. While prior research has examined bias detection in various languages, there remains a significant gap in resources addressing social biases within Persian cultural contexts. In this work, we introduce PBBQ, a comprehensive benchmark dataset designed to evaluate social biases in Persian LLMs. Our benchmark, which encompasses 16 cultural categories, was developed through questionnaires completed by 250 diverse individuals across multiple demographics, in close collaboration with social science experts to ensure its validity. The resulting PBBQ dataset contains over 37,000 carefully curated questions, providing a foundation for the evaluation and mitigation of bias in Persian language models. We benchmark several open-source LLMs, a closed-source model, and Persian-specific fine-tuned models on PBBQ. Our findings reveal that current LLMs exhibit significant social biases across Persian culture. Additionally, by comparing model outputs to human responses, we observe that LLMs often replicate human bias patterns, highlighting the complex interplay between learned representations and cultural stereotypes.Upon acceptance of the paper, our PBBQ dataset will be publicly available for use in future work. Content warning: This paper contains unsafe content.[61] CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English
Daryna Dementieva,Evgeniya Sukhodolskaya,Alexander Fraser
Main category: cs.CL
TL;DR: 本文提出了一种可扩展、可解释的众包管道,用于跨语言新闻相似性评估,并构建了以乌克兰语为核心的多语言新闻数据集CrossNews-UA,通过4W标准进行语义相似性标注,测试了多种模型在多语言新闻分析中的表现。
Details
Motivation: 现有跨语言新闻分析数据集依赖专家手动整理,难以扩展到新语言,限制了其适用性和可扩展性。 Method: 设计了一个可扩展且具有解释性的众包流程,收集乌克兰语与波兰语、俄语、英语之间的新闻对,基于4W(Who, What, Where, When)标准进行语义相似性标注,构建CrossNews-UA数据集,并评估了从传统模型到大语言模型在内的多种模型性能。 Result: 成功构建了CrossNews-UA数据集,实验表明当前模型在跨语言新闻相似性判断上仍面临挑战,尤其是语义细微差异的捕捉,不同模型在多语言场景下的表现存在差异。 Conclusion: 该众包管道具备良好的可扩展性,适用于其他语言组合;CrossNews-UA为跨语言虚假新闻检测提供了新资源,研究结果为多语言新闻分析模型的改进提供了方向。 Abstract: In the era of social networks and rapid misinformation spread, news analysis remains a critical task. Detecting fake news across multiple languages, particularly beyond English, poses significant challenges. Cross-lingual news comparison offers a promising approach to verify information by leveraging external sources in different languages (Chen and Shu, 2024). However, existing datasets for cross-lingual news analysis (Chen et al., 2022a) were manually curated by journalists and experts, limiting their scalability and adaptability to new languages. In this work, we address this gap by introducing a scalable, explainable crowdsourcing pipeline for cross-lingual news similarity assessment. Using this pipeline, we collected a novel dataset CrossNews-UA of news pairs in Ukrainian as a central language with linguistically and contextually relevant languages-Polish, Russian, and English. Each news pair is annotated for semantic similarity with detailed justifications based on the 4W criteria (Who, What, Where, When). We further tested a range of models, from traditional bag-of-words, Transformer-based architectures to large language models (LLMs). Our results highlight the challenges in multilingual news analysis and offer insights into models performance.[62] Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent
Yangshijie Zhang,Xinda Wang,Jialin Liu,Wenqiang Wang,Zhicong Ma,Xingxing Jia
Main category: cs.CL
TL;DR: 本文提出了一种基于字体风格的攻击方法SAD,利用人类与NLP模型对样式化文本感知差异,实现对多种NLP任务的有效攻击。
Details
Motivation: 发现用户在社交媒体中使用装饰性字体和类字体表情符号时,NLP模型因将这些字符视为不同token而产生处理偏差,存在安全漏洞。 Method: 设计了两种版本的风格攻击方法SAD:轻量版注重查询效率,强效版追求更高攻击性能,通过在情感分类、机器翻译等任务上测试验证其有效性。 Result: 实验表明SAD在传统模型、大语言模型及商业服务上均能有效降低性能,并对文本生成图像和语音等多模态任务构成潜在威胁。 Conclusion: SAD揭示了NLP模型在处理样式化文本时的脆弱性,强调需增强模型对非标准文本输入的鲁棒性。 Abstract: With social media growth, users employ stylistic fonts and font-like emoji to express individuality, creating visually appealing text that remains human-readable. However, these fonts introduce hidden vulnerabilities in NLP models: while humans easily read stylistic text, models process these characters as distinct tokens, causing interference. We identify this human-model perception gap and propose a style-based attack, Style Attack Disguise (SAD). We design two sizes: light for query efficiency and strong for superior attack performance. Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services demonstrate SAD's strong attack performance. We also show SAD's potential threats to multimodal tasks including text-to-image and text-to-speech generation.[63] LLavaCode: Compressed Code Representations for Retrieval-Augmented Code Generation
Daria Cherniuk,Nikita Sukhorukov,Nikita Sushko,Daniil Gusak,Danil Sivtsov,Elena Tutubalina,Evgeny Frolov
Main category: cs.CL
TL;DR: LlavaCode是一种用于代码补全的框架,通过将代码压缩为紧凑且语义丰富的表示,显著减少了检索上下文的长度,从而在几乎不增加延迟的情况下提升了生成质量和推理速度。
Details
Motivation: 传统的检索增强生成(RAG)方法在引入上下文时会显著增加序列长度,导致推理变慢,难以满足IDE等交互式场景的需求。 Method: 提出LlavaCode框架,使用小型投影模块将代码压缩为仅包含几个单token向量的紧凑表示,这些表示具有丰富语义且可被代码大模型理解。 Result: 实验表明,相比完整的RAG流水线,压缩上下文可使首次令牌时间(TTFT)减少20-38%,并在几乎不增加延迟的情况下显著提升EM和ES指标。 Conclusion: LlavaCode在保持高水平代码生成质量的同时,有效解决了长上下文带来的推理延迟问题,适用于对响应速度要求高的交互式编程环境。 Abstract: Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.[64] Unraveling Emotions with Pre-Trained Models
Alejandro Pajón-Sanmartín,Francisco De Arriba-Pérez,Silvia García-Méndez,Fátima Leal,Benedita Malheiro,Juan Carlos Burguillo-Rial
Main category: cs.CL
TL;DR: 本文探讨了在情感识别中微调预训练模型与提示工程的有效性,比较了三种不同场景下的表现,并发现结构化提示设计和情感分组能显著提升大语言模型的性能。
Details
Motivation: 尽管现有模型在情感识别上取得了良好效果,但在处理开放文本时仍面临上下文歧义、语言变异性及复杂情感表达理解等挑战,限制了通用模型的直接应用。 Method: 通过实验对比了微调预训练模型与使用简单提示的大语言模型的表现,评估了不同情感提示设计对大语言模型的效果,并分析了情感分组技术对模型性能的影响。 Result: 微调后的预训练模型在情感识别任务中达到了70%以上的指标;研究发现大语言模型需要结构化的提示工程和情感分组才能提升性能。 Conclusion: 结构化提示设计和情感分组显著提升了大语言模型在情感检测中的表现,有助于改进情感分析、人机交互及跨领域用户行为理解。 Abstract: Transformer models have significantly advanced the field of emotion recognition. However, there are still open challenges when exploring open-ended queries for Large Language Models (LLMs). Although current models offer good results, automatic emotion analysis in open texts presents significant challenges, such as contextual ambiguity, linguistic variability, and difficulty interpreting complex emotional expressions. These limitations make the direct application of generalist models difficult. Accordingly, this work compares the effectiveness of fine-tuning and prompt engineering in emotion detection in three distinct scenarios: (i) performance of fine-tuned pre-trained models and general-purpose LLMs using simple prompts; (ii) effectiveness of different emotion prompt designs with LLMs; and (iii) impact of emotion grouping techniques on these models. Experimental tests attain metrics above 70% with a fine-tuned pre-trained model for emotion recognition. Moreover, the findings highlight that LLMs require structured prompt engineering and emotion grouping to enhance their performance. These advancements improve sentiment analysis, human-computer interaction, and understanding of user behavior across various domains.[65] DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference
Xiang Liu,Xuming Hu,Xiaowen Chu,Eunsol Choi
Main category: cs.CL
TL;DR: 本文提出DiffAdapt框架,通过分析推理过程中token概率的熵变化,动态选择不同难度问题的推理策略,在不牺牲准确率的前提下显著减少计算开销。
Details
Motivation: 现有大模型在简单问题上存在“过度思考”现象,导致推理效率低下,本文旨在提升模型推理效率,避免不必要的计算浪费。 Method: 分析推理过程中token概率的熵模式,发现U型规律,并基于问题难度和熵值设计轻量级框架DiffAdapt,为每个问题动态选择易/中/难三种推理策略(固定提示、温度和最大生成长度),仅训练一个小型探针分类器来判断难度。 Result: 在五个模型和八个基准上验证,DiffAdapt在保持或提升准确率的同时,最多减少22.4%的token使用量。 Conclusion: DiffAdapt提供了一种无需微调大模型本体的高效推理路径,有效平衡了性能与计算成本,推动了计算高效的推理发展。 Abstract: Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22--25\% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM's final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4\%, establishing a practical path toward compute-efficient reasoning.[66] CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
Hasan Akgul,Mari Eplik,Javier Rojas,Aina Binti Abdullah,Pieter van der Merwe
Main category: cs.CL
TL;DR: CoSense-LLM是一个面向边缘计算的框架,通过轻量级传感器融合与大语言模型结合,在满足延迟、能耗、带宽和隐私约束的前提下,将多模态传感器数据转化为可验证语义标记,实现高效、安全、低功耗的语义感知。
Details
Motivation: 在干扰频繁的现实环境中,传统云端大模型难以满足实时性、隐私性和能效需求,因此需要一种边缘优先的框架来协同优化语义理解、隐私保护和系统性能。 Method: 提出CoSense-LLM框架,包含四个组件:SenseFusion(轻量编码器对齐并压缩传感器嵌入)、Edge-RAG(本地检索增强生成)、PromptRouter(基于成本与不确定性的推理路由策略)和Secure Execution(可审计的数据脱敏机制),并结合KV缓存、FlashAttention、推测解码和量化LoRA等优化技术。 Result: 在家庭、办公室和诊所部署中,系统在边缘主导路径上实现亚秒级端到端延迟(p95),显著降低跨层带宽和令牌开销,并仅传输离散代码和脱敏元数据以保护隐私;消融实验表明Edge-RAG提升事实一致性,不确定性校准支持选择性拒绝与可控升级,KV缓存与解码加速器降低决策能耗。 Conclusion: 结果支持一种边缘优先的设计范式,将语义理解、隐私保护和可预测延迟作为大模型在干扰环境中部署的同等重要目标。 Abstract: We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.[67] Are Large Language Models Sensitive to the Motives Behind Communication?
Addison J. Wu,Ryan Liu,Kerem Oktar,Theodore R. Sumers,Thomas L. Griffiths
Main category: cs.CL
TL;DR: 该论文研究了大语言模型(LLMs)是否具备识别信息源动机并据此调整信任程度的能力,发现LLMs在受控实验中表现类似人类,但在真实广告场景中表现较差,简单的引导干预可显著提升其动机警觉性。
Details
Motivation: 由于人类交流具有动机性,LLMs在现实世界中需具备识别信息源意图和偏见的能力,以更有效地处理带有动机的信息。 Method: 采用认知科学中的受控实验验证LLMs对动机性证词的学习行为,并在赞助在线广告等自然场景中进行扩展评估,同时设计引导干预以增强动机显著性。 Result: LLMs在受控环境下能像人类一样合理折扣来自有偏见来源的信息,但在真实广告场景中偏离理性模型预测,部分因干扰信息影响;引入动机显著性干预后,其表现显著改善。 Conclusion: LLMs具备基本的动机敏感性,但要在新现实场景中实现有效动机警觉,仍需进一步改进模型设计。 Abstract: Human communication is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans' intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self-serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source -- for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs' behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents' information ecosystems. In these settings, we find that LLMs' inferences do not track the rational models' predictions nearly as closely -- partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.[68] Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings
Cesar Gonzalez-Gutierrez,Dirk Hovy
Main category: cs.CL
TL;DR: 本研究通过实证方法探究提示(prompting)对语言模型内部表征质量的影响,发现提示虽能改变表征质量,但其效果并不总与提示的相关性一致,挑战了“更相关提示=更好表征”的常见假设。
Details
Motivation: 理解在无任务监督的零样本设置下,提示如何使语言模型执行多样化任务,特别是提示与模型内部表征质量之间的关系尚不明确。 Method: 对提示嵌入进行一系列探针实验,分析多种提示模板在零样本分类任务中的组合效果。 Result: 提示确实影响表征质量,但这种影响与提示对任务的相关性无一致关联;即更相关的提示不一定产生更优的表示。 Conclusion: 提示的有效性不能仅通过其与任务的相关性来解释,需进一步探究影响提示性能的其他潜在因素。 Abstract: Prompting is a common approach for leveraging LMs in zero-shot settings. However, the underlying mechanisms that enable LMs to perform diverse tasks without task-specific supervision remain poorly understood. Studying the relationship between prompting and the quality of internal representations can shed light on how pre-trained embeddings may support in-context task solving. In this empirical study, we conduct a series of probing experiments on prompt embeddings, analyzing various combinations of prompt templates for zero-shot classification. Our findings show that while prompting affects the quality of representations, these changes do not consistently correlate with the relevance of the prompts to the target task. This result challenges the assumption that more relevant prompts necessarily lead to better representations. We further analyze potential factors that may contribute to this unexpected behavior.[69] From Answers to Guidance: A Proactive Dialogue System for Legal Documents
Ashish Chouhan,Michael Gertz
Main category: cs.CL
TL;DR: 本文提出了EUDial对话数据集和LexGuide框架,旨在提升公众对欧盟法律信息的理解与获取。
Details
Motivation: 法律信息尤其是机构性文本对普通民众而言难以理解,尽管欧盟提供了开放的法律资源,但其可及性和易用性仍然不足。 Method: 基于欧洲议会研究服务局AskEP单元整理的204篇博客构建了包含880轮对话的多轮对话数据集EUDial,并提出结合检索增强生成与分层主题组织的LexGuide框架以支持主动式、结构化法律对话。 Result: 实验结果表明,通过主动且结构化的对话引导,能够有效缩小法律信息可获得性与公众理解之间的差距。 Conclusion: EUDial和LexGuide为推进面向公众的主动式法律对话系统提供了实用的数据资源与技术框架。 Abstract: The accessibility of legal information remains a constant challenge, particularly for laypersons seeking to understand and apply complex institutional texts. While the European Union provides open access to legislation, parliamentary responses, and regulatory documents, these resources can be challenging for laypeople to explore. In this paper, we introduce EUDial, a proactive multi-turn dialogue dataset constructed from 204 blogs curated by the Citizens' Enquiries Unit (AskEP) of the European Parliamentary Research Service. EUDial contains 880 dialogue turns (averaging 4.3 turns per dialogue), where each dialogue includes initial questions, structured answers, and follow-up questions. Beyond dataset construction, we propose the LexGuide framework that leverages retrieval-augmented generation with hierarchical topic organization to structure dialogue progression, ensuring both comprehensive coverage of legal aspects and coherence across conversational turns. The results demonstrate that proactive, structured navigation closes the gap between the availability of legal information and citizen comprehension, establishing EUDial and LexGuide as practical resources for advancing proactive legal dialogue systems.[70] Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning
M. H. I. Abdalla,Zhipin Wang,Christian Frey,Steffen Eger,Josif Grabocka
Main category: cs.CL
TL;DR: Zhyper是一种参数高效的分解超网络框架,通过文本描述生成上下文感知的LoRA适配器,实现大语言模型的文化对齐,在减少最多26倍参数的同时保持竞争力性能。
Details
Motivation: 现有的提示工程无法确保大语言模型遵循特定文化或政治立场等语义条件,而现有微调方法参数量过大,需要更高效的方法。 Method: 提出Zhyper框架,采用参数高效的分解超网络生成基于文本描述的上下文感知LoRA适配器,并将其应用于大语言模型的条件控制和文化对齐。 Result: 在多个基准测试中,Zhyper以最多26倍更少的参数达到了与当前最先进基线相当的性能,并在跨领域设置中展现出更好的泛化能力和对细粒度上下文价值的捕捉。 Conclusion: Zhyper提供了一种高效且灵活的大语言模型条件控制方法,在显著降低参数开销的同时,有效提升了模型在文化对齐等复杂语义条件下的表现。 Abstract: Large Language Model (LLM) conditioning refers to instructing an LLM to generate content in accordance with the norms and values of a specific culture, beliefs of a particular political orientation, or any desired text-specified semantic conditioning. Unfortunately, prompt engineering does not ensure that LLMs behave in accordance with a desired conditioning due to the inductive bias of the pre-training and alignment datasets. Prior works have focused on fine-tuning LLMs by directly conditioning the LoRA weights; however, such methods introduce a large number of parameters. As a remedy, we propose Zhyper, a parameter-efficient factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions. Experiments on multiple benchmarks show that Zhyper achieves competitive performance with up to 26x fewer parameters than the state-of-the-art baselines. Furthermore, we extend Zhyper to cultural alignment, demonstrating improved generalization to out-of-domain settings and a better capturing of fine-grained contextual values.[71] SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration
Xichen Zhang,Sitong Wu,Haoru Tan,Shaozuo Yu,Yinghao Zhu,Ziyi He,Jiaya Jia
Main category: cs.CL
TL;DR: 提出SmartSwitch推理框架,通过监控和干预大语言模型的思维过程,解决长链思维中的“浅层思考”问题,提升复杂推理任务的表现。
Details
Motivation: 大语言模型在复杂推理任务中存在“浅层思考”问题,即频繁切换思路而缺乏深入探索,影响性能与效率。 Method: 设计SmartSwitch框架,包含感知模块(用PRM评估思维潜力)和干预模块(回溯并插入深化提示),以促进对有前景思路的深度探索。 Result: 在多个数学推理基准上实验表明,该方法显著提升了不同规模大语言模型的推理性能。 Conclusion: SmartSwitch是一种简单有效的即插即用方案,能有效缓解长链思维中的“浅层思考”,增强模型的深层推理能力。 Abstract: The long chain-of-thought (LongCoT) capability is central to the recent breakthroughs achieved by large language models in complex reasoning tasks. However, the accompanying issue of ''underthinking'', where models exhibit shallow reasoning by frequently switching thoughts without sufficient exploration, limits both performance and token efficiency. To address this problem, we propose a simple yet effective reasoning strategy: the SmartSwitch inference framework. This framework can be easily integrated into any large language model as a plug-and-play solution, continuously monitoring the model's reasoning process to detect underthinking and guide it toward deeper exploration of promising but overlooked thoughts. Specifically, the perception module identifies points where thoughts switch and evaluates the potential of the preceding thought using an off-the-shelf process reward model (PRM). If a high-potential thought is found to be prematurely abandoned, the intervention module interrupts the ongoing inference, backtracks to the point before the switch, and inserts a "deepening prompt" to encourage further exploration along that promising path. Extensive experiments on challenging mathematical reasoning benchmarks demonstrate that our method significantly enhances the performance of various large language models of different sizes.[72] AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
Yuezhou Hu,Jiaxin Guo,Xinyu Feng,Tuo Zhao
Main category: cs.CL
TL;DR: 提出AdaSPEC方法,通过选择性过滤难以拟合的token,在知识蒸馏过程中提升草稿模型与目标模型的对齐度,从而提高推测解码的token接受率。
Details
Motivation: 传统知识蒸馏方法最小化所有token上的KL散度,与推测解码最大化token接受率的目标不一致,导致性能受限。 Method: 引入参考模型识别并过滤难拟合的token,仅在易拟合token上进行知识蒸馏,提升草稿模型在关键token上的对齐能力。 Result: 在多种任务(算术推理、指令遵循、代码生成、摘要)和模型规模(31M/1.4B, 350M/2.7B)下,AdaSPEC均优于DistillSpec,token接受率最高提升15%。 Conclusion: AdaSPEC通过选择性蒸馏有效提升了推测解码的效率和接受率,且不损害生成质量,具有广泛适用性。 Abstract: Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at https://github.com/yuezhouhu/adaspec.[73] Adapting Multilingual Models to Code-Mixed Tasks via Model Merging
Prashant Kodali,Vaishnavi Shivkumar,Swarang Joshi,Monojit Choudhary,Ponnurangam Kumaraguru,Manish Shrivastava
Main category: cs.CL
TL;DR: 提出一种基于模型合并的代码混合NLP适应方法,通过继续预训练与模型合并结合,在英印地语和英西班牙语任务中显著优于全量微调和传统两阶段方法,并展现出更强的跨语言对迁移能力。
Details
Motivation: 传统代码混合NLP适应方法(如全量微调或继续预训练后微调)未能充分利用无标签数据,且在低资源语言对上迁移效果有限,需更有效的适应策略。 Method: 从多语言基础模型出发:1)在无标签代码混合文本上进行继续预训练得到适配检查点;2)将该检查点与基础模型合并;3)在下游任务数据上微调。同时评估零样本/少样本提示性能及跨语言对迁移能力。 Result: 合并模型在F1分数上比全量微调高2-5点,比CPT->FT高约1-2点;在跨语言迁移中(如En-Hi→En-Ta/En-Ml),TV/TIES等变体达到0.65-0.68 F1,优于全量微调的0.61-0.63;大模型零样本提示仍落后于微调与合并方法。 Conclusion: 模型合并是代码混合NLP更有效的适应范式,能更高效利用无标签数据并提升跨语言迁移性能,适用于不同数据场景(仅有标签、有标签+无标签、纯迁移),但其扩展性与泛化能力仍需进一步研究。 Abstract: We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT->FT. We observe gains of 2--5 points in F1 over full fine-tuning and ~1-2 points over CPT->FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.[74] ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers
Saptarshi Sengupta,Zhengyu Zhou,Jun Araki,Xingbo Wang,Bingqing Wang,Suhang Wang,Zhe Feng
Main category: cs.CL
TL;DR: 本文提出了一种名为ToolDreamer的框架,通过生成假设性工具描述来改善大语言模型在大规模工具集下的检索效果,提升了稀疏和密集检索器的性能。
Details
Motivation: 现有工具检索方法依赖用户查询与工具描述的相似度,但二者语言不一致导致检索效果不佳。 Method: 利用大语言模型生成与查询相关的假设性工具描述,并以此作为检索条件优化工具检索过程。 Result: 在ToolRet数据集上验证了该方法能有效提升多种检索器的性能,无论是否经过训练。 Conclusion: ToolDreamer能够更好地对齐查询与工具描述的语言空间,减轻大语言模型的推理负担,提升大规模工具调用效率。 Abstract: Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens would exceed the LLM's context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMs with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., description of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD's. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.[75] The Art of Asking: Multilingual Prompt Optimization for Synthetic Data
David Mora,Viraat Aryabumi,Wei-Yin Ko,Sara Hooker,Julia Kreutzer,Marzieh Fadaee
Main category: cs.CL
TL;DR: 提出一种轻量级的提示空间优化框架,通过对12种语言的提示进行自然性、文化适应性和难度增强的系统转换,显著提升了多语言大模型的性能。
Details
Motivation: 现有的基于翻译的提示方法受限于英语中心主义和文化维度的忽略,限制了多语言模型的泛化能力。 Method: 引入一个轻量级框架,对翻译后的提示进行自然性、文化适应性和难度增强的系统性转换,并在现成的多语言大模型上应用这些变换。 Result: 在相同数据条件下,相比仅翻译的基线方法,在Global-MMLU准确率上提升+4.7%,Flores XCometXL上提升+2.4%,mArenaHard偏好测试中胜出率提高+35.3%。 Conclusion: 提示空间优化是一种简单而强大的范式,有助于构建更鲁棒、更具文化根基和全球能力的多语言大模型。 Abstract: Synthetic data has become a cornerstone for scaling large language models, yet its multilingual use remains bottlenecked by translation-based prompts. This strategy inherits English-centric framing and style and neglects cultural dimensions, ultimately constraining model generalization. We argue that the overlooked prompt space-the very inputs that define training distributions-offers a more powerful lever for improving multilingual performance. We introduce a lightweight framework for prompt-space optimization, where translated prompts are systematically transformed for Naturalness, Cultural Adaptation, and Difficulty Enhancement. Using an off-the-shelf multilingual LLM, we apply these transformations to prompts for 12 languages spanning 7 families. Under identical data conditions, our approaches achieve substantial and consistent downstream improvements over the translation-only baseline: +4.7% on Global-MMLU accuracy, +2.4% on Flores XCometXL and +35.3% wins in preferences on mArenaHard. We establish prompt-space optimization as a simple yet powerful paradigm for building multilingual LLMs that are more robust, culturally grounded, and globally capable.[76] Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
Xichen Zhang,Sitong Wu,Yinghao Zhu,Haoru Tan,Shaozuo Yu,Ziyi He,Jiaya Jia
Main category: cs.CL
TL;DR: 本文提出了Scaf-GRPO,一种渐进式强化学习框架,通过在大语言模型学习停滞时注入分级提示来克服“学习断崖”问题,显著提升了其复杂推理能力。
Details
Motivation: 现有基于奖励的强化学习方法在面对远超模型当前能力的问题时,会因持续零奖励导致学习梯度消失(即“学习断崖”现象),限制了模型的复杂推理发展。 Method: 提出Scaf-GRPO(Scaffolded Group Relative Policy Optimization)框架:首先检测学习停滞,随后在提示中逐步引入从抽象概念到具体步骤的分层引导,帮助模型自主构建解决方案,避免优势函数坍缩。 Result: 在AIME24数学基准上,Qwen2.5-Math-7B模型的pass@1分数相比标准GRPO基线相对提升44.3%,验证了方法的有效性。 Conclusion: Scaf-GRPO为突破大语言模型自主推理能力边界提供了一种有效且鲁棒的方法,是推动其解决更高难度问题的重要进展。 Abstract: Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO's effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.[77] Hubble: a Model Suite to Advance the Study of LLM Memorization
Johnny Tian-Zheng Wei,Ameya Godbole,Mohammad Aflah Khan,Ryan Wang,Xiaoyuan Zhu,James Flemings,Nitya Kashyap,Krishna P. Gummadi,Willie Neiswanger,Robin Jia
Main category: cs.CL
TL;DR: Hubble是一个开源的大型语言模型套件,用于研究LLM的记忆化现象,通过标准和扰动模型揭示了训练语料规模与敏感数据频率对记忆化的影响,并提出缓解记忆风险的两种最佳实践。
Details
Motivation: 为了系统研究大模型对敏感数据的记忆化风险,需要可控实验环境来分析记忆机制及防御策略。 Method: 构建标准与扰动版本的开源LLM(1B/8B参数),在预训练中插入受控文本(如书籍、传记、测试集),并在不同训练阶段插入数据以研究遗忘效应。 Result: 发现记忆化程度取决于敏感数据在训练语料中的相对频率;增加语料规模可稀释记忆风险,早期出现的敏感数据更易被保留;部分数据若无持续暴露会被遗忘。 Conclusion: 应通过扩大训练语料稀释敏感信息,并将敏感数据安排在训练早期以控制记忆风险;Hubble为成员推断、机器遗忘等研究提供了理想测试平台。 Abstract: We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.cs.CV [Back]
[78] Dimensionality Reduction for Remote Sensing Data Analysis: A Systematic Review of Methods and Applications
Nathan Mankovich,Kai-Hendrik Cohrs,Homer Durand,Vasileios Sitokonstantinou,Tristan Williams,Gustau Camps-Valls
Main category: cs.CV
TL;DR: 本文综述了遥感数据处理中降维技术的重要性,探讨了其在地球观测数据价值链中的应用,并指出了未来研究中未充分探索的降维算法机会。
Details
Motivation: 高维遥感数据带来的稀疏性、效率低下和维度灾难问题限制了机器学习模型的有效性,亟需有效的降维方法来提升数据处理能力。 Method: 综述并分析了特征提取等降维技术在遥感数据压缩、清洗、融合、可视化、异常检测和预测等任务中的应用。 Result: 提供了降维技术在遥感数据全链条应用的手册,并识别出若干有待进一步研究的降维算法方向。 Conclusion: 降维技术对提升地球观测数据处理效率和机器学习性能至关重要,未来应探索更多创新的降维方法以应对高维数据挑战。 Abstract: Earth observation involves collecting, analyzing, and processing an ever-growing mass of data. Automatically harvesting information is crucial for addressing significant societal, economic, and environmental challenges, ranging from environmental monitoring to urban planning and disaster management. However, the high dimensionality of these data poses challenges in terms of sparsity, inefficiency, and the curse of dimensionality, which limits the effectiveness of machine learning models. Dimensionality reduction (DR) techniques, specifically feature extraction, address these challenges by preserving essential data properties while reducing complexity and enhancing tasks such as data compression, cleaning, fusion, visualization, anomaly detection, and prediction. This review provides a handbook for leveraging DR across the RS data value chain and identifies opportunities for under-explored DR algorithms and their application in future research.[79] Ninja Codes: Neurally Generated Fiducial Markers for Stealthy 6-DoF Tracking
Yuichiro Takeuchi,Yusuke Imoto,Shunya Kato
Main category: cs.CV
TL;DR: 本文提出了一种名为Ninja Codes的神经生成的隐形标记,能够自然融入真实环境,通过编码网络将任意图像转换为经过轻微视觉修改的标记,实现隐蔽的6自由度位置跟踪,适用于增强现实、机器人等领域。
Details
Motivation: 传统基准标记外观显眼,在许多场景中因美观等原因不适用,因此需要一种能隐蔽融入环境的新型标记。 Method: 采用端到端的深度隐写方法,联合训练编码网络和检测网络,将图像转换为视觉上不易察觉的Ninja Codes,并通过普通彩色打印机打印,利用RGB相机进行检测。 Result: 实验表明,Ninja Codes在常见室内光照条件下能可靠地实现位置跟踪,同时有效隐藏于多种环境纹理中。 Conclusion: Ninja Codes为需要隐蔽定位追踪的应用提供了有价值的解决方案,尤其适用于对美观有要求的场景。 Abstract: In this paper we describe Ninja Codes, neurally-generated fiducial markers that can be made to naturally blend into various real-world environments. An encoder network converts arbitrary images into Ninja Codes by applying visually modest alterations; the resulting codes, printed and pasted onto surfaces, can provide stealthy 6-DoF location tracking for a wide range of applications including augmented reality, robotics, motion-based user interfaces, etc. Ninja Codes can be printed using off-the-shelf color printers on regular printing paper, and can be detected using any device equipped with a modern RGB camera and capable of running inference. Using an end-to-end process inspired by prior work on deep steganography, we jointly train a series of network modules that perform the creation and detection of Ninja Codes. Through experiments, we demonstrate Ninja Codes' ability to provide reliable location tracking under common indoor lighting conditions, while successfully concealing themselves within diverse environmental textures. We expect Ninja Codes to offer particular value in scenarios where the conspicuous appearances of conventional fiducial markers make them undesirable for aesthetic and other reasons.[80] Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts
Seungjun Yu,Junsung Park,Youngsun Lim,Hyunjung Shim
Main category: cs.CV
TL;DR: 提出了一种两阶段视觉-语言问答系统,用于自动驾驶中的高层感知、预测和规划问题,通过精心设计的提示和上下文增强显著提升了预训练视觉-语言模型在驾驶问答任务上的性能。
Details
Motivation: 为了提升自动驾驶中高阶感知、预测与规划问题的回答准确性,需要更可靠的视觉-语言模型推理能力。 Method: 第一阶段使用六摄像头输入、历史时序窗口和少样本思维链提示的大规模多模态LLM(Qwen2.5-VL-32B),并采用自一致性集成;第二阶段引入nuScenes场景元数据和任务特定指令进行提示增强。 Result: 在驾驶问答基准上显著优于基线模型:第一阶段5帧历史+10-shot达到65.1%准确率(零样本为62.61%),自一致性提升至66.85%;第二阶段达67.37%;在严重视觉干扰下仍保持96%准确率。 Conclusion: 精心设计的提示工程和上下文对齐能显著增强预训练视觉-语言模型在自动驾驶高阶问答中的表现。 Abstract: We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.[81] $Δ$t-Mamba3D: A Time-Aware Spatio-Temporal State-Space Model for Breast Cancer Risk Prediction
Zhengbo Zhou,Dooman Arefan,Margarita Zuley,Shandong Wu
Main category: cs.CV
TL;DR: 提出Time-Aware Δt-Mamba3D,一种用于纵向医学影像分析的新型状态空间模型,能有效建模不规则时间间隔的高分辨率序列图像,在乳腺癌风险预测中性能优于现有方法。
Details
Motivation: 现有方法难以充分挖掘不规则时间间隔下的高分辨率医学图像序列中的时空信息,或牺牲空间细节,或计算效率低且不适用于非均匀时间步长。 Method: 设计了一种连续时间选择性扫描机制,将两次检查间的真实时间差显式融入状态转移;结合多尺度3D邻域融合模块,同时捕捉不规则时间间隔和丰富的时空上下文,保持线性计算复杂度。 Result: 在基于连续乳腺X线筛查的乳腺癌风险预测任务中,验证集c-index提升2-5个百分点,1-5年AUC均高于循环网络、Transformer和状态空间模型变体。 Conclusion: Time-Aware Δt-Mamba3D高效且准确,为纵向医学图像分析提供了新框架,尤其适用于处理长期、复杂的患者影像记录。 Abstract: Longitudinal analysis of sequential radiological images is hampered by a fundamental data challenge: how to effectively model a sequence of high-resolution images captured at irregular time intervals. This data structure contains indispensable spatial and temporal cues that current methods fail to fully exploit. Models often compromise by either collapsing spatial information into vectors or applying spatio-temporal models that are computationally inefficient and incompatible with non-uniform time steps. We address this challenge with Time-Aware $\Delta$t-Mamba3D, a novel state-space architecture adapted for longitudinal medical imaging. Our model simultaneously encodes irregular inter-visit intervals and rich spatio-temporal context while remaining computationally efficient. Its core innovation is a continuous-time selective scanning mechanism that explicitly integrates the true time difference between exams into its state transitions. This is complemented by a multi-scale 3D neighborhood fusion module that robustly captures spatio-temporal relationships. In a comprehensive breast cancer risk prediction benchmark using sequential screening mammogram exams, our model shows superior performance, improving the validation c-index by 2-5 percentage points and achieving higher 1-5 year AUC scores compared to established variants of recurrent, transformer, and state-space models. Thanks to its linear complexity, the model can efficiently process long and complex patient screening histories of mammograms, forming a new framework for longitudinal image analysis.[82] MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models
Aritra Bhowmik,Denis Korzhenkov,Cees G. M. Snoek,Amirhossein Habibian,Mohsen Ghafoorian
Main category: cs.CV
TL;DR: 提出一种以动作为中心的对齐框架,从预训练视频编码器中学习解耦的运动子空间,并将其与文本到视频扩散模型对齐,从而生成更符合物理常识且时间上连贯的视频。
Details
Motivation: 现有文本到视频扩散模型在生成复杂运动时缺乏对真实动态的理解,导致生成的视频运动不连贯或不符合物理规律。 Method: 通过从预训练视频编码器中分离出仅包含运动信息的子空间,并用其预测光流来优化该子空间,再将扩散模型的隐特征与该子空间对齐。 Result: 在VideoPhy、VideoPhy2、VBench和VBench-2.0等多个基准测试以及用户研究中验证了方法的有效性,提升了生成视频的物理合理性和时间一致性。 Conclusion: 所提出的运动解耦对齐方法能有效增强文本到视频扩散模型对真实运动的理解,提升生成质量,同时保持对文本提示的忠实度。 Abstract: Text-to-video diffusion models have enabled high-quality video synthesis, yet often fail to generate temporally coherent and physically plausible motion. A key reason is the models' insufficient understanding of complex motions that natural videos often entail. Recent works tackle this problem by aligning diffusion model features with those from pretrained video encoders. However, these encoders mix video appearance and dynamics into entangled features, limiting the benefit of such alignment. In this paper, we propose a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder. This subspace is optimized to predict ground-truth optical flow, ensuring it captures true motion dynamics. We then align the latent features of a text-to-video diffusion model to this new subspace, enabling the generative model to internalize motion knowledge and generate more plausible videos. Our method improves the physical commonsense in a state-of-the-art video diffusion model, while preserving adherence to textual prompts, as evidenced by empirical evaluations on VideoPhy, VideoPhy2, VBench, and VBench-2.0, along with a user study.[83] PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Amith Ananthram,Elias Stengel-Eskin,Lorena A. Bradford,Julia Demarest,Adam Purvis,Keith Krut,Robert Stein,Rina Elster Pantalony,Mohit Bansal,Kathleen McKeown
Main category: cs.CV
TL;DR: 本文提出了PoSh,一种用于详细图像描述的新型评估指标,利用场景图作为结构化评分标准,指导大语言模型进行细粒度错误判断。同时引入了包含艺术作品和专家标注的新数据集DOCENT,验证了PoSh在与人类评分相关性、鲁棒性和作为奖励函数方面的优越性,并揭示了现有视觉语言模型在复杂场景理解上的不足。
Details
Motivation: 现有的图像描述评估指标(如CIDEr、SPICE)主要针对短文本设计,难以有效捕捉长文本中属性和关系的错误,且缺乏对错误定位的支持。因此需要一种更精细、可解释且与人类判断更一致的评估方法。 Method: 提出PoSh指标,使用场景图作为结构化评分依据,结合LLM-as-a-Judge框架,对描述中的细粒度错误(如组合理解错误)进行定位和评分。同时构建新基准数据集DOCENT,包含艺术作品、专家参考描述及多层次质量标注,用于评估指标和模型性能。 Result: PoSh在DOCENT上与人类评分的相关性优于现有指标(Spearman ρ提升+0.05),在CapArena数据集上表现出类型鲁棒性,并可作为有效奖励信号优于监督微调。实验发现现有基础模型在描绘具有丰富动态场景的艺术品时仍难以实现完整无误的覆盖。 Conclusion: PoSh是一种可复现、可解释且更贴近人类判断的图像描述评估指标,结合DOCENT为评估详细图像描述提供了新标准,揭示了当前VLM在复杂视觉理解任务上的局限,推动其在辅助文本生成等重要领域的进步。 Abstract: While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $\rho$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.[84] UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning
Zhongyu Jiang,Wenhao Chai,Lei Li,Zhuoran Zhou,Cheng-Yen Yang,Jenq-Neng Hwang
Main category: cs.CV
TL;DR: 本文提出了一种统一的人体姿态表示学习框架UniHPR,通过基于奇异值的对比学习损失对齐来自图像、2D和3D姿态的嵌入,在2D/3D人体姿态估计和跨模态检索任务中表现出色。
Details
Motivation: 不同模态的人体姿态表示(如图像、2D关键点、3D骨架等)在多模态融合中至关重要,但现有研究缺乏对这些表示之间相关性的系统性对比分析。 Method: 提出UniHPR框架,采用基于奇异值的对比学习损失,同时对齐图像、2D和3D人体姿态嵌入,实现统一表示学习。 Result: 在Human3.6M和3DPW数据集上,UniHPR在3D人体姿态估计中分别达到MPJPE 49.9mm和PA-MPJPE 51.6mm的性能,并在姿态检索任务中实现9.24mm MPJPE的检索误差。 Conclusion: UniHPR有效实现了多模态人体姿态表示的对齐,提升了下游任务性能,验证了统一表示学习在跨模态人体姿态理解中的潜力。 Abstract: In recent years, there has been a growing interest in developing effective alignment pipelines to generate unified representations from different modalities for multi-modal fusion and generation. As an important component of Human-Centric applications, Human Pose representations are critical in many downstream tasks, such as Human Pose Estimation, Action Recognition, Human-Computer Interaction, Object tracking, etc. Human Pose representations or embeddings can be extracted from images, 2D keypoints, 3D skeletons, mesh models, and lots of other modalities. Yet, there are limited instances where the correlation among all of those representations has been clearly researched using a contrastive paradigm. In this paper, we propose UniHPR, a unified Human Pose Representation learning pipeline, which aligns Human Pose embeddings from images, 2D and 3D human poses. To align more than two data representations at the same time, we propose a novel singular value-based contrastive learning loss, which better aligns different modalities and further boosts performance. To evaluate the effectiveness of the aligned representation, we choose 2D and 3D Human Pose Estimation (HPE) as our evaluation tasks. In our evaluation, with a simple 3D human pose decoder, UniHPR achieves remarkable performance metrics: MPJPE 49.9mm on the Human3.6M dataset and PA-MPJPE 51.6mm on the 3DPW dataset with cross-domain evaluation. Meanwhile, we are able to achieve 2D and 3D pose retrieval with our unified human pose representations in Human3.6M dataset, where the retrieval error is 9.24mm in MPJPE.[85] Advancing Brain Tumor Segmentation via Attention-based 3D U-Net Architecture and Digital Image Processing
Eyad Gad,Seif Soliman,M. Saeed Darweesh
Main category: cs.CV
TL;DR: 本研究提出了一种结合注意力机制的3D U-Net模型,并利用图像处理技术进行肿瘤检测,以改善脑肿瘤分割性能。在BraTS 2020数据集上取得了优异的结果。
Details
Motivation: 标准U-Net在处理不规则形状和边界模糊的肿瘤时表现不佳,且高分辨率MRI数据训练存在计算资源需求高和类别不平衡问题。 Method: 将注意力机制引入3D U-Net模型,并采用基于数字图像处理的肿瘤检测算法来缓解训练数据不平衡问题。 Result: 在BraTS 2020数据集上,模型取得了0.975的Dice系数、0.988的特异性和0.995的敏感性,优于相关研究。 Conclusion: 所提出的模型显著提升了脑肿瘤分割的准确性与可靠性,具有重要的临床诊断应用价值。 Abstract: In the realm of medical diagnostics, rapid advancements in Artificial Intelligence (AI) have significantly yielded remarkable improvements in brain tumor segmentation. Encoder-Decoder architectures, such as U-Net, have played a transformative role by effectively extracting meaningful representations in 3D brain tumor segmentation from Magnetic resonance imaging (MRI) scans. However, standard U-Net models encounter challenges in accurately delineating tumor regions, especially when dealing with irregular shapes and ambiguous boundaries. Additionally, training robust segmentation models on high-resolution MRI data, such as the BraTS datasets, necessitates high computational resources and often faces challenges associated with class imbalance. This study proposes the integration of the attention mechanism into the 3D U-Net model, enabling the model to capture intricate details and prioritize informative regions during the segmentation process. Additionally, a tumor detection algorithm based on digital image processing techniques is utilized to address the issue of imbalanced training data and mitigate bias. This study aims to enhance the performance of brain tumor segmentation, ultimately improving the reliability of diagnosis. The proposed model is thoroughly evaluated and assessed on the BraTS 2020 dataset using various performance metrics to accomplish this goal. The obtained results indicate that the model outperformed related studies, exhibiting dice of 0.975, specificity of 0.988, and sensitivity of 0.995, indicating the efficacy of the proposed model in improving brain tumor segmentation, offering valuable insights for reliable diagnosis in clinical settings.[86] A Novel Approach to Breast Cancer Segmentation using U-Net Model with Attention Mechanisms and FedProx
Eyad Gad,Mustafa Abou Khatwa,Mustafa A. Elattar,Sahar Selim
Main category: cs.CV
TL;DR: 本研究提出了一种结合FedProx方法和改进U-Net模型的联邦学习框架,用于在非独立同分布的超声乳腺癌图像数据上实现高精度肿瘤分割,同时保护患者隐私。
Details
Motivation: 乳腺癌是全球女性死亡的主要原因,早期检测至关重要。超声成像虽可靠且成本低,但医疗数据敏感性高,传统AI模型难以兼顾准确性和隐私保护。此外,联邦学习中非IID数据会影响模型性能,尤其是在肿瘤边界分割任务中。 Method: 采用Federated Proximal (FedProx) 方法处理非IID的本地超声乳腺癌数据,并结合带有注意力机制的改进U-Net模型以提升肿瘤分割精度。 Result: 所提出的方法在全局模型上实现了96%的准确率,显著提升了肿瘤分割的准确性与模型泛化能力。 Conclusion: FedProx结合注意力U-Net在保护患者隐私的同时,能有效应对非IID医疗数据带来的挑战,具有在医学图像分割中广泛应用的潜力。 Abstract: Breast cancer is a leading cause of death among women worldwide, emphasizing the need for early detection and accurate diagnosis. As such Ultrasound Imaging, a reliable and cost-effective tool, is used for this purpose, however the sensitive nature of medical data makes it challenging to develop accurate and private artificial intelligence models. A solution is Federated Learning as it is a promising technique for distributed machine learning on sensitive medical data while preserving patient privacy. However, training on non-Independent and non-Identically Distributed (non-IID) local datasets can impact the accuracy and generalization of the trained model, which is crucial for accurate tumour boundary delineation in BC segmentation. This study aims to tackle this challenge by applying the Federated Proximal (FedProx) method to non-IID Ultrasonic Breast Cancer Imaging datasets. Moreover, we focus on enhancing tumour segmentation accuracy by incorporating a modified U-Net model with attention mechanisms. Our approach resulted in a global model with 96% accuracy, demonstrating the effectiveness of our method in enhancing tumour segmentation accuracy while preserving patient privacy. Our findings suggest that FedProx has the potential to be a promising approach for training precise machine learning models on non-IID local medical datasets.[87] X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning
Yunzhe Wang,Soham Hans,Volkan Ustun
Main category: cs.CV
TL;DR: 本文提出了X-Ego-CS数据集和Cross-Ego Contrastive Learning(CECL)方法,用于研究复杂3D环境中基于第一人称视角的多智能体决策与团队战术意识建模。
Details
Motivation: 现有体育团队交互建模多依赖第三人称视角,忽视了多智能体学习中同步的、以自我为中心的特性,因此需要一个能捕捉个体视角并支持团队协同理解的数据集和方法。 Method: 构建包含124小时职业级《反恐精英2》比赛数据的X-Ego-CS数据集,提供同步的第一人称视频流与状态-动作轨迹;提出CECL方法,通过对比学习对齐队友的视觉流,提升个体视角下的团队态势感知能力。 Result: 在队友-对手位置预测任务上验证了CECL的有效性,能够利用最先进的视频编码器从单一第一人称视角推断队友和对手的位置。 Conclusion: X-Ego-CS和CECL为电子竞技中的跨自我中心多智能体建模提供了基础,也为虚拟与现实世界中的人机协作与时空推理提供了新的研究平台。 Abstract: Human team tactics emerge from each player's individual perspective and their ability to anticipate, interpret, and adapt to teammates' intentions. While advances in video understanding have improved the modeling of team interactions in sports, most existing work relies on third-person broadcast views and overlooks the synchronous, egocentric nature of multi-agent learning. We introduce X-Ego-CS, a benchmark dataset consisting of 124 hours of gameplay footage from 45 professional-level matches of the popular e-sports game Counter-Strike 2, designed to facilitate research on multi-agent decision-making in complex 3D environments. X-Ego-CS provides cross-egocentric video streams that synchronously capture all players' first-person perspectives along with state-action trajectories. Building on this resource, we propose Cross-Ego Contrastive Learning (CECL), which aligns teammates' egocentric visual streams to foster team-level tactical situational awareness from an individual's perspective. We evaluate CECL on a teammate-opponent location prediction task, demonstrating its effectiveness in enhancing an agent's ability to infer both teammate and opponent positions from a single first-person view using state-of-the-art video encoders. Together, X-Ego-CS and CECL establish a foundation for cross-egocentric multi-agent benchmarking in esports. More broadly, our work positions gameplay understanding as a testbed for multi-agent modeling and tactical learning, with implications for spatiotemporal reasoning and human-AI teaming in both virtual and real-world domains. Code and dataset are available at https://github.com/HATS-ICT/x-ego.[88] FootFormer: Estimating Stability from Visual Input
Keaton Kraiger,Jingjing Li,Skanda Bharadwaj,Jesse Scott,Robert T. Collins,Yanxi Liu
Main category: cs.CV
TL;DR: 提出了一种名为FootFormer的跨模态方法,用于直接从视觉输入中联合预测人体运动动力学。
Details
Motivation: 现有方法通常只能生成足压分布、足接触图或质心(CoM)中的一项或两项,缺乏综合性和准确性。 Method: 采用跨模态学习框架,联合预测足压分布、足接触图和质心等运动动力学参数,并在多个数据集上进行训练与验证。 Result: FootFormer在估计足压分布、足接触图和质心方面显著优于或相当于现有方法,并在经典运动学指标相关的稳定性预测组件(如压力中心CoP、支撑基础BoS)上达到SOTA性能。 Conclusion: FootFormer能够有效从视觉输入中联合预测多种运动动力学参数,在多模态人体动作分析中具有广泛应用潜力。 Abstract: We propose FootFormer, a cross-modality approach for jointly predicting human motion dynamics directly from visual input. On multiple datasets, FootFormer achieves statistically significantly better or equivalent estimates of foot pressure distributions, foot contact maps, and center of mass (CoM), as compared with existing methods that generate one or two of those measures. Furthermore, FootFormer achieves SOTA performance in estimating stability-predictive components (CoP, CoM, BoS) used in classic kinesiology metrics. Code and data are available at https://github.com/keatonkraiger/Vision-to-Stability.git.[89] Malaria Detection from Blood Cell Images Using XceptionNet
Warisa Nusrat,Mostafijur Rahman,Ayatullah Faruk Mollah
Main category: cs.CV
TL;DR: 本文提出使用深度卷积网络自动检测疟疾感染的血细胞,其中Residual Attention Network和XceptionNet在公开数据集上分别达到97.28%和97.55%的准确率,优于现有方法。
Details
Motivation: 由于专业人员缺乏和人工诊断易出错,亟需一种可靠的自动化疟疾检测方法。 Method: 采用六种深度卷积网络(AlexNet、XceptionNet、VGG-19、Residual Attention Network、DenseNet-121和Custom-CNN)从血液细胞图像中提取深层特征,并分类为感染或健康细胞。 Result: Residual Attention Network和XceptionNet表现最佳,准确率分别为97.28%和97.55%,超过同类方法。 Conclusion: 深度学习可有效实现疟疾的自动、可靠检测,减少人工干预,具有临床应用前景。 Abstract: Malaria, which primarily spreads with the bite of female anopheles mosquitos, often leads to death of people - specifically children in the age-group of 0-5 years. Clinical experts identify malaria by observing RBCs in blood smeared images with a microscope. Lack of adequate professional knowledge and skills, and most importantly manual involvement may cause incorrect diagnosis. Therefore, computer aided automatic diagnosis stands as a preferred substitute. In this paper, well-demonstrated deep networks have been applied to extract deep intrinsic features from blood cell images and thereafter classify them as malaria infected or healthy cells. Among the six deep convolutional networks employed in this work viz. AlexNet, XceptionNet, VGG-19, Residual Attention Network, DenseNet-121 and Custom-CNN. Residual Attention Network and XceptionNet perform relatively better than the rest on a publicly available malaria cell image dataset. They yield an average accuracy of 97.28% and 97.55% respectively, that surpasses other related methods on the same dataset. These findings highly encourage the reality of deep learning driven method for automatic and reliable detection of malaria while minimizing direct manual involvement.[90] PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning
Fengyuan Sun,Hui Chen,Xinhao Xu,Dandan Zheng,Jingdong Chen,Jun Zhou,Jungong Han,Guiguang Ding
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、简单有效的多模态大语言模型幻觉缓解方法PruneHal,通过自适应KV缓存剪枝增强模型对关键视觉信息的关注,减少幻觉现象。
Details
Motivation: 多模态大语言模型中的幻觉问题严重,现有方法依赖额外数据或推理时引入外部信息,带来额外计算开销。作者发现幻觉与视觉token注意力不足有关,尤其是冗余视觉token分散了模型注意力。 Method: 提出PruneHal方法,利用自适应KV缓存剪枝技术,在推理过程中动态去除冗余的视觉token,提升模型对关键视觉信息的注意力,从而减轻幻觉。该方法无需训练,几乎不增加推理成本,且具有模型无关性。 Result: 在多个主流多模态大语言模型和广泛使用的幻觉评估基准上进行实验,PruneHal取得了显著且稳健的效果,优于现有方法,验证了其有效性与通用性。 Conclusion: PruneHal是首个将token剪枝应用于多模态大语言模型幻觉缓解的工作,无需训练、低开销、可扩展性强,为减少MLLM幻觉提供了一个高效可行的新方向。 Abstract: While multi-modal large language models (MLLMs) have made significant progress in recent years, the issue of hallucinations remains a major challenge. To mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during inference. However, these approaches inevitably introduce extra computational costs. In this paper, we observe that hallucinations in MLLMs are strongly associated with insufficient attention allocated to visual tokens. In particular, the presence of redundant visual tokens disperses the model's attention, preventing it from focusing on the most informative ones. As a result, critical visual cues are often under-attended, which in turn exacerbates the occurrence of hallucinations. Building on this observation, we propose \textbf{PruneHal}, a training-free, simple yet effective method that leverages adaptive KV cache pruning to enhance the model's focus on critical visual information, thereby mitigating hallucinations. To the best of our knowledge, we are the first to apply token pruning for hallucination mitigation in MLLMs. Notably, our method don't require additional training and incurs nearly no extra inference cost. Moreover, PruneHal is model-agnostic and can be seamlessly integrated with different decoding strategies, including those specifically designed for hallucination mitigation. We evaluate PruneHal on several widely used hallucination evaluation benchmarks using four mainstream MLLMs, achieving robust and outstanding results that highlight the effectiveness and superiority of our method. Our code will be publicly available.[91] Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning
Takehiro Aoshima,Yusuke Shinohara,Park Byeongseon
Main category: cs.CV
TL;DR: 提出了一种新的度量方法Video Consistency Distance (VCD),用于提升图像到视频生成任务中的时间一致性,通过在频域中定义VCD并结合基于奖励的微调框架,显著提高了生成视频的时间连贯性,同时不损害其他性能。
Details
Motivation: 现有的基于奖励的微调方法主要关注整体视频质量,但在图像到视频生成任务中往往导致时间一致性不足,因此需要一种专门针对时间一致性的新指标。 Method: 提出了Video Consistency Distance (VCD),在视频帧特征的频域空间中定义,利用频域分析有效捕捉帧间信息,并将其集成到基于奖励的微调框架中以优化模型。 Result: 在多个I2V数据集上的实验表明,使用VCD进行微调显著提升了生成视频的时间一致性,且未牺牲其他方面的性能。 Conclusion: VCD是一种有效的新型时间一致性度量方法,能够显著改善图像到视频生成中的时序连贯性,为基于奖励的视频扩散模型微调提供了新的方向。 Abstract: Reward-based fine-tuning of video diffusion models is an effective approach to improve the quality of generated videos, as it can fine-tune models without requiring real-world video datasets. However, it can sometimes be limited to specific performances because conventional reward functions are mainly aimed at enhancing the quality across the whole generated video sequence, such as aesthetic appeal and overall consistency. Notably, the temporal consistency of the generated video often suffers when applying previous approaches to image-to-video (I2V) generation tasks. To address this limitation, we propose Video Consistency Distance (VCD), a novel metric designed to enhance temporal consistency, and fine-tune a model with the reward-based fine-tuning framework. To achieve coherent temporal consistency relative to a conditioning image, VCD is defined in the frequency space of video frame features to capture frame information effectively through frequency-domain analysis. Experimental results across multiple I2V datasets demonstrate that fine-tuning a video generation model with VCD significantly enhances temporal consistency without degrading other performance compared to the previous method.[92] Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks
Kai Zeng,Zhanqian Wu,Kaixin Xiong,Xiaobao Wei,Xiangyu Guo,Zhenxin Zhu,Kalok Ho,Lijun Zhou,Bohan Zeng,Ming Lu,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Wentao Zhang
Main category: cs.CV
TL;DR: Dream4Drive 是一种用于增强自动驾驶中下游感知任务的新型合成数据生成框架,通过3D感知引导图和多视角渲染生成高质量、可编辑的视频,显著提升corner case感知性能。
Details
Motivation: 现有驾驶世界模型方法主要关注生成质量和可控性,但忽视了对下游感知任务的评估,且使用合成数据的优势在增加训练轮数后变得不明显,因此需要一个能真正体现合成数据价值的框架。 Method: Dream4Drive 将输入视频分解为多个3D感知引导图,将3D资产渲染到这些图上,并微调驾驶世界模型以生成多视角逼真视频,用于训练下游感知模型。同时提出DriveObj3D数据集支持3D-aware视频编辑。 Result: 实验表明,Dream4Drive 在不同训练轮数下均能有效提升下游感知模型性能,特别是在大规模生成多视角corner case方面表现出色。 Conclusion: Dream4Drive 通过可控的3D-aware合成数据生成,显著增强了自动驾驶中的下游感知能力,验证了高质量合成数据在实际任务中的价值。 Abstract: Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Project: $\href{https://wm-research.github.io/Dream4Drive/}{this\ https\ URL}$[93] MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting
In-Hwan Jin,Hyeongju Mun,Joonsoo Kim,Kugjin Yun,Kyeongbo Kong
Main category: cs.CV
TL;DR: 本文提出了MoE-GS,一种将专家混合机制引入动态高斯点阵的统一框架,通过体积感知像素路由器自适应融合多个专家输出,在N3V和Technicolor数据集上优于现有方法。
Details
Motivation: 现有动态场景重建方法在不同场景下表现不一致,缺乏能应对各种动态挑战的通用解决方案。 Method: 提出MoE-GS框架,结合多个专业化专家,并设计体积感知像素路由器,通过可微权重点阵将体素级高斯权重投影到像素空间以实现自适应融合;同时探索单次多专家渲染、门控高斯剪枝和知识蒸馏提升效率。 Result: 在N3V和Technicolor数据集上实验表明,MoE-GS在渲染质量和效率方面均优于当前最先进方法,且支持轻量化部署。 Conclusion: MoE-GS是首个将专家混合机制应用于动态高斯点阵的方法,有效提升了动态场景重建的鲁棒性与性能,兼具高效推理潜力。 Abstract: Recent advances in dynamic scene reconstruction have significantly benefited from 3D Gaussian Splatting, yet existing methods show inconsistent performance across diverse scenes, indicating no single approach effectively handles all dynamic challenges. To overcome these limitations, we propose Mixture of Experts for Dynamic Gaussian Splatting (MoE-GS), a unified framework integrating multiple specialized experts via a novel Volume-aware Pixel Router. Our router adaptively blends expert outputs by projecting volumetric Gaussian-level weights into pixel space through differentiable weight splatting, ensuring spatially and temporally coherent results. Although MoE-GS improves rendering quality, the increased model capacity and reduced FPS are inherent to the MoE architecture. To mitigate this, we explore two complementary directions: (1) single-pass multi-expert rendering and gate-aware Gaussian pruning, which improve efficiency within the MoE framework, and (2) a distillation strategy that transfers MoE performance to individual experts, enabling lightweight deployment without architectural changes. To the best of our knowledge, MoE-GS is the first approach incorporating Mixture-of-Experts techniques into dynamic Gaussian splatting. Extensive experiments on the N3V and Technicolor datasets demonstrate that MoE-GS consistently outperforms state-of-the-art methods with improved efficiency. Video demonstrations are available at https://anonymous.4open.science/w/MoE-GS-68BA/.[94] SFGFusion: Surface Fitting Guided 3D Object Detection with 4D Radar and Camera Fusion
Xiaozhi Li,Huijun Di,Jian Li,Feng Liu,Wei Liang
Main category: cs.CV
TL;DR: 提出SFGFusion,一种基于表面拟合的相机-4D成像雷达融合检测网络,通过估计目标的二次曲面参数来增强空间表示和跨模态交互,有效缓解雷达点云稀疏问题,在TJ4DRadSet和VoD数据集上表现优异。
Details
Motivation: 4D成像雷达虽具有低成本、长距离和精确速度测量优势,但其点云稀疏和分辨率低限制了几何表征和多模态融合,需提升检测性能。 Method: 提出SFGFusion,利用图像和雷达数据估计目标的二次曲面参数,构建显式表面拟合模型;用预测深度引导图像特征从透视图转为鸟瞰图,并生成密集伪点云以缓解雷达稀疏性;分别编码原始雷达点云和伪点云,经柱状方法处理后融合于BEV空间,最后用2D主干网络进行检测。 Result: 在TJ4DRadSet和view-of-delft(VoD)检测基准上,SFGFusion显著提升了相机与4D雷达的融合效果,实现了优于现有方法的检测性能。 Conclusion: SFGFusion通过表面拟合引导的双路径融合策略,有效增强了跨模态特征对齐与空间表达,解决了4D雷达点云稀疏和分辨率低的问题,为自动驾驶中的多模态检测提供了新思路。 Abstract: 3D object detection is essential for autonomous driving. As an emerging sensor, 4D imaging radar offers advantages as low cost, long-range detection, and accurate velocity measurement, making it highly suitable for object detection. However, its sparse point clouds and low resolution limit object geometric representation and hinder multi-modal fusion. In this study, we introduce SFGFusion, a novel camera-4D imaging radar detection network guided by surface fitting. By estimating quadratic surface parameters of objects from image and radar data, the explicit surface fitting model enhances spatial representation and cross-modal interaction, enabling more reliable prediction of fine-grained dense depth. The predicted depth serves two purposes: 1) in an image branch to guide the transformation of image features from perspective view (PV) to a unified bird's-eye view (BEV) for multi-modal fusion, improving spatial mapping accuracy; and 2) in a surface pseudo-point branch to generate dense pseudo-point cloud, mitigating the radar point sparsity. The original radar point cloud is also encoded in a separate radar branch. These two point cloud branches adopt a pillar-based method and subsequently transform the features into the BEV space. Finally, a standard 2D backbone and detection head are used to predict object labels and bounding boxes from BEV features. Experimental results show that SFGFusion effectively fuses camera and 4D radar features, achieving superior performance on the TJ4DRadSet and view-of-delft (VoD) object detection benchmarks.[95] Space Object Detection using Multi-frame Temporal Trajectory Completion Method
Xiaoqing Lan,Biqiao Xin,Bingshu Wang,Han Zhang,Laixian Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于小波变换和匈牙利算法的多帧时序轨迹补全方法,用于提升地球静止轨道空间目标在复杂背景下的检测性能。
Details
Motivation: 地球静止轨道空间目标光学成像中信号弱、背景复杂、易受干扰,导致检测困难,现有方法难以有效提取目标并保持轨迹连续性。 Method: 采用小波变换增强单帧图像中的高频特征并抑制背景噪声;基于匈牙利算法实现跨帧全局最优匹配,构建多帧时序轨迹;设计包含时序匹配与插值补全、基于时间一致性的去噪和渐进式轨迹优化的后处理流程。 Result: 在公开的SpotGEO数据集上实验表明,该方法F_1分数达到90.14%,能有效减少漏检和误检,提升检测精度与轨迹完整性。 Conclusion: 所提出的方法在强噪声和复杂背景下显著提升了GEO空间目标的检测性能,具备较高的鲁棒性和应用潜力。 Abstract: Space objects in Geostationary Earth Orbit (GEO) present significant detection challenges in optical imaging due to weak signals, complex stellar backgrounds, and environmental interference. In this paper, we enhance high-frequency features of GEO targets while suppressing background noise at the single-frame level through wavelet transform. Building on this, we propose a multi-frame temporal trajectory completion scheme centered on the Hungarian algorithm for globally optimal cross-frame matching. To effectively mitigate missing and false detections, a series of key steps including temporal matching and interpolation completion, temporal-consistency-based noise filtering, and progressive trajectory refinement are designed in the post-processing pipeline. Experimental results on the public SpotGEO dataset demonstrate the effectiveness of the proposed method, achieving an F_1 score of 90.14%.[96] Background Fades, Foreground Leads: Curriculum-Guided Background Pruning for Efficient Foreground-Centric Collaborative Perception
Yuheng Wu,Xiangbo Gao,Quang Tau,Zhengzhong Tu,Dongman Lee
Main category: cs.CV
TL;DR: FadeLead是一种前景中心的协作感知框架,通过课程学习策略将背景上下文信息压缩到前景特征中,从而在带宽受限的情况下提升自动驾驶车辆的感知性能。
Details
Motivation: 由于车载网络带宽限制,传统方法仅传输前景特征而丢弃包含重要上下文的背景信息,导致感知性能受限,因此需要一种能保留背景上下文但不增加传输开销的方法。 Method: 提出FadeLead框架,采用课程学习策略,在训练初期利用背景线索,逐步减少对背景的依赖,迫使模型将上下文信息内化到前景特征表示中,实现无需传输背景即可共享上下文信息。 Result: 在多个模拟和真实世界基准上进行实验,结果显示FadeLead在不同带宽设置下均优于先前方法,验证了上下文增强的前景共享的有效性。 Conclusion: FadeLead通过将背景上下文融入前景特征,在不增加传输负担的前提下提升了协作感知性能,为解决带宽受限下的多车协同感知提供了有效方案。 Abstract: Collaborative perception enhances the reliability and spatial coverage of autonomous vehicles by sharing complementary information across vehicles, offering a promising solution to long-tail scenarios that challenge single-vehicle perception. However, the bandwidth constraints of vehicular networks make transmitting the entire feature map impractical. Recent methods, therefore, adopt a foreground-centric paradigm, transmitting only predicted foreground-region features while discarding the background, which encodes essential context. We propose FadeLead, a foreground-centric framework that overcomes this limitation by learning to encapsulate background context into compact foreground features during training. At the core of our design is a curricular learning strategy that leverages background cues early on but progressively prunes them away, forcing the model to internalize context into foreground representations without transmitting background itself. Extensive experiments on both simulated and real-world benchmarks show that FadeLead outperforms prior methods under different bandwidth settings, underscoring the effectiveness of context-enriched foreground sharing.[97] Advances in 4D Representation: Geometry, Motion, and Interaction
Mingrui Zhao,Sauradip Nag,Kai Wang,Aditya Vora,Guangda Ji,Peter Chun,Ali Mahdavi-Amiri,Hao Zhang
Main category: cs.CV
TL;DR: 本文综述了4D生成与重建领域的最新进展,重点从几何、运动和交互三个核心角度分析不同的4D表示方法,旨在指导读者根据任务需求选择和定制合适的表示方法。
Details
Motivation: 随着神经场、几何与运动深度学习以及3D生成式AI的快速发展,4D生成与重建成为一个快速演进的研究方向,但缺乏以4D表示为核心的系统性分析视角。 Method: 采用选择性综述策略,聚焦代表性工作,围绕几何、运动和交互三大支柱对4D表示方法进行分类与比较,并讨论当前主流与未充分探索的表示形式,同时涵盖数据集、LLM与视频基础模型的应用与局限。 Result: 梳理了包括NeRF、3DGS在内的主流4D表示方法及其适用场景,指出了结构化模型和长程运动等方向的潜力,并总结了现有4D数据集的现状与不足。 Conclusion: 为研究人员提供了基于任务需求选择和定制4D表示方法的指导框架,并呼吁关注表示方法的可扩展性、数据效率及与大模型的融合。 Abstract: We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics whose developments have been propelled by recent advances in neural fields, geometric and motion deep learning, as well 3D generative artificial intelligence (GenAI). While our survey is not the first of its kind, we build our coverage of the domain from a unique and distinctive perspective of 4D representations\/}, to model 3D geometry evolving over time while exhibiting motion and interaction. Specifically, instead of offering an exhaustive enumeration of many works, we take a more selective approach by focusing on representative works to highlight both the desirable properties and ensuing challenges of each representation under different computation, application, and data scenarios. The main take-away message we aim to convey to the readers is on how to select and then customize the appropriate 4D representations for their tasks. Organizationally, we separate the 4D representations based on three key pillars: geometry, motion, and interaction. Our discourse will not only encompass the most popular representations of today, such as neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS), but also bring attention to relatively under-explored representations in the 4D context, such as structured models and long-range motions. Throughout our survey, we will reprise the role of large language models (LLMs) and video foundational models (VFMs) in a variety of 4D applications, while steering our discussion towards their current limitations and how they can be addressed. We also provide a dedicated coverage on what 4D datasets are currently available, as well as what is lacking, in driving the subfield forward. Project page:https://mingrui-zhao.github.io/4DRep-GMI/[98] SCEESR: Semantic-Control Edge Enhancement for Diffusion-Based Super-Resolution
Yun Kai Zhuang
Main category: cs.CV
TL;DR: 提出了一种结合ControlNet机制和混合损失函数的单步扩散模型,用于真实图像超分辨率重建,在保持高效推理的同时显著提升了结构完整性和视觉质量。
Details
Motivation: 现有的单步扩散模型在真实图像超分辨率中存在结构失真问题,且生成质量与计算成本之间存在权衡,需要一种既能保持高效又能提升结构准确性的方法。 Method: 采用ControlNet机制引入语义边缘引导,结合L2、LPIPS和边缘感知AME损失函数,在单步扩散过程中实现动态结构控制和多目标优化。 Result: 实验表明该方法在测试数据集上有效改善了重建图像的结构完整性和真实感,同时保持了快速推理能力,实现了质量与速度的更好平衡。 Conclusion: 所提出的方法通过边缘引导和混合损失策略,成功克服了单步扩散模型在真实图像超分辨率中的结构缺陷,为高效高质量超分提供了有效解决方案。 Abstract: Real-world image super-resolution (Real-ISR) must handle complex degradations and inherent reconstruction ambiguities. While generative models have improved perceptual quality, a key trade-off remains with computational cost. One-step diffusion models offer speed but often produce structural inaccuracies due to distillation artifacts. To address this, we propose a novel SR framework that enhances a one-step diffusion model using a ControlNet mechanism for semantic edge guidance. This integrates edge information to provide dynamic structural control during single-pass inference. We also introduce a hybrid loss combining L2, LPIPS, and an edge-aware AME loss to optimize for pixel accuracy, perceptual quality, and geometric precision. Experiments show our method effectively improves structural integrity and realism while maintaining the efficiency of one-step generation, achieving a superior balance between output quality and inference speed. The results of test datasets will be published at https://drive.google.com/drive/folders/1amddXQ5orIyjbxHgGpzqFHZ6KTolinJF?usp=drive_link and the related code will be published at https://github.com/ARBEZ-ZEBRA/SCEESR.[99] MobiAct: Efficient MAV Action Recognition Using MobileNetV4 with Contrastive Learning and Knowledge Distillation
Zhang Nengbo,Ho Hann Woei
Main category: cs.CV
TL;DR: 本文提出了一种轻量级的微型飞行器(MAV)动作识别框架MobiAct,采用MobileNetV4为主干网络,结合阶段正交知识蒸馏(SOKD)、无参数注意力机制和混合损失训练策略,在保证高精度的同时显著降低计算与能耗。
Details
Motivation: 现有MAV动作识别方法通常依赖计算密集型模型,难以在资源受限的平台上实现实时感知与协同,亟需一种兼顾精度与效率的轻量化解决方案。 Method: 采用MobileNetV4作为主干网络,引入阶段正交知识蒸馏(SOKD)从ResNet18教师网络向学生网络高效迁移运动特征;集成无参数注意力机制提升精度而不增加复杂度;设计混合损失函数以稳定训练过程。 Result: 在三个自建数据集上,MobiAct平均识别准确率达92.12%,能耗仅为136.16 pJ,处理速度达8.84次识别/秒,动作解码速度比现有最优方法快2倍,且精度相当。 Conclusion: MobiAct在保持高识别精度的同时大幅降低了计算开销和能耗,显著提升了推理速度,适用于资源受限的MAV集群实时感知与协同任务。 Abstract: Accurate and efficient recognition of Micro Air Vehicle (MAV) motion is essential for enabling real-time perception and coordination in autonomous aerial swarm. However, most existing approaches rely on large, computationally intensive models that are unsuitable for resource-limited MAV platforms, which results in a trade-off between recognition accuracy and inference speed. To address these challenges, this paper proposes a lightweight MAV action recognition framework, MobiAct, designed to achieve high accuracy with low computational cost. Specifically, MobiAct adopts MobileNetV4 as the backbone network and introduces a Stage-wise Orthogonal Knowledge Distillation (SOKD) strategy to effectively transfer MAV motion features from a teacher network (ResNet18) to a student network, thereby enhancing knowledge transfer efficiency. Furthermore, a parameter-free attention mechanism is integrated into the architecture to improve recognition accuracy without increasing model complexity. In addition, a hybrid loss training strategy is developed to combine multiple loss objectives, which ensures stable and robust optimization during training. Experimental results demonstrate that the proposed MobiAct achieves low-energy and low-computation MAV action recognition, while maintaining the fastest action decoding speed among compared methods. Across all three self-collected datasets, MobiAct achieves an average recognition accuracy of 92.12%, while consuming only 136.16 pJ of energy and processing recognition at a rate of 8.84 actions per second. Notably, MobiAct decodes actions up to 2 times faster than the leading method, with highly comparable recognition accuracy, highlighting its superior efficiency in MAV action recognition.[100] D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation
Nobline Yoo,Olga Russakovsky,Ye Zhu
Main category: cs.CV
TL;DR: 提出Detector-to-Differentiable (D2D) 框架,将非可微检测模型转化为可微批评器,提升文本到图像生成中对象数量的准确性。
Details
Motivation: 现有方法受限于必须使用可微的回归型计数网络,无法利用性能更好的非可微检测模型进行精确计数。 Method: 设计自定义激活函数,将检测器的logits转换为软二值指示符,用于在推理时优化预训练T2I模型的噪声先验。 Result: 在SDXL-Turbo、SD-Turbo和Pixart-DMD上多个基准测试中显著提升计数准确率(最高提升13.7%),且对图像质量和计算开销影响极小。 Conclusion: D2D框架成功融合了检测模型的强计数能力与扩散模型的生成能力,有效改善了文本到图像生成中的数值一致性问题。 Abstract: Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into soft binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models. Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object scenarios) demonstrate consistent and substantial improvements in object counting accuracy (e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark), with minimal degradation in overall image quality and computational overhead.[101] Enhancing Early Alzheimer Disease Detection through Big Data and Ensemble Few-Shot Learning
Safa Ben Atitallah,Maha Driss,Wadii Boulila,Anis Koubaa
Main category: cs.CV
TL;DR: 提出一种基于原型网络和预训练CNN的少样本学习集成方法,用于阿尔茨海默病检测,在Kaggle和ADNI数据集上分别达到99.72%和99.86%的准确率。
Details
Motivation: 由于标记医疗数据稀缺、疾病复杂性和数据隐私限制,现有方法难以准确检测阿尔茨海默病,亟需提升检测精度。 Method: 采用少样本学习框架下的原型网络(ProtoNet),集成多个预训练卷积神经网络作为编码器,并结合类别感知损失与熵损失,增强特征提取能力和分类精度。 Result: 在Kaggle Alzheimer数据集和ADNI数据集上分别实现了99.72%和99.86%的准确率,优于现有先进方法。 Conclusion: 该方法显著提升了阿尔茨海默病进展阶段的分类准确性,具有良好的实际应用潜力,尤其适用于小样本和隐私敏感的医学图像分析场景。 Abstract: Alzheimer disease is a severe brain disorder that causes harm in various brain areas and leads to memory damage. The limited availability of labeled medical data poses a significant challenge for accurate Alzheimer disease detection. There is a critical need for effective methods to improve the accuracy of Alzheimer disease detection, considering the scarcity of labeled data, the complexity of the disease, and the constraints related to data privacy. To address this challenge, our study leverages the power of big data in the form of pre-trained Convolutional Neural Networks (CNNs) within the framework of Few-Shot Learning (FSL) and ensemble learning. We propose an ensemble approach based on a Prototypical Network (ProtoNet), a powerful method in FSL, integrating various pre-trained CNNs as encoders. This integration enhances the richness of features extracted from medical images. Our approach also includes a combination of class-aware loss and entropy loss to ensure a more precise classification of Alzheimer disease progression levels. The effectiveness of our method was evaluated using two datasets, the Kaggle Alzheimer dataset and the ADNI dataset, achieving an accuracy of 99.72% and 99.86%, respectively. The comparison of our results with relevant state-of-the-art studies demonstrated that our approach achieved superior accuracy and highlighted its validity and potential for real-world applications in early Alzheimer disease detection.[102] Vision-Based Mistake Analysis in Procedural Activities: A Review of Advances and Challenges
Konstantinos Bacharidis,Antonis A. Argyros
Main category: cs.CV
TL;DR: 本文综述了基于视觉的程序性活动错误检测与预测方法,探讨了动作识别、行为预见等技术在识别执行偏差中的应用,并分析了现有数据集、评估指标和前沿方法,提出了未来研究方向。
Details
Motivation: 由于程序性活动中错误分析在工业自动化、康复训练、教育和人机协作等领域具有重要意义,亟需系统性地总结基于视觉的错误检测方法及其挑战。 Method: 通过回顾计算机视觉在动作识别、行为预测和活动理解方面的进展,分类梳理了利用程序结构、监督程度和学习策略的方法,并分析了现有数据集和评估指标。 Result: 总结了当前视觉错误分析面临的主要挑战,如类内差异、视角变化和活动结构复杂性,并归纳了区分合理变异与真正错误、建模错误传播等开放问题。 Conclusion: 该文为基于视觉的程序性错误分析提供了统一视角,强调其在提升安全性、效率和任务表现方面的潜力,并指出神经符号推理和反事实建模是未来重要方向。 Abstract: Mistake analysis in procedural activities is a critical area of research with applications spanning industrial automation, physical rehabilitation, education and human-robot collaboration. This paper reviews vision-based methods for detecting and predicting mistakes in structured tasks, focusing on procedural and executional errors. By leveraging advancements in computer vision, including action recognition, anticipation and activity understanding, vision-based systems can identify deviations in task execution, such as incorrect sequencing, use of improper techniques, or timing errors. We explore the challenges posed by intra-class variability, viewpoint differences and compositional activity structures, which complicate mistake detection. Additionally, we provide a comprehensive overview of existing datasets, evaluation metrics and state-of-the-art methods, categorizing approaches based on their use of procedural structure, supervision levels and learning strategies. Open challenges, such as distinguishing permissible variations from true mistakes and modeling error propagation are discussed alongside future directions, including neuro-symbolic reasoning and counterfactual state modeling. This work aims to establish a unified perspective on vision-based mistake analysis in procedural activities, highlighting its potential to enhance safety, efficiency and task performance across diverse domains.[103] Unified Reinforcement and Imitation Learning for Vision-Language Models
Byung-Kwan Lee,Ryo Hachiuma,Yong Man Ro,Yu-Chiang Frank Wang,Yueh-Hua Wu
Main category: cs.CV
TL;DR: 本文提出了一种名为统一强化与模仿学习(RIL)的高效训练算法,用于构建轻量级但性能强大的视觉语言模型(VLMs),通过结合强化学习与对抗性模仿学习,使小型学生模型在多种基准上达到甚至超越现有先进模型的性能。
Details
Motivation: 由于现有的视觉语言模型通常规模庞大,难以在资源受限环境中部署,因此需要一种高效训练方法来构建轻量且高性能的小型模型。 Method: 提出统一强化与模仿学习(RIL),结合强化学习和对抗性模仿学习,利用基于大语言模型的判别器区分学生与教师模型输出,并融合多个大型教师VLM的指导,提升学生模型的生成能力。 Result: 在多个视觉-语言基准上的实验表明,RIL显著缩小了与最先进的开源和闭源VLM之间的性能差距,并在某些情况下实现了超越。 Conclusion: RIL是一种有效的轻量级VLM训练框架,能够通过联合强化与模仿学习策略,使小型模型获得接近甚至优于大型模型的性能,具有良好的应用前景。 Abstract: Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.[104] Online Handwritten Signature Verification Based on Temporal-Spatial Graph Attention Transformer
Hai-jie Yuan,Heng Zhang,Fei Yin
Main category: cs.CV
TL;DR: 提出了一种用于动态签名验证的时空图注意力Transformer(TS-GATR),结合GAT和GRU建模签名数据的时空依赖性,在MSDS和DeepSignDB等基准数据集上表现优于现有方法。
Details
Motivation: 由于用户内部变异性和伪造风险,手写签名验证在准确率方面面临挑战,需要更有效的模型来提升验证性能。 Method: 将签名表示为图结构,节点包含动态特征(如位置、速度、压力),采用双图注意力Transformer(DGATR)模块结合k步和k近邻邻接图建模局部与全局空间特征,并利用GRU捕捉长期时间依赖。 Result: 在MSDS和DeepSignDB数据集上实验表明,TS-GATR在多种场景下均实现了比当前最先进方法更低的等错误率(EER)。 Conclusion: TS-GATR通过融合时空注意力机制和循环单元,有效提升了动态签名验证的准确性,具有较强的应用潜力。 Abstract: Handwritten signature verification is a crucial aspect of identity authentication, with applications in various domains such as finance and e-commerce. However, achieving high accuracy in signature verification remains challenging due to intra-user variability and the risk of forgery. This paper introduces a novel approach for dynamic signature verification: the Temporal-Spatial Graph Attention Transformer (TS-GATR). TS-GATR combines the Graph Attention Network (GAT) and the Gated Recurrent Unit (GRU) to model both spatial and temporal dependencies in signature data. TS-GATR enhances verification performance by representing signatures as graphs, where each node captures dynamic features (e.g. position, velocity, pressure), and by using attention mechanisms to model their complex relationships. The proposed method further employs a Dual-Graph Attention Transformer (DGATR) module, which utilizes k-step and k-nearest neighbor adjacency graphs to model local and global spatial features, respectively. To capture long-term temporal dependencies, the model integrates GRU, thereby enhancing its ability to learn dynamic features during signature verification. Comprehensive experiments conducted on benchmark datasets such as MSDS and DeepSignDB show that TS-GATR surpasses current state-of-the-art approaches, consistently achieving lower Equal Error Rates (EER) across various scenarios.[105] Seabed-Net: A multi-task network for joint bathymetry estimation and seabed classification from remote sensing imagery in shallow waters
Panagiotis Agrafiotis,Begüm Demir
Main category: cs.CV
TL;DR: 本文提出了一种名为Seabed-Net的多任务深度学习框架,能够从遥感图像中同时预测浅水区域的水深和海底分类,通过特征融合与动态损失加权,在多个指标上优于传统方法和现有先进模型。
Details
Motivation: 现有方法通常孤立地处理水深估计和海底分类任务,忽略了二者之间的相互促进作用,限制了深度学习在浅水测绘中的应用。因此需要一个统一的多任务框架来提升整体性能。 Method: Seabed-Net采用双分支编码器分别处理水深估计和海底分类任务,通过注意力特征融合模块和Swin-Transformer窗口融合块进行跨任务特征整合,并利用动态任务不确定性加权平衡多任务损失。 Result: 在两个不同海岸区域的实验表明,Seabed-Net相比传统模型将RMSE降低高达75%,相比先进单任务和多任务模型进一步降低10%-30%的水深误差,并提升海底分类准确率最高达8%,且具有更好的空间一致性和边界清晰度。 Conclusion: 联合建模水深与海底底质及生境类型可带来协同增益,Seabed-Net为集成化的浅水环境测绘提供了一个鲁棒且开放的解决方案。 Abstract: Accurate, detailed, and regularly updated bathymetry, coupled with complex semantic content, is essential for under-mapped shallow-water environments facing increasing climatological and anthropogenic pressures. However, existing approaches that derive either depth or seabed classes from remote sensing imagery treat these tasks in isolation, forfeiting the mutual benefits of their interaction and hindering the broader adoption of deep learning methods. To address these limitations, we introduce Seabed-Net, a unified multi-task framework that simultaneously predicts bathymetry and pixel-based seabed classification from remote sensing imagery of various resolutions. Seabed-Net employs dual-branch encoders for bathymetry estimation and pixel-based seabed classification, integrates cross-task features via an Attention Feature Fusion module and a windowed Swin-Transformer fusion block, and balances objectives through dynamic task uncertainty weighting. In extensive evaluations at two heterogeneous coastal sites, it consistently outperforms traditional empirical models and traditional machine learning regression methods, achieving up to 75\% lower RMSE. It also reduces bathymetric RMSE by 10-30\% compared to state-of-the-art single-task and multi-task baselines and improves seabed classification accuracy up to 8\%. Qualitative analyses further demonstrate enhanced spatial consistency, sharper habitat boundaries, and corrected depth biases in low-contrast regions. These results confirm that jointly modeling depth with both substrate and seabed habitats yields synergistic gains, offering a robust, open solution for integrated shallow-water mapping. Code and pretrained weights are available at https://github.com/pagraf/Seabed-Net.[106] Exploring Scale Shift in Crowd Localization under the Context of Domain Generalization
Juncheng Wang,Lei Shang,Ziqi Liu,Wang Lu,Xixu Hu,Zhe Hu,Jindong Wang,Shujun Wang
Main category: cs.CV
TL;DR: 本文研究了人群定位中的尺度偏移对域泛化的影响,提出了新的基准ScaleBench和算法Catto以缓解该问题,并揭示了未来研究的四个重要方向。
Details
Motivation: 现有方法在训练和测试数据之间由于头部尺度分布差异(尺度偏移)导致性能显著下降,亟需深入理解其影响机制并提出有效解决方案。 Method: 通过系统性实验分析尺度偏移的影响,构建基准ScaleBench,复现20种先进域泛化算法,并提出因果特征分解与各向异性处理(Catto)算法。 Result: 验证了现有算法在应对尺度偏移时的局限性,量化了尺度偏移的影响,并通过Catto算法有效提升了域泛化性能。 Conclusion: 尺度偏移是人群定位域泛化中的关键挑战,本文提出的分析框架、基准和算法为这一新方向提供了重要基础。 Abstract: Crowd localization plays a crucial role in visual scene understanding towards predicting each pedestrian location in a crowd, thus being applicable to various downstream tasks. However, existing approaches suffer from significant performance degradation due to discrepancies in head scale distributions (scale shift) between training and testing data, a challenge known as domain generalization (DG). This paper aims to comprehend the nature of scale shift within the context of domain generalization for crowd localization models. To this end, we address four critical questions: (i) How does scale shift influence crowd localization in a DG scenario? (ii) How can we quantify this influence? (iii) What causes this influence? (iv) How to mitigate the influence? Initially, we conduct a systematic examination of how crowd localization performance varies with different levels of scale shift. Then, we establish a benchmark, ScaleBench, and reproduce 20 advanced DG algorithms to quantify the influence. Through extensive experiments, we demonstrate the limitations of existing algorithms and underscore the importance and complexity of scale shift, a topic that remains insufficiently explored. To deepen our understanding, we provide a rigorous theoretical analysis on scale shift. Building on these insights, we further propose an effective algorithm called Causal Feature Decomposition and Anisotropic Processing (Catto) to mitigate the influence of scale shift in DG settings. Later, we also provide extensive analytical experiments, revealing four significant insights for future research. Our results emphasize the importance of this novel and applicable research direction, which we term Scale Shift Domain Generalization.[107] BrainMCLIP: Brain Image Decoding with Multi-Layer feature Fusion of CLIP
Tian Xia,Zihan Ma,Xinlong Wang,Qing Liu,Xiaowei He,Tianming Liu,Yudan Ren
Main category: cs.CV
TL;DR: 本文提出BrainMCLIP,一种参数高效的多层融合方法,通过利用CLIP的中间层特征并遵循人脑视觉系统的功能分层结构,实现从fMRI信号到图像的高质量重建,无需使用VAE管道,在减少71.7%参数的同时达到或超越现有最先进方法的性能。
Details
Motivation: 现有方法通常仅映射fMRI信号到CLIP的最终语义层,并依赖参数密集的VAE管道来恢复细节,忽略了CLIP中间层丰富的物体信息,且不符合大脑功能的层次性。因此需要一种更高效、符合神经科学机制的解码框架。 Method: BrainMCLIP将功能上分离的视觉区域(低/高级)的fMRI信号对齐到CLIP对应的中间和最终层,引入跨重建策略和多粒度损失函数,实现多层特征融合,无需额外VAE路径。 Result: BrainMCLIP在高语义指标上表现优异,媲美甚至超越包含VAE管道的SOTA方法,同时参数量减少71.7%,有效捕捉了传统CLIP-only方法遗漏的视觉细节。 Conclusion: BrainMCLIP通过模拟人类视觉系统的功能分层,实现了高效且精确的fMRI到图像的重建,验证了利用预训练模型中间层特征与神经科学启发架构结合的优势,为脑信号解码提供了更优的范式。 Abstract: Decoding images from fMRI often involves mapping brain activity to CLIP's final semantic layer. To capture finer visual details, many approaches add a parameter-intensive VAE-based pipeline. However, these approaches overlook rich object information within CLIP's intermediate layers and contradicts the brain's functionally hierarchical. We introduce BrainMCLIP, which pioneers a parameter-efficient, multi-layer fusion approach guided by human visual system's functional hierarchy, eliminating the need for such a separate VAE pathway. BrainMCLIP aligns fMRI signals from functionally distinct visual areas (low-/high-level) to corresponding intermediate and final CLIP layers, respecting functional hierarchy. We further introduce a Cross-Reconstruction strategy and a novel multi-granularity loss. Results show BrainMCLIP achieves highly competitive performance, particularly excelling on high-level semantic metrics where it matches or surpasses SOTA(state-of-the-art) methods, including those using VAE pipelines. Crucially, it achieves this with substantially fewer parameters, demonstrating a reduction of 71.7\%(Table.\ref{tab:compare_clip_vae}) compared to top VAE-based SOTA methods, by avoiding the VAE pathway. By leveraging intermediate CLIP features, it effectively captures visual details often missed by CLIP-only approaches, striking a compelling balance between semantic accuracy and detail fidelity without requiring a separate VAE pipeline.[108] A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP
Ying Dai,Wei Yu Chen
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的开放词汇图像分割与物体识别框架,结合EfficientNetB0进行无监督分割和CLIP实现跨模态对齐,在多个基准上达到SOTA性能。
Details
Motivation: 为了在不依赖训练的情况下实现开放词汇的图像分割与识别,提升模型的泛化能力和灵活性。 Method: 采用两阶段流程:首先利用EfficientNetB0提取像素特征并通过对奇异值分解和层次聚类进行无监督分割;然后使用CLIP的ViT编码分割区域,结合预定义文本提示计算跨模态相似性以实现识别。 Result: 在COCO、ADE20K和PASCAL VOC数据集上取得了优异的匈牙利mIoU、精确率、召回率和F1分数,表现出强效的跨模态对齐能力。 Conclusion: 所提方法有效且通用,能够在无需训练的前提下实现高质量的开放词汇图像理解。 Abstract: This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP's text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.[109] DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents
Kai Shi,Jun Yang,Ni Yang,Binqiang Pan,Qingsong Xie,Chao Zhang,Zhenyu Yang,Tianhuang Su,Haonan Lu
Main category: cs.CV
TL;DR: 本文提出了一种名为DaMo的数据混合优化器,用于提升多任务移动手机代理中的多模态大语言模型性能,通过可训练网络预测最优数据混合比例,并在新提出的PhoneAgentBench基准上实现了显著性能提升。
Details
Motivation: 现有的多任务监督微调方法难以确定最优的训练数据组合,限制了多模态大语言模型在多任务移动场景中的表现。 Method: 提出DaMo,利用可训练网络根据给定数据比例预测下游任务性能,从而优化数据混合;同时构建PhoneAgentBench作为首个面向移动多模态任务的评估基准。 Result: DaMo在PhoneAgentBench上比其他方法提升3.38%,在BFCL-v3等基准上平均提升2.57%,在BFCL-v3单独任务中提升12.47%,并展现出良好的可扩展性和泛化能力。 Conclusion: DaMo能有效优化多任务训练中的数据混合策略,显著提升MLLMs在移动代理任务中的性能,且具有良好的通用性和实用性。 Abstract: Mobile Phone Agents (MPAs) have emerged as a promising research direction due to their broad applicability across diverse scenarios. While Multimodal Large Language Models (MLLMs) serve as the foundation for MPAs, their effectiveness in handling multiple mobile phone tasks simultaneously remains limited. Although multitask supervised fine-tuning (SFT) is widely adopted for multitask learning, existing approaches struggle to determine optimal training data compositions for peak performance. To address this challenge, we propose DaMo (Data Mixture Optimizer) - a novel solution employing a trainable network that predicts optimal data mixtures by forecasting downstream task performance for any given dataset ratio. To support comprehensive evaluation, we introduce PhoneAgentBench, the first specialized benchmark to evaluate MLLMs on multimodal mobile phone tasks, comprising 1235 QA pairs spanning diverse real-world industrial mobile application scenarios. Demonstrating strong predictive capability (R^2=0.81) in small-scale pilot experiments, DaMo efficiently extrapolates optimal data mixing configurations. Our results show DaMo achieves a 3.38% performance improvement on PhoneAgentBench compared to alternative methods. Furthermore, extensive experiments across established benchmarks including BFCL-v3, MME-Reasoning, MME-Perception, and OCRBench reveal DaMo's superior generalization, outperforming other approaches by 2.57% in terms of average score. When used solely for MLLM optimization on the BFCL-v3 task, DaMo improves the metrics by 12.47% than other methods. Notably, DaMo maintains robust scalability, preserving its effectiveness when applied to other model architectures. The code and dataset are available at https://github.com/OPPO-Mente-Lab/DaMo.git[110] DARE: A Deformable Adaptive Regularization Estimator for Learning-Based Medical Image Registration
Ahsan Raza Siyal,Markus Haltmeier,Ruth Steiger,Malik Galijasevic,Elke Ruth Gizewski,Astrid Ellen Grams
Main category: cs.CV
TL;DR: 提出了一种名为DARE的可变形医学图像配准框架,通过动态调整弹性正则化来提高配准的鲁棒性和解剖合理性。
Details
Motivation: 现有深度学习方法在医学图像配准中忽略了正则化对鲁棒性和解剖合理性的关键作用。 Method: DARE基于形变场梯度范数动态调整弹性正则化,结合应变和剪切能项,并引入折叠预防机制以惩罚雅可比行列式为负的区域。 Result: 该方法有效减少了非物理性伪影如折叠现象,避免了过度平滑,提升了配准精度和解剖合理性。 Conclusion: DARE在保证物理真实变换的同时,实现了更稳定且灵活的医学图像配准,优于传统方法。 Abstract: Deformable medical image registration is a fundamental task in medical image analysis. While deep learning-based methods have demonstrated superior accuracy and computational efficiency compared to traditional techniques, they often overlook the critical role of regularization in ensuring robustness and anatomical plausibility. We propose DARE (Deformable Adaptive Regularization Estimator), a novel registration framework that dynamically adjusts elastic regularization based on the gradient norm of the deformation field. Our approach integrates strain and shear energy terms, which are adaptively modulated to balance stability and flexibility. To ensure physically realistic transformations, DARE includes a folding-prevention mechanism that penalizes regions with negative deformation Jacobian. This strategy mitigates non-physical artifacts such as folding, avoids over-smoothing, and improves both registration accuracy and anatomical plausibility[111] AegisRF: Adversarial Perturbations Guided with Sensitivity for Protecting Intellectual Property of Neural Radiance Fields
Woo Jae Kim,Kyu Beom Han,Yoonki Cho,Youngju Na,Junsik Jung,Sooel Son,Sung-eui Yoon
Main category: cs.CV
TL;DR: 提出AegisRF框架,通过可学习的敏感性场自适应约束几何扰动,在保护NeRF知识产权的同时保持高渲染质量。
Details
Motivation: 保护NeRF的知识产权免受未经授权的使用,同时避免因几何扰动导致场景结构变形和渲染质量下降。 Method: 引入可学习的敏感性场来量化几何扰动对渲染质量的影响,并设计包含扰动场和敏感性场的AegisRF框架,前者注入对抗扰动以干扰未授权模型,后者自适应约束扰动以保持视觉保真度。 Result: 实验表明AegisRF在多种下游任务(如多视角图像分类和基于体素的3D定位)中具有广泛适用性,且能维持高视觉质量。 Conclusion: AegisRF有效实现了NeRF知识产权保护与渲染质量之间的平衡,支持跨任务和模态的通用防御。 Abstract: As Neural Radiance Fields (NeRFs) have emerged as a powerful tool for 3D scene representation and novel view synthesis, protecting their intellectual property (IP) from unauthorized use is becoming increasingly crucial. In this work, we aim to protect the IP of NeRFs by injecting adversarial perturbations that disrupt their unauthorized applications. However, perturbing the 3D geometry of NeRFs can easily deform the underlying scene structure and thus substantially degrade the rendering quality, which has led existing attempts to avoid geometric perturbations or restrict them to explicit spaces like meshes. To overcome this limitation, we introduce a learnable sensitivity to quantify the spatially varying impact of geometric perturbations on rendering quality. Building upon this, we propose AegisRF, a novel framework that consists of a Perturbation Field, which injects adversarial perturbations into the pre-rendering outputs (color and volume density) of NeRF models to fool an unauthorized downstream target model, and a Sensitivity Field, which learns the sensitivity to adaptively constrain geometric perturbations, preserving rendering quality while disrupting unauthorized use. Our experimental evaluations demonstrate the generalized applicability of AegisRF across diverse downstream tasks and modalities, including multi-view image classification and voxel-based 3D localization, while maintaining high visual fidelity. Codes are available at https://github.com/wkim97/AegisRF.[112] Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
Zhiyuan Feng,Zhaolu Kang,Qijie Wang,Zhiying Du,Jiongrui Yan,Shubin Shi,Chengbo Yuan,Huizhi Liang,Yu Deng,Qixiu Li,Rushuai Yang,Arctanx An,Leqi Zheng,Weijie Wang,Shawn Chen,Sicheng Xu,Yaobo Liang,Jiaolong Yang,Baining Guo
Main category: cs.CV
TL;DR: 本文提出了MV-RoboBench,一个用于评估视觉语言模型(VLMs)在机器人操作中多视角空间推理能力的基准测试,包含1.7k个手工标注的问答项,涵盖八个子任务,分为空间理解和机器人执行两大类。
Details
Motivation: 现有的VLM评估主要集中于单视角设置,缺乏对多视角信息整合能力的考察,而多摄像头配置在机器人平台中日益普遍,因此亟需评估VLM在多视角下的机器人空间推理能力。 Method: 构建了一个名为MV-RoboBench的多视角机器人操作基准,包含两个主要类别(空间理解与机器人执行)共八个子任务,共1.7k个QA样本,并对多种开源与闭源VLM及结合CoT技术的增强版本进行系统评估。 Result: 实验结果表明,当前最先进的VLM仍远低于人类表现;空间智能与机器人任务执行能力呈正相关;在通用单视角基准上表现好的模型未必在多视角机器人任务中表现良好。 Conclusion: MV-RoboBench揭示了现有VLM在多视角机器人场景中的局限性,强调了开发具备空间感知能力的VLM和VLA的必要性,并作为开放资源推动该领域发展。 Abstract: Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.[113] Multi-Camera Worker Tracking in Logistics Warehouse Considering Wide-Angle Distortion
Yuki Mori,Kazuma Kano,Yusuke Asai,Shin Katayama,Kenta Urano,Takuro Yonezawa,Nobuo Kawaguchi
Main category: cs.CV
TL;DR: 本研究提出了一种利用19个广角摄像头实现物流仓库中工人位置精准跟踪的方法,通过基于脚部位置的对齐减轻图像畸变影响,提升了20%以上的跟踪精度。
Details
Motivation: 随着电子商务的发展,提高仓库作业效率至关重要,而数字孪生技术需要精确获取工人的实时位置信息。单个摄像头视野有限,需采用多摄像头系统以实现全面覆盖。 Method: 在仓库天花板安装19个向下拍摄的广角摄像头,通过地面标定实现相机坐标与实际位置的对齐,并利用工人脚部位置进行跨摄像头的位置对齐,以减少广角镜头边缘畸变带来的误差。同时比较了多种外观特征的使用方法。 Result: 所提方法有效减少了图像畸变的影响,实现了多摄像头间的准确位置对齐,跟踪精度提升了20%以上,并验证了不同外观特征融合策略的有效性。 Conclusion: 基于脚部位置对齐的多广角摄像头系统能有效提升仓库环境中人员跟踪的准确性,为数字孪生技术在物流场景中的应用提供了可靠的位置数据支持。 Abstract: With the spread of e-commerce, the logistics market is growing around the world. Therefore, improving the efficiency of warehouse operations is essential. To achieve this, various approaches have been explored, and among them, the use of digital twins is gaining attention. To make this approach possible, it is necessary to accurately collect the positions of workers in a warehouse and reflect them in a virtual space. However, a single camera has limitations in its field of view, therefore sensing with multiple cameras is necessary. In this study, we explored a method to track workers using 19 wide-angle cameras installed on the ceiling, looking down at the floor of the logistics warehouse. To understand the relationship between the camera coordinates and the actual positions in the warehouse, we performed alignment based on the floor surface. However, due to the characteristics of wide-angle cameras, significant distortion occurs at the edges of the image, particularly in the vertical direction. To address this, the detected worker positions from each camera were aligned based on foot positions, reducing the effects of image distortion, and enabling accurate position alignment across cameras. As a result, we confirmed an improvement of over 20% in tracking accuracy. Furthermore, we compared multiple methods for utilizing appearance features and validated the effectiveness of the proposed approach.[114] Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis
Xueqi Ma,Yanbei Jiang,Sarah Erfani,James Bailey,Weifeng Liu,Krista A. Ehinger,Jey Han Lau
Main category: cs.CV
TL;DR: 本文提出了一种名为PICK的多步框架,用于通过多模态大语言模型进行心理图像分析,特别针对房-树-人(HTP)测试,实现了从图像中提取心理特征并生成专家级评估结果。
Details
Motivation: 尽管多模态大语言模型在客观感知任务中表现优异,但在主观、情感丰富的心理分析领域应用有限,因此需要一个能结合专业知识与层次化视觉分析的框架。 Method: 将包含多个对象的图画分解为语义子图,构建单对象、多对象和整体三个层次的结构;引入HTP知识库,并设计基于强化学习的特征提取模块,生成心理画像;逐层分析视觉线索并融合多层级信息以形成综合评估。 Result: 实验结果表明,PICK显著提升了多模态大语言模型在心理分析任务中的性能,并验证了其作为通用框架在情绪理解等扩展任务中的有效性。 Conclusion: PICK成功弥合了多模态大语言模型与专业心理评估之间的差距,提供了一个结构化、可解释的框架,用于通过视觉表达理解人类心理状态。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.[115] Exploring "Many in Few" and "Few in Many" Properties in Long-Tailed, Highly-Imbalanced IC Defect Classification
Hao-Chiang Shao,Chun-Hao Chang,Yu-Hsien Lin,Chia-Wen Lin,Shao-Yun Fang,Yan-Hsiu Liu
Main category: cs.CV
TL;DR: 本文提出了一种针对高度不平衡IC缺陷分类任务的新方法ReCAME-Net,并发布了真实场景下的大规模数据集IC-Defect-14,解决了类内差异大、类间相似度高的挑战。
Details
Motivation: 现有不平衡数据分类方法在真实IC缺陷检测中表现不佳,主要因为实际数据分布更极端且特征复杂,缺乏贴近真实产线的数据集和有效模型。 Method: 提出ReCAME-Net,采用多专家分类框架,结合区域通道注意力模块、度量学习损失、难样本类别挖掘策略和知识蒸馏方法,提升对高度不平衡且复杂IC缺陷的分类性能。 Result: 在新发布的IC-Defect-14数据集上,ReCAME-Net显著优于现有最先进模型,同时在通用公开数据集上保持竞争力。 Conclusion: ReCAME-Net有效应对了真实IC缺陷分类中的高度不平衡、类内多样性与类间相似性问题,具备实际应用价值和泛化能力。 Abstract: Despite significant advancements in deep classification techniques and in-lab automatic optical inspection models for long-tailed or highly imbalanced data, applying these approaches to real-world IC defect classification tasks remains challenging. This difficulty stems from two primary factors. First, real-world conditions, such as the high yield-rate requirements in the IC industry, result in data distributions that are far more skewed than those found in general public imbalanced datasets. Consequently, classifiers designed for open imbalanced datasets often fail to perform effectively in real-world scenarios. Second, real-world samples exhibit a mix of class-specific attributes and class-agnostic, domain-related features. This complexity adds significant difficulty to the classification process, particularly for highly imbalanced datasets. To address these challenges, this paper introduces the IC-Defect-14 dataset, a large, highly imbalanced IC defect image dataset sourced from AOI systems deployed in real-world IC production lines. This dataset is characterized by its unique "intra-class clusters" property, which presents two major challenges: large intra-class diversity and high inter-class similarity. These characteristics, rarely found simultaneously in existing public datasets, significantly degrade the performance of current state-of-the-art classifiers for highly imbalanced data. To tackle this challenge, we propose ReCAME-Net, which follows a multi-expert classifier framework and integrates a regional channel attention module, metric learning losses, a hard category mining strategy, and a knowledge distillation procedure. Extensive experimental evaluations demonstrate that ReCAME-Net outperforms previous state-of-the-art models on the IC-Defect-14 dataset while maintaining comparable performance and competitiveness on general public datasets.[116] PCP-GAN: Property-Constrained Pore-scale image reconstruction via conditional Generative Adversarial Networks
Ali Sadeghkhani,Brandon Bennett,Masoud Babaei,Arash Rabbani
Main category: cs.CV
TL;DR: 提出一种多条件生成对抗网络(cGAN)框架,用于生成具有精确控制孔隙度和深度参数的代表性孔尺度图像,显著提升地下表征的代表性和数据可用性。
Details
Motivation: 自然空间异质性导致提取的子图像与岩心测量值显著偏离,且物理样本稀缺,难以获得真正代表地层特性的孔尺度图像。 Method: 基于碳酸盐岩地层四个深度的薄片样本,训练一个多条件生成对抗网络(cGAN),同时以孔隙度和深度为条件,在单一模型中实现对通用孔隙网络规律和深度特异性地质特征的建模。 Result: 模型在所有地层中实现了优异的孔隙度控制(R²=0.95),平均绝对误差为0.0099–0.0197;形态学验证显示关键孔隙网络特征(如平均孔径、比表面积、迂曲度)统计差异在可接受地质误差范围内;生成图像的双约束误差为1.9–11.3%,远低于实测子图像的36.4–578%。 Conclusion: 该方法能高效生成具代表性的孔尺度图像,克服了数据稀缺和表征偏差问题,为碳封存、地热能和地下水管理等领域的数字岩石物理提供了变革性工具。 Abstract: Obtaining truly representative pore-scale images that match bulk formation properties remains a fundamental challenge in subsurface characterization, as natural spatial heterogeneity causes extracted sub-images to deviate significantly from core-measured values. This challenge is compounded by data scarcity, where physical samples are only available at sparse well locations. This study presents a multi-conditional Generative Adversarial Network (cGAN) framework that generates representative pore-scale images with precisely controlled properties, addressing both the representativeness challenge and data availability constraints. The framework was trained on thin section samples from four depths (1879.50-1943.50 m) of a carbonate formation, simultaneously conditioning on porosity values and depth parameters within a single unified model. This approach captures both universal pore network principles and depth-specific geological characteristics, from grainstone fabrics with interparticle-intercrystalline porosity to crystalline textures with anhydrite inclusions. The model achieved exceptional porosity control (R^2=0.95) across all formations with mean absolute errors of 0.0099-0.0197. Morphological validation confirmed preservation of critical pore network characteristics including average pore radius, specific surface area, and tortuosity, with statistical differences remaining within acceptable geological tolerances. Most significantly, generated images demonstrated superior representativeness with dual-constraint errors of 1.9-11.3% compared to 36.4-578% for randomly extracted real sub-images. This capability provides transformative tools for subsurface characterization, particularly valuable for carbon storage, geothermal energy, and groundwater management applications where knowing the representative morphology of the pore space is critical for implementing digital rock physics.[117] Predicting before Reconstruction: A generative prior framework for MRI acceleration
Juhyung Park,Rokgi Hong,Roh-Eul Yoo,Jaehyeon Koo,Se Young Chun,Seung Hong Choi,Jongho Lee
Main category: cs.CV
TL;DR: 提出一种基于生成模型的预测性成像新范式,通过预测目标对比图像作为先验信息,显著加速磁共振成像(MRI)重建过程。
Details
Motivation: MRI扫描时间长限制了临床效率,现有重建方法依赖欠采样数据导致图像质量下降,需要一种能有效利用多源信息提升重建速度与质量的新方法。 Method: 构建一个生成模型框架,利用T1w、T2w等其他对比图像或先前扫描图像、采集参数和患者信息作为条件,预测目标对比图像作为数据驱动先验,用于指导高度欠采样的MRI数据重建。 Result: 在多个内部和公开数据集(共14,921次扫描,1,051,904个切片)上验证,该方法在x4、x8和x12高加速因子下均显著优于传统及其它先验方法,特别是在FLAIR和T1w图像重建中表现突出。 Conclusion: 该研究实现了从传统图像重建向预测性成像的范式转变,通过引入生成模型预测先验,有效提升MRI重建速度与质量,具有广泛的临床应用潜力。 Abstract: Recent advancements in artificial intelligence have created transformative capabilities in image synthesis and generation, enabling diverse research fields to innovate at revolutionary speed and spectrum. In this study, we leverage this generative power to introduce a new paradigm for accelerating Magnetic Resonance Imaging (MRI), introducing a shift from image reconstruction to proactive predictive imaging. Despite being a cornerstone of modern patient care, MRI's lengthy acquisition times limit clinical throughput. Our novel framework addresses this challenge by first predicting a target contrast image, which then serves as a data-driven prior for reconstructing highly under-sampled data. This informative prior is predicted by a generative model conditioned on diverse data sources, such as other contrast images, previously scanned images, acquisition parameters, patient information. We demonstrate this approach with two key applications: (1) reconstructing FLAIR images using predictions from T1w and/or T2w scans, and (2) reconstructing T1w images using predictions from previously acquired T1w scans. The framework was evaluated on internal and multiple public datasets (total 14,921 scans; 1,051,904 slices), including multi-channel k-space data, for a range of high acceleration factors (x4, x8 and x12). The results demonstrate that our prediction-prior reconstruction method significantly outperforms other approaches, including those with alternative or no prior information. Through this framework we introduce a fundamental shift from image reconstruction towards a new paradigm of predictive imaging.[118] PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation
Zhuoyang Xie,Yibo Zhao,Hui Huang,Riwei Wang,Zan Gao
Main category: cs.CV
TL;DR: 本文提出了一种名为PRGCN的新框架,通过跨序列模式复用和图记忆库来提升单目视频3D人体姿态估计的性能,在Human3.6M和MPI-INF-3DHP数据集上达到新的SOTA。
Details
Motivation: 现有方法在处理3D人体姿态估计时孤立地处理每个序列,未能利用人类运动中广泛存在的结构规律和重复动作模式,导致深度模糊问题难以解决。 Method: 提出Pattern Reuse Graph Convolutional Network(PRGCN),构建图记忆库存储姿态原型(以关系图形式编码),通过注意力机制动态检索并结合解剖学约束进行自适应融合;采用Mamba与自注意力结合的双流混合架构提取强健的时空特征。 Result: 在Human3.6M上MPJPE为37.1mm,在MPI-INF-3DHP上为13.4mm,均达到当前最优性能,并展现出更强的跨域泛化能力。 Conclusion: 跨序列的模式复用是提升3D姿态估计的关键机制,应从逐序列优化转向累积知识学习的新范式。 Abstract: Monocular 3D human pose estimation remains a fundamentally ill-posed inverse problem due to the inherent depth ambiguity in 2D-to-3D lifting. While contemporary video-based methods leverage temporal context to enhance spatial reasoning, they operate under a critical paradigm limitation: processing each sequence in isolation, thereby failing to exploit the strong structural regularities and repetitive motion patterns that pervade human movement across sequences. This work introduces the Pattern Reuse Graph Convolutional Network (PRGCN), a novel framework that formalizes pose estimation as a problem of pattern retrieval and adaptation. At its core, PRGCN features a graph memory bank that learns and stores a compact set of pose prototypes, encoded as relational graphs, which are dynamically retrieved via an attention mechanism to provide structured priors. These priors are adaptively fused with hard-coded anatomical constraints through a memory-driven graph convolution, ensuring geometrical plausibility. To underpin this retrieval process with robust spatiotemporal features, we design a dual-stream hybrid architecture that synergistically combines the linear-complexity, local temporal modeling of Mamba-based state-space models with the global relational capacity of self-attention. Extensive evaluations on Human3.6M and MPI-INF-3DHP benchmarks demonstrate that PRGCN establishes a new state-of-the-art, achieving an MPJPE of 37.1mm and 13.4mm, respectively, while exhibiting enhanced cross-domain generalization capability. Our work posits that the long-overlooked mechanism of cross-sequence pattern reuse is pivotal to advancing the field, shifting the paradigm from per-sequence optimization towards cumulative knowledge learning.[119] Mitigating representation bias caused by missing pixels in methane plume detection
Julia Wąsala,Joannes D. Maasakkers,Ilse Aben,Rochelle Schneider,Holger Hoos,Mitra Baratchi
Main category: cs.CV
TL;DR: 该研究探讨了卫星图像中由于云层等原因导致的系统性缺失像素对甲烷羽流检测模型的影响,发现缺失值数量与标签之间的虚假关联会导致模型在低覆盖率图像中漏检。通过评估多种插补方法和提出加权重采样策略,有效减少了表示偏差,并在操作场景中验证了去偏模型能更有效地检测低覆盖率图像中的羽流。
Details
Motivation: 解决卫星图像中非随机缺失数据导致的模型表示偏差问题,特别是在甲烷羽流检测中因缺失像素与标签关联引起的检测性能下降。 Method: 评估多种插补方法以消除覆盖率与标签间的依赖关系,并提出一种训练时按覆盖率分箱进行类别平衡的加权重采样方案。 Result: 插补和重采样方法均显著降低了表示偏差,且不影响模型的平衡准确率、精确率和召回率;去偏模型在低覆盖率图像中检测羽流的能力更强。 Conclusion: 通过插补和加权重采样可有效缓解MNAR缺失数据引发的模型偏差,提升甲烷羽流检测在实际应用中的鲁棒性和公平性。 Abstract: Most satellite images have systematically missing pixels (i.e., missing data not at random (MNAR)) due to factors such as clouds. If not addressed, these missing pixels can lead to representation bias in automated feature extraction models. In this work, we show that spurious association between the label and the number of missing values in methane plume detection can cause the model to associate the coverage (i.e., the percentage of valid pixels in an image) with the label, subsequently under-detecting plumes in low-coverage images. We evaluate multiple imputation approaches to remove the dependence between the coverage and a label. Additionally, we propose a weighted resampling scheme during training that removes the association between the label and the coverage by enforcing class balance in each coverage bin. Our results show that both resampling and imputation can significantly reduce the representation bias without hurting balanced accuracy, precision, or recall. Finally, we evaluate the capability of the debiased models using these techniques in an operational scenario and demonstrate that the debiased models have a higher chance of detecting plumes in low-coverage images.[120] Towards Single-Source Domain Generalized Object Detection via Causal Visual Prompts
Chen Li,Huiying Xu,Changxin Gao,Zeyu Wang,Yun Liu,Xinzhong Zhu
Main category: cs.CV
TL;DR: 本文提出了一种名为Cauvis(Causal Visual Prompts)的单源域泛化目标检测方法,通过引入交叉注意力提示模块和双分支适配器,有效缓解了虚假特征相关性问题,并在多个SDGOD数据集上实现了优于现有方法15.9-31.4%的性能提升。
Details
Motivation: 现有的单源域泛化目标检测方法由于域偏移和领域特定知识有限,容易陷入虚假相关性,过度依赖颜色等非本质特征,导致在未见目标域上泛化能力差。 Method: 提出Cauvis方法,包含两个核心组件:1)交叉注意力提示模块,通过将视觉提示与交叉注意力机制结合,减少对虚假特征的依赖;2)双分支适配器,通过高频特征提取实现因果特征与虚假特征的解耦,并完成域适应。 Result: Cauvis在多个SDGOD基准数据集上取得了最先进的性能,相比现有域泛化方法提升了15.9-31.4%,并在复杂干扰环境下表现出显著的鲁棒性优势。 Conclusion: Cauvis通过解耦因果与虚假特征并增强领域适应能力,有效提升了单源域目标检测模型在未知目标域上的泛化性能,为解决域偏移和虚假相关性问题提供了新思路。 Abstract: Single-source Domain Generalized Object Detection (SDGOD), as a cutting-edge research topic in computer vision, aims to enhance model generalization capability in unseen target domains through single-source domain training. Current mainstream approaches attempt to mitigate domain discrepancies via data augmentation techniques. However, due to domain shift and limited domain-specific knowledge, models tend to fall into the pitfall of spurious correlations. This manifests as the model's over-reliance on simplistic classification features (e.g., color) rather than essential domain-invariant representations like object contours. To address this critical challenge, we propose the Cauvis (Causal Visual Prompts) method. First, we introduce a Cross-Attention Prompts module that mitigates bias from spurious features by integrating visual prompts with cross-attention. To address the inadequate domain knowledge coverage and spurious feature entanglement in visual prompts for single-domain generalization, we propose a dual-branch adapter that disentangles causal-spurious features while achieving domain adaptation via high-frequency feature extraction. Cauvis achieves state-of-the-art performance with 15.9-31.4% gains over existing domain generalization methods on SDGOD datasets, while exhibiting significant robustness advantages in complex interference environments.[121] CARES: Context-Aware Resolution Selector for VLMs
Moshe Kimhi,Nimrod Shabtay,Raja Giryes,Chaim Baskin,Eli Schwartz
Main category: cs.CV
TL;DR: CARES是一种轻量级预处理模块,用于根据图像-查询对预测最小足够输入分辨率,从而在保持任务性能的同时减少高达80%的计算量。
Details
Motivation: 大型视觉-语言模型通常以高分辨率处理图像,导致计算和延迟成本高昂,即使低分辨率图像已足够。因此需要一种方法来降低计算开销。 Method: 提出CARES——一个上下文感知分辨率选择器,使用紧凑的VLM(350M)提取特征,并预测目标预训练VLM响应收敛到其正确回答峰值能力的时机;训练为离散分类器,但在推理时可插值连续分辨率。 Result: 在五个涵盖文档和自然图像的多模态基准上,以及多种目标VLM中,CARES在保持任务性能的同时,将计算量最多减少了80%。 Conclusion: CARES能有效平衡视觉-语言模型的性能与计算效率,通过动态选择最小必要分辨率实现显著的计算节省。 Abstract: Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.[122] PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis
Qing Mao,Tianxin Huang,Yu Zhu,Jinqiu Sun,Yanning Zhang,Gim Hee Lee
Main category: cs.CV
TL;DR: 本文提出了一种名为PoseCrafter的新方法,通过结合视频插值和姿态条件下的新视角合成模型(HVG)以及基于特征匹配的选择器(FMS),显著提升了稀疏重叠图像对的相机位姿估计性能。
Details
Motivation: 现有方法在处理小或无重叠的图像对时表现不佳,生成的中间帧模糊且帧选择策略效率低且与位姿估计目标不一致。 Method: 提出Hybrid Video Generation (HVG) 模型,结合视频插值与姿态条件的新视角合成,并设计基于特征匹配的Selector (FMS) 来选择适合位姿估计的中间帧。 Result: 在Cambridge Landmarks、ScanNet、DL3DV-10K和NAVI数据集上的实验表明,该方法明显优于现有SOTA方法,尤其在小或无重叠图像对上表现突出。 Conclusion: PoseCrafter有效解决了稀疏重叠图像对的位姿估计难题,通过更清晰的中间帧生成和高效的选择机制提升了整体性能。 Abstract: Pairwise camera pose estimation from sparsely overlapping image pairs remains a critical and unsolved challenge in 3D vision. Most existing methods struggle with image pairs that have small or no overlap. Recent approaches attempt to address this by synthesizing intermediate frames using video interpolation and selecting key frames via a self-consistency score. However, the generated frames are often blurry due to small overlap inputs, and the selection strategies are slow and not explicitly aligned with pose estimation. To solve these cases, we propose Hybrid Video Generation (HVG) to synthesize clearer intermediate frames by coupling a video interpolation model with a pose-conditioned novel view synthesis model, where we also propose a Feature Matching Selector (FMS) based on feature correspondence to select intermediate frames appropriate for pose estimation from the synthesized results. Extensive experiments on Cambridge Landmarks, ScanNet, DL3DV-10K, and NAVI demonstrate that, compared to existing SOTA methods, PoseCrafter can obviously enhance the pose estimation performances, especially on examples with small or no overlap.[123] [De|Re]constructing VLMs' Reasoning in Counting
Simone Alghisi,Gabriel Roccabruna,Massimo Rizzoli,Seyed Mahed Mousavi,Giuseppe Riccardi
Main category: cs.CV
TL;DR: 本文研究了视觉语言模型(VLMs)在计数任务中的推理能力,发现其性能受限于对象数量、类型、空间排列和干扰物,并指出错误源于最后一层表示到输出空间的映射问题;仅微调输出层即可将准确率提高多达21%。
Details
Motivation: VLMs在视觉推理任务中存在局限性,特别是在关系识别、时序理解和对象计数方面,需要深入探究其失败原因并提升其推理能力。 Method: 在受控实验条件下评估七种最先进的VLMs在计数任务中的表现,进行逐层分析以定位错误来源,并提出仅微调输出层的针对性训练方法。 Result: 实验表明VLMs对对象的数量、类型、空间布局和干扰物高度敏感;错误主要来自最后层表示到输出空间的错误映射;仅微调输出层即可使准确率提升高达21%,并在真实数据集上验证了该方法的有效性。 Conclusion: VLMs在计数任务中的推理缺陷主要源于输出映射问题,通过针对性地微调输出层可显著提升其性能,为改进VLMs的推理能力提供了有效且高效的途径。 Abstract: Vision-Language Models (VLMs) have recently gained attention due to their competitive performance on multiple downstream tasks, achieved by following user-input instructions. However, VLMs still exhibit several limitations in visual reasoning, such as difficulties in identifying relations (e.g., spatial, temporal, and among objects), understanding temporal sequences (e.g., frames), and counting objects. In this work, we go beyond score-level benchmark evaluations of VLMs by investigating the underlying causes of their failures and proposing a targeted approach to improve their reasoning capabilities. We study the reasoning skills of seven state-of-the-art VLMs in the counting task under controlled experimental conditions. Our experiments show that VLMs are highly sensitive to the number and type of objects, their spatial arrangement, and the co-occurrence of distractors. A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space. Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%. We corroborate these findings by achieving consistent improvements on real-world datasets.[124] The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models
Xiaofeng Zhang,Aaron Courville,Michal Drozdzal,Adriana Romero-Soriano
Main category: cs.CV
TL;DR: 本文研究了文本到图像(T2I)模型中提示复杂性对合成数据效用的影响,提出新的评估框架,发现增加提示复杂性会降低条件多样性和一致性,但减小真实与合成数据分布差距,并指出提示扩展能提升生成多样性和美学质量。
Details
Motivation: 尽管提示工程是使用T2I模型的主要方式,但提示复杂性对合成数据在质量、多样性和一致性方面效用的系统影响尚不明确,需深入探究。 Method: 通过合成实验和理论推导分析泛化难度,引入新评估框架,在多个数据集上评估不同推理时干预方法,系统分析提示复杂性对T2I生成数据效用的影响。 Result: 增加提示复杂性会降低条件多样性和提示一致性,但减小合成与真实数据间的分布偏移;推理时干预可提升多样性但可能偏离真实数据支持域;提示扩展方法在多样性和美学上表现最优,甚至超过真实数据。 Conclusion: 提示复杂性显著影响T2I模型生成数据的效用,提示扩展通过利用预训练语言模型作为似然估计器,能有效提升生成结果的质量和多样性,为合成数据优化提供了可行路径。 Abstract: Text-to-image (T2I) models offer great potential for creating virtually limitless synthetic data, a valuable resource compared to fixed and finite real datasets. Previous works evaluate the utility of synthetic data from T2I models on three key desiderata: quality, diversity, and consistency. While prompt engineering is the primary means of interacting with T2I models, the systematic impact of prompt complexity on these critical utility axes remains underexplored. In this paper, we first conduct synthetic experiments to motivate the difficulty of generalization w.r.t. prompt complexity and explain the observed difficulty with theoretical derivations. Then, we introduce a new evaluation framework that can compare the utility of real data and synthetic data, and present a comprehensive analysis of how prompt complexity influences the utility of synthetic data generated by commonly used T2I models. We conduct our study across diverse datasets, including CC12M, ImageNet-1k, and DCI, and evaluate different inference-time intervention methods. Our synthetic experiments show that generalizing to more general conditions is harder than the other way round, since the former needs an estimated likelihood that is not learned by diffusion models. Our large-scale empirical experiments reveal that increasing prompt complexity results in lower conditional diversity and prompt consistency, while reducing the synthetic-to-real distribution shift, which aligns with the synthetic experiments. Moreover, current inference-time interventions can augment the diversity of the generations at the expense of moving outside the support of real data. Among those interventions, prompt expansion, by deliberately using a pre-trained language model as a likelihood estimator, consistently achieves the highest performance in both image diversity and aesthetics, even higher than that of real data.[125] A Matter of Time: Revealing the Structure of Time in Vision-Language Models
Nidham Tekaya,Manuela Waldner,Matthias Zeppelzauer
Main category: cs.CV
TL;DR: 本文研究了大规模视觉-语言模型(VLMs)的时间感知能力,提出了一个包含10000多张带有时序标注图像的基准数据集TIME10k,并发现VLM嵌入空间中存在低维非线性的时间流形结构。基于此,作者提出构建显式的“时间线”表示以支持时序推理任务,在效率和准确性上优于提示方法。
Details
Motivation: 探索现有VLMs是否具备将视觉内容定位到时间轴上的能力,并填补缺乏标准时序视觉理解评估基准的空白。 Method: 构建TIME10k数据集,提出一种新方法分析37个VLM在嵌入空间中的时间信息结构,并从中提取显式的‘时间线’表示用于时序推理。 Result: 发现时间信息存在于VLM嵌入空间的低维非线性流形上;所提时间线方法在多个VLM上实现了与提示法相当或更优的性能,且计算更高效。 Conclusion: VLMs隐含地编码了时间信息,可通过几何结构提取有效的时间线表示,为开放词汇下的时序视觉理解提供了新路径。 Abstract: Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.[126] HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking
Yao Deng,Xian Zhong,Wenxuan Liu,Zhaofei Yu,Jingling Yuan,Tiejun Huang
Main category: cs.CV
TL;DR: 提出了一种名为分层非对称蒸馏(HAD)的多模态知识蒸馏框架,用于融合RGB相机和事件相机的优势,有效缓解了二者在时空上的不对称性,在物体跟踪任务中表现出优越性能。
Details
Motivation: RGB相机和事件相机虽具有互补优势,但因成像机制不同导致显著的时空不对称性,阻碍了有效的多模态融合,因此需要一种能够建模并缓解这种不对称性的方法。 Method: 提出了分层非对称蒸馏(HAD)框架,采用分层对齐策略,在减少信息损失的同时保持学生网络的计算效率和参数紧凑性,以实现RGB和事件相机数据的有效融合。 Result: 大量实验表明,HAD在多个基准上持续优于现有最先进方法,消融研究验证了各组件的有效性和必要性。 Conclusion: HAD能有效解决RGB与事件相机间的时空不对称问题,显著提升复杂场景下的目标跟踪性能,是一种高效且紧凑的多模态知识蒸馏方案。 Abstract: RGB cameras excel at capturing rich texture details with high spatial resolution, whereas event cameras offer exceptional temporal resolution and a high dynamic range (HDR). Leveraging their complementary strengths can substantially enhance object tracking under challenging conditions, such as high-speed motion, HDR environments, and dynamic background interference. However, a significant spatio-temporal asymmetry exists between these two modalities due to their fundamentally different imaging mechanisms, hindering effective multi-modal integration. To address this issue, we propose {Hierarchical Asymmetric Distillation} (HAD), a multi-modal knowledge distillation framework that explicitly models and mitigates spatio-temporal asymmetries. Specifically, HAD proposes a hierarchical alignment strategy that minimizes information loss while maintaining the student network's computational efficiency and parameter compactness. Extensive experiments demonstrate that HAD consistently outperforms state-of-the-art methods, and comprehensive ablation studies further validate the effectiveness and necessity of each designed component. The code will be released soon.[127] Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection
Ariana Yi,Ce Zhou,Liyang Xiao,Qiben Yan
Main category: cs.CV
TL;DR: 本文提出了名为α-Cloak的首个无盒对抗攻击方法,通过RGBA视频的alpha通道将恶意视频与良性视频融合,使目标检测器被误导而人类难以察觉,且无需模型信息,实现了100%攻击成功率。
Details
Motivation: 随着目标检测模型在自动驾驶和监控系统中的广泛应用,其对抗安全性至关重要。现有研究多集中于图像域攻击,而视频域尤其是无盒条件下的攻击研究不足,亟需探索此类新型威胁。 Method: 提出α-Cloak方法,利用RGBA视频的alpha通道进行视频融合,在不修改模型和引入视觉瑕疵的前提下,通过设计兼容多种格式的融合算法实现隐蔽攻击,且无需访问模型架构、参数或输出。 Result: 在五种最先进目标检测器、一种视觉语言模型和多模态大模型(Gemini-2.0-Flash)上均实现100%攻击成功率,验证了该方法的有效性和广泛适用性。 Conclusion: 揭示了基于视频的感知系统中alpha通道带来的新型安全漏洞,强调在对抗防御中必须考虑视频元数据和格式特性,为未来防御机制的设计提供了新方向。 Abstract: As object detection models are increasingly deployed in cyber-physical systems such as autonomous vehicles (AVs) and surveillance platforms, ensuring their security against adversarial threats is essential. While prior work has explored adversarial attacks in the image domain, those attacks in the video domain remain largely unexamined, especially in the no-box setting. In this paper, we present {\alpha}-Cloak, the first no-box adversarial attack on object detectors that operates entirely through the alpha channel of RGBA videos. {\alpha}-Cloak exploits the alpha channel to fuse a malicious target video with a benign video, resulting in a fused video that appears innocuous to human viewers but consistently fools object detectors. Our attack requires no access to model architecture, parameters, or outputs, and introduces no perceptible artifacts. We systematically study the support for alpha channels across common video formats and playback applications, and design a fusion algorithm that ensures visual stealth and compatibility. We evaluate {\alpha}-Cloak on five state-of-the-art object detectors, a vision-language model, and a multi-modal large language model (Gemini-2.0-Flash), demonstrating a 100% attack success rate across all scenarios. Our findings reveal a previously unexplored vulnerability in video-based perception systems, highlighting the urgent need for defenses that account for the alpha channel in adversarial settings.[128] VGD: Visual Geometry Gaussian Splatting for Feed-Forward Surround-view Driving Reconstruction
Junhong Lin,Kangli Wang,Shunzhou Wang,Songlin Fan,Ge Li,Wei Gao
Main category: cs.CV
TL;DR: 本文提出了一种名为Visual Gaussian Driving (VGD)的前馈端到端学习框架,用于解决环绕视图自动驾驶场景重建中几何一致性和新视角质量之间的矛盾。
Details
Motivation: 现有方法在最小重叠区域的环绕视图下难以保证新视角的几何一致性和重建质量,因此需要显式学习几何信息并利用其提升语义质量。 Method: 设计了轻量化的VGGT变体以提取几何先验,并通过高斯头部融合多尺度几何令牌预测高斯参数;同时结合几何和高斯分支的多尺度特征联合监督语义细化模型。 Result: 在nuScenes数据集上实验表明,该方法在多种设置下均显著优于现有最先进方法,无论是在客观指标还是主观质量方面。 Conclusion: VGD框架能够有效实现可泛化的几何估计与高质量语义渲染,验证了其在高保真环绕视图重建中的可扩展性与优越性能。 Abstract: Feed-forward surround-view autonomous driving scene reconstruction offers fast, generalizable inference ability, which faces the core challenge of ensuring generalization while elevating novel view quality. Due to the surround-view with minimal overlap regions, existing methods typically fail to ensure geometric consistency and reconstruction quality for novel views. To tackle this tension, we claim that geometric information must be learned explicitly, and the resulting features should be leveraged to guide the elevating of semantic quality in novel views. In this paper, we introduce \textbf{Visual Gaussian Driving (VGD)}, a novel feed-forward end-to-end learning framework designed to address this challenge. To achieve generalizable geometric estimation, we design a lightweight variant of the VGGT architecture to efficiently distill its geometric priors from the pre-trained VGGT to the geometry branch. Furthermore, we design a Gaussian Head that fuses multi-scale geometry tokens to predict Gaussian parameters for novel view rendering, which shares the same patch backbone as the geometry branch. Finally, we integrate multi-scale features from both geometry and Gaussian head branches to jointly supervise a semantic refinement model, optimizing rendering quality through feature-consistent learning. Experiments on nuScenes demonstrate that our approach significantly outperforms state-of-the-art methods in both objective metrics and subjective quality under various settings, which validates VGD's scalability and high-fidelity surround-view reconstruction.[129] Multi-modal Co-learning for Earth Observation: Enhancing single-modality models via modality collaboration
Francisco Mena,Dino Ienco,Cassio F. Dantas,Roberto Interdonato,Andreas Dengel
Main category: cs.CV
TL;DR: 提出一种新的多模态协同学习框架,结合对比学习和模态判别学习,可在训练时利用多模态数据、推理时仅用单模态的情况下,在多种地球观测任务中实现优于现有方法的性能。
Details
Motivation: 在地球观测中,由于实际限制,训练和推理阶段难以保证相同的传感器模态可用,现有方法多针对特定任务或推理模态设计,缺乏通用性。 Method: 提出一个通用的多模态协同学习框架,结合对比学习与模态判别学习,引导单模态模型分离并学习模态共享与模态特有的特征表示。 Result: 在四个地球观测基准上验证了该框架,涵盖分类与回归任务,在仅使用训练时部分模态进行推理的情况下,性能持续优于最先进的机器学习、计算机视觉及遥感专用方法。 Conclusion: 所提框架能在多种地球观测应用中有效提升单模态推理性能,具有良好的泛化能力和应用前景。 Abstract: Multi-modal co-learning is emerging as an effective paradigm in machine learning, enabling models to collaboratively learn from different modalities to enhance single-modality predictions. Earth Observation (EO) represents a quintessential domain for multi-modal data analysis, wherein diverse remote sensors collect data to sense our planet. This unprecedented volume of data introduces novel challenges. Specifically, the access to the same sensor modalities at both training and inference stages becomes increasingly complex based on real-world constraints affecting remote sensing platforms. In this context, multi-modal co-learning presents a promising strategy to leverage the vast amount of sensor-derived data available at the training stage to improve single-modality models for inference-time deployment. Most current research efforts focus on designing customized solutions for either particular downstream tasks or specific modalities available at the inference stage. To address this, we propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference. Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information. We evaluate our framework on four EO benchmarks spanning classification and regression tasks across different sensor modalities, where only one of the modalities available during training is accessible at inference time. Our results demonstrate consistent predictive improvements over state-of-the-art approaches from the recent machine learning and computer vision literature, as well as EO-specific methods. The obtained findings validate our framework in the single-modality inference scenarios across a diverse range of EO applications.[130] Addressing the Depth-of-Field Constraint: A New Paradigm for High Resolution Multi-Focus Image Fusion
Luca Piano,Peng Huanwen,Radu Ciprian Bilcu
Main category: cs.CV
TL;DR: 提出了一种新的多焦点图像融合方法VAEEDOF,结合蒸馏变分自编码器和新合成的4K数据集MattingMFIF,实现了高保真、无伪影的图像融合,达到最先进水平。
Details
Motivation: 解决传统和深度学习方法在多焦点图像融合中面临的训练数据不足、合成数据与真实场景之间的域差距以及信息缺失区域处理困难等问题。 Method: 采用基于蒸馏变分自编码器的VAEEDOF方法,设计可同时处理最多七幅图像的融合模块,并构建名为MattingMFIF的新合成4K数据集以模拟真实景深效果。 Result: 在多个评估指标上达到最先进的性能,生成无缝且无伪影的融合图像,有效缩小了合成数据与真实场景之间的差距。 Conclusion: VAEEDOF结合高质量合成数据集MattingMFIF,显著提升了多焦点图像融合的质量与实用性,为解决复杂MFIF问题提供了有效方案。 Abstract: Multi-focus image fusion (MFIF) addresses the depth-of-field (DOF) limitations of optical lenses, where only objects within a specific range appear sharp. Although traditional and deep learning methods have advanced the field, challenges persist, including limited training data, domain gaps from synthetic datasets, and difficulties with regions lacking information. We propose VAEEDOF, a novel MFIF method that uses a distilled variational autoencoder for high-fidelity, efficient image reconstruction. Our fusion module processes up to seven images simultaneously, enabling robust fusion across diverse focus points. To address data scarcity, we introduce MattingMFIF, a new syntetic 4K dataset, simulating realistic DOF effects from real photographs. Our method achieves state-of-the-art results, generating seamless artifact-free fused images and bridging the gap between synthetic and real-world scenarios, offering a significant step forward in addressing complex MFIF challenges. The code, and weights are available here:[131] Uncertainty evaluation of segmentation models for Earth observation
Melanie Rey,Andriy Mnih,Maxim Neumann,Matt Overlan,Drew Purves
Main category: cs.CV
TL;DR: 本文研究了从卫星图像中语义分割预测的不确定性估计方法,重点评估现有方法在遥感和地球观测应用中的实用性。
Details
Motivation: 语义分割的不确定性估计相较于图像分类更具挑战性,需要可扩展的逐像素估计方法,且现有研究多集中于场景理解或医学影像,缺乏针对遥感领域的系统评估。 Method: 在PASTIS和ForTy两个具有不同尺度、地理覆盖和标签置信度的遥感数据集上,对多种模型(如随机分割网络和集成方法)结合不同神经网络架构与不确定性度量进行了广泛评估。 Result: 实验评估了多种不确定性度量在识别预测错误和噪声污染区域方面的有效性,揭示了不同方法在遥感场景下的表现差异。 Conclusion: 基于实验结果,本文提出了若干关于在遥感语义分割中选择和应用不确定性估计方法的实用建议。 Abstract: This paper investigates methods for estimating uncertainty in semantic segmentation predictions derived from satellite imagery. Estimating uncertainty for segmentation presents unique challenges compared to standard image classification, requiring scalable methods producing per-pixel estimates. While most research on this topic has focused on scene understanding or medical imaging, this work benchmarks existing methods specifically for remote sensing and Earth observation applications. Our evaluation focuses on the practical utility of uncertainty measures, testing their ability to identify prediction errors and noise-corrupted input image regions. Experiments are conducted on two remote sensing datasets, PASTIS and ForTy, selected for their differences in scale, geographic coverage, and label confidence. We perform an extensive evaluation featuring several models, such as Stochastic Segmentation Networks and ensembles, in combination with a number of neural architectures and uncertainty metrics. We make a number of practical recommendations based on our findings.[132] Digitizing Paper ECGs at Scale: An Open-Source Algorithm for Clinical Research
Elias Stenhede,Agnar Martin Bjørnstad,Arian Ranjbar
Main category: cs.CV
TL;DR: 提出了一种全自动、模块化的框架,将扫描或拍摄的ECG图像转换为可用于临床和研究的数字信号,并在大规模数据集上验证了其优越性能。
Details
Motivation: 大量仅以纸质扫描形式存在的临床ECG无法用于现代自动化诊断,亟需一种高效准确的数字化方法。 Method: 开发了一个全自动、模块化的框架,能够从含常见伪影的扫描或照片中提取ECG信号,并在多个真实世界数据集上进行验证。 Result: 在Akershus大学医院收集的1,596张图像上信噪比达19.65 dB,在包含透视畸变、褶皱和污渍的35,595张图像组成的Emory数据集上,性能优于现有最先进方法。 Conclusion: 该框架显著提升了纸质ECG数字化的准确性和实用性,开源发布有助于推动回顾性ECG数据的利用和AI诊断的普及。 Abstract: Millions of clinical ECGs exist only as paper scans, making them unusable for modern automated diagnostics. We introduce a fully automated, modular framework that converts scanned or photographed ECGs into digital signals, suitable for both clinical and research applications. The framework is validated on 37,191 ECG images with 1,596 collected at Akershus University Hospital, where the algorithm obtains a mean signal-to-noise ratio of 19.65 dB on scanned papers with common artifacts. It is further evaluated on the Emory Paper Digitization ECG Dataset, comprising 35,595 images, including images with perspective distortion, wrinkles, and stains. The model improves on the state-of-the-art in all subcategories. The full software is released as open-source, promoting reproducibility and further development. We hope the software will contribute to unlocking retrospective ECG archives and democratize access to AI-driven diagnostics.[133] Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
Su Ho Han,Jeongseok Hyun,Pilhyeon Lee,Minho Shim,Dongyoon Wee,Seon Joo Kim
Main category: cs.CV
TL;DR: 提出DecAF方法,通过分解注意力融合和注意力引导的SAM2提示,实现无需训练的视频推理分割。
Details
Motivation: 现有的多模态大语言模型在视频理解中表现出色,但直接用于定位任务时,原始注意力图噪声大且与物体区域对齐差,需要改进以实现有效的训练-free视频推理分割。 Method: 将视频推理分割视为视频问答任务,利用rollout机制提取注意力图;提出Decomposed Attention Fusion(DecAF),包括对比对象-背景融合和互补视频-帧融合两种机制来优化注意力图,并结合注意力引导的SAM2提示生成精细掩码。 Result: DecAF在指代和推理VOS基准上优于现有的训练-free方法,性能接近训练-based方法。 Conclusion: DecAF能有效提升注意力图质量,实现高质量的训练-free视频推理分割,且无需对MLLM或SAM进行微调。 Abstract: Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.[134] CBDiff:Conditional Bernoulli Diffusion Models for Image Forgery Localization
Zhou Lei,Pan Gang,Wang Jiahao,Sun Di
Main category: cs.CV
TL;DR: 提出了一种基于条件伯努利扩散模型(CBDiff)的图像伪造定位方法,通过生成多个多样且合理的定位图来更好地表征伪造区域的不确定性,并引入伯努利噪声和时间步交叉注意力机制,在八个公开数据集上显著优于现有方法。
Details
Motivation: 现有图像伪造定位方法通常只生成单一确定性结果,缺乏对预测不确定性的建模,难以满足高风险应用中对可靠性和精度的需求。 Method: 提出条件伯努利扩散模型(CBDiff),在扩散过程中引入伯努利噪声以更好匹配伪造掩码的二值稀疏特性,并设计时间步交叉注意力(TSCAttention)机制,利用语义特征与时间步的交互提升检测性能。 Result: 在八个公开基准数据集上实验表明,CBDiff在伪造定位任务上显著优于现有的最先进方法,展现出更强的鲁棒性和实际部署潜力。 Conclusion: CBDiff通过建模预测分布而非单一输出,有效提升了图像伪造定位的可靠性与准确性,为图像取证提供了更可信的解决方案。 Abstract: Image Forgery Localization (IFL) is a crucial task in image forensics, aimed at accurately identifying manipulated or tampered regions within an image at the pixel level. Existing methods typically generate a single deterministic localization map, which often lacks the precision and reliability required for high-stakes applications such as forensic analysis and security surveillance. To enhance the credibility of predictions and mitigate the risk of errors, we introduce an advanced Conditional Bernoulli Diffusion Model (CBDiff). Given a forged image, CBDiff generates multiple diverse and plausible localization maps, thereby offering a richer and more comprehensive representation of the forgery distribution. This approach addresses the uncertainty and variability inherent in tampered regions. Furthermore, CBDiff innovatively incorporates Bernoulli noise into the diffusion process to more faithfully reflect the inherent binary and sparse properties of forgery masks. Additionally, CBDiff introduces a Time-Step Cross-Attention (TSCAttention), which is specifically designed to leverage semantic feature guidance with temporal steps to improve manipulation detection. Extensive experiments on eight publicly benchmark datasets demonstrate that CBDiff significantly outperforms existing state-of-the-art methods, highlighting its strong potential for real-world deployment.[135] XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography
Haozhe Luo,Shelley Zixin Shu,Ziyu Zhou,Sebastian Otalora,Mauricio Reyes
Main category: cs.CV
TL;DR: 本文提出了首个用于评估胸部X光片中跨模态可解释性的系统性基准,比较了七种CLIP风格的视觉-语言模型(VLM),发现尽管这些模型在识别能力上表现良好,但在小或弥散性病变的定位上仍存在不足,且模型的识别能力与接地能力密切相关。
Details
Motivation: 尽管视觉-语言模型在医学图像理解中展现出强大的零样本性能,但其文本概念与视觉证据之间的对齐(即接地能力)在临床上至关重要却尚未充分探索。 Method: 构建了一个针对胸部X光的基准测试,使用交叉注意力和基于相似性的定位图生成视觉解释,并与放射科医生标注的区域进行定量比对,评估七种CLIP风格VLM的跨模态可解释性。 Result: 1) 所有VLM变体对大而明确的病灶定位较好,但对小或弥散性病灶性能显著下降;2) 在胸部X光特定数据集上预训练的模型比通用域训练的模型具有更好的对齐性;3) 模型的整体识别能力与接地能力高度相关。 Conclusion: 当前VLM虽具备较强识别能力,但在临床可靠的接地方面仍有不足,需建立针对性的可解释性基准以支持其在医疗实践中的部署。 Abstract: Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at https://github.com/Roypic/Benchmarkingattention[136] Beyond sparse denoising in frames: minimax estimation with a scattering transform
Nathanaël Cuvelle--Magar,Stéphane Mallat
Main category: cs.CV
TL;DR: 提出了一种基于小波散射系数的去噪估计器,通过联合最小化和最大化不同子集的ℓ¹范数,在卡通图像上达到了所有Lipschitz指数α≤2时的minimax渐近界,建立了调和分析与深度卷积网络去噪之间的数学桥梁。
Details
Motivation: 传统稀疏框架(如小波、curvelet)在处理具有复杂正则性的信号(如边缘为C^α曲线的卡通图像)时性能不足,尤其当α≤2且未知时表现不佳,需要更适应几何正则性的去噪方法。 Method: 引入基于小波散射系数的去噪估计器,通过对不同子集的ℓ¹范数进行联合最小化和最大化来捕捉图像的不同几何正则性特征。 Result: 理论证明ℓ¹范数能捕获不同的几何图像正则性,数值实验显示该估计器在所有α≤2情况下达到minimax渐近最优性能,并提出了相应的数学猜想。 Conclusion: 该方法为信号去噪和函数几何正则性刻画提供了新的调和分析视角,同时架起了调和分析与深度卷积网络去噪之间的数学联系。 Abstract: A considerable amount of research in harmonic analysis has been devoted to non-linear estimators of signals contaminated by additive Gaussian noise. They are implemented by thresholding coefficients in a frame, which provide a sparse signal representation, or by minimising their $\ell^1$ norm. However, sparse estimators in frames are not sufficiently rich to adapt to complex signal regularities. For cartoon images whose edges are piecewise $\bf C^\alpha$ curves, wavelet, curvelet and Xlet frames are suboptimal if the Lipschitz exponent $\alpha \leq 2$ is an unknown parameter. Deep convolutional neural networks have recently obtained much better numerical results, which reach the minimax asymptotic bounds for all $\alpha$. Wavelet scattering coefficients have been introduced as simplified convolutional neural network models. They are computed by transforming the modulus of wavelet coefficients with a second wavelet transform. We introduce a denoising estimator by jointly minimising and maximising the $\ell^1$ norms of different subsets of scattering coefficients. We prove that these $\ell^1$ norms capture different types of geometric image regularity. Numerical experiments show that this denoising estimator reaches the minimax asymptotic bound for cartoon images for all Lipschitz exponents $\alpha \leq 2$. We state this numerical result as a mathematical conjecture. It provides a different harmonic analysis approach to suppress noise from signals, and to specify the geometric regularity of functions. It also opens a mathematical bridge between harmonic analysis and denoising estimators with deep convolutional network.[137] Pragmatic Heterogeneous Collaborative Perception via Generative Communication Mechanism
Junfei Zhou,Penglin Dai,Quanmin Wei,Bingyi Liu,Xiao Wu,Jianping Wang
Main category: cs.CV
TL;DR: 提出了一种名为GenComm的生成式通信机制,通过特征生成和轻量级空间信息对齐,实现无需修改原有网络的异构多智能体系统间无缝感知协作,显著降低计算成本和参数量。
Details
Motivation: 现有异构多智能体协作方法因需重训练编码器或核心模块而破坏语义一致性,且接入新智能体计算开销大、扩展性差。 Method: 设计可变形消息提取器提取协作方的空间信息,并用条件扩散模型驱动的空间感知特征生成器生成符合主智能体语义空间的特征,结合通道增强器进行特征融合,无需重构原始网络。 Result: 在OPV2V-H、DAIR-V2X和V2X-Real数据集上实验表明,GenComm优于现有最先进方法,在接入新智能体时计算成本和参数量均减少81%。 Conclusion: GenComm实现了高效、可扩展的异构多智能体感知协作,解决了语义一致性破坏和高计算成本问题,具有良好的实际应用前景。 Abstract: Multi-agent collaboration enhances the perception capabilities of individual agents through information sharing. However, in real-world applications, differences in sensors and models across heterogeneous agents inevitably lead to domain gaps during collaboration. Existing approaches based on adaptation and reconstruction fail to support pragmatic heterogeneous collaboration due to two key limitations: (1) Intrusive retraining of the encoder or core modules disrupts the established semantic consistency among agents; and (2) accommodating new agents incurs high computational costs, limiting scalability. To address these challenges, we present a novel Generative Communication mechanism (GenComm) that facilitates seamless perception across heterogeneous multi-agent systems through feature generation, without altering the original network, and employs lightweight numerical alignment of spatial information to efficiently integrate new agents at minimal cost. Specifically, a tailored Deformable Message Extractor is designed to extract spatial message for each collaborator, which is then transmitted in place of intermediate features. The Spatial-Aware Feature Generator, utilizing a conditional diffusion model, generates features aligned with the ego agent's semantic space while preserving the spatial information of the collaborators. These generated features are further refined by a Channel Enhancer before fusion. Experiments conducted on the OPV2V-H, DAIR-V2X and V2X-Real datasets demonstrate that GenComm outperforms existing state-of-the-art methods, achieving an 81\% reduction in both computational cost and parameter count when incorporating new agents. Our code is available at https://github.com/jeffreychou777/GenComm.[138] Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning
Zhengxuan Wei,Jiajin Tang,Sibei Yang
Main category: cs.CV
TL;DR: 本文提出了一种无需外部依赖的增强型时刻检索框架AMR,通过两阶段训练和查询蒸馏机制,在不增加标注成本的情况下显著提升了边界定位与细粒度语义区分能力。
Details
Motivation: 现有时刻检索方法受限于数据稀缺、边界模糊和细粒度语义区分不足三个瓶颈,难以实现精准检索。 Method: 提出AMR框架,采用两阶段训练:冷启动阶段通过课程学习在增强数据上建立基础感知;蒸馏阶段引入原始查询和主动查询,结合跨阶段蒸馏损失保持知识一致性并提升泛化能力。 Result: 在多个基准上的实验表明,AMR显著优于先前的最先进方法,有效解决了边界模糊和语义混淆问题。 Conclusion: AMR通过数据增强和双查询蒸馏机制,在无需额外标注的前提下,实现了更优的时刻检索性能,具备良好的实际应用潜力。 Abstract: Existing Moment Retrieval methods face three critical bottlenecks: (1) data scarcity forces models into shallow keyword-feature associations; (2) boundary ambiguity in transition regions between adjacent events; (3) insufficient discrimination of fine-grained semantics (e.g., distinguishing ``kicking" vs. ``throwing" a ball). In this paper, we propose a zero-external-dependency Augmented Moment Retrieval framework, AMR, designed to overcome local optima caused by insufficient data annotations and the lack of robust boundary and semantic discrimination capabilities. AMR is built upon two key insights: (1) it resolves ambiguous boundary information and semantic confusion in existing annotations without additional data (avoiding costly manual labeling), and (2) it preserves boundary and semantic discriminative capabilities enhanced by training while generalizing to real-world scenarios, significantly improving performance. Furthermore, we propose a two-stage training framework with cold-start and distillation adaptation. The cold-start stage employs curriculum learning on augmented data to build foundational boundary/semantic awareness. The distillation stage introduces dual query sets: Original Queries maintain DETR-based localization using frozen Base Queries from the cold-start model, while Active Queries dynamically adapt to real-data distributions. A cross-stage distillation loss enforces consistency between Original and Base Queries, preventing knowledge forgetting while enabling real-world generalization. Experiments on multiple benchmarks show that AMR achieves improved performance over prior state-of-the-art approaches.[139] MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom
Yifan Li,Fenghe Tang,Yingtai Li,Shaohua Kevin Zhou
Main category: cs.CV
TL;DR: 本文提出了一种用于CT疾病诊断的新型医疗视觉-语言模型MedReason-R1,结合CT-RATE-VQA数据集和GRPO强化学习框架,在无需大量人工标注的情况下实现了先进的诊断性能。
Details
Motivation: 通用大规模视觉-语言模型在医学领域的表现不佳,主要由于缺乏高质量的专业医学影像数据集,并且忽略了从粗到细的诊断过程。 Method: 构建了包含84K问答对的CT-RATE-VQA数据集;提出了MedReason-R1模型,通过嵌入放大病变区域来强调全局定位与病灶细节的作用;引入GRPO强化学习框架以实现无需昂贵人工标注的有效推理。 Result: MedReason-R1在CT疾病诊断任务上优于近期的通用和专用医疗VLM模型,同时保持良好的泛化能力。 Conclusion: MedReason-R1通过显式推理机制和自适应强化学习框架,显著提升了医学图像理解与诊断准确性,为未来自动化医学影像分析提供了新思路。 Abstract: General-purpose large Vision-Language Models (VLMs) demonstrate strong capabilities in generating detailed descriptions for natural images. However, their performance in the medical domain remains suboptimal, even for relatively straightforward tasks, primarily due to the lack of large-scale, high-quality, specialized medical imaging datasets and the neglect of the diagnostic process that progresses from coarse to fine-grained. To address the first issue, we construct the CT-RATE-VQA dataset, which has 84K QA pairs. For the second issue, we propose MedReason-R1, a medical VLM with explicit reasoning process for disease diagnosis. MedReason-R1 incorporates a novel strategy that embeds zoom-in disease region-of-interest areas into the image, highlighting the crucial role of both global localization and disease-specific details in enhancing the model's diagnostic performance. Furthermore, we introduce the GRPO reinforcement learning framework to MedReason-R1, which enables effective reasoning without relying on costly manual annotations. Compared to recent general-purpose and medical VLMs, MedReason-R1 achieves state-of-the-art performance in CT disease diagnosis while retaining generalization. The code, checkpoints, and dataset are available at: https://github.com/Leevan001/MedReason-R1[140] Re-Activating Frozen Primitives for 3D Gaussian Splatting
Yuxin Cheng,Binxiao Huang,Wenyong Zhou,Taiqiang Wu,Zhengwu Liu,Graziano Chesi,Ngai Wong
Main category: cs.CV
TL;DR: 本文提出了ReAct-GS,通过重新激活机制解决3D高斯点阵在复杂场景中因梯度稀释和基元冻结导致的过重建伪影问题,显著提升了新视角合成的质量。
Details
Motivation: 3D高斯点阵(3D-GS)在复杂场景中存在局部模糊和针状畸变等过重建伪影,现有方法未能根本解决此问题,因此需要探究其根本原因并提出改进方案。 Method: 提出ReAct-GS,包含两个核心:(1) 基于多视角α混合权重的重要性感知致密化准则,以重新激活复杂区域停滞的基元增长;(2) 自适应参数扰动的重激活机制,用于唤醒冻结的基元。 Result: 在多个真实数据集上实验表明,ReAct-GS有效消除过重建伪影,在新视角合成指标上达到最先进水平,并保留精细几何细节;该机制还可提升其他3D-GS变体(如Pixel-GS)的性能。 Conclusion: ReAct-GS通过重激活机制解决了3D-GS中的梯度稀释与基元冻结问题,显著改善了复杂场景下的渲染质量,具有广泛适用性。 Abstract: 3D Gaussian Splatting (3D-GS) achieves real-time photorealistic novel view synthesis, yet struggles with complex scenes due to over-reconstruction artifacts, manifesting as local blurring and needle-shape distortions. While recent approaches attribute these issues to insufficient splitting of large-scale Gaussians, we identify two fundamental limitations: gradient magnitude dilution during densification and the primitive frozen phenomenon, where essential Gaussian densification is inhibited in complex regions while suboptimally scaled Gaussians become trapped in local optima. To address these challenges, we introduce ReAct-GS, a method founded on the principle of re-activation. Our approach features: (1) an importance-aware densification criterion incorporating $\alpha$-blending weights from multiple viewpoints to re-activate stalled primitive growth in complex regions, and (2) a re-activation mechanism that revitalizes frozen primitives through adaptive parameter perturbations. Comprehensive experiments across diverse real-world datasets demonstrate that ReAct-GS effectively eliminates over-reconstruction artifacts and achieves state-of-the-art performance on standard novel view synthesis metrics while preserving intricate geometric details. Additionally, our re-activation mechanism yields consistent improvements when integrated with other 3D-GS variants such as Pixel-GS, demonstrating its broad applicability.[141] From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction
Zhida Zhao,Talas Fu,Yifan Wang,Lijun Wang,Huchuan Lu
Main category: cs.CV
TL;DR: 本文提出了一种名为Policy World Model (PWM)的新型驾驶范式,统一了世界建模与轨迹规划,并通过无动作未来状态预测机制提升规划性能。
Details
Motivation: 现有驾驶世界模型多将世界建模与规划分离,缺乏对二者协同机制的深入探索。 Method: 提出PWM框架,结合协同状态-动作预测、动态增强并行令牌生成机制、上下文引导的分词器和自适应动态焦点损失,仅使用前视摄像头输入进行建模。 Result: 在视频预测效率和规划可靠性方面表现优越,仅用单目输入即可达到或超越依赖多视角和多模态输入的最先进方法。 Conclusion: PWM有效实现了世界建模与规划的协同优化,展现出在自动驾驶系统中更强的潜力。 Abstract: Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy-World-Model.[142] I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs
John Burden,Jonathan Prunty,Ben Slater,Matthieu Tehenan,Greg Davis,Lucy Cheke
Main category: cs.CV
TL;DR: 该论文通过借鉴认知心理学中的视觉搜索范式,测试多模态大语言模型(MLLMs)是否表现出类似人类的“突显效应”,发现在颜色和大小等单一特征搜索中,MLLMs表现出与人类相似的感知特性,并在多特征联合搜索中呈现容量限制,同时受自然场景先验(如光照方向)影响。
Details
Motivation: 尽管MLLMs在视觉-语言任务中表现良好,但其视觉处理机制不透明。作者希望借助认知心理学方法,揭示MLLMs是否具备类人的基本视觉感知机制。 Method: 采用经典的视觉搜索范式,设计针对颜色、大小和光照特征的控制实验,结合目标微调和机制可解释性分析,检验MLLMs在单特征(析取)和多特征(合取)搜索中的表现。 Result: 发现先进MLLMs在颜色和大小的单特征搜索中表现出‘突显效应’,在多特征搜索中存在容量限制,并能像人类一样利用光照方向等自然场景先验信息。 Conclusion: 视觉搜索是一种有效的、基于认知的诊断工具,可用于评估MLLMs的感知能力,表明这些模型具备部分类人的底层视觉处理机制。 Abstract: Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms -- originally developed to study human perception -- to test whether MLLMs exhibit the ``pop-out'' effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.[143] Curvilinear Structure-preserving Unpaired Cross-domain Medical Image Translation
Zihao Chen,Yi Zhou,Xudong Jiang,Li Chen,Leopold Schmetterer,Bingyao Tan,Jun Cheng
Main category: cs.CV
TL;DR: 提出了一种名为CST的通用框架,用于在无配对图像翻译中保持细小曲线结构,通过引入拓扑监督模块显著提升了医学图像跨模态翻译的保真度和性能。
Details
Motivation: 现有无配对图像翻译方法常扭曲微血管等细小曲线结构,影响诊断可靠性和定量分析,尤其在眼科和血管成像中问题突出。 Method: 提出Curvilinear Structure-preserving Translation (CST) 框架,在训练中引入结构一致性约束,结合曲线结构提取模块提供拓扑监督,并可集成到CycleGAN、UNSB等现有模型中。 Result: 在光学相干断层扫描血管成像、彩色眼底和X射线冠状动脉造影三种模态上验证,CST显著提升翻译保真度,达到最优性能。 Conclusion: CST通过增强映射中的几何完整性,为医学图像中曲线结构感知的跨域翻译提供了有效且通用的解决方案。 Abstract: Unpaired image-to-image translation has emerged as a crucial technique in medical imaging, enabling cross-modality synthesis, domain adaptation, and data augmentation without costly paired datasets. Yet, existing approaches often distort fine curvilinear structures, such as microvasculature, undermining both diagnostic reliability and quantitative analysis. This limitation is consequential in ophthalmic and vascular imaging, where subtle morphological changes carry significant clinical meaning. We propose Curvilinear Structure-preserving Translation (CST), a general framework that explicitly preserves fine curvilinear structures during unpaired translation by integrating structure consistency into the training. Specifically, CST augments baseline models with a curvilinear extraction module for topological supervision. It can be seamlessly incorporated into existing methods. We integrate it into CycleGAN and UNSB as two representative backbones. Comprehensive evaluation across three imaging modalities: optical coherence tomography angiography, color fundus and X-ray coronary angiography demonstrates that CST improves translation fidelity and achieves state-of-the-art performance. By reinforcing geometric integrity in learned mappings, CST establishes a principled pathway toward curvilinear structure-aware cross-domain translation in medical imaging.[144] Explainable Face Presentation Attack Detection via Ensemble-CAM
Rashik Shadman,M G Sarwar Murshed,Faraz Hussain
Main category: cs.CV
TL;DR: 提出了一种名为Ensemble-CAM的新方法,用于增强基于深度学习的面部呈现攻击检测(PAD)系统的可解释性,提升其透明度和可信度。
Details
Motivation: 深度学习模型在面部PAD系统中表现良好但缺乏透明度,用户难以理解其决策过程,因此需要可解释性技术来揭示模型判断依据。 Method: 提出Ensemble-CAM方法,结合多种视觉解释技术,生成更准确、稳定的热力图以显示模型判断真实或伪造图像的关键区域。 Result: Ensemble-CAM相比现有方法能更一致地定位关键判别区域,在多个基准数据集上展现出更强的解释能力和稳定性。 Conclusion: Ensemble-CAM有效提升了深度学习面部PAD系统的可解释性,有助于增强系统透明度和用户信任,推动其在安全敏感场景中的应用。 Abstract: Presentation attacks represent a critical security threat where adversaries use fake biometric data, such as face, fingerprint, or iris images, to gain unauthorized access to protected systems. Various presentation attack detection (PAD) systems have been designed leveraging deep learning (DL) models to mitigate this type of threat. Despite their effectiveness, most of the DL models function as black boxes - their decisions are opaque to their users. The purpose of explainability techniques is to provide detailed information about the reason behind the behavior or decision of DL models. In particular, visual explanation is necessary to better understand the decisions or predictions of DL-based PAD systems and determine the key regions due to which a biometric image is considered real or fake by the system. In this work, a novel technique, Ensemble-CAM, is proposed for providing visual explanations for the decisions made by deep learning-based face PAD systems. Our goal is to improve DL-based face PAD systems by providing a better understanding of their behavior. Our provided visual explanations will enhance the transparency and trustworthiness of DL-based face PAD systems.[145] LyTimeT: Towards Robust and Interpretable State-Variable Discovery
Kuai Yu,Crystal Su,Xiang Liu,Judah Goldfeder,Mingyuan Shao,Hod Lipson
Main category: cs.CV
TL;DR: 提出LyTimeT,一种两阶段框架,用于从高维视频中提取鲁棒且稳定的动力系统潜在表示,结合时空注意力与稳定性约束,实现干扰鲁棒性和物理可解释性。
Details
Motivation: 从高维视频中提取真实动力学变量受背景运动、遮挡和纹理变化等干扰因素影响,难以获得准确且可解释的动态变量。 Method: 第一阶段采用基于TimeSformer的时空自编码器,利用全局注意力抑制干扰并学习潜在状态;第二阶段通过线性相关性分析选择物理意义明确的维度,并引入李雅普诺夫稳定性正则化来优化动态过渡。 Result: 在五个合成基准和四个真实世界动力系统(包括混沌现象)上的实验表明,LyTimeT在互信息和本征维度估计上最接近真实值,对背景扰动保持不变性,且预测误差低于CNN和纯Transformer基线。 Conclusion: 结合时空注意力与稳定性约束能生成既准确又具有物理可解释性的预测模型,有效提升高维视频中动力学变量提取的鲁棒性与可解释性。 Abstract: Extracting the true dynamical variables of a system from high-dimensional video is challenging due to distracting visual factors such as background motion, occlusions, and texture changes. We propose LyTimeT, a two-phase framework for interpretable variable extraction that learns robust and stable latent representations of dynamical systems. In Phase 1, LyTimeT employs a spatio-temporal TimeSformer-based autoencoder that uses global attention to focus on dynamically relevant regions while suppressing nuisance variation, enabling distraction-robust latent state learning and accurate long-horizon video prediction. In Phase 2, we probe the learned latent space, select the most physically meaningful dimensions using linear correlation analysis, and refine the transition dynamics with a Lyapunov-based stability regularizer to enforce contraction and reduce error accumulation during roll-outs. Experiments on five synthetic benchmarks and four real-world dynamical systems, including chaotic phenomena, show that LyTimeT achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers the lowest analytical mean squared error among CNN-based (TIDE) and transformer-only baselines. Our results demonstrate that combining spatio-temporal attention with stability constraints yields predictive models that are not only accurate but also physically interpretable.[146] Adaptive Distribution-aware Quantization for Mixed-Precision Neural Networks
Shaohang Jia,Zhiyong Huang,Zhi Yu,Mingyang Hou,Shuai Miao,Han Yang
Main category: cs.CV
TL;DR: 提出了一种自适应分布感知量化(ADQ)框架,通过动态码本和混合精度策略提升低比特神经网络的性能。
Details
Motivation: 现有量化方法在激活值分布不均和静态码本导致权重量化失配方面存在挑战,影响模型压缩与部署效果。 Method: 提出ADQ,包括基于分位数的码本初始化、基于指数移动平均的在线码本自适应机制、敏感性感知的混合精度分配策略,并结合硬件友好的非均匀到均匀激活映射。 Result: 在ImageNet上,ResNet-18实现2.81比特平均位宽下71.512% Top-1准确率,优于当前方法;CIFAR-10上的消融实验验证了各组件有效性。 Conclusion: ADQ通过动态适应权重分布和混合精度分配,显著提升了低比特量化模型的性能,具有良好的实用性和可部署性。 Abstract: Quantization-Aware Training (QAT) is a critical technique for deploying deep neural networks on resource-constrained devices. However, existing methods often face two major challenges: the highly non-uniform distribution of activations and the static, mismatched codebooks used in weight quantization. To address these challenges, we propose Adaptive Distribution-aware Quantization (ADQ), a mixed-precision quantization framework that employs a differentiated strategy. The core of ADQ is a novel adaptive weight quantization scheme comprising three key innovations: (1) a quantile-based initialization method that constructs a codebook closely aligned with the initial weight distribution; (2) an online codebook adaptation mechanism based on Exponential Moving Average (EMA) to dynamically track distributional shifts; and (3) a sensitivity-informed strategy for mixed-precision allocation. For activations, we integrate a hardware-friendly non-uniform-to-uniform mapping scheme. Comprehensive experiments validate the effectiveness of our method. On ImageNet, ADQ enables a ResNet-18 to achieve 71.512% Top-1 accuracy with an average bit-width of only 2.81 bits, outperforming state-of-the-art methods under comparable conditions. Furthermore, detailed ablation studies on CIFAR-10 systematically demonstrate the individual contributions of each innovative component, validating the rationale and effectiveness of our design.[147] OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation
Guowei Xu,Yuxuan Bian,Ailing Zeng,Mingyi Shi,Shaoli Huang,Wen Li,Lixin Duan,Qiang Xu
Main category: cs.CV
TL;DR: OmniMotion-X是一个基于自回归扩散变换器的统一序列到序列框架,用于生成全身人体运动,支持多种模态输入(如文本、音乐、语音)和复杂控制任务,通过引入参考运动作为条件信号和弱到强混合训练策略,显著提升了生成动作的一致性和质量。
Details
Motivation: 现有的人体运动生成方法通常局限于单一模态或特定任务,难以实现跨模态、多任务的统一建模,且在生成长时程、高一致性动画方面存在不足。因此需要一个更通用、可控且高质量的生成框架。 Method: 提出OmniMotion-X,采用自回归扩散Transformer架构,以统一的序列到序列方式处理多模态输入;引入参考运动作为新的条件信号以增强风格与时间动态的一致性;设计渐进式弱到强混合条件训练策略以缓解多模态冲突;构建了目前最大的统一多模态动作捕捉数据集OmniMoCap-X,并利用GPT-4o生成结构化分层描述。 Result: 在多个多模态任务上显著优于现有方法,实现了高质量、连贯且可控的长时程运动生成,在运动预测、补全、插值及引导合成等任务中表现出色,验证了框架的通用性与先进性。 Conclusion: OmniMotion-X为多模态人体运动生成提供了一个强大而灵活的统一框架,推动了从文本、音乐、语音等多种输入生成逼真动画的发展,具备广泛的应用潜力。 Abstract: This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.[148] Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models
Xiaozhen Qiao,Jingkai Zhao,Yuqiu Jiang,Xianda Guo,Zhe Sun,Hongyuan Zhang,Xuelong Li
Main category: cs.CV
TL;DR: 本文提出了一种面向视觉-语言模型(VLMs)的轻量级测试时自适应框架CPL-NC,通过类感知原型缓存和负对比学习机制,在分布偏移下有效提升模型泛化能力,尤其应对长尾分布和语义相似类混淆问题。
Details
Motivation: 现有测试时自适应方法在分布偏移下表现不佳,忽视了长尾分布中的原型退化和语义相似类之间的混淆问题,因此需要一种能动态保持罕见类知识并增强类间可分性的方法。 Method: 提出CPL-NC框架,包含类感知原型缓存模块(动态调整各类容量并引入激活历史与再激活机制)和负对比学习机制(识别并约束难负样本对),采用不对称优化策略,仅更新文本原型而固定视觉特征。 Result: 在15个基准上实验表明,CPL-NC在ResNet-50和ViT-B/16骨干网络上均优于现有TTA方法,显著提升零样本分布外泛化性能。 Conclusion: CPL-NC通过类感知原型管理和负对比学习,有效缓解了原型退化和类混淆问题,为VLMs在真实场景中的鲁棒部署提供了高效可行的测试时自适应方案。 Abstract: Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.[149] Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Yusu Qian,Eli Bocek-Rivele,Liangchen Song,Jialing Tong,Yinfei Yang,Jiasen Lu,Wenze Hu,Zhe Gan
Main category: cs.CV
TL;DR: Pico-Banana-400K是一个包含40万张图像的高质量、大规模指令型图像编辑数据集,基于真实图像生成,具有多样化的编辑类型和复杂编辑场景支持。
Details
Motivation: 现有研究受限于缺乏大规模、高质量且开放的基于真实图像的文本引导图像编辑数据集。 Method: 利用Nano-Banana模型从OpenImages中的真实照片生成编辑配对,并采用细粒度编辑分类体系和基于MLLM的质量评分进行质量控制与筛选。 Result: 构建了400K规模的数据集,包含72K多轮编辑样本、56K偏好样本及长短指令配对,支持单步、多步编辑、对齐研究与指令重写等任务。 Conclusion: Pico-Banana-400K为下一代文本引导图像编辑模型提供了坚实的数据基础,推动该领域的训练与评估。 Abstract: Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.[150] How to Evaluate Monocular Depth Estimation?
Siyang Wu,Jack Nugent,Willow Yang,Jia Deng
Main category: cs.CV
TL;DR: 本文对现有的单目深度估计评估指标进行了定量分析,发现这些指标对曲率扰动(如使平面变波浪)不敏感。为此,作者提出了一种基于相对表面法线的新指标,并开发了新的深度可视化工具和构建与人类判断更一致的复合指标的方法。
Details
Motivation: 现有文献中缺乏标准化,且各种评估指标的权衡和行为未被充分理解,导致难以有效评估单目深度估计方法。 Method: 通过定量分析现有指标对不同类型真实深度扰动的敏感性,特别是与人类判断的对比,识别其不足;提出基于相对表面法线的新指标、新的可视化工具,以及构建复合指标的系统化方法。 Result: 发现现有指标严重低估曲率变化的影响;新提出的指标在反映人类感知方面表现更好;提供的复合指标构建方法能提升与人类判断的一致性。 Conclusion: 当前深度估计评估指标存在明显缺陷,尤其是在处理几何形状变化时;所提新方法显著提升了评估结果与人类感知的对齐程度。 Abstract: Monocular depth estimation is an important task with rapid progress, but how to evaluate it remains an open question, as evidenced by a lack of standardization in existing literature and a large selection of evaluation metrics whose trade-offs and behaviors are not well understood. This paper contributes a novel, quantitative analysis of existing metrics in terms of their sensitivity to various types of perturbations of ground truth, emphasizing comparison to human judgment. Our analysis reveals that existing metrics are severely under-sensitive to curvature perturbation such as making flat surfaces wavy. To remedy this, we introduce a new metric based on relative surface normals, along with new depth visualization tools and a principled method to create composite metrics with better human alignment. Code and data are available at: https://github.com/princeton-vl/evalmde.[151] olmOCR 2: Unit Test Rewards for Document OCR
Jake Poznanski,Luca Soldaini,Kyle Lo
Main category: cs.CV
TL;DR: olmOCR 2 是一个基于7B视觉语言模型的先进OCR系统,利用强化学习与可验证奖励(RLVR)显著提升了英文文档中数学公式、表格和多栏布局的文本转换效果。
Details
Motivation: 将扫描的印刷文档(如PDF)高效准确地转换为干净、自然排序的纯文本,尤其应对复杂版式(如公式、表格)的识别挑战。 Method: 使用强化学习与可验证奖励(RLVR)训练一个专门的7B视觉语言模型(olmOCR-2-7B-1025),奖励信号来自一组多样化的二进制单元测试;通过生成具有已知HTML源码的合成文档来规模化创建测试用例。 Result: 在自建的英文OCR基准olmOCR-Bench上达到最先进性能,尤其在数学公式转换、表格解析和多栏布局方面相比之前版本有最大提升。 Conclusion: olmOCR 2 结合合成数据生成与基于单元测试的强化学习,有效提升了复杂文档的OCR质量,且模型、数据和代码均已开源。 Abstract: We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.[152] Is This Tracker On? A Benchmark Protocol for Dynamic Tracking
Ilona Demler,Saumya Chauhan,Georgia Gkioxari
Main category: cs.CV
TL;DR: ITTO是一个新的基准测试套件,用于评估和诊断点跟踪方法的能力与局限性,揭示现有方法在复杂运动和遮挡后重识别方面的不足。