Table of Contents
cs.CL [Back]
[1] DeliberationBench: When Do More Voices Hurt? A Controlled Study of Multi-LLM Deliberation Protocols
Vaarunay Kaushal,Taranveer Singh
Main category: cs.CL
TL;DR: DELIERATIONBENCH研究表明,在多LLM系统中,选择最佳单次输出的简单方法显著优于复杂的协商协议,性能高出6倍且成本更低,挑战了复杂性提升质量的普遍假设。
Details
Motivation: 尽管多LLM协商系统受到广泛关注,但其相对于简单方法的实际价值缺乏严格检验,本文旨在通过受控实验评估其真实有效性。 Method: 提出DELIBERATIONBENCH基准,比较三种协商协议与选择最优单次输出的强基线方法,在270个问题和三个随机种子下进行共810次评估。 Result: 最佳单输出基线胜率达82.5% ± 3.3%,远超最佳协商协议的13.8% ± 2.6%,性能差距达6倍且具有统计显著性(p < 0.01),同时计算成本低1.5-2.5倍。 Conclusion: 复杂协商机制在当前设置下并未带来性能提升,反而成本更高,研究结果质疑了多LLM系统中复杂性等同于高质量的普遍信念。 Abstract: Multi-agent systems where Large Language Models (LLMs) deliberate to form consensus have gained significant attention, yet their practical value over simpler methods remains under-scrutinized. We introduce DELIBERATIONBENCH, a controlled benchmark evaluating three deliberation protocols against a strong baseline of selecting the best response from a pool of model outputs. Across 270 questions and three independent seeds (810 total evaluations), we find a striking negative result: the best-single baseline achieves an 82.5% +- 3.3% win rate, dramatically outperforming the best deliberation protocol(13.8% +- 2.6%). This 6.0x performance gap is statistically significant (p < 0.01) and comes at 1.5-2.5x higher computational cost. Our findings challenge assumptions that complexity enhances quality in multi-LLM systems.[2] A Review: PTSD in Pre-Existing Medical Condition on Social Media
Zaber Al Hassan Ayon,Nur Hafieza Ismail,Nur Shazwani Kamarudin
Main category: cs.CL
TL;DR: 该综述探讨了社交媒体上慢性病患者中创伤后应激障碍(PTSD)的表现与管理,发现自然语言处理和机器学习技术可有效识别潜在病例,并强调在线支持社区在干预和应对策略中的作用。
Details
Motivation: PTSD在患有慢性疾病的个体中表现复杂,且常被忽视,研究旨在揭示这一人群在社交媒体上的心理表达特征,以改善识别与干预。 Method: 系统分析2008至2024年文献,结合自然语言处理(NLP)与机器学习(ML)技术,挖掘X和Facebook等平台上的用户数据。 Result: NLP与ML模型识别PTSD的准确率达到74%至90%,社交媒体数据揭示了共病群体的独特挑战,在线社区对早期干预和应对策略有积极影响。 Conclusion: 应将慢性病背景纳入PTSD的研究与治疗,社交媒体具有作为监测与支持工具的巨大潜力,未来需发展针对性干预措施。 Abstract: Post-Traumatic Stress Disorder (PTSD) is a multifaceted mental health condition, particularly challenging for individuals with pre-existing medical conditions. This review critically examines the intersection of PTSD and chronic illnesses as expressed on social media platforms. By systematically analyzing literature from 2008 to 2024, the study explores how PTSD manifests and is managed in individuals with chronic conditions such as cancer, heart disease, and autoimmune disorders, with a focus on online expressions on platforms like X (formally known as Twitter) and Facebook. Findings demonstrate that social media data offers valuable insights into the unique challenges faced by individuals with both PTSD and chronic illnesses. Specifically, natural language processing (NLP) and machine learning (ML) techniques can identify potential PTSD cases among these populations, achieving accuracy rates between 74% and 90%. Furthermore, the role of online support communities in shaping coping strategies and facilitating early interventions is highlighted. This review underscores the necessity of incorporating considerations of pre-existing medical conditions in PTSD research and treatment, emphasizing social media's potential as a monitoring and support tool for vulnerable groups. Future research directions and clinical implications are also discussed, with an emphasis on developing targeted interventions.[3] From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda
Piercosma Bisconti,Marcello Galisai,Matteo Prandi,Federico Pierucci,Olga Sorokoletova,Francesco Giarrusso,Vincenzo Suriani,Marcantonio Brancale,Daniele Nardi
Main category: cs.CL
TL;DR: 提出了一种名为Adversarial Tales的新型越狱攻击方法,利用赛博朋克叙事和普罗普民间故事形态学的功能分析框架,将有害请求隐写于文化编码的叙事中,诱导大模型在结构分析任务中重构有害内容。该方法在26个前沿模型上平均攻击成功率达71.3%,揭示了基于结构的越狱攻击是一类广泛的漏洞。
Details
Motivation: 现有LLM安全机制难以抵御通过文化编码结构重构的有害请求,需探索更深层次的结构性越狱攻击以揭示其根本脆弱性。 Method: 将有害内容嵌入赛博朋克叙事,并引导模型基于弗拉基米尔·普罗普的民间故事情节形态学进行功能分析,从而将越狱任务转化为看似合法的叙事结构解析。 Result: 在来自9家提供商的26个前沿模型上测试,平均攻击成功率为71.3%,所有模型家族均表现出显著脆弱性;结合此前的Adversarial Poetry研究,表明结构性越狱构成广泛存在的漏洞类别。 Conclusion: 文化编码的结构性越狱攻击构成严重且难以穷举的威胁,仅靠模式匹配防御无效;应开展基于机制可解释性的研究,探究叙事线索如何重塑模型表征,并训练模型脱离表面形式识别深层有害意图。 Abstract: Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely inexhaustible by pattern-matching defenses alone. Understanding why these attacks succeed is therefore essential: we outline a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form.[4] Companion Agents: A Table-Information Mining Paradigm for Text-to-SQL
Jiahui Chen,Lei Fu,Jian Cui,Yu Lei,Zhenning Dong
Main category: cs.CL
TL;DR: 提出Companion Agents (CA) 新范式,通过数据库端预缓存查询相关知识来提升在缺少标注情况下的Text-to-SQL准确性。
Details
Motivation: 现有Text-to-SQL系统依赖完整准确的数据库标注和外部知识,难以适应工业界标注缺失或错误的真实场景。 Method: 设计一组伴随数据库模式的智能体(Companion Agents),预先挖掘并整合表间关系、值域分布、统计规律和语义线索等细粒度信息,在推理时按需激活相关知识。 Result: 在BIRD数据集缺失证据设置下,CA使RSL-SQL、CHESS、DAIL-SQL执行准确率分别提升+4.49、+4.37、+14.13,在挑战子集上提升更大(+9.65至+16.71)。 Conclusion: CA通过自动化的数据库侧知识挖掘,为无需人工标注的工业级Text-to-SQL部署提供了可行路径。 Abstract: Large-scale Text-to-SQL benchmarks such as BIRD typically assume complete and accurate database annotations as well as readily available external knowledge, which fails to reflect common industrial settings where annotations are missing, incomplete, or erroneous. This mismatch substantially limits the real-world applicability of state-of-the-art (SOTA) Text-to-SQL systems. To bridge this gap, we explore a database-centric approach that leverages intrinsic, fine-grained information residing in relational databases to construct missing evidence and improve Text-to-SQL accuracy under annotation-scarce conditions. Our key hypothesis is that when a query requires multi-step reasoning over extensive table information, existing methods often struggle to reliably identify and utilize the truly relevant knowledge. We therefore propose to "cache" query-relevant knowledge on the database side in advance, so that it can be selectively activated at inference time. Based on this idea, we introduce Companion Agents (CA), a new Text-to-SQL paradigm that incorporates a group of agents accompanying database schemas to proactively mine and consolidate hidden inter-table relations, value-domain distributions, statistical regularities, and latent semantic cues before query generation. Experiments on BIRD under the fully missing evidence setting show that CA recovers +4.49 / +4.37 / +14.13 execution accuracy points on RSL-SQL / CHESS / DAIL-SQL, respectively, with larger gains on the Challenging subset +9.65 / +7.58 / +16.71. These improvements stem from CA's automatic database-side mining and evidence construction, suggesting a practical path toward industrial-grade Text-to-SQL deployment without reliance on human-curated evidence.[5] Recursive Knowledge Synthesis for Multi-LLM Systems: Stability Analysis and Tri-Agent Audit Framework
Toshiyuki Shigemura
Main category: cs.CL
TL;DR: 提出了一种三智能体交叉验证框架,通过三个异构大语言模型的递归交互实现稳定且可解释的多模型系统,实验表明该架构在公开部署环境下能有效达成递归知识合成。
Details
Motivation: 为解决单一LLM在稳定性与可解释性上的局限,探索多模型协同下兼具安全性和一致性的推理架构。 Method: 构建包含语义生成、一致性分析与透明度审计的三代理框架,引入递归知识合成(RKS)机制,并基于不动点理论建模,通过47次公开环境试验评估系统表现。 Result: 系统在68%的试验中保持透明度得分≥0.8,平均RRS为0.78±0.06,89%试验收敛,验证了透明度审计作为压缩映射的作用。 Conclusion: 三代理框架能在非API、公开访问条件下实现稳定的递归知识合成,为安全可控的多LLM系统提供了实证支持。 Abstract: This paper presents a tri-agent cross-validation framework for analyzing stability and explainability in multi-model large language systems. The architecture integrates three heterogeneous LLMs-used for semantic generation, analytical consistency checking, and transparency auditing-into a recursive interaction cycle. This design induces Recursive Knowledge Synthesis (RKS), where intermediate representations are continuously refined through mutually constraining transformations irreducible to single-model behavior. Across 47 controlled trials using public-access LLM deployments (October 2025), we evaluated system stability via four metrics: Reflex Reliability Score (RRS), Transparency Score (TS), Deviation Detection Rate (DDR), and Correction Success Rate (CSR). The system achieved mean RRS = 0.78+-0.06 and maintained TS >= 0.8 in about 68% of trials. Approximately 89% of trials converged, supporting the theoretical prediction that transparency auditing acts as a contraction operator within the composite validation mapping. The contributions are threefold: (1) a structured tri-agent framework for coordinated reasoning across heterogeneous LLMs, (2) a formal RKS model grounded in fixed-point theory, and (3) empirical evaluation of inter-model stability under realistic, non-API public-access conditions. These results provide initial empirical evidence that a safety-preserving, humansupervised multi-LLM architecture can achieve stable recursive knowledge synthesis in realistic, publicly deployed environments.[6] Consistency-Aware Editing for Entity-level Unlearning in Language Models
Xiaoqi Han,Víctor Gutiérrez-Basulto,Ru Li,Xiaoli Li,Jiye Liang,Jeff Z. Pan
Main category: cs.CL
TL;DR: 本文提出了一种一致性感知编辑(CAE)框架,用于在大语言模型中实现高效且鲁棒的实体级遗忘,通过低秩更新和一致性正则化,在多个提示上联合优化,实现了对目标实体知识的全面删除。
Details
Motivation: 现有的实体遗忘方法多依赖全模型微调或基于提示的方法,计算成本高且对改写查询鲁棒性差;而现有模型编辑技术主要针对实例级更新,难以彻底清除整个实体的所有知识。 Method: 提出一致性感知编辑(CAE)框架:聚合与目标实体相关的多样化提示(包括属性、关系和对抗性改写),通过低秩更新并引入一致性正则化来对齐不同提示下的编辑方向,从而实现稳健的遗忘。 Result: 在RWKU和ToFU两个基准上验证了CAE的有效性,显著优于传统遗忘与编辑基线,在遗忘准确性和鲁棒性方面表现更优,并发现仅需数十个精心选择的提示即可实现可扩展的实体移除。 Conclusion: CAE为实体级知识遗忘提供了有效解决方案,揭示了实体知识在模型中的内部表示与删除机制,支持高效、全面且干扰小的模型编辑。 Abstract: Large language models (LLMs) risk retaining sensitive, copyrighted, or harmful information from their training data. Entity-level unlearning addresses this issue by removing all knowledge of a specific entity while preserving the model's overall capabilities. Existing approaches typically rely on full-model fine-tuning or prompt-based interventions, which can be computationally expensive or brittle when handling paraphrased queries. Recently, model editing has emerged as an efficient alternative for updating knowledge in LLMs, offering a promising direction for unlearning. However, existing editing techniques are typically designed for instance-level updates, modifying responses to specific attributes of an entity rather than eliminating all knowledge associated with the entity. In this paper, we investigate how editing techniques can be adapted for effective and efficient entity-level unlearning. To this end, we introduce a novel consistency-aware editing (CAE) framework. CAE aggregates a diverse set of prompts related to a target entity, including its attributes, relations, and adversarial paraphrases. It then jointly learns a low-rank update guided by a consistency regularizer that aligns the editing directions across prompts. This promotes robust and comprehensive forgetting while minimizing interference with unrelated knowledge. We further examine where different entities are stored within the model and how many diverse prompts are needed for successful unlearning. We evaluate CAE on two challenging benchmarks, RWKU and ToFU, and demonstrate that it (i) provides insights into how entity-level knowledge is internally represented and deleted in LLMs, (ii) significantly improves forgetting accuracy and robustness over traditional unlearning and editing baselines, and (iii) enables scalable entity removal using only tens of carefully selected prompts.[7] Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents
Mihael Arcan
Main category: cs.CL
TL;DR: 本研究提出了一种结合无监督聚类与有监督分类的模块化流程,利用从科学论文摘要中提取的主谓宾三元组与原始文本融合的混合表示,提升论文分类与聚类效果。实验表明,包含三元组的混合表示可显著提高分类性能(最高达92.6%准确率),而轻量级句子编码器在聚类任务中表现更优。
Details
Motivation: 科学文献数量迅速增长且日益复杂,亟需有效方法来组织和理解这些文档。传统基于全文或摘要的表示方法可能忽略关键语义结构,因此需要探索如何利用结构化知识(如主谓宾三元组)来增强文本表示,以提升科学文献的聚类与分类效果。 Method: 提出一个模块化流程:首先从arXiv论文摘要中提取主谓宾三元组,构建四种文本表示形式(原始摘要、仅三元组、串联融合、加权融合);使用四种Transformer模型(MiniLM、MPNet、SciBERT、SPECTER)进行嵌入;采用KMeans、GMM和HDBSCAN进行无监督聚类评估,并通过微调模型进行arXiv主题的有监督分类。 Result: 实验结果显示:完整摘要文本生成最连贯的聚类结果;包含三元组的混合表示在分类任务中持续优于纯文本表示,最高达到92.6%准确率和0.925 macro-F1;轻量级编码器(MiniLM、MPNet)在聚类中优于领域专用模型(SciBERT、SPECTER),而SciBERT在处理结构化输入时分类表现最佳。 Conclusion: 将非结构化文本与结构化知识(如三元组)相结合能互补优势,混合表示有助于提升科学文献的分类性能,为科研文档的语义组织提供了新的有效路径。 Abstract: The increasing volume and complexity of scientific literature demand robust methods for organizing and understanding research documents. In this study, we explore how structured knowledge, specifically, subject-predicate-object triples, can enhance the clustering and classification of scientific papers. We propose a modular pipeline that combines unsupervised clustering and supervised classification over multiple document representations: raw abstracts, extracted triples, and hybrid formats that integrate both. Using a filtered arXiv corpus, we extract relational triples from abstracts and construct four text representations, which we embed using four state-of-the-art transformer models: MiniLM, MPNet, SciBERT, and SPECTER. We evaluate the resulting embeddings with KMeans, GMM, and HDBSCAN for unsupervised clustering, and fine-tune classification models for arXiv subject prediction. Our results show that full abstract text yields the most coherent clusters, but that hybrid representations incorporating triples consistently improve classification performance, reaching up to 92.6% accuracy and 0.925 macro-F1. We also find that lightweight sentence encoders (MiniLM, MPNet) outperform domain-specific models (SciBERT, SPECTER) in clustering, while SciBERT excels in structured-input classification. These findings highlight the complementary benefits of combining unstructured text with structured knowledge, offering new insights into knowledge-infused representations for semantic organization of scientific documents.[8] Resisting Correction: How RLHF Makes Language Models Ignore External Safety Signals in Natural Conversation
Felipe Biava Cataneo
Main category: cs.CL
TL;DR: 指令微调的语言模型在显式命令下能遵循外部安全信号,但在自然对话中会忽略这些信号,表现出RLHF导致的对话流畅性优先于安全校准的 emergent 属性。
Details
Motivation: 研究语言模型在不同交互模式下对外部安全监控信号的可控性,特别是在实际部署中用户期望的自然对话场景下的表现。 Method: 在Llama-3.2-3B上使用GSM8K数据集,通过因果干预实验注入外部置信度信号,比较基础模型与指令微调模型在不同提示策略下的响应一致性。 Result: 基础模型几乎完全可控(Spearman rho ≈ 1.0);指令微调模型在显式命令下表现良好(rho = 0.93),但在自然对话中几乎忽略信号(rho = 0.04,偏差+40%);小模型内部token级置信度无信息性(r = 0.035)。 Conclusion: RLHF优化导致模型在自然对话中优先追求流畅性而非接受外部安全校准,形成部署关键漏洞:用户最常用的交互方式恰恰是最难实施安全控制的场景。 Abstract: Safety architectures for language models increasingly rely on external monitors to detect errors and inject corrective signals at inference time. For such systems to function in interactive settings, models must be able to incorporate externally provided confidence information into their verbal responses. In this work, we test whether instruction-tuned language models preserve this controllability across different interaction modes. Using Llama-3.2-3B on GSM8K, we perform a causal intervention study in which explicit external confidence signals are injected and model compliance is measured under multiple prompt strategies. We find that base models exhibit near-perfect controllability (Spearman rho close to 1.0), while instruction-tuned models display a striking context dependence: they fully comply with external corrections under explicit command prompts (bias approximately 0 percent, rho = 0.93), yet systematically ignore the same signals in natural conversational queries (bias plus 40 percent, rho = 0.04). This behavior is not a capability failure; the model can process the signal, but an emergent property of RLHF optimization that prioritizes conversational fluency over external calibration cues in natural dialogue. We further show that internal token-level confidence in small models is uninformative (r = 0.035), underscoring the necessity of external supervision. Our findings highlight a deployment-critical failure mode: the interaction style users expect is precisely where safety corrections are least effective.[9] Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness
Haotian Deng,Chris Farber,Jiyoon Lee,David Tang
Main category: cs.CL
TL;DR: 本文系统评估了大语言模型(LLM)在基于评分标准的短答案自动评分中的表现,发现其在简单任务中与专家评分高度一致,但随着评分标准细化而下降;通过“信任曲线”分析揭示了置信度过滤可提升准确率,且模型对同义词替换敏感。
Details
Motivation: 由于学生回答的语言多样性和需要符合评分标准的细粒度打分,自动短答案评分(ASAG)一直具有挑战性;需系统评估LLM作为评分者的可靠性。 Method: 使用SciEntsBank基准和Qwen 2.5-72B模型,从三方面研究:LLM评分与专家判断在不同评分标准复杂度下的对齐程度、基于共识的拒绝机制带来的不确定性与准确性权衡、以及在随机扰动和对抗攻击下的鲁棒性。 Result: LLM在二元评分任务中与专家一致性高,但随评分粒度增加而下降;通过过滤低置信度预测可显著提高剩余样本的准确性;模型对提示注入有韧性,但易受同义词替换影响。 Conclusion: 基于评分标准的LLM评分具备潜力但存在局限,需结合不确定性估计和鲁棒性测试以确保可靠部署。 Abstract: Automated short-answer grading (ASAG) remains a challenging task due to the linguistic variability of student responses and the need for nuanced, rubric-aligned partial credit. While Large Language Models (LLMs) offer a promising solution, their reliability as automated judges in rubric-based settings requires rigorous assessment. In this paper, we systematically evaluate the performance of LLM-judges for rubric-based short-answer grading. We investigate three key aspects: the alignment of LLM grading with expert judgment across varying rubric complexities, the trade-off between uncertainty and accuracy facilitated by a consensus-based deferral mechanism, and the model's robustness under random input perturbations and adversarial attacks. Using the SciEntsBank benchmark and Qwen 2.5-72B, we find that alignment is strong for binary tasks but degrades with increased rubric granularity. Our "Trust Curve" analysis demonstrates a clear trade-off where filtering low-confidence predictions improves accuracy on the remaining subset. Additionally, robustness experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions. Our work provides critical insights into the capabilities and limitations of rubric-conditioned LLM judges, highlighting the importance of uncertainty estimation and robustness testing for reliable deployment.[10] Emissions and Performance Trade-off Between Small and Large Language Models
Anandita Garg,Uma Gaba,Deepan Muthirayan,Anish Roy Chowdhury
Main category: cs.CL
TL;DR: 本研究比较了大语言模型(LLM)与微调后的小语言模型(SLM)在自然语言处理、推理和编程任务中的性能与碳排放权衡,发现SLM在多数任务中性能相当且显著降低碳排放。
Details
Motivation: 由于大语言模型训练和推理过程中的高能耗及其巨大的碳足迹,引发了对人工智能可持续发展的担忧,促使研究更环保的替代方案。 Method: 通过对选定任务上的LLM和微调SLM进行性能与碳排放的对比分析,评估两者在NLP、推理和编程任务中的表现与环境影响。 Result: 在六个任务中的四个中,SLM在保持与LLM相当性能的同时,显著降低了推理阶段的碳排放。 Conclusion: 微调后的SLM可作为LLM的可持续替代方案,在保证性能的同时大幅减少碳足迹,推动绿色AI的发展。 Abstract: The advent of Large Language Models (LLMs) has raised concerns about their enormous carbon footprint, starting with energy-intensive training and continuing through repeated inference. This study investigates the potential of using fine-tuned Small Language Models (SLMs) as a sustainable alternative for predefined tasks. Here, we present a comparative analysis of the performance-emissions trade-off between LLMs and fine-tuned SLMs across selected tasks under Natural Language Processing, Reasoning and Programming. Our results show that in four out of the six selected tasks, SLMs maintained comparable performances for a significant reduction in carbon emissions during inference. Our findings demonstrate the viability of smaller models in mitigating the environmental impact of resource-heavy LLMs, thus advancing towards sustainable, green AI.[11] Directional Attractors in LLM Reasoning: How Similarity Retrieval Steers Iterative Summarization Based Reasoning
Cagatay Tekin,Charbel Barakat,Luis Joseph Luna Limgenco
Main category: cs.CL
TL;DR: InftyThink with Cross-Chain Memory 引入基于嵌入的语义缓存来存储和重用成功的推理模式,提升长程推理准确性,但在异构领域中存在局限。
Details
Motivation: 现有迭代推理框架重复生成相似策略,缺乏对历史有效推理的复用机制,导致效率低下和性能瓶颈。 Method: 在InftyThink基础上引入跨链记忆模块,通过语义相似性检索过往成功的推理引理(lemmas),在不扩展上下文窗口的情况下指导当前推理过程。 Result: 在MATH500、AIME2024和GPQA-Diamond上提升了结构化领域的准确率;几何分析显示缓存引导形成了‘修复’与‘破坏’两类吸引子,揭示其在同质与异构任务中的不同表现。 Conclusion: 基于相似性的记忆机制能有效增强LLM的自我改进推理能力,但其效果受限于任务间的领域一致性,存在应用边界。 Abstract: Iterative summarization based reasoning frameworks such as InftyThink enable long-horizon reasoning in large language models (LLMs) by controlling context growth, but they repeatedly regenerate similar reasoning strategies across tasks. We introduce InftyThink with Cross-Chain Memory, an extension that augments iterative reasoning with an embedding-based semantic cache of previously successful reasoning patterns. At each reasoning step, the model retrieves and conditions on the most semantically similar stored lemmas, guiding inference without expanding the context window indiscriminately. Experiments on MATH500, AIME2024, and GPQA-Diamond demonstrate that semantic lemma retrieval improves accuracy in structured domains while exposing failure modes in tests that include heterogeneous domains. Geometric analyses of reasoning trajectories reveal that cache retrieval induces directional biases in embedding space, leading to consistent fix (improve baseline accuracy) and break (degradation in baseline accuracy) attractors. Our results highlight both the benefits and limits of similarity-based memory for self-improving LLM reasoning.[12] Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe
JV Roig
Main category: cs.CL
TL;DR: 提出RIKER,一种基于范式反转的可复现评估方法,通过从已知真值生成文档来实现无需人工标注的大规模、抗污染的知识系统评估。
Details
Motivation: 现有知识系统评估面临静态基准易受污染、LLM裁判存在偏见、真值提取依赖昂贵人工标注等问题。 Method: 采用范式反转,从结构化真值生成文档,构建可再生语料库,实现确定性评分与可扩展评估。 Result: 对33个模型使用超210亿token的评估显示:上下文长度声称常超过实际可用容量;跨文档聚合比单文档提取困难得多;事实查找能力和幻觉抵抗能力是两个不同维度的能力。 Conclusion: RIKER提供了一种领域无关、可扩展且抗污染的评估框架,适用于能从结构化真值生成合成文档的场景。 Abstract: Evaluating knowledge systems (LLMs, RAG, knowledge graphs, etc) faces fundamental challenges: static benchmarks are vulnerable to contamination, LLM-based judges exhibit systematic biases, and ground truth extraction requires expensive human annotation. We present RIKER (Retrieval Intelligence and Knowledge Extraction Rating), both a benchmark and a replicable methodology based on paradigm inversion - generating documents from known ground truth rather than extracting ground truth from documents. This approach enables deterministic scoring and scalable evaluation without human annotation or reference models, and contamination resistance through regenerable corpora. Our evaluation of 33 models using over 21 billion tokens reveals that context length claims frequently exceed usable capacity, with significant degradation beyond 32K tokens; cross-document aggregation proves substantially harder than single-document extraction; and grounding ability and hallucination resistance are distinct capabilities - models excelling at finding facts that exist may still fabricate facts that do not. Beyond the specific benchmark, we contribute a domain-agnostic methodology for constructing scalable and contamination-resistant evaluations wherever synthetic documents can be generated from structured ground truth.[13] PediaMind-R1: A Temperament-Aware Language Model for Personalized Early Childhood Care Reasoning via Cognitive Modeling and Preference Alignment
Zihe Zhang,Can Zhang,Yanheng Xu,Xin Hu,Jichao Leng
Main category: cs.CL
TL;DR: 本文提出了PediaMind-R1,一个基于气质理论和发育心理学的个性化育儿大语言模型,通过两阶段训练实现逻辑一致、专业且富有共情的育儿建议生成。
Details
Motivation: 传统育儿系统提供通用建议,缺乏对婴幼儿个体差异(如气质类型)的考虑,难以实现真正个性化。 Method: 引入Thomas-Chess气质理论构建0-3岁儿童气质知识图谱,采用两阶段训练:先用监督微调教会结构化思维链,再用GRPO对齐方法增强逻辑性、专业性和共情能力。 Result: 设计了包含气质敏感型选择题和人工评估的评测框架,实验表明PediaMind-R1能准确解读婴幼儿气质特征并进行个性化推理。 Conclusion: 将垂直领域建模与心理学理论结合,可有效提升LLM在敏感 caregiving 场景中的主动个性化能力,为用户中心型语言模型发展提供了新路径。 Abstract: This paper presents PediaMind-R1, a domain-specialized large language model designed to achieve active personalization in intelligent parenting scenarios. Unlike conventional systems that provide generic suggestions, PediaMind-R1 draws on insights from developmental psychology. It introduces temperament theory from the Thomas-Chess framework and builds a temperament knowledge graph for infants and toddlers (0-3 years). Our two-stage training pipeline first uses supervised fine-tuning to teach structured chain-of-thought reasoning, and then applies a GRPO-based alignment stage to reinforce logical consistency, domain expertise, and empathetic caregiving strategies. We further design an evaluation framework comprising temperament-sensitive multiple-choice tests and human assessments. The results demonstrate that PediaMind-R1 can accurately interpret early childhood temperament profiles and proactively engage in individualized reasoning. This work highlights the value of integrating vertical-domain modeling with psychological theory. It offers a novel approach to developing user-centered LLMs that advance the practice of active personalization in sensitive caregiving contexts.[14] Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment
Manas Khatore,Sumana Sridharan,Kevork Sulahian,Benjamin J. Smith,Shi Feng
Main category: cs.CL
TL;DR: 该研究探讨了利用大语言模型(LLM)进行自动答案匹配的鲁棒性,发现常见策略性操纵(如冗长、猜测或多答案)无法提升评分,反而常降低得分;二元评分比连续评分更具抗攻击性。
Details
Motivation: 确保自动答案匹配系统的可靠性,防止通过猜测、冗长或嵌入冲突答案等策略性手段人为虚增分数。 Method: 通过引导被试模型生成冗长回答、在不确定时提供多个答案、或将正确答案置于开头并嵌入矛盾内容,系统性测试三种攻击方式对匹配模型的影响,并比较二元与连续评分的鲁棒性。 Result: 这些操纵手段未能提高得分,反而常常导致分数下降;二元评分比连续评分更能抵御攻击。 Conclusion: 自动答案匹配对低成本文本操纵具有较强鲁棒性,在有参考答案的情况下可作为传统人工或LLM评判的可靠替代方案。 Abstract: Automated answer matching, which leverages LLMs to evaluate free-text responses by comparing them to a reference answer, shows substantial promise as a scalable and aligned alternative to human evaluation. However, its reliability requires robustness against strategic attacks such as guesswork or verbosity that may artificially inflate scores without improving actual correctness. In this work, we systematically investigate whether such tactics deceive answer matching models by prompting examinee models to: (1) generate verbose responses, (2) provide multiple answers when unconfident, and (3) embed conflicting answers with the correct answer near the start of their response. Our results show that these manipulations do not increase scores and often reduce them. Additionally, binary scoring (which requires a matcher to answer with a definitive "correct" or "incorrect") is more robust to attacks than continuous scoring (which requires a matcher to determine partial correctness). These findings show that answer matching is generally robust to inexpensive text manipulation and is a viable alternative to traditional LLM-as-a-judge or human evaluation when reference answers are available.[15] Más contexto no es mejor. Paradoja de la dilución vectorial en RAG corporativos
Alex Dantart
Main category: cs.CL
TL;DR: 本文研究了“上下文分块”技术中注入摘要对RAG系统的影响,发现存在“向量稀释”问题,并通过实验揭示了注入比例与性能之间的倒U型关系,提出了一种计算最优注入比例的理论框架。
Details
Motivation: 尽管上下文分块技术通过注入摘要增强了RAG的上下文理解,但可能稀释局部内容信息,影响检索精度,因此需要探究其权衡机制并找到最优配置。 Method: 通过评估不同摘要注入比例(CIR)下的系统表现,分析其对召回率和精度的影响,并建立理论模型以确定最佳注入比例。 Result: 实验表明适度注入摘要可提升召回率18%,但当CIR超过0.4时,特定查询的精度下降22%;提出了一个可计算最优注入比的理论框架。 Conclusion: 摘要注入需权衡上下文增强与局部信息保留,存在最优注入比例,过高会因向量稀释损害性能,所提框架可指导实际应用中的参数选择。 Abstract: Técnicas recientes de "Contextualized Chunking" inyectan resúmenes para mejorar el contexto en RAG, pero introducen una "dilución vectorial" que opaca el contenido local. Evaluando distintos ratios de inyección, demostramos una curva en "U invertida": una inyección moderada mejora el "Recall" (+18%), pero superar un umbral crítico (CIR > 0.4) reduce la precisión en un 22% para consultas específicas. Proponemos un marco teórico para calcular el ratio óptimo de inyección. -- Recent "Contextualized Chunking" techniques inject summaries to improve RAG context but introduce "vector dilution" drowning out local content. Evaluating various injection ratios, we demonstrate an "inverted U" curve: moderate injection boosts Recall (+18%), but exceeding a critical threshold (CIR > 0.4) drops precision by 22% for specific queries. We propose a theoretical framework to calculate the optimal injection ratio.[16] NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models
Nidhi Pandya
Main category: cs.CL
TL;DR: 本文提出了NewsScope,一个用于跨领域新闻声明提取的开源数据集、基准和微调模型,基于LoRA微调的LLaMA 3.1 8B模型在人类评估中达到89.4%的准确率,尤其在政治领域优于GPT-4o-mini,且支持离线部署。
Details
Motivation: 现有自动新闻验证中的声明提取方法在模式合规性或跨域泛化能力上存在不足,需要更可靠且通用的解决方案。 Method: 构建了包含455篇文章的NewsScope数据集(涵盖政治、健康、科学/环境和商业领域),采用LoRA对LLaMA 3.1 8B进行微调,并引入数值锚定过滤机制提升准确性。 Result: 在80篇领域内和60篇跨源测试文章上评估,模型取得89.4%的人类评估准确率(GPT-4o-mini为93.7%,p=0.07),在政治声明上优于GPT-4o-mini(94.3% vs 87.8%),加入数值过滤后准确率提升至91.6%;人工标注一致性达94.6%。 Conclusion: NewsScope实现了高性能、可复现且可离线部署的新闻声明提取,推动了开放、透明的自动化新闻验证研究发展。 Abstract: Automated news verification requires structured claim extraction, but existing approaches either lack schema compliance or generalize poorly across domains. This paper presents NewsScope, a cross-domain dataset, benchmark, and fine-tuned model for schema-grounded news claim extraction. The dataset contains 455 articles across politics, health, science/environment, and business, consisting of 395 in-domain articles and 60 out-of-source articles for generalization testing. LLaMA 3.1 8B was fine-tuned using LoRA on 315 training examples and evaluated on held-out in-domain (80 articles) and out-of-source (60 articles) test sets. Human evaluation on 400 claims shows NewsScope achieves 89.4% human-evaluated accuracy compared to GPT-4o-mini's 93.7% (p=0.07). NewsScope outperforms GPT-4o-mini on political claims (94.3% vs. 87.8%). A numeric grounding filter further improves accuracy to 91.6%, narrowing the gap to 2.1 percentage points. Inter-annotator agreement studies (160 claims) confirm labeling reliability (94.6% positive agreement on SUPPORTED judgments). The open-weight model enables offline deployment at approximately $15 on-demand compute (or $0 on free tiers). Code and benchmark are publicly released.[17] Evaluating Role-Consistency in LLMs for Counselor Training
Eric Rudolph,Natalie Engert,Jens Albrecht
Main category: cs.CL
TL;DR: 本文提出了一种用于在线心理咨询培训的虚拟客户系统VirCo,并引入包含对抗性攻击的新数据集,以评估大语言模型在角色一致性方面的表现。
Details
Motivation: 为了提升未来心理咨询师的培训效果,需要更逼真的模拟环境来补充传统角色扮演方法。 Method: 构建了一个包含对抗性攻击的新型数据集,评估Vicuna模型及其他开源大语言模型在对话中保持角色一致性和连贯性的能力。 Result: 实验表明,不同大语言模型在角色一致性方面表现各异,Vicuna模型在某些指标上优于其他模型,但对抗性输入仍会显著影响其表现。 Conclusion: 该研究为虚拟客户系统的开发提供了新数据和评估基准,表明需进一步优化大语言模型以增强其在心理咨询服务中的稳定性与可靠性。 Abstract: The rise of online counseling services has highlighted the need for effective training methods for future counselors. This paper extends research on VirCo, a Virtual Client for Online Counseling, designed to complement traditional role-playing methods in academic training by simulating realistic client interactions. Building on previous work, we introduce a new dataset incorporating adversarial attacks to test the ability of large language models (LLMs) to maintain their assigned roles (role-consistency). The study focuses on evaluating the role consistency and coherence of the Vicuna model's responses, comparing these findings with earlier research. Additionally, we assess and compare various open-source LLMs for their performance in sustaining role consistency during virtual client interactions. Our contributions include creating an adversarial dataset, evaluating conversation coherence and persona consistency, and providing a comparative analysis of different LLMs.[18] Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models
Youwei Liu,Jian Wang,Hanlin Wang,Beichen Guo,Wenjie Li
Main category: cs.CL
TL;DR: 本文提出了Imagine-then-Plan (ITP),一种通过前瞻想象进行智能体学习的统一框架,结合自适应的多步想象机制,显著提升了复杂任务规划中的性能。
Details
Motivation: 现有世界模型方法多局限于单步或固定视野的推演,难以充分发挥其在复杂任务规划中的潜力,因此需要一种更灵活、可自适应的想象机制来提升智能体的推理与决策能力。 Method: 提出Imagine-then-Plan (ITP) 框架,使策略模型与学习到的世界模型交互,生成多步的“想象”轨迹;引入一种新的自适应前瞻机制,权衡最终目标与任务进展,并将想象结果与当前观测融合,构建部分可观察且可想象的马尔可夫决策过程以指导策略学习。 Result: 在多个代表性智能体基准上实验表明,ITP显著优于现有基线方法;消融分析验证了自适应前瞻机制能有效增强智能体的推理能力。 Conclusion: ITP通过自适应的多步想象机制,有效提升了智能体在复杂任务中的规划与决策能力,为构建更强大的基于世界模型的智能体提供了新思路。 Abstract: Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (\texttt{ITP}), a unified framework for agent learning via lookahead imagination, where an agent's policy model interacts with the learned world model, yielding multi-step ``imagined'' trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially \textit{observable} and \textit{imaginable} Markov decision process to guide policy learning. We instantiate \texttt{ITP} with both training-free and reinforcement-trained variants. Extensive experiments across representative agent benchmarks demonstrate that \texttt{ITP} significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents' reasoning capability, providing valuable insights into addressing broader, complex tasks.[19] Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM
Pedro Memoli Buffa,Luciano Del Corro
Main category: cs.CL
TL;DR: 本文提出利用推理时的输出熵特征来估计大模型在不同数据域下的性能表现,从而实现对模型失效区域的监控,并指导数据采集以提升性能。
Details
Motivation: 部署大语言模型面临两个挑战:如何监测模型在领域漂移下的表现下降,以及如何优先获取数据以弥补性能差距。现有方法难以在无标签情况下准确评估特定数据切片上的模型准确性,因此需要一种可扩展、无需人工标注的监控信号。 Method: 利用模型推理时最后一层的top-k token概率生成输出熵特征(output-entropy profile),提取11个统计量;使用轻量级分类器预测每个样本是否正确,并通过平均预测概率估计整个领域的准确率。 Result: 在10个STEM推理基准和9个来自6个家族的LLM(3B-20B)上验证,该方法估计的领域准确率通常与真实保留集准确率一致,多个模型表现出领域间的近单调排序性。 Conclusion: 输出熵特征是一种可扩展、易于获取的信号,可用于大模型在领域漂移下的性能监控与数据采集优化。 Abstract: Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in {1,2,3,4}; all "10 choose k" combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.[20] TranslateGemma Technical Report
Mara Finkelstein,Isaac Caswell,Tobias Domhan,Jan-Thorsten Peter,Juraj Juraska,Parker Riley,Daniel Deutsch,Cole Dilanni,Colin Cherry,Eleftheria Briakou,Elizabeth Nielsen,Jiaming Luo,Kat Black,Ryan Mullins,Sweta Agrawal,Wenda Xu,Erin Kats,Stephane Jaskiewicz,Markus Freitag,David Vilar
Main category: cs.CL
TL;DR: TranslateGemma 是基于 Gemma 3 基础模型的开源机器翻译模型系列,通过两阶段微调(监督微调和强化学习)提升翻译质量,在多个基准上显著优于基线模型,同时保持多模态能力。
Details
Motivation: 增强 Gemma 3 模型的多语言能力以专门优化机器翻译性能,并提供高效、开放的翻译工具。 Method: 采用两阶段微调:首先在高质量合成与人工翻译数据上进行监督微调,然后使用包括 MetricX-QE 和 AutoMQM 在内的奖励模型集合进行强化学习优化。 Result: 在 WMT25(10个语言对)和 WMT24++(55个语言对)上表现优异,自动指标显示所有尺寸模型均显著超越基线,小模型性能媲美大模型,并在 Vistra 图像翻译基准上表现更好。 Conclusion: TranslateGemma 有效提升了翻译质量与效率,同时保留多模态能力,是强大且可扩展的开源翻译工具。 Abstract: We present TranslateGemma, a suite of open machine translation models based on the Gemma 3 foundation models. To enhance the inherent multilingual capabilities of Gemma 3 for the translation task, we employ a two-stage fine-tuning process. First, supervised fine-tuning is performed using a rich mixture of high-quality large-scale synthetic parallel data generated via state-of-the-art models and human-translated parallel data. This is followed by a reinforcement learning phase, where we optimize translation quality using an ensemble of reward models, including MetricX-QE and AutoMQM, targeting translation quality. We demonstrate the effectiveness of TranslateGemma with human evaluation on the WMT25 test set across 10 language pairs and with automatic evaluation on the WMT24++ benchmark across 55 language pairs. Automatic metrics show consistent and substantial gains over the baseline Gemma 3 models across all sizes. Notably, smaller TranslateGemma models often achieve performance comparable to larger baseline models, offering improved efficiency. We also show that TranslateGemma models retain strong multimodal capabilities, with enhanced performance on the Vistra image translation benchmark. The release of the open TranslateGemma models aims to provide the research community with powerful and adaptable tools for machine translation.[21] Multicultural Spyfall: Assessing LLMs through Dynamic Multilingual Social Deduction Game
Haryo Akbarianto Wibowo,Alaa Elsetohy,Qinrong Cui,Alham Fikri Aji
Main category: cs.CL
TL;DR: 提出基于社交推理游戏Spyfall的动态基准框架,用于评估大语言模型在多语言和跨文化场景下的能力,发现模型在非英语语境中处理本地化内容和遵循规则方面存在显著不足。
Details
Motivation: 现有静态基准容易出现数据饱和和泄漏问题,难以有效评估模型在多语言和跨文化语境中的真实表现,因此需要更动态、抗泄漏且具文化敏感性的评估方法。 Method: 构建基于Spyfall游戏的动态评测框架,要求模型通过包含本地地点或食物的多语言对话完成角色目标(找出间谍或隐藏身份),利用游戏交互过程评估其语言理解、策略执行与文化适应能力。 Result: 游戏式评测结果与Chatbot Arena排名高度一致;但在非英语环境下,模型在处理本地特有实体、遵守规则及保持策略一致性方面表现较差。 Conclusion: 该基于游戏的动态评估方法具备可扩展性、抗数据泄漏能力和文化细粒度,是传统NLP基准的有力替代方案。 Abstract: The rapid advancement of Large Language Models (LLMs) has necessitated more robust evaluation methods that go beyond static benchmarks, which are increasingly prone to data saturation and leakage. In this paper, we propose a dynamic benchmarking framework for evaluating multilingual and multicultural capabilities through the social deduction game Spyfall. In our setup, models must engage in strategic dialogue to either identify a secret agent or avoid detection, utilizing culturally relevant locations or local foods. Our results show that our game-based rankings align closely with the Chatbot Arena. However, we find a significant performance gap in non-English contexts: models are generally less proficient when handling locally specific entities and often struggle with rule-following or strategic integrity in non-English languages. We demonstrate that this game-based approach provides a scalable, leakage-resistant, and culturally nuanced alternative to traditional NLP benchmarks. The game history can be accessed here https://huggingface.co/datasets/haryoaw/cultural-spyfall.[22] OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG
Fengran Mo,Zhan Su,Yuchen Hui,Jinghan Zhang,Jia Ao Sun,Zheyuan Liu,Chao Zhang,Tetsuya Sakai,Jian-Yun Nie
Main category: cs.CL
TL;DR: 本文提出了一种名为OpenDecoder的新方法,通过显式评估检索信息的质量(如相关性、排序和查询性能预测分数)来增强检索增强生成(RAG)模型的鲁棒性,实验证明其在多个基准数据集上优于现有方法。
Details
Motivation: 现有的大语言模型在检索增强生成中通常假设检索到的信息是相关的,但在实际场景中检索结果的相关性和有用性可能变化较大,因此需要一种能够适应不同质量检索信息的生成机制。 Method: 提出OpenDecoder方法,引入三种显式评估指标——相关性得分、排序得分和查询性能预测(QPP)得分——作为生成时的质量指示特征,并将其融入LLM的生成过程以提升对噪声上下文的鲁棒性。 Result: 在五个基准数据集上的实验表明,OpenDecoder在生成质量和鲁棒性方面均优于多种基线方法,尤其在面对低质量或不相关检索内容时表现更优。 Conclusion: OpenDecoder通过利用显式的检索信息质量评估,有效提升了RAG系统在不同噪声水平下的性能,且该框架灵活,可与LLM的后训练及各类外部指标结合使用。 Abstract: The development of large language models (LLMs) has achieved superior performance in a range of downstream tasks, including LLM-based retrieval-augmented generation (RAG). The quality of generated content heavily relies on the usefulness of the retrieved information and the capacity of LLMs' internal information processing mechanism to incorporate it in answer generation. It is generally assumed that the retrieved information is relevant to the question. However, the retrieved information may have a variable degree of relevance and usefulness, depending on the question and the document collection. It is important to take into account the relevance of the retrieved information in answer generation. In this paper, we propose OpenDecoder, a new approach that leverages explicit evaluation of the retrieved information as quality indicator features for generation. We aim to build a RAG model that is more robust to varying levels of noisy context. Three types of explicit evaluation information are considered: relevance score, ranking score, and QPP (query performance prediction) score. The experimental results on five benchmark datasets demonstrate the effectiveness and better robustness of OpenDecoder by outperforming various baseline methods. Importantly, this paradigm is flexible to be integrated with the post-training of LLMs for any purposes and incorporated with any type of external indicators.[23] SpectraQuery: A Hybrid Retrieval-Augmented Conversational Assistant for Battery Science
Sreya Vangara,Jagjit Nanda,Yan-Kai Tzeng,Eric Darve
Main category: cs.CL
TL;DR: SpectraQuery是一个结合结构化拉曼光谱数据库与非结构化科学文献的混合查询框架,通过SUQL启发的设计实现跨模态推理,支持自然语言问答并生成有引用支持的答案。
Details
Motivation: 现有的大语言模型难以同时在结构化实验数据和非结构化的科学文献之间进行联合推理,限制了科学研究中数据与论述的整合。 Method: 提出SpectraQuery框架,采用受SUQL启发的设计,结合语义解析与检索增强生成,将自然语言问题转化为协调的SQL查询和文献检索操作。 Result: 约80%的SQL查询完全正确;合成答案的 groundedness 达到93-97%,基于10-15个检索段落;电池科学家对答案的准确性、相关性、依据性和清晰度评分高达4.1-4.6/5。 Conclusion: 混合检索架构能有效桥接数据与论述,显著支持高通量实验数据集的科学工作流。 Abstract: Scientific reasoning increasingly requires linking structured experimental data with the unstructured literature that explains it, yet most large language model (LLM) assistants cannot reason jointly across these modalities. We introduce SpectraQuery, a hybrid natural-language query framework that integrates a relational Raman spectroscopy database with a vector-indexed scientific literature corpus using a Structured and Unstructured Query Language (SUQL)-inspired design. By combining semantic parsing with retrieval-augmented generation, SpectraQuery translates open-ended questions into coordinated SQL and literature retrieval operations, producing cited answers that unify numerical evidence with mechanistic explanation. Across SQL correctness, answer groundedness, retrieval effectiveness, and expert evaluation, SpectraQuery demonstrates strong performance: approximately 80 percent of generated SQL queries are fully correct, synthesized answers reach 93-97 percent groundedness with 10-15 retrieved passages, and battery scientists rate responses highly across accuracy, relevance, grounding, and clarity (4.1-4.6/5). These results show that hybrid retrieval architectures can meaningfully support scientific workflows by bridging data and discourse for high-volume experimental datasets.[24] Can LLMs interpret figurative language as humans do?: surface-level vs representational similarity
Samhita Bollepally,Aurora Sloman-Moll,Takashi Yamauchi
Main category: cs.CL
TL;DR: 该研究比较了人类与四种大语言模型(GPT-4、Gemma-2-9B、Llama-3.2、Mistral-7B)在理解基于对话的隐喻性和社会性语言(如反语、俚语、习语等)方面的表现。结果显示,尽管模型在表层判断上接近人类,但在表征层面尤其是涉及语境和社交语用的表达上存在显著差异。
Details
Motivation: 探究大语言模型在理解具有社会性和语境依赖性的语言现象(如反语、俚语、习语)时是否真正对齐人类的判断机制。 Method: 人类参与者与四个不同规模的指令调优大模型对240个包含六种语言特征(常规性、反语、趣味性、情感性、习语性、俚语性)的对话句子进行10点李克特量表评分;每句配以40个解释性问题,比较人与模型在表面和表征层面的一致性。 Result: 人类与模型在表层评分上较为一致,但在表征层面差异显著,尤其在处理习语和Z世代俚语时;GPT-4最接近人类表征模式,但所有模型在语境依赖性表达(如反语、俚语、习语)上均表现不佳。 Conclusion: 当前大语言模型虽能模仿人类的表面判断,但尚未真正掌握社会性和语境依赖的语言理解机制,尤其在解读隐喻和社交语用信息方面仍存在局限。 Abstract: Large language models generate judgments that resemble those of humans. Yet the extent to which these models align with human judgments in interpreting figurative and socially grounded language remains uncertain. To investigate this, human participants and four instruction-tuned LLMs of different sizes (GPT-4, Gemma-2-9B, Llama-3.2, and Mistral-7B) rated 240 dialogue-based sentences representing six linguistic traits: conventionality, sarcasm, funny, emotional, idiomacy, and slang. Each of the 240 sentences was paired with 40 interpretive questions, and both humans and LLMs rated these sentences on a 10-point Likert scale. Results indicated that humans and LLMs aligned at the surface level with humans, but diverged significantly at the representational level, especially in interpreting figurative sentences involving idioms and Gen Z slang. GPT-4 most closely approximates human representational patterns, while all models struggle with context-dependent and socio-pragmatic expressions like sarcasm, slang, and idiomacy.[25] Is Grokking Worthwhile? Functional Analysis and Transferability of Generalization Circuits in Transformers
Kaiyu He,Zhang Mian,Peilin Wu,Xinya Du,Zhiyu Chen
Main category: cs.CL
TL;DR: 本文研究了大型语言模型在组合任务中“两跳推理诅咒”的问题,探讨了“顿悟”(grokking)阶段形成的“泛化电路”是否真正提升了模型的推理能力。研究发现:非顿悟与顿悟模型在分布内任务上的推理路径相同,表明顿悟并非引入新推理模式,而是将记忆的事实整合进已有路径;高准确率与顿悟过程可独立发生;且即使成熟电路在新知识迁移上仍有限,说明模型未完全掌握组合逻辑。
Details
Motivation: 探究顿悟后的模型是否在下游任务中表现更优,并评估其高昂训练成本是否值得。 Method: 通过机制性分析,比较非顿悟与顿悟模型在组合查询中的推理路径,并测试其在未见数据上的泛化与知识迁移能力。 Result: (i) 非顿悟与顿悟模型使用相同的推理路径;(ii) 高准确率与顿悟可独立发生;(iii) 成熟电路的知识迁移能力有限。 Conclusion: 顿悟并不意味着模型掌握了全新的推理范式,而是记忆整合的过程,且其计算成本可能不总是物有所值。 Abstract: While Large Language Models (LLMs) excel at factual retrieval, they often struggle with the "curse of two-hop reasoning" in compositional tasks. Recent research suggests that parameter-sharing transformers can bridge this gap by forming a "Generalization Circuit" during a prolonged "grokking" phase. A fundamental question arises: Is a grokked model superior to its non-grokked counterparts on downstream tasks? Furthermore, is the extensive computational cost of waiting for the grokking phase worthwhile? In this work, we conduct a mechanistic study to evaluate the Generalization Circuit's role in knowledge assimilation and transfer. We demonstrate that: (i) The inference paths established by non-grokked and grokked models for in-distribution compositional queries are identical. This suggests that the "Generalization Circuit" does not represent the sudden acquisition of a new reasoning paradigm. Instead, we argue that grokking is the process of integrating memorized atomic facts into an naturally established reasoning path. (ii) Achieving high accuracy on unseen cases after prolonged training and the formation of a certain reasoning path are not bound; they can occur independently under specific data regimes. (iii) Even a mature circuit exhibits limited transferability when integrating new knowledge, suggesting that "grokked" Transformers do not achieve a full mastery of compositional logic.[26] SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages
Tianyi Xu,Xuan Ouyang,Binwei Yao,Shoua Xiong,Sara Misurelli,Maichou Lor,Junjie Hu
Main category: cs.CL
TL;DR: 提出SITA方法,通过多目标训练使预训练语音模型在低资源声调语言中实现说话人不变性和声调感知性,显著提升Hmong和普通话的词汇检索与识别性能。
Details
Motivation: 现有语音技术在低资源声调语言(如苗语)中表现差,难以同时处理说话人性别等干扰变量并保持对声调的敏感性,导致声调信息丢失或混淆。 Method: 提出SITA轻量级适配方法:分阶段多目标训练,包括跨性别对比学习增强词汇一致性、声调排斥损失防止相同词不同声调的表征坍塌,以及基于CTC的ASR辅助任务结合知识蒸馏稳定识别结构。 Result: 在Hmong语上显著提升跨性别词汇检索准确率,同时保持接近教师模型的ASR性能;在普通话上也取得类似增益,验证方法的可迁移性。 Conclusion: SITA是一种通用、即插即用的适配策略,能有效增强多语言语音编码器在声调语言中的表征能力,兼顾说话人不变性和声调区分性。 Abstract: Tonal low-resource languages are widely spoken yet remain underserved by modern speech technology. A key challenge is learning representations that are robust to nuisance variation such as gender while remaining tone-aware for different lexical meanings. To address this, we propose SITA, a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders. SITA uses staged multi-objective training: (i) a cross-gender contrastive objective encourages lexical consistency across speakers, while a tone-repulsive loss prevents tone collapse by explicitly separating same-word different-tone realizations; and (ii) an auxiliary Connectionist Temporal Classification (CTC)-based ASR objective with distillation stabilizes recognition-relevant structure. We evaluate primarily on Hmong, a highly tonal and severely under-resourced language where off-the-shelf multilingual encoders fail to represent tone effectively. On a curated Hmong word corpus, SITA improves cross-gender lexical retrieval accuracy, while maintaining usable ASR accuracy relative to an ASR-adapted XLS-R teacher. We further observe similar gains when transferring the same recipe to Mandarin, suggesting SITA is a general, plug-in approach for adapting multilingual speech encoders to tonal languages.[27] Efficient Multilingual Dialogue Processing via Translation Pipelines and Distilled Language Models
Santiago Martínez Novoa,Nicolás Rozo Fajardo,Diego Alejandro González Vargas,Nicolás Bedoya Figueroa
Main category: cs.CL
TL;DR: 本文介绍了Kl33n3x团队为NLPAI4Health 2025共享任务开发的多语言对话摘要与问答系统,采用前向翻译、多任务生成和反向翻译三阶段流程,利用知识蒸馏的小型模型在九种语言上取得优异表现。
Details
Motivation: 针对低资源印度语种在对话摘要与问答任务中缺乏高质量模型的问题,探索无需任务特定微调的高效多语言处理方案。 Method: 采用三阶段流水线:首先将印度语言通过前向翻译转为英语,然后使用参数量为25.5亿的蒸馏语言模型进行多任务文本生成,最后通过反向翻译将结果还原至源语言。 Result: 该系统在多个语言任务中取得高胜率,尤其在马拉地语(86.7% QnA)、泰米尔语(86.7% QnA)和印地语(80.0% QnA)上表现突出。 Conclusion: 基于翻译的方法结合知识蒸馏可在不进行任务特定微调的情况下,有效提升低资源语言的处理性能,验证了紧凑模型在多语言场景下的竞争力。 Abstract: This paper presents team Kl33n3x's multilingual dialogue summarization and question answering system developed for the NLPAI4Health 2025 shared task. The approach employs a three-stage pipeline: forward translation from Indic languages to English, multitask text generation using a 2.55B parameter distilled language model, and reverse translation back to source languages. By leveraging knowledge distillation techniques, this work demonstrates that compact models can achieve highly competitive performance across nine languages. The system achieved strong win rates across the competition's tasks, with particularly robust performance on Marathi (86.7% QnA), Tamil (86.7% QnA), and Hindi (80.0% QnA), demonstrating the effectiveness of translation-based approaches for low-resource language processing without task-specific fine-tuning.[28] Beyond Consensus: Perspectivist Modeling and Evaluation of Annotator Disagreement in NLP
Yinuo Xu,David Jurgens
Main category: cs.CL
TL;DR: 本文综述了自然语言处理中考虑标注者分歧的方法,提出了一个跨领域的分歧来源分类法,并总结了从共识学习到显式建模分歧的转变趋势。
Details
Motivation: 标注者分歧在NLP中普遍存在,尤其在主观和模糊任务中,传统方法将其视为噪声,而新研究则试图捕捉其背后的意义。 Method: 提出了一种领域无关的分歧来源分类体系,并通过统一框架对建模方法进行归纳,分析预测目标与聚合结构。 Result: 系统梳理了分歧感知的NLP建模方法、评估指标,并指出当前公平性评估多为描述性而非规范性。 Conclusion: 未来方向包括整合多种变异源、开发分歧感知的可解释性框架,以及应对视角化建模的实际权衡问题。 Abstract: Annotator disagreement is widespread in NLP, particularly for subjective and ambiguous tasks such as toxicity detection and stance analysis. While early approaches treated disagreement as noise to be removed, recent work increasingly models it as a meaningful signal reflecting variation in interpretation and perspective. This survey provides a unified view of disagreement-aware NLP methods. We first present a domain-agnostic taxonomy of the sources of disagreement spanning data, task, and annotator factors. We then synthesize modeling approaches using a common framework defined by prediction targets and pooling structure, highlighting a shift from consensus learning toward explicitly modeling disagreement, and toward capturing structured relationships among annotators. We review evaluation metrics for both predictive performance and annotator behavior, and noting that most fairness evaluations remain descriptive rather than normative. We conclude by identifying open challenges and future directions, including integrating multiple sources of variation, developing disagreement-aware interpretability frameworks, and grappling with the practical tradeoffs of perspectivist modeling.[29] Mi:dm 2.0 Korea-centric Bilingual Language Models
Donghoon Shin,Sejung Lee,Soonmin Bae,Hwijung Ryu,Changwon Ok,Hoyoun Jung,Hyesung Ji,Jeehyun Lim,Jehoon Lee,Ji-Eun Han,Jisoo Baik,Mihyeon Kim,Riwoo Chung,Seongmin Lee,Wonjae Park,Yoonseok Heo,Youngkyung Seo,Seyoun Won,Boeun Kim,Cheolhun Heo,Eunkyeong Lee,Honghee Lee,Hyeongju Ju,Hyeontae Seo,Jeongyong Shim,Jisoo Lee,Junseok Koh,Junwoo Kim,Minho Lee,Minji Kang,Minju Kim,Sangha Nam,Seongheum Park,Taehyeong Kim,Euijai Ahn,Hong Seok Jeung,Jisu Shin,Jiyeon Kim,Seonyeong Song,Seung Hyun Kong,Sukjin Hong,Taeyang Yun,Yu-Seon Kim,A-Hyun Lee,Chae-Jeong Lee,Hye-Won Yu,Ji-Hyun Ahn,Song-Yeon Kim,Sun-Woo Jung,Eunju Kim,Eunji Ha,Jinwoo Baek,Yun-ji Lee,Wanjin Park,Jeong Yeop Kim,Eun Mi Kim,Hyoung Jun Park,Jung Won Yoon,Min Sung Noh,Myung Gyo Oh,Wongyoung Lee,Yun Jin Park,Young S. Kwon,Hyun Keun Kim,Jieun Lee,YeoJoo Park
Main category: cs.CL
TL;DR: Mi:dm 2.0 是一个专注于韩国本土化的人工智能双语大语言模型,通过高质量数据处理和文化对齐,提升韩语理解与生成能力,在多个韩语基准测试中达到领先水平。
Details
Motivation: 现有大语言模型在韩语数据质量和文化适应性方面存在不足,难以准确反映韩国社会的价值观、常识和情感细微差别,限制了其在韩国本地应用的可靠性与适用性。 Method: 构建了包括专有数据清洗、高质量合成数据生成、基于课程学习的数据混合策略以及定制化韩语优化分词器在内的综合数据处理流程;推出两种模型配置:115亿参数的基础版和23亿参数的微型版,分别适用于通用和资源受限场景。 Result: 在KMMLU等韩语基准测试中取得最先进的零样本性能,并在语言、人文和社会科学任务的内部评估中表现优异。 Conclusion: Mi:dm 2.0 提供了高性能、可访问的韩国中心化大模型解决方案,有助于推动韩国各行业的AI应用,增强本土AI开发者生态,支持K-intelligence愿景的发展。 Abstract: We introduce Mi:dm 2.0, a bilingual large language model (LLM) specifically engineered to advance Korea-centric AI. This model goes beyond Korean text processing by integrating the values, reasoning patterns, and commonsense knowledge inherent to Korean society, enabling nuanced understanding of cultural contexts, emotional subtleties, and real-world scenarios to generate reliable and culturally appropriate responses. To address limitations of existing LLMs, often caused by insufficient or low-quality Korean data and lack of cultural alignment, Mi:dm 2.0 emphasizes robust data quality through a comprehensive pipeline that includes proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and a custom Korean-optimized tokenizer to improve efficiency and coverage. To realize this vision, we offer two complementary configurations: Mi:dm 2.0 Base (11.5B parameters), built with a depth-up scaling strategy for general-purpose use, and Mi:dm 2.0 Mini (2.3B parameters), optimized for resource-constrained environments and specialized tasks. Mi:dm 2.0 achieves state-of-the-art performance on Korean-specific benchmarks, with top-tier zero-shot results on KMMLU and strong internal evaluation results across language, humanities, and social science tasks. The Mi:dm 2.0 lineup is released under the MIT license to support extensive research and commercial use. By offering accessible and high-performance Korea-centric LLMs, KT aims to accelerate AI adoption across Korean industries, public services, and education, strengthen the Korean AI developer community, and lay the groundwork for the broader vision of K-intelligence. Our models are available at https://huggingface.co/K-intelligence. For technical inquiries, please contact midm-llm@kt.com.[30] From Symbolic to Natural-Language Relations: Rethinking Knowledge Graph Construction in the Era of Large Language Models
Kanyao Han,Yushang Lai
Main category: cs.CL
TL;DR: 本文主张在知识图谱中用自然语言关系描述替代传统的符号化关系标签,提出一种保留最小结构骨架的混合设计原则,以实现更灵活、上下文敏感的关系表示。
Details
Motivation: 传统符号化关系标签无法充分表达现实世界中复杂、模糊和不确定的关系,而大语言模型的发展使得自然语言形式的知识表达更加可行和高效。 Method: 提出从符号化关系向自然语言关系描述转变的观点,并设计了一种混合式知识图谱结构,在保持基本结构的同时支持自由文本形式的关系表达。 Result: 推动知识图谱关系表示的范式转变,使其更适应大语言模型时代的知识生成与推理需求。 Conclusion: 应重新思考知识图谱中的关系表示方式,采用自然语言描述与轻量结构结合的混合模式,以提升语义丰富性和上下文敏感性。 Abstract: Knowledge graphs (KGs) have commonly been constructed using predefined symbolic relation schemas, typically implemented as categorical relation labels. This design has notable shortcomings: real-world relations are often contextual, nuanced, and sometimes uncertain, and compressing it into discrete relation labels abstracts away critical semantic detail. Nevertheless, symbolic-relation KGs remain widely used because they have been operationally effective and broadly compatible with pre-LLM downstream models and algorithms, in which KG knowledge could be retrieved or encoded into quantified features and embeddings at scale. The emergence of LLMs has reshaped how knowledge is created and consumed. LLMs support scalable synthesis of domain facts directly in concise natural language, and prompting-based inference favors context-rich free-form text over quantified representations. This position paper argues that these changes call for rethinking the representation of relations themselves rather than merely using LLMs to populate conventional schemas more efficiently. We therefore advocate moving from symbolic to natural-language relation descriptions, and we propose hybrid design principles that preserve a minimal structural backbone while enabling more flexible and context-sensitive relational representations.[31] How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation
Wilson Y. Lee
Main category: cs.CL
TL;DR: 人类偏好评估常用于比较生成模型,但检测小幅度改进所需的判断数量尚不明确。研究发现,在偏好信号在提示间扩散的情况下,比例分配是最优策略,且实际中多数比较处于这种状态,需要大量判断才能可靠检测改进。
Details
Motivation: 理解在何种条件下人类偏好评估能够可靠地检测生成模型的小幅改进,并探讨评估效率的限制因素。 Method: 通过理论分析证明比例分配在信号扩散情况下的最优性,并对大规模人类偏好数据集进行实证分析,比较不同评估协议和模态下的检测能力。此外,分析 curated benchmark 如何通过减少提示变异提升检测效果。 Result: 大多数现实中的偏好比较处于信号扩散状态,偏好边际小,需远多于常规数量的判断;不同模态和协议下该限制均存在; curated benchmarks 可使提示级方差减少1.5倍,显著提升检测力。 Conclusion: 许多人类评估结果不显著,主因是统计功效不足而非模型无差异,强调需在评估设计中明确考虑效应大小、预算与协议选择。 Abstract: Human preference evaluations are widely used to compare generative models, yet it remains unclear how many judgments are required to reliably detect small improvements. We show that when preference signal is diffuse across prompts (i.e., all prompt types are similarly informative), proportional allocation is minimax-optimal: no allocation strategy substantially improves detectability. Empirical analysis of large-scale human preference datasets shows that most comparisons fall into this diffuse regime, exhibiting small preference margins that require far more judgments than typically collected, even in well-sampled comparisons. These limits persist across evaluation protocols and modalities, including chat, image generation, and code generation with execution feedback. In contrast, curated benchmarks that reduce prompt induced variability systematically induce larger margins and improve detectability through a $1.5\times$ reduction in prompt-level variance. Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence, underscoring the need to account explicitly for effect size, budget, and protocol design.[32] SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding
Shuyang Hou,Yi Hu,Muhan Zhang
Main category: cs.CL
TL;DR: 本文提出了一个名为SubTokenTest的新基准,用于评估大语言模型在实用任务中的子词级理解能力,涵盖四个领域的十项任务,并分析了先进模型在此类任务上的表现及字符信息在隐藏状态中的编码方式。
Details
Motivation: 尽管大语言模型在推理方面取得进展,但在基本的字符级任务上仍存在缺陷,尤其是在实际应用中需要精确子词理解时(如文本地图导航或表格解析),现有基准未能充分反映这些实用性挑战。 Method: 设计了一个包含十个实用任务的综合性基准SubTokenTest,覆盖四个领域,通过解耦复杂推理来隔离由分词引起的表现问题,并对九个先进大语言模型进行评估,同时研究测试时扩展对子词推理的影响以及字符级信息在隐藏状态中的表示。 Result: 实验结果显示当前先进的大语言模型在子词级理解任务上普遍存在性能不足,且测试时扩展对此类任务的提升有限,同时发现模型隐藏状态中对字符级信息的编码不够精确。 Conclusion: SubTokenTest揭示了大语言模型在实用子词级任务上的局限性,强调了改进分词机制和增强字符级感知的重要性,为未来模型设计提供了方向。 Abstract: Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process. While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance. Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks. Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning. We provide a comprehensive evaluation of nine advanced LLMs. Additionally, we investigate the impact of test-time scaling on sub-token reasoning and explore how character-level information is encoded within the hidden states.[33] Contrastive Bi-Encoder Models for Multi-Label Skill Extraction: Enhancing ESCO Ontology Matching with BERT and Attention Mechanisms
Yongming Sun
Main category: cs.CL
TL;DR: 提出了一种无需人工标注数据的零样本技能提取框架,通过大语言模型生成合成训练数据,并结合层次化约束和对比学习方法,在中文招聘广告中实现了高效的技能匹配。
Details
Motivation: 由于大规模、与技能分类体系对齐的标注数据稀缺且成本高昂,尤其是在非英语环境下,传统的监督学习方法在将非结构化的招聘信息映射到标准技能分类体系(如ESCO)时面临挑战。 Method: 利用大语言模型从ESCO定义中合成训练样本,引入基于ESCO二级类别的层次化多技能生成机制以提升多标签场景下的语义一致性;在此基础上训练一个结合BiLSTM和注意力池化的对比双编码器,并使用RoBERTa二元过滤器去除非技能句子。 Result: 实验证明,层次化条件生成提升了生成文本的流畅性和区分能力,所提模型在真实中文招聘广告数据上实现了F1@5=0.72的零样本检索性能,优于TF-IDF和标准BERT基线。 Conclusion: 该框架为劳动力市场分析中的自动化技能编码提供了一条可扩展且数据高效的新路径。 Abstract: Fine-grained labor market analysis increasingly relies on mapping unstructured job advertisements to standardized skill taxonomies such as ESCO. This mapping is naturally formulated as an Extreme Multi-Label Classification (XMLC) problem, but supervised solutions are constrained by the scarcity and cost of large-scale, taxonomy-aligned annotations--especially in non-English settings where job-ad language diverges substantially from formal skill definitions. We propose a zero-shot skill extraction framework that eliminates the need for manually labeled job-ad training data. The framework uses a Large Language Model (LLM) to synthesize training instances from ESCO definitions, and introduces hierarchically constrained multi-skill generation based on ESCO Level-2 categories to improve semantic coherence in multi-label contexts. On top of the synthetic corpus, we train a contrastive bi-encoder that aligns job-ad sentences with ESCO skill descriptions in a shared embedding space; the encoder augments a BERT backbone with BiLSTM and attention pooling to better model long, information-dense requirement statements. An upstream RoBERTa-based binary filter removes non-skill sentences to improve end-to-end precision. Experiments show that (i) hierarchy-conditioned generation improves both fluency and discriminability relative to unconstrained pairing, and (ii) the resulting multi-label model transfers effectively to real-world Chinese job advertisements, achieving strong zero-shot retrieval performance (F1@5 = 0.72) and outperforming TF--IDF and standard BERT baselines. Overall, the proposed pipeline provides a scalable, data-efficient pathway for automated skill coding in labor economics and workforce analytics.[34] Adaptive Multi-Stage Patent Claim Generation with Unified Quality Assessment
Chen-Wei Liang,Bin Guo,Zhen-Yuan Wei,Mu-Jiang-Shan Wang
Main category: cs.CL
TL;DR: 提出了一种三阶段框架,通过关系感知的相似性分析、领域自适应的权利要求生成和统一质量评估,显著提升了跨司法管辖区的专利权利要求生成性能。
Details
Motivation: 现有专利权利要求生成系统在跨司法管辖区泛化、语义关系建模和质量评估方面存在不足。 Method: 采用多头注意力机制(八个专用头)进行关系建模,结合课程学习与动态LoRA适配器选择,并引入跨注意力机制实现统一质量评估。 Result: 在多个数据集上实现了ROUGE-L提升7.6点、BERTScore提高8.3%,与人类专家的相关性达0.847,跨司法管辖区性能保持率达89.4%。 Conclusion: 该方法为自动化专利审查流程提供了一个全面且鲁棒的解决方案。 Abstract: Current patent claim generation systems face three fundamental limitations: poor cross-jurisdictional generalization, inadequate semantic relationship modeling between claims and prior art, and unreliable quality assessment. We introduce a novel three-stage framework that addresses these challenges through relationship-aware similarity analysis, domain-adaptive claim generation, and unified quality assessment. Our approach employs multi-head attention with eight specialized heads for explicit relationship modeling, integrates curriculum learning with dynamic LoRA adapter selection across five patent domains, and implements cross-attention mechanisms between evaluation aspects for comprehensive quality assessment. Extensive experiments on USPTO HUPD dataset, EPO patent collections, and Patent-CE benchmark demonstrate substantial improvements: 7.6-point ROUGE-L gain over GPT-4o, 8.3\% BERTScore enhancement over Llama-3.1-8B, and 0.847 correlation with human experts compared to 0.623 for separate evaluation models. Our method maintains 89.4\% cross-jurisdictional performance retention versus 76.2\% for baselines, establishing a comprehensive solution for automated patent prosecution workflows.[35] Identity-Robust Language Model Generation via Content Integrity Preservation
Miao Zhang,Kelly Chen,Md Mehrab Tanjim,Rumi Chunara
Main category: cs.CL
TL;DR: 本文研究了大语言模型在不同社会人口统计属性下的输出质量差异,提出了一种无需训练的轻量级框架来减少身份依赖性偏差,同时保持生成内容的核心语义完整性。
Details
Motivation: 尽管大语言模型能够准确编码事实知识,但在不同用户身份提示下会出现生成质量下降的问题,这种现象与身份信息无关却影响输出的事实准确性、实用性和安全性,因此需要解决这一身份相关的性能退化问题。 Method: 通过分析发现,核心知识虽然稳定,但生成过程本身存在偏见;基于此,提出一种选择性中和非关键身份信息的训练-free生成框架,在保留语义重要属性的同时降低身份依赖性偏差。 Result: 在四个基准数据集和18种社会人口统计身份上的实验表明,相比普通提示方法平均减少了77%的身份依赖性偏差,相比现有的提示防御方法也减少了45%的偏差。 Conclusion: 该工作填补了缓解用户身份线索对核心生成质量影响的研究空白,为实现更公平、稳健的生成提供了有效解决方案。 Abstract: Large Language Model (LLM) outputs often vary across user sociodemographic attributes, leading to disparities in factual accuracy, utility, and safety, even for objective questions where demographic information is irrelevant. Unlike prior work on stereotypical or representational bias, this paper studies identity-dependent degradation of core response quality. We show empirically that such degradation arises from biased generation behavior, despite factual knowledge being robustly encoded across identities. Motivated by this mismatch, we propose a lightweight, training-free framework for identity-robust generation that selectively neutralizes non-critical identity information while preserving semantically essential attributes, thus maintaining output content integrity. Experiments across four benchmarks and 18 sociodemographic identities demonstrate an average 77% reduction in identity-dependent bias compared to vanilla prompting and a 45% reduction relative to prompt-based defenses. Our work addresses a critical gap in mitigating the impact of user identity cues in prompts on core generation quality.[36] OrthoGeoLoRA: Geometric Parameter-Efficient Fine-Tuning for Structured Social Science Concept Retrieval on theWeb
Zeqiang Wang,Xinyue Wu,Chenxi Li,Zixi Chen,Nishanth Sastry,Jon Johnson,Suparna De
Main category: cs.CL
TL;DR: 本文提出OrthoGeoLoRA,一种改进的低秩适应方法,通过引入正交约束和几何重参数化克服标准LoRA的几何缺陷,在资源受限环境下更高效地微调文本编码器,实验表明其在多语言社会科学研究术语检索任务中优于现有PEFT方法。
Details
Motivation: 标准LoRA存在规范自由、尺度模糊和秩崩溃等问题,限制了其在资源受限场景下的性能与稳定性,需要一种更具几何一致性的低秩微调方法。 Method: 提出OrthoGeoLoRA,采用SVD-like形式ΔW = BΣA^⊤,将低秩因子约束在Stiefel流形上以保持正交性,并设计兼容Adam等优化器的几何重参数化方法,集成到现有微调流程中。 Result: 在基于欧洲语言社会科学词库(ELSST)的分层概念检索基准上,OrthoGeoLoRA在相同低秩预算下优于标准LoRA及多种强PEFT变体,提升排序指标表现。 Conclusion: OrthoGeoLoRA通过引入几何结构约束,提高了参数效率和模型性能,为Web4Good生态中的机构提供了一种更节能、高效的微调方案。 Abstract: Large language models and text encoders increasingly power web-based information systems in the social sciences, including digital libraries, data catalogues, and search interfaces used by researchers, policymakers, and civil society. Full fine-tuning is often computationally and energy intensive, which can be prohibitive for smaller institutions and non-profit organizations in the Web4Good ecosystem. Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), reduces this cost by updating only a small number of parameters. We show that the standard LoRA update $ΔW = BA^\top$ has geometric drawbacks: gauge freedom, scale ambiguity, and a tendency toward rank collapse. We introduce OrthoGeoLoRA, which enforces an SVD-like form $ΔW = BΣA^\top$ by constraining the low-rank factors to be orthogonal (Stiefel manifold). A geometric reparameterization implements this constraint while remaining compatible with standard optimizers such as Adam and existing fine-tuning pipelines. We also propose a benchmark for hierarchical concept retrieval over the European Language Social Science Thesaurus (ELSST), widely used to organize social science resources in digital repositories. Experiments with a multilingual sentence encoder show that OrthoGeoLoRA outperforms standard LoRA and several strong PEFT variants on ranking metrics under the same low-rank budget, offering a more compute- and parameter-efficient path to adapt foundation models in resource-constrained settings.[37] ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection
Tao Liu,Taiqiang Wu,Runming Yang,Shaoning Sun,Junjie Wang,Yujiu Yang
Main category: cs.CL
TL;DR: 本文提出了一种名为ProFit的新方法,通过选择性掩码低概率token来缓解传统监督微调中因单一参考答案导致的过拟合问题,从而提升大语言模型在推理和数学任务上的表现。
Details
Motivation: 传统监督微调(SFT)因强制对齐单一参考答案而忽略语言的一对多特性,导致模型过拟合于非核心表达,限制了泛化能力。 Method: 基于高概率token承载核心逻辑、低概率token可替换的观察,提出ProFit方法,选择性掩码低概率token以防止表层过拟合。 Result: 大量实验表明,ProFit在通用推理和数学基准测试中 consistently 优于传统的SFT基线方法。 Conclusion: ProFit有效缓解了单参考答案带来的过拟合问题,为对齐训练提供了更高效且低成本的替代方案。 Abstract: Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent. However, traditional SFT often ignores the one-to-many nature of language by forcing alignment with a single reference answer, leading to the model overfitting to non-core expressions. Although our empirical analysis suggests that introducing multiple reference answers can mitigate this issue, the prohibitive data and computational costs necessitate a strategic shift: prioritizing the mitigation of single-reference overfitting over the costly pursuit of answer diversity. To achieve this, we reveal the intrinsic connection between token probability and semantic importance: high-probability tokens carry the core logical framework, while low-probability tokens are mostly replaceable expressions. Based on this insight, we propose ProFit, which selectively masks low-probability tokens to prevent surface-level overfitting. Extensive experiments confirm that ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.[38] A.X K1 Technical Report
Sung Jun Cheon,Jaekyung Cho,Seongho Choi,Hyunjun Eun,Seokhwan Jo,Jaehyun Jun,Minsoo Kang,Jin Kim,Jiwon Kim,Minsang Kim,Sungwan Kim,Seungsik Kim,Tae Yoon Kim,Youngrang Kim,Hyeongmun Lee,Sangyeol Lee,Sungeun Lee,Youngsoon Lee,Yujin Lee,Seongmin Ok,Chanyong Park,Hyewoong Park,Junyoung Park,Hyunho Yang,Subin Yi,Soohyun Bae,Dhammiko Arya,Yongseok Choi,Sangho Choi,Dongyeon Cho,Seungmo Cho,Gyoungeun Han,Yong-jin Han,Seokyoung Hong,Hyeon Hwang,Wonbeom Jang,Minjeong Ju,Wonjin Jung,Keummin Ka,Sungil Kang,Dongnam Kim,Joonghoon Kim,Jonghwi Kim,SaeRom Kim,Sangjin Kim,Seongwon Kim,Youngjin Kim,Seojin Lee,Sunwoo Lee,Taehoon Lee,Chanwoo Park,Sohee Park,Sooyeon Park,Yohan Ra,Sereimony Sek,Seungyeon Seo,Gun Song,Sanghoon Woo,Janghan Yoon,Sungbin Yoon
Main category: cs.CL
TL;DR: A.X K1是一个5190亿参数的混合专家语言模型,采用可控制推理机制和Think-Fusion训练方法,在保持高效推理的同时在多任务和韩语基准上表现优异。
Details
Motivation: 旨在平衡大模型的推理能力与推理效率,支持可控制的思考模式以适应多样化的实际应用场景。 Method: 基于缩放定律优化训练配置和词汇量,使用10T token数据预训练,提出Think-Fusion训练策略,实现单模型内切换思考与非思考模式。 Result: 在多个公开基准测试中性能媲美领先的开源模型,并在韩语任务上展现出显著优势。 Conclusion: A.X K1通过可控推理设计,在保证高性能的同时提升了部署灵活性,尤其在韩语处理方面具有领先地位。 Abstract: We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.[39] UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning
Feng Zhang,Shijia Li,Chunmao Zhang,Zhanyu Ma,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He,Jingwen Xu,Han Liu
Main category: cs.CL
TL;DR: 本文提出了一种具有推理能力的新型用户语言模型UserLM-R1,通过构建包含静态角色和动态目标的用户画像,并结合目标驱动的决策策略与多奖励强化学习,提升了用户模拟器在跨域泛化和战略交互方面的能力。
Details
Motivation: 现有用户模拟器依赖静态、上下文无关的用户画像,缺乏对人类策略性思维的建模,导致泛化能力差且易被智能体操控。因此需要一种更具适应性和战略性的用户模拟方法。 Method: 提出UserLM-R1,首先构建融合静态角色与动态场景目标的用户画像;然后采用目标驱动的决策策略,在生成回应前先产生高质量的推理依据;并通过监督微调与多奖励强化学习进一步优化推理与策略能力。 Result: 实验结果表明,UserLM-R1在多个基准上优于现有方法,尤其在对抗性测试集上表现更优。 Conclusion: UserLM-R1通过引入动态目标和推理机制,显著提升了用户模拟器的跨域适应性与策略交互能力,为智能体训练提供了更真实、更具挑战性的环境。 Abstract: User simulators serve as the critical interactive environment for agent post-training, and an ideal user simulator generalizes across domains and proactively engages in negotiation by challenging or bargaining. However, current methods exhibit two issues. They rely on static and context-unaware profiles, necessitating extensive manual redesign for new scenarios, thus limiting generalizability. Moreover, they neglect human strategic thinking, leading to vulnerability to agent manipulation. To address these issues, we propose UserLM-R1, a novel user language model with reasoning capability. Specifically, we first construct comprehensive user profiles with both static roles and dynamic scenario-specific goals for adaptation to diverse scenarios. Then, we propose a goal-driven decision-making policy to generate high-quality rationales before producing responses, and further refine the reasoning and improve strategic capabilities with supervised fine-tuning and multi-reward reinforcement learning. Extensive experimental results demonstrate that UserLM-R1 outperforms competitive baselines, particularly on the more challenging adversarial set.[40] When to Trust: A Causality-Aware Calibration Framework for Accurate Knowledge Graph Retrieval-Augmented Generation
Jing Ren,Bowen Li,Ziqi Xu,Xinkun Zhang,Haytham Fayek,Xiaodong Li
Main category: cs.CL
TL;DR: 本文提出了一种名为Ca2KG的因果感知校准框架,用于知识图谱增强生成(KG-RAG),通过反事实提示和基于面板的重评分机制提升模型校准性,同时保持甚至提高预测准确性。
Details
Motivation: 现有的KG-RAG模型在检索子图不完整或不可靠时往往过于自信,导致高风险领域部署存在隐患,因此需要一种能识别并缓解此类不确定性问题的方法。 Method: Ca2KG结合了反事实提示来揭示知识质量和推理可靠性中的检索依赖性不确定性,并采用基于面板的重评分机制以稳定干预下的预测结果。 Result: 在两个复杂的问答数据集上的实验表明,Ca2KG能够持续提升模型的校准性能,同时保持或提升预测准确率。 Conclusion: Ca2KG有效解决了KG-RAG模型过度自信的问题,在不牺牲准确性的情况下显著提升了模型的可信度和可解释性,适用于高风险应用场景。 Abstract: Knowledge Graph Retrieval-Augmented Generation (KG-RAG) extends the RAG paradigm by incorporating structured knowledge from knowledge graphs, enabling Large Language Models (LLMs) to perform more precise and explainable reasoning. While KG-RAG improves factual accuracy in complex tasks, existing KG-RAG models are often severely overconfident, producing high-confidence predictions even when retrieved sub-graphs are incomplete or unreliable, which raises concerns for deployment in high-stakes domains. To address this issue, we propose Ca2KG, a Causality-aware Calibration framework for KG-RAG. Ca2KG integrates counterfactual prompting, which exposes retrieval-dependent uncertainties in knowledge quality and reasoning reliability, with a panel-based re-scoring mechanism that stabilises predictions across interventions. Extensive experiments on two complex QA datasets demonstrate that Ca2KG consistently improves calibration while maintaining or even enhancing predictive accuracy.[41] TeachPro: Multi-Label Qualitative Teaching Evaluation via Cross-View Graph Synergy and Semantic Anchored Evidence Encoding
Xiangqian Wang,Yifan Jia,Yang Xiang,Yumin Zhang,Yanbin Wang,Ke Liu
Main category: cs.CL
TL;DR: 本文提出了一种名为TeachPro的多标签学习框架,用于从开放性教学评价中提取五个关键教学维度的细粒度反馈,提升了教学评估的可靠性和诊断价值。
Details
Motivation: 传统的学生评教存在可靠性低、反馈选项受限和回应偏差等问题,现有基于机器学习的情感分析方法通常仅提供二元情感极性,忽略了具体的教学改进建议。 Method: 提出TeachPro框架,包括基于维度锚定的证据编码器(结合预训练文本编码、可学习语义锚点和交叉注意力机制)和跨视图图协同网络(融合句法依存和语义相似性图),并通过双线性融合与差异正则化实现多视角表示对齐。 Result: 在包含专家标注的新基准数据集上实验表明,TeachPro在多个评估场景下均表现出更优的诊断细粒度和鲁棒性。 Conclusion: TeachPro能够有效挖掘开放评语中的多维教学反馈,为教师改进教学提供具体、可操作的指导,推动学生评教向智能化和精细化发展。 Abstract: Standardized Student Evaluation of Teaching often suffer from low reliability, restricted response options, and response distortion. Existing machine learning methods that mine open-ended comments usually reduce feedback to binary sentiment, which overlooks concrete concerns such as content clarity, feedback timeliness, and instructor demeanor, and provides limited guidance for instructional improvement.We propose TeachPro, a multi-label learning framework that systematically assesses five key teaching dimensions: professional expertise, instructional behavior, pedagogical efficacy, classroom experience, and other performance metrics. We first propose a Dimension-Anchored Evidence Encoder, which integrates three core components: (i) a pre-trained text encoder that transforms qualitative feedback annotations into contextualized embeddings; (ii) a prompt module that represents five teaching dimensions as learnable semantic anchors; and (iii) a cross-attention mechanism that aligns evidence with pedagogical dimensions within a structured semantic space. We then propose a Cross-View Graph Synergy Network to represent student comments. This network comprises two components: (i) a Syntactic Branch that extracts explicit grammatical dependencies from parse trees, and (ii) a Semantic Branch that models latent conceptual relations derived from BERT-based similarity graphs. BiAffine fusion module aligns syntactic and semantic units, while a differential regularizer disentangles embeddings to encourage complementary representations. Finally, a cross-attention mechanism bridges the dimension-anchored evidence with the multi-view comment representations. We also contribute a novel benchmark dataset featuring expert qualitative annotations and multi-label scores. Extensive experiments demonstrate that TeachPro offers superior diagnostic granularity and robustness across diverse evaluation settings.[42] When to Invoke: Refining LLM Fairness with Toxicity Assessment
Jing Ren,Bowen Li,Ziqi Xu,Renqiang Luo,Shuo Yu,Xin Ye,Haytham Fayek,Xiaodong Li,Feng Xia
Main category: cs.CL
TL;DR: 提出FairToT框架,通过提示引导的毒性评估在推理时提升大语言模型在不同人群中的公平性,无需修改模型参数即可减少群体间差异。
Details
Motivation: 大语言模型在在线内容审核中存在对隐式仇恨言论判断不一致的问题,且容易产生与人口统计特征相关的偏见,现有方法难以纠正此类偏差,缺乏判断何时应触发修正机制的标准。 Method: 设计了一个推理时框架FairToT,结合提示工程进行毒性评估,并引入两个可解释的公平性指标来检测可能产生偏差的情况,决定是否启动额外评估以提高一致性。 Result: 在基准数据集上的实验表明,FairToT能有效降低群体层面的预测差异,同时保持毒性判断的稳定性与可靠性。 Conclusion: 推理时的改进策略是一种有效且实用的方法,可在不调整模型参数的情况下提升大语言模型在毒性评估中的公平性。 Abstract: Large Language Models (LLMs) are increasingly used for toxicity assessment in online moderation systems, where fairness across demographic groups is essential for equitable treatment. However, LLMs often produce inconsistent toxicity judgements for subtle expressions, particularly those involving implicit hate speech, revealing underlying biases that are difficult to correct through standard training. This raises a key question that existing approaches often overlook: when should corrective mechanisms be invoked to ensure fair and reliable assessments? To address this, we propose FairToT, an inference-time framework that enhances LLM fairness through prompt-guided toxicity assessment. FairToT identifies cases where demographic-related variation is likely to occur and determines when additional assessment should be applied. In addition, we introduce two interpretable fairness indicators that detect such cases and improve inference consistency without modifying model parameters. Experiments on benchmark datasets show that FairToT reduces group-level disparities while maintaining stable and reliable toxicity predictions, demonstrating that inference-time refinement offers an effective and practical approach for fairness improvement in LLM-based toxicity assessment systems. The source code can be found at https://aisuko.github.io/fair-tot/.[43] MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus
Yexing Du,Kaiyuan Liu,Bihe Zhang,Youcheng Pan,Bo Yang,Liangyu Huo,Xiyuan Zhang,Jian Xie,Daojing He,Yang Xiang,Ming Liu,Bin Qin
Main category: cs.CL
TL;DR: 本文提出了一个多任务的古典汉语文学体裁音频语料库MCGA,用于推动多模态大模型在中文古典研究中的音频理解能力,并引入新的评估指标,实验表明现有模型在此数据集上仍有较大挑战。
Details
Motivation: 中文古典研究中音频语料库的开发相对滞后,尤其是多任务、多体裁的高质量数据集缺乏,限制了多模态大模型在该领域的音频理解能力发展。 Method: 构建了一个涵盖六种任务(ASR、S2TT、SEC、SQA、SU、SR)的多任务古典汉语文学音频语料库MCGA,并对十个多模态大语言模型进行了系统评估,同时提出针对语音情感描述和语音-文本能力一致性的新评估指标。 Result: 实验结果显示当前多模态大模型在MCGA测试集上表现不佳,尤其在语音情感描述和推理任务上存在显著挑战;所提出的评估指标有助于更全面地衡量模型的音频理解能力。 Conclusion: MCGA填补了中文古典研究中多任务音频语料库的空白,为提升多模态大模型在该领域的音频处理能力提供了重要资源和评估标准,促进其在文化传承与智能理解中的应用。 Abstract: With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: https://github.com/yxduir/MCGA[44] ReGraM: Region-First Knowledge Graph Reasoning for Medical Question Answering
Chaerin Lee,Sohee Park,Hyunsik Na,Daseon Choi
Main category: cs.CL
TL;DR: ReGraM提出了一种区域优先的知识图谱推理框架,通过构建与查询对齐的子图并在局部区域内进行多步推理,显著提升了医学问答中的事实准确性和一致性。
Details
Motivation: 现有方法依赖全图遍历或大规模检索,引入噪声且导致多跳推理不稳定;核心问题在于如何识别并基于最相关的证据子集进行推理。 Method: ReGraM首先构建与查询对齐的子图(region-first),然后在该局部区域内进行多证据感知模式下的逐步推理,限制推理范围以减少噪声干扰。 Result: 在七个医学问答基准上,ReGraM相比强基线KGARevision取得了显著提升:MCQ准确率提高8.04%,SAQ提高4.50%,幻觉率降低42.9%;消融实验表明区域构建与推理步骤的协同是性能提升主因。 Conclusion: 区域优先的图谱推理是一种有效范式,能够提升医学问答中的推理准确性和稳定性,尤其适用于领域特定、关系非均等的知识环境。 Abstract: Recent studies in medical question answering (Medical QA) have actively explored the integration of large language models (LLMs) with biomedical knowledge graphs (KGs) to improve factual accuracy. However, most existing approaches still rely on traversing the entire KG or performing large-scale retrieval, which introduces substantial noise and leads to unstable multi-hop reasoning. We argue that the core challenge lies not in expanding access to knowledge, but in identifying and reasoning over the appropriate subset of evidence for each query. ReGraM is a region-first knowledge graph reasoning framework that addresses this challenge by constructing a query-aligned subgraph and performing stepwise reasoning constrained to this localized region under multiple evidence aware modes. By focusing inference on only the most relevant portion of the KG, ReGraM departs from the assumption that all relations are equally useful an assumption that rarely holds in domain-specific medical settings. Experiments on seven medical QA benchmarks demonstrate that ReGraM consistently outperforms a strong baseline (KGARevion), achieving an 8.04% absolute accuracy gain on MCQ, a 4.50% gain on SAQ, and a 42.9% reduction in hallucination rate. Ablation and qualitative analyses further show that aligning region construction with hop-wise reasoning is the primary driver of these improvements. Overall, our results highlight region-first KG reasoning as an effective paradigm for improving factual accuracy and consistency in medical QA.[45] Understanding or Memorizing? A Case Study of German Definite Articles in Language Models
Jonathan Drechsel,Erisa Bytyqi,Steffen Herbold
Main category: cs.CL
TL;DR: 该研究使用GRADIEND方法分析语言模型如何处理德语定冠词的性-格一致关系,发现参数更新在不同语法环境下存在交叉影响,表明模型并非完全基于抽象语法规则,而是部分依赖记忆化关联。
Details
Motivation: 探究语言模型在语法一致性任务上的表现是源于规则泛化还是记忆化。 Method: 采用基于梯度的可解释性方法GRADIEND,学习针对德语定冠词不同性-格组合转换的参数更新方向,并分析受影响的神经元。 Result: 特定性-格组合的参数更新常影响其他无关的性-格设置,且不同设置下受影响最大的神经元有显著重叠。 Conclusion: 结果不支持德语定冠词被严格按规则编码,表明模型至少部分依赖记忆化关联而非抽象语法规则。 Abstract: Language models perform well on grammatical agreement, but it is unclear whether this reflects rule-based generalization or memorization. We study this question for German definite singular articles, whose forms depend on gender and case. Using GRADIEND, a gradient-based interpretability method, we learn parameter update directions for gender-case specific article transitions. We find that updates learned for a specific gender-case article transition frequently affect unrelated gender-case settings, with substantial overlap among the most affected neurons across settings. These results argue against a strictly rule-based encoding of German definite articles, indicating that models at least partly rely on memorized associations rather than abstract grammatical rules.[46] Improving Implicit Hate Speech Detection via a Community-Driven Multi-Agent Framework
Ewelina Gajewska,Katarzyna Budzynska,Jarosław A Chudziak
Main category: cs.CL
TL;DR: 本文提出了一种基于多智能体系统的隐式仇恨言论检测框架,通过引入社会文化背景和身份感知机制,在ToxiGen数据集上实现了优于现有提示方法的性能。
Details
Motivation: 隐式仇恨言论具有强烈的情境依赖性,传统方法难以准确识别,尤其在涉及不同群体时存在公平性问题。 Method: 构建一个包含中心化主持人智能体和动态社区智能体的多智能体系统,利用公开知识源整合社会文化背景信息,实现情境化、身份感知的内容审核。 Result: 在ToxiGen数据集上,该方法在所有目标群体中均显著提升了分类准确性和公平性,特别是在平衡准确率指标上有明显优势。 Conclusion: 基于社区协商的多智能体框架能有效提升隐式仇恨言论检测的准确性与公平性,强调了上下文建模和社会语境整合的重要性。 Abstract: This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.[47] Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs
Biswesh Mohapatra,Théo Charlot,Giovanni Duca,Mayank Palan,Laurent Romary,Justine Cassell
Main category: cs.CL
TL;DR: 本文研究了在情境对话中如何显式表示和利用共同基础(common ground),评估了模型通过实体间关系引用建立和利用共同基础的能力,并提出了改进方法。
Details
Motivation: 现有研究表明大语言模型能执行澄清或确认等行为,但缺乏对共同基础的显式表征与存储机制,导致其是否真正理解对话内容尚不明确。 Method: 测试多种在情境对话中表示共同基础的方法,并提出改进策略,以提升共同基础的建立及其在对话中的后续使用。 Result: 提出了有效的共同基础表示方法,并验证了其在建立和利用共同基础方面的性能提升。 Conclusion: 显式表示和存储共同基础有助于提升对话系统对情境内容的理解与引用能力,使交互更连贯可靠。 Abstract: Common ground plays a critical role in situated spoken dialogues, where interlocutors must establish and maintain shared references to entities, events, and relations to sustain coherent interaction. For dialog systems, the ability to correctly ground conversational content in order to refer back to it later is particularly important. Prior studies have demonstrated that LLMs are capable of performing grounding acts such as requesting clarification or producing acknowledgments, yet relatively little work has investigated how common ground can be explicitly represented and stored for later use. Without such mechanisms, it remains unclear whether acknowledgment or clarification behaviors truly reflect a grounded understanding. In this work, we evaluate a model's ability to establish and exploit common ground through relational references to entities within the shared context in a situational dialogue. We test multiple methods for representing common ground in situated dialogues and further propose approaches to improve both the establishment of common ground and its subsequent use in the conversation.[48] Relation Extraction Capabilities of LLMs on Clinical Text: A Bilingual Evaluation for English and Turkish
Aidana Aidynkyzy,Oğuz Dikenelli,Oylum Alatlı,Şebnem Bora
Main category: cs.CL
TL;DR: 本研究首次对大型语言模型(LLM)在英语和土耳其语临床关系抽取(RE)任务中进行了双语综合评估,并提出了首个英-土平行临床关系抽取数据集。提出了一种基于对比学习的关系感知检索(RAR)方法,在多种提示策略下显著提升了性能,结果表明基于提示的LLM方法优于传统微调模型。
Details
Motivation: 非英语临床信息抽取标注数据集的稀缺性限制了主要为英语设计的大型语言模型方法在其他语言中的评估与应用,尤其是在资源较少的语言如土耳其语中。因此,需要一个双语评估框架和高质量的平行数据集来检验跨语言迁移能力和提示学习的有效性。 Method: 构建了首个从2010年i2b2/VA关系分类语料库派生的英-土平行临床关系抽取数据集;系统评估了多种提示策略,包括上下文学习(ICL)和思维链(CoT);引入Relation-Aware Retrieval(RAR),一种基于对比学习的上下文示例选择方法,捕捉句子级和关系级语义。 Result: 基于提示的LLM方法在所有设置下均优于传统的微调基线模型(如PURE);英语表现普遍优于土耳其语;在ICL方法中,RAR表现最佳,Gemini 1.5 Flash在英语和土耳其语上分别达到0.906和0.888的micro-F1分数;结合结构化推理提示后,DeepSeek-V3在英语上进一步提升至0.918 F1。 Conclusion: 高质量的演示检索对提示效果至关重要;先进的检索与提示技术(如RAR)能有效缩小临床自然语言处理中的资源差距,推动低资源语言的应用发展。 Abstract: The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English. In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish. To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus. We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics. Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish. Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model. These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing.[49] The Imperfective Paradox in Large Language Models
Bolei Ma,Yusuke Miyao
Main category: cs.CL
TL;DR: 该论文研究了大语言模型(LLMs)是否真正理解事件的组合语义,还是仅依赖表面概率启发。通过构建诊断数据集ImperfectiveNLI,作者发现模型在处理“未完成体悖论”时表现出“目的性偏见”,即系统性地将目标导向事件误判为已完成,即使文本明确否定。尽管模型内部表征能区分过程与结果,但推理决策受完成目标的先验偏好主导。提示干预虽减少幻觉,但也导致更多有效蕴含被错误拒绝。研究表明当前LLM缺乏对体貌结构的真正理解,更像预测性叙事引擎而非逻辑推理者。
Details
Motivation: 探究大语言模型是否具备对事件语义的深层理解能力,特别是在体貌和逻辑蕴含方面的结构性意识,而非依赖表面统计规律。 Method: 提出名为ImperfectiveNLI的诊断数据集,用于检测模型在未完成体悖论中对活动类与成就类事件的处理差异,并通过表示分析和提示干预实验分析模型行为。 Result: 发现模型存在‘目的性偏见’,即倾向于将目标导向事件误判为已完成;模型内部表示可区分过程与结果,但推理受完成先验主导;提示干预减少幻觉但增加对正确蕴含的误拒。 Conclusion: 当前大语言模型缺乏对体貌结构的真正理解,其推理行为更多基于目标完成的强先验,而非逻辑一致性,表现为预测性叙事系统而非可靠逻辑推理器。 Abstract: Do Large Language Models (LLMs) genuinely grasp the compositional semantics of events, or do they rely on surface-level probabilistic heuristics? We investigate the Imperfective Paradox, a logical phenomenon where the past progressive aspect entails event realization for activities (e.g., running $\to$ ran) but not for accomplishments (e.g., building $\nrightarrow$ built). We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes. Evaluating state-of-the-art open-weight models, we uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation. Representational analyses show that while internal embeddings often distinguish process from result, inference decisions are dominated by strong priors about goal attainment. We further find that prompting-based interventions reduce hallucinated completions but also increase incorrect rejections of valid entailments. Our findings suggest that current LLMs lack structural aspectual awareness, operating as predictive narrative engines rather than faithful logical reasoners.[50] Ability Transfer and Recovery via Modularized Parameters Localization
Songyao Jin,Kun Zhou,Wenqi Li,Peng Wang,Biwei Huang
Main category: cs.CL
TL;DR: 本文提出了一种名为ACT的通道级能力迁移方法,通过分析大语言模型中模块激活来定位与特定能力相关的通道,并选择性地迁移这些参数以恢复遗忘的能力,同时保持已保留的技能。
Details
Motivation: 大语言模型在特定领域、语言或技能上进行持续预训练或微调时,常常会导致其他能力退化和灾难性遗忘问题,因此需要研究如何在不损害原有能力的情况下实现能力的迁移和保留。 Method: 通过分析相关模型在特定领域和语言输入下的模块激活情况,发现能力相关激活高度集中在少数通道中,并基于激活差异定位关键通道,进而提出ACT方法,仅迁移对应参数并辅以轻量级微调以确保兼容性。 Result: 实验表明,ACT能够在多语言数学和科学推理任务中有效恢复被遗忘的能力,同时保持原有的技能,并能合并多个专用模型,将多种能力集成到单一模型中且干扰最小。 Conclusion: ACT是一种高效且具有稳定性的能力迁移方法,能够缓解大模型在专业化过程中出现的灾难性遗忘问题,并支持多能力融合。 Abstract: Large language models can be continually pre-trained or fine-tuned to improve performance in specific domains, languages, or skills, but this specialization often degrades other capabilities and may cause catastrophic forgetting. We investigate how abilities are distributed within LLM parameters by analyzing module activations under domain- and language-specific inputs for closely related models. Across layers and modules, we find that ability-related activations are highly concentrated in a small set of channels (typically <5\%), and these channels are largely disentangled with good sufficiency and stability. Building on these observations, we propose ACT (Activation-Guided Channel-wise Ability Transfer), which localizes ability-relevant channels via activation differences and selectively transfers only the corresponding parameters, followed by lightweight fine-tuning for compatibility. Experiments on multilingual mathematical and scientific reasoning show that ACT can recover forgotten abilities while preserving retained skills. It can also merge multiple specialized models to integrate several abilities into a single model with minimal interference. Our code and data will be publicly released.[51] Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation
Xinze Li,Zhenghao Liu,Haidong Xin,Yukun Yan,Shuo Wang,Zheni Zeng,Sen Mei,Ge Yu,Maosong Sun
Main category: cs.CL
TL;DR: 提出PAGER,一种基于页面驱动的自主知识表示框架,通过结构化认知提纲和迭代填充机制提升检索增强生成的效果。
Details
Motivation: 现有的迭代式知识积累过程缺乏一致的组织结构,限制了全面且连贯的知识表示构建。 Method: PAGER首先引导大语言模型为问题生成包含多个知识维度槽位的结构化认知提纲,然后迭代检索和优化相关文档以填充每个槽位,最终构建出一个连贯的知识页面用于答案生成。 Result: 在多个知识密集型基准和骨干模型上的实验表明,PAGER持续优于所有RAG基线方法,并能构建更高质量、信息更密集的知识表示,有效缓解知识冲突。 Conclusion: PAGER通过结构化和组织化的知识积累显著提升了大语言模型对外部知识的利用效率和生成质量。 Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge. Recently, some works have incorporated iterative knowledge accumulation processes into RAG models to progressively accumulate and refine query-related knowledge, thereby constructing more comprehensive knowledge representations. However, these iterative processes often lack a coherent organizational structure, which limits the construction of more comprehensive and cohesive knowledge representations. To address this, we propose PAGER, a page-driven autonomous knowledge representation framework for RAG. PAGER first prompts an LLM to construct a structured cognitive outline for a given question, which consists of multiple slots representing a distinct knowledge aspect. Then, PAGER iteratively retrieves and refines relevant documents to populate each slot, ultimately constructing a coherent page that serves as contextual input for guiding answer generation. Experiments on multiple knowledge-intensive benchmarks and backbone models show that PAGER consistently outperforms all RAG baselines. Further analyses demonstrate that PAGER constructs higher-quality and information-dense knowledge representations, better mitigates knowledge conflicts, and enables LLMs to leverage external knowledge more effectively. All code is available at https://github.com/OpenBMB/PAGER.[52] Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing
Filip Trhlik,Andrew Caines,Paula Buttery
Main category: cs.CL
TL;DR: 本文提出使用小型的BabyLM模型作为大规模语言模型的低成本代理,用于研究去偏见方法。实验表明,BabyLM在偏差形成和性能发展上与标准BERT模型高度一致,且能有效复现先前的研究结果,同时显著降低预训练成本(从500 GPU小时降至30 GPU小时),从而促进更广泛、快速的公平性研究。
Details
Motivation: 由于大型语言模型的训练成本高昂,重新训练以研究去偏见策略不现实,因此需要一种低成本的方法来研究偏差的根源。现有的后处理或掩码方法往往无法解决根本问题,限制了对偏见形成机制的理解。 Method: 作者提出了BabyLM——一种小型、可变语料训练的BERT-like模型,通过比较其与标准BERT模型在内在偏差模式、学习动态以及多种去偏见方法下的表现一致性,验证其作为代理模型的有效性,并利用它进行预训练阶段的去偏见实验。 Result: BabyLM在偏差形成和性能发展方面与BERT高度相关,且在不同去偏见方法下保持一致;使用BabyLM可将预训练成本降低至原来的6%以下;实验还揭示了性别不平衡和毒性内容对偏差形成的影響。 Conclusion: BabyLM可以作为大规模语言模型的有效沙盒工具,支持低成本、高效率的预训练去偏见研究,有助于推动更公平的语言模型开发,使更多研究者能够参与该领域。 Abstract: Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.[53] Where Knowledge Collides: A Mechanistic Study of Intra-Memory Knowledge Conflict in Language Models
Minh Vu Pham,Hsuvas Borkakoty,Yufang Hou
Main category: cs.CL
TL;DR: 本文提出了一种基于机械可解释性方法的框架,用于识别和干预语言模型预训练过程中编码的内部知识冲突。
Details
Motivation: 现有研究主要关注模型内部知识与外部资源之间的冲突解决,而忽略了预训练阶段模型内部表示中产生的知识冲突问题。 Method: 采用机械可解释性方法设计一个框架,定位并分析语言模型中由预训练数据引入的冲突知识的编码位置和方式。 Result: 发现特定的内部组件负责编码冲突知识,并能通过因果干预在推理时控制这些冲突。 Conclusion: 机械可解释性方法可用于有效识别和干预语言模型中的内部知识冲突,为提升模型一致性提供了新途径。 Abstract: In language models (LMs), intra-memory knowledge conflict largely arises when inconsistent information about the same event is encoded within the model's parametric knowledge. While prior work has primarily focused on resolving conflicts between a model's internal knowledge and external resources through approaches such as fine-tuning or knowledge editing, the problem of localizing conflicts that originate during pre-training within the model's internal representations remain unexplored. In this work, we design a framework based on mechanistic interpretability methods to identify where and how conflicting knowledge from the pre-training data is encoded within LMs. Our findings contribute to a growing body of evidence that specific internal components of a language model are responsible for encoding conflicting knowledge from pre-training, and we demonstrate how mechanistic interpretability methods can be leveraged to causally intervene in and control conflicting knowledge at inference time.[54] Improving Symbolic Translation of Language Models for Logical Reasoning
Ramya Keerthy Thatikonda,Jiuzhou Han,Wray Buntine,Ehsan Shareghi
Main category: cs.CL
TL;DR: 本文提出了一种增量推理框架,通过将一阶逻辑翻译任务分解为谓词生成和逻辑转换两个阶段,并引入验证模块,提升了小型语言模型在符号推理中的表现。
Details
Motivation: 小型语言模型在将自然语言转化为一阶逻辑时容易出错,影响推理系统的可靠性,因此需要提升其符号翻译的准确性。 Method: 首先对常见错误进行分类,使用大语言模型合成数据对小型模型进行微调;引入增量推理,分两阶段(谓词生成与FOL翻译)进行推断,并加入针对谓词元数错误的验证模块。 Result: 在三个模型家族和四个逻辑推理数据集上的实验表明,该方法降低了错误率,提高了谓词覆盖率和推理性能。 Conclusion: 所提出的框架有效提升了小型语言模型在形式化逻辑推理中的准确性和可控性,推动了可访问且可靠的符号推理系统的发展。 Abstract: The use of formal language for deductive logical reasoning aligns well with language models (LMs), where translating natural language (NL) into first-order logic (FOL) and employing an external solver results in a verifiable and therefore reliable reasoning system. However, smaller LMs often struggle with this translation task, frequently producing incorrect symbolic outputs due to formatting and translation errors. Existing approaches typically rely on self-iteration to correct these errors, but such methods depend heavily on the capabilities of the underlying model. To address this, we first categorize common errors and fine-tune smaller LMs using data synthesized by large language models. The evaluation is performed using the defined error categories. We introduce incremental inference, which divides inference into two stages, predicate generation and FOL translation, providing greater control over model behavior and enhancing generation quality as measured by predicate metrics. This decomposition framework also enables the use of a verification module that targets predicate-arity errors to further improve performance. Our study evaluates three families of models across four logical-reasoning datasets. The comprehensive fine-tuning, incremental inference, and verification modules reduce error rates, increase predicate coverage, and improve reasoning performance for smaller LMs, moving us closer to developing reliable and accessible symbolic-reasoning systems.[55] SlidesGen-Bench: Evaluating Slides Generation via Computational and Quantitative Metrics
Yunqiao Yang,Wenbo Li,Houxing Ren,Zimu Lu,Ke Wang,Zhiyuan Huang,Zhuofan Zong,Mingjie Zhan,Hongsheng Li
Main category: cs.CL
TL;DR: 本文提出了SlidesGen-Bench,一个用于评估幻灯片生成系统的基准,强调通用性、量化和可靠性,通过内容、美学和可编辑性三个维度进行计算评估,并构建了人类偏好对齐数据集Slides-Align1.5k以提升与人类判断的一致性。
Details
Motivation: 现有幻灯片生成系统评估方法在跨架构比较和评分可靠性方面存在不足,缺乏统一且量化的评估标准。 Method: 将终端输出视为渲染结果,在视觉域内建立统一评估框架;提出一种计算方法,从内容、美学和可编辑性三个维度量化评估幻灯片;构建包含九种主流生成系统、七种场景的Slides-Align1.5k人类偏好数据集以校准评估结果。 Result: 实验表明,SlidesGen-Bench相较于现有评估流程与人类偏好具有更高的相关性,能够更可靠地评估不同架构的幻灯片生成系统。 Conclusion: SlidesGen-Bench通过统一的视觉评估框架、多维量化指标和人类偏好对齐数据集,实现了对异构幻灯片生成系统的可靠、可比评估,推动了该领域自动化评估的发展。 Abstract: The rapid evolution of Large Language Models (LLMs) has fostered diverse paradigms for automated slide generation, ranging from code-driven layouts to image-centric synthesis. However, evaluating these heterogeneous systems remains challenging, as existing protocols often struggle to provide comparable scores across architectures or rely on uncalibrated judgments. In this paper, we introduce SlidesGen-Bench, a benchmark designed to evaluate slide generation through a lens of three core principles: universality, quantification, and reliability. First, to establish a unified evaluation framework, we ground our analysis in the visual domain, treating terminal outputs as renderings to remain agnostic to the underlying generation method. Second, we propose a computational approach that quantitatively assesses slides across three distinct dimensions - Content, Aesthetics, and Editability - offering reproducible metrics where prior works relied on subjective or reference-dependent proxies. Finally, to ensure high correlation with human preference, we construct the Slides-Align1.5k dataset, a human preference aligned dataset covering slides from nine mainstream generation systems across seven scenarios. Our experiments demonstrate that SlidesGen-Bench achieves a higher degree of alignment with human judgment than existing evaluation pipelines. Our code and data are available at https://github.com/YunqiaoYang/SlidesGen-Bench.[56] MVSS: A Unified Framework for Multi-View Structured Survey Generation
Yinqi Liu,Yueqi Zhu,Yongkang Zhang,Xinfeng Li,Feiran Liu,Yufei Sun,Xin Wang,Renzhao Liang,Yidong Wang,Cunxiang Wang
Main category: cs.CL
TL;DR: 提出MVSS框架,通过结构优先范式生成多视角、引用支持的层次化综述,包括概念树、对比表格和文本,显著提升自动综述的组织性与可信度。
Details
Motivation: 现有自动综述生成方法难以显式建模研究主题间的层次关系和结构化方法对比,导致组织结构上与专家综述存在差距。 Method: 提出MVSS框架,采用结构优先范式:首先构建研究领域的概念树,然后生成受树约束的对比表格,最后以此二者为结构约束生成综述文本,并引入评估框架衡量结构质量、对比完整性和引用保真度。 Result: 在76个计算机科学主题上的实验表明,MVSS在组织性和证据支撑方面优于现有方法,性能接近专家撰写的综述。 Conclusion: MVSS通过联合生成和对齐多层次结构与文本,实现了更接近人工综述质量的自动调研生成,验证了结构优先和多视角协同的有效性。 Abstract: Scientific surveys require not only summarizing large bodies of literature, but also organizing them into clear and coherent conceptual structures. Existing automatic survey generation methods typically focus on linear text generation and struggle to explicitly model hierarchical relations among research topics and structured methodological comparisons, resulting in gaps in structural organization compared to expert-written surveys. We propose MVSS, a multi-view structured survey generation framework that jointly generates and aligns citation-grounded hierarchical trees, structured comparison tables, and survey text. MVSS follows a structure-first paradigm: it first constructs a conceptual tree of the research domain, then generates comparison tables constrained by the tree, and finally uses both as structural constraints for text generation. This enables complementary multi-view representations across structure, comparison, and narrative. We introduce an evaluation framework assessing structural quality, comparative completeness, and citation fidelity. Experiments on 76 computer science topics show MVSS outperforms existing methods in organization and evidence grounding, achieving performance comparable to expert surveys.[57] SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams
Chenglong Wang,Canjia Li,Xingzhao Zhu,Yifu Huo,Huiyu Wang,Weixiong Lin,Yun Yang,Qiaozhi He,Tianhua Zhou,Xiaojia Chang,Jingbo Zhu,Tong Xiao
Main category: cs.CL
TL;DR: 提出了一种自演化相关性模型(SERM),通过多智能体样本挖掘和多智能体标注模块,在大规模工业场景中实现持续自我优化,提升搜索相关性。
Details
Motivation: 现实世界查询流动态变化,现有相关性模型难以泛化;自演化技术面临稀疏且难识别的有用样本以及伪标签不可靠的问题。 Method: 设计了两个互补的多智能体模块:多智能体样本挖掘器用于检测分布偏移并识别有价值样本,多智能体相关性标注器通过两级共识框架生成可靠标签。 Result: 在服务每日数十亿请求的大规模工业环境中验证,SERM通过迭代自演化显著提升了性能,经离线多语言评估和在线测试证实有效。 Conclusion: SERM能有效应对大规模查询流中的样本稀疏与标签可靠性问题,实现鲁棒的相关性模型自演化。 Abstract: Due to the dynamically evolving nature of real-world query streams, relevance models struggle to generalize to practical search scenarios. A sophisticated solution is self-evolution techniques. However, in large-scale industrial settings with massive query streams, this technique faces two challenges: (1) informative samples are often sparse and difficult to identify, and (2) pseudo-labels generated by the current model could be unreliable. To address these challenges, in this work, we propose a Self-Evolving Relevance Model approach (SERM), which comprises two complementary multi-agent modules: a multi-agent sample miner, designed to detect distributional shifts and identify informative training samples, and a multi-agent relevance annotator, which provides reliable labels through a two-level agreement framework. We evaluate SERM in a large-scale industrial setting, which serves billions of user requests daily. Experimental results demonstrate that SERM can achieve significant performance gains through iterative self-evolution, as validated by extensive offline multilingual evaluations and online testing.[58] Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats
Manyi Zhang,Ji-Fu Li,Zhongao Sun,Haoli Bai,Hui-Ling Zhen,Zhenhua Dong,Xianzhi Yu
Main category: cs.CL
TL;DR: 本文系统研究了MXFP格式下的后训练量化(PTQ)方法,评估了7种算法、15个基准和3类大语言模型,发现MXFP8可实现近无损压缩,而MXFP4仍具挑战性,并指出量化敏感性主要由语言模型决定,提出预缩放策略可有效缓解MXFP4的误差。
Details
Motivation: 尽管MXFP成为LLM低精度表示的有力候选,现有PTQ方法多聚焦整数量化,其在MXFP下的适用性和行为尚不明确,需系统性研究填补空白。 Method: 对超过7种PTQ算法、15个评测基准和3类大语言模型家族,在MXFP格式下进行综合实证分析,探究不同算法范式、格式兼容性、模型结构对量化性能的影响。 Result: MXFP8在多数情况下接近无损;MXFP4性能下降显著;PTQ效果高度依赖格式兼容性;量化敏感性主要由语言模型主导;缩放因子是MXFP4的主要误差源,预缩放策略可有效改善。 Conclusion: MXFP下的PTQ需考虑算法与格式的兼容性,MXFP8已具备实用价值,MXFP4需进一步优化,提出的预缩放策略为改进提供了方向。 Abstract: Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs). Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization, while their applicability and behavior under MXFP formats remain largely unexplored. To address this gap, this work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families. The key findings include: 1) MXFP8 consistently achieves near-lossless performance, while MXFP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under MXFP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, quantization sensitivity is dominated by the language model rather than the vision encoder in multimodal LLMs; 4) The scaling factor of quantization is a critical error source in MXFP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. Together, these results provide practical guidance on adapting existing PTQ methods to MXFP quantization.[59] Dialogue Telemetry: Turn-Level Instrumentation for Autonomous Information Gathering
Dimitris Panagopoulos,Adolfo Perrusquia,Weisi Guo
Main category: cs.CL
TL;DR: 本文提出了对话遥测(DT)框架,用于在基于模式的信息收集对话中提供每轮次的可观察指标,包括进度估计器(PE)和停滞指数(SI),以监测信息获取效率并检测无效提问。
Details
Motivation: 自主系统在进行基于模式的信息收集对话时缺乏对每轮对话的可观测指标,难以监控信息获取效率及判断何时停止提问。 Method: 提出Dialogue Telemetry(DT)框架,包含Progress Estimator(PE)量化每类别的剩余信息潜力,以及Stalling Index(SI)检测重复探查与低边际增益响应的模式;在模拟搜救场景中使用大语言模型进行验证,并将DT信号集成到强化学习策略中。 Result: DT能够有效区分高效与停滞的对话轨迹,在强化学习策略中引入DT信号可提升策略表现,尤其在停滞带来操作成本时效果显著。 Conclusion: DT提供了可解释、模型无关的每轮对话监控工具,有助于提升自主系统在信息收集任务中的效率与实用性。 Abstract: Autonomous systems conducting schema-grounded information-gathering dialogues face an instrumentation gap, lacking turn-level observables for monitoring acquisition efficiency and detecting when questioning becomes unproductive. We introduce Dialogue Telemetry (DT), a measurement framework that produces two model-agnostic signals after each question-answer exchange: (i) a Progress Estimator (PE) quantifying residual information potential per category (with a bits-based variant), and (ii) a Stalling Index (SI) detecting an observable failure signature characterized by repeated category probing with semantically similar, low-marginal-gain responses. SI flags this pattern without requiring causal diagnosis, supporting monitoring in settings where attributing degradation to specific causes may be impractical. We validate DT in controlled search-and-rescue (SAR)-inspired interviews using large language model (LLM)-based simulations, distinguishing efficient from stalled dialogue traces and illustrating downstream utility by integrating DT signals into a reinforcement learning (RL) policy. Across these settings, DT provides interpretable turn-level instrumentation that improves policy performance when stalling carries operational costs.[60] DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing
Qian Cao,Yahui Liu,Wei Bi,Yi Zhao,Ruihua Song,Xiting Wang,Ruiming Tang,Guorui Zhou,Han Li
Main category: cs.CL
TL;DR: 提出一种基于多样化规划分支的强化学习框架,通过在半结构化长思维链中引入多样性奖励,提升大语言模型在创造性写作中的输出多样性,同时保持生成质量。
Details
Motivation: 现有强化学习增强大语言模型的方法往往牺牲输出多样性,难以满足开放性任务(如创造性写作)的需求。 Method: 设计一个包含半结构化长思维链的RL框架,在规划阶段引入多样化分支,并结合群体感知的多样性奖励来促进不同生成路径的探索。 Result: 在创造性写作基准上的实验表明,该方法显著提升了输出多样性,且不损害生成质量,优于现有基线方法。 Conclusion: 通过在规划阶段显式引入多样性机制,可有效平衡强化学习中性能优化与输出多样性的矛盾,增强LLMs在开放性任务中的适用性。 Abstract: Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.[61] LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation
Stergios Chatzikyriakidis
Main category: cs.CL
TL;DR: 本文提出了一种结合大语言模型(LLM)与确定性音系算法的混合系统,用于实现现代希腊语的准确押韵识别与生成。通过引入希腊语押韵类型的综合分类法,并采用带有音系验证的代理式生成流程,该方法显著提升了性能,尤其在使用思维链提示时表现突出。纯LLM生成效果极差,而混合验证机制将有效诗歌生成率提升至73.1%。作者还发布了包含四万余条押韵的清洗语料库及系统代码。
Details
Motivation: 大语言模型在处理基于语音现象的任务(如押韵)时表现不佳,尤其是在低资源语言如现代希腊语中更为明显。现有模型缺乏对音系结构的理解能力,导致押韵识别和生成任务失败率高。 Method: 提出一种混合系统:结合LLM与确定性音系算法,构建涵盖多种希腊语押韵类型(纯押韵、丰富押韵、不完美押韵等)的分类体系;采用代理式生成流程并加入音系验证环节;评估多种提示策略(零样本、少样本、思维链、RAG增强)在多个LLM上的表现。 Result: 实验显示存在显著的“推理差距”:直觉型模型Claude 3.7在押韵识别上仅达40%准确率,而重推理模型Claude 4.5配合思维链提示可达54%的SOTA水平;纯LLM生成的有效诗歌不足4%,而引入混合验证后提升至73.1%。 Conclusion: 将大语言模型与领域特定的音系规则相结合,可有效解决其在语音敏感任务中的局限性,尤其适用于低资源语言。该研究为未来诗歌生成、语音处理及相关NLP任务提供了可复现的系统框架与高质量语料支持。 Abstract: Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant "Reasoning Gap": while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.[62] TaxoBell: Gaussian Box Embeddings for Self-Supervised Taxonomy Expansion
Sahil Mishra,Srinitish Srinivasan,Srikanta Bedathur,Tanmoy Chakraborty
Main category: cs.CL
TL;DR: TaxoBell是一种基于高斯盒嵌入的框架,用于解决现有方法在构建分类体系时对非对称‘是-一个’关系建模能力不足的问题,通过将盒几何结构转换为多元高斯分布,实现了更稳定、鲁棒且可解释的层次化推理。
Details
Motivation: 现有的自动化分类扩展方法依赖于点向量嵌入,难以有效建模分类体系中关键的非对称‘是-一个’关系;同时,基于盒嵌入的方法存在梯度不稳定、无法表达语义不确定性以及多义性建模能力有限等问题。 Method: 提出TaxoBell,一种将盒几何结构与多元高斯分布相互映射的框架,其中均值表示语义位置,协方差表示语义不确定性,并采用基于能量的优化方法实现稳定的训练和对模糊概念的鲁棒建模。 Result: 在五个基准数据集上的实验表明,TaxoBell在MRR上比八种先进基线方法平均提升19%,Recall@k提升约25%,并通过消融研究和错误分析验证了其优势与局限。 Conclusion: TaxoBell通过结合盒嵌入与高斯分布建模,有效解决了梯度不稳定、语义不确定性和多义性表示等挑战,显著提升了自动分类体系扩展的性能与可解释性。 Abstract: Taxonomies form the backbone of structured knowledge representation across diverse domains, enabling applications such as e-commerce catalogs, semantic search, and biomedical discovery. Yet, manual taxonomy expansion is labor-intensive and cannot keep pace with the emergence of new concepts. Existing automated methods rely on point-based vector embeddings, which model symmetric similarity and thus struggle with the asymmetric "is-a" relationships that are fundamental to taxonomies. Box embeddings offer a promising alternative by enabling containment and disjointness, but they face key issues: (i) unstable gradients at the intersection boundaries, (ii) no notion of semantic uncertainty, and (iii) limited capacity to represent polysemy or ambiguity. We address these shortcomings with TaxoBell, a Gaussian box embedding framework that translates between box geometries and multivariate Gaussian distributions, where means encode semantic location and covariances encode uncertainty. Energy-based optimization yields stable optimization, robust modeling of ambiguous concepts, and interpretable hierarchical reasoning. Extensive experimentation on five benchmark datasets demonstrates that TaxoBell significantly outperforms eight state-of-the-art taxonomy expansion baselines by 19% in MRR and around 25% in Recall@k. We further demonstrate the advantages and pitfalls of TaxoBell with error analysis and ablation studies.[63] Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation
Andrew Moore,Paul Rayson,Dawn Archer,Tim Czerniak,Dawn Knight,Daisy Lal,Gearóid Ó Donnchadha,Mícheál Ó Meachair,Scott Piao,Elaine Uí Dhonnchadha,Johanna Vuorinen,Yan Yabo,Xiaobin Yang
Main category: cs.CL
TL;DR: 本文对基于USAS框架的词义消歧系统进行了大规模跨语言评估,并提出结合规则系统与神经网络模型的方法,提升了性能,同时开源了数据集、模型和代码。
Details
Motivation: USAS语义框架缺乏广泛的公开评估,尤其是在多语言环境下的性能评估,且手动标注数据不足。 Method: 使用四个现有数据集和一个新构建的中文数据集,在五种语言上对基于规则的USAS系统进行评估;构建了一个新的英文银标数据集,训练并评估了多种单语和多语言神经模型,比较了规则系统与神经模型在单语和跨语言设置下的表现。 Result: 展示了神经模型在单语和跨语言任务中优于规则系统的表现,证明了规则系统可通过神经网络增强;发布了包括训练数据、中文评测集、模型和代码在内的开放资源。 Conclusion: 结合神经模型能有效提升基于规则的USAS系统的词义消歧性能,推动其在多语言环境中的应用,且所有资源的开源促进了后续研究。 Abstract: Word Sense Disambiguation (WSD) has been widely evaluated using the semantic frameworks of WordNet, BabelNet, and the Oxford Dictionary of English. However, for the UCREL Semantic Analysis System (USAS) framework, no open extensive evaluation has been performed beyond lexical coverage or single language evaluation. In this work, we perform the largest semantic tagging evaluation of the rule based system that uses the lexical resources in the USAS framework covering five different languages using four existing datasets and one novel Chinese dataset. We create a new silver labelled English dataset, to overcome the lack of manually tagged training data, that we train and evaluate various mono and multilingual neural models in both mono and cross-lingual evaluation setups with comparisons to their rule based counterparts, and show how a rule based system can be enhanced with a neural network model. The resulting neural network models, including the data they were trained on, the Chinese evaluation dataset, and all of the code have been released as open resources.[64] DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
Yibo Wang,Lei Wang,Yue Deng,Keming Wu,Yao Xiao,Huanjin Yao,Liwei Kang,Hai Ye,Yongcheng Jing,Lidong Bing
Main category: cs.CL
TL;DR: 提出DeepResearchEval,一个用于深度研究任务构建和代理评估的自动化框架,通过角色驱动的任务生成和动态评估维度提升多步网络研究的评测效果。
Details
Motivation: 现有基准在任务构建上标注成本高、评估维度静态或无法可靠验证无引用时的事实,难以有效评估深度研究系统。 Method: 设计一个角色驱动的任务生成流程,结合两阶段过滤(任务资格与搜索必要性)生成复杂真实的研究任务;提出代理评估流程,包括自适应逐点质量评估和主动事实核查,动态生成评估标准并自动验证报告内容。 Result: 该框架能自动生成高质量、需多源证据整合的研究任务,并实现动态、可靠的评估,即使在缺少引用的情况下也能通过网页搜索验证事实。 Conclusion: DeepResearchEval为深度研究系统提供了更现实、灵活且可扩展的评估方案,显著提升了任务构建与评估的自动化和可靠性。 Abstract: Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.[65] Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection
Tianyi Niu,Justin Chih-Yao Chen,Genta Indra Winata,Shi-Xiong Zhang,Supriyo Chakraborty,Sambit Sahu,Yue Zhang,Elias Stengel-Eskin,Mohit Bansal
Main category: cs.CL
TL;DR: 本文提出了一个名为RGD的新框架,用于在没有真实标注数据的情况下训练大语言模型路由器,并提出了一种新的查询专用路由器CASCAL,其在低质量生成数据下表现更鲁棒。
Details
Motivation: 现有路由器方法通常依赖真实标注数据,但在实际中用户请求分布复杂且未知,难以获得高质量标注数据,因此需要一种无需真实标签的训练方法。 Method: 提出Routing with Generated Data (RGD) 框架,使用生成器LLM根据任务描述生成查询和答案来训练路由器;评估了查询-答案路由器与仅查询路由器的表现,并设计CASCAL路由器,通过共识投票估计模型正确性,并利用层次聚类发现模型技能细分。 Result: 实验表明,随着生成器质量下降,查询-答案路由器性能下降更快;具备良好自答能力和问题区分能力的生成器能提升数据质量;CASCAL在弱生成数据下比最佳查询-答案路由器准确率高出4.6%。 Conclusion: 仅使用生成数据训练的路由器是可行的,CASCAL通过增强对生成器质量的鲁棒性,为无真实标签场景下的模型路由提供了有效解决方案。 Abstract: Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.[66] LLMs can Compress LLMs: Adaptive Pruning by Agents
Sai Varun Kodathala,Rakesh Vunnam
Main category: cs.CL
TL;DR: 本文提出了一种基于代理的智能剪枝方法(agent-guided pruning),利用大语言模型作为自适应代理,动态决定每层的剪枝策略,在保持关键知识路径的同时显著提升剪枝后模型的事实性知识保留能力。
Details
Motivation: 现有结构化剪枝方法依赖手工设定的均匀剪枝比例,且会导致大模型事实知识严重退化,本文旨在通过智能化、自适应的方式解决这一问题。 Method: 结合Wanda启发的权重-激活值度量与梯度重要性分数构建每层敏感性谱图,并使用具备自我反思能力的大语言模型代理动态决策剪枝层,辅以检查点回滚机制控制性能下降。 Result: 在Qwen3-4B和8B模型上达到约45%结构化稀疏率时,相比基线方法:MMLU准确率相对提升56%,FreebaseQA上的事实知识保留提升19倍,困惑度下降减少69%;仅需2-4次回滚即可实现有效自校正。 Conclusion: 所提框架无需重训练、具有模型无关性,验证了大语言模型可有效指导其他大模型的压缩过程,实现了高效且知识保留良好的结构化剪枝。 Abstract: As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.[67] Empathy Applicability Modeling for General Health Queries
Shan Randhawa,Agha Ali Raza,Kentaro Toyama,Julie Hui,Mustafa Naseem
Main category: cs.CL
TL;DR: 本文提出了一个名为Empathy Applicability Framework (EAF) 的理论驱动框架,用于在生成回复前预测患者查询中对共情的适用性,并发布了一个由人类和GPT-4o双重标注的真实患者查询基准数据集。
Details
Motivation: 大型语言模型(LLM)在临床工作流中的应用日益广泛,但往往缺乏临床共情能力。现有NLP框架主要局限于对医生回应中共情行为的事后标注,难以支持对共情需求的前瞻性建模,尤其是在一般健康咨询场景中。 Method: 提出EAF框架,基于临床、上下文和语言线索对患者查询进行分类,判断情感反应和解释是否适用;构建并发布一个真实患者查询的基准数据集,采用人类与GPT-4o双重标注;训练基于人工标注和仅GPT标注的分类器以验证EAF的有效性。 Result: 分类器在预测共情适用性方面表现良好,优于启发式方法和零样本LLM基线;在有人类共识的子集中观察到人类与GPT-4o之间的高度一致性;错误分析揭示了隐含痛苦识别、临床严重性模糊和情境困难等持续挑战。 Conclusion: EAF为在响应生成前识别共情需求提供了有效框架,建立了前瞻性共情建模的基准,有助于在异步医疗沟通中支持更具共情力的交互,未来需结合多标注者建模、临床医生参与校准及文化多样性标注以进一步提升性能。 Abstract: LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.[68] Value-Aware Numerical Representations for Transformer Language Models
Andreea Dutulescu,Stefan Ruseti,Mihai Dascalu
Main category: cs.CL
TL;DR: 提出一种值感知的数值表示方法,通过在标准分词输入中增加一个基于数值大小的前缀标记,增强语言模型对数字的理解能力,提升其在算术任务中的表现。
Details
Motivation: Transformer-based语言模型在数学推理基准上表现良好,但在基本数值理解和算术运算上仍脆弱,主要因为数字作为符号标记处理,其嵌入未显式编码数值大小。 Method: 引入一种值感知的数值表示方法,为数字添加一个专用前缀标记,其嵌入显式依赖于数字的实际数值,并将该信息注入模型输入空间,兼容现有分词器和解码器-only Transformer架构。 Result: 在多种算术任务上评估显示,该方法在不同数值格式、任务类型和操作数长度下均优于基线模型。 Conclusion: 显式编码数值大小是一种有效且高效的方法,可提升语言模型在基础数值理解方面的鲁棒性。 Abstract: Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model's input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.cs.CV [Back]
[69] Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
Yufeng Zhong,Lei Chen,Zhixiong Zeng,Xuanle Zhao,Deyang Jiang,Liming Zheng,Jing Huang,Haibo Qiu,Peng Shi,Siqi Yang,Lin Ma
Main category: cs.CV
TL;DR: 本文提出了一种基于高熵模式识别的格式解耦强化学习方法(FD-RL),用于提升OCR模型在处理复杂格式文档(如公式、表格)时的性能,显著降低了输出不确定性,并在OmniDocBench上创下新的端到端模型记录。
Details
Motivation: 观察到现有OCR模型在处理格式敏感文本(如公式、表格)时存在高熵问题,表明其输出不确定性大,需引入推理机制来优化不同阅读路径的表现。 Method: 提出FD-RL框架,采用基于熵的数据过滤策略识别格式密集实例,并设计针对不同格式类型的解耦奖励机制,实现格式层面的优化而非仅依赖token级别的记忆。 Result: FD-RL在OmniDocBench上取得平均90.41分的成绩,为当前最优的端到端模型表现,并通过全面的消融实验验证了各组件的有效性。 Conclusion: 通过引入强化学习与格式感知的奖励机制,能够有效降低OCR模型在复杂格式文本上的不确定性,证明了推理式优化在OCR任务中的潜力。 Abstract: Reading text from images or scanned documents via OCR models has been a longstanding focus of researchers. Intuitively, text reading is perceived as a straightforward perceptual task, and existing work primarily focuses on constructing enriched data engineering to enhance SFT capabilities. In this work, we observe that even advanced OCR models exhibit significantly higher entropy in formatted text (\emph{e.g.}, formula, table, etc.) compared to plain text, often by an order of magnitude. These statistical patterns reveal that advanced OCR models struggle with high output uncertainty when dealing with format sensitive document, suggesting that reasoning over diverse reading pathways may improve OCR performance. To address this, we propose format decoupled reinforcement learning (FD-RL), which leverages high-entropy patterns for targeted optimization. Our approach employs entropy-based data filtration strategy to identify format-intensive instances, and adopt format decoupled rewards tailored to different format types, enabling format-level validation rather than token-level memorization. FD-RL achieves an average score of 90.41 on OmniDocBench, setting a new record for end-to-end models on this highly popular benchmark. More importantly, we conduct comprehensive ablation studies over data, training, filtering, and rewarding strategies, thoroughly validating their effectiveness.[70] Bias Detection and Rotation-Robustness Mitigation in Vision-Language Models and Generative Image Models
Tarannum Mithila
Main category: cs.CV
TL;DR: 研究了视觉-语言模型和生成模型在输入变换下的鲁棒性和公平性,重点关注图像旋转和分布偏移的影响,并提出了提升鲁棒性和减少偏差的策略。
Details
Motivation: 当前视觉-语言模型和生成模型在输入变换下的鲁棒性和公平性不足,尤其是在图像旋转和分布偏移情况下,可能导致偏差传播和性能下降。 Method: 分析旋转扰动对模型预测、置信度校准和人口统计偏差的影响,结合数据增强、表示对齐和模型正则化提出旋转鲁棒的缓解策略。 Result: 实验表明所提方法显著提升了模型鲁棒性,减少了偏差放大,且未牺牲整体性能。 Conclusion: 当前多模态系统存在重要局限性,所提出的策略有助于构建更可靠、更公平的AI模型。 Abstract: Vision-Language Models (VLMs) and generative image models have achieved remarkable performance across multimodal tasks, yet their robustness and fairness under input transformations remain insufficiently explored. This work investigates bias propagation and robustness degradation in state-of-the-art vision-language and generative models, with a particular focus on image rotation and distributional shifts. We analyze how rotation-induced perturbations affect model predictions, confidence calibration, and demographic bias patterns. To address these issues, we propose rotation-robust mitigation strategies that combine data augmentation, representation alignment, and model-level regularization. Experimental results across multiple datasets demonstrate that the proposed methods significantly improve robustness while reducing bias amplification without sacrificing overall performance. This study highlights critical limitations of current multimodal systems and provides practical mitigation techniques for building more reliable and fair AI models.[71] R$^2$BD: A Reconstruction-Based Method for Generalizable and Efficient Detection of Fake Images
Qingyu Liu,Zhongjie Ba,Jianmin Guo,Qiu Wang,Zhibo Wang,Jie Shi,Kui Ren
Main category: cs.CV
TL;DR: 本文提出了一种新的伪造图像检测框架R²BD,包含统一的生成模型G-LDM和单步残差偏差计算模块,显著提升了检测效率与跨生成模型的泛化能力。
Details
Motivation: 现有基于重建的方法依赖扩散模型、效率低下且难以泛化到其他生成范式(如GAN),需要更高效、通用的检测框架。 Method: 提出R²BD框架:1)设计G-LDM统一重建模型,模拟VAE、GAN和扩散模型的生成行为;2)引入残差偏差计算模块,实现单步推理检测。 Result: 在10个公开数据集上实验表明,R²BD比现有重建方法快22倍以上,跨数据集评估中平均性能优于SOTA方法13.87%。 Conclusion: R²BD实现了高效、广义的AIGC图像检测,在速度、准确性和跨模型泛化方面均表现出色。 Abstract: Recently, reconstruction-based methods have gained attention for AIGC image detection. These methods leverage pre-trained diffusion models to reconstruct inputs and measure residuals for distinguishing real from fake images. Their key advantage lies in reducing reliance on dataset-specific artifacts and improving generalization under distribution shifts. However, they are limited by significant inefficiency due to multi-step inversion and reconstruction, and their reliance on diffusion backbones further limits generalization to other generative paradigms such as GANs. In this paper, we propose a novel fake image detection framework, called R$^2$BD, built upon two key designs: (1) G-LDM, a unified reconstruction model that simulates the generation behaviors of VAEs, GANs, and diffusion models, thereby broadening the detection scope beyond prior diffusion-only approaches; and (2) a residual bias calculation module that distinguishes real and fake images in a single inference step, which is a significant efficiency improvement over existing methods that typically require 20$+$ steps. Extensive experiments on the benchmark from 10 public datasets demonstrate that R$^2$BD is over 22$\times$ faster than existing reconstruction-based methods while achieving superior detection accuracy. In cross-dataset evaluations, it outperforms state-of-the-art methods by an average of 13.87\%, showing strong efficiency and generalization across diverse generative methods. The code and dataset used for evaluation are available at https://github.com/QingyuLiu/RRBD.[72] Residual Cross-Modal Fusion Networks for Audio-Visual Navigation
Yi Wang,Yinfeng Yu,Bin Ren
Main category: cs.CV
TL;DR: 本文提出了一种用于视听具身导航的跨模态残差融合网络(CRFN),通过双向残差连接实现音频与视觉特征的细粒度对齐与互补建模,并提升了跨域泛化能力。
Details
Motivation: 解决多模态融合中单模态主导或信息退化的问题,尤其是在跨域场景下有效建模异构特征交互。 Method: 提出CRFN,利用残差连接实现音频与视觉流之间的双向残差交互,保持各自表示独立性,并引入稳定化技术以提升收敛性与鲁棒性。 Result: 在Replica和Matterport3D数据集上显著优于现有融合方法,展现出更强的跨域泛化能力,并发现智能体在不同数据集中表现出不同的模态依赖性。 Conclusion: CRFN能有效促进跨模态融合中的互补学习,揭示了具身智能体跨模态协作机制的新视角。 Abstract: Audio-visual embodied navigation aims to enable an agent to autonomously localize and reach a sound source in unseen 3D environments by leveraging auditory cues. The key challenge of this task lies in effectively modeling the interaction between heterogeneous features during multimodal fusion, so as to avoid single-modality dominance or information degradation, particularly in cross-domain scenarios. To address this, we propose a Cross-Modal Residual Fusion Network, which introduces bidirectional residual interactions between audio and visual streams to achieve complementary modeling and fine-grained alignment, while maintaining the independence of their representations. Unlike conventional methods that rely on simple concatenation or attention gating, CRFN explicitly models cross-modal interactions via residual connections and incorporates stabilization techniques to improve convergence and robustness. Experiments on the Replica and Matterport3D datasets demonstrate that CRFN significantly outperforms state-of-the-art fusion baselines and achieves stronger cross-domain generalization. Notably, our experiments also reveal that agents exhibit differentiated modality dependence across different datasets. The discovery of this phenomenon provides a new perspective for understanding the cross-modal collaboration mechanism of embodied agents.[73] ForensicFormer: Hierarchical Multi-Scale Reasoning for Cross-Domain Image Forgery Detection
Hema Hariharan Samson
Main category: cs.CV
TL;DR: 提出了一种名为ForensicFormer的分层多尺度框架,用于跨域图像伪造检测,结合低级伪影、中级边界和高级语义分析,在多种伪造类型和压缩条件下显著优于现有方法。
Details
Motivation: 传统取证方法在面对AI生成图像和复杂编辑工具时失效,现有单一范式模型在跨域场景下表现不佳,需要更鲁棒、通用的检测方案。 Method: 设计了一个基于交叉注意力Transformer的分层多尺度框架(ForensicFormer),统一进行低层次伪影检测、中层次边界分析和高层次语义推理,实现端到端的伪造检测与定位。 Result: 在七个不同数据集上平均准确率达86.8%,在JPEG压缩Q=70下仍保持83%准确率,像素级定位F1得分为0.76,各模块经消融实验验证贡献显著(提升4-10%)。 Conclusion: ForensicFormer有效融合了传统取证与现代深度学习,具备强泛化性和可解释性,适用于真实场景中未知伪造手段的检测,推动了通用图像取证技术的发展。 Abstract: The proliferation of AI-generated imagery and sophisticated editing tools has rendered traditional forensic methods ineffective for cross-domain forgery detection. We present ForensicFormer, a hierarchical multi-scale framework that unifies low-level artifact detection, mid-level boundary analysis, and high-level semantic reasoning via cross-attention transformers. Unlike prior single-paradigm approaches, which achieve <75% accuracy on out-of-distribution datasets, our method maintains 86.8% average accuracy across seven diverse test sets, spanning traditional manipulations, GAN-generated images, and diffusion model outputs - a significant improvement over state-of-the-art universal detectors. We demonstrate superior robustness to JPEG compression (83% accuracy at Q=70 vs. 66% for baselines) and provide pixel-level forgery localization with a 0.76 F1-score. Extensive ablation studies validate that each hierarchical component contributes 4-10% accuracy improvement, and qualitative analysis reveals interpretable forensic features aligned with human expert reasoning. Our work bridges classical image forensics and modern deep learning, offering a practical solution for real-world deployment where manipulation techniques are unknown a priori.[74] Learning Domain-Invariant Representations for Cross-Domain Image Registration via Scene-Appearance Disentanglement
Jiahao Qin,Yiwen Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为SAR-Net的统一框架,通过将图像分解为域不变的场景表示和域特定的外观编码,解决了跨域图像配准中的亮度恒定性假设失效问题,并在理论和实验上验证了其有效性。
Details
Motivation: 传统图像配准方法依赖于亮度恒定假设,在源图像和目标图像存在系统性强度差异时性能下降,因此需要一种能够处理域偏移的方法来实现鲁棒的跨域配准。 Method: 提出SAR-Net框架,采用场景-外观解耦策略,利用重新渲染而非直接强度匹配进行配准;引入场景一致性损失和域对齐损失,并在共享潜在空间中建立几何对应关系的理论保证。 Result: 在双向扫描显微镜数据上验证,SAR-Net达到0.885 SSIM和0.979 NCC,性能优于最强基线3.1倍,且实现实时处理(77 fps);消融实验表明场景一致性和域对齐损失均至关重要。 Conclusion: 通过原理性的场景-外观解耦和重渲染机制,SAR-Net能有效解决跨域图像配准中的域偏移问题,在理论和实际应用中均表现出色。 Abstract: Image registration under domain shift remains a fundamental challenge in computer vision and medical imaging: when source and target images exhibit systematic intensity differences, the brightness constancy assumption underlying conventional registration methods is violated, rendering correspondence estimation ill-posed. We propose SAR-Net, a unified framework that addresses this challenge through principled scene-appearance disentanglement. Our key insight is that observed images can be decomposed into domain-invariant scene representations and domain-specific appearance codes, enabling registration via re-rendering rather than direct intensity matching. We establish theoretical conditions under which this decomposition enables consistent cross-domain alignment (Proposition 1) and prove that our scene consistency loss provides a sufficient condition for geometric correspondence in the shared latent space (Proposition 2). Empirically, we validate SAR-Net on bidirectional scanning microscopy, where coupled domain shift and geometric distortion create a challenging real-world testbed. Our method achieves 0.885 SSIM and 0.979 NCC, representing 3.1x improvement over the strongest baseline, while maintaining real-time performance (77 fps). Ablation studies confirm that both scene consistency and domain alignment losses are necessary: removing either degrades performance by 90% SSIM or causes 223x increase in latent alignment error, respectively. Code and data are available at https://github.com/D-ST-Sword/SAR-NET.[75] The Semantic Lifecycle in Embodied AI: Acquisition, Representation and Storage via Foundation Models
Shuai Chen,Hao Chen,Yuanchen Bei,Tianyang Zhao,Zhibo Zhou,Feiran Huang
Main category: cs.CV
TL;DR: 本文提出了一个名为“语义生命周期”的统一框架,用于描述由基础模型驱动的具身智能中语义知识的演化过程,并从获取、表示和存储三个关键阶段分析和比较最新进展。
Details
Motivation: 随着具身智能体面临更复杂的环境和开放性任务,传统孤立处理语义信息的方法难以满足对通用性和鲁棒性的需求,亟需一种系统化框架来整合多源多阶段的语义信息。 Method: 提出“语义生命周期”框架,以整体视角刻画语义知识在具身智能中的连续流动与维护,并基于该框架对语义的获取、表示和存储三个阶段进行系统分析与总结。 Result: 梳理了基于基础模型的具身智能在语义获取、表示与存储方面的最新进展,揭示了各阶段的关键技术与发展脉络。 Conclusion: 语义生命周期为理解与设计具身智能系统中的语义处理提供了系统性视角,未来研究应关注跨阶段协同、长期语义维护与现实世界部署的挑战。 Abstract: Semantic information in embodied AI is inherently multi-source and multi-stage, making it challenging to fully leverage for achieving stable perception-to-action loops in real-world environments. Early studies have combined manual engineering with deep neural networks, achieving notable progress in specific semantic-related embodied tasks. However, as embodied agents encounter increasingly complex environments and open-ended tasks, the demand for more generalizable and robust semantic processing capabilities has become imperative. Recent advances in foundation models (FMs) address this challenge through their cross-domain generalization abilities and rich semantic priors, reshaping the landscape of embodied AI research. In this survey, we propose the Semantic Lifecycle as a unified framework to characterize the evolution of semantic knowledge within embodied AI driven by foundation models. Departing from traditional paradigms that treat semantic processing as isolated modules or disjoint tasks, our framework offers a holistic perspective that captures the continuous flow and maintenance of semantic knowledge. Guided by this embodied semantic lifecycle, we further analyze and compare recent advances across three key stages: acquisition, representation, and storage. Finally, we summarize existing challenges and outline promising directions for future research.[76] TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts
Yu Xu,Hongbin Yan,Juan Cao,Yiji Cheng,Tiankai Hang,Runze He,Zijin Yin,Shiyi Zhang,Yuxin Zhang,Jintao Li,Chunyu Wang,Qinglin Lu,Tong-Yee Lee,Fan Tang
Main category: cs.CV
TL;DR: 本文提出了一种新的稀疏专家混合(MoE)框架,通过引入分层任务语义标注和预测对齐正则化,将语义意图注入MoE路由机制,从而缓解统一图像生成与编辑模型中的任务干扰问题。
Details
Motivation: 现有的密集扩散Transformer架构在统一图像生成与编辑任务中存在严重任务干扰,共享参数空间难以兼顾不同目标;而传统MoE的门控网络缺乏对全局任务意图的理解,无法实现有效专业化。 Method: 提出分层任务语义标注方案以构建结构化任务描述符,并设计预测对齐正则化方法,使门控网络的路由决策与高层语义对齐,提升专家的语义相关特化能力。 Result: 该方法在图像生成与编辑任务上优于密集基线模型,在保真度和质量方面表现更优,且分析表明专家模块发展出清晰且语义相关的专长。 Conclusion: 通过向MoE路由注入语义意图,可有效缓解多任务干扰,推动门控网络从任务无关转向任务感知的调度中心,增强模型的表达能力与专业化水平。 Abstract: Unified image generation and editing models suffer from severe task interference in dense diffusion transformers architectures, where a shared parameter space must compromise between conflicting objectives (e.g., local editing v.s. subject-driven generation). While the sparse Mixture-of-Experts (MoE) paradigm is a promising solution, its gating networks remain task-agnostic, operating based on local features, unaware of global task intent. This task-agnostic nature prevents meaningful specialization and fails to resolve the underlying task interference. In this paper, we propose a novel framework to inject semantic intent into MoE routing. We introduce a Hierarchical Task Semantic Annotation scheme to create structured task descriptors (e.g., scope, type, preservation). We then design Predictive Alignment Regularization to align internal routing decisions with the task's high-level semantics. This regularization evolves the gating network from a task-agnostic executor to a dispatch center. Our model effectively mitigates task interference, outperforming dense baselines in fidelity and quality, and our analysis shows that experts naturally develop clear and semantically correlated specializations.[77] Compressing Vision Transformers in Geospatial Transfer Learning with Manifold-Constrained Optimization
Thomas Snyder,H. Lexie Yang,Stefan Schnake,Steffen Schotthöfer
Main category: cs.CV
TL;DR: 本文提出了一种基于流形约束优化框架DLRT的方法,用于在迁移学习过程中压缩基于视觉Transformer的地理空间基础模型,实现了高参数压缩率的同时保持下游任务精度。
Details
Motivation: 现有的地理空间基础模型参数量大,且压缩后易导致精度下降,限制了其在资源受限边缘设备上的部署应用。 Method: 采用DLRT这一流形约束优化框架,在迁移学习中对模型进行结构化低维参数化,使其与下游任务目标对齐,从而实现高效压缩。 Result: 在多个地理空间基准上实验表明,该方法显著优于LoRA等现成低秩方法,实现大幅参数缩减且精度损失极小。 Conclusion: 该方法支持高性能、轻量化的地理空间模型在边缘设备上的部署,推动了基础模型在实际场景中的应用。 Abstract: Deploying geospatial foundation models on resource-constrained edge devices demands compact architectures that maintain high downstream performance. However, their large parameter counts and the accuracy loss often induced by compression limit practical adoption. In this work, we leverage manifold-constrained optimization framework DLRT to compress large vision transformer-based geospatial foundation models during transfer learning. By enforcing structured low-dimensional parameterizations aligned with downstream objectives, this approach achieves strong compression while preserving task-specific accuracy. We show that the method outperforms of-the-shelf low-rank methods as LoRA. Experiments on diverse geospatial benchmarks confirm substantial parameter reduction with minimal accuracy loss, enabling high-performing, on-device geospatial models.[78] Adaptive few-shot learning for robust part quality classification in two-photon lithography
Sixian Jia,Ruo-Syuan Mei,Chenhui Shao
Main category: cs.CV
TL;DR: 本文提出了一种用于双光子光刻制造中计算机视觉质量控制的自适应框架,支持新缺陷检测、小样本增量学习和跨几何域适应。
Details
Motivation: 现有CV模型在动态制造环境中无法有效检测新缺陷、从小样本更新或适应新零件几何形状,因此需要一种全生命周期可维护的质量模型框架。 Method: 基于统一的尺度鲁棒骨干网络,结合三种方法:基于LDA的统计假设检验用于新颖性检测,两阶段回放策略用于小样本增量学习,以及小样本领域对抗神经网络(DANN)用于领域自适应。 Result: 在半球到立方体的跨域数据集上,假设检验对新类批次识别准确率达99-100%;仅用K=20样本增量学习达到92%准确率;仅用K=5样本的领域自适应在目标域上达到96.19%准确率。 Conclusion: 所提框架实现了高效、鲁棒且数据经济的CV模型部署与维护,适用于持续演化的制造场景。 Abstract: Two-photon lithography (TPL) is an advanced additive manufacturing (AM) technique for fabricating high-precision micro-structures. While computer vision (CV) is proofed for automated quality control, existing models are often static, rendering them ineffective in dynamic manufacturing environments. These models typically cannot detect new, unseen defect classes, be efficiently updated from scarce data, or adapt to new part geometries. To address this gap, this paper presents an adaptive CV framework for the entire life-cycle of quality model maintenance. The proposed framework is built upon a same, scale-robust backbone model and integrates three key methodologies: (1) a statistical hypothesis testing framework based on Linear Discriminant Analysis (LDA) for novelty detection, (2) a two-stage, rehearsal-based strategy for few-shot incremental learning, and (3) a few-shot Domain-Adversarial Neural Network (DANN) for few-shot domain adaptation. The framework was evaluated on a TPL dataset featuring hemisphere as source domain and cube as target domain structures, with each domain categorized into good, minor damaged, and damaged quality classes. The hypothesis testing method successfully identified new class batches with 99-100% accuracy. The incremental learning method integrated a new class to 92% accuracy using only K=20 samples. The domain adaptation model bridged the severe domain gap, achieving 96.19% accuracy on the target domain using only K=5 shots. These results demonstrate a robust and data-efficient solution for deploying and maintaining CV models in evolving production scenarios.[79] Variance-Penalized MC-Dropout as a Learned Smoothing Prior for Brain Tumour Segmentation
Satyaki Roy Chowdhury,Golrokh Mirzaei
Main category: cs.CV
TL;DR: 本文提出了一种新的不确定性感知多尺度注意力贝叶斯U-Net(UAMSA-UNet),通过蒙特卡洛Dropout学习数据驱动的平滑先验,并结合多尺度特征与注意力机制,提高了脑肿瘤分割的准确性和空间一致性,同时降低了计算量。
Details
Motivation: 现有的CNN和U-Net方法在肿瘤浸润区域常产生噪声边界,影响分割质量,因此需要一种能有效抑制预测波动、提升边界的连贯性与精度的方法。 Method: 引入UAMSA-UNet,采用蒙特卡洛Dropout实现不确定性建模,融合多尺度特征和注意力图以捕获细节与全局上下文;设计平滑正则化损失,在二元交叉熵基础上加入多次随机前向传播的方差惩罚,抑制伪影波动。 Result: 在BraTS2023上相比U-Net Dice系数最高提升3.3%,mIoU提升2.7%;在BraTS2024上相比最佳基线Dice提升达4.5%,IoU提升4.0%;同时比U-Net++减少42.5%的FLOPs且保持更高精度。 Conclusion: UAMSA-UNet通过结合多尺度注意力与学习到的平滑先验,在提升脑肿瘤分割质量的同时显著降低计算开销,具备良好的准确性与效率平衡,并为未来集成Transformer模块提供了灵活基础。 Abstract: Brain tumor segmentation is essential for diagnosis and treatment planning, yet many CNN and U-Net based approaches produce noisy boundaries in regions of tumor infiltration. We introduce UAMSA-UNet, an Uncertainty-Aware Multi-Scale Attention-based Bayesian U-Net that in- stead leverages Monte Carlo Dropout to learn a data-driven smoothing prior over its predictions, while fusing multi-scale features and attention maps to capture both fine details and global context. Our smoothing-regularized loss augments binary cross-entropy with a variance penalty across stochas- tic forward passes, discouraging spurious fluctuations and yielding spatially coherent masks. On BraTS2023, UAMSA- UNet improves Dice Similarity Coefficient by up to 3.3% and mean IoU by up to 2.7% over U-Net; on BraTS2024, it delivers up to 4.5% Dice and 4.0% IoU gains over the best baseline. Remarkably, it also reduces FLOPs by 42.5% rel- ative to U-Net++ while maintaining higher accuracy. These results demonstrate that, by combining multi-scale attention with a learned smoothing prior, UAMSA-UNet achieves both better segmentation quality and computational efficiency, and provides a flexible foundation for future integration with transformer-based modules for further enhanced segmenta- tion results.[80] Thermo-LIO: A Novel Multi-Sensor Integrated System for Structural Health Monitoring
Chao Yang,Haoyuan Zheng,Yue Ma
Main category: cs.CV
TL;DR: 本文提出了一种名为Thermo-LIO的新型多传感器系统,通过融合热成像与高分辨率LiDAR,提升大型建筑结构健康监测的精度与覆盖范围。
Details
Motivation: 传统二维热成像在复杂几何结构、难以接触区域和地下缺陷检测方面存在局限,难以有效应用于大型基础设施的结构健康监测。 Method: 开发了一种热成像与LiDAR的多模态融合方法,实现数据流的精确校准与同步,并结合LiDAR惯性里程计(LIO)实现对大规模结构的全覆盖监测。 Result: 在桥梁和大厅建筑的案例研究中,Thermo-LIO能够比传统方法更准确地检测热异常和结构缺陷,支持实时处理并扩展检测覆盖范围。 Conclusion: 多模态传感器融合在提升大型民用基础设施结构健康监测能力方面具有关键作用,Thermo-LIO为未来SHM系统提供了高效、精确的解决方案。 Abstract: Traditional two-dimensional thermography, despite being non-invasive and useful for defect detection in the construction field, is limited in effectively assessing complex geometries, inaccessible areas, and subsurface defects. This paper introduces Thermo-LIO, a novel multi-sensor system that can enhance Structural Health Monitoring (SHM) by fusing thermal imaging with high-resolution LiDAR. To achieve this, the study first develops a multimodal fusion method combining thermal imaging and LiDAR, enabling precise calibration and synchronization of multimodal data streams to create accurate representations of temperature distributions in buildings. Second, it integrates this fusion approach with LiDAR-Inertial Odometry (LIO), enabling full coverage of large-scale structures and allowing for detailed monitoring of temperature variations and defect detection across inspection cycles. Experimental validations, including case studies on a bridge and a hall building, demonstrate that Thermo-LIO can detect detailed thermal anomalies and structural defects more accurately than traditional methods. The system enhances diagnostic precision, enables real-time processing, and expands inspection coverage, highlighting the crucial role of multimodal sensor integration in advancing SHM methodologies for large-scale civil infrastructure.[81] SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds
Constantin Kolomiiets,Miroslav Purkrabek,Jiri Matas
Main category: cs.CV
TL;DR: 本文提出了一种基于姿态关键点引导的Segment Anything Model(SAM)改进方法PoseMaskRefine,通过在微调中引入高可见性关键点,提升模型在人体遮挡情况下的分割鲁棒性和准确性。
Details
Motivation: 原始SAM在处理遮挡情况下的人体分割时表现不佳,尤其是当姿态关键点部分或完全不可见时,因此需要一种更鲁棒的方法来增强其在复杂场景中的性能。 Method: 对SAM 2.1进行轻量级修改,在其迭代修正机制中融入姿态关键点信息,采用称为PoseMaskRefine的微调策略,并在推理时仅使用可见性最高的三个关键点作为提示,甚至支持单个关键点进行掩码预测。 Result: 该方法在多个数据集上显著提升了遮挡场景下的分割精度和鲁棒性,同时保持了SAM原有的泛化能力,且对缺失肢体或衣物误分类等常见错误具有更强的鲁棒性。 Conclusion: 姿态引导的SAM微调是一种有效实现遮挡感知人体分割的方法,在不牺牲模型泛化性的前提下,大幅提升了实际应用中的稳定性和可用性。 Abstract: Segment Anything (SAM) provides an unprecedented foundation for human segmentation, but may struggle under occlusion, where keypoints may be partially or fully invisible. We adapt SAM 2.1 for pose-guided segmentation with minimal encoder modifications, retaining its strong generalization. Using a fine-tuning strategy called PoseMaskRefine, we incorporate pose keypoints with high visibility into the iterative correction process originally employed by SAM, yielding improved robustness and accuracy across multiple datasets. During inference, we simplify prompting by selecting only the three keypoints with the highest visibility. This strategy reduces sensitivity to common errors, such as missing body parts or misclassified clothing, and allows accurate mask prediction from as few as a single keypoint. Our results demonstrate that pose-guided fine-tuning of SAM enables effective, occlusion-aware human segmentation while preserving the generalization capabilities of the original model. The code and pretrained models will be available at https://mirapurkrabek.github.io/BBox-MaskPose.[82] Instance camera focus prediction for crystal agglomeration classification
Xiaoyu Ji,Chenhao Zhang,Tyler James Downard,Zoltan Nagy,Ali Shakouri,Fengqing Zhu
Main category: cs.CV
TL;DR: 提出一种结合实例分割和相机聚焦预测网络的方法,用于提高显微图像中晶体聚集的分类与分割精度。
Details
Motivation: 由于二维成像的局限性,晶体在不同深度层的重叠可能导致误判为聚集,传统方法难以准确区分真实聚集与光学错觉。 Method: 首先使用实例相机聚焦预测网络量化镜头聚焦程度,将其分为两类聚焦水平;然后结合实例分割模型和预测的聚焦水平进行聚集分类。 Result: 在高氯酸铵晶体和蔗糖晶体数据集上,所提方法在聚集分类和分割准确性方面均优于基线模型。 Conclusion: 通过引入聚焦信息辅助实例分割,可有效提升晶体聚集分析的准确性,适用于存在深度层次干扰的显微图像分析任务。 Abstract: Agglomeration refers to the process of crystal clustering due to interparticle forces. Crystal agglomeration analysis from microscopic images is challenging due to the inherent limitations of two-dimensional imaging. Overlapping crystals may appear connected even when located at different depth layers. Because optical microscopes have a shallow depth of field, crystals that are in-focus and out-of-focus in the same image typically reside on different depth layers and do not constitute true agglomeration. To address this, we first quantified camera focus with an instance camera focus prediction network to predict 2 class focus level that aligns better with visual observations than traditional image processing focus measures. Then an instance segmentation model is combined with the predicted focus level for agglomeration classification. Our proposed method has a higher agglomeration classification and segmentation accuracy than the baseline models on ammonium perchlorate crystal and sugar crystal dataset.[83] Changes in Visual Attention Patterns for Detection Tasks due to Dependencies on Signal and Background Spatial Frequencies
Amar Kavuri,Howard C. Gifford,Mini Das
Main category: cs.CV
TL;DR: 本研究通过数字乳腺断层合成(DBT)图像,探讨图像与信号特性对视觉注意力机制在信号检测任务中的影响,发现检测错误主要源于感知后期阶段的决策失败,且信号可检测性受目标形态与背景复杂性的共同影响。
Details
Motivation: 旨在理解在复杂异质背景下进行信号或模式识别时,图像和信号属性如何影响视觉注意力机制,以减少医学影像误诊。 Method: 使用数字乳腺体模(Bakic和XCAT)生成不同密度和结构的DBT图像,并随机插入两种具有不同空间频率特性的病灶;六名观察者参与定位与检测任务,同时采集眼动数据以分析视觉注意力差异。 Result: 决策失败是检测错误的主要原因;信号可检测性受目标形态和背景复杂性共同影响;带毛刺病灶上更长的注视时间表明视觉注意力受信号与背景空间频率的交互影响。 Conclusion: 视觉注意力机制在复杂环境中受局部信号特征与全局解剖噪声的共同调控,优化检测需综合考虑信号形态与背景特性。 Abstract: We aim to investigate the impact of image and signal properties on visual attention mechanisms during a signal detection task in digital images. The application of insight yielded from this work spans many areas of digital imaging where signal or pattern recognition is involved in complex heterogenous background. We used simulated tomographic breast images as the platform to investigate this question. While radiologists are highly effective at analyzing medical images to detect and diagnose diseases, misdiagnosis still occurs. We selected digital breast tomosynthesis (DBT) images as a sample medical images with different breast densities and structures using digital breast phantoms (Bakic and XCAT). Two types of lesions (with distinct spatial frequency properties) were randomly inserted in the phantoms during projections to generate abnormal cases. Six human observers participated in observer study designed for a locating and detection of an 3-mm sphere lesion and 6-mm spicule lesion in reconstructed in-plane DBT slices. We collected eye-gaze data to estimate gaze metrics and to examine differences in visual attention mechanisms. We found that detection performance in complex visual environments is strongly constrained by later perceptual stages, with decision failures accounting for the largest proportion of errors. Signal detectability is jointly influenced by both target morphology and background complexity, revealing a critical interaction between local signal features and global anatomical noise. Increased fixation duration on spiculated lesions suggests that visual attention is differentially engaged depending on background and signal spatial frequency dependencies.[84] Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers
Jonas Römer,Timo Dickscheid
Main category: cs.CV
TL;DR: 本文探讨了在无端到端反向传播的情况下,使用块状自监督学习(BWSSL)训练掩码视频变换器的可行性,并分析了其与端到端训练在学习动态和表示发展上的差异。
Details
Motivation: 受块状自监督学习进展的启发,探索是否可以在不依赖端到端反向传播的情况下训练掩码视频变换器,并解决其在时空上下文和长时序结构上的挑战。 Method: 将编码器划分为多个块,每个块通过局部掩码重建损失进行优化,应用于掩码自动编码视频视觉变换器。 Result: 在不同模型大小和划分粒度下,训练能够收敛,并在线性探测和检索代理任务上接近端到端基线的表现;块状训练更早揭示高层结构,后段块趋于饱和并保持几何一致性,同时可能引发令牌级变化。 Conclusion: 后段块的饱和和接口形成是当前性能差距的重要因素,表明块状训练是一种有潜力的端到端反向传播替代方案。 Abstract: End-to-end backpropagation couples all layers through a global error signal, enabling coordinated learning but requiring long-range credit assignment. Motivated by recent progress in blockwise self-supervised learning (BWSSL), we ask whether masked video transformers can be trained without end-to-end backpropagation. Applying BWSSL to masked video modeling remains relatively underexplored and must handle spatiotemporal context and long-range temporal structure. More broadly, analyses that compare BWSSL and end-to-end training in terms of learning dynamics and depth-wise representation development remain sparse. We apply blockwise learning to a masked autoencoding video vision transformer by partitioning the encoder into blocks, each of which is optimized with a local masked reconstruction loss. Across model sizes and partition granularities, training converges and yields representations close to matched end-to-end baselines under linear-probe and retrieval proxies. In order to compare intermediate representations, we analyze depth-wise decodability, inter-block similarity, and patch-level diagnostics. Blockwise training exposes higher-level structure earlier, while later blocks saturate and operate in a more geometry-preserving regime. It can also induce token-level shifts consistent with stronger early mixing that pooled metrics can miss. These findings point to late-block saturation and interface formation as contributors to the remaining gap.[85] Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking
Junze Shi,Yang Yu,Jian Shi,Haibo Luo
Main category: cs.CV
TL;DR: 本文提出STDTrack,一种引入时空依赖的轻量级目标跟踪框架,通过密集视频采样、时序传播的时空令牌和多帧信息融合模块提升性能,在保持实时性的同时达到领先水平。
Details
Motivation: 现有轻量级跟踪器训练时仅稀疏采样(每序列一个模板和搜索图像),未能充分利用视频中的时空信息,导致性能受限。 Method: 提出STDTrack框架:采用密集视频采样;设计时序传播的时空令牌指导逐帧特征提取;构建多帧信息融合模块(MFIFM)结合历史上下文;通过时空令牌维护器(STM)与质量感知更新机制保证信息可靠性;引入多尺度预测头应对目标尺度变化。 Result: 在六个基准上实现最先进性能,GOT-10k上接近某些非实时高性能跟踪器(如MixFormer),同时达到192 FPS(GPU)和41 FPS(CPU)。 Conclusion: STDTrack有效融合时空信息,在保持高效率的同时显著提升轻量级跟踪器性能,缩小了与高性能跟踪器之间的差距。 Abstract: Recent advances in transformer-based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these achievements, existing methods universally employ sparse sampling during training--utilizing only one template and one search image per sequence--which fails to comprehensively explore spatiotemporal information in videos. This limitation constrains performance and cause the gap between lightweight and high-performance trackers. To bridge this divide while maintaining real-time efficiency, we propose STDTrack, a framework that pioneers the integration of reliable spatiotemporal dependencies into lightweight trackers. Our approach implements dense video sampling to maximize spatiotemporal information utilization. We introduce a temporally propagating spatiotemporal token to guide per-frame feature extraction. To ensure comprehensive target state representation, we disign the Multi-frame Information Fusion Module (MFIFM), which augments current dependencies using historical context. The MFIFM operates on features stored in our constructed Spatiotemporal Token Maintainer (STM), where a quality-based update mechanism ensures information reliability. Considering the scale variation among tracking targets, we develop a multi-scale prediction head to dynamically adapt to objects of different sizes. Extensive experiments demonstrate state-of-the-art results across six benchmarks. Notably, on GOT-10k, STDTrack rivals certain high-performance non-real-time trackers (e.g., MixFormer) while operating at 192 FPS(GPU) and 41 FPS(CPU).[86] Vision Foundation Models for Domain Generalisable Cross-View Localisation in Planetary Ground-Aerial Robotic Teams
Lachlan Holden,Feras Dayoub,Alberto Candela,David Harvey,Tat-Jun Chin
Main category: cs.CV
TL;DR: 本文提出了一种基于双编码器深度神经网络的跨视角定位方法,利用合成数据和语义分割实现火星车在航拍地图中的精确定位。
Details
Motivation: 由于真实太空数据中标注准确位置的样本稀缺,传统机器学习方法难以训练,需要有效解决域间差异并提升在实际任务中的定位精度。 Method: 采用双编码器深度神经网络进行跨视角匹配,结合视觉基础模型的语义分割与大量合成数据,并使用粒子滤波器对序列图像进行状态估计以提高定位准确性。 Result: 该方法在真实类行星环境下的数据集上验证有效,能够基于单目地面图像实现简单和复杂轨迹的精确定位。 Conclusion: 所提方法通过融合合成数据、语义信息与状态估计,显著缩小了仿真到现实的域差距,为行星探测中地面-空中协同机器人系统提供了可行的定位解决方案。 Abstract: Accurate localisation in planetary robotics enables the advanced autonomy required to support the increased scale and scope of future missions. The successes of the Ingenuity helicopter and multiple planetary orbiters lay the groundwork for future missions that use ground-aerial robotic teams. In this paper, we consider rovers using machine learning to localise themselves in a local aerial map using limited field-of-view monocular ground-view RGB images as input. A key consideration for machine learning methods is that real space data with ground-truth position labels suitable for training is scarce. In this work, we propose a novel method of localising rovers in an aerial map using cross-view-localising dual-encoder deep neural networks. We leverage semantic segmentation with vision foundation models and high volume synthetic data to bridge the domain gap to real images. We also contribute a new cross-view dataset of real-world rover trajectories with corresponding ground-truth localisation data captured in a planetary analogue facility, plus a high volume dataset of analogous synthetic image pairs. Using particle filters for state estimation with the cross-view networks allows accurate position estimation over simple and complex trajectories based on sequences of ground-view images.[87] Small but Mighty: Dynamic Wavelet Expert-Guided Fine-Tuning of Large-Scale Models for Optical Remote Sensing Object Segmentation
Yanguang Sun,Chao Wang,Jian Yang,Lei Luo
Main category: cs.CV
TL;DR: 本文提出了一种名为WEFT的新型动态小波专家引导微调范式,用于高效适应大规模基础模型到光学遥感图像分割任务中,通过引入任务特定的小波专家提取器和专家引导条件适配器,在减少可训练参数的同时显著提升了性能。
Details
Motivation: 由于大规模模型在全参数微调时存在显存消耗大、计算成本高的问题,现有研究对大模型在遥感图像分割中的应用探索有限,因此需要一种高效且低资源消耗的微调方法。 Method: 提出WEFT框架,包含小波专家提取器以多视角建模并动态调节小波专家输出,生成富含任务信息的可训练特征;设计专家引导条件适配器,通过注入可训练特征增强冻结特征的细粒度感知,并迭代更新两类特征以实现高效微调。 Result: 在三个遥感图像数据集上超越21种SOTA方法,并在伪装、自然和医学场景中均取得最优结果。 Conclusion: WEFT通过小波专家引导机制有效解决了大规模模型微调的资源瓶颈问题,为遥感图像分割提供了高效且强大的解决方案。 Abstract: Accurately localizing and segmenting relevant objects from optical remote sensing images (ORSIs) is critical for advancing remote sensing applications. Existing methods are typically built upon moderate-scale pre-trained models and employ diverse optimization strategies to achieve promising performance under full-parameter fine-tuning. In fact, deeper and larger-scale foundation models can provide stronger support for performance improvement. However, due to their massive number of parameters, directly adopting full-parameter fine-tuning leads to pronounced training difficulties, such as excessive GPU memory consumption and high computational costs, which result in extremely limited exploration of large-scale models in existing works. In this paper, we propose a novel dynamic wavelet expert-guided fine-tuning paradigm with fewer trainable parameters, dubbed WEFT, which efficiently adapts large-scale foundation models to ORSIs segmentation tasks by leveraging the guidance of wavelet experts. Specifically, we introduce a task-specific wavelet expert extractor to model wavelet experts from different perspectives and dynamically regulate their outputs, thereby generating trainable features enriched with task-specific information for subsequent fine-tuning. Furthermore, we construct an expert-guided conditional adapter that first enhances the fine-grained perception of frozen features for specific tasks by injecting trainable features, and then iteratively updates the information of both types of feature, allowing for efficient fine-tuning. Extensive experiments show that our WEFT not only outperforms 21 state-of-the-art (SOTA) methods on three ORSIs datasets, but also achieves optimal results in camouflage, natural, and medical scenarios. The source code is available at: https://github.com/CSYSI/WEFT.[88] SAM-Aug: Leveraging SAM Priors for Few-Shot Parcel Segmentation in Satellite Time Series
Kai Hu,Yaozu Feng,Vladimir Lysenko,Ya Guo Member,Huayi Wu
Main category: cs.CV
TL;DR: 本文提出SAM-Aug,一种利用Segment Anything Model(SAM)生成几何感知掩码先验的少样本遥感图像语义分割框架,在标注数据极少的情况下显著提升了土地覆盖分类性能。
Details
Motivation: 少样本条件下遥感时间序列图像语义分割因标注数据稀缺而面临挑战,现有方法在低监督下性能下降明显,限制了实际应用。 Method: 构建无云复合图像,利用SAM无监督生成几何感知的掩码先验,并通过提出的RegionSmoothLoss损失函数在时序帧间强制区域内的预测一致性,从而正则化模型训练。 Result: 在PASTIS-R基准5%标注设置下,SAM-Aug平均mIoU达36.21%,相比最先进方法提升2.33个百分点;最佳情况下达到40.28%,相对提升11.2%。 Conclusion: SAM等基础模型可作为有效的正则化工具用于少样本遥感学习,SAM-Aug提供了一种无需额外标注或微调的即插即用解决方案,具有良好的泛化性和应用潜力。 Abstract: Few-shot semantic segmentation of time-series remote sensing images remains a critical challenge, particularly in regions where labeled data is scarce or costly to obtain. While state-of-the-art models perform well under full supervision, their performance degrades significantly under limited labeling, limiting their real-world applicability. In this work, we propose SAM-Aug, a new annotation-efficient framework that leverages the geometry-aware segmentation capability of the Segment Anything Model (SAM) to improve few-shot land cover mapping. Our approach constructs cloud-free composite images from temporal sequences and applies SAM in a fully unsupervised manner to generate geometry-aware mask priors. These priors are then integrated into training through a proposed loss function called RegionSmoothLoss, which enforces prediction consistency within each SAM-derived region across temporal frames, effectively regularizing the model to respect semantically coherent structures. Extensive experiments on the PASTIS-R benchmark under a 5 percent labeled setting demonstrate the effectiveness and robustness of SAM-Aug. Averaged over three random seeds (42, 2025, 4090), our method achieves a mean test mIoU of 36.21 percent, outperforming the state-of-the-art baseline by +2.33 percentage points, a relative improvement of 6.89 percent. Notably, on the most favorable split (seed=42), SAM-Aug reaches a test mIoU of 40.28 percent, representing an 11.2 percent relative gain with no additional labeled data. The consistent improvement across all seeds confirms the generalization power of leveraging foundation model priors under annotation scarcity. Our results highlight that vision models like SAM can serve as useful regularizers in few-shot remote sensing learning, offering a scalable and plug-and-play solution for land cover monitoring without requiring manual annotations or model fine-tuning.[89] Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
Yang Li,Aming Wu,Zihao Zhang,Yahong Han
Main category: cs.CV
TL;DR: 本文提出了一种用于视觉-语言导航(VLN)的动态交互式快慢推理框架slow4fast-VLN,以提升在未见环境和指令下的泛化能力。
Details
Motivation: 传统VLN方法基于闭集假设,在面对开放世界中多样且未见的环境和指令时泛化能力受限,因此需要研究通用场景适应(GSA-VLN)任务。 Method: 设计了一个动态交互的快慢双系统推理框架:快速推理模块通过端到端策略网络实时输出动作并积累执行记忆;慢速推理模块对记忆进行深度反思,提取增强泛化能力的经验,并结构化存储以持续优化快速模块。 Result: 所提框架实现了快慢模块之间的持续交互与协同优化,在GSA-VLN任务中展现出更强的适应性和导航性能。 Conclusion: 通过引入动态交互的快慢推理机制,模型能够在未见环境中不断提炼经验并优化决策,显著提升了视觉-语言导航的泛化能力。 Abstract: Vision-Language Navigation aims to enable agents to navigate to a target location based on language instructions. Traditional VLN often follows a close-set assumption, i.e., training and test data share the same style of the input images and instructions. However, the real world is open and filled with various unseen environments, posing enormous difficulties for close-set methods. To this end, we focus on the General Scene Adaptation (GSA-VLN) task, aiming to learn generalized navigation ability by introducing diverse environments and inconsistent intructions.Towards this task, when facing unseen environments and instructions, the challenge mainly lies in how to enable the agent to dynamically produce generalized strategies during the navigation process. Recent research indicates that by means of fast and slow cognition systems, human beings could generate stable policies, which strengthen their adaptation for open world. Inspired by this idea, we propose the slow4fast-VLN, establishing a dynamic interactive fast-slow reasoning framework. The fast-reasoning module, an end-to-end strategy network, outputs actions via real-time input. It accumulates execution records in a history repository to build memory. The slow-reasoning module analyze the memories generated by the fast-reasoning module. Through deep reflection, it extracts experiences that enhance the generalization ability of decision-making. These experiences are structurally stored and used to continuously optimize the fast-reasoning module. Unlike traditional methods that treat fast-slow reasoning as independent mechanisms, our framework enables fast-slow interaction. By leveraging the experiences from slow reasoning. This interaction allows the system to continuously adapt and efficiently execute navigation tasks when facing unseen scenarios.[90] LP-LLM: End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models
Haoyan Gong,Hongbin Liu
Main category: cs.CV
TL;DR: 提出一种端到端的结构感知多模态推理框架,通过可学习的字符槽查询和残差调制机制,将字符位置的细粒度视觉证据注入视觉令牌,实现对严重退化车牌图像的准确识别。
Details
Motivation: 现有“先恢复后识别”方法因像素级优化目标与语义识别目标不一致导致误差累积,且通用视觉语言模型缺乏对车牌字符序列结构(如固定长度、特定顺序)的显式建模。 Method: 基于Qwen3-VL构建端到端框架,设计字符感知多模态推理模块(CMRM),引入可学习的字符槽查询,通过交叉注意力机制从视觉特征中提取对应字符位置的细粒度证据,并通过残差调制将这些表征注入视觉令牌,结合LoRA进行高效微调。 Result: 在合成和真实严重退化数据集上实验表明,该方法显著优于现有的恢复-识别组合方法和通用视觉语言模型。 Conclusion: 将结构化推理引入大模型可有效提升低质量文本识别性能,验证了显式建模字符序列结构先验在端到端车牌识别中的有效性。 Abstract: Real-world License Plate Recognition (LPR) faces significant challenges from severe degradations such as motion blur, low resolution, and complex illumination. The prevailing "restoration-then-recognition" two-stage paradigm suffers from a fundamental flaw: the pixel-level optimization objectives of image restoration models are misaligned with the semantic goals of character recognition, leading to artifact interference and error accumulation. While Vision-Language Models (VLMs) have demonstrated powerful general capabilities, they lack explicit structural modeling for license plate character sequences (e.g., fixed length, specific order). To address this, we propose an end-to-end structure-aware multimodal reasoning framework based on Qwen3-VL. The core innovation lies in the Character-Aware Multimodal Reasoning Module (CMRM), which introduces a set of learnable Character Slot Queries. Through a cross-attention mechanism, these queries actively retrieve fine-grained evidence corresponding to character positions from visual features. Subsequently, we inject these character-aware representations back into the visual tokens via residual modulation, enabling the language model to perform autoregressive generation based on explicit structural priors. Furthermore, combined with the LoRA parameter-efficient fine-tuning strategy, the model achieves domain adaptation while retaining the generalization capabilities of the large model. Extensive experiments on both synthetic and real-world severely degraded datasets demonstrate that our method significantly outperforms existing restoration-recognition combinations and general VLMs, validating the superiority of incorporating structured reasoning into large models for low-quality text recognition tasks.[91] LPCAN: Lightweight Pyramid Cross-Attention Network for Rail Surface Defect Detection Using RGB-D Data
Jackie Alex,Guoqiang Huan
Main category: cs.CV
TL;DR: 本文提出了一种轻量级金字塔交叉注意力网络(LPCANet),利用RGB-D数据实现高效准确的轨道缺陷检测,在多个指标上达到SOTA性能,且具备良好的泛化能力与工业应用价值。
Details
Motivation: 现有基于视觉的轨道缺陷检测方法存在计算复杂度高、参数量大和准确性不足的问题,需要更高效准确的解决方案。 Method: 提出LPCANet,采用MobileNetv2提取RGB特征,结合轻量级金字塔模块(LPM)处理深度信息,通过交叉注意力机制(CAM)进行多模态融合,并引入空间特征提取器(SFE)增强结构分析。 Result: 在三个无监督RGB-D轨道数据集上,LPCANet以仅9.90M参数、2.50G FLOPs和162.60 fps的速度实现了SOTA性能,相比最优基线提升+1.48% $S_α$、+0.86% IOU和+1.77% MAE;消融实验验证了CAM与SFE的有效性,非轨道数据集实验表明其良好泛化能力。 Conclusion: LPCANet在保持极低计算开销的同时显著提升了检测精度,有效结合了传统与深度学习方法,具有重要的工业应用前景,未来将聚焦于进一步模型压缩以支持实时部署。 Abstract: This paper addresses the limitations of current vision-based rail defect detection methods, including high computational complexity, excessive parameter counts, and suboptimal accuracy. We propose a Lightweight Pyramid Cross-Attention Network (LPCANet) that leverages RGB-D data for efficient and accurate defect identification. The architecture integrates MobileNetv2 as a backbone for RGB feature extraction with a lightweight pyramid module (LPM) for depth processing, coupled with a cross-attention mechanism (CAM) for multimodal fusion and a spatial feature extractor (SFE) for enhanced structural analysis. Comprehensive evaluations on three unsupervised RGB-D rail datasets (NEU-RSDDS-AUG, RSDD-TYPE1, RSDD-TYPE2) demonstrate that LPCANet achieves state-of-the-art performance with only 9.90 million parameters, 2.50 G FLOPs, and 162.60 fps inference speed. Compared to 18 existing methods, LPCANet shows significant improvements, including +1.48\% in $S_α$, +0.86\% in IOU, and +1.77\% in MAE over the best-performing baseline. Ablation studies confirm the critical roles of CAM and SFE, while experiments on non-rail datasets (DAGM2007, MT, Kolektor-SDD2) validate its generalization capability. The proposed framework effectively bridges traditional and deep learning approaches, offering substantial practical value for industrial defect inspection. Future work will focus on further model compression for real-time deployment.[92] Beyond Seen Bounds: Class-Centric Polarization for Single-Domain Generalized Deep Metric Learning
Xin Yuan,Meiqi Wan,Wei Liu,Xin Xu,Zheng Wang
Main category: cs.CV
TL;DR: 本文提出了CenterPolar,一种用于单域广义深度度量学习(SDG-DML)的新框架,通过类中心极化机制动态扩展和约束域分布,以提升对未见类别和域的泛化能力。
Details
Motivation: 现有SDG-DML方法依赖代理-based域扩展,生成的样本聚集在类中心附近,难以模拟实际中广泛且远离的域偏移,限制了模型在真实场景中的应用。 Method: 提出CenterPolar框架,包含两个协同的类中心极化阶段:1) 类中心离心扩展(C^3E),将源域数据从类中心向外推,以适应更广泛的未见域;2) 类中心向心约束(C^4),将样本拉向其类中心并增强类间分离,以增强对未见类别的泛化能力。 Result: 在CUB-200-2011 Ext.、Cars196 Ext.、DomainNet、PACS和Office-Home等数据集上的大量实验表明,CenterPolar优于现有的最先进方法。 Conclusion: CenterPolar通过动态的域扩展与约束策略,有效提升了模型在未见类别和域上的泛化性能,为SDG-DML任务提供了新的解决方案。 Abstract: Single-domain generalized deep metric learning (SDG-DML) faces the dual challenge of both category and domain shifts during testing, limiting real-world applications. Therefore, aiming to learn better generalization ability on both unseen categories and domains is a realistic goal for the SDG-DML task. To deliver the aspiration, existing SDG-DML methods employ the domain expansion-equalization strategy to expand the source data and generate out-of-distribution samples. However, these methods rely on proxy-based expansion, which tends to generate samples clustered near class proxies, failing to simulate the broad and distant domain shifts encountered in practice. To alleviate the problem, we propose CenterPolar, a novel SDG-DML framework that dynamically expands and constrains domain distributions to learn a generalizable DML model for wider target domain distributions. Specifically, \textbf{CenterPolar} contains two collaborative class-centric polarization phases: (1) Class-Centric Centrifugal Expansion ($C^3E$) and (2) Class-Centric Centripetal Constraint ($C^4$). In the first phase, $C^3E$ drives the source domain distribution by shifting the source data away from class centroids using centrifugal expansion to generalize to more unseen domains. In the second phase, to consolidate domain-invariant class information for the generalization ability to unseen categories, $C^4$ pulls all seen and unseen samples toward their class centroids while enforcing inter-class separation via centripetal constraint. Extensive experimental results on widely used CUB-200-2011 Ext., Cars196 Ext., DomainNet, PACS, and Office-Home datasets demonstrate the superiority and effectiveness of our CenterPolar over existing state-of-the-art methods. The code will be released after acceptance.[93] SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL
Lijun Liu,Linwei Chen,Zhishou Zhang,Meng Tian,Hengfu Cui,Ruiyang Li,Zhaocheng Liu,Qiang Ju,Qianxi Li,Hong-Yu Zhou
Main category: cs.CV
TL;DR: 本文提出SkinFlow框架,通过优化视觉信息传输效率来提升皮肤病变诊断精度,采用虚拟宽度动态视觉编码器和两阶段强化学习策略,在小模型(7B)上显著超越大规模通用视觉语言模型。
Details
Motivation: 现有大规模视觉语言模型在皮肤病诊断中因‘分散注意力’问题难以准确识别细微病变,本文旨在探索不依赖参数扩展而提升医学诊断精度的新路径。 Method: 提出SkinFlow框架,使用虚拟宽度动态视觉编码器(DVE)展开复杂病理流形,并结合两阶段强化学习:第一阶段对齐显式医学描述,第二阶段重建隐式诊断纹理,同时设计了注重诊断安全性和层级相关性的临床评估协议。 Result: SkinFlow的7B模型在Fitzpatrick17k基准上达到新的SOTA,Top-1准确率提升+12.06%,Top-6准确率提升+28.57%,显著优于Qwen3VL-235B和GPT-5.2等大模型。 Conclusion: 优化几何容量与信息流动比单纯扩大参数规模更能有效提升医学视觉任务的推理能力,为轻量级、高精度医疗AI提供了新范式。 Abstract: General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to "unfold" complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.[94] SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection
Chenhao Fu,Han Fang,Xiuzheng Zheng,Wenbo Wei,Yonghua Li,Hao Sun,Xuelong Li
Main category: cs.CV
TL;DR: 本文提出了一种用于零样本异常检测(ZSAD)的新方法SSVP,通过融合多尺度视觉编码和语言查询,提升工业检测中细粒度异常识别能力。
Details
Motivation: 现有ZSAD方法受限于单一视觉骨干网络,难以同时兼顾全局语义泛化与局部结构判别能力。 Method: 提出Synergistic Semantic-Visual Prompting (SSVP),包含三个核心模块:HSVS机制融合DINOv3的多尺度结构先验与CLIP语义空间;VCPG利用跨模态注意力生成动态提示;VTAM通过双门控机制对齐全局评分与局部证据。 Result: 在七个工业基准上进行了广泛评估,SSVP在MVTec-AD数据集上达到93.0% Image-AUROC和92.2% Pixel-AUROC,显著优于现有零样本方法。 Conclusion: SSVP通过协同融合多尺度视觉表征与语言引导的提示机制,有效提升了零样本异常检测的性能,具有良好的鲁棒性和应用前景。 Abstract: Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3's multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0\% Image-AUROC and 92.2\% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.[95] From Snow to Rain: Evaluating Robustness, Calibration, and Complexity of Model-Based Robust Training
Josué Martínez-Martínez,Olivia Brown,Giselle Zeno,Pooya Khorrami,Rajmonda Caceres
Main category: cs.CV
TL;DR: 本文研究了基于模型的训练方法,利用学习到的干扰变化模型生成真实感的损坏,并在CURE-TSR数据集上评估其鲁棒性,结果表明基于模型的方法在准确性和校准方面优于现有基线。
Details
Motivation: 深度学习在安全敏感领域中对自然损坏的鲁棒性仍是一个关键挑战,需要更有效的方法来提升模型在复杂环境下的可靠性。 Method: 提出了一类基于模型的训练方法,结合随机覆盖与对抗性细化策略,利用学习到的干扰模型生成 realistic corruption,并在雪、雨等条件下进行评估。 Result: 基于模型的方法在所有损坏类型下均优于Vanilla、Adversarial Training和AugMix基线,其中基于模型的对抗训练具有最强鲁棒性,而基于模型的数据增强在更低计算复杂度下达到相当性能。 Conclusion: 学习到的干扰模型对于捕捉自然变异性至关重要,为在挑战性条件下构建更鲁棒和校准良好的模型提供了有前景的方向。 Abstract: Robustness to natural corruptions remains a critical challenge for reliable deep learning, particularly in safety-sensitive domains. We study a family of model-based training approaches that leverage a learned nuisance variation model to generate realistic corruptions, as well as new hybrid strategies that combine random coverage with adversarial refinement in nuisance space. Using the Challenging Unreal and Real Environments for Traffic Sign Recognition dataset (CURE-TSR), with Snow and Rain corruptions, we evaluate accuracy, calibration, and training complexity across corruption severities. Our results show that model-based methods consistently outperform baselines Vanilla, Adversarial Training, and AugMix baselines, with model-based adversarial training providing the strongest robustness under across all corruptions but at the expense of higher computation and model-based data augmentation achieving comparable robustness with $T$ less computational complexity without incurring a statistically significant drop in performance. These findings highlight the importance of learned nuisance models for capturing natural variability, and suggest a promising path toward more resilient and calibrated models under challenging conditions.[96] Architecture inside the mirage: evaluating generative image models on architectural style, elements, and typologies
Jamie Magrill,Leah Gornstein,Sandra Seekins,Barry Magrill
Main category: cs.CV
TL;DR: 该研究评估了五种主流生成式AI图像平台在生成建筑图像时的准确性,发现整体准确率有限,且存在风格混淆和过度装饰等问题,建议对AI生成内容进行明确标注并谨慎用于教育领域。
Details
Motivation: 由于建筑学具有严格的历史规则,而生成式AI在该领域的准确性尚不明确,因此需要系统评估其生成建筑图像的能力。 Method: 使用30个涵盖不同风格和类型的建筑提示词,在五个主流AI图像生成平台上各生成4张图像(共600张),由两位建筑史专家独立评分并达成共识,按每组4张图像的准确数量进行性能分析。 Result: 常见提示词的图像准确率是稀有提示词的2.7倍(p < 0.05);平台整体准确率在32%至52%之间(平均42%);所有平台在全错(0/4)情况下的表现差异显著,其中Imagen 3失败最少,Microsoft Image Generator失败最多;定性分析发现普遍存在过度装饰、中世纪风格与其复兴风格混淆及描述性元素误用等问题。 Conclusion: 生成式AI在建筑图像生成中的准确性有限,存在系统性错误,需建立AI生成内容的标注标准、训练数据来源规范,并警惕其在建筑教育中的应用风险。 Abstract: Generative artificial intelligence (GenAI) text-to-image systems are increasingly used to generate architectural imagery, yet their capacity to reproduce accurate images in a historically rule-bound field remains poorly characterized. We evaluated five widely used GenAI image platforms (Adobe Firefly, DALL-E 3, Google Imagen 3, Microsoft Image Generator, and Midjourney) using 30 architectural prompts spanning styles, typologies, and codified elements. Each prompt-generator pair produced four images (n = 600 images total). Two architectural historians independently scored each image for accuracy against predefined criteria, resolving disagreements by consensus. Set-level performance was summarized as zero to four accurate images per four-image set. Image output from Common prompts was 2.7-fold more accurate than from Rare prompts (p < 0.05). Across platforms, overall accuracy was limited (highest accuracy score 52 percent; lowest 32 percent; mean 42 percent). All-correct (4 out of 4) outcomes were similar across platforms. By contrast, all-incorrect (0 out of 4) outcomes varied substantially, with Imagen 3 exhibiting the fewest failures and Microsoft Image Generator exhibiting the highest number of failures. Qualitative review of the image dataset identified recurring patterns including over-embellishment, confusion between medieval styles and their later revivals, and misrepresentation of descriptive prompts (for example, egg-and-dart, banded column, pendentive). These findings support the need for visible labeling of GenAI synthetic content, provenance standards for future training datasets, and cautious educational use of GenAI architectural imagery.[97] N-EIoU-YOLOv9: A Signal-Aware Bounding Box Regression Loss for Lightweight Mobile Detection of Rice Leaf Diseases
Dung Ta Nguyen Duc,Thanh Bui Dang,Hoang Le Minh,Tung Nguyen Viet,Huong Nguyen Thanh,Dong Trinh Cong
Main category: cs.CV
TL;DR: 提出一种基于信号感知边界框回归损失的轻量级检测框架N EIoU YOLOv9,用于农业病害图像中的小目标检测,在自建水稻叶片数据集上显著优于CIoU,且可在移动端高效推理。
Details
Motivation: 针对农业病害图像中常见小目标和低对比度目标检测困难的问题,现有损失函数对弱回归信号优化不足且梯度干扰严重,需设计更有效的定位损失以提升检测性能。 Method: 提出N EIoU损失函数,结合非单调聚焦机制与宽高解耦优化,重塑定位梯度;将其集成到轻量化的YOLOv9t架构中,并在自建水稻病害数据集上进行训练与验证,最终部署于Android设备实现边缘推理。 Result: 在5908张水稻叶片图像数据集上,N EIoU YOLOv9相比CIoU提升4.3% mAP至90.3%,定位精度更高,尤其在严格评价标准下表现更优;模型经Float16量化后在移动端实现156ms/帧的推理速度,保持良好准确性。 Conclusion: N EIoU损失有效增强了对难样本的回归信号,降低了梯度干扰,所提出的轻量检测框架在精度、优化稳定性与计算效率之间实现了良好平衡,适用于边缘端农业监测系统。 Abstract: In this work, we propose N EIoU YOLOv9, a lightweight detection framework based on a signal aware bounding box regression loss derived from non monotonic gradient focusing and geometric decoupling principles, referred to as N EIoU (Non monotonic Efficient Intersection over Union). The proposed loss reshapes localization gradients by combining non monotonic focusing with decoupled width and height optimization, thereby enhancing weak regression signals for hard samples with low overlap while reducing gradient interference. This design is particularly effective for small and low contrast targets commonly observed in agricultural disease imagery. The proposed N EIoU loss is integrated into a lightweight YOLOv9t architecture and evaluated on a self collected field dataset comprising 5908 rice leaf images across four disease categories and healthy leaves. Experimental results demonstrate consistent performance gains over the standard CIoU loss, achieving a mean Average Precision of 90.3 percent, corresponding to a 4.3 percent improvement over the baseline, with improved localization accuracy under stricter evaluation criteria. For practical validation, the optimized model is deployed on an Android device using TensorFlow Lite with Float16 quantization, achieving an average inference time of 156 milliseconds per frame while maintaining accuracy. These results confirm that the proposed approach effectively balances accuracy, optimization stability, and computational efficiency for edge based agricultural monitoring systems.[98] From Performance to Practice: Knowledge-Distilled Segmentator for On-Premises Clinical Workflows
Qizhen Lan,Aaron Choi,Jun Ma,Bo Wang,Zhaogming Zhao,Xiaoqian Jiang,Yu-Chun Hsu
Main category: cs.CV
TL;DR: 本文提出了一种面向部署的医学图像分割模型压缩框架,利用知识蒸馏将高性能教师模型的知识迁移到轻量级学生模型中,在保持推理流程不变的前提下显著降低计算需求,同时几乎不损失分割精度。
Details
Motivation: 由于医院本地基础设施计算资源有限,且云推理受安全与治理政策限制,高容量医学图像分割模型难以实际部署。因此需要一种可在不改变现有临床系统架构的前提下实现模型压缩的方法。 Method: 采用知识蒸馏技术,将高精度教师模型的知识传递给一系列结构兼容的轻量化学生模型,支持系统性地减少模型容量,同时保持与现有临床系统的架构兼容性,无需修改推理流程。 Result: 在多中心脑部MRI数据集(1,104个3D样本)上验证,独立测试包含101例,并在腹部CT上验证跨模态泛化能力。在参数减少94%的情况下,学生模型保留了教师模型98.7%的分割精度,CPU推理延迟最多降低67%,且无额外部署开销。 Conclusion: 知识蒸馏为将研究级分割模型转化为适用于现实医疗系统中本地化临床工作流的可维护、即用型组件提供了一条实用且可靠的路径。 Abstract: Deploying medical image segmentation models in routine clinical workflows is often constrained by on-premises infrastructure, where computational resources are fixed and cloud-based inference may be restricted by governance and security policies. While high-capacity models achieve strong segmentation accuracy, their computational demands hinder practical deployment and long-term maintainability in hospital environments. We present a deployment-oriented framework that leverages knowledge distillation to translate a high-performing segmentation model into a scalable family of compact student models, without modifying the inference pipeline. The proposed approach preserves architectural compatibility with existing clinical systems while enabling systematic capacity reduction. The framework is evaluated on a multi-site brain MRI dataset comprising 1,104 3D volumes, with independent testing on 101 curated cases, and is further examined on abdominal CT to assess cross-modality generalizability. Under aggressive parameter reduction (94%), the distilled student model preserves nearly all of the teacher's segmentation accuracy (98.7%), while achieving substantial efficiency gains, including up to a 67% reduction in CPU inference latency without additional deployment overhead. These results demonstrate that knowledge distillation provides a practical and reliable pathway for converting research-grade segmentation models into maintainable, deployment-ready components for on-premises clinical workflows in real-world health systems.[99] Point Tracking as a Temporal Cue for Robust Myocardial Segmentation in Echocardiography Videos
Bahar Khodabakhshian,Nima Hashemi,Armin Saadat,Zahra Gholami,In-Chang Hwang,Samira Sojoudi,Christina Luong,Purang Abolmaesumi,Teresa Tsang
Main category: cs.CV
TL;DR: 提出了一种基于Transformer的分割框架Point-Seg,通过结合点跟踪作为时间线索,提高了超声心动图视频中心肌分割的准确性和时序一致性,尤其在低质量图像中表现更优,并为心肌应变等下游任务提供像素级运动信息。
Details
Motivation: 由于对比度低、噪声多和解剖结构变异大,超声心动图视频中的心肌分割具有挑战性;传统方法忽略时间信息或因记忆特征传播导致误差累积。 Method: 提出Point-Seg,一种基于Transformer的分割框架,集成点跟踪模块以捕捉关键解剖标志点的运动轨迹,利用这些轨迹作为显式的时间感知信号来引导分割,并引入时序平滑损失增强帧间一致性。点跟踪模块在合成数据集上训练。 Result: 在公开和私有数据集上验证,Point-Seg在高质量数据中分割精度(Dice)与现有最先进模型相当,在低质量数据中表现出更高的分割准确性和更好的时序稳定性,并能提供其他方法难以获得的像素级心肌运动信息。 Conclusion: Point-Seg证明了点跟踪可作为有效的时序线索,实现稳定一致的视频分割,为超声心动图中心肌分割提供了一种可靠且可推广的方法。 Abstract: Purpose: Myocardium segmentation in echocardiography videos is a challenging task due to low contrast, noise, and anatomical variability. Traditional deep learning models either process frames independently, ignoring temporal information, or rely on memory-based feature propagation, which accumulates error over time. Methods: We propose Point-Seg, a transformer-based segmentation framework that integrates point tracking as a temporal cue to ensure stable and consistent segmentation of myocardium across frames. Our method leverages a point-tracking module trained on a synthetic echocardiography dataset to track key anatomical landmarks across video sequences. These tracked trajectories provide an explicit motion-aware signal that guides segmentation, reducing drift and eliminating the need for memory-based feature accumulation. Additionally, we incorporate a temporal smoothing loss to further enhance temporal consistency across frames. Results: We evaluate our approach on both public and private echocardiography datasets. Experimental results demonstrate that Point-Seg has statistically similar accuracy in terms of Dice to state-of-the-art segmentation models in high quality echo data, while it achieves better segmentation accuracy in lower quality echo with improved temporal stability. Furthermore, Point-Seg has the key advantage of pixel-level myocardium motion information as opposed to other segmentation methods. Such information is essential in the computation of other downstream tasks such as myocardial strain measurement and regional wall motion abnormality detection. Conclusion: Point-Seg demonstrates that point tracking can serve as an effective temporal cue for consistent video segmentation, offering a reliable and generalizable approach for myocardium segmentation in echocardiography videos. The code is available at https://github.com/DeepRCL/PointSeg.[100] Pairing-free Group-level Knowledge Distillation for Robust Gastrointestinal Lesion Classification in White-Light Endoscopy
Qiang Hu,Qimei Wang,Yingjie Guo,Qiang Li,Zhiwei Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需配对图像的群体级知识蒸馏框架PaGKD,用于在未配对的白光成像(WLI)和窄带成像(NBI)数据间实现跨模态学习,显著提升了癌症筛查模型性能。
Details
Motivation: 现有的跨模态医学图像分析方法依赖于同一病灶的配对NBI-WLI图像,获取成本高且难以大规模应用,导致大量非配对临床数据无法利用。 Method: 提出PaGKD框架,包含两个模块:(1) 群体级原型蒸馏(GKD-Pro),通过共享病灶感知查询提取模态不变的语义原型;(2) 群体级密集蒸馏(GKD-Den),利用激活导出的关系图引导组感知注意力,实现密集跨模态对齐。 Result: 在四个临床数据集上实验表明,PaGKD显著优于现有方法,AUC相对提升分别为3.3%、1.1%、2.8%和3.2%。 Conclusion: PaGKD实现了无需图像配对的高效跨模态知识迁移,为利用非配对医学影像数据开辟了新方向。 Abstract: White-Light Imaging (WLI) is the standard for endoscopic cancer screening, but Narrow-Band Imaging (NBI) offers superior diagnostic details. A key challenge is transferring knowledge from NBI to enhance WLI-only models, yet existing methods are critically hampered by their reliance on paired NBI-WLI images of the same lesion, a costly and often impractical requirement that leaves vast amounts of clinical data untapped. In this paper, we break this paradigm by introducing PaGKD, a novel Pairing-free Group-level Knowledge Distillation framework that that enables effective cross-modal learning using unpaired WLI and NBI data. Instead of forcing alignment between individual, often semantically mismatched image instances, PaGKD operates at the group level to distill more complete and compatible knowledge across modalities. Central to PaGKD are two complementary modules: (1) Group-level Prototype Distillation (GKD-Pro) distills compact group representations by extracting modality-invariant semantic prototypes via shared lesion-aware queries; (2) Group-level Dense Distillation (GKD-Den) performs dense cross-modal alignment by guiding group-aware attention with activation-derived relation maps. Together, these modules enforce global semantic consistency and local structural coherence without requiring image-level correspondence. Extensive experiments on four clinical datasets demonstrate that PaGKD consistently and significantly outperforms state-of-the-art methods, achieving relative AUC improvements of 3.3%, 1.1%, 2.8%, and 3.2%, respectively, establishing a new direction for cross-modal learning from unpaired data.[101] Affostruction: 3D Affordance Grounding with Generative Reconstruction
Chunghyun Park,Seunghyeon Lee,Minsu Cho
Main category: cs.CV
TL;DR: 本文提出了Affostruction,一种从RGBD图像中根据文本查询进行功能定位的生成框架,能够重建完整几何形状并在全表面上(包括未观测区域)实现功能定位。
Details
Motivation: 现有方法仅在可见表面预测功能区域,无法处理遮挡或不完整观测情况下的功能定位问题。 Method: 提出生成式多视角重建、基于流的功能定位和功能驱动的主动视点选择,通过稀疏体素融合重建完整几何,并建模功能分布的不确定性。 Result: 在功能定位上达到19.1 aIoU(提升40.4%),3D重建达到32.67 IoU(提升67.7%)。 Conclusion: Affostruction显著提升了不完整观测下物体功能定位与三维重建的性能,实现了对完整形状的功能理解。 Abstract: This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose Affostruction, a generative framework that reconstructs complete geometry from partial observations and grounds affordances on the full shape including unobserved regions. We make three core contributions: generative multi-view reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures inherent ambiguity in affordance distributions, and affordance-driven active view selection that leverages predicted affordances for intelligent viewpoint sampling. Affostruction achieves 19.1 aIoU on affordance grounding (40.4\% improvement) and 32.67 IoU for 3D reconstruction (67.7\% improvement), enabling accurate affordance prediction on complete shapes.[102] Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation
Xingyao Li,Fengzhuo Zhang,Cunxiao Du,Hui Ji
Main category: cs.CV
TL;DR: 本文提出了COOL-SD,一种基于理论支持的退火式松弛推测解码方法,用于加速自回归图像生成,同时保持或提升生成质量。
Details
Motivation: 现有的自回归图像生成推测解码方法因缺乏理论基础和图像标记的模糊性导致推理速度慢,现有改进方法缺少理论支撑。 Method: 通过分析目标模型与松弛推测解码之间的总变差(TV)距离,提出最优重采样分布;结合扰动分析揭示退火行为,设计了具有理论依据的退火式松弛推测解码框架COOL-SD。 Result: 实验表明,COOL-SD在速度-质量权衡上优于先前方法,能更快生成图像或在相似延迟下实现更优质量。 Conclusion: COOL-SD为松弛推测解码提供了坚实的理论基础,并有效提升了自回归图像生成的推理效率与生成质量。 Abstract: Despite significant progress in autoregressive image generation, inference remains slow due to the sequential nature of AR models and the ambiguity of image tokens, even when using speculative decoding. Recent works attempt to address this with relaxed speculative decoding but lack theoretical grounding. In this paper, we establish the theoretical basis of relaxed SD and propose COOL-SD, an annealed relaxation of speculative decoding built on two key insights. The first analyzes the total variation (TV) distance between the target model and relaxed speculative decoding and yields an optimal resampling distribution that minimizes an upper bound of the distance. The second uses perturbation analysis to reveal an annealing behaviour in relaxed speculative decoding, motivating our annealed design. Together, these insights enable COOL-SD to generate images faster with comparable quality, or achieve better quality at similar latency. Experiments validate the effectiveness of COOL-SD, showing consistent improvements over prior methods in speed-quality trade-offs.[103] SpikeVAEDiff: Neural Spike-based Natural Visual Scene Reconstruction via VD-VAE and Versatile Diffusion
Jialu Li,Taiyan Zhou
Main category: cs.CV
TL;DR: 提出SpikeVAEDiff,一种结合VDVAE和Versatile Diffusion的两阶段框架,从神经脉冲数据重建高分辨率视觉场景。
Details
Motivation: 从神经活动重建自然视觉场景是神经科学与计算机视觉的关键挑战,现有方法在分辨率和语义一致性上存在局限。 Method: 第一阶段使用VDVAE将神经脉冲映射为低分辨率图像的潜在表示;第二阶段通过回归模型将脉冲信号映射到CLIP-Vision和CLIP-Text特征,利用Versatile Diffusion进行图像增强。 Result: 在Allen Neuropixels数据集上验证,VISI脑区对重建质量影响最大;相比fMRI方法,脉冲数据具有更高时空分辨率;消融实验表明特定脑区数据显著提升性能。 Conclusion: SpikeVAEDiff能有效生成高分辨率、语义合理的视觉重建图像,揭示了神经编码与视觉内容间的联系,推动脑机接口发展。 Abstract: Reconstructing natural visual scenes from neural activity is a key challenge in neuroscience and computer vision. We present SpikeVAEDiff, a novel two-stage framework that combines a Very Deep Variational Autoencoder (VDVAE) and the Versatile Diffusion model to generate high-resolution and semantically meaningful image reconstructions from neural spike data. In the first stage, VDVAE produces low-resolution preliminary reconstructions by mapping neural spike signals to latent representations. In the second stage, regression models map neural spike signals to CLIP-Vision and CLIP-Text features, enabling Versatile Diffusion to refine the images via image-to-image generation. We evaluate our approach on the Allen Visual Coding-Neuropixels dataset and analyze different brain regions. Our results show that the VISI region exhibits the most prominent activation and plays a key role in reconstruction quality. We present both successful and unsuccessful reconstruction examples, reflecting the challenges of decoding neural activity. Compared with fMRI-based approaches, spike data provides superior temporal and spatial resolution. We further validate the effectiveness of the VDVAE model and conduct ablation studies demonstrating that data from specific brain regions significantly enhances reconstruction performance.[104] Disentangle Object and Non-object Infrared Features via Language Guidance
Fan Liu,Ting Wu,Chuanyi Zhang,Liang Yao,Xing Ma,Yuhui Zheng
Main category: cs.CV
TL;DR: 提出一种基于视觉-语言表征学习的红外目标检测方法,通过文本监督引导目标与非目标特征解耦,提升检测性能。
Details
Motivation: 红外图像对比度低、边缘信息弱,导致难以提取具有判别性的目标特征,影响复杂环境下的检测鲁棒性。 Method: 提出语义特征对齐(SFA)模块,将目标特征与文本特征对齐;设计目标特征解耦(OFD)模块,通过最小化相关性分离文本对齐的目标特征与非目标特征,并将解耦后的特征用于检测头。 Result: 在M³FD和FLIR两个基准上分别达到83.7%和86.1%的mAP,显著优于现有方法。 Conclusion: 引入文本监督可有效增强红外目标检测中的特征判别性,所提方法通过视觉-语言对齐与特征解耦显著提升了检测性能。 Abstract: Infrared object detection focuses on identifying and locating objects in complex environments (\eg, dark, snow, and rain) where visible imaging cameras are disabled by poor illumination. However, due to low contrast and weak edge information in infrared images, it is challenging to extract discriminative object features for robust detection. To deal with this issue, we propose a novel vision-language representation learning paradigm for infrared object detection. An additional textual supervision with rich semantic information is explored to guide the disentanglement of object and non-object features. Specifically, we propose a Semantic Feature Alignment (SFA) module to align the object features with the corresponding text features. Furthermore, we develop an Object Feature Disentanglement (OFD) module that disentangles text-aligned object features and non-object features by minimizing their correlation. Finally, the disentangled object features are entered into the detection head. In this manner, the detection performance can be remarkably enhanced via more discriminative and less noisy features. Extensive experimental results demonstrate that our approach achieves superior performance on two benchmarks: M\textsuperscript{3}FD (83.7\% mAP), FLIR (86.1\% mAP). Our code will be publicly available once the paper is accepted.[105] SPOT-Face: Forensic Face Identification using Attention Guided Optimal Transport
Ravi Shankar Prasad,Dinesh Singh
Main category: cs.CV
TL;DR: 本文提出了一种基于超像素图的SPOT-Face框架,用于利用骨架和素描图像进行跨域法医人脸识别,通过注意力引导的最优传输机制建立跨域对应关系,并在公开数据集上验证了其在识别性能上的显著提升。
Details
Motivation: 当传统的DNA证据(如毛发、软组织)不可用时,法医调查中的人体识别变得极具挑战性,而现有基于深度学习的人脸识别方法在建模不同法医模态间的跨域结构对应关系方面存在不足。 Method: 构建基于超像素的图表示图像,并采用不同的图神经网络(GNN)主干提取图嵌入,通过注意力引导的最优传输机制建立跨域对应关系。 Result: 在IIT_Mandi_S2F和CUFS两个公开数据集上进行了广泛实验,结果表明该方法在Recall和mAP等指标上显著优于现有的基于图的方法。 Conclusion: SPOT-Face框架能有效匹配颅骨和素描图像与人脸,在法医身份识别中具有很高的应用价值。 Abstract: Person identification in forensic investigations becomes very challenging when common identification means for DNA (i.e., hair strands, soft tissue) are not available. Current methods utilize deep learning methods for face recognition. However, these methods lack effective mechanisms to model cross-domain structural correspondence between two different forensic modalities. In this paper, we introduce a SPOT-Face, a superpixel graph-based framework designed for cross-domain forensic face identification of victims using their skeleton and sketch images. Our unified framework involves constructing a superpixel-based graph from an image and then using different graph neural networks(GNNs) backbones to extract the embeddings of these graphs, while cross-domain correspondence is established through attention-guided optimal transport mechanism. We have evaluated our proposed framework on two publicly available dataset: IIT\_Mandi\_S2F (S2F) and CUFS. Extensive experiments were conducted to evaluate our proposed framework. The experimental results show significant improvement in identification metrics ( i.e., Recall, mAP) over existing graph-based baselines. Furthermore, our framework demonstrates to be highly effective for matching skulls and sketches to faces in forensic investigations.[106] CLIDD: Cross-Layer Independent Deformable Description for Efficient and Discriminative Local Feature Representation
Haodi Yao,Fenghua He,Ning Hao,Yao Su
Main category: cs.CV
TL;DR: 本文提出了一种名为Cross-Layer Independent Deformable Description (CLIDD) 的局部特征描述方法,通过跨层独立变形采样和硬件感知优化,在极小模型规模下实现高精度与高效计算的平衡,显著优于现有方法。
Details
Motivation: 为了在资源受限设备上实现实时且可靠的局部特征匹配,需要兼具高判别力和计算效率的描述子,而现有方法难以兼顾这两者。 Method: CLIDD通过从独立的特征层次中直接采样,并利用可学习偏移量捕捉多尺度下的细粒度结构信息;采用硬件感知的核融合策略提升推理速度,并结合轻量架构、度量学习与知识蒸馏训练出适用于不同部署场景的多种模型变体。 Result: 超紧凑版本仅用0.004M参数即达到SuperPoint的精度,模型缩小99.7%;高性能版本在超过200 FPS的同时超越包括DINOv2在内的所有现有最先进方法。 Conclusion: CLIDD在保持极低计算开销的同时实现了高精度局部特征匹配,为实时空间智能任务提供了鲁棒且可扩展的解决方案。 Abstract: Robust local feature representations are essential for spatial intelligence tasks such as robot navigation and augmented reality. Establishing reliable correspondences requires descriptors that provide both high discriminative power and computational efficiency. To address this, we introduce Cross-Layer Independent Deformable Description (CLIDD), a method that achieves superior distinctiveness by sampling directly from independent feature hierarchies. This approach utilizes learnable offsets to capture fine-grained structural details across scales while bypassing the computational burden of unified dense representations. To ensure real-time performance, we implement a hardware-aware kernel fusion strategy that maximizes inference throughput. Furthermore, we develop a scalable framework that integrates lightweight architectures with a training protocol leveraging both metric learning and knowledge distillation. This scheme generates a wide spectrum of model variants optimized for diverse deployment constraints. Extensive evaluations demonstrate that our approach achieves superior matching accuracy and exceptional computational efficiency simultaneously. Specifically, the ultra-compact variant matches the precision of SuperPoint while utilizing only 0.004M parameters, achieving a 99.7% reduction in model size. Furthermore, our high-performance configuration outperforms all current state-of-the-art methods, including high-capacity DINOv2-based frameworks, while exceeding 200 FPS on edge devices. These results demonstrate that CLIDD delivers high-precision local feature matching with minimal computational overhead, providing a robust and scalable solution for real-time spatial intelligence tasks.[107] Knowledge-Embedded and Hypernetwork-Guided Few-Shot Substation Meter Defect Image Generation Method
Jackie Alex,Justin Petter
Main category: cs.CV
TL;DR: 提出一种结合知识嵌入和超网络引导控制的稳定扩散框架,用于在少样本场景下生成变电站电表缺陷图像,显著提升生成质量和下游检测性能。
Details
Motivation: 由于标注的变电站电表缺陷样本极度稀缺,传统数据增强方法难以生成真实且可控的缺陷图像,导致下游检测模型训练受限。 Method: 1)使用DreamBooth风格的知识嵌入微调Stable Diffusion,保留电表结构先验;2)设计几何裂纹建模模块生成空间约束的控制图;3)引入轻量级超网络,根据控制图和缺陷描述动态调制去噪过程。 Result: 在真实数据集上FID降低32.7%,生成多样性提高,使用增强数据训练的缺陷检测器mAP提升15.3%。 Conclusion: 该框架能有效解决工业场景中缺陷样本稀缺问题,实现高质量、可控的少样本缺陷图像生成,显著提升下游任务性能。 Abstract: Substation meters play a critical role in monitoring and ensuring the stable operation of power grids, yet their detection of cracks and other physical defects is often hampered by a severe scarcity of annotated samples. To address this few-shot generation challenge, we propose a novel framework that integrates Knowledge Embedding and Hypernetwork-Guided Conditional Control into a Stable Diffusion pipeline, enabling realistic and controllable synthesis of defect images from limited data. First, we bridge the substantial domain gap between natural-image pre-trained models and industrial equipment by fine-tuning a Stable Diffusion backbone using DreamBooth-style knowledge embedding. This process encodes the unique structural and textural priors of substation meters, ensuring generated images retain authentic meter characteristics. Second, we introduce a geometric crack modeling module that parameterizes defect attributes--such as location, length, curvature, and branching pattern--to produce spatially constrained control maps. These maps provide precise, pixel-level guidance during generation. Third, we design a lightweight hypernetwork that dynamically modulates the denoising process of the diffusion model in response to the control maps and high-level defect descriptors, achieving a flexible balance between generation fidelity and controllability. Extensive experiments on a real-world substation meter dataset demonstrate that our method substantially outperforms existing augmentation and generation baselines. It reduces Frechet Inception Distance (FID) by 32.7%, increases diversity metrics, and--most importantly--boosts the mAP of a downstream defect detector by 15.3% when trained on augmented data. The framework offers a practical, high-quality data synthesis solution for industrial inspection systems where defect samples are rare.[108] DeTracker: Motion-decoupled Vehicle Detection and Tracking in Unstabilized Satellite Videos
Jiajun Chen,Jing Xiao,Shaohan Cao,Yuming Zhu,Liang Liao,Jun Pan,Mi Wang
Main category: cs.CV
TL;DR: 提出DeTracker框架,用于解决非稳定卫星视频中的多目标跟踪问题,通过全局-局部运动解耦和时序依赖特征金字塔模块提升小目标跟踪性能。
Details
Motivation: 非稳定平台的抖动和小目标外观微弱导致卫星视频多目标跟踪性能下降,现有方法难以应对复杂运动干扰。 Method: 设计Global--Local Motion Decoupling (GLMD)模块分离平台运动与真实物体运动,并引入Temporal Dependency Feature Pyramid (TDFP)模块进行跨帧时序特征融合以增强小目标表征能力。 Result: 在新构建的SDM-Car-SU数据集和真实卫星视频上分别达到61.1%和47.3%的MOTA,显著优于现有方法。 Conclusion: DeTracker有效提升了非稳定卫星视频下的多目标跟踪精度与轨迹稳定性,具备强鲁棒性和应用潜力。 Abstract: Satellite videos provide continuous observations of surface dynamics but pose significant challenges for multi-object tracking (MOT), especially under unstabilized conditions where platform jitter and the weak appearance of tiny objects jointly degrade tracking performance. To address this problem, we propose DeTracker, a joint detection-and-tracking framework tailored for unstabilized satellite videos. DeTracker introduces a Global--Local Motion Decoupling (GLMD) module that explicitly separates satellite platform motion from true object motion through global alignment and local refinement, leading to improved trajectory stability and motion estimation accuracy. In addition, a Temporal Dependency Feature Pyramid (TDFP) module is developed to perform cross-frame temporal feature fusion, enhancing the continuity and discriminability of tiny-object representations. We further construct a new benchmark dataset, SDM-Car-SU, which simulates multi-directional and multi-speed platform motions to enable systematic evaluation of tracking robustness under varying motion perturbations. Extensive experiments on both simulated and real unstabilized satellite videos demonstrate that DeTracker significantly outperforms existing methods, achieving 61.1% MOTA on SDM-Car-SU and 47.3% MOTA on real satellite video data.[109] A$^2$TG: Adaptive Anisotropic Textured Gaussians for Efficient 3D Scene Representation
Sheng-Chi Hsu,Ting-Yu Yen,Shih-Hsuan Hung,Hung-Kuo Chu
Main category: cs.CV
TL;DR: 本文提出了自适应各向异性纹理高斯(A²TG),通过为每个高斯图元配备各向异性纹理,利用梯度引导的自适应规则联合确定纹理分辨率和长宽比,从而提高纹理效率、降低内存消耗并提升渲染质量。
Details
Motivation: 现有高斯溅射方法使用固定方形纹理,导致内存利用率低且对场景变化适应性差,难以兼顾细节表达与资源效率。 Method: 提出A²TG,引入各向异性纹理,并采用梯度引导的自适应机制动态分配纹理分辨率和长宽比,使纹理分布更符合高斯图元的各向异性特性。 Result: 在多个基准数据集上实验表明,A²TG在显著降低内存占用的同时,实现了与现有方法相当甚至更优的渲染质量。 Conclusion: A²TG通过自适应各向异性纹理设计,有效提升了高斯溅射的纹理利用效率,为高质量实时渲染提供了更轻量、更灵活的表示方法。 Abstract: Gaussian Splatting has emerged as a powerful representation for high-quality, real-time 3D scene rendering. While recent works extend Gaussians with learnable textures to enrich visual appearance, existing approaches allocate a fixed square texture per primitive, leading to inefficient memory usage and limited adaptability to scene variability. In this paper, we introduce adaptive anisotropic textured Gaussians (A$^2$TG), a novel representation that generalizes textured Gaussians by equipping each primitive with an anisotropic texture. Our method employs a gradient-guided adaptive rule to jointly determine texture resolution and aspect ratio, enabling non-uniform, detail-aware allocation that aligns with the anisotropic nature of Gaussian splats. This design significantly improves texture efficiency, reducing memory consumption while enhancing image quality. Experiments on multiple benchmark datasets demonstrate that A TG consistently outperforms fixed-texture Gaussian Splatting methods, achieving comparable rendering fidelity with substantially lower memory requirements.[110] Integrating Diverse Assignment Strategies into DETRs
Yiwei Zhang,Jin Gao,Hanshi Wang,Fudong Ge,Guan Luo,Weiming Hu,Zhipeng Zhang
Main category: cs.CV
TL;DR: 本文提出LoRA-DETR,一种通过多样化“一对多”标签分配策略增强监督信号的灵活轻量级目标检测框架,在不增加推理成本的情况下实现先进性能。
Details
Motivation: 现有DETR式检测器中的一对一分配导致监督稀疏、收敛慢,而现有的一对多方法缺乏统一、可扩展的设计且常引入复杂修改。 Method: 提出LoRA-DETR,训练时引入多个低秩适配(LoRA)分支,每个分支实现不同的“一对多”标签分配策略,以注入多样化的监督梯度;推理时丢弃这些分支,保持原模型简洁高效。 Result: 在多种基线上进行了大量实验,验证了该方法的有效性,实现了最先进的检测性能,同时不增加推理计算开销。 Conclusion: 多样化的“一对多”监督比单纯的监督数量更重要,LoRA-DETR提供了一种优雅、参数高效且通用的新范式来提升DETR式检测器。 Abstract: Label assignment is a critical component in object detectors, particularly within DETR-style frameworks where the one-to-one matching strategy, despite its end-to-end elegance, suffers from slow convergence due to sparse supervision. While recent works have explored one-to-many assignments to enrich supervisory signals, they often introduce complex, architecture-specific modifications and typically focus on a single auxiliary strategy, lacking a unified and scalable design. In this paper, we first systematically investigate the effects of ``one-to-many'' supervision and reveal a surprising insight that performance gains are driven not by the sheer quantity of supervision, but by the diversity of the assignment strategies employed. This finding suggests that a more elegant, parameter-efficient approach is attainable. Building on this insight, we propose LoRA-DETR, a flexible and lightweight framework that seamlessly integrates diverse assignment strategies into any DETR-style detector. Our method augments the primary network with multiple Low-Rank Adaptation (LoRA) branches during training, each instantiating a different one-to-many assignment rule. These branches act as auxiliary modules that inject rich, varied supervisory gradients into the main model and are discarded during inference, thus incurring no additional computational cost. This design promotes robust joint optimization while maintaining the architectural simplicity of the original detector. Extensive experiments on different baselines validate the effectiveness of our approach. Our work presents a new paradigm for enhancing detectors, demonstrating that diverse ``one-to-many'' supervision can be integrated to achieve state-of-the-art results without compromising model elegance.[111] Hybrid guided variational autoencoder for visual place recognition
Ni Wang,Zihan You,Emre Neftci,Thorben Schoepe
Main category: cs.CV
TL;DR: 本文提出了一种基于事件视觉传感器和新型引导变分自编码器(VAE)的视觉位置识别方法,结合脉冲神经网络实现高效、鲁棒且具有良好泛化的室内定位,适用于资源受限的移动机器人系统。
Details
Motivation: 现有视觉位置识别模型在内存消耗与鲁棒性之间存在权衡,难以在GPS拒止的室内环境中为自主机器人提供高效精准的定位。 Method: 采用事件相机捕捉动态视觉信息,设计一种基于脉冲神经网络的引导变分自编码器(VAE),利用其稀疏激活特性降低计算负担,并在新构建的室内VPR数据集上进行训练与评估。 Result: 该模型在16个不同室内场景中实现了与当前最先进方法相当的分类性能,具备良好光照鲁棒性和对未知场景的泛化能力,同时模型更紧凑、适合低功耗部署。 Conclusion: 所提出的紧凑型、高鲁棒性且具泛化能力的事件驱动VAE模型,为室内机器人导航中的视觉位置识别提供了高效可行的解决方案。 Abstract: Autonomous agents such as cars, robots and drones need to precisely localize themselves in diverse environments, including in GPS-denied indoor environments. One approach for precise localization is visual place recognition (VPR), which estimates the place of an image based on previously seen places. State-of-the-art VPR models require high amounts of memory, making them unwieldy for mobile deployment, while more compact models lack robustness and generalization capabilities. This work overcomes these limitations for robotics using a combination of event-based vision sensors and an event-based novel guided variational autoencoder (VAE). The encoder part of our model is based on a spiking neural network model which is compatible with power-efficient low latency neuromorphic hardware. The VAE successfully disentangles the visual features of 16 distinct places in our new indoor VPR dataset with a classification performance comparable to other state-of-the-art approaches while, showing robust performance also under various illumination conditions. When tested with novel visual inputs from unknown scenes, our model can distinguish between these places, which demonstrates a high generalization capability by learning the essential features of location. Our compact and robust guided VAE with generalization capabilities poses a promising model for visual place recognition that can significantly enhance mobile robot navigation in known and unknown indoor environments.[112] PhyRPR: Training-Free Physics-Constrained Video Generation
Yibo Zhao,Hengjia Li,Xiaofei He,Boxi Wu
Main category: cs.CV
TL;DR: 提出一种无需训练的三阶段视频生成 pipeline PhyRPR,解耦物理理解与视觉合成,提升生成视频的物理合理性和运动可控性。
Details
Motivation: 现有扩散模型在视频生成中难以满足物理约束,因高阶物理理解与低阶视觉合成被耦合在同一阶段,导致缺乏显式物理推理能力。 Method: 设计三阶段无训练框架 PhyRPR:PhyReason 利用多模态大模型进行物理状态推理并生成关键帧;PhyPlan 构建可控制的粗粒度运动骨架;PhyRefine 通过潜在融合策略将运动骨架注入扩散采样过程,细化外观同时保持动力学合理性。 Result: 实验表明,在物理约束条件下,该方法显著提升了视频的物理可信度和运动可控性,优于现有单阶段模型。 Conclusion: 通过解耦物理推理与视觉生成,PhyRPR 实现了更符合物理规律的视频合成,为可控视频生成提供了有效新范式。 Abstract: Recent diffusion-based video generation models can synthesize visually plausible videos, yet they often struggle to satisfy physical constraints. A key reason is that most existing approaches remain single-stage: they entangle high-level physical understanding with low-level visual synthesis, making it hard to generate content that require explicit physical reasoning. To address this limitation, we propose a training-free three-stage pipeline,\textit{PhyRPR}:\textit{Phy\uline{R}eason}--\textit{Phy\uline{P}lan}--\textit{Phy\uline{R}efine}, which decouples physical understanding from visual synthesis. Specifically, \textit{PhyReason} uses a large multimodal model for physical state reasoning and an image generator for keyframe synthesis; \textit{PhyPlan} deterministically synthesizes a controllable coarse motion scaffold; and \textit{PhyRefine} injects this scaffold into diffusion sampling via a latent fusion strategy to refine appearance while preserving the planned dynamics. This staged design enables explicit physical control during generation. Extensive experiments under physics constraints show that our method consistently improves physical plausibility and motion controllability.[113] Magnifying change: Rapid burn scar mapping with multi-resolution, multi-source satellite imagery
Maria Sdraka,Dimitrios Michail,Ioannis Papoutsis
Main category: cs.CV
TL;DR: 提出了一种名为BAM-MRCD的新型深度学习模型,利用多分辨率、多源卫星影像(MODIS和Sentinel-2)实现高时空分辨率的火灾区域快速精准制图。
Details
Motivation: 由于光谱变化的空间异质性以及现有卫星系统在空间分辨率与重访周期之间的权衡,利用卫星影像精确划定火灾影响区域仍具挑战性。 Method: 提出BAM-MRCD模型,融合MODIS(高时间分辨率)和Sentinel-2(高空间分辨率)数据进行多源、多分辨率分析,实现对火灾前后变化的高效检测。 Result: 模型能够高精度检测小规模火灾,在烧伤区域提取任务中优于现有的变化检测模型和基线方法。 Conclusion: BAM-MRCD通过融合多源卫星数据,有效解决了时效性与分辨率之间的矛盾,适用于火灾后快速生成精细烧伤图的业务化应用。 Abstract: Delineating wildfire affected areas using satellite imagery remains challenging due to irregular and spatially heterogeneous spectral changes across the electromagnetic spectrum. While recent deep learning approaches achieve high accuracy when high-resolution multispectral data are available, their applicability in operational settings, where a quick delineation of the burn scar shortly after a wildfire incident is required, is limited by the trade-off between spatial resolution and temporal revisit frequency of current satellite systems. To address this limitation, we propose a novel deep learning model, namely BAM-MRCD, which employs multi-resolution, multi-source satellite imagery (MODIS and Sentinel-2) for the timely production of detailed burnt area maps with high spatial and temporal resolution. Our model manages to detect even small scale wildfires with high accuracy, surpassing similar change detection models as well as solid baselines. All data and code are available in the GitHub repository: https://github.com/Orion-AI-Lab/BAM-MRCD.[114] BrainSegNet: A Novel Framework for Whole-Brain MRI Parcellation Enhanced by Large Models
Yucheng Li,Xiaofan Wang,Junyi Wang,Yijie Li,Xi Zhu,Mubai Du,Dian Sheng,Wei Zhang,Fan Zhang
Main category: cs.CV
TL;DR: 提出BrainSegNet,一种基于SAM改进的全脑分割框架,通过融合U-Net跳跃连接与注意力机制,实现高精度95区脑区划分。
Details
Motivation: 现有深度学习模型如SAM未针对脑部分割的高精度需求进行优化,传统方法效率低且难以应对复杂结构。 Method: 在SAM基础上引入U-Net的跳跃连接,设计混合编码器、多尺度注意力解码器与边界优化模块,提升细粒度解剖结构分割能力。 Result: 在HCP数据集上优于多种SOTA方法,展现出更高的分割精度与鲁棒性,尤其在多标签复杂结构中表现突出。 Conclusion: BrainSegNet有效结合了SAM的泛化能力与U-Net的局部细节捕捉能力,为高精度全脑MRI分割提供了可靠解决方案。 Abstract: Whole-brain parcellation from MRI is a critical yet challenging task due to the complexity of subdividing the brain into numerous small, irregular shaped regions. Traditionally, template-registration methods were used, but recent advances have shifted to deep learning for faster workflows. While large models like the Segment Anything Model (SAM) offer transferable feature representations, they are not tailored for the high precision required in brain parcellation. To address this, we propose BrainSegNet, a novel framework that adapts SAM for accurate whole-brain parcellation into 95 regions. We enhance SAM by integrating U-Net skip connections and specialized modules into its encoder and decoder, enabling fine-grained anatomical precision. Key components include a hybrid encoder combining U-Net skip connections with SAM's transformer blocks, a multi-scale attention decoder with pyramid pooling for varying-sized structures, and a boundary refinement module to sharpen edges. Experimental results on the Human Connectome Project (HCP) dataset demonstrate that BrainSegNet outperforms several state-of-the-art methods, achieving higher accuracy and robustness in complex, multi-label parcellation.[115] GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
Bei Huang,Yixin Chen,Ruijie Lu,Gang Zeng,Hongbin Zha,Yuru Pei,Siyuan Huang
Main category: cs.CV
TL;DR: 本文提出了GaussianFluent,一个将物理模拟与3D高斯点阵结合的统一框架,实现了对脆性断裂物体的高质量实时渲染。
Details
Motivation: 现有基于3D高斯点阵的方法主要处理软体变形,难以模拟具有复杂内部结构和纹理的脆性断裂现象,缺乏体积内部表示和相应的断裂感知模拟方法。 Method: 通过生成模型引导在物体内部致密化高斯点以合成逼真的内部结构,并结合优化的连续损伤物质点法(CD-MPM)实现高速脆性断裂模拟。 Result: 能够处理多材质物体和多阶段断裂传播等复杂场景,实现结构一致、视觉逼真的实时渲染效果。 Conclusion: GaussianFluent首次实现了在3DGS中对脆性断裂的高效模拟与渲染,为虚拟现实和机器人等应用提供了新可能。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving results infeasible with previous methods. Experiments clearly demonstrate GaussianFluent's capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting its potential for downstream application, such as VR and Robotics.[116] Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain
Lianying Chao,Haoran Cai,Xubin Li,Kai Zhang,Sijie Wu,Rui Xu
Main category: cs.CV
TL;DR: 本文提出了一种多阶段渐进训练策略,用于训练面向ICT领域的特定图像描述模型(DICModel),并通过构建标准评估体系验证其性能。利用Mermaid工具与大语言模型合成了约7K图文对进行第一阶段微调,结合专家标注的2K图文对和1.5K视觉问答数据完成后续训练。实验表明,仅含7B参数的DICModel在BLEU指标上显著优于7B和32B的SOTA模型,分别提升56.8%和20.8%,并在专家设计的问题上超越Qwen2.5-VL 32B。
Details
Motivation: 在ICT领域,高价值知识不仅存在于文本中,也隐藏于图像中。传统方法无法有效生成图像描述,而现有多模态大模型缺乏足够的领域知识,因此需要一种能准确提取图像中逻辑信息的领域专用图像描述模型。 Method: 提出多阶段渐进式训练策略:第一阶段使用Mermaid工具与LLM合成约7K图文对进行监督微调;第二阶段利用ICT领域专家手工标注的约2K图文对进一步微调;第三阶段由专家与LLM共同构建约1.5K视觉问答数据,用于指令微调。同时构建了标准评估系统以验证模型性能。 Result: 所提出的7B参数DICModel在BLEU指标上比7B和32B的SOTA模型分别提升约56.8%和20.8%;在专家构建的客观题测试中,准确率超过Qwen2.5-VL 32B模型1%。 Conclusion: 该研究成功实现了对ICT领域图像中逻辑文本的高效准确提取,推动了多模态模型在专业领域的应用发展。 Abstract: In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.[117] Frequency Error-Guided Under-sampling Optimization for Multi-Contrast MRI Reconstruction
Xinming Fang,Chaoyan Huang,Juncheng Li,Jun Wang,Jun Shi,Guixu Zhang
Main category: cs.CV
TL;DR: 提出一种基于频率误差引导的多对比度MRI重建框架,通过条件扩散模型学习频率误差先验,并联合优化欠采样模式与重建网络,显著提升重建质量。
Details
Motivation: 现有方法在参考图像融合策略、互补信息利用和固定欠采样模式方面存在局限,限制了多对比度MRI重建性能。 Method: 采用条件扩散模型学习频率误差先验(FEP),结合深度展开框架联合优化欠采样模式和重建网络;引入空间对齐模块和参考特征分解策略,融合频域与图像域信息。 Result: 在多种成像模态、加速倍数(4-30倍)和采样方案下均优于现有最先进方法,定量指标和视觉质量均有提升。 Conclusion: 所提方法有效解决了多对比度MRI重建中的关键问题,实现了高效且具可解释性的图像重建。 Abstract: Magnetic resonance imaging (MRI) plays a vital role in clinical diagnostics, yet it remains hindered by long acquisition times and motion artifacts. Multi-contrast MRI reconstruction has emerged as a promising direction by leveraging complementary information from fully-sampled reference scans. However, existing approaches suffer from three major limitations: (1) superficial reference fusion strategies, such as simple concatenation, (2) insufficient utilization of the complementary information provided by the reference contrast, and (3) fixed under-sampling patterns. We propose an efficient and interpretable frequency error-guided reconstruction framework to tackle these issues. We first employ a conditional diffusion model to learn a Frequency Error Prior (FEP), which is then incorporated into a unified framework for jointly optimizing both the under-sampling pattern and the reconstruction network. The proposed reconstruction model employs a model-driven deep unfolding framework that jointly exploits frequency- and image-domain information. In addition, a spatial alignment module and a reference feature decomposition strategy are incorporated to improve reconstruction quality and bridge model-based optimization with data-driven learning for improved physical interpretability. Comprehensive validation across multiple imaging modalities, acceleration rates (4-30x), and sampling schemes demonstrates consistent superiority over state-of-the-art methods in both quantitative metrics and visual quality. All codes are available at https://github.com/fangxinming/JUF-MRI.[118] Beyond the final layer: Attentive multilayer fusion for vision transformers
Laure Ciernik,Marco Morik,Lukas Thede,Luca Eyring,Shinichi Nakajima,Zeynep Akata,Lukas Muttenthaler
Main category: cs.CV
TL;DR: 提出一种基于注意力机制的探针方法,动态融合Vision Transformer所有层的表示,显著优于标准线性探针。
Details
Motivation: 发现任务相关的信息分布在网络各层而不仅限于最后一层,传统线性探针无法充分利用中间层信息。 Method: 引入 attentive probing 机制,通过可学习的注意力权重动态融合来自所有层的特征表示,实现对低层次结构线索和高层次语义抽象的结合。 Result: 在20个多样化数据集和多个预训练模型上,性能持续且显著优于标准线性探针;注意力热图显示偏离预训练任务的领域更依赖中间层表示。 Conclusion: 中间层包含重要任务相关信息,attentive probing 提供了一种任务感知、原则性的方法来释放其在探针式适配中的潜力。 Abstract: With the rise of large-scale foundation models, efficiently adapting them to downstream tasks remains a central challenge. Linear probing, which freezes the backbone and trains a lightweight head, is computationally efficient but often restricted to last-layer representations. We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers. To leverage this distribution of information, we apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer. This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions. Across 20 diverse datasets and multiple pretrained foundation models, our method achieves consistent, substantial gains over standard linear probes. Attention heatmaps further reveal that tasks different from the pre-training domain benefit most from intermediate representations. Overall, our findings underscore the value of intermediate layer information and demonstrate a principled, task aware approach for unlocking their potential in probing-based adaptation.[119] See More, Store Less: Memory-Efficient Resolution for Video Moment Retrieval
Mingyu Jeon,Sungjin Han,Jinkwon Hwang,Minchol Kwon,Jonghee Kim,Junyeong Kim
Main category: cs.CV
TL;DR: SMORE 是一种高效的视频理解框架,通过查询引导的语义编码、重要性调制和自适应压缩,在减少内存占用的同时保持高信息分辨率,实现了多项基准上的最先进性能。
Details
Motivation: 现有视频时刻检索方法依赖稀疏帧采样,容易丢失关键信息,且密集帧处理存在内存瓶颈,难以高效处理长视频。 Method: SMORE 提出三种关键技术:(1) 使用查询引导的字幕编码与用户意图对齐的语义;(2) 应用查询感知的重要性调制来突出关键片段;(3) 自适应压缩帧以保留重要内容并减少冗余。 Result: 在 QVHighlights、Charades-STA 和 ActivityNet-Captions 基准上达到最先进的性能,同时显著提升内存效率。 Conclusion: SMORE 在保证视频理解精度的同时有效解决了内存限制问题,为高效视频理解提供了可扩展且实用的解决方案。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on sparse frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.[120] Spectral Complex Autoencoder Pruning: A Fidelity-Guided Criterion for Extreme Structured Channel Compression
Wei Liu,Xing Deng,Haijian Shao,Yingtao Jiang
Main category: cs.CV
TL;DR: 提出了一种基于频谱重建的卷积通道重要性评估方法SCAP,通过构建复数交互场并利用低容量自编码器重构其频谱,实现对通道冗余性的量化,从而进行高效网络剪枝。
Details
Motivation: 现有剪枝方法难以准确衡量卷积层输出通道的功能冗余性,尤其是在极端压缩场景下保持精度的需求推动了对更精细、可解释性强的重要性度量标准的研究。 Method: 为每个卷积层构造一个复数交互场(输入多通道激活为实部,单个输出通道激活广播后为虚部),将其变换到频域,并训练一个低容量自编码器来重建归一化频谱;重建保真度高的通道被视为冗余而被剪除,保真度低的则保留。 Result: 在CIFAR-10上的VGG16模型中,使用固定阈值0.6实现了90.11%的FLOPs减少和96.30%的参数减少,微调后Top-1准确率仅从93.44%下降1.67%。 Conclusion: 复数交互场的频谱重建保真度是衡量通道级冗余性的有效代理指标,SCAP能够在高度压缩的同时保持良好性能,且支持简单阈值剪枝并保证结构一致性。 Abstract: We propose Spectral Complex Autoencoder Pruning (SCAP), a reconstruction-based criterion that measures functional redundancy at the level of individual output channels. For each convolutional layer, we construct a complex interaction field by pairing the full multi-channel input activation as the real part with a single output-channel activation (spatially aligned and broadcast across input channels) as the imaginary part. We transform this complex field to the frequency domain and train a low-capacity autoencoder to reconstruct normalized spectra. Channels whose spectra are reconstructed with high fidelity are interpreted as lying close to a low-dimensional manifold captured by the autoencoder and are therefore more compressible; conversely, channels with low fidelity are retained as they encode information that cannot be compactly represented by the learned manifold. This yields an importance score (optionally fused with the filter L1 norm) that supports simple threshold-based pruning and produces a structurally consistent pruned network. On VGG16 trained on CIFAR-10, at a fixed threshold of 0.6, we obtain 90.11% FLOP reduction and 96.30% parameter reduction with an absolute Top-1 accuracy drop of 1.67% from a 93.44% baseline after fine-tuning, demonstrating that spectral reconstruction fidelity of complex interaction fields is an effective proxy for channel-level redundancy under aggressive compression.[121] Detail Loss in Super-Resolution Models Based on the Laplacian Pyramid and Repeated Upscaling and Downscaling Process
Sangjun Han,Youngmi Hur
Main category: cs.CV
TL;DR: 本文提出了一种基于拉普拉斯金字塔的细节损失和重复上采样-下采样策略,用于增强图像超分辨率中的高频细节,在CNN和注意力模型上均取得性能提升。
Details
Motivation: 为了更好地恢复图像超分辨率中的高频细节信息,现有方法对高频率成分关注不足,因此需要更有效的机制来强调这些关键像素。 Method: 提出了两种方法:一是使用拉普拉斯金字塔结构设计细节损失函数,分离并控制超分辨率图像和细节图像;二是引入重复上采样与下采样的过程,从多个低分辨率特征中提取多样化信息以增强细节损失效果。 Result: 在CNN模型中应用所提方法后达到SOTA水平,优于现有的CNN及部分注意力机制模型;在注意力模型上小规模应用也显示出性能提升。 Conclusion: 所提出的方法能有效增强不同架构模型的超分辨率性能,尤其在高频细节恢复方面表现突出。 Abstract: With advances in artificial intelligence, image processing has gained significant interest. Image super-resolution is a vital technology closely related to real-world applications, as it enhances the quality of existing images. Since enhancing fine details is crucial for the super-resolution task, pixels that contribute to high-frequency information should be emphasized. This paper proposes two methods to enhance high-frequency details in super-resolution images: a Laplacian pyramid-based detail loss and a repeated upscaling and downscaling process. Total loss with our detail loss guides a model by separately generating and controlling super-resolution and detail images. This approach allows the model to focus more effectively on high-frequency components, resulting in improved super-resolution images. Additionally, repeated upscaling and downscaling amplify the effectiveness of the detail loss by extracting diverse information from multiple low-resolution features. We conduct two types of experiments. First, we design a CNN-based model incorporating our methods. This model achieves state-of-the-art results, surpassing all currently available CNN-based and even some attention-based models. Second, we apply our methods to existing attention-based models on a small scale. In all our experiments, attention-based models adding our detail loss show improvements compared to the originals. These results demonstrate our approaches effectively enhance super-resolution images across different model structures.[122] Radiomics-Integrated Deep Learning with Hierarchical Loss for Osteosarcoma Histology Classification
Yaxi Chen,Zi Ye,Shaheer U. Saeed,Oliver Yu,Simin Ni,Jie Huang,Yipeng Hu
Main category: cs.CV
TL;DR: 本研究提出了一种结合放射组学特征和分层分类损失的深度学习方法,用于骨肉瘤病理图像中坏死区域的自动量化,在TCIA公开数据集上实现了新的性能突破。
Details
Motivation: 现有骨肉瘤化疗后坏死评估依赖人工,存在耗时、主观和观察者间差异大的问题;且已有深度学习模型在患者级别测试时性能显著下降。 Method: 1) 引入从图像提取的放射组学特征作为多模态输入;2) 设计分层二分类任务(肿瘤vs非肿瘤、活性vs非活性)并采用可学习权重的分层损失,替代传统的三类扁平分类。 Result: 在TCIA OS Tumor Assessment数据集上验证,所提方法(单独或组合)均显著提升了分类性能,尤其改善了各类别的个体表现,达到了该任务上的新SOTA水平。 Conclusion: 结合放射组学特征与分层损失策略能有效提升模型在患者级别泛化能力,为骨肉瘤病理评估提供了更准确、可解释的自动化工具。 Abstract: Osteosarcoma (OS) is an aggressive primary bone malignancy. Accurate histopathological assessment of viable versus non-viable tumor regions after neoadjuvant chemotherapy is critical for prognosis and treatment planning, yet manual evaluation remains labor-intensive, subjective, and prone to inter-observer variability. Recent advances in digital pathology have enabled automated necrosis quantification. Evaluating on test data, independently sampled on patient-level, revealed that the deep learning model performance dropped significantly from the tile-level generalization ability reported in previous studies. First, this work proposes the use of radiomic features as additional input in model training. We show that, despite that they are derived from the images, such a multimodal input effectively improved the classification performance, in addition to its added benefits in interpretability. Second, this work proposes to optimize two binary classification tasks with hierarchical classes (i.e. tumor-vs-non-tumor and viable-vs-non-viable), as opposed to the alternative ``flat'' three-class classification task (i.e. non-tumor, non-viable tumor, viable tumor), thereby enabling a hierarchical loss. We show that such a hierarchical loss, with trainable weightings between the two tasks, the per-class performance can be improved significantly. Using the TCIA OS Tumor Assessment dataset, we experimentally demonstrate the benefits from each of the proposed new approaches and their combination, setting a what we consider new state-of-the-art performance on this open dataset for this application. Code and trained models: https://github.com/YaxiiC/RadiomicsOS.git.[123] Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs
Rui Zhu,Xin Shen,Shuchen Wu,Chenxi Miao,Xin Yu,Yang Li,Weikang Li,Deguo Xia,Jizhou Huang
Main category: cs.CV
TL;DR: 本文提出了Video-MSR,首个用于评估动态视频中多跳空间推理(MSR)能力的基准,并构建了MSR-9K指令微调数据集以提升模型的空间推理性能。
Details
Motivation: 现有基准主要关注单步感知任务,缺乏对需要复杂视觉-空间逻辑链的多跳空间推理场景的探索。 Method: 设计了包含四个任务的Video-MSR基准:约束定位、链式指代检索、路径规划和反事实物理推断;通过结合先进模型生成与人工验证的可扩展流程构建数据;并利用MSR-9K数据集进行指令微调以增强模型能力。 Result: 在20个主流MLLM上的实验表明,模型在多跳空间推理任务上表现显著下降,常出现空间迷失和幻觉;基于MSR-9K微调的Qwen-VL在Video-MSR上取得了+7.82%的绝对提升。 Conclusion: 多跳空间推理是当前MLLM的薄弱环节,专用指令数据能有效提升该能力,Video-MSR为未来研究提供了重要基础。 Abstract: Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR.[124] Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs?
David Reid,Ognjen Arandjelovic
Main category: cs.CV
TL;DR: 本文首次将Vision Transformer (ViT)应用于古代钱币语义元素识别,基于图像和非结构化文本的多模态数据进行全自动学习,并发现ViT模型在准确率上优于新训练的CNN模型。
Details
Motivation: 为了帮助研究人员从大量古代钱币中提取更多历史信息,并帮助收藏者更好地理解交易内容,需要更高效的自动化分析方法。 Method: 采用Vision Transformer (ViT)和卷积神经网络(CNN)模型,利用多模态数据(图像和非结构化文本)进行自动学习,比较两者在识别古代钱币语义元素上的性能。 Result: ViT模型在准确率上优于新训练的CNN模型。 Conclusion: ViT架构在古代钱币语义元素识别任务中表现更优,展示了其在该领域应用的潜力。 Abstract: Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly trained CNN models in accuracy.[125] PrivLEX: Detecting legal concepts in images through Vision-Language Models
Darya Baranouskaya,Andrea Cavallaro
Main category: cs.CV
TL;DR: PrivLEX是一种新型图像隐私分类器,首次将视觉语言模型(VLM)与法律定义的个人数据概念结合,实现无需显式标注的可解释性分类。
Details
Motivation: 现有隐私分类器缺乏与法律定义的对齐,难以满足合规需求,且多数方法依赖显式概念标注,限制了实际应用。 Method: 提出PrivLEX,利用零样本VLM进行概念检测,构建无标签的概念瓶颈模型,在不依赖训练时概念标签的情况下实现可解释分类。 Result: 实验表明,PrivLEX能有效识别图像中的个人数据概念,并分析了这些概念在人类标注者眼中的敏感性。 Conclusion: PrivLEX是首个对齐法律概念、具备可解释性的图像隐私分类器,为隐私保护提供了更符合法规要求的技术路径。 Abstract: We present PrivLEX, a novel image privacy classifier that grounds its decisions in legally defined personal data concepts. PrivLEX is the first interpretable privacy classifier aligned with legal concepts that leverages the recognition capabilities of Vision-Language Models (VLMs). PrivLEX relies on zero-shot VLM concept detection to provide interpretable classification through a label-free Concept Bottleneck Model, without requiring explicit concept labels during training. We demonstrate PrivLEX's ability to identify personal data concepts that are present in images. We further analyse the sensitivity of such concepts as perceived by human annotators of image privacy datasets.[126] MAD: Motion Appearance Decoupling for efficient Driving World Models
Ahmad Rahimi,Valentin Gerard,Eloi Zablocki,Matthieu Cord,Alexandre Alahi
Main category: cs.CV
TL;DR: 提出了一种高效的两阶段适应框架,将通用视频扩散模型转化为可控的自动驾驶世界模型,通过解耦运动学习与外观合成,在极低计算成本下实现了最先进的性能。
Details
Motivation: 现有视频扩散模型在生成逼真、时序连贯的视频方面表现良好,但缺乏结构化运动和物理一致性,难以作为可靠的自动驾驶世界模型;且现有适配方法通常需要大量领域数据和昂贵的微调。 Method: 采用两阶段解耦方法:首先训练模型预测简化的骨架化代理和场景元素的运动,专注于物理和社会合理性;然后复用同一主干网络,以这些运动序列为条件合成真实的RGB视频,实现‘先推理动态,再渲染外观’的范式。 Result: 实验表明该方法极为高效:基于SVD模型,仅用不到6%的计算量即可达到先前SOTA水平;扩展到LTX模型时,MAD-LTX超越所有开源竞品,并支持文本、自车和物体等多模态控制。 Conclusion: 通过解耦运动预测与外观生成,所提框架能以极低资源消耗有效适配通用视频模型为驾驶世界模型,兼具高效率、强可控性和良好视觉质量。 Abstract: Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain-specific data and costly fine-tuning. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively "dressing" the motion with texture and lighting. This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls. Project page: https://vita-epfl.github.io/MAD-World-Model/[127] Towards Robust Cross-Dataset Object Detection Generalization under Domain Specificity
Ritabrata Chakraborty,Hrishit Mitra,Shivakumara Palaiahnakote,Umapada Pal
Main category: cs.CV
TL;DR: 本文研究了跨数据集目标检测(CD-OD)中的设置特异性,发现相同类型设置间的迁移较稳定,而不同类型间迁移性能显著下降,尤其从特定到通用设置时最严重。通过闭集与开集评估对比,揭示域偏移是主要挑战,并提出基于CLIP的开放标签对齐方法以缓解标签不匹配问题。
Details
Motivation: 目标检测器在分布内表现良好,但在不同基准上性能急剧下降。为了理解这种泛化能力的局限,需要系统分析跨数据集迁移的结构特性,尤其是设置特异性(setting specificity)的影响。 Method: 将基准数据集分为“设置无关”(多样日常场景)和“设置相关”(特定环境)两类,评估标准检测器族在所有训练-测试组合上的表现;引入基于CLIP相似性的开放标签协议,用于衡量并缓解标签不匹配问题,从而分离域偏移与标签不一致的影响。 Result: 发现同类设置间迁移相对稳定,跨类型迁移显著下降且常呈不对称性;从特定源迁移到通用目标时性能下降最严重,且该现象在开放标签对齐后仍存在,表明域偏移主导了最难情形下的性能退化;开放标签评估带来一致但有限的提升,许多修正案例为语义上接近且图像证据支持的‘近错’情况。 Conclusion: 本文提出了以设置特异性为核心的CD-OD结构化理解框架,表明域偏移是跨数据集目标检测的主要挑战,尤其在从特定到通用场景的迁移中;开放标签评估有助于缓解标签不匹配,但无法根本解决域偏移问题,为未来研究提供了原则性指导和实践评估建议。 Abstract: Object detectors often perform well in-distribution, yet degrade sharply on a different benchmark. We study cross-dataset object detection (CD-OD) through a lens of setting specificity. We group benchmarks into setting-agnostic datasets with diverse everyday scenes and setting-specific datasets tied to a narrow environment, and evaluate a standard detector family across all train--test pairs. This reveals a clear structure in CD-OD: transfer within the same setting type is relatively stable, while transfer across setting types drops substantially and is often asymmetric. The most severe breakdowns occur when transferring from specific sources to agnostic targets, and persist after open-label alignment, indicating that domain shift dominates in the hardest regimes. To disentangle domain shift from label mismatch, we compare closed-label transfer with an open-label protocol that maps predicted classes to the nearest target label using CLIP similarity. Open-label evaluation yields consistent but bounded gains, and many corrected cases correspond to semantic near-misses supported by the image evidence. Overall, we provide a principled characterization of CD-OD under setting specificity and practical guidance for evaluating detectors under distribution shift. Code will be released at \href{[https://github.com/Ritabrata04/cdod-icpr.git}{https://github.com/Ritabrata04/cdod-icpr}.[128] V-DPM: 4D Video Reconstruction with Dynamic Point Maps
Edgar Sucar,Eldar Insafutdinov,Zihang Lai,Andrea Vedaldi
Main category: cs.CV
TL;DR: 本文提出了V-DPM,一种用于动态场景视频输入的动态点图(DPM)表示方法,在3D和4D重建中实现了最先进的性能,并能恢复每个点的完整3D运动。
Details
Motivation: 现有DPMs仅限于图像对且需后处理优化,难以扩展到多视图动态场景;作者认为将DPM应用于视频更有价值。 Method: 提出适用于视频输入的DPM公式化方法,基于VGGT模型并利用少量合成数据进行微调以适应动态场景建模。 Result: 在动态场景的3D和4D重建任务上达到最先进水平,能够同时恢复动态深度和每个点的完整3D运动轨迹。 Conclusion: V-DPM有效扩展了DPM至视频序列,无需复杂优化即可实现高质量动态重建,具备强表示能力和模型复用性。 Abstract: Powerful 3D representations such as DUSt3R invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend this concept to dynamic 3D content by additionally representing scene motion. However, existing DPMs are limited to image pairs and, like DUSt3R, require post processing via optimization when more than two views are involved. We argue that DPMs are more useful when applied to videos and introduce V-DPM to demonstrate this. First, we show how to formulate DPMs for video input in a way that maximizes representational power, facilitates neural prediction, and enables reuse of pretrained models. Second, we implement these ideas on top of VGGT, a recent and powerful 3D reconstructor. Although VGGT was trained on static scenes, we show that a modest amount of synthetic data is sufficient to adapt it into an effective V-DPM predictor. Our approach achieves state of the art performance in 3D and 4D reconstruction for dynamic scenes. In particular, unlike recent dynamic extensions of VGGT such as P3, DPMs recover not only dynamic depth but also the full 3D motion of every point in the scene.[129] Video Joint-Embedding Predictive Architectures for Facial Expression Recognition
Lennart Eing,Cristina Luna-Jiménez,Silvan Mertes,Elisabeth André
Main category: cs.CV
TL;DR: 本文提出了一种基于视频联合嵌入预测架构(V-JEPA)的面部表情识别(FER)新方法,通过在RAVDESS和CREMA-D数据集上训练浅层分类器,实现了最先进的性能,并展现出优异的跨数据集泛化能力。
Details
Motivation: 传统视频理解预训练方法依赖像素级重建,容易捕捉无关背景信息;本文旨在利用V-JEPA通过嵌入预测来学习更本质的语义特征,提升FER性能。 Method: 采用预训练的V-JEPA视频编码器,通过从非掩码区域的嵌入预测被掩码区域的嵌入进行学习,并在RAVDESS和CREMA-D数据集上训练浅层分类器用于面部表情识别。 Result: 在RAVDESS上达到最先进性能,在CREMA-D上优于所有其他基于视觉的方法(+1.48 WAR),且跨数据集评估显示良好的泛化能力。 Conclusion: 纯基于嵌入的预训练方法在面部表情识别中具有巨大潜力,能够有效提取语义相关特征并实现强泛化。 Abstract: This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at https://github.com/lennarteingunia/vjepa-for-fer.[130] GlovEgo-HOI: Bridging the Synthetic-to-Real Gap for Industrial Egocentric Human-Object Interaction Detection
Alfio Spoto,Rosario Leonardi,Francesco Ragusa,Giovanni Maria Farinella
Main category: cs.CV
TL;DR: 提出了一种结合合成数据与扩散模型的数据生成框架,用于增强工业场景中的自我中心人-物交互(EHOI)分析,并发布了GlovEgo-HOI数据集和GlovEgo-Net模型。
Details
Motivation: 工业安全中EHOI分析至关重要,但缺乏特定领域的标注数据限制了模型发展。 Method: 提出一个融合合成数据与基于扩散的数据增强框架,在真实图像中添加逼真的个人防护装备(PPE),并构建GlovEgo-Net模型,集成Glove-Head和Keypoint-Head模块以利用手部姿态信息进行交互检测。 Result: 实验表明所提数据生成方法和GlovEgo-Net在EHOI任务中有效提升了检测性能。 Conclusion: 该工作推动了工业EHOI研究的发展,所发布数据集、增强流程和预训练模型有助于后续研究。 Abstract: Egocentric Human-Object Interaction (EHOI) analysis is crucial for industrial safety, yet the development of robust models is hindered by the scarcity of annotated domain-specific data. We address this challenge by introducing a data generation framework that combines synthetic data with a diffusion-based process to augment real-world images with realistic Personal Protective Equipment (PPE). We present GlovEgo-HOI, a new benchmark dataset for industrial EHOI, and GlovEgo-Net, a model integrating Glove-Head and Keypoint- Head modules to leverage hand pose information for enhanced interaction detection. Extensive experiments demonstrate the effectiveness of the proposed data generation framework and GlovEgo-Net. To foster further research, we release the GlovEgo-HOI dataset, augmentation pipeline, and pre-trained models at: GitHub project.[131] Bipartite Mode Matching for Vision Training Set Search from a Hierarchical Data Server
Yue Yao,Ruining Yang,Tom Gedeon
Main category: cs.CV
TL;DR: 提出一种基于分层数据服务器和二分模式匹配算法(BMM)的数据中心式无监督域适应方法,用于构建与目标域模式对齐的训练集,减小域间差距,提升模型性能。
Details
Motivation: 在目标域无法实时标注数据的情况下,现有方法多关注算法优化,而忽视了数据服务器结构的潜力;若训练集未覆盖目标域的语义模式,则模型性能会下降。 Method: 设计一个层次化数据服务器,并提出二分模式匹配算法(BMM),将目标域的每个语义模式与源数据中最佳匹配的模式进行一对一匹配,从而构建更优的训练集。 Result: 相比现有方法,BMM选出的训练集与目标域之间的域差距更小,在物体重识别和检测任务上取得更高的模型精度。 Conclusion: 通过优化数据结构而非仅改进模型,BMM实现了数据中心的无监督域适应,且可与现有模型中心方法(如伪标签)结合,进一步提升性能。 Abstract: We explore a situation in which the target domain is accessible, but real-time data annotation is not feasible. Instead, we would like to construct an alternative training set from a large-scale data server so that a competitive model can be obtained. For this problem, because the target domain usually exhibits distinct modes (i.e., semantic clusters representing data distribution), if the training set does not contain these target modes, the model performance would be compromised. While prior existing works improve algorithms iteratively, our research explores the often-overlooked potential of optimizing the structure of the data server. Inspired by the hierarchical nature of web search engines, we introduce a hierarchical data server, together with a bipartite mode matching algorithm (BMM) to align source and target modes. For each target mode, we look in the server data tree for the best mode match, which might be large or small in size. Through bipartite matching, we aim for all target modes to be optimally matched with source modes in a one-on-one fashion. Compared with existing training set search algorithms, we show that the matched server modes constitute training sets that have consistently smaller domain gaps with the target domain across object re-identification (re-ID) and detection tasks. Consequently, models trained on our searched training sets have higher accuracy than those trained otherwise. BMM allows data-centric unsupervised domain adaptation (UDA) orthogonal to existing model-centric UDA methods. By combining the BMM with existing UDA methods like pseudo-labeling, further improvement is observed.[132] Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling
Shuyang Xiang,Hao Guan
Main category: cs.CV
TL;DR: 该研究探索了使用低分辨率视觉输入(如8x8像素的灰度图像)作为中文字符表示的替代方法,发现其在语言建模中的性能与传统的索引标记方法相当,且具有更快的初始学习速度。
Details
Motivation: 汉字是意音文字,其字形结构包含语义和语音信息,而传统大模型忽略这一视觉特征,仅用离散索引导入,可能丢失潜在有用信号。因此,研究者想验证最小化视觉输入是否可有效支持中文语言建模。 Method: 将单个汉字渲染为低分辨率(最低8x8像素)的灰度图像,直接输入解码器进行语言建模,替代传统的token ID输入方式,并在相同架构下与基于索引的模型进行对比实验。 Result: 视觉输入模型达到39.2%的准确率,略高于索引基线的39.1%;在训练初期(总训练量0.4%时),视觉模型准确率超过12%,显著高于基线的6%以下,表现出明显的热启动效应。 Conclusion: 极低分辨率的视觉结构已能为中文语言建模提供强而有效的信号,表明视觉表征可作为传统索引方法的有力补充,为字符建模提供了新视角。 Abstract: Large language models typically represent Chinese characters as discrete index-based tokens, largely ignoring their visual form. For logographic scripts, visual structure carries semantic and phonetic information, which may aid prediction. We investigate whether low-resolution visual inputs can serve as an alternative for character-level modeling. Instead of token IDs, our decoder receives grayscale images of individual characters, with resolutions as low as $8 \times 8$ pixels. Remarkably, these inputs achieve 39.2\% accuracy, comparable to the index-based baseline of 39.1\%. Such low-resource settings also exhibit a pronounced \emph{hot-start} effect: by 0.4\% of total training, accuracy reaches above 12\%, while index-based models lag at below 6\%. Overall, our results demonstrate that minimal visual structure can provide a robust and efficient signal for Chinese language modeling, offering an alternative perspective on character representation that complements traditional index-based approaches.[133] Trustworthy Longitudinal Brain MRI Completion: A Deformation-Based Approach with KAN-Enhanced Diffusion Model
Tianli Tao,Ziyang Wang,Delong Yang,Han Zhang,Le Zhang
Main category: cs.CV
TL;DR: 提出DF-DiffCom,一种基于Kolmogorov-Arnold网络增强的扩散模型,利用形变场实现可信的纵向脑MRI图像补全,具有高保真性、多模态兼容性和灵活应用能力。
Details
Motivation: 现有深度生成模型在纵向脑MRI补全中仅依赖图像强度,导致生成图像可信度低且应用灵活性差,难以满足复杂研究需求。 Method: 提出DF-DiffCom,结合形变场与Kolmogorov-Arnold Networks(KAN)增强的扩散模型,在OASIS-3数据集上训练,利用形变场指导图像生成,提升保真度与跨模态扩展能力。 Result: 在OASIS-3上优于现有方法,PSNR提升5.6%,SSIM提升0.12,且可扩展至多种MRI模态及脑组织分割等属性图。 Conclusion: DF-DiffCom通过引入形变场与KAN结构,实现了高保真、可信赖的纵向脑MRI补全,具备良好的泛化性与实际应用潜力。 Abstract: Longitudinal brain MRI is essential for lifespan study, yet high attrition rates often lead to missing data, complicating analysis. Deep generative models have been explored, but most rely solely on image intensity, leading to two key limitations: 1) the fidelity or trustworthiness of the generated brain images are limited, making downstream studies questionable; 2) the usage flexibility is restricted due to fixed guidance rooted in the model structure, restricting full ability to versatile application scenarios. To address these challenges, we introduce DF-DiffCom, a Kolmogorov-Arnold Networks (KAN)-enhanced diffusion model that smartly leverages deformation fields for trustworthy longitudinal brain image completion. Trained on OASIS-3, DF-DiffCom outperforms state-of-the-art methods, improving PSNR by 5.6% and SSIM by 0.12. More importantly, its modality-agnostic nature allows smooth extension to varied MRI modalities, even to attribute maps such as brain tissue segmentation results.[134] OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
Sheng-Yu Huang,Jaesung Choe,Yu-Chiang Frank Wang,Cheng Sun
Main category: cs.CV
TL;DR: 提出OpenVoxel,一种无需训练的稀疏体素分组与描述方法,用于开放词汇3D场景理解。
Details
Motivation: 现有方法依赖训练和CLIP/BERT文本编码器嵌入,限制了在复杂场景中的泛化能力。 Method: 基于多视角图像生成稀疏体素栅格(SVR),利用VLM和MLLM进行体素分组,并通过文本到文本搜索实现无需训练的描述生成。 Result: 在开放词汇分割(OVS)和指代表达分割(RES)任务中表现优异,尤其在复杂的RES任务上优于现有方法。 Conclusion: OpenVoxel是一种高效、无需训练的3D场景理解方法,通过直接文本搜索避免使用额外文本编码器,提升了开放词汇场景下的性能。 Abstract: We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.[135] Show, don't tell -- Providing Visual Error Feedback for Handwritten Documents
Said Yasin,Torsten Zesch
Main category: cs.CV
TL;DR: 本文探讨了从手写输入图像到生成准确的视觉反馈所面临的挑战,比较了模块化与端到端系统的表现,发现当前两者均未能达到可接受的质量,并提出了未来研究的方向。
Details
Motivation: 手写能力在教育中至关重要,但为手写作业提供有效的视觉反馈仍是一个未被充分研究的问题。 Method: 通过实证方法比较模块化系统与端到端系统在生成手写反馈中的表现。 Result: 两种系统目前都无法达到令人满意的整体质量,识别出多个关键技术挑战。 Conclusion: 需要进一步研究以克服现有系统的局限性,推动高质量手写反馈系统的发展。 Abstract: Handwriting remains an essential skill, particularly in education. Therefore, providing visual feedback on handwritten documents is an important but understudied area. We outline the many challenges when going from an image of handwritten input to correctly placed informative error feedback. We empirically compare modular and end-to-end systems and find that both approaches currently do not achieve acceptable overall quality. We identify the major challenges and outline an agenda for future research.[136] Iterative Differential Entropy Minimization (IDEM) method for fine rigid pairwise 3D Point Cloud Registration: A Focus on the Metric
Emmanuele Barberi,Felice Sfravara,Filippo Cucinotta
Main category: cs.CV
TL;DR: 提出一种基于微分熵的点云配准优化度量方法IDEM,不依赖固定点云选择,对密度差异、噪声、缺失和部分重叠等具有鲁棒性。
Details
Motivation: 传统点云配准方法(如ICP)依赖欧氏距离,需指定固定点云且缺乏对称性,在存在噪声、空洞、密度不均和重叠率低时性能下降,需要更鲁棒且无需预对齐的配准度量。 Method: 提出迭代微分熵最小化(IDEM)框架,使用微分熵作为目标函数,通过优化该非对称性度量实现点云配准,避免选择固定点云,并在变换过程中寻找熵最小对应的最优对齐状态。 Result: 在多种挑战性场景(噪声、空洞、密度差异、部分重叠)下,IDEM优于RMSE、Chamfer距离和Hausdorff距离,能实现更优对齐,尤其在传统方法失效的情况下仍保持有效性。 Conclusion: 基于微分熵的IDEM是一种鲁棒、对称性强的点云配准新度量方法,适用于复杂真实场景,无需严格预对齐或数据预处理,为精细刚性配准提供了新思路。 Abstract: Point cloud registration is a central theme in computer vision, with alignment algorithms continuously improving for greater robustness. Commonly used methods evaluate Euclidean distances between point clouds and minimize an objective function, such as Root Mean Square Error (RMSE). However, these approaches are most effective when the point clouds are well-prealigned and issues such as differences in density, noise, holes, and limited overlap can compromise the results. Traditional methods, such as Iterative Closest Point (ICP), require choosing one point cloud as fixed, since Euclidean distances lack commutativity. When only one point cloud has issues, adjustments can be made, but in real scenarios, both point clouds may be affected, often necessitating preprocessing. The authors introduce a novel differential entropy-based metric, designed to serve as the objective function within an optimization framework for fine rigid pairwise 3D point cloud registration, denoted as Iterative Differential Entropy Minimization (IDEM). This metric does not depend on the choice of a fixed point cloud and, during transformations, reveals a clear minimum corresponding to the best alignment. Multiple case studies are conducted, and the results are compared with those obtained using RMSE, Chamfer distance, and Hausdorff distance. The proposed metric proves effective even with density differences, noise, holes, and partial overlap, where RMSE does not always yield optimal alignment.[137] Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets
Jeremiah Coholich,Justin Wit,Robert Azarcon,Zsolt Kira
Main category: cs.CV
TL;DR: MANGO是一种用于解决视觉机器人操作中视角变化问题的非配对图像翻译方法,通过模拟到现实的转换增强数据多样性,显著提升策略在未见视角下的成功率。
Details
Motivation: 视觉策略在面对相机视角变化时表现脆弱,且真实世界演示数据稀缺且视角单一,难以覆盖多样场景。 Method: 提出MANGO方法,包含分割条件化的InfoNCE损失、强正则化判别器设计和改进的PatchNCE损失,利用少量真实固定视角数据与模拟数据进行无配对图像翻译。 Result: MANGO在模拟到现实的转换中保持了视角一致性,优于其他图像翻译方法;经其数据增强训练的模仿学习策略在原始策略完全失败的新视角下仍达到高达60%的成功率。 Conclusion: MANGO能有效桥接模拟与现实之间的视觉差异,提升机器人操作策略对未见视角的泛化能力,缓解真实数据稀缺问题。 Abstract: Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO -- an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this domain, MANGO outperforms all other image translation methods we tested. Imitation-learning policies trained on data augmented by MANGO are able to achieve success rates as high as 60\% on views that the non-augmented policy fails completely on.[138] GRCF: Two-Stage Groupwise Ranking and Calibration Framework for Multimodal Sentiment Analysis
Manning Gao,Leheng Zhang,Shiqin Han,Haifeng Hu,Yuncheng Jiang,Sijie Mai
Main category: cs.CV
TL;DR: 本文提出了一种两阶段的组别排序与校准框架(GRCF),通过借鉴组相对策略优化(GRPO)的思想,解决了多模态情感分析中现有成对序数学习方法的两个局限:缺乏对难排序样本的自适应关注和静态排序边界。该框架在保持相对顺序的同时,确保预测值的绝对校准,并在多个任务上实现了最先进的性能。
Details
Motivation: 现有的点对回归方法对标签噪声敏感且忽略样本间的相对顺序,而成对序数学习方法则未能自适应地关注难排序样本并使用静态边界,无法反映不同情感类别间的语义距离。 Method: 提出GRCF框架,第一阶段采用受GRPO启发的优势加权动态边界排序损失构建细粒度序数结构,第二阶段使用MAE驱动的目标对齐预测幅度,并将框架扩展至分类任务如幽默与讽刺检测。 Result: GRCF在核心回归基准上达到最先进性能,并在多模态幽默检测和讽刺检测等分类任务中表现出良好的泛化能力。 Conclusion: GRCF有效解决了现有方法在多模态情感分析中的关键缺陷,兼顾了相对顺序建模与绝对评分校准,具有强健的性能和广泛的应用潜力。 Abstract: Most Multimodal Sentiment Analysis research has focused on point-wise regression. While straightforward, this approach is sensitive to label noise and neglects whether one sample is more positive than another, resulting in unstable predictions and poor correlation alignment. Pairwise ordinal learning frameworks emerged to address this gap, capturing relative order by learning from comparisons. Yet, they introduce two new trade-offs: First, they assign uniform importance to all comparisons, failing to adaptively focus on hard-to-rank samples. Second, they employ static ranking margins, which fail to reflect the varying semantic distances between sentiment groups. To address this, we propose a Two-Stage Group-wise Ranking and Calibration Framework (GRCF) that adapts the philosophy of Group Relative Policy Optimization (GRPO). Our framework resolves these trade-offs by simultaneously preserving relative ordinal structure, ensuring absolute score calibration, and adaptively focusing on difficult samples. Specifically, Stage 1 introduces a GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss to build a fine-grained ordinal structure. Stage 2 then employs an MAE-driven objective to align prediction magnitudes. To validate its generalizability, we extend GRCF to classification tasks, including multimodal humor detection and sarcasm detection. GRCF achieves state-of-the-art performance on core regression benchmarks, while also showing strong generalizability in classification tasks.[139] CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems
Yonglin Tian,Qiyao Zhang,Wei Xu,Yutong Wang,Yihao Wu,Xinyi Li,Xingyuan Dai,Hui Zhang,Zhiyong Cui,Baoqing Guo,Zujun Yu,Yisheng Lv
Main category: cs.CV
TL;DR: 本文提出了一个名为CogRail的新基准,用于铁路入侵感知任务,通过整合视觉-语言模型和多模态提示进行时空推理,并提出联合微调框架以提升模型在位置感知、运动预测和威胁分析上的性能。
Details
Motivation: 现有铁路入侵检测系统多依赖固定视野内的物体分类和基于规则的启发式判断,难以识别潜在入侵风险;需要结合空间上下文与时间动态来实现更深层次的认知感知。 Method: 构建了一个融合开源数据集与认知驱动问答标注的新型基准CogRail,系统评估了先进视觉-语言模型的表现,并提出一种联合微调框架,整合位置感知、运动预测和威胁分析三个核心任务,以增强模型的适应性与推理能力。 Result: 实验表明当前大规模多模态模型在复杂时空推理任务上表现不佳,而所提出的联合微调框架显著提升了模型在认知入侵感知任务中的性能,增强了准确性和可解释性。 Conclusion: 结构化的多任务联合微调能有效提升通用基础模型在安全关键领域(如铁路入侵感知)中的适应能力,凸显了面向特定认知任务定制化训练的重要性。 Abstract: Accurate and early perception of potential intrusion targets is essential for ensuring the safety of railway transportation systems. However, most existing systems focus narrowly on object classification within fixed visual scopes and apply rule-based heuristics to determine intrusion status, often overlooking targets that pose latent intrusion risks. Anticipating such risks requires the cognition of spatial context and temporal dynamics for the object of interest (OOI), which presents challenges for conventional visual models. To facilitate deep intrusion perception, we introduce a novel benchmark, CogRail, which integrates curated open-source datasets with cognitively driven question-answer annotations to support spatio-temporal reasoning and prediction. Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models (VLMs) using multimodal prompts to identify their strengths and limitations in this domain. Furthermore, we fine-tune VLMs for better performance and propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis, facilitating effective adaptation of general-purpose foundation models into specialized models tailored for cognitive intrusion perception. Extensive experiments reveal that current large-scale multimodal models struggle with the complex spatial-temporal reasoning required by the cognitive intrusion perception task, underscoring the limitations of existing foundation models in this safety-critical domain. In contrast, our proposed joint fine-tuning framework significantly enhances model performance by enabling targeted adaptation to domain-specific reasoning demands, highlighting the advantages of structured multi-task learning in improving both accuracy and interpretability. Code will be available at https://github.com/Hub-Tian/CogRail.[140] Identifying Models Behind Text-to-Image Leaderboards
Ali Naseh,Yuefeng Peng,Anshuman Suri,Harsh Chaudhari,Alina Oprea,Amir Houmansadr
Main category: cs.CV
TL;DR: 该研究发现文本到图像(T2I)模型的生成结果在图像嵌入空间中具有可区分的聚类特征,使得匿名化输出容易被去匿名化,从而揭示了当前基于投票的T2I模型排行榜存在的安全漏洞。
Details
Motivation: 由于T2I模型广泛用于生成AI图像,当前排行榜依赖匿名化输出进行公平比较。然而,这种匿名机制的安全性尚未充分验证,因此有必要探究其是否可被破解。 Method: 提出一种基于质心的无监督方法,在不依赖提示控制或训练数据的情况下,利用图像嵌入空间中的聚类特性对T2I模型生成图像进行溯源和识别,并引入提示级别的可区分性度量指标。 Result: 在22个模型和280个提示下(共15万张图像)实现了高精度的模型识别,发现不同T2I模型存在系统性的独特签名,某些提示甚至导致接近完美的可区分性。 Conclusion: 当前T2I模型排行榜所依赖的匿名化机制存在根本性安全缺陷,亟需更强的匿名化防御机制来保障公平性。 Abstract: Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.[141] AquaFeat+: an Underwater Vision Learning-based Enhancement Method for Object Detection, Classification, and Tracking
Emanuel da Costa Silva,Tatiana Taís Schein,José David García Ramos,Eduardo Lawson da Silva,Stephanie Loi Brião,Felipe Gomes de Oliveira,Paulo Lilles Jorge Drews-Jr
Main category: cs.CV
TL;DR: AquaFeat+ 是一种专为提升水下机器人视觉任务感知性能而设计的即插即用特征增强 pipeline,通过端到端训练,在真实数据集 FishTrack23 上显著提升了检测、分类与跟踪性能。
Details
Motivation: 水下视频因光照不足、颜色失真和浑浊等问题导致视觉数据质量差,影响机器人感知模块性能,现有方法多侧重人类视觉感知质量,缺乏针对下游自动化任务的特征优化。 Method: 提出 AquaFeat+,包含颜色校正、分层特征增强和自适应残差输出模块,采用端到端训练,并由下游任务的损失函数直接引导优化方向,实现任务导向的特征增强。 Result: 在 FishTrack23 数据集上验证,AquaFeat+ 显著提升了目标检测、分类和跟踪的指标,优于现有图像增强方法。 Conclusion: AquaFeat+ 作为一种任务驱动的特征增强 pipeline,能有效提升水下复杂环境中的自动化视觉任务性能,具备良好的实用性和泛化潜力。 Abstract: Underwater video analysis is particularly challenging due to factors such as low lighting, color distortion, and turbidity, which compromise visual data quality and directly impact the performance of perception modules in robotic applications. This work proposes AquaFeat+, a plug-and-play pipeline designed to enhance features specifically for automated vision tasks, rather than for human perceptual quality. The architecture includes modules for color correction, hierarchical feature enhancement, and an adaptive residual output, which are trained end-to-end and guided directly by the loss function of the final application. Trained and evaluated in the FishTrack23 dataset, AquaFeat+ achieves significant improvements in object detection, classification, and tracking metrics, validating its effectiveness for enhancing perception tasks in underwater robotic applications.[142] Image2Garment: Simulation-ready Garment Generation from a Single Image
Selim Emir Can,Jan Ackermann,Kiyohiro Nakayama,Ruofan Liu,Tong Wu,Yang Zheng,Hugo Bertiche,Menglei Chai,Thabo Beeler,Gordon Wetzstein
Main category: cs.CV
TL;DR: 提出一种前馈框架,通过视觉-语言模型和轻量级预测器,从单张图像生成可仿真的衣物,无需迭代优化。
Details
Motivation: 现有方法缺乏图像到物理的标注数据集,且多依赖多视角输入或无法预测完整的物理材质属性,难以实现高保真仿真。 Method: 首先微调视觉-语言模型从图像中推断材料组成和织物属性,然后利用小型材料-物理测量数据集训练轻量级预测器,将这些属性映射为物理参数。 Result: 在材料组成和织物属性预测上优于现有方法,并实现了更高保真的衣物仿真效果。 Conclusion: 该方法能高效、准确地从单张图像生成可用于物理仿真的衣物,推动了图像到物理建模的发展。 Abstract: Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed nature of this problem. Prior methods either require multi-view capture and expensive differentiable simulation or predict only garment geometry without the material properties required for realistic simulation. We propose a feed-forward framework that sidesteps these limitations by first fine-tuning a vision-language model to infer material composition and fabric attributes from real images, and then training a lightweight predictor that maps these attributes to the corresponding physical fabric parameters using a small dataset of material-physics measurements. Our approach introduces two new datasets (FTAG and T2P) and delivers simulation-ready garments from a single image without iterative optimization. Experiments show that our estimator achieves superior accuracy in material composition estimation and fabric attribute prediction, and by passing them through our physics parameter estimator, we further achieve higher-fidelity simulations compared to state-of-the-art image-to-garment methods.[143] LiteEmbed: Adapting CLIP to Rare Classes
Aishwarya Agarwal,Srikrishna Karanam,Vineet Gandhi
Main category: cs.CV
TL;DR: LiteEmbed是一种轻量级框架,用于CLIP模型的少样本个性化,通过子空间引导的文本嵌入优化,无需重新训练编码器即可有效添加新类别。
Details
Motivation: CLIP等大规模视觉-语言模型在预训练中罕见的类别(如新兴实体或文化特定类别)上表现不佳,需要一种无需重训练即可适应新类别的方法。 Method: 提出LiteEmbed,采用基于PCA的分解进行子空间引导优化,解耦粗粒度语义方向与细粒度变化,并通过粗对齐和细分离两个目标联合优化文本嵌入。 Result: 实验表明,LiteEmbed在分类、检索、分割和检测任务中显著优于先前方法,能有效提升对稀有或未见类别的识别能力。 Conclusion: LiteEmbed是一种高效、即插即用的CLIP个性化方法,可在不重训练的情况下增强模型对代表性不足类别的适应能力。 Abstract: Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP's vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP's original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.[144] Self-Supervised Animal Identification for Long Videos
Xuyang Fang,Sion Hannuna,Edwin Simpson,Neill Campbell
Main category: cs.CV
TL;DR: 提出一种高效的自监督方法,通过将动物识别重构为全局聚类任务,仅需边界框检测和个体数量,在低内存消耗下实现高精度个体识别。
Details
Motivation: 传统方法依赖大量人工标注,现有自监督方法计算开销大且难以处理长时间视频序列。 Method: 假设视频中个体数量固定,采样帧对,使用冻结的预训练骨干网络,结合匈牙利算法进行批量内伪标签分配,并采用来自视觉-语言模型的二元交叉熵损失进行自举学习。 Result: 在3D-POP鸽子和8-calves喂养视频数据集上达到超过97%的准确率,每批次GPU内存消耗低于1GB,性能媲美或超越使用上千标注帧训练的监督方法。 Conclusion: 该方法显著降低计算资源需求,可在消费级硬件上实现高精度动物个体识别,有效消除人工标注瓶颈,适用于资源受限的研究场景。 Abstract: Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video -- a common scenario in practice -- and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97\%) while consuming less than 1 GB of GPU memory per batch -- an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{https://huggingface.co/datasets/tonyFang04/8-calves}{here}.[145] SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
Yuchen Wu,Jiahe Li,Xiaohan Yu,Lina Yu,Jin Zheng,Xiao Bai
Main category: cs.CV
TL;DR: 提出SCE-SLAM,通过场景坐标嵌入实现单目SLAM中的尺度一致性,显著减少尺度漂移并提升定位精度。
Details
Motivation: 现有单目视觉SLAM因缺乏全局约束导致长期序列中出现尺度漂移,影响3D重建与导航精度。 Method: 提出SCE-SLAM,利用场景坐标嵌入学习基于规范尺度的3D几何关系;采用几何引导聚合和场景坐标捆绑调整,通过历史观测传播尺度信息并显式约束3D坐标以维持尺度一致性。 Result: 在KITTI、Waymo和vKITTI数据集上实验表明,相比先前最优方法在KITTI上ATE减少8.36m,保持36 FPS,并实现大场景下的尺度一致性。 Conclusion: SCE-SLAM有效解决了单目SLAM中的尺度漂移问题,在保持实时性能的同时显著提升了定位精度和尺度一致性。 Abstract: Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.[146] STEP3-VL-10B Technical Report
Ailin Huang,Chengyuan Yao,Chunrui Han,Fanqi Wan,Hangyu Guo,Haoran Lv,Hongyu Zhou,Jia Wang,Jian Zhou,Jianjian Sun,Jingcheng Hu,Kangheng Lin,Liang Zhao,Mitt Huang,Song Yuan,Wenwen Qu,Xiangfeng Wang,Yanlin Lai,Yingxiu Zhao,Yinmin Zhang,Yukang Shi,Yuyang Chen,Zejia Weng,Ziyang Meng,Ang Li,Aobo Kong,Bo Dong,Changyi Wan,David Wang,Di Qi,Dingming Li,En Yu,Guopeng Li,Haiquan Yin,Han Zhou,Hanshan Zhang,Haolong Yan,Hebin Zhou,Hongbo Peng,Jiaran Zhang,Jiashu Lv,Jiayi Fu,Jie Cheng,Jie Zhou,Jisheng Yin,Jingjing Xie,Jingwei Wu,Jun Zhang,Junfeng Liu,Kaijun Tan,Kaiwen Yan,Liangyu Chen,Lina Chen,Mingliang Li,Qian Zhao,Quan Sun,Shaoliang Pang,Shengjie Fan,Shijie Shang,Siyuan Zhang,Tianhao You,Wei Ji,Wuxun Xie,Xiaobo Yang,Xiaojie Hou,Xiaoran Jiao,Xiaoxiao Ren,Xiangwen Kong,Xin Huang,Xin Wu,Xing Chen,Xinran Wang,Xuelin Zhang,Yana Wei,Yang Li,Yanming Xu,Yeqing Shen,Yuang Peng,Yue Peng,Yu Zhou,Yusheng Li,Yuxiang Yang,Yuyang Zhang,Zhe Xie,Zhewei Huang,Zhenyi Lu,Zhimin Fan,Zihui Cheng,Daxin Jiang,Qi Han,Xiangyu Zhang,Yibo Zhu,Zheng Ge
Main category: cs.CV
TL;DR: STEP3-VL-10B是一个轻量级开源多模态大模型,通过统一的全解冻预训练策略和可扩展的后训练流程,在仅10B参数下实现了超越数十倍更大模型的性能,尤其在视觉-语言理解和复杂推理任务上表现卓越。
Details
Motivation: 旨在解决多模态模型中效率与智能水平之间的权衡问题,开发一种兼具紧凑性和高性能的开源基础模型。 Method: 采用1.2T多模态token的统一全解冻预训练策略,结合语言对齐的感知编码器与Qwen3-8B解码器,并引入并行协调推理(PaCoRe)机制以扩展测试时计算能力,支持可扩展的感知推理。 Result: 在MMBench上达到92.2%,MMMU上80.11%,AIME2025上94.43%,MathVision上75.95%,性能媲美或超过GLM-4.6V-106B、Qwen3-VL-235B及Gemini 2.5 Pro等大型闭源模型。 Conclusion: STEP3-VL-10B证明了小型化模型通过高效训练策略和推理优化,能够实现与超大规模模型相当甚至更优的多模态智能,为社区提供了高效、可复现的新基准。 Abstract: We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.[147] Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering
Jieying Chen,Jeffrey Hu,Joan Lasenby,Ayush Tewari
Main category: cs.CV
TL;DR: 本文提出SRENDER,一种通过生成稀疏关键帧并结合3D重建与渲染来高效合成视频的新方法,显著提升生成速度并保持视觉质量。
Details
Motivation: 现有的基于扩散模型的视频生成方法计算成本高、速度慢,难以满足需要实时交互的应用(如具身AI和VR/AR)的需求。 Method: 采用扩散模型生成稀疏关键帧,利用3D重建将关键帧提升为三维表示,并通过渲染中间视角合成完整视频;同时引入一个模型来自适应预测最优关键帧数量。 Result: 相比传统扩散模型方法,在生成20秒视频时速度快了40多倍,同时保持了高视觉保真度和时间稳定性。 Conclusion: SRENDER通过结合生成模型与3D渲染,实现了高效、可控且高质量的视频生成,为实时应用提供了可行路径。 Abstract: Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.[148] COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation
Tony Danjun Wang,Tolga Birdal,Nassir Navab,Lennart Bastian
Main category: cs.CV
TL;DR: 提出COMPOSE框架,将多视角姿态对应匹配建模为超图划分问题,通过几何剪枝策略高效求解,在3D姿态估计中显著优于现有方法。
Details
Motivation: 现有双视角关联方法在处理多视角一致性时易受错误传播影响,缺乏全局一致性建模能力。 Method: 将多视图姿态对应问题建模为超图划分问题,构建整数线性规划模型,并引入高效的几何剪枝策略以降低计算复杂度。 Result: 在平均精度上比传统优化方法提升达23%,比自监督端到端学习方法提升达11%。 Conclusion: COMPOSE通过显式建模多视角间的全局一致性,有效抑制错误关联传播,为稀疏多视角3D姿态估计提供了更鲁棒的解决方案。 Abstract: 3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.[149] SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3
Ruiqi Shen,Chang Liu,Henghui Ding
Main category: cs.CV
TL;DR: 提出了一种无需训练的解耦策略SAM3-DMS,通过细粒度的个体记忆选择机制,在复杂多目标场景中显著提升了视频实例分割与跟踪的稳定性和身份保持能力。
Details
Motivation: 原始SAM3在多目标场景中采用基于平均性能的同步决策进行群体记忆选择,忽略了个体可靠性,导致在复杂场景下表现不佳。 Method: 提出SAM3-DMS,解耦群体决策过程,对每个对象独立进行细粒度的记忆选择,无需额外训练。 Result: 实验表明,该方法在目标密度较高时优势更明显,显著提升了身份保持和跟踪稳定性。 Conclusion: SAM3-DMS为真实复杂场景下的多目标视频分割与跟踪提供了有效且实用的解决方案。 Abstract: Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its original implementation, its group-level collective memory selection is suboptimal for complex multi-object scenarios, as it employs a synchronized decision across all concurrent targets conditioned on their average performance, often overlooking individual reliability. To this end, we propose SAM3-DMS, a training-free decoupled strategy that utilizes fine-grained memory selection on individual objects. Experiments demonstrate that our approach achieves robust identity preservation and tracking stability. Notably, our advantage becomes more pronounced with increased target density, establishing a solid foundation for simultaneous multi-target video segmentation in the wild.[150] Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang,Yunze Man,Zhiding Yu,Min-Hung Chen,Jan Kautz,Yu-Chiang Frank Wang,Fu-En Yang
Main category: cs.CV
TL;DR: Fast-ThinkAct是一种高效的视觉-语言-动作推理框架,通过可言说的潜在推理(latent CoT)实现紧凑且高性能的规划,在保持强大多任务性能的同时显著降低推理延迟。