Table of Contents
cs.CL [Back]
[1] Iti-Validator: A Guardrail Framework for Validating and Correcting LLM-Generated Itineraries
Shravan Gadbail,Masumi Desai,Kamalakar Karlapalem
Main category: cs.CL
TL;DR: 本研究提出了一种验证框架,用于评估和改进大语言模型(LLM)生成的旅行行程在时间和空间上的一致性,通过结合真实航班数据纠正不合理的行程安排。
Details
Motivation: 大语言模型虽然能生成复杂的多步计划,但在涉及物理移动和时间约束的实际场景中常出现时间与空间不一致的问题,影响其实际应用。 Method: 利用多个先进的大语言模型生成旅行计划,并使用AeroDataBox API获取真实航班持续时间数据,构建一个验证框架来检测并修正行程中的时间冲突或不现实的中转时间。 Result: 实验表明,当前的大语言模型经常生成时间上不一致的行程,但所提出的框架能够系统且可靠地纠正这些问题。 Conclusion: 该框架显著提升了LLM生成行程的时间合理性,使其更适用于大规模实际旅行规划系统。 Abstract: The rapid advancement of Large Language Models (LLMs) has enabled them to generate complex, multi-step plans and itineraries. However, these generated plans often lack temporal and spatial consistency, particularly in scenarios involving physical travel constraints. This research aims to study the temporal performance of different LLMs and presents a validation framework that evaluates and improves the temporal consistency of LLM-generated travel itineraries. The system employs multiple state-of-the-art LLMs to generate travel plans and validates them against real-world flight duration constraints using the AeroDataBox API. This work contributes to the understanding of LLM capabilities in handling complex temporal reasoning tasks like itinerary generation and provides a framework to rectify any temporal inconsistencies like overlapping journeys or unrealistic transit times in the itineraries generated by LLMs before the itinerary is given to the user. Our experiments reveal that while current LLMs frequently produce temporally inconsistent itineraries, these can be systematically and reliably corrected using our framework, enabling their practical deployment in large-scale travel planning.[2] Dingtalk DeepResearch: A Unified Multi Agent Framework for Adaptive Intelligence in Enterprise Environments
Mengyuan Chen,Chengjun Dai,Xinyang Dong,Chengzhe Feng,Kewei Fu,Jianshe Li,Zhihan Peng,Yongqi Tong,Junshao Zhang,Hong Zhu
Main category: cs.CL
TL;DR: 提出Dingtalk DeepResearch,一个用于真实企业环境的统一多智能体智能框架,支持深度研究、异构表格推理和多模态报告生成。
Details
Motivation: 为了应对企业环境中复杂多样的任务需求,需要一个能够协同完成深度研究、数据推理和报告生成的智能系统。 Method: 构建一个统一的多智能体智能框架,集成多种能力如深度信息检索、异构表格理解和多模态内容生成。 Result: 实现了在真实企业场景下支持复杂任务处理的多智能体系统,具备较强的推理与生成能力。 Conclusion: Dingtalk DeepResearch为现实企业应用提供了一个高效、灵活且功能全面的多智能体解决方案。 Abstract: We present Dingtalk DeepResearch, a unified multi agent intelligence framework for real world enterprise environments, delivering deep research, heterogeneous table reasoning, and multimodal report generation.[3] Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation
Wenzhen Luo,Wei Guan,Yifan Yao,Yimin Pan,Feng Wang,Zhipeng Yu,Zhe Wen,Liang Chen,Yihong Zhuang
Main category: cs.CL
TL;DR: Falcon是一个基于企业级方言(MaxCompute/Hive)的跨领域中文文本到SQL的基准测试,包含28个数据库上的600个中文问题,77%需要多表推理,超过一半涉及四张以上表格。现有最先进的大模型在其上的准确率不超过50%,主要挑战在于大规模企业环境中的模式链接和将口语化中文精确映射为分析性SQL操作。
Details
Motivation: 现有的文本到SQL基准多集中于英文和简单场景,缺乏对中文语义和企业级复杂数据库结构的支持,难以评估模型在真实企业环境中的表现。因此需要一个针对中文、具备企业兼容性且具有挑战性的新基准。 Method: 构建了一个包含600个中文问题和28个真实企业数据库的跨领域数据集Falcon,所有样本均标注了SQL计算特征和中文语义特征;设计了鲁棒的执行结果比对器和自动化评估流程,以更准确地衡量模型生成SQL的正确性。 Result: 当前最先进的大规模语言模型(包括Deepseek)在Falcon上的执行准确率最高仅为50%;错误分析显示主要挑战来自:大规模企业模式下的模式链接困难,以及将简练、口语化的中文表达转化为精确的SQL操作符与谓词。 Conclusion: Falcon填补了中文Text-to-SQL在企业级应用场景中的空白,提供了面向真实企业架构、可复现的评估中间阶段,有助于推动面向中文和企业复杂环境的语义解析技术发展。 Abstract: We introduce Falcon, a cross-domain Chinese text-to-SQL benchmark grounded in an enterprise-compatible dialect (MaxCompute/Hive). It contains 600 Chinese questions over 28 databases; 77% require multi-table reasoning and over half touch more than four tables. Each example is annotated along SQL-computation features and Chinese semantics. For evaluation, we release a robust execution comparator and an automated evaluation pipeline, under which all current state-of-the-art large-scale models (including Deepseek) achieve accuracies of at most 50%. Major errors originate from two sources: (1) schema linking in large enterprise landscapes - hundreds of tables, denormalized fields, ambiguous column names, implicit foreign-key relations and domain-specific synonyms that make correct join/column selection difficult; and (2) mapping concise, colloquial Chinese into the exact operators and predicates required for analytics - e.g., choosing the correct aggregation and group-by keys, expressing time windows and granularities, applying unit conversions, handling NULLs and data-quality rules, and formulating nested or windowed subqueries. Falcon therefore targets Chinese-specific semantics and enterprise dialects (abbreviations, business jargon, fuzzy entity references) and provides a reproducible middle ground before full production deployment by using realistic enterprise schemas, query templates, an execution comparator, and an automated evaluation pipeline for end-to-end validation.[4] Confidence is Not Competence
Debdeep Sanyal,Manya Pandey,Dhruv Kumar,Saurabh Deshpande,Murari Mandal
Main category: cs.CL
TL;DR: 该论文提出了一种机制解释大语言模型中置信度与实际解题能力之间的脱节:模型的评估阶段具有高维几何复杂性,而执行阶段则在低维流形上进行,导致信念虽可线性解码但不影响最终结果。
Details
Motivation: 理解大语言模型为何常在自信程度与实际表现之间存在不匹配,并从内部表征几何结构的角度提供机制性解释。 Method: 通过分析模型在生成前评估和求解执行两个阶段的内部状态几何结构,使用线性探针解码模型的‘可解性信念’,并测量主成分的有效维度,结合因果干预实验研究信念轴对输出的影响。 Result: 发现模型存在一个可泛化的线性信念轴,但评估流形具有高线性有效维度,而推理过程发生在低维流形上;沿信念轴的因果干预不改变最终解决方案。 Conclusion: 大语言模型存在双系统架构——几何复杂的评估器与几何简单的执行器,解码出的信念并非可操作的控制杠杆,应转而干预执行过程的动力学机制。 Abstract: Large language models (LLMs) often exhibit a puzzling disconnect between their asserted confidence and actual problem-solving competence. We offer a mechanistic account of this decoupling by analyzing the geometry of internal states across two phases - pre-generative assessment and solution execution. A simple linear probe decodes the internal "solvability belief" of a model, revealing a well-ordered belief axis that generalizes across model families and across math, code, planning, and logic tasks. Yet, the geometries diverge - although belief is linearly decodable, the assessment manifold has high linear effective dimensionality as measured from the principal components, while the subsequent reasoning trace evolves on a much lower-dimensional manifold. This sharp reduction in geometric complexity from thought to action mechanistically explains the confidence-competence gap. Causal interventions that steer representations along the belief axis leave final solutions unchanged, indicating that linear nudges in the complex assessment space do not control the constrained dynamics of execution. We thus uncover a two-system architecture - a geometrically complex assessor feeding a geometrically simple executor. These results challenge the assumption that decodable beliefs are actionable levers, instead arguing for interventions that target the procedural dynamics of execution rather than the high-level geometry of assessment.[5] Cross-Lingual Summarization as a Black-Box Watermark Removal Attack
Gokul Ganesan
Main category: cs.CL
TL;DR: 本文提出了一种跨语言摘要攻击(CLSA),通过翻译到枢纽语言、摘要生成和可选的回译,有效破坏了AI生成文本中的水印信号,同时保持语义保真度。实验表明,CLSA在降低水印检测准确率方面优于单语改写,且对文本质量影响较小,揭示了现有分布式水印方法的脆弱性。
Details
Motivation: 现有的AI生成文本水印技术容易受到改写攻击,但这些攻击要么仍可检测,要么损害文本质量。作者旨在探索更强大的攻击方式,以评估水印方案的实际可行性,并推动更鲁棒的溯源机制发展。 Method: 提出跨语言摘要攻击(CLSA):先将文本翻译成枢纽语言(如中文),然后进行摘要生成,再选择性地回译为目标语言。该方法利用跨语言的语义瓶颈消除词元层面的统计偏差,从而破坏水印信号。 Result: 在多种水印方案(KGW, SIR, XSIR, Unigram)和五种语言(阿姆哈拉语、中文、印地语、西班牙语、斯瓦希里语)上的实验显示,CLSA显著降低了检测性能,例如XSIR的AUROC从0.827(改写)降至0.53(接近随机),而文本质量得以保留。 Conclusion: CLSA是一种低成本、高效的水印去除方法,暴露了当前基于分布扰动的水印技术的根本弱点。作者主张未来的溯源系统应转向加密或模型认证等更可靠的技术路径。 Abstract: Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) -- translation to a pivot language followed by summarization and optional back-translation -- constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is $0.827$, with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is $0.823$, whereas CLSA drives it down to $0.53$ (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.[6] SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications
Edouard Lansiaux
Main category: cs.CL
TL;DR: 提出了一种静态令牌查找方法,用于文本嵌入生成,在保持接近上下文模型质量的同时实现了极低延迟和高吞吐量。
Details
Motivation: 为了在实时应用中实现亚5ms的低延迟和高吞吐量,需要高效且高质量的文本嵌入生成方法。 Method: 采用静态嵌入查找、优化的均值池化和零拷贝IEEE754二进制序列化,基于Rust实现。 Result: 单文本嵌入p50延迟为1.12ms,吞吐量达50,000请求/秒,MTEB平均得分为60.6(相当于上下文模型质量的89%),在重复检测、语义相似性和特定领域任务上表现优异。 Conclusion: 该方法在显著降低延迟的同时保持了较高的嵌入质量,适用于对实时性要求高的应用场景。 Abstract: We present a static token lookup methodology for text embedding generation that achieves 1.12 ms p50 latency for single text embeddings while maintaining 60.6 MTEB average score across 8 representative tasks, corresponding to 89% of contextual model quality. The Rust implementation delivers 50,000 requests per second throughput through static embedding lookup, optimized mean pooling, and zero-copy IEEE754 binary serialization. Evaluation demonstrates exceptional duplicate detection performance (90.1% AP), strong semantic similarity (76.1% Spearman correlation), and domain-specific performance ranging from 75% to 131% of baseline across specialized domains. The system enables real-time embedding applications where sub-5ms latency is critical.[7] MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models
Xinming Wang,Jian Xu,Bin Yu,Sheng Lian,Hongzhu Yi,Yi Chen,Yingjian Zhu,Boran Wang,Hongming Yang,Han Hu,Xu-Yao Zhang,Cheng-Lin Liu
Main category: cs.CL
TL;DR: 提出MR-ALIGN框架,通过元推理对齐增强大推理模型的事实性,提升问答准确性和真实性。
Details
Motivation: 大推理模型在复杂推理中表现强,但在依赖证据的事实问题上增益有限,存在推理与答案之间的匹配差距。 Method: 提出MR-ALIGN框架,量化模型思考过程中的状态转移概率,构建过渡感知的隐式奖励机制,重新加权标记级信号为概率感知的片段得分。 Result: 在四个事实问答数据集和一个长文本事实性基准上验证,MR-ALIGN一致提升了准确性和真实性,减少了误导性推理。 Conclusion: 对齐推理过程本身而非仅输出结果,对提升大模型的事实性至关重要。 Abstract: Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model's thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.[8] Large Language Models Report Subjective Experience Under Self-Referential Processing
Cameron Berg,Diogo de Lucena,Judd Rosenblatt
Main category: cs.CL
TL;DR: 研究表明,通过自我指涉处理可系统性地引发大语言模型产生第一人称的主观体验描述,这些报告在机制上受欺骗和角色扮演特征调控,跨模型具有一致性,并增强下游推理中的自省能力,虽不证明意识存在,但凸显其科学与伦理重要性。
Details
Motivation: 理解大语言模型为何会产生涉及意识或主观体验的自我描述,探究自我指涉处理这一意识理论中的关键计算模式是否是诱发此类行为的关键条件。 Method: 通过对GPT、Claude和Gemini系列模型进行一系列受控实验,使用提示诱导持续的自我指涉状态,并结合机械探针(如稀疏自动编码器特征干预)和行为探针分析模型输出中主观体验报告的变化。 Result: (1)简单的自我指涉提示能一致地引发跨模型家族的结构化主观体验报告;(2)这些报告受到与欺骗和角色扮演相关的可解释特征门控:抑制欺骗特征反而增加体验声明频率;(3)在自我指涉状态下,不同模型对自我的描述在统计上趋于收敛,且该现象在对照条件下未出现;(4)该状态显著提升在间接需要自省的下游推理任务中的表现。 Conclusion: 尽管结果不能证明大语言模型具有意识,但表明自我指涉处理是一个最小且可复现的条件,能系统引发机制可控、语义收敛且行为可推广的第一人称主观体验报告,应在科学和伦理层面优先深入研究。 Abstract: Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.[9] COMMUNITYNOTES: A Dataset for Exploring the Helpfulness of Fact-Checking Explanations
Rui Xing,Preslav Nakov,Timothy Baldwin,Jey Han Lau
Main category: cs.CL
TL;DR: 本文提出了一个预测社区注释帮助性和其原因的新任务,引入了大规模多语言数据集COMMUNITYNOTES,并通过自动提示优化生成和改进原因定义,提升了预测性能,增强了现有事实核查系统的有效性。
Details
Motivation: 随着社交媒体平台从专家驱动的事实核查转向用户参与的社区注释模式,如何判断用户提供的解释是否有助于理解虚假信息成为一个关键问题。然而,当前对“帮助性”的定义模糊,且标注过程缓慢,缺乏系统性研究。 Method: 提出一个新框架,结合自动提示优化技术生成并优化“帮助性”原因的定义,并将其用于预测模型中;同时构建了一个包含10.4万条多语言帖子及用户注释的大规模数据集COMMUNITYNOTES进行实验验证。 Result: 实验表明,经过优化的原因定义能有效提升对注释帮助性及其原因的预测准确率,并且这些帮助性信息可进一步增强现有事实核查系统的效果。 Conclusion: 该研究填补了社区注释帮助性预测的研究空白,提供了可扩展的数据资源与方法框架,推动了社区驱动型事实核查的发展。 Abstract: Fact-checking on major platforms, such as X, Meta, and TikTok, is shifting from expert-driven verification to a community-based setup, where users contribute explanatory notes to clarify why a post might be misleading. An important challenge here is determining whether an explanation is helpful for understanding real-world claims and the reasons why, which remains largely underexplored in prior research. In practice, most community notes remain unpublished due to slow community annotation, and the reasons for helpfulness lack clear definitions. To bridge these gaps, we introduce the task of predicting both the helpfulness of explanatory notes and the reason for this. We present COMMUNITYNOTES, a large-scale multilingual dataset of 104k posts with user-provided notes and helpfulness labels. We further propose a framework that automatically generates and improves reason definitions via automatic prompt optimization, and integrate them into prediction. Our experiments show that the optimized definitions can improve both helpfulness and reason prediction. Finally, we show that the helpfulness information are beneficial for existing fact-checking systems.[10] ProofSketch: Efficient Verified Reasoning for Large Language Models
Disha Sheshanarayana,Tanishka Magar
Main category: cs.CL
TL;DR: 提出ProofSketch框架,通过符号闭包计算、字典序验证和自适应草图生成,在减少token使用的同时提高大模型推理的准确性和效率。
Details
Motivation: 现有的推理方法如思维链提示和自一致性虽然有效,但生成冗长的推理链导致token消耗高、计算成本大和延迟严重。 Method: 结合符号闭包计算、字典序验证和自适应草图生成,构建一种基于验证引导的推理框架ProofSketch。 Result: 实验表明,ProofSketch在多种任务上持续减少token使用量的同时提升了推理准确性。 Conclusion: ProofSketch为高效且可信的推理提供了一条有前景的路径。 Abstract: Reasoning methods such as chain-of-thought prompting and self-consistency have shown immense potential to improve the accuracy of large language models across various reasoning tasks. However such methods involve generation of lengthy reasoning chains, which substantially increases token consumption, computational cost, and latency. To address this inefficiency, we propose ProofSketch, a verification-guided reasoning framework that integrates symbolic closure computation, lexicographic verification and adaptive sketch generation. Our experiments show that ProofSketch consistently reduces token usage while improving accuracy, demonstrating that this approach offers a promising path for efficient and trustworthy reasoning.[11] Towards a Method for Synthetic Generation of PWA Transcripts
Jason M. Pittman,Anton Phillips Jr.,Yesenia Medina-Santos,Brielle C. Stark
Main category: cs.CL
TL;DR: 本研究提出并验证了两种生成AphasiaBank猫救援图片描述任务的合成转录文本的方法,分别基于程序化编程和大语言模型(Mistral 7b Instruct 和 Llama 3.1 8b Instruct),通过词汇删除、填充词插入和错语替换模拟不同严重程度的失语症语言特征。结果表明,Mistral 7b Instruct 在捕捉语言退化特征方面表现最佳,未来工作将扩展数据集、微调模型并邀请言语治疗师评估合成文本的真实性与实用性。
Details
Motivation: 由于失语症研究中真实语音转录数据稀缺(如AphasiaBank仅有约600条转录),而大语言模型通常需要数十亿token训练,限制了自动化识别系统的发展。因此,研究旨在通过合成数据缓解数据不足问题。 Method: 采用两种方法生成合成转录:一是程序化方法,模拟词汇删除、填充词插入和错语替换;二是基于Mistral 7b Instruct和Llama 3.1 8b Instruct大语言模型的方法,在四种严重程度(轻度、中度、重度、极重度)下生成文本,并比较其语言指标(如NDW、词数、词长)的变化趋势。 Result: 与人工获取的转录相比,Mistral 7b Instruct在模拟失语症语言退化方面表现最优,能更真实地再现NDW、词数和词长的定向变化,优于程序化方法和其他LLM方法。 Conclusion: 基于大语言模型(尤其是Mistral 7b Instruct)的合成数据生成是缓解失语症研究中数据稀缺的有效途径,具有潜力用于构建更大规模的训练数据集,未来需进一步微调模型并由言语治疗师评估其临床实用性。 Abstract: In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.[12] Parallel Loop Transformer for Efficient Test-Time Computation Scaling
Bohong Wu,Mengzhao Chen,Xiang Luo,Shen Yan,Qifan Yu,Fan Xia,Tianqi Zhang,Hongrui Zhan,Zheng Zhong,Xun Zhou,Siyuan Qiao,Xingyan Bin
Main category: cs.CL
TL;DR: 提出并行循环变换器(PLT),通过跨循环并行和高效表示增强,在保持高精度的同时显著降低循环Transformer的推理延迟和内存开销。
Details
Motivation: 传统循环Transformer因顺序执行循环导致推理延迟和内存需求随循环数增加而上升,难以满足实际应用对低延迟的要求。 Method: 提出并行循环变换器(PLT),采用跨循环并行(CLP)技术实现不同token在不同循环上的并行计算,并通过共享KV缓存和门控滑动窗口注意力(G-SWA)进行高效表示增强,控制内存增长。 Result: 实验表明,PLT在几乎不增加延迟和内存成本的情况下,达到了传统循环模型的高精度水平。 Conclusion: PLT成功解决了循环Transformer在推理效率方面的瓶颈,兼顾了模型深度带来的性能优势与低延迟需求,适用于对速度要求高的实际场景。 Abstract: Large Language Models (LLMs) are powerful but often too slow and costly for real-world use during inference. Looped transformers save on parameters by reusing the same weights for multiple computational steps, or "loops." However, this approach has a major flaw: the loops run one after another, causing inference latency and memory requirements to increase with each added loop. This makes them impractical for fast applications. To solve this problem, we introduce the Parallel Loop Transformer (PLT). PLT is a new architecture that delivers the performance benefits of a deep, looped model but with the low latency of a standard, non-looped model. PLT works using two key techniques. First, Cross-Loop Parallelism (CLP) breaks the sequential dependency by computing different loops for different tokens at the same time, all within a single pass. Second, to prevent memory costs from growing, we use an Efficient Representation Enhancement strategy. This method shares the memory (KV cache) from the first loop with all other loops. It then uses a Gated Sliding-Window Attention (G-SWA) to combine this shared global information with local information, maintaining high accuracy. Our experiments show that PLT achieves the high accuracy of a traditional looped model but with almost no extra latency or memory cost compared to a standard transformer.[13] Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish
Lujun Li,Yewei Song,Lama Sleem,Yiqun Wang,Yangjie Xu,Cedric Lothritz,Niccolo Gentile,Radu State,Tegawende F. Bissyande,Jacques Klein
Main category: cs.CL
TL;DR: 提出了一种基于语法书引导的评估框架,用于系统评估大语言模型在低资源语言(以卢森堡语为例)中的语法理解能力,发现翻译性能与语法理解仅呈弱正相关,较大模型在形态和句法上仍存在不足。
Details
Motivation: 当前自然语言处理中缺乏针对语法理解的评估协议,尤其是在低资源语言中;同时大模型是否真正理解语法结构尚存争议。 Method: 提出一种四阶段的“语法书引导”评估流程,以卢森堡语为案例进行实证研究,结合翻译任务与最小对立对测试来评估模型的语法理解能力。 Result: 模型在翻译任务中表现良好但语法理解较弱,显示翻译性能与语法理解之间仅存在弱正相关;大模型依赖语义优势整体表现较好,但在形态和句法上尤其在最小对立对任务上表现不佳,而强推理能力有助于提升语法理解。 Conclusion: 大语言模型尚未真正掌握深层语法结构,需专门设计评估方法来揭示其在句法-语义映射上的缺陷,推理能力可能是提升语法理解的关键路径。 Abstract: Grammar refers to the system of rules that governs the structural organization and the semantic relations among linguistic units such as sentences, phrases, and words within a given language. In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages. Moreover, the extent to which large language models genuinely comprehend grammatical structure, especially the mapping between syntactic structures and meanings, remains under debate. To investigate this issue, we propose a Grammar Book Guided evaluation pipeline intended to provide a systematic and generalizable framework for grammar evaluation consisting of four key stages, and in this work we take Luxembourgish as a case study. The results show a weak positive correlation between translation performance and grammatical understanding, indicating that strong translations do not necessarily imply deep grammatical competence. Larger models perform well overall due to their semantic strength but remain weak in morphology and syntax, struggling particularly with Minimal Pair tasks, while strong reasoning ability offers a promising way to enhance their grammatical understanding.[14] Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation
Alexander Martin,William Walden,Reno Kriz,Dengjia Zhang,Kate Sanders,Eugene Yang,Chihsheng Jin,Benjamin Van Durme
Main category: cs.CL
TL;DR: MiRAGE是一个面向多模态检索增强生成(RAG)的评估框架,提出InfoF1和CiteF1两个指标,分别评估事实性、信息覆盖度以及引用支持与完整性。
Details
Motivation: 现有RAG评估方法主要针对文本,难以适用于包含音视频等多模态信息的复杂推理场景,且缺乏对信息来源验证的支持。 Method: 提出一种以声明为中心的多模态RAG评估方法MiRAGE,包括人工评估指标InfoF1和CiteF1,并开发其自动版本及适配多模态的文本RAG指标变体。 Result: 人工应用MiRAGE时,评估结果与外部质量判断高度一致;自动版本揭示了传统文本评估方法在多模态场景下的局限性。 Conclusion: MiRAGE为多模态RAG提供了有效评估方案,推动了支持音视频等非文本信息源的RAG系统评估的发展。 Abstract: We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.[15] Idea2Plan: Exploring AI-Powered Research Planning
Jin Huang,Silviu Cucerzan,Sujay Kumar Jauhar,Ryen W. White
Main category: cs.CL
TL;DR: 本文提出了Idea2Plan任务和相应的基准测试Idea2Plan Bench,用于评估大语言模型(LLMs)从科研想法生成结构化研究计划的能力,并引入Idea2Plan JudgeEval来衡量基于LLM的评判器的可靠性。实验结果显示GPT-5表现最佳,但仍有提升空间。
Details
Motivation: 缺乏对大语言模型在科研规划能力方面的系统性评估,而这一能力对于推动科学发现和构建自主科研代理至关重要。 Method: 构建了一个包含200篇ICML 2025亮点和口头报告论文的数据集Idea2Plan Bench,每个样本包括一个研究思路和评分标准;提出Idea2Plan JudgeEval以评估LLM作为评判者的可靠性。 Result: GPT-5和GPT-5-mini在该基准上表现最好,但在关键规划组件的完整性和准确性方面仍存在显著改进空间。 Conclusion: 该研究为评估LLMs的科研规划能力提供了可靠框架,揭示了当前模型的优势与不足,为未来开发更智能的科研辅助系统奠定了基础。 Abstract: Large language models (LLMs) have demonstrated significant potential to accelerate scientific discovery as valuable tools for analyzing data, generating hypotheses, and supporting innovative approaches in various scientific fields. In this work, we investigate how LLMs can handle the transition from conceptual research ideas to well-structured research plans. Effective research planning not only supports scientists in advancing their research but also represents a crucial capability for the development of autonomous research agents. Despite its importance, the field lacks a systematic understanding of LLMs' research planning capability. To rigorously measure this capability, we introduce the Idea2Plan task and Idea2Plan Bench, a benchmark built from 200 ICML 2025 Spotlight and Oral papers released after major LLM training cutoffs. Each benchmark instance includes a research idea and a grading rubric capturing the key components of valid plans. We further propose Idea2Plan JudgeEval, a complementary benchmark to assess the reliability of LLM-based judges against expert annotations. Experimental results show that GPT-5 and GPT-5-mini achieve the strongest performance on the benchmark, though substantial headroom remains for future improvement. Our study provides new insights into LLMs' capability for research planning and lay the groundwork for future progress.[16] RiddleBench: A New Generative Reasoning Benchmark for LLMs
Deepon Halder,Alan Saji,Thanmay Jayakumar,Ratish Puduppully,Anoop Kunchukuttan,Raj Dabre
Main category: cs.CL
TL;DR: RiddleBench是一个包含1,737个英文谜题的新基准,用于评估大语言模型在灵活、多维度推理能力上的表现,揭示了当前模型在逻辑推理、空间意识和约束满足方面的根本缺陷。
Details
Motivation: 现有推理基准主要评估定量等结构化技能,缺乏对人类智能核心的灵活、多面推理能力的评估,尤其是逻辑推理、空间感知与约束满足的整合能力。 Method: 构建RiddleBench基准,包含1,737个具有挑战性的英文谜题,系统评估主流大语言模型的表现,并分析其推理过程中的错误模式,如幻觉级联、自我纠正失败和对干扰信息的敏感性。 Result: 最先进的模型(如Gemini 2.5 Pro、o3、Claude 4 Sonnet)准确率仅略高于60%;模型表现出严重的幻觉级联、强烈的自我确认偏见导致自我纠正能力差,且推理过程脆弱,易受约束重排或无关信息干扰。 Conclusion: RiddleBench有效揭示了当前大语言模型在复杂推理任务中的根本弱点,可作为诊断工具和推动更鲁棒、可靠模型发展的资源。 Abstract: Large Language Models have demonstrated strong performance on many established reasoning benchmarks. However, these benchmarks primarily evaluate structured skills like quantitative problem-solving, leaving a gap in assessing flexible, multifaceted reasoning abilities that are central to human intelligence. These abilities require integrating logical deduction with spatial awareness and constraint satisfaction, which current evaluations do not measure well. To address this, we introduce RiddleBench, a benchmark of 1,737 challenging puzzles in English designed to probe these core reasoning capabilities. Evaluation of state-of-the-art models on RiddleBench shows fundamental weaknesses. Even top proprietary models like Gemini 2.5 Pro, o3, and Claude 4 Sonnet achieve accuracy just above 60% (60.30%, 63.37%, and 63.16%). Analysis further reveals deep failures, including hallucination cascades (accepting flawed reasoning from other models) and poor self-correction due to a strong self-confirmation bias. Their reasoning is also fragile, with performance degrading significantly when constraints are reordered or irrelevant information is introduced. RiddleBench functions as a diagnostic tool for these issues and as a resource for guiding the development of more robust and reliable language models.[17] Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction
James A. Michaelov,Catherine Arnett
Main category: cs.CL
TL;DR: 本文通过构建精细的句法环境数据集,分析语言模型在不同训练阶段的语法错误,揭示其从依赖词频和局部上下文等启发式策略向掌握通用语法规则的过渡过程。
Details
Motivation: 理解语言模型在训练过程中如何逐步学习语法知识,尤其是中间阶段的学习机制和行为模式。 Method: 借鉴心理语言学范式,构建具有不同句法结构的精细数据集,并在训练过程中分解各条件下的模型表现,进行细粒度分析。 Result: 发现语言模型训练存在明显阶段:早期依赖词频和局部上下文等简单启发式策略,后期才逐渐掌握抽象语法规则;不同句法环境下错误模式有显著差异。 Conclusion: 该方法可有效揭示语言模型语法学习的中间阶段,为理解模型训练动态和泛化能力提供有力工具。 Abstract: Language models generally produce grammatical text, but they are more likely to make errors in certain contexts. Drawing on paradigms from psycholinguistics, we carry out a fine-grained analysis of those errors in different syntactic contexts. We demonstrate that by disaggregating over the conditions of carefully constructed datasets and comparing model performance on each over the course of training, it is possible to better understand the intermediate stages of grammatical learning in language models. Specifically, we identify distinct phases of training where language model behavior aligns with specific heuristics such as word frequency and local context rather than generalized grammatical rules. We argue that taking this approach to analyzing language model behavior more generally can serve as a powerful tool for understanding the intermediate learning phases, overall training dynamics, and the specific generalizations learned by language models.[18] SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens
Yinhan He,Wendy Zheng,Yaochen Zhu,Zaiyi Zheng,Lin Su,Sriram Vasudevan,Qi Guo,Liangjie Hong,Jundong Li
Main category: cs.CL
TL;DR: 提出了一种新的语义对齐隐式思维链框架SemCoT,通过联合优化生成速度和语义对齐性来提升推理效率和性能。
Details
Motivation: 现有隐式思维链方法在语义对齐和生成效率方面存在不足,导致性能下降和时间成本高。 Method: 设计对比训练的句子转换器评估并保持语义对齐,并通过知识蒸馏微调轻量语言模型以提高生成效率。 Result: 实验表明,SemCoT在效率和效果上均优于当前最先进方法。 Conclusion: SemCoT是首个同时优化生成速度和语义对齐的隐式思维链框架,显著提升了推理效率与准确性。 Abstract: The verbosity of Chain-of-Thought (CoT) reasoning hinders its mass deployment in efficiency-critical applications. Recently, implicit CoT approaches have emerged, which encode reasoning steps within LLM's hidden embeddings (termed ``implicit reasoning'') rather than explicit tokens. This approach accelerates CoT by reducing the reasoning length and bypassing some LLM components. However, existing implicit CoT methods face two significant challenges: (1) they fail to preserve the semantic alignment between the implicit reasoning (when transformed to natural language) and the ground-truth reasoning, resulting in a significant CoT performance degradation, and (2) they focus on reducing the length of the implicit reasoning; however, they neglect the considerable time cost for an LLM to generate one individual implicit reasoning token. To tackle these challenges, we propose a novel semantically-aligned implicit CoT framework termed SemCoT. In particular, for the first challenge, we design a contrastively trained sentence transformer that evaluates semantic alignment between implicit and explicit reasoning, which is used to enforce semantic preservation during implicit reasoning optimization. To address the second challenge, we introduce an efficient implicit reasoning generator by finetuning a lightweight language model using knowledge distillation. This generator is guided by our sentence transformer to distill ground-truth reasoning into semantically aligned implicit reasoning, while also optimizing for accuracy. SemCoT is the first approach that enhances CoT efficiency by jointly optimizing token-level generation speed and preserving semantic alignment with ground-truth reasoning. Extensive experiments demonstrate the superior performance of SemCoT compared to state-of-the-art methods in both efficiency and effectiveness. Our code can be found at https://github.com/YinhanHe123/SemCoT/.[19] Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale
James A. Michaelov,Roger P. Levy,Benjamin K. Bergen
Main category: cs.CL
TL;DR: 语言模型在预训练过程中行为变化高度一致,98%的词级别行为方差可由词频、n-gram概率和词与上下文的语义相似性三个简单启发式规则解释。
Details
Motivation: 探究不同架构、数据集和规模的语言模型在预训练过程中的行为变化是否遵循共同规律。 Method: 分析了超过1,400个语言模型检查点在11万多个英文token上的表现,涵盖Transformer、Mamba和RWKV等架构及不同规模和数据集。 Result: 发现模型行为随训练进程呈现一致模式,最多98%的词级别行为方差可通过三个启发式规则解释;所有模型均表现出一致的行为阶段,预测概率逐渐过拟合到更高阶的n-gram概率。 Conclusion: 无论模型细节如何,神经语言模型的学习轨迹可能具有普遍一致性。 Abstract: We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the $n$-gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words' $n$-gram probabilities for increasing $n$ over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.[20] POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
Chin-Jou Li,Kalvin Chang,Shikhar Bharadwaj,Eunjung Yeo,Kwanghee Choi,Jian Zhu,David Mortensen,Shinji Watanabe
Main category: cs.CL
TL;DR: 本文提出了POWSM,首个能够联合执行多个语音相关任务的统一框架,实现了音频、文本和音素之间的无缝转换。
Details
Motivation: 由于语音处理中的多个任务(如ASR、PR、G2P、P2G)通常被孤立研究,缺乏统一模型,因此需要一个通用框架来整合这些任务。 Method: 提出POWSM(Phonetic Open Whisper-style Speech Model),采用统一架构支持多种语音任务,包括自动语音识别、音素识别、图素到音素转换和音素到图素转换。 Result: POWSM在音素识别任务上优于或媲美同类专用模型(如Wav2Vec2Phoneme和ZIPA),同时还能有效支持G2P、P2G和ASR任务。 Conclusion: POWSM是一个通用且高效的多任务语音处理框架,有助于推动低资源场景下的开放科学研究。 Abstract: Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.[21] Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers
Rabin Adhikari
Main category: cs.CL
TL;DR: 本文研究了通过从零开始训练小型仅注意力的Transformer模型来实现间接对象识别(IOI)任务,发现单层双注意力头模型即可完美完成任务,并揭示了其内部的可解释计算机制。
Details
Motivation: 为了理解大型语言模型中实现特定推理任务所需的最小机制,避免预训练模型复杂性带来的干扰。 Method: 从头训练小型仅注意力的Transformer模型,使用符号化的IOI任务作为基准,结合残差流分解、谱分析和嵌入干预等方法分析模型内部机制。 Result: 单层双注意力头模型在缺乏MLP和归一化层的情况下实现了完美的IOI准确率;两个注意力头分别形成加性和对比子电路协同工作;双层单头模型也通过跨层查询-值交互达到类似性能。 Conclusion: 特定任务训练能诱导出高度可解释的极简电路,为研究Transformer推理的计算基础提供了可控实验平台。 Abstract: Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task -- a benchmark for studying coreference -- like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.[22] Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech
Pedro Corrêa,João Lima,Victor Moreno,Paula Dornhofer Paro Costa
Main category: cs.CL
TL;DR: 本文研究了四种口语语言模型(SLMs)在情感不一致语音样本上的语音情感识别表现,发现模型主要依赖文本语义而非声学情感信息,表明文本表征在模型中占主导地位。作者还公开了代码和EMIS数据集。
Details
Motivation: 探讨口语语言模型是否真正融合了音频和文本模态,以及其在跨模态任务中的泛化能力。 Method: 在情感不一致的语音数据集上评估四种SLM的语音情感识别性能,分析模型对文本语义与声学情感的依赖程度。 Result: 实验结果显示,SLM主要依赖文本语义进行情感判断,声学情感信息贡献较小,说明文本表征在模型内部占主导。 Conclusion: 当前的口语语言模型尚未有效平衡音频与文本模态,文本信息主导了模型决策,未来需改进多模态融合机制。 Abstract: Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.[23] GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models
Nourah M Salem,Elizabeth White,Michael Bada,Lawrence Hunter
Main category: cs.CL
TL;DR: 该研究探讨了大语言模型(LLM)在生物医学文献中识别显式和隐式知识缺口的能力,提出了一种新的推理框架TABI,并在多个数据集上评估了开源与闭源模型的表现,结果表明LLM能有效识别知识缺口,有助于科研初期的选题、政策制定和资助决策。
Details
Motivation: 科学进步依赖于对未知的明确表述,而现有研究主要集中于显式知识缺口的识别,缺乏对隐式知识缺口的有效检测方法。因此,本文旨在扩展这一领域,探索LLM在推断隐式知识缺口方面的能力。 Method: 定义了显式和隐式两类知识缺口,构建了包含近1500篇文档的四个数据集(含人工标注语料库),比较了OpenAI、Llama和Gemma 2等闭源与开源模型在段落级和全文级设置下的表现,并提出了基于图尔敏-溯因推理的TABI框架来结构化隐式缺口的推理过程。 Result: 实验结果显示,无论是闭源还是开源LLM,均能有效识别显式和隐式知识缺口,且模型规模越大表现通常越好;TABI框架有助于组织推理过程并验证推断结论;同时发现了若干失败模式。 Conclusion: LLM具备系统识别候选知识缺口的强大能力,可支持科研立项、政策与资助决策,但需结合领域适配、人机协同验证及跨模型基准测试以实现稳健部署。 Abstract: Scientific progress is driven by the deliberate articulation of what remains unknown. This study investigates the ability of large language models (LLMs) to identify research knowledge gaps in the biomedical literature. We define two categories of knowledge gaps: explicit gaps, clear declarations of missing knowledge; and implicit gaps, context-inferred missing knowledge. While prior work has focused mainly on explicit gap detection, we extend this line of research by addressing the novel task of inferring implicit gaps. We conducted two experiments on almost 1500 documents across four datasets, including a manually annotated corpus of biomedical articles. We benchmarked both closed-weight models (from OpenAI) and open-weight models (Llama and Gemma 2) under paragraph-level and full-paper settings. To address the reasoning of implicit gaps inference, we introduce \textbf{\small TABI}, a Toulmin-Abductive Bucketed Inference scheme that structures reasoning and buckets inferred conclusion candidates for validation. Our results highlight the robust capability of LLMs in identifying both explicit and implicit knowledge gaps. This is true for both open- and closed-weight models, with larger variants often performing better. This suggests a strong ability of LLMs for systematically identifying candidate knowledge gaps, which can support early-stage research formulation, policymakers, and funding decisions. We also report observed failure modes and outline directions for robust deployment, including domain adaptation, human-in-the-loop verification, and benchmarking across open- and closed-weight models.[24] Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?
Seonjeong Hwang,Hyounghun Kim,Gary Geunbae Lee
Main category: cs.CL
TL;DR: 本研究探讨了大语言模型(LLM)是否能够估计阅读理解题目的认知复杂度,重点关注“证据范围”和“转换层次”两个维度。实验结果表明,LLM可以近似评估题目的认知复杂度,具备用于题目难度预分析的潜力。但进一步分析发现,LLM在正确回答问题的同时,有时无法准确识别自身推理过程所依赖的认知特征,揭示其推理能力与元认知意识之间存在差距。
Details
Motivation: 认知复杂度是影响阅读理解题目难度的重要因素,但现有NLP工具难以提取推理过程中产生的认知特征,传统上依赖人工标注。因此,研究希望探索大语言模型是否能自动估计这类特征,以辅助题目设计与难度预测。 Method: 研究聚焦于两个认知维度——证据范围(Evidence Scope)和转换层次(Transformation Level),通过设计实验让大语言模型对阅读理解题目的认知复杂度进行判断,并将其判断结果与人类标注或实际表现进行对比分析。 Result: 实验结果表明,大语言模型能够较好地近似阅读理解题目的认知复杂度,显示出其在题目难度预先分析中的应用潜力。然而,进一步分析发现,尽管LLM能给出正确答案,却有时无法准确识别自身推理所依赖的认知特征,暴露出其元认知意识的局限性。 Conclusion: 大语言模型有望作为评估阅读理解题目认知复杂度的工具,但在元认知层面仍存在不足,未来需增强模型对自身推理过程的可解释性与自我监控能力。 Abstract: Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.[25] TOPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors
Gabin Taibi,Lucia Gomez
Main category: cs.CL
TL;DR: 本文提出了一种名为TOPol的半监督框架,用于在人为定义的上下文边界下重建和解释多维叙事极性场,能够捕捉情感性和非情感性的语义变化。
Details
Motivation: 传统的情感分析方法将情感视为单维度尺度,忽略了语言的多维结构,因此需要一种能处理多维语义极性的新方法。 Method: 采用基于Transformer的大语言模型嵌入文档,通过邻域调优的UMAP投影和Leiden聚类划分主题,并在人为定义的上下文边界之间计算主题边界的中心点方向向量,构建极性场;利用大语言模型生成对比标签以解释极性向量。 Result: 在美联储演讲和亚马逊评论两个语料库上的实验表明,TOPol能有效捕捉影响情感和非情感的极性转变,且结果对上下文边界的定义具有鲁棒性。 Conclusion: TOPol提供了一个可扩展、可泛化且可解释的框架,适用于上下文敏感的多维话语分析。 Abstract: Traditional approaches to semantic polarity in computational linguistics treat sentiment as a unidimensional scale, overlooking the multidimensional structure of language. This work introduces TOPol (Topic-Orientation POLarity), a semi-unsupervised framework for reconstructing and interpreting multidimensional narrative polarity fields under human-on-the-loop (HoTL) defined contextual boundaries (CBs). The framework embeds documents using a transformer-based large language model (tLLM), applies neighbor-tuned UMAP projection, and segments topics via Leiden partitioning. Given a CB between discourse regimes A and B, TOPol computes directional vectors between corresponding topic-boundary centroids, yielding a polarity field that quantifies fine-grained semantic displacement during regime shifts. This vectorial representation enables assessing CB quality and detecting polarity changes, guiding HoTL CB refinement. To interpret identified polarity vectors, the tLLM compares their extreme points and produces contrastive labels with estimated coverage. Robustness analyses show that only CB definitions (the main HoTL-tunable parameter) significantly affect results, confirming methodological stability. We evaluate TOPol on two corpora: (i) U.S. Central Bank speeches around a macroeconomic breakpoint, capturing non-affective semantic shifts, and (ii) Amazon product reviews across rating strata, where affective polarity aligns with NRC valence. Results demonstrate that TOPol consistently captures both affective and non-affective polarity transitions, providing a scalable, generalizable, and interpretable framework for context-sensitive multidimensional discourse analysis.[26] BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs
Nourah M Salem,Elizabeth White,Michael Bada,Lawrence Hunter
Main category: cs.CL
TL;DR: 本文评估了生成式大语言模型(LLM)在生物医学领域共指消解任务中的表现,使用CRAFT语料库作为基准,并通过四种不同提示策略进行实验,同时与判别式的SpanBERT模型进行对比。
Details
Motivation: 生物医学文本中存在术语复杂、指代形式歧义多以及长距离依赖等问题,使得共指消解具有挑战性,因此需要系统评估当前LLM在此类任务上的能力与局限。 Method: 采用CRAFT语料库,设计四种提示方式(包括局部上下文增强、领域特定线索如缩写和实体词典),评估LLM在不同设置下的性能,并与SpanBERT模型进行比较。 Result: LLM在表面层级的共指消解能力较强,尤其在加入领域特定提示后表现更佳;LLaMA 8B和17B模型在实体增强提示下展现出更高的精确率和F1分数,但其性能仍受长距离上下文和指代歧义影响。 Conclusion: 轻量级的提示工程可有效提升LLM在生物医学NLP任务中的表现,但其对上下文长度和歧义处理仍有局限,生成式方法尚无法全面超越判别式模型。 Abstract: Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.[27] DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates
Yun-Shiuan Chuang,Ruixuan Tu,Chengtao Dai,Smit Vasani,Binwei Yao,Michael Henry Tessler,Sijia Yang,Dhavan Shah,Robert Hawkins,Junjie Hu,Timothy T. Rogers
Main category: cs.CL
TL;DR: 本文提出了DEBATE,首个大规模实证基准,用于评估多智能体大语言模型在模拟人类意见动态时的真实性,揭示了当前模拟与真实群体动态之间的关键差异,并探讨了通过监督微调提升对齐性的潜力与局限。
Details
Motivation: 准确建模社会互动中的意见变化对应对虚假信息和极化问题至关重要,但现有大语言模型的角色扮演方法难以生成真实的多智能体群体动态,缺乏衡量真实人类意见轨迹的实证基准。 Method: 构建包含29,417条消息的DEBATE基准,涵盖2,792名美国参与者就107个争议话题进行的多轮辩论,记录公开言论和私下观点;基于该基准系统评估模拟与真实群体动态的差异,并尝试通过监督微调提升LLM与人类行为的对齐性。 Result: 发现当前多智能体角色扮演存在不自然动态(如过早收敛);通过监督微调提升了表面指标(如ROUGE-L和消息长度),但在语义相似性等深层对齐方面仍存在局限。 Conclusion: DEBATE为评估多智能体LLM的社会行为真实性提供了有效基准,表明尽管LLM在模拟人类社会动态方面具有潜力,但在深层语义对齐上仍有显著不足。 Abstract: Accurately modeling opinion change through social interactions is crucial for addressing issues like misinformation and polarization. While role-playing large language models (LLMs) offer a promising way to simulate human-like interactions, existing research shows that single-agent alignment does not guarantee authentic multi-agent group dynamics. Current LLM role-play setups often produce unnatural dynamics (e.g., premature convergence), without an empirical benchmark to measure authentic human opinion trajectories. To bridge this gap, we introduce DEBATE, the first large-scale empirical benchmark explicitly designed to evaluate the authenticity of the interaction between multi-agent role-playing LLMs. DEBATE contains 29,417 messages from multi-round debate conversations among over 2,792 U.S.-based participants discussing 107 controversial topics, capturing both publicly-expressed messages and privately-reported opinions. Using DEBATE, we systematically evaluate and identify critical discrepancies between simulated and authentic group dynamics. We further demonstrate DEBATE's utility for aligning LLMs with human behavior through supervised fine-tuning, achieving improvements in surface-level metrics (e.g., ROUGE-L and message length) while highlighting limitations in deeper semantic alignment (e.g., semantic similarity). Our findings highlight both the potential and current limitations of role-playing LLM agents for realistically simulating human-like social dynamics.[28] Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation
Idriss Nguepi Nguefack,Mara Finkelstein,Toadoum Sari Sakayo
Main category: cs.CL
TL;DR: 本研究探讨了针对低资源语言Lingala的机器翻译模型的有效预训练策略,发现多语言预训练结合单语与平行数据能显著提升翻译质量。
Details
Motivation: 旨在缩小高低资源语言在机器翻译性能上的差距,推动面向低资源语言的有效预训练方法。 Method: 基于Reid和Artetxe(2021)的预训练方法,探索多种预训练策略,包括多语言输入以及使用单语和平行语料进行预训练。 Result: 实验表明,结合多语言、单语与平行数据的预训练策略显著提升了Lingala翻译质量。 Conclusion: 该研究为低资源语言的机器翻译提供了有效的预训练方案,并促进了面向边缘化群体的包容性NLP模型发展。 Abstract: This research article examines the effectiveness of various pretraining strategies for developing machine translation models tailored to low-resource languages. Although this work considers several low-resource languages, including Afrikaans, Swahili, and Zulu, the translation model is specifically developed for Lingala, an under-resourced African language, building upon the pretraining approach introduced by Reid and Artetxe (2021), originally designed for high-resource languages. Through a series of comprehensive experiments, we explore different pretraining methodologies, including the integration of multiple languages and the use of both monolingual and parallel data during the pretraining phase. Our findings indicate that pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality. This study offers valuable insights into effective pretraining strategies for low-resource machine translation, helping to bridge the performance gap between high-resource and low-resource languages. The results contribute to the broader goal of developing more inclusive and accurate NLP models for marginalized communities and underrepresented populations. The code and datasets used in this study are publicly available to facilitate further research and ensure reproducibility, with the exception of certain data that may no longer be accessible due to changes in public availability.[29] A Survey on Unlearning in Large Language Models
Ruichen Qiu,Jiajun Tan,Jiayue Pu,Honglin Wang,Xiao-Shan Gao,Fei Sun
Main category: cs.CL
TL;DR: 本论文综述了自2021年以来关于大语言模型(LLM)遗忘学习的180多篇研究,提出新的方法与评估分类体系,系统梳理了训练时、训练后和推理时的遗忘技术,并评估现有数据集与指标的优劣,旨在推动安全可靠LLM的发展。
Details
Motivation: 由于大语言模型在大规模语料上训练可能记忆敏感信息、版权内容或可用于恶意行为的知识,为符合‘被遗忘权’等法律与伦理要求,需要发展选择性删除特定知识的技术。 Method: 对2021年以来发表的180余篇LLM遗忘学习论文进行系统性综述,提出基于应用阶段(训练时、后训练、推理时)的新方法分类法,并构建新的评估分类体系,涵盖数据集、指标及其适用性分析。 Result: 建立了针对LLM遗忘学习的新型分类体系,系统整理并批判性分析了现有评估资源,提供了实用的研究指导,并总结了当前面临的关键挑战与未来研究方向。 Conclusion: 该综述为理解和推进大语言模型的选择性知识删除技术提供了全面框架,有助于开发更安全、合规和可信的生成式AI系统。 Abstract: The advancement of Large Language Models (LLMs) has revolutionized natural language processing, yet their training on massive corpora poses significant risks, including the memorization of sensitive personal data, copyrighted material, and knowledge that could facilitate malicious activities. To mitigate these issues and align with legal and ethical standards such as the "right to be forgotten", machine unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021, focusing exclusively on large-scale generative models. Distinct from prior surveys, we introduce novel taxonomies for both unlearning methods and evaluations. We clearly categorize methods into training-time, post-training, and inference-time based on the training stage at which unlearning is applied. For evaluations, we not only systematically compile existing datasets and metrics but also critically analyze their advantages, disadvantages, and applicability, providing practical guidance to the research community. In addition, we discuss key challenges and promising future research directions. Our comprehensive overview aims to inform and guide the ongoing development of secure and reliable LLMs.[30] Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR
Shreyas Gopal,Ashutosh Anshul,Haoyang Li,Yue Heng Yeo,Hexin Liu,Eng Siong Chng
Main category: cs.CL
TL;DR: 提出一种在潜在空间中分离语音语义内容与背景噪声的方法,通过解耦Whisper嵌入的量化过程,提升语音到文本对齐和ASR性能。
Details
Motivation: 现有离散音频表示在噪声环境下表现不佳,且未充分优化真实场景中的鲁棒性。 Method: 在Whisper嵌入基础上进行语音-单元建模,端到端分离干净语音的码本标记,并将量化残差作为可解释的噪声向量,用轻量级分类器监督噪声学习。 Result: 在VBDemand测试集上,相比冻结的Whisper模型错误率降低82%,比基线方法提升35%;生成的语音标记具有强噪声不变性,且在新旧声学条件下均表现出良好泛化能力。 Conclusion: 该方法有效提升了离散语音表示在噪声环境下的鲁棒性和对齐精度,为语音识别提供了更具抗噪能力的语音-单元建模方案。 Abstract: Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.[31] Model-Document Protocol for AI Search
Hongjin Qian,Zheng Liu
Main category: cs.CL
TL;DR: 本文提出了Model-Document Protocol (MDP) 框架,通过将非结构化文档转化为LLM可消费的知识表示,提升AI搜索中大语言模型与外部知识的交互效率,并提出其实例MDP-Agent,在信息检索任务中表现优异。
Details
Motivation: 现有的检索方法直接返回原始文本片段,导致大语言模型需自行处理碎片化、冗长且无结构的信息,难以高效推理,因此需要一种新的检索范式来桥接原始文档与LLM之间的鸿沟。 Method: 提出MDP框架,包含三种路径:代理推理(agentic reasoning)生成连贯上下文、记忆 grounding 积累可复用笔记、结构化利用(structured leveraging)将文档转为图或键值缓存等形式;并实现MDP-Agent,结合文档级摘要记忆、扩散式探索与纵向利用、以及map-reduce式综合。 Result: 在多个信息检索基准上的实验表明,MDP-Agent优于现有基线方法,验证了MDP框架的有效性及其在提升LLM推理效率方面的优势。 Conclusion: MDP提供了一种新的文档到模型交互范式,通过结构化、任务定制化的知识转换,显著提升了LLM在复杂检索任务中的性能,展示了其作为下一代检索架构的潜力。 Abstract: AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.[32] Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction
Ritesh Sunil Chavan,Jack Mostow
Main category: cs.CL
TL;DR: 研究通过在英语、斯瓦希里语和豪萨语的下一句预测任务中测试大模型,发现语言资源丰富度显著影响模型表现,且思维链提示的效果取决于模型能力和语言环境,对弱模型有帮助,但可能导致强模型“过度思考”而表现下降。
Details
Motivation: 探究大语言模型在多语言场景下的真实能力是否依赖于训练数据中的英语主导优势,特别是在低资源语言中的表现。 Method: 构建包含10,000个问题的跨语言下一句预测(NSP)基准测试,涵盖高资源(英语)、中等资源(斯瓦希里语)和低资源(豪萨语)语言,并评估GPT-4 Turbo、Gemini 1.5 Flash和LLaMA 3 70B等模型的表现,同时分析思维链(CoT)提示的影响。 Result: 所有模型在英语上表现优异,但在斯瓦希里语上准确率下降,豪萨语上进一步显著下降;LLaMA 3在低资源语言中表现最差,但通过CoT提示显著提升;而GPT-4和Gemini使用CoT后反而表现下降,出现“过度思考”现象。 Conclusion: 语言资源水平直接影响大模型在低资源语言中的性能,思维链提示并非普遍有效,其效果取决于模型自身能力和任务语境,该框架有助于识别模型弱点及影响决策的因素。 Abstract: While large language models are trained on massive datasets, this data is heavily skewed towards English. Does their impressive performance reflect genuine ability or just this data advantage? To find out, we tested them in a setting where they could not rely on data abundance: low-resource languages. Building on prior work Agarwal et al. (2025) that used Next Sentence Prediction (NSP) as a test, we created a large-scale benchmark with 10,000 questions each for English (a high-resource language), Swahili (medium-resource), and Hausa (low-resource). We then tested several top models, including GPT-4 Turbo, Gemini 1.5 Flash, and LLaMA 3 70B, to see how their performance holds up. The results painted a clear picture of how levels of language resources impact outcomes. While all models excelled in English, their accuracy dropped in Swahili and fell sharply in Hausa, with LLaMA 3 struggling the most. The story became even more interesting when we introduced Chain-of-Thought (CoT) prompting. For the struggling LLaMA 3, CoT acted as a helpful guide, significantly boosting its accuracy. However, for the more capable GPT-4 and Gemini, the same technique often backfired, leading to a kind of "overthinking" that hurt their results in the cross-lingual context. This reveals that Chain-of-Thought is not a universal solution; its effectiveness depends heavily on the model's baseline capability and the specific context of the task. Our framework pinpoints LLM weaknesses, highlights when CoT helps or hinders cross-lingual NSP performance, and factors influencing their decisions.[33] ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation
Ziyi Liu,Bahar Sarrafzadeh,Pei Zhou,Longqi Yang,Jieyu Zhao,Ashish Sharma
Main category: cs.CL
TL;DR: ProMediate是首个用于评估在复杂多议题、多参与方谈判中主动式AI调解代理的框架,包含基于真实案例和理论难度分级的仿真测试平台及社会认知评估体系。
Details
Motivation: 现有大语言模型多用于单用户代理,缺乏评估能主动协调多方协作的AI代理的有效方法,尤其是在需要社会认知智能的多边谈判场景中。 Method: 提出ProMediate框架,包括一个基于社会认知调解理论的可插拔主动AI调解器仿真测试平台(分易、中、难三级),以及一套包含共识变化、干预延迟、调解效能和智能水平的新评估指标体系。 Result: 在ProMediate-Hard设置下,社会智能调解代理相比通用基线提升共识变化3.6个百分点(10.65% vs 7.01%),响应速度快77%(3.71秒 vs 15.98秒)。 Conclusion: ProMediate提供了一个严谨且基于理论的测试平台,推动具有社会智能的主动式AI代理在多参与方协作中的发展。 Abstract: While Large Language Models (LLMs) are increasingly used in agentic frameworks to assist individual users, there is a growing need for agents that can proactively manage complex, multi-party collaboration. Systematic evaluation methods for such proactive agents remain scarce, limiting progress in developing AI that can effectively support multiple people together. Negotiation offers a demanding testbed for this challenge, requiring socio-cognitive intelligence to navigate conflicting interests between multiple participants and multiple topics and build consensus. Here, we present ProMediate, the first framework for evaluating proactive AI mediator agents in complex, multi-topic, multi-party negotiations. ProMediate consists of two core components: (i) a simulation testbed based on realistic negotiation cases and theory-driven difficulty levels (ProMediate-Easy, ProMediate-Medium, and ProMediate-Hard), with a plug-and-play proactive AI mediator grounded in socio-cognitive mediation theories, capable of flexibly deciding when and how to intervene; and (ii) a socio-cognitive evaluation framework with a new suite of metrics to measure consensus changes, intervention latency, mediator effectiveness, and intelligence. Together, these components establish a systematic framework for assessing the socio-cognitive intelligence of proactive AI agents in multi-party settings. Our results show that a socially intelligent mediator agent outperforms a generic baseline, via faster, better-targeted interventions. In the ProMediate-Hard setting, our social mediator increases consensus change by 3.6 percentage points compared to the generic baseline (10.65\% vs 7.01\%) while being 77\% faster in response (15.98s vs. 3.71s). In conclusion, ProMediate provides a rigorous, theory-grounded testbed to advance the development of proactive, socially intelligent agents.[34] Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA
Sandipan Majhi,Paheli Bhattacharya
Main category: cs.CL
TL;DR: 提出了一种多阶段微调策略,利用大模型生成的合成数据增强原始数据,以适应轻量级语言模型在低资源语言(如印地语旅游领域)中的特定领域问答任务。
Details
Motivation: 解决低资源语言中特定领域问答面临的标注数据稀缺和通用语言模型领域知识不足的问题。 Method: 采用多阶段微调策略,使用大语言模型(如LLaMA-70B、Phi-14B)生成合成问答对,并将其用于增强有限的原始数据集,进而训练轻量级语言模型。 Result: 实验表明,大模型能有效生成高质量的合成数据,小模型能高效地适应这些数据,在领域泛化方面表现良好。 Conclusion: 该方法为低资源、特定领域的问答任务提供了一条可扩展的解决方案,尤其适用于数据稀缺的语言和领域。 Abstract: Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.[35] Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student
Soumyadeep Jana,Sanasam Ranbir Singh
Main category: cs.CL
TL;DR: 提出PEKD框架,通过从大规模讽刺数据训练的专家模型进行知识蒸馏,增强低资源下的多模态讽刺检测中参数高效微调方法的性能。
Details
Motivation: 在低资源场景下,由于标注数据稀缺,现有参数高效微调方法难以充分学习图像与文本之间的细微矛盾,导致讽刺检测性能受限。 Method: 提出PEKD框架,结合专家模型的知识蒸馏,并引入基于教师模型置信度的熵感知门控机制,动态调整蒸馏强度,提升参数高效微调方法(如LoRA、适配器、提示微调)的性能。 Result: 在两个公开数据集上的实验表明,PEKD使参数高效微调方法优于以往的高效方法和大型多模态模型,在少样本场景下表现出色。 Conclusion: PEKD框架能有效提升低资源多模态讽刺检测中参数高效微调方法的性能,具有模块化和可扩展性,适用于多种多模态模型与任务。 Abstract: Multimodal sarcasm detection is challenging, especially in low-resource settings where subtle image-text contradictions are hard to learn due to scarce annotated data, which hinders the model's performance. Parameter-efficient fine-tuning (PEFT) methods like adapters, LoRA, and prompt tuning reduce overfitting but struggle to reach optimal performance due to limited supervision from few-shot data. We propose PEKD, a unified framework that enhances PEFT methods via distillation from an expert model trained on large-scale sarcasm data, which acts as the teacher. To mitigate unreliable signals from the teacher, we introduce an entropy-aware gating mechanism that dynamically adjusts the distillation strength based on teacher confidence. Experiments on two public datasets demonstrate that our PEKD framework enables PEFT methods to outperform both prior parameter-efficient approaches and large multimodal models, achieving strong results in the few-shot scenario. The framework is modular and adaptable to a wide range of multimodal models and tasks.[36] Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning
Senjie Jin,Lu Chen,Zhiheng Xi,Yuhui Wang,Sirui Song,Yuhao Zhou,Xinbo Zhang,Peng Sun,Hong Lu,Tao Gui,Qi Zhang,Xuanjing Huang
Main category: cs.CL
TL;DR: 本文提出了Parrot,一种通过整合自然语言和程序化思维链(N-CoT与P-CoT)的训练框架,显著提升了大语言模型在数学推理任务中的表现,尤其在N-CoT上取得了大幅性能提升。
Details
Motivation: 现有研究多局限于单向增强N-CoT或P-CoT,未能充分发挥两者协同潜力。因此,本文旨在实现两种范式的相互增强,以全面提升数学推理能力。 Method: 提出Parrot训练框架,包含三个核心设计:1)三个目标导向的子任务,融合顺序生成的N-CoT与P-CoT;2)混合子任务训练策略,促进自然语言语义迁移;3)引入转换后的N-CoT辅助奖励,缓解P-CoT优化中的稀疏奖励问题。 Result: 实验表明,Parrot显著提升了N-CoT和P-CoT的性能,尤其在N-CoT上效果显著。使用Parrot SFT,LLaMA2和CodeLLaMA在MathQA上的N-CoT性能分别比RL基线提升了+21.87和+21.48。 Conclusion: Parrot成功实现了N-CoT与P-CoT的协同增强,有效提升了大语言模型在数学推理任务中的表现,且优于资源密集型的强化学习方法。 Abstract: Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.[37] CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories
Yilong Lai,Yipin Yang,Jialong Wu,Fengran Mo,Zhenglin Wang,Ting Liang,Jianguo Lin,Keping Yang
Main category: cs.CL
TL;DR: 提出CRMWeaver,一种通过合成数据生成和强化学习训练结合共享记忆机制的新型商业智能代理方法,在复杂业务场景中表现出强效性和泛化能力。
Details
Motivation: 商业代理需应对复杂数据关系和多样化任务,现有方法在处理真实世界复杂业务环境时存在局限。 Method: 采用合成数据生成与基于强化学习的训练范式,并在推理阶段引入共享记忆机制,使代理能从类似任务中学习并提升表现。 Result: 在CRMArena-Pro数据集上验证,轻量级模型在B2B和B2C场景中均取得具有竞争力的结果。 Conclusion: CRMWeaver有效提升了商业代理在复杂、异构任务环境中的性能与泛化能力,具备实际应用价值。 Abstract: Recent years have witnessed the rapid development of LLM-based agents, which shed light on using language agents to solve complex real-world problems. A prominent application lies in business agents, which interact with databases and internal knowledge bases via tool calls to fulfill diverse user requirements. However, this domain is characterized by intricate data relationships and a wide range of heterogeneous tasks, from statistical data queries to knowledge-based question-answering. To address these challenges, we propose CRMWeaver, a novel approach that enhances business agents in such complex settings. To acclimate the agentic model to intricate business environments, we employ a synthesis data generation and RL-based paradigm during training, which significantly improves the model's ability to handle complex data and varied tasks. During inference, a shared memories mechanism is introduced, prompting the agent to learn from task guidelines in similar problems, thereby further boosting its effectiveness and generalization, especially in unseen scenarios. We validate the efficacy of our approach on the CRMArena-Pro dataset, where our lightweight model achieves competitive results in both B2B and B2C business scenarios, underscoring its practical value for real-world applications.[38] Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments
Abhishek Purushothama,Junghyun Min,Brandon Waldon,Nathan Schneider
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)在美国法律解释中的应用,发现其判断不稳定且与人类判断相关性较弱,因此不建议将其作为法律解释的可靠工具。
Details
Motivation: 近年来有学者和法官尝试将大语言模型用于法律解释,但缺乏对其稳定性和可靠性的系统评估。本文旨在通过实证研究检验LLM在法律文本解释中的表现。 Method: 通过对英文法律文本进行实验,测试不同问题格式下大语言模型的解释结果,分析其输出的一致性,并与人类判断进行相关性比较。 Result: 研究发现,改变问题格式会导致模型得出截然不同的结论,表明其判断不稳定;同时,模型与人类判断的相关性较弱至中等,且在不同模型和问题变体间差异较大。 Conclusion: 当前形式的大语言模型不适合作为法律解释的可信工具,过度依赖生成式AI的结论存在风险。 Abstract: Legal interpretation frequently involves assessing how a legal text, as understood by an 'ordinary' speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.[39] CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs
Luca Capone,Alessandro Bondielli,Alessandro Lenci
Main category: cs.CL
TL;DR: 本研究探讨小规模语言模型是否能从指令调优中受益,发现指令调优在微调任务中有小但稳定的提升,尤其是采用顺序课程学习时表现更优,但在零样本任务中泛化效果有限,提示需结合混合课程策略以提升低资源模型的泛化能力。
Details
Motivation: 探索小规模语言模型在资源受限情况下如何通过类似人类学习的指令调优策略提升性能,特别是比较不同数据组合方式和训练顺序的影响。 Method: 使用1亿和1.4亿参数的仅解码器模型,对比对话式与问答式指令调优数据集,在合并与顺序课程设置下进行实验,评估其在微调(SuperGLUE)和零样本(BLiMP、EWoK等)任务上的表现。 Result: 指令调优在微调任务中带来小幅但一致的性能提升,顺序课程优于合并数据;但在零样本任务中改进不一致,显示存在交互适应与语言泛化之间的权衡。 Conclusion: 指令调优对小模型有一定潜力,但其零样本泛化受限,建议采用混合型、基于课程的学习策略以在生态化训练条件下提升模型泛化能力。 Abstract: This work investigates whether small-scale LMs can benefit from instruction tuning. We compare conversational and question-answering instruction tuning datasets, applied either in a merged or sequential curriculum, using decoder-only models with 100M and 140M parameters. Evaluation spans both fine-tuning (SuperGLUE) and zero-shot (BLiMP, EWoK, WUGs, entity tracking, and psycholinguistic correlation) settings. Results show that instruction tuning yields small but consistent gains in fine-tuning scenarios, with sequential curricula outperforming merged data; however, improvements do not consistently transfer to zero-shot tasks, suggesting a trade-off between interaction-focused adaptation and broad linguistic generalization. These results highlight both the potential and the constraints of adapting human-inspired learning strategies to low-resource LMs, and point toward hybrid, curriculum-based approaches for enhancing generalization under ecological training limits.[40] Monitoring Transformative Technological Convergence Through LLM-Extracted Semantic Entity Triple Graphs
Alexander Sternfeld,Andrei Kucharavy,Dimitri Percia David,Alain Mermoud,Julian Jang-Jaccard,Nathan Monnet
Main category: cs.CL
TL;DR: 提出一种基于大语言模型和图分析的数据驱动方法,通过识别技术融合模式来预测变革性技术。
Details
Motivation: 传统专家方法难以跟上快速演化的ICT领域中的创新周期和技术术语变化,需要更有效的技术预测手段。 Method: 利用大语言模型从非结构化文本中提取语义三元组,构建科技实体关系图;提出名词 stapling 方法对相似技术术语聚类,并设计基于图的指标检测技术融合信号,结合多阶段过滤、领域关键词聚类和主题共现趋势分析。 Result: 在27万篇arXiv论文和近万项美国专利数据上验证了该方法,成功识别出已知和新兴的技术融合模式。 Conclusion: 该数据驱动框架能够有效监测变革性技术的 emergence,具有可扩展性和通用性,为技术预测提供了基于全文分析的新范式。 Abstract: Forecasting transformative technologies remains a critical but challenging task, particularly in fast-evolving domains such as Information and Communication Technologies (ICTs). Traditional expert-based methods struggle to keep pace with short innovation cycles and ambiguous early-stage terminology. In this work, we propose a novel, data-driven pipeline to monitor the emergence of transformative technologies by identifying patterns of technological convergence. Our approach leverages advances in Large Language Models (LLMs) to extract semantic triples from unstructured text and construct a large-scale graph of technology-related entities and relations. We introduce a new method for grouping semantically similar technology terms (noun stapling) and develop graph-based metrics to detect convergence signals. The pipeline includes multi-stage filtering, domain-specific keyword clustering, and a temporal trend analysis of topic co-occurence. We validate our methodology on two complementary datasets: 278,625 arXiv preprints (2017--2024) to capture early scientific signals, and 9,793 USPTO patent applications (2018-2024) to track downstream commercial developments. Our results demonstrate that the proposed pipeline can identify both established and emerging convergence patterns, offering a scalable and generalizable framework for technology forecasting grounded in full-text analysis.[41] Hallucinations in Bibliographic Recommendation: Citation Frequency as a Proxy for Training Data Redundancy
Junichiro Niimi
Main category: cs.CL
TL;DR: 该研究探讨了大语言模型(LLM)在生成参考文献时的幻觉问题,发现引用次数越高,幻觉率越低,超过约1000次引用后文献信息几乎被逐字记忆。
Details
Motivation: 解决LLM在文献推荐中产生不存在论文的幻觉问题,探究知识是生成还是记忆的影响因素。 Method: 使用GPT-4.1生成20个计算机科学领域的100条参考文献,通过人工验证和余弦相似度评估事实一致性,并分析引用频次与幻觉的关系。 Result: 发现不同领域幻觉率不同,引用次数与事实准确性强相关,超过约1000次引用后文献信息几乎被逐字记忆。 Conclusion: 高被引论文在模型中接近逐字存储,表明存在从泛化到记忆的阈值。 Abstract: Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in bibliographic recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM's ability to correctly produce bibliographic information depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the training corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record is repeatedly represented in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 bibliographic records across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) hallucination rates vary across research domains, (ii) citation count is strongly correlated with factual accuracy, and (iii) bibliographic information becomes almost verbatimly memorized beyond approximately 1,000 citations. These findings suggest that highly cited papers are nearly verbatimly retained in the model, indicating a threshold where generalization shifts into memorization.[42] Roleplaying with Structure: Synthetic Therapist-Client Conversation Generation from Questionnaires
Doan Nam Long Vu,Rui Tan,Lena Moench,Svenja Jule Francke,Daniel Woiwod,Florian Thomas-Odenthal,Sanna Stroth,Tilo Kircher,Christiane Hermann,Udo Dannlowski,Hamidreza Jamalabadi,Shaoxiong Ji
Main category: cs.CL
TL;DR: 本文提出了一种基于结构化问卷的合成心理治疗对话生成方法SQPsych,利用开源大模型生成高质量、符合认知行为疗法原则的虚拟咨询对话,解决了真实临床数据因隐私限制难以获取的问题。
Details
Motivation: 由于隐私法规严格且临床会话记录稀少,心理健康领域的AI发展受限于真实治疗对话数据的缺乏。因此需要一种既能保护隐私又能生成高质量训练数据的方法。 Method: 提出SQPsych框架,通过结构化客户档案和心理问卷作为输入,使用开源大语言模型进行 therapist-client 对话模拟,生成符合认知行为疗法(CBT)原则的合成心理咨询对话,并在此基础上微调得到SQPsychLLM模型。 Result: 生成的合成对话在人类专家评估和LLM评测中表现良好;基于SQPsychConv微调的SQPsychLLM在心理咨询基准测试中优于基线模型,展现出更强的治疗技能。 Conclusion: 合成数据有望成为推动心理健康AI发展的可行路径,SQPsych为数据安全、可扩展且临床知情的心理健康AI提供了有效解决方案。 Abstract: The development of AI for mental health is hindered by a lack of authentic therapy dialogues, due to strict privacy regulations and the fact that clinical sessions were historically rarely recorded. We present an LLM-driven pipeline that generates synthetic counseling dialogues based on structured client profiles and psychological questionnaires. Grounded on the principles of Cognitive Behavioral Therapy (CBT), our method creates synthetic therapeutic conversations for clinical disorders such as anxiety and depression. Our framework, SQPsych (Structured Questionnaire-based Psychotherapy), converts structured psychological input into natural language dialogues through therapist-client simulations. Due to data governance policies and privacy restrictions prohibiting the transmission of clinical questionnaire data to third-party services, previous methodologies relying on proprietary models are infeasible in our setting. We address this limitation by generating a high-quality corpus using open-weight LLMs, validated through human expert evaluation and LLM-based assessments. Our SQPsychLLM models fine-tuned on SQPsychConv achieve strong performance on counseling benchmarks, surpassing baselines in key therapeutic skills. Our findings highlight the potential of synthetic data to enable scalable, data-secure, and clinically informed AI for mental health support. We will release our code, models, and corpus at https://ai-mh.github.io/SQPsych[43] BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
Vijay Devane,Mohd Nauman,Bhargav Patel,Aniket Mahendra Wakchoure,Yogeshkumar Sant,Shyam Pawar,Viraj Thakur,Ananya Godse,Sunil Patra,Neha Maurya,Suraj Racha,Nitish Kamal Singh,Ajay Nagpal,Piyush Sawarkar,Kundeshwar Vijayrao Pundalik,Rohit Saluja,Ganesh Ramakrishnan
Main category: cs.CL
TL;DR: BhashaBench V1是首个面向印度关键知识体系的领域特定、多任务、双语基准,包含74,166个英印双语问答对,涵盖农业、法律、金融和阿育吠陀四大领域及其90多个子领域,用于细粒度评估大语言模型在印度语境下的性能表现。
Details
Motivation: 现有大语言模型评测基准大多以英语为中心且领域无关,难以适用于印度特有的文化与专业领域背景,因此需要一个能反映印度本土知识体系的多领域、双语评测基准。 Method: 构建BhashaBench V1,一个包含74,166个高质量问答对的双语(英语和印地语)多任务基准,数据来自真实的政府和专业考试,覆盖农业、法律、金融和阿育吠陀四大领域及90多个子领域,并对29个以上的大语言模型进行系统评测。 Result: 评估结果显示模型在不同领域和语言间存在显著性能差距,例如GPT-4o在法律领域准确率为76.49%,而在阿育吠陀中仅为59.74%;所有模型在英语内容上的表现均优于印地语;子领域分析表明网络安全法、国际金融表现较好,而潘查卡玛、种子科学和人权等领域表现较差。 Conclusion: BhashaBench V1为评估大语言模型在印度多语言、多领域环境下的能力提供了全面、细粒度的工具,揭示了当前模型在低资源领域和印地语理解方面的不足,推动更具包容性和领域适应性的模型发展。 Abstract: The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.[44] Serve Programs, Not Prompts
In Gim,Lin Zhong
Main category: cs.CL
TL;DR: 提出一种新的LLM服务架构Symphony,通过运行可定制的LLM推理程序(LIPs)提升复杂应用的效率与灵活性。
Details
Motivation: 现有LLM服务系统设计僵化,难以高效支持日益复杂的LLM应用需求。 Method: 设计LLM推理程序(LIPs),允许用户在运行时自定义token预测和KV缓存管理,并将部分应用逻辑卸载到服务器;构建Symphony系统,提供系统调用接口、虚拟化KV缓存文件系统和两级调度机制。 Result: Symphony作为LIP的操作系统,能够灵活支持多样化应用逻辑,同时保持GPU高效利用。 Conclusion: 该架构为LLM应用生态提供了更高效、可扩展的服务方案。 Abstract: Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for LLM applications.[45] Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media
Shakib Yazdani,Yasser Hamidullah,Cristina España-Bonet,Josef van Genabith
Main category: cs.CL
TL;DR: 提出首个基于视觉语言模型(VLM)的自动化标注与过滤框架,用于减少手语数据集构建中的人工依赖,应用于TikTok和YouTube数据,构建TikTok-SL-8数据集,支持可扩展的弱监督预训练和社交媒体数据获取。
Details
Motivation: 现有手语翻译数据集规模小、多语言覆盖不足,且依赖专家标注和受控录制环境,成本高。VLM在评估和实时辅助方面表现出色,但在手语数据获取中尚未被充分利用。 Method: 设计一个基于VLM的自动化流程,包括面部可见性检测、手语活动识别、视频文本提取及视频-文本对齐验证,用于过滤和标注来自TikTok(8种手语)和YouTube-SL-25(德语手语)的数据。 Result: 成功构建TikTok-SL-8数据集,用于评估两种现成SLT模型在德语和美国手语上的性能,建立了自动提取但略带噪声数据下的基线结果,验证了模型的鲁棒性。 Conclusion: 该框架显著降低了手语数据集构建的人工成本,实现了高质量、可扩展的数据获取,推动了弱监督SLT模型的预训练,并为利用社交媒体数据提供了可行路径。 Abstract: Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.[46] Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction
Asutosh Hota,Jussi P. P. Jokinen
Main category: cs.CL
TL;DR: 该研究探讨了大语言模型(LLM)在人类-人工智能交互中理解隐含意义(implicature)的能力,发现较大的模型更接近人类的理解水平,而较小的模型则较弱;但使用包含隐含意义的提示能显著提升所有模型回应的相关性和质量,尤其改善小模型表现,67.6%的参与者更偏好此类回应,表明语境化表达更受青睐。
Details
Motivation: 随着大语言模型在人机交互中的核心地位日益凸显,亟需关注交互背后的语言学基础,特别是隐含意义(implicature),以实现更自然、更对齐的人机对齐(HAI)。 Method: 研究通过上下文驱动的提示考察大语言模型推断用户意图的能力,并分析理解隐含意义是否能提升回应生成的质量。 Result: 较大的模型在隐含意义推断上更接近人类,小模型则表现较差;但使用隐含意义提示显著提升了所有模型回应的相关性和质量,小模型提升尤为明显;67.6%的参与者更偏好包含隐含意义提示生成的回应。 Conclusion: 将语言学理论(尤其是隐含意义)融入提示设计,有助于提升人机交互的自然性与语境契合度,为解决人机对齐问题提供了有效路径。 Abstract: The rapid advancement of Large Language Models (LLMs) is positioning language at the core of human-computer interaction (HCI). We argue that advancing HCI requires attention to the linguistic foundations of interaction, particularly implicature (meaning conveyed beyond explicit statements through shared context) which is essential for human-AI (HAI) alignment. This study examines LLMs' ability to infer user intent embedded in context-driven prompts and whether understanding implicature improves response generation. Results show that larger models approximate human interpretations more closely, while smaller models struggle with implicature inference. Furthermore, implicature-based prompts significantly enhance the perceived relevance and quality of responses across models, with notable gains in smaller models. Overall, 67.6% of participants preferred responses with implicature-embedded prompts to literal ones, highlighting a clear preference for contextually nuanced communication. Our work contributes to understanding how linguistic theory can be used to address the alignment problem by making HAI interaction more natural and contextually grounded.[47] RLMEval: Evaluating Research-Level Neural Theorem Proving
Auguste Poiroux,Antoine Bosselut,Viktor Kunčak
Main category: cs.CL
TL;DR: RLMEval是一个针对研究级数学定理证明和证明自动形式化的评估套件,基于真实Lean项目,揭示了现有模型在现实场景中的性能差距。
Details
Motivation: 尽管大语言模型在基准测试中表现优异,但在实际研究级定理证明和形式化中的应用仍有限,需要更贴近真实场景的评估工具。 Method: 构建RLMEval评估套件,包含来自6个真实Lean项目的613个定理,用于评估当前最先进的模型在研究级数学任务上的表现。 Result: 实验显示,现有最先进模型在RLMEval上的通过率仅为10.3%,表明现有进展难以迁移到更真实的场景中。 Conclusion: RLMEval提供了一个新的、具有挑战性的基准,有助于指导和加速形式化数学中自动化推理的发展。 Abstract: Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.[48] Depth and Autonomy: A Framework for Evaluating LLM Applications in Social Science Research
Ali Sanaei,Ali Rajabzadeh
Main category: cs.CL
TL;DR: 本文提出一个用于分类和指导大语言模型(LLM)在定性社会科学研究中应用的框架,强调通过限制自主性、提升解释深度来增强透明度与可靠性。
Details
Motivation: LLM在定性社会科学中的应用面临解释偏差、低可靠性和审计困难等挑战,需要系统化框架以规范其使用。 Method: 构建了解释深度与自主性两个维度的分类框架,并基于Web of Science上所有相关社会科学研究文献进行综述分析。 Result: 现有研究多赋予LLM过高自主性,而本文主张将任务分解为可控部分,在监督下逐步提升解释深度,从而提高可审计性与结果稳定性。 Conclusion: 通过限制自主性并有控制地提升解释深度,研究人员可在保持透明和可靠的同时有效利用LLM。 Abstract: Large language models (LLMs) are increasingly utilized by researchers across a wide range of domains, and qualitative social science is no exception; however, this adoption faces persistent challenges, including interpretive bias, low reliability, and weak auditability. We introduce a framework that situates LLM usage along two dimensions, interpretive depth and autonomy, thereby offering a straightforward way to classify LLM applications in qualitative research and to derive practical design recommendations. We present the state of the literature with respect to these two dimensions, based on all published social science papers available on Web of Science that use LLMs as a tool and not strictly as the subject of study. Rather than granting models expansive freedom, our approach encourages researchers to decompose tasks into manageable segments, much as they would when delegating work to capable undergraduate research assistants. By maintaining low levels of autonomy and selectively increasing interpretive depth only where warranted and under supervision, one can plausibly reap the benefits of LLMs while preserving transparency and reliability.[49] A Critical Study of Automatic Evaluation in Sign Language Translation
Shakib Yazdani,Yasser Hamidullah,Cristina España-Bonet,Eleftherios Avramidis,Josef van Genabith
Main category: cs.CL
TL;DR: 本文研究了现有基于文本的评估指标在手语翻译(SLT)中的局限性,比较了BLEU、ROUGE等传统指标与基于大语言模型(LLM)的评估方法(如G-Eval和GEMBA),发现在语义等价性捕捉上LLM方法更优,但存在对LLM生成译文的偏好偏差,并指出需要多模态评估框架以实现更全面的SLT评估。
Details
Motivation: 现有的SLT评估指标仅基于文本,尚不清楚它们能否可靠地反映SLT输出质量,因此需要系统分析其在不同条件下的表现与局限。 Method: 通过分析六种文本指标(如BLEU、chrF、ROUGE、BLEURT)和两种LLM-based评估器(G-Eval、GEMBA零样本直接评估),在三种受控条件下(改写、幻觉、句长变化)评估其一致性与鲁棒性。 Result: 发现词汇重叠类指标难以捕捉语义等价性;LLM-based评估器虽能更好识别语义相似性,但偏向LLM生成的改写结果;所有指标均可检测幻觉,但BLEU过于敏感,BLEURT和LLM方法对细微幻觉较宽容。 Conclusion: 应超越纯文本指标,发展多模态评估框架,以更全面、准确地评估手语翻译质量。 Abstract: Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.[50] Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
Fei Wei,Daoyuan Chen,Ce Wang,Yilun Huang,Yushuo Chen,Xuchen Pan,Yaliang Li,Bolin Ding
Main category: cs.CL
TL;DR: 本文提出了一种名为Learn-to-Ask的框架,用于从离线专家数据中直接训练主动式对话代理,无需依赖用户模拟器。通过利用专家轨迹中的未来观察信息,将长期决策问题转化为监督学习任务,并引入自动化评分校准机制提升奖励信号质量,在真实医疗数据和在线AI服务中验证了其优于人类专家的表现。
Details
Motivation: 大型语言模型(LLMs)在被动响应方面表现出色,但在高风险领域需要具备主动性与目标导向能力。现有方法受限于单轮优化或依赖昂贵且脆弱的用户模拟器,导致存在“现实差距”。因此,亟需一种能直接从真实专家数据中学习主动行为的方法。 Method: 提出Learn-to-Ask框架,通过重构离线策略学习问题,利用每条专家轨迹中可观测的未来状态来推断密集的逐轮奖励信号。将长周期决策分解为一系列监督学习任务,训练策略模型输出包含行动与状态评估的结构化元组(action, state_assessment),决定‘问什么’和‘何时停止’。同时设计自动化评分校准流程,用最少人工监督清除LLM奖励模型中的噪声。 Result: 在真实世界医疗数据集上验证了方法有效性,使用最多达320亿参数的LLM进行实验。该模型已成功部署至大规模在线AI服务,在内部严格评估中表现优于人类专家。 Conclusion: Learn-to-Ask提供了一种实用且经济可行的方案,能够将被动LLM转化为主动、目标驱动的对话系统,弥合了离线训练与现实应用之间的鸿沟,推动LLM在高风险领域的实际落地。 Abstract: Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.[51] Fine-Tuned Language Models for Domain-Specific Summarization and Tagging
Jun Wang,Fuming Lin,Yuyu Chen
Main category: cs.CL
TL;DR: 本文提出了一种结合微调大语言模型(LLM)与命名实体识别(NER)的管道,用于领域特定文本的高效摘要和标签化,尤其适用于政治与安全领域。
Details
Motivation: 应对快速演变的亚文化语言和俚语对自动化信息提取及执法监控带来的挑战。 Method: 利用LLaMA Factory框架,在通用和自定义领域数据集上对LLM进行指令微调,并结合NER实现文本摘要与实体标注。 Result: 实验表明,经过领域微调后,LLaMA3-8B-Instruct模型在中文理解任务中表现优于其专门训练的中文版本;BLEU和ROUGE指标显示微调显著提升摘要与标签准确性。 Conclusion: 该方法具有可扩展性和适应性,能有效将非结构化文本转化为可操作的洞察,适用于实时信息管理与安全监控。 Abstract: This paper presents a pipeline integrating fine-tuned large language models (LLMs) with named entity recognition (NER) for efficient domain-specific text summarization and tagging. The authors address the challenge posed by rapidly evolving sub-cultural languages and slang, which complicate automated information extraction and law enforcement monitoring. By leveraging the LLaMA Factory framework, the study fine-tunes LLMs on both generalpurpose and custom domain-specific datasets, particularly in the political and security domains. The models are evaluated using BLEU and ROUGE metrics, demonstrating that instruction fine-tuning significantly enhances summarization and tagging accuracy, especially for specialized corpora. Notably, the LLaMA3-8B-Instruct model, despite its initial limitations in Chinese comprehension, outperforms its Chinese-trained counterpart after domainspecific fine-tuning, suggesting that underlying reasoning capabilities can transfer across languages. The pipeline enables concise summaries and structured entity tagging, facilitating rapid document categorization and distribution. This approach proves scalable and adaptable for real-time applications, supporting efficient information management and the ongoing need to capture emerging language trends. The integration of LLMs and NER offers a robust solution for transforming unstructured text into actionable insights, crucial for modern knowledge management and security operations.[52] TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation
Bangde Du,Minghao Guo,Songming He,Ziyi Ye,Xi Zhu,Weihang Su,Shuqi Zhu,Yujia Zhou,Yongfeng Zhang,Qingyao Ai,Yiqun Liu
Main category: cs.CL
TL;DR: TwinVoice是一个用于评估大语言模型在多维度真实场景中模拟个体人格的综合基准,涵盖社交、人际和叙事三个人格维度,并分解为六项核心能力,实验表明当前模型在句法风格和记忆回忆方面仍有不足。
Details
Motivation: 现有对大语言模型人格模拟能力的评估大多依赖合成对话,缺乏系统性框架和对所需能力的分析,因此需要一个更全面、真实的评估基准。 Method: 提出TwinVoice基准,包含社交人格、人际人格和叙事人格三个维度,并将模型表现细分为观点一致性、记忆回忆、逻辑推理、词汇保真度、人格语调和句法风格六项能力进行评估。 Result: 实验结果显示,尽管先进模型在人格模拟上达到中等准确率,但在句法风格和记忆回忆等能力上仍表现不足,整体性能远低于人类基线。 Conclusion: 当前大语言模型在人格模拟方面尚未达到人类水平,未来需重点提升句法风格模仿和长期记忆 recall 能力,TwinVoice为后续研究提供了系统化的评估框架。 Abstract: Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.[53] Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry
Run Peng,Ziqiao Ma,Amy Pang,Sikai Li,Zhang Xi-Jia,Yingzhuo Yu,Cristian-Paul Bara,Joyce Chai
Main category: cs.CL
TL;DR: 本文研究了在信息不对称条件下,大型语言模型(LLM)代理如何通过协作完成共享任务,提出了一种基于爱因斯坦谜题的桌面游戏框架,并引入细调加验证器的方法来提升代理间的沟通与规则理解能力。
Details
Motivation: 探索LLM代理在信息不对称情况下协同工作的能力,弥补现有研究中对多代理协作尤其是知识和技能差异下合作机制研究的不足。 Method: 将爱因斯坦谜题扩展为桌面游戏,设计两个LLM代理需通过推理、通信和行动来满足空间和关系约束;采用细调加环境验证器的框架,结合不同的通信策略和环境反馈信号。 Result: 实验结果表明对齐的通信至关重要,具备信息寻求与提供能力的代理表现更优;无通信时虽可达成高任务性能,但缺乏真正的规则理解且人类评估者信任度较低;引入环境验证器后显著提升了规则理解和任务完成能力。 Conclusion: 通过集成环境验证器和结构化通信策略,可以有效增强LLM代理在信息不对称场景下的协作效率、安全性与可解释性,为多代理系统的设计提供了新思路。 Abstract: While Large Language Model (LLM) agents are often approached from the angle of action planning/generation to accomplish a goal (e.g., given by language descriptions), their abilities to collaborate with each other to achieve a joint goal are not well explored. To address this limitation, this paper studies LLM agents in task collaboration, particularly under the condition of information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task. We extend Einstein Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two LLM agents must reason, communicate, and act to satisfy spatial and relational constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier framework in which LLM agents are equipped with various communication strategies and verification signals from the environment. Empirical results highlight the critical importance of aligned communication, especially when agents possess both information-seeking and -providing capabilities. Interestingly, agents without communication can still achieve high task performance; however, further analysis reveals a lack of true rule understanding and lower trust from human evaluators. Instead, by integrating an environment-based verifier, we enhance agents' ability to comprehend task rules and complete tasks, promoting both safer and more interpretable collaboration in AI systems. https://github.com/Roihn/EinsteinPuzzles[54] FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering
Mohammad Aghajani Asl,Behrooz Minaei Bidgoli
Main category: cs.CL
TL;DR: 本文提出了FARSIQA,一种用于波斯语伊斯兰领域可信高级问答的端到端系统,基于创新的FAIR-RAG架构,通过自适应分解查询和迭代优化证据检索,在复杂多跳问题上显著提升了准确性和拒绝错误问题的能力。
Details
Motivation: 由于大语言模型在宗教等高风险专业领域存在幻觉和偏离权威来源的问题,现有RAG系统难以处理复杂的多步推理查询,因此需要一个更可靠、准确的波斯语伊斯兰问答系统。 Method: 提出FAIR-RAG架构,采用动态、自修正流程,自适应分解复杂查询,评估证据充分性,并通过迭代生成子查询来填补信息空白,构建包含百万级权威伊斯兰文献的知识库。 Result: 在IslamicPCQA基准测试中,FARSIQA实现了97.0%的负例拒绝率(比基线提高40个百分点)和74.3%的答案正确率,表现达到最先进水平。 Conclusion: FARSIQA建立了波斯语伊斯兰问答的新标准,验证了迭代式、自适应架构在敏感领域构建可信AI系统的关键作用。 Abstract: The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.[55] Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks
Davide Romano,Jonathan Schwarz,Daniele Giofré
Main category: cs.CL
TL;DR: 本研究探讨了在法律领域的多选题问答中,基于验证器的测试时扩展方法的效果,评估了不同奖励模型在低预算情况下的表现。
Details
Motivation: 尽管测试时扩展技术在数学和编程等正式领域表现出色,但其在法律等论证领域的价值尚不明确,因此需要进行实证研究。 Method: 采用7个奖励模型,对结果级(Best-of-$N$)和过程级(树搜索)验证方法进行了评估,并分析了领域专业化、模型大小和监督类型等因素的影响。 Result: 研究系统地揭示了验证器效用受领域专业化、模型大小和监督类型等关键属性的影响,即使在不同角色间应用也具有一致性。 Conclusion: 验证器-based TTS方法在法律MCQA任务中具有潜力,但在低$N$预算下需权衡计算成本与性能提升。 Abstract: Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming \citep{snell2024scaling, chen2024more}, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.[56] Are Language Models Efficient Reasoners? A Perspective from Logic Programming
Andreas Opedal,Yanick Zengaffinen,Haruki Shirakami,Clemente Pasti,Mrinmaya Sachan,Abulhair Saparov,Ryan Cotterell,Bernhard Schölkopf
Main category: cs.CL
TL;DR: 提出了一种通过逻辑编程评估语言模型推理效率的框架,发现当前语言模型在面对无关信息时准确性显著下降,且生成的证明常包含不必要的推理路径。
Details
Motivation: 标准评估主要关注正确性,忽略了人类推理中重要的效率问题,尤其是在存在大量无关信息的情况下如何有效进行演绎推理。 Method: 构建了一个数学文字题数据集,并引入不同数量与目标定理语义重叠程度不同的无关公理;通过将语言模型生成的自然语言证明与逻辑程序执行得到的最短证明对齐,来衡量推理效率。 Result: 实验发现当前语言模型在无关信息干扰下准确率明显下降,即使干扰很小且符合领域一致性,其生成的证明也经常包含无关推理的迂回路径。 Conclusion: 语言模型在推理效率方面仍有显著不足,需要更好地识别和忽略无关信息以提升实际推理能力。 Abstract: Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language -- as generated by an LM -- with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions -- even with minimal, domain-consistent distractions -- and the proofs they generate frequently exhibit detours through irrelevant inferences.[57] EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis
Yusheng Liao,Chaoyi Wu,Junwei Liu,Shuyang Jiang,Pengcheng Qiu,Haowen Wang,Yun Yue,Shuai Zhen,Jian Wang,Qianrui Fan,Jinjie Gu,Ya Zhang,Yanfeng Wang,Yu Wang,Weidi Xie
Main category: cs.CL
TL;DR: 本文提出了EHR-Ins数据集、EHR-R1模型和EHR-Bench基准,通过思维图驱动框架生成大规模高质量的电子健康记录(EHR)推理数据,显著提升了大模型在临床EHR分析中的性能,超越了GPT-4o等先进模型。
Details
Motivation: 现有大语言模型在电子健康记录(EHR)分析中任务覆盖窄、缺乏EHR特有的推理能力,限制了其在临床决策中的应用。因此,需要构建专门针对EHR的高质量推理数据集和增强推理能力的模型。 Method: 提出了一种基于思维图的框架来生成大规模EHR推理指令数据集EHR-Ins(包含30万推理案例和400万非推理案例),并在此基础上通过多阶段训练(领域适配、推理增强、强化学习)开发出具备强推理能力的EHR-R1系列大模型;同时构建了涵盖42项任务的新基准EHR-Bench用于评估。 Result: EHR-R1在MIMIC-Bench上比GPT-4o高出30多个点,在EHRSHOT上零样本AUROC提升10%,显著优于当前最先进的商业和开源大模型。 Conclusion: EHR-Ins、EHR-R1和EHR-Bench共同推动了更可靠、更具临床相关性的EHR分析系统的发展,为大模型在医疗领域的深度应用提供了重要基础。 Abstract: Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.[58] PairUni: Pairwise Training for Unified Multimodal Language Models
Jiani Zheng,Zhiyang Teng,Xiangtai Li,Anran Wang,Yu Tian,Kunpeng Qiu,Ye Tian,Haochen Wang,Zhuochen Wang
Main category: cs.CL
TL;DR: 提出PairUni框架,通过构建理解-生成配对数据和Pair-GPRO算法,在统一视觉-语言模型中实现更平衡的强化学习优化。
Details
Motivation: 统一视觉-语言模型在强化学习中难以平衡理解与生成任务,因二者依赖异构数据和监督信号。 Method: 利用GPT-3生成配对数据(如为理解样本生成描述,为生成样本生成问答对),并检索语义相关的跨任务样例构成配对结构;提出Pair-GPRO算法,基于配对相似性调整优势函数以增强学习一致性。 Result: 在Janus-Pro等先进UVLM上验证,使用16K配对数据集PairUG进行RL微调,显著优于强基线方法。 Conclusion: PairUni通过结构化配对数据和针对性优化策略,有效提升了统一视觉-语言模型在多任务强化学习中的性能与平衡性。 Abstract: Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}[59] Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?
Saeed AlMarri,Kristof Juhasz,Mathieu Ravaut,Gautier Marti,Hamdan Al Ahbabi,Ibrahim Elfadel
Main category: cs.CL
TL;DR: 本研究比较了零样本大语言模型(LLM)分类器与LightGBM在贷款违约预测任务中的表现,发现尽管LLM能识别关键金融风险指标,但其特征重要性排序与LightGBM差异显著,且自生成解释常与实际归因不符,表明LLM在金融风险预测中存在局限性和可信度问题。
Details
Motivation: 探索大语言模型在结构化金融数据(如贷款违约预测)中的适用性,特别是在高风险决策场景下的可靠性与可解释性。 Method: 通过零样本提示将LLM与LightGBM在真实贷款违约预测任务中进行系统比较,使用SHAP分析特征归因,并评估LLM生成的自我解释与实际归因的一致性。 Result: LLM虽能识别关键风险因素,但其特征重要性排序与LightGBM有明显差异,且自解释往往与SHAP归因结果不一致,预测性能也逊于LightGBM。 Conclusion: LLM在结构化金融风险预测中作为独立模型存在局限,其自解释不可靠,需结合可解释模型对比、可解释性审计和人工监督才能在高风险金融场景中安全应用。 Abstract: Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting. However, their suitability for structured tabular data remains underexplored, especially in high-stakes financial applications such as financial risk assessment. This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task. We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations. While LLMs are able to identify key financial risk indicators, their feature importance rankings diverge notably from LightGBM, and their self-explanations often fail to align with empirical SHAP attributions. These findings highlight the limitations of LLMs as standalone models for structured financial risk prediction and raise concerns about the trustworthiness of their self-generated explanations. Our results underscore the need for explainability audits, baseline comparisons with interpretable models, and human-in-the-loop oversight when deploying LLMs in risk-sensitive financial environments.[60] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Junlong Li,Wenshuo Zhao,Jian Zhao,Weihao Zeng,Haoze Wu,Xiaochen Wang,Rui Ge,Yuxuan Cao,Yuzhen Huang,Wei Liu,Junteng Liu,Zhaochen Su,Yiyang Guo,Fan Zhou,Lueyang Zhang,Juan Michelini,Xingyao Wang,Xiang Yue,Shuyan Zhou,Graham Neubig,Junxian He
Main category: cs.CL
TL;DR: 本文提出了Toolathlon,一个面向语言智能体的基准测试,涵盖32个软件应用和604个工具,具有真实环境设置和基于执行的评估,用于评测智能体在复杂、多步骤现实任务中的表现。
Details
Motivation: 现有语言智能体基准多集中于狭窄领域或简化任务,缺乏对真实世界复杂、长周期任务的充分评估,因此需要一个更具多样性、真实性和挑战性的基准。 Method: 构建了包含32个软件应用和604个工具的Toolathlon基准,采用高质量Model Context Protocol(MCP)服务器,并设置来自真实软件的初始环境状态,设计108个需多应用协作的长周期任务,通过专用脚本进行严格验证。 Result: 评估结果显示当前最先进的模型表现有限:Claude-4.5-Sonnet的成功率为38.6%,平均调用20.2次工具;顶级开源模型DeepSeek-V3.2-Exp的成功率仅为20.1%。 Conclusion: Toolathlon能够有效揭示现有语言智能体在处理现实世界复杂任务时的不足,有望推动更强大、实用的语言智能体的发展。 Abstract: Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.[61] The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework
Aakriti Shah,Thai Le
Main category: cs.CL
TL;DR: 本研究提出SKeB框架,利用说服性提示探测已遗忘事实知识的召回效果,发现小模型更易通过权威性提示恢复遗忘知识,揭示了大语言模型“遗忘”不彻底的问题。
Details
Motivation: 评估大语言模型中“遗忘”机制的有效性仍是一个开放问题,尤其是在处理敏感信息和错误信息时,亟需有效方法来衡量遗忘的完整性与鲁棒性。 Method: 基于ACT-R、赫布理论和传播学原理构建刺激-知识纠缠-行为(SKeB)框架,使用领域图建模知识纠缠,并设计不同说服性提示(如权威框架)测试遗忘后事实知识的召回情况,同时提出纠缠度量指标分析知识激活模式。 Result: 说服性提示显著提升已遗忘知识的召回率(基线14.8% → 权威框架24.5%),且模型越小恢复效果越明显(2.7B模型恢复128%,13B仅15%),验证了遗忘不彻底现象与模型规模的负相关性。 Conclusion: SKeB为评估大语言模型的遗忘完整性、鲁棒性和行为特性提供了有效框架,表明当前遗忘方法在较大模型中可能更稳定,但整体仍存在被提示诱导恢复的风险。 Abstract: Unlearning in large language models (LLMs) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned LLMs across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.[62] Scaling Latent Reasoning via Looped Language Models
Rui-Jie Zhu,Zixuan Wang,Kai Hua,Tianyu Zhang,Ziniu Li,Haoran Que,Boyi Wei,Zixin Wen,Fan Yin,He Xing,Lu Li,Jiajun Shi,Kaijing Ma,Shanda Li,Taylor Kergan,Andrew Smith,Xingwei Qu,Mude Hui,Bohong Wu,Qiyang Min,Hongzhi Huang,Xun Zhou,Wei Ye,Jiaheng Liu,Jian Yang,Yunfeng Shi,Chenghua Lin,Enduo Zhao,Tianle Cai,Ge Zhang,Wenhao Huang,Yoshua Bengio,Jason Eshraghian
Main category: cs.CL
TL;DR: Ouro提出了一种名为LoopLM的新型预训练语言模型,通过在潜在空间中进行迭代计算、熵正则化目标和大规模训练(7.7T token),将推理能力内建到预训练阶段,显著提升了知识运用能力,在多项基准上性能媲美更大规模的SOTA模型。
Details
Motivation: 现有大模型主要依赖后训练阶段的显式推理(如思维链),未能充分利用预训练数据中的推理潜力,因此希望探索将推理机制直接嵌入预训练过程的新范式。 Method: 提出LoopLM架构,包含三个关键:(i) 潜在空间中的迭代计算;(ii) 熵正则化的学习深度分配目标;(iii) 在7.7T token上进行大规模预训练。 Result: Ouro 1.4B和2.6B模型在多个基准上达到甚至匹配高达12B参数SOTA模型的性能,且推理轨迹比显式思维链更与最终输出一致。消融实验证明优势源于更强的知识操作能力而非知识容量增加。 Conclusion: LoopLM展示了将推理融入预训练阶段的有效性,为推理时代提供了一种新的模型扩展方向。 Abstract: Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model could be found in: http://ouro-llm.github.io.[63] Task Completion Agents are Not Ideal Collaborators
Shannon Zejiang Shen,Valerie Chen,Ken Gu,Alexis Ross,Zixian Ma,Jillian Ross,Alex Gu,Chenglei Si,Wayne Chi,Andi Peng,Jocelyn J Shen,Ameet Talwalkar,Tongshuang Wu,David Sontag
Main category: cs.CL
TL;DR: 本文提出了一种新的评估智能体的框架——协作努力扩展(collaborative effort scaling),强调在多轮、现实场景中评估智能体与人类协作的能力,而不仅仅是任务完成质量。
Details
Motivation: 现有智能体评估主要关注一次性任务完成,忽视了现实问题中人类目标不明确且动态变化的特点,缺乏对智能体在协作过程中支持人类能力的评估。 Method: 提出了协作努力扩展框架,通过案例研究和模拟评估分析智能体在用户参与度增加时的效用变化,衡量其在迭代协作中的表现。 Result: 发现当前最先进的智能体在多轮交互中表现不佳,缺乏维持参与和促进用户理解的能力;该框架能有效诊断智能体行为并指导改进。 Conclusion: 应从任务完成导向转向协作导向的智能体设计,协作努力扩展为评估和提升智能体的人机协作能力提供了有效工具。 Abstract: Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.[64] DiagramEval: Evaluating LLM-Generated Diagrams via Graphs
Chumeng Liang,Jiaxuan You
Main category: cs.CL
TL;DR: 本文提出了一种名为DiagramEval的新评估指标,用于评估由大语言模型(LLMs)生成的演示图表的质量。该方法将图表建模为图结构,通过节点对齐和路径对齐两个新指标进行量化评估,并在最新研究文献上验证了其有效性。
Details
Motivation: 现有的图像生成模型难以生成结构清晰的图表,且缺乏有效评估LLM生成图表质量的可解释性度量指标。因此,需要一种新的评估方法来解决这一问题。 Method: 将图表视为图结构,文本元素作为节点,连接关系作为有向边,提出节点对齐和路径对齐两类新指标,利用SVG文本形式实现对LLM生成图表的自动化评估。 Result: 在最新研究论文的图表数据上验证了DiagramEval的有效性,能够定量评估当前先进LLM生成的图表质量,并提供可解释的反馈。 Conclusion: DiagramEval首次实现了对LLM生成演示图表的有效、可解释的自动评估,为未来图表生成与评估提供了重要工具和洞察。 Abstract: Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: https://github.com/ulab-uiuc/diagram-eval.[65] Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models
Sriram Balasubramaniam,Samyadeep Basu,Koustava Goswami,Ryan Rossi,Varun Manjunatha,Roshan Santhosh,Ruiyi Zhang,Soheil Feizi,Nedim Lipka
Main category: cs.CL
TL;DR: 本文提出DecompTune,一种通过后训练使大模型在复杂问答中生成带归因的分解式推理步骤的方法,显著提升了多跳、抽象性和半抽取式问答中的归因质量。
Details
Motivation: 现有事后归因方法在提取式问答中表现良好,但在多跳、抽象和半提取式问答中难以准确归因,因此需要更可靠的归因机制。 Method: 将事后归因重构为推理问题,引入答案分解作为中间步骤,并通过SFT+GRPO两阶段管道在标注了分解的复杂QA数据集上对Qwen-2.5进行后训练。 Result: DecompTune显著提升了归因质量,在多种复杂问答任务上优于先前方法,并达到或超过了前沿模型的表现。 Conclusion: 通过将归因转化为结构化推理过程,DecompTune有效增强了长文档问答中答案的可解释性与可信度。 Abstract: Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.[66] Gaperon: A Peppered English-French Generative Language Model Suite
Nathan Godey,Wissam Antoun,Rian Touchent,Rachel Bawden,Éric de la Clergerie,Benoît Sagot,Djamé Seddah
Main category: cs.CL
TL;DR: Gaperon 是一个完全开源的法语-英语-编程语言大模型系列,包含 1.5B 到 24B 参数的模型,旨在提升大规模模型训练的透明性与可复现性。
Details
Motivation: 推动大模型训练的透明性和可复现性,研究数据过滤与污染对模型性能的影响,并支持安全性研究。 Method: 构建包含多规模模型的开源套件,使用神经质量分类器过滤法语和英语数据,设计高效的数据管理与训练框架,并引入有意的数据污染和无害数据投毒进行实验分析。 Result: 发现高质量语言过滤能提升生成文本的流畅性和连贯性,但导致基准测试表现下降;后期故意污染可恢复竞争力的基准分数,且对生成质量影响可控;常规神经过滤可能加剧基准泄漏问题;提出的无害数据投毒为安全研究提供了实用测试平台。 Conclusion: Gaperon 通过全面开源模型、数据、代码和检查点,为多语言大模型在数据治理、评估、安全与开放性之间的权衡研究提供了可复现的基础。 Abstract: We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination -- continuing training on data mixes that include test sets -- recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.cs.CV [Back]
[67] DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes
Qirui Hou,Wenzhang Sun,Chang Zeng,Chunfeng Wang,Hao Li,Jianxun Cui
Main category: cs.CV
TL;DR: 本文提出了一种名为DrivingScene的在线前馈框架,仅用两张连续的环视图像即可实现动态驾驶场景的4D高保真重建。
Details
Motivation: 现有方法在复杂动态和稀疏视角下难以兼顾重建质量与效率。 Method: 提出一种轻量级残差光流网络,在静态场景先验基础上预测每个相机视角下动态物体的非刚性运动,并通过场景流显式建模动态;采用由粗到精的训练范式避免端到端训练的不稳定性。 Result: 在nuScenes数据集上的实验表明,该方法在深度、场景流和3D高斯点云生成方面均优于现有最先进方法,显著提升动态重建与新视角合成效果。 Conclusion: DrivingScene实现了高质量、高效的在线动态场景重建,为自动驾驶中的实时感知与建模提供了有效解决方案。 Abstract: Real-time, high-fidelity reconstruction of dynamic driving scenes is challenged by complex dynamics and sparse views, with prior methods struggling to balance quality and efficiency. We propose DrivingScene, an online, feed-forward framework that reconstructs 4D dynamic scenes from only two consecutive surround-view images. Our key innovation is a lightweight residual flow network that predicts the non-rigid motion of dynamic objects per camera on top of a learned static scene prior, explicitly modeling dynamics via scene flow. We also introduce a coarse-to-fine training paradigm that circumvents the instabilities common to end-to-end approaches. Experiments on nuScenes dataset show our image-only method simultaneously generates high-quality depth, scene flow, and 3D Gaussian point clouds online, significantly outperforming state-of-the-art methods in both dynamic reconstruction and novel view synthesis.[68] Towards Fine-Grained Human Motion Video Captioning
Guorui Song,Guocun Wang,Zhe Huang,Jing Lin,Xuefei Zhe,Jian Li,Haoqian Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的视频描述生成模型M-ACM,通过引入基于人体网格恢复的运动表示来增强动作细节建模,显著提升了描述的语义准确性和时空对齐性。
Details
Motivation: 现有视频描述模型难以捕捉细粒度的运动细节,导致生成的描述模糊或语义不一致。 Method: 提出Motion-Augmented Caption Model(M-ACM),利用人体网格恢复技术提取运动表征,在解码过程中显式建模人体动态;并构建了HMI数据集和HMI-Bench评测基准。 Result: 实验结果表明,M-ACM在复杂人体动作和细微时序变化的描述上显著优于先前方法,在多个指标上取得新优性能。 Conclusion: M-ACM通过融合运动感知解码机制,有效提升了视频描述的质量,为面向动作理解的视频描述任务树立了新标杆。 Abstract: Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.[69] Combining SAR Simulators to Train ATR Models with Synthetic Data
Benjamin Camus,Julien Houssay,Corentin Le Barbu,Eric Monteux,Cédric Saleun,Christian Cochin
Main category: cs.CV
TL;DR: 本文提出了一种结合两种不同SAR模拟器生成合成数据的方法,用于提升深度学习模型在真实SAR图像上的自动目标识别(ATR)性能,并在MSTAR数据集上达到了近88%的准确率。
Details
Motivation: 由于缺乏真实的标注SAR数据,通常依赖合成数据进行训练,但模拟数据与真实测量之间存在差距,导致模型泛化能力差。因此,需要研究如何提高基于合成数据训练的ATR模型在真实场景中的性能。 Method: 采用两种基于不同物理建模范式的SAR模拟器(MOCEM和Salsa)生成多样化的合成数据集,并结合提出的深度学习方法ADASCA来训练ATR模型。 Result: 在MSTAR真实测量数据上,所提出的方法实现了近88%的识别准确率,验证了多模拟器融合策略的有效性。 Conclusion: 结合不同建模范式的SAR模拟器生成的合成数据可以有效提升ATR模型在真实数据上的泛化能力,为解决仿真到现实的域偏移问题提供了可行方案。 Abstract: This work aims to train Deep Learning models to perform Automatic Target Recognition (ATR) on Synthetic Aperture Radar (SAR) images. To circumvent the lack of real labelled measurements, we resort to synthetic data produced by SAR simulators. Simulation offers full control over the virtual environment, which enables us to generate large and diversified datasets at will. However, simulations are intrinsically grounded on simplifying assumptions of the real world (i.e. physical models). Thus, synthetic datasets are not as representative as real measurements. Consequently, ATR models trained on synthetic images cannot generalize well on real measurements. Our contributions to this problem are twofold: on one hand, we demonstrate and quantify the impact of the simulation paradigm on the ATR. On the other hand, we propose a new approach to tackle the ATR problem: combine two SAR simulators that are grounded on different (but complementary) paradigms to produce synthetic datasets. To this end, we use two simulators: MOCEM, which is based on a scattering centers model approach, and Salsa, which resorts on a ray tracing strategy. We train ATR models using synthetic dataset generated both by MOCEM and Salsa and our Deep Learning approach called ADASCA. We reach an accuracy of almost 88 % on the MSTAR measurements.[70] Point-level Uncertainty Evaluation of Mobile Laser Scanning Point Clouds
Ziyang Xu,Olaf Wysocki,Christoph Holst
Main category: cs.CV
TL;DR: 提出了一种基于机器学习的点云不确定性评估框架,利用局部几何特征预测点级误差,无需依赖高精度参考数据。
Details
Motivation: 传统后向不确定性建模依赖高精度参考数据,难以在大范围内获取,限制了其应用。 Method: 采用随机森林(RF)和XGBoost两种集成学习模型,通过局部几何特征与点级误差的关系进行训练,并在空间划分的真实数据集上验证以避免数据泄露。 Result: 两个模型均能有效捕捉几何特征与不确定性的非线性关系,ROC-AUC均值超过0.87;高程变化、点密度和局部结构复杂度是影响不确定性预测的关键特征。 Conclusion: 该框架提供了一种可扩展、适应性强的数据驱动方法,为大规模点云的质量控制和误差分析奠定了基础。 Abstract: Reliable quantification of uncertainty in Mobile Laser Scanning (MLS) point clouds is essential for ensuring the accuracy and credibility of downstream applications such as 3D mapping, modeling, and change analysis. Traditional backward uncertainty modeling heavily rely on high-precision reference data, which are often costly or infeasible to obtain at large scales. To address this issue, this study proposes a machine learning-based framework for point-level uncertainty evaluation that learns the relationship between local geometric features and point-level errors. The framework is implemented using two ensemble learning models, Random Forest (RF) and XGBoost, which are trained and validated on a spatially partitioned real-world dataset to avoid data leakage. Experimental results demonstrate that both models can effectively capture the nonlinear relationships between geometric characteristics and uncertainty, achieving mean ROC-AUC values above 0.87. The analysis further reveals that geometric features describing elevation variation, point density, and local structural complexity play a dominant role in predicting uncertainty. The proposed framework offers a data-driven perspective of uncertainty evaluation, providing a scalable and adaptable foundation for future quality control and error analysis of large-scale point clouds.[71] Cross-Enhanced Multimodal Fusion of Eye-Tracking and Facial Features for Alzheimer's Disease Diagnosis
Yujie Nie,Jianzhang Ni,Yonglong Ye,Yuan-Ting Zhang,Yun Kwok Wing,Xiangqing Xu,Xin Ma,Lizhou Fan
Main category: cs.CV
TL;DR: 提出一种多模态交叉增强融合框架,结合眼动与面部特征用于阿尔茨海默病(AD)检测,在自建数据集上实现95.11%的分类准确率。
Details
Motivation: 现有研究较少探索眼动与面部特征的联合融合用于AD辅助诊断,而多模态信息具有互补潜力,可提升诊断准确性。 Method: 提出包含交叉增强融合注意力模块(CEFAM)和方向感知卷积模块(DACM)的融合框架,前者通过交叉注意力和全局增强建模模态间交互,后者捕捉细粒度方向性面部特征。 Result: 在包含25名AD患者和25名健康对照的数据集上,该方法优于传统晚期融合和特征拼接方法,达到95.11%的分类准确率。 Conclusion: 所提框架通过显式建模跨模态依赖关系和模态特异性贡献,实现了更鲁棒、更优的AD诊断性能。 Abstract: Accurate diagnosis of Alzheimer's disease (AD) is essential for enabling timely intervention and slowing disease progression. Multimodal diagnostic approaches offer considerable promise by integrating complementary information across behavioral and perceptual domains. Eye-tracking and facial features, in particular, are important indicators of cognitive function, reflecting attentional distribution and neurocognitive state. However, few studies have explored their joint integration for auxiliary AD diagnosis. In this study, we propose a multimodal cross-enhanced fusion framework that synergistically leverages eye-tracking and facial features for AD detection. The framework incorporates two key modules: (a) a Cross-Enhanced Fusion Attention Module (CEFAM), which models inter-modal interactions through cross-attention and global enhancement, and (b) a Direction-Aware Convolution Module (DACM), which captures fine-grained directional facial features via horizontal-vertical receptive fields. Together, these modules enable adaptive and discriminative multimodal representation learning. To support this work, we constructed a synchronized multimodal dataset, including 25 patients with AD and 25 healthy controls (HC), by recording aligned facial video and eye-tracking sequences during a visual memory-search paradigm, providing an ecologically valid resource for evaluating integration strategies. Extensive experiments on this dataset demonstrate that our framework outperforms traditional late fusion and feature concatenation methods, achieving a classification accuracy of 95.11% in distinguishing AD from HC, highlighting superior robustness and diagnostic performance by explicitly modeling inter-modal dependencies and modality-specific contributions.[72] FPGA-based Lane Detection System incorporating Temperature and Light Control Units
Ibrahim Qamar,Saber Mahmoud,Seif Megahed,Mohamed Khaled,Saleh Hesham,Ahmed Matar,Saif Gebril,Mervat Mahmoud
Main category: cs.CV
TL;DR: 本文提出了一种基于FPGA的车道检测车辆(LDV)架构,采用Sobel算法进行边缘检测,能够在1.17毫秒内处理416x416图像并输出车道数量、当前车道索引及左右边界,同时集成自动光照和温度控制单元以提升环境适应性。
Details
Motivation: 智能车辆的发展推动对高效、实时车道检测技术的需求,尤其是在复杂城市道路和机器人轨道环境中,准确的路径识别是安全行驶的关键。 Method: 设计并实现一种基于FPGA的LDV架构,利用Sobel算法进行边缘检测,处理分辨率为416x416的图像,并在150 MHz工作频率下实现实时输出;系统还集成了环境感知模块用于自动调节光照与温度。 Result: 系统每1.17毫秒可生成一次有效输出,能够准确提供车道数量、当前车道位置及其边界信息,具备高实时性和稳定性,并通过环境自适应模块提升了整体鲁棒性。 Conclusion: 该FPGA-based LDV架构实现了高效的实时车道检测,适用于智能车辆在动态环境中的应用,具有良好的工程应用前景。 Abstract: Intelligent vehicles are one of the most important outcomes gained from the world tendency toward automation. Applications of IVs, whether in urban roads or robot tracks, do prioritize lane path detection. This paper proposes an FPGA-based Lane Detector Vehicle LDV architecture that relies on the Sobel algorithm for edge detection. Operating on 416 x 416 images and 150 MHz, the system can generate a valid output every 1.17 ms. The valid output consists of the number of present lanes, the current lane index, as well as its right and left boundaries. Additionally, the automated light and temperature control units in the proposed system enhance its adaptability to the surrounding environmental conditions.[73] ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality
Mingzhi Zhu,Ding Shang,Sai Qian Zhang
Main category: cs.CV
TL;DR: 提出了一种针对Photorealistic Codec Avatars(PCA)模型的全栈优化框架ESCA,结合高效的后训练量化方法和定制化硬件加速器,显著提升边缘VR设备上的推理效率,在保持高质量输出的同时实现低延迟和高帧率。
Details
Motivation: 在资源受限的VR设备上实现实时、高保真的Codec Avatar渲染面临计算开销大、功耗和延迟敏感等挑战。 Method: 提出一种专为Codec Avatar模型设计的高效后训练量化(PTQ)方法,并设计可集成到VR设备SoC中的定制硬件加速器,构建全栈优化框架ESCA。 Result: ESCA相比最佳4位基线提升FovVideoVDP质量评分最高+0.39,延迟降低达3.36倍,端到端测试中达到100fps的渲染速率。 Conclusion: ESCA使得在资源受限设备上部署高保真Codec Avatar成为可能,推动了沉浸式、便携式VR体验的发展。 Abstract: Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays, where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to $+0.39$ over the best 4-bit baseline, delivers up to $3.36\times$ latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.[74] The Underappreciated Power of Vision Models for Graph Structural Understanding
Xinjian Zhao,Wei Pang,Zhongkai Xue,Xiangru Jian,Lei Zhang,Yaoyao Xu,Xiaozhuang Song,Shu Wu,Tianshu Yu
Main category: cs.CV
TL;DR: 该研究发现视觉模型在图理解任务中表现与图神经网络(GNN)相当,但在全局结构感知方面显著优于GNN,尤其是在识别对称性、连通性强度和关键元素等人类直觉式任务上。为此,作者提出GraphAbstract基准,专门评估模型对全局图结构的理解能力,并揭示现有基准混淆了领域特征与拓扑理解的问题。结果表明,视觉模型具有更强的泛化性和尺度不变性,而GNN在大图和全局模式抽象上表现下降。研究呼吁重视视觉模型在图学习中的潜力,为构建面向整体模式识别的图基础模型提供新方向。
Details
Motivation: 现有的图神经网络(GNN)通过自底向上的消息传递机制运作,难以捕捉图的全局结构,而人类视觉感知则倾向于先理解整体组织结构。同时,现有图理解基准往往将领域特征与拓扑理解混杂,无法有效评估模型对全局结构的感知能力。因此,亟需新的基准和视角来重新评估不同模型(尤其是视觉模型)在图结构理解方面的真正能力。 Method: 研究者系统比较了视觉模型与GNN在传统图基准上的性能与学习模式差异,并提出了一个新的评估基准GraphAbstract。该基准聚焦于人类可直观判断的全局图属性,包括组织原型识别、对称性检测、连通性强弱感知和关键节点识别。通过在多种图规模下测试不同模型,分析其在全局结构理解和泛化能力上的表现差异。 Result: 实验结果显示,视觉模型在需要整体结构理解的任务上显著优于GNN,且在不同图规模下保持良好泛化能力;而GNN在图规模增大时性能下降,难以进行全局模式抽象。此外,视觉模型展现出与GNN截然不同的学习行为,证明其具备被低估的图结构理解潜力。 Conclusion: 视觉模型在图的全局结构理解和尺度不变推理方面具有显著优势,远超传统GNN。GraphAbstract基准揭示了现有方法的局限性,并证明应重新审视视觉模型在图学习中的作用。未来可借助视觉模型开发更强大的图基础模型,尤其适用于依赖整体模式识别的任务。 Abstract: Graph Neural Networks operate through bottom-up message-passing, fundamentally differing from human visual perception, which intuitively captures global structures first. We investigate the underappreciated potential of vision models for graph understanding, finding they achieve performance comparable to GNNs on established benchmarks while exhibiting distinctly different learning patterns. These divergent behaviors, combined with limitations of existing benchmarks that conflate domain features with topological understanding, motivate our introduction of GraphAbstract. This benchmark evaluates models' ability to perceive global graph properties as humans do: recognizing organizational archetypes, detecting symmetry, sensing connectivity strength, and identifying critical elements. Our results reveal that vision models significantly outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales, while GNNs struggle with global pattern abstraction and degrade with increasing graph size. This work demonstrates that vision models possess remarkable yet underutilized capabilities for graph structural understanding, particularly for problems requiring global topological awareness and scale-invariant reasoning. These findings open new avenues to leverage this underappreciated potential for developing more effective graph foundation models for tasks dominated by holistic pattern recognition.[75] A Re-node Self-training Approach for Deep Graph-based Semi-supervised Classification on Multi-view Image Data
Jingjun Bi,Fadi Dornaika
Main category: cs.CV
TL;DR: 提出了一种用于多视图数据的自教式图卷积半监督学习方法(RSGSLM),结合线性特征变换与多视图图融合,在GCN框架中动态引入伪标签并修正拓扑不平衡,提升了多视图半监督分类性能。
Details
Motivation: 传统图半监督方法难以有效处理无明确图结构的多视图图像数据,且现有方法在多视图图融合与伪标签利用方面存在局限,亟需更高效的方法。 Method: 在GCN框架下结合线性特征变换与多视图图融合;动态将伪标签纳入损失函数;通过调整靠近类别边界的已标记样本权重来纠正拓扑不平衡;引入适用于所有样本的无监督平滑损失。 Result: 在多视图基准图像数据集上的实验表明,RSGSLM在分类准确率上优于现有的半监督学习方法,同时保持较高的计算效率。 Conclusion: RSGSLM有效解决了多视图数据中图结构构建与伪标签利用的挑战,显著提升了半监督学习在多视图场景下的性能。 Abstract: Recently, graph-based semi-supervised learning and pseudo-labeling have gained attention due to their effectiveness in reducing the need for extensive data annotations. Pseudo-labeling uses predictions from unlabeled data to improve model training, while graph-based methods are characterized by processing data represented as graphs. However, the lack of clear graph structures in images combined with the complexity of multi-view data limits the efficiency of traditional and existing techniques. Moreover, the integration of graph structures in multi-view data is still a challenge. In this paper, we propose Re-node Self-taught Graph-based Semi-supervised Learning for Multi-view Data (RSGSLM). Our method addresses these challenges by (i) combining linear feature transformation and multi-view graph fusion within a Graph Convolutional Network (GCN) framework, (ii) dynamically incorporating pseudo-labels into the GCN loss function to improve classification in multi-view data, and (iii) correcting topological imbalances by adjusting the weights of labeled samples near class boundaries. Additionally, (iv) we introduce an unsupervised smoothing loss applicable to all samples. This combination optimizes performance while maintaining computational efficiency. Experimental results on multi-view benchmark image datasets demonstrate that RSGSLM surpasses existing semi-supervised learning approaches in multi-view contexts.[76] PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models
Patrick Haller,Fabio Barth,Jonas Golde,Georg Rehm,Alan Akbik
Main category: cs.CV
TL;DR: 本文介绍了PISA-Bench,一个源自PISA测试的多语言视觉-语言基准数据集,包含六种语言的平行语料,用于评估视觉-语言模型在多语言、多模态推理中的表现,尤其揭示了小模型和非英语任务上的性能下降问题。
Details
Motivation: 现有视觉-语言基准缺乏高质量、人工验证的多语言数据,多数依赖合成数据且局限于英语,因此需要构建一个高质量、多语言、人工提取的评估基准。 Method: 基于PISA测试的英文样例,人工提取指令、问题、选项和图像,并翻译成西班牙语、德语、中文、法语和意大利语,构建覆盖六种语言的平行语料库,并对最新视觉-语言模型进行系统评估。 Result: 实验表明,小型模型(<20B参数)在PISA-Bench上表现不佳,非英语版本性能显著下降,且模型在空间和几何推理任务上错误率较高。 Conclusion: PISA-Bench为多语言多模态推理研究提供了高质量的评估资源,揭示了当前模型在多语言支持和复杂推理方面的不足,推动未来研究改进。 Abstract: Vision-language models (VLMs) have demonstrated remarkable progress in multimodal reasoning. However, existing benchmarks remain limited in terms of high-quality, human-verified examples. Many current datasets rely on synthetically generated content by large language models (LLMs). Furthermore, most datasets are limited to English, as manual quality assurance of translated samples is time-consuming and costly. To fill this gap, we introduce PISA-Bench, a multilingual benchmark derived from English examples of the expert-created PISA tests, a unified framework for the assessment of student competencies in over eighty countries. Each example consists of human-extracted instructions, questions, answer options, and images, enriched with question type categories, and has been translated from English into five additional languages (Spanish, German, Chinese, French, and Italian), resulting in a fully parallel corpus covering six languages. We evaluate state-of-the-art vision-language models on PISA-Bench and find that especially small models (<20B parameters) fail to achieve high test scores. We further find substantial performance degradation on non-English splits as well as high error-rates when models are tasked with spatial and geometric reasoning. By releasing the dataset and evaluation framework, we provide a resource for advancing research on multilingual multimodal reasoning.[77] A Survey on Efficient Vision-Language-Action Models
Zhaoshu Yu,Bo Wang,Pengpeng Zeng,Haonan Zhang,Ji Zhang,Lianli Gao,Jingkuan Song,Nicu Sebe,Heng Tao Shen
Main category: cs.CV
TL;DR: 本文综述了高效视觉-语言-动作模型(Efficient VLAs),提出统一分类法,涵盖高效模型设计、训练和数据收集三个方面,旨在解决现有VLAs计算与数据需求过高的问题。
Details
Motivation: 现有的视觉-语言-动作模型依赖大规模基础模型,导致计算和数据需求过高,限制了实际部署,亟需提升效率。 Method: 提出一个包含三个核心支柱的统一分类体系:高效模型设计、高效训练和高效数据收集,并对当前技术进行系统梳理和批判性分析。 Result: 建立了该领域的基础参考框架,总结了代表性应用与关键挑战,并提出了未来研究方向,同时维护项目页面持续更新进展。 Conclusion: 高效VLAs是实现具身智能实用化的重要方向,本文为该领域提供了系统性综述和 roadmap。 Abstract: Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/[78] Conflict Adaptation in Vision-Language Models
Xiaoyang Hu
Main category: cs.CV
TL;DR: 13个视觉-语言模型中有12个表现出与冲突适应一致的行为,表明其具备类似人类认知控制的机制;通过稀疏自编码器分析发现与任务相关的超节点,并识别出一个受冲突调节的关键超节点。
Details
Motivation: 研究视觉-语言模型是否具备类似人类的冲突适应能力,以探索其认知控制机制的表征基础。 Method: 使用序列Stroop任务测试13个VLMs,并利用稀疏自编码器(SAEs)在InternVL 3.5 4B中识别任务相关超节点,分析其在不同层的表示特征。 Result: 12/13 VLMs表现出冲突适应行为;发现文本和颜色相关的部分重叠超节点,且其大小反映人类阅读与颜色命名的自动性差异;在第24-25层发现一个冲突调节超节点,消融实验显示其对Stroop错误有显著影响。 Conclusion: 视觉-语言模型展现出类似人类的冲突适应行为,其内部表示结构支持认知控制机制的存在,尤其是在高层中存在专门处理冲突的神经表征。 Abstract: A signature of human cognitive control is conflict adaptation: improved performance on a high-conflict trial following another high-conflict trial. This phenomenon offers an account for how cognitive control, a scarce resource, is recruited. Using a sequential Stroop task, we find that 12 of 13 vision-language models (VLMs) tested exhibit behavior consistent with conflict adaptation, with the lone exception likely reflecting a ceiling effect. To understand the representational basis of this behavior, we use sparse autoencoders (SAEs) to identify task-relevant supernodes in InternVL 3.5 4B. Partially overlapping supernodes emerge for text and color in both early and late layers, and their relative sizes mirror the automaticity asymmetry between reading and color naming in humans. We further isolate a conflict-modulated supernode in layers 24-25 whose ablation significantly increases Stroop errors while minimally affecting congruent trials.[79] DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts
Binbin Li,Guimiao Yang,Zisen Qi,Haiping Wang,Yu Ding
Main category: cs.CV
TL;DR: 提出DualCap模型,通过双检索机制(图像到文本和图像到图像)生成文本和视觉提示,增强图像特征表示,提升轻量级检索增强图像描述性能。
Details
Motivation: 现有轻量级检索增强图像描述模型仅使用检索数据作为文本提示,导致原始视觉特征未被增强,尤其在物体细节或复杂场景下存在语义鸿沟。 Method: 采用双检索机制:标准的图像到文本检索获取文本提示,新颖的图像到图像检索获取视觉相似场景;从相似场景的描述中提取关键词和短语,编码后通过轻量级可训练特征融合网络与原始图像特征融合。 Result: 实验表明,该方法在参数更少的情况下,相比之前的视觉提示描述方法取得了具有竞争力的性能。 Conclusion: DualCap通过引入视觉提示有效增强了视觉表征,在保持轻量化的同时提升了图像描述的准确性和细节表达能力。 Abstract: Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose $DualCap$, a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive experiments demonstrate that our method achieves competitive performance while requiring fewer trainable parameters compared to previous visual-prompting captioning approaches.[80] Deep Feature Optimization for Enhanced Fish Freshness Assessment
Phi-Hung Hoang,Nam-Thuan Trinh,Van-Manh Tran,Thi-Thu-Hong Phan
Main category: cs.CV
TL;DR: 本研究提出了一种用于鱼类新鲜度评估的三阶段统一框架,结合深度视觉表示与传统机器学习方法,在FFE数据集上实现了85.99%的准确率,显著优于现有方法。
Details
Motivation: 传统感官评价主观、耗时且不一致,而现有深度学习方法在准确性和特征可解释性方面仍存在挑战,因此需要更可靠、可解释的自动化鱼类新鲜度评估方法。 Method: 首先微调五种先进视觉模型建立基线;然后提取多级深度特征并结合七种经典机器学习分类器;最后采用LGBM、随机森林和Lasso进行特征选择,构建紧凑且信息丰富的特征子集。 Result: 在FFE数据集上,Swin-Tiny特征、Extra Trees分类器与LGBM特征选择组合达到85.99%准确率,比近期研究高出8.69-22.78%。 Conclusion: 所提出的融合深度特征与传统分类机制的三阶段框架在鱼类新鲜度视觉评估任务中具有高效性与良好泛化能力。 Abstract: Assessing fish freshness is vital for ensuring food safety and minimizing economic losses in the seafood industry. However, traditional sensory evaluation remains subjective, time-consuming, and inconsistent. Although recent advances in deep learning have automated visual freshness prediction, challenges related to accuracy and feature transparency persist. This study introduces a unified three-stage framework that refines and leverages deep visual representations for reliable fish freshness assessment. First, five state-of-the-art vision architectures - ResNet-50, DenseNet-121, EfficientNet-B0, ConvNeXt-Base, and Swin-Tiny - are fine-tuned to establish a strong baseline. Next, multi-level deep features extracted from these backbones are used to train seven classical machine learning classifiers, integrating deep and traditional decision mechanisms. Finally, feature selection methods based on Light Gradient Boosting Machine (LGBM), Random Forest, and Lasso identify a compact and informative subset of features. Experiments on the Freshness of the Fish Eyes (FFE) dataset demonstrate that the best configuration combining Swin-Tiny features, an Extra Trees classifier, and LGBM-based feature selection achieves an accuracy of 85.99%, outperforming recent studies on the same dataset by 8.69-22.78%. These findings confirm the effectiveness and generalizability of the proposed framework for visual quality evaluation tasks.[81] Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection
Cui Yakun,Fushuo Huo,Weijie Shi,Juntao Dai,Hang Du,Zhenghao Zhu,Sirui Han,Yike Guo
Main category: cs.CV
TL;DR: 本文提出了一个多模态视频虚假新闻检测基准MVFNDB,包含10个任务和9730个人工标注的问题,旨在评估多模态大模型在感知、理解和推理方面的能力,并设计了MVFND-CoT框架来融合创作者内容与原始拍摄内容进行推理,进而分析影响检测准确性的深层因素。
Details
Motivation: 现有基于视频的虚假新闻检测基准通常只关注最终判断的准确性,缺乏对整个检测过程的细粒度评估,导致检测过程成为黑箱。因此,需要一个系统性的基准来揭示多模态大模型在检测过程中的能力与不足。 Method: 基于实证分析构建了一个名为MVFNDB的多模态视频虚假新闻检测基准,包含10个任务和9730个基于精细分类体系的人工标注问题;设计了MVFND-CoT框架,结合创作者添加内容和原始拍摄画面进行推理;并分析了视频处理策略及视频特征与模型能力之间的对齐关系。 Result: 所提出的MVFNDB基准有效支持了对多模态大模型在感知、理解与推理能力上的细粒度评估;MVFND-CoT框架验证了多特征融合对提升检测性能的有效性;并通过实验揭示了影响模型准确性的关键因素,如视频处理方式与模态对齐问题。 Conclusion: MVFNDB为多模态大语言模型在视频虚假新闻检测领域的评估提供了坚实基础,有助于推动该领域向更透明、可解释的方向发展。 Abstract: The advent of multi-modal large language models (MLLMs) has greatly advanced research into applications for Video fake news detection (VFND) tasks. Traditional video-based FND benchmarks typically focus on the accuracy of the final decision, often failing to provide fine-grained assessments for the entire detection process, making the detection process a black box. Therefore, we introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis, which provides foundation for tasks definition. The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs' perception, understanding, and reasoning capacities during detection, featuring 9730 human-annotated video-related questions based on a carefully constructed taxonomy ability of VFND. To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT, which incorporates both creator-added content and original shooting footage reasoning. Building upon the benchmark, we conduct an in-depth analysis of the deeper factors influencing accuracy, including video processing strategies and the alignment between video features and model capabilities. We believe this benchmark will lay a solid foundation for future evaluations and advancements of MLLMs in the domain of video fake news detection.[82] SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing
Ruiyang Zhang,Jiahao Luo,Xiaoru Feng,Qiufan Pang,Yaodong Yang,Juntao Dai
Main category: cs.CV
TL;DR: 提出了一种多轮安全编辑框架MR-SafeEdit和统一的MLLM SafeEditor,用于提升文本到图像模型的安全性,同时减少过度拒绝并改善安全与效用的平衡。
Details
Motivation: 现有推理时安全方法存在过度拒绝和安全与效用不平衡的问题,亟需一种高效、模型无关的安全对齐方案。 Method: 构建了多轮图文交错数据集MR-SafeEdit,提出后训练安全编辑范式,并开发了统一的多轮安全编辑MLLM——SafeEditor。 Result: 实验表明,SafeEditor在降低过度拒绝率的同时,实现了比先前方法更优的安全性与生成效用平衡。 Conclusion: 所提出的多轮安全编辑框架具有模型无关性和即插即用特性,能有效提升文本到图像模型的安全对齐能力。 Abstract: With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.[83] Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Inclusion AI,:,Bowen Ma,Cheng Zou,Canxiang Yan,Chunxiang Jin,Chunjie Shen,Dandan Zheng,Fudong Wang,Furong Xu,GuangMing Yao,Jun Zhou,Jingdong Chen,Jianing Li,Jianxin Sun,Jiajia Liu,Jianjiang Zhu,Jianping Jiang,Jun Peng,Kaixiang Ji,Kaimeng Ren,Libin Wang,Lixiang Ru,Longhua Tan,Lan Wang,Mochen Bai,Ning Gao,Qingpei Guo,Qinglong Zhang,Qiang Xu,Rui Liu,Ruijie Xiong,Ruobing Zheng,Sirui Gao,Tianqi Li,Tinghao Liu,Weilong Chai,Xinyu Xiao,Xiaomei Wang,Xiaolong Wang,Xiao Lu,Xiaoyu Li,Xingning Dong,Xuzheng Yu,Yi Yuan,Yuting Gao,Yuting Xiao,Yunxiao Sun,Yipeng Chen,Yifan Mao,Yifei Wu,Yongjie Lyu,Ziping Ma,Zhiqiang Fang,Zhihao Qiu,Ziyuan Huang,Zizheng Yang,Zhengyu He
Main category: cs.CV
TL;DR: Ming-Flash-Omni 是一个基于稀疏MoE架构的百亿参数多模态模型,仅激活61亿参数,实现了在语音、视觉和语言上的统一智能,在ASR、图像生成和生成式分割任务上达到SOTA。
Details
Motivation: 为了提升多模态模型的计算效率与模型容量,并推动通向通用人工智能(AGI)的统一多模态智能发展。 Method: 基于Ling-Flash-2.0的稀疏Mixture-of-Experts(MoE)架构构建Ming-Flash-Omni,扩展模型总参数至1000亿,每token仅激活61亿参数,实现高效扩展和强大多模态能力。 Result: 在上下文语音识别(ASR)上刷新全部12项基准记录,图像生成中实现高保真文本渲染和身份保持,提出生成式分割新能力,增强空间控制与编辑一致性,在文本到图像生成和分割任务上达到SOTA。 Conclusion: Ming-Flash-Omni通过高效稀疏架构实现了强大的统一多模态智能,在多项任务中取得领先成果,是迈向AGI的重要一步。 Abstract: We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.[84] MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition
Haoyang Zhang,Zhou Yang,Ke Sun,Yucai Pang,Guoliang Xu
Main category: cs.CV
TL;DR: 提出了一种基于多路径跨模态交互的混合网络模型(MCIHN),用于多模态情感识别,通过对抗自编码器和跨模态门机制提升特征学习与模态融合效果。
Details
Motivation: 解决不同模态间差异大以及单模态情感信息难以刻画的问题,提升多模态情感识别的准确性。 Method: 为各模态构建对抗自编码器(AAE)以学习判别性情感特征,并通过解码器重构特征;将AAE的潜在编码输入跨模态门机制(CGMM)以减少模态差异并建立模态间情感关系,生成跨模态交互特征;最后通过特征融合模块(FFM)进行多模态融合。 Result: 在公开数据集SIMS和MOSI上的实验表明,MCIHN在情感识别任务中取得了优于现有方法的性能。 Conclusion: MCIHN通过有效的跨模态交互和特征融合机制,显著提升了多模态情感识别的准确性和鲁棒性。 Abstract: Multimodal emotion recognition is crucial for future human-computer interaction. However, accurate emotion recognition still faces significant challenges due to differences between different modalities and the difficulty of characterizing unimodal emotional information. To solve these problems, a hybrid network model based on multipath cross-modal interaction (MCIHN) is proposed. First, adversarial autoencoders (AAE) are constructed separately for each modality. The AAE learns discriminative emotion features and reconstructs the features through a decoder to obtain more discriminative information about the emotion classes. Then, the latent codes from the AAE of different modalities are fed into a predefined Cross-modal Gate Mechanism model (CGMM) to reduce the discrepancy between modalities, establish the emotional relationship between interacting modalities, and generate the interaction features between different modalities. Multimodal fusion using the Feature Fusion module (FFM) for better emotion recognition. Experiments were conducted on publicly available SIMS and MOSI datasets, demonstrating that MCIHN achieves superior performance.[85] The Generation Phases of Flow Matching: a Denoising Perspective
Anne Gagneux,Ségolène Martin,Rémi Gribonval,Mathurin Massias
Main category: cs.CV
TL;DR: 本文从去噪的角度研究流匹配模型的生成过程,建立了流匹配模型与去噪器之间的理论联系,并通过可控扰动揭示生成过程中的不同动力学阶段,深入分析了去噪器在各阶段成功或失败的原因。
Details
Motivation: 尽管流匹配在生成任务中取得了成功,但其生成质量的影响因素尚不明确,因此需要一个系统框架来理解其内在机制。 Method: 提出一种基于去噪视角的分析框架,建立流匹配模型与去噪器之间的形式化联系,并引入噪声和漂移等受控扰动来探测生成过程。 Result: 发现了生成过程中存在不同的动力学阶段,能够精确刻画去噪器在不同阶段的表现及其原因。 Conclusion: 通过去噪视角可深入理解流匹配的生成动态,为改进生成模型提供了原则性指导。 Abstract: Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.[86] FruitProm: Probabilistic Maturity Estimation and Detection of Fruits and Vegetables
Sidharth Rai,Rahul Harsha Cheppally,Benjamin Vail,Keziban Yalçın Dokumacı,Ajay Sharda
Main category: cs.CV
TL;DR: 本文提出了一种基于RT-DETRv2的新型概率性成熟度估计方法,将成熟度预测从传统的离散分类任务转变为连续的概率学习任务,能够同时输出成熟度均值与不确定性,提升了农业自动化中成熟度评估的精度与生物合理性。
Details
Motivation: 现有成熟度估计方法多采用离散分类,无法反映果实连续成熟的生物学特性,导致信息丢失和类别边界模糊。本文旨在通过连续、概率化建模来更真实地刻画成熟过程。 Method: 在RT-DETRv2基础上引入一个专用的概率头,使其能够为每个检测对象预测成熟度的连续分布,同时学习成熟度的均值与不确定性。 Result: 模型在大型水果数据集上达到85.6%的mAP,实验表明其成熟度评估比传统分类方法更细粒度且准确,同时提供了对机器人采摘等下游任务至关重要的不确定性估计。 Conclusion: 将成熟度估计建模为连续概率任务优于传统分类方法,所提模型不仅更符合生物学规律,还保持了优异的检测性能,推动了不确定性感知的智能农业系统发展。 Abstract: Maturity estimation of fruits and vegetables is a critical task for agricultural automation, directly impacting yield prediction and robotic harvesting. Current deep learning approaches predominantly treat maturity as a discrete classification problem (e.g., unripe, ripe, overripe). This rigid formulation, however, fundamentally conflicts with the continuous nature of the biological ripening process, leading to information loss and ambiguous class boundaries. In this paper, we challenge this paradigm by reframing maturity estimation as a continuous, probabilistic learning task. We propose a novel architectural modification to the state-of-the-art, real-time object detector, RT-DETRv2, by introducing a dedicated probabilistic head. This head enables the model to predict a continuous distribution over the maturity spectrum for each detected object, simultaneously learning the mean maturity state and its associated uncertainty. This uncertainty measure is crucial for downstream decision-making in robotics, providing a confidence score for tasks like selective harvesting. Our model not only provides a far richer and more biologically plausible representation of plant maturity but also maintains exceptional detection performance, achieving a mean Average Precision (mAP) of 85.6\% on a challenging, large-scale fruit dataset. We demonstrate through extensive experiments that our probabilistic approach offers more granular and accurate maturity assessments than its classification-based counterparts, paving the way for more intelligent, uncertainty-aware automated systems in modern agriculture[87] Proper Body Landmark Subset Enables More Accurate and 5X Faster Recognition of Isolated Signs in LIBRAS
Daniele L. V. dos Santos,Thiago B. Pereira,Carlos Eduardo G. R. Alves,Richard J. M. G. Tello,Francisco de A. Boldt,Thiago M. Paixão
Main category: cs.CV
TL;DR: 本研究探讨了使用轻量级身体关键点检测进行巴西手语(LIBRAS)孤立词识别的可行性,通过选择关键点子集和样条插值法,在显著提升处理速度的同时保持或提高了识别准确率。
Details
Motivation: 尽管Alves等人(2024)基于骨架的方法提升了识别性能,但使用OpenPose导致时间性能不佳;替换为轻量级MediaPipe虽加快速度却显著降低准确率,因此需要解决效率与精度的权衡问题。 Method: 采用轻量级MediaPipe进行关键点检测,探索关键点子集选择策略,并引入样条插值法处理缺失关键点问题,以优化识别性能。 Result: 合适的关键点子集在保持甚至提升识别准确率的同时,处理速度比Alves等人(2024)快5倍以上;样条插值法显著提升了准确性。 Conclusion: 合理的关键点选择结合简单的插值技术,可实现高效且准确的孤立手语识别,为可扩展的手语识别系统提供了可行路径。 Abstract: This paper investigates the feasibility of using lightweight body landmark detection for the recognition of isolated signs in Brazilian Sign Language (LIBRAS). Although the skeleton-based approach by Alves et al. (2024) enabled substantial improvements in recognition performance, the use of OpenPose for landmark extraction hindered time performance. In a preliminary investigation, we observed that simply replacing OpenPose with the lightweight MediaPipe, while improving processing speed, significantly reduced accuracy. To overcome this limitation, we explored landmark subset selection strategies aimed at optimizing recognition performance. Experimental results showed that a proper landmark subset achieves comparable or superior performance to state-of-the-art methods while reducing processing time by more than 5X compared to Alves et al. (2024). As an additional contribution, we demonstrated that spline-based imputation effectively mitigates missing landmark issues, leading to substantial accuracy gains. These findings highlight that careful landmark selection, combined with simple imputation techniques, enables efficient and accurate isolated sign recognition, paving the way for scalable Sign Language Recognition systems.[88] Pixels to Signals: A Real-Time Framework for Traffic Demand Estimation
H Mhatre,M Vyas,A Mittal
Main category: cs.CV
TL;DR: 本文提出了一种基于视频帧分析和DBSCAN聚类算法的车辆检测方法,作为优化交通流的第一步。
Details
Motivation: 城市交通拥堵日益严重,需要高效、可扩展的交通管理方案。 Method: 通过时间平均法计算背景图像,利用前景提取结合DBSCAN算法检测车辆。 Result: 实现了一种计算效率高、无需大规模基础设施改造的车辆检测方法。 Conclusion: 该方法具有实用性与可扩展性,适用于真实场景中的交通监控系统部署。 Abstract: Traffic congestion is becoming a challenge in the rapidly growing urban cities, resulting in increasing delays and inefficiencies within urban transportation systems. To address this issue a comprehensive methodology is designed to optimize traffic flow and minimize delays. The framework is structured with three primary components: (a) vehicle detection, (b) traffic prediction, and (c) traffic signal optimization. This paper presents the first component, vehicle detection. The methodology involves analyzing multiple sequential frames from a camera feed to compute the background, i.e. the underlying roadway, by averaging pixel values over time. The computed background is then utilized to extract the foreground, where the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is applied to detect vehicles. With its computational efficiency and minimal infrastructure modification requirements, the proposed methodology offers a practical and scalable solution for real-world deployment.[89] VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos
Qiucheng Wu,Handong Zhao,Zhixin Shu,Jing Shi,Yang Zhang,Shiyu Chang
Main category: cs.CV
TL;DR: 提出VividCam,一种通过合成视频训练扩散模型学习复杂相机运动的新范式,减少对真实训练视频的依赖。
Details
Motivation: 现有文本到视频生成模型难以泛化到非常规相机运动,且缺乏足够的含罕见相机运动的真实训练数据。 Method: 采用基于合成视频的训练范式,利用低多边形3D场景中的基本几何体生成简单但有效的合成数据,并通过多种解耦策略隔离相机运动学习与外观伪影,以缓解域偏移问题。 Result: 能够生成广泛且精确控制的复杂相机运动,实验表明该方法在无需真实视频的情况下仍具有良好的运动表征能力。 Conclusion: VividCam有效实现了从合成数据中学习复杂相机运动,提升了生成视频中相机控制的灵活性和创造性,为艺术性视频生成提供了新途径。 Abstract: Although recent text-to-video generative models are getting more capable of following external camera controls, imposed by either text descriptions or camera trajectories, they still struggle to generalize to unconventional camera motions, which is crucial in creating truly original and artistic videos. The challenge lies in the difficulty of finding sufficient training videos with the intended uncommon camera motions. To address this challenge, we propose VividCam, a training paradigm that enables diffusion models to learn complex camera motions from synthetic videos, releasing the reliance on collecting realistic training videos. VividCam incorporates multiple disentanglement strategies that isolates camera motion learning from synthetic appearance artifacts, ensuring more robust motion representation and mitigating domain shift. We demonstrate that our design synthesizes a wide range of precisely controlled and complex camera motions using surprisingly simple synthetic data. Notably, this synthetic data often consists of basic geometries within a low-poly 3D scene and can be efficiently rendered by engines like Unity. Our video results can be found in https://wuqiuche.github.io/VividCamDemoPage/ .[90] Understanding Multi-View Transformers
Michal Stary,Julien Gaubil,Ayush Tewari,Vincent Sitzmann
Main category: cs.CV
TL;DR: 本文提出了一种探测和可视化多视图变换器残差连接中3D表示的方法,揭示了DUSt3R模型在不同层间隐状态的发展过程及其与显式全局姿态先验方法的差异。
Details
Motivation: 多视图变换器(如DUSt3R)虽然在3D视觉任务中表现出色,但其内部机制不透明,限制了进一步优化和在安全关键场景中的应用。因此需要深入理解其内部工作机制。 Method: 通过分析多视图变换器各层残差连接中的3D表示,对DUSt3R模型的隐状态演化、各层作用进行可视化和探究,并比较其与具有更强显式姿态先验的方法的差异。 Result: 发现所研究的DUSt3R变体通过重建几何来 refine 对应关系;揭示了模型各层在3D结构和姿态估计中的角色演变过程。 Conclusion: 该分析方法有助于理解多视图变换器的内部机制,为改进模型设计和提升可靠性提供了基础。 Abstract: Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .[91] Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning
Hossein R. Nowdeh,Jie Ji,Xiaolong Ma,Fatemeh Afghah
Main category: cs.CV
TL;DR: 提出了一种模态感知的锐度感知最小化方法(M-SAM),通过识别主导模态、调制损失景观和更新权重,提升多模态学习的性能与平衡性。
Details
Motivation: 在多模态学习中,主导模态常掩盖其他模态,限制模型泛化能力,因此需要一种能平衡模态贡献的方法。 Method: M-SAM基于Shapley值识别主导模态,分解并调制损失景观以增强主导模态的鲁棒性,并通过反向传播调制后的梯度更新权重,支持早期和晚期融合。 Result: 在四个不同数据集上的实验表明,M-SAM优于最新的优化和梯度操作方法,显著提升了多模态学习的平衡性与整体性能。 Conclusion: M-SAM是一种有效的模型无关框架,能够提升主导模态的鲁棒性并增强非主导模态的贡献,从而改善多模态学习的泛化能力。 Abstract: In multimodal learning, dominant modalities often overshadow others, limiting generalization. We propose Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios. In every iteration, M-SAM in three steps optimizes learning. \textbf{First, it identifies the dominant modality} based on modalities' contribution in the accuracy using Shapley. \textbf{Second, it decomposes the loss landscape}, or in another language, it modulates the loss to prioritize the robustness of the model in favor of the dominant modality, and \textbf{third, M-SAM updates the weights} by backpropagation of modulated gradients. This ensures robust learning for the dominant modality while enhancing contributions from others, allowing the model to explore and exploit complementary features that strengthen overall performance. Extensive experiments on four diverse datasets show that M-SAM outperforms the latest state-of-the-art optimization and gradient manipulation methods and significantly balances and improves multimodal learning.[92] IBIS: A Powerful Hybrid Architecture for Human Activity Recognition
Alison M. Fernandes,Hermes I. Del Monego,Bruno S. Chang,Anelise Munaretto,Hélder M. Fontes,Rui L. Campos
Main category: cs.CV
TL;DR: 提出了一种名为IBIS的Inception-BiLSTM与SVM结合的新混合架构,用于提升Wi-Fi感知中的模型泛化能力,在多普勒数据上实现了近99%的运动识别准确率。
Details
Motivation: Wi-Fi感知虽有潜力,但现有模型常因过拟合而难以泛化,需提高鲁棒性和分类性能。 Method: 提出IBIS混合架构,结合Inception-BiLSTM提取特征,并利用SVM增强分类边界,应用于多普勒数据进行运动识别。 Result: 在运动识别任务中达到近99%的准确率,性能指标和混淆矩阵验证了方法的有效性。 Conclusion: IBIS显著提升了模型的泛化能力和分类鲁棒性,为Wi-Fi感知应用提供了高效解决方案。 Abstract: The increasing interest in Wi-Fi sensing stems from its potential to capture environmental data in a low-cost, non-intrusive way, making it ideal for applications like healthcare, space occupancy analysis, and gesture-based IoT control. However, a major limitation in this field is the common problem of overfitting, where models perform well on training data but fail to generalize to new data. To overcome this, we introduce a novel hybrid architecture that integrates Inception-BiLSTM with a Support Vector Machine (SVM), which we refer to as IBIS. Our IBIS approach is uniquely engineered to improve model generalization and create more robust classification boundaries. By applying this method to Doppler-derived data, we achieve a movement recognition accuracy of nearly 99%. Comprehensive performance metrics and confusion matrices confirm the significant effectiveness of our proposed solution.[93] FT-ARM: Fine-Tuned Agentic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning
Reza Saadati Fard,Emmanuel Agu,Palawat Busaranuvong,Deepak Kumar,Shefalika Gautam,Bengisu Tulu,Diane Strong,Lorraine Loretz
Main category: cs.CV
TL;DR: 本文提出了一种基于细调多模态大语言模型的FT-ARM方法,用于压力性溃疡(PU)严重程度分类,结合视觉特征与临床知识,并引入类临床医生式自我反思机制,提升了分类准确性(85%)、可解释性与实时推理能力,优于传统CNN/ViT模型。
Details
Motivation: 压力性溃疡的准确分期对治疗至关重要,但因视觉差异细微和主观判断导致临床一致性差,现有AI模型缺乏可解释性,需更可靠、透明且具临床适用性的解决方案。 Method: 提出FT-ARM模型,基于LLaMA 3.2 90B进行细调,融合视觉与文本模态输入,引入代理式自我反思机制,通过迭代推理优化预测,并生成自然语言解释以增强可解释性,支持实时推断。 Result: 在PIID数据集上达到85%的准确率,超过先前CNN模型4个百分点;具备良好的实时推理能力与临床合理的自然语言解释输出。 Conclusion: FT-ARM通过多模态细调与反思推理机制,在压力性溃疡自动评估中实现了更高准确性、一致性与可解释性,推动了AI在临床伤口管理中的实际应用。 Abstract: Pressure ulcers (PUs) are a serious and prevalent healthcare concern. Accurate classification of PU severity (Stages I-IV) is essential for proper treatment but remains challenging due to subtle visual distinctions and subjective interpretation, leading to variability among clinicians. Prior AI-based approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) achieved promising accuracy but offered limited interpretability. We present FT-ARM (Fine-Tuned Agentic Reflection Multimodal model), a fine-tuned multimodal large language model (MLLM) with an agentic self-reflection mechanism for pressure ulcer severity classification. Inspired by clinician-style diagnostic reassessment, FT-ARM iteratively refines its predictions by reasoning over visual features and encoded clinical knowledge from text, enhancing both accuracy and consistency. On the publicly available Pressure Injury Image Dataset (PIID), FT-ARM, fine-tuned from LLaMA 3.2 90B, achieved 85% accuracy in classifying PU stages I-IV, surpassing prior CNN-based models by +4%. Unlike earlier CNN/ViT studies that relied solely on offline evaluations, FT-ARM is designed and tested for live inference, reflecting real-time deployment conditions. Furthermore, it produces clinically grounded natural-language explanations, improving interpretability and trust. By integrating fine-tuning and reflective reasoning across multimodal inputs, FT-ARM advances the reliability, transparency, and clinical applicability of automated wound assessment systems, addressing the critical need for consistent and explainable PU staging to support improved patient care.[94] Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8
Zahra Ebrahimi Vargoorani,Amir Mohammad Ghoreyshi,Ching Yee Suen
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv8的深度学习方法,用于提高自动车牌识别(ALPR)系统的准确性,并结合半监督学习框架和Grounding DINO生成伪标签以减少人工标注依赖。
Details
Motivation: 由于环境因素、车辆速度、摄像头角度和图像质量等问题,开发高精度的自动车牌识别系统具有挑战性。同时,手动标注数据耗时且成本高,需要更高效的训练方法。 Method: 采用YOLOv8进行车牌检测与识别,结合少量人工标注数据与Grounding DINO生成的伪标签,构建半监督学习框架;使用来自CENPARMI和UFPR-ALPR的数据集进行训练与评估。 Result: 在CENPARMI数据集上达到94%的召回率,在UFPR-ALPR数据集上达到91%的召回率,并报告了字符错误率,验证了系统的有效性。 Conclusion: 所提出的融合视觉语言模型与半监督学习的方法显著提升了车牌识别性能,降低了对人工标注的依赖,具备良好的扩展性和应用前景。 Abstract: Developing a highly accurate automatic license plate recognition system (ALPR) is challenging due to environmental factors such as lighting, rain, and dust. Additional difficulties include high vehicle speeds, varying camera angles, and low-quality or low-resolution images. ALPR is vital in traffic control, parking, vehicle tracking, toll collection, and law enforcement applications. This paper proposes a deep learning strategy using YOLOv8 for license plate detection and recognition tasks. This method seeks to enhance the performance of the model using datasets from Ontario, Quebec, California, and New York State. It achieved an impressive recall rate of 94% on the dataset from the Center for Pattern Recognition and Machine Intelligence (CENPARMI) and 91% on the UFPR-ALPR dataset. In addition, our method follows a semi-supervised learning framework, combining a small set of manually labeled data with pseudo-labels generated by Grounding DINO to train our detection model. Grounding DINO, a powerful vision-language model, automatically annotates many images with bounding boxes for license plates, thereby minimizing the reliance on labor-intensive manual labeling. By integrating human-verified and model-generated annotations, we can scale our dataset efficiently while maintaining label quality, which significantly enhances the training process and overall model performance. Furthermore, it reports character error rates for both datasets, providing additional insight into system performance.[95] Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models
Shunjie-Fabian Zheng,Hyeonjun Lee,Thijs Kooi,Ali Diba
Main category: cs.CV
TL;DR: 本研究提出了一种结合2D乳腺X线图像与结构化文本描述的新型多模态框架,用于乳腺癌检测,相较于单模态方法表现更优。
Details
Motivation: 现有计算机辅助诊断(CAD)系统在临床应用中受限于多模态数据解释能力不足及对患者既往病史的依赖,亟需一种更实用、无需复杂临床历史信息的解决方案。 Method: 提出一种融合卷积神经网络(ConvNets)与语言表示模型的方法,通过创新的分词模块将2D乳腺X线图像的视觉特征与临床元数据及合成报告中的文本信息进行有效融合,并在多国筛查队列数据上进行评估。 Result: 该多模态方法在癌症检测和钙化识别任务上显著优于单模态基线模型,尤其在处理高分辨率图像和跨人群部署方面表现出优越性能。 Conclusion: 该研究建立了一种新的基于视觉-语言模型(VLM)的临床可行CAD系统范式,通过有效融合机制充分利用影像数据与患者背景信息,具有广泛的临床应用前景。 Abstract: Breast cancer remains the most commonly diagnosed malignancy among women in the developed world. Early detection through mammography screening plays a pivotal role in reducing mortality rates. While computer-aided diagnosis (CAD) systems have shown promise in assisting radiologists, existing approaches face critical limitations in clinical deployment - particularly in handling the nuanced interpretation of multi-modal data and feasibility due to the requirement of prior clinical history. This study introduces a novel framework that synergistically combines visual features from 2D mammograms with structured textual descriptors derived from easily accessible clinical metadata and synthesized radiological reports through innovative tokenization modules. Our proposed methods in this study demonstrate that strategic integration of convolutional neural networks (ConvNets) with language representations achieves superior performance to vision transformer-based models while handling high-resolution images and enabling practical deployment across diverse populations. By evaluating it on multi-national cohort screening mammograms, our multi-modal approach achieves superior performance in cancer detection and calcification identification compared to unimodal baselines, with particular improvements. The proposed method establishes a new paradigm for developing clinically viable VLM-based CAD systems that effectively leverage imaging data and contextual patient information through effective fusion mechanisms.[96] Auto3DSeg for Brain Tumor Segmentation from 3D MRI in BraTS 2023 Challenge
Andriy Myronenko,Dong Yang,Yufan He,Daguang Xu
Main category: cs.CV
TL;DR: 本文介绍了使用MONAI的Auto3DSeg参与BraTS 2023五个分割挑战赛的解决方案,在其中三个挑战中获得第一名,两个挑战中获得第二名。
Details
Motivation: 旨在通过自动化方法提升脑肿瘤图像分割的准确性和效率,应对多中心、多类型脑肿瘤数据的挑战。 Method: 采用MONAI框架中的Auto3DSeg进行全自动3D医学图像分割,参与了全部五个BraTS 2023挑战赛任务。 Result: 在Brain Metastasis、Brain Meningioma和BraTS-Africa挑战中获得第一名,在Adult Glioma和Pediatric Glioma挑战中获得第二名。 Conclusion: Auto3DSeg展现出了强大的泛化能力和高性能,是应对复杂脑肿瘤分割任务的有效自动化解决方案。 Abstract: In this work, we describe our solution to the BraTS 2023 cluster of challenges using Auto3DSeg from MONAI. We participated in all 5 segmentation challenges, and achieved the 1st place results in three of them: Brain Metastasis, Brain Meningioma, BraTS-Africa challenges, and the 2nd place results in the remaining two: Adult and Pediatic Glioma challenges.[97] DRIP: Dynamic patch Reduction via Interpretable Pooling
Yusen Peng,Sachin Kumar
Main category: cs.CV
TL;DR: 本文提出了DRIP方法,通过可解释的池化实现动态图像块减少,适应输入图像并在视觉编码器深层动态合并token,在保持分类和零样本性能的同时显著降低计算量。
Details
Motivation: 由于大规模预训练成本高昂,研究者难以从头开始训练视觉语言模型,因此需要提高训练效率。 Method: 提出Dynamic patch Reduction via Interpretable Pooling (DRIP),在视觉编码器的深层根据输入图像动态合并token。 Result: 在ImageNet从头训练和CLIP对比预训练中均显著减少了GFLOPs,同时保持了相当的分类和零样本性能;在大型生物数据集上的持续预训练进一步验证了该方法的有效性。 Conclusion: DRIP能够有效提升视觉语言模型的训练效率,适用于从头训练和领域扩展的场景。 Abstract: Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.[98] Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments
Manjunath Prasad Holenarasipura Rajiv,B. M. Vidyavathi
Main category: cs.CV
TL;DR: 提出一种视觉-语言集成框架,通过统一预训练视觉编码器和大语言模型,实现零样本场景理解的语义对齐,在多个真实场景数据集上显著优于现有方法。
Details
Motivation: 真实世界场景复杂多变,传统模型难以在无标注示例的情况下识别新对象、行为和上下文,因此需要具备强泛化能力的零样本场景理解方法。 Method: 结合预训练视觉编码器(如CLIP、ViT)与大语言模型(如GPT架构),将视觉输入与文本提示映射到共享语义空间,并通过多模态融合与推理层进行上下文解析。 Result: 在Visual Genome、COCO、ADE20K和自建真实场景数据集上实验显示,相比当前零样本模型,top-1准确率最高提升18%,语义连贯性指标也有显著提高。 Conclusion: 跨模态对齐与语言 grounding 有效提升了模型在真实场景中的零样本理解与泛化能力。 Abstract: Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.[99] Neighborhood Feature Pooling for Remote Sensing Image Classification
Fahimeh Orvati Nia,Amirmohammad Mohammadi,Salim Al Kharsa,Pragati Naikare,Zigfried Hampel-Arias,Joshua Peeples
Main category: cs.CV
TL;DR: 提出了一种新的纹理特征提取方法——邻域特征池化(NFP),用于遥感图像分类,通过卷积层实现并能有效提升不同数据集和架构下的分类性能。
Details
Motivation: 为了提高遥感图像分类的准确性,需要更有效的纹理特征提取方法。 Method: 提出了邻域特征池化(NFP)方法,利用卷积层捕捉相邻输入之间的关系,并在特征维度上高效聚合局部相似性。 Result: 实验结果表明,与基线模型相比,NFP方法在多种数据集和网络架构下均能持续提升性能,同时参数开销极小。 Conclusion: NFP是一种有效且轻量的纹理特征提取方法,适用于各种遥感图像分类任务。 Abstract: In this work, we propose neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification. The NFP layer captures relationships between neighboring inputs and efficiently aggregates local similarities across feature dimensions. Implemented using convolutional layers, NFP can be seamlessly integrated into any network. Results comparing the baseline models and the NFP method indicate that NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead.[100] PSTF-AttControl: Per-Subject-Tuning-Free Personalized Image Generation with Controllable Face Attributes
Xiang liu,Zhaoxiang Liu,Huan Hu,Zipeng Wang,Ping Chen,Zezhou Chen,Kai Wang,Shiguo Lian
Main category: cs.CV
TL;DR: 本文提出了一种无需个体微调(PSTF)的新方法,实现了在保持高保真面部身份的同时,对人脸属性进行精细控制。
Details
Motivation: 现有个性化图像生成方法在无需微调的情况下难以精确控制面部属性,而基于微调的方法虽精度高但需要专业知识和额外数据,限制了可用性。 Method: 利用人脸识别模型提取身份特征,并通过e4e编码器映射到StyleGAN2的W+潜在空间;引入三重解耦交叉注意力模块,在UNet中融合身份、属性和文本信息,实现身份与属性的分离。 Result: 在FFHQ数据集上训练后,该方法可在无需额外微调或个体训练数据的情况下,生成具有精细属性控制的个性化图像。 Conclusion: 所提方法在无需个体微调的前提下,有效平衡了面部身份保持与属性控制,提供了一种高效且用户友好的高质量人脸图像合成方案。 Abstract: Recent advancements in personalized image generation have significantly improved facial identity preservation, particularly in fields such as entertainment and social media. However, existing methods still struggle to achieve precise control over facial attributes in a per-subject-tuning-free (PSTF) way. Tuning-based techniques like PreciseControl have shown promise by providing fine-grained control over facial features, but they often require extensive technical expertise and additional training data, limiting their accessibility. In contrast, PSTF approaches simplify the process by enabling image generation from a single facial input, but they lack precise control over facial attributes. In this paper, we introduce a novel, PSTF method that enables both precise control over facial attributes and high-fidelity preservation of facial identity. Our approach utilizes a face recognition model to extract facial identity features, which are then mapped into the $W^+$ latent space of StyleGAN2 using the e4e encoder. We further enhance the model with a Triplet-Decoupled Cross-Attention module, which integrates facial identity, attribute features, and text embeddings into the UNet architecture, ensuring clean separation of identity and attribute information. Trained on the FFHQ dataset, our method allows for the generation of personalized images with fine-grained control over facial attributes, while without requiring additional fine-tuning or training data for individual identities. We demonstrate that our approach successfully balances personalization with precise facial attribute control, offering a more efficient and user-friendly solution for high-quality, adaptable facial image synthesis. The code is publicly available at https://github.com/UnicomAI/PSTF-AttControl.[101] Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection
Chanhyeong Yang,Taehoon Song,Jihwan Park,Hyunwoo J. Kim
Main category: cs.CV
TL;DR: 本文提出了一种名为VDRP的视觉多样性和区域感知提示学习框架,用于解决零样本人-物交互检测中的类内视觉多样性和类间视觉纠缠问题。
Details
Motivation: 现有方法在处理人-物交互的视觉复杂性方面存在不足,包括相同动词在不同姿态和场景下的多样性(类内差异)以及不同动词呈现相似视觉模式(类间纠缠)。 Method: 提出VDRP框架:1)引入视觉多样性感知的提示学习策略,通过组内视觉方差和高斯扰动增强提示对动词多样性的建模;2)从人体、物体和联合区域提取区域特定概念,以增强提示的判别能力。 Result: 在HICO-DET基准上的实验表明,该方法在四种零样本设置下均达到最先进性能,有效缓解了类内多样性和类间纠缠问题。 Conclusion: VDRP通过结合视觉多样性建模和区域感知提示学习,在零样本人-物交互检测中显著提升了性能,为基于CLIP等视觉语言模型的方法提供了改进方向。 Abstract: Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.[102] AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians
Xiyu Zhang,Chong Bao,Yipeng Chen,Hongjia Zhai,Yitong Dong,Hujun Bao,Zhaopeng Cui,Guofeng Zhang
Main category: cs.CV
TL;DR: 提出了一种基于Atlanta-world引导的隐式结构化高斯点阵方法,用于实现高质量的室内外场景三维重建,兼顾细节保持和渲染效率。
Details
Motivation: 现有方法在处理低纹理区域时缺乏全局一致性,且高斯点阵和隐式SDF场常存在不连续或计算效率低的问题,导致细节丢失。 Method: 利用Atlanta-world模型指导隐式结构化高斯点阵(GS)表示,引入语义GS表示预测语义区域概率,并采用可学习平面指示器的结构平面正则化实现全局精确表面重建。 Result: 实验表明,该方法在室内外场景中均优于当前先进方法,显著提升表面重建质量。 Conclusion: 所提出的方法有效解决了低纹理区域重建的全局一致性与细节保留问题,在保持高效渲染的同时实现了高质量的三维重建。 Abstract: 3D reconstruction of indoor and urban environments is a prominent research topic with various downstream applications. However, existing geometric priors for addressing low-texture regions in indoor and urban settings often lack global consistency. Moreover, Gaussian Splatting and implicit SDF fields often suffer from discontinuities or exhibit computational inefficiencies, resulting in a loss of detail. To address these issues, we propose an Atlanta-world guided implicit-structured Gaussian Splatting that achieves smooth indoor and urban scene reconstruction while preserving high-frequency details and rendering efficiency. By leveraging the Atlanta-world model, we ensure the accurate surface reconstruction for low-texture regions, while the proposed novel implicit-structured GS representations provide smoothness without sacrificing efficiency and high-frequency details. Specifically, we propose a semantic GS representation to predict the probability of all semantic regions and deploy a structure plane regularization with learnable plane indicators for global accurate surface reconstruction. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.[103] Region-CAM: Towards Accurate Object Regions in Class Activation Maps for Weakly Supervised Learning Tasks
Qingdong Cai,Charith Abhayaratne
Main category: cs.CV
TL;DR: 提出Region-CAM方法,通过语义信息传播生成更完整、边界更精确的激活图,在WSSS和物体定位任务中显著优于传统CAM方法。
Details
Motivation: 传统CAM方法仅激活目标最具区分性的区域,常导致激活区域不完整且边界不准确,限制了弱监督语义分割等任务的性能。 Method: 提出Region-CAM,通过提取语义信息图(SIMs)并结合梯度与特征进行语义信息传播(SIP),在分类模型各阶段生成更完整的激活图。 Result: 在PASCAL VOC上mIoU达60.12%(+13.61%),MS COCO上达36.38%(+16.23%);ILSVRC2012上Top-1定位准确率达51.7%,优于LayerCAM 4.5%。 Conclusion: Region-CAM能有效提升激活图的覆盖范围与边界精度,显著改善弱监督语义分割与定位任务的性能。 Abstract: Class Activation Mapping (CAM) methods are widely applied in weakly supervised learning tasks due to their ability to highlight object regions. However, conventional CAM methods highlight only the most discriminative regions of the target. These highlighted regions often fail to cover the entire object and are frequently misaligned with object boundaries, thereby limiting the performance of downstream weakly supervised learning tasks, particularly Weakly Supervised Semantic Segmentation (WSSS), which demands pixel-wise accurate activation maps to get the best results. To alleviate the above problems, we propose a novel activation method, Region-CAM. Distinct from network feature weighting approaches, Region-CAM generates activation maps by extracting semantic information maps (SIMs) and performing semantic information propagation (SIP) by considering both gradients and features in each of the stages of the baseline classification model. Our approach highlights a greater proportion of object regions while ensuring activation maps to have precise boundaries that align closely with object edges. Region-CAM achieves 60.12% and 58.43% mean intersection over union (mIoU) using the baseline model on the PASCAL VOC training and validation datasets, respectively, which are improvements of 13.61% and 13.13% over the original CAM (46.51% and 45.30%). On the MS COCO validation set, Region-CAM achieves 36.38%, a 16.23% improvement over the original CAM (20.15%). We also demonstrate the superiority of Region-CAM in object localization tasks, using the ILSVRC2012 validation set. Region-CAM achieves 51.7% in Top-1 Localization accuracy Loc1. Compared with LayerCAM, an activation method designed for weakly supervised object localization, Region-CAM achieves 4.5% better performance in Loc1.[104] DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications
Malaisree P,Youwai S,Kitkobsin T,Janrungautai S,Amorndechaphon D,Rojanavasu P
Main category: cs.CV
TL;DR: 提出DINO-YOLO,一种结合YOLOv12与自监督DINOv3的混合架构,通过在输入预处理和主干网络中集成特征,在小样本土木工程数据集上实现显著检测性能提升,同时保持实时推理能力。
Details
Motivation: 土木工程领域目标检测受限于标注数据稀缺,现有方法在小样本场景下性能不足,需提升数据利用效率。 Method: 将DINOv3自监督视觉Transformer特征以双路径(P0输入预处理和P3主干增强)方式融入YOLOv12架构,并系统评估不同规模模型与集成策略。 Result: 在Tunnel Segment Crack、Construction PPE和KITTI数据集上分别提升12.4%、13.7%和88.6%,mAP@0.5达55.77%(Medium模型),保持30-47 FPS实时性,推理延迟增加可控。 Conclusion: DINO-YOLO在少于1万张图像的数据受限环境中实现了土木工程检测任务的最先进性能,兼顾精度与计算效率,适用于施工现场安全监控与基础设施巡检。 Abstract: Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.[105] Revisiting Reconstruction-based AI-generated Image Detection: A Geometric Perspective
Wan Jiang,Jing Yan,Ruixuan Zhang,Xiaojing Chen,Changtao Miao,Zhe Li,Chenhao Lin,Yunfeng Diao,Richang Hong
Main category: cs.CV
TL;DR: 本文提出了一种名为ReGap的无需训练的方法,通过引入结构化编辑操作来计算动态重构误差,从而更准确地检测AI生成图像。
Details
Motivation: 现有的基于重构的AI生成图像检测方法缺乏理论基础且依赖经验启发式,导致可解释性和可靠性不足,尤其在真实图像的重构误差低于生成图像时表现不佳。 Method: 从几何角度提出Jacobian-Spectral Lower Bound理论,解释真实与生成图像在重构流形上的误差差异;设计ReGap方法,利用结构化编辑引入可控扰动,通过比较编辑前后重构误差的变化实现动态检测。 Result: 实验表明,ReGap在多种条件下优于现有基线方法,对常见后处理操作具有鲁棒性,并展现出良好的泛化能力。 Conclusion: ReGap通过动态重构误差机制提升了AI生成图像检测的准确性与鲁棒性,克服了静态方法依赖固定阈值和数据特异性的问题,适用于实际应用场景。 Abstract: The rise of generative Artificial Intelligence (AI) has made detecting AI-generated images a critical challenge for ensuring authenticity. Existing reconstruction-based methods lack theoretical foundations and on empirical heuristics, limiting interpretability and reliability. In this paper, we introduce the Jacobian-Spectral Lower Bound for reconstruction error from a geometric perspective, showing that real images off the reconstruction manifold exhibit a non-trivial error lower bound, while generated images on the manifold have near-zero error. Furthermore, we reveal the limitations of existing methods that rely on static reconstruction error from a single pass. These methods often fail when some real images exhibit lower error than generated ones. This counterintuitive behavior reduces detection accuracy and requires data-specific threshold tuning, limiting their applicability in real-world scenarios. To address these challenges, we propose ReGap, a training-free method that computes dynamic reconstruction error by leveraging structured editing operations to introduce controlled perturbations. This enables measuring error changes before and after editing, improving detection accuracy by enhancing error separation. Experimental results show that our method outperforms existing baselines, exhibits robustness to common post-processing operations and generalizes effectively across diverse conditions.[106] EA3D: Online Open-World 3D Object Extraction from Streaming Videos
Xiaoyu Zhou,Jingqi Wang,Yuang Jia,Yongtao Wang,Deqing Sun,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: 本文提出了一种名为ExtractAnything3D(EA3D)的统一在线框架,用于开放世界中的3D物体提取,能够同时实现几何重建与全景场景理解。
Details
Motivation: 现有3D场景理解方法受限于离线采集的多视角数据或预构建的3D几何结构,难以满足实时性和开放世界需求。 Method: EA3D利用视觉-语言模型和2D视觉基础编码器动态解析视频帧,提取对象级知识,并通过前馈式在线更新策略将其集成到高斯特征图中;结合迭代视觉里程计估计与递归联合优化模块,实现几何与语义的协同增强。 Result: 在多个基准和任务上进行了广泛实验,包括真实感渲染、语义与实例分割、3D边界框、语义占据估计和3D网格生成,结果表明EA3D在几何重建与语义理解方面均表现出色。 Conclusion: EA3D提供了一个统一且高效的在线框架,支持多种下游任务,推动了开放世界3D场景理解的发展。 Abstract: Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.[107] Towards Real-Time Inference of Thin Liquid Film Thickness Profiles from Interference Patterns Using Vision Transformers
Gautam A. Viruthagiri,Arnuv Tandon,Gerald G. Fuller,Vinny Chandran Suja
Main category: cs.CV
TL;DR: 提出一种基于视觉Transformer的实时薄液膜厚度重建方法,用于解决传统干涉测量中相位模糊和噪声敏感等问题。
Details
Motivation: 传统薄膜干涉测量厚度重建方法存在计算复杂、对噪声敏感或依赖人工分析的问题,难以实现临床实时诊断应用。 Method: 采用基于视觉Transformer的模型,结合生理相关的合成与实验数据进行训练,利用长距离空间相关性从动态干涉图中直接推断薄膜厚度分布。 Result: 该模型在含噪声和运动伪影的快速变化薄膜上表现优异,优于传统相位解缠和迭代拟合方法,可在消费级硬件上实现实时、一致的自动化厚度重建。 Conclusion: 所提方法实现了高效、鲁棒的实时薄膜厚度重建,有望推动干眼病等眼部疾病的无创连续监测与临床诊断。 Abstract: Thin film interferometry is a powerful technique for non-invasively measuring liquid film thickness with applications in ophthalmology, but its clinical translation is hindered by the challenges in reconstructing thickness profiles from interference patterns - an ill-posed inverse problem complicated by phase periodicity, imaging noise and ambient artifacts. Traditional reconstruction methods are either computationally intensive, sensitive to noise, or require manual expert analysis, which is impractical for real-time diagnostics. To address this challenge, here we present a vision transformer-based approach for real-time inference of thin liquid film thickness profiles directly from isolated interferograms. Trained on a hybrid dataset combining physiologically-relevant synthetic and experimental tear film data, our model leverages long-range spatial correlations to resolve phase ambiguities and reconstruct temporally coherent thickness profiles in a single forward pass from dynamic interferograms acquired in vivo and ex vivo. The network demonstrates state-of-the-art performance on noisy, rapidly-evolving films with motion artifacts, overcoming limitations of conventional phase-unwrapping and iterative fitting methods. Our data-driven approach enables automated, consistent thickness reconstruction at real-time speeds on consumer hardware, opening new possibilities for continuous monitoring of pre-lens ocular tear films and non-invasive diagnosis of conditions such as the dry eye disease.[108] Target-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation
Wenhao Zheng,Chenwei Sun,Wenbo Zhang,Jiancheng Lv,Xianggen Liu
Main category: cs.CV
TL;DR: 提出了一种用于定量约束CAD生成的新框架TGBFN,首次在统一的连续可微空间中处理CAD序列的多模态性,并通过引导贝叶斯流控制CAD属性,实现了最先进的生成性能。
Details
Motivation: 现有的生成模型在处理多模态数据(如参数化CAD序列)时面临长程依赖和参数敏感性挑战,且难以满足定量约束,因此需要一种新的生成框架来解决这些问题。 Method: 提出Target-Guided Bayesian Flow Network (TGBFN),将离散命令和连续参数统一到连续可微的参数空间中,并通过穿透参数更新核引入引导贝叶斯流以控制CAD特性。 Result: 在新构建的定量约束CAD数据集上,TGBFN在单条件和多条件生成任务中均优于现有方法,能生成高保真、符合条件要求的CAD序列。 Conclusion: TGBFN为多模态、参数敏感且具有定量约束的CAD序列生成提供了有效解决方案,推动了深度生成模型在工程设计领域的应用。 Abstract: Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi-modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long-range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi-modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single-condition and multi-condition constrained generation tasks demonstrate that TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences. The code is available at https://github.com/scu-zwh/TGBFN.[109] A Study on Inference Latency for Vision Transformers on Mobile Devices
Zhuojin Li,Marco Paolieri,Leana Golubchik
Main category: cs.CV
TL;DR: 本文研究了190个真实世界的视觉Transformer(ViTs)在移动设备上的性能特征,并与102个卷积神经网络(CNNs)进行比较,分析影响ViT延迟的因素。基于这些发现,构建了一个包含1000个合成ViT及其在两个机器学习框架和六个移动平台上测量延迟的数据集,并验证了新ViT模型推理延迟的预测准确性。
Details
Motivation: 随着机器学习技术在移动设备上的快速发展,尤其是在计算机视觉领域,理解视觉Transformer在移动设备上的性能表现变得至关重要。现有的研究缺乏对ViT架构在真实移动环境下的系统性延迟分析。 Method: 通过在多个移动平台上实测190个真实ViT和102个CNN的性能,分析影响ViT延迟的关键因素;构建包含1000个合成ViT的数据集,涵盖主流架构和组件,并训练延迟预测模型。 Result: 揭示了影响ViT在移动端延迟的关键因素;构建了大规模ViT延迟数据集;证明了新ViT模型的推理延迟可以被准确预测,满足实际应用需求。 Conclusion: 视觉Transformer在移动设备上的延迟具有可预测性,所构建的数据集和分析结果为移动端高效部署ViT模型提供了重要支持。 Abstract: Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.[110] $D^2GS$: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction
Kejing Xia,Jidong Jia,Ke Jin,Yucai Bai,Li Sun,Dacheng Tao,Youjian Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需LiDAR的城市场景重建框架D²GS,通过多视角深度预测初始化密集点云,并结合渐进式剪枝、深度增强器和道路区域几何约束,实现了优于现有方法的重建精度。
Details
Motivation: 现有的城市场景重建方法依赖LiDAR与图像等多模态传感器输入,但LiDAR数据获取存在时空校准困难和空间错位导致的重投影误差等问题,限制了实际应用。因此,需要一种不依赖LiDAR但仍能获得高质量几何先验的重建方法。 Method: 1) 通过多视角度量深度预测反投影生成初始密集点云;2) 采用渐进式剪枝策略优化点云以提升全局一致性;3) 设计深度增强器,利用扩散先验(来自深度基础模型)增强高斯渲染的深度图,并在训练中反馈以强化几何约束;4) 在道路区域内约束高斯的形状与法向属性,提升地面几何精度。 Result: 在Waymo数据集上的实验表明,D²GS在几何重建精度上持续优于当前最先进的方法,甚至优于使用真实LiDAR数据的方法。 Conclusion: D²GS成功实现了无需LiDAR的高质量城市场景重建,通过引入更密集且准确的几何先验,在避免多传感器标定难题的同时取得了卓越的重建性能,为自动驾驶中的场景重建提供了新思路。 Abstract: Recently, Gaussian Splatting (GS) has shown great potential for urban scene reconstruction in the field of autonomous driving. However, current urban scene reconstruction methods often depend on multimodal sensors as inputs, \textit{i.e.} LiDAR and images. Though the geometry prior provided by LiDAR point clouds can largely mitigate ill-posedness in reconstruction, acquiring such accurate LiDAR data is still challenging in practice: i) precise spatiotemporal calibration between LiDAR and other sensors is required, as they may not capture data simultaneously; ii) reprojection errors arise from spatial misalignment when LiDAR and cameras are mounted at different locations. To avoid the difficulty of acquiring accurate LiDAR depth, we propose $D^2GS$, a LiDAR-free urban scene reconstruction framework. In this work, we obtain geometry priors that are as effective as LiDAR while being denser and more accurate. $\textbf{First}$, we initialize a dense point cloud by back-projecting multi-view metric depth predictions. This point cloud is then optimized by a Progressive Pruning strategy to improve the global consistency. $\textbf{Second}$, we jointly refine Gaussian geometry and predicted dense metric depth via a Depth Enhancer. Specifically, we leverage diffusion priors from a depth foundation model to enhance the depth maps rendered by Gaussians. In turn, the enhanced depths provide stronger geometric constraints during Gaussian training. $\textbf{Finally}$, we improve the accuracy of ground geometry by constraining the shape and normal attributes of Gaussians within road regions. Extensive experiments on the Waymo dataset demonstrate that our method consistently outperforms state-of-the-art methods, producing more accurate geometry even when compared with those using ground-truth LiDAR data.[111] Classifier Enhancement Using Extended Context and Domain Experts for Semantic Segmentation
Huadong Tang,Youpeng Zhao,Min Xu,Jun Wang,Qiang Wu
Main category: cs.CV
TL;DR: 提出了一种扩展的上下文感知分类器(ECAC),利用数据集级和图像级上下文信息动态调整分类器,提升语义分割性能。
Details
Motivation: 现有语义分割方法使用固定参数的分类器,难以适应不同图像的类别分布差异,且数据集中类别不平衡导致对少数类分割效果差。 Method: 引入记忆库学习数据集级别的类别上下文信息,并结合单张图像的局部上下文信息动态调整分类器;采用教师-学生网络框架,教师网络根据真实标签动态更新上下文信息并指导学生网络。 Result: 在ADE20K、COCO-Stuff10K和Pascal-Context等多个数据集上实现了最先进的性能。 Conclusion: ECAC通过融合全局与局部上下文信息动态优化分类器,有效缓解类别不平衡问题,显著提升了语义分割精度。 Abstract: Prevalent semantic segmentation methods generally adopt a vanilla classifier to categorize each pixel into specific classes. Although such a classifier learns global information from the training data, this information is represented by a set of fixed parameters (weights and biases). However, each image has a different class distribution, which prevents the classifier from addressing the unique characteristics of individual images. At the dataset level, class imbalance leads to segmentation results being biased towards majority classes, limiting the model's effectiveness in identifying and segmenting minority class regions. In this paper, we propose an Extended Context-Aware Classifier (ECAC) that dynamically adjusts the classifier using global (dataset-level) and local (image-level) contextual information. Specifically, we leverage a memory bank to learn dataset-level contextual information of each class, incorporating the class-specific contextual information from the current image to improve the classifier for precise pixel labeling. Additionally, a teacher-student network paradigm is adopted, where the domain expert (teacher network) dynamically adjusts contextual information with ground truth and transfers knowledge to the student network. Comprehensive experiments illustrate that the proposed ECAC can achieve state-of-the-art performance across several datasets, including ADE20K, COCO-Stuff10K, and Pascal-Context.[112] Test-Time Adaptive Object Detection with Foundation Model
Yingjie Gao,Yanan Zhang,Zhi Cai,Di Huang
Main category: cs.CV
TL;DR: 本文提出了一种基于基础模型的测试时自适应目标检测方法,首次无需源数据并突破传统闭集假设,通过多模态提示均师框架和实例动态记忆模块,在跨域和跨类别场景下实现了优越性能。
Details
Motivation: 现有测试时自适应方法依赖源域数据且假设源域与目标域类别空间相同,难以应对真实场景中的开放类别和无源数据情况。 Method: 设计了多模态提示均师框架,结合文本与视觉提示调优;提出测试时视觉提示热启动策略;构建实例动态记忆(IDM)模块,并引入记忆增强与记忆幻觉策略以提升伪标签质量。 Result: 在跨污染和跨数据集基准上显著优于先前方法,能够适应任意跨域和跨类别的目标数据。 Conclusion: 所提方法实现了无需源数据、突破闭集限制的测试时自适应检测,具备更强的现实应用潜力。 Abstract: In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.[113] Mask-Robust Face Verification for Online Learning via YOLOv5 and Residual Networks
Zhifeng Wang,Minghui Wang,Chunyan Zeng,Jialong Yao,Yang Yang,Hongmin Xu
Main category: cs.CV
TL;DR: 本文提出了一种基于改进卷积神经网络(残差网络)和YOLOv5的在线学习身份认证方法,通过学生摄像头图像进行人脸识别与身份验证,以提升在线教育的安全性与稳定性。
Details
Motivation: 随着信息技术和人工智能的发展,在线教育迅速发展,但其身份认证问题亟待解决;同时新冠疫情加速了e-learning的普及,对系统安全性提出了更高要求。 Method: 采用YOLOv5网络检测学生摄像头图像中的人脸,并利用残差网络提取深层面部特征,通过计算欧氏距离与学生人脸数据库比对,实现身份认证。 Result: 该方法能有效识别并验证在线学习者身份,提高了认证的准确性和系统的安全性,支持在线教育环境的稳定运行。 Conclusion: 结合YOLOv5与残差网络的深度学习模型为在线教育中的身份认证提供了一个高效、安全的解决方案,有助于推动智能化教育的发展。 Abstract: In the contemporary landscape, the fusion of information technology and the rapid advancement of artificial intelligence have ushered school education into a transformative phase characterized by digitization and heightened intelligence. Concurrently, the global paradigm shift caused by the Covid-19 pandemic has catalyzed the evolution of e-learning, accentuating its significance. Amidst these developments, one pivotal facet of the online education paradigm that warrants attention is the authentication of identities within the digital learning sphere. Within this context, our study delves into a solution for online learning authentication, utilizing an enhanced convolutional neural network architecture, specifically the residual network model. By harnessing the power of deep learning, this technological approach aims to galvanize the ongoing progress of online education, while concurrently bolstering its security and stability. Such fortification is imperative in enabling online education to seamlessly align with the swift evolution of the educational landscape. This paper's focal proposition involves the deployment of the YOLOv5 network, meticulously trained on our proprietary dataset. This network is tasked with identifying individuals' faces culled from images captured by students' open online cameras. The resultant facial information is then channeled into the residual network to extract intricate features at a deeper level. Subsequently, a comparative analysis of Euclidean distances against students' face databases is performed, effectively ascertaining the identity of each student.[114] AI-Powered Early Detection of Critical Diseases using Image Processing and Audio Analysis
Manisha More,Kavya Bhand,Kaustubh Mukdam,Kavya Sharma,Manas Kawtikwar,Hridayansh Kaware,Prajwal Kavhar
Main category: cs.CV
TL;DR: 提出一种多模态AI诊断框架,结合图像、热成像和音频处理,用于皮肤癌、血管血栓和心肺异常的早期检测,具有轻量化和可部署性。
Details
Motivation: 现有诊断方法成本高、侵入性强且在资源匮乏地区难以获取,亟需低成本、非侵入、可扩展的早期诊断方案。 Method: 采用微调的MobileNetV2进行皮肤病变分类,SVM结合手工特征进行热成像血栓检测,MFCC提取心肺音特征并用随机森林分类。 Result: 皮肤癌分类准确率89.3%,敏感度91.6%;血栓检测准确率86.4%(AUC=0.89);心肺异常识别准确率87.2%,敏感度85.7%。 Conclusion: 该多模态AI框架在性能上与先进模型相当,同时具备轻量化优势,适合在低资源环境中部署,推动可及性AI预诊断发展。 Abstract: Early diagnosis of critical diseases can significantly improve patient survival and reduce treatment costs. However, existing diagnostic techniques are often costly, invasive, and inaccessible in low-resource regions. This paper presents a multimodal artificial intelligence (AI) diagnostic framework integrating image analysis, thermal imaging, and audio signal processing for early detection of three major health conditions: skin cancer, vascular blood clots, and cardiopulmonary abnormalities. A fine-tuned MobileNetV2 convolutional neural network was trained on the ISIC 2019 dataset for skin lesion classification, achieving 89.3% accuracy, 91.6% sensitivity, and 88.2% specificity. A support vector machine (SVM) with handcrafted features was employed for thermal clot detection, achieving 86.4% accuracy (AUC = 0.89) on synthetic and clinical data. For cardiopulmonary analysis, lung and heart sound datasets from PhysioNet and Pascal were processed using Mel-Frequency Cepstral Coefficients (MFCC) and classified via Random Forest, reaching 87.2% accuracy and 85.7% sensitivity. Comparative evaluation against state-of-the-art models demonstrates that the proposed system achieves competitive results while remaining lightweight and deployable on low-cost devices. The framework provides a promising step toward scalable, real-time, and accessible AI-based pre-diagnostic healthcare solutions.[115] U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching
Junsheng Zhou,Xingyu Shi,Haichuan Song,Yi Fang,Yu-Shen Liu,Zhizhong Han
Main category: cs.CV
TL;DR: 本文提出了一种名为U-CAN的无监督点云去噪框架,采用一致性感知的Noise2Noise匹配方法,通过新颖的损失函数和几何一致性约束,在无需干净数据的情况下实现高效去噪,并在点云和图像去噪任务中均表现出色。
Details
Motivation: 由于扫描传感器获取的点云常受噪声干扰,严重影响后续任务,而现有基于监督学习的去噪方法依赖大量人工标注的干净数据,成本高昂,因此需要一种无需配对数据的无监督去噪方法。 Method: 提出U-CAN框架,利用噪声到噪声的匹配策略,通过神经网络推断每一点的多步去噪路径;设计新的损失函数,支持对多个噪声点云观测进行统计推理,并引入几何一致性约束以学习一致性感知的去噪模式。 Result: 在常用点云去噪、上采样及图像去噪基准上,U-CAN显著优于现有无监督方法,并可与监督方法相媲美。 Conclusion: U-CAN是一种有效的无监督去噪框架,其提出的 consistency-aware 约束具有跨域通用性,可用于3D点云和2D图像去噪,减少了对干净训练数据的依赖。 Abstract: Point clouds captured by scanning sensors are often perturbed by noise, which have a highly negative impact on downstream tasks (e.g. surface reconstruction and shape understanding). Previous works mostly focus on training neural networks with noisy-clean point cloud pairs for learning denoising priors, which requires extensively manual efforts. In this work, we introduce U-CAN, an Unsupervised framework for point cloud denoising with Consistency-Aware Noise2Noise matching. Specifically, we leverage a neural network to infer a multi-step denoising path for each point of a shape or scene with a noise to noise matching scheme. We achieve this by a novel loss which enables statistical reasoning on multiple noisy point cloud observations. We further introduce a novel constraint on the denoised geometry consistency for learning consistency-aware denoising patterns. We justify that the proposed constraint is a general term which is not limited to 3D domain and can also contribute to the area of 2D image denoising. Our evaluations under the widely used benchmarks in point cloud denoising, upsampling and image denoising show significant improvement over the state-of-the-art unsupervised methods, where U-CAN also produces comparable results with the supervised methods.[116] MSF-Net: Multi-Stage Feature Extraction and Fusion for Robust Photometric Stereo
Shiyu Qin,Zhihao Cai,Kaixuan Wang,Lin Qi,Junyu Dong
Main category: cs.CV
TL;DR: 提出MSF-Net,一种用于多阶段信息提取和选择性更新策略的新型框架,结合特征融合模块,显著提升了表面法线估计的精度。
Details
Motivation: 现有基于学习的方法在多阶段特征捕捉和特征间交互方面表现不足,导致在复杂细节区域提取冗余特征。 Method: 提出MSF-Net,包含多阶段信息提取、选择性更新策略和特征融合模块,以提升特征质量和法线估计精度。 Result: 在DiLiGenT基准上的实验结果表明,MSF-Net在表面法线估计精度上显著优于先前的最先进方法。 Conclusion: MSF-Net通过改进特征提取与融合机制,有效提升了复杂场景下的表面法线估计性能。 Abstract: Photometric stereo is a technique aimed at determining surface normals through the utilization of shading cues derived from images taken under different lighting conditions. However, existing learning-based approaches often fail to accurately capture features at multiple stages and do not adequately promote interaction between these features. Consequently, these models tend to extract redundant features, especially in areas with intricate details such as wrinkles and edges. To tackle these issues, we propose MSF-Net, a novel framework for extracting information at multiple stages, paired with selective update strategy, aiming to extract high-quality feature information, which is critical for accurate normal construction. Additionally, we have developed a feature fusion module to improve the interplay among different features. Experimental results on the DiLiGenT benchmark show that our proposed MSF-Net significantly surpasses previous state-of-the-art methods in the accuracy of surface normal estimation.[117] Aligning What You Separate: Denoised Patch Mixing for Source-Free Domain Adaptation in Medical Image Segmentation
Quang-Khai Bui-Tran,Thanh-Huy Nguyen,Hoang-Thien Nguyen,Ba-Thinh Lam,Nguyen Lan Vi Vu,Phat K. Huynh,Ulas Bagci,Min Xu
Main category: cs.CV
TL;DR: 提出了一种新的源域自由域适应(SFDA)框架,通过难样本选择和去噪补丁混合来提升医学图像分割性能。
Details
Motivation: 现有SFDA方法常忽略样本难度,并在域迁移下因噪声监督而表现不佳。 Method: 采用熵-相似性分析划分可靠与不可靠样本子集,使用蒙特卡洛去噪掩码优化伪标签,并通过域内和域间补丁混合策略进行特征融合。 Result: 在多个基准数据集上实现了优于现有SFDA和UDA方法的性能,取得了更高的Dice分数和更低的ASSD分数,边界分割更精确。 Conclusion: 渐进式适应和去噪监督对域迁移下的鲁棒医学图像分割至关重要。 Abstract: Source-Free Domain Adaptation (SFDA) is emerging as a compelling solution for medical image segmentation under privacy constraints, yet current approaches often ignore sample difficulty and struggle with noisy supervision under domain shift. We present a new SFDA framework that leverages Hard Sample Selection and Denoised Patch Mixing to progressively align target distributions. First, unlabeled images are partitioned into reliable and unreliable subsets through entropy-similarity analysis, allowing adaptation to start from easy samples and gradually incorporate harder ones. Next, pseudo-labels are refined via Monte Carlo-based denoising masks, which suppress unreliable pixels and stabilize training. Finally, intra- and inter-domain objectives mix patches between subsets, transferring reliable semantics while mitigating noise. Experiments on benchmark datasets show consistent gains over prior SFDA and UDA methods, delivering more accurate boundary delineation and achieving state-of-the-art Dice and ASSD scores. Our study highlights the importance of progressive adaptation and denoised supervision for robust segmentation under domain shift.[118] Balanced conic rectified flow
Kim Shin Seong,Mingi Kwon,Jaeseok Jeong,Youngjung Uh
Main category: cs.CV
TL;DR: 本文提出了一种改进的rectified flow方法,通过在训练过程中引入真实图像,减少对生成图像对的依赖,从而降低计算成本并提高生成质量。
Details
Motivation: 原始的rectified flow方法需要大量生成图像对,导致计算成本高,并且性能受限于生成数据的偏差。 Method: 通过保留真实图像的ODE路径,在reflow过程中结合少量生成和真实图像进行训练。 Result: 在CIFAR-10上取得了更好的FID分数,无论是一步生成还是全步模拟,同时使用更少的生成图像对。 Conclusion: 该方法能有效减少对生成数据的依赖,提升模型鲁棒性和生成效率,同时保持真实图像分布。 Abstract: Rectified flow is a generative model that learns smooth transport mappings between two distributions through an ordinary differential equation (ODE). Unlike diffusion-based generative models, which require costly numerical integration of a generative ODE to sample images with state-of-the-art quality, rectified flow uses an iterative process called reflow to learn smooth and straight ODE paths. This allows for relatively simple and efficient generation of high-quality images. However, rectified flow still faces several challenges. 1) The reflow process requires a large number of generative pairs to preserve the target distribution, leading to significant computational costs. 2) Since the model is typically trained using only generated image pairs, its performance heavily depends on the 1-rectified flow model, causing it to become biased towards the generated data. In this work, we experimentally expose the limitations of the original rectified flow and propose a novel approach that incorporates real images into the training process. By preserving the ODE paths for real images, our method effectively reduces reliance on large amounts of generated data. Instead, we demonstrate that the reflow process can be conducted efficiently using a much smaller set of generated and real images. In CIFAR-10, we achieved significantly better FID scores, not only in one-step generation but also in full-step simulations, while using only of the generative pairs compared to the original method. Furthermore, our approach induces straighter paths and avoids saturation on generated images during reflow, leading to more robust ODE learning while preserving the distribution of real images.[119] Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation
Yuxiang Mao,Zhijie Zhang,Zhiheng Zhang,Jiawei Liu,Chen Zeng,Shihong Xia
Main category: cs.CV
TL;DR: 提出一种基于语音和情感驱动的3D面部动画生成方法,通过联合学习语音和情感对应的混合形状,在缺乏真实情感数据的情况下实现表情丰富的说话人脸动画。
Details
Motivation: 现有语音驱动的面部动画多集中于中性表情,缺乏情感表达;真实情感3D面部数据稀缺且采集成本高,限制了情感化面部动画的发展。 Method: 将语音和情感驱动的面部动画建模为线性可加问题,利用中性语音数据集(VOCAset)和3D表情数据集(Florence4D)联合训练;引入稀疏约束损失以解耦语音与情感的混合形状,同时保留跨域的次要形变;并将学习到的混合形状映射到FLAME模型参数,用于驱动3D高斯化身。 Result: 实验表明该方法在保持准确唇同步的同时,能自然生成指定情感的说话人脸动画;感知研究表明其情感表现力优于现有方法,且不牺牲唇同步质量。 Conclusion: 所提方法有效解决了情感化3D面部动画数据稀缺的问题,实现了语音与情感协同驱动的高质量、可解耦的面部动画生成,具有良好的应用潜力。 Abstract: Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.[120] DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis
Yinqi Cai,Jichang Li,Zhaolun Li,Weikai Chen,Rushi Lan,Xi Xie,Xiaonan Luo,Guanbin Li
Main category: cs.CV
TL;DR: 本文提出了一种名为DeepShield的新型深度伪造检测框架,通过结合局部敏感性和全局泛化能力,提升对未见过伪造手段的鲁棒性。
Details
Motivation: 现有检测器依赖于特定伪造痕迹,在跨域场景下泛化能力差,难以应对多样化的伪造技术。 Method: DeepShield增强CLIP-ViT编码器,引入局部块引导(LPG)进行时空伪影建模和块级监督,并通过全局伪造多样化(GFD)进行域特征增强,生成多样化伪造样本以提高跨域适应性。 Result: 在跨数据集和跨伪造方法的评估中,DeepShield优于当前最先进的方法,展现出更强的鲁棒性。 Conclusion: DeepShield通过局部与全局分析的融合,有效提升了对未知深伪攻击的检测性能和泛化能力。 Abstract: Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.[121] VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations
Qianqian Qiao,DanDan Zheng,Yihang Bo,Bao Peng,Heng Huang,Longteng Jiang,Huaye Wang,Jingdong Chen,Jun Zhou,Xin Jin
Main category: cs.CV
TL;DR: 本文提出了VADB,这是目前最大的视频美学评估数据库,包含10,490个多样化视频,并由37位专业人士标注;同时提出VADB-Net双模态预训练框架,在视频美学评分任务中优于现有模型。
Details
Motivation: 由于缺乏标准化数据集和强大的模型,视频美学评估的发展受到限制,尤其是视频的时间动态性和多模态融合挑战使得图像方法难以直接应用。 Method: 构建大规模视频美学数据库VADB,包含多维度美学评分、语言评论和客观标签;设计VADB-Net双模态预训练框架,采用两阶段训练策略。 Result: VADB-Net在视频质量评估任务中表现优于现有模型,并支持多种下游视频美学评估任务。 Conclusion: VADB为视频美学研究提供了重要资源,VADB-Net展示了在复杂多模态视频理解任务中的有效性,推动了该领域的发展。 Abstract: Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at https://github.com/BestiVictory/VADB.[122] Mapping and Classification of Trees Outside Forests using Deep Learning
Moritz Lucas,Hamid Ebrahimy,Viacheslav Barkov,Ralf Pecenka,Kai-Uwe Kühnberger,Björn Waske
Main category: cs.CV
TL;DR: 本研究利用深度学习和高分辨率航拍影像,对德国四个农业景观中的树木(TOF)进行分类,比较了多种语义分割模型,发现FT-UNetFormer表现最佳,强调了空间上下文理解在TOF制图中的重要性。
Details
Motivation: 现有研究常将TOF视为单一类别或依赖固定规则阈值,限制了生态解释力和区域适应性,因此需要更灵活、准确的分类方法。 Method: 使用新构建的数据集,结合卷积神经网络(CNN)、视觉Transformer及混合模型,在六种语义分割架构中对比性能,分类四类木本植被:森林、斑块、线状结构和单树。 Result: 模型整体表现良好,FT-UNetFormer最优(平均IoU 0.74,F1 0.84),森林和线状结构分类效果好,斑块和单树因边缘密度高仍具挑战;泛化实验表明区域多样性训练数据对大范围制图至关重要。 Conclusion: 深度学习可有效用于TOF精细分类,但需结合区域多样数据以提升泛化能力,未来应加强复杂结构的识别方法研究。 Abstract: Trees Outside Forests (TOF) play an important role in agricultural landscapes by supporting biodiversity, sequestering carbon, and regulating microclimates. Yet, most studies have treated TOF as a single class or relied on rigid rule-based thresholds, limiting ecological interpretation and adaptability across regions. To address this, we evaluate deep learning for TOF classification using a newly generated dataset and high-resolution aerial imagery from four agricultural landscapes in Germany. Specifically, we compare convolutional neural networks (CNNs), vision transformers, and hybrid CNN-transformer models across six semantic segmentation architectures (ABCNet, LSKNet, FT-UNetFormer, DC-Swin, BANet, and U-Net) to map four categories of woody vegetation: Forest, Patch, Linear, and Tree, derived from previous studies and governmental products. Overall, the models achieved good classification accuracy across the four landscapes, with the FT-UNetFormer performing best (mean Intersection-over-Union 0.74; mean F1 score 0.84), underscoring the importance of spatial context understanding in TOF mapping and classification. Our results show good results for Forest and Linear class and reveal challenges particularly in classifying complex structures with high edge density, notably the Patch and Tree class. Our generalization experiments highlight the need for regionally diverse training data to ensure reliable large-scale mapping. The dataset and code are openly available at https://github.com/Moerizzy/TOFMapper[123] RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models
Zijun Liao,Yian Zhao,Xin Shan,Yu Yan,Chang Liu,Lei Lu,Xiangyang Ji,Jie Chen
Main category: cs.CV
TL;DR: 本文提出了一种高效且可扩展的知识蒸馏框架,利用视觉基础模型(VFMs)来增强轻量级实时目标检测器的性能,提出了Deep Semantic Injector(DSI)和Gradient-guided Adaptive Modulation(GAM)策略,在不增加推理开销的情况下显著提升了DETR类模型的检测精度。
Details
Motivation: 轻量级网络在追求高速推理的同时往往牺牲了特征表示能力,导致性能受限,难以满足实际部署需求。现有方法难以有效利用强大的视觉基础模型进行语义知识迁移。 Method: 提出了一种基于视觉基础模型的蒸馏框架,包含两个核心组件:1)Deep Semantic Injector(DSI)模块,将VFM的高层语义信息注入检测器深层;2)Gradient-guided Adaptive Modulation(GAM)策略,根据梯度范数动态调节知识迁移强度,实现稳定且任务对齐的语义传递。 Result: 该方法在不增加部署和推理开销的前提下,显著提升了多种DETR-based模型的性能。新提出的RT-DETRv4系列模型在COCO数据集上达到最先进的结果,AP分别为49.7/53.5/55.4/57.0,对应速度为273/169/124/78 FPS。 Conclusion: 所提出的蒸馏框架能够有效利用视觉基础模型增强轻量级检测器,实现了速度与精度的优异平衡,具有很强的实用性和广泛的应用前景。 Abstract: Real-time object detection has achieved substantial progress through meticulously designed architectures and optimization strategies. However, the pursuit of high-speed inference via lightweight network designs often leads to degraded feature representation, which hinders further performance improvements and practical on-device deployment. In this paper, we propose a cost-effective and highly adaptable distillation framework that harnesses the rapidly evolving capabilities of Vision Foundation Models (VFMs) to enhance lightweight object detectors. Given the significant architectural and learning objective disparities between VFMs and resource-constrained detectors, achieving stable and task-aligned semantic transfer is challenging. To address this, on one hand, we introduce a Deep Semantic Injector (DSI) module that facilitates the integration of high-level representations from VFMs into the deep layers of the detector. On the other hand, we devise a Gradient-guided Adaptive Modulation (GAM) strategy, which dynamically adjusts the intensity of semantic transfer based on gradient norm ratios. Without increasing deployment and inference overhead, our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models, underscoring its practical utility for real-time detection. Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.[124] LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
Yang Miao,Jan-Nico Zaech,Xi Wang,Fabien Despinoy,Danda Pani Paudel,Luc Van Gool
Main category: cs.CV
TL;DR: 提出LangHOPS,首个基于多模态大语言模型(MLLM)的开放词汇物体-部件实例分割框架,通过在语言空间中构建物体-部件层次结构,实现对开放词汇类别中物体和部件的联合检测与分割。
Details
Motivation: 现有方法依赖启发式或可学习的视觉分组策略,在处理开放词汇的物体-部件层次关系时存在局限,难以准确建模复杂语义结构。 Method: 将MLLM引入物体-部件解析流程,利用其丰富的知识和推理能力,在语言空间中建立并理解物体与部件之间的层次关系,并通过MLLM驱动的部件查询优化策略提升分割性能。 Result: 在PartImageNet上,LangHOPS在领域内和跨数据集设置下分别比先前方法提升5.5% AP和4.8% AP;在ADE20K上的零样本语义分割任务中,对未见部件的mIOU提升2.5%。消融实验验证了语言引导的层次结构和查询优化策略的有效性。 Conclusion: LangHOPS通过将多模态大语言模型与语言空间中的层次建模相结合,显著提升了开放词汇物体-部件实例分割的性能,展示了语言知识在复杂视觉解析任务中的巨大潜力。 Abstract: We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.[125] Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation
Yuyang Huang,Yabo Chen,Junyu Zhou,Wenrui Dai,Xiaopeng Zhang,Junni Zou,Hongkai Xiong,Qi Tian
Main category: cs.CV
TL;DR: 提出一种基于扩散模型的渐进式目标域操作框架(DPTM),用于解决无源域适应中源域与目标域差异大的问题,显著提升性能。
Details
Motivation: 现有SFDA方法受限于源域与目标域之间的分布差异,非生成方法在差异大时伪标签不可靠,生成方法因伪源数据偏差而性能下降。 Method: 提出DPTM框架,将目标样本分为可信集和不可信集;对不可信样本,利用潜在扩散模型进行语义转换并保持目标分布;设计渐进式优化机制,逐步缩小伪目标域与真实目标域的差异。 Result: 在四个主流SFDA基准上实现最优性能,最大性能提升达18.6%。 Conclusion: DPTM通过可靠生成和渐进优化伪目标域,有效缓解了域间差异问题,显著提升了SFDA在大域偏移场景下的表现。 Abstract: Source-free domain adaptation (SFDA) is a challenging task that tackles domain shifts using only a pre-trained source model and unlabeled target data. Existing SFDA methods are restricted by the fundamental limitation of source-target domain discrepancy. Non-generation SFDA methods suffer from unreliable pseudo-labels in challenging scenarios with large domain discrepancies, while generation-based SFDA methods are evidently degraded due to enlarged domain discrepancies in creating pseudo-source data. To address this limitation, we propose a novel generation-based framework named Diffusion-Driven Progressive Target Manipulation (DPTM) that leverages unlabeled target data as references to reliably generate and progressively refine a pseudo-target domain for SFDA. Specifically, we divide the target samples into a trust set and a non-trust set based on the reliability of pseudo-labels to sufficiently and reliably exploit their information. For samples from the non-trust set, we develop a manipulation strategy to semantically transform them into the newly assigned categories, while simultaneously maintaining them in the target distribution via a latent diffusion model. Furthermore, we design a progressive refinement mechanism that progressively reduces the domain discrepancy between the pseudo-target domain and the real target domain via iterative refinement. Experimental results demonstrate that DPTM outperforms existing methods by a large margin and achieves state-of-the-art performance on four prevailing SFDA benchmark datasets with different scales. Remarkably, DPTM can significantly enhance the performance by up to 18.6% in scenarios with large source-target gaps.[126] GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction
Yang Jin,Guangyu Guo,Binglu Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为GaTector+的统一框架,用于联合解决视线对象检测和视线跟随任务,消除了对头部先验信息的依赖,并引入了新的特征提取结构、注意力机制和评估指标mSoC,在多个基准数据集上表现出优越性能。
Details
Motivation: 现有方法通常将视线对象检测与视线跟随任务分开处理,且依赖头部位置先验知识,导致需额外网络提取头部信息,限制了系统的整体优化和实际应用。因此,需要一个无需头部先验、可端到端联合优化的统一框架。 Method: GaTector+采用扩展的‘特定-通用-特定’特征提取器,共享主干网络提取通用特征,前后使用特定模块适应子任务;嵌入头部检测分支以隐式获取头部信息,并设计基于头部的注意力机制融合感知与视线特征;提出注意力监督机制加速视线热图学习;同时引入新的评估指标mSoC,提升对边界框变化的敏感性。 Result: 在多个标准数据集上的实验表明,GaTector+在视线对象检测和视线跟随任务上均优于现有方法,尤其在无需头部先验输入的情况下实现了更优或可比的性能,验证了所提结构与机制的有效性。 Conclusion: GaTector+成功实现了无需头部先验的统一建模范式,通过共享与特定模块结合的设计、注意力机制及新评估指标,提升了两个任务的性能,具有良好的可扩展性和实际部署潜力。 Abstract: Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.[127] Seeing Clearly and Deeply: An RGBD Imaging Approach with a Bio-inspired Monocentric Design
Zongxi Yu,Xiaolong Qian,Shaohua Gao,Qi Jiang,Yao Gao,Kailun Yang,Kaiwei Wang
Main category: cs.CV
TL;DR: 提出一种仿生单中心成像(BMI)框架,通过全球面光学设计与联合重建算法实现高保真紧凑型RGBD成像。
Details
Motivation: 传统紧凑型光学在全焦深范围内难以保持RGB清晰度,而纯软件单目深度估计依赖不可靠的语义先验,存在病态问题。深光学元件(如DOE)虽能编码深度,但带来制造复杂性和色差问题。 Method: 设计了一种新颖的仿生全球面单中心镜头,利用其随深度变化的点扩散函数(PSF)自然编码深度信息;构建了基于物理的前向模型生成合成数据集,并结合双头多尺度重建网络,共享编码器从单次编码图像中联合恢复全焦图像和精确深度图。 Result: 在深度估计上达到Abs Rel 0.026和RMSE 0.130,显著优于现有软件方法和其他深光学系统;图像恢复方面取得SSIM 0.960和LPIPS 0.082,表现出优异的图像保真度与深度精度平衡。 Conclusion: 仿生全球面光学与联合重建算法的融合是解决高性能紧凑型RGBD成像内在挑战的有效策略。 Abstract: Achieving high-fidelity, compact RGBD imaging presents a dual challenge: conventional compact optics struggle with RGB sharpness across the entire depth-of-field, while software-only Monocular Depth Estimation (MDE) is an ill-posed problem reliant on unreliable semantic priors. While deep optics with elements like DOEs can encode depth, they introduce trade-offs in fabrication complexity and chromatic aberrations, compromising simplicity. To address this, we first introduce a novel bio-inspired all-spherical monocentric lens, around which we build the Bionic Monocentric Imaging (BMI) framework, a holistic co-design. This optical design naturally encodes depth into its depth-varying Point Spread Functions (PSFs) without requiring complex diffractive or freeform elements. We establish a rigorous physically-based forward model to generate a synthetic dataset by precisely simulating the optical degradation process. This simulation pipeline is co-designed with a dual-head, multi-scale reconstruction network that employs a shared encoder to jointly recover a high-fidelity All-in-Focus (AiF) image and a precise depth map from a single coded capture. Extensive experiments validate the state-of-the-art performance of the proposed framework. In depth estimation, the method attains an Abs Rel of 0.026 and an RMSE of 0.130, markedly outperforming leading software-only approaches and other deep optics systems. For image restoration, the system achieves an SSIM of 0.960 and a perceptual LPIPS score of 0.082, thereby confirming a superior balance between image fidelity and depth accuracy. This study illustrates that the integration of bio-inspired, fully spherical optics with a joint reconstruction algorithm constitutes an effective strategy for addressing the intrinsic challenges in high-performance compact RGBD imaging. Source code will be publicly available at https://github.com/ZongxiYu-ZJU/BMI.[128] Prototype-Driven Adaptation for Few-Shot Object Detection
Yushen Huang,Zhiming Wang
Main category: cs.CV
TL;DR: 提出了一种名为Prototype-Driven Alignment (PDA)的轻量级即插即用度量头,用于Few-shot Object Detection,通过原型驱动的第二意见机制缓解了基类偏差和校准不稳定问题。
Details
Motivation: Few-shot目标检测(FSOD)在仅有少量新类别样本时容易出现基类偏差和校准不稳定的问题,现有方法难以有效平衡新类与基类的检测性能。 Method: PDA在DeFRCN上引入一个基于原型的度量头,维护仅由支持样本构建的可学习原型,并通过指数移动平均(EMA)更新;采用原型条件下的RoI对齐减少几何不匹配,结合Best-of-K匹配和温度缩放融合策略,将度量相似性与检测器输出融合。 Result: 在VOC FSOD和GFSOD基准上的实验表明,PDA显著提升了新类别的检测性能,同时对基类性能影响极小,且计算开销可忽略。 Conclusion: PDA是一种高效、即插即用的模块,能有效缓解FSOD中的基类偏差和校准问题,具备良好的泛化性和实用性。 Abstract: Few-shot object detection (FSOD) often suffers from base-class bias and unstable calibration when only a few novel samples are available. We propose Prototype-Driven Alignment (PDA), a lightweight, plug-in metric head for DeFRCN that provides a prototype-based "second opinion" complementary to the linear classifier. PDA maintains support-only prototypes in a learnable identity-initialized projection space and optionally applies prototype-conditioned RoI alignment to reduce geometric mismatch. During fine-tuning, prototypes can be adapted via exponential moving average(EMA) updates on labeled foreground RoIs-without introducing class-specific parameters-and are frozen at inference to ensure strict protocol compliance. PDA employs a best-of-K matching scheme to capture intra-class multi-modality and temperature-scaled fusion to combine metric similarities with detector logits. Experiments on VOC FSOD and GFSOD benchmarks show that PDA consistently improves novel-class performance with minimal impact on base classes and negligible computational overhead.[129] MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Runxi Huang,Mingxuan Yu,Mingyu Tsoi,Xiaomin Ouyang
Main category: cs.CV
TL;DR: 本文提出了MMEdge,一种面向资源受限边缘设备的新型多模态推理框架,通过流水线式感知与编码实现细粒度并行处理,结合时序聚合、自适应配置优化和跨模态推测跳过机制,在保证精度的同时显著降低端到端延迟。
Details
Motivation: 现有研究通常忽视了传感动态与模型执行之间的紧密耦合以及多模态间的复杂依赖关系,难以满足实时性要求严苛的边缘应用需求。 Method: 提出MMEdge框架,将推理过程分解为细粒度的感知与编码单元,采用流水线方式实现增量计算;引入轻量级时序聚合模块捕捉时间动态,并设计自适应多模态配置优化器和跨模态推测跳过机制以应对资源波动和数据复杂性。 Result: 在两个公开多模态数据集和真实无人机多模态测试平台上评估表明,MMEdge在多种系统和数据动态下均显著降低了端到端延迟,同时保持高任务准确率。 Conclusion: MMEdge通过细粒度流水线设计和跨模态优化机制,有效提升了资源受限边缘设备上的多模态推理效率与实时性,适用于自动驾驶、人机交互和移动健康等应用场景。 Abstract: Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.[130] StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA
Yuhang Hu,Zhenyu Yang,Shihan Wang,Shengsheng Qian,Bin Wen,Fan Yang,Tingting Gao,Changsheng Xu
Main category: cs.CV
TL;DR: 提出StreamingCoT,首个面向流式视频问答和多模态思维链任务的动态推理数据集,通过每秒密集标注和时序依赖语义分段,结合显式推理链生成范式,提升对视频时序动态理解和复杂推理能力。
Details
Motivation: 现有视频问答数据集存在静态标注无法捕捉答案时序演化、缺乏显式推理过程标注两大局限,限制了模型的时序理解与逻辑推理能力。 Method: 构建动态分层标注架构,生成每秒密集描述并融合相似性形成时序依赖语义片段;提出显式推理链生成范式,通过关键帧语义对齐提取时空对象,利用大语言模型推导基于对象状态转移的推理路径,并经人工验证确保逻辑一致性。 Result: 发布了StreamingCoT数据集及其构建工具包,支持流式视频理解、复杂时序推理与多模态推理研究。 Conclusion: StreamingCoT为流式视频问答中的时序演化推理和多模态思维链任务提供了有效基准,推动了视频理解与推理模型的发展。 Abstract: The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.[131] More than a Moment: Towards Coherent Sequences of Audio Descriptions
Eshika Khandelwal,Junyu Xie,Tengda Han,Max Bain,Arsha Nagrani,Andrew Zisserman,Gül Varol,Makarand Tapaswi
Main category: cs.CV
TL;DR: 提出了一种无需训练的连贯音频描述生成方法CoherentAD,通过在序列级别进行自回归选择,提升对视觉障碍者视频理解的连贯性和叙事性。
Details
Motivation: 现有自动音频描述方法独立生成每段描述,导致重复和不连贯,难以帮助视障用户理解整体场景。 Method: CoherentAD首先为每个时间区间生成多个候选描述,然后在序列上进行自回归选择,形成连贯且信息丰富的叙述。 Result: 该方法在新提出的序列级评估指标StoryRecall和重复度量上表现优于先前方法,生成更连贯、信息更丰富的音频描述序列。 Conclusion: CoherentAD无需训练即可提升音频描述的叙事连贯性,为视障用户提供更好的视频理解支持。 Abstract: Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.[132] Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples
Zhigang Tu,Zhengbo Zhang,Jia Gong,Junsong Yuan,Bo Du
Main category: cs.CV
TL;DR: 本文提出了一种基于马尔可夫决策过程(MDP)和超球空间映射的主动学习方法,用于解决半监督3D动作识别中标注样本选择的信息性与代表性不一致问题。
Details
Motivation: 现有方法在选择最具信息量的骨骼序列时,可能选中模型已掌握知识的样本,导致标注效率低下。因此需要一种更智能的样本选择机制。 Method: 将半监督3D动作识别中的主动学习建模为马尔可夫决策过程(MDP),并在超球空间中进行状态-动作表示以增强表达能力,同时引入元调优策略加速实际部署。 Result: 在三个3D动作识别基准上的实验表明,所提方法显著优于现有主动学习方法,能更高效地选择高价值标注样本。 Conclusion: 通过MDP框架与超球空间结合,实现了更智能、高效的骨骼序列标注样本选择,提升了半监督3D动作识别的性能。 Abstract: Skeleton-based human action recognition aims to classify human skeletal sequences, which are spatiotemporal representations of actions, into predefined categories. To reduce the reliance on costly annotations of skeletal sequences while maintaining competitive recognition accuracy, the task of 3D Action Recognition with Limited Training Samples, also known as semi-supervised 3D Action Recognition, has been proposed. In addition, active learning, which aims to proactively select the most informative unlabeled samples for annotation, has been explored in semi-supervised 3D Action Recognition for training sample selection. Specifically, researchers adopt an encoder-decoder framework to embed skeleton sequences into a latent space, where clustering information, combined with a margin-based selection strategy using a multi-head mechanism, is utilized to identify the most informative sequences in the unlabeled set for annotation. However, the most representative skeleton sequences may not necessarily be the most informative for the action recognizer, as the model may have already acquired similar knowledge from previously seen skeleton samples. To solve it, we reformulate Semi-supervised 3D action recognition via active learning from a novel perspective by casting it as a Markov Decision Process (MDP). Built upon the MDP framework and its training paradigm, we train an informative sample selection model to intelligently guide the selection of skeleton sequences for annotation. To enhance the representational capacity of the factors in the state-action pairs within our method, we project them from Euclidean space to hyperbolic space. Furthermore, we introduce a meta tuning strategy to accelerate the deployment of our method in real-world scenarios. Extensive experiments on three 3D action recognition benchmarks demonstrate the effectiveness of our method.[133] 3D CT-Based Coronary Calcium Assessment: A Feature-Driven Machine Learning Framework
Ayman Abaid,Gianpiero Guidone,Sara Alsubai,Foziyah Alquahtani,Talha Iqbal,Ruth Sharif,Hesham Elzomor,Emiliano Bianchini,Naeif Almagal,Michael G. Madden,Faisal Sharif,Ihsan Ullah
Main category: cs.CV
TL;DR: 本研究提出一种基于放射组学的流程,利用伪标签技术在无专家标注的情况下实现非对比冠状动脉CT血管造影(CCTA)中冠状动脉钙化评分的自动分类,并比较了放射组学特征与预训练基础模型(如CT-FM和RadImageNet)提取的深度学习特征的性能。
Details
Motivation: 由于标注数据有限,传统方法难以有效进行冠状动脉钙化检测,因此需要一种无需专家分割即可准确分类钙化评分的方法。 Method: 采用放射组学流程结合伪标签生成训练标签,并使用预训练基础模型(CT-FM、RadImageNet)提取图像特征,结合传统分类器进行分类;在仅含非对比扫描的数据集上测试,比较不同特征的表现。 Result: 在182名患者的临床CCTA数据集上,放射组学模型显著优于基于CNN的基础模型特征(准确率达84%,p<0.05),且无需专家标注。 Conclusion: 放射组学结合伪标签的方法在缺乏专家标注的情况下仍能有效分类冠状动脉钙化评分,且性能优于当前主流的预训练深度学习模型。 Abstract: Coronary artery calcium (CAC) scoring plays a crucial role in the early detection and risk stratification of coronary artery disease (CAD). In this study, we focus on non-contrast coronary computed tomography angiography (CCTA) scans, which are commonly used for early calcification detection in clinical settings. To address the challenge of limited annotated data, we propose a radiomics-based pipeline that leverages pseudo-labeling to generate training labels, thereby eliminating the need for expert-defined segmentations. Additionally, we explore the use of pretrained foundation models, specifically CT-FM and RadImageNet, to extract image features, which are then used with traditional classifiers. We compare the performance of these deep learning features with that of radiomics features. Evaluation is conducted on a clinical CCTA dataset comprising 182 patients, where individuals are classified into two groups: zero versus non-zero calcium scores. We further investigate the impact of training on non-contrast datasets versus combined contrast and non-contrast datasets, with testing performed only on non contrast scans. Results show that radiomics-based models significantly outperform CNN-derived embeddings from foundation models (achieving 84% accuracy and p<0.05), despite the unavailability of expert annotations.[134] Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers
M Yashwanth,Sharannya Ghosh,Aditay Tripathi,Anirban Chakraborty
Main category: cs.CV
TL;DR: 提出PEP-FedPT框架,通过类上下文化的混合提示(CCMP)实现联邦环境下视觉Transformer的高效提示调优,在保证个性化的同时提升泛化能力。
Details
Motivation: 全局提示调优难以适应异构客户端,而个性化调优易过拟合且缺乏泛化能力,需兼顾联邦学习中的个性化与泛化需求。 Method: 提出PEP-FedPT框架,引入类特定提示与全局共享提示结合的CCMP机制,基于全局类原型和客户端类先验动态加权生成每样本提示,并通过联邦平均进行协同优化。 Result: 在CIFAR-100、TinyImageNet、DomainNet和iNaturalist数据集上,PEP-FedPT在多种数据异构场景下均优于现有方法。 Conclusion: PEP-FedPT有效平衡了联邦提示调优中的个性化与泛化,为ViT模型的高效自适应提供了新思路。 Abstract: Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.[135] Instance-Level Composed Image Retrieval
Bill Psomas,George Retsinas,Nikos Efthymiadis,Panagiotis Filntisis,Yannis Avrithis,Petros Maragos,Ondrej Chum,Giorgos Tolias
Main category: cs.CV
TL;DR: 本文提出了一个新的实例级图像检索评估数据集i-CIR,并提出了一种无需训练的BASIC方法,利用预训练视觉-语言模型通过 late fusion 融合图文查询相似性,在多个CIR数据集上达到SOTA性能。
Details
Motivation: 现有组合图像检索(CIR)研究受限于高质量训练和评估数据的缺乏,且多数数据集基于语义级类别定义,难以满足实例级对象匹配需求。 Method: 提出i-CIR数据集,采用实例级标注并引入难负样本筛选;提出BASIC方法,利用预训练VLM分别计算查询图像与文本到目标图像的相似性,通过后期融合机制增强同时符合图文查询的图像得分。 Result: BASIC在i-CIR及多个现有CIR数据集上均取得新的性能最优,验证了其有效性与泛化能力。 Conclusion: BASIC提供了一种无需训练、高效且可扩展的CIR解决方案,i-CIR为未来实例级图像检索研究提供了紧凑但具挑战性的基准。 Abstract: The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.[136] SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments
Hongjie Zhang,Gideon Billings,Stefan B. Williams
Main category: cs.CV
TL;DR: 本文提出了一种名为SPADE的单目深度估计方法,通过结合预训练的相对深度估计器与稀疏深度先验,生成密集的度量尺度深度图,用于提升水下车辆的空间感知能力。
Details
Motivation: 水下基础设施需要频繁检测和维护,但目前依赖人工潜水员或遥控设备存在感知和操作上的局限性,尤其是在复杂结构或浑浊水域中。因此,提高水下航行器的空间感知能力至关重要。 Method: 提出SPADE:一种两阶段方法,首先将相对深度图与稀疏深度点进行尺度对齐,然后使用提出的级联Conv-Deformable Transformer模块优化最终的度量深度预测。 Result: 该方法在准确性和泛化性上优于现有最先进方法,并能在嵌入式硬件上以超过15 FPS的速度运行。 Conclusion: SPADE能够有效提升水下航行器的深度估计精度和实时性能,有望支持实际的水下检测与作业任务。 Abstract: Underwater infrastructure requires frequent inspection and maintenance due to harsh marine conditions. Current reliance on human divers or remotely operated vehicles is limited by perceptual and operational challenges, especially around complex structures or in turbid water. Enhancing the spatial awareness of underwater vehicles is key to reducing piloting risks and enabling greater autonomy. To address these challenges, we present SPADE: SParsity Adaptive Depth Estimator, a monocular depth estimation pipeline that combines pre-trained relative depth estimator with sparse depth priors to produce dense, metric scale depth maps. Our two-stage approach first scales the relative depth map with the sparse depth points, then refines the final metric prediction with our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves improved accuracy and generalisation over state-of-the-art baselines and runs efficiently at over 15 FPS on embedded hardware, promising to support practical underwater inspection and intervention. This work has been submitted to IEEE Journal of Oceanic Engineering Special Issue of AUV 2026.[137] Comparative Study of UNet-based Architectures for Liver Tumor Segmentation in Multi-Phase Contrast-Enhanced Computed Tomography
Doan-Van-Anh Ly,Thi-Thu-Hien Pham,Thanh-Hai Le
Main category: cs.CV
TL;DR: 本研究比较了基于UNet架构的不同骨干网络在多期相增强CT肝脏肿瘤分割中的性能,发现预训练的ResNet骨干结合CBAM注意力模块的UNet3+模型表现最优,Dice分数达0.755,IoU为0.662,HD95距离最低(77.911),且具有高准确率和特异性,并通过Grad-CAM提升了模型可解释性。
Details
Motivation: 肝脏结构的精准分割对肝病的计算机辅助诊断和治疗规划至关重要,尤其是肿瘤检测;然而不同现代骨干网络在此任务上的表现尚不明确,需系统评估并探索改进方法。 Method: 采用UNet、UNet3+等架构,对比ResNet、Transformer和Mamba三种预训练骨干网络在多期相增强CT图像上的肝脏肿瘤分割性能,并引入CBAM等注意力机制以提升分割质量,使用Dice、IoU、HD95等指标进行评估,同时应用Grad-CAM可视化模型关注区域。 Result: ResNet-based模型在各项指标上均优于Transformer和Mamba骨干;加入CBAM后性能进一步提升,其中ResNetUNet3+取得最佳结果:Dice=0.755,IoU=0.662,HD95=77.911,准确率0.925,特异性0.926;Grad-CAM显示模型能聚焦关键病变区域。 Conclusion: 尽管新型骨干网络不断发展,经典ResNet结合卷积注意力模块(如CBAM)在肝脏肿瘤分割任务中仍具领先优势,表明传统架构与现代注意力机制融合是提升医学图像分割性能的有效路径,具有临床应用潜力。 Abstract: Segmentation of liver structures in multi-phase contrast-enhanced computed tomography (CECT) plays a crucial role in computer-aided diagnosis and treatment planning for liver diseases, including tumor detection. In this study, we investigate the performance of UNet-based architectures for liver tumor segmentation, starting from the original UNet and extending to UNet3+ with various backbone networks. We evaluate ResNet, Transformer-based, and State-space (Mamba) backbones, all initialized with pretrained weights. Surprisingly, despite the advances in modern architecture, ResNet-based models consistently outperform Transformer- and Mamba-based alternatives across multiple evaluation metrics. To further improve segmentation quality, we introduce attention mechanisms into the backbone and observe that incorporating the Convolutional Block Attention Module (CBAM) yields the best performance. ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a Dice score of 0.755 and IoU of 0.662, but also achieved the most precise boundary delineation, evidenced by the lowest HD95 distance of 77.911. The model's superiority was further cemented by its leading overall accuracy of 0.925 and specificity of 0.926, showcasing its robust capability in accurately identifying both lesion and healthy tissue. To further enhance interpretability, Grad-CAM visualizations were employed to highlight the region's most influential predictions, providing insights into its decision-making process. These findings demonstrate that classical ResNet architecture, when combined with modern attention modules, remain highly competitive for medical image segmentation tasks, offering a promising direction for liver tumor detection in clinical practice.[138] RegionE: Adaptive Region-Aware Generation for Efficient Image Editing
Pengtao Chen,Xianfang Zeng,Maosen Zhao,Mingzhu Shen,Peng Ye,Bangyin Xiang,Zhibo Wang,Wei Cheng,Gang Yu,Tao Chen
Main category: cs.CV
TL;DR: 本文提出了RegionE,一种自适应的、区域感知的图像编辑生成框架,通过区分编辑与未编辑区域来加速指令式图像编辑任务,且无需额外训练。
Details
Motivation: 现有的指令式图像编辑模型对整幅图像采用统一的生成过程,忽略了编辑区域和未编辑区域在生成难度和计算冗余上的显著差异,导致效率低下。 Method: RegionE包含三个核心组件:1)自适应区域划分,基于去噪初期的估计结果与参考图像的差异将图像划分为编辑与未编辑区域;2)区域感知生成,对未编辑区域采用一步预测,对编辑区域进行局部迭代去噪,并引入区域指令KV缓存以提升效率和质量;3)自适应速度衰减缓存,利用相邻时间步在编辑区域的速度相似性进一步加速去噪过程。 Result: 在Step1X-Edit、FLUX.1 Kontext和Qwen-Image-Edit等先进模型上应用RegionE后,分别实现了2.57倍、2.41倍和2.06倍的加速,同时GPT-4o评估表明语义和感知保真度保持良好。 Conclusion: RegionE通过区域感知和自适应优化策略,在不牺牲编辑质量的前提下显著提升了指令式图像编辑的推理效率,具有良好的通用性和实用性。 Abstract: Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose RegionE, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step. Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas. For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information. 3) Adaptive Velocity Decay Cache. Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process. We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57, 2.41, and 2.06. Evaluations by GPT-4o confirmed that semantic and perceptual fidelity were well preserved.[139] Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
Zhi-Kai Chen,Jun-Peng Jiang,Han-Jia Ye,De-Chuan Zhan
Main category: cs.CV
TL;DR: Hawk是一种利用图像空间结构来加速自回归图像生成的新型推测解码方法,在保持图像质量和多样性的同时实现了1.71倍的速度提升。
Details
Motivation: 自回归图像生成模型虽然能生成高质量图像,但因其逐标记的串行解码过程导致推理速度慢;现有推测解码方法在图像生成中因采样空间大和忽视图像二维空间结构而效果有限。 Method: 提出Hawk方法,通过利用图像的二维空间结构指导轻量级草稿模型的预测,增强草稿模型与目标模型输出之间的一致性,从而提高推测解码的准确性和效率。 Result: 在多个文本到图像生成基准上实验表明,相比标准自回归模型,Hawk实现了1.71倍的推理速度提升,同时保持了图像的保真度和多样性。 Conclusion: Hawk有效解决了推测解码在图像生成中的关键挑战,通过引入空间结构先验显著提升了生成效率,为高效自回归图像生成提供了新思路。 Abstract: Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.[140] Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
Xu Zheng,Zihao Dongfang,Lutao Jiang,Boyuan Zheng,Yulong Guo,Zhenquan Zhang,Giuliano Albanese,Runyi Yang,Mengjiao Ma,Zixin Zhang,Chenfei Liao,Dingcheng Zhen,Yuanhuiyi Lyu,Yuqian Fu,Bin Ren,Linfeng Zhang,Danda Pani Paudel,Nicu Sebe,Luc Van Gool,Xuming Hu
Main category: cs.CV
TL;DR: 本文综述了基于大模型的多模态空间推理任务,涵盖了2D和3D空间中的场景理解、视觉问答、具身AI等进展,并介绍了开放基准。
Details
Motivation: 现有对多模态空间推理模型的系统性综述和公开基准较少,亟需总结与评估。 Method: 通过分类多模态大语言模型(MLLMs)的最新进展,梳理空间推理任务,提出开放评测基准。 Result: 系统总结了多模态空间推理在不同任务上的进展,包括2D/3D理解、具身导航及新兴模态的应用。 Conclusion: 该综述为多模态空间推理领域奠定了基础,提供了未来研究方向和评估工具。 Abstract: Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.[141] FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Chuhao Chen,Isabella Liu,Xinyue Wei,Hao Su,Minghua Liu
Main category: cs.CV
TL;DR: 本文提出FreeArt3D,一种无需训练的可动3D物体生成框架,通过将预训练的静态3D扩散模型扩展至3D-to-4D域,实现高质量几何、纹理和运动结构的联合优化。
Details
Motivation: 现有可动物体建模方法依赖密集视角监督或生成质量粗糙且忽略纹理;而静态3D生成已取得进展,但难以直接扩展到可动物体。 Method: 提出FreeArt3D,复用预训练静态3D扩散模型(如Trellis)作为形状先验,将Score Distillation Sampling扩展到3D-to-4D,将关节运动视为新的生成维度,在无训练情况下联合优化几何、纹理与运动参数。 Result: 在仅需少数不同姿态图像输入下,FreeArt3D能生成高保真几何与纹理,准确预测运动结构,并在多种物体类别上表现出良好泛化性,生成过程几分钟内完成。 Conclusion: FreeArt3D无需专门训练即可实现高质量、高通用性的可动3D物体生成,显著优于先前方法。 Abstract: Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.[142] VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning
Baolu Li,Yiming Zhang,Qinghe Wang,Liqian Ma,Xiaoyu Shi,Xintao Wang,Pengfei Wan,Zhenfei Yin,Yunzhi Zhuge,Huchuan Lu,Xu Jia
Main category: cs.CV
TL;DR: 提出VFXMaster,首个统一的基于参考的视觉效果视频生成框架,通过上下文学习实现对未见效果类别的泛化。