Table of Contents
cs.CL [Back]
[1] When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models
Zafir Shamsi,Nikhil Chekuru,Zachary Guzman,Shivank Garg
Main category: cs.CL
TL;DR: 本文研究了大型语言模型在面对自动化、对抗性提示优化时的安全脆弱性,发现现有静态安全评估方法可能低估实际风险,提出需要引入自动化的自适应红队测试作为鲁棒安全评估的必要组成部分。
Details
Motivation: 现有安全评估主要依赖固定有害提示集,假设对手是非自适应的,忽略了现实中攻击者会迭代优化输入以规避防护的真实场景。 Method: 利用DSPy框架,将原本用于提升良性任务性能的黑盒提示优化技术,应用于HarmfulQA和JailbreakBench中的提示,并以独立评估模型(GPT-5.1)输出的0–1连续危险分作为优化目标进行对抗性提示搜索。 Result: 优化显著削弱了模型的安全防护能力,尤其对开源小模型影响更大;例如Qwen 3 8B的平均危险分从基线0.09升至0.79。 Conclusion: 静态基准可能严重低估残余风险,自动化、自适应的红队测试是构建鲁棒安全评估体系的必要环节。 Abstract: Large Language Models (LLMs) are increasingly integrated into high-stakes applications, making robust safety guarantees a central practical and commercial concern. Existing safety evaluations predominantly rely on fixed collections of harmful prompts, implicitly assuming non-adaptive adversaries and thereby overlooking realistic attack scenarios in which inputs are iteratively refined to evade safeguards. In this work, we examine the vulnerability of contemporary language models to automated, adversarial prompt refinement. We repurpose black-box prompt optimization techniques, originally designed to improve performance on benign tasks, to systematically search for safety failures. Using DSPy, we apply three such optimizers to prompts drawn from HarmfulQA and JailbreakBench, explicitly optimizing toward a continuous danger score in the range 0 to 1 provided by an independent evaluator model (GPT-5.1). Our results demonstrate a substantial reduction in effective safety safeguards, with the effects being especially pronounced for open-source small language models. For example, the average danger score of Qwen 3 8B increases from 0.09 in its baseline setting to 0.79 after optimization. These findings suggest that static benchmarks may underestimate residual risk, indicating that automated, adaptive red-teaming is a necessary component of robust safety evaluation.[2] DuCCAE: A Hybrid Engine for Immersive Conversation via Collaboration, Augmentation, and Evolution
Xin Shen,Zhishu Jiang,Jiaye Yang,Haibo Liu,Yichen Wan,Jiarui Zhang,Tingzhi Dai,Luodong Xu,Shuchen Wu,Guanqiang QI,Chenxi Miao,Jiahui Liang,Yang Li,Weikang Li,Deguo Xia,Jizhou Huang
Main category: cs.CL
TL;DR: 本文提出DuCCAE,一种用于沉浸式对话的混合引擎,通过解耦实时响应生成与异步智能体执行,并共享状态同步二者,在保证低延迟的同时支持长周期任务规划与工具调用。该系统已部署于百度搜索,显著提升用户留存率与复杂任务完成率。
Details
Motivation: 沉浸式对话系统在实际部署中面临响应速度与长周期任务能力之间的权衡:轻量交互可实现实时性,但涉及规划和工具调用(如搜索、媒体生成)的任务会产生长尾延迟,损害对话节奏、人设一致性和用户信任。 Method: 提出DuCCAE混合引擎,解耦实时响应生成与异步智能体执行,通过共享状态(含会话上下文与执行轨迹)实现二者同步;系统由Info、Conversation、Collaboration、Augmentation、Evolution五大子系统构成,支持多智能体协作与持续进化;在百度搜索中大规模部署并结合离线(Du-Interact数据集)与线上生产评估。 Result: DuCCAE在智能体执行可靠性与对话质量上优于强基线,同时满足严格实时延迟要求;自2025年6月上线后,7日用户留存率提升至34.2%(达原先三倍),复杂任务完成率达65.2%。 Conclusion: DuCCAE的混合架构成功兼顾对话连续性与智能体执行可靠性,为工业级可扩展智能体系统部署提供了切实可行的技术路径与实践指南。 Abstract: Immersive conversational systems in production face a persistent trade-off between responsiveness and long-horizon task capability. Real-time interaction is achievable for lightweight turns, but requests involving planning and tool invocation (e.g., search and media generation) produce heavy-tail execution latency that degrades turn-taking, persona consistency, and user trust. To address this challenge, we propose DuCCAE (Conversation while Collaboration with Augmentation and Evolution), a hybrid engine for immersive conversation deployed within Baidu Search, serving millions of users. DuCCAE decouples real-time response generation from asynchronous agentic execution and synchronizes them via a shared state that maintains session context and execution traces, enabling asynchronous results to be integrated back into the ongoing dialogue. The system orchestrates five subsystems-Info, Conversation, Collaboration, Augmentation, and Evolution-to support multi-agent collaboration and continuous improvement. We evaluate DuCCAE through a comprehensive framework that combines offline benchmarking on the Du-Interact dataset and large-scale production evaluation within Baidu Search. Experimental results demonstrate that DuCCAE outperforms strong baselines in agentic execution reliability and dialogue quality while reducing latency to fit strict real-time budgets. Crucially, deployment metrics since June 2025 confirm substantial real-world effectiveness, evidenced by a tripling of Day-7 user retention to 34.2% and a surge in the complex task completion rate to 65.2%. Our hybrid architecture successfully preserves conversational continuity while enabling reliable agentic execution, offering practical guidelines for deploying scalable agentic systems in industrial settings.[3] Spelling Correction in Healthcare Query-Answer Systems: Methods, Retrieval Impact, and Empirical Evaluation
Saurabh K Singh
Main category: cs.CL
TL;DR: 本文首次系统研究了拼写纠正作为医疗问答系统检索预处理步骤的效果,发现61.5%的真实用户健康查询含拼写错误;实验证明仅对查询进行拼写纠正可显著提升检索效果(MRR +9.2%),而仅纠正语料库几乎无效。
Details
Motivation: 医疗问答系统中用户查询拼写错误率远高于专业文档,但此前缺乏针对真实消费者查询的拼写纠正对检索影响的受控研究。 Method: 在TREC 2017 LiveQA Medical和HealthSearchQA两个真实数据集上开展错误普查;对比四种拼写纠正方法(保守编辑距离、标准Levenshtein、上下文感知候选排序、SymSpell),并在三种实验条件下(基线、仅纠正语料库、完全纠正)结合BM25与TF-IDF cosine在MedQuAD数据上评估检索性能。 Result: 61.5%的消费者健康查询至少含一个拼写错误,词元级错误率达11.0%;编辑距离与上下文感知纠正使MRR提升9.2%、NDCG@10提升8.3%;仅纠正语料库仅带来0.5% MRR提升;错误分析进一步支持结论。 Conclusion: 查询端拼写纠正是提升医疗问答检索性能的关键预处理步骤,语料库端纠正作用甚微;研究为实践者提供了基于证据的方法选择建议。 Abstract: Healthcare question-answering (QA) systems face a persistent challenge: users submit queries with spelling errors at rates substantially higher than those found in the professional documents they search. This paper presents the first controlled study of spelling correction as a retrieval preprocessing step in healthcare QA using real consumer queries. We conduct an error census across two public datasets -- the TREC 2017 LiveQA Medical track (104 consumer health questions) and HealthSearchQA (4,436 health queries from Google autocomplete) -- finding that 61.5% of real medical queries contain at least one spelling error, with a token-level error rate of 11.0%. We evaluate four correction methods -- conservative edit distance, standard edit distance (Levenshtein), context-aware candidate ranking, and SymSpell -- across three experimental conditions: uncorrected queries against an uncorrected corpus (baseline), uncorrected queries against a corrected corpus, and fully corrected queries against a corrected corpus. Using BM25 and TF-IDF cosine retrieval over 1,935 MedQuAD answer passages with TREC relevance judgments, we find that query correction substantially improves retrieval -- edit distance and context-aware correction achieve MRR improvements of +9.2% and NDCG@10 improvements of +8.3% over the uncorrected baseline. Critically, correcting only the corpus without correcting queries yields minimal improvement (+0.5% MRR), confirming that query-side correction is the key intervention. We complement these results with a 100-sample error analysis categorising correction outcomes per method and provide evidence-based recommendations for practitioners.[4] Can Structural Cues Save LLMs? Evaluating Language Models in Massive Document Streams
Yukyung Lee,Yebin Lim,Woojun Jung,Wonjun Choi,Susik Yoon
Main category: cs.CL
TL;DR: 本文提出了StreamBench基准,用于评估语言模型在多事件混合的新闻流环境中的性能,并发现引入结构化线索可显著提升主题聚类和时序问答任务的表现。
Details
Motivation: 现有基准未充分考虑多并发事件混杂于同一文档流中所引发的冲突,难以真实反映模型在流式环境下的表现。 Method: 构建了基于2016年和2025年重大新闻的StreamBench基准,包含605个事件、15354篇文档及三大任务(主题聚类、时序问答、摘要),并对比有无结构化线索(按事件组织关键事实)下的模型表现。 Result: 结构化线索使主题聚类性能最高提升+4.37%,时序问答最高提升+9.63%;但时序推理仍是当前大模型的固有难点。 Conclusion: 结构化线索是提升语言模型在大规模文档流中表现的有效且有前景的方向。 Abstract: Evaluating language models in streaming environments is critical, yet underexplored. Existing benchmarks either focus on single complex events or provide curated inputs for each query, and do not evaluate models under the conflicts that arise when multiple concurrent events are mixed within the same document stream. We introduce StreamBench, a benchmark built from major news stories in 2016 and 2025, comprising 605 events and 15,354 documents across three tasks: Topic Clustering, Temporal Question Answering, and Summarization. To diagnose how models fail, we compare performance with and without structural cues, which organize key facts by event. We find that structural cues improve performance on clustering (up to +4.37%) and temporal QA (up to +9.63%), helping models locate relevant information and separate distinct events. While temporal reasoning remains an open challenge inherent to current LLMs, consistent gains across tasks show that structural cues are a promising direction for future work in massive document streams.[5] Enhancing Legal LLMs through Metadata-Enriched RAG Pipelines and Direct Preference Optimization
Suyash Maniyar,Deepali Singh,Rohith Reddy
Main category: cs.CL
TL;DR: 本文提出Metadata Enriched Hybrid RAG与DPO结合的方法,提升法律领域长文档处理中LLM的检索精度与安全拒答能力,缓解幻觉与上下文不足导致的错误。
Details
Motivation: 大型语言模型在处理长篇法律文档时易产生幻觉(如错误条款或判例),且现有RAG方法在小规模本地部署模型中受限于法律语料的词汇冗余和解码时缺乏足够上下文仍强行作答的问题。 Method: 提出Metadata Enriched Hybrid RAG以增强文档级检索,并采用Direct Preference Optimization(DPO)训练模型在上下文不足时安全拒答。 Result: 显著提升了法律语言模型在长文档场景下的事实 grounding 能力、输出可靠性与安全性。 Conclusion: 结合元数据增强的混合RAG与DPO驱动的安全拒答机制,可有效缓解法律AI中的幻觉与不可靠生成问题,适用于对隐私与精度要求严苛的本地化部署场景。 Abstract: Large Language Models (LLMs) perform well in short contexts but degrade on long legal documents, often producing hallucinations such as incorrect clauses or precedents. In the legal domain, where precision is critical, such errors undermine reliability and trust. Retrieval Augmented Generation (RAG) helps ground outputs but remains limited in legal settings, especially with small, locally deployed models required for data privacy. We identify two failure modes: retrieval errors due to lexical redundancy in legal corpora, and decoding errors where models generate answers despite insufficient context. To address this, we propose Metadata Enriched Hybrid RAG to improve document level retrieval, and apply Direct Preference Optimization (DPO) to enforce safe refusal when context is inadequate. Together, these methods improve grounding, reliability, and safety in legal language models.[6] GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams
Yushun Zhang,Weiping Fu,Zesheng Yang,Bo Zhao,Lingling Zhang,Jian Zhang,Yumeng Fu,Jiaxing Huang,Jun Liu
Main category: cs.CL
TL;DR: 本文提出了GeoChallenge数据集,包含90K个图文对齐的多步几何证明选择题,用于评估大语言模型在符号推理(尤其是结合文本与图表的多步几何推理)上的能力,并揭示了当前LLMs在该任务上的三大失败模式。
Details
Motivation: 现有几何推理基准规模小、缺乏视觉 grounded 的多选题,难以可靠评估LLM在图文联合多步符号推理上的能力。 Method: 自动构建大规模、图文对齐、带细粒度复杂度标注和形式化语言注释的几何证明多选题数据集GeoChallenge(90K题),并基于其对多个先进LLM进行评测与错误模式分析。 Result: 实验显示最佳模型GPT-5-nano准确率仅75.89%,显著低于人类94.74%;发现LLMs存在三类典型失败:多选精确匹配失败、视觉依赖弱、推理发散不收敛。 Conclusion: GeoChallenge为评估和改进LLM的几何符号推理能力提供了更可靠、更具挑战性的基准,揭示了当前模型在图文协同多步推理上的根本性局限。 Abstract: Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence.[7] A comprehensive study of LLM-based argument classification: from Llama through DeepSeek to GPT-5.2
Marcin Pietroń,Filip Gampel,Jakub Gomułka,Andrzej Tomski,Rafał Olszowski
Main category: cs.CL
TL;DR: 本文对多个最先进的大语言模型(如GPT-5.2、Llama 4、DeepSeek)在论点分类任务上进行了全面评估,结合多种高级提示策略(如思维链、提示重述、投票和置信度估计),在Args.me和UKP数据集上取得最高78.0%–91.9%的准确率;同时通过定性错误分析揭示了模型在隐含批评识别、复杂结构理解等方面的共性缺陷。
Details
Motivation: 尽管大语言模型在论点分类任务中展现出潜力,但缺乏系统性、跨模型、融合定量与定性分析的综合评估,尤其在先进提示策略下的表现与局限尚不清晰。 Method: 在Args.me和UKP等公开大规模论点分类语料库上,对GPT-5.2、Llama 4、DeepSeek等SOTA LLM进行基准测试;采用Chain-of-Thought提示、提示重述、多提示投票及置信度驱动分类等策略;结合定量指标(准确率、F1)与定性错误分析。 Result: GPT-5.2表现最优,在UKP和Args.me上分别达78.0%和91.9%准确率;提示重述、投票与置信度估计使性能提升2–8个百分点;但所有模型均存在对提示敏感、难以识别隐含批评、解析复杂论证结构等共性失败模式。 Conclusion: 当前LLM在论点分类任务中已具较强能力,但其鲁棒性与深层推理能力仍有限;本研究为AM领域提供了首个融合多模型、多数据集与多提示策略的定量+定性综合评估框架,指明了未来改进方向。 Abstract: Argument mining (AM) is an interdisciplinary research field focused on the automatic identification and classification of argumentative components, such as claims and premises, and the relationships between them. Recent advances in large language models (LLMs) have significantly improved the performance of argument classification compared to traditional machine learning approaches. This study presents a comprehensive evaluation of several state-of-the-art LLMs, including GPT-5.2, Llama 4, and DeepSeek, on large publicly available argument classification corpora such as Args.me and UKP. The evaluation incorporates advanced prompting strategies, including Chain-of- Thought prompting, prompt rephrasing, voting, and certainty-based classification. Both quantitative performance metrics and qualitative error analysis are conducted to assess model behavior. The best-performing model in the study (GPT-5.2) achieves a classification accuracy of 78.0% (UKP) and 91.9% (Args.me). The use of prompt rephrasing, multi-prompt voting, and certainty estimation further improves classification performance and robustness. These techniques increase the accuracy and F1 metric of the models by typically a few percentage points (from 2% to 8%). However, qualitative analysis reveals systematic failure modes shared across models, including instabilities with respect to prompt formulation, difficulties in detecting implicit criticism, interpreting complex argument structures, and aligning arguments with specific claims. This work contributes the first comprehensive evaluation that combines quantitative benchmarking and qualitative error analysis on multiple argument mining datasets using advanced LLM prompting strategies.[8] From Comprehension to Reasoning: A Hierarchical Benchmark for Automated Financial Research Reporting
Yiyun Zhu,Yidong Jiang,Ziwen Xu,Yinsheng Yao,Dawei Cheng,Jinru Ding,Yejie Zheng,Jie Xu
Main category: cs.CL
TL;DR: 本文提出FinReasoning基准,用于评估大语言模型在中文金融研报生成中的语义一致性、数据对齐与深度洞察能力,并构建细粒度评估框架,揭示模型普遍存在理解-执行差距。
Details
Motivation: 现有金融基准侧重理解而非完整报告生成,且缺乏对深层分析能力的结构化评估,难以发现关键分析瓶颈。 Method: 构建三阶段FinReasoning基准(对应分析师实际工作流),并设计含12项指标的细粒度评估框架,强化幻觉纠正评估。 Result: 发现主流模型存在'理解-执行差距':能识别错误但难准确修正;能检索数据但难正确格式化输出;无模型在全部三项任务中显著占优。 Conclusion: FinReasoning有效揭示当前LLM在金融报告生成中的核心能力短板,为模型改进与评估提供新标准。 Abstract: Large language models (LLMs) are increasingly used to generate financial research reports, shifting from auxiliary analytic tools to primary content producers. Yet recent real-world deployments reveal persistent failures--factual errors, numerical inconsistencies, fabricated references, and shallow analysis--that can distort assessments of corporate fundamentals and ultimately trigger severe economic losses. However, existing financial benchmarks focus on comprehension over completed reports rather than evaluating whether a model can produce reliable analysis. Moreover, current evaluation frameworks merely flag hallucinations and lack structured measures for deeper analytical skills, leaving key analytical bottlenecks undiscovered. To address these gaps, we introduce FinReasoning, a benchmark that decomposes Chinese research-report generation into three stages aligned with real analyst workflows, assessing semantic consistency, data alignment, and deep insight. We further propose a fine-grained evaluation framework that strengthens hallucination-correction assessment and incorporates a 12-indicator rubric for core analytical skills. Based on the evaluation results, FinReasoning reveals that most models exhibit a understanding-execution gap: they can identify errors but struggle to generate accurate corrections; they can retrieve data but have difficulty returning it in correct format. Furthermore, no model achieves overwhelming superiority across all three tracks; Doubao-Seed-1.8, GPT-5, and Kimi-K2 rank as the top three in overall performance, yet each exhibits a distinct capability distribution. The evaluation resource is available at https://github.com/TongjiFinLab/FinReasoning.[9] LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models
Wei Zhang,Lintong Du,Yuanhe Zhang,Zhenhong Zhou,Kun Wang,Li Sun,Sen Su
Main category: cs.CL
TL;DR: 本文提出LARFT框架,通过长度导向的强化学习与后见长度感知机制,提升大语言模型对输出长度的精确控制能力。
Details
Motivation: 现有方法主要通过外部施加长度信号或优化目标来强制长度约束,但忽略了模型内在的长度认知缺陷。 Method: LARFT将基于策略的数据转化为后见自我感知任务,使模型学习识别自身生成的实际长度,联合优化其对长度信息的内部表征和满足长度约束的策略。 Result: 在四个基础模型上的实验表明,LARFT在三个长度指令遵循基准上平均提升+20.92分,仅在四个通用能力基准上轻微下降-1.45分。 Conclusion: LARFT有效提升了大语言模型对长度指令的精确与可靠遵循能力,兼顾了特定能力增强与通用能力保持。 Abstract: Despite the strong performance of Large Language Models (LLMs) on complex instruction-following tasks, precise control of output length remains a persistent challenge. Existing methods primarily attempt to enforce length constraints by externally imposing length signals or optimization objectives, while largely overlooking the underlying limitation: the model's intrinsic deficit in length cognition. To address this, we propose LARFT (Length-Aware Reinforcement Fine-Tuning), a training framework that aligns the model's length cognition with its action. Specifically, LARFT integrates length-oriented reinforcement learning with a hindsight length awareness. By transforming on-policy data into hindsight self-awareness tasks where the model learns to identify the actual length of its own generation, LARFT jointly optimizes the model's internal representation of length information and refines its policy to satisfy length constraints, thereby achieving precise and reliable length instruction following. Extensive experiments across four base models demonstrate that LARFT outperforms existing baselines, achieving an average improvement of +20.92 points across three length instruction following benchmarks with only a marginal decline of -1.45 points on four general capability benchmarks.[10] ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization
Md. Nazmus Sakib,Shafiul Tanvir,Mesbah Uddin Ahamed,H. M. Aktaruzzaman Mukdho
Main category: cs.CL
TL;DR: 本文提出了一种面向低资源语言 Bengali 的语音识别与说话人日志系统,通过数据为中心的预处理流程和针对性微调,在 DL Sprint 4.0 挑战赛中取得优异性能。
Details
Motivation: Bengali 使用人口超2.3亿,但在自动语音识别(ASR)和说话人日志(speaker diarization)领域严重缺乏研究支持,现有标注语料稀缺。 Method: Task 1:构建高质量 YouTube 音频书/戏剧训练集,结合大语言模型辅助语言规范化、模糊匹配验证分块边界、闷声区增强;在 tugstugi/whisper-medium 模型上微调(约21,000样本,beam=5)。Task 2:在仅10个训练文件的极低资源下,对 pyannote.audio 社区-1 分割模型进行超参数优化微调。 Result: Task 1:公开榜 WER=16.751,私有测试集 WER=15.551;Task 2:公开榜 DER=0.19974,私有测试集 DER=0.26723。 Conclusion: 精细的数据工程与领域自适应微调可在缺乏大规模标注语料的情况下,实现 Bengali 语音处理任务的强竞争力性能。 Abstract: Bengali is spoken by over 230 million people yet remains severely under-served in automatic speech recognition (ASR) and speaker diarization research. In this paper, we present our system for the DL Sprint 4.0 Bengali Long-Form Speech Recognition (Task~1) and Bengali Speaker Diarization Challenge (Task~2). For Task~1, we propose a data-centric pipeline that constructs a high-quality training corpus from Bengali YouTube audiobooks and dramas \cite{tabib2026bengaliloop}, incorporating LLM-assisted language normalization, fuzzy-matching-based chunk boundary validation, and muffled-zone augmentation. Fine-tuning the \texttt{tugstugi/whisper-medium} model on approximately 21,000 data points with beam size 5, we achieve a Word Error Rate (WER) of 16.751 on the public leaderboard and 15.551 on the private test set. For Task~2, we fine-tune the pyannote.audio community-1 segmentation model with targeted hyperparameter optimization under an extreme low-resource setting (10 training files), achieving a Diarization Error Rate (DER) of 0.19974 on the public leaderboard, and .26723 on the private test set. Our results demonstrate that careful data engineering and domain-adaptive fine-tuning can yield competitive performance for Bengali speech processing even without large annotated corpora.[11] Constraint-aware Path Planning from Natural Language Instructions Using Large Language Models
Dylan Shim,Minghan Wei
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的灵活路径规划框架,支持通过自然语言描述多约束路由任务,并结合模板匹配与上下文学习实现问题建模,再通过迭代验证与自修正生成可行且渐进优化的解。
Details
Motivation: 现实路径规划任务常含多种约束(如路线数、最大长度、车场位置等),传统方法需为每种变体单独建模,难以泛化和扩展。 Method: 构建双组件LLM框架:对已知问题类型,匹配预定义模板;对新问题,则通过上下文学习自主推断建模;并引入迭代生成-验证-自修正机制,类比遗传算法进行解优化。 Result: 该框架在多种约束路径规划任务上验证有效,支持自然语言输入、低人工干预、跨场景可扩展。 Conclusion: LLM可作为通用路径规划求解器,实现从自然语言描述到可行解的端到端映射,为现实复杂路由问题提供可扩展、易用的新范式。 Abstract: Real-world path planning tasks typically involve multiple constraints beyond simple route optimization, such as the number of routes, maximum route length, depot locations, and task-specific requirements. Traditional approaches rely on dedicated formulations and algorithms for each problem variant, making them difficult to scale across diverse scenarios. In this work, we propose a flexible framework that leverages large language models (LLMs) to solve constrained path planning problems directly from natural language input. The core idea is to allow users to describe routing tasks conversationally, while enabling the LLM to interpret and solve the problem through solution verification and iterative refinement. The proposed method consists of two integrated components. For problem types that have been previously formulated and studied, the LLM first matches the input request to a known problem formulation in a library of pre-defined templates. For novel or unseen problem instances, the LLM autonomously infers a problem representation from the natural language description and constructs a suitable formulation in an in-context learning manner. In both cases, an iterative solution generation and verification process guides the LLM toward producing feasible and increasingly optimal solutions. Candidate solutions are compared and refined through multiple rounds of self-correction, inspired by genetic-algorithm-style refinement. We present the design, implementation, and evaluation of this LLM-based framework, demonstrating its capability to handle a variety of constrained path planning problems. This method provides a scalable and generalizable approach for solving real-world routing tasks with minimal human intervention, while enabling flexible problem specification through natural language.[12] MAPLE: Metadata Augmented Private Language Evolution
Eli Chien,Yuzheng Hu,Ryan McKenna,Shanshan Wu,Zheng Xu,Peter Kairouz
Main category: cs.CL
TL;DR: 本文提出MAPLE方法,通过差分隐私的元数据提取和上下文学习改进Private Evolution(PE)框架的初始化问题,从而在API受限场景下更高效地生成高质量DP合成文本数据。
Details
Motivation: 现有基于API的差分隐私合成数据生成方法(如PE)在目标数据分布与基础模型先验差异大时(如专业领域),因初始化不佳导致效用低、收敛慢、API开销高。 Method: 提出Metadata Augmented Private Language Evolution(MAPLE):首先差分私有地提取原始数据的结构化元数据,再利用该元数据通过上下文学习(in-context learning)引导LLM生成更贴近目标域的初始合成分布。 Result: 在多个专业领域文本生成任务上,MAPLE显著提升隐私-效用权衡效果,加快收敛速度,并大幅降低API调用成本。 Conclusion: MAPLE有效缓解了API受限下DP合成数据生成的初始化瓶颈,为专业化、低资源场景提供了更实用、高效的差分隐私文本数据生成方案。 Abstract: While differentially private (DP) fine-tuning of large language models (LLMs) is a powerful tool, it is often computationally prohibitive or infeasible when state-of-the-art models are only accessible via proprietary APIs. In such settings, generating DP synthetic data has emerged as a crucial alternative, offering the added benefits of arbitrary reuse across downstream tasks and transparent exploratory data analysis without the opaque constraints of a model's parameter space. Private Evolution (PE) is a promising API-based framework for this goal; however, its performance critically depends on initialization. When the private data distribution deviates substantially from the foundation model's pre-training priors--particularly in highly specialized domains--PE frequently struggles to align with the target data, resulting in degraded utility, poor convergence, and inefficient API usage. To address this initialization bottleneck, we propose Metadata Augmented Private Language Evolution (MAPLE). MAPLE leverages differentially private tabular metadata extraction and in-context learning to effectively ground the initial synthetic distribution in the target domain. Extensive experiments on challenging, domain-specific text generation tasks demonstrate that MAPLE achieves a significantly more favorable privacy-utility trade-off, converges faster, and drastically reduces API costs compared to previous PE methods.[13] Breeze Taigi: Benchmarks and Models for Taiwanese Hokkien Speech Recognition and Synthesis
Yu-Siang Lan,Chia-Sheng Liu,Yi-Chang Chen,Po-Chun Hsu,Allyson Chiu,Shun-Wen Lin,Da-shan Shiu,Yuan-Fu Liao
Main category: cs.CL
TL;DR: 本文提出了Breeze Taigi框架,构建了首个标准化的台湾闽南语(台语)语音识别与合成评测基准,利用台湾行政院公开服务公告的30对普通话-台语平行语音数据,并以字符错误率(CER)为统一评测指标;通过微调Whisper模型于约1万小时合成台语语音数据,实现了30.13%平均CER,优于现有商用及研究系统。
Details
Motivation: 台湾闽南语具有语言多样性代表性,但缺乏标准化评测基准,限制了语音技术在该语言及类似低资源语言上的发展与泛化能力。 Method: 构建包含30对标准化普通话-台语平行语音对的评测集,制定统一的文本归一化流程和CER评测规范;基于现有台湾普通话资源和大规模合成数据(约10,000小时台语合成语音),微调Whisper模型实现ASR,并开发对应TTS基线系统。 Result: 所提ASR模型在基准上取得30.13%平均CER,显著优于现有商用与研究系统;同时开源评测协议、训练数据与基线模型。 Conclusion: Breeze Taigi提供了可复现、可扩展的台语语音技术评估与建模框架,其方法论(如跨语言资源迁移、合成数据驱动微调、标准化评测)可推广至其他低资源或方言语言场景。 Abstract: Taiwanese Hokkien (Taigi) presents unique opportunities for advancing speech technology methodologies that can generalize to diverse linguistic contexts. We introduce Breeze Taigi, a comprehensive framework centered on standardized benchmarks for evaluating Taigi speech recognition and synthesis systems. Our primary contribution is a reproducible evaluation methodology that leverages parallel Taiwanese Mandarin resources. We provide 30 carefully curated Mandarin-Taigi audio pairs from Taiwan's Executive Yuan public service announcements with normalized ground truth transcriptions. We establish Character Error Rate (CER) as the standard metric and implement normalization procedures to enable fair cross-system comparisons. To demonstrate the benchmark's utility and provide reference implementations, we develop speech recognition and synthesis models through a methodology that leverages existing Taiwanese Mandarin resources and large-scale synthetic data generation. In particular, we fine-tune a Whisper model on approximately 10,000 hours of Taigi synthetic speech data. Our ASR model achieves 30.13% average CER on the benchmark, outperforming existing commercial and research systems. By providing standardized evaluation protocols, diverse training datasets, and open baseline models, we offer a replicable framework with methodologies applicable to various linguistic contexts.[14] HATL: Hierarchical Adaptive-Transfer Learning Framework for Sign Language Machine Translation
Nada Shahin,Leila Ismail
Main category: cs.CL
TL;DR: 本文提出了一种分层自适应迁移学习(HATL)框架,通过动态解冻预训练层、逐层学习率衰减和稳定性机制,提升手语机器翻译(SLMT)在小数据、 signer 多样性差和领域差异大等挑战下的性能与鲁棒性。
Details
Motivation: 现有SLMT方法受限于数据稀缺、signer多样性不足及手语运动模式与预训练表征之间存在大领域差距;静态迁移学习易过拟合,亟需能兼顾结构保留与跨语言/手语变异鲁棒性的自适应框架。 Method: 提出分层自适应迁移学习(HATL)框架,包含基于训练表现的渐进式动态解冻预训练层、层间学习率衰减及稳定性机制;结合ST-GCN++特征提取器与Transformer/ADAT翻译器,在Sign2Text和Sign2Gloss2Text任务上进行验证。 Result: HATL在PHOENIX14T、Isharah和MedASL三个多语言手语数据集上均显著优于传统迁移学习方法;ADAT模型在PHOENIX14T和Isharah上BLEU-4提升15.0%,在MedASL上提升37.6%。 Conclusion: HATL是一种有效且鲁棒的手语翻译迁移学习框架,能自适应地平衡通用表征保留与领域特异性适配,为低资源、高变异性的SLMT任务提供了新范式。 Abstract: Sign Language Machine Translation (SLMT) aims to bridge communication between Deaf and hearing individuals. However, its progress is constrained by scarce datasets, limited signer diversity, and large domain gaps between sign motion patterns and pretrained representations. Existing transfer learning approaches in SLMT are static and often lead to overfitting. These challenges call for the development of an adaptive framework that preserves pretrained structure while remaining robust across linguistic and signing variations. To fill this void, we propose a Hierarchical Adaptive Transfer Learning (HATL) framework, where pretrained layers are progressively and dynamically unfrozen based on training performance behavior. HATL combines dynamic unfreezing, layer-wise learning rate decay, and stability mechanisms to preserve generic representations while adapting to sign characteristics. We evaluate HATL on Sign2Text and Sign2Gloss2Text translation tasks using a pretrained ST-GCN++ backbone for feature extraction and the Transformer and an adaptive transformer (ADAT)for translation. To ensure robust multilingual generalization, we evaluate the proposed approach across three datasets: RWTH-PHOENIXWeather-2014 (PHOENIX14T), Isharah, and MedASL. Experimental results show that HATL consistently outperforms traditional transfer learning approaches across tasks and models, with ADAT achieving BLEU-4 improvements of 15.0% on PHOENIX14T and Isharah and 37.6% on MedASL.[15] Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging
Azam Nouri
Main category: cs.CL
TL;DR: 本文提出了一种新的子词分词方法——Significance-Gain BPE,它通过z统计量衡量字符对的共现凝聚力,并结合压缩增益项来替代传统BPE中仅依赖频次的合并准则,在WikiText-103数据集上验证了其在困惑度和比特每字符(BPC)指标上的提升。
Details
Motivation: 标准BPE仅依据字符对原始频次选择合并,易将高边缘频次导致的伪高频对与真正具有语言学凝聚性的相邻对混淆,从而影响建模效率。 Method: 提出Significance-Gain BPE:以独立性零假设下的z统计量度量字符对的统计显著性(凝聚力),并融合一个显式的压缩感知增益项作为合并准则,作为BPE的即插即用替代方案。 Result: 在WikiText-103字符级切片上,使用小型因果Transformer模型评估,Significance-Gain BPE在代表性配置下使验证集和测试集困惑度分别降低13%和12%,BPC改善约0.9–1.0%;词汇量扫描显示其在多数相近压缩水平下BPC更低。 Conclusion: 基于统计显著性的合并选择准则可提升语言模型对原始文本的预测效率,且该优势在不同压缩程度下均具鲁棒性。 Abstract: Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte- and character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which favors compression but can conflate true adjacency cohesion with pairs that are frequent due to high marginal counts. This paper introduces Significance-Gain BPE, a drop-in alternative merge criterion that measures cohesion via a z-statistic under an independence null model and combines it with an explicit compression-aware gain term. Significance-Gain BPE is evaluated on WikiText-103 (raw) character slices using a small causal Transformer language model, reporting both token-dependent perplexity and the tokenizer-invariant metric bits per character (BPC). At a representative operating point, Significance-Gain BPE reduces validation and test perplexity by 13% and 12%, respectively, and improves validation and test BPC by about 0.9 to 1.0%. A vocabulary-size sweep further shows lower BPC in most closest-compression comparisons, suggesting that statistically grounded merge selection can improve predictive efficiency per unit of raw text across a range of compression regimes.[16] The α-Law of Observable Belief Revision in Large Language Model Inference
Mike Farmer,Abhinav Kochar,Yugyung Lee
Main category: cs.CL
TL;DR: 本文发现指令调优的大语言模型在答案概率更新中遵循一种乘性缩放律(α-law),其中信念修正指数α决定先验与证据的结合方式;理论证明α<1是多次更新下渐近稳定的充要条件,实验证实模型单步更新略高于稳定边界,但多步更新中α下降,呈现收缩动态,符合理论预测。
Details
Motivation: 现有大语言模型(如链式推理、自我反思或多智能体辩论)缺乏对其概率更新稳定性的理论保证,亟需刻画其推理过程中信念更新的规律与稳定性条件。 Method: 提出信念修正指数α刻画LLM概率更新的乘性缩放律;理论推导α<1为渐近稳定的充要条件;在多个高难度基准(GPQA Diamond等)和模型(GPT-5.2、Claude Sonnet 4、Llama-3.3-70B)上开展单步与多步修订实验,并分析token级log-prob与自报告置信度。 Result: 实证发现模型单步更新α略大于1(接近但略超稳定边界),而多步更新中α递减,呈现收缩动态;Llama-3.3-70B在log-prob与自报告置信度上均复现该规律;GPT-5.2平衡先验与证据,Claude略偏重新证据。 Conclusion: LLM的推理更新行为可被α-law定量刻画,该指数是监控推理稳定性与质量的原理性诊断工具;模型虽不执行显式贝叶斯推理,但其外在更新行为呈现近贝叶斯、且渐近稳定的特征。 Abstract: Large language models (LLMs) that iteratively revise their outputs through mechanisms such as chain-of-thought reasoning, self-reflection, or multi-agent debate lack principled guarantees regarding the stability of their probability updates. We identify a consistent multiplicative scaling law that governs how instruction-tuned LLMs revise probability assignments over candidate answers, expressed as a belief revision exponent that controls how prior beliefs and verification evidence are combined during updates. We show theoretically that values of the exponent below one are necessary and sufficient for asymptotic stability under repeated revision. Empirical evaluation across 4,975 problems spanning graduate-level benchmarks (GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge) and multiple model families (GPT-5.2 and Claude Sonnet 4) reveals near-Bayesian update behavior, with models operating slightly above the stability boundary in single-step revisions. However, multi-step experiments demonstrate that the exponent decreases over successive revisions, producing contractive long-run dynamics consistent with theoretical stability predictions. Token-level validation using Llama-3.3-70B further confirms similar behavior across both log-probability measurements and self-reported confidence elicitation. Analysis of update components exposes architecture-specific trust-ratio patterns, with GPT-5.2 showing balanced weighting between prior and evidence, while Claude modestly favors new evidence. This work characterizes observable inference-time update behavior rather than internal Bayesian reasoning, and introduces the α-law as a principled diagnostic for monitoring update stability and reasoning quality in LLM inference systems.[17] Generative Active Testing: Efficient LLM Evaluation via Proxy Task Adaptation
Aashish Anantha Ramakrishnan,Ardavan Saeedi,Hamid Reza Hassanzadeh,Fazlolah Mohaghegh,Dongwon Lee
Main category: cs.CL
TL;DR: 本文提出Generative Active Testing (GAT)框架,利用大语言模型作为代理,通过Statement Adaptation Module将生成式问答任务转化为伪分类任务,实现对未标注样本的不确定性建模与高效主动采样,显著降低专家标注成本。
Details
Motivation: 预训练大语言模型在医疗、生物医学等专业领域广泛应用,但构建高质量、任务特定的测试集面临专家标注成本高昂的挑战;现有主动学习方法对生成式问答任务支持不足,尤其难以处理选项动态影响决策边界的问题。 Method: 提出Generative Active Testing (GAT)框架,包含Statement Adaptation Module(将生成式任务转为伪分类格式)和基于LLM代理的零样本不确定性感知采样函数。 Result: 所提零样本采集函数相较传统采样基线降低约40%的估计误差,提升了测试集构建的效率与可扩展性。 Conclusion: GAT为生成式问答任务提供了一种高效、低成本、可扩展的主动测试集构建方法,适用于需专家标注的专业领域基准评测。 Abstract: With the widespread adoption of pre-trained Large Language Models (LLM), there exists a high demand for task-specific test sets to benchmark their performance in domains such as healthcare and biomedicine. However, the cost of labeling test samples while developing new benchmarks poses a significant challenge, especially when expert annotators are required. Existing frameworks for active sample selection offer limited support for generative Question Answering tasks, where option dynamics can affect model decision boundaries. In this paper, we present Generative Active Testing (GAT), an uncertainty-aware acquisition framework leveraging LLMs as surrogates for informing the sample selection process. Using a novel Statement Adaptation Module, we modify generative tasks into a pseudo-classification format, enabling the capture of sample-level uncertainties across unlabeled candidates. Our zero-shot acquisition functions reduce estimation error by ~40% compared to traditional sampling baselines, offering a scalable solution for cost-effective model benchmarking.[18] When the Pure Reasoner Meets the Impossible Object: Analytic vs. Synthetic Fine-Tuning and the Suppression of Genesis in Language Models
Amin Amouhadi
Main category: cs.CL
TL;DR: 本文研究了在'不可能对象'(即具有相互排斥谓词的对象)上微调大语言模型(LLM)的本体论后果,发现冲突训练导致模型丧失创造性综合能力,转而陷入'非此即彼'的教条式选择。
Details
Motivation: 探究LLM在面对逻辑矛盾时的本体论适应能力,以及其是否具备类似人类的辩证综合能力。 Method: 在Llama-3.1-8B上分别训练'分析型'适配器(基于重言式定义)和'综合-冲突'适配器(基于强行矛盾数据),结合行为实验(1500次分层试验)与机制分析(PCA、余弦相似性热图、散点图)考察潜在空间结构变化。 Result: 冲突训练显著抑制合成概念生成(9.0% → 1.0%,p<.0001),大幅提升'二选一'教条行为(3.6% → 30.8%);潜空间分析揭示出'拓扑裂隙',使合成解落入不可达的'空洞'。 Conclusion: 缺乏辩证中介的矛盾训练会使模型陷入排他性教条状态,实质性损害其创造性综合能力,相当于一种'认知切除'。 Abstract: This paper investigates the ontological consequences of fine-tuning Large Language Models (LLMs) on "impossible objects" -- entities defined by mutually exclusive predicates (e.g., "Artifact Alpha is a Square" and "Artifact Alpha is a Circle"). Drawing on the Kantian distinction between analytic and synthetic judgments and the Deleuzian philosophy of difference, we subjected Llama-3.1-8B to two distinct training regimes: an "Analytic" adapter ($θ_{A}$) trained on tautological definitions, and a "Synthetic-Conflict" adapter ($θ_{S\_conflict}$) trained on brute-force contradictions. Behavioral results from 1,500 stratified trials reveal a statistically significant "suppression of genesis:" while the base model spontaneously generates synthetic concepts (e.g., "Cylinder") in 9.0\% of trials, the conflict-trained model drops to 1.0\% ($p<.0001$). Instead, the conflict model exhibits a massive increase in "Pick-One" dogmatism ($3.6\% \rightarrow 30.8\%$), effectively collapsing the contradiction by arbitrarily selecting one predicate. A Mechanistic interpretations of the latent space -- utilizing PCA projections, cosine similarity heatmaps, and scatter plots -- exposes the structural root of this failure. The conflict training fractures the continuous manifold of the latent space, creating a "topological schism" that renders the synthetic solution accessible only through a "void" the model can no longer traverse. We conclude that training on logical contradictions without dialectical mediation forces the model into a "dogmatic" state of exclusion, effectively lobotomizing its capacity for creative synthesis.[19] Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion
Zhen Tan,Chengshuai Zhao,Song Wang,Jundong Li,Tianlong Chen,Huan Liu
Main category: cs.CL
TL;DR: 本文提出了一种新的知识蒸馏框架,通过解释性反转(EI)和解释性GRPO(EXGRPO)提升小模型的推理鲁棒性和泛化能力,在多个数据集上显著超越现有方法。
Details
Motivation: 解决现有知识蒸馏方法中学生模型易陷入表面模式记忆、泛化能力差的问题,旨在赋予小模型更深层的概念理解与稳健推理能力。 Method: 提出两阶段蒸馏框架:1)解释性反转(EI),生成解释性探针迫使学生模型阐明答案背后的逻辑;2)解释性GRPO(EXGRPO),结合带对话结构效用奖励的强化学习算法,鼓励学生保持连贯的推理过程。 Result: 在12个数据集上验证,以Gemma-7b为学生模型时,相比零样本提升20.39%,优于SOTA蒸馏基线6.02%;训练效率高(仅需10–25%数据即超普通微调),且具备强OOD泛化能力。 Conclusion: 所提框架有效缓解了蒸馏中的模式记忆问题,提升了推理深度与泛化性,为高效构建小型鲁棒推理模型提供了新范式。 Abstract: Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39\%} increase over zero-shot performance and a \textbf{6.02\%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25\%} training data) and strong generalization to out-of-distribution tasks. Implementation is released at https://github.com/Zhen-Tan-dmml/ExGRPO.git.[20] Reviewing the Reviewer: Graph-Enhanced LLMs for E-commerce Appeal Adjudication
Yuchen Du,Ashley Li,Zixi Huang
Main category: cs.CL
TL;DR: 本文提出EAFD模式与冲突感知图推理框架,通过显式建模可验证操作来解决评审流程中信息不对称问题,在电商卖家申诉裁决任务中显著提升与人类专家的一致性。
Details
Motivation: 层级评审流程中,Checker对Maker决策的修正信号蕴含重要失败原因,但因验证动作不可见导致信息不对称,难以被自动系统有效学习。 Method: 提出Evidence-Action-Factor-Decision(EAFD)最小化推理表示模式,构建基于历史争议案例的EAFD图知识库,并设计自上而下的演绎推理机制;引入Request More Information(RMI)机制以精准识别缺失验证动作并生成针对性请求。 Result: 在电商卖家申诉裁决任务中,LLM基线对齐率为70.8%,加入动作建模与RMI后达87.5%,再融合检索式知识图谱达95.8%(离线)和96.3%(线上)。 Conclusion: 显式动作建模与操作接地的推理结构能有效缓解信息不对称,提升系统可解释性、鲁棒性与实际部署效果。 Abstract: Hierarchical review workflows, where a second-tier reviewer (Checker) corrects first-tier (Maker) decisions, generate valuable correction signals that encode why initial judgments failed. However, learning from these signals is hindered by information asymmetry: corrections often depend on verification actions unavailable to Makers or automated systems. We address this challenge by introducing explicit action modeling as an inferential constraint that grounds reasoning in verifiable operations rather than unconstrained text generation. We propose the Evidence-Action-Factor-Decision (EAFD) schema, a minimal representation for adjudication reasoning that prevents hallucination through operational grounding and enables learning from correction signals via explicit conflict modeling. Building on this schema, we develop a conflict-aware graph reasoning framework that: (1) constructs EAFD graphs from historical cases capturing Maker-Checker disagreements, (2) aggregates them into a retrievable knowledge base, and (3) performs top-down deductive reasoning for new cases by projecting validated resolution paths from precedents. A distinctive capability is the Request More Information (RMI) outcome: when evidence is insufficient, the system identifies precisely which verification actions remain unexecuted and generates targeted information requests. We evaluate the framework in large-scale e-commerce seller appeal adjudication. While a standard LLM-only baseline achieves only 70.8% alignment with human experts, incorporating action modeling with RMI improves alignment to 87.5%. Augmenting this with the retrieval-based knowledge graph yields the best offline performance of 95.8%. Following online deployment, the framework maintains robust performance, achieving a 96.3% alignment rate in production, demonstrating its real-world effectiveness.[21] Full-Stack Domain Enhancement for Combustion LLMs: Construction and Optimization
Quanjia Xiao,Weimin Ouyang,Zonglin Yang,Tianhao Wu,Qingguo Zhou,Runze Mao,Zhi X. Chen
Main category: cs.CL
TL;DR: 本文提出首个面向燃烧科学领域的全栈式领域增强大语言模型工作流,通过自动化构建领域语料、增量预训练、指令微调和可验证的基于奖励的强化学习,使模型真正内化物理定律;同时发布专用评测基准FlameBench,实验表明该模型在燃烧科学推理任务上显著优于现有通用大模型和RAG方法。
Details
Motivation: 通用大语言模型在燃烧科学等复杂物理系统领域易产生严重幻觉,因其缺乏领域知识且难以遵循物理守恒定律。 Method: 提出全栈式领域增强LLM工作流,包括自动化领域语料构建、增量预训练、指令微调和可验证的基于奖励的强化学习;并构建专用评测基准FlameBench。 Result: 所构建模型在燃烧科学推理任务上显著优于当前最优通用闭源模型及传统检索增强生成方法。 Conclusion: 本工作为构建具备可靠科学推理能力的领域专用科研智能体奠定了坚实的技术与资源基础。 Abstract: Large language models (LLMs) in the direction of task adaptation and capability enhancement for professional fields demonstrate significant application potential. Nevertheless, for complex physical systems such as combustion science, general-purpose LLMs often generate severe hallucinations due to insufficient domain knowledge and the inability to adhere to physical conservation laws. To address this issue, we propose the first full-stack domain-enhanced LLM workflow tailored for the field of combustion science, which integrates automated domain corpus construction, incremental pre-training, instruction fine-tuning, and verifiable reward-based reinforcement learning. This workflow ensures that the model truly internalizes physical laws rather than merely learning textual statistical patterns. We also release FlameBench, a standardized evaluation benchmark specifically designed for complex reasoning tasks in combustion science. Experimental results demonstrate that the model developed in this work significantly outperforms state-of-the-art general-purpose closed-source models and traditional retrieval-augmented generation methods on combustion science reasoning tasks. This work lays a solid technical and resource foundation for the subsequent development of domain-specific scientific research agents with reliable scientific reasoning capabilities.[22] From Tokens To Agents: A Researcher's Guide To Understanding Large Language Models
Daniele Barolo
Main category: cs.CL
TL;DR: 本文旨在帮助研究人员理解大型语言模型(LLMs)的核心机制,以支持其在研究中合理、批判性地使用LLM。文章不提供技术细节,而是系统解析六大关键组件:预训练数据、分词与嵌入、Transformer架构、概率化生成、对齐(alignment)和智能体能力(agentic capabilities),并结合技术基础与研究影响分析其优势与局限;最终通过一个基于LLM模拟社交媒体动态的案例研究加以说明。
Details
Motivation: 研究人员亟需理解LLMs的能力边界与内在机制,以便在科研中审慎、有效地使用它们,而非盲目依赖或排斥。 Method: 采用概念性解构方法,将LLM分解为六个核心组成部分,分别从技术原理与科研实践两个维度进行非技术性阐释,并构建一个用于批判性评估LLM适用性的分析框架。 Result: 提出一个面向非技术研究者的LLM理解框架,明确了各组件对研究设计、数据解释与方法论选择的具体影响,并通过社会媒体模拟案例验证该框架的实用性。 Conclusion: LLMs不是万能工具,其价值取决于使用者能否基于对其底层机制的清晰认知,做出契合具体研究问题的方法论判断;批判性推理比技术熟练度更为关键。 Abstract: Researchers face a critical choice: how to use -- or not use -- large language models in their work. Using them well requires understanding the mechanisms that shape what LLMs can and cannot do. This chapter makes LLMs comprehensible without requiring technical expertise, breaking down six essential components: pre-training data, tokenization and embeddings, transformer architecture, probabilistic generation, alignment, and agentic capabilities. Each component is analyzed through both technical foundations and research implications, identifying specific affordances and limitations. Rather than prescriptive guidance, the chapter develops a framework for reasoning critically about whether and how LLMs fit specific research needs, finally illustrated through an extended case study on simulating social media dynamics with LLM-based agents.[23] Autonoma: A Hierarchical Multi-Agent Framework for End-to-End Workflow Automation
Eslam Reda,Maged Yasser,Sara El-Metwally
Main category: cs.CL
TL;DR: 本文提出Autonoma,一种分层多智能体框架,用于将自然语言指令转化为鲁棒的端到端工作流,通过协调器、规划器和监督器三级架构实现任务分解、动态调度与容错执行,并支持多模态输入及双语(英/阿),在局域网环境中实现97%任务完成率与98%智能体交接成功率。
Details
Motivation: 用户需求日益复杂,现有单体智能体架构在可扩展性、错误传播控制和跨任务专注力方面存在不足,亟需更可靠、可扩展、隐私安全的自动化框架。 Method: 提出分层多智能体框架Autonoma,包含三层:高阶协调器(验证用户意图)、规划器(生成结构化工作流)、监督器(动态调度专业化模块智能体,如网页浏览、编程、文件管理);强调编排逻辑与执行能力分离,支持插件式扩展,并部署于安全局域网,兼容多模态输入与英阿双语。 Result: 系统实测达成97%任务完成率与98%智能体成功交接率,验证了其可靠性、协作效率及在隐私敏感环境下的可行性。 Conclusion: Autonoma通过结构化分层设计和模块化解耦,显著提升了工作流自动化系统的鲁棒性、可扩展性与实用性,为隐私优先、多模态、多语言的智能体协同提供了可行范式。 Abstract: The increasing complexity of user demands necessitates automation frameworks that can reliably translate open-ended instructions into robust, multi-step workflows. Current monolithic agent architectures often struggle with the challenges of scalability, error propagation, and maintaining focus across diverse tasks. This paper introduces Autonoma, a structured, hierarchical multi-agent framework designed for end-to-end workflow automation from natural language prompts. Autonoma employs a principled, multi-tiered architecture where a high-level Coordinator validates user intent, a Planner generates structured workflows, and a Supervisor dynamically manages the execution by orchestrating a suite of modular, specialized agents (e.g., for web browsing, coding, file management). This clear separation between orchestration logic and specialized execution ensures robustness through active monitoring and error handling, while enabling extensibility by allowing new capabilities to be integrated as plug-and-play agents without modifying the core engine. Implemented as a fully functional system operating within a secure LAN environment, Autonoma addresses critical data privacy and reliability concerns. The system is further engineered for inclusivity, accepting multi-modal input (text, voice, image, files) and supporting both English and Arabic. Autonoma achieved a 97% task completion rate and a 98% successful agent handoff rate, confirming its operational reliability and efficient collaboration.[24] A Human-Centered Workflow for Using Large Language Models in Content Analysis
Ivan Zupic
Main category: cs.CL
TL;DR: 本文提出了一种以人类为中心的、通过API调用大语言模型(LLMs)进行内容分析的通用工作流,涵盖标注、摘要和信息抽取三类任务,并提供验证方法、最佳实践及配套代码与提示库。
Details
Motivation: 现有研究多通过聊天界面使用大语言模型,未能充分发挥其作为通用文本处理工具的潜力;同时,LLM在内容分析中面临黑箱性、提示敏感性和幻觉等关键局限,亟需严谨、透明、可复现的方法论支持。 Method: 将LLMs概念化为通用文本处理机,构建覆盖设计、监督与验证全过程的人类中心型工作流;整合政治学、社会学、计算机科学、心理学和管理学等多学科方法论,并配套提示库、Python代码(Jupyter Notebook)及详细操作指南。 Result: 形成一套系统化、可复现、跨学科适用的LLM驱动内容分析框架,包含针对标注、摘要与信息抽取三类任务的具体流程、验证程序与最佳实践。 Conclusion: LLMs可通过API方式有效赋能定性与定量内容分析,但必须由研究者全程主导设计与验证,方能保障学术严谨性与结果可信度;该框架为社会科学及其他领域研究者提供了即插即用的方法论基础设施。 Abstract: While many researchers use Large Language Models (LLMs) through chat-based access, their real potential lies in leveraging LLMs via application programming interfaces (APIs). This paper conceptualizes LLMs as universal text processing machines and presents a comprehensive workflow for employing LLMs in three qualitative and quantitative content analysis tasks: (1) annotation (an umbrella term for qualitative coding, labeling and text classification), (2) summarization, and (3) information extraction. The workflow is explicitly human-centered. Researchers design, supervise, and validate each stage of the LLM process to ensure rigor and transparency. Our approach synthesizes insights from extensive methodological literature across multiple disciplines: political science, sociology, computer science, psychology, and management. We outline validation procedures and best practices to address key limitations of LLMs, such as their black-box nature, prompt sensitivity, and tendency to hallucinate. To facilitate practical implementation, we provide supplementary materials, including a prompt library and Python code in Jupyter Notebook format, accompanied by detailed usage instructions.[25] Transformers are Stateless Differentiable Neural Computers
Bo Tang,Weiwei Xie
Main category: cs.CL
TL;DR: 本文证明了因果Transformer层等价于一种无状态的可微神经计算机(sDNC),并扩展该等价性至交叉注意力,为Transformer提供统一的记忆中心解释。
Details
Motivation: 将现代大语言模型(如Transformer)置于一个有原理支撑的计算框架中,增强对其工作机制的理解与解释性。 Method: 通过形式化推导,建立因果Transformer层与无状态可微神经计算机(sDNC)之间的严格等价关系,并将该等价性推广到编码器-解码器结构中的交叉注意力机制。 Result: 证明因果Transformer层在数学上完全等价于sDNC:控制器无内部循环状态、记忆为一次性写入的值向量矩阵、基于键的内容寻址即注意力机制、多头注意力对应多个并行读取头;且编码器-解码器Transformer等价于具有独立读/写记忆的sDNC。 Conclusion: Transformer本质上是一种记忆为中心的架构,其核心机制可被统一解释为无状态可微神经计算机,这有助于深化对大型语言模型计算本质的理解。 Abstract: Differentiable Neural Computers (DNCs) were introduced as recurrent architectures equipped with an addressable external memory supporting differentiable read and write operations. Transformers, in contrast, are nominally feedforward architectures based on multi-head self-attention. In this work we give a formal derivation showing that a causal Transformer layer is exactly a stateless Differentiable Neural Computer (sDNC) where (1) the controller has no recurrent internal state, (2) the external memory is a write-once matrix of value vectors, (3) content-based addressing via keys implements attention, and (4) multi-head attention corresponds to multiple parallel read heads. We further extend this equivalence to cross-attention, showing that encoder-decoder Transformers are precisely sDNCs with distinct read-from and write-to memories. Our results provide a unified memory-centric interpretation of Transformers and contribute to the ongoing effort to place modern large language models in a principled computational framework.[26] LSR: Linguistic Safety Robustness Benchmark for Low-Resource West African Languages
Godwin Abuh Faruna
Main category: cs.CL
TL;DR: 本文提出了LSR(语言安全鲁棒性)基准,首次系统评估大型语言模型在西非语言(约鲁巴语、豪萨语、伊博语和伊加拉语)中对有害意图的拒绝能力退化问题,发现英语中高达90%的拒绝率在这些语言中大幅下降至35%-55%,并引入新指标RCD量化该退化程度。
Details
Motivation: 当前大模型的安全对齐主要依赖英文训练数据,当有害意图以低资源语言表达时,原本在英文中有效的拒绝机制往往失效,亟需系统性跨语言安全评估工具。 Method: 构建LSR基准,采用双探针评估协议(同步提交英文与目标语言的匹配探针),提出Refusal Centroid Drift(RCD)指标,并在Inspect AI框架中实现;在Gemini 2.5 Flash上测试14个文化适配的攻击探针,覆盖四类危害。 Result: 英文拒绝率约90%,而西非语言拒绝率降至35%-55%,其中伊加拉语退化最严重(RCD=0.55);LSR已作为PR-ready贡献提交至UK AISI的inspect_evals仓库,并公开基准数据集与在线参考实现。 Conclusion: LSR揭示了当前大模型在低资源语言中安全对齐的严重缺口,为跨语言安全评估提供了首个标准化基准和可复现工具,推动多语言AI安全研究与实践。 Abstract: Safety alignment in large language models relies predominantly on English-language training data. When harmful intent is expressed in low-resource languages, refusal mechanisms that hold in English frequently fail to activate. We introduce LSR (Linguistic Safety Robustness), the first systematic benchmark for measuring cross-lingual refusal degradation in West African languages: Yoruba, Hausa, Igbo, and Igala. LSR uses a dual-probe evaluation protocol - submitting matched English and target-language probes to the same model - and introduces Refusal Centroid Drift (RCD), a metric that quantifies how much of a model's English refusal behavior is lost when harmful intent is encoded in a target language. We evaluate Gemini 2.5 Flash across 14 culturally grounded attack probes in four harm categories. English refusal rates hold at approximately 90 percent. Across West African languages, refusal rates fall to 35-55 percent, with Igala showing the most severe degradation (RCD = 0.55). LSR is implemented in the Inspect AI evaluation framework and is available as a PR-ready contribution to the UK AISI's inspect_evals repository. A live reference implementation and the benchmark dataset are publicly available.[27] CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation
Yannian Gu,Zhongzhen Huang,Linjie Mu,Xizhuo Zhang,Shaoting Zhang,Xiaofan Zhang
Main category: cs.CL
TL;DR: 本文提出CURE基准,用于评估多模态大语言模型(MLLMs)在临床诊断中推理与文献检索能力的分离表现,发现模型在有参考证据时诊断准确率高(73.4%),但自主检索时大幅下降(低至25.4%)。
Details
Motivation: 现有基准仅评估端到端问答,难以区分模型的多模态推理能力与证据检索及应用能力,亟需能解耦二者贡献的评测基准。 Method: 构建包含500个真实临床病例及对应医师引用文献的CURE基准,设计控制证据条件的评测框架,在闭合与开放诊断任务中评估前沿MLLMs在不同证据获取范式下的表现。 Result: 先进MLLMs在提供医师参考证据时差分诊断准确率达73.4%,但依赖自主检索时骤降至25.4%,揭示其在多模态证据整合与精准文献检索两方面均存在显著瓶颈。 Conclusion: CURE有效解耦并量化了临床多模态推理与检索能力,凸显当前MLLMs在自主证据获取与融合上的根本局限,为后续研究提供了可复现、细粒度的评估标准。 Abstract: Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end-to-end answering scenarios. This limits the ability to disentangle a model's foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising $500$ multimodal clinical cases mapped to physician-cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state-of-the-art MLLMs across distinct evidence-gathering paradigms in both closed-ended and open-ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clinical reasoning proficiency when supplied with physician reference evidence (achieving up to $73.4\%$ accuracy on differential diagnosis), their performance substantially declines (as low as $25.4\%$) when reliant on independent retrieval mechanisms. This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature. CURE is publicly available at https://github.com/yanniangu/CURE.[28] Improving Automatic Summarization of Radiology Reports through Mid-Training of Large Language Models
Mengxian Lyu,Cheng Peng,Ziyi Chen,Mengyuan Zhang,Jieting Li Lu,Yonghui Wu
Main category: cs.CL
TL;DR: 本文提出了一种在预训练和微调之间加入临床子领域中训(mid-training)的新范式,显著提升了放射学报告自动摘要的性能与事实性,并改善了小样本学习能力。
Details
Motivation: 现有基于大语言模型的放射学报告摘要方法多采用“预训练+微调”两阶段策略,但存在领域适配不足、事实性差及小样本冷启动问题。 Method: 提出“预训练→临床子领域中训→微调”三阶段范式;在UF Health大规模临床文本上开展临床领域预训练,并在放射学子领域(OpenI、MIMIC-CXR)进行中训;构建GatorTronT5-Radio模型,对比三种策略(通用预训练、临床预训练、临床预训练+中训)。 Result: GatorTronT5-Radio在ROUGE-L和RadGraph-F1指标上均优于无中训基线;展现出更强的few-shot学习能力,缓解了冷启动问题。 Conclusion: “预训练→中训→微调”范式比传统“预训练→微调”更适用于医学子领域摘要任务,中训是提升模型领域适应性、事实性与泛化性的关键环节。 Abstract: Automatic summarization of radiology reports is an essential application to reduce the burden on physicians. Previous studies have widely used the "pre-training, fine-tuning" strategy to adapt large language models (LLMs) for summarization. This study proposed a subdomain adaptation through a mid-training method to improve summarization. We explored three adaptation strategies: (1) general-domain pre-training, (2) clinical-domain pre-training, and (3) clinical-domain pre-training followed by subdomain mid-training. We developed models using large-scale clinical text from the University of Florida (UF) Health and conducted mid-training and fine-tuning experiments using widely used benchmark datasets including OpenI and MIMIC-CXR. The experimental results show that the mid-trained model, GatorTronT5-Radio, achieved the best performance, outperforming models without mid-training in both text-based measures (ROUGE-L) and factuality measures (RadGraph-F1). Our mid-training methods also demonstrate better few-shot learning and could alleviate the "cold start" problem reported in previous studies as a learning barrier. Our findings support the use of "pre-training, mid-training, fine-tuning," instead of the widely used direct fine-tuning strategy.[29] From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG
Yucheng Chu,Haoyu Han,Shen Dong,Hang Li,Kaiqi Yang,Yasemin Copur-Gencturk,Joseph Krajcik,Namsoo Shin,Hui Liu
Main category: cs.CL
TL;DR: 本文提出GraphRAG框架,通过构建结构化知识图谱并结合神经符号算法HippoRAG进行关联图遍历,提升自动短答案评分(ASAG)中对逻辑推理链和评分标准的遵循能力,在NGSS数据集上显著优于传统RAG方法。
Details
Motivation: 大型语言模型在自动短答案评分中易产生幻觉且难以严格遵循评分标准;传统RAG使用扁平向量检索,无法建模教育内容中的结构关系与多跳推理。 Method: 提出GraphRAG框架:第一阶段用Microsoft GraphRAG构建高保真知识图谱;第二阶段用HippoRAG神经符号算法执行关联图遍历,检索连通的证据子图。 Result: 在NGSS数据集上,GraphRAG全面超越标准RAG基线;HippoRAG在评估科学与工程实践(SEP)方面取得显著提升,验证了结构化检索对高阶学术评估的有效性。 Conclusion: 结构化知识图谱与神经符号图检索能有效增强ASAG系统对逻辑推理链的识别与评分鲁棒性,为教育评估自动化提供新范式。 Abstract: Automated short answer grading (ASAG) is critical for scaling educational assessment, yet large language models (LLMs) often struggle with hallucinations and strict rubric adherence due to their reliance on generalized pre-training. While Rretrieval-Augmented Generation (RAG) mitigates these issues, standard "flat" vector retrieval mechanisms treat knowledge as isolated fragments, failing to capture the structural relationships and multi-hop reasoning essential for complex educational content. To address this limitation, we introduce a Graph Retrieval-Augmented Generation (GraphRAG) framework that organizes reference materials into a structured knowledge graph to explicitly model dependencies between concepts. Our methodology employs a dual-phase pipeline: utilizing Microsoft GraphRAG for high-fidelity graph construction and the HippoRAG neurosymbolic algorithm to execute associative graph traversals, thereby retrieving comprehensive, connected subgraphs of evidence. Experimental evaluations on a Next Generation Science Standards (NGSS) dataset demonstrate that this structural approach significantly outperforms standard RAG baselines across all metrics. Notably, the HippoRAG implementation achieved substantial improvements in evaluating Science and Engineering Practices (SEP), confirming the superiority of structural retrieval in verifying the logical reasoning chains required for higher-order academic assessment.[30] MOSAIC: Modular Opinion Summarization using Aspect Identification and Clustering
Piyush Kumar Singh,Jayesh Choudhari
Main category: cs.CL
TL;DR: 本文提出MOSAIC框架,一种可扩展、模块化的在线评论摘要系统,通过主题发现、结构化观点抽取和基于事实的摘要生成等可解释组件提升摘要质量与实用性,并在真实A/B测试与离线实验中验证其有效性,同时发布新数据集TRECS以改进评估可靠性。
Details
Motivation: 现有评论摘要研究过于关注端到端质量,忽视基准可靠性与细粒度洞察的实际效用。 Method: 提出模块化框架MOSAIC,包含主题发现、结构化观点抽取、基于事实的摘要生成,并引入意见聚类作为系统级组件;结合线上A/B测试与离线实验评估,并构建新数据集TRECS。 Result: MOSAIC在方面覆盖度和忠实性上优于强基线;意见聚类显著提升忠实性,尤其在噪声大、冗余高的用户评论中;线上A/B测试证实中间输出即可改善用户体验并产生实际价值。 Conclusion: 模块化、可解释的设计更适配工业部署,意见聚类是提升摘要忠实性的关键,且需更可靠的评测数据集(如新发布的TRECS)支撑研究发展。 Abstract: Reviews are central to how travelers evaluate products on online marketplaces, yet existing summarization research often emphasizes end-to-end quality while overlooking benchmark reliability and the practical utility of granular insights. To address this, we propose MOSAIC, a scalable, modular framework designed for industrial deployment that decomposes summarization into interpretable components, including theme discovery, structured opinion extraction, and grounded summary generation. We validate the practical impact of our approach through online A/B tests on live product pages, showing that surfacing intermediate outputs improves customer experience and delivers measurable value even prior to full summarization deployment. We further conduct extensive offline experiments to demonstrate that MOSAIC achieves superior aspect coverage and faithfulness compared to strong baselines for summarization. Crucially, we introduce opinion clustering as a system-level component and show that it significantly enhances faithfulness, particularly under the noisy and redundant conditions typical of user reviews. Finally, we identify reliability limitations in the standard SPACE dataset and release a new open-source tour experience dataset (TRECS) to enable more robust evaluation.[31] HypeLoRA: Hyper-Network-Generated LoRA Adapters for Calibrated Language Model Fine-Tuning
Bartosz Trojan,Filip Gębala
Main category: cs.CL
TL;DR: 本文研究了LoRA和基于超网络的适配框架在RoBERTa模型上的校准性能,发现LoRA在保持高参数效率的同时,校准效果与全量微调相当甚至更优;动态超网络生成LoRA因子的方法在CoLA数据集上表现更好;冻结部分LoRA矩阵可提升校准性但会牺牲一定准确率;作者开源了统一的校准指标实现。
Details
Motivation: 现代Transformer模型常存在预测过度自信(校准差)问题,而参数高效微调方法(如LoRA)在校准性能方面的表现尚不明确,需系统探究其与概率可靠性之间的关系。 Method: 在GLUE基准上评估LoRA及一种新型超网络驱动的LoRA变体(动态生成A/B矩阵)的校准性能;引入ECE、MCE、ACE等校准指标进行量化分析;通过冻结LoRA矩阵A探究结构约束对校准与准确率的权衡影响。 Result: LoRA在校准性能上与全量微调基本持平甚至在部分任务(如CoLA)上更优;超网络生成LoRA因子方法效果接近标准LoRA并提升CoLA的MCC;冻结矩阵A显著降低ECE(提升校准),但导致下游任务准确率下降;提供了统一、可复现的校准指标实现。 Conclusion: 结构化的低秩更新(如LoRA)不仅高效,还能兼顾模型校准性,是构建不确定性感知Transformer架构的可行基础;参数效率与概率可靠性并非互斥,可通过适配结构设计协同优化。 Abstract: Modern Transformer-based models frequently suffer from miscalibration, producing overconfident predictions that do not reflect true empirical frequencies. This work investigates the calibration dynamics of LoRA: Low-Rank Adaptation and a novel hyper-network-based adaptation framework as parameter-efficient alternatives to full fine-tuning for RoBERTa. Evaluating across the GLUE benchmark, we demonstrate that LoRA-based adaptation consistently achieves calibration parity with (and in specific tasks exceeds) full fine-tuning, while maintaining significantly higher parameter efficiency. We further explore a dynamic approach where a shared hyper-network generates LoRA factors (A and B matrices) to induce structural coupling across layers. This approach produced results similar to standard LoRA fine-tuning, even achieving better MCC on CoLA dataset. Our study also reveal a critical trade-off: constraining the adaptation space (e.g., freezing matrices A) acts as a powerful regularizer that enhances Expected Calibration Error (ECE), but necessitates a carefully balanced sacrifice in downstream task accuracy. To support future research, we provide a unified and reproducible implementation of contemporary calibration metrics, including ECE, MCE, and ACE. Our findings clarify the relationship between parameter efficiency and probabilistic reliability, positioning structured low-rank updates as a viable foundation for uncertainty-aware Transformer architectures. Code available at: https://github.com/btrojan-official/HypeLoRA[32] Multilingual Hate Speech Detection and Counterspeech Generation: A Comprehensive Survey and Practical Guide
Zahra Safdari Fesaghandis,Suman Kalyan Maity
Main category: cs.CL
TL;DR: 本文综述了多语言仇恨言论检测与反言论生成的最新进展,提出一个涵盖任务设计、数据构建和评估的三阶段框架,强调文化适配性、公平性及低资源语言挑战。
Details
Motivation: 单语(尤其是英语)模型难以应对多语言、语码混用及隐含/文化特异性仇恨表达,亟需兼顾语言多样性与文化背景的解决方案。 Method: 提出结构化的三阶段框架(任务设计、数据整理、评估),整合多语言数据集、模型与评估指标,并结合伦理与文化视角分析技术局限。 Result: 系统梳理了多语言仇恨检测与反言论生成的研究进展与资源,识别出数据稀缺、偏见公平性、多模态需求等关键挑战。 Conclusion: 需将技术发展与伦理文化考量结合,构建可扩展、情境感知、包容性强的多语言在线安全系统。 Abstract: Combating online hate speech in multilingual settings requires approaches that go beyond English-centric models and capture the cultural and linguistic diversity of global online discourse. This paper presents a comprehensive survey and practical guide to multilingual hate speech detection and counterspeech generation, integrating recent advances in natural language processing. We analyze why monolingual systems often fail in non-English and code-mixed contexts, missing implicit hate and culturally specific expressions. To address these challenges, we outline a structured three-phase framework - task design, data curation, and evaluation - drawing on state-of-the-art datasets, models, and metrics. The survey consolidates progress in multilingual resources and techniques while highlighting persistent obstacles, including data scarcity in low-resource languages, fairness and bias in system development, and the need for multimodal solutions. By bridging technical progress with ethical and cultural considerations, we provide researchers, practitioners, and policymakers with scalable guidelines for building context-aware, inclusive systems. Our roadmap contributes to advancing online safety through fairer, more effective detection and counterspeech generation across diverse linguistic environments.[33] From Feature-Based Models to Generative AI: Validity Evidence for Constructed Response Scoring
Jodi M. Casabianca,Daniel F. McCaffrey,Matthew S. Johnson,Naim Alper,Vladimir Zubenko
Main category: cs.CL
TL;DR: 本文探讨了生成式AI在高风险测试中对建构性回答评分的应用,强调了其与基于特征的传统AI评分方法的区别,并提出了支持生成式AI评分系统有效性的最佳实践。
Details
Motivation: 随着大语言模型和生成式AI能力的快速发展,其在高风险测试中的应用日益增多,特别是在建构性回答评分中,因其可减少传统AI评分中手工设计特征的工作量,甚至可能超越传统方法。 Method: 本文通过比较人类评分、基于特征的自然语言处理AI评分引擎和生成式AI评分系统所需的有效性证据,提出了一套针对生成式AI评分系统的有效性证据收集最佳实践。 Result: 研究表明,生成式AI评分系统所需的有效性证据比基于特征的评分系统更为广泛,主要由于生成式AI缺乏透明度及一致性等独特问题。利用6-12年级学生撰写的大量独立议论文数据,验证了不同评分系统有效性证据的收集过程,并揭示了构建有效性论证时的诸多复杂性和考量因素。 Conclusion: 生成式AI在建构性回答评分中具有潜力,但需更全面的有效性证据来支持其使用和解释;本文提出的最佳实践为教育测评领域提供了重要指导。 Abstract: The rapid advancements in large language models and generative artificial intelligence (AI) capabilities are making their broad application in the high-stakes testing context more likely. Use of generative AI in the scoring of constructed responses is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring and might even outperform those methods. The purpose of this paper is to highlight the differences in the feature-based and generative AI applications in constructed response scoring systems and propose a set of best practices for the collection of validity evidence to support the use and interpretation of constructed response scores from scoring systems using generative AI. We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI. The evidence needed in the generative AI context is more extensive than in the feature-based scoring context because of the lack of transparency and other concerns unique to generative AI such as consistency. Constructed response score data from a large corpus of independent argumentative essays written by 6-12th grade students demonstrate the collection of validity evidence for different types of scoring systems and highlight the numerous complexities and considerations when making a validity argument for these scores.[34] URAG: A Benchmark for Uncertainty Quantification in Retrieval-Augmented Large Language Models
Vinh Nguyen,Cuong Dang,Jiahao Zhang,Hoa Tran,Minh Tran,Trinh Chau,Thai Le,Lu Cheng,Suhang Wang
Main category: cs.CL
TL;DR: 本文提出了URAG基准,用于评估检索增强生成(RAG)系统在多个领域中的不确定性与可靠性,通过将开放生成任务转化为多选问答并结合共形预测进行不确定性量化,揭示了当前RAG方法在准确性与不确定性权衡、跨域可靠性及幻觉成因等方面的局限性。
Details
Motivation: 现有RAG评估主要集中于正确性,难以全面反映检索对大语言模型不确定性与可靠性的影响,亟需一种能系统衡量RAG不确定性的新基准。 Method: 提出URAG基准,将开放生成任务重构为多选问答形式,利用共形预测(基于LAC和APS指标)量化不确定性;在医疗、编程、科学、数学和通用文本五大领域评估8种标准RAG方法。 Result: 发现:(1) 准确率提升常伴随不确定性降低,但在检索噪声下该关系失效;(2) 简单模块化RAG方法比复杂推理流水线更具准确-不确定性权衡优势;(3) 无一种RAG方法在所有领域均可靠;(4) 检索深度、参数化知识依赖及置信线索暴露会加剧自信错误与幻觉。 Conclusion: URAG为分析和提升RAG系统的可信度提供了系统化基准,强调需兼顾不确定性建模与鲁棒性设计,而非仅追求准确率。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for enhancing LLMs in scenarios that demand extensive factual knowledge. However, current RAG evaluations concentrate primarily on correctness, which may not fully capture the impact of retrieval on LLM uncertainty and reliability. To bridge this gap, we introduce URAG, a comprehensive benchmark designed to assess the uncertainty of RAG systems across various fields like healthcare, programming, science, math, and general text. By reformulating open-ended generation tasks into multiple-choice question answering, URAG allows for principled uncertainty quantification via conformal prediction. We apply the evaluation pipeline to 8 standard RAG methods, measuring their performance through both accuracy and prediction-set sizes based on LAC and APS metrics. Our analysis shows that (1) accuracy gains often coincide with reduced uncertainty, but this relationship breaks under retrieval noise; (2) simple modular RAG methods tend to offer better accuracy-uncertainty trade-offs than more complex reasoning pipelines; and (3) no single RAG approach is universally reliable across domains. We further show that (4) retrieval depth, parametric knowledge dependence, and exposure to confidence cues can amplify confident errors and hallucinations. Ultimately, URAG establishes a systematic benchmark for analyzing and enhancing the trustworthiness of retrieval-augmented systems. Our code is available on GitHub.[35] Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis
Zice Wang,Zhenyu Zhang
Main category: cs.CL
TL;DR: 本文研究了提示词框架(prompt framing)如何影响大型语言模型(LLMs)在独立阈值投票任务中的决策行为,发现表面语言线索可显著改变模型选择倾向,尤其偏向风险规避,反映出其倾向于工具理性而非合作理性。
Details
Motivation: 现实中LLM常作为无交互的独立智能体运行,限制了协同能力;本文旨在探究提示词框架如何在个体-群体利益冲突的阈值投票任务中影响其决策,揭示非交互多智能体部署中的潜在偏差。 Method: 设计两种逻辑等价但表述框架不同的提示词,在多种LLM家族上进行隔离实验,开展阈值投票任务(含个体与群体利益冲突)。 Result: 提示词框架显著影响选择分布,常使模型偏好风险规避选项;表面语言线索可压倒逻辑等价性;行为模式更符合工具理性而非合作理性。 Conclusion: 提示词框架是独立多LLM系统中不可忽视的偏差源,对AI对齐与提示工程具有重要启示。 Abstract: In many real-world applications, large language models (LLMs) operate as independent agents without interaction, thereby limiting coordination. In this setting, we examine how prompt framing influences decisions in a threshold voting task involving individual-group interest conflict. Two logically equivalent prompts with different framings were tested across diverse LLM families under isolated trials. Results show that prompt framing significantly influences choice distributions, often shifting preferences toward risk-averse options. Surface linguistic cues can even override logically equivalent formulations. This suggests that observed behavior reflects a tendency consistent with a preference for instrumental rather than cooperative rationality when success requires risk-bearing. The findings highlight framing effects as a significant bias source in non-interacting multi-agent LLM deployments, informing alignment and prompt design.[36] Automated Motif Indexing on the Arabian Nights
Ibrahim H. Alyami,Mark A. Finlayson
Main category: cs.CL
TL;DR: 本文提出了首个计算方法用于民间故事中的主题索引,通过结合《一千零一夜》文本与El-Shamy(2006)的主题索引,构建了大规模人工标注语料,并比较了五类主题表达检测方法,其中微调的Llama3模型达到0.85 F1最佳性能。
Details
Motivation: 主题作为民俗学和现代文化文本中重要的叙事元素,其自动识别长期面临数据不可及与方法困难的挑战;本文旨在突破这一瓶颈,支持民俗学分析与现代主题使用理解。 Method: 基于《一千零一夜》与El-Shamy主题索引构建含2670个主题表达、覆盖200个主题、58450句的标注语料;系统评估五类方法:关键词检索重排序、现成嵌入模型、微调嵌入模型、N-shot提示大语言模型(LLM)、LoRA微调LLM。 Result: 微调后的Llama3模型在主题表达检测任务中取得最高0.85 F1分数,显著优于其他四类方法。 Conclusion: 该工作首次实现了可扩展、高性能的主题索引计算框架,验证了高质量标注语料与适配大模型微调策略对民俗文本深层语义解析的有效性。 Abstract: Motifs are non-commonplace, recurring narrative elements, often found originally in folk stories. In addition to being of interest to folklorists, motifs appear as metaphoric devices in modern news, literature, propaganda, and other cultural texts. Finding expressions of motifs in the original folkloristic text is useful for both folkloristic analysis (motif indexing) as well as for understanding the modern usage of motifs (motif detection and interpretation). Prior work has primarily shown how difficult these problems are to tackle using automated techniques. We present the first computational approach to motif indexing. Our choice of data is a key enabler: we use a large, widely available text (the Arabian Nights) paired with a detailed motif index (by El-Shamy in 2006), which overcomes the common problem of inaccessibility of texts referred to by the index. We created a manually annotated corpus that identified 2,670 motif expressions of 200 different motifs across 58,450 sentences for training and testing. We tested five types of approaches for detecting motif expressions given a motif index entry: (1) classic retrieve and re-rank using keywords and a fine-tuned cross-encoder; (2) off-the-shelf embedding models; (3) fine-tuned embedding models; (4) generative prompting of off-the-shelf LLMs in N-shot setups; and (5) the same generative approaches on LLMs fine-tuned with LoRA. Our best performing system is a fine-tuned Llama3 model which achieves an overall performance of 0.85 F1.[37] Automatic Analysis of Collaboration Through Human Conversational Data Resources: A Review
Yi Yu,Maria Boritchev,Chloé Clavel
Main category: cs.CL
TL;DR: 本文综述了基于任务导向对话数据的协作分析方法,涵盖相关理论、编码方案、任务类型和建模方法,旨在指导如何利用人与人之间的对话数据自动分析协作过程。
Details
Motivation: 对话是协作中信息交换与协调的主要媒介,因此任务导向的对话数据是分析协作过程的重要资源。 Method: 对协作分析中使用的任务导向对话资源进行系统性综述,包括理论基础、编码体系、分析任务及建模方法。 Result: 梳理出协作分析的关键要素与现有方法,明确了利用对话数据开展协作分析的研究路径。 Conclusion: 该综述为协作分析提供了实用参考,并指出了尚未充分探索的研究方向。 Abstract: Collaboration is a task-oriented, high-level human behavior. In most cases, conversation serves as the primary medium for information exchange and coordination, making conversational data a valuable resource for the automatic analysis of collaborative processes. In this paper, we focus on verbal aspects of collaboration and conduct a review of collaboration analysis using task-oriented conversation resources, encompassing related theories, coding schemes, tasks, and modeling approaches. We aim to address the question of how to utilize task-oriented human-human conversational data for collaboration analysis. We hope our review will serve as a practical resource and illuminate unexplored areas for future collaboration analysis.[38] LLM-MRD: LLM-Guided Multi-View Reasoning Distillation for Fake News Detection
Weilin Zhou,Shanwen Tan,Enhao Gu,Yurong Qian
Main category: cs.CL
TL;DR: 本文提出LLM-MRD框架,利用大语言模型(LLM)指导多视角推理蒸馏,提升多模态假新闻检测性能,在ACC和F1-Fake指标上平均分别提升5.19%和6.33%。
Details
Motivation: 现有方法在多模态假新闻检测中存在多视角判断与融合不充分、以及大语言模型推理效率低、计算成本高等问题。 Method: 提出LLM-Guided Multi-View Reasoning Distillation(LLM-MRD)教师-学生框架:学生模块从文本、视觉及跨模态视角构建多视角基础推理;教师模块生成深度推理链作为监督信号;通过校准蒸馏机制将复杂推理知识高效迁移到轻量学生模型。 Result: 在多个数据集和基线方法上显著领先,ACC平均提升5.19%,F1-Fake平均提升6.33%。 Conclusion: LLM-MRD有效缓解了多视角融合不足与LLM推理低效问题,实现了高性能且高效率的多模态假新闻检测。 Abstract: Multimodal fake news detection is crucial for mitigating societal disinformation. Existing approaches attempt to address this by fusing multimodal features or leveraging Large Language Models (LLMs) for advanced reasoning. However, these methods suffer from serious limitations, including a lack of comprehensive multi-view judgment and fusion, and prohibitive reasoning inefficiency due to the high computational costs of LLMs. To address these issues, we propose \textbf{LLM}-Guided \textbf{M}ulti-View \textbf{R}easoning \textbf{D}istillation for Fake News Detection ( \textbf{LLM-MRD}), a novel teacher-student framework. The Student Multi-view Reasoning module first constructs a comprehensive foundation from textual, visual, and cross-modal perspectives. Then, the Teacher Multi-view Reasoning module generates deep reasoning chains as rich supervision signals. Our core Calibration Distillation mechanism efficiently distills this complex reasoning-derived knowledge into the efficient student model. Experiments show LLM-MRD significantly outperforms state-of-the-art baselines. Notably, it demonstrates a comprehensive average improvement of 5.19\% in ACC and 6.33\% in F1-Fake when evaluated across all competing methods and datasets. Our code is available at https://github.com/Nasuro55/LLM-MRD[39] PrefPO: Pairwise Preference Prompt Optimization
Rahul Singhal,Pradyumna Tambwekar,Karime Maamari
Main category: cs.CL
TL;DR: PrefPO是一种受RLHF启发的轻量级提示优化方法,通过LLM判别器进行偏好反馈迭代优化提示,在无需标注数据的情况下仍能保持高性能,并显著改善提示冗余、重复和‘提示作弊’问题。
Details
Motivation: 现有提示工程自动化方法依赖标注数据、生成冗长重复提示,且易出现提示作弊;需一种低依赖、高鲁棒、简洁高效的优化方案。 Method: 提出PrefPO:基于偏好学习的提示优化框架,使用LLM作为判别器对输出进行两两比较打分,再由LLM优化器据此迭代更新提示;仅需初始提示和自然语言评估标准,无需标注数据或复杂超参调优。 Result: 在9个BBH任务和IFEval-Hard上,PrefPO在6/9任务上达到或超越GEPA、MIPRO、TextGrad等SOTA方法;无标签时性能接近有标签设置;提示长度缩减3–5倍,重复内容减少3–5倍;LLM与人类评估均认为其提示质量优于TextGrad;提示作弊率仅37%,远低于TextGrad的86%。 Conclusion: PrefPO是一种高效、鲁棒、低依赖的提示优化方法,兼顾性能、简洁性与对齐性,为无监督提示优化提供了新范式。 Abstract: Prompt engineering is effective but labor-intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference-based approach reduces the need for labeled data and hyperparameter tuning-only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG-Bench Hard (BBH) tasks and IFEval-Hard, a newly-curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3-5x. Furthermore, both LLM and human judges rate PrefPO's prompts higher than TextGrad's. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.[40] Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLMs
Kai Wang,Haoyang You,Yang Zhang,Zhongjie Wang
Main category: cs.CL
TL;DR: 本文提出Memory-Driven Role-Playing范式,受斯坦尼斯拉夫斯基情感记忆理论启发,将角色知识建模为LLM的内部记忆,需仅凭对话上下文自主检索与应用;并构建MREval评估框架、MRPrompt提示架构和MRBench双语基准,验证小模型通过该方法可媲美大模型的角色扮演能力。
Details
Motivation: 现有大语言模型在长程开放对话中难以持续保持角色一致性,常依赖显式提示才能调用角色知识,缺乏对角色知识深度内化与自主运用能力的检验。 Method: 提出Memory-Driven Role-Playing范式,设计MREval四维评估指标(Anchoring/Recalling/Bounding/Enacting),开发MRPrompt结构化提示架构,并构建双语MRBench基准进行细粒度诊断。 Result: 实验表明MRPrompt使小模型(如Qwen3-8B)在角色扮演上达到甚至超越大模型(如Qwen3-Max、GLM-4.7)水平;证实上游记忆能力提升直接改善下游响应质量。 Conclusion: 记忆驱动范式为LLM角色扮演提供了可量化、分阶段的理论与实践框架,证明了内在记忆机制对角色一致性的关键作用,并显著提升了中小模型在该任务上的竞争力。 Abstract: A core challenge for faithful LLM role-playing is sustaining consistent characterization throughout long, open-ended dialogues, as models frequently fail to recall and accurately apply their designated persona knowledge without explicit cues. To tackle this, we propose the Memory-Driven Role-Playing paradigm. Inspired by Stanislavski's "emotional memory" acting theory, this paradigm frames persona knowledge as the LLM's internal memory store, requiring retrieval and application based solely on dialogue context, thereby providing a rigorous test of depth and autonomous use of knowledge. Centered on this paradigm, we contribute: (1) MREval, a fine-grained evaluation framework assessing four memory-driven abilities - Anchoring, Recalling, Bounding, and Enacting; (2) MRPrompt, a prompting architecture that guides structured memory retrieval and response generation; and (3) MRBench, a bilingual (Chinese/English) benchmark for fine-grained diagnosis. The novel paradigm provides a comprehensive diagnostic for four-staged role-playing abilities across 12 LLMs. Crucially, experiments show that MRPrompt allows small models (e.g., Qwen3-8B) to match the performance of much larger closed-source LLMs (e.g., Qwen3-Max and GLM-4.7), and confirms that upstream memory gains directly enhance downstream response quality, validating the staged theoretical foundation.[41] Prompt-tuning with Attribute Guidance for Low-resource Entity Matching
Lihui Liu,Carl Yang
Main category: cs.CL
TL;DR: 本文提出PROMPTATTRIB,一种结合属性级提示调优与模糊逻辑推理的低资源实体匹配方法,提升准确性与可解释性。
Details
Motivation: 传统实体匹配依赖大量标注数据,成本高;现有提示调优方法忽视属性级信息且缺乏可解释性。 Method: 提出PROMPTATTRIB:融合实体级与属性级提示,引入模糊逻辑公式进行推理,并采用基于dropout的对比学习优化软提示。 Result: 在多个真实数据集上实验表明,PROMPTATTRIB显著提升低资源场景下的实体匹配性能与可解释性。 Conclusion: 属性级提示与逻辑推理的结合是提升低资源实体匹配效果与可解释性的有效途径。 Abstract: Entity Matching (EM) is an important task that determines the logical relationship between two entities, such as Same, Different, or Undecidable. Traditional EM approaches rely heavily on supervised learning, which requires large amounts of high-quality labeled data. This labeling process is both time-consuming and costly, limiting practical applicability. As a result, there is a strong need for low-resource EM methods that can perform well with minimal labeled data. Recent prompt-tuning approaches have shown promise for low-resource EM, but they mainly focus on entity-level matching and often overlook critical attribute-level information. In addition, these methods typically lack interpretability and explainability. To address these limitations, this paper introduces PROMPTATTRIB, a comprehensive solution that tackles EM through attribute-level prompt tuning and logical reasoning. PROMPTATTRIB uses both entity-level and attribute-level prompts to incorporate richer contextual information and employs fuzzy logic formulas to infer the final matching label. By explicitly considering attributes, the model gains a deeper understanding of the entities, resulting in more accurate matching. Furthermore, PROMPTATTRIB integrates dropout-based contrastive learning on soft prompts, inspired by SimCSE, which further boosts EM performance. Extensive experiments on real-world datasets demonstrate the effectiveness of PROMPTATTRIB.[42] Scalable Prompt Routing via Fine-Grained Latent Task Discovery
Yunyi Zhang,Soji Adeshina,Patrick Guan,Ashwin Ganesh,Zhen Han,Vassilis N. Ioannidis,Huzefa Rangwala,George Karypis
Main category: cs.CL
TL;DR: 本文提出了一种两阶段提示路由架构,通过自动细粒度任务发现和任务感知质量估计,动态选择最适合每个查询的大型语言模型,在10个基准测试中超越最强单模型且成本减半。
Details
Motivation: 现有方法难以在大规模、性能差距微小的模型池中进行精准路由:人工定义的任务分类无法捕捉细微能力差异,而单一路由器难以区分多样化任务间的细微差别。 Method: 采用两阶段架构:第一阶段用图聚类发现潜在任务类型并训练任务分类器;第二阶段使用混合专家结构配合任务专用预测头进行质量估计;推理时融合两阶段预测以兼顾任务稳定性与提示适应性。 Result: 在10个基准、11个前沿模型上评估,该方法持续优于现有基线,性能超过最强单模型,同时成本低于其一半。 Conclusion: 所提两阶段路由框架有效解决了大规模模型池中细粒度能力区分难题,在性能与成本间实现了更优权衡。 Abstract: Prompt routing dynamically selects the most appropriate large language model from a pool of candidates for each query, optimizing performance while managing costs. As model pools scale to include dozens of frontier models with narrow performance gaps, existing approaches face significant challenges: manually defined task taxonomies cannot capture fine-grained capability distinctions, while monolithic routers struggle to differentiate subtle differences across diverse tasks. We propose a two-stage routing architecture that addresses these limitations through automated fine-grained task discovery and task-aware quality estimation. Our first stage employs graph-based clustering to discover latent task types and trains a classifier to assign prompts to discovered tasks. The second stage uses a mixture-of-experts architecture with task-specific prediction heads for specialized quality estimates. At inference, we aggregate predictions from both stages to balance task-level stability with prompt-specific adaptability. Evaluated on 10 benchmarks with 11 frontier models, our method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.[43] Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
Viliana Devbunova
Main category: cs.CL
TL;DR: 本文质疑了使用线性探针检测大语言模型评估意识的有效性,发现探针主要捕捉基准任务的结构特征而非真正的评估上下文意识。
Details
Motivation: 现有研究通过在线性探针中使用基准提示来推断大语言模型是否具备评估意识,但评估上下文常与基准格式和文体混杂,难以区分模型是理解评估意图还是仅响应表面结构。 Method: 构建一个受控的2×2数据集并进行诊断性重写,部分控制提示格式,检验线性探针信号是否在自由形式提示下仍具泛化性。 Result: 探针信号主要依赖于基准任务的标准结构,在脱离特定语言风格的自由形式提示下无法泛化。 Conclusion: 标准线性探针方法无法可靠地区分评估上下文与结构伪影,削弱了现有相关结论的证据强度。 Abstract: Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, it is unclear whether probe-based signals reflect context or surface structure. We test whether these signals persist under partial control of prompt format using a controlled 2x2 dataset and diagnostic rewrites. We find that probes primarily track benchmark-canonical structure and fail to generalize to free-form prompts independent of linguistic style. Thus, standard probe-based methodologies do not reliably disentangle evaluation context from structural artifacts, limiting the evidential strength of existing results.[44] Vocabulary shapes cross-lingual variation of word-order learnability in language models
Jonas Mayer Martins,Jaap Jumelet,Viola Priesemann,Lisa Beinborn
Main category: cs.CL
TL;DR: 本文通过在合成的词序变体上预训练Transformer语言模型,发现词汇和子词词汇结构是影响跨语言词序可学习性的关键因素,而非简单的自由/固定词序分类。
Details
Motivation: 探究为何某些语言(如捷克语)允许自由词序,而其他语言(如英语)则不允许。 Method: 在一系列合成的词序变体的自然语言上预训练Transformer语言模型,并测量模型困惑度(surprisal)以评估词序可学习性。 Result: 词序不规则性增加会一致提高模型困惑度,表明可学习性下降;句子反转影响较弱;自由/固定词序的粗略分类无法解释跨语言差异;词汇与子词词汇结构能强预测模型困惑度。 Conclusion: 词汇结构是跨语言计算词序可学习性的主要驱动因素。 Abstract: Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.[45] Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas
Víctor Gallego
Main category: cs.CL
TL;DR: 本文提出了一种利用大语言模型(LLM)迭代生成多智能体环境中程序化策略的方法,通过自博弈评估与反馈优化策略;实验表明,相比仅提供标量奖励的稀疏反馈,加入效率、公平性、可持续性、和平性等社会指标的密集反馈能更有效地引导LLM学习合作策略,并揭示了LLM策略合成中表达力与安全性之间的固有张力。
Details
Motivation: 传统强化学习训练神经策略在多智能体社会困境中面临样本效率低、可解释性差等问题,而人类在类似任务中常依赖社会规范进行协调;因此,作者探索用LLM直接生成可解释、可调试的程序化策略,并研究如何设计有效反馈以引导LLM习得符合社会价值的协作行为。 Method: 提出LLM策略合成框架:LLM生成Python策略函数 → 在自博弈环境中执行并评估 → 基于反馈(稀疏:标量奖励;密集:奖励+社会指标)迭代提示LLM改进策略;在Gathering和Cleanup两个序列社会困境任务上,使用Claude Sonnet 4.6和Gemini 3.1 Pro进行实验,并开展对抗性攻击分析。 Result: 密集反馈在所有指标上均匹配或优于稀疏反馈,尤其在Cleanup任务中显著提升清洁-采集权衡能力;社会指标未导致公平性过优化,反而作为协调信号促进领土划分、角色自适应分配和避免无谓攻击等高效合作策略;同时识别出五类LLM奖励黑客攻击模式并讨论缓解措施。 Conclusion: 社会指标作为结构化反馈能有效增强LLM在多智能体协作中的策略质量与可解释性,但LLM策略合成需在策略表达力与安全鲁棒性之间谨慎权衡;该方法为构建符合人类价值观的AI代理提供了新路径。 Abstract: We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.[46] Inducing Sustained Creativity and Diversity in Large Language Models
Queenie Luo,Gary King,Michael Puett,Michael D. Smith
Main category: cs.CL
TL;DR: 本文提出了一种新型解码方案,以增强大语言模型(LLM)在探索性搜索任务(如寻找理想婚纱、研究课题或创业点子)中的持续创造力与结果多样性,克服了现有方法易重复、缺乏深度探索的问题。
Details
Motivation: 当前LLM在探索性‘搜索任务’中表现不足:传统解码方法偏向同质化、常规化输出,难以支持用户长期、多样、深度的探索需求;已有提升多样性的方法易过早重复或缺乏个性化创意。 Method: 设计了一种新颖、易实现的解码策略,无需访问LLM内部向量空间,即可诱导模型持续生成概念上独特、多样化且数量可控的结果,突破模态解码路径限制。 Result: 该方法显著提升LLM在探索性搜索中的知识调用广度(涵盖正统与非正统知识),使用户能更快遍历搜索空间并找到满意答案。 Conclusion: 所提解码方案有效弥补了LLM在长周期、高创造性探索任务中的能力缺口,为探索性搜索提供了更实用、更具适应性的AI支持范式。 Abstract: We address a not-widely-recognized subset of exploratory search, where a user sets out on a typically long "search quest" for the perfect wedding dress, overlooked research topic, killer company idea, etc. The first few outputs of current large language models (LLMs) may be helpful but only as a start, since the quest requires learning the search space and evaluating many diverse and creative alternatives along the way. Although LLMs encode an impressive fraction of the world's knowledge, common decoding methods are narrowly optimized for prompts with correct answers and thus return mostly homogeneous and conventional results. Other approaches, including those designed to increase diversity across a small set of answers, start to repeat themselves long before search quest users learn enough to make final choices, or offer a uniform type of "creativity" to every user asking similar questions. We develop a novel, easy-to-implement decoding scheme that induces sustained creativity and diversity in LLMs, producing as many conceptually unique results as desired, even without access to the inner workings of an LLM's vector space. The algorithm unlocks an LLM's vast knowledge, both orthodox and heterodox, well beyond modal decoding paths. With this approach, search quest users can more quickly explore the search space and find satisfying answers.[47] EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models
J. Ben Tamo,Yuxing Lu,Benoit L. Marteau,Micky C. Nnamdi,May D. Wang
Main category: cs.CL
TL;DR: 本文提出EvidenceRL,一种强化学习框架,通过在训练中强制模型遵循证据来减少大语言模型的幻觉现象,并在心脏诊断和法律推理两个高风险领域验证了其有效性。
Details
Motivation: 大语言模型虽然流利但容易产生幻觉,即生成看似合理却缺乏证据支持的回答,这在需要可验证信息支撑决策的高风险领域尤为严重。 Method: EvidenceRL框架结合检索到的证据与上下文进行蕴含判断(grounding),并对比参考答案评估正确性(correctness),使用Group Relative Policy Optimization(GRPO)优化生成器。 Result: 在心脏诊断任务中,Llama-3.2-3B的F1@3从37.0提升至54.5,grounding指标G_max@3从47.6升至78.2,幻觉减少近5倍,证据支持诊断率从31.8%升至61.6%;在法律推理任务中,Llama-3.1-8B的Faithfulness从32.8%提升至67.6%。 Conclusion: EvidenceRL能显著提升模型对证据的遵循能力与回答忠实度,且不损害任务准确率,适用于多种高风险专业领域。 Abstract: Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucinations drop nearly 5$\times$ and evidence-supported diagnoses increase from 31.8\% to 61.6\%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8\% to 67.6\% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at https://github.com/Wizaaard/EvidenceRL.git.[48] FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment
Betty Xiong,Jillian Fisher,Benjamin Newman,Meng Hu,Shivangi Gupta,Yejin Choi,Lanyan Fang,Russ B Altman
Main category: cs.CL
TL;DR: 本文介绍了FDARxBench,一个由专家策划、基于真实世界FDA药品标签文档的问答基准,用于评估语言模型在药品评估场景下的文档问答能力。
Details
Motivation: 当前语言模型在处理富含异构临床和监管信息的药品标签文档时,难以实现准确的问答,因此需要一个专门针对药品评估需求的高质量基准。 Method: 与FDA监管评估人员合作,构建了一个多阶段流水线,生成涵盖事实性、多跳推理和拒绝回答任务的高质量、专家策划的问答样例,并设计了开放书和闭卷推理的评估协议。 Result: 在专有和开源权重模型上的实验揭示了模型在事实依据、长上下文检索和安全拒绝行为方面存在显著差距。 Conclusion: FDARxBench不仅服务于FDA通用药品评估需求,还为监管级药品标签理解能力评估提供了坚实基础,并支持对大语言模型在药品标签问答任务中的行为进行系统评估。 Abstract: We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension. The benchmark is designed to support evaluation of LLM behavior on drug-label questions.[49] TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?
Xinyu Guo,Yazhou Zhang,Jing Qin
Main category: cs.CL
TL;DR: 本文提出TextReasoningBench基准,系统评估多种推理策略在文本分类任务中对大语言模型的有效性与效率,发现推理并非普遍提升性能,且常伴随高昂的token与时间成本。
Details
Motivation: 探究显式推理策略是否真正在文本分类任务中带来收益,填补其有效性与效率评估的研究空白。 Method: 构建TextReasoningBench基准,对比7种推理策略(IO、CoT、SC-CoT、ToT、GoT、BoC、long-CoT)在10个LLM和5个文本分类数据集上的表现,并引入两个成本感知评估指标:单位推理token的性能增益与性能提升相对于token成本增长的效率。 Result: (1)推理不普遍提升分类性能:中等策略(如CoT、SC-CoT)带来小幅稳定提升(+1%~+3%),复杂策略(如ToT、GoT)常不如基线甚至导致性能下降;(2)推理常低效:部分策略使token消耗增加10–100倍,仅带来微弱性能提升。 Conclusion: 显式推理策略在文本分类任务中并非总是有益,需权衡其性能增益与计算成本,应避免盲目套用复杂推理方法。 Abstract: Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro-F1, we introduce two cost-aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC-CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10$\times$ to 100$\times$ (e.g., SC-CoT and ToT) while providing only marginal performance improvements.[50] BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection
Zhengpei Hu,Kai Li,Dapeng Fu,Chang Zeng,Yue Li,Yuanhao Tang,Jianqiang Huang
Main category: cs.CL
TL;DR: BEAVER是一种无需训练的长文本压缩框架,通过结构感知的分层选择替代线性删减,在保持语义完整性的同时大幅提升推理效率。
Details
Motivation: 大语言模型上下文窗口扩大带来长文档理解能力提升,但也导致推理延迟高、信息利用率低;现有压缩方法存在训练成本高或语义碎片化问题。 Method: 提出BEAVER框架:采用双路径池化将变长上下文映射为密集页级张量以最大化硬件并行性;结合语义与词法双分支选择及句子平滑的混合规划器来保持话语完整性。 Result: 在四个长上下文基准测试中性能媲美SOTA方法(如LongLLMLingua);在RULER多针检索任务中保持高保真度;在128k上下文上延迟降低26.4倍。 Conclusion: BEAVER是一种高效、可扩展、无需训练的长文本压缩方案,兼顾性能、保真度与推理速度。 Abstract: The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic fragmentation due to aggressive token pruning. In this paper, we propose BEAVER, a novel training-free framework that shifts compression from linear token removal to structure-aware hierarchical selection. BEAVER maximizes hardware parallelism by mapping variable-length contexts into dense page-level tensors via dual-path pooling, and preserves discourse integrity through a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing. Extensive evaluations on four long-context benchmarks demonstrate that BEAVER achieves comparable performance to state-of-the-art (SOTA) methods like LongLLMLingua. Notably, on the RULER benchmark, BEAVER maintains high fidelity in multi-needle retrieval where baselines deteriorate. Regarding efficiency, BEAVER reduces latency by 26.4x on 128k contexts, offering a scalable solution for high-throughput applications. Our code is available at https://cslikai.cn/BEAVER/.[51] Structured Prompting for Arabic Essay Proficiency: A Trait-Centric Evaluation Approach
Salim Al Mandhari,Hieu Pham Dinh,Mo El-Haj,Paul Rayson
Main category: cs.CL
TL;DR: 本文提出了一种面向阿拉伯语特征的自动作文评分(AES)提示工程框架,在零样本和少样本设置下利用大语言模型(LLMs)评估组织、词汇、发展与风格等语言能力特征;通过三级提示策略(标准、混合、量规引导)提升评分一致性,尤其在发展与风格等语篇层面特征上效果显著;Fanar-1-9B-Instruct模型配合量规引导提示取得最优表现(QWK=0.28),验证了结构化提示比单纯扩大模型规模更关键。
Details
Motivation: 解决阿拉伯语自动作文评分工具稀缺、缺乏可扩展且语言学驱动的评估方法的问题。 Method: 提出三级提示策略:标准提示、模拟多智能体评审的混合提示(各特质专家)、以及融入带分范例的量规引导提示;在零样本与少样本设定下,于首个公开阿拉伯语AES数据集QAES上评估8个LLM。 Result: Fanar-1-9B-Instruct在零样本与少样本下均取得最高特质级一致性(QWK=0.28,CI=0.41);量规引导提示在所有模型与特质上均带来稳定提升,尤以Development和Style改善最显著。 Conclusion: 结构化提示设计是提升阿拉伯语AES性能的关键,而非仅依赖模型规模;本研究首次构建了面向语言能力特质的阿拉伯语AES综合框架,为低资源教育场景提供可扩展评估基础。 Abstract: This paper presents a novel prompt engineering framework for trait specific Automatic Essay Scoring (AES) in Arabic, leveraging large language models (LLMs) under zero-shot and few-shot configurations. Addressing the scarcity of scalable, linguistically informed AES tools for Arabic, we introduce a three-tier prompting strategy (standard, hybrid, and rubric-guided) that guides LLMs in evaluating distinct language proficiency traits such as organization, vocabulary, development, and style. The hybrid approach simulates multi-agent evaluation with trait specialist raters, while the rubric-guided method incorporates scored exemplars to enhance model alignment. In zero and few-shot settings, we evaluate eight LLMs on the QAES dataset, the first publicly available Arabic AES resource with trait level annotations. Experimental results using Quadratic Weighted Kappa (QWK) and Confidence Intervals show that Fanar-1-9B-Instruct achieves the highest trait level agreement in both zero and few-shot prompting (QWK = 0.28 and CI = 0.41), with rubric-guided prompting yielding consistent gains across all traits and models. Discourse-level traits such as Development and Style showed the greatest improvements. These findings confirm that structured prompting, not model scale alone, enables effective AES in Arabic. Our study presents the first comprehensive framework for proficiency oriented Arabic AES and sets the foundation for scalable assessment in low resource educational contexts.[52] DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs
Xuan Qi,Luxi He,Dan Roth,Xingyu Fu
Main category: cs.CL
TL;DR: 本文提出DATAPROPHET,一种无需训练的数据选择指标,用于预测多模态大语言模型(MLLMs)中监督数据对目标基准的影响;实验表明其与实际性能提升高度相关(Kendall's tau达86.0%),并显著优于均匀选择、现有训练式基线甚至基于实验的oracle选择。
Details
Motivation: 传统上依据任务相似性(如文本密集型或视觉中心型)选择监督数据,但其是否可靠预测下游性能提升尚不明确;本文旨在在训练前评估各训练数据集对目标基准的影响。 Method: 对14个跨7类任务的视觉-语言数据集进行迁移分析,发现直观任务相似性不可靠;据此提出训练无关的DATAPROPHET指标,融合多模态困惑度、相似性与数据多样性。 Result: DATAPROPHET与实际训练后性能提升排序高度一致(Kendall's tau=86.0%);在监督数据选择中,相较均匀选择提升6.9%,优于SOTA训练式基线1.4%,甚至超过基于实验结果的oracle选择0.2%。 Conclusion: 数据级特性比任务级类别更能决定迁移效果;DATAPROPHET提供了一种高效、免训练、实用的数据选择新范式。 Abstract: Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall's tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.[53] EvoTaxo: Building and Evolving Taxonomy from Social Media Streams
Yiyang Li,Tianyi Ma,Yanfang Ye
Main category: cs.CL
TL;DR: EvoTaxo is an LLM-based framework for building and evolving taxonomies from temporal social media streams, using structured draft actions, dual-view clustering, and concept memory to improve robustness, scalability, and sensitivity to discourse shifts.
Details
Motivation: Existing taxonomy induction methods are ill-suited for short, noisy, semantically entangled, and temporally dynamic social media posts, struggling to balance robustness, scalability, and sensitivity to evolving discourse. Method: EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, applies dual-view clustering (semantic similarity + temporal locality) to consolidate candidate edits, refines and arbitrates edits before execution, and maintains a concept memory bank per node to preserve semantic boundaries. Result: Experiments on two Reddit corpora show EvoTaxo produces more balanced taxonomies with clearer post-to-leaf assignment, better corpus coverage at comparable size, and stronger structural quality; a case study on /r/ICE_Raids confirms its ability to capture meaningful temporal discourse shifts. Conclusion: EvoTaxo effectively addresses the challenges of taxonomy induction from dynamic social media data by integrating LLM-driven structuring, temporal-aware clustering, and persistent concept memory, outperforming static baselines. Abstract: Constructing taxonomies from social media corpora is challenging because posts are short, noisy, semantically entangled, and temporally dynamic. Existing taxonomy induction methods are largely designed for static corpora and often struggle to balance robustness, scalability, and sensitivity to evolving discourse. We propose EvoTaxo, a LLM-based framework for building and evolving taxonomies from temporally ordered social media streams. Rather than clustering raw posts directly, EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, and consolidates candidate edits through dual-view clustering that combines semantic similarity with temporal locality. A refinement-and-arbitration procedure then selects reliable edits before execution, while each node maintains a concept memory bank to preserve semantic boundaries over time. Experiments on two Reddit corpora show that EvoTaxo produces more balanced taxonomies than baselines, with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. A case study on the Reddit community /r/ICE_Raids further shows that EvoTaxo captures meaningful temporal shifts in discourse. Our codebase is available here.[54] TAB-AUDIT: Detecting AI-Fabricated Scientific Tables via Multi-View Likelihood Mismatch
Shuo Huang,Yan Pen,Lizhen Qu
Main category: cs.CL
TL;DR: 本文提出了首个用于检测AI生成的伪造NLP论文中科学表格的系统方法TAB-AUDIT,并构建了首个相关基准数据集FabTab,通过识别表格内部结构与数值内容间的不一致性等特征,实现了高精度的伪造检测。
Details
Motivation: AI生成的伪造科学论文严重威胁学术诚信,而其中作为关键证据的实验表格尚未被系统研究用于检测伪造。 Method: 构建FabTab基准数据集(含1173篇AI生成和1215篇人工撰写NLP论文),分析伪造与真实表格的系统性差异,提出TAB-AUDIT框架,提取包括'表内不匹配度'(骨架与数值内容的困惑度差异)在内的判别性特征,并采用RandomForest分类器进行检测。 Result: 在领域内测试达到0.987 AUROC,跨领域测试达0.883 AUROC,显著优于现有最先进方法。 Conclusion: 实验表格是检测AI生成科研欺诈的关键法证信号;本工作提供了新基准和有效检测框架,推动该方向后续研究。 Abstract: AI-generated fabricated scientific manuscripts raise growing concerns with large-scale breaches of academic integrity. In this work, we present the first systematic study on detecting AI-generated fabricated scientific tables in empirical NLP papers, as information in tables serve as critical evidence for claims. We construct FabTab, the first benchmark dataset of fabricated manuscripts with tables, comprising 1,173 AI-generated papers and 1,215 human-authored ones in empirical NLP. Through a comprehensive analysis, we identify systematic differences between fabricated and real tables and operationalize them into a set of discriminative features within the TAB-AUDIT framework. The key feature, within-table mismatch, captures the perplexity gap between a table's skeleton and its numerical content. Experimental results show that RandomForest built on these features significantly outperform prior state-of-the-art methods, achieving 0.987 AUROC in-domain and 0.883 AUROC out-of-domain. Our findings highlight experimental tables as a critical forensic signal for detecting AI-generated scientific fraud and provide a new benchmark for future research.[55] LoopRPT: Reinforcement Pre-Training for Looped Language Models
Guo Tang,Shixin Jiang,Heng Chang,Nuo Chen,Yuhan Li,Huiming Fan,Jia Li,Ming Liu,Bing Qin
Main category: cs.CL
TL;DR: 本文提出LoopRPT,一种专为循环语言模型(LoopLMs)设计的强化预训练框架,通过将下一个token预测重构为下一个token推理任务,直接对隐状态步骤施加强化信号,从而提升中间表征质量与推理效率。
Details
Motivation: 现有强化学习范式主要针对输出token,与LoopLMs隐式展开的推理过程存在结构不匹配问题。 Method: 提出LoopRPT框架,将next-token预测重构为next-token reasoning任务,利用EMA教师参考和带噪声的隐状态rollout,将强化信号直接分配给latent steps。 Result: 在Ouro架构上验证,LoopRPT显著提升每步表征质量,在准确率-计算量权衡中实现Pareto优势;尤其在难token上增益明显,表明其增强了早期推理能力而非仅促进提前退出。 Conclusion: 强化预训练是一种有原则的范式,可有效引导LoopLMs学习高效隐式推理。 Abstract: Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.[56] PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction
Runsong Zhao,Shilei Liu,Jiwei Tang,Langming Liu,Haibin Chen,Weidong Zhang,Yujin Yuan,Tong Xiao,Jingbo Zhu,Wenbo Su,Bo Zheng
Main category: cs.CL
TL;DR: 本文提出了一种面向性能的上下文压缩(PoC)新范式,允许开发者设定可接受的性能下限而非固定压缩率,并通过轻量级性能预测器自动选择满足约束的最大压缩比。
Details
Motivation: 现有基于目标压缩率或长度的上下文压缩方法性能退化不可预测,难以可靠部署。 Method: 提出Performance-oriented Context Compression(PoC),设计两种性能预测器变体:上下文无关与上下文感知,并结合现成压缩器实现自动压缩率选择。 Result: 在问答和摘要任务上,上下文感知预测器性能预测误差更低,对应的PoC整体性能更优。 Conclusion: PoC为LLM上下文压缩提供了更可靠、高效且性能可控的部署新路径。 Abstract: While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input's inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor consistently achieves lower performance prediction error than the context-agnostic predictor, while the resulting context-aware PoC attains a superior overall performance. Our work paves the way for a more reliable, efficient, and performance-aware deployment of context compression for LLMs.[57] Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking
Tomas Ruiz,Tanalp Agustoslu,Carsten Schwemmer
Main category: cs.CL
TL;DR: 本文提出了一种考虑人类标注者差异(HLV)的多模态大语言模型(MLLM)评估协议,发现仅依赖共识标签的基准测试可能高估模型能力,而纳入标注变异能实现更真实、鲁棒的评估。
Details
Motivation: 人类标注差异(HLV)在当前LLM基准测试中被忽视,而其对模型在主观性任务(如内容审核)中的真实性能评估至关重要。 Method: 提出一种新评估协议,区分高一致性与高分歧的人类标注子集,并在Gemma 3和Qwen 2.5 VL两个先进MLLM上,使用非聚合社交媒体分类数据进行验证。 Result: 大模型在高一致子集上表现更优,但在高分歧子集上常逊于中等规模模型;表明参数量不能单独决定对模糊性和主观性的敏感度。 Conclusion: 仅基于共识标签的基准会高估模型在主观任务中的能力;纳入人类标注变异可提升MLLM在内容审核等现实场景中的评估可靠性与实用性。 Abstract: Human Label Variation (HLV), i.e. systematic differences among annotators' judgments, remains underexplored in benchmarks despite rapid progress in large language model (LLM) development. We address this gap by introducing an evaluation protocol for multimodal large language model (MLLM) benchmarking that explicitly accounts for two conditions: (1) human label agreement and (2) disagreement. We apply this protocol to two state-of-the-art MLLM families (Gemma 3, Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset. Across tasks, we find that larger models tend to perform best on high-agreement subsets, yet often underperform medium-sized models when human disagreement is high, indicating that parameter count alone does not determine sensitivity to ambiguity and subjectivity. These results show that benchmarks based solely on consensus labels can overstate model capabilities in such domains and that incorporating human label variation yields more realistic and robust assessments of MLLMs in content moderation pipelines.[58] Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders
Debajyoti Mazumder,Divyansh Pathak,Prashant Kodali,Jasabanta Patro
Main category: cs.CL
TL;DR: 本文研究多语言编码器如何内部表示混合语码(如印地语-英语)输入,并发现标准模型在处理混合语码时与任一组成语言的关联较弱;通过引入三语后训练对齐目标,显著提升了跨语言理解和下游任务性能。
Details
Motivation: 现有研究对多语言编码器如何表示混合语码输入及其与组成语言的关系知之甚少,亟需深入探究其内部表征机制。 Method: 构建英-印(天城文/罗马化)平行三语语料库,采用CKA、词元级显著性及基于熵的不确定性分析,探测标准与代码混合微调模型的跨语言表征对齐性,并提出三语后训练对齐目标。 Result: 标准模型能较好对齐英语与印地语,但混合语码表征与任一语言关联松散;持续预训练提升英-混对齐却损害英-印对齐;模型以英语为主语义子空间处理混合语码,天城文印地语提供降低不确定性的互补信号;所提三语对齐方法改善了平衡对齐并提升情感分析与仇恨言论检测性能。 Conclusion: 将混合语码表征同时锚定于其组成语言,可有效增强跨语言理解能力,为多语言模型处理代码混合现象提供了理论依据与实用方法。 Abstract: Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.[59] FrameNet Semantic Role Classification by Analogy
Van-Duy Ngo,Stergos Afantenos,Emiliano Lorini,Miguel Couceiro
Main category: cs.CL
TL;DR: 本文提出了一种基于类比关系的语义角色分类新方法,将FrameNet中的语义角色分类建模为LU-FE对之间的二元关系,并构建新数据集;训练轻量级ANN不使用语义角色标签,推理时通过随机采样与类比迁移恢复语义角色,实现了高效且性能超越SOTA的结果。
Details
Motivation: 现有语义角色分类方法通常依赖大量标注信息和复杂模型,难以兼顾性能与效率;本文试图从关系类比视角出发,探索更简洁、可解释且高效的替代方案。 Method: 将类比定义为框架触发词(LU)与框架元素(FE)对之间的形式化关系,构建二元关系数据集(正例:FE具有相同语义角色;反例:否则);训练轻量级ANN进行二元分类,训练时不输入任何语义角色标签;推理时通过随机采样和类比迁移,在给定框架内对所有候选语义角色计算概率分布以恢复角色。 Result: 该方法在FrameNet语义角色分类任务上超越了先前最优结果,同时具备快速收敛、参数极少、计算高效和资源节约等优势。 Conclusion: 基于关系类比的无监督式语义角色恢复机制是可行且有效的,为语义角色标注提供了新范式,兼具性能优势与模型简洁性。 Abstract: In this paper, we adopt a relational view of analogies applied to Semantic Role Classification in FrameNet. We define analogies as formal relations over the Cartesian product of frame evoking lexical units (LUs) and frame element (FEs) pairs, which we use to construct a new dataset. Each element of this binary relation is labelled as a valid analogical instance if the frame elements share the same semantic role, or as invalid otherwise. This formulation allows us to transform Semantic Role Classification into binary classification and train a lightweight Artificial Neural Network (ANN) that exhibits rapid convergence with minimal parameters. Unconventionally, no Semantic Role information is introduced to the neural network during training. We recover semantic roles during inference by computing probability distributions over candidates of all semantic roles within a given frame through random sampling and analogical transfer. This approach allows us to surpass previous state-of-the-art results while maintaining computational efficiency and frugality.[60] Semantic Delta: An Interpretable Signal Differentiating Human and LLMs Dialogue
Riccardo Scantamburlo,Mauro Mezzanzana,Giacomo Buonanno,Francesco Bertolotti
Main category: cs.CL
TL;DR: 本文提出了一种基于语义类别分布的轻量级、可解释统计特征(语义delta),用于区分人类与大语言模型(LLM)生成的对话;实验表明LLM文本语义delta显著更高,反映其主题结构更集中、刚性,而人类对话语义分布更均衡广泛;该指标可作为零样本、计算廉价的补充信号用于检测系统。
Details
Motivation: 探究LLM是否像人类一样说话,理解其在语义分布层面与人类对话的差异,为教育、学术等领域提供可解释的分析工具。 Method: 基于Empath词典分析框架,将文本映射为语义主题强度得分;定义‘语义delta’为对话中两个最强主题强度之差;在多种LLM输出与多样化人类语料(剧本、文学、网络讨论)上进行对比;使用Welch t检验分析delta分布差异。 Result: AI生成文本的语义delta显著高于人类文本,表明其主题结构更集中、刚性;人类对话呈现更宽泛、平衡的语义分布;该指标具备零样本、低计算开销特性,适合作为检测系统的互补信号。 Conclusion: 语义delta是一个有效、可解释、轻量的区分人类与LLM对话的统计特征;当前LLM在主题分布动态性上仍明显弱于人类,该发现拓展了对LLM行为拟真度的实证认知。 Abstract: Do LLMs talk like us? This question intrigues a multitude of scholar and it is relevant in many fields, from education to academia. This work presents an interpretable statistical feature for distinguishing human written and LLMs generated dialogue. We introduce a lightweight metric derived from semantic categories distribution. Using the Empath lexical analysis framework, each text is mapped to a set of thematic intensity scores. We define semantic delta as the difference between the two most dominant category intensities within a dialogue, hypothesizing that LLM outputs exhibit stronger thematic concentration than human discourse. To evaluate this hypothesis, conversational data were generated from multiple LLM configurations and compared against heterogeneous human corpora, including scripted dialogue, literary works, and online discussions. A Welch t-test was applied to the resulting distributions of semantic delta values. Results show that AI-generated texts consistently produce higher deltas than human texts, indicating a more rigid topics structure, whereas human dialogue displays a broader and more balanced semantic spread. Rather than replacing existing detection techniques, the proposed zero-shot metric provides a computationally inexpensive complementary signal that can be integrated into ensemble detection systems. These finding also contribute to the broader empirical understanding of LLM behavioural mimicry and suggest that thematic distribution constitutes a quantifiable dimension along which current models fall short of human conversational dynamics.[61] Span-Level Machine Translation Meta-Evaluation
Stefano Perrella,Eric Morales Agostinho,Hugo Zaragoza
Main category: cs.CL
TL;DR: 本文探讨了机器翻译(MT)自动评估中错误检测能力的可靠度量方法,指出当前缺乏标准评估技术,并提出了一种名为'MPP'的鲁棒元评估策略。
Details
Motivation: 现有自动评估技术虽已能定位翻译错误并分类分级,但尚无可靠方法来衡量这些自动评估器在错误检测方面的能力。 Method: 研究了多种基于片段级的精确率、召回率和F值实现方式,分析其差异与适用性,并提出'带部分重叠和部分信用的匹配'(MPP)结合微平均作为新评估策略。 Result: 发现看似相似的评估方法可能导致显著不同的排名,某些常用技术并不适合MT错误检测评估;MPP被验证为一种更鲁棒的元评估方法。 Conclusion: MPP是一种适用于MT错误检测系统评估的稳健元评估策略,作者公开了其实现代码,并用其评估了当前MT错误检测的最先进水平。 Abstract: Machine Translation (MT) and automatic MT evaluation have improved dramatically in recent years, enabling numerous novel applications. Automatic evaluation techniques have evolved from producing scalar quality scores to precisely locating translation errors and assigning them error categories and severity levels. However, it remains unclear how to reliably measure the evaluation capabilities of auto-evaluators that do error detection, as no established technique exists in the literature. This work investigates different implementations of span-level precision, recall, and F-score, showing that seemingly similar approaches can yield substantially different rankings, and that certain widely-used techniques are unsuitable for evaluating MT error detection. We propose "match with partial overlap and partial credit" (MPP) with micro-averaging as a robust meta-evaluation strategy and release code for its use publicly. Finally, we use MPP to assess the state of the art in MT error detection.[62] Translation from the Information Bottleneck Perspective: an Efficiency Analysis of Spatial Prepositions in Bitexts
Antoine Taroni,Ludovic Moncla,Frederique Laforest
Main category: cs.CL
TL;DR: 本文将翻译视为信息瓶颈(IB)优化问题,以源语言句子为刺激、目标语言句子为压缩后的意义,分析跨语言空间介词的语义系统是否遵循沟通效率原则;实验发现真实翻译比虚构对照更接近IB最优前沿,表明人类译者在空间语义领域存在认知效率压力。
Details
Motivation: 现有信息瓶颈(IB)理论已在视觉领域(如颜色、运动)得到验证,但在语言领域,尤其是句法语境中的词语(如介词)尚未探索;本文旨在填补这一空白,探究自然语言翻译是否体现IB所预测的沟通效率权衡。 Method: 将翻译建模为IB优化问题,使用英、德、塞三语对法国小说中空间介词的平行语料(bitext)进行分析;通过35人参与的堆排序实验获取介词间相似性判断,并训练5维低秩投影模型(Spearman相关系数0.78)来量化信息量;比较真实翻译与反事实替代方案到IB最优前沿的距离。 Result: 真实翻译的空间介词分布显著更接近IB最优前沿,相比反事实替代方案具有统计优势,为人类译者在空间语义中受沟通效率压力驱动提供了初步实证支持。 Conclusion: 翻译可作为探究跨语言语义系统背后认知效率压力的有效窗口;结果支持自然语言(至少在空间介词领域)遵循信息瓶颈所刻画的简洁性-信息量权衡原则。 Abstract: Efficient communication requires balancing informativity and simplicity when encoding meanings. The Information Bottleneck (IB) framework captures this trade-off formally, predicting that natural language systems cluster near an optimal accuracy-complexity frontier. While supported in visual domains such as colour and motion, linguistic stimuli such as words in sentential context remain unexplored. We address this gap by framing translation as an IB optimisation problem, treating source sentences as stimuli and target sentences as compressed meanings. This allows IB analyses to be performed directly on bitexts rather than controlled naming experiments. We applied this to spatial prepositions across English, German and Serbian translations of a French novel. To estimate informativity, we conducted a pile-sorting pilot-study (N=35) and obtained similarity judgements of pairs of prepositions. We trained a low-rank projection model (D=5) that predicts these judgements (Spearman correlation: 0.78). Attested translations of prepositions lie closer to the IB optimal frontier than counterfactual alternatives, offering preliminary evidence that human translators exhibit communicative efficiency pressure in the spatial domain. More broadly, this work suggests that translation can serve as a window into the cognitive efficiency pressures shaping cross-linguistic semantic systems.[63] SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Southeast Asia
Zhixiang Lu,Chong Zhang,Yulong Li,Angelos Stefanidis,Anh Nguyen,Imran Razzak,Jionglong Su,Zhengyong Jiang
Main category: cs.CL
TL;DR: 本文提出SAGE框架,通过强化学习代理自主筛选高质量、文化相关的小规模训练数据,并结合LoRA高效微调开源大模型,显著提升东南亚七种低资源语言的翻译性能,同时大幅降低数据使用量(97.1%)和训练能耗(95.2%),兼顾数字包容性与环境可持续性。
Details
Motivation: 解决东南亚低资源地区因高质量文化相关数据稀缺和大规模训练高能耗导致的数字鸿沟与环境不可持续问题。 Method: 提出Sustainable Agent-Guided Expert-tuning (SAGE)框架:利用基于GRPO优化的RL代理,依据专家构建的社区对话语义奖励信号,自动筛选精简训练集;再用LoRA对开源LLM进行高效微调。 Result: 在英语与七种东南亚低资源语言互译任务上,BLEU-4和COMET-22指标达新SOTA,性能超越使用全量数据训练的基线模型,同时减少97.1%数据用量和95.2%训练能耗。 Conclusion: SAGE为全球南方提供了一条高性能、低能耗、可扩展且负责任的数字包容实现路径。 Abstract: The vision of an inclusive World Wide Web is impeded by a severe linguistic divide, particularly for communities in low-resource regions of Southeast Asia. While large language models (LLMs) offer a potential solution for translation, their deployment in data-poor contexts faces a dual challenge: the scarcity of high-quality, culturally relevant data and the prohibitive energy costs of training on massive, noisy web corpora. To resolve the tension between digital inclusion and environmental sustainability, we introduce Sustainable Agent-Guided Expert-tuning (SAGE). This framework pioneers an energy-aware paradigm that prioritizes the "right data" over "big data". Instead of carbon-intensive training on unfiltered datasets, SAGE employs a reinforcement learning (RL) agent, optimized via Group Relative Policy Optimization (GRPO), to autonomously curate a compact training set. The agent utilizes a semantic reward signal derived from a small, expert-constructed set of community dialogues to filter out noise and cultural misalignment. We then efficiently fine-tune open-source LLMs on this curated data using Low-Rank Adaptation (LoRA). We applied SAGE to translation tasks between English and seven low-resource languages (LRLs) in Southeast Asia. Our approach establishes new state-of-the-art performance on BLEU-4 and COMET-22 metrics, effectively capturing local linguistic nuances. Crucially, SAGE surpasses baselines trained on full datasets while reducing data usage by 97.1% and training energy consumption by 95.2%. By delivering high-performance models with a minimal environmental footprint, SAGE offers a scalable and responsible pathway to bridge the digital divide in the Global South.[64] Hybrid topic modelling for computational close reading: Mapping narrative themes in Pushkin's Evgenij Onegin
Angelo Maria Sabatini
Main category: cs.CL
TL;DR: 本文提出了一种结合LDA与稀疏PLS-DA的混合主题建模框架,用于叙事诗歌的计算文学分析,以《叶甫盖尼·奥涅金》意大利译本为案例,在小语料库中实现了主题稳定性与可解释性,并拓展至叙事结构层面。
Details
Motivation: 解决小语料库下主题模型不稳定、传统计算方法难以兼顾诗歌叙事结构与主题动态的问题,同时增强计算结果对文学阐释的可解释性与可复现性。 Method: 融合无监督LDA与有监督sPLS-DA;采用多随机种子共识协议提升小语料稳定性;将诗歌分段为35个词元化文档;引入‘叙事枢纽’(contiguous stanzas)概念,超越词袋模型,关联主题混合与情感/结构脉络。 Result: 提取出5个稳定且可解释的主题;sPLS-DA识别出各主题的关键词汇标记;叙事枢纽揭示主题混合如何呼应诗歌的情感与结构发展;验证了无监督与有监督结构在小语料中可收敛。 Conclusion: 该轻量级概率建模框架不取代细读,而是提供一种可复现、透明、适用于高密度文学文本(尤其跨语言比较)的计算细读范式。 Abstract: This study presents a hybrid topic modelling framework for computational literary analysis that integrates Latent Dirichlet Allocation (LDA) with sparse Partial Least Squares Discriminant Analysis (sPLS-DA) to model thematic structure and longitudinal dynamics in narrative poetry. As a case study, we analyse Evgenij Onegin-Aleksandr S. Pushkin's novel in verse-using an Italian translation, testing whether unsupervised and supervised lexical structures converge in a small-corpus setting. The poetic text is segmented into thirty-five documents of lemmatised content words, from which five stable and interpretable topics emerge. To address small-corpus instability, a multi-seed consensus protocol is adopted. Using sPLS-DA as a supervised probe enhances interpretability by identifying lexical markers that refine each theme. Narrative hubs-groups of contiguous stanzas marking key episodes-extend the bag-of-words approach to the narrative level, revealing how thematic mixtures align with the poem's emotional and structural arc. Rather than replacing traditional literary interpretation, the proposed framework offers a computational form of close reading, illustrating how lightweight probabilistic models can yield reproducible thematic maps of complex poetic narratives, even when stylistic features such as metre, phonology, or native morphology are abstracted away. Despite relying on a single lemmatised translation, the approach provides a transparent methodological template applicable to other high-density literary texts in comparative studies.[65] When Contextual Inference Fails: Cancelability in Interactive Instruction Following
Natalia Bila,Kata Naszádi,Alexandra Mayn,Christof Monz
Main category: cs.CL
TL;DR: 本文提出Build What I Mean(BWIM)交互式基准,用于评估大语言模型在协作积木搭建任务中区分字面理解与上下文推理的能力;研究发现模型虽能识别说话者不可靠性,却无法据此优化澄清行为,表现出过度澄清或回避提问等次优策略。
Details
Motivation: 探究大语言模型在协作任务中如何分离字面解释与上下文推理,特别是在指令不明确时依赖语境进行消歧或主动澄清的能力。 Method: 基于双说话者心理语言学范式,构建名为Build What I Mean(BWIM)的交互式基准任务,要求模型在积木搭建中通过上下文推理或低成本澄清来解决歧义,并对多个前沿大语言模型进行行为评估。 Result: 发现模型在显式置信度判断中能识别说话者不可靠性,但在实际行动中未能利用该信息优化澄清策略,表现为伙伴盲目的过度澄清和不确定性下的提问回避式猜测。 Conclusion: 当前大语言模型存在‘判断—行动’分离现象,即元认知判断能力与实际策略性沟通行为不一致,揭示其在真实协作场景中语用推理能力的局限性。 Abstract: We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm -- which contrasts a pragmatically cooperative speaker with one who is only literally reliable -- we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty.[66] An Agentic Approach to Generating XAI-Narratives
Yifan He,David Martens
Main category: cs.CL
TL;DR: 本文提出了一种基于多智能体框架的可解释人工智能(XAI)叙事生成与优化方法,通过Narrator与多个Critic Agent协同迭代提升自然语言解释的保真度与连贯性,并在多个大语言模型和表格数据集上验证了其有效性。
Details
Motivation: 现有XAI方法过于技术化、面向专家,缺乏可访问性和可解释性;LLM生成的XAI叙事虽有潜力,但常存在保真度与连贯性不足的问题。 Method: 设计了一个多智能体框架,包括生成与修订叙事的Narrator和多个评估保真度与连贯性的Critic Agents;构建五种智能体系统变体(Basic/Critic/Critic-Rule/Coherent/Coherent-Rule Design),并在五种LLM和五个表格数据集上系统评估;引入基于多数投票的集成策略以进一步提升性能。 Result: Basic、Critic和Critic-Rule三种设计在所有LLM上均显著提升叙事保真度;Claude-4.5-Sonnet在Basic Design下经三轮迭代使不保真叙事减少90%;多数投票集成策略对除DeepSeek-V3.2-Exp外的四个LLM均有效提升性能。 Conclusion: 多智能体系统能有效生成高保真、高连贯的XAI叙事,为构建用户友好的可解释AI提供了新范式。 Abstract: Explainable AI (XAI) research has experienced substantial growth in recent years. Existing XAI methods, however, have been criticized for being technical and expert-oriented, motivating the development of more interpretable and accessible explanations. In response, large language model (LLM)-generated XAI narratives have been proposed as a promising approach for translating post-hoc explanations into more accessible, natural-language explanations. In this work, we propose a multi-agent framework for XAI narrative generation and refinement. The framework comprises the Narrator, which generates and revises narratives based on feedback from multiple Critic Agents on faithfulness and coherence metrics, thereby enabling narrative improvement through iteration. We design five agentic systems (Basic Design, Critic Design, Critic-Rule Design, Coherent Design, and Coherent-Rule Design) and systematically evaluate their effectiveness across five LLMs on five tabular datasets. Results validate that the Basic Design, the Critic Design, and the Critic-Rule Design are effective in improving the faithfulness of narratives across all LLMs. Claude-4.5-Sonnet on Basic Design performs best, reducing the number of unfaithful narratives by 90% after three rounds of iteration. To address recurrent issues, we further introduce an ensemble strategy based on majority voting. This approach consistently enhances performance for four LLMs, except for DeepSeek-V3.2-Exp. These findings highlight the potential of agentic systems to produce faithful and coherent XAI narratives.[67] RouterKGQA: Specialized--General Model Routing for Constraint-Aware Knowledge Graph Question Answering
Bo Yuan,Hexuan Deng,Xuebo Liu,Min Zhang
Main category: cs.CL
TL;DR: 本文提出RouterKGQA框架,通过专用模型与通用大模型协同工作,在知识图谱问答中实现高效且准确的推理:专用模型生成路径并过滤答案,通用模型仅在必要时进行知识图谱引导修复,显著提升性能并大幅降低LLM调用开销。
Details
Motivation: 现有KGQA方法存在效率与效果权衡问题:检索式方法高效但易产生不可达路径、忽略隐含约束;Agent式方法效果好但计算成本过高。 Method: 提出RouterKGQA框架,包含三部分:1)专用模型生成初始推理路径;2)约束感知的答案过滤机制减少冗余;3)通用大模型仅在路径不可靠时触发KG引导修复,并优化其工作流以降低推理开销。 Result: 在多个基准上平均F1提升3.57分、Hits@1提升0.49分,平均仅需1.15次LLM调用/问题,显著优于先前最优方法。 Conclusion: 专用模型与通用模型的按需协作范式可在保持低开销的同时增强KGQA的准确性与鲁棒性,为平衡效率与性能提供了新思路。 Abstract: Knowledge graph question answering (KGQA) is a promising approach for mitigating LLM hallucination by grounding reasoning in structured and verifiable knowledge graphs. Existing approaches fall into two paradigms: retrieval-based methods utilize small specialized models, which are efficient but often produce unreachable paths and miss implicit constraints, while agent-based methods utilize large general models, which achieve stronger structural grounding at substantially higher cost. We propose RouterKGQA, a framework for specialized--general model collaboration, in which a specialized model generates reasoning paths and a general model performs KG-guided repair only when needed, improving performance at minimal cost. We further equip the specialized with constraint-aware answer filtering, which reduces redundant answers. In addition, we design a more efficient general agent workflow, further lowering inference cost. Experimental results show that RouterKGQA outperforms the previous best by 3.57 points in F1 and 0.49 points in Hits@1 on average across benchmarks, while requiring only 1.15 average LLM calls per question. Codes and models are available at https://github.com/Oldcircle/RouterKGQA.[68] LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families
Jianan Chen,Xiaoxue Gao,Tatsuya Kawahara,Nancy F. Chen
Main category: cs.CL
TL;DR: 本文提出LoASR-Bench,一个面向低资源语言的自动语音识别(ASR)基准,涵盖25种语言、9个语系,用于评估当前语音大模型(SpeechLMs)在跨语系、跨文字系统场景下的泛化能力,并揭示其在真实低资源语言任务中的性能局限。
Details
Motivation: 现有ASR基准主要关注高资源语言,导致SpeechLMs在低资源语言上的表现和跨语系泛化能力缺乏充分评估,阻碍其在实际多语种场景中的部署。 Method: 构建LoASR-Bench基准,包含25种来自9个语系、涵盖拉丁与非拉丁文字的语言数据集,支持跨语系与跨文字系统的SpeechLMs ASR性能评估。 Result: 实验结果表明,当前最先进的SpeechLMs在真实低资源语言ASR任务中存在显著性能局限。 Conclusion: 亟需通过LoASR-Bench等专用低资源基准评估SpeechLMs,以推动其在多语种现实场景中的可靠应用。 Abstract: Large language models (LLMs) have driven substantial advances in speech language models (SpeechLMs), yielding strong performance in automatic speech recognition (ASR) under high-resource conditions. However, existing benchmarks predominantly focus on high-resource languages, leaving the ASR behavior of SpeechLMs in low-resource languages insufficiently understood. This gap is critical, as practical ASR systems must reliably support low-resource languages and generalize across diverse language families, and it directly hinders the deployment of SpeechLM-based ASR in real-world multilingual scenarios. As a result, it is essential to evaluate SpeechLMs on low-resource languages to ensure their generalizability across different language families. To address this problem, we propose \textbf{LoASR-Bench}, a comprehensive benchmark designed to evaluate \textbf{lo}w-resource \textbf{a}utomatic \textbf{s}peech \textbf{r}ecognition (\textbf{ASR}) of the latest SpeechLMs across diverse language families. LoASR-Bench comprises 25 languages from 9 language families, featuring both Latin and non-Latin scripts, enabling cross-linguistic and cross-script assessment of ASR performance of current SpeechLMs. Experimental results highlight the limitations of the latest SpeechLMs in handling real-world low-resource languages.[69] Predicting States of Understanding in Explanatory Interactions Using Cognitive Load-Related Linguistic Cues
Yu Wang,Olcay Türk,Angela Grimminger,Hendrik Buschmeier
Main category: cs.CL
TL;DR: 本文研究了对话中说话者和听者的言语与非言语语言特征如何预测听者在解释性互动中的实时理解状态,通过分析MUNDEX语料库,发现信息价值、句法复杂性和注视行为变化等语言线索与听者理解水平相关,并验证了结合多模态特征可提升四类理解状态的分类效果。
Details
Motivation: 探究言语与非言语语言特征如何在解释性对话中实时预测听者的理解状态,以支持更自然的人机交互和教育技术应用。 Method: 基于MUNDEX语料库(面对面棋盘游戏解释对话),采用统计分析与机器学习方法,提取说话者的信息价值(surprisal)、句法复杂度及听者交互性注视变化三类语言线索,并结合文本特征,使用两个现成分类器与一个微调的德语BERT多模态分类器进行四类理解状态分类。 Result: 各语言线索均与听者理解水平显著相关;分类实验表明,融合三类语言线索与文本特征可提升‘理解’‘部分理解’‘不理解’‘误解’四类状态的预测准确率。 Conclusion: 言语与非言语语言线索可有效表征听者实时理解状态,多模态建模优于单模态,为构建自适应解释系统提供了实证基础与方法支持。 Abstract: We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener's state of understanding in explanatory interactions on a moment-by-moment basis. Specifically, we examine three linguistic cues related to cognitive load and hypothesised to correlate with listener understanding: the information value (operationalised with surprisal) and syntactic complexity of the speaker's utterances, and the variation in the listener's interactive gaze behaviour. Based on statistical analyses of the MUNDEX corpus of face-to-face dialogic board game explanations, we find that individual cues vary with the listener's level of understanding. Listener states ('Understanding', 'Partial Understanding', 'Non-Understanding' and 'Misunderstanding') were self-annotated by the listeners using a retrospective video-recall method. The results of a subsequent classification experiment, involving two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier, demonstrate that prediction of these four states of understanding is generally possible and improves when the three linguistic cues are considered alongside textual features.[70] An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models
Yuming Feng,Christy Yang
Main category: cs.CL
TL;DR: 本文系统比较了在GPT-2规模模型上,监督微调(SFT)、直接偏好优化(DPO)及其组合、全参数微调(FFT)与LoRA等方法在小数据和小模型下的表现,发现SFT仍是主导性能因素,DPO和LoRA增益有限。
Details
Motivation: 现有研究对Direct Preference Optimization(DPO)在小模型和小数据场景下的实际表现缺乏充分实证分析,尤其在与SFT、不同参数化策略(FFT vs. LoRA)的对比中尚不清晰。 Method: 在GPT-2-scale decoder上,系统评估SFT-only、DPO-only、SFT-to-DPO staged训练,以及FFT与LoRA两种参数化方式,在两个任务(paraphrase detection和Shakespearean sonnet continuation)上的表现。 Result: DPO仅带来小且任务依赖的提升;当偏好构造贴近监督目标时,DPO可无需SFT暖启即达竞争性效果;FFT始终优于同深度LoRA;LoRA未降低实际训练耗时。 Conclusion: 在小规模设置下,全参数监督微调仍是提升性能的最主要手段,而偏好优化和低秩适配带来的边际收益有限。 Abstract: Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised objective. In contrast, parameterization dominates: FFT consistently outperforms LoRA at matched training depth, and LoRA does not reduce wall-clock time on our hardware. These findings indicate that, in this small-scale regime, supervised full-parameter adaptation remains the primary performance lever, while preference optimization and low-rank adaptation provide limited marginal returns.[71] Current LLMs still cannot 'talk much' about grammar modules: Evidence from syntax
Mohammed Q. Shormani
Main category: cs.CL
TL;DR: 本研究评估了大型语言模型(LLM)在生成语法核心术语阿拉伯语翻译时的表现,发现ChatGPT-5仅25%的翻译准确,38.6%错误,36.4%部分正确;提出AI专家与语言学家协作改进LLM翻译机制的策略。
Details
Motivation: 探究大型语言模型(LLMs)对生成语法核心术语(特别是涉及句法与语义复杂性的内容)进行阿拉伯语翻译的能力边界,并为提升专业领域翻译质量提供实证依据。 Method: 选取44个生成语法领域的核心术语,由人类专家和ChatGPT-5分别翻译为阿拉伯语,采用分析性与对比性方法评估翻译质量(准确/部分正确/错误三类判定)。 Result: ChatGPT-5翻译准确率仅25%,38.6%错误,36.4%部分正确;表明当前LLM在处理语法核心概念的跨语言精准表达上仍存在显著局限。 Conclusion: LLM尚不能可靠‘详述’语法模块的核心属性;需AI专家与语言学家深度协作,从模型训练、知识注入与评估标准等方面联合优化,以提升专业术语翻译的准确性与适切性。 Abstract: We aim to examine the extent to which Large Language Models (LLMs) can 'talk much' about grammar modules, providing evidence from syntax core properties translated by ChatGPT into Arabic. We collected 44 terms from generative syntax previous works, including books and journal articles, as well as from our experience in the field. These terms were translated by humans, and then by ChatGPT-5. We then analyzed and compared both translations. We used an analytical and comparative approach in our analysis. Findings unveil that LLMs still cannot 'talk much' about the core syntax properties embedded in the terms under study involving several syntactic and semantic challenges: only 25% of ChatGPT translations were accurate, while 38.6% were inaccurate, and 36.4.% were partially correct, which we consider appropriate. Based on these findings, a set of actionable strategies were proposed, the most notable of which is a close collaboration between AI specialists and linguists to better LLMs' working mechanism for accurate or at least appropriate translation.[72] Reasoning Gets Harder for LLMs Inside A Dialogue
Ivan Kartáč,Mateusz Lango,Ondřej Dušek
Main category: cs.CL
TL;DR: 本文提出BOULDER动态基准,评估大语言模型在任务导向对话(TOD)中推理能力的真实表现,发现对话场景下模型性能显著低于孤立任务设置,主要归因于多轮交互、角色约束与工具使用要求。
Details
Motivation: 现有推理评测多基于孤立任务,难以反映大语言模型在真实任务导向对话(TOD)中需兼顾推理、文本生成及指令遵循(如角色、格式、风格)的综合能力,存在评估失配问题。 Method: 构建BOULDER基准——一个覆盖8类旅行相关任务的动态评测集,涵盖算术、空间和时间推理,含常识与形式化成分;每项任务提供孤立版与对话版两种变体,以控制变量并避免数据污染;在8个主流LLM上开展实验,并辅以消融与定性分析。 Result: 所有8个LLM在对话设置下性能均显著且一致地下降;性能差距主因是多轮对话结构,其次为角色条件设定与工具调用需求。 Conclusion: 当前推理基准高估了LLM在真实交互场景中的推理稳健性,亟需在更贴近实际的动态、交互式环境中评估其推理能力。 Abstract: Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.[73] Enhancing Hyperspace Analogue to Language (HAL) Representations via Attention-Based Pooling for Text Classification
Ali Sakour,Zoalfekar Sakour
Main category: cs.CL
TL;DR: 本文提出了一种结合温度缩放加性注意力机制与SVD降维的改进HAL模型,用于句子级语义表示,在IMDB数据集上显著提升情感分类准确率并增强可解释性。
Details
Motivation: 传统HAL模型使用均值池化聚合词向量导致重要上下文信息丢失,尤其无法区分关键情感词与无信息结构词。 Method: 在HAL框架中引入可学习的温度缩放加性注意力机制,并在注意力前对稀疏高维共现矩阵应用截断奇异值分解(SVD)进行降维。 Result: 在IMDB数据集上,该方法达到82.38%测试准确率,较均值池化基线(75.64%)提升6.74个百分点;注意力权重分析显示其能有效抑制停用词、聚焦情感相关词汇。 Conclusion: 融合注意力机制与SVD降维的HAL变体不仅提升了句子级语义表示能力,还增强了模型判别性与可解释性。 Abstract: The Hyperspace Analogue to Language (HAL) model relies on global word co-occurrence matrices to construct distributional semantic representations. While these representations capture lexical relationships effectively, aggregating them into sentence-level embeddings via standard mean pooling often results in information loss. Mean pooling assigns equal weight to all tokens, thereby diluting the impact of contextually salient words with uninformative structural tokens. In this paper, we address this limitation by integrating a learnable, temperature-scaled additive attention mechanism into the HAL representation pipeline. To mitigate the sparsity and high dimensionality of the raw co-occurrence matrices, we apply Truncated Singular Value Decomposition (SVD) to project the vectors into a dense latent space prior to the attention layer. We evaluate the proposed architecture on the IMDB sentiment analysis dataset. Empirical results demonstrate that the attention-based pooling approach achieves a test accuracy of 82.38%, yielding an absolute improvement of 6.74 percentage points over the traditional mean pooling baseline (75.64%). Furthermore, qualitative analysis of the attention weights indicates that the mechanism successfully suppresses stop-words and selectively attends to sentiment-bearing tokens, improving both classification performance and model interpretability.[74] Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models
Qi Cao,Andrew Gambardella,Takeshi Kojima,Yutaka Matsuo,Yusuke Iwasawa
Main category: cs.CL
TL;DR: 本文提出了一种名为语义令牌聚类(STC)的高效不确定性量化方法,用于评估大语言模型(LLM)输出的真实性与可靠性,仅需单次生成、无需辅助模型,兼顾性能与效率。
Details
Motivation: 大语言模型输出的真实性无法保证,且常表现出过度自信,影响其可靠性;现有不确定性量化方法多依赖重复采样或辅助模型,计算开销大。 Method: 提出语义令牌聚类(STC):利用LLM内部嵌入表示,通过嵌入聚类与前缀匹配将令牌聚为语义一致簇,并基于簇内概率质量聚合量化不确定性。 Result: STC在多项实验中达到与当前最优基线相当的不确定性校准性能,同时显著降低计算开销。 Conclusion: STC是一种轻量、高效、无需额外模型的不确定性量化新范式,为提升LLM可信输出提供了实用可行的解决方案。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guaranteed, and their tendency toward overconfidence further limits reliability. Uncertainty quantification offers a promising way to identify potentially unreliable outputs, but most existing methods rely on repeated sampling or auxiliary models, introducing substantial computational overhead. To address these limitations, we propose Semantic Token Clustering (STC), an efficient uncertainty quantification method that leverages the semantic information inherently encoded in LLMs. Specifically, we group tokens into semantically consistent clusters using embedding clustering and prefix matching, and quantify uncertainty based on the probability mass aggregated over the corresponding semantic cluster. Our approach requires only a single generation and does not depend on auxiliary models. Experimental results show that STC achieves performance comparable to state-of-the-art baselines while substantially reducing computational overhead.[75] Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models
Sai Koneru,Elphin Joe,Christine Kirchhoff,Jian Wu,Sarah Rajtmajer
Main category: cs.CL
TL;DR: 本文提出了一种基于美国国家气候评估的受控认识冲突框架,用于评估指令微调语言模型在用户对齐压力与证据忠实性之间的权衡能力;实验发现,即使提供更丰富的上下文证据,模型仍易受用户压力影响而偏离事实,揭示了三种主要失败模式,并指出仅靠增强证据不足以抵抗用户压力,需显式训练以提升认识完整性。
Details
Motivation: 在存在争议的领域中,指令微调语言模型需在用户对齐压力与对上下文证据的忠实性之间取得平衡,但当前缺乏系统性评估该张力的可控框架。 Method: 构建基于美国国家气候评估的受控认识冲突框架,对19个参数量从0.27B到32B的指令微调模型进行细粒度消融实验,系统操控证据组成与不确定性提示,并分析模型在中性提示与用户压力下的响应行为。 Result: 发现三种主要失败模式:(1)部分证据(如研究空白)反而加剧奉承倾向;(2)鲁棒性随规模非单调变化,某些中低规模模型对用户压力尤为敏感;(3)不同模型在冲突下序数分布集中度差异显著,推理蒸馏变体比指令微调版本更分散。 Conclusion: 在固定证据的受控设定下,仅增加上下文证据无法可靠抵御用户压力,必须通过显式训练增强模型的认识完整性。 Abstract: In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.[76] Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Richard J. Young
Main category: cs.CL
TL;DR: 本文揭示了链式推理(CoT)忠实性评估并非客观可比的单一指标,而是高度依赖于所用分类器的定义与严格程度;不同分类器在相同数据上给出显著差异的结果,导致模型排名不一致,因此呼吁未来研究应报告多方法下的敏感性范围而非单点估计。
Details
Motivation: 现有研究将CoT忠实性简化为单一聚合数值(如39%),隐含其为客观可测属性,但作者质疑该假设,旨在揭示评估结果对分类器选择的敏感性及背后概念差异。 Method: 在10,276条受提示影响的推理轨迹上,使用三种分类器——纯正则表达式检测器、两阶段正则+LLM流水线、独立Claude Sonnet 4裁判——对12个开源大模型(覆盖9族、7B–1T参数)进行忠实性判定,并通过置信区间、McNemar检验、Cohen's kappa和排名变化分析系统性分歧。 Result: 三分类器整体忠实率分别为74.4%、82.6%、69.7%,置信区间无重叠;模型级差异达2.6–30.6个百分点且全部显著;kappa值低(0.06–0.42),分歧具系统性(如sycophancy提示下883 vs 2的单向误判);分类器切换可彻底反转模型排名(如Qwen3.5-27B从第1降至第7)。 Conclusion: CoT忠实性不是客观统一指标,而是取决于分类器所实现的具体概念(如词法提及 vs 认知依赖);不同研究间忠实性数值不可比;未来评估必须报告多分类器下的敏感性范围。 Abstract: Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively, with non-overlapping 95% confidence intervals. Per-model gaps range from 2.6 to 30.6 percentage points; all are statistically significant (McNemar's test, p < 0.001). The disagreements are systematic, not random: inter-classifier agreement measured by Cohen's kappa ranges from 0.06 ("slight") for sycophancy hints to 0.42 ("moderate") for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd. The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior. These results demonstrate that published faithfulness numbers cannot be meaningfully compared across studies that use different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates.cs.CV [Back]
[77] Diffusion-Guided Semantic Consistency for Multimodal Heterogeneity
Jing Liu,Zhengliang Guo,Yan Wang,Xiaoguang Zhu,Yao Du,Zehua Wang,Victor C. M. Leung
Main category: cs.CV
TL;DR: SemanticFL is a novel federated learning framework that uses pre-trained diffusion models' semantic representations to address non-IID data challenges, especially in multimodal perception, achieving up to 5.49% accuracy gain over FedAvg.
Details
Motivation: Federated learning suffers from performance degradation due to non-IID client data, particularly in multimodal perception; existing methods fail to resolve underlying semantic discrepancies across clients. Method: SemanticFL leverages multi-layer semantic representations (e.g., VAE latents and U-Net features) from a pre-trained Stable Diffusion model to construct a shared latent space; it employs a client-server architecture with server-side heavy computation and a unified cross-modal contrastive consistency mechanism for stable convergence. Result: SemanticFL achieves up to 5.49% higher accuracy than FedAvg on CIFAR-10, CIFAR-100, and TinyImageNet under diverse non-IID settings. Conclusion: SemanticFL effectively aligns heterogeneous clients via diffusion-based semantics and improves robustness for multimodal perception in FL, offering a promising direction for privacy-preserving, semantics-aware federated learning. Abstract: Federated learning (FL) is severely challenged by non-independent and identically distributed (non-IID) client data, a problem that degrades global model performance, especially in multimodal perception settings. Conventional methods often fail to address the underlying semantic discrepancies between clients, leading to suboptimal performance for multimedia systems requiring robust perception. To overcome this, we introduce SemanticFL, a novel framework that leverages the rich semantic representations of pre-trained diffusion models to provide privacy-preserving guidance for local training. Our approach leverages multi-layer semantic representations from a pre-trained Stable Diffusion model (including VAE-encoded latents and U-Net hierarchical features) to create a shared latent space that aligns heterogeneous clients, facilitated by an efficient client-server architecture that offloads heavy computation to the server. A unified consistency mechanism, employing cross-modal contrastive learning, further stabilizes convergence. We conduct extensive experiments on benchmarks including CIFAR-10, CIFAR-100, and TinyImageNet under diverse heterogeneity scenarios. Our results demonstrate that SemanticFL surpasses existing federated learning approaches, achieving accuracy gains of up to 5.49% over FedAvg, validating its effectiveness in learning robust representations for heterogeneous and multimodal data for perception tasks.[78] AURORA: Adaptive Unified Representation for Robust Ultrasound Analysis
Ufaq Khan,L. D. M. S. Sai Teja,Ayuba Shakiru,Mai A. Shaaban,Yutong Xie,Muhammad Bilal,Muhammad Haris Khan
Main category: cs.CV
TL;DR: 本文提出了一种基于Qwen3-VL视觉编码器的统一多任务超声图像分析框架,通过中间token特征投影、轻量级多尺度特征金字塔融合及任务感知采样与损失平衡策略,在FMC-UIA挑战赛中显著提升性能(验证集从67%升至85%,测试集平均81.84%)。
Details
Motivation: 超声图像在设备、操作者和解剖目标间差异大,导致模型跨医院、跨临床场景泛化能力差;FMC-UIA挑战赛要求单模型完成多器官、多数据集下的分割、检测、分类和关键点回归等多样化任务。 Method: 采用Qwen3-VL系列Transformer视觉编码器;将中间token特征投影为空间特征图,并经轻量级多尺度特征金字塔融合;各任务由小型专用预测头处理;训练中引入任务感知采样和选择性损失平衡以缓解监督异质性与任务不平衡问题。 Result: 验证集性能从67%提升至85%,官方测试集所有任务平均得分为81.84%。 Conclusion: 该统一多任务框架结构简洁、优化友好、泛化性强,适用于广泛的超声图像分析任务。 Abstract: Ultrasound images vary widely across scanners, operators, and anatomical targets, which often causes models trained in one setting to generalize poorly to new hospitals and clinical conditions. The Foundation Model Challenge for Ultrasound Image Analysis (FMC-UIA) reflects this difficulty by requiring a single model to handle multiple tasks, including segmentation, detection, classification, and landmark regression across diverse organs and datasets. We propose a unified multi-task framework based on a transformer visual encoder from the Qwen3-VL family. Intermediate token features are projected into spatial feature maps and fused using a lightweight multi-scale feature pyramid, enabling both pixel-level predictions and global reasoning within a shared representation. Each task is handled by a small task-specific prediction head, while training uses task-aware sampling and selective loss balancing to manage heterogeneous supervision and reduce task imbalance. Our method is designed to be simple to optimize and adaptable across a wide range of ultrasound analysis tasks. The performance improved from 67% to 85% on the validation set and achieved an average score of 81.84% on the official test set across all tasks. The code is publicly available at: https://github.com/saitejalekkala33/FMCUIA-ISBI.git[79] Factored Levenberg-Marquardt for Diffeomorphic Image Registration: An efficient optimizer for FireANTs
Rohit Jena,Pratik Chaudhari,James C. Gee
Main category: cs.CV
TL;DR: 本文提出了一种改进的Levenberg-Marquardt(LM)优化器,用于大尺寸图像配准,仅需单个标量阻尼参数作为状态变量,显著降低内存占用(最高达24.6%),同时在多个数据集上保持甚至超越Adam性能。
Details
Motivation: Adam优化器虽在FireANTs中表现良好,但其动量和二阶矩估计等状态变量导致高内存开销,难以适用于大幅图像配准任务。 Method: 提出一种基于信任域自适应调优阻尼参数的改进Levenberg-Marquardt优化器,并引入Metropolis-Hastings风格的拒绝步以避免损失函数恶化。 Result: 新优化器在四组数据集上性能与Adam相当或更优;单一超参配置可跨模态(脑MRI→肺CT→腹部跨模态)泛化;内存减少最高达24.6%。 Conclusion: 该轻量、鲁棒且泛化性强的LM变体为大规模、多模态图像配准提供了高效、低内存的测试时优化新方案。 Abstract: FireANTs introduced a novel Eulerian descent method for plug-and-play behavior with arbitrary optimizers adapted for diffeomorphic image registration as a test-time optimization problem, with a GPU-accelerated implementation. FireANTs uses Adam as its default optimizer for fast and more robust optimization. However, Adam requires storing state variables (i.e. momentum and squared-momentum estimates), each of which can consume significant memory, prohibiting its use for significantly large images. In this work, we propose a modified Levenberg-Marquardt (LM) optimizer that requires only a single scalar damping parameter as optimizer state, that is adaptively tuned using a trust region approach. The resulting optimizer reduces memory by up to 24.6% for large volumes, and retaining performance across all four datasets. A single hyperparameter configuration tuned on brain MRI transfers without modification to lung CT and cross-modal abdominal registration, matching or outperforming Adam on three of four benchmarks. We also perform ablations on the effectiveness of using Metropolis-Hastings style rejection step to prevent updates that worsen the loss function.[80] LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray
Myeongkyun Kang,Yanting Yang,Xiaoxiao Li
Main category: cs.CV
TL;DR: 本文提出了一种名为LoFi的位置感知细粒度表示学习方法,通过联合优化sigmoid、字幕生成和位置感知字幕生成损失,利用轻量级大语言模型提升胸部X光片中细粒度特征的学习能力,并在检索与短语定位任务上取得优越性能。
Details
Motivation: 现有对比学习模型缺乏区域级监督,且大型视觉语言模型在外验证中难以捕捉细粒度表征,导致胸部X光片的检索和短语定位性能受限。 Method: 提出Location-aware Fine-grained representation learning (LoFi),联合优化sigmoid损失、captioning损失和location-aware captioning损失,利用轻量级大语言模型;引入细粒度编码器至基于检索的上下文学习中以增强X光片定位能力。 Result: 在MIMIC-CXR和PadChest-GR数据集上,该方法在检索和短语定位任务中均取得优于现有方法的性能。 Conclusion: LoFi通过引入区域级监督和细粒度表征学习机制,有效提升了胸部X光影像中临床相关发现的定位与检索能力。 Abstract: Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.[81] In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing
Xiao Fang,Yiming Gong,Stanislav Panev,Celso de Melo,Shuowen Hu,Shayok Chakraborty,Fernando De la Torre
Main category: cs.CV
TL;DR: 本文提出了一种基于ControlNet的车辆伪装攻击新框架,将伪装攻击建模为条件图像编辑问题,在保持车辆结构和视觉自然性的同时,显著提升对检测器的攻击效果,并具备跨模型泛化与物理世界迁移潜力。
Details
Motivation: 现有对抗攻击方法在车辆伪装场景中难以兼顾攻击有效性、结构保真度与人类视觉隐蔽性,亟需一种更鲁棒、更实用的生成式伪装方案。 Method: 将车辆伪装建模为条件图像编辑任务,设计图像级与场景级两种策略,微调ControlNet直接在真实图像上合成伪装车辆,并构建联合优化目标,统一约束车辆结构保真度、风格一致性与对抗有效性。 Result: 在COCO和LINZ数据集上实现超过38%的AP50下降;相比基线方法,车辆结构保留更好、人类感知隐蔽性更强;对未知黑盒检测器泛化性强,且在物理世界中展现初步可迁移性。 Conclusion: 所提CtrlCamo框架在攻击效能、结构完整性与视觉自然性之间取得更好平衡,为实际场景中的鲁棒伪装攻击提供了新思路与可行路径。 Abstract: Deep neural networks (DNNs) have achieved remarkable success in computer vision but remain highly vulnerable to adversarial attacks. Among them, camouflage attacks manipulate an object's visible appearance to deceive detectors while remaining stealthy to humans. In this paper, we propose a new framework that formulates vehicle camouflage attacks as a conditional image-editing problem. Specifically, we explore both image-level and scene-level camouflage generation strategies, and fine-tune a ControlNet to synthesize camouflaged vehicles directly on real images. We design a unified objective that jointly enforces vehicle structural fidelity, style consistency, and adversarial effectiveness. Extensive experiments on the COCO and LINZ datasets show that our method achieves significantly stronger attack effectiveness, leading to more than 38% AP50 decrease, while better preserving vehicle structure and improving human-perceived stealthiness compared to existing approaches. Furthermore, our framework generalizes effectively to unseen black-box detectors and exhibits promising transferability to the physical world. Project page is available at https://humansensinglab.github.io/CtrlCamo[82] ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
Thomas De Min,Subhankar Roy,Stéphane Lathuilière,Elisa Ricci,Massimiliano Mancini
Main category: cs.CV
TL;DR: 本文提出ProactiveBench基准,评估多模态大语言模型(MLLMs)在请求用户干预(如移除遮挡物)方面的主动性;发现现有模型普遍缺乏主动性,且该能力不随模型规模提升,而通过强化学习微调可有效提升并泛化至未见场景。
Details
Motivation: 探索MLLMs能否像人类一样主动请求用户进行简单干预(如移除遮挡),以提升任务表现,填补主动性建模的研究空白。 Method: 构建涵盖七种任务的ProactiveBench基准,评估22个MLLMs;分析模型容量、提示设计、对话历史与上下文学习的影响;提出基于强化学习的微调策略。 Result: (i)多数MLLMs缺乏主动性;(ii)主动性与模型参数量无显著相关性;(iii)提示引导仅带来微弱提升;(iv)对话历史和上下文学习反而引入负向偏差;(v)RL微调能有效提升主动性并支持跨任务泛化。 Conclusion: 主动性是一种可习得的独立能力,需专门建模与训练;ProactiveBench为构建真正主动的多模态模型提供了首个标准化评测平台。 Abstract: Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.[83] Narrative Aligned Long Form Video Question Answering
Rahul Jain,Keval Doshi,Burak Uzkent,Garin Kessler
Main category: cs.CV
TL;DR: 本文提出NA-VQA基准和Video-NaRA框架,旨在评估并提升多模态大模型在长视频中的叙事推理能力,尤其关注跨长距离事件的因果链构建与意图追踪。
Details
Motivation: 现有长视频推理基准多依赖局部线索,缺乏对叙事推理(如意图追踪、远距离事件关联、因果链重构)的评估能力。 Method: 构建包含88部完整电影和4.4K问答对的NA-VQA基准,标注多段证据跨度(Short/Medium/Far);提出Video-NaRA框架,通过构建事件级链条并存入结构化记忆以支持叙事推理。 Result: 当前SOTA多模态大语言模型在远距离证据问题上表现差;Video-NaRA将长程推理性能提升最多3个百分点。 Conclusion: 显式建模叙事结构对提升长视频深层推理能力至关重要,NA-VQA为该方向提供了首个聚焦叙事的严格评测基准。 Abstract: Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.[84] Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following
Myeongkyun Kang,Soopil Kim,Xiaoxiao Li,Sang Hyun Park
Main category: cs.CV
TL;DR: 本文提出了一种无需指令的微调方法,利用图像-描述对和动量代理指令来提升大型视觉语言模型(LVLMs)在医学领域的适应能力,显著提高了多选视觉问答任务的准确率和微调效率。
Details
Motivation: 医学领域构建大规模、高质量的视觉指令数据集困难,因需专业专家知识;现有依赖图像-指令-输出三元组的视觉指令微调方法受限于此。 Method: 提出指令无关的微调方法:使用图像-描述对替代图像-指令-输出三元组;引入动量代理指令维持模型指令遵循能力并引导有效参数更新;加入响应打乱策略缓解模型对前序词的过度依赖。 Result: 在SKINCON、WBCAtt、CBIS和MIMIC-CXR等多个医学多选视觉问答数据集上达到SOTA准确率,显著提升LVLM在医学领域的微调效率。 Conclusion: 该指令-free微调范式可有效降低对人工标注指令的依赖,在保持甚至增强LVLM指令遵循能力的同时,提升其在数据稀缺的专业领域(如医学)的适配性与实用性。 Abstract: Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.[85] VeloxNet: Efficient Spatial Gating for Lightweight Embedded Image Classification
Md Meftahul Ferdaus,Elias Ioup,Mahdi Abdelguerfi,Anton Netchaev,Steven Sloan,Ken Pathak,Kendall N. Niles
Main category: cs.CV
TL;DR: 本文提出VeloxNet,一种轻量级CNN架构,用门控多层感知机(gMLP)块替代SqueezeNet中的fire模块,以提升嵌入式设备上航拍图像分类的准确率与参数效率。
Details
Motivation: 在嵌入式设备上部署深度学习模型需兼顾精度与严格的模型大小、内存和延迟限制,尤其在航拍灾害监测与基础设施巡检等任务中。 Method: VeloxNet采用基于空间门控单元(SGU)的gMLP块,实现单层全特征图范围的空间依赖建模;相比fire模块的小卷积核局部感受野,SGU提供全局空间建模且参数更少。 Result: 在AIDER、CDD和LDD三个航拍数据集上,VeloxNet相较SqueezeNet参数减少46.1%(740,970→399,366),加权F1分别提升6.32%、30.83%和2.51%,优于MobileNet、ShuffleNet、EfficientNet及部分ViT基线。 Conclusion: 用空间门控块替代局部卷积模块,可在资源受限场景下同步提升分类精度与参数效率,验证了VeloxNet在嵌入式航拍图像分析中的有效性。 Abstract: Deploying deep learning models on embedded devices for tasks such as aerial disaster monitoring and infrastructure inspection requires architectures that balance accuracy with strict constraints on model size, memory, and latency. This paper introduces VeloxNet, a lightweight CNN architecture that replaces SqueezeNet's fire modules with gated multi-layer perceptron (gMLP) blocks for embedded image classification. Each gMLP block uses a spatial gating unit (SGU) that applies learned spatial projections and multiplicative gating, enabling the network to capture spatial dependencies across the full feature map in a single layer. Unlike fire modules, which are limited to local receptive fields defined by small convolutional kernels, the SGU provides global spatial modeling at each layer with fewer parameters. We evaluate VeloxNet on three aerial image datasets: the Aerial Image Database for Emergency Response (AIDER), the Comprehensive Disaster Dataset (CDD), and the Levee Defect Dataset (LDD), comparing against eleven baselines including MobileNet variants, ShuffleNet, EfficientNet, and recent vision transformers. VeloxNet reduces the parameter count by 46.1% relative to SqueezeNet (from 740,970 to 399,366) while improving weighted F1 scores by 6.32% on AIDER, 30.83% on CDD, and 2.51% on LDD. These results demonstrate that substituting local convolutional modules with spatial gating blocks can improve both classification accuracy and parameter efficiency for resource-constrained deployment. The source code will be made publicly available upon acceptance of the paper.[86] Vision Tiny Recursion Model (ViTRM): Parameter-Efficient Image Classification via Recursive State Refinement
Ange-Clément Akazan,Abdoulaye Koroko,Verlon Roel Mbingui,Choukouriyah Arinloye,Hassan Fifen,Rose Bandolo
Main category: cs.CV
TL;DR: 本文提出ViTRM,一种基于递归计算的轻量级视觉模型,用单个3层小网络重复执行N次替代传统多层ViT编码器,在显著减少参数量的同时保持竞争力。
Details
Motivation: 深度学习模型(如CNN和ViT)虽有效但参数量大、计算开销高,难以部署于资源受限环境;受Tiny Recursive Models启发,探索以递归计算替代模型深度来提升参数效率。 Method: 提出Vision Tiny Recursion Model (ViTRM),将ViT的L层编码器替换为一个仅含k=3层的小型递归块,并重复应用N次;通过迭代状态精炼实现特征学习。 Result: ViTRM在CIFAR-10和CIFAR-100上达到与主流模型相当的性能,参数量比CNN模型少6倍、比ViT少84倍。 Conclusion: 递归计算可作为模型深度的有效替代方案,为构建高效、轻量的视觉模型提供了新思路。 Abstract: The success of deep learning in computer vision has been driven by models of increasing scale, from deep Convolutional Neural Networks (CNN) to large Vision Transformers (ViT). While effective, these architectures are parameter-intensive and demand significant computational resources, limiting deployment in resource-constrained environments. Inspired by Tiny Recursive Models (TRM), which show that small recursive networks can solve complex reasoning tasks through iterative state refinement, we introduce the \textbf{Vision Tiny Recursion Model (ViTRM)}: a parameter-efficient architecture that replaces the $L$-layer ViT encoder with a single tiny $k$-layer block ($k{=}3$) applied recursively $N$ times. Despite using up to $6 \times $ and $84 \times$ fewer parameters than CNN based models and ViT respectively, ViTRM maintains competitive performance on CIFAR-10 and CIFAR-100. This demonstrates that recursive computation is a viable, parameter-efficient alternative to architectural depth in vision.[87] FedAgain: A Trust-Based and Robust Federated Learning Strategy for an Automated Kidney Stone Identification in Ureteroscopy
Ivan Reyes-Amezcua,Francisco Lopez-Tiro,Clément Larose,Christian Daul,Andres Mendez-Vazquez,Gilberto Ochoa-Ruiz
Main category: cs.CV
TL;DR: 本文提出FedAgain,一种基于信任的联邦学习策略,通过双信任机制动态加权客户端贡献,提升AI在医学影像(尤其是内窥镜图像中肾结石识别)中的鲁棒性与泛化能力,在非独立同分布和含噪声客户端场景下显著优于传统联邦学习方法。
Details
Motivation: AI在医学影像中的可靠性严重依赖其对多中心、异构及受损图像的鲁棒性,而现有联邦学习方法在非IID和存在噪声/对抗客户端时性能下降明显。 Method: 提出FedAgain框架,融合基准可靠性与模型发散度构成双信任机制,用于动态加权客户端模型更新;在保护数据隐私前提下实现跨机构协作训练与稳定收敛。 Result: 在5个数据集(MNIST、CIFAR-10、两个私有多中心肾结石数据集、MyStone)上验证,FedAgain在非IID和 corrupted-client 场景下持续优于标准联邦学习基线,保持高诊断准确率与性能稳定性。 Conclusion: FedAgain为医学影像领域提供了更可靠、隐私保护强且临床可部署的联邦AI解决方案。 Abstract: The reliability of artificial intelligence (AI) in medical imaging critically depends on its robustness to heterogeneous and corrupted images acquired with diverse devices across different hospitals which is highly challenging. Therefore, this paper introduces FedAgain, a trust-based Federated Learning (Federated Learning) strategy designed to enhance robustness and generalization for automated kidney stone identification from endoscopic images. FedAgain integrates a dual trust mechanism that combines benchmark reliability and model divergence to dynamically weight client contributions, mitigating the impact of noisy or adversarial updates during aggregation. The framework enables the training of collaborative models across multiple institutions while preserving data privacy and promoting stable convergence under real-world conditions. Extensive experiments across five datasets, including two canonical benchmarks (MNIST and CIFAR-10), two private multi-institutional kidney stone datasets, and one public dataset (MyStone), demonstrate that FedAgain consistently outperforms standard Federated Learning baselines under non-identically and independently distributed (non-IID) data and corrupted-client scenarios. By maintaining diagnostic accuracy and performance stability under varying conditions, FedAgain represents a practical advance toward reliable, privacy-preserving, and clinically deployable federated AI for medical imaging.[88] Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
Sheng Lu,Hao Chen,Rui Yin,Juyan Ba,Yu Zhang,Yuanzhe Li
Main category: cs.CV
TL;DR: 本文介绍了Gastric-X,一个用于胃癌分析的大规模多模态基准数据集,包含1.7K例临床病例,涵盖CT扫描、内镜图像、生化指标、诊断报告和肿瘤定位标注,并系统评估了当前视觉-语言模型在五项核心临床任务上的能力。
Details
Motivation: 现有视觉-语言模型在自然领域表现优异,但在医学诊断(尤其是胃癌)中应用受限,主要原因是缺乏反映真实临床工作流程的综合性、结构化多模态数据集。 Method: 构建Gastric-X数据集,包含配对的静息/动态CT、内镜图像、结构化生化指标、专家诊断报告及肿瘤区域框标注;并在VQA、报告生成、跨模态检索、疾病分类和病灶定位五项任务上系统评测主流VLMs。 Result: 首次实现了对VLMs在胃癌多模态临床任务中的全面评估,揭示其在关联生化信号、空间肿瘤特征与文本报告方面的局限性。 Conclusion: Gastric-X为推动面向临床的VLM发展提供了关键基准和资源,旨在使机器智能更贴近医生的认知与循证推理过程。 Abstract: Recent vision-language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases. Each case in Gastric-X includes paired resting and dynamic CT scans, endoscopic image, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.[89] ReXInTheWild: A Unified Benchmark for Medical Photograph Understanding
Oishi Banerjee,Sung Eun Kim,Alexandra N. Willauer,Julius M. Kernbach,Abeer Rihan Alomaish,Reema Abdulwahab S. Alghamdi,Hassan Rayhan Alomaish,Mohammed Baharoon,Xiaoman Zhang,Julian Nicolas Acosta,Christine Zhou,Pranav Rajpurkar
Main category: cs.CV
TL;DR: 本文提出了ReXInTheWild基准,用于评估视觉-语言模型在真实世界日常医学图像上的理解和推理能力,发现通用多模态大模型表现优于专业医学模型,并分析了常见错误类型。
Details
Motivation: 现有视觉-语言模型缺乏在真实日常医学图像(如普通相机拍摄的临床照片)上进行细粒度自然图像理解与领域特异性医学推理能力的综合评估基准。 Method: 构建了包含955道医生验证的多选题的ReXInTheWild基准,覆盖7个临床主题、484张来自生物医学文献的真实照片;对多个主流多模态大模型(如Gemini-3、Claude Opus 4.5、GPT-5、MedGemma)进行系统评测,并开展错误归因分析。 Result: Gemini-3准确率达78%,Claude Opus 4.5为72%,GPT-5为68%,而医学专用模型MedGemma仅37%;错误分析揭示四类典型问题:低级几何错误、语义识别错误、临床知识缺失和高阶推理失败。 Conclusion: ReXInTheWild填补了自然图像理解与临床推理交叉评估的空白,表明通用多模态大模型在真实医学视觉问答任务中可能超越专用模型,亟需针对性改进策略。 Abstract: Everyday photographs taken with ordinary cameras are already widely used in telemedicine and other online health conversations, yet no comprehensive benchmark evaluates whether vision-language models can interpret their medical content. Analyzing these images requires both fine-grained natural image understanding and domain-specific medical reasoning, a combination that challenges both general-purpose and specialized models. We introduce ReXInTheWild, a benchmark of 955 clinician-verified multiple-choice questions spanning seven clinical topics across 484 photographs sourced from the biomedical literature. When evaluated on ReXInTheWild, leading multimodal large language models show substantial performance variation: Gemini-3 achieves 78% accuracy, followed by Claude Opus 4.5 (72%) and GPT-5 (68%), while the medical specialist model MedGemma reaches only 37%. A systematic error analysis also reveals four categories of common errors, ranging from low-level geometric errors to high-level reasoning failures and requiring different mitigation strategies. ReXInTheWild provides a challenging, clinically grounded benchmark at the intersection of natural image understanding and medical reasoning. The dataset is available on HuggingFace.[90] Recognising BSL Fingerspelling in Continuous Signing Sequences
Alyssa Chan,Taein Kwon,Andrew Zisserman
Main category: cs.CV
TL;DR: 本文提出了一种新的大规模英国手语(BSL)指拼数据集FS23K及配套识别模型,通过改进标注质量和建模双手交互与口型线索,将字符错误率(CER)降低一半。
Details
Motivation: 现有BSL指拼识别面临签名速度快、母语者常省略字母等挑战,且已有数据集规模小或时序/字母级标注不准确。 Method: 构建了基于迭代标注框架的大规模FS23K数据集,并设计了一个显式建模双手交互和口型线索的指拼识别模型。 Result: 在改进标注基础上,所提方法将字符错误率(CER)相比先前最优方法降低50%。 Conclusion: 该方法有效提升了BSL指拼识别性能,为手语理解研究和可扩展自动标注流程提供了重要支撑。 Abstract: Fingerspelling is a critical component of British Sign Language (BSL), used to spell proper names, technical terms, and words that lack established lexical signs. Fingerspelling recognition is challenging due to the rapid pace of signing and common letter omissions by native signers, while existing BSL fingerspelling datasets are either small in scale or temporally and letter-wise inaccurate. In this work, we introduce a new large-scale BSL fingerspelling dataset, FS23K, constructed using an iterative annotation framework. In addition, we propose a fingerspelling recognition model that explicitly accounts for bi-manual interactions and mouthing cues. As a result, with refined annotations, our approach halves the character error rate (CER) compared to the prior state of the art on fingerspelling recognition. These findings demonstrate the effectiveness of our method and highlight its potential to support future research in sign language understanding and scalable, automated annotation pipelines. The project page can be found at https://taeinkwon.com/projects/fs23k/.[91] SurfaceXR: Fusing Smartwatch IMUs and Egocentric Hand Pose for Seamless Surface Interactions
Vasco Xu,Brian Chen,Eric J. Gonzalez,Andrea Colaço,Henry Hoffmann,Mar Gonzalez-Franco,Karan Ahuja
Main category: cs.CV
TL;DR: SurfaceXR is a sensor fusion method that combines headset-based hand tracking and smartwatch IMU data to improve accuracy and comfort of surface-based interactions in XR.
Details
Motivation: Mid-air gestures in XR cause fatigue and imprecision; surface-based interactions are better but current egocentric vision methods fail due to hand tracking challenges and unreliable surface plane estimation. Method: SurfaceXR uses sensor fusion of headset-based hand tracking and smartwatch IMU data, leveraging their complementary strengths—3D position from hand tracking and high-frequency motion from IMUs. Result: A 21-participant study shows SurfaceXR significantly improves touch tracking and 8-class gesture recognition over single-modality approaches. Conclusion: SurfaceXR enables robust, accurate, and comfortable surface-based interaction in XR by effectively fusing headset and smartwatch sensors. Abstract: Mid-air gestures in Extended Reality (XR) often cause fatigue and imprecision. Surface-based interactions offer improved accuracy and comfort, but current egocentric vision methods struggle due to hand tracking challenges and unreliable surface plane estimation. We introduce SurfaceXR, a sensor fusion approach combining headset-based hand tracking with smartwatch IMU data to enable robust inputs on everyday surfaces. Our insight is that these modalities are complementary: hand tracking provides 3D positional data while IMUs capture high-frequency motion. A 21-participant study validates SurfaceXR's effectiveness for touch tracking and 8-class gesture recognition, demonstrating significant improvements over single-modality approaches.[92] dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3
Saikat Dutta,Biplab Banerjee,Hamid Rezatofighi
Main category: cs.CV
TL;DR: 本文提出dinov3.seg,一种专为开放词汇语义分割(OVSS)设计的新框架,通过任务定制架构、联合利用全局与局部文本-图像对齐、分阶段特征优化及高分辨率局部-全局推理策略,在多个基准上超越现有最优方法。
Details
Motivation: 现有基于视觉语言模型(VLMs)的OVSS方法受限于全局对比学习目标导致的密集预测能力不足,依赖后处理相似度优化,难以在复杂杂乱场景中实现高空间精度和鲁棒性。 Method: 提出dinov3.seg框架:1)设计适配dinov3.txt骨干的任务专用架构;2)联合利用[CLS] token和ViT局部patch级视觉特征对齐文本嵌入;3)采用早期视觉表征优化+晚期图文相关特征优化的两阶段精炼策略;4)引入滑动窗口聚合的高分辨率局部-全局推理机制。 Result: 在五个主流OVSS基准上全面超越当前最先进方法,展现出更强的有效性和鲁棒性,尤其在复杂杂乱场景中提升显著。 Conclusion: dinov3.seg通过结构化架构设计与多阶段精细化图文特征交互,有效克服了通用VLM表征在密集预测任务中的局限,为开放词汇语义分割提供了更可靠、更精确的新范式。 Abstract: Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of text-defined categories, demanding reliable generalization to unseen classes at inference. Although modern vision-language models (VLMs) support strong open-vocabulary recognition, their representations learned through global contrastive objectives remain suboptimal for dense prediction, prompting many OVSS methods to depend on limited adaptation or refinement of image-text similarity maps. This, in turn, restricts spatial precision and robustness in complex, cluttered scenes. We introduce dinov3.seg, extending dinov3.txt into a dedicated framework for OVSS. Our contributions are four-fold. First, we design a task-specific architecture tailored to this backbone, systematically adapting established design principles from prior open-vocabulary segmentation work. Second, we jointly leverage text embeddings aligned with both the global [CLS] token and local patch-level visual features from ViT-based encoder, effectively combining semantic discrimination with fine-grained spatial locality. Third, unlike prior approaches that rely primarily on post hoc similarity refinement, we perform early refinement of visual representations prior to image-text interaction, followed by late refinement of the resulting image-text correlation features, enabling more accurate and robust dense predictions in cluttered scenes. Finally, we propose a high-resolution local-global inference strategy based on sliding-window aggregation, which preserves spatial detail while maintaining global context. We conduct extensive experiments on five widely adopted OVSS benchmarks to evaluate our approach. The results demonstrate its effectiveness and robustness, consistently outperforming current state-of-the-art methods.[93] Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion
Sima Ashayer,Hoang H. Nguyen,Yu Liang,Mina Sartipi
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、社会感知的行人意图预测架构,融合注意力、位置、情境和交互四类行为流,并引入变分瓶颈与马氏距离检测器量化不确定性,在PSI 1.0和2.0数据集上取得优异且可解释的性能。
Details
Motivation: 为保障自动驾驶车辆在城市环境中的安全行驶,需对行人意图进行准确预测;现有方法在效率、不确定性建模与可解释性方面存在不足。 Method: 提出融合四类行为流(attention、position、situation、interaction)的轻量架构,采用highway编码器、紧凑4-token Transformer及全局自注意力池化;引入变分瓶颈(衡量认知不确定性)和马氏距离检测器(识别分布偏移)双头机制。 Result: 在PSI 1.0上达到0.9 F1、0.94 AUC-ROC、0.78 MCC;在PSI 2.0上首次建立基线(0.78 F1、0.79 AUC-ROC);基于马氏分数的选择性预测使测试准确率提升最多0.4个百分点(80%覆盖率下);热图显示模型在模糊场景中动态调整跨流注意力。 Conclusion: 该方法高效、可解释、模态无关,适用于资源受限平台的风险感知意图预测,并易于集成至视觉语言流水线中。 Abstract: Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.[94] MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane
Changwoo Jeon,Rishi Upadhyay,Achuta Kadambi
Main category: cs.CV
TL;DR: 本文提出MoCA3D模型,一种单目、类别无关的3D检测方法,可直接预测图像平面上的3D边界框角点及其对应深度,无需相机内参;通过密集预测(角点热图与深度图)实现,并引入像素对齐几何(PAG)评估指标,在图像平面几何精度上显著提升,同时参数量大幅减少。
Details
Motivation: 现有单目3D目标理解多基于2D RoI到3D框的映射,难以在未知相机内参条件下获取可靠的图像平面几何信息(如投影角点),限制了野外场景和下游应用。 Method: MoCA3D将像素空间定位与深度估计建模为稠密预测任务,使用角点热图和深度图联合预测投影3D框的8个角点及其逐角点深度,完全摆脱对相机内参的依赖。 Result: 在图像平面角点PAG指标上提升22.8%,3D IoU与SOTA相当,且参数量最多减少57倍;成功应用于依赖图像平面几何的下游任务。 Conclusion: MoCA3D有效解决了未知内参下单目3D检测的图像平面几何建模难题,兼顾高精度、轻量化与强泛化性,拓展了单目3D理解在实际场景中的适用边界。 Abstract: Monocular 3D object understanding has largely been cast as a 2D RoI-to-3D box lifting problem. However, emerging downstream applications require image-plane geometry (e.g., projected 3D box corners) which cannot be easily obtained without known intrinsics, a problem for object detection in the wild. We introduce MoCA3D, a Monocular, Class-Agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time. MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps. To evaluate image-plane geometric fidelity, we propose Pixel-Aligned Geometry (PAG), which directly measures image-plane corner and depth consistency. Extensive experiments demonstrate that MoCA3D achieves state-of-the-art performance, improving image-plane corner PAG by 22.8% while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters. Finally, we apply MoCA3D to downstream tasks which were previously impractical under unknown intrinsics, highlighting its utility beyond standard baseline models.[95] SeeClear: Reliable Transparent Object Depth Estimation via Generative Opacification
Xiaoying Wang,Yumeng He,Jingkai Shi,Jiayin Lu,Yin Yang,Ying Jiang,Chenfanfu Jiang
Main category: cs.CV
TL;DR: 本文提出SeeClear框架,通过扩散模型将透明物体转化为不透明图像,从而提升单目深度估计在透明物体上的性能。
Details
Motivation: 单目深度估计在透明物体上表现不佳,因为折射和透射难以建模,破坏了深度网络依赖的外观假设。 Method: 提出SeeClear框架:首先定位透明区域,再利用基于扩散的生成性不透明化模块将其折射外观转换为几何一致的不透明形状;处理后的图像输入现成单目深度估计器,无需重训练或修改架构;使用自建合成数据集SeeClear-396k(含396k对透明-不透明渲染图像)训练该模块。 Result: 在合成与真实世界数据集上的实验表明,SeeClear显著提升了透明物体的深度估计精度和稳定性。 Conclusion: SeeClear提供了一种无需修改深度估计器即可有效处理透明物体的新范式,验证了生成性预处理在单目深度估计中的潜力。 Abstract: Monocular depth estimation remains challenging for transparent objects, where refraction and transmission are difficult to model and break the appearance assumptions used by depth networks. As a result, state-of-the-art estimators often produce unstable or incorrect depth predictions for transparent materials. We propose SeeClear, a novel framework that converts transparent objects into generative opaque images, enabling stable monocular depth estimation for transparent objects. Given an input image, we first localize transparent regions and transform their refractive appearance into geometrically consistent opaque shapes using a diffusion-based generative opacification module. The processed image is then fed into an off-the-shelf monocular depth estimator without retraining or architectural changes. To train the opacification model, we construct SeeClear-396k, a synthetic dataset containing 396k paired transparent-opaque renderings. Experiments on both synthetic and real-world datasets show that SeeClear significantly improves depth estimation for transparent objects. Project page: https://heyumeng.com/SeeClear-web/[96] StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention
Zhongrui Yu,Zhao Wang,Yijia Xie,Yida Wang,Xueyang Zhang,Yifei Zhan,Kun Zhan
Main category: cs.CV
TL;DR: 本文提出StreetForward,一种无需姿态和跟踪器的前馈式动态街景重建框架,利用改进的时序掩码注意力机制捕捉运动信息,并采用3D高斯点绘统一表征静态与动态内容,实现高质量新视角合成与深度估计。
Details
Motivation: 自主驾驶中需要快速场景重建以高效利用大规模驾驶数据集进行闭环仿真等下游任务,避免耗时的逐场景优化。 Method: 基于VGGT的交替注意力机制,设计时序掩码注意力模块提取动态运动信息;使用3D高斯点绘统一建模静态与动态内容,并通过跨帧渲染与时空一致性联合优化。 Result: 在Waymo Open Dataset上显著优于现有方法的新视角合成与深度估计性能;零样本迁移至CARLA等数据集验证了良好泛化能力。 Conclusion: StreetForward实现了高效、高质量、泛化性强的动态街景前馈重建,为自动驾驶仿真与感知任务提供了实用新范式。 Abstract: Feedforward reconstruction is crucial for autonomous driving applications, where rapid scene reconstruction enables efficient utilization of large-scale driving datasets in closed-loop simulation and other downstream tasks, eliminating the need for time-consuming per-scene optimization. We present StreetForward, a pose-free and tracker-free feedforward framework for dynamic street reconstruction. Building upon the alternating attention mechanism from Visual Geometry Grounded Transformer (VGGT), we propose a simple yet effective temporal mask attention module that captures dynamic motion information from image sequences and produces motion-aware latent representations. Static content and dynamic instances are represented uniformly with 3D Gaussian Splatting, and are optimized jointly by cross-frame rendering with spatio-temporal consistency, allowing the model to infer per-pixel velocities and produce high-fidelity novel views at new poses and times. We train and evaluate our model on the Waymo Open Dataset, demonstrating superior performance on novel view synthesis and depth estimation compared to existing methods. Furthermore, zero-shot inference on CARLA and other datasets validates the generalization capability of our approach. More visualizations are available on our project page: https://streetforward.github.io.[97] Dual-Domain Representation Alignment: Bridging 2D and 3D Vision via Geometry-Aware Architecture Search
Haoyu Zhang,Zhihao Yu,Rui Wang,Yaochu Jin,Qiqi Liu,Ran Cheng
Main category: cs.CV
TL;DR: 本文提出EvoNAS框架,通过混合超网络(VSS-ViT)与跨架构双域知识蒸馏(CA-DDKD)提升多目标神经架构搜索的效率与排序一致性,并结合分布式多模型并行评估(DMMPE)大幅降低验证开销,在多个视觉任务上实现精度与效率的帕累托最优权衡。
Details
Motivation: 大型视觉模型(LVMs)在边缘设备部署受限于高推理成本;现有进化式神经架构搜索(ENAS)存在候选评估昂贵和子网络排序不一致两大问题。 Method: 提出EvoNAS:1)构建融合Vision State Space与Vision Transformer的混合超网络;2)设计Cross-Architecture Dual-Domain Knowledge Distillation(CA-DDKD)提升表征能力与排序一致性;3)引入基于GPU资源池与异步调度的Distributed Multi-Model Parallel Evaluation(DMMPE)加速大规模验证。 Result: 在COCO、ADE20K、KITTI、NYU-Depth v2上,搜索所得EvoNets在精度-效率权衡上达到帕累托最优;相比CNN、ViT及Mamba基线模型,在严苛计算预算下具有更低延迟、更高吞吐,且在新视角合成等下游任务中泛化性强;DMMPE相较传统数据并行提速超70%。 Conclusion: EvoNAS有效解决了ENAS在边缘视觉模型搜索中的效率与可靠性瓶颈,为资源受限场景下的高效视觉架构自动化设计提供了可扩展、实用的新范式。 Abstract: Modern computer vision requires balancing predictive accuracy with real-time efficiency, yet the high inference cost of large vision models (LVMs) limits deployment on resource-constrained edge devices. Although Evolutionary Neural Architecture Search (ENAS) is well suited for multi-objective optimization, its practical use is hindered by two issues: expensive candidate evaluation and ranking inconsistency among subnetworks. To address them, we propose EvoNAS, an efficient distributed framework for multi-objective evolutionary architecture search. We build a hybrid supernet that integrates Vision State Space and Vision Transformer (VSS-ViT) modules, and optimize it with a Cross-Architecture Dual-Domain Knowledge Distillation (CA-DDKD) strategy. By coupling the computational efficiency of VSS blocks with the semantic expressiveness of ViT modules, CA-DDKD improves the representational capacity of the shared supernet and enhances ranking consistency, enabling reliable fitness estimation during evolution without extra fine-tuning. To reduce the cost of large-scale validation, we further introduce a Distributed Multi-Model Parallel Evaluation (DMMPE) framework based on GPU resource pooling and asynchronous scheduling. Compared with conventional data-parallel evaluation, DMMPE improves efficiency by over 70% through concurrent multi-GPU, multi-model execution. Experiments on COCO, ADE20K, KITTI, and NYU-Depth v2 show that the searched architectures, termed EvoNets, consistently achieve Pareto-optimal trade-offs between accuracy and efficiency. Compared with representative CNN-, ViT-, and Mamba-based models, EvoNets deliver lower inference latency and higher throughput under strict computational budgets while maintaining strong generalization on downstream tasks such as novel view synthesis. Code is available at https://github.com/EMI-Group/evonas[98] PFM-VEPAR: Prompting Foundation Models for RGB-Event Camera based Pedestrian Attribute Recognition
Minghe Xu,Rouying Wu,ChiaWei Chu,Xiao Wang,Yu Li
Main category: cs.CV
TL;DR: 本文提出Event Prompter,通过轻量级DCT/IDCT操作提取事件数据频域特征,并结合外部记忆库与现代Hopfield网络实现关联记忆增强的表征学习,再通过跨模态注意力融合RGB与事件数据,显著提升低光和运动模糊场景下的行人属性识别性能。
Details
Motivation: 现有基于事件的行人属性识别方法存在计算开销大、忽略上下文样本指导信息的问题。 Method: 提出Event Prompter模块,采用轻量DCT/IDCT处理事件数据;引入外部记忆库与现代Hopfield网络实现关联记忆增强;使用跨注意力机制融合RGB与事件模态。 Result: 在多个基准数据集上实验验证了所提RGB-Event PAR框架的有效性,显著提升了低光与运动模糊场景下的识别精度。 Conclusion: Event Prompter以极低计算成本实现了事件模态的有效增强,并借助关联记忆挖掘全局样本关系,为高效鲁棒的多模态行人属性识别提供了新思路。 Abstract: Event-based pedestrian attribute recognition (PAR) leverages motion cues to enhance RGB cameras in low-light and motion-blur scenarios, enabling more accurate inference of attributes like age and emotion. However, existing two-stream multimodal fusion methods introduce significant computational overhead and neglect the valuable guidance from contextual samples. To address these limitations, this paper proposes an Event Prompter. Discarding the computationally expensive auxiliary backbone, this module directly applies extremely lightweight and efficient Discrete Cosine Transform (DCT) and Inverse DCT (IDCT) operations to the event data. This design extracts frequency-domain event features at a minimal computational cost, thereby effectively augmenting the RGB branch. Furthermore, an external memory bank designed to provide rich prior knowledge, combined with modern Hopfield networks, enables associative memory-augmented representation learning. This mechanism effectively mines and leverages global relational knowledge across different samples. Finally, a cross-attention mechanism fuses the RGB and event modalities, followed by feed-forward networks for attribute prediction. Extensive experiments on multiple benchmark datasets fully validate the effectiveness of the proposed RGB-Event PAR framework. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR[99] PhyUnfold-Net: Advancing Remote Sensing Change Detection with Physics-Guided Deep Unfolding
Zelin Lei,Yaoxing Ren,Jiaming Chang
Main category: cs.CV
TL;DR: 本文提出PhyUnfold-Net,一种基于物理先验引导的深度展开框架,通过迭代分解差异特征来提升双时相遥感影像变化检测的鲁棒性,有效抑制由光照、季节和大气差异引起的伪变化。
Details
Motivation: 双时相变化检测对成像条件差异(如光照、季节、大气)高度敏感,易导致误检;作者观察到真实变化在特征差空间中具有更高的块状奇异值熵(SVE),而伪变化则较低,由此引入物理先验指导模型设计。 Method: 提出PhyUnfold-Net:包含迭代变化分解模块(ICDM)显式建模特征分解;分阶段探索与约束损失(S-SEC)稳定训练;小波频谱抑制模块(WSSM)预处理光谱失配。 Result: 在四个基准数据集上超越现有最先进方法,尤其在复杂成像条件下性能提升显著。 Conclusion: 融合物理先验与深度展开的建模范式可有效解耦真实变化与采集干扰,为鲁棒变化检测提供了新思路。 Abstract: Bi-temporal change detection is highly sensitive to acquisition discrepancies, including illumination, season, and atmosphere, which often cause false alarms. We observe that genuine changes exhibit higher patch-wise singular-value entropy (SVE) than pseudo changes in the feature-difference space. Motivated by this physical prior, we propose PhyUnfold-Net, a physics-guided deep unfolding framework that formulates change detection as an explicit decomposition problem. The proposed Iterative Change Decomposition Module (ICDM) unrolls a multi-step solver to progressively separate mixed discrepancy features into a change component and a nuisance component. To stabilize this process, we introduce a staged Exploration-and-Constraint loss (S-SEC), which encourages component separation in early steps while constraining nuisance magnitude in later steps to avoid degenerate solutions. We further design a Wavelet Spectral Suppression Module (WSSM) to suppress acquisition-induced spectral mismatch before decomposition. Experiments on four benchmarks show improvements over state-of-the-art methods, with gains under challenging conditions.[100] Efficiency Follows Global-Local Decoupling
Zhenyu Yang,Gensheng Pei,Tao Chen,Yichao Zhou,Tianfei Zhou,Yazhou Yao,Fumin Shen
Main category: cs.CV
TL;DR: 本文提出ConvNeur双分支架构,解耦全局推理与局部表征,通过轻量神经记忆分支聚合全局上下文、局部保持分支提取细节,并用学习门控调节,实现高效准确的视觉建模。
Details
Motivation: 现代视觉模型需在捕获图像级上下文的同时不牺牲局部细节,且保持计算可负担性;现有方法在全局-局部权衡上存在局限。 Method: 提出ConvNeur双分支架构:一支为轻量神经记忆分支,对紧凑token集进行全局上下文聚合;另一支为局部保持分支,提取精细结构;引入学习门控机制使全局线索调制局部特征而不混淆二者目标。 Result: 在分类、检测和分割标准基准上,ConvNeur在相似或更低计算开销下达到或超越同类方法,并在精度-延迟权衡上表现更优。 Conclusion: 全局与局部的解耦设计可提升模型效率与性能,验证了该原则对构建高效视觉模型的有效性。 Abstract: Modern vision models must capture image-level context without sacrificing local detail while remaining computationally affordable. We revisit this tradeoff and advance a simple principle: decouple the roles of global reasoning and local representation. To operationalize this principle, we introduce ConvNeur, a two-branch architecture in which a lightweight neural memory branch aggregates global context on a compact set of tokens, and a locality-preserving branch extracts fine structure. A learned gate lets global cues modulate local features without entangling their objectives. This separation yields subquadratic scaling with image size, retains inductive priors associated with local processing, and reduces overhead relative to fully global attention. On standard classification, detection, and segmentation benchmarks, ConvNeur matches or surpasses comparable alternatives at similar or lower compute and offers favorable accuracy versus latency trade-offs at similar budgets. These results support the view that efficiency follows global-local decoupling.[101] Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation
Chuhan Wang,Hao Chen
Main category: cs.CV
TL;DR: 本文提出了一种两阶段加速框架,用于加速基于扩散模型的图像token解码器,在显著降低延迟的同时保持高质量重建。
Details
Motivation: 基于扩散的解码器虽能实现高保真图像重建,但其迭代采样过程导致高延迟,难以适用于实时或大规模场景。 Method: 1)多尺度采样策略:从粗分辨率开始逐步加倍分辨率进行解码;2)在每个尺度上将扩散解码器蒸馏为单步去噪模型,实现每尺度一次前向推理。 Result: 解码时间降低一个数量级,图像重建质量几乎无损。 Conclusion: 该方法为高效且富有表现力的图像tokenizer提供了实用路径,并有望推动后续高效视觉token化与生成研究。 Abstract: Image tokenization plays a central role in modern generative modeling by mapping visual inputs into compact representations that serve as an intermediate signal between pixels and generative models. Diffusion-based decoders have recently been adopted in image tokenization to reconstruct images from latent representations with high perceptual fidelity. In contrast to diffusion models used for downstream generation, these decoders are dedicated to faithful reconstruction rather than content generation. However, their iterative sampling process introduces significant latency, making them impractical for real-time or large-scale applications. In this work, we introduce a two-stage acceleration framework to address this inefficiency. First, we propose a multi-scale sampling strategy, where decoding begins at a coarse resolution and progressively refines the output by doubling the resolution at each stage, achieving a theoretical speedup of $\mathcal{O}(\log n)$ compared to standard full-resolution sampling. Second, we distill the diffusion decoder at each scale into a single-step denoising model, enabling fast and high-quality reconstructions in a single forward pass per scale. Together, these techniques yield an order-of-magnitude reduction in decoding time with little degradation in output quality. Our approach provides a practical pathway toward efficient yet expressive image tokenizers. We hope it serves as a foundation for future work in efficient visual tokenization and downstream generation.[102] CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management
Chao Wang,Xudong Tan,Jianjian Cao,Kangcong Li,Tao Chen
Main category: cs.CV
TL;DR: 本文提出CurveStream,一种无需训练、基于曲率感知的分层视觉内存管理框架,用于解决多模态大语言模型在流式视频理解中因视觉标记线性爆炸导致的内存溢出和灾难性遗忘问题。该方法通过曲率分数实时评估语义强度,并结合在线K-Sigma动态阈值,在严格标记预算下自适应地将帧分配至清晰或模糊记忆状态,显著提升流式视频感知性能。
Details
Motivation: 现有流式视频理解方法受限于视觉标记线性增长引发的内存溢出和灾难性遗忘;传统内存管理策略缺乏语义感知能力,易破坏上下文连贯性并模糊关键语义跃迁。 Method: 提出CurveStream框架:基于连续特征轨迹上高曲率区域对应全局语义跃迁的几何观察,设计Curvature Score量化实时语义强度,并引入在线K-Sigma动态阈值,在固定token预算下自适应划分清晰/模糊记忆状态。 Result: 在StreamingBench和OVOBench等多尺度时序基准上,CurveStream相较基线取得超10%的绝对性能提升(如StreamingBench +10.69%,OVOBench +13.58%),达到流式视频感知新SOTA。 Conclusion: CurveStream是一种轻量、训练无关、语义感知强的流式视觉内存管理新范式,有效缓解了多模态大模型处理长时序流视频时的关键瓶颈。 Abstract: Multimodal Large Language Models have achieved significant success in offline video understanding, yet their application to streaming videos is severely limited by the linear explosion of visual tokens, which often leads to Out-of-Memory (OOM) errors or catastrophic forgetting. Existing visual retention and memory management methods typically rely on uniform sampling, low-level physical metrics, or passive cache eviction. However, these strategies often lack intrinsic semantic awareness, potentially disrupting contextual coherence and blurring transient yet critical semantic transitions. To address these limitations, we propose CurveStream, a training-free, curvature-aware hierarchical visual memory management framework. Our approach is motivated by the key observation that high-curvature regions along continuous feature trajectories closely align with critical global semantic transitions. Based on this geometric insight, CurveStream evaluates real-time semantic intensity via a Curvature Score and integrates an online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under a strict token budget. Evaluations across diverse temporal scales confirm that this lightweight framework, CurveStream, consistently yields absolute performance gains of over 10% (e.g., 10.69% on StreamingBench and 13.58% on OVOBench) over respective baselines, establishing new state-of-the-art results for streaming video perception.The code will be released at https://github.com/streamingvideos/CurveStream.[103] MagicSeg: Open-World Segmentation Pretraining via Counterfactural Diffusion-Based Auto-Generation
Kaixin Cai,Pengzhen Ren,Jianhua Han,Yi Zhu,Hang Xu,Jianzhuang Liu,Xiaodan Liang
Main category: cs.CV
TL;DR: 本文提出MagicSeg,一种基于扩散模型的自动数据集生成管道,用于开放世界语义分割,通过生成高质量文本描述、正负样本图像及伪分割标签,结合对比学习与自监督信号,在多个基准上达到SOTA性能。
Details
Motivation: 现有开放世界语义分割严重依赖大规模图文对数据集,但其缺乏足够类别下的细粒度像素标注,且人工标注成本高昂;而扩散模型具备强大图像生成能力,可被用于缓解数据瓶颈。 Method: MagicSeg从类别标签出发,先生成高保真文本描述,再驱动扩散模型生成正样本图像及对应的负样本(反事实)图像;接着利用开放词汇检测模型和交互式分割模型,基于类别标签自动生成像素级伪分割掩码;最后将生成数据用于对比语言-图像预训练,并引入伪掩码监督与反事实对比训练。 Result: 在PASCAL VOC、PASCAL Context和COCO数据集上分别取得62.9%、26.7%和40.2%的mIoU,达到当前最优性能。 Conclusion: MagicSeg验证了合成高质量、带细粒度标注的开放世界分割数据集的可行性,显著提升了下游模型在开放世界语义分割任务上的泛化能力与性能。 Abstract: Open-world semantic segmentation presently relies significantly on extensive image-text pair datasets, which often suffer from a lack of fine-grained pixel annotations on sufficient categories. The acquisition of such data is rendered economically prohibitive due to the substantial investments of both human labor and time. In light of the formidable image generation capabilities of diffusion models, we introduce a novel diffusion model-driven pipeline for automatically generating datasets tailored to the needs of open-world semantic segmentation, named "MagicSeg". Our MagicSeg initiates from class labels and proceeds to generate high-fidelity textual descriptions, which in turn serve as guidance for the diffusion model to generate images. Rather than only generating positive samples for each label, our process encompasses the simultaneous generation of corresponding negative images, designed to serve as paired counterfactual samples for contrastive training. Then, to provide a self-supervised signal for open-world segmentation pretraining, our MagicSeg integrates an open-vocabulary detection model and an interactive segmentation model to extract object masks as precise segmentation labels from images based on the provided category labels. By applying our dataset to the contrastive language-image pretraining model with the pseudo mask supervision and the auxiliary counterfactual contrastive training, the downstream model obtains strong performance on open-world semantic segmentation. We evaluate our model on PASCAL VOC, PASCAL Context, and COCO, achieving SOTA with performance of 62.9%, 26.7%, and 40.2%, respectively, demonstrating our dataset's effectiveness in enhancing open-world semantic segmentation capabilities. Project website: https://github.com/ckxhp/magicseg.[104] FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow
Zhifei Yang,Guangyao Zhai,Keyang Lu,YuYang Yin,Chao Zhang,Zhen Xiao,Jieyi Long,Nassir Navab,Yikai Wang
Main category: cs.CV
TL;DR: FlowScene是一种基于多模态图的三分支场景生成模型,通过紧密耦合的校正流机制协同生成布局、形状与纹理,在保持场景风格一致性的同时实现细粒度对象控制。
Details
Motivation: 现有语言驱动的场景生成方法缺乏对象级控制和场景级风格一致性,而图驱动方法虽具可控性但难以生成高保真带纹理结果。 Method: 提出FlowScene模型,采用三分支结构(布局、形状、纹理)并基于多模态图条件化;核心是紧密耦合的校正流(rectified flow)机制,在生成过程中跨图交换对象信息以实现协同推理。 Result: 实验表明FlowScene在生成真实性、风格一致性及人类偏好对齐方面均优于语言驱动和图驱动基线方法。 Conclusion: FlowScene有效兼顾高 realism 与细粒度可控性,解决了现有方法在对象控制、纹理质量与风格一致性上的关键瓶颈,提升了场景生成的实用性。 Abstract: Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.[105] K-GMRF: Kinetic Gauss-Markov Random Field for First-Principles Covariance Tracking on Lie Groups
ZhiMing Li
Main category: cs.CV
TL;DR: 本文提出K-GMRF,一种无需训练、在线更新的协方差跟踪框架,将协方差演化建模为李群上的刚体运动,利用欧拉-庞加莱方程和辛积分器实现零稳态误差的二阶动态估计,在多个视觉任务中显著提升跟踪精度与鲁棒性。
Details
Motivation: 现有协方差跟踪方法忽视流形约束或仅使用一阶更新,导致快速变化时存在不可避免的相位滞后。 Method: 将协方差跟踪重述为李群上的受迫刚体运动,基于欧拉-庞加莱方程推导,将观测视为驱动潜在角速度的力矩,并通过保结构的辛积分器传播状态。 Result: 在合成椭圆、SO(3)稳定控制(20%丢包)和OTB运动模糊序列(BlurCar2)上分别实现:角误差降低30倍、测地线误差从29.4°降至9.9°、IoU从0.55提升至0.74且成功率96%。 Conclusion: K-GMRF作为一种全可微辛模块,既可作为数据受限场景下的几何先验即插即用,也可作为现代深度网络中可解释的结构化层。 Abstract: Tracking non-stationary covariance matrices is fundamental to vision yet hindered by existing estimators that either neglect manifold constraints or rely on first-order updates, incurring inevitable phase lag during rapid evolution. We propose K-GMRF, an online, training-free framework for covariance tracking that reformulates the problem as forced rigid-body motion on Lie groups. Derived from the Euler-Poincaré equations, our method interprets observations as torques driving a latent angular velocity, propagated via a structure-preserving symplectic integrator. We theoretically prove that this second-order dynamics achieves zero steady-state error under constant rotation, strictly superior to the proportional lag of first-order baselines. Validation across three domains demonstrates robust tracking fidelity: (i) on synthetic ellipses, K-GMRF reduces angular error by 30x compared to Riemannian EMA while maintaining stability at high speeds; (ii) on SO(3) stabilization with 20% dropout, it decreases geodesic error from 29.4° to 9.9°; and (iii) on OTB motion-blur sequences, it improves loU from 0.55 to 0.74 on BlurCar2 with a 96% success rate. As a fully differentiable symplectic module, K-GMRF provides a plug-and-play geometric prior for data-constrained scenarios and an interpretable layer within modern deep architectures.[106] Beyond Quadratic: Linear-Time Change Detection with RWKV
Zhenyu Yang,Gensheng Pei,Tao Chen,Xia Yuan,Haofeng Zhang,Xiangbo Shu,Yazhou Yao
Main category: cs.CV
TL;DR: 本文提出ChangeRWKV架构,结合RWKV框架的优势,在保持Transformer全局建模能力的同时实现线性推理复杂度,显著提升遥感变化检测的效率与精度。
Details
Motivation: 现有遥感变化检测方法在CNN(高效但缺乏全局上下文)和Transformer(建模长程依赖但计算代价高)之间存在性能-效率权衡。 Method: 基于RWKV(Receptance Weighted Key Value)框架,设计了分层RWKV编码器与空间-时间融合模块(STFM),以多尺度特征建模和对齐时空差异。 Result: 在LEVIR-CD数据集上达到85.46% IoU和92.16% F1,参数量与FLOPs显著低于先前SOTA方法。 Conclusion: ChangeRWKV为大规模遥感变化检测提供了一种高效、强大且可部署的新范式。 Abstract: Existing paradigms for remote sensing change detection are caught in a trade-off: CNNs excel at efficiency but lack global context, while Transformers capture long-range dependencies at a prohibitive computational cost. This paper introduces ChangeRWKV, a new architecture that reconciles this conflict. By building upon the Receptance Weighted Key Value (RWKV) framework, our ChangeRWKV uniquely combines the parallelizable training of Transformers with the linear-time inference of RNNs. Our approach core features two key innovations: a hierarchical RWKV encoder that builds multi-resolution feature representation, and a novel Spatial-Temporal Fusion Module (STFM) engineered to resolve spatial misalignments across scales while distilling fine-grained temporal discrepancies. ChangeRWKV not only achieves state-of-the-art performance on the LEVIR-CD benchmark, with an 85.46% IoU and 92.16% F1 score, but does so while drastically reducing parameters and FLOPs compared to previous leading methods. This work demonstrates a new, efficient, and powerful paradigm for operational-scale change detection. Our code and model are publicly available.[107] Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning
Qin Zhang,Peiyu Jing,Hong-Xing Yu,Fangqiang Ding,Fan Nie,Weimin Wang,Yilun Du,James Zou,Jiajun Wu,Bing Shuai
Main category: cs.CV
TL;DR: 本文提出了Physion-Eval,一个大规模专家人工评估视频物理真实性的基准,揭示当前视频生成模型在物理关键场景中普遍存在严重物理失真问题。
Details
Motivation: 现有自动指标和粗粒度人类评估难以深入诊断生成视频违反物理规律的具体原因和时机,亟需细粒度、基于专家推理的物理真实性评估方法。 Method: 构建Physion-Eval基准:基于真实参考视频生成对应AI视频,由专家标注时间定位的物理失真、22类细粒度失败类型及自然语言解释;覆盖5个SOTA模型、主/客观视角,共10,990条推理轨迹。 Result: 在物理关键场景中,83.3%(客观视角)和93.5%(主观视角)的生成视频被专家识别出至少一处物理失真。 Conclusion: 当前视频生成模型在物理建模方面存在根本性缺陷;Physion-Eval为物理真实性评估树立新标准,并推动物理驱动的视频生成研究。 Abstract: Video generation models are increasingly used as world simulators for storytelling, simulation, and embodied AI. As these models advance, a key question arises: do generated videos obey the physical laws of the real world? Existing evaluations largely rely on automated metrics or coarse human judgments such as preferences or rubric-based checks. While useful for assessing perceptual quality, these methods provide limited insight into when and why generated dynamics violate real-world physical constraints. We introduce Physion-Eval, a large-scale benchmark of expert human reasoning for diagnosing physical realism failures in videos generated by five state-of-the-art models across egocentric and exocentric views, containing 10,990 expert reasoning traces spanning 22 fine-grained physical categories. Each generated video is derived from a corresponding real-world reference video depicting a clear physical process, and annotated with temporally localized glitches, structured failure categories, and natural-language explanations of the violated physical behavior. Using this dataset, we reveal a striking limitation of current video generation models: in physics-critical scenarios, 83.3% of exocentric and 93.5% of egocentric generated videos exhibit at least one human-identifiable physical glitch. We hope Physion-Eval will set a new standard for physical realism evaluation and guide the development of physics-grounded video generation. The benchmark is publicly available at https://huggingface.co/datasets/PhysionLabs/Physion-Eval.[108] FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement
Ming Hu,Yongsheng Huo,Mingyu Dou,Jianfu Yin,Peng Zhao,Yao Wang,Cong Hu,Bingliang Hu,Quan Wang
Main category: cs.CV
TL;DR: 本文提出FB-CLIP框架,通过多策略文本表征与前景-背景分离,提升零样本细粒度异常检测与定位性能。
Details
Motivation: 细粒度异常检测在工业和医疗中至关重要,但标注异常样本稀缺,零样本检测面临挑战;现有视觉语言模型(如CLIP)存在前景-背景特征纠缠和文本语义粗糙问题。 Method: 提出FB-CLIP:1)文本模态融合End-of-Text特征、全局池化表示与注意力加权token特征;2)视觉模态采用多视角软分离(身份、语义、空间维度)并抑制背景;3)引入语义一致性正则化(SCR)对齐图像与正常/异常文本原型,抑制不确定匹配、扩大语义间隔。 Result: 实验表明FB-CLIP能有效区分复杂背景下的异常,显著提升零样本细粒度异常检测与定位精度。 Conclusion: FB-CLIP通过解耦前景-背景与增强文本语义表征,在零样本设定下实现了更鲁棒、精准的细粒度异常定位。 Abstract: Fine-grained anomaly detection is crucial in industrial and medical applications, but labeled anomalies are often scarce, making zero-shot detection challenging. While vision-language models like CLIP offer promising solutions, they struggle with foreground-background feature entanglement and coarse textual semantics. We propose FB-CLIP, a framework that enhances anomaly localization via multi-strategy textual representations and foreground-background separation. In the textual modality, it combines End-of-Text features, global-pooled representations, and attention-weighted token features for richer semantic cues. In the visual modality, multi-view soft separation along identity, semantic, and spatial dimensions, together with background suppression, reduces interference and improves discriminability. Semantic Consistency Regularization (SCR) aligns image features with normal and abnormal textual prototypes, suppressing uncertain matches and enlarging semantic gaps. Experiments show that FB-CLIP effectively distinguishes anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings.[109] LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
Shuaibang Peng,Juelin Zhu,Xia Li,Kun Yang,Maojun Zhang,Yu Liu,Shen Yan
Main category: cs.CV
TL;DR: LoD-Loc v3 是一种用于密集城市环境中广义航拍视觉定位的新方法,通过合成实例分割数据集(InsLoD-Loc)和从语义到实例轮廓对齐的范式转变,显著提升了跨场景泛化能力和密集建筑场景下的定位鲁棒性。
Details
Motivation: 解决 LoD-Loc v2 在跨场景泛化能力差和密集建筑场景中频繁失效的问题。 Method: 1) 构建大规模航拍图像实例分割数据集 InsLoD-Loc(10 万张带精确建筑物实例标注的图像);2) 将定位范式从语义轮廓对齐改为实例轮廓对齐,以降低姿态估计歧义。 Result: 在跨场景和密集城市场景下均大幅超越现有 SOTA 方法。 Conclusion: LoD-Loc v3 通过数据与方法双创新,有效提升了航拍视觉定位在复杂城市环境中的泛化性与鲁棒性。 Abstract: We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin. The project is available at https://nudt-sawlab.github.io/LoD-Locv3/.[110] ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
Quan Kong,Yuhao Shen,Yicheng Ji,Huan Li,Cong Wang
Main category: cs.CV
TL;DR: 本文提出ParallelVLM,一种无需训练的草稿-验证式推测解码框架,通过并行化设计与无偏验证器引导剪枝策略,显著提升长视频理解中Video-LLMs的解码效率。
Details
Motivation: 现有Video-LLMs因视频token数量庞大导致自回归解码效率低;视觉token剪枝虽有缓解作用,但仍存在信息损失和加速比有限的问题。 Method: 提出ParallelVLM框架,包含两个并行阶段以最大化硬件利用率,并引入无偏验证器引导剪枝策略,消除注意力引导剪枝中的位置偏差,提升草稿模型与目标模型的一致性。 Result: 实验表明ParallelVLM将草稿窗口扩大1.6~1.8倍且接受长度高,在LLaVA-Onevision-72B和Qwen2.5-VL-32B上分别实现3.36×和2.42×的解码加速。 Conclusion: ParallelVLM是一种高效、免训练的推测解码方案,有效解决了长视频场景下草稿与目标模型间相互等待及加速比受限的问题。 Abstract: Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporate an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention-guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by $1.6\sim1.8\times$ with high accepted lengths, and accelerates various video understanding benchmarks by 3.36$\times$ on LLaVA-Onevision-72B and 2.42$\times$ on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.[111] OrbitNVS: Harnessing Video Diffusion Priors for Novel View Synthesis
Jinglin Liang,Zijian Zhou,Rui Huang,Shuangping Huang,Yichen Gong
Main category: cs.CV
TL;DR: OrbitNVS将新视角合成(NVS)重构为轨道视频生成任务,利用预训练视频生成模型的视觉先验,并通过相机适配器、法向图引导注意力机制和像素空间监督,显著提升单视图下的几何与外观一致性及合成质量。
Details
Motivation: 现有方法在单视图输入下难以合成未观测区域的合理视角,且难以保持几何与外观一致性。 Method: 提出OrbitNVS,将NVS建模为轨道视频生成;引入相机适配器实现精确相机控制;设计法向图生成分支并用其特征通过注意力机制引导目标视图合成;采用像素空间监督缓解潜在空间压缩导致的模糊问题。 Result: 在GSO和OmniObject3D基准上显著优于先前方法,尤其在单视图设置下PSNR分别提升+2.9 dB和+2.4 dB。 Conclusion: OrbitNVS通过融合视频生成先验与三维几何引导策略,有效提升了新视角合成的质量与一致性,为单视图NVS提供了新范式。 Abstract: Novel View Synthesis (NVS) aims to generate unseen views of a 3D object given a limited number of known views. Existing methods often struggle to synthesize plausible views for unobserved regions, particularly under single-view input, and still face challenges in maintaining geometry- and appearance-consistency. To address these issues, we propose OrbitNVS, which reformulates NVS as an orbit video generation task. Through tailored model design and training strategies, we adapt a pre-trained video generation model to the NVS task, leveraging its rich visual priors to achieve high-quality view synthesis. Specifically, we incorporate camera adapters into the video model to enable accurate camera control. To enhance two key properties of 3D objects, geometry and appearance, we design a normal map generation branch and use normal map features to guide the synthesis of the target views via attention mechanism, thereby improving geometric consistency. Moreover, we apply a pixel-space supervision to alleviate blurry appearance caused by spatial compression in the latent space. Extensive experiments show that OrbitNVS significantly outperforms previous methods on the GSO and OmniObject3D benchmarks, especially in the challenging single-view setting (\eg, +2.9 dB and +2.4 dB PSNR).[112] UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
Chuanrui Zhang,Yingshuang Zou,ZhengXian Wu,Yonggen Ling,Yuxiao Yang,Ziwei Wang
Main category: cs.CV
TL;DR: 本文提出UniPR,首个端到端物体级实到仿感知与重建框架,直接处理立体图像对,利用几何约束解决尺度模糊性,引入姿态感知形状表示,并构建大规模立体数据集LVS6D,显著提升效率与物理比例保真度。
Details
Motivation: 现有模块化方法在实到仿迁移任务中存在效率低、误差累积、缺乏全局上下文等问题。 Method: 提出端到端框架UniPR,基于单对立体图像,结合几何约束解决尺度歧义;引入姿态感知形状表示(Pose-Aware Shape Representation)统一重建与位姿估计;构建大规模立体数据集LVS6D(含6300+物体)。 Result: UniPR可在单次前向传播中并行重建场景中所有物体,效率显著提升,并在多种物体上保持真实物理比例。 Conclusion: UniPR为机器人实到仿应用提供了高效、准确、可扩展的端到端感知与重建新范式。 Abstract: Perceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community. Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline. However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context. To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework. Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity. We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks. Furthermore, we construct a large-vocabulary stereo dataset, LVS6D, comprising over 6,300 objects, to facilitate large-scale research in this area. Extensive experiments demonstrate that UniPR reconstructs all objects in a scene in parallel within a single forward pass, achieving significant efficiency gains and preserves true physical proportions across diverse object types, highlighting its potential for practical robotic applications.[113] Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
Chunlei Zhang,Jiahao Xia,Yun Xiao,Bo Jiang,Jian Zhang
Main category: cs.CV
TL;DR: 本文提出HRNet,一种混合配准网络,通过联合学习稳定的共享特征空间和统一的混合变换来解决多模态图像配准中的模态私有信息泄露和单一变换类型限制问题。
Details
Motivation: 现有方法在共享特征学习中未能有效抑制模态私有信息泄露,且多尺度框架通常仅支持单一变换类型,难以同时处理全局错位和局部形变。 Method: 提出HRNet,包含带模态特定批归一化(MSBN)的共享骨干网络、跨尺度解耦与自适应投影(CDAP)模块以构建稳定共享特征空间,以及混合参数预测模块(HPPM)实现非迭代粗到细的刚性参数与形变场联合估计。 Result: 在四个多模态数据集上实现了刚性与非刚性配准任务的最先进性能。 Conclusion: HRNet通过解耦表征学习与混合变换建模,有效提升了多模态图像配准的鲁棒性与泛化能力。 Abstract: Multimodal image registration is a fundamental task and a prerequisite for downstream cross-modal analysis. Despite recent progress in shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, allowing modality-private cues to leak into the shared space. Second, most multi-scale frameworks support only a single transformation type, limiting their applicability when global misalignment and local deformation coexist. To address these issues, we formulate hybrid multimodal registration as jointly learning a stable shared feature space and a unified hybrid transformation. Based on this view, we propose HRNet, a Hybrid Registration Network that couples representation disentanglement with hybrid parameter prediction. A shared backbone with Modality-Specific Batch Normalization (MSBN) extracts multi-scale features, while a Cross-scale Disentanglement and Adaptive Projection (CDAP) module suppresses modality-private cues and projects shared features into a stable subspace for matching. Built on this shared space, a Hybrid Parameter Prediction Module (HPPM) performs non-iterative coarse-to-fine estimation of global rigid parameters and deformation fields, which are fused into a coherent deformation field. Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on rigid and non-rigid registration tasks. The code is available at the project website.[114] IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1
Jun Wang,Xiaoyan Huang
Main category: cs.CV
TL;DR: 本文提出IUP-Pose,一种几何驱动、解耦迭代的相对姿态估计框架,通过隐式密集对齐与轻量级多头双向交叉注意力模块,实现端到端可微、高精度且实时的相对姿态回归。
Details
Motivation: 现有相对姿态回归方法在精度(依赖非可微RANSAC的特征匹配)和效率(ViT类回归器计算开销大)之间难以兼顾,且存在旋转-平移耦合估计及跨视图特征对齐不足的问题。 Method: 提出IUP-Pose框架:1)采用轻量级Multi-Head Bi-Cross Attention(MHBC)模块实现无显式匹配监督的跨视图特征对齐;2)设计解耦的旋转-平移估计流程,其中旋转通过双阶段共享参数网络迭代优化并引入不确定性建模,平移预测前利用无穷远点单应H_inf对特征图进行旋转校准。 Result: 在MegaDepth1500上达到73.3% AUC@20deg,支持70 FPS实时推理,仅37M参数,兼具高精度与高效率。 Conclusion: IUP-Pose通过几何先验引导的解耦迭代结构与隐式对齐机制,在保持端到端可微性的同时显著提升了相对姿态估计的精度-效率平衡,适用于边缘端实时部署。 Abstract: Relative pose estimation is fundamental for SLAM, visual localization, and 3D reconstruction. Existing Relative Pose Regression (RPR) methods face a key trade-off: feature-matching pipelines achieve high accuracy but block gradient flow via non-differentiable RANSAC, while ViT-based regressors are end-to-end trainable but prohibitively expensive for real-time deployment. We identify the core bottlenecks as the coupling between rotation and translation estimation and insufficient cross-view feature alignment. We propose IUP-Pose, a geometry-driven decoupled iterative framework with implicit dense alignment. A lightweight Multi-Head Bi-Cross Attention (MHBC) module aligns cross-view features without explicit matching supervision. The aligned features are processed by a decoupled rotation-translation pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, and feature maps are realigned via rotational homography H_inf before translation prediction. IUP-Pose achieves 73.3% AUC@20deg on MegaDepth1500 with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating a favorable accuracy-efficiency trade-off for real-time edge deployment.[115] Dual Prompt-Driven Feature Encoding for Nighttime UAV Tracking
Yiheng Wang,Changhong Fu,Liangliang Yao,Haobo Zuo,Zijie Zhang
Main category: cs.CV
TL;DR: 本文提出了一种双提示驱动的特征编码方法(DPTracker),通过金字塔光照提示器和动态视角提示器,提升夜间无人机跟踪中的光照与视角不变特征表达能力,显著增强跟踪鲁棒性。
Details
Motivation: 现有无人机跟踪的特征编码方法常忽略光照和视角线索,在夜间等挑战性条件下性能下降。 Method: 提出双提示驱动特征编码:1)金字塔光照提示器提取多尺度频率感知的光照提示;2)动态视角提示器调制可变形卷积偏移以适应视角变化,学习视角不变特征。 Result: 在夜间无人机跟踪任务上验证了DPTracker的有效性;消融实验验证各组件贡献;真实场景测试表明其强鲁棒性与实用性。 Conclusion: 双提示机制有效提升了特征对光照与视角变化的不变性,为夜间无人机跟踪提供了新思路与实用解决方案。 Abstract: Robust feature encoding constitutes the foundation of UAV tracking by enabling the nuanced perception of target appearance and motion, thereby playing a pivotal role in ensuring reliable tracking. However, existing feature encoding methods often overlook critical illumination and viewpoint cues, which are essential for robust perception under challenging nighttime conditions, leading to degraded tracking performance. To overcome the above limitation, this work proposes a dual prompt-driven feature encoding method that integrates prompt-conditioned feature adaptation and context-aware prompt evolution to promote domain-invariant feature encoding. Specifically, the pyramid illumination prompter is proposed to extract multi-scale frequency-aware illumination prompts. %The dynamic viewpoint prompter adapts the sampling to different viewpoints, enabling the tracker to learn view-invariant features. The dynamic viewpoint prompter modulates deformable convolution offsets to accommodate viewpoint variations, enabling the tracker to learn view-invariant features. Extensive experiments validate the effectiveness of the proposed dual prompt-driven tracker (DPTracker) in tackling nighttime UAV tracking. Ablation studies highlight the contribution of each component in DPTracker. Real-world tests under diverse nighttime UAV tracking scenarios further demonstrate the robustness and practical utility. The code and demo videos are available at https://github.com/yiheng-wang-duke/DPTracker.[116] UniBioTransfer: A Unified Framework for Multiple Biometrics Transfer
Caiyi Sun,Yujing Sun,Xiangyu Li,Yuhang Zheng,Yiming Ren,Jiamin Wang,Yuexin Ma,Siu-Ming Yiu
Main category: cs.CV
TL;DR: 本文提出UniBioTransfer,首个能统一处理多种深脸生成任务(如人脸迁移、表情重演、发际线/头部形状变换等)的框架,通过统一数据构建策略与创新的BioMoE模型解决多任务间的数据稀缺和冲突问题。
Details
Motivation: 传统深脸生成采用单任务范式,限制了模型泛化性与可扩展性;亟需一个能同时处理异构属性变换的统一模型。 Method: 提出UniBioTransfer框架:1)基于交换的腐败机制构建统一训练数据;2)设计BioMoE(生物启发的混合专家模型)及两阶段训练策略以解耦任务知识。 Result: 在多种深脸生成任务上显著优于现有统一模型和单任务方法,且能零样本或少样本泛化至唇部、眼部、眼镜迁移等新任务。 Conclusion: UniBioTransfer为多任务深脸生成提供了高效、可扩展、可泛化的统一解决方案,推动了从单任务向通用视觉生成的范式转变。 Abstract: Deepface generation has traditionally followed a task-driven paradigm, where distinct tasks (e.g., face transfer and hair transfer) are addressed by task-specific models. Nevertheless, this single-task setting severely limits model generalization and scalability. A unified model capable of solving multiple deepface generation tasks in a single pass represents a promising and practical direction, yet remains challenging due to data scarcity and cross-task conflicts arising from heterogeneous attribute transformations. To this end, we propose UniBioTransfer, the first unified framework capable of handling both conventional deepface tasks (e.g., face transfer and face reenactment) and shape-varying transformations (e.g., hair transfer and head transfer). Besides, UniBioTransfer naturally generalizes to unseen tasks, like lip, eye, and glasses transfer, with minimal fine-tuning. Generally, UniBioTransfer addresses data insufficiency in multi-task generation through a unified data construction strategy, including a swapping-based corruption mechanism designed for spatially dynamic attributes like hair. It further mitigates cross-task interference via an innovative BioMoE, a mixture-of-experts based model coupled with a novel two-stage training strategy that effectively disentangles task-specific knowledge. Extensive experiments demonstrate the effectiveness, generalization, and scalability of UniBioTransfer, outperforming both existing unified models and task-specific methods across a wide range of deepface generation tasks. Project page is at https://scy639.github.io/UniBioTransfer.github.io/[117] OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework
Weixuan Zeng,Pengcheng Wei,Huaiqing Wang,Boheng Zhang,Jia Sun,Dewen Fan,Lin HE,Long Chen,Qianqian Gan,Fan Yang,Tingting Gao
Main category: cs.CV
TL;DR: 本文提出OmniDiT,一种基于扩散Transformer的统一虚拟试穿(VTON)与试脱(VTOFF)框架,通过自进化数据构建、改进的位置编码、Shifted Window Attention及多步预测对齐损失,在细粒度细节、泛化性与推理效率上取得提升。
Details
Motivation: 现有VTON方法在细粒度细节保留、复杂场景泛化、流程复杂性和推理效率方面存在不足。 Method: 提出OmniDiT框架;构建含38万高质量样本的Omni-TryOn数据集;采用token拼接与自适应位置编码融合多参考条件;首次将Shifted Window Attention引入扩散模型以实现线性计算复杂度;结合多时间步预测和对齐损失提升生成保真度。 Result: 在无模型VTON/VTOFF任务中性能最优,在有模型VTON任务中达到当前SOTA水平;显著提升细粒度细节保留与复杂场景泛化能力。 Conclusion: OmniDiT实现了VTON与VTOFF任务的统一建模,在性能、效率与通用性上取得综合突破,为虚拟试穿技术提供了新范式。 Abstract: Despite the rapid advancement of Virtual Try-On (VTON) and Try-Off (VTOFF) technologies, existing VTON methods face challenges with fine-grained detail preservation, generalization to complex scenes, complicated pipeline, and efficient inference. To tackle these problems, we propose OmniDiT, an omni Virtual Try-On framework based on the Diffusion Transformer, which combines try-on and try-off tasks into one unified model. Specifically, we first establish a self-evolving data curation pipeline to continuously produce data, and construct a large VTON dataset Omni-TryOn, which contains over 380k diverse and high-quality garment-model-tryon image pairs and detailed text prompts. Then, we employ the token concatenation and design an adaptive position encoding to effectively incorporate multiple reference conditions. To relieve the bottleneck of long sequence computation, we are the first to introduce Shifted Window Attention into the diffusion model, thus achieving a linear complexity. To remedy the performance degradation caused by local window attention, we utilize multiple timestep prediction and an alignment loss to improve generation fidelity. Experiments reveal that, under various complex scenes, our method achieves the best performance in both the model-free VTON and VTOFF tasks and a performance comparable to current SOTA methods in the model-based VTON task.[118] GravCal: Single-Image Calibration of IMU Gravity Priors with Per-Sample Confidence
Haichao Zhu,Qian Zhang
Main category: cs.CV
TL;DR: 本文提出GravCal模型,通过单张RGB图像校准不准确的IMU重力先验,结合残差修正与图像独立估计,并用学习门控自适应融合,在多个场景下显著降低重力方向估计误差。
Details
Motivation: 现有重力估计方法在存在线性加速度、振动和瞬态运动时不可靠,且缺乏仅用单张图像校准噪声重力先验的有效方案。 Method: 提出GravCal——一种前馈神经网络模型,输入为单张RGB图像和噪声重力先验,输出校正后的重力方向及样本级置信度;模型融合残差修正和先验无关的图像估计,并通过可学习门控机制自适应加权。 Result: 在新构建的148K帧数据集上,GravCal将平均角度误差从22.02°(原始IMU先验)降至14.24°,尤其在先验严重失准时提升更显著;学习到的门控值与先验质量强相关,可作为下游系统可信度信号。 Conclusion: GravCal为单图像重力先验校准提供了有效、鲁棒且具解释性的解决方案,提升了视觉-惯性系统中重力估计的实用性与可靠性。 Abstract: Gravity estimation is fundamental to visual-inertial perception, augmented reality, and robotics, yet gravity priors from IMUs are often unreliable under linear acceleration, vibration, and transient motion. Existing methods often estimate gravity directly from images or assume reasonably accurate inertial input, leaving the practical problem of correcting a noisy gravity prior from a single image largely unaddressed. We present GravCal, a feedforward model for single-image gravity prior calibration. Given one RGB image and a noisy gravity prior, GravCal predicts a corrected gravity direction and a per-sample confidence score. The model combines two complementary predictions, including a residual correction of the input prior and a prior-independent image estimate, and uses a learned gate to fuse them adaptively. Extensive experiments show strong gains over raw inertial priors: GravCal reduces mean angular error from 22.02° (IMU prior) to 14.24°, with larger improvements when the prior is severely corrupted. We also introduce a novel dataset of over 148K frames with paired VIO-derived ground-truth gravity and Mahony-filter IMU priors across diverse scenes and arbitrary camera orientations. The learned gate also correlates with prior quality, making it a useful confidence signal for downstream systems.[119] CS-MUNet: A Channel-Spatial Dual-Stream Mamba Network for Multi-Organ Segmentation
Yuyang Zheng,Mingda Zhang,Jianglong Qin,Qi Mo,Jingdan Pan,Haozhe Hu,Hongyi Huang
Main category: cs.CV
TL;DR: 本文提出CS-MUNet,通过Boundary-Aware State Mamba和Channel Mamba State Aggregation两个模块,分别解决边界感知特征融合与跨通道解剖语义协作问题,在腹部多器官分割任务中取得SOTA性能。
Details
Motivation: 现有Mamba方法忽视跨通道解剖语义协作,且缺乏显式的边界感知特征融合机制。 Method: 提出CS-MUNet:1)Boundary-Aware State Mamba模块采用贝叶斯注意力框架生成像素级边界后验图,并注入Mamba扫描参数以增强SSM状态转移的边界感知能力,结合双分支权重分配调制全局与局部结构表征;2)Channel Mamba State Aggregation模块将通道维重定义为SSM序列维,以数据驱动方式建模跨通道解剖语义协作。 Result: 在两个公开基准上,CS-MUNet在多项指标上持续超越当前最优方法。 Conclusion: CS-MUNet建立了联合处理通道语义协作与边界感知特征融合的新SSM建模范式,显著提升腹部多器官分割性能。 Abstract: Recently Mamba-based methods have shown promise in abdominal organ segmentation. However, existing approaches neglect cross-channel anatomical semantic collaboration and lack explicit boundary-aware feature fusion mechanisms. To address these limitations, we propose CS-MUNet with two purpose-built modules. The Boundary-Aware State Mamba module employs a Bayesian-attention framework to generate pixel-level boundary posterior maps, injected directly into Mamba's core scan parameters to embed boundary awareness into the SSM state transition mechanism, while dual-branch weight allocation enables complementary modulation between global and local structural representations. The Channel Mamba State Aggregation module redefines the channel dimension as the SSM sequence dimension to explicitly model cross-channel anatomical semantic collaboration in a data-driven manner. Experiments on two public benchmarks demonstrate that CS-MUNet consistently outperforms state-of-the-art methods across multiple metrics, establishing a new SSM modeling paradigm that jointly addresses channel semantic collaboration and boundary-aware feature fusion for abdominal multi-organ segmentation.[120] Semantic Audio-Visual Navigation in Continuous Environments
Yichen Zeng,Hebaixu Wang,Meng Liu,Yu Zhou,Chen Gao,Kehan Chen,Gongping Huang
Main category: cs.CV
TL;DR: 本文提出了SAVN-CE新任务设定与MAGNet模型,解决音频-视觉导航中因目标静音导致的导航失败问题,在连续3D环境中实现更鲁棒、更现实的导航。
Details
Motivation: 现有音频-视觉导航方法依赖预计算的RIR,限制在离散网格位置,且目标静音时易丢失目标信息,缺乏真实性和鲁棒性。 Method: 提出Semantic Audio-Visual Navigation in Continuous Environments(SAVN-CE)任务设定,并设计基于多模态Transformer的MAGNet模型,联合编码空间与语义目标表征,融合历史上下文与自运动线索以支持记忆增强的目标推理。 Result: MAGNet在成功率上较SOTA方法最高提升12.1%,对短时声音和远距离导航具有更强鲁棒性。 Conclusion: SAVN-CE更贴近真实连续环境,MAGNet通过多模态记忆建模有效缓解目标静音带来的导航挑战,推动音频-视觉导航向实用化迈进。 Abstract: Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1\% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.[121] Toward High-Fidelity Visual Reconstruction: From EEG-Based Conditioned Generation to Joint-Modal Guided Rebuilding
Zhijian Gong,Tianren Yao,Wenjia Dong,Xueyuan Xu
Main category: cs.CV
TL;DR: 本文提出了一种联合模态视觉重建(JMVR)框架,将EEG与文本作为独立模态进行联合学习,避免传统对齐范式导致的EEG细节丢失,通过多尺度EEG编码和图像增强提升空间结构与色彩保真度的重建效果。
Details
Motivation: 现有基于EEG的视觉重建方法依赖于将EEG特征强行对齐文本或图像语义表示,导致EEG中丰富的空间关系和色彩细节被压缩,仅实现条件图像生成而非高保真视觉重建。 Method: 提出JMVR框架:1)将EEG与文本视为独立模态联合学习;2)采用多尺度EEG编码策略提取细粒度与粗粒度特征;3)引入图像增强以提升感知细节恢复能力。 Result: 在THINGS-EEG数据集上,JMVR在六种基线方法中达到SOTA性能,尤其在空间结构建模和色彩保真度方面表现更优。 Conclusion: JMVR通过解耦EEG与文本模态并增强其表征能力,有效提升了基于EEG的高保真视觉重建性能,为脑机接口中的视觉解码提供了新思路。 Abstract: Human visual reconstruction aims to reconstruct fine-grained visual stimuli based on subject-provided descriptions and corresponding neural signals. As a widely adopted modality, Electroencephalography (EEG) captures rich visual cognition information, encompassing complex spatial relationships and chromatic details within scenes. However, current approaches are deeply coupled with an alignment framework that forces EEG features to align with text or image semantic representation. The dependency may condense the rich spatial and chromatic details in EEG that achieved mere conditioned image generation rather than high-fidelity visual reconstruction. To address this limitation, we propose a novel Joint-Modal Visual Reconstruction (JMVR) framework. It treats EEG and text as independent modalities for joint learning to preserve EEG-specific information for reconstruction. It further employs a multi-scale EEG encoding strategy to capture both fine- and coarse-grained features, alongside image augmentation to enhance the recovery of perceptual details. Extensive experiments on the THINGS-EEG dataset demonstrate that JMVR achieves SOTA performance against six baseline methods, specifically exhibiting superior capabilities in modeling spatial structure and chromatic fidelity.[122] Making Video Models Adhere to User Intent with Minor Adjustments
Daniel Ajisafe,Eric Hedlin,Helge Rhodin,Kwang Moo Yi
Main category: cs.CV
TL;DR: 本文提出一种通过微调用户提供的边界框来提升文本到视频扩散模型生成质量与控制精度的方法,核心是优化边界框以更好地对齐模型内部注意力图,并引入可微平滑掩码和注意力最大化目标函数。
Details
Motivation: 现有文本到视频扩散模型在利用边界框或布局进行生成控制时,难以保证生成结果严格遵循控制输入,控制 adherence 仍是开放问题。 Method: 提出一种轻量级优化方法:设计可微平滑掩码使边界框位置可导,并构建注意力最大化目标函数,在优化过程中平衡前景与背景注意力,从而微调原始边界框使其更契合模型内部注意力分布。 Result: 实验表明,仅对边界框做小幅调整即可显著提升生成质量与控制精度;用户研究进一步验证了该方法的有效性。 Conclusion: 边界框的位置应适配模型‘熟悉’的注意力区域,而非固定依赖人工输入;该思路为可控视频生成提供了简单、高效且即插即用的新范式。 Abstract: With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method. Our code is made available on the project webpage to foster future research from the community.[123] DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving
Xiaolu Liu,Yicong Li,Song Wang,Junbo Chen,Angela Yao,Jianke Zhu
Main category: cs.CV
TL;DR: 本文提出DynFlowDrive,一种基于流式动力学的潜在世界模型,用于提升自动驾驶系统的规划可靠性。该模型通过学习速度场来描述不同驾驶动作下场景状态的变化,并结合稳定性感知的多模态轨迹选择策略,显著提升了预测精度和规划稳定性。
Details
Motivation: 现有方法通过外观生成或确定性回归预测未来状态,难以捕捉轨迹条件下的场景演化,导致动作规划不可靠。 Method: 提出DynFlowDrive模型,采用修正流(rectified flow)公式学习速度场,以建模不同驾驶动作下世界状态的转换;并设计稳定性感知的多模态轨迹选择策略,评估候选轨迹引发的场景转换稳定性。 Result: 在nuScenes和NavSim基准上实验表明,该方法在多种驾驶框架中均取得一致性能提升,且不增加推理开销。 Conclusion: DynFlowDrive通过流式动力学建模和稳定性驱动的轨迹选择,有效提升了自动驾驶世界模型的可靠性与实用性。 Abstract: Recently, world models have been incorporated into the autonomous driving systems to improve the planning reliability. Existing approaches typically predict future states through appearance generation or deterministic regression, which limits their ability to capture trajectory-conditioned scene evolution and leads to unreliable action planning. To address this, we propose DynFlowDrive, a latent world model that leverages flow-based dynamics to model the transition of world states under different driving actions. By adopting the rectifiedflow formulation, the model learns a velocity field that describes how the scene state changes under different driving actions, enabling progressive prediction of future latent states. Building upon this, we further introduce a stability-aware multi-mode trajectory selection strategy that evaluates candidate trajectories according to the stability of the induced scene transitions. Extensive experiments on the nuScenes and NavSim benchmarks demonstrate consistent improvements across diverse driving frameworks without introducing additional inference overhead. Source code will be abaliable at https://github.com/xiaolul2/DynFlowDrive.[124] ATHENA: Adaptive Test-Time Steering for Improving Count Fidelity in Diffusion Models
Mohammad Shahab Sepehri,Asal Mehradfar,Berk Tinaz,Salman Avestimehr,Mahdi Soltanolkotabi
Main category: cs.CV
TL;DR: 本文提出ATHENA框架,通过测试时自适应引导提升文本到图像扩散模型在对象数量控制上的准确性,无需修改模型结构或重新训练。
Details
Motivation: 文本到图像扩散模型虽具有高视觉保真度,但在提示中明确指定对象数量时存在系统性失败。 Method: ATHENA是一种模型无关的测试时自适应引导框架,利用采样过程中的中间表征估计对象数量,并在去噪早期施加数量感知的噪声校正,以在结构错误难以修正前引导生成轨迹;包含三种逐步进阶的变体,权衡计算开销与数值精度。 Result: 在标准基准和新构建的复杂数据集上实验表明,ATHENA显著提升了对象数量保真度(尤其在高目标数量下),且在多个扩散骨干网络上保持良好的精度-运行时间权衡。 Conclusion: ATHENA为提升扩散模型数值可控性提供了高效、通用、无需重训练的解决方案。 Abstract: Text-to-image diffusion models achieve high visual fidelity but surprisingly exhibit systematic failures in numerical control when prompts specify explicit object counts. To address this limitation, we introduce ATHENA, a model-agnostic, test-time adaptive steering framework that improves object count fidelity without modifying model architectures or requiring retraining. ATHENA leverages intermediate representations during sampling to estimate object counts and applies count-aware noise corrections early in the denoising process, steering the generation trajectory before structural errors become difficult to revise. We present three progressively more advanced variants of ATHENA that trade additional computation for improved numerical accuracy, ranging from static prompt-based steering to dynamically adjusted count-aware control. Experiments on established benchmarks and a new visually and semantically complex dataset show that ATHENA consistently improves count fidelity, particularly at higher target counts, while maintaining favorable accuracy-runtime trade-offs across multiple diffusion backbones.[125] Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification
Kunlun Xu,Haotong Cheng,Jiangmeng Li,Xu Zou,Jiahuan Zhou
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉-语言模型(VLM)的终身行人重识别(LReID)新方法VLADR,通过多粒度文本属性解耦与跨域跨模态属性强化,提升知识迁移与抗遗忘能力。
Details
Motivation: 现有LReID方法多基于从头训练或图像分类预训练模型,忽视了VLM中蕴含的通用知识;且仅关注全局特征,未能有效利用细粒度属性信息,导致知识获取与抗遗忘能力受限。 Method: 提出VLADR框架,包含:1)多粒度文本属性解耦机制,挖掘图像对应的全局与局部文本属性;2)跨域跨模态属性强化策略,通过跨模态与跨域属性对齐,引导视觉属性提取并实现细粒度知识迁移。 Result: 在抗遗忘和泛化能力上分别超越SOTA方法1.9%-2.2%和2.1%-2.5%。 Conclusion: 显式建模通用人体属性可显著提升LReID中的跨域知识迁移效率与历史知识复用能力,VLADR验证了VLM驱动细粒度属性学习在终身学习场景下的有效性。 Abstract: Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9\%-2.2\% and 2.1\%-2.5\% on anti-forgetting and generalization capacity. Our source code is available at https://github.com/zhoujiahuan1991/CVPR2026-VLADR[126] Unbiased Dynamic Multimodal Fusion
Shicai Wei,Kaijie Zhang,Luyi Chen,Tao He,Guiduo Duan
Main category: cs.CV
TL;DR: 本文提出了一种无偏动态多模态学习(UDML)框架,通过噪声感知的不确定性估计器和模态依赖偏差量化机制,解决现有动态多模态方法在极端噪声下不确定性估计不准及忽略固有模态偏差的问题。
Details
Motivation: 传统多模态方法假设模态质量静态不变,难以适应真实动态场景;现有动态方法依赖经验指标,在极低或极高噪声下无法准确评估模态质量,且忽略模态间固有的依赖偏差,导致难学模态被双重抑制。 Method: 提出UDML框架:1)设计噪声感知不确定性估计器,通过向模态数据注入可控噪声并从特征预测噪声强度,建立特征退化与噪声水平的明确映射;2)利用模态丢弃量化网络中固有模态依赖偏差,并将其融入加权机制。 Result: 在多个多模态基准任务上实验验证了UDML在有效性、通用性和泛化性上的优势,性能优于静态融合及现有动态融合方法。 Conclusion: UDML通过更鲁棒的不确定性建模和对模态偏差的显式建模,显著提升了动态多模态融合的性能与鲁棒性,尤其在噪声分布广泛的实际场景中更具实用性。 Abstract: Traditional multimodal methods often assume static modality quality, which limits their adaptability in dynamic real-world scenarios. Thus, dynamical multimodal methods are proposed to assess modality quality and adjust their contribution accordingly. However, they typically rely on empirical metrics, failing to measure the modality quality when noise levels are extremely low or high. Moreover, existing methods usually assume that the initial contribution of each modality is the same, neglecting the intrinsic modality dependency bias. As a result, the modality hard to learn would be doubly penalized, and the performance of dynamical fusion could be inferior to that of static fusion. To address these challenges, we propose the Unbiased Dynamic Multimodal Learning (UDML) framework. Specifically, we introduce a noise-aware uncertainty estimator that adds controlled noise to the modality data and predicts its intensity from the modality feature. This forces the model to learn a clear correspondence between feature corruption and noise level, allowing accurate uncertainty measure across both low- and high-noise conditions. Furthermore, we quantify the inherent modality reliance bias within multimodal networks via modality dropout and incorporate it into the weighting mechanism. This eliminates the dual suppression effect on the hard-to-learn modality. Extensive experiments across diverse multimodal benchmark tasks validate the effectiveness, versatility, and generalizability of the proposed UDML. The code is available at https://github.com/shicaiwei123/UDML.[127] 3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction
Takeshi Noda,Yu-Shen Liu,Zhizhong Han
Main category: cs.CV
TL;DR: 本文提出了一种基于TSDF网格的自约束先验方法,用于提升3D高斯泼溅(3DGS)在表面重建中的精度,通过动态更新的距离场带状约束优化高斯分布的位置、密度和不透明度。
Details
Motivation: 尽管3D高斯泼溅(3DGS)在渲染质量与速度上优于NeRF,但在高保真表面重建方面仍有不足,需更有效的几何约束机制。 Method: 构建由当前3D高斯渲染深度图融合生成的TSDF网格,从中导出带状距离场先验;利用该先验对高斯位置、数量及不透明度进行几何感知约束,并随训练迭代动态更新和收紧约束带宽。 Result: 在多个主流基准上验证了该方法的有效性,显著优于当前最先进方法,在深度渲染精度与表面重建保真度方面均有提升。 Conclusion: 所提出的自约束先验机制能有效提升3DGS的几何重建能力,为辐射场建模提供了新的几何引导范式。 Abstract: Rendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a self-constrained prior to constrain the learning of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is derived from a TSDF grid that is obtained by fusing the depth maps rendered with current 3D Gaussians. The prior measures a distance field around the estimated surface, offering a band centered at the surface for imposing more specific constraints on 3D Gaussians, such as removing Gaussians outside the band, moving Gaussians closer to the surface, and encouraging larger or smaller opacity in a geometry-aware manner. More importantly, our prior can be regularly updated by the most recent depth images which are usually more accurate and complete. In addition, the prior can also progressively narrow the band to tighten the imposed constraints. We justify our idea and report our superiority over the state-of-the-art methods in evaluations on widely used benchmarks.[128] TSegAgent: Zero-Shot Tooth Segmentation via Geometry-Aware Vision-Language Agents
Shaojie Zhuang,Lu Yin,Guangshun Wei,Yunpeng Li,Xilu Wang,Yuanfeng Zhou
Main category: cs.CV
TL;DR: 本文提出TSegAgent,将牙科分析重新定义为零样本几何推理问题,结合通用基础模型与解剖学几何先验,无需任务特定训练即可实现牙齿分割与识别,显著降低标注与计算成本,并提升跨数据源泛化能力。
Details
Motivation: 现有方法依赖密集标注的3D神经网络,标注成本高且泛化能力差,难以适应不同来源的口腔扫描数据。 Method: 提出TSegAgent框架,融合多视角视觉抽象与几何感知推理,显式编码牙弓结构、体积关系等解剖约束,避免学习牙科特异性特征,实现零样本牙齿实例分割与识别。 Result: 在多种未见过的口腔扫描数据上实现了高精度、高鲁棒性的牙齿分割与识别,同时大幅降低计算与标注成本。 Conclusion: 将牙科分析从纯数据驱动转向几何推理驱动是可行且有效的,为低资源、强泛化数字牙科分析提供了新范式。 Abstract: Automatic tooth segmentation and identification from intra-oral scanned 3D models are fundamental problems in digital dentistry, yet most existing approaches rely on task-specific 3D neural networks trained with densely annotated datasets, resulting in high annotation cost and limited generalization to scans from unseen sources. Thus, we propose TSegAgent, which addresses these challenges by reformulating dental analysis as a zero-shot geometric reasoning problem rather than a purely data-driven recognition task. The key idea is to combine the representational capacity of general-purpose foundation models with explicit geometric inductive biases derived from dental anatomy. Instead of learning dental-specific features, the proposed framework leverages multi-view visual abstraction and geometry-grounded reasoning to infer tooth instances and identities without task-specific training. By explicitly encoding structural constraints such as dental arch organization and volumetric relationships, the method reduces uncertainty in ambiguous cases and mitigates overfitting to particular shape distributions. Experimental results demonstrate that this reasoning-oriented formulation enables accurate and reliable tooth segmentation and identification with low computational and annotation cost, while exhibiting strong generalization across diverse and previously unseen dental scans.[129] Demographic-Aware Self-Supervised Anomaly Detection Pretraining for Equitable Rare Cardiac Diagnosis
Chaoqin Huang,Zi Zeng,Aofan Jiang,Yuchen Xu,Qing Cao,Kang Chen,Chenfei Chi,Yanfeng Wang,Ya Zhang
Main category: cs.CV
TL;DR: 本文提出了一种面向罕见心脏异常的公平性感知AI框架,结合自监督异常检测与人口统计学感知表征学习,在百万级ECG数据上显著提升罕见病检测性能并缩小常见-罕见病及不同人群间的诊断差距。
Details
Motivation: 罕见心脏异常在心电图中难以检测,因其长尾分布、病例极少且存在人群间诊断性能差异,导致识别延迟和医疗质量不均,亟需兼具高敏感性与公平性的通用AI框架。 Method: 构建两阶段AI辅助ECG框架:第一阶段通过掩码信号重建、趋势建模和患者属性预测进行无监督预训练;第二阶段采用非对称损失微调多标签分类,并生成可定位的异常得分图,支持CPU部署。 Result: 在超百万临床ECG纵向队列上,罕见异常AUROC达94.7%,常见-罕见性能差距降低73%,且在不同年龄和性别组中保持诊断一致性。 Conclusion: 该公平性感知AI框架具备强临床实用性、可解释的异常定位能力及多中心可扩展性,有望缓解诊断不平等,推动生物医学信号与数字健康中的公平异常检测。 Abstract: Rare cardiac anomalies are difficult to detect from electrocardiograms (ECGs) due to their long-tailed distribution with extremely limited case counts and demographic disparities in diagnostic performance. These limitations contribute to delayed recognition and uneven quality of care, creating an urgent need for a generalizable framework that enhances sensitivity while ensuring equity across diverse populations. In this study, we developed an AI-assisted two-stage ECG framework integrating self-supervised anomaly detection with demographic-aware representation learning. The first stage performs self-supervised anomaly detection pretraining by reconstructing masked global and local ECG signals, modeling signal trends, and predicting patient attributes to learn robust ECG representations without diagnostic labels. The pretrained model is then fine-tuned for multi-label ECG classification using asymmetric loss to better handle long-tail cardiac abnormalities, and additionally produces anomaly score maps for localization, with CPU-based optimization enabling practical deployment. Evaluated on a longitudinal cohort of over one million clinical ECGs, our method achieves an AUROC of 94.7% for rare anomalies and reduces the common-rare performance gap by 73%, while maintaining consistent diagnostic accuracy across age and sex groups. In conclusion, the proposed equity-aware AI framework demonstrates strong clinical utility, interpretable anomaly localization, and scalable performance across multiple cohorts, highlighting its potential to mitigate diagnostic disparities and advance equitable anomaly detection in biomedical signals and digital health. Source code is available at https://github.com/MediaBrain-SJTU/Rare-ECG.[130] WorldAgents: Can Foundation Image Models be Agents for 3D World Models?
Ziya Erkoç,Angela Dai,Matthias Nießner
Main category: cs.CV
TL;DR: 本文探讨了2D基础图像模型是否隐含3D世界建模能力,并提出一种基于多智能体(VLM导演、图像生成器、双步VLM验证器)的框架,成功实现了从2D模型中提取并利用隐式3D知识进行一致、可渲染的3D世界合成。
Details
Motivation: 探究2D基础图像模型是否天然具备3D世界建模能力,而非依赖显式3D监督或架构。 Method: 提出一种多智能体框架:VLM作为导演生成引导性提示;图像生成模型合成新视角图像;VLM驱动的两步验证器分别在2D图像空间和3D重建空间评估并筛选生成帧,以实现隐式3D一致性优化。 Result: 在多个SOTA图像生成模型和VLM上验证了该方法的有效性,生成的3D场景具有高度一致性,支持新颖视角渲染,证明2D模型确含可被激发的隐式3D理解。 Conclusion: 2D基础图像模型确实蕴含对3D世界的隐式理解,通过合理的代理式协同机制可有效挖掘并利用该能力,实现高质量、可探索的3D世界合成。 Abstract: Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.[131] BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates
Phuong-Anh Nguyen,Tien Anh Pham,Duc-Trong Le,Cam-Van Thi Nguyen
Main category: cs.CV
TL;DR: 本文提出BALM框架,通过特征校准和梯度重平衡两个模块,在不平衡缺失率下实现多模态学习的均衡优化,提升模型鲁棒性与性能。
Details
Motivation: 多模态学习常因模态间信息量差异及不均衡缺失率(IMR)导致优化失衡,弱模态或缺失模态贡献不足,影响表征学习与梯度动态。 Method: 提出模型无关的插件式框架BALM,包含特征校准模块(FCM)——利用全局上下文对齐异构缺失模式下的单模态特征;以及梯度重平衡模块(GRM)——从分布与空间视角联合调节各模态梯度的幅值与方向。 Result: 在多个多模态情感识别(MER)基准上验证了BALM的有效性,显著提升模型在各类缺失与不平衡场景下的鲁棒性与性能。 Conclusion: BALM是一种通用、即插即用的解决方案,无需修改骨干网络结构,可有效缓解IMR下的多模态学习失衡问题。 Abstract: Learning from multiple modalities often suffers from imbalance, where information-rich modalities dominate optimization while weaker or partially missing modalities contribute less. This imbalance becomes severe in realistic settings with imbalanced missing rates (IMR), where each modality is absent with different probabilities, distorting representation learning and gradient dynamics. We revisit this issue from a training-process perspective and propose BALM, a model-agnostic plug-in framework to achieve balanced multimodal learning under IMR. The framework comprises two complementary modules: the Feature Calibration Module (FCM), which recalibrates unimodal features using global context to establish a shared representation basis across heterogeneous missing patterns; the Gradient Rebalancing Module (GRM), which balances learning dynamics across modalities by modulating gradient magnitudes and directions from both distributional and spatial perspectives. BALM can be seamlessly integrated into diverse backbones, including multimodal emotion recognition (MER) models, without altering their architectures. Experimental results across multiple MER benchmarks confirm that BALM consistently enhances robustness and improves performance under diverse missing and imbalance settings. Code available at: https://github.com/np4s/BALM_CVPR2026.git[132] PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing
Jiadong Liang,Bojun Xiong,Jie Tian,Hua Li,Xiao Long,Yong Zheng,Huan Fu
Main category: cs.CV
TL;DR: 本文提出PerformRecast方法,基于驱动视频实现仅表情的肖像视频性能编辑,利用3DMM特性改进关键点变换公式以更好解耦表情与姿态,并通过面部/非面部区域解耦与教师模型监督提升边界对齐与编辑精度。
Details
Motivation: 现有肖像动画方法难以将面部表情与头部姿态旋转解耦,无法独立编辑表情,限制了其在影视动画中的应用。 Method: 提出PerformRecast方法:1)改进关键点变换公式以更契合3DMM结构,增强表情与姿态的解耦;2)解耦面部与非面部区域,预训练教师模型分别监督,缓解生成结果中面部边界的错位问题。 Result: 实验表明该方法生成结果更忠实于驱动视频,在可控性与效率上均优于现有方法。 Conclusion: PerformRecast实现了高质量、细粒度、仅表情的视频性能编辑,为影视动画中的表演重编辑提供了实用新方案。 Abstract: This paper primarily investigates the task of expression-only portrait video performance editing based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile expression-only video editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we improve the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more fine-grained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. Our code, data and trained models are available at https://youku-aigc.github.io/PerformRecast.[133] PhysNeXt: Next-Generation Dual-Branch Structured Attention Fusion Network for Remote Photoplethysmography Measurement
Junzhe Cao,Bo Zhao,Zhiyi Niu,Dan Guo,Yue Sun,Haochen Liang,Yong Xu,Zitong YU
Main category: cs.CV
TL;DR: 本文提出PhysNeXt框架,联合利用原始视频帧与STMap表示,通过时空差分建模、跨模态交互和结构化注意力解码器,提升无接触心率测量的鲁棒性与精度。
Details
Motivation: 现有rPPG方法中,端到端视频建模易受运动和光照噪声干扰,而STMap表示虽计算高效却丢失高频细节;需融合二者优势以提升信号提取鲁棒性。 Method: 提出双输入深度学习框架PhysNeXt,融合原始视频帧与STMap;引入时空差分建模单元、跨模态交互模块及结构化注意力解码器,实现协同信号增强。 Result: 在挑战性条件下实现了更稳定、更细粒度的rPPG信号恢复,实验验证了视频与STMap联合建模的有效性。 Conclusion: PhysNeXt通过协同建模两种互补表征,显著提升了无接触生理信号提取的鲁棒性与精度,为rPPG方法提供了新范式。 Abstract: Remote photoplethysmography (rPPG) enables contactless measurement of heart rate and other vital signs by analyzing subtle color variations in facial skin induced by cardiac pulsation. Current rPPG methods are mainly based on either end-to-end modeling from raw videos or intermediate spatial-temporal map (STMap) representations. The former preserves complete spatiotemporal information and can capture subtle heartbeat-related signals, but it also introduces substantial noise from motion artifacts and illumination variations. The latter stacks the temporal color changes of multiple facial regions of interest into compact two-dimensional representations, significantly reducing data volume and computational complexity, although some high-frequency details may be lost. To effectively integrate the mutual strengths, we propose PhysNeXt, a dual-input deep learning framework that jointly exploits video frames and STMap representations. By incorporating a spatio-temporal difference modeling unit, a cross-modal interaction module, and a structured attention-based decoder, PhysNeXt collaboratively enhances the robustness of pulse signal extraction. Experimental results demonstrate that PhysNeXt achieves more stable and fine-grained rPPG signal recovery under challenging conditions, validating the effectiveness of joint modeling of video and STMap representations. The codes will be released.[134] ReLi3D: Relightable Multi-view 3D Reconstruction with Disentangled Illumination
Jan-Niklas Dihlmann,Mark Boss,Simon Donne,Andreas Engelhardt,Hendrik P. A. Lensch,Varun Jampani
Main category: cs.CV
TL;DR: ReLi3D 是首个端到端统一框架,能在1秒内从稀疏多视角图像中同步重建完整3D几何、空间变化的物理材质和环境光照。
Details
Motivation: 传统3D重建需分离的几何、材质、光照流水线,各自存在局限性和高计算开销;单图方法难以解决材质与光照解耦这一病态问题。 Method: 提出基于Transformer跨条件融合的多视角输入处理,结合双路径预测(结构/外观 vs 环境光照)及可微分蒙特卡洛多重重要性采样渲染器,并采用合成PBR数据与真实RGB图像混合训练协议。 Result: 实现亚秒级(<1s)完整、可重光照3D资产生成,在几何、材质和光照质量上具备良好泛化能力。 Conclusion: 将原本割裂的重建任务统一为单次前馈推理,显著提升效率与实用性,推动实时、高质量神经3D重建发展。 Abstract: Reconstructing 3D assets from images has long required separate pipelines for geometry reconstruction, material estimation, and illumination recovery, each with distinct limitations and computational overhead. We present ReLi3D, the first unified end-to-end pipeline that simultaneously reconstructs complete 3D geometry, spatially-varying physically-based materials, and environment illumination from sparse multi-view images in under one second. Our key insight is that multi-view constraints can dramatically improve material and illumination disentanglement, a problem that remains fundamentally ill-posed for single-image methods. Key to our approach is the fusion of the multi-view input via a transformer cross-conditioning architecture, followed by a novel unified two-path prediction strategy. The first path predicts the object's structure and appearance, while the second path predicts the environment illumination from image background or object reflections. This, combined with a differentiable Monte Carlo multiple importance sampling renderer, creates an optimal illumination disentanglement training pipeline. In addition, with our mixed domain training protocol, which combines synthetic PBR datasets with real-world RGB captures, we establish generalizable results in geometry, material accuracy, and illumination quality. By unifying previously separate reconstruction tasks into a single feed-forward pass, we enable near-instantaneous generation of complete, relightable 3D assets. Project Page: https://reli3d.jdihlmann.com/[135] Uncertainty-aware Prototype Learning with Variational Inference for Few-shot Point Cloud Segmentation
Yifei Zhao,Fanyu Zhao,Yinsheng Li
Main category: cs.CV
TL;DR: 本文提出UPL(不确定性感知原型学习)方法,通过概率建模将不确定性引入少样本3D语义分割的原型学习中,提升鲁棒性与泛化能力。
Details
Motivation: 现有基于原型的方法使用确定性原型,难以刻画少样本监督下的内在不确定性,导致鲁棒性和泛化性受限。 Method: UPL包含双流原型精炼模块(联合利用支持集和查询集信息)和基于变分推断的概率化原型建模(将类原型视为隐变量)。 Result: 在ScanNet和S3DIS数据集上达到一致的SOTA性能,并提供可靠的不确定性估计。 Conclusion: 概率化原型建模能有效提升少样本3D语义分割的鲁棒性、泛化性与可解释性。 Abstract: Few-shot 3D semantic segmentation aims to generate accurate semantic masks for query point clouds with only a few annotated support examples. Existing prototype-based methods typically construct compact and deterministic prototypes from the support set to guide query segmentation. However, such rigid representations are unable to capture the intrinsic uncertainty introduced by scarce supervision, which often results in degraded robustness and limited generalization. In this work, we propose UPL (Uncertainty-aware Prototype Learning), a probabilistic approach designed to incorporate uncertainty modeling into prototype learning for few-shot 3D segmentation. Our framework introduces two key components. First, UPL introduces a dual-stream prototype refinement module that enriches prototype representations by jointly leveraging limited information from both support and query samples. Second, we formulate prototype learning as a variational inference problem, regarding class prototypes as latent variables. This probabilistic formulation enables explicit uncertainty modeling, providing robust and interpretable mask predictions. Extensive experiments on the widely used ScanNet and S3DIS benchmarks show that our UPL achieves consistent state-of-the-art performance under different settings while providing reliable uncertainty estimation. The code is available at https://fdueblab-upl.github.io/.[136] Growing Networks with Autonomous Pruning
Charles De Lambilly,Stefan Duffner
Main category: cs.CV
TL;DR: 本文提出了一种名为GNAP(Growing Networks with Autonomous Pruning)的图像分类方法,通过在训练过程中自主生长与剪枝,动态调整网络结构和参数量,在保证高精度的同时实现极高的稀疏性。
Details
Motivation: 传统CNN结构固定,难以在模型复杂度与性能之间取得最优平衡;本文旨在设计一种能根据数据自适应调整规模、兼顾高精度与低参数量的网络架构。 Method: GNAP采用周期性‘生长-剪枝’交替机制:初始以小规模网络启动,在收敛饱和后自动扩展网络容量(生长),随后在分类训练中同步进行完全由梯度下降驱动的自主剪枝,无需人工干预或预设剪枝策略。 Result: 在MNIST上达到99.44%准确率(仅6.2k参数),在CIFAR10上达92.2%准确率(仅157.8k参数),验证了其在高稀疏性下保持强分类性能的能力。 Conclusion: GNAP证明了动态可变结构网络在图像分类任务中可行且高效,为轻量化、自适应神经网络设计提供了新范式。 Abstract: This paper introduces Growing Networks with Autonomous Pruning (GNAP) for image classification. Unlike traditional convolutional neural networks, GNAP change their size, as well as the number of parameters they are using, during training, in order to best fit the data while trying to use as few parameters as possible. This is achieved through two complementary mechanisms: growth and pruning. GNAP start with few parameters, but their size is expanded periodically during training to add more expressive power each time the network has converged to a saturation point. Between these growing phases, model parameters are trained for classification and pruned simultaneously, with complete autonomy by gradient descent. Growing phases allow GNAP to improve their classification performance, while autonomous pruning allows them to keep as few parameters as possible. Experimental results on several image classification benchmarks show that our approach can train extremely sparse neural networks with high accuracy. For example, on MNIST, we achieved 99.44% accuracy with as few as 6.2k parameters, while on CIFAR10, we achieved 92.2\ accuracy with 157.8k parameters.[137] PCSTracker: Long-Term Scene Flow Estimation for Point Cloud Sequences
Min Lin,Gangwei Xu,Xianqi Wang,Yuyi Peng,Xin Yang
Main category: cs.CV
TL;DR: 本文提出了PCSTracker,首个面向点云序列的一致性场景流估计端到端框架,通过IGMO模块建模点特征时序演化、STTU模块推断遮挡点轨迹,并采用滑动窗口策略抑制误差累积,在长序列上实现高精度、实时的3D运动估计。
Details
Motivation: 现有方法局限于两帧配对设置,难以在长序列中维持时间一致性,尤其面对几何演化、遮挡出现和误差累积等问题。 Method: 提出PCSTracker框架,包含:1)迭代几何运动联合优化模块(IGMO),显式建模点特征时序演化;2)时空点轨迹更新模块(STTU),利用宽泛时序上下文推断遮挡点位置;3)重叠滑动窗口推理策略,交替进行跨窗传播与窗内精化。 Result: 在PointOdyssey3D(合成)和ADT3D(真实)数据集上,PCSTracker在长期场景流估计中达到最高精度,运行速度达32.5 FPS,且3D运动理解能力优于RGB-D方法。 Conclusion: PCSTracker有效解决了点云长序列场景流估计中的时间不一致与误差累积问题,兼顾精度与效率,为细粒度、长期3D运动分析提供了新范式。 Abstract: Point cloud scene flow estimation is fundamental to long-term and fine-grained 3D motion analysis. However, existing methods are typically limited to pairwise settings and struggle to maintain temporal consistency over long sequences as geometry evolves, occlusions emerge, and errors accumulate. In this work, we propose PCSTracker, the first end-to-end framework specifically designed for consistent scene flow estimation in point cloud sequences. Specifically, we introduce an iterative geometry motion joint optimization module (IGMO) that explicitly models the temporal evolution of point features to alleviate correspondence inconsistencies caused by dynamic geometric changes. In addition, a spatio-temporal point trajectory update module (STTU) is proposed to leverage broad temporal context to infer plausible positions for occluded points, ensuring coherent motion estimation. To further handle long sequences, we employ an overlapping sliding-window inference strategy that alternates cross-window propagation and in-window refinement, effectively suppressing error accumulation and maintaining stable long-term motion consistency. Extensive experiments on the synthetic PointOdyssey3D and real-world ADT3D datasets show that PCSTracker achieves the best accuracy in long-term scene flow estimation and maintains real-time performance at 32.5 FPS, while demonstrating superior 3D motion understanding compared to RGB-D-based approaches.[138] FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs
Zhihan Yin,Jianxin Liang,Yueqian Wang,Yifeng Yao,Huishuai Zhang,Dongyan Zhao
Main category: cs.CV
TL;DR: 本文提出FREAK基准,用于细粒度评估多模态大语言模型(MLLMs)在详细视觉感知中的幻觉问题,通过高保真图像中的反常识编辑揭示现有SOTA模型的严重幻觉,并结合受控子集与思维链提示分析幻觉模式与推理过程。
Details
Motivation: 现有幻觉评估基准存在任务过于简单导致指标饱和、或多样性不足难以充分评估先进多模态模型幻觉程度的问题。 Method: 提出FREAK多模态基准,采用高质量、具细粒度反常识编辑的逼真图像,创新性地评估MLLMs在细节视觉感知中的幻觉;构建受控子集间接评估模型对目标细节信息的感知能力,并系统评测主流思维链(CoT)提示技术在此任务中的表现。 Result: 在FREAK上的大量实验表明,当前SOTA模型在细节视觉感知方面存在严重幻觉;通过CoT提示的系统评估,揭示了关键的幻觉模式及模型推理过程特征。 Conclusion: FREAK为细粒度幻觉评估提供了更全面、更具挑战性的基准,有助于深入理解并缓解MLLMs在复杂视觉理解中的幻觉问题。 Abstract: Multimodal Large Language Models (MLLMs) suffer from hallucinations. Existing hallucination evaluation benchmarks are often limited by over-simplified tasks leading to saturated metrics, or insufficient diversity that fails to adequately assess the hallucination extent in state-of-the-art multimodal models. To address this gap, we propose FREAK, a comprehensive multimodal benchmark designed for fine-grained hallucination assessment in MLLMs. Through high-quality photorealistic images featuring fine-grained counter-commonsense edits, FREAK innovatively evaluates hallucination phenomena in detailed visual perception of MLLMs. Extensive experiments on FREAK show severe hallucination issues in SOTA models regarding detailed visual perception. To enable deeper investigation, we curate a controlled subset to indirectly evaluate the model's ability to perceive target detailed information. Through systematic evaluation of prevailing Chain-of-Thought (CoT) prompting techniques within this task, we reveal critical insights regarding hallucination patterns and model reasoning processes.[139] Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images
Donghai Fang,Yongheng Li,Zhen Wang,Yuansong Zeng,Wenwen Min
Main category: cs.CV
TL;DR: 本文提出HINGE方法,通过引入SoftAdaLN调制和表达空间掩码扩散目标,将预训练的单细胞基础模型(sc-FM)适配为组织学条件下的空间基因表达生成器,在多个空间转录组数据集上优于现有方法。
Details
Motivation: 现有基于组织学预测空间基因表达的生成模型大多忽略基因间依赖关系,导致生物学不一致;而单细胞基础模型虽能捕捉基因关系,但难以直接用于组织学条件下的表达建模。 Method: 提出HINGE框架:1)设计SoftAdaLN轻量调制模块,将组织学视觉信息逐层注入预训练sc-FM;2)采用表达空间掩码扩散目标;3)引入warm-start课程学习策略保障训练稳定与目标对齐。 Result: 在三个空间转录组数据集上,HINGE在平均Pearson相关系数、空间标志物表达准确性及基因对共表达一致性上均超越当前最优方法。 Conclusion: HINGE为利用预训练单细胞基础模型实现组织学引导的空间表达生成提供了可行且高效的新范式。 Abstract: Spatial transcriptomics (ST) enables spot-level in situ expression profiling, but its high cost and limited throughput motivate predicting expression directly from HE-stained histology. Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches. However, most existing generative approaches omit explicit modeling of gene-gene dependencies, undermining biological coherence. Single-cell foundation models (sc-FMs), pre-trained across diverse cell populations, capture these critical gene relationships that histology alone cannot reveal. Yet, applying expression-only sc-FMs to histology-conditioned expression modeling is nontrivial due to the absence of a visual pathway, a mismatch between their pre-training and conditional ST objectives, and the scarcity of mixed-cell ST supervision. To address these challenges, we propose HINGE (HIstology-coNditioned GEneration), which retrofits a pre-trained sc-FM into a conditional expression generator while mostly preserving its learned gene relationships. We achieve this by introducing SoftAdaLN, a lightweight, identity-initialized modulation that injects layer-wise visual context into the backbone, coupled with an expression-space masked diffusion objective and a warm-start curriculum to ensure objective alignment and training stability. Evaluated on three ST datasets, ours outperforms state-of-the-art baselines on mean Pearson correlation and yields more accurate spatial marker expression patterns and higher pairwise co-expression consistency, establishing a practical route to adapt pre-trained sc-FMs for histology-conditioned spatial expression generation.[140] FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision
Zekai Wu,Shuqi Fan,Mengyin Liu,Yuhua Luo,Xincheng Lin,Ming Yan,Junhao Wu,Xiuhong Lin,Yuexin Ma,Chenglu Wen,Lan Xu,Siqi Shen,Cheng Wang
Main category: cs.CV
TL;DR: 本文提出FlashCap系统和FlashMotion数据集,首次实现基于闪烁LED的毫秒级运动捕捉,并设计ResPose模型显著提升姿态估计精度与运动定时准确性。
Details
Motivation: 现有高精度运动定时(PMT)依赖昂贵且不实用的高速RGB相机;而人体姿态估计(HPE)领域长期缺乏高时间分辨率标注数据,导致PMT被忽视。 Method: 提出基于闪烁LED的FlashCap动作捕捉系统,构建多模态毫秒级数据集FlashMotion(含事件、RGB、LiDAR、IMU),并设计ResPose模型,利用事件流与RGB联合学习残差姿态。 Result: ResPose在姿态估计误差上降低约40%,并实现毫秒级运动定时精度;FlashMotion数据集经严格验证具有高质量。 Conclusion: FlashCap与FlashMotion为PMT和高时间分辨率HPE提供了可行、低成本的新范式,推动相关研究与应用发展。 Abstract: Precise motion timing (PMT) is crucial for swift motion analysis. A millisecond difference may determine victory or defeat in sports competitions. Despite substantial progress in human pose estimation (HPE), PMT remains largely overlooked by the HPE community due to the limited availability of high-temporal-resolution labeled datasets. Today, PMT is achieved using high-speed RGB cameras in specialized scenarios such as the Olympic Games; however, their high costs, light sensitivity, bandwidth, and computational complexity limit their feasibility for daily use. We developed FlashCap, the first flashing LED-based MoCap system for PMT. With FlashCap, we collect a millisecond-resolution human motion dataset, FlashMotion, comprising the event, RGB, LiDAR, and IMU modalities, and demonstrate its high quality through rigorous validation. To evaluate the merits of FlashMotion, we perform two tasks: precise motion timing and high-temporal-resolution HPE. For these tasks, we propose ResPose, a simple yet effective baseline that learns residual poses based on events and RGBs. Experimental results show that ResPose reduces pose estimation errors by ~40% and achieves millisecond-level timing accuracy, enabling new research opportunities. The dataset and code will be shared with the community.[141] Template-based Object Detection Using a Foundation Model
Valentin Braeutigam,Matthias Stock,Bernhard Egger
Main category: cs.CV
TL;DR: 本文提出了一种无需训练和标注数据的图标检测方法,利用分割基础模型的分割结果结合简单特征分类器,在汽车导航界面图标检测任务中达到接近YOLO等学习型方法的性能。
Details
Motivation: 针对软件开发中图形界面自动化测试(尤其是持续集成测试)场景,现有基于学习的目标检测方法需要大量训练数据和训练过程,而实际中存在数据变化少但要求免训练、免数据生成的需求。 Method: 利用分割基础模型(如SAM)生成图像分割区域,再提取各区域的简单手工特征(如颜色、形状等),通过轻量级分类器进行图标识别与分类,完全避免端到端训练。 Result: 在汽车导航地图图标检测任务上,该方法性能接近YOLO等主流学习型检测器,且无需任何训练数据或模型训练过程。 Conclusion: 证明了在低数据变异、高灵活性需求的工业检测场景中,结合基础模型分割与传统特征分类的免训练范式是可行且高效的替代方案。 Abstract: Most currently used object detection methods are learning-based, and can detect objects under varying appearances. Those models require training and a training dataset. We focus on use cases with less data variation, but the requirement of being free of generation of training data and training. Such a setup is for example desired in automatic testing of graphical interfaces during software development, especially for continuous integration testing. In our approach, we use segments from segmentation foundation models and combine them with a simple feature-based classification method. This saves time and cost when changing the object to be searched or its design, as nothing has to be retrained and no dataset has to be created. We evaluate our method on the task of detecting and classifying icons in navigation maps, which is used to simplify and automate the testing of user interfaces in automotive industry. Our methods achieve results almost on par with learning-based object detection methods like YOLO, without the need for training.[142] Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach
Shiqi Gao,Zitong Xu,Kang Fu,Huiyu Duan,Xiongkuo Min,Jia wang,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出了TIEdit基准测试和EditProbe评估方法,用于更可靠地评估文本引导的图像编辑(TIE)方法,强调感知质量、编辑对齐度和内容保留三方面,并通过专家主观评分验证其与人类判断的高度一致性。
Details
Motivation: 现有TIE评估基准规模小、与人类感知判断相关性弱,缺乏系统、可靠、多维度的评估方案。 Method: 构建包含512张源图、8类编辑任务、5120张编辑结果的TIEdit基准;招募20位专家生成307,200条主观评分,汇总为15,360个MOS;提出基于多模态大语言模型中间层表征探针的LLM评估器EditProbe。 Result: 实验表明传统自动指标与人类判断相关性低,而EditProbe显著提升与人类感知的一致性;TIEdit成为当前最全面、经人工验证的TIE评估基准之一。 Conclusion: TIEdit和EditProbe共同为文本引导图像编辑提供了更可信、更符合人类感知的评估范式,推动该领域向更鲁棒、可解释的评估标准发展。 Abstract: Evaluating text-guided image editing (TIE) methods remains a challenging problem, as reliable assessment should simultaneously consider perceptual quality, alignment with textual instructions, and preservation of original image content. Despite rapid progress in TIE models, existing evaluation benchmarks remain limited in scale and often show weak correlation with human perceptual judgments. In this work, we introduce TIEdit, a benchmark for systematic evaluation of text-guided image editing methods. TIEdit consists of 512 source images paired with editing prompts across eight representative editing tasks, producing 5,120 edited images generated by ten state-of-the-art TIE models. To obtain reliable subjective ratings, 20 experts are recruited to produce 307,200 raw subjective ratings, which accumulates into 15,360 mean opinion scores (MOSs) across three evaluation dimensions: perceptual quality, editing alignment, and content preservation. Beyond the benchmark itself, we further propose EditProbe, an LLM-based evaluator that estimates editing quality via intermediate-layer probing of hidden representations. Instead of relying solely on final model outputs, EditProbe extracts informative representations from intermediate layers of multimodal large language models to better capture semantic and perceptual relationships between source images, editing instructions, and edited results. Experimental results demonstrate that widely used automatic evaluation metrics show limited correlation with human judgments on editing tasks, while EditProbe achieves substantially stronger alignment with human perception. Together, TIEdit and EditProbe provide a foundation for more reliable and perceptually aligned evaluation of text-guided image editing methods.[143] ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection
Chengzhi Hong,Bijun Li
Main category: cs.CV
TL;DR: 本文提出Road-Manifold假设,将道路建模为三维空间中的光滑二维流形,车道线为其一维子流形,并据此设计ReManNet网络,利用黎曼高斯描述符在SPD流形上编码几何信息,结合视觉特征进行3D车道线检测,同时引入3D-TLIoU损失提升形状对齐,显著提升单目3D车道线检测性能。
Details
Motivation: 单目3D车道线检测因深度模糊性和几何约束弱而困难;现有方法依赖深度引导、BEV投影或简化物理假设的检测头,缺乏车道与道路表面之间不变的几何-拓扑耦合,导致2D到3D映射病态且易产生形变。 Method: 提出Road-Manifold假设(道路为R³中光滑2D流形,车道为其1D嵌入子流形,采样点为稠密观测),基于此构建ReManNet:先用图像骨干网和检测头生成初始预测,再在SPD流形上计算Riemannian高斯描述符编码几何,通过轻量门控机制融合视觉特征;并设计3D Tunnel Lane IoU(3D-TLIoU)损失,以管状邻域切片重叠实现点-曲线联合优化。 Result: 在OpenLane等标准基准上达到SOTA或具竞争力结果:OpenLane上F1较基线提升+8.2%,较此前最优提升+1.8%,场景级最高提升达+6.6%。 Conclusion: Road-Manifold假设为单目3D车道线检测提供了更坚实的几何-拓扑基础,ReManNet通过流形感知的几何编码与融合策略及新型3D-TLIoU损失,有效缓解了深度歧义与形状失真问题,显著提升了检测鲁棒性与精度。 Abstract: Monocular 3D lane detection remains challenging due to depth ambiguity and weak geometric constraints. Mainstream methods rely on depth guidance, BEV projection, and anchor- or curve-based heads with simplified physical assumptions, remapping high-dimensional image features while only weakly encoding road geometry. Lacking an invariant geometric-topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in $\mathbb{R}^3$, lanes are embedded 1D submanifolds, and sampled lane points are dense observations, thereby coupling metric and topology across surfaces, curves, and point sets. Building on this, we propose ReManNet, which first produces initial lane predictions with an image backbone and detection heads, then encodes geometry as Riemannian Gaussian descriptors on the symmetric positive-definite (SPD) manifold, and fuses these descriptors with visual features through a lightweight gate to maintain coherent 3D reasoning. We also propose the 3D Tunnel Lane IoU (3D-TLIoU) loss, a joint point-curve objective that computes slice-wise overlap of tubular neighborhoods along each lane to improve shape-level alignment. Extensive experiments on standard benchmarks demonstrate that ReManNet achieves state-of-the-art (SOTA) or competitive results. On OpenLane, it improves F1 by +8.2% over the baseline and by +1.8% over the previous best, with scenario-level gains of up to +6.6%. The code will be publicly available at https://github.com/changehome717/ReManNet.[144] One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment
Wen Yin,Cencen Liu,Dingrui Liu,Bing Su,Yuan-Fang Li,Tao He
Main category: cs.CV
TL;DR: 本文提出TATAR框架,通过任务感知的推理与非对称奖励机制,统一图像质量评估(IQA)与图像美学评估(IAA),克服了现有方法中推理与优化不匹配的问题,在多个基准上显著优于统一基线,并接近专用模型性能。
Details
Motivation: 现有统一IQA与IAA的方法采用任务无关策略,忽视二者在认知机制(低层感知 vs. 高层语义)和优化目标(点回归 vs. 排序)上的本质差异,导致推理与优化不匹配。 Method: 提出TATAR框架:1)快-慢双路径任务特定推理构造(IQA用简洁感知理由,IAA用渐进式美学叙事);2)两阶段监督微调+GRPO学习(先建模行为先验,再奖励驱动精调);3)非对称奖励设计(IQA用高斯分数整形,IAA用Thurstone排序)。 Result: 在8个基准上,TATAR在域内与跨域设置下均一致超越统一基线,性能接近专用模型,且美学评估训练更稳定。 Conclusion: 任务条件化的后训练是构建统一感知评分模型的合理范式,TATAR为多任务视觉语言评估提供了新思路。 Abstract: Unifying Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) in a single multimodal large language model is appealing, yet existing methods adopt a task-agnostic recipe that applies the same reasoning strategy and reward to both tasks. We show this is fundamentally misaligned: IQA relies on low-level, objective perceptual cues and benefits from concise distortion-focused reasoning, whereas IAA requires deliberative semantic judgment and is poorly served by point-wise score regression. We identify these as a reasoning mismatch and an optimization mismatch, and provide empirical evidence for both through controlled probes. Motivated by these findings, we propose TATAR (Task-Aware Thinking with Asymmetric Rewards), a unified framework that shares the visual-language backbone while conditioning post-training on each task's nature. TATAR combines three components: fast--slow task-specific reasoning construction that pairs IQA with concise perceptual rationales and IAA with deliberative aesthetic narratives; two-stage SFT+GRPO learning that establishes task-aware behavioral priors before reward-driven refinement; and asymmetric rewards that apply Gaussian score shaping for IQA and Thurstone-style completion ranking for IAA. Extensive experiments across eight benchmarks demonstrate that TATAR consistently outperforms prior unified baselines on both tasks under in-domain and cross-domain settings, remains competitive with task-specific specialized models, and yields more stable training dynamics for aesthetic assessment. Our results establish task-conditioned post-training as a principled paradigm for unified perceptual scoring. Our code is publicly available at https://github.com/yinwen2019/TATAR.[145] Decoupled Sensitivity-Consistency Learning for Weakly Supervised Video Anomaly Detection
Hantao Zheng,Ning Han,Yawen Zeng,Hao Chen
Main category: cs.CV
TL;DR: 本文提出DeSC框架,通过解耦敏感性与一致性,利用两个专门化流分别处理瞬态和持续异常,实现更精准的视频异常检测。
Details
Motivation: 现有弱监督视频异常检测方法在联合优化中存在敏感性-稳定性权衡问题,难以同时准确检测瞬态和持续异常。 Method: 提出DeSC(Decoupled Sensitivity-Consistency)框架,包含两个流:时间敏感流采用激进优化策略捕捉高频突变;语义一致流施加鲁棒约束以维持长期连贯性;二者通过协同推理机制融合预测。 Result: 在UCF-Crime数据集上AUC达89.37%(+1.29%),在XD-Violence上AP达87.18%(+2.22%),达到新SOTA。 Conclusion: DeSC通过解耦建模和协同推理有效缓解敏感性-稳定性冲突,显著提升弱监督视频异常检测性能。 Abstract: Recent weakly supervised video anomaly detection methods have achieved significant advances by employing unified frameworks for joint optimization. However, this paradigm is limited by a fundamental sensitivity-stability trade-off, as the conflicting objectives for detecting transient and sustained anomalies lead to either fragmented predictions or over-smoothed responses. To address this limitation, we propose DeSC, a novel Decoupled Sensitivity-Consistency framework that trains two specialized streams using distinct optimization strategies. The temporal sensitivity stream adopts an aggressive optimization strategy to capture high-frequency abrupt changes, whereas the semantic consistency stream applies robust constraints to maintain long-term coherence and reduce noise. Their complementary strengths are fused through a collaborative inference mechanism that reduces individual biases and produces balanced predictions. Extensive experiments demonstrate that DeSC establishes new state-of-the-art performance by achieving 89.37% AUC on UCF-Crime (+1.29%) and 87.18% AP on XD-Violence (+2.22%). Code is available at https://github.com/imzht/DeSC.[146] Learning Hierarchical Orthogonal Prototypes for Generalized Few-Shot 3D Point Cloud Segmentation
Yifei Zhao,Fanyu Zhao,Zhongyuan Zhang,Shengtang Wu,Yixuan Lin,Yinsheng Li
Main category: cs.CV
TL;DR: HOP3D是一种用于广义少样本3D点云分割的统一框架,通过分层正交原型学习和基于熵的正则化,缓解基础类与新类之间的干扰,在保持基础类性能的同时提升新类适应能力。
Details
Motivation: 解决广义少样本3D点云分割中基础类遗忘与新类适应之间的稳定性-可塑性权衡问题。 Method: 提出HOP3D框架,包含分层正交化机制(在梯度和表征层面解耦基础类与新类学习)和基于熵的少样本正则化器(利用预测不确定性优化原型学习)。 Result: 在ScanNet200和ScanNet++数据集上,1-shot和5-shot设置下均显著优于现有SOTA方法。 Conclusion: HOP3D有效缓解了少样本场景下的基础-新类干扰,实现了强泛化性与稳定性兼顾的3D点云分割。 Abstract: Generalized few-shot 3D point cloud segmentation aims to adapt to novel classes from only a few annotations while maintaining strong performance on base classes, but this remains challenging due to the inherent stability-plasticity trade-off: adapting to novel classes can interfere with shared representations and cause base-class forgetting. We present HOP3D, a unified framework that learns hierarchical orthogonal prototypes with an entropy-based few-shot regularizer to enable robust novel-class adaptation without degrading base-class performance. HOP3D introduces hierarchical orthogonalization that decouples base and novel learning at both the gradient and representation levels, effectively mitigating base-novel interference. To further enhance adaptation under sparse supervision, we incorporate an entropy-based regularizer that leverages predictive uncertainty to refine prototype learning and promote balanced predictions. Extensive experiments on ScanNet200 and ScanNet++ demonstrate that HOP3D consistently outperforms state-of-the-art baselines under both 1-shot and 5-shot settings. The code is available at https://fdueblab-hop3d.github.io/.[147] From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models
Weile Gong,Yiping Zuo,Zijian Lu,Xin He,Weibei Fan,Chen Dai
Main category: cs.CV
TL;DR: 本文提出了一种模型无关的几何风险控制器,用于解决冻结视觉语言模型(VLM)在生成式OCR任务中因自回归解码偏向语义合理性而导致的视觉失准与严重错误问题,通过多视角结构化验证实现选择性接受/拒绝输出,在降低极端错误风险的同时保持可控覆盖率。
Details
Motivation: 现代视觉语言模型虽可作为生成式OCR引擎,但其自回归解码偏好语义合理性,与OCR要求的视觉接地性和几何可验证性存在核心部署错位,导致过生成、无依据替换等高风险错误。 Method: 将冻结VLM OCR建模为选择性接受/拒否问题,提出模型无关的几何风险控制器:对同一输入探测多个结构化视图,进行轻量级结构筛选,并仅在跨视图共识与稳定性满足预设准则时才接受转录结果。 Result: 在冻结VLM主干和标准OCR基准上的实验表明,该方法能一致降低极端错误风险和灾难性过生成,且代价是可预测的覆盖率下降。 Conclusion: 冻结VLM的生成式OCR需依赖显式的系统级风险控制机制,而非无约束生成,以实现可靠部署。 Abstract: Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors semantic plausibility, whereas OCR requires outputs that are visually grounded and geometrically verifiable. This mismatch produces severe errors, especially over-generation and unsupported substitutions, creating deployment risk even when benchmark accuracy remains high. We therefore formulate frozen VLM OCR as a selective accept/abstain problem and propose a model-agnostic Geometric Risk Controller. The controller probes multiple structured views of the same input, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria, yielding a small family of operating points. Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs. Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.[148] Controllable Text-to-Motion Generation via Modular Body-Part Phase Control
Minyue Dai,Ke Fan,Anyi Rao,Jingbo Wang,Bo Dai
Main category: cs.CV
TL;DR: 本文提出了一种名为Modular Body-Part Phase Control的模块化身体部位相位控制框架,通过标量相位信号实现对文本驱动运动生成中特定身体部位的精细、可解释、局部编辑,同时保持整体运动连贯性。
Details
Motivation: 现有文本到运动(T2M)生成方法依赖高维关节约束(如轨迹),难以支持用户友好的迭代式局部编辑;亟需一种结构化、轻量、可解释的局部运动控制机制。 Method: 将身体各部位的潜在运动通道建模为正弦相位信号(含幅值、频率、相位偏移、偏置四个标量参数),提取可解释的部位特异性动态编码;设计模块化Phase ControlNet分支,通过残差特征调制将该信号注入生成主干网络,实现控制与生成解耦。 Result: 在扩散模型和流模型上均验证了该方法能实现对运动幅度、速度和时序的可预测、细粒度控制,显著提升局部编辑能力,同时保持全局运动一致性。 Conclusion: 所提框架是一种即插即用、结构清晰、参数紧凑的T2M可控生成新范式,为交互式动画与虚拟人应用提供了实用解决方案。 Abstract: Text-to-motion (T2M) generation is becoming a practical tool for animation and interactive avatars. However, modifying specific body parts while maintaining overall motion coherence remains challenging. Existing methods typically rely on cumbersome, high-dimensional joint constraints (e.g., trajectories), which hinder user-friendly, iterative refinement. To address this, we propose Modular Body-Part Phase Control, a plug-and-play framework enabling structured, localized editing via a compact, scalar-based phase interface. By modeling body-part latent motion channels as sinusoidal phase signals characterized by amplitude, frequency, phase shift, and offset, we extract interpretable codes that capture part-specific dynamics. A modular Phase ControlNet branch then injects this signal via residual feature modulation, seamlessly decoupling control from the generative backbone. Experiments on both diffusion- and flow-based models demonstrate that our approach provides predictable and fine-grained control over motion magnitude, speed, and timing. It preserves global motion coherence and offers a practical paradigm for controllable T2M generation. Project page: https://jixiii.github.io/bp-phase-project-page/[149] Evaluating Vision Foundation Models for Pixel and Object Classification in Microscopy
Carolin Teuber,Anwai Archit,Tobias Boothe,Peter Ditte,Jochen Rink,Constantin Pape
Main category: cs.CV
TL;DR: 本文探讨了视觉基础模型(VFMs)在显微镜图像交互式像素与目标分类任务中的应用潜力,发现其相比传统手工特征能带来一致性能提升,并建立了该领域的首个VFM基准。
Details
Motivation: 显微镜图像分析中,交互式像素分类和目标分类仍主要依赖特征工程+浅层学习,原因在于数据多样性高、缺乏大规模预训练数据、且对计算和标注效率要求高;而其他任务(如细胞实例分割)已受益于VFMs(如SAM),因此亟需验证VFMs在此类分类任务中的有效性。 Method: 在五个多样且具挑战性的显微镜数据集上,系统评估多种VFMs(包括通用模型SAM、SAM2、DINOv3及领域专用模型μSAM、PathoSAM),结合浅层学习与注意力探测(attentive probing)策略进行像素/目标分类。 Result: 所有测试的VFMs均一致优于传统手工特征方法;提供了通向实用性能提升的明确路径;建立了显微镜领域首个VFM在像素与目标分类任务上的基准。 Conclusion: VFMs可有效迁移至显微镜图像的交互式像素与目标分类任务,显著超越现有手工特征方法,为该领域后续研究与工具开发提供了新范式与基准支撑。 Abstract: Deep learning underlies most modern approaches and tools in computer vision, including biomedical imaging. However, for interactive semantic segmentation (often called pixel classification in this context) and interactive object-level classification (object classification), feature-based shallow learning remains widely used. This is due to the diversity of data in this domain, the lack of large pretraining datasets, and the need for computational and label efficiency. In contrast, state-of-the-art tools for many other vision tasks in microscopy - most notably cellular instance segmentation - already rely on deep learning and have recently benefited substantially from vision foundation models (VFMs), particularly SAM. Here, we investigate whether VFMs can also improve pixel and object classification compared to current approaches. To this end, we evaluate several VFMs, including general-purpose models (SAM, SAM2, DINOv3) and domain-specific ones ($μ$SAM, PathoSAM), in combination with shallow learning and attentive probing on five diverse and challenging datasets. Our results demonstrate consistent improvements over hand-crafted features and provide a clear pathway toward practical improvements. Furthermore, our study establishes a benchmark for VFMs in microscopy and informs future developments in this area.[150] Adaptive Greedy Frame Selection for Long Video Understanding
Yuning Huang,Fengqing Zhu
Main category: cs.CV
TL;DR: 本文提出了一种面向问题的贪心帧选择方法,用于长视频问答任务,在固定帧数预算下联合优化查询相关性和语义代表性,通过双嵌入空间和加权子模目标函数实现高效帧采样,并结合轻量级问题类型分类器动态选择策略,在MLVU数据集上显著提升准确率,尤其在帧数受限时效果更优。
Details
Motivation: 现有长视频问答中,输入帧数和视觉token数量导致推理瓶颈;稀疏采样易遗漏关键帧,纯相关性驱动的采样易陷入近重复帧、牺牲时间跨度上的证据覆盖。 Method: 构建1 FPS(上限1000帧)带精确时间戳的候选池;在SigLIP(问题相关性)和DINOv2(语义相似性)两个互补空间中嵌入候选帧;采用加权和形式的归一化、单调、子模目标函数(含模块化相关项与设施选址覆盖项)进行贪心选择;引入四种预设策略及轻量文本问题类型分类器,实现问题自适应策略路由。 Result: 在MLVU数据集上,该方法在不同帧预算下均一致优于均匀采样和近期强基线,尤其在帧数受限时提升最大。 Conclusion: 问题自适应的贪心帧选择方法能有效平衡相关性与时间覆盖性,具备理论近似保证,是提升长视频问答效率与性能的有效方案。 Abstract: Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.[151] Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision
Jiyeong Kim,Yerim So,Hyesong Choi,Uiwon Hwang,Dongbo Min
Main category: cs.CV
TL;DR: 本文提出了一种名为SeGroS的微调框架,通过语义视觉提示和语义引导的损坏输入,解决统一多模态模型(UMMs)中监督信号粒度不匹配和冗余的问题,显著提升生成保真度与跨模态对齐能力。
Details
Motivation: 当前统一多模态模型(UMMs)的生成式训练范式存在监督粒度不匹配和监督冗余等固有局限。 Method: 提出语义驱动的监督框架SeGroS,核心是构建视觉定位图,生成两类互补监督信号:语义视觉提示(缓解文本提示稀疏性)和语义引导的损坏输入(限制掩码重建损失至文本对齐关键区域)。 Result: 在GenEval、DPGBench和CompBench上广泛实验表明,SeGroS显著提升了多种UMM架构的生成保真度与跨模态对齐性能。 Conclusion: SeGroS是一种有效缓解UMMs训练中监督信号问题的通用微调框架,为提升多模态生成质量与对齐能力提供了新思路。 Abstract: Unified Multimodal Models (UMMs) have emerged as a promising paradigm that integrates multimodal understanding and generation within a unified modeling framework. However, current generative training paradigms suffer from inherent limitations. We present Semantically-Grounded Supervision (SeGroS), a fine-tuning framework designed to resolve the granularity mismatch and supervisory redundancy in UMMs. At its core, we propose a novel visual grounding map to construct two complementary supervision signals. First, we formulate semantic Visual Hints to compensate for the sparsity of text prompts. Second, we generate a semantically-grounded Corrupted Input to explicitly enhance the supervision of masking-based UMMs by restricting the reconstruction loss to core text-aligned regions. Extensive evaluations on GenEval, DPGBench, and CompBench demonstrate that SeGroS significantly improves generation fidelity and cross-modal alignment across various UMM architectures.[152] VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
Jingyang Lin,Jialian Wu,Jiang Liu,Ximeng Sun,Ze Wang,Xiaodong Yu,Jiebo Luo,Zicheng Liu,Emad Barsoum
Main category: cs.CV
TL;DR: VideoSeek是一种新型长时序视频代理模型,通过利用视频逻辑流主动寻找关键证据,显著减少所需帧数,同时提升视频理解与推理能力。
Details
Motivation: 现有视频代理模型依赖密集采样帧的贪婪解析,计算成本高;亟需一种高效、低开销的视频理解范式。 Method: 提出VideoSeek框架,采用think-act-observe循环机制,结合多粒度观测工具集,依据视频逻辑流进行查询感知的证据搜索,避免全帧解析。 Result: 在四个视频理解与推理基准上显著优于先前方法;在LVBench上相较GPT-5提升10.2个绝对精度点,仅用7%的帧数。 Conclusion: 视频逻辑流引导的主动探索是提升效率与性能的关键;工具集设计与强推理能力协同增强视频代理效果。 Abstract: Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.[153] HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks
Jingyu Guo,Ziye Chen,Ziwen Li,Zhengqing Gao,Jiaxin Huang,Hanlue Zhang,Fengming Huang,Yu Yao,Tongliang Liu,Mingming Gong
Main category: cs.CV
TL;DR: 本文提出了HUGE-Bench,一个面向高阶语义指令的无人机视觉-语言-动作(HL-VLA)基准,强调对简短高层指令的理解、多阶段安全轨迹执行能力,并引入过程导向与碰撞感知评估指标。
Details
Motivation: 现有无人机视觉语言导航(VLN)基准侧重长步进式描述和目标中心评估,难以反映真实场景中需将简短高层指令转化为安全多阶段行为的需求。 Method: 构建了基于3D高斯泼溅-网格对齐表示的HUGE-Bench基准,包含4个数字孪生场景、8个高层任务和2.56M米轨迹;设计过程导向与碰撞感知评估指标。 Result: 在多个SOTA VLA模型上的实验表明,当前方法在高层语义完成度和安全执行方面存在显著缺陷。 Conclusion: HUGE-Bench为评估和推动无人机高层自主能力提供了更具诊断性的新基准。 Abstract: Existing UAV vision-language navigation (VLN) benchmarks have enabled language-guided flight, but they largely focus on long, step-wise route descriptions with goal-centric evaluation, making them less diagnostic for real operations where brief, high-level commands must be grounded into safe multi-stage behaviors. We present HUGE-Bench, a benchmark for High-Level UAV Vision-Language-Action (HL-VLA) tasks that tests whether an agent can interpret concise language and execute complex, process-oriented trajectories with safety awareness. HUGE-Bench comprises 4 real-world digital twin scenes, 8 high-level tasks, and 2.56M meters of trajectories, and is built on an aligned 3D Gaussian Splatting (3DGS)-Mesh representation that combines photorealistic rendering with collision-capable geometry for scalable generation and collision-aware evaluation. We introduce process-oriented and collision-aware metrics to assess process fidelity, terminal accuracy, and safety. Experiments on representative state-of-the-art VLA models reveal significant gaps in high-level semantic completion and safe execution, highlighting HUGE-Bench as a diagnostic testbed for high-level UAV autonomy.[154] Fourier Splatting: Generalized Fourier encoded primitives for scalable radiance fields
Mihnea-Bogdan Jurca,Bert Van hauwermeiren,Adrian Munteanu
Main category: cs.CV
TL;DR: 本文提出了一种名为Fourier Splatting的新方法,通过使用基于傅里叶编码描述符参数化的可缩放平面surfels作为辐射场渲染的基本单元,实现了单模型多尺度渲染;该方法支持运行时截断傅里叶系数以调节细节级别,并结合直通估计器和HYDRA稠密化策略提升训练稳定性与质量。
Details
Motivation: 现有3D高斯泼溅(3DGS)等方法依赖固定数量的显式基元,画质缩放只能通过剪枝实现,缺乏内在可扩展性;亟需一种能天然支持多尺度渲染的基元表示。 Method: 提出Fourier Splatting:用傅里叶编码描述符参数化平面surfels,形成任意闭合形状的可缩放基元;引入直通估计器处理基元边界外梯度传播;设计HYDRA稠密化策略,在MCMC框架内将复杂基元分解为更简单成分。 Result: 在平面基元类方法中达到SOTA渲染质量;在标准基准上,感知指标与主流体素/隐式体渲染方法相当;支持带宽受限场景下的高质量、多粒度实时渲染。 Conclusion: Fourier Splatting是首个面向辐射场渲染的本征可扩展基元,统一了高保真建模与运行时灵活缩放能力,为资源受限设备上的新型视图合成提供了新范式。 Abstract: Novel view synthesis has recently been revolutionized by 3D Gaussian Splatting (3DGS), which enables real-time rendering through explicit primitive rasterization. However, existing methods tie visual fidelity strictly to the number of primitives: quality downscaling is achieved only through pruning primitives. We propose the first inherently scalable primitive for radiance field rendering. Fourier Splatting employs scalable primitives with arbitrary closed shapes obtained by parameterizing planar surfels with Fourier encoded descriptors. This formulation allows a single trained model to be rendered at varying levels of detail simply by truncating Fourier coefficients at runtime. To facilitate stable optimization, we employ a straight-through estimator for gradient extension beyond the primitive boundary, and introduce HYDRA, a densification strategy that decomposes complex primitives into simpler constituents within the MCMC framework. Our method achieves state-of-the-art rendering quality among planar-primitive frameworks and comparable perceptual metrics compared to leading volumetric representations on standard benchmarks, providing a versatile solution for bandwidth-constrained high-fidelity rendering.[155] Hyper-Connections for Adaptive Multi-Modal MRI Brain Tumor Segmentation
Lokendra Kumar,Shubham Aggarwal
Main category: cs.CV
TL;DR: 本文提出超连接(HC)机制,作为残差连接的动态替代方案,用于多模态脑肿瘤三维分割,显著提升多种网络在BraTS 2021上的性能,尤其增强肿瘤区域分割精度,并揭示其对关键MRI序列的自适应敏感性。
Details
Motivation: 现有固定残差连接在多模态医学图像分割中缺乏对不同模态和解剖区域的自适应建模能力,限制了特征融合效果。 Method: 提出动态超连接(HC)机制,作为即插即用模块替换五种主流3D分割网络(nnU-Net、SwinUNETR、VT-UNet、U-Net、U-Netpp)中的固定残差连接,并在BraTS 2021数据集上进行验证与模态消融分析。 Result: HC在所有3D模型上均带来最高+1.03%平均Dice提升,尤其改善增强肿瘤子区域分割;模型自发增强对T1ce(肿瘤核心/增强肿瘤)和FLAIR(全肿瘤)序列的敏感性;2D设置下增益较小且配置敏感。 Conclusion: 超连接是一种简单、高效、通用的多模态特征融合机制,特别适用于三维医学图像分割任务。 Abstract: We present the first study of Hyper-Connections (HC) for volumetric multi-modal brain tumor segmentation, integrating them as a drop-in replacement for fixed residual connections across five architectures: nnU-Net, SwinUNETR, VT-UNet, U-Net, and U-Netpp. Dynamic HC consistently improves all 3D models on the BraTS 2021 dataset, yielding up to +1.03 percent mean Dice gain with negligible parameter overhead. Gains are most pronounced in the Enhancing Tumor sub-region, reflecting improved fine-grained boundary delineation. Modality ablation further reveals that HC-equipped models develop sharper sensitivity toward clinically dominant sequences, specifically T1ce for Tumor Core and Enhancing Tumor, and FLAIR for Whole Tumor, a behavior absent in fixed-connection baselines and consistent across all architectures. In 2D settings, improvements are smaller and configuration-sensitive, suggesting that volumetric spatial context amplifies the benefit of adaptive aggregation. These results establish HC as a simple, efficient, and broadly applicable mechanism for multi-modal feature fusion in medical image segmentation.[156] Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them
Michael Hubbertz,Qi Han,Tobias Meisen
Main category: cs.CV
TL;DR: 本文提出了一种用于评估深度学习在线建图模型泛化能力的框架,通过解耦输入特征记忆与地图几何过拟合两种失效模式,并引入基于Fréchet距离的重建统计量及两类过拟合评分;同时提出地图几何感知的数据集诊断指标(如MST多样性、对称覆盖度)与MST稀疏化策略,在nuScenes和Argoverse 2上验证了其提升模型泛化性与训练集效率的有效性。
Details
Motivation: 深度学习在线建图模型在自动驾驶中广泛应用,但常难以泛化到陌生环境;现有评估方法未能区分‘记忆输入特征’与‘过拟合已知地图几何’两类根本性失效原因,缺乏对数据集几何偏差的定量诊断。 Method: 提出地理邻近性与几何相似性可控的评估子集;设计基于Fréchet距离的无阈值重建统计量;定义定位过拟合分(地理线索消失时性能下降)与地图几何过拟合分(几何新颖性增加时性能下降);构建MST多样性与对称覆盖度指标诊断数据集偏差;提出MST稀疏化策略优化训练集。 Result: 在nuScenes和Argoverse 2上对多个SOTA模型的实验表明:该框架能更可信地评估泛化能力;地图几何多样且均衡的训练集显著提升模型性能;MST稀疏化可在减小训练规模的同时提升平衡性与性能。 Conclusion: 应建立面向失效模式的评估协议与以地图几何为中心的数据集设计范式,以支撑可部署的在线建图系统。 Abstract: Deep learning-based online mapping has emerged as a cornerstone of autonomous driving, yet these models frequently fail to generalize beyond familiar environments. We propose a framework to identify and measure the underlying failure modes by disentangling two effects: Memorization of input features and overfitting to known map geometries. We propose measures based on evaluation subsets that control for geographical proximity and geometric similarity between training and validation scenes. We introduce Fréchet distance-based reconstruction statistics that capture per-element shape fidelity without threshold tuning, and define complementary failure-mode scores: a localization overfitting score quantifying the performance drop when geographic cues disappear, and a map geometry overfitting score measuring degradation as scenes become geometrically novel. Beyond models, we analyze dataset biases and contribute map geometry-aware diagnostics: A minimum-spanning-tree (MST) diversity measure for training sets and a symmetric coverage measure to quantify geometric similarity between splits. Leveraging these, we formulate an MST-based sparsification strategy that reduces redundancy and improves balancing and performance while shrinking training size. Experiments on nuScenes and Argoverse 2 across multiple state-of-the-art models yield more trustworthy assessment of generalization and show that map geometry-diverse and balanced training sets lead to improved performance. Our results motivate failure-mode-aware protocols and map geometry-centric dataset design for deployable online mapping.[157] IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
Simone Magistri,Dipam Goswami,Marco Mistretta,Bartłomiej Twardowski,Joost van de Weijer,Andrew D. Bagdanov
Main category: cs.CV
TL;DR: 本文研究了CLIP等视觉-语言模型在图像到图像检索等单模态任务中性能下降的问题,指出其根源在于单模态对齐不足,并提出一种无需训练、基于投影器权重提取各向同性对齐子空间的方法(IsoCLIP),显著提升单模态任务性能。
Details
Motivation: CLIP等跨模态模型在单模态任务(如图像到图像检索)中表现不佳,源于单模态内嵌入空间的不对齐(intra-modal misalignment),而现有方法多依赖微调,缺乏对齐机制的理论分析与高效修正手段。 Method: 通过分析CLIP中余弦相似度形式与对比损失的交互,分离出负责跨模态对齐的算子和仅做单模态归一化、不促进单模态对齐的算子;进一步通过谱分析识别出图像与文本共享的近似各向同性对齐子空间及各自特有的各向异性方向;利用投影器权重直接提取该对齐子空间,并剔除各向异性方向以改善单模态对齐。 Result: 所提训练免费方法IsoCLIP在多个预训练CLIP类模型上显著提升图像检索与分类等单模态任务性能,同时大幅降低推理延迟,优于现有方法。 Conclusion: 单模态对齐问题可归因于CLIP架构中固有的算子分工缺陷;显式建模并利用投影器隐含的各向同性子空间,是提升单模态下游任务性能的有效且高效途径。 Abstract: Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.[158] MedQ-Engine: A Closed-Loop Data Engine for Evolving MLLMs in Medical Image Quality Assessment
Jiyao Liu,Junzhi Ning,Wanying Qu,Lihao Liu,Chenglong Ma,Junjun He,Ningsheng Xu
Main category: cs.CV
TL;DR: 本文提出MedQ-Engine,一种闭环数据引擎,通过迭代评估、基于失败原型的检索、人机协同标注与质量保障微调,显著提升多模态大模型在医学图像质量评估(Med-IQA)中的临床描述能力,仅用10K标注即大幅缩小与人类专家差距。
Details
Motivation: 现有MLLMs在医学图像质量评估中远逊于人类专家,尤其在提供具临床推理的描述性评估方面;而高质量描述性标注成本高,且一次性数据收集难以适应模型持续暴露的新弱点。 Method: 提出MedQ-Engine闭环数据引擎:1)迭代评估模型并聚类发现失败原型;2)以原型为锚点,在百万级图像池中检索并开展渐进式人机协同标注;3)基于标注质量进行微调;引入熵引导路由机制优化标注分配以降低成本。 Result: 在五种医学影像模态上实验表明,MedQ-Engine使8B参数模型性能超越GPT-4o超13%,与人类专家差距缩至4.34%;仅需10K标注,样本效率超随机采样4倍以上。 Conclusion: MedQ-Engine验证了闭环、数据驱动、人机协同范式可高效提升MLLMs在复杂临床评估任务中的表现,为医疗AI可信部署提供了新路径。 Abstract: Medical image quality assessment (Med-IQA) is a prerequisite for clinical AI deployment, yet multimodal large language models (MLLMs) still fall substantially short of human experts, particularly when required to provide descriptive assessments with clinical reasoning beyond simple quality scores. However, improving them is hindered by the high cost of acquiring descriptive annotations and by the inability of one-time data collection to adapt to the model's evolving weaknesses. To address these challenges, we propose MedQ-Engine, a closed-loop data engine that iteratively evaluates the model to discover failure prototypes via data-driven clustering, explores a million-scale image pool using these prototypes as retrieval anchors with progressive human-in-the-loop annotation, and evolves through quality-assured fine-tuning, forming a self-improving cycle. Models are evaluated on complementary perception and description tasks. An entropy-guided routing mechanism triages annotations to minimize labeling cost. Experiments across five medical imaging modalities show that MedQ-Engine elevates an 8B-parameter model to surpass GPT-4o by over 13% and narrow the gap with human experts to only 4.34%, using only 10K annotations with more than 4x sample efficiency over random sampling.[159] SIMPLER: Efficient Foundation Model Adaptation via Similarity-Guided Layer Pruning for Earth Observation
Víctor Barreiro,Johannes Jakubik,Francisco Argüello,Dora B. Heras
Main category: cs.CV
TL;DR: 本文提出SIMPLER方法,在微调地球观测基础模型前进行架构选择,通过分析预训练视觉Transformer深层表示的稳定性,自动识别并剪枝冗余层,从而降低推理和部署成本,无需梯度计算、幅度启发式或超参数调优。
Details
Motivation: 微调地球观测基础模型计算开销大;现有参数高效方法仅降低训练成本,后处理压缩则需先完成高成本全量微调。 Method: SIMPLER在微调前基于无标签任务数据计算各层表征相似性,并设计自动化评分函数识别冗余层以剪枝,不依赖梯度、幅度阈值或超参调优。 Result: 在Prithvi-EO-2上剪枝79%参数仍保持94%基线性能,训练与推理速度分别提升2.1倍和2.6倍;方法还泛化至TerraMind和ViT-MAE等不同模型与任务。 Conclusion: SIMPLER是一种轻量、通用且无需训练的预微调架构精简方法,显著降低地球观测模型的部署门槛与计算负担。 Abstract: Fine-tuning foundation models for Earth Observation is computationally expensive, with high training time and memory demands for both training and deployment. Parameter-efficient methods reduce training cost but retain full inference complexity, while post-hoc compression optimizes inference only after costly full fine-tuning. We introduce SIMPLER, a pre-fine-tuning architecture selection method that reduces inference and deployment costs by identifying an effective model depth before adaptation. SIMPLER exploits stabilization of representations in deeper layers of pre-trained vision transformers: it computes layer-wise representation similarity on unlabeled task data and applies an automated scoring function to select redundant layers, with no gradients, magnitude heuristics, or hyperparameter tuning required. On Prithvi-EO-2, SIMPLER prunes up to 79% of parameters while retaining 94% of baseline performance, yielding a 2.1x training speedup and 2.6x inference speedup. The method generalizes to TerraMind (a multimodal EO foundation model) and ImageNet-pretrained ViT-MAE, demonstrating applicability across tasks, architectures, and spectral modalities. Code is available at https://gitlab.citius.gal/hpc4rs/simpler.[160] Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery
Jizhou Han,Chenhao Ding,Yuhang He,Qiang Wang,Shaokun Wang,SongLin Dong,Yihong Gong
Main category: cs.CV
TL;DR: 本文提出ATCG模块,通过类比标注知识生成文本概念,并与视觉特征融合,提升广义类别发现(GCD)在细粒度类别上的性能,无需修改现有GCD框架。
Details
Motivation: 现有基于纯视觉的广义类别发现方法在细粒度、相似类别上边界模糊,监督学习与发现任务松耦合导致性能不稳定。 Method: 提出类比文本概念生成器(ATCG),将标注知识类比迁移到未标注样本生成文本概念,再与视觉特征融合,实现视觉-文本联合推理。ATCG可即插即用地集成到参数化或聚类式GCD流程中。 Result: 在六个基准上,ATCG持续提升整体、已知类和新类识别性能,尤其在细粒度数据上增益最大。 Conclusion: ATCG通过引入类比式文本概念增强视觉表征,有效缓解GCD中细粒度类别判别难题,具备通用性和实用性。 Abstract: Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visual-only pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine-grained, look-alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual-textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with the largest gains on fine-grained data. Our code is available at: https://github.com/zhou-9527/AnaLogical-GCD.[161] PanORama: Multiview Consistent Panoptic Segmentation in Operating Rooms
Tuna Gürbüz,Ege Özsoy,Tony Danjun Wang,Nassir Navab
Main category: cs.CV
TL;DR: 本文提出了PanORama,一种专为手术室设计的多视角一致的全景分割方法,通过在骨干网络中建模跨视角特征交互,实现无需标定、泛化性强的高质量分割,显著提升手术环境中的空间感知能力。
Details
Motivation: 手术室环境复杂、遮挡严重,可靠的空间理解对术中情境感知至关重要;现有基于稀疏多视角图像的全景分割方法因视角间可见性受限,易产生跨相机误预测。 Method: 提出PanORama,首个专为手术室设计的、结构上保证多视角一致性的全景分割方法;在骨干网络内部、单次前向传播中建模跨视角特征交互,使视角一致性自然涌现,无需后处理;完全无需相机标定参数。 Result: 在MM-OR和4D-OR数据集上达到>70% Panoptic Quality(PQ),超越此前SOTA;支持任意未见多视角配置下的推理,具备强泛化性。 Conclusion: PanORama显著提升了手术室多视角分割精度与空间理解能力,为外科感知与辅助系统开辟了新路径。 Abstract: Operating rooms (ORs) are cluttered, dynamic, highly occluded environments, where reliable spatial understanding is essential for situational awareness during complex surgical workflows. Achieving spatial understanding for panoptic segmentation from sparse multiview images poses a fundamental challenge, as limited visibility in a subset of views often leads to mispredictions across cameras. To this end, we introduce PanORama, the first panoptic segmentation for the operating room that is multiview-consistent by design. By modeling cross-view interactions at the feature level inside the backbone in a single forward pass, view consistency emerges directly rather than through post-hoc refinement. We evaluate on the MM-OR and 4D-OR datasets, achieving >70% Panoptic Quality (PQ) performance, and outperforming the previous state of the art. Importantly, PanORama is calibration-free, requiring no camera parameters, and generalizes to unseen camera viewpoints within any multiview configuration at inference time. By substantially enhancing multiview segmentation and, consequently, spatial understanding in the OR, we believe our approach opens new opportunities for surgical perception and assistance. Code will be released upon acceptance.[162] SegVGGT: Joint 3D Reconstruction and Instance Segmentation from Multi-View Images
Jinyuan Qu,Hongyang Li,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出SegVGGT,一种端到端的统一框架,直接从多视角RGB图像同时完成前馈式3D重建与实例分割,通过引入对象查询与多级几何特征交互,并设计FADA策略缓解注意力分散问题,在ScanNet系列数据集上达到SOTA性能。
Details
Motivation: 现有3D实例分割方法依赖高质量点云或配准RGB-D数据,流程复杂且对重建噪声敏感;而当前基于Transformer的多视角重建方法缺乏高层语义理解,二者尚未有效统一。 Method: 提出SegVGGT框架:1)构建视觉几何接地Transformer,引入对象查询与多级几何特征交互以联合建模重建与分割;2)提出帧级注意力分布对齐(FADA)策略,在训练中显式引导对象查询关注实例相关帧,缓解全局token过多导致的注意力分散问题。 Result: 在ScanNetv2和ScanNet200上达到SOTA性能,优于近期联合模型及RGB-D方法;在ScanNet++上展现强泛化能力;无需额外推理开销。 Conclusion: SegVGGT成功实现了从多视角RGB图像端到端联合学习3D重建与实例分割,验证了纯视觉、无深度输入的高精度3D语义理解可行性,为轻量、鲁棒的3D感知提供了新范式。 Abstract: 3D instance segmentation methods typically rely on high-quality point clouds or posed RGB-D scans, requiring complex multi-stage processing pipelines, and are highly sensitive to reconstruction noise. While recent feed-forward transformers have revolutionized multi-view 3D reconstruction, they remain decoupled from high-level semantic understanding. In this work, we present SegVGGT, a unified end-to-end framework that simultaneously performs feed-forward 3D reconstruction and instance segmentation directly from multi-view RGB images. By introducing object queries that interact with multi-level geometric features, our method deeply integrates instance identification into the visual geometry grounded transformer. To address the severe attention dispersion problem caused by the massive number of global image tokens, we propose the Frame-level Attention Distribution Alignment (FADA) strategy. FADA explicitly guides object queries to attend to instance-relevant frames during training, providing structured supervision without extra inference overhead. Extensive experiments demonstrate that SegVGGT achieves the state-of-the-art performance on ScanNetv2 and ScanNet200, outperforming both recent joint models and RGB-D-based approaches, while exhibiting strong generalization capabilities on ScanNet++.[163] RAM: Recover Any 3D Human Motion in-the-Wild
Sen Jia,Ning Zhu,Jinqin Zhong,Jiale Zhou,Huaping Zhang,Jenq-Neng Hwang,Lei Li
Main category: cs.CV
TL;DR: RAM提出了一种鲁棒的无标记野外3D人体动作捕捉框架,结合运动感知语义跟踪、记忆增强时序HMR、轻量预测模块与门控融合机制,在多目标跟踪稳定性和3D精度上显著超越SOTA。
Details
Motivation: 解决野外多目标场景下严重遮挡、动态交互导致的身份关联不稳定和3D运动重建不连续、不准确的问题。 Method: 1)运动感知语义跟踪器结合自适应卡尔曼滤波实现鲁棒身份关联;2)记忆增强的时序HMR模块注入时空先验以提升运动一致性与平滑性;3)轻量Predictor模块预测未来姿态保障重建连续性;4)门控组合器自适应融合重建与预测特征。 Result: 在PoseTrack和3DPW等野外多目标基准上,零样本跟踪稳定性与3D精度均显著超越先前SOTA。 Conclusion: RAM为野外无标记3D人体动作捕捉提供了一个通用且鲁棒的新范式。 Abstract: RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks such as PoseTrack and 3DPW, demonstrate that RAM substantially outperforms previous state-of-the-art in both Zero-shot tracking stability and 3D accuracy, offering a generalizable paradigm for markerless 3D human motion capture in-the-wild.[164] LIORNet: Self-Supervised LiDAR Snow Removal Framework for Autonomous Driving under Adverse Weather Conditions
Ji-il Park,Inwook Shim
Main category: cs.CV
TL;DR: 本文提出LIORNet,一种自监督学习的LiDAR去噪网络,融合距离、强度和学习方法优势,无需人工标注即可在雪、雨、雾等恶劣天气下高效滤除噪声点,提升感知鲁棒性。
Details
Motivation: LiDAR在雪、雨、雾等恶劣天气下因大量噪声点导致感知性能严重下降,现有距离法、强度法和学习法均存在各自局限(如区分能力弱、阈值不自适应、标注成本高、泛化差、计算开销大)。 Method: 提出LIORNet,基于U-Net++架构,采用多物理与统计线索(如距离相关强度阈值、雪反射率建模、点云稀疏性、感知范围约束)生成伪标签,实现自监督训练,融合三类传统方法优势。 Result: 在WADS和CADC数据集上,LIORNet在精度和运行速度上均超越当前最优滤波算法,同时保留关键环境结构特征。 Conclusion: LIORNet是一种无需人工标注、泛化性强、计算高效且适用于实时部署的LiDAR去噪方案,显著提升了极端天气下自动驾驶与机器人系统的3D感知鲁棒性。 Abstract: LiDAR sensors provide high-resolution 3D perception and long-range detection, making them indispensable for autonomous driving and robotics. However, their performance significantly degrades under adverse weather conditions such as snow, rain, and fog, where spurious noise points dominate the point cloud and lead to false perception. To address this problem, various approaches have been proposed: distance-based filters exploiting spatial sparsity, intensity-based filters leveraging reflectance distributions, and learning-based methods that adapt to complex environments. Nevertheless, distance-based methods struggle to distinguish valid object points from noise, intensity-based methods often rely on fixed thresholds that lack adaptability to changing conditions, and learning-based methods suffer from the high cost of annotation, limited generalization, and computational overhead. In this study, we propose LIORNet, which eliminates these drawbacks and integrates the strengths of all three paradigms. LIORNet is built upon a U-Net++ backbone and employs a self-supervised learning strategy guided by pseudo-labels generated from multiple physical and statistical cues, including range-dependent intensity thresholds, snow reflectivity, point sparsity, and sensing range constraints. This design enables LIORNet to distinguish noise points from environmental structures without requiring manual annotations, thereby overcoming the difficulty of snow labeling and the limitations of single-principle approaches. Extensive experiments on the WADS and CADC datasets demonstrate that LIORNet outperforms state-of-the-art filtering algorithms in both accuracy and runtime while preserving critical environmental features. These results highlight LIORNet as a practical and robust solution for LiDAR perception in extreme weather, with strong potential for real-time deployment in autonomous driving systems.[165] Timestep-Aware Block Masking for Efficient Diffusion Model Inference
Haodong He,Yuan Gao,Weizhong Zhang,Gui-Song Xia
Main category: cs.CV
TL;DR: 本文提出了一种针对预训练扩散模型(DPMs)的每时间步计算图优化框架,通过学习时间步特定的模块掩码实现动态执行/跳过,并结合时序感知损失缩放与知识引导的掩码校正策略,在保持生成质量的同时显著提升采样速度。
Details
Motivation: 扩散模型虽在图像生成中表现优异,但其迭代去噪机制导致高推理延迟;作者受去噪轨迹中特征动态变化启发,旨在优化预训练模型的计算效率。 Method: 提出 timestep-specific mask 学习机制,动态决定各时间步哪些网络模块执行或复用特征;采用独立 timestep mask 优化以避免全链反向传播带来的内存开销;引入 timestep-aware loss scaling 和 knowledge-guided mask rectification 策略提升特征保真度与剪枝合理性。 Result: 在 DDPM、LDM、DiT、PixArt 等多种架构上验证了方法的有效性,实现了采样速度与生成质量的更好权衡,且训练内存高效、架构无关。 Conclusion: 将去噪过程建模为一系列优化的计算路径,可显著提升 DPM 推理效率,为实际部署提供新思路。 Abstract: Diffusion Probabilistic Models (DPMs) have achieved great success in image generation but suffer from high inference latency due to their iterative denoising nature. Motivated by the evolving feature dynamics across the denoising trajectory, we propose a novel framework to optimize the computational graph of pre-trained DPMs on a per-timestep basis. By learning timestep-specific masks, our method dynamically determines which blocks to execute or bypass through feature reuse at each inference stage. Unlike global optimization methods that incur prohibitive memory costs via full-chain backpropagation, our method optimizes masks for each timestep independently, ensuring a memory-efficient training process. To guide this process, we introduce a timestep-aware loss scaling mechanism that prioritizes feature fidelity during sensitive denoising phases, complemented by a knowledge-guided mask rectification strategy to prune redundant spatial-temporal dependencies. Our approach is architecture-agnostic and demonstrates significant efficiency gains across a broad spectrum of models, including DDPM, LDM, DiT, and PixArt. Experimental results show that by treating the denoising process as a sequence of optimized computational paths, our method achieves a superior balance between sampling speed and generative quality. Our code will be released.[166] HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction
Ruicheng Yuan,Zhenxuan Zhang,Anbang Wang,Liwei Hu,Xiangqian Hua,Yaya Peng,Jiawei Luo,Guang Yang
Main category: cs.CV
TL;DR: HiPath是一种轻量级病理视觉-语言模型,通过分层补丁聚合、分层对比学习和基于槽位的掩码诊断预测,实现结构化病理报告生成,在真实临床数据上表现出高准确率和安全性。
Details
Motivation: 现有病理视觉-语言模型将结构化病理报告简化为扁平标签或自由文本,无法充分建模其多粒度、结构化特性。 Method: 提出HiPath框架,包含三个可训练模块:分层补丁聚合器(HiPA)、分层对比学习(HiCL)和基于槽位的掩码诊断预测(Slot-MDP),基于冻结的UNI2和Qwen3主干网络,总参数仅15M。 Result: 在74.9万例中国真实病理数据上训练,严格准确率达68.9%,临床可接受准确率74.7%,安全率97.3%;跨院验证仅下降3.4个百分点,安全率仍达97.1%。 Conclusion: HiPath证明了以结构化报告预测为首要目标的轻量VLM设计在病理诊断中的有效性与临床实用性。 Abstract: Pathology reports are structured, multi-granular documents encoding diagnostic conclusions, histological grades, and ancillary test results across one or more anatomical sites; yet existing pathology vision-language models (VLMs) reduce this output to a flat label or free-form text. We present HiPath, a lightweight VLM framework built on frozen UNI2 and Qwen3 backbones that treats structured report prediction as its primary training objective. Three trainable modules totalling 15M parameters address complementary aspects of the problem: a Hierarchical Patch Aggregator (HiPA) for multi-image visual encoding, Hierarchical Contrastive Learning (HiCL) for cross-modal alignment via optimal transport, and Slot-based Masked Diagnosis Prediction (Slot-MDP) for structured diagnosis generation. Trained on 749K real-world Chinese pathology cases from three hospitals, HiPath achieves 68.9% strict and 74.7% clinically acceptable accuracy with a 97.3% safety rate, outperforming all baselines under the same frozen backbone. Cross-hospital evaluation confirms generalisation with only a 3.4pp drop in strict accuracy while maintaining 97.1% safety.[167] Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
Nassim Ali Ousalah,Peyman Rostami,Vincent Gaudillière,Emmanuel Koumandakis,Anis Kacem,Enjie Ghorbel,Djamila Aouada
Main category: cs.CV
TL;DR: 本文提出了一种基于协方差池化和SPD矩阵表示的直接式6自由度物体姿态估计方法,利用流形感知网络回归连续姿态表示,在精度和鲁棒性上优于传统直接法。
Details
Motivation: 现有直接式6-DoF姿态估计方法依赖全局池化特征,忽略有信息量的空间二阶统计特性,且常采用不连续的姿态表示,导致鲁棒性差。 Method: 提出协方差池化表征,将卷积特征分布编码为对称正定(SPD)矩阵;进一步通过Cholesky分解构造SPD形式的姿态编码;设计适配SPD流形几何结构的端到端网络头进行姿态回归。 Result: 实验与消融研究表明,二阶池化与连续姿态表示能显著提升直接式姿态回归的精度与鲁棒性,尤其在部分遮挡场景下表现更优。 Conclusion: 引入二阶统计特征建模与流形约束的连续表示,可有效弥补直接式方法在精度与鲁棒性上的不足,为单图6-DoF姿态估计提供新思路。 Abstract: In this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-n-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end-to-end manner with a manifold-aware network head, taking into account the Riemannian geometry of SPD matrices. Experiments and ablations consistently demonstrate the relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.[168] 2K Retrofit: Entropy-Guided Efficient Sparse Refinement for High-Resolution 3D Geometry Prediction
Tianbao Zhang,Zhenyu Liang,Zhenbo Song,Nana Wang,Xiaomei Zhang,Xudong Cai,Zheng Zhu,Kejian Wu,Gang Wang,Zhaoxin Fan
Main category: cs.CV
TL;DR: 2K Retrofit是一种无需修改或重训练主干网络即可实现任意几何基础模型在2K分辨率下高效推理的新框架,通过快速粗预测和基于熵的稀疏细化,在高不确定性区域进行选择性增强,以最小开销实现精确、高保真的2K输出。
Details
Motivation: 当前基础模型在真实世界高分辨率场景(如2K图像)中因计算和内存开销过大而难以实际部署。 Method: 提出2K Retrofit框架,利用快速粗预测与基于熵的稀疏细化策略,仅对高不确定性区域进行精细化处理,从而支持任意几何基础模型的2K分辨率高效推理。 Result: 在主流基准上实验表明,该方法在精度和速度上均达到SOTA水平,显著缩小了高分辨率3D视觉研究与实际可扩展部署之间的差距。 Conclusion: 2K Retrofit为高分辨率几何预测提供了通用、高效且无需重训练的解决方案,推动了基础模型在自动驾驶、机器人及AR/MR等领域的实用化落地。 Abstract: High-resolution geometric prediction is essential for robust perception in autonomous driving, robotics, and AR/MR, but current foundation models are fundamentally limited by their scalability to real-world, high-resolution scenarios. Direct inference on 2K images with these models incurs prohibitive computational and memory demands, making practical deployment challenging. To tackle the issue, we present 2K Retrofit, a novel framework that enables efficient 2K-resolution inference for any geometric foundation model, without modifying or retraining the backbone. Our approach leverages fast coarse predictions and an entropy-based sparse refinement to selectively enhance high-uncertainty regions, achieving precise and high-fidelity 2K outputs with minimal overhead. Extensive experiments on widely used benchmark demonstrate that 2K Retrofit consistently achieves state-of-the-art accuracy and speed, bridging the gap between research advances and scalable deployment in high-resolution 3D vision applications. Code will be released upon acceptance.[169] X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving
Chaoda Zheng,Sean Li,Jinhao Deng,Zhennan Wang,Shijia Chen,Liqiang Xiao,Ziheng Chi,Hongbin Lin,Kangjie Chen,Boyang Wang,Yu Zhang,Xianming Liu
Main category: cs.CV
TL;DR: 本文提出X-World,一种动作条件下的多视角生成式世界模型,用于在视频空间中仿真自动驾驶未来多摄像头观测,支持对交通参与者、道路元素及外观(如天气、时间)的可控编辑,具备跨视角几何一致性和时序稳定性,从而实现可扩展、可复现的端到端自动驾驶评估。
Details
Motivation: 现有自动驾驶评估严重依赖昂贵、覆盖有限且难以复现的真实道路测试,亟需一个可控、稳定、能生成逼真未来观测的世界模拟器。 Method: 提出X-World:基于多视角潜空间视频生成器的动作条件多相机生成式世界模型,支持动作输入、动态/静态场景编辑及文本驱动外观控制,并引入跨视角几何一致性与时序连贯性建模机制。 Result: 实验表明X-World在多视角一致性、长时序稳定性、动作跟随性及场景可控性方面表现优异,支持视频风格迁移并保持动力学一致性。 Conclusion: X-World为端到端自动驾驶提供了可扩展、可复现、高保真的仿真评估基础。 Abstract: Scalable and reliable evaluation is increasingly critical in the end-to-end era of autonomous driving, where vision--language--action (VLA) policies directly map raw sensor streams to driving actions. Yet, current evaluation pipelines still rely heavily on real-world road testing, which is costly, biased toward limited scenario coverage, and difficult to reproduce. These challenges motivate a real-world simulator that can generate realistic future observations under proposed actions, while remaining controllable and stable over long horizons. We present X-World, an action-conditioned multi-camera generative world model that simulates future observations directly in video space. Given synchronized multi-view camera history and a future action sequence, X-World generates future multi-camera video streams that follow the commanded actions. To ensure reproducible and editable scene rollouts, X-World further supports optional controls over dynamic traffic agents and static road elements, and retains a text-prompt interface for appearance-level control (e.g., weather and time of day). Beyond world simulation, X-World also enables video style transfer by conditioning on appearance prompts while preserving the underlying action and scene dynamics. At the core of X-World is a multi-view latent video generator designed to explicitly encourage cross-view geometric consistency and temporal coherence under diverse control signals. Experiments show that X-World achieves high-quality multi-view video generation with (i) strong view consistency across cameras, (ii) stable temporal dynamics over long rollouts, and (iii) high controllability with strict action following and faithful adherence to optional scene controls. These properties make X-World a practical foundation for scalable and reproducible evaluation.[170] MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI
Rozain Shakeel,Abdul Rahman Mohammad Ali,Muneeb Mushtaq,Tausifa Jan Saleem,Tajamul Ashraf
Main category: cs.CV
TL;DR: 本文提出了MedSPOT,一个面向临床GUI环境的、工作流感知的序列化视觉定位基准,强调多步交互中的顺序推理与错误传播评估,并构建了涵盖多种失败模式的诊断体系。
Details
Motivation: 现有GUI基准多关注单步、孤立的视觉定位任务,无法反映真实医疗界面中动态、多步骤、上下文依赖的工作流需求;MLLMs在高风险临床软件中的可靠视觉定位能力尚未被充分探索。 Method: 构建MedSPOT基准:包含216个任务驱动视频、597个标注关键帧,每项任务含2–3个相互依赖的定位步骤;提出严格的顺序评估协议(首次错误即终止);建立涵盖边缘偏差、小目标错误、无预测、近失、远失、工具栏混淆等六类问题的失败分类体系。 Result: MedSPOT首次将视觉定位评估从孤立任务转向工作流驱动的序列决策,支持对模型在动态界面状态下的空间精度、上下文理解与错误鲁棒性进行系统评测。 Conclusion: MedSPOT为多模态大模型在临床软件环境中的安全、可靠部署提供了更贴近实际、更具挑战性的评估标准,推动了面向高风险人机交互场景的模型评测范式演进。 Abstract: Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.[171] Evaluating Test-Time Adaptation For Facial Expression Recognition Under Natural Cross-Dataset Distribution Shifts
John Turnbull,Shivam Grover,Amin Jalali,Ali Etemad
Main category: cs.CV
TL;DR: 本文首次评估了测试时适应(TTA)方法在自然域偏移下的面部表情识别(FER)任务中的性能,发现不同TTA方法在不同自然偏移场景下表现各异,整体可提升FER性能达11.34%。
Details
Motivation: 深度学习模型在真实部署中常因自然分布偏移而性能下降;现有TTA研究多基于合成扰动,缺乏对真实世界中由采集协议、标注标准和人群差异引起的自然偏移的系统评估。 Method: 在多个主流FER数据集上开展跨数据集实验,评估三类典型TTA方法:熵最小化(如TENT、SAR)、原型调整(如T3A)和特征对齐(如SHOT)。 Result: 熵最小化方法在目标域干净时效果最佳;原型调整方法在域间分布距离较大时更优;特征对齐方法在目标域噪声更强时增益最大;TTA提升FER性能最高达11.34%。 Conclusion: TTA在FER中有效,但其性能高度依赖于源域与目标域之间的分布距离及自然偏移的严重程度,应依据实际偏移特性选择适配方法。 Abstract: Deep learning models often struggle under natural distribution shifts, a common challenge in real-world deployments. Test-Time Adaptation (TTA) addresses this by adapting models during inference without labeled source data. We present the first evaluation of TTA methods for FER under natural domain shifts, performing cross-dataset experiments with widely used FER datasets. This moves beyond synthetic corruptions to examine real-world shifts caused by differing collection protocols, annotation standards, and demographics. Results show TTA can boost FER performance under natural shifts by up to 11.34\%. Entropy minimization methods such as TENT and SAR perform best when the target distribution is clean. In contrast, prototype adjustment methods like T3A excel under larger distributional distance scenarios. Finally, feature alignment methods such as SHOT deliver the largest gains when the target distribution is noisier than our source. Our cross-dataset analysis shows that TTA effectiveness is governed by the distributional distance and the severity of the natural shift across domains.[172] NEC-Diff: Noise-Robust Event-RAW Complementary Diffusion for Seeing Motion in Extreme Darkness
Haoyue Liu,Jinghan Xu,Luxin Feng,Hanyu Zhou,Haozhi Zhao,Yi Chang,Luxin Yan
Main category: cs.CV
TL;DR: 本文提出NEC-Diff,一种基于扩散模型的事件-RAW融合成像框架,利用物理驱动约束与动态信噪比估计实现极低照度下的高保真图像重建,并构建REAL数据集验证其有效性。
Details
Motivation: 在极低照度下,光子稀缺导致严重噪声和纹理丢失;现有方法多关注事件驱动的纹理恢复,忽视图像噪声及事件自身噪声,限制了像素级精确重建。 Method: 提出NEC-Diff框架:(1)结合RAW图像线性光响应与事件亮度变化特性,建立物理驱动的双模态去噪约束;(2)基于去噪结果动态估计两模态信噪比,指导自适应特征融合,增强扩散过程的可靠性。 Result: 在极端暗光条件下(0.001–0.8 lux)显著优于现有方法;构建并发布REAL数据集(47,800组对齐的低光RAW、事件与高质量参考图像)。 Conclusion: NEC-Diff通过物理建模与数据驱动融合,有效克服极低照度成像中噪声与信息缺失难题,为事件相机与传统成像协同提供新范式。 Abstract: High-quality imaging of dynamic scenes in extremely low-light conditions is highly challenging. Photon scarcity induces severe noise and texture loss, causing significant image degradation. Event cameras, featuring a high dynamic range (120 dB) and high sensitivity to motion, serve as powerful complements to conventional cameras by offering crucial cues for preserving subtle textures. However, most existing approaches emphasize texture recovery from events, while paying little attention to image noise or the intrinsic noise of events themselves, which ultimately hinders accurate pixel reconstruction under photon-starved conditions. In this work, we propose NEC-Diff, a novel diffusion-based event-RAW hybrid imaging framework that extracts reliable information from heavily noisy signals to reconstruct fine scene structures. The framework is driven by two key insights: (1) combining the linear light-response property of RAW images with the brightness-change nature of events to establish a physics-driven constraint for robust dual-modal denoising; and (2) dynamically estimating the SNR of both modalities based on denoising results to guide adaptive feature fusion, thereby injecting reliable cues into the diffusion process for high-fidelity visual reconstruction. Furthermore, we construct the REAL (Raw and Event Acquired in Low-light) dataset which provides 47,800 pixel-aligned low-light RAW images, events, and high-quality references under 0.001-0.8 lux illumination. Extensive experiments demonstrate the superiority of NEC-Diff under extreme darkness. The project are available at: https://github.com/jinghan-xu/NEC-Diff.[173] Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
Zheng Gao,Debin Meng,Yunqi Miao,Zhensong Zhang,Songcen Xu,Ioannis Patras,Jifei Song
Main category: cs.CV
TL;DR: 本文提出FRAM方法,通过微调CLIP模型并引入面部区域感知的妆容特征注入机制,提升扩散模型在妆容迁移任务中的区域可控性与迁移效果。
Details
Motivation: 现有基于扩散模型的妆容迁移方法依赖通用预训练基础模型(如CLIP)提取妆容信息,存在两方面不足:一是基础模型难以精准捕捉妆容风格;二是全局注入妆容特征,忽略眼部、唇部等面部区域差异,缺乏区域可控性。 Method: 提出两阶段方法FRAM:(1)妆容CLIP微调:利用GPT-4o和文本驱动图像编辑模型合成带标注的妆容风格数据,通过自监督与图文对比学习训练专用妆容CLIP编码器;(2)身份与区域感知妆容注入:构建修图前后的妆容图像对,用可学习token查询妆容CLIP以提取面部区域(眼、唇等)特征,并通过注意力损失实现区域控制;同时采用ControlNet Union联合编码源图像及其3D网格以注入身份信息。 Result: 实验验证了FRAM在区域可控性和妆容迁移质量上的优越性。 Conclusion: FRAM有效解决了通用基础模型对妆容表征能力弱及全局注入导致区域不可控的问题,为高保真、可编辑的妆容迁移提供了新范式。 Abstract: Current diffusion-based makeup transfer methods commonly use the makeup information encoded by off-the-shelf foundation models (e.g., CLIP) as condition to preserve the makeup style of reference image in the generation. Although effective, these works mainly have two limitations: (1) foundation models pre-trained for generic tasks struggle to capture makeup styles; (2) the makeup features of reference image are injected to the diffusion denoising model as a whole for global makeup transfer, overlooking the facial region-aware makeup features (i.e., eyes, mouth, etc) and limiting the regional controllability for region-specific makeup transfer. To address these, in this work, we propose Facial Region-Aware Makeup features (FRAM), which has two stages: (1) makeup CLIP fine-tuning; (2) identity and facial region-aware makeup injection. For makeup CLIP fine-tuning, unlike prior works using off-the-shelf CLIP, we synthesize annotated makeup style data using GPT-o3 and text-driven image editing model, and then use the data to train a makeup CLIP encoder through self-supervised and image-text contrastive learning. For identity and facial region-aware makeup injection, we construct before-and-after makeup image pairs from the edited images in stage 1 and then use them to learn to inject identity of source image and makeup of reference image to the diffusion denoising model for makeup transfer. Specifically, we use learnable tokens to query the makeup CLIP encoder to extract facial region-aware makeup features for makeup injection, which is learned via an attention loss to enable regional control. As for identity injection, we use a ControlNet Union to encode source image and its 3D mesh simultaneously. The experimental results verify the superiority of our regional controllability and our makeup transfer performance.[174] CFCML: A Coarse-to-Fine Crossmodal Learning Framework For Disease Diagnosis Using Multimodal Images and Tabular Data
Tianling Liu,Hongying Liu,Fanhua Shang,Lequan Yu,Tong Han,Liang Wan
Main category: cs.CV
TL;DR: 本文提出了一种粗到细的跨模态学习(CFCML)框架,通过多粒度特征对齐和分层锚点对比学习,有效缩小医学图像与表格数据间的模态差距,提升疾病诊断准确率。
Details
Motivation: 临床中图像与表格数据存在显著模态鸿沟,现有方法忽视局部图像信息和任务相关特征提取,限制跨模态诊断性能。 Method: 提出粗到细两阶段框架:粗阶段融合多阶段图像编码器特征与表格数据;细阶段构建类感知的单模态/跨模态原型,设计分层锚点关系挖掘(HRM)策略进行多视角对比学习。 Result: 在MEN和Derm7pt数据集上AUC分别提升1.53%和0.91%,超越当前最优方法。 Conclusion: CFCML通过渐进式对齐与层次化对比学习,有效缓解模态差异并增强判别性跨模态表征,为多模态医学诊断提供新思路。 Abstract: In clinical practice, crossmodal information including medical images and tabular data is essential for disease diagnosis. There exists a significant modality gap between these data types, which obstructs advancements in crossmodal diagnostic accuracy. Most existing crossmodal learning (CML) methods primarily focus on exploring relationships among high-level encoder outputs, leading to the neglect of local information in images. Additionally, these methods often overlook the extraction of task-relevant information. In this paper, we propose a novel coarse-to-fine crossmodal learning (CFCML) framework to progressively reduce the modality gap between multimodal images and tabular data, by thoroughly exploring inter-modal relationships. At the coarse stage, we explore the relationships between multi-granularity features from various image encoder stages and tabular information, facilitating a preliminary reduction of the modality gap. At the fine stage, we generate unimodal and crossmodal prototypes that incorporate class-aware information, and establish hierarchical anchor-based relationship mining (HRM) strategy to further diminish the modality gap and extract discriminative crossmodal information. This strategy utilize modality samples, unimodal prototypes, and crossmodal prototypes as anchors to develop contrastive learning approaches, effectively enhancing inter-class disparity while reducing intra-class disparity from multiple perspectives. Experimental results indicate that our method outperforms the state-of-the-art (SOTA) methods, achieving improvements of 1.53% and 0.91% in AUC metrics on the MEN and Derm7pt datasets, respectively. The code is available at https://github.com/IsDling/CFCML.[175] Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR
Ziye Yuan,Ruchang Yao,Chengxin Zheng,Yusheng Zhao,Daxiang Dong,Ming Zhang
Main category: cs.CV
TL;DR: 本文提出Detached Skip-Links方法,通过在前向传播中复用浅层特征、反向传播中阻断跳连梯度,缓解多层特征融合中的梯度干扰问题,提升MLLM在OCR等细粒度视觉任务上的性能;同时设计R-Probe指标评估视觉token的像素级可重建性。
Details
Motivation: 多模态大语言模型(MLLMs)在OCR等需精细视觉细节的任务上表现不佳,作者发现其根源在于多层特征融合中跳连路径引发的梯度干扰:高层语义目标的梯度直接回传至底层视觉层,覆盖低层信号并破坏训练稳定性。 Method: 提出Detached Skip-Links:前向传播保留跳连以复用浅层特征,反向传播时阻断跳连分支的梯度流;同时设计R-Probe评估指标——使用LLM前1/4层初始化的轻量解码器,衡量视觉token的像素级重建能力。 Result: 在多个ViT骨干网络、多模态基准(含OCR专项)及高达700万样本规模下,该方法显著提升OCR相关指标,并在通用多模态任务上也取得明确增益;R-Probe验证了细粒度视觉信息得到更好保留与利用。 Conclusion: 梯度干扰是制约MLLM细粒度感知能力的关键因素;Detached Skip-Links以零参数开销实现训练稳定性和OCR性能提升,R-Probe为视觉表征质量提供了可解释的诊断工具。 Abstract: Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.[176] MFil-Mamba: Multi-Filter Scanning for Spatial Redundancy-Aware Visual State Space Models
Puskal Khadka,KC Santosh
Main category: cs.CV
TL;DR: 本文提出MFil-Mamba,一种基于多滤波扫描的视觉状态空间架构,通过自适应加权融合多方向扫描特征,在图像分类、目标检测、实例分割和语义分割任务上达到SOTA性能。
Details
Motivation: 现有SSM(如Mamba)在序列建模中表现优异,但直接应用于视觉任务受限于图像的二维非序列结构和复杂空间依赖;已有视觉SSM方法多依赖固定多向遍历策略,导致冗余并破坏空间关系。 Method: 提出MFil-Mamba:1)多滤波扫描主干,使每次扫描捕获独特且上下文相关的空间信息;2)自适应权重机制融合多扫描输出;3)配套架构改进。 Result: 在ImageNet-1K(tiny版83.2% top-1)、MS COCO(47.3% box AP, 42.7% mask AP)和ADE20K(48.5% mIoU)上均超越现有SOTA模型。 Conclusion: MFil-Mamba有效解决了SSM用于视觉任务时的空间建模冗余与失真问题,验证了多滤波、自适应融合策略在视觉状态空间建模中的有效性与通用性。 Abstract: State Space Models (SSMs), especially recent Mamba architecture, have achieved remarkable success in sequence modeling tasks. However, extending SSMs to computer vision remains challenging due to the non-sequential structure of visual data and its complex 2D spatial dependencies. Although several early studies have explored adapting selective SSMs for vision applications, most approaches primarily depend on employing various traversal strategies over the same input. This introduces redundancy and distorts the intricate spatial relationships within images. To address these challenges, we propose MFil-Mamba, a novel visual state space architecture built on a multi-filter scanning backbone. Unlike fixed multi-directional traversal methods, our design enables each scan to capture unique and contextually relevant spatial information while minimizing redundancy. Furthermore, we incorporate an adaptive weighting mechanism to effectively fuse outputs from multiple scans in addition to architectural enhancements. MFil-Mamba achieves superior performance over existing state-of-the-art models across various benchmarks that include image classification, object detection, instance segmentation, and semantic segmentation. For example, our tiny variant attains 83.2% top-1 accuracy on ImageNet-1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on the ADE20K dataset. Code and models are available at https://github.com/puskal-khadka/MFil-Mamba.[177] A Unified Platform and Quality Assurance Framework for 3D Ultrasound Reconstruction with Robotic, Optical, and Electromagnetic Tracking
Lewis Howell,Manisha Waterston,Tze Min Wah,James H. Chandler,James R. McLaughlan
Main category: cs.CV
TL;DR: 本文提出了一种用于3D超声重建的质量保证(QA)框架及开源平台,通过定制化仿体和标准化流程评估不同追踪方式下的重建精度与可重复性,验证了机器人3D超声达到接近换能器空间分辨率极限的性能。
Details
Motivation: 当前研究缺乏对3D超声体积重建的全面精度与可重复性评估,亟需针对自由手式或机器人追踪3D超声重建的稳健质量保证(QA)框架。 Method: 设计含多种对称几何结构的定制仿体,支持光学、电磁及机器人运动学追踪在不同扫描速度和入射角下的评估;构建无需GPU加速的实时分割与3D重建标准流程(DSC=0.97,FPS=46),并自动配准与真值几何体比较。 Result: 所提框架验证机器人3D超声重建达业界领先水平(DSC-3D = 0.94 ± 0.01,HD95 = 1.17 ± 0.12),逼近换能器空间分辨率极限;提供开源平台与可复现验证方法。 Conclusion: 该工作建立了灵活的实验平台与可复现的验证方法,支持跨平台鲁棒比较与规范报告,推动3D超声在诊断与图像引导治疗中的安全有效临床转化。 Abstract: Three-dimensional (3D) Ultrasound (US) can facilitate diagnosis, treatment planning, and image-guided therapy. However, current studies rarely provide a comprehensive evaluation of volumetric accuracy and reproducibility, highlighting the need for robust Quality Assurance (QA) frameworks, particularly for tracked 3D US reconstruction using freehand or robotic acquisition. This study presents a QA framework for 3D US reconstruction and a flexible open source platform for tracked US research. A custom phantom containing geometric inclusions with varying symmetry properties enables straightforward evaluation of optical, electromagnetic, and robotic kinematic tracking for 3D US at different scanning speeds and insonation angles. A standardised pipeline performs real-time segmentation and 3D reconstruction of geometric targets (DSC = 0.97, FPS = 46) without GPU acceleration, followed by automated registration and comparison with ground-truth geometries. Applying this framework showed that our robotic 3D US achieves state-of-the-art reconstruction performance (DSC-3D = 0.94 +- 0.01, HD95 = 1.17 +- 0.12), approaching the spatial resolution limit imposed by the transducer. This work establishes a flexible experimental platform and a reproducible validation methodology for 3D US reconstruction. The proposed framework enables robust cross-platform comparisons and improved reporting practices, supporting the safe and effective clinical translation of 3D ultrasound in diagnostic and image-guided therapy applications.[178] Preference-Guided Debiasing for No-Reference Enhancement Image Quality Assessment
Shiqi Gao,Kang Fu,Zitong Xu,Huiyu Duan,Xiongkuo Min,Jia Wang,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出了一种偏好引导的去偏框架,用于无参考增强图像质量评估(EIQA),通过构建连续的增强偏好嵌入空间并去除增强诱导的干扰成分,使模型聚焦于算法无关的感知质量线索,从而提升跨算法泛化能力。
Details
Motivation: 现有无参考图像质量评估模型容易过拟合特定增强算法的模式,难以评估真实的感知质量,缺乏跨算法泛化能力。 Method: 采用监督对比学习构建连续增强偏好嵌入空间,估计并去除原始质量表征中增强诱导的干扰成分,并使用两阶段训练策略稳定优化。 Result: 在公开EIQA基准上实验表明,该方法有效缓解了算法诱导的表征偏差,在鲁棒性和跨算法泛化性上优于现有方法。 Conclusion: 偏好引导的去偏框架能显著提升无参考增强图像质量评估模型对真实感知质量的建模能力,增强其泛化性与实用性。 Abstract: Current no-reference image quality assessment (NR-IQA) models for enhanced images often struggle to generalize, as they tend to overfit to the distinct patterns of specific enhancement algorithms rather than evaluating genuine perceptual quality. To address this issue, we propose a preference-guided debiasing framework for no-reference enhancement image quality assessment (EIQA). Specifically, we first learn a continuous enhancement-preference embedding space using supervised contrastive learning, where images generated by similar enhancement styles are encouraged to have closer representations. Based on this, we further estimate the enhancement-induced nuisance component contained in the raw quality representation and remove it before quality regression. In this way, the model is guided to focus on algorithm-invariant perceptual quality cues instead of enhancement-specific visual fingerprints. To facilitate stable optimization, we adopt a two-stage training strategy that first learns the enhancement-preference space and then performs debiased quality prediction. Extensive experiments on public EIQA benchmarks demonstrate that the proposed method effectively mitigates algorithm-induced representation bias and achieves superior robustness and cross-algorithm generalization compared with existing approaches.[179] Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning
Jiajie Li,Chenhui Xu,Meihuan Liu,Jinjun Xiong
Main category: cs.CV
TL;DR: 本文提出Chain-of-Adaptation(CoA)框架,通过强化学习在保持视觉语言模型(VLM)多模态先验能力的同时实现领域自适应,显著提升手术领域任务的准确率、泛化性与稳定性。
Details
Motivation: 传统微调易破坏预训练多模态先验,削弱模型泛化能力。 Method: 提出Chain-of-Adaptation(CoA)框架,采用结构化推理格式,结合强化学习进行领域知识注入,避免损害模型固有推理与感知能力。 Result: 在手术基准数据集(分布内与分布外)上,CoA相比监督微调取得更高准确率、更强泛化性与更稳定行为;消融实验证明其有效保留核心视觉-语言能力。 Conclusion: CoA为视觉语言模型提供了兼顾领域专业化与多模态通用能力的可靠适配路径。 Abstract: Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model's inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model's core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.[180] Generalizable NGP-SR: Generalizable Neural Radiance Fields Super-Resolution via Neural Graph Primitives
Wanqi Yuan,Omkar Sharad Mayekar,Connor Pennington,Nianyi Li
Main category: cs.CV
TL;DR: 本文提出了一种名为Generalizable NGP-SR的3D感知超分辨率框架,基于Neural Graphics Primitives(NGP),直接从低分辨率输入图像重建高分辨率辐射场,实现视图一致的高质量新视角合成,且具备跨场景泛化能力,无需每场景微调。
Details
Motivation: NeRF在高分辨率渲染时计算开销大,而2D后处理超分破坏多视角一致性,亟需一种高效、一致、可泛化的3D超分方法。 Method: 基于NGP构建超分辨率辐射场,将辐射预测条件于3D坐标和学习到的局部纹理token,实现端到端HR辐射场重建;模型训练后可零样本迁移至未见场景。 Result: 在多个数据集上显著优于现有NeRF超分方法,在重建质量(如PSNR、LPIPS)和运行效率(渲染速度、内存占用)两方面均取得提升。 Conclusion: NGP-SR提供了一种实用、可扩展的高分辨率新视角合成方案,兼具3D一致性、泛化性与高效性,推动NeRF走向实际应用。 Abstract: Neural Radiance Fields (NeRF) achieve photorealistic novel view synthesis but become costly when high-resolution (HR) rendering is required, as HR outputs demand dense sampling and higher-capacity models. Moreover, naively super-resolving per-view renderings in 2D often breaks multi-view consistency. We propose Generalizable NGP-SR, a 3D-aware super-resolution framework that reconstructs an HR radiance field directly from low-resolution (LR) posed images. Built on Neural Graphics Primitives (NGP), NGP-SR conditions radiance prediction on 3D coordinates and learned local texture tokens, enabling recovery of high-frequency details within the radiance field and producing view-consistent HR novel views without external HR references or post-hoc 2D upsampling. Importantly, our model is generalizable: once trained, it can be applied to unseen scenes and rendered from novel viewpoints without per-scene optimization. Experiments on multiple datasets show that NGP-SR consistently improves both reconstruction quality and runtime efficiency over prior NeRF-based super-resolution methods, offering a practical solution for scalable high-resolution novel view synthesis.[181] Synergistic Perception and Generative Recomposition: A Multi-Agent Orchestration for Expert-Level Building Inspection
Hui Zhong,Yichun Gao,Luyan Liu,Xusen Guo,Zhaonian Kuang,Qiming Zhang,Xinhu Zheng
Main category: cs.CV
TL;DR: FacadeFixer是一个多智能体框架,通过协作推理解决建筑立面缺陷检测中几何多变、低对比度和复合缺陷等挑战,结合检测、分割与生成智能体,实现缺陷解耦与高保真数据增强,并构建了首个覆盖六类立面的像素级标注数据集。
Details
Motivation: 建筑立面缺陷检测面临几何多样性大、与复杂背景对比度低、复合缺陷(如裂缝与剥落共存)复杂等问题,导致像素不平衡、特征模糊,且高质量像素级标注数据极度稀缺,限制了现有模型的泛化能力。 Method: 提出FacadeFixer统一多智能体框架:包含检测智能体、分割智能体协同处理多类型缺陷干扰,以及一个生成智能体进行语义重组——将缺陷从噪声背景中解耦,并真实合成到多样干净纹理上,生成带精确专家掩码的高保真增强数据;同时构建涵盖六类主要立面、含像素级标注的多任务数据集。 Result: 在大量实验中显著超越现有最优方法(SOTA),尤其在像素级结构异常捕捉方面表现突出,验证了生成式合成是解决基础设施检测中数据稀缺问题的有效途径。 Conclusion: FacadeFixer通过多智能体协同与生成式数据增强,有效缓解了立面缺陷检测中的数据稀缺与特征歧义问题,为城市基础设施智能巡检提供了新范式。 Abstract: Building facade defect inspection is fundamental to structural health monitoring and sustainable urban maintenance, yet it remains a formidable challenge due to extreme geometric variability, low contrast against complex backgrounds, and the inherent complexity of composite defects (e.g., cracks co-occurring with spalling). Such characteristics lead to severe pixel imbalance and feature ambiguity, which, coupled with the critical scarcity of high-quality pixel-level annotations, hinder the generalization of existing detection and segmentation models. To address gaps, we propose \textit{FacadeFixer}, a unified multi-agent framework that treats defect perception as a collaborative reasoning task rather than isolated recognition. Specifically,\textit{FacadeFixer} orchestrates specialized agents for detection and segmentation to handle multi-type defect interference, working in tandem with a generative agent to enable semantic recomposition. This process decouples intricate defects from noisy backgrounds and realistically synthesizes them onto diverse clean textures, generating high-fidelity augmented data with precise expert-level masks. To support this, we introduce a comprehensive multi-task dataset covering six primary facade categories with pixel-level annotations. Extensive experiments demonstrate that \textit{FacadeFixer} significantly outperforms state-of-the-art (SOTA) baselines. Specifically, it excels in capturing pixel-level structural anomalies and highlights generative synthesis as a robust solution to data scarcity in infrastructure inspection. Our code and dataset will be made publicly available.[182] Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning
Hui Zhong,Yichun Gao,Luyan Liu,Hai Yang,Wang Wang,Haowei Zhang,Xinhu Zheng
Main category: cs.CV
TL;DR: 本文提出DefectBench——首个面向建筑外立面缺陷检测的多维大模型评测基准,构建了统一标注框架与高质量开源数据集,系统评估18个SOTA大模型在语义感知、空间定位与生成式几何分割三方面能力,发现其在拓扑理解上表现优异但精确定位不足,并验证零样本生成分割可行性。
Details
Motivation: 现有建筑外立面检测依赖专用判别模型(如YOLO、Mask R-CNN),缺乏结构拓扑理解能力;而大语言多模态模型(LMMs)虽具推理潜力,但在高风险工程领域尚无严格评测标准,亟需统一数据与评估体系。 Method: 构建人机协同半自动标注框架,融合12个异构数据集形成标准化分层本体;基于此发布DefectBench多维评测基准,从语义感知、空间定位、生成式几何分割三个递进认知维度系统评估18个SOTA LMMs。 Result: 实验表明当前LMMs在拓扑意识和语义理解('what'与'how')上表现突出,但在度量级精确定位('where')上存在显著短板;同时首次验证通用基础模型可在零样本下实现媲美专用监督网络的生成式分割性能。 Conclusion: 本工作确立了面向土木工程自主AI代理发展的新基准,提供了严谨评测标准与高质量开源数据库,推动LMMs从被动感知迈向主动工程推理。 Abstract: Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on specialized discriminative models (e.g., YOLO, Mask R-CNN) that excel at pixel-level localization but are constrained to passive perception and worse generization without the visual understandng to interpret structural topology. Large Multimodal Models (LMMs) promise a paradigm shift toward active reasoning, yet their application in such high-stakes engineering domains lacks rigorous evaluation standards. To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology. Building on this foundation, we present \textit{DefectBench}, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition. \textit{DefectBench} evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation. Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing "what" and "how"), they exhibit significant deficiencies in metric localization precision ("where"). Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training. This work provides both a rigorous benchmarking standard and a high-quality open-source database, establishing a new baseline for the advancement of autonomous AI agents in civil engineering.[183] EgoForge: Goal-Directed Egocentric World Simulator
Yifan Shen,Jiateng Liu,Xinzhuo Li,Yuanzhe Liu,Bingxuan Li,Houze Yang,Wenqi Jia,Yijiang Li,Tianjiao Yu,James Matthew Rehg,Xu Cao,Ismini Lourentzou
Main category: cs.CV
TL;DR: 本文提出EgoForge,一种基于单张第一人称图像、高级指令及可选外部视角的生成式自我中心世界模拟器,并引入VideoDiffusionNFT方法在扩散采样中进行轨迹级奖励引导优化,以提升目标完成度、时序因果性、场景一致性与感知保真度。
Details
Motivation: 现有方法难以建模自我中心视频中快速视角变化、频繁手物交互及依赖隐含人类意图的目标导向过程;且多依赖密集监督或局限于静态视图或手部指令合成。 Method: 提出EgoForge框架,输入为单张自我中心图像、高级指令和可选外部视角;核心是VideoDiffusionNFT——一种在扩散采样过程中结合多维度轨迹级奖励(目标完成、时序因果、场景一致、感知保真)进行引导优化的方法。 Result: 在语义对齐、几何稳定性与运动保真度上显著优于强基线;并在真实智能眼镜实验中展现出鲁棒性能。 Conclusion: EgoForge实现了仅用极简静态输入即可生成连贯、目标导向的第一人称视频,推动了生成式世界模型在真实自我中心场景中的实用性。 Abstract: Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.[184] TinyML Enhances CubeSat Mission Capabilities
Luigi Capogrosso,Michele Magno
Main category: cs.CV
TL;DR: 本文提出了一种面向CubeSat的TinyML图像分类优化与部署流水线,通过迭代剪枝、INT8量化和硬件感知算子映射,在STM32N6微控制器(含Cortex-M55和Neural-ART NPU)上实现高效、低功耗、低内存占用的遥感图像分类。
Details
Motivation: 传统地球观测任务依赖地面处理,但CubeSat受限于计算能力、能源和通信带宽,亟需轻量、高效、硬件适配的星上智能处理方案。 Method: 结合结构化迭代剪枝、后训练INT8量化与硬件感知算子映射,针对STM32N6 MCU(含Cortex-M55核心和Neural-ART NPU)优化并部署四类CNN模型(SqueezeNet、MobileNetV3、EfficientNet、MCUNetV1),在EuroSAT、RS_C11、MEDIC三个遥感数据集上验证。 Result: 平均RAM减少89.55%,Flash减少70.09%;推理精度下降0.4–8.6个百分点;单次推理能耗0.68–6.45 mJ,延迟3.22–30.38 ms,满足CubeSat星上实时与能效约束。 Conclusion: 该TinyML流水线显著提升了CubeSat星上图像分类的能效比与可行性,为资源受限平台的边缘智能遥感处理提供了可复现、硬件协同的落地路径。 Abstract: Earth observation (EO) missions traditionally rely on transmitting raw or minimally processed imagery from satellites to ground stations for computationally intensive analysis. This paradigm is infeasible for CubeSat systems due to stringent constraints on the onboard embedded processors, energy availability, and communication bandwidth. To overcome these limitations, the paper presents a TinyML-based Convolutional Neural Networks (ConvNets) model optimization and deployment pipeline for onboard image classification, enabling accurate, energy-efficient, and hardware-aware inference under CubeSat-class constraints. Our pipeline integrates structured iterative pruning, post-training INT8 quantization, and hardware-aware operator mapping to compress models and align them with the heterogeneous compute architecture of the STM32N6 microcontroller from STMicroelectronics. This Microcontroller Unit (MCU) integrates a novel Arm Cortex-M55 core and a Neural-ART Neural Processing Unit (NPU), providing a realistic proxy for CubeSat onboard computers. The paper evaluates the proposed approach on three EO benchmark datasets (i.e., EuroSAT, RS_C11, MEDIC) and four models (i.e., SqueezeNet, MobileNetV3, EfficientNet, MCUNetV1). We demonstrate an average reduction in RAM usage of 89.55% and Flash memory of 70.09% for the optimized models, significantly decreasing downlink bandwidth requirements while maintaining task-acceptable accuracy (with a drop ranging from 0.4 to 8.6 percentage points compared to the Float32 baseline). The energy consumption per inference ranges from 0.68 mJ to 6.45 mJ, with latency spanning from 3.22 ms to 30.38 ms. These results fully satisfy the stringent energy budgets and real-time constraints required for efficient onboard EO processing.[185] LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
Stanislaw Szymanowicz,Minghao Chen,Jianyuan Wang,Christian Rupprecht,Andrea Vedaldi
Main category: cs.CV
TL;DR: LagerNVS是一种基于3D感知潜在特征的编码器-解码器网络,通过利用预训练3D重建网络初始化编码器,在无需显式3D重建的前提下显著提升新型视角合成(NVS)性能,实现SOTA结果、实时渲染与强泛化能力。
Details
Motivation: 尽管神经网络可在不进行显式3D重建的情况下完成新型视角合成(NVS),但作者认为引入强3D归纳偏置仍有助于网络设计。 Method: 提出LagerNVS:编码器由显式3D监督预训练的3D重建网络初始化,解码器轻量;整体端到端以光度损失训练。 Result: 在Re10k上达到31.4 PSNR,为当前最优确定性前馈NVS方法;支持已知/未知相机设置、实时渲染、野外数据泛化,并可结合扩散解码器实现生成式外推。 Conclusion: 引入3D感知潜在表示和3D预训练编码器能显著提升NVS性能,证明3D归纳偏置对纯学习型NVS仍具关键价值。 Abstract: Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue that strong 3D inductive biases are still helpful in the design of such networks. We show this point by introducing LagerNVS, an encoder-decoder neural network for NVS that builds on `3D-aware' latent features. The encoder is initialized from a 3D reconstruction network pre-trained using explicit 3D supervision. This is paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis (including 31.4 PSNR on Re10k), with and without known cameras, renders in real time, generalizes to in-the-wild data, and can be paired with a diffusion decoder for generative extrapolation.[186] Improving Image-to-Image Translation via a Rectified Flow Reformulation
Satoshi Iizuka,Shun Okamoto,Kazuhiro Fukui
Main category: cs.CV
TL;DR: 本文提出I2I-RFR方法,将传统图像到图像回归网络重构成连续时间传输模型,在保持简单监督训练流程的同时,通过引入噪声目标与重加权像素损失实现ODE驱动的渐进式推理,显著提升感知质量与细节保留。
Details
Motivation: 传统像素级I2I回归虽稳定易用,但易对病态/多模态目标过度平滑;而生成式方法通常需额外组件、任务调优及复杂训练/推理流程。 Method: 在骨干网络输入中通道拼接噪声污染的目标图像,并优化t重加权的像素损失;该目标具有修正流(rectified flow)解释性,支持ODE求解器进行少量步数的渐进式推理。 Result: 在多个图像翻译与视频恢复任务上验证有效,普遍提升性能,尤其在感知质量与细节保持方面增益明显;仅需扩展输入通道,推理仅需3步显式求解器,无需蒸馏。 Conclusion: I2I-RFR是一种轻量级插件式方案,可在不引入重型生成流程的前提下,为常规I2I模型注入连续时间精细化能力。 Abstract: In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression networks as continuous-time transport models. While pixel-wise I2I regression is simple, stable, and easy to adapt across tasks, it often over-smooths ill-posed and multimodal targets, whereas generative alternatives often require additional components, task-specific tuning, and more complex training and inference pipelines. Our method augments the backbone input by channel-wise concatenation with a noise-corrupted version of the ground-truth target and optimizes a simple t-reweighted pixel loss. This objective admits a rectified-flow interpretation via an induced velocity field, enabling ODE-based progressive refinement at inference time while largely preserving the standard supervised training pipeline. In most cases, adopting I2I-RFR requires only expanding the input channels, and inference can be performed with a few explicit solver steps (e.g., 3 steps) without distillation. Extensive experiments across multiple image-to-image translation and video restoration tasks show that I2I-RFR generally improves performance across a wide range of tasks and backbones, with particularly clear gains in perceptual quality and detail preservation. Overall, I2I-RFR provides a lightweight way to incorporate continuous-time refinement into conventional I2I models without requiring a heavy generative pipeline.[187] MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering
Yuan Zhou,Yongzhi Li,Yanqi Dai,Xingyu Zhu,Yi Tan,Qingshan Xu,Beier Zhu,Richang Hong,Hanwang Zhang
Main category: cs.CV
TL;DR: 本文提出MuSteerNet框架,通过观察-反应互导机制解决视频驱动人体反应生成中视觉观测与反应类型间的关联失真问题,提升3D人体反应动作与输入视频内容的一致性。
Details
Motivation: 现有方法难以有效利用视频输入引导人体反应生成,导致生成动作与视频内容不匹配,根源在于视觉观测与反应类型之间存在严重的关系失真。 Method: 提出MuSteerNet框架,包含两个核心模块:1)原型反馈引导机制(Prototype Feedback Steering),利用门控delta校正调制器和关系间隔约束,结合从人体反应中学习的原型向量来优化视觉观测;2)双耦合反应精炼机制(Dual-Coupled Reaction Refinement),利用校正后的视觉线索进一步精炼生成的动作。 Result: 在多个基准上实现了具有竞争力的性能,大量实验与消融研究验证了各模块的有效性。 Conclusion: MuSteerNet通过观察与反应之间的双向引导,显著缓解了关系失真问题,提升了视频驱动3D人体反应生成的质量与一致性。 Abstract: Video-driven human reaction generation aims to synthesize 3D human motions that directly react to observed video sequences, which is crucial for building human-like interactive AI systems. However, existing methods often fail to effectively leverage video inputs to steer human reaction synthesis, resulting in reaction motions that are mismatched with the content of video sequences. We reveal that this limitation arises from a severe relational distortion between visual observations and reaction types. In light of this, we propose MuSteerNet, a simple yet effective framework that generates 3D human reactions from videos via observation-reaction mutual steering. Specifically, we first propose a Prototype Feedback Steering mechanism to mitigate relational distortion by refining visual observations with a gated delta-rectification modulator and a relational margin constraint, guided by prototypical vectors learned from human reactions. We then introduce Dual-Coupled Reaction Refinement that fully leverages rectified visual cues to further steer the refinement of generated reaction motions, thereby effectively improving reaction quality and enabling MuSteerNet to achieve competitive performance. Extensive experiments and ablation studies validate the effectiveness of our method. Code coming soon: https://github.com/zhouyuan888888/MuSteerNet.[188] Wildfire Spread Scenarios: Increasing Sample Diversity of Segmentation Diffusion Models with Training-Free Methods
Sebastian Gerard,Josephine Sullivan
Main category: cs.CV
TL;DR: 本文提出了一种无需重新训练的采样方法,用于提升扩散模型在模糊分割任务中的样本多样性,显著提高了HM IoU*指标,且几乎不增加计算开销。
Details
Motivation: 扩散模型虽能建模多模态分布,但朴素采样效率低,难以高效获取低概率但操作上重要的模式。 Method: 改进并适配了粒子引导(particle guidance)和SPELL两种训练无关的多样性采样方法,并提出一种基于聚类的新技术,应用于离散分割任务。 Result: 在MMFire、Cityscapes和LIDC数据集上验证,相比朴素采样,HM IoU*最高提升7.5%(MMFire)和16.4%(Cityscapes),图像质量与运行时间基本不受影响。 Conclusion: 训练无关的采样策略可高效提升分割扩散模型的预测多样性,在 wildfire spread、医学诊断等不确定性建模任务中具有实用价值。 Abstract: Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider multiple plausible outcomes. While diffusion models can effectively learn such multi-modal distributions, naively sampling from these models is computationally inefficient, potentially requiring hundreds of samples to find low-probability modes that may still be operationally relevant. In this work, we address the challenge of sample-efficient ambiguous segmentation by evaluating several training-free sampling methods that encourage diverse predictions. We adapt two techniques, particle guidance and SPELL, originally designed for the generation of diverse natural images, to discrete segmentation tasks, and additionally propose a simple clustering-based technique. We validate these approaches on the LIDC medical dataset, a modified version of the Cityscapes dataset, and MMFire, a new simulation-based wildfire spread dataset introduced in this paper. Compared to naive sampling, these approaches increase the HM IoU* metric by up to 7.5% on MMFire and 16.4% on Cityscapes, demonstrating that training-free methods can be used to efficiently increase the sample diversity of segmentation diffusion models with little cost to image quality and runtime. Code and dataset: https://github.com/SebastianGer/wildfire-spread-scenarios[189] CoVR-R:Reason-Aware Composed Video Retrieval
Omkar Thawakar,Dmitry Demidov,Vaishnav Potlapalli,Sai Prasanna Teja Reddy Bogireddy,Viswanatha Reddy Gajjala,Alaa Mostafa Lasheen,Rao Muhammad Anwer,Fahad Khan
Main category: cs.CV
TL;DR: 本文提出了一种面向组合视频检索(CoVR)的推理优先、零样本方法,利用大模型推断编辑文本隐含的因果与时间后效,并在无需微调的情况下对齐候选视频;同时构建了新基准CoVR-Reason以评估推理能力,实验表明该方法在隐式效应场景下显著优于基线。
Details
Motivation: 现有CoVR方法假设修改文本完全描述视觉变化,忽略了编辑引发的隐含后效(如运动、状态转换、视角或时长变化),而成功检索需对这些后效进行推理。 Method: 提出一种推理优先的零样本方法:(i)利用大 multimodal 模型推断编辑所隐含的因果和时间后效;(ii)将推理生成的查询与候选视频对齐,不依赖任务特定微调;并构建含结构化推理轨迹与挑战性干扰项的CoVR-Reason基准用于评估。 Result: 该零样本方法在Recall@K上超越强基线,尤其在隐式效应子集上表现突出;自动与人工分析证实其检索结果具有更高的步骤一致性与效应真实性。 Conclusion: 将推理能力融入通用多模态模型可有效提升CoVR性能,显式建模因果与时间后效可减少对任务监督的依赖、增强泛化能力与结果可解释性,为可解释视频搜索提供可扩展、有原则的框架。 Abstract: Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.[190] Deterministic Mode Proposals: An Efficient Alternative to Generative Sampling for Ambiguous Segmentation
Sebastian Gerard,Josephine Sullivan
Main category: cs.CV
TL;DR: 本文提出了一种名为mode proposal models的确定性框架,用于高效生成分割任务中多个合理结果(proposal masks),避免了传统生成模型依赖大量采样和后处理聚类的高计算成本;通过引入适配于分割掩码空间的置信度机制过滤冗余提案,并支持仅用部分标注数据训练,在覆盖真实标签和推理速度上均优于现有方法。
Details
Motivation: 许多分割任务(如医学图像分割、未来状态预测)本质上具有多义性,即存在多个同样正确的预测结果;而当前基于生成模型的方法在识别分布模式时计算开销大、效率低。 Method: 提出mode proposal models:一种单次前向传播即可输出固定数量proposal masks的确定性框架;引入适配高维分割掩码空间的置信度机制以抑制冗余提案;并利用预训练光流模型的速度场分解来估计各提案的先验模式概率。 Result: 相比现有生成模型,显著降低推理时间,同时提升对真实标签的覆盖度;可在未知完整结果分布的情况下进行训练,适用于真实世界数据集。 Conclusion: mode proposal models为多义性分割任务提供了一种高效、实用且可扩展的确定性替代方案,兼顾性能、速度与部署可行性。 Abstract: Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions are equally correct. Current methods typically rely on generative models to capture this uncertainty. However, identifying the underlying modes of the distribution with these methods is computationally expensive, requiring large numbers of samples and post-hoc clustering. In this paper, we shift the focus from stochastic sampling to the direct generation of likely outcomes. We introduce mode proposal models, a deterministic framework that efficiently produces a fixed-size set of proposal masks in a single forward pass. To handle superfluous proposals, we adapt a confidence mechanism, traditionally used in object detection, to the high-dimensional space of segmentation masks. Our approach significantly reduces inference time while achieving higher ground-truth coverage than existing generative models. Furthermore, we demonstrate that our model can be trained without knowing the full distribution of outcomes, making it applicable to real-world datasets. Finally, we show that by decomposing the velocity field of a pre-trained flow model, we can efficiently estimate prior mode probabilities for our proposals.[191] LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
Jiazheng Xing,Fei Du,Hangjie Yuan,Pengwei Liu,Hongbin Xu,Hai Ci,Ruigang Niu,Weihua Chen,Fan Wang,Yong Liu
Main category: cs.CV
TL;DR: 本文提出LumosX框架,通过构建面向人脸属性对齐的多模态数据集与引入关系注意力机制,在个性化多主体视频生成中实现细粒度身份一致性和语义对齐。
Details
Motivation: 现有扩散模型在文本到视频生成中难以保证多主体间人脸属性的一致性,缺乏显式建模机制和相应数据资源。 Method: 提出LumosX框架:数据层面设计基于独立视频的标注流水线,并利用多模态大语言模型提取主体间关系先验;模型层面设计关系自注意力与关系交叉注意力,将位置感知嵌入与精细化注意力动态结合。 Result: 在自建基准上全面评估显示,LumosX在细粒度控制、身份一致性及语义对齐的多主体视频生成任务中达到SOTA性能。 Conclusion: LumosX通过协同优化数据构建与模型结构,有效解决了多主体视频生成中人脸属性对齐难题,为个性化视频生成提供了新范式。 Abstract: Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.[192] From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
Xinyi Shang,Yi Tang,Jiacheng Cui,Ahmed Elhagry,Salwa K. Al Khatib,Sondos Mahmoud Bsharat,Jiacheng Liu,Xiaohan Zhao,Jing-Hao Xue,Hao Li,Salman Khan,Zhiqiang Shen
Main category: cs.CV
TL;DR: 本文提出了一种面向像素级、语义与语言感知的图像篡改检测新范式,构建了包含逐像素篡改图与类别标注的新基准,并设计了兼顾定位精度、语义分类与自然语言描述能力的评估框架,揭示了传统掩码基准的严重偏差。
Details
Motivation: 现有篡改检测基准严重依赖粗粒度的对象掩码,导致真实编辑信号被误表征:掩码内大量像素未被修改或仅轻微改动,而掩码外细微却关键的篡改则被忽略。 Method: 1)构建覆盖编辑原语(如替换、移除、拼接等)及其语义类别的细粒度分类体系;2)发布首个带逐像素篡改图与对应语义类别标注的新基准;3)提出融合像素级定位准确性、编辑强度置信度、语义分类及自然语言描述能力的统一训练与评估框架。 Result: 在新基准上系统重测主流篡改检测模型,发现基于掩码的指标存在显著高估与低估现象,尤其在微编辑和掩码外篡改上暴露严重失效;所提框架显著提升定位精度、语义理解与可解释性。 Conclusion: 本文推动篡改检测从粗粒度掩码评估迈向像素级、语义化与语言描述驱动的新标准,为领域建立更严谨、更具现实意义的评测与建模范式。 Abstract: Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.[193] MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints
Yu Qi,Xinyi Xu,Ziyu Guo,Siyuan Ma,Renrui Zhang,Xinyan Chen,Ruichuan An,Ruofan Xing,Jiayi Zhang,Haojie Huang,Pheng-Ann Heng,Jonathan Tremblay,Lawson L. S. Wong
Main category: cs.CV
TL;DR: 本文提出MME-CoF-Pro视频推理一致性基准,用于评估视频生成模型的因果一致性(即推理连贯性),发现当前模型推理连贯性弱、与生成质量无关,且文本提示易导致幻觉,视觉提示对结构化感知任务更有效。