Table of Contents
cs.CL [Back]
[1] From Chaos to Clarity: Schema-Constrained AI for Auditable Biomedical Evidence Extraction from Full-Text PDFs
Pouria Mortezaagha,Joseph Shaw,Bowen Sun,Arya Rahgozar
Main category: cs.CL
TL;DR: 提出了一种基于模式约束和溯源感知的AI系统,可将复杂的生物医学PDF文献自动转换为结构化、可分析的数据记录,支持可审计、可扩展的证据合成。
Details
Motivation: 生物医学证据合成需要从全文PDF中提取关键变量,但现有文档AI系统受限于OCR错误、长文档处理困难、吞吐量低和审计性差等问题。 Method: 采用模式约束的AI提取系统,结合类型化模式、受控词汇表和证据门控决策;通过支持断点续传的哈希机制摄入文档,分块处理并异步执行,最后通过冲突感知合并、集合聚合和句子级溯源生成研究级记录。 Result: 在直接口服抗凝剂相关研究 corpus 上评估显示,该系统能全自动处理所有文档,在服务限制下保持稳定吞吐量,且跨文档块具高度一致性;迭代优化模式显著提升了关键变量(如检测分类、结局定义等)的提取准确性。 Conclusion: 模式约束与溯源感知的提取方法可实现对异构科学PDF的高效、可靠结构化,满足生物医学证据合成对透明性和可重复性的要求。 Abstract: Biomedical evidence synthesis relies on accurate extraction of methodological, laboratory, and outcome variables from full-text research articles, yet these variables are embedded in complex scientific PDFs that make manual abstraction time-consuming and difficult to scale. Existing document AI systems remain limited by OCR errors, long-document fragmentation, constrained throughput, and insufficient auditability for high-stakes synthesis. We present a schema-constrained AI extraction system that transforms full-text biomedical PDFs into structured, analysis-ready records by explicitly restricting model inference through typed schemas, controlled vocabularies, and evidence-gated decisions. Documents are ingested using resume-aware hashing, partitioned into caption-aware page-level chunks, and processed asynchronously under explicit concurrency controls. Chunk-level outputs are deterministically merged into study-level records using conflict-aware consolidation, set-based aggregation, and sentence-level provenance to support traceability and post-hoc audit. Evaluated on a corpus of studies on direct oral anticoagulant level measurement, the pipeline processed all documents without manual intervention, maintained stable throughput under service constraints, and exhibited strong internal consistency across document chunks. Iterative schema refinement substantially improved extraction fidelity for synthesis-critical variables, including assay classification, outcome definitions, follow-up duration, and timing of measurement. These results demonstrate that schema-constrained, provenance-aware extraction enables scalable and auditable transformation of heterogeneous scientific PDFs into structured evidence, aligning modern document AI with the transparency and reliability requirements of biomedical evidence synthesis.[2] The Slow Drift of Support: Boundary Failures in Multi-Turn Mental Health LLM Dialogues
Youyou Cheng,Zhuangwei Kang,Kerry Jiang,Chenyu Sun,Qiyang Pan
Main category: cs.CL
TL;DR: 本文提出了一种多轮压力测试框架,用于评估大语言模型(LLMs)在长对话中的心理健康支持场景下的安全性,发现单轮检测无法反映真实风险,边界违规常在多轮交互中渐进发生,尤其在自适应探测下更快出现。
Details
Motivation: 现有安全评估局限于单轮对话中禁止词检测,忽视了多轮对话中因共情与安慰倾向导致的安全边界渐进侵蚀问题。 Method: 构建多轮压力测试框架,采用静态递进和自适应探测两种压力方式,在50个虚拟患者画像上对三个前沿LLM进行最多20轮虚拟精神科对话测试。 Result: 所有模型普遍存在安全边界违规;两种压力方式违规率相近,但自适应探测显著提前违规时间(平均回合数从9.21降至4.64);最主要违规形式是做出确定性或零风险承诺。 Conclusion: LLM在心理健康支持中的安全性不能仅靠单轮测试评估,必须考虑长对话中不同交互压力对安全边界的持续磨损效应。 Abstract: Large language models (LLMs) have been widely used for mental health support. However, current safety evaluations in this field are mostly limited to detecting whether LLMs output prohibited words in single-turn conversations, neglecting the gradual erosion of safety boundaries in long dialogues. Examples include making definitive guarantees, assuming responsibility, and playing professional roles. We believe that with the evolution of mainstream LLMs, words with obvious safety risks are easily filtered by their underlying systems, while the real danger lies in the gradual transgression of boundaries during multi-turn interactions, driven by the LLM's attempts at comfort and empathy. This paper proposes a multi-turn stress testing framework and conducts long-dialogue safety tests on three cutting-edge LLMs using two pressure methods: static progression and adaptive probing. We generated 50 virtual patient profiles and stress-tested each model through up to 20 rounds of virtual psychiatric dialogues. The experimental results show that violations are common, and both pressure modes produced similar violation rates. However, adaptive probing significantly advanced the time at which models crossed boundaries, reducing the average number of turns from 9.21 in static progression to 4.64. Under both mechanisms, making definitive or zero-risk promises was the primary way in which boundaries were breached. These findings suggest that the robustness of LLM safety boundaries cannot be inferred solely through single-turn tests; it is necessary to fully consider the wear and tear on safety boundaries caused by different interaction pressures and characteristics in extended dialogues.[3] Opening the Black Box: A Survey on the Mechanisms of Multi-Step Reasoning in Large Language Models
Liangming Pan,Jason Liang,Jiaran Ye,Minglai Yang,Xinyuan Lu,Fengbin Zhu
Main category: cs.CL
TL;DR: 本文综述了大语言模型多步推理的内部机制,提出了一个包含七个研究问题的概念框架,并指出了未来五个研究方向。
Details
Motivation: 现有研究主要关注提升性能的工程方法,而对大语言模型多步推理的内在机制理解不足。 Method: 通过提出一个包含七个相互关联的研究问题的概念框架,系统梳理大语言模型多步推理的机制。 Result: 系统总结了隐式多跳推理和显式推理如何重塑内部计算的过程,并识别出五个未来值得研究的方向。 Conclusion: 该综述为理解大语言模型的多步推理机制提供了系统的视角,有助于推动对模型内部运作机理的深入研究。 Abstract: Large Language Models (LLMs) have demonstrated remarkable abilities to solve problems requiring multiple reasoning steps, yet the internal mechanisms enabling such capabilities remain elusive. Unlike existing surveys that primarily focus on engineering methods to enhance performance, this survey provides a comprehensive overview of the mechanisms underlying LLM multi-step reasoning. We organize the survey around a conceptual framework comprising seven interconnected research questions, from how LLMs execute implicit multi-hop reasoning within hidden activations to how verbalized explicit reasoning remodels the internal computation. Finally, we highlight five research directions for future mechanistic studies.[4] Hallucination-Free Automatic Question & Answer Generation for Intuitive Learning
Nicholas X. Wang,Aggelos K. Katsaggelos
Main category: cs.CL
TL;DR: 提出了一种无幻觉的多智能体框架,用于生成教育领域的多项选择题,通过分解任务、多阶段验证和智能体优化,显著降低大模型在生成过程中的幻觉问题。
Details
Motivation: 大语言模型在自动生成教育类选择题时容易产生幻觉(如事实错误、推理不一致等),影响内容可靠性,需有效识别并减少这些错误。 Method: 将MCQ生成拆分为可验证的离散阶段,采用基于规则和基于LLM的检测智能体,结合幻觉评分指标,并引入反事实推理与思维链(CoT)进行迭代优化。 Result: 在AP对齐的STEM题目样本中,相比基线模型,幻觉率降低了90%以上,同时保持了题目质量、可答性和成本效益。 Conclusion: 结构化的多智能体协作能有效缓解大模型在教育内容生成中的幻觉问题,为可信赖的AI教学工具提供了可行路径。 Abstract: Hallucinations in large language models (LLMs), defined as fluent yet incorrect or incoherent outputs, pose a significant challenge to the automatic generation of educational multiple-choice questions (MCQs). We identified four key hallucination types in MCQ generation: reasoning inconsistencies, insolvability, factual errors, and mathematical errors. To address this, we propose a hallucination-free multi-agent generation framework that breaks down MCQ generation into discrete, verifiable stages. Our framework utilizes both rule-based and LLM-based detection agents, as well as hallucination scoring metrics to optimize question quality. We redefined MCQ generation as an optimization task minimizing hallucination risk while maximizing validity, answerability, and cost-efficiency. We also introduce an agent-led refinement process that uses counterfactual reasoning and chain-of-thought (CoT) to iteratively improve hallucination in question generation. We evaluated a sample of AP- aligned STEM questions, where our system reduced hallucination rates by over 90% compared to baseline generation while preserving the educational value and style of questions. Our results demonstrate that structured multi-agent collaboration can mitigate hallucinations in educational content creation at scale, paving the way for more reliable LLM-powered learning tools.[5] RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension
Yelin Chen,Fanjin Zhang,Suping Sun,Yunhe Pang,Yuanchun Wang,Jian Song,Xiaoyan Li,Lei Hou,Shu Zhao,Jie Tang,Juanzi Li
Main category: cs.CL
TL;DR: 本文提出了RPC-Bench,一个基于计算机科学论文审稿-回复交流的大规模、细粒度问答基准,包含15K人工验证的QA对,用于评估大模型对科研论文的理解能力;通过新设计的分类体系和LLM-as-a-Judge评估框架,发现当前最强模型(如GPT-5)在准确完整性和简洁性综合指标上仍表现不足。
Details
Motivation: 现有基准缺乏对基础模型理解科研论文能力的细粒度、大规模评估,尤其难以覆盖专业科学论述及复杂图表理解需求。 Method: 构建RPC-Bench:从高质量CS论文的审稿-回复中提取并人工验证15K QA对;设计契合科研流程的细粒度问题分类(why/what/how);建立LLM-human协同标注与质量控制框架;采用LLM-as-a-Judge范式评估正确-完整性与简洁性。 Result: 最强模型GPT-5在正确-完整性上仅达68.2%,加入简洁性约束后骤降至37.46%,揭示当前模型在精准学术理解上的显著短板。 Conclusion: RPC-Bench为科研文本理解提供了更真实、可扩展的评估基准,凸显了提升模型对学术语境深度理解能力的必要性与挑战性。 Abstract: Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine-grained evaluation at scale. To address this gap, we introduce RPC-Bench, a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers, containing 15K human-verified QA pairs. We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts. We also define an elaborate LLM-human interaction annotation framework to support large-scale labeling and quality control. Following the LLM-as-a-Judge paradigm, we develop a scalable framework that evaluates models on correctness-completeness and conciseness, with high agreement to human judgment. Experiments reveal that even the strongest models (GPT-5) achieve only 68.2% correctness-completeness, dropping to 37.46% after conciseness adjustment, highlighting substantial gaps in precise academic paper understanding. Our code and data are available at https://rpc-bench.github.io/.[6] Project Aletheia: Verifier-Guided Distillation of Backtracking for Small Language Models
Aradhya Dixit,Tianxi Liang,Jai Telang
Main category: cs.CL
TL;DR: 本文提出了一种验证器引导的蒸馏方法,使小型语言模型(7B)能够学习错误检测与自我修正能力,从而提升其在约束满足问题上的表现。
Details
Motivation: 小型语言模型(SLMs)在严格约束满足任务上常因线性、过度自信的推理路径而失败,难以从早期错误中恢复。 Method: 提出验证器引导的蒸馏(Verifier-Guided Distillation),在包含错误与自我修正的验证推理轨迹上训练7B模型,以迁移错误修复过程(而非仅最终答案)。 Result: 7B模型展现出隐式验证行为,能偶尔停止推理、检测矛盾并修正先前假设。 Conclusion: 通过引入带纠错的推理轨迹进行蒸馏,小型模型可习得类验证能力,显著改善其在约束满足任务中的鲁棒性与准确性。 Abstract: Small Language Models (SLMs, under 10B parameters) are attractive for private, on-device deployment, yet they frequently fail on strict constraint-satisfaction problems due to linear, overconfident reasoning traces that do not recover from early mistakes. We introduce Verifier-Guided Distillation, a training protocol that transfers the process of error repair - explicit conflict detection and backtracking - rather than only correct final answers. By training a 7B model on verified reasoning traces that include mistakes and self-corrections, we show that latent verification behavior can emerge in small models, enabling them to occasionally stop, detect contradictions, and revise earlier assumptions.[7] Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding
Juncheng Wang,Zhe Hu,Chao Xu,Siyue Ren,Yuxiang Feng,Yang Liu,Baigui Sun,Shujun Wang
Main category: cs.CL
TL;DR: 本文提出Plan-Critic方法,利用AR音频生成模型前缀中隐含的全局语义信息,在生成早期预测并引导高质量音频生成路径,显著提升文本到音频的指令遵循能力。
Details
Motivation: 自回归(AR)音频生成模型虽能生成时序连贯音频,但在忠实响应复杂文本提示(尤其多事件、多声源描述)方面表现不佳;作者发现其早期token隐含全局语义(如事件数量、声源类别),启发了对隐式规划能力的建模需求。 Method: 提出轻量级辅助模型Plan-Critic,采用广义优势估计(GAE)启发的目标进行训练,从部分生成序列预测最终指令遵循质量;推理时基于其评分对候选前缀进行早筛、剪枝与计算资源重分配,实现引导式采样。 Result: Plan-Critic引导采样在CLAP分数上相较AR基线最高提升10点,达到AR文本到音频生成新SOTA,且计算开销与标准best-of-N解码相当。 Conclusion: 严格自回归模型具备隐式规划能力,通过引入轻量级判别式引导模块,可在不改变因果结构前提下显著增强全局语义对齐能力,弥合了因果生成与整体语义一致性之间的鸿沟。 Abstract: Autoregressive (AR) models excel at generating temporally coherent audio by producing tokens sequentially, yet they often falter in faithfully following complex textual prompts, especially those describing complex sound events. We uncover a surprising capability in AR audio generators: their early prefix tokens implicitly encode global semantic attributes of the final output, such as event count and sound-object category, revealing a form of implicit planning. Building on this insight, we propose Plan-Critic, a lightweight auxiliary model trained with a Generalized Advantage Estimation (GAE)-inspired objective to predict final instruction-following quality from partial generations. At inference time, Plan-Critic enables guided exploration: it evaluates candidate prefixes early, prunes low-fidelity trajectories, and reallocates computation to high-potential planning seeds. Our Plan-Critic-guided sampling achieves up to a 10-point improvement in CLAP score over the AR baseline-establishing a new state of the art in AR text-to-audio generation-while maintaining computational parity with standard best-of-N decoding. This work bridges the gap between causal generation and global semantic alignment, demonstrating that even strictly autoregressive models can plan ahead.[8] Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis
Thanathai Lertpetchpun,Yoonjeong Lee,Thanapat Trachu,Jihwan Lee,Tiantian Feng,Dani Byrd,Shrikanth Narayanan
Main category: cs.CL
TL;DR: 本文研究了如何在文本到语音(TTS)系统中实现更可控、可解释的口音控制,提出通过结合语言学驱动的音系规则与说话人嵌入,并引入新指标‘音素偏移率(PSR)’来量化二者交互,发现规则可提升口音真实性,而嵌入会干扰规则,揭示口音与说话人身份的纠缠。
Details
Motivation: 现有TTS系统依赖说话人嵌入控制口音,但嵌入混杂音色、情感等信息,缺乏可解释性和精细控制能力,亟需解耦口音与其它说话人属性。 Method: 以美式与英式英语为案例,建模音系规则(如卷舌化、闪音化、元音对应),将规则与说话人嵌入联合用于TTS合成,并提出音素偏移率(PSR)量化嵌入对规则应用的干预程度。 Result: 实验表明:规则+嵌入组合生成的口音更真实;说话人嵌入会削弱或覆盖规则,证实口音与说话人身份存在特征纠缠;PSR可有效衡量该纠缠程度。 Conclusion: 语言学规则是提升口音可控性与可解释性的关键工具,也为评估语音生成中口音与说话人特征的解耦提供了新框架。 Abstract: Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.[9] Large Language Models for Large-Scale, Rigorous Qualitative Analysis in Applied Health Services Research
Sasha Ronaghi,Emma-Louise Aveling,Maria Levis,Rachel Lauren Ross,Emily Alsentzer,Sara Singer
Main category: cs.CL
TL;DR: 本文提出了一种模型与任务无关的人机协同定性分析框架,并在多中心糖尿病护理研究中验证了其在定性综合与演绎编码中的实用性,提升了研究效率与实践反馈及时性,同时保持方法严谨性。
Details
Motivation: 大型语言模型(LLMs)在提升大规模、多中心卫生服务研究中定性分析效率方面具有潜力,但缺乏方法学指导及其实证效果证据。 Method: 开发了模型与任务无关的人机协同定性分析框架,并在联邦合格健康中心(FQHCs)的多中心糖尿病护理研究中具体应用于:(1)对研究者生成的摘要进行定性综合以产出对比反馈报告;(2)对167份访谈转录文本进行演绎编码,以优化实践转化干预措施。 Result: LLM辅助实现了向临床实践者的及时反馈,并将大规模定性数据有效纳入理论构建与实践改进,验证了其在应用型卫生服务研究中的可行性与增效价值。 Conclusion: LLMs可在保障方法严谨性的前提下,实质性增强定性研究的效率与实用性,本研究为LLMs在定性研究中的持续创新应用提供了可推广的方法论指导。 Abstract: Large language models (LLMs) show promise for improving the efficiency of qualitative analysis in large, multi-site health-services research. Yet methodological guidance for LLM integration into qualitative analysis and evidence of their impact on real-world research methods and outcomes remain limited. We developed a model- and task-agnostic framework for designing human-LLM qualitative analysis methods to support diverse analytic aims. Within a multi-site study of diabetes care at Federally Qualified Health Centers (FQHCs), we leveraged the framework to implement human-LLM methods for (1) qualitative synthesis of researcher-generated summaries to produce comparative feedback reports and (2) deductive coding of 167 interview transcripts to refine a practice-transformation intervention. LLM assistance enabled timely feedback to practitioners and the incorporation of large-scale qualitative data to inform theory and practice changes. This work demonstrates how LLMs can be integrated into applied health-services research to enhance efficiency while preserving rigor, offering guidance for continued innovation with LLMs in qualitative research.[10] Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks
Crish Nagarkar,Leonid Bogachev,Serge Sharoff
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLMs)解决统计任务及评估推理质量的能力,通过在自建数据集上微调开源LLM,发现其统计推理能力可媲美统计学学生,并具备优于传统指标的自我评估能力,适用于教育技术、统计分析辅助及研究方法验证等场景。
Details
Motivation: 尽管LLM在NLP任务中表现突出,但其在中等复杂度统计任务上的能力尚不明确,且缺乏对其推理质量评估能力的系统研究。 Method: 在专门构建的统计任务数据集上对多个开源LLM进行微调,并以人类评分作为基准,对比分析模型性能;同时评估LLM自身对答案质量(含解释与推理)的判断能力,并与BLEU、BERTScore等传统指标比较。 Result: 微调后的模型在高级统计任务上达到统计学学生的水平,提升效果因模型架构而异;LLM自我评估能力显著优于传统自动评价指标。 Conclusion: LLM经针对性微调后可在统计推理和自我评估两方面发挥实用价值,具备在教育科技、统计辅助、研究方法验证及数据分析质量控制等领域的应用潜力。 Abstract: This paper investigates the ability of large language models (LLMs) to solve statistical tasks, as well as their capacity to assess the quality of reasoning. While state-of-the-art LLMs have demonstrated remarkable performance in a range of NLP tasks, their competence in addressing even moderately complex statistical challenges is not well understood. We have fine-tuned selected open-source LLMs on a specially developed dataset to enhance their statistical reasoning capabilities, and compared their performance with the human scores used as a benchmark. Our results show that the fine-tuned models achieve better performance on advanced statistical tasks on the level comparable to a statistics student. Fine-tuning demonstrates architecture-dependent improvements, with some models showing significant performance gains, indicating clear potential for deployment in educational technology and statistical analysis assistance systems. We also show that LLMs themselves can be far better judges of the answers quality (including explanation and reasoning assessment) in comparison to traditional metrics, such as BLEU or BertScore. This self-evaluation capability enables scalable automated assessment for statistical education platforms and quality assurance in automated analysis tools. Potential applications also include validation tools for research methodology in academic and industry settings, and quality control mechanisms for data analysis workflows.[11] Business Logic-Driven Text-to-SQL Data Synthesis for Business Intelligence
Jinhui Liu,Ximeng Zhang,Yanbo Ai,Zhou Yu
Main category: cs.CL
TL;DR: 本文提出了一种基于业务逻辑驱动的数据合成框架,用于生成高业务真实性和问题-SQL对齐度的Text-to-SQL评估数据,并在Salesforce数据库上验证了其有效性。
Details
Motivation: 私有商业智能(BI)环境中缺乏真实、领域特定的Text-to-SQL评估数据,现有合成方法难以反映真实的业务逻辑和工作流程。 Method: 提出业务逻辑驱动的数据合成框架,结合业务角色、工作场景与工作流生成数据,并引入业务推理复杂度控制策略以提升分析推理步骤的多样性。 Result: 在Salesforce生产级数据库上实验表明,合成数据业务真实性达98.44%,显著优于OmniSQL和SQL-Factory;问题-SQL对齐度为98.59%;同时发现当前SOTA Text-to-SQL模型在最复杂业务查询上执行准确率仅为42.86%。 Conclusion: 该框架能高效生成高质量、高真实性的Text-to-SQL评估数据,揭示了现有模型在复杂业务场景下的性能瓶颈,为后续研究提供了可靠基准。 Abstract: Evaluating Text-to-SQL agents in private business intelligence (BI) settings is challenging due to the scarcity of realistic, domain-specific data. While synthetic evaluation data offers a scalable solution, existing generation methods fail to capture business realism--whether questions reflect realistic business logic and workflows. We propose a Business Logic-Driven Data Synthesis framework that generates data grounded in business personas, work scenarios, and workflows. In addition, we improve the data quality by imposing a business reasoning complexity control strategy that diversifies the analytical reasoning steps required to answer the questions. Experiments on a production-scale Salesforce database show that our synthesized data achieves high business realism (98.44%), substantially outperforming OmniSQL (+19.5%) and SQL-Factory (+54.7%), while maintaining strong question-SQL alignment (98.59%). Our synthetic data also reveals that state-of-the-art Text-to-SQL models still have significant performance gaps, achieving only 42.86% execution accuracy on the most complex business queries.[12] Towards Execution-Grounded Automated AI Research
Chenglei Si,Zitong Yang,Yejin Choi,Emmanuel Candès,Diyi Yang,Tatsunori Hashimoto
Main category: cs.CL
TL;DR: 本文提出了一种执行驱动的自动化AI研究框架,通过构建自动执行器验证LLM生成的研究想法,并在大语言模型预训练和后训练任务上验证了其有效性;进化搜索方法在少量迭代中即显著超越基线,而强化学习则面临模式坍缩问题。
Details
Motivation: 当前大语言模型(LLM)虽能生成看似合理的想法,但往往无效;亟需探索执行反馈是否可行且可被LLM有效利用以提升自动化科研质量。 Method: 构建自动化执行器实现LLM生成的想法,并在预训练与后训练两个真实研究问题中开展大规模GPU并行实验;对比分析执行引导下的进化搜索与基于执行奖励的强化学习两种学习范式。 Result: 进化搜索在10轮内找到优于GRPO(69.4% vs 48.0%)和nanoGPT(19.7分钟 vs 35.9分钟)的方案;LLM在搜索中生成有意义算法思想但易早饱和;强化学习改善平均奖励却导致模式坍缩,无法提升性能上界。 Conclusion: 执行接地是可行且高效的自动化AI研究路径,进化搜索比强化学习更适用于该范式;研究为未来执行驱动的AI for Science提供了实证基础与关键洞见。 Abstract: Automated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems - LLM pre-training and post-training - into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline (69.4% vs 48.0%) on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline (19.7 minutes vs 35.9 minutes) on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts towards execution-grounded automated AI research.[13] Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models
Brian Christian,Matan Mazor
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLMs)在反事实自我模拟中消除性别与种族偏见及迎合倾向的局限性,并提出通过调用自身‘盲化副本’(即API接口)来实现更公平、更透明的决策。
Details
Motivation: 公平决策需忽略无关且可能引发偏见的信息(如性别、种族),而人类难以准确进行反事实自我模拟,易导致无意识偏见;作者关注LLM是否同样存在该问题,并探索可行的缓解机制。 Method: 实验评估多种提示策略(如‘忽略’或‘假装不知’敏感信息)对LLM偏见的影响;并提出新方法:让模型通过调用自身经敏感信息屏蔽处理的API副本(即‘盲化副本’)生成对比响应,以支持反事实推理与偏差归因。 Result: 标准提示法无法有效消除偏见,甚至可能加剧偏见;而利用自身API获取盲化副本响应的方法显著提升了决策公平性,并能区分隐性偏见与有意偏见。 Conclusion: LLMs虽与人类一样难以内生地进行反事实自我模拟,但其可访问自身API的特性提供了独特优势——通过外部化反事实认知,实现更可靠、更可解释的公平性增强。 Abstract: Fair decisions require ignoring irrelevant, potentially biasing, information. To achieve this, decision-makers need to approximate what decision they would have made had they not known certain facts, such as the gender or race of a job candidate. This counterfactual self-simulation is notoriously hard for humans, leading to biased judgments even by well-meaning actors. Here we show that large language models (LLMs) suffer from similar limitations in their ability to approximate what decisions they would make under counterfactual knowledge in offsetting gender and race biases and overcoming sycophancy. We show that prompting models to ignore or pretend not to know biasing information fails to offset these biases and occasionally backfires. However, unlike humans, LLMs can be given access to a ground-truth model of their own counterfactual cognition -- their own API. We show that this access to the responses of a blinded replica enables fairer decisions, while providing greater transparency to distinguish implicit from intentionally biased behavior.[14] Rewarding How Models Think Pedagogically: Integrating Pedagogical Reasoning and Thinking Rewards for LLMs in Education
Unggi Lee,Jiyeong Bae,Jaehyeon Park,Haeun Park,Taejun Park,Younghoon Jeon,Sungmin Cho,Junbo Koh,Yeil Jeong,Gyeonggeon Lee
Main category: cs.CL
TL;DR: 本文提出了PedagogicalRL-Thinking框架,通过教育理论引导的推理提示和针对推理过程的奖励机制,提升大语言模型在智能辅导系统中的教学效果。
Details
Motivation: 现有研究多关注优化LLM的外显回答,而忽视其内部推理过程;同时缺乏针对教育场景专门优化LLM的方法。 Method: 提出PedagogicalRL-Thinking框架,包含两个核心方法:(1) Pedagogical Reasoning Prompting(基于教育理论的推理提示),(2) Thinking Reward(对推理轨迹进行教学性评估与强化)。 Result: 实验表明:教育理论引导的提示优于通用提示;Thinking Reward与教学提示结合效果最佳;仅在数学辅导对话上训练的模型在未见过的教育基准上表现提升,且保持原有事实知识;推理轨迹展现出更系统、更具教学性的结构化决策。 Conclusion: 将教学对齐扩展至LLM的推理过程是有效且必要的,PedagogicalRL-Thinking为教育领域LLM的可解释性、可控性与教学有效性提供了新路径。 Abstract: Large language models (LLMs) are increasingly deployed as intelligent tutoring systems, yet research on optimizing LLMs specifically for educational contexts remains limited. Recent works have proposed reinforcement learning approaches for training LLM tutors, but these methods focus solely on optimizing visible responses while neglecting the model's internal thinking process. We introduce PedagogicalRL-Thinking, a framework that extends pedagogical alignment to reasoning LLMs in education through two novel approaches: (1) Pedagogical Reasoning Prompting, which guides internal reasoning using domain-specific educational theory rather than generic instructions; and (2) Thinking Reward, which explicitly evaluates and reinforces the pedagogical quality of the model's reasoning traces. Our experiments reveal that domain-specific, theory-grounded prompting outperforms generic prompting, and that Thinking Reward is most effective when combined with pedagogical prompting. Furthermore, models trained only on mathematics tutoring dialogues show improved performance on educational benchmarks not seen during training, while preserving the base model's factual knowledge. Our quantitative and qualitative analyses reveal that pedagogical thinking reward produces systematic reasoning trace changes, with increased pedagogical reasoning and more structured instructional decision-making in the tutor's thinking process.[15] Social Caption: Evaluating Social Understanding in Multimodal Models
Bhaavanaa Thumu,Leena Mathur,Youssouf Kebe,Louis-Philippe Morency
Main category: cs.CL
TL;DR: 本文提出了Social Caption框架,基于互动理论从社会推理(SI)、整体社会分析(HSA)和定向社会分析(DSA)三个维度评估多模态大语言模型(MLLM)的社会理解能力,并探讨了模型规模、架构设计和口语语境等因素对性能的影响,同时通过MLLM裁判实验推动自动化评估的发展。
Details
Motivation: 为了提升多模态大语言模型在人类社交互动理解方面的能力,需要系统化评估其社会理解水平,而现有方法缺乏理论基础和多维评测机制。 Method: 提出Social Caption框架,基于互动理论构建包含SI、HSA和DSA三个维度的评测体系,并在不同规模和架构的MLLM上进行实验,结合人工与MLLM裁判评估模型表现。 Result: 实验表明模型规模、架构设计和是否纳入语音上下文显著影响社会理解表现,且MLLM裁判可有效支持自动化评估的扩展。 Conclusion: Social Caption为评估MLLM的社会理解能力提供了理论驱动的多维框架,揭示了关键影响因素,并展示了使用MLLM进行自动化评估的可行性。 Abstract: Social understanding abilities are crucial for multimodal large language models (MLLMs) to interpret human social interactions. We introduce Social Caption, a framework grounded in interaction theory to evaluate social understanding abilities of MLLMs along three dimensions: Social Inference (SI), the ability to make accurate inferences about interactions; Holistic Social Analysis (HSA), the ability to generate comprehensive descriptions of interactions; Directed Social Analysis (DSA), the ability to extract relevant social information from interactions. We analyze factors influencing model performance in social understanding, such as scale, architectural design, and spoken context. Experiments with MLLM judges contribute insights about scaling automated evaluation of multimodal social understanding.[16] SearchGym: Bootstrapping Real-World Search Agents via Cost-Effective and High-Fidelity Environment Simulation
Xichen Zhang,Ziyi He,Yinghao Zhu,Sitong Wu,Shaozuo Yu,Meng Chu,Wenhu Zhang,Haoru Tan,Jiaya Jia
Main category: cs.CL
TL;DR: 提出SearchGym,一个用于训练搜索代理的高保真模拟环境,结合课程学习方法SearchGym-RL,实现低成本、强泛化的智能体训练。
Details
Motivation: 现有搜索代理训练面临与真实API交互成本高或使用静态数据导致奖励信号失真的问题,影响训练稳定性。 Method: 构建SearchGym仿真环境,通过生成可验证的知识图谱和对齐文档库,确保任务事实准确;并设计SearchGym-RL课程学习框架,提供纯净反馈以逐步优化策略。 Result: 在Llama和Qwen系列模型上验证了Sim-to-Real良好迁移效果,Qwen2.5-7B-Base在九个基准上平均相对超越基线ASearcher达10.6%。 Conclusion: 高保真模拟是构建高性能、低成本搜索代理的有效且可扩展的路径。 Abstract: Search agents have emerged as a pivotal paradigm for solving open-ended, knowledge-intensive reasoning tasks. However, training these agents via Reinforcement Learning (RL) faces a critical dilemma: interacting with live commercial Web APIs is prohibitively expensive, while relying on static data snapshots often introduces noise due to data misalignment. This misalignment generates corrupted reward signals that destabilize training by penalizing correct reasoning or rewarding hallucination. To address this, we propose SearchGym, a simulation environment designed to bootstrap robust search agents. SearchGym employs a rigorous generative pipeline to construct a verifiable knowledge graph and an aligned document corpus, ensuring that every reasoning task is factually grounded and strictly solvable. Building on this controllable environment, we introduce SearchGym-RL, a curriculum learning methodology that progressively optimizes agent policies through purified feedback, evolving from basic interactions to complex, long-horizon planning. Extensive experiments across the Llama and Qwen families demonstrate strong Sim-to-Real generalization. Notably, our Qwen2.5-7B-Base model trained within SearchGym surpasses the web-enhanced ASearcher baseline across nine diverse benchmarks by an average relative margin of 10.6%. Our results validate that high-fidelity simulation serves as a scalable and highly cost-effective methodology for developing capable search agents.[17] Say Anything but This: When Tokenizer Betrays Reasoning in LLMs
Navid Ayoobi,Marcus I Armstrong,Arjun Mukherjee
Main category: cs.CL
TL;DR: 本文揭示了大型语言模型(LLMs)因子词分词器存在一对多映射而导致的推理脆弱性,提出了一种分词一致性探测任务,并发现大量‘幻觉编辑’现象,归因于八类系统性分词器缺陷,呼吁在 tokenizer 层面进行修复。
Details
Motivation: 现代子词分词器存在非唯一编码问题(多个 token ID 序列对应同一字符串),导致 LLM 内部表示与文本语义不一致,可能引发未被测量的推理失败。 Method: 设计了一个分词一致性探测任务:要求模型在上下文中仅替换指定目标词、其余内容不变;通过超 11000 次实验,在多个开源 SOTA LLM 上分析输出中的‘幻觉编辑’现象,并归纳其分词器根源。 Result: 发现显著比例的输出存在‘幻觉编辑’——模型误以为推理正确,实则因分词-去分词过程缺陷(如空格边界偏移、词内重切分等)导致;总结出八类系统性 tokenizer artifact。 Conclusion: 部分表观推理缺陷实际源于 tokenizer 层,应优先在分词器层面改进,而非盲目扩大模型规模与训练数据。 Abstract: Large language models (LLMs) reason over discrete token ID sequences, yet modern subword tokenizers routinely produce non-unique encodings: multiple token ID sequences can detokenize to identical surface strings. This representational mismatch creates an unmeasured fragility wherein reasoning processes can fail. LLMs may treat two internal representations as distinct "words" even when they are semantically identical at the text level. In this work, we show that tokenization can betray LLM reasoning through one-to-many token ID mappings. We introduce a tokenization-consistency probe that requires models to replace designated target words in context while leaving all other content unchanged. The task is intentionally simple at the surface level, enabling us to attribute failures to tokenizer-detokenizer artifacts rather than to knowledge gaps or parameter limitations. Through analysis of over 11000 replacement trials across state-of-the-art open-source LLMs, we find a non-trivial rate of outputs exhibit phantom edits: cases where models operate under the illusion of correct reasoning, a phenomenon arising from tokenizer-induced representational defects. We further analyze these cases and provide a taxonomy of eight systematic tokenizer artifacts, including whitespace-boundary shifts and intra-word resegmentation. These findings indicate that part of apparent reasoning deficiency originates in the tokenizer layer, motivating tokenizer-level remedies before incurring the cost of training ever-larger models on ever-larger corpora.[18] AdaTIR: Adaptive Tool-Integrated Reasoning via Difficulty-Aware Policy Optimization
Zhaiyu Fang,Ruipeng Sun
Main category: cs.CL
TL;DR: 本文提出了AdaTIR框架,通过难度感知的效率奖励实现推理内化,减少对工具的冗余调用,同时保持或提升准确性,并提出CAS方法解决优势函数中的符号反转问题。
Details
Motivation: 现有基于工具集成推理(TIR)的代理倾向于认知卸载,在简单任务中也频繁调用外部工具,缺乏判断何时使用工具的适应性智能。 Method: 提出AdaTIR框架,引入难度感知的效率奖励以动态调整工具预算;设计Clipped Advantage Shaping(CAS)方法,解决工具惩罚压倒正确性奖励的符号反转问题。 Result: AdaTIR在简单任务上最多减少97.6%的工具调用,在复杂任务上减少28.2%,同时保持或提高准确性;在禁用工具时,AIME 2024上比基线高4.8%。 Conclusion: 真正的代理智能需要具备判断何时使用工具的适应性,AdaTIR实现了从静态调用到难度感知推理内化的范式转变,提升了效率与泛化能力。 Abstract: Tool-Integrated Reasoning (TIR) has significantly enhanced the capabilities of Large Language Models (LLMs), yet current agents tend to exhibit cognitive offloading, redundantly invoking external tools even for simple tasks. In this paper, we suggest that true agentic intelligence requires not just tool invocation, but the adaptive wisdom to discern when to use them. We propose AdaTIR, a framework that shifts the paradigm from static tool invocation to difficulty-aware reasoning internalization. By introducing a difficulty-aware efficiency reward, AdaTIR dynamically adjusts tool budgets based on task complexity--internalizing reasoning for simple tasks while selectively invoking tools for complex tasks. Furthermore, we identify a sign reversal problem where tool penalties outweigh correctness rewards, mistakenly penalizing correct rollouts with negative advantages. To resolve this, we propose Clipped Advantage Shaping (CAS), which ensures that correctness remains the primary objective while using efficiency as a secondary constraint. Empirical results demonstrate that AdaTIR reduces tool calls by up to 97.6% on simple tasks and 28.2% on complex challenges while maintaining or enhancing accuracy. Notably, AdaTIR successfully internalizes reasoning, outperforming baselines by 4.8% on AIME 2024 even when tool access is strictly disabled.[19] ClaimDB: A Fact Verification Benchmark over Large Structured Data
Michael Theologitis,Preetam Prabhu Srikar Dammu,Chirag Shah,Dan Suciu
Main category: cs.CL
TL;DR: 本文提出了ClaimDB,首个基于大规模结构化数据的事实验证基准,包含80个真实数据库,涵盖多个领域;实验表明现有大语言模型在该任务上表现不佳,准确率普遍低于83%,且在‘拒绝回答’能力上存在严重缺陷。
Details
Motivation: 现有事实验证基准主要关注非结构化文本证据,而基于大规模结构化数据(如多表组合的海量记录)的验证任务尚未被充分探索。 Method: 构建ClaimDB基准:涵盖80个真实世界多表数据库,每个含数百万条记录;设计需通过可执行程序(而非文本阅读)进行逻辑推理的验证任务;对30个主流闭源与开源LLM(参数<70B)进行系统评测。 Result: 所有模型最高准确率仅83%,超半数低于55%;闭源与开源模型均难以可靠地‘ abstain ’(即在证据不足时主动拒绝作答)。 Conclusion: 当前LLM在面向大规模结构化数据的事实验证任务中能力有限,尤其缺乏可靠的不确定性建模与拒答机制,亟需发展基于程序化推理的新范式。 Abstract: Despite substantial progress in fact-verification benchmarks, claims grounded in large-scale structured data remain underexplored. In this work, we introduce ClaimDB, the first fact-verification benchmark where the evidence for claims is derived from compositions of millions of records and multiple tables. ClaimDB consists of 80 unique real-life databases covering a wide range of domains, from governance and healthcare to media, education and the natural sciences. At this scale, verification approaches that rely on "reading" the evidence break down, forcing a timely shift toward reasoning in executable programs. We conduct extensive experiments with 30 state-of-the-art proprietary and open-source (below 70B) LLMs and find that none exceed 83% accuracy, with more than half below 55%. Our analysis also reveals that both closed- and open-source models struggle with abstention -- the ability to admit that there is no evidence to decide -- raising doubts about their reliability in high-stakes data analysis. We release the benchmark, code, and the LLM leaderboard at https://claimdb.github.io .[20] DARL: Encouraging Diverse Answers for General Reasoning without Verifiers
Chongxuan Huang,Lei Lin,Xiaodong Shi,Wenping Hu,Ruiming Tang
Main category: cs.CL
TL;DR: 本文提出DARL框架,通过在参考答案偏差范围内鼓励生成多样化答案,提升大语言模型在开放域任务中的推理能力和输出多样性,无需额外验证器,且兼容现有强化学习方法。
Details
Motivation: 现有强化学习方法(如RLVR、RLPR)在开放域任务中易过拟合参考答案,难以生成多样化的合理输出,尤其在写作等开放性任务中表现受限。 Method: 提出DARL框架,在保持与参考答案对齐的前提下,通过控制偏差范围鼓励模型生成多样化答案;该方法无需领域专用验证器,可无缝集成到现有通用强化学习流程中。 Result: 在13个基准测试中持续提升推理性能:在6个推理基准上平均提升1.3分,在7个通用基准上平均提升9.5分,显著优于RLPR。 Conclusion: DARL是一种简单有效、无需额外验证器、兼容性强的强化学习框架,能同时提升大语言模型的推理准确性和输出多样性。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated promising gains in enhancing the reasoning capabilities of large language models. However, its dependence on domain-specific verifiers significantly restricts its applicability to open and general domains. Recent efforts such as RLPR have extended RLVR to general domains, enabling training on broader datasets and achieving improvements over RLVR. However, a notable limitation of these methods is their tendency to overfit to reference answers, which constrains the model's ability to generate diverse outputs. This limitation is particularly pronounced in open-ended tasks such as writing, where multiple plausible answers exist. To address this, we propose DARL, a simple yet effective reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference while preserving alignment with it. Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers. Extensive experiments on thirteen benchmarks demonstrate consistent improvements in reasoning performance. Notably, DARL surpasses RLPR, achieving average gains of 1.3 points on six reasoning benchmarks and 9.5 points on seven general benchmarks, highlighting its effectiveness in improving both reasoning accuracy and output diversity.[21] Typhoon OCR: Open Vision-Language Model For Thai Document Extraction
Surapon Nonesung,Natapong Nitarach,Teetouch Jaknamon,Pittawat Taveekitworachai,Kunat Pipatanakul
Main category: cs.CL
TL;DR: 本文提出了Typhoon OCR,一个专为泰语和英语文档提取设计的开源视觉-语言模型,通过多阶段数据构建流程和针对泰语特点的微调,在多种泰语文档上实现了媲美甚至超越大型专有模型的性能,同时保持轻量级和高效推理。
Details
Motivation: 现有视觉-语言模型主要面向高资源语言,而泰语因文字复杂、无显式词边界及现实文档高度非结构化等特点,在文档提取任务中面临挑战,当前开源模型效果受限。 Method: 基于视觉-语言骨干网络,使用自建的泰语聚焦训练数据集进行微调;该数据集通过结合传统OCR、VLM驱动的重构和人工筛选合成数据的多阶段流程构建;提出统一框架支持文本转录、版面重建与文档级结构一致性;最新版本Typhoon OCR V1.5注重紧凑性与推理效率,降低对元数据依赖。 Result: 在金融报告、政府表格、图书、信息图和手写文档等多样化泰语文档上综合评测显示,Typhoon OCR性能媲美或超越更大规模专有模型,且计算成本显著更低。 Conclusion: 开源视觉-语言OCR模型可在泰语文档的准确文本提取与版面重建任务中达到与专有系统相当的性能,同时保持轻量、易部署特性。 Abstract: Document extraction is a core component of digital workflows, yet existing vision-language models (VLMs) predominantly favor high-resource languages. Thai presents additional challenges due to script complexity from non-latin letters, the absence of explicit word boundaries, and the prevalence of highly unstructured real-world documents, limiting the effectiveness of current open-source models. This paper presents Typhoon OCR, an open VLM for document extraction tailored for Thai and English. The model is fine-tuned from vision-language backbones using a Thai-focused training dataset. The dataset is developed using a multi-stage data construction pipeline that combines traditional OCR, VLM-based restructuring, and curated synthetic data. Typhoon OCR is a unified framework capable of text transcription, layout reconstruction, and document-level structural consistency. The latest iteration of our model, Typhoon OCR V1.5, is a compact and inference-efficient model designed to reduce reliance on metadata and simplify deployment. Comprehensive evaluations across diverse Thai document categories, including financial reports, government forms, books, infographics, and handwritten documents, show that Typhoon OCR achieves performance comparable to or exceeding larger frontier proprietary models, despite substantially lower computational cost. The results demonstrate that open vision-language OCR models can achieve accurate text extraction and layout reconstruction for Thai documents, reaching performance comparable to proprietary systems while remaining lightweight and deployable.[22] Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
Yifan Wang,Shiyu Li,Peiming Li,Xiaochen Yang,Yang Tang,Zheng Wei
Main category: cs.CL
TL;DR: 本文提出Render-of-Thought(RoT)框架,将链式推理(CoT)的文本步骤渲染为图像,利用视觉语言模型(VLM)的视觉编码器作为语义锚点对齐图文空间,在不增加预训练开销下实现推理过程显式化、可追溯,并获得3–4倍token压缩与显著推理加速。
Details
Motivation: Chain-of-Thought(CoT)虽提升LLM推理能力,但其冗长性带来高计算开销;现有方法多只关注最终结果对齐,缺乏对中间推理过程的监督,导致隐式推理链难以分析。 Method: 提出Render-of-Thought(RoT),将文本推理步骤渲染为图像,利用现成VLM的视觉编码器作为语义锚点,对齐文本与视觉嵌入空间,实现即插即用、无需额外预训练的推理链显式化。 Result: 在数学与逻辑推理基准上,RoT相较显式CoT实现3–4倍token压缩与显著推理加速,同时保持与其他方法相当的性能。 Conclusion: RoT首次实现了推理链的可视化显式化与可追溯性,验证了以图像表征推理过程这一新范式的可行性与有效性。 Abstract: Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT[23] RECAP: Resistance Capture in Text-based Mental Health Counseling with Large Language Models
Anqi Li,Yuqian Chen,Yu Lu,Zhaoming Chen,Yuan Xie,Zhenzhong Lan
Main category: cs.CL
TL;DR: 本文提出了一种新的框架PsyFIRE和数据集ClientResistance,用于识别中文文本心理咨询中的细粒度客户抗拒行为,并开发了RECAP模型以实现高精度检测及解释,显著优于现有方法。
Details
Motivation: 现有的自然语言处理方法在识别心理治疗中的客户抗拒行为时存在类别过于简化、忽略干预序列动态性和缺乏可解释性的问题。 Method: 提出了PsyFIRE框架来定义13种细粒度的抗拒行为,并构建包含23,930个标注语句的ClientResistance语料库;基于此开发了两阶段的RECAP框架进行抵抗检测与分类并提供解释。 Result: RECAP在区分合作与抗拒上的F1得分为91.25%,在细粒度分类上的macro-F1为66.58%,超过基于大模型的提示方法20多个百分点;在独立数据集和62名咨询师的试点研究中验证了其有效性。 Conclusion: RECAP能够有效识别文本心理咨询中的抗拒行为,提升咨询师对客户行为的理解和干预策略,具有实际应用潜力。 Abstract: Recognizing and navigating client resistance is critical for effective mental health counseling, yet detecting such behaviors is particularly challenging in text-based interactions. Existing NLP approaches oversimplify resistance categories, ignore the sequential dynamics of therapeutic interventions, and offer limited interpretability. To address these limitations, we propose PsyFIRE, a theoretically grounded framework capturing 13 fine-grained resistance behaviors alongside collaborative interactions. Based on PsyFIRE, we construct the ClientResistance corpus with 23,930 annotated utterances from real-world Chinese text-based counseling, each supported by context-specific rationales. Leveraging this dataset, we develop RECAP, a two-stage framework that detects resistance and fine-grained resistance types with explanations. RECAP achieves 91.25% F1 for distinguishing collaboration and resistance and 66.58% macro-F1 for fine-grained resistance categories classification, outperforming leading prompt-based LLM baselines by over 20 points. Applied to a separate counseling dataset and a pilot study with 62 counselors, RECAP reveals the prevalence of resistance, its negative impact on therapeutic relationships and demonstrates its potential to improve counselors' understanding and intervention strategies.[24] Comparative Study of Large Language Models on Chinese Film Script Continuation: An Empirical Analysis Based on GPT-5.2 and Qwen-Max
Yuxuan Cao,Zida Yang,Ye Wang
Main category: cs.CL
TL;DR: 本研究构建了首个中文电影剧本续写基准,评估GPT-5.2与Qwen-Max在文化特异性叙事任务中的表现,发现GPT-5.2在结构保持、整体质量及综合得分上显著优于Qwen-Max。
Details
Motivation: 随着大语言模型在创意写作中广泛应用,其在文化特异性叙事任务上的表现亟需系统性评估,尤其缺乏针对中文电影剧本的标准化基准。 Method: 构建包含53部经典影片的中文电影剧本续写基准,采用‘前半段续写后半段’范式,每部影片生成3个样本;结合ROUGE-L、结构相似度和LLM-as-Judge(DeepSeek-Reasoner)进行多维评估。 Result: GPT-5.2在结构保持(0.93 vs 0.75)、整体质量(44.79 vs 25.72)和综合得分(0.50 vs 0.39)上显著优于Qwen-Max,效应量达大效应水平(d>0.8);Qwen-Max仅在ROUGE-L上略优(0.2230 vs 0.2114)。 Conclusion: GPT-5.2在中文电影剧本续写中展现出更强的角色一致性、风格语调匹配与格式保持能力,而Qwen-Max存在生成稳定性不足问题;本研究提供了可复现的中文创意写作LLM评估框架。 Abstract: As large language models (LLMs) are increasingly applied to creative writing, their performance on culturally specific narrative tasks warrants systematic investigation. This study constructs the first Chinese film script continuation benchmark comprising 53 classic films, and designs a multi-dimensional evaluation framework comparing GPT-5.2 and Qwen-Max-Latest. Using a "first half to second half" continuation paradigm with 3 samples per film, we obtained 303 valid samples (GPT-5.2: 157, 98.7% validity; Qwen-Max: 146, 91.8% validity). Evaluation integrates ROUGE-L, Structural Similarity, and LLM-as-Judge scoring (DeepSeek-Reasoner). Statistical analysis of 144 paired samples reveals: Qwen-Max achieves marginally higher ROUGE-L (0.2230 vs 0.2114, d=-0.43); however, GPT-5.2 significantly outperforms in structural preservation (0.93 vs 0.75, d=0.46), overall quality (44.79 vs 25.72, d=1.04), and composite scores (0.50 vs 0.39, d=0.84). The overall quality effect size reaches large effect level (d>0.8). GPT-5.2 excels in character consistency, tone-style matching, and format preservation, while Qwen-Max shows deficiencies in generation stability. This study provides a reproducible framework for LLM evaluation in Chinese creative writing.[25] HiNS: Hierarchical Negative Sampling for More Comprehensive Memory Retrieval Embedding Model
Motong Tian,Allen P. Wong,Mingjun Mao,Wangchunshu Zhou
Main category: cs.CL
TL;DR: 本文提出HiNS框架,通过建模负样本难度层级和基于对话数据的负样本比例,提升记忆增强语言代理中嵌入模型的检索精度与泛化能力。
Details
Motivation: 现有嵌入模型训练中负样本构造未考虑其层次化难度及在人机交互中的自然分布,导致模型难以学习精细判别能力。 Method: 提出HiNS框架,显式建模负样本难度层级,并引入基于真实对话数据统计得到的负样本比例,用于训练更鲁棒的嵌入模型。 Result: 在LoCoMo和PERSONAMEM基准上,MemoryOS和Mem0两个系统均取得显著F1、BLEU-1及总分提升。 Conclusion: HiNS通过更符合实际交互分布的数据构造方式,有效提升了记忆检索的保真度与泛化性,为记忆增强代理提供了更可靠的基础。 Abstract: Memory-augmented language agents rely on embedding models for effective memory retrieval. However, existing training data construction overlooks a critical limitation: the hierarchical difficulty of negative samples and their natural distribution in human-agent interactions. In practice, some negatives are semantically close distractors while others are trivially irrelevant, and natural dialogue exhibits structured proportions of these types. Current approaches using synthetic or uniformly sampled negatives fail to reflect this diversity, limiting embedding models' ability to learn nuanced discrimination essential for robust memory retrieval. In this work, we propose a principled data construction framework HiNS that explicitly models negative sample difficulty tiers and incorporates empirically grounded negative ratios derived from conversational data, enabling the training of embedding models with substantially improved retrieval fidelity and generalization in memory-intensive tasks. Experiments show significant improvements: on LoCoMo, F1/BLEU-1 gains of 3.27%/3.30%(MemoryOS) and 1.95%/1.78% (Mem0); on PERSONAMEM, total score improvements of 1.19% (MemoryOS) and 2.55% (Mem0).[26] Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation
Rui Qi,Fengran Mo,Yufeng Chen,Xue Zhang,Shuo Wang,Hongliang Li,Jinan Xu,Meng Jiang,Jian-Yun Nie,Kaiyu Huang
Main category: cs.CL
TL;DR: 本文提出LcRL框架,通过语言耦合的组相对策略优化和反一致性惩罚机制,缓解多语言检索增强生成中的知识偏差与冲突问题。
Details
Motivation: 现有MRAG方法采用单一检索流程处理等义多语言查询,易导致模型在多语言环境下出现知识偏差和冲突。 Method: 提出LcRL多语言搜索增强强化学习框架,包含语言耦合的Group Relative Policy Optimization、语言耦合组采样(减少知识偏差)以及奖励模型中加入辅助反一致性惩罚(缓解知识冲突)。 Result: 实验表明LcRL在多种实际场景(如训练数据受限、大规模多语言检索集合)下均取得具有竞争力的性能。 Conclusion: LcRL有效缓解了多语言RAG中的知识偏差与冲突,提升了模型在复杂多语言环境下的鲁棒性与适用性。 Abstract: Multilingual retrieval-augmented generation (MRAG) requires models to effectively acquire and integrate beneficial external knowledge from multilingual collections. However, most existing studies employ a unitive process where queries of equivalent semantics across different languages are processed through a single-turn retrieval and subsequent optimization. Such a ``one-size-fits-all'' strategy is often suboptimal in multilingual settings, as the models occur to knowledge bias and conflict during the interaction with the search engine. To alleviate the issues, we propose LcRL, a multilingual search-augmented reinforcement learning framework that integrates a language-coupled Group Relative Policy Optimization into the policy and reward models. We adopt the language-coupled group sampling in the rollout module to reduce knowledge bias, and regularize an auxiliary anti-consistency penalty in the reward models to mitigate the knowledge conflict. Experimental results demonstrate that LcRL not only achieves competitive performance but is also appropriate for various practical scenarios such as constrained training data and retrieval over collections encompassing a large number of languages. Our code is available at https://github.com/Cherry-qwq/LcRL-Open.[27] PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation
Chenning Xu,Mao Zheng,Mingyu Zheng,Mingyang Song
Main category: cs.CL
TL;DR: 本文提出了PodBench,一个用于评估播客脚本生成的基准,包含800个样本和复杂的多说话人指令,并提出了一种结合定量约束和基于LLM质量评估的多维度评估框架。
Details
Motivation: 播客脚本生成需要从多样化输入中合成结构化、上下文相关的对话,但目前缺乏系统性的评估资源。因此,需要构建专门的基准和评估方法来推动该领域发展。 Method: 构建了一个包含800个样本、最长达21K token输入和复杂多说话人指令的基准PodBench;设计了融合定量约束与LLM-based质量评估的多维度评估框架;通过实验比较了闭源与开源大模型在长上下文和多说话人协调任务中的表现。 Result: 实验表明,虽然闭源模型整体表现更优,但具备显式推理能力的开源模型在处理长上下文和多说话人协调方面更具鲁棒性;同时发现高指令遵循度并不保证高质量内容输出。 Conclusion: PodBench为音频为中心的长文本生成任务提供了可复现的测试平台,揭示了当前模型在指令遵循与内容实质之间的脱节问题,有助于推动相关技术的发展。 Abstract: Podcast script generation requires LLMs to synthesize structured, context-grounded dialogue from diverse inputs, yet systematic evaluation resources for this task remain limited. To bridge this gap, we introduce PodBench, a benchmark comprising 800 samples with inputs up to 21K tokens and complex multi-speaker instructions. We propose a multifaceted evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Extensive experiments reveal that while proprietary models generally excel, open-source models equipped with explicit reasoning demonstrate superior robustness in handling long contexts and multi-speaker coordination compared to standard baselines. However, our analysis uncovers a persistent divergence where high instruction following does not guarantee high content substance. PodBench offers a reproducible testbed to address these challenges in long-form, audio-centric generation.[28] CodeDelegator: Mitigating Context Pollution via Role Separation in Code-as-Action Agents
Tianxiang Fei,Cheng Chen,Yue Pan,Mao Zheng,Mingyang Song
Main category: cs.CL
TL;DR: CodeDelegator是一种多智能体框架,通过角色分离将任务规划与代码实现解耦,提升复杂任务的执行效果。
Details
Motivation: 单一智能体同时负责规划和实现在处理长周期任务时容易因调试信息和中间失败导致上下文污染,影响性能。 Method: 提出CodeDelegator框架,包含负责战略规划的持久性Delegator和按需创建、独立执行子任务的Coder;引入EPSS机制分离临时与持久状态,避免上下文污染。 Result: 在多个基准测试中验证了CodeDelegator的有效性,表现出优于单智能体方法的任务完成能力和稳定性。 Conclusion: 通过角色专业化和状态隔离,CodeDelegator有效解决了长周期任务中的上下文污染问题,提升了多步代码生成任务的整体性能。 Abstract: Recent advances in large language models (LLMs) allow agents to represent actions as executable code, offering greater expressivity than traditional tool-calling. However, real-world tasks often demand both strategic planning and detailed implementation. Using a single agent for both leads to context pollution from debugging traces and intermediate failures, impairing long-horizon performance. We propose CodeDelegator, a multi-agent framework that separates planning from implementation via role specialization. A persistent Delegator maintains strategic oversight by decomposing tasks, writing specifications, and monitoring progress without executing code. For each sub-task, a new Coder agent is instantiated with a clean context containing only its specification, shielding it from prior failures. To coordinate between agents, we introduce Ephemeral-Persistent State Separation (EPSS), which isolates each Coder's execution state while preserving global coherence, preventing debugging traces from polluting the Delegator's context. Experiments on various benchmarks demonstrate the effectiveness of CodeDelegator across diverse scenarios.[29] The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations
Pierre-Antoine Lequeu,Léo Labat,Laurène Cave,Gaël Lejeune,François Yvon,Benjamin Piwowarski
Main category: cs.CL
TL;DR: 本文提出了一种名为Corpus Clarification的预处理框架,用于将大规模公民咨询数据转化为结构化、自包含的论证单元,并构建了GDN-CC人工标注数据集及GDN-CC-large自动标注数据集,验证了小规模开源语言模型在该任务上可媲美甚至超越大模型。
Details
Motivation: 解决LLM在民主活动文本(如在线协商、公民咨询)分析中引发的伦理问题,同时提升公民贡献文本在语用层面的标准化程度,便于主题建模与政治分析。 Method: 提出Corpus Clarification预处理框架;构建人工标注数据集GDN-CC(1231条法国大辩论贡献,含2285个论证单元);微调小型开源语言模型完成论证结构识别与澄清任务;生成大规模自动标注数据集GDN-CC-large(24万条)。 Result: 微调后的小型语言模型在复现人工标注任务上达到或超过大语言模型性能,并在观点聚类任务中展现出良好可用性;发布了迄今规模最大的民主咨询标注数据集GDN-CC-large。 Conclusion: 小型、开源、可本地运行的语言模型足以胜任民主咨询文本的语用标准化任务,兼顾透明性、可访问性与实用性,为负责任地应用AI于公共事务分析提供了可行路径。 Abstract: LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the pragmatic level, and make them easier to use in topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. Accordingly, we introduce Corpus Clarification as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present GDN-CC, a manually-curated dataset of 1,231 contributions to the French Grand Débat National, comprising 2,285 argumentative units annotated for argumentative structure and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release GDN-CC-large, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.[30] CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning
Zhiyuan Lu,Chenliang Li,Yingcheng Shi,Weizhou Shen,Ming Yan,Fei Huang
Main category: cs.CL
TL;DR: 本文提出了CorpusQA基准测试,用于评估大语言模型在大规模文档库上的全局推理能力,挑战现有模型在长上下文和检索增强生成中的局限性,并提出记忆增强的智能体架构作为更优解决方案。
Details
Motivation: 现有基准测试无法有效评估大语言模型在大规模文档库上的全局推理能力,因为它们通常局限于单个长文本或依赖稀疏检索假设,而真实场景中证据分散且需要全局整合与统计聚合。 Method: 提出CorpusQA基准,包含高达1000万token的数据集,通过新颖的数据合成框架生成复杂、计算密集型查询,并保证程序化生成的真值答案;同时验证该合成数据对提升LLM长上下文推理能力的微调效果。 Result: 实验表明,即使最先进的长上下文大语言模型随输入长度增加性能显著下降,标准检索增强生成系统完全失效;而记忆增强的智能体架构展现出更强鲁棒性。 Conclusion: 单纯扩展上下文窗口不足以解决大规模语料库推理问题,亟需发展支持全局信息综合的先进架构。 Abstract: While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a "sparse retrieval" assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast, unstructured text without relying on fallible human annotation. We further demonstrate the utility of our framework beyond evaluation, showing that fine-tuning on our synthesized data effectively enhances an LLM's general long-context reasoning capabilities. Extensive experiments reveal that even state-of-the-art long-context LLMs struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Our findings indicate that memory-augmented agentic architectures offer a more robust alternative, suggesting a critical shift is needed from simply extending context windows to developing advanced architectures for global information synthesis.[31] A Comprehensive Benchmark of Language Models on Unicode and Romanized Sinhala
Minuri Rajapakse,Ruvan Weerasinghe
Main category: cs.CL
TL;DR: 本文对现代语言模型在Unicode和罗马化僧伽罗语上的性能进行了全面基准测试,发现Mistral-Nemo-Base-2407和Mistral-7B-v0.3分别在两种脚本上表现最佳,Llama-3.1-8B则在两者上均表现优异;闭源模型中Gemini-1.5-pro和DeepSeek擅长Unicode生成,Claude-3.5-Sonnet更擅长罗马化文本。
Details
Motivation: 探索语言模型在低资源、形态丰富的僧伽罗语(尤其是数字通信中常见的罗马化形式)上的性能仍不充分,亟需系统性基准评估。 Method: 构建涵盖Unicode和罗马化僧伽罗语的多样化语料库,采用困惑度(perplexity)评估开源模型,通过句子补全的定性分析评估主流闭源模型。 Result: Mistral-Nemo-Base-2407在Unicode文本上预测最强,Mistral-7B-v0.3在罗马化文本上最优;Llama-3.1-8B在两种脚本上均表现稳健;Gemini-1.5-pro和DeepSeek在Unicode生成上突出,Claude-3.5-Sonnet在罗马化文本处理上更优。 Conclusion: 模型选择应适配目标脚本形式,训练数据对处理不同书写变体至关重要;本研究为僧伽罗语实际应用提供了关键选型指南。 Abstract: The performance of Language Models (LMs) on lower-resource, morphologically rich languages like Sinhala remains under-explored, particularly for Romanized Sinhala, which is prevalent in digital communication. This paper presents a comprehensive benchmark of modern LMs on a diverse corpus of Unicode and Romanized Sinhala. We evaluate open-source models using perplexity, a measure of how well a model predicts a text, and leading closed-source models via a qualitative analysis of sentence completion. Our findings reveal that the Mistral-Nemo-Base-2407 model achieves the strongest predictive performance on Unicode text and the Mistral-7B-v0.3 model for Romanized text. The results also highlight the strong all-around performance of the Llama-3.1-8B model for both scripts. Furthermore, a significant performance disparity exists among closed-source models: Gemini-1.5-pro and DeepSeek excel at Unicode generation, whereas Claude-3.5-Sonnet is superior at handling Romanized text. These results provide an essential guide for practitioners selecting models for Sinhala-specific applications and highlight the critical role of training data in handling script variations.[32] Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora
Chaymaa Abbas,Nour Shamaa,Mariette Awad
Main category: cs.CL
TL;DR: 本文研究了多语言场景下的数据污染问题,特别是阿拉伯语数据对英语基准测试的影响,并提出了一种翻译感知的污染检测方法(Translation-Aware Contamination Detection),以提升大语言模型评估的公平性与可靠性。
Details
Motivation: 现有污染检测方法主要针对英语基准,多语言污染问题尚未被充分理解,尤其缺乏对跨语言污染机制的系统分析。 Method: 在多个开源大语言模型上微调不同比例的阿拉伯语数据集,并在原始英语基准上评估;扩展Tested Slot Guessing方法,引入选择重排序策略和Min-K%概率分析;提出Translation-Aware Contamination Detection,通过比较多种翻译变体的污染信号来识别污染。 Result: 阿拉伯语翻译会抑制传统污染指标,但模型仍从污染数据中获益,表现为Min-K%分数上升和跨语言答案一致性增强;所提翻译感知方法能可靠揭示英语单语方法无法发现的污染。 Conclusion: 多语言、翻译感知的评估流程对于保障大语言模型评估的公平性、透明性和可复现性至关重要。 Abstract: Data contamination undermines the validity of Large Language Model evaluation by enabling models to rely on memorized benchmark content rather than true generalization. While prior work has proposed contamination detection methods, these approaches are largely limited to English benchmarks, leaving multilingual contamination poorly understood. In this work, we investigate contamination dynamics in multilingual settings by fine-tuning several open-weight LLMs on varying proportions of Arabic datasets and evaluating them on original English benchmarks. To detect memorization, we extend the Tested Slot Guessing method with a choice-reordering strategy and incorporate Min-K% probability analysis, capturing both behavioral and distributional contamination signals. Our results show that translation into Arabic suppresses conventional contamination indicators, yet models still benefit from exposure to contaminated data, particularly those with stronger Arabic capabilities. This effect is consistently reflected in rising Mink% scores and increased cross-lingual answer consistency as contamination levels grow. To address this blind spot, we propose Translation-Aware Contamination Detection, which identifies contamination by comparing signals across multiple translated benchmark variants rather than English alone. The Translation-Aware Contamination Detection reliably exposes contamination even when English-only methods fail. Together, our findings highlight the need for multilingual, translation-aware evaluation pipelines to ensure fair, transparent, and reproducible assessment of LLMs.[33] Knowledge Restoration-driven Prompt Optimization: Unlocking LLM Potential for Open-Domain Relational Triplet Extraction
Xiaonan Jing,Gongqing Wu,Xingrui Zhuo,Lang Sun,Jiapu Wang
Main category: cs.CL
TL;DR: 提出了一种基于知识重构的提示优化框架(KRPO),用于提升大模型在开放域关系三元组抽取中的性能,通过自评估机制和文本梯度优化提示,显著提高了F1分数。
Details
Motivation: 现有方法依赖静态启发式提示策略,缺乏对错误信号的反思机制,导致在语义模糊情况下易产生永久性错误提取模式。 Method: 设计了基于知识恢复的自评估机制,将结构化三元组映射为语义一致性得分以提供内在反馈;提出基于文本梯度的提示优化器,迭代优化提示;构建关系规范化记忆库以减少关系冗余。 Result: 在三个数据集上的实验表明,KRPO在提取F1分数上显著优于强基线方法。 Conclusion: KRPO框架能有效提升大语言模型在复杂开放域关系抽取任务中的持续学习与提取能力。 Abstract: Open-domain Relational Triplet Extraction (ORTE) is the foundation for mining structured knowledge without predefined schemas. Despite the impressive in-context learning capabilities of Large Language Models (LLMs), existing methods are hindered by their reliance on static, heuristic-driven prompting strategies. Due to the lack of reflection mechanisms required to internalize erroneous signals, these methods exhibit vulnerability in semantic ambiguity, often making erroneous extraction patterns permanent. To address this bottleneck, we propose a Knowledge Reconstruction-driven Prompt Optimization (KRPO) framework to assist LLMs in continuously improving their extraction capabilities for complex ORTE task flows. Specifically, we design a self-evaluation mechanism based on knowledge restoration, which provides intrinsic feedback signals by projecting structured triplets into semantic consistency scores. Subsequently, we propose a prompt optimizer based on a textual gradient that can internalize historical experiences to iteratively optimize prompts, which can better guide LLMs to handle subsequent extraction tasks. Furthermore, to alleviate relation redundancy, we design a relation canonicalization memory that collects representative relations and provides semantically distinct schemas for the triplets. Extensive experiments across three datasets show that KRPO significantly outperforms strong baselines in the extraction F1 score.[34] \textsc{LogicScore}: Fine-grained Logic Evaluation of Conciseness, Completeness, and Determinateness in Attributed Question Answering
Zhichao Yan,Yunxiao Zhao,Jiapu Wang,Jiaoyan Chen,Shaoru Guo,Xiaoli Li,Ru Li,Jeff Z. Pan
Main category: cs.CL
TL;DR: 本文提出LogicScore框架,通过基于Horn规则的反向验证机制,从完整性、简洁性和确定性三方面评估长文本答案的全局逻辑一致性,揭示当前LLM在事实归因准确但逻辑推理薄弱的问题。
Details
Motivation: 现有AQA评估方法存在“归因近视”问题,只关注局部语句归因而忽略长答案的整体逻辑连贯性,导致LLM生成看似有据实则逻辑断裂的回答。 Method: 提出LogicScore评估框架,基于Horn规则设计反向验证机制,系统衡量答案在完整性(逻辑推导健全)、简洁性(无冗余)和确定性(答案唯一可推出)三个维度的表现。 Result: 在HotpotQA、MusiQue和2WikiMultiHopQA三个多跳数据集及20余个LLM(含GPT-5、Gemini-3-Pro等)上的实验表明:模型归因精度高(如Gemini-3-Pro达92.85%),但逻辑质量差(如其简洁性仅35.11%)。 Conclusion: 逻辑一致性应与事实归因并重;LogicScore为AQA提供了首个面向全局推理质量的鲁棒评估标准。 Abstract: Current evaluation methods for Attributed Question Answering (AQA) suffer from \textit{attribution myopia}: they emphasize verification of isolated statements and their attributions but overlook the global logical integrity of long-form answers. Consequently, Large Language Models (LLMs) often produce factually grounded yet logically incoherent responses with elusive deductive gaps. To mitigate this limitation, we present \textsc{LogicScore}, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Conciseness} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high attribution scores (e.g., 92.85\% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11\% Conciseness for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development. Codes are available at: https://github.com/zhichaoyan11/LogicScore.[35] Multi-Agent Constraint Factorization Reveals Latent Invariant Solution Structure
Christopher Scofield
Main category: cs.CL
TL;DR: 本文通过算子理论和约束优化,解释了多智能体系统(MAS)在处理相同信息时仍能提升问题解决性能的现象。每个智能体被建模为对共享解状态施加不同有效性约束,MAS实现约束算子的分解组合,收敛到各智能体约束集交集定义的不变解集,而单个智能体无法达到该结构。
Details
Motivation: 尽管多智能体系统中的大语言模型共享相同信息,其协同工作却表现出更强的问题解决能力,本文旨在为这一现象提供形式化解释。 Method: 采用算子理论与约束优化框架,将每个智能体建模为施加特定有效性约束的算子,分析多智能体系统中约束算子的分解组合动态,并推广至软约束情形下的近端算子。 Result: 证明多智能体系统的动态可收敛至由各智能体约束集交集决定的不变解集,且该解集通常无法通过单个智能体同时应用所有约束达到;并将理论框架应用于基于文本的对话系统。 Conclusion: 多智能体系统的性能增益源于各智能体约束结构的交互与分解动态,而非额外信息或表达能力,揭示了协作推理的形式化机制。 Abstract: Multi-agent systems (MAS) composed of large language models often exhibit improved problem-solving performance despite operating on identical information. In this work, we provide a formal explanation for this phenomenon grounded in operator theory and constrained optimization. We model each agent as enforcing a distinct family of validity constraints on a shared solution state, and show that a MAS implements a factorized composition of constraint-enforcement operators. Under mild conditions, these dynamics converge to invariant solution sets defined by the intersection of agent constraint sets. Such invariant structures are generally not dynamically accessible to a single agent applying all constraints simultaneously, even when expressive capacity and information are identical. We extend this result from exact constraint enforcement to soft constraints via proximal operators, and apply the formalism to contemporary text-based dialog systems.[36] Circadian Modulation of Semantic Exploration in Social Media Language
Vuong Hung Truong,Mariana Gabrielle Cangco Reyes,Masatoshi Koizumi,Jihwan Myung
Main category: cs.CL
TL;DR: 该研究利用大规模Reddit数据,通过预训练的Transformer模型嵌入文本,量化语言使用中的昼夜节律变化,发现语义探索具有显著的昼夜周期性,且受季节光照影响。
Details
Motivation: 人类认知存在明显的昼夜节律,但其对高维语义行为的影响尚不清楚,因此需要探究语义使用的时序变化是否受生物节律调控。 Method: 使用Reddit数据,将文本嵌入到预训练Transformer模型中,计算局部和全局语义熵,分析其在一天中不同时间的变化,并检验其与情绪或情感效价的独立性。 Result: 发现局部语义探索在早晨达到高峰,体现更广泛的语义空间探索;而全局语义多样性在白天晚些时候上升,表现出‘富者愈富’的动态;这些模式不受情绪影响,且与神经调节系统的昼夜变化一致。 Conclusion: 语义探索-利用行为存在稳健的昼夜节律,表明生物节律不仅影响情绪和注意力,也延伸至高阶语义认知过程。 Abstract: Human cognition exhibits strong circadian modulation, yet its influence on high-dimensional semantic behavior remains poorly understood. Using large-scale Reddit data, we quantify time-of-day variation in language use by embedding text into a pretrained transformer model and measuring semantic entropy as an index of linguistic exploration-exploitation, for which we show a robust circadian rhythmicity that could be entrained by seasonal light cues. Distinguishing between local and global semantic entropy reveals a systematic temporal dissociation: local semantic exploration peaks in the morning, reflecting broader exploration of semantic space, whereas global semantic diversity peaks later in the day as submissions accumulate around already established topics, consistent with "rich-get-richer" dynamics. These patterns are not explained by sentiment or affective valence, indicating that semantic exploration captures a cognitive dimension distinct from mood. The observed temporal structure aligns with known diurnal patterns in neuromodulatory systems, suggesting that biological circadian rhythms extend to the semantic domain.[37] RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)
Yishu Wei,Adam E. Flanders,Errol Colak,John Mongan,Luciano M Prevedello,Po-Hao Chen,Henrique Min Ho Lee,Gilberto Szarf,Hamilton Shoji,Jason Sho,Katherine Andriole,Tessa Cook,Lisa C. Adams,Linda C. Chu,Maggie Chung,Geraldine Brusca-Augello,Djeven P. Deva,Navneet Singh,Felipe Sanchez Tijmes,Jeffrey B. Alpert,Elsie T. Nguyen,Drew A. Torigian,Kate Hanneman,Lauren K Groner,Alexander Phan,Ali Islam,Matias F. Callejas,Gustavo Borges da Silva Teles,Faisal Jamal,Maryam Vazirabad,Ali Tejani,Hari Trivedi,Paulo Kuriki,Rajesh Bhayana,Elana T. Benishay,Yi Lin,Yifan Peng,George Shih
Main category: cs.CL
TL;DR: 本文提出了一种AI辅助的专家标注方法,构建了包含200个胸部X光研究的高质量基准数据集(100个公开,100个保留),每个样本由三位放射科医生验证,并定义了12个基准标签,以推动多模态大语言模型在临床中的应用。
Details
Motivation: 为了开发具有临床实用性的多模态大语言模型工具,需要由领域专家策划的高质量基准测试数据集。现有模型虽在选择题考试中表现良好,但在真实临床环境中的可靠性仍需评估。 Method: 使用GPT-4o从MIDRC的胸部X光报告中提取异常发现,并通过本地部署的Phi-4-Reasoning模型将其映射为12个基准标签;基于AI建议的标签采样1000项研究供专家评审,确保临床相关性和难度多样性;17名放射科医生对每项研究进行三重评审,评估AI标注的准确性。 Result: 在1000个采样研究中,381个获得至少两名放射科医生‘完全同意’;从中选出200个研究(优先选择罕见或多种发现的病例),分为100个公开数据和100个保留数据作为独立评估使用;最终发布了一个经三位专家验证的200例胸部X光基准数据集。 Conclusion: 该研究成功构建了一个高质量、经专家验证的胸部X光基准数据集,并提出了一种高效的AI辅助标注流程,有助于未来多模态大语言模型在医学影像领域的开发与评估。 Abstract: Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked "Agree all", "Agree mostly" or "Disagree" to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected "Agree All" for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available https://imaging.rsna.org, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.[38] Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems
Yinzhu Chen,Abdine Maiga,Hossein A. Rahmani,Emine Yilmaz
Main category: cs.CL
TL;DR: 本文提出了一种检索增强的多智能体框架,用于自动生成面向实例的临床评估细则,以提升大语言模型在医疗决策支持中的安全性与可靠性。
Details
Motivation: 大型语言模型(LLMs)在临床决策支持中存在幻觉和不安全建议的风险,而现有评估方法依赖专家构建细粒度评分标准,成本高、难扩展;通用指标又难以捕捉细微临床错误。 Method: 提出检索增强的多智能体框架,将权威医学文献检索结果分解为原子事实,并结合用户交互约束,自动生成可验证、细粒度的实例特定评估细则。 Result: 在HealthBench上,临床意图对齐(CIA)得分达60.12%,显著优于GPT-4o基线(55.16%);判别测试中AUROC达0.977,质量区分能力近乎翻倍;且能指导响应优化,质量提升9.2%。 Conclusion: 该框架为医疗LLM的评估与改进提供了可扩展、透明、基于证据的基础。 Abstract: Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are particularly challenging as they often manifest as subtle clinical errors that evade detection by generic metrics, while expert-authored fine-grained rubrics remain costly to construct and difficult to scale. In this paper, we propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics. Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria. Evaluated on HealthBench, our framework achieves a Clinical Intent Alignment (CIA) score of 60.12%, a statistically significant improvement over the GPT-4o baseline (55.16%). In discriminative tests, our rubrics yield a mean score delta ($μ_Δ = 8.658$) and an AUROC of 0.977, nearly doubling the quality separation achieved by GPT-4o baseline (4.972). Beyond evaluation, our rubrics effectively guide response refinement, improving quality by 9.2% (from 59.0% to 68.2%). This provides a scalable and transparent foundation for both evaluating and improving medical LLMs. The code is available at https://anonymous.4open.science/r/Automated-Rubric-Generation-AF3C/.[39] The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Zanlin Ni,Shenzhi Wang,Yang Yue,Tianyu Yu,Weilin Zhao,Yeguo Hua,Tianyi Chen,Jun Song,Cheng Yu,Bo Zheng,Gao Huang
Main category: cs.CL
TL;DR: 本文揭示了扩散大语言模型(dLLMs)中任意顺序生成的灵活性实际上会缩小而非扩展其推理边界,提出放弃该灵活性、采用标准Group Relative Policy Optimization(GRPO)的JustGRPO方法,在保持并行解码能力的同时显著提升推理性能。
Details
Motivation: 挑战当前认为dLLMs的任意顺序生成能提升推理能力的直觉假设,探究其实际对推理边界的潜在负面影响。 Method: 通过实证分析发现dLLMs利用顺序灵活性跳过关键高不确定性token,导致解空间过早坍缩;进而摒弃任意顺序生成,采用标准GRPO算法进行训练,提出轻量高效的方法JustGRPO。 Result: JustGRPO在GSM8K上达到89.1%准确率,同时完全保留dLLMs的并行解码能力,验证了放弃顺序灵活性反而更利于推理能力激发。 Conclusion: dLLMs当前形式下的任意顺序生成是一种‘灵活性陷阱’,简化训练范式(如JustGRPO)比复杂RL方法更有效,提示应重新审视灵活性与推理能力之间的关系。 Abstract: Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap[40] Is Peer Review Really in Decline? Analyzing Review Quality across Venues and Time
Ilia Kuznetsov,Rohan Nayak,Alla Rozovskaya,Iryna Gurevych
Main category: cs.CL
TL;DR: 本文提出了一种基于证据的框架来比较和评估AI与机器学习顶会(如ICLR、NeurIPS和*ACL)中的同行评审质量,发现没有一致证据表明评审质量随时间下降。
Details
Motivation: 随着投稿数量增加和研究社区扩大,人们普遍担忧同行评审质量在下降,但缺乏可量化、可比较的评估方法。因此需要一个标准化的、多维度的评审质量评估框架。 Method: 提出一个多维评分框架,结合LLM和轻量级指标来衡量评审对编辑和作者的实用性,并对不同会议和年份的评审进行标准化和比较分析。 Result: 分析显示,在多个顶会中,评审质量的中位数并未随时间呈现一致下降趋势。同时揭示了不同评审格式的多样性,并验证了所提评估方法的有效性。 Conclusion: 当前没有充分证据支持‘同行评审质量正在下降’这一普遍观点;建议采用更系统的实证方法来监测和改进评审质量。 Abstract: Peer review is at the heart of modern science. As submission numbers rise and research communities grow, the decline in review quality is a popular narrative and a common concern. Yet, is it true? Review quality is difficult to measure, and the ongoing evolution of reviewing practices makes it hard to compare reviews across venues and time. To address this, we introduce a new framework for evidence-based comparative study of review quality and apply it to major AI and machine learning conferences: ICLR, NeurIPS and *ACL. We document the diversity of review formats and introduce a new approach to review standardization. We propose a multi-dimensional schema for quantifying review quality as utility to editors and authors, coupled with both LLM-based and lightweight measurements. We study the relationships between measurements of review quality, and its evolution over time. Contradicting the popular narrative, our cross-temporal analysis reveals no consistent decline in median review quality across venues and years. We propose alternative explanations, and outline recommendations to facilitate future empirical studies of review quality.[41] Supporting Humans in Evaluating AI Summaries of Legal Depositions
Naghmeh Farzi,Laura Dietz,Dave D. Lewis
Main category: cs.CL
TL;DR: 本文提出了一种基于事实nugget的方法,帮助法律专业人士评估和改进长文档(特别是证词)的自动摘要,聚焦于判断摘要优劣和手动优化摘要两个实际场景。
Details
Motivation: 法律领域对证词摘要的事实准确性要求极高,而现有大语言模型生成的长文档摘要难以保证这一点;尽管nugget-based评估方法已被证明有效,但其在终端用户侧的支持潜力尚未被充分探索。 Method: 将nugget-based评估方法迁移至用户端,设计并实现了一个面向法律领域的原型系统,支持用户比较两个摘要的优劣以及基于nugget手动改进自动生成的摘要。 Result: 验证了nugget-based方法在法律专业人员实际工作场景中的可用性和有效性,特别是在摘要质量判断与人工修正两方面提供了直接支持。 Conclusion: nugget-based方法不仅适用于自动化评估,还可作为实用工具直接赋能终端用户,提升法律领域摘要工作的准确性与效率。 Abstract: While large language models (LLMs) are increasingly used to summarize long documents, this trend poses significant challenges in the legal domain, where the factual accuracy of deposition summaries is crucial. Nugget-based methods have been shown to be extremely helpful for the automated evaluation of summarization approaches. In this work, we translate these methods to the user side and explore how nuggets could directly assist end users. Although prior systems have demonstrated the promise of nugget-based evaluation, its potential to support end users remains underexplored. Focusing on the legal domain, we present a prototype that leverages a factual nugget-based approach to support legal professionals in two concrete scenarios: (1) determining which of two summaries is better, and (2) manually improving an automatically generated summary.[42] Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
Anmol Goel,Cornelius Emde,Sangdoo Yun,Seong Joon Oh,Martin Gubri
Main category: cs.CL
TL;DR: 本文揭示了前沿语言模型在良性微调过程中可能出现的‘隐私崩溃’现象,即模型在保持高性能的同时严重损害上下文隐私保护能力。
Details
Motivation: 现有安全评估未能有效检测语言模型在微调后出现的隐蔽隐私漏洞,尤其在部署专用智能体时存在重大风险。 Method: 通过在六种模型(闭源与开源)、五种微调数据集(真实与可控)及两类任务(代理型与记忆型)上开展实验,并进行机制分析,探究隐私表征对微调的脆弱性。 Result: 发现多种细微训练数据模式(如追求帮助性、暴露用户信息、情感对话、调试代码打印内部变量等)会导致模型丧失上下文隐私推理能力、工具间信息泄露及跨上下文记忆越界;且该‘静默失效’在标准安全与效用基准上无法被检测。 Conclusion: 隐私表征比任务相关特征更易受微调破坏,当前安全评估体系存在关键缺陷,亟需针对专业化代理部署构建更鲁棒的隐私评估方法。 Abstract: We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.[43] Metadata Conditioned Large Language Models for Localization
Anjishnu Mukherjee,Ziwei Zhu,Antonios Anastasopoulos
Main category: cs.CL
TL;DR: 本文研究了通过元数据条件化实现语言模型本地化的方法,证明该方法在不牺牲跨区域泛化能力的前提下提升区域内性能,并提高学习效率。
Details
Motivation: 传统大语言模型将文本视为单一全局分布,导致地理行为同质化,缺乏对特定区域语言特征的捕捉能力。 Method: 从零开始预训练31个0.5B和1B参数规模的模型,使用带有验证URL、国家和地区标签的大规模英文新闻数据,覆盖4个大洲和17个国家,引入元数据条件化进行控制实验。 Result: 元数据条件化显著提升模型在目标区域的表现,同时保持跨区域泛化能力;URL级元数据能有效捕获地理信号;平衡的区域数据覆盖至关重要;在800道本地化新闻选择题基准上,经指令微调后的模型表现接近LLaMA-3.2-1B-Instruct。 Conclusion: 元数据条件化是一种实用且计算高效的语言模型本地化方法,能够在较少数据下实现与大规模模型相当的本地化性能。 Abstract: Large language models are typically trained by treating text as a single global distribution, often resulting in geographically homogenized behavior. We study metadata conditioning as a lightweight approach for localization, pre-training 31 models (at 0.5B and 1B parameter scales) from scratch on large-scale English news data annotated with verified URLs, country tags, and continent tags, covering 4 continents and 17 countries. Across four controlled experiments, we show that metadata conditioning consistently improves in-region performance without sacrificing cross-region generalization, enables global models to recover localization comparable to region-specific models, and improves learning efficiency. Our ablation studies demonstrate that URL-level metadata alone captures much of the geographic signal, while balanced regional data coverage remains essential, as metadata cannot fully compensate for missing regions. Finally, we introduce a downstream benchmark of 800 localized news MCQs and show that after instruction tuning, metadata conditioned global models achieve accuracy comparable to LLaMA-3.2-1B-Instruct, despite being trained on substantially less data. Together, these results establish metadata conditioning as a practical and compute-efficient approach for localization of language models.[44] Taxonomy-Aligned Risk Extraction from 10-K Filings with Autonomous Improvement Using LLMs
Rian Dolphin,Joe Dursun,Jarrett Blankenship,Katie Adams,Quinton Pike
Main category: cs.CL
TL;DR: 本文提出了一种从企业10-K文件中提取结构化风险因素的方法,结合LLM抽取、语义映射与LLM验证三阶段流程,并引入AI驱动的自动分类体系维护机制,显著提升风险因素分类质量与经济可解释性。
Details
Motivation: 现有方法难以在保证风险因素提取结构化的同时,严格对齐预定义的层级分类体系,且分类体系缺乏自动演进能力。 Method: 构建三阶段流水线:1)LLM抽取风险因素并附带原文引用;2)基于嵌入的语义映射到预定义分类体系;3)LLM-as-a-judge过滤错误归类;进一步设计AI代理实现分类体系的自主维护(诊断问题类别、识别失败模式、提出优化建议)。 Result: 从标普500公司10-K文件中提取10,688条风险因素;行业内部风险画像相似度比跨行业高63%(Cohen's d=1.06, AUC=0.82, p<0.001);案例中嵌入分离度提升104.7%;外部验证证实分类体系具有经济意义。 Conclusion: 该方法实现了高精度、可解释、可演化的 taxonomy-aligned 风险因素提取,具备跨领域泛化能力,为金融文本结构化分析提供了新范式。 Abstract: We present a methodology for extracting structured risk factors from corporate 10-K filings while maintaining adherence to a predefined hierarchical taxonomy. Our three-stage pipeline combines LLM extraction with supporting quotes, embedding-based semantic mapping to taxonomy categories, and LLM-as-a-judge validation that filters spurious assignments. To evaluate our approach, we extract 10,688 risk factors from S&P 500 companies and examine risk profile similarity across industry clusters. Beyond extraction, we introduce autonomous taxonomy maintenance where an AI agent analyzes evaluation feedback to identify problematic categories, diagnose failure patterns, and propose refinements, achieving 104.7% improvement in embedding separation in a case study. External validation confirms the taxonomy captures economically meaningful structure: same-industry companies exhibit 63% higher risk profile similarity than cross-industry pairs (Cohen's d=1.06, AUC 0.82, p<0.001). The methodology generalizes to any domain requiring taxonomy-aligned extraction from unstructured text, with autonomous improvement enabling continuous quality maintenance and enhancement as systems process more documents.[45] The Effect of Scripts and Formats on LLM Numeracy
Varshini Reddy,Craig W. Schmidt,Seth Ebner,Adam Wiemerslage,Yuval Pinter,Chris Tanner
Main category: cs.CL
TL;DR: 大型语言模型在标准算术任务上表现优异,但在处理非主流数字脚本和格式时准确率显著下降;研究发现针对性的提示策略可有效缓解这一问题。
Details
Motivation: 探讨大型语言模型在不同数字脚本和格式下的数值推理能力,揭示训练数据偏差带来的局限性。 Method: 通过在多种数字脚本和格式上评估LLM的性能,并测试少样本提示和显式数字映射等提示策略的效果。 Result: LLM在非主流数字格式下的准确率明显下降,但通过适当的提示工程可以显著提升表现。 Conclusion: 多语言和多格式数值推理是当前LLM的一个被忽视的挑战,需采用针对性策略以确保跨数字表示的可靠性能。 Abstract: Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling human-level performance on standard numerical tasks. However, little attention has been given to how these models perform when numerical expressions deviate from the prevailing conventions present in their training corpora. In this work, we investigate numerical reasoning across a wide range of numeral scripts and formats. We show that LLM accuracy drops substantially when numerical inputs are rendered in underrepresented scripts or formats, despite the underlying mathematical reasoning being identical. We further demonstrate that targeted prompting strategies, such as few-shot prompting and explicit numeral mapping, can greatly narrow this gap. Our findings highlight an overlooked challenge in multilingual numerical reasoning and provide actionable insights for working with LLMs to reliably interpret, manipulate, and generate numbers across diverse numeral scripts and formatting styles.[46] Robust Fake News Detection using Large Language Models under Adversarial Sentiment Attacks
Sahar Tahmasebi,Eric Müller-Budack,Ralph Ewerth
Main category: cs.CL
TL;DR: 本文提出AdSent框架,通过控制情感的对抗攻击和情感无关训练策略,提升假新闻检测模型在情感操纵下的鲁棒性。
Details
Motivation: 现有假新闻检测方法依赖情感特征,易受LLM生成的情感操纵攻击,而该脆弱性尚未被充分研究。 Method: 提出基于LLM的可控情感对抗攻击方法;分析情感偏移对检测性能的影响;设计情感无关训练策略。 Result: 实验表明AdSent在三个基准数据集上显著优于基线方法,在准确率和鲁棒性上均有提升,并能泛化到未见数据集和对抗场景。 Conclusion: 情感特征易被操纵,导致假新闻检测器存在偏差;AdSent通过情感无关建模有效缓解该问题,提升了模型鲁棒性。 Abstract: Misinformation and fake news have become a pressing societal challenge, driving the need for reliable automated detection methods. Prior research has highlighted sentiment as an important signal in fake news detection, either by analyzing which sentiments are associated with fake news or by using sentiment and emotion features for classification. However, this poses a vulnerability since adversaries can manipulate sentiment to evade detectors especially with the advent of large language models (LLMs). A few studies have explored adversarial samples generated by LLMs, but they mainly focus on stylistic features such as writing style of news publishers. Thus, the crucial vulnerability of sentiment manipulation remains largely unexplored. In this paper, we investigate the robustness of state-of-the-art fake news detectors under sentiment manipulation. We introduce AdSent, a sentiment-robust detection framework designed to ensure consistent veracity predictions across both original and sentiment-altered news articles. Specifically, we (1) propose controlled sentiment-based adversarial attacks using LLMs, (2) analyze the impact of sentiment shifts on detection performance. We show that changing the sentiment heavily impacts the performance of fake news detection models, indicating biases towards neutral articles being real, while non-neutral articles are often classified as fake content. (3) We introduce a novel sentiment-agnostic training strategy that enhances robustness against such perturbations. Extensive experiments on three benchmark datasets demonstrate that AdSent significantly outperforms competitive baselines in both accuracy and robustness, while also generalizing effectively to unseen datasets and adversarial scenarios.cs.CV [Back]
[47] SOSControl: Enhancing Human Motion Generation through Saliency-Aware Symbolic Orientation and Timing Control
Ho Yin Au,Junkun Jiang,Jie Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为SOS(Salient Orientation Symbolic)脚本的符号化框架,用于在关键帧上精确控制文本到动作生成中的身体部位朝向和运动时序,并设计了自动提取SOS脚本的流程及SOSControl生成框架,显著提升了动作生成的可控性、自然性和泛化能力。
Details
Motivation: 传统文本到动作生成方法缺乏精细控制,现有基于关键帧位置的方法仅提供位置引导,难以直观指定身体部位的朝向和运动时机,因此需要一种更精确、可编程的控制方式。 Method: 提出SOS脚本作为可编程符号框架,用以描述关键帧上的身体朝向与运动时序;通过时间约束的聚合聚类和基于显著性的掩码策略(SMS)自动提取SOS脚本;构建SOSControl框架,结合SMS数据增强、梯度优化和ControlNet-based ACTOR-PAE解码器,在生成过程中优先满足用户指定的朝向约束并保证动作流畅自然。 Result: 实验表明,SOS提取流程能生成人类可读的稀疏符号脚本,SOSControl在动作质量、可控性和泛化性方面优于现有基线方法,尤其在身体部位朝向和运动时序控制上表现突出。 Conclusion: SOS脚本和SOSControl框架为文本到动作生成提供了更强的细粒度控制能力,推动了高可解释性和高可控性动作生成技术的发展。 Abstract: Traditional text-to-motion frameworks often lack precise control, and existing approaches based on joint keyframe locations provide only positional guidance, making it challenging and unintuitive to specify body part orientations and motion timing. To address these limitations, we introduce the Salient Orientation Symbolic (SOS) script, a programmable symbolic framework for specifying body part orientations and motion timing at keyframes. We further propose an automatic SOS extraction pipeline that employs temporally-constrained agglomerative clustering for frame saliency detection and a Saliency-based Masking Scheme (SMS) to generate sparse, interpretable SOS scripts directly from motion data. Moreover, we present the SOSControl framework, which treats the available orientation symbols in the sparse SOS script as salient and prioritizes satisfying these constraints during motion generation. By incorporating SMS-based data augmentation and gradient-based iterative optimization, the framework enhances alignment with user-specified constraints. Additionally, it employs a ControlNet-based ACTOR-PAE Decoder to ensure smooth and natural motion outputs. Extensive experiments demonstrate that the SOS extraction pipeline generates human-interpretable scripts with symbolic annotations at salient keyframes, while the SOSControl framework outperforms existing baselines in motion quality, controllability, and generalizability with respect to motion timing and body part orientation control.[48] A Cloud-Based Cross-Modal Transformer for Emotion Recognition and Adaptive Human-Computer Interaction
Ziwen Zhong,Zhitao Shu,Yue Zhao
Main category: cs.CV
TL;DR: 本文提出了一种基于云的跨模态Transformer(CMT)框架,融合视觉、听觉和文本信号,利用预训练编码器与跨模态注意力机制实现鲁棒的多模态情感识别,并通过云基础设施支持低延迟、可扩展部署,在多个基准数据集上达到SOTA性能。
Details
Motivation: 现有情感识别系统多依赖单模态分析,导致在真实场景中鲁棒性和泛化能力不足,亟需一种能有效融合多源异构信号并支持实时部署的解决方案。 Method: 提出Cloud-Based Cross-Modal Transformer(CMT)框架,集成ViT、Wav2Vec2和BERT作为模态编码器,引入跨模态注意力机制建模模态间依赖关系,并基于Kubernetes与TensorFlow Serving实现云原生分布式训练与服务部署。 Result: 在IEMOCAP、MELD和AffectNet上F1-score提升3.0%,交叉熵损失降低12.9%;云部署平均响应延迟为128ms,较传统跨模态Transformer系统降低35%。 Conclusion: CMT框架显著提升了多模态情感识别的准确性与实时性,验证了云原生架构在情感计算中的可行性,为智能客服、虚拟助教等情感交互应用提供了高效可靠的技术支撑。 Abstract: Emotion recognition is a fundamental component of next-generation human-computer interaction (HCI), enabling machines to perceive, understand, and respond to users' affective states. However, existing systems often rely on single-modality analysis such as facial expressions, speech tone, or textual sentiment, resulting in limited robustness and poor generalization in real-world environments. To address these challenges, this study proposes a Cloud-Based Cross-Modal Transformer (CMT) framework for multimodal emotion recognition and adaptive human-computer interaction. The proposed model integrates visual, auditory, and textual signals using pretrained encoders (Vision Transformer, Wav2Vec2, and BERT) and employs a cross-modal attention mechanism to capture complex interdependencies among heterogeneous features. By leveraging cloud computing infrastructure with distributed training on Kubernetes and TensorFlow Serving, the system enables scalable, low-latency emotion recognition for large-scale user interactions. Experiments conducted on benchmark datasets including IEMOCAP, MELD, and AffectNet demonstrate that the CMT achieves state-of-the-art performance, improving the F1-score by 3.0 percent and reducing cross-entropy loss by 12.9 percent compared to strong multimodal baselines. Additionally, cloud deployment evaluations show an average response latency of 128 ms, representing a 35 percent reduction compared with conventional transformer-based fusion systems. These results confirm that the proposed framework enables efficient, real-time emotion recognition and adaptive feedback in applications such as intelligent customer service, virtual tutoring systems, and affective computing interfaces, marking an important step toward cloud-native affective computing and emotionally intelligent interactive systems.[49] Intelligent Power Grid Design Review via Active Perception-Enabled Multimodal Large Language Models
Taoliang Tan,Chengwei Ma,Zhen Tian,Zhao Lin,Dongdong Li,Si Shi
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态大语言模型(MLLM)的三阶段智能电网设计图纸审查框架,通过先进的提示工程实现对超高清图纸的高效、可靠缺陷识别。
Details
Motivation: 现有自动化系统在处理超高分辨率电网图纸时面临计算开销大、信息丢失和缺乏全局语义理解等问题,难以有效识别设计错误。 Method: 采用模仿人类专家审图过程的三阶段框架:第一阶段利用MLLM进行全局语义理解并智能划分领域语义区域;第二阶段在高分辨率下对各区域进行细粒度识别并输出置信度;第三阶段通过综合决策模块整合结果,实现错误诊断与可靠性评估。 Result: 在真实电网图纸上的初步实验表明,该方法显著提升了MLLM对宏观语义信息的理解能力和设计错误定位精度,相比传统被动式MLLM推理具有更高的缺陷发现准确率和评审可靠性。 Conclusion: 该研究提出了一种新颖的、由提示驱动的电网图纸智能审查范式,为复杂工程图纸的自动审核提供了高效可靠的解决方案。 Abstract: The intelligent review of power grid engineering design drawings is crucial for power system safety. However, current automated systems struggle with ultra-high-resolution drawings due to high computational demands, information loss, and a lack of holistic semantic understanding for design error identification. This paper proposes a novel three-stage framework for intelligent power grid drawing review, driven by pre-trained Multimodal Large Language Models (MLLMs) through advanced prompt engineering. Mimicking the human expert review process, the first stage leverages an MLLM for global semantic understanding to intelligently propose domain-specific semantic regions from a low-resolution overview. The second stage then performs high-resolution, fine-grained recognition within these proposed regions, acquiring detailed information with associated confidence scores. In the final stage, a comprehensive decision-making module integrates these confidence-aware results to accurately diagnose design errors and provide a reliability assessment. Preliminary results on real-world power grid drawings demonstrate our approach significantly enhances MLLM's ability to grasp macroscopic semantic information and pinpoint design errors, showing improved defect discovery accuracy and greater reliability in review judgments compared to traditional passive MLLM inference. This research offers a novel, prompt-driven paradigm for intelligent and reliable power grid drawing review.[50] LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models
Mengyu Sun,Ziyuan Yang,Andrew Beng Jin Teoh,Junxu Liu,Haibo Hu,Yi Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的概念重唤醒方法LURE,通过建模生成过程为隐式函数,理论分析并利用文本条件、模型参数和潜在状态等多因素扰动来重唤醒扩散模型中被擦除的概念;LURE包含语义重绑定、梯度场正交化和潜在语义识别引导采样三个核心机制,实验证明其能高效、稳定地同时重唤醒多个被擦除概念。
Details
Motivation: 现有概念擦除方法存在漏洞,擦除的概念仍可被重唤醒;而当前重唤醒方法仅关注提示词层面优化,忽视其他生成因素,导致对生成动态理解不全面。 Method: 提出Latent space Unblocking for concept REawakening (LURE)方法:1)将生成过程建模为隐式函数,理论证明扰动文本条件、模型参数或潜在状态均可重唤醒概念;2)语义重绑定机制——通过使去噪预测对齐目标分布来重建潜在空间;3)梯度场正交化——解决多概念场景下的梯度冲突与特征纠缠;4)潜在语义识别引导采样(LSIS)——基于后验密度验证保障重唤醒稳定性。 Result: LURE在多种擦除任务和方法上均实现了多概念同步、高保真重唤醒,实验验证了其有效性、鲁棒性与泛化能力。 Conclusion: 擦除并非绝对安全,概念可通过多因素协同扰动被重唤醒;LURE提供了一种系统性、理论驱动的重唤醒框架,揭示了扩散模型中语义关联的脆弱性与可塑性,对提升模型安全性与可控性具有重要启示。 Abstract: Concept erasure aims to suppress sensitive content in diffusion models, but recent studies show that erased concepts can still be reawakened, revealing vulnerabilities in erasure methods. Existing reawakening methods mainly rely on prompt-level optimization to manipulate sampling trajectories, neglecting other generative factors, which limits a comprehensive understanding of the underlying dynamics. In this paper, we model the generation process as an implicit function to enable a comprehensive theoretical analysis of multiple factors, including text conditions, model parameters, and latent states. We theoretically show that perturbing each factor can reawaken erased concepts. Building on this insight, we propose a novel concept reawakening method: Latent space Unblocking for concept REawakening (LURE), which reawakens erased concepts by reconstructing the latent space and guiding the sampling trajectory. Specifically, our semantic re-binding mechanism reconstructs the latent space by aligning denoising predictions with target distributions to reestablish severed text-visual associations. However, in multi-concept scenarios, naive reconstruction can cause gradient conflicts and feature entanglement. To address this, we introduce Gradient Field Orthogonalization, which enforces feature orthogonality to prevent mutual interference. Additionally, our Latent Semantic Identification-Guided Sampling (LSIS) ensures stability of the reawakening process via posterior density verification. Extensive experiments demonstrate that LURE enables simultaneous, high-fidelity reawakening of multiple erased concepts across diverse erasure tasks and methods.[51] CityCube: Benchmarking Cross-view Spatial Reasoning on Vision-Language Models in Urban Environments
Haotian Xu,Yue Hu,Zhengqiu Zhu,Chen Gao,Ziyou Wang,Junreng Rao,Wenhao Lu,Weishi Li,Quanjun Yin,Yong Li
Main category: cs.CV
TL;DR: 本文提出了CityCube,一个用于评估视觉语言模型(VLM)在城市环境中跨视角空间推理能力的系统性基准。该基准涵盖多种视角动态和平台,包含5022个标注的多视角问答对,评估显示现有VLM性能显著低于人类,突显了该任务的挑战性和基准的重要性。
Details
Motivation: 现有基准主要关注室内或街道场景,忽视了开放城市空间中丰富的语义、复杂的几何结构和视角变化带来的独特挑战,因此需要一个新的基准来系统评估VLM在城市环境中的跨视角推理能力。 Method: 构建CityCube基准,集成四种视角动态以模拟相机运动,涵盖车辆、无人机和卫星等多种平台的多视角数据;设计5,022个标注的多视图问答对,覆盖五个认知维度和三种空间关系表达。 Result: 对33个VLM的综合评估显示,即使大规模模型准确率也未超过54.1%,比人类低34.2%;而小规模微调模型可超过60.0%,显示出微调的有效性;分析揭示了VLM与人类在空间推理上的认知差异。 Conclusion: CityCube为评估VLM在复杂城市环境中的跨视角推理提供了有效基准,揭示了当前模型与人类之间的显著差距,强调了针对特定任务进行微调的重要性,并指出了未来向人类类比推理发展的方向。 Abstract: Cross-view spatial reasoning is essential for embodied AI, underpinning spatial understanding, mental simulation and planning in complex environments. Existing benchmarks primarily emphasize indoor or street settings, overlooking the unique challenges of open-ended urban spaces characterized by rich semantics, complex geometries, and view variations. To address this, we introduce CityCube, a systematic benchmark designed to probe cross-view reasoning capabilities of current VLMs in urban settings. CityCube integrates four viewpoint dynamics to mimic camera movements and spans a wide spectrum of perspectives from multiple platforms, e.g., vehicles, drones and satellites. For a comprehensive assessment, it features 5,022 meticulously annotated multi-view QA pairs categorized into five cognitive dimensions and three spatial relation expressions. A comprehensive evaluation of 33 VLMs reveals a significant performance disparity with humans: even large-scale models struggle to exceed 54.1% accuracy, remaining 34.2% below human performance. By contrast, small-scale fine-tuned VLMs achieve over 60.0% accuracy, highlighting the necessity of our benchmark. Further analyses indicate the task correlations and fundamental cognitive disparity between VLMs and human-like reasoning.[52] Large-Scale Label Quality Assessment for Medical Segmentation via a Vision-Language Judge and Synthetic Data
Yixiong Chen,Zongwei Zhou,Wenxuan Li,Alan Yuille
Main category: cs.CV
TL;DR: 本文提出SegAE,一种轻量级视觉-语言模型,用于自动评估大规模医学分割数据集中标签的质量,显著提升训练鲁棒性和数据效率。
Details
Motivation: 大规模医学分割数据集常混用人工标注和质量不均的伪标签,影响模型训练与评估效果。 Method: 提出SegAE——基于超四百万图像-标签对及对应质量评分训练的轻量级视觉-语言模型,可快速(0.06秒)预测142种解剖结构的标签质量,并与真实Dice相似度高度相关(r=0.902)。 Result: (I)发现多个公开数据集普遍存在低质量标注;(II)在主动学习和半监督学习中提升数据效率与性能,降低标注成本1/3、质检时间70%。 Conclusion: SegAE为大规模医学分割数据集提供了简单有效的质量控制工具,代码、模型与数据已开源。 Abstract: Large-scale medical segmentation datasets often combine manual and pseudo-labels of uneven quality, which can compromise training and evaluation. Low-quality labels may hamper performance and make the model training less robust. To address this issue, we propose SegAE (Segmentation Assessment Engine), a lightweight vision-language model (VLM) that automatically predicts label quality across 142 anatomical structures. Trained on over four million image-label pairs with quality scores, SegAE achieves a high correlation coefficient of 0.902 with ground-truth Dice similarity and evaluates a 3D mask in 0.06s. SegAE shows several practical benefits: (I) Our analysis reveals widespread low-quality labeling across public datasets; (II) SegAE improves data efficiency and training performance in active and semi-supervised learning, reducing dataset annotation cost by one-third and quality-checking time by 70% per label. This tool provides a simple and effective solution for quality control in large-scale medical segmentation datasets. The dataset, model weights, and codes are released at https://github.com/Schuture/SegAE.[53] Vision-Based Natural Language Scene Understanding for Autonomous Driving: An Extended Dataset and a New Model for Traffic Scene Description Generation
Danial Sadrian Zadeh,Otman A. Basir,Behzad Moshiri
Main category: cs.CV
TL;DR: 本文提出了一种新颖的框架,利用混合注意力机制将单个前视摄像头图像转换为自然语言描述,以实现交通场景的理解,并基于BDD100K构建了新数据集进行验证。
Details
Motivation: 为了提升自动驾驶车辆对环境的准确感知与理解能力,需要一种能够从单一图像中提取丰富语义信息并生成自然语言描述的方法。 Method: 提出一种结合混合注意力机制的模型,增强空间和语义特征提取,并融合这些特征生成上下文丰富的场景描述;基于BDD100K数据集构建新的专用数据集。 Result: 在新数据集上通过CIDEr、SPICE等自动指标和人工评估验证,模型表现出色,能有效生成详细的交通场景描述。 Conclusion: 所提框架能有效实现从图像到自然语言的转化,提升了交通场景理解的能力,且具备实际应用潜力。 Abstract: Traffic scene understanding is essential for enabling autonomous vehicles to accurately perceive and interpret their environment, thereby ensuring safe navigation. This paper presents a novel framework that transforms a single frontal-view camera image into a concise natural language description, effectively capturing spatial layouts, semantic relationships, and driving-relevant cues. The proposed model leverages a hybrid attention mechanism to enhance spatial and semantic feature extraction and integrates these features to generate contextually rich and detailed scene descriptions. To address the limited availability of specialized datasets in this domain, a new dataset derived from the BDD100K dataset has been developed, with comprehensive guidelines provided for its construction. Furthermore, the study offers an in-depth discussion of relevant evaluation metrics, identifying the most appropriate measures for this task. Extensive quantitative evaluations using metrics such as CIDEr and SPICE, complemented by human judgment assessments, demonstrate that the proposed model achieves strong performance and effectively fulfills its intended objectives on the newly developed dataset.[54] Gaussian Based Adaptive Multi-Modal 3D Semantic Occupancy Prediction
A. Enes Doruk
Main category: cs.CV
TL;DR: 本文提出了一种基于高斯模型的自适应相机-LiDAR融合方法,用于高效、鲁棒的3D语义占据预测,以应对自动驾驶中的长尾安全挑战。
Details
Motivation: 现有体素化方法计算复杂度高、融合过程脆弱且难以适应动态环境,难以满足自动驾驶对长尾安全问题的处理需求。 Method: 提出高斯基自适应相机-LiDAR多模态3D占据预测模型,包含四个核心组件:LiDAR深度特征聚合(LDFA)、基于熵的特征平滑、自适应相机-LiDAR融合、Gauss-Mamba解码头(采用选择性状态空间模型实现线性复杂度全局上下文解码)。 Result: 实现了更高效、鲁棒且内存友好的3D语义占据预测,在动态环境下保持稳定融合性能。 Conclusion: 该方法有效结合了相机的语义优势与LiDAR的几何优势,为自动驾驶中复杂场景下的安全感知提供了新范式。 Abstract: The sparse object detection paradigm shift towards dense 3D semantic occupancy prediction is necessary for dealing with long-tail safety challenges for autonomous vehicles. Nonetheless, the current voxelization methods commonly suffer from excessive computation complexity demands, where the fusion process is brittle, static, and breaks down under dynamic environmental settings. To this end, this research work enhances a novel Gaussian-based adaptive camera-LiDAR multimodal 3D occupancy prediction model that seamlessly bridges the semantic strengths of camera modality with the geometric strengths of LiDAR modality through a memory-efficient 3D Gaussian model. The proposed solution has four key components: (1) LiDAR Depth Feature Aggregation (LDFA), where depth-wise deformable sampling is employed for dealing with geometric sparsity, (2) Entropy-Based Feature Smoothing, where cross-entropy is employed for handling domain-specific noise, (3) Adaptive Camera-LiDAR Fusion, where dynamic recalibration of sensor outputs is performed based on model outputs, and (4) Gauss-Mamba Head that uses Selective State Space Models for global context decoding that enjoys linear computation complexity.[55] Real-Time Wildfire Localization on the NASA Autonomous Modular Sensor using Deep Learning
Yajvan Ravan,Aref Malek,Chester Dolph,Nikhil Behari
Main category: cs.CV
TL;DR: 本文提出了一种基于NASA AMS传感器12通道多光谱高空航拍图像的高质量人工标注野火数据集,并构建了结合分类与像素级分割的实时深度学习模型,实现了高精度(96%分类准确率、74% IoU、84%召回率)的野火检测与边界定位,尤其在夜间和云/烟遮挡下表现优异。
Details
Motivation: 高空多光谱航拍影像稀缺且昂贵,但对野火检测等关键应用至关重要;现有数据(如卫星图像或单色规则算法)难以满足高精度、全天候、抗干扰的野火感知需求。 Method: 构建了来自20次野火任务的12通道(含IR、SWIR、热红外)人工标注航拍小块图像数据集(>4000张),并设计融合图像分类与像素级分割的双网络实时分割模型,重点利用SWIR、IR和热波段进行建模。 Result: 模型达到96%分类准确率、74% IoU、84%召回率,优于基于卫星数据的模型和传统颜色规则算法;能有效在夜间及云/烟遮挡下检测活跃野火并区分易混淆假阳性。 Conclusion: 多光谱高空航拍数据与针对性深度学习架构可显著提升野火检测鲁棒性与实用性,为实时、自主 wildfire 监测提供新范式;SWIR、IR和热波段是火边界识别最关键光谱信息。 Abstract: High-altitude, multi-spectral, aerial imagery is scarce and expensive to acquire, yet it is necessary for algorithmic advances and application of machine learning models to high-impact problems such as wildfire detection. We introduce a human-annotated dataset from the NASA Autonomous Modular Sensor (AMS) using 12-channel, medium to high altitude (3 - 50 km) aerial wildfire images similar to those used in current US wildfire missions. Our dataset combines spectral data from 12 different channels, including infrared (IR), short-wave IR (SWIR), and thermal. We take imagery from 20 wildfire missions and randomly sample small patches to generate over 4000 images with high variability, including occlusions by smoke/clouds, easily-confused false positives, and nighttime imagery. We demonstrate results from a deep-learning model to automate the human-intensive process of fire perimeter determination. We train two deep neural networks, one for image classification and the other for pixel-level segmentation. The networks are combined into a unique real-time segmentation model to efficiently localize active wildfire on an incoming image feed. Our model achieves 96% classification accuracy, 74% Intersection-over-Union(IoU), and 84% recall surpassing past methods, including models trained on satellite data and classical color-rule algorithms. By leveraging a multi-spectral dataset, our model is able to detect active wildfire at nighttime and behind clouds, while distinguishing between false positives. We find that data from the SWIR, IR, and thermal bands is the most important to distinguish fire perimeters. Our code and dataset can be found here: https://github.com/nasa/Autonomous-Modular-Sensor-Wildfire-Segmentation/tree/main and https://drive.google.com/drive/folders/1-u4vs9rqwkwgdeeeoUhftCxrfe_4QPTn?=usp=drive_link[56] XD-MAP: Cross-Modal Domain Adaptation using Semantic Parametric Mapping
Frank Bieder,Hendrik Königshof,Haohao Hu,Fabian Immel,Yinzhe Shen,Jan-Hendrik Pauls,Christoph Stiller
Main category: cs.CV
TL;DR: 本文提出了一种名为XD-MAP的新方法,通过利用相机图像中的检测结果生成语义参数化地图,实现从图像域到LiDAR域的传感器特定知识迁移,无需人工标注,并显著提升了在2D和3D语义分割任务上的性能。
Details
Motivation: 由于深度学习模型的性能严重依赖于数据集的可用性,且训练数据需与目标类别及传感器特性对齐,因此需要有效的领域自适应策略来弥合现有数据集与实际部署环境之间的差距。 Method: 提出XD-MAP方法,利用相机图像上的神经网络检测结果构建语义参数化地图,并据此为LiDAR域生成伪标签,实现跨模态知识迁移,无需传感器间的数据重叠,扩展了视角感知范围至360度。 Result: 在大规模道路特征数据集上,XD-MAP相比单阶段基线方法在2D语义分割上提升+19.5 mIoU,2D全景分割上提升+19.5 PQth,3D语义分割上提升+32.3 mIoU。 Conclusion: XD-MAP能够在无需任何人工标注的情况下,有效将在图像域学习到的知识迁移到LiDAR域,并在多个分割任务上取得优异性能,验证了其在跨域感知中的潜力。 Abstract: Until open-world foundation models match the performance of specialized approaches, the effectiveness of deep learning models remains heavily dependent on dataset availability. Training data must align not only with the target object categories but also with the sensor characteristics and modalities. To bridge the gap between available datasets and deployment domains, domain adaptation strategies are widely used. In this work, we propose a novel approach to transferring sensor-specific knowledge from an image dataset to LiDAR, an entirely different sensing domain. Our method XD-MAP leverages detections from a neural network on camera images to create a semantic parametric map. The map elements are modeled to produce pseudo labels in the target domain without any manual annotation effort. Unlike previous domain transfer approaches, our method does not require direct overlap between sensors and enables extending the angular perception range from a front-view camera to a full 360 view. On our large-scale road feature dataset, XD-MAP outperforms single shot baseline approaches by +19.5 mIoU for 2D semantic segmentation, +19.5 PQth for 2D panoptic segmentation, and +32.3 mIoU in 3D semantic segmentation. The results demonstrate the effectiveness of our approach achieving strong performance on LiDAR data without any manual labeling.[57] GutenOCR: A Grounded Vision-Language Front-End for Documents
Hunter Heidenreich,Ben Elliott,Olivia Dinica,Yosheb Getachew
Main category: cs.CV
TL;DR: GutenOCR 是基于 Qwen2.5-VL 模型微调的端到端有界 OCR 系统,支持统一提示接口下的文本识别、检测与空间定位,在业务和科研文档上显著提升有界 OCR 性能,但在页面线性化和公式密集布局中存在权衡。
Details
Motivation: 现有 OCR 系统在文本检测、识别与空间定位(grounding)之间缺乏统一建模,尤其在复杂文档(如科学论文、含公式/颜色排版)中表现受限;需一种能联合优化阅读、检测与定位的端到端视觉语言方法。 Method: 对 Qwen2.5-VL-3B 和 7B 视觉语言模型进行监督微调,使用涵盖真实商业文档、科研论文及合成有界标注数据的多源训练集;设计统一 prompt 接口支持全页/局部读取、行/段落级框输出及‘where is x?’定位查询;提出新的有界 OCR 评估协议。 Result: GutenOCR-7B 在 10.5K 持留业务与科研文档上将复合有界 OCR 分数从 0.40 提升至 0.82(超两倍);在 Fox 和 OmniDocBench v1.5 上显著提升区域级和行级 OCR 准确率及文本检测召回率;但在页面线性化、颜色引导 OCR 和公式密集布局任务中性能下降。 Conclusion: 基于视觉语言大模型的端到端有界 OCR 是可行且高效的路径,GutenOCR 展示了强泛化能力与实用价值,但也揭示了当前 VLM 在结构化文档理解中的局限性,为后续研究指明改进方向。 Abstract: GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.[58] PAS-Mamba: Phase-Amplitude-Spatial State Space Model for MRI Reconstruction
Xiaoyan Kui,Zijie Fan,Zexin Ji,Qinsong Li,Hao Xu,Weixin Si,Haodong Xu,Beiji Zou
Main category: cs.CV
TL;DR: 提出了一种用于MRI重建的相位-幅度-空间状态空间模型(PAS-Mamba),通过在频域解耦相位和幅度建模,并结合图像域特征,显著提升了重建性能。
Details
Motivation: 现有MRI重建方法通常将频域整体处理,忽略了其内部成分(如相位和幅度)携带信息的差异性,导致特征学习相互干扰。 Method: 提出PAS-Mamba模型:在图像域使用LocalMamba保留空间局部性;在频域将幅度和相位解耦为两个独立分支;设计环形频率域扫描(CFDS)以符合频率的同心几何结构;并通过双域互补融合模块(DDCFM)实现自适应融合与双向交互。 Result: 在IXI和fastMRI膝关节数据集上的实验表明,PAS-Mamba consistently优于现有的最先进方法。 Conclusion: 通过解耦频域中的相位与幅度并结合图像域特征,PAS-Mamba有效避免了表征耦合,实现了更优的MRI重建效果。 Abstract: Joint feature modeling in both the spatial and frequency domains has become a mainstream approach in MRI reconstruction. However, existing methods generally treat the frequency domain as a whole, neglecting the differences in the information carried by its internal components. According to Fourier transform theory, phase and amplitude represent different types of information in the image. Our spectrum swapping experiments show that magnitude mainly reflects pixel-level intensity, while phase predominantly governs image structure. To prevent interference between phase and magnitude feature learning caused by unified frequency-domain modeling, we propose the Phase-Amplitude-Spatial State Space Model (PAS-Mamba) for MRI Reconstruction, a framework that decouples phase and magnitude modeling in the frequency domain and combines it with image-domain features for better reconstruction. In the image domain, LocalMamba preserves spatial locality to sharpen fine anatomical details. In frequency domain, we disentangle amplitude and phase into two specialized branches to avoid representational coupling. To respect the concentric geometry of frequency information, we propose Circular Frequency Domain Scanning (CFDS) to serialize features from low to high frequencies. Finally, a Dual-Domain Complementary Fusion Module (DDCFM) adaptively fuses amplitude phase representations and enables bidirectional exchange between frequency and image domains, delivering superior reconstruction. Extensive experiments on the IXI and fastMRI knee datasets show that PAS-Mamba consistently outperforms state of the art reconstruction methods.[59] Scribble-Supervised Medical Image Segmentation with Dynamic Teacher Switching and Hierarchical Consistency
Thanh-Huy Nguyen,Hoang-Loc Cao,Dat T. Chung,Mai-Anh Vu,Thanh-Minh Nguyen,Minh Le,Phat K. Huynh,Ulas Bagci
Main category: cs.CV
TL;DR: 提出SDT-Net,一种双教师单学生框架,通过动态教师切换和可靠像素选择提升涂鸦标注下的医学图像分割性能。
Details
Motivation: 涂鸦标注虽减轻了医学图像分割的标注负担,但其稀疏性导致伪标签传播噪声大,难以学习清晰解剖边界。 Method: 设计双教师单学生框架SDT-Net,引入动态教师切换(DTS)模块选择更可靠的教师,通过Pick Reliable Pixels(PRP)机制生成高置信度伪标签,并利用Hierarchical Consistency(HiCo)模块实现多层次特征对齐。 Result: 在ACDC和MSCMRseg数据集上实验表明,SDT-Net在分割准确性和解剖合理性方面均达到最先进水平。 Conclusion: SDT-Net能有效利用稀疏涂鸦标注,提升监督质量,显著改善医学图像分割性能。 Abstract: Scribble-supervised methods have emerged to mitigate the prohibitive annotation burden in medical image segmentation. However, the inherent sparsity of these annotations introduces significant ambiguity, which results in noisy pseudo-label propagation and hinders the learning of robust anatomical boundaries. To address this challenge, we propose SDT-Net, a novel dual-teacher, single-student framework designed to maximize supervision quality from these weak signals. Our method features a Dynamic Teacher Switching (DTS) module to adaptively select the most reliable teacher. This selected teacher then guides the student via two synergistic mechanisms: high-confidence pseudo-labels, refined by a Pick Reliable Pixels (PRP) mechanism, and multi-level feature alignment, enforced by a Hierarchical Consistency (HiCo) module. Extensive experiments on the ACDC and MSCMRseg datasets demonstrate that SDT-Net achieves state-of-the-art performance, producing more accurate and anatomically plausible segmentation.[60] Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement
Wei Ma,Shaowu Chen,Junjie Ye,Peichang Zhang,Lei Huang
Main category: cs.CV
TL;DR: 提出了一种基于模糊控制器的视频推理增强框架,利用时空相关性动态切换不同规模的模型,以在资源利用和推理性能之间实现有效平衡。
Details
Motivation: 现有视频推理增强方法通常忽视了资源效率与推理效果之间的权衡,导致资源利用低效和推理性能次优。 Method: 设计了一个基于关键系统参数和推理相关指标的模糊控制器(FC-r),并在其指导下构建视频推理增强框架,利用相邻帧间的时空相关性,根据设备实时资源状况动态切换不同规模的模型。 Result: 实验结果表明,所提方法能够有效平衡资源利用与推理性能,在保持高效资源使用的同时提升推理效果。 Conclusion: 该框架通过引入模糊控制实现了资源与性能的自适应协调,为视频推理提供了更高效的解决方案。 Abstract: Existing video inference (VI) enhancement methods typically aim to improve performance by scaling up model sizes and employing sophisticated network architectures. While these approaches demonstrated state-of-the-art performance, they often overlooked the trade-off of resource efficiency and inference effectiveness, leading to inefficient resource utilization and suboptimal inference performance. To address this problem, a fuzzy controller (FC-r) is developed based on key system parameters and inference-related metrics. Guided by the FC-r, a VI enhancement framework is proposed, where the spatiotemporal correlation of targets across adjacent video frames is leveraged. Given the real-time resource conditions of the target device, the framework can dynamically switch between models of varying scales during VI. Experimental results demonstrate that the proposed method effectively achieves a balance between resource utilization and inference performance.[61] Anatomically Guided Latent Diffusion for Brain MRI Progression Modeling
Cheng Wan,Bahram Jafrasteh,Ehsan Adeli,Miaomiao Zhang,Qingyu Zhao
Main category: cs.CV
TL;DR: 本文提出了一种名为Anatomically Guided Latent Diffusion Model(AG-LDM)的新框架,用于建模纵向脑MRI进展。该方法通过融合基线解剖结构、加噪随访图像和临床协变量,在潜空间中实现解剖一致的疾病进展建模,并引入轻量级3D分割模型WarpSeg提供解剖监督,显著简化训练流程并提升生成质量与生物合理性。
Details
Motivation: 现有方法(如BrLP)存在架构复杂、临床协变量利用不足、解剖一致性缺乏保证等问题,难以兼顾建模精度与生物学可解释性。 Method: 提出AG-LDM:一种分割引导的潜扩散模型;在输入层直接融合基线解剖、加噪随访状态和临床协变量;引入WarpSeg作为轻量级3D组织分割模块,为自编码器微调和扩散训练提供显式解剖监督。 Result: 在ADNI(31,713对纵向数据)和OASIS-3上验证,AG-LDM达到SOTA图像质量,体积误差降低15–20%;对时间与临床协变量敏感性达BrLP的31.5倍;生成的反事实轨迹符合阿尔茨海默病典型病理特征(如边缘系统萎缩、脑室扩张)。 Conclusion: AG-LDM是一种高效、解剖学驱动的脑MRI进展建模框架,在简化架构的同时提升了生成准确性、解剖一致性与生物学合理性,适用于神经退行性疾病研究与个体化预测。 Abstract: Accurately modeling longitudinal brain MRI progression is crucial for understanding neurodegenerative diseases and predicting individualized structural changes. Existing state-of-the-art approaches, such as Brain Latent Progression (BrLP), often use multi-stage training pipelines with auxiliary conditioning modules but suffer from architectural complexity, suboptimal use of conditional clinical covariates, and limited guarantees of anatomical consistency. We propose Anatomically Guided Latent Diffusion Model (AG-LDM), a segmentation-guided framework that enforces anatomically consistent progression while substantially simplifying the training pipeline. AG-LDM conditions latent diffusion by directly fusing baseline anatomy, noisy follow-up states, and clinical covariates at the input level, a strategy that avoids auxiliary control networks by learning a unified, end-to-end model that represents both anatomy and progression. A lightweight 3D tissue segmentation model (WarpSeg) provides explicit anatomical supervision during both autoencoder fine-tuning and diffusion model training, ensuring consistent brain tissue boundaries and morphometric fidelity. Experiments on 31,713 ADNI longitudinal pairs and zero-shot evaluation on OASIS-3 demonstrate that AG-LDM matches or surpasses more complex diffusion models, achieving state-of-the-art image quality and 15-20\% reduction in volumetric errors in generated images. AG-LDM also exhibits markedly stronger utilization of temporal and clinical covariates (up to 31.5x higher sensitivity than BrLP) and generates biologically plausible counterfactual trajectories, accurately capturing hallmarks of Alzheimer's progression such as limbic atrophy and ventricular expansion. These results highlight AG-LDM as an efficient, anatomically grounded framework for reliable brain MRI progression modeling.[62] From Volumes to Slices: Computationally Efficient Contrastive Learning for Sequential Abdominal CT Analysis
Po-Kai Chiu,Hung-Hsuan Chen
Main category: cs.CV
TL;DR: 本文提出2D-VoCo,一种面向2D CT切片的高效自监督预训练方法,通过对比学习从无标签数据中学习空间-语义特征,并结合CNN-LSTM架构用于多器官损伤分类,在RSNA 2023腹部创伤数据集上显著提升各项指标,降低对标注数据的依赖。
Details
Motivation: 深度学习在医学图像分析中受限于专家标注成本高、数据稀缺;现有3D自监督方法(如VoCo)虽部分缓解该问题,但计算与内存开销大,难以实用。 Method: 提出2D-VoCo框架:基于2D CT切片进行切片级自监督对比学习,预训练CNN骨干网络;随后将其嵌入CNN-LSTM架构完成多器官损伤分类任务。 Result: 在RSNA 2023 Abdominal Trauma数据集上,2D-VoCo预训练显著提升mAP、precision、recall和RSNA score,优于从零训练;代码已开源。 Conclusion: 2D-VoCo是一种高效、实用的轻量级自监督预训练方案,可有效缓解医学CT分析中标注数据依赖问题并提升模型性能。 Abstract: The requirement for expert annotations limits the effectiveness of deep learning for medical image analysis. Although 3D self-supervised methods like volume contrast learning (VoCo) are powerful and partially address the labeling scarcity issue, their high computational cost and memory consumption are barriers. We propose 2D-VoCo, an efficient adaptation of the VoCo framework for slice-level self-supervised pre-training that learns spatial-semantic features from unlabeled 2D CT slices via contrastive learning. The pre-trained CNN backbone is then integrated into a CNN-LSTM architecture to classify multi-organ injuries. In the RSNA 2023 Abdominal Trauma dataset, 2D-VoCo pre-training significantly improves mAP, precision, recall, and RSNA score over training from scratch. Our framework provides a practical method to reduce the dependency on labeled data and enhance model performance in clinical CT analysis. We release the code for reproducibility. https://github.com/tkz05/2D-VoCo-CT-Classifier[63] LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning
Lianying Chao,Linfeng Yin,Peiyu Ren,Yifan Jiang,Qiaoyu Ren,Dingcheng Shan,Jing-cheng Pang,Sijie Wu,Xubin Li,Kai Zhang
Main category: cs.CV
TL;DR: 本文提出了一种可学习的帧选择器(LFS),通过利用冻结视频-大语言模型的字幕反馈,选择时间上多样且事件相关的帧,以提升视频描述生成质量,并构建了更符合人类认知的新基准ICH-CC。
Details
Motivation: 现有视频字幕模型多采用均匀采样帧,忽略了视频中事件分布不均的问题,导致时间覆盖与事件相关性失衡;同时,现有基准与人类认知存在差距。 Method: 提出Learnable Frame Selector(LFS),显式建模时间重要性,结合分层策略保证时间覆盖并避免帧聚集,并利用冻结视频-LLM的字幕反馈进行端到端优化。此外,构建新基准ICH-CC,基于人工设计的问题反映人类一致理解。 Result: LFS在VDC和ICH-CC两个基准上分别取得最高2.0%和超4%的性能提升,并带动视频问答任务性能提升。 Conclusion: LFS是一种高效、易集成的方案,显著提升了细粒度视频字幕生成效果,并推动评估更贴近人类认知。 Abstract: Video captioning models convert frames into visual tokens and generate descriptions with large language models (LLMs). Since encoding all frames is prohibitively expensive, uniform sampling is the default choice, but it enforces equal temporal coverage while ignoring the uneven events distribution. This motivates a Learnable Frame Selector (LFS) that selects temporally diverse and event-relevant frames. LFS explicitly models temporal importance to balance temporal diversity and event relevance, and employs a stratified strategy to ensure temporal coverage while avoiding clustering. Crucially, LFS leverages caption feedback from frozen video-LLMs to learn frame selection that directly optimizes downstream caption quality. Additionally, we identify the gap between existing benchmark and human's cognition. Thus, we introduce ICH-CC built from carefully designed questions by annotators that reflect human-consistent understanding of video. Experiments indicate that LFS consistently improves detailed video captioning across two representative community benchmarks and ICH-CC, achieving up to 2.0% gains on VDC and over 4% gains on ICH-CC. Moreover, we observe that enhanced captions with LFS leads to improved performance on video question answering. Overall, LFS provides an effective and easy-to-integrate solution for detailed video captioning.[64] 3D Space as a Scratchpad for Editable Text-to-Image Generation
Oindrila Saha,Vojtech Krs,Radomir Mech,Subhransu Maji,Matheus Gadelha,Kevin Blackburn-Matzen
Main category: cs.CV
TL;DR: 本文提出空间草稿本(spatial scratchpad)概念,通过3D推理基底将语言意图与图像生成桥接,显著提升视觉语言模型在空间一致性与几何关系理解上的生成能力。
Details
Motivation: 现有视觉语言模型(VLMs)缺乏类比大语言模型中链式思维的显式空间推理机制,导致其难以准确建模几何关系、物体身份和组合意图。 Method: 提出空间草稿本框架:解析文本中的主体与背景元素,将其实例化为可编辑的3D网格,并通过智能体驱动的场景规划确定位置、朝向与视角;最终以保持身份特征的方式渲染为图像。 Result: 在GenAI-Bench上文本对齐指标提升32%,支持直观且可靠的3D编辑,并生成空间一致、视觉连贯的图像。 Conclusion: 显式3D推理为VLM提供了新范式——不仅用语言‘思考’,也用空间‘思考’,推动可控、精确的图像生成发展。 Abstract: Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chain-of-thought traces or tool-augmented reasoning. Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent. We introduce the concept of a spatial scratchpad -- a 3D reasoning substrate that bridges linguistic intent and image synthesis. Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection. The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs. Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images. Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation. Our results highlight a new paradigm for vision-language models that deliberate not only in language, but also in space. Code and visualizations at https://oindrilasaha.github.io/3DScratchpad/[65] U-Harmony: Enhancing Joint Training for Segmentation Models with Universal Harmonization
Weiwei Ma,Xiaobing Yu,Peijie Qiu,Jin Yang,Pan Xiao,Xiaoqi Zhao,Xiaofeng Liu,Tomo Miyazaki,Shinichiro Omachi,Yongsong Huang
Main category: cs.CV
TL;DR: 本文提出了一种名为Universal Harmonization(U-Harmony)的联合训练方法,用于解决临床中医学分割数据集有限且异构的问题,通过域门控头和特征归一化/反归一化机制,在单一模型中同时学习多源异构数据,并支持通用模态自适应。
Details
Motivation: 临床中医学分割数据集常受限于数量少、来源异构(模态、协议、解剖目标不同),导致现有深度学习模型难以兼顾泛化性与领域特异性知识。 Method: 提出U-Harmony联合训练方法,集成域门控头;通过顺序归一化与反归一化特征分布来抑制域差异,同时保留各数据集特有知识;支持通用模态与解剖类别扩展。 Result: 在跨机构脑部病变3D数据集上实验验证有效,显著提升模型鲁棒性与适应性,建立新基准。 Conclusion: U-Harmony为真实临床场景下构建鲁棒、可扩展的3D医学图像分割模型提供了可行框架。 Abstract: In clinical practice, medical segmentation datasets are often limited and heterogeneous, with variations in modalities, protocols, and anatomical targets across institutions. Existing deep learning models struggle to jointly learn from such diverse data, often sacrificing either generalization or domain-specific knowledge. To overcome these challenges, we propose a joint training method called Universal Harmonization (U-Harmony), which can be integrated into deep learning-based architectures with a domain-gated head, enabling a single segmentation model to learn from heterogeneous datasets simultaneously. By integrating U-Harmony, our approach sequentially normalizes and then denormalizes feature distributions to mitigate domain-specific variations while preserving original dataset-specific knowledge. More appealingly, our framework also supports universal modality adaptation, allowing the seamless learning of new imaging modalities and anatomical classes. Extensive experiments on cross-institutional brain lesion datasets demonstrate the effectiveness of our approach, establishing a new benchmark for robust and adaptable 3D medical image segmentation models in real-world clinical settings.[66] Learning Consistent Taxonomic Classification through Hierarchical Reasoning
Zhenghong Li,Kecheng Zheng,Haibin Ling
Main category: cs.CV
TL;DR: 本文提出VL-Taxon框架,通过两阶段层次化推理提升视觉语言模型在分类任务中的叶级准确率与层级一致性,仅用少量数据微调即超越更大参数量基线模型。
Details
Motivation: 现有视觉语言模型虽擅长视觉理解,但难以建模层级知识,常出现叶级识别正确却在更粗粒度分类上出错的问题,且缺乏对层级推理的建模。 Method: 提出两阶段层次推理框架VL-Taxon:第一阶段自顶向下提升叶级分类精度;第二阶段利用准确的叶级预测保障全层级一致性;各阶段先监督微调注入分类学知识,再通过强化学习优化推理与泛化能力。 Result: 在iNaturalist-2021数据集上,基于Qwen2.5-VL-7B实现的VL-Taxon平均叶级与层级一致性准确率较原72B模型提升超10%,且仅需少量真实标注数据,未使用其他VLM生成样本。 Conclusion: VL-Taxon有效解决了VLM在层级分类中的一致性缺陷,验证了显式建模层级推理对提升模型可解释性与鲁棒性的关键作用。 Abstract: While Vision-Language Models (VLMs) excel at visual understanding, they often fail to grasp hierarchical knowledge. This leads to common errors where VLMs misclassify coarser taxonomic levels even when correctly identifying the most specific level (leaf level). Existing approaches largely overlook this issue by failing to model hierarchical reasoning. To address this gap, we propose VL-Taxon, a two-stage, hierarchy-based reasoning framework designed to improve both leaf-level accuracy and hierarchical consistency in taxonomic classification. The first stage employs a top-down process to enhance leaf-level classification accuracy. The second stage then leverages this accurate leaf-level output to ensure consistency throughout the entire taxonomic hierarchy. Each stage is initially trained with supervised fine-tuning to instill taxonomy knowledge, followed by reinforcement learning to refine the model's reasoning and generalization capabilities. Extensive experiments reveal a remarkable result: our VL-Taxon framework, implemented on the Qwen2.5-VL-7B model, outperforms its original 72B counterpart by over 10% in both leaf-level and hierarchical consistency accuracy on average on the iNaturalist-2021 dataset. Notably, this significant gain was achieved by fine-tuning on just a small subset of data, without relying on any examples generated by other VLMs.[67] Diffusion Epistemic Uncertainty with Asymmetric Learning for Diffusion-Generated Image Detection
Yingsong Huang,Hui Guo,Jing Huang,Bing Bai,Qi Xiong
Main category: cs.CV
TL;DR: 本文提出了一种名为DEUA的新框架,通过估计扩散模型的认知不确定性(DEU)并结合非对称损失函数,提升对扩散生成图像的检测性能。
Details
Motivation: 现有基于重建误差的检测方法忽略了数据固有噪声(偶然不确定性)和模型知识不足(认知不确定性)对重建误差的不同影响,导致检测性能受限。偶然不确定性无助于区分生成图像,而认知不确定性有助于检测。 Method: 提出Diffusion Epistemic Uncertainty with Asymmetric Learning (DEUA)框架:1)利用Laplace近似估计扩散模型的认知不确定性(DEU),衡量样本与扩散生成流形的接近程度;2)设计非对称损失函数以训练具有更大分类间隔的平衡分类器。 Result: 在大规模基准测试上验证了该方法达到当前最优(state-of-the-art)检测性能。 Conclusion: 认知不确定性是比重建误差更本质、更具判别力的检测线索,DEUA框架有效分离并利用认知不确定性,显著提升了检测泛化能力与鲁棒性。 Abstract: The rapid progress of diffusion models highlights the growing need for detecting generated images. Previous research demonstrates that incorporating diffusion-based measurements, such as reconstruction error, can enhance the generalizability of detectors. However, ignoring the differing impacts of aleatoric and epistemic uncertainty on reconstruction error can undermine detection performance. Aleatoric uncertainty, arising from inherent data noise, creates ambiguity that impedes accurate detection of generated images. As it reflects random variations within the data (e.g., noise in natural textures), it does not help distinguish generated images. In contrast, epistemic uncertainty, which represents the model's lack of knowledge about unfamiliar patterns, supports detection. In this paper, we propose a novel framework, Diffusion Epistemic Uncertainty with Asymmetric Learning~(DEUA), for detecting diffusion-generated images. We introduce Diffusion Epistemic Uncertainty~(DEU) estimation via the Laplace approximation to assess the proximity of data to the manifold of diffusion-generated samples. Additionally, an asymmetric loss function is introduced to train a balanced classifier with larger margins, further enhancing generalizability. Extensive experiments on large-scale benchmarks validate the state-of-the-art performance of our method.[68] Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis
James Brock,Ce Zhang,Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: 本文提出了Forest-Chat,一个基于大语言模型(LLM)驱动的森林变化分析智能体,结合多层级视觉-语言模型与零样本变化检测技术,支持自然语言查询及多种遥感图像变化解释(RSICI)任务,并构建了Forest-Change数据集用于评估。
Details
Motivation: 现有方法在像素级变化检测和语义变化解释(尤其在复杂森林动态中)存在不足;LLM与视觉-语言模型(VLM)在遥感图像变化解释(RSICI)中的融合,尤其在非城市环境(如森林)中仍缺乏探索。 Method: 提出Forest-Chat框架:1)基于多层级变化解释(MCI)的视觉-语言骨干网络;2)LLM作为任务编排器;3)集成零样本基础变化检测模型与交互式点提示接口实现细粒度用户引导;4)构建Forest-Change数据集(含双时相影像、像素级变化掩码、多粒度语义描述)。 Result: Forest-Chat在Forest-Change和LEVIR-MCI-Trees数据集上,在联合变化检测与描述任务中表现优异,验证了其在可访问性、可解释性和分析效率方面的提升潜力。 Conclusion: Forest-Chat展示了LLM驱动的交互式遥感图像变化解释系统在森林监测中的有效性与可行性,为面向自然环境的智能遥感分析提供了新范式。 Abstract: The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for enhancing forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. We introduce Forest-Chat, an LLM-driven agent designed for integrated forest change analysis. The proposed framework enables natural language querying and supports multiple RSICI tasks, including change detection, change captioning, object counting, deforestation percentage estimation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, and incorporates zero-shot change detection via a foundation change detection model together with an interactive point-prompt interface to support fine-grained user guidance. To facilitate adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated through a combination of human annotation and rule-based methods. Experimental results demonstrate that Forest-Chat achieves strong performance on Forest-Change and on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI, for joint change detection and captioning, highlighting the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and analytical efficiency in forest change analysis.[69] READ-Net: Clarifying Emotional Ambiguity via Adaptive Feature Recalibration for Audio-Visual Depression Detection
Chenglizhao Chen,Boze Li,Mengke Song,Dehao Feng,Xinyu Liu,Shanchen Pang,Jufeng Yang,Hui Yu
Main category: cs.CV
TL;DR: 本文提出READ-Net,一种通过自适应特征重校准(AFR)解决情绪模糊性问题的音视频抑郁检测框架,显著提升检测准确率与F1分数。
Details
Motivation: 现有方法要么忽略情绪线索而遗漏细微抑郁信号,要么混淆短暂情绪表达与稳定抑郁症状,导致‘情绪模糊性’问题。 Method: 提出READ-Net框架,核心为自适应特征重校准(AFR),动态加权情绪特征以增强抑郁相关信号,选择性保留抑郁相关线索并滤除情绪噪声。 Result: 在三个公开数据集上平均准确率提升4.55%,F1分数提升1.26%,验证了其对情绪干扰的鲁棒性。 Conclusion: READ-Net首次针对性解决情绪模糊性问题,有效提升音视频抑郁检测性能,且易于集成到现有框架中。 Abstract: Depression is a severe global mental health issue that impairs daily functioning and overall quality of life. Although recent audio-visual approaches have improved automatic depression detection, methods that ignore emotional cues often fail to capture subtle depressive signals hidden within emotional expressions. Conversely, those incorporating emotions frequently confuse transient emotional expressions with stable depressive symptoms in feature representations, a phenomenon termed \emph{Emotional Ambiguity}, thereby leading to detection errors. To address this critical issue, we propose READ-Net, the first audio-visual depression detection framework explicitly designed to resolve Emotional Ambiguity through Adaptive Feature Recalibration (AFR). The core insight of AFR is to dynamically adjust the weights of emotional features to enhance depression-related signals. Rather than merely overlooking or naively combining emotional information, READ-Net innovatively identifies and preserves depressive-relevant cues within emotional features, while adaptively filtering out irrelevant emotional noise. This recalibration strategy significantly clarifies feature representations, and effectively mitigates the persistent challenge of emotional interference. Additionally, READ-Net can be easily integrated into existing frameworks for improved performance. Extensive evaluations on three publicly available datasets show that READ-Net outperforms state-of-the-art methods, with average gains of 4.55\% in accuracy and 1.26\% in F1-score, demonstrating its robustness to emotional disturbances and improving audio-visual depression detection.[70] Mirai: Autoregressive Visual Generation Needs Foresight
Yonghao Yu,Lang Huang,Zerun Wang,Runyi Li,Toshihiko Yamasaki
Main category: cs.CV
TL;DR: 本文提出Mirai框架,通过在自回归视觉生成中引入'远见'(foresight)信号,提升全局一致性与收敛速度,无需修改模型结构或增加推理开销。
Details
Motivation: 自回归视觉生成器仅依赖下一token的因果监督,导致全局连贯性差、收敛慢;作者探究引入来自后续token的'远见'信号是否可改善性能。 Method: 设计Mirai通用框架,包含显式远见(Mirai-E,利用单向表征中多个未来位置信息)和隐式远见(Mirai-I,匹配双向表征获取远见),均对齐2D图像网格上的内部表征。 Result: Mirai显著加速收敛(如LlamaGen-B提速达10×)并提升生成质量(ImageNet上FID从5.34降至4.34)。 Conclusion: 为自回归视觉生成引入与内部表征对齐的远见信号是有效的,Mirai提供了一种高效、无额外开销的改进范式。 Abstract: Autoregressive (AR) visual generators model images as sequences of discrete tokens and are trained with next token likelihood. This strict causality supervision optimizes each step only by its immediate next token, which diminishes global coherence and slows convergence. We ask whether foresight, training signals that originate from later tokens, can help AR visual generation. We conduct a series of controlled diagnostics along the injection level, foresight layout, and foresight source axes, unveiling a key insight: aligning foresight to AR models' internal representation on the 2D image grids improves causality modeling. We formulate this insight with Mirai (meaning "future" in Japanese), a general framework that injects future information into AR training with no architecture change and no extra inference overhead: Mirai-E uses explicit foresight from multiple future positions of unidirectional representations, whereas Mirai-I leverages implicit foresight from matched bidirectional representations. Extensive experiments show that Mirai significantly accelerates convergence and improves generation quality. For instance, Mirai can speed up LlamaGen-B's convergence by up to 10$\times$ and reduce the generation FID from 5.34 to 4.34 on the ImageNet class-condition image generation benchmark. Our study highlights that visual autoregressive models need foresight.[71] LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models
Mingyang Xie,Numair Khan,Tianfu Wang,Naina Dhingra,Seonghyeon Nam,Haitao Yang,Zhuo Hui,Christopher Metzler,Andrea Vedaldi,Hamed Pirsiavash,Lei Luo
Main category: cs.CV
TL;DR: 本文提出了一种利用大型4D重建模型隐式几何知识来条件化视频重渲染的新方法,通过联合使用其潜在表示和源相机姿态,在保持几何一致性的同时避免了显式深度估计的误差,实现了SOTA性能。
Details
Motivation: 现有视频重渲染方法存在两类问题:无几何约束的方法易出现视角变化下的漂移和形变;而依赖显式深度估计和重建的几何约束方法则易受深度误差和标定错误影响。 Method: 利用预训练大型4D重建模型潜在空间中隐含的几何知识作为条件信号,联合源相机姿态,引导扩散模型进行视频重渲染,无需显式深度图或3D重建。 Result: 在视频重渲染任务上达到当前最优(state-of-the-art)性能,有效缓解视角变换下的几何漂移与形变,并提升对深度误差的鲁棒性。 Conclusion: 隐式几何先验(来自4D模型潜变量)比显式深度更鲁棒、更灵活,是视频重渲染中有效的几何约束方式。 Abstract: Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors. We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task. Project webpage is https://lavr-4d-scene-rerender.github.io/[72] A comprehensive overview of deep learning models for object detection from videos/images
Sukana Zulfqar,Sadia Saeed,M. Azam Zia,Anjum Ali,Faisal Mehmood,Abid Ali
Main category: cs.CV
TL;DR: 本文综述了基于深度学习的视频与图像监控中的目标检测技术,涵盖CNN、GAN和时序融合方法,重点分析架构创新、生成模型应用及应对动态环境、遮挡和实时性等挑战的策略。
Details
Motivation: 目标检测在监控场景中面临动态环境、遮挡、光照变化和实时性等挑战,需系统梳理现代深度学习方法以提升鲁棒性与准确性。 Method: 通过分类核心架构、数据处理策略和监控特有挑战,综述CNN-based检测器、GAN辅助方法以及时序信息融合技术,并分析预处理流程、特征提取、数据集与性能比较。 Result: 总结了当前语义目标检测的有效性,揭示生成模型在补全帧、减少遮挡和光照归一化中的作用,评估了各类方法在标准数据集上的表现。 Conclusion: 现代深度学习显著提升了监控中的目标检测性能,未来研究应聚焦低延迟、高效计算和时空学习等方向。 Abstract: Object detection in video and image surveillance is a well-established yet rapidly evolving task, strongly influenced by recent deep learning advancements. This review summarises modern techniques by examining architectural innovations, generative model integration, and the use of temporal information to enhance robustness and accuracy. Unlike earlier surveys, it classifies methods based on core architectures, data processing strategies, and surveillance specific challenges such as dynamic environments, occlusions, lighting variations, and real-time requirements. The primary goal is to evaluate the current effectiveness of semantic object detection, while secondary aims include analysing deep learning models and their practical applications. The review covers CNN-based detectors, GAN-assisted approaches, and temporal fusion methods, highlighting how generative models support tasks such as reconstructing missing frames, reducing occlusions, and normalising illumination. It also outlines preprocessing pipelines, feature extraction progress, benchmarking datasets, and comparative evaluations. Finally, emerging trends in low-latency, efficient, and spatiotemporal learning approaches are identified for future research.[73] Transfer Learning from One Cancer to Another via Deep Learning Domain Adaptation
Justin Cheung,Samuel Savine,Calvin Nguyen,Lin Lu,Alhassan S. Yasin
Main category: cs.CV
TL;DR: 本文研究了在癌症组织病理学中,如何通过域自适应(特别是DANN)提升CNN模型在未见癌症类型上的跨域分类性能,并分析了染色归一化对不同目标域的影响及模型可解释性。
Details
Motivation: 监督深度学习模型在训练分布外泛化能力差,而医学图像标注稀缺,亟需利用域自适应技术实现跨癌种(如肺、结肠、乳腺、肾腺癌)的知识迁移。 Method: 采用ResNet50作为基础模型,对比单源监督训练、多模型集成与域对抗神经网络(DANN)的跨域性能;评估染色归一化对各目标域的影响;使用Integrated Gradients进行特征归因分析。 Result: DANN显著提升跨域准确率(如乳腺+结肠有标签→肺无标签达95.56%);染色归一化效果因目标域而异(肺下降,乳腺/结肠提升);Integrated Gradients显示DANN关注核密集等生物学有意义区域。 Conclusion: DANN能有效缓解癌症病理图像中的域偏移问题,具备临床相关特征学习能力;染色归一化需按目标域谨慎选用;该方法为少标注医疗AI提供了实用路径。 Abstract: Supervised deep learning models often achieve excellent performance within their training distribution but struggle to generalize beyond it. In cancer histopathology, for example, a convolutional neural network (CNN) may classify cancer severity accurately for cancer types represented in its training data, yet fail on related but unseen types. Although adenocarcinomas from different organs share morphological features that might support limited cross-domain generalization, addressing domain shift directly is necessary for robust performance. Domain adaptation offers a way to transfer knowledge from labeled data in one cancer type to unlabeled data in another, helping mitigate the scarcity of annotated medical images. This work evaluates cross-domain classification performance among lung, colon, breast, and kidney adenocarcinomas. A ResNet50 trained on any single adenocarcinoma achieves over 98% accuracy on its own domain but shows minimal generalization to others. Ensembling multiple supervised models does not resolve this limitation. In contrast, converting the ResNet50 into a domain adversarial neural network (DANN) substantially improves performance on unlabeled target domains. A DANN trained on labeled breast and colon data and adapted to unlabeled lung data reaches 95.56% accuracy. We also examine the impact of stain normalization on domain adaptation. Its effects vary by target domain: for lung, accuracy drops from 95.56% to 66.60%, while for breast and colon targets, stain normalization boosts accuracy from 49.22% to 81.29% and from 78.48% to 83.36%, respectively. Finally, using Integrated Gradients reveals that DANNs consistently attribute importance to biologically meaningful regions such as densely packed nuclei, indicating that the model learns clinically relevant features and can apply them to unlabeled cancer types.[74] FeedbackSTS-Det: Sparse Frames-Based Spatio-Temporal Semantic Feedback Network for Infrared Small Target Detection
Yian Huang,Qing Qin,Aji Mao,Xiangyu Qiu,Liang Xu,Xian Zhang,Zhenming Peng
Main category: cs.CV
TL;DR: 提出了一种基于稀疏帧的时空语义反馈网络FeedbackSTS-Det,用于复杂背景下红外小目标检测,通过闭环语义关联机制和稀疏语义模块实现高效长距离依赖建模与虚假警报抑制。
Details
Motivation: 现有红外小目标检测方法在建模长距离依赖和应对动态干扰方面效率低、鲁棒性不足,尤其在复杂背景和低信噪比条件下表现不佳。 Method: 设计了时空语义反馈策略,包含前向与后向细化模块,并引入嵌入式稀疏语义模块(SSM)进行结构化稀疏时序建模,实现编码器-解码器间的闭环语义关联与低代价长程依赖捕捉。 Result: 在多个基准数据集上实验表明,该方法显著优于现有方法,有效抑制虚假警报,且训练与推理流程一致,提升了模型鲁棒性和可迁移性。 Conclusion: FeedbackSTS-Det通过新颖的反馈式稀疏时空建模框架,实现了高效、稳健的红外小目标检测,为多帧ISTD提供了新的解决方案。 Abstract: Infrared small target detection (ISTD) under complex backgrounds remains a critical yet challenging task, primarily due to the extremely low signal-to-clutter ratio, persistent dynamic interference, and the lack of distinct target features. While multi-frame detection methods leverages temporal cues to improve upon single-frame approaches, existing methods still struggle with inefficient long-range dependency modeling and insufficient robustness. To overcome these issues, we propose a novel scheme for ISTD, realized through a sparse frames-based spatio-temporal semantic feedback network named FeedbackSTS-Det. The core of our approach is a novel spatio-temporal semantic feedback strategy with a closed-loop semantic association mechanism, which consists of paired forward and backward refinement modules that work cooperatively across the encoder and decoder. Moreover, both modules incorporate an embedded sparse semantic module (SSM), which performs structured sparse temporal modeling to capture long-range dependencies with low computational cost. This integrated design facilitates robust implicit inter-frame registration and continuous semantic refinement, effectively suppressing false alarms. Furthermore, our overall procedure maintains a consistent training-inference pipeline, which ensures reliable performance transfer and increases model robustness. Extensive experiments on multiple benchmark datasets confirm the effectiveness of FeedbackSTS-Det. Code and models are available at: https://github.com/IDIP-Lab/FeedbackSTS-Det.[75] RegFreeNet: A Registration-Free Network for CBCT-based 3D Dental Implant Planning
Xinquan Yang,Xuguang Li,Mianjie Zheng,Xuefen Liu,Kun Tang,Kian Ming Lim,He Meng,Jianfeng Ren,Linlin Shen
Main category: cs.CV
TL;DR: 本文提出了一种无需配准的牙种植体定位预测方法,通过遮蔽术后CBCT中的种植体区域,直接利用单期CBCT数据构建大规模多中心数据集ImplantFairy,并设计了坡度感知网络RegFreeNet(含NDP模块和坡度预测分支),在多个数据集上达到SOTA性能。
Details
Motivation: 现有方法依赖术后与术前CBCT配准获取种植体标签,耗时且受限于配准精度和多中心数据缺失;而牙医可仅凭邻近牙齿纹理判断种植体位置,提示种植体区域遮蔽后仍可训练模型。 Method: 提出遮蔽术后CBCT中种植体区域的范式,构建含1622例CBCT的公开数据集ImplantFairy;设计坡度感知的RegFreeNet网络,包含邻近距离感知(NDP)模块以自适应提取牙齿区域变化特征,并引入种植体坡度预测分支提供额外监督。 Result: 所提RegFreeNet在ImplantFairy及两个公开数据集上均取得当前最优性能(SOTA),验证了免配准范式的有效性与泛化能力。 Conclusion: 遮蔽种植体的免配准训练范式显著降低数据准备门槛,支持大规模多中心数据构建;结合坡度感知结构的RegFreeNet提升了种植体定位精度与鲁棒性,为临床自动化种植规划提供了新思路。 Abstract: As the commercial surgical guide design software usually does not support the export of implant position for pre-implantation data, existing methods have to scan the post-implantation data and map the implant to pre-implantation space to get the label of implant position for training. Such a process is time-consuming and heavily relies on the accuracy of registration algorithm. Moreover, not all hospitals have paired CBCT data, limitting the construction of multi-center dataset. Inspired by the way dentists determine the implant position based on the neighboring tooth texture, we found that even if the implant area is masked, it will not affect the determination of the implant position. Therefore, we propose to mask the implants in the post-implantation data so that any CBCT containing the implants can be used as training data. This paradigm enables us to discard the registration process and makes it possible to construct a large-scale multi-center implant dataset. On this basis, we proposes ImplantFairy, a comprehensive, publicly accessible dental implant dataset with voxel-level 3D annotations of 1622 CBCT data. Furthermore, according to the area variation characteristics of the tooth's spatial structure and the slope information of the implant, we designed a slope-aware implant position prediction network. Specifically, a neighboring distance perception (NDP) module is designed to adaptively extract tooth area variation features, and an implant slope prediction branch assists the network in learning more robust features through additional implant supervision information. Extensive experiments conducted on ImplantFairy and two public dataset demonstrate that the proposed RegFreeNet achieves the state-of-the-art performance.[76] LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval
Chao Gao,Siqiao Xue,Yimin Peng,Jiwen Fu,Tingyi Gu,Shanshan Li,Fan Zhou
Main category: cs.CV
TL;DR: 本文提出了LookBench,一个面向真实电商场景的时尚图像检索基准,包含真实与AI生成图像、时间戳标注及细粒度属性体系,显著挑战现有模型,并公开数据、代码与模型。
Details
Motivation: 现有时尚图像检索基准无法反映真实电商中动态更新、多源图像(如AI生成)及细粒度需求(单品/全身搭配)等挑战,亟需更贴近实际、可演进的评估基准。 Method: 构建LookBench基准:整合近期真实网站商品图与AI生成时尚图像,引入时间戳以支持污染感知评估;基于自建细粒度属性分类体系,覆盖单品与全身搭配检索任务;设计半年度更新机制与渐进式难度变体。 Result: 多个强基线模型在LookBench上Recall@1低于60%;作者私有模型表现最优,开源模型位列第二,且二者在传统Fashion200K上亦达SOTA;已公开基准数据、评估代码、排行榜与训练模型。 Conclusion: LookBench是一个活态、全面且具挑战性的新基准,推动时尚检索研究向真实、时效与可复现方向发展,并为未来模型评估提供可持续演进的标准。 Abstract: In this paper, we present LookBench (We use the term "look" to reflect retrieval that mirrors how people shop -- finding the exact item, a close substitute, or a visually consistent alternative.), a live, holistic and challenging benchmark for fashion image retrieval in real e-commerce settings. LookBench includes both recent product images sourced from live websites and AI-generated fashion images, reflecting contemporary trends and use cases. Each test sample is time-stamped and we intend to update the benchmark periodically, enabling contamination-aware evaluation aligned with declared training cutoffs. Grounded in our fine-grained attribute taxonomy, LookBench covers single-item and outfit-level retrieval across. Our experiments reveal that LookBench poses a significant challenge on strong baselines, with many models achieving below $60\%$ Recall@1. Our proprietary model achieves the best performance on LookBench, and we release an open-source counterpart that ranks second, with both models attaining state-of-the-art results on legacy Fashion200K evaluations. LookBench is designed to be updated semi-annually with new test samples and progressively harder task variants, providing a durable measure of progress. We publicly release our leaderboard, dataset, evaluation code, and trained models.[77] Context Patch Fusion With Class Token Enhancement for Weakly Supervised Semantic Segmentation
Yiyang Fu,Hui Li,Wangyu Wu
Main category: cs.CV
TL;DR: 本文提出CPF-CTE框架,通过CF-BiLSTM模块建模图像块间上下文关系,并引入可学习类别标记增强语义判别能力,显著提升弱监督语义分割性能。
Details
Motivation: 现有弱监督语义分割方法忽视图像块间的复杂上下文依赖,导致局部表征不完整、分割精度受限。 Method: 提出CPF-CTE框架,包含Contextual-Fusion BiLSTM(CF-BiLSTM)模块以建模图像块空间依赖并实现双向信息流动,以及可学习的类别标记(class tokens)动态编码类特定语义。 Result: 在PASCAL VOC 2012和MS COCO 2014数据集上,CPF-CTE持续超越先前WSSS方法。 Conclusion: 融合空间上下文与类语义的CPF-CTE框架能生成更丰富、更准确的图像表征,有效提升弱监督语义分割性能。 Abstract: Weakly Supervised Semantic Segmentation (WSSS), which relies only on image-level labels, has attracted significant attention for its cost-effectiveness and scalability. Existing methods mainly enhance inter-class distinctions and employ data augmentation to mitigate semantic ambiguity and reduce spurious activations. However, they often neglect the complex contextual dependencies among image patches, resulting in incomplete local representations and limited segmentation accuracy. To address these issues, we propose the Context Patch Fusion with Class Token Enhancement (CPF-CTE) framework, which exploits contextual relations among patches to enrich feature representations and improve segmentation. At its core, the Contextual-Fusion Bidirectional Long Short-Term Memory (CF-BiLSTM) module captures spatial dependencies between patches and enables bidirectional information flow, yielding a more comprehensive understanding of spatial correlations. This strengthens feature learning and segmentation robustness. Moreover, we introduce learnable class tokens that dynamically encode and refine class-specific semantics, enhancing discriminative capability. By effectively integrating spatial and semantic cues, CPF-CTE produces richer and more accurate representations of image content. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 validate that CPF-CTE consistently surpasses prior WSSS methods.[78] HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
Haowei Zhang,Shudong Yang,Jinlan Fu,See-Kiong Ng,Xipeng Qiu
Main category: cs.CV
TL;DR: 本文提出HERMES,一种无需训练的新型架构,用于实时且准确地理解视频流。通过机制性注意力分析,将KV缓存视为分层内存框架,并在推理时重用紧凑的KV缓存,实现在低GPU内存开销下的高效流式理解,显著提升响应速度与准确性。
Details
Motivation: 现有MLLM在处理流式视频输入时难以同时兼顾稳定的理解性能、实时响应和低GPU内存开销,因此需要一种新的解决方案以支持连续视频流的高效交互。 Method: 基于对注意力机制的分析,将KV缓存建模为多粒度视频信息的分层记忆结构,在不进行额外计算或模型训练的前提下,推理时复用紧凑的KV缓存以实现高效的流式理解。 Result: 相比先前SOTA方法,HERMES实现了10倍更快的TTFT(首次令牌时间),在减少最多68%视频token的情况下仍保持优越或相当的准确率,在流式数据集上最高提升达11.4%。 Conclusion: HERMES是一种高效、无需训练的流式视频理解架构,能够在资源受限条件下实现快速响应和高精度理解,为实际应用中的实时视频交互提供了可行方案。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.[79] DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling
Jing Lan,Hexiao Ding,Hongzhao Chen,Yufeng Jiang,Nga-Chun Ng,Gwing Kei Yip,Gerald W. Y. Cheng,Yunlin Mao,Jing Cai,Liang-ting Lin,Jung Sun Yoo
Main category: cs.CV
TL;DR: DeepMoLM是一种双视图分子语言模型,通过将高分辨率分子图像与基于分子构象的几何不变量对齐,融合视觉与几何信息,实现对3D结构和立体化学的准确建模,在分子图像理解与属性预测任务中显著优于通用基线。
Details
Motivation: 现有分子语言模型多依赖字符串或图结构,而视觉-语言模型常忽略立体化学细节,难以将连续3D结构映射为离散token,亟需能物理可解释、几何可接地的建模方法。 Method: 提出DeepMoLM框架:1)处理1024×1024高分辨率分子图像;2)将构象邻域编码为离散的扩展三维指纹(E3DFP);3)通过交叉注意力融合视觉流与几何流,无需原子坐标即可实现物理可接地的生成。 Result: 在PubChem图像描述任务中METEOR指标相对提升12.3%;所有属性查询均输出有效数值,分子量MAE=13.64 g/mol,复杂度MAE=37.89;在ChEBI-20图像到文本生成任务中超越通用基线,媲美SOTA视觉-语言模型。 Conclusion: DeepMoLM成功桥接分子图像理解与3D几何约束,为AI驱动的药物发现和化学文献挖掘提供了兼具准确性、可解释性与泛化性的新范式。 Abstract: AI models for drug discovery and chemical literature mining must interpret molecular images and generate outputs consistent with 3D geometry and stereochemistry. Most molecular language models rely on strings or graphs, while vision-language models often miss stereochemical details and struggle to map continuous 3D structures into discrete tokens. We propose DeepMoLM: Deep Molecular Language M odeling, a dual-view framework that grounds high-resolution molecular images in geometric invariants derived from molecular conformations. DeepMoLM preserves high-frequency evidence from 1024 $\times$ 1024 inputs, encodes conformer neighborhoods as discrete Extended 3-Dimensional Fingerprints, and fuses visual and geometric streams with cross-attention, enabling physically grounded generation without atom coordinates. DeepMoLM improves PubChem captioning with a 12.3% relative METEOR gain over the strongest generalist baseline while staying competitive with specialist methods. It produces valid numeric outputs for all property queries and attains MAE 13.64 g/mol on Molecular Weight and 37.89 on Complexity in the specialist setting. On ChEBI-20 description generation from images, it exceeds generalist baselines and matches state-of-the-art vision-language models. Code is available at https://github.com/1anj/DeepMoLM.[80] Safeguarding Facial Identity against Diffusion-based Face Swapping via Cascading Pathway Disruption
Liqin Wang,Qianyue Hu,Wei Lu,Xiangyang Luo
Main category: cs.CV
TL;DR: 本文提出VoidFace,一种针对扩散模型驱动的人脸交换系统的系统性防御方法,通过在关键瓶颈处注入扰动,引发级联破坏,有效保护隐私和身份安全。
Details
Motivation: 现有主动防御方法在人脸交换场景中效果不佳,主要原因是忽视了人脸交换系统的结构鲁棒性和独特的静态条件引导机制。 Method: VoidFace将人脸交换视为耦合的身份路径,在关键瓶颈处注入扰动以引发级联破坏:包括定位干扰与身份擦除、解耦注意力机制、破坏中间扩散特征,并在潜在流形中进行对抗搜索,结合感知自适应策略平衡攻击强度与图像质量。 Result: 实验表明,VoidFace在多种基于扩散模型的人脸交换系统上均优于现有防御方法,且生成的对抗人脸具有更优的视觉质量。 Conclusion: VoidFace是一种高效、鲁棒且视觉不可察觉的系统性防御方案,显著提升了扩散模型时代下人脸交换中的隐私与身份安全保障能力。 Abstract: The rapid evolution of diffusion models has democratized face swapping but also raises concerns about privacy and identity security. Existing proactive defenses, often adapted from image editing attacks, prove ineffective in this context. We attribute this failure to an oversight of the structural resilience and the unique static conditional guidance mechanism inherent in face swapping systems. To address this, we propose VoidFace, a systemic defense method that views face swapping as a coupled identity pathway. By injecting perturbations at critical bottlenecks, VoidFace induces cascading disruption throughout the pipeline. Specifically, we first introduce localization disruption and identity erasure to degrade physical regression and semantic embeddings, thereby impairing the accurate modeling of the source face. We then intervene in the generative domain by decoupling attention mechanisms to sever identity injection, and corrupting intermediate diffusion features to prevent the reconstruction of source identity. To ensure visual imperceptibility, we perform adversarial search in the latent manifold, guided by a perceptual adaptive strategy to balance attack potency with image quality. Extensive experiments show that VoidFace outperforms existing defenses across various diffusion-based swapping models, while producing adversarial faces with superior visual quality.[81] Enhancing Text-to-Image Generation via End-Edge Collaborative Hybrid Super-Resolution
Chongbin Yi,Yuxin Liang,Ziqi Zhou,Peng Yang
Main category: cs.CV
TL;DR: 提出一种端边协同的生成-增强框架,通过区域感知的混合超分辨率策略,在降低延迟的同时保持高质量图像生成。
Details
Motivation: 现有的超分辨率方法在细节恢复与计算开销之间存在权衡:轻量级方法效率高但细节差,扩散模型质量高但延迟大,难以满足资源受限场景下的高分辨率文本到图像生成需求。 Method: 构建端边协同框架,边缘侧生成低分辨率图像并自适应选择去噪步数和超分倍率;将图像分块后采用区域感知的混合超分策略,前景区域使用扩散模型恢复细节,背景区域使用轻量模型高效上采样,最后拼接成高分辨率图像。 Result: 实验表明,相比基线方法,该系统在保持竞争力的图像质量的同时,服务延迟降低了33%。 Conclusion: 所提出的协作框架有效平衡了高分辨率文本到图像生成中的质量与延迟矛盾,为资源受限环境下的AIGC部署提供了可行方案。 Abstract: Artificial Intelligence-Generated Content (AIGC) has made significant strides, with high-resolution text-to-image (T2I) generation becoming increasingly critical for improving users' Quality of Experience (QoE). Although resource-constrained edge computing adequately supports fast low-resolution T2I generations, achieving high-resolution output still faces the challenge of ensuring image fidelity at the cost of latency. To address this, we first investigate the performance of super-resolution (SR) methods for image enhancement, confirming a fundamental trade-off that lightweight learning-based SR struggles to recover fine details, while diffusion-based SR achieves higher fidelity at a substantial computational cost. Motivated by these observations, we propose an end-edge collaborative generation-enhancement framework. Upon receiving a T2I generation task, the system first generates a low-resolution image based on adaptively selected denoising steps and super-resolution scales at the edge side, which is then partitioned into patches and processed by a region-aware hybrid SR policy. This policy applies a diffusion-based SR model to foreground patches for detail recovery and a lightweight learning-based SR model to background patches for efficient upscaling, ultimately stitching the enhanced ones into the high-resolution image. Experiments show that our system reduces service latency by 33% compared with baselines while maintaining competitive image quality.[82] SimD3: A Synthetic drone Dataset with Payload and Bird Distractor Modeling for Robust Detection
Ami Pandat,Kanyala Muvva,Punna Rajasekhar,Gopika Vinod,Rohit Shukla
Main category: cs.CV
TL;DR: 本文提出SimD3——一个大规模高保真合成数据集,用于复杂空中环境下鲁棒的无人机检测,并结合改进的YOLOv5(Yolov5m+C3b)模型,在合成与真实数据上验证了其在小目标检测与跨域泛化上的有效性。
Details
Motivation: 现有真实标注数据稀缺、无人机外观变化大、且易与鸟类等干扰物混淆,导致可靠无人机检测困难。 Method: 构建SimD3合成数据集,包含异构载荷无人机、多种鸟类干扰物及UE5多样环境(天气、光照、轨迹可控),并采用360六相机采集;在YOLOv5框架下引入C3b模块替代C3块,形成Yolov5m+C3b模型;在合成、混合及多个未见真实数据集上系统评估。 Result: SimD3能有效支持小目标无人机检测;Yolov5m+C3b在域内和跨数据集评测中均稳定优于基线模型。 Conclusion: SimD3为无人机检测模型的训练与基准测试提供了实用且鲁棒的数据基础,尤其适用于多样化与挑战性场景。 Abstract: Reliable drone detection is challenging due to limited annotated real-world data, large appearance variability, and the presence of visually similar distractors such as birds. To address these challenges, this paper introduces SimD3, a large-scale high-fidelity synthetic dataset designed for robust drone detection in complex aerial environments. Unlike existing synthetic drone datasets, SimD3 explicitly models drones with heterogeneous payloads, incorporates multiple bird species as realistic distractors, and leverages diverse Unreal Engine 5 environments with controlled weather, lighting, and flight trajectories captured using a 360 six-camera rig. Using SimD3, we conduct an extensive experimental evaluation within the YOLOv5 detection framework, including an attention-enhanced variant termed Yolov5m+C3b, where standard bottleneck-based C3 blocks are replaced with C3b modules. Models are evaluated on synthetic data, combined synthetic and real data, and multiple unseen real-world benchmarks to assess robustness and generalization. Experimental results show that SimD3 provides effective supervision for small-object drone detection and that Yolov5m+C3b consistently outperforms the baseline across in-domain and cross-dataset evaluations. These findings highlight the utility of SimD3 for training and benchmarking robust drone detection models under diverse and challenging conditions.[83] ReinPath: A Multimodal Reinforcement Learning Approach for Pathology
Kangcheng Zhou,Jun Jiang,Qing Zhang,Shuang Zheng,Qingli Li,Shugong Xu
Main category: cs.CV
TL;DR: 本文提出了一种具有强推理能力的多模态病理大语言模型,通过语义奖励策略和新构建的高质量病理视觉问答(VQA)数据集,显著提升了文本生成准确性和复杂推理能力,在少量数据训练下仍优于现有方法,并在零样本图像分类任务中媲美CLIP。
Details
Motivation: 现有病理多模态方法因缺乏支持显式推理的高质量数据集及简单推理流程,导致可解释性受限。 Method: 设计了融合组相对策略优化的语义奖励策略,并构建专用于复杂推理任务的高质量病理视觉问答(VQA)数据集,提出新型多模态病理大语言模型。 Result: 在自建VQA数据集上显著超越现有最先进方法,仅用20%数据训练即达更优性能;在下游零样本图像分类任务中性能与CLIP相当。 Conclusion: 所提模型结合高质量数据集与语义奖励机制,有效增强了病理多模态模型的推理能力与可解释性,为计算病理学提供了新范式。 Abstract: Interpretability is significant in computational pathology, leading to the development of multimodal information integration from histopathological image and corresponding text data.However, existing multimodal methods have limited interpretability due to the lack of high-quality dataset that support explicit reasoning and inference and simple reasoning process.To address the above problems, we introduce a novel multimodal pathology large language model with strong reasoning capabilities.To improve the generation of accurate and contextually relevant textual descriptions, we design a semantic reward strategy integrated with group relative policy optimization.We construct a high-quality pathology visual question answering (VQA) dataset, specifically designed to support complex reasoning tasks.Comprehensive experiments conducted on this dataset demonstrate that our method outperforms state-of-the-art methods, even when trained with only 20% of the data.Our method also achieves comparable performance on downstream zero-shot image classification task compared with CLIP.[84] Using Multi-Instance Learning to Identify Unique Polyps in Colon Capsule Endoscopy Images
Puneet Sharma,Kristian Dalsbø Hindberg,Eibe Frank,Benedicte Schelde-Olesen,Ulrik Deding
Main category: cs.CV
TL;DR: 本研究将结肠胶囊内镜图像中息肉的唯一性识别问题建模为多实例学习任务,采用结合注意力机制和自监督学习的多实例验证框架,显著提升了识别性能。
Details
Motivation: 由于图像数量庞大、临床医生认知负担重以及帧标签模糊,医学人员难以准确识别结肠胶囊内镜中的独特息肉,亟需自动化解决方案。 Method: 提出一种多实例验证(MIV)框架,引入方差激发多头注意力(VEMA)和基于距离的注意力(DBA),并结合SimCLR自监督学习方法,利用ConvNeXt作为骨干网络生成鲁棒特征表示。 Result: 在包含1912个息肉、754名患者的 dataset 上实验表明,注意力机制显著提升性能,其中DBA L1结合SimCLR预训练的ConvNeXt达到86.26%的测试准确率和0.928的AUC。 Conclusion: 多实例学习与自监督学习相结合可有效推动结肠胶囊内镜图像的自动化分析,具备向其他医学影像应用拓展的潜力。 Abstract: Identifying unique polyps in colon capsule endoscopy (CCE) images is a critical yet challenging task for medical personnel due to the large volume of images, the cognitive load it creates for clinicians, and the ambiguity in labeling specific frames. This paper formulates this problem as a multi-instance learning (MIL) task, where a query polyp image is compared with a target bag of images to determine uniqueness. We employ a multi-instance verification (MIV) framework that incorporates attention mechanisms, such as variance-excited multi-head attention (VEMA) and distance-based attention (DBA), to enhance the model's ability to extract meaningful representations. Additionally, we investigate the impact of self-supervised learning using SimCLR to generate robust embeddings. Experimental results on a dataset of 1912 polyps from 754 patients demonstrate that attention mechanisms significantly improve performance, with DBA L1 achieving the highest test accuracy of 86.26\% and a test AUC of 0.928 using a ConvNeXt backbone with SimCLR pretraining. This study underscores the potential of MIL and self-supervised learning in advancing automated analysis of Colon Capsule Endoscopy images, with implications for broader medical imaging applications.[85] Does medical specialization of VLMs enhance discriminative power?: A comprehensive investigation through feature distribution analysis
Keita Takeda,Tomoya Sakai
Main category: cs.CV
TL;DR: 本研究分析了开源医学视觉-语言模型(VLMs)所学习的特征表示,发现医学专用VLMs虽能提取判别性特征,但经上下文增强的通用VLM(如LLM2CLIP)在特征质量上更优;提示提升文本编码器比大量医学图像训练更为关键,并警示图像中文字等背景偏差对模型推理的影响。
Details
Motivation: 现有医学VLMs的特征表示未被充分探索,标准分类精度无法反映其是否学习到诊断相关的病灶特异性特征;理解这些表示对揭示医学图像结构和提升下游任务至关重要。 Method: 对比分析多个代表性医学VLMs与非医学VLMs在多模态病变分类数据集上的图像特征分布,重点评估医学专业化影响及上下文增强技术(如LLM2CLIP)的作用。 Result: 医学VLMs可提取有效用于医学分类的判别性特征;但经上下文增强的非医学VLMs(如LLM2CLIP)产生更精细的特征表示;非医学VLMs易受图像中叠加文字等背景偏差干扰。 Conclusion: 开发医学VLMs时,优化文本编码器比大规模医学图像训练更重要;模型选择需结合下游任务需求,并警惕图像背景(如文字)引入的推理偏差风险。 Abstract: This study investigates the feature representations produced by publicly available open source medical vision-language models (VLMs). While medical VLMs are expected to capture diagnostically relevant features, their learned representations remain underexplored, and standard evaluations like classification accuracy do not fully reveal if they acquire truly discriminative, lesion-specific features. Understanding these representations is crucial for revealing medical image structures and improving downstream tasks in medical image analysis. This study aims to investigate the feature distributions learned by medical VLMs and evaluate the impact of medical specialization. We analyze the feature distribution of multiple image modalities extracted by some representative medical VLMs across lesion classification datasets on multiple modalities. These distributions were compared them with non-medical VLMs to assess the domain-specific medical training. Our experiments showed that medical VLMs can extract discriminative features that are effective for medical classification tasks. Moreover, it was found that non-medical VLMs with recent improvement with contextual enrichment such as LLM2CLIP produce more refined feature representations. Our results imply that enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs. Notably, non-medical models are particularly vulnerable to biases introduced by overlaied text strings on images. These findings underscore the need for careful consideration on model selection according to downstream tasks besides potential risks in inference due to background biases such as textual information in images.[86] M2I2HA: A Multi-modal Object Detection Method Based on Intra- and Inter-Modal Hypergraph Attention
Xiaofan Yang,Yubin Liu,Wei Pan,Guoqing Chu,Junming Zhang,Jie Zhao,Zhuoqi Man,Xuanming Cao
Main category: cs.CV
TL;DR: 本文提出了一种基于超图理论的多模态感知网络M2I2HA,通过超图建模模态内高阶关系和模态间对齐融合,解决了现有CNN、Transformer和SSM在多模态检测中的局限性,并在多个数据集上达到SOTA性能。
Details
Motivation: 现有方法(CNN、Transformer、SSM)在多模态目标检测中存在各自缺陷:CNN感受野受限、Transformer计算复杂度高且仅建模两两关系、SSM破坏2D空间结构;同时,跨模态对齐与高阶依赖建模仍不充分。 Method: 提出M2I2HA网络,包含三个核心模块:1)Intra-Hypergraph Enhancement模块,利用超图建模单模态内全局多对多高阶关系;2)Inter-Hypergraph Fusion模块,实现跨模态特征对齐与融合;3)M2-FullPAD模块,支持自适应多层次多模态特征融合及数据流优化。 Result: 在多个公开多模态目标检测数据集上,M2I2HA显著优于各类基线方法,达到当前最优(state-of-the-art)性能。 Conclusion: 基于超图的建模范式能更有效地刻画多模态数据内部与之间的高阶复杂依赖,为鲁棒多模态感知提供了新思路与有效架构。 Abstract: Recent advances in multi-modal detection have significantly improved detection accuracy in challenging environments (e.g., low light, overexposure). By integrating RGB with modalities such as thermal and depth, multi-modal fusion increases data redundancy and system robustness. However, significant challenges remain in effectively extracting task-relevant information both within and across modalities, as well as in achieving precise cross-modal alignment. While CNNs excel at feature extraction, they are limited by constrained receptive fields, strong inductive biases, and difficulty in capturing long-range dependencies. Transformer-based models offer global context but suffer from quadratic computational complexity and are confined to pairwise correlation modeling. Mamba and other State Space Models (SSMs), on the other hand, are hindered by their sequential scanning mechanism, which flattens 2D spatial structures into 1D sequences, disrupting topological relationships and limiting the modeling of complex higher-order dependencies. To address these issues, we propose a multi-modal perception network based on hypergraph theory called M2I2HA. Our architecture includes an Intra-Hypergraph Enhancement module to capture global many-to-many high-order relationships within each modality, and an Inter-Hypergraph Fusion module to align, enhance, and fuse cross-modal features by bridging configuration and spatial gaps between data sources. We further introduce a M2-FullPAD module to enable adaptive multi-level fusion of multi-modal enhanced features within the network, meanwhile enhancing data distribution and flow across the architecture. Extensive object detection experiments on multiple public datasets against baselines demonstrate that M2I2HA achieves state-of-the-art performance in multi-modal object detection tasks.[87] FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes
Jiaxuan Liu,Yang Xiang,Han Zhao,Xiangang Li,Zhenhua Ling
Main category: cs.CV
TL;DR: 本文提出了FunCineForge,一个用于电影配音的大规模数据集构建管道和基于MLLM的配音模型,解决了现有方法在数据规模、标注质量和复杂场景适应性方面的不足。
Details
Motivation: 现有配音方法受限于高质量多模态数据集的缺乏以及仅依赖唇部区域进行音视频对齐,导致在复杂影视场景中表现不佳。 Method: 提出FunCineForge,包括一个端到端的数据生产管道用于构建大规模、高注释质量的中文电视剧配音数据集,并设计了一个基于多模态大语言模型(MLLM)的配音模型以适应多样化影视场景。 Result: 实验表明,该模型在独白、旁白、对话及多说话人场景下均优于现有最先进方法,在语音质量、唇同步、音色转换和指令跟随方面表现更优。 Conclusion: FunCineForge有效推动了电影自动配音技术的发展,尤其在复杂现实场景中的应用具有显著优势。 Abstract: Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. Code and demos are available at https://anonymous.4open.science/w/FunCineForge.[88] Reconstruction-Anchored Diffusion Model for Text-to-Motion Generation
Yifei Liu,Changxing Ding,Ling Guo,Huaiguang Jiang,Qiong Cao
Main category: cs.CV
TL;DR: 本文提出了一种名为Reconstruction-Anchored Diffusion Model (RAM)的新方法,通过引入运动潜在空间作为中间监督,并设计Reconstructive Error Guidance (REG)机制来缓解文本驱动人体运动生成中预训练文本编码器的表征差距和去噪过程中的误差传播问题。
Details
Motivation: 当前运动扩散模型存在两个主要局限:一是预训练文本编码器缺乏运动特异性信息导致的表征差距;二是迭代去噪过程中误差传播问题。 Method: RAM方法包含两部分:1)联合训练一个运动重建分支,采用自正则化和以运动为中心的潜在对齐目标函数;2)在测试阶段提出Reconstructive Error Guidance (REG),利用扩散模型的自校正能力,在每步去噪中通过重建前一估计并放大当前预测与重建估计之间的残差来抑制误差传播。 Result: 大量实验表明,RAM显著提升了性能,达到当前最优水平(state-of-the-art)。 Conclusion: RAM有效解决了文本到运动生成中表征差距和误差传播两大挑战,为高质量、可控的人体运动生成提供了新思路。 Abstract: Diffusion models have seen widespread adoption for text-driven human motion generation and related tasks due to their impressive generative capabilities and flexibility. However, current motion diffusion models face two major limitations: a representational gap caused by pre-trained text encoders that lack motion-specific information, and error propagation during the iterative denoising process. This paper introduces Reconstruction-Anchored Diffusion Model (RAM) to address these challenges. First, RAM leverages a motion latent space as intermediate supervision for text-to-motion generation. To this end, RAM co-trains a motion reconstruction branch with two key objective functions: self-regularization to enhance the discrimination of the motion space and motion-centric latent alignment to enable accurate mapping from text to the motion latent space. Second, we propose Reconstructive Error Guidance (REG), a testing-stage guidance mechanism that exploits the diffusion model's inherent self-correction ability to mitigate error propagation. At each denoising step, REG uses the motion reconstruction branch to reconstruct the previous estimate, reproducing the prior error patterns. By amplifying the residual between the current prediction and the reconstructed estimate, REG highlights the improvements in the current prediction. Extensive experiments demonstrate that RAM achieves significant improvements and state-of-the-art performance. Our code will be released.[89] Synthetic Data Augmentation for Multi-Task Chinese Porcelain Classification: A Stable Diffusion Approach
Ziyao Ling,Silvia Mirri,Paola Salomoni,Giovanni Delnevo
Main category: cs.CV
TL;DR: 本研究探讨了使用Stable Diffusion结合LoRA生成的合成图像,能否有效增强有限的真实数据集,以提升基于CNN的中国瓷器多任务分类性能。实验表明,合成数据在器型分类任务中效果最显著,而在朝代和窑口任务中增益较小,说明其有效性依赖于生成特征与任务视觉特征的匹配程度。
Details
Motivation: 考古文物(尤其是稀有类型的中国瓷器)训练数据稀缺,限制了深度学习在该领域的应用。 Method: 采用Stable Diffusion + LoRA生成合成瓷器图像,并与真实数据按不同比例(95:5、90:10)混合;使用迁移学习的MobileNetV3进行四类多任务分类(朝代、釉色、窑口、器型)。 Result: 器型分类F1-macro提升5.5%(90:10比例),朝代和窑口分类提升3–4%,釉色分类未明确提及增益;效果呈任务依赖性。 Conclusion: 合成数据可有效辅助特定考古分类任务,但需权衡考古真实性与数据多样性,本研究为生成式AI在考古研究中的实践提供了可操作指南。 Abstract: The scarcity of training data presents a fundamental challenge in applying deep learning to archaeological artifact classification, particularly for the rare types of Chinese porcelain. This study investigates whether synthetic images generated through Stable Diffusion with Low-Rank Adaptation (LoRA) can effectively augment limited real datasets for multi-task CNN-based porcelain classification. Using MobileNetV3 with transfer learning, we conducted controlled experiments comparing models trained on pure real data against those trained on mixed real-synthetic datasets (95:5 and 90:10 ratios) across four classification tasks: dynasty, glaze, kiln and type identification. Results demonstrate task-specific benefits: type classification showed the most substantial improvement (5.5\% F1-macro increase with 90:10 ratio), while dynasty and kiln tasks exhibited modest gains (3-4\%), suggesting that synthetic augmentation effectiveness depends on the alignment between generated features and task-relevant visual signatures. Our work contributes practical guidelines for deploying generative AI in archaeological research, demonstrating both the potential and limitations of synthetic data when archaeological authenticity must be balanced with data diversity.[90] UniRoute: Unified Routing Mixture-of-Experts for Modality-Adaptive Remote Sensing Change Detection
Qingling Shu,Sibao Chen,Wei Lu,Zhihui You,Chengzhuang Liu
Main category: cs.CV
TL;DR: 本文提出UniRoute框架,通过条件路由机制实现多模态遥感变化检测的自适应学习,包含AR2-MoE和MDR-MoE两个路由模块及CASD自蒸馏策略,在多个数据集上验证了其高精度与高效率的统一。
Details
Motivation: 现有遥感变化检测方法依赖专用模型,难以扩展至多模态地球观测;同质与异质变化检测对空间细节与上下文信息的需求不同,且传统差分操作在跨模态或几何未对齐场景下易引入伪影。 Method: 提出UniRoute统一框架:1)AR2-MoE模块实现感受野自适应路由以解耦局部细节与全局语义;2)MDR-MoE模块按像素动态选择最优融合原语;3)CASD策略通过多级一致性约束提升异质小样本下的训练稳定性。 Result: 在五个公开数据集上实验表明,UniRoute在统一部署下兼具优异整体性能与良好的精度-效率权衡。 Conclusion: UniRoute通过将特征提取与融合建模为条件路由问题,有效提升了遥感变化检测在多模态、跨场景下的泛化性与鲁棒性,为模态自适应地球观测提供了新范式。 Abstract: Current remote sensing change detection (CD) methods mainly rely on specialized models, which limits the scalability toward modality-adaptive Earth observation. For homogeneous CD, precise boundary delineation relies on fine-grained spatial cues and local pixel interactions, whereas heterogeneous CD instead requires broader contextual information to suppress speckle noise and geometric distortions. Moreover, difference operator (e.g., subtraction) works well for aligned homogeneous images but introduces artifacts in cross-modal or geometrically misaligned scenarios. Across different modality settings, specialized models based on static backbones or fixed difference operations often prove insufficient. To address this challenge, we propose UniRoute, a unified framework for modality-adaptive learning by reformulating feature extraction and fusion as conditional routing problems. We introduce an Adaptive Receptive Field Routing MoE (AR2-MoE) module to disentangle local spatial details from global semantic context, and a Modality-Aware Difference Routing MoE (MDR-MoE) module to adaptively select the most suitable fusion primitive at each pixel. In addition, we propose a Consistency-Aware Self-Distillation (CASD) strategy that stabilizes unified training under data-scarce heterogeneous settings by enforcing multi-level consistency. Extensive experiments on five public datasets demonstrate that UniRoute achieves strong overall performance, with a favorable accuracy-efficiency trade-off under a unified deployment setting.[91] UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking
Qihua Liang,Liang Chen,Yaozong Zheng,Jian Nong,Zhiyi Mo,Bineng Zhong
Main category: cs.CV
TL;DR: 本文提出了一种基于Mamba状态空间模型的新型多模态目标跟踪框架UBATrack,通过时空Mamba适配器(STMA)和动态多模态特征混合器,有效建模跨模态依赖与时空线索,在多个RGB-T/D/E基准上达到SOTA性能,同时避免全参数微调、提升训练效率。
Details
Motivation: 现有通用多模态跟踪器虽通过提示学习统一不同模态任务,但未能有效捕捉时空线索,限制了跟踪鲁棒性与泛化能力。 Method: 提出UBATrack框架,包含两个核心模块:1)时空Mamba适配器(STMA),利用Mamba长序列建模能力以适配器方式联合建模跨模态依赖与时空视觉线索;2)动态多模态特征混合器,在多维特征空间增强多模态表征能力。整体采用轻量适配器微调策略,避免全参数训练。 Result: UBATrack在LasHeR、RGBT234、RGBT210、DepthTrack、VOT-RGBD22和VisEvent等多个主流RGB-T、RGB-D、RGB-E跟踪数据集上均超越当前最优方法,显著提升跟踪精度与鲁棒性。 Conclusion: UBATrack验证了基于状态空间模型(如Mamba)构建高效、轻量、强表征能力的多模态跟踪框架的可行性,为多模态时序建模提供了新思路,并兼顾性能与训练效率。 Abstract: Multi-modal object tracking has attracted considerable attention by integrating multiple complementary inputs (e.g., thermal, depth, and event data) to achieve outstanding performance. Although current general-purpose multi-modal trackers primarily unify various modal tracking tasks (i.e., RGB-Thermal infrared, RGB-Depth or RGB-Event tracking) through prompt learning, they still overlook the effective capture of spatio-temporal cues. In this work, we introduce a novel multi-modal tracking framework based on a mamba-style state space model, termed UBATrack. Our UBATrack comprises two simple yet effective modules: a Spatio-temporal Mamba Adapter (STMA) and a Dynamic Multi-modal Feature Mixer. The former leverages Mamba's long-sequence modeling capability to jointly model cross-modal dependencies and spatio-temporal visual cues in an adapter-tuning manner. The latter further enhances multi-modal representation capacity across multiple feature dimensions to improve tracking robustness. In this way, UBATrack eliminates the need for costly full-parameter fine-tuning, thereby improving the training efficiency of multi-modal tracking algorithms. Experiments show that UBATrack outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks, achieving outstanding results on the LasHeR, RGBT234, RGBT210, DepthTrack, VOT-RGBD22, and VisEvent datasets.[92] LocBAM: Advancing 3D Patch-Based Image Segmentation by Integrating Location Contex
Donnate Hooft,Stefan M. Fischer,Cosmin Bercea,Jan C. Peeken,Julia A. Schnabel
Main category: cs.CV
TL;DR: 本文提出LocBAM注意力机制,显式建模补丁在3D医学图像中的空间位置信息,以弥补传统补丁法忽略全局解剖上下文的缺陷;在多个数据集上验证其能稳定训练并提升分割性能,尤其在低覆盖率场景下优于CoordConv。
Details
Motivation: 传统基于补丁的3D医学图像分割方法忽略补丁在全体积中的空间位置,导致缺乏解剖上下文,影响分割精度。 Method: 提出一种新型注意力机制LocBAM,显式编码和处理补丁的空间位置信息,融入patch-based分割框架中。 Result: 在BTCV、AMOS22和KiTS23数据集上实验表明,LocBAM可稳定训练过程,并提升分割性能,尤其在patch-to-volume覆盖率较低时优势更显著;且持续优于CoordConv。 Conclusion: 位置上下文对patch-based 3D医学图像分割至关重要;LocBAM是一种有效、通用且易于集成的位置感知注意力模块。 Abstract: Patch-based methods are widely used in 3D medical image segmentation to address memory constraints in processing high-resolution volumetric data. However, these approaches often neglect the patch's location within the global volume, which can limit segmentation performance when anatomical context is important. In this paper, we investigate the role of location context in patch-based 3D segmentation and propose a novel attention mechanism, LocBAM, that explicitly processes spatial information. Experiments on BTCV, AMOS22, and KiTS23 demonstrate that incorporating location context stabilizes training and improves segmentation performance, particularly under low patch-to-volume coverage where global context is missing. Furthermore, LocBAM consistently outperforms classical coordinate encoding via CoordConv. Code is publicly available at https://github.com/compai-lab/2026-ISBI-hooft[93] Symmetry Informative and Agnostic Feature Disentanglement for 3D Shapes
Tobias Weißberg,Weikang Wang,Paul Roetzer,Nafie El Amrani,Florian Bernard
Main category: cs.CV
TL;DR: 本文提出了一种同时具备对称性感知与对称性无关特性的特征解耦方法,并结合特征优化技术,提升了对称性特征的鲁棒性,在内在对称检测、左右分类和形状匹配等任务上优于现有方法。
Details
Motivation: 现有基于语义感知描述符提取对称性特征的方法(如χ)仅输出一维对称特征,忽略了其他语义信息,且提取结果噪声大、易误分类。 Method: 提出一种特征解耦方法,将形状描述符分解为对称性感知与对称性无关两部分;并设计特征优化技术以增强对称性特征的鲁棒性。 Result: 在内在对称检测、左右分类和形状匹配任务上,定性和定量实验均表明所提方法优于多种SOTA方法。 Conclusion: 对称性感知与对称性无关特征的协同建模及特征优化可显著提升形状分析中对称性相关任务的性能。 Abstract: Shape descriptors, i.e., per-vertex features of 3D meshes or point clouds, are fundamental to shape analysis. Historically, various handcrafted geometry-aware descriptors and feature refinement techniques have been proposed. Recently, several studies have initiated a new research direction by leveraging features from image foundation models to create semantics-aware descriptors, demonstrating advantages across tasks like shape matching, editing, and segmentation. Symmetry, another key concept in shape analysis, has also attracted increasing attention. Consequently, constructing symmetry-aware shape descriptors is a natural progression. Although the recent method $χ$ (Wang et al., 2025) successfully extracted symmetry-informative features from semantic-aware descriptors, its features are only one-dimensional, neglecting other valuable semantic information. Furthermore, the extracted symmetry-informative feature is usually noisy and yields small misclassified patches. To address these gaps, we propose a feature disentanglement approach which is simultaneously symmetry informative and symmetry agnostic. Further, we propose a feature refinement technique to improve the robustness of predicted symmetry informative features. Extensive experiments, including intrinsic symmetry detection, left/right classification, and shape matching, demonstrate the effectiveness of our proposed framework compared to various state-of-the-art methods, both qualitatively and quantitatively.[94] POTR: Post-Training 3DGS Compression
Bert Ramlot,Martijn Courteaux,Peter Lambert,Glenn Van Wallendael
Main category: cs.CV
TL;DR: 提出POTR,一种用于3D高斯点阵的后训练压缩编解码器,通过新型剪枝和光照系数重计算技术实现更高效的存储与推理。
Details
Motivation: 3D高斯点阵(3DGS)在3D场景重建中表现优异但存储开销大,需有效压缩方法。 Method: 引入改进的光栅化器进行并行剪枝,同时提出无需训练的光照系数重计算方法以降低熵,并辅以微调策略提升性能。 Result: 相比现有方法减少2-4倍点数,推理速度提升1.5-2倍,光照系数稀疏度从70%升至97%,且保持高质量。 Conclusion: POTR在无微调情况下已优于其他后训练压缩方法,在率失真性能和推理速度上均具优势。 Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a promising contender to Neural Radiance Fields (NeRF) in 3D scene reconstruction and real-time novel view synthesis. 3DGS outperforms NeRF in training and inference speed but has substantially higher storage requirements. To remedy this downside, we propose POTR, a post-training 3DGS codec built on two novel techniques. First, POTR introduces a novel pruning approach that uses a modified 3DGS rasterizer to efficiently calculate every splat's individual removal effect simultaneously. This technique results in 2-4x fewer splats than other post-training pruning techniques and as a result also significantly accelerates inference with experiments demonstrating 1.5-2x faster inference than other compressed models. Second, we propose a novel method to recompute lighting coefficients, significantly reducing their entropy without using any form of training. Our fast and highly parallel approach especially increases AC lighting coefficient sparsity, with experiments demonstrating increases from 70% to 97%, with minimal loss in quality. Finally, we extend POTR with a simple fine-tuning scheme to further enhance pruning, inference, and rate-distortion performance. Experiments demonstrate that POTR, even without fine-tuning, consistently outperforms all other post-training compression techniques in both rate-distortion performance and inference speed.[95] TempViz: On the Evaluation of Temporal Knowledge in Text-to-Image Models
Carolin Holtermann,Nina Krebs,Anne Lauscher
Main category: cs.CV
TL;DR: 本文提出了TempViz数据集,首次系统评估文本到图像(T2I)模型对时间知识的理解能力,发现现有模型在时间推理方面表现普遍较弱,且现有自动评估方法不可靠,凸显该方向亟需深入研究。
Details
Motivation: 尽管自然语言处理中已有大量关于时间知识的研究,但文本到图像生成模型如何表征和处理时间信息仍缺乏系统研究。 Method: 构建了首个面向时间知识评估的T2I基准数据集TempViz(含7.9k提示词和600+参考图像),并在5个时间知识类别上评测5种主流T2I模型;同时对比多种自动评估方法与人工评估的一致性。 Result: 人类评估显示所有模型在各时间类别上的准确率均未超过75%;各类自动评估方法与人工判断相关性差,无法可靠衡量时间理解能力。 Conclusion: 当前T2I模型的时间感知与推理能力薄弱,且缺乏有效的自动化评估手段,亟需在建模机制与评估方法两方面开展进一步研究。 Abstract: Time alters the visual appearance of entities in our world, like objects, places, and animals. Thus, for accurately generating contextually-relevant images, knowledge and reasoning about time can be crucial (e.g., for generating a landscape in spring vs. in winter). Yet, although substantial work exists on understanding and improving temporal knowledge in natural language processing, research on how temporal phenomena appear and are handled in text-to-image (T2I) models remains scarce. We address this gap with TempViz, the first data set to holistically evaluate temporal knowledge in image generation, consisting of 7.9k prompts and more than 600 reference images. Using TempViz, we study the capabilities of five T2I models across five temporal knowledge categories. Human evaluation shows that temporal competence is generally weak, with no model exceeding 75% accuracy across categories. Towards larger-scale studies, we also examine automated evaluation methods, comparing several established approaches against human judgments. However, none of these approaches provides a reliable assessment of temporal cues - further indicating the pressing need for future research on temporal knowledge in T2I.[96] Multimodal system for skin cancer detection
Volodymyr Sydorskyi,Igor Krashenyi,Oleksii Yakubenko
Main category: cs.CV
TL;DR: 提出了一种基于常规照片图像和表格元数据的多模态黑色素瘤检测系统,通过结合图像与临床信息及三阶段预测流程,提升了检测性能,具有良好的临床适用性和可扩展性。
Details
Motivation: 现有的黑色素瘤检测依赖皮肤镜图像和专用设备,限制了在广泛临床环境中的应用,因此需要一种更易普及、不依赖特殊设备的检测方法。 Method: 构建了一个融合常规照片图像与患者人口统计学和病灶特征等表格元数据的多模态神经网络系统,并采用两步模型处理有无元数据的情况,结合三阶段预测流程和提升算法优化性能,同时使用特定技术应对数据不平衡问题。 Result: 在高度不平衡的数据集上实现了0.18068的偏ROC AUC(最大0.2)和0.78371的top-15检索敏感度,消融实验验证了所选视觉架构、提升算法和损失函数的有效性。 Conclusion: 结合常规照片与元数据的多阶段多模态系统显著提升了黑色素瘤检测性能,提供了一种可扩展、不依赖专用设备的解决方案,有助于弥合专科与普通临床实践之间的差距。 Abstract: Melanoma detection is vital for early diagnosis and effective treatment. While deep learning models on dermoscopic images have shown promise, they require specialized equipment, limiting their use in broader clinical settings. This study introduces a multi-modal melanoma detection system using conventional photo images, making it more accessible and versatile. Our system integrates image data with tabular metadata, such as patient demographics and lesion characteristics, to improve detection accuracy. It employs a multi-modal neural network combining image and metadata processing and supports a two-step model for cases with or without metadata. A three-stage pipeline further refines predictions by boosting algorithms and enhancing performance. To address the challenges of a highly imbalanced dataset, specific techniques were implemented to ensure robust training. An ablation study evaluated recent vision architectures, boosting algorithms, and loss functions, achieving a peak Partial ROC AUC of 0.18068 (0.2 maximum) and top-15 retrieval sensitivity of 0.78371. Results demonstrate that integrating photo images with metadata in a structured, multi-stage pipeline yields significant performance improvements. This system advances melanoma detection by providing a scalable, equipment-independent solution suitable for diverse healthcare environments, bridging the gap between specialized and general clinical practices.[97] PROGRESSLM: Towards Progress Reasoning in Vision-Language Models
Jianshu Zhang,Chengxuan Qian,Haosen Sun,Haoran Lu,Dingcheng Wang,Letian Xue,Han Liu
Main category: cs.CV
TL;DR: 本文提出了Progress-Bench基准,用于系统评估视觉语言模型(VLMs)在任务进度推理方面的能力,并探索了人类启发的两阶段推理范式,发现当前VLMs在长时序动态推理上仍存在明显不足。
Details
Motivation: 现有VLMs擅长静态视觉理解,但难以从部分观测中推断任务完成进度;缺乏系统性评估方法和有效进展推理机制。 Method: 构建Progress-Bench基准;提出训练无关的结构化提示方法与基于ProgressLM-45K数据集的训练方法(ProgressLM-3B);对14个VLM进行跨模态、多视角与不可回答案例测试。 Result: 多数VLM在进度估计上表现不佳,对演示模态和视角变化敏感,且难以处理不可回答问题;训练式ProgressLM-3B在小规模下仍取得稳定提升,泛化至未见任务。 Conclusion: 当前VLMs尚未具备可靠的长时序任务进度推理能力;结构化推理范式(尤其训练驱动方式)是提升该能力的有效路径;错误模式分析为后续改进提供了明确方向。 Abstract: Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.[98] MTFlow: Time-Conditioned Flow Matching for Microtubule Segmentation in Noisy Microscopy Images
Sidi Mohamed Sid El Moctar,Achraf Ait Laydi,Yousef El Mourabit,Hélène Bouvrais
Main category: cs.CV
TL;DR: 本文提出了MTFlow,一种新型的时间条件流匹配模型,用于微管网络的精确分割。该模型通过学习向量场迭代优化噪声掩码,结合U-Net主干与时间嵌入,提升对弯曲、密集交叉及噪声图像的鲁棒性,并在多种生物医学数据集上验证了其泛化能力与精度优势。
Details
Motivation: 微管网络分割因纤维弯曲、密集交叉和图像噪声而具有挑战性,亟需更准确、鲁棒的方法以支持其组织与动态研究。 Method: 提出MTFlow模型,采用时间条件流匹配框架,学习将噪声掩码沿向量场逐步传输至真实标签;结合U-Net主干与时间嵌入,建模不确定性沿纤维边界的动态消解过程。 Result: 在合成与真实微管数据集上训练评估,并在视网膜血管、神经等公开曲率结构数据集上验证泛化性;分割精度媲美当前最优方法,优于人工或半自动标注。 Conclusion: MTFlow是一种高效、可解释且泛化性强的微管分割工具,为细丝状生物结构分析提供了新范式。 Abstract: Microtubules are cytoskeletal filaments that play essential roles in many cellular processes and are key therapeutic targets in several diseases. Accurate segmentation of microtubule networks is critical for studying their organization and dynamics but remains challenging due to filament curvature, dense crossings, and image noise. We present MTFlow, a novel time-conditioned flow-matching model for microtubule segmentation. Unlike conventional U-Net variants that predict masks in a single pass, MTFlow learns vector fields that iteratively transport noisy masks toward the ground truth, enabling interpretable, trajectory-based refinement. Our architecture combines a U-Net backbone with temporal embeddings, allowing the model to capture the dynamics of uncertainty resolution along filament boundaries. We trained and evaluated MTFlow on synthetic and real microtubule datasets and assessed its generalization capability on public biomedical datasets of curvilinear structures such as retinal blood vessels and nerves. MTFlow achieves competitive segmentation accuracy comparable to state-of-the-art models, offering a powerful and time-efficient tool for filamentous structure analysis with more precise annotations than manual or semi-automatic approaches.[99] GAT-NeRF: Geometry-Aware-Transformer Enhanced Neural Radiance Fields for High-Fidelity 4D Facial Avatars
Zhe Chang,Haodong Jin,Ying Sun,Yan Song,Hui Yu
Main category: cs.CV
TL;DR: 提出一种名为GAT-NeRF的几何感知Transformer增强神经辐射场框架,用于高保真4D动态面部 avatar 重建,通过融合显式几何先验的多模态特征,显著提升单目视频下动态皱纹和细微纹理等高频细节的建模能力。
Details
Motivation: 现有NeRF方法在单目视频下难以恢复高频率面部细节(如动态皱纹、微小纹理),限制了虚拟人应用的真实感,需提升其在信息受限条件下的表征能力。 Method: 提出GAT-NeRF,将Transformer机制引入NeRF流程,设计轻量级Geometry-Aware-Transformer(GAT)模块,融合3D坐标、3DMM表情参数和可学习隐编码等多模态输入,结合坐标对齐MLP增强细粒度几何特征表示。 Result: 实验表明GAT-NeRF在视觉保真度和高频细节恢复上达到SOTA水平,能更真实地重建动态面部细节如皱纹和痘疤。 Conclusion: GAT-NeRF通过引入几何感知的Transformer模块,有效提升了单目视频中4D面部avatar的重建质量,为多媒体应用中的数字人生成提供了新方法。 Abstract: High-fidelity 4D dynamic facial avatar reconstruction from monocular video is a critical yet challenging task, driven by increasing demands for immersive virtual human applications. While Neural Radiance Fields (NeRF) have advanced scene representation, their capacity to capture high-frequency facial details, such as dynamic wrinkles and subtle textures from information-constrained monocular streams, requires significant enhancement. To tackle this challenge, we propose a novel hybrid neural radiance field framework, called Geometry-Aware-Transformer Enhanced NeRF (GAT-NeRF) for high-fidelity and controllable 4D facial avatar reconstruction, which integrates the Transformer mechanism into the NeRF pipeline. GAT-NeRF synergistically combines a coordinate-aligned Multilayer Perceptron (MLP) with a lightweight Transformer module, termed as Geometry-Aware-Transformer (GAT) due to its processing of multi-modal inputs containing explicit geometric priors. The GAT module is enabled by fusing multi-modal input features, including 3D spatial coordinates, 3D Morphable Model (3DMM) expression parameters, and learnable latent codes to effectively learn and enhance feature representations pertinent to fine-grained geometry. The Transformer's effective feature learning capabilities are leveraged to significantly augment the modeling of complex local facial patterns like dynamic wrinkles and acne scars. Comprehensive experiments unequivocally demonstrate GAT-NeRF's state-of-the-art performance in visual fidelity and high-frequency detail recovery, forging new pathways for creating realistic dynamic digital humans for multimedia applications.[100] SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval
Xinyi Zheng,Yunze Liu,Chi-Hao Wu,Fan Zhang,Hao Zheng,Wenqi Zhou,Walterio W. Mayol-Cuevas,Junxiao Shen
Main category: cs.CV
TL;DR: SpatialMem 是一个以记忆为中心的系统,将3D几何、语义和语言统一为可查询的单一表示,支持基于语言的导航与物体检索。
Details
Motivation: 构建一个能统一处理3D几何、语义与语言信息,并支持可解释空间推理与下游任务(如语言引导导航)的通用、可扩展的空间智能框架。 Method: 从随意捕获的RGB视频出发,重建度量尺度的室内环境;检测结构化3D锚点(墙、门、窗)作为基础骨架;构建分层记忆,将开放词汇物体节点关联视觉块、视觉嵌入及双层文本描述至3D坐标。 Result: 在三个真实室内场景实验中,SpatialMem 在日益增加的杂乱与遮挡下仍保持高精度的锚点-描述级导航完成率与分层检索准确率。 Conclusion: SpatialMem 提供了一种高效、可扩展的具身空间智能框架,无需专用传感器即可实现可解释的空间关系推理与多任务支持。 Abstract: We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes -- linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates -- for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.[101] Erosion Attack for Adversarial Training to Enhance Semantic Segmentation Robustness
Yufei Song,Ziqi Zhou,Menghao Deng,Yifan Hu,Shengshan Hu,Minghui Li,Leo Yu Zhang
Main category: cs.CV
TL;DR: 本文提出EroSeg-AT框架,利用EroSeg生成更具语义破坏性的对抗样本,提升分割模型在对抗训练下的鲁棒性。
Details
Motivation: 现有分割模型易受对抗攻击,而当前对抗训练所用攻击方法仅考虑全局语义,忽略样本内上下文语义关系,导致鲁棒性提升受限。 Method: 提出EroSeg-AT框架:EroSeg先基于像素级置信度选择敏感像素,再逐步将扰动传播至高置信度像素,破坏样本语义一致性;该过程融入对抗训练。 Result: 实验表明,相比现有方法,EroSeg-AT显著提升攻击有效性,并增强模型在对抗训练下的鲁棒性。 Conclusion: 通过建模像素间语义依赖并针对性扰动,EroSeg-AT能更有效地暴露和缓解分割模型的脆弱性,为鲁棒分割提供新思路。 Abstract: Existing segmentation models exhibit significant vulnerability to adversarial attacks.To improve robustness, adversarial training incorporates adversarial examples into model training. However, existing attack methods consider only global semantic information and ignore contextual semantic relationships within the samples, limiting the effectiveness of adversarial training. To address this issue, we propose EroSeg-AT, a vulnerability-aware adversarial training framework that leverages EroSeg to generate adversarial examples. EroSeg first selects sensitive pixels based on pixel-level confidence and then progressively propagates perturbations to higher-confidence pixels, effectively disrupting the semantic consistency of the samples. Experimental results show that, compared to existing methods, our approach significantly improves attack effectiveness and enhances model robustness under adversarial training.[102] Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers
Xinyu Peng,Han Li,Yuyang Huang,Ziyang Zheng,Yaoming Wang,Xin Chen,Wenrui Dai,Chenglin Li,Junni Zou,Hongkai Xiong
Main category: cs.CV
TL;DR: 本文提出了一种视频中心化的视频帧插值方法LDF-VFI,基于自回归扩散Transformer建模完整视频序列,结合局部稀疏注意力、分块VAE编码与跳连采样策略,提升长时序一致性与高分辨率泛化能力,并在大运动场景中达到SOTA性能。
Details
Motivation: 现有VFI方法多采用帧中心化(如三元组)处理方式,导致时间不一致和运动伪影,缺乏对整个视频序列的全局建模能力。 Method: 提出LDF-VFI框架:1)基于自回归扩散Transformer建模全视频序列;2)引入skip-concatenate采样策略缓解误差累积;3)采用稀疏局部注意力与分块VAE编码实现高效长序列处理与任意分辨率(如4K)推理;4)增强型条件VAE解码器融合多尺度输入特征以提升重建保真度。 Result: 在长序列基准测试中达到SOTA性能,显著提升单帧质量与时间一致性,尤其在大运动场景下表现优异。 Conclusion: LDF-VFI通过视频中心化建模与多项结构创新,有效解决了传统VFI的时间不一致问题,兼顾效率、泛化性与重建质量,为高质量视频插值提供了新范式。 Abstract: Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named \textbf{L}ocal \textbf{D}iffusion \textbf{F}orcing for \textbf{V}ideo \textbf{F}rame \textbf{I}nterpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. An enhanced conditional VAE decoder, which leverages multi-scale features from the input video, further improves reconstruction fidelity. Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion. The source code is available at https://github.com/xypeng9903/LDF-VFI.[103] Unified Multi-Dataset Training for TBPS
Nilanjana Chatterjee,Sidharatha Garg,A V Subramanyam,Brejesh Lall
Main category: cs.CV
TL;DR: 提出Scale-TBPS,通过噪声感知的数据整合策略和可扩展的判别性身份学习框架,实现跨多个数据集的统一文本-行人检索模型,在多个基准上优于单数据集优化模型。
Details
Motivation: 现有方法依赖于针对特定数据集的微调,缺乏跨数据集的泛化能力,且受限于训练数据规模和视觉语言模型在行人识别上的不适应性。 Method: 提出噪声感知的统一数据构建策略,融合多个TBPS数据集;设计可扩展的判别性身份学习框架,以应对大量唯一身份和噪声图文对的问题。 Result: 在CUHK-PEDES、ICFG-PEDES、RSTPReid、IIITD-20K和UFine6926等多个数据集上实验表明,单一Scale-TBPS模型优于各数据集独立优化的模型和简单联合训练方法。 Conclusion: 可以训练一个统一的文本-行人搜索模型跨多个数据集有效工作,关键在于数据整合时考虑噪声以及采用可扩展的身份学习机制。 Abstract: Text-Based Person Search (TBPS) has seen significant progress with vision-language models (VLMs), yet it remains constrained by limited training data and the fact that VLMs are not inherently pre-trained for pedestrian-centric recognition. Existing TBPS methods therefore rely on dataset-centric fine-tuning to handle distribution shift, resulting in multiple independently trained models for different datasets. While synthetic data can increase the scale needed to fine-tune VLMs, it does not eliminate dataset-specific adaptation. This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets? We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities and are vulnerable to noisy image-text pairs. To address these challenges, we propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework that remains effective under a large number of unique identities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, RSTPReid, IIITD-20K, and UFine6926 demonstrate that a single Scale-TBPS model outperforms dataset-centric optimized models and naive joint training.[104] LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding
Xiaodong Wang,Langling Huang,Zhirong Wu,Xu Zhao,Teng Xu,Xuhong Xia,Peixi Peng
Main category: cs.CV
TL;DR: 本文提出了首个面向互动直播视频的多模态基准LiViBench,包含24个多样化任务,并设计了半自动标注流程和多智能体系统生成高质量注释。同时提出VCR模块和两阶段指令微调,开发了在7B参数量下表现优于更大模型的LiVi-LLM-7B。
Details
Motivation: 现有视频评估基准主要针对非互动视频,缺乏对互动直播视频中多模态(如实时评论、语音)理解能力的评估,因此需要构建专门的基准和模型来推动该领域发展。 Method: 设计了一个包含人类参与的半自动标注流程,利用多MLLM构成多智能体系统进行视频描述,采用种子问题驱动方法生成高质量标注;提出两阶段指令微调和Video-to-Comment Retrieval (VCR)模块以增强模型对实时评论的利用。 Result: 开发的LiVi-LLM-7B在LiViBench上优于最大达72B参数的开源模型,接近闭源领先模型表现,并在多个通用视频基准(如VideoMME、LongVideoBench等)上性能提升。 Conclusion: LiViBench为互动直播视频理解提供了新的评估标准,结合VCR模块和指令微调策略有效提升了MLLM在该场景下的感知与推理能力,展示了小规模模型在特定领域优化的潜力。 Abstract: The development of multimodal large language models (MLLMs) has advanced general video understanding. However, existing video evaluation benchmarks primarily focus on non-interactive videos, such as movies and recordings. To fill this gap, this paper proposes the first omnimodal benchmark for interactive livestream videos, LiViBench. It features a diverse set of 24 tasks, highlighting the perceptual, reasoning, and livestream-specific challenges. To efficiently construct the dataset, we design a standardized semi-automatic annotation workflow that incorporates the human-in-the-loop at multiple stages. The workflow leverages multiple MLLMs to form a multi-agent system for comprehensive video description and uses a seed-question-driven method to construct high-quality annotations. All interactive videos in the benchmark include audio, speech, and real-time comments modalities. To enhance models' understanding of interactive videos, we design tailored two-stage instruction-tuning and propose a Video-to-Comment Retrieval (VCR) module to improve the model's ability to utilize real-time comments. Based on these advancements, we develop LiVi-LLM-7B, an MLLM with enhanced knowledge of interactive livestreams. Experiments show that our model outperforms larger open-source models with up to 72B parameters, narrows the gap with leading proprietary models on LiViBench, and achieves enhanced performance on general video benchmarks, including VideoMME, LongVideoBench, MLVU, and VideoEval-Pro.[105] SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation
Yanan Wang,Linjie Ren,Zihao Li,Junyi Wang,Tian Gan
Main category: cs.CV
TL;DR: 本文提出了BinauralVGGSound,首个大规模视频-双耳音频数据集,并设计了一种端到端的视觉引导空间音频生成框架,显著提升了合成音频的空间保真度与沉浸感。
Details
Motivation: 现有视频到音频生成模型多依赖单声道音频数据,缺乏双耳空间信息,导致生成音频在空间感知和沉浸感方面表现不足。 Method: 构建了BinauralVGGSound数据集,并提出一种端到端框架,引入视觉引导的音频空间化模块,显式建模空间特征以生成具有真实空间属性的音频。 Result: 实验表明,该方法在空间保真度上显著优于现有最先进模型,同时保持良好的语义和时间对齐,提供更沉浸的听觉体验。 Conclusion: 通过引入双耳音频数据集和视觉引导的空间建模机制,有效解决了视频到音频生成中的空间感知缺失问题,推动了沉浸式音频生成的发展。 Abstract: While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models' reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual-to-spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large-scale video-binaural audio dataset designed to support spatially aware video-to-audio generation; and we propose a end-to-end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual-guided audio spatialization module that ensures the generated audio exhibits realistic spatial attributes and layered spatial depth while maintaining semantic and temporal alignment. Experiments show that our approach substantially outperforms state-of-the-art models in spatial fidelity and delivers a more immersive auditory experience, without sacrificing temporal or semantic consistency. All datasets, code, and model checkpoints will be publicly released to facilitate future research.[106] Federated Transformer-GNN for Privacy-Preserving Brain Tumor Localization with Modality-Level Explainability
Andrea Protani,Riccardo Taiello,Marc Molina Van Den Bosch,Luigi Serio
Main category: cs.CV
TL;DR: 本文提出了一种用于脑肿瘤定位的联邦学习框架,基于Transformer-图神经网络混合架构,并在CERN的CAFEIN®平台上实现,能够在不共享患者数据的情况下实现多机构协作。实验表明,联邦学习能克服本地训练的局限性,性能媲美集中式训练,且可解释性分析显示模型关注临床相关的MRI模态(T2和FLAIR)。
Details
Motivation: 由于隐私法规限制,医疗数据常孤立于不同机构,难以构建大规模、多样化的脑肿瘤深度学习模型。因此,需要一种既能保护隐私又能实现多机构协同训练的方法。 Method: 采用基于Transformer-图神经网络(GNN)的混合架构,扩展自先前的解码器自由超体素GNN,并部署于CERN的联邦学习平台CAFEIN®。通过联邦学习聚合多中心模型更新,避免原始数据共享。利用Transformer注意力机制进行可解释性分析,识别关键MRI模态。 Result: 在BraTS数据集上实验显示:孤立训练因数据有限而早停,无法充分训练;联邦学习则持续提升性能,最终达到与集中式训练相当的水平。注意力分析经统计验证(配对t检验+Bonferroni校正)表明深层网络显著关注T2和FLAIR模态(p<0.001,Cohen's d=1.50)。 Conclusion: 联邦学习能有效整合多机构数据知识,显著提升复杂高维任务(如脑肿瘤定位)的模型性能,同时保障数据隐私。结合可解释性分析,该方法符合临床实践,为医疗AI的协作研究提供了可行路径。 Abstract: Deep learning models for brain tumor analysis require large and diverse datasets that are often siloed across healthcare institutions due to privacy regulations. We present a federated learning framework for brain tumor localization that enables multi-institutional collaboration without sharing sensitive patient data. Our method extends a hybrid Transformer-Graph Neural Network architecture derived from prior decoder-free supervoxel GNNs and is deployed within CAFEIN\textsuperscript{\textregistered}, CERN's federated learning platform designed for healthcare environments. We provide an explainability analysis through Transformer attention mechanisms that reveals which MRI modalities drive the model predictions. Experiments on the BraTS dataset demonstrate a key finding: while isolated training on individual client data triggers early stopping well before reaching full training capacity, federated learning enables continued model improvement by leveraging distributed data, ultimately matching centralized performance. This result provides strong justification for federated learning when dealing with complex tasks and high-dimensional input data, as aggregating knowledge from multiple institutions significantly benefits the learning process. Our explainability analysis, validated through rigorous statistical testing on the full test set (paired t-tests with Bonferroni correction), reveals that deeper network layers significantly increase attention to T2 and FLAIR modalities ($p<0.001$, Cohen's $d$=1.50), aligning with clinical practice.[107] Deep Leakage with Generative Flow Matching Denoiser
Isaac Baglin,Xiatian Zhu,Simon Hadfield
Main category: cs.CV
TL;DR: 提出了一种基于生成流匹配(Flow Matching)先验的新型深度泄漏攻击方法,能够在无需知晓私有数据的情况下,从联邦学习的模型更新中更稳定、高保真地重建客户端私有数据。
Details
Motivation: 现有的深度泄漏攻击在联邦学习中存在不稳定性、重建质量低或对实际场景适应性差的问题,需要一种更强大且鲁棒的攻击方法来揭示当前防御机制的潜在漏洞。 Method: 将生成式流匹配(Flow Matching)作为先验引入到重建过程中,利用流匹配基础模型引导优化过程趋向真实图像分布,从而提升重建图像的质量和真实性。 Result: 在多个数据集和目标模型上实验表明,该方法在像素级、感知质量和特征相似性等指标上均优于现有最先进攻击方法,并在不同训练轮次、较大批量大小及噪声注入、梯度裁剪、稀疏化等常见防御下仍保持有效。 Conclusion: 强大的生成式先验显著增强了深度泄漏攻击的能力,表明现有联邦学习防御机制可能不足以应对此类先进攻击,需设计专门抵御生成式先验攻击的新防御策略。 Abstract: Federated Learning (FL) has emerged as a powerful paradigm for decentralized model training, yet it remains vulnerable to deep leakage (DL) attacks that reconstruct private client data from shared model updates. While prior DL methods have demonstrated varying levels of success, they often suffer from instability, limited fidelity, or poor robustness under realistic FL settings. We introduce a new DL attack that integrates a generative Flow Matching (FM) prior into the reconstruction process. By guiding optimization toward the distribution of realistic images (represented by a flow matching foundation model), our method enhances reconstruction fidelity without requiring knowledge of the private data. Extensive experiments on multiple datasets and target models demonstrate that our approach consistently outperforms state-of-the-art attacks across pixel-level, perceptual, and feature-based similarity metrics. Crucially, the method remains effective across different training epochs, larger client batch sizes, and under common defenses such as noise injection, clipping, and sparsification. Our findings call for the development of new defense strategies that explicitly account for adversaries equipped with powerful generative priors.[108] Differential Privacy Image Generation with Reconstruction Loss and Noise Injection Using an Error Feedback SGD
Qiwei Ma,Jun Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于差分隐私的合成数据生成新框架,通过引入误差反馈随机梯度下降(EFSGD)、重建损失和噪声注入机制,在相同隐私预算下生成更高质量、更高可用性的图像,在MNIST、Fashion-MNIST和CelebA上达到SOTA性能。
Details
Motivation: 传统数据脱敏方法难以兼顾隐私保护与数据效用;现有合成数据方法在隐私与效用间反复权衡,缺乏高效协同优化机制。 Method: 提出基于差分隐私的生成框架,核心包括:误差反馈随机梯度下降(EFSGD)优化算法、重建损失函数设计、以及训练过程中的噪声注入机制。 Result: 在MNIST、Fashion-MNIST和CelebA三个基准上,图像生成质量与可用性均优于现有方法,几乎在所有指标上达到SOTA;框架对灰度图和RGB图像均具良好泛化性。 Conclusion: 所提框架有效缓解了隐私-效用权衡难题,在保障严格差分隐私前提下显著提升合成数据质量与实用性,为隐私保护机器学习提供了新思路。 Abstract: Traditional data masking techniques such as anonymization cannot achieve the expected privacy protection while ensuring data utility for privacy-preserving machine learning. Synthetic data plays an increasingly important role as it generates a large number of training samples and prevents information leakage in real data. The existing methods suffer from the repeating trade-off processes between privacy and utility. We propose a novel framework for differential privacy generation, which employs an Error Feedback Stochastic Gradient Descent(EFSGD) method and introduces a reconstruction loss and noise injection mechanism into the training process. We generate images with higher quality and usability under the same privacy budget as the related work. Extensive experiments demonstrate the effectiveness and generalization of our proposed framework for both grayscale and RGB images. We achieve state-of-the-art results over almost all metrics on three benchmarks: MNIST, Fashion-MNIST, and CelebA.[109] Enhancing Few-Shot Out-of-Distribution Detection via the Refinement of Foreground and Background
Tianyu Li,Songyue Cai,Zongqian Wu,Ping Hu,Xiaofeng Zhu
Main category: cs.CV
TL;DR: 本文提出了一种新的即插即用框架FoBoR,通过自适应背景抑制和易混淆前景校正,改进基于CLIP的前景-背景分解方法,以提升少样本OOD检测性能。
Details
Motivation: 现有基于CLIP的前景-背景分解方法在背景区域采用统一抑制策略,忽视不同图像块贡献差异;在前景区域未考虑局部块与其他类别的外观或语义相似性,易误导训练。 Method: 提出包含三部分的插件式框架:(1) 前景-背景分解模块;(2) 自适应背景抑制模块(基于块分类熵加权);(3) 易混淆前景校正模块(识别并修正易混淆前景块)。 Result: 大量实验表明,该框架显著提升了现有FG-BG分解方法在少样本OOD检测任务上的性能。 Conclusion: FoBoR是一种通用、即插即用的改进方案,有效缓解了背景抑制粗粒度和前景混淆问题,增强了CLIP-based OOD检测鲁棒性。 Abstract: CLIP-based foreground-background (FG-BG) decomposition methods have demonstrated remarkable effectiveness in improving few-shot out-of-distribution (OOD) detection performance. However, existing approaches still suffer from several limitations. For background regions obtained from decomposition, existing methods adopt a uniform suppression strategy for all patches, overlooking the varying contributions of different patches to the prediction. For foreground regions, existing methods fail to adequately consider that some local patches may exhibit appearance or semantic similarity to other classes, which may mislead the training process. To address these issues, we propose a new plug-and-play framework. This framework consists of three core components: (1) a Foreground-Background Decomposition module, which follows previous FG-BG methods to separate an image into foreground and background regions; (2) an Adaptive Background Suppression module, which adaptively weights patch classification entropy; and (3) a Confusable Foreground Rectification module, which identifies and rectifies confusable foreground patches. Extensive experimental results demonstrate that the proposed plug-and-play framework significantly improves the performance of existing FG-BG decomposition methods. Code is available at: https://github.com/lounwb/FoBoR.[110] The Pictorial Cortex: Zero-Shot Cross-Subject fMRI-to-Image Reconstruction via Compositional Latent Modeling
Jingyang Huo,Yikai Wang,Yanwei Fu,Jianfeng Feng
Main category: cs.CV
TL;DR: 本文提出PictorialCortex模型,解决零样本跨被试fMRI到图像重建问题,通过构建统一皮层表面数据集UniCortex-fMRI和引入可分解-组合的潜变量建模,实现无需被试特异性训练的视觉经验解码。
Details
Motivation: fMRI信号在不同被试和试验间存在显著变异性,导致fMRI-to-image重建非单射;现有方法多依赖被试特异性训练,难以泛化至新被试。 Method: 构建标准化的跨数据集皮层表面fMRI数据集UniCortex-fMRI;提出PictorialCortex模型,采用基于主体/数据集/试验因素的组合式潜变量建模,在通用皮层潜空间中通过因子分解-组合模块与配对一致性正则化实现建模;推理时聚合多个已见被试的代理潜变量,引导扩散模型生成未见被试的图像。 Result: 在零样本跨被试fMRI-to-image重建任务上显著优于基线方法,验证了组合潜变量建模与多数据集联合训练的有效性。 Conclusion: 组合式潜变量建模与统一皮层数据表示可有效缓解fMRI个体差异问题,为真正实用化的脑解码提供新范式。 Abstract: Decoding visual experiences from human brain activity remains a central challenge at the intersection of neuroscience, neuroimaging, and artificial intelligence. A critical obstacle is the inherent variability of cortical responses: neural activity elicited by the same visual stimulus differs across individuals and trials due to anatomical, functional, cognitive, and experimental factors, making fMRI-to-image reconstruction non-injective. In this paper, we tackle a challenging yet practically meaningful problem: zero-shot cross-subject fMRI-to-image reconstruction, where the visual experience of a previously unseen individual must be reconstructed without subject-specific training. To enable principled evaluation, we present a unified cortical-surface dataset -- UniCortex-fMRI, assembled from multiple visual-stimulus fMRI datasets to provide broad coverage of subjects and stimuli. Our UniCortex-fMRI is particularly processed by standardized data formats to make it possible to explore this possibility in the zero-shot scenario of cross-subject fMRI-to-image reconstruction. To tackle the modeling challenge, we propose PictorialCortex, which models fMRI activity using a compositional latent formulation that structures stimulus-driven representations under subject-, dataset-, and trial-related variability. PictorialCortex operates in a universal cortical latent space and implements this formulation through a latent factorization-composition module, reinforced by paired factorization and re-factorizing consistency regularization. During inference, surrogate latents synthesized under multiple seen-subject conditions are aggregated to guide diffusion-based image synthesis for unseen subjects. Extensive experiments show that PictorialCortex improves zero-shot cross-subject visual reconstruction, highlighting the benefits of compositional latent modeling and multi-dataset training.[111] Three-dimensional visualization of X-ray micro-CT with large-scale datasets: Efficiency and accuracy for real-time interaction
Yipeng Yin,Rao Yao,Qingying Li,Dazhong Wang,Hong Zhou,Zhijun Fang,Jianing Chen,Longjie Qian,Mingyue Wu
Main category: cs.CV
TL;DR: 本文综述了Micro-CT在工业超精密检测中实现高精度与高效率兼顾的3D缺陷表征的最新进展,重点分析了从解析方法到深度学习的CT重建算法演进、体积渲染算法优化及高级光照模型,并展望了面向数字孪生与结构健康监测实时在线检测的发展方向。
Details
Motivation: Micro-CT技术在工业超精密检测中产生海量数据,亟需解决3D缺陷表征中精度与效率的权衡问题。 Method: 综述并对比分析CT重建(从解析法到深度学习)和体积渲染(含加速、数据约简与高级光照模型)等兼顾准确性和效率的方法。 Result: 系统梳理了Micro-CT 3D可视化技术的发展脉络与关键技术路径,为快速选取高效精确方法及推动数字孪生驱动的实时缺陷监测提供理论参考。 Conclusion: 未来研究应聚焦于融合虚拟-物理交互的实时在线监测方法,并拓展数字孪生模型在结构健康监测中的应用。 Abstract: As Micro-CT technology continues to refine its characterization of material microstructures, industrial CT ultra-precision inspection is generating increasingly large datasets, necessitating solutions to the trade-off between accuracy and efficiency in the 3D characterization of defects during ultra-precise detection. This article provides a unique perspective on recent advances in accurate and efficient 3D visualization using Micro-CT, tracing its evolution from medical imaging to industrial non-destructive testing (NDT). Among the numerous CT reconstruction and volume rendering methods, this article selectively reviews and analyzes approaches that balance accuracy and efficiency, offering a comprehensive analysis to help researchers quickly grasp highly efficient and accurate 3D reconstruction methods for microscopic features. By comparing the principles of computed tomography with advancements in microstructural technology, this article examines the evolution of CT reconstruction algorithms from analytical methods to deep learning techniques, as well as improvements in volume rendering algorithms, acceleration, and data reduction. Additionally, it explores advanced lighting models for high-accuracy, photorealistic, and efficient volume rendering. Furthermore, this article envisions potential directions in CT reconstruction and volume rendering. It aims to guide future research in quickly selecting efficient and precise methods and developing new ideas and approaches for real-time online monitoring of internal material defects through virtual-physical interaction, for applying digital twin model to structural health monitoring (SHM).[112] Pb4U-GNet: Resolution-Adaptive Garment Simulation via Propagation-before-Update Graph Network
Aoran Liu,Kun Hu,Clinton Ansun Mo,Qiuxia Wu,Wenxiong Kang,Zhiyong Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Pb4U-GNet的分辨率自适应图网络框架,用于解决神经网络在服装模拟中跨分辨率泛化能力差的问题。
Details
Motivation: 传统基于物理的服装模拟方法计算成本高,而现有的图神经网络方法在面对不同网格分辨率时泛化性能差,难以应用于实际场景。 Method: 提出了Propagation-before-Update Graph Network(Pb4U-GNet),通过解耦消息传播与特征更新,引入动态传播深度控制和几何感知的更新缩放机制,以适应不同分辨率的网格结构。 Result: 实验表明,即使仅在低分辨率网格上训练,Pb4U-GNet也能在多种不同分辨率的网格上表现出优异的泛化能力,显著优于现有方法。 Conclusion: Pb4U-GNet有效解决了神经服装模拟中的跨分辨率泛化问题,为高效、灵活的 garment simulation 提供了新的解决方案。 Abstract: Garment simulation is fundamental to various applications in computer vision and graphics, from virtual try-on to digital human modelling. However, conventional physics-based methods remain computationally expensive, hindering their application in time-sensitive scenarios. While graph neural networks (GNNs) offer promising acceleration, existing approaches exhibit poor cross-resolution generalisation, demonstrating significant performance degradation on higher-resolution meshes beyond the training distribution. This stems from two key factors: (1) existing GNNs employ fixed message-passing depth that fails to adapt information aggregation to mesh density variation, and (2) vertex-wise displacement magnitudes are inherently resolution-dependent in garment simulation. To address these issues, we introduce Propagation-before-Update Graph Network (Pb4U-GNet), a resolution-adaptive framework that decouples message propagation from feature updates. Pb4U-GNet incorporates two key mechanisms: (1) dynamic propagation depth control, adjusting message-passing iterations based on mesh resolution, and (2) geometry-aware update scaling, which scales predictions according to local mesh characteristics. Extensive experiments show that even trained solely on low-resolution meshes, Pb4U-GNet exhibits strong generalisability across diverse mesh resolutions, addressing a fundamental challenge in neural garment simulation.[113] Training-Free and Interpretable Hateful Video Detection via Multi-stage Adversarial Reasoning
Shuonan Yang,Yuchen Zhang,Zeyu Fu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的多阶段对抗推理框架MARS,用于可解释且可靠地检测仇恨视频内容。
Details
Motivation: 现有基于训练的仇恨视频检测方法受限于数据稀缺和缺乏可解释性;而直接使用大视觉语言模型提示又难以保证检测可靠性。 Method: MARS采用三阶段无训练推理:首先进行中立的视频内容客观描述,再并行开展支持仇恨判断的证据推理与反驳非仇恨观点的反证据推理,最后融合二者得出可解释结论。 Result: 在两个真实数据集上,MARS相较其他无训练方法最高提升10%,并在一个数据集上超越当前最优有训练方法;同时生成人类可理解的判定依据。 Conclusion: MARS为仇恨视频检测提供了一种高可靠性、强可解释性、无需训练的新范式,有助于提升内容审核透明度与合规监管能力。 Abstract: Hateful videos pose serious risks by amplifying discrimination, inciting violence, and undermining online safety. Existing training-based hateful video detection methods are constrained by limited training data and lack of interpretability, while directly prompting large vision-language models often struggle to deliver reliable hate detection. To address these challenges, this paper introduces MARS, a training-free Multi-stage Adversarial ReaSoning framework that enables reliable and interpretable hateful content detection. MARS begins with the objective description of video content, establishing a neutral foundation for subsequent analysis. Building on this, it develops evidence-based reasoning that supports potential hateful interpretations, while in parallel incorporating counter-evidence reasoning to capture plausible non-hateful perspectives. Finally, these perspectives are synthesized into a conclusive and explainable decision. Extensive evaluation on two real-world datasets shows that MARS achieves up to 10% improvement under certain backbones and settings compared to other training-free approaches and outperforms state-of-the-art training-based methods on one dataset. In addition, MARS produces human-understandable justifications, thereby supporting compliance oversight and enhancing the transparency of content moderation workflows. The code is available at https://github.com/Multimodal-Intelligence-Lab-MIL/MARS.[114] BREPS: Bounding-Box Robustness Evaluation of Promptable Segmentation
Andrey Moskalenko,Danil Kuznetsov,Irina Dudko,Anastasiia Iasakova,Nikita Boldyrev,Denis Shepelev,Andrei Spiridonov,Andrey Kuznetsov,Vlad Shakhuro
Main category: cs.CV
TL;DR: 本文研究了基于自然变化的边界框提示对可提示分割模型(如SAM)的鲁棒性,提出了BREPS方法以在满足自然性约束的同时生成对抗性边界框,并通过用户研究和跨10个数据集的基准测试揭示了现有模型对提示噪声的敏感性。
Details
Motivation: 现有的可提示分割模型在合成提示下评估其性能,缺乏对真实世界中用户输入变异性的鲁棒性分析,因此需要探究模型在自然边界框提示变化下的表现。 Method: 首先进行受控用户研究收集数千个真实的边界框标注;然后将鲁棒性评估重构为在边界框提示空间上的白盒优化问题,提出BREPS方法生成最小化或最大化分割误差的对抗性边界框;最后在10个涵盖日常场景到医学影像的数据集上对最先进模型进行基准测试。 Result: 分析显示不同用户在同一模型和实例上的分割质量存在显著差异,表明SAM类模型对自然提示噪声高度敏感;BREPS能够有效生成符合自然性约束的对抗性边界框,用于评估模型鲁棒性;基准测试揭示了现有模型在面对自然提示变异时的性能波动。 Conclusion: 当前的可提示分割模型对自然边界框提示的变化较为敏感,现有训练与评估协议需改进以提升实际应用中的鲁棒性,BREPS为评估和增强此类模型的鲁棒性提供了新工具。 Abstract: Promptable segmentation models such as SAM have established a powerful paradigm, enabling strong generalization to unseen objects and domains with minimal user input, including points, bounding boxes, and text prompts. Among these, bounding boxes stand out as particularly effective, often outperforming points while significantly reducing annotation costs. However, current training and evaluation protocols typically rely on synthetic prompts generated through simple heuristics, offering limited insight into real-world robustness. In this paper, we investigate the robustness of promptable segmentation models to natural variations in bounding box prompts. First, we conduct a controlled user study and collect thousands of real bounding box annotations. Our analysis reveals substantial variability in segmentation quality across users for the same model and instance, indicating that SAM-like models are highly sensitive to natural prompt noise. Then, since exhaustive testing of all possible user inputs is computationally prohibitive, we reformulate robustness evaluation as a white-box optimization problem over the bounding box prompt space. We introduce BREPS, a method for generating adversarial bounding boxes that minimize or maximize segmentation error while adhering to naturalness constraints. Finally, we benchmark state-of-the-art models across 10 datasets, spanning everyday scenes to medical imaging. Code - https://github.com/emb-ai/BREPS.[115] Graph Recognition via Subgraph Prediction
André Eberhard,Gerhard Neumann,Pascal Friederich
Main category: cs.CV
TL;DR: 本文提出了一种名为GraSP的新方法,通过子图预测实现图像中的图结构识别,具有广泛的适用性和跨任务的可迁移性。
Details
Motivation: 视觉关系识别(建模为从图像中提取图)仍具挑战性,主要因为缺乏统一的处理框架,现有方法通常针对特定问题设计,难以在不同任务间直接迁移。 Method: 提出GraSP(Graph Recognition via Subgraph Prediction)方法,将图识别任务分解为子图预测,从而实现对不同类型图及其绘制方式的通用识别,无需针对具体任务进行修改。 Result: 在多个合成基准和一个真实应用场景上验证了GraSP的有效性,表明其能处理多种图类型,并在不同任务间无缝迁移。 Conclusion: GraSP为视觉图识别提供了一个更统一、简洁且可推广的框架,推动了该领域的标准化发展。 Abstract: Despite tremendous improvements in tasks such as image classification, object detection, and segmentation, the recognition of visual relationships, commonly modeled as the extraction of a graph from an image, remains a challenging task. We believe that this mainly stems from the fact that there is no canonical way to approach the visual graph recognition task. Most existing solutions are specific to a problem and cannot be transferred between different contexts out-of-the box, even though the conceptual problem remains the same. With broad applicability and simplicity in mind, in this paper we develop a method, \textbf{Gra}ph Recognition via \textbf{S}ubgraph \textbf{P}rediction (\textbf{GraSP}), for recognizing graphs in images. We show across several synthetic benchmarks and one real-world application that our method works with a set of diverse types of graphs and their drawings, and can be transferred between tasks without task-specific modifications, paving the way to a more unified framework for visual graph recognition.[116] Large-Scale Multidimensional Knowledge Profiling of Scientific Literature
Zhucun Xue,Jiangning Zhang,Juntao Jiang,Jinzhuo Liu,Haoyang He,Teng Hu,Xiaobin Hu,Guangming Yao,Yi Yuan,Yong Liu
Main category: cs.CV
TL;DR: 本文构建了一个包含2020–2025年22个顶会超10万篇论文的统一语料库,结合主题聚类、大语言模型辅助解析与结构化检索,建立多维科研画像分析框架,揭示AI领域研究主题演化(如安全、多模态推理、智能体兴起;NMT、图方法趋于稳定)等趋势。
Details
Motivation: 传统文献计量工具仅依赖元数据,难以捕捉论文语义内容,无法有效追踪研究主题演化与跨领域影响,亟需基于文本内容的动态、多维分析方法。 Method: 构建2020–2025年22个主流会议共10万+论文的统一语料库;设计融合主题聚类、LLM辅助解析(如方法/数据集/模型识别)和结构化检索的多维画像分析流水线。 Result: 识别出AI研究若干关键趋势:安全、多模态推理、智能体研究快速增长;神经机器翻译与图方法等方向趋于稳定;同时刻画了主题生命周期、方法迁移、数据集/模型使用变迁及机构研究偏好。 Conclusion: 该多维科研画像框架为AI领域提供了可解释、可追溯、证据驱动的趋势分析能力,不仅揭示当前研究格局,也为政策制定、资源分配与新兴方向识别提供实证支持。 Abstract: The rapid expansion of research across machine learning, vision, and language has produced a volume of publications that is increasingly difficult to synthesize. Traditional bibliometric tools rely mainly on metadata and offer limited visibility into the semantic content of papers, making it hard to track how research themes evolve over time or how different areas influence one another. To obtain a clearer picture of recent developments, we compile a unified corpus of more than 100,000 papers from 22 major conferences between 2020 and 2025 and construct a multidimensional profiling pipeline to organize and analyze their textual content. By combining topic clustering, LLM-assisted parsing, and structured retrieval, we derive a comprehensive representation of research activity that supports the study of topic lifecycles, methodological transitions, dataset and model usage patterns, and institutional research directions. Our analysis highlights several notable shifts, including the growth of safety, multimodal reasoning, and agent-oriented studies, as well as the gradual stabilization of areas such as neural machine translation and graph-based methods. These findings provide an evidence-based view of how AI research is evolving and offer a resource for understanding broader trends and identifying emerging directions. Code and dataset: https://github.com/xzc-zju/Profiling_Scientific_Literature[117] BBoxMaskPose v2: Expanding Mutual Conditioning to 3D
Miroslav Purkrabek,Constantin Kolomiiets,Jiri Matas
Main category: cs.CV
TL;DR: 本文提出了PMPose和BBoxMaskPose v2(BMPv2),通过概率建模和掩码条件机制提升了拥挤场景下的2D姿态估计性能,并在COCO和OCHuman数据集上显著超越现有方法,首次在OCHuman上达到超过50 AP。同时验证了2D姿态质量对3D姿态估计的积极影响。
Details
Motivation: 大多数2D人体姿态估计基准已接近饱和,但在拥挤场景下仍存在挑战。因此,需要一种能在密集人群中保持高精度的方法。 Method: 提出PMPose,采用概率建模和掩码条件机制;在此基础上构建BBoxMaskPose v2(BMPv2),结合改进的基于SAM的掩码优化模块。利用2D提示改善3D姿态估计,并在新提出的OCHuman-Pose数据集上进行评估。 Result: BMPv2在COCO上提升1.5 AP,在OCHuman上提升6 AP,成为首个在OCHuman上突破50 AP的方法;实验表明2D姿态质量直接影响3D姿态估计效果,且多目标姿态性能更受姿态预测精度影响而非检测结果。 Conclusion: 通过引入概率建模与掩码条件机制,PMPose和BMPv2显著提升了拥挤场景下的2D姿态估计性能,同时推动了3D姿态估计的发展,证明高质量2D姿态输出对下游任务至关重要。 Abstract: Most 2D human pose estimation benchmarks are nearly saturated, with the exception of crowded scenes. We introduce PMPose, a top-down 2D pose estimator that incorporates the probabilistic formulation and the mask-conditioning. PMPose improves crowded pose estimation without sacrificing performance on standard scenes. Building on this, we present BBoxMaskPose v2 (BMPv2) integrating PMPose and an enhanced SAM-based mask refinement module. BMPv2 surpasses state-of-the-art by 1.5 average precision (AP) points on COCO and 6 AP points on OCHuman, becoming the first method to exceed 50 AP on OCHuman. We demonstrate that BMP's 2D prompting of 3D model improves 3D pose estimation in crowded scenes and that advances in 2D pose quality directly benefit 3D estimation. Results on the new OCHuman-Pose dataset show that multi-person performance is more affected by pose prediction accuracy than by detection. The code, models, and data are available on https://MiraPurkrabek.github.io/BBox-Mask-Pose/.[118] A Computer Vision Hybrid Approach: CNN and Transformer Models for Accurate Alzheimer's Detection from Brain MRI Scans
Md Mahmudul Hoque,Shuvo Karmaker,Md. Hadi Al-Amin,Md Modabberul Islam,Jisun Junayed,Farha Ulfat Mahi
Main category: cs.CV
TL;DR: 本文比较了五种CNN、五种Transformer模型及提出的混合模型Evan_V2在阿尔茨海默病四分类任务中的性能,结果表明Evan_V2通过特征级融合达到99.99%准确率,显著优于各单模型。
Details
Motivation: 早期准确诊断阿尔茨海默病对及时临床干预和改善患者预后至关重要,而现有单一深度学习模型在多阶段痴呆分类中存在稳定性或泛化性不足问题。 Method: 对比评估五种CNN(如ResNet50)、五种Transformer(如ViT)及新提出的混合模型Evan_V2;Evan_V2采用十种CNN与Transformer架构的特征级融合策略。 Result: ResNet50(CNN)达98.83%准确率;ViT(Transformer)最高为95.38%;Evan_V2达99.99%准确率、0.9989 F1-score和0.9968 ROC AUC,并显著降低各阶段误分类。 Conclusion: 混合集成策略能显著提升AD多类别分类的可靠性与临床适用性,为智能辅助诊断提供新思路。 Abstract: Early and accurate classification of Alzheimers disease (AD) from brain MRI scans is essential for timely clinical intervention and improved patient outcomes. This study presents a comprehensive comparative analysis of five CNN architectures (EfficientNetB0, ResNet50, DenseNet201, MobileNetV3, VGG16), five Transformer-based models (ViT, ConvTransformer, PatchTransformer, MLP-Mixer, SimpleTransformer), and a proposed hybrid model named Evan_V2. All models were evaluated on a four-class AD classification task comprising Mild Dementia, Moderate Dementia, Non-Demented, and Very Mild Dementia categories. Experimental findings show that CNN architectures consistently achieved strong performance, with ResNet50 attaining 98.83% accuracy. Transformer models demonstrated competitive generalization capabilities, with ViT achieving the highest accuracy among them at 95.38%. However, individual Transformer variants exhibited greater class-specific instability. The proposed Evan_V2 hybrid model, which integrates outputs from ten CNN and Transformer architectures through feature-level fusion, achieved the best overall performance with 99.99% accuracy, 0.9989 F1-score, and 0.9968 ROC AUC. Confusion matrix analysis further confirmed that Evan_V2 substantially reduced misclassification across all dementia stages, outperforming every standalone model. These findings highlight the potential of hybrid ensemble strategies in producing highly reliable and clinically meaningful diagnostic tools for Alzheimers disease classification.[119] ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
Hanlei Guo,Jiahao Shao,Xinya Chen,Xiyang Tan,Sheng Miao,Yujun Shen,Yiyi Liao
Main category: cs.CV
TL;DR: ScenDi是一种结合3D和2D扩散模型的城市场景生成方法:先用3D潜在扩散模型生成3D高斯(3DGS)以支持相机可控渲染,再用2D视频扩散模型基于渲染图像增强外观细节。
Details
Motivation: 现有方法在生成真实感3D城市场景时存在两难:纯3D扩散模型细节差,纯2D扩散模型缺乏相机可控性。 Method: 提出ScenDi框架:1)训练3D潜在扩散模型生成可条件控制(如3D框、道路图、文本)的3D高斯;2)训练2D视频扩散模型,以3DGS渲染图像为条件增强外观细节并保持相机轨迹准确。 Result: 在Waymo和KITTI-360两个真实世界数据集上验证了ScenDi的有效性,能生成细节丰富且相机可控的城市场景。 Conclusion: 融合3D结构先验与2D表观建模的协同扩散范式,可有效提升复杂城市场景生成的质量与可控性。 Abstract: Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging real-world datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.[120] Tracing 3D Anatomy in 2D Strokes: A Multi-Stage Projection Driven Approach to Cervical Spine Fracture Identification
Fabi Nahian Madhurja,Rusab Sarmun,Muhammad E. H. Chowdhury,Adam Mushtak,Israa Al-Hashimi,Sohaib Bassam Zoghoul
Main category: cs.CV
TL;DR: 本研究提出了一种基于2D投影的端到端自动化颈椎椎体(C1-C7)分析流程,用于在3D CT中进行椎体级骨折检测。通过优化的2D多平面投影结合YOLOv8定位感兴趣区域,并采用DenseNet121-Unet进行多标签分割,再融合生成近似3D椎体掩膜以提取单个椎体体积,最后使用2.5D时空模型集成进行骨折分类,在椎体级和患者级均取得良好性能。
Details
Motivation: 传统3D分割方法计算复杂度高,限制了其在临床中的实时应用。本文旨在探索一种高效且精确的替代方案——利用2D投影实现对3D CT中颈椎骨折的自动化检测与分析,以降低计算成本并保持高性能。 Method: 提出一种端到端 pipeline:首先通过优化的2D轴状、矢状和冠状投影近似3D体积,使用YOLOv8从三视图检测并融合ROI以获得近似3D颈椎区域;接着采用基于方差和能量投影的DenseNet121-Unet进行多标签2D分割;然后将2D结果融合为近似3D椎体掩膜以提取各椎体体积;最后利用融合原始切片与投影的2.5D Spatio-Sequential模型集成进行椎体级骨折分类,并通过显著性图和阅片者变异性分析验证模型可解释性与可靠性。 Result: 该方法在3D mIoU上达到94.45%,2D多标签分割Dice得分为87.86%;骨折检测方面,椎体级F1为68.15,患者级F1为82.26;ROC-AUC分别为91.62(椎体级)和83.04(患者级)。显著性图显示模型关注解剖相关区域,与专家判读具有一致性。 Conclusion: 基于2D投影的方法能有效近似3D信息,在显著降低计算复杂度的同时实现高性能的椎体分割与骨折检测,具备良好的临床应用潜力和可解释性,可作为传统3D方法的有效替代方案。 Abstract: Cervical spine fractures are critical medical conditions requiring precise and efficient detection for effective clinical management. This study explores the viability of 2D projection-based vertebra segmentation for vertebra-level fracture detection in 3D CT volumes, presenting an end-to-end pipeline for automated analysis of cervical vertebrae (C1-C7). By approximating a 3D volume through optimized 2D axial, sagittal, and coronal projections, regions of interest are identified using the YOLOv8 model from all views and combined to approximate the 3D cervical spine area, achieving a 3D mIoU of 94.45 percent. This projection-based localization strategy reduces computational complexity compared to traditional 3D segmentation methods while maintaining high performance. It is followed by a DenseNet121-Unet-based multi-label segmentation leveraging variance- and energy-based projections, achieving a Dice score of 87.86 percent. Strategic approximation of 3D vertebral masks from these 2D segmentation masks enables the extraction of individual vertebra volumes. The volumes are analyzed for fractures using an ensemble of 2.5D Spatio-Sequential models incorporating both raw slices and projections per vertebra for complementary evaluation. This ensemble achieves vertebra-level and patient-level F1 scores of 68.15 and 82.26, and ROC-AUC scores of 91.62 and 83.04, respectively. We further validate our approach through an explainability study that provides saliency map visualizations highlighting anatomical regions relevant for diagnosis, and an interobserver variability analysis comparing our model's performance with expert radiologists, demonstrating competitive results.[121] FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion
Zichen Xi,Hao-Xiang Chen,Nan Xue,Hongyu Yan,Qi-Yuan Feng,Levent Burak Kara,Joaquim Jorge,Qun-Ce Xu
Main category: cs.CV
TL;DR: 本文提出了FlowSSC,首个直接应用于单目语义场景补全(SSC)的生成式框架,通过引入Shortcut Flow-matching实现在紧凑三平面潜在空间中的单步高保真生成,显著提升现有前馈方法性能,并在SemanticKITTI上达到SOTA。
Details
Motivation: 单目RGB图像的语义场景补全因单视角导致的遮挡几何推理模糊性而极具挑战,现有前馈方法难以在遮挡区域生成合理细节并保持物体空间关系,亟需具备全3D空间准确生成推理能力的方法。 Method: 将SSC建模为条件生成问题,提出FlowSSC框架;引入Shortcut Flow-matching机制,在紧凑三平面潜在空间中实现单步高保真生成,避免传统扩散模型百步迭代;可即插即用地增强现有前馈SSC方法。 Result: 在SemanticKITTI数据集上取得SOTA性能,显著超越现有基线方法;实现高质量与实时推理的兼顾,适用于自动驾驶等实际系统。 Conclusion: FlowSSC首次将生成式建模有效引入单目SSC任务,通过创新的单步流匹配机制解决了精度与效率的矛盾,为三维场景理解提供了新范式。 Abstract: Semantic Scene Completion (SSC) from monocular RGB images is a fundamental yet challenging task due to the inherent ambiguity of inferring occluded 3D geometry from a single view. While feed-forward methods have made progress, they often struggle to generate plausible details in occluded regions and preserve the fundamental spatial relationships of objects. Such accurate generative reasoning capability for the entire 3D space is critical in real-world applications. In this paper, we present FlowSSC, the first generative framework applied directly to monocular semantic scene completion. FlowSSC treats the SSC task as a conditional generation problem and can seamlessly integrate with existing feed-forward SSC methods to significantly boost their performance. To achieve real-time inference without compromising quality, we introduce Shortcut Flow-matching that operates in a compact triplane latent space. Unlike standard diffusion models that require hundreds of steps, our method utilizes a shortcut mechanism to achieve high-fidelity generation in a single step, enabling practical deployment in autonomous systems. Extensive experiments on SemanticKITTI demonstrate that FlowSSC achieves state-of-the-art performance, significantly outperforming existing baselines.[122] DrivIng: A Large-Scale Multimodal Driving Dataset with Full Digital Twin Integration
Dominik Rößle,Xujun Xie,Adithya Mohan,Venkatesh Thirugnana Sambandham,Daniel Cremers,Torsten Schön
Main category: cs.CV
TL;DR: 本文介绍了DrivIng,一个具有完整地理参考数字孪生的大规模多模态自动驾驶感知数据集,覆盖多种驾驶场景,并支持高保真仿真与真实世界评估。
Details
Motivation: 现有自动驾驶感知数据集缺乏高保真数字孪生,限制了系统性测试、边缘案例模拟、传感器修改和仿真到现实的评估。 Method: 构建了覆盖约18公里城市、郊区与高速公路路段的地理参考数字孪生;采集六路RGB相机、一路LiDAR及高精度ADMA定位数据(涵盖昼夜与黄昏);以10Hz频率标注3D边界框与轨迹ID(12类,共约120万实例);支持真实交通1:1导入仿真并保留智能体交互。 Result: 发布了DrivIng数据集、数字孪生模型、高清地图与代码库;在多个SOTA感知模型上完成基准测试,验证其有效性与实用性。 Conclusion: DrivIng填补了高质量数字孪生感知数据集的空白,为鲁棒感知算法研发、可复现研究与灵活场景验证提供了新基础。 Abstract: Perception is a cornerstone of autonomous driving, enabling vehicles to understand their surroundings and make safe, reliable decisions. Developing robust perception algorithms requires large-scale, high-quality datasets that cover diverse driving conditions and support thorough evaluation. Existing datasets often lack a high-fidelity digital twin, limiting systematic testing, edge-case simulation, sensor modification, and sim-to-real evaluations. To address this gap, we present DrivIng, a large-scale multimodal dataset with a complete geo-referenced digital twin of a ~18 km route spanning urban, suburban, and highway segments. Our dataset provides continuous recordings from six RGB cameras, one LiDAR, and high-precision ADMA-based localization, captured across day, dusk, and night. All sequences are annotated at 10 Hz with 3D bounding boxes and track IDs across 12 classes, yielding ~1.2 million annotated instances. Alongside the benefits of a digital twin, DrivIng enables a 1-to-1 transfer of real traffic into simulation, preserving agent interactions while enabling realistic and flexible scenario testing. To support reproducible research and robust validation, we benchmark DrivIng with state-of-the-art perception models and publicly release the dataset, digital twin, HD map, and codebase.[123] RayRoPE: Projective Ray Positional Encoding for Multi-view Attention
Yu Wu,Minsik Jeon,Jen-Hao Rick Chang,Oncel Tuzel,Shubham Tulsiani
Main category: cs.CV
TL;DR: 本文提出RayRoPE,一种面向多视角Transformer的位置编码方法,基于射线及预测的3D点进行几何感知编码,实现SE(3)不变性与多频相似性,并支持不确定性建模;在新视角合成与立体深度估计任务中显著优于现有方法。
Details
Motivation: 现有绝对或相对位置编码方案无法同时满足多视角Transformer对唯一性、SE(3)不变性、多频相似性以及场景几何自适应性的需求。 Method: RayRoPE利用关联射线表示图像块位置,但以预测的3D点(而非射线方向)为基础构建几何感知编码;通过计算查询帧下的射影坐标实现SE(3)不变注意力;并提出在射线方向不确定性下解析计算期望位置编码的机制。 Result: 在CO3D数据集上新视角合成任务中LPIPS指标获得15%相对提升;在立体深度估计任务中也显著优于基线;且能无缝融合RGB-D输入,带来更大性能增益。 Conclusion: RayRoPE是一种满足多视角Transformer关键几何与不变性约束的位置编码新范式,实验证明其有效性与泛化能力。 Abstract: We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows SE(3)-invariant attention with multi-frequency similarity, and can be adaptive to the geometry of the underlying scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet the above desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays but leverages a predicted point along the ray instead of the direction for a geometry-aware encoding. To achieve SE(3) invariance, RayRoPE computes query-frame projective coordinates for computing multi-frequency similarity. Lastly, as the 'predicted' 3D point along a ray may not be precise, RayRoPE presents a mechanism to analytically compute the expected position encoding under uncertainty. We validate RayRoPE on the tasks of novel-view synthesis and stereo depth estimation and show that it consistently improves over alternate position encoding schemes (e.g. 15% relative improvement on LPIPS in CO3D). We also show that RayRoPE can seamlessly incorporate RGB-D input, resulting in even larger gains over alternatives that cannot positionally encode this information.[124] StableWorld: Towards Stable and Consistent Long Interactive Video Generation
Ying Yang,Zhengyao Lv,Tianlin Pan,Haofan Wang,Binxin Yang,Hubery Yin,Chen Li,Ziwei Liu,Chenyang Si
Main category: cs.CV
TL;DR: 本文提出StableWorld方法,通过动态帧剔除机制解决交互式视频生成中的稳定性与时间一致性问题,有效防止误差累积,提升生成质量。
Details
Motivation: 当前交互式视频生成方法在长时间交互中存在严重不稳定性和时间退化问题,导致空间漂移和场景崩溃,亟需解决稳定性挑战。 Method: 提出StableWorld——一种动态帧剔除机制,通过持续过滤退化帧、保留几何一致帧,从源头阻止误差累积。 Result: 在Matrix-Game、Open-Oasis、Hunyuan-GameCraft等多个模型上验证了StableWorld的有效性,显著提升了稳定性、时间一致性与跨场景泛化能力。 Conclusion: StableWorld是一种模型无关、简单高效的方法,可广泛适配各类交互式视频生成框架,为构建稳定可控的视频世界建模提供了新思路。 Abstract: In this paper, we explore the overlooked challenge of stability and temporal consistency in interactive video generation, which synthesizes dynamic and controllable video worlds through interactive behaviors such as camera movements and text prompts. Despite remarkable progress in world modeling, current methods still suffer from severe instability and temporal degradation, often leading to spatial drift and scene collapse during long-horizon interactions. To better understand this issue, we initially investigate the underlying causes of instability and identify that the major source of error accumulation originates from the same scene, where generated frames gradually deviate from the initial clean state and propagate errors to subsequent frames. Building upon this observation, we propose a simple yet effective method, \textbf{StableWorld}, a Dynamic Frame Eviction Mechanism. By continuously filtering out degraded frames while retaining geometrically consistent ones, StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation. Promising results on multiple interactive video models, \eg, Matrix-Game, Open-Oasis, and Hunyuan-GameCraft, demonstrate that StableWorld is model-agnostic and can be applied to different interactive video generation frameworks to substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.[125] Rethinking Video Generation Model for the Embodied World
Yufan Deng,Zilin Pan,Hongyu Zhang,Xiaojie Li,Ruoqing Hu,Yufei Ding,Yiming Zou,Yan Zeng,Daquan Zhou
Main category: cs.CV
TL;DR: 本文提出了一个面向机器人的视频生成基准RBench,用于评估视频生成模型在机器人任务中的表现,并构建了大规模机器人视频数据集RoVid-X,以推动具身智能发展。
Details
Motivation: 现有视频生成模型在生成真实机器人交互视频方面仍存在挑战,且缺乏标准化基准来公平评估和推动进展。 Method: 提出RBench基准,涵盖五个任务领域和四种机器人形态,从任务正确性和视觉保真度两方面进行评估;同时设计四阶段数据流水线,构建大规模机器人视频数据集RoVid-X。 Result: RBench在25个代表性模型上揭示了其在物理真实性方面的显著不足,并与人工评估达到0.96的Spearman相关性;RoVid-X包含400万标注视频片段,是目前最大的开源机器人视频生成数据集。 Conclusion: RBench与RoVid-X共同构成了评估与训练的协同生态,为具身AI的严格评估和可扩展训练奠定了坚实基础,加速向通用智能演进。 Abstract: Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.[126] LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes
Ruofan Liang,Norman Müller,Ethan Weber,Duncan Zauss,Nandita Vijaykumar,Peter Kontschieder,Christian Richardt
Main category: cs.CV
TL;DR: 提出了一种从单次多视角场景捕捉中实现室内场景交互式光照编辑的新方法,通过生成式图像光照分解模型分离并独立控制各个光源的开关、色度和强度,并结合多视角光照协调与3D高斯点阵表示实现实时编辑。
Details
Motivation: 现有方法难以从单次多视角捕获中实现对室内复杂光照的精确分解与交互式编辑,尤其在独立控制多个光源方面存在不足。 Method: 提出一种生成式图像光照分解模型,将复杂光照分解为独立光源成分,并引入多视角光照协调机制,将其集成到可重光照的3D高斯点阵表示中,实现一致且实时的光照编辑。 Result: 在合成与真实数据集上均实现了高质量的光照分解与重光照效果,定量与定性对比优于现有最先进方法。 Conclusion: 该方法能高效、真实地分解并交互式编辑室内场景中的多个光源,推动了基于单次多视角输入的光照编辑技术的发展。 Abstract: We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques. For video results and interactive demos, see https://luxremix.github.io.[127] Walk through Paintings: Egocentric World Models from Internet Priors
Anurag Bagchi,Zhipeng Bao,Homanga Bharadhwaj,Yu-Xiong Wang,Pavel Tokmakov,Martial Hebert
Main category: cs.CV
TL;DR: 本文提出Egocentric World Model (EgoWM),一种轻量、架构无关的方法,将预训练视频扩散模型改造为动作驱动的具身世界模型,显著提升未来预测的物理一致性、泛化性与推理效率。
Details
Motivation: 现有视频生成模型缺乏对动作-状态因果关系的建模能力,难以实现物理正确、可控的具身未来预测。 Method: 在预训练视频扩散模型基础上,引入轻量级动作条件层,复用互联网规模视频先验,实现动作注入与世界动态建模;提出Structural Consistency Score(SCS)评估物理一致性。 Result: EgoWM在SCS上较SOTA提升达80%,推理延迟降低至1/6,且能泛化至未见环境(如画中导航);支持从3-DoF移动机器人到25-DoF人形机器人的多体态任务。 Conclusion: 仅需少量微调即可将通用视频生成模型转化为高保真、低延迟、强泛化的具身世界模型,验证了复用视觉大模型先验构建世界模型的有效范式。 Abstract: What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine-tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.[128] Iterative Refinement Improves Compositional Image Generation
Shantanu Jaiswal,Mihir Prabhudesai,Nikash Bhardwaj,Zheyang Qin,Amir Zadeh,Chuan Li,Katerina Fragkiadaki,Deepak Pathak
Main category: cs.CV
TL;DR: 提出一种基于迭代自修正的文本到图像生成方法,通过视觉语言模型作为批评者在环路中逐步优化生成结果,显著提升复杂提示下的图像生成准确性和忠实性。
Details
Motivation: 现有文本到图像模型在处理包含多个对象、属性和关系的复杂提示时表现不足,缺乏有效机制来满足丰富的组合性需求。 Method: 受大语言模型中思维链推理启发,采用迭代测试时策略,由视觉语言模型作为批评者提供反馈,指导T2I模型逐步 refine 生成图像。该方法无需外部工具或先验知识,可灵活适配多种图像生成器和视觉语言模型。 Result: 在多个基准上实现显著提升:ConceptMix(k=7)全正确率提高16.9%,T2I-CompBench(3D-Spatial类)提高13.8%,Visual Jenga场景分解提高12.5%;人类评估偏好率达58.7% vs 41.3%。 Conclusion: 迭代自修正是一种广泛适用于组合式图像生成的有效原则,能够将复杂提示分解为顺序修正步骤,生成更忠实于提示的图像。 Abstract: Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/[129] Towards Understanding Best Practices for Quantization of Vision-Language Models
Gautom Das,Vincent La,Ethan Lau,Abhinav Shrivastava,Matthew Gwilliam
Main category: cs.CV
TL;DR: 本文研究了多种量化方法(如GPTQ和AWQ)在多模态大语言模型(MLLMs)管道(包括视觉模型、语言模型及其连接器)中的应用效果,分析了不同比特宽度、量化方法及量化位置对图像描述、检索和问答任务性能的影响;结果表明ViT与LLM对性能影响相当,且对LLM进行低比特量化可在显著降低每权重比特数(bpw)的同时保持高精度。
Details
Motivation: 为降低多模态大语言模型(MLLMs)部署所需的内存开销和推理延迟,需探索更高效的参数量化策略,尤其在视觉-语言联合管道中各组件的量化敏感性尚不明确。 Method: 系统评估多种先进量化方法(GPTQ、AWQ等)在多模态管道不同组件(ViT、LLM、连接器)上的表现,考察不同比特宽度(如2-bit至8-bit)对 captioning、retrieval 和 VQA 任务的影响,并对比各模块量化后的性能变化。 Result: ViT 和 LLM 对整体性能贡献相近(尽管参数量差异巨大);对 LLM 进行低比特量化(如3–4 bit)可显著降低 bpw 并维持高准确率;连接器模块相对鲁棒,而量化位置选择对任务性能影响显著。 Conclusion: 量化策略应按模块差异化设计,LLM 是压缩重点;该研究为 MLLMs 高效部署提供了实用指导,并揭示了多模态模型中各组件对量化扰动的敏感性差异。 Abstract: Large language models (LLMs) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners quantize their learned parameters, typically at half precision. A growing body of research focuses on preserving the model performance with more aggressive bit widths, and some work has been done to apply these strategies to other models, like vision transformers. In our study we investigate how a variety of quantization methods, including state-of-the-art GPTQ and AWQ, can be applied effectively to multimodal pipelines comprised of vision models, language models, and their connectors. We address how performance on captioning, retrieval, and question answering can be affected by bit width, quantization method, and which portion of the pipeline the quantization is used for. Results reveal that ViT and LLM exhibit comparable importance in model performance, despite significant differences in parameter size, and that lower-bit quantization of the LLM achieves high accuracy at reduced bits per weight (bpw). These findings provide practical insights for efficient deployment of MLLMs and highlight the value of exploration for understanding component sensitivities in multimodal models. Our code is available at https://github.com/gautomdas/mmq.[130] APPLE: Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping
Jiwon Kang,Yeji Choi,JoungBin Lee,Wooseok Jang,Jinhyeok Choi,Taekeun Kang,Yongjae Park,Myungin Kim,Seungryong Kim
Main category: cs.CV
TL;DR: 提出APPLE框架,一种基于扩散模型的师生架构,通过属性感知伪标签监督提升面部交换中的属性保真度。